After what I think was a first revolution to formally approach the problem of causation, with the advent of logical inference, it came a second revolution (so to speak) in the area of causality. In the attempt to try to quantify the idea of chance, or something happening for no reason as opposed to be produced or having a cause. There are all sort of interesting epistemological challenges about such a question. For example, science, and particularly classical mechanics, establishes that everything has a cause and there is no chance. This may suggest that something appearing to happen by chance is only so in appearance because its random nature only happens to be strange to the observation of interest for which one can identify a cause. Italian and French mathematicians such as Cardno, Pascal, Ferma,t and Borel were the first to make the attempt to characterize chance in a mathematical way and we owe them many of the concepts and ideas today still in use in areas such as probability and statistics. Further progress was followed by German-speaking mathematicians such as Jacob Bernoulli, and much later by Richard von Mises, providing the foundations of modern probability theory. But it was Andrey Kolmogorov who provided the final version of what we know today as probability theory in its current formal version called an axoimitization in mathematics. Interestingly, both von Mises and Kolmogorov felt they had arrived at a limited and weak characterization of randomness by means of probability distributions. And while von Mises had not available some crucial concepts to come to a better characterization of randomness, Kolmogorov combined ideas taken from the rising area of computation to arrive at what today we know as a mathematical definition of randomness that we will be covering and it is at the center of course and of the techniques that we will introduce and study. The aim of classical probability and traditional statistics is to help in the problem of causal inference based or by way of calculating probability distributions. One main criticism in the way that the statistics approaches causality is its dependence on assumptions and expectations based on those probability distributions. Indeed, without knowing the generating cost for the appearance of, for example, 5000 white swans parading one after each other, a statistician would approach the problem of guessing the color of the next swans by simply calculating a distribution based on how many times has seen white swans versus some other color. Based on traditional statistics, the sudden appearance of a black swan in the long list of white swans would be a great surprise and would be considered an oddity, and it may be so. However, if there is a swan owner releasing the swans in and a specific order and had decided to release all the white ones first followed by the black ones, we would have made the awful assumption that swans were somehow appearing at random and thought that the black swan was an oddity when we saw it the first time coming. In other words, knowing or attempting to come up with a model of a generating mechanism removed from the assumption of randomness, would have the advantage to lead to better explanations and the means to make more accurate predictions. Of course, one thing is to attempt and one other is to achieve it. But the statistics are not designed to generate candidate models but to describe the data probabilistically. Limitations may include confounding causes with effects. That is the question to disentangle, a common cost from certain situations and this happens all over science and is among the main concerns when designing an experiment. In other cases, it may simply be unclear which is the cause and which is the effect. For example some study may claim that children that watch a lot of TV are the most violent. Clearly, TV makes children more violent, they would say. But this could easily be produced by some other cause. For example violent children may like watching more TV than less violent children because they become reclusive. In the children experiment what one needs is to introduce what is called a control experiment to disentangle whether the children were already violent before the experiment or if they became more addicted to TV as a result of their violent personalities. These safeguards are called controlled experiments. Another problem is that one cannot reset the same child as a violent and then non-violent in some other environment, or the other way around. So the experiment has to be performed on an already existing population that will necessarily be influenced by other undirected causes that cannot be completely isolated. The purpose of control experiments however, is to control the most obvious bias or confounding causes. Do we draw more meaningful conclusions and results? Control experiments are always of the form: What would happen if something else had or not had not happened, or what would happen if we do or do not apply or remove some other influence from the experiment? Pierre-Simon Laplace was the first to use what are called uniform priors when faced to complete lack of knowledge. That is, a distribution that assumes that all events are equally likely. He introduced a principle known as "The principle of insufficient reason," also known as "the principle of indifference." The principle of insufficient reason is similar to Occam's Razor in nature in that it is a guiding principle with no strong evidence in favor or against it. The principle states that if there are n possible causes indistinguishable, except perhaps by their names, then each possible cause should be assigned a probability = 1/n, that is equal probability, and should not be discarded or ruled out. While the principle of insufficient reason is a reasonable principle to follow, we will challenge some of the assumptions because, while it may be desirable to keep all possible causes with nonzero probability just as another principle suggests, called the principle of multiple explanations, establishing that if several theories are consistent with the observed data, we should retain them all, there are strong reasons to assign a different probability to different explanations instead of assigning equal probability to each of them. In fact, the principle of indifference seems to contradict somehow Occam's Razor, that suggests not to assign equal probability to over complicated causes if there was a reason to do so. All these assumptions made in traditional statistics revealed that there is a highly subjective component in the way that classical probability deals with causality, particularly in the absence of data and knowledge about the generating source, which is pretty much the general case. Nevertheless, all these methods are widely accepted and used, such as for example in the field of machine learning. We will challenge the use of uniform priors, that is the use of uniform distributions as first assumption when assuming the study of causal systems as opposed to random systems, also called stochastic. We have shown, for example, that challenging the common use of these uniform priors has interesting advantages including the acceleration in the convergence of biological evolution in comparison to assuming uniform random mutations. Something that we will cover in the last module, so we will come back to all of this in detail. The idea that probability should be interpreted a subjective degree of belief in a proposition was first proposed by John Maynard Keynes in the early 1920s. But even today methods such as Shannon entropy are taking way more seriously than they should. Entropy, in the way in which Claude Shannon introduced it, is usually presented as a measure of surprise in the context of sending messages. As we will see, Shannon entropy is a measure of the degree of uncertainty of one's own lack of knowledge, rather than about any objective uncertainty related to the possible cause of a process. So, unlike the general claim, we will demonstrate how Shannon entropy is not a syntactic measure at all, but highly semantic. Yet, that does not mean it is better or worse. Shannon's entropy is interesting because it introduced logic and computation as descriptions of operations in information. We will explain later in greater detail how some of these ways to characterize randomness are often, if not always, very fragile. We will illustrate these with examples, but it is important that you have some notions already about these concepts. One fundamental concept in statistics, is that of correlation, which is not only a minor concept in statistics but actually at the heart of statistics in some way. Statistics is all about finding statistical patterns in the form of regularity. As we will see Anything more sophisticated and that will be missed by statistical approaches to inferring causal mechanisms in data. A statistical regularity can be, for example, the tendency of some data points to lay on a plane or a time series to display periods of something. These are typical plots of positive and negative correlation. Think of two processes from which we obtain some data. It can be a time series. A time series is a collection of data points sorted by time. Imagine do you want to see if these two series are correlated or causally connected in some way. Think of the x-axis as one series and the y-axis as the other one. Then one can see whether the data points get aligned, which would mean that they distribute in a similar fashion. Correlation values traditionally denoted by the Greek letter, in this case Rho, are typically given between -1 and 1. When Rho is close to 1 or -1, the data is either positively or negatively correlated. These plots are called scatter plots. There are several ways to measure correlation, but they are all very similar and consist in taking distances among data points. One of the most popular methods is called Pearson's correlation, which measures the correlation among data point values. Another popular one is Kendall's or Spearman's correlation and they measure correlation when only the order matters. Because of the limitations mentioned before, traditional statistics often leads to spurious models from false negatives and false positives with high probability. A false positive or false negative is a regularity in the data that appears real but it is not. It is only an artifact leaving the wrong impression regarding its cause and effect. Here are examples of false negatives, meaning that the correlation test quantified by Rho suggests that there is no correlation among the axis x and y. Yet, by only looking at these plots one can see structure immediately suggesting that something interesting is happening in the way in which the data points are distributed along the axis. Yet one can see that the Rho value is almost zero in all these plots suggesting that nothing interesting was happening. There are other examples in which the correlation is not only zero, but also all plots have exactly the same Rho value yet clearly they are very different and they have very different structures. According to the Stanford encyclopedia of philosophy, attempts to analyze causation in terms of statistical patterns are referred to as "regularity theories of causation." Statistical regularities are only a subset of the possible properties that a phenomenon may display. A statistical approach offers an explanation for the distribution of the data but leaves to the scientists the heavy task of making an interpretation of the data, to come up with a model for the data. Traditionally, what a scientist does is to fit some curve then the equation of the curve is taken as a generating model, both explaining the distribution of data and for making predictions. In the typical case of positive correlation, for example, it is not difficult to fit a line. This is called a linear fit because the function is linear as fitted by a line a polynomial of degree 1. However, one can always force a curve to go exactly through any number of data points using a polynomial of degree proportional to the number of points. One can see how growing the degree of the polynomial one can make the curve go through or close to the data points. All these limitations of traditional statistical approaches to causality can be summarized in what is one of the most common sites in the area, that association is not causation, or that correlation does not imply causation. But also that lack of correlation does not mean lack of causation. In other words, that you can fit a curve to some data points does not necessarily mean that the curve has truly anything to do with such data points. By using a very interesting technique, a recent paper showed the degree in which correlation can be fooled. All these scatter plots have identical values. That is, mean standard deviation and Pearson correlation to two decimal places. But clearly, they seem not to have been generated at random. Some versions of Shannon entropy can tell apart some cases, but will fail for most if no method different to statistics to infer the underlying probability distribution is used. We will explain these concepts in greater detail in the modules that follow, but bear in mind that this is a general problem that traditional statistics and classical probability will always bring to the table. In another example to contrast what a mechanistic model is, as opposed to a statistical description, according to a statistical description of the phenomenon, car arrivals to any point on a free road follow what is known as a Poisson distribution. This is because distributions can have characteristic shapes when they are plotted and one of them is the Poisson. However, what the statistical approach is describing is the effect of the mechanistic cost and not the cost itself. It is not providing any clue about the generating mechanism. It may provide sometimes some hints of the causes that are left to the scientist to make the interpretation, but statistics do not provide any model by themselves. They cause for the Poisson distribution to be produced is that a slower drivers cause faster drivers to accumulate behind, producing patches of cars running together on the highways and arriving to gas stations at about the same time. But it is not the Poisson distribution that causes the way in which cars accumulate, nor it suggests how or why it happens. In contrast, a mechanistic approach attempts to provide a causal model that may help design ways in which it can be manipulated. Because it points out the mechanism that can be changed to achieve a different purpose, probability and statistics have led a revolution in the study of causality. But they have in some fundamental way been exhausted to make further progress in modern science. This course is all about trying to complement and provide an alternative to traditional statistics and classical probability. We will see how what we call algorithmic information dynamics provides interest avenues to better deal and help reveal mechanistic causes.