So in the previous lectures we saw how computable measures such as Shannon entropy and even lossless compression have some limitations. Here we will introduce new concepts and alternative methods to measure algorithmic complexity by completely different means as those taken by classical information theory and different or complementary to the use of lossless compression algorithms to approximate algorithmic complexity. We will also see how measuring algorithmic complexity can help in analysing real-world data and produce candidate mechanistic models and therefore how it contributes to the discussion of causation. So by now you must remember the most basic definition of algorithmic or Kolmogorov-Chaitin complexity. The algorithmic complexity of a string or an object s is the length shortest computer program p that running on a Turing machine U produces s. Another suggestive way to read this formula is as follows. The string or object s can be an observation of the world, then we aim at finding models such as the computer program p that are generating mechanisms because they are computer programs that can mechanistically be followed step by step when running on a Turing machine. This interpretation is the same as the one suggested by another area called Computational Mechanics introduced by James Crutchfield and others. As we know algorithmic complexity is uncomputable because there is no algorithm to find and prove that a program p is actually the shortest, so what other people have done before is to reduce the power of the Turing machine on which the programs run, and thus of the computer programs themselves. For example, if instead of a Turing machine U is replaced by a finite automaton, a machine that is much simpler, then the quest to find short, or even the shortest computer finite automaton producing a string is possible, but this comes at a cost, the simpler the machine the easier is to find those programs but also those programs will capture less features of the observation to the points in which, just like Shannon entropy and other computable measures, those restricted models mostly model statistical properties. They are still very interesting because they offer mechanistic models, that is an automaton diagram from which more data can be produced to test a model. You can also see immediately from the definition how algorithmic complexity also captures Ockham’s razor, that is the assumption that the simplest generating model is also the most likely, in this case algorithmic complexity formalises Ockham’s razor by assuming that simple means shortest. This is usually considered a reasonable assumption in science, and it can even be said it drives science itself as we try to find simple explanations to complex phenomena. From the definition one can also see how algorithmic complexity offers a way to produce a mechanistic model that is constructible by algorithmic means, as both p and U are realisable, meaning that one can actually run it on a digital computer and follow step by step. One can also see how the model has to be compatible with the observation because of the equality between model and observation, and how this is possibly a very strict requirement as we would easily accept models that conform with the data and not necessarily a model that produces exactly every aspect of the data as it would be required from the pure form of algorithmic complexity. In some way this wouldn’t even be desirable because sometimes observations come with noise and we may don’t want noise to be represented in the model. And while K is uncomputable there are means to approximate it because, as we have seen before, it is lower semi-computable, which means, as you already know by now, that one can find upper bounds and make continuous improvements towards values of K. So, if we are not proposing popular lossless compression algorithms, how do we propose to calculate K? The answer is by exploiting algorithmic probability and the fundamental Coding theorem. The method is based on the idea that, contrary to the big data dictum, that small data actually matters, specially in causation. The main idea is that one can run all computer programs up to some size to see how many of them generate the same strings and then we continue running more programs to produce more data that is more strings and their associated frequencies providing estimations of their algorithmic probability. Then, by the algorithmic coding theorem we can convert those frequencies into an estimation of the algorithmic complexity of the string. The programs will be generative mechanisms or candidate models explaining the data, and the complexity values will serve as guides for us to compare models across data. As a reminder, the algorithmic Coding theorem established the connection between the complexity of an object and its algorithmic probability of being produced by a random computer program. The intuition behind the Coding theorem is the beautiful relationship between program-size complexity and frequency of production. What it means is that the less random an object the more likely it is to be produced by a short program. There is a greater probability of producing the program that produces the object, especially if the program producing the object is short. Therefore, if an object is compressible, then the chances of producing the compressed program that produces said object are greater than they would be for a random object. By way of the algorithmic Coding theorem one can approximate the algorithmic complexity of an object. So what we do is to look at the empirical output distribution of trillions of small computer programs and count how many times an object is produced. Those objects produced many times will have high algorithmic probability and therefore low algorithmic complexity and viceversa. Here are two key formulas to understand the method. Let the tuple n and 2 be the set of all Turing machines with n states and 2 symbols, and let D be the function that assigns to every finite binary string s the number of machines that produced s divided by the total number of machines of that size, then D of s is an approximation of the algorithmic probability of s and by the Coding theorem we can obtain K of s. For this reason we call our algorithm CTM standing for Coding Theorem Method. The greater the value of n we to calculate D of s, the better the approximation to both the algorithmic probability of s and K of s. D(n) can also be seen as averaging K over a large set of possible computer languages, each represented by a small Turing machine, in order to reduce the possible impact of the constant from the invariance theorem that we covered two units ago. Another way to see the method is as averaging over a very large number of short computer programs whose frequency is an indication of their complexity according to the Coding theorem. The CTM allows us not only to build an empirical distribution of computer program outputs from smaller to larger sizes, but once a string is generated for the first time among all the computer programs of the smallest size, we know which Turing machine is the smallest one producing a string and also the runtime of such a Turing machine. Here you can see a flow diagram of the algorithm behind our method. The idea is to start from the smallest possible programs and follow an enumeration going through all possible computer programs of increasing length, because we know that computer programs just as Turing machines are enumerable or countable, length here is quantified by number of states in a Turing machine implementing such computer program. Then we iterate through that enumeration looking for the output of each computer program to increase a counter indicating how many times every output has been produced effectively producing an empirical distribution related to the Universal Distribution that we will use to estimate the algorithmic probability of a string or an object. So we start with computer program 0, which means it has a label 0, where 0 has nothing special but is a small program among all possible that can be carried out by Turing machines with up to certain number of states, in this case 2 states because Turing machines with 1 state only do not compute anything, then we exhaust all programs of that size hence making irrelevant with which we started from the beginning before moving to the computer programs that Turing machines with 3 states can run, and so. Notice that we only explore Turing machines with empty input and you may wonder why and how could this be a complete exploration of all computer programs disregarding possible inputs. This is because Turing universality tells us that one can hardcode programs into hardware and the other way around. This is related to the concept of a stored computer program, you can always copy a computer program into a computer and the computer program becomes part of the computer and then you can call that computer program just by calling its name in this case its number, equivalent to clicking on an icon on your computer instead of introducing a floppy disk or a CD because it is already in memory. So when we go through all possible programs, even with no input, we know we are really exploring the whole space of computer programs including all those that can be fed or even reprogrammed with non-empty inputs. Once we run each computer program, we then ask if such a machine has halted, for which we cannot know in general, but for some cases we can, by, for example, using the values of the Busy Beaver. Because we know that Busy Beaver are the Turing machines that run the longest for a fixed number of states, so for that number of states if a computer program has not halted by the running time of the Busy Beaver then we know it will never halt and we can then just ignore that computer program because it will never produce any output. However, if it does halt, then we can look at its output and count how many times the same output has been produced before. Now you can see that some outputs will be produced with high frequency, because according to algorithmic complexity and the Coding theorem, if a string is very simple such as a sequence of 0s, then it will produced by many short computer programs. And in a second we will see if this is actually the case, because before this very simple experiment nobody knew. With these results we can then use the Coding theorem and make continuous improvements on estimations of both algorithmic probability and algorithmic complexity. Notice how powerful this is under the Church-Turing thesis establishing that everything that is computable is computable by a Turing machine. It means that we will eventually cover every possible model of natural phenomena if we had enough computational resources. We will see how we can partially circumvent or make a compromise between the lack of enough resources and the power of this scheme. Additionally, from this calculation, we also know which was the first and thus the shortest computer program producing a string and we also take the runtime of such a Turing machine which allow us to have an estimation of another seminal measure of complexity, Bennett’s Logical Depth (LD) approximated by this same Coding Theorem Method. As you may remember, Logical Depth is a measure of sophistication that is able to tell apart sophistication or deepness from simplicity and randomness defined as the computation time that it takes to decompress an object such as a string from its shortest compressed version. With this method we can generate all small computer programs and average their running times to get a value based on this Logical Depth measure which we will briefly use in the next unit as an example of an application to a real-world problem related to cognition. The calculation of these empirical distributions related to the Universal Distribution was everything but trivial and in some way it hasn’t finished and never will, we are still exploring larger rule spaces of longer computer programs to produce a greater number of computable models of the world for our purposes in our quest to tackle causation. The calculation did not only require all sort of hacks when dealing with running trillions of small computer programs and producing huge numbers of strings and values, but we also had to do the calculation in the most clever way to safe time. For example, we did not run all programs because we were able to know which programs would produce the same output by looking at their behaviour. For example, for each program moving the head to one side of the Turing machine tape, we know there are the same number of programs going to the other side of the tape, so we could complete the picture taking into consideration some symmetries. To explore all the computer programs of up to 4 states it took us about a week on a laptop to go through more than a trillion small computer programs but to explore all possible computer programs implemented by Turing machines of up to 5 states it took us a medium size computer with 30 CPUs running for 3 months. Currently we are conducting the next run on one of the most powerful computers in the world with thousands of CPUs, and we keep increasing the accuracy of our prior distribution, that is the numerical approximation of a Universal Distribution. You can see in this table that we also explored other spaces, such as computer programs that print in 2 dimensions, and Turing machines with more than 2 symbols to be able to quantify non-binary objects. As an example of one validation and sanity check among many that we performed, we were able to find that we were not only performing in line with what other methods do, such as comparing our distribution with the actual shortest computer program found following the formalism shown in the previous flow diagram, because as I said before, from the CTM method not only the frequency of production of a string is obtained, that is the number of computer programs that produce a string, but we also knew the first time that a string was produced and thus which is the shortest computer program producing that string for that computing model. So this figure plots actual instruction size, that is the number of instructions actually used in the shortest computer programs found versus the estimated algorithmic probability for each of these strings. So for strings from 2 to 12 bits we found that the correlation was very high, indicating we were in the right direction. One advantage of this numerical approach to an empirical approximation of the Universal Distribution is that we do not rely only on having found a short computer program generating an object but also in averaging all short computer programs found, so the method is more robust than simply finding a single short computer program. This also allow us to generate a set of mechanistic models explaining data. Then we of course compared our results against compression, in particular the popular LZW method using the Compress function, and we also found a strong correlation but we also found that compression was not only limited by the way in which popular lossless compression algorithms are implemented but are also extremely insensitive specially for small objects such as short strings like in this case, only producing a very coarse-grained view with a few distinct values for all the $2^n$ strings in comparison to the much finer-grained values of algorithmic probability approximated by our methods. Yet what we found is that what was not complex for compression algorithms, that is, what was highly compressible, had also high algorithmic probability, consistent with the theoretical expectation, because remember that high algorithmic probability means low algorithmic complexity, which is exactly what high compressibility means. However, the converse was not as strong, and there we had to look further to find out that the main difference is that our methods were identifying some strings that should be compressed but our compression scheme failed to do so unlike our methods that were suggesting that some strings should be of lower algorithmic complexity than what the compression algorithm was suggesting. We will see in the next lecture how we tested this and how that would help us to better deal with causality. So how the empirical distribution looks like? It looks like this. You can see immediately that it seems to make sense, more simple strings appear on top. And values follow an exponential decay exactly as predicted by algorithmic probability. We can see that most Turing machines produce a simple 0 or 1 and halt, and so on. And while you may think that this table is sorted by string length, it is not. Here on the right are some cases of longer strings that appear high in the ranking because they are simpler. We call these strings jumpers because they are far away from their length size group. By looking at this table and jumpers one can see algorithmic probability in action. One other advantage of precomputing this empirical distribution is that while its calculation is intractable in the best case, and is the most intensive computation possible, because it is bounded and even equivalent to calculating the Busy Beaver function, which we know are the most demanding computer programs, we only need to calculate this once. Once calculated one can implement a lookup table with keys the strings themselves, and the application becomes linear. Of course in the general case completing this distribution is uncomputable and thus unachievable, but this also means we will always have a job keeping us busy to make improvements over previous estimations. On the left table we have the table without considering the machines that never halted, but on the right we have the machines that never halted identified by an empty string epsilon. It turns out that 80% of Turing machines don’t halt in this space, and the greater the rule space, and the more computer programs, the less they halt. Still, by increasing the number of computer programs we produce more strings and can cover more cases. There is something else that may be of concern, that is the enumeration or the Turing machine model chosen and its effect on the empirical distribution. As you may remember by looking at the sketch on the bottom reproduced in one of the previous units, the invariance theorem tell us that we cannot be sure how fast the values of algorithmic complexity will converge in the face of changes of universal Turing machine or Turing-complete programming language even though the same theorem also tells us that eventually all values will converge up to some constant. So this would be a concern to our approach. How can we know if we are far off from estimations and that taking this particular enumeration from shorter to larger computer programs is instable at the beginning. Well, our numerical calculations suggest that this is not the case. We have shown that the method is well-behaved and seems not to diverge because in exploring greater rule spaces according to increasing Turing machine size, the output distributions converge while the differences remain very small and thus under apparent control. Notice that no other method is immune to this problem, including compression lossless algorithms, that is the problem of choosing a particular lossless compression algorithm against some other, the only way we can do is to see if in the application of the methods the results conform to the theoretical expectations and in our case it is the case.