As we saw in the previous unit, traditional probability theory and classical information theory attempt to quantify some properties related to randomness by asking how many bits are needed to encode or communicate a sequence. Algorithmic Information Theory or the theory of Algorithmic Complexity is, however, of a very different nature than classical information. The motivation behind algorithmic complexity is the question of what does it mean for an object to be truly random for the following properties in the broadest and most general terms.

Unpredictability

That is, the impossibility to predict an outcome or an object.

Incompressibility

That is, that one cannot describe something random in a simple or short fashion, and

Typicality

That is, that a random event does not have anything particularly special.

A sequence of coin tosses, for example, would be called typical if it looks unbiased, that is random, but if it comes out with only tails every time then it would look atypical, compressible and highly predictable. In contrast to traditional statistics, and classical information theory, algorithmic information theory avoids probability distributions and provides a very different and in a fundamental way a general open approach to randomness by focusing on the properties of data that can be modelled by computer programs, so something random will be something that no computer program shorter than the data can model or reproduce. 

So, we have arrived to one of the main concepts in this course, that is the concept of algorithmic complexity. Defined by Solomonoff, Kolmogorov, Chaitin, and Levin in 1970, the so-called program-size complexity, also known as algorithmic complexity or Kolmogorov complexity, is a measure that quantifies algorithmic randomness, a type of randomness strictly stronger than statistical randomness.

Formally, the algorithmic complexity that we will denote by K of a string s is the length of the shortest computer program p running on a universal Turing machine T that generates the string as output and halts. The largest or smallest difference between the original length and the compressed length, that is the length of the Turing machine, determines the complexity of the string.

So a string is said to be random if:

The length of the computer program and the size of the string are not very different, both measures in bits. 

However, K is said to be uncomputable because of the halting problem given that one cannot always find the shortest programs in finite time because one would have to run every computer program for which we may have to wait forever if they never halt. In practice, one uses more pragmatic approaches such as a lossless compression algorithm. Lossless here means that there is no loss of information after compressing and the decompression retrieves exactly the same original object. The compressed version of an object is an upper bound of its algorithmic complexity, that means that the actual algorithmic complexity of the string cannot be greater than its compressed length and thus finding a short representation of an object by compressing it is a sufficient test of non-randomness.

For finite strings it is sufficient to calculate the length in bits of the compressed version and compare it to the original string length, this function roughly encodes a string in binary:

For example, a non-random binary string will be highly compressible:

But a pseudo-random sequence is harder to compress:

For this case, its compressed length in binary is actually not shorter than its uncompressed length.

The problem is, of course, the cases in which no compressed version is found. Because K is not computable we cannot guarantee that the lack of a compressed version of an object means that no such object exists and so not much can be said about it. So, because of this limitation, in general it is more important and highly more informative to find that something is compressible than that something is not.

As a generalisation for ever growing sequences, a sequence s can be said to be algorithmic random if all its initial segments are not compressible by more than a constant. So eventually after a while, the growing sequence will be mostly uncompressible from some length on. That can be written as follows:

This means, for example, that unlike for random strings, for non-random strings such as alternating 0 and 1 two thousand times, one can find a constant that makes all initial segments to have low algorithmic complexity from one point on. 

From the initial segment number 80 on, for example, all compressed versions are shorter than the length of the initial segments themselves:

Now by adding the constant 80, all compressed initial segments are shorter than the initial segments:

And 176 bits is an upper bound of the algorithmic complexity of this sequence:

For a random sequence no such constant can be found. Here is illustrated with a pseudo-random string of 2000 bits.

The Kolmogorov complexity characterization conveys the intuition that a random sequence is not compressible.

So you can see how Algorithmic Information Theory draws heavily from the theory of computation as founded by Alan Turing by looking at the computer programs behind the data and producing that data. Algorithmic complexity can help to formulate and tackle questions that traditional tools from information theory are poorly equipped to deal with.  For example, this string 

01101001100101101001011001101001\[Ellipsis] 

may look random when viewed through the lens of Shannon's entropy, but the sequence is simply generated by starting with 0 and then successively appending the 'Boolean complement' to the sequence, that is 1 where there is a 0 and 0 where there is a 1. Another look at the way this string is generated may reveal the process. You start from 0 then the Boolean complement of 0 is 1, you end up with the sequence 01, then you take the Boolean complement of 01, which is 10 and append it to the sequence, then you have 0110 and so on. The resulting infinite sequence is called the Thue-Morse sequence after their inventors:

Despite the potential infinite length of this sequence, the generating mechanism is a very short computer program of fixed length that can generate every bit, analogous to the short description we just gave in plain English. Such a computer program, however, would never be taken into consideration by Shannon entropy, which would assign it a large value because the number of different substrings keeps growing, in spite of the fact that the generating program remains of the same small size. 

Shannon entropy and algorithmic complexity use and work with the same units: bits. But information is interpreted in very different ways. For example, algorithmic complexity considers individual objects independent of any probability distribution.

So, in another example, if you don't know the possible deterministic source of a sequence and assume a uniform distribution, different initial segments of a deterministic sequence such as the Fibonacci sequence have different entropy values:

this is because the larger the numbers more bits are required to encode them in binary. However, for algorithmic complexity the size of the generating program remains the same:

this means that the Fibonacci sequence has a low and constant algorithmic complexity and so for any segment of the Fibonacci sequence the same algorithmic complexity.

This example reveals that this is a great difference: two messages can have different entropy yet they can have the same algorithmic complexity. 

So, what are the most salient properties of algorithmic complexity? The central idea behind algorithmic complexity is that a sequence of bits is random if there is no computer program whose length in bits is shorter than the sequence itself in bits\[LongDash]this basically means that random sequences are those that cannot be compressed. By a simple combinatorial argument, one can see that almost all strings are random. That is, the shortest program producing the output rarely has a much shorter representation than the length of the string itself.

For instance, strings of size 20 bits are this many:

and strings that can be encoded in less than 20 bits are:

That is, at least one string cannot be compressed at all. 

But programs of less than 19 bits are significantly fewer:

So it turns out that half of the original strings are compressible by just one bit!

And half of the half of the original strings are compressible only by two bits!

And so on.

So most strings are close to the maximal complexity because they are not compressible but for a few bits.

That means that one cannot pair off all n-length binary strings with binary programs of much shorter length, because there simply aren't enough short programs to encode all strings in shorter strings, even under optimal circumstances.

Another important property of algorithmic complexity is also commonly seen as its greatest burden. That is its uncomputable nature. A function is uncomputable if there is no Turing machine that is guaranteed to produce an output for all inputs, or in other words, if the machine computing the function doesn't halt for a number of inputs. For Kolmogorov complexity that means that the function

s \[RightArrow] K(s)

has no effective procedure (or Turing machine). That is, there is no general function that, given a specific string, can generate the shortest program that produces that string. 

This uncomputability of the function s to K(s) is, however, also the source of its greatest strength. Contrary to the common belief that the greatest burden of K is its uncomputability, it is its uncomputability that provides K with its greatest power. Algorithmic Information Theory proves that no computable measure will be up to the task in finding all possible regularities among all possible infinite sequences or even among all finite string. This is because there is an uncountable number of possible regularities in the set of all possible finite strings (which is infinite), while there is only a countable number of possible Turing machines (or computer programs), so there are not enough of them to spot every possible regularity up to the task has to be uncomputable.

It is more precise and appropiate to refer to the uncomputability of the function s to K(s) as semi-computability, because one can actually approximate K(s) from above i.e. one can calculate the upper bounds of K. One traditional way to calculate upper bounds on K is, as we said before, with the use of lossless compression algorithms. A trivial upper bound on K for any string s is simply the program print(s). If a string s does not allow any other shorter program than print(s) then s it can be said to be incompressible or algorithmically random. A proof of the uncomputability of K is sometimes explained using Berry's paradox. 

Popular lossless compression algorithms such as Lempev-Ziv and Compress behind formats such as zip are regularly used to estimate algorithmic complexity. As we said, the usefulness of lossless compression algorithms as a method for approximating algorithmic complexity derives from the fact that compression is a sufficient test for non-randomness. For example, the difference in length between the compressed and uncompressed forms of the output of a cellular automaton is an approximation of its algorithmic complexity. For most cases, the length of the compressed form levels off, indicating that the cellular automaton output is repetitive and can easily be described. However, in cases like rules 30, 45, 110, or 73, the length of the compressed form grows rapidly, corresponding to the apparent randomness and lack of structure in the display.

However, at the same time,  according to algorithmic complexity, the shortest description of an unfolding object is the rule from which it evolves:

So the algorithmic complexity of all the 256 ECA is not larger than only about 8 bits plus the size of the CellularAutomaton[] function. That does not mean that all these rules are equally algorithmic random or complex because it can easily be seen how rule 0, that evolves into all blank cells, can actually be encoded in less than 8 bits plus the size of the function CellularAutomaton[] and is thus very likely to be less algorithmic complex than say rule 30.

Here is a table of Elementary Cellular Automata rules sorted from highest compressibility and thus lowest estimated algorithmic complexity by lossless compression to lowest compressibility or highest estimated algorithmic complexity.