So in the previous lectures we saw how computable
measures such as Shannon entropy and even

lossless compression have some limitations.

Here we will introduce new concepts and alternative
methods to measure algorithmic complexity

by completely different means as those taken
by classical information theory and different

or complementary to the use of lossless compression
algorithms to approximate algorithmic complexity.

We will also see how measuring algorithmic
complexity can help in analysing real-world

data and produce candidate mechanistic models
and therefore how it contributes to the discussion

of causation.

So by now you must remember the most basic
definition of algorithmic or Kolmogorov-Chaitin

complexity.

The algorithmic complexity of a string or
an object s is the length shortest computer

program p that running on a Turing machine
U produces s.

Another suggestive way to read this formula
is as follows.

The string or object s can be an observation
of the world, then we aim at finding models

such as the computer program p that are generating
mechanisms because they are computer programs

that can mechanistically be followed step
by step when running on a Turing machine.

This interpretation is the same as the one
suggested by another area called Computational

Mechanics introduced by James Crutchfield
and others.

As we know algorithmic complexity is uncomputable
because there is no algorithm to find and

prove that a program p is actually the shortest,
so what other people have done before is to

reduce the power of the Turing machine on
which the programs run, and thus of the computer

programs themselves.

For example, if instead of a Turing machine
U is replaced by a finite automaton, a machine

that is much simpler, then the quest to find
short, or even the shortest computer finite

automaton producing a string is possible,
but this comes at a cost, the simpler the

machine the easier is to find those programs
but also those programs will capture less

features of the observation to the points
in which, just like Shannon entropy and other

computable measures, those restricted models
mostly model statistical properties.

They are still very interesting because they
offer mechanistic models, that is an automaton

diagram from which more data can be produced
to test a model.

You can also see immediately from the definition
how algorithmic complexity also captures Ockham’s

razor, that is the assumption that the simplest
generating model is also the most likely,

in this case algorithmic complexity formalises
Ockham’s razor by assuming that simple means

shortest.

This is usually considered a reasonable assumption
in science, and it can even be said it drives

science itself as we try to find simple explanations
to complex phenomena.

From the definition one can also see how algorithmic
complexity offers a way to produce a mechanistic

model that is constructible by algorithmic
means, as both p and U are realisable, meaning

that one can actually run it on a digital
computer and follow step by step.

One can also see how the model has to be compatible
with the observation because of the equality

between model and observation, and how this
is possibly a very strict requirement as we

would easily accept models that conform with
the data and not necessarily a model that

produces exactly every aspect of the data
as it would be required from the pure form

of algorithmic complexity.

In some way this wouldn’t even be desirable
because sometimes observations come with noise

and we may don’t want noise to be represented
in the model.

And while K is uncomputable there are means
to approximate it because, as we have seen

before, it is lower semi-computable, which
means, as you already know by now, that one

can find upper bounds and make continuous
improvements towards values of K.

So, if we are not proposing popular lossless
compression algorithms, how do we propose

to calculate K?

The answer is by exploiting algorithmic probability
and the fundamental Coding theorem.

The method is based on the idea that, contrary
to the big data dictum, that small data actually

matters, specially in causation.

The main idea is that one can run all computer
programs up to some size to see how many of

them generate the same strings and then we
continue running more programs to produce

more data that is more strings and their associated
frequencies providing estimations of their

algorithmic probability.

Then, by the algorithmic coding theorem we
can convert those frequencies into an estimation

of the algorithmic complexity of the string.

The programs will be generative mechanisms
or candidate models explaining the data, and

the complexity values will serve as guides
for us to compare models across data.

As a reminder, the algorithmic Coding theorem
established the connection between the complexity

of an object and its algorithmic probability
of being produced by a random computer program.

The intuition behind the Coding theorem is
the beautiful relationship between program-size

complexity and frequency of production.

What it means is that the less random an object
the more likely it is to be produced by a

short program.

There is a greater probability of producing
the program that produces the object, especially

if the program producing the object is short.

Therefore, if an object is compressible, then
the chances of producing the compressed program

that produces said object are greater than
they would be for a random object.

By way of the algorithmic Coding theorem one
can approximate the algorithmic complexity

of an object.

So what we do is to look at the empirical
output distribution of trillions of small

computer programs and count how many times
an object is produced.

Those objects produced many times will have
high algorithmic probability and therefore

low algorithmic complexity and viceversa.

Here are two key formulas to understand the
method.

Let the tuple n and 2 be the set of all Turing
machines with n states and 2 symbols, and

let D be the function that assigns to every
finite binary string s the number of machines

that produced s divided by the total number
of machines of that size, then D of s is an

approximation of the algorithmic probability
of s and by the Coding theorem we can obtain

K of s.

For this reason we call our algorithm CTM
standing for Coding Theorem Method.

The greater the value of n we to calculate
D of s, the better the approximation to both

the algorithmic probability of s and K of
s.

D(n) can also be seen as averaging K over
a large set of possible computer languages,

each represented by a small Turing machine,
in order to reduce the possible impact of

the constant from the invariance theorem that
we covered two units ago.

Another way to see the method is as averaging
over a very large number of short computer

programs whose frequency is an indication
of their complexity according to the Coding

theorem.

The CTM allows us not only to build an empirical
distribution of computer program outputs from

smaller to larger sizes, but once a string
is generated for the first time among all

the computer programs of the smallest size,
we know which Turing machine is the smallest

one producing a string and also the runtime
of such a Turing machine.

Here you can see a flow diagram of the algorithm
behind our method.

The idea is to start from the smallest possible
programs and follow an enumeration going through

all possible computer programs of increasing
length, because we know that computer programs

just as Turing machines are enumerable or
countable, length here is quantified by number

of states in a Turing machine implementing
such computer program.

Then we iterate through that enumeration looking
for the output of each computer program to

increase a counter indicating how many times
every output has been produced effectively

producing an empirical distribution related
to the Universal Distribution that we will

use to estimate the algorithmic probability
of a string or an object.

So we start with computer program 0, which
means it has a label 0, where 0 has nothing

special but is a small program among all possible
that can be carried out by Turing machines

with up to certain number of states, in this
case 2 states because Turing machines with

1 state only do not compute anything, then
we exhaust all programs of that size hence

making irrelevant with which we started from
the beginning before moving to the computer

programs that Turing machines with 3 states
can run, and so.

Notice that we only explore Turing machines
with empty input and you may wonder why and

how could this be a complete exploration of
all computer programs disregarding possible

inputs.

This is because Turing universality tells
us that one can hardcode programs into hardware

and the other way around.

This is related to the concept of a stored
computer program, you can always copy a computer

program into a computer and the computer program
becomes part of the computer and then you

can call that computer program just by calling
its name in this case its number, equivalent

to clicking on an icon on your computer instead
of introducing a floppy disk or a CD because

it is already in memory.

So when we go through all possible programs,
even with no input, we know we are really

exploring the whole space of computer programs
including all those that can be fed or even

reprogrammed with non-empty inputs.

Once we run each computer program, we then
ask if such a machine has halted, for which

we cannot know in general, but for some cases
we can, by, for example, using the values

of the Busy Beaver.

Because we know that Busy Beaver are the Turing
machines that run the longest for a fixed

number of states, so for that number of states
if a computer program has not halted by the

running time of the Busy Beaver then we know
it will never halt and we can then just ignore

that computer program because it will never
produce any output.

However, if it does halt, then we can look
at its output and count how many times the

same output has been produced before.

Now you can see that some outputs will be
produced with high frequency, because according

to algorithmic complexity and the Coding theorem,
if a string is very simple such as a sequence

of 0s, then it will produced by many short
computer programs.

And in a second we will see if this is actually
the case, because before this very simple

experiment nobody knew.

With these results we can then use the Coding
theorem and make continuous improvements on

estimations of both algorithmic probability
and algorithmic complexity.

Notice how powerful this is under the Church-Turing
thesis establishing that everything that is

computable is computable by a Turing machine.

It means that we will eventually cover every
possible model of natural phenomena if we

had enough computational resources.

We will see how we can partially circumvent
or make a compromise between the lack of enough

resources and the power of this scheme.

Additionally, from this calculation, we also
know which was the first and thus the shortest

computer program producing a string and we
also take the runtime of such a Turing machine

which allow us to have an estimation of another
seminal measure of complexity, Bennett’s

Logical Depth (LD) approximated by this same
Coding Theorem Method.

As you may remember, Logical Depth is a measure
of sophistication that is able to tell apart

sophistication or deepness from simplicity
and randomness defined as the computation

time that it takes to decompress an object
such as a string from its shortest compressed

version.

With this method we can generate all small
computer programs and average their running

times to get a value based on this Logical
Depth measure which we will briefly use in

the next unit as an example of an application
to a real-world problem related to cognition.

The calculation of these empirical distributions
related to the Universal Distribution was

everything but trivial and in some way it
hasn’t finished and never will, we are still

exploring larger rule spaces of longer computer
programs to produce a greater number of computable

models of the world for our purposes in our
quest to tackle causation.

The calculation did not only require all sort
of hacks when dealing with running trillions

of small computer programs and producing huge
numbers of strings and values, but we also

had to do the calculation in the most clever
way to safe time.

For example, we did not run all programs because
we were able to know which programs would

produce the same output by looking at their
behaviour.

For example, for each program moving the head
to one side of the Turing machine tape, we

know there are the same number of programs
going to the other side of the tape, so we

could complete the picture taking into consideration
some symmetries.

To explore all the computer programs of up
to 4 states it took us about a week on a laptop

to go through more than a trillion small computer
programs but to explore all possible computer

programs implemented by Turing machines of
up to 5 states it took us a medium size computer

with 30 CPUs running for 3 months.

Currently we are conducting the next run on
one of the most powerful computers in the

world with thousands of CPUs, and we keep
increasing the accuracy of our prior distribution,

that is the numerical approximation of a Universal
Distribution.

You can see in this table that we also explored
other spaces, such as computer programs that

print in 2 dimensions, and Turing machines
with more than 2 symbols to be able to quantify

non-binary objects.

As an example of one validation and sanity
check among many that we performed, we were

able to find that we were not only performing
in line with what other methods do, such as

comparing our distribution with the actual
shortest computer program found following

the formalism shown in the previous flow diagram,
because as I said before, from the CTM method

not only the frequency of production of a
string is obtained, that is the number of

computer programs that produce a string, but
we also knew the first time that a string

was produced and thus which is the shortest
computer program producing that string for

that computing model.

So this figure plots actual instruction size,
that is the number of instructions actually

used in the shortest computer programs found
versus the estimated algorithmic probability

for each of these strings.

So for strings from 2 to 12 bits we found
that the correlation was very high, indicating

we were in the right direction.

One advantage of this numerical approach to
an empirical approximation of the Universal

Distribution is that we do not rely only on
having found a short computer program generating

an object but also in averaging all short
computer programs found, so the method is

more robust than simply finding a single short
computer program.

This also allow us to generate a set of mechanistic
models explaining data.

Then we of course compared our results against
compression, in particular the popular LZW

method using the Compress function, and we
also found a strong correlation but we also

found that compression was not only limited
by the way in which popular lossless compression

algorithms are implemented but are also extremely
insensitive specially for small objects such

as short strings like in this case, only producing
a very coarse-grained view with a few distinct

values for all the $2^n$ strings in comparison
to the much finer-grained values of algorithmic

probability approximated by our methods.

Yet what we found is that what was not complex
for compression algorithms, that is, what

was highly compressible, had also high algorithmic
probability, consistent with the theoretical

expectation, because remember that high algorithmic
probability means low algorithmic complexity,

which is exactly what high compressibility
means.

However, the converse was not as strong, and
there we had to look further to find out that

the main difference is that our methods were
identifying some strings that should be compressed

but our compression scheme failed to do so
unlike our methods that were suggesting that

some strings should be of lower algorithmic
complexity than what the compression algorithm

was suggesting.

We will see in the next lecture how we tested
this and how that would help us to better

deal with causality.

So how the empirical distribution looks like?

It looks like this.

You can see immediately that it seems to make
sense, more simple strings appear on top.

And values follow an exponential decay exactly
as predicted by algorithmic probability.

We can see that most Turing machines produce
a simple 0 or 1 and halt, and so on.

And while you may think that this table is
sorted by string length, it is not.

Here on the right are some cases of longer
strings that appear high in the ranking because

they are simpler.

We call these strings jumpers because they
are far away from their length size group.

By looking at this table and jumpers one can
see algorithmic probability in action.

One other advantage of precomputing this empirical
distribution is that while its calculation

is intractable in the best case, and is the
most intensive computation possible, because

it is bounded and even equivalent to calculating
the Busy Beaver function, which we know are

the most demanding computer programs, we only
need to calculate this once.

Once calculated one can implement a lookup
table with keys the strings themselves, and

the application becomes linear.

Of course in the general case completing this
distribution is uncomputable and thus unachievable,

but this also means we will always have a
job keeping us busy to make improvements over

previous estimations.

On the left table we have the table without
considering the machines that never halted,

but on the right we have the machines that
never halted identified by an empty string

epsilon.

It turns out that 80% of Turing machines don’t
halt in this space, and the greater the rule

space, and the more computer programs, the
less they halt.

Still, by increasing the number of computer
programs we produce more strings and can cover

more cases.

There is something else that may be of concern,
that is the enumeration or the Turing machine

model chosen and its effect on the empirical
distribution.

As you may remember by looking at the sketch
on the bottom reproduced in one of the previous

units, the invariance theorem tell us that
we cannot be sure how fast the values of algorithmic

complexity will converge in the face of changes
of universal Turing machine or Turing-complete

programming language even though the same
theorem also tells us that eventually all

values will converge up to some constant.

So this would be a concern to our approach.

How can we know if we are far off from estimations
and that taking this particular enumeration

from shorter to larger computer programs is
instable at the beginning.

Well, our numerical calculations suggest that
this is not the case.

We have shown that the method is well-behaved
and seems not to diverge because in exploring

greater rule spaces according to increasing
Turing machine size, the output distributions

converge while the differences remain very
small and thus under apparent control.

Notice that no other method is immune to this
problem, including compression lossless algorithms,

that is the problem of choosing a particular
lossless compression algorithm against some

other, the only way we can do is to see if
in the application of the methods the results

conform to the theoretical expectations and
in our case it is the case.