In this lesson, we will see how algorithmic

complexity and algorithmic probability

can make contributions to today's

trendy field of machine learning, including

deep learning and deep neural networks.

Judea Pearl, one of the most respected

researchers in areas related to artificial

intelligence, believes that to have truly

intelligent machines, we need to teach them

cause and effect. In his criticism to

contemporary artificial intelligence, he notices

as it is also widely acknowledged, that current

approaches fall very short at some task,

even when they may have reached superhuman

capabilities in other areas such as classification

and optimization. Including winning at some

highly sophisticated board games. Yet machines

and current statistical algorithms,

including deep neural networks, fail at

basic tasks, such as inference, abstraction

causal explanation and model generation.

So how to make progress ? It was

remarkable to see how Marvin Minsky, another

one of the founders of Artificial Intelligence,

acknowledged a few months before passing

away, the importance of algorithmic probability.

This is a quote from a panel discussion in

which he participated in 2014

. It reads, "It seems to me that

the most important

discovery since Godel, was the discovery by

Chaitin, Solomonoff and Kolmogorov

of the concept called algoritmic probability,

which is a fundamental new theory of how

to make predictions given a collection of

experiences. Everybody should learn all

about that and spend the rest

of their lives working on it." And, im glad

to report that you're now one of those that

have decided perhaps, not to devote their entire

lives, but part of it. And, actually learn about it.

and star thinking of ways to think and use

and perhaps even convince all the rest to start

leaving behind the practice of streching

beyond their limit, tools from statistics and

probability, to deal with causation, and

instead, embrace the new tools based on and

empowered by model computation and model

driven, in the practice of everyday modern science.

Here is another celebrated scientist, this time, a

molecular biologist, Sydney Brenner. Also

acknowledging how much biology needs to bring

concepts from fundamental computer science

and Turing's work. If you put together these

three people and the works of Solomonoff,

Chaitin and Kolgomorov, this is really a

'no brainer'. We are heading in the right direction.

And, we are opening new ground, with the

state-of-the-art theory and methods to

disturb the way in which we do science.

We have come a long way covering many

areas of science connected to algorithmic information

dynamics.

Areas on which we stand on and

we put together to contribute to probably the

most important topic in science, that of causation.

And it is in this sense that we think we can

make also contributions to the current area

of machine learning, with what we call causal

deconvolution and program synthesis, model

generation. But also to more specific challenges in

the area such as hyper parameter tuning, feature

selection, data dimension reduction and faster

optimisation techniques.

The main idea behind

these contributions is the replacement

of methods and functions, currently based on

traditional statistics, with methods rooted

on algorithmic probability, and based on

approximations to the Universal distribution.

As we have seen before, algorithmic probability

provides the accepted definition of randomness.

You assess, its equivalent formulation of

algorithmic complexity. Algorithmic probability

also defines optimal prediction or

optimal inference and connects or advances

classical probability with the use of modern

computation by introducing the idea of computability.

As we have seen, the main idea of

algorithmic probability is to quantify

the probability, that the random computer

program produces a piece of data.

Now, we have seen how all these guessing today

measure challenge in science of causaslity

and finding causes for effects in all areas of

science. And by now, I hope you can see how

universal the theory is, independent of

model of computation and connect to the

practice of science in many if not most

possible ways.

One first contribution was conceived to help

in the challenge of the convolving systems.

When we observe nature, we rarely see

systems in isolation. They are always interacting

with each other and when looking at these

convoluted signals, it is key to be able to

separate them in the ideal case by their

most likely origin or source in order to

study and eventually manipulate them.

So this algorithm we recently introduced,

traverses signals within an observation window

to try to separate the regions that are likely

coming from the same origin

same causal origin.

This is done in the same way we have explored

before. By trying to explain every part of the

observation by an associated underlying

candidate model, inferred by using CTM and BDM.

Then we compare those models across a

sliding window to separate regions into most

likely algorithmic origins.

When we tested these

methods on the strings, dynamical systems

producing highly integrated images, we were able to

separate them. In figures A and B for example,

we had a string in 2 different regimes.

One was generated randomly and the

other was very simple

generated by a very simple

algorithm. The idea illustrated is that we can

find an inflexion point indicating 
two groups and then

each element of each group can be traced

back to a similar origin, hence providing a clue

of how each part may have been generated.

In figures C and D for example, is the simple

Turing machine transition diagram but can't

explain the regular region of the string in

figure A.

The difference between figure A and B

is that we reversed the same string to show

that we can still find such an inflexion point and

still colour each part correctly, by their common

source, um mechanistic source.

In figures F and G, we did the same

with two interacting cellular

automata. In this case, even when there is a

rule determining the interaction between two

cellular automata rules, one completely dominates the other.

But in more complicated cases like the one in

figure A2, um, it can be shown and seen in figure D2

and in E2 how the algorithm does a good work at

separating each region. And we tested the same algorithm

using other measures such as applying the same

with Shannon entropy and compression algorithms.

And this um.. they were not able to do it properly

or completely failed. In figure B2 you can see that two

original ECAs or Elementary Cellular Automata and

in figures D2 and E2 the separation with our deconvolution

method. In figure F2 is a sanity check test, how the left and

right sides have different algorithmic properties

according to the BVM method. Um. C2 only shows

what we see as observers. We are seeing the last step of

deconvolution of 2 systems such as cellular automata.

And it is not apparent any difference coming from

one side or the other. We also test that the convolution method

on networks that were generated by different

mechanistic sources. One was by preferential

attachment, producing a small scale free

network. We connected it to both a complete

graph and a random error network.

Red edges show the elements that our

algorithm is able to detect or identify, that

should be broken in order to separate the major

components, with most elements associated to

one or another network by their um, most

likely source. Below each case, there are histograms

showing how the extraction of each network

corresponded exactly with a degree sequence

distribution of that network. Basically making a

perfect code. We tested the algorithm increasing

the random connections between um, these networks

, and even connecting more networks. And in general,

it did a very good job. In figures A and B one can

see how the same algorithm can be used to partition

objects. The main idea is the perturbation analysis,

separating neutral from negative and positive.

Here is a slightly more sophisticated case,

not too complicated, so that it can still be

understood and used as an illustration of the

method but less trivial than the previous one.

So on figure A, we have a network composed by three

other networks. Each with different topological

properties, as a result of being generated by

different ways. One is a star graph, another one is a

random graph and another one is a complete

graph. In principle, you may not even notice

the difference between the networks if they

layout does not help you to visualise it, like in

figure A. But quite, one can take the signature

, the information signature of the network as we

have done in previous lessons and units. And

as it is shown in figure C, with great colour.

Then the steps shown by the signature

are indications of the way

in which the network has to be

broken into its most likely, common sources.

And it can be detected by taking the difference

of the consecutive signature values as shown in figure C

(with the orange curve)

Now we have that anything

above the logarithm of the a constant is not part

of the evolution of the original system. So, we use

it as a cut-off, and any value above that baseline,

can be decomposed, as shown in figure D. Figure D

Figure D shows the signature,

but now broken and coloured

corresponding to the elements in the orginal

network. And there, we solved this exactly desired

decomposition of the original network.

The base line um, also gives us a

natural termination criterion to

stop the algorithm, that is to stop decomposing the

system further. If there are no more peaks over

the baseline shown in black in figure C, we simply

stop the program. This means that the algorithm

is unsupervised and parameter free. There is no

need to select any features of interest. or to

make any arbitrary choices. So as you can

see, the curve in orange, so just that there should be

about four components because they are above

the black curve baseline. And that is exactly the

case because we have three, graphs and then the

edges that connect those graphs.

Another application is to data dimension

reduction. This is because

we can identify the regions of an image that

contribute less to the candidate models able to

explain the original image. So, we are to perform

some sort of highly optimal reduction of data.

Theoretically, the most optimal, only limited by

our own approximations. Because the aim of the

algorithm is to preserve all the features of the

data that contribute to their description.

Here in figures A and B are 2 examples where

in light grey, is coloured, the parts that have less

contribution to the algorithmic description of

each elementary cellular automata. And how the alogorithm

can keep reducing the data in several steps,

preserving the main features until the end.

These 2 examples are quite, simple. But when

applied to more sophisticated data, including networks,

we have shown that

they out-perform the best algorithms for

network reduction, which is by the way called

'sparsification'. Again, in this method, you can see, there are no

parameters involved and nor any need to supervise it.

Infact, the terminating criterion is very similar to the

previous algorithm, and as soon as removing more

parts leads to a greater loss of 
algorithmic information,

we just stop it. We can only continue if we

further wish to further reduce the data,

but that is an option, not a requirement for

the algorithm to work properly.

One can also see how algorithmic feature

selection and model generation can

complement the current area of

deep-learning and some of their methods.

This is by reducing for example,

the training sets and focusing on causal

elements rather than just simply combinatorial

statistics.

It can also trim models and help choose the

right parameters for the model, of the model,

which is known as hyper parameters.

That is, for example,

the size of the network, the

number of layers or the precision of the

weights and so on. The feature selection

would be very similar to the application

that we saw before, with an investigation of

eigen values in graph spectra. Well we can

find which elements are the most informative

to learn and which are worth simply not

even considering. In some way and together

with many other ideas we are exploring

and pursuing, these methods are introducing

cause and effect in machine learning. So,

soon in the algorithmic complexity calculator,

you will be able to provide your data and the

calculator will retrieve a set of very abstract

general candidate models able to explain

the data. Together with features that

may have been extracted in the process, and

identify as key for the causal explanation

of the system.

The take home message is that

with all our tools and methods based on

algorithmic probability, we can make

significant contributions to introduce

causation in machine and deep learning.

And make methods and approaches such as

deep neural networks more robust.

We call this approach

algorithmic machine learning

or algorithmic machine intelligence. In the

next lecture, we will see some other possible

directions of future research related to these

ideas.