In this lesson, we will see how algorithmic complexity and algorithmic probability can make contributions to today's trendy field of machine learning, including deep learning and deep neural networks. Judea Pearl, one of the most respected researchers in areas related to artificial intelligence, believes that to have truly intelligent machines, we need to teach them cause and effect. In his criticism to contemporary artificial intelligence, he notices as it is also widely acknowledged, that current approaches fall very short at some task, even when they may have reached superhuman capabilities in other areas such as classification and optimization. Including winning at some highly sophisticated board games. Yet machines and current statistical algorithms, including deep neural networks, fail at basic tasks, such as inference, abstraction causal explanation and model generation. So how to make progress ? It was remarkable to see how Marvin Minsky, another one of the founders of Artificial Intelligence, acknowledged a few months before passing away, the importance of algorithmic probability. This is a quote from a panel discussion in which he participated in 2014 . It reads, "It seems to me that the most important discovery since Godel, was the discovery by Chaitin, Solomonoff and Kolmogorov of the concept called algoritmic probability, which is a fundamental new theory of how to make predictions given a collection of experiences. Everybody should learn all about that and spend the rest of their lives working on it." And, im glad to report that you're now one of those that have decided perhaps, not to devote their entire lives, but part of it. And, actually learn about it. and star thinking of ways to think and use and perhaps even convince all the rest to start leaving behind the practice of streching beyond their limit, tools from statistics and probability, to deal with causation, and instead, embrace the new tools based on and empowered by model computation and model driven, in the practice of everyday modern science. Here is another celebrated scientist, this time, a molecular biologist, Sydney Brenner. Also acknowledging how much biology needs to bring concepts from fundamental computer science and Turing's work. If you put together these three people and the works of Solomonoff, Chaitin and Kolgomorov, this is really a 'no brainer'. We are heading in the right direction. And, we are opening new ground, with the state-of-the-art theory and methods to disturb the way in which we do science. We have come a long way covering many areas of science connected to algorithmic information dynamics. Areas on which we stand on and we put together to contribute to probably the most important topic in science, that of causation. And it is in this sense that we think we can make also contributions to the current area of machine learning, with what we call causal deconvolution and program synthesis, model generation. But also to more specific challenges in the area such as hyper parameter tuning, feature selection, data dimension reduction and faster optimisation techniques. The main idea behind these contributions is the replacement of methods and functions, currently based on traditional statistics, with methods rooted on algorithmic probability, and based on approximations to the Universal distribution. As we have seen before, algorithmic probability provides the accepted definition of randomness. You assess, its equivalent formulation of algorithmic complexity. Algorithmic probability also defines optimal prediction or optimal inference and connects or advances classical probability with the use of modern computation by introducing the idea of computability. As we have seen, the main idea of algorithmic probability is to quantify the probability, that the random computer program produces a piece of data. Now, we have seen how all these guessing today measure challenge in science of causaslity and finding causes for effects in all areas of science. And by now, I hope you can see how universal the theory is, independent of model of computation and connect to the practice of science in many if not most possible ways. One first contribution was conceived to help in the challenge of the convolving systems. When we observe nature, we rarely see systems in isolation. They are always interacting with each other and when looking at these convoluted signals, it is key to be able to separate them in the ideal case by their most likely origin or source in order to study and eventually manipulate them. So this algorithm we recently introduced, traverses signals within an observation window to try to separate the regions that are likely coming from the same origin same causal origin. This is done in the same way we have explored before. By trying to explain every part of the observation by an associated underlying candidate model, inferred by using CTM and BDM. Then we compare those models across a sliding window to separate regions into most likely algorithmic origins. When we tested these methods on the strings, dynamical systems producing highly integrated images, we were able to separate them. In figures A and B for example, we had a string in 2 different regimes. One was generated randomly and the other was very simple generated by a very simple algorithm. The idea illustrated is that we can find an inflexion point indicating two groups and then each element of each group can be traced back to a similar origin, hence providing a clue of how each part may have been generated. In figures C and D for example, is the simple Turing machine transition diagram but can't explain the regular region of the string in figure A. The difference between figure A and B is that we reversed the same string to show that we can still find such an inflexion point and still colour each part correctly, by their common source, um mechanistic source. In figures F and G, we did the same with two interacting cellular automata. In this case, even when there is a rule determining the interaction between two cellular automata rules, one completely dominates the other. But in more complicated cases like the one in figure A2, um, it can be shown and seen in figure D2 and in E2 how the algorithm does a good work at separating each region. And we tested the same algorithm using other measures such as applying the same with Shannon entropy and compression algorithms. And this um.. they were not able to do it properly or completely failed. In figure B2 you can see that two original ECAs or Elementary Cellular Automata and in figures D2 and E2 the separation with our deconvolution method. In figure F2 is a sanity check test, how the left and right sides have different algorithmic properties according to the BVM method. Um. C2 only shows what we see as observers. We are seeing the last step of deconvolution of 2 systems such as cellular automata. And it is not apparent any difference coming from one side or the other. We also test that the convolution method on networks that were generated by different mechanistic sources. One was by preferential attachment, producing a small scale free network. We connected it to both a complete graph and a random error network. Red edges show the elements that our algorithm is able to detect or identify, that should be broken in order to separate the major components, with most elements associated to one or another network by their um, most likely source. Below each case, there are histograms showing how the extraction of each network corresponded exactly with a degree sequence distribution of that network. Basically making a perfect code. We tested the algorithm increasing the random connections between um, these networks , and even connecting more networks. And in general, it did a very good job. In figures A and B one can see how the same algorithm can be used to partition objects. The main idea is the perturbation analysis, separating neutral from negative and positive. Here is a slightly more sophisticated case, not too complicated, so that it can still be understood and used as an illustration of the method but less trivial than the previous one. So on figure A, we have a network composed by three other networks. Each with different topological properties, as a result of being generated by different ways. One is a star graph, another one is a random graph and another one is a complete graph. In principle, you may not even notice the difference between the networks if they layout does not help you to visualise it, like in figure A. But quite, one can take the signature , the information signature of the network as we have done in previous lessons and units. And as it is shown in figure C, with great colour. Then the steps shown by the signature are indications of the way in which the network has to be broken into its most likely, common sources. And it can be detected by taking the difference of the consecutive signature values as shown in figure C (with the orange curve) Now we have that anything above the logarithm of the a constant is not part of the evolution of the original system. So, we use it as a cut-off, and any value above that baseline, can be decomposed, as shown in figure D. Figure D Figure D shows the signature, but now broken and coloured corresponding to the elements in the orginal network. And there, we solved this exactly desired decomposition of the original network. The base line um, also gives us a natural termination criterion to stop the algorithm, that is to stop decomposing the system further. If there are no more peaks over the baseline shown in black in figure C, we simply stop the program. This means that the algorithm is unsupervised and parameter free. There is no need to select any features of interest. or to make any arbitrary choices. So as you can see, the curve in orange, so just that there should be about four components because they are above the black curve baseline. And that is exactly the case because we have three, graphs and then the edges that connect those graphs. Another application is to data dimension reduction. This is because we can identify the regions of an image that contribute less to the candidate models able to explain the original image. So, we are to perform some sort of highly optimal reduction of data. Theoretically, the most optimal, only limited by our own approximations. Because the aim of the algorithm is to preserve all the features of the data that contribute to their description. Here in figures A and B are 2 examples where in light grey, is coloured, the parts that have less contribution to the algorithmic description of each elementary cellular automata. And how the alogorithm can keep reducing the data in several steps, preserving the main features until the end. These 2 examples are quite, simple. But when applied to more sophisticated data, including networks, we have shown that they out-perform the best algorithms for network reduction, which is by the way called 'sparsification'. Again, in this method, you can see, there are no parameters involved and nor any need to supervise it. Infact, the terminating criterion is very similar to the previous algorithm, and as soon as removing more parts leads to a greater loss of algorithmic information, we just stop it. We can only continue if we further wish to further reduce the data, but that is an option, not a requirement for the algorithm to work properly. One can also see how algorithmic feature selection and model generation can complement the current area of deep-learning and some of their methods. This is by reducing for example, the training sets and focusing on causal elements rather than just simply combinatorial statistics. It can also trim models and help choose the right parameters for the model, of the model, which is known as hyper parameters. That is, for example, the size of the network, the number of layers or the precision of the weights and so on. The feature selection would be very similar to the application that we saw before, with an investigation of eigen values in graph spectra. Well we can find which elements are the most informative to learn and which are worth simply not even considering. In some way and together with many other ideas we are exploring and pursuing, these methods are introducing cause and effect in machine learning. So, soon in the algorithmic complexity calculator, you will be able to provide your data and the calculator will retrieve a set of very abstract general candidate models able to explain the data. Together with features that may have been extracted in the process, and identify as key for the causal explanation of the system. The take home message is that with all our tools and methods based on algorithmic probability, we can make significant contributions to introduce causation in machine and deep learning. And make methods and approaches such as deep neural networks more robust. We call this approach algorithmic machine learning or algorithmic machine intelligence. In the next lecture, we will see some other possible directions of future research related to these ideas.