In this lesson we will see some applications of algorithmic information dynamics, perturbation causal calculus, and reprogrammability to molecular biology and genetics - in particular to biological networks and cell dynamics. Before getting into details, it is important to understand what a gene ontology is. As you may know, even if we know very little about the specific functions of genes, a good amount of knowledge is known and has been published as part of the biological literature in journals over the last decades. So, a gene ontology is a database that contains that knowledge from the literature in a computable form that can be used for other purposes - in particular, for validation of other new experiments. So, in general, these databases include some categories for each gene or protein associated to the process in which they are known or are believed to be involved. By "believed" I mean that there is a good statistical evidence that this is actually the case. And, those categories can be cellular components which they are associated with, and the type of molecular functions they are known for. For example, some genes may be known to be related to energy production, to build a specific tissue and so on. This is what a category in a gene ontology database may look like. Different genes and proteins may be associated with each of these nodes and they can be associated with more than one. And, if associated to a low component, then it is associated to the larger component that contains the smaller one. And that is why it is called an "ontology" - because it is basically a hierarchical representation of knowledge. This is an example of a gene product and its different categories that's coming from a gene ontology database. So, this cytochrome c is associated - for example - with a process related to cell death and is related to the mitochondria. This is another example and representation, more in the format of what is called a "gene browser" - the name that is given to browsers in which genes and proteins can be queried to find associations in gene ontology databases. This is a breakdown of biological terms and how many genes and proteins are in each category and subcategories - and how many of them there are associated to different categories. But, now that you know what is a gene ontology, let's get back to the application of algorithmic complexity to biology. A landmark paper in the area of application of algorithmic information theory to molecular biology and genetics is this one by Cilibrasi and Vitányi - two very important researchers in the field of Kolmogorov complexity. They defined a measure called "normalized information distance," and another one associated to this one that allows numerical approximations that is called "normalized compression distance," that makes use of popular lossless compression algorithms to be implemented. And, they applied this measure to full genomes of different animals, extracted precisely from one of those popular gene browsers. And, what they did was to compress those full genomes and compare each other according to the compression measure. And, what they found is that they were able to reconstruct a phylogenetic tree that closely corresponds to the evolutionary history and genetic relationship of each species. Of course, this was an important result but it should be put in context. One major problem and challenge in science is that we often don't talk to other scientists in different areas - and I think this case illustrates it very well. Because, it turns out that there is something in the genome of all species that is called "the GC content," which is basically the number of Gs and Cs - that is, the nucleotides - that are in a DNA sequence or in a full genome. Well, it turns out that each species has a very specific GC content that is about the same for two species that are evolutionary related, because by definition, their genomic alignment is very high. In other words, the gold standard of the phylogenetic tree reconstruction is built by aligning genomic sequences. But, only looking at GC content, one can reconstruct it with a very high accuracy and precision. GC-content is nothing but a function of Shannon entropy of a DNA sequence. So, it is also not difficult to see how Shannon entropy alone would reconstruct the same phylogenetic tree. So, we are in the situation in which even the most trivial measures can reconstruct the same phylogenetic tree that Cilibrasi and Vitányi reconstructed with quite a convoluted measure, using a black box algorithm, which is a lossless compression algorithm. This is why we like to be so much involved with other experts and we have also pushed ourselves to learn as much as we can about the areas in which we are trying to make a contribution, especially to molecular biology and genetics. So, let's get back to algorithmic information dynamics and how it can contribute to the study of these types of biological systems perhaps in a very different way. Just as a reminder, you may recall that one can identify the elements of a network by applying our perturbation causal calculus, to evaluate how each element of a network or a system in general contributes to the original system or network. In this case, for example, we delete every node and see how much the algorithmic complexity of the mutated network changed with respect to the algorithmic complexity of the original network - in this case, Watts-Strogatz's small-world network with different rewiring probabilities. You know, you start from a regular cycle graph and then randomly reconnect a few edges according to that rewiring probability value. By performing our perturbation calculus, we are able to detect how some nodes were involved in the rewiring, because they increase the algorithmic complexity of the mutated network that came from the original regular cycle. Therefore, we can measure by how much, and color each node according to its perturbability magnitude, providing a disruptability index in some way, so to speak. We did the same with one of the most curated and validated biological networks available - the transcription factor network of E. coli. A transcription factor network is a network that does not include all the genes of an organism, such as this bacteria, but only a few genes that regulate other genes, and are thus called "transcription factors" because they regulate the transcription of those other genes - that is, whether they express themselves or not. So, basically, turning on and off genes. Here, we colored in red those genes that, if deleted, would send the network towards randomness, while nodes further away from red, in a regular color spectra, send the network away from randomness - so effectively ranking all the genes in this network. These are the information signatures from the application of the perturbation analysis. We were able to cluster genes by whether they have about the same algorithmic information value or contribution to the original network. This is how the network looks when separating nodes by their contribution, either to move the E. coli network towards randomness or away from randomness according to BDM. So, because a lot is known about the E. coli and all its genes, and - in particular - the transcription factors, we extracted the gene ontology values for each gene in each cluster and saw if such genes in the same cluster were associated to similar processes. It turns out that they did. These very small p-values tell that genes are not clustered by chance according to BDM, but by gene function with high accuracy, because genes in each cluster have over-represented the categories. This type of analysis is called "enrichment analysis," because you are enriching your results with all the known literature about those genes, and then testing if your experiment found something that such literature is also telling you. And, that was actually the case. The same happened when testing with the three most important gene ontology databases. Here is, for example, KEGG. And, here is EcoCyc, strengthening our finding that algorithmic information dynamics is actually capturing function when clustering genes by algorithmic causal perturbation analysis. However, when using other measures exactly in the same way as we used BDM, but replaced by measures such as Shannon entropy, no significant groups were found after the gene ontology enrichment analysis. They produced three clusters that we call "above the baseline," "baseline" and "below the baseline." So, entropy... proved to be less sensitive. This is how entropy colors the network, and you see how poorly it performs. Only two clusters could be identified using lossless compression, in this case with a compressed algorithm dividing cases into above baseline and baseline only. Above baseline nodes were enriched for transcription factors, and no significant groups were found after the enrichment analysis. This is how compression was able to color the network - quite insensitive to small changes as one would expect, because - as we have shown before - small changes are traditionally not captured by compression. So, compression is quite useless when it comes to finding perturbation analysis. Nothing particularly interesting came up when using lossless compression algorithms after enrichment analysis either. Moreover, unlike graph theoretic measures that can be described as single or composed functions of other graph theoretic measures, BDM was not found to be correlated with any of the most popular measures, such as lossless compression and Shannon entropy - but also node degree, in degree, out degree, and betweenness centrality. This diagram summarizes the results for E. coli using the three most popular gene ontology databases, the three of them producing the same results when classifying transcription factors in the most studied biological network of this bacteria that kills so many people around the world. There are many questions yet to answer. For example, whether such a perturbation analysis can tell anything about more fundamental genes than others. [ end of audio ]