We saw in the last lectures how to measure Entropy. One other contribution of the theory of information is to the understanding of the concepts of noise and redundancy and how noise and redundancy counteract and complement each other. Imagine that you were given a message in a language in which every bit is fundamental for the message to be understood. Such a language would be very fragile because the loss of a single bit would mean that you lose the entire message. Fortunately, most human languages are not of this type and are rather highly redundant. That's the reason that you can still understand a written or spoken message even when perhaps some letters or even full words are lost or deleted. This may explain why some languages use grammatical gender, articles, and other linguistic features as a way to add redundancy to the language even when they may not appear fundamental in any way. Let's take a piece of text from some a popular source such as, for example, an excerpt of 100 words from chapter one of Jean Austen's Pride and Prejudice and let's replace every letter "u" with an underscore: I am certain that you are still able to recover all the text with absolutely no loss of information or meaning because you are able to mentally fill the gaps. In this case we removed only one of the lowest frequency vowels, but what if in addition to the letter "u" we removed the letter "e" which is the most frequent letter in English: Then things would start getting more difficult but still with a little extra effort one can fully recover the text. However, there is always a point of inflection when recovering information or meaning becomes impossible. The following text, for example, can still be recovered by hand and even faster with the help of a computer looking for matches with real words in English to complete the missing letters, but the point of no recovery is no much farther away than this: It turns out that the ultimate lower limit to this process of deletion and point of no recovery is precisely given by Shannon's Entropy. How removed a language is from that point of no recovery is called Redundancy. Redundancy in a message is related to the extent to which it is possible to compress the message. Our process of deleting letters is a form of compression and as long as we are able to recover the original message then that compression is said to be lossless. What lossless traditional data compression does is to reduce the number of bits used to encode a message by identifying and eliminating statistical redundancy. When we compress data without loosing any information, we remove redundancy. When we compress a message, what we do is to encode the same amount of information using less bits. So we end up having more Shannon's information per symbol in the compressed message. A compressed message looks therefore less predictable because we have deleted all redundancies such as repetitions and other statistical regularities. On the one hand, the so-called Shannon Source Coding Theorem states that a lossless data compression scheme cannot compress messages, on average, beyond the limit of ending with more than one bit of Shannon's information per bit, that is when the message looks random and all its elements are equally distributed. On the other hand, redundancy may often be a desirable feature in data, because data is always subject to errors in its transmission. These errors in transmission or additions to a transmission are usually called "noise", and the Shannon's Noisy-Channel Coding Theorem establishes a fundamental tradeoff between redundancy and noise, it tells how much redundancy one needs in a message to deal with some noise. Here is some code illustrating the theorem. The code introduces artificial noise destroying the message, but one can increase the redundancy of the message in order to counteract the effect of noise to be able to reconstruct the message in full with no error on the receiving side. Therefore, increasing the noise in message will inevitably make it lose information, but increasing the redundancy will increase its resiliency and robustness of the message even in the face of additive noise. So we have here two messages, one highly structured and one random-looking. We can add noise and see how the structured message will start looking random, and to counteract that loss we can add redundancy to be able to recover the original message. Let's come back to human languages. One consequence of redundancy is that letters and sequences of letters will have different probabilities of appearance. If we assume we are dealing with 27 characters (that is, 26 letters plus the empty space), and that all of these characters are equally probable, then we will have an information Entropy of about 4.8 bits per character. But we know that the characters are not equally probable. For instance, the letter "e" has the highest frequency of occurrence as we had said before, while the letter "z", coincidently, has the lowest. This is related to the concept of redundancy, which is nothing more than the number of constraints imposed on the text of the English language: for example, the letter "q" is most of the time, if not always, followed by "u" in English, and we also have rules such as "I before e except after c", and so on. This plot shows the frequency of sequences of n letters calculated from the text of the Universal Declaration of Human Rights in 20 languages and illustrates the Entropy rate calculated from taking different word block sizes. Entropy rate shows the Entropy change as a function of number of letters. The Entropy of a language can be seen as an estimation of the probabilistic information content of each letter in that language. By playing with this computer program to estimate the Entropy of n-grams for 20 languages you can see how different languages have different Entropy rates indicating different rates of redundancy. So let's turn our attention to biological information. How does biology store information? One main repository is of course the genome and the DNA. In some fundamental way, DNA can be considered the source code of a living organism. An organism comes out of DNA and its interaction with the environment. So what about the redundancy in DNA? Would you expect it to be redundant? How much can one tamper with DNA before producing important changes in the phenotype? Well, that is one of the main questions in molecular biology and genetics, and one of the answers, is that different regions of the genome have different degrees of redundancy according to whether they are under evolutionary selective pressure, that is whether a DNA region plays an important role in the unfolding of the organism biological development. It turns out that, in general, the genome and almost every repository of biological information is highly redundant, not only the genome has many copies of the same genes but also many proteins are encoded by many genes and so on. Among the purposes of redundancy is that many genes can produce a stable number of proteins so small variations to the source code has no significant effect, and this is a highly desirable property in biology for biological robustness because mistakes always happen in the real world, yet biological organisms are very resilient, so information theory may explain this resiliency. The production of essential proteins in plants and animals have different resilience rates to different kinds of mutations that we suffer all the time as a consequence of copying errors or things like free radicals in the environment. Interestingly, this is reflected in the Shannon Entropy of the genomes of different plants and animals. Indeed, each organism has a specific redundancy known as GC content, which is the number of times that two of the four nucleotides in the DNA are repeated in the genome. This repetition of GC content is closely related to Entropy because it represents redundancy. And it turns out that each species has a specific associated number of redundancy or GC content so if two species have about the same rate of redundancy in GC content because they are evolutionary related then they will also have about the same genomic Entropy. With this computer program we can simulate that redundancy by creating artificial DNA sequences with exactly the same GC content and Entropy as in real genomes, and we can do it for a wide range of species and both for DNA and RNA. And we can play with this GC content directly and see what would it mean a DNA sequence with GC content at 100% and how that would hurt the code of life by making it significantly less expressible because it would be unable to encode the same number of proteins with only two letters instead of four. On the other hand, just as for languages, having no redundancy at all is dangerous because errors in the communication or storage may be fatal and DNA segments key to produce some proteins would be unrecoverable, so nature and organisms are always being optimised by evolution to find the sweet spot between redundancy and efficiency. In this piece of software that we have written and is available online, we can generate pseudorandom sequences of four letters, representing each nucleotide in the DNA by taking into consideration GC content. Known GC content can also be chosen for popular organisms, including many mammals, but also a few bacteria, ranging in GC content from 20% to almost 52%. Even though the human genome GC content can vary from 35% to 60% from chromosome to chromosome, the average human genome GC content is 46.1%.