We saw in the last lectures how to measure
Entropy.

One other contribution of the theory of information
is to the understanding of the concepts of

noise and redundancy and how noise and redundancy
counteract and complement each other.

Imagine that you were given a message in a
language in which every bit is fundamental

for the message to be understood.

Such a language would be very fragile because
the loss of a single bit would mean that you

lose the entire message.

Fortunately, most human languages are not
of this type and are rather highly redundant.

That's the reason that you can still understand
a written or spoken message even when perhaps

some letters or even full words are lost or
deleted.

This may explain why some languages use grammatical
gender, articles, and other linguistic features

as a way to add redundancy to the language
even when they may not appear fundamental

in any way.

Let's take a piece of text from some a popular
source such as, for example, an excerpt of

100 words from chapter one of Jean Austen's
Pride and Prejudice and let's replace every

letter "u" with an underscore:

I am certain that you are still able to recover
all the text with absolutely no loss of information

or meaning because you are able to mentally
fill the gaps.

In this case we removed only one of the lowest
frequency vowels, but what if in addition

to the letter "u" we removed the letter "e"
which is the most frequent letter in English:

Then things would start getting more difficult
but still with a little extra effort one can

fully recover the text.

However, there is always a point of inflection
when recovering information or meaning becomes

impossible.

The following text, for example, can still
be recovered by hand and even faster with

the help of a computer looking for matches
with real words in English to complete the

missing letters, but the point of no recovery
is no much farther away than this:

It turns out that the ultimate lower limit
to this process of deletion and point of no

recovery is precisely given by Shannon's Entropy.

How removed a language is from that point
of no recovery is called Redundancy.

Redundancy in a message is related to the
extent to which it is possible to compress

the message.

Our process of deleting letters is a form
of compression and as long as we are able

to recover the original message then that
compression is said to be lossless.

What lossless traditional data compression
does is to reduce the number of bits used

to encode a message by identifying and eliminating
statistical redundancy.

When we compress data without loosing any
information, we remove redundancy.

When we compress a message, what we do is
to encode the same amount of information using

less bits.

So we end up having more Shannon's information
per symbol in the compressed message.

A compressed message looks therefore less
predictable because we have deleted all redundancies

such as repetitions and other statistical
regularities.

On the one hand, the so-called Shannon Source
Coding Theorem states that a lossless data

compression scheme cannot compress messages,
on average, beyond the limit of ending with

more than one bit of Shannon's information
per bit, that is when the message looks random

and all its elements are equally distributed.

On the other hand, redundancy may often be
a desirable feature in data, because data

is always subject to errors in its transmission.

These errors in transmission or additions
to a transmission are usually called "noise",

and the Shannon's Noisy-Channel Coding Theorem
establishes a fundamental tradeoff between

redundancy and noise, it tells how much redundancy
one needs in a message to deal with some noise.

Here is some code illustrating the theorem.

The code introduces artificial noise destroying
the message, but one can increase the redundancy

of the message in order to counteract the
effect of noise to be able to reconstruct

the message in full with no error on the receiving
side.

Therefore, increasing the noise in message
will inevitably make it lose information,

but increasing the redundancy will increase
its resiliency and robustness of the message

even in the face of additive noise.

So we have here two messages, one highly structured
and one random-looking.

We can add noise and see how the structured
message will start looking random, and to

counteract that loss we can add redundancy
to be able to recover the original message.

Let's come back to human languages.

One consequence of redundancy is that letters
and sequences of letters will have different

probabilities of appearance.

If we assume we are dealing with 27 characters
(that is, 26 letters plus the empty space),

and that all of these characters are equally
probable, then we will have an information

Entropy of about 4.8 bits per character.

But we know that the characters are not equally
probable.

For instance, the letter "e" has the highest
frequency of occurrence as we had said before,

while the letter "z", coincidently, has the
lowest.

This is related to the concept of redundancy,
which is nothing more than the number of constraints

imposed on the text of the English language:
for example, the letter "q" is most of the

time, if not always, followed by "u" in English,
and we also have rules such as "I before e

except after c", and so on.

This plot shows the frequency of sequences
of n letters calculated from the text of the

Universal Declaration of Human Rights in 20
languages and illustrates the Entropy rate

calculated from taking different word block
sizes.

Entropy rate shows the Entropy change as a
function of number of letters.

The Entropy of a language can be seen as an
estimation of the probabilistic information

content of each letter in that language.

By playing with this computer program to estimate
the Entropy of n-grams for 20 languages you

can see how different languages have different
Entropy rates indicating different rates of

redundancy.

So let's turn our attention to biological
information.

How does biology store information?

One main repository is of course the genome
and the DNA.

In some fundamental way, DNA can be considered
the source code of a living organism.

An organism comes out of DNA and its interaction
with the environment.

So what about the redundancy in DNA?

Would you expect it to be redundant?

How much can one tamper with DNA before producing
important changes in the phenotype?

Well, that is one of the main questions in
molecular biology and genetics, and one of

the answers, is that different regions of
the genome have different degrees of redundancy

according to whether they are under evolutionary
selective pressure, that is whether a DNA

region plays an important role in the unfolding
of the organism biological development.

It turns out that, in general, the genome
and almost every repository of biological

information is highly redundant, not only
the genome has many copies of the same genes

but also many proteins are encoded by many
genes and so on.

Among the purposes of redundancy is that many
genes can produce a stable number of proteins

so small variations to the source code has
no significant effect, and this is a highly

desirable property in biology for biological
robustness because mistakes always happen

in the real world, yet biological organisms
are very resilient, so information theory

may explain this resiliency.

The production of essential proteins in plants
and animals have different resilience rates

to different kinds of mutations that we suffer
all the time as a consequence of copying errors

or things like free radicals in the environment.

Interestingly, this is reflected in the Shannon
Entropy of the genomes of different plants

and animals.

Indeed, each organism has a specific redundancy
known as GC content, which is the number of

times that two of the four nucleotides in
the DNA are repeated in the genome.

This repetition of GC content is closely related
to Entropy because it represents redundancy.

And it turns out that each species has a specific
associated number of redundancy or GC content

so if two species have about the same rate
of redundancy in GC content because they are

evolutionary related then they will also have
about the same genomic Entropy.

With this computer program we can simulate
that redundancy by creating artificial DNA

sequences with exactly the same GC content
and Entropy as in real genomes, and we can

do it for a wide range of species and both
for DNA and RNA.

And we can play with this GC content directly
and see what would it mean a DNA sequence

with GC content at 100% and how that would
hurt the code of life by making it significantly

less expressible because it would be unable
to encode the same number of proteins with

only two letters instead of four.

On the other hand, just as for languages,
having no redundancy at all is dangerous because

errors in the communication or storage may
be fatal and DNA segments key to produce some

proteins would be unrecoverable, so nature
and organisms are always being optimised by

evolution to find the sweet spot between redundancy
and efficiency.

In this piece of software that we have written
and is available online, we can generate pseudorandom

sequences of four letters, representing each
nucleotide in the DNA by taking into consideration

GC content.

Known GC content can also be chosen for popular
organisms, including many mammals, but also

a few bacteria, ranging in GC content from
20% to almost 52%.

Even though the human genome GC content can
vary from 35% to 60% from chromosome to chromosome,

the average human genome GC content is 46.1%.