So we saw in the last lesson how Entropy is
defined to quantify the apparent disorder

of a room with particles and how it can provide
statistical information on the average behaviour

of those particles in the form of a description
of their statistical properties.

In information theory a way to study particles
is by using symbols forming objects such as

sequences or messages and the question is
transformed into how ordered or disordered

a sequence of symbols looks like instead of
a room with particles.

In the context of messages written in some
human language, the equivalent of a room would

be the space of all possible well-formed sentences
in a language.

For example, the following sequence of bits
looks highly ordered:

because it is just a repetition of the same
symbol and there is very little surprise when

looking at this sequence and finding out that
more and more 1s are coming out.

However, this other sequence looks less ordered.

It is also said that it looks more typical
because it is the kind of sequence one would

expect by chance, yet every bit comes as a
surprise because no information from the first

segment of the sequence gives any information
about what is to come next . This is what

one would expect in, for example, the result
of tossing a coin with 0 corresponding to

heads and 1 to tails, with each toss independent
of each other and not telling anything about

the next result.

Notice that just like in the case of the room
in the previous lesson, this kind of order

and disorder is of statistical nature, that
is, how symbols seem to be distributed along

the sequence of bits.

Two sequences will have the same Entropy if
they have similar statistical properties and

they will have similar statistical properties
if they have the same proportion of symbols

relative to the occurrence of appearance of
all other possible symbols.

In the case of a binary sequence, for example,
if a binary sequence has about the same number

of 1s than 0s then the sequence will be assigned
high Entropy.

Here is an example of a binary string with
four 0s and six 1s.

And here is a sample of 10 strings from permutations
of the bits of the same string.

All these strings will have the same Shannon
Entropy in the set of all possible binary

strings of the same length because all these
strings have exactly the same number of 0s

and 1s than the original:
The formula to calculate Shannon Entropy denoted

by the letter H of a string or sequence s,
is traditionally written as follows:

H(s) is equal to minus the sum of the probabilities
of the occurrence of a symbol i in s, denoted

by Subscript[s, i], multiplied by the logarithm
of the same probability over all i symbols

among n possible.

We can illustrate its application with the
following example:

Let's take as a sequence something written
in English, such as 'this is not a sentence'

and look at its distribution of letters.

We can then count how many times each symbol,
in this case each letter, occurs in the sentence

hence effectively producing a discrete distribution
of letters from that message.

We have that the letter 'a' occurs once, just
as 'c', 'h' and 'o', followed by the letter

'i' appearing twice and then 'e', 'n', 's'
and 't' appearing thrice in the sentence,

and the space 4 times in the sequence:
And the distribution looks like this with

no particular order on the X axis.

It does not look very special but it also
does not look completely uniform and thus

it means that it may be conveying some information:
Then for every symbol Subscript[s, i], in

this case a letter, we can calculate each
probability of occurrence in the sentence:

To finally multiply by the logarithm of the
probabilities and take the sum over all the

elements.

We have seen thus how to arrive at the Entropy
function that we can write in the Wolfram

Language as follows:
And we can directly get a value for the entropy

of our example sequence.

In this case, our sequence of letters has
Shannon Entropy about 3.14

So, if the distribution is not typical in
the sense of looking random which would mean

to see every letter with the same frequency
what does the distribution and this Entropy

value tell us?

Actually the sentence in English is telling
us a few things, despite being so short.

It is already suggesting that among the most
frequent letters in English there is the letter

's' and 'e', for example.

So you can see how a non-uniform distribution
conveys some information about the underlying

rules that made that sequence to be a sentence
written in English.

We will later see more examples of this.

The reason that logarithm is used in the formula
for Entropy is because, besides having some

useful mathematical properties such as being
the inverse of exponentiation, the result

of integrating 1/x, and transforming multiplication
into addition because the logarithm of a product

of logarithms is the sum of the logarithms:
the logarithm also helps dealing with large

numbers, as it is some sort of natural mathematical
compression method that works by changing

the alphabet base helping encode information.

Let me explain with the following example:
Imagine that you were asked to guess a number

x and that the only information you had is
that such a number is in some interval 1 <= x

<= N. In this situation, you could ask one
by one whether it is the number 1, or the

number 2, the number 3 and so on up to 100
and in average if the number was picked at

random you would need about 50 questions in
average to guess the correct number.

However, the optimal algorithm for guessing
a number is something called a binary search

algorithm, which finds the number x in time
Subscript[log, 2]N in average, also denoted

by the O notation, O(Subscript[log, 2]N),
where O stands for 'order' like in the order

of some quantity.

So let's say I tell you the number is between
1 and 100, the Subscript[log, 2] of 100 is

6.64, so rounding is 7.

It means that in about 7 yes and no answers
(hence binary) you may be able to guess such

a number.

So, let' s say the number is 17, then you
could ask the following questions: is the

number...

(1) between 1 and 50?

In this case, yes
(2) between 1 and 25?

Yes
(3) between 1 and 13?

No
(4) between 14 and 20?

Yes
(5) between 14 and 17?

Yes
(6) between 16 and 17?

Yes
(7) is it 16?

No
It is therefore 17 and it could have taken

only 6.

This strategy always work, try it for yourself
picking some other number.

So the Subscript[log, 2] of an integer gives
the number of binary questions that can encode

any number in a decimal representation.

Can you think of other binary questions that
lead to the same number of answers to guess

correctly?

Do all binary questions lead to the same number
of guesses?

The Wolfram Language has its own Entropy function
that you can use by typing just:

The first input parameter in the Entropy function
is the base, which will change the base of

the logarithm in the formula.

We will usually take base 2 as we are interested
in bits in this course unless otherwise established,

for sentences in human languages we can take
the natural logarithm.

You can verify that my function is the same
as the function implemented in the Wolfram

Language: