Shannon' s original motivation was to come
up with a measure useful to quantify the channel

capacity needed to send a binary message through
telephone lines.

It is often said that Shannon's Entropy does
not quantify meaning, which is, to some degree,

correct in the context of communication theory
because Shannon Entropy seems not to care

about the meaning of a message or its actual
content.

In other words, Shannon Entropy does not care
whether you are trying to send a message such

as "I'll meet you for lunch on Tuesday 2pm",
or a possibly random-looking number such as

"84592646".

However, there is another angle of Shannon
Entropy that suggests exactly the contrary,

and it is both what makes this measure interesting
but also limited.

Meaning in Shannon Entropy is deeply encoded
in the form of the underlying working or assumed

distributions, in other words, the context
in which a question related to the Entropy

of some specific feature of an element in
a distribution is asked.

For example, if the number "84592646" in our
example was rather a telephone number, the

underlying distribution would not be the distribution
of all alphanumerical characters of any length

but, for that example, the set of valid telephone
numbers.

So, Shannon Entropy does distinguish between
a sentence such as "I'll meet you for lunch

on Tuesday 2pm" and a telephone number if
the former is known to be a sentence written

in English and the latter a valid telephone
number.

So if you have any knowledge of the ensemble
for your distribution, then Shannon Entropy

it is all about meaning.

For example, the sentence "I'll meet you for
lunch on Tuesday 2pm" only has meaning if

you know that it is a sentence written in
English and you know English, so by knowing

that this sentence belongs to a subset of
well-formed English sentences, then the Entropy

of the sentence becomes significantly lower
than assuming that the string is in the the

space of all possible letter and word arrangements
for which the Entropy would be larger.

The problem with Entropy is not that it is
unable to convey or capture meaning but that

is it ambiguous and fragile for exactly the
same reason related to probability distributions,

because Shannon Entropy by itself does not
provide any means to estimate the probability

distributions and so it relies, in practice,
on traditional statistics or the observer's

beliefs or lack of knowledge.

In general, one ends up using a general assumption
for the uniform distribution which makes Entropy

to become a trivial function becoming a symbol-counting
function.

Indeed, if the uniform distribution is assumed,
as it is in most cases, what Shannon Entropy

is measuring is the multiplicity of the different
symbols used in a sequence, just as it does

its counterpart measure of Entropy in physics
that counts the number of possible micro-states,

such as particles or molecules in a given
volume of space.

Leaving those arguments related to meaning
and the limitations of Entropy, there are

interesting properties of Entropy worth mention
and studying.

For example, one of the general properties
of Shannon Entropy is that redundancy does

not add new information as one would theoretically
expect.

Once fixed the number of symbols or letters,
the greater redundancy the lower the entropy.

For example, repeating the letter e at the
end of some words does not provide any new

information than the original sentence and
as a function of the sequence length, the

Entropy drops just as it can be seen.

In this plot we can see that a growing sequence
with an increasing number of trivial repetitions

decreases the information content per digit
or letter.

In practice, with no means to calculate or
make a well educated guess of the underlying

probability distributions, Entropy is indeed
blind to meaning and is a measure of combinatorial

diversity, so a sentence may have the same
Entropy as some other scrambled sentence as

long as it uses the same letters.

For example, these 2 arrangements of letters
have the same Entropy when considering the

ensemble of all possible sequences of the
same length using the letters of the Latin

alphabet:
Now, a generalization of the concept of a

letter in a message is the concept of a micro
state.

A micro state can be almost anything, in this
case is

a binary digit, which again can be seen that
as long as there are the same number of 1s

than 0s and assuming equal probability for
all sequences of the same length, we have

that a pseudo random sequence of two thousand
0s and 1s produced by the function RandomInteger[]

in the Wolfram Language has almost the same
Shannon Entropy than the highly structured

sequence of repeating 01 a thousand times:
Here is

a

hundred experiments producing pseudo-randomly
sequences each with 100 binary digits showing

how all are so close to 1.

This is because the intended behaviour of
a pseudo-random generating function like RandomInteger[]

is to produce about the same number of 1s
and 0s in a disarranged fashion if it is a

good pseudo-random number generator.

So the Shannon Entropy of these sequences
will be log 2:

with a very small standard deviation among
several trials, that is, many different pseudo-random

arrangements:
The Entropy of a sequence of results of tossing

a coin represented by the letters h or t for
heads or tails.

The Entropy of such sequence is maximized
when the coin is fair producing about the

same number of h's and t's, that is, both
have equal probability 1/2 or 50:50.

This is the situation of maximum uncertainty
as it is most difficult to predict the outcome

of the next toss in the sequence; the result
of each toss of the coin is said to deliver

one full bit of information when you have
no idea what will come next.

However, if we know the coin is not fair,
but comes up heads all the time, then there

is the point of less uncertainty because you
have some information about the outcome, you

know that it will come h or t with less or
more probability.

Every time it is tossed, one side is more
likely to come up than the other.

The reduced uncertainty is quantified by Entropy.

If the coin is completely biased and always
comes head, the entropy will be zero because

each toss of the coin delivers no new information
as the outcome of each coin toss is completely

determined.

This is how typical textbook example for Shannon
Entropy of a random variable looks like, it

is important that you remember this because
when we introduce new tools in this course

we will make comparisons to Entropy.

So maximum fairness is reached when the coin
produces the same number of heads and tails,

that is 50% chances of being h or t where
uncertainty is maximal.

Note that the maximum value of the graph in
the formula for Entropy depends on the logarithmic

base and the distribution, which are two parameters
for Shannon Entropy.

Here, the entropy is at most 1 as we are assuming
a uniform distribution and base two, for which

the result is said to be in bits.

From that plot, you can see how Entropy is
a function of symbol density or symbol count,

here are some examples of binary and trinary
sequences were both the number of symbols

and the number of repetitions determine their
Shannon Entropy, both for natural and binary

logarithms:
You can see that for this particular case,

when taking every bit as the micro-state for
Entropy, the arrangement of the 0s and 1s

is irrelevant as long as the number of 0s
and 1s remain the same.

Entropy values also vary significantly as
a function of the number of potential symbols

available.

One way to tell apart cases such as a repetition
of 0 and 1 in alternation or randomly arranged

having basically the same Entropy such as
in this example:

Is by taking different micro state lengths
or coarse-graining the sequence, for example,

for the ordered sequence, taking units of
2 bits as the micro state for the measurement

of Entropy, we get that the Entropy is 0.

But if we do the same with the random-looking
we now get a divergent value from the ordered

case:
We will later see how this can lead to what

is called Entropy rate and the best version
of Entropy.

But it is important to see how the assumptions
of the underlying probability distributions

drives Entropy.

For example, if Entropy values are the same
for a message and its scrambled version is

only because we may lack the information about
the source of the message in which it was

written and thus with some possibly meaning
to speakers able to speak the same language.

So Entropy is not only about the internal
syntactic structure of a message but about

how much we know about the underlying ensembles
and assumed distributions and hence in a way

highly epistemological although reduced to
syntactic in practice in the face of complete

lack of information.