In this video I'm going to show you how to compute Shannon information content.
First let's look at the analogy betweeen Boltzmann entropy and Shannon information.
Shannon got his idea for characterizing information from the statistical mechanics of Boltzmann.
So, recall that we defined the notion of a "microstate" as some detailed configuration of the system components---
so, in the example of the slot machine, that would be one configuration of the slot machine windows, such as
"apple, pear, cherry"---and a "macrostate" was some collection or set of microstates, such as
"all three the same" or, "exactly one apple", and the entropy, S, assumes that all microstates are equally probable
and here was the equation, engraved on Boltzmann's tomb, that says that the entropy of a particular macrostate
was equal to k, Boltzmann's constant, times the log of W. Here this log---"l-o-g"---he used this to mean
the natural log, and W was the number of microstates corresponding to this macrostate.
K just gave us a way of assigning units, measured often in Joules per Kelvin, but for our purposes that
k is equal to 1---let's just assume that---people actually do that to compute entropy sometimes,
and it just gives us the entropy in different units, but we can use it to compare entropies.
So, for example, let's look back at our slot machine.
Remember our quiz, where we asked you how many microstates give rise to the "win" macrostate,
"all three the same", and that was 5, and how many microstates give rise to the "lose" macrostate,
and that was 120, so according to Boltzmann, if we assume that k, the Boltzmann constant, is 1, then
we get S of this macrostate, let's call this S of the win macrostate, is equal to the natural
log of 5, which is about 1.61, and the S of lose is equal to the natural log of 120, which is about 4.79---
and the reason Boltzmann used the natural log here was to get this in a certain range of numbers,
so usually we're talking about systems with a huge number of microstates that give rise to particular macrostates,
and the natural log was the way of scaling those very large numbers.
But you don't have to worry about the details of that, but you can see that the information---I'm sorry,
the Boltzmann entropy of this macrostate is much smaller than the Boltzmann entropy
of this macrostate, which was our intuition.
Now, returning to our analogy, the Shannon information version of a microstate is a message---
a symbol, number, or word---and the Shannon information version of a macrostate was a message source,
which is a collection or set of possible messages, with some probability for sending each possible message.
Now, just as we did for Boltzmann entropy, we're going to assume here that all the messages are equally probable,
with M being the number of messages.
Now, we can define H, the Shannon information content of a message source,
as being equal to the log base 2 M, that is, the log base 2 of the number of messages.
The log base 2 allows us to measure the information content in bits-per-message.
Here's our example of our 1-year-old, who only says "da da da da", so there's only one message here,
and our Shannon information content is equal to the log, base 2, of 1.
Well, 2 raised to the 0th power is 1, so the information content is 0, which went with our intuition---
there's no unpredictability, there's no surprise. Now,
suppose now that instead of "da da da", our baby said "da ba ma", that is three messages,
so if that were the case, then M would be equal to 3, and H would be equal to log base 2 of 3, which,
according to my calculator, is 1.58. So, that gives us quite a bit more information content than
when there's only one message.
Our example with a fair coin, heads or tails, there's two messages,
so H of a fair coin is equal to the log base 2 of 2, which equals 1,
so the information content here is 1, which is always the information content
if we have two choices, heads or tails, 0 or 1, with equal probability.
One more example of this kind, the information content of a fair die is equal to, well here
M equals 6---there's 6 possible messages, one for each side of the die,
so it's equal to the log, base 2, of 6, which is approximately equal to 2.58, and that's in bits.
So I'll tell you a little bit more, later on, what exactly this represents, in terms of
coding or computer memory, but for now it goes along with our intuition that this has the highest information content
so far that we've seen because there's 6 different messages.
Well now I'm gonna write down a more general formula.
For the previous formula, we were assuming that all messages have equal probability,
but most often that's not the case, there's different messages with different probabilities---
we have a biased coin, or, more realistically, we have a person who's talking and the words that
come out of their mouth are not going to be equally probable.
A more general formula---and this is the formula that Shannon actually wrote down---says,
let M be the number of possible messages, and we're going to assign a probability
to each message. So we'll call the probability of message i, that's one of the M messages,
p, sub i---this is just a name for a probability given to message i---and Shannon's
formula said the probability of this message source---well, this is a summation symbol,
this says we're going to sum up all of the different---for each message i, all the way up to the total
number of messages, this log-2 of the probability, times the probability. So this is like taking a weighted average,
weighted by probability, and we put a minus sign out here because these probabilities are all fractions,
and the log of a fraction is going to be negative, so we put the minus sign out here
to counteract that negative, so you'll see that in a couple of examples.
Now, if you don't understand this formula, just be patient, because I'll show you how it works in practice.
Let's assume now that we have a biased coin,
and that the probability of heads is no longer 1/2, but let's say that it's .6, and the probability of tails is .4.
Ok, what is the information content of this?
So, let's wright down our formula, H, the information content of the biased coin,
is equal to minus the sum of the two components here, which is .6
times log base 2 of .6, plus .4 times log base 2 of .4, which, on my calculator,
approximately .971 bits. So this is lower information content than the fair coin,
which had information content of 1 bit, and that's because, of course, this is more predictable---
heads is more likely to come up than tails.
Well now I can do something a little more general and look at an example of text information content.
So suppose that I have a text, my question is what is the information content of this text?
There's actually a number of ways to calculate that, and people do
calculate the information content of text, as a measure of its complexity, for exampe, but
the way that I'm going to do it is to look at the frequency of the different words
as a measure of their probability. For each word, I'm going to write down the word,
its frequency, and then, what I'm going to call its relative frequency.
So let's see... the word "to" appears twice, "*to* be or not *to* be", but there's 6 words altogether,
so I'm going to call its relative frequency 2/6, so out of the 6 words
it appears twice. "Be" appears twice also, "or" appears once, "not" appears once...
we'll make these relative frequencies their probabilities so we can say that the information content
of this text is equal to minus, and now we're going to sum up for each message,
that is, each word, its probability times the log base 2 of its probability,
and we're going to do that for each word...
and that's approximately equal to 1.9.
So this is a way of calculating the information content of a piece of text---we'll see a little bit more
on that in the exercises.