1
00:00:00,572 --> 00:00:06,914
In this video I'm going to show you how to compute Shannon information content.

2
00:00:06,914 --> 00:00:13,242
First let's look at the analogy betweeen Boltzmann entropy and Shannon information.

3
00:00:13,242 --> 00:00:21,472
Shannon got his idea for characterizing information from the statistical mechanics of Boltzmann.

4
00:00:21,472 --> 00:00:29,862
So, recall that we defined the notion of a "microstate" as some detailed configuration of the system components---

5
00:00:29,862 --> 00:00:36,123
so, in the example of the slot machine, that would be one configuration of the slot machine windows, such as

6
00:00:36,139 --> 00:00:42,572
"apple, pear, cherry"---and a "macrostate" was some collection or set of microstates, such as

7
00:00:42,572 --> 00:00:52,432
"all three the same" or, "exactly one apple", and the entropy, S, assumes that all microstates are equally probable

8
00:00:52,432 --> 00:01:00,156
and here was the equation, engraved on Boltzmann's tomb, that says that the entropy of a particular macrostate

9
00:01:00,156 --> 00:01:12,612
was equal to k, Boltzmann's constant, times the log of W. Here this log---"l-o-g"---he used this to mean

10
00:01:12,612 --> 00:01:19,891
the natural log, and W was the number of microstates corresponding to this macrostate.

11
00:01:19,891 --> 00:01:29,789
K just gave us a way of assigning units, measured often in Joules per Kelvin, but for our purposes that

12
00:01:29,789 --> 00:01:37,927
k is equal to 1---let's just assume that---people actually do that to compute entropy sometimes,

13
00:01:37,927 --> 00:01:42,655
and it just gives us the entropy in different units, but we can use it to compare entropies.

14
00:01:42,655 --> 00:01:45,951
So, for example, let's look back at our slot machine.

15
00:01:45,951 --> 00:01:51,123
Remember our quiz, where we asked you how many microstates give rise to the "win" macrostate,

16
00:01:51,123 --> 00:02:00,301
"all three the same", and that was 5, and how many microstates give rise to the "lose" macrostate,

17
00:02:00,301 --> 00:02:09,310
and that was 120, so according to Boltzmann, if we assume that k, the Boltzmann constant, is 1, then

18
00:02:09,310 --> 00:02:18,878
we get S of this macrostate, let's call this S of the win macrostate, is equal to the natural

19
00:02:18,878 --> 00:02:43,571
log of 5, which is about 1.61, and the S of lose is equal to the natural log of 120, which is about 4.79---

20
00:02:43,571 --> 00:02:51,722
and the reason Boltzmann used the natural log here was to get this in a certain range of numbers,

21
00:02:51,722 --> 00:02:59,865
so usually we're talking about systems with a huge number of microstates that give rise to particular macrostates,

22
00:02:59,865 --> 00:03:05,462
and the natural log was the way of scaling those very large numbers.

23
00:03:05,462 --> 00:03:10,421
But you don't have to worry about the details of that, but you can see that the information---I'm sorry,

24
00:03:10,421 --> 00:03:17,787
the Boltzmann entropy of this macrostate is much smaller than the Boltzmann entropy

25
00:03:17,787 --> 00:03:21,494
of this macrostate, which was our intuition.

26
00:03:21,494 --> 00:03:29,217
Now, returning to our analogy, the Shannon information version of a microstate is a message---

27
00:03:29,217 --> 00:03:37,299
a symbol, number, or word---and the Shannon information version of a macrostate was a message source,

28
00:03:37,299 --> 00:03:43,424
which is a collection or set of possible messages, with some probability for sending each possible message.

29
00:03:43,424 --> 00:03:51,254
Now, just as we did for Boltzmann entropy, we're going to assume here that all the messages are equally probable,

30
00:03:51,254 --> 00:03:55,897
with M being the number of messages.

31
00:03:55,897 --> 00:04:00,764
Now, we can define H, the Shannon information content of a message source,

32
00:04:00,764 --> 00:04:07,026
as being equal to the log base 2 M, that is, the log base 2 of the number of messages.

33
00:04:07,026 --> 00:04:14,028
The log base 2 allows us to measure the information content in bits-per-message.

34
00:04:14,028 --> 00:04:23,538
Here's our example of our 1-year-old, who only says "da da da da", so there's only one message here,

35
00:04:23,538 --> 00:04:36,260
and our Shannon information content is equal to the log, base 2, of 1.

36
00:04:36,260 --> 00:04:45,966
Well, 2 raised to the 0th power is 1, so the information content is 0, which went with our intuition---

37
00:04:45,966 --> 00:04:50,579
there's no unpredictability, there's no surprise. Now,

38
00:04:50,579 --> 00:04:59,905
suppose now that instead of "da da da", our baby said "da ba ma", that is three messages,

39
00:04:59,905 --> 00:05:11,551
so if that were the case, then M would be equal to 3, and H would be equal to log base 2 of 3, which,

40
00:05:11,551 --> 00:05:24,369
according to my calculator, is 1.58. So, that gives us quite a bit more information content than

41
00:05:24,369 --> 00:05:29,012
when there's only one message.

42
00:05:29,012 --> 00:05:34,970
Our example with a fair coin, heads or tails, there's two messages,

43
00:05:34,970 --> 00:05:44,770
so H of a fair coin is equal to the log base 2 of 2, which equals 1,

44
00:05:44,770 --> 00:05:51,561
so the information content here is 1, which is always the information content

45
00:05:51,561 --> 00:05:56,462
if we have two choices, heads or tails, 0 or 1, with equal probability.

46
00:05:56,462 --> 00:06:07,183
One more example of this kind, the information content of a fair die is equal to, well here

47
00:06:07,183 --> 00:06:12,892
M equals 6---there's 6 possible messages, one for each side of the die,

48
00:06:12,892 --> 00:06:24,384
so it's equal to the log, base 2, of 6, which is approximately equal to 2.58, and that's in bits.

49
00:06:24,432 --> 00:06:30,261
So I'll tell you a little bit more, later on, what exactly this represents, in terms of

50
00:06:30,261 --> 00:06:37,586
coding or computer memory, but for now it goes along with our intuition that this has the highest information content

51
00:06:37,586 --> 00:06:41,456
so far that we've seen because there's 6 different messages.

52
00:06:41,456 --> 00:06:46,414
Well now I'm gonna write down a more general formula.

53
00:06:46,414 --> 00:06:51,972
For the previous formula, we were assuming that all messages have equal probability,

54
00:06:51,972 --> 00:06:57,907
but most often that's not the case, there's different messages with different probabilities---

55
00:06:57,907 --> 00:07:05,750
we have a biased coin, or, more realistically, we have a person who's talking and the words that

56
00:07:05,750 --> 00:07:09,005
come out of their mouth are not going to be equally probable.

57
00:07:09,005 --> 00:07:14,141
A more general formula---and this is the formula that Shannon actually wrote down---says,

58
00:07:14,141 --> 00:07:19,039
let M be the number of possible messages, and we're going to assign a probability

59
00:07:19,039 --> 00:07:25,464
to each message. So we'll call the probability of message i, that's one of the M messages,

60
00:07:25,464 --> 00:07:34,499
p, sub i---this is just a name for a probability given to message i---and Shannon's

61
00:07:34,499 --> 00:07:42,386
formula said the probability of this message source---well, this is a summation symbol,

62
00:07:42,386 --> 00:07:48,853
this says we're going to sum up all of the different---for each message i, all the way up to the total

63
00:07:48,853 --> 00:08:00,087
number of messages, this log-2 of the probability, times the probability. So this is like taking a weighted average,

64
00:08:00,087 --> 00:08:07,257
weighted by probability, and we put a minus sign out here because these probabilities are all fractions,

65
00:08:07,257 --> 00:08:12,107
and the log of a fraction is going to be negative, so we put the minus sign out here

66
00:08:12,107 --> 00:08:17,191
to counteract that negative, so you'll see that in a couple of examples.

67
00:08:17,191 --> 00:08:24,124
Now, if you don't understand this formula, just be patient, because I'll show you how it works in practice.

68
00:08:24,124 --> 00:08:29,777
Let's assume now that we have a biased coin,

69
00:08:29,777 --> 00:08:42,654
and that the probability of heads is no longer 1/2, but let's say that it's .6, and the probability of tails is .4.

70
00:08:42,654 --> 00:08:48,717
Ok, what is the information content of this?

71
00:08:48,717 --> 00:08:56,185
So, let's wright down our formula, H, the information content of the biased coin,

72
00:08:56,185 --> 00:09:04,583
is equal to minus the sum of the two components here, which is .6

73
00:09:04,583 --> 00:09:13,875
times log base 2 of .6, plus .4 times log base 2 of .4, which, on my calculator,

74
00:09:13,875 --> 00:09:21,781
approximately .971 bits. So this is lower information content than the fair coin,

75
00:09:21,781 --> 00:09:27,617
which had information content of 1 bit, and that's because, of course, this is more predictable---

76
00:09:27,617 --> 00:09:30,907
heads is more likely to come up than tails.

77
00:09:30,907 --> 00:09:36,910
Well now I can do something a little more general and look at an example of text information content.

78
00:09:36,910 --> 00:09:46,715
So suppose that I have a text, my question is what is the information content of this text?

79
00:09:46,715 --> 00:09:50,846
There's actually a number of ways to calculate that, and people do

80
00:09:50,846 --> 00:09:56,734
calculate the information content of text, as a measure of its complexity, for exampe, but

81
00:09:56,734 --> 00:10:01,248
the way that I'm going to do it is to look at the frequency of the different words

82
00:10:01,248 --> 00:10:07,301
as a measure of their probability. For each word, I'm going to write down the word,

83
00:10:07,301 --> 00:10:14,291
its frequency, and then, what I'm going to call its relative frequency.

84
00:10:14,291 --> 00:10:27,385
So let's see... the word "to" appears twice, "<i>to</i> be or not <i>to</i> be", but there's 6 words altogether,

85
00:10:27,385 --> 00:10:34,294
so I'm going to call its relative frequency 2/6, so out of the 6 words

86
00:10:34,294 --> 00:10:52,522
it appears twice. "Be" appears twice also, "or" appears once, "not" appears once...

87
00:10:52,522 --> 00:11:01,941
we'll make these relative frequencies their probabilities so we can say that the information content

88
00:11:01,941 --> 00:11:07,517
of this text is equal to minus, and now we're going to sum up for each message,

89
00:11:07,517 --> 00:11:15,198
that is, each word, its probability times the log base 2 of its probability,

90
00:11:15,198 --> 00:11:19,271
and we're going to do that for each word...

91
00:11:19,271 --> 00:11:31,211
and that's approximately equal to 1.9.

92
00:11:31,211 --> 00:11:39,180
So this is a way of calculating the information content of a piece of text---we'll see a little bit more

93
00:11:39,180 --> 00:11:41,834
on that in the exercises.