Let's keep on looking at the relationship between information and probability. Probability is just flipping coins, heads heads, tails, thank goodness. Actually, by the way, I had a legal two- headed quarter issued by the US Mint. Can you guess what it is? It's actually the New Hampshire quarter that has George Washington on one side and it's got The Old Man of the Mountain on the other. The Old Man of the Mountain is a rock formation that looks like a head, unfortunately it fell down about ten years ago, so now they're putting a new New Hampshire quarter. I also have a legal US quarter that has five heads, and I urge you to consider what quarter that might be. So let's look at this relationship between information and probability. Probability, we can either think of it as somethng that's a prior probability, so the probability of heads is the probability of tails is one half, because there is no particular reason to choose between heads or tails when I flip this coin. But we can also think of it in terms of frequencies. If we have a long, long sequence of heads and tails, I'm trying to make this random but I have a funny feeling it's exactly the same sequence of heads and tails that I put there before. My students complain that whenever I write a random string of zeros and ones on the board, it's always the same sequence, which can't really be quite random, can it? Anyway, what we can do is we can say, let's take a very very long sequence. Let's count the number of heads and tails, frequency of heads is equal to the number of heads divided by the total length of the sequence. The frequency of tails is the same. And then let's ask, we have this intuition that if the coin is fair, that almost all such sequences that we would generate would have roughly fifty percent heads and tails. Let me now make that mathematically more precise. So let's look at the total number of sequences of coin tosses of length m, which have exactly m_h heads and m_t tails. The sum of these two things is equal to m, the total length of the sequence. Now the total number of sequence of this is a number that's called m choose m_h. It's the number of way, if I have a sequence that's of length m, and I call heads zero and tails one, it's the number of binary sequences of length m that have exactly m_h zeros and m_t ones. These are the same things because if I've chosen the number that have exactly m_h heads is exactly the same number of sequence that have m_t tails. So what is this number? This is equal to the following: number of ways of choosing, this is called m choose m_h, it's the mathematical formulation and it's pronounced as this funny thing with parentheses with m on top and m_h at the bottom, it's called m choose m_h because it's the number of ways of choosing m_h spots or locations out of m possible spots. So we have zero zero zero one one zero, I'm again falling into the same possibility. We have a whole bunch of different spots, and we want to count the number of places where we can put zeros here. So we have a certain number of zeros and we want to put them in a particular way. Now this total number is the total number of sequences have exactly this number of m_h heads in there. So this number, I'll just write down what it is, m choose m_h is equal to m! / m_h! * (m-m_h)! which is equal to m! over m_h! * m_t!. I use this funny notation with an exclamation mark. The exclamation mark says if I have a number, m!, that's equal to one times two times three times four times ... times m-1 times m. It's the product of all numbers up to m. So, for example, 2! is equal to one times two equals two. 3! is equal to one times two times three is equal to six, et cetera. They get big really really fast, these numbers. So this number m choose m_h, m! is the total number of ways of rearranging m objects. So if we have a certain number of zeros and ones, we can rearrange these m objects in a particular way, but then if we have a zero in the first place and a zero in the second place we don't care if this is this zero or if this is the other zero. They're both zero. So we have to divide by the number of ways of rearranging the numbers of zeroes between each other. That's this number m_h! - that's the number of ways of rearranging the zeroes, m_t! is the number of ways of rearranging the ones. Total number of sequences that have exactly m_h zeroes and m_t ones is the total number of ways of rearranging theses sequences divided by the number of ways of rearranging the zeroes divided by the number of ways of rearranging the ones. This is a fact, if it doesn't make sense, just suck it up. Now we have a nice mathematical formula for this number. Here is another fun fact. So I'm now going to define a probability. The probability of heads, I'll call it q. Because we agreed before that the probability of heads was one half, probability of tails was one half. So we'll define this to be the number of heads divided by m. This is the frequency of heads, so it's what we observe to be the number of heads. The observed probabilities, as it were, of heads. But it's just a frequency, we don't know if it's the probability. And 1-q(h) is equal to q(t) is equal to the frequency of tails. Here is the fundamental formula of information theory. I'll say it again, the fundamental formula of information theory. S for entropy, or I for information, it gets written both ways because remember, entropy, the quantity that people discovered in the 19th century is just information. Information required to label the different possible configurations of atoms and molecules. In this case we're trying to label the total number of possibilities of heads and tails, so this is equal to -q(h) log base 2 q(h) - q(t) log base 2 q(t). This is a funny formula. There's these minus signs, there's these logarithms, there's these probabilities, et cetera. So let me just actually write it out in a particular case. Suppose, just to give you a feeling for what it is, q(h) turns out to be exactly one half. So exactly half of the sequences are heads. And suppose q(t) is the same thing, because it's one minus this, this is q(t). Then this quantity S or I, information, entropy two sides of the same coin. So I is equal to -0.5 log base 2 of 0.5, because this is 0.5, this is 0.5. And then we have -0.5, this is q(t), log base 2 of 0.5. There are two of these, each was multiplied 0.5. This is equal to -log base 2 of 0.5, because I added them up. But log base 2 of 0.5, this is the power to which two must be raised to equal to 0.5, and that's -1. This is equal to -(-1), because 2^(-1) is 0.5, and that's equal to one, and it actually has a unit, it's one bit. So this formula, I won't call it magical, it's just mathematical, but it's extremely useful. This formula says, if I have a coin, and I flip it, and it's probability of heads is 0.5, it's probability of tails is 0.5, then the amount of information that's encompassed in a single coin flip, if you'd like, the amount of information that is generated by the coin flip, is one bit. It happens to be tails this time. So, flipping a coin with the frequency of heads is the frequency of tails is 0.5, gives you one bit of information. Now, if we go back to our counting argument about probability, what we find is that the number of bits required to describe the ways of arranging exactly a certain number m_h of heads, remember this is a number, we're just defining it to be m, the number of bits required to describe this is just equal to m times (-q(h) log base 2 of q(h) -q(t) log base 2 of q(t)), and so this is equal to m times the amount of information. I keep getting tempted to call it magical, but I don't believe in magic. So this mathematical formula for the information is just really a way of counting number of possibilities. We look at the number of possible ways of getting exactly m_h heads. We take this logarithm, that's the number of bits of information that are in this particular sequence, and we find that it's equal to m, the total number of flips, times this quantity, the information. So just to summarize, fundamental formula of information theory is that the amount of information, also the amount of entropy, is equal to, if I have two possibilities, is just -q(h) log of q(h) - q(t) log q(t). If I have more possibilities, let's call them k possible outcomes, then the amount of information is equal to - sum of possible outcomes from i = 1 to k q(i) log base 2 of q(i). And we just get this from counting the number of possibilities that go into constructing these frequencies. And this extremely useful formula is the basis for all information theory, including the mathematical theory of communication, including the theory of computation.