so we've been talking about information how you measure information and of course, you measure information in bits how you can use information to label things for instance, with bar codes so you can use information to label things and then we talked about probability and information and if I have probabilities for events P_i then it says that the amount of information that's associated with this event occurring is minus the sum over i p_i log to the base 2 of p_i this beautiful formula that was developed by Maxwell, Boltzmann, and Gibbs back in the middle of nineteenth century to talk about the amount of entropy and atoms or molecules this is often called S for entropy, as well and then was rediscovered by Claude Shannon in the 1940's to talk about information theory in the abstract and the mathematical theory of communication in fact, there is a funny story about this that Shannon, when he came up with this formula minus sum over i p_i log to the base 2 of p_i he went to John Von Neumann the famous mathematician and he said, "what sould I call this quantity?" and Von Neumann says "you should call it H, because that's what Boltzmann called it " but Von Neumann, who had a famous memory apparently forgot of Boltzmann's H of his famous "H theorem" was the same thing but without the minus sign so it's a negative quantity and it gets more and more negative as opposed to entropy which is a positive quantiy and gets more and more positive so, actually these fundamental formulas about information theory go back to the mid 19th century a hundred and fifty years so now we'd like to apply them to ideas about communication and to do that, I'd like to tell you a little bit more about probability so, we talked about probabilities for events probability x you know x equals "it's sunny" probability of y y is "it's raining" and we could look at the probability of x and I'm gonna use the notation I introduced for Boolean logic before is the probabiity of this thing right here means "AND" "probability of X AND Y" or we can also just call this the probability of X Y simultaneously keep on streamlining our notation this is the probability that it's raining... it's sunny and it's raining now, mostly in the world this is a pretty small probability but here in Santa Fe it happens all the time and as a result, you get rather beautiful rainbows...single, double, triple on a daily basis so, we have... this is what is called the joint probability the joint probability that it's sunny and its raining the joint probability of X and Y and what do we expect of this joint probability? so, we have the probability of X and Y and this tells you the probability that it's sunny and it's raining we can also look at the probability X AND NOT Y so X AND (NOT Y) again using our notation introduced to us by the famous husband of the daughter of the severe general of the British ___(?) George Boole, married to Mary Everest and we have a relationship which says that the probability of X on its own should be equal to the probability of X AND Y plus the probability of X AND (NOT Y) and the probability of X on its own is called the "marginal probability" so, it's just the probability that it's sunny on its own so the probability that it's sunny on its own is the probability that it's sunny and it's raning plus the probability that it's sunny and it's not raining I think this makes some kind of sense why is called the "marginal probability"? I have no idea so let's not even worry about it there's a very nice picture of probabilities in terms of set theory I don't know about you but I grew up in the age of "new math" where they tried to teach us about set theory and unions of sets and intersections of sets and things like that from starting at a very early age which means people of my generation are completely unable to do their tax returns but for me, dealing a lot with math it actually has been quite helpful for my career to learn about set theory at the age of 3 or 4 or whatever it was so, we have a picture like this this is the space or the set of all events here is the set X which is the set of events X, where it's sunny here is the set of events Y, where is the set of events where it's raining this thing right here is called "X intersection Y" which is the set of events where it's both sunny and it's raining but in contrast, if I look at this right here this is "X union Y" which is the set of events where it's either sunny or raining and now you can kind of see where George Boole got his funny "cap" and "cup" notation we can pair this with X AND Y X AND Y, from a logical standpoint is essentially the same as this union of these sets and similarly, X intersection Y is X OR Y --translator's note: professor Lloyd meant "union" when referring to OR and "intersection" when referring to AND http://www.onlinemathlearning.com/intersection-of-two-sets.html-- so when I take the logical statement corresponding to the set of events that I write it as X AND Y the set of events is the intersection of it's sunny and it's raining X OR Y is the intersection of events where it's sunny or it's raining --translator's note: professor Lloyd meant "union" when referring to OR, "intersection" refers to AND-- and you can have all kinds of you know nice pictures here's Z where let's say it's snowy at the same time it's sunny which is something that I've seen happen here in Santa Fe this is not so strange in here where we have X intersection Y intersection Z which is not the empty when in terms of Santa Fe ok, so now let's actually look at the kinds of information that are associated with this suppose that I have a set of possible events, I'll call one set labeled by i the other set, labeled by j and now I can look at p of i and j so this is a case where the first type of event is i and the second type of event is j and I can define you know, I'm gonna do this slightly different let's call this... we'll be slightly fancier we'll call these event x_i and event y_j so, i labels the different events of x and j labels the different events of y so, for instance x_i could be two events either it's sunny or it's not sunny so i could be zero, and it would be 'it's not sunny' and 1 could be it's sunny and j could be it's either raining or it's not raining so there are two possible value of y I'm just trying to make my life easier so we have a joint probability distribution x_i and y_j this is our joint probability, as before and now we have a joint information which we shall call I of X and Y this is the information that's inherent in the joint set of events X and Y in our case, it being sunny and not sunny, raining and not raining and this just takes the same form as before we sum over all different possibilities sunny-raining, not sunny-raining, sunny-not raining, not sunny-not raining this is why one shouldn't try to enumerate these things p of x_i y_j logarithm of p of x_i y_j so this is the amount of information that's inherent with these two sets of events together and of course, we still have this, if you like the marginal information, the information of X on its own which is now just the sum over events x on its own of the marginal distribution why it's called "marginal" I don't know it's just the probability for X on its own p of X_i log base two of X_i and similarly we can talk about I of Y is minus the sum over j p of Y_j log to the base 2 of p of Y_j this is the amount of information inherent whether it's sunny or not sunny it could be up to a bit of information if it's probability one half of being sunny or not sunny then there's a bit of information let me tell you in Santa Fe there's far less than a bit of information on whether it's sunny or not because it's sunny most of the time similarly, raining or not raining could be up to a bit of information if each of these probabilities is 1/2 again we're in the high desert here it's normally not raining so, you've far less than a bit of information on the question whether it's raining or not raining so, we have joint information constructed out of joint probabilities marginal information, or information on the original variables on their own, constructed out of marginal probabilities and let me end this little section by defining a very useful quantity which is called the mutual information the mutual information, which is defined to be I( X ...I normally define it with this little colon right in the middle, because it looks nice and symmetrical and we'll see that this isn't symmetrical it's the information in X plus the information in Y minus the information in X and Y taken together it's possible to show that this is always greater or equal to zero and this mutual information can be thought of as the amount of information the variable X has about Y if X and Y are completely uncorrelated, so it's completely uncorrelated whether it's sunny or not sunny or raining or not raining then this will be zero however, in the case of sunny and not sunny raning and not raining, they are very correlated in the sense that once you know that it's sunny it's probabiy not raining, even though sometimes that does happen here in Santa Fe and so in that case, you'd expect to find a large amount of mutual information in most places in fact, you'll find that knowing whether it's sunny or not sunny gives you a very good prediction about whether it's raining or it's not raining mutual information measures the amount of information that X can tell us about Y it's symmetric, so it tells us the amount of information that Y can tell us about X and another way of thinking about it is that it's the amount of information that X and Y hold in common which is why it's called "mutual information"