The last thing I promised you was to tell you how coarse graining fit into the story of information theory. So, even if you have not discovered the information theory lectures, you will get a very brief introduction to it. Let's imagine that we have some process, let's say with 26 options: there's the probability that the process emits A, the probability that the process emits B, all the way down to the probability that the process emits Z. What information theory does, one of the canonical questions it asks, is: How much information is in that process? This term also comes up: How much entropy is in the process? Another term is: How much uncertainty ... is in the process? A process that has higher uncertainty – in this case here, a process when you stick your hand in the bag, you don't know if you're going to get an A or B or C all the way down to Z – a process that is more uncertain has higher entropy, and has higher information. Claude Shannon invented a way to measure, or to quantify, to turn this big list of numbers, this big list of probabilities, into a single number which he then called the uncertainty of the process itself. He called that function H, I think because he had spent some time in Germany, although I'm not sure. So, what H does, is it eats a list of probabilities, and spits out the other side a single number, which we will call, I don't know – in this case, here, that number might be 4, and the units turn out to be bits. So, how do we take a particular set of probabilities and map them to a number? Well, H of pᵢ is defined in general as the negative of the sum over all the options – in this case 26 options – of pᵢ times the log base 2 of pᵢ. This is the fundamental quantity of information theory. One of the questions you might have is: How did Shannon derive it? What he did was, he came up with a series of axioms that he wanted this function to satisfy. He said, "If you want to measure uncertainty, here's a couple things that it should do." First of all, the maximum uncertainty that this distribution can have is if each of these probabilities is equally likely. So if the probability of A is 1 in 26, the probability of B is one in 26, and all the way down to Z, that should mean, if that's the case, you should be maximally uncertain about what's going to happen. That's the first thing he said. The more uniform ... rather ... if it's perfectly uniform, the probabilities are perfectly uniform, that should be the condition of maximum uncertainty. He didn't need to say the following thing, but he could also have said it as well: if all of the probability is located in a single option, then the system has uncertainty 0. If you know that every time you stick your hand in the bag you're going to get an A, the system is fundamentally certain. There is no uncertainty in what's going to happen, when you know you stick your hand in, you always get an A. So that's the first thing that Shannon wanted this function to do. He also wanted it to be symmetric, he didn't want it to discriminate between each of these different variables. So, for example, a probability distribution where A and B both had probability 0.5, and everything else was 0, should have the same uncertainty as the distribution where the probability of Y and Z are both 0.5, and everything else was 0. He didn't want to discriminate between the different options, and, for this reason, information theory is sometimes called a syntactic theory. So, well, that makes sense, right? If you shuffle these around, that should be good, right? It doesn't matter what these probabilities attach to. And the second thing is conditioned of maximal uncertainty. It turns out that's not quite enough to uniquely specify a function until you introduce the coarse-graining axiom. We'll leave up Shannon's information theoretic account of how to measure uncertainty by summing up pᵢ log₂pᵢ for all the different options. The coarse-graining axiom ... says the following: let's say you have three options, and we'll write it like this, so now we'll have a distribution just over, let's say, 3 letters – A, B, and C. It could be a distribution over rock, paper and scissors. So, we have P_A, a probability of A happening, a probability of B happening, and a probability of C happening; a probability of rock, papers and scissors, if you like. And, in this case here, for simplicity, we'll assume that these are independent in the sense that if I get an A, it doesn't affect what I'm going to get next; if I play rock it doesn't affect what I'm going to get next. So, Shannon said, okay the entropy of that system should be equal to the following: it should be equal to the entropy of A versus B or C plus the probability of B or C, times the entropy of that choice, B and C. So, he imagined, in other words, that instead of finding out whether the system had chosen A, B and C all at once, first you found out whether the system had chosen A, or B or C. So, A, or B or C, and then sometimes the system had chosen B or C, and then you had to make a second inquiry to figure out whether it was B or C. So, the coarse-graining axiom says the uncertainty about what the fine-grain state of the system is, is equal to the uncertainty about a coarse-grained version of the system. In this case, the course graining, and this is just an example, in this case, the coarse graining says, "Ah, B or C are kind of the same. You know they're both water signs or something, right?" The entropy of the coarse-grain system, plus the weighted entropies of those fine-grain distinctions. And once you demand that this property here be obeyed, then you recover Shannon's original formulation of entropy. And it turns out that this formulation is now unique: there is no other functional form for how to convert a list of probabilities into a number describing uncertainty, that obeys the condition of a maximum uncertainty, that obeys the syntax property of being able to shuffle and get the same answer, and that, finally, obeys Shannon's coarse-graining axiom. If you enforce all of those, this is the final formula that you're left with, no other formula will do. And one of the very nice things about this is that the course-graining axiom shows you how, when you coarse-grain a system, the uncertainty, the information that you have, goes down. So let's say that instead of representing the system like this, I broke it up into these two pieces here, coarse-grained uncertainty, and then a fine-grained uncertainty, where I split the B or C case, and then do you see, do you know what, I don't care about that. In that case, this equality then becomes a greater-than-or-equal-to sign. In fact, it's strictly "greater than" as long as there's some possibility of B and some possibility of C, if there is a fine-grained distinction to be made. So, in this first module what have we done? I've given you a short introduction to the idea of renormalization through the example of the microeconomics story. You have a microeconomic account of the world, a very detailed description of what people are doing that you then coarse grain to a macroeconomic account. That coarse graining generally erases distinctions: instead of distinguishing case B from case C, you say, "Ah, you know what? They're pretty much the same." We had an example of that coarse-graining property in the case of coarse-graining the image of Alice in Wonderland. And there, I actually gave you three examples, majority rule coarse graining, decimation coarse graining, and a more complicated, sort of Fourier space kind of coarse graining, called the JPEG, which of course is now in so much use that your computer has a special chip that does JPEG decompression. So, I gave you an example of how coarse graining worked, and, in this final part of the lecture of this first part of the module, what I've done is tell you a little bit about how coarse graining plays a central role in information theory. But coarse graining is not enough. It's not enough just to simplify the world, because, in the end, what scientists do is something more than just create JPEGs. What a biologist does is something more than just clustering – or rather you hope a biologist is doing something more than just clustering, although you should check the literature because sometimes they don't. And, in fact, many scientists fall victim to this: they think that if you have a compressed, efficient description of the world then you're done. But you're not done, because what you want to do with that compressed description is make predictions, produce explanations, try to describe, but not just describe: try to explain what happened before and what's going to happen next. And in order to do that, you need a theory. And what renormalization does, is it says, "Great, congratulations, you've constructed a coarse graining, now, how is the theory connecting those different coarse-grained states, how is that theory related to the one at the fundamental level? How can you go, in other words, or rather what happens when you take a full economic and psychological theory of the world and course-grain it to macroeconomics? What happens when the macroeconomist then tries to build a theory that relate his coarse-grain quantities?" And so, in the next module, we will give you your first simple example of how theories about the world change when the objects that they describe simplify.