We've introduced this problem of producing a parsimonious model of the data, meaning a description of the probabilities of each of the possible configurations. Now what I'm going to do is I'm going to show you the general method for enlarging a parsimonious model, or conversely a general method for producing a model that's more parsimonious than an exact reproduction of the data, and that method is called the method of maximum entropy or the MaxEnt principle. The example I am going to talk about is predicting when you are going to get a cab in New York City. So the joke about New York is that you can never get a cab, except when you don't need a cab, and there's just cabs everywhere. And, of course, there's some queer reasons why this might be the case, but if you've ever try to get a cab in the early morning going south on Park Avenue, forget it, you will never get a cab. These are some New York City cabs here. Let's say you are scientist about this, and so you decide to gather data, and you're gathering data, you say, I need a cab, I go out in the street, How long do I have to wait ... for me to finally get a cab that I can get in, a cab that's free and on duty? Let's say I kept records on that for a while, and so here are some data I've gathered, and this is the time it took me to get a cab in minutes. So, one time it took me 6 minutes to get a cab, then it took me 3 minutes, 4 minutes, another time it took me 6 minutes again, and so forth. So this here is a set of observations, about a really basic empirical question: How long does it take to get a cab? And then the question is: What should I believe about the waiting time for a New York City cab? So, you're actually pretty good at this already. You know, for example, that one way we can do it is to take this data here, I have 10 data points on how long it takes to get a cab, and so the the probability of me waiting 6 minutes to get a cab looks like it's about ... well, there's one, two, three times out of ten that I saw a cab after six minutes, so that means it's about a 30% chance that I'm going to have to wait 6 minutes. And, for example, the chance that I'll have to wait 2 minutes looks like it's 20%. So, you can see right away there's a huge problem here, because, for example, it turns out that if I follow this naive model directly, it says the chance of me getting a cab in 1 minute is zero. There's this zero chance of me getting a cab in a minute. And not only that, it says, for example, the chance of me waiting for 7 minutes to take a cab is also zero. This seems puzzling guys, it feels like what we're what we call overfitting the data. We're describing the data in such a way that we're putting too much structure in. The fact that I never waited more than 6 minutes to take a cab, but I actually waited three times, I waited 6 minutes three times, that seems to be an accident of the data. We don't want to put that into our model. So, instead of doing the naive method, what I'm going to do, and this is the core of the maximum entropy method, is produce a probability distribution that has two things. One my P_{MaxEnt} that I'm going to try to produce, First of all P_{MaxEnt}, we'll call it P_{ME}, satisfies a limited number of constraints, and I'll tell you what a constraint is explicitly in a moment. And number two, the distribution that satisfies those constraints has the maximum entropy of all the distributions that satisfy those constraints. So what we'll find is that there are potentially many probability distributions that satisfy the constraints, and we're going to pick the one, and it turns out it's the unique one, that has the maximum entropy out of all those distributions that satisfy those constraints. So the constraints will always be in the form of expectation values. There will always be constraints on the average of some quantity you measure on the data. So, for example, ... we could have a constraint on the expectation value of the average waiting time. So we write this like this. These angle brackets mean the expectation value of x, and so the way we do that is we integrate the probability of waiting x time times x, dx, and integrate from 0 to ∞ . And if we're willing to just discretize and talk about minutes, we round to the nearest minute, we can also write it like this. Where here instead of integrating over a continuum of times from 0 to 0.01 and so forth minutes, here we just some 0 minutes, 1 minute, 2 minutes, 3 minutes. So 0 minutes, the cab is right there, you open the door, it's a magical day. So this is an expectation value on the average waiting time. And just to give you an example, here's another expectation value you might you might measure. This is the average of the square of the waiting time, and, of course, the way you do that, is you integrate x² dx, weighted by the probability of that particular x, and in general the expectation value of a function f(x) is weighting f(x) by the probability of each x. So this notation here should be something that, if you're not familiar or not comfortable with it, you should take some time and just figure out why this is the correct way to talk about the average value of x. And if you like, this one here might be more familiar to you if integrals are still a little scary, which they shouldn't be. What we're going to do in this particular application, the maximum entropy principle, is one, P_{ME} (x) will be constrained so that the average value of x, the average waiting time, under the distribution P_{ME}, is equal to that in the data. And, in fact, if you count here and measure the average waiting time in the data, you discover, and I'm quite happy about this, the average waiting time. In this dataset it's 4 minutes, and so what we're going to say is give me probability distributions whose average waiting time is 4 minutes. So that's step one, this is the constraint step. And you can see right away that there are many distributions that have an average waiting time of 4 minutes. Here's one. The probability of waiting x minutes is 0, except when x = 4. Just to be technical, this is a definition that would only work in the discrete case. We'd have to use delta functions, I will spare you the delta functions. Here's another example. P(x) = 0.5 if x =3, 0.5 if x =5, and 0 otherwise. These are all potential models of catching a cab in New York City that satisfy the constraint that their average is four minutes. So, someone could say, "Hey, I know, here's a good model of your data. Providing data is like cabs either take 3 minute or 5 minutes and no other time whatsoever." And, of course, you can think about mixing these two together. So, for example, you might mix this and this so you'd have a distribution, and I'll just draw it graphically here. Where there's some spread over waiting times between 3, 4 and 5 minutes. And, of course, this original distribution here also satisfies it. By definition, if we have a distribution that's non-zero only at these points, and is weighted by the number of times we see it in the data, the expectation over that will also be 4 minutes, by definition. So we have a plethora of candidate models. We've a plethora of models that satisfy this one particular constraint. Pick the one ... that maximizes the entropy. So you should have remembered the definition of entropy, if not this the perfect time to pause the video, and review. But what we want is the distribution whose entropy is maximized. Another way to say it is we want the distribution that leaves us maximally uncertain about how long the cab will take to arrive, except for the fact, of course, that one thing is constrained. The one thing that we've constrained is that the cab takes four minutes on average. But otherwise I want to be maximally uncertain, I don't want to have, you know the way we would say this philosophically, I don't have any prejudices about what New York City cabs do, I want to be maximally uncertain about their behavior subject to this one constraint. And you can see, for example, here, intuitively, the idea that cabs always take 4 minutes is satisfying this average criteria but it's putting a huge amount of additional structure in. It's saying for some reason all waiting times except for 4 minutes are forbidden. And so intuitively somehow that seems like it would require an extra justification. But we're trying to be minimally biased, we're trying to have a maximally possible range, a maximal possible range over all configurations of the system, subject to a constraint on the average behavior that we observe. This one here is slightly better because it allows for a wider range, and in fact the mixture of these is even better still. And what we would like to do is produce a distribution where you have to ask, – and so this is one way to interpret entropy – you have to ask on average the maximal number of questions in order to decide how long the cab actually took. So, this step here allows you to select from all of these models, from the plethora of popular models that satisfy the constraints, one particular model.