We've boiled down our max ent. prescription into 2 steps. The first thing is you want the probability distribution to satisfy this constraint on the average value of the waiting time. And the second thing is we want that particular probability distribution to have the maximum entropy - to maximize the function p log p, and I always forget the negative sign. So in fact, we're maximizing the function negative sum over all states of the system p log p. And remember, states of the system here are how long you've waited for a particular cab. Or rather, how long you've waited from a particular time you decided to start waiting. So, this turns out to be a hard problem, or at least a non-trivial problem. If you have a little calculus, you're really good at maximizing functions, so let's start there. Let's imagine in particular...and I'll use this very simple example of maximizing a function over a 2-dimensional space. Let's call these the x1 and x2 axes. So I have some function. I'm just going to draw the contours for you. So here we go. And what I'm going to do is make it such that there's a single maximum over this space. And what we'll talk about in one of the appendices, if we have time, is why we can prove that the entropy function has a unique maximum even what you subject it to these constraints. For now, you can take it on faith that, in fact, this problem has a unique solution. So, here's a function. I've given it a unique maximum, and so we'll call this function f. So, using your amazing calculus skills, you know that the maximum of this function is defined as the point where the derivative of f with respect to x is 0. And remember this is a vector here so what this means is df/dx1 is 0 and df/dx2 is zero. Now, yes, you might have accidentally hit a minimum, so just check to make sure it's actually a minimum. That's what you would do. So now the problem is that we're not allowed to range over this entire space. We're restricted to some subspace. We're restricted in particular to some constraint here. So how can we find the maximum of the function, not the global maximum, but the maximum that is also satisfying a set of constraints and what I'll do is I'll draw those constraints as a line in this space. A point here is a valid argument for the function f, but it doesn't satisfy this constraint here and what I'll do is define this constraint in the following way. I'll say the constraint is that g(x) = c, where c is some number. And just to be clear, for us, it's better to write g(p). We set that equal to 4 minutes. That's just to remind you our particular constraint is that the function g is equal to 4. This is the general case here. So what we want to do now is not find the maximum point, the top of the mountain. We want to find the point that is the maximum along this line - this line defined by g(x) = c. So let me give you a little intuition about how you might do this. Imagine you're on a train going through this mountainous landscape, and as you're going through, down here, you'll see you're crossing contours of the function f. In this case, you're going uphill - the function is increasing - so you know a point there is not a maximum of the function along this line because if you wait a little bit longer, you'll get to this point here and you've already crossed the contour line. So here, you're going up. Notice here, you're going down the other side of the mountain - you're crossing contour lines in the other way, so you know, in fact, that the maximum can't be over here because you were already higher over here. So, somewhere between here and here is the maximum - somewhere in the middle you reach the peak. You reach the peak when the contours of the function f are parallel to the track that you are taking - where there is a tangent point between the contour and the direction of your motion on this fictitious train. So, we know how to get the directions of the contours of the function f - those are, indeed, just the gradient of the function...this is a vector, remember. And we are going to say these are equal to the perpendiculars to the train tracks. So if the perpendicular to the contour is parallel to the perpendicular to the train track, that means that the direction of the contour is parallel to the direction of the train track. If the two perpendiculars are parallel, so are the two original vectors. So then the next question is how can I get the perpendicular to the train track. What I want you do is imagine this is the train track for g(x) = c, and here is the train track for g(x) = c' and so forth. So here is another set of contours defined by the function g and we want the perpendiculars to these contours to be parallel to the contours for f - the perpendiculars for the contours of f. So what that means is that this gradient here - these arrows here, and in particular, these arrows right here - are equal to some real number lambda times the gradient of the constraint. When this equation is satisfied, that means that these contours here are precisely parallel to these contours here. So in order to maximize a function f subject to a set of constraints, don't solve this here. Don't solve this problem, solve this problem. And now you notice you have this mysterious value lambda. This is called the Lagrange multiplier. So what we're going to do is try to find a solution where the gradients are parallel to each other. In other words, that one can be transformed into another by scaling by some constant factor along all axes. So this is the intuitive motivation for how to solve the problem of maximization subject to constraints when you have only a single constraint. What we do is find the point at which these two gradients align. There's a wrinkle. The wrinkle is the following. It looks like we have only one constraint here and our contraint is just that this function is equal to 4. But we actually have 2 constraints. Our second constraint is overall normalization and it says that we want this function p here to normalize to 1. If you sum the probabilities of all the arrival times, they have to equal 1. Now, of course, p is a probability so we know that has to be true. We didn't talk about this explicitly, but when we start wandering over function space - when these xs here become ps, we start manipulating the probabilities - what we want to do is be able to relax the constraint that they sum to 1 when we consider maximizing this function f. We want to be able to range over the entire space including, for example, points where all the probabilities - all the ps - are 0. And then later what we'll do is we'll impose the normalization constraint. So we actually have 2 constraints, and not just 1.