1
00:00:01,111 --> 00:00:04,561
We've boiled down our max ent. prescription
2
00:00:04,561 --> 00:00:07,911
into 2 steps. The first thing is you want the
3
00:00:07,911 --> 00:00:17,711
probability distribution to satisfy this constraint
4
00:00:17,711 --> 00:00:21,031
on the average value of the waiting time.
5
00:00:24,738 --> 00:00:26,598
And the second thing is we want
6
00:00:26,598 --> 00:00:28,698
that particular probability distribution to
7
00:00:28,698 --> 00:00:36,128
have the maximum entropy - to maximize the
8
00:00:36,128 --> 00:00:44,318
function p log p, and I always forget the negative
9
00:00:44,318 --> 00:00:46,868
sign. So in fact, we're maximizing the function
10
00:00:46,868 --> 00:00:49,458
negative sum over all states of the system
11
00:00:49,458 --> 00:00:51,598
p log p. And remember, states of the system
12
00:00:51,598 --> 00:00:53,688
here are how long you've waited for a particular
13
00:00:53,688 --> 00:00:55,958
cab. Or rather, how long you've waited from
14
00:00:55,958 --> 00:00:58,718
a particular time you decided to start waiting.
15
00:00:58,718 --> 00:01:01,948
So, this turns out to be a hard problem, or
16
00:01:01,948 --> 00:01:04,448
at least a non-trivial problem. If you have
17
00:01:04,448 --> 00:01:06,438
a little calculus, you're really good at maximizing
18
00:01:06,438 --> 00:01:09,168
functions, so let's start there. Let's imagine
19
00:01:09,168 --> 00:01:12,408
in particular...and I'll use this very simple example
20
00:01:12,408 --> 00:01:14,648
of maximizing a function over a 2-dimensional
21
00:01:14,648 --> 00:01:23,198
space. Let's call these the x1 and x2 axes.
22
00:01:25,801 --> 00:01:26,911
So I have some function. I'm just
23
00:01:26,911 --> 00:01:29,011
going to draw the contours for you.
24
00:01:29,011 --> 00:01:32,221
So here we go. And what I'm going to do is
25
00:01:32,221 --> 00:01:38,211
make it such that there's a single maximum over
26
00:01:38,211 --> 00:01:41,081
this space. And what we'll talk about in one of
27
00:01:41,081 --> 00:01:43,571
the appendices, if we have time, is why we can
28
00:01:43,571 --> 00:01:46,361
prove that the entropy function has a unique
29
00:01:46,361 --> 00:01:50,571
maximum even what you subject it to these
30
00:01:50,571 --> 00:01:54,731
constraints. For now, you can take it on faith
31
00:01:54,731 --> 00:01:59,621
that, in fact, this problem has a unique solution.
32
00:01:59,621 --> 00:02:01,961
So, here's a function. I've given it a unique
33
00:02:01,961 --> 00:02:07,191
maximum, and so we'll call this function f.
34
00:02:07,191 --> 00:02:10,731
So, using your amazing calculus skills, you know
35
00:02:10,731 --> 00:02:13,391
that the maximum of this function is defined as
36
00:02:13,391 --> 00:02:15,661
the point where the derivative of f with
37
00:02:15,661 --> 00:02:18,581
respect to x is 0. And remember this is a vector
38
00:02:18,581 --> 00:02:21,571
here so what this means is df/dx1 is 0 and
39
00:02:21,571 --> 00:02:25,571
df/dx2 is zero. Now, yes, you might have
40
00:02:25,571 --> 00:02:29,271
accidentally hit a minimum, so just check to
41
00:02:29,271 --> 00:02:31,371
make sure it's actually a minimum. That's
42
00:02:31,371 --> 00:02:36,201
what you would do. So now the problem is that
43
00:02:36,201 --> 00:02:39,411
we're not allowed to range over this entire space.
44
00:02:39,411 --> 00:02:42,401
We're restricted to some subspace. We're restricted
45
00:02:42,401 --> 00:02:49,041
in particular to some constraint here. So how
46
00:02:49,041 --> 00:02:51,671
can we find the maximum of the function, not
47
00:02:51,671 --> 00:02:54,881
the global maximum, but the maximum that
48
00:02:54,881 --> 00:02:57,941
is also satisfying a set of constraints and what I'll
49
00:02:57,941 --> 00:03:03,431
do is I'll draw those constraints as a line
50
00:03:03,431 --> 00:03:08,011
in this space. A point here is a valid argument
51
00:03:08,011 --> 00:03:10,561
for the function f, but it doesn't satisfy this
52
00:03:10,561 --> 00:03:12,241
constraint here and what I'll do is
53
00:03:12,241 --> 00:03:14,461
define this constraint in the following way.
54
00:03:14,461 --> 00:03:17,931
I'll say the constraint is that g(x) = c, where c
55
00:03:17,931 --> 00:03:21,221
is some number. And just to be clear, for us,
56
00:03:21,221 --> 00:03:35,771
it's better to write g(p). We set that equal to
57
00:03:35,771 --> 00:03:39,291
4 minutes. That's just to remind you our particular
58
00:03:39,291 --> 00:03:44,641
constraint is that the function g is equal to 4.
59
00:03:44,641 --> 00:03:50,901
This is the general case here. So what we want
60
00:03:50,901 --> 00:03:53,621
to do now is not find the maximum point, the top
61
00:03:53,621 --> 00:03:56,141
of the mountain. We want to find the point
62
00:03:56,141 --> 00:03:59,871
that is the maximum along this line - this
63
00:03:59,871 --> 00:04:03,171
line defined by g(x) = c. So let me give you
64
00:04:03,171 --> 00:04:05,391
a little intuition about how you might do this.
65
00:04:05,391 --> 00:04:08,821
Imagine you're on a train going through this
66
00:04:08,821 --> 00:04:12,001
mountainous landscape, and as you're going
67
00:04:12,001 --> 00:04:15,871
through, down here, you'll see you're crossing
68
00:04:15,871 --> 00:04:19,111
contours of the function f. In this case, you're
69
00:04:19,111 --> 00:04:22,661
going uphill - the function is increasing - so you
70
00:04:22,661 --> 00:04:26,661
know a point there is not a maximum of the function
71
00:04:26,661 --> 00:04:28,321
along this line because if you wait a little bit
72
00:04:28,321 --> 00:04:30,111
longer, you'll get to this point here and you've
73
00:04:30,111 --> 00:04:33,101
already crossed the contour line. So here,
74
00:04:33,101 --> 00:04:38,091
you're going up. Notice here, you're going down
75
00:04:38,091 --> 00:04:39,911
the other side of the mountain - you're crossing
76
00:04:39,911 --> 00:04:41,341
contour lines in the other way, so you know,
77
00:04:41,341 --> 00:04:43,581
in fact, that the maximum can't be over here
78
00:04:43,581 --> 00:04:45,711
because you were already higher over here.
79
00:04:45,711 --> 00:04:48,111
So, somewhere between here and here is the
80
00:04:48,111 --> 00:04:54,231
maximum - somewhere in the middle you reach the
81
00:04:54,231 --> 00:05:03,291
peak. You reach the peak when the contours of the
82
00:05:03,291 --> 00:05:09,821
function f are parallel to the track that you are taking -
83
00:05:09,821 --> 00:05:13,511
where there is a tangent point between the contour
84
00:05:13,511 --> 00:05:15,781
and the direction of your motion on this fictitious
85
00:05:15,781 --> 00:05:23,581
train. So, we know how to get the directions of the
86
00:05:23,581 --> 00:05:28,511
contours of the function f - those are, indeed, just the
87
00:05:28,511 --> 00:05:31,701
gradient of the function...this is a vector, remember.
88
00:05:31,701 --> 00:05:34,891
And we are going to say these are equal to the
89
00:05:34,891 --> 00:05:40,241
perpendiculars to the train tracks. So if the
90
00:05:40,241 --> 00:05:43,331
perpendicular to the contour is parallel to the
91
00:05:43,331 --> 00:05:46,871
perpendicular to the train track, that means
92
00:05:46,871 --> 00:05:49,731
that the direction of the contour is parallel
93
00:05:49,731 --> 00:05:52,201
to the direction of the train track. If the two
94
00:05:52,201 --> 00:05:55,661
perpendiculars are parallel, so are the two
95
00:05:55,661 --> 00:05:58,771
original vectors. So then the next question is
96
00:05:58,771 --> 00:06:03,851
how can I get the perpendicular to the train track.
97
00:06:03,851 --> 00:06:06,201
What I want you do is imagine this is the train
98
00:06:06,201 --> 00:06:09,231
track for g(x) = c, and here is the train track
99
00:06:09,231 --> 00:06:15,931
for g(x) = c' and so forth. So here is another set
100
00:06:15,931 --> 00:06:19,775
of contours defined by the function g and we
101
00:06:19,775 --> 00:06:24,185
want the perpendiculars to these contours to
102
00:06:24,185 --> 00:06:29,195
be parallel to the contours for f - the
103
00:06:29,195 --> 00:06:32,475
perpendiculars for the contours of f. So what
104
00:06:32,475 --> 00:06:36,115
that means is that this gradient here - these arrows
105
00:06:36,115 --> 00:06:41,565
here, and in particular, these arrows right here -
106
00:06:41,565 --> 00:06:48,815
are equal to some real number lambda times
107
00:06:48,815 --> 00:06:54,685
the gradient of the constraint. When this equation
108
00:06:54,685 --> 00:07:03,265
is satisfied, that means that these contours here
109
00:07:03,265 --> 00:07:09,315
are precisely parallel to these contours here.
110
00:07:09,315 --> 00:07:12,185
So in order to maximize a function f subject
111
00:07:12,185 --> 00:07:15,745
to a set of constraints, don't solve this here.
112
00:07:15,745 --> 00:07:18,415
Don't solve this problem, solve this problem.
113
00:07:18,415 --> 00:07:20,405
And now you notice you have this mysterious
114
00:07:20,405 --> 00:07:25,925
value lambda. This is called the Lagrange multiplier.
115
00:07:25,925 --> 00:07:29,464
So what we're going to do is try to find a solution
116
00:07:29,464 --> 00:07:34,364
where the gradients are parallel to each other.
117
00:07:34,364 --> 00:07:36,154
In other words, that one can be transformed into
118
00:07:36,154 --> 00:07:39,214
another by scaling by some constant factor
119
00:07:39,214 --> 00:07:44,924
along all axes. So this is the intuitive motivation
120
00:07:44,924 --> 00:07:48,664
for how to solve the problem of maximization
121
00:07:48,664 --> 00:07:51,784
subject to constraints when you have only
122
00:07:51,784 --> 00:07:56,794
a single constraint. What we do is find the point
123
00:07:56,794 --> 00:08:03,944
at which these two gradients align. There's a wrinkle.
124
00:08:03,944 --> 00:08:06,174
The wrinkle is the following. It looks like we have only
125
00:08:06,174 --> 00:08:10,454
one constraint here and our contraint is just
126
00:08:10,454 --> 00:08:14,494
that this function is equal to 4. But we actually have
127
00:08:14,494 --> 00:08:20,014
2 constraints. Our second constraint is overall
128
00:08:20,014 --> 00:08:25,724
normalization and it says that we want this
129
00:08:25,724 --> 00:08:31,524
function p here to normalize to 1. If you
130
00:08:31,524 --> 00:08:34,334
sum the probabilities of all the arrival times, they
131
00:08:34,334 --> 00:08:38,334
have to equal 1. Now, of course, p is a probability
132
00:08:38,334 --> 00:08:41,194
so we know that has to be true. We didn't talk about
133
00:08:41,194 --> 00:08:43,254
this explicitly, but when we start wandering over
134
00:08:43,254 --> 00:08:47,254
function space - when these xs here become ps,
135
00:08:47,254 --> 00:08:49,484
we start manipulating the probabilities - what we
136
00:08:49,484 --> 00:08:51,614
want to do is be able to relax the constraint
137
00:08:51,614 --> 00:08:54,364
that they sum to 1 when we consider maximizing
138
00:08:54,364 --> 00:08:56,954
this function f. We want to be able to range over
139
00:08:56,954 --> 00:08:59,554
the entire space including, for example, points
140
00:08:59,554 --> 00:09:02,084
where all the probabilities - all the ps - are 0.
141
00:09:02,084 --> 00:09:04,724
And then later what we'll do is we'll impose the
142
00:09:04,724 --> 00:09:08,064
normalization constraint. So we actually have
143
00:09:08,064 --> 00:09:11,544
2 constraints, and not just 1.