We've introduced this problem of producing
a parsimonious model of the data,

meaning a description of the probabilities
of each of the possible configurations.

Now what I'm going to do is I'm going
to show you the general method

for enlarging a parsimonious model,

or conversely a general method for
producing a model that's more parsimonious

than an exact reproduction of the data,

and that method is called
the method of maximum entropy

or the MaxEnt principle.

The example I am going to talk about

is predicting when you are going
to get a cab in New York City.

So the joke about New York
is that you can never get a cab,

except when you don't need a cab,
and there's just cabs everywhere.

And, of course, there's some queer reasons
why this might be the case,

but if you've ever try to get a cab

in the early morning going south
on Park Avenue,

forget it, you will never get a cab.

These are some New York City cabs here.

Let's say you are scientist about this,

and so you decide to gather data,

and you're gathering data, you say,
I need a cab, I go out in the street,

How long do I have to wait ...

for me to finally get a cab
that I can get in,

a cab that's free and on duty?

Let's say I kept records
on that for a while,

and so here are some data I've gathered,

and this is the time it took me
to get a cab in minutes.

So, one time it took me
6 minutes to get a cab,

then it took me 3 minutes, 4 minutes,

another time it took me 6 minutes again,
and so forth.

So this here is a set of observations,

about a really basic empirical question:
How long does it take to get a cab?

And then the question is: What should
I believe about the waiting time

for a New York City cab?

So, you're actually pretty good
at this already.

You know, for example,
that one way we can do it

is to take this data here,

I have 10 data points
on how long it takes to get a cab,

and so the the probability of me
waiting 6 minutes to get a cab

looks like it's about ... well, there's
one, two, three times out of ten

that I saw a cab after six minutes,

so that means it's about a 30% chance
that I'm going to have to wait 6 minutes.

And, for example, the chance
that I'll have to wait 2 minutes

looks like it's 20%.

So, you can see right away
there's a huge problem here,

because, for example, it turns out that
if I follow this naive model directly,

it says the chance of me getting
a cab in 1 minute is zero.

There's this zero chance of me
getting a cab in a minute.

And not only that, it says, for example,

the chance of me waiting for 7 minutes
to take a cab is also zero.

This seems puzzling guys,

it feels like what we're
what we call overfitting the data.

We're describing the data in such a way
that we're putting too much structure in.

The fact that I never waited
more than 6 minutes to take a cab,

but I actually waited three times,
I waited 6 minutes three times,

that seems to be an accident of the data.

We don't want to put that into our model.

So, instead of doing the naive method,

what I'm going to do, and this is the core
of the maximum entropy method,

is produce a probability distribution
that has two things.

One my P_{MaxEnt} that I'm going
to try to produce,

First of all P_{MaxEnt},
we'll call it P_{ME},

satisfies a limited number of constraints,

and I'll tell you what a constraint
is explicitly in a moment.

And number two, the distribution that
satisfies those constraints

has the maximum entropy
of all the distributions

that satisfy those constraints.

So what we'll find is that there are
potentially many probability distributions

that satisfy the constraints,

and we're going to pick the one,
and it turns out it's the unique one,

that has the maximum entropy

out of all those distributions
that satisfy those constraints.

So the constraints will always be
in the form of expectation values.

There will always be constraints
on the average

of some quantity you measure on the data.

So, for example, ...

we could have a constraint
on the expectation value

of the average waiting time.

So we write this like this.

These angle brackets
mean the expectation value of x,

and so the way we do that is we integrate
the probability of waiting x time

times x, dx, and integrate from 0 to ∞ .

And if we're willing to just discretize
and talk about minutes,

we round to the nearest minute,

we can also write it like this.

Where here instead of integrating
over a continuum of times

from 0 to 0.01 and so forth minutes,

here we just some 0 minutes, 1 minute,
2 minutes, 3 minutes.

So 0 minutes, the cab is right there,
you open the door, it's a magical day.

So this is an expectation value
on the average waiting time.

And just to give you an example,

here's another expectation value
you might you might measure.

This is the average
of the square of the waiting time,

and, of course, the way you do that,

is you integrate x² dx, weighted
by the probability of that particular x,

and in general the expectation value
of a function f(x)

is weighting f(x)
by the probability of each x.

So this notation here
should be something that,

if you're not familiar
or not comfortable with it,

you should take some time
and just figure out

why this is the correct way to talk
about the average value of x.

And if you like, this one here
might be more familiar to you

if integrals are still a little scary,
which they shouldn't be.

What we're going to do
in this particular application,

the maximum entropy principle,

is one, P_{ME} (x) will be constrained

so that the average value of x,
the average waiting time,

under the distribution P_{ME},

is equal to that in the data.

And, in fact, if you count here

and measure the average
waiting time in the data,

you discover,
and I'm quite happy about this,

the average waiting time.

In this dataset it's 4 minutes,
and so what we're going to say

is give me probability distributions
whose average waiting time is 4 minutes.

So that's step one,
this is the constraint step.

And you can see right away
that there are many distributions

that have an average waiting time
of 4 minutes.

Here's one.

The probability of waiting x minutes is 0,
except when x = 4.

Just to be technical, this is a definition
that would only work in the discrete case.

We'd have to use delta functions,
I will spare you the delta functions.

Here's another example.

P(x) = 0.5 if x =3, 0.5 if x =5,
and 0 otherwise.

These are all potential models
of catching a cab in New York City

that satisfy the constraint
that their average is four minutes.

So, someone could say, "Hey, I know,
here's a good model of your data.

Providing data is like cabs
either take 3 minute or 5 minutes

and no other time whatsoever."

And, of course, you can think
about mixing these two together.

So, for example, you might mix this
and this so you'd have a distribution,

and I'll just draw it graphically here.

Where there's some spread over waiting
times between 3, 4 and 5 minutes.

And, of course, this original
distribution here also satisfies it.

By definition, if we have a distribution
that's non-zero only at these points,

and is weighted by the number of times
we see it in the data,

the expectation over that
will also be 4 minutes, by definition.

So we have a plethora of candidate models.

We've a plethora of models that satisfy
this one particular constraint.

Pick the one ...

that maximizes the entropy.

So you should have remembered
the definition of entropy,

if not this the perfect time
to pause the video, and review.

But what we want is the distribution
whose entropy is maximized.

Another way to say it is
we want the distribution

that leaves us maximally uncertain about
how long the cab will take to arrive,

except for the fact, of course,
that one thing is constrained.

The one thing that we've constrained

is that the cab takes four minutes
on average.

But otherwise I want to be
maximally uncertain,

I don't want to have, you know the way
we would say this philosophically,

I don't have any prejudices
about what New York City cabs do,

I want to be maximally uncertain
about their behavior

subject to this one constraint.

And you can see, for example,
here, intuitively,

the idea that cabs always take 4 minutes

is satisfying this average criteria

but it's putting a huge amount
of additional structure in.

It's saying for some reason

all waiting times
except for 4 minutes are forbidden.

And so intuitively somehow that seems like
it would require an extra justification.

But we're trying to be minimally biased,

we're trying to have
a maximally possible range,

a maximal possible range
over all configurations of the system,

subject to a constraint
on the average behavior that we observe.

This one here is slightly better
because it allows for a wider range,

and in fact the mixture of these
is even better still.

And what we would like to do
is produce a distribution

where you have to ask,

– and so this is one way
to interpret entropy –

you have to ask on average
the maximal number of questions

in order to decide
how long the cab actually took.

So, this step here allows you to select
from all of these models,

from the plethora of popular models
that satisfy the constraints,

one particular model.