The last thing I promised you

was to tell you how coarse graining
fit into the story of information theory.

So, even if you have not discovered
the information theory lectures,

you will get
a very brief introduction to it.

Let's imagine that we have some process,

let's say with 26 options:

there's the probability
that the process emits A,

the probability that the process emits B,

all the way down to the probability
that the process emits Z.

What information theory does,

one of the canonical questions
it asks, is:

How much information
is in that process?

This term also comes up:

How much entropy is in the process?

Another term is: How much uncertainty ...

is in the process?

A process that has higher uncertainty –

in this case here, a process
when you stick your hand in the bag,

you don't know if you're going
to get an A or B or C

all the way down to Z –

a process that is more uncertain
has higher entropy,

and has higher information.

Claude Shannon invented a way

to measure, or to quantify,
to turn this big list of numbers,

this big list of probabilities,

into a single number which he then called
the uncertainty of the process itself.

He called that function H,

I think because he had spent
some time in Germany,

although I'm not sure.

So, what H does,
is it eats a list of probabilities,

and spits out the other side
a single number,

which we will call, I don't know –

in this case, here,
that number might be 4,

and the units turn out to be bits.

So, how do we take
a particular set of probabilities

and map them to a number?

Well, H of pᵢ is defined in general

as the negative of the sum
over all the options –

in this case 26 options –

of pᵢ times the log base 2 of pᵢ.

This is the fundamental quantity
of information theory.

One of the questions you might have is:
How did Shannon derive it?

What he did was,
he came up with a series of axioms

that he wanted this function to satisfy.

He said, "If you want
to measure uncertainty,

here's a couple things that it should do."

First of all, the maximum uncertainty
that this distribution can have

is if each of these probabilities
is equally likely.

So if the probability of A is 1 in 26,

the probability of B is one in 26,
and all the way down to Z,

that should mean, if that's the case,

you should be maximally uncertain
about what's going to happen.

That's the first thing he said.

The more uniform ... rather ...

if it's perfectly uniform,
the probabilities are perfectly uniform,

that should be the condition
of maximum uncertainty.

He didn't need to say the following thing,
but he could also have said it as well:

if all of the probability
is located in a single option,

then the system has uncertainty 0.

If you know that every time
you stick your hand in the bag

you're going to get an A,

the system is fundamentally certain.

There is no uncertainty
in what's going to happen,

when you know you stick your hand in,
you always get an A.

So that's the first thing
that Shannon wanted this function to do.

He also wanted it to be symmetric,

he didn't want it to discriminate
between each of these different variables.

So, for example,
a probability distribution

where A and B both had probability 0.5,

and everything else was 0,

should have the same uncertainty

as the distribution where the probability
of Y and Z are both 0.5,

and everything else was 0.

He didn't want to discriminate
between the different options,

and, for this reason, information theory
is sometimes called a syntactic theory.

So, well, that makes sense, right?

If you shuffle these around,
that should be good, right?

It doesn't matter
what these probabilities attach to.

And the second thing
is conditioned of maximal uncertainty.

It turns out that's not quite enough
to uniquely specify a function

until you introduce
the coarse-graining axiom.

We'll leave up Shannon's
information theoretic account

of how to measure uncertainty

by summing up pᵢ log₂pᵢ
for all the different options.

The coarse-graining axiom ...

says the following:

let's say you have three options,

and we'll write it like this,

so now we'll have a distribution

just over, let's say,
3 letters – A, B, and C.

It could be a distribution
over rock, paper and scissors.

So, we have P_A,
a probability of A happening,

a probability of B happening,
and a probability of C happening;

a probability of rock, papers
and scissors, if you like.

And, in this case here, for simplicity,

we'll assume that these are independent

in the sense that if I get an A,

it doesn't affect
what I'm going to get next;

if I play rock it doesn't affect
what I'm going to get next.

So, Shannon said, okay
the entropy of that system

should be equal to the following:

it should be equal to
the entropy of A versus B or C

plus the probability of B or C,

times the entropy of that choice, B and C.

So, he imagined, in other words,

that instead of finding out

whether the system
had chosen A, B and C all at once,

first you found out

whether the system
had chosen A, or B or C.

So, A, or B or C,

and then sometimes the system
had chosen B or C,

and then you had to make a second inquiry

to figure out whether it was B or C.

So, the coarse-graining axiom says

the uncertainty about what the fine-grain
state of the system is,

is equal to the uncertainty about
a coarse-grained version of the system.

In this case, the course graining,
and this is just an example,

in this case, the coarse graining says,
"Ah, B or C are kind of the same.

You know they're both water signs
or something, right?"

The entropy of the coarse-grain system,

plus the weighted entropies
of those fine-grain distinctions.

And once you demand
that this property here be obeyed,

then you recover Shannon's
original formulation of entropy.

And it turns out
that this formulation is now unique:

there is no other functional form

for how to convert a list of probabilities

into a number describing uncertainty,

that obeys the condition
of a maximum uncertainty,

that obeys the syntax property

of being able to shuffle
and get the same answer,

and that, finally, obeys
Shannon's coarse-graining axiom.

If you enforce all of those,

this is the final formula
that you're left with,

no other formula will do.

And one of the very nice things about this

is that the course-graining axiom
shows you how,

when you coarse-grain a system,

the uncertainty,
the information that you have,

goes down.

So let's say that instead of representing
the system like this,

I broke it up into these two pieces here,

coarse-grained uncertainty,
and then a fine-grained uncertainty,

where I split the B or C case,

and then do you see, do you know what,
I don't care about that.

In that case, this equality then becomes
a greater-than-or-equal-to sign.

In fact, it's strictly "greater than"

as long as there's some possibility of B

and some possibility of C,

if there is a fine-grained distinction
to be made.

So, in this first module
what have we done?

I've given you a short introduction
to the idea of renormalization

through the example
of the microeconomics story.

You have a microeconomic
account of the world,

a very detailed description
of what people are doing

that you then coarse grain
to a macroeconomic account.

That coarse graining generally
erases distinctions:

instead of distinguishing
case B from case C,

you say, "Ah, you know what?
They're pretty much the same."

We had an example
of that coarse-graining property

in the case of coarse-graining
the image of Alice in Wonderland.

And there, I actually gave you
three examples,

majority rule coarse graining,
decimation coarse graining,

and a more complicated, sort of
Fourier space kind of coarse graining,

called the JPEG,

which of course is now in so much use

that your computer has a special chip
that does JPEG decompression.

So, I gave you an example
of how coarse graining worked,

and, in this final part of the lecture
of this first part of the module,

what I've done
is tell you a little bit about

how coarse graining plays a central role
in information theory.

But coarse graining is not enough.

It's not enough
just to simplify the world,

because, in the end, what scientists do

is something more than just create JPEGs.

What a biologist does
is something more than just clustering –

or rather you hope a biologist is doing
something more than just clustering,

although you should check the literature
because sometimes they don't.

And, in fact, many scientists
fall victim to this:

they think that if you have a compressed,
efficient description of the world

then you're done.

But you're not done,

because what you want to do
with that compressed description

is make predictions, produce explanations,

try to describe, but not just describe:

try to explain what happened before
and what's going to happen next.

And in order to do that,
you need a theory.

And what renormalization does,
is it says,

"Great, congratulations,
you've constructed a coarse graining,

now, how is the theory connecting
those different coarse-grained states,

how is that theory related
to the one at the fundamental level?

How can you go, in other words,

or rather what happens when you take

a full economic
and psychological theory of the world

and course-grain it to macroeconomics?

What happens when the macroeconomist
then tries to build a theory

that relate his coarse-grain quantities?"

And so, in the next module,
we will give you your first simple example

of how theories about the world change

when the objects
that they describe simplify.