Let's keep on looking at the relationship
between information and probability.
Probability is just flipping coins, heads
heads, tails, thank goodness.
Actually, by the way, I had a legal two-
headed quarter issued by the US Mint.
Can you guess what it is?
It's actually the New Hampshire quarter
that has George Washington on one side
and it's got The Old Man of the Mountain
on the other.
The Old Man of the Mountain is a rock formation
that looks like a head,
unfortunately it fell down about ten years ago,
so now they're putting a new New Hampshire quarter.
I also have a legal US quarter that has five heads,
and I urge you to consider what quarter that might be.
So let's look at this relationship between
information and probability.
Probability, we can either think of it as somethng
that's a prior probability, so the probability
of heads is the probability of tails is one half,
because there is no particular reason to
choose between heads or tails when I flip
this coin.
But we can also think of it in terms of frequencies.
If we have a long, long sequence of heads
and tails, I'm trying to make this random but
I have a funny feeling it's exactly the same
sequence of heads and tails that I put there before.
My students complain that whenever I write
a random string of zeros and ones on the board,
it's always the same sequence, which can't
really be quite random, can it?
Anyway, what we can do is we can say, let's
take a very very long sequence.
Let's count the number of heads and tails,
frequency of heads is equal to the number of heads
divided by the total length of the sequence.
The frequency of tails is the same.
And then let's ask, we have this intuition that
if the coin is fair,
that almost all such sequences that we would
generate
would have roughly fifty percent heads and tails.
Let me now make that mathematically more precise.
So let's look at the total number of sequences
of coin tosses of length m,
which have exactly m_h heads and m_t tails.
The sum of these two things is equal to m,
the total length of the sequence.
Now the total number of sequence of this is
a number that's called m choose m_h.
It's the number of way, if I have a sequence
that's of length m, and I call heads zero and
tails one, it's the number of binary sequences
of length m that have exactly m_h zeros
and m_t ones.
These are the same things because if I've
chosen the number that have exactly m_h
heads is exactly the same number of sequence
that have m_t tails.
So what is this number?
This is equal to the following:
number of ways of choosing, this is called
m choose m_h,
it's the mathematical formulation and it's
pronounced as this funny thing with parentheses
with m on top and m_h at the bottom, it's
called m choose m_h because it's the number
of ways of choosing m_h spots or locations
out of m possible spots.
So we have zero zero zero one one zero,
I'm again falling into the same possibility.
We have a whole bunch of different spots,
and we want to count the number of places
where we can put zeros here.
So we have a certain number of zeros and
we want to put them in a particular way.
Now this total number is the total number of
sequences have exactly this number of m_h
heads in there.
So this number, I'll just write down what it is,
m choose m_h is equal to m! / m_h! * (m-m_h)!
which is equal to m! over m_h! * m_t!.
I use this funny notation with an exclamation mark.
The exclamation mark says if I have a number,
m!, that's equal to one times two times three
times four times ... times m-1 times m.
It's the product of all numbers up to m.
So, for example, 2! is equal to one times two
equals two.
3! is equal to one times two times three is
equal to six, et cetera.
They get big really really fast, these numbers.
So this number m choose m_h,
m! is the total number of ways of rearranging
m objects.
So if we have a certain number of zeros and
ones, we can rearrange these m objects in
a particular way, but then if we have a zero
in the first place and a zero in the second place
we don't care if this is this zero or if this
is the other zero. They're both zero.
So we have to divide by the number of ways of
rearranging the numbers of zeroes between each other.
That's this number m_h! - that's the number
of ways of rearranging the zeroes,
m_t! is the number of ways of rearranging
the ones.
Total number of sequences that have exactly
m_h zeroes and m_t ones is the total number
of ways of rearranging theses sequences
divided by the number of ways of rearranging
the zeroes divided by the number of ways
of rearranging the ones.
This is a fact, if it doesn't make sense,
just suck it up.
Now we have a nice mathematical formula
for this number.
Here is another fun fact.
So I'm now going to define a probability.
The probability of heads, I'll call it q.
Because we agreed before that the probability
of heads was one half, probability of tails was one half.
So we'll define this to be the number of heads
divided by m.
This is the frequency of heads, so it's
what we observe to be the number of heads.
The observed probabilities, as it were, of heads.
But it's just a frequency, we don't know if
it's the probability.
And 1-q(h) is equal to q(t) is equal to the
frequency of tails.
Here is the fundamental formula of information
theory.
I'll say it again, the fundamental formula
of information theory.
S for entropy, or I for information, it gets
written both ways because remember,
entropy, the quantity that people discovered
in the 19th century is just information.
Information required to label the different
possible configurations of atoms and molecules.
In this case we're trying to label the total
number of possibilities of heads and tails,
so this is equal to
-q(h) log base 2 q(h) - q(t) log base 2 q(t).
This is a funny formula.
There's these minus signs, there's these logarithms,
there's these probabilities, et cetera.
So let me just actually write it out in a
particular case.
Suppose, just to give you a feeling for what it is,
q(h) turns out to be exactly one half.
So exactly half of the sequences are heads.
And suppose q(t) is the same thing,
because it's one minus this, this is q(t).
Then this quantity S or I, information, entropy
two sides of the same coin.
So I is equal to -0.5 log base 2 of 0.5, because
this is 0.5, this is 0.5.
And then we have -0.5, this is q(t), log base 2
of 0.5.
There are two of these, each was multiplied 0.5.
This is equal to -log base 2 of 0.5, because I
added them up.
But log base 2 of 0.5, this is the power to
which two must be raised to equal to 0.5,
and that's -1.
This is equal to -(-1), because 2^(-1) is 0.5,
and that's equal to one, and it actually has a unit,
it's one bit.
So this formula, I won't call it magical,
it's just mathematical,
but it's extremely useful.
This formula says, if I have a coin, and I
flip it, and it's probability of heads is 0.5,
it's probability of tails is 0.5, then the
amount of information that's encompassed
in a single coin flip, if you'd like, the
amount of information that is generated
by the coin flip, is one bit.
It happens to be tails this time.
So, flipping a coin with the frequency of
heads is the frequency of tails is 0.5,
gives you one bit of information.
Now, if we go back to our counting argument
about probability, what we find
is that the number of bits required to describe
the ways of arranging exactly a certain number
m_h of heads, remember this is a number,
we're just defining it to be m, the number of
bits required to describe this is just equal
to m times (-q(h) log base 2 of q(h) -q(t) log base 2 of q(t)),
and so this is equal to m times the amount
of information.
I keep getting tempted to call it magical,
but I don't believe in magic.
So this mathematical formula for the information
is just really a way of counting number of possibilities.
We look at the number of possible ways of
getting exactly m_h heads.
We take this logarithm, that's the number
of bits of information that are in this particular
sequence, and we find that it's equal to m,
the total number of flips, times this quantity,
the information.
So just to summarize, fundamental formula
of information theory
is that the amount of information, also the
amount of entropy, is equal to, if I have
two possibilities, is just -q(h) log of q(h) - q(t) log q(t).
If I have more possibilities, let's call
them k possible outcomes,
then the amount of information is equal to
- sum of possible outcomes from i = 1 to k
q(i) log base 2 of q(i).
And we just get this from counting the
number of possibilities that go into constructing
these frequencies.
And this extremely useful formula is the basis
for all information theory, including the mathematical
theory of communication, including the theory
of computation.