It is important to understand
how entropy works

in the face of multiple variables,
because one of the main applications

of classical information theory
is to try to find common

or shared statistical properties
across different objects.

And, while we will use
some basic concepts from entropy,

we will also be replacing it
in some applications

with measures of algorithmic nature
as opposed to

measures of statistical nature -
such as entropy.

Others have already done a great job

at describing the many versions
and types of applications of entropy,

so this is only a brief overview
of the ways in which entropy approaches

and statistical information can be helpful
for our own discussion

when moving to algorithmic complexity.

In traditional information theory,

one of the things that can be done
is to ask

what kind of similar statistical
properties

two or more objects may share.

Joint entropy is a measure associated
with a set of random variables.

A random variable
can be a binary sequence,

and the events of a sequence are its bits.

Joint entropy is simply
the entropy of two variables,

and, in general,
for any number of variables

it can be written as follows...

For example, you may want to know

the entropy of two random variables
related to weather.

Let's say that we are interested in

whether today will be a sunny
or rainy day;

and whether it will be hot or cool;

and that we have empirically calculated
these probabilities,

based on previous days'
and previous years' records.

Then, the joint entropy
of the variables X and Y,

according to the joint entropy formula,
is as follows...

The value indicates that these variables
are not independent,

because the joint entropy

is different to the entropies
of the single variables alone.

This is because there are greater chances
to be a hot day

if it is sunny than if it is rainy.

And, if it is rainy,
it is also more likely to be cooler -

according to the empirical distributions
that we assumed

and we calculated in this example.

Something important to notice is
that to apply entropy

one needs to be able to properly calculate
those joint distributions.

Now, if you want to know how much

of the statistical properties
of one sequence, X,

you can guess from the statistical
properties of another sequence, Y,

then we are talking about
what is known as "conditional entropy."

Conditional entropy quantifies
the amount of information needed

to describe the outcome
of a random variable, Y,

given the value
of another random variable, X.

For example, let's say we have
two sequences as follows...

Their conditional entropy
should be very high

because the sequences
have very similar properties.

You can write a function
for conditional entropy

in the Wolfram language as follows...

There is also an undocumented function

in some versions
of the Wolfram language

to calculate
only conditional entropies,

but you should not tell Wolfram
I told you about it,

and it may not be in all the platforms
of the Wolfram language

and versions of <i>Mathematica</i>,
but you can try it.

As extreme cases,
we have that the conditional entropy -

denoted by H of Y given X -

is equal to zero,
if and only if,

the value of Y is completely determined
by the value of X.

Conversely, the entropy of Y given X,

is equal to the entropy of Y alone,
if and only if,

Y and X are independent
random variables -

which means that X does not provide
any information about Y,

and thus, the entropy of Y given X
is simply equal to

asking for the entropy of Y directly alone
with no access to X.

The mutual information of two random
variables or random sequences,

as defined in terms of Shannon entropy
and denoted by the letter R,

is a measure of the variables'
common statistical properties.

That is, the amount
of statistical information

that one variable may tell
about some other.

In terms of joint entropy,

"I" can be easily defined
by the joint entropy of X and Y,

minus the conditional entropy
of X given Y,

minus the conditional entropy
of Y given X.

That is, all the information
that can be known

from knowing each variable
or sequence from each other.

Mutual information is "commutative,"

meaning that the order of the sequences
do not matter -

whatever you can guess of X from Y
you can do so for Y from X.

This will be a problem
that entropy carries with it.

The fact that
there is a symmetric function

means that it does not allow you
to determine a causal direction

when dealing with causality,
but only correlation.

There are ways to fix this
by introducing time,

but this will only partially fix
the problem

because entropy will miss anything

that is not of a statistical nature,
as we will later see,

such as even algorithmic connections
between the variables.

Conditional entropy
and mutual information

can be thought of
as complementary operations.

The joint entropy of Y... given X

is a measure of what X
does not say about Y,

or the amount of information
remaining about Y after knowing X.

And, the mutual information
between X and Y

is a measure of what Y says about X,
or how much information X shares with Y.

You will find
this very useful Venn diagram

in most textbooks
about information theory,

explaining each of these measures.

And, in science in general,
we are interested in causal relationships.

That is, what is the cause of events
such as -

what is the cause of lung cancer?

Do people smoking have greater chances
of developing lung cancer?

As you can see,

this sounds very much like
mutual information and conditioning.

For example, you can think of

whether knowing that someone smokes
two boxes of cigarettes every day

will tell you anything
about that person getting cancer.

And, one can calculate
all the joint probabilities

and apply information theory
to get an answer.

Imagine that you perform an experiment

where you find out that people
that are more stressed

also have greater chances
to develop cancer.

But, what happens in reality is
that people that smoke

are more likely to smoke more
when they are stressed.

So, at best, the stress is only indirectly
causing cancer -

but it is not a direct cause,
because there may be stressed people

that do not relieve their stress
by smoking

and thus increasing their chances
to develop lung cancer.

It is often the case
that one cannot isolate these causes

and cases from some studies because of
lack of control over the variables.

But, using conditional mutual information,

one may sometimes help tell
direct causes from indirect causes.

For example, to know what is statistically
more informative -

to know if being stressed
or being a smoker

increases the risks of developing lung
cancer is something we can explore

with classical information theory
and Shannon entropy.

We basically want to disentangle
the following causal scenario

in which we believe that it is smoking
that is the direct cause,

but we want to test if being stressed
is equally informative

or only an indirect cause by way of
stressed people that decide to smoke.

The first thing to notice is that
we have three random variables,

which in this case may be binary -
but they can be given weights too.

So, the variables are - to be a smoker,

which can have an answer yes or no;

to have the condition of being stressed;

or to have lung cancer.

So, if we want to test whether
it is to be stressed

or to be a smoker is the greater risk
in developing lung cancer,

we can write both statements as follows...

Then this specific test
may fail to tell the direct cause

and we would need to come up with
a more sophisticated case.

One question will be if such equality
would tell us something

about the entropy of being a smoker

given that it is stressed,
or the other way around,

clearly if age of a smoker given stressed
is equal to being stressed

given to being a smoker,
then both would be equally informative.

Would it imply that virtually every smoker
is stressed and the other way around?

Maybe you can think about it.

We may know that the uncertainty
or the degree of surprise

of having cancer being a smoker
is much less than the surprise

of uncertainty of having cancer
after being stressed -

which means that it is more likely
to develop cancer

by virtue of being a smoker
than by being stressed.

We know, however, that people stressed

and smokers are not these joint sets
of people,

because many stressed people
decide to smoke.

So, think if it makes sense to ask about
the mutual information of cases,

such as the mutual information
between cancer

and a smoker being stressed
versus only the mutual information

of cancer and a smoker...

and what values of which
would tell us something about

the relationship among
these three variables.

We can see how
classical information theory

and Shannon entropy
provides a nice language...

to deal with probability distributions

and frame these kind of questions
in a statistical framework,

and one can also see
how good the answer fully depends

on how much we know,
or how much we can approximate,

all the empirical distributions involved.

[ end of audio ]