So, can we say anything
general about that process?

The answer turns out to be: Yes.

And what we're going to do
is represent this matrix

not in the form
that you're familiar with

the ϵ and 1 - ϵ form

where the column vector refers to

the probability of being in state A

or the probability of being in state B

instead what we're going to do is
diagonalize this matrix

so if the matrix is T what we're
going to do is find its eigenvectors

and the eigenvectors
of the matrix T are defined

by T acting on let's say
the first eigenvector

gives you back the first eigenvector
potentially times some constant λ,

so this is the eigenvector

of the transition matrix or
the evolution operator T

and this is the eigenvalue.

A matrix like this generically has two

eigenvectors

and two

eigenvalues

and so we can write the eigenvectors

as v₁ and v₂.

And now what i'm going to do is for an arbitrary

probability distribution over states
A and B. I'm going to represent that

as a sum,

as a weighted sum over these two eigenvectors,

so we'll call that α₁v₁ + α₂v₂.

And so if I now make a column vector

that represents that probability
in eigenvector space

what I'll have is an α₁ and α₂ like this.

So if I have diagonalized
that matrix now

when it acts upon

this transform space

it looks it takes a very simple form,

and in fact, I'll write out
the exact form it takes here

the matrix now become diagonal

and when it acts

on the diagonalized version
of the probability vector

as you would expect

given the nature of
the eigenvectors itself

It turns into

a scaled version of
each of these two entries.

So what I've done here is go from

this way of representing
the Markov chain

to this diagonalized representation here

which we can call T hat

and T hat now acts not on the
probability of being in states A and B

but instead on the amount

of probability that you have

in the first eigenvector pattern

and the amount of probability you have
in the second eigenvector pattern.

What you can see here is that this matrix,

the two eigenvalues it has,

one of these eigenvalues is equal to 1.

And in fact it turns out from a theory
that you can prove in linear algebra

that any stochastic matrix
will have as its largest eigenvalue

something whose
absolute value is equal to 1.

In fact sometimes this eigenvalue here
can be negative one.

Sometimes it can even be imaginary.

Generically for most ordinary
stochastic matrices

including the one we're going to look at here

and the conditions on that
are somewhat technical

we have to have irreducibility [or] ergodicity,

but in general what you'll have
is that the largest eigenvalue

for stochastic matrix is equal to unity.

All the other eigenvalues
will have absolute values

that are smaller than 1.

So notice here as long as ϵ

as long as ϵ is less than unity

right, as long as there's some probability
for a transition between A and B

this term here will have
an absolute value less than 1.

Notice that if ϵ is less than ½ in fact
the second eigenvalue will be negative,

but it still in its absolute value
will be less than 1.

So now we've reduced the problem
of taking T to the N

to the problem of taking T hat to the N

where T hat is 1 on the diagonal

and then 2 epsilon minus 1 here
and that is now acting

on this transformed representation of the
probabilities of being in state A and B.

If we raise this here to the power of N

this matrix takes a very simple form,

this first diagonal term
of course 1 to the power of n is just 1

the off diagonal terms remain 0

and now you have 2ϵ - 1 to the power of n

again acting on
the transformed representation.

As n gets very large this term here

because it is less than 1 and
absolute value

will continue to get smaller and smaller

so for example let's take
the simple case

where epsilon is equal to let's say ¼.

In that case along the diagonal
you'll have

a 1 up here and a negative ½ to the n here

and when you act upon an
arbitrary column vector α₁ α₂

this will turn into α₁
will remain unchanged

but α₂ will be down by a very large factor

so let's say if n is a thousand

then 1 over 2 to the power of a thousand
is down by a factor of a million or so.

There will be this minus sign here

So depending upon whether n is even or odd

it will be a very tiny positive number
or a very tiny negative number

but in general as you take
increasingly large powers of this T

You will find that no matter what
you put into the system,

no matter what combination
of α₁ and α₂ you put into the system

out the other side you will
get a vector that is pure α₁.

Another way to say this

using the language of the probability
of being in each of these states

is that if you
coarse-grain your data enough

the corresponding model will take
any probability distribution

and map it directly to that first
eigenvector of the system.

That first eigenvector has a special name.
It's called the stationary distribution.

It's called the stationary distribution
of the original stochastic matrix.

And so now

we know that no matter
what you put in.

Let's say if you begin entirely in state A

if the system has been coarse-grained enough,
after one coarse-grain time step

using the evolution operator T²
or rather T to the n as n gets very large.

At the next time step you will go to a unique
probability distribution given by v₁

and in fact if you begin in state B,
you will also be taken

to the identical
probability distribution v₁.

And so in the end
the matrix will take the following form.

If you begin in state A
you have some probability called P(A)

of staying in state A.

And you have some probability of ending up and state B

called P(B) and then we know, of course,
that that's equal to 1 - P(A)

Similarly, if you're in state B

you have some probability
to stay in state B

and you have some probability
to end up back in state A

and, again, we can write
the probability of being in state B

as 1 minus the probability
of being in state A

so

What the simplification means
is that no matter

what probabilities
for transitions you put in here

and in particular we are free

at short time scales to put arbitrary
probabilities for the self loop

for A and arbitrary probabilities
for the self loop for B.

If you begin with this description on short time scales,

if you renormalize enough,
if you coarse-grain the observations enough

and ask what the corresponding model is

you find in fact that every model can be
described by a single parameter P(A).

You move from models
that exist in a 2-dimensional space

where I have to tell you the self-loop
probabilities for each of these individual states.

If you coarse-grain enough that takes you

to a set of models that can be described
by a single probability, P(A)

and so in fact one way to imagine the
limit of this coarse-graining process

how all of these different Markov chains
flow is in a following diagram

which we call, sometimes call,
phase diagram for this system,

which I can draw like this.

On the X axis we have
the self-loop probability for the A state

and on the y-axis we have
the self-loop probability for the B state

So every point in this space
refers to a distinct Markov chain model.

All of these models actually flow
to a single line on this plane.

All the points on this line here
have the potential to be fixed points

that as you coarse-grain the system
further and further

they all end up on this line
where the probability

of the self-loop for the A state is equal

to 1 minus the probability
for the self-loop for the B state

and the way to think about this is P(A) given A

when you coarse-grain enough
just becomes the stationary probability,

the stationary distribution for that Markov chain P(A)

the self-loop probability for B
just becomes the probability for B.

And we know that that now
has to be equal to 1 - P(A)

for that stationary distribution.

So what we find is that as you
continue to renormalize this system

as you continue to coarse-grain
and ask what the corresponding model is

you get driven towards this line here

and as you get driven to that line here

you flow to some
what we call fixed point

where you get closer and closer
to the limit of T to the n

as n goes to infinity.

So that's a more formal, still somewhat
cartoonish story that I've told

the reason I say it's cartoonish is that
I haven't told you the exact conditions

for this flow to work.

Implicitly what you have to have
is that first eigenvalue

has to be equal to one and there
can be no other eigenvalues

that also have
absolute value equal to one.

So for example a model that does not flow

model that does not flow
to somewhere on this line

is one where the system
continually oscillates back and forth.

In that case the self-loop probability for A is zero,

the self-loop probability for B is zero.

But no matter how many times
you multiply the system by itself

it never kind of blurs out
and gives you a stationary distribution.

I haven't told you the exact conditions

that tell you that some
of these models are ruled out

sort of on the corners of this space.

But basically if you're somewhere in the interior

you always flow to a unique fixed point

under the coarse-graining operation.