So, can we say anything
general about that process?
The answer turns out to be: Yes.
And what we're going to do
is represent this matrix
not in the form
that you're familiar with
the ϵ and 1 - ϵ form
where the column vector refers to
the probability of being in state A
or the probability of being in state B
instead what we're going to do is
diagonalize this matrix
so if the matrix is T what we're
going to do is find its eigenvectors
and the eigenvectors
of the matrix T are defined
by T acting on let's say
the first eigenvector
gives you back the first eigenvector
potentially times some constant λ,
so this is the eigenvector
of the transition matrix or
the evolution operator T
and this is the eigenvalue.
A matrix like this generically has two
eigenvectors
and two
eigenvalues
and so we can write the eigenvectors
as v₁ and v₂.
And now what i'm going to do is for an arbitrary
probability distribution over states
A and B. I'm going to represent that
as a sum,
as a weighted sum over these two eigenvectors,
so we'll call that α₁v₁ + α₂v₂.
And so if I now make a column vector
that represents that probability
in eigenvector space
what I'll have is an α₁ and α₂ like this.
So if I have diagonalized
that matrix now
when it acts upon
this transform space
it looks it takes a very simple form,
and in fact, I'll write out
the exact form it takes here
the matrix now become diagonal
and when it acts
on the diagonalized version
of the probability vector
as you would expect
given the nature of
the eigenvectors itself
It turns into
a scaled version of
each of these two entries.
So what I've done here is go from
this way of representing
the Markov chain
to this diagonalized representation here
which we can call T hat
and T hat now acts not on the
probability of being in states A and B
but instead on the amount
of probability that you have
in the first eigenvector pattern
and the amount of probability you have
in the second eigenvector pattern.
What you can see here is that this matrix,
the two eigenvalues it has,
one of these eigenvalues is equal to 1.
And in fact it turns out from a theory
that you can prove in linear algebra
that any stochastic matrix
will have as its largest eigenvalue
something whose
absolute value is equal to 1.
In fact sometimes this eigenvalue here
can be negative one.
Sometimes it can even be imaginary.
Generically for most ordinary
stochastic matrices
including the one we're going to look at here
and the conditions on that
are somewhat technical
we have to have irreducibility [or] ergodicity,
but in general what you'll have
is that the largest eigenvalue
for stochastic matrix is equal to unity.
All the other eigenvalues
will have absolute values
that are smaller than 1.
So notice here as long as ϵ
as long as ϵ is less than unity
right, as long as there's some probability
for a transition between A and B
this term here will have
an absolute value less than 1.
Notice that if ϵ is less than ½ in fact
the second eigenvalue will be negative,
but it still in its absolute value
will be less than 1.
So now we've reduced the problem
of taking T to the N
to the problem of taking T hat to the N
where T hat is 1 on the diagonal
and then 2 epsilon minus 1 here
and that is now acting
on this transformed representation of the
probabilities of being in state A and B.
If we raise this here to the power of N
this matrix takes a very simple form,
this first diagonal term
of course 1 to the power of n is just 1
the off diagonal terms remain 0
and now you have 2ϵ - 1 to the power of n
again acting on
the transformed representation.
As n gets very large this term here
because it is less than 1 and
absolute value
will continue to get smaller and smaller
so for example let's take
the simple case
where epsilon is equal to let's say ¼.
In that case along the diagonal
you'll have
a 1 up here and a negative ½ to the n here
and when you act upon an
arbitrary column vector α₁ α₂
this will turn into α₁
will remain unchanged
but α₂ will be down by a very large factor
so let's say if n is a thousand
then 1 over 2 to the power of a thousand
is down by a factor of a million or so.
There will be this minus sign here
So depending upon whether n is even or odd
it will be a very tiny positive number
or a very tiny negative number
but in general as you take
increasingly large powers of this T
You will find that no matter what
you put into the system,
no matter what combination
of α₁ and α₂ you put into the system
out the other side you will
get a vector that is pure α₁.
Another way to say this
using the language of the probability
of being in each of these states
is that if you
coarse-grain your data enough
the corresponding model will take
any probability distribution
and map it directly to that first
eigenvector of the system.
That first eigenvector has a special name.
It's called the stationary distribution.
It's called the stationary distribution
of the original stochastic matrix.
And so now
we know that no matter
what you put in.
Let's say if you begin entirely in state A
if the system has been coarse-grained enough,
after one coarse-grain time step
using the evolution operator T²
or rather T to the n as n gets very large.
At the next time step you will go to a unique
probability distribution given by v₁
and in fact if you begin in state B,
you will also be taken
to the identical
probability distribution v₁.
And so in the end
the matrix will take the following form.
If you begin in state A
you have some probability called P(A)
of staying in state A.
And you have some probability of ending up and state B
called P(B) and then we know, of course,
that that's equal to 1 - P(A)
Similarly, if you're in state B
you have some probability
to stay in state B
and you have some probability
to end up back in state A
and, again, we can write
the probability of being in state B
as 1 minus the probability
of being in state A
so
What the simplification means
is that no matter
what probabilities
for transitions you put in here
and in particular we are free
at short time scales to put arbitrary
probabilities for the self loop
for A and arbitrary probabilities
for the self loop for B.
If you begin with this description on short time scales,
if you renormalize enough,
if you coarse-grain the observations enough
and ask what the corresponding model is
you find in fact that every model can be
described by a single parameter P(A).
You move from models
that exist in a 2-dimensional space
where I have to tell you the self-loop
probabilities for each of these individual states.
If you coarse-grain enough that takes you
to a set of models that can be described
by a single probability, P(A)
and so in fact one way to imagine the
limit of this coarse-graining process
how all of these different Markov chains
flow is in a following diagram
which we call, sometimes call,
phase diagram for this system,
which I can draw like this.
On the X axis we have
the self-loop probability for the A state
and on the y-axis we have
the self-loop probability for the B state
So every point in this space
refers to a distinct Markov chain model.
All of these models actually flow
to a single line on this plane.
All the points on this line here
have the potential to be fixed points
that as you coarse-grain the system
further and further
they all end up on this line
where the probability
of the self-loop for the A state is equal
to 1 minus the probability
for the self-loop for the B state
and the way to think about this is P(A) given A
when you coarse-grain enough
just becomes the stationary probability,
the stationary distribution for that Markov chain P(A)
the self-loop probability for B
just becomes the probability for B.
And we know that that now
has to be equal to 1 - P(A)
for that stationary distribution.
So what we find is that as you
continue to renormalize this system
as you continue to coarse-grain
and ask what the corresponding model is
you get driven towards this line here
and as you get driven to that line here
you flow to some
what we call fixed point
where you get closer and closer
to the limit of T to the n
as n goes to infinity.
So that's a more formal, still somewhat
cartoonish story that I've told
the reason I say it's cartoonish is that
I haven't told you the exact conditions
for this flow to work.
Implicitly what you have to have
is that first eigenvalue
has to be equal to one and there
can be no other eigenvalues
that also have
absolute value equal to one.
So for example a model that does not flow
model that does not flow
to somewhere on this line
is one where the system
continually oscillates back and forth.
In that case the self-loop probability for A is zero,
the self-loop probability for B is zero.
But no matter how many times
you multiply the system by itself
it never kind of blurs out
and gives you a stationary distribution.
I haven't told you the exact conditions
that tell you that some
of these models are ruled out
sort of on the corners of this space.
But basically if you're somewhere in the interior
you always flow to a unique fixed point
under the coarse-graining operation.