Monte Carlo methods: Markov Chains

Today: Markov Chains

Our goals:

Reading: Notes from Richard Weber’s course on Markov chains, or Lange, Chapter 23.

A silly example

A frog hops around on some lily pads, choosing where to hop somewhat randomly.

We would like to be able to make probability statements about where the frog is.

Definitions

A silly example

A frog hops around on some lily pads, choosing where to hop somewhat randomly.

n-step transition matrix

Given a starting distribution \(\lambda\), we would like to know \(P(X_n = j)\).

We will use the notation \(P_i(A) = P(A \mid X_0 = i)\) for \(A\) any event.

For \(n = 1\), we know that \[ P(X_1 = j) = \sum_{i \in I} \lambda_i P_i(X_1 = j) = \sum_{i \in I} \lambda_i p_{ij} = (\lambda P)_j \]

For \(n = 2\), we can write \[ P_i(X_2 = j) = \sum_k P_i(X_1 = k, X_2 = j) = \sum_k p_{ik} p_{kj} = (P^2)_{ij} \]

\[ P(X_2 = j) = \sum_{i,k} \lambda_i P_i(X_1 = k, X_2 = j) = \sum_{i,k} \lambda_i p_{ik} p_{kj} = (\lambda P^2)_j \]

So in general, if \(\delta_i\) is the vector with a 1 at index \(i\) and 0’s everywhere else, \[ P_i(X_n = j) = (\delta_i P^n)_j = (P^n)_{ij} \] and \[ P(X_n = j) = \sum_i \lambda_i P_i(X_n = j) = (\lambda P^n)_j \]

So the \(n\)-step transition matrix is simply \(P^n\).

Example of n-step transition matrix computations

P = matrix(0, nrow = 3, ncol = 3)
P[1,2] = 1
P[2,2] = P[2,3] = P[3,1] = P[3,3] = 1/2
P
##      [,1] [,2] [,3]
## [1,]  0.0  1.0  0.0
## [2,]  0.0  0.5  0.5
## [3,]  0.5  0.0  0.5
lambda = c(1,0,0)
lambda %*% P
##      [,1] [,2] [,3]
## [1,]    0    1    0
lambda %*% P %*% P
##      [,1] [,2] [,3]
## [1,]    0  0.5  0.5

What happens after a long time?

## computes P^(2^k)
pow_P = function(P, k) {
    for(i in 1:k) {
        P = P %*% P
    }
    return(P)
}
## 8-step probabilities 
lambda %*% pow_P(P, 3)
##          [,1]      [,2]      [,3]
## [1,] 0.203125 0.3984375 0.3984375
## 16-step probabilities 
lambda %*% pow_P(P, 4)
##           [,1]      [,2]      [,3]
## [1,] 0.2000122 0.3999939 0.3999939
## 32-step probabilities 
lambda %*% pow_P(P, 5)
##      [,1] [,2] [,3]
## [1,]  0.2  0.4  0.4
## 64-step probabilities 
lambda %*% pow_P(P, 6)
##      [,1] [,2] [,3]
## [1,]  0.2  0.4  0.4

What if the frog starts at a different location?

## 2^6-step probabilities, starting at 2
lambda = c(0,1,0)
lambda %*% pow_P(P, 6)
##      [,1] [,2] [,3]
## [1,]  0.2  0.4  0.4
## 2^6-step probabilities, starting at 3
lambda = c(0,0,1)
lambda %*% pow_P(P, 6)
##      [,1] [,2] [,3]
## [1,]  0.2  0.4  0.4

In this case, no matter where the frog starts, we get the same probabilities:

pow_P(P, 6)
##      [,1] [,2] [,3]
## [1,]  0.2  0.4  0.4
## [2,]  0.2  0.4  0.4
## [3,]  0.2  0.4  0.4

Another example

P_big = matrix(0, nrow = 7, ncol = 7)
P_big[1:3,1:3] = P
P_big[4,3:5] = c(.25, .5, .25)
P_big[5, 6:7] = c(.5, .5)
P_big[6,4] = 1
P_big[7,7] = 1
P_big
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,]  0.0  1.0 0.00  0.0 0.00  0.0  0.0
## [2,]  0.0  0.5 0.50  0.0 0.00  0.0  0.0
## [3,]  0.5  0.0 0.50  0.0 0.00  0.0  0.0
## [4,]  0.0  0.0 0.25  0.5 0.25  0.0  0.0
## [5,]  0.0  0.0 0.00  0.0 0.00  0.5  0.5
## [6,]  0.0  0.0 0.00  1.0 0.00  0.0  0.0
## [7,]  0.0  0.0 0.00  0.0 0.00  0.0  1.0

Let’s do the same thing: if we start the frog at different locations, compute the probability of being at any of the lily pads after 64 steps.

## 2^6-step probabilities, starting at 1
lambda = c(1, rep(0, 6))
lambda %*% pow_P(P_big, 6)
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,]  0.2  0.4  0.4    0    0    0    0
## 2^6-step probabilities, starting at 7
lambda = c(rep(0, 6), 1)
lambda %*% pow_P(P_big, 6)
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,]    0    0    0    0    0    0    1
## 2^6-step probabilities, starting at 6
lambda = c(rep(0, 5), 1, 0)
lambda %*% pow_P(P_big, 6)
##           [,1]      [,2]      [,3]         [,4]         [,5]         [,6]
## [1,] 0.1333333 0.2666667 0.2666667 1.905188e-09 6.499812e-10 4.435003e-10
##           [,7]
## [1,] 0.3333333
round(pow_P(P_big, 6), digits = 2)
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.20 0.40 0.40    0    0    0 0.00
## [2,] 0.20 0.40 0.40    0    0    0 0.00
## [3,] 0.20 0.40 0.40    0    0    0 0.00
## [4,] 0.13 0.27 0.27    0    0    0 0.33
## [5,] 0.07 0.13 0.13    0    0    0 0.67
## [6,] 0.13 0.27 0.27    0    0    0 0.33
## [7,] 0.00 0.00 0.00    0    0    0 1.00

Our next goal is to find the conditions under which this convergence occurs.

We’ll need two concepts:

Class structure

We say state \(i\) leads to \(j\), \(i \to j\) if \[ P_i(X_n = j \text{ for some } n \ge 0) > 0 \]

We say that \(i\) communicates with \(j\) and write \(i \leftrightarrow j\) if \(i \to j\) and \(j \to i\).

Communicating classes

From what we saw before, we can see that

If we add that \(i \leftrightarrow i\), \(\leftrightarrow\) satisfies the conditions for an equivalence relation.

We use \(\leftrightarrow\) to partition the state space \(I\) into communicating clases.

Irreducibility

We will want Markov chains that are irreducible.

A chain with transition matrix \(P\) is irreducible if the state space \(I\) forms a single class.

Is the little frog Markov chain irreducible?

Invariant distributions

Definition:

Interpretation:

Convergence of chains to equilibrium

(Theorem 9.8 in the linked notes)

Let \(P\) be irreducible, positive recurrent, and aperiodic (we haven’t defined positive recurrent or aperiodic here, but you can look in the notes), with invariant distribution \(\pi\).

Then for any initial distribution, \(P(X_n = j) \to \pi_j\) as \(n \to \infty\).

Interpretation: No matter where we start, the chain converges to a unique distribution.

Ergodic theorem

Let \((X_n)_{0 \le n \le N}\) be a Markov chain with transition matrix \(P\).

If \(P\) is irreducible and positive recurrent, then for any bounded function \(f\) we have \[ P(\frac{1}{n} \sum_{k=1}^n f(X_k) \to \bar f \text{ as } n \to \infty) = 1 \] where \[ \bar f = \sum_{x} \pi_i f(i) \] and \(\pi\) is the unique invariant distribution.

Why is this useful?

Looking forward

Now we know:

Next time:

By the way: