Markov Chain Monte Carlo
Metropolis acceptance
Markov Chain Monte Carlo (MCMC) does not need an introduction honestly. It is used to generate samples from a “target” probability distribution \(f(x)\), which is not really easy to sample. MCMC does so by constructing a Markov Chain, that is generating a new sample is only conditioned on the previous sample. More formally, this can be written as
\[\mathrm{P}(x^{i+1} | x^i, x^{i-1},..., x^{1}) = \mathrm{P}(x^{i+1} | x^i) = g(x^{i+1} | x^i)\]where \(g( \cdot | x^i)\) is called the “proposal” distribution, and it proposes new samples from the prior distribution. The right choice of the proposal affects the convergence of MCMC. We start with the case when this proposal is symmetric, that is, there is an as chance of moving of \(x^i \rightarrow x^j\) as of \(x^j \rightarrow x^i\), that is,
\[g(x^{i+1} | x^i) = g(x^{i} | x^{i+1})\]But for any distributions that follow the symmetry of \(g( \cdot | x^i)\), the MCMC defines a Metropolis ratio, which determines whether a proposed sample will be accepted or not. The candidate proposal \(x^*\) is accepted with the probability
\[\alpha(x^i, x^*) = \min \left(1, \frac{f(x^*)}{f(x^i)} \right)\]Intuitively, this means that the candidate is always accepted if it is moving to a higher probability region, and is only sometimes accepted when going to the lower probability areas. The ratio inside \(\min\) is called the Metropolis ratio.
Detailed balance and stationary distribution
Our objective from MCMC is to sample the posterior distribution, i.e., the samples generated at i-th and (i+1)-th step belong to the same distribution. Such a distribution which after a transition step comes back to itself is called a stationary distribution of that Markov chain. We want this stationary distribution to be the distribution we sample from. Any ergodic Markov chain that has the detailed balance sufficiently has a target stationary distribution. The detailed balance is defined as:
\(f(x) T(x,y) = f(y) T(y,x) \implies f(x^i = x) T(x^i = x, x^{i+ 1} = y) = f(x^i = y) T(x^i= y, x^{i+1} = y)\) where \(T(x,y)\) is the transition matrix from state \(x\) to \(y\), which for our case is \(g(y|x)\) . In order to show that such a chain has a stationary distribution, we have
\[\begin{aligned} f_{i+1}(y) &= \sum_x f_{i}(x) T(x, y) \quad \text{OR} \sum_x f_{i}(x^i = x) T(x^i = x, x^{i+ 1} = y) \\ & \text{(Expressing as marginal distribution)} \\ &= \sum_x f_{i}(y) T(y, x) \quad \text{OR} \sum_x f_{i}(x^i = y) T(x^i = y, x^{i+ 1} = x) \\ & \text{(By detailed balance)} \\ &= f_{i}(y) \sum_x T(y, x) \quad \text{OR} f_{i}(x^i = y) \sum_x T(x^i = y, x^{i+ 1} = x) \\ &= f_{i}(y) \\ & \text{(The summand is the sum of probabilities from any obtained value in state} x^{i} \text{ to } x^{i+1} = j, \text{which should be 1)} \\ \end{aligned}\]Intuitively, detailed balance states that moving from state \(x\) to state \(y\) is as likely as the other way round, this also implies reversibility. Detailed balance enforces that the under condition of ergodicity, if the chain samples from a distribution at i-th step, it will do so in the next step.
Now, in Metropolis algorithm, the transition probability is given by
\[\begin{aligned} T(x^i, x^*) &= g(x^* | x^i) \alpha(x^i, x^*) \\ &= g(x^* | x^i) \min \left(1, \frac{f(x^*)}{f(x^i)} \right) \\ \therefore & \text{(From detailed balance)} \\ f(x^i) T(x^i, x^*) &= f(x^i) g(x^* | x^i) \min \left(1, \frac{f(x^*)}{f(x^i)} \right) \\ &= g(x^* | x^i) \min \left(f(x^i), f(x^*) \right) \\ &= g(x^i | x^*) \min \left(f(x^i), f(x^*) \right) \\ & (g(\cdot | \cdot) \text{is symmeteric for Metropolis algorithm})\\ &= f(x^*) g(x^i | x^*) \min \left(1, \frac{f(x^i)}{f(x^*)} \right) \\ &= f(x^*) T(x^*, x^i) \end{aligned}\]Therefore, the Metropolis algorithm always has the stationary distribution defined by the “target” distribution. Though starting from a random prior point, it takes a few iterations to get to the space of exploring the posterior. This period is called burn-in and the samples generated during the time are generally ignored.
Before we go further, the following are special cases of Metropolis algorithm.
Random Walk Metropolis
Random Walk Metropolis is a special case of the above when we choose \(g(x^i|x^j) = h(x^i -x^j)\), an example of \(h(\cdot)\) would be the normal distribution, eg.
\(g(x^{j} | x^i) = h(x^j- x^i) = \mathrm{N}(x^j- x^i, \sigma^2) \quad \text{or} \quad x^j \sim \mathrm{N}(x^i, \sigma^2)\) We, therefore, propose a new sample that is in the vicinity of the prebious one by taking small steps, defined by \(\sigma\), in random directions. This works for simple problems, but even the step-size needs to be adequate. If it’s too small, we’re not exploring the prior space, if it’s too large, we’re wasting a lot of time on samples that may not be accepted.
Independent chain
Independent chain is another sub-class which allows the proposal to be independent of the previous state.
\[g(x^{j} | x^i) = h(x^j)\]This might work only for very specific cases.
Metropolis-Hastings
While Metropolis requires the proposal distribution \(g(\cdot | \cdot)\) be symmetric, this is not necessarily required for MCMC. In case of assymetric proposal, we modify the acceptance as:
\[\alpha(x^i, x^*) = \min \left(1, \frac{f(x^*) g(x^i | x^*)}{f(x^i) g(x^* | x^i)} \right)\]To show that MH also has detailed balance, we have
\[\begin{aligned} f(x^i) T(x^i, x^*) &= f(x^i) g(x^* | x^i) \min \left(1, \frac{f(x^*) g(x^i | x^*)}{f(x^i) g(x^* | x^i)} \right) \\ &= \min \left(f(x^i) g(x^* | x^i), f(x^*) g(x^i | x^*) \right) \\ &= f(x^*) g(x^i | x^*) \min \left(1, \frac{f(x^i) g(x^* | x^i)}{f(x^*) g(x^i | x^*)} \right) \\ &= f(x^*) T(x^*, x^i) \end{aligned}\]