### Transcription of Introduction to Markov Chain Monte Carlo

1 1. **Introduction** to **Markov** **Chain** **Monte** **Carlo** Charles J. Geyer History Despite a few notable uses of simulation of random processes in the pre-computer era (Hammersley and Handscomb, 1964, Section ; Stigler, 2002, Chapter 7), practical widespread use of simulation had to await the invention of computers. Almost as soon as computers were invented, they were used for simulation (Hammersley and Handscomb, 1964, Section ). The name **Monte** **Carlo** started as cuteness gambling was then (around 1950) illegal in most places, and the casino at **Monte** **Carlo** was the most famous in the world but it soon became a colorless technical term for simulation of random processes. **Markov** **Chain** **Monte** **Carlo** (MCMC) was invented soon after ordinary **Monte** **Carlo** at Los Alamos, one of the few places where computers were available at the time. Metropolis et al. (1953) simulated a liquid in equilibrium with its gas phase.

2 The obvious way to find out about the thermodynamic equilibrium is to simulate the dynamics of the system, and let it run until it reaches equilibrium. The tour de force was their realization that they did not need to simulate the exact dynamics; they only needed to simulate some **Markov** **Chain** having the same equilibrium distribution. Simulations following the scheme of Metropolis et al. (1953) are said to use the Metropolis algorithm. As computers became more widely available, the Metropolis algorithm was widely used by chemists and physicists, but it did not become widely known among statisticians until after 1990. Hastings (1970) general- ized the Metropolis algorithm, and simulations following his scheme are said to use the Metropolis Hastings algorithm. A special case of the Metropolis Hastings algorithm was introduced by Geman and Geman (1984), apparently without knowledge of earlier work.

3 Simulations following their scheme are said to use the Gibbs sampler. Much of Geman and Geman (1984) discusses optimization to find the posterior mode rather than simulation, and it took some time for it to be understood in the spatial statistics community that the Gibbs sampler simulated the posterior distribution, thus enabling full Bayesian inference of all kinds. A methodology that was later seen to be very similar to the Gibbs sampler was introduced by Tanner and Wong (1987), again apparently without knowledge of earlier work. To this day, some refer to the Gibbs sampler as data augmentation following these authors. Gelfand and Smith (1990) made the wider Bayesian community aware of the Gibbs sampler, which up to that time had been known only in the spatial statistics community. Then it took off; as of this writing, a search for Gelfand and Smith (1990) on Google Scholar yields 4003 links to other works.

4 It was rapidly realized that most Bayesian inference could The fifth author was Edward Teller, the father of the hydrogen bomb.. 3. 4 Handbook of **Markov** **Chain** **Monte** **Carlo** be done by MCMC, whereas very little could be done without MCMC. It took a while for researchers to properly understand the theory of MCMC (Geyer, 1992; Tierney, 1994) and that all of the aforementioned work was a special case of the notion of MCMC. Green (1995). generalized the Metropolis Hastings algorithm, as much as it can be generalized. Although this terminology is not widely used, we say that simulations following his scheme use the Metropolis Hastings Green algorithm. MCMC is not used only for Bayesian inference. Like- lihood inference in cases where the likelihood cannot be calculated explicitly due to missing data or complex dependence can also use MCMC (Geyer, 1994, 1999; Geyer and Thompson, 1992, 1995, and references cited therein).

5 **Markov** Chains A sequence X1 , X2 , .. of random elements of some set is a **Markov** **Chain** if the conditional distribution of Xn+1 given X1 , .. , Xn depends on Xn only. The set in which the Xi take values is called the state space of the **Markov** **Chain** . A **Markov** **Chain** has stationary transition probabilities if the conditional distribution of Xn+1. given Xn does not depend on n. This is the main kind of **Markov** **Chain** of interest in MCMC. Some kinds of adaptive MCMC (Chapter 4, this volume) have nonstationary transition probabilities. In this chapter we always assume stationary transition probabilities. The joint distribution of a **Markov** **Chain** is determined by The marginal distribution of X1 , called the initial distribution The conditional distribution of Xn+1 given Xn , called the transition probability dis- tribution (because of the assumption of stationary transition probabilities, this does not depend on n).

6 People introduced to **Markov** chains through a typical course on stochastic processes have usually only seen examples where the state space is finite or countable. If the state space is finite, written {x1 , .. , xn }, then the initial distribution can be associated with a vector = ( 1 , .. , n ) defined by Pr(X1 = xi ) = i , i = 1, .. , n, and the transition probabilities can be associated with a matrix P having elements pij defined by Pr(Xn+1 = xj | Xn = xi ) = pij , i = 1, .. , n and j = 1, .. , n. When the state space is countably infinite, we can think of an infinite vector and matrix. But most **Markov** chains of interest in MCMC have uncountable state space, and then we cannot think of the initial distribution as a vector or the transition probability distribution as a matrix. We must think of them as an unconditional probability distribution and a conditional probability distribution.

7 **Introduction** to **Markov** **Chain** **Monte** **Carlo** 5. Computer Programs and **Markov** Chains Suppose you have a computer program Initialize x repeat {. Generate pseudorandom change to x Output x }. If x is the entire state of the computer program exclusive of random number generator seeds (which we ignore, pretending pseudorandom is random), this is MCMC. It is important that x must be the entire state of the program. Otherwise the resulting stochastic process need not be **Markov** . There is not much structure here. Most simulations can be fit into this format. Thus most simulations can be thought of as MCMC if the entire state of the computer program is considered the state of the **Markov** **Chain** . Hence, MCMC is a very general simulation methodology. Stationarity A sequence X1 , X2 , .. of random elements of some set is called a stochastic process ( **Markov** chains are a special case).

8 A stochastic process is stationary if for every positive integer k the distribution of the k-tuple (Xn+1 , .. , Xn+k ). does not depend on n. A **Markov** **Chain** is stationary if it is a stationary stochastic process. In a **Markov** **Chain** , the conditional distribution of (Xn+2 , .. , Xn+k ) given Xn+1 does not depend on n. It follows that a **Markov** **Chain** is stationary if and only if the marginal distribution of Xn does not depend on n. An initial distribution is said to be stationary or invariant or equilibrium for some transition probability distribution if the **Markov** **Chain** specified by this initial distribution and transi- tion probability distribution is stationary. We also indicate this by saying that the transition probability distribution preserves the initial distribution. Stationarity implies stationary transition probabilities, but not vice versa. Consider an initial distribution concentrated at one point.

9 The **Markov** **Chain** can be stationary if and only if all iterates are concentrated at the same point, that is, X1 = X2 = .., so the **Chain** goes nowhere and does nothing. Conversely, any transition probability distribution can be combined with any initial distribution, including those concentrated at one point. Such a **Chain** is usually not stationary (even though the transition probabilities are stationary). Having an equilibrium distribution is an important property of a **Markov** **Chain** tran- sition probability. In Section below, we shall see that MCMC samples the equilibrium distribution, whether the **Chain** is stationary or not. Not all **Markov** chains have equilibrium distributions, but all **Markov** chains used in MCMC do. The Metropolis Hastings Green (MHG) algorithm (Sections , , and below) constructs transition probability mechanisms that preserve a specified equilibrium distribution.

10 6 Handbook of **Markov** **Chain** **Monte** **Carlo** Reversibility A transition probability distribution is reversible with respect to an initial distribution if, for the **Markov** **Chain** X1 , X2 , .. they specify, the distribution of pairs (Xi , Xi+1 ) is exchangeable. A **Markov** **Chain** is reversible if its transition probability is reversible with respect to its initial distribution. Reversibility implies stationarity, but not vice versa. Areversible **Markov** **Chain** has the same laws running forward or backward in time, that is, for any i and k the distributions of (Xi+1 , .. , Xi+k ) and (Xi+k , .. , Xi+1 ) are the same. Hence the name. Reversibility plays two roles in **Markov** **Chain** theory. All known methods for construct- ing transition probability mechanisms that preserve a specified equilibrium distribution in non-toy problems are special cases of the MHG algorithm, and all of the elementary updates constructed by the MHG algorithm are reversible (which accounts for its other name, the reversible jump algorithm).