3 The Gibbs Sampler

(1)

Stochastic Alternating Projections

Persi Diaconis Kshitij Khare

Departments of Mathematics and Statistics Department of Statistics

Stanford University and Stanford University

University of Nice - Sophia Antipolis

Laurent Saloff-Coste Department of Mathematics

Cornell University

March 16, 2007

Abstract

We show that a standard Monte Carlo algorithm - The Gibbs sampler - can be seen as alternating projections into closed subspaces of a Hilbert space. This allows classical convergence theorems of von Neumann and others to be harnessed for proving convergence.

In the other direction, it shows that classical convergence rates involving the angle between subspaces can be substantially refined in several cases.

1 Introduction

This paper gives a sharp connection between two classical areas of applied mathematics.

• Von Neumann’s Alternating Projection Algorithm. Let P₁, P₂ be the orthogonal projections onto closed subspaces M₁, M₂ of a Hilbert space H. Let P_Ibe the orthogonal projection onto the intersection M₁∩ M₂. If T = P₂P₁, then T^k → P_I as k → ∞. That is,

T^k(h) − P_I(h)

→ 0 for each h ∈ H.

• The Gibbs Sampler. Let f (x, y) be a positive probability density on the measureable space (X × Y, µ × ν). Run a Markov chain (X₀, Y₀), (X₁, Y₁), ... on (X × Y) starting from (X0, Y0) as follows:

– From (x, y), choose y⁰ from the conditional density f (y⁰ | x) – From y⁰, choose x⁰ from the conditional density f (x⁰ | y⁰)

(2)

This Markov chain has stationary density f (x, y) and 1

N

X

i=1

δ_(X_i_,Y_i₎ ⇒ f, almost surely, weak star.

A simple example is given in detail in Section 5.3 below. Both algorithms alternate: The von Neumann algorithm projects first into M1, then into M2, then into M1 and so on. We will show that the Gibbs sampler amounts to computing projections in the Hilbert Space L²(f ), with M₁ = L²(σ(Y ), f ) and M₂ = L²(σ(X), f ), where X(x, y) = x and Y (x, y) = y are the coordinate mappings. Both algorithms were developed for problems where each step is easy to carry out, but projecting into the intersection is intractable. There is a difference, the mapping T^k converges, while the Markov chain (X_i, Y_i), 0 ≤ i < ∞ bobbles around without converging.

Section 2 reviews the Alternating projection method, its extension to many projections, rates of convergence and a stochastic version - The “alternierelde Verfahren” of Burkholder, Chow, Rota and Delyon-Delyon. We present a physical demonstration using a string and paper clips shown to us by Don Burkholder. Section 3 reviews the Gibbs sampler (also known as Glauber dynamics or the heat-bath algorithm) and several variants. Section 4 gives our main result:

In L²(f ), let K be the Markov operator corresponding to the Gibbs sampler. Let P₁ be the orthogonal projection onto the subspace M₁ = L²(σ(Y ), f ), P₂ be the orthogonal projection onto the subspace M2 = L²(σ(X), f ). Then,

Kⁿ = (P₂P₁)ⁿ for all n ≥ 0.

In fact, the result is true for a sequence of k alternating projections corresponding to the Gibbs sampler for a density in k dimensions, for any k ∈ N. The proof is similiar to the k = 2 case and is presented in Section 4. But for ease of exposition, we stick to the k = 2 case in most examples.

Section 5 draws some consequences of this equality. The rate of convergence in von Neu- mann’s algorithm is shown to be related to probabilists’ maximal correlation. The Hibert space notion of strong convergence is shown to be fairly weak in the probabilistic setting where much more demanding topologies are standard fare. This suggests new problems in the Hilbert space setting.

2 Alternating Projection Algorithms

Von Neumann’s alternating projection theorem stated above has many extensions and applications. A splendid textbook account appears in [15, Chapter 1]. While applications are surveyed in [16]. These trace the history back to Schwarz [29], who used the method to solve the Dirichlet

(3)

problem on a region given as a union of regions each having a simple to solve Dirichlet problem (eg. A union of disks).

One important extension is due to Halperin [19], who shows that the conclusion holds as stated for n subspaces with T = P₁P₂· · · P_n. An elegant proof of this using elementary arguments is in [25].

There is some classical work on the rate of convergence in von Neumann’s theorem. If M₁, M₂ are closed subspaces of the Hilbert Space H, let

c = sup{hv₁, v₂i | v_i ∈ M_i ∩ (M₁∩ M₂)^⊥, ||v_i|| ≤ 1}. (1) This is the cosine of the angle between M₁ and M₂. If P_i is the projection into M_i and P_I is the orthogonal projection into M₁∩ M₂, Aronszajn [1], proved that

||(P₂P₁)ⁿ(x) − P_I(x)|| ≤ c²ⁿ⁻¹||x|| , for all x ∈ H. (2) This result is best possible, and some extensions to more subspaces are available. See [15, page 220] and Section 5.5 below.

Not all projections into closed subspaces can be realized by computing conditional expectations. For example, consider H = L² (−π, π),^dx_2π. The subspace M₁ of functions which vanish (almost surely) on a subset of (−π, π) is closed, but not the range of conditional expectations given a sub-σ-algebra (the constant functions are not in M₁). The subspace M₂ of functions with vanishing negative Fourier coefficients is a closed subspace containing the constants, but not the range of conditional expectation given a sub-σ-algebra (the projection of a positive function in M₂ need not be positive). Alternating projections into these subspaces are a key ingredient of the classical work of Landau, Logan, Pollack and Stepian on band limited functions. See [21] for a readable overview. An elegant necessary and sufficient condition for a subspace of L²(P ) to be the range of a conditional expectation operator (for some sub-σ- algebra) is given in Neveu [24, Exercise IV.3.1, Page 123]. The subspace V must be closed, contain the constants and if f is in V , then max(f, 0) must be in V .

Example For a finite space X = {1, 2, ..., n}, let θ_i > 0,Pn

i=1θ_i = 1. A σ-field A on X is specified by a partition {A₁, A₂, ..., A_k}, A_i 6= ∅, A_i ∩ A_j = ∅, ∪^k_i=1A_i = X . Let PA be the projection from L²(X , θ) to L²(X , A, θ). The matrix of PA has (i, j) entry ¯θjδA(i, j), 1 ≤ i, j ≤ n, where δ_A(i, j) is one or zero as i and j are in the same block A_l and ¯θ_j = P ^θ^j

i∈Alθi. It follows that the number of subspaces of L²(X , θ) which are the range of a conditional expectation operator is the Bell number B(n) (that is, the number of set partitions on {1, 2, ..., n}). There is a continuum of other subspaces.

Iterated conditional expectation operators have been studied by Burkholder and Chow (see [8]), Burkholder (see [9], [10], [11]) and Rota (see [28]). See [31] and [14] for recent results.

Let (Ω, F , P ) be a probability space with U ∈ L²(P ). Let A₁, A₂ be sub-σ algebras of F . Set

U0 = U, U2i+1= E(U2i| A1), U2i+2 = E(U2i+1| A2), 0 ≤ i < ∞.

(4)

Burkholder and Chow [8] proved that

U_n→ E(U | ¯A₁∩ ¯A₂), almost surely. (3) In (3) ¯A_i is the smallest σ-algebra containing A_i and the P -null sets in F . The following two examples were suggested by David Freedman.

Example Let (Ω, F ) be the Borel unit square with P the uniform probability on the diagonal ∆.

Take A_i = σ(X_i) with X_i(ω) = ω_i for ω = (ω₁, ω₂). Then A₁∩ A₂ = {φ, Ω}, and ¯A₁∩ ¯A₂ = F . Here U_2i+1(ω₁, ω₂) = U (ω₁, ω₁) and U_2i+2(ω₁, ω₂) = U (ω₂, ω₂). The iterations do not converge.

Example Let Ω be the Borel unit square. Let P be the uniform distribution supported on the upper left and lower right quarter squares. Now P has a density f (x, y) with respect to Lebesgue measure on Ω, but ¯A₁ ∩ ¯A₂ contains the 4 quarter squares. Thus, the iterations do not converge to constant functions.

A crucial step in connecting the Burkholder-Chow result (3) and other results below to the Gibbs sampler is understanding when ¯A₁∩ ¯A₂ =A₁∩ A₂. After all, if Ω = Ω₁× Ω₂ is a product space and A_i is the σ-algebra generated by the projection on the i^th coordinate, A₁∩ A₂ is the trivial σ-algebra and (3) then says U_n converges to E(U ) if ¯A₁ ∩ ¯A₂ = A₁∩ A₂. An elegant necessary and sufficient condition has been developed in response to questions raised by an early version of the present paper by Patrizia Berti, Luca Pratelli and Pietro Rigo [6]. Here is one version of their result.

Theorem 2.1 (Berti, Pratelli, Rigo) Let (Ω, F , P ) be a probability space and A₁, A₂ ⊆ F sub σ-fields. Let N = {F ∈ F : P (F ) ∈ {0, 1}}, ¯G = σ(G ∪ N ) for any subclass G ⊆ F . In order that ¯A₁∩ ¯A₂ = A₁∩ A₂ it is necessary and sufficient that

A₁ ∈ A₁, A₂ ∈ A₂ and P (A₁ ∩ A₂) = P (A^c₁∩ A^c₂) = 0 implies P (A₁4 B) = 0 or P (A₂4 B) = 0 for some B ∈ A₁∩ A₂.

They give many useful corollaries, some of which are down to earth and useful for the Gibbs sampling application. See Sections 3, 5 below.

Burkholder [10] and Ornstein [26] show that convergence in (3) can fail if only X ∈ L¹(P ).

These authors show that a necessary and sufficient condition is that Z

|X(ω)| log(1 + |X(ω)|)P (dω) < ∞.

The extension to more than two σ-algebras was open for more than forty years. Consider the case of three σ-algebras A₁, A₂, A₃. If the iterations are taken in order

A₁, A₂, A₃, A₂, A₁, A₂, A₃, A₂, A₁, ...

(5)

then the original arguments go through to show almost sure convergence to E(U | ¯A₁ ∩ ¯A₂ ∩ A¯₃).The straightforward extension of Halperin’s Theorem, where the iterations are taken in order

A₁, A₂, A₃, A₁, A₂, A₃, A₁, A₂, A₃, ...

resisted solution. Convergence was finally proved by Delyon and Delyon [14]. Their argument uses a fascinating extension of spectral theory to non-normal operators. It cries out for a more probabilistic proof.

Von Neumann’s theorem has been widely developed and applied. For alternating minimiza- tion procedures, see [3]. For convex optimization using random alternating projections, see [12].

Applications to best approximation are in [4]. Proofs of extensions of the Ergodic theorem are in [7]. A useful survey geared towards projections into the intersections of convex sets is in [5].

We cannot leave this part of the subject without mentioning a charming demonstration of the theorems on alternating projections shown to us by Don Burkholder. Take a piece of string about two feet long. Attach two paper clips at two arbitrary positions. Call these the ”Left”

and ”Right” paper clips.

(a) (b) (c)

Figure 1

At any stage, proceed as follows: Fold the right end of the string over to touch the left paper clip. Holding this clip (fig. 1(b), top) (and the right end) with the left hand fingers, slide the right clip to the right until it hits the right end of the loop formed (fig. 1(b), bottom).

Unfold the string, fold the left end of the string over to touch the right clip (fig. 1(c), top).

Hold it there with the right hand fingers and slide the left-most clip to the left until it hits the left end of the loop formed (fig. 1(c), bottom).

Unfold the string. These two stages constitute one pass of the algorithm. If this is repeated a few times, the position of the clips will converge to ¹₃ and ²₃ of the total length.

To make the connection with the developements of this section, suppose the clips are origi- nally at distance x and x + y on a string of length x + y + z (fig. 2(a)). Folding the right end over and sliding the right clip results in (fig. 2(b)). Folding the left end over and sliding the left clip results in (fig. 2(c)).

(6)

x y z

x y z xxxxx (y+z)/2(y+z)/2(y+z)/2(y+z)/2(y+z)/2 (y+z)/2(y+z)/2(y+z)/2(y+z)/2(y+z)/2 (2x+y+z)/4(2x+y+z)/4(2x+y+z)/4(2x+y+z)/4(2x+y+z)/4 (2x+y+z)/4(2x+y+z)/4(2x+y+z)/4(2x+y+z)/4(2x+y+z)/4 (y+z)/2(y+z)/2(y+z)/2(y+z)/2(y+z)/2

(a) (b) (c)

Figure 2 These transformations correspond to two projections.

P₁ =





1 0 0

0 ¹₂ ¹₂ 0 ¹₂ ¹₂



 P₂ =





1 2

1

2 0

1 2

1

2 0

0 0 1



 with P₂P₁ =





1 2

1 4

1 1 4

2 1 4

1 4

0 ¹₂ ¹₂



.

Since it is doubly stochastic with ₁

3,¹₃,¹₃T

as a unique stationary vector,

(P₂P₁)ⁿ



 x y z



→





x+y+z x+y+z3 x+y+z3

3



.

If one were to demonstrate this, say by making a prediction for the length of the leftmost clip before the initial placement of the clips, it is desirable to know how many iterations are required for convergence. The eigenvalues and right eigenvectors of P₂P₁ are

1,



 1 1 1



 1 4,



 1 1

−2



 0,



 0 1

−1



. Then



 x y z



= x + y + z 3



 1 1 1



+ 2 3x − y

3 − z 3



 1 1

−2



+ (y − x)



 0 1

−1



. Thus

(P₂P₁)ⁿ



 x y z



= x + y + z 3



 1 1 1



+ 1 4ⁿ

2 3x − y

3 −z 3



 1 1

−2



.

Given x + y + z = c, the error is largest when x = c, y = z = 0. This corresponds to both clips staring at the right end of the string. The deviation of the leftmost clip from ^c₃ after n steps

(7)

is ₃₍₄^2cn). For c = 2 feet, if we want an error of ₁₀¹ inch, we must take n ≥ 4. For the historical record, we note that Burkholder demonstrated this to us by repeatedly folding a 3 × 5 index card instead of a piece of string.

3 The Gibbs Sampler

The Gibbs sampler, also known as Glauber dynamics or the heat-bath algorithm, is an important tool of scientific computing. It gives a way of sampling from an intractable high dimensional probability density (perhaps known up to a normalizing constant) by a sequence of one-dimensional updates. We will not attempt to review the extensive literature on the Gibbs sampler here. See [13] for a gentle introduction, [22] for a textbook treatment and [20]

for the literature on rates of convergence. A more detailed review is in [17, Section 2].

In the present paper we focus on two-component Gibbs samplers. Thus, let f (x, y) be a probability density with respect to the σ-finite measure µ × ν on X × Y. This has marginal densities

m1(x) = Z

f (x, y)ν(dy), m2(y) = Z

f (x, y)µ(dx).

For simplicity, suppose throughout that 0 < m₁(x), m₂(y) < ∞ for all x, y. Then, the conditional densities

f (x | y) = f (x, y)

m₂(y), f (y | x) = f (x, y) m₁(x)

are well defined. The Gibbs sampler is a Markov chain which may be described by

• From (x, y) choose y⁰ from f (y⁰ | x) and then x⁰ from f (x⁰ | y⁰).

This gives a kernel (w.r.t. µ × ν):

k(x, y ; x⁰, y⁰) = f (y⁰ | x)f (x⁰ | y⁰)

Let kⁿ(x, y ; x⁰, y⁰) = R kⁿ⁻¹(x, y ; w, z)k(w, z ; x⁰, y⁰)µ(dw)ν(dz), with Kⁿ the associated operator on L²(f ). One consequence of von Neumann’s Theorem and the results on iterated conditional expectations that follows from the developements in Section 4 and Section 5 is the following result.

Corollary 3.1 Assume that f (x, y) > 0 for all x, y, then for U ∈ L²(f ), with U =¯

Z

U (x, y)f (x, y)µ(dx)ν(dy),

• R (KⁿU (x, y) − ¯U )²f (x, y)µ(dx)ν(dy) → 0 as n → ∞

• KⁿU (x, y) → ¯U a.e. (x, y) as n → ∞

(8)

We note that if the density f (x, y) is not assumed positive everyplace, it appears to be a delicate matter to determine when the limiting projection in (3) of Section 2 is almost surely constant (c.f. Examples in Section 2 above). Useful sufficient conditions which handle most cases that arise in practice are given in the paper by Berti-Pratelli-Rigo [6]. Here is one of their theorems followed by some examples.

Theorem 3.1 (Berti, Pratelli, Rigo) Let f (x, y) be the probability density of a measure P on X × Y with respect to µ × ν. Let X and Y be the coordinate projections and N = {F : P (F ) ∈ {0, 1}}. In order that σ(X) ∩ σ(Y ) = N , it is sufficient that

(U × Y) ∪ (X × V ) ⊃ {f > 0} ⊃ U × V, for some U and V with P ((X, Y ) ∈ U × V ) > 0.

Example If f (x, y) > 0 for all (x, y), the hypothesis holds with U = X , V = Y. In the case that X × Y is the unit square and f is positive on a triangle below the main diagonal or any disc or indeed any convex open non-empty set, the hypothesis evidently holds and then the conclusions of Corollary 3.1 are valid. We have worked with densities in this section and in Section 4. With easy modifications, the results of Corollary 3.1 and Theorem 4.1 hold for a general probability on X × Y provided only that regular versions of the two conditional probabilities exist.

4 Gibbs Sampling as Alternating Projections

This section shows that the Gibbs sampler can be regarded as an alternating projection algorithm. Let (X , µ(dx)), (Y, ν(dy)) be σ-finite measure spaces. Let f (x, y) be a probability density with respect to µ × ν. This determines a Hilbert space L²(f ). If X(x, y) = x, Y (x, y) = y are the coordinate projections and σ(X), σ(Y ) the associated σ-algebras, let the marginals be

m_X(x) = Z

f (x, y)ν(dy), m_Y(y) = Z

f (x, y)µ(dx).

Let

M₁ = L²(σ(Y ), f ) ˜=L²(m_Y), M₂ = L²(σ(X), f ) ˜=L²(m_X).

These are closed subspaces of L²(f ) and the orthogonal projections onto M₁, M₂ are realized by the conditional expectations E( . | σ(Y )), E( . | σ(X)) respectively. See [24, Proposition IV.3.1, Page 122].

Consider the Gibbs Sampling algorithm of Section 3. This has transition density

k(x, y ; x⁰, y⁰) = f (x⁰ | y⁰)f (y⁰ | x) (4) We now arrive at our main result.

(9)

Theorem 4.1 Let K be the operator on L²(f ) associated to the Gibbs sampling kernel (4). Let P₁, P₂ be orthogonal projections onto the subspaces M₁, M₂ defined above. Then,

K = P₂P₁ and so Kⁿ = (P₂P₁)ⁿ for n = 0, 1, 2, ...

Proof Using (4), for U ∈ L²(f ), K(U )(x, y) =

Z

X

Z

Y

U (x⁰, y⁰)f (y⁰|x)f (x⁰|y⁰)ν(dy⁰)µ(dx⁰), and,

P₂P₁(U )(x, y) = E[P₁(U ) | X = x]

= E[E[U | σ(Y )] | X = x]

= Z

Y

Z

X

U (x⁰, y⁰)f (x⁰|y⁰)µ(dx⁰)

f (y⁰|x)ν(dy⁰)

= Z

Y

Z

X

U (x⁰, y⁰)f (x⁰|y⁰)f (y⁰|x)µ(dx⁰)ν(dy⁰)

= Z

X

Z

Y

U (x⁰, y⁰)f (y⁰|x)f (x⁰|y⁰)ν(dy⁰)µ(dx⁰).

This proves the result for n = 1. The result for general n now follows by induction.

The argument can be generalized to higher dimensions. Let (X_i, µ_i(dx_i)), i = 1, 2, ..., k, be σ-finite measure spaces. Let f (x₁, x₂, ..., x_k) be a probability density with respect to Qk

i=1µ_i. In what follows, we will write Qk

i=1dxi for the σ-finite dominating measure Qk

i=1µi(dxi). Let X_i(x₁, x₂, ..., x_k) = x_i, i = 1, 2, ..., k, be the coordinate projections. Define the σ-algebras

A_j = σ(X₁, X₂, ..., X_j−1, X_j+1, ..., X_k), 1 ≤ j ≤ k, and the corresponding Hilbert spaces

M_j = L²(A_j, f ), 1 ≤ j ≤ k.

The orthogonal projection P_j onto M_j is realized by the conditional expectation E(. | A_j), 1 ≤ j ≤ k. Consider the Gibbs sampling algorithm with transition density

k(x₁, x₂, ..., x_k ; x⁰₁, x⁰₂, ..., x⁰_k) =

k

Y

i=1

f (x⁰_i | x₁, x₂, ..., x_i−1, x⁰_i+1, ..., x⁰_k). (5)

Theorem 4.2 Let K be the operator on L²(f ) associated to the Gibbs sampling chain (5).

Then,

K = P_kP_k−1...P₁ and so Kⁿ = (P_kP_k−1...P₁)ⁿ for all 0 ≤ n < ∞.

(10)

Proof Let x = (x₁, x₂, ..., x_n) and x⁰ = (x⁰₁, x⁰₂, ..., x⁰_n). Using (5), for U ∈ L²(f ), we have,

K(U )(x) = Z

X1

Z

X2

...

Z

Xk

U (x⁰)

k

Y

i=1

f (x⁰_i | x₁, x₂, ..., x_i−1, x⁰_i+1, ..., x⁰_k)

k

Y

i=1

dx⁰_i. Define U₀ = U and

U_ki+j = E(U_ki+(j−1) | A_j), j = 1, 2, ..., k i = 0, 1, 2, ...

Then,

PkPk−1...P1(U )(x1, x2, ..., xk)

= U_k(x₁, x₂, ..., x_k−1)

= E [U_k−1 | A_k] (x₁, x₂, ..., x_k−1)

= Z

X_k

U_k−1(x₁, x₂, x₃, ..., x_k−2, x⁰_k)f (x⁰_k | x₁, x₂, ..., x_k−1)dx⁰_k

= Z

X_k

Z

X_k−1

U_k−2(x₁, ..., x_k−2, x⁰_k−1, x⁰_k)f (x⁰_k−1 | x₁, ..., x_k−2, x⁰_k)f (x⁰_k|x₁, x₂, ..., x_k−1)dx⁰_k−1dx⁰_k

The last statement uses U_k−1 = E[U_k−2 | A_k−1]. Continuing like this we get,

P_kP_k−1...P₁(U ) = Z

X_k

Z

X_k−1

...

Z

X1

U (x⁰)

k

Y

i=1

f (x⁰_i | x₁, x₂, ..., x_i−1, x⁰_i+1, ..., x⁰_k)

k

Y

i=1

dx⁰_i. This proves the result for n = 1. The result for general n now follows by induction.

5 Some Consequences and Relations

In this section we relate the angle between subspaces to the maximal correlation and the second eigenvalue of the Gibbs sampler. We use this to get rates of convergence.

5.1 Angles Between Subspaces

Let M₁, M₂ be closed subspaces of the Hilbert Space H. Let P_i be orthogonal projection onto Mi and PI be the orthogonal projection onto M1∩ M2. H. Aronszajn’s estimate of the convergence of (P₂P₁)ⁿ to P_I involved the cosine of the angle between M₁ and M₂:

c = sup{hv₁, v₂i | v_i ∈ M_i∩ (M₁∩ M₂)^⊥, ||v_i|| ≤ 1, i = 1, 2}.

By definition of the orthogonal projection P₁, for any h ∈ H, we have

hP1h, hi^1/2 = kP1hk = max{hg, hi} | g ∈ M1, kgk ≤ 1}. (6)

(11)

Indeed, for any g ∈ M₁, h − P₁h is orthogonal to g and thus hg, hi = hg, P₁hi. It follows that max{hg, hi} | g ∈ M₁, kgk ≤ 1} = kP₁hk.

The first equality in (6) follows from the fact that P₁ is self-adjoint and is a projection. For any v2 ∈ M2∩ (M1∩ M2)^⊥, we have

sup{hv₁, v₂i | v₁ ∈ M₁∩ (M₁∩ M₂)^⊥, kv₁k ≤ 1} = hP₁v₂, v₂i^1/2 and it follows that

c = sup{hP₁v₂, v₂i^1/2 | v₂ ∈ M₂∩ (M₁ ∩ M₂)^⊥, kv₂k ≤ 1}. (7) This can also be understood as saying that the norm of P₁as an operator from M₂∩(M₁∩M₂)^⊥ to M1 is bounded by c. This point is important in deriving Aronszajn’s bound (see below). By symmetry, c is also a bound on the norm of P₂ acting from M₁∩ (M₁∩ M₂)^⊥ to M₂. Several of the results in this section use the hypothesis ¯A₁∩ ¯A₂ is trivial. We refer to the theorems of Berti-Pratelli-Rigo [6] discussed above for conditions yielding this hypothesis.

5.2 Angles and Eigenvalues

Consider now the special case where H = L²(Ω, F , P ), for P a probability measure and M_i = L²(Ω, Ai, P ) with Ai a sub σ-algebra in F . Then Pi = E(. | Ai). Recall next that the maximal correlation is defined by

γ(A₁, A₂) = sup{E(X₁X₂) | X_i ∈ M_i, E(X_i) = 0, E(X_i²) ≤ 1, i = 1, 2}

If M₁∩ M₂ consists of constant functions, (M₁∩ M₂)^⊥ consists of mean zero functions. Thus, Proposition 5.1 If ¯A₁∩ ¯A₂ is trivial, then γ(A₁, A₂) = c.

Next we give an interpretation of this number in terms of the Gibbs sampler of Section 3 which has P₁ = E( . | σ(Y )), P₂ = E( . | σ(X)), M₁ = L²(σ(Y ), f ), M₂ = L²(σ(X), f ). The operator Q = P₂P₁P₂ : L²(σ(X), f ) → L²(σ(X), f ) is evidently self-adjoint and corresponds to the marginal x-chain. Its norm

kQk₀ = kQk_L²

0(σ(X),f )→L²₀(σ(X),f )

on

L²₀(σ(X), f ) = {g ∈ L²(σ(X), f ) | E(g) = 0}

can be computed (by self-adjointness) as

kQk₀ = sup|hQu, ui| | u ∈ L²₀(σ(X), f ); E(u²) ≤ 1 .

(12)

Observe further that

hQu, ui = hP₂P₁P₂u, ui = hP₁u, ui

since u ∈ M₂ and P₂ is self-adjoint. Hence, using (7), we have established that

kQk₀ = c². (8)

Let β₁ be the second largest eigenvalue of Q. β₁may be taken as the maximum of the support of the spectral measure of Q−I if eigenvalues do not exist. The classical minimax characterization of eigenvalues shows that

β₁ = sup{hQg, gi | g ∈ L²(σ(X), f ), E(g) = 0, E(g²) = 1}

= sup{hP₁g, gi | g ∈ L²(σ(X), f ), E(g) = 0, E(g²) = 1}

= c².

This proves the following proposition.

Proposition 5.2 With notation as above, assuming that A₁∩ A₂ is trivial, then γ(A₁, A₂) = c =p

β₁

5.3 Rates of Convergence

The weak convergence in von Neumann’s theorem is not suitable for use in the quantitative parts of Markov chain Theory. However, Aronszajn’s inequality is closely related to the basic spectral gap technique used in Markov chain theory. Here is a brief discussion of the technique and connections to standard modes of convergence. For ease of exposition, restrict attention to a compact Markov operator K on a countable state space Ω. Assume further that K is ergodic, with unique stationary distribution π(ω) (thus P

ωπ(ω)k(ω, ω⁰) = π(ω⁰), where k is the kernel corresponding to K). One condition on (K, π) that is often used is the contraction estimate

k(K − π)gk ≤ βkgk (9)

where kgk = P

ω∈Ω|g(ω)|²π(ω)1/2

is the norm on L²(Ω, π). This is the same as an operator norm estimate on K acting on L²₀(π) = {g ∈ L²(Ω, π) | π(g) = 0} (where π(g) = P

ω∈Ωg(ω)π(ω)), and is also equivalent to the singular value bound σ₁(K) ≤ β

where σ₁(K) is the second largest singular value of K on L²(Ω, π) (the largest is σ₀(K) = 1).

By definition σ₁(K) is also the square root of the second largest eigenvalue of the self-adjoint operator K^∗K on L²(Ω, π).

(13)

The typical convergence bounds derived from such a condition are X_ω²(n) :=X

ω⁰

kⁿ(ω, ω⁰) π(ω⁰) − 1

2

π(ω⁰) ≤ π(ω)⁻¹β²ⁿ (10) and

2kK_ωⁿ− πkTV =X

ω⁰

|kⁿ(ω, ω⁰) − π(ω⁰)| ≤ π(ω)^−1/2βⁿ.

These are easily deduced from (9) by passing to the adjoint, i.e., observing that (9) is equivalent to k(K^∗ − π)gk ≤ βkgk, and using the test functions g(ω) = (π(ω)⁻¹)δ_ω where δ_ω(ω⁰) = 1 if ω = ω⁰ and equals 0 otherwise. These are crude but useful bounds. As an example, we consider a Gibbs sampling chain from [17]. As explained below, it uses all of the singular values of the projection P₂P₁.

Example (Poisson-Exponential).

Let X = {0, 1, 2, 3, 4, ...}, Y = (0, ∞), µ(dx) = Counting measure, ν(dy) = Lebesgue measure.

Set

f (x, y) = e^−2yy^x x!

This example is natural in a Bayesian Statistics setting where f (x | y) = e^−yy^x

x!

is the Poisson distribution with parameter y. If e^−y is taken as the prior density of y, the joint density is f (x, y). The conditional density

f (y | x) = 2^x+1e^−yy^x x!

is the Γ x + 1,¹₂ density. The following theorem is proved in [17].

Theorem 5.1 For any starting state x, the x-chain of the Gibbs Sampler satisfies, X_x²(n) ≤ 2^−2c for n = log₂(x + 1) + c, c > 0

X_x²(n) ≥ 2^2c for n = log₂(x − 1) − c, c > 0

This shows order log₂(x) steps are necessary and sufficient for convergence. On the other hand, the self-adjoint operator Q corresponding to the x-chain has second largest eigenvalue ¹₂. The stationary distribution is given by m_X(x) = ¹₂x+1

. Hence, the bound (10) becomes, X_x²(n) ≤

( 1 2

x+1)−1

1 2

2n

= 1 2

x+1−2n

The last inequality implies that X_x²(n) is small after order ^x+1₂ steps. This bound is “off”

compared to the correct answer log₂(x).

(14)

5.4 The Gibbs Sampler and Aronszajn’s Bound

We now return to the Gibbs sampler. The Gibbs sampling chain corresponding to the operator K = P2P1 on L²(f ) has invariant measure π = f . We assume that σ(X) ∩ σ(Y ) is trivial so that the intersection of M₁ = L²(σ(Y ), f ) and M₂ = L²(σ(X), f ) is the space of constant functions. Strictly speaking, so far we have not derived a bound on Kⁿ− f acting on L²(f ).

Instead, we have a bound on Qⁿ− mX (where Q = P2P1P2) acting on L²(σ(X), f ). However, from our previous discussion we can extract a bound on Kⁿ− f . Namely, we have seen that the parameter c bounds the norm of P₁ from L²₀(σ(X), f ) to L²₀(σ(Y ), f ). Since P₂ sends L²₀(f ) into L²₀(σ(X), f ) with norm 1, we find that

kP₁P₂k_L²

0(f )→L²₀(f ) ≤ c.

and hence

k(P₁P₂)ⁿk_L²

0(f )→L²₀(f ) ≤ cⁿ, n ≥ 1.

This is the same as

k((P1P2)ⁿ− f )gk ≤ cⁿkgk

Suppose Ω = X × Y is countable. Note that (P₁P₂)ⁿ = (Kⁿ)^∗. With ω = (x, y), using g = f (ω)⁻¹δω, we get the total variation bound

kK_(x,y)ⁿ − f kTV ≤ f (x, y)^−1/2cⁿ

We can get Aronszajn’s bound by considering the self-adjoint operator Q = P₂P₁P₂ acting on L²₀(σ(X), f ). Indeed,

(P₁P₂)ⁿ = P₁Qⁿ⁻¹

Recall that c bound the norm of P₁ from L²₀(σ(X), f ) to L²₀(σ(Y ), f ). Hence by (8), k(P₁P₂)ⁿk_L²

0(f )→L²₀(f ) ≤ c²ⁿ⁻¹

which is exactly Aronszajn’s bound. This implies the following result.

Proposition 5.3 Consider the Gibbs sampling chain as described in Section 3. Let Ω = X × Y be countable. Assume σ(X) ∩ σ(Y ) is trivial and let c be as in Propositions 5.1 and 5.2. Then for any ω = (x, y) ∈ Ω, and all n ≥ 1,

kK_(x,y)ⁿ − f kTV ≤ f (x, y)^−1/2c²ⁿ⁻¹.

For general state space Ω, these bounds on total variation are not applicable because P (ω) = 0. However, in many examples the operators Pi, i = 1, 2 will have the property that

P₂P₁δ_ω = ψ_ω

(15)

is in L²(f ), for any fixed ω. Here P₂P₁δ_ω is obtained as the limit of P₂P₁φ_n where φ_n is a sequence of (non-negative) functions in L²(f ) such that φ_n → δ_ω (in total variation). In such cases, with ω = (x, y), one obtains

kK_(x,y)ⁿ − f kTV ≤ kψ_x,yk^1/2₂ c²ⁿ⁻³. (11)

From the definition,

ψ_x,y(x⁰, y⁰) = Z

k(x⁰, y⁰ ; x⁰⁰, y⁰⁰)δ_(x,y)(x⁰⁰, y⁰⁰)µ(dx⁰⁰)ν(dy⁰⁰)

= k(x⁰, y⁰ ; x, y)

= f (y | x⁰)f (x | y)

and

kψ_x,yk²₂ = f (x, y)² Z

X

f (x⁰ | y) m_X(x⁰)

2

m_X(x⁰)dµ(x⁰).

Note that the condition for compactness given in [17, Proposition 6.1] is sup

y

Z

X

f (x⁰ | y) m_X(x⁰)

2

m_X(x⁰)dµ(x⁰) < ∞.

This shows that (11) applies to most of the examples considered in [17].

Return to the general setting of von Neumann’s theorem, with M₁, M₂ closed subspaces of the Hilbert space H and P_i the orthogonal projection onto M_i. Suppose that M₁∩ M₂ = {0}.

Let π₁ be P₁ restricted to M₂ and π₂ be P₂ restricted to M₁. Then π₁ and π₂ are adjoints.

If π₁ (equivalently π₂, equivalently π₁π₂) is compact (as in all examples in [17]), the singular values of π₁ are the square roots of the eigenvalues of π₁π₂. As shown above in Thereom 5.1, bounds like Aronszajn’s which only use the first singular value can be off. In [17], dozens of examples appear where all the singular values are used to get the right answer. Of course, these examples have special structure, whereas the value of Aronszajn’s bound lies in its great generality. In the compact case, the higher singular values can also be seen as angles between subspaces of M₁ and M₂.

5.5 Gibbs Sampler in Higher Dimensions

With more than two coordinates Deutsch [15, Page 220] gives the following refined bounds.

Let M_i, 1 ≤ i ≤ k be closed subspaces of the Hilbert space H. Let P_i, 1 ≤ i ≤ k be the asscoiated projections and P_I the projection onto the intersection. If c_i is the cosine of the angle between Mi and ∩^k_j=i+1Mj, then, ∀x ∈ H,

(P_kP_k−1...P₁)^l(x) − P_I(x)

≤ (1 −

k−1

Y

i=1

(1 − c²_i))²^l ||x|| .

(16)

Unfortunately, this may not be useful. For example, consider the probability density f (θ, j, n) =n

j

θ^j1 − θ^n−je^−λλⁿ

n! for 0 ≤ θ ≤ 1, 0 ≤ j ≤ n < ∞.

Let H = L²(f ) and M_i the functions in H which do not depend on the coordinate i, 1 ≤ i ≤ 3. The standard Gibbs sampler for this example may be shown to converge. However, c₁ = 1 − ^e^λ^+λ−1_λ2 , c₂ = 1. We do not know a useful rate of convergence for this Markov chain.

Acknowledgements We thank Patrizia Berti, Don Burkholder, David Freedman, Henry Landau, Russell Luke, Pierre Mathieu, Luca Pratelli and Pietro Rigo for their enthusiastic help.

References

[1] Aronszajn, N. (1950). Theory of reproducing kernels, Trans. Amer. Math. Soc. 68, 337-404.

[2] Athreya, K., Doss, H. and Sethuraman, J. (1996). On the convergence of the Markov chain simulation method, Ann. Statist. 24, 89-100.

[3] Bauschke, H., Combettes, P. and Reich, S. (2005). The asymptotic behaviour of the com- position of two resolvents, emphNonlinear Analysis: Theory, Methods and Applications 60, 283-301.

[4] Bauschke, H., Combettes, P. and Luke, D. (2006). A strongly convergent reflection method for finding the projection onto the intersection of two closed convex sets in a Hilbert space, Journal of Approximation Theory 127, 178-192.

[5] Bauschke, H. (2001). Projection algorithms: results and open problems, Inherently Par- allel Algorithms in Feasibility and Optimization and their applications (Haifa 2000), D.

Butnariu, Y. Censor, S. Reich (editors), Elsevier, 11-22.

[6] Berti, P., Pratelli, L. and Rigo, P. (2007). Trivial intersections of σ-fields and Gibbs Sam- pling, Preprint, Dept. of Mathematics, University of Pavia, Italy.

[7] Bufetov, A. (2001). Markov operators and the Nevo-Stein theorem. Preprint, Erwin Schrodinger Institute.

[8] Burkholder, D.L. and Chow Y.S. (1961). Iterates of conditional expectation operators, Proc. Amer. Math. Soc. 12, 490-495.

[9] Burkholder, D.L. (1961). Sufficiency in the Undominated Case, Ann. Math. Stat. 32, 1191- 1200.

[10] Burkholder, D.L. (1962). Successive conditional expectations of an integrable function, Ann. Math. Stat. 33, 887-893.

(17)

[11] Burkholder, D.L. (1962). On the order structure of the set of sufficient subfields, Ann.

Math. Stat. 33, 596-599.

[12] Butnariu, D. (2005). The expected projection method: Its behaviour and applications to linear operator equations and convex optimatization, Journal of Applied Analysis 1, 93-108.

[13] Casella, G. and George E. (1992). Explaining the Gibbs sampler, Amer. Statistician 46, 167-174.

[14] Delyon, B. and Delyon, F. (1999). Generalization of von Neumann’s spectral sets and integral representation of operators, Bull. Soc. Math. Fr. 127, 25-41.

[15] Deutsch, F. (2001). Best Approximation in Inner Product Spaces, Springer-Verlag, New York.

[16] Deutsch, F. (1992). The method of alternating orthogonal projections, Approximation Theory, Spline, Functions and Applications, 105-121.

[17] Diaconis, P., Khare, K. and Saloff-Coste, L. (2006). Gibbs sampling, exponential families and orthogonal polynomials, Preprint, Dept. of Statistics, Stanford University.

[18] Diaconis, P., Khare, K. and Saloff-Coste, L. (2006). Gibbs sampling, exponential families and coupling, Preprint, Dept. of Statistics, Stanford University.

[19] Halperin, I. (1962). The product of projection operators, Acta. Sci. Math. (Szeged) 23, 96-99.

[20] Jones, G. and Hobert, J. (2001). Honest exploration of intractable probability distributions via Markov chain monte carlo, Statist. Sci. 16, 312-334.

[21] Landau, H.J. (1985). An overview of time and frequency limiting, Fourier Techniques and Applications, 201-220.

[22] Liu, J. (2001). Monte Carlo Strategies in Scientific Computing, Springer, New York.

[23] Meyn, S.P. and Tweedie, R. (1993). Markov Chains and Stochastic Stability, Springer- Verlag, New York.

[24] Neveu, J. (1965). Mathematical foundations of the calculus of probability, Holden-Day, Inc., San Francisco, London, Amstredam.

[25] Netyanun, A. and Solomon, D. (2006). Iterated products of projections on Hilbert spaces, American Mathematical Monthly 113(7), August-September issue.

(18)

[26] Ornstein, D. (1968). On the pointwise behaviour of iterates of a self adjoint operator, J.

Math. Mech. 18, 473-477.

[27] Rosenthal, J. (1995). Minorization conditions and convergence rates for Markov chain monte carlo, Jour. Amer. Statist. Assoc. 90, 558-566.

[28] Rota, G.C. (1962). An “alternierende Verfahren” for general positive operators, Bull.

Amer. Math. Soc. 68, 95-102.

[29] Schwarz, H.A. (1870). Ueber einen Grenz˜ubergang durch alternirendes Verfahren, Viertel- jahrsschrift der Naturforschenden Gessellschaft in Zurich 15, 272-286.

[30] Tierney, L. (1994). Markov chains for exploring posterior distributions, (with discussion), Ann. Statist. 22, 1701-1762.

[31] Zaharopol, R. (1990). On products of conditional expectation operators, Canad. Math.

Bull. 33, 257-260.