Stochastic Alternating Projections
Persi Diaconis Kshitij Khare
Departments of Mathematics and Statistics Department of Statistics
Stanford University and Stanford University
University of Nice - Sophia Antipolis
Laurent Saloff-Coste Department of Mathematics
Cornell University
March 16, 2007
Abstract
We show that a standard Monte Carlo algorithm - The Gibbs sampler - can be seen as alternating projections into closed subspaces of a Hilbert space. This allows classical convergence theorems of von Neumann and others to be harnessed for proving convergence.
In the other direction, it shows that classical convergence rates involving the angle between subspaces can be substantially refined in several cases.
1 Introduction
This paper gives a sharp connection between two classical areas of applied mathematics.
• Von Neumann’s Alternating Projection Algorithm. Let P1, P2 be the orthogonal projections onto closed subspaces M1, M2 of a Hilbert space H. Let PIbe the orthogonal projection onto the intersection M1∩ M2. If T = P2P1, then Tk → PI as k → ∞. That is,
Tk(h) − PI(h)
→ 0 for each h ∈ H.
• The Gibbs Sampler. Let f (x, y) be a positive probability density on the measureable space (X × Y, µ × ν). Run a Markov chain (X0, Y0), (X1, Y1), ... on (X × Y) starting from (X0, Y0) as follows:
– From (x, y), choose y0 from the conditional density f (y0 | x) – From y0, choose x0 from the conditional density f (x0 | y0)
This Markov chain has stationary density f (x, y) and 1
N
N
X
i=1
δ(Xi,Yi) ⇒ f, almost surely, weak star.
A simple example is given in detail in Section 5.3 below. Both algorithms alternate: The von Neumann algorithm projects first into M1, then into M2, then into M1 and so on. We will show that the Gibbs sampler amounts to computing projections in the Hilbert Space L2(f ), with M1 = L2(σ(Y ), f ) and M2 = L2(σ(X), f ), where X(x, y) = x and Y (x, y) = y are the coordinate mappings. Both algorithms were developed for problems where each step is easy to carry out, but projecting into the intersection is intractable. There is a difference, the mapping Tk converges, while the Markov chain (Xi, Yi), 0 ≤ i < ∞ bobbles around without converging.
Section 2 reviews the Alternating projection method, its extension to many projections, rates of convergence and a stochastic version - The “alternierelde Verfahren” of Burkholder, Chow, Rota and Delyon-Delyon. We present a physical demonstration using a string and paper clips shown to us by Don Burkholder. Section 3 reviews the Gibbs sampler (also known as Glauber dynamics or the heat-bath algorithm) and several variants. Section 4 gives our main result:
In L2(f ), let K be the Markov operator corresponding to the Gibbs sampler. Let P1 be the orthogonal projection onto the subspace M1 = L2(σ(Y ), f ), P2 be the orthogonal projection onto the subspace M2 = L2(σ(X), f ). Then,
Kn = (P2P1)n for all n ≥ 0.
In fact, the result is true for a sequence of k alternating projections corresponding to the Gibbs sampler for a density in k dimensions, for any k ∈ N. The proof is similiar to the k = 2 case and is presented in Section 4. But for ease of exposition, we stick to the k = 2 case in most examples.
Section 5 draws some consequences of this equality. The rate of convergence in von Neu- mann’s algorithm is shown to be related to probabilists’ maximal correlation. The Hibert space notion of strong convergence is shown to be fairly weak in the probabilistic setting where much more demanding topologies are standard fare. This suggests new problems in the Hilbert space setting.
2 Alternating Projection Algorithms
Von Neumann’s alternating projection theorem stated above has many extensions and applica- tions. A splendid textbook account appears in [15, Chapter 1]. While applications are surveyed in [16]. These trace the history back to Schwarz [29], who used the method to solve the Dirichlet
problem on a region given as a union of regions each having a simple to solve Dirichlet problem (eg. A union of disks).
One important extension is due to Halperin [19], who shows that the conclusion holds as stated for n subspaces with T = P1P2· · · Pn. An elegant proof of this using elementary arguments is in [25].
There is some classical work on the rate of convergence in von Neumann’s theorem. If M1, M2 are closed subspaces of the Hilbert Space H, let
c = sup{hv1, v2i | vi ∈ Mi ∩ (M1∩ M2)⊥, ||vi|| ≤ 1}. (1) This is the cosine of the angle between M1 and M2. If Pi is the projection into Mi and PI is the orthogonal projection into M1∩ M2, Aronszajn [1], proved that
||(P2P1)n(x) − PI(x)|| ≤ c2n−1||x|| , for all x ∈ H. (2) This result is best possible, and some extensions to more subspaces are available. See [15, page 220] and Section 5.5 below.
Not all projections into closed subspaces can be realized by computing conditional expec- tations. For example, consider H = L2 (−π, π),dx2π. The subspace M1 of functions which vanish (almost surely) on a subset of (−π, π) is closed, but not the range of conditional ex- pectations given a sub-σ-algebra (the constant functions are not in M1). The subspace M2 of functions with vanishing negative Fourier coefficients is a closed subspace containing the con- stants, but not the range of conditional expectation given a sub-σ-algebra (the projection of a positive function in M2 need not be positive). Alternating projections into these subspaces are a key ingredient of the classical work of Landau, Logan, Pollack and Stepian on band limited functions. See [21] for a readable overview. An elegant necessary and sufficient condition for a subspace of L2(P ) to be the range of a conditional expectation operator (for some sub-σ- algebra) is given in Neveu [24, Exercise IV.3.1, Page 123]. The subspace V must be closed, contain the constants and if f is in V , then max(f, 0) must be in V .
Example For a finite space X = {1, 2, ..., n}, let θi > 0,Pn
i=1θi = 1. A σ-field A on X is specified by a partition {A1, A2, ..., Ak}, Ai 6= ∅, Ai ∩ Aj = ∅, ∪ki=1Ai = X . Let PA be the projection from L2(X , θ) to L2(X , A, θ). The matrix of PA has (i, j) entry ¯θjδA(i, j), 1 ≤ i, j ≤ n, where δA(i, j) is one or zero as i and j are in the same block Al and ¯θj = P θj
i∈Alθi. It follows that the number of subspaces of L2(X , θ) which are the range of a conditional expectation operator is the Bell number B(n) (that is, the number of set partitions on {1, 2, ..., n}). There is a continuum of other subspaces.
Iterated conditional expectation operators have been studied by Burkholder and Chow (see [8]), Burkholder (see [9], [10], [11]) and Rota (see [28]). See [31] and [14] for recent results.
Let (Ω, F , P ) be a probability space with U ∈ L2(P ). Let A1, A2 be sub-σ algebras of F . Set
U0 = U, U2i+1= E(U2i| A1), U2i+2 = E(U2i+1| A2), 0 ≤ i < ∞.
Burkholder and Chow [8] proved that
Un→ E(U | ¯A1∩ ¯A2), almost surely. (3) In (3) ¯Ai is the smallest σ-algebra containing Ai and the P -null sets in F . The following two examples were suggested by David Freedman.
Example Let (Ω, F ) be the Borel unit square with P the uniform probability on the diagonal ∆.
Take Ai = σ(Xi) with Xi(ω) = ωi for ω = (ω1, ω2). Then A1∩ A2 = {φ, Ω}, and ¯A1∩ ¯A2 = F . Here U2i+1(ω1, ω2) = U (ω1, ω1) and U2i+2(ω1, ω2) = U (ω2, ω2). The iterations do not converge.
Example Let Ω be the Borel unit square. Let P be the uniform distribution supported on the upper left and lower right quarter squares. Now P has a density f (x, y) with respect to Lebesgue measure on Ω, but ¯A1 ∩ ¯A2 contains the 4 quarter squares. Thus, the iterations do not converge to constant functions.
A crucial step in connecting the Burkholder-Chow result (3) and other results below to the Gibbs sampler is understanding when ¯A1∩ ¯A2 =A1∩ A2. After all, if Ω = Ω1× Ω2 is a product space and Ai is the σ-algebra generated by the projection on the ith coordinate, A1∩ A2 is the trivial σ-algebra and (3) then says Un converges to E(U ) if ¯A1 ∩ ¯A2 = A1∩ A2. An elegant necessary and sufficient condition has been developed in response to questions raised by an early version of the present paper by Patrizia Berti, Luca Pratelli and Pietro Rigo [6]. Here is one version of their result.
Theorem 2.1 (Berti, Pratelli, Rigo) Let (Ω, F , P ) be a probability space and A1, A2 ⊆ F sub σ-fields. Let N = {F ∈ F : P (F ) ∈ {0, 1}}, ¯G = σ(G ∪ N ) for any subclass G ⊆ F . In order that ¯A1∩ ¯A2 = A1∩ A2 it is necessary and sufficient that
A1 ∈ A1, A2 ∈ A2 and P (A1 ∩ A2) = P (Ac1∩ Ac2) = 0 implies P (A14 B) = 0 or P (A24 B) = 0 for some B ∈ A1∩ A2.
They give many useful corollaries, some of which are down to earth and useful for the Gibbs sampling application. See Sections 3, 5 below.
Burkholder [10] and Ornstein [26] show that convergence in (3) can fail if only X ∈ L1(P ).
These authors show that a necessary and sufficient condition is that Z
|X(ω)| log(1 + |X(ω)|)P (dω) < ∞.
The extension to more than two σ-algebras was open for more than forty years. Consider the case of three σ-algebras A1, A2, A3. If the iterations are taken in order
A1, A2, A3, A2, A1, A2, A3, A2, A1, ...
then the original arguments go through to show almost sure convergence to E(U | ¯A1 ∩ ¯A2 ∩ A¯3).The straightforward extension of Halperin’s Theorem, where the iterations are taken in order
A1, A2, A3, A1, A2, A3, A1, A2, A3, ...
resisted solution. Convergence was finally proved by Delyon and Delyon [14]. Their argument uses a fascinating extension of spectral theory to non-normal operators. It cries out for a more probabilistic proof.
Von Neumann’s theorem has been widely developed and applied. For alternating minimiza- tion procedures, see [3]. For convex optimization using random alternating projections, see [12].
Applications to best approximation are in [4]. Proofs of extensions of the Ergodic theorem are in [7]. A useful survey geared towards projections into the intersections of convex sets is in [5].
We cannot leave this part of the subject without mentioning a charming demonstration of the theorems on alternating projections shown to us by Don Burkholder. Take a piece of string about two feet long. Attach two paper clips at two arbitrary positions. Call these the ”Left”
and ”Right” paper clips.
(a) (b) (c)
Figure 1
At any stage, proceed as follows: Fold the right end of the string over to touch the left paper clip. Holding this clip (fig. 1(b), top) (and the right end) with the left hand fingers, slide the right clip to the right until it hits the right end of the loop formed (fig. 1(b), bottom).
Unfold the string, fold the left end of the string over to touch the right clip (fig. 1(c), top).
Hold it there with the right hand fingers and slide the left-most clip to the left until it hits the left end of the loop formed (fig. 1(c), bottom).
Unfold the string. These two stages constitute one pass of the algorithm. If this is repeated a few times, the position of the clips will converge to 13 and 23 of the total length.
To make the connection with the developements of this section, suppose the clips are origi- nally at distance x and x + y on a string of length x + y + z (fig. 2(a)). Folding the right end over and sliding the right clip results in (fig. 2(b)). Folding the left end over and sliding the left clip results in (fig. 2(c)).
x y z
x y z
x y z
x y z
x y z xxxxx (y+z)/2(y+z)/2(y+z)/2(y+z)/2(y+z)/2 (y+z)/2(y+z)/2(y+z)/2(y+z)/2(y+z)/2 (2x+y+z)/4(2x+y+z)/4(2x+y+z)/4(2x+y+z)/4(2x+y+z)/4 (2x+y+z)/4(2x+y+z)/4(2x+y+z)/4(2x+y+z)/4(2x+y+z)/4 (y+z)/2(y+z)/2(y+z)/2(y+z)/2(y+z)/2
(a) (b) (c)
Figure 2 These transformations correspond to two projections.
P1 =
1 0 0
0 12 12 0 12 12
P2 =
1 2
1
2 0
1 2
1
2 0
0 0 1
with P2P1 =
1 2
1 4
1 1 4
2 1 4
1 4
0 12 12
.
Since it is doubly stochastic with 1
3,13,13T
as a unique stationary vector,
(P2P1)n
x y z
→
x+y+z x+y+z3 x+y+z3
3
.
If one were to demonstrate this, say by making a prediction for the length of the leftmost clip before the initial placement of the clips, it is desirable to know how many iterations are required for convergence. The eigenvalues and right eigenvectors of P2P1 are
1,
1 1 1
1 4,
1 1
−2
0,
0 1
−1
. Then
x y z
= x + y + z 3
1 1 1
+ 2 3x − y
3 − z 3
1 1
−2
+ (y − x)
0 1
−1
. Thus
(P2P1)n
x y z
= x + y + z 3
1 1 1
+ 1 4n
2 3x − y
3 −z 3
1 1
−2
.
Given x + y + z = c, the error is largest when x = c, y = z = 0. This corresponds to both clips staring at the right end of the string. The deviation of the leftmost clip from c3 after n steps
is 3(42cn). For c = 2 feet, if we want an error of 101 inch, we must take n ≥ 4. For the historical record, we note that Burkholder demonstrated this to us by repeatedly folding a 3 × 5 index card instead of a piece of string.
3 The Gibbs Sampler
The Gibbs sampler, also known as Glauber dynamics or the heat-bath algorithm, is an im- portant tool of scientific computing. It gives a way of sampling from an intractable high dimensional probability density (perhaps known up to a normalizing constant) by a sequence of one-dimensional updates. We will not attempt to review the extensive literature on the Gibbs sampler here. See [13] for a gentle introduction, [22] for a textbook treatment and [20]
for the literature on rates of convergence. A more detailed review is in [17, Section 2].
In the present paper we focus on two-component Gibbs samplers. Thus, let f (x, y) be a probability density with respect to the σ-finite measure µ × ν on X × Y. This has marginal densities
m1(x) = Z
f (x, y)ν(dy), m2(y) = Z
f (x, y)µ(dx).
For simplicity, suppose throughout that 0 < m1(x), m2(y) < ∞ for all x, y. Then, the conditional densities
f (x | y) = f (x, y)
m2(y), f (y | x) = f (x, y) m1(x)
are well defined. The Gibbs sampler is a Markov chain which may be described by
• From (x, y) choose y0 from f (y0 | x) and then x0 from f (x0 | y0).
This gives a kernel (w.r.t. µ × ν):
k(x, y ; x0, y0) = f (y0 | x)f (x0 | y0)
Let kn(x, y ; x0, y0) = R kn−1(x, y ; w, z)k(w, z ; x0, y0)µ(dw)ν(dz), with Kn the associated operator on L2(f ). One consequence of von Neumann’s Theorem and the results on iterated conditional expectations that follows from the developements in Section 4 and Section 5 is the following result.
Corollary 3.1 Assume that f (x, y) > 0 for all x, y, then for U ∈ L2(f ), with U =¯
Z
U (x, y)f (x, y)µ(dx)ν(dy),
• R (KnU (x, y) − ¯U )2f (x, y)µ(dx)ν(dy) → 0 as n → ∞
• KnU (x, y) → ¯U a.e. (x, y) as n → ∞
We note that if the density f (x, y) is not assumed positive everyplace, it appears to be a delicate matter to determine when the limiting projection in (3) of Section 2 is almost surely constant (c.f. Examples in Section 2 above). Useful sufficient conditions which handle most cases that arise in practice are given in the paper by Berti-Pratelli-Rigo [6]. Here is one of their theorems followed by some examples.
Theorem 3.1 (Berti, Pratelli, Rigo) Let f (x, y) be the probability density of a measure P on X × Y with respect to µ × ν. Let X and Y be the coordinate projections and N = {F : P (F ) ∈ {0, 1}}. In order that σ(X) ∩ σ(Y ) = N , it is sufficient that
(U × Y) ∪ (X × V ) ⊃ {f > 0} ⊃ U × V, for some U and V with P ((X, Y ) ∈ U × V ) > 0.
Example If f (x, y) > 0 for all (x, y), the hypothesis holds with U = X , V = Y. In the case that X × Y is the unit square and f is positive on a triangle below the main diagonal or any disc or indeed any convex open non-empty set, the hypothesis evidently holds and then the conclusions of Corollary 3.1 are valid. We have worked with densities in this section and in Section 4. With easy modifications, the results of Corollary 3.1 and Theorem 4.1 hold for a general probability on X × Y provided only that regular versions of the two conditional probabilities exist.
4 Gibbs Sampling as Alternating Projections
This section shows that the Gibbs sampler can be regarded as an alternating projection algo- rithm. Let (X , µ(dx)), (Y, ν(dy)) be σ-finite measure spaces. Let f (x, y) be a probability den- sity with respect to µ × ν. This determines a Hilbert space L2(f ). If X(x, y) = x, Y (x, y) = y are the coordinate projections and σ(X), σ(Y ) the associated σ-algebras, let the marginals be
mX(x) = Z
f (x, y)ν(dy), mY(y) = Z
f (x, y)µ(dx).
Let
M1 = L2(σ(Y ), f ) ˜=L2(mY), M2 = L2(σ(X), f ) ˜=L2(mX).
These are closed subspaces of L2(f ) and the orthogonal projections onto M1, M2 are realized by the conditional expectations E( . | σ(Y )), E( . | σ(X)) respectively. See [24, Proposition IV.3.1, Page 122].
Consider the Gibbs Sampling algorithm of Section 3. This has transition density
k(x, y ; x0, y0) = f (x0 | y0)f (y0 | x) (4) We now arrive at our main result.
Theorem 4.1 Let K be the operator on L2(f ) associated to the Gibbs sampling kernel (4). Let P1, P2 be orthogonal projections onto the subspaces M1, M2 defined above. Then,
K = P2P1 and so Kn = (P2P1)n for n = 0, 1, 2, ...
Proof Using (4), for U ∈ L2(f ), K(U )(x, y) =
Z
X
Z
Y
U (x0, y0)f (y0|x)f (x0|y0)ν(dy0)µ(dx0), and,
P2P1(U )(x, y) = E[P1(U ) | X = x]
= E[E[U | σ(Y )] | X = x]
= Z
Y
Z
X
U (x0, y0)f (x0|y0)µ(dx0)
f (y0|x)ν(dy0)
= Z
Y
Z
X
U (x0, y0)f (x0|y0)f (y0|x)µ(dx0)ν(dy0)
= Z
X
Z
Y
U (x0, y0)f (y0|x)f (x0|y0)ν(dy0)µ(dx0).
This proves the result for n = 1. The result for general n now follows by induction.
The argument can be generalized to higher dimensions. Let (Xi, µi(dxi)), i = 1, 2, ..., k, be σ-finite measure spaces. Let f (x1, x2, ..., xk) be a probability density with respect to Qk
i=1µi. In what follows, we will write Qk
i=1dxi for the σ-finite dominating measure Qk
i=1µi(dxi). Let Xi(x1, x2, ..., xk) = xi, i = 1, 2, ..., k, be the coordinate projections. Define the σ-algebras
Aj = σ(X1, X2, ..., Xj−1, Xj+1, ..., Xk), 1 ≤ j ≤ k, and the corresponding Hilbert spaces
Mj = L2(Aj, f ), 1 ≤ j ≤ k.
The orthogonal projection Pj onto Mj is realized by the conditional expectation E(. | Aj), 1 ≤ j ≤ k. Consider the Gibbs sampling algorithm with transition density
k(x1, x2, ..., xk ; x01, x02, ..., x0k) =
k
Y
i=1
f (x0i | x1, x2, ..., xi−1, x0i+1, ..., x0k). (5)
Theorem 4.2 Let K be the operator on L2(f ) associated to the Gibbs sampling chain (5).
Then,
K = PkPk−1...P1 and so Kn = (PkPk−1...P1)n for all 0 ≤ n < ∞.
Proof Let x = (x1, x2, ..., xn) and x0 = (x01, x02, ..., x0n). Using (5), for U ∈ L2(f ), we have,
K(U )(x) = Z
X1
Z
X2
...
Z
Xk
U (x0)
k
Y
i=1
f (x0i | x1, x2, ..., xi−1, x0i+1, ..., x0k)
k
Y
i=1
dx0i. Define U0 = U and
Uki+j = E(Uki+(j−1) | Aj), j = 1, 2, ..., k i = 0, 1, 2, ...
Then,
PkPk−1...P1(U )(x1, x2, ..., xk)
= Uk(x1, x2, ..., xk−1)
= E [Uk−1 | Ak] (x1, x2, ..., xk−1)
= Z
Xk
Uk−1(x1, x2, x3, ..., xk−2, x0k)f (x0k | x1, x2, ..., xk−1)dx0k
= Z
Xk
Z
Xk−1
Uk−2(x1, ..., xk−2, x0k−1, x0k)f (x0k−1 | x1, ..., xk−2, x0k)f (x0k|x1, x2, ..., xk−1)dx0k−1dx0k
The last statement uses Uk−1 = E[Uk−2 | Ak−1]. Continuing like this we get,
PkPk−1...P1(U ) = Z
Xk
Z
Xk−1
...
Z
X1
U (x0)
k
Y
i=1
f (x0i | x1, x2, ..., xi−1, x0i+1, ..., x0k)
k
Y
i=1
dx0i. This proves the result for n = 1. The result for general n now follows by induction.
5 Some Consequences and Relations
In this section we relate the angle between subspaces to the maximal correlation and the second eigenvalue of the Gibbs sampler. We use this to get rates of convergence.
5.1 Angles Between Subspaces
Let M1, M2 be closed subspaces of the Hilbert Space H. Let Pi be orthogonal projection onto Mi and PI be the orthogonal projection onto M1∩ M2. H. Aronszajn’s estimate of the convergence of (P2P1)n to PI involved the cosine of the angle between M1 and M2:
c = sup{hv1, v2i | vi ∈ Mi∩ (M1∩ M2)⊥, ||vi|| ≤ 1, i = 1, 2}.
By definition of the orthogonal projection P1, for any h ∈ H, we have
hP1h, hi1/2 = kP1hk = max{hg, hi} | g ∈ M1, kgk ≤ 1}. (6)
Indeed, for any g ∈ M1, h − P1h is orthogonal to g and thus hg, hi = hg, P1hi. It follows that max{hg, hi} | g ∈ M1, kgk ≤ 1} = kP1hk.
The first equality in (6) follows from the fact that P1 is self-adjoint and is a projection. For any v2 ∈ M2∩ (M1∩ M2)⊥, we have
sup{hv1, v2i | v1 ∈ M1∩ (M1∩ M2)⊥, kv1k ≤ 1} = hP1v2, v2i1/2 and it follows that
c = sup{hP1v2, v2i1/2 | v2 ∈ M2∩ (M1 ∩ M2)⊥, kv2k ≤ 1}. (7) This can also be understood as saying that the norm of P1as an operator from M2∩(M1∩M2)⊥ to M1 is bounded by c. This point is important in deriving Aronszajn’s bound (see below). By symmetry, c is also a bound on the norm of P2 acting from M1∩ (M1∩ M2)⊥ to M2. Several of the results in this section use the hypothesis ¯A1∩ ¯A2 is trivial. We refer to the theorems of Berti-Pratelli-Rigo [6] discussed above for conditions yielding this hypothesis.
5.2 Angles and Eigenvalues
Consider now the special case where H = L2(Ω, F , P ), for P a probability measure and Mi = L2(Ω, Ai, P ) with Ai a sub σ-algebra in F . Then Pi = E(. | Ai). Recall next that the maximal correlation is defined by
γ(A1, A2) = sup{E(X1X2) | Xi ∈ Mi, E(Xi) = 0, E(Xi2) ≤ 1, i = 1, 2}
If M1∩ M2 consists of constant functions, (M1∩ M2)⊥ consists of mean zero functions. Thus, Proposition 5.1 If ¯A1∩ ¯A2 is trivial, then γ(A1, A2) = c.
Next we give an interpretation of this number in terms of the Gibbs sampler of Section 3 which has P1 = E( . | σ(Y )), P2 = E( . | σ(X)), M1 = L2(σ(Y ), f ), M2 = L2(σ(X), f ). The operator Q = P2P1P2 : L2(σ(X), f ) → L2(σ(X), f ) is evidently self-adjoint and corresponds to the marginal x-chain. Its norm
kQk0 = kQkL2
0(σ(X),f )→L20(σ(X),f )
on
L20(σ(X), f ) = {g ∈ L2(σ(X), f ) | E(g) = 0}
can be computed (by self-adjointness) as
kQk0 = sup|hQu, ui| | u ∈ L20(σ(X), f ); E(u2) ≤ 1 .
Observe further that
hQu, ui = hP2P1P2u, ui = hP1u, ui
since u ∈ M2 and P2 is self-adjoint. Hence, using (7), we have established that
kQk0 = c2. (8)
Let β1 be the second largest eigenvalue of Q. β1may be taken as the maximum of the support of the spectral measure of Q−I if eigenvalues do not exist. The classical minimax characterization of eigenvalues shows that
β1 = sup{hQg, gi | g ∈ L2(σ(X), f ), E(g) = 0, E(g2) = 1}
= sup{hP1g, gi | g ∈ L2(σ(X), f ), E(g) = 0, E(g2) = 1}
= c2.
This proves the following proposition.
Proposition 5.2 With notation as above, assuming that A1∩ A2 is trivial, then γ(A1, A2) = c =p
β1
5.3 Rates of Convergence
The weak convergence in von Neumann’s theorem is not suitable for use in the quantitative parts of Markov chain Theory. However, Aronszajn’s inequality is closely related to the basic spectral gap technique used in Markov chain theory. Here is a brief discussion of the technique and connections to standard modes of convergence. For ease of exposition, restrict attention to a compact Markov operator K on a countable state space Ω. Assume further that K is ergodic, with unique stationary distribution π(ω) (thus P
ωπ(ω)k(ω, ω0) = π(ω0), where k is the kernel corresponding to K). One condition on (K, π) that is often used is the contraction estimate
k(K − π)gk ≤ βkgk (9)
where kgk = P
ω∈Ω|g(ω)|2π(ω)1/2
is the norm on L2(Ω, π). This is the same as an op- erator norm estimate on K acting on L20(π) = {g ∈ L2(Ω, π) | π(g) = 0} (where π(g) = P
ω∈Ωg(ω)π(ω)), and is also equivalent to the singular value bound σ1(K) ≤ β
where σ1(K) is the second largest singular value of K on L2(Ω, π) (the largest is σ0(K) = 1).
By definition σ1(K) is also the square root of the second largest eigenvalue of the self-adjoint operator K∗K on L2(Ω, π).
The typical convergence bounds derived from such a condition are Xω2(n) :=X
ω0
kn(ω, ω0) π(ω0) − 1
2
π(ω0) ≤ π(ω)−1β2n (10) and
2kKωn− πkTV =X
ω0
|kn(ω, ω0) − π(ω0)| ≤ π(ω)−1/2βn.
These are easily deduced from (9) by passing to the adjoint, i.e., observing that (9) is equivalent to k(K∗ − π)gk ≤ βkgk, and using the test functions g(ω) = (π(ω)−1)δω where δω(ω0) = 1 if ω = ω0 and equals 0 otherwise. These are crude but useful bounds. As an example, we consider a Gibbs sampling chain from [17]. As explained below, it uses all of the singular values of the projection P2P1.
Example (Poisson-Exponential).
Let X = {0, 1, 2, 3, 4, ...}, Y = (0, ∞), µ(dx) = Counting measure, ν(dy) = Lebesgue measure.
Set
f (x, y) = e−2yyx x!
This example is natural in a Bayesian Statistics setting where f (x | y) = e−yyx
x!
is the Poisson distribution with parameter y. If e−y is taken as the prior density of y, the joint density is f (x, y). The conditional density
f (y | x) = 2x+1e−yyx x!
is the Γ x + 1,12 density. The following theorem is proved in [17].
Theorem 5.1 For any starting state x, the x-chain of the Gibbs Sampler satisfies, Xx2(n) ≤ 2−2c for n = log2(x + 1) + c, c > 0
Xx2(n) ≥ 22c for n = log2(x − 1) − c, c > 0
This shows order log2(x) steps are necessary and sufficient for convergence. On the other hand, the self-adjoint operator Q corresponding to the x-chain has second largest eigenvalue 12. The stationary distribution is given by mX(x) = 12x+1
. Hence, the bound (10) becomes, Xx2(n) ≤
( 1 2
x+1)−1
1 2
2n
= 1 2
x+1−2n
The last inequality implies that Xx2(n) is small after order x+12 steps. This bound is “off”
compared to the correct answer log2(x).
5.4 The Gibbs Sampler and Aronszajn’s Bound
We now return to the Gibbs sampler. The Gibbs sampling chain corresponding to the operator K = P2P1 on L2(f ) has invariant measure π = f . We assume that σ(X) ∩ σ(Y ) is trivial so that the intersection of M1 = L2(σ(Y ), f ) and M2 = L2(σ(X), f ) is the space of constant functions. Strictly speaking, so far we have not derived a bound on Kn− f acting on L2(f ).
Instead, we have a bound on Qn− mX (where Q = P2P1P2) acting on L2(σ(X), f ). However, from our previous discussion we can extract a bound on Kn− f . Namely, we have seen that the parameter c bounds the norm of P1 from L20(σ(X), f ) to L20(σ(Y ), f ). Since P2 sends L20(f ) into L20(σ(X), f ) with norm 1, we find that
kP1P2kL2
0(f )→L20(f ) ≤ c.
and hence
k(P1P2)nkL2
0(f )→L20(f ) ≤ cn, n ≥ 1.
This is the same as
k((P1P2)n− f )gk ≤ cnkgk
Suppose Ω = X × Y is countable. Note that (P1P2)n = (Kn)∗. With ω = (x, y), using g = f (ω)−1δω, we get the total variation bound
kK(x,y)n − f kTV ≤ f (x, y)−1/2cn
We can get Aronszajn’s bound by considering the self-adjoint operator Q = P2P1P2 acting on L20(σ(X), f ). Indeed,
(P1P2)n = P1Qn−1
Recall that c bound the norm of P1 from L20(σ(X), f ) to L20(σ(Y ), f ). Hence by (8), k(P1P2)nkL2
0(f )→L20(f ) ≤ c2n−1
which is exactly Aronszajn’s bound. This implies the following result.
Proposition 5.3 Consider the Gibbs sampling chain as described in Section 3. Let Ω = X × Y be countable. Assume σ(X) ∩ σ(Y ) is trivial and let c be as in Propositions 5.1 and 5.2. Then for any ω = (x, y) ∈ Ω, and all n ≥ 1,
kK(x,y)n − f kTV ≤ f (x, y)−1/2c2n−1.
For general state space Ω, these bounds on total variation are not applicable because P (ω) = 0. However, in many examples the operators Pi, i = 1, 2 will have the property that
P2P1δω = ψω
is in L2(f ), for any fixed ω. Here P2P1δω is obtained as the limit of P2P1φn where φn is a sequence of (non-negative) functions in L2(f ) such that φn → δω (in total variation). In such cases, with ω = (x, y), one obtains
kK(x,y)n − f kTV ≤ kψx,yk1/22 c2n−3. (11)
From the definition,
ψx,y(x0, y0) = Z
k(x0, y0 ; x00, y00)δ(x,y)(x00, y00)µ(dx00)ν(dy00)
= k(x0, y0 ; x, y)
= f (y | x0)f (x | y)
and
kψx,yk22 = f (x, y)2 Z
X
f (x0 | y) mX(x0)
2
mX(x0)dµ(x0).
Note that the condition for compactness given in [17, Proposition 6.1] is sup
y
Z
X
f (x0 | y) mX(x0)
2
mX(x0)dµ(x0) < ∞.
This shows that (11) applies to most of the examples considered in [17].
Return to the general setting of von Neumann’s theorem, with M1, M2 closed subspaces of the Hilbert space H and Pi the orthogonal projection onto Mi. Suppose that M1∩ M2 = {0}.
Let π1 be P1 restricted to M2 and π2 be P2 restricted to M1. Then π1 and π2 are adjoints.
If π1 (equivalently π2, equivalently π1π2) is compact (as in all examples in [17]), the singular values of π1 are the square roots of the eigenvalues of π1π2. As shown above in Thereom 5.1, bounds like Aronszajn’s which only use the first singular value can be off. In [17], dozens of examples appear where all the singular values are used to get the right answer. Of course, these examples have special structure, whereas the value of Aronszajn’s bound lies in its great generality. In the compact case, the higher singular values can also be seen as angles between subspaces of M1 and M2.
5.5 Gibbs Sampler in Higher Dimensions
With more than two coordinates Deutsch [15, Page 220] gives the following refined bounds.
Let Mi, 1 ≤ i ≤ k be closed subspaces of the Hilbert space H. Let Pi, 1 ≤ i ≤ k be the asscoiated projections and PI the projection onto the intersection. If ci is the cosine of the angle between Mi and ∩kj=i+1Mj, then, ∀x ∈ H,
(PkPk−1...P1)l(x) − PI(x)
≤ (1 −
k−1
Y
i=1
(1 − c2i))2l ||x|| .
Unfortunately, this may not be useful. For example, consider the probability density f (θ, j, n) =n
j
θj1 − θn−je−λλn
n! for 0 ≤ θ ≤ 1, 0 ≤ j ≤ n < ∞.
Let H = L2(f ) and Mi the functions in H which do not depend on the coordinate i, 1 ≤ i ≤ 3. The standard Gibbs sampler for this example may be shown to converge. However, c1 = 1 − eλ+λ−1λ2 , c2 = 1. We do not know a useful rate of convergence for this Markov chain.
Acknowledgements We thank Patrizia Berti, Don Burkholder, David Freedman, Henry Landau, Russell Luke, Pierre Mathieu, Luca Pratelli and Pietro Rigo for their enthusiastic help.
References
[1] Aronszajn, N. (1950). Theory of reproducing kernels, Trans. Amer. Math. Soc. 68, 337-404.
[2] Athreya, K., Doss, H. and Sethuraman, J. (1996). On the convergence of the Markov chain simulation method, Ann. Statist. 24, 89-100.
[3] Bauschke, H., Combettes, P. and Reich, S. (2005). The asymptotic behaviour of the com- position of two resolvents, emphNonlinear Analysis: Theory, Methods and Applications 60, 283-301.
[4] Bauschke, H., Combettes, P. and Luke, D. (2006). A strongly convergent reflection method for finding the projection onto the intersection of two closed convex sets in a Hilbert space, Journal of Approximation Theory 127, 178-192.
[5] Bauschke, H. (2001). Projection algorithms: results and open problems, Inherently Par- allel Algorithms in Feasibility and Optimization and their applications (Haifa 2000), D.
Butnariu, Y. Censor, S. Reich (editors), Elsevier, 11-22.
[6] Berti, P., Pratelli, L. and Rigo, P. (2007). Trivial intersections of σ-fields and Gibbs Sam- pling, Preprint, Dept. of Mathematics, University of Pavia, Italy.
[7] Bufetov, A. (2001). Markov operators and the Nevo-Stein theorem. Preprint, Erwin Schrodinger Institute.
[8] Burkholder, D.L. and Chow Y.S. (1961). Iterates of conditional expectation operators, Proc. Amer. Math. Soc. 12, 490-495.
[9] Burkholder, D.L. (1961). Sufficiency in the Undominated Case, Ann. Math. Stat. 32, 1191- 1200.
[10] Burkholder, D.L. (1962). Successive conditional expectations of an integrable function, Ann. Math. Stat. 33, 887-893.
[11] Burkholder, D.L. (1962). On the order structure of the set of sufficient subfields, Ann.
Math. Stat. 33, 596-599.
[12] Butnariu, D. (2005). The expected projection method: Its behaviour and applications to linear operator equations and convex optimatization, Journal of Applied Analysis 1, 93-108.
[13] Casella, G. and George E. (1992). Explaining the Gibbs sampler, Amer. Statistician 46, 167-174.
[14] Delyon, B. and Delyon, F. (1999). Generalization of von Neumann’s spectral sets and integral representation of operators, Bull. Soc. Math. Fr. 127, 25-41.
[15] Deutsch, F. (2001). Best Approximation in Inner Product Spaces, Springer-Verlag, New York.
[16] Deutsch, F. (1992). The method of alternating orthogonal projections, Approximation Theory, Spline, Functions and Applications, 105-121.
[17] Diaconis, P., Khare, K. and Saloff-Coste, L. (2006). Gibbs sampling, exponential families and orthogonal polynomials, Preprint, Dept. of Statistics, Stanford University.
[18] Diaconis, P., Khare, K. and Saloff-Coste, L. (2006). Gibbs sampling, exponential families and coupling, Preprint, Dept. of Statistics, Stanford University.
[19] Halperin, I. (1962). The product of projection operators, Acta. Sci. Math. (Szeged) 23, 96-99.
[20] Jones, G. and Hobert, J. (2001). Honest exploration of intractable probability distributions via Markov chain monte carlo, Statist. Sci. 16, 312-334.
[21] Landau, H.J. (1985). An overview of time and frequency limiting, Fourier Techniques and Applications, 201-220.
[22] Liu, J. (2001). Monte Carlo Strategies in Scientific Computing, Springer, New York.
[23] Meyn, S.P. and Tweedie, R. (1993). Markov Chains and Stochastic Stability, Springer- Verlag, New York.
[24] Neveu, J. (1965). Mathematical foundations of the calculus of probability, Holden-Day, Inc., San Francisco, London, Amstredam.
[25] Netyanun, A. and Solomon, D. (2006). Iterated products of projections on Hilbert spaces, American Mathematical Monthly 113(7), August-September issue.
[26] Ornstein, D. (1968). On the pointwise behaviour of iterates of a self adjoint operator, J.
Math. Mech. 18, 473-477.
[27] Rosenthal, J. (1995). Minorization conditions and convergence rates for Markov chain monte carlo, Jour. Amer. Statist. Assoc. 90, 558-566.
[28] Rota, G.C. (1962). An “alternierende Verfahren” for general positive operators, Bull.
Amer. Math. Soc. 68, 95-102.
[29] Schwarz, H.A. (1870). Ueber einen Grenz˜ubergang durch alternirendes Verfahren, Viertel- jahrsschrift der Naturforschenden Gessellschaft in Zurich 15, 272-286.
[30] Tierney, L. (1994). Markov chains for exploring posterior distributions, (with discussion), Ann. Statist. 22, 1701-1762.
[31] Zaharopol, R. (1990). On products of conditional expectation operators, Canad. Math.
Bull. 33, 257-260.