**The Joy of Sampling**

D.A. FORSYTH, J. HADDON AND S. IOFFE

*Computer Science Division, University of California at Berkeley, Berkeley, CA 94720, USA*

daf@cs.berkeley.edu haddon@cs.berkeley.edu

ioffe@cs.berkeley.edu

*Received November 12, 1999; Revised August 4, 2000; Accepted January 17, 2001*

**Abstract.** A standard method for handling Bayesian models is to use Markov chain Monte Carlo methods to
draw samples from the posterior. We demonstrate this method on two core problems in computer vision—structure
from motion and colour constancy. These examples illustrate a samplers producing useful representations for very
large problems. We demonstrate that the sampled representations are trustworthy, using consistency checks in the
experimental design. The sampling solution to structure from motion is strictly better than the factorisation approach,
because: it reports uncertainty on structure and position measurements in a direct way; it can identify tracking errors;

and its estimates of covariance in marginal point position are reliable. Our colour constancy solution is strictly better than competing approaches, because: it reports uncertainty on surface colour and illuminant measurements in a direct way; it incorporates all available constraints on surface reflectance and on illumination in a direct way;

and it integrates a spatial model of reflectance and illumination distribution with a rendering model in a natural way.

One advantage of a sampled representation is that it can be resampled to take into account other information. We demonstrate the effect of knowing that, in our colour constancy example, a surface viewed in two different images is in fact the same object. We conclude with a general discussion of the strengths and weaknesses of the sampling paradigm as a tool for computer vision.

**Keywords:** Markov chain Monte Carlo, colour constancy, structure from motion

**1.** **Introduction**

The Bayesian philosophy is that all information about
*a model is captured by a posterior distribution obtained*
using Bayes’ rule:

posterior*= P(world | observations)*

*∝ P(observations | world)P(world)*
*where the prior P(world) is the probability density of*
the state of the world in the absence of observations.

*Many examples suggest that, when computational dif-*
*ficulties can be sidestepped, the Bayesian philosophy*
leads to excellent and effective use of data (e.g. exposi-
tions in Carlin and Louis (1996), Gelman et al. (1995),
Grenander (1983), and Grenander (1993); examples of

the use of Bayesian inference in the vision literature include (Binford and Levitt, 1994; Chou and Brown, 1990; Huang et al., 1994; Jolly et al., 1996; Maybank and Sturm, 1999; Noble and Mundy, 1993; Pavlovic et al., 1999a, 1999b; Sarkar and Boyer, 1992, 1994;

Sullivan et al., 1999; Yuille and Coughlan, 1999; Zhu et al., 2000, among others).

A probability distribution is, in essence, a device for computing expectations. The problems we are in- terested in typically involve an important continuous component, meaning that computing expectations in- volves estimating integrals, usually over high dimen- sional domains. One useful technique is to represent the posterior by drawing a large number of samples from that distribution. These samples can then be used to estimate any expectation with respect to that posterior.

For example, if we wished to decide whether to fight with or flee from an attacker, we would draw samples from the posterior on the outcome and estimate ex- pected utilities for each decision (as averages of the utilities over the samples) and then choose the decision with the best utility.

Sampling algorithms are more general than random
search for MAP interpretations precisely because the
results give an approximate representation of the en-
tire posterior. This means that, for example, we can
estimate the covariance of the posterior; resample the
samples to incorporate new information; engage in
multiple calculations for different decisions using dis-
*tinct utilities, etc. Sampling is in principle simple and*
*general, if samples can be drawn from the posterior*
efficiently.

This paper demonstrates the strengths and weak- nesses of sampling methods using two core vision prob- lems as examples: structure from motion (Section 2);

and colour constancy (Section 3).

**Notation: we write***² for a vector, whose i’th compo-*
nent is*²**i*and*M for a matrix whose i, j’th component*
*is M** _{ij}*. Sampler jargon that may be unfamiliar is shown
in italics when first introduced.

*1.1.* *Simple Sampling Algorithms*

For some probability distributions, direct algorithms exist for drawing samples (e.g. Ripley, 1987); we are seldom lucky enough to have a posterior of this type.

*Rejection sampling is appropriate for some distribu-*
*tions. Assume that we wish to draw samples from p(x)*
*and have a proposal distribution q(x), from which we*
can draw samples easily. Assume also we know some
*constant, k, such that kq(x) ≥ p(x) for all x; we can*
*draw a sample from p(x) by drawing a sample x*0from
*q(x), and then accepting the sample with probability*
*p(x*0*)/(kq(x*0*)). In importance sampling, to draw a*
*sample from p(x), we first draw a large number of*
independent samples*{s*1*, . . . , s**n*} from a proposal dis-
*tribution q(x), and then set s = s**i*with probability pro-
portional to *w**i* = _{q}^{p}_{(x)}^{(x)}*. As n*→ ∞, the distribution
*for the sample s will approach p(x). Both rejection*
sampling and importance sampling methods can be
*wildly inefficient if q(x) approximates p(x) poorly (the*
usual case in high dimensions); in some such cases, a
*collection of different q(x)’s can be pasted together*
to obtain a better approximation (e.g. Gamerman,
1997).

*1.2.* *Markov Chain Monte Carlo—*

*the Metropolis-Hastings Algorithm*

Markov chain Monte Carlo methods (Gamerman, 1997; Gilks et al., 1996c) are the standard methods for sampling complex distributions. In this method, one constructs a Markov chain whose stationary distribu- tion is the target distribution. A new sample can be ob- tained from an old one, by advancing the Markov chain.

The Metropolis-Hastings algorithm is a technique
for constructing a Markov chain that has a particular
desired stationary distribution. Assume that we have a
distribution*π from which we would like to generate*
samples. We would like to build a Markov chain which
has*π as a stationary distribution.*

The algorithm will produce a sequence of samples
*X*1*, . . . , X**n**, by taking a sample X**i* and proposing a
*revised version, X*_{i}^{0}. The next element of the sequence
*X**i*+1 *will be X*_{i}^{0} with probability*α(X**i**, X**i*+1*); other-*
*wise, it will be X**i*. We will give the form of*α below.*

The proposal process is random, too. In particular,
there is a proposal distribution which gives the prob-
*ability of proposing X*_{i}^{0} *from X** _{i}*. This can be written

*P(X*

*i*

*→ X*

^{0}

_{i}*). Note that the proposal distribution is a*

*function of X*

^{0}

_{i}*, and may be a function of X*

*. Now we assume that if*

_{i}*π(u) is non-zero, then there are some*values

*v such that P(v → u) is non-zero, too.*

In this case, we have that

*α = max*
µ

1*,P(X*^{0}*i**→ X**i**)π(X**i*^{0}*)*
*P(X**i**→ X**i*^{0}*)π(X**i**)*

¶

Notice that this expression is qualitatively sensible. If
the chain is at a point where*π has a very low value and*
at the new point*π has a very high value and the forward*
and backward proposal probabilities are about equal,
then the new point will be accepted with high proba-
bility. If the chain is at a point where*π has a very high*
value and the proposal process has a high probability
of suggesting points with a very low value of*π, it is*
likely to stay at that point. Finally, if a point which has
high value of*π is proposed disproportionately often,*
it is less likely to be accepted.

A good way to think about the Metropolis-Hastings algorithm is that it is an improved version of the

“hypothesize and test” process that is common in vi-
sion. Metropolis-Hastings suggests various hypotheses
which, depending on the result of a bookkeeping exer-
cise, are accepted or rejected. This process yields the se-
*quence X*1*, . . . , X**n*. However, for Metropolis-Hastings

the sequence of hypotheses has very significant se- mantics; assuming technical conditions on the proposal process (expounded in, for example, Gamerman, 1997;

Gilks et al., 1996a; Roberts, 1996; Tierney, 1996; these
conditions are usually fairly easily met and all our
samplers meet them), once sufficient iterations have
*completed, all subsequent X**i*are samples drawn from
*π(X).*

*1.3.* *Burn-in and Mixing*

Generally, an MCMC method needs to produce some
number of samples to “forget” its start point. The num-
ber of iterations required to achieve this is often called
*the burn-in time. The burn-in may be extremely long*
for a poorly designed sampler. There are a very small
number of samplers known to have a short burn-in time
(e.g. Jerrum and Sinclair, 1996).

Once a sampler has burnt in, the sequence of sam-
ples it produces may or may not be correlated; if this
*correlation is low, the method is said to mix well. It is*
desirable to have an algorithm that burns in quickly,
and mixes well. Burn-in and mixing are related to the
dynamics of the underlying Markov chain. One way to
show that a sampler mixes quickly is to prove that, for
any decomposition of its domain into two disjoint sets
A and B, the conditional probability that the sampler
goes to B given it is in set A, is high. Such proofs of
fast mixing exist for a small number of cases, but re-
quire substantial art (e.g. Jerrum and Sinclair, 1996).

We are aware of no proof that a sampler used for vision problems is fast mixing. In our examples, as in the vast majority of cases, the algorithm is used without such proofs; we show a variety of consistency checks that suggest that the algorithm has converged.

*1.4.* *The Attractions of MCMC*

It is known how to apply this algorithm when the do- main of support is complicated (for example, samples may be drawn when the domain of support of the poste- rior consists of several different spaces of different di- mensions, Green, 1995; Richardson and Green, 1997).

There are numerous variants to the basic algorithm, some of which combine deterministic dynamics with random search in the hope of better mixing (e.g. see the review in Neal (1993)).

The advantage of viewing Metropolis-Hastings al- gorithms as a souped up hypothesize and test process is that it suggests how to build proposal mechanisms.

A natural strategy is to take current vision algorithms
and make them produce probabilistic outputs. This ap-
proach is illustrated in Section 3.2; Zhu et al. (2000)
have used it successfully for some recognition prob-
lems. A really attractive feature is that we can use
different, possibly incompatible algorithms as distinct
sources of proposals, and the samples we obtain repre-
*sent the posterior incorporating all available measure-*
ments.

Quite often in practice it is easy to come up with a
*function f proportional to the posterior. In this case,*
the posterior is

R *f*

*D* *f(u) du*

but the integral—the normalizing constant—can be very difficult to compute (the best way to do it is to use a sampling method). An attractive feature of the Metropolis Hastings algorithm is that we need not know the normalizing constant for the distribution (because the constant is cancelled by the ratio).

*1.5.* *Techniques for Building Practical*
*MCMC Samplers*

It is easy to build a sampler using the Metropolis-
Hastings algorithm. It seems to be very hard to build a
*good sampler—one that burns in quickly, mixes well,*
and gives a trustworthy picture of the posterior—using
that algorithm. We describe a variety of techniques for
building samplers, and conclude with a discussion of
possible sanity checks.

* 1.5.1. Gibbs Samplers.* It is quite common to en-
counter situations where the target distribution has a
non standard form, but is standard when groups of vari-
ables have fixed values (this occurs in vision problems;

see Sections, 2.3 and 3.2). In this case, it is natural
to adopt a proposal mechanism that fixes one set of
variables and draws a sample from the full conditional
distribution on the other set, and vice versa. This very
*useful technique is known as Gibbs sampling (named*
by Geman and Geman (1984) but apparently due to the
statistical physics literature, where it was known as the
*heat bath algorithm, Gilks et al., 1996b, p. 12). Usu-*
ally, the group of variables to be sampled is chosen at
random, and sufficient samples are drawn so that each
group of variables is visited many times.

Gibbs sampling is very easy to implement. There is one considerable danger, which is often quite dif- ficult to avoid. If the groups of variables are strongly

*Figure 1.* *Correlated variables cause Gibbs samplers to behave badly. The figure on the top left shows 100 samples drawn from a Gibbs sampler*
for two independent normal random variables, one with variance one and the other with variance ten. The stars indicate the samples; the line
segments indicate the order in which the samples were drawn. Note that the sampler makes quite large vertical moves (because the variance in
*this direction is large). The figure on the top right shows 100 samples drawn from this distribution, now rotated by 45*^{◦}, using a Gibbs sampler.

In this case, the sampler can make only relatively small vertical and horizontal moves, and so the position of the samples changes relatively
*slowly; the 100 samples in the graph on the bottom left, which consist of those of the first graph rotated by 45*^{◦}, give a much better picture of
*the distribution. On the bottom right, the x-coordinate for the samples drawn from the second sampler (solid line) and the x-coordinates of the*
third figure (dashed line). The solid curve (correctly) suggests that the samples drawn from the second sampler are quite strongly correlated.

correlated, then a Gibbs sampler can mix very badly indeed. The effect is well known (for a full discussion, see for example, Gilks and Roberts, 1996) and easily illustrated (see Fig. 1).

* 1.5.2. The Hybrid Monte Carlo Method.* A common
difficulty with sampling methods is that the state of the
sampler appears to perform a slightly biased random
walk. The difficulty with random walk is that it takes a
long time to move any distance along a domain, mean-
ing that if the sampler is started at a point a long way
from the mode of the distribution, it will take a long

time before it reaches the mode. From our perspective, it is extremely important to have a representation of the distribution around the mode.

Hybrid Monte Carlo is a method for making propos-
als that causes the state of the sampler to move rather
quickly to the mode, and then explore it. The method
is due to Duane et al. (1987) (and described in detail
**in Neal (1993)). Write the state of the sampler as q.**

The method requires that the target distribution can be written as

**π(q) = exp{−U(q)}**

*Now let us think of U as a potential function; the*
state of the sampler will be the state of a particle
*of mass m subject to this potential function. This*
state can be determined by considering the momen-
**tum of the particle p and writing a Hamiltonian for the**
particle:

*H (q, p) = U(q) +*

**p**

^{T}**p**

*2m*

We now need to integrate Hamilton’s equations

**∂q**

*∂t* = 1
*m***p**

**∂p**

*∂t* = −∇**q***U*

to determine the state of the particle. This temporary excursion into mechanics is actually justified, because we can exponentiate the negative Hamiltonian of the particle to get

*π*^{0}**(q, p) = exp{−H(q, p)}**

**= π(q) exp**

½

−**p**^{T}**p**
*2m*

¾

which is a new target distribution for a larger set of random variables. We now have two proposal moves:

1. Advance time in our particle model by some ran-
domly chosen amount, either forwards or back-
**wards. This updates both q and p. As long as we use**
a symplectic integrator, the extent of the advance is
uniform and random, and the choice of forward or
backward is random, the accept probability is one.

**2. Fix q and draw a sample for p from the full con-**
ditional. This is easy, because the full conditional
**distribution in p is normal and is independent of q.**

*This sampler has very attractive qualitative behaviour.*

*If the state is at a relatively large value of U , then the*
first type of move will travel quickly down the gradient
*of U to smaller values, while building up momentum.*

But the second move then discards this momentum;

so we have a sampler that should move quickly to a
*mode—where U is small—and then move around ex-*
ploring the mode under the influence of the random
choice of momenta. Good values of the particle’s mass
and of the range of time values must be chosen by
experiment.

In practice, the hybrid method seems to be useful for continuous problems. It is very easy to implement for the colour constancy example given above, and has

been successfully used on a variety of other continuous problems (Neal, 1993).

*1.6.* *MCMC and Random Search in Vision*

Markov chain Monte Carlo has appeared in the vi- sion literature in various forms. One common use is to attempt to obtain an MAP estimate by random search, usually using the Metropolis-Hastings algo- rithm (e.g. Geman and Geman, 1984; Geman and Graffigne, 1986). The Markov random field model is a spatial model which gives a posterior on image la- bellings given measurements as a function of measure- ment values and local patterns of pixel labels (so-called

“clique potentials”; the topic is reviewed in Li (1995)).

A standard method for estimating MAP labellings is to use an annealed version of the Metropolis-Hastings algorithm, where the posterior to be sampled is a func- tion of a parameter that changes during the sampling process. This parameter is often thought of as tempera- ture; the intent is that for high values of the parameter, the posterior has only one mode, and as the temperature is reduced the state of the sampler will get stuck in that mode, thereby obtaining a global extremum. It is not possible to guarantee in practice that this occurs, and the algorithm has a rather mixed reputation (Collins et al., 1988; Golden and Skiscim, 1986).

The notion of using a sampling method to perform inference on a generative model of an image pattern appears to be due to Grenander (1983). Few successful examples appear in the literature. In Jolly et al. (1996), an annealing method is used to estimate an MAP so- lution for the configuration and motion of a motor car template in an image. In Zhu (1998), a random search method is used to find a medial axis transform. In Zhu et al. (2000), an MCMC method is used to find simple shapes and road signs. In Green (1996), MCMC is used to perform inference in various vision-like situations, including reconstruction from single photon emission computed tomography data and finding a polygonal template of a duck in heavy spatial noise. In Phillips and Smith (1996), inference is performed on a hierar- chical model to find faces, and a version of MCMC is used to find an unknown number of disks. Templates are used for restoration in Amit et al. (1991). Gibbs sam- plers are quite widely used for reconstruction (Geman and Geman, 1984; Geman and Graffigne, 1986; Zhu et al., 1998).

Random search is now a standard method for es- timating the fundamental matrix in structure from

motion problems; a review appears in Torr and Mur- ray (1997). RANSAC—an algorithm for robust fitting, due to Fischler and Bolles (1981) and appearing in the statistical literature as Rousseeuw (1987)—proposes small sets of correspondences uniformly at random, fits a fundamental matrix to each set, and accepts the set whose fit gives the largest number of correspondences with a sufficiently small residual. The number of sets is chosen to ensure some high probability that a correct set is found. The main advantage of an MCMC method over RANSAC is that an MCMC method can produce a series of hypotheses with meaningful semantics—

indicating, for example, the posterior probability that a particular point is an outlier, or the posterior proba- bility that a pair of measurements come from a single point.

**1.6.1. Particle Filtering (or Condensation, or**

* “Survival of the Fittest”) and Resampling.* The most
substantial impact of sampling algorithms in vision has
been the use of resampling algorithms in tracking. The

*best known algorithm is known as condensation in the*

*vision community (Blake and Isard, 1998), survival of*

*the fittest in the AI community (Kanazawa et al., 1995),*

*and particle filtering in the statistical signal processing*community, where it originated (Carpenter et al., 1999;

Kitagawa, 1987). A wide range of variants and of ap-
plications of particle filtering are described in a forth-
coming book (Doucet et al., 2001). This algorithm is a
modification of factored sampling: one draws samples
from a prior (which represents the state of the world up
*to the k*−1’th measurement), propagates these samples
through a dynamical model, and then weights them us-
*ing the posterior incorporating the k’th measurement.*

This set of weighted samples provides a representa- tion of the prior for the next iteration. The algorithm is fast and efficient, and is now quite widely applied for low-dimensional problems.

The attraction of resampling algorithms is that they can be used to incorporate “new information.” In track- ing applications, new information comes because a new frame, with new measurements, has arrived. New in- formation may come from other sources. In the colour constancy example, we assume that the algorithm is told that two patches in two different images are the same colour (this might occur because a recognition al- gorithm has a good match to the geometry, and knows the patches represent the same object). This informa- tion strongly constrains the inferred colours for other patches in each view (Section 3).

In recognition applications one often encounters some form of hierarchical model, which again suggests resampling. In Ioffe and Forsyth (1999), a sampler is used to label groups of image segments, using their con- sistency with observed human kinematics. The human model used has nine segments. It is foolish to attempt to label all nine segment groups; instead, their algo- rithm uses a sampler to label individual segments with a frequency proportional to the posterior probability of that label given the image data. The set of individual segment labels is resampled to propose pairs of labels for pairs of segments, and so on. In this case, the new information is the use of an enhanced prior; the prior for pairs of labels emphasizes pairs of segments that lie in particular configurations, a property that is meaning- less for single segments.

**2.** **Example: Large Scale Sampling for Bayesian**
**Structure from Motion**

Structure from motion is the problem of inferring some description of geometry from a sequence of images.

The problem has a long history and a huge literature;

space does not allow a comprehensive review, but see Beardsley et al. (1997), Faugeras et al. (1998), Faugeras and Robert (1996), Gool and Zisserman (1997), and Hartley and Zisserman (2000). Accurate solutions to structure from motion are attractive, because the tech- nique can be used to generate models for rendering vir- tual environments (e.g. Debevec et al., 1996; Faugeras et al., 1998; Gool and Zisserman, 1997; Tomasi and Kanade, 1992).

*2.1.* *Structure from Motion by Matrix Factorisation*
*Assume m distinct views of n points are given;*

correspondences are known. In the influential
Tomasi-Kanade formulation of structure from motion
(Tomasi and Kanade, 1992), these data are arranged
*into a 2m× n matrix of measurements D which must*
factor as *D = UV, where U represents the camera*
positions and*V represents point positions. An affine*
transform*A is determined such that UA minimises a*
set of constraints associated with a camera, and*A*^{−1}*V*
then represents Euclidean structure.

In practice, factorisation is achieved using a singu- lar value decomposition. This is a maximum likeli- hood method if an isotropic Gaussian error model is adopted; for an anisotropic Gaussian error model, see Morris and Kanade (1998). The formalism has been applied to various camera models (Poelman, 1993;

Tomasi and Kanade, 1992; Triggs, 1995); missing data points can be interpolated from known points (Jacobs, 1997; Tomasi and Kanade, 1992); methods for mo- tion segmentation exist (Costeira and Kanade, 1998);

and methods for lines and similar primitives are known (Morris and Kanade, 1998). There are noise estimates for recovered structure (Morris and Kanade, 1998).

These assume that errors in the estimates of struc- ture are independent, an assumption that the authors acknowledge is not always sustainable.

The factorisation method has one important weak- ness. Because the algorithm has two separate stages, it does not allow any payoff between model error—

the extent to which the recovered model violates the
required set of camera constraints—and measurement
error—the extent to which model predictions corre-
spond to data observations. This means that the model
cannot be used to identify measurement problems (for
example, tracker errors as in Fig. 5), and so is sub-
ject to reconstruction errors caused by incorporating
erroneous measurements. This is a property of the al-
gorithm, rather than of the problem; because*U and V*
have relatively few degrees of freedom compared with
*D, it should be possible to identify and ignore many*
unreliable measurements if the full force of the model
is employed. Recent work by Dellaert et al. has shown
how strongly the model constrains the data; they use a
sampling method to average over all correspondences,
weighting them by consistency with measured data, and
obtaining a satisfactory reconstruction. Their method
removes the need to compute correspondences from
structure from motion problems (Dellaert et al., 2000).

*2.2.* *The Posterior on Structure and Motion*

It is useful to think of Bayesian models as generative
models (e.g. Grenander, 1983). In a generative structure
from motion model, *U and V are drawn from appro-*
priate priors. Then*D is obtained by adding noise to*
*UV. We assume that noise is obtained from a mixture*
model; with some large probability, Gaussian noise is
used, and with a small probability, the measurement
value is replaced with a uniform random variable.

The priors on*U and V are obtained from constraints*
on camera structure. We do not fix the origin of the co-
ordinate system, and represent points in homogenous
coordinates, so our*U and V have dimensions 2m × 4*
and 4*× n respectively. We assume a scaled ortho-*
graphic viewing model with unknown scale that varies
from frame to frame.

All this yields a vector of constraint equations
**C****(U, V) = 0**

which contains elements of the form X3

*j*=1

*(u**i**, j**)*^{2}−
X3

*j*=1

*(u**i**+ m, j**)*^{2}

(expressing the fact that the camera basis consists of elements of the same length),

X3
*j*=1

*(u**i**, j**u**i**+ m, j**)*

(expressing the fact that the camera basis elements are perpendicular), and

*v**j**,4*− 1

(from the homogenous coordinates). A natural prior to use is proportional to

exp

µ**−C**^{T}* (U, V)C(U, V)*
2

*σ*constraint

^{2}

¶

This prior penalises violations of the constraints quite strongly, but allows constraint violations to be paid off one against the other. This approach is in essence a penalty method. An alternative is to insist that the prior is uniform if the constraints are all satis- fied and zero otherwise—in practice, this would in- volve constructing a parametrisation for the domain on which the prior is non-zero, and working with that parametrisation. This approach is numerically more complex to implement; it also has the disadvantage that one is imposing constraints that may, in fact, be violated (i.e. the scaled orthography model may not be sufficient; the imaging element may be misaligned with respect to the lens, so that the camera basis con- sists of elements of slightly different length, etc.).

We can now write a posterior model. Recall that the
noise process is a mixture of two processes: the first
adds Gaussian noise, and the second replaces the mea-
surement value with a uniform random variable. We
introduce a set of discrete mask bits, one per measure-
ment, in a matrix *M; these mask bits determine by*
which noise model a measurement is affected. A mask
bit will be 1 for a “good” measurement (i.e. one af-
fected by isotropic Gaussian noise), and 0 for a “bad”

measurement (i.e. one which contains no information about the model). These bits should be compared with the mask bits used in fitting mixture models using EM

(see the discussion in McLachlan and Krishnan (1996),
and with the boundary processes used in, among oth-
ers, Blake and Zisserman, 1987; Mumford and Shah,
1989). We introduce a prior on *M, π(M), which is*
*zero for matrices that have fewer than k non-zero ele-*
ments in some row or column, and uniform otherwise;

this prior ensures that we do not attempt inference for situations where we have insufficient measurements.

*The likelihood is then P(D|U, V, M), which is pro-*
portional to the exponential of

−½ X

*i**, j*

*(d**ij*−P

*k**u**ik**v**kj**)*^{2}*m**ij*

2*σ*meas^{2}

+*(1 − m**ij**)*
2*σ*_{bad}^{2}

¾

and the posterior is proportional to:

*P(D | U, V, M) × exp*

µ**−C**^{T}* (U, V)C(U, V)*
2

*σ*constraint

^{2}

¶
*π(M)*
Notice that the maximum of the posterior could well
not occur at the maximum of the likelihood, because
although the factorisation might fit the data well, the*U*
factor may satisfy the camera constraints poorly.

*2.3.* *Sampling the Structure from Motion Model*
This formulation contains both a discrete and a contin-
uous component. It is natural to consider using a Gibbs
sampler, sampling from the full conditional on point
positions given fixed camera positions, and from the
full conditional on camera positions given fixed point
positions. This works poorly, because the variables are
very highly correlated—a tiny shift in a point position
given fixed camera positions tends to result in a large
error. Instead, the continuous variables are sampled us-
ing the hybrid method described in Section 1.2; discrete
variables are sampled from the full conditional using
a strategy that proposes inverting 5% of the bits, ran-
domly chosen, at a time. Hybrid MCMC moves are pro-
posed with probability 0.7 and discrete variable moves
are proposed with probability 0.3.

**3.** **Example: Sampling an Unknown Number of**
**Components for Bayesian Colour Constancy**
The image appearance of a set of surfaces is af-
fected both by the reflectance of the surfaces and by
the spectral radiance of the illuminating light. Re-
covering a representation of the surface reflectance

*from image information is called colour constancy.*

Computational models customarily model surface re- flectances and illuminant spectra by a finite weighted sum of basis functions and use a variety of cues to recover reflectance, including (but not limited to!):

specular reflections (Lee, 1986); constant average re- flectance (Buchsbaum, 1980); illuminant spatial fre- quency (Land and McCann, 1971); low-dimensional families of surfaces (Maloney and Wandell, 1986) and physical constraints on reflectance and illumination coefficients (Forsyth, 1990; Finlayson, 1996). Each cue has well-known strengths and weaknesses. The most complete recent study appears to be Brainard and Freeman (1997), which uses the cues to make Bayesian decisions that maximise expected utility, and compares the quality of the decision; inaccurate decisions con- found recognition (Funt et al., 1998).

*3.1.* *The Probabilistic Model*

We assume that surfaces are flat, so that there is no shading variation due to surface orientation and no in- terreflection. There are four components to our model:

**• A viewing model: we assume a perspective view of**
a flat, frontal surface, with the focal point positioned
above the center of the surface. As spatial resolution
is not a major issue here, we work on a 50× 50 pixel
grid for speed.

**• A spatial model of surface reflectances: because**
spatial statistics is not the primary focus of this paper,
we use a model where reflectances are constant in a
grid of boxes, where the grid edges are not known
in advance. A natural improvement would be the
random polygon tesselation of Green (1996).

**• A spatial model of illumination: for the work de-**
scribed in this paper, we assume that there is a single
point source whose position is uniformly distributed
within a volume around the viewed surface.

**• A rendering model: which determines the recep-**
tor responses resulting from a particular choice of
illuminant and surface reflectance; this follows from
standard considerations.

* 3.1.1. The Rendering Model.* We model surface re-
flectances as a sum of basis functions

*φ*

*j*

*(λ), and as-*sume that reflectances are piecewise constant:

*s(x, y, λ) =*

*n**s*

X

*j*=0

*σ**j**(x, y)φ**j**(λ)*

Here *σ**j**(x, y) are a set of coefficients that vary over*
space according to the spatial model.

Similarly, we model illuminants as a sum of basis
functions *ψ**i* and assume that the spatial variation is
given by the presence of a single point source posi-
**tioned at p. The diffuse component due to the source**

*e*_{d}**(x, p, λ, p) = d(x, y, p)**

*n*_{e}

X

*i*=0

*²**i**ψ**i**(λ)*
where*²**i*are the coefficients of each basis function and
*d (x, y, p) is a gain term that represents the change in*
brightness of the source over the area viewed. The spec-
ular component due to the source is:

*e**m***(x, p, λ, p) = m(x, y, p)**

*n*_{e}

X

*i*=0

*²**i**ψ**i**(λ)*
*where m (x, y, p) is a gain term that represents the*
change in specular component over the area viewed.

*Standard considerations yield a model of the k’th*
receptor response as:

*p*_{k}* (x, y) = d(x, y, p)*X

*i**, j*

*g*_{i j k}*²**i**σ**j**(x, y)*

* + m(x, y, p)*X

*i*

*h**i k**²**i*

where

*g**ijk* =
Z

*ρk(λ)ψ**i**(λ)φ**i**(λ) dλ*

and

*h** _{ik}*=
Z

*ρk(λ)ψ**i**(λ) dλ*

and*ρ**k**(λ)is the sensitivity of the k’th receptor class.*

*The illuminant terms d (x, y, p) and m(x, y, p) follow*

*from the point source model; m*using Phong’s model of specularities.

**(x, y, p) is obtained**We write any prior probability distribution as*π. Our*
model of the process by which an image is generated
is then:

*• sample the number of reflectance steps in x and in y*
*(k**x**and k**y*respectively) from the prior*π(k**x**, k**y**) =*
*π(k**x**)π(k**y**).*

**• now sample the position of the steps (e***x* **and e***y* re-
spectively) from the prior

**π(e***x***, e***y**| k**x**, k**y***) = π(e***x**| k**x***)π(e***y**| k**y**);*

*• for each tile, sample the reflectance (σ*^{(m)}*for the m’th*
tile) for that tile from the prior*π(σ*^{(m)}*);*

*• sample the illuminant coefficients ² from the prior*
*π(²);*

**• sample the illuminant position p from the prior π(p);**

• and rendser the image, adding Gaussian noise of
known standard deviation*σ**cc* to the value of each
pixel.

This gives a likelihood,
*P*¡

image*| k**x**, k**y***, e***x***, e***y**, σ*^{(1)}*, . . . , σ*^{(k}^{x}^{k}^{y}^{)}* , ², p*¢
The posterior is proportional to:

*P* ¡

image*| k**x**, k**y***, e***x***, e***y**, σ*^{(1)}*, . . . , σ*^{(k}^{x}^{k}^{y}^{)}*, ²**i** , p*¢

**× π(e***x**| k**x***)π(e***y**| k**y**)π(k**x**)π(k**y**)*

*× π(²**i** )π(p)* Y

*m*∈ tiles

*π(σ*^{(m)}*)*

**3.1.2. Priors and Practicalities.****The spatial model:**

We specify the spatial model by giving the number of
*edges in the x and y direction separately, the position*
of the edges, and the reflectances within each block.

We assume that there are no more than seven edges (8 patches) within each direction, purely for efficiency.

The prior used is a Poisson distribution, censored to ensure that all values greater than seven have zero prior, and rescaled. Edge positions are chosen using a hard- core model: the first edge position is chosen uniformly;

the second is chosen uniformly, so that the number of pixels between it and the first is never fewer than five;

the third is chosen uniformly so that the number of pixels between it and the second and between it and the first is never fewer than five; and so on. This hard- core model ensures that edge are not so close together that pixel evidence between edges is moot.

**Priors for reflectance and illumination: Surface**
reflectance functions can never be less than zero, nor
greater than one. This means that the coefficients of
these functions lie in a compact convex set. It is easy
to obtain a representative subset of the family of planes
that bounds this set, by sampling the basis functions
at some set of wavelengths. Similarly, illuminant func-
tions can never be less than zero, meaning that the coef-
ficients of these functions lie in a convex cone. Again,
this cone is easily approximated. These constraints on
reflectance and illuminant coefficients are encoded in
the prior. We use a prior that is constant within the con-
straint set and falls off exponentially with an estimate
of distance from the constraint set. Because the con-
straint sets are convex, they can be expressed as a set
of linear inequalities; for surface reflectance we have

*C**s***σ +b > 0 and for illuminant we have C***i** ² > 0. If the*
coefficients in these inequalities are normalised (i.e. the
rows of the matrices are unit vectors), then the largest
negative value of these inequalities is an estimate of
distance to the constraint set.

We use six basis elements for illumination and re-
flectance so that we can have (for example) surfaces that
look different under one light and the same under an-
*other light. This phenomenon, known as metamerism,*
occurs in the real world; our exploration of ambigui-
*ties should represent the possibility. We represent sur-*
face colour by the colour of a surface rendered under a
known, white light.

*3.2.* *Sampling the Colour Constancy Model*

Proposals are made by a mixture of five distinct moves,
chosen at random. The probability of proposing a par-
ticular type of move is uniform, with the exception that
when there are no edges, no deaths are proposed, and
when the number of edges in a particular direction is at
a maximum, no births are proposed. An important ad-
*vantage to this approach is that, within each move, we*
can assume that the values of variables that we are not
changing are correct, and so apply standard algorithms
to estimate other values. Calculations are straightfor-
ward, along the lines of Green (1995).

* Moving the light: Proposals for a new x, y position*
for the light are obtained by filtering the image. We ap-
ply a filter whose kernel has the same shape as a typical

*specularity and a zero mean to the r , g and b compo-*nents separately; the responses are divided by mean intensity, and the sum of squared responses is rescaled to form a proposal distribution. The kernel itself is obtained by averaging a large number of speculari- ties obtained using draws from the prior on illuminant position. Using image data to construct proposal dis- tributions appears to lead to quite efficient samplers; it is also quite generally applicable, as Zhu et al. (2000) (who call it “data driven MCMC”) point out. Proposals

*for a move of the light in z are uniform, within a small*range of the current position. The real dataset has no specularities, and these moves have been demonstrated only for synthetic data.

**Birth of an edge: For each direction, we apply a**
derivative of Gaussian filter to the red, green and blue
components of the image and then divide the response
by a weighted average of the local intensity; the result
is squared and summed along the direction of interest.

This is normalised to 0.8, and 0.2 of a uniform distri-

bution is added. This process produces a proposal dis- tribution that has strong peaks at each edge, and at the specularity, but does not completely exclude any legal edge point (Fig. 2). Again, we are using image infor- mation to construct an appropriate proposal process.

For a given state, this proposal distribution is zeroed
for points close to existing edges (for consistency with
the hard core model), and a proposed new edge position
is chosen from the result. Once the position has been
**chosen, we must choose new reflectances for each of**
the new patches created by the birth of an edge. Gen-
erally, if we give the two new patches reflectances that
are similar to that of the old patch, we expect that there
will be only a small change in the posterior; this is
advantageous, because it encourages exploration. Cur-
rently, we average the receptor responses within each
new patch, and then use the (known) illuminant to es-
timate a reflectance that comes as close as possible
to achieving this average value, while lying within the
constraint set. We then add a Gaussian random variable
to the estimated reflectance value; currently, we use a
vector of independent Gaussian components each of
standard deviation 0.5 (the choice will depend on the
basis fitted).

**Death of an edge: The edge whose death is pro-**
posed is chosen uniformly at random. The death of an
edge causes pairs of surface patches to be fused; the
new reflectance for this fused region is obtained using
the same mechanism as for a birth (i.e. the receptor re-
sponses are averaged, the known illuminant is used to
estimate good reflectances for each patch, and a vector
of independent Gaussian components each of standard
deviation 0.5 is added to the result).

**Moving an edge: An edge to move is chosen uni-**
formly at random. Within the region of available points
(governed by the hard-core model—the edge cannot get
too close to the edges on either side of it) a new position
is proposed uniformly at random. This is somewhat in-
efficient, compared with the use of filter energies as a
proposal distribution. We use this mechanism to avoid a
problem posed by a hard-core model; it can be difficult
for a sampler to move out of the state where two edges
are placed close together and on either side of a real
edge. Neither edge can be moved to the real edge—the
other repels it—and a new edge cannot be proposed in
the right side; furthermore, there may be little advan-
tage in killing either of the two edges. Proposing uni-
form moves alleviates this problem by increasing the
possibility that one of the two edges will move away,
so that the other can move onto the right spot.

*Figure 2.* *The proposal distribution for edge birth in the x direction for the Mondrian image shown. The proposal distribution is obtained*
*by filtering the image, dividing the response by a weighted average of the local intensity, then summing down the y-direction. The result is*
normalised to 0.8, and 0.2 of a uniform distribution is added. Note that the filtering process leads to strong peaks near the edges; this means that
the proposal process is relatively efficient, but does not completely rule out edges away from strong responses, if other evidence can be found
for their presence (the likelihood component of the posterior).

**Change reflectance and illumination: It is tempt-**
ing to use a Gibbs sampler, but the chain moves ex-
tremely slowly if we do this. Instead, we sample re-
flectance and illumination simultaneously using the hy-
brid method of Section 1.2.

Poor behaviour by the Gibbs sampler can be ex- plained as follows. Assume that the sampler has burnt in, which means that the current choice of surface re- flectance and illuminant coefficients yields quite a good approximation to the original picture. Assume that we have fixed the surface reflectance coefficients and wish to change the illuminant coefficients. Now we expect that the normal distribution in illuminant coefficients has a mean somewhere close to the current value and a fairly narrow covariance, because any substantial change in the illuminant coefficients will lead to an image that is different from the original picture. This means that any change in the illuminant coefficients that results will be small. Similarly, if we fix the illu- minant coefficients and sample the surface reflectance coefficients, we expect that the changes that result will be small.

**4.** **Experimental Procedures**

In each case, the sampler can be started at a state chosen at random, or at a state chosen by a start procedure (described in more detail in Section 5.4). The main difference between these methods is that choosing a start point tends to lead to a sampler that appears to burn in more quickly.

*Figure 3.* *Left: a typical synthetic Mondrian, rendered using a linear intensity scale that thresholds the specularity. Center: the proposal*
*distribution for x and y position of the specularity, obtained by image filtering and shown with the highest value white. Right: a rendering of*
a typical sample for this case, using the sample’s illuminant; a successful sampler produces samples that look like the image. Results for real
images are shown in colour in Fig. 8.

*4.1.* *Structure from Motion*

Results are obtained using the hotel dataset, courtesy of the Modeling by Videotaping group in the Robotics Institute, Carnegie Mellon University. We report two types of experiment: in the first, the sampler is run on that dataset; in the second, some small percentage of the points in this dataset are replaced with uniform random numbers in the range of the image coordinates.

This represents large noise effects. Coordinates in this
dataset appear to lie in the range 1–512. The algorithm
appears to be quite well behaved for a rang of choices
of constant. Values for the constants for Figs. 5, 6, 9
and 10 are*σ*meas *= 1/*√

2,*σ*constraint *= 1/*√

5000;*σ*bad

should be slightly larger than *σ*meas (allowing points
to range some distance from the measurement before
the measurement has been disallowed) and we used
*σ*meas = √

5*∗ σ*constraint for these figures. Experience
suggests it is possible to use*σ*constraintvery much smaller
without apparently affecting the freedom with which
the sampler mixes.

*4.2.* *Colour Constancy*

As Fig. 3 indicates, the sampler runs on synthetic im- ages, and makes reasonable estimates of the position of the edges and the specularity and of illuminant and surface colours. In this case the basis and constraints are all known in advance. Applying the sampler to real data is more interesting. The data set shown in Fig. 8 consists of images originally used in Forsyth (1990).

These are images of the same set of patches on a Mondrian of coloured paper patches, photographed under white, blue, yellow, purple, red and cyan light.

There are no specularities, so we used a diffuse model for this data set.

The original data has been lost, so we used ver- sions scanned from the paper; these images were dis- played on a CRT, photographed from that display, sub- jected to four-colour printing and then scanned; it is remarkable that any constancy is possible under the circumstances. A basis was obtained using the bilin- ear fitting procedure of Marimont and Wandell (1992).

Determining appropriate constraint regions is more dif-
ficult; we obtained a natural coordinate system using
principal components, and then constructed a bound-
ing box in this coordinate system. The box was grown
10% along each axis, on the understanding that none
of the colours in the Mondrians of Forsyth (1990) were
very deeply saturated. The red, green and blue receptor
responses are represented by numbers in the range zero
to one; we use*σ**cc* *= 1/64, implying that only the top*
six bits in each receptor response are reliable.

**5.** **Assessing the Experimental Results**

Sections 2 and 3 phrased two standard vision problems as inference problems. These are quite nasty inference problems, with large numbers of both continuous and discrete variables. It is possible, as these sections indi- cated, to extract a representation of the posterior from these problems. Why do we believe that these repre- sentations are helpful? and how well do they compare with representations that other methods might offer?

Some cautions must be observed before making comparisons. Firstly, it is important to apply a real- ity check to the representations that the sampler pro- duces, to determine if there is reason to believe that the sampler has burnt-in. Secondly, comparing a represen- tation of a posterior given some data with the result of a method that reports a minimum error solution offers no more than a perfunctory error check. This is because the nature of the information produced by the two algo- rithms is different. The meaningful comparison is with other possible reports of the properties of the posterior.

Here, no “gold standard” tests are available; there are
no methods that are known to produce more accurate
representations of a posterior density against which we
*can test a sampler. However, we can compare the rep-*
resentation produced by the sampler to methods that
are significantly cheaper computationally.

*5.1.* *Reality Checks: Has the Sampler Burnt in and*
*is it Mixing?*

There are convergence diagnostics for MCMC methods (e.g. see Besag et al., 1995; Roberts, 1992), but these can suggest convergence where none exists; it is easy to produce a chain that can pass these tests without having burnt in. Instead, we rely on general methods. Firstly, we check to ensure that the sampler can move to a near- maximal value of the posterior from any start position within a reasonable number of moves. Secondly, we check that the state of the sampler moves freely about the domain that is represented. Third, we have built various consistency checks into the experiments.

* 5.1.1. Structure from Motion.* Figure 4 shows a se-
ries of samples drawn from the posterior for the struc-
ture from motion problem, with an indication of the
order in which the samples were drawn, indicating that
the sampler is mixing relatively well.

While the sampler’s mixing rate does appear to be sufficient to give a reasonable estimate of structure of the posterior around its mode, it is clear that the sam- pler does not move around the whole domain freely.

This posterior contains a discrete symmetry; for any
fixed value of the mask bits, one can multiply *U by*
a square root of the identity on the left and *V by a*
square root of the identity on the right, and obtain the
same value of the posterior. This creates no particular
difficulty in practice, because these solutions are very
widely isolated from one another. Our sampler does not
move from peak to peak, because the probability that
the hybrid method would obtain sufficient momentum
to cross the very large regions of very low probability is
effectively zero. This is in fact a desirable property; the
symmetry means that accurate estimates of the mean
value of*U and V would be zero.*

**Consistency checks: In general, we expect that a**
sampler that is behaving properly should be able to
identify correspondence errors and produce a stable
representation. There are in fact a number of subtle
tracker errors in the hotel sequence. Figure 5 shows that
the sampler can identify these tracker errors. Figure 6
illustrates that large tracker errors, artificially inserted
into the dataset for this purpose, can be identified, too.

* 5.1.2. Colour Constancy.* The sampler described
here has been run on many synthetic images where

“ground truth” is known, and in each case reaches a small neighbourhood of ground truth from a randomly

*Figure 4.* These plots illustrate the path taken through the state space by the structure from motion sampler. Each plot connects the position of
a given point in every tenth sample, starting at the 100th. The paths have been coded with a grey level for clarity; the early samples are light, and
the path moves through darker grey levels. The fact that these paths repeatedly cross themselves and return to the same regions suggests that the
sampler is mixing rather freely.

selected start point—i.e. “burns in”—within about 1000 samples. The experimental data shown below suggests the sampler mixes well, because of the wide spread on the marginal densities on the reflectances.

**Consistency checks: The sampler is run on six im-**
ages of the same scene, but the fact that these images
are of the same scene is not built into the model. The
spread of samples for surface reflectance coefficients
recovered for a particular surface in a particular image,
is quite wide (see Fig. 8). However, if we compare the
spread of samples for that surface for different images,
the clusters overlap. This means that the representa-
tion is correctly encoding the fact that these surfaces
could be very similar. In fact, as we shall see in Section
5.2, the representation encodes the fact that all surface
patches could be very similar.

*5.2.* *Attractive Properties of Sampled*
*Representations*

There are three attractive properties of the sampled rep- resentations we have derived:

• they provide a covariance estimate for inferred state;

• they can be resampled to incorporate new infor- mation;

• they appear to be stable to perturbations of the input data set.

We describe these properties below.

* 5.2.1. Covariance.* The samplers described produce a
representation of the posterior probability distribution,

*Figure 5.* Two (cropped) frames from the hotel sequence showing a single sample reconstruction. Squares correspond to measurements with
mask bit one (i.e. the measurement of that point in that frame is believed correct); a white cross on a dark background corresponds to a
measurement with mask bit zero (i.e. the measurement of that point in that frame is believed incorrect); grey diamonds correspond to model
predictions. The extent to which a diamond is centered within a square gives the extent to which a model prediction is supported by the data. In
the right frame, at several locations the tracker has skipped to another feature for unknown reasons. In each case the reconstruction identifies the
data point as being erroneous, and reprojects to a point in a significantly different position from the measurement reported by the tracker and
lying where a correct measurement would be as seen by the position relative to the surface texture on the object.

given a data set. A particularly attractive feature is that special datasets require no additional analysis. For ex- ample, if every element in the image has the same colour, we expect the colour constancy sampler to pro- duce a very wide spread of samples for the surface reflectance; similarly, if a structure from motion data set is obtained by a camera translating in its plane, the sampler will return a set of samples with substan- tial variance perpendicular to that plane without further ado. A second attractive feature is that both expecta- tions and marginal probability distributions are easily available: to compute an expectation of a function, we average that function’s value over the samples, and to compute a marginal, we drop irrelevant terms from the state of each sample.

Figure 7 illustrates the kind of information a sam- pler can produce for the structure from motion data; in

particular, the sampler reflects the scatter of possible inferred values for a single point.

Figure 8 show a set of typical results a sampler can produce from real images for the colour constancy problem. The spatial model identifies edges correctly.

Groups of samples drawn for the same surface re- flectance under different lights intersect, as we expect.

Furthermore, groups of samples drawn for different surface reflectances under the same light tend not to in- tersect, meaning that these surfaces are generally seen as different. The figure shows a rendering of samples under white light, to give some impression of the vari- ation in descriptions that results.

**5.2.2. Resampling to Incorporate New Information.**

Assume that we are engaged in colour constancy. We construct a representation of surface colour, and new

*Figure 6.* We perturb the hotel sequence by replacing 5% of the data points with draws from a uniform distribution in the image plane. The
Bayesian method, started as in Section 5.4.1, easily discounts these noise points; the figure shows the same frames in the sequence as in Fig. 5,
uncropped to show the noise but with a sample reconstruction indicated using the same notation as that figure.Squares correspond to measurements
with mask bit one (i.e. the measurement of that point in that frame is believed correct); a white cross on a dark background corresponds to
measurements with mask bit zero (i.e. the measurement of that point in that frame is believed incorrect); grey diamonds correspond to model
predictions. The extent to which a diamond is centered within a square gives the extent to which a model prediction is supported by the data.

*Figure 7.* Black points show an overhead view of a single sample of the 3D reconstruction obtained using 40 frames of 80 points in the hotel
sequence, rotated by hand to show the right-angled structure in the model indicating that the structure is qualitatively correct; the cloud of grey
points are samples of the position of a single point, scaled by 1000 to show the (very small) uncertainty available in a single point measurement.

information arrives—what do we do? If the representa- tion is probabilistic, the answer is (relatively) straight- forward; we adjust our representation to convey the posterior incorporating this new information. For ex-

ample, assume we have a sampled representation of the posterior for two distinct images. We are now told that a patch in one image is the same as a patch in another—

this should have an impact on our interpretation of both

images. The sampled representation is well suited to determining the effect of this information.

In particular, we have samples of
*P(σ**a**, state a | image a)*

and

*P(σ**b**, state b | image b)*

where we have suppressed the details of the rest of
the state in the notation. We interpret “the same” to
mean that each patch is a sample from a Gaussian dis-
tribution with some unknown mean *α and a known*
standard deviation. We would like to obtain samples
of

*P(α, state a, state b | image a, image b)*
(image a will be abbreviated as “im a”, etc.). Now we
have that

*P(im a, im b | state a, state b, α)*
is proportional to

Z µ*P(im a, state a | σ**a**)P(σ**a* *| α)×*

*P(im b, state b | σ**b**)P(σ**b**| α)*

¶

*dσ**a* *dσ**b**π(α)*

Now the term inside the integral is:

*P(state a, σ**a**, image a)*

*π(σ**a**)* × *P(state b,σ**b**, image b)*
*π(σ**b**)*

*× P(σ**b**| α)P(σ**a**| α)*
We have two sets of samples, P*a*

andP*b*

. We en-
sure that these samples are independent and identically
distributed by shuffling them (to remove the corre-
lations introduced by MCMC). This means that, for
*the conditional density for the i ’th sample, we have*

*P(*P*a*

*i* *| i) = P(state a, σ**a**, image a). Now we con-*
struct a new sampler, whose state is*{i, j, α}. We ensure*
this produces samples of the distribution

*5(i, j, α) =*

µ *P(σ**a**(i) | α)×*

*P(σ**a**( j) | α)π(α)*

¶

*π(σ**a**(i))π(σ**b**( j))*

*We now use the i ’s and j ’s as indexes to our previous set*
of samples. We can marginalise with respect to*σ**a*and
*σ**b* by simply dropping their values from the sample.

The result is a set of samples distributed according to the desired distribution:

Z µ*P(im a, state a| σ**a**)P(σ**a* *| α)×*

*P(im b, state b | σ**b**)P(σ**b* *| α)*

¶

*dσ**a* *dσ**b**π(α)*

Building a sampler that obtains samples of *{i, j, α}*

space according to the desired distribution involves technical difficulties beyond the scope of this paper.

The approach essentially chooses pairs consisting of a sample from the set for image a and a sample from the set for image b; these pairs are chosen with a frequency that is higher when the values inferred for a particular patch are similar. Of course, this trick extends to more images.

Figure 8 shows results obtained by assuming that a single surface patch in each of the six images is the same. Typically, a small number of sets of samples have a very much higher probability than all others, so that a sampled representation consists of a large num- ber of copies of these samples, interspersed with one or two others. This results in very much reduced vari- ance in the rendering of the patch that is known to be similar for the six images, because the error balls for this surface patch intersect in a relatively small region.

*However, this does not mean that the variance for the*
inferred reflectances for the other patches must be re-
*duced. It is reduced (Fig. 8), but this is because the*
representations recovered for each separate input im-
age (correctly) captures the possibility that each of the
surface patches is the same. This is another important
reality check that strongly suggests the sampled rep-
resentation is trustworthy: the algorithm has been able
to use information that one patch is the same in each
image to obtain a representation that strongly suggests
the other patches are the same, too.

**5.2.3. Stability of the Recovered Representations.**

Reconstructions cannot be compared on the basis of accuracy, because “ground truth” is not available.

However, we can demonstrate that sampled represen- tations are stable under various perturbations of their input. In structure from motion, small errors in tracker response for some points could lead to significant per- turbations of the reconstruction for all points, because the reconstructed point positions are not independent—

they are coupled by the reconstructed camera configu- rations.

Small errors in tracker response actually occur: in the 40 frames of the hotel sequence that we used, six point measurements in nine frames are affected