The Joy of Sampling

(1)

The Joy of Sampling

D.A. FORSYTH, J. HADDON AND S. IOFFE

Computer Science Division, University of California at Berkeley, Berkeley, CA 94720, USA

daf@cs.berkeley.edu haddon@cs.berkeley.edu

ioffe@cs.berkeley.edu

Received November 12, 1999; Revised August 4, 2000; Accepted January 17, 2001

Abstract. A standard method for handling Bayesian models is to use Markov chain Monte Carlo methods to draw samples from the posterior. We demonstrate this method on two core problems in computer vision—structure from motion and colour constancy. These examples illustrate a samplers producing useful representations for very large problems. We demonstrate that the sampled representations are trustworthy, using consistency checks in the experimental design. The sampling solution to structure from motion is strictly better than the factorisation approach, because: it reports uncertainty on structure and position measurements in a direct way; it can identify tracking errors;

and its estimates of covariance in marginal point position are reliable. Our colour constancy solution is strictly better than competing approaches, because: it reports uncertainty on surface colour and illuminant measurements in a direct way; it incorporates all available constraints on surface reflectance and on illumination in a direct way;

and it integrates a spatial model of reflectance and illumination distribution with a rendering model in a natural way.

One advantage of a sampled representation is that it can be resampled to take into account other information. We demonstrate the effect of knowing that, in our colour constancy example, a surface viewed in two different images is in fact the same object. We conclude with a general discussion of the strengths and weaknesses of the sampling paradigm as a tool for computer vision.

Keywords: Markov chain Monte Carlo, colour constancy, structure from motion

1. Introduction

The Bayesian philosophy is that all information about a model is captured by a posterior distribution obtained using Bayes’ rule:

posterior= P(world | observations)

∝ P(observations | world)P(world) where the prior P(world) is the probability density of the state of the world in the absence of observations.

Many examples suggest that, when computational dif- ficulties can be sidestepped, the Bayesian philosophy leads to excellent and effective use of data (e.g. exposi- tions in Carlin and Louis (1996), Gelman et al. (1995), Grenander (1983), and Grenander (1993); examples of

the use of Bayesian inference in the vision literature include (Binford and Levitt, 1994; Chou and Brown, 1990; Huang et al., 1994; Jolly et al., 1996; Maybank and Sturm, 1999; Noble and Mundy, 1993; Pavlovic et al., 1999a, 1999b; Sarkar and Boyer, 1992, 1994;

Sullivan et al., 1999; Yuille and Coughlan, 1999; Zhu et al., 2000, among others).

A probability distribution is, in essence, a device for computing expectations. The problems we are in- terested in typically involve an important continuous component, meaning that computing expectations involves estimating integrals, usually over high dimensional domains. One useful technique is to represent the posterior by drawing a large number of samples from that distribution. These samples can then be used to estimate any expectation with respect to that posterior.

(2)

For example, if we wished to decide whether to fight with or flee from an attacker, we would draw samples from the posterior on the outcome and estimate expected utilities for each decision (as averages of the utilities over the samples) and then choose the decision with the best utility.

Sampling algorithms are more general than random search for MAP interpretations precisely because the results give an approximate representation of the en- tire posterior. This means that, for example, we can estimate the covariance of the posterior; resample the samples to incorporate new information; engage in multiple calculations for different decisions using dis- tinct utilities, etc. Sampling is in principle simple and general, if samples can be drawn from the posterior efficiently.

This paper demonstrates the strengths and weaknesses of sampling methods using two core vision problems as examples: structure from motion (Section 2);

and colour constancy (Section 3).

Notation: we write² for a vector, whose i’th compo- nent is²iandM for a matrix whose i, j’th component is M_ij. Sampler jargon that may be unfamiliar is shown in italics when first introduced.

1.1. Simple Sampling Algorithms

For some probability distributions, direct algorithms exist for drawing samples (e.g. Ripley, 1987); we are seldom lucky enough to have a posterior of this type.

Rejection sampling is appropriate for some distribu- tions. Assume that we wish to draw samples from p(x) and have a proposal distribution q(x), from which we can draw samples easily. Assume also we know some constant, k, such that kq(x) ≥ p(x) for all x; we can draw a sample from p(x) by drawing a sample x0from q(x), and then accepting the sample with probability p(x0)/(kq(x0)). In importance sampling, to draw a sample from p(x), we first draw a large number of independent samples{s1, . . . , sn} from a proposal dis- tribution q(x), and then set s = siwith probability proportional to wi = _q^p_(x)^(x). As n→ ∞, the distribution for the sample s will approach p(x). Both rejection sampling and importance sampling methods can be wildly inefficient if q(x) approximates p(x) poorly (the usual case in high dimensions); in some such cases, a collection of different q(x)’s can be pasted together to obtain a better approximation (e.g. Gamerman, 1997).

1.2. Markov Chain Monte Carlo—

the Metropolis-Hastings Algorithm

Markov chain Monte Carlo methods (Gamerman, 1997; Gilks et al., 1996c) are the standard methods for sampling complex distributions. In this method, one constructs a Markov chain whose stationary distribution is the target distribution. A new sample can be obtained from an old one, by advancing the Markov chain.

The Metropolis-Hastings algorithm is a technique for constructing a Markov chain that has a particular desired stationary distribution. Assume that we have a distributionπ from which we would like to generate samples. We would like to build a Markov chain which hasπ as a stationary distribution.

The algorithm will produce a sequence of samples X1, . . . , Xn, by taking a sample Xi and proposing a revised version, X_i⁰. The next element of the sequence Xi+1 will be X_i⁰ with probabilityα(Xi, Xi+1); other- wise, it will be Xi. We will give the form ofα below.

The proposal process is random, too. In particular, there is a proposal distribution which gives the prob- ability of proposing X_i⁰ from X_i. This can be written P(Xi → X⁰_i). Note that the proposal distribution is a function of X⁰_i, and may be a function of X_i. Now we assume that ifπ(u) is non-zero, then there are some valuesv such that P(v → u) is non-zero, too.

In this case, we have that

α = max µ

1,P(X⁰i→ Xi)π(Xi⁰) P(Xi→ Xi⁰)π(Xi)

¶

Notice that this expression is qualitatively sensible. If the chain is at a point whereπ has a very low value and at the new pointπ has a very high value and the forward and backward proposal probabilities are about equal, then the new point will be accepted with high probability. If the chain is at a point whereπ has a very high value and the proposal process has a high probability of suggesting points with a very low value ofπ, it is likely to stay at that point. Finally, if a point which has high value ofπ is proposed disproportionately often, it is less likely to be accepted.

A good way to think about the Metropolis-Hastings algorithm is that it is an improved version of the

“hypothesize and test” process that is common in vision. Metropolis-Hastings suggests various hypotheses which, depending on the result of a bookkeeping exer- cise, are accepted or rejected. This process yields the se- quence X1, . . . , Xn. However, for Metropolis-Hastings

(3)

the sequence of hypotheses has very significant semantics; assuming technical conditions on the proposal process (expounded in, for example, Gamerman, 1997;

Gilks et al., 1996a; Roberts, 1996; Tierney, 1996; these conditions are usually fairly easily met and all our samplers meet them), once sufficient iterations have completed, all subsequent Xiare samples drawn from π(X).

1.3. Burn-in and Mixing

Generally, an MCMC method needs to produce some number of samples to “forget” its start point. The number of iterations required to achieve this is often called the burn-in time. The burn-in may be extremely long for a poorly designed sampler. There are a very small number of samplers known to have a short burn-in time (e.g. Jerrum and Sinclair, 1996).

Once a sampler has burnt in, the sequence of samples it produces may or may not be correlated; if this correlation is low, the method is said to mix well. It is desirable to have an algorithm that burns in quickly, and mixes well. Burn-in and mixing are related to the dynamics of the underlying Markov chain. One way to show that a sampler mixes quickly is to prove that, for any decomposition of its domain into two disjoint sets A and B, the conditional probability that the sampler goes to B given it is in set A, is high. Such proofs of fast mixing exist for a small number of cases, but require substantial art (e.g. Jerrum and Sinclair, 1996).

We are aware of no proof that a sampler used for vision problems is fast mixing. In our examples, as in the vast majority of cases, the algorithm is used without such proofs; we show a variety of consistency checks that suggest that the algorithm has converged.

1.4. The Attractions of MCMC

It is known how to apply this algorithm when the domain of support is complicated (for example, samples may be drawn when the domain of support of the posterior consists of several different spaces of different dimensions, Green, 1995; Richardson and Green, 1997).

There are numerous variants to the basic algorithm, some of which combine deterministic dynamics with random search in the hope of better mixing (e.g. see the review in Neal (1993)).

The advantage of viewing Metropolis-Hastings algorithms as a souped up hypothesize and test process is that it suggests how to build proposal mechanisms.

A natural strategy is to take current vision algorithms and make them produce probabilistic outputs. This approach is illustrated in Section 3.2; Zhu et al. (2000) have used it successfully for some recognition problems. A really attractive feature is that we can use different, possibly incompatible algorithms as distinct sources of proposals, and the samples we obtain repre- sent the posterior incorporating all available measure- ments.

Quite often in practice it is easy to come up with a function f proportional to the posterior. In this case, the posterior is

R f

D f(u) du

but the integral—the normalizing constant—can be very difficult to compute (the best way to do it is to use a sampling method). An attractive feature of the Metropolis Hastings algorithm is that we need not know the normalizing constant for the distribution (because the constant is cancelled by the ratio).

1.5. Techniques for Building Practical MCMC Samplers

It is easy to build a sampler using the Metropolis- Hastings algorithm. It seems to be very hard to build a good sampler—one that burns in quickly, mixes well, and gives a trustworthy picture of the posterior—using that algorithm. We describe a variety of techniques for building samplers, and conclude with a discussion of possible sanity checks.

1.5.1. Gibbs Samplers. It is quite common to en- counter situations where the target distribution has a non standard form, but is standard when groups of variables have fixed values (this occurs in vision problems;

see Sections, 2.3 and 3.2). In this case, it is natural to adopt a proposal mechanism that fixes one set of variables and draws a sample from the full conditional distribution on the other set, and vice versa. This very useful technique is known as Gibbs sampling (named by Geman and Geman (1984) but apparently due to the statistical physics literature, where it was known as the heat bath algorithm, Gilks et al., 1996b, p. 12). Usu- ally, the group of variables to be sampled is chosen at random, and sufficient samples are drawn so that each group of variables is visited many times.

Gibbs sampling is very easy to implement. There is one considerable danger, which is often quite difficult to avoid. If the groups of variables are strongly

(4)

Figure 1. Correlated variables cause Gibbs samplers to behave badly. The figure on the top left shows 100 samples drawn from a Gibbs sampler for two independent normal random variables, one with variance one and the other with variance ten. The stars indicate the samples; the line segments indicate the order in which the samples were drawn. Note that the sampler makes quite large vertical moves (because the variance in this direction is large). The figure on the top right shows 100 samples drawn from this distribution, now rotated by 45^◦, using a Gibbs sampler.

In this case, the sampler can make only relatively small vertical and horizontal moves, and so the position of the samples changes relatively slowly; the 100 samples in the graph on the bottom left, which consist of those of the first graph rotated by 45^◦, give a much better picture of the distribution. On the bottom right, the x-coordinate for the samples drawn from the second sampler (solid line) and the x-coordinates of the third figure (dashed line). The solid curve (correctly) suggests that the samples drawn from the second sampler are quite strongly correlated.

correlated, then a Gibbs sampler can mix very badly indeed. The effect is well known (for a full discussion, see for example, Gilks and Roberts, 1996) and easily illustrated (see Fig. 1).

1.5.2. The Hybrid Monte Carlo Method. A common difficulty with sampling methods is that the state of the sampler appears to perform a slightly biased random walk. The difficulty with random walk is that it takes a long time to move any distance along a domain, meaning that if the sampler is started at a point a long way from the mode of the distribution, it will take a long

time before it reaches the mode. From our perspective, it is extremely important to have a representation of the distribution around the mode.

Hybrid Monte Carlo is a method for making proposals that causes the state of the sampler to move rather quickly to the mode, and then explore it. The method is due to Duane et al. (1987) (and described in detail in Neal (1993)). Write the state of the sampler as q.

The method requires that the target distribution can be written as

π(q) = exp{−U(q)}

(5)

Now let us think of U as a potential function; the state of the sampler will be the state of a particle of mass m subject to this potential function. This state can be determined by considering the momen- tum of the particle p and writing a Hamiltonian for the particle:

H(q, p) = U(q) +p^Tp 2m

We now need to integrate Hamilton’s equations

∂q

∂t = 1 mp

∂p

∂t = −∇qU

to determine the state of the particle. This temporary excursion into mechanics is actually justified, because we can exponentiate the negative Hamiltonian of the particle to get

π⁰(q, p) = exp{−H(q, p)}

= π(q) exp

½

−p^Tp 2m

¾

which is a new target distribution for a larger set of random variables. We now have two proposal moves:

1. Advance time in our particle model by some randomly chosen amount, either forwards or back- wards. This updates both q and p. As long as we use a symplectic integrator, the extent of the advance is uniform and random, and the choice of forward or backward is random, the accept probability is one.

2. Fix q and draw a sample for p from the full con- ditional. This is easy, because the full conditional distribution in p is normal and is independent of q.

This sampler has very attractive qualitative behaviour.

If the state is at a relatively large value of U , then the first type of move will travel quickly down the gradient of U to smaller values, while building up momentum.

But the second move then discards this momentum;

so we have a sampler that should move quickly to a mode—where U is small—and then move around ex- ploring the mode under the influence of the random choice of momenta. Good values of the particle’s mass and of the range of time values must be chosen by experiment.

In practice, the hybrid method seems to be useful for continuous problems. It is very easy to implement for the colour constancy example given above, and has

been successfully used on a variety of other continuous problems (Neal, 1993).

1.6. MCMC and Random Search in Vision

Markov chain Monte Carlo has appeared in the vision literature in various forms. One common use is to attempt to obtain an MAP estimate by random search, usually using the Metropolis-Hastings algorithm (e.g. Geman and Geman, 1984; Geman and Graffigne, 1986). The Markov random field model is a spatial model which gives a posterior on image labellings given measurements as a function of measurement values and local patterns of pixel labels (so-called

“clique potentials”; the topic is reviewed in Li (1995)).

A standard method for estimating MAP labellings is to use an annealed version of the Metropolis-Hastings algorithm, where the posterior to be sampled is a function of a parameter that changes during the sampling process. This parameter is often thought of as temperature; the intent is that for high values of the parameter, the posterior has only one mode, and as the temperature is reduced the state of the sampler will get stuck in that mode, thereby obtaining a global extremum. It is not possible to guarantee in practice that this occurs, and the algorithm has a rather mixed reputation (Collins et al., 1988; Golden and Skiscim, 1986).

The notion of using a sampling method to perform inference on a generative model of an image pattern appears to be due to Grenander (1983). Few successful examples appear in the literature. In Jolly et al. (1996), an annealing method is used to estimate an MAP solution for the configuration and motion of a motor car template in an image. In Zhu (1998), a random search method is used to find a medial axis transform. In Zhu et al. (2000), an MCMC method is used to find simple shapes and road signs. In Green (1996), MCMC is used to perform inference in various vision-like situations, including reconstruction from single photon emission computed tomography data and finding a polygonal template of a duck in heavy spatial noise. In Phillips and Smith (1996), inference is performed on a hierarchical model to find faces, and a version of MCMC is used to find an unknown number of disks. Templates are used for restoration in Amit et al. (1991). Gibbs samplers are quite widely used for reconstruction (Geman and Geman, 1984; Geman and Graffigne, 1986; Zhu et al., 1998).

Random search is now a standard method for estimating the fundamental matrix in structure from

(6)

motion problems; a review appears in Torr and Mur- ray (1997). RANSAC—an algorithm for robust fitting, due to Fischler and Bolles (1981) and appearing in the statistical literature as Rousseeuw (1987)—proposes small sets of correspondences uniformly at random, fits a fundamental matrix to each set, and accepts the set whose fit gives the largest number of correspondences with a sufficiently small residual. The number of sets is chosen to ensure some high probability that a correct set is found. The main advantage of an MCMC method over RANSAC is that an MCMC method can produce a series of hypotheses with meaningful semantics—

indicating, for example, the posterior probability that a particular point is an outlier, or the posterior probability that a pair of measurements come from a single point.

1.6.1. Particle Filtering (or Condensation, or

“Survival of the Fittest”) and Resampling. The most substantial impact of sampling algorithms in vision has been the use of resampling algorithms in tracking. The best known algorithm is known as condensation in the vision community (Blake and Isard, 1998), survival of the fittest in the AI community (Kanazawa et al., 1995), and particle filtering in the statistical signal processing community, where it originated (Carpenter et al., 1999;

Kitagawa, 1987). A wide range of variants and of applications of particle filtering are described in a forth- coming book (Doucet et al., 2001). This algorithm is a modification of factored sampling: one draws samples from a prior (which represents the state of the world up to the k−1’th measurement), propagates these samples through a dynamical model, and then weights them us- ing the posterior incorporating the k’th measurement.

This set of weighted samples provides a representation of the prior for the next iteration. The algorithm is fast and efficient, and is now quite widely applied for low-dimensional problems.

The attraction of resampling algorithms is that they can be used to incorporate “new information.” In tracking applications, new information comes because a new frame, with new measurements, has arrived. New information may come from other sources. In the colour constancy example, we assume that the algorithm is told that two patches in two different images are the same colour (this might occur because a recognition algorithm has a good match to the geometry, and knows the patches represent the same object). This information strongly constrains the inferred colours for other patches in each view (Section 3).

In recognition applications one often encounters some form of hierarchical model, which again suggests resampling. In Ioffe and Forsyth (1999), a sampler is used to label groups of image segments, using their consistency with observed human kinematics. The human model used has nine segments. It is foolish to attempt to label all nine segment groups; instead, their algorithm uses a sampler to label individual segments with a frequency proportional to the posterior probability of that label given the image data. The set of individual segment labels is resampled to propose pairs of labels for pairs of segments, and so on. In this case, the new information is the use of an enhanced prior; the prior for pairs of labels emphasizes pairs of segments that lie in particular configurations, a property that is meaning- less for single segments.

2. Example: Large Scale Sampling for Bayesian Structure from Motion

Structure from motion is the problem of inferring some description of geometry from a sequence of images.

The problem has a long history and a huge literature;

space does not allow a comprehensive review, but see Beardsley et al. (1997), Faugeras et al. (1998), Faugeras and Robert (1996), Gool and Zisserman (1997), and Hartley and Zisserman (2000). Accurate solutions to structure from motion are attractive, because the technique can be used to generate models for rendering vir- tual environments (e.g. Debevec et al., 1996; Faugeras et al., 1998; Gool and Zisserman, 1997; Tomasi and Kanade, 1992).

2.1. Structure from Motion by Matrix Factorisation Assume m distinct views of n points are given;

correspondences are known. In the influential Tomasi-Kanade formulation of structure from motion (Tomasi and Kanade, 1992), these data are arranged into a 2m× n matrix of measurements D which must factor as D = UV, where U represents the camera positions andV represents point positions. An affine transformA is determined such that UA minimises a set of constraints associated with a camera, andA⁻¹V then represents Euclidean structure.

In practice, factorisation is achieved using a singu- lar value decomposition. This is a maximum likelihood method if an isotropic Gaussian error model is adopted; for an anisotropic Gaussian error model, see Morris and Kanade (1998). The formalism has been applied to various camera models (Poelman, 1993;

(7)

Tomasi and Kanade, 1992; Triggs, 1995); missing data points can be interpolated from known points (Jacobs, 1997; Tomasi and Kanade, 1992); methods for motion segmentation exist (Costeira and Kanade, 1998);

and methods for lines and similar primitives are known (Morris and Kanade, 1998). There are noise estimates for recovered structure (Morris and Kanade, 1998).

These assume that errors in the estimates of structure are independent, an assumption that the authors acknowledge is not always sustainable.

The factorisation method has one important weak- ness. Because the algorithm has two separate stages, it does not allow any payoff between model error—

the extent to which the recovered model violates the required set of camera constraints—and measurement error—the extent to which model predictions correspond to data observations. This means that the model cannot be used to identify measurement problems (for example, tracker errors as in Fig. 5), and so is subject to reconstruction errors caused by incorporating erroneous measurements. This is a property of the algorithm, rather than of the problem; becauseU and V have relatively few degrees of freedom compared with D, it should be possible to identify and ignore many unreliable measurements if the full force of the model is employed. Recent work by Dellaert et al. has shown how strongly the model constrains the data; they use a sampling method to average over all correspondences, weighting them by consistency with measured data, and obtaining a satisfactory reconstruction. Their method removes the need to compute correspondences from structure from motion problems (Dellaert et al., 2000).

2.2. The Posterior on Structure and Motion

It is useful to think of Bayesian models as generative models (e.g. Grenander, 1983). In a generative structure from motion model, U and V are drawn from appro- priate priors. ThenD is obtained by adding noise to UV. We assume that noise is obtained from a mixture model; with some large probability, Gaussian noise is used, and with a small probability, the measurement value is replaced with a uniform random variable.

The priors onU and V are obtained from constraints on camera structure. We do not fix the origin of the coordinate system, and represent points in homogenous coordinates, so ourU and V have dimensions 2m × 4 and 4× n respectively. We assume a scaled ortho- graphic viewing model with unknown scale that varies from frame to frame.

All this yields a vector of constraint equations C(U, V) = 0

which contains elements of the form X3

j=1

(ui, j)²− X3

j=1

(ui+ m, j)²

(expressing the fact that the camera basis consists of elements of the same length),

X3 j=1

(ui, jui+ m, j)

(expressing the fact that the camera basis elements are perpendicular), and

vj,4− 1

(from the homogenous coordinates). A natural prior to use is proportional to

exp

µ−C^T(U, V)C(U, V) 2σconstraint²

¶

This prior penalises violations of the constraints quite strongly, but allows constraint violations to be paid off one against the other. This approach is in essence a penalty method. An alternative is to insist that the prior is uniform if the constraints are all satis- fied and zero otherwise—in practice, this would involve constructing a parametrisation for the domain on which the prior is non-zero, and working with that parametrisation. This approach is numerically more complex to implement; it also has the disadvantage that one is imposing constraints that may, in fact, be violated (i.e. the scaled orthography model may not be sufficient; the imaging element may be misaligned with respect to the lens, so that the camera basis consists of elements of slightly different length, etc.).

We can now write a posterior model. Recall that the noise process is a mixture of two processes: the first adds Gaussian noise, and the second replaces the measurement value with a uniform random variable. We introduce a set of discrete mask bits, one per measurement, in a matrix M; these mask bits determine by which noise model a measurement is affected. A mask bit will be 1 for a “good” measurement (i.e. one affected by isotropic Gaussian noise), and 0 for a “bad”

measurement (i.e. one which contains no information about the model). These bits should be compared with the mask bits used in fitting mixture models using EM

(8)

(see the discussion in McLachlan and Krishnan (1996), and with the boundary processes used in, among others, Blake and Zisserman, 1987; Mumford and Shah, 1989). We introduce a prior on M, π(M), which is zero for matrices that have fewer than k non-zero ele- ments in some row or column, and uniform otherwise;

this prior ensures that we do not attempt inference for situations where we have insufficient measurements.

The likelihood is then P(D|U, V, M), which is pro- portional to the exponential of

−½ X

i, j

(dij−P

kuikvkj)²mij

2σmeas²

+(1 − mij) 2σ_bad²

¾

and the posterior is proportional to:

P(D | U, V, M) × exp

µ−C^T(U, V)C(U, V) 2σconstraint²

¶ π(M) Notice that the maximum of the posterior could well not occur at the maximum of the likelihood, because although the factorisation might fit the data well, theU factor may satisfy the camera constraints poorly.

2.3. Sampling the Structure from Motion Model This formulation contains both a discrete and a continuous component. It is natural to consider using a Gibbs sampler, sampling from the full conditional on point positions given fixed camera positions, and from the full conditional on camera positions given fixed point positions. This works poorly, because the variables are very highly correlated—a tiny shift in a point position given fixed camera positions tends to result in a large error. Instead, the continuous variables are sampled using the hybrid method described in Section 1.2; discrete variables are sampled from the full conditional using a strategy that proposes inverting 5% of the bits, randomly chosen, at a time. Hybrid MCMC moves are proposed with probability 0.7 and discrete variable moves are proposed with probability 0.3.

3. Example: Sampling an Unknown Number of Components for Bayesian Colour Constancy The image appearance of a set of surfaces is affected both by the reflectance of the surfaces and by the spectral radiance of the illuminating light. Re- covering a representation of the surface reflectance

from image information is called colour constancy.

Computational models customarily model surface reflectances and illuminant spectra by a finite weighted sum of basis functions and use a variety of cues to recover reflectance, including (but not limited to!):

specular reflections (Lee, 1986); constant average reflectance (Buchsbaum, 1980); illuminant spatial frequency (Land and McCann, 1971); low-dimensional families of surfaces (Maloney and Wandell, 1986) and physical constraints on reflectance and illumination coefficients (Forsyth, 1990; Finlayson, 1996). Each cue has well-known strengths and weaknesses. The most complete recent study appears to be Brainard and Freeman (1997), which uses the cues to make Bayesian decisions that maximise expected utility, and compares the quality of the decision; inaccurate decisions con- found recognition (Funt et al., 1998).

3.1. The Probabilistic Model

We assume that surfaces are flat, so that there is no shading variation due to surface orientation and no in- terreflection. There are four components to our model:

• A viewing model: we assume a perspective view of a flat, frontal surface, with the focal point positioned above the center of the surface. As spatial resolution is not a major issue here, we work on a 50× 50 pixel grid for speed.

• A spatial model of surface reflectances: because spatial statistics is not the primary focus of this paper, we use a model where reflectances are constant in a grid of boxes, where the grid edges are not known in advance. A natural improvement would be the random polygon tesselation of Green (1996).

• A spatial model of illumination: for the work de- scribed in this paper, we assume that there is a single point source whose position is uniformly distributed within a volume around the viewed surface.

• A rendering model: which determines the recep- tor responses resulting from a particular choice of illuminant and surface reflectance; this follows from standard considerations.

3.1.1. The Rendering Model. We model surface reflectances as a sum of basis functionsφj(λ), and assume that reflectances are piecewise constant:

s(x, y, λ) =

ns

X

j=0

σj(x, y)φj(λ)

(9)

Here σj(x, y) are a set of coefficients that vary over space according to the spatial model.

Similarly, we model illuminants as a sum of basis functions ψi and assume that the spatial variation is given by the presence of a single point source posi- tioned at p. The diffuse component due to the source

e_d(x, p, λ, p) = d(x, y, p)

n_e

X

i=0

²iψi(λ) where²iare the coefficients of each basis function and d(x, y, p) is a gain term that represents the change in brightness of the source over the area viewed. The specular component due to the source is:

em(x, p, λ, p) = m(x, y, p)

n_e

X

i=0

²iψi(λ) where m(x, y, p) is a gain term that represents the change in specular component over the area viewed.

Standard considerations yield a model of the k’th receptor response as:

p_k(x, y) = d(x, y, p)X

i, j

g_{i j k}²iσj(x, y)

+ m(x, y, p)X

i

hi k²i

where

gijk = Z

ρk(λ)ψi(λ)φi(λ) dλ

and

h_ik= Z

ρk(λ)ψi(λ) dλ

andρk(λ)is the sensitivity of the k’th receptor class.

The illuminant terms d(x, y, p) and m(x, y, p) follow from the point source model; m(x, y, p) is obtained using Phong’s model of specularities.

We write any prior probability distribution asπ. Our model of the process by which an image is generated is then:

• sample the number of reflectance steps in x and in y (kxand kyrespectively) from the priorπ(kx, ky) = π(kx)π(ky).

• now sample the position of the steps (ex and ey respectively) from the prior

π(ex, ey| kx, ky) = π(ex| kx)π(ey| ky);

• for each tile, sample the reflectance (σ^(m)for the m’th tile) for that tile from the priorπ(σ^(m));

• sample the illuminant coefficients ² from the prior π(²);

• sample the illuminant position p from the prior π(p);

• and rendser the image, adding Gaussian noise of known standard deviationσcc to the value of each pixel.

This gives a likelihood, P¡

image| kx, ky, ex, ey, σ⁽¹⁾, . . . , σ^(k^x^k^y⁾, ², p¢ The posterior is proportional to:

P ¡

image| kx, ky, ex, ey, σ⁽¹⁾, . . . , σ^(k^x^k^y⁾, ²i, p¢

× π(ex| kx)π(ey| ky)π(kx)π(ky)

× π(²i)π(p) Y

m∈ tiles

π(σ^(m))

3.1.2. Priors and Practicalities. The spatial model:

We specify the spatial model by giving the number of edges in the x and y direction separately, the position of the edges, and the reflectances within each block.

We assume that there are no more than seven edges (8 patches) within each direction, purely for efficiency.

The prior used is a Poisson distribution, censored to ensure that all values greater than seven have zero prior, and rescaled. Edge positions are chosen using a hard- core model: the first edge position is chosen uniformly;

the second is chosen uniformly, so that the number of pixels between it and the first is never fewer than five;

the third is chosen uniformly so that the number of pixels between it and the second and between it and the first is never fewer than five; and so on. This hard- core model ensures that edge are not so close together that pixel evidence between edges is moot.

Priors for reflectance and illumination: Surface reflectance functions can never be less than zero, nor greater than one. This means that the coefficients of these functions lie in a compact convex set. It is easy to obtain a representative subset of the family of planes that bounds this set, by sampling the basis functions at some set of wavelengths. Similarly, illuminant functions can never be less than zero, meaning that the coefficients of these functions lie in a convex cone. Again, this cone is easily approximated. These constraints on reflectance and illuminant coefficients are encoded in the prior. We use a prior that is constant within the constraint set and falls off exponentially with an estimate of distance from the constraint set. Because the constraint sets are convex, they can be expressed as a set of linear inequalities; for surface reflectance we have

(10)

Csσ +b > 0 and for illuminant we have Ci² > 0. If the coefficients in these inequalities are normalised (i.e. the rows of the matrices are unit vectors), then the largest negative value of these inequalities is an estimate of distance to the constraint set.

We use six basis elements for illumination and reflectance so that we can have (for example) surfaces that look different under one light and the same under an- other light. This phenomenon, known as metamerism, occurs in the real world; our exploration of ambigui- ties should represent the possibility. We represent sur- face colour by the colour of a surface rendered under a known, white light.

3.2. Sampling the Colour Constancy Model

Proposals are made by a mixture of five distinct moves, chosen at random. The probability of proposing a particular type of move is uniform, with the exception that when there are no edges, no deaths are proposed, and when the number of edges in a particular direction is at a maximum, no births are proposed. An important ad- vantage to this approach is that, within each move, we can assume that the values of variables that we are not changing are correct, and so apply standard algorithms to estimate other values. Calculations are straightfor- ward, along the lines of Green (1995).

Moving the light: Proposals for a new x, y position for the light are obtained by filtering the image. We apply a filter whose kernel has the same shape as a typical specularity and a zero mean to the r , g and b compo- nents separately; the responses are divided by mean intensity, and the sum of squared responses is rescaled to form a proposal distribution. The kernel itself is obtained by averaging a large number of specularities obtained using draws from the prior on illuminant position. Using image data to construct proposal distributions appears to lead to quite efficient samplers; it is also quite generally applicable, as Zhu et al. (2000) (who call it “data driven MCMC”) point out. Proposals for a move of the light in z are uniform, within a small range of the current position. The real dataset has no specularities, and these moves have been demonstrated only for synthetic data.

Birth of an edge: For each direction, we apply a derivative of Gaussian filter to the red, green and blue components of the image and then divide the response by a weighted average of the local intensity; the result is squared and summed along the direction of interest.

This is normalised to 0.8, and 0.2 of a uniform distri-

bution is added. This process produces a proposal distribution that has strong peaks at each edge, and at the specularity, but does not completely exclude any legal edge point (Fig. 2). Again, we are using image information to construct an appropriate proposal process.

For a given state, this proposal distribution is zeroed for points close to existing edges (for consistency with the hard core model), and a proposed new edge position is chosen from the result. Once the position has been chosen, we must choose new reflectances for each of the new patches created by the birth of an edge. Gen- erally, if we give the two new patches reflectances that are similar to that of the old patch, we expect that there will be only a small change in the posterior; this is advantageous, because it encourages exploration. Cur- rently, we average the receptor responses within each new patch, and then use the (known) illuminant to estimate a reflectance that comes as close as possible to achieving this average value, while lying within the constraint set. We then add a Gaussian random variable to the estimated reflectance value; currently, we use a vector of independent Gaussian components each of standard deviation 0.5 (the choice will depend on the basis fitted).

Death of an edge: The edge whose death is pro- posed is chosen uniformly at random. The death of an edge causes pairs of surface patches to be fused; the new reflectance for this fused region is obtained using the same mechanism as for a birth (i.e. the receptor responses are averaged, the known illuminant is used to estimate good reflectances for each patch, and a vector of independent Gaussian components each of standard deviation 0.5 is added to the result).

Moving an edge: An edge to move is chosen uni- formly at random. Within the region of available points (governed by the hard-core model—the edge cannot get too close to the edges on either side of it) a new position is proposed uniformly at random. This is somewhat inefficient, compared with the use of filter energies as a proposal distribution. We use this mechanism to avoid a problem posed by a hard-core model; it can be difficult for a sampler to move out of the state where two edges are placed close together and on either side of a real edge. Neither edge can be moved to the real edge—the other repels it—and a new edge cannot be proposed in the right side; furthermore, there may be little advantage in killing either of the two edges. Proposing uniform moves alleviates this problem by increasing the possibility that one of the two edges will move away, so that the other can move onto the right spot.

(11)

Figure 2. The proposal distribution for edge birth in the x direction for the Mondrian image shown. The proposal distribution is obtained by filtering the image, dividing the response by a weighted average of the local intensity, then summing down the y-direction. The result is normalised to 0.8, and 0.2 of a uniform distribution is added. Note that the filtering process leads to strong peaks near the edges; this means that the proposal process is relatively efficient, but does not completely rule out edges away from strong responses, if other evidence can be found for their presence (the likelihood component of the posterior).

(12)

Change reflectance and illumination: It is tempt- ing to use a Gibbs sampler, but the chain moves extremely slowly if we do this. Instead, we sample reflectance and illumination simultaneously using the hybrid method of Section 1.2.

Poor behaviour by the Gibbs sampler can be ex- plained as follows. Assume that the sampler has burnt in, which means that the current choice of surface reflectance and illuminant coefficients yields quite a good approximation to the original picture. Assume that we have fixed the surface reflectance coefficients and wish to change the illuminant coefficients. Now we expect that the normal distribution in illuminant coefficients has a mean somewhere close to the current value and a fairly narrow covariance, because any substantial change in the illuminant coefficients will lead to an image that is different from the original picture. This means that any change in the illuminant coefficients that results will be small. Similarly, if we fix the illuminant coefficients and sample the surface reflectance coefficients, we expect that the changes that result will be small.

4. Experimental Procedures

In each case, the sampler can be started at a state chosen at random, or at a state chosen by a start procedure (described in more detail in Section 5.4). The main difference between these methods is that choosing a start point tends to lead to a sampler that appears to burn in more quickly.

Figure 3. Left: a typical synthetic Mondrian, rendered using a linear intensity scale that thresholds the specularity. Center: the proposal distribution for x and y position of the specularity, obtained by image filtering and shown with the highest value white. Right: a rendering of a typical sample for this case, using the sample’s illuminant; a successful sampler produces samples that look like the image. Results for real images are shown in colour in Fig. 8.

4.1. Structure from Motion

Results are obtained using the hotel dataset, courtesy of the Modeling by Videotaping group in the Robotics Institute, Carnegie Mellon University. We report two types of experiment: in the first, the sampler is run on that dataset; in the second, some small percentage of the points in this dataset are replaced with uniform random numbers in the range of the image coordinates.

This represents large noise effects. Coordinates in this dataset appear to lie in the range 1–512. The algorithm appears to be quite well behaved for a rang of choices of constant. Values for the constants for Figs. 5, 6, 9 and 10 areσmeas = 1/√

2,σconstraint = 1/√

5000;σbad

should be slightly larger than σmeas (allowing points to range some distance from the measurement before the measurement has been disallowed) and we used σmeas = √

5∗ σconstraint for these figures. Experience suggests it is possible to useσconstraintvery much smaller without apparently affecting the freedom with which the sampler mixes.

4.2. Colour Constancy

As Fig. 3 indicates, the sampler runs on synthetic images, and makes reasonable estimates of the position of the edges and the specularity and of illuminant and surface colours. In this case the basis and constraints are all known in advance. Applying the sampler to real data is more interesting. The data set shown in Fig. 8 consists of images originally used in Forsyth (1990).

(13)

These are images of the same set of patches on a Mondrian of coloured paper patches, photographed under white, blue, yellow, purple, red and cyan light.

There are no specularities, so we used a diffuse model for this data set.

The original data has been lost, so we used ver- sions scanned from the paper; these images were dis- played on a CRT, photographed from that display, sub- jected to four-colour printing and then scanned; it is remarkable that any constancy is possible under the circumstances. A basis was obtained using the bilin- ear fitting procedure of Marimont and Wandell (1992).

Determining appropriate constraint regions is more difficult; we obtained a natural coordinate system using principal components, and then constructed a bound- ing box in this coordinate system. The box was grown 10% along each axis, on the understanding that none of the colours in the Mondrians of Forsyth (1990) were very deeply saturated. The red, green and blue receptor responses are represented by numbers in the range zero to one; we useσcc = 1/64, implying that only the top six bits in each receptor response are reliable.

5. Assessing the Experimental Results

Sections 2 and 3 phrased two standard vision problems as inference problems. These are quite nasty inference problems, with large numbers of both continuous and discrete variables. It is possible, as these sections indicated, to extract a representation of the posterior from these problems. Why do we believe that these representations are helpful? and how well do they compare with representations that other methods might offer?

Some cautions must be observed before making comparisons. Firstly, it is important to apply a reality check to the representations that the sampler produces, to determine if there is reason to believe that the sampler has burnt-in. Secondly, comparing a representation of a posterior given some data with the result of a method that reports a minimum error solution offers no more than a perfunctory error check. This is because the nature of the information produced by the two algorithms is different. The meaningful comparison is with other possible reports of the properties of the posterior.

Here, no “gold standard” tests are available; there are no methods that are known to produce more accurate representations of a posterior density against which we can test a sampler. However, we can compare the rep- resentation produced by the sampler to methods that are significantly cheaper computationally.

5.1. Reality Checks: Has the Sampler Burnt in and is it Mixing?

There are convergence diagnostics for MCMC methods (e.g. see Besag et al., 1995; Roberts, 1992), but these can suggest convergence where none exists; it is easy to produce a chain that can pass these tests without having burnt in. Instead, we rely on general methods. Firstly, we check to ensure that the sampler can move to a near- maximal value of the posterior from any start position within a reasonable number of moves. Secondly, we check that the state of the sampler moves freely about the domain that is represented. Third, we have built various consistency checks into the experiments.

5.1.1. Structure from Motion. Figure 4 shows a series of samples drawn from the posterior for the structure from motion problem, with an indication of the order in which the samples were drawn, indicating that the sampler is mixing relatively well.

While the sampler’s mixing rate does appear to be sufficient to give a reasonable estimate of structure of the posterior around its mode, it is clear that the sampler does not move around the whole domain freely.

This posterior contains a discrete symmetry; for any fixed value of the mask bits, one can multiply U by a square root of the identity on the left and V by a square root of the identity on the right, and obtain the same value of the posterior. This creates no particular difficulty in practice, because these solutions are very widely isolated from one another. Our sampler does not move from peak to peak, because the probability that the hybrid method would obtain sufficient momentum to cross the very large regions of very low probability is effectively zero. This is in fact a desirable property; the symmetry means that accurate estimates of the mean value ofU and V would be zero.

Consistency checks: In general, we expect that a sampler that is behaving properly should be able to identify correspondence errors and produce a stable representation. There are in fact a number of subtle tracker errors in the hotel sequence. Figure 5 shows that the sampler can identify these tracker errors. Figure 6 illustrates that large tracker errors, artificially inserted into the dataset for this purpose, can be identified, too.

5.1.2. Colour Constancy. The sampler described here has been run on many synthetic images where

“ground truth” is known, and in each case reaches a small neighbourhood of ground truth from a randomly

(14)

Figure 4. These plots illustrate the path taken through the state space by the structure from motion sampler. Each plot connects the position of a given point in every tenth sample, starting at the 100th. The paths have been coded with a grey level for clarity; the early samples are light, and the path moves through darker grey levels. The fact that these paths repeatedly cross themselves and return to the same regions suggests that the sampler is mixing rather freely.

selected start point—i.e. “burns in”—within about 1000 samples. The experimental data shown below suggests the sampler mixes well, because of the wide spread on the marginal densities on the reflectances.

Consistency checks: The sampler is run on six im- ages of the same scene, but the fact that these images are of the same scene is not built into the model. The spread of samples for surface reflectance coefficients recovered for a particular surface in a particular image, is quite wide (see Fig. 8). However, if we compare the spread of samples for that surface for different images, the clusters overlap. This means that the representation is correctly encoding the fact that these surfaces could be very similar. In fact, as we shall see in Section 5.2, the representation encodes the fact that all surface patches could be very similar.

5.2. Attractive Properties of Sampled Representations

There are three attractive properties of the sampled representations we have derived:

• they provide a covariance estimate for inferred state;

• they can be resampled to incorporate new information;

• they appear to be stable to perturbations of the input data set.

We describe these properties below.

5.2.1. Covariance. The samplers described produce a representation of the posterior probability distribution,

(15)

Figure 5. Two (cropped) frames from the hotel sequence showing a single sample reconstruction. Squares correspond to measurements with mask bit one (i.e. the measurement of that point in that frame is believed correct); a white cross on a dark background corresponds to a measurement with mask bit zero (i.e. the measurement of that point in that frame is believed incorrect); grey diamonds correspond to model predictions. The extent to which a diamond is centered within a square gives the extent to which a model prediction is supported by the data. In the right frame, at several locations the tracker has skipped to another feature for unknown reasons. In each case the reconstruction identifies the data point as being erroneous, and reprojects to a point in a significantly different position from the measurement reported by the tracker and lying where a correct measurement would be as seen by the position relative to the surface texture on the object.

given a data set. A particularly attractive feature is that special datasets require no additional analysis. For example, if every element in the image has the same colour, we expect the colour constancy sampler to produce a very wide spread of samples for the surface reflectance; similarly, if a structure from motion data set is obtained by a camera translating in its plane, the sampler will return a set of samples with substantial variance perpendicular to that plane without further ado. A second attractive feature is that both expectations and marginal probability distributions are easily available: to compute an expectation of a function, we average that function’s value over the samples, and to compute a marginal, we drop irrelevant terms from the state of each sample.

Figure 7 illustrates the kind of information a sampler can produce for the structure from motion data; in

particular, the sampler reflects the scatter of possible inferred values for a single point.

Figure 8 show a set of typical results a sampler can produce from real images for the colour constancy problem. The spatial model identifies edges correctly.

Groups of samples drawn for the same surface reflectance under different lights intersect, as we expect.

Furthermore, groups of samples drawn for different surface reflectances under the same light tend not to intersect, meaning that these surfaces are generally seen as different. The figure shows a rendering of samples under white light, to give some impression of the variation in descriptions that results.

5.2.2. Resampling to Incorporate New Information.

Assume that we are engaged in colour constancy. We construct a representation of surface colour, and new

(16)

Figure 6. We perturb the hotel sequence by replacing 5% of the data points with draws from a uniform distribution in the image plane. The Bayesian method, started as in Section 5.4.1, easily discounts these noise points; the figure shows the same frames in the sequence as in Fig. 5, uncropped to show the noise but with a sample reconstruction indicated using the same notation as that figure.Squares correspond to measurements with mask bit one (i.e. the measurement of that point in that frame is believed correct); a white cross on a dark background corresponds to measurements with mask bit zero (i.e. the measurement of that point in that frame is believed incorrect); grey diamonds correspond to model predictions. The extent to which a diamond is centered within a square gives the extent to which a model prediction is supported by the data.

Figure 7. Black points show an overhead view of a single sample of the 3D reconstruction obtained using 40 frames of 80 points in the hotel sequence, rotated by hand to show the right-angled structure in the model indicating that the structure is qualitatively correct; the cloud of grey points are samples of the position of a single point, scaled by 1000 to show the (very small) uncertainty available in a single point measurement.

information arrives—what do we do? If the representation is probabilistic, the answer is (relatively) straight- forward; we adjust our representation to convey the posterior incorporating this new information. For ex-

ample, assume we have a sampled representation of the posterior for two distinct images. We are now told that a patch in one image is the same as a patch in another—

this should have an impact on our interpretation of both

(17)

images. The sampled representation is well suited to determining the effect of this information.

In particular, we have samples of P(σa, state a | image a)

and

P(σb, state b | image b)

where we have suppressed the details of the rest of the state in the notation. We interpret “the same” to mean that each patch is a sample from a Gaussian distribution with some unknown mean α and a known standard deviation. We would like to obtain samples of

P(α, state a, state b | image a, image b) (image a will be abbreviated as “im a”, etc.). Now we have that

P(im a, im b | state a, state b, α) is proportional to

Z µP(im a, state a | σa)P(σa | α)×

P(im b, state b | σb)P(σb| α)

¶

dσa dσbπ(α)

Now the term inside the integral is:

P(state a, σa, image a)

π(σa) × P(state b,σb, image b) π(σb)

× P(σb| α)P(σa| α) We have two sets of samples, Pa

andPb

. We ensure that these samples are independent and identically distributed by shuffling them (to remove the corre- lations introduced by MCMC). This means that, for the conditional density for the i ’th sample, we have

P(Pa

i | i) = P(state a, σa, image a). Now we construct a new sampler, whose state is{i, j, α}. We ensure this produces samples of the distribution

5(i, j, α) =

µ P(σa(i) | α)×

P(σa( j) | α)π(α)

¶

π(σa(i))π(σb( j))

We now use the i ’s and j ’s as indexes to our previous set of samples. We can marginalise with respect toσaand σb by simply dropping their values from the sample.

The result is a set of samples distributed according to the desired distribution:

Z µP(im a, state a| σa)P(σa | α)×

P(im b, state b | σb)P(σb | α)

¶

dσa dσbπ(α)

Building a sampler that obtains samples of {i, j, α}

space according to the desired distribution involves technical difficulties beyond the scope of this paper.

The approach essentially chooses pairs consisting of a sample from the set for image a and a sample from the set for image b; these pairs are chosen with a frequency that is higher when the values inferred for a particular patch are similar. Of course, this trick extends to more images.

Figure 8 shows results obtained by assuming that a single surface patch in each of the six images is the same. Typically, a small number of sets of samples have a very much higher probability than all others, so that a sampled representation consists of a large number of copies of these samples, interspersed with one or two others. This results in very much reduced variance in the rendering of the patch that is known to be similar for the six images, because the error balls for this surface patch intersect in a relatively small region.

However, this does not mean that the variance for the inferred reflectances for the other patches must be re- duced. It is reduced (Fig. 8), but this is because the representations recovered for each separate input image (correctly) captures the possibility that each of the surface patches is the same. This is another important reality check that strongly suggests the sampled representation is trustworthy: the algorithm has been able to use information that one patch is the same in each image to obtain a representation that strongly suggests the other patches are the same, too.

5.2.3. Stability of the Recovered Representations.

Reconstructions cannot be compared on the basis of accuracy, because “ground truth” is not available.

However, we can demonstrate that sampled representations are stable under various perturbations of their input. In structure from motion, small errors in tracker response for some points could lead to significant perturbations of the reconstruction for all points, because the reconstructed point positions are not independent—

they are coupled by the reconstructed camera configurations.

Small errors in tracker response actually occur: in the 40 frames of the hotel sequence that we used, six point measurements in nine frames are affected

(18)