C
ONDENSATION—Conditional Density Propagation for Visual Tracking
MICHAEL ISARD AND ANDREW BLAKE
Department of Engineering Science, University of Oxford, Oxford OX1 3PJ, UK Received July 16, 1996; Accepted March 3, 1997
Abstract. The problem of tracking curves in dense visual clutter is challenging. Kalman filtering is inadequate because it is based on Gaussian densities which, being unimodal, cannot represent simultaneous alternative hypothe- ses. The Condensation algorithm uses “factored sampling”, previously applied to the interpretation of static images, in which the probability distribution of possible interpretations is represented by a randomly generated set.
Condensation uses learned dynamical models, together with visual observations, to propagate the random set over time. The result is highly robust tracking of agile motion. Notwithstanding the use of stochastic methods, the algorithm runs in near real-time.
1. Tracking Curves in Clutter
The purpose of this paper1 is to establish a stochas- tic framework for tracking curves in visual clutter, us- ing a sampling algorithm. The approach is rooted in ideas from statistics, control theory and computer vi- sion. The problem is to track outlines and features of foreground objects, modelled as curves, as they move in substantial clutter, and to do it at, or close to, video frame-rate. This is challenging because elements in the background clutter may mimic parts of foreground features. In the most severe case of camouflage, the background may consist of objects similar to the fore- ground object, for instance, when a person is moving past a crowd. Our approach aims to dissolve the result- ing ambiguity by applying probabilistic models of ob- ject shape and motion to analyse the video-stream. The degree of generality of these models is pitched care- fully: sufficiently specific for effective disambiguation but sufficiently general to be broadly applicable over entire classes of foreground objects.
1.1. Modelling Shape and Motion
Effective methods have arisen in computer vision for modelling shape and motion. When suitable geometric
models of a moving object are available, they can be matched effectively to image data, though usually at considerable computational cost (Hogg, 1983; Lowe, 1991; Sullivan, 1992; Huttenlocher et al., 1993). Once an object has been located approximately, tracking it in subsequent images becomes more efficient computa- tionally (Lowe, 1992), especially if motion is modelled as well as shape (Gennery, 1992; Harris, 1992). One important facility is the modelling of curve segments which interact with images (Fischler and Elschlager, 1973; Yuille and Hallinan, 1992) or image sequences (Kass et al., 1987; Dickmanns, and Graefe, 1988).
This is more general than modelling entire objects but more clutter-resistant than applying signal-processing to low-level corners or edges. The methods to be dis- cussed here have been applied at this level, to segments of parametric B-spline curves (Bartels et al., 1987) tracking over image sequences (Menet et al., 1990;
Cipolla and Blake, 1990). The B-spline curves could, in theory, be parameterised by their control points. In practice, this allows too many degrees of freedom for stable tracking and it is necessary to restrict the curve to a low-dimensional parameter x, for example, over an affine space (Koenderink and Van Doorn, 1991;
Ullman and Basri, 1991; Blake et al., 1993), or more generally allowing a “shape-space” of non-rigid mo- tion (Cootes et al., 1993).
6 Isard and Blake
Finally, prior probability densities can be defined over the curves (Cootes et al., 1993) represented by appropriate parameter vectors x, and also over their motions (Terzopoulos and Metaxas, 1991; Blake et al., 1993), and this constitutes a powerful facility for tracking. Reasonable defaults can be chosen for those densities. However, it is obviously more satisfac- tory to measure or estimate them from data-sequences (x1, x2, . . .). Algorithms to do this, assuming Gaussian densities, are known in the control-theory literature (Goodwin and Sin, 1984) and have been applied in computer vision (Blake and Isard, 1994; Baumberg and Hogg, 1995). Given the learned prior, and an observa- tion density that characterises the statistical variability of image data z given a curve state x, a posterior distri- bution can, in principle, be estimated for xtgiven ztat successive times t.
1.2. Kalman Filters and Data-Association
Spatio-temporal estimation, the tracking of shape and position over time, has been dealt with thoroughly by Kalman filtering, in the relatively clutter-free case in which p(xt) can satisfactorily be modelled as Gaussian (Dickmanns and Graefe, 1988; Harris, 1992; Gennery, 1992; Rehg and Kanade, 1994; Matthies et al., 1989)
Figure 1. Kalman filter as density propagation: in the case of Gaussian prior, process and observation densities, and assuming linear dynamics, the propagation process of Fig. 2 reduces to a diffusing Gaussian state density, represented completely by its evolving (multivariate) mean and variance—precisely what a Kalman filter computes.
and can be applied to curves (Terzopoulos and Szeliski, 1992; Blake et al., 1993). These solutions work rela- tively poorly in clutter which causes the density for xt to be multi-modal and therefore non-Gaussian. With simple, discrete features such as points or corners com- binatorial data-association methods can be effective with clutter but combinatorial methods to do not ap- ply naturally to curves. There remains a need for an appropriately general probabilistic mechanism to han- dle multi-modal density functions.
1.3. Temporal Propagation of Conditional Densities The Kalman filter as a recursive linear estimator is a special case, applying only to Gaussian densities, of a more general probability density propagation pro- cess. In continuous time this can be described in terms of diffusion, governed by a “Fokker-Planck” equation (Astrom, 1970), in which the density for xt drifts and spreads under the action of a stochastic model of its dynamics. In the simple Gaussian case, the diffusion is purely linear and the density function evolves as a Gaussian pulse that translates, spreads and is rein- forced, remaining Gaussian throughout, as in Fig. 1, a process that is described analytically and exactly by the Kalman filter. The random component of the dynamical
Figure 2. Probability density propagation: propagation is depicted here as it occurs over a discrete time-step. There are three phases: drift due to the deterministic component of object dynamics; diffusion due to the random component; reactive reinforcement due to observations.
model leads to spreading—increasing uncertainty—
while the deterministic component causes the density function to drift bodily. The effect of an external obser- vation ztis to superimpose a reactive effect on the dif- fusion in which the density tends to peak in the vicinity of observations. In clutter, there are typically several competing observations and these tend to encourage a non-Gaussian state-density (Fig. 2).
The Condensation algorithm is designed to ad- dress this more general situation. It has the striking property that, generality notwithstanding, it is a consid- erably simpler algorithm than the Kalman filter. More- over, despite its use of random sampling which is often thought to be computationally inefficient, the Condensation algorithm runs in near real-time.
This is because tracking over time maintains relatively tight distributions for shape at successive time-steps, and particularly so given the availability of accurate, learned models of shape and motion.
2. Discrete-Time Propagation of State Density For computational purposes, the propagation process must be set out in terms of discrete time t. The state of the modelled object at time t is denoted xtand its history
isXt = {x1, . . . , xt}. Similarly, the set of image fea- tures at time t is ztwith historyZt= {z1, . . . , zt}. Note that no functional assumptions (linearity, Gaussianity, unimodality) are made about densities in the general treatment, though particular choices will be made in due course in order to demonstrate the approach.
2.1. Stochastic Dynamics
A somewhat general assumption is made for the prob- abilistic framework that the object dynamics form a temporal Markov chain so that
p(xt| Xt−1) = p(xt| xt−1) (1)
—the new state is conditioned directly only on the im- mediately preceding state, independent of the earlier history. This still allows quite general dynamics, in- cluding stochastic difference equations of arbitrary or- der; we use second order models and details are given later. The dynamics are entirely determined therefore by the form of the conditional density p(xt| xt−1). For instance,
p(xt| xt−1) ∝ exp µ
−1
2(xt− xt−1− 1)2
¶
8 Isard and Blake
represents a one-dimensional random walk (discrete diffusion) whose step length is a standard normal vari- ate, superimposed on a rightward drift at unit speed.
Of course, for realistic problems, the state x is multi- dimensional and the density is more complex (and, in the applications presented later, learned from training sequences).
2.2. Measurement
Observations zt are assumed to be independent, both mutually and with respect to the dynamical process.
This is expressed probabilistically as follows:
p(Zt−1, xt| Xt−1) = p(xt| Xt−1)
t−1
Y
i=1
p(zi| xi). (2)
Note that integrating over xtimplies the mutual condi- tional independence of observations:
p(Zt| Xt) = Yt i=1
p(zi| xi). (3)
The observation process is therefore defined by speci- fying the conditional density p(zt| xt) at each time t, and later, in computational examples, we take this to be a time-independent function p(z | x). Suffice it to say for now that, in clutter, the observation density is multi-modal. Details will be given in Section 6.
2.3. Propagation
Given a continuous-valued Markov chain with inde- pendent observations, the conditional state-density pt
at time t is defined by
pt(xt) ≡ p(xt| Zt).
This represents all information about the state at time t that is deducible from the entire data-stream up to that time. The rule for propagation of state density over time is
p(xt| Zt) = ktp(zt| xt)p(xt| Zt−1), (4) where
p(xt| Zt−1) = Z
xt−1
p(xt| xt−1)p(xt−1| Zt−1) (5)
and ktis a normalisation constant that does not depend on xt. The validity of the rule is proved in the appendix.
The propagation rule (4) should be interpreted sim- ply as the equivalent of the Bayes’ rule (6) for inferring posterior state density from data, for the time-varying case. The effective prior p(xt| Zt−1) is actually a pre- diction taken from the posterior p(xt−1| Zt−1) from the previous time-step, onto which is superimposed one time-step from the dynamical model (Fokker-Planck drift plus diffusion as in Fig. 2), which is expressed in (5). Multiplication in (4) by the observation density p(zt| xt) in the Bayesian manner then applies the re- active effect expected from observations. Because the observation density is non-Gaussian, the evolving state density p(xt| Zt) is also generally non-Gaussian. The problem now is how to apply a nonlinear filter to eval- uate the state density over time, without incurring ex- cessive computational load. Inevitably this means ap- proximating. Numerous approaches, including “multi- ple hypothesis tracking”, have been proposed but prove unsuitable for use with curves as opposed to discrete features—details are given in the appendix. In this pa- per we propose a sampling approach which is described in the following two sections.
3. Factored Sampling
This section describes first the factored sampling algo- rithm dealing with non-Gaussian observations in single images. Then factored sampling is extended in the fol- lowing section to deal with temporal image sequences.
A standard problem in statistical pattern recognition is to find an object parameterised as x with prior p(x), using data z from a single image. The posterior density p(x | z) represents all the knowledge about x that is de- ducible from the data. It can be evaluated, in principle, by applying Bayes’ rule (Papoulis, 1990) to obtain
p(x | z) = kp(z | x)p(x) (6) where k is a normalisation constant that is independent of x. In cases where p(z | x) is sufficiently complex that p(x | z) cannot be evaluated simply in closed form, iterative sampling techniques can be used (Geman and Geman, 1984; Ripley and Sutherland, 1990; Grenander et al., 1991; Storvik, 1994). The factored sampling algorithm (Grenander et al., 1991) generates a random variate x from a distribution ˜p(x) that approximates the posterior p(x | z). First, a sample-set {s(1), . . . , s(N)} is generated from the prior density p(x) and then an index
Figure 3. Factored sampling: a set of points s(n), the centres of the blobs in the figure, is sampled randomly from a prior density p(x). Each sample is assigned a weightπi (depicted by blob area) in proportion to the value of the observation density p(z | x = s(n)). The weighted point-set then serves as a representation of the posterior density p(x | z), suitable for sampling. The one-dimensional case illustrated here extends naturally to the practical case that the density is defined over several position and shape variables.
n∈ {1, . . . , N} is chosen with probability πn, where πn= pz¡
s(n)¢ PN
j=1pz
¡s( j)¢
and
pz(x) = p(z | x),
the conditional observation density. The value x0= xn
chosen in this fashion has a distribution which approx- imates the posterior p(x | z) increasingly accurately as N increases (Fig. 3).
Note that posterior mean propertiesE[g(x) | z] can be generated directly from the samples{s(n)} by weight- ing with pz(x) to give:
E[g(x) | z] ≈ PN
n=1g¡ s(n)¢
pz
¡s(n)¢ PN
n=1 pz
¡s(n)¢ . (7)
For example, the mean can be estimated using g(x) = x (illustrated in Fig. 4) and the variance using g(x) = xxT. In the case that p(x) is a spatial Gauss-Markov process, Gibbs sampling from p(x) has been used to generate the random variates{s(1), . . . , s(N)}. Oth- erwise, for low-dimensional parameterisations as in this paper, standard, direct methods can be used for Gaussians2 (Press et al., 1988). Note that, in the case that the density p(z | x) is normal, the mean obtained by factored sampling is consistent with an estimate ob- tained more conventionally, and efficiently, from linear least squares estimation. For multi-modal distributions which cannot be approximated as normal, so that linear
estimators are unusable, estimates of mean x by fac- tored sampling continue to apply.
4. The CONDENSATIONAlgorithm
The Condensation algorithm is based on factored sampling but extended to apply iteratively to successive images in a sequence. The same sampling strategy has been developed elsewhere (Gordon, et al., 1993;
Kitagawa, 1996), presented as developments of Monte- Carlo methods. Jump-diffusion tracking (Miller et al., 1995) may also be related to the approach described here.
Given that the process at each time-step is a self- contained iteration of factored sampling, the out- put of an iteration will be a weighted, time-stamped sample-set, denoted{s(n)t , n = 1, . . . , N} with weights πt(n), representing approximately the conditional state- density p(xt| Zt) at time t. How is this sample-set obtained? Clearly, the process must begin with a prior density and the effective prior for time-step t should be p(xt| Zt−1). This prior is of course multi-modal in general and no functional representation of it is avail- able. It is derived from the sample set representation {(s(n)t−1, πt(n)−1), n = 1, . . . , N} of p(xt−1| Zt−1), the output from the previous time-step, to which predic- tion (5) must then be applied.
The iterative process as applied to sample-sets, de- picted in Fig. 5, mirrors the continuous diffusion pro- cess in Fig. 2. At the top of the diagram, the out- put from time-step t − 1 is the weighted sample-set {(s(n)t−1, πt(n)−1), n = 1, . . . , N}. The aim is to maintain, at successive time-steps, sample sets of fixed size N ,
10 Isard and Blake
Figure 4. Sample-set representation of shape distributions: the sample-set representation of probability distributions, illustrated in one dimen- sion in Fig. 3, is illustrated here (a) as it applies to the distribution of a multi-dimensional curve parameter x. Each sample s(n)is shown as a curve (of varying position and shape) with a thickness proportional to the weightπn. The weighted mean of the sample set (b) serves as an estimator of the distribution mean.
Figure 5. One time-step in the Condensation algorithm: Each of the three steps—drift-diffuse-measure—of the probabilistic propagation process of Fig. 2 is represented by steps in the Condensation algorithm.
so that the algorithm can be guaranteed to run within a given computational resource. The first operation therefore is to sample (with replacement) N times from the set{s(n)t−1}, choosing a given element with probabil- ityπt(n)−1. Some elements, especially those with high weights, may be chosen several times, leading to iden- tical copies of elements in the new set. Others with relatively low weights may not be chosen at all.
Each element chosen from the new set is now sub- jected to the predictive steps. First, an element un- dergoes drift and, since this is deterministic, identical elements in the new set undergo the same drift. This is apparent in the Fig. 5. The second predictive step, diffusion, is random and identical elements now split because each undergoes its own independent Brownian motion step. At this stage, the sample set{s(n)t } for the new time-step has been generated but, as yet, without its weights; it is approximately a fair random sample from the effective prior density p(xt| Zt−1) for time-step t.
Finally, the observation step from factored sampling is applied, generating weights from the observation den- sity p(zt| xt) to obtain the sample-set representation {(s(n)t , πt(n))} of state-density for time t.
Figure 6 gives a synopsis of the algorithm. Note the use of cumulative weights c( j)t−1(constructed in step 3) to achieve efficient sampling in step 1. After any time- step, it is possible to “report” on the current state, for example, by evaluating some moment of the state den- sity as shown.
One of the striking properties of the Condensa- tion algorithm is its simplicity, compared with the Kalman filter, despite its generality. Largely, this is due to the absence of the Riccati equation which ap- pears in the Kalman filter for the propagation of co- variance. The Riccati equation is relatively complex computationally but is not required in the Conden- sation algorithm which instead deals with variability by sampling, involving the repeated computation of a relatively simple propagation formula.
5. Stochastic Dynamical Models for Curve Motion
In order to apply the Condensation algorithm, which is general, to tracking curves in image-streams, specific probability densities must be established both for the dynamics of the object and for the observation process. In the examples described here, x is a linear parameterisation of the curve and allowed transforma- tions of the curve are represented by linear transfor-
mations of x. The Condensation algorithm itself does not demand necessarily a linear parameterisation though linearity is an attraction for another reason—the availability of algorithms to learn object dynamics. The algorithm could also be used, in principle, with non- linear parameterised kinematics—for instance, repre- senting an articulated hand in terms of joint angles (Rehg and Kanade, 1994).
5.1. Linear Parameterisations of Splines for Tracking
We represent the state of a tracked object following methods established for tracking using a Kalman filter (Blake et al., 1995). Objects are modelled as a curve (or set of curves), typically though not necessarily the occluding contour, and represented at time t by a pa- rameterised image curve r(s, t). The parameterisation is in terms of B-splines, so
r(s, t) = (B(s) · Qx(t), B(s) · Qy(t)),
for 0≤ s ≤ L, (8) where B(s) is a vector (B1(s), . . . , BNB(s))T of B-spline basis functions, Qxand Qyare vectors of B- spline control point coordinates and L is the number of spans. It is usually desirable (Blake et al., 1993) to restrict the configuration of the spline to a shape-space of vectors X defined by
ÃQx Qy
!
= WX + Ã ¯Qx
¯Qy
!
, (9)
where the matrix W is a matrix of rank NXconsiderably lower than the 2NB degrees of freedom of the uncon- strained spline. Typically the shape-space may allow affine deformations of the template shape ¯Q, or more generally a space of rigid and non-rigid deformations.
The space is constructed by applying an appropriate combination of three methods to build a W -matrix:
1. determining analytically combinations of contours derived from one or more views (Ullman and Basri, 1991; Koenderink and Van Doorn, 1991; Blake et al., 1993), a method that is usable both for affine spaces and for certain classes of articulated object;
2. capturing sequences of key frames of the object in different poses (Blake et al., 1995);
3. performing principal components analysis on a set of outlines of the deforming object (Cootes et al.,
12 Isard and Blake
Figure 6. The Condensation algorithm.
1993; Baumberg and Hogg, 1994) to derive a small set of representative contours.
5.2. Dynamical Model
Exploiting earlier work on dynamical modelling (Blake et al., 1993, 1995), object dynamics are modelled as a second order process, conveniently represented in dis- crete time t as a second order linear difference equation:
xt− ¯x = A(xt−1− ¯x) + Bwt (10)
where wtare independent vectors of independent stan- dard normal variables, the state-vector
xt = µXt−1
Xt
¶
, (11)
and where¯x is the mean value of the state and A, B are matrices representing the deterministic and stochastic components of the dynamical model, respectively. The system is a set of damped oscillators, whose modes, natural frequencies and damping constants are deter- mined by A, driven by random accelerations coupled
into the dynamics via B from the noise term Bw. While it is possible to set sensible defaults for A, ¯x and B, it is more satisfactory and effective to estimate them from input data taken while the object performs typical motions. Methods for doing this via Maximum Like- lihood Estimation are essential to the work described here and are described fully elsewhere (Blake et al., 1995; Reynard et al., 1996).
The dynamical model can be re-expressed in such a way as to make quite clear that it is a temporal Markov chain:
p(xt| xt−1)
∝ exp µ
−1
2kB−1((xt− ¯x) − A(xt−1− ¯x))k2
¶ (12)
wherek · · · k is the Euclidean norm. It is therefore clear that the learned dynamical models are appropriate for use in the Condensation algorithm.
5.3. Initial Conditions
Initial conditions for tracking can be determined by specifying the prior density p(x0), and if this is Gaussian, direct sampling can be used to initialise the Condensation algorithm. Alternatively, it is pos- sible simply to allow the density p(xt) to settle to a steady state p(x∞), in the absence of object measure- ments. Provided the learned dynamics are stable (free of undamped oscillations) a unique steady state exists.
Furthermore, if p(x0) is Gaussian, p(x∞) is Gaussian with parameters that can be computed by iterating the Riccati equation (Gelb, 1974). At this point the density function represents an envelope of possible configura- tions of the object, as learned during the training phase.
(Background clutter, if present, will modify and bias this envelope to some extent.) Then, as soon as the foreground object arrives and is measured, the density
p(xt) begins to evolve appropriately.
6. Observation Model
The observation process defined by p(zt| xt) is as- sumed here to be stationary in time (though the Condensation algorithm does not necessarily de- mand this) so a static function p(z | x) needs to be specified. As yet we have no capability to estimate it from data, though that would be ideal, so some reasonable assumptions must be made. First, a mea- surement model for one-dimensional data with clutter
is suggested. Then an extension is proposed for two- dimensional observations that is also used later in com- putational experiments.
6.1. One-Dimensional Observations in Clutter In one dimension, observations reduce to a set of scalar positions {z = (z1, z2, . . . , zM)} and the ob- servation density has the form p(z | x) where x is one-dimensional position. The multiplicity of measure- ments reflects the presence of clutter so either one of the events
φm= {true measurement is zm}, m = 1, . . . , M occurs, or else the target object is not visible with prob- ability q = 1 −P
mP(φm). Such reasoning about clutter and false alarms is commonly used in target tracking (Bar-Shalom and Fortmann, 1988). Now the observation density can be expressed as
p(z | x) = qp(z | clutter) + XM m=1
p(z | x, φm)P(φm).
A reasonable functional form for this can be obtained by making some specific assumptions: that3 P(φm) = p, ∀m, that the clutter is a Poisson process along the line with spatial densityλ and that any true target measure- ment is unbiased and normally distributed with stan- dard deviationσ. This leads to
p(z | x) ∝ 1 + 1
√2πσα X
m
exp µ
−νm2
2σ2
¶ (13) whereα = qλ and νm = zm− x, and is illustrated in Fig. 7. Peaks in the density function correspond to measured features and the state density will tend to be reinforced in the Condensation algorithm at such points. The background level reflects the possibility that the true target has not been detected at all. The effect on tracking behaviour is to provide for the possi- bility of “tunneling”: a good hypothesis should survive a transitory failure of observations due, for example, to occlusion of the tracked object. The parametersσ (units of distance) and α (units of inverse distance) must be chosen, though, in principle, they could be estimated from data by observing measurement error σ and both the density of clutter λ and probability of non-detection q.
Considerable economy can be applied, in practice, in the evaluation of the observation density. Given a hy- pothesised position x in the “observation” step (Fig. 6) it is not necessary to attend to all features z1, . . . , zM.
14 Isard and Blake
Figure 7. One-dimensional observation model: a probabilistic observation model allowing for clutter and the possibility of missing the target altogether is specified here as a conditional density p(z | x).
Anyνmfor which
√ 1
2πσαexp µ
−νm2
2σ2
¶
¿ 1
can be neglected and this sets a search window around the position x outside which measurements can be ig- nored. For practical values of the constants the search window will have a width of a fewσ . In practice, the clutter is sufficiently sparse andσ is sufficiently small that the search window rarely contains more than one feature.
Note that the density p(z | x) represents the informa- tion about x given a fixed number M of measurements.
Potentially, the event ψM that there are M measure- ments, regardless of the actual values of those measure- ments, also constitutes information about x. However, we can reasonably assume here that
P(ψM| x) = P(ψM),
for instance because x is assumed to lie always within the image window. In that case, by Bayes’ theorem,
p(x | ψM) = p(x)
—the event ψM provides no additional informa- tion about the position x. (If x is allowed also to fall outside the image window then the eventψMis infor- mative: a value of M well above the mean value for the background clutter enhances the probability that x lies within the window.)
6.2. Two-Dimensional Observations
In a two-dimensional image, the set of observations z is, in principle, the entire set of features visible in the image. However, an important aspect of earlier sys- tems in achieving real-time performance (Lowe, 1992;
Harris, 1992; Blake et al., 1993) has been the restriction of measurement to a sparse set of lines normal to the tracked curve. These two apparently conflicting ideas can be resolved as follows.
The observation density p(z | x) in two dimensions describes the distribution of a (linearly) parameterised image curve z(s), given a hypothetical shape in the form of a curve r(s), 0 ≤ s ≤ 1, represented by a shape parameter x. The two-dimensional density be derived as an extension of the one-dimensional case. It is assumed that a mapping g(s) is known that associates each point z(s) on the image curve with a point r(g(s)) on the shape. In practice, this mapping is set up by tracing normals from the curve r. Note that g(s) is not necessarily injective because z(s) includes clutter as well as foreground features. Next, the one-dimensional density (13) is approximated in a more amenable form that neglects the possibility of more than one feature lying inside the search interval:
p(z | x) ∝ exp µ
− 1
2σ2 f(ν1; µ)
¶
where f(ν; µ) = min(ν2, µ2), (14)
Figure 8. Observation process: the thick line is a hypothesised shape, represented as a parametric spline curve. The spines are curve normals along which high-contrast features (white crosses) are sought.
µ =√
2σ log(1/√
2πασ ) is a spatial scale constant, andν1is theνmwith smallest magnitude, representing the feature lying closest to the hypothesised position x.
A natural extension to two dimensions is then
p(z | x) = Zexp Ã
−1 2r
Z L
0
f(z1(s) − r(s); µ) ds
!
(15) in which r is a variance constant and z1(s) is the closest associated feature to r(s):
z1(s) = z(s0) where s0= arg min
s0∈g−1(s)|r(s) − z(s0)|.
Note that the constant of proportionality (“partition function”) Z(x) is an unknown function. We make the assumption that the variation of Z with x is slow com- pared with the other term in (15) so that Z can be treated as constant. It remains to establish whether this as- sumption is justified.
The observation density (15) can be computed via a discrete approximation, the simplest being:
p(z | x) ∝ exp Ã
− XM m=1
1
2r M f(z1(sm) − r(sm); µ)
! ,
(16) where sm = m/M. This is simply the product of one-dimensional densities (14) withσ =√
r M, eval- uated independently along M curve normals as in Fig. 8.
7. Applying the CONDENSATIONAlgorithm to Video-Streams
Four examples are shown here of the practical efficacy of the Condensation algorithm. Movie (MPEG) versions of some results are available on the web at http://www.robots.ox.ac.uk/~ab/.
16 Isard and Blake
Figure 9. Tracking three people in a cluttered room: the first frame of a sequence in which one figure moves from right to left in front of two stationary figures.
7.1. Tracking a Multi-Modal Distribution
The ability of the Condensation algorithm to rep- resent multi-modal distributions was tested using a 70 frame (2.8 s) sequence of a cluttered room containing three people each facing the camera (Fig. 9). One of the people moves from right to left, in front of the other two. The shape-space for tracking is built from a hand- drawn template of head and shoulders (Fig. 8) which is then allowed to deform via planar affine transforma- tions. A Kalman filter contour-tracker (Blake et al., 1993) with default motion parameters is able to track a single moving person just well enough to obtain a sequence of outline curves that is usable as training data. Given the high level of clutter, adequate perfor- mance with the Kalman filter is obtained here by means of background modelling (Rowe and Blake, 1996), a statistical form of background subtraction, which ef- fectively removes clutter from the image data before it is tracked. It transpires, for this particular training set, that the learned motions comprise primarily horizontal translation, with vertical translation and horizontal and vertical shear present to a lesser degree.
The learned shape and motion model can now be installed as p(xt| xt−1) in the Condensation algorithm which is run on a test sequence but with- out the benefit of background modelling, so that the
background clutter is now visible to the tracker.
Figure 10 shows how the state-density evolves as track- ing progresses. Initialisation is performed simply by iterating the stochastic model, in the absence of mea- surements, to its steady state and it can be seen that this corresponds, at time 0, to a roughly Gaussian distribu- tion, as expected. The distribution rapidly collapses down to three peaks which are then maintained appro- priately even during temporary occlusion. Although the tracker was designed to track just one person, the Con- densation algorithm allows the tracking of all three, for free; the ability to represent multi-modal distribu- tions effectively provides multiple hypothesis capabil- ity. Tracking is based on frame rate (40 ms) sampling in this experiment and distributions are plotted in the figure for alternate frames. The experiment was run using a distribution of N = 1000 samples per time- step.
7.2. Tracking Rapid Motions Through Clutter The ability to track more agile motion, still against clutter, was tested using a 500 field (10 s) sequence of a girl dancing vigorously to a Scottish reel. The shape-space for tracking was planar affine, based on a hand-drawn template curve for the head outline. The training sequence consisted of dancing against a largely
Figure 10. Tracking with multi-modal state-density: an approximate depiction of the state-density is shown, computed by smoothing the distribution of point masses s(1)t , s(2)t , . . . in the Condensation algorithm. The density is, of course, multi-dimensional; its projection onto the horizontal translation axis is shown here. The initial distribution is roughly Gaussian but this rapidly evolves to acquire peaks corresponding to each of the three people in the scene. The right-most peak drifts leftwards, following the moving person, coalescing with and separating from the other two peaks as it moves. Having specified a tracker for one person we effectively have, for free, a multi-person tracker, owing to the innate ability of the Condensation algorithm to maintain multiple hypotheses.
uncluttered background, tracked by a Kalman filter contour-tracker with default dynamics to record 140 fields (2.8 s) of tracked head positions, the most that could be tracked before losing lock. Those 140 fields were sufficient to learn a bootstrap motion model which then allowed the Kalman filter to track the training data for 800 fields (16 s) before loss of lock. The motion model obtained from these 800 fields was used in exper- iments with the Condensation tracker and applied to the test data, now including clutter.
Figure 11 shows some stills from the test sequence, with a trail of preceding head positions to indicate mo- tion. The motion is primarily translation, with some horizontal shear apparent as the dancer turns her head.
Representing the state density with N = 100 samples at each time-step proves just sufficient for successful tracking. As in the previous example, a prior density can be computed as the steady state of the motion model and, in this case, that yields a prior for position that spreads across most of the image area, as might be
expected given the range of the dance. Such a broad distribution cannot effectively be represented by just N = 100 samples. One alternative is to increase N in the early stages of tracking, and this is done in a later experiment. Alternatively, the prior can be based on a narrower distribution whose centre is positioned by hand over the object at time 0, and that is what was done here. Observation parameters wereµ = 24, σ = 7 with M= 18 normals.
Figure 12 shows the motion of the centroid of the estimated head position as tracked both by the Con- densation algorithm and by a Kalman filter using the same motion model. The Condensation tracker correctly estimated head position throughout the se- quence, but after about 40 fields (0.80 s), the Kalman filter was distracted by clutter, never to recover.
Given that there is only one moving person in this experiment, unlike the previous one in which there were three, it might seem that a unimodal repre- sentation of the state density should suffice. This is
18 Isard and Blake
Figure 11. Tracking agile motion in clutter: the test sequence consists of 500 fields (10 s) of agile dance against a cluttered background. The dancer’s head is tracked through the sequence. Several representative fields are shown here, each with a trail of successive mean tracked head position at intervals of 40 ms. The Condensation algorithm used N = 100 sample per time-step to obtain these results.
emphatically not the case. The facility to represent multiple modes is crucial to robustness as Fig. 13 illus- trates. The figure shows how the distribution becomes misaligned (at 900 ms), reacting to the distracting form of the computer screen. After a further 20 ms the distri- bution splits into two distinct peaks, one correspond- ing to clutter (the screen) and the other to the dancer’s head. At this point the clutter peak actually has the higher posterior probability—a unimodal tracker, for instance, a Kalman filter, would almost certainly dis- card the lower peak, rendering it unable to recover.
The Condensation algorithm however, capable as it is of carrying several hypotheses simultaneously, does recover rapidly as the clutter peak decays for lack of confirmatory observation, leaving just one peak corre- sponding to the dancer at 960 ms.
7.3. Tracking an Articulated Object
The preceding sequences show motion taking place in affine shape-spaces of just six dimensions. High di- mensionality is one of the factors, in addition to agility and clutter, that makes tracking hard (Blake et al., 1993). In order to investigate tracking performance in higher dimensions, we used a 500 field (10 s) test sequence of a hand translating, rotating, and flexing its fingers independently, over a highly cluttered desk scene (Fig. 14). Figure 15 shows just how severe the clutter problem is—the hand is immersed in a dense field of edges.
A model of shape and motion model was learned from training sequences of hand motion against a plain background, tracked by Kalman filter (using signed
Figure 12. The Condensation tracker succeeds where a Kalman filter fails: the estimated centroid for the sequence shown in Fig. 11 is plotted against time for the entire 500 field sequence, as tracked first by the Condensation tracker, then by a comparable Kalman filter tracker.
The Condensation algorithm correctly estimates the head position throughout the sequence. The Kalman filter tracks briefly, but is soon distracted by clutter and never recovers.
edges to help to disambiguate finger boundaries). The procedure comprised several stages, creative assembly of methods from the available “toolkit” for learning (Blake et al., 1995).
1. Shape-space was constructed from six templates drawn around the hand with the palm in a fixed orientation and with the fingers and thumb in var- ious configurations. The six templates combined linearly to form a five-dimensional space of defor- mations which were then added to the space of trans- lations to form a seven-dimensional shape-space.
2. Default dynamics in the shape-space above were adequate to track a clutter-free training sequence of 600 frames in which the palm of the hand main- tained an approximately fixed attitude.
3. Principal components analysis: the sequence of 600 hand outlines was replicated with each hand contour rotated through 90◦and the sequences con- catenated to give a sequence of 1200 deformations.
Projecting out the translational component of mo- tion, the application of Principal Component Anal- ysis (PCA) to the sequence of residual deformations of the 1200 contours established a 10-dimensional space that accounted almost entirely for deforma- tion. This was then combined with the translational space to form a 12-dimensional shape-space that
accounted both for the flexing of fingers and thumb and also for rotations of the palm.
4. Bootstrapping: a Kalman filter with default dy- namics in the 12-dimensional shape-space was suf- ficient to track a training sequence of 800 fields of the hand translating, rotating, and flexing fingers and thumb slowly. This was used to learn a model of motion.
5. Relearning: that motion model was installed in a Kalman filter used to track another, faster training- sequence of 800 fields. This allowed a model for more agile motion to be learned, which was then used in experiments with the Condensation tracker.
Figure 16 shows detail of a series of images from a tracked, 500 field test-sequence. The initial state den- sity was simply the steady state of the motion model, obtained by allowing the filter to iterate in the absence of observations. Tracker initialisation was facilitated by using more samples per time-step (N = 1500) at time t = 0, falling gradually to 500 over the first 4 fields. The rest of the sequence was tracked using N = 500. As with the previous example of the dancer, clutter can distract the tracker but the ability to repre- sent multi-modal state density means that tracking can recover.
20 Isard and Blake
Figure 13. Recovering from tracking failure: detail from four consecutive fields of the sequence illustrated in Fig. 11. Each sample from the distribution is plotted on the image, with intensity scaled to indicate its posterior probability. (Most of the N= 100 samples have too low a probability to be visible in this display.) In field 45, the distribution is misaligned, and has begun to diverge. In fields 46 and 47 it has split into two distinct peaks, the larger attracted to background clutter, but converges back onto the dancer in field 48.
7.4. Tracking a Camouflaged Object
Finally, we tested the ability of the algorithm to track rapid motion against background distraction in the extreme case that background objects actually mim- iced the tracked object. A 12 s (600 field) sequence
showed a bush blowing in the wind, the task being to track one particular leaf. A template was drawn by hand around a still of one chosen leaf and allowed to undergo affine deformations during tracking. Given that a clutter-free training sequence was not available, the motion model was again learned by means of a
Figure 14. A hand moving over a cluttered desk: Field 0 of a 500 field (10 s) sequence in which the hand translates, rotates, and the fingers and thumb flex independently.
Figure 15. Severe clutter: detail of one field (Fig. 14) from the test-sequence shows the high level of potential ambiguity. Output from a directional Gaussian edge detector shows that there are many clutter edges present as potential distractors.
bootstrap procedure. A tracker with default dynamics proved capable of tracking the first 150 fields of a train- ing sequence before losing the leaf, and those tracked positions allowed a first approximation to the model to be learned. Installing that in a Condensation tracker, the entire sequence could be tracked, though with occasional misalignments. Finally, a third learned
model was sufficient to track accurately the entire 12- second training sequence. Despite occasional violent gusts of wind and temporary obscuration by another leaf, the Condensation algorithm successfully fol- lowed the object, as Fig. 17 shows. In fact, tracking is accurate enough using N = 1200 samples to sepa- rate the foreground leaf from the background reliably,
22 Isard and Blake
Figure 16. Tracking a flexing hand across a cluttered desk: representative stills from a 500 field (10 s) sequence show a hand moving over a highly cluttered desk scene. The fingers and thumb flex independently, and the hand translates and rotates. Here the Condensation algorithm uses N= 1500 samples per time-step initially, dropping gradually over 4 fields to N = 500 for the tracking of the remainder of the sequence.
The mean configuration of the contour is displayed.
Figure 17. Tracking with camouflage: the aim is to track a single camouflaged moving leaf in this 12-s sequence of a bush blowing in the wind.
Despite the heavy clutter of distractors which actually mimic the foreground object, and occasional violent gusts of wind, the chosen foreground leaf is successfully tracked throughout the sequence. Representative stills depict mean contour configurations, with preceding tracked leaf positions plotted at 40 ms intervals to indicate motion.
an effect which can otherwise only be achieved using
“blue-screening”. Having obtained the model itera- tively as above, independent test sequences could be tracked without further training. With N = 1200 sam- ples per time-step the tracker runs at 6.5 Hz on a SGI Indy SC4400 200 MHz workstation. Reducing this to N = 200 increases processing speed to video frame- rate (25 Hz), at the cost of occasional misalignments in the mean configuration of the contour. Observation parameters wereµ = 8, σ = 3 with M = 21 normals.
8. Conclusions
Tracking in clutter is hard because of the essential multi-modality of the conditional observation den- sity p(z | x). In the case of curves multiple-hypothesis tracking is inapplicable and a new approach is needed.
The Condensation algorithm is a fusion of the sta- tistical factored sampling algorithm for static, non- Gaussian problems with a stochastic model for object motion. The result is an algorithm for tracking rigid and non-rigid motion which has been demonstrated to be far more effective in clutter than comparable Kalman filters. Performance of the Condensation algorithm improves as the sample size parameter N increases;
formally computational complexity is O(N log N), al- though this can be made O(N) with a minor modifi- cation to the sampling procedure. Impressive results have been demonstrated for models with between 6 and 12 degrees of freedom, even when N is as low as 100–200. Performance in several cases was im- proved still further with an increased value N≈ 1000.
In a six-dimensional shape-space, the system currently runs with N = 100 in real-time (50 Hz) on a desk-top graphics workstation (SGI Indy R4400SC, 200 MHz).
The new approach raises a number of questions.
First, alternative observation models could be explored in order to make greater use of image intensity varia- tions, though without sacrificing too much in the way of photometric invariance. It is to be hoped in the in- terests of efficiency that, as happens with the search window in the edge-based case, computational atten- tion could be concentrated in a band around the hy- pothesised curve without significant loss of accuracy in the model. Such a model would have echoes of cor- relation matching but of course without the exhaustive search characteristic of correlation matchers which is quite infeasible in more than two or three dimensions.
Secondly, the availability of general state densities suggests the need for more general representations of
those densities. When the density is approximately unimodal, first and second moments may be adequate to convey the likely states, but in the multi-modal case, as for example when several people are tracked simul- taneously, the mean configuration is not a particularly useful statistic—it meaninglessly combines the config- urations of the three people. An alternative is to attempt to develop a mode finder capable of pin-pointing sev- eral modes when present. More generally, there is a need for “operators” to interrogate densities: for in- stance, an operator to find a person moving to the right, or to find the tallest person. Perhaps such op- erators could be formulated as hypothesis tests applied to sample sets.
A third question concerns the random sampling scheme and its efficiency. Factored sampling can be inefficient as the modes of p(z | x) become narrow.
One approach is “importance sampling” (Ripley, 1987) in which a heuristically chosen distribution, approxi- mating p(z | x), is used to concentrate random sam- pling around modes. However, this has the drawback that the prior p(x) must be repeatedly evaluated whereas, in temporal propagation, the prior (predic- tion) p(xt| zt−1) cannot be evaluated pointwise, only sampled.
Finally, it is striking that the density propagation equation (4) in the Condensation algorithm is a continuous form of the propagation rule of the “for- ward algorithm” for Hidden Markov Models (HMMs) (Rabiner and Bing-Hwang, 1993). The integral over continuous states in (5) becomes a summation over discrete states in the HMM, with p(xt| xt−1) repre- sented by a transition matrix. This suggests a natu- ral opportunity to combine the two so that mixed dis- crete/continuous states could be propagated over time.
This would allow switching between multiple mod- els, for instance, walk-trot-canter-gallop, each model represented by a stochastic differential equation, with transitions governed by a discrete conditional proba- bility matrix. It seems likely that such a system could be executed as a Condensation tracker. A further challenge is to develop a learning algorithm for mixed dynamical models.
Appendix A: Non-Linear Filtering
There are four distinct probability distributions rep- resented in a non-linear Bayesian filter. Three of them form part of the problem specification and the fourth constitutes the solution. The three specified
24 Isard and Blake
distributions are:
1. the prior density p(x) for the state x
2. the process density p(xt| xt−1) that describes the stochastic dynamics
3. the observation density p(z | x)
and the filter evolves over time to generate, as the so- lution at each time-step, the state-density pt(x) where pt(xt) ≡ p(xt| Zt). Only when all of the three spec- ified distributions are Gaussian is the state-density pt
also Gaussian. Otherwise, for non-Gaussian pt, it is possible to use one of a number of approximate filters, depending on which of the specified densities it is that is non-Gaussian.
A.1. Non-Gaussian Prior Density
The case that the prior density is non-Gaussian is the simplest to deal with provided only that it can ade- quately be represented (or approximated) as an additive Gaussian mixture:
p0(x) = XM m=1
w(m)G¡
x; µ(m), P0(m)¢ .
In that case, provided that other specified densities are Gaussian, the state density can also be represented as a corresponding mixture
pt(x) = XM m=1
w(m)G¡
x; µ(m)t , Pt(m)
¢
in which the meansµ(m)t and variances Pt(m)vary over time but the weightsw(m)are fixed. Each of the M mix- ture components evolves as an independent Gaussian so that, in fact, the state density is just a sum of densities from M independent linear Kalman filters.
A.2. Non-Gaussian Process Density
Non-Gaussian state densities can arise from the nature of the process either because the dynamics are driven by non-Gaussian process noise, or, more generally, be- cause the deterministic dynamics are non-linear. One approach to filtering is then to approximate the dynam- ics by Taylor expansion as a linear process with time- varying coefficients and proceed as for linear Kalman filters. This generates a Gaussian representation of the
evolving state-density which may be a good approx- imation depending on the nature of the non-linearity.
This is the basis of the “Extended Kalman Filter” (EKF) (Gelb, 1974; Bar-Shalom and Fortmann, 1988). Al- ternatively, one can attempt a mixture representation, as earlier, but now allowing the weightsw(m) also to vary over time. Unfortunately, even allowing dynamic re-weighting (Sorenson and Alspach, 1971) does not produce exact solutions for pt(x), because the indi- vidual Gaussian components do not remain Gaussian over time. For example, consider the case in which the process density p(xt| xt−1) is itself an additive mix- ture of k> 1 Gaussian components. According to the Bayesian propagation equation (5) each component of pt splits into k separate components in the transition from time n to time n+ 1; the total number of compo- nents in ptgrows exponentially as kt. Clearly, ptmust be approximated at each time-step to prune back the number of components (Anderson and Moore, 1979) within some resource-limited bound M. Effectively, there are Mk full Kalman filters running at each time- step, each bringing the computational expense of its own Riccati equation step. Clearly, the success of this approach depends on how well the densities pt and p(xt| xt−1) can be approximated with a modest num- ber Mk of components.
A.3. Non-Gaussian Observation Density
In the case of visual tracking in clutter, non-linearity of the tracking filter arises, as we have seen, because the observation density p(z | x) is non-Gaussian and, furthermore, is multi-modal so that it cannot be well approximated by a single Gaussian. Each of the meth- ods just mentioned for handling non-Gaussian process density, the EKF and Gaussian mixtures, are relevant also to non-Gaussian observation density but continue to have the same drawbacks. Note that, in the case of Gaussian mixtures, the number of mixture components again proliferates at each time-step of (4), albeit via a different mechanism involving products of Gaussians rather than convolutions. Even this assumes that the observation density can be approximated as a mixture but in clutter this becomes rather inefficient, requiring at least one component per visible feature.
There is an additional class of techniques which ap- plies to this case when the non-Gaussian state den- sity arises from clutter of a particular sort. In the simplest case, one of a finite set of measurements zt = {zt,1, . . . , zt,k} at time n is to be associated with