Learning Object Intrinsic Structure for Robust Visual Tracking

(1)

Learning Object Intrinsic Structure for Robust Visual Tracking

Qiang Wang, Guangyou Xu, Haizhou Ai

Department of Computer Science and Technology, Tsinghua University,

State Key Laboratory of Intelligent Technology and Systems, Beijing 100084, PR China.

wangq98@mails.tsinghua.edu.cn

Abstract

In this paper, a novel method to learn the intrinsic object structure for robust visual tracking is proposed. The basic assumption is that the parameterized object state lies on a low dimensional manifold and can be learned from train- ing data. Based on this assumption, firstly we derived the dimensionality reduction and density estimation algorithm for unsupervised learning of object intrinsic representation, the obtained non-rigid part of object state reduces even to 2 dimensions. Secondly the dynamical model is derived and trained based on this intrinsic representation. Thirdly the learned intrinsic object structure is integrated into a particle-filter style tracker. We will show that this intrinsic object representation has some interesting properties and based on which the newly derived dynamical model makes particle-filter style tracker more robust and reliable. Exper- iments show that the learned tracker performs much better than existing trackers on the tracking of complex non-rigid motions such as fish twisting with self-occlusion and large inter-frame lip motion. The proposed method also has the potential to solve other type of tracking problems.

1. Introduction

Visual tracking could be regarded as a probabilistic inference problem of maintaining an accurate representation of the posterior on an object given observations in the image sequences. Various efforts on improving tracking performance are concentrated on the following three aspects:

the object state representation, the dynamical model and the measurement model. The object state representation is a fundamental problem in computer vision and is very important to the tracker’s performance. Typical object representation includes parameterized shapes [2] [7], a combination of shape and color [17] and the exemplar-based models [15].

A good object representation should improve its specificity and compactness. Hierarchical linear dimensionality reduction method [6] or mixtures of linear models [14] are usu-

ally used to improve the specificity of object representation. The tracker’s performance also depends on the state dimension and it is extremely difficult to get good results from particle filters in the spaces of dimension greater than 10 [3]. The dynamical model is either derived from a priori human knowledge with adjustable free parameters that can be trained [9], or completely learned from a large col- lection of data [12]. The measurement model that models the image observation likelihood is derived either directly from edge or image SSD with an ad-hoc knowledge [1] or learned probabilistically [11].

The common problem of the existing methods is that the obtained object state is not a globally coordinated low dimensional vector in a continuous metric space, which makes the tracking inefficient or suffered from local- minimum problem. The basic idea of intrinsic tracking is to first learn the highly compact and specific object representation that should be accurate enough to preserve object details and should have a density model for the probabilistic tracking. We will also show that for an actual deformable object, this representation is global and usually has some semantic interpretations. Then based on the object density model, the dynamical model is derived and factorized in the form of mixture of Gaussian diffusions. Finally, the learned intrinsic object structure is integrated into a particle filter framework for visual tracking.

This paper could be viewed as an attempt to apply the state-of-art non-linear dimensionality reduction method [10] [13] in the field of non-rigid object tracking.

Significant improvements are made to fulfill the tracking task since it is not straightforward to apply the dimensionality reduction method in visual tracking. First, in the dimensionality reduction algorithm that converts the high- dimensional parametric object state to a low-dimensional intrinsic object state, the dimension of intrinsic state could be much lower than that of the original parametric state and it is difficult to preserve the variance in the low dimensional space, so the mapping between the high-dimensional and low-dimensional data is not accurate. Second, the dynamical model on the low-dimensional intrinsic state may still be 1063-6919/03 $17.00 © 2003 IEEE

(2)

complex, and the probabilistic noise in the dynamical model can’t be simply assumed as a Gaussian, therefore a general multi-modal dynamical model is needed. The contributions of this paper include: 1) An improved non-parametric dimensionality reduction generalization algorithm to establish an accurate mapping between the parametric object state and the intrinsic object state. 2) A maximum likelihood (ML) estimation of the object density parameters that is fast and efficient. 3) A factorized dynamical model in the form of mixture of Gaussian diffusions.

The paper is organized as follows: In section 2 the algorithm of learning a low dimensional object intrinsic representation is presented. In section 3 the mapping between the high-dimensional parametric state and the low-dimensional intrinsic states is introduced. In section 4, the modeling and learning of the dynamical model is described. The probabilistic tracking algorithm is implemented in section 5. In section 6, the experimental results on fish and lip tracking are given and compared to the existing tracker.

2. Learning object intrinsic representation

The object intrinsic representation is a low dimensional representation with a density model that captures the global structure of a curved manifold. The manifold is the space of all the high-dimensional parametric object state. The object density is modeled by mixture of factor analyzers (MFA) [5]. In order to estimate the parameters of MFA model with low dimensional factors, the existing dimensionality reduction algorithm is used to determine the factors in MFA, and the k-means algorithm is used to determine the label of mixture component. By converting the above latent variable into an observable variable, we derived a ML algorithm to estimate the density parameters efficiently.

2.1. Object density modeling using MFA model In factor analysis (FA) [5], the generative model is given by:

x= Λy + u (1)

Where x is a D-dimensional real-valued data vector, y is a d-dimensional vector of the real-valued factor and d is gen- erally much smaller than D.Λ is known as the factor loading matrix. The factor y is assumed to be N(v, Σ) distributed (a Gaussian distribution with mean v and covarianceΣ). The D-dimensional random variable u is N(µ, Ψ) distributed, where Ψ is a diagonal matrix, y and u are assumed to be independent.

Now for a mixture of M factor analyzers indexed by w, w=1,.., M, the generative model is:

x|w = Λwy|w + uw (2)

From (2), the overall distribution of the MFA model is given by:

P(x, y, w) = P (x|y, w)P (y|w)P (w) (3) P(x|y, w) = N (Λwy+ uw,Ψw) (4) P(y|w) = N (vw,Σw) (5) 2.2. Dimensionality reduction

In the framework of unsupervised nonlinear dimensionality reduction method [10] [13], we consider a high dimensional data set X = {x1, x2, .., xn} with xi ∈ R^D that lies on a d < D dimensional, possible nonlinear manifold (plus some noise outside the manifold), the goal is to express the data in the intrinsic coordinates of the manifold: Y= {y1, y2, .., yn} with yi∈ R^d. Thus the method aims at obtaining a compact object representation from ex- amples, that is, to discover a few degrees of freedom that underlie the observed modes of continuous variability without use of a priori knowledge.

In the experiment, we use the ISOMAP algorithm [13]

for dimensionality reduction. The aim of ISOMAP is to preserve the topological structure of the data. As an example, the algorithm is applied on the shape space of a swimming fish with its results illustrated in Figure 1. In this particular example, the ISOMAP algorithm discovers the 2-dimensional semantic meaningful coordinates that approximately corresponding to the two axes in the polar coordinates system and the resulted intrinsic space captures more shape details than PCA.

Figure 1. Left: dimensionality reduction us- ing PCA. The first two variation modes are selected and the resulted intrinsic coordi- nates approximately lie on a circle and many fish deformation modes such as twisting are lost. Right: dimensionality reduction using ISOMAP. The red data points are visualized.

The green line represents the major mode of variability that accounts for the fish rotation and the yellow lines represent the mode of variability that accounts for the local twisting.

(3)

2.3. Density parameters estimation

In the MFA model specified by equation (2), y, w are the latent variables and the parameters of the MFA model are {Λw, µ_w, P(w), Ψw}. We developed a ML based estimation algorithm that can efficiently estimate the model parameters for the subsequent tracking task. The idea is to make the latent variable ”observable” under certain assumptions. By assuming that the original training data lies in a low dimensional manifold, we use the intrinsic coordinate obtained in the dimensionality reduction algorithm to replace the latent variable y. Similarly suppose those closely gathered training data points are generated by the same probabilistic mixture component, we use the k-means algorithm to label the training data and replace the latent variable w with the label. Since the dimensionality reduction algorithm and the k-means algorithm are both unsupervised, the whole learning algorithm is also unsupervised.

Given the training data set {xi, yi, wi}, i = 1, .., N , where wi is the label of {xi, yi}, wi ∈ [1, M ], M is the number of mixture components in the MFA model.

Consider equation (4)-(5), the complete data log-likelihood function is defined as:

Φ(Θ) = Xn i=1

[−1

2log(|Ψwi|) −1

2x¯^T_iΨ⁻¹wix¯_i− 1

2log(|Σwi|) −1

2y¯^T_iΣ⁻¹_w_iy¯i+ log(αwi)] (6) Wherex¯i = xi− Λwiyi − µwi,y¯i = yi− vwi, αwi = p(wi), PM

wi=1αwi = 1. Now the model parameters are Θ = {αl, vl,Σl,Λl, µ_l,Ψl}, l = 1, .., M . In our experiment, we setΣlto be a full covariance matrix andΨlto be a constrained covariance matrix proportional to the identity matrix. The parameters are estimated by maximizing the objective function in equation (6). Detailed formulation of each parameter can be found in Appendix A. The result of the k-means centers and the density model of the fish shape are demonstrated in Figure.2.

3. Mapping between high and low dimensional states

Given the low dimensional representation with the density models, we now need to establish the mapping between high and low dimensional states. Denote the mapping as:

ξ : R^D → R^d, y = ξ(x) and η : R^d → R^D, x = η(y), where x, y are the high and low dimensional vectors respectively. Since the density model is known, one straightforward method is to take the conditional expectation of x and y as the mapping function. It is formulated as follows:

η(y) = E(x|y) =X

w

(Λwy+ µ_w)P (w|y) (7)

Figure 2. Left: cluster centers of the fish shape by the k-means algorithm. Right: re- sulted MFA model. There are 17 mixture com- ponents, where ’+’ represents mixture center and the ellipse represents the equal probabil- ity curve of 2 s.t.d.

ξ(x) = E(y|x) =X

w

((Λ^T_wΛw)⁻¹Λ^T_w(x − µ_w))P (w|x) (8) One problem of the above parametric mapping is that η◦ ξ and ξ ◦ η are not identity mappings even on the training set. The inaccuracy is due to the simplified density model.

Actually, the mapping from y to x is more difficult since there are limited training data and we just assumeΨlto be a constrained matrix proportional to the identity matrix, as described in section 2.3.

An accurate non-parametric form of mapping is thus derived. Inspired from the LLE algorithm [10], we use the similar idea of local neighborhood reconstruction (in the training set) to construct the accurate mappings between high and low dimensional data. The idea basically has two assumptions: 1. Each data point and its neighbors are ly- ing on a locally linear patch of the manifold and the local geometry of these patches is characterized by some linear coefficients that reconstruct each data point from its neighbors. 2. The same weights that reconstruct the ith data point in D-dimensions should also reconstruct its intrinsic manifold coordinates in d-dimensions and vice visa. Based on these assumptions, the process of computing the output x for input y has the following steps: 1) Identify the neighbors of y among the training set that can reconstruct it. 2) Compute the linear weights that best reconstruct y from the selected neighbors. This can be implemented by solving a linear equation problem. 3) Since the neighbors of y have known corresponding high dimensional coordinates, x is then obtained by linearly combining these coordinates with the calculated weights.

One important problem that needs to be considered is how to select the neighbors of y for reconstruction in step 1. Usually K nearest neighbors of y are selected, but the local covariance matrix may suffer from degeneracy when the

(4)

local difference vectors are linearly dependent. This can result in multiple solutions of the linear weights. Although we can add regularization term to avoid the degeneracy, the reconstruction weights may still be highly deviated when the local covariance matrix is far from full rank. This will make the reconstructed x far away from the position according to the training data. So we propose an efficient algorithm to select the neighbors that can make local covariance matrix be full rank. The algorithm is a variant of the Gram-Schmidt process. The algorithm guarantees that we can get an accurate mapping between the high and low dimensional states at least on the training set. Detailed algorithm can be found in Appendix B.

4. Modeling and learning of the dynamical model

The dynamical model is constructed based on the object’s intrinsic representation. Assume the Markov property of the dynamical process, the dynamical model can be represented by P(yt|y_t−1). Given the object density model and assume statistical independence between wt|w_t−1and y_t−1, the dynamical model can be factorized as follows:

P(yt|y_t−1) =X

wt

P(yt|y_t−1, wt)P (wt|y_t−1) = P

wt,wt−1P(yt|y_t−1, wt)P (wt|w_t−1)P (w_t−1|y_t−1)(9) Where P(w_t−1|y_t−1) can be calculated directly;

P(wt|w_t−1) is the item of the Markov matrix representing the transition probabilities between different mixtures, de- noted as J = [P (wt|w_t−1)]_{M ∗M}, it can be learned from the transition histogram of training data; P(yt|y_t−1, wt) is the dynamical model under fixed mixture label, it can be modeled as a first-order Gaussian Auto-Regressive Process (ARP) defined by:

P(yt|y_t−1, wt) = N (Awty_t−1+ bwt, Cyt

−1,wt) (10) Equation (10) has two parts: 1) The deterministic part with parameters Awtand bwtand 2) the stochastic part of Gaus- sian noise with zero mean and covariance matrix Cyt−1,wt

4.1. Learning parameters of the dynamical model The parameters in the dynamical model are Ξ = {Al, bl, Cl(y_t−1), J}, l = 1, .., M . The learning algorithm is briefly described as follows:

Learning deterministic parameters:Since the dynami- cal system is directly observable, the parameters Al, blof the first order ARP can be learned from training data using the MLE algorithm as described in [9].

Learning stochastic parameters: We assume that Cl(y_t−1) is proportional to the covariance matrix of the

corresponding mixtureΣl and the factor is determined by y_t−1 : Cl(y_t−1) = a(y_t−1) ∗ Σl. In our experiment, a(y_t−1) is a piecewise constant value depends on the distance that y_t−1apart from the corresponding mixture centers.

Learning Markov matrix: The Markov matrix J can be learned from the transition histogram of temporally continuous training data. We first pick up key-frame object and calculate its intrinsic coordinate. After applying the expo- nential filter to smooth the transition probability, the transition histogram is constructed. Then the histogram is nor- malized to be the Markov matrix.

4.2. Mixture of Gaussian diffusion noise model From equation (9) and (10), we get the dynamical model in the form of mixture of Gaussian diffusion (MGD) noises:

P(yt|y_t−1) =X

wt

N(Awty_t−1+bwt, Cyt−1,wt)β(wt; y_t−1) (11) Where: β(wt; y_t−1) =P

wt−1P(wt|w_t−1)P (w_t−1|y_t−1), Each Gaussian noise is located at the position Awty_t−1+ bwt, with the covariance matrix C_yt−1,wtand mixture coef- ficient β(wt; y_t−1). The noise parameters change dynam- ically according to the current state y_t−1. An illustration of the mixture of Gaussian diffusion noise model in one dimension is shown in Figure 3. Such noise model can be eas- ily implemented in a particle-filter based tracker described in section 5. It worth pointing out that the multi-peak nar- rowed noise cost lesser particles to fill the high probability region.

Figure 3. Mixture of Gaussian diffusion noise model

5. Probabilistic tracking

The probabilistic tracking algorithm is briefly presented in this section. The tracker adopts the framework of ICON- DENSATION algorithm [8]. There are conceptually 3 components in the tracker: the object representation, the dynamical model and the measurement model. For the detailed measurement model, please refer to [16].

(5)

One step of the tracking algorithm is depicted in Fig- ure 4. The conditional object state density at time t is represented by the weighted sample set{(s⁽ⁿ⁾_t , π_t⁽ⁿ⁾), n = 1, .., N } where s⁽ⁿ⁾_t is a discrete random sample and has a probability proportional to its weight π_t⁽ⁿ⁾.

Generate sample set Stat time t from sample set S_t−1 at time t− 1. St = {(s⁽ⁿ⁾_t , π_t⁽ⁿ⁾), n = 1, ..N }, S_t−1= {(s⁽ⁿ⁾_t−1, π_t−1⁽ⁿ⁾), n = 1, .., N }. For i =1 to N :

1) s⁽ⁱ⁾_t,0=Importance Resampling(S_t−1). Hybrid method is used: the new samples can be generated from the importance prior, the importance function or the posterior density function at time t− 1. Each option has a fixed probability to be selected.

2) s⁽ⁱ⁾_t =Prediction(s⁽ⁱ⁾_t,0).

2.1) If s⁽ⁱ⁾_t,0is drawn from the importance prior or importance function, then a small Gaussian noise is added to s⁽ⁱ⁾_t,0to get s⁽ⁱ⁾_t .

2.2) Else split the samples s⁽ⁱ⁾_t,0 into 2 parts: s⁽ⁱ⁾_t,0 = s⁽ⁱ⁾_t,0,a+s⁽ⁱ⁾_t,0,b, where s⁽ⁱ⁾_t,0,ais rigid state part of translation and scale, s⁽ⁱ⁾_t,0,bis the part of non-rigid or intrinsic state.

The rigid state is predicted using a second order linear predictor. And the intrinsic state is predicted by adding the MGD noise as follows: first the mixture label wtis selected with a probability proportional to β(wt; s⁽ⁱ⁾_t,0,b), then the non-rigid state is predicted using the dynamical model of mixture wt.

3) π_t⁽ⁱ⁾= λ⁽ⁱ⁾_t ∗Measurement(η(s⁽ⁱ⁾_t )). Where λ⁽ⁱ⁾_t is the importance sampling correction term of samples s⁽ⁱ⁾_t,0, η is the mapping function from low to high dimensional data.

Figure 4. Probabilistic tracking algorithm.

6. Experiments

We performed two experiments to evaluate the performance of the intrinsic tracker. The first experiment is to track a swimming fish in a cluttered background. The fish swims, rotates and twists. It is challenging to robustly track the non-rigid motion of such a fish. In the second experiment, we track lip with large inter-frame motion of different people under different poses. The dense in-mouth clutter will probably distract the tracker without a multi-model intrinsic dynamical model to achieve correct prediction. The intrinsic tracker is compared to the ICondensation tracker.

6.1. Fish tracking

In the fish tracking experiment, the fish shape is first modeled by a cubic B-spline curve with 18 control points.

Then the PCA model of the fish shape is trained from about 120 selected frames and the fish shape state reduces to 22 dimensions, accounting for 99.9% of the total variance. Then the intrinsic fish representation and its intrinsic dynamics are learned. There are 17 mixtures in the MFA model and the resulted intrinsic state of fish is 2 dimensional. The parameters of dynamical model are learned according to section 4.1. The first 200 frames of fish swimming are taken as the training sequence in which the fish swims forward, backward and twist.

Since the ICondensation algorithm without intrinsic fish representation is unstable, so we integrate the low- dimensional intrinsic representation into the ICondensation algorithm. The image measurement is also made the same for both trackers for comparison. The image measurement model is similar as that used in [16] except that we use gray steerable edge instead of color edge and only the fish and non-fish color histograms are constructed. The intrinsic tracker needs 2000 samples to achieve stable result while the ICondensation tracker needs 1000 samples (but it lost track when the fish twist). The intrinsic tracker runs at about 1HZ and the ICondensation tracker runs at about 2HZ on a Pentium IV 1.4G computer.

The tracking result of the fish swimming is shown in Fig- ure 5. The original image size is 384*288 and the shown image is cropped into much smaller regions for clarity. Both trackers correctly track the position of the fish. The intrinsic tracker correctly tracks the twist and rotation motion of fish while the ICondesation tracker lost track when the fish turn its direction (as in frame 81, 181, etc).

In order to get a quantified result, we manually label some fish shape as the ground truth data and compare it to the tracking result. The chamfer distance [4] between corresponding two shapes is calculated as the error residue.

Figure 6. shows the graph of error residue for both the intrinsic tracker and the ICondensation tracker.

6.2. Lip tracking

In the lip tracking experiment, the lip shape is first modeled by two cubic B-spline curves with 13 and 5 control points respectively, representing the upper and lower outer lip contour. Then the PCA model of the lip shape is trained from about 100 selected frames and the lip shape state reduces to 14 dimensions, accounting for 99% of the total variance. Then the intrinsic lip representation and its intrinsic dynamics are learned. There are 11 mixtures in the MFA model and the resulted intrinsic state of lip is 2 dimensional.

The parameters of dynamical model are learned in the same way as in the fish tracking experiment and the large inter- frame motion training sequence is used. Both the ICon- densation tracker and the intrinsic tracker use the global lip corner prediction algorithm [16] to predict lip corners. 500

(6)

Figure 5. Fish tracking. Top Block (3 rows):

ICondensation tracker result. Bottom Block (3 rows): intrinsic tracker result. For each block, the first row is a swimming fish under different pose corresponding to frame 54, 81, 181, 234. The second row is a sequence of fish twist from left to right corresponding to frame 260, 263, 266, 269, 272. The third row is a sequence that fish tilt down and twist corresponding to frame 531, 535, 539, 543, 547.

samples are used for both trackers to achieve stable result.

The tracking result of large inter-frame motion with dense clutter is shown in Figure 7. The intrinsic tracker can track the rapid open mouth motion while the IConden- sation tracker performs poorly in this case. This shows the necessity of applying the multi-modal dynamical model.

The lip tracking result of different persons at different poses (head rotation about 20 degrees) and the illustration of intrinsic lip representation are shown in Figure 8.

7. Conclusions

A method of learning object intrinsic structure for visual tracking is presented. The object intrinsic structure includes an intrinsic representation and an intrinsic dynamical model. The obtained object intrinsic state is actually a glob-

Figure 6. Quantified results of the IConden- sation tracker and the intrinsic tracker on the fish swimming sequence. The chamfer dis- tance is in pixel.

Figure 7. Tracking large inter-frame motion with dense clutter. The first row is the in- trinsic tracker’s result. From left to right is:

Posterior estimation of frame 7, 8, 9, 10, 11, the multiple samples in frame 9, 10. The sec- ond row is the corresponding ICondensation tracker’s result. The third row is the lip region probability map and the fourth row is the color edge in lip region.

ally coordinated low dimensional vector in a continuous metric space. The learning algorithm is general and experiments on fish and lip tracking show that the learned probabilistic tracker can significantly improve its performance when tracking object with variable shape and complex dynamics. Further more, interesting shape details of the object can be obtained by applying the learning algorithm on a larger training set and possibly with enlarged intrinsic dimensionality.

We intend to explore several avenues in future work:

a). The current mapping between high and low dimensional states is non-parametric and its computational com- plexity depends on the size of the training set. An alterna- tive is to establish the mapping using compactly supported Radial Basis Function (RBF) that is computationally more efficient but with the similar accuracy.

b). How general is the assumption of a low dimensional

(7)

Figure 8. Left: lip tracking result of different people under different poses. Right: the in- trinsic representation of lip for multiple peo- ple and different pose. The red data points on the vertical and horizontal lines are visualized at the left and bottom side. The vertical line can be regarded as a close/open mouth pro- cess and the horizontal line can be regarded as lip under different poses.

intrinsically flat manifold? And what if the object state does not lie on such a manifold?

Acknowledgement

We would like to thank Tarence Tay for providing the fish swimming video data. We are grateful for the CVPR reviewers for their constructive comments.

References

[1] A. Blake and M. Isard. Active Contours. Springer-Verlag, 1998.

[2] T. F. Cootes, D. Cooper, C. Taylor and J. Graham. Active shape models - their training and application. Computer Vi- sion and Image Understanding, 61(1):38–59, 1995.

[3] D. A. Forsyth and J. Ponce. Computer Vision: A Modern Approach. Prentice Hall, 2002.

[4] D. Gavrila and V. Philomin. Real-time object detection for smart vehicles. Proc. IEEE ICCV, pages 87–93, 1999.

[5] Z. Ghahramani, G. Hinton. The EM algorithm for mixtures of factor analyzers. University of Toronto Technical Report, CRG-TR-96-1,1996.

[6] T. Heap and D. Hogg. Wormholes in Shape Space:Tracking through Discontinuous Changes in Shape. Proc. IEEE ICCV, pages 344 – 349, 1998.

[7] M. Isard and A. Blake. Contour tracking by stochastic prop- agation of conditional density. Proc. ECCV, pages 343–356, 1996.

[8] M. Isard and A. Blake. ICONDENSATION: Unifying low- level and high-level tracking in a stochastic framework.

Proc. ECCV, pages 893–908, 1998.

[9] B. North, A. Blake, M. Isard, and J. Rittscher. Learning and Classification of Complex Dynamics. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(9):1016–

1034, Sept. 2000.

[10] S. Roweis and L.K. Saul. Nonlinear dimensionality re- duction by locally linear embedding. Science, 290(5500):

2323–2326, Dec. 2000.

[11] J. Sullivan, A. Blake, M. Isard and J. MacCormick. Object Localization by Bayesian Correlation. Proc. IEEE ICCV, pages 1068–1075, 1999.

[12] T. Tay and K.K. Sung. Probabilistic Learning and Modeling of Object Dynamics for Tracking. Proc. IEEE ICCV, vol. II, pages 648–653, 2001.

[13] J.B. Tenenbaum, V. deSilva and J.C. Langford. A Global Geometric Framework for Nonlinear Dimensionality Re- duction. Science, 290(5500): 2319-2323, Dec. 2000.

[14] M. E. Tipping and C. M. Bishop. Mixtures of probabilistic principal component analysers. Neural Computation, 11(2):443-482, 1999.

[15] K. Toyama, A. Blake. Probabilistic Tracking in a Metric Space. Proc. IEEE ICCV, vol. II, pages 50–57, 2001.

[16] Qiang Wang, Haizhou Ai, Guangyou Xu. A Probabilis- tic Dynamic Contour Model for Accurate and Robust Lip Tracking. Proc. IEEE Fourth International Conference on Multimodal Interfaces, 2002.

[17] Y. Wu, T. S. Huang. A Co-inference Approach to Robust Visual Tracking. Proc. IEEE ICCV, pages 26–33, 2001.

Appendix A

The complete data log-likelihood function is specified by equation (6). The ML estimation of model parameters are listed as follows:

αl=^C_N^l, Clis the number of samples that wi= l vl= _C¹_lP

i,wi=lyi,Σl= _C¹_lP

i,wi=l(yi− vl)(yi− vl)^T Λ˜l=P

i,wi=l(xiy˜^T_i) · (P

i,wi=l(˜yiy˜_i^T))⁻¹,where:

Λ˜l= [Λl, µl], ˜y_i= [yi1]^T. AndΨl= σlI with:

σl=_C¹_l_DP

i,wi=l(xi− Λly_i− µ_l)^T(xi− Λly_i− µ_l)

Appendix B

Denote the neighborhood set of y as P = {p1, .., pn} sorted by the distance from y(y 6= pi), the following algorithm will pick up a subset Q = {q1, q₂, .., qm} from P such that{qi− y, i = 1, .., m} are linearly independent.

The algorithm is as follows:

1. Set q1= p1, r1= p1− y and k = 1 2. For i= 2, .., n Let:

t= (pi− y) − (^(pⁱ_r^−y)·r¹

1·r1 )r1− ... − (^(pⁱ_r^−y)·r_k ^k

·r^k )rk

If t6= 0, then k = k + 1; qk = pi; rk = t

3. Suppose the final k adds up to m, then Q = {q1, q2, .., qm} is the selected neighborhood set.

The above algorithm assumes that y 6= pi, i= 1, .., n.

If the neighborhood set has a vector equal to y, we can just add it to the final set Q.