• 沒有找到結果。

Adaptive Discriminative Generative Model and Its Applications

N/A
N/A
Protected

Academic year: 2022

Share "Adaptive Discriminative Generative Model and Its Applications"

Copied!
8
0
0

加載中.... (立即查看全文)

全文

(1)

Adaptive Discriminative Generative Model and Its Applications

Ruei-Sung Lin David Ross Jongwoo Lim Ming-Hsuan Yang

University of Illinois University of Toronto Honda Research Institute

rlin1@uiuc.edu dross@cs.toronto.edu jlim1@uiuc.edu myang@honda-ri.com

Abstract

This paper presents an adaptive discriminative generative model that gen- eralizes the conventional Fisher Linear Discriminant algorithm and ren- ders a proper probabilistic interpretation. Within the context of object tracking, we aim to find a discriminative generative model that best sep- arates the target from the background. We present a computationally efficient algorithm to constantly update this discriminative model as time progresses. While most tracking algorithms operate on the premise that the object appearance or ambient lighting condition does not significantly change as time progresses, our method adapts a discriminative genera- tive model to reflect appearance variation of the target and background, thereby facilitating the tracking task in ever-changing environments. Nu- merous experiments show that our method is able to learn a discrimina- tive generative model for tracking target objects undergoing large pose and lighting changes.

1 Introduction

Tracking moving objects is an important and essential component of visual perception, and has been an active research topic in computer vision community for decades. Object tracking can be formulated as a continuous state estimation problem where the unobserv- able states encode the locations or motion parameters of the target objects, and the task is to infer the unobservable states from the observed images over time. At each time step, a tracker first predicts a few possible locations (i.e., hypotheses) of the target in the next frame based on its prior and current knowledge. The prior knowledge includes its previous observations and estimated state transitions. Among these possible locations, the tracker then determines the most likely location of the target object based on the new observa- tion. An attractive and effective prediction mechanism is based on Monte Carlo sampling in which the state dynamics (i.e., transition) can be learned with a Kalman filter or simply modeled as a Gaussian distribution. Such a formulation indicates that the performance of a tracker is largely based on a good observation model for validating all hypotheses. In- deed, learning a robust observation model has been the focus of most recent object tracking research within this framework, and is also the focus of this paper.

Most of the existing approaches utilize static observation models and construct them before a tracking task starts. To account for all possible variation in a static observation model, it is imperative to collect a large set of training examples with the hope that it covers all possible variations of the object’s appearance. However, it is well known that the appear- ance of an object varies significantly under different illumination, viewing angle, and shape deformation. It is a daunting, if not impossible, task to collect a training set that enumerates all possible cases. An alternative approach is to develop an adaptive method that contains a number of trackers that track different features or parts of a target object [3]. Therefore,

(2)

even though each tracker may fail under certain circumstances, it is unlikely all of them fail at the same time. The tracking method then adaptively selects the trackers that are robust at current situation to predict object locations. Although this approach improves the flexibility and robustness of a tracking method, each tracker has a static observation model which has to be trained beforehand and consequently restricts its application domains severely. There are numerous cases, e.g., robotics applications, where the tracker is expected to track a pre- viously unseen target once it is detected. To the best of our knowledge, considerably less attention is paid to developing adaptive observation models to account for appearance vari- ation of a target object (e.g., pose, deformation) or environmental changes (e.g., lighting conditions and viewing angles) as tracking task progresses.

Our approach is to learn a model for determining the probability of a predicted image loca- tion being generated from the class of the target or the background. That is, we formulate a binary classification problem and develop a discriminative model to distinguish obser- vations from the target class and the background class. While conventional discriminative classifiers simply predict the class of each test sample, a good model within the above- mentioned tracking framework needs to select the most likely sample that belongs to target object class from a set of samples (or hypotheses). In other words, an observation model needs a classifier with proper probabilistic interpretation.

In this paper, we present an adaptive discriminative generative model and apply it to object tracking. The proposed model aims to best separate the target and the background in the ever-changing environment. The problem is formulated as a density estimation problem, where the goal is, given a set of positive (i.e., belonging to the target object class) and neg- ative examples (i.e., belonging to the background class), to learn a distribution that assigns high probability to the positive examples and low probability to the negative examples. This is done in a two-stage process. First, in the generative stage, we use a probabilistic principal component analysis to model the density of the positive examples. The result of this state is a Gaussian, which assigns high probability to examples lying in the linear subspace which captures the most variance of the positive examples. Second, in the discriminative stage, we use negative examples (specifically, negative examples that are assigned high probabil- ity by our generative model) to produce a new distribution which reduces the probability of the negative examples. This is done by learning a linear projection that, when applied to the data and the generative model, increases the distance between the negative examples and the mean. Toward that end, it is formulated as an optimization problem and we show that this is a direct generalization of the conventional Fisher Linear Discriminant algorithm with proper probabilistic interpretation. Our experimental results show that our algorithm can reliably track moving objects whose appearance changes under different poses, illumi- nation, and self deformation.

2 Probabilistic Tracking Algorithm

We formulate the object tracking problem as a state estimation problem in a way similar to [5] [9]. Denoteotas an image region observed at timet and Ot= {o1, . . . , ot} is a set of image regions observed from the beginning to timet. An object tracking problem is a process to infer statestfrom observationOt, where statestcontains a set of parameters referring to the tracked object’s 2-D position, orientation, and scale in imageot. Assuming a Markovian state transition, this inference problem is formulated with a recursive equation:

p(st|Ot) = kp(ot|st)



p(st|st−1)p(st−1|Ot−1)dst−1 (1) wherek is a constant, and p(ot|st) and p(st|st−1) correspond to the observation model and dynamic model, respectively.

In (1),p(st−1|Ot−1) is the state estimation given all the prior observations up to time t − 1, andp(ot|st) is the likelihood that observing image otat statest. Put together, the posterior estimationp(st|Ot) can be computed efficiently. For object tracking, an ideal distribution

(3)

ofp(st|Ot) should peak at ot, i.e.,stmatching the observed object’s locationot. While the integral in (1) predicts the regions where object is likely to appear given all the prior observations, the observation modelp(ot|st) determines the most likely state that matches the observation at timet.

In our formulation,p(ot|st) measures the probability of observing ot as a sample being generated by the target object class. Note thatOtis an image sequence and if the images are acquired at high frame rate, it is expected that the difference betweenot andot−1

is small though object’s appearance might vary according to different of viewing angles, illuminations, and possible self-deformation. Instead of adopting a complex static model to learnp(ot|st) for all possible ot, a simpler model suffices by adapting this model to account for the appearance changes. In addition, sinceotandot−1are most likely similar and computingp(ot|st) depends on p(ot−1|st−1), the prior information p(ot−1|st−1) can be used to enhance the distinctiveness between the object and its background inp(ot|st).

The idea of using an adaptive observation model for object tracking and then applying discriminative analysis to better predict object location is the focus of the rest the paper. The observation model we use is based on probabilistic principle component analysis (PPCA) [10]. Object Tracking using PCA models have been well exploited in the computer vision community [2]. Nevertheless, most existing tracking methods do not update the observation models as time progresses. In this paper, we follow the work by Tipping and Bishop [10]

and propose an adaptive observation model based on PCA within a formal probabilistic framework. Our result is a generalization of the conventional Fisher Linear Discriminant with proper probabilistic interpretation.

3 A Discriminative Generative Observation Model

In this work, we track a target object based on its observations in the videos, i.e.,ot. Since the size of image regionotmight change according to differentst, we first convertotto a standard size and use it for tracking. In the following, we denoteytas the standardized appearance vector ofot.

The dimensionality of the appearance vectorytis usually high. In our experiments, the standard image size is a 19× 19 patch and thus ytis a 361-dimensional vector. We thus model the appearance vector with a graphical model of low-dimensional latent variables.

3.1 A Generative Model with Latent Variables

A latent model relates an-dimensional appearance vector y to a m-dimensional vector of latent variablesx:

y = W x + µ +  (2)

whereW is a n × m projection matrix associating y and x, µ is the mean of y, and  is additive noise. As commonly assumed in factor analysis [1] and other graphical models [6], the latent variablesx are independent with unit variance, x ∼ N (0, Im), where Imis the m-dimensional identity matrix, and  is zero mean Gaussian noise,  ∼ N (0, σ2In). Since x and  are both Gaussians, it follows that y is also a Gaussian distribution, y ∼ N (µ, C), whereC = W WT + σ2I and In is ann-dimensional identity matrix. Together with (2), we have a generative observation model:

p(ot|st) = p(yt|W, µ, ) ∼ N (yt|µ, W WT + σ2In) (3) This latent variable model follows the form of probabilistic principle component analysis, and its parameters can be estimated from a set of examples [10]. Given a set of appearance samples Y = {y1, . . . , yN}, the covariance matrix of Y is denoted as S = N1 

(y − µ)(y − µ)T. Leti|i = 1, . . . , N} be the eigenvalues of S arranged in descending order, i.e.,λi ≥ λj ifi < j. Also, define the diagonal matrix Σm = diag(λ1, . . . , λm), and let Um be the eigenvectors that corresponds to the eigenvalues in Σm. Tipping and Bishop

(4)

[10] show that the the maximum likelihood estimate ofµ, W and  can be obtained by µ = 1

N

N i=1

yi, W = Umm− σ2Im)1/2R, σ2= 1 n − m

n i=m+1

λi (4) whereR is an arbitrary m × m orthogonal rotation matrix.

To model all possible appearance variations of a target object (due to pose, illumination and view angle change), one could resort to a mixture of PPCA models. However, it not only requires significant computation for estimating the model parameters but also leads to other serious questions such as the number of components as well as under-fitting or over-fitting. On the other hand, at any given time a linear PPCA model suffices to model gradual appearance variation if the model is constantly updated. In this paper, we use a single PPCA, and dynamically adapt the model parametersW , µ, and σ2 to account for appearance change.

3.1.1 Probability computation with Probabilistic PCA

Once the model parameters are known, we can compute the probability that a vectory is a sample of this generative appearance model. From (4), the log-probability is computed by

L(W, µ, σ2) = −1 2

N log 2π + log |C| + yTC−1y

(5) wherey = y − µ. Neglecting the constant terms, the log-probability is determined by yTC−1y. Together with C = W WT + σ2Inand (4), it follows that

yTC−1y = yTUmΣ−1mUmTy + 1

σ2yT(In− UmUmT)y (6) HereyTUmΣ−1mUmTy is the Mahalanobis distance of y in the subspace spanned by Um, and yT(In− UmUmT)y is the shortest distance from y to this subspace spanned by Um. Usually σ is set to a small value, and consequently the probability will be determined solely by the distance to the subspace. However, the choice ofσ is not trivial. From (6), if the σ is set to a value much smaller than the actual one, the distance to the subspace will be favored and ignore the contribution of Mahalanobis distance, thereby rendering an inaccurate estimate.

The choice ofσ is even more critical in situations where the appearance changes dynami- cally and requiresσ to be adjusted accordingly. This topic will be further examined in the following section.

3.1.2 Online Learning of Probabilistic PCA

Unlike the analysis in the previous section where model parameters are estimated based on a fixed set of training examples, our generative model has to learn and update its parameters on line. Starting with a single example (the appearance of the tracked object in the first video frame), our generative model constantly updates its parameters as new observations arrive.

The equations for updating parameters are derived from (4). The update procedure ofUm

and Σmis complicated since it involves the computations of eigenvalues and eigenvectors.

Here we use a forgetting factorγ to put more weights on the most recent data. Denote the newly arrived samples at timet as Y = {y1, . . . , yM}, and assume the mean µ is fixed, Umt and Σtmcan be obtained by performing singular value decomposition (SVD) on

[√γUm,t−1m,t−1)1/2|

(1 − γ)

M Y ] (7)

whereY = [y1−µ, . . . , yM−µ]. Σ1/2m,tandUm,twill contain them-largest singular values and the corresponding singular vectors respectively at timet. This update procedure can be efficiently implemented using the R-SVD algorithm, e.g., [4] [7].

If the mean µ constantly changes, the above update procedure can not be applied. We recently proposed a method [8] to compute SVD with correct updated mean in which Σ1/2m,t

(5)

andUm,tcan be obtained by computing SVD on

√γUm,t−1m,t−1)1/2



(1 − γ)

M Y





(1 − γ)γ(µt−1− µY)

(8) whereY = [y1− µY, . . . , yM− µY] and µY = M1 M

i=1yi. This formulation is similar to the SVD computation with the fixed mean case, and the same incremental SVD algorithm can be used to compute Σ1/2m,tandUm,twith an extra term shown in (8).

Computing and updatingσ is more difficult than the form in (8). In the previous section, we show that an inaccurate value ofσ will severely affect probability estimates. In order to have an accurate estimate ofσ using (4), a large set of training examples is usually required. Our generative model starts with a single example and gradually adapts the model parameters. If we updateσ based on (4), we will start with a very small value of σ since there are only a few samples at our disposal at the outset, and the algorithm could quickly lose track of the target because of an inaccurate probability estimate. Since the training examples are not permanently stored in memory,λiin (4) and consequentlyσ may not be accurately updated if the number of drawn samples is insufficient. These constraints lead us to develop a method that adaptively adjustsσ according to the newly arrived samples, which will be explained in the next section.

3.2 Discriminative Generative Model

As is observed in Section 2, the object’s appearance atot−1 andotdo not change much.

Therefore, we can use the observation atot−1to boost the likelihood measurement inot. That is, we draw a set samples (i.e., image patches) parameterized by{sit−1|i = 1, ..., k}

inot−1that have largep(ot−1|sit−1), but the low posterior p(sit−1|Ot−1). These are treated as the negative samples (i.e., samples that are not generated from the class of the target object) that the generative model is likely to confuse atOt.

Given a set samplesY= {y1, . . . , yk} where yiis the appearance vector collected inot−1

based on state parametersit−1, we want to find a linear projectionVthat projectsYonto a subspace such that the likelihood ofYin the subspace is minimized. LetV be a p × n matrix and sincep(y|W, µ, σ) is a Gaussian, p(V y|V, W, µ, σ) ∼ N (V µ, V CVT) is a also Gaussian. The log likelihood is computed by

L(V, W, µ, σ) = −k 2

p log(2π) + log |V CVT| + tr((V CVT)−1V SVT) (9) whereS =k1k

i=1(yi− µ)(yi− µ)T.

To facilitate the following analysis we first assumeV projects Yto a 1-D space, i.e.,p = 1 andV = vT, and thus

L(V, W, µ, σ) = −k 2

log(2π) + log |vTCv| +vTSv vTCv

(10) Note thatvTCv is the variance of the object samples in the projected space, and we need to impose a constraint, e.g.,vtCv = 1, to ensure that the minimum likelihood solution of v does not increase the variance in the projected space. Let vTCv = 1, the optimization problem becomes

v= arg max

{v|vTCv=1}vTSv = arg max

v

vTSv

vTCv (11)

Thus, we obtain an equation exactly like the Fisher discriminant analysis for a binary clas- sification problem. In (11),v is a projection that keeps the object’s samples in the projected space close to theµ (with variance vTCv = 1), while keeping negative samples in Yaway fromµ. The optimal value of v is the generalized eigenvector of SandC that corresponds to largest eigenvalue. In a general case, it follows that

V= arg max

{V |V CVT=I}|V SVT| = arg max

V

|V SVT|

|V CVT| (12)

(6)

whereVcan be obtained by solving a generalized eigenvalue problem ofSandC. By projecting observation samples onto a low dimensional subspace, we enhance the discrim- inative power of the generative model. In the meanwhile, we reduce the time required to compute probabilities, which is also a critical improvement for real time applications like object tracking.

3.2.1 Online Update of Discriminative Analysis

The computation of the projection matrixV depends on matrices C and S. In section 3.1.2, we have shown the procedures to updateC. The same procedures can be used to updateS. LetµY = 1kk

i=1yiandSY =k1k

i=1(yi− µY)(yi− µY)T, S= 1

k

k i=1

(yi− µ)(yi− µ)T = SY+ (µ − µY)(µ − µY)T (13)

GivenS andC, V is computed by solving a generalized eigenvalue problem. If we de- composeS = ATA and C = BTB, then we can find V more efficiently using generalized singular value decomposition. DenoteUY and ΣY as the SVD ofSY, it follows that by lettingA = [UYΣ1/2Y | (µ − µY)]T andB = [UmΣ1/2m 2I]T, we obtainS= ATA and C = BTB.

As is detailed in [4] ,V can be computed by first performing a QR factorization:

A B



= QA

QB



R (14)

and computing the singular value decomposition ofQA

QA= UADAVAT (15)

, we then obtainV = R−1VA. The rank ofA is usually small in vision applications, and V can be computed efficiently, thereby facilitating tracking the process.

4 Proposed Tracking Algorithm

In this section, we summarize the proposed tracking algorithm and demonstrate how the abovementioned learning and inference algorithms are incorporated for object tracking.

Our algorithm localizes the tracked object in each video frame using a rectangular window.

A states is a length-5 vector, s = (x, y, θ, w, h), that parameterizes the windows position (x, y), orientation (θ) and width and height (w, h). The proposed algorithm is based on maximum likelihood estimate (i.e., the most probable location of the object) given all the observations up to that time instance,st = arg maxstp(st|Ot).

We assume that state transition is a Gaussian distribution, i.e.,

p(st|st−1) ∼ N (st−1, Σs) (16) where Σsis a diagonal matrix. According to this distribution, the tracker then drawsN samplesSt= {c1, . . . , cN} which represent the possible locations of the target. Denote yit as the appearance vector ofot, andYt = {y1t, . . . , yNt } as a set of appearance vectors that corresponds to the set of state vectorsSt. The posterior probability that the tracked object is atciin video frameotis then defined as

p(st= ci|Ot) = κp(yit|V, W, µ, σ)p(st= ci|st−1) (17) whereκ is a constant. Therefore, st = arg maxci∈Stp(st= ci|Ot).

Oncest is determined, the corresponding observationyt will be a new example to update W and µ. Appearance vectors ytiwith largep(yti|V, W, µ, σ) but whose corresponding state parametersciare away fromst will be used as new examples to updateV .

Our tracking assumeso1ands1are given (through object detection) and thus obtains the first appearance vectory1which in turn is used an the initial value ofµ, but V and W are

(7)

unknown at the outset. WhenV and W are not available, our tracking algorithm is based on template matching (withµ being the template). The matrix W is computed after a small number of appearance vectors are observed. WhenW is available, we can then start to compute and updateV accordingly.

As mentioned in the Section 3.1.1, it is difficult to obtain an accurate estimate ofσ. In our tracking the system, we adaptively adjustσ according to ΣminW . We set σ be a fixed fraction of the smallest eigenvalues in Σm. This will ensure the distance measurement in (6) will not be biased to favor either the Mahalanobis distance in the subspace or the distance to the subspace.

5 Experimental Results

We tested the proposed algorithm with numerous object tracking experiments. To ex- amine whether our model is able to adapt and track objects in the dynamically chang- ing environment, we recorded videos containing appearance deformation, large illumina- tion change, and large pose variations. All the image sequences consist of 320× 240 pixel grayscale videos, recorded at 30 frames/second and 256 gray-levels per pixel. The forgetting term is empirically selected as 0.85, and the batch size for update is set to 5 as a trade-off of computational efficiency as well as effectiveness of modeling appear- ance change due to fast motion. More experimental results and videos can be found at http://www.ifp.uiuc.edu/˜rlin1/adgm.html.

Figure 1: A target undergoes pose and lighting variation.

Figures 1 and 2 show snapshots of some tracking results enclosed with rectangular win- dows. There are two rows of images below each video frame. The first row shows the sampled images in the current frame that have the largest likelihoods of being the target lo- cations according our discriminative generative model. The second row shows the sample images in the current video frame that are selected online for updating the discriminative generative model.

The results in Figure 1 show the our method is able to track targets undergoing pose and lighting change. Figure 2 shows tracking results where the object appearances change significantly due to variation in pose and lighting as well as cast shadows. These exper- iments demonstrate that our tracking algorithm is able to follow objects even when there is a large appearance change due to pose or lighting variation. We have also tested these two sequences with conventional view-based eigentracker [2] or template-based method.

Empirical results show that such methods do not perform well as they do not update the object representation to account for appearance change.

(8)

Figure 2: A target undergoes large lighting and pose variation with cast shadows.

6 Conclusion

We have presented a discriminative generative framework that generalizes the conventional Fisher Linear Discriminant algorithm with a proper probabilistic interpretation. For object tracking, we aim to find a discriminative generative model that best separates the target class from the background. With a computationally efficient algorithm that constantly up- date this discriminative model as time progresses, our method adapts the discriminative generative model to account for appearance variation of the target and background, thereby facilitating the tracking task in different situations. Our experiments show that the pro- posed model is able to learn a discriminative generative model for tracking target objects undergoing large pose and lighting changes. We also plan to apply the proposed method to other problems that deal with non-stationary data stream in our future work.

References

[1] T. W. Anderson. An Introduction to Multivariate Statistical Analysis. Wiley, New York, 1984.

[2] M. J. Black and A. D. Jepson. Eigentracking: Robust matching and tracking of articulated objects using view-based representation. In B. Buxton and R. Cipolla, editors, Proceedings of the Fourth European Conference on Computer Vision, LNCS 1064, pp. 329–342. Springer Verlag, 1996.

[3] R. T. Collins and Y. Liu. On-line selection of discriminative tracking features. In Proceedings of the Ninth IEEE International Conference on Computer Vision, volume 1, pp. 346–352, 2003.

[4] G. H. Golub and C. F. Van Loan. Matrix Computations. The Johns Hopkins University Press, 1996.

[5] M. Isard and A. Blake. Contour tracking by stochastic propagation of conditional density. In B. Buxton and R. Cipolla, editors, Proceedings of the Fourth European Conference on Computer Vision, LNCS 1064, pp. 343–356. Springer Verlag, 1996.

[6] M. I. Jordan, editor. Learning in Graphical Models. MIT Press, 1999.

[7] A. Levy and M. Lindenbaum. Sequential Karhunen-Loeve basis extraction and its application to images.

IEEE Transactions on Image Processing, 9(8):1371–1374, 2000.

[8] R.-S. Lin, D. Ross, J. Lim, and M.-H. Yang. Incremental subspace update with running mean.

Technical report, Beckman Institute, University of Illinois at Urbana-Champaign, 2004. available at http://www.ifp.uiuc.edu/˜rlin1/isuwrm.pdf.

[9] D. Ross, J. Lim, and M.-H. Yang. Adaptive probabilistic visual tracking with incremental subspace update.

In T. Pajdla and J. Matas, editors, Proceedings of the Eighth European Conference on Computer Vision, LNCS 3022, pp. 470–482. Springer Verlag, 2004.

[10] M. E. Tipping and C. M. Bishop. Probabilistic principal component analysis. Journal of the Royal Statis- tical Society, Series B, 61(3):611–622, 1999.

參考文獻

相關文件

In this paper, a multilayer feedforward neural network model with a back-propagation learning algorithm is used to predict the current input vector for the selection function of our

In this paper, we provide new decidability and undecidability results for classes of linear hybrid systems, and we show that some algorithms for the analysis of timed automata can

To do this, we propose the use of a state-of-the-art frame-semantic parser, and a spectral clustering based slot ranking model that adapts the generic output of the parser to the

include domain knowledge by specific kernel design (e.g. train a generative model for feature extraction, and use the extracted feature in SVM to get discriminative power).

In this section we define a general model that will encompass both register and variable automata and study its query evaluation problem over graphs. The model is essentially a

With the proposed model equations, accurate results can be obtained on a mapped grid using a standard method, such as the high-resolution wave- propagation algorithm for a

Our model system is written in quasi-conservative form with spatially varying fluxes in generalized coordinates Our grid system is a time-varying grid. Extension of the model to

To date we had used PSO and successfully found optimal designs for experiments up to 8 factors for a mixture model, nonlinear models up to 6 parameters and also for more involved