• 沒有找到結果。

Principal component analysis (PCA) is one of the most popular techniques for processing, compressing, and visualizing data, although its effective- ness is limited by its global linearity

N/A
N/A
Protected

Academic year: 2022

Share "Principal component analysis (PCA) is one of the most popular techniques for processing, compressing, and visualizing data, although its effective- ness is limited by its global linearity"

Copied!
40
0
0

加載中.... (立即查看全文)

全文

(1)

Mixtures of Probabilistic Principal Component Analyzers Michael E. Tipping

Christopher M. Bishop

Microsoft Research, St. George House, Cambridge CB2 3NH, U.K.

Principal component analysis (PCA) is one of the most popular techniques for processing, compressing, and visualizing data, although its effective- ness is limited by its global linearity. While nonlinear variants of PCA have been proposed, an alternative paradigm is to capture data complex- ity by a combination of local linear PCA projections. However, conven- tional PCA does not correspond to a probability density, and so there is no unique way to combine PCA models. Therefore, previous attempts to formulate mixture models for PCA have been ad hoc to some extent. In this article, PCA is formulated within a maximum likelihood framework, based on a specific form of gaussian latent variable model. This leads to a well-defined mixture model for probabilistic principal component analyzers, whose parameters can be determined using an expectation- maximization algorithm. We discuss the advantages of this model in the context of clustering, density modeling, and local dimensionality reduc- tion, and we demonstrate its application to image compression and hand- written digit recognition.

1 Introduction

Principal component analysis (PCA) (Jolliffe, 1986) has proved to be an ex- ceedingly popular technique for dimensionality reduction and is discussed at length in most texts on multivariate analysis. Its many application areas include data compression, image analysis, visualization, pattern recogni- tion, regression, and time-series prediction.

The most common definition of PCA, due to Hotelling (1933), is that for a set of observed d-dimensional data vectors{tn}, n ∈ {1 . . . N}, the q principal axes wj, j∈ {1, . . . , q}, are those orthonormal axes onto which the retained variance under projection is maximal. It can be shown that the vectors wj are given by the q dominant eigenvectors (those with the largest associated eigenvalues) of the sample covariance matrix S = P

n(tn− ¯t)(tn− ¯t)T/N such that Swj = λjwjand where ¯t is the sample mean. The vector xn = WT(tn− ¯t), where W = (w1, w2, . . . , wq), is thus a q-dimensional reduced representation of the observed vector tn.

A complementary property of PCA, and that most closely related to the original discussions of Pearson (1901), is that the projection onto the

Neural Computation 11, 443–482 (1999) ° 1999 Massachusetts Institute of Technologyc

(2)

principal subspace minimizes the squared reconstruction errorP

ktn−ˆtnk2. The optimal linear reconstruction of tnis given by ˆtn= Wxn+ ¯t, where xn= WT(tn− ¯t), and the orthogonal columns of W span the space of the leading q eigenvectors of S. In this context, the principal component projection is often known as the Karhunen-Lo`eve transform.

One limiting disadvantage of these definitions of PCA is the absence of an associated probability density or generative model. Deriving PCA from the perspective of density estimation would offer a number of important advantages, including the following:

• The corresponding likelihood would permit comparison with other density-estimation techniques and facilitate statistical testing.

• Bayesian inference methods could be applied (e.g., for model compar- ison) by combining the likelihood with a prior.

• In classification, PCA could be used to model class-conditional densi- ties, thereby allowing the posterior probabilities of class membership to be computed. This contrasts with the alternative application of PCA for classification of Oja (1983) and Hinton, Dayan, and Revow (1997).

• The value of the probability density function could be used as a mea- sure of the “degree of novelty” of a new data point, an alternative approach to that of Japkowicz, Myers, and Gluck (1995) and Petsche et al. (1996) in autoencoder-based PCA.

• The probability model would offer a methodology for obtaining a prin- cipal component projection when data values are missing.

• The single PCA model could be extended to a mixture of such models.

This final advantage is particularly significant. Because PCA defines only a linear projection of the data, the scope of its application is necessarily some- what limited. This has naturally motivated various developments of non- linear PCA in an effort to retain a greater proportion of the variance using fewer components. Examples include principal curves (Hastie & Stuetzle, 1989; Tibshirani, 1992), multilayer autoassociative neural networks (Kramer, 1991), the kernel-function approach of Webb (1996), and the generative to- pographic mapping (GTM) of Bishop, Svens´en, and Williams (1998). An alternative paradigm to such global nonlinear approaches is to model non- linear structure with a collection, or mixture, of local linear submodels. This philosophy is an attractive one, motivating, for example, the mixture-of- experts technique for regression (Jordan & Jacobs, 1994).

A number of implementations of “mixtures of PCA” have been proposed in the literature, each defining a different algorithm or a variation. The vari- ety of proposed approaches is a consequence of ambiguity in the formulation of the overall model. Current methods for local PCA generally necessitate a two-stage procedure: a partitioning of the data space followed by esti-

(3)

mation of the principal subspace within each partition. Standard Euclidean distance-based clustering may be performed in the partitioning phase, but more appropriately, the reconstruction error may be used as the criterion for cluster assignments. This conveys the advantage that a common cost measure is used in both stages. However, even recently proposed models that adopt this cost measure still define different algorithms (Hinton et al., 1997; Kambhatla & Leen, 1997), while a variety of alternative approaches for combining local PCA models have also been proposed (Broomhead, In- dik, Newell, & Rand, 1991; Bregler & Omohundro, 1995; Hinton, Revow,

& Dayan, 1995; Dony & Haykin, 1995). None of these algorithms defines a probability density.

One difficulty in implementation is that when using “hard” clustering in the partitioning phase (Kambhatla & Leen, 1997), the overall cost function is inevitably nondifferentiable. Hinton et al. (1997) finesse this problem by considering the partition assignments as missing data in an expectation- maximization (EM) framework, and thereby propose a “soft” algorithm where instead of any given data point being assigned exclusively to one principal component analyzer, the responsibility for its generation is shared among all of the analyzers. The authors concede that the absence of a prob- ability model for PCA is a limitation to their approach and propose that the responsibility of the jth analyzer for reconstructing data point tn be given by rnj = exp (−Ej2/2σ2)/{P

j0exp(−E2j0/2σ2)}, where Ej is the corre- sponding reconstruction cost. This allows the model to be determined by the maximization of a pseudo-likelihood function, and an explicit two-stage algorithm is unnecessary. Unfortunately, this also requires the introduction of a variance parameterσ2whose value is somewhat arbitrary, and again, no probability density is defined.

Our key result is to derive a probabilistic model for PCA. From this a mixture of local PCA models follows as a natural extension in which all of the model parameters may be estimated through the maximization of a single likelihood function. Not only does this lead to a clearly defined and unique algorithm, but it also conveys the advantage of a probability density function for the final model, with all the associated benefits as outlined above.

In section 2, we describe the concept of latent variable models. We then introduce probabilistic principal component analysis (PPCA) in section 3, showing how the principal subspace of a set of data vectors can be obtained within a maximum likelihood framework. Next, we extend this result to mixture models in section 4, and outline an efficient EM algorithm for es- timating all of the model parameters in a mixture of probabilistic principal component analyzers. The partitioning of the data and the estimation of local principal axes are automatically linked. Furthermore, the algorithm implic- itly incorporates a soft clustering similar to that implemented by Hinton et al. (1997), in which the parameterσ2appears naturally within the model.

(4)

Indeed,σ2has a simple interpretation and is determined by the same EM procedure used to update the other model parameters.

The proposed PPCA mixture model has a wide applicability, and we discuss its advantages from two distinct perspectives. First, in section 5, we consider PPCA for dimensionality reduction and data compression in local linear modeling. We demonstrate the operation of the algorithm on a simple toy problem and compare its performance with that of an explicit reconstruction-based nonprobabilistic modeling method on both synthetic and real-world data sets.

A second perspective is that of general gaussian mixtures. The PPCA mixture model offers a way to control the number of parameters when esti- mating covariance structures in high dimensions, while not overconstrain- ing the model flexibility. We demonstrate this property in section 6 and apply the approach to the classification of images of handwritten digits.

Proofs of key results and algorithmic details are provided in the ap- pendixes.

2 Latent Variable Models and PCA

2.1 Latent Variable Models. A latent variable model seeks to relate a d- dimensional observed data vector t to a corresponding q-dimensional vector of latent variables x:

t= y(x; w) + ², (2.1)

where y(·; ·) is a function of the latent variables x with parameters w, and

² is an x-independent noise process. Generally, q< d such that the latent variables offer a more parsimonious description of the data. By defining a prior distribution over x, together with the distribution of ², equation 2.1 induces a corresponding distribution in the data space, and the model pa- rameters may then be determined by maximum likelihood techniques. Such a model may also be termed generative, as data vectors t may be generated by sampling from the x and ² distributions and applying equation 2.1.

2.2 Factor Analysis. Perhaps the most common example of a latent vari- able model is that of statistical factor analysis (Bartholomew, 1987), in which the mapping y(x; w) is a linear function of x:

t= Wx + µ + ². (2.2)

Conventionally, the latent variables are defined to be independent and gaus- sian with unit variance, so xN (0, I). The noise model is also gaussian such that ²N (0, Ψ), with Ψ diagonal, and the (d × q) parameter matrix Wcontains the factor loadings. The parameter µ permits the data model to have nonzero mean. Given this formulation, the observation vectors are

(5)

also normally distributed tN (µ, C), where the model covariance is C= Ψ + WWT. (As a result of this parameterization, C is invariant under postmultiplication of W by an orthogonal matrix, equivalent to a rotation of the x coordinate system.) The key motivation for this model is that be- cause of the diagonality of Ψ, the observed variables t are conditionally independent given the latent variables, or factors, x. The intention is that the dependencies between the data variables t are explained by a smaller number of latent variables x, while ² represents variance unique to each ob- servation variable. This is in contrast to conventional PCA, which effectively treats both variance and covariance identically. There is no closed-form an- alytic solution for W and Ψ, so their values must be determined by iterative procedures.

2.3 Links from Factor Analysis to PCA. In factor analysis, the subspace defined by the columns of W will generally not correspond to the principal subspace of the data. Nevertheless, certain links between the two methods have been noted. For instance, it has been observed that the factor load- ings and the principal axes are quite similar in situations where the esti- mates of the elements of Ψ turn out to be approximately equal (e.g., Rao, 1955). Indeed, this is an implied result of the fact that if Ψ = σ2Iand an isotropic, rather than diagonal, noise model is assumed, then PCA emerges if the d− q smallest eigenvalues of the sample covariance matrix S are ex- actly equal. This homoscedastic residuals model is considered by Basilevsky (1994, p. 361), for the case where the model covariance is identical to its data sample counterpart. Given this restriction, the factor loadings W and noise varianceσ2 are identifiable (assuming correct choice of q) and can be de- termined analytically through eigendecomposition of S, without resort to iteration (Anderson, 1963).

This established link with PCA requires that the d− q minor eigenvalues of the sample covariance matrix be equal (or, more trivially, be negligible) and thus implies that the covariance model must be exact. Not only is this assumption rarely justified in practice, but when exploiting PCA for di- mensionality reduction, we do not require an exact characterization of the covariance structure in the minor subspace, as this information is effectively discarded. In truth, what is of real interest in the homoscedastic residuals model is the form of the maximum likelihood solution when the model covariance is not identical to its data sample counterpart.

Importantly, we show in the following section that PCA still emerges in the case of an approximate model. In fact, this link between factor analysis and PCA had been partially explored in the early factor analysis literature by Lawley (1953) and Anderson and Rubin (1956). Those authors showed that the maximum likelihood solution in the approximate case was related to the eigenvectors of the sample covariance matrix, but did not show that these were the principal eigenvectors but instead made this additional as- sumption. In the next section (and in appendix A) we extend this earlier

(6)

work to give a full characterization of the properties of the model we term probabilistic PCA. Specifically, with ²N¡

0, σ2I¢

, the columns of the max- imum likelihood estimator WMLare shown to span the principal subspace of the data even when C6= S.

3 Probabilistic PCA

3.1 The Probability Model. For the case of isotropic noise ²N¡ 0, σ2I¢

, equation 2.2 implies a probability distribution over t-space for a given x of the form

p(t|x) = (2πσ2)−d/2exp

½

− 1

2kt − Wx − µk2

¾

. (3.1)

With a gaussian prior over the latent variables defined by p(x) = (2π)−q/2exp

½

−1 2xTx

¾

, (3.2)

we obtain the marginal distribution of t in the form p(t) =

Z

p(t|x)p(x)dx, (3.3)

= (2π)−d/2|C|−1/2exp

½

−1

2(t − µ)TC−1(t − µ)

¾

, (3.4)

where the model covariance is

C= σ2I+ WWT. (3.5)

Using Bayes’ rule, the posterior distribution of the latent variables x given the observed t may be calculated:

p(x|t) = (2π)−q/2−2M|1/2

× exp

·

−1 2 n

x− M−1WT(t − µ)oT

−2M) n

x− M−1WT(t − µ)o ¸

, (3.6)

where the posterior covariance matrix is given by

σ2M−1= σ22I+ WTW)−1. (3.7)

Note that M is q× q while C is d × d.

(7)

The log-likelihood of observing the data under this model is

L =XN

n=1

ln© p(tn)ª

,

= −N 2

nd ln(2π) + ln |C| + tr³ C−1S´o

, (3.8)

where

S= 1 N

XN n=1

(tn− µ)(tn− µ)T, (3.9)

is the sample covariance matrix of the observed{tn}.

3.2 Properties of the Maximum Likelihood Estimators. The maximum likelihood estimate of the parameter µ is given by the mean of the data:

µML= 1 N

XN n=1

tn. (3.10)

We now consider the maximum likelihood estimators for the parameters W andσ2.

3.2.1 The Weight Matrix W. The log-likelihood (see equation 3.8) is max- imized when the columns of W span the principal subspace of the data. To show this we consider the derivative of equation 3.8 with respect to W:

∂L

∂W = N(C−1SC−1W− C−1W). (3.11)

In appendix A it is shown that with C given by equation 3.5, the only nonzero stationary points of equation 3.11 occur for

W= Uqq− σ2I)1/2R, (3.12)

where the q column vectors in the d× q matrix Uq are eigenvectors of S, with corresponding eigenvalues in the q×q diagonal matrix Λq, and R is an arbitrary q×q orthogonal rotation matrix. Furthermore, it is also shown that the stationary point corresponding to the global maximum of the likelihood occurs when Uq comprises the principal eigenvectors of S, and thus Λq contains the corresponding eigenvaluesλ1, . . . , λq, where the eigenvalues of S are indexed in order of decreasing magnitude. All other combinations of eigenvectors represent saddle points of the likelihood surface. Thus, from equation 3.12, the latent variable model defined by equation 2.2 effects a

(8)

mapping from the latent space into the principal subspace of the observed data.

3.2.2 The Noise Varianceσ2. It may also be shown that for W= WML, the maximum likelihood estimator forσ2is given by

σML2 = 1 d− q

Xd j=q+1

λj, (3.13)

whereλq+1, . . . , λdare the smallest eigenvalues of S, and soσML2 has a clear interpretation as the average variance “lost” per discarded dimension.

3.3 Dimensionality Reduction and Optimal Reconstruction. To imple- ment probabilistic PCA, we would generally first compute the usual eigen- decomposition of S (we consider an alternative, iterative approach shortly), after whichσML2 is found from equation 3.13 followed by WMLfrom equa- tion 3.12. This is then sufficient to define the associated density model for PCA, allowing the advantages listed in section 1 to be exploited.

In conventional PCA, the reduced-dimensionality transformation of a data point tnis given by xn= UTq(tn−µ) and its reconstruction by ˆtn= Uqxn+ µ. This may be similarly achieved within the PPCA formulation. However, we note that in the probabilistic framework, the generative model defined by equation 2.2 represents a mapping from the lower-dimensional latent space to the data space. So in PPCA, the probabilistic analog of the dimensionality reduction process of conventional PCA would be to invert the conditional distribution p(t|x) using Bayes’ rule, in equation 3.6, to give p(x|t). In this case, each data point tn is represented in the latent space not by a single vector, but by the gaussian posterior distribution defined by equation 3.6. As an alternative to the standard PCA projection, then, a convenient summary of this distribution and representation of tnwould be the posterior mean hxni = M−1WTML(tn− µ), a quantity that also arises naturally in (and is computed in) the EM implementation of PPCA considered in section 3.4.

Note also from equation 3.6 that the covariance of the posterior distribution is given byσ2M−1and is therefore constant for all data points.

However, perhaps counterintuitively given equation 2.2, WMLhxni + µ is not the optimal linear reconstruction of tn. This may be seen from the fact that for σ2 > 0, WMLhxni + µ is not an orthogonal projection of tn, as a consequence of the gaussian prior over x causing the posterior mean projection to become skewed toward the origin. If we consider the limit as σ2 → 0, the projection WMLhxni = WML(WTMLWML)−1WTML(tn− µ) does become orthogonal and is equivalent to conventional PCA, but then the density model is singular and thus undefined.

Taking this limit is not necessary, however, since the optimal least-squares linear reconstruction of the data from the posterior mean vectorshxni may

(9)

be obtained from (see appendix B)

ˆtn= WML

³

WTMLWML´−1

Mhxni + µ, (3.14)

with identical reconstruction error to conventional PCA.

For reasons of probabilistic elegance, therefore, we might choose to ex- ploit the posterior mean vectorshxni as the reduced-dimensionality rep- resentation of the data, although there is no material benefit in so doing.

Indeed, we note that in addition to the conventional PCA representation UTq(tn− µ), the vectors ˆxn = WTML(tn− µ) could equally be used without loss of information and reconstructed using

ˆtn= WML

³

WTMLWML´−1 ˆxn+ µ.

3.4 An EM Algorithm for PPCA. By a simple extension of the EM for- mulation for parameter estimation in the standard linear factor analysis model (Rubin & Thayer 1982), we can obtain a principal component projec- tion by maximizing the likelihood function (see equation 3.8). We are not suggesting that such an approach necessarily be adopted for probabilistic PCA; normally the principal axes would be estimated in the conventional manner, via eigendecomposition of S, and subsequently incorporated in the probability model using equations 3.12 and 3.13 to realize the advan- tages outlined in the introduction. However, as discussed in appendix A.5, there may be an advantage in the EM approach for large d since the pre- sented algorithm, although iterative, requires neither computation of the d× d covariance matrix, which is O(Nd2), nor its explicit eigendecomposi- tion, which is O(d3). We derive the EM algorithm and consider its properties from the computational perspective in appendix A.5.

3.5 Factor Analysis Revisited. The probabilistic PCA algorithm was ob- tained by introducing a constraint into the noise matrix of the factor analysis latent variable model. This apparently minor modification leads to signifi- cant differences in the behavior of the two methods. In particular, we now show that the covariance properties of the PPCA model are identical to those of conventional PCA and are quite different from those of standard factor analysis.

Consider a nonsingular linear transformation of the data variables, so that t → At. Using equation 3.10, we see that under such a transforma- tion, the maximum likelihood solution for the mean will be transformed as µML→ AµML. From equation 3.9, it then follows that the covariance matrix will transform as S→ ASAT.

The log-likelihood for the latent variable model, from equation 3.8, is

(10)

given by

L(W, Ψ) = −N 2

½

d ln(2π) + ln |WWT+ Ψ|

+ trh

(WWT+ Ψ)−1S

i ¾, (3.15)

where Ψ is a general noise covariance matrix. Thus, using equation 3.15, we see that under the transformation t→ At, the log-likelihood will transform as

L(W, Ψ) → L(A−1W, A−1ΨA−T) − N ln |A|, (3.16)

where A−T ≡ (A−1)T. Thus, if WML and ΨML are maximum likelihood solutions for the original data, then AWMLand AΨMLATwill be maximum likelihood solutions for the transformed data set.

In general, the form of the solution will not be preserved under such a transformation. However, we can consider two special cases. First, sup- pose Ψ is a diagonal matrix, corresponding to the case of factor analysis.

Then Ψ will remain diagonal provided A is also a diagonal matrix. This says that factor analysis is covariant under component-wise rescaling of the data variables: the scale factors simply become absorbed into rescaling of the noise variances, and the rows of W are rescaled by the same factors.

Second, consider the case Ψ= σ2I, corresponding to PPCA. Then the trans- formed noise covariance σ2AAT will be proportional to the unit matrix only if AT = A−1— in other words, if A is an orthogonal matrix. Trans- formation of the data vectors by multiplication with an orthogonal matrix corresponds to a rotation of the coordinate system. This same covariance property is shared by standard nonprobabilistic PCA since a rotation of the coordinates induces a corresponding rotation of the principal axes. Thus we see that factor analysis is covariant under componentwise rescaling, while PPCA and PCA are covariant under rotations, as illustrated in Figure 1.

4 Mixtures of Probabilistic Principal Component Analyzers

The association of a probability model with PCA offers the tempting prospect of being able to model complex data structures with a combination of local PCA models through the mechanism of a mixture of probabilistic princi- pal component analysers (Tipping & Bishop, 1997). This formulation would permit all of the model parameters to be determined from maximum likeli- hood, where both the appropriate partitioning of the data and the determi- nation of the respective principal axes occur automatically as the likelihood is maximized. The log-likelihood of observing the data set for such a mixture

(11)

Figure 1: Factor analysis is covariant under a componentwise rescaling of the data variables (top plots), while PCA and probabilistic PCA are covariant under rotations of the data space coordinates (bottom plots).

model is:

L =XN

n=1

ln© p(tn)ª

, (4.1)

=XN

n=1

ln (XM

i=1

πip(tn|i) )

, (4.2)

where p(t|i) is a single PPCA model and πi is the corresponding mixing proportion, withπi≥ 0 andP

πi= 1. Note that a separate mean vector µi is now associated with each of the M mixture components, along with the parameters Wiandσi2. A related model has recently been exploited for data visualization (Bishop & Tipping, 1998), and a similar approach, based on

(12)

the standard factor analysis diagonal (Ψ) noise model, has been employed for handwritten digit recognition (Hinton et al. 1997), although it does not implement PCA.

The corresponding generative model for the mixture case now requires the random choice of a mixture component according to the proportionsπi, followed by sampling from the x and ² distributions and applying equa- tion 2.2 as in the single model case, taking care to use the appropriate pa- rameters µi, Wi, andσi2. Furthermore, for a given data point t, there is now a posterior distribution associated with each latent space, the mean of which for space i is given by(σi2I+ WTiWi)−1WTi(t − µi).

We can develop an iterative EM algorithm for optimization of all of the model parametersπi, µi, Wi, andσi2. If Rni= p(i|tn) is the posterior respon- sibility of mixture i for generating data point tn, given by

Rni= p(tn|i)πi

p(tn) , (4.3)

then in appendix C it is shown that we obtain the following parameter updates:

e πi= 1

N XN n=1

Rni, (4.4)

e µi=

PN

n=1Rnitn PN

n=1Rni . (4.5)

Thus the updates foreπi andeµi correspond exactly to those of a stan- dard gaussian mixture formulation (e.g., see Bishop, 1995). Furthermore, in appendix C, it is also shown that the combination of the E- and M-steps leads to the intuitive result that the axes Wiand the noise varianceσi2are determined from the local responsibility–weighted covariance matrix:

Si= 1 e πiN

XN n=1

Rni(tn− eµi)(tn− eµi)T, (4.6)

by standard eigendecomposition in exactly the same manner as for a single PPCA model. However, as noted in section 3.4 (and also in appendix A.5), for larger values of data dimensionality d, computational advantages can be obtained if Wiandσi2are updated iteratively according to an EM schedule.

This is discussed for the mixture model in appendix C.

Iteration of equations 4.3, 4.4, and 4.5 in sequence followed by computa- tion of Wiandσi2, from either equation 4.6 using equations 2.12 and 2.13 or using the iterative updates in appendix C, is guaranteed to find a local max- imum of the log-likelihood in equation 4.2. At convergence of the algorithm each weight matrix Wispans the principal subspace of its respective Si.

(13)

In the next section we consider applications of this PPCA mixture model, beginning with data compression and reconstruction tasks. We then con- sider general density modeling in section 6.

5 Local Linear Dimensionality Reduction

In this section we begin by giving an illustration of the application of the PPCA mixture algorithm to a synthetic data set. More realistic examples are then considered, with an emphasis on cases in which a principal component approach is motivated by the objective of deriving a reduced-dimensionality representation of the data, which can be reconstructed with minimum error.

We will therefore contrast the clustering mechanism in the PPCA mixture model with that of a hard clustering approach based explicitly on recon- struction error as used in a typical algorithm.

5.1 Illustration for Synthetic Data. For a demonstration of the mixture of PPCA algorithm, we generated a synthetic data set comprising 500 data points sampled uniformly over the surface of a hemisphere, with additive gaussian noise. Figure 2a shows this data.

A mixture of 12 probabilistic principal component analyzers was then fitted to the data using the EM algorithm outlined in the previous section, with latent space dimensionality q= 2. Because of the probabilistic formal- ism, a generative model of the data is defined, and we emphasize this by plotting a second set of 500 data points, obtained by sampling from the fitted generative model. These data points are shown in Figure 2b. Histograms of the distances of all the data points from the hemisphere are also given to indicate more clearly the accuracy of the model in capturing the structure of the underlying generator.

5.2 Clustering Mechanisms. Generating a local PCA model of the form illustrated above is often prompted by the ultimate goal of accurate data reconstruction. Indeed, this has motivated Kambhatla and Leen (1997) and Hinton et al. (1997) to use squared reconstruction error as the clustering criterion in the partitioning phase. Dony and Haykin (1995) adopt a similar approach to image compression, although their model has no set of indepen- dent mean parameters µi. Using the reconstruction criterion, a data point is assigned to the component that reconstructs it with lowest error, and the principal axes are then reestimated within each cluster. For the mixture of PPCA model, however, data points are assigned to mixture components (in a soft fashion) according to the responsibility Rniof the mixture component for its generation. Since Rni = p(tn|i)πi/p(tn) and p(tn) is constant for all components, Rni∝ p(tn|i), and we may gain further insight into the cluster- ing by considering the probability density associated with component i at

(14)

(a)

(b)

−0.20 −0.1 0 0.1 0.2

100

distance from sphere

−0.20 −0.1 0 0.1 0.2

100

distance from sphere

Figure 2: Modeling noisy data on a hemisphere. (a) On the left, the synthetic data; on the right, a histogram of the Euclidean distances of each data point to the sphere. (b) Data generated from the fitted PPCA mixture model with the synthetic data on the left and the histogram on the right.

data point tn:

p(tn|i) = (2π)−d/2|Ci|−1/2exp

n−E2ni/2o

, (5.1)

where

E2ni= (tn− µi)TC−1i (tn− µi), (5.2)

Ci = σi2I+ WiWTi. (5.3)

It is helpful to express the matrix Wiin terms of its singular value decompo- sition (and although we are considering an individual mixture component i, the i subscript will be omitted for notational clarity):

W= Uq(Kq− σ2I)1/2R, (5.4)

(15)

where Uqis a d× q matrix of orthonormal column vectors and R is an ar- bitrary q×q orthogonal matrix. The singular values are parameterized, with- out loss of generality, in terms of(Kq−σ2I)1/2, where Kq= diag(k1, k2, . . . , kq) is a q× q diagonal matrix. Then

E2n= (tn− µ)Tn

σ2I+ Uq(Kq− σ2I)UTqo−1

(tn− µ). (5.5)

The data point tn may also be expressed in terms of the basis of vectors U= (Uq, Ud−q), where Ud−qcomprises(d − q) vectors perpendicular to Uq, which complete an orthonormal set. In this basis, we define zn= UT(tn−µ) and so tn− µ = Uzn, from which equation 5.5 may then be written as

E2n= zTnUT

nσ2I+ Uq(Kq− σ2I)UTqo−1

Uzn, (5.6)

= zTnD−1zn, (5.7)

where D= diag(k1, k2, . . . , kq, σ2, . . . , σ2) is a d × d diagonal matrix. Thus:

E2n= zTinK−1q zin+zToutzout

σ2 , (5.8)

= E2in+ E2rec2, (5.9)

where we have partitioned the elements of z into zin, the projection of tn−µ onto the subspace spanned by W, and zout, the projection onto the cor- responding perpendicular subspace. Thus, E2recis the squared reconstruc- tion error, and E2inmay be interpreted as an in-subspace error term. At the maximum likelihood solution, Uqis the matrix of eigenvectors of the local covariance matrix and Kq= Λq.

Asσi2 → 0, Rni ∝ πiexp¡

−E2rec/2¢

and, for equal prior probabilities, cluster assignments are equivalent to a soft reconstruction-based clustering.

However, forσA2, σB2> 0, consider a data point that lies in the subspace of a relatively distant component A, which may be reconstructed with zero error yet lies closer to the mean of a second component B. The effect of the noise varianceσB2 in equation 5.9 is to moderate the contribution of E2rec for component B. As a result, the data point may be assigned to the nearer component B even though the reconstruction error is considerably greater, given that it is sufficiently distant from the mean of A such that E2infor A is large.

It should be expected, then, that mixture of PPCA clustering would result in more localized clusters, but with the final reconstruction error inferior to that of a clustering model based explicitly on a reconstruction criterion.

Conversely, it should also be clear that clustering the data according to the proximity to the subspace alone will not necessarily result in localized partitions (as noted by Kambhatla, 1995, who also considers the relationship

(16)

Figure 3: Comparison of the partitioning of the hemisphere effected by a VQPCA-based model (left) and a PPCA mixture model (right). The illustrated boundaries delineate regions of the hemisphere that are best reconstructed by a particular local PCA model. One such region is shown shaded to emphasize that clustering according to reconstruction error results in a nonlocalized partition- ing. In the VQPCA case, the circular effects occur when principal component planes intersect beneath the surface of the hemisphere.

of such an algorithm to a probabilistic model). That this is so is simply illustrated in Figure 3, in which a collection of 12 conventional PCA models have been fitted to the hemisphere data, according to the VQPCA (vector- quantization PCA) algorithm of Kambhatla and Leen (1997), defined as follows:

1. Select initial cluster centers µiat random from points in the data set, and assign all data points to the nearest (in terms of Euclidean dis- tance) cluster center.

2. Set the Wi vectors to the first two principal axes of the covariance matrix of cluster i.

3. Assign data points to the cluster that best reconstructs them, setting each µito the mean of those data points assigned to cluster i.

4. Repeat from step 2 until the cluster allocations are constant.

In Figure 3, data points have been sampled over the hemisphere, with- out noise, and allocated to the cluster that best reconstructs them. The left plot shows the partitioning associated with the best (i.e., lowest reconstruc- tion error) model obtained from 100 runs of the VQPCA algorithm. The right plot shows a similar partitioning for the best (i.e., greatest likelihood) PPCA mixture model using the same number of components, again from 100 runs.

Note that the boundaries illustrated in this latter plot were obtained using

(17)

Table 1: Data Sets Used for Comparison of Clustering Criteria.

Data Set N d M q Description

Hemisphere 500 3 12 2 Synthetic data used above

Oil 500 12 12 2 Diagnostic measurements from oil

pipeline flows

Digit 1 500 64 10 10 8× 8 gray-scale images of handwritten digit 1

Digit 2 500 64 10 10 8× 8 gray-scale images of handwritten digit 2

Image 500 64 8 4 8× 8 gray-scale blocks from a photo- graphic image

EEG 300 30 8 5 Delay vectors from an electroencephalo- gram time-series signal

assignments based on reconstruction error for the final model, in identical fashion to the VQPCA case, and not on probabilistic responsibility. We see that the partitions formed when clustering according to reconstruction er- ror alone can be nonlocal, as exemplified by the shaded component. This phenomenon is rather contrary to the philosophy of local dimensionality re- duction and is an indirect consequence of the fact that reconstruction-based local PCA does not model the data in a probabilistic sense.

However, we might expect that algorithms such as VQPCA should offer better performance in terms of the reconstruction error of the final solution, having been designed explicitly to optimize that measure. In order to test this, we compared the VQPCA algorithm with the PPCA mixture model on six data sets, detailed in Table 1.

Figure 4 summarizes the reconstruction error of the respective models, and in general, VQPCA performs better, as expected. However, we also note two interesting aspects of the results.

First, in the case of the oil data, the final reconstruction error of the PPCA model on both training and test sets is counterintuitively superior, despite the fact that the partitioning of the data space was based only partially on reconstruction error. This behavior is, we hypothesize, a result of the particular structure of that data set. The oil data are known to comprise a number of disjoint, but locally smooth, two-dimensional cluster structures (see Bishop & Tipping, 1998, for a visualization).

For the oil data set, we observed that many of the models found by the VQPCA algorithm exhibit partitions that are not only often nonconnected (similar to those shown for the hemisphere in Figure 3) but may also span more than one of the disjoint cluster structures. The evidence of Figure 4 suggests that these models represent poor local minima of the reconstruc-

(18)

HEMISPHERE OIL DIGIT_1 DIGIT_2 IMAGE EEG 92

94 96 98 100 102 104 106 108

Dataset

% Reconstruction Error

Training Set

HEMISPHERE OIL DIGIT_1 DIGIT_2 IMAGE EEG 88

90 92 94 96 98 100 102 104 106

Dataset

% Reconstruction Error

Test Set

Figure 4: Reconstruction errors for reconstruction-based local PCA (VQPCA) and the PPCA mixture. Errors for the latter (∗) have been shown relative to the former (∇), and are averaged over 100 runs with random initial configurations.

tion error cost function. The PPCA mixture algorithm does not find such suboptimal solutions, which would have low likelihood due to the locality implied by the density model. The experiment indicates that by avoiding these poor solutions, the PPCA mixture model is able to find solutions with lower reconstruction error (on average) than VQPCA.

(19)

These observations apply only to the case of the oil data set. For the hemisphere, digit 1, image, and electroencephalogram (EEG) training sets, the data manifolds are less disjoint, and the explicit reconstruction-based algorithm, VQPCA, is superior. For the digit 2 case, the two algorithms appear approximately equivalent.

A second aspect of Figure 4 is the suggestion that the PPCA mixture model algorithm may be less sensitive to overfitting. As would be expected, compared with the training set, errors on the test set increase for both algo- rithms (although, because the errors have been normalized to allow com- parisons between data sets, this is not shown in Figure 4). However, with the exception of the case of the digit 2 data set, for the PPCA mixture model this increase is proportionately smaller than for VQPCA. This effect is most dramatic for the image data set, where PPCA is much superior on the test set. For that data set, the test examples were derived from a separate portion of the image (see below), and as such, the test set statistics can be expected to differ more significantly from the respective training set than for the other examples.

A likely explanation is that because of the soft clustering of the PPCA mixture model, there is an inherent smoothing effect occurring when esti- mating the local sets of principal axes. Each set of axes is determined from its corresponding local responsibility–weighted covariance matrix, which in general will be influenced by many data points, not just the subset that would be associated with the cluster in a “hard” implementation. Because of this, the parameters in the Wimatrix in cluster i are also constrained by data points in neighboring clusters (j 6= i) to some extent. This notion is discussed in the context of regression by Jordan and Jacobs (1994) as moti- vation for their mixture-of-experts model, where the authors note how soft partitioning can reduce variance (in terms of the bias-variance decomposi- tion). Although it is difficult to draw firm conclusions from this limited set of experiments, the evidence of Figure 4 does point to the presence of such an effect.

5.3 Application: Image Compression. As a practical example, we con- sider an application of the PPCA mixture model to block transform image coding. Figure 5 shows the original image. This 720× 360 pixel image was segmented into 8× 8 nonoverlapping blocks, giving a total data set of 4050 64-dimensional vectors. Half of these data, corresponding to the left half of the picture, were used as training data. The right half was reserved for testing; a magnified portion of the test image is also shown in Figure 5. A reconstruction of the entire image based on the first four principal compo- nents of a single PCA model determined from the block-transformed left half of the image is shown in Figure 6.

Figure 7 shows the reconstruction of the original image when modeled by a mixture of probabilistic principal component analyzers. The model param- eters were estimated using only the left half of the image. In this example, 12

(20)

Figure 5: (Left) The original image. (Right) Detail.

Figure 6: The PCA reconstructed image, at 0.5 bit per pixel. (Left) The original image. (Right) Detail.

components were used, of dimensionality 4; after the model likelihood had been maximized, the image coding was performed in a “hard” fashion, by allocating data to the component with the lowest reconstruction error. The resulting coded image was uniformly quantized, with bits allocated equally to each transform variable, before reconstruction, in order to give a final bit rate of 0.5 bits per pixel (and thus compression of 16 to 1) in both Figures 6 and 7. In the latter case, the cost of encoding the mixture component label was included. For the simple principal subspace reconstruction, the nor- malized test error was 7.1 × 10−2; for the mixture model, it was 5.7 × 10−2. The VQPCA algorithm gave a test error of 6.2 × 10−2.

6 Density Modeling

A popular approach to semiparametric density estimation is the gaussian mixture model (Titterington, Smith, & Makov, 1985). However, such models suffer from the limitation that if each gaussian component is described by a full covariance matrix, then there are d(d + 1)/2 independent covariance parameters to be estimated for each mixture component. Clearly, as the di- mensionality of the data space increases, the number of data points required to specify those parameters reliably will become prohibitive. An alternative

(21)

Figure 7: The mixture of PPCA reconstructed image, using the same bit rate as Figure 6. (Left) The original image. (Right) Detail.

approach is to reduce the number of parameters by placing a constraint on the form of the covariance matrix. (Another would be to introduce priors over the parameters of the full covariance matrix, as implemented by Or- moneit & Tresp, 1996.) Two common constraints are to restrict the covariance to be isotropic or to be diagonal. The isotropic model is highly constrained as it assigns only a single parameter to describe the entire covariance struc- ture in the full d dimensions. The diagonal model is more flexible, with d parameters, but the principal axes of the elliptical gaussians must be aligned with the data axes, and thus each individual mixture component is unable to capture correlations among the variables.

A mixture of PPCA models, where the covariance of each gaussian is parameterized by the relation C= σ2I+WWT, comprises dq+1−q(q−1)/2 free parameters.1(Note that the q(q−1)/2 term takes account of the number of parameters needed to specify the arbitrary rotation R.) It thus permits the number of parameters to be controlled by the choice of q. When q= 0, the model is equivalent to an isotropic gaussian. With q= d − 1, the general covariance gaussian is recovered.

6.1 A Synthetic Example: Noisy Spiral Data. The utility of the PPCA mixture approach may be demonstrated with the following simple example.

A 500-point data set was generated along a three-dimensional spiral con- figuration with added gaussian noise. The data were then modeled by both a mixture of PPCA models and a mixture of diagonal covariance gaussians, using eight mixture components. In the mixture of PPCA case, q = 1 for each component, and so there are four variance parameters per component compared with three for the diagonal model. The results are visualized in Figure 8, which illustrates both side and end projections of the data.

1An alternative would be a mixture of factor analyzers, implemented by Hinton et al.

(1997), although that comprises more parameters.

(22)

Figure 8: Comparison of an eight-component diagonal variance gaussian mix- ture model with a mixture of PPCA model. The upper two plots give a view perpendicular to the major axis of the spiral; the lower two plots show the end elevation. The covariance structure of each mixture component is shown by pro- jection of a unit Mahalanobis distance ellipse, and the log-likelihood per data point is given in parentheses above the figures.

The orientation of the ellipses in the diagonal model can be seen not to coincide with the local data structure, which is a result of the axial alignment constraint. A further consequence of the diagonal parameterization is that the means are also implicitly constrained because they tend to lie where the tangent to the spiral is parallel to either axis of the end elevation. This qualitative superiority of the PPCA approach is underlined quantitatively by the log-likelihood per data point given in parentheses in the figure. Such a result would be expected given that the PPCA model has an extra parameter in each mixture component, but similar results are observed if the spiral is embedded in a space of much higher dimensionality where the extra parameter in PPCA is proportionately less relevant.

It should be intuitive that the axial alignment constraint of the diagonal model is, in general, particularly inappropriate when modeling a smooth

(23)

Table 2: Log-Likelihood per Data Point Measured on Training and Test Sets for Gaussian Mixture Models with Eight Components and a 100-Point Training Set.

Isotropic Diagonal Full PPCA

Training −3.14 −2.74 −1.47 −1.65

Test −3.68 −3.43 −3.09 −2.37

and continuous lower dimensional manifold in higher dimensions, regard- less of the intrinsic dimensionality. Even with q= 1, the PPCA approach is able to track the spiral manifold successfully.

Finally, we demonstrate the importance of the use of an appropriate number of parameters by modeling a three-dimensional spiral data set of 100 data points (the number of data points was reduced to emphasize the overfitting) as above with isotropic, diagonal, and full covariance gaussian mixture models, along with a PPCA mixture model. For each model, the log-likelihood per data point for both the training data set and an unseen test set of 1000 data points is given in Table 2.

As would be expected in this case of limited data, the full covariance model exhibits the best likelihood on the training set, but test set perfor- mance is worse than for the PPCA mixture. For this simple example, there is only one intermediate PPCA parameterization with q = 1 (q = 0 and q= 2 are equivalent to the isotropic and full covariance cases respectively).

In realistic applications, where the dimensionality d will be considerably larger, the PPCA model offers the choice of a range of q, and an appropriate value can be determined using standard techniques for model selection.

Finally, note that these advantages are not limited to mixture models, but may equally be exploited for the case of a single gaussian distribution.

6.2 Application: Handwritten Digit Recognition. One potential appli- cation for high-dimensionality density models is handwritten digit recog- nition. Examples of gray-scale pixel images of a given digit will generally lie on a lower-dimensional smooth continuous manifold, the geometry of which is determined by properties of the digit such as rotation, scaling, and thickness of stroke. One approach to the classification of such digits (al- though not necessarily the best) is to build a model of each digit separately, and classify unseen digits according to the model to which they are most similar.

Hinton et al. (1997) gave an excellent discussion of the handwritten digit problem and applied a mixture of PCA approach, using soft reconstruction–

based clustering, to the classification of scaled and smoothed 8× 8 gray- scale images taken from the CEDAR U.S. Postal Service database (Hull, 1994). The models were constructed using an 11,000-digit subset of the br

(24)

Figure 9: Mean vectorsµi, illustrated as gray-scale digits, for each of the 10 digit models. The model for a given digit is a mixture of 10 PPCA models, one centered at each of the pixel vectors shown on the corresponding row. Note how different components can capture different styles of digit.

data set (which was further split into training and validation sets), and the bs test set was classified according to which model best reconstructed each digit (in the squared-error sense). We repeated the experiment with the same data using the PPCA mixture approach using the same choice of parameter values (M= 10 and q = 10). To help visualize the final model, the means of each component µiare illustrated in digit form in Figure 9.

The digits were again classified, using the same method of classification, and the best model on the validation set misclassified 4.64% of the digits in the test set. Hinton et al. (1997) reported an error of 4.91%, and we would expect the improvement to be a result partly of the localized clustering of the PPCA model, but also the use of individually estimated values ofσi2for each component, rather than a single, arbitrarily chosen, global value.

One of the advantages of the PPCA methodology is that the definition of the density model permits the posterior probabilities of class membership

(25)

to be computed for each digit and used for subsequent classification, rather than using reconstruction error as above. Classification according to the largest posterior probability for the M= 10 and q = 10 model resulted in an increase in error, and it was necessary to invest significant effort to optimize the parameters M and q for each model to provide comparable performance.

Using this approach, our best classifier on the validation set misclassified 4.61% of the test set. An additional benefit of the use of posterior probabilities is that it is possible to reject a proportion of the test samples about which the classifier is most “unsure” and thus hopefully improve the classification performance. Using this approach to reject 5% of the test examples resulted in a misclassification rate of 2.50%. (The availability of posteriors can be advantageous in other applications, where they may be used in various forms of follow-on processing.)

7 Conclusions

Modeling complexity in data by a combination of simple linear models is an attractive paradigm offering both computational and algorithmic advan- tages along with increased ease of interpretability. In this article, we have exploited the definition of a probabilistic model for PCA in order to com- bine local PCA models within the framework of a probabilistic mixture in which all the parameters are determined from maximum likelihood using an EM algorithm. In addition to the clearly defined nature of the resulting algorithm, the primary advantage of this approach is the definition of an observation density model.

A possible disadvantage of the probabilistic approach to combining local PCA models is that by optimizing a likelihood function, the PPCA mixture model does not directly minimize squared reconstruction error. For applica- tions where this is the salient criterion, algorithms that explicitly minimize reconstruction error should be expected to be superior. Experiments indeed showed this to be generally the case, but two important caveats must be con- sidered before any firm conclusions can be drawn concerning the suitability of a given model. First, and rather surprisingly, for one of the data sets (‘oil’) considered in the article, the final PPCA mixture model was actually supe- rior in the sense of squared reconstruction error, even on the training set.

It was demonstrated that algorithms incorporating reconstruction-based clustering do not necessarily generate local clusters, and it was reasoned that for data sets comprising a number of disjoint data structures, this phe- nomenon may lead to poor local minima. Such minima are not found by the PPCA density model approach. A second consideration is that there was also evidence that the smoothing implied by the soft clustering inherent in the PPCA mixture model helps to reduce overfitting, particularly in the case of the image compression experiment where the statistics of the test data set differed from the training data much more so than for other examples.

In that instance, the reconstruction test error for the PPCA model was, on

(26)

average, more than 10% lower.

In terms of a gaussian mixture model, the mixture of probabilistic prin- cipal component analyzers enables data to be modeled in high dimensions with relatively few free parameters, while not imposing a generally inap- propriate constraint on the covariance structure. The number of free pa- rameters may be controlled through the choice of latent space dimension q, allowing an interpolation in model complexity from isotropic to full covari- ance structures. The efficacy of this parameterization was demonstrated by performance on a handwritten digit recognition task.

Appendix A: Maximum Likelihood PCA

A.1 The Stationary Points of the Log-Likelihood. The gradient of the log-likelihood (see equation 3.8) with respect to W may be obtained from standard matrix differentiation results (e.g., see Krzanowski & Marriott, 1994, p. 133):

∂L

∂W= N(C−1SC−1W− C−1W). (A.1)

At the stationary points

SC−1W= W, (A.2)

assuming that σ2 > 0, and thus that C−1 exists. This is a necessary and sufficient condition for the density model to remain nonsingular, and we will restrict ourselves to such cases. It will be seen shortly thatσ2 > 0 if q< rank(S), so this assumption implies no loss of practicality.

There are three possible classes of solutions to equation A.2:

1. W= 0. This is shown later to be a minimum of the log-likelihood.

2. C= S, where the covariance model is exact, such as is discussed by Basilevsky (1994, pp. 361–363) and considered in section 2.3. In this unrealistic case of an exact covariance model, where the d− q smallest eigenvalues of S are identical and equal toσ2, W is identifiable since

σ2I+ WWT= S,

⇒ W = U(Λ − σ2I)1/2R, (A.3)

where U is a square matrix whose columns are the eigenvectors of S, with Λ the corresponding diagonal matrix of eigenvalues, and R is an arbitrary orthogonal (i.e., rotation) matrix.

3. SC−1W= W, with W 6= 0 and C 6= S.

參考文獻

相關文件

[This function is named after the electrical engineer Oliver Heaviside (1850–1925) and can be used to describe an electric current that is switched on at time t = 0.] Its graph

substance) is matter that has distinct properties and a composition that does not vary from sample

Wang, Solving pseudomonotone variational inequalities and pseudocon- vex optimization problems using the projection neural network, IEEE Transactions on Neural Networks 17

volume suppressed mass: (TeV) 2 /M P ∼ 10 −4 eV → mm range can be experimentally tested for any number of extra dimensions - Light U(1) gauge bosons: no derivative couplings. =&gt;

(a) The magnitude of the gravitational force exerted by the planet on an object of mass m at its surface is given by F = GmM / R 2 , where M is the mass of the planet and R is

We explicitly saw the dimensional reason for the occurrence of the magnetic catalysis on the basis of the scaling argument. However, the precise form of gap depends

Define instead the imaginary.. potential, magnetic field, lattice…) Dirac-BdG Hamiltonian:. with small, and matrix

• Formation of massive primordial stars as origin of objects in the early universe. • Supernova explosions might be visible to the most