• 沒有找到結果。

Rotational Ambiguity. If W is determined by the above algorithm, or any other iterative method that maximizes the likelihood (see

equa-tion 3.8), then at convergence, WML = Uqq− σ2I)1/2R. If it is desired to find the true principal axes Uq(and not just the principal subspace) then the arbitrary rotation matrix R presents difficulty. This rotational ambiguity also exists in factor analysis, as well as in certain iterative PCA algorithms, where it is usually not possible to determine the actual principal axes if R6= I (although there are algorithms where the constraint R = I is imposed and the axes may be found).

However, in probabilistic PCA, R may actually be found since

WTMLWML= RTq− σ2I)R (A.29)

implies that RTmay be computed as the matrix of eigenvectors of the q× q matrix WTMLWML. Hence, both Uq and Λq may be found by inverting the rotation followed by normalization of WML. That the rotational ambiguity may be resolved in PPCA is a consequence of the scaling of the eigenvectors byq− σ2I)1/2 prior to rotation by R. Without this scaling, WTMLWML = I, and the corresponding eigenvectors remain ambiguous. Also, note that while finding the eigenvectors of S directly requires O(d3) operations, to obtain them from WMLin this way requires only O(q3).

Appendix B: Optimal Least-Squares Reconstruction

One of the motivations for adopting PCA in many applications, notably in data compression, is the property of optimal linear least-squares recon-struction. That is, for all orthogonal projections x = ATtof the data, the

least-squares reconstruction error,

E2rec= 1 N

XN n=1

ktn− BATtnk2, (B.1)

is minimized when the columns of A span the principal subspace of the data covariance matrix, and B= A. (For simplification, and without loss of generality, we assume here that the data has zero mean.)

We can similarly obtain this property from our probabilistic formalism, without the need to determine the exact orthogonal projection W, by finding the optimal reconstruction of the posterior mean vectorshxni. To do this we simply minimize

E2rec= 1 N

XN n=1

ktn− Bhxnik2, (B.2)

over the reconstruction matrix B, which is equivalent to a linear regression problem giving

B= SW(WTSW)−1M, (B.3)

where we have substituted for hxni from equation A.22. In general, the resulting projection Bhxni of tnis not orthogonal, except in the maximum likelihood case, where W = WML = Uqq− σ2I)1/2R, and the optimal reconstructing matrix becomes

BML= W(WTW)−1M, (B.4)

and so

ˆtn= W(WTW)−1Mhxni, (B.5)

= W(WTW)−1WTtn, (B.6)

which is the expected orthogonal projection. The implication is thus that in the data compression context, at the maximum likelihood solution, the variableshxni can be transmitted down the channel and the original data vectors optimally reconstructed using equation B.5 given the parameters W andσ2. Substituting for B in equation B.2 gives E2rec= (d − q)σ2, and the noise termσ2thus represents the expected squared reconstruction error per

“lost” dimension.

Appendix C: EM for Mixtures of Probabilistic PCA

In a mixture of probabilistic principal component analyzers, we must fit a mixture of latent variable models in which the overall model distribution

takes the form p(t) =XM

i=1

πip(t|i), (C.1)

where p(t|i) is a single probabilistic PCA model and πiis the corresponding mixing proportion. The parameters for this mixture model can be deter-mined by an extension of the EM algorithm. We begin by considering the standard form that the EM algorithm would take for this model and high-light a number of limitations. We then show that a two-stage form of EM leads to a more efficient algorithm.

We first note that in addition to a set of xnifor each model i, the missing data include variables znilabeling which model is responsible for generating each data point tn. At this point we can derive a standard EM algorithm by considering the corresponding complete-data log-likelihood, which takes the form

LC=XN

n=1

XM i=1

zniln{πip(tn, xni)}. (C.2)

Starting with “old” values for the parametersπi, µi, Wi, andσi2, we first evaluate the posterior probabilities Rni using equation 4.3 and similarly evaluate the expectationshxnii and hxnixTnii:

hxnii = M−1i WTi(tn− µi), (C.3)

hxnixTnii = σi2M−1i + hxniihxniiT, (C.4) with Mi= σi2I+ WTiWi.

Then we take the expectation ofLC with respect to these posterior dis-tributions to obtain

hLCi =XN

n=1

XM i=1

Rni

½

lnπid

2lnσi2−1 2tr

³hxnixTni

− 1

i2ktni− µik2+ 1

σi2hxniiTWTi(tn− µi)

− 1 2σi2tr

³

WTiWihxnixTnii´ ¾

, (C.5)

whereh·i denotes the expectation with respect to the posterior distributions of both xniand zni and terms independent of the model parameters have been omitted. The M-step then involves maximizing equation C.5 with re-spect toπi, µi,σi2, and Wito obtain “new” values for these parameters. The maximization with respect toπimust take account of the constraint that

P

iπi= 1. This can be achieved with the use of a Lagrange multiplier λ (see Bishop, 1995) and maximizing

hLCi + λ ÃXM

i=1

πi− 1

!

. (C.6)

Together with the results of maximizing equation C.5 with respect to the remaining parameters, this gives the following M-step equations:

eπi= 1 N

X

n

Rni (C.7)

e µi=

P

nRni(tPni− eWihxnii)

nRni (C.8)

Wei=

"

X

n

Rni(tn− eµi)hxniiT

# "

X

n

RnihxnixTnii

#−1

(C.9)

e

σi2= 1 dP

nRni

½ X

n

Rniktn− eµik2− 2X

n

RnihxniiTWeTi(tn− eµi)

+X

n

Rnitr

³hxnixTniieWTiWei

´ ¾

(C.10) where the symboledenotes “new” quantities that may be adjusted in the M-step. Note that the M-step equations foreµiand eWi, given by equations C.8 and C.9, are coupled, and so further (albeit straightforward) manipulation is required to obtain explicit solutions.

In fact, simplification of the M-step equations, along with improved speed of convergence, is possible if we adopt a two-stage EM procedure as follows. The likelihood function we wish to maximize is given by

L =XN

n=1

ln (XM

i=1

πip(tn|i) )

. (C.11)

Regarding the component labels znias missing data, and ignoring the pres-ence of the latent x variables for now, we can consider the corresponding expected complete-data log-likelihood given by

LbC=XN

n=1

XM i=1

Rniln©

πip(tn|i)ª

, (C.12)

where Rni represent the posterior probabilities (corresponding to the ex-pected values of zni) and are given by equation 4.2. Maximization of equa-tion C.12 with respect toπi, again using a Lagrange multiplier, gives the

M-step equation (4.4). Similarly, maximization of equation C.12 with re-spect to µigives equation 4.5. This is the first stage of the combined EM procedure.

In order to update Wiandσi2, we seek only to increase the value of bLC, and not actually to maximize it. This corresponds to the generalized EM (or GEM) algorithm. We do this by considering bLC as our likelihood of interest and, introducing the missing xnivariables, perform one cycle of the EM algorithm, now with respect to the parameters Wiandσi2. This second stage is guaranteed to increase bLC, and thereforeL as desired.

The advantages of this approach are twofold. First, the new valueseµi calculated in the first stage are used to compute the sufficient statistics of the posterior distribution of xniin the second stage using equations C.3 and C.4. By using updated values of µiin computing these statistics, this leads to improved convergence speed.

A second advantage is that for the second stage of the EM algorithm, there is a considerable simplification of the M-step updates, since when equation C.5 is expanded forhxnii and hxnixTnii, only terms in eµi(and not µi) appear. By inspection of equation C.5, we see that the expected complete-data log-likelihood now takes the form

hLCi =XN

n=1

XM i=1

Rni

½

lneπid

2lnσi2−1 2tr

³hxnixTni

− 1

i2ktni− eµik2+ 1

σi2hxTniiWTi(tn− eµi)

− 1 2σi2tr

³

WTiWihxnixTnii´ ¾

. (C.13)

Now when we maximize equation C.13 with respect to Wiandσi2(keeping e

µifixed), we obtain the much simplified M-step equations:

Wei= SiWii2I+ M−1i WTiSiWi)−1, (C.14) eσi2 = 1

dtr

³

Si− SiWiM−1i WeTi

´, (C.15)

where Si= 1

eπiN XN n=1

Rni(tn− eµi)(tn− eµi)T. (C.16)

Iteration of equations 4.3 through 4.5 followed by equations C.14 and C.15 in sequence is guaranteed to find a local maximum of the likelihood (see equation 4.1).

Comparison of equations C.14 and C.15 with equations A.26 and A.27 shows that the updates for the mixture case are identical to those of the

single PPCA model, given that the local responsibility-weighted covari-ance matrix Si is substituted for the global covariance matrix S. Thus, at stationary points, each weight matrix Wicontains the (scaled and rotated) eigenvectors of its respective Si, the local covariance matrix. Each submodel is then performing a local PCA, where each data point is weighted by the responsibility of that submodel for its generation, and a soft partitioning, similar to that introduced by Hinton et al. (1997), is automatically effected.

Given the established results for the single PPCA model, there is no need to use the iterative updates (see equations C.14 and C.15) since Wiandσi2 may be determined by eigendecomposition of Si, and the likelihood must still increase unless at a maximum. However, as discussed in appendix A.5, the iterative EM scheme may offer computational advantages, particularly for q ¿ d. In such a case, the iterative approach of equations C.14 and C.15 can be used, taking care to evaluate SiWi efficiently as P

nRni(tn− e

µi)©

(tn− eµi)TWiª

. In the mixture case, unlike for the single model, Simust be recomputed at each iteration of the EM algorithm, as the responsibilities Rniwill change.

As a final computational note, it might appear that the necessary cal-culation of p(t|i) would require inversion of the d × d matrix C, an O(d3) operation. However,2I+ WWT)−1= {I − W(σ2I+ WTW)−1WT}/σ2and so C−1may be computed using the already calculated q× q matrix M−1. Acknowledgments

This work was supported by EPSRC contract GR/K51808: Neural Networks for Visualization of High Dimensional Data, at Aston University. We thank Michael Revow for supplying the handwritten digit data in its processed form.

References

Anderson, T. W. (1963). Asymptotic theory for principal component analysis.

Annals of Mathematical Statistics, 34, 122–148.

Anderson, T. W., & Rubin, H. (1956). Statistical inference in factor analysis. In J. Neyman (Ed.), Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability (Vol. 5, pp. 111–150). Berkeley: University of Califor-nia, Berkeley.

Bartholomew, D. J. (1987). Latent variable models and factor analysis. London:

Charles Griffin & Co. Ltd.

Basilevsky, A. (1994). Statistical factor analysis and related methods. New York:

Wiley.

Bishop, C. M. (1995). Neural networks for pattern recognition. Oxford: Clarendon Press.

Bishop, C. M., Svens´en, M., & Williams, C. K. I. (1998). GTM: The generative topographic mapping. Neural Computation, 10(1), 215–234.

Bishop, C. M., & Tipping, M. E. (1998). A hierarchical latent variable model for data visualization. IEEE Transactions on Pattern Analysis and Machine Intelli-gence, 20(3), 281–293.

Bregler, C., & Omohundro, S. M. (1995). Nonlinear image interpolation using manifold learning. In G. Tesauro, D. S. Touretzky, & T. K. Leen (Eds.), Advances in neural information processing systems, 7 (pp. 973–980). Cambridge, MA: MIT Press.

Broomhead, D. S., Indik, R., Newell, A. C., & Rand, D. A. (1991). Local adaptive Galerkin bases for large-dimensional dynamical systems. Nonlinearity, 4(1), 159–197.

Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, B39(1), 1–38.

Dony, R. D., & Haykin, S. (1995). Optimally adaptive transform coding. IEEE Transactions on Image Processing, 4(10), 1358–1370.

Hastie, T., & Stuetzle, W. (1989). Principal curves. Journal of the American Statistical Association, 84, 502–516.

Hinton, G. E., Dayan, P., & Revow, M. (1997). Modelling the manifolds of images of handwritten digits. IEEE Transactions on Neural Networks, 8(1), 65–74.

Hinton, G. E., Revow, M., & Dayan, P. (1995). Recognizing handwritten digits using mixtures of linear models. In G. Tesauro, D. S. Touretzky, & T. K. Leen (Eds.), Advances in neural information processing systems, 7 (pp. 1015–1022).

Cambridge, MA: MIT Press.

Hotelling, H. (1933). Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 24, 417–441.

Hull, J. J. (1994). A database for handwritten text recognition research. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16, 550–554.

Japkowicz, N., Myers, C., & Gluck, M. (1995). A novelty detection approach to classification. In Proceedings of the Fourteenth International Conference on Artificial Intelligence (pp. 518–523).

Jolliffe, I. T. (1986). Principal component analysis. New York: Springer-Verlag.

Jordan, M. I., & Jacobs, R. A. (1994). Hierarchical mixtures of experts and the EM algorithm. Neural Computation, 6(2), 181–214.

Kambhatla, N. (1995). Local models and gaussian mixture models for statistical data processing. Unpublished doctoral dissertation, Oregon Graduate Institute, Center for Spoken Language Understanding.

Kambhatla, N., & Leen, T. K. (1997). Dimension reduction by local principal component analysis. Neural Computation, 9(7), 1493–1516.

Kramer, M. A. (1991). Nonlinear principal component analysis using autoasso-ciative neural networks. AIChE Journal, 37(2), 233–243.

Krzanowski, W. J., & Marriott, F. H. C. (1994). Multivariate analysis part 2: Clas-sification, Covariance structures and repeated measurements. London: Edward Arnold.

Lawley, D. N. (1953). A modified method of estimation in factor analysis and some large sample results. In Uppsala Symposium on Psychological Factor Anal-ysis. Nordisk Psykologi Monograph Series (pp. 35–42). Uppsala: Almqvist and Wiksell.

相關文件