Rotational Ambiguity. If W is determined by the above algorithm, or any other iterative method that maximizes the likelihood (see

equa-tion 3.8), then at convergence, WML = Uq(Λq− σ²I)^1/2R. If it is desired to find the true principal axes U_q(and not just the principal subspace) then the arbitrary rotation matrix R presents difficulty. This rotational ambiguity also exists in factor analysis, as well as in certain iterative PCA algorithms, where it is usually not possible to determine the actual principal axes if R6= I (although there are algorithms where the constraint R = I is imposed and the axes may be found).

However, in probabilistic PCA, R may actually be found since

W^T_MLW_ML= R^T(Λq− σ²I)R (A.29)

implies that R^Tmay be computed as the matrix of eigenvectors of the q× q matrix W^T_MLW_ML. Hence, both U_q and Λ_q may be found by inverting the rotation followed by normalization of W_ML. That the rotational ambiguity may be resolved in PPCA is a consequence of the scaling of the eigenvectors by(Λq− σ²I)^1/2 prior to rotation by R. Without this scaling, W^T_MLW_ML = I, and the corresponding eigenvectors remain ambiguous. Also, note that while finding the eigenvectors of S directly requires O(d³) operations, to obtain them from W_MLin this way requires only O(q³).

Appendix B: Optimal Least-Squares Reconstruction

One of the motivations for adopting PCA in many applications, notably in data compression, is the property of optimal linear least-squares recon-struction. That is, for all orthogonal projections x = A^Ttof the data, the

least-squares reconstruction error,

E²_rec= 1 N

XN n=1

ktn− BA^Tt_nk², (B.1)

is minimized when the columns of A span the principal subspace of the data covariance matrix, and B= A. (For simplification, and without loss of generality, we assume here that the data has zero mean.)

We can similarly obtain this property from our probabilistic formalism, without the need to determine the exact orthogonal projection W, by finding the optimal reconstruction of the posterior mean vectorshxni. To do this we simply minimize

E²_rec= 1 N

XN n=1

ktn− Bhxnik², (B.2)

over the reconstruction matrix B, which is equivalent to a linear regression problem giving

B= SW(W^TSW)⁻¹M, (B.3)

where we have substituted for hxni from equation A.22. In general, the resulting projection Bhxni of tnis not orthogonal, except in the maximum likelihood case, where W = WML = Uq(Λq− σ²I)^1/2R, and the optimal reconstructing matrix becomes

BML= W(W^TW)⁻¹M, (B.4)

and so

ˆt_n= W(W^TW)⁻¹Mhxni, (B.5)

= W(W^TW)⁻¹W^Tt_n, (B.6)

which is the expected orthogonal projection. The implication is thus that in the data compression context, at the maximum likelihood solution, the variableshxni can be transmitted down the channel and the original data vectors optimally reconstructed using equation B.5 given the parameters W andσ². Substituting for B in equation B.2 gives E²_rec= (d − q)σ², and the noise termσ²thus represents the expected squared reconstruction error per

“lost” dimension.

Appendix C: EM for Mixtures of Probabilistic PCA

In a mixture of probabilistic principal component analyzers, we must fit a mixture of latent variable models in which the overall model distribution

takes the form p(t) =X^M

i=1

πip(t|i), (C.1)

where p(t|i) is a single probabilistic PCA model and πiis the corresponding mixing proportion. The parameters for this mixture model can be deter-mined by an extension of the EM algorithm. We begin by considering the standard form that the EM algorithm would take for this model and high-light a number of limitations. We then show that a two-stage form of EM leads to a more efficient algorithm.

We first note that in addition to a set of x_nifor each model i, the missing data include variables znilabeling which model is responsible for generating each data point t_n. At this point we can derive a standard EM algorithm by considering the corresponding complete-data log-likelihood, which takes the form

LC=X^N

n=1

XM i=1

z_niln{πip(tn, xni)}. (C.2)

Starting with “old” values for the parametersπi, µ_i, W_i, andσ_i², we first evaluate the posterior probabilities R_ni using equation 4.3 and similarly evaluate the expectationshxnii and hxnix^T_nii:

hxnii = M⁻¹_i W^T_i(tn− µ_i), (C.3)

hxnix^T_nii = σ_i²M⁻¹_i + hxniihxnii^T, (C.4) with M_i= σ_i²I+ W^T_iW_i.

Then we take the expectation ofLC with respect to these posterior dis-tributions to obtain

hLCi =X^N

n=1

XM i=1

R_ni

lnπi−d

2lnσ_i²−1 2tr

³hxnix^T_nii´

− 1

2σ_i²ktni− µ_ik²+ 1

σ_i²hxnii^TW^T_i(tn− µ_i)

− 1 2σ_i²tr

W^T_iW_ihxnix^T_nii´ ¾

, (C.5)

whereh·i denotes the expectation with respect to the posterior distributions of both x_niand z_ni and terms independent of the model parameters have been omitted. The M-step then involves maximizing equation C.5 with re-spect toπi, µ_i,σ_i², and W_ito obtain “new” values for these parameters. The maximization with respect toπimust take account of the constraint that

iπi= 1. This can be achieved with the use of a Lagrange multiplier λ (see Bishop, 1995) and maximizing

hLCi + λ ÃXM

i=1

πi− 1

. (C.6)

Together with the results of maximizing equation C.5 with respect to the remaining parameters, this gives the following M-step equations:

eπi= 1 N

Rni (C.7)

e µ_i=

nR_ni(tPni− eW_ihxnii)

nR_ni (C.8)

We_i=

R_ni(tn− eµ_i)hxnii^T

# "

R_nihxnix^T_nii

#₋₁

(C.9)

σ_i²= 1 dP

nR_ni

½ X

R_niktn− eµ_ik²− 2X

R_nihxnii^TWe^T_i(tn− eµ_i)

R_nitr

³hxnix^T_niieW^T_iWe_i

´ ¾

(C.10) where the symboledenotes “new” quantities that may be adjusted in the M-step. Note that the M-step equations foreµ_iand eW_i, given by equations C.8 and C.9, are coupled, and so further (albeit straightforward) manipulation is required to obtain explicit solutions.

In fact, simplification of the M-step equations, along with improved speed of convergence, is possible if we adopt a two-stage EM procedure as follows. The likelihood function we wish to maximize is given by

L =X^N

n=1

ln (XM

i=1

πip(tn|i) )

. (C.11)

Regarding the component labels z_nias missing data, and ignoring the pres-ence of the latent x variables for now, we can consider the corresponding expected complete-data log-likelihood given by

LbC=X^N

n=1

XM i=1

R_niln©

πip(tn|i)ª

, (C.12)

where R_ni represent the posterior probabilities (corresponding to the ex-pected values of z_ni) and are given by equation 4.2. Maximization of equa-tion C.12 with respect toπi, again using a Lagrange multiplier, gives the

M-step equation (4.4). Similarly, maximization of equation C.12 with re-spect to µ_igives equation 4.5. This is the first stage of the combined EM procedure.

In order to update Wiandσ_i², we seek only to increase the value of bLC, and not actually to maximize it. This corresponds to the generalized EM (or GEM) algorithm. We do this by considering bLC as our likelihood of interest and, introducing the missing x_nivariables, perform one cycle of the EM algorithm, now with respect to the parameters W_iandσ_i². This second stage is guaranteed to increase bLC, and thereforeL as desired.

The advantages of this approach are twofold. First, the new valueseµ_i calculated in the first stage are used to compute the sufficient statistics of the posterior distribution of x_niin the second stage using equations C.3 and C.4. By using updated values of µ_iin computing these statistics, this leads to improved convergence speed.

A second advantage is that for the second stage of the EM algorithm, there is a considerable simplification of the M-step updates, since when equation C.5 is expanded forhxnii and hxnix^T_nii, only terms in eµ_i(and not µ_i) appear. By inspection of equation C.5, we see that the expected complete-data log-likelihood now takes the form

hLCi =X^N

n=1

XM i=1

R_ni

lneπi−d

2lnσ_i²−1 2tr

³hxnix^T_nii´

− 1

2σ_i²ktni− eµ_ik²+ 1

σ_i²hx^T_niiW^T_i(tn− eµ_i)

− 1 2σ_i²tr

W^T_iW_ihxnix^T_nii´ ¾

. (C.13)

Now when we maximize equation C.13 with respect to W_iandσ_i²(keeping e

µ_ifixed), we obtain the much simplified M-step equations:

We_i= SiW_i(σ_i²I+ M⁻¹_i W^T_iS_iW_i)⁻¹, (C.14) eσ_i² = 1

dtr

S_i− SiW_iM⁻¹_i We^T_i

´, (C.15)

where S_i= 1

eπiN XN n=1

R_ni(tn− eµ_i)(tn− eµ_i)^T. (C.16)

Iteration of equations 4.3 through 4.5 followed by equations C.14 and C.15 in sequence is guaranteed to find a local maximum of the likelihood (see equation 4.1).

Comparison of equations C.14 and C.15 with equations A.26 and A.27 shows that the updates for the mixture case are identical to those of the

single PPCA model, given that the local responsibility-weighted covari-ance matrix Si is substituted for the global covariance matrix S. Thus, at stationary points, each weight matrix W_icontains the (scaled and rotated) eigenvectors of its respective S_i, the local covariance matrix. Each submodel is then performing a local PCA, where each data point is weighted by the responsibility of that submodel for its generation, and a soft partitioning, similar to that introduced by Hinton et al. (1997), is automatically effected.

Given the established results for the single PPCA model, there is no need to use the iterative updates (see equations C.14 and C.15) since W_iandσ_i² may be determined by eigendecomposition of S_i, and the likelihood must still increase unless at a maximum. However, as discussed in appendix A.5, the iterative EM scheme may offer computational advantages, particularly for q ¿ d. In such a case, the iterative approach of equations C.14 and C.15 can be used, taking care to evaluate S_iW_i efficiently as P

nR_ni(tn− e

µ_i)©

(tn− eµ_i)^TW_iª

. In the mixture case, unlike for the single model, S_imust be recomputed at each iteration of the EM algorithm, as the responsibilities R_niwill change.

As a final computational note, it might appear that the necessary cal-culation of p(t|i) would require inversion of the d × d matrix C, an O(d³) operation. However,(σ²I+ WW^T)⁻¹= {I − W(σ²I+ W^TW)⁻¹W^T}/σ²and so C⁻¹may be computed using the already calculated q× q matrix M⁻¹. Acknowledgments

This work was supported by EPSRC contract GR/K51808: Neural Networks for Visualization of High Dimensional Data, at Aston University. We thank Michael Revow for supplying the handwritten digit data in its processed form.

References

Anderson, T. W. (1963). Asymptotic theory for principal component analysis.

Annals of Mathematical Statistics, 34, 122–148.

Anderson, T. W., & Rubin, H. (1956). Statistical inference in factor analysis. In J. Neyman (Ed.), Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability (Vol. 5, pp. 111–150). Berkeley: University of Califor-nia, Berkeley.

Bartholomew, D. J. (1987). Latent variable models and factor analysis. London:

Charles Griffin & Co. Ltd.

Basilevsky, A. (1994). Statistical factor analysis and related methods. New York:

Wiley.

Bishop, C. M. (1995). Neural networks for pattern recognition. Oxford: Clarendon Press.

Bishop, C. M., Svens´en, M., & Williams, C. K. I. (1998). GTM: The generative topographic mapping. Neural Computation, 10(1), 215–234.

Bishop, C. M., & Tipping, M. E. (1998). A hierarchical latent variable model for data visualization. IEEE Transactions on Pattern Analysis and Machine Intelli-gence, 20(3), 281–293.

Bregler, C., & Omohundro, S. M. (1995). Nonlinear image interpolation using manifold learning. In G. Tesauro, D. S. Touretzky, & T. K. Leen (Eds.), Advances in neural information processing systems, 7 (pp. 973–980). Cambridge, MA: MIT Press.

Broomhead, D. S., Indik, R., Newell, A. C., & Rand, D. A. (1991). Local adaptive Galerkin bases for large-dimensional dynamical systems. Nonlinearity, 4(1), 159–197.

Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, B39(1), 1–38.

Dony, R. D., & Haykin, S. (1995). Optimally adaptive transform coding. IEEE Transactions on Image Processing, 4(10), 1358–1370.

Hastie, T., & Stuetzle, W. (1989). Principal curves. Journal of the American Statistical Association, 84, 502–516.

Hinton, G. E., Dayan, P., & Revow, M. (1997). Modelling the manifolds of images of handwritten digits. IEEE Transactions on Neural Networks, 8(1), 65–74.

Hinton, G. E., Revow, M., & Dayan, P. (1995). Recognizing handwritten digits using mixtures of linear models. In G. Tesauro, D. S. Touretzky, & T. K. Leen (Eds.), Advances in neural information processing systems, 7 (pp. 1015–1022).

Cambridge, MA: MIT Press.

Hotelling, H. (1933). Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 24, 417–441.

Hull, J. J. (1994). A database for handwritten text recognition research. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16, 550–554.

Japkowicz, N., Myers, C., & Gluck, M. (1995). A novelty detection approach to classification. In Proceedings of the Fourteenth International Conference on Artificial Intelligence (pp. 518–523).

Jolliffe, I. T. (1986). Principal component analysis. New York: Springer-Verlag.

Jordan, M. I., & Jacobs, R. A. (1994). Hierarchical mixtures of experts and the EM algorithm. Neural Computation, 6(2), 181–214.

Kambhatla, N. (1995). Local models and gaussian mixture models for statistical data processing. Unpublished doctoral dissertation, Oregon Graduate Institute, Center for Spoken Language Understanding.

Kambhatla, N., & Leen, T. K. (1997). Dimension reduction by local principal component analysis. Neural Computation, 9(7), 1493–1516.

Kramer, M. A. (1991). Nonlinear principal component analysis using autoasso-ciative neural networks. AIChE Journal, 37(2), 233–243.

Krzanowski, W. J., & Marriott, F. H. C. (1994). Multivariate analysis part 2: Clas-sification, Covariance structures and repeated measurements. London: Edward Arnold.

Lawley, D. N. (1953). A modified method of estimation in factor analysis and some large sample results. In Uppsala Symposium on Psychological Factor Anal-ysis. Nordisk Psykologi Monograph Series (pp. 35–42). Uppsala: Almqvist and Wiksell.

在文檔中 Principal component analysis (PCA) is one of the most popular techniques for processing, compressing, and visualizing data, although its effective- ness is limited by its global linearity (頁 33-40)