Advanced Topics in Learning and Vision


Academic year: 2022

Advanced Topics in Learning and Vision

Ming-Hsuan Yang




• Principal Component Analysis

• Factor Analysis

• EM Algorithm

• Probabilistic Principal Component Analysis

• Mixture of Factor Analyzers

• Mixture of Probabilistic Component Analyzers

• Isometric Mapping

• Local Linear Embedding

• Global coordination of local representation



• Embedding: Isomap [4], LLE [2]

• Global coordination of local generative models: Global coordination [1], Alignment of local representation [3].


Principal Component Analysis

• Curse of dimensionality

• Possibly the most popular dimensionality reduction algorithm

• Widely used in computer vision and other applications

• Two perspectives:

- Minimize reconstruction error (e.g., vision): Karhunen-Loeve transform - Decorrelation (e.g., signal processing): Hotelling transform

• Unsupervised learning

• Linear transform

• Second order statistics


Motivating Example [Black 04]


Principal Component Analysis

• Given a set of N data points xn, map each x in a d-dimensional space x = [x1, . . . , xd]T onto vector z in an M-dimensional space where M < d.

• Use linear combination

x =




ziui (1)

where the vectors ui satisfy the orthonormality relation

uTi uj = δij (2)

in which δij is the Kronecker delta. Thus,

zi = uTi x (3)

• Now we have only a subset M < d of the basis vector ui, x can be approximated by

˜ x =




ziui +



i=M +1

biui (4)


where bi is a constant

• Dimensionality reduction: x has d degree of freedom and z has M degree of freedom

• For each xn, the error introduced by the dimensionality reduction is

xn − ˜xn =



i=M +1

(zin − bi)ui (5)

and we want to find the basis vector ui, the coefficients bi, and the values zi with minimum error.

• For the whole data set, with orthonormality relation

EM = 1 2




||xn − ˜xn||2 = 1 2






i=M +1

(zin − bi)2 (6)


• Take derivative of EM with respect to bi and set it to zero,

bi = 1 N




zin = uTi x¯ (7)



x = 1 N




xn (8)

• Plugging back to the sum of square errors, EM, EM = 12 Pd

i=M +1


n=1(uTi (xn − ¯x))2

= N2 Pd

i=M +1 uTi Cui (9)

where C is a covariance matrix

C = 1 N




(xn − ¯x)(xn − ¯x)T (10)


• Minimizing EM with respect to ui, we get

Cui = λiui (11)

i.e., the basis vectors ui are the eigenvectors of the covariance matrix C.



• Minimizing EM with respect to ui, we get

Cui = λiui (12)

i.e., the basis vectors ui are the eigenvectors of the covariance matrix C.

• Need some constraints to solve this optimization

• Impose orthonormal constraints among ui

• Use Lagrange multipliers µijM = 1




i=M +1

uiCuTi − 1 2



i=M +1



j=M +1

µij(uiuTj − δij) (13)

• In matrix form,

M = 1

2tr{UTCU } − 1

2{M (UTU − I)} (14)


where M is a matrix with elements µij, U is a matrix whose columns consists of ui, and I is the unit matrix.

• Minimizing EˆM with respect to U,

(C + CT)U − U (M + MT) = 0 (15)

• Note C is symmetric, M is symmetric and U UT is symmetric as it is the unit matrix I.

CU = U M (16)

UTCU = M (17)

• The eigenvector equation for M

M Ψ = ΨΛ (18)

where Λ is a diagonal matrix of eigenvalues and Ψ is the matrix of eigenvectors.


• M is symmetric and Ψ can be chosen to have orthonormal columns, i.e., ΨTΨ = I.

Λ = ΨTM Ψ (19)

• Plugging in together,


= (U Ψ)TC(U Ψ)

= U˜TC ˜U

(20) where U = U Ψ, and˜

U = ˜U ΨT (21)

• Thus a solution for UTCU = M can be obtained from the particular solution U˜ by application of an orthogonal transformation given by Ψ.


Getting Principal Components from Data

• Minimizing EM with respect to ui, we get

Cui = λiui (22)

i.e., the basis vectors ui are the eigenvectors of the covariance matrix C.

• Consequently, the error of EM is

EM = 1 2



i=M +1

λi (23)

In other words, the minimum error is reached by discarding the eigenvectors corresponding to the d − M smallest eigenvalues.

• Each eigenvector ui is called a principal component.

• Retain the eigenvectors corresponding to the largest eigenvalues


• Project xn onto these eigenvectors give the components of the transformed vector zn in the M-dimensional space.

• Each two-dimensional data point is transformed to a single variable z1 representing the projection of the data point onto the eigenvector u1.

• Infer the structure (or reduce redundancy) inherent in high dimensional data.

• Parsimonious representation

• Linear dimensionality algorithm based on sum-of-square-error criterion


• Other criteria: covariance measure and population entropy


Intrinsic Dimensionality

• A data set in d dimensions has intrinsic dimensionality equal to d0 if the data lies entirely within a d0-dimensional space.

• What is the intrinsic dimensionality of data?

• The intrinsic dimensionality may increase due to noise.

• PCA, as a linear approximation, has its limitation.


• How to determine the number of eigenvectors?

• Empirically determined based on reconstruction error (i.e., energy).

• Will come back to this issue later on (Isomap, LLE).


Singular Value Decomposition (SVD)

• Since

x =




ziui , and uTi uj = δij (24)

• Subtract the mean from each column

A = [(x1 − ¯x) . . . (xN − ¯x)] (25) Covariance matrix

C = AAT (26)

• Singular value decomposition allows us to write A as

A = U ΣV T = 

u1 . . . uN 

 λ1

. . .


 uT1



 (27)


C = N1 AAT

= N1 U ΣV T(U ΣV T)T


= N1 U Σ2U2


• Therefore,

Cui = σi2

N ui (29)

• So, the columns U are eigenvectors and the eigenvalues are just λi = σNi.


Eigenface [Turk and Pentland 1991]

• Collect a set of face images.

• Normalize for contrast, scale and orientation.

• Apply PCA to compute the first M eigenvectors (dubbed as Eigenface) that best accounts for data variance (i.e., facial structure)

• Compute the distance between the projected points for face recognition or detection


Appearance Manifolds [Murase and Nayar 95]

• The image variation of an object under different pose or is assumed to lie on a manifold.

• For each object, collect images under different pose

• Construct a universal eigenspace from all the images

• For the set of images of of the same object, find the smoothly varying manifold in eigenspace, i.e., parametric eigenspace.


• The manifolds of two objects may intersect, the intersection corresponds to poses of the two objects for which their images are very similar in



Further Study

• Incremental update of PCA

• Independent component analysis (ICA)

• Discrete PCA

• Two-dimensional PCA

• Illumination cone [Belhumeur and Kriegman 97]


Factor Analysis (FA)

• A generative dimensionality reduction algorithm

• A d-dimensional data vector x is modeled using a M-dimensional vector z, dubbed as factors, where k is smaller than d.

x = Λz + ε (30)


- Λ is factor loading matrix.

- z is assumed be N (0, I) distributed (zero mean, unit variance normals).

- The factors z model correlation between the elements of x.

- ε is a random variable to account for noise and assumed to be distributed with N (0, Ψ) where Ψ is a diagonal matrix.

- ε accounts for independent noise in each element of x.

- x is N (0, ΛΛT + Ψ) distributed.

- the diagonality of Ψ is a key assumption: constraining the error covariance Ψ for estimation

- The observed variable, xi, are conditionally independent given the factors z.


Properties of FA

Factor analysis: x = Λz + ε

• Latent variables z: explain correlations between x

• εi represents variability unique to a particular xi

• Fundamentally differs from PCA which treats covariance and variance identically.

• Want to infer Λ and Ψ from x

• Suppose Λ and Ψ are known, by linear projection

E[z|x] = βx (31)


β = ΛT(Ψ|ΛΛT)−1, since the joint Gaussian of data x and factors z: p(

 x z

) = N (

 0 0



) (32)

• Note that since Ψ is diagonal,

(Ψ + ΛΛT)−1 = Ψ−1 − Ψ−1Λ(I + ΛTΨ−1Λ)−1ΛTΨ−1 (33) I is an identity matrix

• The second moment of factors:

E[zzT|x] = V ar(z|x) + E[z|x]E[z|x]T

= I − βΛ + βxxTβT (34)

• Expectation of first and second moments provide measure of uncertainty in the factors, which PCA does not have.


EM Algorithm for Factor Analysis

• Expectation-Maximization: useful technique in dealing with missing data

• Start with some initial guess of missing data and evaluate them.

• Optimize the missing parameters by taking derivate of likelihood of observed and missing data w.r.t. parameters.

• Repeat until the data likelihood does not change

• E-step: Given Λ and Ψ, for each data point xi, compute E[z|x] = βx

E[zzT|x] = V ar(z|x) + E[z|x]E[z|x]T

= I − βΛ + βxxTβT


• M-step:

Λnew = (PN

i=1 xiE[z|xi]T)(PN

i=1 E[zzT|xi])−1 Ψnew = N1 diag{PN

i=1 xixTi − ΛnewE[z|xi]xTi } (36)


where diag operator sets all off-diagonal elements to zero.


FA and PCA

• Given a set of data points, would Λ correspond to principal subspace?

• No in most cases

• But if FA has isotropic error model, i.e., ψi = σ2, then Λ corresponds to eigenvectors.


Probabilistic Principal Component Analysis (PPCA)

• Based on factor analysis, x = Λz + ε, with isotropic noise model N (0, σ2I)

• The z conditional probability distribution over x is given by

x|z ∼ N (Λz, σ2I) (37)

• Since x ∼ N (0, I), marginal distribution for x is

x ∼ N (0, ˜C) (38)

where C = ΛΛT + σ2I.

• Log likelihood of data is L = −N

2 {d ln(2π) + ln |C| + tr( ˜C−1S)} (39)



S = 1 N




xnxnT (40)

• Estimating Λ and σ2 can be obtained by maximizing L using the EM algorithm similar to that in factor analysis.



