Advanced Topics in Learning and Vision

(1)

Advanced Topics in Learning and Vision

Ming-Hsuan Yang

[email protected]

(2)

Overview

• Principal Component Analysis

• Factor Analysis

• EM Algorithm

• Probabilistic Principal Component Analysis

• Mixture of Factor Analyzers

• Mixture of Probabilistic Component Analyzers

• Isometric Mapping

• Local Linear Embedding

• Global coordination of local representation

(3)

Reading

• Embedding: Isomap [4], LLE [2]

• Global coordination of local generative models: Global coordination [1], Alignment of local representation [3].

References

[1] S. Roweis, L. K. Saul, and G. E. Hinton. Global coordination of local linear models. In T. Ditterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14, pages 889–896. MIT Press, 2002.

[2] S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500), 2000.

[3] Y. W. Teh and S. Roweis. Automatic alignment of local representations. In S. Becker,

S. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing Systems 15, pages 841–848. MIT Press, 2003.

[4] J. B. Tenenbaum, V. de Silva, and J. C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500), 2000.

(4)

Principal Component Analysis

• Curse of dimensionality

• Possibly the most popular dimensionality reduction algorithm

• Widely used in computer vision and other applications

• Two perspectives:

- Minimize reconstruction error (e.g., vision): Karhunen-Loeve transform - Decorrelation (e.g., signal processing): Hotelling transform

• Unsupervised learning

• Linear transform

• Second order statistics

(5)

Motivating Example [Black 04]

(6)

(7)

Principal Component Analysis

• Given a set of N data points x_n, map each x in a d-dimensional space x = [x₁, . . . , x_d]^T onto vector z in an M-dimensional space where M < d.

• Use linear combination

x =

d

X

i=1

z_iu_i (1)

where the vectors u_i satisfy the orthonormality relation

u^T_i u_j = δ_ij (2)

in which δ_ij is the Kronecker delta. Thus,

z_i = u^T_i x (3)

• Now we have only a subset M < d of the basis vector u_i, x can be approximated by

˜ x =

M

X

i=1

z_iu_i +

d

X

i=M +1

b_iu_i (4)

(8)

where b_i is a constant

• Dimensionality reduction: x has d degree of freedom and z has M degree of freedom

• For each xⁿ, the error introduced by the dimensionality reduction is

xⁿ − ˜xⁿ =

d

X

i=M +1

(z_iⁿ − b_i)u_i (5)

and we want to find the basis vector u_i, the coefficients b_i, and the values z_i with minimum error.

• For the whole data set, with orthonormality relation

E_M = 1 2

N

X

n=1

||xⁿ − ˜xⁿ||² = 1 2

N

X

n=1

d

X

i=M +1

(z_iⁿ − b_i)² (6)

(9)

• Take derivative of E_M with respect to b_i and set it to zero,

b_i = 1 N

N

X

n=1

z_iⁿ = u^T_i x¯ (7)

where

¯

x = 1 N

N

X

n=1

xⁿ (8)

• Plugging back to the sum of square errors, E_M, E_M = ¹₂ Pd

i=M +1

PN

n=1(u^T_i (xⁿ − ¯x))²

= ^N₂ Pd

i=M +1 u^T_i Cu_i (9)

where C is a covariance matrix

C = 1 N

N

X

n=1

(xⁿ − ¯x)(xⁿ − ¯x)^T (10)

(10)

• Minimizing E_M with respect to u_i, we get

Cu_i = λ_iu_i (11)

i.e., the basis vectors u_i are the eigenvectors of the covariance matrix C.

(11)

Derivation

Cu_i = λ_iu_i (12)

• Need some constraints to solve this optimization

• Impose orthonormal constraints among u_i

• Use Lagrange multipliers µ_ij Eˆ_M = 1

2

d

X

i=M +1

u_iCu^T_i − 1 2

d

X

i=M +1

d

X

j=M +1

µ_ij(u_iu^T_j − δ_ij) (13)

• In matrix form,

Eˆ_M = 1

2tr{U^TCU } − 1

2{M (U^TU − I)} (14)

(12)

where M is a matrix with elements µ_ij, U is a matrix whose columns consists of u_i, and I is the unit matrix.

• Minimizing Eˆ_M with respect to U,

(C + C^T)U − U (M + M^T) = 0 (15)

• Note C is symmetric, M is symmetric and U U^T is symmetric as it is the unit matrix I.

CU = U M (16)

U^TCU = M (17)

• The eigenvector equation for M

M Ψ = ΨΛ (18)

where Λ is a diagonal matrix of eigenvalues and Ψ is the matrix of eigenvectors.

(13)

• M is symmetric and Ψ can be chosen to have orthonormal columns, i.e., Ψ^TΨ = I.

Λ = Ψ^TM Ψ (19)

• Plugging in together,

Λ = Ψ^TU^TCU Ψ

= (U Ψ)^TC(U Ψ)

= U˜^TC ˜U

(20) where U = U Ψ, and˜

U = ˜U Ψ^T (21)

• Thus a solution for U^TCU = M can be obtained from the particular solution U˜ by application of an orthogonal transformation given by Ψ.

(14)

Getting Principal Components from Data

Cu_i = λ_iu_i (22)

• Consequently, the error of E_M is

E_M = 1 2

d

X

i=M +1

λ_i (23)

In other words, the minimum error is reached by discarding the eigenvectors corresponding to the d − M smallest eigenvalues.

• Each eigenvector u_i is called a principal component.

• Retain the eigenvectors corresponding to the largest eigenvalues

(15)

• Project xⁿ onto these eigenvectors give the components of the transformed vector zⁿ in the M-dimensional space.

• Each two-dimensional data point is transformed to a single variable z₁ representing the projection of the data point onto the eigenvector u₁.

• Infer the structure (or reduce redundancy) inherent in high dimensional data.

• Parsimonious representation

• Linear dimensionality algorithm based on sum-of-square-error criterion

(16)

• Other criteria: covariance measure and population entropy

(17)

Intrinsic Dimensionality

• A data set in d dimensions has intrinsic dimensionality equal to d⁰ if the data lies entirely within a d⁰-dimensional space.

• What is the intrinsic dimensionality of data?

• The intrinsic dimensionality may increase due to noise.

• PCA, as a linear approximation, has its limitation.

(18)

• How to determine the number of eigenvectors?

• Empirically determined based on reconstruction error (i.e., energy).

• Will come back to this issue later on (Isomap, LLE).

(19)

Singular Value Decomposition (SVD)

• Since

x =

d

X

i=1

z_iu_i , and u^T_i u_j = δ_ij (24)

• Subtract the mean from each column

A = [(x¹ − ¯x) . . . (x^N − ¯x)] (25) Covariance matrix

C = AA^T (26)

• Singular value decomposition allows us to write A as

A = U ΣV ^T =

u₁ . . . u_N



 λ₁

. . .

λ_N







 u^T₁

...

u^T



 (27)

(20)

C = _N¹ AA^T

= _N¹ U ΣV ^T(U ΣV ^T)^T

= _N¹ U ΣV ^TV ΣU^T

= _N¹ U Σ²U²

(28)

• Therefore,

Cu_i = σ_i²

N u_i (29)

• So, the columns U are eigenvectors and the eigenvalues are just λ_i = ^σ_Nⁱ.

(21)

Eigenface [Turk and Pentland 1991]

• Collect a set of face images.

• Normalize for contrast, scale and orientation.

• Apply PCA to compute the first M eigenvectors (dubbed as Eigenface) that best accounts for data variance (i.e., facial structure)

• Compute the distance between the projected points for face recognition or detection

(22)

Appearance Manifolds [Murase and Nayar 95]

• The image variation of an object under different pose or is assumed to lie on a manifold.

• For each object, collect images under different pose

• Construct a universal eigenspace from all the images

• For the set of images of of the same object, find the smoothly varying manifold in eigenspace, i.e., parametric eigenspace.

(23)

• The manifolds of two objects may intersect, the intersection corresponds to poses of the two objects for which their images are very similar in

appearance.

(24)

Further Study

• Incremental update of PCA

• Independent component analysis (ICA)

• Discrete PCA

• Two-dimensional PCA

• Illumination cone [Belhumeur and Kriegman 97]

(25)

Factor Analysis (FA)

• A generative dimensionality reduction algorithm

• A d-dimensional data vector x is modeled using a M-dimensional vector z, dubbed as factors, where k is smaller than d.

x = Λz + ε (30)

where

- Λ is factor loading matrix.

- z is assumed be N (0, I) distributed (zero mean, unit variance normals).

- The factors z model correlation between the elements of x.

- ε is a random variable to account for noise and assumed to be distributed with N (0, Ψ) where Ψ is a diagonal matrix.

- ε accounts for independent noise in each element of x.

- x is N (0, ΛΛ^T + Ψ) distributed.

- the diagonality of Ψ is a key assumption: constraining the error covariance Ψ for estimation

- The observed variable, x_i, are conditionally independent given the factors z.

(26)

Properties of FA

Factor analysis: x = Λz + ε

• Latent variables z: explain correlations between x

• ε_i represents variability unique to a particular x_i

• Fundamentally differs from PCA which treats covariance and variance identically.

• Want to infer Λ and Ψ from x

• Suppose Λ and Ψ are known, by linear projection

E[z|x] = βx (31)

(27)

β = Λ^T(Ψ|ΛΛ^T)⁻¹, since the joint Gaussian of data x and factors z: p(

x z

) = N (

0 0

,

ΛΛ^T + Ψ Λ Λ^T I

) (32)

• Note that since Ψ is diagonal,

(Ψ + ΛΛ^T)⁻¹ = Ψ⁻¹ − Ψ⁻¹Λ(I + Λ^TΨ⁻¹Λ)⁻¹Λ^TΨ⁻¹ (33) I is an identity matrix

• The second moment of factors:

E[zz^T|x] = V ar(z|x) + E[z|x]E[z|x]^T

= I − βΛ + βxx^Tβ^T (34)

• Expectation of first and second moments provide measure of uncertainty in the factors, which PCA does not have.

(28)

EM Algorithm for Factor Analysis

• Expectation-Maximization: useful technique in dealing with missing data

• Start with some initial guess of missing data and evaluate them.

• Optimize the missing parameters by taking derivate of likelihood of observed and missing data w.r.t. parameters.

• Repeat until the data likelihood does not change

• E-step: Given Λ and Ψ, for each data point x_i, compute E[z|x] = βx

E[zz^T|x] = V ar(z|x) + E[z|x]E[z|x]^T

= I − βΛ + βxx^Tβ^T

(35)

• M-step:

Λ^new = (P^N

i=1 x_iE[z|x_i]^T)(P^N

i=1 E[zz^T|x_i])⁻¹ Ψ^new = _N¹ diag{PN

i=1 x_ix^T_i − Λ^newE[z|x_i]x^T_i } (36)

(29)

where diag operator sets all off-diagonal elements to zero.

(30)

FA and PCA

• Given a set of data points, would Λ correspond to principal subspace?

• No in most cases

• But if FA has isotropic error model, i.e., ψ_i = σ², then Λ corresponds to eigenvectors.

(31)

Probabilistic Principal Component Analysis (PPCA)

• Based on factor analysis, x = Λz + ε, with isotropic noise model N (0, σ²I)

• The z conditional probability distribution over x is given by

x|z ∼ N (Λz, σ²I) (37)

• Since x ∼ N (0, I), marginal distribution for x is

x ∼ N (0, ˜C) (38)

where C = ΛΛ^T + σ²I.

• Log likelihood of data is L = −N

2 {d ln(2π) + ln |C| + tr( ˜C⁻¹S)} (39)

(32)

where

S = 1 N

N

X

i=1

xⁿx^nT (40)

• Estimating Λ and σ² can be obtained by maximizing L using the EM algorithm similar to that in factor analysis.