Advanced Topics in Learning and Vision

(1)

Advanced Topics in Learning and Vision

Ming-Hsuan Yang

[email protected]

(2)

Overview

• EM Algorithm

• Mixture of Factor Analyzers

• Mixture of Probabilistic Component Analyzers

• Isometric Mapping

• Locally Linear Embedding

• Linear regression

• Logistic regression

• Linear classifier

• Fisher linear discriminant

(3)

Announcements

• More course material available on the course web page

• Code: PCA, FA, MoG, MFA, MPPCA, LLE, and Isomap

• Reading (due Oct 25):

- Fisher linear discriminant: Fisherface vs. Eigenface [1]

- Support vector machine: [3] or [2]

References

[1] P. Belhumeur, J. Hespanha, and D. Kriegman. Eigenfaces vs. Fisherfaces:

Recognition using class specific linear projection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(7):711–720, 1997.

[2] A. Mohan, C. Papageorgiou, and T. Poggio. Example-based object detection in images by components. IEEE Transactions on Pattern Analysis and Machine Intelligence,

23(4):349–361, 2001.

[3] M. Pontil and A. Verri. Support vector machines for 3D object recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(6):637–646, 1998.

(4)

Mixture of Gaussians

p(x) = P^K

k=1 π_kN (x|µ_k, Σ_k) PK

k=1 π_k = 1 0 ≤ π_k ≤ 1 (1)

where π_k is the mixing parameter, describing the contribution of k-th Gaussian component in explaining x.

• Given data X = {x₁, . . . , x_N}, we want to determine π_k and model parameters θ = {π_k, µ_k, Σ_k}.

- X are observable

- The contribution of each data point x_i to j-th Gaussian component, γ_j(x_i), is hidden variables that can be derived from X and θ.

- θ are unknown

• If we know θ, we can compute γ_j(x_i).

• If we know γ_j(x_i), we can compute θ.

• Chicken and egg problem.

(5)

EM algorithm

• Expectation Maximization

• First take some initial guess of model parameters and compute the expectation of hidden value

• Iterative procedure

• Start with some initial guess and refine it

• Very useful technique

• Variational learning

• Generalized EM algorithm

(6)

EM Algorithm for Mixture of Gaussians

• Log likelihood function

ln p(X|π, µ, Σ) =

N

X

i=1

ln{

K

X

k=1

π_kN (x_i|µ_k, Σ_k)} (2)

• No close form solution (sum of components inside the log function)

• E (Expectation) step: Given all the current model parameters, compute the expectation of the hidden variables

• M (Maximization) step: Optimize the log likelihood with respect to model parameters

(7)

EM Algorithm for Mixture of Gaussians: M Step

ln L =

N

X

i=1

(

K

X

k=1

π_kN_ki) (3)

where

N_ki = N (x_i|µ_k, Σ_k) (4) Take derivative of L w.r.t. µ_j

∂ ln L

∂µ_j =

N

X

i=1

π_jN_ji PK

k=1 π_kN_ki 1 N_ji

∂N_ji

∂µ_j = 0 (5)

Note

1 N_ji

∂N_ji

∂µ_j = −X

j

(x_i − µ_j) (6)

Let γ_j(x_i) = P_K^π^j^N^ji

k=1π_kN_ki, i.e., the normalized probablity of x_i being generated

(8)

from the j-th Gaussian component

N

X

i=1

γ_j(x_i)Σ_j(x_i − µ_j) = 0 (7) Thus,

µ_j =

PN

i=1 γ_j(x_i)x_i P^N

i=1 γ_j(x_i) (8)

Likewise take partial derivative w.r.t. to π_j and Σ_j

π_j = 1 N

N

X

i=1

γ_j(x_i) (9)

Σ_j =

PN

i=1 γ_j(x_i)(x_i − µ_j)(x_i − µ_j)^T PN

i=1 γ_j(x_i) (10)

Note γ_j(x_i) plays the weighting role

(9)

EM Algorithm for Mixture of Gaussians: E Step

• Compute the expected value of hidden variable γ_j

γ_j(x_i) = π_jN (x_i|µ_j, Σ_j) PK

k=1 π_kN (x_i|µ_k, Σ_k) (11)

• Interpret the mixing coefficients as prior probabilities

p(x_i) =

K

X

j=1

π_jN (x_i|µ_j, Σ_j) =

K

X

j=1

p(j)p(x_i|j) (12)

• Thus, γ_j(x_i) corresponds to posterior probabilities (responsibilities)

p(j|x_i) = p(j)p(x_i|j)

p(x_i) = π_jN (x_i|µ_j, Σ_j) P^K

k=1 π_kN (x_i|µ_k, Σ_k) = γ_j(x_i) (13)

(10)

EM for Factor Analysis

• Factor analysis: x = Λz + ε

• Log likelihood: L = log Q

i(2π)^d/2|Ψ|^−1/2 exp{−¹₂(x_i − Λz)^TΨ⁻¹(x_i − Λz)

• Hidden variable: z, model parameters: Λ, Ψ.

• E-step:

E[z|x] = βx

E[zz^T|x] = V ar(z|x) + E[z|x]E[z|x]^T

= I − βΛ + βxx^Tβ^T

(14)

• M-step:

Λ^new = (PN

i=1 x_iE[z|x_i]^T)(PN

i=1 E[zz^T|x_i])⁻¹ Ψ^new = _N¹ diag{PN

i=1 x_ix^T_i − Λ^newE[z|x_i]x^T_i } (15) where diag operator sets all off-diagonal elements to zero.

(11)

Mixture of Factor Analyzers (MFA)

• Assume that we have K factor analyzers indexed by ω_k, k = 1, . . . , K. ω_k = 1 when the data point was generated by k-th factor analyzer.

• The generative mixture model:

p(x) =

K

X

k=1

Z

p(x|z, ω_k)p(z|ω_k)p(ω_k)dz (16)

where

p(z|ω_k) = p(z) = N (0, I) (17)

• All each mode factor analyzer to model data covariance structure in a different part of the input space

p(x|z, ω_k) = N (µ_k + Λ_kz, Ψ) (18)

(12)

EM for Mixture of Factor Analyzers

E[ω_kzz^T|x_i] = E[ω_k|x_i]E[zz^T|ω_k, x_i]

(19)

• The model parameters are {(µ_k, Λ_k, π_k)^K_k=1, Ψ}.

• For the M step, take derivative of log likelihood with respect to model parameters for new µ_k, Λ_k, π_k, and Ψ.

• Read “The EM Algorithm for Mixtures of Factor Analyzers,” by Ghahramani and Hinton for details.

• Also read Ghahramani’s lecture notes.

(13)

EM for Mixture of Probabilistic PCA

• Based on factor analyzers

• Read “Mixtures of Probabilistic Principal Component Analyzers,” by Tipping and Bishop.

(14)

MFA: Applications

• Modeling the manifolds of images of hand written digits with mixture of factor analyzers [Hinton et al. 97].

• Modeling multimodal density of faces for recognition and detection [Frey et al. 98] [Yang et al. 00].

• Analyze layers of appearance and motion [Frey and Jojic 99]

• Mixture of factor analyzers concurrently performs clustering and dimensionality reduction.

(15)

Nonlinear Principal Component Analysis (NLPCA)

• Aim to better model nonlinear manifold

• Use on multi-layer (5 layer) perception

• The layer in the middle represents the feature space of the NLPCA transform.

• Two additional layers are used for nonlinearity.

• Auto-encoder, auto-associator, bottleneck network.

(16)

Recap

• Linear dimensionality reduction:

- Assume data is generated from a subspace

- Determine the subspace with PCA or FA (i.e., the subspace is spanned by the principal components)

• Nonlinear dimensionality reduction:

- Model data with a mixture of locally linear subspaces - use mixture of PCA, mixture of FA

• Mixture methods have local coordinate systems

• Need to find transformation between coordinate systems

(17)

Isometric Mapping (Isomap) [Tenenbaum et al. 00]

• Preserving pairwise distance structure

• Approximate geodesic distance

• Nonlinear dimensionality reduction

• Use a global coordinate system

• Aim to find intrinsic dimensionality

(18)

Multidimensional Scaling (MDS)

• Analyze pairwise similarities of entities to gain insight in the underlying structure

• Based on a matrix of pairwise similarities

• Metric or non-metric

• Useful for data visualization

• Can be used for dimensionality reduction

• Preserve the pairwise similarity measure

(19)

Isomap: Algorithm

• Isomap:

- Construct neighborhood graph:

Define a graph G of all data points by connecting points i and j if they are neighbors

- Compute the short paths:

For any pair of points i and j, compute their shortest path, and obtain D_G.

- Construct M-dimensional embedding:

Apply classic MDS to D_G to construct M-dimensional Euclidean space Y . The coordinates y_i are obtained by minimizing

E = ||τ (D_G) − τ (D_Y )||_L² (20) where τ converts distances into inner products that uniquely characterize the geometry of the data.

• The global minimum of (20) is obtained by setting y_i to the top M eigenvectors of τ (D_G).

(20)

Isomap: Applications

Intrinsic low dimensional Interpolation using low

(21)

Isomap: Applications

• Object recognition: memory-based recognition

• Object tracking: trajectory along inferred nonlinear manifold

• Video synthesis: interpolate along trajectory on nonlinear manifold

“Representation analysis and synthesis of lip images using dimensionality reduction,”

Aharon and Kimmel, IJCV 2005.

(22)

Isomap: Applications

• States of a moving object move smoothly along a low dimensional manifold.

• Discover the underlying manifold using Isomap

• Learn the mapping between input data and the corresponding points the low dimensional manifold using mixture of factor analyzers.

• Learn a dynamical model based on the points on the low dimensional manifold.

• Use particle filter for tracking.

(23)

Locally Linear Embedding (LLE) [Roweis et al. 00]

• Capture local geometry by linear reconstruction

• Map high dimensional data to global internal coordinates

(24)

Locally Linear Embedding: Algorithm

• LLE

- For each point, determine its neighbors

- Reconstruct with linear weights using neighbors E(W ) = X

i

|x_i − X

j

w_ijx_j|² (21)

to find out w

- Map to embedded coordinates:

Fix w and project x ∈ R^d to y ∈ R^M (M < d) such that φ(y) = X

i

|y_i − X

j

w_ijy_j|² (22)

• The embedding cost (22) defines a unconstrained optimization problem.

Add a normalized constraint and solve an eigenvalue problem.

• The bottom M nonzero eigenvectors provide an ordered set of orthogonal

(25)

LLE: Applications

Learn the embedding of facial expression images for synthesis

Learn the embedding of lip images for synthesis

(26)

Isomap and LLE

• Embedding rather than mapping function

• No probabilistic interpretation

• No generative model

• Do not take temporal information into consideration

• Unsupervised learning

(27)

Further Study

• Kernel PCA.

• Principal Curve.

• Laplacian Eigenmap.

• Hessian Isomap.

• Spectral clustering.

• Unified view of spectral embedding and clustering.

• Global coordination of local generative models:

- Global coordination of local linear representation.

- Automatic alignment of local representation.

(28)

Big Picture

“A unifying review of linear Gaussian models,” Roweis and Ghahramani 99

• Deterministic/probabilistic

• Static/dynamic

• Linear/nonlinear

• Mixture, hierarchical