Advanced Topics in Learning and Vision
Ming-Hsuan Yang
mhyang@csie.ntu.edu.tw
Overview
• EM Algorithm
• Mixture of Factor Analyzers
• Mixture of Probabilistic Component Analyzers
• Isometric Mapping
• Locally Linear Embedding
• Linear regression
• Logistic regression
• Linear classifier
• Fisher linear discriminant
Announcements
• More course material available on the course web page
• Code: PCA, FA, MoG, MFA, MPPCA, LLE, and Isomap
• Reading (due Oct 25):
- Fisher linear discriminant: Fisherface vs. Eigenface [1]
- Support vector machine: [3] or [2]
References
[1] P. Belhumeur, J. Hespanha, and D. Kriegman. Eigenfaces vs. Fisherfaces:
Recognition using class specific linear projection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(7):711–720, 1997.
[2] A. Mohan, C. Papageorgiou, and T. Poggio. Example-based object detection in images by components. IEEE Transactions on Pattern Analysis and Machine Intelligence,
23(4):349–361, 2001.
[3] M. Pontil and A. Verri. Support vector machines for 3D object recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(6):637–646, 1998.
Mixture of Gaussians
p(x) = PK
k=1 πkN (x|µk, Σk) PK
k=1 πk = 1 0 ≤ πk ≤ 1 (1)
where πk is the mixing parameter, describing the contribution of k-th Gaussian component in explaining x.
• Given data X = {x1, . . . , xN}, we want to determine πk and model parameters θ = {πk, µk, Σk}.
- X are observable
- The contribution of each data point xi to j-th Gaussian component, γj(xi), is hidden variables that can be derived from X and θ.
- θ are unknown
• If we know θ, we can compute γj(xi).
• If we know γj(xi), we can compute θ.
• Chicken and egg problem.
EM algorithm
• Expectation Maximization
• First take some initial guess of model parameters and compute the expectation of hidden value
• Iterative procedure
• Start with some initial guess and refine it
• Very useful technique
• Variational learning
• Generalized EM algorithm
EM Algorithm for Mixture of Gaussians
• Log likelihood function
ln p(X|π, µ, Σ) =
N
X
i=1
ln{
K
X
k=1
πkN (xi|µk, Σk)} (2)
• No close form solution (sum of components inside the log function)
• E (Expectation) step: Given all the current model parameters, compute the expectation of the hidden variables
• M (Maximization) step: Optimize the log likelihood with respect to model parameters
EM Algorithm for Mixture of Gaussians: M Step
ln L =
N
X
i=1
(
K
X
k=1
πkNki) (3)
where
Nki = N (xi|µk, Σk) (4) Take derivative of L w.r.t. µj
∂ ln L
∂µj =
N
X
i=1
πjNji PK
k=1 πkNki 1 Nji
∂Nji
∂µj = 0 (5)
Note
1 Nji
∂Nji
∂µj = −X
j
(xi − µj) (6)
Let γj(xi) = PKπjNji
k=1πkNki, i.e., the normalized probablity of xi being generated
from the j-th Gaussian component
N
X
i=1
γj(xi)Σj(xi − µj) = 0 (7) Thus,
µj =
PN
i=1 γj(xi)xi PN
i=1 γj(xi) (8)
Likewise take partial derivative w.r.t. to πj and Σj
πj = 1 N
N
X
i=1
γj(xi) (9)
Σj =
PN
i=1 γj(xi)(xi − µj)(xi − µj)T PN
i=1 γj(xi) (10)
Note γj(xi) plays the weighting role
EM Algorithm for Mixture of Gaussians: E Step
• Compute the expected value of hidden variable γj
γj(xi) = πjN (xi|µj, Σj) PK
k=1 πkN (xi|µk, Σk) (11)
• Interpret the mixing coefficients as prior probabilities
p(xi) =
K
X
j=1
πjN (xi|µj, Σj) =
K
X
j=1
p(j)p(xi|j) (12)
• Thus, γj(xi) corresponds to posterior probabilities (responsibilities)
p(j|xi) = p(j)p(xi|j)
p(xi) = πjN (xi|µj, Σj) PK
k=1 πkN (xi|µk, Σk) = γj(xi) (13)
EM for Factor Analysis
• Factor analysis: x = Λz + ε
• Log likelihood: L = log Q
i(2π)d/2|Ψ|−1/2 exp{−12(xi − Λz)TΨ−1(xi − Λz)
• Hidden variable: z, model parameters: Λ, Ψ.
• E-step:
E[z|x] = βx
E[zzT|x] = V ar(z|x) + E[z|x]E[z|x]T
= I − βΛ + βxxTβT
(14)
• M-step:
Λnew = (PN
i=1 xiE[z|xi]T)(PN
i=1 E[zzT|xi])−1 Ψnew = N1 diag{PN
i=1 xixTi − ΛnewE[z|xi]xTi } (15) where diag operator sets all off-diagonal elements to zero.
Mixture of Factor Analyzers (MFA)
• Assume that we have K factor analyzers indexed by ωk, k = 1, . . . , K. ωk = 1 when the data point was generated by k-th factor analyzer.
• The generative mixture model:
p(x) =
K
X
k=1
Z
p(x|z, ωk)p(z|ωk)p(ωk)dz (16)
where
p(z|ωk) = p(z) = N (0, I) (17)
• All each mode factor analyzer to model data covariance structure in a different part of the input space
p(x|z, ωk) = N (µk + Λkz, Ψ) (18)
EM for Mixture of Factor Analyzers
• For the E step, we need to compute the expectations of all hidden variables E[ωk|xi] ∝ p(xi, ωk) = p(ωk)p(xi|ωk) = πkN (x − µk, ΛkΛTk|Ψ) E[ωkz|xi] = E[ωk|xi]E[z|ωk, xi]
E[ωkzzT|xi] = E[ωk|xi]E[zzT|ωk, xi]
(19)
• The model parameters are {(µk, Λk, πk)Kk=1, Ψ}.
• For the M step, take derivative of log likelihood with respect to model parameters for new µk, Λk, πk, and Ψ.
• Read “The EM Algorithm for Mixtures of Factor Analyzers,” by Ghahramani and Hinton for details.
• Also read Ghahramani’s lecture notes.
EM for Mixture of Probabilistic PCA
• Based on factor analyzers
• Read “Mixtures of Probabilistic Principal Component Analyzers,” by Tipping and Bishop.
MFA: Applications
• Modeling the manifolds of images of hand written digits with mixture of factor analyzers [Hinton et al. 97].
• Modeling multimodal density of faces for recognition and detection [Frey et al. 98] [Yang et al. 00].
• Analyze layers of appearance and motion [Frey and Jojic 99]
• Mixture of factor analyzers concurrently performs clustering and dimensionality reduction.
Nonlinear Principal Component Analysis (NLPCA)
• Aim to better model nonlinear manifold
• Use on multi-layer (5 layer) perception
• The layer in the middle represents the feature space of the NLPCA transform.
• Two additional layers are used for nonlinearity.
• Auto-encoder, auto-associator, bottleneck network.
Recap
• Linear dimensionality reduction:
- Assume data is generated from a subspace
- Determine the subspace with PCA or FA (i.e., the subspace is spanned by the principal components)
• Nonlinear dimensionality reduction:
- Model data with a mixture of locally linear subspaces - use mixture of PCA, mixture of FA
• Mixture methods have local coordinate systems
• Need to find transformation between coordinate systems
Isometric Mapping (Isomap) [Tenenbaum et al. 00]
• Preserving pairwise distance structure
• Approximate geodesic distance
• Nonlinear dimensionality reduction
• Use a global coordinate system
• Aim to find intrinsic dimensionality
Multidimensional Scaling (MDS)
• Analyze pairwise similarities of entities to gain insight in the underlying structure
• Based on a matrix of pairwise similarities
• Metric or non-metric
• Useful for data visualization
• Can be used for dimensionality reduction
• Preserve the pairwise similarity measure
Isomap: Algorithm
• Isomap:
- Construct neighborhood graph:
Define a graph G of all data points by connecting points i and j if they are neighbors
- Compute the short paths:
For any pair of points i and j, compute their shortest path, and obtain DG.
- Construct M-dimensional embedding:
Apply classic MDS to DG to construct M-dimensional Euclidean space Y . The coordinates yi are obtained by minimizing
E = ||τ (DG) − τ (DY )||L2 (20) where τ converts distances into inner products that uniquely characterize the geometry of the data.
• The global minimum of (20) is obtained by setting yi to the top M eigenvectors of τ (DG).
Isomap: Applications
Intrinsic low dimensional Interpolation using low
Isomap: Applications
• Object recognition: memory-based recognition
• Object tracking: trajectory along inferred nonlinear manifold
• Video synthesis: interpolate along trajectory on nonlinear manifold
“Representation analysis and synthesis of lip images using dimensionality reduction,”
Aharon and Kimmel, IJCV 2005.
Isomap: Applications
• States of a moving object move smoothly along a low dimensional manifold.
• Discover the underlying manifold using Isomap
• Learn the mapping between input data and the corresponding points the low dimensional manifold using mixture of factor analyzers.
• Learn a dynamical model based on the points on the low dimensional manifold.
• Use particle filter for tracking.
Locally Linear Embedding (LLE) [Roweis et al. 00]
• Capture local geometry by linear reconstruction
• Map high dimensional data to global internal coordinates
Locally Linear Embedding: Algorithm
• LLE
- For each point, determine its neighbors
- Reconstruct with linear weights using neighbors E(W ) = X
i
|xi − X
j
wijxj|2 (21)
to find out w
- Map to embedded coordinates:
Fix w and project x ∈ Rd to y ∈ RM (M < d) such that φ(y) = X
i
|yi − X
j
wijyj|2 (22)
• The embedding cost (22) defines a unconstrained optimization problem.
Add a normalized constraint and solve an eigenvalue problem.
• The bottom M nonzero eigenvectors provide an ordered set of orthogonal
LLE: Applications
Learn the embedding of facial expression images for synthesis
Learn the embedding of lip images for synthesis
Isomap and LLE
• Embedding rather than mapping function
• No probabilistic interpretation
• No generative model
• Do not take temporal information into consideration
• Unsupervised learning
Further Study
• Kernel PCA.
• Principal Curve.
• Laplacian Eigenmap.
• Hessian Isomap.
• Spectral clustering.
• Unified view of spectral embedding and clustering.
• Global coordination of local generative models:
- Global coordination of local linear representation.
- Automatic alignment of local representation.
Big Picture
“A unifying review of linear Gaussian models,” Roweis and Ghahramani 99
• Deterministic/probabilistic
• Static/dynamic
• Linear/nonlinear
• Mixture, hierarchical