• 沒有找到結果。

Advanced Topics in Learning and Vision

N/A
N/A
Protected

Academic year: 2022

Share "Advanced Topics in Learning and Vision"

Copied!
28
0
0

加載中.... (立即查看全文)

全文

(1)

Advanced Topics in Learning and Vision

Ming-Hsuan Yang

mhyang@csie.ntu.edu.tw

(2)

Overview

• EM Algorithm

• Mixture of Factor Analyzers

• Mixture of Probabilistic Component Analyzers

• Isometric Mapping

• Locally Linear Embedding

• Linear regression

• Logistic regression

• Linear classifier

• Fisher linear discriminant

(3)

Announcements

• More course material available on the course web page

• Code: PCA, FA, MoG, MFA, MPPCA, LLE, and Isomap

• Reading (due Oct 25):

- Fisher linear discriminant: Fisherface vs. Eigenface [1]

- Support vector machine: [3] or [2]

References

[1] P. Belhumeur, J. Hespanha, and D. Kriegman. Eigenfaces vs. Fisherfaces:

Recognition using class specific linear projection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(7):711–720, 1997.

[2] A. Mohan, C. Papageorgiou, and T. Poggio. Example-based object detection in images by components. IEEE Transactions on Pattern Analysis and Machine Intelligence,

23(4):349–361, 2001.

[3] M. Pontil and A. Verri. Support vector machines for 3D object recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(6):637–646, 1998.

(4)

Mixture of Gaussians

p(x) = PK

k=1 πkN (x|µk, Σk) PK

k=1 πk = 1 0 ≤ πk ≤ 1 (1)

where πk is the mixing parameter, describing the contribution of k-th Gaussian component in explaining x.

• Given data X = {x1, . . . , xN}, we want to determine πk and model parameters θ = {πk, µk, Σk}.

- X are observable

- The contribution of each data point xi to j-th Gaussian component, γj(xi), is hidden variables that can be derived from X and θ.

- θ are unknown

• If we know θ, we can compute γj(xi).

• If we know γj(xi), we can compute θ.

• Chicken and egg problem.

(5)

EM algorithm

• Expectation Maximization

• First take some initial guess of model parameters and compute the expectation of hidden value

• Iterative procedure

• Start with some initial guess and refine it

• Very useful technique

• Variational learning

• Generalized EM algorithm

(6)

EM Algorithm for Mixture of Gaussians

• Log likelihood function

ln p(X|π, µ, Σ) =

N

X

i=1

ln{

K

X

k=1

πkN (xik, Σk)} (2)

• No close form solution (sum of components inside the log function)

• E (Expectation) step: Given all the current model parameters, compute the expectation of the hidden variables

• M (Maximization) step: Optimize the log likelihood with respect to model parameters

(7)

EM Algorithm for Mixture of Gaussians: M Step

ln L =

N

X

i=1

(

K

X

k=1

πkNki) (3)

where

Nki = N (xik, Σk) (4) Take derivative of L w.r.t. µj

∂ ln L

∂µj =

N

X

i=1

πjNji PK

k=1 πkNki 1 Nji

∂Nji

∂µj = 0 (5)

Note

1 Nji

∂Nji

∂µj = −X

j

(xi − µj) (6)

Let γj(xi) = PKπjNji

k=1πkNki, i.e., the normalized probablity of xi being generated

(8)

from the j-th Gaussian component

N

X

i=1

γj(xij(xi − µj) = 0 (7) Thus,

µj =

PN

i=1 γj(xi)xi PN

i=1 γj(xi) (8)

Likewise take partial derivative w.r.t. to πj and Σj

πj = 1 N

N

X

i=1

γj(xi) (9)

Σj =

PN

i=1 γj(xi)(xi − µj)(xi − µj)T PN

i=1 γj(xi) (10)

Note γj(xi) plays the weighting role

(9)

EM Algorithm for Mixture of Gaussians: E Step

• Compute the expected value of hidden variable γj

γj(xi) = πjN (xij, Σj) PK

k=1 πkN (xik, Σk) (11)

• Interpret the mixing coefficients as prior probabilities

p(xi) =

K

X

j=1

πjN (xij, Σj) =

K

X

j=1

p(j)p(xi|j) (12)

• Thus, γj(xi) corresponds to posterior probabilities (responsibilities)

p(j|xi) = p(j)p(xi|j)

p(xi) = πjN (xij, Σj) PK

k=1 πkN (xik, Σk) = γj(xi) (13)

(10)

EM for Factor Analysis

• Factor analysis: x = Λz + ε

• Log likelihood: L = log Q

i(2π)d/2|Ψ|−1/2 exp{−12(xi − Λz)TΨ−1(xi − Λz)

• Hidden variable: z, model parameters: Λ, Ψ.

• E-step:

E[z|x] = βx

E[zzT|x] = V ar(z|x) + E[z|x]E[z|x]T

= I − βΛ + βxxTβT

(14)

• M-step:

Λnew = (PN

i=1 xiE[z|xi]T)(PN

i=1 E[zzT|xi])−1 Ψnew = N1 diag{PN

i=1 xixTi − ΛnewE[z|xi]xTi } (15) where diag operator sets all off-diagonal elements to zero.

(11)

Mixture of Factor Analyzers (MFA)

• Assume that we have K factor analyzers indexed by ωk, k = 1, . . . , K. ωk = 1 when the data point was generated by k-th factor analyzer.

• The generative mixture model:

p(x) =

K

X

k=1

Z

p(x|z, ωk)p(z|ωk)p(ωk)dz (16)

where

p(z|ωk) = p(z) = N (0, I) (17)

• All each mode factor analyzer to model data covariance structure in a different part of the input space

p(x|z, ωk) = N (µk + Λkz, Ψ) (18)

(12)

EM for Mixture of Factor Analyzers

• For the E step, we need to compute the expectations of all hidden variables E[ωk|xi] ∝ p(xi, ωk) = p(ωk)p(xik) = πkN (x − µk, ΛkΛTk|Ψ) E[ωkz|xi] = E[ωk|xi]E[z|ωk, xi]

E[ωkzzT|xi] = E[ωk|xi]E[zzTk, xi]

(19)

• The model parameters are {(µk, Λk, πk)Kk=1, Ψ}.

• For the M step, take derivative of log likelihood with respect to model parameters for new µk, Λk, πk, and Ψ.

• Read “The EM Algorithm for Mixtures of Factor Analyzers,” by Ghahramani and Hinton for details.

• Also read Ghahramani’s lecture notes.

(13)

EM for Mixture of Probabilistic PCA

• Based on factor analyzers

• Read “Mixtures of Probabilistic Principal Component Analyzers,” by Tipping and Bishop.

(14)

MFA: Applications

• Modeling the manifolds of images of hand written digits with mixture of factor analyzers [Hinton et al. 97].

• Modeling multimodal density of faces for recognition and detection [Frey et al. 98] [Yang et al. 00].

• Analyze layers of appearance and motion [Frey and Jojic 99]

• Mixture of factor analyzers concurrently performs clustering and dimensionality reduction.

(15)

Nonlinear Principal Component Analysis (NLPCA)

• Aim to better model nonlinear manifold

• Use on multi-layer (5 layer) perception

• The layer in the middle represents the feature space of the NLPCA transform.

• Two additional layers are used for nonlinearity.

• Auto-encoder, auto-associator, bottleneck network.

(16)

Recap

• Linear dimensionality reduction:

- Assume data is generated from a subspace

- Determine the subspace with PCA or FA (i.e., the subspace is spanned by the principal components)

• Nonlinear dimensionality reduction:

- Model data with a mixture of locally linear subspaces - use mixture of PCA, mixture of FA

• Mixture methods have local coordinate systems

• Need to find transformation between coordinate systems

(17)

Isometric Mapping (Isomap) [Tenenbaum et al. 00]

• Preserving pairwise distance structure

• Approximate geodesic distance

• Nonlinear dimensionality reduction

• Use a global coordinate system

Aim to find intrinsic dimensionality

(18)

Multidimensional Scaling (MDS)

• Analyze pairwise similarities of entities to gain insight in the underlying structure

• Based on a matrix of pairwise similarities

• Metric or non-metric

• Useful for data visualization

• Can be used for dimensionality reduction

• Preserve the pairwise similarity measure

(19)

Isomap: Algorithm

• Isomap:

- Construct neighborhood graph:

Define a graph G of all data points by connecting points i and j if they are neighbors

- Compute the short paths:

For any pair of points i and j, compute their shortest path, and obtain DG.

- Construct M-dimensional embedding:

Apply classic MDS to DG to construct M-dimensional Euclidean space Y . The coordinates yi are obtained by minimizing

E = ||τ (DG) − τ (DY )||L2 (20) where τ converts distances into inner products that uniquely characterize the geometry of the data.

• The global minimum of (20) is obtained by setting yi to the top M eigenvectors of τ (DG).

(20)

Isomap: Applications

Intrinsic low dimensional Interpolation using low

(21)

Isomap: Applications

• Object recognition: memory-based recognition

• Object tracking: trajectory along inferred nonlinear manifold

• Video synthesis: interpolate along trajectory on nonlinear manifold

“Representation analysis and synthesis of lip images using dimensionality reduction,”

Aharon and Kimmel, IJCV 2005.

(22)

Isomap: Applications

• States of a moving object move smoothly along a low dimensional manifold.

• Discover the underlying manifold using Isomap

• Learn the mapping between input data and the corresponding points the low dimensional manifold using mixture of factor analyzers.

• Learn a dynamical model based on the points on the low dimensional manifold.

• Use particle filter for tracking.

(23)

Locally Linear Embedding (LLE) [Roweis et al. 00]

• Capture local geometry by linear reconstruction

• Map high dimensional data to global internal coordinates

(24)

Locally Linear Embedding: Algorithm

• LLE

- For each point, determine its neighbors

- Reconstruct with linear weights using neighbors E(W ) = X

i

|xi − X

j

wijxj|2 (21)

to find out w

- Map to embedded coordinates:

Fix w and project x ∈ Rd to y ∈ RM (M < d) such that φ(y) = X

i

|yi − X

j

wijyj|2 (22)

• The embedding cost (22) defines a unconstrained optimization problem.

Add a normalized constraint and solve an eigenvalue problem.

• The bottom M nonzero eigenvectors provide an ordered set of orthogonal

(25)

LLE: Applications

Learn the embedding of facial expression images for synthesis

Learn the embedding of lip images for synthesis

(26)

Isomap and LLE

• Embedding rather than mapping function

• No probabilistic interpretation

• No generative model

• Do not take temporal information into consideration

• Unsupervised learning

(27)

Further Study

• Kernel PCA.

• Principal Curve.

• Laplacian Eigenmap.

• Hessian Isomap.

• Spectral clustering.

• Unified view of spectral embedding and clustering.

• Global coordination of local generative models:

- Global coordination of local linear representation.

- Automatic alignment of local representation.

(28)

Big Picture

“A unifying review of linear Gaussian models,” Roweis and Ghahramani 99

• Deterministic/probabilistic

• Static/dynamic

• Linear/nonlinear

• Mixture, hierarchical

參考文獻

相關文件

• Since successive samples are correlated, the Markov chain may have to be run for a considerable time in order to generate samples that are effectively independent samples from p(x).

• Can be applied to all the latent variable models including factor analysis, probabilistic PCA, mixture of factor analyzers, mixture of probabilistic PCA, mixture of Gaussians, etc..

- Multi-layer perceptron with linear, logistic and softmax outputs and appropriate error functions. - Radial basis function (RBF) networks with both Gaussian and non-local

Monopolies in synchronous distributed systems (Peleg 1998; Peleg

Corollary 13.3. For, if C is simple and lies in D, the function f is analytic at each point interior to and on C; so we apply the Cauchy-Goursat theorem directly. On the other hand,

Corollary 13.3. For, if C is simple and lies in D, the function f is analytic at each point interior to and on C; so we apply the Cauchy-Goursat theorem directly. On the other hand,

• elearning pilot scheme (Four True Light Schools): WIFI construction, iPad procurement, elearning school visit and teacher training, English starts the elearning lesson.. 2012 •

Microphone and 600 ohm line conduits shall be mechanically and electrically connected to receptacle boxes and electrically grounded to the audio system ground point.. Lines in