• 沒有找到結果。

Advanced Topics in Learning and Vision

N/A
N/A
Protected

Academic year: 2022

Share "Advanced Topics in Learning and Vision"

Copied!
32
0
0

加載中.... (立即查看全文)

全文

(1)

Advanced Topics in Learning and Vision

Ming-Hsuan Yang

mhyang@csie.ntu.edu.tw

(2)

Overview

• Principal Component Analysis

• Factor Analysis

• EM Algorithm

• Probabilistic Principal Component Analysis

• Mixture of Factor Analyzers

• Mixture of Probabilistic Component Analyzers

• Isometric Mapping

• Local Linear Embedding

• Global coordination of local representation

(3)

Reading

• Embedding: Isomap [4], LLE [2]

• Global coordination of local generative models: Global coordination [1], Alignment of local representation [3].

References

[1] S. Roweis, L. K. Saul, and G. E. Hinton. Global coordination of local linear models. In T. Ditterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14, pages 889–896. MIT Press, 2002.

[2] S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500), 2000.

[3] Y. W. Teh and S. Roweis. Automatic alignment of local representations. In S. Becker,

S. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing Systems 15, pages 841–848. MIT Press, 2003.

[4] J. B. Tenenbaum, V. de Silva, and J. C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500), 2000.

(4)

Principal Component Analysis

• Curse of dimensionality

• Possibly the most popular dimensionality reduction algorithm

• Widely used in computer vision and other applications

• Two perspectives:

- Minimize reconstruction error (e.g., vision): Karhunen-Loeve transform - Decorrelation (e.g., signal processing): Hotelling transform

• Unsupervised learning

• Linear transform

• Second order statistics

(5)

Motivating Example [Black 04]

(6)
(7)

Principal Component Analysis

• Given a set of N data points xn, map each x in a d-dimensional space x = [x1, . . . , xd]T onto vector z in an M-dimensional space where M < d.

• Use linear combination

x =

d

X

i=1

ziui (1)

where the vectors ui satisfy the orthonormality relation

uTi uj = δij (2)

in which δij is the Kronecker delta. Thus,

zi = uTi x (3)

• Now we have only a subset M < d of the basis vector ui, x can be approximated by

˜ x =

M

X

i=1

ziui +

d

X

i=M +1

biui (4)

(8)

where bi is a constant

• Dimensionality reduction: x has d degree of freedom and z has M degree of freedom

• For each xn, the error introduced by the dimensionality reduction is

xn − ˜xn =

d

X

i=M +1

(zin − bi)ui (5)

and we want to find the basis vector ui, the coefficients bi, and the values zi with minimum error.

• For the whole data set, with orthonormality relation

EM = 1 2

N

X

n=1

||xn − ˜xn||2 = 1 2

N

X

n=1

d

X

i=M +1

(zin − bi)2 (6)

(9)

• Take derivative of EM with respect to bi and set it to zero,

bi = 1 N

N

X

n=1

zin = uTi x¯ (7)

where

¯

x = 1 N

N

X

n=1

xn (8)

• Plugging back to the sum of square errors, EM, EM = 12 Pd

i=M +1

PN

n=1(uTi (xn − ¯x))2

= N2 Pd

i=M +1 uTi Cui (9)

where C is a covariance matrix

C = 1 N

N

X

n=1

(xn − ¯x)(xn − ¯x)T (10)

(10)

• Minimizing EM with respect to ui, we get

Cui = λiui (11)

i.e., the basis vectors ui are the eigenvectors of the covariance matrix C.

(11)

Derivation

• Minimizing EM with respect to ui, we get

Cui = λiui (12)

i.e., the basis vectors ui are the eigenvectors of the covariance matrix C.

• Need some constraints to solve this optimization

• Impose orthonormal constraints among ui

• Use Lagrange multipliers µijM = 1

2

d

X

i=M +1

uiCuTi − 1 2

d

X

i=M +1

d

X

j=M +1

µij(uiuTj − δij) (13)

• In matrix form,

M = 1

2tr{UTCU } − 1

2{M (UTU − I)} (14)

(12)

where M is a matrix with elements µij, U is a matrix whose columns consists of ui, and I is the unit matrix.

• Minimizing EˆM with respect to U,

(C + CT)U − U (M + MT) = 0 (15)

• Note C is symmetric, M is symmetric and U UT is symmetric as it is the unit matrix I.

CU = U M (16)

UTCU = M (17)

• The eigenvector equation for M

M Ψ = ΨΛ (18)

where Λ is a diagonal matrix of eigenvalues and Ψ is the matrix of eigenvectors.

(13)

• M is symmetric and Ψ can be chosen to have orthonormal columns, i.e., ΨTΨ = I.

Λ = ΨTM Ψ (19)

• Plugging in together,

Λ = ΨTUTCU Ψ

= (U Ψ)TC(U Ψ)

= U˜TC ˜U

(20) where U = U Ψ, and˜

U = ˜U ΨT (21)

• Thus a solution for UTCU = M can be obtained from the particular solution U˜ by application of an orthogonal transformation given by Ψ.

(14)

Getting Principal Components from Data

• Minimizing EM with respect to ui, we get

Cui = λiui (22)

i.e., the basis vectors ui are the eigenvectors of the covariance matrix C.

• Consequently, the error of EM is

EM = 1 2

d

X

i=M +1

λi (23)

In other words, the minimum error is reached by discarding the eigenvectors corresponding to the d − M smallest eigenvalues.

• Each eigenvector ui is called a principal component.

• Retain the eigenvectors corresponding to the largest eigenvalues

(15)

• Project xn onto these eigenvectors give the components of the transformed vector zn in the M-dimensional space.

• Each two-dimensional data point is transformed to a single variable z1 representing the projection of the data point onto the eigenvector u1.

• Infer the structure (or reduce redundancy) inherent in high dimensional data.

• Parsimonious representation

• Linear dimensionality algorithm based on sum-of-square-error criterion

(16)

• Other criteria: covariance measure and population entropy

(17)

Intrinsic Dimensionality

• A data set in d dimensions has intrinsic dimensionality equal to d0 if the data lies entirely within a d0-dimensional space.

• What is the intrinsic dimensionality of data?

• The intrinsic dimensionality may increase due to noise.

• PCA, as a linear approximation, has its limitation.

(18)

• How to determine the number of eigenvectors?

• Empirically determined based on reconstruction error (i.e., energy).

• Will come back to this issue later on (Isomap, LLE).

(19)

Singular Value Decomposition (SVD)

• Since

x =

d

X

i=1

ziui , and uTi uj = δij (24)

• Subtract the mean from each column

A = [(x1 − ¯x) . . . (xN − ¯x)] (25) Covariance matrix

C = AAT (26)

• Singular value decomposition allows us to write A as

A = U ΣV T = 

u1 . . . uN 

 λ1

. . .

λN

 uT1

...

uT

 (27)

(20)

C = N1 AAT

= N1 U ΣV T(U ΣV T)T

= N1 U ΣV TV ΣUT

= N1 U Σ2U2

(28)

• Therefore,

Cui = σi2

N ui (29)

• So, the columns U are eigenvectors and the eigenvalues are just λi = σNi.

(21)

Eigenface [Turk and Pentland 1991]

• Collect a set of face images.

• Normalize for contrast, scale and orientation.

• Apply PCA to compute the first M eigenvectors (dubbed as Eigenface) that best accounts for data variance (i.e., facial structure)

• Compute the distance between the projected points for face recognition or detection

(22)

Appearance Manifolds [Murase and Nayar 95]

• The image variation of an object under different pose or is assumed to lie on a manifold.

• For each object, collect images under different pose

• Construct a universal eigenspace from all the images

• For the set of images of of the same object, find the smoothly varying manifold in eigenspace, i.e., parametric eigenspace.

(23)

• The manifolds of two objects may intersect, the intersection corresponds to poses of the two objects for which their images are very similar in

appearance.

(24)

Further Study

• Incremental update of PCA

• Independent component analysis (ICA)

• Discrete PCA

• Two-dimensional PCA

• Illumination cone [Belhumeur and Kriegman 97]

(25)

Factor Analysis (FA)

• A generative dimensionality reduction algorithm

• A d-dimensional data vector x is modeled using a M-dimensional vector z, dubbed as factors, where k is smaller than d.

x = Λz + ε (30)

where

- Λ is factor loading matrix.

- z is assumed be N (0, I) distributed (zero mean, unit variance normals).

- The factors z model correlation between the elements of x.

- ε is a random variable to account for noise and assumed to be distributed with N (0, Ψ) where Ψ is a diagonal matrix.

- ε accounts for independent noise in each element of x.

- x is N (0, ΛΛT + Ψ) distributed.

- the diagonality of Ψ is a key assumption: constraining the error covariance Ψ for estimation

- The observed variable, xi, are conditionally independent given the factors z.

(26)

Properties of FA

Factor analysis: x = Λz + ε

• Latent variables z: explain correlations between x

• εi represents variability unique to a particular xi

• Fundamentally differs from PCA which treats covariance and variance identically.

• Want to infer Λ and Ψ from x

• Suppose Λ and Ψ are known, by linear projection

E[z|x] = βx (31)

(27)

β = ΛT(Ψ|ΛΛT)−1, since the joint Gaussian of data x and factors z: p(

 x z



) = N (

 0 0

 ,

 ΛΛT + Ψ Λ ΛT I



) (32)

• Note that since Ψ is diagonal,

(Ψ + ΛΛT)−1 = Ψ−1 − Ψ−1Λ(I + ΛTΨ−1Λ)−1ΛTΨ−1 (33) I is an identity matrix

• The second moment of factors:

E[zzT|x] = V ar(z|x) + E[z|x]E[z|x]T

= I − βΛ + βxxTβT (34)

• Expectation of first and second moments provide measure of uncertainty in the factors, which PCA does not have.

(28)

EM Algorithm for Factor Analysis

• Expectation-Maximization: useful technique in dealing with missing data

• Start with some initial guess of missing data and evaluate them.

• Optimize the missing parameters by taking derivate of likelihood of observed and missing data w.r.t. parameters.

• Repeat until the data likelihood does not change

• E-step: Given Λ and Ψ, for each data point xi, compute E[z|x] = βx

E[zzT|x] = V ar(z|x) + E[z|x]E[z|x]T

= I − βΛ + βxxTβT

(35)

• M-step:

Λnew = (PN

i=1 xiE[z|xi]T)(PN

i=1 E[zzT|xi])−1 Ψnew = N1 diag{PN

i=1 xixTi − ΛnewE[z|xi]xTi } (36)

(29)

where diag operator sets all off-diagonal elements to zero.

(30)

FA and PCA

• Given a set of data points, would Λ correspond to principal subspace?

• No in most cases

• But if FA has isotropic error model, i.e., ψi = σ2, then Λ corresponds to eigenvectors.

(31)

Probabilistic Principal Component Analysis (PPCA)

• Based on factor analysis, x = Λz + ε, with isotropic noise model N (0, σ2I)

• The z conditional probability distribution over x is given by

x|z ∼ N (Λz, σ2I) (37)

• Since x ∼ N (0, I), marginal distribution for x is

x ∼ N (0, ˜C) (38)

where C = ΛΛT + σ2I.

• Log likelihood of data is L = −N

2 {d ln(2π) + ln |C| + tr( ˜C−1S)} (39)

(32)

where

S = 1 N

N

X

i=1

xnxnT (40)

• Estimating Λ and σ2 can be obtained by maximizing L using the EM algorithm similar to that in factor analysis.

參考文獻

相關文件

– Local miss rate— misses in this cache divided by the total number of memory accesses to this cache (Miss rate L2 ). – Global miss rate—misses in this cache divided by the total

• Richard Szeliski, Image Alignment and Stitching: A Tutorial, Foundations and Trends in Computer Graphics and Computer Vision, 2(1):1-104, December 2006. Szeliski

Table 7: Resident population born outside Macao by total years of residence in Macao c (2001 ). Total

In this paper, we study the local models via various techniques and complete the proof of the quantum invariance of Gromov–Witten theory in genus zero under ordinary flops of

Local and global saddle points, second-order sufficient conditions, aug- mented Lagrangian, exact penalty representations.. AMS

In addressing the questions of its changing religious identities and institutional affiliations, the paper shows that both local and global factors are involved, namely, Puhua

➢The input code determines the generator output. ➢Understand the meaning of each dimension to control

• To achieve small expected risk, that is good generalization performance ⇒ both the empirical risk and the ratio between VC dimension and the number of data points have to be small..