Kernel Principal Component Analysis (KPCA)

(1)

Advanced Topics in Learning and Vision

Ming-Hsuan Yang

[email protected]

(2)

Announcements

• Project proposal:

• Reading (due Nov 8):

- Viola and Jones: Adaboost-based real-time face detector [3].

- Viola et al: Adaboost-based real-time pedestrian detector [4].

- Avidan: Ensemble tracking [1]

S. Avidan. Ensemble tracking. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, volume 2, pages 494–501, 2005.

P. Viola and M. Jones. Robust real-time face detection. International Journal of Computer Vision, 57(2):137–154, 2004.

P. Viola, M. Jones, and D. Snow. Markov face models. In Proceedings of the Ninth IEEE International Conference on Computer Vision, volume 2, pages 734–741, 2003.

(3)

Overview

• Kernel methods: Kernel principal component analysis, kernel discriminant analysis

• Bagging

• Adaboost

• Ensemble Learning

(4)

Kernel Principal Component Analysis (KPCA)

• Powerful technique for extracting structure from data

• Extend conventional principal component analysis (PCA) to a high dimensional feature space.

• Subsume conventional PCA.

• Able to extract up to N (number of samples) nonlinear principal components without expensive computation.

• While conventional PCA extracts principal components in the input space, KPCA aims at extracting principal components of variables (or features) that are nonlinearly related to the input variables, nonlinear principal components.

• In the context of image analysis, this is equivalent to finding principal components in the space of product of input pixels.

(5)

• Replace covariance matrix by Gram matrix (kernel matrix) computed by certain kernel function

(6)

Principal Component Analysis Revisited

• Conventional PCA: Given a set of N data points x_i and assume that they have zero means, i.e., PN

i=1 x_i = 0. The covariance matrix is

C = 1 N

N

X

i=1

x_ix^T_i (1)

and we solve the eigenvalue problem

Cv = λv (2)

Note

Cv = 1 N

N

X

i=1

(x_i · v)x_i (3)

, all solutions v must lie in the span of x₁, . . . , x_N. The above eigenvalue problem is equivalent to

(7)

KPCA: Computation

• Project samples from the input space to a high dimensional feature space x ∈ R^d → Φ(x) ∈ R^D, D d (5)

• Assume the data has been centered, i.e., PN

i=1 Φ(x_i) = 0, the covariance matrix in R^D is

C =¯ 1 N

N

X

i=1

Φ(x_i)Φ(x_i)^T (6)

, and the eigenvalue problem becomes

CV = λV¯ (7)

• See [2] for centering data points in the feature space.

(8)

• Similarly, all solutions V lie in the span of Φ(x₁), . . . , Φ(x_N),

V =

N

X

j=1

α_jΦ(x_j) (8)

, and

(Φ(x_k) · ¯CV) = λ(Φ(x_k) · V) ∀k, k = 1, . . . , N. (9)

• Remember

C =¯ 1 N

N

X

i=1

Φ(x_i)Φ(x_i)^T

• Combining (8), (9) and (6), we have

1 N

P^N

j=1 α_j(Φ(x_k) · P^N

i=1 Φ(x_i))(Φ(x_i) · Φ(x_j)) = λP^N

j=1 α_j(Φ(x_k) · Φ(x_j))

∀k = 1, . . . , N.

(10)

(9)

1 N

N

X

j=1

α_j(Φ(x_k) ·

N

X

i=1

Φ(x_i))(Φ(x_i) · Φ(x_j)) = λ

N

X

j=1

α_j(Φ(x_k) · Φ(x_j))

• Define an N × N matrix K by

K_ij = (Φ(x_i) · Φ(x_j)). (11) , the above equation can be written in matrix form

K²α = N λKα (12)

where α is a column vector of α₁, . . . , α_N.

• Solve the following eigenvalue problem to obtain solution for (12)

Kα = N λα (13)

• See [2] for justification and further details.

(10)

• Let λ₁ ≤ λ₂ ≤ . . . ≤ λ_N denote the eigenvalues of K, and α¹, . . . , α^N be the corresponding set of eigenvectors with λ_p be the first nonzero

eigenvalue. We normalize α^p, . . . , α^N by requiring

V^k · V^k = 1, ∀k = p, . . . , N. (14)

• Compute the projection onto V^k ∈ R^D (k = p, . . . , N ). Let x be a test sample with an image Φ(x) ∈ R^D.

V^k · Φ(x) =

N

X

j=1

α^k_j(Φ(x_j) · Φ(x)) (15)

• In summary:

1. Compute kernel matrix K.

2. Compute the eigenvectors and normalize them.

3. Compute projections of a test point onto the eigenvectors.

(11)

Example

• Let x = (x₁, x₂) ∈ R², and let Φ(x) = (x²₁, x²₂,√

2x₁x₂). Then, (x · y) = (x₁y₁ + x₂y₂)² = (x²₁, x²₂,√

2x₁x₂)(y₁², y₂²,√

2y₁y₂)^T = (Φ(x) · Φ(y))

• In general, the dot product kernel

k(x, y) = (x · y)^m (16)

corresponds to a dot product in the space of m-th order monomials of the input coordinates.

• In the context of image analysis, we can easily work with in the space spanned by products of any m pixels, without any explicit use of mapped pattern Φ_m(x).

• The dimensionality of R^D is ^(d+m−1)!_m!(d−1)! and grows like d^m.

(12)

• For instance, 16 × 16 pixel image with a polynomial kernel of degree 5 yield a dimensionality of 10¹⁰.

• In short, the use of kernel functions is one way to compute higher-order

statistics without a combinatorial explosion of time and memory complexity!

• Same idea is exploited in nonlinear SVM.

(13)

Dimensionality Reduction, Feature Extraction and Pre-Image

• For a data set of N points x ∈ R^d, PCA is able to extract up to d nonzero eigenvalues from the d × d covariance matrix.

• In contrast, KPCA is able to extract N nonzero eigenvalues from the N × N kernel matrix. If N > d, then KPCA is able to find more principal

components.

• PCA is able to reconstruct an original pattern x_i from a set of principal components (x_i · v_j).

• In contrast, we can only have an approximate reconstruction.

(14)

Properties of KPCA

• KPCA carries out the standard PCA in the high dimensional feature space

• The statistical properties of PCA carry over to kernel-based PCA

• In the feature space, i.e., R^D, PCA is the orthogonal basis transformation with the following properties:

- the first q (q = 1, . . . , N) principal components, i.e., the projections on eigenvectors carry more variance than any other q orthogonal directions.

- the mean squared approximation error in representing the observations by the first q principal components is minimal.

- the principal components are uncorrelated.

- the first q principal components have maximal mutual information with respect to the inputs (under Gaussian assumption).

(15)

Kernel Discriminant Analysis (KDA)

• Kernelize the conventional Fisher linear discriminant method.

• Carry out Fisher linear discriminant in the feature space, but without the time and memory complexity.

• Has been shown to outperform SVM in some applications.

• Connections to SVM and RBF.

• Other kernel methods, e.g., kernel independent component analysis (KICA).

(16)

Input Space or Feature Space?

• Image are high-dimensional vectors in the input space

• Kernel methods conceptually project samples to a high (possibly infinite) dimensional feature space.

• Can the sheer abundance of features be useful or harmful for classification task?

• Borne out by experiments with empirical comparisons.

(17)

Bagging

• The name of the game: Bootstrap aggregating [Brieman 96b]

• Votes classifiers generated by different bootstrap samples (with replicates).

• A bootstrap sample is generated by uniformly sampling N samples from the training set with replacement.

• T bootstrap samples, B₁, . . . , B_T are generated and a classifier is C_i is built from each bootstrap sample B_i.

• A final classifier C^∗ is built from C₁, . . . , C_T whose output is the class predicted most often by its sub-classifiers, with ties broken arbitrarily.

• For a given bootstrap sample, an instance in the training set has probability 1 − (1 − 1/N )^N of being selected at least once in the N times instances are randomly selected from the training set.

• For large N, it is about 1 − 1/e = 63.2%, which means each B_i contains only about 63.2% unique instances from the training set.

(18)

• Such perturbation causes different C_i to be built if the inducer (learner, classifier) is unstable (e.g., neural networks, decision trees), and

performance can be improved.

(19)

The Bagging Algorithm

1. for i = 1 to T {

2. S⁰ = bootstrap sample from S (i.i.d. sample with replacement) 3. C_i = I(S⁰) (build a classifier C_i)

4. }

5. C^∗ = arg max_y∈Y P

i:C_i(x)=y 1(the most often predicated label y)

(20)

Boosting

• First introduced by Schapire and further improvement, Adaboost (Adaptive Boosting) proposed by Freud and Schapire.

• Like Bagging, Adaboost generates a set of classifiers and votes them.

• Adaboost generates the classifiers sequentially, while Bagging generates them in parallel.

• Adaboost changes the weights of the training instances provided as input to each learner based on classifiers that were previously built.

• The goal is to force the inducer to minimize expected error over different input distributions.

• T weighted training sets S₁, . . . , S_T are generated in sequence and T classifiers C₁, . . . , C_T are built.

• A final classifier C^∗ is formed using a weighted voting scheme: the weight

(21)

Adaboost Algorithm (M1)

1. S⁰ = S with instance weights assigned to be 1.

2. For i = 1 to T { 3. C_i = I(S⁰) 4. ε_i = _N¹ P

x_j∈S⁰:C_i(x_j)6=y_j weight(x) (weight error on the training set)

5. If ε_i > 1/2, set S⁰ to a bootstrap sample from S with weight 1 for every instance and go to step 3.

6. β_i = ε_i/(1 − ε_i)

7. For each x_j ∈ S⁰, if C_i(x_j) = y_j then weight(x_j)=weight(x_j)·β_i. 8. Normalize the weights of instances so the total weight of S⁰ is N } 9. C^∗(x) = arg max_y∈Y maxP

i:C_i(x)=y log _β¹

i.

(22)

• The update rule is mathematically equivalent to the following update rule For each x_j, divide weight(xj) by 2ε if C_i(x_j) 6= y_j and 2(1 − ε) otherwise

• The incorrect instances are weighted by a factor inversely proportional to the error on the training set, i.e., 1/(2ε_i). Small training set error, such as 0.1%, will cause weights to grow by several orders of magnitude.

• The proportion of misclassified instances is ε_i, and these instances get boosted by a factor of 1/(2ε_i), thus causing the total weight of the

misclassified instances after updating to be half the original training set weight. Similarly, the correctly classified instances will have a total weight equal to half the original weight, and thus no normalization is required.

• The Adaboost algorithm requires a weak learner whose error is bounded by a constant strictly less than 1/2.

(23)

More Details on Adaboost

Refer to Schapire’s tutorial on boosting.

(24)

Real-Time Adaboost-based Face Detector

(25)

(26)

(27)

(28)

(29)

(30)

(31)

(32)

(33)

(34)

Ensemble Learning

• Many ways to combine classifiers

• Randomization algorithm

• Application example: Ensemble tracking by Avidan.

References

[1] S. Avidan. Ensemble tracking. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, volume 2, pages 494–501, 2005.

[2] B. Sch ¨oelkoph and A. Smola. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10:1299–1319, 1998.

[3] P. Viola and M. Jones. Robust real-time face detection. International Journal of Computer Vision, 57(2):137–154, 2004.

[4] P. Viola, M. Jones, and D. Snow. Markov face models. In Proceedings of the Ninth IEEE