• 沒有找到結果。

Kernel Principal Component Analysis (KPCA)

N/A
N/A
Protected

Academic year: 2022

Share "Kernel Principal Component Analysis (KPCA)"

Copied!
34
0
0

加載中.... (立即查看全文)

全文

(1)

Advanced Topics in Learning and Vision

Ming-Hsuan Yang

mhyang@csie.ntu.edu.tw

(2)

Announcements

• Project proposal:

• Reading (due Nov 8):

- Viola and Jones: Adaboost-based real-time face detector [3].

- Viola et al: Adaboost-based real-time pedestrian detector [4].

- Avidan: Ensemble tracking [1]

S. Avidan. Ensemble tracking. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, volume 2, pages 494–501, 2005.

P. Viola and M. Jones. Robust real-time face detection. International Journal of Computer Vision, 57(2):137–154, 2004.

P. Viola, M. Jones, and D. Snow. Markov face models. In Proceedings of the Ninth IEEE International Conference on Computer Vision, volume 2, pages 734–741, 2003.

(3)

Overview

• Kernel methods: Kernel principal component analysis, kernel discriminant analysis

• Bagging

• Adaboost

• Ensemble Learning

(4)

Kernel Principal Component Analysis (KPCA)

• Powerful technique for extracting structure from data

• Extend conventional principal component analysis (PCA) to a high dimensional feature space.

• Subsume conventional PCA.

• Able to extract up to N (number of samples) nonlinear principal components without expensive computation.

• While conventional PCA extracts principal components in the input space, KPCA aims at extracting principal components of variables (or features) that are nonlinearly related to the input variables, nonlinear principal components.

• In the context of image analysis, this is equivalent to finding principal components in the space of product of input pixels.

(5)

• Replace covariance matrix by Gram matrix (kernel matrix) computed by certain kernel function

(6)

Principal Component Analysis Revisited

• Conventional PCA: Given a set of N data points xi and assume that they have zero means, i.e., PN

i=1 xi = 0. The covariance matrix is

C = 1 N

N

X

i=1

xixTi (1)

and we solve the eigenvalue problem

Cv = λv (2)

Note

Cv = 1 N

N

X

i=1

(xi · v)xi (3)

, all solutions v must lie in the span of x1, . . . , xN. The above eigenvalue problem is equivalent to

(7)

KPCA: Computation

• Project samples from the input space to a high dimensional feature space x ∈ Rd → Φ(x) ∈ RD, D  d (5)

• Assume the data has been centered, i.e., PN

i=1 Φ(xi) = 0, the covariance matrix in RD is

C =¯ 1 N

N

X

i=1

Φ(xi)Φ(xi)T (6)

, and the eigenvalue problem becomes

CV = λV¯ (7)

• See [2] for centering data points in the feature space.

(8)

• Similarly, all solutions V lie in the span of Φ(x1), . . . , Φ(xN),

V =

N

X

j=1

αjΦ(xj) (8)

, and

(Φ(xk) · ¯CV) = λ(Φ(xk) · V) ∀k, k = 1, . . . , N. (9)

• Remember

C =¯ 1 N

N

X

i=1

Φ(xi)Φ(xi)T

• Combining (8), (9) and (6), we have

1 N

PN

j=1 αj(Φ(xk) · PN

i=1 Φ(xi))(Φ(xi) · Φ(xj)) = λPN

j=1 αj(Φ(xk) · Φ(xj))

∀k = 1, . . . , N.

(10)

(9)

1 N

N

X

j=1

αj(Φ(xk) ·

N

X

i=1

Φ(xi))(Φ(xi) · Φ(xj)) = λ

N

X

j=1

αj(Φ(xk) · Φ(xj))

• Define an N × N matrix K by

Kij = (Φ(xi) · Φ(xj)). (11) , the above equation can be written in matrix form

K2α = N λKα (12)

where α is a column vector of α1, . . . , αN.

• Solve the following eigenvalue problem to obtain solution for (12)

Kα = N λα (13)

• See [2] for justification and further details.

(10)

• Let λ1 ≤ λ2 ≤ . . . ≤ λN denote the eigenvalues of K, and α1, . . . , αN be the corresponding set of eigenvectors with λp be the first nonzero

eigenvalue. We normalize αp, . . . , αN by requiring

Vk · Vk = 1, ∀k = p, . . . , N. (14)

• Compute the projection onto Vk ∈ RD (k = p, . . . , N ). Let x be a test sample with an image Φ(x) ∈ RD.

Vk · Φ(x) =

N

X

j=1

αkj(Φ(xj) · Φ(x)) (15)

• In summary:

1. Compute kernel matrix K.

2. Compute the eigenvectors and normalize them.

3. Compute projections of a test point onto the eigenvectors.

(11)

Example

• Let x = (x1, x2) ∈ R2, and let Φ(x) = (x21, x22,√

2x1x2). Then, (x · y) = (x1y1 + x2y2)2 = (x21, x22,√

2x1x2)(y12, y22,√

2y1y2)T = (Φ(x) · Φ(y))

• In general, the dot product kernel

k(x, y) = (x · y)m (16)

corresponds to a dot product in the space of m-th order monomials of the input coordinates.

• In the context of image analysis, we can easily work with in the space spanned by products of any m pixels, without any explicit use of mapped pattern Φm(x).

• The dimensionality of RD is (d+m−1)!m!(d−1)! and grows like dm.

(12)

• For instance, 16 × 16 pixel image with a polynomial kernel of degree 5 yield a dimensionality of 1010.

• In short, the use of kernel functions is one way to compute higher-order

statistics without a combinatorial explosion of time and memory complexity!

• Same idea is exploited in nonlinear SVM.

(13)

Dimensionality Reduction, Feature Extraction and Pre-Image

• For a data set of N points x ∈ Rd, PCA is able to extract up to d nonzero eigenvalues from the d × d covariance matrix.

• In contrast, KPCA is able to extract N nonzero eigenvalues from the N × N kernel matrix. If N > d, then KPCA is able to find more principal

components.

• PCA is able to reconstruct an original pattern xi from a set of principal components (xi · vj).

• In contrast, we can only have an approximate reconstruction.

(14)

Properties of KPCA

• KPCA carries out the standard PCA in the high dimensional feature space

• The statistical properties of PCA carry over to kernel-based PCA

• In the feature space, i.e., RD, PCA is the orthogonal basis transformation with the following properties:

- the first q (q = 1, . . . , N) principal components, i.e., the projections on eigenvectors carry more variance than any other q orthogonal directions.

- the mean squared approximation error in representing the observations by the first q principal components is minimal.

- the principal components are uncorrelated.

- the first q principal components have maximal mutual information with respect to the inputs (under Gaussian assumption).

(15)

Kernel Discriminant Analysis (KDA)

• Kernelize the conventional Fisher linear discriminant method.

• Carry out Fisher linear discriminant in the feature space, but without the time and memory complexity.

• Has been shown to outperform SVM in some applications.

• Connections to SVM and RBF.

• Other kernel methods, e.g., kernel independent component analysis (KICA).

(16)

Input Space or Feature Space?

• Image are high-dimensional vectors in the input space

• Kernel methods conceptually project samples to a high (possibly infinite) dimensional feature space.

• Can the sheer abundance of features be useful or harmful for classification task?

• Borne out by experiments with empirical comparisons.

(17)

Bagging

• The name of the game: Bootstrap aggregating [Brieman 96b]

• Votes classifiers generated by different bootstrap samples (with replicates).

A bootstrap sample is generated by uniformly sampling N samples from the training set with replacement.

• T bootstrap samples, B1, . . . , BT are generated and a classifier is Ci is built from each bootstrap sample Bi.

• A final classifier C is built from C1, . . . , CT whose output is the class predicted most often by its sub-classifiers, with ties broken arbitrarily.

• For a given bootstrap sample, an instance in the training set has probability 1 − (1 − 1/N )N of being selected at least once in the N times instances are randomly selected from the training set.

• For large N, it is about 1 − 1/e = 63.2%, which means each Bi contains only about 63.2% unique instances from the training set.

(18)

• Such perturbation causes different Ci to be built if the inducer (learner, classifier) is unstable (e.g., neural networks, decision trees), and

performance can be improved.

(19)

The Bagging Algorithm

1. for i = 1 to T {

2. S0 = bootstrap sample from S (i.i.d. sample with replacement) 3. Ci = I(S0) (build a classifier Ci)

4. }

5. C = arg maxy∈Y P

i:Ci(x)=y 1(the most often predicated label y)

(20)

Boosting

• First introduced by Schapire and further improvement, Adaboost (Adaptive Boosting) proposed by Freud and Schapire.

• Like Bagging, Adaboost generates a set of classifiers and votes them.

Adaboost generates the classifiers sequentially, while Bagging generates them in parallel.

• Adaboost changes the weights of the training instances provided as input to each learner based on classifiers that were previously built.

• The goal is to force the inducer to minimize expected error over different input distributions.

• T weighted training sets S1, . . . , ST are generated in sequence and T classifiers C1, . . . , CT are built.

• A final classifier C is formed using a weighted voting scheme: the weight

(21)

Adaboost Algorithm (M1)

1. S0 = S with instance weights assigned to be 1.

2. For i = 1 to T { 3. Ci = I(S0) 4. εi = N1 P

xj∈S0:Ci(xj)6=yj weight(x) (weight error on the training set)

5. If εi > 1/2, set S0 to a bootstrap sample from S with weight 1 for every instance and go to step 3.

6. βi = εi/(1 − εi)

7. For each xj ∈ S0, if Ci(xj) = yj then weight(xj)=weight(xj)·βi. 8. Normalize the weights of instances so the total weight of S0 is N } 9. C(x) = arg maxy∈Y maxP

i:Ci(x)=y log β1

i.

(22)

• The update rule is mathematically equivalent to the following update rule For each xj, divide weight(xj) by 2ε if Ci(xj) 6= yj and 2(1 − ε) otherwise

• The incorrect instances are weighted by a factor inversely proportional to the error on the training set, i.e., 1/(2εi). Small training set error, such as 0.1%, will cause weights to grow by several orders of magnitude.

• The proportion of misclassified instances is εi, and these instances get boosted by a factor of 1/(2εi), thus causing the total weight of the

misclassified instances after updating to be half the original training set weight. Similarly, the correctly classified instances will have a total weight equal to half the original weight, and thus no normalization is required.

The Adaboost algorithm requires a weak learner whose error is bounded by a constant strictly less than 1/2.

(23)

More Details on Adaboost

Refer to Schapire’s tutorial on boosting.

(24)

Real-Time Adaboost-based Face Detector

(25)
(26)
(27)
(28)
(29)
(30)
(31)
(32)
(33)
(34)

Ensemble Learning

Many ways to combine classifiers

• Randomization algorithm

• Application example: Ensemble tracking by Avidan.

References

[1] S. Avidan. Ensemble tracking. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, volume 2, pages 494–501, 2005.

[2] B. Sch ¨oelkoph and A. Smola. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10:1299–1319, 1998.

[3] P. Viola and M. Jones. Robust real-time face detection. International Journal of Computer Vision, 57(2):137–154, 2004.

[4] P. Viola, M. Jones, and D. Snow. Markov face models. In Proceedings of the Ninth IEEE

參考文獻

相關文件

4G - Index and principal rates of change of the Composite Consumer Price Index at section, class and group levels of goods and services. 4A - Index and principal rates of change

4G - Index and principal rates of change of the Composite Consumer Price Index at section, class and group levels of goods and services. 4A - Index and principal rates of change

4G - Index and principal rates of change of the Composite Consumer Price Index at section, class and group levels of goods and services. 4A - Index and principal rates of change

4G - Index and principal rates of change of the Composite Consumer Price Index at section, class and group levels of goods and services. 4A - Index and principal rates of change

4G - Index and principal rates of change of the Composite Consumer Price Index at section, class and group levels of goods and services. 4A - Index and principal rates of change

4G - Index and principal rates of change of the Composite Consumer Price Index at section, class and group levels of goods and services. 4A - Index and principal rates of change

4G - Index and principal rates of change of the Composite Consumer Price Index at section, class and group levels of goods and services. 4A - Index and principal rates of change

4G - Index and principal rates of change of the Composite Consumer Price Index at section, class and group levels of goods and services. 4A - Index and principal rates of change