吳漢銘國立臺北大學 統計學系
核方法
Kernel Method
C06
本章大綱
Kernel Methods, Kernel Trick
Kernel Data and Its Properties
PCA/SIR in the Euclidean Space
Kernel PCA, Kernel SIR in a Non-linear Feature Space
Relations Towards Other Methods
KSIR for Nonlinear Dimensional Reduction
Experiments on Classification
2/34
核方法 (Kernel Methods)
Aronszajn (1950) and Parzen (1962) first to employ kernel methods in statistics.
Aizerman et al. (1964) used positive definite kernels which was closer to “kernel trick”, they argue that a positive definite kernel is identical to a dot product in the feature space.
Scholkopf et al (1998) point out that kernels can be used to construct
generalization of any algorithm that can be carried out in terms of dot products.
For last 20 years, there have seen a large number of kernelization of various algorithms. (PCA, LDA, CCA, PLS,…)
Boser et al (1992), to construct SVMs, a generalization of the so-called
optimal hyperplane algorithm.
3/34
Prepare Kernel Data
xj xi
Φ(xj) Φ(xi)
?
理論上
事實上
4/34
Data Representation
Data are not represented individually anymore, but only through a set of pairwise comparisons.
The representation as a square matrix does not depend on the nature of the objects to be analyzed.
The size of the matrix used to represent a dataset of n objects is always n by n.
5/34
Kernel as Inner Product
(Aronszajn 1950)
A Hilbert space is a vector space endowed with a dot product that is complete for the norm induced.Rp with the classical inner product is an example of a finite-dimensional Hilbert
space. David Hilbert (01/23/1862 – 02/14/1943)
6/34
Reproducing Kernel Hilbert Space
7/34Kernel Trick
The kernel Trick was first published in the 1964 paper Theoretical foundations of the potential function method in pattern recognition learning.
Any algorithm for vectorial data that can be expressed only in terms of dot products between vectors can be performed implicitly in the feature space
associated with any kernel, by replacing each dot product by a kernel evaluation.
It is a very convenient trick to transform linear methods, such as LDA or PCA into nonlinear methods, by simply replacing the classic dot product by a more general kernel.
The kernel trick transforms any algorithm that solely dependents on the dot
product between two vectors. Wherever a dot product is used, it is replaced with the kernel function.
The non-linear algorithm is the linear algorithm operating in the feature space.
Kernelization: the operation that transforms a linear algorithm into a more general kernel method.
8/34
Kernel Data: Properties
Kernel map can bring the data distribution to better elliptical symmetry. Kernel data are (with empirical and theoretical justification)
Better elliptically symmetrically distributed.
Better approximately normal (Gaussian)
Raw data on Euclidean space Rp
Kernel data on a RKHS Hk
Via a specific statistical notion of classical approach on Rp
Kernel approach on Hk, which is exactly the classical procedure on kernel data.
Main goal: Parallel to the classical multivariate statistical analysis, we aim to develop an analysis tool in the Gaussian reproducing kernel Hilbert space.
Main advantage: Nonparametric approach with “parametric-plus”
computing load.
parametric: classical multivariate analysis procedures.
plus: kernel data preparation.
9/34
Example: Better Elliptical Symmetry
Kernel map can bring the data distribution to better elliptical symmetry.
Scatterplot (x1, x2) Kernel data Scatterplot
Using Gaussian kernel with scale=0.05.
The raw data is scaled to have unit variance of each column before
transformation
10/34
Example: Normal Probability Plot
x2 Best four,
kernel data
Worst four, kernel data Median
four, kernel data
11/34
Example: Justification of Gaussianity
For details:
Huang, S.Y., Hwang, C. R. and Lin, M.H. Kernel Fisher’s Discriminant Analysis in Gaussian Reproducing Kernel Hilbert Space.
Theoretical Justification of Gaussianity Empirical Justification of Gaussianity:
Kolmogorov-Smirnov Test: H0: The data follow a normal distribution
0.05 0.01
# p-vaule > 0.05 = 97,
# p-vaule > 0.01 =142;
Prepare Your Data to Do the Above Empirical Justification
12/34
PCA in the Euclidean Space
13/34Kernel PCA
14/34Kernel PCA: kpca {kernlab}
> library(kernlab)
> rbf <- rbfdot(sigma = 0.05) #Radial Basis kernel function
> rbf
Gaussian Radial Basis kernel function.
Hyperparameter : sigma = 0.05
> KX <- kernelMatrix(kernel=rbf, x=as.matrix(iris[,1:4])) # calculate kernel matrix
> dim(KX) [1] 150 150
• rbfdot (Radial Basis kernel function)
• polydot (Polynomial kernel function
• vanilladot (Linear kernel function)
• tanhdot (Hyperbolic tangent kernel function)
test <- sample(1:150, 20)
iris.kpca <- kpca(~., data=iris[-test, -5], kernel="rbfdot", kpar=list(sigma=0.2), features=2)
# print the principal component vectors pcv(iris.kpca)
# plot the data projection on the components
plot(rotated(iris.kpca), col=as.integer(iris[-test, 5]), xlab="1st Principal Component",
ylab="2nd Principal Component", main="KPCA for iris data")
# embed remaining points
emb <- predict(iris.kpca, as.matrix(iris[test, -5])) points(emb, col=iris[test, 5], pch=17, cex=1.5, asp=1)
kernlab: Kernel-Based Machine Learning Lab 15/34
SIR in the Euclidean Space
Li (1991) introduced the following model
16/34
SIR: Algorithm
17/34SIR: Theorem
18/34Kernel SIR in a Non-linear Feature Space Kernel SIR: Kernelize the SIR algorithm
19/34
KSIR: Algorithm
20/34KSIR
(conti.) 21/34KSIR
(conti.) 22/34Normalization and Projection
23/34Centering in Feature Space
For Testing Data For Training Data
24/34
Reduced Features
we are not working in the full feature space, but just in a
comparably small linear subspace of it, whose dimension equals at most the number of observations.
Working in a space whose
dimension equals the number of observations can pose difficulties.
To deal with these, one can either use only a subset of the extracted features, or use some other form of capacity control or
regularization.
For Theoretical details:
Lee, Y.J. and Huang, S.Y. (2006), Reduced support vector machines: a statistical theory, IEEE Transactions on Neural Networks, accepted.
http://dmlab1.csie.ntust.edu.tw/downloads
25/34
Relations Towards Other Methods
KSIR generalizes SIR to a nonlinear one by kernelization of the SIR algorithm.
It finds nonlinear d.r. subspace, a central d.r. subspace in Hk
A semiparametric method.
SIR: spectrum analysis of cov(E[x|y]) wrt cov(x)
KSIR: spectrum analysis of a generalized association measure.
26/34
Relations Towards Other Methods
Kernel Fisher discriminant Analysis as special case of CCA.
(Kuss, M. and Graepel, T: The Geometry Of Kernel Canonical Correlation Analysis. (108), Max Planck Institute for Biological Cybernetics, Tübingen, Germany (May 2003)
Chen, C. H., and Li, K. C. (2001)
27/34
Visualization: Square Data (150x2)
KPCA
d = 1 d = 2 d = 3 d = 4 d = 1 d = 2 d = 3 d = 4
V1
V2
V3
V1
V2
V3
s = 0.01 s = 0.1 s = 1 s = 10 s = 0.01 s = 0.1 s = 1 s = 10
KSIR
V1
V2
V3
V1
V2
V3
H=8
KPCA KSIR
28/34
Visualization: Three Clusters Data (220x2)
KPCA KSIR
d = 1 d = 2 d = 3 d = 4 d = 1 d = 2 d = 3 d = 4
s = 0.01 s = 0.1 s = 1 s = 10 s = 0.01 s = 0.1 s = 1 s = 10 V1
V2
V3
V1
V2
V3
V1
V2
V3
V1
V2
V3
KPCA KSIR
29/34
Visualization: Iris Data (150x4)
The sepal length, sepal width, petal length, and petal width are measured in centimeters on 50 iris
specimens from each of three species, Iris setosa, I.
versicolor, and I. virginica.
Fisher (1936)
PCA SIR
KPCA Gaussian s=0.05 KSIR
30/34
Visualization: Wine Data (178x18)
PCA SIR
KPCA Gaussian s=0.05 KSIR
Wine data (n=178) are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars.
The analysis determined the quantities of 13 constituents found in each of the three types of wines.
Past Usage
RDA : 100%, QDA 99.4%, LDA 98.9%, 1NN 96.1%
(z-transformed data, loo)
31/34
Visualization: Pendigit Data (7494x16)
Pen-based recognition of handwritten Digits
7494 instances, 16 attributes
10 classes
Gaussian 0.05 Random sampling 200
PCA SIR
KPCA KSIR
32/34
Classification: UCI Data Sets
Gaussian 0.05 Random sampling 200
33/34