Prepare Kernel Data

(1)

吳漢銘國立臺北大學統計學系

核方法

Kernel Method

C06

(2)

本章大綱

 Kernel Methods, Kernel Trick

 Kernel Data and Its Properties

 PCA/SIR in the Euclidean Space

 Kernel PCA, Kernel SIR in a Non-linear Feature Space

 Relations Towards Other Methods

 KSIR for Nonlinear Dimensional Reduction

 Experiments on Classification

2/34

(3)

核方法 (Kernel Methods)

 Aronszajn (1950) and Parzen (1962) first to employ kernel methods in statistics.

 Aizerman et al. (1964) used positive definite kernels which was closer to “kernel trick”, they argue that a positive definite kernel is identical to a dot product in the feature space.

 Scholkopf et al (1998) point out that kernels can be used to construct

generalization of any algorithm that can be carried out in terms of dot products.

 For last 20 years, there have seen a large number of kernelization of various algorithms. (PCA, LDA, CCA, PLS,…)

 Boser et al (1992), to construct SVMs, a generalization of the so-called

optimal hyperplane algorithm.

3/34

(4)

Prepare Kernel Data

x_j x_i

Φ(x_j) Φ(x_i)

?

理論上

事實上

4/34

(5)

Data Representation

 Data are not represented individually anymore, but only through a set of pairwise comparisons.

 The representation as a square matrix does not depend on the nature of the objects to be analyzed.

 The size of the matrix used to represent a dataset of n objects is always n by n.

5/34

(6)

Kernel as Inner Product

(Aronszajn 1950)

A Hilbert space is a vector space endowed with a dot product that is complete for the norm induced.R^p with the classical inner product is an example of a finite-dimensional Hilbert

space. David Hilbert (01/23/1862 – 02/14/1943)

6/34

(7)

Reproducing Kernel Hilbert Space

^7/34

(8)

Kernel Trick

 The kernel Trick was first published in the 1964 paper Theoretical foundations of the potential function method in pattern recognition learning.

 Any algorithm for vectorial data that can be expressed only in terms of dot products between vectors can be performed implicitly in the feature space

associated with any kernel, by replacing each dot product by a kernel evaluation.

 It is a very convenient trick to transform linear methods, such as LDA or PCA into nonlinear methods, by simply replacing the classic dot product by a more general kernel.

 The kernel trick transforms any algorithm that solely dependents on the dot

product between two vectors. Wherever a dot product is used, it is replaced with the kernel function.

 The non-linear algorithm is the linear algorithm operating in the feature space.

 Kernelization: the operation that transforms a linear algorithm into a more general kernel method.

8/34

(9)

Kernel Data: Properties

 Kernel map can bring the data distribution to better elliptical symmetry. Kernel data are (with empirical and theoretical justification)

 Better elliptically symmetrically distributed.

 Better approximately normal (Gaussian)

 Raw data on Euclidean space R^p

 Kernel data on a RKHS H_k

 Via a specific statistical notion of classical approach on R^p

 Kernel approach on H_k, which is exactly the classical procedure on kernel data.

 Main goal: Parallel to the classical multivariate statistical analysis, we aim to develop an analysis tool in the Gaussian reproducing kernel Hilbert space.

 Main advantage: Nonparametric approach with “parametric-plus”

computing load.

parametric: classical multivariate analysis procedures.

plus: kernel data preparation.

9/34

(10)

Example: Better Elliptical Symmetry

 Kernel map can bring the data distribution to better elliptical symmetry.

Scatterplot (x1, x2) Kernel data Scatterplot

 Using Gaussian kernel with scale=0.05.

 The raw data is scaled to have unit variance of each column before

transformation

10/34

(11)

Example: Normal Probability Plot

x2 Best four,

kernel data

Worst four, kernel data Median

four, kernel data

11/34

(12)

Example: Justification of Gaussianity

For details:

Huang, S.Y., Hwang, C. R. and Lin, M.H. Kernel Fisher’s Discriminant Analysis in Gaussian Reproducing Kernel Hilbert Space.

Theoretical Justification of Gaussianity Empirical Justification of Gaussianity:

Kolmogorov-Smirnov Test: H₀: The data follow a normal distribution

0.05 0.01

# p-vaule > 0.05 = 97,

# p-vaule > 0.01 =142;

Prepare Your Data to Do the Above Empirical Justification

12/34

(13)

PCA in the Euclidean Space

^13/34

(14)

Kernel PCA

^14/34

(15)

Kernel PCA: kpca {kernlab}

> library(kernlab)

> rbf <- rbfdot(sigma = 0.05) #Radial Basis kernel function

> rbf

Gaussian Radial Basis kernel function.

Hyperparameter : sigma = 0.05

> KX <- kernelMatrix(kernel=rbf, x=as.matrix(iris[,1:4])) # calculate kernel matrix

> dim(KX) [1] 150 150

• rbfdot (Radial Basis kernel function)

• polydot (Polynomial kernel function

• vanilladot (Linear kernel function)

• tanhdot (Hyperbolic tangent kernel function)

test <- sample(1:150, 20)

iris.kpca <- kpca(~., data=iris[-test, -5], kernel="rbfdot", kpar=list(sigma=0.2), features=2)

# print the principal component vectors pcv(iris.kpca)

# plot the data projection on the components

plot(rotated(iris.kpca), col=as.integer(iris[-test, 5]), xlab="1st Principal Component",

ylab="2nd Principal Component", main="KPCA for iris data")

# embed remaining points

emb <- predict(iris.kpca, as.matrix(iris[test, -5])) points(emb, col=iris[test, 5], pch=17, cex=1.5, asp=1)

kernlab: Kernel-Based Machine Learning Lab 15/34

(16)

SIR in the Euclidean Space

 Li (1991) introduced the following model

16/34

(17)

SIR: Algorithm

^17/34

(18)

SIR: Theorem

^18/34

(19)

Kernel SIR in a Non-linear Feature Space Kernel SIR: Kernelize the SIR algorithm

19/34

(20)

KSIR: Algorithm

^20/34

(21)

KSIR

^(conti.) ^21/34

(22)

KSIR

^(conti.) ^22/34

(23)

Normalization and Projection

^23/34

(24)

Centering in Feature Space

For Testing Data For Training Data

24/34

(25)

Reduced Features

 we are not working in the full feature space, but just in a

comparably small linear subspace of it, whose dimension equals at most the number of observations.

 Working in a space whose

dimension equals the number of observations can pose difficulties.

 To deal with these, one can either use only a subset of the extracted features, or use some other form of capacity control or

regularization.

For Theoretical details:

Lee, Y.J. and Huang, S.Y. (2006), Reduced support vector machines: a statistical theory, IEEE Transactions on Neural Networks, accepted.

http://dmlab1.csie.ntust.edu.tw/downloads

25/34

(26)

Relations Towards Other Methods

 KSIR generalizes SIR to a nonlinear one by kernelization of the SIR algorithm.

 It finds nonlinear d.r. subspace, a central d.r. subspace in H_k

 A semiparametric method.

 SIR: spectrum analysis of cov(E[x|y]) wrt cov(x)

 KSIR: spectrum analysis of a generalized association measure.

26/34

(27)

Relations Towards Other Methods

Kernel Fisher discriminant Analysis as special case of CCA.

(Kuss, M. and Graepel, T: The Geometry Of Kernel Canonical Correlation Analysis. (108), Max Planck Institute for Biological Cybernetics, Tübingen, Germany (May 2003)

Chen, C. H., and Li, K. C. (2001)

27/34

(28)

Visualization: Square Data (150x2)

KPCA

d = 1 d = 2 d = 3 d = 4 d = 1 d = 2 d = 3 d = 4

V₁

V₂

V₃

V₁

V₂

V₃

s = 0.01 s = 0.1 s = 1 s = 10 s = 0.01 s = 0.1 s = 1 s = 10

KSIR

V₁

V₂

V₃

V₁

V₂

V₃

H=8

KPCA KSIR

28/34

(29)

Visualization: Three Clusters Data (220x2)

KPCA KSIR

d = 1 d = 2 d = 3 d = 4 d = 1 d = 2 d = 3 d = 4

s = 0.01 s = 0.1 s = 1 s = 10 s = 0.01 s = 0.1 s = 1 s = 10 V₁

V₂

V₃

V₁

V₂

V₃

V₁

V₂

V₃

V₁

V₂

V₃

KPCA KSIR

29/34

(30)

Visualization: Iris Data (150x4)

 The sepal length, sepal width, petal length, and petal width are measured in centimeters on 50 iris

specimens from each of three species, Iris setosa, I.

versicolor, and I. virginica.

Fisher (1936)

PCA SIR

KPCA Gaussian s=0.05 KSIR

30/34

(31)

Visualization: Wine Data (178x18)

PCA SIR

KPCA Gaussian s=0.05 KSIR

 Wine data (n=178) are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars.

The analysis determined the quantities of 13 constituents found in each of the three types of wines.

Past Usage

RDA : 100%, QDA 99.4%, LDA 98.9%, 1NN 96.1%

(z-transformed data, loo)

31/34

(32)

Visualization: Pendigit Data (7494x16)

 Pen-based recognition of handwritten Digits

 7494 instances, 16 attributes

 10 classes

Gaussian 0.05 Random sampling 200

PCA SIR

KPCA KSIR

32/34

(33)

Classification: UCI Data Sets

Gaussian 0.05 Random sampling 200

33/34

(34)

Classification: Microarray Data Sets

^34/34