• 沒有找到結果。

Prepare Kernel Data

N/A
N/A
Protected

Academic year: 2021

Share "Prepare Kernel Data"

Copied!
34
0
0

加載中.... (立即查看全文)

全文

(1)

吳漢銘國立臺北大學 統計學系

核方法

Kernel Method

C06

(2)

本章大綱

Kernel Methods, Kernel Trick

Kernel Data and Its Properties

PCA/SIR in the Euclidean Space

Kernel PCA, Kernel SIR in a Non-linear Feature Space

Relations Towards Other Methods

KSIR for Nonlinear Dimensional Reduction

Experiments on Classification

2/34

(3)

核方法 (Kernel Methods)

Aronszajn (1950) and Parzen (1962) first to employ kernel methods in statistics.

Aizerman et al. (1964) used positive definite kernels which was closer to “kernel trick”, they argue that a positive definite kernel is identical to a dot product in the feature space.

Scholkopf et al (1998) point out that kernels can be used to construct

generalization of any algorithm that can be carried out in terms of dot products.

For last 20 years, there have seen a large number of kernelization of various algorithms. (PCA, LDA, CCA, PLS,…)

Boser et al (1992), to construct SVMs, a generalization of the so-called

optimal hyperplane algorithm.

3/34

(4)

Prepare Kernel Data

xj xi

Φ(xj) Φ(xi)

?

理論上

事實上

4/34

(5)

Data Representation

Data are not represented individually anymore, but only through a set of pairwise comparisons.

 The representation as a square matrix does not depend on the nature of the objects to be analyzed.

The size of the matrix used to represent a dataset of n objects is always n by n.

5/34

(6)

Kernel as Inner Product

(Aronszajn 1950)

A Hilbert space is a vector space endowed with a dot product that is complete for the norm induced.Rp with the classical inner product is an example of a finite-dimensional Hilbert

space. David Hilbert (01/23/1862 – 02/14/1943)

6/34

(7)

Reproducing Kernel Hilbert Space

7/34

(8)

Kernel Trick

The kernel Trick was first published in the 1964 paper Theoretical foundations of the potential function method in pattern recognition learning.

Any algorithm for vectorial data that can be expressed only in terms of dot products between vectors can be performed implicitly in the feature space

associated with any kernel, by replacing each dot product by a kernel evaluation.

It is a very convenient trick to transform linear methods, such as LDA or PCA into nonlinear methods, by simply replacing the classic dot product by a more general kernel.

The kernel trick transforms any algorithm that solely dependents on the dot

product between two vectors. Wherever a dot product is used, it is replaced with the kernel function.

The non-linear algorithm is the linear algorithm operating in the feature space.

Kernelization: the operation that transforms a linear algorithm into a more general kernel method.

8/34

(9)

Kernel Data: Properties

Kernel map can bring the data distribution to better elliptical symmetry. Kernel data are (with empirical and theoretical justification)

Better elliptically symmetrically distributed.

Better approximately normal (Gaussian)

Raw data on Euclidean space Rp

Kernel data on a RKHS Hk

Via a specific statistical notion of classical approach on Rp

Kernel approach on Hk, which is exactly the classical procedure on kernel data.

Main goal: Parallel to the classical multivariate statistical analysis, we aim to develop an analysis tool in the Gaussian reproducing kernel Hilbert space.

Main advantage: Nonparametric approach with “parametric-plus”

computing load.

parametric: classical multivariate analysis procedures.

plus: kernel data preparation.

9/34

(10)

Example: Better Elliptical Symmetry

Kernel map can bring the data distribution to better elliptical symmetry.

Scatterplot (x1, x2) Kernel data Scatterplot

Using Gaussian kernel with scale=0.05.

The raw data is scaled to have unit variance of each column before

transformation

10/34

(11)

Example: Normal Probability Plot

x2 Best four,

kernel data

Worst four, kernel data Median

four, kernel data

11/34

(12)

Example: Justification of Gaussianity

For details:

Huang, S.Y., Hwang, C. R. and Lin, M.H. Kernel Fisher’s Discriminant Analysis in Gaussian Reproducing Kernel Hilbert Space.

Theoretical Justification of Gaussianity Empirical Justification of Gaussianity:

Kolmogorov-Smirnov Test: H0: The data follow a normal distribution

0.05 0.01

# p-vaule > 0.05 = 97,

# p-vaule > 0.01 =142;

Prepare Your Data to Do the Above Empirical Justification

12/34

(13)

PCA in the Euclidean Space

13/34

(14)

Kernel PCA

14/34

(15)

Kernel PCA: kpca {kernlab}

> library(kernlab)

> rbf <- rbfdot(sigma = 0.05) #Radial Basis kernel function

> rbf

Gaussian Radial Basis kernel function.

Hyperparameter : sigma = 0.05

> KX <- kernelMatrix(kernel=rbf, x=as.matrix(iris[,1:4])) # calculate kernel matrix

> dim(KX) [1] 150 150

rbfdot (Radial Basis kernel function)

polydot (Polynomial kernel function

vanilladot (Linear kernel function)

tanhdot (Hyperbolic tangent kernel function)

test <- sample(1:150, 20)

iris.kpca <- kpca(~., data=iris[-test, -5], kernel="rbfdot", kpar=list(sigma=0.2), features=2)

# print the principal component vectors pcv(iris.kpca)

# plot the data projection on the components

plot(rotated(iris.kpca), col=as.integer(iris[-test, 5]), xlab="1st Principal Component",

ylab="2nd Principal Component", main="KPCA for iris data")

# embed remaining points

emb <- predict(iris.kpca, as.matrix(iris[test, -5])) points(emb, col=iris[test, 5], pch=17, cex=1.5, asp=1)

kernlab: Kernel-Based Machine Learning Lab 15/34

(16)

SIR in the Euclidean Space

Li (1991) introduced the following model

16/34

(17)

SIR: Algorithm

17/34

(18)

SIR: Theorem

18/34

(19)

Kernel SIR in a Non-linear Feature Space Kernel SIR: Kernelize the SIR algorithm

19/34

(20)

KSIR: Algorithm

20/34

(21)

KSIR

(conti.) 21/34

(22)

KSIR

(conti.) 22/34

(23)

Normalization and Projection

23/34

(24)

Centering in Feature Space

For Testing Data For Training Data

24/34

(25)

Reduced Features

we are not working in the full feature space, but just in a

comparably small linear subspace of it, whose dimension equals at most the number of observations.

Working in a space whose

dimension equals the number of observations can pose difficulties.

To deal with these, one can either use only a subset of the extracted features, or use some other form of capacity control or

regularization.

For Theoretical details:

Lee, Y.J. and Huang, S.Y. (2006), Reduced support vector machines: a statistical theory, IEEE Transactions on Neural Networks, accepted.

http://dmlab1.csie.ntust.edu.tw/downloads

25/34

(26)

Relations Towards Other Methods

KSIR generalizes SIR to a nonlinear one by kernelization of the SIR algorithm.

It finds nonlinear d.r. subspace, a central d.r. subspace in Hk

A semiparametric method.

SIR: spectrum analysis of cov(E[x|y]) wrt cov(x)

KSIR: spectrum analysis of a generalized association measure.

26/34

(27)

Relations Towards Other Methods

Kernel Fisher discriminant Analysis as special case of CCA.

(Kuss, M. and Graepel, T: The Geometry Of Kernel Canonical Correlation Analysis. (108), Max Planck Institute for Biological Cybernetics, Tübingen, Germany (May 2003)

Chen, C. H., and Li, K. C. (2001)

27/34

(28)

Visualization: Square Data (150x2)

KPCA

d = 1 d = 2 d = 3 d = 4 d = 1 d = 2 d = 3 d = 4

V1

V2

V3

V1

V2

V3

s = 0.01 s = 0.1 s = 1 s = 10 s = 0.01 s = 0.1 s = 1 s = 10

KSIR

V1

V2

V3

V1

V2

V3

H=8

KPCA KSIR

28/34

(29)

Visualization: Three Clusters Data (220x2)

KPCA KSIR

d = 1 d = 2 d = 3 d = 4 d = 1 d = 2 d = 3 d = 4

s = 0.01 s = 0.1 s = 1 s = 10 s = 0.01 s = 0.1 s = 1 s = 10 V1

V2

V3

V1

V2

V3

V1

V2

V3

V1

V2

V3

KPCA KSIR

29/34

(30)

Visualization: Iris Data (150x4)

The sepal length, sepal width, petal length, and petal width are measured in centimeters on 50 iris

specimens from each of three species, Iris setosa, I.

versicolor, and I. virginica.

Fisher (1936)

PCA SIR

KPCA Gaussian s=0.05 KSIR

30/34

(31)

Visualization: Wine Data (178x18)

PCA SIR

KPCA Gaussian s=0.05 KSIR

Wine data (n=178) are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars.

The analysis determined the quantities of 13 constituents found in each of the three types of wines.

Past Usage

RDA : 100%, QDA 99.4%, LDA 98.9%, 1NN 96.1%

(z-transformed data, loo)

31/34

(32)

Visualization: Pendigit Data (7494x16)

Pen-based recognition of handwritten Digits

7494 instances, 16 attributes

10 classes

Gaussian 0.05 Random sampling 200

PCA SIR

KPCA KSIR

32/34

(33)

Classification: UCI Data Sets

Gaussian 0.05 Random sampling 200

33/34

(34)

Classification: Microarray Data Sets

34/34

參考文獻

相關文件

– The The readLine readLine method is the same method used to read method is the same method used to read  from the keyboard, but in this case it would read from a 

include domain knowledge by specific kernel design (e.g. train a generative model for feature extraction, and use the extracted feature in SVM to get discriminative power).

² Stable kernel in a goals hierarchy is used as a basis for establishing the architecture; Goals are organized to form several alternatives based on the types of goals and

More precisely, it is the problem of partitioning a positive integer m into n positive integers such that any of the numbers is less than the sum of the remaining n − 1

Bootstrapping is a general approach to statistical in- ference based on building a sampling distribution for a statistic by resampling from the data at hand.. • The

From Remark 3.4, there exists a minimum kernel scale σ min , such that the correspondence produced by the HD model with the kernel scale σ ≥ σ min is the same as the correspondence

The existence of transmission eigenvalues is closely related to the validity of some reconstruction methods for the inverse scattering problems in an inhomogeneous medium such as

Unlike some kernel-based thread packages, the Linux kernel does not make any distinction between threads and processes: a thread is simply a process that did not create a new