© Deng Cai, College of Computer Science, Zhejiang University

(1)

So Far…

Our goal (supervised learning):

 To learn a set of discriminant functions

Bayesian framework

 We could design an optimal classifier if we knew:

 P(_i) : priors and P(x | _i) : class‐conditional densities

 Using training data to estimate P(_i) and P(x | _i)

Directly learning discriminant functions from the training data

 We only know the form of the discriminant functions

 Linear Regression

 Logistic Regression

 SVM

0

Linear

(2)

Nonlinear Distributed Data

Impossible to separate with a hyperplane

?

(3)

2

Generalized Linear Function &

Kernel Methods

Deng Cai (蔡登)

College of Computer Science Zhejiang University

[email protected]

(4)

A Circle from 2D to 3D

Here is an example of mapping a (special case) circle in 2D to 3D (the result is linear separable):

(5)

Generalized Linear Discriminant Functions

Recall the Linear Discriminant Function

 positive implies class 1

 negative implies class 2

Generalized Linear Discriminant

 Add additional terms involving the products of features

 For example,

 Given: [x₁, x₂, x₃]

 Make it: [ x₁, x₂, x₃, x₁x₂, x₂x₃, x₁x₂x₃ ] by adding products of features.

 Learn a discriminant function that is linear in the new feature space

(6)

Quadratic Discriminant Function

 Obtained by adding pair‐wise products of features

 g(x) positive implies class 1; g(x) negative implies class 2

g(x) = 0, represents a hyperquadric (hyperparaboloid, hyperellipsoid, hyperhyperboloids), as opposed to hyperplanes in linear discriminant case.

Adding more terms such as w

_ijkx_ix_jx_k

results in polynomial discriminant functions.

Linear Part

(d+1) parameters

Quadratic part, d(d+1)/2 additional parameters

(7)

Quadratic Discriminant Function

(8)

Quadratic Discriminant

Functions

(9)

Generalized Discriminant Function

A generalized linear discriminant function can be written as,

Equivalently,

Setting

functions

Setting y_i(x)to be monomials results in polynomial

discriminant functions Dimensionality of the

augmented feature space.

Weights in the augmented Weights in the augmented feature space. Note that the function is linear in a.

t

a

d

a

a , ,..., ] [

a 

₁ ₂ _ˆ

y  [ y

₁

( x ), y

₂

( x ),..., y

_d_ˆ

( x )]

^t

also called the augmented feature vector^.

(10)

Phi Function

The discriminant function g(x) is not linear in x, but is linear in y.

The mapping is taking a d‐

dimensional vector x and mapping it to a

dimensional space. The mapping y is called the phi‐

function.

When the input patterns x are non‐linearly separable in the input space, mapping them using the right phi‐

function maps them to a space where the patterns are linearly separable.

Unfortunately, the curse of dimensionality makes it hard to capitalize this in practice. A complete QDF involves (d +1) (d+2)/2 terms; for modest values of d, say d =50, this requires many terms

t

d x

y x

y

y  [ ₁( ), ₂( ),..., _ˆ( )]

 dˆ

(11)

Representer Theorem

10

(12)

Kernelized Ridge Regression

Woodbury matrix identity

11

∗ argmin

∗

≜

∗

, ,

(13)

Support Vector Machine

Hyper plane of maximum margin is supported by those points (vectors) on the margin. Those are called Support Vectors.

Non-support vectors can move freely without

affecting the position of the hyperplane as long as they don’t exceed the margin.

(14)

Support Vector Machine

The final classifier is

sgn sgn ,

Note: for non‐support vectors, the corresponding is zero.

(15)

Kernels

Let , 0 be some measure of similarity between objects , ∈ , where is some abstract space; we will call a kernel function.

 Typically the function is symmetric, and non‐negative

Examples

 Linear kernels

 Polynomial kernels

 RBF kernels

14

,

, ¹

, exp

2

(16)

The advantages of kernel methods

Non‐linear classifiers

 The kernel  Nonlinearity of the learned function.

The samples can not be represented as feature vectors

 But we can get the similarity of two samples

 String kernels

 Graph kernels

15