Introduction to Support Vector Machines

(1)

Introduction to Support Vector Machines

Hsuan-Tien Lin

Learning Systems Group, California Institute of Technology

Talk in NTU EE/CS Speech Lab, November 16, 2005

(2)

Setup

fixed-length example: D-dimensional vector x , each component is a feature

raw digital sampling of a 0.5 sec. wave file DFT of the raw sampling

a fixed-length feature vector extracted from the wave file label: a number y ∈ Y

binary classification: is there a man speaking in the wave file?

(y = +1 if man, y = −1 if not)

multi-class classification: which speaker is speaking?

(y ∈ {1,2, · · · ,K})

regression: how excited is the speaker? (y∈ R)

(3)

Binary Classification Problem

learning problem: given training examples and labels{(x_i,y_i)}^N_i=1, find a function g(x) : X → Y that predicts the label of unseen x well

vowel identification: given training wave files and their vowel labels, find a function g(x)that translates wave files to vowel well

we will focus on binary classification problem:Y = {+1, −1}

most basic learning problem, but very useful and can be extended to other problems

illustrative demo: for examples with two different color in a D-dimensional space, how can we “separate” the examples?

(4)

Hyperplane Classifier

use a hyperplane to separate the two colors:

g(x) =sign w^Tx+b

if w^T +b≥0, the classifier returns+1, otherwise the classifier returns−1

possibly lots of hyperplanes satisfying our needs, which one should we choose?

(5)

SVM: Large-Margin Hyperplane Classifier

marginρ_i =y_i(w^Tx +b)/kwk₂: does y_i agree with w^Tx+b in sign?

how large is the distance between the example and the separating hyperplane?

large positive margin→clear separation→low risk classification idea of SVM: maximize the minimum margin

maxw,b min

i ρ_i

s.t. ρi =y_i(w^Tx_i+b)/kwk₂≥0

(6)

Hard-Margin Linear SVM

maximize the minimum margin maxw,b min

i ρ_i

s.t. ρi =y_i(w^Tx_i +b)/kwk₂≥0,i =1, . . . ,N.

equivalent to min

w,b

1 2w^Tw

s.t. y_i(w^Tx_i+b) ≥1,i =1, . . . ,N.

– hard-margin linear SVM

quadratic programming with D+1 variables: well-studied in optimization

is the hard-margin linear SVM good enough?

(7)

Soft-Margin Linear SVM

hard-margin – hard constraints on separation:

min

w,b

1 2w^Tw

s.t. y_i(w^Tx_i+b) ≥1,i =1, . . . ,N.

no feasible solution if some noisy outliers exist soft-margin – soft constraints as cost:

min

w,b

1

2w^Tw+CX

i

ξi

s.t. y_i(w^Tx_i+b) ≥1− ξ_i, ξi ≥0,i =1, . . . ,N. allow the noisy examples to haveξ_i >0 with a cost is linear SVM good enough?

(8)

Soft-Margin Nonlinear SVM

what if we want a boundary g(x) =sign x^Tx −1

? can never be constructed with a hyperplane classifier sign w^Tx+b

however, we can have more complex feature transforms:

φ(x) = [(x)1, (x)₂, · · · , (x)_D, (x)1(x)1, (x)1(x)2, · · · , (x)D(x)D]

there is a classifier sign w^Tφ(x) +b

that describes the boundary soft-margin nonlinear SVM:

minw,b

1

2w^Tw+CX

i

ξ_i

s.t. y_i(w^Tφ(x_i) +b) ≥1− ξ_i, ξ_i ≥0,i =1, . . . ,N.

– with nonlinearφ(·)

(9)

Feature Transformation

what feature transformsφ(·)should we use?

we can only extract finite small number of features, but we can use unlimited number of feature transforms

traditionally:

use domain knowledge to do feature transformation use only “useful” feature transformation

use a small number of feature transformation

control the goodness of fitting by suitable choice of feature transformation

what if we use “infinite number” of feature transformation, and let the algorithm decide a good w automatically?

would infinite number of transformations introduce overfitting?

are we able to solve the optimization problem?

(10)

Dual Problem

infinite quadratic programming if infiniteφ(·):

min

w,b

1

2w^Tw+CX

i

ξi

s.t. y_i(w^Tφ(x_i) +b) ≥1− ξ_i, ξi ≥0,i =1, . . . ,N.

luckily, we can solve its associated dual problem:

minα

1

2α^TQα −e^Tα s.t. y^Tα =0,

0≤ α_i ≤C,

Q_ij ≡y_iy_jφ^T(x_i)φ(x_j) α: N-dimensional vector

(11)

Solution of the Dual Problem

associated dual problem:

minα

1

0≤ α_i ≤C,

Q_ij ≡y_iy_jφ^T(x_i)φ(x_j)

equivalent solution:

g(x) =sign

w^Tx+b

=signX

y_iα_iφ^T(x_i)φ(x) +b

no need for w andφ(x)explicitly if we can compute K(x,x⁰) = φ^T(x)φ(x⁰)efficiently

(12)

Kernel Trick

let kernel K(x,x⁰) = φ^T(x)φ(x⁰) revisit: can we compute the kernel of

φ(x) = [(x)₁, (x)₂, · · · , (x)_D, (x)₁(x)₁, (x)₁(x)₂, · · · , (x)_D(x)_D]

efficiently?

well, not really how about this?

φ(x) = h√

2(x)1,

√

2(x)2, · · · ,

√

2(x)D, (x)1(x)1, · · · , (x)D(x)D

i

K(x,x⁰) = (1+x^Tx⁰)²−1

(13)

Different Kernels

types of kernels

linear K(x,x⁰) =x^Tx⁰,

polynomial: K(x,x⁰) = (ax^Tx⁰+r)^d

Gaussian RBF: K(x,x⁰) =exp(−γkx−x⁰k²₂) Laplacian RBF: K(x,x⁰) =exp(−γkx −x⁰k₁)

the last two equivalently have feature transformation in infinite dimensional space!

new paradigm for machine learning: use many many feature transformations, control the goodness of fitting by large-margin (clear separation) and violation cost (amount of outlier allowed)

(14)

Support Vectors: Meaningful Representation

minα

1

0≤ α_i ≤C, equivalent solution:

g(x) =signX

y_iα_iK(x_i,x) +b

only those withα_i >0 are needed for classification – support vectors

from optimality conditions,α_i:

“=0”: no need in constructing the decision function, away from the boundary or on the boundary

“>0 and<C”: free support vector, on the boundary

“=C”: bounded support vector,

violate the boundary(ξ_i >0)or on the boundary

(15)

Why is SVM Successful?

infinite number of feature transformation: suitable for conquering nonlinear classification tasks

large-margin concept: theoretically promising soft-margin trade-off: controls regularization well

convex optimization problems: possible for good optimization algorithms (compared to Neural Networks and some other learning algorithms)

support vectors: useful in data analysis and interpretation

(16)

Why is SVM Not Successful?

SVM can be sensitive to scaling and parameters

standard SVM is only a “discriminative” classification algorithm SVM training can be time-consuming when N is large and the solver is not carefully implemented

infinite number of feature transformation⇔mysterious classifier

(17)

Useful Extensions of SVM

multiclass SVM: use 1vs1 approach to combine binary SVM to multiclass

– the label that gets more votes from the classifiers is the prediction

probability output: transform the raw output w^Tφ(x) +b to a value between[0,1]to mean P(+1|x)

– use a sigmoid function to transform fromR → [0,1]

infinite ensemble learning (Lin and Li 2005):

if the kernel K(x,x⁰) = −kx−x⁰k₁is used for standard SVM, the classifier is equivalently

g(x) =sign

Z

w_θs_θ(x)dθ +b

where s_θ(x)is a thresholding rule on one feature of x .

(18)

Basic Use of SVM

scale each feature of your data to a suitable range (say,[−1,1]) use a Gaussian RBF kernel K(x,x⁰) =exp(−γkx−x⁰k²₂) use cross validation and grid search to determine a good(γ,C) pair

use the best(γ,C)on your training set do testing with the SVM classifier

all included in LIBSVM (from Lab of Prof. Chih-Jen Lin)

(19)

Advanced Use of SVM

include domain knowledge by specific kernel design (e.g. train a generative model for feature extraction, and use the extracted feature in SVM to get discriminative power)

combining SVM with your favorite tools (e.g. HMM + SVM for speech recognition)

fine-tune SVM parameters with specific knowledge of your problem (e.g. different costs for different examples?) interpreting the SVM results you get (e.g. are the SVs meaningful?)

(20)

Resources

LIBSVM:http://www.csie.ntu.edu.tw/~cjlin/libsvm LIBSVM Tools:

http://www.csie.ntu.edu.tw/~cjlin/libsvmtools Kernel Machines Forum:http://www.kernel-machines.org Hsu, Chang, and Lin: A Practical Guide to Support Vector

Classification

my email:htlin@caltech.edu

acknowledgment: some figures obtained from Prof. Chih-Jen Lin