### Introduction to Support Vector Machines

Hsuan-Tien Lin

Learning Systems Group, California Institute of Technology

Talk in NTU EE/CS Speech Lab, November 16, 2005

### Setup

*fixed-length example: D-dimensional vector x , each component is*
a feature

raw digital sampling of a 0.5 sec. wave file DFT of the raw sampling

a fixed-length feature vector extracted from the wave file
*label: a number y* ∈ Y

binary classification: is there a man speaking in the wave file?

*(y* = +1 if man, y = −1 if not)

multi-class classification: which speaker is speaking?

*(y* ∈ {1,2, · · · ,*K*})

*regression: how excited is the speaker? (y*∈ R)

### Binary Classification Problem

learning problem: given training examples and labels{(x* _{i}*,

*y*

*)}*

_{i}

^{N}*,*

_{i=1}*find a function g(x*) : X → Y

*that predicts the label of unseen x*well

vowel identification: given training wave files and their vowel labels,
*find a function g(x*)that translates wave files to vowel well

we will focus on binary classification problem:Y = {+1, −1}

most basic learning problem, but very useful and can be extended to other problems

illustrative demo: for examples with two different color in a
*D-dimensional space, how can we “separate” the examples?*

### Hyperplane Classifier

use a hyperplane to separate the two colors:

*g(x*) =*sign w*^{T}*x*+*b*

*if w** ^{T}* +

*b*≥0, the classifier returns+1, otherwise the classifier returns−1

possibly lots of hyperplanes satisfying our needs, which one should we choose?

### SVM: Large-Margin Hyperplane Classifier

marginρ* _{i}* =

*y*

*(w*

_{i}

^{T}*x*+

*b)/kw*k

_{2}:

*does y*

_{i}*agree with w*

^{T}*x*+

*b in sign?*

how large is the distance between the example and the separating hyperplane?

large positive margin→clear separation→low risk classification idea of SVM: maximize the minimum margin

max*w,b* min

*i* ρ_{i}

s.t. ρ*i* =*y** _{i}*(w

^{T}*x*

*+*

_{i}*b)/kwk*

_{2}≥0

### Hard-Margin Linear SVM

maximize the minimum margin
max*w,b* min

*i* ρ_{i}

s.t. ρ*i* =*y** _{i}*(w

^{T}*x*

*+*

_{i}*b)/kw*k

_{2}≥0,

*i*=1, . . . ,

*N.*

equivalent to min

*w,b*

1
2*w*^{T}*w*

s.t. *y** _{i}*(w

^{T}*x*

*+*

_{i}*b) ≥*1,

*i*=1, . . . ,

*N.*

– hard-margin linear SVM

*quadratic programming with D*+1 variables: well-studied in
optimization

is the hard-margin linear SVM good enough?

### Soft-Margin Linear SVM

hard-margin – hard constraints on separation:

min

*w,b*

1
2*w*^{T}*w*

s.t. *y** _{i}*(w

^{T}*x*

*+*

_{i}*b) ≥*1,

*i*=1, . . . ,

*N.*

no feasible solution if some noisy outliers exist soft-margin – soft constraints as cost:

min

*w,b*

1

2*w*^{T}*w*+*C*X

*i*

ξ*i*

s.t. *y** _{i}*(w

^{T}*x*

*+*

_{i}*b) ≥*1− ξ

*, ξ*

_{i}*i*≥0,

*i*=1, . . . ,

*N*. allow the noisy examples to haveξ

*>0 with a cost is linear SVM good enough?*

_{i}### Soft-Margin Nonlinear SVM

*what if we want a boundary g(x*) =*sign x*^{T}*x* −1

?
can never be constructed with a hyperplane classifier
*sign w*^{T}*x*+*b*

however, we can have more complex feature transforms:

φ(x) = [(x)1, (x)_{2}, · · · , (x)* _{D}*, (x)1(x)1, (x)1(x)2, · · · , (x)

*D*(x)

*D*]

*there is a classifier sign w** ^{T}*φ(x) +

*b*

that describes the boundary soft-margin nonlinear SVM:

min*w,b*

1

2*w*^{T}*w*+*C*X

*i*

ξ_{i}

s.t. *y** _{i}*(w

*φ(x*

^{T}*) +*

_{i}*b) ≥*1− ξ

*, ξ*

_{i}*≥0,*

_{i}*i*=1, . . . ,

*N.*

– with nonlinearφ(·)

### Feature Transformation

what feature transformsφ(·)should we use?

we can only extract finite small number of features, but we can use unlimited number of feature transforms

traditionally:

use domain knowledge to do feature transformation use only “useful” feature transformation

use a small number of feature transformation

control the goodness of fitting by suitable choice of feature transformation

what if we use “infinite number” of feature transformation, and let
*the algorithm decide a good w automatically?*

would infinite number of transformations introduce overfitting?

are we able to solve the optimization problem?

### Dual Problem

infinite quadratic programming if infiniteφ(·):

min

*w,b*

1

2*w*^{T}*w*+*C*X

*i*

ξ*i*

s.t. *y** _{i}*(w

*φ(x*

^{T}*) +*

_{i}*b) ≥*1− ξ

*, ξ*

_{i}*i*≥0,

*i*=1, . . . ,

*N.*

luckily, we can solve its associated dual problem:

minα

1

2α^{T}*Qα −e** ^{T}*α
s.t.

*y*

*α =0,*

^{T}0≤ α* _{i}* ≤

*C,*

*Q** _{ij}* ≡

*y*

_{i}*y*

*φ*

_{j}*(x*

^{T}*)φ(x*

_{i}*) α: N-dimensional vector*

_{j}### Solution of the Dual Problem

associated dual problem:

minα

1

2α^{T}*Qα −e** ^{T}*α
s.t.

*y*

*α =0,*

^{T}0≤ α* _{i}* ≤

*C,*

*Q** _{ij}* ≡

*y*

_{i}*y*

*φ*

_{j}*(x*

^{T}*)φ(x*

_{i}*)*

_{j}equivalent solution:

*g(x*) =sign

*w*^{T}*x*+*b*

=signX

*y** _{i}*α

*φ*

_{i}*(x*

^{T}*)φ(x) +*

_{i}*b*

*no need for w and*φ(x)explicitly if we can compute
*K*(x,*x*^{0}) = φ* ^{T}*(x)φ(x

^{0})efficiently

### Kernel Trick

*let kernel K*(x,*x*^{0}) = φ* ^{T}*(x)φ(x

^{0}) revisit: can we compute the kernel of

φ(x) = [(x)_{1}, (x)_{2}, · · · , (x)* _{D}*, (x)

_{1}(x)

_{1}, (x)

_{1}(x)

_{2}, · · · , (x)

*(x)*

_{D}*]*

_{D}efficiently?

well, not really how about this?

φ(x) = h√

2(x)1,

√

2(x)2, · · · ,

√

2(x)*D*, (x)1(x)1, · · · , (x)*D*(x)*D*

i

*K*(x,*x*^{0}) = (1+*x*^{T}*x*^{0})^{2}−1

### Different Kernels

types of kernels

*linear K*(x,*x*^{0}) =*x*^{T}*x*^{0},

*polynomial: K*(x,*x*^{0}) = (ax^{T}*x*^{0}+*r)*^{d}

*Gaussian RBF: K*(x,*x*^{0}) =exp(−γkx−*x*^{0}k^{2}_{2})
*Laplacian RBF: K*(x,*x*^{0}) =exp(−γkx −*x*^{0}k_{1})

**the last two equivalently have feature transformation in infinite**
dimensional space!

new paradigm for machine learning: use many many feature transformations, control the goodness of fitting by large-margin (clear separation) and violation cost (amount of outlier allowed)

### Support Vectors: Meaningful Representation

minα

1

2α^{T}*Qα −e** ^{T}*α
s.t.

*y*

*α =0,*

^{T}0≤ α* _{i}* ≤

*C,*equivalent solution:

*g(x*) =signX

*y** _{i}*α

_{i}*K*(x

*,*

_{i}*x*) +

*b*

only those withα* _{i}* >0 are needed for classification – support
vectors

from optimality conditions,α* _{i}*:

“=0”: no need in constructing the decision function, away from the boundary or on the boundary

“>0 and<*C”: free support vector, on the boundary*

“=*C”: bounded support vector,*

violate the boundary(ξ* _{i}* >0)or on the boundary

### Why is SVM Successful?

infinite number of feature transformation: suitable for conquering nonlinear classification tasks

large-margin concept: theoretically promising soft-margin trade-off: controls regularization well

convex optimization problems: possible for good optimization algorithms (compared to Neural Networks and some other learning algorithms)

support vectors: useful in data analysis and interpretation

### Why is SVM Not Successful?

SVM can be sensitive to scaling and parameters

standard SVM is only a “discriminative” classification algorithm
*SVM training can be time-consuming when N is large and the*
solver is not carefully implemented

infinite number of feature transformation⇔mysterious classifier

### Useful Extensions of SVM

multiclass SVM: use 1vs1 approach to combine binary SVM to multiclass

– the label that gets more votes from the classifiers is the prediction

*probability output: transform the raw output w** ^{T}*φ(x) +

*b to a value*between[0,1]

*to mean P(+1|x*)

– use a sigmoid function to transform fromR → [0,1]

infinite ensemble learning (Lin and Li 2005):

*if the kernel K*(x,*x*^{0}) = −kx−*x*^{0}k_{1}is used for standard SVM, the
classifier is equivalently

*g(x) =*sign

Z

*w*_{θ}*s*_{θ}(x)dθ +*b*

*where s*_{θ}(x)*is a thresholding rule on one feature of x .*

### Basic Use of SVM

scale each feature of your data to a suitable range (say,[−1,1])
*use a Gaussian RBF kernel K*(x,*x*^{0}) =exp(−γkx−*x*^{0}k^{2}_{2})
use cross validation and grid search to determine a good(γ,*C)*
pair

use the best(γ,*C)*on your training set
do testing with the SVM classifier

all included in LIBSVM (from Lab of Prof. Chih-Jen Lin)

### Advanced Use of SVM

include domain knowledge by specific kernel design (e.g. train a generative model for feature extraction, and use the extracted feature in SVM to get discriminative power)

combining SVM with your favorite tools (e.g. HMM + SVM for speech recognition)

fine-tune SVM parameters with specific knowledge of your problem (e.g. different costs for different examples?) interpreting the SVM results you get (e.g. are the SVs meaningful?)

### Resources

LIBSVM:http://www.csie.ntu.edu.tw/~cjlin/libsvm LIBSVM Tools:

http://www.csie.ntu.edu.tw/~cjlin/libsvmtools Kernel Machines Forum:http://www.kernel-machines.org Hsu, Chang, and Lin: A Practical Guide to Support Vector

Classification

my email:htlin@caltech.edu

acknowledgment: some figures obtained from Prof. Chih-Jen Lin