Introduction to Support Vector Machines
Hsuan-Tien Lin
Learning Systems Group, California Institute of Technology
Talk in NTU EE/CS Speech Lab, November 16, 2005
Setup
fixed-length example: D-dimensional vector x , each component is a feature
raw digital sampling of a 0.5 sec. wave file DFT of the raw sampling
a fixed-length feature vector extracted from the wave file label: a number y ∈ Y
binary classification: is there a man speaking in the wave file?
(y = +1 if man, y = −1 if not)
multi-class classification: which speaker is speaking?
(y ∈ {1,2, · · · ,K})
regression: how excited is the speaker? (y∈ R)
Binary Classification Problem
learning problem: given training examples and labels{(xi,yi)}Ni=1, find a function g(x) : X → Y that predicts the label of unseen x well
vowel identification: given training wave files and their vowel labels, find a function g(x)that translates wave files to vowel well
we will focus on binary classification problem:Y = {+1, −1}
most basic learning problem, but very useful and can be extended to other problems
illustrative demo: for examples with two different color in a D-dimensional space, how can we “separate” the examples?
Hyperplane Classifier
use a hyperplane to separate the two colors:
g(x) =sign wTx+b
if wT +b≥0, the classifier returns+1, otherwise the classifier returns−1
possibly lots of hyperplanes satisfying our needs, which one should we choose?
SVM: Large-Margin Hyperplane Classifier
marginρi =yi(wTx +b)/kwk2: does yi agree with wTx+b in sign?
how large is the distance between the example and the separating hyperplane?
large positive margin→clear separation→low risk classification idea of SVM: maximize the minimum margin
maxw,b min
i ρi
s.t. ρi =yi(wTxi+b)/kwk2≥0
Hard-Margin Linear SVM
maximize the minimum margin maxw,b min
i ρi
s.t. ρi =yi(wTxi +b)/kwk2≥0,i =1, . . . ,N.
equivalent to min
w,b
1 2wTw
s.t. yi(wTxi+b) ≥1,i =1, . . . ,N.
– hard-margin linear SVM
quadratic programming with D+1 variables: well-studied in optimization
is the hard-margin linear SVM good enough?
Soft-Margin Linear SVM
hard-margin – hard constraints on separation:
min
w,b
1 2wTw
s.t. yi(wTxi+b) ≥1,i =1, . . . ,N.
no feasible solution if some noisy outliers exist soft-margin – soft constraints as cost:
min
w,b
1
2wTw+CX
i
ξi
s.t. yi(wTxi+b) ≥1− ξi, ξi ≥0,i =1, . . . ,N. allow the noisy examples to haveξi >0 with a cost is linear SVM good enough?
Soft-Margin Nonlinear SVM
what if we want a boundary g(x) =sign xTx −1
? can never be constructed with a hyperplane classifier sign wTx+b
however, we can have more complex feature transforms:
φ(x) = [(x)1, (x)2, · · · , (x)D, (x)1(x)1, (x)1(x)2, · · · , (x)D(x)D]
there is a classifier sign wTφ(x) +b
that describes the boundary soft-margin nonlinear SVM:
minw,b
1
2wTw+CX
i
ξi
s.t. yi(wTφ(xi) +b) ≥1− ξi, ξi ≥0,i =1, . . . ,N.
– with nonlinearφ(·)
Feature Transformation
what feature transformsφ(·)should we use?
we can only extract finite small number of features, but we can use unlimited number of feature transforms
traditionally:
use domain knowledge to do feature transformation use only “useful” feature transformation
use a small number of feature transformation
control the goodness of fitting by suitable choice of feature transformation
what if we use “infinite number” of feature transformation, and let the algorithm decide a good w automatically?
would infinite number of transformations introduce overfitting?
are we able to solve the optimization problem?
Dual Problem
infinite quadratic programming if infiniteφ(·):
min
w,b
1
2wTw+CX
i
ξi
s.t. yi(wTφ(xi) +b) ≥1− ξi, ξi ≥0,i =1, . . . ,N.
luckily, we can solve its associated dual problem:
minα
1
2αTQα −eTα s.t. yTα =0,
0≤ αi ≤C,
Qij ≡yiyjφT(xi)φ(xj) α: N-dimensional vector
Solution of the Dual Problem
associated dual problem:
minα
1
2αTQα −eTα s.t. yTα =0,
0≤ αi ≤C,
Qij ≡yiyjφT(xi)φ(xj)
equivalent solution:
g(x) =sign
wTx+b
=signX
yiαiφT(xi)φ(x) +b
no need for w andφ(x)explicitly if we can compute K(x,x0) = φT(x)φ(x0)efficiently
Kernel Trick
let kernel K(x,x0) = φT(x)φ(x0) revisit: can we compute the kernel of
φ(x) = [(x)1, (x)2, · · · , (x)D, (x)1(x)1, (x)1(x)2, · · · , (x)D(x)D]
efficiently?
well, not really how about this?
φ(x) = h√
2(x)1,
√
2(x)2, · · · ,
√
2(x)D, (x)1(x)1, · · · , (x)D(x)D
i
K(x,x0) = (1+xTx0)2−1
Different Kernels
types of kernels
linear K(x,x0) =xTx0,
polynomial: K(x,x0) = (axTx0+r)d
Gaussian RBF: K(x,x0) =exp(−γkx−x0k22) Laplacian RBF: K(x,x0) =exp(−γkx −x0k1)
the last two equivalently have feature transformation in infinite dimensional space!
new paradigm for machine learning: use many many feature transformations, control the goodness of fitting by large-margin (clear separation) and violation cost (amount of outlier allowed)
Support Vectors: Meaningful Representation
minα
1
2αTQα −eTα s.t. yTα =0,
0≤ αi ≤C, equivalent solution:
g(x) =signX
yiαiK(xi,x) +b
only those withαi >0 are needed for classification – support vectors
from optimality conditions,αi:
“=0”: no need in constructing the decision function, away from the boundary or on the boundary
“>0 and<C”: free support vector, on the boundary
“=C”: bounded support vector,
violate the boundary(ξi >0)or on the boundary
Why is SVM Successful?
infinite number of feature transformation: suitable for conquering nonlinear classification tasks
large-margin concept: theoretically promising soft-margin trade-off: controls regularization well
convex optimization problems: possible for good optimization algorithms (compared to Neural Networks and some other learning algorithms)
support vectors: useful in data analysis and interpretation
Why is SVM Not Successful?
SVM can be sensitive to scaling and parameters
standard SVM is only a “discriminative” classification algorithm SVM training can be time-consuming when N is large and the solver is not carefully implemented
infinite number of feature transformation⇔mysterious classifier
Useful Extensions of SVM
multiclass SVM: use 1vs1 approach to combine binary SVM to multiclass
– the label that gets more votes from the classifiers is the prediction
probability output: transform the raw output wTφ(x) +b to a value between[0,1]to mean P(+1|x)
– use a sigmoid function to transform fromR → [0,1]
infinite ensemble learning (Lin and Li 2005):
if the kernel K(x,x0) = −kx−x0k1is used for standard SVM, the classifier is equivalently
g(x) =sign
Z
wθsθ(x)dθ +b
where sθ(x)is a thresholding rule on one feature of x .
Basic Use of SVM
scale each feature of your data to a suitable range (say,[−1,1]) use a Gaussian RBF kernel K(x,x0) =exp(−γkx−x0k22) use cross validation and grid search to determine a good(γ,C) pair
use the best(γ,C)on your training set do testing with the SVM classifier
all included in LIBSVM (from Lab of Prof. Chih-Jen Lin)
Advanced Use of SVM
include domain knowledge by specific kernel design (e.g. train a generative model for feature extraction, and use the extracted feature in SVM to get discriminative power)
combining SVM with your favorite tools (e.g. HMM + SVM for speech recognition)
fine-tune SVM parameters with specific knowledge of your problem (e.g. different costs for different examples?) interpreting the SVM results you get (e.g. are the SVs meaningful?)
Resources
LIBSVM:http://www.csie.ntu.edu.tw/~cjlin/libsvm LIBSVM Tools:
http://www.csie.ntu.edu.tw/~cjlin/libsvmtools Kernel Machines Forum:http://www.kernel-machines.org Hsu, Chang, and Lin: A Practical Guide to Support Vector
Classification
my email:htlin@caltech.edu
acknowledgment: some figures obtained from Prof. Chih-Jen Lin