Introduction to Support Vector Machines
Hsuan-Tien Lin
Learning Systems Group, California Institute of Technology
Talk in NTU EE/CS Speech Lab, November 16, 2005
fixed-length example: D-dimensional vector x , each component is a feature
raw digital sampling of a 0.5 sec. wave file DFT of the raw sampling
a fixed-length feature vector extracted from the wave file label: a number y ∈ Y
binary classification: is there a man speaking in the wave file?
(y = +1 if man, y = −1 if not)
multi-class classification: which speaker is speaking?
(y ∈ {1,2, · · · ,K})
regression: how excited is the speaker? (y∈ R)
Binary Classification Problem
learning problem: given training examples and labels{(xi,yi)}Ni=1, find a function g(x) : X → Y that predicts the label of unseen x well
vowel identification: given training wave files and their vowel labels, find a function g(x)that translates wave files to vowel well
we will focus on binary classification problem:Y = {+1, −1}
most basic learning problem, but very useful and can be extended to other problems
illustrative demo: for examples with two different color in a D-dimensional space, how can we “separate” the examples?
Hyperplane Classifier
use a hyperplane to separate the two colors:
g(x) =sign wTx+b
if wT +b≥0, the classifier returns+1, otherwise the classifier returns−1
possibly lots of hyperplanes satisfying our needs, which one should we choose?
SVM: Large-Margin Hyperplane Classifier
marginρi =yi(wTx +b)/kwk2: does yi agree with wTx+b in sign?
how large is the distance between the example and the separating hyperplane?
large positive margin→clear separation→low risk classification idea of SVM: maximize the minimum margin
maxw,b min
i ρi
s.t. ρi =yi(wTxi+b)/kwk2≥0
Hard-Margin Linear SVM
maximize the minimum margin maxw,b min
i ρi
s.t. ρi =yi(wTxi +b)/kwk2≥0,i =1, . . . ,N.
equivalent to min
1 2wTw
s.t. yi(wTxi+b) ≥1,i =1, . . . ,N.
– hard-margin linear SVM
quadratic programming with D+1 variables: well-studied in optimization
is the hard-margin linear SVM good enough?
Soft-Margin Linear SVM
hard-margin – hard constraints on separation:
1 2wTw
s.t. yi(wTxi+b) ≥1,i =1, . . . ,N.
no feasible solution if some noisy outliers exist soft-margin – soft constraints as cost:
s.t. yi(wTxi+b) ≥1− ξi, ξi ≥0,i =1, . . . ,N. allow the noisy examples to haveξi >0 with a cost is linear SVM good enough?
Soft-Margin Nonlinear SVM
what if we want a boundary g(x) =sign xTx −1
? can never be constructed with a hyperplane classifier sign wTx+b
however, we can have more complex feature transforms:
φ(x) = [(x)1, (x)2, · · · , (x)D, (x)1(x)1, (x)1(x)2, · · · , (x)D(x)D]
there is a classifier sign wTφ(x) +b
that describes the boundary soft-margin nonlinear SVM:
s.t. yi(wTφ(xi) +b) ≥1− ξi, ξi ≥0,i =1, . . . ,N.
– with nonlinearφ(·)
Feature Transformation
what feature transformsφ(·)should we use?
we can only extract finite small number of features, but we can use unlimited number of feature transforms
use domain knowledge to do feature transformation use only “useful” feature transformation
use a small number of feature transformation
control the goodness of fitting by suitable choice of feature transformation
what if we use “infinite number” of feature transformation, and let the algorithm decide a good w automatically?
would infinite number of transformations introduce overfitting?
are we able to solve the optimization problem?
Dual Problem
infinite quadratic programming if infiniteφ(·):
s.t. yi(wTφ(xi) +b) ≥1− ξi, ξi ≥0,i =1, . . . ,N.
luckily, we can solve its associated dual problem:
2αTQα −eTα s.t. yTα =0,
0≤ αi ≤C,
Qij ≡yiyjφT(xi)φ(xj) α: N-dimensional vector
Solution of the Dual Problem
associated dual problem:
2αTQα −eTα s.t. yTα =0,
0≤ αi ≤C,
Qij ≡yiyjφT(xi)φ(xj)
equivalent solution:
g(x) =sign
yiαiφT(xi)φ(x) +b
no need for w andφ(x)explicitly if we can compute K(x,x0) = φT(x)φ(x0)efficiently
Kernel Trick
let kernel K(x,x0) = φT(x)φ(x0) revisit: can we compute the kernel of
φ(x) = [(x)1, (x)2, · · · , (x)D, (x)1(x)1, (x)1(x)2, · · · , (x)D(x)D]
well, not really how about this?
φ(x) = h√
2(x)2, · · · ,
2(x)D, (x)1(x)1, · · · , (x)D(x)D
K(x,x0) = (1+xTx0)2−1
Different Kernels
types of kernels
linear K(x,x0) =xTx0,
polynomial: K(x,x0) = (axTx0+r)d
Gaussian RBF: K(x,x0) =exp(−γkx−x0k22) Laplacian RBF: K(x,x0) =exp(−γkx −x0k1)
the last two equivalently have feature transformation in infinite dimensional space!
new paradigm for machine learning: use many many feature transformations, control the goodness of fitting by large-margin (clear separation) and violation cost (amount of outlier allowed)
Support Vectors: Meaningful Representation
2αTQα −eTα s.t. yTα =0,
0≤ αi ≤C, equivalent solution:
g(x) =signX
yiαiK(xi,x) +b
only those withαi >0 are needed for classification – support vectors
from optimality conditions,αi:
“=0”: no need in constructing the decision function, away from the boundary or on the boundary
“>0 and<C”: free support vector, on the boundary
“=C”: bounded support vector,
violate the boundary(ξi >0)or on the boundary
Why is SVM Successful?
infinite number of feature transformation: suitable for conquering nonlinear classification tasks
large-margin concept: theoretically promising soft-margin trade-off: controls regularization well
convex optimization problems: possible for good optimization algorithms (compared to Neural Networks and some other learning algorithms)
support vectors: useful in data analysis and interpretation
Why is SVM Not Successful?
SVM can be sensitive to scaling and parameters
standard SVM is only a “discriminative” classification algorithm SVM training can be time-consuming when N is large and the solver is not carefully implemented
infinite number of feature transformation⇔mysterious classifier
Useful Extensions of SVM
multiclass SVM: use 1vs1 approach to combine binary SVM to multiclass
– the label that gets more votes from the classifiers is the prediction
probability output: transform the raw output wTφ(x) +b to a value between[0,1]to mean P(+1|x)
– use a sigmoid function to transform fromR → [0,1]
infinite ensemble learning (Lin and Li 2005):
if the kernel K(x,x0) = −kx−x0k1is used for standard SVM, the classifier is equivalently
g(x) =sign
wθsθ(x)dθ +b
where sθ(x)is a thresholding rule on one feature of x .
Basic Use of SVM
scale each feature of your data to a suitable range (say,[−1,1]) use a Gaussian RBF kernel K(x,x0) =exp(−γkx−x0k22) use cross validation and grid search to determine a good(γ,C) pair
use the best(γ,C)on your training set do testing with the SVM classifier
all included in LIBSVM (from Lab of Prof. Chih-Jen Lin)
Advanced Use of SVM
include domain knowledge by specific kernel design (e.g. train a generative model for feature extraction, and use the extracted feature in SVM to get discriminative power)
combining SVM with your favorite tools (e.g. HMM + SVM for speech recognition)
fine-tune SVM parameters with specific knowledge of your problem (e.g. different costs for different examples?) interpreting the SVM results you get (e.g. are the SVs meaningful?)
LIBSVM: LIBSVM Tools: Kernel Machines Forum: Hsu, Chang, and Lin: A Practical Guide to Support Vector
acknowledgment: some figures obtained from Prof. Chih-Jen Lin