Support Vector Machines for Data
Classification and Regression
Chih-Jen Lin
Department of Computer Science National Taiwan University
Outline
Support vector classification Two practical example
Support vector regression Discussion and conclusions
Data Classification
Given training data in different classes (labels known) Predict test data (labels unknown)
Examples
Handwritten digits recognition Spam filtering
Text classification
Prediction of signal peptide in human secretory proteins
Methods:
Nearest Neighbor Neural Networks Decision Tree
Support vector machines: a new method Becoming more and more popular
Why Support Vector Machines
Existing methods:
Nearest neighbor, Neural networks, decision trees. SVM: a new one
In my opinion, after careful data pre-processing
Appropriately use NN or SVM ⇒ similar accuracy But, users may not use them properly
The chance of SVM
Easier for users to appropriately use it
Support Vector Classification
Training vectors : xi, i = 1, . . . , l
Consider a simple case with two classes: Define a vector y
yi = (
1 if xi in class 1 −1 if xi in class 2,
wTx + b = +1 0 −1 A separating hyperplane: wTx + b = 0 (wTxi) + b > 0 if yi = 1 (wTxi) + b < 0 if yi = −1
Decision function f (x) = sign(wT x + b), x: test data
Variables: w and b : Need to know coefficients of a plane
Many possible choices of w and b
Select w, b with the maximal margin.
Maximal distance between wTx + b = ±1 (wTxi) + b ≥ 1 if yi = 1 (wTxi) + b ≤ −1 if yi = −1
Distance between wTx + b = 1 and −1: 2/kwk = 2/√wTw max 2/kwk ≡ min wTw/2 min w,b 1 2w T w subject to yi((wTxi) + b) ≥ 1, i = 1, . . . , l.
Higher Dimensional Feature Spaces
Earlier we tried to find a linear separating hyperplane
Data may not be linear separable
Non-separable case: allow training errors
min w,b,ξ 1 2w T w + C l X i=1 ξi yi((wTxi) + b) ≥ 1 − ξi, ξi ≥ 0, i = 1, . . . , l
Nonlinear case: linear separable in other spaces ?
Higher dimensional ( maybe infinite ) feature space
Example: x ∈ R3, φ(x) ∈ R10
φ(x) = (1, √2x1, √2x2, √2x3, x21,
x22, x23, √2x1x2, √2x1x3, √2x2x3)
A standard problem [Cortes and Vapnik, 1995]:
min w,b,ξ 1 2w T w + C l X i=1 ξi
Finding the Decision Function
w: a vector in a high dimensional space ⇒ maybe
infinite variables The dual problem
min α 1 2α T Qα − eTα subject to 0 ≤ αi ≤ C, i = 1, . . . , l yT α = 0, where Qij = yiyjφ(xi)Tφ(xj) and e = [1, . . . , 1]T
Primal and dual : optimization theory. Not trivial.
Infinite dimensional programming. A finite problem:
#variables = #training data
Qij = yiyjφ(xi)Tφ(xj) needs a closed form
Efficient calculation of high dimensional inner products Kernel trick, K(xi, xj) = φ(xi)Tφ(xj)
Example: xi ∈ R3, φ(xi) ∈ R10 φ(xi) = (1, √2(xi)1, √2(xi)2, √2(xi)3, (xi)21, (xi)22, (xi)23, √ 2(xi)1(xi)2, √ 2(xi)1(xi)3, √ 2(xi)2(xi)3), Then φ(xi)Tφ(xj) = (1 + xTi xj)2. Popular methods: K(xi, xj) =
e−γkxi−xjk2, (Radial Basis Function)
Kernel Tricks
Kernel: K(x, y) = φ(x)Tφ(y)
No need to explicitly know φ(x)
Common kernels K(xi, xj) =
e−γkxi−xjk2, (Radial Basis Function)
(xTi xj/a + b)d (Polynomial kernel)
They can be inner product in infinite dimensional space Assume x ∈ R1 and γ > 0.
e−γkxi−xjk2 = e−γ(xi−xj)2 = e−γx2i+2γxixj−γx2j = e−γx2i−γx2j 1 + 2γxixj 1! + (2γxixj)2 2! + (2γxixj)3 3! + · · · = e−γx2i−γx2j 1 · 1 + r 2γ 1! xi · r 2γ 1! xj + r (2γ)2 2! x 2 i · r (2γ)2 2! x 2 j + r (2γ)3 3! x 3 i · r (2γ)3 3! x 3 j + · · · = φ(xi)Tφ(xj), where
Decision function
w: maybe an infinite vector
At optimum w = Pli=1 αiyiφ(xi) Decision function wTφ(x) + b = l X i=1 αiyiφ(xi)Tφ(x) + b
> 0: 1st class, < 0: 2nd class Only φ(xi) of αi > 0 used
Support Vectors: More Important Data
0 0.2 0.4 0.6 0.8 1 1.2Let Us Try An Example
A problem from astroparticle physics
1.0 1:2.617300e+01 2:5.886700e+01 3:-1.894697e-01 4:1.251225e+02 1.0 1:5.707397e+01 2:2.214040e+02 3:8.607959e-02 4:1.229114e+02 1.0 1:1.725900e+01 2:1.734360e+02 3:-1.298053e-01 4:1.250318e+02 1.0 1:2.177940e+01 2:1.249531e+02 3:1.538853e-01 4:1.527150e+02 1.0 1:9.133997e+01 2:2.935699e+02 3:1.423918e-01 4:1.605402e+02 1.0 1:5.537500e+01 2:1.792220e+02 3:1.654953e-01 4:1.112273e+02 1.0 1:2.956200e+01 2:1.913570e+02 3:9.901439e-02 4:1.034076e+02
Training and testing sets available: 3,089 and 4,000 Data format is an issue
SVM software:
LIBSVM
http://www.csie.ntu.edu.tw/~cjlin/libsvm
Now one of the most used SVM software Installation
On Unix:
Download zip file and make On Windows:
Download zip file and make
c:nmake -f Makefile.win
Usage of
LIBSVM
TrainingUsage: svm-train [options] training_set_file [model_file] options:
-s svm_type : set type of SVM (default 0) 0 -- C-SVC
1 -- nu-SVC
2 -- one-class SVM 3 -- epsilon-SVR 4 -- nu-SVR
Training and Testing
Training
$./svm-train train.1 ...*
optimization finished, #iter = 6131 nu = 0.606144
obj = -1061.528899, rho = -0.495258 nSV = 3053, nBSV = 724
Total nSV = 3053
Testing
What does this Output Mean
obj: the optimal objective value of the dual SVM rho: −b in the decision function
nSV and nBSV: number of support vectors and bounded support vectors
(i.e., αi = C).
nu-svm is a somewhat equivalent form of C-SVM where C is replaced by ν.
Why this Fails
After training, nearly 100% support vectors Training and testing accuracy different
$./svm-predict train.1 train.1.model o Accuracy = 99.7734% (3082/3089)
Most kernel elements:
Kij (
= 1 if i = j, → 0 if i 6= j.
Data Scaling
Without scaling
Attributes in greater numeric ranges may dominate
Example: height sex x1 150 F x2 180 M x3 185 M and y1 = 0, y2 = 1, y3 = 1.
The separating hyperplane
x1
x2 x3
Decision strongly depends on the first attribute What if the second is more important
Linearly scale the first to [0, 1] by:
1st attribute − 150
185 − 150 ,
New points and separating hyperplane
Transformed to the original space,
x1
x2 x3
After Data Scaling
A common mistake
$./svm-scale -l -1 -u 1 train.1 > train.1.scale $./svm-scale -l -1 -u 1 test.1 > test.1.scale
Same factor on training and testing
$./svm-scale -s range1 train.1 > train.1.scale $./svm-scale -r range1 test.1 > test.1.scale $./svm-train train.1.scale
$./svm-predict test.1.scale train.1.scale.model test.1.predict
→ Accuracy = 96.15%
We store the scaling factor used in training and apply them for testing set
More on Training
Train scaled data and then prediction
$./svm-train train.1.scale
$./svm-predict test.1.scale train.1.scale.model test.1.predict
→ Accuracy = 96.15%
Training accuracy now is
$./svm-predict train.1.scale train.1.scale.model o Accuracy = 96.439% (2979/3089) (classification)
Default parameter
Different Parameters
If we use C = 20, γ = 400
$./svm-train -c 20 -g 400 train.1.scale
./svm-predict train.1.scale train.1.scale.model o Accuracy = 100% (3089/3089) (classification)
100% training accuracy but
$./svm-predict test.1.scale train.1.scale.model o Accuracy = 82.7% (3308/4000) (classification)
Very bad test accuracy Overfitting happens
Overfitting and Underfitting
When training and predicting a data, we should
Avoid underfitting: small training error Avoid overfitting: small testing error
Overfitting
In theory
You can easily achieve 100% training accuracy This is useless
Surprisingly
Parameter Selection
Is very important
Now parameters are
C, kernel parameters Example:
γ of e−γkxi−xjk2
a, b, d of (xTi xj/a + b)d
Performance Evaluation
Training errors not important; only test errors count
l training data, xi ∈ Rn, yi ∈ {+1, −1}, i = 1, . . . , l, a
learning machine:
x → f(x, α), f(x, α) = 1 or − 1. Different α: different machines
The expected test error (generalized error)
R(α) =
Z 1
P (x, y) unknown, empirical risk (training error): Remp(α) = 1 2l l X i=1 |yi − f(xi, α)| 1
2|yi − f(xi, α)| : loss, choose 0 ≤ η ≤ 1, with probability
at least 1 − η:
R(α) ≤ Remp(α) + another term
Performance Evaluation (Cont.)
In practice
Available data ⇒ training and validation Train the training
Test the validation
k-fold cross validation:
Data randomly separated to k groups.
CV and Test Accuracy
If we select parameters so that CV is the highest, Does CV represent future test accuracy ?
Slightly different
If we have enough parameters, we can achieve 100% CV as well
e.g. more parameters than # of training data But test accuracy may be different
Using CV on training + validation
A Simple Procedure
1. Conduct simple scaling on the data
2. Consider RBF kernel K(x, y) = e−γkx−yk2
3. Use cross-validation to find the best parameter C and γ
4. Use the best C and γ to train the whole training set 5. Test
Best C and γ by training k − 1 and the whole ? In theory, a minor difference
Parameter Selection Procedure in
LIBSVM
grid search + CV$./grid.py train.1 train.1.scale
[local] -1 -7 85.1408 (best c=0.5, g=0.0078125, rate=85.1408) [local] 5 -7 95.4354 (best c=32.0, g=0.0078125, rate=95.4354) .
. .
Easy parallelization on a cluster
$./grid.py train.1 train.1.scale
[linux1] -1 -7 85.1408 (best c=0.5, g=0.0078125, rate=85.1408) [linux7] 5 -7 95.4354 (best c=32.0, g=0.0078125, rate=95.4354) .
. .
Parallel Parameter Selection
Specify machine names in grid.py
telnet_workers = []
ssh_workers = [’linux1’,’linux1’,’linux2’, ’linux3’]
nr_local_worker = 1
linux1: more powerful or two CPUs A simple centralized control
Load balancing not a problem We can use other tools
Contour of Parameter Selection
d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 98.8 98.6 98.4 98.2 98 97.8 97.6 97.4 97.2 97 -1 0 1 2 3 lg(gamma)Simple script in
LIBSVM
easy.py: a script for dummies$python easy.py train.1 test.1 Scaling training data...
Cross validation... Best c=2.0, g=2.0 Training...
Scaling testing data... Testing...
Example: Engine Misfire
Detection
Problem Description
First problem of IJCNN Challenge 2001, data from Ford Given time series length T = 50, 000
The kth data
x1(k), x2(k), x3(k), x4(k), x5(k), y(k)
y(k) = ±1: output, affected only by x1(k), . . . , x4(k)
x5(k) = 1, kth data considered for evaluating accuracy 50,000 training data, 100,000 testing data (in two sets)
Past and future information may affect y(k)
x1(k): periodically nine 0s, one 1, nine 0s, one 1, and so on. Example: 0.000000 -0.999991 0.169769 0.000000 1.000000 0.000000 -0.659538 0.169769 0.000292 1.000000 0.000000 -0.660738 0.169128 -0.020372 1.000000 1.000000 -0.660307 0.169128 0.007305 1.000000 0.000000 -0.660159 0.169525 0.002519 1.000000 0.000000 -0.659091 0.169525 0.018198 1.000000 0.000000 -0.660532 0.169525 -0.024526 1.000000 0.000000 -0.659798 0.169525 0.012458 1.000000
Background: Engine Misfire Detection
How engine works
Air-fuel mixture injected to cylinder
intact, compression, combustion, exhaustion
Engine misfire: a substantial fraction of a cylinder’s air-fuel mixture fails to ignite
Frequent misfires: pollutants and costly replacement On-board detection:
Engine crankshaft rational dynamics with a position sensor
Encoding Schemes
For SVM: each data is a vector
x1(k): periodically nine 0s, one 1, nine 0s, one 1, ... 10 binary attributes
x1(k − 5), . . . , x1(k + 4) for the kth data
x1(k): an integer in 1 to 10 Which one is better
We think 10 binaries better for SVM
x4(k) more important
Training SVM
Selecting parameters; generating a good model for prediction
RBF kernel K(xi, xj) = φ(xi)T φ(xj) = e−γkxi−xjk2
Two parameters: γ and C
Five-fold cross validation on 50,000 data Data randomly separated to five groups.
Each time four as training and one as testing
Use C = 24, γ = 22 and train 50,000 data for the final model
d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 98.8 98.6 98.4 98.2 98 97.8 97.6 97.4 97.2 97 -1 0 1 2 3 lg(gamma)
Test set 1: 656 errors, Test set 2: 637 errors
About 3000 support vectors of 50,000 training data A good case for SVM
This is just the outline. There are other details.
Machine Learning Is Sometimes An Art
But not a science
For complicated problems, there is no real systematic procedure
An Example: Vehicle Classification
Vehicle classification in distributed sensor networks
http://www.ece.wisc.edu/~sensit and
http://mmsp-2.caenn.wisc.edu/events.zip
Prepared by Duarte and Hu in University of Wisconsin
Three classes of data: two vehicles and noise
Each instance: acoustic and seismic features # features of each part: 50 and 50
Distribution of data:
#class 1 #class 2 #class 3
How Data Are Generated
Wireless distributed sensor networks (WDSN) Several sensors in a field
Event extraction
Only information when the vehicle is close enough to the sensor
Then a time series FFT-based features
Noise: high-energy factors such as wind and radio chatter.
Sample instances: Acoustic Data
2 1:-1.8893190e-02 2:-7.2501253e-03 3:-9.3349372e-03 4:8.2397278e-02 5:1.0000000e+00 6:2.8431799e-02 7:-3.9595759e-03 8:-2.2467102e-02 9:-2.7549071e-03 10:-2.2973921e-02 11:-2.4513591e-10:-2.2973921e-02 12:-2.7172349e-10:-2.2973921e-02 13:-2.2274419e-10:-2.2973921e-02 14:-1.8458129e-10:-2.2973921e-02 15:-2.6647322e-02 16:-2.6252666e-02 17:-2.2212002e-02 18:-2.5001779e-02 19:-2.6927617e-02 20:-2.7374419e-19:-2.6927617e-02 21:-2.7112618e-19:-2.6927617e-02 22:-2.4519:-2.6927617e-02704e-19:-2.6927617e-02 23:-2.5475226e-19:-2.6927617e-02 24:-2.5618921e-02 25:-2.6852989e-02 26:-2.5735666e-02 27:-2.7456095e-02 28:-2.7803905e-02 29:-2.6621734e-28:-2.7803905e-02 30:-2.4935499e-28:-2.7803905e-02 31:-2.7729578e-28:-2.7803905e-02 32:-2.6718499e-28:-2.7803905e-02 33:-1.9738297e-02 34:-2.2609663e-02 35:-2.3814977e-02 36:-2.6252692e-02
37:-2.4909885e-Results from the Authors
Paper available from
http://www.ece.wisc.edu/~sensit/publications/
Three-fold CV Accuracy
Method Acoustic Seismic
k-nearest neighbor 69.36% 56.24% Maximal likelihood 68.95% 62.81%
SVM 69.48% 63.79%
We think more investigation may improve the accuracy So I decided to let students do a project on this
A report presented in my statistical learning theory
course
By C.-C. Chou, S.-T. Wang, R.-E. Fan, C.-W. Lin, and C.-C. Lin
Authors’ Approach
Data split to three folds
Two as training and one as validation
Average of three validation accuracy reported Polynomial kernel used
(1 + xTi xj)T C = 1
My Students’ Approach
Cross-validation is a biased estimate
Too many parameters: CV accuracy overfitted Practically ok for two/three parameters
We do a more formal way
Kernel/Parameter Selection
RBF kernel
e−γkxi−xjk2
Parameter selection very important
C and γ
Fewer than polynomial kernel Huge training time
Issue: best (C, γ) for 10% may not be the best for the whole
In theory C should be decreased a bit
min w,b,ξ 1 2w T w + C l X i=1 ξi
Results
Test accuracy (log2 C, log2 γ) Acoustic Seismic
75.01 (7,-2) 72.03 (18,-10) Not very good
Try to combine two features New accuracy 83.70 (9,-6) This case:
Data Scaling
Earlier we mention the importance of data scaling How about this data set ?
Each attribute in a suitable range ? First 4 attributes of training/validation:
X1 X2 X3 X4
Min.:-0.5988 Min.:-0.5194 Min.:-0.4806 Min.:-0.5111 Mean: 0.1319 Mean: 0.2481 Mean: 0.1512 Mean: 0.1844 Max.: 1.0000 Max.: 1.0000 Max.: 1.0000 Max.: 1.0000
Data Scaling (Cont.)
From the authors’ original matlab code: x ∈ Rn:
xi ← xi
maxj(|xj|)
Instance-wise scaling
Earlier: feature-wise scaling
First 4 features scaled to [−1, 1]
X1 X2 X3 X4
Other features similar
max of X1 < 1 as
we scale all and the above: only 4/5
Very different distributions
How attributes scaled to [−1, 1]:
xi − min
In original data, most xi close to min
After instance-wise scaling, may not be that close to the new min
After Scaling
New results
Acoustic Seismic Combined 79.71 (6,-2) 76.68 (6,-2) 87.18 (5,-3) Compare to earlier results
Acoustic Seismic Combined
75.01 (7,-2) 72.03 (18,-10) 83.70 (9,-6) New results consistently better
Feature-wise scaling seems more appropriate Six data sets available at
Issues not Investigated Yet
If most values close to min of the features
are these values outliers or useful information ? Is 86% enough for practical use ?
Originally
Assault Amphibian Vehicle (AAV) Main Battle Tank (M1)
High Mobility Multipurpose Wheeled Vehicle (HMMWV)
So five-class problem
Now we have only AAV, DW, and noise # of SVs is an issue
Now around 20,000 SVs
Can they be stored in a sensor ? Further improvement
Feature selection
Lesson from This Experiment
No systematic way for a machine learning task However, some simple techniques/analysis help Better understanding on ML methods also helps Of course you need good luck
SVM Primal and Dual
Standard SVM min w,b,ξ 1 2w T w + C l X i=1 ξi subject to yi(wTφ(xi) + b) ≥ 1 − ξi, ξi ≥ 0, i = 1, . . . , l.w: huge vector variable
Possibly infinite variables
Dual problem min α 1 2 l X i=1 l X j=1 αiαjyiyjφ(xi)Tφ(xj) − l X i=1 αi subject to 0 ≤ αi ≤ C, i = 1, . . . , l, l X i=1 yiαi = 0. K(xi, xj) = φ(xi)Tφ(xj) available using special φ
Primal Dual Relationship
At optimum ¯ w = l X i=1 ¯ αiyiφ(xi) (1) 1 2w¯ T ¯ w + C l X i=1 ¯ ξi = eTα¯ − 1 2α¯ TQ ¯α. (2) where e = [1, . . . , 1]T.Derivation of the Dual
We follow the description in [Bazaraa et al., 1993] Consider a simpler problem
min w,b 1 2w T w subject to yi(wTφ(xi) + b) ≥ 1, i = 1, . . . , l. Its dual min α 1 2 l X i=1 l X j=1 αiαjyiyjφ(xi)Tφ(xj) − l X i=1 αi
Lagrangian Dual
Defined as max α≥0(minw,b L(w, b, α)), (3) where L(w, b, α) = 1 2kwk 2 − l X i=1 αi yi(wTφ(xi) + b) − 1 . (4)Minimize with respect to the primal variables w and b
Assume ( ¯w, ¯b) optimal for the primal with optimal objective value γ = 12k ¯wk2. No (w, b) satisfies 1 2kwk 2 < γ and y i(wTφ(xi) + b) ≥ 1, i = 1, . . . , l. (5)
There is α¯ ≥ 0 such that for all w, b 1 2kwk 2 − γ − l X i=1 ¯ αi yi(wTφ(xi) + b) − 1 ≥ 0. (6)
Thus
max
α≥0 minw,b L(w, b, α) ≥ γ.
(7)
i.e., for any α,
min w,b L(w, b, α) ≤ L( ¯ w, ¯b, α), so max α≥0 minw,b L(w, b, α) ≤ maxα≥0 L( ¯w, ¯b, α) = 1 2k ¯wk 2 = γ. (8)
With α¯i ≥ 0 and yi( ¯wTφ(xi) + ¯b) − 1 ≥ 0, ¯
αi[yi( ¯wTφ(xi) + ¯b) − 1] = 0, i = 1, . . . , l,
Complementarity condition.
Simplify the dual, when α is fixed,
min w,b L(w, b, α) = ( −∞ if Pl i=1 αiyi 6= 0, minw 1 2wT w − Pl i=1 αi[yi(wTφ(xi) − 1] if Pl i=1 αiyi = 0.
If Pli=1 αiyi 6= 0, decrease −b Pli=1 αiyi in L(w, b, α) to −∞ If Pli=1 αiyi = 0, Optimum of 12wTw − Pi=1l αi[yi(wTφ(xi) − 1] happens when ∂ ∂wL(w, b, α) = 0. Thus,
More details ∂ ∂wL(w, b, α) = ∂ ∂w1L(w, b, α) .. . ∂ ∂wn L(w, b, α) Assume w ∈ Rn L(w, b, α) rewritten as 1 2 n X wj2 − l X αi[yi( n X wjφ(xi)j − 1]
So ∂ ∂wj L(w, b, α) = wj − l X i=1 αiyiφ(xi)j = 0 Note that wTw = l X i=1 αiyiφ(xi) T l X j=1 αjyjφ(xj) = X i,j αiαjyiyjφ(xi)Tφ(xj)
The dual is max α≥0 (Pl i=1 αi − 12 P i,j αiαjyiyjφ(xi)T φ(xj) if Pl i=1 αiyi = 0, −∞ if Pli=1 αiyi 6= 0.
−∞ definitely not maximum of the dual
Dual optimal solution not happen when Pl
i=1 αiyi 6= 0. Dual simplified to max l X αi − 1 2 l X l X αiαjyiyjφ(xi)Tφ(xj)
Karush-Kuhn-Tucker (KKT) optimality conditions of the primal: ¯ αi[yi( ¯wTφ(xi) + ¯b) − 1] = 0, i = 1, . . . , l, l X i=1 αiyi = 0, αi ≥ 0, ∀i, w = l X i=1 αiyiφ(xi).
An Example
Two training data in R1:
4
0 1
Primal Problem
x1 = 0, x2 = 1 with y = [−1, 1]T. Primal problem min w,b 1 2w 2 subject to w · 1 + b ≥ 1, (11) −1(w · 0 + b) ≥ 1. (12)−b ≥ 1 and w ≥ 1 − b ≥ 2.
For any (w, b) satisfying two inequality constraints
w ≥ 2
We are minimizing 12w2
The smallest possibility is w = 2.
(w, b) = (2, −1) is the optimal solution. The separating hyperplane 2x − 1 = 0 In the middle of the two training data:
Dual Problem
Formula derived before
min α∈Rl 1 2 l X i=1 l X j=1 αiαjyiyjφ(xi)Tφ(xj) − l X i=1 αi subject to αi ≥ 0, i = 1, . . . , l, and l X i=1 αiyi = 0.
Get the objective function
Objective function 1 2α 2 1 − (α1 + α2) = 1 2 h α1 α2i " 0 0 0 1 # " α1 α2 # − h1 1i " α1 α2 # . Constraints α1 − α2 = 0, 0 ≤ α1, 0 ≤ α2.
α2 = α1 to the objective function, 1 2α 2 1 − 2α2 Smallest value at α1 = 2. α2 as well If smallest value < 0 clipped to 0
Dual Problems for Other Formulas
So we think that for any optimization problem Lagrangian dual exists
This is wrong Remember we calculate min 1 2w T w − l X i=1 αi[yi(wTφ(xi) − 1] by ∂ L(w, b, α) = 0.
Note that
f0(x) = 0 ⇔ x minimum is wrong
Example
f (x) = x3, x = 0 not minimum This function must satisfy certain conditions Some papers wrongly derived the dual of their new formulations without checking conditions
[2, 2]T satisfies constraints 0 ≤ α1 and 0 ≤ α2 It is optimal Primal-dual relation w = y1α1x1 + y2α2x2 = 1 · 2 · 1 + (−1) · 2 · 0 = 2
Multi-class Classification
k classes
One-against-all: Train k binary SVMs:
1st class vs. (2 − k)th class 2nd class vs. (1, 3 − k)th class .. . k decision functions (w1)Tφ(x) + b1
Multi-class Classification (Cont.)
One-against-one: train k(k − 1)/2 binary SVMs
(1, 2), (1, 3), . . . , (1, k), (2, 3), (2, 4), . . . , (k − 1, k)
Select the one with the largest vote
This is the method used by LIBSVM
Try a 4-class problem 6 binary SVMs
$libsvm-2.5/svm-train bsvm-2.05/vehicle.scale optimization finished, #iter = 173
obj = -142.552559, rho = 0.748453 nSV = 194, nBSV = 183
optimization finished, #iter = 330 obj = -149.912202, rho = -0.786410 nSV = 227, nBSV = 217
optimization finished, #iter = 169 obj = -139.655613, rho = 0.998277 nSV = 186, nBSV = 177
optimization finished, #iter = 268 obj = -185.161735, rho = -0.674739 nSV = 253, nBSV = 244
optimization finished, #iter = 477 obj = -378.264371, rho = 0.177314 nSV = 405, nBSV = 394
There are many other methods
A comparison in [Hsu and Lin, 2002] For a software
We select one which is generally good but not always the best
Finally I chose 1 vs. 1
Similar accuracy to others Shortest training
Why Shorter Training Time
1 vs. 1
k(k − 1)/2 problems, each 2l/k data on average 1 vs. all
k problems, each l data
If solving the optimization problem: polynomial of the size with degree d
Their complexities k(k − 1) 2 O 2l k d vs. kO(ld)
Outline
Support vector regression (SVR) Practical examples
Support Vector Regression (SVR)
Support vector machines: a new method for data classification and prediction
Given training data (x1, y1), . . . , (xl, yl)
Regression: find a function so that
f (xi) ≈ yi
Least square regression:
min l X
x y wT x + b This is equivalent to min l X ξi2 + (ξi∗)2
A quadratic programming problem L1-norm regression min w,b l X i=1 |yi − (wTxi + b)| or min w,b,ξ,ξ∗ l X i=1 (ξi + ξi∗)
A linear programming problem This is equivalent to min w,b,ξ,ξ∗ C l X i=1 (ξi + ξi∗) subject to −ξi∗ ≤ yi − (wTxi + b) ≤ ξi, ξi ≥ 0, ξi∗ ≥ 0, i = 1, . . . , l. C: a constant
Linear support vector regression min w,b,ξ,ξ∗ 1 2w T w + C l X i=1 (ξi + ξi∗) subject to −ξi∗ − ≤ yi − (wT xi + b) ≤ + ξi, ξi ≥ 0, ξi∗ ≥ 0, i = 1, . . . , l.
A tube
− ≤ y − (wT x + b) ≤
Data in the tube considered no error Most training data in the tube
1
2wTw: regularization, wTx + b more smooth
Similar to the classification case General support vector regression:
Data mapped to a higher space by φ(x) The new approximation function
ξi ξi∗ wT φ(x) + b = h 0 − i
Standard SVR min w,b,ξ,ξ∗ 1 2w T w + C l X i=1 ξi + C l X i=1 ξi∗ − − ξi∗ ≤ yi − (wTφ(xi) + b) ≤ + ξi, ξi, ξi∗ ≥ 0, i = 1, . . . , l.
Data in high dimensional spaces
Possible wT φ(xi) + b = yi, i = 1, . . . , l ⇒ overfitting
Good regression methods: balance between overfitting and underfitting
min 12wTw: avoid overfitting
Support vector regression using
LIBSVM
Using the option -s 3Usage: svm-train [options] training_set_file [model_file] options:
-s svm_type : set type of SVM (default 0) 0 -- C-SVC
1 -- nu-SVC
2 -- one-class SVM 3 -- epsilon-SVR 4 -- nu-SVR
Check a regression data $head -n 5 svrprob/trans/abalone.scale.shuffle.train.1 13 1:-1 2:0.310811 3:0.193277 4:-0.707965 5:-0.273012 6:-0.441693 7:-0.175958 8:-0.324278 12 1:-1 2:0.797297 3:0.764706 4:-0.637168 5:0.86369 6:0.301118 7:0.634146 8:0.101302 9 1:-1 2:-0.121622 3:-0.159664 4:-0.769912 5:-0.771641 6:-0.848243 7:-0.766551 8:-0.765705 8 1:-1 2:0.189189 3:0.142857 4:-0.761062 5:-0.212691 6:-0.247604 7:-0.132404 8:-0.432937 19 1:1 2:0.202703 3:0.12605 4:-0.752212 5:-0.427732 6:-0.615815 7:-0.5 8:-0.414827 Additional parameter $svm-train -s 3 -c 64 -g 0.25 -p 0.5 svrprob/trans/abalone.scale.shuffle.train.1
Test
$svm-predict svrprob/trans/abalone.scale.shuffle.test.1 abalone.scale.shuffle.train.1.model o Accuracy = 0% (0/200) (classification)
Mean squared error = 5.00931 (regression)
Squared correlation coefficient = 0.58387 (regression)
Test data $head -n 5 svrprob/trans/abalone.scale.shuffle.test.1 8 1:1 2:0.0945946 3:0.0420168 4:-0.823009 5:-0.640423 6:-0.649361 7:-0.710801 8:-0.697793 20 2:0.756757 3:0.747899 4:-0.690265 5:0.662358 6:0.220447 7:0.571429 8:0.92077 7 1:1 2:0.175676 3:0.142857 4:-0.769912 5:-0.529573 6:-0.552716 7:-0.503484 8:-0.636672 8 1:-1 2:0.189189 3:0.12605 4:-0.752212 5:-0.470427 6:-0.456869 7:-0.54007 8:-0.734012
Check $head -n 5 o 7.62474 14.7385 8.40741 7.13157 9.74578
SVR Performance Evaluation
Not accuracy any more
MSE (Mean Square Error):
l X
i=1
(yi − ˆyi)2
Squared correlation coefficient (also called r2)
Pl
i=1(yi − ¯y)2(ˆyi − ¯ˆy)2 Pl
i=1(yi − ¯y)2
Pl
Conclusions
Dealing with data is interesting
especially if you get good accuracy
Some basic understandings are essential when applying methods
e.g. the importance of validation No method is the best for all data