Support Vector Machines and Kernel Methods: Status and Challenges
Chih-Jen Lin
Department of Computer Science National Taiwan University
Talk at K. U. Leuven Optimization in Engineering Center, January 15, 2013
Outline
Basic concepts: SVM and kernels Dual problem and SVM variants Practical use of SVM
Multi-class classification Large-scale training Linear SVM
Discussion and conclusions
Outline
Basic concepts: SVM and kernels Dual problem and SVM variants Practical use of SVM
Multi-class classification Large-scale training Linear SVM
Discussion and conclusions
Support Vector Classification
Training vectors : xi, i = 1, . . . , l Feature vectors. For example, A patient = [height, weight, . . .]T
Consider a simple case with two classes:
Define an indicator vector y yi =
1 if xi in class 1
−1 if xi in class 2 A hyperplane which separates all data
wTx + b = h+1
−10
i
A separating hyperplane: wTx + b = 0 (wTxi) + b ≥ 1 if yi = 1 (wTxi) + b ≤ −1 if yi = −1
Decision function f (x) = sgn(wTx + b), x: test data Many possible choices of w and b
Maximal Margin
Distance between wTx + b = 1 and −1:
2/kwk = 2/
√ wTw
A quadratic programming problem (Boser et al., 1992)
minw,b
1 2wTw
subject to yi(wTxi + b) ≥ 1, i = 1, . . . , l .
Data May Not Be Linearly Separable
An example:
Allow training errors
Higher dimensional ( maybe infinite ) feature space φ(x) = [φ1(x), φ2(x), . . .]T.
Standard SVM (Boser et al., 1992; Cortes and Vapnik, 1995)
min
w,b,ξ
1
2wTw +C
l
X
i =1
ξi
subject to yi(wTφ(xi)+ b) ≥ 1 −ξi, ξi ≥ 0, i = 1, . . . , l .
Example: x ∈ R3, φ(x) ∈ R10 φ(x) = [1,√
2x1,√
2x2,√
2x3, x12, x22, x32,√
2x1x2,√
2x1x3,√
2x2x3]T
Finding the Decision Function
w: maybe infinite variables
The dual problem: finite number of variables minα
1
2αTQα − eTα
subject to 0 ≤ αi ≤ C , i = 1, . . . , l yTα = 0,
where Qij = yiyjφ(xi)Tφ(xj) and e = [1, . . . , 1]T At optimum
w =Pl
i =1αiyiφ(xi)
A finite problem: #variables = #training data
Kernel Tricks
Qij = yiyjφ(xi)Tφ(xj) needs a closed form Example: xi ∈ R3, φ(xi) ∈ R10
φ(xi) = [1,√
2(xi)1,√
2(xi)2,√
2(xi)3, (xi)21, (xi)22, (xi)23,√
2(xi)1(xi)2,√
2(xi)1(xi)3,√
2(xi)2(xi)3]T Then φ(xi)Tφ(xj) = (1 + xTi xj)2.
Kernel: K (x, y) = φ(x)Tφ(y); common kernels:
e−γkxi−xjk2, (Radial Basis Function) (xTi xj/a + b)d (Polynomial kernel)
Can be inner product in infinite dimensional space Assume x ∈ R1 and γ > 0.
e−γkxi−xjk2 = e−γ(xi−xj)2 = e−γxi2+2γxixj−γxj2
=e−γxi2−γxj2 1 + 2γxixj
1! + (2γxixj)2
2! + (2γxixj)3
3! + · · ·
=e−γxi2−γxj2 1 · 1+
r2γ 1!xi ·
r2γ 1!xj+
r(2γ)2 2! xi2 ·
r(2γ)2 2! xj2 +
r(2γ)3 3! xi3 ·
r(2γ)3
3! xj3 + · · · = φ(xi)Tφ(xj), where
φ(x ) = e−γx2
1,
r2γ 1!x ,
r(2γ)2 2! x2,
r(2γ)3
3! x3, · · ·
T
.
Issues
So what kind of kernel should I use?
What kind of functions are valid kernels?
How to decide kernel parameters?
Some of these issues will be discussed later
Decision function
At optimum
w =Pl
i =1αiyiφ(xi) Decision function
wTφ(x) + b
=
l
X
i =1
αiyiφ(xi)Tφ(x) + b
=
l
X
i =1
αiyiK (xi, x) + b
Only φ(xi) of αi > 0 used ⇒ support vectors
Support Vectors: More Important Data
Only φ(xi) of αi > 0 used ⇒ support vectors
-0.2 0 0.2 0.4 0.6 0.8 1 1.2
-1.5 -1 -0.5 0 0.5 1
Outline
Basic concepts: SVM and kernels Dual problem and SVM variants Practical use of SVM
Multi-class classification Large-scale training Linear SVM
Discussion and conclusions
Deriving the Dual
For simplification, consider the problem without ξi
minw,b
1 2wTw
subject to yi(wTφ(xi) + b) ≥ 1, i = 1, . . . , l . Its dual is
minα
1
2αTQα − eTα
subject to 0 ≤ αi, i = 1, . . . , l , yTα = 0.
Lagrangian Dual
maxα≥0 min
w,b L(w, b, α), where
L(w, b, α) = 1
2kwk2 −
l
X
i =1
αi yi(wTφ(xi) + b) − 1 Strong duality (be careful about this)
min Primal = max
α≥0 min
w,b L(w, b, α)
Simplify the dual. When α is fixed, minw,b L(w, b, α) =
(−∞ if Pl
i =1αiyi 6= 0, minw
1
2wTw −Pl
i =1αi[yi(wTφ(xi) − 1] if Pl
i =1αiyi = 0.
If Pl
i =1αiyi 6= 0, we can decrease
−b
l
X
i =1
αiyi in L(w, b, α) to −∞
If Pl
i =1αiyi = 0, optimum of the strictly convex function
1
2wTw −
l
X
i =1
αi[yi(wTφ(xi) − 1]
happens when
∇wL(w, b, α) = 0.
Thus,
w =
l
X
i =1
αiyiφ(xi).
Note that wTw =
l X
i =1
αiyiφ(xi)
T l X
j =1
αjyjφ(xj)
= X
i ,j
αiαjyiyjφ(xi)Tφ(xj)
The dual is maxα≥0
l
P
i =1
αi − 12 P
i ,j
αiαjyiyjφ(xi)Tφ(xj) if Pl
i =1αiyi = 0,
−∞ if Pl
i =1αiyi 6= 0.
Lagrangian dual: maxα≥0 minw,bL(w, b, α)
−∞ definitely not maximum of the dual Dual optimal solution not happen when
l
X
i =1
αiyi 6= 0 .
Dual simplified to max
α∈Rl l
X
i =1
αi − 1 2
l
X
i =1 l
X
j =1
αiαjyiyjφ(xi)Tφ(xj) subject to yTα = 0,
αi ≥ 0, i = 1, . . . , l .
More about Dual Problems
After SVM is popular
Quite a few people think that for any optimization problem
⇒ Lagrangian dual exists and strong duality holds Wrong! We usually need
Convex programming; Constraint qualification We have them
SVM primal is convex; Linear constraints
Our problems may be infinite dimensional Can still use Lagrangian duality
See a rigorous discussion in Lin (2001)
Primal versus Dual
Recall the dual problem is minα
1
2αTQα − eTα
subject to 0 ≤ αi ≤ C , i = 1, . . . , l yTα = 0
and at optimum w =
l
X
i =1
αiyiφ(xi) (1)
Primal versus Dual (Cont’d)
What if we put (1) into primal min
α,ξ
1
2αTQα + C
l
X
i =1
ξi
subject to (Qα + by)i ≥ 1 − ξi (2) ξi ≥ 0
If Q is positive definite, we can prove that the optimal α of (2) is the same as that of the dual So dual is not the only choice to solve when we use kernels
Other Variants
A general form for binary classification minw r (w) + C
l
X
i =1
ξ(w; xi, yi) r (w): regularization term
ξ(w; x, y ): loss function: we hope y wTx > 0 C : regularization parameter
Loss Functions
Some commonly used loss functions:
ξL1(w; x, y ) ≡ max(0, 1 − y wTx), (3) ξL2(w; x, y ) ≡ max(0, 1 − y wTx)2, and (4) ξLR(w; x, y ) ≡ log(1 + e−y wTx). (5) We omit the bias term b here
SVM (Boser et al., 1992; Cortes and Vapnik, 1995):
(3)-(4)
Logistic regression (LR): (5)
Loss Functions (Cont’d)
−y wTx ξ(w; x, y )
ξL1 ξL2
ξLR
Indeed SVM and logistic regression are very similar
Loss Functions (Cont’d)
If we use square loss function
ξ(w; x, y ) ≡ (1 − y wTx)2 it becomes least-square SVM (Suykens and Vandewalle, 1999) or Gaussian process
Regularization
L1 versus L2
kwk1 and wTw/2 wTw/2: smooth, easier to optimize kwk1: non-differentiable
sparse solution; possibly many zero elements Possible advantages of L1 regularization:
Feature selection Less storage for w
Training SVM
The main issue is to solve the dual problem minα
1
2αTQα − eTα
subject to 0 ≤ αi ≤ C , i = 1, . . . , l yTα = 0
This will be discuss in Thursday’s lecture, which talks about the connection between optimization and machine learning
Outline
Basic concepts: SVM and kernels Dual problem and SVM variants Practical use of SVM
Multi-class classification Large-scale training Linear SVM
Discussion and conclusions
Let’s Try a Practical Example
A problem from astroparticle physics
1 2.61e+01 5.88e+01 -1.89e-01 1.25e+02 1 5.70e+01 2.21e+02 8.60e-02 1.22e+02 1 1.72e+01 1.73e+02 -1.29e-01 1.25e+02 0 2.39e+01 3.89e+01 4.70e-01 1.25e+02 0 2.23e+01 2.26e+01 2.11e-01 1.01e+02 0 1.64e+01 3.92e+01 -9.91e-02 3.24e+01 Training and testing sets available: 3,089 and 4,000 Data available at LIBSVM Data Sets
Training and Testing
Training the set svmguide1 to obtain svmguide1.model
$./svm-train svmguide1 Testing the set svmguide1.t
$./svm-predict svmguide1.t svmguide1.model out Accuracy = 66.925% (2677/4000)
We see that training and testing accuracy are very different. Training accuracy is almost 100%
$./svm-predict svmguide1 svmguide1.model out Accuracy = 99.7734% (3082/3089)
Why this Fails
Gaussian kernel is used here
We see that most kernel elements have Kij = e−kxi−xjk2/4
(
= 1 if i = j ,
→ 0 if i 6= j . because some features in large numeric ranges For what kind of data,
K ≈ I ?
Why this Fails (Cont’d)
If we have training data
φ(x1) = [1, 0, . . . , 0]T ...
φ(xl) = [0, . . . , 0, 1]T then
K = I
Clearly such training data can be correctly separated, but how about testing data?
So overfitting occurs
Overfitting
See the illustration in the next slide In theory
You can easily achieve 100% training accuracy This is useless
When training and predicting a data, we should Avoid underfitting: small training error
Avoid overfitting: small testing error
l and s: training; and 4: testing
Data Scaling
Without scaling, the above overfitting situation may occur
Also, features in greater numeric ranges may dominate
A simple solution is to linearly scale each feature to [0, 1] by:
feature value − min max − min , There are many other scaling methods
Scaling generally helps, but not always
Data Scaling: Same Factors
A common mistake
$./svm-scale -l -1 -u 1 svmguide1 > svmguide1.scale
$./svm-scale -l -1 -u 1 svmguide1.t > svmguide1.t.scale -l -1 -u 1: scaling to [−1, 1]
We need to use same factors on training and testing
$./svm-scale -s range1 svmguide1 > svmguide1.scale
$./svm-scale -r range1 svmguide1.t > svmguide1.t.scale Later we will give a real example
After Data Scaling
Train scaled data and then predict
$./svm-train svmguide1.scale
$./svm-predict svmguide1.t.scale svmguide1.scale.model svmguide1.t.predict
Accuracy = 96.15%
Training accuracy is now similar
$./svm-predict svmguide1.scale svmguide1.scale.model o Accuracy = 96.439%
For this experiment, we use parameters C = 1, γ = 0.25, but sometimes performances are sensitive to parameters
Parameters versus Performances
If we use C = 20, γ = 400
$./svm-train -c 20 -g 400 svmguide1.scale
$./svm-predict svmguide1.scale svmguide1.scale.model o Accuracy = 100% (3089/3089)
100% training accuracy but
$./svm-predict svmguide1.t.scale svmguide1.scale.model o Accuracy = 82.7% (3308/4000)
Very bad test accuracy Overfitting happens
Parameter Selection
For SVM, we may need to select suitable parameters They are C and kernel parameters
Example:
γ of e−γkxi−xjk2 a, b, d of (xTi xj/a + b)d
How to select them so performance is better?
Performance Evaluation
Available data ⇒ training and validation
Train the training; test the validation to estimate the performance
A common way is k-fold cross validation (CV):
Data randomly separated to k groups
Each time k − 1 as training and one as testing Select parameters/kernels with best CV result There are many other methods to evaluate the performance
Contour of CV Accuracy
The good region of parameters is quite large SVM is sensitive to parameters, but not that sensitive
Sometimes default parameters work
but it’s good to select them if time is allowed
Example of Parameter Selection
Direct training and test
$./svm-train svmguide3
$./svm-predict svmguide3.t svmguide3.model o
→ Accuracy = 2.43902%
After data scaling, accuracy is still low
$./svm-scale -s range3 svmguide3 > svmguide3.scale
$./svm-scale -r range3 svmguide3.t > svmguide3.t.scale
$./svm-train svmguide3.scale
$./svm-predict svmguide3.t.scale svmguide3.scale.model o
→ Accuracy = 12.1951%
Example of Parameter Selection (Cont’d)
Select parameters by trying a grid of (C , γ) values
$ python grid.py svmguide3.scale
· · ·
128.0 0.125 84.8753
(Best C =128.0, γ=0.125 with five-fold cross-validation rate=84.8753%)
Train and predict using the obtained parameters
$ ./svm-train -c 128 -g 0.125 svmguide3.scale
$ ./svm-predict svmguide3.t.scale svmguide3.scale.model svmguide3.t.predict
→ Accuracy = 87.8049%
Selecting Kernels
RBF, polynomial, or others?
For beginners, use RBF first
Linear kernel: special case of RBF
Accuracy of linear the same as RBF under certain parameters (Keerthi and Lin, 2003)
Polynomial kernel:
(xTi xj/a + b)d
Numerical difficulties: (< 1)d → 0, (> 1)d → ∞ More parameters than RBF
Selecting Kernels (Cont’d)
Commonly used kernels are Gaussian (RBF), polynomial, and linear
But in different areas, special kernels have been developed. Examples
1. χ2 kernel is popular in computer vision 2. String kernel is useful in some domains
A Simple Procedure for Beginners
After helping many users, we came up with the following procedure
1. Conduct simple scaling on the data 2. Consider RBF kernel K (x, y) = e−γkx−yk2
3. Use cross-validation to find the best parameter C and γ
4. Use the best C and γ to train the whole training set 5. Test
In LIBSVM, we have a python script easy.py implementing this procedure.
A Simple Procedure for Beginners (Cont’d)
We proposed this procedure in an “SVM guide”
(Hsu et al., 2003) and implemented it in LIBSVM From research viewpoints, this procedure is not novel. We never thought about submiting our guide somewhere
But this procedure has been tremendously useful.
Now almost the standard thing to do for SVM beginners
A Real Example of Wrong Scaling
Separately scale each feature of training and testing data to [0, 1]
$ ../svm-scale -l 0 svmguide4 > svmguide4.scale
$ ../svm-scale -l 0 svmguide4.t > svmguide4.t.scale
$ python easy.py svmguide4.scale svmguide4.t.scale Accuracy = 69.2308% (216/312) (classification) The accuracy is low even after parameter selection
$ ../svm-scale -l 0 -s range4 svmguide4 > svmguide4.scale
$ ../svm-scale -r range4 svmguide4.t > svmguide4.t.scale
$ python easy.py svmguide4.scale svmguide4.t.scale Accuracy = 89.4231% (279/312) (classification)
A Real Example of Wrong Scaling (Cont’d)
With the correct setting, the 10 features in the test data svmguide4.t.scale have the following maximal values:
0.7402, 0.4421, 0.6291, 0.8583, 0.5385, 0.7407, 0.3982, 1.0000, 0.8218, 0.9874
Scaling the test set to [0, 1] generated an erroneous set.
Outline
Basic concepts: SVM and kernels Dual problem and SVM variants Practical use of SVM
Multi-class classification Large-scale training Linear SVM
Discussion and conclusions
Multi-class Classification
k classes
One-against-the rest: Train k binary SVMs:
1st class vs. (2, · · · , k)th class 2nd class vs. (1, 3, . . . , k)th class
...
k decision functions
(w1)Tφ(x) + b1 ...
(wk)Tφ(x) + bk
Prediction:
arg max
j (wj)Tφ(x) + bj
Reason: If x ∈ 1st class, then we should have (w1)Tφ(x) + b1 ≥ +1
(w2)Tφ(x) + b2 ≤ −1 ...
(wk)Tφ(x) + bk ≤ −1
Multi-class Classification (Cont’d)
One-against-one: train k(k − 1)/2 binary SVMs (1, 2), (1, 3), . . . , (1, k), (2, 3), (2, 4), . . . , (k − 1, k) If 4 classes ⇒ 6 binary SVMs
yi = 1 yi = −1 Decision functions class 1 class 2 f12(x) = (w12)Tx + b12 class 1 class 3 f13(x) = (w13)Tx + b13 class 1 class 4 f14(x) = (w14)Tx + b14 class 2 class 3 f23(x) = (w23)Tx + b23 class 2 class 4 f24(x) = (w24)Tx + b24 class 3 class 4 f34(x) = (w34)Tx + b34
For a testing data, predicting all binary SVMs Classes winner
1 2 1
1 3 1
1 4 1
2 3 2
2 4 4
3 4 3
Select the one with the largest vote class 1 2 3 4
# votes 3 1 1 1 May use decision values as well
More Complicated Forms
Solving a single optimization problem (Weston and Watkins, 1999; Crammer and Singer, 2002; Lee et al., 2004)
There are many other methods A comparison in Hsu and Lin (2002)
RBF kernel: accuracy similar for different methods But 1-against-1 is the fastest for training
Outline
Basic concepts: SVM and kernels Dual problem and SVM variants Practical use of SVM
Multi-class classification Large-scale training Linear SVM
Discussion and conclusions
SVM doesn’t Scale Up
Yes, if using kernels
Training millions of data is time consuming
Cases with many support vectors: quadratic time bottleneck on
QSV, SV
For noisy data: # SVs increases linearly in data size (Steinwart, 2003)
Some solutions Parallelization Approximation
Parallelization
Multi-core/Shared Memory/GPU
• One line change of LIBSVM
Multicore Shared-memory
1 80 1 100
2 48 2 57
4 32 4 36
8 27 8 28
50,000 data (kernel evaluations: 80% time)
• GPU (Catanzaro et al., 2008); Cell (Marzolla, 2010) Distributed Environments
• Chang et al. (2007); Zanni et al. (2006); Zhu et al.
(2009).
Approximately Training SVM
Can be done in many aspects Data level: sub-sampling Optimization level:
Approximately solve the quadratic program Other non-intuitive but effective ways I will show one today
Many papers have addressed this issue
Approximately Training SVM (Cont’d)
Subsampling
Simple and often effective More advanced techniques
Incremental training: (e.g., Syed et al., 1999) Data ⇒ 10 parts
train 1st part ⇒ SVs, train SVs + 2nd part, . . . Select and train good points: KNN or heuristics For example, Bakır et al. (2005)
Approximately Training SVM (Cont’d)
Approximate the kernel; e.g., Fine and Scheinberg (2001); Williams and Seeger (2001)
Use part of the kernel; e.g., Lee and Mangasarian (2001); Keerthi et al. (2006)
Early stopping of optimization algorithms Tsang et al. (2005) and others
And many more
Some simple but some sophisticated
Approximately Training SVM (Cont’d)
Sophisticated techniques may not be always useful Sometimes slower than sub-sampling
covtype: 500k training and 80k testing rcv1: 550k training and 14k testing
covtype rcv1
Training size Accuracy Training size Accuracy
50k 92.5% 50k 97.2%
100k 95.3% 100k 97.4%
500k 98.2% 550k 97.8%
Approximately Training SVM (Cont’d)
Sophisticated techniques may not be always useful Sometimes slower than sub-sampling
covtype: 500k training and 80k testing rcv1: 550k training and 14k testing
covtype rcv1
Training size Accuracy Training size Accuracy
50k 92.5% 50k 97.2%
100k 95.3% 100k 97.4%
500k 98.2% 550k 97.8%
Discussion: Large-scale Training
We don’t have many large and well labeled sets Expensive to obtain true labels
Specific properties of data should be considered We will illustrate this point using linear SVM The design of software for very large data sets should be application different
Outline
Basic concepts: SVM and kernels Dual problem and SVM variants Practical use of SVM
Multi-class classification Large-scale training Linear SVM
Discussion and conclusions
Linear and Kernel Classification
Methods such as SVM and logistic regression can used in two ways
Kernel methods: data mapped to a higher dimensional space
x ⇒ φ(x)
φ(xi)Tφ(xj) easily calculated; little control on φ(·) Linear classification + feature engineering:
We have x without mapping. Alternatively, we can say that φ(x) is our x; full control on x or φ(x) We refer to them as kernel and linear classifiers
Linear and Kernel Classification
Let’s check the prediction cost wTx + b versus Xl
i =1αiK (xi, x) + b If K (xi, xj) takes O(n), then
O(n) versus O(nl ) Linear is much cheaper
Linear and Kernel Classification (Cont’d)
Also, linear is a special case of kernel
Indeed, we can prove that accuracy of linear is the same as Gaussian (RBF) kernel under certain parameters (Keerthi and Lin, 2003)
Therefore, roughly we have
accuracy: kernel ≥ linear cost: kernel linear Speed is the reason to use linear
Linear and Kernel Classification (Cont’d)
For some problems, accuracy by linear is as good as nonlinear
But training and testing are much faster
This particularly happens for document classification Number of features (bag-of-words model) very large Data very sparse (i.e., few non-zeros)
Recently linear classification is a popular research topic. Sample works in 2005-2008: Joachims (2006); Shalev-Shwartz et al. (2007); Hsieh et al.
(2008)
Comparison Between Linear and Kernel (Training Time & Testing Accuracy)
Linear RBF Kernel Data set Time Accuracy Time Accuracy
MNIST38 0.1 96.82 38.1 99.70
ijcnn1 1.6 91.81 26.8 98.69
covtype 1.4 76.37 46,695.8 96.11
news20 1.1 96.95 383.2 96.90
real-sim 0.3 97.44 938.3 97.82
yahoo-japan 3.1 92.63 20,955.2 93.31 webspam 25.7 93.35 15,681.8 99.26 Size reasonably large: e.g., yahoo-japan: 140k instances and 830k features
Comparison Between Linear and Kernel (Training Time & Testing Accuracy)
Linear RBF Kernel Data set Time Accuracy Time Accuracy
MNIST38 0.1 96.82 38.1 99.70
ijcnn1 1.6 91.81 26.8 98.69
covtype 1.4 76.37 46,695.8 96.11
news20 1.1 96.95 383.2 96.90
real-sim 0.3 97.44 938.3 97.82
yahoo-japan 3.1 92.63 20,955.2 93.31 webspam 25.7 93.35 15,681.8 99.26 Size reasonably large: e.g., yahoo-japan: 140k instances and 830k features
Comparison Between Linear and Kernel (Training Time & Testing Accuracy)
Linear RBF Kernel Data set Time Accuracy Time Accuracy
MNIST38 0.1 96.82 38.1 99.70
ijcnn1 1.6 91.81 26.8 98.69
covtype 1.4 76.37 46,695.8 96.11
news20 1.1 96.95 383.2 96.90
real-sim 0.3 97.44 938.3 97.82
yahoo-japan 3.1 92.63 20,955.2 93.31 webspam 25.7 93.35 15,681.8 99.26 Size reasonably large: e.g., yahoo-japan: 140k instances and 830k features
Extension: Training Explicit Form of Nonlinear Mappings
Linear-SVM method to train φ(x1), . . . , φ(xl) Kernel not used
Applicable only if dimension of φ(x) not too large Low-degree Polynomial Mappings
K (xi, xj) = (xTi xj + 1)2 = φ(xi)Tφ(xj) φ(x) = [1,√
2x1, . . . ,√
2xn, x12, . . . , xn2,
√
2x1x2, . . . ,
√
2xn−1xn]T
When degree is small, train the explicit form of φ(x)
Testing Accuracy and Training Time
Data set
Degree-2 Polynomial Accuracy diff.
Training time (s)
Accuracy Linear RBF LIBLINEAR LIBSVM
a9a 1.6 89.8 85.06 0.07 0.02
real-sim 59.8 1,220.5 98.00 0.49 0.10
ijcnn1 10.7 64.2 97.84 5.63 −0.85
MNIST38 8.6 18.4 99.29 2.47 −0.40
covtype 5,211.9 NA 80.09 3.74 −15.98
webspam 3,228.1 NA 98.44 5.29 −0.76
Training φ(xi) by linear: faster than kernel, but sometimes competitive accuracy
Discussion: Directly Train φ(x
i), ∀i
See details in our work (Chang et al., 2010) A related development: Sonnenburg and Franc (2010)
Useful for certain applications
Outline
Basic concepts: SVM and kernels Dual problem and SVM variants Practical use of SVM
Multi-class classification Large-scale training Linear SVM
Discussion and conclusions
Extensions of SVM
Multiple Kernel Learning (MKL) Learning to rank
Semi-supervised learning Active learning
Cost sensitive learning Structured Learning
Conclusions
SVM and kernel methods are rather mature areas But still quite a few interesting research issues Many are extensions of standard classification It is possible to identify more extensions through real applications
References I
G. H. Bakır, L. Bottou, and J. Weston. Breaking svm complexity with cross-training. In L. K.
Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems 17, pages 81–88. MIT Press, Cambridge, MA, 2005.
B. E. Boser, I. Guyon, and V. Vapnik. A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, pages 144–152. ACM Press, 1992.
B. Catanzaro, N. Sundaram, and K. Keutzer. Fast support vector machine training and classification on graphics processors. In Proceedings of the Twenty Fifth International Conference on Machine Learning (ICML), 2008.
E. Chang, K. Zhu, H. Wang, H. Bai, J. Li, Z. Qiu, and H. Cui. Parallelizing support vector machines on distributed computers. In NIPS 21, 2007.
Y.-W. Chang, C.-J. Hsieh, K.-W. Chang, M. Ringgaard, and C.-J. Lin. Training and testing low-degree polynomial data mappings via linear SVM. Journal of Machine Learning Research, 11:1471–1490, 2010. URL
http://www.csie.ntu.edu.tw/~cjlin/papers/lowpoly_journal.pdf.
C. Cortes and V. Vapnik. Support-vector network. Machine Learning, 20:273–297, 1995.
K. Crammer and Y. Singer. On the learnability and design of output codes for multiclass problems. Machine Learning, (2–3):201–233, 2002.
References II
S. Fine and K. Scheinberg. Efficient svm training using low-rank kernel representations.
Journal of Machine Learning Research, 2:243–264, 2001.
C.-J. Hsieh, K.-W. Chang, C.-J. Lin, S. S. Keerthi, and S. Sundararajan. A dual coordinate descent method for large-scale linear SVM. In Proceedings of the Twenty Fifth International Conference on Machine Learning (ICML), 2008. URL
http://www.csie.ntu.edu.tw/~cjlin/papers/cddual.pdf.
C.-W. Hsu and C.-J. Lin. A comparison of methods for multi-class support vector machines.
IEEE Transactions on Neural Networks, 13(2):415–425, 2002.
C.-W. Hsu, C.-C. Chang, and C.-J. Lin. A practical guide to support vector classification.
Technical report, Department of Computer Science, National Taiwan University, 2003.
URL http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf.
T. Joachims. Training linear SVMs in linear time. In Proceedings of the Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2006.
S. S. Keerthi and C.-J. Lin. Asymptotic behaviors of support vector machines with Gaussian kernel. Neural Computation, 15(7):1667–1689, 2003.
S. S. Keerthi, O. Chapelle, and D. DeCoste. Building support vector machines with reduced classifier complexity. Journal of Machine Learning Research, 7:1493–1515, 2006.
Y. Lee, Y. Lin, and G. Wahba. Multicategory support vector machines. Journal of the American Statistical Association, 99(465):67–81, 2004.
References III
Y.-J. Lee and O. L. Mangasarian. RSVM: Reduced support vector machines. In Proceedings of the First SIAM International Conference on Data Mining, 2001.
C.-J. Lin. Formulations of support vector machines: a note from an optimization point of view. Neural Computation, 13(2):307–317, 2001.
M. Marzolla. Optimized training of support vector machines on the cell processor. Technical Report UBLCS-2010-02, Department of Computer Science, University of Bologna, Italy, Feb. 2010. URL http://www.cs.unibo.it/pub/TR/UBLCS/ABSTRACTS/2010.bib?
ncstrl.cabernet//BOLOGNA-UBLCS-2010-02.
S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: primal estimated sub-gradient solver for SVM. In Proceedings of the Twenty Fourth International Conference on Machine Learning (ICML), 2007.
S. Sonnenburg and V. Franc. COFFIN : A computational framework for linear SVMs. In Proceedings of the Twenty Seventh International Conference on Machine Learning (ICML), pages 999–1006, 2010.
I. Steinwart. Sparseness of support vector machines. Journal of Machine Learning Research, 4:
1071–1105, 2003.
J. Suykens and J. Vandewalle. Least square support vector machine classifiers. Neural Processing Letters, 9(3):293–300, 1999.
References IV
N. A. Syed, H. Liu, and K. K. Sung. Incremental learning with support vector machines. In Workshop on Support Vector Machines, IJCAI99, 1999.
I. Tsang, J. Kwok, and P. Cheung. Core vector machines: Fast SVM training on very large data sets. Journal of Machine Learning Research, 6:363–392, 2005.
J. Weston and C. Watkins. Multi-class support vector machines. In M. Verleysen, editor, Proceedings of ESANN99, pages 219–224, Brussels, 1999. D. Facto Press.
C. K. I. Williams and M. Seeger. Using the Nystr¨om method to speed up kernel machines. In T. Leen, T. Dietterich, and V. Tresp, editors, Neural Information Processing Systems 13, pages 682–688. MIT Press, 2001.
L. Zanni, T. Serafini, and G. Zanghirati. Parallel software for training large scale support vector machines on multiprocessor systems. Journal of Machine Learning Research, 7:
1467–1492, 2006.
Z. A. Zhu, W. Chen, G. Wang, C. Zhu, and Z. Chen. P-packSVM: Parallel primal gradient descent kernel SVM. In Proceedings of the IEEE International Conference on Data Mining, 2009.