Training Support Vector Machines:
Status and Challenges
Chih-Jen Lin
Department of Computer Science National Taiwan University
Talk at Microsoft Research Asia
Outline
Training support vector machines Training large-scale SVM
Linear SVM
SVM with Low-Degree Polynomial Mapping Discussion and Conclusions
Support Vector Classification
Training data (xi, yi), i = 1, . . . , l , xi ∈ Rn, yi = ±1 Maximizing the margin
[Boser et al., 1992, Cortes and Vapnik, 1995]
minw,b
1
2wTw + C
l
X
i =1
max(1 − yi(wTφ(xi)+ b), 0) High dimensional ( maybe infinite ) feature space
φ(x) = (φ1(x), φ2(x), . . .).
Support Vector Classification (Cont’d)
The dual problem (finite # variables) minα
1
2αTQα − eTα
subject to 0 ≤ αi ≤ C , i = 1, . . . , l yTα = 0,
where Qij = yiyjφ(xi)Tφ(xj) and e = [1, . . . , 1]T At optimum
w = Pl
i =1αiyiφ(xi)
Kernel: K (xi, xj) ≡ φ(xi)Tφ(xj) ; closed form
Large Dense Quadratic Programming
Qij 6= 0, Q : an l by l fully dense matrix minα
1
2αTQα − eTα
subject to 0 ≤ αi ≤ C , i = 1, . . . , l yTα = 0
50,000 training points: 50,000 variables:
(50, 0002 × 8/2) bytes = 10GB RAM to store Q Traditional optimization methods cannot be directly applied
Decomposition Methods
Working on some variables each time (e.g.,
[Osuna et al., 1997, Joachims, 1998, Platt, 1998]) Working set B , N = {1, . . . , l }\B fixed
Sub-problem at the kth iteration:
minαB
1
2αTB (αkN)TQBB QBN
QNB QNN
αB αkN
−
eTB (ekN)TαB αkN
subject to 0 ≤ α ≤ C , i ∈ B, yTα = −yTαk
Avoid Memory Problems
The new objective function 1
2αTBQBBαB + (−eB + QBNαkN)TαB + constant Only B columns of Q needed (|B| ≥ 2)
Calculated when used Trade time for space
Popular software such as SVMlight and LIBSVM are of this type
Work well if data not too large (e.g., ≤ 100k)
Outline
Training support vector machines Training large-scale SVM
Linear SVM
SVM with Low-Degree Polynomial Mapping Discussion and Conclusions
Is It Possible to Train Large SVM?
Accurately solve quadratic programs with millions of variables or more?
General approach: very unlikely
Cases with many support vectors: quadratic time bottleneck on
QSV, SV
Parallelization: possible but
Difficult in distributed environments due to high communication cost
Is It Possible to Train Large SVM?
(Cont’d)
For large problems, approximation almost unavoidable
That is, don’t accurately solve the quadratic program of the full training set
Approximately Training SVM
Can be done in many aspects Data level: sub-sampling Optimization level:
Approximately solve the quadratic program Other non-intuitive but effective ways I will show one today
Many papers have addressed this issue
Approximately Training SVM (Cont’d)
Subsampling
Simple and often effective Many more advanced techniques
Incremental training: (e.g., [Syed et al., 1999]) Data ⇒ 10 parts
train 1st part ⇒ SVs, train SVs + 2nd part, . . . Select and train good points: KNN or heuristics e.g., [Bakır et al., 2005]
Approximately Training SVM (Cont’d)
Approximate the kernel; e.g., [Fine and Scheinberg, 2001, Williams and Seeger, 2001]
Use part of the kernel; e.g.,
[Lee and Mangasarian, 2001, Keerthi et al., 2006]
Early stopping of optimization algorithms [Tsang et al., 2005] and most parallel works And many others
Some simple but some sophisticated
Approximately Training SVM (Cont’d)
But sophisticated techniques may not be always useful
Sometimes slower than sub-sampling covtype: 500k training and 80k testing rcv1: 550k training and 14k testing
covtype rcv1
Training size Accuracy Training size Accuracy
50k 92.5% 50k 97.2%
100k 95.3% 100k 97.4%
Approximately Training SVM (Cont’d)
But sophisticated techniques may not be always useful
Sometimes slower than sub-sampling covtype: 500k training and 80k testing rcv1: 550k training and 14k testing
covtype rcv1
Training size Accuracy Training size Accuracy
50k 92.5% 50k 97.2%
100k 95.3% 100k 97.4%
500k 98.2% 550k 97.8%
Approximately Training SVM (Cont’d)
Personally I prefer specialized approach for large-scale scenarios
Distribution of training data
??
General Solvers (e.g., LIBSVM, SVMlight) Median and
small Large
Approximately Training SVM (Cont’d)
We don’t have many large and well labeled sets They appear in certain application domains Specific properties of data should be considered May significantly improve the training speed We will illustrate this point using linear SVM The design of software for large and median/small problems should be different
Outline
Training support vector machines Training large-scale SVM
Linear SVM
SVM with Low-Degree Polynomial Mapping Discussion and Conclusions
Linear SVM
Data not mapped to another space Primal without the bias term b
minw
1
2wTw + C
l
X
i =1
max 0, 1 − yiwTxi Dual
minα f (α) ≡ 1
2αTQα − eTα subject to 0 ≤ αi ≤ C , ∀i
T
Linear SVM (Cont’d)
In theory, RBF kernel with certain parameters
⇒ as good as linear [Keerthi and Lin, 2003]
RBF kernel:
K (xi, xj) = e−γkxi−xjk2 That is,
Test accuracy of linear
≤Test accuracy of RBF Linear SVM not better than nonlinear; but
Linear SVM for Large Document Sets
Bag of words model (TF-IDF or others) A large # of features
Accuracy similar with/without mapping vectors What if training is much faster?
A very effective approximation to nonlinear SVM
A Comparison: LIBSVM and LIBLINEAR
rcv1: # data: > 600k, # features: > 40k TF-IDF
Using LIBSVM (linear kernel)
> 10 hours
Using LIBLINEAR
Computation: < 5 seconds; I/O: 60 seconds Same stopping condition
Accuracy similar to nonlinear; more than 100x speedup
Why Training Linear SVM Is Faster?
In optimization, each iteration we often need
∇if (α) = (Qα)i − 1 Nonlinear SVM
∇if (α) = Xl
j =1yiyjK (xi, xj)αj − 1 cost: O(nl ); n: # features, l : # data
Linear: use w ≡ Xl
j =1yjαjxj and ∇if (α) = yiwTxi − 1
Faster if # iterations not l times more For details, see
C.-J. Hsieh K.-W. Chang, C.-J. Lin, S. S. Keerthi, and S.
Sundararajan. A dual coordinate descent method for large-scale linear SVM. ICML 2008.
R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A library for large linear
classification. Journal of Machine Learning Research 9(2008), 1871-1874.
Experiments
Problem l : # data n: # features
news20 19,996 1,355,191
yahoo-japan 176,203 832,026
rcv1 677,399 47,236
Testing Accuracy versus Training Time
news20 yahoo-japan
Training Linear SVM Always Much Faster?
No
If #data #features, the algorithm used above may not be very good
Need some other ways
But document data are not of this type Large-scale SVM training is domain specific
Outline
Training support vector machines Training large-scale SVM
Linear SVM
SVM with Low-Degree Polynomial Mapping Discussion and Conclusions
Training Nonlinear SVM via Linear SVM
Revisit nonlinear SVM minw
1
2wTw + C
l
X
i =1
max(1 − yiwTφ(xi), 0) Dimension of φ(x): large
If not very large, directly train SVM without kernel Calculate ∇if (α) at each step
Kernel: O(nl )
Degree-2 Polynomial Mapping
Degree-2 polynomial kernel
K (xi, xj) = (1 + xTi xj)2 Instead we do
φ(x) = [1,√
2x1, . . . ,√
2xn, x12, . . . , xn2,√
2x1x2, . . . ,√
2xn−1xn]T. Now we can just consider
φ(x) = [1, x1, . . . , xn, x12, . . . , xn2, x1x2, . . . , xn−1xn]T. O(n2) dimensions can cause troubles; some
Accuracy Difference with linear and RBF
Data set Degree-2 Polynomial Time Accuracy diff.
LIBLINEAR LIBSVM linear RBF
a9a 1.6 89.8 0.07 0.02
real-sim 59.8 1,220.5 0.49 0.10
ijcnn1 10.7 64.2 5.63 −0.85
MNIST38 8.6 18.4 2.47 −0.40
covtype 5,211.9 ≥ 3 × 105 3.74 −15.98 webspam 3,228.1 ≥ 3 × 105 5.29 −0.76
Some problems: accuracy similar to RBF; but training much faster
Less nonlinear SVM to approximate highly nonlinear
NLP Applications
In NLP (Natural Language Processing) degree-2 or degree-3 polynomial kernels very popular
Competitive with RBF; better than linear No theory yet; but possible reasons
Bigram/trigram useful
This is different from other areas (e.g., image), which mainly use RBF
Currently people complain that training is slow
SVM with Low-Degree Polynomial Mapping
Dependency Parsing
nsubjROOT det dobj prep det pobj p
John hit the ball with a bat .
NNP VBD DT NN IN DT NN .
RBF Poly Linear Poly
Training time 3h34m53s 3h21m51s 3m36s 3m43s
Parsing speed 0.7x 1x 1652x 103x
UAS 89.92 91.67 89.11 91.71
LAS 88.55 90.60 88.07 90.71
Dependency Parsing
nsubjROOT det dobj prep det pobj p
John hit the ball with a bat .
NNP VBD DT NN IN DT NN .
LIBSVM LIBLINEAR
RBF Poly Linear Poly
Training time 3h34m53s 3h21m51s 3m36s 3m43s
Parsing speed 0.7x 1x 1652x 103x
UAS 89.92 91.67 89.11 91.71
Dependency Parsing (Cont’d)
Details:
Y.-W. Chang, C.-J. Hsieh, K.-W. Chang, M.
Ringgaard, and C.-J. Lin. Low-degree polynomial mapping of data for SVM, 2009.
Outline
Training support vector machines Training large-scale SVM
Linear SVM
SVM with Low-Degree Polynomial Mapping Discussion and Conclusions
What If Data Cannot Fit in Memory?
We can manage to train data in disk Details not shown here
However, what if data too large to store in one machine?
So far not many such cases with well labeled data It’s expensive to label data
We do see very large but low quality data Dealing with such data is different
L1-regularized Classifiers
Replacing kwk2 with kwk1
minw kwk1 + C × (losses) Sparsity: many w elements are zeros Feature selection
LIBLINEAR supports L2 loss and logistic regression max 0, 1 − yiwTxi2
and log(1 + e−yiwTxi) If using least-square loss and y ∈ Rl,
related to L1-regularized problems in signal
Conclusions
Training large SVM is difficult
The (at least) quadratic time bottleneck Approximation is often needed; but some are non-intuitive ways
E.g., linear SVM good approximation to nonlinear SVM for some applications
Difficult to have a general approach for all large scenarios
Special techniques are needed
Conclusions (Cont’d)
Software design for large and median/small problems should be different
Median/small problems: general and simple software Sources for my past work are available on my page.
In particular, LIBSVM:
http://www.csie.ntu.edu.tw/~cjlin/libsvm LIBLINEAR: http:
//www.csie.ntu.edu.tw/~cjlin/liblinear I will be happy to talk to any machine learning users here