Training Support Vector Machines:
Status and Challenges
Chih-Jen Lin
Department of Computer Science National Taiwan University
ICML Workshop on Large Scale Learning Challenge
Outline
SVM is popular
But its training isn’t easy We check existing techniques Large data sets
We show several approaches, and discuss various considerations
Will try to partially answer why there are controversial comparisons
Introduction to SVM
Outline
Introduction to SVM
Solving SVM Quadratic Programming Problem Training large-scale data
Linear SVM
Discussion and Conclusions
Support Vector Classification
Training data (xi, yi), i = 1, . . . , l , xi ∈ Rn, yi = ±1 Maximizing the margin
[Boser et al., 1992, Cortes and Vapnik, 1995]
min
w,b,ξ
1
2wTw +C
l
X
i =1
ξi
subject to yi(wTφ(xi)+ b) ≥ 1 −ξi, ξi ≥ 0, i = 1, . . . , l .
High dimensional ( maybe infinite ) feature space φ(x) = (φ1(x), φ2(x), . . .).
Introduction to SVM
Support Vector Classification (Cont’d)
w: maybe infinite variables
The dual problem (finite # variables) minα
1
2αTQα − eTα
subject to 0 ≤ αi ≤ C , i = 1, . . . , l yTα = 0,
where Qij = yiyjφ(xi)Tφ(xj) and e = [1, . . . , 1]T At optimum
w =Pl
i =1αiyiφ(xi)
Outline
Introduction to SVM
Solving SVM Quadratic Programming Problem Training large-scale data
Linear SVM
Discussion and Conclusions
Solving SVM Quadratic Programming Problem
Large Dense Quadratic Programming
minα
1
2αTQα − eTα
subject to 0 ≤ αi ≤ C , i = 1, . . . , l yTα = 0
Qij 6= 0, Q : an l by l fully dense matrix 50,000 training points: 50,000 variables:
(50, 0002 × 8/2) bytes = 10GB RAM to store Q Traditional methods:
Decomposition Methods
Working on some variables each time (e.g.,
[Osuna et al., 1997, Joachims, 1998, Platt, 1998]) Similar to coordinate-wise minimization
Working set B , N = {1, . . . , l }\B fixed Sub-problem at the kth iteration:
minαB
1
2αTB (αkN)TQBB QBN
QNB QNN
αB αkN
−
eTB (ekN)TαB αkN
subject to 0 ≤ αt ≤ C , t ∈ B, yTBαB = −yTNαkN
Solving SVM Quadratic Programming Problem
Avoid Memory Problems
The new objective function 1
2αTBQBBαB + (−eB + QBNαkN)TαB + constant Only B columns of Q needed (|B| ≥ 2)
Calculated when used Trade time for space
Popular software such as SVMlight, LIBSVM, SVMTorch are of this type
How Decomposition Methods Perform?
Convergence not very fast
But, no need to have very accurate α Prediction not affected much
In some situations, # support vectors # training points
Initial α1 = 0, some instances never used
Solving SVM Quadratic Programming Problem
An example of training 50,000 instances using LIBSVM
$svm-train -c 16 -g 4 -m 400 22features Total nSV = 3370
Time 79.524s
On a Xeon 2.0G machine
Calculating Q may have taken more time
#SVs = 3,370 50,000
A good case where some remain at zero all the time
Issues of Decomposition Methods
Techniques for faster decomposition methods store recently used kernel elements working set size/selection
theoretical issues: convergence
and others (details not discussed here) But training large data still difficult
Kernel square to the number of data Training millions of data time consuming Will discuss some possible approaches
Training large-scale data
Outline
Introduction to SVM
Solving SVM Quadratic Programming Problem Training large-scale data
Linear SVM
Discussion and Conclusions
Parallel: Multi-core/Shared Memory
Most computation of decomposition methods:
kernel evaluations
Easily parallelized via openMP One line change of LIBSVM
Each core/CPU calculates part of a kernel column Multicore Shared-memory
1 80 1 100
2 48 2 57
4 32 4 36
8 27 8 28
Same 50,000 data (kernel evaluations: 80% time) Using GPU [Catanzaro et al., 2008]
Training large-scale data
Parallel: Distributed Environments
What if data data cannot fit into memory?
Use distributed environments PSVM: [Chang et al., 2007]
http://code.google.com/p/psvm/
π-SVM: http://pisvm.sourceforge.net, Parallel GPDT [Zanni et al., 2006]
All use MPI
They report good speed-up
But on certain environments, communication cost a concern
Approximations
Subsampling
Simple and often effective Many more advanced techniques
Incremental training: (e.g., [Syed et al., 1999]) Data ⇒ 10 parts
train 1st part ⇒ SVs, train SVs + 2nd part, . . . Select and train good points: KNN or heuristics e.g., [Bakır et al., 2005]
Training large-scale data
Approximations (Cont’d)
Approximate the kernel; e.g., [Fine and Scheinberg, 2001, Williams and Seeger, 2001]
Use part of the kernel; e.g.,
[Lee and Mangasarian, 2001, Keerthi et al., 2006]
And many others
Some simple but some sophisticated
Parallelization or Approximation
Difficult to say Parallel: general
Approximation: simpler in some cases We can do both
For certain problems, approximation doesn’t easily work
Training large-scale data
Parallelization or Approximation (Cont’d)
covtype: 500k training and 80k testing rcv1: 550k training and 14k testing
covtype rcv1
Training size Accuracy Training size Accuracy
50k 92.5% 50k 97.2%
100k 95.3% 100k 97.4%
500k 98.2% 550k 97.8%
For large sets, selecting a right approach is essential We illustrate this point using linear SVM for
document classification
Outline
Introduction to SVM
Solving SVM Quadratic Programming Problem Training large-scale data
Linear SVM
Discussion and Conclusions
Linear SVM
Linear Support Vector Machines
Data not mapped to another space
In theory, RBF kernel with certain parameters
⇒ as good geleralization performance as linear [Keerthi and Lin, 2003]
But sometimes can easily solve much larger linear SVMs
Training of linear/nonlinear SVMs should be separately considered
Linear Support Vector Machines (Cont’d)
Linear SVM useful if accuracy similar to nonlinear Will discuss an example of linear SVM for document classification
Linear SVM
Linear SVM for Large Document Sets
Document classification
Bag of words model (TF-IDF or others) A large # of features
Can solve larger problems than kernelized cases Recently an active research topic
SVMperf [Joachims, 2006]
Pegasos [Shalev-Shwartz et al., 2007]
LIBLINEAR [Lin et al., 2007, Hsieh et al., 2008]
and others
Linear SVM
Primal without the bias term b minw
1
2wTw + C
l
X
i =1
max 0, 1 − yiwTxi Dual
minα f (α) = 1
2αTQα − eTα subject to 0 ≤ αi ≤ C , ∀i
No linear constraint yTα = 0 Qij = yiyjxTi xj
Linear SVM
A Comparison: LIBSVM and LIBLINEAR
rcv1: # data: > 600k, # features: > 40k TF-IDF
Using LIBSVM (linear kernel)
> 10 hours
Using LIBLINEAR
Computation: < 5 seconds; I/O: 60 seconds Same stopping condition
Accuracy similar to nonlinear
Revisit Decomposition Methods
The extreme: update one variable at a time Reduced to
αi ← min
max
αi − ∇if (α) Qii , 0
, C
where
∇if (α) = (Qα)i − 1 =
l
X
j =1
Qijαj − 1 O(nl ) to calculate i th row of Q
n: # features, l : # data
Linear SVM
For linear SVM, define w = Xl
j =1yjαjxj, Much easier: O(n)
∇if (α) = yiwTxi − 1 All we need is to maintain w. If
¯
αi ← αi then
w ← w + (αi − ¯αi)yixi
Testing Accuracy (Time in Seconds)
L1-SVM: news20 L2-SVM: news20
L1-SVM: rcv1 L2-SVM: rcv1
Linear SVM
Analysis
Other implementation details in [Hsieh et al., 2008]
Decomposition method for linear/nonlinear kernels:
O(nl ) per iteration
New way for linear: O(n) per iteration Faster if # iterations not l times more
A few seconds for million data; Any limitation?
Less effective if
# features small: should solve primal Large penalty parameter C
Analysis (Cont’d)
One must be careful on comparisons Now we have two decomposition methods (nonlinear and linear)
Similar theoretical convergence rates
Very different practical behaviors for certain problems
This partially explains controversial comparisons in some recent work
Linear SVM
Analysis (Cont’d)
A lesson: different SVMs
To handle large data ⇒ may need different training strategies
Even just for linear SVM
# data # features
# data # features
# data, # features both large Should use different methods
For example, # data # features
primal based method; (but why not nonlinear?)
Outline
Introduction to SVM
Solving SVM Quadratic Programming Problem Training large-scale data
Linear SVM
Discussion and Conclusions
Discussion and Conclusions
Discussion and Conclusions
Linear versus nonlinear
In this competition, most use linear (wild track) Even accuracy may be worse
Recall I mention “parallelization” &
“approximation”
Linear is essentially an approximation of nonlinear For large data, selecting a right approach seems to be essential
But finding a suitable one is difficult
Discussion and Conclusions (Cont’d)
This (i.e., “too many approaches”) is indeed bad from the viewpoint of designing machine learning software
The success of LIBSVM and SVMlight Simple and general
Developments in both directions (general and specific) will help to advance SVM training