Training Large-scale Linear Classifiers
Chih-Jen Lin
Department of Computer Science National Taiwan University
http://www.csie.ntu.edu.tw/~cjlin
Talk at Hong Kong University of Science and Technology February 5, 2009
Outline
Linear versus Nonlinear Classification Review of SVM Training
Large-scale Linear SVM Discussion and Conclusions
Outline
Linear versus Nonlinear Classification Review of SVM Training
Large-scale Linear SVM Discussion and Conclusions
Kernel Methods and SVM
Kernel methods became very popular in the past decade
In particular, support vector machines (SVM) But slow in training large data due to nonlinear mapping (enlarge the # features)
Example: x = [x1, x2, x3]T ∈ R3 φ(x) = 1,√
2x1,√
2x2,√
2x3, x12, x22, x32,√
2x1x2,√
2x1x3,√
2x2x3T
∈ R10 If data are very large ⇒ often need approximation e.g., sub-sampling and many other ways
Linear Classification
Certain problems: # features large
Often similar accuracy with/without nonlinear mappings
Linear classification: no mapping Stay in the original input space
We can efficiently train very large data Document classification is of this type Very important for Internet companies
An Example
rcv1: # data: > 600k, # features: > 40k Using LIBSVM (linear kernel)
> 10 hours
Using LIBLINEAR
Computation: < 5 seconds; I/O: 60 seconds Same stopping condition in solving SVM optimization problems
Will show how this is achieved and discuss if there are any concerns
Outline
Linear versus Nonlinear Classification Review of SVM Training
Large-scale Linear SVM Discussion and Conclusions
Support Vector Classification
Training data (xi, yi), i = 1, . . . , l , xi ∈ Rn, yi = ±1 minw
1
2wTw + C Xl
i =1max(0, 1 − yiwTφ(xi)) C : regularization parameter
High dimensional ( maybe infinite ) feature space φ(x) = [φ1(x), φ2(x), . . .]T.
We omit the bias term b w: may have infinite variables
Support Vector Classification (Cont’d)
The dual problem (finite # variables) minα f (α) = 1
2αTQα − eTα
subject to 0 ≤ αi ≤ C , i = 1, . . . , l , where Qij = yiyjφ(xi)Tφ(xj) and e = [1, . . . , 1]T At optimum
w =Pl
i =1αiyiφ(xi) Kernel: K (xi, xj) ≡ φ(xi)Tφ(xj)
Large Dense Quadratic Programming
Qij 6= 0, Q : an l by l fully dense matrix minα f (α) = 1
2αTQα − eTα
subject to 0 ≤ αi ≤ C , i = 1, . . . , l 50,000 training points: 50,000 variables:
(50, 0002 × 8/2) bytes = 10GB RAM to store Q Traditional methods:
Newton, Quasi Newton cannot be directly applied Now most use decomposition methods
[Osuna et al., 1997, Joachims, 1998, Platt, 1998]
Decomposition Methods
We consider a one-variable version Similar to coordinate descent methods Select the i th component for update:
min
d
1
2(α + d ei)TQ(α + d ei) − eT(α + d ei) subject to 0 ≤ αi + d ≤ C
where
ei ≡ [0 . . . 0
| {z }
i −1
1 0 . . . 0]T
α: current solution; the i th component is changed
Avoid Memory Problems
The new objective function 1
2Qiid2 + (Qα − e)id + constant To get (Qα − e)i, only Q’s i th row is needed
(Qα − e)i = Xl
j =1Qijαj − 1
Calculated when needed. Trade time for space Used by popular software (e.g., SVMlight, LIBSVM) They update 10 and 2 variables at a time
Decomposition Methods: Algorithm
Optimal d :
−(Qα − e)i
Qii = − Pl
j =1Qijαj − 1 Qii
Consider lower/upper bounds: [0, C ] Algorithm:
While α is not optimal
1. Select the i th element for update 2. αi ← min
max
αi −
Pl
j =1Qijαj−1 Qii , 0
, C
Select an Element for Update
Many ways
Sequential (easiest)
Permuting 1, . . . , l every l steps Random
Existing software check gradient information
∇1f (α), . . . , ∇lf (α) But is ∇f (α) available?
Select an Element for Update (Cont’d)
We can easily maintain gradient
∇f (α) = Qα − e
∇sf (α) = (Qα)s − 1 = Xl
j =1Qsjαj − 1 Initial α = 0
∇f (0) = −e αi updated to ¯αi
∇sf (α) ← ∇sf (α)+ Qsi( ¯αi − αi) , ∀s O(l ) if Qsi ∀s (i th column) are available
Select an Element for Update (Cont’d)
No matter maintaining ∇f (α) or not Q’s i th row (column) always needed
¯
αi ← min max αi − Pl
j =1Qijαj − 1 Qii , 0
! , C
!
Q is symmetric
Using ∇f (α) to select i : faster convergence i.e., fewer iterations
Decomposition Methods: Using Gradient
The new procedure
α = 0, ∇f (α) = −e While α is not optimal
1. Select the i th element using ∇f (α) 2. ¯αi ← min
max
αi −
Pl
j =1Qijαj−1 Qii , 0
, C
3. ∇sf (α) ← ∇sf (α) + Qsi( ¯αi − αi), ∀s Cost per iteration
O(ln), l : # instances, n: # features
Assume each Qij = yiyjK (xi, xj) takes O(n)
Outline
Linear versus Nonlinear Classification Review of SVM Training
Large-scale Linear SVM Discussion and Conclusions
Linear SVM for Large Document Sets
Document classification
Bag of words model (TF-IDF or others) A large # of features
Testing accuracy: linear/nonlinear SVMs similar nonlinear SVM: we mean SVM via kernels
Recently an active research topic SVMperf [Joachims, 2006]
Pegasos [Shalev-Shwartz et al., 2007]
LIBLINEAR [Lin et al., 2007, Hsieh et al., 2008]
and others
Linear SVM
Primal without the bias term b minw
1
2wTw + C
l
X
i =1
max 0, 1 − yiwTxi
Dual
minα f (α) = 1
2αTQα − eTα subject to 0 ≤ αi ≤ C , ∀i Qij = yiyjxTi xj
Revisit Decomposition Methods
While α is not optimal
1. Select the i th element for update 2. αi ← min
max
αi −
Pl
j =1Qijαj−1 Qii , 0
, C
O(ln) per iteration; n: # features, l : # data
For linear SVM, define w ≡ Xl
j =1yjαjxj ∈ Rn O(n) per iteration
Xl
j =1Qijαj−1 = Xl
j =1yiyjxTi xjαj−1 = yiwTxi−1
All we need is to maintain w. If
¯
αi ← αi then O(n) for
w ← w + ( ¯αi − αi)yixi Initial w
α = 0 ⇒ w = 0 Give up maintaining ∇f (α)
Select i for update Sequential, random, or
Permuting 1, . . . , l every l steps
Algorithms for Linear and Nonlinear SVM
Linear:
While α is not optimal
1. Select the i th element for update 2. ¯αi ← min
max
αi − yiwTQxi−1
ii , 0
, C
3. w ← w + ( ¯αi − αi)yixi
Nonlinear:
While α is not optimal
1. Select the i th element using ∇f (α) 2. ¯αi ← min
max
αi −
Pl
j =1Qijαj−1 Qii , 0
, C
3. ∇sf (α) ← ∇sf (α) + Qsi( ¯αi − αi), ∀s
Analysis
Decomposition method for nonlinear (also linear):
O(ln) per iteration (used in LIBSVM) New way for linear:
O(n) per iteration (used in LIBLINEAR) Faster if # iterations not l times more Experiments
Problem l : # data n: # features
news20 19,996 1,355,191
yahoo-japan 176,203 832,026
rcv1 677,399 47,236
yahoo-korea 460,554 3,052,939
Testing Accuracy versus Training Time
news20 yahoo-japan
rcv1 yahoo-korea
Outline
Linear versus Nonlinear Classification Review of SVM Training
Large-scale Linear SVM Discussion and Conclusions
Limitation
A few seconds for million data; Too good to be true?
Less effective if C is large (or data not scaled) Same problem occurs for training nonlinear SVMs But no need to use large C
Same model after C ≥ ¯C [Keerthi and Lin, 2003]
C is small for document data (if scaled)¯
Limitation (Cont’d)
Less effective if # features small
Should solve primal: # variables = # features Why not using kernels with nonlinear mappings?
Comparing Different Training Methods
O(ln) versus O(n) per iteration
Generally, the new method for linear is much faster Especially for document data
But can always find weird cases where LIBSVM faster than LIBLINEAR
Apply the right approach to the right problem is essential
One must be careful on comparing training algorithms
Software Issue
Large data ⇒ may need different training strategies for different problems
But we pay the price of complicating software packages
The success of LIBSVM and SVMlight Simple and general
They cover both linear/nonlinear
General versus special: always an issue
Other Methods for Linear SVM
w is the key to reduce O(ln) to O(n) per iteration w =Xl
j =1yjαjxj ∈ Rn Many optimization methods can be used
We can now solve primal: w not infinite any more minw
1
2wTw + C
l
X
i =1
max 0, 1 − yiwTxi We used decomposition method as an example as it works for both linear and nonlinear
Easily see the striking difference with/without w
Other Linear Classifiers
Logistic regression, maximum entropy, conditional random fields (CRF)
All linear classifiers
In the past, SVM training is considered very different from them
For the linear case, things are very related Many interesting findings; but no time to show details
What if Data Are Even Larger?
We see I/O costs more than computing
Large-scale document classification on a single computer essentially a solved problem
Challenges:
What if data larger than computer RAM?
What if data distributedly stored?
Document classification in a data center
environment is an interesting research direction
Conclusions
For certain problems, linear classifiers as accurate as nonlinear, and more efficient for training/testing However, we are not claiming you shouldn’t use kernels any more
For large data, right approaches are essential Machine learning researchers should clearly tell people when to use which methods
You are welcome to try our software
http://www.csie.ntu.edu.tw/~cjlin/libsvm http://www.csie.ntu.edu.tw/~cjlin/liblinear