Large-scale Linear Classification:
Status and Challenges
Chih-Jen Lin
Department of Computer Science National Taiwan University
Talk at Criteo Machine Learning Workshop, November 8, 2017
Outline
1 Introduction
2 Optimization methods
3 Multi-core linear classification
4 Distributed linear classification
5 Conclusions
Introduction
Outline
1 Introduction
2 Optimization methods
3 Multi-core linear classification
4 Distributed linear classification
5 Conclusions
Introduction
Linear Classification
Although many new and advanced techniques are available (e.g., deep learning), linear classifiers remain to be useful because of their simplicity
We have fast training/prediction for large-scale data A large-scale optimization problem is solved
The focus of this talk is on how to solve this optimization problem
Introduction
The Software LIBLINEAR
My talk will be very related to research done in developing the software LIBLINEAR for linear classification
www.csie.ntu.edu.tw/~cjlin/liblinear It is now one of the most used linear classification tools
Introduction
Linear and Kernel Classification
Methods such as SVM and logistic regression are often used in two ways
Kernel methods: data mapped to another space x ⇒ φ(x )
φ(x )Tφ(y) easily calculated; no good control on φ(·)
Feature engineering + linear classification:
Directly use x without mapping. But x may have been carefully generated. Full control on x
Introduction
Comparison Between Linear and Kernel
For certain problems, accuracy by linear is as good as kernel
But training and testing are much faster Especially document classification
Number of features (bag-of-words model) very large Large and sparse data
Training millions of data in just a few seconds
Introduction
Comparison Between Linear and Nonlinear (Training Time & Testing Accuracy)
Linear RBF Kernel
Data set Time Accuracy Time Accuracy
MNIST38 0.1 96.82 38.1 99.70
ijcnn1 1.6 91.81 26.8 98.69
covtype multiclass 1.4 76.37 46,695.8 96.11
news20 1.1 96.95 383.2 96.90
real-sim 0.3 97.44 938.3 97.82
yahoo-japan 3.1 92.63 20,955.2 93.31
webspam 25.7 93.35 15,681.8 99.26
Size reasonably large: e.g., yahoo-japan: 140k instances and 830k features
Introduction
Comparison Between Linear and Nonlinear (Training Time & Testing Accuracy)
Linear RBF Kernel
Data set Time Accuracy Time Accuracy
MNIST38 0.1 96.82 38.1 99.70
ijcnn1 1.6 91.81 26.8 98.69
covtype multiclass 1.4 76.37 46,695.8 96.11
news20 1.1 96.95 383.2 96.90
real-sim 0.3 97.44 938.3 97.82
yahoo-japan 3.1 92.63 20,955.2 93.31
webspam 25.7 93.35 15,681.8 99.26
Size reasonably large: e.g., yahoo-japan: 140k instances and 830k features
Introduction
Comparison Between Linear and Nonlinear (Training Time & Testing Accuracy)
Linear RBF Kernel
Data set Time Accuracy Time Accuracy
MNIST38 0.1 96.82 38.1 99.70
ijcnn1 1.6 91.81 26.8 98.69
covtype multiclass 1.4 76.37 46,695.8 96.11
news20 1.1 96.95 383.2 96.90
real-sim 0.3 97.44 938.3 97.82
yahoo-japan 3.1 92.63 20,955.2 93.31
webspam 25.7 93.35 15,681.8 99.26
Size reasonably large: e.g., yahoo-japan: 140k instances and 830k features
Introduction
Binary Linear Classification
Training data {yi, xi}, xi ∈ Rn, i = 1, . . . , l , yi = ±1 l : # of data, n: # of features
minw f (w ), where f (w ) ≡
C
l
X
i =1
ξ(w ; xi, yi) + (1
2wTw L2 regularization kw k1 L1 regularization ξ(w ; x , y ): loss function: we hope y wTx > 0 C : regularization parameter
Introduction
Loss Functions
Some commonly used loss functions.
ξL1(w ; x , y ) ≡ max(0, 1 − y wTx ), (1) ξL2(w ; x , y ) ≡ max(0, 1 − y wTx )2, (2) ξLR(w ; x , y ) ≡ log(1 + e−y wTx). (3) SVM (Boser et al., 1992; Cortes and Vapnik, 1995):
(1)-(2)
Logistic regression (LR): (3)
Optimization methods
Outline
1 Introduction
2 Optimization methods
3 Multi-core linear classification
4 Distributed linear classification
5 Conclusions
Optimization methods
Optimization Methods
A difference between linear and kernel is that for kernel, optimization must be over a variable α (usually through the dual problem) where
w = Xl
i =1αiφ(xi)
We cannot minimize over w , which may be infinite dimensional
However, for linear, minimizing over w or α is ok
Optimization methods
Optimization Methods (Cont’d)
Unconstrained optimization methods can be categorized to
Low-order methods: quickly get a model, but slow final convergence
High-order methods: more robust and useful for ill-conditioned situations
We will show both types of optimization methods are useful for linear classification
Further, to handle large problems, the algorithms must take problem structure into account
Let’s discuss a low-order method (coordinate descent) in detail
Optimization methods
Coordinate Descent
We consider L1-loss and the dual SVM problem minα f (α)
subject to 0 ≤ αi ≤ C , ∀i , where
f (α) ≡ 1
2αTQα − eTα and
Qij = yiyjxTi xj, e = [1, . . . , 1]T We will apply coordinate descent (CD) methods The situation for L2 or LR loss is very similar
Optimization methods
Coordinate Descent (Cont’d)
For current α, change αi by fixing others Let
ei = [0, . . . , 0, 1, 0, . . . , 0]T The sub-problem is
min
d f (α + d ei) = 1
2Qiid2 + ∇if (α)d + constant subject to 0 ≤ αi + d ≤ C
Without constraints
optimal d = −∇if (α) Qii
Optimization methods
Coordinate Descent (Cont’d)
Now 0 ≤ αi + d ≤ C αi ← min
max
αi − ∇if (α) Qii , 0
, C
Note that
∇if (α) = (Qα)i − 1 = Xl
j =1Qijαj − 1
= Xl
j =1yiyjxTi xjαj − 1
Expensive: O(ln), l : # instances, n: features
Optimization methods
Coordinate Descent (Cont’d)
A trick in Hsieh et al. (2008) is to define and maintain
u ≡Xl
j =1yjαjxj,
Easy gradient calculation: the cost is O(n)
∇if (α) = yiuTxi − 1
Note that this cannot be done for kernel as xi is high dimensional
Optimization methods
Coordinate Descent (Cont’d)
The procedure
While α is not optimal (Outer iteration) For i = 1, . . . , l (Inner iteration)
(a) ¯αi ← αi
(b) G = yiuTxi − 1
(c) αi ← min(max(αi − G /Qii, 0), C ) (d) If αi needs to be changed
u ← u + (αi − ¯αi)yixi Maintaining u also costs
O(n)
Optimization methods
Coordinate Descent (Cont’d)
Having
u ≡Xl
j =1yjαjxj,
∇if (α) = yiuTxi − 1 and
¯
αi : old ; αi : new u ← u + (αi − ¯αi)yixi. is very essential
This isn’t the vanilla CD dated back to Hildreth (1957)
We take the problem structure into account
Optimization methods
Comparisons
L2-loss SVM is used
DCDL2: Dual coordinate descent DCDL2-S: DCDL2 with shrinking PCD: Primal coordinate descent TRON: Trust region Newton method
This result is from Hsieh et al. (2008) with C = 1
Optimization methods
Objective values (Time in Seconds)
news20 rcv1
yahoo-japan yahoo-korea
Optimization methods
Low- versus High-order Methods
We see low-order methods are efficient, but
high-order methods are useful for difficult situations CD for dual
$ time ./train -c 1 news20.scale 2.528s
$ time ./train -c 100 news20.scale 28.589s
Newton for primal
$ time ./train -c 1 -s 2 news20.scale 8.596s
$ time ./train -c 100 -s 2 news20.scale 11.088s
Optimization methods
Training Median-sized Data: Status
Basically a solved problem
However, as data and memory continue to grow, new techniques are needed for large-scale sets.
Two possible strategies are
1 Multi-core linear classification
2 Distributed linear classification
Multi-core linear classification
Outline
1 Introduction
2 Optimization methods
3 Multi-core linear classification
4 Distributed linear classification
5 Conclusions
Multi-core linear classification
Multi-core Linear Classification
Nowadays each CPU has several cores
However, parallelizing algorithms to use multiple cores may not be that easy
In fact, algorithms may need to be redesigned Since two years ago we have been working on multi-core LIBLINEAR
Multi-core linear classification
Multi-core Linear Classification (Cont’d)
Three multi-core solvers have been released
1 Newton method for primal L2-regularized problem (Lee et al., 2015)
2 Coordinate descent method for dual
L2-regularized problem (Chiang et al., 2016)
3 Coordinate descent method for primal
L1-regularized problem (Zhuang et al., 2017) They are practically useful. For example, one user from USC thanked us because “a job (taking >30 hours using one core) now can finish within 5 hours”
We will briefly discuss the 2nd and the 3rd
Multi-core linear classification
Multi-core CD for Dual
Recall the CD algorithm for dual is
While α is not optimal (Outer iteration) For i = 1, . . . , l (Inner iteration)
(a) ¯αi ← αi
(b) G = yiuTxi − 1
(c) αi ← min(max(αi − G /Qii, 0), C ) (d) If αi needs to be changed
u ← u + (αi − ¯αi)yixi
Multi-core linear classification
Multi-core CD for Dual (Cont’d)
The algorithm is inherently sequential Suppose
αi0 is updated after αi
Then αi0 must wait until the latest u is obtained The parallelization is difficult
Multi-core linear classification
Multi-core CD for Dual (Cont’d)
Asynchronous CD is possible (Hsieh et al., 2015), but may diverge
We note that for a given set ¯B
∇if (w ) = wTxi, ∀i ∈ ¯B can be calculated in parallel
We then propose a framework
Multi-core linear classification
Multi-core CD for Dual (Cont’d)
While α is not optimal (a) Select a set ¯B
(b) Calculate ∇B¯f (α) in parallel (c) Select B ⊂ ¯B with |B| | ¯B|
(d) Sequentially update αi, i ∈ B
Multi-core linear classification
Multi-core CD for Dual (Cont’d)
The selection of
B ⊂ ¯B with |B| | ¯B|
is by ∇B¯f (w )
The idea is simple, but needs efforts to have a practical setting (details omitted)
Multi-core linear classification
Multi-core CD for Dual (Cont’d)
webspam url combined
Alg-4: the method in Chiang et al. (2016) Asynchronous CD (Hsieh et al., 2015)
Multi-core linear classification
Multi-core CD for L1 Regularization
Currently, primal CD (Yuan et al., 2010) or its variants (Yuan et al., 2012) is the state-of-the-art for L1
Each CD step involves one feature
Some attempts of parallel CD for L1 include Asynchronous CD (Bradley et al., 2011) Block CD (Bian et al., 2013)
These methods are not satisfactory for either divergence issue, or
poor speedup
Multi-core linear classification
Multi-core CD for L1 Regularization (Cont’d)
We struggled for years for find a solution
Recently, in a work (Zhuang et al., 2017) we have an effective setting
It’s partially supported by Criteo Faculty Research Award
Our idea is simple: direct parallelization of CD But wait.. This shouldn’t work because each CD iteration is cheap
Multi-core linear classification
Direct Parallelization of CD
Let’s consider a simple setting to decide if one CD step should be parallelized or not
if #non-zeros in an instance/feature ≥ a threshold then
multi-core else
single-core
Idea: a CD step is parallelized if there are enough operations
Multi-core linear classification
Direct Parallelization of CD (Cont’d)
Speedup of CD for dual, L2 regularization Data set
#threads
2 4 8
sparse
sets avazu-app 0.4 0.3 0.2
criteo 0.5 0.3 0.2
dense sets
epsilon normalized 1.3 1.3 1.1 splice site.t.10% 1.8 2.8 4.1 CD for dual: one instance at a time
Threshold: 0 (sparse), 500 (dense)
If 500 for sparse, no instance parallelized The speedup is poor
Multi-core linear classification
% of instances/features containing 50% and 80%
#non-zeros
Data set Instance Feature
avazu-app 50% 80% 0.2% 1%
criteo 50% 80% 0.01% 0.2%
kdd2010-a 40% 73% 0.03% 2%
kdd2012 50% 80% 0.003% 0.5%
rcv1 test 24% 54% 1% 5%
splice site.t.10% 50% 80% 9% 57%
url combined 44% 76% 0.002% 0.006%
webspam 29% 55% 0.6% 2%
yahoo-korea 20% 48% 0.07% 0.5%
Features’ non-zero distribution is extremely skewed Non-zeros are in few dense (and parallelizable) features
Multi-core linear classification
Speedup of CD for L1 Regularization
LR loss used Naive Block CD Async. CD
Data set 2 4 8 2 4 8 2 4 8
avazu-app 1.9 3.4 5.6 0.4 0.7 1.0 1.4 2.7 3.4 criteo 1.8 3.3 5.5 0.7 1.2 1.9 1.5 2.9 4.8 epsilon normalized 2.0 4.0 7.9 x x x 1.3 2.1 x HIGGS 2.0 3.9 7.5 0.7 0.8 0.9 1.0 1.3 x kdd2010-a 1.7 2.4 3.1 0.8 1.4 2.4 1.5 2.7 4.8 kdd2012 1.9 2.8 3.9 0.2 0.4 0.6 2.1 4.7 7.0 rcv1 test 1.9 3.4 5.9 x x x 1.3 2.5 4.5 splice site.t.10% 1.9 3.6 6.2 x x x 1.6 2.7 4.3 url combined 2.0 3.5 6.2 0.5 0.9 1.3 1.0 1.7 1.7 webspam 1.8 3.2 4.8 0.1 0.3 0.5 1.4 2.5 4.1 yahoo-korea 1.9 3.5 5.9 0.2 0.3 0.5 1.3 2.4 4.4
Distributed linear classification
Outline
1 Introduction
2 Optimization methods
3 Multi-core linear classification
4 Distributed linear classification
5 Conclusions
Distributed linear classification
Distributed Linear Classification
It’s even more complicated than multi-core
I don’t have time to discuss this topic in detail, but let me share some lessons
A big mistake was that we worked on distributed before multi-core
Distributed linear classification
Distributed Linear Classification (Cont’d)
A few years ago, big data was hot. So we extended a Newton solver in LIBLINEAR to MPI (Zhuang et al., 2015) and Spark (Lin et al., 2014)
We were a bit ahead of time; Spark MLlib wasn’t even available then
Unfortunately, very few people use our code, especially the Spark one
We moved to multi-core. Immediately, multi-core LIBLINEAR has many users
Distributed linear classification
Distributed Linear Classification (Cont’d)
Why we failed? Several possible reasons Not many people have big data??
System issues are more important than we thought.
At that time Spark wasn’t easy to use and was being actively changed
System configuration and application scenarios may significantly vary
An algorithm useful for systems with fast network speed may be useless for systems with slow
communication
Distributed linear classification
Distributed Linear Classification (Cont’d)
Application dependency is stronger.
L2 and L1 regularization often give similar accuracy.
On a single machine, we may not want to use L1 because training is more difficult and the smaller model size isn’t that important
However, for distributed applications many have told me that they need L1
A lesson is that for people from academia, it’s better to collaborate with industry for research on distributed machine learning
Conclusions
Outline
1 Introduction
2 Optimization methods
3 Multi-core linear classification
4 Distributed linear classification
5 Conclusions
Conclusions
Conclusions
Linear classification is an old topic, but it remains to be useful for many applications
Efficient training relies on designing optimization algorithms by incorporating the problem structure Many issues about multi-core and distributed linear classification still need to be studied