Chih-Jen Lin
Department of Computer Science National Taiwan University
Talk at Criteo, August 1, 2014
Chih-Jen Lin (National Taiwan Univ.) 1 / 44
Outline
Introduction
Optimization methods
Extension of linear classification Big-data linear classification Conclusions and future directions
Outline
Introduction
Optimization methods
Extension of linear classification Big-data linear classification Conclusions and future directions
Chih-Jen Lin (National Taiwan Univ.) 3 / 44
Introduction
Linear and Nonlinear Classification
Linear Nonlinear
Linear: a linear function to separate data in the original input space; nonlinear: data mapped to other spaces
Original: [height, weight]
Nonlinear: [height, weight, weight/height2]
Linear and Nonlinear Classification (Cont’d)
Methods such as SVM and logistic regression can be used in two ways
• Kernel methods: data mapped to another space x ⇒ φ(x )
φ(x )Tφ(y) easily calculated; no good control on φ(·)
• Linear classification + feature engineering:
Directly use x without mapping. But x may have been carefully generated using some nonlinear information. Full control on x
We will focus on the 2nd type of approaches in this talk
Chih-Jen Lin (National Taiwan Univ.) 5 / 44
Introduction
Why Linear Classification?
• If φ(x ) is high dimensional, decision function sgn(wTφ(x ))
is expensive
• Kernel methods:
w ≡
l
X
i =1
αiφ(xi) for some α, K (xi, xj) ≡ φ(xi)Tφ(xj)
New decision function: sgn Pl
i =1αiK (xi, x )
• Special φ(x ) so calculating K (xi, xj) is easy. Example:
K (x , x ) ≡ (xTx + 1)2 = φ(x )Tφ(x ), φ(x ) ∈ RO(n2)
Why Linear Classification? (Cont’d)
Prediction
wTx versus Xl
i =1αiK (xi, x ) If K (xi, xj) takes O(n), then
O(n) versus O(nl ) Kernel: cost related to size of training data Linear: cheaper and simpler
Chih-Jen Lin (National Taiwan Univ.) 7 / 44
Introduction
Linear is Useful in Some Places
For certain problems, accuracy by linear is as good as nonlinear
But training and testing are much faster Especially document classification
Number of features (bag-of-words model) very large Large and sparse data
Training millions of data in just a few seconds
Comparison Between Linear and Nonlinear (Training Time & Testing Accuracy)
Linear RBF Kernel
Data set Time Accuracy Time Accuracy
MNIST38 0.1 96.82 38.1 99.70
ijcnn1 1.6 91.81 26.8 98.69
covtype 1.4 76.37 46,695.8 96.11
news20 1.1 96.95 383.2 96.90
real-sim 0.3 97.44 938.3 97.82
yahoo-japan 3.1 92.63 20,955.2 93.31 webspam 25.7 93.35 15,681.8 99.26 Size reasonably large: e.g., yahoo-japan: 140k instances and 830k features
Chih-Jen Lin (National Taiwan Univ.) 9 / 44
Introduction
Comparison Between Linear and Nonlinear (Training Time & Testing Accuracy)
Linear RBF Kernel
Data set Time Accuracy Time Accuracy
MNIST38 0.1 96.82 38.1 99.70
ijcnn1 1.6 91.81 26.8 98.69
covtype 1.4 76.37 46,695.8 96.11
news20 1.1 96.95 383.2 96.90
real-sim 0.3 97.44 938.3 97.82
yahoo-japan 3.1 92.63 20,955.2 93.31 webspam 25.7 93.35 15,681.8 99.26 Size reasonably large: e.g., yahoo-japan: 140k instances
Comparison Between Linear and Nonlinear (Training Time & Testing Accuracy)
Linear RBF Kernel
Data set Time Accuracy Time Accuracy
MNIST38 0.1 96.82 38.1 99.70
ijcnn1 1.6 91.81 26.8 98.69
covtype 1.4 76.37 46,695.8 96.11
news20 1.1 96.95 383.2 96.90
real-sim 0.3 97.44 938.3 97.82
yahoo-japan 3.1 92.63 20,955.2 93.31 webspam 25.7 93.35 15,681.8 99.26 Size reasonably large: e.g., yahoo-japan: 140k instances and 830k features
Chih-Jen Lin (National Taiwan Univ.) 9 / 44
Introduction
Binary Linear Classification
Training data {yi, xi}, xi ∈ Rn, i = 1, . . . , l , yi = ±1 l : # of data, n: # of features
minw f (w ), f (w ) ≡ wTw 2 + C
l
X
i =1
ξ(w ; xi, yi)
wTw /2: regularization term (we have no time to talk about L1 regularization here)
ξ(w ; x , y ): loss function: we hope y wTx > 0 C : regularization parameter
Loss Functions
Some commonly used ones:
ξL1(w ; x , y ) ≡ max(0, 1 − y wTx ), (1) ξL2(w ; x , y ) ≡ max(0, 1 − y wTx )2, (2) ξLR(w ; x , y ) ≡ log(1 + e−y wTx). (3) SVM (Boser et al., 1992; Cortes and Vapnik, 1995):
(1)-(2)
Logistic regression (LR): (3); no reference because it can be traced back to 19th century
Chih-Jen Lin (National Taiwan Univ.) 11 / 44
Introduction
Loss Functions (Cont’d)
−y wTx ξ(w ; x , y )
ξL1 ξL2
ξLR
Their performance is usually similar
Loss Functions (Cont’d)
However,
ξL1: not differentiable
ξL2: differentiable but not twice differentiable ξLR: twice differentiable
The same optimization method may not be applicable to all these losses
Chih-Jen Lin (National Taiwan Univ.) 13 / 44
Optimization methods
Outline
Introduction
Optimization methods
Extension of linear classification Big-data linear classification Conclusions and future directions
Optimization Methods
Many unconstrained optimization methods can be applied
For kernel, optimization is over a variable α where
w =
l
X
i =1
αiφ(xi)
We cannot minimize over w because it may be infinite dimensional
However, for linear, minimizing over w or α is ok
Chih-Jen Lin (National Taiwan Univ.) 15 / 44
Optimization methods
Optimization Methods (Cont’d)
Among unconstrained optimization methods,
Low-order methods: quickly get a model, but slow final convergence
High-order methods: more robust and useful for ill-conditioned situations
We will quickly discuss some examples and show both types of optimization methods are useful for linear classification
Optimization: 2nd Order Methods
Newton direction
mins ∇f (wk)Ts + 1
2sT∇2f (wk)s This is the same as solving Newton linear system
∇2f (wk)s = −∇f (wk)
Hessian matrix ∇2f (wk) too large to be stored
∇2f (wk) : n × n, n : number of features But Hessian has a special form
∇2f (w ) = I + CXTDX ,
Chih-Jen Lin (National Taiwan Univ.) 17 / 44
Optimization methods
Optimization: 2nd Order Methods (Cont’d)
X : data matrix. D diagonal. For logistic regression, Dii = e−yiwTxi
1 + e−yiwTxi
Using CG to solve the linear system. Only Hessian-vector products are needed
∇2f (w )s = s + C · XT(D(X s)) Therefore, we have a Hessian-free approach
Optimization: 1st Order Methods
We consider L1-loss and the dual SVM problem minα f (α)
subject to 0 ≤ αi ≤ C , ∀i , where
f (α) ≡ 1
2αTQα − eTα and
Qij = yiyjxTi xj, e = [1, . . . , 1]T We will apply coordinate descent (CD) methods The situation for L2 or LR loss is very similar
Chih-Jen Lin (National Taiwan Univ.) 19 / 44
Optimization methods
1st Order Methods (Cont’d)
Coordinate descent: a simple and classic technique Change one variable at a time
Given current α. Let ei = [0, . . . , 0, 1, 0, . . . , 0]T. min
d f (α + d ei) = 1
2Qiid2 + ∇if (α)d + constant Without constraints
optimal d = −∇if (α) Qii
Now 0 ≤ αi + d ≤ C αi ← min
max
αi − ∇if (α) Q , 0
, C
Comparisons
L2-loss SVM is used
DCDL2: Dual coordinate descent DCDL2-S: DCDL2 with shrinking PCD: Primal coordinate descent TRON: Trust region Newton method This result is from Hsieh et al. (2008)
Chih-Jen Lin (National Taiwan Univ.) 21 / 44
Optimization methods
Objective values (Time in Seconds)
news20 rcv1
yahoo-japan yahoo-korea
Low- versus High-order Methods
• We saw that low-order methods are efficient to give a model. However, high-order methods may be useful for difficult situationa
• An example: # instance: 32,561, # features: 123
Objective value Accuracy
# features is small ⇒ solving primal is more suitable
Chih-Jen Lin (National Taiwan Univ.) 23 / 44
Extension of linear classification
Outline
Introduction
Optimization methods
Extension of linear classification Big-data linear classification Conclusions and future directions
Extension of Linear Classification
Linear classification can be extended in different ways
An important one is to approximate nonlinear classifiers
Goal: better accuracy of nonlinear but faster training/testing
Examples
1. Explicit data mappings + linear classification 2. Kernel approximation + linear classification I will focus on the first
Chih-Jen Lin (National Taiwan Univ.) 25 / 44
Extension of linear classification
Linear Methods to Explicitly Train φ(x
i)
Example: low-degree polynomial mapping:
φ(x ) = [1, x1, . . . , xn, x12, . . . , xn2, x1x2, . . . , xn−1xn]T For this mapping, # features = O(n2)
When is it useful?
Recall O(n) for linear versus O(nl ) for kernel Now O(n2) versus O(nl )
Sparse data
n ⇒ ¯n, average # non-zeros for sparse data
¯
n n ⇒ O( ¯n2) may be much smaller than O(l ¯n)
Example: Dependency Parsing
A multi-class problem with sparse data
n Dim. of φ(x ) l n w ’s # nonzeros¯ 46,155 1,065,165,090 204,582 13.3 1,438,456
¯
n: average # nonzeros per instance Degree-2 polynomial is used
Dimensionality of w is very high, but w is sparse Some training feature columns of xixj are entirely zero
Hashing techniques are used to handle sparse w
Chih-Jen Lin (National Taiwan Univ.) 27 / 44
Extension of linear classification
Example: Dependency Parsing (Cont’d)
LIBSVM LIBLINEAR
RBF Poly Linear Poly
Training time 3h34m53s 3h21m51s 3m36s 3m43s
Parsing speed 0.7x 1x 1652x 103x
UAS 89.92 91.67 89.11 91.71
LAS 88.55 90.60 88.07 90.71
We get faster training/testing, but maintain good accuracy
See detailed discussion in Chang et al. (2010)
Discussion
In the above example, we use all pairs
This is fine for some applications, but # features may become too large
People have proposed projection or hashing
techniques to use fewer features as approximations Examples: Kar and Karnick (2012); Pham and Pagh (2013)
This has been used in computational adversitements (Chapelle et al., 2014)
Chih-Jen Lin (National Taiwan Univ.) 29 / 44
Big-data linear classification
Outline
Introduction
Optimization methods
Extension of linear classification Big-data linear classification Conclusions and future directions
Big-data Linear Classification
Nowadays data can be easily larger than memory capacity
Disk-level linear classification: Yu et al. (2012) and subsequent developments
Distributed linear classification: recently an active research topic
Example: we can parallelize the 2nd-order method discussed earlier. Recall the Hessian-vector product
∇2f (w )s = s + C · XT(D(X s))
Chih-Jen Lin (National Taiwan Univ.) 31 / 44
Big-data linear classification
Parallel Hessian-vector Product
Hessian-vector products are the computational bottleneck
XTDX s
Data matrix X is now distributedly stored
X1 X2
. . . Xp node 1
node 2
node p
XTDX s = X1TD1X1s + · · · + XpTDpXps
Parallel Hessian-vector Product (Cont’d)
We use allreduce to let every node get XTDX s
s
s
s
X1TD1X1s
X2TD2X2s
X3TD3X3s
ALL REDUCE
XTDX s
XTDX s
XTDX s
Allreduce: reducing all vectors (XiTDiXix , ∀i ) to a single vector (XTDX s ∈ Rn) and then sending the result to every node
Chih-Jen Lin (National Taiwan Univ.) 33 / 44
Big-data linear classification
Instance-wise and Feature-wise Data Splits
Xiw,1 Xiw,2
Xiw,3
Xfw,1Xfw,2Xfw,3
Instance-wise Feature-wise
Feature-wise: each machine calculates part of the Hessian-vector product
(∇2f (w )v)fw,1 = v1+CXfw,1T D(Xfw,1v1+· · ·+Xfw,pvp)
Instance-wise and Feature-wise Data Splits (Cont’d)
Xfw,1v1 + · · · + Xfw,pvp ∈ Rl must be available on all nodes (by allreduce)
Data moved per Hessian-vector product Instance-wise: O(n), Feature-wise: O(l )
Chih-Jen Lin (National Taiwan Univ.) 35 / 44
Big-data linear classification
Experiments
Two sets:
Data set l n #nonzeros
epsilon 400,000 2,000 800,000,000 webspam 350,000 16,609,143 1,304,697,446 For results of more sets, see Zhuang et al. (2014) We use Amazon AWS
We compare
1. TRON: Trust-region Newton method 2. ADMM: alternating direction method of
multipliers (Boyd et al., 2011; Zhang et al., 2012)
Experiments (Cont’d)
0 20 40 60
10−5 100
Time (s)
Relative function value difference
ADMM−IW ADMM−FW TRON−IW TRON−FW
0 200 400 600 800 10−5
100
Time (s)
Relative function value difference
ADMM−IW ADMM−FW TRON−IW TRON−FW
epsilon webspam
16 machines are used
Horizontal line: test accuracy has stabilized TRON has faster convergence than ADMM Instance-wise and feature-wise splits useful for l n and l n, respectively
Chih-Jen Lin (National Taiwan Univ.) 37 / 44
Big-data linear classification
Programming Frameworks
We use MPI for the above experiments How about others like MapReduce?
MPI is more efficient, but has no fault tolerance In contrast, MapReduce is slow for iterative algorithms due to heavy disk I/O
Many new frameworks are being actively developed 1. Spark (Zaharia et al., 2010)
2. REEF (Chun et al., 2013)
Selecting suitable frameworks for distributed classification isn’t that easy!
A Comparison Between MPI and Spark
0 20 40 60
−8
−6
−4
−2 0 2
Training time (seconds)
Relative function value difference (log)
Spark LIBLINEAR Spark LIBLINEAR−m MPI LIBLINEAR
We use the data set epsilon (8 nodes). Spark is slower, but in general competitive
Chih-Jen Lin (National Taiwan Univ.) 39 / 44
Conclusions and future directions
Outline
Introduction
Optimization methods
Extension of linear classification Big-data linear classification Conclusions and future directions
Resources on Linear Classification
Since 2007, we have been actively developing the software LIBLINEAR for linear classification www.csie.ntu.edu.tw/~cjlin/liblinear It’s now widely used in Internet companies An earlier survey on linear classification is Yuan et al. (2012)
Recent Advances of Large-scale Linear Classification. Proceedings of IEEE, 2012 It contains many references on this subject
Chih-Jen Lin (National Taiwan Univ.) 41 / 44
Conclusions and future directions
Distributed LIBLINEAR
We recently released an extension of LIBLINEAR for distributed classification
See http://www.csie.ntu.edu.tw/~cjlin/
libsvmtools/distributed-liblinear We support both MPI and Spark
The development is still in an early stage. Your comments are very welcome.
Conclusions
Linear classification is an old topic; but recently there are new and interesting applications
Kernel methods are still useful for many
applications, but linear classification + feature engineering are suitable for some others
Advantages of linear: easier feature engineering We expect that linear classification can be widely used in situations ranging from small-model to big-data classification
Chih-Jen Lin (National Taiwan Univ.) 43 / 44
Conclusions and future directions
Acknowledgments
Many students have contributed to our research on large-scale linear classification
We also thank the partial support from National Science Council of Taiwan