Recent Advances in Large Linear Classification
Chih-Jen Lin
Department of Computer Science National Taiwan University
Talk at NEC Labs, August 26, 2011
Chih-Jen Lin (National Taiwan Univ.) 1 / 44
This talk is based on our recent survey paper invited by Proceedings of IEEE
G.-X. Yuan, C.-H. Ho, and C.-J. Lin. Recent Advances of Large-scale Linear Classification.
It’s also related to our development of the software LIBLINEAR
www.csie.ntu.edu.tw/~cjlin/liblinear Due to time constraints, we will give overviews instead of deep technical details.
Chih-Jen Lin (National Taiwan Univ.) 2 / 44
Outline
Introduction
Binary linear classification Multi-class linear classification
Applications in non-standard scenarios Data beyond memory capacity
Discussion and conclusions
Chih-Jen Lin (National Taiwan Univ.) 3 / 44
Introduction
Outline
Introduction
Binary linear classification Multi-class linear classification
Applications in non-standard scenarios Data beyond memory capacity
Discussion and conclusions
Chih-Jen Lin (National Taiwan Univ.) 4 / 44
Introduction
Linear and Nonlinear Classification
Linear Nonlinear
By linear we mean data not mapped to a higher dimensional space
Original: [height, weight]
Nonlinear: [height, weight, weight/height2]
Chih-Jen Lin (National Taiwan Univ.) 5 / 44
Introduction
Linear and Nonlinear Classification (Cont’d)
Given training data {yi, xi}, xi ∈ Rn, i = 1, . . . , l , yi = ±1
l : # of data, n: # of features
Linear: find (w, b) such that the decision function is sgn wTx + b
Nonlinear: map data to φ(xi). The decision function becomes
sgn wTφ(x)+ b Later b is omitted
Chih-Jen Lin (National Taiwan Univ.) 6 / 44
Introduction
Why Linear Classification?
• If φ(x) is high dimensional, wTφ(x) is expensive
• Kernel methods:
w ≡ Xl
i =1αiφ(xi) for some α, K (xi, xj) ≡ φ(xi)Tφ(xj) New decision function: sgn
Xl
i =1αiK (xi, x)
• Special φ(x) so that calculating K (xi, xj) is easy
• Example:
K (xi, xj) ≡ (xTi xj + 1)2 = φ(xi)Tφ(xj), φ(x) ∈ RO(n2)
Chih-Jen Lin (National Taiwan Univ.) 7 / 44
Introduction
Why Linear Classification? (Cont’d)
Prediction
wTx versus Xl
i =1αiK (xi, x) If K (xi, xj) takes O(n), then
O(n) versus O(nl ) Nonlinear: more powerful to separate data Linear: cheaper and simpler
Chih-Jen Lin (National Taiwan Univ.) 8 / 44
Introduction
Linear is Useful in Some Places
For certain problems, accuracy by linear is as good as nonlinear
But training and testing are much faster Especially document classification
Number of features (bag-of-words model) very large Recently linear classification is a popular research topic. Sample works in 2005-2008: Joachims (2006); Shalev-Shwartz et al. (2007); Hsieh et al.
(2008)
They focus on large sparse data
There are many other recent papers and software
Chih-Jen Lin (National Taiwan Univ.) 9 / 44
Introduction
Comparison Between Linear and Nonlinear (Training Time & Testing Accuracy)
Linear RBF Kernel
Data set Time Accuracy Time Accuracy
MNIST38 0.1 96.82 38.1 99.70
ijcnn1 1.6 91.81 26.8 98.69
covtype 1.4 76.37 46,695.8 96.11
news20 1.1 96.95 383.2 96.90
real-sim 0.3 97.44 938.3 97.82
yahoo-japan 3.1 92.63 20,955.2 93.31 webspam 25.7 93.35 15,681.8 99.26 Size reasonably large: e.g., yahoo-japan: 140k instances and 830k features
Chih-Jen Lin (National Taiwan Univ.) 10 / 44
Introduction
Comparison Between Linear and Nonlinear (Training Time & Testing Accuracy)
Linear RBF Kernel
Data set Time Accuracy Time Accuracy
MNIST38 0.1 96.82 38.1 99.70
ijcnn1 1.6 91.81 26.8 98.69
covtype 1.4 76.37 46,695.8 96.11
news20 1.1 96.95 383.2 96.90
real-sim 0.3 97.44 938.3 97.82
yahoo-japan 3.1 92.63 20,955.2 93.31 webspam 25.7 93.35 15,681.8 99.26 Size reasonably large: e.g., yahoo-japan: 140k instances and 830k features
Chih-Jen Lin (National Taiwan Univ.) 10 / 44
Introduction
Comparison Between Linear and Nonlinear (Training Time & Testing Accuracy)
Linear RBF Kernel
Data set Time Accuracy Time Accuracy
MNIST38 0.1 96.82 38.1 99.70
ijcnn1 1.6 91.81 26.8 98.69
covtype 1.4 76.37 46,695.8 96.11
news20 1.1 96.95 383.2 96.90
real-sim 0.3 97.44 938.3 97.82
yahoo-japan 3.1 92.63 20,955.2 93.31 webspam 25.7 93.35 15,681.8 99.26 Size reasonably large: e.g., yahoo-japan: 140k instances and 830k features
Chih-Jen Lin (National Taiwan Univ.) 10 / 44
Binary linear classification
Outline
Introduction
Binary linear classification Multi-class linear classification
Applications in non-standard scenarios Data beyond memory capacity
Discussion and conclusions
Chih-Jen Lin (National Taiwan Univ.) 11 / 44
Binary linear classification
Binary Linear Classification
Training data {yi, xi}, xi ∈ Rn, i = 1, . . . , l , yi = ±1 l : # of data, n: # of features
minw r (w) + C Xl
i =1ξ(w; xi, yi) r (w): regularization term
ξ(w; x, y ): loss function: we hope y wTx > 0 C : regularization parameter
Chih-Jen Lin (National Taiwan Univ.) 12 / 44
Binary linear classification
Loss Functions
Some commonly used ones:
ξL1(w; x, y ) ≡ max(0, 1 − y wTx), (1) ξL2(w; x, y ) ≡ max(0, 1 − y wTx)2, and (2) ξLR(w; x, y ) ≡ log(1 + e−y wTx). (3) SVM (Boser et al., 1992; Cortes and Vapnik, 1995):
(1)-(2)
Logistic regression (LR): (3)
Chih-Jen Lin (National Taiwan Univ.) 13 / 44
Binary linear classification
Loss Functions (Cont’d)
−y wTx ξ(w; x, y )
ξL1
ξL2
ξLR
They are similar
Chih-Jen Lin (National Taiwan Univ.) 14 / 44
Binary linear classification
Regularization
L1 versus L2
kwk1 and wTw/2 wTw/2: smooth, easier to optimize kwk1: non-differentiable
sparse solution; possibly many zero elements Possible advantages of L1 regularization:
Feature selection Less storage for w
Chih-Jen Lin (National Taiwan Univ.) 15 / 44
Binary linear classification
Training Linear Classifiers
Many recent developments; won’t show details here Why training linear is faster than nonlinear?
Recall the O(n) and O(nl ) difference in prediction:
wTx and Xl
i =1αiK (xi, x) n: # features, l : # data
A similar situation happens here. During training:
Xl
t=1αtxTi xt often needed ⇒ O(nl ) (4)
Chih-Jen Lin (National Taiwan Univ.) 16 / 44
Binary linear classification
Training Linear Classifiers (Cont’d)
By maintaining u ≡
l
X
t=1
ytαtxt → uTxi O(n) cost u: an intermediate variable during training;
eventually approaches the final weight vector w Key: we are able to store xt, ∀t and maintain u Nonlinear: can’t store φ(xt)
For linear, basically any optimization method can be applied
Chih-Jen Lin (National Taiwan Univ.) 17 / 44
Binary linear classification
Choosing a Training Algorithm
Data property
# instances # features or the other way around Primal or dual
First-order or higher-order
Now first-order is slightly preferred as seldom we need an accurate optimization solution
Cost of operations
exp/log more expensive; avoid them in training LR Others
Chih-Jen Lin (National Taiwan Univ.) 18 / 44
Binary linear classification
L1 Regularization
Non-differentiable: need non-smooth optimization techniques
Difficult to apply sophisticated methods Currently, coordinate descent or Newton with coordinate descent are among the most efficient (Yuan et al., 2010; Friedman et al., 2010; Yuan et al., 2011)
Chih-Jen Lin (National Taiwan Univ.) 19 / 44
Multi-class linear classification
Outline
Introduction
Binary linear classification Multi-class linear classification
Applications in non-standard scenarios Data beyond memory capacity
Discussion and conclusions
Chih-Jen Lin (National Taiwan Univ.) 20 / 44
Multi-class linear classification
Solving Several Binary Problems
Same methods for linear and nonlinear classification But there are some subtle differences
One-vs-rest
wm : class m positive; others negative class of x ≡ arg max
m=1,...,kwTmx.
Memory: O(kn); k: # classes
One-vs-one: w1,2, . . . , w(k−1),k constructed O(k2n) memory cost
Chih-Jen Lin (National Taiwan Univ.) 21 / 44
Multi-class linear classification
Solving Several Binary Problems (Cont’d)
So one-vs-rest more suitable than one-vs-one This isn’t the case for kernelized SVM/LR
Chih-Jen Lin (National Taiwan Univ.) 22 / 44
Multi-class linear classification
Considering All Data at Once
wmin1,...,wk
1 2
Xk
m=1kwmk22 + C Xl
i =1ξ({wm}km=1; xi, yi), Multi-class SVM by Crammer and Singer (2001)
loss function : max
m6=y max(0, 1 − (wy − wm)Tx).
Maximum Entropy (ME)
loss function : P(y |x) ≡ exp(wTy x) Pk
m=1exp(wTmx), Many don’t think that ME is close to SVM; but it is.
Note if # classes = 2, ME ⇒ LR
Chih-Jen Lin (National Taiwan Univ.) 23 / 44
Applications in non-standard scenarios
Outline
Introduction
Binary linear classification Multi-class linear classification
Applications in non-standard scenarios Data beyond memory capacity
Discussion and conclusions
Chih-Jen Lin (National Taiwan Univ.) 24 / 44
Applications in non-standard scenarios
Applications in Non-standard Scenarios
Linear classification can be applied to many other places
An important one is to approximate nonlinear classifiers
Goal: better accuracy of nonlinear but faster training/testing
Two types of methods here
- Linear-method for explicit data mappings - Approximating kernels
Chih-Jen Lin (National Taiwan Univ.) 25 / 44
Applications in non-standard scenarios
Linear Methods to Explicitly Train φ(x
i)
Example: low-degree polynomial mapping:
φ(x) = [1, x1, . . . , xn, x12, . . . , xn2, x1x2, . . . , xn−1xn]T For this mapping, # features = O(n2)
When is it useful?
Recall O(n) for linear versus O(nl ) for kernel Now O(n2) versus O(nl )
Sparse data
n ⇒ ¯n, average # non-zeros for sparse data
¯
n n ⇒ O(¯n2) may still be smaller than O(l ¯n)
Chih-Jen Lin (National Taiwan Univ.) 26 / 44
Applications in non-standard scenarios
High Dimensionality of φ(x) and w
• Many new considerations in large scenarios
• For example, w has O(n2) components if degree is 2 Our application: n = 46, 155, 20G for w
• See detailed discussion in Chang et al. (2010)
• A related development is the COFFIN framework by Sonnenburg and Franc (2010)
Chih-Jen Lin (National Taiwan Univ.) 27 / 44
Applications in non-standard scenarios
An NLP Application: Dependency Parsing
Construct dependency graph: a multi-class problem
nsubj ROOT det dobj prep det pobj p
John hit the ball with a bat .
NNP VBD DT NN IN DT NN .
Very sparse: ¯n, average # nonzeros per instance
n Dim. of φ(x) l n w’s # nonzeros¯ 46,155 1,065,165,090 204,582 13.3 1,438,456
Chih-Jen Lin (National Taiwan Univ.) 28 / 44
Applications in non-standard scenarios
An NLP Application (Cont’d)
LIBSVM LIBLINEAR
RBF Poly Linear Poly
Training time 3h34m53s 3h21m51s 3m36s 3m43s
Parsing speed 0.7x 1x 1652x 103x
UAS 89.92 91.67 89.11 91.71
LAS 88.55 90.60 88.07 90.71
Explicitly using φ(x) instead of kernels
⇒ faster training and testing
Some interesting Hashing techniques used to handle sparse w
Chih-Jen Lin (National Taiwan Univ.) 29 / 44
Applications in non-standard scenarios
Approximating Kernels
Following Lee and Wright (2010), we consider two categories
Kernel matrix approximation:
Original matrix Q with
Qij = yiyjK (xi, xj) Consider
Q = ¯¯ ΦTΦ ≈ Q.¯
Φ ≡ [¯¯ x1, . . . , ¯xl] becomes new training data ⇒ trained by a linear classifier
Chih-Jen Lin (National Taiwan Univ.) 30 / 44
Applications in non-standard scenarios
Approximating Kernels (Cont’d)
Φ ∈ R¯ d ×l, d l . # features # data Testing is an issue
Feature mapping approximation
A mapping function ¯φ : Rn → Rd such that φ(x)¯ Tφ(t) ≈ K (x, t).¯
Testing is straightforward because ¯φ(·) is available Many mappings have been proposed; in particular, Hashing
φ(·) may be¯ dense or sparse
Chih-Jen Lin (National Taiwan Univ.) 31 / 44
Data beyond memory capacity
Outline
Introduction
Binary linear classification Multi-class linear classification
Applications in non-standard scenarios Data beyond memory capacity
Discussion and conclusions
Chih-Jen Lin (National Taiwan Univ.) 32 / 44
Data beyond memory capacity
Data Beyond Memory Capacity
Most existing algorithms assume data in memory They are slow if data larger than memory
Frequent disk access of data; CPU time no longer the main concern
They cannot be run in distributed environments Many challenging research issues
Chih-Jen Lin (National Taiwan Univ.) 33 / 44
Data beyond memory capacity
When Data Cannot Fit In Memory
LIBLINEAR on machine with 1 GB memory:
Disk swap causes lengthy training time
Chih-Jen Lin (National Taiwan Univ.) 34 / 44
Data beyond memory capacity
Disk-level Data Classification
Data larger than memory but smaller than disk Design algorithms so that disk access is less frequent An example (Yu et al., 2010): a decomposition method to load a block at a time but ensure overall convergence
But loading time becomes a big concern
Reading 1TB from a hard disk takes very long time
Chih-Jen Lin (National Taiwan Univ.) 35 / 44
Data beyond memory capacity
Distributed Linear Classification
An important advantage: each node loads data in its disk
Parallel data loading, but how about operations?
Issues
- Many methods (e.g., stochastic gradient descent or coordinate descent) are inherently sequential - Communication cost is a concern
Chih-Jen Lin (National Taiwan Univ.) 36 / 44
Data beyond memory capacity
Distributed Linear Classification (Cont’d)
Simple approaches
Subsampling: a subset to fit in memory - Simple and useful in some situations
- In a sense, you do a “reduce” operation to collect data to one computer, and then conduct detailed analysis
Bagging: train several subsets and ensemble results - Useful in distributed environments; each node ⇒ a subset
- Example: Zinkevich et al. (2010)
Chih-Jen Lin (National Taiwan Univ.) 37 / 44
Data beyond memory capacity
Distributed Linear Classification (Cont’d)
Some results by averaging models
yahoo-korea kddcup10 webspam epsilson
Using all 87.29 89.89 99.51 89.78
Avg. models 86.08 89.64 98.40 88.83
Using all: solves a single linear SVM
Avg. models: each node solves a linear SVM on a subset
Slightly worse but in general OK
Chih-Jen Lin (National Taiwan Univ.) 38 / 44
Data beyond memory capacity
Distributed Linear Classification (Cont’d)
Parallel optimization Many possible approaches
If the method involves matrix-vector products, then such operations can be paralleled
Each iteration involves communication
Also MapReduce not very suitable for iterative algorithms (I/O for fault tolerance)
Should have as few iterations as possible
Chih-Jen Lin (National Taiwan Univ.) 39 / 44
Data beyond memory capacity
Distributed Linear Classification (Cont’d)
ADMM (Boyd et al., 2011)
w1,...,wminm,z
1
2zTz + C
m
X
j =1
X
i ∈Bj
ξL1(w; xi, yi) + ρ 2
m
X
j =1
kwj − zk2 subject to wj − z = 0, ∀j
Each problem independently updated; but must collect wj
Some have tried MapReduce, but no public implementation yet
Convergence may not be very fast (i.e., need some iterations)
Chih-Jen Lin (National Taiwan Univ.) 40 / 44
Data beyond memory capacity
Distributed Linear Classification (Cont’d)
Vowpal Wabbit (Langford et al., 2007)
After version 6.0, Hadoop support has been provided LBFGS (quasi Newton) algorithms
From John’s talk: 2.1T features, 17B samples, 1K nodes ⇒ 70 minutes
Chih-Jen Lin (National Taiwan Univ.) 41 / 44
Discussion and conclusions
Outline
Introduction
Binary linear classification Multi-class linear classification
Applications in non-standard scenarios Data beyond memory capacity
Discussion and conclusions
Chih-Jen Lin (National Taiwan Univ.) 42 / 44
Discussion and conclusions
Related Topics
Structured learning
Instead of yi ∈ {+1, −1}, yi becomes a vector Examples: condition random fields (CRF) and structured SVM
They are linear classifiers Regression
Document classification has been widely used, but document regression (e.g., L2-regularized SVR) less frequently applied
Example: yi is CTR and xi is a web page
L1-regularized least-square regression is another story ⇒ very popular for compressed sensing
Chih-Jen Lin (National Taiwan Univ.) 43 / 44
Discussion and conclusions
Conclusions
Linear classification is an old topic; but new developments for large-scale applications are interesting
Linear classification works on x rather than φ(x) Easy and flexible for feature engineering
Linear classification + feature engineering useful for many real applications
Chih-Jen Lin (National Taiwan Univ.) 44 / 44