Optimization Methods for Large-scale Linear Classification
Chih-Jen Lin
Department of Computer Science National Taiwan University
Talk at University of Rome “La Sapienza,” June 25, 2013
Part of this talk is based on our recent survey paper in Proceedings of IEEE, 2012
G.-X. Yuan, C.-H. Ho, and C.-J. Lin. Recent Advances of Large-scale Linear Classification.
It’s also related to our development of the software LIBLINEAR
www.csie.ntu.edu.tw/~cjlin/liblinear Due to time constraints, we will give overviews instead of deep technical details.
Outline
Introduction
Binary linear classification
Optimization Methods: Second-order Methods Optimization Methods: First-order Methods Experiments
Big-data Machine Learning Conclusions
Introduction
Outline
Introduction
Binary linear classification
Optimization Methods: Second-order Methods Optimization Methods: First-order Methods Experiments
Big-data Machine Learning Conclusions
Introduction
Linear and Nonlinear Classification
Some popular methods such as SVM and logistic regression can be used in two ways
Kernel methods: data mapped to another space x ⇒ φ(x)
φ(x)Tφ(y) easily calculated; no good control on φ(·) Linear classification + feature engineering:
We have x without mapping. Alternatively, we can say that φ(x) is our x; full control on x or φ(x) We refer to them as nonlinear and linear classifiers; we will focus on linear here
Introduction
Linear and Nonlinear Classification
Linear Nonlinear
By linear we mean data not mapped to a higher dimensional space
Original: [height, weight]
Nonlinear: [height, weight, weight/height2]
Introduction
Linear and Nonlinear Classification (Cont’d)
Given training data {yi, xi}, xi ∈ Rn, i = 1, . . . , l , yi = ±1
l : # of data, n: # of features
Linear: find (w, b) such that the decision function is sgn wTx + b
Nonlinear: map data to φ(xi). The decision function becomes
sgn wTφ(x)+ b Later b is omitted
Introduction
Why Linear Classification?
• If φ(x) is high dimensional, wTφ(x) is expensive
• Kernel methods:
w ≡ Xl
i =1αiφ(xi) for some α, K (xi, xj) ≡ φ(xi)Tφ(xj) New decision function: sgn
Xl
i =1αiK (xi, x)
• Special φ(x) so that calculating K (xi, xj) is easy
• Example:
K (xi, xj) ≡ (xTi xj + 1)2 = φ(xi)Tφ(xj), φ(x) ∈ RO(n2)
Introduction
Why Linear Classification? (Cont’d)
Prediction
wTx versus Xl
i =1αiK (xi, x) If K (xi, xj) takes O(n), then
O(n) versus O(nl ) Nonlinear: more powerful to separate data Linear: cheaper and simpler
Introduction
Linear is Useful in Some Places
For certain problems, accuracy by linear is as good as nonlinear
But training and testing are much faster Especially document classification
Number of features (bag-of-words model) very large Large and sparse data
Training millions of data in just a few seconds Recently linear classification is a popular research topic
Binary linear classification
Outline
Introduction
Binary linear classification
Optimization Methods: Second-order Methods Optimization Methods: First-order Methods Experiments
Big-data Machine Learning Conclusions
Binary linear classification
Binary Linear Classification
Training data {yi, xi}, xi ∈ Rn, i = 1, . . . , l , yi = ±1 l : # of data, n: # of features
minw
wTw 2 + C
l
X
i =1
ξ(w; xi, yi)
wTw/2: regularization term
ξ(w; x, y ): loss function: we hope y wTx > 0 C : regularization parameter
Binary linear classification
Loss Functions
Some commonly used ones:
ξL1(w; x, y ) ≡ max(0, 1 − y wTx), (1) ξL2(w; x, y ) ≡ max(0, 1 − y wTx)2, (2) ξLR(w; x, y ) ≡ log(1 + e−y wTx). (3) SVM (Boser et al., 1992; Cortes and Vapnik, 1995):
(1)-(2)
Logistic regression (LR): (3)
Binary linear classification
Loss Functions (Cont’d)
−y wTx ξ(w; x, y )
ξL1 ξL2
ξLR
They are similar in terms of performance
Binary linear classification
Loss Functions (Cont’d)
However,
ξL1: not differentiable
ξL2: differentiable but not twice differentiable ξLR: twice differentiable
Many optimization methods can be used
Optimization Methods: Second-order Methods
Outline
Introduction
Binary linear classification
Optimization Methods: Second-order Methods Optimization Methods: First-order Methods Experiments
Big-data Machine Learning Conclusions
Optimization Methods: Second-order Methods
Truncated Newton Method
Newton direction
mins ∇f (wk)Ts + 1
2sT∇2f (wk)s
This is the same as solving Newton linear system
∇2f (wk)s = −∇f (wk)
Hessian matrix ∇2f (wk) too large to be stored
∇2f (wk) : n × n, n : number of features For document data, n can be millions or more
Optimization Methods: Second-order Methods
Using Special Properties of Data Classification
But Hessian has a special form
∇2f (w) = I + CXTDX , D diagonal. For logistic regression,
Dii = e−yiwTxi 1 + e−yiwTxi X : data, # instances × # features
X = [x1, . . . , xl]T
Optimization Methods: Second-order Methods
Using Special Properties of Data Classification (Cont’d)
Using CG to solve the linear system. Only Hessian-vector products are needed
∇2f (w)s = s + C · XT(D(X s)) Therefore, we have a Hessian-free approach
In Lin et al. (2008), we use the trust-region + CG approach by Steihaug (1983)
Quadratic convergence is achieved
Optimization Methods: Second-order Methods
Training L2-loss SVM
The loss function is differentiable but not twice differentiable
ξL2(w; x, y ) ≡ max(0, 1 − y wTx)2 We can use generalized Hessian (Mangasarian, 2002)
Works well in practice, but no theoretical quadratic convergence
Optimization Methods: First-order Methods
Outline
Introduction
Binary linear classification
Optimization Methods: Second-order Methods Optimization Methods: First-order Methods Experiments
Big-data Machine Learning Conclusions
Optimization Methods: First-order Methods
First-order methods are popular in data classification Reason: no need to accurately solve the
optimization problem
We consider L1-loss SVM as an example here, though same methods may be extended to L2 and logistic loss
Optimization Methods: First-order Methods
SVM Dual
From primal dual relationship minα f (α)
subject to 0 ≤ αi ≤ C , ∀i , where
f (α) ≡ 1
2αTQα − eTα and
Qij = yiyjxTi xj, e = [1, . . . , 1]T
Optimization Methods: First-order Methods
Dual Coordinate Descent
Very simple: minimizing one variable at a time While α not optimal
For i = 1, . . . , l minαi
f (. . . , αi, . . .) A classic optimization technique
Traced back to Hildreth (1957) if constraints are not considered
Optimization Methods: First-order Methods
The Procedure
Given current α. Let ei = [0, . . . , 0, 1, 0, . . . , 0]T. mind f (α + d ei) = 1
2Qiid2 + ∇if (α)d + constant Without constraints
optimal d = −∇if (α) Qii Now 0 ≤ αi + d ≤ C
αi ← min
max
αi − ∇if (α) Qii
, 0
, C
Optimization Methods: First-order Methods
The Procedure (Cont’d)
∇if (α) = (Qα)i − 1 = Xl
j =1Qijαj − 1
= Xl
j =1yiyjxTi xjαj − 1 Directly calculating gradients costs O(ln) l :# data, n: # features
For linear SVM, define u ≡ Xl
j =1yjαjxj, Easy gradient calculation: costs O(n)
∇ f (α) = y uTx − 1
Optimization Methods: First-order Methods
The Procedure (Cont’d)
All we need is to maintain u u = Xl
j =1yjαjxj, If
¯
αi : old ; αi : new then
u ← u + (αi − ¯αi)yixi. Also costs O(n)
Optimization Methods: First-order Methods
Algorithm
Given initial α and find u = X
i
yiαixi.
While α is not optimal (Outer iteration) For i = 1, . . . , l (Inner iteration)
(a) ¯αi ← αi
(b) G = yiuTxi − 1 (c) If αi can be changed
αi ← min(max(αi − G /Qii, 0), C ) u ← u + (αi − ¯αi)yixi
Optimization Methods: First-order Methods
Analysis
Convergence; from Luo and Tseng (1992)
f (αk+1) − f (α∗) ≤ µ(f (αk) − f (α∗)), ∀k ≥ k0. α∗: optimal solution
Recently we prove the result with k0 = 1 (Wang and Lin, 2013)
Difficulty: the objective function is convex only rather than strictly convex
Optimization Methods: First-order Methods
Careful Implementation
Some techniques can improve the running speed
Shrinking: remove αi if it is likely to be bounded at the end
Easier to conduct shrinking than the kernel case (details not shown)
Order of sub-problems being minimized α1 → α2 → · · · → αl
Can use any random order at each outer iteration απ(1) → απ(2) → · · · → απ(l )
Very effective in practice
Optimization Methods: First-order Methods
Difference from the Kernel Case
What if coordinate descent methods are applied to kernel classifiers?
Recall the gradient is
∇if (α) =
l
X
j =1
yiyjxTi xjαj−1 = (yixi)T
l
X
j =1
yjxjαj−1 but we cannot do this for kernel because
K (xi, xj) = φ(xi)Tφ(xj) is not separated
If using kernel, the cost of calculating ∇if (α) must be O(ln)
Optimization Methods: First-order Methods
Difference from the Kernel Case (Cont’d)
This difference is similar to our earlier discussion on the prediction cost
wTx versus Xl
i =1αiK (xi, x) O(n) versus O(nl )
However, if O(ln) cost is spent, the whole ∇f (α) can be maintained (details not shown here)
In contrast, the setting of using u knows ∇if (α) rather than the whole ∇f (α)
Optimization Methods: First-order Methods
Difference from the Kernel Case (Cont’d)
In existing coordinate descent methods for kernel classifiers, people also use ∇f (α) information to select variable for update
Recall there are two types of coordinate descent methods
Gauss-Seidel: sequential selection of variables Gauss-Southwell: greedy selection of variables To do greedy selection, usually the whole gradient must be available
Optimization Methods: First-order Methods
Difference from the Kernel Case (Cont’d)
Existing coordinate descent methods for linear ⇒ related to Gauss-Seidel
Existing coordinate descent methods for kernel ⇒ related to Gauss-Southwell
Experiments
Outline
Introduction
Binary linear classification
Optimization Methods: Second-order Methods Optimization Methods: First-order Methods Experiments
Big-data Machine Learning Conclusions
Experiments
Comparisons
L2-SVM is used
DCDL2: Dual coordinate descent DCDL2-S: DCDL2 with shrinking PCD: Primal coordinate descent TRON: Trust region Newton method
Experiments
Objective values (Time in Seconds)
news20 rcv1
yahoo-japan yahoo-korea
Experiments
Analysis
Dual coordinate descents are very effective if # data, # features are large
Useful for document classification Half million data in a few seconds However, it is less effective if
# features small: should solve primal; or large penalty parameter C ; problems are more ill-conditioned
Experiments
An Example When # Features Small
# instance: 32,561, # features: 123
Objective value Accuracy
Big-data Machine Learning
Outline
Introduction
Binary linear classification
Optimization Methods: Second-order Methods Optimization Methods: First-order Methods Experiments
Big-data Machine Learning Conclusions
Big-data Machine Learning
Big-data Machine Learning
Data distributedly stored
This is a new topic and many research works are still going on
You may ask what the difference is from distributed optimization
They are related, but now the algorithm must avoid expensive data accesses
Big-data Machine Learning
Big-data Machine Learning (Cont’d)
Issues for parallelization
- Many methods (e.g., stochastic gradient descent or coordinate descent) are inherently sequential - Communication cost is a concern
Big-data Machine Learning
Simple Distributed Linear Classification I
Bagging: train several subsets and ensemble results - Useful in distributed environments; each node ⇒ a subset
- Example: Zinkevich et al. (2010) Some results by averaging models
yahoo-korea kddcup10 webspam epsilson
Using all 87.29 89.89 99.51 89.78
Avg. models 86.08 89.64 98.40 88.83 Using all: solves a single linear SVM
Big-data Machine Learning
Simple Distributed Linear Classification II
Avg. models: each node solves a linear SVM on a subset
Slightly worse but in general OK
Big-data Machine Learning
ADMM by Boyd et al. (2011) I
Recall the SVM problem (bias term b omitted) minw
1
2wTw + C
l
X
i =1
max(0, 1 − yiwTxi) An equivalent optimization problem
w1,...,wminm,z
1
2zTz + C
m
X
j =1
X
i ∈Bj
max(0, 1 − yiwTj xi)+
ρ 2
m
X
j =1
kwj − zk2 subject to wj − z = 0, ∀j
Big-data Machine Learning
ADMM by Boyd et al. (2011) II
The key is that
z = w1 = · · · = wm are all optimal w
This optimization problem was proposed in 1970s, but is now applied to distributed machine learning Each node has a subset Bj and updates wj
Only w1, . . . , wm must be collected
Data are not moved; less communication cost Still, we cannot afford too many iterations because of communication cost
Big-data Machine Learning
Vowpal Wabbit (Langford et al., 2007) I
It started as a linear classification package on a single computer
After version 6.0, Hadoop support has been provided A hybrid approach: parallel SGD initially and switch to LBFGS (quasi Newton)
They argue that AllReduce is a more suitable operation than MapReduce
What is AllReduce?
Every node starts with a value and ends up with the sum at all nodes
Big-data Machine Learning
Vowpal Wabbit (Langford et al., 2007) II
In Agarwal et al. (2012), the authors argue that many machine learning algorithms can be
implemented using AllReduce LBFGS is an example
They train 17B samples with 16M features on 1K nodes ⇒ 70 minutes
Conclusions
Outline
Introduction
Binary linear classification
Optimization Methods: Second-order Methods Optimization Methods: First-order Methods Experiments
Big-data Machine Learning Conclusions
Conclusions
Conclusions
Linear classification is an old topic; but recently there are new applications and large-scale challenges The optimization problem can be solved by many existing techniques
However, some machine-learning aspects must be considered
In particular, data access may become a bottleneck in large-scale scenarios
Overall, linear classification is still an on-going and exciting research area