Large-scale Linear and Kernel Classification
Chih-Jen Lin
Department of Computer Science National Taiwan University
MSR India Summer School 2015 on Machine Learning
Given training data in different classes (labels known)
Predict test data (labels unknown) Classic example: medical diagnosis
Find a patient’s blood pressure, weight, etc.
After several years, know if he/she recovers Build a machine learning model
New patient: find blood pressure, weight, etc Prediction
Training and testing
Chih-Jen Lin (National Taiwan Univ.) 2 / 121
Data Classification (Cont’d)
Among many classification methods, linear and kernel are two popular ones
They are very related
We will discuss these two topics in detail in this lecture
Talk slides:
http://www.csie.ntu.edu.tw/~cjlin/talks/
msri.pdf
1 Linear classification
2 Kernel classification
3 Linear versus kernel classification
4 Solving optimization problems
5 Big-data linear classification
6 Discussion and conclusions
Chih-Jen Lin (National Taiwan Univ.) 4 / 121
Linear classification
Outline
1 Linear classification
2 Kernel classification
3 Linear versus kernel classification
4 Solving optimization problems
5 Big-data linear classification
6 Discussion and conclusions
Outline
1 Linear classification Maximum margin
Regularization and losses Other derivations
Chih-Jen Lin (National Taiwan Univ.) 6 / 121
Linear classification Maximum margin
Outline
1 Linear classification Maximum margin
Regularization and losses Other derivations
Linear Classification
Training vectors : xi, i = 1, . . . , l Feature vectors. For example, A patient = [height, weight, . . .]T
Consider a simple case with two classes:
Define an indicator vector y ∈ Rl yi =
1 if xi in class 1
−1 if xi in class 2 A hyperplane to linearly separate all data
Chih-Jen Lin (National Taiwan Univ.) 8 / 121
Linear classification Maximum margin
◦ ◦
◦
◦ ◦
◦
◦◦
4 4 4 4
4 4
4
◦ ◦
◦
◦ ◦
◦
◦◦
4 4 4 4
4 4
4
wTx + b = h+1
−10
i
A separating hyperplane: wTx + b = 0 (wTxi) + b ≥ 1 if yi = 1 (wTxi) + b ≤ −1 if yi = −1
Decision function f (x ) = sgn(wTx + b), x : test data
Many possible choices of w and b
Maximal Margin
Maximizing the distance between wTx + b = 1 and
−1:
2/kw k = 2/
√ wTw A quadratic programming problem
minw ,b
1 2wTw
subject to yi(wTxi + b) ≥ 1, i = 1, . . . , l .
This is the basic formulation of support vector machines (Boser et al., 1992)
Chih-Jen Lin (National Taiwan Univ.) 10 / 121
Linear classification Maximum margin
Data May Not Be Linearly Separable
An example:
◦
◦
◦
◦
◦
◦ ◦
◦
4 4
4 4
4
4 4 4
We can never find a linear hyperplane to separate data
Remedy: allow training errors
Data May Not Be Linearly Separable (Cont’d)
Standard SVM (Boser et al., 1992; Cortes and Vapnik, 1995)
w ,b,ξmin 1
2wTw +C
l
X
i =1
ξi
subject to yi(wTxi + b) ≥ 1 −ξi, ξi ≥ 0, i = 1, . . . , l .
We explain later why this method is called support vector machine
Chih-Jen Lin (National Taiwan Univ.) 12 / 121
Linear classification Maximum margin
The Bias Term b
Recall the decision function is sgn(wTx + b) Sometimes the bias term b is omitted
sgn(wTx )
That is, the hyperplane always passes through the origin
This is fine if the number of features is not too small In our discussion, b is used for kernel, but omitted for linear (due to some historical reasons)
Outline
1 Linear classification Maximum margin
Regularization and losses Other derivations
Chih-Jen Lin (National Taiwan Univ.) 14 / 121
Linear classification Regularization and losses
Equivalent Optimization Problem
• Recall SVM optimization problem (without b) is minw ,ξ
1
2wTw + C
l
X
i =1
ξi subject to yiwTxi ≥ 1 − ξi,
ξi ≥ 0, i = 1, . . . , l .
• It is equivalent to minw
1
2wTw + C
l
X
i =1
max(0, 1 − yiwTxi) (1)
• This reformulation is useful for subsequent discussion
Equivalent Optimization Problem (Cont’d)
That is, at optimum,
ξi = max(0, 1 − yiwTxi) Reason: from constraint
ξi ≥ 1 − yiwTxi and ξi ≥ 0 but we also want to minimize ξi
Chih-Jen Lin (National Taiwan Univ.) 16 / 121
Linear classification Regularization and losses
Equivalent Optimization Problem (Cont’d)
We now derive the same optimization problem (1) from a different viewpoint
minw (training errors)
To characterize the training error, we need a loss function ξ(w ; x , y ) for each instance (xi, yi) Ideally we should use 0–1 training loss:
ξ(w ; x , y ) =
(1 if y wTx < 0, 0 otherwise
Equivalent Optimization Problem (Cont’d)
However, this function is discontinuous. The optimization problem becomes difficult
−y wTx ξ(w ; x , y )
We need continuous approximations
Chih-Jen Lin (National Taiwan Univ.) 18 / 121
Linear classification Regularization and losses
Common Loss Functions
Hinge loss (l1 loss)
ξL1(w ; x , y ) ≡ max(0, 1 − y wTx ) (2) Squared hinge loss (l2 loss)
ξL2(w ; x , y ) ≡ max(0, 1 − y wTx )2 (3) Logistic loss
ξLR(w ; x , y ) ≡ log(1 + e−y wTx) (4) SVM: (2)-(3). Logistic regression (LR): (4)
Common Loss Functions (Cont’d)
−y wTx ξ(w ; x , y )
ξL1 ξL2
ξLR
Logistic regression is very related to SVM Their performance is usually similar
Chih-Jen Lin (National Taiwan Univ.) 20 / 121
Linear classification Regularization and losses
Common Loss Functions (Cont’d)
However, minimizing training losses may not give a good model for future prediction
Overfitting occurs
Overfitting
See the illustration in the next slide For classification,
You can easily achieve 100% training accuracy This is useless
When training a data set, we should Avoid underfitting: small training error Avoid overfitting: small testing error
Chih-Jen Lin (National Taiwan Univ.) 22 / 121
Linear classification Regularization and losses
l and s: training; and 4: testing
Regularization
In training we manipulate the w vector so that it fits the data
So we need a way to make w ’s values less extreme.
One idea is to make the objective function smoother
Chih-Jen Lin (National Taiwan Univ.) 24 / 121
Linear classification Regularization and losses
General Form of Linear Classification
Training data {yi, xi}, xi ∈ Rn, i = 1, . . . , l , yi = ±1 l : # of data, n: # of features
minw f (w ), f (w ) ≡ wTw
2 + C
l
X
i =1
ξ(w ; xi, yi) (5) wTw /2: regularization term
ξ(w ; x , y ): loss function C : regularization parameter
General Form of Linear Classification (Cont’d)
If hinge loss
ξL1(w ; x , y ) ≡ max(0, 1 − y wTx ) is used, then (5) goes back to the SVM problem described earlier (b omitted):
minw ,ξ
1
2wTw + C
l
X
i =1
ξi subject to yiwTxi ≥ 1 − ξi,
ξi ≥ 0, i = 1, . . . , l .
Chih-Jen Lin (National Taiwan Univ.) 26 / 121
Linear classification Regularization and losses
Solving Optimization Problems
We have an unconstrained problem, so many
existing unconstrained optimization techniques can be used
However,
ξL1: not differentiable
ξL2: differentiable but not twice differentiable ξLR: twice differentiable
We may need different types of optimization methods
Details of solving optimization problems will be discussed later
Outline
1 Linear classification Maximum margin
Regularization and losses Other derivations
Chih-Jen Lin (National Taiwan Univ.) 28 / 121
Linear classification Other derivations
Logistic Regression
Logistic regression can be traced back to the 19th century
It’s mainly from statistics community, so many people wrongly think that this method is very different from SVM
Indeed from what we have shown they are very related.
Let’s see how to derive it from a statistical viewpoint
Logistic Regression (Cont’d)
For a label-feature pair (y , x ), assume the probability model
p(y |x ) = 1
1 + e−y wTx. Note that
p(1|x ) + p(−1|x )
= 1
1 + e−wTx + 1 1 + ewTx
= ewTx
1 + ewTx + 1 1 + ewTx
= 1
w is the parameter to be decided
Chih-Jen Lin (National Taiwan Univ.) 30 / 121
Linear classification Other derivations
Logistic Regression (Cont’d)
Idea of this model p(1|x ) = 1
1 + e−wTx
(→ 1 if wTx 0,
→ 0 if wTx 0 Assume training instances are
(yi, xi), i = 1, . . . , l
Logistic Regression (Cont’d)
Logistic regression finds w by maximizing the following likelihood
maxw l
Y
i =1
p (yi|xi) . (6) Negative log-likelihood
− log
l
Y
i =1
p (yi|xi) = −
l
X
i =1
log p (yi|xi)
=
l
X
i =1
log
1 + e−yiwTxi
Chih-Jen Lin (National Taiwan Univ.) 32 / 121
Linear classification Other derivations
Logistic Regression (Cont’d)
Logistic regression minw
l
X
i =1
log
1 + e−yiwTxi
. Regularized logistic regression
minw
1
2wTw + C
l
X
i =1
log
1 + e−yiwTxi
. (7) C : regularization parameter decided by users
Discussion
We see that the same method can be derived from different ways
SVM
Maximal margin
Regularization and training losses LR
Regularization and training losses Maximum likelihood
Chih-Jen Lin (National Taiwan Univ.) 34 / 121
Kernel classification
Outline
1 Linear classification
2 Kernel classification
3 Linear versus kernel classification
4 Solving optimization problems
5 Big-data linear classification
6 Discussion and conclusions
Outline
2 Kernel classification Nonlinear mapping Kernel tricks
Chih-Jen Lin (National Taiwan Univ.) 36 / 121
Kernel classification Nonlinear mapping
Outline
2 Kernel classification Nonlinear mapping Kernel tricks
Data May Not Be Linearly Separable
This is an earlier example:
◦
◦
◦
◦
◦
◦ ◦
◦
4 4
4 4
4
4 4 4
In addition to allowing training errors, what else can we do?
For this data set, shouldn’t we use a nonlinear classifier?
Chih-Jen Lin (National Taiwan Univ.) 38 / 121
Kernel classification Nonlinear mapping
Mapping Data to a Higher Dimensional Space
But modeling nonlinear curves is difficult. Instead, we map data to a higher dimensional space
φ(x ) = [φ1(x ), φ2(x ), . . .]T. For example,
weight height2
is a useful new feature to check if a person overweights or not
Kernel Support Vector Machines
Linear SVM:
w ,b,ξmin 1
2wTw + C Xl i =1ξi subject to yi(wTxi + b) ≥ 1 − ξi,
ξi ≥ 0, i = 1, . . . , l . Kernel SVM:
w ,b,ξmin 1
2wTw + C Xl
i =1ξi
subject to yi(wTφ(xi)+ b) ≥ 1 − ξi, ξi ≥ 0, i = 1, . . . , l .
Chih-Jen Lin (National Taiwan Univ.) 40 / 121
Kernel classification Nonlinear mapping
Kernel Logistic Regression
minw ,b
1
2wTw + C
l
X
i =1
log
1 + e−yi(wTφ(xi)+b)
.
Difficulties After Mapping Data to a High-dimensional Space
# variables in w = dimensions of φ(x )
Infinite variables if φ(x ) is infinite dimensional Cannot do an infinite-dimensional inner product for predicting a test instance
sgn(wTφ(x ))
Use kernel trick to go back to a finite number of variables
Chih-Jen Lin (National Taiwan Univ.) 42 / 121
Kernel classification Kernel tricks
Outline
2 Kernel classification Nonlinear mapping Kernel tricks
Kernel Tricks
It can be shown at optimum w =
l
X
i =1
yiαiφ(xi) Details not provided here
Special φ(x ) such that the decision function becomes
sgn(wTφ(x )) = sgn
Xl
i =1yiαiφ(xi)Tφ(x )
= sgn
Xl
i =1yiαiK (xi, x )
Chih-Jen Lin (National Taiwan Univ.) 44 / 121
Kernel classification Kernel tricks
Kernel Tricks (Cont’d)
φ(xi)Tφ(xj) needs a closed form Example: xi ∈ R3, φ(xi) ∈ R10 φ(xi) = [1,√
2(xi)1,√
2(xi)2,√
2(xi)3, (xi)21, (xi)22, (xi)23,√
2(xi)1(xi)2,√
2(xi)1(xi)3,√
2(xi)2(xi)3]T Then φ(xi)Tφ(xj) = (1 + xTi xj)2.
Kernel: K (x , y) = φ(x )Tφ(y); common kernels:
e−γkxi−xjk2, (Radial Basis Function) (xTi xj/a + b)d (Polynomial kernel)
K (x , y) can be inner product in infinite dimensional space. Assume x ∈ R1 and γ > 0.
e−γkxi−xjk2 = e−γ(xi−xj)2 = e−γxi2+2γxixj−γxj2
=e−γxi2−γxj2 1 + 2γxixj
1! + (2γxixj)2
2! + (2γxixj)3
3! + · · ·
=e−γxi2−γxj2 1 · 1+
r2γ 1!xi ·
r2γ 1!xj+
r(2γ)2 2! xi2 ·
r(2γ)2 2! xj2 +
r(2γ)3 3! xi3 ·
r(2γ)3
3! xj3 + · · · = φ(xi)Tφ(xj), where
φ(x ) = e−γx2
1,
r2γ 1!x ,
r(2γ)2 2! x2,
r(2γ)3
3! x3, · · ·
T
.
Chih-Jen Lin (National Taiwan Univ.) 46 / 121
Linear versus kernel classification
Outline
1 Linear classification
2 Kernel classification
3 Linear versus kernel classification
4 Solving optimization problems
5 Big-data linear classification
6 Discussion and conclusions
Outline
3 Linear versus kernel classification Comparison on the cost
Numerical comparisons
Chih-Jen Lin (National Taiwan Univ.) 48 / 121
Linear versus kernel classification Comparison on the cost
Outline
3 Linear versus kernel classification Comparison on the cost
Numerical comparisons
Linear and Kernel Classification
Now we see that methods such as SVM and logistic regression can used in two ways
Kernel methods: data mapped to a higher dimensional space
x ⇒ φ(x )
φ(xi)Tφ(xj) easily calculated; little control on φ(·) Linear classification + feature engineering:
We have x without mapping. Alternatively, we can say that φ(x ) is our x ; full control on x or φ(x )
Chih-Jen Lin (National Taiwan Univ.) 50 / 121
Linear versus kernel classification Comparison on the cost
Linear and Kernel Classification
The cost of using linear and kernel classification is different
Let’s check the prediction cost wTx versus Xl
i =1yiαiK (xi, x ) If K (xi, xj) takes O(n), then
O(n) versus O(nl ) Linear is much cheaper
A similar difference occurs for training
Linear and Kernel Classification (Cont’d)
In fact, linear is a special case of kernel
We can prove that accuracy of linear is the same as Gaussian (RBF) kernel under certain parameters (Keerthi and Lin, 2003)
Therefore, roughly we have
accuracy: kernel ≥ linear cost: kernel linear Speed is the reason to use linear
Chih-Jen Lin (National Taiwan Univ.) 52 / 121
Linear versus kernel classification Comparison on the cost
Linear and Kernel Classification (Cont’d)
For some problems, accuracy by linear is as good as nonlinear
But training and testing are much faster
This particularly happens for document classification Number of features (bag-of-words model) very large Data very sparse (i.e., few non-zeros)
Outline
3 Linear versus kernel classification Comparison on the cost
Numerical comparisons
Chih-Jen Lin (National Taiwan Univ.) 54 / 121
Linear versus kernel classification Numerical comparisons
Comparison Between Linear and Kernel (Training Time & Testing Accuracy)
Linear RBF Kernel
Data set Time Accuracy Time Accuracy
MNIST38 0.1 96.82 38.1 99.70
ijcnn1 1.6 91.81 26.8 98.69
covtype 1.4 76.37 46,695.8 96.11
news20 1.1 96.95 383.2 96.90
real-sim 0.3 97.44 938.3 97.82
yahoo-japan 3.1 92.63 20,955.2 93.31 webspam 25.7 93.35 15,681.8 99.26 Size reasonably large: e.g., yahoo-japan: 140k instances and 830k features
Comparison Between Linear and Kernel (Training Time & Testing Accuracy)
Linear RBF Kernel
Data set Time Accuracy Time Accuracy
MNIST38 0.1 96.82 38.1 99.70
ijcnn1 1.6 91.81 26.8 98.69
covtype 1.4 76.37 46,695.8 96.11
news20 1.1 96.95 383.2 96.90
real-sim 0.3 97.44 938.3 97.82
yahoo-japan 3.1 92.63 20,955.2 93.31 webspam 25.7 93.35 15,681.8 99.26 Size reasonably large: e.g., yahoo-japan: 140k instances and 830k features
Chih-Jen Lin (National Taiwan Univ.) 55 / 121
Linear versus kernel classification Numerical comparisons
Comparison Between Linear and Kernel (Training Time & Testing Accuracy)
Linear RBF Kernel
Data set Time Accuracy Time Accuracy
MNIST38 0.1 96.82 38.1 99.70
ijcnn1 1.6 91.81 26.8 98.69
covtype 1.4 76.37 46,695.8 96.11
news20 1.1 96.95 383.2 96.90
real-sim 0.3 97.44 938.3 97.82
yahoo-japan 3.1 92.63 20,955.2 93.31 webspam 25.7 93.35 15,681.8 99.26 Size reasonably large: e.g., yahoo-japan: 140k instances and 830k features
Outline
1 Linear classification
2 Kernel classification
3 Linear versus kernel classification
4 Solving optimization problems
5 Big-data linear classification
6 Discussion and conclusions
Chih-Jen Lin (National Taiwan Univ.) 56 / 121
Solving optimization problems
Outline
4 Solving optimization problems Kernel: decomposition methods Linear: coordinate descent method Linear: second-order methods Experiments
Outline
4 Solving optimization problems Kernel: decomposition methods Linear: coordinate descent method Linear: second-order methods Experiments
Chih-Jen Lin (National Taiwan Univ.) 58 / 121
Solving optimization problems Kernel: decomposition methods
Dual Problem
Recall we said that the difficulty after mapping x to φ(x ) is the huge number of variables
We mentioned
w =
l
X
i =1
αiyiφ(xi) (8) and used kernels for prediction
Besides prediction, we must do training via kernels The most common way to train SVM via kernels is through its dual problem
Dual Problem (Cont’d)
The dual problem minα
1
2αTQα − eTα
subject to 0 ≤ αi ≤ C , i = 1, . . . , l yTα = 0,
where Qij = yiyjφ(xi)Tφ(xj) and e = [1, . . . , 1]T From primal-dual relationship, at optimum (8) holds Dual problem has a finite number of variables
Chih-Jen Lin (National Taiwan Univ.) 60 / 121
Solving optimization problems Kernel: decomposition methods
Example: Primal-dual Relationship
Consider the earlier example:
4
0
1
Now two data are x1 = 1, x2 = 0 with y = [+1, −1]T The solution is (w , b) = (2, −1)
Example: Primal-dual Relationship (Cont’d)
The dual objective function 1
2α1 α21 0 0 0
α1 α2
−1 1α1 α2
= 1
2α21 − (α1 + α2)
In optimization, objective function means the function to be optimized
Constraints are
α1 − α2 = 0, 0 ≤ α1, 0 ≤ α2.
Chih-Jen Lin (National Taiwan Univ.) 62 / 121
Solving optimization problems Kernel: decomposition methods
Example: Primal-dual Relationship (Cont’d)
Substituting α2 = α1 into the objective function, 1
2α21 − 2α1 has the smallest value at α1 = 2.
Because [2, 2]T satisfies constraints 0 ≤ α1 and 0 ≤ α2, it is optimal
Example: Primal-dual Relationship (Cont’d)
Using the primal-dual relation w = y1α1x1 + y2α2x2
= 1 · 2 · 1 + (−1) · 2 · 0
= 2
This is the same as that by solving the primal problem.
Chih-Jen Lin (National Taiwan Univ.) 64 / 121
Solving optimization problems Kernel: decomposition methods
Decision function
At optimum
w = Pl
i =1αiyiφ(xi) Decision function
wTφ(x ) + b
= Xl
i =1αiyiφ(xi)Tφ(x ) + b
= Xl
i =1αiyiK (xi, x ) + b Recall 0 ≤ αi ≤ C in the dual problem
Support Vectors
Only xi of αi > 0 used ⇒ support vectors
-0.2 0 0.2 0.4 0.6 0.8 1 1.2
-1.5 -1 -0.5 0 0.5 1
Chih-Jen Lin (National Taiwan Univ.) 66 / 121
Solving optimization problems Kernel: decomposition methods
Large Dense Quadratic Programming
minα
1
2αTQα − eTα
subject to 0 ≤ αi ≤ C , i = 1, . . . , l yTα = 0
Qij 6= 0, Q : an l by l fully dense matrix 50,000 training points: 50,000 variables:
(50, 0002 × 8/2) bytes = 10GB RAM to store Q
Large Dense Quadratic Programming (Cont’d)
Traditional optimization methods cannot be directly applied here because Q cannot even be stored
Currently, decomposition methods (a type of coordinate descent methods) are what used in practice
Chih-Jen Lin (National Taiwan Univ.) 68 / 121
Solving optimization problems Kernel: decomposition methods
Decomposition Methods
Working on some variables each time (e.g., Osuna et al., 1997; Joachims, 1998; Platt, 1998)
Similar to coordinate-wise minimization Working set B, N = {1, . . . , l }\B fixed Sub-problem at the kth iteration:
minαB
1
2αTB (αkN)TQBB QBN QNB QNN
αB αkN
−
eTB (ekN)TαB αkN
subject to 0 ≤ αt ≤ C , t ∈ B, yBTαB = −yTNαkN
Avoid Memory Problems
The new objective function 1
2αTBQBBαB + (−eB +QBNαkN)TαB + constant Only B columns of Q are needed
In general |B| ≤ 10 is used. We need |B| ≥ 2 because of the linear constraint
yTBαB = −yNTαkN
Calculated when used: trade time for space But is such an approach practical?
Chih-Jen Lin (National Taiwan Univ.) 70 / 121
Solving optimization problems Kernel: decomposition methods
How Decomposition Methods Perform?
Convergence not very fast. This is known because of using only first-order information
But, no need to have very accurate α decision function: Xl
i =1yiαiK (xi, x ) + b Prediction may still be correct with a rough α Further, in some situations,
# support vectors # training points Initial α1 = 0, some instances never used
How Decomposition Methods Perform?
(Cont’d)
An example of training 50,000 instances using the software LIBSVM (|B| = 2)
$svm-train -c 16 -g 4 -m 400 22features Total nSV = 3370
Time 79.524s
This was done on a typical desktop
Calculating the whole Q takes more time
#SVs = 3,370 50,000
A good case where some remain at zero all the time
Chih-Jen Lin (National Taiwan Univ.) 72 / 121
Solving optimization problems Linear: coordinate descent method
Outline
4 Solving optimization problems Kernel: decomposition methods Linear: coordinate descent method Linear: second-order methods Experiments
Coordinate Descent Methods for Linear Classification
We consider L1-loss SVM as an example here The same method can be extended to L2 and logistic loss
More details in Hsieh et al. (2008); Yu et al. (2011)
Chih-Jen Lin (National Taiwan Univ.) 74 / 121
Solving optimization problems Linear: coordinate descent method
SVM Dual (Linear without Kernel)
From primal dual relationship minα f (α)
subject to 0 ≤ αi ≤ C , ∀i , where
f (α) ≡ 1
2αTQα − eTα and
Qij = yiyjxTi xj, e = [1, . . . , 1]T
No linear constraint yTα = 0 because of no bias term b
Dual Coordinate Descent
Very simple: minimizing one variable at a time While α not optimal
For i = 1, . . . , l minαi
f (. . . , αi, . . .) A classic optimization technique
Traced back to Hildreth (1957) if constraints are not considered
Chih-Jen Lin (National Taiwan Univ.) 76 / 121
Solving optimization problems Linear: coordinate descent method
The Procedure
Given current α. Let ei = [0, . . . , 0, 1, 0, . . . , 0]T. mind f (α + d ei) = 1
2Qiid2 + ∇if (α)d + constant Without constraints
optimal d = −∇if (α) Qii Now 0 ≤ αi + d ≤ C
αi ← min
max
αi − ∇if (α) Qii
, 0
, C
The Procedure (Cont’d)
∇if (α) = (Qα)i − 1 = Xl
j =1Qijαj − 1
= Xl
j =1yiyjxTi xjαj − 1 Directly calculating gradients costs O(ln) l :# data, n: # features
For linear SVM, define u ≡ Xl
j =1yjαjxj, Easy gradient calculation: costs O(n)
∇if (α) = yiuTxi − 1
Chih-Jen Lin (National Taiwan Univ.) 78 / 121
Solving optimization problems Linear: coordinate descent method
The Procedure (Cont’d)
All we need is to maintain u u = Xl
j =1yjαjxj, If
¯
αi : old ; αi : new then
u ← u + (αi − ¯αi)yixi. Also costs O(n)
Algorithm: Dual Coordinate Descent
Given initial α and find u = X
i
yiαixi.
While α is not optimal (Outer iteration) For i = 1, . . . , l (Inner iteration)
(a) ¯αi ← αi
(b) G = yiuTxi − 1 (c) If αi can be changed
αi ← min(max(αi − G /Qii, 0), C ) u ← u + (αi − ¯αi)yixi
Chih-Jen Lin (National Taiwan Univ.) 80 / 121
Solving optimization problems Linear: coordinate descent method
Difference from the Kernel Case
• We have seen that coordinate descent is also the main method to train kernel classifiers
• Recall the i -th element of gradient costs O(n) by
∇if (α) =
l
X
j =1
yiyjxTi xjαj − 1 = (yixi)T
l
X
j =1
yjxjαj
− 1
= (yixi)Tu− 1 but we cannot do this for kernel because
K (xi, xj) = φ(xi)Tφ(xj) cannot be separated
Difference from the Kernel Case (Cont’d)
If using kernel, the cost of calculating ∇if (α) must be O(ln)
However, if O(ln) cost is spent, the whole ∇f (α) can be maintained (details not shown here)
In contrast, the setting of using u knows ∇if (α) rather than the whole ∇f (α)
Chih-Jen Lin (National Taiwan Univ.) 82 / 121
Solving optimization problems Linear: coordinate descent method
Difference from the Kernel Case (Cont’d)
In existing coordinate descent methods for kernel classifiers, people also use ∇f (α) information to select variables (i.e., select the set B) for update In optimization there are two types of coordinate descent methods:
sequential or random selection of variables greedy selection of variables
To do greedy selection, usually the whole gradient must be available
Difference from the Kernel Case (Cont’d)
Existing coordinate descent methods for linear ⇒ related to sequential or random selection
Existing coordinate descent methods for kernel ⇒ related to greedy selection
Chih-Jen Lin (National Taiwan Univ.) 84 / 121
Solving optimization problems Linear: coordinate descent method
Bias Term b and Linear Constraint in Dual
In our discussion, b is used for kernel but not linear Mainly history reason
For kernel SVM, we can also omit b to get rid of the linear constraint yTα = 0
Then for kernel decomposition method, |B| = 1 can also be possible
Outline
4 Solving optimization problems Kernel: decomposition methods Linear: coordinate descent method Linear: second-order methods Experiments
Chih-Jen Lin (National Taiwan Univ.) 86 / 121
Solving optimization problems Linear: second-order methods
Optimization for Linear and Kernel Cases
Recall that
w =
l
X
i =1
yiαiφ(xi)
Kernel: can only solve an optimization problem of α Linear: can solve either w or α
We will show an example to minimize over w
Newton Method
Let’s minimize a twice-differentiable function minw f (w )
For example, logistic regression has minw
1
2wTw + C
l
X
i =1
log
1 + e−yiwTxi
. Newton direction at iterate wk
mins ∇f (wk)Ts + 1
2sT∇2f (wk)s
Chih-Jen Lin (National Taiwan Univ.) 88 / 121
Solving optimization problems Linear: second-order methods
Truncated Newton Method
The above sub-problem is equivalent to solving Newton linear system
∇2f (wk)s = −∇f (wk) Approximately solving the linear system ⇒ truncated Newton
However, Hessian matrix ∇2f (wk) is too large to be stored
∇2f (wk) : n × n, n : number of features For document data, n can be millions or more
Using Special Properties of Data Classification
But Hessian has a special form
∇2f (w ) = I + CXTDX , D diagonal. For logistic regression,
Dii = e−yiwTxi 1 + e−yiwTxi X : data, # instances × # features
X = [x1, . . . , xl]T
Chih-Jen Lin (National Taiwan Univ.) 90 / 121
Solving optimization problems Linear: second-order methods
Using Special Properties of Data Classification (Cont’d)
Using Conjugate Gradient (CG) to solve the linear system.
CG is an iterative procedure. Each CG step mainly needs one Hessian-vector product
∇2f (w )s = s + C · XT(D(X s)) Therefore, we have a Hessian-free approach
Using Special Properties of Data Classification (Cont’d)
Now the procedure has two layers of iterations Outer: Newton iterations
Inner: CG iterations per Newton iteration Past machine learning works used Hessian-free approaches include, for example, (Keerthi and DeCoste, 2005; Lin et al., 2008)
Second-order information used: faster convergence than first-order methods
Chih-Jen Lin (National Taiwan Univ.) 92 / 121
Solving optimization problems Experiments
Outline
4 Solving optimization problems Kernel: decomposition methods Linear: coordinate descent method Linear: second-order methods Experiments
Comparisons
L2-loss SVM is used
DCDL2: Dual coordinate descent (Hsieh et al., 2008)
DCDL2-S: DCDL2 with shrinking (Hsieh et al., 2008)
PCD: Primal coordinate descent (Chang et al., 2008)
TRON: Trust region Newton method (Lin et al., 2008)
Chih-Jen Lin (National Taiwan Univ.) 94 / 121
Solving optimization problems Experiments
Objective values (Time in Seconds)
news20 rcv1
yahoo-japan yahoo-korea
Analysis
Dual coordinate descents are very effective if # data and # features are both large
Useful for document classification Half million data in a few seconds However, it is less effective if
# features small: should solve primal; or large penalty parameter C ; problems are more ill-conditioned
Chih-Jen Lin (National Taiwan Univ.) 96 / 121
Solving optimization problems Experiments
An Example When # Features Small
# instance: 32,561, # features: 123
Objective value Accuracy
Outline
1 Linear classification
2 Kernel classification
3 Linear versus kernel classification
4 Solving optimization problems
5 Big-data linear classification
6 Discussion and conclusions
Chih-Jen Lin (National Taiwan Univ.) 98 / 121
Big-data linear classification
Outline
5 Big-data linear classification Multi-core linear classification Distributed linear classification
Big-data Linear Classification
Parallelization in shared-memory system: use the power of multi-core CPU if data can fit in memory Distributed linear classification: if data cannot be stored in one computer
Example: we can parallelize the 2nd-order method (i.e., the Newton method) discussed earlier.
Recall the bottleneck is the Hessian-vector product
∇2f (w )s = s + C · XT(D(X s)) See the analysis in the next slide
Chih-Jen Lin (National Taiwan Univ.) 100 / 121
Big-data linear classification
Matrix-vector Multiplications
Two sets:
Data set l n #nonzeros
epsilon 400,000 2,000 800,000,000 webspam 350,000 16,609,143 1,304,697,446 Matrix-vector multiplications occupy the majority of the running time
Data set matrix-vector ratio
epsilson 99.88%
webspam 97.95%
This is by Newton methods using one core
We should parallelize matrix-vector multiplications
Outline
5 Big-data linear classification Multi-core linear classification Distributed linear classification
Chih-Jen Lin (National Taiwan Univ.) 102 / 121
Big-data linear classification Multi-core linear classification
Parallelization by OpenMP
The Hessian-vector product can be done by XTDX s =Xl
i =1xiDiixTi s
We can easily parallelize this loop by OpenMP Speedup; details in Lee et al. (2015)
epsilon webspam
Outline
5 Big-data linear classification Multi-core linear classification Distributed linear classification
Chih-Jen Lin (National Taiwan Univ.) 104 / 121
Big-data linear classification Distributed linear classification
Parallel Hessian-vector Product
Now data matrix X is distributedly stored
X1 X2
. . . Xp node 1
node 2
node p
XTDX s = X1TD1X1s + · · · + XpTDpXps
Parallel Hessian-vector Product (Cont’d)
We use allreduce to let every node get XTDX s
s
s
s
X1TD1X1s
X2TD2X2s
X3TD3X3s
ALL REDUCE
XTDX s
XTDX s
XTDX s
Allreduce: reducing all vectors (XiTDiXix , ∀i ) to a single vector (XTDX s ∈ Rn) and then sending the result to every node
Chih-Jen Lin (National Taiwan Univ.) 106 / 121
Big-data linear classification Distributed linear classification
Instance-wise and Feature-wise Data Splits
Xiw,1 Xiw,2
Xiw,3
Xfw,1Xfw,2Xfw,3
Instance-wise Feature-wise
Feature-wise: each machine calculates part of the Hessian-vector product
(∇2f (w )s)fw,1 = s1+CXfw,1T D(Xfw,1s1+· · ·+Xfw,psp)
Instance-wise and Feature-wise Data Splits (Cont’d)
Xfw,1s1 + · · · + Xfw,psp ∈ Rl must be available on all nodes (by allreduce)
Data moved per Hessian-vector product Instance-wise: O(n), Feature-wise: O(l )
Chih-Jen Lin (National Taiwan Univ.) 108 / 121
Big-data linear classification Distributed linear classification
Experiments
We compare
TRON: Newton method
ADMM: alternating direction method of multipliers (Boyd et al., 2011; Zhang et al., 2012)
Vowpal Wabbit (Langford et al., 2007) TRON and ADMM are implemented by MPI Details in Zhuang et al. (2015)
Experiments (Cont’d)
0 100 200 300 400 500
10−4 10−2 100
Time (sec.)
Relative function value difference
VW ADMM−FW ADMM−IW TRON−FW TRON−IW
0 1000 2000 3000 4000 5000 10−4
10−2 100 102
Time (sec.)
Relative function value difference
VW ADMM−FW ADMM−IW TRON−FW TRON−IW
epsilon webspam
32 machines are used
Horizontal line: test accuracy has stabilized Instance-wise and feature-wise splits useful for l n and l n, respectively
Chih-Jen Lin (National Taiwan Univ.) 110 / 121
Discussion and conclusions
Outline
1 Linear classification
2 Kernel classification
3 Linear versus kernel classification
4 Solving optimization problems
5 Big-data linear classification
6 Discussion and conclusions
Outline
6 Discussion and conclusions Some resources
Conclusions
Chih-Jen Lin (National Taiwan Univ.) 112 / 121
Discussion and conclusions Some resources
Outline
6 Discussion and conclusions Some resources
Conclusions
Software
• Most materials in this talks are based on our experiences in developing two popular software
• Kernel: LIBSVM (Chang and Lin, 2011)
http://www.csie.ntu.edu.tw/~cjlin/libsvm
• Linear: LIBLINEAR (Fan et al., 2008).
http://www.csie.ntu.edu.tw/~cjlin/liblinear See also a survey on linear classification in Yuan et al.
(2012)
Chih-Jen Lin (National Taiwan Univ.) 114 / 121
Discussion and conclusions Some resources
Distributed LIBLINEAR
An extension of the software LIBLINEAR
See http://www.csie.ntu.edu.tw/~cjlin/
libsvmtools/distributed-liblinear
We support both MPI (Zhuang et al., 2015) and Spark (Lin et al., 2014)
The development is still in an early stage.
Outline
6 Discussion and conclusions Some resources
Conclusions
Chih-Jen Lin (National Taiwan Univ.) 116 / 121
Discussion and conclusions Conclusions
Conclusions
Linear and kernel classification are old topics
However, novel techniques are still being developed to handle large-scale data or new applications You are welcome to join to this interesting research area
References I
B. E. Boser, I. Guyon, and V. Vapnik. A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, pages 144–152. ACM Press, 1992.
S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1):1–122, 2011.
C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2(3):27:1–27:27, 2011. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
K.-W. Chang, C.-J. Hsieh, and C.-J. Lin. Coordinate descent method for large-scale L2-loss linear SVM. Journal of Machine Learning Research, 9:1369–1398, 2008. URL
http://www.csie.ntu.edu.tw/~cjlin/papers/cdl2.pdf.
C. Cortes and V. Vapnik. Support-vector network. Machine Learning, 20:273–297, 1995.
R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: a library for large linear classification. Journal of Machine Learning Research, 9:1871–1874, 2008. URL http://www.csie.ntu.edu.tw/~cjlin/papers/liblinear.pdf.
C. Hildreth. A quadratic programming procedure. Naval Research Logistics Quarterly, 4:
79–85, 1957.
Chih-Jen Lin (National Taiwan Univ.) 118 / 121
Discussion and conclusions Conclusions
References II
C.-J. Hsieh, K.-W. Chang, C.-J. Lin, S. S. Keerthi, and S. Sundararajan. A dual coordinate descent method for large-scale linear SVM. In Proceedings of the Twenty Fifth International Conference on Machine Learning (ICML), 2008. URL
http://www.csie.ntu.edu.tw/~cjlin/papers/cddual.pdf.
T. Joachims. Making large-scale SVM learning practical. In B. Sch¨olkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods – Support Vector Learning, pages 169–184, Cambridge, MA, 1998. MIT Press.
S. S. Keerthi and D. DeCoste. A modified finite Newton method for fast solution of large scale linear SVMs. Journal of Machine Learning Research, 6:341–361, 2005.
S. S. Keerthi and C.-J. Lin. Asymptotic behaviors of support vector machines with Gaussian kernel. Neural Computation, 15(7):1667–1689, 2003.
J. Langford, L. Li, and A. Strehl. Vowpal Wabbit, 2007.
https://github.com/JohnLangford/vowpal_wabbit/wiki.
M.-C. Lee, W.-L. Chiang, and C.-J. Lin. Fast matrix-vector multiplications for large-scale logistic regression on shared-memory systems. Technical report, National Taiwan University, 2015.
C.-J. Lin, R. C. Weng, and S. S. Keerthi. Trust region Newton method for large-scale logistic regression. Journal of Machine Learning Research, 9:627–650, 2008. URL
http://www.csie.ntu.edu.tw/~cjlin/papers/logistic.pdf.