Chih-Jen Lin
Department of Computer Science National Taiwan University
Talk at Summer School on Optimization, Big Data and Applications, July, 2017
1 Introduction: why optimization and machine learning are related?
2 Optimization methods for kernel support vector machines
Decomposition methods
3 Optimization methods for linear classification Decomposition method
Newton methods Experiments
4 Multi-core implementation
5 Discussion and conclusions
tw/~cjlin/talks/italy_optml.pdf
Outline
1 Introduction: why optimization and machine learning are related?
2 Optimization methods for kernel support vector machines
Decomposition methods
3 Optimization methods for linear classification Decomposition method
Newton methods Experiments
4 Multi-core implementation
5 Discussion and conclusions
What is Machine Learning?
Extract knowledge from data and make predictions Representative tasks: classification, clustering, and others
Classification Clustering We will focus on classification
Data Classification
Given training data in different classes (labels known)
Predict test data (labels unknown) Classic example
1. Find a patient’s blood pressure, weight, etc.
2. After several years, know if he/she recovers 3. Build a machine learning model
4. New patient: find blood pressure, weight, etc 5. Prediction
Two main stages: training and testing
Why Is Optimization Used?
Usually the goal of classification is to minimize the test error
Therefore, many classification methods solve optimization problems
Optimization and Machine Learning
Standard optimization packages may be directly applied to machine learning applications
However, efficiency and scalability are issues Very often machine learning knowledge must be considered in designing suitable optimization methods
Optimization and Machine Learning (Cont’d)
Sometimes optimization researchers fail to make real impact because of not knowing the differences between the two areas
I like to talk about the connection between these two areas because I was trained as an optimization researcher but now work on machine learning
Optimization and Machine Learning (Cont’d)
In this talk, I will discuss some lessons learned in developing two popular packages
LIBSVM: 2000–now
A library for support vector machines LIBLINEAR: 2007–now
A library for large linear classification
Let me shamelessly say a bit about how they have been widely used
Optimization and Machine Learning (Cont’d)
LIBSVM is probably the most widely used SVM package. Its implementation paper has been cited more than 32, 000 times (Google Scholar, 5/2017) LIBLINEAR is popularly used in Internet companies for large-scale linear classification
Minimizing Training Errors
Basically a classification method starts with minimizing the training errors
min
model (training errors)
That is, all or most training data with labels should be correctly classified by our model
A model can be a decision tree, a support vector machine, a neural network, or other types
Minimizing Training Errors (Cont’d)
We consider the model to be a vector w That is, the decision function is
sgn(wTx )
For any data, x , the predicted label is (1 if wTx ≥ 0
−1 otherwise
Minimizing Training Errors (Cont’d)
The two-dimensional situation
◦ ◦
◦
◦ ◦
◦
◦◦
4 4 4 4
4 4
4
wTx = 0
This seems to be quite restricted, but practically x is in a much higher dimensional space
Minimizing Training Errors (Cont’d)
To characterize the training error, we need a loss function ξ(w ; x , y ) for each instance (x , y ) Ideally we should use 0–1 training loss:
ξ(w ; x , y ) =
(1 if y wTx < 0, 0 otherwise
Minimizing Training Errors (Cont’d)
However, this function is discontinuous. The optimization problem becomes difficult
−y wTx ξ(w ; x , y )
We need continuous approximations
Loss Functions
Some commonly used ones:
ξL1(w ; x , y ) ≡ max(0, 1 − y wTx ), (1) ξL2(w ; x , y ) ≡ max(0, 1 − y wTx )2, (2) ξLR(w ; x , y ) ≡ log(1 + e−y wTx). (3) SVM (Boser et al., 1992; Cortes and Vapnik, 1995):
(1)-(2)
Logistic regression (LR): (3)
Loss Functions (Cont’d)
−y wTx ξ(w ; x , y )
ξL1 ξL2
ξLR
Their performance is usually similar
Loss Functions (Cont’d)
These loss functions have different differentiability ξL1: not differentiable
ξL2: differentiable but not twice differentiable ξLR: twice differentiable
The same optimization method may not be applicable to all these losses
Loss Functions (Cont’d)
However, minimizing training losses may not give a good model for future prediction
Overfitting occurs
Overfitting
See the illustration in the next slide For classification,
You can easily achieve 100% training accuracy This is useless
When training a data set, we should Avoid underfitting: small training error Avoid overfitting: small testing error
l and s: training; and 4: testing
Regularization
To minimize the training error we manipulate the w vector so that it fits the data
To avoid overfitting we need a way to make w ’s values less extreme.
One idea is to make w values closer to zero We can add, for example,
wTw
2 or kw k1 to the function that is minimized
Regularized Linear Classification
Training data {yi, xi}, xi ∈ Rn, i = 1, . . . , l , yi = ±1 l : # of data, n: # of features
minw f (w ), f (w ) ≡ wTw 2 + C
l
X
i =1
ξ(w ; xi, yi)
wTw /2: regularization term (we have no time to talk about L1 regularization here)
ξ(w ; x , y ): loss function: we hope y wTx > 0 C : regularization parameter
Outline
1 Introduction: why optimization and machine learning are related?
2 Optimization methods for kernel support vector machines
Decomposition methods
3 Optimization methods for linear classification Decomposition method
Newton methods Experiments
4 Multi-core implementation
5 Discussion and conclusions
Kernel Methods
Kernel methods are a class of classification
techniques where major operations are conducted by kernel evaluations
A representative example is support vector machine (Boser et al., 1992; Cortes and Vapnik, 1995)
Support Vector Classification
• Training data (xi, yi), i = 1, . . . , l , xi ∈ Rn, yi = ±1
• Minimizing training losses with regularization
minw ,b
1
2wTw + C
l
X
i =1
max(1 − yi(wTφ(xi)+ b), 0)
• Note that here we add a bias term b so the decision function is
sgn(wTφ(x ) + b)
Support Vector Classification (Cont’d)
• Then the hyperplane does not pass through 0
• If n (# of features) is small, b may be important.
Otherwise, it’s not
• There are also historical reasons
• In our discussion we sometimes include b but sometimes do not
Mapping Data to a Higher Dimensional Space
To better separate the data, we map data to a higher dimensional space
φ(x ) = [φ1(x ), φ2(x ), . . .]T. For example,
weight height2
is a useful new feature to check if a person overweights or not
Difficulties After Mapping Data to a High-dimensional Space
# variables in w = dimensions of φ(x )
Infinite variables if φ(x ) is infinite dimensional Cannot do an infinite-dimensional inner product for predicting a test instance
sgn(wTφ(x ) + b)
Use kernel trick to go back to a finite number of variables
Support Vector Classification (Cont’d)
The dual problem (finite # variables) minα
1
2αTQα − eTα
subject to 0 ≤ αi ≤ C , i = 1, . . . , l yTα = 0,
where Qij = yiyjφ(xi)Tφ(xj) and e = [1, . . . , 1]T At optimum
w = Pl
i =1αiyiφ(xi)
Kernel: K (xi, xj) ≡ φ(xi)Tφ(xj) ; closed form Example: Gaussian (RBF) kernel: e−γkxi−xjk2
Kernel Tricks
• It can be shown that at optimum, w is a linear combination of training data
w = Xl
i =1yiαiφ(xi) Proofs not provided here.
• Special φ(x ) such that the decision function becomes sgn(wTφ(x ) + b) = sgn
Xl
i =1yiαiφ(xi)Tφ(x ) + b
= sgn
Xl
i =1yiαiK (xi, x ) + b
Kernel Tricks (Cont’d)
φ(xi)Tφ(xj) needs a closed form Example: xi ∈ R3, φ(xi) ∈ R10 φ(xi) = [1,√
2(xi)1,√
2(xi)2,√
2(xi)3, (xi)21, (xi)22, (xi)23,√
2(xi)1(xi)2,√
2(xi)1(xi)3,√
2(xi)2(xi)3]T Then φ(xi)Tφ(xj) = (1 + xTi xj)2.
Kernel: K (x , y) = φ(x )Tφ(y); common kernels:
e−γkxi−xjk2, (Radial Basis Function) (xTi xj/a + b)d (Polynomial kernel)
K (x , y) can be inner product in infinite dimensional space. Assume x ∈ R1 and γ > 0.
e−γkxi−xjk2 = e−γ(xi−xj)2 = e−γxi2+2γxixj−γxj2
=e−γxi2−γxj2 1 + 2γxixj
1! + (2γxixj)2
2! + (2γxixj)3
3! + · · ·
=e−γxi2−γxj2 1 · 1+
r2γ 1!xi ·
r2γ 1!xj+
r(2γ)2 2! xi2 ·
r(2γ)2 2! xj2 +
r(2γ)3 3! xi3 ·
r(2γ)3
3! xj3 + · · · = φ(xi)Tφ(xj), where
φ(x ) = e−γx2
1,
r2γ 1!x ,
r(2γ)2 2! x2,
r(2γ)3
3! x3, · · ·
T
.
We don’t discuss the primal-dual relationship here In this lecture we focus more on optimization algorithms rather than optimization theory
Support Vector Classification (Cont’d)
Only xi of αi > 0 used ⇒ support vectors
-0.2 0 0.2 0.4 0.6 0.8 1 1.2
-1.5 -1 -0.5 0 0.5 1
Large Dense Quadratic Programming
minα
1
2αTQα − eTα
subject to 0 ≤ αi ≤ C , i = 1, . . . , l yTα = 0
Qij 6= 0, Q : an l by l fully dense matrix 50,000 training points: 50,000 variables:
(50, 0002 × 8/2) bytes = 10GB RAM to store Q
Large Dense Quadratic Programming (Cont’d)
Tradition optimization methods cannot be directly applied here because Q cannot even be stored Currently, decomposition methods (a type of
coordinate descent methods) are commonly used in practice
Outline
1 Introduction: why optimization and machine learning are related?
2 Optimization methods for kernel support vector machines
Decomposition methods
3 Optimization methods for linear classification Decomposition method
Newton methods Experiments
4 Multi-core implementation
5 Discussion and conclusions
Decomposition Methods
Working on some variables each time (e.g., Osuna et al., 1997; Joachims, 1998; Platt, 1998)
Similar to coordinate-wise minimization Working set B, N = {1, . . . , l }\B fixed Let the objective function be
f (α) = 1
2αTQα − eTα
Decomposition Methods (Cont’d)
Sub-problem on the variable dB min
dB
f ([ααBN] +d
B
0 )
subject to −αi ≤ di ≤ C − αi, ∀i ∈ B di = 0, ∀i /∈ B,
yBTdB = 0
The objective function of the sub-problem f ([ααBN] +dB
0 )
=1
2dTBQBBdB + ∇Bf (α)TdB + constant.
Avoid Memory Problems
QBB is a sub-matrix of Q
QBB QBN QNB QNN
Note that
∇f (α) = Qα − e, ∇Bf (α) = QB,:α − eB
Avoid Memory Problems (Cont’d)
Only B columns of Q are needed
In general |B| ≤ 10 is used. We need |B| ≥ 2 because of the linear constraint
yTBdB = 0
Calculated when used: trade time for space But is such an approach practical?
Algorithm of Decomposition Methods
While α is not optimal Select a working set B
Solve the sub-problem of dB αB ← αB + dB
We will talk about the selection of B later
How Decomposition Methods Perform?
Convergence not very fast. This is known because of using only first-order information
But, no need to have very accurate α decision function:
sgn(wTφ(x ) + b) = sgn
Xl
i =1αiyiK (xi, x ) + b
Prediction may still be correct with a rough α Further, in some situations,
# support vectors # training points Initial α1 = 0, some instances never used
How Decomposition Methods Perform?
(Cont’d)
An example of training 50,000 instances using the software LIBSVM
$svm-train -c 16 -g 4 -m 400 22features Total nSV = 3370
Time 79.524s
This was done on a typical desktop
Calculating the whole Q takes more time
#SVs = 3,370 50,000
A good case where some remain at zero all the time
How Decomposition Methods Perform?
(Cont’d)
Because many αi = 0 in the end, we can develop a shrinking techniques
Variables are removed during the optimization procedure. Smaller problems are solved
For example, if αi = 0 in more than 100 iterations, probably we can remove it
More advanced techniques are possible
Of course in the end we must check all variables again for the stopping condition
Machine Learning Properties are Useful in Designing Optimization Algorithms
We have seen that special properties of SVM contribute to the viability of decomposition methods
For machine learning applications, no need to accurately solve the optimization problem Because some optimal αi = 0, decomposition methods may not need to update all the variables Also, we can use shrinking techniques to reduce the problem size during decomposition methods
Issues Related to Optimization
Working set selection Asymptotic convergence
Finite termination & stopping conditions Convergence rate
Numerical issues
Working Set Selection
In general we don’t choose a large |B| because of the following reasons
1 The sub-problem becomes expensive: O(|B|3)
2 # iterations may not be significantly decreased Currently a popular setting is to choose |B| = 2
B = {i , j }
This is called SMO (Sequential Minimal Optimization)
Working Set Selection (Cont’d)
• One idea is to use gradient information
• If
αi < C , yi = 1, and − ∇if (α) > 0, then we can enlarge αi
• Therefore, one possibility is
i ∈ arg max{−yt∇f (α)t |αt < C , yt = 1 or αt > 0, yt = −1}
j ∈ arg min{−yt∇f (α)t |αt < C , yt = −1 or αt > 0, yt = 1}
• They somewhat correspond to the maximal violation of the optimality condition
Working Set Selection (Cont’d)
This setting was first used in Joachims (1998) Can we use a more sophisticated one?
It’s difficult because cost is a concern For example, if we check the second-order
information (e.g., objective-value reduction) of all {i , j} pairs,
then
O(l2) cost is needed
Working Set Selection (Cont’d)
Currently in LIBSVM, we use
i ∈ arg max{−yt∇f (α)t |αt < C , yt = 1 or αt > 0, yt = −1}
and select j by second-order information (Fan et al., 2005)
The cost is still
O(l )
Complexity of Decomposition Methods
Let’s describe the algorithm again While α is not optimal
Select a working set B
Solve the sub-problem of dB αB ← αB + dB
Complexity of Decomposition Methods (Cont’d)
To construct the sub-problem,
∇Bf (α) = QB,:α − eB
needs
O(|B| × l × n) if each kernel evaluation costs O(n)
But for the working set selection, we need the whole
∇f (α)
The cost can be as large as O(l × l × n)
Complexity of Decomposition Methods (Cont’d)
Fortunately, we can use
∇f (α +dB 0
) =Qα − e + Q dB 0
=∇f (α) + Q:,BdB
Note that Q:,B is available earlier
Therefore, the cost of decomposition methods for kernel SVM is
O(|B| × l × n) × # iterations
Differences between Optimization and Machine Learning
The two topics may have different focuses. We give the following example
The decomposition method we just discussed converges more slowly when C is large
Using C = 1 on a data set
# iterations: 508 Using C = 5, 000
# iterations: 35,241
Optimization researchers may rush to solve difficult cases of large C
That’s what I did before
It turns out that large C is less used than small C Recall that SVM solves
1
2wTw + C (sum of training losses) A large C means to overfit training data This does not give good test accuracy
Outline
1 Introduction: why optimization and machine learning are related?
2 Optimization methods for kernel support vector machines
Decomposition methods
3 Optimization methods for linear classification Decomposition method
Newton methods Experiments
4 Multi-core implementation
5 Discussion and conclusions
Linear and Kernel Classification
We have
Kernel ⇒ map data to a higher space Linear ⇒ use the original data; φ(x ) = x Intuitively, kernel should give superior accuracy than linear
There are even theoretical proofs. Roughly
speaking, from the Taylor expansion of the Gaussian (RBF) kernel
e−γkxi−xjk2
linear SVM is a special case of RBF-kernel SVM (Keerthi and Lin, 2003)
Linear and Kernel Classification
Optimization people may think there is no need to specially consider linear SVM
That is, same optimization algorithms are enough for both linear and kernel cases
However, this is wrong if we consider their practical use
Linear and Kernel Classification (Cont’d)
Classification methods such as SVM and logistic regression can be used in two ways
Kernel methods: data mapped to a higher dimensional space
x ⇒ φ(x )
φ(xi)Tφ(xj) easily calculated; little control on φ(·) Feature engineering + linear classification :
We have x without mapping. Alternatively, we can say that φ(x ) is our x ; full control on x or φ(x )
Linear and Kernel Classification (Cont’d)
For some problems, accuracy by linear is as good as nonlinear
But training and testing are much faster
This particularly happens for document classification Number of features (bag-of-words model) very large Data very sparse (i.e., few non-zeros)
Comparison Between Linear and Kernel (Training Time & Test Accuracy)
Linear RBF Kernel Data set #data #features Time Acc. Time Acc.
MNIST38 11,982 752 0.1 96.82 38.1 99.70
ijcnn1 49,990 22 1.6 91.81 26.8 98.69
covtype multiclass 464,810 54 1.4 76.37 46,695.8 96.11 news20 15,997 1,355,191 1.1 96.95 383.2 96.90 real-sim 57,848 20,958 0.3 97.44 938.3 97.82 yahoo-japan 140,963 832,026 3.1 92.63 20,955.2 93.31 webspam 280,000 254 25.7 93.35 15,681.8 99.26
Comparison Between Linear and Kernel (Training Time & Test Accuracy)
Linear RBF Kernel Data set #data #features Time Acc. Time Acc.
MNIST38 11,982 752 0.1 96.82 38.1 99.70
ijcnn1 49,990 22 1.6 91.81 26.8 98.69
covtype multiclass 464,810 54 1.4 76.37 46,695.8 96.11 news20 15,997 1,355,191 1.1 96.95 383.2 96.90 real-sim 57,848 20,958 0.3 97.44 938.3 97.82 yahoo-japan 140,963 832,026 3.1 92.63 20,955.2 93.31 webspam 280,000 254 25.7 93.35 15,681.8 99.26
Comparison Between Linear and Kernel (Training Time & Test Accuracy)
Linear RBF Kernel Data set #data #features Time Acc. Time Acc.
MNIST38 11,982 752 0.1 96.82 38.1 99.70
ijcnn1 49,990 22 1.6 91.81 26.8 98.69
covtype multiclass 464,810 54 1.4 76.37 46,695.8 96.11 news20 15,997 1,355,191 1.1 96.95 383.2 96.90 real-sim 57,848 20,958 0.3 97.44 938.3 97.82 yahoo-japan 140,963 832,026 3.1 92.63 20,955.2 93.31 webspam 280,000 254 25.7 93.35 15,681.8 99.26
Therefore, there is a need to develop optimization methods for large linear classification
Why Linear is Faster in Training and Prediction?
The reason is that
for linear, xi is available but
for kernel, φ(xi) is not To illustrate this point, let’s modify kernel
decomposition methods discussed earlier for linear
Outline
1 Introduction: why optimization and machine learning are related?
2 Optimization methods for kernel support vector machines
Decomposition methods
3 Optimization methods for linear classification Decomposition method
Newton methods Experiments
4 Multi-core implementation
5 Discussion and conclusions
Decomposition Method for Linear Classification
• Recall for kernels, we solve the sub-problem:
min
dB
1
2(αB + dB)T (αN)TQBB QBN QNB QNN
αB + dB αN
−eTB eTNαB + dB αN
subject to 0 ≤ αt ≤ C , t ∈ B, yTB(αB + dB) = −yTNαN
• The objective function over dB is 1
2dTBQBBdB + (−eB +QB,:α)TdB
Decomposition Method for Linear Classification (Cont’d)
We need
QB,:α =QBB QBN α The cost is
O(|B| × l × n) because for i ∈ B,
Qi ,:α = Xl
j =1yiyjK (xi, xj)αj n: # features, l : # data
Decomposition Method for Linear Classification (Cont’d)
In the linear case,
K (xi, xj) = xTi xj
⇒ Xl
j =1yiyjK (xi, xj)αj = yixTi
Xl
j =1yjxjαj
Assume
u ≡Xl
j =1yjαjxj (4)
is available
Decomposition Method for Linear Classification (Cont’d)
The cost is significantly reduced
O(|B| × l × n) ⇒ O(|B| × n) The main difference is that in kernel
Xl
j =1yjαjφ(xj)
cannot be written down. But for linear we can!
Decomposition Method for Linear Classification (Cont’d)
All we need is to maintain u u = Xl
j =1yjαjxj
Then the following update rule can be applied u ← u + X
i :i ∈B
(di)yixi. The cost is also
O(|B| × n)
Decomposition Method for Linear Classification (Cont’d)
Note that eventually
u → primal optimal w
One-variable Procedure
This is what people use now in practice
Let’s consider the optimization problem without the bias term
minw
1
2wTw + C
l
X
i =1
max(1 − yiwTxi, 0) The dual problem now doesn’t have a linear constraint
minα
1
2αTQα − eTα
subject to 0 ≤ αi ≤ C , i = 1, . . . , l ,
One-variable Procedure (Cont’d)
Here
f (α) ≡ 1
2αTQα − eTα and
Qij = yiyjxTi xj, e = [1, . . . , 1]T Without the linear constraint
yTα = 0
we can choose one variable at a time
One-variable Procedure (Cont’d)
Given current α. Let the working set be B = {i } Let
ei = [0, . . . , 0, 1, 0, . . . , 0]T The sub-problem is
min
d f (α + d ei) = 1
2Qiid2 + ∇if (α)d + constant subject to 0 ≤ αi + d ≤ C
Without constraints
optimal d = −∇if (α) Qii
One-variable Procedure (Cont’d)
Now 0 ≤ αi + d ≤ C αi ← min
max
αi − ∇if (α) Qii
, 0
, C
Note that
∇if (α) = (Qα)i − 1 = Xl
j =1Qijαj − 1
= Xl
j =1yiyjxTi xjαj − 1
One-variable Procedure (Cont’d)
As before, define
u ≡Xl
j =1yjαjxj,
Easy gradient calculation: the cost is O(n)
∇if (α) = yiuTxi − 1
Algorithm: Dual Coordinate Descent
Given initial α and find u = X
i
yiαixi.
While α is not optimal (Outer iteration) For i = 1, . . . , l (Inner iteration)
(a) ¯αi ← αi
(b) G = yiuTxi − 1
(c) αi ← min(max(αi − G /Qii, 0), C ) (d) If αi needs to be changed
u ← u + (αi − ¯αi)yixi
One-variable Procedure (Cont’d)
Maintaining u also costs O(n) Thus the total cost is
O(n) × # iterations Recall that the cost for kernel is
O(l × n) × # iterations if we don’t have b and select |B| = 1
We will explain some interesting differences between the two
Recap: Dual Coordinate Descent
It’s very simple: minimizing one variable at a time While α not optimal
For i = 1, . . . , l minαi
f (. . . , αi, . . .) A classic optimization technique
Traced back to Hildreth (1957) if constraints are not considered
So what’s new?
Recap: Dual Coordinate Descent (Cont’d)
Having
u ≡Xl
j =1yjαjxj,
∇if (α) = yiuTxi − 1 and
¯
αi : old ; αi : new u ← u + (αi − ¯αi)yixi. is very essential
This is another example where we take the problem structure into account
Recap: Dual Coordinate Descent (Cont’d)
Such a setting was first developed at Hsieh et al.
(2008)
Careful Implementation
Some techniques can improve the running speed
Shrinking: remove αi if it is likely to be bounded at the end
Easier to conduct shrinking than the kernel case (details not shown)
Cyclic order to update elements α1 → α2 → · · · → αl A random order gives faster convergence
απ(1) → απ(2) → · · · → απ(l ) We will go back to this issue later
Difference from the Kernel Case
• We have seen that coordinate-descent type of
methods are used for both linear and kernel classifiers
• Recall the i -th element of gradient costs O(n) by
∇if (α) =
l
X
j =1
yiyjxTi xjαj − 1 = (yixi)T
l
X
j =1
yjxjαj
− 1
= (yixi)Tu− 1 but we cannot do this for kernel because
K (xi, xj) = φ(xi)Tφ(xj) cannot be separated
Difference from the Kernel Case (Cont’d)
If using kernel, the cost of calculating ∇if (α) must be O(ln)
However, if O(ln) cost is spent, the whole ∇f (α) can be maintained
∇f (α) ← ∇f (α) + Q:,idi
In contrast, the linear setting of using u knows only
∇if (α) rather than the whole ∇f (α)
This situation affects the working set selection
Difference from the Kernel Case (Cont’d)
• We have mentioned that for existing decomposition methods (or say coordinate descent methods) for kernel classifiers, ∇f (α) information is used to select variables for update
• Therefore, we have the following situations for two working set selections
1 Greedy: Using ∇f (α)
2 Cyclic
Kernel Linear
Greedy Cyclic Greedy Cyclic
Update αi O(1) O(ln) O(1) O(n)
Maintain ∇f (α) O(ln) NA O(ln) NA
Difference from the Kernel Case (Cont’d)
In optimization there are two types of coordinate descent methods
1 Gauss-Seidel: sequential selection of variables
2 Gauss-Southwell: greedy selection of variables To do greedy selection, usually the whole gradient must be available
Existing coordinate descent methods for linear ⇒ related to Gauss-Seidel
Existing coordinate descent methods for kernel ⇒ related to Gauss-Southwell
Difference from the Kernel Case (Cont’d)
In general greedy leads to fewer iterations than cyclic
So is the cyclic setting for linear practically viable?
Working Set Selection
Without gradient information, looks like we can only do a cyclic update
1, 2, 3, . . .
However, Hsieh et al. (2008, Section 3.1) showed that with random permutation, the convergence is much faster
Let’s try an example by using LIBLINEAR real-sim is a data set with
72,309 instances and 20,958 features
Working Set Selection (Cont’d)
With random permutation
$ ./train ~/Desktop/real-sim
optimization finished, #iter = 13 Objective value = -4396.370629
Here an iteration means to go through all instances once (though shrinking is applied)
Without random permutation (cyclic)
$ ./train ~/Desktop/real-sim
optimization finished, #iter = 326 Objective value = -4391.706239
Working Set Selection (Cont’d)
Here l1 loss is used
Both approaches are under the same stopping condition
Let’s see the algorithm with random permutation
Working Set Selection (Cont’d)
While α is not optimal (Outer iteration) Randomly permute 1, . . . , l
For i = 1, . . . , l (Inner iteration) (a) ¯αi ← αi
(b) G = yiuTxi − 1
(c) αi ← min(max(αi − G /Qii, 0), C ) (d) If αi needs to be changed
u ← u + (αi − ¯αi)yixi
Note that it’s not useful if we only randomly permute data once
Working Set Selection (Cont’d)
Randomly permute 1, . . . , l
While α is not optimal (Outer iteration) For i = 1, . . . , l (Inner iteration)
(a) ¯αi ← αi
(b) G = yiuTxi − 1
(c) αi ← min(max(αi − G /Qii, 0), C ) (d) If αi needs to be changed
u ← u + (αi − ¯αi)yixi On the other hand, random CD works:
Working Set Selection (Cont’d)
While α is not optimal (Outer iteration) Randomly select i
(a) ¯αi ← αi
(b) G = yiuTxi − 1
(c) αi ← min(max(αi − G /Qii, 0), C ) (d) If αi needs to be changed
u ← u + (αi − ¯αi)yixi
Turns out it is easier to analyze the complexity of the random CD; see Shalev-Shwartz and Zhang (2013) and many subsequent works
Working Set Selection (Cont’d)
It’s difficult to analyze the setting of using random permutations. Some recent attempts (for simplified situations) include Lee and Wright (2016)
This is still an ongoing research issue
Another line of research is to cheaply find some important indices.
See Glasmachers and Dogan (2013) and other developments
In other words, we think some distributions are better than uniform
Working Set Selection (Cont’d)
However, it is difficult to beat the uniform setting if shrinking has been applied
Working Set Selection and Bias Term
Recall if we use
sgn(wTx + b)
as the decision function, then the dual problem has a linear constraint
yTα = 0 Then at each iteration, two indices
{i , j}
must be selected
Working Set Selection and Bias Term (Cont’d)
If we use a cyclic setting, this means
(1, 2), (1, 3), (1, 4), . . . , (2, 3), . . .
The training may be terribly slow because for many pairs we cannot change α at all
Therefore, an interesting situation is as follows
Working set Without With Used
selection bias bias for
cyclic OK not OK linear
greedy (with ∇f (α)) OK OK kernel
Working Set Selection and Bias Term (Cont’d)
Fortunately, linear classification is often used to handle large data with many sparse features
In such a high dimensional space, bias term is often not needed
Even if it is, there is a trick of adding the bias term to the objective function
minw ,b
1
2wTw +1
2b2+C
l
X
i =1
max(1−yi(wT bx 1
), 0)
Working Set Selection and Bias Term (Cont’d)
The dual no longer has the linear constraint yTα = 0
However, for some problems (e.g., one-class SVM), a linear constraint must be there (details not shown) Then the training by coordinate descent for the linear case can be an issue
Other Losses
Our discussion so far is for l1 loss
All results can be applied to other losses such as l2 loss, logistic loss, etc.
Outline
1 Introduction: why optimization and machine learning are related?
2 Optimization methods for kernel support vector machines
Decomposition methods
3 Optimization methods for linear classification Decomposition method
Newton methods Experiments
4 Multi-core implementation
5 Discussion and conclusions
More Optimization Methods can be Applied for Linear
Recall that
w =
l
X
i =1
yiαiφ(xi)
Kernel: can only solve an optimization problem of α because w is too high dimensional
Linear: can solve either w or α
We will show an example to minimize over w
Newton Method
Let’s minimize a twice-differentiable function minw f (w )
For example, logistic regression has minw
1
2wTw + C
l
X
i =1
log
1 + e−yiwTxi
. Newton direction at iterate wk
mins ∇f (wk)Ts + 1
2sT∇2f (wk)s
Truncated Newton Method
The above sub-problem is equivalent to solving Newton linear system
∇2f (wk)s = −∇f (wk) Approximately solving the linear system ⇒ truncated Newton
However, Hessian matrix ∇2f (wk) is too large to be stored
∇2f (wk) : n × n, n : number of features For document data, n can be millions or more
Using Properties of Data Classification
But Hessian has a special form
∇2f (w ) = I + CXTDX , D is diagonal. For logistic regression,
Dii = e−yiwTxi 1 + e−yiwTxi X : data, # instances × # features
X =
xT1
...
xTl
∈ Rl ×n
Using Properties of Data Classification (Cont’d)
Using Conjugate Gradient (CG) to solve the linear system.
CG is an iterative procedure. Each CG step mainly needs one Hessian-vector product
∇2f (w )d = d + C · XT(D(X d )) Therefore, we have a Hessian-free approach
Using Properties of Data Classification (Cont’d)
Now the procedure has two layers of iterations Outer: Newton iterations
Inner: CG iterations per Newton iteration Past machine learning works used Hessian-free approaches include, for example, (Keerthi and DeCoste, 2005; Lin et al., 2008)
Second-order information used: faster convergence than first-order methods
Training L2-loss SVM
The loss function is differentiable but not twice differentiable
ξL2(w ; x , y ) ≡ max(0, 1 − y wTx )2 We can use generalized Hessian (Mangasarian, 2002); details not shown here
Preconditioning
Each Hessian-vector product
∇2f (w )d = d + C · XT(D(X d )) costs
O(ln), where X ∈ Rl ×n
Each function/gradient evaluation also costs O(ln);
details omitted
Therefore, the cost of each Newton iteration is roughly
O(ln) × # CG steps
Preconditioning (Cont’d)
Can we reduce the number of CG steps by preconditioning?
This classic technique finds
EET ≈ ∇2f (w ) so the linear system
∇2f (w )s = −∇f (w ) becomes
E−1∇2f (w )E−Tˆs = −E−1∇f (w )
Preconditioning (Cont’d)
Note that
s = E−Tˆs If
E−1∇2f (wk)E−T ≈ I then # CG steps can be reduced
For example, we use the diagonal preconditioner so EET = diag(∇2f (w ))
and
E = p
diag(∇2f (w ))
Preconditioning (Cont’d)
Because
∇2f (w ) = I + CXTDX , we have
∇2j ,jf (w ) = 1 + C
l
X
i =1
DiiXij2 The cost of constructing the preconditioner is
O(ln)
Preconditioning (Cont’d)
In each CG step, doing
(E−1 or E−T) × a vector costs
O(n)
Therefore, extra cost of diagonal preconditioning at each Newton iteration is
O(n) × # CG steps + O(ln) This is acceptable in compared with what we already need
O(ln) × # CG steps
Preconditioning (Cont’d)
However, the issue is that the number of CG steps may not be decreased
In the area of numerical analysis, preconditioning is an art rather than a science
Further, because of using a Hessian-free approach, many existing preconditioners are not directly applicable
This is still a research issue
Sub-sampled Newton Methods
The optimization problem can be rewritten as minw
wTw 2Cl + 1
l Xl
i =1max(1 − yiwTxi, 0) The second term is indeed the average loss A subset of data should give similar gradient or Hessian!!
This kind of settings has been explored in Byrd et al. (2011); Martens (2010)
Clearly this is an example of taking properties of machine learning problems
Lesson Learned from Kernel to Linear
We must know the practical machine-learning use in order to decide if new optimization algorithms are needed for certain problems
Outline
1 Introduction: why optimization and machine learning are related?
2 Optimization methods for kernel support vector machines
Decomposition methods
3 Optimization methods for linear classification Decomposition method
Newton methods Experiments
4 Multi-core implementation
5 Discussion and conclusions
Comparisons
L2-loss SVM is used
DCDL2: Dual coordinate descent (Hsieh et al., 2008)
DCDL2-S: DCDL2 with shrinking (Hsieh et al., 2008)
PCD: Primal coordinate descent (Chang et al., 2008)
TRON: Trust region Newton method (Lin et al., 2008)
Objective values (Time in Seconds)
news20 rcv1
yahoo-japan yahoo-korea
Analysis
Dual coordinate descents are very effective if # data and # features are both large
Useful for document classification Half million data in a few seconds However, it is less effective if
# features small: should solve primal; or large penalty parameter C ; problems are more ill-conditioned
An Example When # Features Small
# instance: 32,561, # features: 123
Objective value Accuracy
Careful Evaluation
This is very very important
Let me give you one real personal experience
Careful Evaluation (Cont’d)
In Lin et al. (2008), to check if diagonal
perconditioner works in Newton methods we give the following table of total number of CG steps
Problem CG PCG
a9a 567 263
real-sim 104 160
news20 71 155
citeseer 113 115 yahoo-japan 278 326
rcv1 225 280
yahoo-korea 779 736
Careful Evaluation (Cont’d)
Clearly, PCG may not be better But this conclusion is misleading In this experiment,
a strict stopping condition is used
Careful Evaluation (Cont’d)
Let’s look at this figure for yahoo-japan
0.0 0.5 1.0 1.5 2.0 2.5
CG iterations 1e2
10-8 10-7 10-6 10-5 10-4 10-3 10-2 10-1 100
(f−f∗)/f∗
CG Diagonal
You see four horizontal lines
Careful Evaluation (Cont’d)
The second line corresponds to the default LIBLINEAR stopping condition
The 4 lines are by using
10, , 0.1, 0.01, where is the default LIBLINEAR stopping tolerance
Things below/above these 4 lines are not useful They are too loose or too strict for the machine learning task
Careful Evaluation (Cont’d)
Another issue is which regularization parameter C to be used in presenting results?
In the table, we use C = 1, the simplest choice In the figure, we use
CBest = 0.5,
where CBest is the value leading to the best validation accuracy.
A reasonable setting is to show figures of C = CBest × {0.01, 0.1, 1, 10, 100},
Careful Evaluation (Cont’d)
The reason is that in practice we start from a small C to search for CBest.
Things are larger than 100 × CBest are not important Lesson: even in comparing optimization algorithms for machine learning, we need to take machine learning properties into account
Outline
1 Introduction: why optimization and machine learning are related?
2 Optimization methods for kernel support vector machines
Decomposition methods
3 Optimization methods for linear classification Decomposition method
Newton methods Experiments
4 Multi-core implementation
5 Discussion and conclusions
Multi-core Implementation
Nowadays each computer has several cores They can do parallel computation
However, most machine learning algorithms (including those we have discussed) do not take parallel computation into account
Multi-core Implementation (Cont’d)
In fact, algorithms may need to be redesigned Recently we did two studies
1 Newton method for solving the primal problem (Lee et al., 2015)
2 Coordinate descent method for solving the dual problem (Chiang et al., 2016)
We will discuss the Newton method in detail
Parallel Newton Implementation
Recall the bottleneck is the Hessian-vector product
∇2f (w )d = d + C · XT(D(X d )) This is conducted at each CG step
Matrix-vector Multiplications: More Than 90% of the Training Time
Data set #instances #features ratio
kddb 19,264,097 29,890,095 82.11%
url combined 2,396,130 3,231,961 94.83%
webspam 350,000 16,609,143 97.95%
rcv1 binary 677,399 47,236 97.88%
covtype binary 581,012 54 89.20%
epsilon normalized 400,000 2,000 99.88%
rcv1 multiclass 518,571 47,236 97.04%
covtype multiclass 581,012 54 89.06%
Matrix-vector Multiplications: More Than 90% of the Training Time (Cont’d)
This result is by Newton methods using one core We should parallelize matrix-vector multiplications For ∇2f (w )d we must calculate
u = X d (5)
u ← Du (6)
¯u = XTu, where u = DX d (7) Because D is diagonal, (6) is easy
Matrix-vector Multiplications: More Than 90% of the Training Time (Cont’d)
We will discuss the parallelization of (5) and (7) They are more complicated
For our problems, X is large and sparse
Efficient parallel sparse matrix-vector multiplications are still a research issue
Parallel X d Operation
Assume that X is in a row-oriented sparse format
X =
xT1
...
xTl
and u = X d =
xT1 d
...
xTl d
we have the following simple loop
1: for i = 1, . . . , l do
2: ui = xTi d
3: end for
Because the l inner products are independent, we can easily parallelize the loop by, for example, OpenMP