Chih-Jen Lin

Department of Computer Science National Taiwan University

Talk at Summer School on Optimization, Big Data and Applications, July, 2017

1 Introduction: why optimization and machine learning are related?

2 Optimization methods for kernel support vector machines

Decomposition methods

3 Optimization methods for linear classification Decomposition method

Newton methods Experiments

4 Multi-core implementation

5 Discussion and conclusions

tw/~cjlin/talks/italy_optml.pdf

### Outline

1 Introduction: why optimization and machine learning are related?

2 Optimization methods for kernel support vector machines

Decomposition methods

3 Optimization methods for linear classification Decomposition method

Newton methods Experiments

4 Multi-core implementation

5 Discussion and conclusions

### What is Machine Learning?

Extract knowledge from data and make predictions Representative tasks: classification, clustering, and others

Classification Clustering We will focus on classification

### Data Classification

Given training data in different classes (labels known)

Predict test data (labels unknown) Classic example

1. Find a patient’s blood pressure, weight, etc.

2. After several years, know if he/she recovers 3. Build a machine learning model

4. New patient: find blood pressure, weight, etc 5. Prediction

Two main stages: training and testing

### Why Is Optimization Used?

Usually the goal of classification is to minimize the test error

Therefore, many classification methods solve optimization problems

### Optimization and Machine Learning

Standard optimization packages may be directly applied to machine learning applications

However, efficiency and scalability are issues Very often machine learning knowledge must be considered in designing suitable optimization methods

### Optimization and Machine Learning (Cont’d)

Sometimes optimization researchers fail to make real impact because of not knowing the differences between the two areas

I like to talk about the connection between these two areas because I was trained as an optimization researcher but now work on machine learning

### Optimization and Machine Learning (Cont’d)

In this talk, I will discuss some lessons learned in developing two popular packages

LIBSVM: 2000–now

A library for support vector machines LIBLINEAR: 2007–now

A library for large linear classification

Let me shamelessly say a bit about how they have been widely used

### Optimization and Machine Learning (Cont’d)

LIBSVM is probably the most widely used SVM package. Its implementation paper has been cited more than 32, 000 times (Google Scholar, 5/2017) LIBLINEAR is popularly used in Internet companies for large-scale linear classification

### Minimizing Training Errors

Basically a classification method starts with minimizing the training errors

min

model (training errors)

That is, all or most training data with labels should be correctly classified by our model

A model can be a decision tree, a support vector machine, a neural network, or other types

### Minimizing Training Errors (Cont’d)

We consider the model to be a vector w That is, the decision function is

sgn(w^{T}x )

For any data, x , the predicted label is
(1 if w^{T}x ≥ 0

−1 otherwise

### Minimizing Training Errors (Cont’d)

The two-dimensional situation

◦ ◦

◦

◦ ◦

◦

◦◦

4 4 4 4

4 4

4

w^{T}x = 0

This seems to be quite restricted, but practically x is in a much higher dimensional space

### Minimizing Training Errors (Cont’d)

To characterize the training error, we need a loss function ξ(w ; x , y ) for each instance (x , y ) Ideally we should use 0–1 training loss:

ξ(w ; x , y ) =

(1 if y w^{T}x < 0,
0 otherwise

### Minimizing Training Errors (Cont’d)

However, this function is discontinuous. The optimization problem becomes difficult

−y w^{T}x
ξ(w ; x , y )

We need continuous approximations

### Loss Functions

Some commonly used ones:

ξ_{L1}(w ; x , y ) ≡ max(0, 1 − y w^{T}x ), (1)
ξ_{L2}(w ; x , y ) ≡ max(0, 1 − y w^{T}x )^{2}, (2)
ξ_{LR}(w ; x , y ) ≡ log(1 + e^{−y w}^{T}^{x}). (3)
SVM (Boser et al., 1992; Cortes and Vapnik, 1995):

(1)-(2)

Logistic regression (LR): (3)

### Loss Functions (Cont’d)

−y w^{T}x
ξ(w ; x , y )

ξ_{L1}
ξ_{L2}

ξLR

Their performance is usually similar

### Loss Functions (Cont’d)

These loss functions have different differentiability
ξ_{L1}: not differentiable

ξ_{L2}: differentiable but not twice differentiable
ξ_{LR}: twice differentiable

The same optimization method may not be applicable to all these losses

### Loss Functions (Cont’d)

However, minimizing training losses may not give a good model for future prediction

Overfitting occurs

### Overfitting

See the illustration in the next slide For classification,

You can easily achieve 100% training accuracy This is useless

When training a data set, we should Avoid underfitting: small training error Avoid overfitting: small testing error

### l and s: training; and 4: testing

### Regularization

To minimize the training error we manipulate the w vector so that it fits the data

To avoid overfitting we need a way to make w ’s values less extreme.

One idea is to make w values closer to zero We can add, for example,

w^{T}w

2 or kw k_{1}
to the function that is minimized

### Regularized Linear Classification

Training data {y_{i}, x_{i}}, x_{i} ∈ R^{n}, i = 1, . . . , l , y_{i} = ±1
l : # of data, n: # of features

minw f (w ), f (w ) ≡ w^{T}w
2 + C

l

X

i =1

ξ(w ; x_{i}, y_{i})

w^{T}w /2: regularization term (we have no time to
talk about L1 regularization here)

ξ(w ; x , y ): loss function: we hope y w^{T}x > 0
C : regularization parameter

### Outline

1 Introduction: why optimization and machine learning are related?

2 Optimization methods for kernel support vector machines

Decomposition methods

3 Optimization methods for linear classification Decomposition method

Newton methods Experiments

4 Multi-core implementation

5 Discussion and conclusions

### Kernel Methods

Kernel methods are a class of classification

techniques where major operations are conducted by kernel evaluations

A representative example is support vector machine (Boser et al., 1992; Cortes and Vapnik, 1995)

### Support Vector Classification

• Training data (x_{i}, y_{i}), i = 1, . . . , l , x_{i} ∈ R^{n}, y_{i} = ±1

• Minimizing training losses with regularization

minw ,b

1

2w^{T}w + C

l

X

i =1

max(1 − yi(w^{T}φ(xi)+ b), 0)

• Note that here we add a bias term b so the decision function is

sgn(w^{T}φ(x ) + b)

### Support Vector Classification (Cont’d)

• Then the hyperplane does not pass through 0

• If n (# of features) is small, b may be important.

Otherwise, it’s not

• There are also historical reasons

• In our discussion we sometimes include b but sometimes do not

### Mapping Data to a Higher Dimensional Space

To better separate the data, we map data to a higher dimensional space

φ(x ) = [φ1(x ), φ2(x ), . . .]^{T}.
For example,

weight
height^{2}

is a useful new feature to check if a person overweights or not

### Difficulties After Mapping Data to a High-dimensional Space

# variables in w = dimensions of φ(x )

Infinite variables if φ(x ) is infinite dimensional Cannot do an infinite-dimensional inner product for predicting a test instance

sgn(w^{T}φ(x ) + b)

Use kernel trick to go back to a finite number of variables

### Support Vector Classification (Cont’d)

The dual problem (finite # variables) minα

1

2α^{T}Qα − e^{T}α

subject to 0 ≤ α_{i} ≤ C , i = 1, . . . , l
y^{T}α = 0,

where Q_{ij} = y_{i}y_{j}φ(x_{i})^{T}φ(x_{j}) and e = [1, . . . , 1]^{T}
At optimum

w = Pl

i =1α_{i}y_{i}φ(x_{i})

Kernel: K (x_{i}, x_{j}) ≡ φ(x_{i})^{T}φ(x_{j}) ; closed form
Example: Gaussian (RBF) kernel: e^{−γkx}^{i}^{−x}^{j}^{k}^{2}

### Kernel Tricks

• It can be shown that at optimum, w is a linear combination of training data

w = X^{l}

i =1yiαiφ(xi) Proofs not provided here.

• Special φ(x ) such that the decision function becomes
sgn(w^{T}φ(x ) + b) = sgn

Xl

i =1y_{i}α_{i}φ(x_{i})^{T}φ(x ) + b

= sgn

X^{l}

i =1y_{i}α_{i}K (x_{i}, x ) + b

### Kernel Tricks (Cont’d)

φ(x_{i})^{T}φ(x_{j}) needs a closed form
Example: x_{i} ∈ R^{3}, φ(x_{i}) ∈ R^{10}
φ(x_{i}) = [1,√

2(x_{i})_{1},√

2(x_{i})_{2},√

2(x_{i})_{3}, (x_{i})^{2}_{1},
(x_{i})^{2}_{2}, (x_{i})^{2}_{3},√

2(x_{i})_{1}(x_{i})_{2},√

2(x_{i})_{1}(x_{i})_{3},√

2(x_{i})_{2}(x_{i})_{3}]^{T}
Then φ(xi)^{T}φ(xj) = (1 + x^{T}_{i} xj)^{2}.

Kernel: K (x , y) = φ(x )^{T}φ(y); common kernels:

e^{−γkx}^{i}^{−x}^{j}^{k}^{2}, (Radial Basis Function)
(x^{T}_{i} x_{j}/a + b)^{d} (Polynomial kernel)

K (x , y) can be inner product in infinite dimensional
space. Assume x ∈ R^{1} and γ > 0.

e^{−γkx}^{i}^{−x}^{j}^{k}^{2} = e^{−γ(x}^{i}^{−x}^{j}^{)}^{2} = e^{−γx}^{i}^{2}^{+2γx}^{i}^{x}^{j}^{−γx}^{j}^{2}

=e^{−γx}^{i}^{2}^{−γx}^{j}^{2} 1 + 2γx_{i}x_{j}

1! + (2γx_{i}x_{j})^{2}

2! + (2γx_{i}x_{j})^{3}

3! + · · ·

=e^{−γx}^{i}^{2}^{−γx}^{j}^{2} 1 · 1+

r2γ
1!x_{i} ·

r2γ
1!x_{j}+

r(2γ)^{2}
2! x_{i}^{2} ·

r(2γ)^{2}
2! x_{j}^{2}
+

r(2γ)^{3}
3! x_{i}^{3} ·

r(2γ)^{3}

3! x_{j}^{3} + · · · = φ(x_{i})^{T}φ(x_{j}),
where

φ(x ) = e^{−γx}^{2}

1,

r2γ 1!x ,

r(2γ)^{2}
2! x^{2},

r(2γ)^{3}

3! x^{3}, · · ·

T

.

We don’t discuss the primal-dual relationship here In this lecture we focus more on optimization algorithms rather than optimization theory

### Support Vector Classification (Cont’d)

Only xi of αi > 0 used ⇒ support vectors

-0.2 0 0.2 0.4 0.6 0.8 1 1.2

-1.5 -1 -0.5 0 0.5 1

### Large Dense Quadratic Programming

minα

1

2α^{T}Qα − e^{T}α

subject to 0 ≤ α_{i} ≤ C , i = 1, . . . , l
y^{T}α = 0

Qij 6= 0, Q : an l by l fully dense matrix 50,000 training points: 50,000 variables:

(50, 000^{2} × 8/2) bytes = 10GB RAM to store Q

### Large Dense Quadratic Programming (Cont’d)

Tradition optimization methods cannot be directly applied here because Q cannot even be stored Currently, decomposition methods (a type of

coordinate descent methods) are commonly used in practice

### Outline

1 Introduction: why optimization and machine learning are related?

2 Optimization methods for kernel support vector machines

Decomposition methods

3 Optimization methods for linear classification Decomposition method

Newton methods Experiments

4 Multi-core implementation

5 Discussion and conclusions

### Decomposition Methods

Working on some variables each time (e.g., Osuna et al., 1997; Joachims, 1998; Platt, 1998)

Similar to coordinate-wise minimization Working set B, N = {1, . . . , l }\B fixed Let the objective function be

f (α) = 1

2α^{T}Qα − e^{T}α

### Decomposition Methods (Cont’d)

Sub-problem on the variable d_{B}
min

dB

f ([_{α}^{α}^{B}_{N}] +_{d}

B

0 )

subject to −α_{i} ≤ d_{i} ≤ C − α_{i}, ∀i ∈ B
d_{i} = 0, ∀i /∈ B,

y_{B}^{T}d_{B} = 0

The objective function of the sub-problem
f ([^{α}α^{B}_{N}] +_{d}_{B}

0 )

=1

2d^{T}_{B}Q_{BB}dB + ∇_{B}f (α)^{T}dB + constant.

### Avoid Memory Problems

QBB is a sub-matrix of Q

Q_{BB} Q_{BN}
Q_{NB} Q_{NN}

Note that

∇f (α) = Qα − e, ∇_{B}f (α) = QB,:α − eB

### Avoid Memory Problems (Cont’d)

Only B columns of Q are needed

In general |B| ≤ 10 is used. We need |B| ≥ 2 because of the linear constraint

y^{T}_{B}dB = 0

Calculated when used: trade time for space But is such an approach practical?

### Algorithm of Decomposition Methods

While α is not optimal Select a working set B

Solve the sub-problem of d_{B}
α_{B} ← α_{B} + d_{B}

We will talk about the selection of B later

### How Decomposition Methods Perform?

Convergence not very fast. This is known because of using only first-order information

But, no need to have very accurate α decision function:

sgn(w^{T}φ(x ) + b) = sgn

X^{l}

i =1αiyiK (xi, x ) + b

Prediction may still be correct with a rough α Further, in some situations,

# support vectors # training points
Initial α^{1} = 0, some instances never used

### How Decomposition Methods Perform?

### (Cont’d)

An example of training 50,000 instances using the software LIBSVM

$svm-train -c 16 -g 4 -m 400 22features Total nSV = 3370

Time 79.524s

This was done on a typical desktop

Calculating the whole Q takes more time

#SVs = 3,370 50,000

A good case where some remain at zero all the time

### How Decomposition Methods Perform?

### (Cont’d)

Because many α_{i} = 0 in the end, we can develop a
shrinking techniques

Variables are removed during the optimization procedure. Smaller problems are solved

For example, if α_{i} = 0 in more than 100 iterations,
probably we can remove it

More advanced techniques are possible

Of course in the end we must check all variables again for the stopping condition

### Machine Learning Properties are Useful in Designing Optimization Algorithms

We have seen that special properties of SVM contribute to the viability of decomposition methods

For machine learning applications, no need to accurately solve the optimization problem Because some optimal αi = 0, decomposition methods may not need to update all the variables Also, we can use shrinking techniques to reduce the problem size during decomposition methods

### Issues Related to Optimization

Working set selection Asymptotic convergence

Finite termination & stopping conditions Convergence rate

Numerical issues

### Working Set Selection

In general we don’t choose a large |B| because of the following reasons

1 The sub-problem becomes expensive: O(|B|^{3})

2 # iterations may not be significantly decreased Currently a popular setting is to choose |B| = 2

B = {i , j }

This is called SMO (Sequential Minimal Optimization)

### Working Set Selection (Cont’d)

• One idea is to use gradient information

• If

α_{i} < C , y_{i} = 1, and − ∇_{i}f (α) > 0,
then we can enlarge αi

• Therefore, one possibility is

i ∈ arg max{−y_{t}∇f (α)_{t} |α_{t} < C , y_{t} = 1 or
α_{t} > 0, y_{t} = −1}

j ∈ arg min{−yt∇f (α)_{t} |α_{t} < C , yt = −1 or
αt > 0, yt = 1}

• They somewhat correspond to the maximal violation of the optimality condition

### Working Set Selection (Cont’d)

This setting was first used in Joachims (1998) Can we use a more sophisticated one?

It’s difficult because cost is a concern For example, if we check the second-order

information (e.g., objective-value reduction) of all {i , j} pairs,

then

O(l^{2})
cost is needed

### Working Set Selection (Cont’d)

Currently in LIBSVM, we use

i ∈ arg max{−y_{t}∇f (α)_{t} |α_{t} < C , y_{t} = 1 or
α_{t} > 0, y_{t} = −1}

and select j by second-order information (Fan et al., 2005)

The cost is still

O(l )

### Complexity of Decomposition Methods

Let’s describe the algorithm again While α is not optimal

Select a working set B

Solve the sub-problem of d_{B}
α_{B} ← α_{B} + d_{B}

### Complexity of Decomposition Methods (Cont’d)

To construct the sub-problem,

∇_{B}f (α) = QB,:α − eB

needs

O(|B| × l × n) if each kernel evaluation costs O(n)

But for the working set selection, we need the whole

∇f (α)

The cost can be as large as O(l × l × n)

### Complexity of Decomposition Methods (Cont’d)

Fortunately, we can use

∇f (α +d_{B}
0

) =Qα − e + Q d_{B}
0

=∇f (α) + Q_{:,B}dB

Note that Q_{:,B} is available earlier

Therefore, the cost of decomposition methods for kernel SVM is

O(|B| × l × n) × # iterations

### Differences between Optimization and Machine Learning

The two topics may have different focuses. We give the following example

The decomposition method we just discussed converges more slowly when C is large

Using C = 1 on a data set

# iterations: 508 Using C = 5, 000

# iterations: 35,241

Optimization researchers may rush to solve difficult cases of large C

That’s what I did before

It turns out that large C is less used than small C Recall that SVM solves

1

2w^{T}w + C (sum of training losses)
A large C means to overfit training data
This does not give good test accuracy

### Outline

1 Introduction: why optimization and machine learning are related?

2 Optimization methods for kernel support vector machines

Decomposition methods

3 Optimization methods for linear classification Decomposition method

Newton methods Experiments

4 Multi-core implementation

5 Discussion and conclusions

### Linear and Kernel Classification

We have

Kernel ⇒ map data to a higher space Linear ⇒ use the original data; φ(x ) = x Intuitively, kernel should give superior accuracy than linear

There are even theoretical proofs. Roughly

speaking, from the Taylor expansion of the Gaussian (RBF) kernel

e^{−γkx}^{i}^{−x}^{j}^{k}^{2}

linear SVM is a special case of RBF-kernel SVM (Keerthi and Lin, 2003)

### Linear and Kernel Classification

Optimization people may think there is no need to specially consider linear SVM

That is, same optimization algorithms are enough for both linear and kernel cases

However, this is wrong if we consider their practical use

### Linear and Kernel Classification (Cont’d)

Classification methods such as SVM and logistic regression can be used in two ways

Kernel methods: data mapped to a higher dimensional space

x ⇒ φ(x )

φ(xi)^{T}φ(xj) easily calculated; little control on φ(·)
Feature engineering + linear classification :

We have x without mapping. Alternatively, we can say that φ(x ) is our x ; full control on x or φ(x )

### Linear and Kernel Classification (Cont’d)

For some problems, accuracy by linear is as good as nonlinear

But training and testing are much faster

This particularly happens for document classification Number of features (bag-of-words model) very large Data very sparse (i.e., few non-zeros)

### Comparison Between Linear and Kernel (Training Time & Test Accuracy)

Linear RBF Kernel Data set #data #features Time Acc. Time Acc.

MNIST38 11,982 752 0.1 96.82 38.1 99.70

ijcnn1 49,990 22 1.6 91.81 26.8 98.69

covtype multiclass 464,810 54 1.4 76.37 46,695.8 96.11 news20 15,997 1,355,191 1.1 96.95 383.2 96.90 real-sim 57,848 20,958 0.3 97.44 938.3 97.82 yahoo-japan 140,963 832,026 3.1 92.63 20,955.2 93.31 webspam 280,000 254 25.7 93.35 15,681.8 99.26

### Comparison Between Linear and Kernel (Training Time & Test Accuracy)

Linear RBF Kernel Data set #data #features Time Acc. Time Acc.

MNIST38 11,982 752 0.1 96.82 38.1 99.70

ijcnn1 49,990 22 1.6 91.81 26.8 98.69

covtype multiclass 464,810 54 1.4 76.37 46,695.8 96.11 news20 15,997 1,355,191 1.1 96.95 383.2 96.90 real-sim 57,848 20,958 0.3 97.44 938.3 97.82 yahoo-japan 140,963 832,026 3.1 92.63 20,955.2 93.31 webspam 280,000 254 25.7 93.35 15,681.8 99.26

### Comparison Between Linear and Kernel (Training Time & Test Accuracy)

Linear RBF Kernel Data set #data #features Time Acc. Time Acc.

MNIST38 11,982 752 0.1 96.82 38.1 99.70

ijcnn1 49,990 22 1.6 91.81 26.8 98.69

covtype multiclass 464,810 54 1.4 76.37 46,695.8 96.11 news20 15,997 1,355,191 1.1 96.95 383.2 96.90 real-sim 57,848 20,958 0.3 97.44 938.3 97.82 yahoo-japan 140,963 832,026 3.1 92.63 20,955.2 93.31 webspam 280,000 254 25.7 93.35 15,681.8 99.26

Therefore, there is a need to develop optimization methods for large linear classification

### Why Linear is Faster in Training and Prediction?

The reason is that

for linear, x_{i} is available
but

for kernel, φ(x_{i}) is not
To illustrate this point, let’s modify kernel

decomposition methods discussed earlier for linear

### Outline

1 Introduction: why optimization and machine learning are related?

2 Optimization methods for kernel support vector machines

Decomposition methods

3 Optimization methods for linear classification Decomposition method

Newton methods Experiments

4 Multi-core implementation

5 Discussion and conclusions

### Decomposition Method for Linear Classification

• Recall for kernels, we solve the sub-problem:

min

dB

1

2(α_{B} + d_{B})^{T} (α_{N})^{T}Q_{BB} Q_{BN}
QNB QNN

α_{B} + d_{B}
αN

−e^{T}_{B} e^{T}_{N}α_{B} + d_{B}
α_{N}

subject to 0 ≤ α_{t} ≤ C , t ∈ B, y^{T}_{B}(α_{B} + d_{B}) = −y^{T}_{N}α_{N}

• The objective function over dB is 1

2d^{T}_{B}Q_{BB}dB + (−e_{B} +Q_{B,:}α)^{T}dB

### Decomposition Method for Linear Classification (Cont’d)

We need

QB,:α =Q_{BB} Q_{BN} α
The cost is

O(|B| × l × n) because for i ∈ B,

Q_{i ,:}α = X^{l}

j =1y_{i}y_{j}K (x_{i}, x_{j})α_{j}
n: # features, l : # data

### Decomposition Method for Linear Classification (Cont’d)

In the linear case,

K (xi, xj) = x^{T}_{i} xj

⇒ X^{l}

j =1y_{i}y_{j}K (x_{i}, x_{j})α_{j} = y_{i}x^{T}_{i}

X^{l}

j =1y_{j}xjα_{j}

Assume

u ≡X^{l}

j =1y_{j}α_{j}xj (4)

is available

### Decomposition Method for Linear Classification (Cont’d)

The cost is significantly reduced

O(|B| × l × n) ⇒ O(|B| × n) The main difference is that in kernel

X^{l}

j =1y_{j}α_{j}φ(x_{j})

cannot be written down. But for linear we can!

### Decomposition Method for Linear Classification (Cont’d)

All we need is to maintain u
u = X^{l}

j =1y_{j}α_{j}x_{j}

Then the following update rule can be applied u ← u + X

i :i ∈B

(di)yixi. The cost is also

O(|B| × n)

### Decomposition Method for Linear Classification (Cont’d)

Note that eventually

u → primal optimal w

### One-variable Procedure

This is what people use now in practice

Let’s consider the optimization problem without the bias term

minw

1

2w^{T}w + C

l

X

i =1

max(1 − y_{i}w^{T}x_{i}, 0)
The dual problem now doesn’t have a linear
constraint

minα

1

2α^{T}Qα − e^{T}α

subject to 0 ≤ α_{i} ≤ C , i = 1, . . . , l ,

### One-variable Procedure (Cont’d)

Here

f (α) ≡ 1

2α^{T}Qα − e^{T}α
and

Qij = yiyjx^{T}_{i} xj, e = [1, . . . , 1]^{T}
Without the linear constraint

y^{T}α = 0

we can choose one variable at a time

### One-variable Procedure (Cont’d)

Given current α. Let the working set be B = {i } Let

ei = [0, . . . , 0, 1, 0, . . . , 0]^{T}
The sub-problem is

min

d f (α + d e_{i}) = 1

2Q_{ii}d^{2} + ∇_{i}f (α)d + constant
subject to 0 ≤ α_{i} + d ≤ C

Without constraints

optimal d = −∇_{i}f (α)
Qii

### One-variable Procedure (Cont’d)

Now 0 ≤ α_{i} + d ≤ C
α_{i} ← min

max

α_{i} − ∇_{i}f (α)
Qii

, 0

, C

Note that

∇_{i}f (α) = (Qα)_{i} − 1 = X^{l}

j =1Q_{ij}α_{j} − 1

= X^{l}

j =1y_{i}y_{j}x^{T}_{i} x_{j}α_{j} − 1

### One-variable Procedure (Cont’d)

As before, define

u ≡X^{l}

j =1yjαjxj,

Easy gradient calculation: the cost is O(n)

∇_{i}f (α) = y_{i}u^{T}x_{i} − 1

### Algorithm: Dual Coordinate Descent

Given initial α and find u = X

i

y_{i}α_{i}x_{i}.

While α is not optimal (Outer iteration) For i = 1, . . . , l (Inner iteration)

(a) ¯α_{i} ← α_{i}

(b) G = yiu^{T}xi − 1

(c) α_{i} ← min(max(α_{i} − G /Q_{ii}, 0), C )
(d) If α_{i} needs to be changed

u ← u + (α_{i} − ¯α_{i})y_{i}x_{i}

### One-variable Procedure (Cont’d)

Maintaining u also costs O(n) Thus the total cost is

O(n) × # iterations Recall that the cost for kernel is

O(l × n) × # iterations if we don’t have b and select |B| = 1

We will explain some interesting differences between the two

### Recap: Dual Coordinate Descent

It’s very simple: minimizing one variable at a time While α not optimal

For i = 1, . . . , l minαi

f (. . . , α_{i}, . . .)
A classic optimization technique

Traced back to Hildreth (1957) if constraints are not considered

So what’s new?

### Recap: Dual Coordinate Descent (Cont’d)

Having

u ≡X^{l}

j =1y_{j}α_{j}xj,

∇_{i}f (α) = y_{i}u^{T}x_{i} − 1
and

¯

α_{i} : old ; α_{i} : new
u ← u + (α_{i} − ¯α_{i})y_{i}x_{i}.
is very essential

This is another example where we take the problem structure into account

### Recap: Dual Coordinate Descent (Cont’d)

Such a setting was first developed at Hsieh et al.

(2008)

### Careful Implementation

Some techniques can improve the running speed

Shrinking: remove α_{i} if it is likely to be bounded at
the end

Easier to conduct shrinking than the kernel case (details not shown)

Cyclic order to update elements
α1 → α_{2} → · · · → α_{l}
A random order gives faster convergence

α_{π(1)} → α_{π(2)} → · · · → α_{π(l )}
We will go back to this issue later

### Difference from the Kernel Case

• We have seen that coordinate-descent type of

methods are used for both linear and kernel classifiers

• Recall the i -th element of gradient costs O(n) by

∇_{i}f (α) =

l

X

j =1

y_{i}y_{j}x^{T}_{i} xjα_{j} − 1 = (y_{i}xi)^{T}

l

X

j =1

y_{j}xjα_{j}

− 1

= (y_{i}x_{i})^{T}u− 1
but we cannot do this for kernel because

K (x_{i}, x_{j}) = φ(x_{i})^{T}φ(x_{j})
cannot be separated

### Difference from the Kernel Case (Cont’d)

If using kernel, the cost of calculating ∇_{i}f (α) must
be O(ln)

However, if O(ln) cost is spent, the whole ∇f (α) can be maintained

∇f (α) ← ∇f (α) + Q_{:,i}d_{i}

In contrast, the linear setting of using u knows only

∇_{i}f (α) rather than the whole ∇f (α)

This situation affects the working set selection

### Difference from the Kernel Case (Cont’d)

• We have mentioned that for existing decomposition methods (or say coordinate descent methods) for kernel classifiers, ∇f (α) information is used to select variables for update

• Therefore, we have the following situations for two working set selections

1 Greedy: Using ∇f (α)

2 Cyclic

Kernel Linear

Greedy Cyclic Greedy Cyclic

Update αi O(1) O(ln) O(1) O(n)

Maintain ∇f (α) O(ln) NA O(ln) NA

### Difference from the Kernel Case (Cont’d)

In optimization there are two types of coordinate descent methods

1 Gauss-Seidel: sequential selection of variables

2 Gauss-Southwell: greedy selection of variables To do greedy selection, usually the whole gradient must be available

Existing coordinate descent methods for linear ⇒ related to Gauss-Seidel

Existing coordinate descent methods for kernel ⇒ related to Gauss-Southwell

### Difference from the Kernel Case (Cont’d)

In general greedy leads to fewer iterations than cyclic

So is the cyclic setting for linear practically viable?

### Working Set Selection

Without gradient information, looks like we can only do a cyclic update

1, 2, 3, . . .

However, Hsieh et al. (2008, Section 3.1) showed that with random permutation, the convergence is much faster

Let’s try an example by using LIBLINEAR real-sim is a data set with

72,309 instances and 20,958 features

### Working Set Selection (Cont’d)

With random permutation

$ ./train ~/Desktop/real-sim

optimization finished, #iter = 13 Objective value = -4396.370629

Here an iteration means to go through all instances once (though shrinking is applied)

Without random permutation (cyclic)

$ ./train ~/Desktop/real-sim

optimization finished, #iter = 326 Objective value = -4391.706239

### Working Set Selection (Cont’d)

Here l1 loss is used

Both approaches are under the same stopping condition

Let’s see the algorithm with random permutation

### Working Set Selection (Cont’d)

While α is not optimal (Outer iteration) Randomly permute 1, . . . , l

For i = 1, . . . , l (Inner iteration)
(a) ¯α_{i} ← α_{i}

(b) G = yiu^{T}xi − 1

(c) αi ← min(max(α_{i} − G /Q_{ii}, 0), C )
(d) If α_{i} needs to be changed

u ← u + (α_{i} − ¯α_{i})y_{i}xi

Note that it’s not useful if we only randomly permute data once

### Working Set Selection (Cont’d)

Randomly permute 1, . . . , l

While α is not optimal (Outer iteration) For i = 1, . . . , l (Inner iteration)

(a) ¯αi ← α_{i}

(b) G = y_{i}u^{T}xi − 1

(c) α_{i} ← min(max(α_{i} − G /Q_{ii}, 0), C )
(d) If α_{i} needs to be changed

u ← u + (α_{i} − ¯α_{i})y_{i}x_{i}
On the other hand, random CD works:

### Working Set Selection (Cont’d)

While α is not optimal (Outer iteration) Randomly select i

(a) ¯α_{i} ← α_{i}

(b) G = y_{i}u^{T}x_{i} − 1

(c) α_{i} ← min(max(α_{i} − G /Q_{ii}, 0), C )
(d) If α_{i} needs to be changed

u ← u + (α_{i} − ¯α_{i})y_{i}x_{i}

Turns out it is easier to analyze the complexity of the random CD; see Shalev-Shwartz and Zhang (2013) and many subsequent works

### Working Set Selection (Cont’d)

It’s difficult to analyze the setting of using random permutations. Some recent attempts (for simplified situations) include Lee and Wright (2016)

This is still an ongoing research issue

Another line of research is to cheaply find some important indices.

See Glasmachers and Dogan (2013) and other developments

In other words, we think some distributions are better than uniform

### Working Set Selection (Cont’d)

However, it is difficult to beat the uniform setting if shrinking has been applied

### Working Set Selection and Bias Term

Recall if we use

sgn(w^{T}x + b)

as the decision function, then the dual problem has a linear constraint

y^{T}α = 0
Then at each iteration, two indices

{i , j}

must be selected

### Working Set Selection and Bias Term (Cont’d)

If we use a cyclic setting, this means

(1, 2), (1, 3), (1, 4), . . . , (2, 3), . . .

The training may be terribly slow because for many pairs we cannot change α at all

Therefore, an interesting situation is as follows

Working set Without With Used

selection bias bias for

cyclic OK not OK linear

greedy (with ∇f (α)) OK OK kernel

### Working Set Selection and Bias Term (Cont’d)

Fortunately, linear classification is often used to handle large data with many sparse features

In such a high dimensional space, bias term is often not needed

Even if it is, there is a trick of adding the bias term to the objective function

minw ,b

1

2w^{T}w +1

2b^{2}+C

l

X

i =1

max(1−y_{i}(w^{T} bx
1

), 0)

### Working Set Selection and Bias Term (Cont’d)

The dual no longer has the linear constraint
y^{T}α = 0

However, for some problems (e.g., one-class SVM), a linear constraint must be there (details not shown) Then the training by coordinate descent for the linear case can be an issue

### Other Losses

Our discussion so far is for l1 loss

All results can be applied to other losses such as l2 loss, logistic loss, etc.

### Outline

1 Introduction: why optimization and machine learning are related?

2 Optimization methods for kernel support vector machines

Decomposition methods

3 Optimization methods for linear classification Decomposition method

Newton methods Experiments

4 Multi-core implementation

5 Discussion and conclusions

### More Optimization Methods can be Applied for Linear

Recall that

w =

l

X

i =1

y_{i}α_{i}φ(x_{i})

Kernel: can only solve an optimization problem of α because w is too high dimensional

Linear: can solve either w or α

We will show an example to minimize over w

### Newton Method

Let’s minimize a twice-differentiable function minw f (w )

For example, logistic regression has minw

1

2w^{T}w + C

l

X

i =1

log

1 + e^{−y}^{i}^{w}^{T}^{x}^{i}

.
Newton direction at iterate w^{k}

mins ∇f (w^{k})^{T}s + 1

2s^{T}∇^{2}f (w^{k})s

### Truncated Newton Method

The above sub-problem is equivalent to solving Newton linear system

∇^{2}f (w^{k})s = −∇f (w^{k})
Approximately solving the linear system ⇒
truncated Newton

However, Hessian matrix ∇^{2}f (w^{k}) is too large to be
stored

∇^{2}f (w^{k}) : n × n, n : number of features
For document data, n can be millions or more

### Using Properties of Data Classification

But Hessian has a special form

∇^{2}f (w ) = I + CX^{T}DX ,
D is diagonal. For logistic regression,

D_{ii} = e^{−y}^{i}^{w}^{T}^{x}^{i}
1 + e^{−y}^{i}^{w}^{T}^{x}^{i}
X : data, # instances × # features

X =

x^{T}_{1}

...

x^{T}_{l}

∈ R^{l ×n}

### Using Properties of Data Classification (Cont’d)

Using Conjugate Gradient (CG) to solve the linear system.

CG is an iterative procedure. Each CG step mainly needs one Hessian-vector product

∇^{2}f (w )d = d + C · X^{T}(D(X d ))
Therefore, we have a Hessian-free approach

### Using Properties of Data Classification (Cont’d)

Now the procedure has two layers of iterations Outer: Newton iterations

Inner: CG iterations per Newton iteration Past machine learning works used Hessian-free approaches include, for example, (Keerthi and DeCoste, 2005; Lin et al., 2008)

Second-order information used: faster convergence than first-order methods

### Training L2-loss SVM

The loss function is differentiable but not twice differentiable

ξ_{L2}(w ; x , y ) ≡ max(0, 1 − y w^{T}x )^{2}
We can use generalized Hessian (Mangasarian,
2002); details not shown here

### Preconditioning

Each Hessian-vector product

∇^{2}f (w )d = d + C · X^{T}(D(X d ))
costs

O(ln), where X ∈ R^{l ×n}

Each function/gradient evaluation also costs O(ln);

details omitted

Therefore, the cost of each Newton iteration is roughly

O(ln) × # CG steps

### Preconditioning (Cont’d)

Can we reduce the number of CG steps by preconditioning?

This classic technique finds

EE^{T} ≈ ∇^{2}f (w )
so the linear system

∇^{2}f (w )s = −∇f (w )
becomes

E^{−1}∇^{2}f (w )E^{−T}ˆs = −E^{−1}∇f (w )

### Preconditioning (Cont’d)

Note that

s = E^{−T}ˆs
If

E^{−1}∇^{2}f (w^{k})E^{−T} ≈ I
then # CG steps can be reduced

For example, we use the diagonal preconditioner so
EE^{T} = diag(∇^{2}f (w ))

and

E = p

diag(∇^{2}f (w ))

### Preconditioning (Cont’d)

Because

∇^{2}f (w ) = I + CX^{T}DX ,
we have

∇^{2}_{j ,j}f (w ) = 1 + C

l

X

i =1

DiiX_{ij}^{2}
The cost of constructing the preconditioner is

O(ln)

### Preconditioning (Cont’d)

In each CG step, doing

(E^{−1} or E^{−T}) × a vector
costs

O(n)

Therefore, extra cost of diagonal preconditioning at each Newton iteration is

O(n) × # CG steps + O(ln) This is acceptable in compared with what we already need

O(ln) × # CG steps

### Preconditioning (Cont’d)

However, the issue is that the number of CG steps may not be decreased

In the area of numerical analysis, preconditioning is an art rather than a science

Further, because of using a Hessian-free approach, many existing preconditioners are not directly applicable

This is still a research issue

### Sub-sampled Newton Methods

The optimization problem can be rewritten as minw

w^{T}w
2Cl + 1

l
X^{l}

i =1max(1 − yiw^{T}xi, 0)
The second term is indeed the average loss
A subset of data should give similar gradient or
Hessian!!

This kind of settings has been explored in Byrd et al. (2011); Martens (2010)

Clearly this is an example of taking properties of machine learning problems

### Lesson Learned from Kernel to Linear

We must know the practical machine-learning use in order to decide if new optimization algorithms are needed for certain problems

### Outline

1 Introduction: why optimization and machine learning are related?

2 Optimization methods for kernel support vector machines

Decomposition methods

3 Optimization methods for linear classification Decomposition method

Newton methods Experiments

4 Multi-core implementation

5 Discussion and conclusions

### Comparisons

L2-loss SVM is used

DCDL2: Dual coordinate descent (Hsieh et al., 2008)

DCDL2-S: DCDL2 with shrinking (Hsieh et al., 2008)

PCD: Primal coordinate descent (Chang et al., 2008)

TRON: Trust region Newton method (Lin et al., 2008)

### Objective values (Time in Seconds)

news20 rcv1

yahoo-japan yahoo-korea

### Analysis

Dual coordinate descents are very effective if # data and # features are both large

Useful for document classification Half million data in a few seconds However, it is less effective if

# features small: should solve primal; or large penalty parameter C ; problems are more ill-conditioned

### An Example When # Features Small

# instance: 32,561, # features: 123

Objective value Accuracy

### Careful Evaluation

This is very very important

Let me give you one real personal experience

### Careful Evaluation (Cont’d)

In Lin et al. (2008), to check if diagonal

perconditioner works in Newton methods we give the following table of total number of CG steps

Problem CG PCG

a9a 567 263

real-sim 104 160

news20 71 155

citeseer 113 115 yahoo-japan 278 326

rcv1 225 280

yahoo-korea 779 736

### Careful Evaluation (Cont’d)

Clearly, PCG may not be better But this conclusion is misleading In this experiment,

a strict stopping condition is used

### Careful Evaluation (Cont’d)

Let’s look at this figure for yahoo-japan

0.0 0.5 1.0 1.5 2.0 2.5

CG iterations ^{1e2}

10^{-8}
10^{-7}
10^{-6}
10^{-5}
10^{-4}
10^{-3}
10^{-2}
10^{-1}
10^{0}

(f−f∗)/f∗

CG Diagonal

You see four horizontal lines

### Careful Evaluation (Cont’d)

The second line corresponds to the default LIBLINEAR stopping condition

The 4 lines are by using

10, , 0.1, 0.01, where is the default LIBLINEAR stopping tolerance

Things below/above these 4 lines are not useful They are too loose or too strict for the machine learning task

### Careful Evaluation (Cont’d)

Another issue is which regularization parameter C to be used in presenting results?

In the table, we use C = 1, the simplest choice In the figure, we use

CBest = 0.5,

where C_{Best} is the value leading to the best
validation accuracy.

A reasonable setting is to show figures of C = CBest × {0.01, 0.1, 1, 10, 100},

### Careful Evaluation (Cont’d)

The reason is that in practice we start from a small
C to search for C_{Best}.

Things are larger than 100 × C_{Best} are not important
Lesson: even in comparing optimization algorithms
for machine learning, we need to take machine
learning properties into account

### Outline

1 Introduction: why optimization and machine learning are related?

2 Optimization methods for kernel support vector machines

Decomposition methods

3 Optimization methods for linear classification Decomposition method

Newton methods Experiments

4 Multi-core implementation

5 Discussion and conclusions

### Multi-core Implementation

Nowadays each computer has several cores They can do parallel computation

However, most machine learning algorithms (including those we have discussed) do not take parallel computation into account

### Multi-core Implementation (Cont’d)

In fact, algorithms may need to be redesigned Recently we did two studies

1 Newton method for solving the primal problem (Lee et al., 2015)

2 Coordinate descent method for solving the dual problem (Chiang et al., 2016)

We will discuss the Newton method in detail

### Parallel Newton Implementation

Recall the bottleneck is the Hessian-vector product

∇^{2}f (w )d = d + C · X^{T}(D(X d ))
This is conducted at each CG step

### Matrix-vector Multiplications: More Than 90% of the Training Time

Data set #instances #features ratio

kddb 19,264,097 29,890,095 82.11%

url combined 2,396,130 3,231,961 94.83%

webspam 350,000 16,609,143 97.95%

rcv1 binary 677,399 47,236 97.88%

covtype binary 581,012 54 89.20%

epsilon normalized 400,000 2,000 99.88%

rcv1 multiclass 518,571 47,236 97.04%

covtype multiclass 581,012 54 89.06%

### Matrix-vector Multiplications: More Than 90% of the Training Time (Cont’d)

This result is by Newton methods using one core
We should parallelize matrix-vector multiplications
For ∇^{2}f (w )d we must calculate

u = X d (5)

u ← Du (6)

¯u = X^{T}u, where u = DX d (7)
Because D is diagonal, (6) is easy

### Matrix-vector Multiplications: More Than 90% of the Training Time (Cont’d)

We will discuss the parallelization of (5) and (7) They are more complicated

For our problems, X is large and sparse

Efficient parallel sparse matrix-vector multiplications are still a research issue

### Parallel X d Operation

Assume that X is in a row-oriented sparse format

X =

x^{T}_{1}

...

x^{T}_{l}

and u = X d =

x^{T}_{1} d

...

x^{T}_{l} d

we have the following simple loop

1: for i = 1, . . . , l do

2: u_{i} = x^{T}_{i} d

3: end for

Because the l inner products are independent, we can easily parallelize the loop by, for example, OpenMP