Optimization and Machine Learning

(1)

Chih-Jen Lin

Department of Computer Science National Taiwan University

Talk at Summer School on Optimization, Big Data and Applications, July, 2017

(2)

1 Introduction: why optimization and machine learning are related?

2 Optimization methods for kernel support vector machines

Decomposition methods

3 Optimization methods for linear classification Decomposition method

Newton methods Experiments

4 Multi-core implementation

5 Discussion and conclusions

(3)

tw/~cjlin/talks/italy_optml.pdf

(4)

Outline

(5)

What is Machine Learning?

Extract knowledge from data and make predictions Representative tasks: classification, clustering, and others

Classification Clustering We will focus on classification

(6)

Data Classification

Given training data in different classes (labels known)

Predict test data (labels unknown) Classic example

1. Find a patient’s blood pressure, weight, etc.

2. After several years, know if he/she recovers 3. Build a machine learning model

4. New patient: find blood pressure, weight, etc 5. Prediction

Two main stages: training and testing

(7)

Why Is Optimization Used?

Usually the goal of classification is to minimize the test error

Therefore, many classification methods solve optimization problems

(8)

Optimization and Machine Learning

Standard optimization packages may be directly applied to machine learning applications

However, efficiency and scalability are issues Very often machine learning knowledge must be considered in designing suitable optimization methods

(9)

Optimization and Machine Learning (Cont’d)

Sometimes optimization researchers fail to make real impact because of not knowing the differences between the two areas

I like to talk about the connection between these two areas because I was trained as an optimization researcher but now work on machine learning

(10)

Optimization and Machine Learning (Cont’d)

In this talk, I will discuss some lessons learned in developing two popular packages

LIBSVM: 2000–now

A library for support vector machines LIBLINEAR: 2007–now

A library for large linear classification

Let me shamelessly say a bit about how they have been widely used

(11)

Optimization and Machine Learning (Cont’d)

LIBSVM is probably the most widely used SVM package. Its implementation paper has been cited more than 32, 000 times (Google Scholar, 5/2017) LIBLINEAR is popularly used in Internet companies for large-scale linear classification

(12)

Minimizing Training Errors

Basically a classification method starts with minimizing the training errors

min

model (training errors)

That is, all or most training data with labels should be correctly classified by our model

A model can be a decision tree, a support vector machine, a neural network, or other types

(13)

Minimizing Training Errors (Cont’d)

We consider the model to be a vector w That is, the decision function is

sgn(w^Tx )

For any data, x , the predicted label is (1 if w^Tx ≥ 0

−1 otherwise

(14)

Minimizing Training Errors (Cont’d)

The two-dimensional situation

◦ ◦

◦

◦ ◦

◦

◦◦

4 4 4 4

4 4

4

w^Tx = 0

This seems to be quite restricted, but practically x is in a much higher dimensional space

(15)

Minimizing Training Errors (Cont’d)

To characterize the training error, we need a loss function ξ(w ; x , y ) for each instance (x , y ) Ideally we should use 0–1 training loss:

ξ(w ; x , y ) =

(1 if y w^Tx < 0, 0 otherwise

(16)

Minimizing Training Errors (Cont’d)

However, this function is discontinuous. The optimization problem becomes difficult

−y w^Tx ξ(w ; x , y )

We need continuous approximations

(17)

Loss Functions

Some commonly used ones:

ξ_L1(w ; x , y ) ≡ max(0, 1 − y w^Tx ), (1) ξ_L2(w ; x , y ) ≡ max(0, 1 − y w^Tx )², (2) ξ_LR(w ; x , y ) ≡ log(1 + e^{−y w}^T^x). (3) SVM (Boser et al., 1992; Cortes and Vapnik, 1995):

(1)-(2)

Logistic regression (LR): (3)

(18)

Loss Functions (Cont’d)

−y w^Tx ξ(w ; x , y )

ξ_L1 ξ_L2

ξLR

Their performance is usually similar

(19)

Loss Functions (Cont’d)

These loss functions have different differentiability ξ_L1: not differentiable

ξ_L2: differentiable but not twice differentiable ξ_LR: twice differentiable

The same optimization method may not be applicable to all these losses

(20)

Loss Functions (Cont’d)

However, minimizing training losses may not give a good model for future prediction

Overfitting occurs

(21)

Overfitting

See the illustration in the next slide For classification,

You can easily achieve 100% training accuracy This is useless

When training a data set, we should Avoid underfitting: small training error Avoid overfitting: small testing error

(22)

l and s: training; and 4: testing

(23)

Regularization

To minimize the training error we manipulate the w vector so that it fits the data

To avoid overfitting we need a way to make w ’s values less extreme.

One idea is to make w values closer to zero We can add, for example,

w^Tw

2 or kw k₁ to the function that is minimized

(24)

Regularized Linear Classification

Training data {y_i, x_i}, x_i ∈ Rⁿ, i = 1, . . . , l , y_i = ±1 l : # of data, n: # of features

minw f (w ), f (w ) ≡ w^Tw 2 + C

l

X

i =1

ξ(w ; x_i, y_i)

w^Tw /2: regularization term (we have no time to talk about L1 regularization here)

ξ(w ; x , y ): loss function: we hope y w^Tx > 0 C : regularization parameter

(25)

Outline

(26)

Kernel Methods

Kernel methods are a class of classification

techniques where major operations are conducted by kernel evaluations

A representative example is support vector machine (Boser et al., 1992; Cortes and Vapnik, 1995)

(27)

Support Vector Classification

• Training data (x_i, y_i), i = 1, . . . , l , x_i ∈ Rⁿ, y_i = ±1

• Minimizing training losses with regularization

minw ,b

1

2w^Tw + C

l

X

i =1

max(1 − yi(w^Tφ(xi)+ b), 0)

• Note that here we add a bias term b so the decision function is

sgn(w^Tφ(x ) + b)

(28)

Support Vector Classification (Cont’d)

• Then the hyperplane does not pass through 0

• If n (# of features) is small, b may be important.

Otherwise, it’s not

• There are also historical reasons

• In our discussion we sometimes include b but sometimes do not

(29)

Mapping Data to a Higher Dimensional Space

To better separate the data, we map data to a higher dimensional space

φ(x ) = [φ1(x ), φ2(x ), . . .]^T. For example,

weight height²

is a useful new feature to check if a person overweights or not

(30)

Difficulties After Mapping Data to a High-dimensional Space

# variables in w = dimensions of φ(x )

Infinite variables if φ(x ) is infinite dimensional Cannot do an infinite-dimensional inner product for predicting a test instance

sgn(w^Tφ(x ) + b)

Use kernel trick to go back to a finite number of variables

(31)

Support Vector Classification (Cont’d)

The dual problem (finite # variables) minα

1

2α^TQα − e^Tα

subject to 0 ≤ α_i ≤ C , i = 1, . . . , l y^Tα = 0,

where Q_ij = y_iy_jφ(x_i)^Tφ(x_j) and e = [1, . . . , 1]^T At optimum

w = Pl

i =1α_iy_iφ(x_i)

Kernel: K (x_i, x_j) ≡ φ(x_i)^Tφ(x_j) ; closed form Example: Gaussian (RBF) kernel: e^−γkxⁱ^−x^j^k²

(32)

Kernel Tricks

• It can be shown that at optimum, w is a linear combination of training data

w = X^l

i =1yiαiφ(xi) Proofs not provided here.

• Special φ(x ) such that the decision function becomes sgn(w^Tφ(x ) + b) = sgn

Xl

i =1y_iα_iφ(x_i)^Tφ(x ) + b

= sgn

X^l

i =1y_iα_iK (x_i, x ) + b

(33)

Kernel Tricks (Cont’d)

φ(x_i)^Tφ(x_j) needs a closed form Example: x_i ∈ R³, φ(x_i) ∈ R¹⁰ φ(x_i) = [1,√

2(x_i)₁,√

2(x_i)₂,√

2(x_i)₃, (x_i)²₁, (x_i)²₂, (x_i)²₃,√

2(x_i)₁(x_i)₂,√

2(x_i)₁(x_i)₃,√

2(x_i)₂(x_i)₃]^T Then φ(xi)^Tφ(xj) = (1 + x^T_i xj)².

Kernel: K (x , y) = φ(x )^Tφ(y); common kernels:

e^−γkxⁱ^−x^j^k², (Radial Basis Function) (x^T_i x_j/a + b)^d (Polynomial kernel)

(34)

K (x , y) can be inner product in infinite dimensional space. Assume x ∈ R¹ and γ > 0.

e^−γkxⁱ^−x^j^k² = e^−γ(xⁱ^−x^j⁾² = e^−γxⁱ²^+2γxⁱ^x^j^−γx^j²

=e^−γxⁱ²^−γx^j² 1 + 2γx_ix_j

1! + (2γx_ix_j)²

2! + (2γx_ix_j)³

3! + · · ·

=e^−γxⁱ²^−γx^j² 1 · 1+

r2γ 1!x_i ·

r2γ 1!x_j+

r(2γ)² 2! x_i² ·

r(2γ)² 2! x_j² +

r(2γ)³ 3! x_i³ ·

r(2γ)³

3! x_j³ + · · · = φ(x_i)^Tφ(x_j), where

φ(x ) = e^−γx²

1,

r2γ 1!x ,

r(2γ)² 2! x²,

r(2γ)³

3! x³, · · ·

T

.

(35)

We don’t discuss the primal-dual relationship here In this lecture we focus more on optimization algorithms rather than optimization theory

(36)

Support Vector Classification (Cont’d)

Only xi of αi > 0 used ⇒ support vectors

-0.2 0 0.2 0.4 0.6 0.8 1 1.2

-1.5 -1 -0.5 0 0.5 1

(37)

Large Dense Quadratic Programming

minα

1

2α^TQα − e^Tα

subject to 0 ≤ α_i ≤ C , i = 1, . . . , l y^Tα = 0

Qij 6= 0, Q : an l by l fully dense matrix 50,000 training points: 50,000 variables:

(50, 000² × 8/2) bytes = 10GB RAM to store Q

(38)

Large Dense Quadratic Programming (Cont’d)

Tradition optimization methods cannot be directly applied here because Q cannot even be stored Currently, decomposition methods (a type of

coordinate descent methods) are commonly used in practice

(39)

Outline

(40)

Decomposition Methods

Working on some variables each time (e.g., Osuna et al., 1997; Joachims, 1998; Platt, 1998)

Similar to coordinate-wise minimization Working set B, N = {1, . . . , l }\B fixed Let the objective function be

f (α) = 1

2α^TQα − e^Tα

(41)

Decomposition Methods (Cont’d)

Sub-problem on the variable d_B min

dB

f ([_α^α^B_N] +_d

B

0 )

subject to −α_i ≤ d_i ≤ C − α_i, ∀i ∈ B d_i = 0, ∀i /∈ B,

y_B^Td_B = 0

The objective function of the sub-problem f ([^αα^B_N] +_d_B

0 )

=1

2d^T_BQ_BBdB + ∇_Bf (α)^TdB + constant.

(42)

Avoid Memory Problems

QBB is a sub-matrix of Q

Q_BB Q_BN Q_NB Q_NN

Note that

∇f (α) = Qα − e, ∇_Bf (α) = QB,:α − eB

(43)

Avoid Memory Problems (Cont’d)

Only B columns of Q are needed

In general |B| ≤ 10 is used. We need |B| ≥ 2 because of the linear constraint

y^T_BdB = 0

Calculated when used: trade time for space But is such an approach practical?

(44)

Algorithm of Decomposition Methods

While α is not optimal Select a working set B

Solve the sub-problem of d_B α_B ← α_B + d_B

We will talk about the selection of B later

(45)

How Decomposition Methods Perform?

Convergence not very fast. This is known because of using only first-order information

But, no need to have very accurate α decision function:

sgn(w^Tφ(x ) + b) = sgn

X^l

i =1αiyiK (xi, x ) + b

Prediction may still be correct with a rough α Further, in some situations,

# support vectors # training points Initial α¹ = 0, some instances never used

(46)

How Decomposition Methods Perform?

(Cont’d)

An example of training 50,000 instances using the software LIBSVM

$svm-train -c 16 -g 4 -m 400 22features Total nSV = 3370

Time 79.524s

This was done on a typical desktop

Calculating the whole Q takes more time

#SVs = 3,370 50,000

A good case where some remain at zero all the time

(47)

How Decomposition Methods Perform?

(Cont’d)

Because many α_i = 0 in the end, we can develop a shrinking techniques

Variables are removed during the optimization procedure. Smaller problems are solved

For example, if α_i = 0 in more than 100 iterations, probably we can remove it

More advanced techniques are possible

Of course in the end we must check all variables again for the stopping condition

(48)

Machine Learning Properties are Useful in Designing Optimization Algorithms

We have seen that special properties of SVM contribute to the viability of decomposition methods

For machine learning applications, no need to accurately solve the optimization problem Because some optimal αi = 0, decomposition methods may not need to update all the variables Also, we can use shrinking techniques to reduce the problem size during decomposition methods

(49)

Issues Related to Optimization

Working set selection Asymptotic convergence

Finite termination & stopping conditions Convergence rate

Numerical issues

(50)

Working Set Selection

In general we don’t choose a large |B| because of the following reasons

1 The sub-problem becomes expensive: O(|B|³)

2 # iterations may not be significantly decreased Currently a popular setting is to choose |B| = 2

B = {i , j }

This is called SMO (Sequential Minimal Optimization)

(51)

Working Set Selection (Cont’d)

• One idea is to use gradient information

• If

α_i < C , y_i = 1, and − ∇_if (α) > 0, then we can enlarge αi

• Therefore, one possibility is

i ∈ arg max{−y_t∇f (α)_t |α_t < C , y_t = 1 or α_t > 0, y_t = −1}

j ∈ arg min{−yt∇f (α)_t |α_t < C , yt = −1 or αt > 0, yt = 1}

• They somewhat correspond to the maximal violation of the optimality condition

(52)

Working Set Selection (Cont’d)

This setting was first used in Joachims (1998) Can we use a more sophisticated one?

It’s difficult because cost is a concern For example, if we check the second-order

information (e.g., objective-value reduction) of all {i , j} pairs,

then

O(l²) cost is needed

(53)

Working Set Selection (Cont’d)

Currently in LIBSVM, we use

i ∈ arg max{−y_t∇f (α)_t |α_t < C , y_t = 1 or α_t > 0, y_t = −1}

and select j by second-order information (Fan et al., 2005)

The cost is still

O(l )

(54)

Complexity of Decomposition Methods

Let’s describe the algorithm again While α is not optimal

Select a working set B

Solve the sub-problem of d_B α_B ← α_B + d_B

(55)

Complexity of Decomposition Methods (Cont’d)

To construct the sub-problem,

∇_Bf (α) = QB,:α − eB

needs

O(|B| × l × n) if each kernel evaluation costs O(n)

But for the working set selection, we need the whole

∇f (α)

The cost can be as large as O(l × l × n)

(56)

Complexity of Decomposition Methods (Cont’d)

Fortunately, we can use

∇f (α +d_B 0

) =Qα − e + Q d_B 0

=∇f (α) + Q_:,BdB

Note that Q_:,B is available earlier

Therefore, the cost of decomposition methods for kernel SVM is

O(|B| × l × n) × # iterations

(57)

Differences between Optimization and Machine Learning

The two topics may have different focuses. We give the following example

The decomposition method we just discussed converges more slowly when C is large

Using C = 1 on a data set

# iterations: 508 Using C = 5, 000

# iterations: 35,241

(58)

Optimization researchers may rush to solve difficult cases of large C

That’s what I did before

It turns out that large C is less used than small C Recall that SVM solves

1

2w^Tw + C (sum of training losses) A large C means to overfit training data This does not give good test accuracy

(59)

Outline

(60)

Linear and Kernel Classification

We have

Kernel ⇒ map data to a higher space Linear ⇒ use the original data; φ(x ) = x Intuitively, kernel should give superior accuracy than linear

There are even theoretical proofs. Roughly

speaking, from the Taylor expansion of the Gaussian (RBF) kernel

e^−γkxⁱ^−x^j^k²

linear SVM is a special case of RBF-kernel SVM (Keerthi and Lin, 2003)

(61)

Linear and Kernel Classification

Optimization people may think there is no need to specially consider linear SVM

That is, same optimization algorithms are enough for both linear and kernel cases

However, this is wrong if we consider their practical use

(62)

Linear and Kernel Classification (Cont’d)

Classification methods such as SVM and logistic regression can be used in two ways

Kernel methods: data mapped to a higher dimensional space

x ⇒ φ(x )

φ(xi)^Tφ(xj) easily calculated; little control on φ(·) Feature engineering + linear classification :

We have x without mapping. Alternatively, we can say that φ(x ) is our x ; full control on x or φ(x )

(63)

Linear and Kernel Classification (Cont’d)

For some problems, accuracy by linear is as good as nonlinear

But training and testing are much faster

This particularly happens for document classification Number of features (bag-of-words model) very large Data very sparse (i.e., few non-zeros)

(64)

Comparison Between Linear and Kernel (Training Time & Test Accuracy)

Linear RBF Kernel Data set #data #features Time Acc. Time Acc.

MNIST38 11,982 752 0.1 96.82 38.1 99.70

ijcnn1 49,990 22 1.6 91.81 26.8 98.69

covtype multiclass 464,810 54 1.4 76.37 46,695.8 96.11 news20 15,997 1,355,191 1.1 96.95 383.2 96.90 real-sim 57,848 20,958 0.3 97.44 938.3 97.82 yahoo-japan 140,963 832,026 3.1 92.63 20,955.2 93.31 webspam 280,000 254 25.7 93.35 15,681.8 99.26

(65)

Comparison Between Linear and Kernel (Training Time & Test Accuracy)

MNIST38 11,982 752 0.1 96.82 38.1 99.70

ijcnn1 49,990 22 1.6 91.81 26.8 98.69

(66)

Comparison Between Linear and Kernel (Training Time & Test Accuracy)

MNIST38 11,982 752 0.1 96.82 38.1 99.70

ijcnn1 49,990 22 1.6 91.81 26.8 98.69

Therefore, there is a need to develop optimization methods for large linear classification

(67)

Why Linear is Faster in Training and Prediction?

The reason is that

for linear, x_i is available but

for kernel, φ(x_i) is not To illustrate this point, let’s modify kernel

decomposition methods discussed earlier for linear

(68)

Outline

(69)

Decomposition Method for Linear Classification

• Recall for kernels, we solve the sub-problem:

min

dB

1

2(α_B + d_B)^T (α_N)^TQ_BB Q_BN QNB QNN

α_B + d_B αN

−e^T_B e^T_Nα_B + d_B α_N

subject to 0 ≤ α_t ≤ C , t ∈ B, y^T_B(α_B + d_B) = −y^T_Nα_N

• The objective function over dB is 1

2d^T_BQ_BBdB + (−e_B +Q_B,:α)^TdB

(70)

Decomposition Method for Linear Classification (Cont’d)

We need

QB,:α =Q_BB Q_BN α The cost is

O(|B| × l × n) because for i ∈ B,

Q_{i ,:}α = X^l

j =1y_iy_jK (x_i, x_j)α_j n: # features, l : # data

(71)

Decomposition Method for Linear Classification (Cont’d)

In the linear case,

K (xi, xj) = x^T_i xj

⇒ X^l

j =1y_iy_jK (x_i, x_j)α_j = y_ix^T_i

X^l

j =1y_jxjα_j

Assume

u ≡X^l

j =1y_jα_jxj (4)

is available

(72)

Decomposition Method for Linear Classification (Cont’d)

The cost is significantly reduced

O(|B| × l × n) ⇒ O(|B| × n) The main difference is that in kernel

X^l

j =1y_jα_jφ(x_j)

cannot be written down. But for linear we can!

(73)

Decomposition Method for Linear Classification (Cont’d)

All we need is to maintain u u = X^l

j =1y_jα_jx_j

Then the following update rule can be applied u ← u + X

i :i ∈B

(di)yixi. The cost is also

O(|B| × n)

(74)

Decomposition Method for Linear Classification (Cont’d)

Note that eventually

u → primal optimal w

(75)

One-variable Procedure

This is what people use now in practice

Let’s consider the optimization problem without the bias term

minw

1

2w^Tw + C

l

X

i =1

max(1 − y_iw^Tx_i, 0) The dual problem now doesn’t have a linear constraint

minα

1

2α^TQα − e^Tα

subject to 0 ≤ α_i ≤ C , i = 1, . . . , l ,

(76)

One-variable Procedure (Cont’d)

Here

f (α) ≡ 1

2α^TQα − e^Tα and

Qij = yiyjx^T_i xj, e = [1, . . . , 1]^T Without the linear constraint

y^Tα = 0

we can choose one variable at a time

(77)

One-variable Procedure (Cont’d)

Given current α. Let the working set be B = {i } Let

ei = [0, . . . , 0, 1, 0, . . . , 0]^T The sub-problem is

min

d f (α + d e_i) = 1

2Q_iid² + ∇_if (α)d + constant subject to 0 ≤ α_i + d ≤ C

Without constraints

optimal d = −∇_if (α) Qii

(78)

One-variable Procedure (Cont’d)

Now 0 ≤ α_i + d ≤ C α_i ← min

max

α_i − ∇_if (α) Qii

, 0

, C

Note that

∇_if (α) = (Qα)_i − 1 = X^l

j =1Q_ijα_j − 1

= X^l

j =1y_iy_jx^T_i x_jα_j − 1

(79)

One-variable Procedure (Cont’d)

As before, define

u ≡X^l

j =1yjαjxj,

Easy gradient calculation: the cost is O(n)

∇_if (α) = y_iu^Tx_i − 1

(80)

Algorithm: Dual Coordinate Descent

Given initial α and find u = X

i

y_iα_ix_i.

While α is not optimal (Outer iteration) For i = 1, . . . , l (Inner iteration)

(a) ¯α_i ← α_i

(b) G = yiu^Txi − 1

(c) α_i ← min(max(α_i − G /Q_ii, 0), C ) (d) If α_i needs to be changed

u ← u + (α_i − ¯α_i)y_ix_i

(81)

One-variable Procedure (Cont’d)

Maintaining u also costs O(n) Thus the total cost is

O(n) × # iterations Recall that the cost for kernel is

O(l × n) × # iterations if we don’t have b and select |B| = 1

We will explain some interesting differences between the two

(82)

Recap: Dual Coordinate Descent

It’s very simple: minimizing one variable at a time While α not optimal

For i = 1, . . . , l minαi

f (. . . , α_i, . . .) A classic optimization technique

Traced back to Hildreth (1957) if constraints are not considered

So what’s new?

(83)

Recap: Dual Coordinate Descent (Cont’d)

Having

u ≡X^l

j =1y_jα_jxj,

∇_if (α) = y_iu^Tx_i − 1 and

¯

α_i : old ; α_i : new u ← u + (α_i − ¯α_i)y_ix_i. is very essential

This is another example where we take the problem structure into account

(84)

Recap: Dual Coordinate Descent (Cont’d)

Such a setting was first developed at Hsieh et al.

(2008)

(85)

Careful Implementation

Some techniques can improve the running speed

Shrinking: remove α_i if it is likely to be bounded at the end

Easier to conduct shrinking than the kernel case (details not shown)

Cyclic order to update elements α1 → α₂ → · · · → α_l A random order gives faster convergence

α_π(1) → α_π(2) → · · · → α_{π(l )} We will go back to this issue later

(86)

Difference from the Kernel Case

• We have seen that coordinate-descent type of

methods are used for both linear and kernel classifiers

• Recall the i -th element of gradient costs O(n) by

∇_if (α) =

l

X

j =1

y_iy_jx^T_i xjα_j − 1 = (y_ixi)^T

l

X

j =1

y_jxjα_j

− 1

= (y_ix_i)^Tu− 1 but we cannot do this for kernel because

K (x_i, x_j) = φ(x_i)^Tφ(x_j) cannot be separated

(87)

Difference from the Kernel Case (Cont’d)

If using kernel, the cost of calculating ∇_if (α) must be O(ln)

However, if O(ln) cost is spent, the whole ∇f (α) can be maintained

∇f (α) ← ∇f (α) + Q_:,id_i

In contrast, the linear setting of using u knows only

∇_if (α) rather than the whole ∇f (α)

This situation affects the working set selection

(88)

Difference from the Kernel Case (Cont’d)

• We have mentioned that for existing decomposition methods (or say coordinate descent methods) for kernel classifiers, ∇f (α) information is used to select variables for update

• Therefore, we have the following situations for two working set selections

1 Greedy: Using ∇f (α)

2 Cyclic

Kernel Linear

Greedy Cyclic Greedy Cyclic

Update αi O(1) O(ln) O(1) O(n)

Maintain ∇f (α) O(ln) NA O(ln) NA

(89)

Difference from the Kernel Case (Cont’d)

In optimization there are two types of coordinate descent methods

1 Gauss-Seidel: sequential selection of variables

2 Gauss-Southwell: greedy selection of variables To do greedy selection, usually the whole gradient must be available

Existing coordinate descent methods for linear ⇒ related to Gauss-Seidel

Existing coordinate descent methods for kernel ⇒ related to Gauss-Southwell

(90)

Difference from the Kernel Case (Cont’d)

In general greedy leads to fewer iterations than cyclic

So is the cyclic setting for linear practically viable?

(91)

Working Set Selection

Without gradient information, looks like we can only do a cyclic update

1, 2, 3, . . .

However, Hsieh et al. (2008, Section 3.1) showed that with random permutation, the convergence is much faster

Let’s try an example by using LIBLINEAR real-sim is a data set with

72,309 instances and 20,958 features

(92)

Working Set Selection (Cont’d)

With random permutation

$ ./train ~/Desktop/real-sim

optimization finished, #iter = 13 Objective value = -4396.370629

Here an iteration means to go through all instances once (though shrinking is applied)

Without random permutation (cyclic)

$ ./train ~/Desktop/real-sim

optimization finished, #iter = 326 Objective value = -4391.706239

(93)

Working Set Selection (Cont’d)

Here l1 loss is used

Both approaches are under the same stopping condition

Let’s see the algorithm with random permutation

(94)

Working Set Selection (Cont’d)

While α is not optimal (Outer iteration) Randomly permute 1, . . . , l

For i = 1, . . . , l (Inner iteration) (a) ¯α_i ← α_i

(b) G = yiu^Txi − 1

(c) αi ← min(max(α_i − G /Q_ii, 0), C ) (d) If α_i needs to be changed

u ← u + (α_i − ¯α_i)y_ixi

Note that it’s not useful if we only randomly permute data once

(95)

Working Set Selection (Cont’d)

Randomly permute 1, . . . , l

While α is not optimal (Outer iteration) For i = 1, . . . , l (Inner iteration)

(a) ¯αi ← α_i

(b) G = y_iu^Txi − 1

u ← u + (α_i − ¯α_i)y_ix_i On the other hand, random CD works:

(96)

Working Set Selection (Cont’d)

While α is not optimal (Outer iteration) Randomly select i

(a) ¯α_i ← α_i

(b) G = y_iu^Tx_i − 1

u ← u + (α_i − ¯α_i)y_ix_i

Turns out it is easier to analyze the complexity of the random CD; see Shalev-Shwartz and Zhang (2013) and many subsequent works

(97)

Working Set Selection (Cont’d)

It’s difficult to analyze the setting of using random permutations. Some recent attempts (for simplified situations) include Lee and Wright (2016)

This is still an ongoing research issue

Another line of research is to cheaply find some important indices.

See Glasmachers and Dogan (2013) and other developments

In other words, we think some distributions are better than uniform

(98)

Working Set Selection (Cont’d)

However, it is difficult to beat the uniform setting if shrinking has been applied

(99)

Working Set Selection and Bias Term

Recall if we use

sgn(w^Tx + b)

as the decision function, then the dual problem has a linear constraint

y^Tα = 0 Then at each iteration, two indices

{i , j}

must be selected

(100)

Working Set Selection and Bias Term (Cont’d)

If we use a cyclic setting, this means

(1, 2), (1, 3), (1, 4), . . . , (2, 3), . . .

The training may be terribly slow because for many pairs we cannot change α at all

Therefore, an interesting situation is as follows

Working set Without With Used

selection bias bias for

cyclic OK not OK linear

greedy (with ∇f (α)) OK OK kernel

(101)

Working Set Selection and Bias Term (Cont’d)

Fortunately, linear classification is often used to handle large data with many sparse features

In such a high dimensional space, bias term is often not needed

Even if it is, there is a trick of adding the bias term to the objective function

minw ,b

1

2w^Tw +1

2b²+C

l

X

i =1

max(1−y_i(w^T bx 1

), 0)

(102)

Working Set Selection and Bias Term (Cont’d)

The dual no longer has the linear constraint y^Tα = 0

However, for some problems (e.g., one-class SVM), a linear constraint must be there (details not shown) Then the training by coordinate descent for the linear case can be an issue

(103)

Other Losses

Our discussion so far is for l1 loss

All results can be applied to other losses such as l2 loss, logistic loss, etc.

(104)

Outline

(105)

More Optimization Methods can be Applied for Linear

Recall that

w =

l

X

i =1

y_iα_iφ(x_i)

Kernel: can only solve an optimization problem of α because w is too high dimensional

Linear: can solve either w or α

We will show an example to minimize over w

(106)

Newton Method

Let’s minimize a twice-differentiable function minw f (w )

For example, logistic regression has minw

1

2w^Tw + C

l

X

i =1

log

1 + e^−yⁱ^w^T^xⁱ

. Newton direction at iterate w^k

mins ∇f (w^k)^Ts + 1

2s^T∇²f (w^k)s

(107)

Truncated Newton Method

The above sub-problem is equivalent to solving Newton linear system

∇²f (w^k)s = −∇f (w^k) Approximately solving the linear system ⇒ truncated Newton

However, Hessian matrix ∇²f (w^k) is too large to be stored

∇²f (w^k) : n × n, n : number of features For document data, n can be millions or more

(108)

Using Properties of Data Classification

But Hessian has a special form

∇²f (w ) = I + CX^TDX , D is diagonal. For logistic regression,

D_ii = e^−yⁱ^w^T^xⁱ 1 + e^−yⁱ^w^T^xⁱ X : data, # instances × # features

X =



 x^T₁

...

x^T_l



 ∈ R^{l ×n}

(109)

Using Properties of Data Classification (Cont’d)

Using Conjugate Gradient (CG) to solve the linear system.

CG is an iterative procedure. Each CG step mainly needs one Hessian-vector product

∇²f (w )d = d + C · X^T(D(X d )) Therefore, we have a Hessian-free approach

(110)

Using Properties of Data Classification (Cont’d)

Now the procedure has two layers of iterations Outer: Newton iterations

Inner: CG iterations per Newton iteration Past machine learning works used Hessian-free approaches include, for example, (Keerthi and DeCoste, 2005; Lin et al., 2008)

Second-order information used: faster convergence than first-order methods

(111)

Training L2-loss SVM

The loss function is differentiable but not twice differentiable

ξ_L2(w ; x , y ) ≡ max(0, 1 − y w^Tx )² We can use generalized Hessian (Mangasarian, 2002); details not shown here

(112)

Preconditioning

Each Hessian-vector product

∇²f (w )d = d + C · X^T(D(X d )) costs

O(ln), where X ∈ R^{l ×n}

Each function/gradient evaluation also costs O(ln);

details omitted

Therefore, the cost of each Newton iteration is roughly

O(ln) × # CG steps

(113)

Preconditioning (Cont’d)

Can we reduce the number of CG steps by preconditioning?

This classic technique finds

EE^T ≈ ∇²f (w ) so the linear system

∇²f (w )s = −∇f (w ) becomes

E⁻¹∇²f (w )E^−Tˆs = −E⁻¹∇f (w )

(114)

Preconditioning (Cont’d)

Note that

s = E^−Tˆs If

E⁻¹∇²f (w^k)E^−T ≈ I then # CG steps can be reduced

For example, we use the diagonal preconditioner so EE^T = diag(∇²f (w ))

and

E = p

diag(∇²f (w ))

(115)

Preconditioning (Cont’d)

Because

∇²f (w ) = I + CX^TDX , we have

∇²_{j ,j}f (w ) = 1 + C

l

X

i =1

DiiX_ij² The cost of constructing the preconditioner is

O(ln)

(116)

Preconditioning (Cont’d)

In each CG step, doing

(E⁻¹ or E^−T) × a vector costs

O(n)

Therefore, extra cost of diagonal preconditioning at each Newton iteration is

O(n) × # CG steps + O(ln) This is acceptable in compared with what we already need

O(ln) × # CG steps

(117)

Preconditioning (Cont’d)

However, the issue is that the number of CG steps may not be decreased

In the area of numerical analysis, preconditioning is an art rather than a science

Further, because of using a Hessian-free approach, many existing preconditioners are not directly applicable

This is still a research issue

(118)

Sub-sampled Newton Methods

The optimization problem can be rewritten as minw

w^Tw 2Cl + 1

l X^l

i =1max(1 − yiw^Txi, 0) The second term is indeed the average loss A subset of data should give similar gradient or Hessian!!

This kind of settings has been explored in Byrd et al. (2011); Martens (2010)

Clearly this is an example of taking properties of machine learning problems

(119)

Lesson Learned from Kernel to Linear

We must know the practical machine-learning use in order to decide if new optimization algorithms are needed for certain problems

(120)

Outline

(121)

Comparisons

L2-loss SVM is used

DCDL2: Dual coordinate descent (Hsieh et al., 2008)

DCDL2-S: DCDL2 with shrinking (Hsieh et al., 2008)

PCD: Primal coordinate descent (Chang et al., 2008)

TRON: Trust region Newton method (Lin et al., 2008)

(122)

Objective values (Time in Seconds)

news20 rcv1

yahoo-japan yahoo-korea

(123)

Analysis

Dual coordinate descents are very effective if # data and # features are both large

Useful for document classification Half million data in a few seconds However, it is less effective if

# features small: should solve primal; or large penalty parameter C ; problems are more ill-conditioned

(124)

An Example When # Features Small

# instance: 32,561, # features: 123

Objective value Accuracy

(125)

Careful Evaluation

This is very very important

Let me give you one real personal experience

(126)

Careful Evaluation (Cont’d)

In Lin et al. (2008), to check if diagonal

perconditioner works in Newton methods we give the following table of total number of CG steps

Problem CG PCG

a9a 567 263

real-sim 104 160

news20 71 155

citeseer 113 115 yahoo-japan 278 326

rcv1 225 280

yahoo-korea 779 736

(127)

Careful Evaluation (Cont’d)

Clearly, PCG may not be better But this conclusion is misleading In this experiment,

a strict stopping condition is used

(128)

Careful Evaluation (Cont’d)

Let’s look at this figure for yahoo-japan

0.0 0.5 1.0 1.5 2.0 2.5

CG iterations ^1e2

10^-8 10^-7 10^-6 10^-5 10^-4 10^-3 10^-2 10^-1 10⁰

(f−f∗)/f∗

CG Diagonal

You see four horizontal lines

(129)

Careful Evaluation (Cont’d)

The second line corresponds to the default LIBLINEAR stopping condition

The 4 lines are by using

10, , 0.1, 0.01, where is the default LIBLINEAR stopping tolerance

Things below/above these 4 lines are not useful They are too loose or too strict for the machine learning task

(130)

Careful Evaluation (Cont’d)

Another issue is which regularization parameter C to be used in presenting results?

In the table, we use C = 1, the simplest choice In the figure, we use

CBest = 0.5,

where C_Best is the value leading to the best validation accuracy.

A reasonable setting is to show figures of C = CBest × {0.01, 0.1, 1, 10, 100},

(131)

Careful Evaluation (Cont’d)

The reason is that in practice we start from a small C to search for C_Best.

Things are larger than 100 × C_Best are not important Lesson: even in comparing optimization algorithms for machine learning, we need to take machine learning properties into account

(132)

Outline

(133)

Multi-core Implementation

Nowadays each computer has several cores They can do parallel computation

However, most machine learning algorithms (including those we have discussed) do not take parallel computation into account

(134)

Multi-core Implementation (Cont’d)

In fact, algorithms may need to be redesigned Recently we did two studies

1 Newton method for solving the primal problem (Lee et al., 2015)

2 Coordinate descent method for solving the dual problem (Chiang et al., 2016)

We will discuss the Newton method in detail

(135)

Parallel Newton Implementation

Recall the bottleneck is the Hessian-vector product

∇²f (w )d = d + C · X^T(D(X d )) This is conducted at each CG step

(136)

Matrix-vector Multiplications: More Than 90% of the Training Time

Data set #instances #features ratio

kddb 19,264,097 29,890,095 82.11%

url combined 2,396,130 3,231,961 94.83%

webspam 350,000 16,609,143 97.95%

rcv1 binary 677,399 47,236 97.88%

covtype binary 581,012 54 89.20%

epsilon normalized 400,000 2,000 99.88%

rcv1 multiclass 518,571 47,236 97.04%

covtype multiclass 581,012 54 89.06%

(137)

Matrix-vector Multiplications: More Than 90% of the Training Time (Cont’d)

This result is by Newton methods using one core We should parallelize matrix-vector multiplications For ∇²f (w )d we must calculate

u = X d (5)

u ← Du (6)

¯u = X^Tu, where u = DX d (7) Because D is diagonal, (6) is easy

(138)

Matrix-vector Multiplications: More Than 90% of the Training Time (Cont’d)

We will discuss the parallelization of (5) and (7) They are more complicated

For our problems, X is large and sparse

Efficient parallel sparse matrix-vector multiplications are still a research issue

(139)

Parallel X d Operation

Assume that X is in a row-oriented sparse format

X =





 x^T₁

...

x^T_l





 and u = X d =





 x^T₁ d

...

x^T_l d





 we have the following simple loop

1: for i = 1, . . . , l do

2: u_i = x^T_i d

3: end for

Because the l inner products are independent, we can easily parallelize the loop by, for example, OpenMP