### Large-scale Linear and Kernel Classification

Chih-Jen Lin

Department of Computer Science National Taiwan University

MSR India Summer School 2015 on Machine Learning

Given training data in different classes (labels known)

Predict test data (labels unknown) Classic example: medical diagnosis

Find a patient’s blood pressure, weight, etc.

After several years, know if he/she recovers Build a machine learning model

New patient: find blood pressure, weight, etc Prediction

Training and testing

Chih-Jen Lin (National Taiwan Univ.) 2 / 121

### Data Classification (Cont’d)

Among many classification methods, linear and kernel are two popular ones

They are very related

We will discuss these two topics in detail in this lecture

Talk slides:

http://www.csie.ntu.edu.tw/~cjlin/talks/

msri.pdf

1 Linear classification

2 Kernel classification

3 Linear versus kernel classification

4 Solving optimization problems

5 Big-data linear classification

6 Discussion and conclusions

Chih-Jen Lin (National Taiwan Univ.) 4 / 121

Linear classification

### Outline

1 Linear classification

2 Kernel classification

3 Linear versus kernel classification

4 Solving optimization problems

5 Big-data linear classification

6 Discussion and conclusions

### Outline

1 Linear classification Maximum margin

Regularization and losses Other derivations

Chih-Jen Lin (National Taiwan Univ.) 6 / 121

Linear classification Maximum margin

### Outline

1 Linear classification Maximum margin

Regularization and losses Other derivations

### Linear Classification

Training vectors : x_{i}, i = 1, . . . , l
Feature vectors. For example,
A patient = [height, weight, . . .]^{T}

Consider a simple case with two classes:

Define an indicator vector y ∈ R^{l}
y_{i} =

1 if x_{i} in class 1

−1 if xi in class 2 A hyperplane to linearly separate all data

Chih-Jen Lin (National Taiwan Univ.) 8 / 121

Linear classification Maximum margin

◦ ◦

◦

◦ ◦

◦

◦◦

4 4 4 4

4 4

4

◦ ◦

◦

◦ ◦

◦

◦◦

4 4 4 4

4 4

4

w^{T}x + b = h_{+1}

−10

i

A separating hyperplane: w^{T}x + b = 0
(w^{T}xi) + b ≥ 1 if y_{i} = 1
(w^{T}xi) + b ≤ −1 if y_{i} = −1

Decision function f (x ) = sgn(w^{T}x + b), x : test
data

Many possible choices of w and b

### Maximal Margin

Maximizing the distance between w^{T}x + b = 1 and

−1:

2/kw k = 2/

√
w^{T}w
A quadratic programming problem

minw ,b

1
2w^{T}w

subject to y_{i}(w^{T}xi + b) ≥ 1,
i = 1, . . . , l .

This is the basic formulation of support vector machines (Boser et al., 1992)

Chih-Jen Lin (National Taiwan Univ.) 10 / 121

Linear classification Maximum margin

### Data May Not Be Linearly Separable

An example:

◦

◦

◦

◦

◦

◦ ◦

◦

4 4

4 4

4

4 4 4

We can never find a linear hyperplane to separate data

Remedy: allow training errors

### Data May Not Be Linearly Separable (Cont’d)

Standard SVM (Boser et al., 1992; Cortes and Vapnik, 1995)

w ,b,ξmin 1

2w^{T}w +C

l

X

i =1

ξi

subject to yi(w^{T}xi + b) ≥ 1 −ξi,
ξi ≥ 0, i = 1, . . . , l .

We explain later why this method is called support vector machine

Chih-Jen Lin (National Taiwan Univ.) 12 / 121

Linear classification Maximum margin

### The Bias Term b

Recall the decision function is
sgn(w^{T}x + b)
Sometimes the bias term b is omitted

sgn(w^{T}x )

That is, the hyperplane always passes through the origin

This is fine if the number of features is not too small In our discussion, b is used for kernel, but omitted for linear (due to some historical reasons)

### Outline

1 Linear classification Maximum margin

Regularization and losses Other derivations

Chih-Jen Lin (National Taiwan Univ.) 14 / 121

Linear classification Regularization and losses

### Equivalent Optimization Problem

• Recall SVM optimization problem (without b) is minw ,ξ

1

2w^{T}w + C

l

X

i =1

ξ_{i}
subject to y_{i}w^{T}x_{i} ≥ 1 − ξ_{i},

ξ_{i} ≥ 0, i = 1, . . . , l .

• It is equivalent to minw

1

2w^{T}w + C

l

X

i =1

max(0, 1 − y_{i}w^{T}xi) (1)

• This reformulation is useful for subsequent discussion

### Equivalent Optimization Problem (Cont’d)

That is, at optimum,

ξ_{i} = max(0, 1 − y_{i}w^{T}x_{i})
Reason: from constraint

ξ_{i} ≥ 1 − y_{i}w^{T}x_{i} and ξ_{i} ≥ 0
but we also want to minimize ξ_{i}

Chih-Jen Lin (National Taiwan Univ.) 16 / 121

Linear classification Regularization and losses

### Equivalent Optimization Problem (Cont’d)

We now derive the same optimization problem (1) from a different viewpoint

minw (training errors)

To characterize the training error, we need a loss
function ξ(w ; x , y ) for each instance (x_{i}, y_{i})
Ideally we should use 0–1 training loss:

ξ(w ; x , y ) =

(1 if y w^{T}x < 0,
0 otherwise

### Equivalent Optimization Problem (Cont’d)

However, this function is discontinuous. The optimization problem becomes difficult

−y w^{T}x
ξ(w ; x , y )

We need continuous approximations

Chih-Jen Lin (National Taiwan Univ.) 18 / 121

Linear classification Regularization and losses

### Common Loss Functions

Hinge loss (l1 loss)

ξ_{L1}(w ; x , y ) ≡ max(0, 1 − y w^{T}x ) (2)
Squared hinge loss (l2 loss)

ξ_{L2}(w ; x , y ) ≡ max(0, 1 − y w^{T}x )^{2} (3)
Logistic loss

ξ_{LR}(w ; x , y ) ≡ log(1 + e^{−y w}^{T}^{x}) (4)
SVM: (2)-(3). Logistic regression (LR): (4)

### Common Loss Functions (Cont’d)

−y w^{T}x
ξ(w ; x , y )

ξ_{L1}
ξ_{L2}

ξ_{LR}

Logistic regression is very related to SVM Their performance is usually similar

Chih-Jen Lin (National Taiwan Univ.) 20 / 121

Linear classification Regularization and losses

### Common Loss Functions (Cont’d)

However, minimizing training losses may not give a good model for future prediction

Overfitting occurs

### Overfitting

See the illustration in the next slide For classification,

You can easily achieve 100% training accuracy This is useless

When training a data set, we should Avoid underfitting: small training error Avoid overfitting: small testing error

Chih-Jen Lin (National Taiwan Univ.) 22 / 121

Linear classification Regularization and losses

### l and s: training; and 4: testing

### Regularization

In training we manipulate the w vector so that it fits the data

So we need a way to make w ’s values less extreme.

One idea is to make the objective function smoother

Chih-Jen Lin (National Taiwan Univ.) 24 / 121

Linear classification Regularization and losses

### General Form of Linear Classification

Training data {yi, xi}, x_{i} ∈ R^{n}, i = 1, . . . , l , yi = ±1
l : # of data, n: # of features

minw f (w ), f (w ) ≡ w^{T}w

2 + C

l

X

i =1

ξ(w ; x_{i}, y_{i})
(5)
w^{T}w /2: regularization term

ξ(w ; x , y ): loss function C : regularization parameter

### General Form of Linear Classification (Cont’d)

If hinge loss

ξ_{L1}(w ; x , y ) ≡ max(0, 1 − y w^{T}x )
is used, then (5) goes back to the SVM problem
described earlier (b omitted):

minw ,ξ

1

2w^{T}w + C

l

X

i =1

ξ_{i}
subject to y_{i}w^{T}xi ≥ 1 − ξ_{i},

ξ_{i} ≥ 0, i = 1, . . . , l .

Chih-Jen Lin (National Taiwan Univ.) 26 / 121

Linear classification Regularization and losses

### Solving Optimization Problems

We have an unconstrained problem, so many

existing unconstrained optimization techniques can be used

However,

ξ_{L1}: not differentiable

ξ_{L2}: differentiable but not twice differentiable
ξLR: twice differentiable

We may need different types of optimization methods

Details of solving optimization problems will be discussed later

### Outline

1 Linear classification Maximum margin

Regularization and losses Other derivations

Chih-Jen Lin (National Taiwan Univ.) 28 / 121

Linear classification Other derivations

### Logistic Regression

Logistic regression can be traced back to the 19th century

It’s mainly from statistics community, so many people wrongly think that this method is very different from SVM

Indeed from what we have shown they are very related.

Let’s see how to derive it from a statistical viewpoint

### Logistic Regression (Cont’d)

For a label-feature pair (y , x ), assume the probability model

p(y |x ) = 1

1 + e^{−y w}^{T}^{x}.
Note that

p(1|x ) + p(−1|x )

= 1

1 + e^{−w}^{T}^{x} + 1
1 + e^{w}^{T}^{x}

= e^{w}^{T}^{x}

1 + e^{w}^{T}^{x} + 1
1 + e^{w}^{T}^{x}

= 1

w is the parameter to be decided

Chih-Jen Lin (National Taiwan Univ.) 30 / 121

Linear classification Other derivations

### Logistic Regression (Cont’d)

Idea of this model p(1|x ) = 1

1 + e^{−w}^{T}^{x}

(→ 1 if w^{T}x 0,

→ 0 if w^{T}x 0
Assume training instances are

(yi, xi), i = 1, . . . , l

### Logistic Regression (Cont’d)

Logistic regression finds w by maximizing the following likelihood

maxw l

Y

i =1

p (y_{i}|x_{i}) . (6)
Negative log-likelihood

− log

l

Y

i =1

p (yi|x_{i}) = −

l

X

i =1

log p (yi|x_{i})

=

l

X

i =1

log

1 + e^{−y}^{i}^{w}^{T}^{x}^{i}

Chih-Jen Lin (National Taiwan Univ.) 32 / 121

Linear classification Other derivations

### Logistic Regression (Cont’d)

Logistic regression minw

l

X

i =1

log

1 + e^{−y}^{i}^{w}^{T}^{x}^{i}

. Regularized logistic regression

minw

1

2w^{T}w + C

l

X

i =1

log

1 + e^{−y}^{i}^{w}^{T}^{x}^{i}

. (7) C : regularization parameter decided by users

### Discussion

We see that the same method can be derived from different ways

SVM

Maximal margin

Regularization and training losses LR

Regularization and training losses Maximum likelihood

Chih-Jen Lin (National Taiwan Univ.) 34 / 121

Kernel classification

### Outline

1 Linear classification

2 Kernel classification

3 Linear versus kernel classification

4 Solving optimization problems

5 Big-data linear classification

6 Discussion and conclusions

### Outline

2 Kernel classification Nonlinear mapping Kernel tricks

Chih-Jen Lin (National Taiwan Univ.) 36 / 121

Kernel classification Nonlinear mapping

### Outline

2 Kernel classification Nonlinear mapping Kernel tricks

### Data May Not Be Linearly Separable

This is an earlier example:

◦

◦

◦

◦

◦

◦ ◦

◦

4 4

4 4

4

4 4 4

In addition to allowing training errors, what else can we do?

For this data set, shouldn’t we use a nonlinear classifier?

Chih-Jen Lin (National Taiwan Univ.) 38 / 121

Kernel classification Nonlinear mapping

### Mapping Data to a Higher Dimensional Space

But modeling nonlinear curves is difficult. Instead, we map data to a higher dimensional space

φ(x ) = [φ1(x ), φ2(x ), . . .]^{T}.
For example,

weight
height^{2}

is a useful new feature to check if a person overweights or not

### Kernel Support Vector Machines

Linear SVM:

w ,b,ξmin 1

2w^{T}w + C Xl
i =1ξ_{i}
subject to y_{i}(w^{T}x_{i} + b) ≥ 1 − ξ_{i},

ξ_{i} ≥ 0, i = 1, . . . , l .
Kernel SVM:

w ,b,ξmin 1

2w^{T}w + C X^{l}

i =1ξ_{i}

subject to y_{i}(w^{T}φ(x_{i})+ b) ≥ 1 − ξ_{i},
ξ_{i} ≥ 0, i = 1, . . . , l .

Chih-Jen Lin (National Taiwan Univ.) 40 / 121

Kernel classification Nonlinear mapping

### Kernel Logistic Regression

minw ,b

1

2w^{T}w + C

l

X

i =1

log

1 + e^{−y}^{i}^{(w}^{T}^{φ(x}^{i}^{)+b)}

.

### Difficulties After Mapping Data to a High-dimensional Space

# variables in w = dimensions of φ(x )

Infinite variables if φ(x ) is infinite dimensional Cannot do an infinite-dimensional inner product for predicting a test instance

sgn(w^{T}φ(x ))

Use kernel trick to go back to a finite number of variables

Chih-Jen Lin (National Taiwan Univ.) 42 / 121

Kernel classification Kernel tricks

### Outline

2 Kernel classification Nonlinear mapping Kernel tricks

### Kernel Tricks

It can be shown at optimum w =

l

X

i =1

y_{i}α_{i}φ(x_{i})
Details not provided here

Special φ(x ) such that the decision function becomes

sgn(w^{T}φ(x )) = sgn

X^{l}

i =1yiαiφ(xi)^{T}φ(x )

= sgn

X^{l}

i =1y_{i}α_{i}K (x_{i}, x )

Chih-Jen Lin (National Taiwan Univ.) 44 / 121

Kernel classification Kernel tricks

### Kernel Tricks (Cont’d)

φ(x_{i})^{T}φ(x_{j}) needs a closed form
Example: x_{i} ∈ R^{3}, φ(x_{i}) ∈ R^{10}
φ(x_{i}) = [1,√

2(x_{i})_{1},√

2(x_{i})_{2},√

2(x_{i})_{3}, (x_{i})^{2}_{1},
(x_{i})^{2}_{2}, (x_{i})^{2}_{3},√

2(x_{i})_{1}(x_{i})_{2},√

2(x_{i})_{1}(x_{i})_{3},√

2(x_{i})_{2}(x_{i})_{3}]^{T}
Then φ(xi)^{T}φ(xj) = (1 + x^{T}_{i} xj)^{2}.

Kernel: K (x , y) = φ(x )^{T}φ(y); common kernels:

e^{−γkx}^{i}^{−x}^{j}^{k}^{2}, (Radial Basis Function)
(x^{T}_{i} x_{j}/a + b)^{d} (Polynomial kernel)

K (x , y) can be inner product in infinite dimensional
space. Assume x ∈ R^{1} and γ > 0.

e^{−γkx}^{i}^{−x}^{j}^{k}^{2} = e^{−γ(x}^{i}^{−x}^{j}^{)}^{2} = e^{−γx}^{i}^{2}^{+2γx}^{i}^{x}^{j}^{−γx}^{j}^{2}

=e^{−γx}^{i}^{2}^{−γx}^{j}^{2} 1 + 2γx_{i}x_{j}

1! + (2γx_{i}x_{j})^{2}

2! + (2γx_{i}x_{j})^{3}

3! + · · ·

=e^{−γx}^{i}^{2}^{−γx}^{j}^{2} 1 · 1+

r2γ
1!x_{i} ·

r2γ
1!x_{j}+

r(2γ)^{2}
2! x_{i}^{2} ·

r(2γ)^{2}
2! x_{j}^{2}
+

r(2γ)^{3}
3! x_{i}^{3} ·

r(2γ)^{3}

3! x_{j}^{3} + · · · = φ(x_{i})^{T}φ(x_{j}),
where

φ(x ) = e^{−γx}^{2}

1,

r2γ 1!x ,

r(2γ)^{2}
2! x^{2},

r(2γ)^{3}

3! x^{3}, · · ·

T

.

Chih-Jen Lin (National Taiwan Univ.) 46 / 121

Linear versus kernel classification

### Outline

1 Linear classification

2 Kernel classification

3 Linear versus kernel classification

4 Solving optimization problems

5 Big-data linear classification

6 Discussion and conclusions

### Outline

3 Linear versus kernel classification Comparison on the cost

Numerical comparisons

Chih-Jen Lin (National Taiwan Univ.) 48 / 121

Linear versus kernel classification Comparison on the cost

### Outline

3 Linear versus kernel classification Comparison on the cost

Numerical comparisons

### Linear and Kernel Classification

Now we see that methods such as SVM and logistic regression can used in two ways

Kernel methods: data mapped to a higher dimensional space

x ⇒ φ(x )

φ(xi)^{T}φ(xj) easily calculated; little control on φ(·)
Linear classification + feature engineering:

We have x without mapping. Alternatively, we can say that φ(x ) is our x ; full control on x or φ(x )

Chih-Jen Lin (National Taiwan Univ.) 50 / 121

Linear versus kernel classification Comparison on the cost

### Linear and Kernel Classification

The cost of using linear and kernel classification is different

Let’s check the prediction cost
w^{T}x versus Xl

i =1y_{i}α_{i}K (x_{i}, x )
If K (x_{i}, x_{j}) takes O(n), then

O(n) versus O(nl ) Linear is much cheaper

A similar difference occurs for training

### Linear and Kernel Classification (Cont’d)

In fact, linear is a special case of kernel

We can prove that accuracy of linear is the same as Gaussian (RBF) kernel under certain parameters (Keerthi and Lin, 2003)

Therefore, roughly we have

accuracy: kernel ≥ linear cost: kernel linear Speed is the reason to use linear

Chih-Jen Lin (National Taiwan Univ.) 52 / 121

Linear versus kernel classification Comparison on the cost

### Linear and Kernel Classification (Cont’d)

For some problems, accuracy by linear is as good as nonlinear

But training and testing are much faster

This particularly happens for document classification Number of features (bag-of-words model) very large Data very sparse (i.e., few non-zeros)

### Outline

3 Linear versus kernel classification Comparison on the cost

Numerical comparisons

Chih-Jen Lin (National Taiwan Univ.) 54 / 121

Linear versus kernel classification Numerical comparisons

### Comparison Between Linear and Kernel (Training Time & Testing Accuracy)

Linear RBF Kernel

Data set Time Accuracy Time Accuracy

MNIST38 0.1 96.82 38.1 99.70

ijcnn1 1.6 91.81 26.8 98.69

covtype 1.4 76.37 46,695.8 96.11

news20 1.1 96.95 383.2 96.90

real-sim 0.3 97.44 938.3 97.82

yahoo-japan 3.1 92.63 20,955.2 93.31 webspam 25.7 93.35 15,681.8 99.26 Size reasonably large: e.g., yahoo-japan: 140k instances and 830k features

### Comparison Between Linear and Kernel (Training Time & Testing Accuracy)

Linear RBF Kernel

Data set Time Accuracy Time Accuracy

MNIST38 0.1 96.82 38.1 99.70

ijcnn1 1.6 91.81 26.8 98.69

covtype 1.4 76.37 46,695.8 96.11

news20 1.1 96.95 383.2 96.90

real-sim 0.3 97.44 938.3 97.82

yahoo-japan 3.1 92.63 20,955.2 93.31 webspam 25.7 93.35 15,681.8 99.26 Size reasonably large: e.g., yahoo-japan: 140k instances and 830k features

Chih-Jen Lin (National Taiwan Univ.) 55 / 121

Linear versus kernel classification Numerical comparisons

### Comparison Between Linear and Kernel (Training Time & Testing Accuracy)

Linear RBF Kernel

Data set Time Accuracy Time Accuracy

MNIST38 0.1 96.82 38.1 99.70

ijcnn1 1.6 91.81 26.8 98.69

covtype 1.4 76.37 46,695.8 96.11

news20 1.1 96.95 383.2 96.90

real-sim 0.3 97.44 938.3 97.82

yahoo-japan 3.1 92.63 20,955.2 93.31 webspam 25.7 93.35 15,681.8 99.26 Size reasonably large: e.g., yahoo-japan: 140k instances and 830k features

### Outline

1 Linear classification

2 Kernel classification

3 Linear versus kernel classification

4 Solving optimization problems

5 Big-data linear classification

6 Discussion and conclusions

Chih-Jen Lin (National Taiwan Univ.) 56 / 121

Solving optimization problems

### Outline

4 Solving optimization problems Kernel: decomposition methods Linear: coordinate descent method Linear: second-order methods Experiments

### Outline

4 Solving optimization problems Kernel: decomposition methods Linear: coordinate descent method Linear: second-order methods Experiments

Chih-Jen Lin (National Taiwan Univ.) 58 / 121

Solving optimization problems Kernel: decomposition methods

### Dual Problem

Recall we said that the difficulty after mapping x to φ(x ) is the huge number of variables

We mentioned

w =

l

X

i =1

α_{i}y_{i}φ(x_{i}) (8)
and used kernels for prediction

Besides prediction, we must do training via kernels The most common way to train SVM via kernels is through its dual problem

### Dual Problem (Cont’d)

The dual problem minα

1

2α^{T}Qα − e^{T}α

subject to 0 ≤ α_{i} ≤ C , i = 1, . . . , l
y^{T}α = 0,

where Qij = yiyjφ(xi)^{T}φ(xj) and e = [1, . . . , 1]^{T}
From primal-dual relationship, at optimum (8) holds
Dual problem has a finite number of variables

Chih-Jen Lin (National Taiwan Univ.) 60 / 121

Solving optimization problems Kernel: decomposition methods

### Example: Primal-dual Relationship

Consider the earlier example:

4

0

1

Now two data are x_{1} = 1, x_{2} = 0 with
y = [+1, −1]^{T}
The solution is (w , b) = (2, −1)

### Example: Primal-dual Relationship (Cont’d)

The dual objective function 1

2α_{1} α_{2}1 0
0 0

α_{1}
α_{2}

−1 1α_{1}
α_{2}

= 1

2α^{2}_{1} − (α_{1} + α_{2})

In optimization, objective function means the function to be optimized

Constraints are

α_{1} − α_{2} = 0, 0 ≤ α_{1}, 0 ≤ α_{2}.

Chih-Jen Lin (National Taiwan Univ.) 62 / 121

Solving optimization problems Kernel: decomposition methods

### Example: Primal-dual Relationship (Cont’d)

Substituting α_{2} = α_{1} into the objective function,
1

2α^{2}_{1} − 2α_{1}
has the smallest value at α_{1} = 2.

Because [2, 2]^{T} satisfies constraints
0 ≤ α_{1} and 0 ≤ α_{2},
it is optimal

### Example: Primal-dual Relationship (Cont’d)

Using the primal-dual relation w = y1α1x1 + y2α2x2

= 1 · 2 · 1 + (−1) · 2 · 0

= 2

This is the same as that by solving the primal problem.

Chih-Jen Lin (National Taiwan Univ.) 64 / 121

Solving optimization problems Kernel: decomposition methods

### Decision function

At optimum

w = Pl

i =1αiyiφ(xi) Decision function

w^{T}φ(x ) + b

= X^{l}

i =1α_{i}y_{i}φ(x_{i})^{T}φ(x ) + b

= X^{l}

i =1α_{i}y_{i}K (x_{i}, x ) + b
Recall 0 ≤ α_{i} ≤ C in the dual problem

### Support Vectors

Only xi of αi > 0 used ⇒ support vectors

-0.2 0 0.2 0.4 0.6 0.8 1 1.2

-1.5 -1 -0.5 0 0.5 1

Chih-Jen Lin (National Taiwan Univ.) 66 / 121

Solving optimization problems Kernel: decomposition methods

### Large Dense Quadratic Programming

minα

1

2α^{T}Qα − e^{T}α

subject to 0 ≤ α_{i} ≤ C , i = 1, . . . , l
y^{T}α = 0

Qij 6= 0, Q : an l by l fully dense matrix 50,000 training points: 50,000 variables:

(50, 000^{2} × 8/2) bytes = 10GB RAM to store Q

### Large Dense Quadratic Programming (Cont’d)

Traditional optimization methods cannot be directly applied here because Q cannot even be stored

Currently, decomposition methods (a type of coordinate descent methods) are what used in practice

Chih-Jen Lin (National Taiwan Univ.) 68 / 121

Solving optimization problems Kernel: decomposition methods

### Decomposition Methods

Working on some variables each time (e.g., Osuna et al., 1997; Joachims, 1998; Platt, 1998)

Similar to coordinate-wise minimization Working set B, N = {1, . . . , l }\B fixed Sub-problem at the kth iteration:

minαB

1

2α^{T}_{B} (α^{k}_{N})^{T}Q_{BB} Q_{BN}
Q_{NB} Q_{NN}

α_{B}
α^{k}_{N}

−

e^{T}_{B} (e^{k}_{N})^{T}α_{B}
α^{k}_{N}

subject to 0 ≤ α_{t} ≤ C , t ∈ B, y_{B}^{T}α_{B} = −y^{T}_{N}α^{k}_{N}

### Avoid Memory Problems

The new objective function 1

2α^{T}_{B}Q_{BB}α_{B} + (−e_{B} +Q_{BN}α^{k}_{N})^{T}α_{B} + constant
Only B columns of Q are needed

In general |B| ≤ 10 is used. We need |B| ≥ 2 because of the linear constraint

y^{T}_{B}α_{B} = −y_{N}^{T}α^{k}_{N}

Calculated when used: trade time for space But is such an approach practical?

Chih-Jen Lin (National Taiwan Univ.) 70 / 121

Solving optimization problems Kernel: decomposition methods

### How Decomposition Methods Perform?

Convergence not very fast. This is known because of using only first-order information

But, no need to have very accurate α
decision function: X^{l}

i =1y_{i}α_{i}K (x_{i}, x ) + b
Prediction may still be correct with a rough α
Further, in some situations,

# support vectors # training points
Initial α^{1} = 0, some instances never used

### How Decomposition Methods Perform?

### (Cont’d)

An example of training 50,000 instances using the software LIBSVM (|B| = 2)

$svm-train -c 16 -g 4 -m 400 22features Total nSV = 3370

Time 79.524s

This was done on a typical desktop

Calculating the whole Q takes more time

#SVs = 3,370 50,000

A good case where some remain at zero all the time

Chih-Jen Lin (National Taiwan Univ.) 72 / 121

Solving optimization problems Linear: coordinate descent method

### Outline

4 Solving optimization problems Kernel: decomposition methods Linear: coordinate descent method Linear: second-order methods Experiments

### Coordinate Descent Methods for Linear Classification

We consider L1-loss SVM as an example here The same method can be extended to L2 and logistic loss

More details in Hsieh et al. (2008); Yu et al. (2011)

Chih-Jen Lin (National Taiwan Univ.) 74 / 121

Solving optimization problems Linear: coordinate descent method

### SVM Dual (Linear without Kernel)

From primal dual relationship minα f (α)

subject to 0 ≤ α_{i} ≤ C , ∀i ,
where

f (α) ≡ 1

2α^{T}Qα − e^{T}α
and

Q_{ij} = y_{i}y_{j}x^{T}_{i} x_{j}, e = [1, . . . , 1]^{T}

No linear constraint y^{T}α = 0 because of no bias
term b

### Dual Coordinate Descent

Very simple: minimizing one variable at a time While α not optimal

For i = 1, . . . , l minαi

f (. . . , α_{i}, . . .)
A classic optimization technique

Traced back to Hildreth (1957) if constraints are not considered

Chih-Jen Lin (National Taiwan Univ.) 76 / 121

Solving optimization problems Linear: coordinate descent method

### The Procedure

Given current α. Let e_{i} = [0, . . . , 0, 1, 0, . . . , 0]^{T}.
mind f (α + d ei) = 1

2Qiid^{2} + ∇if (α)d + constant
Without constraints

optimal d = −∇_{i}f (α)
Q_{ii}
Now 0 ≤ α_{i} + d ≤ C

α_{i} ← min

max

α_{i} − ∇_{i}f (α)
Qii

, 0

, C

### The Procedure (Cont’d)

∇_{i}f (α) = (Qα)_{i} − 1 = X^{l}

j =1Q_{ij}α_{j} − 1

= X^{l}

j =1y_{i}y_{j}x^{T}_{i} x_{j}α_{j} − 1
Directly calculating gradients costs O(ln)
l :# data, n: # features

For linear SVM, define
u ≡ X^{l}

j =1y_{j}α_{j}xj,
Easy gradient calculation: costs O(n)

∇_{i}f (α) = y_{i}u^{T}x_{i} − 1

Chih-Jen Lin (National Taiwan Univ.) 78 / 121

Solving optimization problems Linear: coordinate descent method

### The Procedure (Cont’d)

All we need is to maintain u
u = X^{l}

j =1y_{j}α_{j}x_{j},
If

¯

α_{i} : old ; α_{i} : new
then

u ← u + (αi − ¯αi)yixi. Also costs O(n)

### Algorithm: Dual Coordinate Descent

Given initial α and find u = X

i

y_{i}α_{i}x_{i}.

While α is not optimal (Outer iteration) For i = 1, . . . , l (Inner iteration)

(a) ¯α_{i} ← α_{i}

(b) G = yiu^{T}xi − 1
(c) If α_{i} can be changed

α_{i} ← min(max(α_{i} − G /Q_{ii}, 0), C )
u ← u + (α_{i} − ¯α_{i})y_{i}x_{i}

Chih-Jen Lin (National Taiwan Univ.) 80 / 121

Solving optimization problems Linear: coordinate descent method

### Difference from the Kernel Case

• We have seen that coordinate descent is also the main method to train kernel classifiers

• Recall the i -th element of gradient costs O(n) by

∇_{i}f (α) =

l

X

j =1

y_{i}y_{j}x^{T}_{i} xjα_{j} − 1 = (y_{i}xi)^{T}

l

X

j =1

y_{j}xjα_{j}

− 1

= (y_{i}x_{i})^{T}u− 1
but we cannot do this for kernel because

K (x_{i}, x_{j}) = φ(x_{i})^{T}φ(x_{j})
cannot be separated

### Difference from the Kernel Case (Cont’d)

If using kernel, the cost of calculating ∇_{i}f (α) must
be O(ln)

However, if O(ln) cost is spent, the whole ∇f (α) can be maintained (details not shown here)

In contrast, the setting of using u knows ∇if (α) rather than the whole ∇f (α)

Chih-Jen Lin (National Taiwan Univ.) 82 / 121

Solving optimization problems Linear: coordinate descent method

### Difference from the Kernel Case (Cont’d)

In existing coordinate descent methods for kernel classifiers, people also use ∇f (α) information to select variables (i.e., select the set B) for update In optimization there are two types of coordinate descent methods:

sequential or random selection of variables greedy selection of variables

To do greedy selection, usually the whole gradient must be available

### Difference from the Kernel Case (Cont’d)

Existing coordinate descent methods for linear ⇒ related to sequential or random selection

Existing coordinate descent methods for kernel ⇒ related to greedy selection

Chih-Jen Lin (National Taiwan Univ.) 84 / 121

Solving optimization problems Linear: coordinate descent method

### Bias Term b and Linear Constraint in Dual

In our discussion, b is used for kernel but not linear Mainly history reason

For kernel SVM, we can also omit b to get rid of
the linear constraint y^{T}α = 0

Then for kernel decomposition method, |B| = 1 can also be possible

### Outline

Chih-Jen Lin (National Taiwan Univ.) 86 / 121

Solving optimization problems Linear: second-order methods

### Optimization for Linear and Kernel Cases

Recall that

w =

l

X

i =1

y_{i}α_{i}φ(x_{i})

Kernel: can only solve an optimization problem of α Linear: can solve either w or α

We will show an example to minimize over w

### Newton Method

Let’s minimize a twice-differentiable function minw f (w )

For example, logistic regression has minw

1

2w^{T}w + C

l

X

i =1

log

1 + e^{−y}^{i}^{w}^{T}^{x}^{i}

.
Newton direction at iterate w^{k}

mins ∇f (w^{k})^{T}s + 1

2s^{T}∇^{2}f (w^{k})s

Chih-Jen Lin (National Taiwan Univ.) 88 / 121

Solving optimization problems Linear: second-order methods

### Truncated Newton Method

The above sub-problem is equivalent to solving Newton linear system

∇^{2}f (w^{k})s = −∇f (w^{k})
Approximately solving the linear system ⇒
truncated Newton

However, Hessian matrix ∇^{2}f (w^{k}) is too large to be
stored

∇^{2}f (w^{k}) : n × n, n : number of features
For document data, n can be millions or more

### Using Special Properties of Data Classification

But Hessian has a special form

∇^{2}f (w ) = I + CX^{T}DX ,
D diagonal. For logistic regression,

D_{ii} = e^{−y}^{i}^{w}^{T}^{x}^{i}
1 + e^{−y}^{i}^{w}^{T}^{x}^{i}
X : data, # instances × # features

X = [x_{1}, . . . , x_{l}]^{T}

Chih-Jen Lin (National Taiwan Univ.) 90 / 121

Solving optimization problems Linear: second-order methods

### Using Special Properties of Data Classification (Cont’d)

Using Conjugate Gradient (CG) to solve the linear system.

CG is an iterative procedure. Each CG step mainly needs one Hessian-vector product

∇^{2}f (w )s = s + C · X^{T}(D(X s))
Therefore, we have a Hessian-free approach

### Using Special Properties of Data Classification (Cont’d)

Now the procedure has two layers of iterations Outer: Newton iterations

Inner: CG iterations per Newton iteration Past machine learning works used Hessian-free approaches include, for example, (Keerthi and DeCoste, 2005; Lin et al., 2008)

Second-order information used: faster convergence than first-order methods

Chih-Jen Lin (National Taiwan Univ.) 92 / 121

Solving optimization problems Experiments

### Outline

### Comparisons

L2-loss SVM is used

DCDL2: Dual coordinate descent (Hsieh et al., 2008)

DCDL2-S: DCDL2 with shrinking (Hsieh et al., 2008)

PCD: Primal coordinate descent (Chang et al., 2008)

TRON: Trust region Newton method (Lin et al., 2008)

Chih-Jen Lin (National Taiwan Univ.) 94 / 121

Solving optimization problems Experiments

### Objective values (Time in Seconds)

news20 rcv1

yahoo-japan yahoo-korea

### Analysis

Dual coordinate descents are very effective if # data and # features are both large

Useful for document classification Half million data in a few seconds However, it is less effective if

# features small: should solve primal; or large penalty parameter C ; problems are more ill-conditioned

Chih-Jen Lin (National Taiwan Univ.) 96 / 121

Solving optimization problems Experiments

### An Example When # Features Small

# instance: 32,561, # features: 123

Objective value Accuracy

### Outline

1 Linear classification

2 Kernel classification

3 Linear versus kernel classification

4 Solving optimization problems

5 Big-data linear classification

6 Discussion and conclusions

Chih-Jen Lin (National Taiwan Univ.) 98 / 121

Big-data linear classification

### Outline

5 Big-data linear classification Multi-core linear classification Distributed linear classification

### Big-data Linear Classification

Parallelization in shared-memory system: use the power of multi-core CPU if data can fit in memory Distributed linear classification: if data cannot be stored in one computer

Example: we can parallelize the 2nd-order method (i.e., the Newton method) discussed earlier.

Recall the bottleneck is the Hessian-vector product

∇^{2}f (w )s = s + C · X^{T}(D(X s))
See the analysis in the next slide

Chih-Jen Lin (National Taiwan Univ.) 100 / 121

Big-data linear classification

### Matrix-vector Multiplications

Two sets:

Data set l n #nonzeros

epsilon 400,000 2,000 800,000,000 webspam 350,000 16,609,143 1,304,697,446 Matrix-vector multiplications occupy the majority of the running time

Data set matrix-vector ratio

epsilson 99.88%

webspam 97.95%

This is by Newton methods using one core

We should parallelize matrix-vector multiplications

### Outline

5 Big-data linear classification Multi-core linear classification Distributed linear classification

Chih-Jen Lin (National Taiwan Univ.) 102 / 121

Big-data linear classification Multi-core linear classification

### Parallelization by OpenMP

The Hessian-vector product can be done by
X^{T}DX s =X^{l}

i =1x_{i}D_{ii}x^{T}_{i} s

We can easily parallelize this loop by OpenMP Speedup; details in Lee et al. (2015)

epsilon webspam

### Outline

5 Big-data linear classification Multi-core linear classification Distributed linear classification

Chih-Jen Lin (National Taiwan Univ.) 104 / 121

Big-data linear classification Distributed linear classification

### Parallel Hessian-vector Product

Now data matrix X is distributedly stored

X_{1}
X2

. . .
X_{p}
node 1

node 2

node p

X^{T}DX s = X_{1}^{T}D_{1}X_{1}s + · · · + X_{p}^{T}D_{p}X_{p}s

### Parallel Hessian-vector Product (Cont’d)

We use allreduce to let every node get X^{T}DX s

s

s

s

X_{1}^{T}D_{1}X_{1}s

X_{2}^{T}D_{2}X_{2}s

X_{3}^{T}D_{3}X_{3}s

ALL REDUCE

X^{T}DX s

X^{T}DX s

X^{T}DX s

Allreduce: reducing all vectors (X_{i}^{T}D_{i}X_{i}x , ∀i ) to a single
vector (X^{T}DX s ∈ R^{n}) and then sending the result to
every node

Chih-Jen Lin (National Taiwan Univ.) 106 / 121

Big-data linear classification Distributed linear classification

### Instance-wise and Feature-wise Data Splits

X_{iw,1}
Xiw,2

X_{iw,3}

Xfw,1Xfw,2Xfw,3

Instance-wise Feature-wise

Feature-wise: each machine calculates part of the Hessian-vector product

(∇^{2}f (w )s)fw,1 = s1+CX_{fw,1}^{T} D(Xfw,1s1+· · ·+Xfw,psp)

### Instance-wise and Feature-wise Data Splits (Cont’d)

X_{fw,1}s_{1} + · · · + X_{fw,p}s_{p} ∈ R^{l} must be available on all
nodes (by allreduce)

Data moved per Hessian-vector product Instance-wise: O(n), Feature-wise: O(l )

Chih-Jen Lin (National Taiwan Univ.) 108 / 121

Big-data linear classification Distributed linear classification

### Experiments

We compare

TRON: Newton method

ADMM: alternating direction method of multipliers (Boyd et al., 2011; Zhang et al., 2012)

Vowpal Wabbit (Langford et al., 2007) TRON and ADMM are implemented by MPI Details in Zhuang et al. (2015)

### Experiments (Cont’d)

0 100 200 300 400 500

10^{−4}
10^{−2}
10^{0}

Time (sec.)

Relative function value difference

VW ADMM−FW ADMM−IW TRON−FW TRON−IW

0 1000 2000 3000 4000 5000
10^{−4}

10^{−2}
10^{0}
10^{2}

Time (sec.)

Relative function value difference

VW ADMM−FW ADMM−IW TRON−FW TRON−IW

epsilon webspam

32 machines are used

Horizontal line: test accuracy has stabilized Instance-wise and feature-wise splits useful for l n and l n, respectively

Chih-Jen Lin (National Taiwan Univ.) 110 / 121

Discussion and conclusions

### Outline

1 Linear classification

2 Kernel classification

3 Linear versus kernel classification

4 Solving optimization problems

5 Big-data linear classification

6 Discussion and conclusions

### Outline

6 Discussion and conclusions Some resources

Conclusions

Chih-Jen Lin (National Taiwan Univ.) 112 / 121

Discussion and conclusions Some resources

### Outline

6 Discussion and conclusions Some resources

Conclusions

### Software

• Most materials in this talks are based on our experiences in developing two popular software

• Kernel: LIBSVM (Chang and Lin, 2011)

http://www.csie.ntu.edu.tw/~cjlin/libsvm

• Linear: LIBLINEAR (Fan et al., 2008).

http://www.csie.ntu.edu.tw/~cjlin/liblinear See also a survey on linear classification in Yuan et al.

(2012)

Chih-Jen Lin (National Taiwan Univ.) 114 / 121

Discussion and conclusions Some resources

### Distributed LIBLINEAR

An extension of the software LIBLINEAR

See http://www.csie.ntu.edu.tw/~cjlin/

libsvmtools/distributed-liblinear

We support both MPI (Zhuang et al., 2015) and Spark (Lin et al., 2014)

The development is still in an early stage.

### Outline

6 Discussion and conclusions Some resources

Conclusions

Chih-Jen Lin (National Taiwan Univ.) 116 / 121

Discussion and conclusions Conclusions

### Conclusions

Linear and kernel classification are old topics

However, novel techniques are still being developed to handle large-scale data or new applications You are welcome to join to this interesting research area

### References I

B. E. Boser, I. Guyon, and V. Vapnik. A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, pages 144–152. ACM Press, 1992.

S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1):1–122, 2011.

C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2(3):27:1–27:27, 2011. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.

K.-W. Chang, C.-J. Hsieh, and C.-J. Lin. Coordinate descent method for large-scale L2-loss linear SVM. Journal of Machine Learning Research, 9:1369–1398, 2008. URL

http://www.csie.ntu.edu.tw/~cjlin/papers/cdl2.pdf.

C. Cortes and V. Vapnik. Support-vector network. Machine Learning, 20:273–297, 1995.

R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: a library for large linear classification. Journal of Machine Learning Research, 9:1871–1874, 2008. URL http://www.csie.ntu.edu.tw/~cjlin/papers/liblinear.pdf.

C. Hildreth. A quadratic programming procedure. Naval Research Logistics Quarterly, 4:

79–85, 1957.

Chih-Jen Lin (National Taiwan Univ.) 118 / 121

Discussion and conclusions Conclusions

### References II

C.-J. Hsieh, K.-W. Chang, C.-J. Lin, S. S. Keerthi, and S. Sundararajan. A dual coordinate descent method for large-scale linear SVM. In Proceedings of the Twenty Fifth International Conference on Machine Learning (ICML), 2008. URL

http://www.csie.ntu.edu.tw/~cjlin/papers/cddual.pdf.

T. Joachims. Making large-scale SVM learning practical. In B. Sch¨olkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods – Support Vector Learning, pages 169–184, Cambridge, MA, 1998. MIT Press.

S. S. Keerthi and D. DeCoste. A modified finite Newton method for fast solution of large scale linear SVMs. Journal of Machine Learning Research, 6:341–361, 2005.

S. S. Keerthi and C.-J. Lin. Asymptotic behaviors of support vector machines with Gaussian kernel. Neural Computation, 15(7):1667–1689, 2003.

J. Langford, L. Li, and A. Strehl. Vowpal Wabbit, 2007.

https://github.com/JohnLangford/vowpal_wabbit/wiki.

M.-C. Lee, W.-L. Chiang, and C.-J. Lin. Fast matrix-vector multiplications for large-scale logistic regression on shared-memory systems. Technical report, National Taiwan University, 2015.

C.-J. Lin, R. C. Weng, and S. S. Keerthi. Trust region Newton method for large-scale logistic regression. Journal of Machine Learning Research, 9:627–650, 2008. URL

http://www.csie.ntu.edu.tw/~cjlin/papers/logistic.pdf.