• 沒有找到結果。

# Large-scale Linear Classiﬁcation

N/A
N/A
Protected

Share "Large-scale Linear Classiﬁcation"

Copied!
159
0
0

(1)

### Large-scale Linear Classification

Chih-Jen Lin

Department of Computer Science National Taiwan University

International Winter School on Big Data, 2016

(2)

### Data Classification

Given training data in different classes (labels known)

Predict test data (labels unknown) Classic example: medical diagnosis

Find a patient’s blood pressure, weight, etc.

After several years, know if he/she recovers Build a machine learning model

New patient: find blood pressure, weight, etc Prediction

Training and testing

(3)

### Data Classification (Cont’d)

Among many classification methods, linear and kernel are two popular ones

They are very related

We will detailedly discuss linear classification and its connection to kernel

Talk slides:

http://www.csie.ntu.edu.tw/~cjlin/talks/

course-bilbao.pdf

(4)

### Outline

1 Linear classification

2 Kernel classification

3 Linear versus kernel classification

4 Solving optimization problems

5 Multi-core linear classification

6 Distributed linear classification

7 Discussion and conclusions

(5)

Linear classification

### Outline

1 Linear classification

2 Kernel classification

3 Linear versus kernel classification

4 Solving optimization problems

5 Multi-core linear classification

6 Distributed linear classification

7 Discussion and conclusions

(6)

Linear classification

### Outline

1 Linear classification Maximum margin

Regularization and losses Other derivations

(7)

Linear classification Maximum margin

### Outline

1 Linear classification Maximum margin

Regularization and losses Other derivations

(8)

Linear classification Maximum margin

### Linear Classification

Training vectors: xi, i = 1, . . . , l Feature vectors. For example, A patient = [height, weight, . . .]T

Consider a simple case with two classes:

Define an indicator vector y ∈ Rl yi =

 1 if xi in class 1

−1 if xi in class 2 A hyperplane to linearly separate all data

(9)

Linear classification Maximum margin

◦ ◦

◦ ◦

◦◦

4 4 4 4

4 4

4

◦ ◦

◦ ◦

◦◦

4 4 4 4

4 4

4

wTx + b = h+1

−10

i

A separating hyperplane: wTx + b = 0 (wTxi) + b ≥ 1 if yi = 1 (wTxi) + b ≤ −1 if yi = −1

Decision function f (x) = sgn(wTx + b), x: test data

Many possible choices of w and b

(10)

Linear classification Maximum margin

### Maximal Margin

Maximizing the distance between wTx + b = 1 and

−1:

2/kwk = 2/√ wTw A quadratic programming problem

minw,b

1 2wTw

subject to yi(wTxi + b) ≥ 1, i = 1, . . . , l .

This is the basic formulation of support vector machines (Boser et al., 1992)

(11)

Linear classification Maximum margin

### Data May Not Be Linearly Separable

An example:

◦ ◦

4 4

4 4

4

4 4 4

We can never find a linear hyperplane to separate data

Remedy: allow training errors

(12)

Linear classification Maximum margin

### Data May Not Be Linearly Separable (Cont’d)

Standard SVM (Boser et al., 1992; Cortes and Vapnik, 1995)

w,b,minξ

1

2wTw +C

l

X

i =1

ξi

subject to yi(wTxi + b) ≥ 1 −ξi, ξi ≥ 0, i = 1, . . . , l .

We explain later why this method is called support vector machine

(13)

Linear classification Maximum margin

### The Bias Term b

Recall the decision function is sgn(wTx + b) Sometimes the bias term b is omitted

sgn(wTx)

That is, the hyperplane always passes through the origin

This is fine if the number of features is not too small In our discussion, b is used for kernel, but omitted for linear (due to some historical reasons)

(14)

Linear classification Regularization and losses

### Outline

1 Linear classification Maximum margin

Regularization and losses Other derivations

(15)

Linear classification Regularization and losses

### Equivalent Optimization Problem

• Recall SVM optimization problem (without b) is minw,ξ

1

2wTw + C

l

X

i =1

ξi subject to yiwTxi ≥ 1 − ξi,

ξi ≥ 0, i = 1, . . . , l .

• It is equivalent to minw

1

2wTw + C

l

X

i =1

max(0, 1 − yiwTxi) (1)

• This reformulation is useful for subsequent discussion

(16)

Linear classification Regularization and losses

### Equivalent Optimization Problem (Cont’d)

That is, at optimum,

ξi = max(0, 1 − yiwTxi) Reason: from constraint

ξi ≥ 1 − yiwTxi and ξi ≥ 0 but we also want to minimize ξi

(17)

Linear classification Regularization and losses

### Equivalent Optimization Problem (Cont’d)

We now derive the same optimization problem (1) from a different viewpoint

We now aim to minimize the training error minw (training errors)

To characterize the training error, we need a loss function ξ(w; x, y) for each instance (x, y) Ideally we should use 0–1 training loss:

ξ(w; x, y) =

(1 if ywTx < 0, 0 otherwise

(18)

Linear classification Regularization and losses

### Equivalent Optimization Problem (Cont’d)

However, this function is discontinuous. The optimization problem becomes difficult

−ywTx ξ(w; x, y)

We need continuous approximations

(19)

Linear classification Regularization and losses

### Common Loss Functions

Hinge loss (l1 loss)

ξL1(w; x, y) ≡ max(0, 1 − ywTx) (2) Squared hinge loss (l2 loss)

ξL2(w; x, y) ≡ max(0, 1 − ywTx)2 (3) Logistic loss

ξLR(w; x, y) ≡ log(1 + e−ywTx) (4) SVM: (2)-(3). Logistic regression (LR): (4)

(20)

Linear classification Regularization and losses

### Common Loss Functions (Cont’d)

−ywTx ξ(w; x, y)

ξL1 ξL2

ξLR

Logistic regression is very related to SVM Their performance is usually similar

(21)

Linear classification Regularization and losses

### Common Loss Functions (Cont’d)

However, minimizing training losses may not give a good model for future prediction

Overfitting occurs

(22)

Linear classification Regularization and losses

### Overfitting

See the illustration in the next slide For classification,

You can easily achieve 100% training accuracy This is useless

When training a data set, we should Avoid underfitting: small training error Avoid overfitting: small testing error

(23)

Linear classification Regularization and losses

### l and s: training; and 4: testing

(24)

Linear classification Regularization and losses

### Regularization

To minimize the training error we manipulate the w vector so that it fits the data

To avoid overfitting we need a way to make w’s values less extreme.

One idea is to make the objective function smoother

(25)

Linear classification Regularization and losses

### General Form of Linear Classification

Training data {yi,xi},xi ∈ Rn, i = 1, . . . , l , yi = ±1 l : # of data, n: # of features

minw f (w), f (w) ≡ wTw 2 + C

l

X

i =1

ξ(w; xi, yi) (5) wTw/2: regularization term

ξ(w; x, y): loss function C : regularization parameter

(26)

Linear classification Regularization and losses

### General Form of Linear Classification (Cont’d)

If hinge loss

ξL1(w; x, y) ≡ max(0, 1 − ywTx) is used, then (5) goes back to the SVM problem described earlier (b omitted):

minw,ξ

1

2wTw + C

l

X

i =1

ξi subject to yiwTxi ≥ 1 − ξi,

ξi ≥ 0, i = 1, . . . , l .

(27)

Linear classification Regularization and losses

### Solving Optimization Problems

We have an unconstrained problem, so many

existing unconstrained optimization techniques can be used

However,

ξL1: not differentiable

ξL2: differentiable but not twice differentiable ξLR: twice differentiable

We may need different types of optimization methods

Details of solving optimization problems will be discussed later

(28)

Linear classification Other derivations

### Outline

1 Linear classification Maximum margin

Regularization and losses Other derivations

(29)

Linear classification Other derivations

### Logistic Regression

Logistic regression can be traced back to the 19th century

It’s mainly from statistics community, so many people wrongly think that this method is very different from SVM

Indeed from what we have shown they are very related.

Let’s see how to derive it from a statistical viewpoint

(30)

Linear classification Other derivations

### Logistic Regression (Cont’d)

For a label-feature pair (y ,x), assume the probability model

p(y |x) = 1

1 + e−ywTx. Note that

p(1|x) + p(−1|x)

= 1

1 + ewTx + 1 1 + ewTx

= ewTx

1 + ewTx + 1 1 + ewTx

= 1

w is the parameter to be decided

Chih-Jen Lin (National Taiwan Univ.) 30 / 157

(31)

Linear classification Other derivations

### Logistic Regression (Cont’d)

Idea of this model p(1|x) = 1

1 + ewTx

(→ 1 if wTx  0,

→ 0 if wTx  0 Assume training instances are

(yi,xi), i = 1, . . . , l

(32)

Linear classification Other derivations

### Logistic Regression (Cont’d)

Logistic regression finds w by maximizing the following likelihood

maxw l

Y

i =1

p (yi|xi) . (6) Negative log-likelihood

− log

l

Y

i =1

p (yi|xi) = −

l

X

i =1

log p (yi|xi)

=

l

X

i =1

log



1 + e−yiwTxi



(33)

Linear classification Other derivations

### Logistic Regression (Cont’d)

Logistic regression minw

l

X

i =1

log



1 + e−yiwTxi

 . Regularized logistic regression

minw

1

2wTw + C

l

X

i =1

log



1 + e−yiwTxi



. (7) C : regularization parameter decided by users

(34)

Linear classification Other derivations

### Discussion

We see that the same method can be derived from different ways

SVM

Maximal margin

Regularization and training losses LR

Regularization and training losses Maximum likelihood

(35)

Kernel classification

### Outline

1 Linear classification

2 Kernel classification

3 Linear versus kernel classification

4 Solving optimization problems

5 Multi-core linear classification

6 Distributed linear classification

7 Discussion and conclusions

(36)

Kernel classification

### Outline

2 Kernel classification Nonlinear mapping Kernel tricks

(37)

Kernel classification Nonlinear mapping

### Outline

2 Kernel classification Nonlinear mapping Kernel tricks

(38)

Kernel classification Nonlinear mapping

### Data May Not Be Linearly Separable

This is an earlier example:

◦ ◦

4 4

4 4

4

4 4 4

In addition to allowing training errors, what else can we do?

For this data set, shouldn’t we use a nonlinear classifier?

(39)

Kernel classification Nonlinear mapping

### Mapping Data to a Higher Dimensional Space

But modeling nonlinear curves is difficult. Instead, we map data to a higher dimensional space

φ(x) = [φ1(x), φ2(x), . . .]T. For example,

weight height2

is a useful new feature to check if a person overweights or not

(40)

Kernel classification Nonlinear mapping

### Kernel Support Vector Machines

Linear SVM:

w,b,ξmin 1

2wTw + C Xl i =1ξi subject to yi(wTxi + b) ≥ 1 − ξi,

ξi ≥ 0, i = 1, . . . , l . Kernel SVM:

w,b,ξmin 1

2wTw + C Xl

i =1ξi

subject to yi(wTφ(xi)+ b) ≥ 1 − ξi, ξi ≥ 0, i = 1, . . . , l .

(41)

Kernel classification Nonlinear mapping

### Kernel Logistic Regression

minw,b

1

2wTw + C

l

X

i =1

log



1 + e−yi(wTφ(xi)+b)

 .

(42)

Kernel classification Nonlinear mapping

### Difficulties After Mapping Data to a High-dimensional Space

# variables in w = dimensions of φ(x)

Infinite variables if φ(x) is infinite dimensional Cannot do an infinite-dimensional inner product for predicting a test instance

sgn(wTφ(x))

Use kernel trick to go back to a finite number of variables

(43)

Kernel classification Kernel tricks

### Outline

2 Kernel classification Nonlinear mapping Kernel tricks

(44)

Kernel classification Kernel tricks

### Kernel Tricks

• It can be shown at optimum, w is a linear combination of training data

w = Xl

i =1yiαiφ(xi)

Proofs not provided here. Later we will show that α is the solution of a dual problem

• Special φ(x) such that the decision function becomes sgn(wTφ(x)) = sgn

 Xl

i =1yiαiφ(xi)Tφ(x)



= sgn

 Xl

i =1yiαiK (xi,x)



(45)

Kernel classification Kernel tricks

### Kernel Tricks (Cont’d)

φ(xi)Tφ(xj) needs a closed form Example: xi ∈ R3, φ(xi) ∈ R10 φ(xi) = [1,√

2(xi)1,√

2(xi)2,√

2(xi)3, (xi)21, (xi)22, (xi)23,√

2(xi)1(xi)2,√

2(xi)1(xi)3,√

2(xi)2(xi)3]T Then φ(xi)Tφ(xj) = (1 +xTi xj)2.

Kernel: K (x, y) = φ(x)Tφ(y); common kernels:

e−γkxixjk2, (Radial Basis Function) (xTi xj/a + b)d (Polynomial kernel)

(46)

Kernel classification Kernel tricks

K (x, y) can be inner product in infinite dimensional space. Assume x ∈ R1 and γ > 0.

e−γkxi−xjk2 = e−γ(xi−xj)2 = e−γxi2+2γxixj−γxj2

=e−γxi2−γxj2 1 + 2γxixj

1! + (2γxixj)2

2! + (2γxixj)3

3! + · · ·

=e−γxi2−γxj2 1 · 1+

r2γ 1!xi ·

r2γ 1!xj+

r(2γ)2 2! xi2 ·

r(2γ)2 2! xj2 +

r(2γ)3 3! xi3 ·

r(2γ)3

3! xj3 + · · · = φ(xi)Tφ(xj), where

φ(x ) = e−γx2

 1,

r2γ 1!x ,

r(2γ)2 2! x2,

r(2γ)3

3! x3, · · ·

T

.

(47)

Linear versus kernel classification

### Outline

1 Linear classification

2 Kernel classification

3 Linear versus kernel classification

4 Solving optimization problems

5 Multi-core linear classification

6 Distributed linear classification

7 Discussion and conclusions

(48)

Linear versus kernel classification

### Outline

3 Linear versus kernel classification Comparison on the cost

Numerical comparisons A real example

(49)

Linear versus kernel classification Comparison on the cost

### Outline

3 Linear versus kernel classification Comparison on the cost

Numerical comparisons A real example

(50)

Linear versus kernel classification Comparison on the cost

### Linear and Kernel Classification

Now we see that methods such as SVM and logistic regression can be used in two ways

Kernel methods: data mapped to a higher dimensional space

x ⇒ φ(x)

φ(xi)Tφ(xj) easily calculated; little control on φ(·) Linear classification + feature engineering:

We have x without mapping. Alternatively, we can say that φ(x) is our x; full control on x or φ(x)

(51)

Linear versus kernel classification Comparison on the cost

### Linear and Kernel Classification

The cost of using linear and kernel classification is different

Let’s check the prediction cost wTx versus Xl

i =1yiαiK (xi,x) If K (xi,xj) takes O(n), then

O(n) versus O(nl ) Linear is much cheaper

A similar difference occurs for training

(52)

Linear versus kernel classification Comparison on the cost

### Linear and Kernel Classification (Cont’d)

In fact, linear is a special case of kernel

We can prove that accuracy of linear is the same as Gaussian (RBF) kernel under certain parameters (Keerthi and Lin, 2003)

Therefore, roughly we have

accuracy: kernel ≥ linear cost: kernel  linear Speed is the reason to use linear

(53)

Linear versus kernel classification Comparison on the cost

### Linear and Kernel Classification (Cont’d)

For some problems, accuracy by linear is as good as nonlinear

But training and testing are much faster

This particularly happens for document classification Number of features (bag-of-words model) very large Data very sparse (i.e., few non-zeros)

(54)

Linear versus kernel classification Numerical comparisons

### Outline

3 Linear versus kernel classification Comparison on the cost

Numerical comparisons A real example

(55)

Linear versus kernel classification Numerical comparisons

### Comparison Between Linear and Kernel (Training Time & Testing Accuracy)

Linear RBF Kernel

Data set Time Accuracy Time Accuracy

MNIST38 0.1 96.82 38.1 99.70

ijcnn1 1.6 91.81 26.8 98.69

covtype 1.4 76.37 46,695.8 96.11

news20 1.1 96.95 383.2 96.90

real-sim 0.3 97.44 938.3 97.82

yahoo-japan 3.1 92.63 20,955.2 93.31 webspam 25.7 93.35 15,681.8 99.26 Size reasonably large: e.g., yahoo-japan: 140k instances and 830k features

(56)

Linear versus kernel classification Numerical comparisons

### Comparison Between Linear and Kernel (Training Time & Testing Accuracy)

Linear RBF Kernel

Data set Time Accuracy Time Accuracy

MNIST38 0.1 96.82 38.1 99.70

ijcnn1 1.6 91.81 26.8 98.69

covtype 1.4 76.37 46,695.8 96.11

news20 1.1 96.95 383.2 96.90

real-sim 0.3 97.44 938.3 97.82

yahoo-japan 3.1 92.63 20,955.2 93.31 webspam 25.7 93.35 15,681.8 99.26 Size reasonably large: e.g., yahoo-japan: 140k instances and 830k features

(57)

Linear versus kernel classification Numerical comparisons

### Comparison Between Linear and Kernel (Training Time & Testing Accuracy)

Linear RBF Kernel

Data set Time Accuracy Time Accuracy

MNIST38 0.1 96.82 38.1 99.70

ijcnn1 1.6 91.81 26.8 98.69

covtype 1.4 76.37 46,695.8 96.11

news20 1.1 96.95 383.2 96.90

real-sim 0.3 97.44 938.3 97.82

yahoo-japan 3.1 92.63 20,955.2 93.31 webspam 25.7 93.35 15,681.8 99.26 Size reasonably large: e.g., yahoo-japan: 140k instances and 830k features

(58)

Linear versus kernel classification A real example

### Outline

3 Linear versus kernel classification Comparison on the cost

Numerical comparisons A real example

(59)

Linear versus kernel classification A real example

i

### )

We may directly train φ(xi), ∀i without using kernel This is possible only if φ(xi) is not too high

dimensional

Next we show a real example of running a machine learning model is a small sensor hub

(60)

Linear versus kernel classification A real example

### Example: Classifier in a Small Device

In a sensor application (Yang, 2013), the classifier can use less than 16KB of RAM

Classifiers Test accuracy Model Size

Decision Tree 77.77 76.02KB

AdaBoost (10 trees) 78.84 1,500.54KB SVM (RBF kernel) 85.33 1,287.15KB Number of features: 5

We consider a degree-3 polynomial mapping dimensionality = 5 + 3

3



+ bias term = 57.

(61)

Linear versus kernel classification A real example

### Example: Classifier in a Small Device

One-against-one strategy for 5-class classification

5 2



× 57 × 4bytes = 2.28KB Assume single precision

Results

SVM method Test accuracy Model Size

RBF kernel 85.33 1,287.15KB

Polynomial kernel 84.79 2.28KB

Linear kernel 78.51 0.24KB

(62)

Solving optimization problems

### Outline

1 Linear classification

2 Kernel classification

3 Linear versus kernel classification

4 Solving optimization problems

5 Multi-core linear classification

6 Distributed linear classification

7 Discussion and conclusions

(63)

Solving optimization problems

### Outline

4 Solving optimization problems Kernel: decomposition methods Linear: coordinate descent method Linear: second-order methods Experiments

(64)

Solving optimization problems Kernel: decomposition methods

### Outline

4 Solving optimization problems Kernel: decomposition methods Linear: coordinate descent method Linear: second-order methods Experiments

(65)

Solving optimization problems Kernel: decomposition methods

### Dual Problem

Recall we said that the difficulty after mapping x to φ(x) is the huge number of variables

We mentioned

w =

l

X

i =1

αiyiφ(xi) (8) and used kernels for prediction

Besides prediction, we must do training via kernels The most common way to train SVM via kernels is through its dual problem

(66)

Solving optimization problems Kernel: decomposition methods

### Dual Problem (Cont’d)

The dual problem minα

1

TQα −eTα

subject to 0 ≤ αi ≤ C , i = 1, . . . , l yTα = 0,

where Qij = yiyjφ(xi)Tφ(xj) and e = [1, . . . , 1]T From primal-dual relationship, at optimum (8) holds Dual problem has a finite number of variables

If no bias term b, then yTα = 0 disappears

(67)

Solving optimization problems Kernel: decomposition methods

### Example: Primal-dual Relationship

Consider the earlier example:

4

0

1

Now two data are x1 = 1,x2 = 0 with y = [+1, −1]T The solution is (w , b) = (2, −1)

(68)

Solving optimization problems Kernel: decomposition methods

### Example: Primal-dual Relationship (Cont’d)

The dual objective function 1

2α1 α21 0 0 0

 α1 α2



−1 1α1 α2



= 1

21 − (α1 + α2)

In optimization, objective function means the function to be optimized

Constraints are

α1 − α2 = 0, 0 ≤ α1, 0 ≤ α2.

(69)

Solving optimization problems Kernel: decomposition methods

### Example: Primal-dual Relationship (Cont’d)

Substituting α2 = α1 into the objective function, 1

21 − 2α1 has the smallest value at α1 = 2.

Because [2, 2]T satisfies constraints 0 ≤ α1 and 0 ≤ α2, it is optimal

(70)

Solving optimization problems Kernel: decomposition methods

### Example: Primal-dual Relationship (Cont’d)

Using the primal-dual relation w = y1α1x1 + y2α2x2

= 1 · 2 · 1 + (−1) · 2 · 0

= 2

This is the same as that by solving the primal problem.

(71)

Solving optimization problems Kernel: decomposition methods

### Decision function

At optimum

w = Pli =1αiyiφ(xi) Decision function

wTφ(x) + b

= Xl

i =1αiyiφ(xi)Tφ(x) + b

= Xl

i =1αiyiK (xi,x) + b Recall 0 ≤ αi ≤ C in the dual problem

(72)

Solving optimization problems Kernel: decomposition methods

### Support Vectors

Only xi of αi > 0 used ⇒ support vectors

-0.2 0 0.2 0.4 0.6 0.8 1 1.2

-1.5 -1 -0.5 0 0.5 1

(73)

Solving optimization problems Kernel: decomposition methods

minα

1

TQα −eTα

subject to 0 ≤ αi ≤ C , i = 1, . . . , l yTα = 0

Qij 6= 0, Q : an l by l fully dense matrix 50,000 training points: 50,000 variables:

(50, 0002 × 8/2) bytes = 10GB RAM to store Q

(74)

Solving optimization problems Kernel: decomposition methods

### Large Dense Quadratic Programming (Cont’d)

Traditional optimization methods cannot be directly applied here because Q cannot even be stored

Currently, decomposition methods (a type of coordinate descent methods) are what used in practice

(75)

Solving optimization problems Kernel: decomposition methods

### Decomposition Methods

Working on some variables each time (e.g., Osuna et al., 1997; Joachims, 1998; Platt, 1998)

Similar to coordinate-wise minimization Working set B, N = {1, . . . , l }\B fixed Let the objective function be

f (α) = 1

TQα −eTα

(76)

Solving optimization problems Kernel: decomposition methods

### Decomposition Methods (Cont’d)

Sub-problem on the variable dB

mindB

f ([ααBN] +d

B

0 )

subject to −αi ≤ di ≤ C − αi, ∀i ∈ B di = 0, ∀i /∈ B,

yBTdB = 0

The objective function of the sub-problem f ([ααBN] +d

B

0 )

=1

2dTBQBBdB + ∇Bf (α)TdB + constant.

(77)

Solving optimization problems Kernel: decomposition methods

### Avoid Memory Problems

QBB is a sub-matrix of Q

QBB QBN QNB QNN



Note that

∇f (α) = Qα −e, ∇Bf (α) = QB,:α −eB

(78)

Solving optimization problems Kernel: decomposition methods

### Avoid Memory Problems (Cont’d)

Only B columns of Q are needed

In general |B| ≤ 10 is used. We need |B| ≥ 2 because of the linear constraint

yTBdB = 0

Calculated when used: trade time for space But is such an approach practical?

(79)

Solving optimization problems Kernel: decomposition methods

### How Decomposition Methods Perform?

Convergence not very fast. This is known because of using only first-order information

But, no need to have very accurate α decision function: Xl

i =1yiαiK (xi,x) + b Prediction may still be correct with a rough α Further, in some situations,

# support vectors  # training points Initial α1 = 0, some instances never used

(80)

Solving optimization problems Kernel: decomposition methods

### (Cont’d)

An example of training 50,000 instances using the software LIBSVM (|B| = 2)

\$svm-train -c 16 -g 4 -m 400 22features Total nSV = 3370

Time 79.524s

This was done on a typical desktop

Calculating the whole Q takes more time

#SVs = 3,370  50,000

A good case where some remain at zero all the time

(81)

Solving optimization problems Linear: coordinate descent method

### Outline

4 Solving optimization problems Kernel: decomposition methods Linear: coordinate descent method Linear: second-order methods Experiments

(82)

Solving optimization problems Linear: coordinate descent method

### Coordinate Descent Methods for Linear Classification

We consider L1-loss SVM as an example here The same method can be extended to L2 and logistic loss

More details in Hsieh et al. (2008); Yu et al. (2011)

(83)

Solving optimization problems Linear: coordinate descent method

### SVM Dual (Linear without Kernel)

From primal dual relationship minα f (α)

subject to 0 ≤ αi ≤ C , ∀i , where

f (α) ≡ 1

TQα −eTα and

Qij = yiyjxTi xj, e = [1, . . . , 1]T

No linear constraint yTα = 0 because of no bias term b

(84)

Solving optimization problems Linear: coordinate descent method

### Dual Coordinate Descent

Very simple: minimizing one variable at a time While α not optimal

For i = 1, . . . , l minαi

f (. . . , αi, . . .) A classic optimization technique

Traced back to Hildreth (1957) if constraints are not considered

(85)

Solving optimization problems Linear: coordinate descent method

### The Procedure

Given current α. Let ei = [0, . . . , 0, 1, 0, . . . , 0]T. min

d f (α + dei) = 1

2Qiid2 + ∇if (α)d + constant This sub-problem is a special case of the earlier sub-problem of the decomposition method for kernel classifiers

That is, the working set B = {i } Without constraints

optimal d = −∇if (α) Qii

(86)

Solving optimization problems Linear: coordinate descent method

### The Procedure (Cont’d)

Now 0 ≤ αi + d ≤ C αi ← min

 max



αi − ∇if (α) Qii

, 0

 , C



Note that

if (α) = (Qα)i − 1 = Xl

j =1Qijαj − 1

= Xl

j =1yiyjxTi xjαj − 1

(87)

Solving optimization problems Linear: coordinate descent method

### The Procedure (Cont’d)

Directly calculating gradients costs O(ln) l :# data, n: # features

This is the case for kernel classifiers For linear SVM, define

u ≡ Xl

j =1yjαjxj, Easy gradient calculation: costs O(n)

if (α) = yiuTxi − 1

(88)

Solving optimization problems Linear: coordinate descent method

### The Procedure (Cont’d)

All we need is to maintain u u = Xl

j =1yjαjxj, If

¯

αi : old ; αi : new then

u ← u + (αi − ¯αi)yixi. Also costs O(n)

(89)

Solving optimization problems Linear: coordinate descent method

### Algorithm: Dual Coordinate Descent

Given initial α and find u = X

i

yiαixi.

While α is not optimal (Outer iteration) For i = 1, . . . , l (Inner iteration)

(a) ¯αi ← αi

(b) G = yiuTxi − 1 (c) If αi can be changed

αi ← min(max(αi − G /Qii, 0), C ) u ← u + (αi − ¯αi)yixi

(90)

Solving optimization problems Linear: coordinate descent method

### Difference from the Kernel Case

• We have seen that coordinate-descent type of

methods are used for both linear and kernel classifiers

• Recall the i -th element of gradient costs O(n) by

if (α) =

l

X

j =1

yiyjxTi xjαj − 1 = (yixi)T

l

X

j =1

yjxjαj

− 1

= (yixi)Tu− 1 but we cannot do this for kernel because

K (xi,xj) = φ(xi)Tφ(xj) cannot be separated

(91)

Solving optimization problems Linear: coordinate descent method

### Difference from the Kernel Case (Cont’d)

If using kernel, the cost of calculating ∇if (α) must be O(ln)

However, if O(ln) cost is spent, the whole ∇f (α) can be maintained (details not shown here)

In contrast, the setting of using u knows ∇if (α) rather than the whole ∇f (α)

(92)

Solving optimization problems Linear: coordinate descent method

### Difference from the Kernel Case (Cont’d)

In existing coordinate descent methods for kernel classifiers, people also use ∇f (α) information to select variables (i.e., select the set B) for update In optimization there are two types of coordinate descent methods:

sequential or random selection of variables greedy selection of variables

To do greedy selection, usually the whole gradient must be available

(93)

Solving optimization problems Linear: coordinate descent method

### Difference from the Kernel Case (Cont’d)

Existing coordinate descent methods for linear ⇒ related to sequential or random selection

Existing coordinate descent methods for kernel ⇒ related to greedy selection

(94)

Solving optimization problems Linear: coordinate descent method

### Bias Term b and Linear Constraint in Dual

In our discussion, b is used for kernel but not linear Mainly history reason

For kernel SVM, we can also omit b to get rid of the linear constraint yTα = 0

Then for kernel decomposition method, |B| = 1 can also be possible

(95)

Solving optimization problems Linear: second-order methods

### Outline

4 Solving optimization problems Kernel: decomposition methods Linear: coordinate descent method Linear: second-order methods Experiments

(96)

Solving optimization problems Linear: second-order methods

### Optimization for Linear and Kernel Cases

Recall that

w =

l

X

i =1

yiαiφ(xi)

Kernel: can only solve an optimization problem of α Linear: can solve either w or α

We will show an example to minimize over w

(97)

Solving optimization problems Linear: second-order methods

### Newton Method

Let’s minimize a twice-differentiable function minw f (w)

For example, logistic regression has minw

1

2wTw + C

l

X

i =1

log



1 + e−yiwTxi

 . Newton direction at iterate wk

mins ∇f (wk)Ts + 1

2sT2f (wk)s

(98)

Solving optimization problems Linear: second-order methods

### Truncated Newton Method

The above sub-problem is equivalent to solving Newton linear system

2f (wk)s = −∇f (wk) Approximately solving the linear system ⇒ truncated Newton

However, Hessian matrix ∇2f (wk) is too large to be stored

2f (wk) : n × n, n : number of features For document data, n can be millions or more

(99)

Solving optimization problems Linear: second-order methods

### Using Special Properties of Data Classification

But Hessian has a special form

2f (w) = I + CXTDX , D diagonal. For logistic regression,

Dii = e−yiwTxi 1 + e−yiwTxi X : data, # instances × # features

X = [x1, . . . ,xl]T

(100)

Solving optimization problems Linear: second-order methods

### Using Special Properties of Data Classification (Cont’d)

Using Conjugate Gradient (CG) to solve the linear system.

CG is an iterative procedure. Each CG step mainly needs one Hessian-vector product

2f (w)d = d + C · XT(D(Xd)) Therefore, we have a Hessian-free approach

(101)

Solving optimization problems Linear: second-order methods

### Using Special Properties of Data Classification (Cont’d)

Now the procedure has two layers of iterations Outer: Newton iterations

Inner: CG iterations per Newton iteration Past machine learning works used Hessian-free approaches include, for example, (Keerthi and DeCoste, 2005; Lin et al., 2008)

Second-order information used: faster convergence than first-order methods

(102)

Solving optimization problems Experiments

### Outline

4 Solving optimization problems Kernel: decomposition methods Linear: coordinate descent method Linear: second-order methods Experiments

(103)

Solving optimization problems Experiments

### Comparisons

L2-loss SVM is used

DCDL2: Dual coordinate descent (Hsieh et al., 2008)

DCDL2-S: DCDL2 with shrinking (Hsieh et al., 2008)

PCD: Primal coordinate descent (Chang et al., 2008)

TRON: Trust region Newton method (Lin et al., 2008)

(104)

Solving optimization problems Experiments

### Objective values (Time in Seconds)

news20 rcv1

yahoo-japan yahoo-korea

(105)

Solving optimization problems Experiments

### Analysis

Dual coordinate descents are very effective if # data and # features are both large

Useful for document classification Half million data in a few seconds However, it is less effective if

# features small: should solve primal; or large penalty parameter C ; problems are more ill-conditioned

(106)

Solving optimization problems Experiments

### An Example When # Features Small

# instance: 32,561, # features: 123

Objective value Accuracy

(107)

Multi-core linear classification

### Outline

1 Linear classification

2 Kernel classification

3 Linear versus kernel classification

4 Solving optimization problems

5 Multi-core linear classification

6 Distributed linear classification

7 Discussion and conclusions

(108)

Multi-core linear classification

### Outline

5 Multi-core linear classification

Parallel matrix-vector multiplications Experiments

(109)

Multi-core linear classification

### Multi-core Linear Classification

Parallelization in shared-memory system: use the power of multi-core CPU if data can fit in memory Example: we can parallelize the 2nd-order method (i.e., the Newton method) discussed earlier.

We discuss the study in Lee et al. (2015)

Recall the bottleneck is the Hessian-vector product

2f (w)d = d + C · XT(D(Xd)) See the analysis in the next slide

(110)

Multi-core linear classification

### Matrix-vector Multiplications: More Than 90% of the Training Time

Data set #instances #features ratio

kddb 19,264,097 29,890,095 82.11%

url combined 2,396,130 3,231,961 94.83%

webspam 350,000 16,609,143 97.95%

rcv1 binary 677,399 47,236 97.88%

covtype binary 581,012 54 89.20%

epsilon normalized 400,000 2,000 99.88%

rcv1 518,571 47,236 97.04%

covtype 581,012 54 89.06%

(111)

Multi-core linear classification

### Matrix-vector Multiplications: More Than 90% of the Training Time (Cont’d)

This result is by Newton methods using one core We should parallelize matrix-vector multiplications For ∇2f (w)d we must calculate

u = Xd (9)

u ← Du (10)

¯u = XTu, where u = DXd (11) Because D is diagonal, (10) is easy

We will discuss the parallelization of (9) and (11)

(112)

Multi-core linear classification Parallel matrix-vector multiplications

### Outline

5 Multi-core linear classification

Parallel matrix-vector multiplications Experiments

(113)

Multi-core linear classification Parallel matrix-vector multiplications

### Parallel X d Operation

Assume that X is in a row-oriented sparse format

X =

 xT1

...

xTl

 and u = Xd =

 xT1 d

...

xTl d

 we have the following simple loop

1: for i = 1, . . . , l do

2: ui = xTi d

3: end for

Because the l inner products are independent, we can easily parallelize the loop by, for example, OpenMP

(114)

Multi-core linear classification Parallel matrix-vector multiplications

T

### u Operation

For the other matrix-vector multiplication

¯

u = XTu, where u = DXd, we have

¯

u = u1x1 + · · · + ulxl.

Because matrix X is row-oriented, accessing columns in XT is much easier than rows We can use the following loop

1: for i = 1, . . . , l do

2: ¯u ← ¯u + uixi 3: end for

(115)

Multi-core linear classification Parallel matrix-vector multiplications

T

### u Operation (Cont’d)

There is no need to store a separate XT

However, it is possible that threads on ui1xi1 and ui2xi2 want to update the same component ¯us at the same time:

1: for i = 1, . . . , l do in parallel

2: for (xi)s 6= 0 do

3:s ← ¯us + ui(xi)s

4: end for

5: end for

(116)

Multi-core linear classification Parallel matrix-vector multiplications

T

### u

An atomic operation can avoid other threads to write ¯us at the same time.

1: for i = 1, . . . , l do in parallel

2: for (xi)s 6= 0 do

3: atomic: ¯us ← ¯us + ui(xi)s

4: end for

5: end for

However, waiting time can be a serious problem

(117)

Multi-core linear classification Parallel matrix-vector multiplications

T

### u

Another method is using temporary arrays

maintained by each thread, and summing up them in the end

That is, store uˆp = X

i

{uixi | i run by thread p}

and then

¯

u = X

p

ˆ up

(118)

Multi-core linear classification Parallel matrix-vector multiplications

### Atomic Operation: Almost No Speedup

Reduce operations are superior to atomic operations

1 2 4# threads6 8 10 12 0

2 4 6 8 10

Speedup

OMP-array OMP-atomic

1 2 4# threads6 8 10 12 0

1 2 3 4 5

Speedup

OMP-array OMP-atomic

rcv1 binary covtype binary Subsequently we use the reduce operations

(119)

Multi-core linear classification Parallel matrix-vector multiplications

### Existing Algorithms for Sparse Matrix-vector Product

This is always an important research issue in numerical analysis

Instead of our direct implementation to parallelize loops, in the next slides we will consider two existing methods

(120)

Multi-core linear classification Parallel matrix-vector multiplications

### Recursive Sparse Blocks (Martone, 2014)

RSB (Recursive Sparse Blocks) is an effective format for fast parallel sparse matrix-vector multiplications It recursively partitions a matrix to be like the figure

Locality of memory references improved, but the construction time is not negligible

(121)

Multi-core linear classification Parallel matrix-vector multiplications

### Recursive Sparse Blocks (Cont’d)

Parallel, efficient sparse matrix-vector operations Improve locality of memory references

But the initial construction time is about 20

multiplications, which is not negligible in some cases We will show the result in the experiment part

(122)

Multi-core linear classification Parallel matrix-vector multiplications

### Intel MKL

Intel Math Kernel Library (MKL) is a commercial library including optimized routines for linear algebra (Intel)

It supports fast matrix-vector multiplications for different sparse formats.

We consider the row-oriented format to store X .

(123)

Multi-core linear classification Experiments

### Outline

5 Multi-core linear classification

Parallel matrix-vector multiplications Experiments

(124)

Multi-core linear classification Experiments

### Experiments

Baseline: Single core version in LIBLINEAR 1.96 OpenMP to parallelize loops

MKL: Intel MKL version 11.2 RSB: librsb version 1.2.0

(125)

Multi-core linear classification Experiments

### Speedup of X d: All are Excellent

rcv1 binary webspam kddb

url combined covtype binary rcv1

We conduct experiments on 14 real-world large-scale data sets and compare the proposed algorithms (ASRankSVM and ASCRC) with several state-of-the-art bipartite ranking

We conduct experiments on 14 real-world large-scale data sets and compare the proposed algorithms (ASRankSVM and ASCRC) with several state-of-the-art bipartite ranking algorithms,

Markov chain: a discrete random process with a finite number of states and it satisfies the property that the next state depends only on the current state.. Aperiodic: all states

4 Solving optimization problems Kernel: decomposition methods Linear: coordinate descent method Linear: second-order methods Experiments. Chih-Jen Lin (National Taiwan Univ.) 86

Less effective if C is large (or data not scaled) Same problem occurs for training nonlinear SVMs But no need to use large C. Same model after C ≥ ¯ C [Keerthi and

Linear classification is an old topic; but recently there are new applications and large-scale challenges The optimization problem can be solved by many existing techniques.

Finally, we compare the proposed block minimization framework with other approaches for solving linear classification problems when data size is beyond the memory capacity.. We

1 Embedding Numerous Features: Kernel Models Lecture 1: Linear Support Vector Machine.. linear SVM: more robust and solvable with quadratic programming Lecture 2: Dual Support

1 Embedding Numerous Features: Kernel Models Lecture 1: Linear Support Vector Machine.

2 Distributed classification algorithms Kernel support vector machines Linear support vector machines Parallel tree learning.. 3 Distributed clustering

2 Distributed classification algorithms Kernel support vector machines Linear support vector machines Parallel tree learning?. 3 Distributed clustering

Initial Approaches and Some Settings Sparse Features and Linear Classification Condensed Features and Random Forest Ensemble and Final Results.. Discussion

We will quickly discuss some examples and show both types of optimization methods are useful for linear classification.. Chih-Jen Lin (National Taiwan Univ.) 16

Data larger than memory but smaller than disk Design algorithms so that disk access is less frequent An example (Yu et al., 2010): a decomposition method to load a block at a time

Solving SVM Quadratic Programming Problem Training large-scale data..

Parallel dual coordinate descent method for large-scale linear classification in multi-core environments. In Proceedings of the 22nd ACM SIGKDD International Conference on

Core vector machines: Fast SVM training on very large data sets. Multi-class support

Efficient training relies on designing optimization algorithms by incorporating the problem structure Many issues about multi-core and distributed linear classification still need to

Advantages of linear: easier feature engineering We expect that linear classification can be widely used in situations ranging from small-model to big-data classification. Chih-Jen

Lecture 1: Large-Margin Linear Classification Large-Margin Separating Hyperplane Standard Large-Margin Problem Support Vector Machine.. Reasons behind

A dual coordinate descent method for large-scale linear SVM. In Proceedings of the Twenty Fifth International Conference on Machine Learning

Core vector machines: Fast SVM training on very large data sets. Multi-class support