### Training Large-scale Linear Classifiers

Chih-Jen Lin

Department of Computer Science National Taiwan University

http://www.csie.ntu.edu.tw/~cjlin

Talk at Hong Kong University of Science and Technology February 5, 2009

### Outline

Linear versus Nonlinear Classification Review of SVM Training

Large-scale Linear SVM Discussion and Conclusions

### Outline

Linear versus Nonlinear Classification Review of SVM Training

Large-scale Linear SVM Discussion and Conclusions

### Kernel Methods and SVM

Kernel methods became very popular in the past decade

In particular, support vector machines (SVM) But slow in training large data due to nonlinear mapping (enlarge the # features)

Example: x = [x_{1}, x_{2}, x_{3}]^{T} ∈ R^{3}
φ(x) = 1,√

2x_{1},√

2x_{2},√

2x_{3}, x_{1}^{2},
x_{2}^{2}, x_{3}^{2},√

2x_{1}x_{2},√

2x_{1}x_{3},√

2x_{2}x_{3}T

∈ R^{10}
If data are very large ⇒ often need approximation
e.g., sub-sampling and many other ways

### Linear Classification

Certain problems: # features large

Often similar accuracy with/without nonlinear mappings

Linear classification: no mapping Stay in the original input space

We can efficiently train very large data Document classification is of this type Very important for Internet companies

### An Example

rcv1: # data: > 600k, # features: > 40k Using LIBSVM (linear kernel)

> 10 hours

Using LIBLINEAR

Computation: < 5 seconds; I/O: 60 seconds Same stopping condition in solving SVM optimization problems

Will show how this is achieved and discuss if there are any concerns

### Outline

Linear versus Nonlinear Classification Review of SVM Training

Large-scale Linear SVM Discussion and Conclusions

### Support Vector Classification

Training data (xi, yi), i = 1, . . . , l , xi ∈ R^{n}, yi = ±1
minw

1

2w^{T}w + C X^{l}

i =1max(0, 1 − y_{i}w^{T}φ(x_{i}))
C : regularization parameter

High dimensional ( maybe infinite ) feature space
φ(x) = [φ_{1}(x), φ_{2}(x), . . .]^{T}.

We omit the bias term b w: may have infinite variables

### Support Vector Classification (Cont’d)

The dual problem (finite # variables) minα f (α) = 1

2α^{T}Qα − e^{T}α

subject to 0 ≤ α_{i} ≤ C , i = 1, . . . , l ,
where Q_{ij} = y_{i}y_{j}φ(x_{i})^{T}φ(x_{j}) and e = [1, . . . , 1]^{T}
At optimum

w =Pl

i =1α_{i}y_{i}φ(x_{i})
Kernel: K (x_{i}, x_{j}) ≡ φ(x_{i})^{T}φ(x_{j})

### Large Dense Quadratic Programming

Q_{ij} 6= 0, Q : an l by l fully dense matrix
minα f (α) = 1

2α^{T}Qα − e^{T}α

subject to 0 ≤ αi ≤ C , i = 1, . . . , l 50,000 training points: 50,000 variables:

(50, 000^{2} × 8/2) bytes = 10GB RAM to store Q
Traditional methods:

Newton, Quasi Newton cannot be directly applied Now most use decomposition methods

[Osuna et al., 1997, Joachims, 1998, Platt, 1998]

### Decomposition Methods

We consider a one-variable version Similar to coordinate descent methods Select the i th component for update:

min

d

1

2(α + d e_{i})^{T}Q(α + d e_{i}) − e^{T}(α + d e_{i})
subject to 0 ≤ α_{i} + d ≤ C

where

ei ≡ [0 . . . 0

| {z }

i −1

1 0 . . . 0]^{T}

α: current solution; the i th component is changed

### Avoid Memory Problems

The new objective function 1

2Q_{ii}d^{2} + (Qα − e)_{i}d + constant
To get (Qα − e)_{i}, only Q’s i th row is needed

(Qα − e)i = X^{l}

j =1Qijαj − 1

Calculated when needed. Trade time for space
Used by popular software (e.g., SVM^{light}, LIBSVM)
They update 10 and 2 variables at a time

### Decomposition Methods: Algorithm

Optimal d :

−(Qα − e)i

Q_{ii} = −
Pl

j =1Qijαj − 1
Q_{ii}

Consider lower/upper bounds: [0, C ] Algorithm:

While α is not optimal

1. Select the i th element for update 2. αi ← min

max

αi −

Pl

j =1Qijαj−1 Qii , 0

, C

### Select an Element for Update

Many ways

Sequential (easiest)

Permuting 1, . . . , l every l steps Random

Existing software check gradient information

∇_{1}f (α), . . . , ∇_{l}f (α)
But is ∇f (α) available?

### Select an Element for Update (Cont’d)

We can easily maintain gradient

∇f (α) = Qα − e

∇_{s}f (α) = (Qα)_{s} − 1 = X^{l}

j =1Q_{sj}α_{j} − 1
Initial α = 0

∇f (0) = −e
α_{i} updated to ¯α_{i}

∇_{s}f (α) ← ∇_{s}f (α)+ Q_{si}( ¯α_{i} − α_{i}) , ∀s
O(l ) if Q_{si} ∀s (i th column) are available

### Select an Element for Update (Cont’d)

No matter maintaining ∇f (α) or not Q’s i th row (column) always needed

¯

α_{i} ← min max α_{i} −
Pl

j =1Q_{ij}α_{j} − 1
Q_{ii} , 0

! , C

!

Q is symmetric

Using ∇f (α) to select i : faster convergence i.e., fewer iterations

### Decomposition Methods: Using Gradient

The new procedure

α = 0, ∇f (α) = −e While α is not optimal

1. Select the i th element using ∇f (α) 2. ¯αi ← min

max

αi −

Pl

j =1Qijαj−1 Qii , 0

, C

3. ∇_{s}f (α) ← ∇_{s}f (α) + Q_{si}( ¯α_{i} − α_{i}), ∀s
Cost per iteration

O(ln), l : # instances, n: # features

Assume each Q_{ij} = y_{i}y_{j}K (x_{i}, x_{j}) takes O(n)

### Outline

Linear versus Nonlinear Classification Review of SVM Training

Large-scale Linear SVM Discussion and Conclusions

### Linear SVM for Large Document Sets

Document classification

Bag of words model (TF-IDF or others) A large # of features

Testing accuracy: linear/nonlinear SVMs similar nonlinear SVM: we mean SVM via kernels

Recently an active research topic
SVM^{perf} [Joachims, 2006]

Pegasos [Shalev-Shwartz et al., 2007]

LIBLINEAR [Lin et al., 2007, Hsieh et al., 2008]

and others

### Linear SVM

Primal without the bias term b minw

1

2w^{T}w + C

l

X

i =1

max 0, 1 − y_{i}w^{T}x_{i}

Dual

minα f (α) = 1

2α^{T}Qα − e^{T}α
subject to 0 ≤ α_{i} ≤ C , ∀i
Q_{ij} = y_{i}y_{j}x^{T}_{i} x_{j}

### Revisit Decomposition Methods

While α is not optimal

1. Select the i th element for update
2. α_{i} ← min

max

α_{i} −

Pl

j =1Qijαj−1 Qii , 0

, C

O(ln) per iteration; n: # features, l : # data

For linear SVM, define
w ≡ X^{l}

j =1y_{j}α_{j}x_{j} ∈ R^{n}
O(n) per iteration

X^{l}

j =1Q_{ij}α_{j}−1 = X^{l}

j =1y_{i}y_{j}x^{T}_{i} x_{j}α_{j}−1 = y_{i}w^{T}x_{i}−1

All we need is to maintain w. If

¯

α_{i} ← α_{i}
then O(n) for

w ← w + ( ¯α_{i} − α_{i})y_{i}x_{i}
Initial w

α = 0 ⇒ w = 0 Give up maintaining ∇f (α)

Select i for update Sequential, random, or

Permuting 1, . . . , l every l steps

### Algorithms for Linear and Nonlinear SVM

Linear:

While α is not optimal

1. Select the i th element for update
2. ¯α_{i} ← min

max

α_{i} − ^{y}^{i}^{w}^{T}_{Q}^{x}^{i}^{−1}

ii , 0

, C

3. w ← w + ( ¯α_{i} − α_{i})y_{i}x_{i}

Nonlinear:

While α is not optimal

1. Select the i th element using ∇f (α)
2. ¯α_{i} ← min

max

α_{i} −

Pl

j =1Qijαj−1 Qii , 0

, C

3. ∇_{s}f (α) ← ∇_{s}f (α) + Q_{si}( ¯α_{i} − α_{i}), ∀s

### Analysis

Decomposition method for nonlinear (also linear):

O(ln) per iteration (used in LIBSVM) New way for linear:

O(n) per iteration (used in LIBLINEAR) Faster if # iterations not l times more Experiments

Problem l : # data n: # features

news20 19,996 1,355,191

yahoo-japan 176,203 832,026

rcv1 677,399 47,236

yahoo-korea 460,554 3,052,939

### Testing Accuracy versus Training Time

news20 yahoo-japan

rcv1 yahoo-korea

### Outline

Linear versus Nonlinear Classification Review of SVM Training

Large-scale Linear SVM Discussion and Conclusions

### Limitation

A few seconds for million data; Too good to be true?

Less effective if C is large (or data not scaled) Same problem occurs for training nonlinear SVMs But no need to use large C

Same model after C ≥ ¯C [Keerthi and Lin, 2003]

C is small for document data (if scaled)¯

### Limitation (Cont’d)

Less effective if # features small

Should solve primal: # variables = # features Why not using kernels with nonlinear mappings?

### Comparing Different Training Methods

O(ln) versus O(n) per iteration

Generally, the new method for linear is much faster Especially for document data

But can always find weird cases where LIBSVM faster than LIBLINEAR

Apply the right approach to the right problem is essential

One must be careful on comparing training algorithms

### Software Issue

Large data ⇒ may need different training strategies for different problems

But we pay the price of complicating software packages

The success of LIBSVM and SVM^{light}
Simple and general

They cover both linear/nonlinear

General versus special: always an issue

### Other Methods for Linear SVM

w is the key to reduce O(ln) to O(n) per iteration
w =X^{l}

j =1y_{j}α_{j}x_{j} ∈ R^{n}
Many optimization methods can be used

We can now solve primal: w not infinite any more minw

1

2w^{T}w + C

l

X

i =1

max 0, 1 − y_{i}w^{T}x_{i}
We used decomposition method as an example as it
works for both linear and nonlinear

Easily see the striking difference with/without w

### Other Linear Classifiers

Logistic regression, maximum entropy, conditional random fields (CRF)

All linear classifiers

In the past, SVM training is considered very different from them

For the linear case, things are very related Many interesting findings; but no time to show details

### What if Data Are Even Larger?

We see I/O costs more than computing

Large-scale document classification on a single computer essentially a solved problem

Challenges:

What if data larger than computer RAM?

What if data distributedly stored?

Document classification in a data center

environment is an interesting research direction

### Conclusions

For certain problems, linear classifiers as accurate as nonlinear, and more efficient for training/testing However, we are not claiming you shouldn’t use kernels any more

For large data, right approaches are essential Machine learning researchers should clearly tell people when to use which methods

You are welcome to try our software

http://www.csie.ntu.edu.tw/~cjlin/libsvm http://www.csie.ntu.edu.tw/~cjlin/liblinear