Training Large-scale Linear Classiﬁers

(1)

Training Large-scale Linear Classifiers

Chih-Jen Lin

Department of Computer Science National Taiwan University

http://www.csie.ntu.edu.tw/~cjlin

Talk at Hong Kong University of Science and Technology February 5, 2009

(2)

Outline

Linear versus Nonlinear Classification Review of SVM Training

Large-scale Linear SVM Discussion and Conclusions

(3)

Outline

(4)

Kernel Methods and SVM

Kernel methods became very popular in the past decade

In particular, support vector machines (SVM) But slow in training large data due to nonlinear mapping (enlarge the # features)

Example: x = [x₁, x₂, x₃]^T ∈ R³ φ(x) = 1,√

2x₁,√

2x₂,√

2x₃, x₁², x₂², x₃²,√

2x₁x₂,√

2x₁x₃,√

2x₂x₃T

∈ R¹⁰ If data are very large ⇒ often need approximation e.g., sub-sampling and many other ways

(5)

Linear Classification

Certain problems: # features large

Often similar accuracy with/without nonlinear mappings

Linear classification: no mapping Stay in the original input space

We can efficiently train very large data Document classification is of this type Very important for Internet companies

(6)

An Example

rcv1: # data: > 600k, # features: > 40k Using LIBSVM (linear kernel)

> 10 hours

Using LIBLINEAR

Computation: < 5 seconds; I/O: 60 seconds Same stopping condition in solving SVM optimization problems

Will show how this is achieved and discuss if there are any concerns

(7)

Outline

(8)

Support Vector Classification

Training data (xi, yi), i = 1, . . . , l , xi ∈ Rⁿ, yi = ±1 minw

1

2w^Tw + C X^l

i =1max(0, 1 − y_iw^Tφ(x_i)) C : regularization parameter

High dimensional ( maybe infinite ) feature space φ(x) = [φ₁(x), φ₂(x), . . .]^T.

We omit the bias term b w: may have infinite variables

(9)

Support Vector Classification (Cont’d)

The dual problem (finite # variables) minα f (α) = 1

2α^TQα − e^Tα

subject to 0 ≤ α_i ≤ C , i = 1, . . . , l , where Q_ij = y_iy_jφ(x_i)^Tφ(x_j) and e = [1, . . . , 1]^T At optimum

w =Pl

i =1α_iy_iφ(x_i) Kernel: K (x_i, x_j) ≡ φ(x_i)^Tφ(x_j)

(10)

Large Dense Quadratic Programming

Q_ij 6= 0, Q : an l by l fully dense matrix minα f (α) = 1

2α^TQα − e^Tα

subject to 0 ≤ αi ≤ C , i = 1, . . . , l 50,000 training points: 50,000 variables:

(50, 000² × 8/2) bytes = 10GB RAM to store Q Traditional methods:

Newton, Quasi Newton cannot be directly applied Now most use decomposition methods

[Osuna et al., 1997, Joachims, 1998, Platt, 1998]

(11)

Decomposition Methods

We consider a one-variable version Similar to coordinate descent methods Select the i th component for update:

min

d

1

2(α + d e_i)^TQ(α + d e_i) − e^T(α + d e_i) subject to 0 ≤ α_i + d ≤ C

where

ei ≡ [0 . . . 0

| {z }

i −1

1 0 . . . 0]^T

α: current solution; the i th component is changed

(12)

Avoid Memory Problems

The new objective function 1

2Q_iid² + (Qα − e)_id + constant To get (Qα − e)_i, only Q’s i th row is needed

(Qα − e)i = X^l

j =1Qijαj − 1

Calculated when needed. Trade time for space Used by popular software (e.g., SVM^light, LIBSVM) They update 10 and 2 variables at a time

(13)

Decomposition Methods: Algorithm

Optimal d :

−(Qα − e)i

Q_ii = − Pl

j =1Qijαj − 1 Q_ii

Consider lower/upper bounds: [0, C ] Algorithm:

While α is not optimal

1. Select the i th element for update 2. αi ← min

max

αi −

Pl

j =1Qijαj−1 Qii , 0

, C

(14)

Select an Element for Update

Many ways

Sequential (easiest)

Permuting 1, . . . , l every l steps Random

Existing software check gradient information

∇₁f (α), . . . , ∇_lf (α) But is ∇f (α) available?

(15)

Select an Element for Update (Cont’d)

We can easily maintain gradient

∇f (α) = Qα − e

∇_sf (α) = (Qα)_s − 1 = X^l

j =1Q_sjα_j − 1 Initial α = 0

∇f (0) = −e α_i updated to ¯α_i

∇_sf (α) ← ∇_sf (α)+ Q_si( ¯α_i − α_i) , ∀s O(l ) if Q_si ∀s (i th column) are available

(16)

Select an Element for Update (Cont’d)

No matter maintaining ∇f (α) or not Q’s i th row (column) always needed

¯

α_i ← min max α_i − Pl

j =1Q_ijα_j − 1 Q_ii , 0

! , C

!

Q is symmetric

Using ∇f (α) to select i : faster convergence i.e., fewer iterations

(17)

Decomposition Methods: Using Gradient

The new procedure

α = 0, ∇f (α) = −e While α is not optimal

1. Select the i th element using ∇f (α) 2. ¯αi ← min

max

αi −

Pl

, C

3. ∇_sf (α) ← ∇_sf (α) + Q_si( ¯α_i − α_i), ∀s Cost per iteration

O(ln), l : # instances, n: # features

Assume each Q_ij = y_iy_jK (x_i, x_j) takes O(n)

(18)

Outline

(19)

Linear SVM for Large Document Sets

Document classification

Bag of words model (TF-IDF or others) A large # of features

Testing accuracy: linear/nonlinear SVMs similar nonlinear SVM: we mean SVM via kernels

Recently an active research topic SVM^perf [Joachims, 2006]

Pegasos [Shalev-Shwartz et al., 2007]

LIBLINEAR [Lin et al., 2007, Hsieh et al., 2008]

and others

(20)

Linear SVM

Primal without the bias term b minw

1

2w^Tw + C

l

X

i =1

max 0, 1 − y_iw^Tx_i

Dual

minα f (α) = 1

2α^TQα − e^Tα subject to 0 ≤ α_i ≤ C , ∀i Q_ij = y_iy_jx^T_i x_j

(21)

Revisit Decomposition Methods

1. Select the i th element for update 2. α_i ← min

max

α_i −

Pl

, C

O(ln) per iteration; n: # features, l : # data

For linear SVM, define w ≡ X^l

j =1y_jα_jx_j ∈ Rⁿ O(n) per iteration

X^l

j =1Q_ijα_j−1 = X^l

j =1y_iy_jx^T_i x_jα_j−1 = y_iw^Tx_i−1

(22)

All we need is to maintain w. If

¯

α_i ← α_i then O(n) for

w ← w + ( ¯α_i − α_i)y_ix_i Initial w

α = 0 ⇒ w = 0 Give up maintaining ∇f (α)

Select i for update Sequential, random, or

Permuting 1, . . . , l every l steps

(23)

Algorithms for Linear and Nonlinear SVM

Linear:

1. Select the i th element for update 2. ¯α_i ← min

max

α_i − ^yⁱ^w^T_Q^xⁱ⁻¹

ii , 0

, C

3. w ← w + ( ¯α_i − α_i)y_ix_i

Nonlinear:

1. Select the i th element using ∇f (α) 2. ¯α_i ← min

max

α_i −

Pl

, C

3. ∇_sf (α) ← ∇_sf (α) + Q_si( ¯α_i − α_i), ∀s

(24)

Analysis

Decomposition method for nonlinear (also linear):

O(ln) per iteration (used in LIBSVM) New way for linear:

O(n) per iteration (used in LIBLINEAR) Faster if # iterations not l times more Experiments

Problem l : # data n: # features

news20 19,996 1,355,191

yahoo-japan 176,203 832,026

rcv1 677,399 47,236

yahoo-korea 460,554 3,052,939

(25)

Testing Accuracy versus Training Time

news20 yahoo-japan

rcv1 yahoo-korea

(26)

Outline

(27)

Limitation

A few seconds for million data; Too good to be true?

Less effective if C is large (or data not scaled) Same problem occurs for training nonlinear SVMs But no need to use large C

Same model after C ≥ ¯C [Keerthi and Lin, 2003]

C is small for document data (if scaled)¯

(28)

Limitation (Cont’d)

Less effective if # features small

Should solve primal: # variables = # features Why not using kernels with nonlinear mappings?

(29)

Comparing Different Training Methods

O(ln) versus O(n) per iteration

Generally, the new method for linear is much faster Especially for document data

But can always find weird cases where LIBSVM faster than LIBLINEAR

Apply the right approach to the right problem is essential

One must be careful on comparing training algorithms

(30)

Software Issue

Large data ⇒ may need different training strategies for different problems

But we pay the price of complicating software packages

The success of LIBSVM and SVM^light Simple and general

They cover both linear/nonlinear

General versus special: always an issue

(31)

Other Methods for Linear SVM

w is the key to reduce O(ln) to O(n) per iteration w =X^l

j =1y_jα_jx_j ∈ Rⁿ Many optimization methods can be used

We can now solve primal: w not infinite any more minw

1

2w^Tw + C

l

X

i =1

max 0, 1 − y_iw^Tx_i We used decomposition method as an example as it works for both linear and nonlinear

Easily see the striking difference with/without w

(32)

Other Linear Classifiers

Logistic regression, maximum entropy, conditional random fields (CRF)

All linear classifiers

In the past, SVM training is considered very different from them

For the linear case, things are very related Many interesting findings; but no time to show details

(33)

What if Data Are Even Larger?

We see I/O costs more than computing

Large-scale document classification on a single computer essentially a solved problem

Challenges:

What if data larger than computer RAM?

What if data distributedly stored?

Document classification in a data center

environment is an interesting research direction

(34)

Conclusions

For certain problems, linear classifiers as accurate as nonlinear, and more efficient for training/testing However, we are not claiming you shouldn’t use kernels any more

For large data, right approaches are essential Machine learning researchers should clearly tell people when to use which methods

You are welcome to try our software

http://www.csie.ntu.edu.tw/~cjlin/libsvm http://www.csie.ntu.edu.tw/~cjlin/liblinear