• 沒有找到結果。

Training Large-scale Linear Classifiers

N/A
N/A
Protected

Academic year: 2022

Share "Training Large-scale Linear Classifiers"

Copied!
34
0
0

加載中.... (立即查看全文)

全文

(1)

Training Large-scale Linear Classifiers

Chih-Jen Lin

Department of Computer Science National Taiwan University

http://www.csie.ntu.edu.tw/~cjlin

Talk at Hong Kong University of Science and Technology February 5, 2009

(2)

Outline

Linear versus Nonlinear Classification Review of SVM Training

Large-scale Linear SVM Discussion and Conclusions

(3)

Outline

Linear versus Nonlinear Classification Review of SVM Training

Large-scale Linear SVM Discussion and Conclusions

(4)

Kernel Methods and SVM

Kernel methods became very popular in the past decade

In particular, support vector machines (SVM) But slow in training large data due to nonlinear mapping (enlarge the # features)

Example: x = [x1, x2, x3]T ∈ R3 φ(x) = 1,√

2x1,√

2x2,√

2x3, x12, x22, x32,√

2x1x2,√

2x1x3,√

2x2x3T

∈ R10 If data are very large ⇒ often need approximation e.g., sub-sampling and many other ways

(5)

Linear Classification

Certain problems: # features large

Often similar accuracy with/without nonlinear mappings

Linear classification: no mapping Stay in the original input space

We can efficiently train very large data Document classification is of this type Very important for Internet companies

(6)

An Example

rcv1: # data: > 600k, # features: > 40k Using LIBSVM (linear kernel)

> 10 hours

Using LIBLINEAR

Computation: < 5 seconds; I/O: 60 seconds Same stopping condition in solving SVM optimization problems

Will show how this is achieved and discuss if there are any concerns

(7)

Outline

Linear versus Nonlinear Classification Review of SVM Training

Large-scale Linear SVM Discussion and Conclusions

(8)

Support Vector Classification

Training data (xi, yi), i = 1, . . . , l , xi ∈ Rn, yi = ±1 minw

1

2wTw + C Xl

i =1max(0, 1 − yiwTφ(xi)) C : regularization parameter

High dimensional ( maybe infinite ) feature space φ(x) = [φ1(x), φ2(x), . . .]T.

We omit the bias term b w: may have infinite variables

(9)

Support Vector Classification (Cont’d)

The dual problem (finite # variables) minα f (α) = 1

TQα − eTα

subject to 0 ≤ αi ≤ C , i = 1, . . . , l , where Qij = yiyjφ(xi)Tφ(xj) and e = [1, . . . , 1]T At optimum

w =Pl

i =1αiyiφ(xi) Kernel: K (xi, xj) ≡ φ(xi)Tφ(xj)

(10)

Large Dense Quadratic Programming

Qij 6= 0, Q : an l by l fully dense matrix minα f (α) = 1

TQα − eTα

subject to 0 ≤ αi ≤ C , i = 1, . . . , l 50,000 training points: 50,000 variables:

(50, 0002 × 8/2) bytes = 10GB RAM to store Q Traditional methods:

Newton, Quasi Newton cannot be directly applied Now most use decomposition methods

[Osuna et al., 1997, Joachims, 1998, Platt, 1998]

(11)

Decomposition Methods

We consider a one-variable version Similar to coordinate descent methods Select the i th component for update:

min

d

1

2(α + d ei)TQ(α + d ei) − eT(α + d ei) subject to 0 ≤ αi + d ≤ C

where

ei ≡ [0 . . . 0

| {z }

i −1

1 0 . . . 0]T

α: current solution; the i th component is changed

(12)

Avoid Memory Problems

The new objective function 1

2Qiid2 + (Qα − e)id + constant To get (Qα − e)i, only Q’s i th row is needed

(Qα − e)i = Xl

j =1Qijαj − 1

Calculated when needed. Trade time for space Used by popular software (e.g., SVMlight, LIBSVM) They update 10 and 2 variables at a time

(13)

Decomposition Methods: Algorithm

Optimal d :

−(Qα − e)i

Qii = − Pl

j =1Qijαj − 1 Qii

Consider lower/upper bounds: [0, C ] Algorithm:

While α is not optimal

1. Select the i th element for update 2. αi ← min

 max

 αi

Pl

j =1Qijαj−1 Qii , 0

 , C



(14)

Select an Element for Update

Many ways

Sequential (easiest)

Permuting 1, . . . , l every l steps Random

Existing software check gradient information

1f (α), . . . , ∇lf (α) But is ∇f (α) available?

(15)

Select an Element for Update (Cont’d)

We can easily maintain gradient

∇f (α) = Qα − e

sf (α) = (Qα)s − 1 = Xl

j =1Qsjαj − 1 Initial α = 0

∇f (0) = −e αi updated to ¯αi

sf (α) ← ∇sf (α)+ Qsi( ¯αi − αi) , ∀s O(l ) if Qsi ∀s (i th column) are available

(16)

Select an Element for Update (Cont’d)

No matter maintaining ∇f (α) or not Q’s i th row (column) always needed

¯

αi ← min max αi − Pl

j =1Qijαj − 1 Qii , 0

! , C

!

Q is symmetric

Using ∇f (α) to select i : faster convergence i.e., fewer iterations

(17)

Decomposition Methods: Using Gradient

The new procedure

α = 0, ∇f (α) = −e While α is not optimal

1. Select the i th element using ∇f (α) 2. ¯αi ← min

 max

 αi

Pl

j =1Qijαj−1 Qii , 0

 , C

 3. ∇sf (α) ← ∇sf (α) + Qsi( ¯αi − αi), ∀s Cost per iteration

O(ln), l : # instances, n: # features

Assume each Qij = yiyjK (xi, xj) takes O(n)

(18)

Outline

Linear versus Nonlinear Classification Review of SVM Training

Large-scale Linear SVM Discussion and Conclusions

(19)

Linear SVM for Large Document Sets

Document classification

Bag of words model (TF-IDF or others) A large # of features

Testing accuracy: linear/nonlinear SVMs similar nonlinear SVM: we mean SVM via kernels

Recently an active research topic SVMperf [Joachims, 2006]

Pegasos [Shalev-Shwartz et al., 2007]

LIBLINEAR [Lin et al., 2007, Hsieh et al., 2008]

and others

(20)

Linear SVM

Primal without the bias term b minw

1

2wTw + C

l

X

i =1

max 0, 1 − yiwTxi

Dual

minα f (α) = 1

TQα − eTα subject to 0 ≤ αi ≤ C , ∀i Qij = yiyjxTi xj

(21)

Revisit Decomposition Methods

While α is not optimal

1. Select the i th element for update 2. αi ← min

 max

 αi

Pl

j =1Qijαj−1 Qii , 0

 , C

 O(ln) per iteration; n: # features, l : # data

For linear SVM, define w ≡ Xl

j =1yjαjxj ∈ Rn O(n) per iteration

Xl

j =1Qijαj−1 = Xl

j =1yiyjxTi xjαj−1 = yiwTxi−1

(22)

All we need is to maintain w. If

¯

αi ← αi then O(n) for

w ← w + ( ¯αi − αi)yixi Initial w

α = 0 ⇒ w = 0 Give up maintaining ∇f (α)

Select i for update Sequential, random, or

Permuting 1, . . . , l every l steps

(23)

Algorithms for Linear and Nonlinear SVM

Linear:

While α is not optimal

1. Select the i th element for update 2. ¯αi ← min

max



αiyiwTQxi−1

ii , 0

 , C

 3. w ← w + ( ¯αi − αi)yixi

Nonlinear:

While α is not optimal

1. Select the i th element using ∇f (α) 2. ¯αi ← min

 max

 αi

Pl

j =1Qijαj−1 Qii , 0

 , C

 3. ∇sf (α) ← ∇sf (α) + Qsi( ¯αi − αi), ∀s

(24)

Analysis

Decomposition method for nonlinear (also linear):

O(ln) per iteration (used in LIBSVM) New way for linear:

O(n) per iteration (used in LIBLINEAR) Faster if # iterations not l times more Experiments

Problem l : # data n: # features

news20 19,996 1,355,191

yahoo-japan 176,203 832,026

rcv1 677,399 47,236

yahoo-korea 460,554 3,052,939

(25)

Testing Accuracy versus Training Time

news20 yahoo-japan

rcv1 yahoo-korea

(26)

Outline

Linear versus Nonlinear Classification Review of SVM Training

Large-scale Linear SVM Discussion and Conclusions

(27)

Limitation

A few seconds for million data; Too good to be true?

Less effective if C is large (or data not scaled) Same problem occurs for training nonlinear SVMs But no need to use large C

Same model after C ≥ ¯C [Keerthi and Lin, 2003]

C is small for document data (if scaled)¯

(28)

Limitation (Cont’d)

Less effective if # features small

Should solve primal: # variables = # features Why not using kernels with nonlinear mappings?

(29)

Comparing Different Training Methods

O(ln) versus O(n) per iteration

Generally, the new method for linear is much faster Especially for document data

But can always find weird cases where LIBSVM faster than LIBLINEAR

Apply the right approach to the right problem is essential

One must be careful on comparing training algorithms

(30)

Software Issue

Large data ⇒ may need different training strategies for different problems

But we pay the price of complicating software packages

The success of LIBSVM and SVMlight Simple and general

They cover both linear/nonlinear

General versus special: always an issue

(31)

Other Methods for Linear SVM

w is the key to reduce O(ln) to O(n) per iteration w =Xl

j =1yjαjxj ∈ Rn Many optimization methods can be used

We can now solve primal: w not infinite any more minw

1

2wTw + C

l

X

i =1

max 0, 1 − yiwTxi We used decomposition method as an example as it works for both linear and nonlinear

Easily see the striking difference with/without w

(32)

Other Linear Classifiers

Logistic regression, maximum entropy, conditional random fields (CRF)

All linear classifiers

In the past, SVM training is considered very different from them

For the linear case, things are very related Many interesting findings; but no time to show details

(33)

What if Data Are Even Larger?

We see I/O costs more than computing

Large-scale document classification on a single computer essentially a solved problem

Challenges:

What if data larger than computer RAM?

What if data distributedly stored?

Document classification in a data center

environment is an interesting research direction

(34)

Conclusions

For certain problems, linear classifiers as accurate as nonlinear, and more efficient for training/testing However, we are not claiming you shouldn’t use kernels any more

For large data, right approaches are essential Machine learning researchers should clearly tell people when to use which methods

You are welcome to try our software

http://www.csie.ntu.edu.tw/~cjlin/libsvm http://www.csie.ntu.edu.tw/~cjlin/liblinear

參考文獻

相關文件

We will quickly discuss some examples and show both types of optimization methods are useful for linear classification.. Chih-Jen Lin (National Taiwan Univ.) 16

Solving SVM Quadratic Programming Problem Training large-scale data..

Large data: if solving linear systems is needed, use iterative (e.g., CG) instead of direct methods Feature correlation: methods working on some variables at a time (e.g.,

b) Less pressure on prevention and reduction measures c) No need to be anxious about the possible loss.. Those risks that have not been identified and taken care of in the

Ongoing Projects in Image/Video Analytics with Deep Convolutional Neural Networks. § Goal – Devise effective and efficient learning methods for scalable visual analytic

◦ Disallow tasks in the production prio rity band to preempt one another.... Jobs

An instance associated with ≥ 2 labels e.g., a video shot includes several concepts Large-scale Data. SVM cannot handle large sets if using kernels

Core vector machines: Fast SVM training on very large data sets. Using the Nystr¨ om method to speed up