• 沒有找到結果。

Training Support Vector Machines: Status and Challenges

N/A
N/A
Protected

Academic year: 2022

Share "Training Support Vector Machines: Status and Challenges"

Copied!
34
0
0

加載中.... (立即查看全文)

全文

(1)

Training Support Vector Machines:

Status and Challenges

Chih-Jen Lin

Department of Computer Science National Taiwan University

ICML Workshop on Large Scale Learning Challenge

(2)

Outline

SVM is popular

But its training isn’t easy We check existing techniques Large data sets

We show several approaches, and discuss various considerations

Will try to partially answer why there are controversial comparisons

(3)

Introduction to SVM

Outline

Introduction to SVM

Solving SVM Quadratic Programming Problem Training large-scale data

Linear SVM

Discussion and Conclusions

(4)

Support Vector Classification

Training data (xi, yi), i = 1, . . . , l , xi ∈ Rn, yi = ±1 Maximizing the margin

[Boser et al., 1992, Cortes and Vapnik, 1995]

min

w,b,ξ

1

2wTw +C

l

X

i =1

ξi

subject to yi(wTφ(xi)+ b) ≥ 1 −ξi, ξi ≥ 0, i = 1, . . . , l .

High dimensional ( maybe infinite ) feature space φ(x) = (φ1(x), φ2(x), . . .).

(5)

Introduction to SVM

Support Vector Classification (Cont’d)

w: maybe infinite variables

The dual problem (finite # variables) minα

1

TQα − eTα

subject to 0 ≤ αi ≤ C , i = 1, . . . , l yTα = 0,

where Qij = yiyjφ(xi)Tφ(xj) and e = [1, . . . , 1]T At optimum

w =Pl

i =1αiyiφ(xi)

(6)

Outline

Introduction to SVM

Solving SVM Quadratic Programming Problem Training large-scale data

Linear SVM

Discussion and Conclusions

(7)

Solving SVM Quadratic Programming Problem

Large Dense Quadratic Programming

minα

1

TQα − eTα

subject to 0 ≤ αi ≤ C , i = 1, . . . , l yTα = 0

Qij 6= 0, Q : an l by l fully dense matrix 50,000 training points: 50,000 variables:

(50, 0002 × 8/2) bytes = 10GB RAM to store Q Traditional methods:

(8)

Decomposition Methods

Working on some variables each time (e.g.,

[Osuna et al., 1997, Joachims, 1998, Platt, 1998]) Similar to coordinate-wise minimization

Working set B , N = {1, . . . , l }\B fixed Sub-problem at the kth iteration:

minαB

1

2αTBkN)TQBB QBN

QNB QNN

 αB αkN



eTB (ekN)TαB αkN



subject to 0 ≤ αt ≤ C , t ∈ B, yTBαB = −yTNαkN

(9)

Solving SVM Quadratic Programming Problem

Avoid Memory Problems

The new objective function 1

TBQBBαB + (−eB + QBNαkN)TαB + constant Only B columns of Q needed (|B| ≥ 2)

Calculated when used Trade time for space

Popular software such as SVMlight, LIBSVM, SVMTorch are of this type

(10)

How Decomposition Methods Perform?

Convergence not very fast

But, no need to have very accurate α Prediction not affected much

In some situations, # support vectors  # training points

Initial α1 = 0, some instances never used

(11)

Solving SVM Quadratic Programming Problem

An example of training 50,000 instances using LIBSVM

$svm-train -c 16 -g 4 -m 400 22features Total nSV = 3370

Time 79.524s

On a Xeon 2.0G machine

Calculating Q may have taken more time

#SVs = 3,370  50,000

A good case where some remain at zero all the time

(12)

Issues of Decomposition Methods

Techniques for faster decomposition methods store recently used kernel elements working set size/selection

theoretical issues: convergence

and others (details not discussed here) But training large data still difficult

Kernel square to the number of data Training millions of data time consuming Will discuss some possible approaches

(13)

Training large-scale data

Outline

Introduction to SVM

Solving SVM Quadratic Programming Problem Training large-scale data

Linear SVM

Discussion and Conclusions

(14)

Parallel: Multi-core/Shared Memory

Most computation of decomposition methods:

kernel evaluations

Easily parallelized via openMP One line change of LIBSVM

Each core/CPU calculates part of a kernel column Multicore Shared-memory

1 80 1 100

2 48 2 57

4 32 4 36

8 27 8 28

Same 50,000 data (kernel evaluations: 80% time) Using GPU [Catanzaro et al., 2008]

(15)

Training large-scale data

Parallel: Distributed Environments

What if data data cannot fit into memory?

Use distributed environments PSVM: [Chang et al., 2007]

http://code.google.com/p/psvm/

π-SVM: http://pisvm.sourceforge.net, Parallel GPDT [Zanni et al., 2006]

All use MPI

They report good speed-up

But on certain environments, communication cost a concern

(16)

Approximations

Subsampling

Simple and often effective Many more advanced techniques

Incremental training: (e.g., [Syed et al., 1999]) Data ⇒ 10 parts

train 1st part ⇒ SVs, train SVs + 2nd part, . . . Select and train good points: KNN or heuristics e.g., [Bakır et al., 2005]

(17)

Training large-scale data

Approximations (Cont’d)

Approximate the kernel; e.g., [Fine and Scheinberg, 2001, Williams and Seeger, 2001]

Use part of the kernel; e.g.,

[Lee and Mangasarian, 2001, Keerthi et al., 2006]

And many others

Some simple but some sophisticated

(18)

Parallelization or Approximation

Difficult to say Parallel: general

Approximation: simpler in some cases We can do both

For certain problems, approximation doesn’t easily work

(19)

Training large-scale data

Parallelization or Approximation (Cont’d)

covtype: 500k training and 80k testing rcv1: 550k training and 14k testing

covtype rcv1

Training size Accuracy Training size Accuracy

50k 92.5% 50k 97.2%

100k 95.3% 100k 97.4%

500k 98.2% 550k 97.8%

For large sets, selecting a right approach is essential We illustrate this point using linear SVM for

document classification

(20)

Outline

Introduction to SVM

Solving SVM Quadratic Programming Problem Training large-scale data

Linear SVM

Discussion and Conclusions

(21)

Linear SVM

Linear Support Vector Machines

Data not mapped to another space

In theory, RBF kernel with certain parameters

⇒ as good geleralization performance as linear [Keerthi and Lin, 2003]

But sometimes can easily solve much larger linear SVMs

Training of linear/nonlinear SVMs should be separately considered

(22)

Linear Support Vector Machines (Cont’d)

Linear SVM useful if accuracy similar to nonlinear Will discuss an example of linear SVM for document classification

(23)

Linear SVM

Linear SVM for Large Document Sets

Document classification

Bag of words model (TF-IDF or others) A large # of features

Can solve larger problems than kernelized cases Recently an active research topic

SVMperf [Joachims, 2006]

Pegasos [Shalev-Shwartz et al., 2007]

LIBLINEAR [Lin et al., 2007, Hsieh et al., 2008]

and others

(24)

Linear SVM

Primal without the bias term b minw

1

2wTw + C

l

X

i =1

max 0, 1 − yiwTxi Dual

minα f (α) = 1

TQα − eTα subject to 0 ≤ αi ≤ C , ∀i

No linear constraint yTα = 0 Qij = yiyjxTi xj

(25)

Linear SVM

A Comparison: LIBSVM and LIBLINEAR

rcv1: # data: > 600k, # features: > 40k TF-IDF

Using LIBSVM (linear kernel)

> 10 hours

Using LIBLINEAR

Computation: < 5 seconds; I/O: 60 seconds Same stopping condition

Accuracy similar to nonlinear

(26)

Revisit Decomposition Methods

The extreme: update one variable at a time Reduced to

αi ← min

 max



αi − ∇if (α) Qii , 0

 , C



where

if (α) = (Qα)i − 1 =

l

X

j =1

Qijαj − 1 O(nl ) to calculate i th row of Q

n: # features, l : # data

(27)

Linear SVM

For linear SVM, define w = Xl

j =1yjαjxj, Much easier: O(n)

if (α) = yiwTxi − 1 All we need is to maintain w. If

¯

αi ← αi then

w ← w + (αi − ¯αi)yixi

(28)

Testing Accuracy (Time in Seconds)

L1-SVM: news20 L2-SVM: news20

L1-SVM: rcv1 L2-SVM: rcv1

(29)

Linear SVM

Analysis

Other implementation details in [Hsieh et al., 2008]

Decomposition method for linear/nonlinear kernels:

O(nl ) per iteration

New way for linear: O(n) per iteration Faster if # iterations not l times more

A few seconds for million data; Any limitation?

Less effective if

# features small: should solve primal Large penalty parameter C

(30)

Analysis (Cont’d)

One must be careful on comparisons Now we have two decomposition methods (nonlinear and linear)

Similar theoretical convergence rates

Very different practical behaviors for certain problems

This partially explains controversial comparisons in some recent work

(31)

Linear SVM

Analysis (Cont’d)

A lesson: different SVMs

To handle large data ⇒ may need different training strategies

Even just for linear SVM

# data  # features

# data  # features

# data, # features both large Should use different methods

For example, # data  # features

primal based method; (but why not nonlinear?)

(32)

Outline

Introduction to SVM

Solving SVM Quadratic Programming Problem Training large-scale data

Linear SVM

Discussion and Conclusions

(33)

Discussion and Conclusions

Discussion and Conclusions

Linear versus nonlinear

In this competition, most use linear (wild track) Even accuracy may be worse

Recall I mention “parallelization” &

“approximation”

Linear is essentially an approximation of nonlinear For large data, selecting a right approach seems to be essential

But finding a suitable one is difficult

(34)

Discussion and Conclusions (Cont’d)

This (i.e., “too many approaches”) is indeed bad from the viewpoint of designing machine learning software

The success of LIBSVM and SVMlight Simple and general

Developments in both directions (general and specific) will help to advance SVM training

參考文獻

相關文件

include domain knowledge by specific kernel design (e.g. train a generative model for feature extraction, and use the extracted feature in SVM to get discriminative power).

1 Embedding Numerous Features: Kernel Models Lecture 1: Linear Support Vector Machine.. linear SVM: more robust and solvable with quadratic programming Lecture 2: Dual Support

2 Distributed classification algorithms Kernel support vector machines Linear support vector machines Parallel tree learning.. 3 Distributed clustering

2 Distributed classification algorithms Kernel support vector machines Linear support vector machines Parallel tree learning?. 3 Distributed clustering

We will quickly discuss some examples and show both types of optimization methods are useful for linear classification.. Chih-Jen Lin (National Taiwan Univ.) 16

An instance associated with ≥ 2 labels e.g., a video shot includes several concepts Large-scale Data. SVM cannot handle large sets if using kernels

Core vector machines: Fast SVM training on very large data sets. Using the Nystr¨ om method to speed up

Core vector machines: Fast SVM training on very large data sets. Multi-class support