Can Support Vector Machine be a Major Classification Method ?

(1)

Can Support Vector Machine be a Major

Classification Method ?

Chih-Jen Lin

Department of Computer Science National Taiwan University

(2)

Motivation

• SVM: a hot machine learning issue

• However, not a major classification method yet

KDNuggets 2002 Poll: Neural Networks, Decision trees remain main tools

(3)

The Potential of SVM

• In my opinion, after careful data pre-processing

Appropriately use NN or SVM ⇒ similar accuracy

• But, users may not use them properly

• The chance of SVM

Easier for users to appropriately use it

(4)

What Many Users are Doing Now

• Transfer data to the format of an SVM software

• May not conduct scaling

• Randomly try few parameters and kernels without validation

• Default parameters are surprisingly important

(5)

We Hope Users At Least Do

• The following procedure

1. Simple scaling (training and testing) 2. Consider the RBF kernel

K(x, y) = e−γkx−yk2 = e−kx−yk2/(2σ2) and find the best C and γ (or σ2)

• Why RBF:

– Linear kernel: special case of RBF [Keerthi and Lin 2003] – Polynomial: numerical difficulties

(< 1)d → 0, (> 1)d → ∞

– tanh: still a mystery

(6)

In a coming paper [Lin and Lin 2003], for certain parameters, it

(7)

Examples of the Proposed Procedure

• User 1:

I am using libsvm in a astroparticle physics application (AMANDA experiment). First, let me congratulate you to a really easy to use and nice package.

Unfortunately, it gives me astonishingly bad results...

• Answer:

What is your procedure ?

• User 1:

I do for example the following steps (here for classification):

(8)

>TRAINING.SCALE.DAT

./svm-train -s 0 -t 2 -c 10 TRAINING.SCALE.DAT ./svm-predict TESTING SIGNAL.SCALE.DAT

TRAINING.SCALE.DAT.model s 0 2 10.out Accuracy = 75.2%

• Answer:

OK. Send me the data

• Answer:

First I scale the training and testing TOGETHER: /mnt/professor/cjlin/tmp% libsvm-2.36/svm-scale total > total.scale

Then separate them again.

Using the model selection tool (cross validation) to find out the best parameter:

(9)

sort the results: (find the best cv accuracy) /mnt/professor/cjlin/tmp% sort -k 3 train.out .

2 1 96.9569 8 1 96.9569

so c = 4 and g = 1 might be the best. Train the training data again:

/mnt/professor/cjlin/tmp/libsvm-2.36%./svm-train -m 300 -c 4 -g 2 ../train

Finally test the independent data:

/mnt/professor/cjlin/tmp/libsvm-2.36%./svm-predict ../testdata train.model o Accuracy = 97.3

• User 1:

You earned a copy of my PhD thesis

(10)

I am a developer in a bioinformatics laboratory at ... We would like to use LIBSVM in a project ... The datasets are reasonable unbalanced - there are 221 examples in the first set, 117 in the second set and 53 in the third set.

But results not good

• Answer:

Have you scaled the data ? What is your accuracy ?

• User 2: Yes, to [0,1]. 36%

• Answer:

OK. Send me the data

• Answer:

I am able to give 83.88% cv accuracy. Is that good enough for you ?

(11)

• User 2:

(12)

Model Selection is Important

• In fact, two-parameter search

• By bounds of loo

• By two line search

(13)

Bound of loo

• Many loo bounds

• Main reason: save computational cost

• Bounds where a path may be found

-4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C)

(14)

– Radius margin bound – Span bound

• A recent paper [Chung et al. 2002] on radius margin bound

– Minima in a good region more important than tightness

Good bound should avoid that minima happen at the boundary

(i.e., too small or too large C and σ2)

– Modification for L1-SVM – Differentiability min C,σ2 f (α(C, σ 2₎₎ – Reliable Implementation

(15)

L1-SVM L2-SVM

#fun #grad accuracy #fun #grad accuracy

banana 9 6 88.96 8 5 88.53 image 17 13 96.24 11 6 97.03 splice 13 12 89.84 21 19 89.84 tree 8 8 86.50 8 8 86.54 waveform 16 13 88.57 8 7 89.83 ijcnn1 9 9 97.09 7 7 97.83

• A coming paper [Chang and Lin 2003]: non-smooth optimization techniques for bounds

– Allow us to use more (i.e. non-differentiable) bounds – Sensitive analysis

(16)

– Boundle (cutting plane) methods

Piecewise diff. → Semi-smooth % Directionally diff.

(17)

Two Line Searches

• CV (loo) contour of RBF kernel [Keerthi and Lin 2003]:

log C_lim log C

log σ2

underfitting

underfitting overfitting good region

log σ2 = log C − log ˜C

• When σ2 large

(C, σ2) of RBF ≡ C/σ2 of linear

• A heuristic for model selection

(18)

2. Fix ˜C and search for the best (C, σ2) satisfying log σ2 = log C − log ˜C using RBF

Problem n #test Test error of Test error of

grid method new method

banana 400 4900 0.1235 (6,-0) 0.1178 (-2,-2) image 1300 1010 0.02475 (9,4) 0.02475 (1,0.5) splice 1000 2175 0.09701 (1,4) 0.1011 (0,4) ringnorm 400 7000 0.01429(-2,2) 0.018 (-3,2) twonorm 400 7000 0.031 (1,3) 0.02914 (1,4) tree 700 11692 0.1132 (8,4) 0.1246 (2,2) adult 1605 29589 0.1614 (5,6) 0.1614 (5,6) web 2477 38994 0.02223 (5,5) 0.02223 (5,5) • 441 verses 54 SVMs

(19)

d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 98.8 98.6 98.4 98.2 98 97.8 97.6 97.4 97.2 97 1 2 3 4 5 6 7 lg(C) -2 -1 0 1 2 3 lg(gamma)

(20)

However, I Prefer Simple Grid Search

• Reasons for not using bounds (if two parameters)

– Psychologically, not feel safe

– In practice: IJCNN competition:

97.09% and 97.83% using RM bounds for L1 and L2-SVM 98.59% using 25-point grid

2668, 1990, and 1293 testing errors – Useful if more than 2 parameters

• About two-line search:

(21)

100 1000 10000 100000 1e+06 -8 -6 -4 -2 0 2 4 6 8 Iterations log(C) heart_scale 100 1000 10000 100000 1e+06 -8 -6 -4 -2 0 2 4 6 8 Iterations log(C) heart_scale

– A paper [Chung et al. 2003]: efficient decomposition methods for linear SVMs

– Decision of the best C for linear SVMs sometimes ambiguous

(22)

70 72 74 76 78 80 82 84 -8 -6 -4 -2 0 2 CV rate log(C)

– After C ≥ C∗, everything is the same

• We propose that users do – Start from a loose grid

– Identify good regions and finer grid

• The grid search tool in libsvm

• Easy parallelization

(23)

loo bounds: 20 steps ⇒ more time than 10 × 10 grids with five computers

Automatic load balancing

• No need for α-seeding, passing cache etc.

• This simple tool

– Enough for median-sized problems

– Advantage of having only one figure for multi-class problems

• Further improvement

(24)

Challenges

• Using this, if for enough problems, satisfactory results obtained

⇒ then SVM can be a major method eventually

How do we ask users to at least do this ? How do we know if it is or not ?

• If not