• 沒有找到結果。

Can Support Vector Machine be a Major Classification Method ?

N/A
N/A
Protected

Academic year: 2021

Share "Can Support Vector Machine be a Major Classification Method ?"

Copied!
24
0
0

加載中.... (立即查看全文)

全文

(1)

Can Support Vector Machine be a Major

Classification Method ?

Chih-Jen Lin

Department of Computer Science National Taiwan University

(2)

Motivation

SVM: a hot machine learning issue

However, not a major classification method yet

KDNuggets 2002 Poll: Neural Networks, Decision trees remain main tools

(3)

The Potential of SVM

In my opinion, after careful data pre-processing

Appropriately use NN or SVM ⇒ similar accuracy

But, users may not use them properly

The chance of SVM

Easier for users to appropriately use it

(4)

What Many Users are Doing Now

Transfer data to the format of an SVM software

May not conduct scaling

Randomly try few parameters and kernels without validation

• Default parameters are surprisingly important

(5)

We Hope Users At Least Do

The following procedure

1. Simple scaling (training and testing) 2. Consider the RBF kernel

K(x, y) = e−γkx−yk2 = e−kx−yk2/(2σ2) and find the best C and γ (or σ2)

Why RBF:

– Linear kernel: special case of RBF [Keerthi and Lin 2003] – Polynomial: numerical difficulties

(< 1)d → 0, (> 1)d → ∞

– tanh: still a mystery

(6)

In a coming paper [Lin and Lin 2003], for certain parameters, it

(7)

Examples of the Proposed Procedure

User 1:

I am using libsvm in a astroparticle physics application (AMANDA experiment). First, let me congratulate you to a really easy to use and nice package.

Unfortunately, it gives me astonishingly bad results...

Answer:

What is your procedure ?

User 1:

I do for example the following steps (here for classification):

(8)

>TRAINING.SCALE.DAT

./svm-train -s 0 -t 2 -c 10 TRAINING.SCALE.DAT ./svm-predict TESTING SIGNAL.SCALE.DAT

TRAINING.SCALE.DAT.model s 0 2 10.out Accuracy = 75.2%

Answer:

OK. Send me the data

Answer:

First I scale the training and testing TOGETHER: /mnt/professor/cjlin/tmp% libsvm-2.36/svm-scale total > total.scale

Then separate them again.

Using the model selection tool (cross validation) to find out the best parameter:

(9)

sort the results: (find the best cv accuracy) /mnt/professor/cjlin/tmp% sort -k 3 train.out .

2 1 96.9569 8 1 96.9569

so c = 4 and g = 1 might be the best. Train the training data again:

/mnt/professor/cjlin/tmp/libsvm-2.36%./svm-train -m 300 -c 4 -g 2 ../train

Finally test the independent data:

/mnt/professor/cjlin/tmp/libsvm-2.36%./svm-predict ../testdata train.model o Accuracy = 97.3

User 1:

You earned a copy of my PhD thesis

(10)

I am a developer in a bioinformatics laboratory at ... We would like to use LIBSVM in a project ... The datasets are reasonable unbalanced - there are 221 examples in the first set, 117 in the second set and 53 in the third set.

But results not good

Answer:

Have you scaled the data ? What is your accuracy ?

User 2: Yes, to [0,1]. 36%

Answer:

OK. Send me the data

Answer:

I am able to give 83.88% cv accuracy. Is that good enough for you ?

(11)

User 2:

(12)

Model Selection is Important

In fact, two-parameter search

By bounds of loo

By two line search

(13)

Bound of loo

Many loo bounds

Main reason: save computational cost

Bounds where a path may be found

-4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C)

(14)

– Radius margin bound – Span bound

A recent paper [Chung et al. 2002] on radius margin bound

– Minima in a good region more important than tightness

Good bound should avoid that minima happen at the boundary

(i.e., too small or too large C and σ2)

– Modification for L1-SVM – Differentiability min C,σ2 f (α(C, σ 2)) – Reliable Implementation

(15)

L1-SVM L2-SVM

#fun #grad accuracy #fun #grad accuracy

banana 9 6 88.96 8 5 88.53 image 17 13 96.24 11 6 97.03 splice 13 12 89.84 21 19 89.84 tree 8 8 86.50 8 8 86.54 waveform 16 13 88.57 8 7 89.83 ijcnn1 9 9 97.09 7 7 97.83

A coming paper [Chang and Lin 2003]: non-smooth optimization techniques for bounds

– Allow us to use more (i.e. non-differentiable) bounds – Sensitive analysis

(16)

– Boundle (cutting plane) methods

Piecewise diff. → Semi-smooth % Directionally diff.

(17)

Two Line Searches

CV (loo) contour of RBF kernel [Keerthi and Lin 2003]:

log Clim log C

log σ2

underfitting

underfitting overfitting good region

log σ2 = log C − log ˜C

When σ2 large

(C, σ2) of RBF ≡ C/σ2 of linear

A heuristic for model selection

(18)

2. Fix ˜C and search for the best (C, σ2) satisfying log σ2 = log C − log ˜C using RBF

Problem n #test Test error of Test error of

grid method new method

banana 400 4900 0.1235 (6,-0) 0.1178 (-2,-2) image 1300 1010 0.02475 (9,4) 0.02475 (1,0.5) splice 1000 2175 0.09701 (1,4) 0.1011 (0,4) ringnorm 400 7000 0.01429(-2,2) 0.018 (-3,2) twonorm 400 7000 0.031 (1,3) 0.02914 (1,4) tree 700 11692 0.1132 (8,4) 0.1246 (2,2) adult 1605 29589 0.1614 (5,6) 0.1614 (5,6) web 2477 38994 0.02223 (5,5) 0.02223 (5,5) 441 verses 54 SVMs

(19)

d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 98.8 98.6 98.4 98.2 98 97.8 97.6 97.4 97.2 97 1 2 3 4 5 6 7 lg(C) -2 -1 0 1 2 3 lg(gamma)

(20)

However, I Prefer Simple Grid Search

Reasons for not using bounds (if two parameters)

– Psychologically, not feel safe

– In practice: IJCNN competition:

97.09% and 97.83% using RM bounds for L1 and L2-SVM 98.59% using 25-point grid

2668, 1990, and 1293 testing errors – Useful if more than 2 parameters

About two-line search:

(21)

100 1000 10000 100000 1e+06 -8 -6 -4 -2 0 2 4 6 8 Iterations log(C) heart_scale 100 1000 10000 100000 1e+06 -8 -6 -4 -2 0 2 4 6 8 Iterations log(C) heart_scale

– A paper [Chung et al. 2003]: efficient decomposition methods for linear SVMs

Decision of the best C for linear SVMs sometimes ambiguous

(22)

70 72 74 76 78 80 82 84 -8 -6 -4 -2 0 2 CV rate log(C)

After C ≥ C∗, everything is the same

We propose that users do – Start from a loose grid

– Identify good regions and finer grid

The grid search tool in libsvm

Easy parallelization

(23)

loo bounds: 20 steps ⇒ more time than 10 × 10 grids with five computers

Automatic load balancing

No need for α-seeding, passing cache etc.

This simple tool

– Enough for median-sized problems

– Advantage of having only one figure for multi-class problems

Further improvement

(24)

Challenges

Using this, if for enough problems, satisfactory results obtained

⇒ then SVM can be a major method eventually

How do we ask users to at least do this ? How do we know if it is or not ?

If not

參考文獻

相關文件

=&gt; Cross-curricular vocabulary is an important resource for cross-curricular reading.. =&gt; Cross-curricular vocabulary can best be taught through

Now, nearly all of the current flows through wire S since it has a much lower resistance than the light bulb. The light bulb does not glow because the current flowing through it

support vector machine, ε-insensitive loss function, ε-smooth support vector regression, smoothing Newton algorithm..

Programming languages can be used to create programs that control the behavior of a. machine and/or to express algorithms precisely.” -

“Transductive Inference for Text Classification Using Support Vector Machines”, Proceedings of ICML-99, 16 th International Conference on Machine Learning, pp.200-209. Coppin

If we would like to use both training and validation data to predict the unknown scores, we can record the number of iterations in Algorithm 2 when using the training/validation

Lecture 1: Large-Margin Linear Classification Large-Margin Separating Hyperplane Standard Large-Margin Problem Support Vector Machine.. Reasons behind

Using this, one can obtain a weaker notion of isomorphism of vector bundles by defining two vector bun- dles over the same base space X to be stably isomorphic if they become