• 沒有找到結果。

A practical guide to support vector classification

N/A
N/A
Protected

Academic year: 2021

Share "A practical guide to support vector classification"

Copied!
29
0
0

加載中.... (立即查看全文)

全文

(1)

A Practical Guide to Support Vector

Classification

Chih-Jen Lin

Department of Computer Science National Taiwan University

(2)

Motivation and Outline

SVM: a hot machine learning issue

However, many beginners get unsatisfactory accuracy at first

Some easy but significant steps missed This talk

– Some cookbook approaches based on our experience serving users

– No guarantee for the best accuracy but usually reasonable accuracy

– Hope beginners get acceptable results fast and easily.

– Challenging cases and further extension What do we plan to add in LIBSVM

(3)

Basic Concepts of SVM

wTx + b =     +1 0 −1     min w,b,ξ 1 2w Tw + C l X i=1 ξi subject to yi(wTφ(xi) + b) ≥ 1 − ξi, ξi ≥ 0, i = 1, . . . , l. Kernel: K(x, y) = φ(x)Tφ(y)

(4)

What Many Beginners are Doing Now

Transfer data to the format of an SVM software

May not conduct scaling

Randomly try few parameters and kernels without validation

• Default parameters are surprisingly important

(5)

Examples

training testing features classes Accuracy Accuracy

data data by users by us

User 1 3,089 4,000 4 2 75.2% 96.9%

User 2 391 0 20 3 36% 85.2%

User 3 1,243 41 21 2 4.88% 87.8%

User 1:

I am using libsvm in a astroparticle physics

application .. First, let me congratulate you to a really easy to use and nice package.

Unfortunately, it gives me astonishingly bad

results...

Answer:

(6)

Answer:

I am able to get 97% test accuracy. Is that good enough for you ?

User 1:

You earned a copy of my PhD thesis

User 2:

I am a developer in a bioinformatics laboratory at ... We would like to use LIBSVM in a project ... But results not good. 36% CV accuracy

Answer:

OK. Send me the data

Answer:

I am able to give 83.88% cv accuracy. Is that good enough for you ?

(7)

User 2:

83.88% accuracy would be excellent...

User 3:

I have problems getting the same result with SVM to compared to neural nets.

Right now I get a correct of 4.88%, which is very bad (neural net 70-90%).

Answer

I play a bit your data. My testing accuracy is 87.8%. Is this good for you ?

User 3:

(8)

We Hope Users At Least Do

The following procedure

1. Conduct simple scaling on the data

2. Consider RBF kernel K(x, y) = e−γkx−yk2

3. Use cross-validation to find the best parameter C and γ

4. Use the best C and γ to train the whole training set

(9)

Why RBF

Linear kernel: special case of RBF [Keerthi and Lin 2003]

Polynomial: numerical difficulties (< 1)d → 0, (> 1)d → ∞

More parameters than RBF

tanh: still a mystery

May not be positive semi-definite

In [Lin and Lin 2003], for certain parameters, it behaves like RBF

(10)

Examples: Using the Proposed Procedure

User 1

Original sets with default parameters $./svm-train train.1

$./svm-predict test.1 train.1.model test.1.predict

→ Accuracy = 66.925%

Scaled sets with default parameters

$./svm-scale -s range1 train.1 > train.1.scale $./svm-scale -r range1 test.1 > test.1.scale $./svm-train train.1.scale

$./svm-predict test.1.scale train.1.scale.model test.1.predict

→ Accuracy = 96.15%

(11)

$python grid.py train.1.scale

· · ·

2.0 2.0 96.8922

(Best C=2.0, γ=2.0 with five-fold cross-validation rate=96.8922%)

$./svm-train -c 2 -g 2 train.1.scale

$./svm-predict test.1.scale train.1.scale.model test.1.predict

→ Accuracy = 96.875%

User 2

Original sets with default parameters $./svm-train -v 5 train.2

→ Cross Validation Accuracy = 56.5217%

Scaled sets with default parameters

$./svm-scale train.2 > train.2.scale $./svm-train -v 5 train.2.scale

(12)

→ Cross Validation Accuracy = 78.5166%

Scaled sets with parameter selection

$python grid.py train.2.scale

· · ·

2.0 0.5 85.1662

→ Cross Validation Accuracy = 85.1662%

(Best C=2.0, γ=0.5 with five fold cross-validation rate=85.1662%)

User 3

Original sets with default parameters $./svm-train train.3

$./svm-predict test.3 train.3.model test.3.predict

→ Accuracy = 2.43902%

(13)

$./svm-scale -s range3 train.3 > train.3.scale $./svm-scale -r range3 test.3 > test.3.scale $./svm-train train.3.scale

$./svm-predict test.3.scale train.3.scale.model test.3.predict

→ Accuracy = 12.1951%

Scaled sets with parameter selection

$python grid.py train.3.scale

· · ·

128.0 0.125 84.8753

(Best C=128.0, γ=0.125 with five-fold cross-validation rate=84.8753%)

$./svm-train -c 128 -g 0.125 train.3.scale

$./svm-predict test.3.scale train.3.scale.model test.3.predict

(14)

Scaling

Important for Neural Networks (Part 2 of NN FAQ) Most reasons apply here

Attributes in greater numeric ranges may dominate

K(x, y) = e−γkx−yk2

Simple linearly scaling each attribute to [−1, +1] or [0, 1].

(15)

Model Selection

In fact, two-parameter search: C and γ

We recommend a simple grid search using cross-validation

E.g. 5-fold CV on C = 2−5, 2−3, . . . , 215, γ = 2−15, 2−13, . . . , 23

Why not more efficient methods

leave-one-out error ≤ f (C, γ) so

min

C,γ f (C, γ)

(16)

-4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C)

(17)

Reasons for not using bounds (if two parameters)

– Implementation more complicated

– Psychologically, not feel safe

– In practice: IJCNN competition:

97.09% and 97.83% using Radius Margin bounds for L1 and L2-SVM

98.59% using 25-point grid

2668, 1990, and 1293 testing errors

(18)

d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 98.8 98.6 98.4 98.2 98 97.8 97.6 97.4 97.2 97 1 2 3 4 5 6 7 lg(C) -2 -1 0 1 2 3 lg(gamma)

(19)

We propose that users do

– Start from a loose grid

– Identify good regions and use a finer grid

The grid search tool in libsvm

Easy parallelization

Every problem is independent

loo bounds: 20 steps ⇒ about 10 × 10 grids with five computers Automatic load balancing

(20)

Example: Automatic Script

User 1

$python easy.py train.1 test.1 Scaling training data...

Cross validation... Best c=2.0, g=2.0 Training...

Scaling testing data... Testing...

Accuracy = 96.875% (3875/4000) (classification)

User 3

$python easy.py train.3 test.3 Scaling training data...

(21)

Best c=128.0, g=0.125 Training...

Scaling testing data... Testing...

(22)

Challenges

Is the procedure good enough ?

Good for some median-sized data sets

Difficult problems: this procedure not enough

– Too much training time

– Low accuracy

Extension of the procedure ?

(23)

Feature Selection

Too many (non-zero) features

Examples here: 4, 20, 21 features ¿ #data

RBF kernel

K(x, y) = e−γkx−yk2

Irrelevant attributes cause problems

How about

K(x, y) = e−Pni=1γi(xi−yi)2

Difficult to choose γi

Possible approaches (e.g. [Chapelle et al. 2002]): leave-one-out error ≤ f (C, γ1, . . . , γn)

(24)

Feature selection before training SVM SVM can help feature selection as well E.g. linear SVM

f (x) = wTx + b

Choose indices with large |wi| [Guyon et al. 2002]

Overall, a very difficult issue

(25)

Probability Estimates

SVM outputs decision values only

Probability estimates for two-class SVM:

– Platt’s sigmoid approximation

– Isotonic regression

– SVM density estimation ?

We are conducting a serious evaluation

Multi-class probability estimate Related to multi-class classification

(26)

Currently LIBSVM uses 1vs1 (after an evaluation in [Hsu and Lin, 2002])

10 classes: 45 SVMs, 0vs1, 0vs2, . . . , 8vs9

Given rij ≈ P (y = i | y = i or j), estimate P (y = i)

An issue for all binary classification methods

New and stable methods proposed in [Wu et al., 2003]

All these are about ready

(27)

Unbalanced Data

Many information retrieval users ask about ROC curve and adjusting precision/recall

Not accuracy any more

Three ways to generate ROC curves

Adjust b of

f (x) = wTx + b

– Unbalanced cost function min w,b,ξ 1 2w Tw + C + X i:yi=1 ξi + C− X i:yi=−1 ξi

– Rank by probability output + cross validation (now available)

(28)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.2 0.4 0.6 0.8 1

Ture Positive Rate

False Positive Rate ROC curve of heart_scale

Goal: an integrated tool so users can easily adjust cost matrices or the relation of TP, TN, FT, FN

(29)

Conclusions

Still a long way to serve all users’ needs but we are trying

We hope more users can benefit from this research and

eventually SVM can be an easy-to-use classification method

Slides based on

Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin

A Practical Guide to Support Vector Classification http: //www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf

LIBSVM available at

http://www.csie.ntu.edu.tw/~cjlin/libsvm

參考文獻

相關文件

The basic ranks of teachers in aided secondary schools are Certificated Master/Mistress (CM) for non-graduate teachers and Graduate Master/Mistress (GM) for

Teaching experience overseas and in Others (e.g. recognised local tertiary institutions and registered Day Schools offering formal curriculum courses to own

APSM is the basic rank of the Primary School Master/Mistress (PSM) grade that has been created in aided primary schools with effect from the 1994/95 school year.

support vector machine, ε-insensitive loss function, ε-smooth support vector regression, smoothing Newton algorithm..

(2011) The Project Approach In Early Years Provision : A Practical Guide To Promoting Children's Creativity And Critical Thinking Through Project

Restorative practice and special needs: A practical guide to working restoratively with young people. Philadelphia, PA: Jessica

“Transductive Inference for Text Classification Using Support Vector Machines”, Proceedings of ICML-99, 16 th International Conference on Machine Learning, pp.200-209. Coppin

Predict daily maximal load of January 1999 A time series prediction problem.. Data