A practical guide to support vector classification

(1)

A Practical Guide to Support Vector

Classification

Chih-Jen Lin

Department of Computer Science National Taiwan University

(2)

Motivation and Outline

• SVM: a hot machine learning issue

• However, many beginners get unsatisfactory accuracy at first

Some easy but significant steps missed • This talk

– Some cookbook approaches based on our experience serving users

– No guarantee for the best accuracy but usually reasonable accuracy

– Hope beginners get acceptable results fast and easily.

– Challenging cases and further extension What do we plan to add in LIBSVM

(3)

Basic Concepts of SVM

wTx + b =     +1 0 −1     min w,b,ξ 1 2w T_{w + C} l X i=1 ξ_i subject to y_i(wTφ(x_i) + b) ≥ 1 − ξ_i, ξ_i ≥ 0, i = 1, . . . , l. • Kernel: K(x, y) = φ(x)Tφ(y)

(4)

What Many Beginners are Doing Now

• Transfer data to the format of an SVM software

• May not conduct scaling

• Randomly try few parameters and kernels without validation

• Default parameters are surprisingly important

(5)

Examples

training testing features classes Accuracy Accuracy

data data by users by us

User 1 3,089 4,000 4 2 75.2% 96.9%

User 2 391 0 20 3 36% 85.2%

User 3 1,243 41 21 2 4.88% 87.8%

• User 1:

I am using libsvm in a astroparticle physics

application .. First, let me congratulate you to a really easy to use and nice package.

Unfortunately, it gives me astonishingly bad

results...

• Answer:

(6)

• Answer:

I am able to get 97% test accuracy. Is that good enough for you ?

• User 1:

You earned a copy of my PhD thesis

• User 2:

I am a developer in a bioinformatics laboratory at ... We would like to use LIBSVM in a project ... But results not good. 36% CV accuracy

• Answer:

OK. Send me the data

• Answer:

I am able to give 83.88% cv accuracy. Is that good enough for you ?

(7)

• User 2:

83.88% accuracy would be excellent...

• User 3:

I have problems getting the same result with SVM to compared to neural nets.

Right now I get a correct of 4.88%, which is very bad (neural net 70-90%).

• Answer

I play a bit your data. My testing accuracy is 87.8%. Is this good for you ?

• User 3:

(8)

We Hope Users At Least Do

• The following procedure

1. Conduct simple scaling on the data

2. Consider RBF kernel K(x, y) = e−γkx−yk2

3. Use cross-validation to find the best parameter C and γ

4. Use the best C and γ to train the whole training set

(9)

Why RBF

• Linear kernel: special case of RBF [Keerthi and Lin 2003]

• Polynomial: numerical difficulties (< 1)d → 0, (> 1)d → ∞

More parameters than RBF

• tanh: still a mystery

May not be positive semi-definite

In [Lin and Lin 2003], for certain parameters, it behaves like RBF

(10)

Examples: Using the Proposed Procedure

User 1

• Original sets with default parameters $./svm-train train.1

$./svm-predict test.1 train.1.model test.1.predict

→ Accuracy = 66.925%

• Scaled sets with default parameters

$./svm-scale -s range1 train.1 > train.1.scale $./svm-scale -r range1 test.1 > test.1.scale $./svm-train train.1.scale

$./svm-predict test.1.scale train.1.scale.model test.1.predict

→ Accuracy = 96.15%

(11)

$python grid.py train.1.scale

· · ·

2.0 2.0 96.8922

(Best C=2.0, γ=2.0 with five-fold cross-validation rate=96.8922%)

$./svm-train -c 2 -g 2 train.1.scale

→ Accuracy = 96.875%

User 2

• Original sets with default parameters $./svm-train -v 5 train.2

→ Cross Validation Accuracy = 56.5217%

• Scaled sets with default parameters

$./svm-scale train.2 > train.2.scale $./svm-train -v 5 train.2.scale

(12)

• Scaled sets with parameter selection

· · ·

2.0 0.5 85.1662

(Best C=2.0, γ=0.5 with five fold cross-validation rate=85.1662%)

User 3

• Original sets with default parameters $./svm-train train.3

$./svm-predict test.3 train.3.model test.3.predict

→ Accuracy = 2.43902%

(13)

$./svm-scale -s range3 train.3 > train.3.scale $./svm-scale -r range3 test.3 > test.3.scale $./svm-train train.3.scale

→ Accuracy = 12.1951%

• Scaled sets with parameter selection

· · ·

128.0 0.125 84.8753

(Best C=128.0, γ=0.125 with five-fold cross-validation rate=84.8753%)

$./svm-train -c 128 -g 0.125 train.3.scale

(14)

Scaling

• Important for Neural Networks (Part 2 of NN FAQ) Most reasons apply here

• Attributes in greater numeric ranges may dominate

K(x, y) = e−γkx−yk2

• Simple linearly scaling each attribute to [−1, +1] or [0, 1].

(15)

Model Selection

• In fact, two-parameter search: C and γ

• We recommend a simple grid search using cross-validation

E.g. 5-fold CV on C = 2−5, 2−3, . . . , 215, γ = 2−15, 2−13, . . . , 23

• Why not more efficient methods

leave-one-out error ≤ f (C, γ) so

min

C,γ f (C, γ)

(16)

-4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C)

(17)

• Reasons for not using bounds (if two parameters)

– Implementation more complicated

– Psychologically, not feel safe

– In practice: IJCNN competition:

97.09% and 97.83% using Radius Margin bounds for L1 and L2-SVM

98.59% using 25-point grid

2668, 1990, and 1293 testing errors

(18)

d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 98.8 98.6 98.4 98.2 98 97.8 97.6 97.4 97.2 97 1 2 3 4 5 6 7 lg(C) -2 -1 0 1 2 3 lg(gamma)

(19)

• We propose that users do

– Start from a loose grid

– Identify good regions and use a finer grid

• The grid search tool in libsvm

• Easy parallelization

Every problem is independent

loo bounds: 20 steps ⇒ about 10 × 10 grids with five computers Automatic load balancing

(20)

Example: Automatic Script

• User 1

$python easy.py train.1 test.1 Scaling training data...

Cross validation... Best c=2.0, g=2.0 Training...

Scaling testing data... Testing...

Accuracy = 96.875% (3875/4000) (classification)

• User 3

$python easy.py train.3 test.3 Scaling training data...

(21)

Best c=128.0, g=0.125 Training...

Scaling testing data... Testing...

(22)

Challenges

• Is the procedure good enough ?

Good for some median-sized data sets

• Difficult problems: this procedure not enough

– Too much training time

– Low accuracy

• Extension of the procedure ?

(23)

Feature Selection

• Too many (non-zero) features

Examples here: 4, 20, 21 features ¿ #data

• RBF kernel

K(x, y) = e−γkx−yk2

Irrelevant attributes cause problems

• How about

K(x, y) = e−Pni=1γi(xi−yi)2

Difficult to choose γ_i

Possible approaches (e.g. [Chapelle et al. 2002]): leave-one-out error ≤ f (C, γ1, . . . , γn)

(24)

• Feature selection before training SVM SVM can help feature selection as well E.g. linear SVM

f (x) = wTx + b

Choose indices with large |wi| [Guyon et al. 2002]

• Overall, a very difficult issue

(25)

Probability Estimates

• SVM outputs decision values only

• Probability estimates for two-class SVM:

– Platt’s sigmoid approximation

– Isotonic regression

– SVM density estimation ?

We are conducting a serious evaluation

• Multi-class probability estimate Related to multi-class classification

(26)

Currently LIBSVM uses 1vs1 (after an evaluation in [Hsu and Lin, 2002])

10 classes: 45 SVMs, 0vs1, 0vs2, . . . , 8vs9

Given r_ij ≈ P (y = i | y = i or j), estimate P (y = i)

An issue for all binary classification methods

New and stable methods proposed in [Wu et al., 2003]

• All these are about ready

(27)

Unbalanced Data

• Many information retrieval users ask about ROC curve and adjusting precision/recall

Not accuracy any more

• Three ways to generate ROC curves

– Adjust b of

f (x) = wTx + b

– Unbalanced cost function min w,b,ξ 1 2w T_{w +} _C + X i:yi=1 ξi + C− X i:yi=−1 ξi

– Rank by probability output + cross validation (now available)

(28)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.2 0.4 0.6 0.8 1

Ture Positive Rate

False Positive Rate ROC curve of heart_scale

• Goal: an integrated tool so users can easily adjust cost matrices or the relation of TP, TN, FT, FN

(29)

Conclusions

• Still a long way to serve all users’ needs but we are trying

• We hope more users can benefit from this research and

eventually SVM can be an easy-to-use classification method

• Slides based on

Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin

A Practical Guide to Support Vector Classification http: //www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf

• LIBSVM available at

http://www.csie.ntu.edu.tw/~cjlin/libsvm