Can Support Vector Machine be a Major
Classification Method ?
Chih-Jen Lin
Department of Computer Science National Taiwan University
Motivation
• SVM: a hot machine learning issue
• However, not a major classification method yet
KDNuggets 2002 Poll: Neural Networks, Decision trees remain main tools
The Potential of SVM
• In my opinion, after careful data pre-processing
Appropriately use NN or SVM ⇒ similar accuracy
• But, users may not use them properly
• The chance of SVM
Easier for users to appropriately use it
What Many Users are Doing Now
• Transfer data to the format of an SVM software
• May not conduct scaling
• Randomly try few parameters and kernels without validation
• Default parameters are surprisingly important
We Hope Users At Least Do
• The following procedure
1. Simple scaling (training and testing) 2. Consider the RBF kernel
K(x, y) = e−γkx−yk2 = e−kx−yk2/(2σ2) and find the best C and γ (or σ2)
• Why RBF:
– Linear kernel: special case of RBF [Keerthi and Lin 2003] – Polynomial: numerical difficulties
(< 1)d → 0, (> 1)d → ∞
– tanh: still a mystery
In a coming paper [Lin and Lin 2003], for certain parameters, it
Examples of the Proposed Procedure
• User 1:
I am using libsvm in a astroparticle physics application (AMANDA experiment). First, let me congratulate you to a really easy to use and nice package.
Unfortunately, it gives me astonishingly bad results...
• Answer:
What is your procedure ?
• User 1:
I do for example the following steps (here for classification):
>TRAINING.SCALE.DAT
./svm-train -s 0 -t 2 -c 10 TRAINING.SCALE.DAT ./svm-predict TESTING SIGNAL.SCALE.DAT
TRAINING.SCALE.DAT.model s 0 2 10.out Accuracy = 75.2%
• Answer:
OK. Send me the data
• Answer:
First I scale the training and testing TOGETHER: /mnt/professor/cjlin/tmp% libsvm-2.36/svm-scale total > total.scale
Then separate them again.
Using the model selection tool (cross validation) to find out the best parameter:
sort the results: (find the best cv accuracy) /mnt/professor/cjlin/tmp% sort -k 3 train.out .
2 1 96.9569 8 1 96.9569
so c = 4 and g = 1 might be the best. Train the training data again:
/mnt/professor/cjlin/tmp/libsvm-2.36%./svm-train -m 300 -c 4 -g 2 ../train
Finally test the independent data:
/mnt/professor/cjlin/tmp/libsvm-2.36%./svm-predict ../testdata train.model o Accuracy = 97.3
• User 1:
You earned a copy of my PhD thesis
I am a developer in a bioinformatics laboratory at ... We would like to use LIBSVM in a project ... The datasets are reasonable unbalanced - there are 221 examples in the first set, 117 in the second set and 53 in the third set.
But results not good
• Answer:
Have you scaled the data ? What is your accuracy ?
• User 2: Yes, to [0,1]. 36%
• Answer:
OK. Send me the data
• Answer:
I am able to give 83.88% cv accuracy. Is that good enough for you ?
• User 2:
Model Selection is Important
• In fact, two-parameter search
• By bounds of loo
• By two line search
Bound of loo
• Many loo bounds
• Main reason: save computational cost
• Bounds where a path may be found
-4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C) -4 -2 0 2 4 -4 -2 0 2 4 ln(sigma^2) ln(C)
– Radius margin bound – Span bound
• A recent paper [Chung et al. 2002] on radius margin bound
– Minima in a good region more important than tightness
Good bound should avoid that minima happen at the boundary
(i.e., too small or too large C and σ2)
– Modification for L1-SVM – Differentiability min C,σ2 f (α(C, σ 2)) – Reliable Implementation
L1-SVM L2-SVM
#fun #grad accuracy #fun #grad accuracy
banana 9 6 88.96 8 5 88.53 image 17 13 96.24 11 6 97.03 splice 13 12 89.84 21 19 89.84 tree 8 8 86.50 8 8 86.54 waveform 16 13 88.57 8 7 89.83 ijcnn1 9 9 97.09 7 7 97.83
• A coming paper [Chang and Lin 2003]: non-smooth optimization techniques for bounds
– Allow us to use more (i.e. non-differentiable) bounds – Sensitive analysis
– Boundle (cutting plane) methods
Piecewise diff. → Semi-smooth % Directionally diff.
Two Line Searches
• CV (loo) contour of RBF kernel [Keerthi and Lin 2003]:
log Clim log C
log σ2
underfitting
underfitting overfitting good region
log σ2 = log C − log ˜C
• When σ2 large
(C, σ2) of RBF ≡ C/σ2 of linear
• A heuristic for model selection
2. Fix ˜C and search for the best (C, σ2) satisfying log σ2 = log C − log ˜C using RBF
Problem n #test Test error of Test error of
grid method new method
banana 400 4900 0.1235 (6,-0) 0.1178 (-2,-2) image 1300 1010 0.02475 (9,4) 0.02475 (1,0.5) splice 1000 2175 0.09701 (1,4) 0.1011 (0,4) ringnorm 400 7000 0.01429(-2,2) 0.018 (-3,2) twonorm 400 7000 0.031 (1,3) 0.02914 (1,4) tree 700 11692 0.1132 (8,4) 0.1246 (2,2) adult 1605 29589 0.1614 (5,6) 0.1614 (5,6) web 2477 38994 0.02223 (5,5) 0.02223 (5,5) • 441 verses 54 SVMs
d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 98.8 98.6 98.4 98.2 98 97.8 97.6 97.4 97.2 97 1 2 3 4 5 6 7 lg(C) -2 -1 0 1 2 3 lg(gamma)
However, I Prefer Simple Grid Search
• Reasons for not using bounds (if two parameters)
– Psychologically, not feel safe
– In practice: IJCNN competition:
97.09% and 97.83% using RM bounds for L1 and L2-SVM 98.59% using 25-point grid
2668, 1990, and 1293 testing errors – Useful if more than 2 parameters
• About two-line search:
100 1000 10000 100000 1e+06 -8 -6 -4 -2 0 2 4 6 8 Iterations log(C) heart_scale 100 1000 10000 100000 1e+06 -8 -6 -4 -2 0 2 4 6 8 Iterations log(C) heart_scale
– A paper [Chung et al. 2003]: efficient decomposition methods for linear SVMs
– Decision of the best C for linear SVMs sometimes ambiguous
70 72 74 76 78 80 82 84 -8 -6 -4 -2 0 2 CV rate log(C)
– After C ≥ C∗, everything is the same
• We propose that users do – Start from a loose grid
– Identify good regions and finer grid
• The grid search tool in libsvm
• Easy parallelization
loo bounds: 20 steps ⇒ more time than 10 × 10 grids with five computers
Automatic load balancing
• No need for α-seeding, passing cache etc.
• This simple tool
– Enough for median-sized problems
– Advantage of having only one figure for multi-class problems
• Further improvement
Challenges
• Using this, if for enough problems, satisfactory results obtained
⇒ then SVM can be a major method eventually
How do we ask users to at least do this ? How do we know if it is or not ?
• If not