Support Vector Machines for Data
Classification
Chih-Jen Lin
Department of Computer Science National Taiwan University
Outline
• Support vector classification
• Example: engine misfire detection
Data Classification
• Given training data in different classes (labels known) Predict test data (labels unknown)
• Examples
– Handwritten digits recognition
– Spam filtering
• Training and testing
• Methods:
– Nearest Neighbor
– Neural Networks
• Support vector machines: a new method Becoming more and more popular
We will discuss its current status
• A good classification method:
– Avoid underfitting: small training error
Support Vector Classification
• Training vectors : xi, i = 1, . . . , l
• Consider a simple case with two classes: Define a vector y yi = 1 if xi in class 1 −1 if xi in class 2,
wTx + b = +1 0 −1 • A separating hyperplane: wTx + b = 0 (wTxi) + b > 0 if yi = 1 (wTxi) + b < 0 if yi = −1
• Decision function f (x) = sign(wTx + b), x: test data
Variables: w and b : Need to know coefficients of a plane
• Select w, b with the maximal margin.
Maximal distance between wTx + b = ±1 Vapnik’s statistical learning theory.
(wTxi) + b ≥ 1 if yi = 1 (wTxi) + b ≤ −1 if yi = −1
(1)
• Distance between wTx + b = 1 and −1: 2/kwk = 2/√wTw • max 2/kwk ≡ min wTw/2 min w,b 1 2w Tw subject to yi((wTxi) + b) ≥ 1, from (1) i = 1, . . . , l.
Higher Dimensional Feature Spaces
• Earlier we tried to find a linear separating hyperplane
Data may not be linear separable
• Non-separable case: allow training errors
min w,b,ξ 1 2w Tw + C l X i=1 ξi subject to yi((wTxi) + b) ≥ 1 − ξi, ξi ≥ 0, i = 1, . . . , l
• ξi > 1, xi not on the correct side of the separating plane
• Avoid underfitting; nonlinear separating hyperplane linear separable in other spaces ?
• Higher dimensional ( maybe infinite ) feature space φ(x) = (φ1(x), φ2(x), . . .). • Example: x ∈ R3, φ(x) ∈ R10 φ(x) = (1,√2x1, √ 2x2, √ 2x3, x21, x22, x23,√2x1x2, √ 2x1x3, √ 2x2x3)
• A standard problem [Cortes and Vapnik, 1995]: min w,b,ξ 1 2w Tw + C l X i=1 ξi subject to yi(wTφ(xi) + b) ≥ 1 − ξi, ξi ≥ 0, i = 1, . . . , l
Finding the Decision Function
• w: a vector in a high dimensional space ⇒ maybe infinite variables
• The dual problem
min α 1 2α TQα − eTα subject to 0 ≤ αi ≤ C, i = 1, . . . , l yTα = 0, where Qij = yiyjφ(xi)Tφ(xj) and e = [1, . . . , 1]T w = Pli=1 αiyiφ(xi)
• Primal and dual : optimization theory. Not trivial.
• A finite problem:
#variables = #training data
• Qij = yiyjφ(xi)Tφ(xj) needs a closed form
Efficient calculation of high dimensional inner products
Kernel trick, K(xi, xj) = φ(xi)Tφ(xj) • Example: xi ∈ R3, φ(xi) ∈ R10 φ(xi) = (1, √ 2(xi)1, √ 2(xi)2, √ 2(xi)3, (xi)21, (xi)22, (xi)23, √ 2(xi)1(xi)2, √ 2(xi)1(xi)3, √ 2(xi)2(xi)3), Then φ(xi)Tφ(xj) = (1 + xTi xj)2.
• Popular methods: K(xi, xj) =
e−γkxi−xjk2, (Radial Basis Function) (xTi xj/a + b)d (Polynomial kernel) • Decision function: wTφ(x) + b = l X i=1 αiyiφ(xi)Tφ(x) + b No need to have w • > 0: 1st class, < 0: 2nd class • Only φ(xi) of αi > 0 used αi > 0 ⇒ support vectors
Support Vectors: More Important Data
−1.5 −1 −0.5 0 0.5 1 −0.2 0 0.2 0.4 0.6 0.8 1 1.2Example: Engine Misfire Detection
• First problem of IJCNN Challenge 2001, data from Ford
• Given time series length T = 50, 000
• The kth data x1(k), x2(k), x3(k), x4(k), x5(k), y(k) • Example: 0.000000 -0.999991 0.169769 0.000000 1.000000 0.000000 -0.659538 0.169769 0.000292 1.000000 0.000000 -0.660738 0.169128 -0.020372 1.000000 1.000000 -0.660307 0.169128 0.007305 1.000000 0.000000 -0.660159 0.169525 0.002519 1.000000 0.000000 -0.659091 0.169525 0.018198 1.000000 0.000000 -0.660532 0.169525 -0.024526 1.000000 0.000000 -0.659798 0.169525 0.012458 1.000000
• x5(k): not related to the output
x5(k) = 1, kth data considered for evaluating accuracy
i.e., not used for testing; can still be used in training
• 50,000 training data, 100,000 testing data (in two sets)
• Past and future information may affect y(k)
• x1(k): periodically nine 0s, one 1, nine 0s, one 1, and so on.
Background: Engine Misfire Detection
• Known after the competition
• Engine misfire: a substantial fraction of a cylinder’s air-fuel mixture fails to ignite
• Frequent misfires: pollutants and costly replacement
• On-board detection:
Engine crankshaft rational dynamics with a position sensor
Encoding Schemes
• For SVM: each data is a vector
• x1(k): periodically nine 0s, one 1, nine 0s, one 1, ... – 10 binary attributes
x1(k − 5), . . . , x1(k + 4) for the kth data
– x1(k): an integer in 1 to 10 – Which one is better
– We think 10 binaries better for SVM
• x4(k) more important
Including x4(k − 5), . . . , x4(k + 4) for the kth data
Training SVM
• Selecting parameters; generating a good model for prediction
• RBF kernel K(xi, xj) = φ(xi)Tφ(xj) = e−γkxi−xjk2
• Two parameters: γ and C
• Five-fold cross validation on 50,000 data Data randomly separated to five groups.
Each time four as training and one as testing
d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 98.8 98.6 98.4 98.2 98 97.8 97.6 97.4 97.2 97 1 2 3 4 5 6 7 lg(C) -2 -1 0 1 2 3 lg(gamma)
• Test set 1: 656 errors, Test set 2: 637 errors
• About 3000 support vectors of 50,000 training data A good case for SVM
• This is just the outline. There are other details.
A General Procedure
1. Conduct simple scaling on the data2. Consider RBF kernel K(x, y) = e−γkx−yk2
3. Use cross-validation to find the best parameter C and γ
4. Use the best C and γ to train the whole training set
5. Test
• Best C and γ by training k − 1 and the whole ? In theory, a minor difference
No problem in practice
A Software: LIBSVM
• A library for SVM (in both C++ and Java)
http://www.csie.ntu.edu.tw/~cjlin/libsvm – Classification and regression
– Scripts for procedures mentioned above
• Interfaces:
– Matlab: developed at Ohio State University
– R (and S-Plus): developed at Technische Universit¨at Wien
– Python: developed at HP Labs.
– Perl: developed at Simon Fraser University
• Used in many integrated machine learning/data mining packages
Current Status of SVM
• In my opinion, after careful data pre-processing
Appropriately use NN or SVM ⇒ similar accuracy
• But, users may not use them properly
• The chance of SVM
Easier for users to appropriately use it
• The ambition: replacing part of NN
Discussion and Conclusions
• SVM: a simple and effective classification method
• Applications: key to improve SVM
• All my research results can be found at