Support vector machines for data classification

(1)

Support Vector Machines for Data

Classification

Chih-Jen Lin

Department of Computer Science National Taiwan University

(2)

Outline

• Support vector classification

• Example: engine misfire detection

(3)

Data Classification

• Given training data in different classes (labels known) Predict test data (labels unknown)

• Examples

– Handwritten digits recognition

– Spam filtering

• Training and testing

• Methods:

– Nearest Neighbor

– Neural Networks

(4)

• Support vector machines: a new method Becoming more and more popular

We will discuss its current status

• A good classification method:

– Avoid underfitting: small training error

(5)

Support Vector Classification

• Training vectors : xi, i = 1, . . . , l

• Consider a simple case with two classes: Define a vector y y_i =    1 if xi in class 1 −1 if x_i in class 2,

(6)

wTx + b =     +1 0 −1     • A separating hyperplane: wTx + b = 0 (wTx_i) + b > 0 if y_i = 1 (wTxi) + b < 0 if yi = −1

• Decision function f (x) = sign(wTx + b), x: test data

Variables: w and b : Need to know coefficients of a plane

(7)

• Select w, b with the maximal margin.

Maximal distance between wTx + b = ±1 Vapnik’s statistical learning theory.

(wTx_i) + b ≥ 1 if y_i = 1 (wTxi) + b ≤ −1 if yi = −1

(1)

• Distance between wTx + b = 1 and −1: 2/kwk = 2/√wT_w • max 2/kwk ≡ min wTw/2 min w,b 1 2w T_w subject to y_i((wTx_i) + b) ≥ 1, from (1) i = 1, . . . , l.

(8)

Higher Dimensional Feature Spaces

• Earlier we tried to find a linear separating hyperplane

Data may not be linear separable

• Non-separable case: allow training errors

min w,b,ξ 1 2w T_{w +} _C l X i=1 ξi subject to y_i((wTx_i) + b) ≥ 1 − ξ_i, ξi ≥ 0, i = 1, . . . , l

• ξ_i > 1, x_i not on the correct side of the separating plane

(9)

• Avoid underfitting; nonlinear separating hyperplane linear separable in other spaces ?

• Higher dimensional ( maybe infinite ) feature space φ(x) = (φ1(x), φ2(x), . . .). • Example: x ∈ R3, φ(x) ∈ R10 φ(x) = (1,√2x1, √ 2x2, √ 2x3, x2₁, x2₂, x2₃,√2x1x2, √ 2x1x3, √ 2x2x3)

(10)

• A standard problem [Cortes and Vapnik, 1995]: min w,b,ξ 1 2w T_{w + C} l X i=1 ξ_i subject to y_i(wTφ(x_i) + b) ≥ 1 − ξ_i, ξ_i ≥ 0, i = 1, . . . , l

(11)

Finding the Decision Function

• w: a vector in a high dimensional space ⇒ maybe infinite variables

• The dual problem

min α 1 2α T_{Qα − e}T_α subject to 0 ≤ αi ≤ C, i = 1, . . . , l yTα = 0, where Q_ij = y_iy_jφ(x_i)Tφ(x_j) and e = [1, . . . , 1]T w = Pl_i=1 αiyiφ(xi)

• Primal and dual : optimization theory. Not trivial.

(12)

• A finite problem:

#variables = #training data

• Qij = yiyjφ(xi)Tφ(xj) needs a closed form

Efficient calculation of high dimensional inner products

Kernel trick, K(x_i, x_j) = φ(x_i)Tφ(x_j) • Example: x_i ∈ R3, φ(x_i) ∈ R10 φ(xi) = (1, √ 2(xi)1, √ 2(xi)2, √ 2(xi)3, (xi)2₁, (xi)2₂, (xi)2₃, √ 2(xi)1(xi)2, √ 2(xi)1(xi)3, √ 2(xi)2(xi)3), Then φ(x_i)Tφ(x_j) = (1 + xT_i x_j)2.

(13)

• Popular methods: K(x_i, x_j) =

e−γkxi−xjk2, (Radial Basis Function) (xT_i x_j/a + b)d (Polynomial kernel) • Decision function: wTφ(x) + b = l X i=1 α_iy_iφ(x_i)Tφ(x) + b No need to have w • > 0: 1st class, < 0: 2nd class • Only φ(x_i) of α_i > 0 used αi > 0 ⇒ support vectors

(14)

Support Vectors: More Important Data

−1.5 −1 −0.5 0 0.5 1 −0.2 0 0.2 0.4 0.6 0.8 1 1.2

(15)

Example: Engine Misfire Detection

• First problem of IJCNN Challenge 2001, data from Ford

• Given time series length T = 50, 000

• The kth data x₁(k), x₂(k), x₃(k), x₄(k), x₅(k), y(k) • Example: 0.000000 -0.999991 0.169769 0.000000 1.000000 0.000000 -0.659538 0.169769 0.000292 1.000000 0.000000 -0.660738 0.169128 -0.020372 1.000000 1.000000 -0.660307 0.169128 0.007305 1.000000 0.000000 -0.660159 0.169525 0.002519 1.000000 0.000000 -0.659091 0.169525 0.018198 1.000000 0.000000 -0.660532 0.169525 -0.024526 1.000000 0.000000 -0.659798 0.169525 0.012458 1.000000

(16)

• x₅(k): not related to the output

x5(k) = 1, kth data considered for evaluating accuracy

i.e., not used for testing; can still be used in training

• 50,000 training data, 100,000 testing data (in two sets)

• Past and future information may affect y(k)

• x1(k): periodically nine 0s, one 1, nine 0s, one 1, and so on.

(17)

Background: Engine Misfire Detection

• Known after the competition

• Engine misfire: a substantial fraction of a cylinder’s air-fuel mixture fails to ignite

• Frequent misfires: pollutants and costly replacement

• On-board detection:

Engine crankshaft rational dynamics with a position sensor

(18)

Encoding Schemes

• For SVM: each data is a vector

• x1(k): periodically nine 0s, one 1, nine 0s, one 1, ... – 10 binary attributes

x₁(k − 5), . . . , x₁(k + 4) for the kth data

– x1(k): an integer in 1 to 10 – Which one is better

– We think 10 binaries better for SVM

• x₄(k) more important

Including x4(k − 5), . . . , x4(k + 4) for the kth data

(19)

Training SVM

• Selecting parameters; generating a good model for prediction

• RBF kernel K(x_i, x_j) = φ(x_i)Tφ(x_j) = e−γkxi−xjk2

• Two parameters: γ and C

• Five-fold cross validation on 50,000 data Data randomly separated to five groups.

Each time four as training and one as testing

(20)

d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 98.8 98.6 98.4 98.2 98 97.8 97.6 97.4 97.2 97 1 2 3 4 5 6 7 lg(C) -2 -1 0 1 2 3 lg(gamma)

(21)

• Test set 1: 656 errors, Test set 2: 637 errors

• About 3000 support vectors of 50,000 training data A good case for SVM

• This is just the outline. There are other details.

(22)

A General Procedure

1. Conduct simple scaling on the data

2. Consider RBF kernel K(x, y) = e−γkx−yk2

3. Use cross-validation to find the best parameter C and γ

4. Use the best C and γ to train the whole training set

5. Test

• Best C and γ by training k − 1 and the whole ? In theory, a minor difference

No problem in practice

(23)

A Software: LIBSVM

• A library for SVM (in both C++ and Java)

http://www.csie.ntu.edu.tw/~cjlin/libsvm – Classification and regression

– Scripts for procedures mentioned above

• Interfaces:

– Matlab: developed at Ohio State University

– R (and S-Plus): developed at Technische Universit¨at Wien

– Python: developed at HP Labs.

– Perl: developed at Simon Fraser University

(24)

• Used in many integrated machine learning/data mining packages

(25)

Current Status of SVM

• In my opinion, after careful data pre-processing

Appropriately use NN or SVM ⇒ similar accuracy

• But, users may not use them properly

• The chance of SVM

Easier for users to appropriately use it

• The ambition: replacing part of NN

(26)

Discussion and Conclusions

• SVM: a simple and effective classification method

• Applications: key to improve SVM

• All my research results can be found at