• 沒有找到結果。

Support vector machines for data classification

N/A
N/A
Protected

Academic year: 2021

Share "Support vector machines for data classification"

Copied!
26
0
0

加載中.... (立即查看全文)

全文

(1)

Support Vector Machines for Data

Classification

Chih-Jen Lin

Department of Computer Science National Taiwan University

(2)

Outline

Support vector classification

Example: engine misfire detection

(3)

Data Classification

Given training data in different classes (labels known) Predict test data (labels unknown)

Examples

– Handwritten digits recognition

– Spam filtering

Training and testing

Methods:

– Nearest Neighbor

– Neural Networks

(4)

Support vector machines: a new method Becoming more and more popular

We will discuss its current status

A good classification method:

– Avoid underfitting: small training error

(5)

Support Vector Classification

Training vectors : xi, i = 1, . . . , l

Consider a simple case with two classes: Define a vector y yi =    1 if xi in class 1 −1 if xi in class 2,

(6)

wTx + b =     +1 0 −1     A separating hyperplane: wTx + b = 0 (wTxi) + b > 0 if yi = 1 (wTxi) + b < 0 if yi = −1

Decision function f (x) = sign(wTx + b), x: test data

Variables: w and b : Need to know coefficients of a plane

(7)

Select w, b with the maximal margin.

Maximal distance between wTx + b = ±1 Vapnik’s statistical learning theory.

(wTxi) + b ≥ 1 if yi = 1 (wTxi) + b ≤ −1 if yi = −1

(1)

Distance between wTx + b = 1 and −1: 2/kwk = 2/√wTw max 2/kwk ≡ min wTw/2 min w,b 1 2w Tw subject to yi((wTxi) + b) ≥ 1, from (1) i = 1, . . . , l.

(8)

Higher Dimensional Feature Spaces

Earlier we tried to find a linear separating hyperplane

Data may not be linear separable

Non-separable case: allow training errors

min w,b,ξ 1 2w Tw + C l X i=1 ξi subject to yi((wTxi) + b) ≥ 1 − ξi, ξi ≥ 0, i = 1, . . . , l

ξi > 1, xi not on the correct side of the separating plane

(9)

Avoid underfitting; nonlinear separating hyperplane linear separable in other spaces ?

Higher dimensional ( maybe infinite ) feature space φ(x) = (φ1(x), φ2(x), . . .). Example: x ∈ R3, φ(x) ∈ R10 φ(x) = (1,√2x1, 2x2, 2x3, x21, x22, x23,√2x1x2, 2x1x3, 2x2x3)

(10)

A standard problem [Cortes and Vapnik, 1995]: min w,b,ξ 1 2w Tw + C l X i=1 ξi subject to yi(wTφ(xi) + b) ≥ 1 − ξi, ξi ≥ 0, i = 1, . . . , l

(11)

Finding the Decision Function

w: a vector in a high dimensional space ⇒ maybe infinite variables

The dual problem

min α 1 2α TQα − eTα subject to 0 ≤ αi ≤ C, i = 1, . . . , l yTα = 0, where Qij = yiyjφ(xi)Tφ(xj) and e = [1, . . . , 1]T w = Pli=1 αiyiφ(xi)

• Primal and dual : optimization theory. Not trivial.

(12)

A finite problem:

#variables = #training data

Qij = yiyjφ(xi)Tφ(xj) needs a closed form

Efficient calculation of high dimensional inner products

Kernel trick, K(xi, xj) = φ(xi)Tφ(xj) Example: xi ∈ R3, φ(xi) ∈ R10 φ(xi) = (1, 2(xi)1, 2(xi)2, 2(xi)3, (xi)21, (xi)22, (xi)23, 2(xi)1(xi)2, 2(xi)1(xi)3, 2(xi)2(xi)3), Then φ(xi)Tφ(xj) = (1 + xTi xj)2.

(13)

Popular methods: K(xi, xj) =

e−γkxi−xjk2, (Radial Basis Function) (xTi xj/a + b)d (Polynomial kernel) Decision function: wTφ(x) + b = l X i=1 αiyiφ(xi)Tφ(x) + b No need to have w > 0: 1st class, < 0: 2nd class Only φ(xi) of αi > 0 used αi > 0 ⇒ support vectors

(14)

Support Vectors: More Important Data

−1.5 −1 −0.5 0 0.5 1 −0.2 0 0.2 0.4 0.6 0.8 1 1.2

(15)

Example: Engine Misfire Detection

First problem of IJCNN Challenge 2001, data from Ford

Given time series length T = 50, 000

The kth data x1(k), x2(k), x3(k), x4(k), x5(k), y(k) Example: 0.000000 -0.999991 0.169769 0.000000 1.000000 0.000000 -0.659538 0.169769 0.000292 1.000000 0.000000 -0.660738 0.169128 -0.020372 1.000000 1.000000 -0.660307 0.169128 0.007305 1.000000 0.000000 -0.660159 0.169525 0.002519 1.000000 0.000000 -0.659091 0.169525 0.018198 1.000000 0.000000 -0.660532 0.169525 -0.024526 1.000000 0.000000 -0.659798 0.169525 0.012458 1.000000

(16)

x5(k): not related to the output

x5(k) = 1, kth data considered for evaluating accuracy

i.e., not used for testing; can still be used in training

50,000 training data, 100,000 testing data (in two sets)

Past and future information may affect y(k)

x1(k): periodically nine 0s, one 1, nine 0s, one 1, and so on.

(17)

Background: Engine Misfire Detection

Known after the competition

Engine misfire: a substantial fraction of a cylinder’s air-fuel mixture fails to ignite

Frequent misfires: pollutants and costly replacement

On-board detection:

Engine crankshaft rational dynamics with a position sensor

(18)

Encoding Schemes

For SVM: each data is a vector

x1(k): periodically nine 0s, one 1, nine 0s, one 1, ... – 10 binary attributes

x1(k − 5), . . . , x1(k + 4) for the kth data

x1(k): an integer in 1 to 10 – Which one is better

– We think 10 binaries better for SVM

x4(k) more important

Including x4(k − 5), . . . , x4(k + 4) for the kth data

(19)

Training SVM

Selecting parameters; generating a good model for prediction

RBF kernel K(xi, xj) = φ(xi)Tφ(xj) = e−γkxi−xjk2

Two parameters: γ and C

Five-fold cross validation on 50,000 data Data randomly separated to five groups.

Each time four as training and one as testing

(20)

d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 98.8 98.6 98.4 98.2 98 97.8 97.6 97.4 97.2 97 1 2 3 4 5 6 7 lg(C) -2 -1 0 1 2 3 lg(gamma)

(21)

Test set 1: 656 errors, Test set 2: 637 errors

About 3000 support vectors of 50,000 training data A good case for SVM

This is just the outline. There are other details.

(22)

A General Procedure

1. Conduct simple scaling on the data

2. Consider RBF kernel K(x, y) = e−γkx−yk2

3. Use cross-validation to find the best parameter C and γ

4. Use the best C and γ to train the whole training set

5. Test

Best C and γ by training k − 1 and the whole ? In theory, a minor difference

No problem in practice

(23)

A Software: LIBSVM

A library for SVM (in both C++ and Java)

http://www.csie.ntu.edu.tw/~cjlin/libsvm – Classification and regression

– Scripts for procedures mentioned above

Interfaces:

– Matlab: developed at Ohio State University

– R (and S-Plus): developed at Technische Universit¨at Wien

– Python: developed at HP Labs.

– Perl: developed at Simon Fraser University

(24)

Used in many integrated machine learning/data mining packages

(25)

Current Status of SVM

In my opinion, after careful data pre-processing

Appropriately use NN or SVM ⇒ similar accuracy

But, users may not use them properly

The chance of SVM

Easier for users to appropriately use it

The ambition: replacing part of NN

(26)

Discussion and Conclusions

SVM: a simple and effective classification method

Applications: key to improve SVM

All my research results can be found at

參考文獻

相關文件

include domain knowledge by specific kernel design (e.g. train a generative model for feature extraction, and use the extracted feature in SVM to get discriminative power).

• The memory storage unit holds instructions and data for a running program.. • A bus is a group of wires that transfer data from one part to another (data,

2 Distributed classification algorithms Kernel support vector machines Linear support vector machines Parallel tree learning.. 3 Distributed clustering

2 Distributed classification algorithms Kernel support vector machines Linear support vector machines Parallel tree learning?. 3 Distributed clustering

“Transductive Inference for Text Classification Using Support Vector Machines”, Proceedings of ICML-99, 16 th International Conference on Machine Learning, pp.200-209. Coppin

The relationship between these extra type parameters, and the types to which they are associated, is established by parameteriz- ing the interfaces (Java generics, C#, and Eiffel)

Solving SVM Quadratic Programming Problem Training large-scale data..

An instance associated with ≥ 2 labels e.g., a video shot includes several concepts Large-scale Data. SVM cannot handle large sets if using kernels