Support vector machines for data classification and regression

(1)

Support Vector Machines for Data

Classification and Regression

Chih-Jen Lin

Department of Computer Science National Taiwan University

(2)

Outline

Support vector classification Two practical example

Support vector regression Discussion and conclusions

(3)

Data Classification

Given training data in different classes (labels known) Predict test data (labels unknown)

Examples

Handwritten digits recognition Spam filtering

Text classification

Prediction of signal peptide in human secretory proteins

(4)

Methods:

Nearest Neighbor Neural Networks Decision Tree

Support vector machines: a new method Becoming more and more popular

(5)

Why Support Vector Machines

Existing methods:

Nearest neighbor, Neural networks, decision trees. SVM: a new one

In my opinion, after careful data pre-processing

Appropriately use NN or SVM _⇒ similar accuracy But, users may not use them properly

The chance of SVM

Easier for users to appropriately use it

(6)

Support Vector Classification

Training vectors : x_i, i = 1, . . . , l

Consider a simple case with two classes: Define a vector y

y_i = (

1 if x_i in class 1 −1 if x_i in class 2,

(7)

wTx + b =    +1 0 −1    A separating hyperplane: wTx + b = 0 (wTx_i) + b > 0 if y_i = 1 (wTx_i) + b < 0 if y_i = −1

(8)

Decision function f (x) = sign(wT x + b), x: test data

Variables: w and b : Need to know coefficients of a plane

Many possible choices of w and b

Select w, b with the maximal margin.

Maximal distance between wTx + b = ±1 (wTx_i) + b ≥ 1 if y_i = 1 (wTx_i) + b ≤ −1 if y_i = −1

(9)

Distance between wTx + b = 1 and −1: 2/kwk = 2/√wTw max 2/kwk ≡ min wTw/2 min w,b 1 2w T w subject to y_i((wTx_i) + b) ≥ 1, i = 1, . . . , l.

(10)

Higher Dimensional Feature Spaces

Earlier we tried to find a linear separating hyperplane

Data may not be linear separable

Non-separable case: allow training errors

min w,b,ξ 1 2w T w + C l X i=1 ξ_i y_i((wTx_i) + b) ≥ 1 − ξ_i, ξ_i _{≥ 0, i = 1, . . . , l}

(11)

Nonlinear case: linear separable in other spaces ?

Higher dimensional ( maybe infinite ) feature space

(12)

Example: x ∈ R3, φ(x) ∈ R10

φ(x) = (1, √2x₁, √2x₂, √2x₃, x2₁,

x2₂, x2₃, √2x₁x₂, √2x₁x₃, √2x₂x₃)

A standard problem [Cortes and Vapnik, 1995]:

min w,b,ξ 1 2w T w + C l X i=1 ξ_i

(13)

Finding the Decision Function

w: a vector in a high dimensional space ⇒ maybe

infinite variables The dual problem

min α 1 2α T Qα − eTα subject to _{0 ≤ α}_i _{≤ C, i = 1, . . . , l} yT α = 0, where Q_ij = y_iy_jφ(x_i)Tφ(x_j) and e = [1, . . . , 1]T

(14)

Primal and dual : optimization theory. Not trivial.

Infinite dimensional programming. A finite problem:

#variables = #training data

Q_ij = y_iy_jφ(x_i)Tφ(x_j) needs a closed form

Efficient calculation of high dimensional inner products Kernel trick, K(x_i, x_j) = φ(x_i)Tφ(x_j)

(15)

Example: x_i ∈ R3, φ(x_i) ∈ R10 φ(x_i) = (1, √2(x_i)₁, √2(x_i)₂, √2(x_i)₃, (x_i)2₁, (xi)2₂, (xi)2₃, √ 2(xi)1(xi)2, √ 2(xi)1(xi)3, √ 2(xi)2(xi)3), Then φ(xi)Tφ(xj) = (1 + xT_i x_j)2. Popular methods: K(x_i, x_j) =

e−γkxi−xjk2_, _{(Radial Basis Function)}

(16)

Kernel Tricks

Kernel: K(x, y) = φ(x)Tφ(y)

No need to explicitly know φ(x)

Common kernels K(x_i, x_j) =

e−γkxi−xjk2_, _{(Radial Basis Function)}

(xT_i x_j/a + b)d (Polynomial kernel)

They can be inner product in infinite dimensional space Assume _{x ∈ R}1 and γ > 0.

(17)

e−γkxi−xjk2 = e−γ(xi−xj)2 = e−γx2i+2γxixj−γx2j = e−γx2i−γx2j 1 + 2γxixj 1! + (2γx_ix_j)2 2! + (2γx_ix_j)3 3! + · · · = e−γx2i−γx2j 1 · 1 + r 2γ 1! xi · r 2γ 1! xj + r (2γ)2 2! x 2 i · r (2γ)2 2! x 2 j + r (2γ)3 3! x 3 i · r (2γ)3 3! x 3 j + · · · = φ(x_i)Tφ(x_j), where

(18)

Decision function

w: maybe an infinite vector

At optimum w = Pl_i=1 α_iy_iφ(x_i) Decision function wTφ(x) + b = l X i=1 αiyiφ(xi)Tφ(x) + b

(19)

> 0: 1st class, < 0: 2nd class Only φ(x_i) of α_i > 0 used

(20)

Support Vectors: More Important Data

0 0.2 0.4 0.6 0.8 1 1.2

(21)

Let Us Try An Example

A problem from astroparticle physics

1.0 1:2.617300e+01 2:5.886700e+01 3:-1.894697e-01 4:1.251225e+02 1.0 1:5.707397e+01 2:2.214040e+02 3:8.607959e-02 4:1.229114e+02 1.0 1:1.725900e+01 2:1.734360e+02 3:-1.298053e-01 4:1.250318e+02 1.0 1:2.177940e+01 2:1.249531e+02 3:1.538853e-01 4:1.527150e+02 1.0 1:9.133997e+01 2:2.935699e+02 3:1.423918e-01 4:1.605402e+02 1.0 1:5.537500e+01 2:1.792220e+02 3:1.654953e-01 4:1.112273e+02 1.0 1:2.956200e+01 2:1.913570e+02 3:9.901439e-02 4:1.034076e+02

Training and testing sets available: 3,089 and 4,000 Data format is an issue

(22)

SVM software:

LIBSVM

http://www.csie.ntu.edu.tw/~cjlin/libsvm

Now one of the most used SVM software Installation

On Unix:

Download zip file and make On Windows:

Download zip file and make

c:nmake -f Makefile.win

(23)

Usage of

LIBSVM

Training

Usage: svm-train [options] training_set_file [model_file] options:

-s svm_type : set type of SVM (default 0) 0 -- C-SVC

1 -- nu-SVC

2 -- one-class SVM 3 -- epsilon-SVR 4 -- nu-SVR

(24)

Training and Testing

Training

$./svm-train train.1 ...*

optimization finished, #iter = 6131 nu = 0.606144

obj = -1061.528899, rho = -0.495258 nSV = 3053, nBSV = 724

Total nSV = 3053

Testing

(25)

What does this Output Mean

obj: the optimal objective value of the dual SVM rho: _−b in the decision function

nSV and nBSV: number of support vectors and bounded support vectors

(i.e., α_i = C).

nu-svm is a somewhat equivalent form of C-SVM where C is replaced by ν.

(26)

Why this Fails

After training, nearly 100% support vectors Training and testing accuracy different

$./svm-predict train.1 train.1.model o Accuracy = 99.7734% (3082/3089)

Most kernel elements:

Kij (

= 1 if i = j, → 0 if _{i 6= j.}

(27)

Data Scaling

Without scaling

Attributes in greater numeric ranges may dominate

Example: height sex x₁ 150 F x₂ 180 M x₃ 185 M and y₁ = 0, y₂ = 1, y₃ = 1.

(28)

The separating hyperplane

x₁

x₂ x₃

Decision strongly depends on the first attribute What if the second is more important

(29)

Linearly scale the first to [0, 1] by:

1st attribute _{− 150}

185 − 150 ,

New points and separating hyperplane

(30)

Transformed to the original space,

x₁

x₂ x₃

(31)

After Data Scaling

A common mistake

$./svm-scale -l -1 -u 1 train.1 > train.1.scale $./svm-scale -l -1 -u 1 test.1 > test.1.scale

(32)

Same factor on training and testing

$./svm-scale -s range1 train.1 > train.1.scale $./svm-scale -r range1 test.1 > test.1.scale $./svm-train train.1.scale

$./svm-predict test.1.scale train.1.scale.model test.1.predict

→ Accuracy = 96.15%

We store the scaling factor used in training and apply them for testing set

(33)

More on Training

Train scaled data and then prediction

$./svm-train train.1.scale

$./svm-predict test.1.scale train.1.scale.model test.1.predict

→ Accuracy = 96.15%

Training accuracy now is

$./svm-predict train.1.scale train.1.scale.model o Accuracy = 96.439% (2979/3089) (classification)

Default parameter

(34)

Different Parameters

If we use C = 20, γ = 400

$./svm-train -c 20 -g 400 train.1.scale

./svm-predict train.1.scale train.1.scale.model o Accuracy = 100% (3089/3089) (classification)

100% training accuracy but

$./svm-predict test.1.scale train.1.scale.model o Accuracy = 82.7% (3308/4000) (classification)

Very bad test accuracy Overfitting happens

(35)

Overfitting and Underfitting

When training and predicting a data, we should

Avoid underfitting: small training error Avoid overfitting: small testing error

(36)

(37)

Overfitting

In theory

You can easily achieve 100% training accuracy This is useless

Surprisingly

(38)

Parameter Selection

Is very important

Now parameters are

C, kernel parameters Example:

γ of e−γkxi−xjk2

a, b, d of (xT_i x_j/a + b)d

(39)

Performance Evaluation

Training errors not important; only test errors count

l training data, x_i ∈ Rn, y_i ∈ {+1, −1}, i = 1, . . . , l, a

learning machine:

x → f(x, α), f(x, α) = 1 or _{− 1.} Different α: different machines

The expected test error (generalized error)

R(α) =

Z ₁

(40)

P (x, y) unknown, empirical risk (training error): R_emp(α) = 1 2l l X i=1 |yi − f(xi, α)| 1

2|yi − f(xi, α)| : loss, choose 0 ≤ η ≤ 1, with probability

at least _{1 − η}:

R(α) ≤ Remp(α) + another term

(41)

Performance Evaluation (Cont.)

In practice

Available data _⇒ training and validation Train the training

Test the validation

k-fold cross validation:

Data randomly separated to k groups.

(42)

CV and Test Accuracy

If we select parameters so that CV is the highest, Does CV represent future test accuracy ?

Slightly different

If we have enough parameters, we can achieve 100% CV as well

e.g. more parameters than # of training data But test accuracy may be different

(43)

Using CV on training + validation

(44)

A Simple Procedure

1. Conduct simple scaling on the data

2. Consider RBF kernel K(x, y) = e−γkx−yk2

3. Use cross-validation to find the best parameter C and γ

4. Use the best C and γ to train the whole training set 5. Test

Best C and γ by training _{k − 1} and the whole ? In theory, a minor difference

(45)

Parameter Selection Procedure in

LIBSVM

grid search + CV

$./grid.py train.1 train.1.scale

[local] -1 -7 85.1408 (best c=0.5, g=0.0078125, rate=85.1408) [local] 5 -7 95.4354 (best c=32.0, g=0.0078125, rate=95.4354) .

. .

(46)

Easy parallelization on a cluster

$./grid.py train.1 train.1.scale

[linux1] -1 -7 85.1408 (best c=0.5, g=0.0078125, rate=85.1408) [linux7] 5 -7 95.4354 (best c=32.0, g=0.0078125, rate=95.4354) .

. .

(47)

Parallel Parameter Selection

Specify machine names in grid.py

telnet_workers = []

ssh_workers = [’linux1’,’linux1’,’linux2’, ’linux3’]

nr_local_worker = 1

linux1: more powerful or two CPUs A simple centralized control

Load balancing not a problem We can use other tools

(48)

Contour of Parameter Selection

d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 98.8 98.6 98.4 98.2 98 97.8 97.6 97.4 97.2 97 -1 0 1 2 3 lg(gamma)

(49)

Simple script in

LIBSVM

easy.py: a script for dummies

$python easy.py train.1 test.1 Scaling training data...

Cross validation... Best c=2.0, g=2.0 Training...

Scaling testing data... Testing...

(50)

Example: Engine Misfire

Detection

(51)

Problem Description

First problem of IJCNN Challenge 2001, data from Ford Given time series length T = 50, 000

The kth data

x₁(k), x₂(k), x₃(k), x₄(k), x₅(k), y(k)

y(k) = ±1: output, affected only by x₁(k), . . . , x₄(k)

x₅(k) = 1, kth data considered for evaluating accuracy 50,000 training data, 100,000 testing data (in two sets)

(52)

Past and future information may affect y(k)

x₁(k): periodically nine 0s, one 1, nine 0s, one 1, and so on. Example: 0.000000 -0.999991 0.169769 0.000000 1.000000 0.000000 -0.659538 0.169769 0.000292 1.000000 0.000000 -0.660738 0.169128 -0.020372 1.000000 1.000000 -0.660307 0.169128 0.007305 1.000000 0.000000 -0.660159 0.169525 0.002519 1.000000 0.000000 -0.659091 0.169525 0.018198 1.000000 0.000000 -0.660532 0.169525 -0.024526 1.000000 0.000000 -0.659798 0.169525 0.012458 1.000000

(53)

Background: Engine Misfire Detection

How engine works

Air-fuel mixture injected to cylinder

intact, compression, combustion, exhaustion

Engine misfire: a substantial fraction of a cylinder’s air-fuel mixture fails to ignite

Frequent misfires: pollutants and costly replacement On-board detection:

Engine crankshaft rational dynamics with a position sensor

(54)

Encoding Schemes

For SVM: each data is a vector

x₁(k): periodically nine 0s, one 1, nine 0s, one 1, ... 10 binary attributes

x₁_{(k − 5), . . . , x}₁(k + 4) for the kth data

x₁(k): an integer in 1 to 10 Which one is better

We think 10 binaries better for SVM

x₄(k) more important

(55)

Training SVM

Selecting parameters; generating a good model for prediction

RBF kernel K(x_i, x_j) = φ(x_i)T φ(x_j) = e−γkxi−xjk2

Two parameters: γ and C

Five-fold cross validation on 50,000 data Data randomly separated to five groups.

Each time four as training and one as testing

Use C = 24, γ = 22 and train 50,000 data for the final model

(56)

d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 98.8 98.6 98.4 98.2 98 97.8 97.6 97.4 97.2 97 -1 0 1 2 3 lg(gamma)

(57)

Test set 1: 656 errors, Test set 2: 637 errors

About 3000 support vectors of 50,000 training data A good case for SVM

This is just the outline. There are other details.

(58)

(59)

Machine Learning Is Sometimes An Art

But not a science

For complicated problems, there is no real systematic procedure

(60)

An Example: Vehicle Classification

Vehicle classification in distributed sensor networks

http://www.ece.wisc.edu/~sensit and

http://mmsp-2.caenn.wisc.edu/events.zip

Prepared by Duarte and Hu in University of Wisconsin

Three classes of data: two vehicles and noise

Each instance: acoustic and seismic features # features of each part: 50 and 50

(61)

Distribution of data:

#class 1 #class 2 #class 3

(62)

How Data Are Generated

Wireless distributed sensor networks (WDSN) Several sensors in a field

Event extraction

Only information when the vehicle is close enough to the sensor

Then a time series FFT-based features

Noise: high-energy factors such as wind and radio chatter.

(63)

Sample instances: Acoustic Data

2 1:-1.8893190e-02 2:-7.2501253e-03 3:-9.3349372e-03 4:8.2397278e-02 5:1.0000000e+00 6:2.8431799e-02 7:-3.9595759e-03 8:-2.2467102e-02 9:-2.7549071e-03 10:-2.2973921e-02 11:-2.4513591e-10:-2.2973921e-02 12:-2.7172349e-10:-2.2973921e-02 13:-2.2274419e-10:-2.2973921e-02 14:-1.8458129e-10:-2.2973921e-02 15:-2.6647322e-02 16:-2.6252666e-02 17:-2.2212002e-02 18:-2.5001779e-02 19:-2.6927617e-02 20:-2.7374419e-19:-2.6927617e-02 21:-2.7112618e-19:-2.6927617e-02 22:-2.4519:-2.6927617e-02704e-19:-2.6927617e-02 23:-2.5475226e-19:-2.6927617e-02 24:-2.5618921e-02 25:-2.6852989e-02 26:-2.5735666e-02 27:-2.7456095e-02 28:-2.7803905e-02 29:-2.6621734e-28:-2.7803905e-02 30:-2.4935499e-28:-2.7803905e-02 31:-2.7729578e-28:-2.7803905e-02 32:-2.6718499e-28:-2.7803905e-02 33:-1.9738297e-02 34:-2.2609663e-02 35:-2.3814977e-02 36:-2.6252692e-02

(64)

37:-2.4909885e-Results from the Authors

Paper available from

http://www.ece.wisc.edu/~sensit/publications/

Three-fold CV Accuracy

Method Acoustic Seismic

k-nearest neighbor 69.36% 56.24% Maximal likelihood 68.95% 62.81%

SVM 69.48% 63.79%

We think more investigation may improve the accuracy So I decided to let students do a project on this

(65)

A report presented in my statistical learning theory

course

By C.-C. Chou, S.-T. Wang, R.-E. Fan, C.-W. Lin, and C.-C. Lin

(66)

Authors’ Approach

Data split to three folds

Two as training and one as validation

Average of three validation accuracy reported Polynomial kernel used

(1 + xT_i x_j)T C = 1

(67)

My Students’ Approach

Cross-validation is a biased estimate

Too many parameters: CV accuracy overfitted Practically ok for two/three parameters

We do a more formal way

(68)

Kernel/Parameter Selection

RBF kernel

e−γkxi−xjk2

Parameter selection very important

C and γ

Fewer than polynomial kernel Huge training time

(69)

Issue: best (C, γ) for 10% may not be the best for the whole

In theory C should be decreased a bit

min w,b,ξ 1 2w T w + C l X i=1 ξi

(70)

Results

Test accuracy (log₂ C, log₂ γ) Acoustic Seismic

75.01 (7,-2) 72.03 (18,-10) Not very good

Try to combine two features New accuracy 83.70 (9,-6) This case:

(71)

Data Scaling

Earlier we mention the importance of data scaling How about this data set ?

Each attribute in a suitable range ? First 4 attributes of training/validation:

X1 X2 X3 X4

Min.:-0.5988 Min.:-0.5194 Min.:-0.4806 Min.:-0.5111 Mean: 0.1319 Mean: 0.2481 Mean: 0.1512 Mean: 0.1844 Max.: 1.0000 Max.: 1.0000 Max.: 1.0000 Max.: 1.0000

(72)

Data Scaling (Cont.)

From the authors’ original matlab code: x ∈ Rn:

x_i _← xi

maxj(|xj|)

Instance-wise scaling

Earlier: feature-wise scaling

First 4 features scaled to _{[−1, 1]}

X1 X2 X3 X4

(73)

Other features similar

max of X1 < 1 as

we scale all and the above: only 4/5

Very different distributions

How attributes scaled to _{[−1, 1]}:

x_i _{− min}

(74)

In original data, most x_i close to min

After instance-wise scaling, may not be that close to the new min

(75)

After Scaling

New results

Acoustic Seismic Combined 79.71 (6,-2) 76.68 (6,-2) 87.18 (5,-3) Compare to earlier results

Acoustic Seismic Combined

75.01 (7,-2) 72.03 (18,-10) 83.70 (9,-6) New results consistently better

Feature-wise scaling seems more appropriate Six data sets available at

(76)

Issues not Investigated Yet

If most values close to min of the features

are these values outliers or useful information ? Is 86% enough for practical use ?

Originally

Assault Amphibian Vehicle (AAV) Main Battle Tank (M1)

High Mobility Multipurpose Wheeled Vehicle (HMMWV)

(77)

So five-class problem

Now we have only AAV, DW, and noise # of SVs is an issue

Now around 20,000 SVs

Can they be stored in a sensor ? Further improvement

Feature selection

(78)

Lesson from This Experiment

No systematic way for a machine learning task However, some simple techniques/analysis help Better understanding on ML methods also helps Of course you need good luck

(79)

(80)

SVM Primal and Dual

Standard SVM min w,b,ξ 1 2w T w + C l X i=1 ξi subject to yi(wTφ(xi) + b) ≥ 1 − ξi, ξi ≥ 0, i = 1, . . . , l.

w: huge vector variable

Possibly infinite variables

(81)

Dual problem min α 1 2 l X i=1 l X j=1 α_iα_jy_iy_jφ(x_i)Tφ(x_j_{) −} l X i=1 α_i subject to _{0 ≤ α}_i _{≤ C,} i = 1, . . . , l, l X i=1 yiαi = 0. K(x_i, x_j) = φ(x_i)Tφ(x_j) available using special φ

(82)

Primal Dual Relationship

At optimum ¯ w = l X i=1 ¯ α_iy_iφ(x_i) (1) 1 2w¯ T _¯ w + C l X i=1 ¯ ξ_i = eTα¯ ₋ 1 2α¯ T_{Q ¯}_α. ₍₂₎ where e = [1, . . . , 1]T.

(83)

Derivation of the Dual

We follow the description in [Bazaraa et al., 1993] Consider a simpler problem

min w,b 1 2w T w subject to y_i(wTφ(x_i_{) + b) ≥ 1, i = 1, . . . , l.} Its dual min α 1 2 l X i=1 l X j=1 α_iα_jy_iy_jφ(x_i)Tφ(x_j_{) −} l X i=1 α_i

(84)

Lagrangian Dual

Defined as max α≥0(minw,b L(w, b, α)), (3) where L(w, b, α) = 1 2kwk 2 − l X i=1 αi yi(wTφ(xi) + b) − 1 . (4)

Minimize with respect to the primal variables w and b

(85)

Assume ( ¯w, ¯b) optimal for the primal with optimal objective value γ = 1₂_{k ¯}wk2. No (w, b) satisfies 1 2kwk 2 _{< γ} _and _y i(wTφ(xi) + b) ≥ 1, i = 1, . . . , l. (5)

There is α¯ _{≥ 0} such that for all w, b 1 2kwk 2 − γ − l X i=1 ¯ α_i y_i(wTφ(x_i_{) + b) − 1} _{≥ 0.} (6)

(86)

Thus

max

α≥0 minw,b L(w, b, α) ≥ γ.

(7)

i.e., for any α,

min w,b L(w, b, α) ≤ L( ¯ w, ¯b, α), so max α≥0 minw,b L(w, b, α) ≤ maxα≥0 L( ¯w, ¯b, α) = 1 2k ¯wk 2 _{= γ.} ₍₈₎

(87)

With α¯_i _{≥ 0} and y_i( ¯wTφ(x_i) + ¯b) − 1 ≥ 0, ¯

α_i[y_i( ¯wTφ(x_i) + ¯b) − 1] = 0, i = 1, . . . , l,

Complementarity condition.

Simplify the dual, when α is fixed,

min w,b L(w, b, α) = ( −∞ if Pl i=1 αiyi 6= 0, minw 1 2wT w − Pl i=1 αi[yi(wTφ(xi) − 1] if Pl i=1 αiyi = 0.

(88)

If Pl_i=1 α_iy_i _{6= 0}, decrease _−b Pl_i=1 α_iy_i in L(w, b, α) to _−∞ If Pl_i=1 α_iy_i = 0, Optimum of 1₂wTw − P_i=1l α_i[y_i(wTφ(x_i) − 1] happens when ∂ ∂wL(w, b, α) = 0. Thus,

(89)

More details ∂ ∂wL(w, b, α) =    ∂ ∂w₁L(w, b, α) .. . ∂ ∂wn L(w, b, α)    Assume w ∈ Rn L(w, b, α) rewritten as 1 2 n X w_j2 ₋ l X αi[yi( n X wjφ(xi)j − 1]

(90)

So ∂ ∂w_j L(w, b, α) = wj − l X i=1 αiyiφ(xi)j = 0 Note that wTw = l X i=1 αiyiφ(xi) T l X j=1 αjyjφ(xj) = X i,j αiαjyiyjφ(xi)Tφ(xj)

(91)

The dual is max α≥0 (_Pl i=1 αi − 12 P i,j αiαjyiyjφ(xi)T φ(xj) if Pl i=1 αiyi = 0, −∞ if Pl_i=1 α_iy_i _{6= 0.}

−∞ definitely not maximum of the dual

Dual optimal solution not happen when Pl

i=1 αiyi 6= 0. Dual simplified to max l X α_i ₋ 1 2 l X l X α_iα_jy_iy_jφ(x_i)Tφ(x_j)

(92)

Karush-Kuhn-Tucker (KKT) optimality conditions of the primal: ¯ αi[yi( ¯wTφ(x_i) + ¯b) − 1] = 0, i = 1, . . . , l, l X i=1 αiyi = 0, αi ≥ 0, ∀i, w = l X i=1 α_iy_iφ(x_i).

(93)

An Example

Two training data in R1:

4

0 1

(94)

Primal Problem

x₁ = 0, x₂ = 1 with y = [−1, 1]T. Primal problem min w,b 1 2w 2 subject to _{w · 1 + b ≥ 1,} (11) −1(w · 0 + b) ≥ 1. (12)

(95)

−b ≥ 1 and _{w ≥ 1 − b ≥ 2}.

For any (w, b) satisfying two inequality constraints

w ≥ 2

We are minimizing 1₂w2

The smallest possibility is w = 2.

(w, b) = (2, −1) is the optimal solution. The separating hyperplane _{2x − 1 = 0} In the middle of the two training data:

(96)

Dual Problem

Formula derived before

min α∈Rl 1 2 l X i=1 l X j=1 α_iα_jy_iy_jφ(x_i)Tφ(x_j_{) −} l X i=1 α_i subject to αi ≥ 0, i = 1, . . . , l, and l X i=1 αiyi = 0.

Get the objective function

(97)

Objective function 1 2α 2 1 − (α1 + α2) = 1 2 h α₁ α₂i " 0 0 0 1 # " α₁ α₂ # − h1 1i " α₁ α₂ # . Constraints α₁ _{− α}₂ _{= 0, 0 ≤ α}₁_{, 0 ≤ α}₂.

(98)

α₂ = α₁ to the objective function, 1 2α 2 1 − 2α2 Smallest value at α₁ = 2. α₂ as well If smallest value < 0 clipped to 0

(99)

Dual Problems for Other Formulas

So we think that for any optimization problem Lagrangian dual exists

This is wrong Remember we calculate min 1 2w T w − l X i=1 αi[yi(wTφ(xi) − 1] by ∂ L(w, b, α) = 0.

(100)

Note that

f0_{(x) = 0 ⇔ x} minimum is wrong

Example

f (x) = x3, x = 0 not minimum This function must satisfy certain conditions Some papers wrongly derived the dual of their new formulations without checking conditions

(101)

[2, 2]T satisfies constraints _{0 ≤ α}₁ and _{0 ≤ α}₂ It is optimal Primal-dual relation w = y₁α₁x₁ + y₂α₂x₂ = 1 · 2 · 1 + (−1) · 2 · 0 = 2

(102)

Multi-class Classification

k classes

One-against-all: Train k binary SVMs:

1st class vs. _{(2 − k)}th class 2nd class vs. _{(1, 3 − k)}th class .. . k decision functions (w1)Tφ(x) + b₁

(103)

(104)

Multi-class Classification (Cont.)

One-against-one: train _{k(k − 1)/2} binary SVMs

(1, 2), (1, 3), . . . , (1, k), (2, 3), (2, 4), . . . , (k − 1, k)

Select the one with the largest vote

This is the method used by LIBSVM

Try a 4-class problem 6 binary SVMs

(105)

$libsvm-2.5/svm-train bsvm-2.05/vehicle.scale optimization finished, #iter = 173

obj = -142.552559, rho = 0.748453 nSV = 194, nBSV = 183

optimization finished, #iter = 330 obj = -149.912202, rho = -0.786410 nSV = 227, nBSV = 217

optimization finished, #iter = 169 obj = -139.655613, rho = 0.998277 nSV = 186, nBSV = 177

optimization finished, #iter = 268 obj = -185.161735, rho = -0.674739 nSV = 253, nBSV = 244

optimization finished, #iter = 477 obj = -378.264371, rho = 0.177314 nSV = 405, nBSV = 394

(106)

There are many other methods

A comparison in [Hsu and Lin, 2002] For a software

We select one which is generally good but not always the best

Finally I chose 1 vs. 1

Similar accuracy to others Shortest training

(107)

Why Shorter Training Time

1 vs. 1

k(k − 1)/2 problems, each 2l/k data on average 1 vs. all

k problems, each l data

If solving the optimization problem: polynomial of the size with degree d

Their complexities k(k − 1) 2 O 2l k d vs. kO(ld)

(108)

(109)

Outline

Support vector regression (SVR) Practical examples

(110)

Support Vector Regression (SVR)

Support vector machines: a new method for data classification and prediction

Given training data (x₁, y₁), . . . , (x_l, y_l)

Regression: find a function so that

f (xi) ≈ yi

Least square regression:

min l X

(111)

x y wT x + b This is equivalent to min l X ξ_i2 + (ξ_i∗)2

(112)

A quadratic programming problem L1-norm regression min w,b l X i=1 |yi − (wTxi + b)| or min w,b,ξ,ξ∗ l X i=1 (ξi + ξ_i∗)

(113)

A linear programming problem This is equivalent to min w,b,ξ,ξ∗ C l X i=1 (ξi + ξ_i∗) subject to _−ξ_i∗ _{≤ y}_i _{− (w}Tx_i + b) ≤ ξ_i, ξ_i _{≥ 0, ξ}_i∗ _{≥ 0, i = 1, . . . , l.} C: a constant

(114)

Linear support vector regression min w,b,ξ,ξ∗ 1 2w T w + C l X i=1 (ξ_i + ξ_i∗) subject to _−ξ_i∗ _{− ≤ y}_i _{− (w}T x_i + b) ≤ + ξ_i, ξ_i _{≥ 0, ξ}_i∗ _{≥ 0, i = 1, . . . , l.}

(115)

A tube

− ≤ y − (wT x + b) ≤

Data in the tube considered no error Most training data in the tube

1

2wTw: regularization, wTx + b more smooth

Similar to the classification case General support vector regression:

Data mapped to a higher space by φ(x) The new approximation function

(116)

ξ_i ξ_i∗ wT φ(x) + b = h 0 − i

(117)

Standard SVR min w,b,ξ,ξ∗ 1 2w T w + C l X i=1 ξi + C l X i=1 ξ_i∗ − − ξi∗ ≤ yi − (wTφ(xi) + b) ≤ + ξi, ξi, ξ_i∗ ≥ 0, i = 1, . . . , l.

Data in high dimensional spaces

Possible wT φ(x_i) + b = y_i, i = 1, . . . , l ⇒ overfitting

(118)

Good regression methods: balance between overfitting and underfitting

min 1₂wTw: avoid overfitting

(119)

Support vector regression using

LIBSVM

Using the option -s 3

Usage: svm-train [options] training_set_file [model_file] options:

-s svm_type : set type of SVM (default 0) 0 -- C-SVC

1 -- nu-SVC

2 -- one-class SVM 3 -- epsilon-SVR 4 -- nu-SVR

(120)

Check a regression data $head -n 5 svrprob/trans/abalone.scale.shuffle.train.1 13 1:-1 2:0.310811 3:0.193277 4:-0.707965 5:-0.273012 6:-0.441693 7:-0.175958 8:-0.324278 12 1:-1 2:0.797297 3:0.764706 4:-0.637168 5:0.86369 6:0.301118 7:0.634146 8:0.101302 9 1:-1 2:-0.121622 3:-0.159664 4:-0.769912 5:-0.771641 6:-0.848243 7:-0.766551 8:-0.765705 8 1:-1 2:0.189189 3:0.142857 4:-0.761062 5:-0.212691 6:-0.247604 7:-0.132404 8:-0.432937 19 1:1 2:0.202703 3:0.12605 4:-0.752212 5:-0.427732 6:-0.615815 7:-0.5 8:-0.414827 Additional parameter $svm-train -s 3 -c 64 -g 0.25 -p 0.5 svrprob/trans/abalone.scale.shuffle.train.1

(121)

Test

$svm-predict svrprob/trans/abalone.scale.shuffle.test.1 abalone.scale.shuffle.train.1.model o Accuracy = 0% (0/200) (classification)

Mean squared error = 5.00931 (regression)

Squared correlation coefficient = 0.58387 (regression)

Test data $head -n 5 svrprob/trans/abalone.scale.shuffle.test.1 8 1:1 2:0.0945946 3:0.0420168 4:-0.823009 5:-0.640423 6:-0.649361 7:-0.710801 8:-0.697793 20 2:0.756757 3:0.747899 4:-0.690265 5:0.662358 6:0.220447 7:0.571429 8:0.92077 7 1:1 2:0.175676 3:0.142857 4:-0.769912 5:-0.529573 6:-0.552716 7:-0.503484 8:-0.636672 8 1:-1 2:0.189189 3:0.12605 4:-0.752212 5:-0.470427 6:-0.456869 7:-0.54007 8:-0.734012

(122)

Check $head -n 5 o 7.62474 14.7385 8.40741 7.13157 9.74578

(123)

SVR Performance Evaluation

Not accuracy any more

MSE (Mean Square Error):

l X

i=1

(y_i _{− ˆy}_i)2

Squared correlation coefficient (also called r2)

Pl

i=1(yi − ¯y)2(ˆyi − ¯ˆy)2 Pl

i=1(yi − ¯y)2

Pl

(124)

Conclusions

Dealing with data is interesting

especially if you get good accuracy

Some basic understandings are essential when applying methods

e.g. the importance of validation No method is the best for all data