• 沒有找到結果。

Introduction to Support Vector Machines

N/A
N/A
Protected

Academic year: 2022

Share "Introduction to Support Vector Machines"

Copied!
20
0
0

(1)

Introduction to Support Vector Machines

Hsuan-Tien Lin

Learning Systems Group, California Institute of Technology

Talk in NTU EE/CS Speech Lab, November 16, 2005

(2)

Setup

fixed-length example: D-dimensional vector x , each component is a feature

raw digital sampling of a 0.5 sec. wave file DFT of the raw sampling

a fixed-length feature vector extracted from the wave file label: a number y ∈ Y

binary classification: is there a man speaking in the wave file?

(y = +1 if man, y = −1 if not)

multi-class classification: which speaker is speaking?

(y ∈ {1,2, · · · ,K})

regression: how excited is the speaker? (y∈ R)

(3)

Binary Classification Problem

learning problem: given training examples and labels{(xi,yi)}Ni=1, find a function g(x) : X → Y that predicts the label of unseen x well

vowel identification: given training wave files and their vowel labels, find a function g(x)that translates wave files to vowel well

we will focus on binary classification problem:Y = {+1, −1}

most basic learning problem, but very useful and can be extended to other problems

illustrative demo: for examples with two different color in a D-dimensional space, how can we “separate” the examples?

(4)

Hyperplane Classifier

use a hyperplane to separate the two colors:

g(x) =sign wTx+b

if wT +b≥0, the classifier returns+1, otherwise the classifier returns−1

possibly lots of hyperplanes satisfying our needs, which one should we choose?

(5)

SVM: Large-Margin Hyperplane Classifier

marginρi =yi(wTx +b)/kwk2: does yi agree with wTx+b in sign?

how large is the distance between the example and the separating hyperplane?

large positive margin→clear separation→low risk classification idea of SVM: maximize the minimum margin

maxw,b min

i ρi

s.t. ρi =yi(wTxi+b)/kwk2≥0

(6)

Hard-Margin Linear SVM

maximize the minimum margin maxw,b min

i ρi

s.t. ρi =yi(wTxi +b)/kwk2≥0,i =1, . . . ,N.

equivalent to min

w,b

1 2wTw

s.t. yi(wTxi+b) ≥1,i =1, . . . ,N.

– hard-margin linear SVM

quadratic programming with D+1 variables: well-studied in optimization

is the hard-margin linear SVM good enough?

(7)

Soft-Margin Linear SVM

hard-margin – hard constraints on separation:

min

w,b

1 2wTw

s.t. yi(wTxi+b) ≥1,i =1, . . . ,N.

no feasible solution if some noisy outliers exist soft-margin – soft constraints as cost:

min

w,b

1

2wTw+CX

i

ξi

s.t. yi(wTxi+b) ≥1− ξi, ξi ≥0,i =1, . . . ,N. allow the noisy examples to haveξi >0 with a cost is linear SVM good enough?

(8)

Soft-Margin Nonlinear SVM

what if we want a boundary g(x) =sign xTx −1

? can never be constructed with a hyperplane classifier sign wTx+b

however, we can have more complex feature transforms:

φ(x) = [(x)1, (x)2, · · · , (x)D, (x)1(x)1, (x)1(x)2, · · · , (x)D(x)D]

there is a classifier sign wTφ(x) +b

that describes the boundary soft-margin nonlinear SVM:

minw,b

1

2wTw+CX

i

ξi

s.t. yi(wTφ(xi) +b) ≥1− ξi, ξi ≥0,i =1, . . . ,N.

– with nonlinearφ(·)

(9)

Feature Transformation

what feature transformsφ(·)should we use?

we can only extract finite small number of features, but we can use unlimited number of feature transforms

traditionally:

use domain knowledge to do feature transformation use only “useful” feature transformation

use a small number of feature transformation

control the goodness of fitting by suitable choice of feature transformation

what if we use “infinite number” of feature transformation, and let the algorithm decide a good w automatically?

would infinite number of transformations introduce overfitting?

are we able to solve the optimization problem?

(10)

Dual Problem

infinite quadratic programming if infiniteφ(·):

min

w,b

1

2wTw+CX

i

ξi

s.t. yi(wTφ(xi) +b) ≥1− ξi, ξi ≥0,i =1, . . . ,N.

luckily, we can solve its associated dual problem:

minα

1

TQα −eTα s.t. yTα =0,

0≤ αiC,

QijyiyjφT(xi)φ(xj) α: N-dimensional vector

(11)

Solution of the Dual Problem

associated dual problem:

minα

1

TQα −eTα s.t. yTα =0,

0≤ αiC,

QijyiyjφT(xi)φ(xj)

equivalent solution:

g(x) =sign

wTx+b

=signX

yiαiφT(xi)φ(x) +b

no need for w andφ(x)explicitly if we can compute K(x,x0) = φT(x)φ(x0)efficiently

(12)

Kernel Trick

let kernel K(x,x0) = φT(x)φ(x0) revisit: can we compute the kernel of

φ(x) = [(x)1, (x)2, · · · , (x)D, (x)1(x)1, (x)1(x)2, · · · , (x)D(x)D]

efficiently?

well, not really how about this?

φ(x) = h√

2(x)1,

2(x)2, · · · ,

2(x)D, (x)1(x)1, · · · , (x)D(x)D

i

K(x,x0) = (1+xTx0)2−1

(13)

Different Kernels

types of kernels

linear K(x,x0) =xTx0,

polynomial: K(x,x0) = (axTx0+r)d

Gaussian RBF: K(x,x0) =exp(−γkxx0k22) Laplacian RBF: K(x,x0) =exp(−γkx x0k1)

the last two equivalently have feature transformation in infinite dimensional space!

new paradigm for machine learning: use many many feature transformations, control the goodness of fitting by large-margin (clear separation) and violation cost (amount of outlier allowed)

(14)

Support Vectors: Meaningful Representation

minα

1

TQα −eTα s.t. yTα =0,

0≤ αiC, equivalent solution:

g(x) =signX

yiαiK(xi,x) +b

only those withαi >0 are needed for classification – support vectors

from optimality conditions,αi:

“=0”: no need in constructing the decision function, away from the boundary or on the boundary

“>0 and<C”: free support vector, on the boundary

“=C”: bounded support vector,

violate the boundaryi >0)or on the boundary

(15)

Why is SVM Successful?

infinite number of feature transformation: suitable for conquering nonlinear classification tasks

large-margin concept: theoretically promising soft-margin trade-off: controls regularization well

convex optimization problems: possible for good optimization algorithms (compared to Neural Networks and some other learning algorithms)

support vectors: useful in data analysis and interpretation

(16)

Why is SVM Not Successful?

SVM can be sensitive to scaling and parameters

standard SVM is only a “discriminative” classification algorithm SVM training can be time-consuming when N is large and the solver is not carefully implemented

infinite number of feature transformation⇔mysterious classifier

(17)

Useful Extensions of SVM

multiclass SVM: use 1vs1 approach to combine binary SVM to multiclass

– the label that gets more votes from the classifiers is the prediction

probability output: transform the raw output wTφ(x) +b to a value between[0,1]to mean P(+1|x)

– use a sigmoid function to transform fromR → [0,1]

infinite ensemble learning (Lin and Li 2005):

if the kernel K(x,x0) = −kx−x0k1is used for standard SVM, the classifier is equivalently

g(x) =sign

Z

wθsθ(x)dθ +b



where sθ(x)is a thresholding rule on one feature of x .

(18)

Basic Use of SVM

scale each feature of your data to a suitable range (say,[−1,1]) use a Gaussian RBF kernel K(x,x0) =exp(−γkx−x0k22) use cross validation and grid search to determine a good(γ,C) pair

use the best(γ,C)on your training set do testing with the SVM classifier

all included in LIBSVM (from Lab of Prof. Chih-Jen Lin)

(19)

Advanced Use of SVM

include domain knowledge by specific kernel design (e.g. train a generative model for feature extraction, and use the extracted feature in SVM to get discriminative power)

combining SVM with your favorite tools (e.g. HMM + SVM for speech recognition)

fine-tune SVM parameters with specific knowledge of your problem (e.g. different costs for different examples?) interpreting the SVM results you get (e.g. are the SVs meaningful?)

(20)

Resources

LIBSVM:http://www.csie.ntu.edu.tw/~cjlin/libsvm LIBSVM Tools:

http://www.csie.ntu.edu.tw/~cjlin/libsvmtools Kernel Machines Forum:http://www.kernel-machines.org Hsu, Chang, and Lin: A Practical Guide to Support Vector

Classification

my email:htlin@caltech.edu

acknowledgment: some figures obtained from Prof. Chih-Jen Lin

2 Distributed classification algorithms Kernel support vector machines Linear support vector machines Parallel tree learning.. 3 Distributed clustering

2 Distributed classification algorithms Kernel support vector machines Linear support vector machines Parallel tree learning?. 3 Distributed clustering

 Direct taxes: salary tax, increase in tourist expenses would result in an increase in income of people working in the tourism industry.  Indirect taxes, departure tax and hotel

Using a one-factor higher-order item response theory (HO-IRT) model formulation, it is pos- ited that an examinee’s performance in each domain is accounted for by a

In addition, to incorporate the prior knowledge into design process, we generalise the Q(Γ (k) ) criterion and propose a new criterion exploiting prior information about

we often use least squares to get model parameters in a fitting problem... 7 Least

Solving SVM Quadratic Programming Problem Training large-scale data..

• Many people travel for gaining respect from others and a satisfying social status because one with plenty of travel experience and knowledge of different countries is