• 沒有找到結果。

Machine Learning Techniques (ᘤᢈ)

N/A
N/A
Protected

Academic year: 2022

Share "Machine Learning Techniques (ᘤᢈ)"

Copied!
27
0
0

加載中.... (立即查看全文)

全文

(1)

Machine Learning Techniques ( 機器學習技法)

Lecture 4: Soft-Margin Support Vector Machine

Hsuan-Tien Lin (林軒田)

htlin@csie.ntu.edu.tw

Department of Computer Science

& Information Engineering

National Taiwan University ( 國立台灣大學資訊工程系)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 0/22

(2)

Soft-Margin Support Vector Machine

Roadmap

1

Embedding Numerous Features: Kernel Models

Lecture 3: Kernel Support Vector Machine kernel

as a shortcut to (transform + inner product) to

remove dependence on ˜ d: allowing a spectrum of

simple (linear) models to infinite dimensional

(Gaussian) ones with margin control

Lecture 4: Soft-Margin Support Vector Machine Motivation and Primal Problem

Dual Problem

Messages behind Soft-Margin SVM Model Selection

2 Combining Predictive Features: Aggregation Models

3 Distilling Implicit Features: Extraction Models

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 1/22

(3)

Soft-Margin Support Vector Machine Motivation and Primal Problem

Cons of Hard-Margin SVM

recall: SVM can still overfit :-(

Φ

1

part of reasons: Φ

other part:

separable

Φ

4

if always insisting on

separable

(=⇒

shatter),

have power to

overfit to noise

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 2/22

(4)

Soft-Margin Support Vector Machine Motivation and Primal Problem

Give Up on Some Examples

want:

give up

on some noisy examples

pocket

min

b,w N

X

n=1

qy

n

6= sign(w

T

z

n

+ b) y

hard-margin SVM

min

b,w

1 2 w

T

w

s.t. y

n

(w

T

z

n

+ b) ≥ 1 for all n

combination: min

b,w

1

2 w T w

+

C

·

N

X

n=1

r

y n 6= sign(w T z n + b) z

s.t. y

n

(w

T z n

+b)≥ 1 for

correct

n

y

n

(w

T z n

+b)≥

−∞

for

incorrect

n

C: trade-off of large margin

&

noise tolerance

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/22

(5)

Soft-Margin Support Vector Machine Motivation and Primal Problem

Soft-Margin SVM (1/2)

min

b,w

1

2 w

T

w + C ·

N

X

n=1

qy

n

6= sign(w

T

z

n

+ b) y

s.t. y

n

(w

T

z

n

+ b) ≥ 1 − ∞ · qy

n

6= sign(w

T

z

n

+ b) y

• J·K

: non-linear,

not QP anymore :-(

—what about dual? kernel?

cannot distinguish

small error (slightly away from fat boundary)

or

large error (a...w...a...y... from fat boundary)

record ‘margin violation’ by

ξ n

—linear constraints

penalize with

margin violation

instead of

error count

—quadratic objective

soft-margin SVM: min

b,w,ξ

1

2 w

T

w + C ·

N

X

n=1

ξ

n

s.t. y

n

(w

T

z

n

+ b) ≥ 1 − ξ

n

and ξ

n

≥ 0 for all n

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 4/22

(6)

Soft-Margin Support Vector Machine Motivation and Primal Problem

Soft-Margin SVM (2/2)

record ‘margin violation’ by

ξ n

penalize with

margin violation

b,w,ξ

min 1

2 w

T

w + C ·

N

X

n=1

ξ

n

s.t. y

n

(w

T

z

n

+ b) ≥ 1 − ξ

n

and ξ

n

≥ 0 for all n

Hi Hi

violation

parameter

C: trade-off of large margin

&

margin violation

• large C: want less margin violation

• small C: want large margin

• QP

of

d ˜

+1 + N variables, 2N constraints next: remove dependence on

d ˜

by

soft-margin SVM primal⇒

dual?

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/22

(7)

Soft-Margin Support Vector Machine Motivation and Primal Problem

Fun Time

At the optimal solution of

b,w,ξ

min 1

2

w T w + C

·

N

X

n=1

ξ

n

s.t. y

n

(w

T z n

+b)≥ 1 − ξ

n

andξ

n

≥ 0 for all n, assume that y

1

(w

T z 1

+b) =−10. What is the corresponding ξ

1

?

1

1

2

11

3

21

4

31

Reference Answer: 2

ξ

1

is simply 1− y

1

(w

T z 1

+b) when y

1

(w

T z 1

+b)≤ 1.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 6/22

(8)

Soft-Margin Support Vector Machine Motivation and Primal Problem

Fun Time

At the optimal solution of

b,w,ξ

min 1

2

w T w + C

·

N

X

n=1

ξ

n

s.t. y

n

(w

T z n

+b)≥ 1 − ξ

n

andξ

n

≥ 0 for all n, assume that y

1

(w

T z 1

+b) =−10. What is the corresponding ξ

1

?

1

1

2

11

3

21

4

31

Reference Answer: 2

ξ

1

is simply 1− y

1

(w

T z 1

+b) when y

1

(w

T z 1

+b)≤ 1.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 6/22

(9)

Soft-Margin Support Vector Machine Dual Problem

Lagrange Dual

primal: min

b,w,ξ

1

2

w T w + C

·

N

X

n=1

ξ

n

s.t.

y n (w T z n + b) ≥ 1 − ξ n

and

ξ n ≥ 0

for all n Lagrange function with Lagrange multipliers

α n

and

β n

L(b, w, ξ, α, β) = 1

2 w

T

w + C ·

N

X

n=1

ξ

n

+

N

X

n=1

α

n

· 1 − ξ

n

− y

n

(w

T

z

n

+ b)  +

N

X

n=1

β

n

· (−ξ

n

)

want: Lagrange dual max

α

n

≥0, β

n

≥0



b,w,ξ

min L(b, w, ξ,

α, β)



Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 7/22

(10)

Soft-Margin Support Vector Machine Dual Problem

Simplify ξ n and β n

max

αn≥0,βn≥0

min

b,w,ξ

1

2 w

T

w + C ·

N

X

n=1

ξ

n

+

N

X

n=1

α

n

· 1 − ξ

n

− y

n

(w

T

z

n

+ b)  +

N

X

n=1

β

n

· (− ξ

n

)

!

∂ξ ∂L

n =0 = C

−α n −β n

no loss of optimality if solving with implicit constraint

β n

=C−

α n

and explicit constraint 0≤

α n

C: β n

removed

ξ can also be removed :-), like how we removed b

max

0≤αn≤C,βn=C−αn

b,w,ξ

min 1 2 w

T

w +

N

X

n=1

α

n

(1 − y

n

(w

T

z

n

+ b))

      XX XX

XX XX XX X +

N

P

n=1

(C − α

n

− β

n

) · ξ

n

!

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 8/22

(11)

Soft-Margin Support Vector Machine Dual Problem

Other Simplifications

max

0≤αn≤C,βn=C−αn

min

b,w

1 2 w

T

w +

N

X

n=1

α

n

(1 − y

n

(w

T

z

n

+ b))

!

familiar? :-)

inner problem

same as hard-margin SVM

∂L ∂b

=0: no loss of optimality if solving with constraint

N

P

n=1

α n y n = 0

∂w ∂L

i =0: no loss of optimality if solving with constraint

w =

N

P

n=1

α n y n z n

standard dual can be derived

using the same steps as Lecture 2

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 9/22

(12)

Soft-Margin Support Vector Machine Dual Problem

Standard Soft-Margin SVM Dual

min

α

1 2

N

X

n=1 N

X

m=1

α n α m

y

n

y

m z T n z m

N

X

n=1

α n

subject to

N

X

n=1

y

n α n

=0;

0≤

α n ≤ C

, for n = 1, 2, . . . , N;

implicitly

w =

N

X

n=1

α n

y

n z n

;

β n

=C−

α n

, for n = 1, 2, . . . , N

—only difference to hard-margin:

upper bound

on

α n

another (convex)

QP,

with

N variables

&

2N + 1

constraints

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 10/22

(13)

Soft-Margin Support Vector Machine Dual Problem

Fun Time

In the soft-margin SVM, assume that we want to increase the parameter C by 2. How shall the corresponding dual problem be changed?

1

the upper bound ofα

n

shall be halved

2

the upper bound ofα

n

shall be decreased by 2

3

the upper bound ofα

n

shall be increased by 2

4

the upper bound ofα

n

shall be doubled

Reference Answer: 3

Because C is exactly the upper bound ofα

n

, increasing C by 2 in the primal problem is equivalent to increasing the upper bound by 2 in the dual problem.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 11/22

(14)

Soft-Margin Support Vector Machine Dual Problem

Fun Time

In the soft-margin SVM, assume that we want to increase the parameter C by 2. How shall the corresponding dual problem be changed?

1

the upper bound ofα

n

shall be halved

2

the upper bound ofα

n

shall be decreased by 2

3

the upper bound ofα

n

shall be increased by 2

4

the upper bound ofα

n

shall be doubled

Reference Answer: 3

Because C is exactly the upper bound ofα

n

, increasing C by 2 in the primal problem is equivalent to increasing the upper bound by 2 in the dual problem.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 11/22

(15)

Soft-Margin Support Vector Machine Messages behind Soft-Margin SVM

Kernel Soft-Margin SVM

Kernel Soft-Margin SVM Algorithm

1 q n,m

=y

n

y

m K

(x

n

, x

m

);

p

=−1

N

; (A,

c)

for

equ./lower-bound/upper-bound constraints

2

α← QP(

Q

D,

p, A, c)

3

b←?

4

return

SVs

and theirα

n

as well as b such that for new

x,

gSVM(x) = sign

 P

SV indices n

α

n y n K

(x

n

, x) + b



• almost

the same as hard-margin

more flexible than hard-margin

—primal/dual always solvable

remaining question:

step 3

?

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 12/22

(16)

Soft-Margin Support Vector Machine Messages behind Soft-Margin SVM

Solving for b

hard-margin SVM

complementary slackness:

α n

(1− y

n

(w

T z n

+b)) = 0

SV (α

s

> 0)

⇒ b = y

s

− w

T z s

free (α

s

< C)

ξ s

=0

soft-margin SVM

complementary slackness:

α n

(1−

ξ n

− y

n

(w

T z n

+b)) = 0 (C−

α n

n

=0

SV (α

s

> 0)

⇒ b = y

s

− y

s ξ s

− w

T z s

free (α

s

< C)

ξ s

=0

solve unique b with

free SV (x s , y s ):

b =

y s

− X

SV indices n

α n

y

n

K (x

n

,

x s

)

—range of b otherwise

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 13/22

(17)

Soft-Margin Support Vector Machine Messages behind Soft-Margin SVM

Soft-Margin Gaussian SVM in Action

C = 1 C = 10 C = 100

large C =⇒ less

noise tolerance

=⇒

‘overfit’?

warning: SVM can still overfit :-(

soft-margin Gaussian SVM:

need

careful selection of (γ, C)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 14/22

(18)

Soft-Margin Support Vector Machine Messages behind Soft-Margin SVM

Physical Meaning of α n

complementary slackness:

α n

(1

−ξ n − y n (w T z n + b)) =

0 (C−

α n

n

=0

non SV (0 =

α n

):

ξ n

=0,

‘away from’/on

fat boundary

 free SV (0 <

α n

< C):

ξ n

=0, on

fat boundary, locates b

4 bounded SV (

α n

=C):

ξ n

=violation amount,

‘violate’/on

fat boundary

α n

can be used for

data analysis

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 15/22

(19)

Soft-Margin Support Vector Machine Messages behind Soft-Margin SVM

Fun Time

For a data set of size 10000, after solving SVM, assume that there are 1126 support vectors, and 1000 of those support vectors are bounded.

What is the possible range of E

in

(gSVM)in terms of 0/1 error?

1

0.0000≤ E

in

(gSVM)≤ 0.1000

2

0.1000≤ E

in

(gSVM)≤ 0.1126

3

0.1126≤ E

in

(gSVM)≤ 0.5000

4

0.1126≤ E

in

(gSVM)≤ 1.0000

Reference Answer: 1

The bounded support vectors are the only ones that could violate the fat boundary: ξ

n

≥ 0. If ξ

n

≥ 1, then the violation causes a 0/1 error on the example. On the other hand, it is also possible thatξ

n

< 1, and in that case the violation does not cause a 0/1 error.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 16/22

(20)

Soft-Margin Support Vector Machine Messages behind Soft-Margin SVM

Fun Time

For a data set of size 10000, after solving SVM, assume that there are 1126 support vectors, and 1000 of those support vectors are bounded.

What is the possible range of E

in

(gSVM)in terms of 0/1 error?

1

0.0000≤ E

in

(gSVM)≤ 0.1000

2

0.1000≤ E

in

(gSVM)≤ 0.1126

3

0.1126≤ E

in

(gSVM)≤ 0.5000

4

0.1126≤ E

in

(gSVM)≤ 1.0000

Reference Answer: 1

The bounded support vectors are the only ones that could violate the fat boundary:

ξ

n

≥ 0. If ξ

n

≥ 1, then the violation causes a 0/1 error on the example. On the other hand, it is also possible thatξ

n

< 1, and in that case the violation does not cause a 0/1 error.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 16/22

(21)

Soft-Margin Support Vector Machine Model Selection

Practical Need: Model Selection

replacemen

complicated even for

(C, γ) of Gaussian SVM

more combinations if including other kernels or parameters

how to select?

validation :-)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 17/22

(22)

Soft-Margin Support Vector Machine Model Selection

Selection by Cross Validation

replacemen

0.3500 0.3250 0.3250

0.2000 0.2250 0.2750

0.1750 0.2250 0.2000

E

cv

(C, γ): ‘non-smooth’

function of (C, γ)

—difficult to optimize

proper models can be chosen by

V -fold cross validation

on

a few grid values of (C, γ)

E

cv

: very popular criteria for soft-margin SVM

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 18/22

(23)

Soft-Margin Support Vector Machine Model Selection

Leave-One-Out CV Error for SVM

recall: E

loocv

= E

cv

with N folds claim: E

loocv

#SV N

for

(x N , y N ): if optimal α N = 0

(non-SV)

=⇒

(α 1 , α 2 , . . . , α N−1 ) still optimal

when

leaving out (x N , y N )

key:

what if there’s better

α

n

?

SVM:

g

=g when

leaving out non-SV

e

non-SV

= err(g

,

non-SV)

= err(g,

non-SV) =

0 e

SV

≤ 1

x1−x2−1=0 0.707

motivation from hard-margin SVM:

only

SVs needed

scaled #SV bounds leave-one-out CV error

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 19/22

(24)

Soft-Margin Support Vector Machine Model Selection

Selection by # SV

replacemen

38 37 37

27 21 17

21 18 19

nSV(C, γ): ‘non-smooth’

function of (C, γ)

—difficult to optimize

just an upper bound!

dangerous models can be ruled out by

nSV

on

a few grid values of (C, γ)

nSV: often used as a

safety check

if computing E

cv

is too time-consuming

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 20/22

(25)

Soft-Margin Support Vector Machine Model Selection

Fun Time

For a data set of size 10000, after solving SVM on some parameters, assume that there are 1126 support vectors, and 1000 of those support vectors are bounded. Which of the following cannot be E

loocv

with those parameters?

1

0.0000

2

0.0805

3

0.1111

4

0.5566

Reference Answer: 4

Note that the upper bound of E

loocv

is 0.1126.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 21/22

(26)

Soft-Margin Support Vector Machine Model Selection

Fun Time

For a data set of size 10000, after solving SVM on some parameters, assume that there are 1126 support vectors, and 1000 of those support vectors are bounded. Which of the following cannot be E

loocv

with those parameters?

1

0.0000

2

0.0805

3

0.1111

4

0.5566

Reference Answer: 4

Note that the upper bound of E

loocv

is 0.1126.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 21/22

(27)

Soft-Margin Support Vector Machine Model Selection

Summary

1

Embedding Numerous Features: Kernel Models

Lecture 4: Soft-Margin Support Vector Machine Motivation and Primal Problem

add margin violations ξ n

Dual Problem

upper-bound α n by C Messages behind Soft-Margin SVM

bounded/free SVs for data analysis Model Selection

cross-validation, or approximately nSV

next: other kernel models for soft binary classification

2 Combining Predictive Features: Aggregation Models

3 Distilling Implicit Features: Extraction Models

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 22/22

參考文獻

相關文件

Which of the following aggregation model learns diverse g t by reweighting and calculates linear vote by steepest search?.

3 Distilling Implicit Features: Extraction Models Lecture 14: Radial Basis Function Network. RBF

Lecture 4: Soft-Margin Support Vector Machine allow some margin violations ξ n while penalizing them by C; equivalent to upper-bounding α n by C Lecture 5: Kernel Logistic

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/22.. Decision Tree Decision Tree Hypothesis. Disclaimers about

1 Embedding Numerous Features: Kernel Models Lecture 1: Linear Support Vector Machine.. linear SVM: more robust and solvable with quadratic programming Lecture 2: Dual Support

1 Embedding Numerous Features: Kernel Models Lecture 1: Linear Support Vector Machine.

• ‘ content teachers need to support support the learning of those parts of language knowledge that students are missing and that may be preventing them mastering the

Principle Component Analysis Denoising Auto Encoder Deep Neural Network... Deep Learning Optimization