Machine Learning Techniques (ᘤᢈ)

(1)

Machine Learning Techniques ( 機器學習技法)

Lecture 4: Soft-Margin Support Vector Machine

Hsuan-Tien Lin (林軒田)

[email protected]

Department of Computer Science

& Information Engineering

National Taiwan University ( 國立台灣大學資訊工程系)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 0/22

(2)

Soft-Margin Support Vector Machine

Roadmap

1

Embedding Numerous Features: Kernel Models

Lecture 3: Kernel Support Vector Machine kernel

as a shortcut to (transform + inner product) to

remove dependence on ˜ d: allowing a spectrum of

simple (linear) models to infinite dimensional

(Gaussian) ones with margin control

Lecture 4: Soft-Margin Support Vector Machine Motivation and Primal Problem

Dual Problem

Messages behind Soft-Margin SVM Model Selection

2 Combining Predictive Features: Aggregation Models

3 Distilling Implicit Features: Extraction Models

(3)

Soft-Margin Support Vector Machine Motivation and Primal Problem

Cons of Hard-Margin SVM

recall: SVM can still overfit :-(

Φ

₁

•

part of reasons: Φ

•

other part:

separable

Φ

₄

if always insisting on

separable

(=⇒

shatter),

have power to

overfit to noise

(4)

Give Up on Some Examples

want:

give up

on some noisy examples

pocket

min

b,w N

X

n=1

qy

_n

6= sign(w

^T

z

_n

+ b) y

hard-margin SVM

min

b,w

1 2 w

^T

w

s.t. y

_n

(w

^T

z

_n

+ b) ≥ 1 for all n

combination: min

b,w

1 2 w ^T w

+

C

·

N

X

n=1

r

y _n 6= sign(w ^T z _n + b) z

s.t. y

_n

(w

^T z _n

+b)≥ 1 for

correct

n

y

_n

(w

^T z _n

+b)≥

−∞

for

incorrect

n

C: trade-off of large margin

&

noise tolerance

(5)

Soft-Margin SVM (1/2)

min

b,w

1 2 w

^T

w + C ·

N

X

n=1

qy

n

6= sign(w

^T

z

_n

+ b) y

s.t. y

n

(w

^T

z

n

+ b) ≥ 1 − ∞ · qy

n

6= sign(w

^T

z

n

+ b) y

• J·K

: non-linear,

not QP anymore :-(

—what about dual? kernel?

•

cannot distinguish

small error (slightly away from fat boundary)

or

large error (a...w...a...y... from fat boundary)

•

record ‘margin violation’ by

ξ _n

—linear constraints

•

penalize with

margin violation

instead of

error count

—quadratic objective

soft-margin SVM: min

b,w,ξ

1 2 w

^T

w + C ·

N

X

n=1

ξ

n

s.t. y

n

(w

^T

z

n

+ b) ≥ 1 − ξ

n

and ξ

n

≥ 0 for all n

(6)

Soft-Margin SVM (2/2)

•

record ‘margin violation’ by

ξ n

•

penalize with

margin violation

b,w,ξ

min 1

2 w

^T

w + C ·

N

X

n=1

ξ

n

s.t. y

_n

(w

^T

z

_n

+ b) ≥ 1 − ξ

n

and ξ

n

≥ 0 for all n

Hi Hi

violation

•

parameter

C: trade-off of large margin

&

margin violation

• large C: want less margin violation

• small C: want large margin

• QP

of

d ˜

+1 + N variables, 2N constraints next: remove dependence on

d ˜

by

soft-margin SVM primal⇒

dual?

(7)

Fun Time

At the optimal solution of

b,w,ξ

min 1

2

w ^T w + C

·

N

X

n=1

ξ

n

s.t. y

n

(w

^T z _n

+b)≥ 1 − ξ

ⁿ

andξ

n

≥ 0 for all n, assume that y

₁

(w

^T z ₁

+b) =−10. What is the corresponding ξ

1

?

1

2

11

3

21

4

31

Reference Answer: 2

ξ

₁

is simply 1− y

1

(w

^T z ₁

+b) when y

₁

(w

^T z ₁

+b)≤ 1.

(8)

Fun Time

At the optimal solution of

b,w,ξ

min 1

2

w ^T w + C

·

N

X

n=1

ξ

n

s.t. y

n

(w

^T z _n

+b)≥ 1 − ξ

ⁿ

andξ

n

≥ 0 for all n, assume that y

₁

(w

^T z ₁

+b) =−10. What is the corresponding ξ

1

?

1

2

11

3

21

4

31

Reference Answer: 2

ξ

1

is simply 1− y

1

(w

^T z ₁

+b) when y

₁

(w

^T z ₁

+b)≤ 1.

(9)

Soft-Margin Support Vector Machine Dual Problem

Lagrange Dual

primal: min

b,w,ξ

1

2

w ^T w + C

·

N

X

n=1

ξ

_n

s.t.

y _n (w ^T z _n + b) ≥ 1 − ξ n

and

ξ n ≥ 0

for all n Lagrange function with Lagrange multipliers

α n

and

β n

L(b, w, ξ, α, β) = 1

2 w

^T

w + C ·

N

X

n=1

ξ

n

+

N

X

n=1

α

n

· 1 − ξ

n

− y

n

(w

^T

z

_n

+ b) +

N

X

n=1

β

n

· (−ξ

n

)

want: Lagrange dual max

α

n

≥0, β

n

≥0

b,w,ξ

min L(b, w, ξ,

α, β)

(10)

Simplify ξ _n and β _n

max

αn≥0,βn≥0

min

b,w,ξ

1 2 w

^T

w + C ·

N

X

n=1

ξ

_n

+

N

X

n=1

α

_n

· 1 − ξ

_n

− y

n

(w

^T

z

n

+ b) +

N

X

n=1

β

_n

· (− ξ

_n

)

!

• _∂ξ ^∂L

_n =0 = C

−α n −β n

•

no loss of optimality if solving with implicit constraint

β _n

=C−

α _n

and explicit constraint 0≤

α n

≤

C: β n

removed

ξ can also be removed :-), like how we removed b

max

0≤αn≤C,βn=C−αn

b,w,ξ

min 1 2 w

^T

w +

N

X

n=1

α

n

(1 − y

n

(w

^T

z

_n

+ b))

XX XX

XX XX XX X +

N

P

n=1

(C − α

n

− β

n

) · ξ

n

!

(11)

Other Simplifications

max

0≤αn≤C,β_n=C−αn

min

b,w

1 2 w

^T

w +

N

X

n=1

α

n

(1 − y

n

(w

^T

z

_n

+ b))

!

familiar? :-)

•

inner problem

same as hard-margin SVM

• ^∂L _∂b

=0: no loss of optimality if solving with constraint

N

P

n=1

α n y n = 0

• _∂w ^∂L

_i =0: no loss of optimality if solving with constraint

w =

N

P

n=1

α n y n z _n

standard dual can be derived

using the same steps as Lecture 2

(12)

Standard Soft-Margin SVM Dual

min

α

1 2

N

X

n=1 N

X

m=1

α n α m

y

n

y

m z ^T _n z m

−

N

X

n=1

α n

subject to

N

X

n=1

y

_n α n

=0;

0≤

α _n ≤ C

, for n = 1, 2, . . . , N;

implicitly

w =

N

X

n=1

α n

y

n z n

;

β n

=C−

α n

, for n = 1, 2, . . . , N

—only difference to hard-margin:

upper bound

on

α n

another (convex)

QP,

with

N variables

&

2N + 1

constraints

(13)

Fun Time

In the soft-margin SVM, assume that we want to increase the parameter C by 2. How shall the corresponding dual problem be changed?

1

the upper bound ofα

n

shall be halved

2 n

shall be decreased by 2

3 n

shall be increased by 2

4 _n

shall be doubled

Reference Answer: 3

Because C is exactly the upper bound ofα

n

, increasing C by 2 in the primal problem is equivalent to increasing the upper bound by 2 in the dual problem.

(14)

Fun Time

In the soft-margin SVM, assume that we want to increase the parameter C by 2. How shall the corresponding dual problem be changed?

1 n

shall be halved

2 n

shall be decreased by 2

3 n

shall be increased by 2

4 _n

shall be doubled

Reference Answer: 3

Because C is exactly the upper bound ofα

n

, increasing C by 2 in the primal problem is equivalent to increasing the upper bound by 2 in the dual problem.

(15)

Soft-Margin Support Vector Machine Messages behind Soft-Margin SVM

Kernel Soft-Margin SVM

Kernel Soft-Margin SVM Algorithm

1 q n,m

=y

n

y

m K

(x

_n

, x

m

);

p

=−1

N

; (A,

c)

for

equ./lower-bound/upper-bound constraints

2

α← QP(

Q

D,

p, A, c)

3

b←?

4

return

SVs

and theirα

n

as well as b such that for new

x,

gSVM(x) = sign

P

SV indices n

α

_n y _n K

(x

_n

, x) + b

• almost

the same as hard-margin

•

more flexible than hard-margin

—primal/dual always solvable

remaining question:

step 3

?

(16)

Solving for b

hard-margin SVM

complementary slackness:

α _n

(1− y

n

(w

^T z _n

+b)) = 0

•

SV (α

_s

> 0)

⇒ b = y

s

− w

^T z _s

•

free (α

s

< C)

⇒

ξ s

=0

soft-margin SVM

α _n

(1−

ξ _n

− y

n

(w

^T z _n

+b)) = 0 (C−

α n

)ξ

n

=0

•

SV (α

_s

> 0)

⇒ b = y

s

− y

s ξ s

− w

^T z _s

•

free (α

s

< C)

⇒

ξ s

=0

solve unique b with

free SV (x _s , y s ):

b =

y s

− X

SV indices n

α n

y

n

K (x

_n

,

x _s

)

—range of b otherwise

(17)

Soft-Margin Gaussian SVM in Action

C = 1 C = 10 C = 100

•

large C =⇒ less

noise tolerance

=⇒

‘overfit’?

• warning: SVM can still overfit :-(

soft-margin Gaussian SVM:

need

careful selection of (γ, C)

(18)

Physical Meaning of α _n

α n

(1

−ξ n − y n (w ^T z _n + b)) =

0 (C−

α n

)ξ

n

=0

•

non SV (0 =

α n

):

ξ n

=0,

‘away from’/on

fat boundary

•

free SV (0 <

α n

< C):

ξ n

=0, on

fat boundary, locates b

•

4 bounded SV (

α n

=C):

ξ _n

=violation amount,

‘violate’/on

fat boundary

α n

can be used for

data analysis

(19)

Fun Time

For a data set of size 10000, after solving SVM, assume that there are 1126 support vectors, and 1000 of those support vectors are bounded.

What is the possible range of E

_in

(gSVM)in terms of 0/1 error?

1

0.0000≤ E

in

(gSVM)≤ 0.1000

2

0.1000≤ E

in

(gSVM)≤ 0.1126

3

0.1126≤ E

in

(g_SVM)≤ 0.5000

4

0.1126≤ E

in

(g_SVM)≤ 1.0000

Reference Answer: 1

The bounded support vectors are the only ones that could violate the fat boundary: ξ

n

≥ 0. If ξ

n

≥ 1, then the violation causes a 0/1 error on the example. On the other hand, it is also possible thatξ

n

< 1, and in that case the violation does not cause a 0/1 error.

(20)

Fun Time

For a data set of size 10000, after solving SVM, assume that there are 1126 support vectors, and 1000 of those support vectors are bounded.

What is the possible range of E

_in

(gSVM)in terms of 0/1 error?

1

0.0000≤ E

in

(gSVM)≤ 0.1000

2

0.1000≤ E

in

(gSVM)≤ 0.1126

3

0.1126≤ E

in

(g_SVM)≤ 0.5000

4

0.1126≤ E

in

(g_SVM)≤ 1.0000

Reference Answer: 1

The bounded support vectors are the only ones that could violate the fat boundary:

ξ

n

≥ 0. If ξ

n

≥ 1, then the violation causes a 0/1 error on the example. On the other hand, it is also possible thatξ

_n

< 1, and in that case the violation does not cause a 0/1 error.

(21)

Soft-Margin Support Vector Machine Model Selection

Practical Need: Model Selection

replacemen

•

complicated even for

(C, γ) of Gaussian SVM

•

more combinations if including other kernels or parameters

how to select?

validation :-)

(22)

Selection by Cross Validation

replacemen

0.3500 0.3250 0.3250

0.2000 0.2250 0.2750

0.1750 0.2250 0.2000

•

E

_cv

(C, γ): ‘non-smooth’

function of (C, γ)

—difficult to optimize

•

proper models can be chosen by

V -fold cross validation

on

a few grid values of (C, γ)

E

_cv

: very popular criteria for soft-margin SVM

(23)

Leave-One-Out CV Error for SVM

recall: E

_loocv

= E

_cv

with N folds claim: E

_loocv

≤

^#SV _N

•

for

(x _N , y _N ): if optimal α _N = 0

(non-SV)

=⇒

(α 1 , α 2 , . . . , α N−1 ) still optimal

when

leaving out (x _N , y N )

key:

what if there’s better

α

n

?

•

SVM:

g ⁻

=g when

leaving out non-SV

e

_non-SV

= err(g

⁻

,

non-SV)

= err(g,

non-SV) =

0 e

_SV

≤ 1

x¹−x²−1=0 0.707

motivation from hard-margin SVM:

only

SVs needed

scaled #SV bounds leave-one-out CV error

(24)

Selection by # SV

replacemen

38 37 37

27 21 17

21 18 19

•

nSV(C, γ): ‘non-smooth’

function of (C, γ)

—difficult to optimize

• just an upper bound!

•

dangerous models can be ruled out by

nSV

on

a few grid values of (C, γ)

nSV: often used as a

safety check

if computing E

_cv

is too time-consuming

(25)

Fun Time

For a data set of size 10000, after solving SVM on some parameters, assume that there are 1126 support vectors, and 1000 of those support vectors are bounded. Which of the following cannot be E

_loocv

with those parameters?

1

0.0000

2

0.0805

3

0.1111

4

0.5566

Reference Answer: 4

Note that the upper bound of E

_loocv

is 0.1126.

(26)

Fun Time

For a data set of size 10000, after solving SVM on some parameters, assume that there are 1126 support vectors, and 1000 of those support vectors are bounded. Which of the following cannot be E

Machine Learning Techniques (ᘤᢈ)

Machine Learning Techniques ( 機器學習技法)

Lecture 4: Soft-Margin Support Vector Machine

Department of Computer Science

& Information Engineering

National Taiwan University ( 國立台灣大學資訊工程系)

Roadmap

1

Lecture 3: Kernel Support Vector Machine kernel

remove dependence on ˜ d: allowing a spectrum of

Lecture 4: Soft-Margin Support Vector Machine Motivation and Primal Problem

Dual Problem

Messages behind Soft-Margin SVM Model Selection

2 Combining Predictive Features: Aggregation Models

3 Distilling Implicit Features: Extraction Models

Cons of Hard-Margin SVM

recall: SVM can still overfit :-(

1

•

•

separable

4

separable

shatter),

overfit to noise

Give Up on Some Examples

give up

pocket

min

X

qy

6= sign(w

z

+ b) y

hard-margin SVM

min

1 2 w

w

s.t. y

(w

z

+ b) ≥ 1 for all n

b,w

1

2 w T w

C

N

n=1

r

y n 6= sign(w T z n + b) z

n

T z n

correct

n

T z n

−∞

incorrect

C: trade-off of large margin

noise tolerance

Soft-Margin SVM (1/2)

min

1

2 w

w + C ·

X

qy

6= sign(w

z

+ b) y

s.t. y

(w

z

+ b) ≥ 1 − ∞ · qy

6= sign(w

z

+ b) y

• J·K

not QP anymore :-(

•

small error (slightly away from fat boundary)

Machine Learning Techniques (ᘤᢈ)

₁

₄

2 w ^T w

y _n 6= sign(w ^T z _n + b) z

_n

^T z _n

_n

^T z _n

ξ _n

w ^T w + C

^T z _n

ⁿ

₁

^T z ₁

₁

^T z ₁

₁

^T z ₁

w ^T w + C