Machine Learning Techniques (ᘤᢈ)

(1)

Machine Learning Techniques ( 機器學習技法)

Lecture 2: Dual Support Vector Machine

Hsuan-Tien Lin (林軒田)

htlin@csie.ntu.edu.tw

Department of Computer Science

& Information Engineering

National Taiwan University ( 國立台灣大學資訊工程系)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 0/23

(2)

Dual Support Vector Machine

Roadmap

1

Embedding Numerous Features: Kernel Models

Lecture 1: Linear Support Vector Machine

linear

SVM: more

robust

and solvable with

quadratic programming Lecture 2: Dual Support Vector Machine

Motivation of Dual SVM Lagrange Dual SVM Solving Dual SVM

Messages behind Dual SVM

2 Combining Predictive Features: Aggregation Models

3 Distilling Implicit Features: Extraction Models

(3)

Dual Support Vector Machine Motivation of Dual SVM

Non-Linear Support Vector Machine Revisited

min

b,w 1 2 w ^T w

s. t. y

_n

(w

^T z _n

|{z}

Φ(x

n

)

+b)≥ 1, for n = 1, 2, . . . , N

Non-Linear Hard-Margin SVM

1 Q =

0 0 ^T _˜

d

0 _d _˜ I _˜ _d

;

p = 0 _{d +1} _˜

;

a ^T _n = y n

1 z ^T _n

;

c n = 1

2

b

w

← QP(

Q, p, A, c)

3

return b∈ R & w ∈

R ^d ^˜

with gSVM(x) = sign(w

^T Φ(x)

+b)

•

demanded:

not many

(large-margin), but

sophisticated

boundary (feature transform)

•

QP with

d + 1 ˜

variables and N constraints

—challenging if

d ˜

large,

or infinite?! :-)

goal: SVM

without dependence on ˜ d

(4)

Todo: SVM ‘without’ ˜ d

Original SVM

(convex) QP of

• d + 1 variables ˜

•

N constraints

‘Equivalent’ SVM

(convex) QP of

• N variables

• N + 1

constraints

Warning: Heavy Math!!!!!!

•

introduce some necessary math

without rigor

to help

understand SVM deeper

• ‘claim’ some results

if details unnecessary

—like how we ‘claimed’ Hoeffding

‘Equivalent’ SVM: based on some

dual problem

of Original SVM

(5)

Key Tool: Lagrange Multipliers

Regularization by

Constrained-Minimizing E in

min

w

E

_in

(w) s.t.

w ^T w ≤ C

⇔ Regularization by Minimizing E aug

min

w

E

aug

(w) = E

_in

(w) +

λ

N

w ^T w

•

C equivalent to some

λ

≥ 0 by checking

optimality condition

∇E

in

(w) +

^2λ _N w = 0

•

regularization: view

λ

as

given parameter instead of C, and

solve ‘easily’

•

dual SVM: view

λ’s as unknown given the constraints, and solve them as variables instead

how many

λ’s as variables?

N—one per constraint

(6)

Starting Point: Constrained to ‘Unconstrained’

min

b,w 1 2 w ^T w

s.t.

y n (w ^T z n + b) ≥ 1

, for n = 1, 2, . . . , N

Lagrange Function

with

Lagrange multipliers _@

@ λ _n α _n

,

L

(b, w,

α) =

1 2 w ^T w

| {z }

objective

+

N

X

n=1

α n

(1

− y n (w ^T z _n + b)

| {z }

constraint

)

Claim

SVM≡ min

b,w

all α

maxn

≥0 L

(b, w,

α)

=min

b,w

∞

if

violate

;

¹ ₂ w ^T w

if

feasible

•

any ‘violating’ (b, w): max

all α

n

≥0

+P

n α n

(some positive)

→

∞

•

any ‘feasible’ (b, w): max

all α

n

≥0

+P

n α n

(all non-positive)

=

constraints now

hidden in max

(7)

Fun Time

Consider two transformed examples (z

₁

, +1) and (z

2

,−1) with z

1

=

z

and

z ₂

=−z. What is the Lagrange function

L

(b, w,

α)

of hard-margin SVM?

1 1

2 w ^T w + α ₁

(1 +

w ^T z + b) + α ₂

(1 +

w ^T z + b)

2 1

2 w ^T w + α ₁

(1− w

^T z

− b) +

α ₂

(1− w

^T z + b)

3 1

2 w ^T w + α ₁

(1 +

w ^T z + b) + α ₂

(1 +

w ^T z

− b)

4 1

2 w ^T w + α ₁

(1− w

^T z

− b) +

α ₂

(1− w

^T z

− b)

Reference Answer: 2

By definition,

L

(b, w,

α) =

1

2

w ^T w + α ₁

(1− y

1

(w

^T z ₁

+b)) +

α ₂

(1− y

2

(w

^T z ₂

+b)) with (z

₁

, y

1

) = (z, +1) and (z

2

, y

2

) = (−z, −1).

(8)

Fun Time

₁

, +1) and (z

2

,−1) with z

1

=

z

and

z ₂

=−z. What is the Lagrange function

L

(b, w,

α)

of hard-margin SVM?

1 1

2 w ^T w + α ₁

(1 +

w ^T z + b) + α ₂

(1 +

w ^T z + b)

2 1

2 w ^T w + α ₁

(1− w

^T z

− b) +

α ₂

(1− w

^T z + b)

3 1

2 w ^T w + α ₁

(1 +

w ^T z + b) + α ₂

(1 +

w ^T z

− b)

4 1

2 w ^T w + α ₁

(1− w

^T z

− b) +

α ₂

(1− w

^T z

− b)

Reference Answer: 2

By definition,

L

(b, w,

α) =

1

2

w ^T w + α ₁

(1− y

1

(w

^T z ₁

+b)) +

α ₂

(1− y

2

(w

^T z ₂

+b)) with (z

₁

, y

1

) = (z, +1) and (z

2

, y

2

) = (−z, −1).

(9)

Dual Support Vector Machine Lagrange Dual SVM

Lagrange Dual Problem

for

any fixed α ⁰ with all α ⁰ _n ≥ 0

, min

b,w

max

all α

n

≥0

L(b, w,

α)

≥ min

b,w

L(b, w,

α ⁰

) because

max

≥

any

for

best α ⁰ ≥ 0 on RHS

, min

b,w

max

all α

n

≥0

L(b, w,

α)

≥ max

all α

n0

≥0

min

b,w

L(b, w,

α ⁰

)

| {z }

Lagrange dual problem

because

best

is one of

any

Lagrange dual problem:

‘outer’ maximization of α

on

lower bound of original problem

(10)

Strong Duality of Quadratic Programming

min

b,w

all

max

α

n

≥0

L(

b, w, α)

| {z }

equiv. to original (primal) SVM

≥ _all

^max

_α

n

≥0

min

b,w

L(

b, w, α)

| {z }

Lagrange dual

• ‘ ≥’: weak duality

• ‘=’: strong duality, true for QP if

• convex primal

• feasible primal (true if Φ-separable)

• linear constraints

—called constraint qualification

exists

primal-dual

optimal solution (b,

w, α)

for

both sides

(11)

Solving Lagrange Dual: Simplifications (1/2)

all

max

α

n

≥0





 min

b,w

1 2 w ^T w

+

N

X

n=1

α n

(1− y

ⁿ

(w

^T z n

+

b))

| {z }

L(b,w,α)







• inner problem

‘unconstrained’, at optimal:

∂L(b,w,α)

∂b

=0 = −P

N

n=1 α n

y

n

•

no loss of optimality if solving with constraint

P N

n=1 α n y _n = 0

but wait,

b can be removed

max

all α

n

≥0,P y

n

α

n

=0

min

b,w 1

2 w ^T w

+

N

X

n=1

α n

(1− y

n

(w

^T z _n

))−

XX XX

XX X P N

n=1 α n y _n

·

b

!

(12)

Solving Lagrange Dual: Simplifications (2/2)

max

all α

n

≥0,P y

n

α

n

=0

min

b,w 1

2 w ^T w

+

N

X

n=1

α n

(1− y

n

(w

^T z _n

))

!

• inner problem

‘unconstrained’, at optimal:

∂L(b,w,α)

∂w

i =0 =

w _i

−P

N

n=1 α n

y

_n

z

_n,i

•

no loss of optimality if solving with constraint

w = P N

n=1 α _n y _n z _n

but wait!

max

all α

n

≥0,P y

n

α

n

=0,w= P α

n

y

n

z

n

min b,w 1

2 w ^T w

+

N

X

n=1

α n

−

w ^T w

!

⇐⇒ max

all α

n

≥0,P y

n

α

n

=0,w= P α

n

y

n

z

n

−

¹ ₂

k

N

X

n=1

α n

y

_n z _n

k

²

+

N

X

n=1

α n

(13)

KKT Optimality Conditions

max

all α

n

≥0,P y

n

α

n

=0,w= P α

n

y

n

z

n

−

¹ ₂

k

N

X

n=1

α n

y

n z n

k

²

+

N

X

n=1

α n

if

primal-dual

optimal (b,

w, α),

• primal feasible: y _n

(w

^T z _n

+

b)

≥ 1

• dual feasible: α n

≥ 0

• dual-inner

optimal: P y

_n α n

=0;

w

=P

α n

y

n z _n

• primal-inner

optimal (at optimal all ‘Lagrange terms’ disappear):

α n

(1− y

n

(w

^T z _n

+

b)) =

0

—called

Karush-Kuhn-Tucker (KKT) conditions, necessary for

optimality [& sufficient here]

will use

KKT

to ‘solve’ (b,

w)

from optimal

α

(14)

Fun Time

For a single variable w , consider minimizing

¹ ₂

w

²

subject to two linear constraints w ≥ 1 and w ≤ 3. We know that the Lagrange function L(w, α) =

¹ ₂

w

²

+α

₁

(1− w) + α

2

(w− 3). Which of the following equations that contain α are among the KKT conditions of the optimization problem?

1

α

₁

≥ 0 and α

2

≥ 0

2

w =α

₁

− α

2

3

α

₁

(1− w) = 0 and α

2

(w− 3) = 0.

4

all of the above

Reference Answer: 4

1 contains dual-feasible constraints; 2 contains dual-inner-optimal constraints; 3 contains primal-inner-optimal constraints.

(15)

Fun Time

For a single variable w , consider minimizing

¹ ₂

w

²

subject to two linear constraints w ≥ 1 and w ≤ 3. We know that the Lagrange function L(w, α) =

¹ ₂

w

²

+α

₁

(1− w) + α

2

(w− 3). Which of the following equations that contain α are among the KKT conditions of the optimization problem?

1

α

₁

≥ 0 and α

2

≥ 0

2

w =α

₁

− α

2

3

α

₁

(1− w) = 0 and α

2

(w− 3) = 0.

4

all of the above

Reference Answer: 4

1 contains dual-feasible constraints;

2 contains dual-inner-optimal constraints;

3 contains primal-inner-optimal constraints.

(16)

Dual Support Vector Machine Solving Dual SVM

Dual Formulation of Support Vector Machine

max

all α

n

≥0,P y

n

α

n

=0,w= P

α

n

y

n

z

n −

¹ ₂

k

N

X

n=1

α n

y

n z n

k

²

+

N

X

n=1

α n

standard hard-margin SVM

dual

min

α

1 2

N

X

n=1 N

X

m=1

α n α m

y

_n

y

_m z ^T _n z _m

−

N

X

n=1

α n

subject to

N

X

n=1

y

n α n

=0;

α n

≥ 0, for n = 1, 2, . . . , N

(convex) QP of

N variables

&

N + 1

constraints, as promised how to solve?

yeah, we know QP! :-)

(17)

Dual SVM with QP Solver

optimal

α

=? min

α

1 2

N

P

n=1 N

P

m=1

α

n

α

m

y

n

y

m

z

^Tn

z

m

−

N

X

n=1

α _n

subject to

N

X

n=1

y

n α n

=0;

α _n

≥ 0,

for n = 1, 2, . . . , N

optimal

α

← QP(

Q, p, A, c)

min

α 1

2 α ^T Qα

+

p ^T α

subject to

a ^T _i α

≥

c _i

,

for i = 1, 2, . . .

• q n,m = y n y m z ^T _n z _m

• p = − 1 N

• a ≥ = y, a ≤ = − y;

a ^T _n = n-th unit direction

• c ≥ = 0, c ≤ = 0; c _n = 0

note: many solvers treat

equality (a ≥ , a ≤ ) &

bound (a _n ) constraints specially for numerical stability

(18)

Dual SVM with Special QP Solver

optimal

α

← QP(

Q

D,

p, A, c)

min

α

1 2 α ^T Q

D

α

+

p ^T α

subject to

special equality and bound constraints

• q

n,m = y _n y _m z ^T _n z _m

, often

non-zero

•

if N = 30, 000,

dense Q

_D(N by N symmetric) takes > 3G RAM

•

need

special solver

for

• not storing whole Q

_D

• utilizing special constraints properly to scale up to large N

usually better to use

special solver

in practice

(19)

Optimal (b, w)

KKT conditions

if

primal-dual

optimal (b,

w, α),

• primal feasible: y _n

(w

^T z _n

+

b)

≥ 1

• dual feasible: α _n

≥ 0

• dual-inner

optimal: P y

n α n

=0;

w

=P

α n

y

n z n

• primal-inner

optimal (at optimal all ‘Lagrange terms’ disappear):

α n

(1− y

n

(w

^T z _n

+

b)) =

0 (complementary slackness)

•

optimal

α

=⇒ optimal

w? easy above!

•

optimal

α

=⇒ optimal

b? a range from primal feasible

&

equality from

comp. slackness

if one

α n

> 0⇒

b

=y

n

−

w ^T z n

comp. slackness:

α n

> 0 ⇒ on fat boundary (SV!)

(20)

Fun Time

₁

, +1) and (z

₂

,−1) with z

1

=

z

and

z ₂

=−z. After solving the dual problem of hard-margin SVM, assume that the optimalα

1

andα

2

are both strictly positive. What is the optimal b?

1

−1

2

0

3

1

4

not certain with the descriptions above

Reference Answer: 2

With the descriptions, at the optimal (b, w), b = +1− w

^T z =

−1 + w

^T z

That is,

w ^T z = 1 and b = 0.

(21)

Fun Time

₁

, +1) and (z

₂

,−1) with z

1

=

z

and

z ₂

=−z. After solving the dual problem of hard-margin SVM, assume that the optimalα

1

andα

2

are both strictly positive. What is the optimal b?

1

−1

2

0

3

1

4

not certain with the descriptions above

Reference Answer: 2

With the descriptions, at the optimal (b, w), b = +1− w

^T z =

−1 + w

^T z

That is,

w ^T z = 1 and b = 0.

(22)

Dual Support Vector Machine Messages behind Dual SVM

Support Vectors Revisited

•

on boundary:

‘locates’ fattest hyperplane;

others:

not needed

•

examples with

α n

> 0: on boundary

•

call

α n

> 0 examples

(z _n , y n ) support vectors

(( (( ( hhh (candidates) h h

• SV (positive α n )

⊆ SV candidates (on boundary)

x¹−x²−1=0 0.707

•

only

SV

needed to compute

w: w

=

N

P

n=1

α n

y

_n z _n

= P

SV

α n

y

_n z _n

•

only

SV

needed to compute

b: b

=y

_n

−

w ^T z _n

with any

SV (z _n , y n )

SVM: learn fattest hyperplane

by identifying

support vectors

with

dual

optimal solution

(23)

Representation of Fattest Hyperplane

SVM

w

SVM =

N

X

n=1

α n

(y

_n z _n

)

α _n

from

dual solution

PLA

w

PLA =

N

X

n=1

β n

(y

n z n

)

β n

by

# mistake corrections

w

=linear combination of y

_n z _n

•

also true for GD/SGD-based LogReg/LinReg when

w ₀

=

0

•

call

w ‘represented’ by data

SVM: represent w

by

SVs only

(24)

Summary: Two Forms of Hard-Margin SVM

Primal Hard-Margin SVM

min

b,w

1 2 w ^T w

sub. to y

n

(w

^T z _n

+

b)

≥ 1, for n = 1, 2, . . . , N

• d + 1 variables, ˜

N constraints

—suitable when

d + 1 small ˜

•

physical meaning: locate

specially-scaled

(b,

w)

Dual Hard-Margin SVM

min

α

1 2 α ^T

QD

α

− 1

^T α

s.t.

y ^T α

=0;

α n

≥ 0 for n = 1, . . . , N

• N variables,

N + 1 simple constraints

—suitable when

N small

•

physical meaning: locate

SVs

(z

_n

, y

n

)& their

α n

both eventually result in optimal (b,

w)

for fattest hyperplane

g

_SVM(x) = sign(w

^T Φ(x)

+

b)

(25)

Are We Done Yet?

goal: SVM

without dependence on ˜ d

min

α 1

2 α ^T Q

D

α

− 1

^T α

subject to

y ^T α

=0;

α n

≥ 0, for n = 1, 2, . . . , N

• N variables, N + 1 constraints: no dependence on ˜ d?

• q _n,m = y _n y _m z ^T _n z _m

: inner product in R

^˜ ^d

—O(˜

d )

via naïve computation!

no dependence

only if

avoiding naïve computation (next lecture :-))

(26)

Fun Time

Consider applying dual hard-margin SVM on N = 5566 examples and getting 1126 SVs. Which of the following can be the number of

examples that are on the fat boundary—that is, SV candidates?

1

0

2

1024

3

1234

4

9999

Reference Answer: 3

Because SVs are always on the fat boundary,

# SVs≤ # SV candidates ≤ N.

(27)

Fun Time

Consider applying dual hard-margin SVM on N = 5566 examples and getting 1126 SVs. Which of the following can be the number of

Machine Learning Techniques (ᘤᢈ)

Machine Learning Techniques ( 機器學習技法)

Lecture 2: Dual Support Vector Machine

Department of Computer Science

& Information Engineering

National Taiwan University ( 國立台灣大學資訊工程系)

Roadmap

1

Lecture 1: Linear Support Vector Machine

linear

robust

quadratic programming Lecture 2: Dual Support Vector Machine

Motivation of Dual SVM Lagrange Dual SVM Solving Dual SVM

Messages behind Dual SVM

2 Combining Predictive Features: Aggregation Models

3 Distilling Implicit Features: Extraction Models

Non-Linear Support Vector Machine Revisited

b,w 1 2 w T w

n

T z n

|{z}

Φ(x

)

Non-Linear Hard-Margin SVM

1 Q =

 0 0 T ˜

d

0 d ˜ I ˜ d



p = 0 d +1 ˜

a T n = y n

1 z T n

c n = 1

2

w

Q, p, A, c)

3

R d ˜

T Φ(x)

•

not many

sophisticated

•

d + 1 ˜

d ˜

or infinite?! :-)

without dependence on ˜ d

Todo: SVM ‘without’ ˜ d

Original SVM

• d + 1 variables ˜

•

‘Equivalent’ SVM

• N variables

• N + 1

Warning: Heavy Math!!!!!!

•

without rigor

understand SVM deeper

• ‘claim’ some results

dual problem

Key Tool: Lagrange Multipliers

Regularization by

Constrained-Minimizing E in

w

in

w T w ≤ C

⇔ Regularization by Minimizing E aug

w

aug

in

λ

w T w

•

λ

optimality condition

in

2λ N w = 0

•

λ

given parameter instead of C, and

Machine Learning Techniques (ᘤᢈ)

b,w 1 2 w ^T w

_n

^T z _n

0 0 ^T _˜

0 _d _˜ I _˜ _d

p = 0 _{d +1} _˜

a ^T _n = y n

1 z ^T _n

R ^d ^˜

^T Φ(x)

_in

w ^T w ≤ C

_in

w ^T w

^2λ _N w = 0

b,w 1 2 w ^T w

y n (w ^T z n + b) ≥ 1

Lagrange multipliers _@

@ λ _n α _n

1 2 w ^T w

− y n (w ^T z _n + b)

¹ ₂ w ^T w

₁

z ₂

2 w ^T w + α ₁

w ^T z + b) + α ₂

w ^T z + b)

2 w ^T w + α ₁

^T z

α ₂

^T z + b)

2 w ^T w + α ₁

w ^T z + b) + α ₂

w ^T z

2 w ^T w + α ₁

^T z

α ₂

^T z