• 沒有找到結果。

Machine Learning Techniques (ᘤᢈ)

N/A
N/A
Protected

Academic year: 2022

Share "Machine Learning Techniques (ᘤᢈ)"

Copied!
28
0
0

加載中.... (立即查看全文)

全文

(1)

Machine Learning Techniques ( 機器學習技法)

Lecture 2: Dual Support Vector Machine

Hsuan-Tien Lin (林軒田)

htlin@csie.ntu.edu.tw

Department of Computer Science

& Information Engineering

National Taiwan University ( 國立台灣大學資訊工程系)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 0/23

(2)

Dual Support Vector Machine

Roadmap

1

Embedding Numerous Features: Kernel Models

Lecture 1: Linear Support Vector Machine

linear

SVM: more

robust

and solvable with

quadratic programming Lecture 2: Dual Support Vector Machine

Motivation of Dual SVM Lagrange Dual SVM Solving Dual SVM

Messages behind Dual SVM

2 Combining Predictive Features: Aggregation Models

3 Distilling Implicit Features: Extraction Models

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 1/23

(3)

Dual Support Vector Machine Motivation of Dual SVM

Non-Linear Support Vector Machine Revisited

min

b,w 1 2 w T w

s. t. y

n

(w

T z n

|{z}

Φ(x

n

)

+b)≥ 1, for n = 1, 2, . . . , N

Non-Linear Hard-Margin SVM

1 Q =

 0 0 T ˜

d

0 d ˜ I ˜ d



;

p = 0 d +1 ˜

;

a T n = y n

 1 z T n 

;

c n = 1

2

 b

w



← QP(

Q, p, A, c)

3

return b∈ R & w ∈

R d ˜

with gSVM(x) = sign(w

T Φ(x)

+b)

demanded:

not many

(large-margin), but

sophisticated

boundary (feature transform)

QP with

d + 1 ˜

variables and N constraints

—challenging if

d ˜

large,

or infinite?! :-)

goal: SVM

without dependence on ˜ d

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 2/23

(4)

Dual Support Vector Machine Motivation of Dual SVM

Todo: SVM ‘without’ ˜ d

Original SVM

(convex) QP of

• d + 1 variables ˜

N constraints

‘Equivalent’ SVM

(convex) QP of

• N variables

• N + 1

constraints

Warning: Heavy Math!!!!!!

introduce some necessary math

without rigor

to help

understand SVM deeper

‘claim’ some results

if details unnecessary

—like how we ‘claimed’ Hoeffding

‘Equivalent’ SVM: based on some

dual problem

of Original SVM

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/23

(5)

Dual Support Vector Machine Motivation of Dual SVM

Key Tool: Lagrange Multipliers

Regularization by

Constrained-Minimizing E in

min

w

E

in

(w) s.t.

w T w ≤ C

⇔ Regularization by Minimizing E aug

min

w

E

aug

(w) = E

in

(w) +

λ

N

w T w

C equivalent to some

λ

≥ 0 by checking

optimality condition

∇E

in

(w) +

N w = 0

regularization: view

λ

as

given parameter instead of C, and

solve ‘easily’

dual SVM: view

λ’s as unknown given the constraints, and solve them as variables instead

how many

λ’s as variables?

N—one per constraint

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 4/23

(6)

Dual Support Vector Machine Motivation of Dual SVM

Starting Point: Constrained to ‘Unconstrained’

min

b,w 1 2 w T w

s.t.

y n (w T z n + b) ≥ 1

, for n = 1, 2, . . . , N

Lagrange Function

with

Lagrange multipliers @

@ λ n α n

,

L

(b, w,

α) =

1 2 w T w

| {z }

objective

+

N

X

n=1

α n

(1

− y n (w T z n + b)

| {z }

constraint

)

Claim

SVM≡ min

b,w



all α

maxn

≥0 L

(b, w,

α)



=min

b,w



if

violate

;

1 2 w T w

if

feasible



any ‘violating’ (b, w): max

all α

n

≥0





+P

n α n

(some positive)

any ‘feasible’ (b, w): max

all α

n

≥0





+P

n α n

(all non-positive)



=



constraints now

hidden in max

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/23

(7)

Dual Support Vector Machine Motivation of Dual SVM

Fun Time

Consider two transformed examples (z

1

, +1) and (z

2

,−1) with z

1

=

z

and

z 2

=−z. What is the Lagrange function

L

(b, w,

α)

of hard-margin SVM?

1 1

2 w T w + α 1

(1 +

w T z + b) + α 2

(1 +

w T z + b)

2 1

2 w T w + α 1

(1− w

T z

− b) +

α 2

(1− w

T z + b)

3 1

2 w T w + α 1

(1 +

w T z + b) + α 2

(1 +

w T z

− b)

4 1

2 w T w + α 1

(1− w

T z

− b) +

α 2

(1− w

T z

− b)

Reference Answer: 2

By definition,

L

(b, w,

α) =

1

2

w T w + α 1

(1− y

1

(w

T z 1

+b)) +

α 2

(1− y

2

(w

T z 2

+b)) with (z

1

, y

1

) = (z, +1) and (z

2

, y

2

) = (−z, −1).

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 6/23

(8)

Dual Support Vector Machine Motivation of Dual SVM

Fun Time

Consider two transformed examples (z

1

, +1) and (z

2

,−1) with z

1

=

z

and

z 2

=−z. What is the Lagrange function

L

(b, w,

α)

of hard-margin SVM?

1 1

2 w T w + α 1

(1 +

w T z + b) + α 2

(1 +

w T z + b)

2 1

2 w T w + α 1

(1− w

T z

− b) +

α 2

(1− w

T z + b)

3 1

2 w T w + α 1

(1 +

w T z + b) + α 2

(1 +

w T z

− b)

4 1

2 w T w + α 1

(1− w

T z

− b) +

α 2

(1− w

T z

− b)

Reference Answer: 2

By definition,

L

(b, w,

α) =

1

2

w T w + α 1

(1− y

1

(w

T z 1

+b)) +

α 2

(1− y

2

(w

T z 2

+b)) with (z

1

, y

1

) = (z, +1) and (z

2

, y

2

) = (−z, −1).

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 6/23

(9)

Dual Support Vector Machine Lagrange Dual SVM

Lagrange Dual Problem

for

any fixed α 0 with all α 0 n ≥ 0

, min

b,w

 max

all α

n

≥0

L(b, w,

α)



≥ min

b,w

L(b, w,

α 0

) because

max

any

for

best α 0 ≥ 0 on RHS

, min

b,w

 max

all α

n

≥0

L(b, w,

α)



≥ max

all α

n0

≥0

min

b,w

L(b, w,

α 0

)

| {z }

Lagrange dual problem

because

best

is one of

any

Lagrange dual problem:

‘outer’ maximization of α

on

lower bound of original problem

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 7/23

(10)

Dual Support Vector Machine Lagrange Dual SVM

Strong Duality of Quadratic Programming

min

b,w



all

max

α

n

≥0

L(

b, w, α)



| {z }

equiv. to original (primal) SVM

all

max

α

n

≥0



min

b,w

L(

b, w, α)



| {z }

Lagrange dual

• ‘ ≥’: weak duality

• ‘=’: strong duality, true for QP if

• convex primal

• feasible primal (true if Φ-separable)

• linear constraints

—called constraint qualification

exists

primal-dual

optimal solution (b,

w, α)

for

both sides

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 8/23

(11)

Dual Support Vector Machine Lagrange Dual SVM

Solving Lagrange Dual: Simplifications (1/2)

all

max

α

n

≥0

 min

b,w

1

2 w T w

+

N

X

n=1

α n

(1− y

n

(w

T z n

+

b))

| {z }

L(b,w,α)

• inner problem

‘unconstrained’, at optimal:

∂L(b,w,α)

∂b

=0 = −P

N

n=1 α n

y

n

no loss of optimality if solving with constraint

P N

n=1 α n y n = 0

but wait,

b can be removed

max

all α

n

≥0,P y

n

α

n

=0

min

b,w 1

2 w T w

+

N

X

n=1

α n

(1− y

n

(w

T z n

))−

    XX XX

XX X P N

n=1 α n y n

·

b

!

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 9/23

(12)

Dual Support Vector Machine Lagrange Dual SVM

Solving Lagrange Dual: Simplifications (2/2)

max

all α

n

≥0,P y

n

α

n

=0

min

b,w 1

2 w T w

+

N

X

n=1

α n

(1− y

n

(w

T z n

))

!

• inner problem

‘unconstrained’, at optimal:

∂L(b,w,α)

∂w

i =0 =

w i

−P

N

n=1 α n

y

n

z

n,i

no loss of optimality if solving with constraint

w = P N

n=1 α n y n z n

but wait!

max

all α

n

≥0,P y

n

α

n

=0,w= P α

n

y

n

z

n

min b,w 1

2 w T w

+

N

X

n=1

α n

w T w

!

⇐⇒ max

all α

n

≥0,P y

n

α

n

=0,w= P α

n

y

n

z

n

1 2

k

N

X

n=1

α n

y

n z n

k

2

+

N

X

n=1

α n

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 10/23

(13)

Dual Support Vector Machine Lagrange Dual SVM

KKT Optimality Conditions

max

all α

n

≥0,P y

n

α

n

=0,w= P α

n

y

n

z

n

1 2

k

N

X

n=1

α n

y

n z n

k

2

+

N

X

n=1

α n

if

primal-dual

optimal (b,

w, α),

• primal feasible: y n

(w

T z n

+

b)

≥ 1

• dual feasible: α n

≥ 0

• dual-inner

optimal: P y

n α n

=0;

w

=P

α n

y

n z n

• primal-inner

optimal (at optimal all ‘Lagrange terms’ disappear):

α n

(1− y

n

(w

T z n

+

b)) =

0

—called

Karush-Kuhn-Tucker (KKT) conditions, necessary for

optimality [& sufficient here]

will use

KKT

to ‘solve’ (b,

w)

from optimal

α

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 11/23

(14)

Dual Support Vector Machine Lagrange Dual SVM

Fun Time

For a single variable w , consider minimizing

1 2

w

2

subject to two linear constraints w ≥ 1 and w ≤ 3. We know that the Lagrange function L(w, α) =

1 2

w

2

1

(1− w) + α

2

(w− 3). Which of the following equations that contain α are among the KKT conditions of the optimization problem?

1

α

1

≥ 0 and α

2

≥ 0

2

w =α

1

− α

2

3

α

1

(1− w) = 0 and α

2

(w− 3) = 0.

4

all of the above

Reference Answer: 4

1 contains dual-feasible constraints; 2 contains dual-inner-optimal constraints; 3 contains primal-inner-optimal constraints.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 12/23

(15)

Dual Support Vector Machine Lagrange Dual SVM

Fun Time

For a single variable w , consider minimizing

1 2

w

2

subject to two linear constraints w ≥ 1 and w ≤ 3. We know that the Lagrange function L(w, α) =

1 2

w

2

1

(1− w) + α

2

(w− 3). Which of the following equations that contain α are among the KKT conditions of the optimization problem?

1

α

1

≥ 0 and α

2

≥ 0

2

w =α

1

− α

2

3

α

1

(1− w) = 0 and α

2

(w− 3) = 0.

4

all of the above

Reference Answer: 4

1 contains dual-feasible constraints;

2 contains dual-inner-optimal constraints;

3 contains primal-inner-optimal constraints.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 12/23

(16)

Dual Support Vector Machine Solving Dual SVM

Dual Formulation of Support Vector Machine

max

all α

n

≥0,P y

n

α

n

=0,w= P

α

n

y

n

z

n −

1 2

k

N

X

n=1

α n

y

n z n

k

2

+

N

X

n=1

α n

standard hard-margin SVM

dual

min

α

1 2

N

X

n=1 N

X

m=1

α n α m

y

n

y

m z T n z m

N

X

n=1

α n

subject to

N

X

n=1

y

n α n

=0;

α n

≥ 0, for n = 1, 2, . . . , N

(convex) QP of

N variables

&

N + 1

constraints, as promised how to solve?

yeah, we know QP! :-)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 13/23

(17)

Dual Support Vector Machine Solving Dual SVM

Dual SVM with QP Solver

optimal

α

=? min

α

1 2

N

P

n=1 N

P

m=1

α

n

α

m

y

n

y

m

z

Tn

z

m

N

X

n=1

α n

subject to

N

X

n=1

y

n α n

=0;

α n

≥ 0,

for n = 1, 2, . . . , N

optimal

α

← QP(

Q, p, A, c)

min

α 1

2 α T

+

p T α

subject to

a T i α

c i

,

for i = 1, 2, . . .

• q n,m = y n y m z T n z m

p = − 1 N

a ≥ = y, a ≤ = − y;

a T n = n-th unit direction

• c ≥ = 0, c ≤ = 0; c n = 0

note: many solvers treat

equality (a, a) &

bound (a n ) constraints specially for numerical stability

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 14/23

(18)

Dual Support Vector Machine Solving Dual SVM

Dual SVM with Special QP Solver

optimal

α

← QP(

Q

D,

p, A, c)

min

α

1

2 α T Q

D

α

+

p T α

subject to

special equality and bound constraints

• q

n,m = y n y m z T n z m

, often

non-zero

if N = 30, 000,

dense Q

D(N by N symmetric) takes > 3G RAM

need

special solver

for

• not storing whole Q

D

• utilizing special constraints properly to scale up to large N

usually better to use

special solver

in practice

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 15/23

(19)

Dual Support Vector Machine Solving Dual SVM

Optimal (b, w)

KKT conditions

if

primal-dual

optimal (b,

w, α),

• primal feasible: y n

(w

T z n

+

b)

≥ 1

• dual feasible: α n

≥ 0

• dual-inner

optimal: P y

n α n

=0;

w

=P

α n

y

n z n

• primal-inner

optimal (at optimal all ‘Lagrange terms’ disappear):

α n

(1− y

n

(w

T z n

+

b)) =

0 (complementary slackness)

optimal

α

=⇒ optimal

w? easy above!

optimal

α

=⇒ optimal

b? a range from primal feasible

&

equality from

comp. slackness

if one

α n

> 0⇒

b

=y

n

w T z n

comp. slackness:

α n

> 0 ⇒ on fat boundary (SV!)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 16/23

(20)

Dual Support Vector Machine Solving Dual SVM

Fun Time

Consider two transformed examples (z

1

, +1) and (z

2

,−1) with z

1

=

z

and

z 2

=−z. After solving the dual problem of hard-margin SVM, assume that the optimalα

1

andα

2

are both strictly positive. What is the optimal b?

1

−1

2

0

3

1

4

not certain with the descriptions above

Reference Answer: 2

With the descriptions, at the optimal (b, w), b = +1− w

T z =

−1 + w

T z

That is,

w T z = 1 and b = 0.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 17/23

(21)

Dual Support Vector Machine Solving Dual SVM

Fun Time

Consider two transformed examples (z

1

, +1) and (z

2

,−1) with z

1

=

z

and

z 2

=−z. After solving the dual problem of hard-margin SVM, assume that the optimalα

1

andα

2

are both strictly positive. What is the optimal b?

1

−1

2

0

3

1

4

not certain with the descriptions above

Reference Answer: 2

With the descriptions, at the optimal (b, w), b = +1− w

T z =

−1 + w

T z

That is,

w T z = 1 and b = 0.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 17/23

(22)

Dual Support Vector Machine Messages behind Dual SVM

Support Vectors Revisited

on boundary:

‘locates’ fattest hyperplane;

others:

not needed

examples with

α n

> 0: on boundary

call

α n

> 0 examples

(z n , y n ) support vectors

(( (( ( hhh (candidates) h h

• SV (positive α n )

⊆ SV candidates (on boundary)

x1−x2−1=0 0.707

only

SV

needed to compute

w: w

=

N

P

n=1

α n

y

n z n

= P

SV

α n

y

n z n

only

SV

needed to compute

b: b

=y

n

w T z n

with any

SV (z n , y n )

SVM: learn fattest hyperplane

by identifying

support vectors

with

dual

optimal solution

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 18/23

(23)

Dual Support Vector Machine Messages behind Dual SVM

Representation of Fattest Hyperplane

SVM

w

SVM =

N

X

n=1

α n

(y

n z n

)

α n

from

dual solution

PLA

w

PLA =

N

X

n=1

β n

(y

n z n

)

β n

by

# mistake corrections

w

=linear combination of y

n z n

also true for GD/SGD-based LogReg/LinReg when

w 0

=

0

call

w ‘represented’ by data

SVM: represent w

by

SVs only

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 19/23

(24)

Dual Support Vector Machine Messages behind Dual SVM

Summary: Two Forms of Hard-Margin SVM

Primal Hard-Margin SVM

min

b,w

1 2 w T w

sub. to y

n

(w

T z n

+

b)

≥ 1, for n = 1, 2, . . . , N

• d + 1 variables, ˜

N constraints

—suitable when

d + 1 small ˜

physical meaning: locate

specially-scaled

(b,

w)

Dual Hard-Margin SVM

min

α

1

2 α T

QD

α

− 1

T α

s.t.

y T α

=0;

α n

≥ 0 for n = 1, . . . , N

• N variables,

N + 1 simple constraints

—suitable when

N small

physical meaning: locate

SVs

(z

n

, y

n

)& their

α n

both eventually result in optimal (b,

w)

for fattest hyperplane

g

SVM(x) = sign(w

T Φ(x)

+

b)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 20/23

(25)

Dual Support Vector Machine Messages behind Dual SVM

Are We Done Yet?

goal: SVM

without dependence on ˜ d

min

α 1

2 α T Q

D

α

− 1

T α

subject to

y T α

=0;

α n

≥ 0, for n = 1, 2, . . . , N

• N variables, N + 1 constraints: no dependence on ˜ d?

• q n,m = y n y m z T n z m

: inner product in R

˜ d

—O(˜

d )

via naïve computation!

no dependence

only if

avoiding naïve computation (next lecture :-))

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 21/23

(26)

Dual Support Vector Machine Messages behind Dual SVM

Fun Time

Consider applying dual hard-margin SVM on N = 5566 examples and getting 1126 SVs. Which of the following can be the number of

examples that are on the fat boundary—that is, SV candidates?

1

0

2

1024

3

1234

4

9999

Reference Answer: 3

Because SVs are always on the fat boundary,

# SVs≤ # SV candidates ≤ N.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 22/23

(27)

Dual Support Vector Machine Messages behind Dual SVM

Fun Time

Consider applying dual hard-margin SVM on N = 5566 examples and getting 1126 SVs. Which of the following can be the number of

examples that are on the fat boundary—that is, SV candidates?

1

0

2

1024

3

1234

4

9999

Reference Answer: 3

Because SVs are always on the fat boundary,

# SVs≤ # SV candidates ≤ N.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 22/23

(28)

Dual Support Vector Machine Messages behind Dual SVM

Summary

1

Embedding Numerous Features: Kernel Models

Lecture 2: Dual Support Vector Machine

Motivation of Dual SVM

want to remove dependence on ˜ d Lagrange Dual SVM

KKT conditions link primal/dual Solving Dual SVM

another QP, better solved with special solver Messages behind Dual SVM

SVs represent fattest hyperplane

next: computing inner product in R

d˜

efficiently

2 Combining Predictive Features: Aggregation Models

3 Distilling Implicit Features: Extraction Models

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 23/23

參考文獻

相關文件

Which of the following aggregation model learns diverse g t by reweighting and calculates linear vote by steepest search?.

3 Distilling Implicit Features: Extraction Models Lecture 14: Radial Basis Function Network. RBF

Lecture 4: Soft-Margin Support Vector Machine allow some margin violations ξ n while penalizing them by C; equivalent to upper-bounding α n by C Lecture 5: Kernel Logistic

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/22.. Decision Tree Decision Tree Hypothesis. Disclaimers about

1 Embedding Numerous Features: Kernel Models Lecture 1: Linear Support Vector Machine.

2 Distributed classification algorithms Kernel support vector machines Linear support vector machines Parallel tree learning.. 3 Distributed clustering

2 Distributed classification algorithms Kernel support vector machines Linear support vector machines Parallel tree learning?. 3 Distributed clustering

Keywords Support vector machine · ε-insensitive loss function · ε-smooth support vector regression · Smoothing Newton algorithm..