Machine Learning Techniques (ᘤᢈ)

(1)

Machine Learning Techniques ( 機器學習技巧)

Lecture 1: Large-Margin Linear Classification

Hsuan-Tien Lin (林軒田)

[email protected]

Department of Computer Science

& Information Engineering

National Taiwan University

( 國立台灣大學資訊工程系)

(2)

Large-Margin Linear Classification

Agenda

Lecture 1: Large-Margin Linear Classification Large-Margin Separating Hyperplane Standard Large-Margin Problem Support Vector Machine

Reasons behind Large-Margin Hyperplane

(3)

Large-Margin Linear Classification Large-Margin Separating Hyperplane

Linear Classification Revisited

PLA/pocket

h(x) = sign(s)

s x

x

x x

₀

1 2

d

h ( ) x

plausible err = 0/1

(small flipping noise)

minimize

specially

(linear separable)

linear (hyperplane) classifiers:

h(x) = sign(w

^T x)

(4)

Which Line Is Best?

•

PLA? depending on randomness

•

VC bound? whichever you like!

E

_out

(w)≤ E

in

(w)

| {z }

0

+ Ω(H)

| {z }

d

VC

=d +1

You?

rightmost one, possibly :-)

(5)

Why Rightmost Hyperplane?

informal argument

•

if (Gaussian-like) noise on future

x

≈ x

ⁿ

:

⇐⇒

x _n further from hyperplane

⇐⇒

tolerate more noise

⇐⇒

more robust to overfitting

• ⇐⇒ robustness of separating hyperplane

⇐⇒

amount of noise tolerance

⇐⇒

distance to closest x _n

rightmost one:

more robust

(6)

Fat Hyperplane

• robust

separating hyperplane:

fat

—far from both sides of examples

• robustness

≡

fatness: distance to closest x _n

goal: find

fattest

separating hyperplane

(7)

Large-Margin Separating Hyperplane

max

w fatness(w)

subject to

w classifies every (x _n

, y

n

)correctly

fatness(w) =

min

n=1,...,N

distance(x

_n

, w)

max

w margin(w)

subject to every

y _n w ^T x _n > 0

margin(w) =

min

n=1,...,N

distance(x

_n

, w)

•

fatness: called

margin

• correctness: y _n

=sign(w

^T x _n

)

goal: find

largest-margin

(8)

Large-Margin Separating Hyperplane

max

w margin(w)

subject to every

y _n w ^T x _n > 0

margin(w) =

min

n=1,...,N

distance(x

_n

, w)

•

fatness: called

margin

• correctness: y _n

=sign(w

^T x _n

)

goal: find

largest-margin

(9)

Fun Time

(10)

Large-Margin Linear Classification Standard Large-Margin Problem

Distance to Hyperplane: Preliminary

max w margin(w)

subject to every y _n w ^T x _n > 0 margin(w) = min

n=1,...,N distance(x n , w)

‘shorten’ x and w

distance

needs

w ₀

and

(w ₁ , . . . , w d )

differently (to be derived)

b

=

w ₀







| w

|







=





 w ₁

.. . w _d







;

XX x ₀ = X X 1







| x

|







=





 x ₁

.. . x _d







next: h(x) = sign(w

^T x

+

b)

(11)

Distance to Hyperplane

want: distance(x,

w, b), with hyperplane w ^T x + b

=0 consider

x ⁰

on hyperplane

1 w ^T x ⁰

=−

b

2 w

⊥ hyperplane:







w ^T

(x

⁰⁰

−

x ⁰

)

| {z } vector on hyperplane







=0

3

distance = project (x−

x ⁰

)to

⊥ hyperplane

dist(x, h)

x^′ x^′′

w x

distance(x,

w, b) =

w ^T

k

w

k(x−

x ⁰

)

=

1

k

w

k|

w ^T x + b

|

(12)

Distance to Separating Hyperplane

distance(x,

w, b) =

1

k

w

k|

w ^T x + b

|

• separating

hyperplane: for every n

y _n (w ^T x _n + b) > 0

•

distance to

separating

hyperplane:

distance(x

_n

,

w, b) =

1

k

w

k

y n

(w

^T x _n

+

b)

max

b,w

margin(w,

b)

subject to every

y _n (w ^T x _n + b) > 0

margin(w,

b) =

min

n=1,...,N 1

kwk y _n

(w

^T x _n

+

b)

(13)

Margin of Special Separating Hyperplane

max

b,w

margin(w,

b)

subject to every y

_n

(w

^T x _n

+

b)

> 0 margin(w,

b) =

min

n=1,...,N 1

kwk

y

_n

(w

^T x _n

+

b)

•

(w,

b)

and (1126w, 1126b): same hyperplane, same margin

• special

scaling: only consider separating (w,

b)

such that

min n y n (w ^T x n + b) = 1

=⇒ margin(

w, b) = _kwk ¹

max

b,w 1 kwk

subject to

every y n (w ^T x _n + b) > 0 min

n=1,...,N y _n (w ^T x _n + b) = 1

(14)

Standard Large-Margin Hyperplane Problem

max

b,w

1

k

w

k = 1

√

w ^T w

subject to

min

n=1,...,N y _n (w ^T x _n + b) = 1

final changes:

•

max =⇒ min, remove√

w

, add

¹ ₂

•

min(. . .) = 1 =⇒ (. . .) ≥ 1

—min

¹ ₂ w ^T w

means not all (. . .) > 1

min

b,w 1 2 w ^T w

subject to y

_n

(w

^T x _n

+

b)

≥ 1

(15)

Fun Time

(16)

Large-Margin Linear Classification Support Vector Machine

Solving a Particular Standard Problem

min

b,w 1 2 w ^T w

subject to y

n

(w

^T x _n

+

b)

≥ 1

X =







0 0 2 2 2 0 3 0







y =







−1

−1 +1 +1







− b

≥ 1 (i)

−2 w ₁ − 2 w ₂ − b

≥ 1 (ii)

2w ₁

+ 0w ₂

+ b

≥ 1 (iii)

3w ₁

+ 0w ₂

+ b

≥ 1 (iv)

•

(i) & (iii) =⇒

w ₁

≥ +1 (ii) & (iii) =⇒

w ₂

≤ −1

=⇒

¹ ₂ w ^T w

≥

1

•

(w

₁

=1,

w ₂

=−1,

b

=−1) at

lower bound

and satisfies (i)− (iv) g_SVM(x) = sign(x

₁

− x

2

− 1):

SVM? :-)

(17)

Support Vector Machine (SVM)

optimal solution: (w

₁

=1,

w ₂

=−1,

b

=−1) margin(w,

b)

=

_kwk ¹

=

^√ ¹

2

x¹−x²−1=0 0.707

•

examples on boundary:

‘locates’ fattest hyperplane

other examples:

not needed

•

call boundary examples

support vector (candidates)

support vector

machine (SVM):

learn

fattest hyperplanes

(with help of

support vectors

)

(18)

Solving General SVM

min

b,w 1 2 w ^T w

subject to y

_n

(w

^T x _n

+

b)

≥ 1

• not easy manually, of course :-)

•

gradient decent?

not easy with constraints

•

luckily:

• (convex) quadratic objective functions of (b, w)

• linear constraints of (b, w)

—quadratic programming

quadratic programming

(QP):

‘easy’ optimization problem

(19)

Quadratic Programming

optimal (b,

w) =

? min

b,w

1 2 w ^T w

subject to y

_n

(w

^T x _n

+

b)

≥ 1, for n = 1, 2, . . . , N

optimal

u

← QP(

A, c, P, r)

min

u

1 2 u ^T Au

+

c ^T u

subject to

p ^T _m u

≥

r _m

,

for m = 1, 2, . . . , M

objective function:

u =

b w

;

A =

0 0 ^T _d 0 _d I d

;

c = 0 _{d +1}

constraints:

p ^T _n = y n

1 x ^T _n

;

r n = 1;

M = N

SVM with general QP solver:

easy

if you’ve read the manual :-)

(20)

SVM with QP Solver

Linear Hard-Margin SVM Algorithm

1 A =

0 0 ^T _d 0 _d I d

;

c = 0 _{d +1}

;

p ^T _n = y n

1 x ^T _n

;

r n = 1

2 b w

← QP(

A, c, P, r)

3

return

b

&

w

as

g

_SVM

• hard-margin: nothing violate ‘fat boundary’

• linear: x _n

want

non-linear?

z n

= Φ(x

n

)—remember? :-)

(21)

Fun Time

(22)

Large-Margin Linear Classification Reasons behind Large-Margin Hyperplane

Why Large-Margin Hyperplane?

min

b,w 1 2 w ^T w

subject to y

_n

(w

^T z _n

+

b)

≥ 1

minimize constraint regularization E

_in w ^T w

≤ C

SVM

w ^T w

E

_in

=0 [and more]

SVM (large-margin hyperplane):

‘weight-decay regularization’ within E _in = 0

(23)

Large-Margin Restricts Dichotomies

consider ‘large-margin algorithm’A

ρ

:

either

returns g with margin(g) ≥ ρ (if exists)

, or 0 otherwise

A ⁰ : like PLA = ⇒ shatter ‘general’ 3 inputs

A ^1.126 : more strict than SVM = ⇒ no-shatter some 3 inputs

ρ

fewer dichotomies =⇒ smaller ‘VC dim.’ =⇒

better generalization

(24)

VC Dimension of Large-Margin Algorithm

fewer dichotomies =⇒ smaller

‘VC dim.’

considers d

_VC

( A ρ ) [data-dependent, need more than VC]

—

instead of

d

VC

( H) [data-independent, covered by VC]

d

VC

( A ^ρ ) when X = unit circle in R ²

•

ρ = 0: just perceptrons (dVC =3)

•

ρ >

√ 3

2

: no shatter any 3 inputs (dVC< 3)

—some inputs must be of

distance ≤ √ 3

generally, whenX in

radius-R hyperball:

d_VC(A

ρ

)≤ min

R ² ρ ²

, d

+1≤ d + 1

| {z }

d

VC

(perceptrons)

(25)

Benefits of Large-Margin Hyperplanes

large-margin

hyperplanes hyperplanes hyperplanes + higher-order transforms

# even fewer not many many

boundary simple simple sophisticated

• not many

good, for d_VC and generalization

• sophisticated

good, for possibly better E

_in

a new possibility: non-linear SVM

large-margin hyperplanes + higher-order transforms

# not many

boundary sophisticated

(26)

Fun Time

(27)

Machine Learning Techniques (ᘤᢈ)

Machine Learning Techniques ( 機器學習技巧)

Lecture 1: Large-Margin Linear Classification

Department of Computer Science

& Information Engineering

National Taiwan University

( 國立台灣大學資訊工程系)

Agenda

Lecture 1: Large-Margin Linear Classification Large-Margin Separating Hyperplane Standard Large-Margin Problem Support Vector Machine

Reasons behind Large-Margin Hyperplane

Linear Classification Revisited

PLA/pocket

s x

x

x x

h ( ) x

(small flipping noise)

specially

(linear separable)

T x)

Which Line Is Best?

•

•

out

in

0

d

=d +1

rightmost one, possibly :-)

Why Rightmost Hyperplane?

informal argument

•

x

n

x n further from hyperplane

tolerate more noise

more robust to overfitting

• ⇐⇒ robustness of separating hyperplane

amount of noise tolerance

distance to closest x n

more robust

Fat Hyperplane

• robust

fat

• robustness

fatness: distance to closest x n

fattest

Large-Margin Separating Hyperplane

w fatness(w)

w classifies every (x n

n

fatness(w) =

n=1,...,N

n

w margin(w)

y n w T x n > 0

margin(w) =

n=1,...,N

n

•

margin

• correctness: y n

T x n

largest-margin

Large-Margin Separating Hyperplane

w margin(w)

y n w T x n > 0

margin(w) =

n=1,...,N

n

•

margin

• correctness: y n

T x n

largest-margin

Fun Time

Distance to Hyperplane: Preliminary

max w margin(w)

subject to every y n w T x n > 0 margin(w) = min

n=1,...,N distance(x n , w)

Machine Learning Techniques (ᘤᢈ)

^T x)

_out

ⁿ

x _n further from hyperplane

distance to closest x _n

fatness: distance to closest x _n

w classifies every (x _n

_n

y _n w ^T x _n > 0

_n

• correctness: y _n

^T x _n

y _n w ^T x _n > 0

_n

• correctness: y _n

^T x _n

subject to every y _n w ^T x _n > 0 margin(w) = min

w ₀

(w ₁ , . . . , w d )

w ₀

 w ₁

.. . w _d

XX x ₀ = X X 1

 x ₁

.. . x _d

^T x

w, b), with hyperplane w ^T x + b

x ⁰

1 w ^T x ⁰

w ^T

⁰⁰

x ⁰

x ⁰

w ^T

x ⁰

w ^T x + b

w ^T x + b

y _n (w ^T x _n + b) > 0

_n