• 沒有找到結果。

Machine Learning Techniques (ᘤᢈ)

N/A
N/A
Protected

Academic year: 2022

Share "Machine Learning Techniques (ᘤᢈ)"

Copied!
27
0
0

加載中.... (立即查看全文)

全文

(1)

Machine Learning Techniques ( 機器學習技巧)

Lecture 1: Large-Margin Linear Classification

Hsuan-Tien Lin (林軒田)

htlin@csie.ntu.edu.tw

Department of Computer Science

& Information Engineering

National Taiwan University

( 國立台灣大學資訊工程系)

(2)

Large-Margin Linear Classification

Agenda

Lecture 1: Large-Margin Linear Classification Large-Margin Separating Hyperplane Standard Large-Margin Problem Support Vector Machine

Reasons behind Large-Margin Hyperplane

(3)

Large-Margin Linear Classification Large-Margin Separating Hyperplane

Linear Classification Revisited

PLA/pocket

h(x) = sign(s)

s x

x

x x

0

1 2

d

h ( ) x

plausible err = 0/1

(small flipping noise)

minimize

specially

(linear separable)

linear (hyperplane) classifiers:

h(x) = sign(w

T x)

(4)

Large-Margin Linear Classification Large-Margin Separating Hyperplane

Which Line Is Best?

PLA? depending on randomness

VC bound? whichever you like!

E

out

(w)≤ E

in

(w)

| {z }

0

+ Ω(H)

| {z }

d

VC

=d +1

You?

rightmost one, possibly :-)

(5)

Large-Margin Linear Classification Large-Margin Separating Hyperplane

Why Rightmost Hyperplane?

informal argument

if (Gaussian-like) noise on future

x

≈ x

n

:

⇐⇒

x n further from hyperplane

⇐⇒

tolerate more noise

⇐⇒

more robust to overfitting

• ⇐⇒ robustness of separating hyperplane

⇐⇒

amount of noise tolerance

⇐⇒

distance to closest x n

rightmost one:

more robust

(6)

Large-Margin Linear Classification Large-Margin Separating Hyperplane

Fat Hyperplane

• robust

separating hyperplane:

fat

—far from both sides of examples

• robustness

fatness: distance to closest x n

goal: find

fattest

separating hyperplane

(7)

Large-Margin Linear Classification Large-Margin Separating Hyperplane

Large-Margin Separating Hyperplane

max

w fatness(w)

subject to

w classifies every (x n

, y

n

)correctly

fatness(w) =

min

n=1,...,N

distance(x

n

, w)

max

w margin(w)

subject to every

y n w T x n > 0

margin(w) =

min

n=1,...,N

distance(x

n

, w)

fatness: called

margin

• correctness: y n

=sign(w

T x n

)

goal: find

largest-margin

separating hyperplane

(8)

Large-Margin Linear Classification Large-Margin Separating Hyperplane

Large-Margin Separating Hyperplane

max

w margin(w)

subject to every

y n w T x n > 0

margin(w) =

min

n=1,...,N

distance(x

n

, w)

fatness: called

margin

• correctness: y n

=sign(w

T x n

)

goal: find

largest-margin

separating hyperplane

(9)

Large-Margin Linear Classification Large-Margin Separating Hyperplane

Fun Time

(10)

Large-Margin Linear Classification Standard Large-Margin Problem

Distance to Hyperplane: Preliminary

max w margin(w)

subject to every y n w T x n > 0 margin(w) = min

n=1,...,N distance(x n , w)

‘shorten’ x and w

distance

needs

w 0

and

(w 1 , . . . , w d )

differently (to be derived)

b

=

w 0

| w

|

=

 w 1

.. . w d

;

   XX x 0 = X X 1

| x

|

=

 x 1

.. . x d

next: h(x) = sign(w

T x

+

b)

(11)

Large-Margin Linear Classification Standard Large-Margin Problem

Distance to Hyperplane

want: distance(x,

w, b), with hyperplane w T x + b

=0 consider

x 0

on hyperplane

1 w T x 0

=−

b

2 w

⊥ hyperplane:

w T

(x

00

x 0

)

| {z } vector on hyperplane

=0

3

distance = project (x−

x 0

)to

⊥ hyperplane

dist(x, h)

x x′′

w x

distance(x,

w, b) =

w T

k

w

k(x−

x 0

)

=

1

1

k

w

k|

w T x + b

|

(12)

Large-Margin Linear Classification Standard Large-Margin Problem

Distance to Separating Hyperplane

distance(x,

w, b) =

1

k

w

k|

w T x + b

|

separating

hyperplane: for every n

y n (w T x n + b) > 0

distance to

separating

hyperplane:

distance(x

n

,

w, b) =

1

k

w

k

y n

(w

T x n

+

b)

max

b,w

margin(w,

b)

subject to every

y n (w T x n + b) > 0

margin(w,

b) =

min

n=1,...,N 1

kwk y n

(w

T x n

+

b)

(13)

Large-Margin Linear Classification Standard Large-Margin Problem

Margin of Special Separating Hyperplane

max

b,w

margin(w,

b)

subject to every y

n

(w

T x n

+

b)

> 0 margin(w,

b) =

min

n=1,...,N 1

kwk

y

n

(w

T x n

+

b)

(w,

b)

and (1126w, 1126b): same hyperplane, same margin

special

scaling: only consider separating (w,

b)

such that

min n y n (w T x n + b) = 1

=⇒ margin(

w, b) = kwk 1

max

b,w 1 kwk

subject to

every y n (w T x n + b) > 0 min

n=1,...,N y n (w T x n + b) = 1

(14)

Large-Margin Linear Classification Standard Large-Margin Problem

Standard Large-Margin Hyperplane Problem

max

b,w

1

k

w

k = 1

w T w

subject to

min

n=1,...,N y n (w T x n + b) = 1

final changes:

max =⇒ min, remove√

w

, add

1 2

min(. . .) = 1 =⇒ (. . .) ≥ 1

—min

1 2 w T w

means not all (. . .) > 1

min

b,w 1 2 w T w

subject to y

n

(w

T x n

+

b)

≥ 1

(15)

Large-Margin Linear Classification Standard Large-Margin Problem

Fun Time

(16)

Large-Margin Linear Classification Support Vector Machine

Solving a Particular Standard Problem

min

b,w 1 2 w T w

subject to y

n

(w

T x n

+

b)

≥ 1

X =

0 0 2 2 2 0 3 0

y =

−1

−1 +1 +1

− b

≥ 1 (i)

−2 w 1 − 2 w 2 − b

≥ 1 (ii)

2w 1

+ 0w 2

+ b

≥ 1 (iii)

3w 1

+ 0w 2

+ b

≥ 1 (iv)

 (i) & (iii) =⇒

w 1

≥ +1 (ii) & (iii) =⇒

w 2

≤ −1



=⇒

1 2 w T w

1

(w

1

=1,

w 2

=−1,

b

=−1) at

lower bound

and satisfies (i)− (iv) gSVM(x) = sign(x

1

− x

2

− 1):

SVM? :-)

(17)

Large-Margin Linear Classification Support Vector Machine

Support Vector Machine (SVM)

optimal solution: (w

1

=1,

w 2

=−1,

b

=−1) margin(w,

b)

=

kwk 1

=

1

2

x1−x2−1=0 0.707

examples on boundary:

‘locates’ fattest hyperplane

other examples:

not needed

call boundary examples

support vector (candidates)

support vector

machine (SVM):

learn

fattest hyperplanes

(with help of

support vectors

)

(18)

Large-Margin Linear Classification Support Vector Machine

Solving General SVM

min

b,w 1 2 w T w

subject to y

n

(w

T x n

+

b)

≥ 1

not easy manually, of course :-)

gradient decent?

not easy with constraints

luckily:

• (convex) quadratic objective functions of (b, w)

• linear constraints of (b, w)

—quadratic programming

quadratic programming

(QP):

‘easy’ optimization problem

(19)

Large-Margin Linear Classification Support Vector Machine

Quadratic Programming

optimal (b,

w) =

? min

b,w

1 2 w T w

subject to y

n

(w

T x n

+

b)

≥ 1, for n = 1, 2, . . . , N

optimal

u

← QP(

A, c, P, r)

min

u

1

2 u T Au

+

c T u

subject to

p T m u

r m

,

for m = 1, 2, . . . , M

objective function:

u =



b w



;

A =

 0 0 T d 0 d I d



;

c = 0 d +1

constraints:

p T n = y n

 1 x T n 

;

r n = 1;

M = N

SVM with general QP solver:

easy

if you’ve read the manual :-)

(20)

Large-Margin Linear Classification Support Vector Machine

SVM with QP Solver

Linear Hard-Margin SVM Algorithm

1 A =

 0 0 T d 0 d I d



;

c = 0 d +1

;

p T n = y n 

1 x T n 

;

r n = 1

2

 b w



← QP(

A, c, P, r)

3

return

b

&

w

as

g

SVM

hard-margin: nothing violate ‘fat boundary’

linear: x n

want

non-linear?

z n

= Φ(x

n

)—remember? :-)

(21)

Large-Margin Linear Classification Support Vector Machine

Fun Time

(22)

Large-Margin Linear Classification Reasons behind Large-Margin Hyperplane

Why Large-Margin Hyperplane?

min

b,w 1 2 w T w

subject to y

n

(w

T z n

+

b)

≥ 1

minimize constraint regularization E

in w T w

≤ C

SVM

w T w

E

in

=0 [and more]

SVM (large-margin hyperplane):

‘weight-decay regularization’ within E in = 0

(23)

Large-Margin Linear Classification Reasons behind Large-Margin Hyperplane

Large-Margin Restricts Dichotomies

consider ‘large-margin algorithm’A

ρ

:

either

returns g with margin(g) ≥ ρ (if exists)

, or 0 otherwise

A 0 : like PLA = ⇒ shatter ‘general’ 3 inputs

A 1.126 : more strict than SVM = ⇒ no-shatter some 3 inputs

ρ

fewer dichotomies =⇒ smaller ‘VC dim.’ =⇒

better generalization

(24)

Large-Margin Linear Classification Reasons behind Large-Margin Hyperplane

VC Dimension of Large-Margin Algorithm

fewer dichotomies =⇒ smaller

‘VC dim.’

considers d

VC

( A ρ ) [data-dependent, need more than VC]

instead of

d

VC

( H) [data-independent, covered by VC]

d

VC

( A ρ ) when X = unit circle in R 2

ρ = 0: just perceptrons (dVC =3)

ρ >

√ 3

2

: no shatter any 3 inputs (dVC< 3)

—some inputs must be of

distance ≤ √ 3

generally, whenX in

radius-R hyperball:

dVC(A

ρ

)≤ min



R 2 ρ 2

, d



+1≤ d + 1

| {z }

d

VC

(perceptrons)

(25)

Large-Margin Linear Classification Reasons behind Large-Margin Hyperplane

Benefits of Large-Margin Hyperplanes

large-margin

hyperplanes hyperplanes hyperplanes + higher-order transforms

# even fewer not many many

boundary simple simple sophisticated

not many

good, for dVC and generalization

sophisticated

good, for possibly better E

in

a new possibility: non-linear SVM

large-margin hyperplanes + higher-order transforms

# not many

boundary sophisticated

(26)

Large-Margin Linear Classification Reasons behind Large-Margin Hyperplane

Fun Time

(27)

Large-Margin Linear Classification Reasons behind Large-Margin Hyperplane

Summary

Lecture 1: Large-Margin Linear Classification Large-Margin Separating Hyperplane

intuitively more robust Standard Large-Margin Problem

min. normal vector while separating with scale Support Vector Machine

easy via quadratic programming

Reasons behind Large-Margin Hyperplane

fewer dichotomies and better generalization

參考文獻

相關文件

2 Combining Predictive Features: Aggregation Models Lecture 7: Blending and Bagging.. Motivation of Aggregation

Which of the following aggregation model learns diverse g t by reweighting and calculates linear vote by steepest search?.

3 Distilling Implicit Features: Extraction Models Lecture 14: Radial Basis Function Network. RBF

Lecture 4: Soft-Margin Support Vector Machine allow some margin violations ξ n while penalizing them by C; equivalent to upper-bounding α n by C Lecture 5: Kernel Logistic

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/22.. Decision Tree Decision Tree Hypothesis. Disclaimers about

1 Embedding Numerous Features: Kernel Models Lecture 1: Linear Support Vector Machine.. linear SVM: more robust and solvable with quadratic programming Lecture 2: Dual Support

1 Embedding Numerous Features: Kernel Models Lecture 1: Linear Support Vector Machine.

2 Distributed classification algorithms Kernel support vector machines Linear support vector machines Parallel tree learning.. 3 Distributed clustering