• 沒有找到結果。

Machine Learning Techniques (ᘤᢈ)

N/A
N/A
Protected

Academic year: 2022

Share "Machine Learning Techniques (ᘤᢈ)"

Copied!
35
0
0

加載中.... (立即查看全文)

全文

(1)

Machine Learning Techniques ( 機器學習技法)

Lecture 1: Linear Support Vector Machine

Hsuan-Tien Lin (林軒田)

htlin@csie.ntu.edu.tw

Department of Computer Science

& Information Engineering

National Taiwan University

( 國立台灣大學資訊工程系)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 0/28

(2)

Course History

NTU Version

15-17 weeks (2+ hours)

highly-praised with

English and blackboard teaching

Coursera Version

8 weeks of ‘foundations’

(previous course) + 8 weeks of ‘techniques’ (this course)

Mandarin teaching

to reach more audience in need

slides teaching

improved with Coursera’s quiz and homework mechanisms

goal:

try

making Coursera version even better than NTU version

(3)

Linear Support Vector Machine Course Introduction

Course Design

from Foundations to Techniques

mixture of philosophical illustrations, key theory, core algorithms, usage in practice, and hopefully jokes

:-)

three major techniques surrounding

feature transforms:

• Embedding Numerous Features: how to exploit and regularize numerous features?

—inspires Support Vector Machine (SVM) model

• Combining Predictive Features: how to construct and blend predictive features?

—inspires Adaptive Boosting (AdaBoost) model

• Distilling Implicit Features: how to identify and learn implicit features?

—inspires Deep Learning model

allows students to

use ML professionally

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 2/28

(4)

Linear Support Vector Machine Course Introduction

Fun Time

Which of the following description of this course is true?

1

the course will be taught in Taiwanese

2

the course will tell me the techniques that create the android Lieutenant Commander Data in Star Trek

3

the course will be 16 weeks long

4

the course will focus on three major techniques

Reference Answer: 4

1

no, my Taiwanese is unfortunately not good enough for teaching (yet)

2

no, although what we teach may serve as building blocks

3

no, unless you have also joined the previous course

4

yes,

let’s get started!

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/28

(5)

Linear Support Vector Machine Course Introduction

Fun Time

Which of the following description of this course is true?

1

the course will be taught in Taiwanese

2

the course will tell me the techniques that create the android Lieutenant Commander Data in Star Trek

3

the course will be 16 weeks long

4

the course will focus on three major techniques

Reference Answer: 4

1

no, my Taiwanese is unfortunately not good enough for teaching (yet)

2

no, although what we teach may serve as building blocks

3

no, unless you have also joined the previous course

4

yes,

let’s get started!

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/28

(6)

Roadmap

1

Embedding Numerous Features: Kernel Models

Lecture 1: Linear Support Vector Machine

Course Introduction

Large-Margin Separating Hyperplane Standard Large-Margin Problem Support Vector Machine

Reasons behind Large-Margin Hyperplane

2 Combining Predictive Features: Aggregation Models

3 Distilling Implicit Features: Extraction Models

(7)

Linear Support Vector Machine Large-Margin Separating Hyperplane

Linear Classification Revisited

PLA/pocket

h(x) = sign(s)

s x

x

x x

0

1 2

d

h ( ) x

plausible err = 0/1

(small flipping noise)

minimize

specially

(linear separable)

linear (hyperplane) classifiers:

h(x) = sign(w

T x)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/28

(8)

Which Line Is Best?

PLA? depending on randomness

VC bound? whichever you like!

E

out

(w)≤

E in (w)

| {z }

0

+

Ω( H)

| {z }

d

VC

=d +1

You?

rightmost one, possibly :-)

(9)

Linear Support Vector Machine Large-Margin Separating Hyperplane

Why Rightmost Hyperplane?

informal argument

if (Gaussian-like) noise on future

x

≈ x

n

:

⇐⇒

x n further from hyperplane

⇐⇒

tolerate more noise

⇐⇒

more robust to overfitting

⇐⇒

distance to closest x n

⇐⇒

amount of noise tolerance

⇐⇒

robustness of hyperplane

rightmost one:

more robust

because of

larger distance to closest x n

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 7/28

(10)

Fat Hyperplane

• robust

separating hyperplane:

fat

—far from both sides of examples

• robustness

fatness: distance to closest x n

goal: find

fattest

separating hyperplane

(11)

Linear Support Vector Machine Large-Margin Separating Hyperplane

Large-Margin Separating Hyperplane

max

w fatness(w)

subject to

w classifies every (x n

, y

n

)correctly

fatness(w) =

min

n=1,...,N

distance(x

n

, w)

max

w margin(w)

subject to every

y n w T x n > 0

margin(w) =

min

n=1,...,N

distance(x

n

, w)

fatness: formally called

margin

• correctness: y n

=sign(w

T x n

)

goal: find

largest-margin separating

hyperplane

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 9/28

(12)

Large-Margin Separating Hyperplane

max

w margin(w)

subject to every

y n w T x n > 0

margin(w) =

min

n=1,...,N

distance(x

n

, w)

fatness: formally called

margin

• correctness: y n

=sign(w

T x n

)

goal: find

largest-margin

separating

hyperplane

(13)

Linear Support Vector Machine Large-Margin Separating Hyperplane

Fun Time

Consider two examples (v, +1) and (−v, −1) where v ∈ R

2

(without padding the v

0

=1). Which of the following hyperplane is the

largest-margin separating

one for the two examples? You are highly encouraged to visualize by considering, for instance,

v = (3, 2).

1

x

1

=0

2

x

2

=0

3

v

1

x

1

+v

2

x

2

=0

4

v

2

x

1

+v

1

x

2

=0

Reference Answer: 3

Here the

largest-margin separating

hyperplane (line) must be a perpendicular bisector of the line segment between

v and

−v. Hence v is a normal vector of the largest-margin line. The result can be extended to the more general case of

v

∈ R

d

.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 10/28

(14)

Fun Time

Consider two examples (v, +1) and (−v, −1) where v ∈ R

2

(without padding the v

0

=1). Which of the following hyperplane is the

largest-margin separating

one for the two examples? You are highly encouraged to visualize by considering, for instance,

v = (3, 2).

1

x

1

=0

2

x

2

=0

3

v

1

x

1

+v

2

x

2

=0

4

v

2

x

1

+v

1

x

2

=0

Reference Answer: 3

Here the

largest-margin separating

hyperplane (line) must be a perpendicular bisector of the line segment between

v and

−v. Hence v is a normal vector of the largest-margin line. The result can be extended to the more general case of

v

∈ R

d

.

(15)

Linear Support Vector Machine Standard Large-Margin Problem

Distance to Hyperplane: Preliminary

max w margin(w)

subject to every y n w T x n > 0 margin(w) = min

n=1,...,N distance(x n , w)

‘shorten’ x and w

distance

needs

w 0

and

(w 1 , . . . , w d )

differently (to be derived)

b

=

w 0

| w

|

=

 w 1

.. . w d

;

   XX x 0 = X X 1

| x

|

=

 x 1

.. . x d

for this part: h(x) = sign(w

T x

+

b)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 11/28

(16)

Distance to Hyperplane

want: distance(x,

b, w), with hyperplane w T x 0

+

b

=0

consider

x 0

,

x 00

on hyperplane

1 w T x 0

= −

b, w T x 00

= −

b

2 w

⊥ hyperplane:

w T

(x

00

x 0

)

| {z } vector on hyperplane

=0

3

distance = project (x−

x 0

)to

⊥ hyperplane

dist(x, h)

x x′′

w x

distance(x,

b, w) =

w T

k

w

k(x−

x 0

)

=

1

1

k

w

k|

w T x + b

|

(17)

Linear Support Vector Machine Standard Large-Margin Problem

Distance to Separating Hyperplane

distance(x,

b, w) =

1

k

w

k|

w T x + b

|

separating

hyperplane: for every n

y n (w T x n + b) > 0

distance to

separating

hyperplane:

distance(x

n

,

b, w) =

1

k

w

k

y n

(w

T x n

+

b)

max

b,w

margin(b,

w)

subject to every

y n (w T x n + b) > 0

margin(b,

w) =

min

n=1,...,N 1

kwk y n

(w

T x n

+

b)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 13/28

(18)

Margin of Special Separating Hyperplane

max

b,w

margin(b,

w)

subject to every y

n

(w

T x n

+

b)

> 0 margin(b,

w) =

min

n=1,...,N 1

kwk

y

n

(w

T x n

+

b)

w T x + b

=0 same as 3w

T x + 3b

=0: scaling does not matter

special

scaling: only consider separating (b,

w)

such that

n=1,...,N min y n (w T x n + b) = 1

=⇒ margin(

b, w) = kwk 1

max

b,w 1 kwk

subject to

every y n (w T x n + b) > 0 min

n=1,...,N y n (w T x n + b) = 1

(19)

Linear Support Vector Machine Standard Large-Margin Problem

Standard Large-Margin Hyperplane Problem

max

b,w

1

k

w

k subject to

min

n=1,...,N y n (w T x n + b) = 1

necessary constraints: y

n

(w

T x n

+

b)

≥ 1 for all n original constraint:

min n=1,...,N y n (w T x n + b) = 1

want: optimal (b,

w) here (inside)

if optimal (b,

w)

outside, e.g. y

n

(w

T x n

+

b)

> 1.126 for all n

—can scale (b,

w)

to “more optimal” (

1.126 b

,

1.126 w

)

(contradiction!)

final change: max =⇒ min, remove√

w

, add

1 2

min

b,w

1 2 w T w

subject to y

n

(w

T x n

+

b)

≥ 1 for all n

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 15/28

(20)

Linear Support Vector Machine Standard Large-Margin Problem

Fun Time

Consider three examples (x

1

, +1), (x

2

, +1), (x

3

,−1), where

x 1

= (3, 0), x

2

= (0, 4), x

3

= (0, 0). In addition, consider a hyperplane x

1

+x

2

=1. Which of the following is not true?

1

the hyperplane is a separating one for the three examples

2

the distance from the hyperplane to

x 1

is 2

3

the distance from the hyperplane to

x 3

is

1

2 4

the example that is closest to the hyperplane is

x 3

Reference Answer: 2

The distance from the hyperplane to

x 1

is

√ 1

2

(3 + 0− 1) =√ 2.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 16/28

(21)

Linear Support Vector Machine Standard Large-Margin Problem

Fun Time

Consider three examples (x

1

, +1), (x

2

, +1), (x

3

,−1), where

x 1

= (3, 0), x

2

= (0, 4), x

3

= (0, 0). In addition, consider a hyperplane x

1

+x

2

=1. Which of the following is not true?

1

the hyperplane is a separating one for the three examples

2

the distance from the hyperplane to

x 1

is 2

3

the distance from the hyperplane to

x 3

is

1

2 4

the example that is closest to the hyperplane is

x 3 Reference Answer: 2

The distance from the hyperplane to

x 1

is

√ 1

2

(3 + 0− 1) =√ 2.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 16/28

(22)

Linear Support Vector Machine Support Vector Machine

Solving a Particular Standard Problem

min

b,w 1 2 w T w

subject to y

n

(w

T x n

+

b)

≥ 1 for all n

X =

0 0 2 2 2 0 3 0

y =

−1

−1 +1 +1

− b

≥ 1 (i)

−2 w 1 − 2 w 2 − b

≥ 1 (ii)

2w 1

+ 0w 2

+ b

≥ 1 (iii)

3w 1

+ 0w 2

+ b

≥ 1 (iv)

 (i) & (iii) =⇒

w 1

≥ +1 (ii) & (iii) =⇒

w 2

≤ −1



=⇒

1 2 w T w

1

(w

1

=1,

w 2

=−1,

b

=−1) at

lower bound

and satisfies (i)− (iv) gSVM(x) = sign(x

1

− x

2

− 1):

SVM? :-)

(23)

Linear Support Vector Machine Support Vector Machine

Support Vector Machine (SVM)

optimal solution: (w

1

=1,

w 2

=−1,

b

=−1) margin(b,

w)

=

kwk 1

=

1

2

x1−x2−1=0 0.707

examples on boundary:

‘locates’ fattest hyperplane

other examples:

not needed

call boundary example

support vector (candidate)

support vector

machine (SVM):

learn

fattest hyperplanes

(with help of

support vectors

)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 18/28

(24)

Solving General SVM

min

b,w 1 2 w T w

subject to y

n

(w

T x n

+

b)

≥ 1 for all n

not easy manually, of course :-)

gradient descent?

not easy with constraints

luckily:

• (convex) quadratic objective function of (b, w)

• linear constraints of (b, w)

—quadratic programming

quadratic programming

(QP):

‘easy’ optimization problem

(25)

Linear Support Vector Machine Support Vector Machine

Quadratic Programming

optimal (b,

w) =

? min

b,w 1 2 w T w

subject to y

n

(w

T x n

+

b)

≥ 1, for n = 1, 2, . . . , N

optimal

u

← QP(

Q, p, A, c)

min

u 1

2 u T Qu

+

p T u

subject to

a T m u

c m

,

for m = 1, 2, . . . , M

objective function:

u =



b w



;

Q =

 0 0 T d 0 d I d



;

p = 0 d +1

constraints:

a T n = y n

 1 x T n 

;

c n = 1;

M = N

SVM with general QP solver:

easy

if you’ve read the manual :-)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 20/28

(26)

SVM with QP Solver

Linear Hard-Margin SVM Algorithm

1 Q =

 0 0 T d 0 d I d



;

p = 0 d +1

;

a T n = y n 

1 x T n 

;

c n = 1

2

 b w



← QP(

Q, p, A, c)

3

return

b

&

w

as

g

SVM

hard-margin: nothing violate ‘fat boundary’

linear: x n

want

non-linear?

z n

= Φ(x

n

)—remember? :-)

(27)

Linear Support Vector Machine Support Vector Machine

Fun Time

Consider two negative examples with

x 1

= (0, 0) and x

2

= (2, 2); two positive examples with

x 3

= (2, 0) and x

4

= (3, 0), as shown on page 17 of the slides. Define

u, Q, p, c n

as those listed on page 20 of the slides. What are

a T n

that need to be fed into the QP solver?

1 a

T1

= [−1, 0, 0]

,

a

T2

= [−1, 2, 2]

,

a

T3

= [−1, 2, 0]

,

a

T4

= [−1, 3, 0]

2 a

T1

= [1, 0, 0]

,

a

T2

= [1, −2, −2]

,

a

T3

= [−1, 2, 0]

,

a

T4

= [−1, 3, 0]

3 a

T1

= [1, 0, 0]

,

a

T2

= [1, 2, 2]

,

a

T3

= [1, 2, 0]

,

a

T4

= [1, 3, 0]

4 a

T1

= [−1, 0, 0]

,

a

T2

= [−1, −2, −2]

,

a

T3

= [1, 2, 0]

,

a

T4

= [1, 3, 0]

Reference Answer: 4

We need

a T n

=y

n

 1

x T n

.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 22/28

(28)

Fun Time

Consider two negative examples with

x 1

= (0, 0) and x

2

= (2, 2); two positive examples with

x 3

= (2, 0) and x

4

= (3, 0), as shown on page 17 of the slides. Define

u, Q, p, c n

as those listed on page 20 of the slides. What are

a T n

that need to be fed into the QP solver?

1 a

T1

= [−1, 0, 0]

,

a

T2

= [−1, 2, 2]

,

a

T3

= [−1, 2, 0]

,

a

T4

= [−1, 3, 0]

2 a

T1

= [1, 0, 0]

,

a

T2

= [1, −2, −2]

,

a

T3

= [−1, 2, 0]

,

a

T4

= [−1, 3, 0]

3 a

T1

= [1, 0, 0]

,

a

T2

= [1, 2, 2]

,

a

T3

= [1, 2, 0]

,

a

T4

= [1, 3, 0]

4 a

T1

= [−1, 0, 0]

,

a

T2

= [−1, −2, −2]

,

a

T3

= [1, 2, 0]

,

a

T4

= [1, 3, 0]

Reference Answer: 4

We need

a T n

=y

n

 1

x T n

.

(29)

Linear Support Vector Machine Reasons behind Large-Margin Hyperplane

Why Large-Margin Hyperplane?

min

b,w 1 2 w T w

subject to y

n

(w

T z n

+

b)

≥ 1 for all n

minimize constraint regularization E

in w T w

≤ C

SVM

w T w

E

in

=0 [and more]

SVM (large-margin hyperplane):

‘weight-decay regularization’ within E in = 0

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 23/28

(30)

Large-Margin Restricts Dichotomies

consider ‘large-margin algorithm’A

ρ

:

either

returns g with margin(g) ≥ ρ (if exists)

, or 0 otherwise

A 0 : like PLA = ⇒ shatter ‘general’ 3 inputs

A 1.126 : more strict than SVM = ⇒ cannot shatter any 3 inputs

ρ

fewer dichotomies =⇒ smaller ‘VC dim.’ =⇒

better generalization

(31)

Linear Support Vector Machine Reasons behind Large-Margin Hyperplane

VC Dimension of Large-Margin Algorithm

fewer dichotomies =⇒ smaller

‘VC dim.’

considers d

VC

( A ρ ) [data-dependent, need more than VC]

instead of

d

VC

( H) [data-independent, covered by VC]

d

VC

( A ρ ) when X = unit circle in R 2

ρ = 0: just perceptrons (dVC =3)

ρ >

√ 3

2

: cannot shatter any 3 inputs (dVC< 3)

—some inputs must be of

distance ≤ √ 3

generally, whenX in

radius-R hyperball:

dVC(A

ρ

)≤ min



R 2 ρ 2

, d



+1≤ d + 1

| {z }

d

VC

(perceptrons)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 25/28

(32)

Benefits of Large-Margin Hyperplanes

large-margin

hyperplanes hyperplanes hyperplanes + feature transform Φ

# even fewer not many many

boundary simple simple sophisticated

not many

good, for dVC and generalization

sophisticated

good, for possibly better E

in

a new possibility: non-linear SVM

large-margin hyperplanes

+ numerous feature transform Φ

# not many

boundary sophisticated

(33)

Linear Support Vector Machine Reasons behind Large-Margin Hyperplane

Fun Time

Consider running the ‘large-margin algorithm’A

ρ

withρ =

1 4

on a Z-space such that z = Φ(x) is of 1126 dimensions (excluding z

0

) and kzk ≤ 1. What is the upper bound of dVC(A

ρ

)when calculated by min

R

2

ρ

2, d

+1?

1

5

2

17

3

1126

4

1127

Reference Answer: 2

By the description, d = 1126 and R = 1. So the upper bound is simply 17.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 27/28

(34)

Fun Time

Consider running the ‘large-margin algorithm’A

ρ

withρ =

1 4

on a Z-space such that z = Φ(x) is of 1126 dimensions (excluding z

0

) and kzk ≤ 1. What is the upper bound of dVC(A

ρ

)when calculated by min

R

2

ρ

2, d

+1?

1

5

2

17

3

1126

4

1127

Reference Answer: 2

By the description, d = 1126 and R = 1. So the upper bound is simply 17.

(35)

Linear Support Vector Machine Reasons behind Large-Margin Hyperplane

Summary

1

Embedding Numerous Features: Kernel Models

Lecture 1: Linear Support Vector Machine

Course Introduction

from foundations to techniques Large-Margin Separating Hyperplane

intuitively more robust against noise Standard Large-Margin Problem

minimize ‘length of w’ at special separating scale Support Vector Machine

‘easy’ via quadratic programming Reasons behind Large-Margin Hyperplane fewer dichotomies and better generalization

next: solving non-linear Support Vector Machine

2 Combining Predictive Features: Aggregation Models

3 Distilling Implicit Features: Extraction Models

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 28/28

參考文獻

相關文件

Which of the following aggregation model learns diverse g t by reweighting and calculates linear vote by steepest search?.

3 Distilling Implicit Features: Extraction Models Lecture 14: Radial Basis Function Network. RBF

Lecture 4: Soft-Margin Support Vector Machine allow some margin violations ξ n while penalizing them by C; equivalent to upper-bounding α n by C Lecture 5: Kernel Logistic

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/22.. Decision Tree Decision Tree Hypothesis. Disclaimers about

1 Embedding Numerous Features: Kernel Models Lecture 1: Linear Support Vector Machine.. linear SVM: more robust and solvable with quadratic programming Lecture 2: Dual Support

Principle Component Analysis Denoising Auto Encoder Deep Neural Network... Deep Learning Optimization

For a deep NNet for written character recognition from raw pixels, which type of features are more likely extracted after the first hidden layer.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/24.:. Deep Learning Deep