Machine Learning Techniques (ᘤᢈ)

(1)

Machine Learning Techniques ( 機器學習技法)

Lecture 1: Linear Support Vector Machine

Hsuan-Tien Lin (林軒田)

[email protected]

Department of Computer Science

& Information Engineering

National Taiwan University

( 國立台灣大學資訊工程系)

(2)

Linear Support Vector Machine Course Introduction

Course History

NTU Version

•

15-17 weeks (2+ hours)

•

highly-praised with

English and blackboard teaching

Coursera Version

•

8 weeks of ‘foundations’ (previous course) + 8 weeks of ‘techniques’ (this course)

• Mandarin teaching

to reach more audience in need

• slides teaching

improved with Coursera’s quiz and homework mechanisms

goal:

try

making Coursera version even better than NTU version

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 1/28

(3)

Course History

NTU Version

•

highly-praised with

English and blackboard teaching

Coursera Version

•

8 weeks of ‘foundations’

(previous course) + 8 weeks of ‘techniques’ (this course)

• Mandarin teaching

• slides teaching

goal:

try

(4)

Course History

NTU Version

•

highly-praised with

English and blackboard teaching

Coursera Version

•

8 weeks of ‘foundations’

(previous course) + 8 weeks of ‘techniques’ (this course)

• Mandarin teaching

• slides teaching

goal:

try

(5)

Course Design

from Foundations to Techniques

•

mixture of philosophical illustrations, key theory, core algorithms, usage in practice, and hopefully jokes

:-)

•

three major techniques surrounding

feature transforms:

• Embedding Numerous Features: how to exploit and regularize numerous features?

—inspires Support Vector Machine (SVM) model

• Combining Predictive Features: how to construct and blend predictive features?

—inspires Adaptive Boosting (AdaBoost) model

• Distilling Implicit Features: how to identify and learn implicit features?

—inspires Deep Learning model

allows students to

use ML professionally

(6)

Course Design

from Foundations to Techniques

• :-)

• feature transforms:

• Embedding Numerous Features: how to exploit and regularize numerous features?

—inspires Support Vector Machine (SVM) model

• Combining Predictive Features: how to construct and blend predictive features?

—inspires Adaptive Boosting (AdaBoost) model

• Distilling Implicit Features: how to identify and learn implicit features?

—inspires Deep Learning model

allows students to

use ML professionally

(7)

Course Design

from Foundations to Techniques

• :-)

• feature transforms:

• Embedding Numerous Features: how to exploit and regularize numerous features?

—inspires Support Vector Machine (SVM) model

• Combining Predictive Features: how to construct and blend predictive features?

—inspires Adaptive Boosting (AdaBoost) model

• Distilling Implicit Features: how to identify and learn implicit features?

—inspires Deep Learning model

allows students to

use ML professionally

(8)

Course Design

from Foundations to Techniques

• :-)

• feature transforms:

• Embedding Numerous Features: how to exploit and regularize numerous features?

—inspires Support Vector Machine (SVM) model

• Combining Predictive Features: how to construct and blend predictive features?

—inspires Adaptive Boosting (AdaBoost) model

• Distilling Implicit Features: how to identify and learn implicit features?

—inspires Deep Learning model

allows students to

use ML professionally

(9)

Course Design

from Foundations to Techniques

• :-)

• feature transforms:

• Embedding Numerous Features: how to exploit and regularize numerous features?

—inspires Support Vector Machine (SVM) model

• Combining Predictive Features: how to construct and blend predictive features?

—inspires Adaptive Boosting (AdaBoost) model

• Distilling Implicit Features: how to identify and learn implicit features?

—inspires Deep Learning model

allows students to

use ML professionally

(10)

Course Design

from Foundations to Techniques

• :-)

• feature transforms:

• Embedding Numerous Features: how to exploit and regularize numerous features?

—inspires Support Vector Machine (SVM) model

• Combining Predictive Features: how to construct and blend predictive features?

—inspires Adaptive Boosting (AdaBoost) model

• Distilling Implicit Features: how to identify and learn implicit features?

—inspires Deep Learning model

allows students to

use ML professionally

(11)

Course Design

from Foundations to Techniques

• :-)

• feature transforms:

• Embedding Numerous Features: how to exploit and regularize numerous features?

—inspires Support Vector Machine (SVM) model

• Combining Predictive Features: how to construct and blend predictive features?

—inspires Adaptive Boosting (AdaBoost) model

• Distilling Implicit Features: how to identify and learn implicit features?

—inspires Deep Learning model

allows students to

use ML professionally

(12)

Course Design

from Foundations to Techniques

• :-)

• feature transforms:

• Embedding Numerous Features: how to exploit and regularize numerous features?

—inspires Support Vector Machine (SVM) model

• Combining Predictive Features: how to construct and blend predictive features?

—inspires Adaptive Boosting (AdaBoost) model

• Distilling Implicit Features: how to identify and learn implicit features?

—inspires Deep Learning model

allows students to

use ML professionally

(13)

Course Design

from Foundations to Techniques

• :-)

• feature transforms:

• Embedding Numerous Features: how to exploit and regularize numerous features?

—inspires Support Vector Machine (SVM) model

• Combining Predictive Features: how to construct and blend predictive features?

—inspires Adaptive Boosting (AdaBoost) model

• Distilling Implicit Features: how to identify and learn implicit features?

—inspires Deep Learning model

allows students to

use ML professionally

(14)

Fun Time

Which of the following description of this course is true?

1

the course will be taught in Taiwanese

2

the course will tell me the techniques that create the android Lieutenant Commander Data in Star Trek

3

the course will be 16 weeks long

4

the course will focus on three major techniques

Reference Answer: 4

1

no, my Taiwanese is unfortunately not good enough for teaching (yet)

2

no, although what we teach may serve as building blocks

3

no, unless you have also joined the previous course

4

yes,

let’s get started!

(15)

Fun Time

Which of the following description of this course is true?

1

the course will be taught in Taiwanese

2

the course will tell me the techniques that create the android Lieutenant Commander Data in Star Trek

3

the course will be 16 weeks long

4

the course will focus on three major techniques

Reference Answer: 4

1

no, my Taiwanese is unfortunately not good enough for teaching (yet)

2

no, although what we teach may serve as building blocks

3

no, unless you have also joined the previous course

4

yes,

let’s get started!

(16)

Roadmap

1

Embedding Numerous Features: Kernel Models

Lecture 1: Linear Support Vector Machine

Course Introduction

Large-Margin Separating Hyperplane Standard Large-Margin Problem Support Vector Machine

Reasons behind Large-Margin Hyperplane

2 Combining Predictive Features: Aggregation Models

3 Distilling Implicit Features: Extraction Models

(17)

Linear Support Vector Machine Large-Margin Separating Hyperplane

Linear Classification Revisited

PLA/pocket

h(x) = sign(s)

s x

x

x x

₀

1 2

d

h ( ) x

plausible err = 0/1

(small flipping noise)

minimize

specially

(linear separable)

linear (hyperplane) classifiers:

h(x) = sign(w

^T x)

(18)

Which Line Is Best?

•

PLA? depending on randomness

•

VC bound? whichever you like! E

out

(w)≤

E _in (w)

| {z }

0

+

Ω( H)

| {z }

d

VC

=d +1

You?

rightmost one, possibly :-)

(19)

Which Line Is Best?

•

VC bound? whichever you like! E

out

(w)≤

E _in (w)

| {z }

0

+

Ω( H)

| {z }

d

VC

=d +1

You?

rightmost one, possibly :-)

(20)

Which Line Is Best?

•

VC bound? whichever you like!

E

_out

(w)≤

E _in (w)

| {z }

0

+

Ω( H)

| {z }

d

VC

=d +1

You?

rightmost one, possibly :-)

(21)

Which Line Is Best?

•

VC bound? whichever you like!

E

_out

(w)≤

E _in (w)

| {z }

0

+

Ω( H)

| {z }

d

VC

=d +1

You?

rightmost one, possibly :-)

(22)

Why Rightmost Hyperplane?

informal argument

if (Gaussian-like) noise on future

x

≈ x

ⁿ

:

⇐⇒

x _n further from hyperplane

⇐⇒

tolerate more noise

⇐⇒

more robust to overfitting

⇐⇒

distance to closest x _n

⇐⇒

amount of noise tolerance

⇐⇒

robustness of hyperplane

rightmost one:

more robust

because of

larger distance to closest x n

(23)

Why Rightmost Hyperplane?

informal argument

x

≈ x

ⁿ

:

⇐⇒

x _n further from hyperplane

⇐⇒

tolerate more noise

⇐⇒

more robust to overfitting

⇐⇒

distance to closest x _n

⇐⇒

amount of noise tolerance

⇐⇒

robustness of hyperplane

rightmost one:

more robust

because of

larger distance to closest x n

(24)

Why Rightmost Hyperplane?

informal argument

x

≈ x

ⁿ

:

⇐⇒

x _n further from hyperplane

⇐⇒

tolerate more noise

⇐⇒

more robust to overfitting

⇐⇒

distance to closest x _n

⇐⇒

amount of noise tolerance

⇐⇒

robustness of hyperplane

rightmost one:

more robust

because of

larger distance to closest x n

(25)

Why Rightmost Hyperplane?

informal argument

x

≈ x

ⁿ

:

⇐⇒

x _n further from hyperplane

⇐⇒

tolerate more noise

⇐⇒

more robust to overfitting

⇐⇒

distance to closest x _n

⇐⇒

amount of noise tolerance

⇐⇒

robustness of hyperplane

rightmost one:

more robust

because of

larger distance to closest x n

(26)

Why Rightmost Hyperplane?

informal argument

x

≈ x

ⁿ

:

⇐⇒

x _n further from hyperplane

⇐⇒

tolerate more noise

⇐⇒

more robust to overfitting

⇐⇒

distance to closest x _n

⇐⇒

amount of noise tolerance

⇐⇒

robustness of hyperplane

rightmost one:

more robust

because of

larger distance to closest x n

(27)

Why Rightmost Hyperplane?

informal argument

x

≈ x

ⁿ

:

⇐⇒

x _n further from hyperplane

⇐⇒

tolerate more noise

⇐⇒

more robust to overfitting

⇐⇒

distance to closest x _n

⇐⇒

amount of noise tolerance

⇐⇒

robustness of hyperplane

rightmost one:

more robust

because of

larger distance to closest x n

(28)

Why Rightmost Hyperplane?

informal argument

x

≈ x

ⁿ

:

⇐⇒

x _n further from hyperplane

⇐⇒

tolerate more noise

⇐⇒

more robust to overfitting

⇐⇒

distance to closest x _n

⇐⇒

amount of noise tolerance

⇐⇒

robustness of hyperplane

rightmost one:

more robust

because of

larger distance to closest x _n

(29)

Fat Hyperplane

• robust

separating hyperplane:

fat

—far from both sides of examples

• robustness

≡

fatness: distance to closest x _n

goal: find

fattest

separating hyperplane

(30)

Fat Hyperplane

• robust

fat

• robustness

≡

fatness: distance to closest x _n

goal: find

fattest

(31)

Fat Hyperplane

• robust

fat

• robustness

≡

fatness: distance to closest x _n

goal: find

fattest

(32)

Large-Margin Separating Hyperplane

max

w fatness(w)

subject to

w classifies every (x _n

, y

n

)correctly

fatness(w) =

min

n=1,...,N

distance(x

_n

, w)

max

w margin(w)

subject to every

y _n w ^T x _n > 0

margin(w) =

min

n=1,...,N

distance(x

_n

, w)

•

fatness: formally called

margin

• correctness: y _n

=sign(w

^T x _n

)

goal: find

largest-margin separating

hyperplane

(33)

Large-Margin Separating Hyperplane

max

w margin(w)

subject to

w classifies every (x _n

, y

n

)correctly

margin(w) =

min

n=1,...,N

distance(x

_n

, w)

max

w margin(w)

subject to every

y _n w ^T x _n > 0

margin(w) =

min

n=1,...,N

distance(x

_n

, w)

• margin

• correctness: y _n

=sign(w

^T x _n

)

goal: find

largest-margin

separating

hyperplane

(34)

Large-Margin Separating Hyperplane

max

w margin(w)

subject to every

y _n w ^T x _n > 0

margin(w) =

min

n=1,...,N

distance(x

_n

, w)

• margin

• correctness: y _n

=sign(w

^T x _n

)

goal: find

largest-margin separating

hyperplane

(35)

Large-Margin Separating Hyperplane

max

w margin(w)

subject to every

y _n w ^T x _n > 0

margin(w) =

min

n=1,...,N

distance(x

_n

, w)

• margin

• correctness: y _n

=sign(w

^T x _n

)

goal: find

largest-margin

separating

hyperplane

(36)

Fun Time

Consider two examples (v, +1) and (−v, −1) where v ∈ R

²

(without padding the v

₀

=1). Which of the following hyperplane is the

largest-margin separating

one for the two examples? You are highly encouraged to visualize by considering, for instance,

v = (3, 2).

1

x

₁

=0

2

x

₂

=0

3

v

₁

x

₁

+v

₂

x

₂

=0

4

v

₂

x

₁

+v

₁

x

₂

=0

Reference Answer: 3

Here the

largest-margin separating

hyperplane (line) must be a perpendicular bisector of the line segment between

v and

−v. Hence v is a normal vector of the largest-margin line. The result can be extended to the more general case of

v

∈ R

^d

.

(37)

Fun Time

Consider two examples (v, +1) and (−v, −1) where v ∈ R

²

(without padding the v

₀

=1). Which of the following hyperplane is the

largest-margin separating

one for the two examples? You are highly encouraged to visualize by considering, for instance,

v = (3, 2).

1

x

₁

=0

2

x

₂

=0

3

v

₁

x

₁

+v

₂

x

₂

=0

4

v

₂

x

₁

+v

₁

x

₂

=0

Reference Answer: 3

Here the

largest-margin separating

hyperplane (line) must be a perpendicular bisector of the line segment between

v and

−v. Hence v is a normal vector of the largest-margin line. The result can be extended to the more general case of

v

∈ R

^d

.

(38)

Linear Support Vector Machine Standard Large-Margin Problem

Distance to Hyperplane: Preliminary

max w margin(w)

subject to every y n w ^T x _n > 0 margin(w) = min

n=1,...,N distance(x _n , w)

‘shorten’ x and w

distance

needs

w ₀

and

(w ₁ , . . . , w _d )

differently (to be derived)

b

=

w ₀







| w

|







=





 w ₁

.. . w _d







;

XX x ₀ = X X 1







| x

|







=





 x ₁

.. . x _d







for this part: h(x) = sign(w

^T x

+

b)

(39)

Distance to Hyperplane: Preliminary

max w margin(w)

subject to every y n w ^T x _n > 0 margin(w) = min

n=1,...,N distance(x _n , w)

‘shorten’ x and w

distance

needs

w ₀

and

(w ₁ , . . . , w _d )

b

=

w ₀







| w

|







=





 w ₁

.. . w _d







;

XX x ₀ = X X 1







| x

|







=





 x ₁

.. . x _d







^T x

+

b)

(40)

Distance to Hyperplane: Preliminary

max w margin(w)

subject to every y n w ^T x _n > 0 margin(w) = min

n=1,...,N distance(x _n , w)

‘shorten’ x and w

distance

needs

w ₀

and

(w ₁ , . . . , w _d )

b

=

w ₀







| w

|







=





 w ₁

.. . w _d







;

XX x ₀ = X X 1







| x

|







=





 x ₁

.. . x _d







^T x

+

b)

(41)

Distance to Hyperplane: Preliminary

max w margin(w)

subject to every y n w ^T x _n > 0 margin(w) = min

n=1,...,N distance(x _n , w)

‘shorten’ x and w

distance

needs

w ₀

and

(w ₁ , . . . , w _d )

b

=

w ₀







| w

|







=





 w ₁

.. . w _d







;

XX x ₀ = X X 1







| x

|







=





 x ₁

.. . x _d







^T x

+

b)

(42)

Distance to Hyperplane

want: distance(x,

b, w), with hyperplane w ^T x ⁰

+

b

=0

consider

x ⁰

,

x ⁰⁰

on hyperplane

1 w ^T x ⁰

=

−

b

,

w ^T x ⁰⁰

=

−

b

2 w

⊥ hyperplane:







w ^T

(x

⁰⁰

−

x ⁰

)

| {z } vector on hyperplane







=0

3

distance = project (x−

x ⁰

)to

⊥ hyperplane

dist(x, h)

x^′ x^′′ w x

distance(x,

b, w) =

w ^T

k

w

k

(x−

x ⁰

)

=

1

1 k

w

k|

w ^T x

+

b

|

(43)

Distance to Hyperplane

want: distance(x,

b, w), with hyperplane w ^T x ⁰

+

b

=0

consider

x ⁰

,

x ⁰⁰

on hyperplane

1 w ^T x ⁰

=

−

b

,

w ^T x ⁰⁰

=

−

b

2 w

⊥ hyperplane:







w ^T

(x

⁰⁰

−

x ⁰

)







=

0

3 x ⁰

)to

⊥ hyperplane

dist(x, h)

x^′ x^′′

w x

distance(x,

b, w) =

w ^T

k

w

k

(x−

x ⁰

)

=

1

1 k

w

k|

w ^T x

+

b

|

(44)

Distance to Hyperplane

want: distance(x,

b, w), with hyperplane w ^T x ⁰

+

b

=0

consider

x ⁰

,

x ⁰⁰

on hyperplane

1 w ^T x ⁰

=

−

b

,

w ^T x ⁰⁰

=

−

b

2 w

⊥ hyperplane:







w ^T

(x

⁰⁰

−

x ⁰

)







=

0

3 x ⁰

)to

⊥ hyperplane

dist(x, h)

x^′ x^′′

w x

distance(x,

b, w) =

w ^T

k

w

k

(x−

x ⁰

)

=

1

1 k

w

k|

w ^T x

+

b

|

(45)

Distance to Hyperplane

want: distance(x,

b, w), with hyperplane w ^T x ⁰

+

b

=0

consider

x ⁰

,

x ⁰⁰

on hyperplane

1 w ^T x ⁰

= −

b, w ^T x ⁰⁰

= −

b

2 w

⊥ hyperplane:







w ^T

(x

⁰⁰

−

x ⁰

)







=

0

3 x ⁰

)to

⊥ hyperplane

dist(x, h)

x^′ x^′′

w x

distance(x,

b, w) =

w ^T

k

w

k

(x−

x ⁰

)

=

1

1 k

w

k|

w ^T x

+

b

|

(46)

Distance to Hyperplane

want: distance(x,

b, w), with hyperplane w ^T x ⁰

+

b

=0

consider

x ⁰

,

x ⁰⁰

on hyperplane

1 w ^T x ⁰

= −

b, w ^T x ⁰⁰

= −

b

2 w

⊥ hyperplane:







w ^T

(x

⁰⁰

−

x ⁰

)







=

0

3 x ⁰

)to

⊥ hyperplane

dist(x, h)

x^′ x^′′

w x

distance(x,

b, w) =

w ^T

k

w

k

(x−

x ⁰

)

=

1

1 k

w

k|

w ^T x

+

b

|

(47)

Distance to Hyperplane

want: distance(x,

b, w), with hyperplane w ^T x ⁰

+

b

=0

consider

x ⁰

,

x ⁰⁰

on hyperplane

1 w ^T x ⁰

= −

b, w ^T x ⁰⁰

= −

b

2 w

⊥ hyperplane:







w ^T

(x

⁰⁰

−

x ⁰

)







=0

3 x ⁰

)to

⊥ hyperplane

dist(x, h)

x^′ x^′′

w x

distance(x,

b, w) =

w ^T

k

w

k

(x−

x ⁰

)

=

1

1 k

w

k|

w ^T x

+

b

|

(48)

Distance to Hyperplane

want: distance(x,

b, w), with hyperplane w ^T x ⁰

+

b

=0

consider

x ⁰

,

x ⁰⁰

on hyperplane

1 w ^T x ⁰

= −

b, w ^T x ⁰⁰

= −

b

2 w

⊥ hyperplane:







w ^T

(x

⁰⁰

−

x ⁰

)







=0

3 x ⁰

)to

⊥ hyperplane

dist(x, h)

x^′ x^′′

w x

distance(x,

b, w) =

w ^T

k

w

k

(x−

x ⁰

)

=

1

1 k

w

k|

w ^T x

+

b

|

(49)

Distance to Hyperplane

want: distance(x,

b, w), with hyperplane w ^T x ⁰

+

b

=0

consider

x ⁰

,

x ⁰⁰

on hyperplane

1 w ^T x ⁰

= −

b, w ^T x ⁰⁰

= −

b

2 w

⊥ hyperplane:







w ^T

(x

⁰⁰

−

x ⁰

)







=0

3 x ⁰

)to

⊥ hyperplane

dist(x, h)

x^′ x^′′

w x

distance(x,

b, w) =

w ^T

k

w

k

(x−

x ⁰

)

=

1

1 k

w

k|

w ^T x

+

b

|

(50)

Distance to Hyperplane

want: distance(x,

b, w), with hyperplane w ^T x ⁰

+

b

=0

consider

x ⁰

,

x ⁰⁰

on hyperplane

1 w ^T x ⁰

= −

b, w ^T x ⁰⁰

= −

b

2 w

⊥ hyperplane:







w ^T

(x

⁰⁰

−

x ⁰

)







=0

3 x ⁰

)to

⊥ hyperplane

dist(x, h)

x^′ x^′′

w x

distance(x,

b, w) =

w ^T

k

w

k(x−

x ⁰

)

=

1

1 k

w

k|

w ^T x

+

b

|

(51)

Distance to Hyperplane

want: distance(x,

b, w), with hyperplane w ^T x ⁰

+

b

=0

consider

x ⁰

,

x ⁰⁰

on hyperplane

1 w ^T x ⁰

= −

b, w ^T x ⁰⁰

= −

b

2 w

⊥ hyperplane:







w ^T

(x

⁰⁰

−

x ⁰

)







=0

3 x ⁰

)to

⊥ hyperplane

dist(x, h)

x^′ x^′′

w x

distance(x,

b, w) =

w ^T

k

w

k(x−

x ⁰

)

=

1

1 k

w

k|

w ^T x

+

b

|

(52)

Distance to Hyperplane

want: distance(x,

b, w), with hyperplane w ^T x ⁰

+

b

=0

consider

x ⁰

,

x ⁰⁰

on hyperplane

1 w ^T x ⁰

= −

b, w ^T x ⁰⁰

= −

b

2 w

⊥ hyperplane:







w ^T

(x

⁰⁰

−

x ⁰

)







=0

3 x ⁰

)to

⊥ hyperplane

dist(x, h)

x^′ x^′′

w x

distance(x,

b, w) =

w ^T

k

w

k(x−

x ⁰

)

=

1

k

w

k|

w ^T x + b

|

(53)

Distance to Separating Hyperplane

distance(x,

b, w) =

1

k

w

k|

w ^T x + b

|

• separating

hyperplane: for every n

y n (w ^T x _n + b) > 0

•

distance to

separating

hyperplane: distance(x

_n

,

b, w) =

1

k

w

k

y _n

(w

^T x _n

+

b)

max

b,w

margin(b,

w)

subject to every

y n (w ^T x _n + b) > 0

margin(b,

w) =

min

n=1,...,N

(54)

Distance to Separating Hyperplane

distance(x,

b, w) =

1

k

w

k|

w ^T x + b

|

• separating

y n (w ^T x n + b) > 0

•

distance to

separating

hyperplane: distance(x

_n

,

b, w) =

1

k

w

k

y _n

(w

^T x _n

+

b)

max

b,w

margin(b,

w)

subject to every

y n (w ^T x _n + b) > 0

margin(b,

w) =

min

n=1,...,N

distance(x

_n

,

b, w)

(55)

Distance to Separating Hyperplane

distance(x,

b, w) =

1

k

w

k|

w ^T x + b

|

• separating

y n (w ^T x n + b) > 0

•

distance to

separating

hyperplane:

distance(x

_n

,

b, w) =

1

k

w

k

y _n

(w

^T x _n

+

b)

max

b,w

margin(b,

w)

subject to every

y n (w ^T x _n + b) > 0

margin(b,

w) =

min

n=1,...,N

distance(x

_n

,

b, w)

(56)

Distance to Separating Hyperplane

distance(x,

b, w) =

1

k

w

k|

w ^T x + b

|

• separating

y n (w ^T x n + b) > 0

•

distance to

separating

hyperplane:

distance(x

_n

,

b, w) =

1

k

w

k

y _n

(w

^T x _n

+

b)

max

b,w

margin(b,

w)

subject to every

y n (w ^T x _n + b) > 0

margin(b,

w) =

min

n=1,...,N 1

kwk y _n

(w

^T x _n

+

b)

(57)

Margin of Special Separating Hyperplane

max

b,w

margin(b,

w)

subject to every y

_n

(w

^T x _n

+

b)

> 0 margin(b,

w) =

min

n=1,...,N 1

kwk

y

_n

(w

^T x _n

+

b)

• w ^T x + b

=0 same as 3w

^T x + 3b

=0: scaling does not matter

• special

scaling: only consider separating (b,

w)

such that

min

n=1,...,N y _n (w ^T x _n + b) = 1

=⇒

margin(b,

w) = _kwk ¹

max

b,w 1 kwk

subject to every y

n

(w

^T x _n

+b)> 0

min

n=1,...,N y _n (w ^T x _n + b) = 1

(58)

Margin of Special Separating Hyperplane

max

b,w

margin(b,

w)

subject to every y

_n

(w

^T x _n

+

b)

> 0 margin(b,

w) =

min

n=1,...,N 1

kwk

y

_n

(w

^T x _n

+

b)

• w ^T x + b

=0 same as 3w

^T x + 3b

• special

w)

such that

min

n=1,...,N y _n (w ^T x _n + b) = 1

=⇒

margin(b,

w) = _kwk ¹

max

b,w 1 kwk

subject to every y

n

(w

^T x _n

+b)> 0

min

n=1,...,N y _n (w ^T x _n + b) = 1

(59)

Margin of Special Separating Hyperplane

max

b,w

margin(b,

w)

subject to every y

_n

(w

^T x _n

+

b)

> 0 margin(b,

w) =

min

n=1,...,N 1

kwk

y

_n

(w

^T x _n

+

b)

• w ^T x + b

=0 same as 3w

^T x + 3b

• special

w)

such that

n=1,...,N min y n (w ^T x _n + b) = 1

=⇒

margin(b,

w) = _kwk ¹

max

b,w 1 kwk

subject to every y

n

(w

^T x _n

+b)> 0

min

n=1,...,N y _n (w ^T x _n + b) = 1

(60)

Margin of Special Separating Hyperplane

max

b,w

margin(b,

w)

subject to every y

_n

(w

^T x _n

+

b)

> 0 margin(b,

w) =

min

n=1,...,N 1

kwk

y

_n

(w

^T x _n

+

b)

• w ^T x + b

=0 same as 3w

^T x + 3b

• special

w)

such that

n=1,...,N min y n (w ^T x _n + b) = 1

=⇒ margin(

b, w) = _kwk ¹

max

b,w 1 kwk

subject to every y

n

(w

^T x _n

+b)> 0

min

n=1,...,N y _n (w ^T x _n + b) = 1

(61)

Margin of Special Separating Hyperplane

max

b,w

margin(b,

w)

subject to every y

_n

(w

^T x _n

+

b)

> 0 margin(b,

w) =

min

n=1,...,N 1

kwk

y

_n

(w

^T x _n

+

b)

• w ^T x + b

=0 same as 3w

^T x + 3b

• special

w)

such that

n=1,...,N min y n (w ^T x _n + b) = 1

=⇒ margin(

b, w) = _kwk ¹

max

b,w 1 kwk

subject to every y

n

(w

^T x _n

+b)> 0

min

n=1,...,N y _n (w ^T x _n + b) = 1

(62)

Margin of Special Separating Hyperplane

max

b,w

margin(b,

w)

subject to every y

_n

(w

^T x _n

+

b)

> 0 margin(b,

w) =

min

n=1,...,N 1

kwk

y

_n

(w

^T x _n

+

b)

• w ^T x + b

=0 same as 3w

^T x + 3b

• special

w)

such that

n=1,...,N min y n (w ^T x _n + b) = 1

=⇒ margin(

b, w) = _kwk ¹

max

b,w 1 kwk

subject to

every y n (w ^T x _n + b) > 0 min

n=1,...,N y _n (w ^T x _n + b) = 1

(63)

Standard Large-Margin Hyperplane Problem

max

b,w

1

k

w

k subject to

min

n=1,...,N y n (w ^T x _n + b) = 1

necessary constraints: y

_n

(w

^T x _n

+

b)

≥ 1 for all n

original constraint:

min _n=1,...,N y n (w ^T x n + b) = 1

want: optimal (b,

w) here (inside)

if optimal (b,

w)

outside, e.g. y

_n

(w

^T x _n

+

b)

>

1.126

for all n

—can scale (b,

w)

to “more optimal” (

_1.126 ^b

,

_1.126 ^w

)

(contradiction!)

final change: max =⇒ min, remove√

w

, add

¹ ₂

min

b,w

1 2 w ^T w

subject to y

_n

(w

^T x _n

+

b)

≥ 1

for all n

(64)

Standard Large-Margin Hyperplane Problem

max

b,w

1

k

w

k subject to

min

n=1,...,N y n (w ^T x _n + b) = 1

_n

(w

^T x _n

+

b)

≥ 1 for all n original constraint:

min _n=1,...,N y n (w ^T x n + b) = 1

want: optimal (b,

w) here (inside)

if optimal (b,

w)

outside, e.g. y

_n

(w

^T x _n

+

b)

>

1.126

for all n

—can scale (b,

w)

_1.126 ^b

,

_1.126 ^w

)

(contradiction!)

w

, add

¹ ₂

min

b,w

1 2 w ^T w

subject to y

_n

(w

^T x _n

+

b)

≥ 1

for all n

(65)

Standard Large-Margin Hyperplane Problem

max

b,w

1

k

w

k subject to

min

n=1,...,N y n (w ^T x _n + b) = 1

_n

(w

^T x _n

+

b)

min _n=1,...,N y n (w ^T x n + b) = 1

want: optimal (b,

w) here (inside)

if optimal (b,

w)

outside, e.g. y

_n

(w

^T x _n

+

b)

>

1.126

for all n

—can scale (b,

w)

_1.126 ^b

,

_1.126 ^w

)

(contradiction!)

w

, add

¹ ₂

min

b,w

1 2 w ^T w

subject to y

_n

(w

^T x _n

+

b)

≥ 1

for all n

(66)

Standard Large-Margin Hyperplane Problem

max

b,w

1

k

w

k subject to

min

n=1,...,N y n (w ^T x _n + b) = 1

_n

(w

^T x _n

+

b)

min _n=1,...,N y n (w ^T x n + b) = 1

want: optimal (b,

w) here (inside)

if optimal (b,

w)

outside,

e.g. y

_n

(w

^T x _n

+

b)

>

1.126

for all n

—can scale (b,

w)

_1.126 ^b

,

_1.126 ^w

)

(contradiction!)

w

, add

¹ ₂

min

b,w

1 2 w ^T w

subject to y

_n

(w

^T x _n

+

b)

≥ 1

for all n

(67)

Standard Large-Margin Hyperplane Problem

max

b,w

1

k

w

k subject to

min

n=1,...,N y n (w ^T x _n + b) = 1

_n

(w

^T x _n

+

b)

min _n=1,...,N y n (w ^T x n + b) = 1

want: optimal (b,

w) here (inside)

if optimal (b,

w)

outside, e.g. y

_n

(w

^T x _n

+

b)

>

1.126

for all n

—can scale (b,

w)

_1.126 ^b

,

_1.126 ^w

)

(contradiction!)

w

, add

¹ ₂

min

b,w

1 2 w ^T w

subject to y

_n

(w

^T x _n

+

b)

≥ 1

for all n

(68)

Standard Large-Margin Hyperplane Problem

max

b,w

1

k

w

k subject to

min

n=1,...,N y n (w ^T x _n + b) = 1

_n

(w

^T x _n

+

b)

min _n=1,...,N y n (w ^T x n + b) = 1

want: optimal (b,

w) here (inside)

if optimal (b,

w)

outside, e.g. y

_n

(w

^T x _n

+

b)

> 1.126 for all n

—can scale (b,

w)

_1.126 ^b

,

_1.126 ^w

)

(contradiction!)

w

, add

¹ ₂

min

b,w

1 2 w ^T w

subject to y

_n

(w

^T x _n

+

b)

≥ 1

for all n

(69)

Standard Large-Margin Hyperplane Problem

max

b,w

1

k

w

k subject to

min

n=1,...,N y n (w ^T x _n + b) = 1

_n

(w

^T x _n

+

b)

min _n=1,...,N y n (w ^T x n + b) = 1

want: optimal (b,

w) here (inside)

if optimal (b,

w)

outside, e.g. y

_n

(w

^T x _n

+

b)

> 1.126 for all n

—can scale (b,

w)

_1.126 ^b

,

_1.126 ^w

)

(contradiction!)

w

, add

¹ ₂

min

b,w

1 2 w ^T w

subject to y

_n

(w

^T x _n

+

b)

≥ 1

for all n

(70)

Standard Large-Margin Hyperplane Problem

max

b,w

1

k

w

k subject to

min

n=1,...,N y n (w ^T x _n + b) = 1

_n

(w

^T x _n

+

b)

min _n=1,...,N y n (w ^T x n + b) = 1

want: optimal (b,

w) here (inside)

if optimal (b,

w)

outside, e.g. y

_n

(w

^T x _n

+

b)

> 1.126 for all n

—can scale (b,

w)

_1.126 ^b

,

_1.126 ^w

)

(contradiction!)

w

, add

¹ ₂

min

b,w

1 2 w ^T w

subject to y

_n

(w

^T x _n

+

b)

≥ 1

for all n

(71)

Standard Large-Margin Hyperplane Problem

max

b,w

1

k

w

k subject to

min

n=1,...,N y _n (w ^T x _n + b) = 1

_n

(w

^T x _n

+

b)

min _n=1,...,N y n (w ^T x n + b) = 1

want: optimal (b,

w) here (inside)

if optimal (b,

w)

outside, e.g. y

_n

(w

^T x _n

+

b)

> 1.126 for all n

—can scale (b,

w)

_1.126 ^b

,

_1.126 ^w

)

(contradiction!)

w

, add

¹ ₂

max

b,w

min

b,w 1 2 w ^T w

1 kwk

subject to y

_n

(w

^T x _n

+

b)

≥ 1 for all n

(72)

Standard Large-Margin Hyperplane Problem

max

b,w

1

k

w

k subject to

min

n=1,...,N y _n (w ^T x _n + b) = 1

_n

(w

^T x _n

+

b)

min _n=1,...,N y n (w ^T x n + b) = 1

want: optimal (b,

w) here (inside)

if optimal (b,

w)

outside, e.g. y

_n

(w

^T x _n

+

b)

> 1.126 for all n

—can scale (b,

w)

_1.126 ^b

,

_1.126 ^w

)

(contradiction!)

w, add ¹ ₂

max

b,w

min

b,w 1 2 w ^T w

1 kwk

subject to y

_n

(w

^T x _n

+

b)

≥ 1 for all n

(73)

Standard Large-Margin Hyperplane Problem

max

b,w

1

k

w

k subject to

min

n=1,...,N y n (w ^T x _n + b) = 1

_n

(w

^T x _n

+

b)

min _n=1,...,N y n (w ^T x n + b) = 1

want: optimal (b,

w) here (inside)

if optimal (b,

w)

outside, e.g. y

_n

(w

^T x _n

+

b)

> 1.126 for all n

—can scale (b,

w)

_1.126 ^b

,

_1.126 ^w

)

(contradiction!)

w, add ¹ ₂

min

b,w

1 2

w ^T w

subject to y

_n

(w

^T x _n

+

b)

≥ 1 for all n

Machine Learning Techniques (ᘤᢈ)

Machine Learning Techniques ( 機器學習技法)

Lecture 1: Linear Support Vector Machine

Department of Computer Science

& Information Engineering

National Taiwan University

( 國立台灣大學資訊工程系)

Course History

NTU Version

•

•

English and blackboard teaching

Coursera Version

•

• Mandarin teaching

• slides teaching

try

Course History

NTU Version

•

•

English and blackboard teaching

Coursera Version

•

• Mandarin teaching

• slides teaching

try

Course History

NTU Version

•

•

English and blackboard teaching

Coursera Version

•

• Mandarin teaching

• slides teaching

try

Course Design

from Foundations to Techniques

•

:-)

•

feature transforms:

• Embedding Numerous Features: how to exploit and regularize numerous features?

—inspires Support Vector Machine (SVM) model

• Combining Predictive Features: how to construct and blend predictive features?

—inspires Adaptive Boosting (AdaBoost) model

• Distilling Implicit Features: how to identify and learn implicit features?

—inspires Deep Learning model

use ML professionally

Course Design

from Foundations to Techniques

•

:-)

•

feature transforms:

• Embedding Numerous Features: how to exploit and regularize numerous features?

—inspires Support Vector Machine (SVM) model

• Combining Predictive Features: how to construct and blend predictive features?

—inspires Adaptive Boosting (AdaBoost) model

• Distilling Implicit Features: how to identify and learn implicit features?

—inspires Deep Learning model

use ML professionally

Course Design

from Foundations to Techniques

•

:-)

•

feature transforms:

• Embedding Numerous Features: how to exploit and regularize numerous features?

—inspires Support Vector Machine (SVM) model

• Combining Predictive Features: how to construct and blend predictive features?

—inspires Adaptive Boosting (AdaBoost) model

• Distilling Implicit Features: how to identify and learn implicit features?

—inspires Deep Learning model

use ML professionally

Course Design

from Foundations to Techniques

•

:-)

Machine Learning Techniques (ᘤᢈ)