• 沒有找到結果。

Machine Learning Foundations (ᘤ9M)

N/A
N/A
Protected

Academic year: 2022

Share "Machine Learning Foundations (ᘤ9M)"

Copied!
27
0
0

加載中.... (立即查看全文)

全文

(1)

Machine Learning Foundations ( 機器學習基石)

Lecture 12: Nonlinear Transformation

Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.tw

Department of Computer Science

& Information Engineering

National Taiwan University

( 國立台灣大學資訊工程系)

(2)

Nonlinear Transformation

Roadmap

1 When Can Machines Learn?

2 Why Can Machines Learn?

3 How

Can Machines Learn?

Lecture 11: Linear Models for Classification binary classification

via

(logistic) regression;

multiclass

via

OVA/OVO decomposition Lecture 12: Nonlinear Transformation

Quadratic Hypotheses Nonlinear Transform

Price of Nonlinear Transform Structured Hypothesis Sets

4 How Can Machines Learn Better?

(3)

Nonlinear Transformation Quadratic Hypotheses

Linear Hypotheses

up to now: linear hypotheses

visually:

‘line’-like

boundary

mathematically: linear scores

s

=

w T x

but limited . . .

−1 0 1

−1 0 1

theoretically:

d

VC

under control :-)

practically: on someD,

large E in

for every line

:-(

how to

break the limit

of linear hypotheses

(4)

Nonlinear Transformation Quadratic Hypotheses

Circular Separable

−1 0 1

−1 0 1

−1 0 1

−1 0 1

D not linear separable

but

circular separable

by a circle of radius√

0.6 centered at origin:

hSEP(x) = sign

−x

1 2

− x

2 2

+0.6

re-derive

Circular-PLA, Circular-Regression,

blahblah. . . all over again?

:-)

(5)

Nonlinear Transformation Quadratic Hypotheses

Circular Separable and Linear Separable

h(x) = sign

0.6

|{z}

w ˜

0

·

1

|{z}

z

0

+(

−1

)

| {z }

w ˜

1

·

x 1 2

|{z}

z

1

+(

−1

)

| {z }

w ˜

2

·

x 2 2

|{z}

z

2

= sign

w ˜ T z



x

1

x

2

−1 0 1

−1 0

1

{(x

n

, y

n

)} circular separable

=⇒ {(

z n

, y

n

)}

linear

separable

x

∈ X 7−→

Φ z ∈ Z

:

(nonlinear) feature

transform Φ z

1

z

2

0 0.5 1

0 0.5 1

circular separable inX =⇒

linear

separable in

Z

vice versa?

(6)

Nonlinear Transformation Quadratic Hypotheses

Linear Hypotheses in Z -Space

(z 0 , z 1 , z 2 )

=

z

=

Φ(x) = (1, x 1 2

,

x 2 2

) h(x) =

˜ h(z) =

sign

w ˜ T Φ(x)



=sign

w ˜ 0

+

w ˜ 1 x 1 2

+

w ˜ 2 x 2 2



w ˜ = ( w ˜ 0 , w ˜ 1 , w ˜ 2 )

(0.6,−1, −1): circle (◦inside)

(−0.6, +1, +1): circle (◦outside)

(0.6,−1, −2): ellipse

(0.6,−1, +2): hyperbola

(0.6, +1, +2):

constant:-)

lines in

Z

-space

⇐⇒

special

quadratic curves inX -space

(7)

Nonlinear Transformation Quadratic Hypotheses

General Quadratic Hypothesis Set

a ‘bigger’

Z

-space with

Φ 2

(x) = (1,

x 1

,

x 2

,

x 1 2

,

x 1 x 2

,

x 2 2

) perceptrons in

Z

-space⇐⇒ quadratic hypotheses in X -space

H

Φ

2 = n

h(x) : h(x) =

h(Φ ˜ 2

(x)) for some linear

˜ h

on

Z

o

can

implement all possible quadratic curve boundaries:

circle, ellipse,

rotated ellipse, hyperbola, parabola,

. . .

⇐=

ellipse 2(x

1

+x

2

− 3)

2

+ (x

1

− x

2

− 4)

2

=1

⇐=

w ˜ T

=

[33, −20, −4, 3, 2, 3]

include

lines and constants as degenerate cases

next:

learn

a good quadratic hypothesis g

(8)

Nonlinear Transformation Quadratic Hypotheses

Fun Time

Using the transform

Φ 2

(x) = (1,

x 1

,

x 2

,

x 1 2

,

x 1 x 2

,

x 2 2

), which of the following weights

w ˜ T

in the

Z

-space implements the parabola 2x

1 2

+x

2

=1?

1 [ −1, 2, 1, 0, 0, 0]

2 [0, 2, 1, 0, −1, 0]

3 [ −1, 0, 1, 2, 0, 0]

4 [ −1, 2, 0, 0, 0, 1]

Reference Answer: 3

Too simple, uh? :-)

Flexibility to implement arbitrary quadratic curves opens new possibilities for minimizing E

in

!

(9)

Nonlinear Transformation Quadratic Hypotheses

Fun Time

Using the transform

Φ 2

(x) = (1,

x 1

,

x 2

,

x 1 2

,

x 1 x 2

,

x 2 2

), which of the following weights

w ˜ T

in the

Z

-space implements the parabola 2x

1 2

+x

2

=1?

1 [ −1, 2, 1, 0, 0, 0]

2 [0, 2, 1, 0, −1, 0]

3 [ −1, 0, 1, 2, 0, 0]

4 [ −1, 2, 0, 0, 0, 1]

Reference Answer: 3

Too simple, uh? :-)

Flexibility to implement arbitrary quadratic curves opens new possibilities for minimizing E

in

!

(10)

Nonlinear Transformation Nonlinear Transform

Good Quadratic Hypothesis

Z

-space X -space

perceptrons

⇐⇒ quadratic hypotheses

good perceptron

⇐⇒

good quadratic hypothesis separating perceptron

⇐⇒ separating quadratic hypothesis

z1

z2

0 0.5 1

0 0.5 1

⇐⇒

x1

x2

−1 0 1

−1 0 1

want: get

good perceptron

in

Z

-space

known: get

good perceptron

in

X

-space with data{(

x n

, y

n

)} todo: get

good perceptron

in

Z

-space with data{(

z n

=

Φ 2

(x

n

), y

n

)}

(11)

Nonlinear Transformation Nonlinear Transform

The Nonlinear Transform Steps

−1 0 1

−1 0 1

−→

Φ

0 0.5 1

0 0.5 1

↓ A

−1 0 1

−1 0 1

Φ

−1

←−

−→

Φ

0 0.5 1

0 0.5 1

1

transform original data{(x

n

, y

n

)} to {(

z n

=

Φ(x n

), y

n

)} by

Φ

2

get a good perceptron

w ˜

using{(

z n

, y

n

)}

and your favorite linear classification algorithmA

3

return g(x) = sign

w ˜ T Φ(x)



(12)

Nonlinear Transformation Nonlinear Transform

Nonlinear Model via Nonlinear Φ + Linear Models

−1 0 1

−1 0 1

−→

Φ

0 0.5 1

0 0.5 1

↓ A

−1 0 1

−1 0 1

Φ

−1

←−

−→

Φ

0 0.5 1

0 0.5 1

two choices:

feature transform

Φ

linear modelA,

not just binary classification

Pandora’s box :-):

can now freely do

quadratic PLA, quadratic regression,

cubic regression, . . ., polynomial regression

(13)

Nonlinear Transformation Nonlinear Transform

Feature Transform Φ

−→

Φ

Average Intensity

Symmetry

not 1 1

↓ A

Φ

−1

←−

−→

Φ

Average Intensity

Symmetry

not new, not just polynomial:

raw (pixels)

domain knowledge

−→

concrete (intensity, symmetry)

the force, too good to be true? :-)

(14)

Nonlinear Transformation Nonlinear Transform

Fun Time

Consider the quadratic transform

Φ 2

(x) for x∈ R

d

instead of in R

2

. The transform should include all different quadratic, linear, and constant terms formed by (x

1

, x

2

, . . . , x

d

). What is the number of dimensions of

z

=

Φ 2

(x)?

1

d

2 d

2

2

+

3d 2

+1

3

d

2

+d + 1

4

2

d

Reference Answer: 2

Number of different quadratic terms is

d 2

 + d; number of different linear terms is d ;

number of different constant term is 1.

(15)

Nonlinear Transformation Nonlinear Transform

Fun Time

Consider the quadratic transform

Φ 2

(x) for x∈ R

d

instead of in R

2

. The transform should include all different quadratic, linear, and constant terms formed by (x

1

, x

2

, . . . , x

d

). What is the number of dimensions of

z

=

Φ 2

(x)?

1

d

2 d

2

2

+

3d 2

+1

3

d

2

+d + 1

4

2

d

Reference Answer: 2

Number of different quadratic terms is

d 2

 + d;

number of different linear terms is d ; number of different constant term is 1.

(16)

Nonlinear Transformation Price of Nonlinear Transform

Computation/Storage Price

Q-th order polynomial transform: Φ

Q

(x) =  1,

x

1

, x

2

, . . . , x

d

, x

12

, x

1

x

2

, . . . , x

d2

, . . . ,

x

1Q

, x

1Q−1

x

2

, . . . , x

dQ



=

1

|{z}

w ˜

0

+

d ˜

|{z}

others

dimensions

= # ways of≤ Q-combination from d kinds with repetitions

=

Q+d Q

 =

Q+d d

 =

O Q d 

= efforts needed for computing/storing

z

=

Φ Q

(x) and

w ˜

Q large =⇒

difficult to compute/store

(17)

Nonlinear Transformation Price of Nonlinear Transform

Model Complexity Price

Q-th order polynomial transform: Φ

Q

(x) =  1,

x

1

, x

2

, . . . , x

d

, x

12

, x

1

x

2

, . . . , x

d2

, . . . ,

x

1Q

, x

1Q−1

x

2

, . . . , x

dQ



1

|{z}

w ˜

0

+

d ˜

|{z}

others

dimensions =

O Q d 

number of free parameters ˜w

i

=

d ˜

+1

≈ d

VC

( H Φ

Q

)

• d

VC

( H Φ

Q

) ≤ ˜d + 1

, why?

=⇒

any ˜d + 2 inputs not shattered inZ

=⇒ any ˜d + 2 inputs not shattered in X Q large =⇒

large d

VC

(18)

Nonlinear Transformation Price of Nonlinear Transform

Generalization Issue

Φ

1

(original

x)

which one do you prefer? :-)

Φ

1

‘visually’ preferred

Φ

4

: E

in

(g) = 0 but overkill

Φ

4

1 can we make sure that E out (g) is close enough to E in (g)?

2 can we make E in (g) small enough?

trade-off:

d (Q)˜

1 2

higher :-(

:-D

lower

:-D

:-(

how to pick Q?

visually, maybe?

(19)

Nonlinear Transformation Price of Nonlinear Transform

Danger of Visual Choices

first of all, can you really ‘visualize’ whenX = R

10

?

(well, I can’t :-)) Visualize X = R 2

full Φ

2

:

z = (1, x 1

, x

2

, x

1 2

, x

1

x

2

, x

2 2

), dVC =6

or

z = (1, x 1 2

, x

2 2

), dVC =3,

after visualizing?

or better

z = (1, x 1 2

+x

2 2

), dVC=2?

or even better

z = sign(0.6 − x 1 2 − x 2 2 )?

—careful about

your brain’s ‘model complexity’

−1 0 1

−1 0 1

for VC-safety, Φ shall be decided

without ‘peeking’

data

(20)

Nonlinear Transformation Price of Nonlinear Transform

Fun Time

Consider the Q-th order polynomial transform

Φ Q

(x) for x∈ R

2

. Recall that ˜d =

Q+2 2

 − 1. When Q = 50, what is the value of ˜d?

1

1126

2

1325

3

2651

4

6211

Reference Answer: 2

It’s just a simple calculation, but shows you how ˜d becomes hundreds of times of d = 2 after the transform.

(21)

Nonlinear Transformation Price of Nonlinear Transform

Fun Time

Consider the Q-th order polynomial transform

Φ Q

(x) for x∈ R

2

. Recall that ˜d =

Q+2 2

 − 1. When Q = 50, what is the value of ˜d?

1

1126

2

1325

3

2651

4

6211

Reference Answer: 2

It’s just a simple calculation, but shows you how ˜d becomes hundreds of times of d = 2 after the transform.

(22)

Nonlinear Transformation Structured Hypothesis Sets

Polynomial Transform Revisited

Φ

0

(x) =  1 

, Φ

1

(x) = 

Φ

0

(x), x

1

, x

2

, . . . , x

d



Φ

2

(x) = 

Φ

1

(x), x

12

, x

1

x

2

, . . . , x

d2



Φ

3

(x) = 

Φ

2

(x), x

13

, x

12

x

2

, . . . , x

d3

 . . . . . .

Φ

Q

(x) = 

Φ

Q−1

(x), x

1Q

, x

1Q−1

x

2

, . . . , x

dQ



H

Φ0

⊂ H

Φ1

⊂ H

Φ2

⊂ H

Φ3

⊂ . . . ⊂ H

ΦQ

k k k k k

H

0

H

1

H

2

H

3

. . . H

Q

H0 H1 H2 H3 · · ·

structure:

nested H i

(23)

Nonlinear Transformation Structured Hypothesis Sets

Structured Hypothesis Sets

H0 H1 H2 H3 · · ·

Let

g i = argmin h∈H

i

E in (h):

H

0

⊂ H

1

⊂ H

2

⊂ H

3

⊂ . . .

d

VC

( H

0

) ≤ d

VC

( H

1

) ≤ d

VC

( H

2

) ≤ d

VC

( H

3

) ≤ . . . E

in

(g

0

) ≥ E

in

(g

1

) ≥ E

in

(g

2

) ≥ E

in

(g

3

) ≥ . . .

in-sample error model complexity out-of-sample error

VC dimension, dvc

Error

dvc

use

H 1126

won’t be good!

:-(

(24)

Nonlinear Transformation Structured Hypothesis Sets

Linear Model First

in-sample error model complexity out-of-sample error

VC dimension, dvc

Error

dvc

tempting sin: useH

1126

, low

E in (g 1126 )

to fool your boss

—really? :-( a dangerous path of no return

safe route: H

1

first

• if E

in

(g

1

) good enough, live happily thereafter :-)

• otherwise, move right of the curve

with nothing lost except ‘wasted’ computation

linear model first:

simple, efficient,

safe, and workable!

(25)

Nonlinear Transformation Structured Hypothesis Sets

Fun Time

Consider two hypothesis sets,H

1

andH

1126

, whereH

1

⊂ H

1126

. Which of the following relationship between dVC(H

1

)and dVC(H

1126

)is not possible?

1

dVC(H

1

) =dVC(H

1126

)

2

dVC(H

1

)6= dVC(H

1126

)

3

dVC(H

1

)< dVC(H

1126

)

4

dVC(H

1

)> dVC(H

1126

)

Reference Answer: 4

Every input combination thatH

1

shatters can be shattered byH

1126

, so dVCcannot

decrease.

(26)

Nonlinear Transformation Structured Hypothesis Sets

Fun Time

Consider two hypothesis sets,H

1

andH

1126

, whereH

1

⊂ H

1126

. Which of the following relationship between dVC(H

1

)and dVC(H

1126

)is not possible?

1

dVC(H

1

) =dVC(H

1126

)

2

dVC(H

1

)6= dVC(H

1126

)

3

dVC(H

1

)< dVC(H

1126

)

4

dVC(H

1

)> dVC(H

1126

)

Reference Answer: 4

Every input combination thatH

1

shatters can be shattered byH

1126

, so dVCcannot

decrease.

(27)

Nonlinear Transformation Structured Hypothesis Sets

Summary

1 When Can Machines Learn?

2 Why Can Machines Learn?

3 How

Can Machines Learn?

Lecture 11: Linear Models for Classification Lecture 12: Nonlinear Transformation

Quadratic Hypotheses

linear hypotheses on quadratic-transformed data Nonlinear Transform

happy linear modeling after Z = Φ(X ) Price of Nonlinear Transform

computation/storage/[model complexity]

Structured Hypothesis Sets

linear/simpler model first

next: dark side of the force :-)

4 How Can Machines Learn Better?

參考文獻

相關文件

Which keywords below shall have large positive weights in a good perceptron for the task. 1 coffee, tea,

Which keywords below shall have large positive weights in a good perceptron for the task.. 1 coffee, tea,

2 You’ll likely be rich by exploiting the rule in the next 100 days, if the market behaves similarly to the last 10 years. 3 You’ll likely be rich by exploiting the ‘best rule’

to maximize fairness (everyone’s responsibility), lending/borrowing not allowed.. Collaboration

to maximize fairness (everyone’s responsibility), lending/borrowing not allowed.. Collaboration

1 After computing if D is linear separable, we shall know w ∗ and then there is no need to use PLA.. Noise and Error Algorithmic Error Measure. Choice of

You shall find it difficult to generate more kinds by varying the inputs, and we will give a formal proof in future lectures.

Lecture 5: Training versus Testing Hsuan-Tien Lin (林 軒田) htlin@csie.ntu.edu.tw?. Department of