Machine Learning Foundations (ᘤ9M)

(1)

Machine Learning Foundations

( 機器學習基石)

Lecture 13: Hazard of Overfitting

Hsuan-Tien Lin (林軒田) [email protected]

Department of Computer Science

& Information Engineering

National Taiwan University

( 國立台灣大學資訊工程系)

(2)

Hazard of Overfitting

Roadmap

1 When Can Machines Learn?

2 Why Can Machines Learn?

3 How Can Machines Learn?

Lecture 12: Nonlinear Transform

nonlinear

^via

nonlinear feature transform Φ

plus

linear

with price of

model complexity

4

How Can Machines Learn

Better?

Lecture 13: Hazard of Overfitting What is Overfitting?

The Role of Noise and Data Size Deterministic Noise

Dealing with Overfitting

(3)

Hazard of Overfitting What is Overfitting?

Bad Generalization

•

regression for x ∈ R with N = 5 examples

• target f (x ) = 2nd order polynomial

•

label y

_n

=f (x

_n

) +very small noise

•

linear regression in Z-space + Φ= 4th order polynomial

• unique solution passing all examples

=⇒ E _in (g) = 0

• E _out (g) huge

x

y

Data Target Fit

bad generalization:

low E _in

,

high E out

(4)

Bad Generalization

• • target f (x ) = 2nd order polynomial

•

label y

_n

=f (x

_n

) +very small noise

• • unique solution passing all examples

=⇒ E _in (g) = 0

• E _out (g) huge

x

y

Data Target Fit

bad generalization:

low E _in

,

high E out

(5)

Bad Generalization

• • target f (x ) = 2nd order polynomial

•

label y

_n

=f (x

_n

) +very small noise

• • unique solution passing all examples

=⇒ E _in (g) = 0

• E _out (g) huge

x

y

Data Target Fit

bad generalization:

low E _in

,

high E out

(6)

Bad Generalization

• • target f (x ) = 2nd order polynomial

•

label y

_n

=f (x

_n

) +very small noise

• • unique solution passing all examples

=⇒ E _in (g) = 0

• E _out (g) huge

x

y

Data Target Fit

bad generalization:

low E _in

,

high E out

(7)

Bad Generalization

• • target f (x ) = 2nd order polynomial

•

label y

_n

=f (x

_n

) +very small noise

• • unique solution passing all examples

=⇒ E _in (g) = 0

• E _out (g) huge

x

y

Data Target Fit

bad generalization:

low E _in

,

high E out

(8)

Bad Generalization and Overfitting

•

take d_VC=1126 for learning:

bad generalization

—(E

_out

-

E _in

) large

•

switch from dVC =d_VC

^∗

to dVC =1126:

overfitting

—E

_in ↓, E out ↑

•

switch from d_VC =d_VC

^∗

to d_VC =1:

underfitting

—E

_in ↑, E _out ↑

in-sample error model complexity out-of-sample error

VC dimension, dvc

Error

d^∗_vc

bad generalization: low E

_in

, high E

_out

;

overfitting: lower

E

_in

, higherE

_out

(9)

Bad Generalization and Overfitting

•

bad generalization

—(E

_out

-

E _in

) large

• ^∗

to dVC =1126:

overfitting

—E

_in ↓, E out ↑

• ^∗

to d_VC =1:

underfitting

—E

_in ↑, E _out ↑

VC dimension, dvc

Error

d^∗_vc

_in

, high E

_out

;

overfitting: lower

E

_in

, higherE

_out

(10)

Bad Generalization and Overfitting

•

bad generalization

—(E

_out

-

E _in

) large

• ^∗

to dVC =1126:

overfitting

—E

_in ↓, E out ↑

• ^∗

to d_VC =1:

underfitting

—E

_in ↑, E _out ↑

VC dimension, dvc

Error

d^∗_vc

_in

, high E

_out

;

overfitting: lower

E

_in

, higherE

_out

(11)

Bad Generalization and Overfitting

•

bad generalization

—(E

_out

-

E _in

) large

• ^∗

to dVC =1126:

overfitting

—E

_in ↓, E out ↑

• ^∗

to d_VC =1:

underfitting

—E

_in ↑, E _out ↑

VC dimension, dvc

Error

d^∗_vc

_in

, high E

_out

;

overfitting: lower

E

_in

, higherE

_out

(12)

Bad Generalization and Overfitting

•

bad generalization

—(E

_out

-

E _in

) large

• ^∗

to dVC =1126:

overfitting

—E

_in ↓, E out ↑

• ^∗

to d_VC =1:

underfitting

—E

_in ↑, E _out ↑

VC dimension, dvc

Error

d^∗_vc

_in

, high E

_out

;

overfitting: lower

E

_in

, higherE

out

(13)

Cause of Overfitting: A Driving Analogy

x

y

‘good fit’

=⇒

x

y

Data Target Fit

overfit

learning driving

overfit commit a car accident

use excessive dVC ‘drive too fast’

noise

bumpy road

limited data size N

limited observations about road condition next: how does

noise

&

data size

affect

overfitting?

(14)

Cause of Overfitting: A Driving Analogy

x

y

‘good fit’

=⇒

x

y

Data Target Fit

overfit

learning driving

noise

bumpy road

limited data size N

noise

&

data size

affect

overfitting?

(15)

Cause of Overfitting: A Driving Analogy

x

y

‘good fit’

=⇒

x

y

Data Target Fit

overfit

learning driving

noise

bumpy road

limited data size N

noise

&

data size

affect

overfitting?

(16)

Cause of Overfitting: A Driving Analogy

x

y

‘good fit’

=⇒

x

y

Data Target Fit

overfit

learning driving

noise

bumpy road

limited data size N

noise

&

data size

affect

overfitting?

(17)

Cause of Overfitting: A Driving Analogy

x

y

‘good fit’

=⇒

x

y

Data Target Fit

overfit

learning driving

noise

bumpy road

limited data size N

noise

&

data size

affect

overfitting?

(18)

Cause of Overfitting: A Driving Analogy

x

y

‘good fit’

=⇒

x

y

Data Target Fit

overfit

learning driving

noise

bumpy road

limited data size N

limited observations about road condition

next: how does

noise

&

data size

affect overfitting?

(19)

Cause of Overfitting: A Driving Analogy

x

y

‘good fit’

=⇒

x

y

Data Target Fit

overfit

learning driving

noise

bumpy road

limited data size N

noise

&

data size

affect

overfitting?

(20)

Fun Time

Based on our discussion, for data of fixed size, which of the following situation is relatively of the lowest risk of overfitting?

1

small noise, fitting from small d_VCto median d_VC

2

small noise, fitting from small dVCto large dVC

3

large noise, fitting from small dVCto median dVC

4

large noise, fitting from small dVCto large dVC

Reference Answer: 1

Two causes of overfitting are noise and excessive d_VC. So if both are relatively ‘under control’, the risk of overfitting is smaller.

(21)

Fun Time

Based on our discussion, for data of fixed size, which of the following situation is relatively of the lowest risk of overfitting?

1

small noise, fitting from small d_VCto median d_VC

2

small noise, fitting from small dVCto large dVC

3

large noise, fitting from small dVCto median dVC

4

large noise, fitting from small dVCto large dVC

Reference Answer: 1

Two causes of overfitting are noise and excessive d_VC. So if both are relatively ‘under control’, the risk of overfitting is smaller.

(22)

Hazard of Overfitting The Role of Noise and Data Size

Case Study (1/2)

10-th order target function + noise

x

y

Data Target

g ₂ ∈ H ₂ g ₁₀ ∈ H ₁₀

E

_in 0.050 0.034

E

_out 0.127 9.00

50-th order target function noiselessly

x

y

Data Target

g ₂ ∈ H ₂ g ₁₀ ∈ H ₁₀

E

_in 0.029 0.00001

E

out 0.120 7680

overfitting from best

g ₂ ∈ H ₂

to best

g ₁₀ ∈ H ₁₀

?

(23)

Case Study (1/2)

10-th order target function + noise

x

y

Data Target

g ₂ ∈ H ₂ g ₁₀ ∈ H ₁₀

E

_in 0.050 0.034

E

_out 0.127 9.00

50-th order target function noiselessly

x

y

Data Target

g ₂ ∈ H ₂ g ₁₀ ∈ H ₁₀

E

_in 0.029 0.00001

E

out 0.120 7680

g ₂ ∈ H ₂

to best

g ₁₀ ∈ H ₁₀

?

(24)

Case Study (1/2)

10-th order target function + noise

x

y

Data Target

g ₂ ∈ H ₂ g ₁₀ ∈ H ₁₀

E

_in 0.050 0.034

E

_out 0.127 9.00

50-th order target function noiselessly

x

y

Data Target

g ₂ ∈ H ₂ g ₁₀ ∈ H ₁₀

E

_in 0.029 0.00001

E

out 0.120 7680

g ₂ ∈ H ₂

to best

g ₁₀ ∈ H ₁₀

?

(25)

Case Study (2/2)

10-th order target function + noise

x

y

Data 2nd Order Fit 10th Order Fit

g ₂ ∈ H ₂ g ₁₀ ∈ H ₁₀

E

_in 0.050 0.034

E

_out 0.127 9.00

50-th order target function noiselessly

x

y

Data 2nd Order Fit 10th Order Fit

g ₂ ∈ H ₂ g ₁₀ ∈ H ₁₀

E

_in 0.029 0.00001

E

out 0.120 7680

overfitting from

g ₂

to

g ₁₀

?

both yes!

(26)

Case Study (2/2)

10-th order target function + noise

x

y

Data 2nd Order Fit 10th Order Fit

g ₂ ∈ H ₂ g ₁₀ ∈ H ₁₀

E

_in 0.050 0.034

E

_out 0.127 9.00

50-th order target function noiselessly

x

y

Data 2nd Order Fit 10th Order Fit

g ₂ ∈ H ₂ g ₁₀ ∈ H ₁₀

E

_in 0.029 0.00001

E

out 0.120 7680

overfitting from

g ₂

to

g ₁₀

?

both yes!

(27)

Case Study (2/2)

10-th order target function + noise

x

y

Data 2nd Order Fit 10th Order Fit

g ₂ ∈ H ₂ g ₁₀ ∈ H ₁₀

E

_in 0.050 0.034

E

_out 0.127 9.00

50-th order target function noiselessly

x

y

Data 2nd Order Fit 10th Order Fit

g ₂ ∈ H ₂ g ₁₀ ∈ H ₁₀

E

_in 0.029 0.00001

E

out 0.120 7680

overfitting from

g ₂

to

g ₁₀

?

both yes!

(28)

Irony of Two Learners

x

y

Data 2nd Order Fit 10th Order Fit

x

y

Data Target

•

learner

Overfit: pick g ₁₀ ∈ H ₁₀

•

learner

Restrict: pick g ₂ ∈ H ₂

•

when both

know that target = 10th

—R ‘gives up’ability to fit

but

R wins in E _out

a lot!

philosophy:

concession

for

advantage? :-)

(29)

Irony of Two Learners

x

y

Data 2nd Order Fit 10th Order Fit

x

y

Data Target

•

learner

Overfit: pick g ₁₀ ∈ H ₁₀

•

learner

Restrict: pick g ₂ ∈ H ₂

•

when both

know that target = 10th

but

R wins in E _out

a lot!

philosophy:

concession

for

advantage? :-)

(30)

Irony of Two Learners

x

y

Data 2nd Order Fit 10th Order Fit

x

y

Data Target

•

learner

Overfit: pick g ₁₀ ∈ H ₁₀

•

learner

Restrict: pick g ₂ ∈ H ₂

•

when both

know that target = 10th

but

R wins in E _out

a lot!

philosophy:

concession

for

advantage? :-)

(31)

Irony of Two Learners

x

y

Data 2nd Order Fit 10th Order Fit

x

y

Data Target

•

learner

Overfit: pick g ₁₀ ∈ H ₁₀

•

learner

Restrict: pick g ₂ ∈ H ₂

•

when both

know that target = 10th

but

R wins in E _out

a lot!

philosophy:

concession

for

advantage? :-)

(32)

Irony of Two Learners

x

y

Data 2nd Order Fit 10th Order Fit

x

y

Data Target

•

learner

Overfit: pick g ₁₀ ∈ H ₁₀

•

learner

Restrict: pick g ₂ ∈ H ₂

•

when both

know that target = 10th

but

R wins in E _out

a lot!

philosophy:

concession

for

advantage? :-)

(33)

Irony of Two Learners

x

y

Data 2nd Order Fit 10th Order Fit

x

y

Data Target

•

learner

Overfit: pick g ₁₀ ∈ H ₁₀

•

learner

Restrict: pick g ₂ ∈ H ₂

•

when both

know that target = 10th

but

R wins in E _out

a lot!

philosophy:

concession

for

advantage? :-)

(34)

Learning Curves Revisited

H 2

Number of Data Points, N

E xp ec te d E rr or

E

out

E

in

H 10

Number of Data Points, N

E xp ec te d E rr or

E

_out

E

in

•

H

₁₀

: lower

E out

when N → ∞,

H

₁₀

:

but much larger generalization error for small N

•

gray area :

O

overfits! (E

_in ↓, E out ↑)

R

always

wins in E _out

if N small!

(35)

Learning Curves Revisited

H 2

Number of Data Points, N

E xp ec te d E rr or

E

out

E

in

H 10

Number of Data Points, N

E xp ec te d E rr or

E

_out

E

in

•

H

₁₀

: lower

E out

when N → ∞,

H

₁₀

:

•

gray area :

O

overfits! (E

_in ↓, E out ↑)

R

always

wins in E _out

if N small!

(36)

Learning Curves Revisited

H 2

Number of Data Points, N

E xp ec te d E rr or

E

out

E

in

H 10

Number of Data Points, N

E xp ec te d E rr or

E

_out

E

in

•

H

₁₀

: lower

E out

when N → ∞,

H

₁₀

:

•

gray area :

O

overfits! (E

_in ↓, E out ↑)

R

always

wins in E _out

if N small!

(37)

Learning Curves Revisited

H 2

Number of Data Points, N

E xp ec te d E rr or

E

out

E

in

H 10

Number of Data Points, N

E xp ec te d E rr or

E

_out

E

in

•

H

₁₀

: lower

E out

when N → ∞,

H

₁₀

:

•

gray area :

O

overfits! (E

_in ↓, E out ↑)

R

always

wins in E _out

if N small!

(38)

The ‘No Noise’ Case

x

y

Data 2nd Order Fit 10th Order Fit

x

y

Data Target

•

learner

Overfit: pick g ₁₀ ∈ H ₁₀

•

learner

Restrict: pick g ₂ ∈ H ₂

•

when both

know that there is no noise

—R still wins

is there really

no noise?

‘target complexity’ acts like noise

(39)

The ‘No Noise’ Case

x

y

Data 2nd Order Fit 10th Order Fit

x

y

Data Target

•

learner

Overfit: pick g ₁₀ ∈ H ₁₀

•

learner

Restrict: pick g ₂ ∈ H ₂

•

when both

know that there is no noise

—R still wins

is there really

no noise?

‘target complexity’ acts like noise

(40)

The ‘No Noise’ Case

x

y

Data 2nd Order Fit 10th Order Fit

x

y

Data Target

•

learner

Overfit: pick g ₁₀ ∈ H ₁₀

•

learner

Restrict: pick g ₂ ∈ H ₂

•

when both

know that there is no noise

—R still wins

is there really

no noise?

‘target complexity’ acts like noise

(41)

The ‘No Noise’ Case

x

y

Data 2nd Order Fit 10th Order Fit

x

y

Data Target

•

learner

Overfit: pick g ₁₀ ∈ H ₁₀

•

learner

Restrict: pick g ₂ ∈ H ₂

•

when both

know that there is no noise

—R still wins

is there really

no noise?

‘target complexity’ acts like noise

(42)

The ‘No Noise’ Case

x

y

Data 2nd Order Fit 10th Order Fit

x

y

Data Target

•

learner

Overfit: pick g ₁₀ ∈ H ₁₀

•

learner

Restrict: pick g ₂ ∈ H ₂

•

when both

know that there is no noise

—R still wins

is there really

no noise?

‘target complexity’ acts like noise

(43)

Fun Time

When having limited data, in which of the following case would learner

R

perform better than learner

O?

1

limited data from a 10-th order target function with some noise

2

limited data from a 1126-th order target function with no noise

3

4

all of the above

Reference Answer: 4

We discussed about 1 and 2 , but you shall be able to

‘generalize’ :-)

that

R

also wins in the more difficult case of 3 .

(44)

Fun Time

When having limited data, in which of the following case would learner

R

perform better than learner

O?

1

2

limited data from a 1126-th order target function with no noise

3

4

all of the above

Reference Answer: 4

We discussed about 1 and 2 , but you shall be able to

‘generalize’ :-)

that

R

also wins in the more difficult case of 3 .

(45)

Hazard of Overfitting Deterministic Noise

A Detailed Experiment

y =

f (x )

+

∼ Gaussian

Q

_f

X

q=0

α q x ^q

| {z }

f (x )

, σ ²

!

• Gaussian iid noise

with level

σ ²

•

some ‘uniform’ distribution on

f (x )

with complexity level

Q _f

•

data size N

x

y

Data Target

goal:

‘overfit level’

for different (N,

σ ²

)and (N,

Q _f

)?

(46)

A Detailed Experiment

y =

f (x )

+

∼ Gaussian

Q

_f

X

q=0

α q x ^q

| {z }

f (x )

, σ ²

!

• Gaussian iid noise

with level

σ ²

• f (x )

Q _f

•

data size N

x

y

Data Target

goal:

‘overfit level’

for different (N,

σ ²

)and (N,

Q _f

)?

(47)

A Detailed Experiment

y =

f (x )

+

∼ Gaussian

Q

_f

X

q=0

α q x ^q

| {z }

f (x )

, σ ²

!

• Gaussian iid noise

with level

σ ²

• f (x )

Q _f

•

data size N

x

y

Data Target

goal:

‘overfit level’

for different (N,

σ ²

)and (N,

Q _f

)?

(48)

A Detailed Experiment

y =

f (x )

+

∼ Gaussian

Q

_f

X

q=0

α q x ^q

| {z }

f (x )

, σ ²

!

• Gaussian iid noise

with level

σ ²

• f (x )

Q _f

•

data size N

x

y

Data Target

goal:

‘overfit level’

for different (N,

σ ²

)and (N,

Q _f

)?

(49)

The Overfit Measure

x

y

Data 2nd Order Fit 10th Order Fit

• g ₂ ∈ H ₂

• g ₁₀ ∈ H ₁₀

• E _in (g ₁₀ )

≤

E _in (g ₂ )

for sure

x

y

Data Target

overfit measure E _out (g ₁₀ )

−

E _out (g ₂ )

(50)

The Overfit Measure

x

y

Data 2nd Order Fit 10th Order Fit

• g ₂ ∈ H ₂

• g ₁₀ ∈ H ₁₀

• E _in (g ₁₀ )

≤

E _in (g ₂ )

for sure

x

y

Data Target

overfit measure E _out (g ₁₀ )

−

E _out (g ₂ )

(51)

The Overfit Measure

x

y

Data 2nd Order Fit 10th Order Fit

• g ₂ ∈ H ₂

• g ₁₀ ∈ H ₁₀

• E _in (g ₁₀ )

≤

E _in (g ₂ )

for sure

x

y

Data Target

overfit measure E _out (g ₁₀ )

−

E _out (g ₂ )

(52)

The Results

impact of σ ² versus N

Number of Data Points, N

N oi se Le ve l, σ

2

80 100 120 -0.2

-0.1 0 0.1 0.2

0 1 2

fixed Q

_f

=20

impact of Q f versus N

Number of Data Points, N T ar ge t C om pl ex it y, Q

f

80 100 120 -0.2

-0.1 0 0.1 0.2

0 25 50 75 100

fixedσ

²

=0.1

ring a bell? :-)

(53)

The Results

impact of σ ² versus N

Number of Data Points, N

N oi se Le ve l, σ

2

80 100 120 -0.2

-0.1 0 0.1 0.2

0 1 2

fixed Q

_f

=20

impact of Q f versus N

Number of Data Points, N T ar ge t C om pl ex it y, Q

f

80 100 120 -0.2

-0.1 0 0.1 0.2

0 25 50 75 100

fixedσ

²

=0.1

ring a bell? :-)

(54)

The Results

impact of σ ² versus N

Number of Data Points, N

N oi se Le ve l, σ

2

80 100 120 -0.2

-0.1 0 0.1 0.2

0 1 2

fixed Q

_f

=20

impact of Q f versus N

Number of Data Points, N T ar ge t C om pl ex it y, Q

f

80 100 120 -0.2

-0.1 0 0.1 0.2

0 25 50 75 100

fixedσ

²

=0.1

ring a bell? :-)

(55)

The Results

impact of σ ² versus N

Number of Data Points, N

N oi se Le ve l, σ

2

80 100 120 -0.2

-0.1 0 0.1 0.2

0 1 2

fixed Q

_f

=20

impact of Q f versus N

Number of Data Points, N T ar ge t C om pl ex it y, Q

f

80 100 120 -0.2

-0.1 0 0.1 0.2

0 25 50 75 100

fixedσ

²

=0.1

ring a bell? :-)

(56)

Impact of Noise and Data Size

impact of σ ² versus N:

stochastic noise

Number of Data Points, N

NoiseLevel,σ2

80 100 120 -0.2

-0.1 0 0.1 0.2

0 1 2

impact of Q f versus N: deterministic noise

Number of Data Points, N TargetComplexity,Qf

80 100 120 -0.2

-0.1 0 0.1 0.2

0 25 50 75 100

four reasons of serious overfitting:

data size N ↓ overfit

↑ stochastic noise

↑ overfit

↑ deterministic noise

↑ overfit

↑

excessive power ↑ overfit

↑

overfitting

‘easily’ happens

(57)

Impact of Noise and Data Size

impact of σ ² versus N:

stochastic noise

NoiseLevel,σ2

80 100 120 -0.2

-0.1 0 0.1 0.2

0 1 2

impact of Q f versus N:

deterministic noise

80 100 120 -0.2

-0.1 0 0.1 0.2

0 25 50 75 100

↑ stochastic noise

↑ overfit

↑ deterministic noise

↑ overfit

↑

overfitting

(58)

Impact of Noise and Data Size

impact of σ ² versus N:

stochastic noise

NoiseLevel,σ2

80 100 120 -0.2

-0.1 0 0.1 0.2

0 1 2

impact of Q f versus N:

deterministic noise

80 100 120 -0.2

-0.1 0 0.1 0.2

0 25 50 75 100

↑

stochastic noise

↑ overfit

↑ deterministic noise

↑ overfit

↑

overfitting

(59)

Impact of Noise and Data Size

impact of σ ² versus N:

stochastic noise

NoiseLevel,σ2

80 100 120 -0.2

-0.1 0 0.1 0.2

0 1 2

impact of Q f versus N:

deterministic noise

80 100 120 -0.2

-0.1 0 0.1 0.2

0 25 50 75 100

↑ stochastic noise

↑ overfit

↑

deterministic noise

↑ overfit

↑

overfitting

(60)

Impact of Noise and Data Size

impact of σ ² versus N:

stochastic noise

NoiseLevel,σ2

80 100 120 -0.2

-0.1 0 0.1 0.2

0 1 2

impact of Q f versus N:

deterministic noise

80 100 120 -0.2

-0.1 0 0.1 0.2

0 25 50 75 100

↑ stochastic noise

↑ overfit

↑ deterministic noise

↑ overfit

↑

overfitting

(61)

Impact of Noise and Data Size

impact of σ ² versus N:

stochastic noise

NoiseLevel,σ2

80 100 120 -0.2

-0.1 0 0.1 0.2

0 1 2

impact of Q f versus N:

deterministic noise

80 100 120 -0.2

-0.1 0 0.1 0.2

0 25 50 75 100

↑ stochastic noise

↑ overfit

↑ deterministic noise

↑ overfit

↑

overfitting

(62)

Impact of Noise and Data Size

impact of σ ² versus N:

stochastic noise

NoiseLevel,σ2

80 100 120 -0.2

-0.1 0 0.1 0.2

0 1 2

impact of Q f versus N:

deterministic noise

80 100 120 -0.2

-0.1 0 0.1 0.2

0 25 50 75 100

↑ stochastic noise

↑ overfit

↑ deterministic noise

↑ overfit

↑

overfitting

(63)

Deterministic Noise

•

if

f

∈/

H: something of f

cannot be captured by

H

•

deterministic noise : difference between best

h ^∗ ∈ H

and

f

•

acts like ‘stochastic noise’

—not new to CS:

pseudo-random generator

•

difference to stochastic noise:

• depends on H

• fixed for a given x

x

y

h

^∗

f

philosophy: when teaching

a kid,

perhaps better not to use examples from a

complicated target function? :-)

(64)

Deterministic Noise

•

if

f

∈/

H: something of f

H

• h ^∗ ∈ H

and

f

•

—not new to CS:

pseudo-random generator

• • depends on H

• fixed for a given x

x

y

h

^∗

f

a kid,

complicated target function? :-)

(65)

Deterministic Noise

•

if

f

∈/

H: something of f

H

• h ^∗ ∈ H

and

f

•

—not new to CS:

pseudo-random generator

• • depends on H

• fixed for a given x

x

y

h

^∗

f

a kid,

complicated target function? :-)

(66)

Deterministic Noise

•

if

f

∈/

H: something of f

H

• h ^∗ ∈ H

and

f

•

acts like ‘stochastic noise’—not new to CS:

pseudo-random generator

• • depends on H

• fixed for a given x

x

y

h

^∗

f

a kid,

complicated target function? :-)

(67)

Deterministic Noise

•

if

f

∈/

H: something of f

H

• h ^∗ ∈ H

and

f

• pseudo-random generator

• • depends on H

• fixed for a given x

x

y

h

^∗

f

a kid,

complicated target function? :-)

(68)

Deterministic Noise

•

if

f

∈/

H: something of f

H

• h ^∗ ∈ H

and

f

• pseudo-random generator

• • depends on H

• fixed for a given x

x

y

h

^∗

f

a kid,

complicated target function? :-)

(69)

Deterministic Noise

•

if

f

∈/

H: something of f

H

• h ^∗ ∈ H

and

f

• pseudo-random generator

• • depends on H

• fixed for a given x

x

y

h

^∗

f

a kid,

complicated target function? :-)

(70)

Fun Time

Consider the target function being sin(1126x ) for x ∈ [0, 2π]. When x is uniformly sampled from the range, and we use all possible linear hypotheses h(x ) = w · x to approximate the target function with respect to the squared error, what is the level of deterministic noise for each x ?

1

| sin(1126x)|

2

| sin(1126x) − x|

3

| sin(1126x) + x|

4

| sin(1126x) − 1126x|

Reference Answer: 1

You can try a few different w and convince yourself that the best hypothesis h

^∗

is h

^∗

(x ) = 0. The deterministic noise is the difference between f and h

^∗

.

(71)

Fun Time

Consider the target function being sin(1126x ) for x ∈ [0, 2π]. When x is uniformly sampled from the range, and we use all possible linear hypotheses h(x ) = w · x to approximate the target function with respect to the squared error, what is the level of deterministic noise for each x ?

1

| sin(1126x)|

2

| sin(1126x) − x|

3

| sin(1126x) + x|

4

| sin(1126x) − 1126x|

Reference Answer: 1

You can try a few different w and convince yourself that the best hypothesis h

^∗

is h

^∗

(x ) = 0. The deterministic noise is the difference between f and h

^∗

.

(72)

Hazard of Overfitting Dealing with Overfitting

Driving Analogy Revisited

learning driving

noise bumpy road

limited data size N limited observations about road condition

start from simple model

drive slowly

data cleaning/pruning

use more accurate road information

data hinting

exploit more road information

regularization

put the brakes

validation

monitor the dashboard

all very

practical

techniques to combat overfitting

(73)

Driving Analogy Revisited

learning driving

noise bumpy road

start from simple model

drive slowly

data cleaning/pruning

data hinting

regularization

put the brakes

validation

all very

practical

(74)

Driving Analogy Revisited

learning driving

noise bumpy road

start from simple model

drive slowly

data cleaning/pruning

data hinting

regularization

put the brakes

validation

all very

practical

(75)

Driving Analogy Revisited

learning driving

noise bumpy road

start from simple model

drive slowly

data cleaning/pruning

data hinting

regularization

put the brakes

validation

all very

practical

(76)

Driving Analogy Revisited

learning driving

noise bumpy road

start from simple model

drive slowly

data cleaning/pruning

data hinting

regularization

put the brakes

validation

all very

practical

(77)

Driving Analogy Revisited

learning driving

noise bumpy road

start from simple model

drive slowly

data cleaning/pruning

data hinting

regularization

put the brakes

validation

all very

practical

(78)

Driving Analogy Revisited

learning driving

noise bumpy road

start from simple model

drive slowly

data cleaning/pruning

data hinting

regularization

put the brakes

validation

all very

practical

(79)

Data Cleaning/Pruning

•

if ‘detect’ the outlier

5

at the top by

• too close to other ◦, or too far from other ×

• wrong by current classifier

• . . .

•

possible action 1: correct the label (data cleaning)

•

possible action 2: remove the example (data pruning)

possibly helps, but

effect varies

(80)

Data Cleaning/Pruning

•

5

at the top by

• too close to other ◦, or too far from other ×

• wrong by current classifier

• . . .

•

possibly helps, but

effect varies

(81)

Data Cleaning/Pruning

•

5

at the top by

• too close to other ◦, or too far from other ×

• wrong by current classifier

• . . .

•

possibly helps, but

effect varies

(82)

Data Cleaning/Pruning

•

5

at the top by

• too close to other ◦, or too far from other ×

• wrong by current classifier

• . . .

•

possibly helps, but

effect varies

(83)

Data Cleaning/Pruning

•

5

at the top by

• too close to other ◦, or too far from other ×

• wrong by current classifier

• . . .

•

possibly helps, but

effect varies

(84)

Data Cleaning/Pruning

•

5

at the top by

• too close to other ◦, or too far from other ×

• wrong by current classifier

• . . .

•

possibly helps, but

effect varies

(85)

Data Cleaning/Pruning

•

5

at the top by

• too close to other ◦, or too far from other ×

• wrong by current classifier

• . . .

•

possibly helps, but

effect varies

(86)

Data Hinting

•

slightly shifted/rotated digits carry the same meaning

•

possible action: add

virtual examples

by shifting/rotating the given digits (data hinting)

possibly helps, but

watch out

—virtual example not

^iid ∼ P(x, y )!

(87)

Data Hinting

•

• virtual examples

possibly helps, but

watch out

^iid ∼ P(x, y )!

(88)

Data Hinting

•

• virtual examples

possibly helps, but

watch out

^iid ∼ P(x, y )!

(89)

Fun Time

Assume we know that f (x ) is symmetric for some 1D regression application. That is, f (x ) = f (−x ). One possibility of using the knowledge is to consider symmetric hypotheses only. On the other hand, you can also generate virtual examples from the original data {(x

n

, y

n

)}as hints. What virtual examples suit your needs best?

1

{(x

n

, −y

n

)}

2

{(−x

n

, −y

n

)}

3

{(−x

_n

, y

_n

)}

4

{(2x

_n

, 2y

_n

)}

Reference Answer: 3

We want the virtual examples to encode the invariance when x → −x .

(90)

Fun Time

Assume we know that f (x ) is symmetric for some 1D regression application. That is, f (x ) = f (−x ). One possibility of using the knowledge is to consider symmetric hypotheses only. On the other hand, you can also generate virtual examples from the original data {(x

Machine Learning Foundations (ᘤ9M)