# Machine Learning Foundations (ᘤ9M)

(1)

## ( 機器學習基石)

### Lecture 13: Hazard of Overfitting

Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.tw

### ( 國立台灣大學資訊工程系)

(2)

Hazard of Overfitting

via

plus

with price of

### 4

How Can Machines Learn

### Dealing with Overfitting

(3)

Hazard of Overfitting What is Overfitting?

### •

regression for x ∈ R with N = 5 examples

label y

=f (x

### n

) +very small noise

### •

linear regression in Z-space + Φ= 4th order polynomial

,

### high E out

(4)

Hazard of Overfitting What is Overfitting?

### •

regression for x ∈ R with N = 5 examples

label y

=f (x

### n

) +very small noise

### •

linear regression in Z-space + Φ= 4th order polynomial

,

### high E out

(5)

Hazard of Overfitting What is Overfitting?

### •

regression for x ∈ R with N = 5 examples

label y

=f (x

### n

) +very small noise

### •

linear regression in Z-space + Φ= 4th order polynomial

,

### high E out

(6)

Hazard of Overfitting What is Overfitting?

### •

regression for x ∈ R with N = 5 examples

label y

=f (x

### n

) +very small noise

### •

linear regression in Z-space + Φ= 4th order polynomial

,

### high E out

(7)

Hazard of Overfitting What is Overfitting?

### •

regression for x ∈ R with N = 5 examples

label y

=f (x

### n

) +very small noise

### •

linear regression in Z-space + Φ= 4th order polynomial

,

### high E out

(8)

Hazard of Overfitting What is Overfitting?

### •

take dVC=1126 for learning:

—(E

-

) large

### •

switch from dVC =dVC

to dVC =1126:

—E

### •

switch from dVC =dVC

to dVC =1:

—E

### in ↑, E out ↑

in-sample error model complexity out-of-sample error

VC dimension, dvc

Error

dvc

, high E

;

E

, higherE

### out

(9)

Hazard of Overfitting What is Overfitting?

### •

take dVC=1126 for learning:

—(E

-

) large

### •

switch from dVC =dVC

to dVC =1126:

—E

### •

switch from dVC =dVC

to dVC =1:

—E

### in ↑, E out ↑

in-sample error model complexity out-of-sample error

VC dimension, dvc

Error

dvc

, high E

;

E

, higherE

### out

(10)

Hazard of Overfitting What is Overfitting?

### •

take dVC=1126 for learning:

—(E

-

) large

### •

switch from dVC =dVC

to dVC =1126:

—E

### •

switch from dVC =dVC

to dVC =1:

—E

### in ↑, E out ↑

in-sample error model complexity out-of-sample error

VC dimension, dvc

Error

dvc

, high E

;

E

, higherE

### out

(11)

Hazard of Overfitting What is Overfitting?

### •

take dVC=1126 for learning:

—(E

-

) large

### •

switch from dVC =dVC

to dVC =1126:

—E

### •

switch from dVC =dVC

to dVC =1:

—E

### in ↑, E out ↑

in-sample error model complexity out-of-sample error

VC dimension, dvc

Error

dvc

, high E

;

E

, higherE

### out

(12)

Hazard of Overfitting What is Overfitting?

### •

take dVC=1126 for learning:

—(E

-

) large

### •

switch from dVC =dVC

to dVC =1126:

—E

### •

switch from dVC =dVC

to dVC =1:

—E

### in ↑, E out ↑

in-sample error model complexity out-of-sample error

VC dimension, dvc

Error

dvc

, high E

;

E

, higherE

### out

(13)

Hazard of Overfitting What is Overfitting?

## Cause of Overfitting: A Driving Analogy

‘good fit’

### overfit

learning driving

overfit commit a car accident

use excessive dVC ‘drive too fast’

&

### data size

affect

overfitting?

(14)

Hazard of Overfitting What is Overfitting?

## Cause of Overfitting: A Driving Analogy

‘good fit’

### overfit

learning driving

overfit commit a car accident

use excessive dVC ‘drive too fast’

&

### data size

affect

overfitting?

(15)

Hazard of Overfitting What is Overfitting?

## Cause of Overfitting: A Driving Analogy

‘good fit’

### overfit

learning driving

overfit commit a car accident

use excessive dVC ‘drive too fast’

&

### data size

affect

overfitting?

(16)

Hazard of Overfitting What is Overfitting?

## Cause of Overfitting: A Driving Analogy

‘good fit’

### overfit

learning driving

overfit commit a car accident

use excessive dVC ‘drive too fast’

&

### data size

affect

overfitting?

(17)

Hazard of Overfitting What is Overfitting?

## Cause of Overfitting: A Driving Analogy

‘good fit’

### overfit

learning driving

overfit commit a car accident

use excessive dVC ‘drive too fast’

&

### data size

affect

overfitting?

(18)

Hazard of Overfitting What is Overfitting?

## Cause of Overfitting: A Driving Analogy

‘good fit’

### overfit

learning driving

overfit commit a car accident

use excessive dVC ‘drive too fast’

next: how does

&

### data size

affect overfitting?

(19)

Hazard of Overfitting What is Overfitting?

## Cause of Overfitting: A Driving Analogy

‘good fit’

### overfit

learning driving

overfit commit a car accident

use excessive dVC ‘drive too fast’

&

### data size

affect

overfitting?

(20)

Hazard of Overfitting What is Overfitting?

## Fun Time

Based on our discussion, for data of fixed size, which of the following situation is relatively of the lowest risk of overfitting?

### 1

small noise, fitting from small dVCto median dVC

### 2

small noise, fitting from small dVCto large dVC

### 3

large noise, fitting from small dVCto median dVC

### 4

large noise, fitting from small dVCto large dVC

Two causes of overfitting are noise and excessive dVC. So if both are relatively ‘under control’, the risk of overfitting is smaller.

(21)

Hazard of Overfitting What is Overfitting?

## Fun Time

Based on our discussion, for data of fixed size, which of the following situation is relatively of the lowest risk of overfitting?

### 1

small noise, fitting from small dVCto median dVC

### 2

small noise, fitting from small dVCto large dVC

### 3

large noise, fitting from small dVCto median dVC

### 4

large noise, fitting from small dVCto large dVC

Two causes of overfitting are noise and excessive dVC. So if both are relatively ‘under control’, the risk of overfitting is smaller.

(22)

Hazard of Overfitting The Role of Noise and Data Size

## Case Study (1/2)

E

E

E

E

### out 0.120 7680

overfitting from best

to best

### g 10 ∈ H 10

?

(23)

Hazard of Overfitting The Role of Noise and Data Size

## Case Study (1/2)

E

E

E

E

### out 0.120 7680

overfitting from best

to best

### g 10 ∈ H 10

?

(24)

Hazard of Overfitting The Role of Noise and Data Size

## Case Study (1/2)

E

E

E

E

### out 0.120 7680

overfitting from best

to best

### g 10 ∈ H 10

?

(25)

Hazard of Overfitting The Role of Noise and Data Size

## Case Study (2/2)

E

E

E

E

overfitting from

to

?

### both yes!

(26)

Hazard of Overfitting The Role of Noise and Data Size

## Case Study (2/2)

E

E

E

E

overfitting from

to

?

### both yes!

(27)

Hazard of Overfitting The Role of Noise and Data Size

## Case Study (2/2)

E

E

E

E

overfitting from

to

?

### both yes!

(28)

Hazard of Overfitting The Role of Noise and Data Size

## Irony of Two Learners

learner

learner

when both

### know that target = 10th

—R ‘gives up’ability to fit

but

a lot!

philosophy:

### concession

for

(29)

Hazard of Overfitting The Role of Noise and Data Size

## Irony of Two Learners

learner

learner

when both

### know that target = 10th

—R ‘gives up’ability to fit

but

a lot!

philosophy:

### concession

for

(30)

Hazard of Overfitting The Role of Noise and Data Size

## Irony of Two Learners

learner

learner

when both

### know that target = 10th

—R ‘gives up’ability to fit

but

a lot!

philosophy:

### concession

for

(31)

Hazard of Overfitting The Role of Noise and Data Size

## Irony of Two Learners

learner

learner

when both

### know that target = 10th

—R ‘gives up’ability to fit

but

a lot!

philosophy:

### concession

for

(32)

Hazard of Overfitting The Role of Noise and Data Size

## Irony of Two Learners

learner

learner

when both

### know that target = 10th

—R ‘gives up’ability to fit

but

a lot!

philosophy:

### concession

for

(33)

Hazard of Overfitting The Role of Noise and Data Size

## Irony of Two Learners

learner

learner

when both

### know that target = 10th

—R ‘gives up’ability to fit

but

a lot!

philosophy:

### concession

for

(34)

Hazard of Overfitting The Role of Noise and Data Size

## Learning Curves Revisited

out

in

out

in

H

: lower

when N → ∞,

H

### 10

:

but much larger generalization error for small N

gray area :

overfits! (E

always

### wins in Eout

if N small!

(35)

Hazard of Overfitting The Role of Noise and Data Size

## Learning Curves Revisited

out

in

out

in

H

: lower

when N → ∞,

H

### 10

:

but much larger generalization error for small N

gray area :

overfits! (E

always

### wins in Eout

if N small!

(36)

Hazard of Overfitting The Role of Noise and Data Size

## Learning Curves Revisited

out

in

out

in

H

: lower

when N → ∞,

H

### 10

:

but much larger generalization error for small N

gray area :

overfits! (E

always

### wins in Eout

if N small!

(37)

Hazard of Overfitting The Role of Noise and Data Size

## Learning Curves Revisited

out

in

out

in

H

: lower

when N → ∞,

H

### 10

:

but much larger generalization error for small N

gray area :

overfits! (E

always

### wins in Eout

if N small!

(38)

Hazard of Overfitting The Role of Noise and Data Size

## The ‘No Noise’ Case

learner

learner

when both

—R still wins

is there really

### ‘target complexity’ acts like noise

(39)

Hazard of Overfitting The Role of Noise and Data Size

## The ‘No Noise’ Case

learner

learner

when both

—R still wins

is there really

### ‘target complexity’ acts like noise

(40)

Hazard of Overfitting The Role of Noise and Data Size

## The ‘No Noise’ Case

learner

learner

when both

—R still wins

is there really

### ‘target complexity’ acts like noise

(41)

Hazard of Overfitting The Role of Noise and Data Size

## The ‘No Noise’ Case

learner

learner

when both

—R still wins

is there really

### ‘target complexity’ acts like noise

(42)

Hazard of Overfitting The Role of Noise and Data Size

## The ‘No Noise’ Case

learner

learner

when both

—R still wins

is there really

### ‘target complexity’ acts like noise

(43)

Hazard of Overfitting The Role of Noise and Data Size

## Fun Time

When having limited data, in which of the following case would learner

### R

perform better than learner

### 1

limited data from a 10-th order target function with some noise

### 2

limited data from a 1126-th order target function with no noise

### 3

limited data from a 1126-th order target function with some noise

### 4

all of the above

We discussed about 1 and 2 , but you shall be able to

that

### R

also wins in the more difficult case of 3 .

(44)

Hazard of Overfitting The Role of Noise and Data Size

## Fun Time

When having limited data, in which of the following case would learner

### R

perform better than learner

### 1

limited data from a 10-th order target function with some noise

### 2

limited data from a 1126-th order target function with no noise

### 3

limited data from a 1126-th order target function with some noise

### 4

all of the above

We discussed about 1 and 2 , but you shall be able to

that

### R

also wins in the more difficult case of 3 .

(45)

Hazard of Overfitting Deterministic Noise

## A Detailed Experiment

y =

+

f

with level

### •

some ‘uniform’ distribution on

### f (x )

with complexity level

data size N

goal:

### ‘overfit level’

for different (N,

)and (N,

### Q f

)?

(46)

Hazard of Overfitting Deterministic Noise

## A Detailed Experiment

y =

+

f

with level

### •

some ‘uniform’ distribution on

### f (x )

with complexity level

data size N

goal:

### ‘overfit level’

for different (N,

)and (N,

### Q f

)?

(47)

Hazard of Overfitting Deterministic Noise

## A Detailed Experiment

y =

+

f

with level

### •

some ‘uniform’ distribution on

### f (x )

with complexity level

data size N

goal:

### ‘overfit level’

for different (N,

)and (N,

### Q f

)?

(48)

Hazard of Overfitting Deterministic Noise

## A Detailed Experiment

y =

+

f

with level

### •

some ‘uniform’ distribution on

### f (x )

with complexity level

data size N

goal:

### ‘overfit level’

for different (N,

)and (N,

### Q f

)?

(49)

Hazard of Overfitting Deterministic Noise

## The Overfit Measure

for sure

### E out (g 2 )

(50)

Hazard of Overfitting Deterministic Noise

## The Overfit Measure

for sure

### E out (g 2 )

(51)

Hazard of Overfitting Deterministic Noise

## The Overfit Measure

for sure

### E out (g 2 )

(52)

Hazard of Overfitting Deterministic Noise

## The Results

2

80 100 120 -0.2

-0.1 0 0.1 0.2

0 1 2

fixed Q

=20

f

80 100 120 -0.2

-0.1 0 0.1 0.2

0 25 50 75 100

fixedσ

=0.1

### ring a bell? :-)

(53)

Hazard of Overfitting Deterministic Noise

## The Results

2

80 100 120 -0.2

-0.1 0 0.1 0.2

0 1 2

fixed Q

=20

f

80 100 120 -0.2

-0.1 0 0.1 0.2

0 25 50 75 100

fixedσ

=0.1

### ring a bell? :-)

(54)

Hazard of Overfitting Deterministic Noise

## The Results

2

80 100 120 -0.2

-0.1 0 0.1 0.2

0 1 2

fixed Q

=20

f

80 100 120 -0.2

-0.1 0 0.1 0.2

0 25 50 75 100

fixedσ

=0.1

### ring a bell? :-)

(55)

Hazard of Overfitting Deterministic Noise

## The Results

2

80 100 120 -0.2

-0.1 0 0.1 0.2

0 1 2

fixed Q

=20

f

80 100 120 -0.2

-0.1 0 0.1 0.2

0 25 50 75 100

fixedσ

=0.1

### ring a bell? :-)

(56)

Hazard of Overfitting Deterministic Noise

## Impact of Noise and Data Size

### stochastic noise

Number of Data Points, N

NoiseLevel,σ2

80 100 120 -0.2

-0.1 0 0.1 0.2

0 1 2

### impact of Q f versus N: deterministic noise

Number of Data Points, N TargetComplexity,Qf

80 100 120 -0.2

-0.1 0 0.1 0.2

0 25 50 75 100

four reasons of serious overfitting:

data size N ↓ overfit

↑ overfit

↑ overfit

### ↑

excessive power ↑ overfit

### overfitting

‘easily’ happens

(57)

Hazard of Overfitting Deterministic Noise

## Impact of Noise and Data Size

### stochastic noise

Number of Data Points, N

NoiseLevel,σ2

80 100 120 -0.2

-0.1 0 0.1 0.2

0 1 2

### deterministic noise

Number of Data Points, N TargetComplexity,Qf

80 100 120 -0.2

-0.1 0 0.1 0.2

0 25 50 75 100

four reasons of serious overfitting:

data size N ↓ overfit

↑ overfit

↑ overfit

### ↑

excessive power ↑ overfit

### overfitting

‘easily’ happens

(58)

Hazard of Overfitting Deterministic Noise

## Impact of Noise and Data Size

### stochastic noise

Number of Data Points, N

NoiseLevel,σ2

80 100 120 -0.2

-0.1 0 0.1 0.2

0 1 2

### deterministic noise

Number of Data Points, N TargetComplexity,Qf

80 100 120 -0.2

-0.1 0 0.1 0.2

0 25 50 75 100

four reasons of serious overfitting:

data size N ↓ overfit

↑ overfit

↑ overfit

### ↑

excessive power ↑ overfit

### overfitting

‘easily’ happens

(59)

Hazard of Overfitting Deterministic Noise

## Impact of Noise and Data Size

### stochastic noise

Number of Data Points, N

NoiseLevel,σ2

80 100 120 -0.2

-0.1 0 0.1 0.2

0 1 2

### deterministic noise

Number of Data Points, N TargetComplexity,Qf

80 100 120 -0.2

-0.1 0 0.1 0.2

0 25 50 75 100

four reasons of serious overfitting:

data size N ↓ overfit

↑ overfit

↑ overfit

### ↑

excessive power ↑ overfit

### overfitting

‘easily’ happens

(60)

Hazard of Overfitting Deterministic Noise

## Impact of Noise and Data Size

### stochastic noise

Number of Data Points, N

NoiseLevel,σ2

80 100 120 -0.2

-0.1 0 0.1 0.2

0 1 2

### deterministic noise

Number of Data Points, N TargetComplexity,Qf

80 100 120 -0.2

-0.1 0 0.1 0.2

0 25 50 75 100

four reasons of serious overfitting:

data size N ↓ overfit

↑ overfit

↑ overfit

### ↑

excessive power ↑ overfit

### overfitting

‘easily’ happens

(61)

Hazard of Overfitting Deterministic Noise

## Impact of Noise and Data Size

### stochastic noise

Number of Data Points, N

NoiseLevel,σ2

80 100 120 -0.2

-0.1 0 0.1 0.2

0 1 2

### deterministic noise

Number of Data Points, N TargetComplexity,Qf

80 100 120 -0.2

-0.1 0 0.1 0.2

0 25 50 75 100

four reasons of serious overfitting:

data size N ↓ overfit

↑ overfit

↑ overfit

### ↑

excessive power ↑ overfit

### overfitting

‘easily’ happens

(62)

Hazard of Overfitting Deterministic Noise

## Impact of Noise and Data Size

### stochastic noise

Number of Data Points, N

NoiseLevel,σ2

80 100 120 -0.2

-0.1 0 0.1 0.2

0 1 2

### deterministic noise

Number of Data Points, N TargetComplexity,Qf

80 100 120 -0.2

-0.1 0 0.1 0.2

0 25 50 75 100

four reasons of serious overfitting:

data size N ↓ overfit

↑ overfit

↑ overfit

### ↑

excessive power ↑ overfit

### overfitting

‘easily’ happens

(63)

Hazard of Overfitting Deterministic Noise

## Deterministic Noise

if

∈/

### H: something of f

cannot be captured by

### •

deterministic noise : difference between best

and

### •

acts like ‘stochastic noise’

—not new to CS:

### •

difference to stochastic noise:

### f

philosophy: when teaching

### a kid,

perhaps better not to use examples from a

### complicated target function? :-)

(64)

Hazard of Overfitting Deterministic Noise

## Deterministic Noise

if

∈/

### H: something of f

cannot be captured by

### •

deterministic noise : difference between best

and

### •

acts like ‘stochastic noise’

—not new to CS:

### •

difference to stochastic noise:

### f

philosophy: when teaching

### a kid,

perhaps better not to use examples from a

### complicated target function? :-)

(65)

Hazard of Overfitting Deterministic Noise

## Deterministic Noise

if

∈/

### H: something of f

cannot be captured by

### •

deterministic noise : difference between best

and

### •

acts like ‘stochastic noise’

—not new to CS:

### •

difference to stochastic noise:

### f

philosophy: when teaching

### a kid,

perhaps better not to use examples from a

### complicated target function? :-)

(66)

Hazard of Overfitting Deterministic Noise

## Deterministic Noise

if

∈/

### H: something of f

cannot be captured by

### •

deterministic noise : difference between best

and

### •

acts like ‘stochastic noise’—not new to CS:

### •

difference to stochastic noise:

### f

philosophy: when teaching

### a kid,

perhaps better not to use examples from a

### complicated target function? :-)

(67)

Hazard of Overfitting Deterministic Noise

## Deterministic Noise

if

∈/

### H: something of f

cannot be captured by

### •

deterministic noise : difference between best

and

### •

acts like ‘stochastic noise’—not new to CS:

### •

difference to stochastic noise:

### f

philosophy: when teaching

### a kid,

perhaps better not to use examples from a

### complicated target function? :-)

(68)

Hazard of Overfitting Deterministic Noise

## Deterministic Noise

if

∈/

### H: something of f

cannot be captured by

### •

deterministic noise : difference between best

and

### •

acts like ‘stochastic noise’—not new to CS:

### •

difference to stochastic noise:

### f

philosophy: when teaching

### a kid,

perhaps better not to use examples from a

### complicated target function? :-)

(69)

Hazard of Overfitting Deterministic Noise

## Deterministic Noise

if

∈/

### H: something of f

cannot be captured by

### •

deterministic noise : difference between best

and

### •

acts like ‘stochastic noise’—not new to CS:

### •

difference to stochastic noise:

### f

philosophy: when teaching

### a kid,

perhaps better not to use examples from a

### complicated target function? :-)

(70)

Hazard of Overfitting Deterministic Noise

## Fun Time

Consider the target function being sin(1126x ) for x ∈ [0, 2π]. When x is uniformly sampled from the range, and we use all possible linear hypotheses h(x ) = w · x to approximate the target function with respect to the squared error, what is the level of deterministic noise for each x ?

| sin(1126x)|

### 2

| sin(1126x) − x|

### 3

| sin(1126x) + x|

### 4

| sin(1126x) − 1126x|

You can try a few different w and convince yourself that the best hypothesis h

is h

### ∗

(x ) = 0. The deterministic noise is the difference between f and h

### ∗

.

(71)

Hazard of Overfitting Deterministic Noise

## Fun Time

Consider the target function being sin(1126x ) for x ∈ [0, 2π]. When x is uniformly sampled from the range, and we use all possible linear hypotheses h(x ) = w · x to approximate the target function with respect to the squared error, what is the level of deterministic noise for each x ?

| sin(1126x)|

### 2

| sin(1126x) − x|

### 3

| sin(1126x) + x|

### 4

| sin(1126x) − 1126x|

You can try a few different w and convince yourself that the best hypothesis h

is h

### ∗

(x ) = 0. The deterministic noise is the difference between f and h

### ∗

.

(72)

Hazard of Overfitting Dealing with Overfitting

## Driving Analogy Revisited

learning driving

overfit commit a car accident

use excessive dVC ‘drive too fast’

drive slowly

put the brakes

### validation

monitor the dashboard

all very

### practical

techniques to combat overfitting

(73)

Hazard of Overfitting Dealing with Overfitting

## Driving Analogy Revisited

learning driving

overfit commit a car accident

use excessive dVC ‘drive too fast’

drive slowly

put the brakes

### validation

monitor the dashboard

all very

### practical

techniques to combat overfitting

(74)

Hazard of Overfitting Dealing with Overfitting

## Driving Analogy Revisited

learning driving

overfit commit a car accident

use excessive dVC ‘drive too fast’

drive slowly

put the brakes

### validation

monitor the dashboard

all very

### practical

techniques to combat overfitting

(75)

Hazard of Overfitting Dealing with Overfitting

## Driving Analogy Revisited

learning driving

overfit commit a car accident

use excessive dVC ‘drive too fast’

drive slowly

put the brakes

### validation

monitor the dashboard

all very

### practical

techniques to combat overfitting

(76)

Hazard of Overfitting Dealing with Overfitting

## Driving Analogy Revisited

learning driving

overfit commit a car accident

use excessive dVC ‘drive too fast’

drive slowly

put the brakes

### validation

monitor the dashboard

all very

### practical

techniques to combat overfitting

(77)

Hazard of Overfitting Dealing with Overfitting

## Driving Analogy Revisited

learning driving

overfit commit a car accident

use excessive dVC ‘drive too fast’

drive slowly

put the brakes

### validation

monitor the dashboard

all very

### practical

techniques to combat overfitting

(78)

Hazard of Overfitting Dealing with Overfitting

## Driving Analogy Revisited

learning driving

overfit commit a car accident

use excessive dVC ‘drive too fast’

drive slowly

put the brakes

### validation

monitor the dashboard

all very

### practical

techniques to combat overfitting

(79)

Hazard of Overfitting Dealing with Overfitting

## Data Cleaning/Pruning

### •

if ‘detect’ the outlier

at the top by

### •

possible action 1: correct the label (data cleaning)

### •

possible action 2: remove the example (data pruning)

possibly helps, but

### effect varies

(80)

Hazard of Overfitting Dealing with Overfitting

## Data Cleaning/Pruning

### •

if ‘detect’ the outlier

at the top by

### •

possible action 1: correct the label (data cleaning)

### •

possible action 2: remove the example (data pruning)

possibly helps, but

### effect varies

(81)

Hazard of Overfitting Dealing with Overfitting

## Data Cleaning/Pruning

### •

if ‘detect’ the outlier

at the top by

### •

possible action 1: correct the label (data cleaning)

### •

possible action 2: remove the example (data pruning)

possibly helps, but

### effect varies

(82)

Hazard of Overfitting Dealing with Overfitting

## Data Cleaning/Pruning

### •

if ‘detect’ the outlier

at the top by

### •

possible action 1: correct the label (data cleaning)

### •

possible action 2: remove the example (data pruning)

possibly helps, but

### effect varies

(83)

Hazard of Overfitting Dealing with Overfitting

## Data Cleaning/Pruning

### •

if ‘detect’ the outlier

at the top by

### •

possible action 1: correct the label (data cleaning)

### •

possible action 2: remove the example (data pruning)

possibly helps, but

### effect varies

(84)

Hazard of Overfitting Dealing with Overfitting

## Data Cleaning/Pruning

### •

if ‘detect’ the outlier

at the top by

### •

possible action 1: correct the label (data cleaning)

### •

possible action 2: remove the example (data pruning)

possibly helps, but

### effect varies

(85)

Hazard of Overfitting Dealing with Overfitting

## Data Cleaning/Pruning

### •

if ‘detect’ the outlier

at the top by

### •

possible action 1: correct the label (data cleaning)

### •

possible action 2: remove the example (data pruning)

possibly helps, but

### effect varies

(86)

Hazard of Overfitting Dealing with Overfitting

## Data Hinting

### •

slightly shifted/rotated digits carry the same meaning

### virtual examples

by shifting/rotating the given digits (data hinting)

possibly helps, but

### watch out

—virtual example not

### iid∼ P(x, y )!

(87)

Hazard of Overfitting Dealing with Overfitting

## Data Hinting

### •

slightly shifted/rotated digits carry the same meaning

### virtual examples

by shifting/rotating the given digits (data hinting)

possibly helps, but

### watch out

—virtual example not

### iid∼ P(x, y )!

(88)

Hazard of Overfitting Dealing with Overfitting

## Data Hinting

### •

slightly shifted/rotated digits carry the same meaning

### virtual examples

by shifting/rotating the given digits (data hinting)

possibly helps, but

### watch out

—virtual example not

### iid∼ P(x, y )!

(89)

Hazard of Overfitting Dealing with Overfitting

## Fun Time

Assume we know that f (x ) is symmetric for some 1D regression application. That is, f (x ) = f (−x ). One possibility of using the knowledge is to consider symmetric hypotheses only. On the other hand, you can also generate virtual examples from the original data {(x

, y

### n

)}as hints. What virtual examples suit your needs best?

{(x

, −y

)}

{(−x

, −y

)}

{(−x

, y

)}

{(2x

, 2y

### n

)}

We want the virtual examples to encode the invariance when x → −x .

(90)

Hazard of Overfitting Dealing with Overfitting

## Fun Time

Assume we know that f (x ) is symmetric for some 1D regression application. That is, f (x ) = f (−x ). One possibility of using the knowledge is to consider symmetric hypotheses only. On the other hand, you can also generate virtual examples from the original data {(x

, y

### n

)}as hints. What virtual examples suit your needs best?

{(x

, −y

)}

{(−x

, −y

)}

{(−x

, y

)}

{(2x

, 2y

### n

)}

We want the virtual examples to encode the invariance when x → −x .

(91)

Hazard of Overfitting Dealing with Overfitting

## Summary

### 4

How Can Machines Learn

Updating...

## References

Related subjects :