## Machine Learning Foundations ( 機器學習基石)

### Lecture 13: Hazard of Overfitting

Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.tw### Department of Computer Science

### & Information Engineering

### National Taiwan University

### ( 國立台灣大學資訊工程系)

Hazard of Overfitting

## Roadmap

### 1 When Can Machines Learn?

### 2 Why Can Machines Learn?

### 3 How Can Machines Learn?

### Lecture 12: Nonlinear Transform

**nonlinear**

^{via}

**nonlinear feature transform Φ**

plus**linear**

with price of**model complexity**

### 4

How Can Machines Learn**Better?**

### Lecture 13: Hazard of Overfitting What is Overfitting?

### The Role of Noise and Data Size Deterministic Noise

### Dealing with Overfitting

Hazard of Overfitting What is Overfitting?

## Bad Generalization

### •

regression for x ∈ R with N = 5 examples### • target f (x ) = 2nd order polynomial

### •

label y_{n}

=f (x_{n}

) +very small noise
### •

linear regression in Z-space + Φ= 4th order polynomial### • unique solution passing all examples

### =⇒ E _{in} (g) = 0

### • E _{out} (g) **huge**

### x

### y

### Data Target Fit

bad generalization:

### low E _{in}

,### high E out

Hazard of Overfitting What is Overfitting?

## Bad Generalization and Overfitting

### •

take d_{VC}=1126 for learning:

bad generalization

—(E

_{out}

-### E _{in}

) large
### •

switch from dVC =d_{VC}

^{∗}

to dVC =1126:
**overfitting**

—E

_{in} ↓, E out ↑

### •

switch from d_{VC}=d

_{VC}

^{∗}

to d_{VC}=1:

**underfitting**

—E

_{in} ↑, E _{out} ↑

in-sample error model complexity out-of-sample error

VC dimension, dvc

Error

d^{∗}_{vc}

bad generalization: low E

_{in}

, high E_{out}

;
**overfitting: lower**

E_{in}

, higherE### out

Hazard of Overfitting What is Overfitting?

## Cause of Overfitting: A Driving Analogy

### x

### y

‘good fit’

### =⇒

### x

### y

### Data Target Fit

**overfit**

learning driving

overfit commit a car accident

use excessive dVC ‘drive too fast’

**noise**

bumpy road
**limited data size N**

limited observations about road condition
next: how does**noise**

&**data size**

affect
overfitting?

Hazard of Overfitting What is Overfitting?

## Fun Time

Based on our discussion, for data of fixed size, which of the following situation is relatively of the lowest risk of overfitting?

### 1

small noise, fitting from small d_{VC}to median d

_{VC}

### 2

small noise, fitting from small dVCto large dVC### 3

large noise, fitting from small dVCto median dVC### 4

large noise, fitting from small dVCto large dVC### Reference Answer: 1

Two causes of overfitting are noise and
excessive d_{VC}. So if both are relatively ‘under
control’, the risk of overfitting is smaller.

Hazard of Overfitting What is Overfitting?

## Fun Time

Based on our discussion, for data of fixed size, which of the following situation is relatively of the lowest risk of overfitting?

### 1

small noise, fitting from small d_{VC}to median d

_{VC}

### 2

small noise, fitting from small dVCto large dVC### 3

large noise, fitting from small dVCto median dVC### 4

large noise, fitting from small dVCto large dVC### Reference Answer: 1

Two causes of overfitting are noise and
excessive d_{VC}. So if both are relatively ‘under
control’, the risk of overfitting is smaller.

Hazard of Overfitting The Role of Noise and Data Size

## Case Study (1/2)

### 10-th order target function **+ noise**

### x

### y

### Data Target

### g _{2} ∈ H _{2} g _{10} ∈ H _{10}

E_{in} 0.050 0.034

E_{out} 0.127 **9.00**

### 50-th order target function **noiselessly**

### x

### y

### Data Target

### g _{2} ∈ H _{2} g _{10} ∈ H _{10}

E_{in} 0.029 0.00001

E### out 0.120 **7680**

overfitting from best

### g _{2} ∈ H _{2}

to best### g _{10} ∈ H _{10}

?
Hazard of Overfitting The Role of Noise and Data Size

## Case Study (2/2)

### 10-th order target function **+ noise**

### x

### y

### Data 2nd Order Fit 10th Order Fit

### g _{2} ∈ H _{2} g _{10} ∈ H _{10}

E_{in} 0.050 0.034

E_{out} 0.127 **9.00**

### 50-th order target function **noiselessly**

### x

### y

### Data 2nd Order Fit 10th Order Fit

### g _{2} ∈ H _{2} g _{10} ∈ H _{10}

E_{in} 0.029 0.00001

E### out 0.120 **7680**

overfitting from### g _{2}

to### g _{10}

?**both yes!**

Hazard of Overfitting The Role of Noise and Data Size

## Irony of Two Learners

### x

### y

### Data 2nd Order Fit 10th Order Fit

### x

### y

### Data Target

### •

learner### Overfit: pick g _{10} ∈ H _{10}

### •

learner### Restrict: pick g _{2} ∈ H _{2}

### •

when both**know that target = 10th**

—R ‘gives up’ability to fit

but

### R **wins in E** _{out}

a lot!
philosophy:

### concession

for**advantage?** **:-)**

Hazard of Overfitting The Role of Noise and Data Size

## Learning Curves Revisited

### H 2

### Number of Data Points, N

### E xp ec te d E rr or

### E

out### E

in### H 10

### Number of Data Points, N

### E xp ec te d E rr or

### E

_{out}

### E

in### •

H_{10}

: lower### E out

when N → ∞,H

_{10}

:
but much larger generalization error for small N

### •

gray area :### O

overfits! (E_{in} ↓, E out ↑)

### R

always**wins in E** _{out}

if N small!
Hazard of Overfitting The Role of Noise and Data Size

## The ‘No Noise’ Case

### x

### y

### Data 2nd Order Fit 10th Order Fit

### x

### y

### Data Target

### •

learner### Overfit: pick g _{10} ∈ H _{10}

### •

learner### Restrict: pick g _{2} ∈ H _{2}

### •

when both**know that there is no** **noise**

—R still wins
is there really

**no noise?**

### ‘target complexity’ acts like noise

Hazard of Overfitting The Role of Noise and Data Size

## Fun Time

When having limited data, in which of the following case would learner

### R

perform better than learner### O?

### 1

limited data from a 10-th order target function with some noise### 2

limited data from a 1126-th order target function with no noise### 3

limited data from a 1126-th order target function with some noise### 4

all of the above### Reference Answer: 4

We discussed about 1 and 2 , but you shall be able to

**‘generalize’ :-)**

that### R

also wins in the more difficult case of 3 .Hazard of Overfitting The Role of Noise and Data Size

## Fun Time

When having limited data, in which of the following case would learner

### R

perform better than learner### O?

### 1

limited data from a 10-th order target function with some noise### 2

limited data from a 1126-th order target function with no noise### 3

limited data from a 1126-th order target function with some noise### 4

all of the above### Reference Answer: 4

We discussed about 1 and 2 , but you shall be able to

**‘generalize’ :-)**

that### R

also wins in the more difficult case of 3 .Hazard of Overfitting Deterministic Noise

## A Detailed Experiment

y =

### f (x )

+### ∼ Gaussian

### Q

_{f}

### X

### q=0

### α q x ^{q}

### | {z }

### f (x )

### , σ ^{2}

### !

### • Gaussian iid noise

with level### σ ^{2}

### •

some ‘uniform’ distribution on### f (x )

with complexity level### Q _{f}

### •

data size N### x

### y

### Data Target

goal:

**‘overfit level’**

for
different (N,### σ ^{2}

)and (N,### Q _{f}

)?
Hazard of Overfitting Deterministic Noise

## The Overfit Measure

### x

### y

### Data 2nd Order Fit 10th Order Fit

### • g _{2} ∈ H _{2}

### • g _{10} ∈ H _{10}

### • E _{in} (g _{10} )

≤### E _{in} (g _{2} )

for sure
### x

### y

### Data Target

**overfit measure** E _{out} (g _{10} )

−### E _{out} (g _{2} )

Hazard of Overfitting Deterministic Noise

## The Results

### impact of σ ^{2} versus N

### Number of Data Points, N

### N oi se Le ve l, σ

280 100 120 -0.2

-0.1 0 0.1 0.2

0 1 2

fixed Q

_{f}

=20
### impact of Q f versus N

### Number of Data Points, N T ar ge t C om pl ex it y, Q

f80 100 120 -0.2

-0.1 0 0.1 0.2

0 25 50 75 100

fixedσ

^{2}

=0.1
**ring a bell? :-)**

Hazard of Overfitting Deterministic Noise

## Impact of Noise and Data Size

### impact of σ ^{2} versus N:

**stochastic noise**

Number of Data Points, N

NoiseLevel,σ2

80 100 120 -0.2

-0.1 0 0.1 0.2

0 1 2

### impact of Q f versus N:

**deterministic noise**

Number of Data Points, N TargetComplexity,Qf

80 100 120 -0.2

-0.1 0 0.1 0.2

0 25 50 75 100

four reasons of serious overfitting:

data size N ↓ overfit

### ↑ stochastic noise

↑ overfit### ↑ deterministic noise

↑ overfit### ↑

excessive power ↑ overfit### ↑

### overfitting

‘easily’ happensHazard of Overfitting Deterministic Noise

## Deterministic Noise

### •

if### f

∈/### H: something of f

cannot be captured by### H

### •

deterministic noise : difference between best### h ^{∗} ∈ H

and### f

### •

acts like ‘stochastic noise’—not new to CS:### pseudo-random generator

### •

difference to stochastic noise:### • depends on H

### • fixed for a given **x**

### x

### y

### h

^{∗}

### f

philosophy: when teaching

### a kid,

perhaps better not to use examples from a### complicated target function? **:-)**

Hazard of Overfitting Deterministic Noise

## Fun Time

Consider the target function being sin(1126x ) for x ∈ [0, 2π]. When x is uniformly sampled from the range, and we use all possible linear hypotheses h(x ) = w · x to approximate the target function with respect to the squared error, what is the level of deterministic noise for each x ?

### 1

| sin(1126x)|### 2

| sin(1126x) − x|### 3

| sin(1126x) + x|### 4

| sin(1126x) − 1126x|### Reference Answer: 1

You can try a few different w and convince yourself that the best hypothesis h

^{∗}

is
h^{∗}

(x ) = 0. The deterministic noise is the
difference between f and h^{∗}

.
Hazard of Overfitting Deterministic Noise

## Fun Time

Consider the target function being sin(1126x ) for x ∈ [0, 2π]. When x is uniformly sampled from the range, and we use all possible linear hypotheses h(x ) = w · x to approximate the target function with respect to the squared error, what is the level of deterministic noise for each x ?

### 1

| sin(1126x)|### 2

| sin(1126x) − x|### 3

| sin(1126x) + x|### 4

| sin(1126x) − 1126x|### Reference Answer: 1

You can try a few different w and convince yourself that the best hypothesis h

^{∗}

is
h^{∗}

(x ) = 0. The deterministic noise is the
difference between f and h^{∗}

.
Hazard of Overfitting Dealing with Overfitting

## Driving Analogy Revisited

learning driving

overfit commit a car accident

use excessive dVC ‘drive too fast’

noise bumpy road

limited data size N limited observations about road condition

**start from simple model**

drive slowly
**data cleaning/pruning**

use more accurate road information
**data hinting**

exploit more road information
**regularization**

put the brakes
**validation**

monitor the dashboard
all very

**practical**

techniques
to combat overfitting
Hazard of Overfitting Dealing with Overfitting

## Data Cleaning/Pruning

### •

if ‘detect’ the outlier### 5

at the top by### • too close to other ◦, or too far from other ×

### • wrong by current classifier

### • . . .

### •

possible action 1: correct the label (data cleaning)### •

possible action 2: remove the example (data pruning)possibly helps, but

**effect varies**

Hazard of Overfitting Dealing with Overfitting

## Data Hinting

### •

slightly shifted/rotated digits carry the same meaning### •

possible action: add**virtual examples**

by shifting/rotating the
given digits (data hinting)
possibly helps, but

**watch out**

—virtual example not

^{iid} **∼ P(x, y )!**

Hazard of Overfitting Dealing with Overfitting

## Fun Time

Assume we know that f (x ) is symmetric for some 1D regression application. That is, f (x ) = f (−x ). One possibility of using the knowledge is to consider symmetric hypotheses only. On the other hand, you can also generate virtual examples from the original data {(x

### n

, y### n

)}as hints. What virtual examples suit your needs best?### 1

{(x### n

, −y### n

)}### 2

{(−x### n

, −y### n

)}### 3

{(−x_{n}

, y_{n}

)}
### 4

{(2x_{n}

, 2y_{n}

)}
### Reference Answer: 3

We want the virtual examples to encode the invariance when x → −x .

Hazard of Overfitting Dealing with Overfitting

## Fun Time

Assume we know that f (x ) is symmetric for some 1D regression application. That is, f (x ) = f (−x ). One possibility of using the knowledge is to consider symmetric hypotheses only. On the other hand, you can also generate virtual examples from the original data {(x

### n

, y### n

)}as hints. What virtual examples suit your needs best?### 1

{(x### n

, −y### n

)}### 2

{(−x### n

, −y### n

)}### 3

{(−x_{n}

, y_{n}

)}
### 4

{(2x_{n}

, 2y_{n}

)}
### Reference Answer: 3

We want the virtual examples to encode the invariance when x → −x .

Hazard of Overfitting Dealing with Overfitting