Machine Learning Foundations
( 機器學習基石)
Lecture 13: Hazard of Overfitting
Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.twDepartment of Computer Science
& Information Engineering
National Taiwan University
( 國立台灣大學資訊工程系)
Hazard of Overfitting
Roadmap
1 When Can Machines Learn?
2 Why Can Machines Learn?
3 How Can Machines Learn?
Lecture 12: Nonlinear Transform
nonlinear vianonlinear feature transform Φ
pluslinear with price ofmodel complexity
model complexity
4
How Can Machines LearnBetter?
Lecture 13: Hazard of Overfitting What is Overfitting?
The Role of Noise and Data Size Deterministic Noise
Dealing with Overfitting
Hazard of Overfitting What is Overfitting?
Bad Generalization
•
regression for x ∈ R with N = 5 examples• target f (x ) = 2nd order polynomial
•
label yn
=f (xn
) +very small noise•
linear regression in Z-space + Φ= 4th order polynomial• unique solution passing all examples
=⇒ E in (g) = 0
• E out (g) huge
x
y
Data Target Fit
bad generalization:
low E in
,high E out
Hazard of Overfitting What is Overfitting?
Bad Generalization
•
regression for x ∈ R with N = 5 examples• target f (x ) = 2nd order polynomial
•
label yn
=f (xn
) +very small noise•
linear regression in Z-space + Φ= 4th order polynomial• unique solution passing all examples
=⇒ E in (g) = 0
• E out (g) huge
x
y
Data Target Fit
bad generalization:
low E in
,high E out
Hazard of Overfitting What is Overfitting?
Bad Generalization
•
regression for x ∈ R with N = 5 examples• target f (x ) = 2nd order polynomial
•
label yn
=f (xn
) +very small noise•
linear regression in Z-space + Φ= 4th order polynomial• unique solution passing all examples
=⇒ E in (g) = 0
• E out (g) huge
x
y
Data Target Fit
bad generalization:
low E in
,high E out
Hazard of Overfitting What is Overfitting?
Bad Generalization
•
regression for x ∈ R with N = 5 examples• target f (x ) = 2nd order polynomial
•
label yn
=f (xn
) +very small noise•
linear regression in Z-space + Φ= 4th order polynomial• unique solution passing all examples
=⇒ E in (g) = 0
• E out (g) huge
x
y
Data Target Fit
bad generalization:
low E in
,high E out
Hazard of Overfitting What is Overfitting?
Bad Generalization
•
regression for x ∈ R with N = 5 examples• target f (x ) = 2nd order polynomial
•
label yn
=f (xn
) +very small noise•
linear regression in Z-space + Φ= 4th order polynomial• unique solution passing all examples
=⇒ E in (g) = 0
• E out (g) huge
x
y
Data Target Fit
bad generalization:
low E in
,high E out
Hazard of Overfitting What is Overfitting?
Bad Generalization and Overfitting
•
take dVC=1126 for learning:bad generalization
—(E
out
-E in
) large•
switch from dVC =dVC∗
to dVC =1126:overfitting
—E
in ↓, E out ↑
•
switch from dVC =dVC∗
to dVC =1:underfitting
—E
in ↑, E out ↑
in-sample error model complexity out-of-sample error
VC dimension, dvc
Error
d∗vc
bad generalization: low E
in
, high Eout
;overfitting: lower
Ein
, higherEout
Hazard of Overfitting What is Overfitting?
Bad Generalization and Overfitting
•
take dVC=1126 for learning:bad generalization
—(E
out
-E in
) large•
switch from dVC =dVC∗
to dVC =1126:overfitting
—E
in ↓, E out ↑
•
switch from dVC =dVC∗
to dVC =1:underfitting
—E
in ↑, E out ↑
in-sample error model complexity out-of-sample error
VC dimension, dvc
Error
d∗vc
bad generalization: low E
in
, high Eout
;overfitting: lower
Ein
, higherEout
Hazard of Overfitting What is Overfitting?
Bad Generalization and Overfitting
•
take dVC=1126 for learning:bad generalization
—(E
out
-E in
) large•
switch from dVC =dVC∗
to dVC =1126:overfitting
—E
in ↓, E out ↑
•
switch from dVC =dVC∗
to dVC =1:underfitting
—E
in ↑, E out ↑
in-sample error model complexity out-of-sample error
VC dimension, dvc
Error
d∗vc
bad generalization: low E
in
, high Eout
;overfitting: lower
Ein
, higherEout
Hazard of Overfitting What is Overfitting?
Bad Generalization and Overfitting
•
take dVC=1126 for learning:bad generalization
—(E
out
-E in
) large•
switch from dVC =dVC∗
to dVC =1126:overfitting
—E
in ↓, E out ↑
•
switch from dVC =dVC∗
to dVC =1:underfitting
—E
in ↑, E out ↑
in-sample error model complexity out-of-sample error
VC dimension, dvc
Error
d∗vc
bad generalization: low E
in
, high Eout
;overfitting: lower
Ein
, higherEout
Hazard of Overfitting What is Overfitting?
Bad Generalization and Overfitting
•
take dVC=1126 for learning:bad generalization
—(E
out
-E in
) large•
switch from dVC =dVC∗
to dVC =1126:overfitting
—E
in ↓, E out ↑
•
switch from dVC =dVC∗
to dVC =1:underfitting
—E
in ↑, E out ↑
in-sample error model complexity out-of-sample error
VC dimension, dvc
Error
d∗vc
bad generalization: low E
in
, high Eout
;overfitting: lower
Ein
, higherEout
Hazard of Overfitting What is Overfitting?
Cause of Overfitting: A Driving Analogy
x
y
‘good fit’
=⇒
x
y
Data Target Fit
overfit
learning driving
overfit commit a car accident
use excessive dVC ‘drive too fast’
noise
bumpy roadlimited data size N
limited observations about road condition next: how doesnoise
&data size
affectoverfitting?
Hazard of Overfitting What is Overfitting?
Cause of Overfitting: A Driving Analogy
x
y
‘good fit’
=⇒
x
y
Data Target Fit
overfit
learning driving
overfit commit a car accident
use excessive dVC ‘drive too fast’
noise
bumpy roadlimited data size N
limited observations about road condition next: how doesnoise
&data size
affectoverfitting?
Hazard of Overfitting What is Overfitting?
Cause of Overfitting: A Driving Analogy
x
y
‘good fit’
=⇒
x
y
Data Target Fit
overfit
learning driving
overfit commit a car accident
use excessive dVC ‘drive too fast’
noise
bumpy roadlimited data size N
limited observations about road condition next: how doesnoise
&data size
affectoverfitting?
Hazard of Overfitting What is Overfitting?
Cause of Overfitting: A Driving Analogy
x
y
‘good fit’
=⇒
x
y
Data Target Fit
overfit
learning driving
overfit commit a car accident
use excessive dVC ‘drive too fast’
noise
bumpy roadlimited data size N
limited observations about road condition next: how doesnoise
&data size
affectoverfitting?
Hazard of Overfitting What is Overfitting?
Cause of Overfitting: A Driving Analogy
x
y
‘good fit’
=⇒
x
y
Data Target Fit
overfit
learning driving
overfit commit a car accident
use excessive dVC ‘drive too fast’
noise
bumpy roadlimited data size N
limited observations about road condition next: how doesnoise
&data size
affectoverfitting?
Hazard of Overfitting What is Overfitting?
Cause of Overfitting: A Driving Analogy
x
y
‘good fit’
=⇒
x
y
Data Target Fit
overfit
learning driving
overfit commit a car accident
use excessive dVC ‘drive too fast’
noise
bumpy roadlimited data size N
limited observations about road conditionnext: how does
noise
&data size
affect overfitting?Hazard of Overfitting What is Overfitting?
Cause of Overfitting: A Driving Analogy
x
y
‘good fit’
=⇒
x
y
Data Target Fit
overfit
learning driving
overfit commit a car accident
use excessive dVC ‘drive too fast’
noise
bumpy roadlimited data size N
limited observations about road condition next: how doesnoise
&data size
affectoverfitting?
Hazard of Overfitting What is Overfitting?
Fun Time
Based on our discussion, for data of fixed size, which of the following situation is relatively of the lowest risk of overfitting?
1
small noise, fitting from small dVCto median dVC2
small noise, fitting from small dVCto large dVC3
large noise, fitting from small dVCto median dVC4
large noise, fitting from small dVCto large dVCReference Answer: 1
Two causes of overfitting are noise and excessive dVC. So if both are relatively ‘under control’, the risk of overfitting is smaller.
Hazard of Overfitting What is Overfitting?
Fun Time
Based on our discussion, for data of fixed size, which of the following situation is relatively of the lowest risk of overfitting?
1
small noise, fitting from small dVCto median dVC2
small noise, fitting from small dVCto large dVC3
large noise, fitting from small dVCto median dVC4
large noise, fitting from small dVCto large dVCReference Answer: 1
Two causes of overfitting are noise and excessive dVC. So if both are relatively ‘under control’, the risk of overfitting is smaller.
Hazard of Overfitting The Role of Noise and Data Size
Case Study (1/2)
10-th order target function + noise
x
y
Data Target
g 2 ∈ H 2 g 10 ∈ H 10
Ein 0.050 0.034
Eout 0.127 9.00
50-th order target function noiselessly
x
y
Data Target
g 2 ∈ H 2 g 10 ∈ H 10
Ein 0.029 0.00001
Eout 0.120 7680
overfitting from best
g 2 ∈ H 2
to bestg 10 ∈ H 10
?Hazard of Overfitting The Role of Noise and Data Size
Case Study (1/2)
10-th order target function + noise
x
y
Data Target
g 2 ∈ H 2 g 10 ∈ H 10
Ein 0.050 0.034
Eout 0.127 9.00
50-th order target function noiselessly
x
y
Data Target
g 2 ∈ H 2 g 10 ∈ H 10
Ein 0.029 0.00001
Eout 0.120 7680
overfitting from bestg 2 ∈ H 2
to bestg 10 ∈ H 10
?Hazard of Overfitting The Role of Noise and Data Size
Case Study (1/2)
10-th order target function + noise
x
y
Data Target
g 2 ∈ H 2 g 10 ∈ H 10
Ein 0.050 0.034
Eout 0.127 9.00
50-th order target function noiselessly
x
y
Data Target
g 2 ∈ H 2 g 10 ∈ H 10
Ein 0.029 0.00001
Eout 0.120 7680
overfitting from best
g 2 ∈ H 2
to bestg 10 ∈ H 10
?Hazard of Overfitting The Role of Noise and Data Size
Case Study (2/2)
10-th order target function + noise
x
y
Data 2nd Order Fit 10th Order Fit
g 2 ∈ H 2 g 10 ∈ H 10
Ein 0.050 0.034
Eout 0.127 9.00
50-th order target function noiselessly
x
y
Data 2nd Order Fit 10th Order Fit
g 2 ∈ H 2 g 10 ∈ H 10
Ein 0.029 0.00001
Eout 0.120 7680
overfitting fromg 2
tog 10
?both yes!
Hazard of Overfitting The Role of Noise and Data Size
Case Study (2/2)
10-th order target function + noise
x
y
Data 2nd Order Fit 10th Order Fit
g 2 ∈ H 2 g 10 ∈ H 10
Ein 0.050 0.034
Eout 0.127 9.00
50-th order target function noiselessly
x
y
Data 2nd Order Fit 10th Order Fit
g 2 ∈ H 2 g 10 ∈ H 10
Ein 0.029 0.00001
Eout 0.120 7680
overfitting from
g 2
tog 10
?both yes!
Hazard of Overfitting The Role of Noise and Data Size
Case Study (2/2)
10-th order target function + noise
x
y
Data 2nd Order Fit 10th Order Fit
g 2 ∈ H 2 g 10 ∈ H 10
Ein 0.050 0.034
Eout 0.127 9.00
50-th order target function noiselessly
x
y
Data 2nd Order Fit 10th Order Fit
g 2 ∈ H 2 g 10 ∈ H 10
Ein 0.029 0.00001
Eout 0.120 7680
overfitting fromg 2
tog 10
?both yes!
Hazard of Overfitting The Role of Noise and Data Size
Irony of Two Learners
x
y
Data 2nd Order Fit 10th Order Fit
x
y
Data Target
•
learnerOverfit: pick g 10 ∈ H 10
•
learnerRestrict: pick g 2 ∈ H 2
•
when bothknow that target = 10th
—R ‘gives up’ability to fit
but
R wins in E out
a lot!philosophy:
concession
foradvantage? :-)
Hazard of Overfitting The Role of Noise and Data Size
Irony of Two Learners
x
y
Data 2nd Order Fit 10th Order Fit
x
y
Data Target
•
learnerOverfit: pick g 10 ∈ H 10
•
learnerRestrict: pick g 2 ∈ H 2
•
when bothknow that target = 10th
—R ‘gives up’ability to fit
but
R wins in E out
a lot!philosophy:
concession
foradvantage? :-)
Hazard of Overfitting The Role of Noise and Data Size
Irony of Two Learners
x
y
Data 2nd Order Fit 10th Order Fit
x
y
Data Target
•
learnerOverfit: pick g 10 ∈ H 10
•
learnerRestrict: pick g 2 ∈ H 2
•
when bothknow that target = 10th
—R ‘gives up’ability to fit
but
R wins in E out
a lot!philosophy:
concession
foradvantage? :-)
Hazard of Overfitting The Role of Noise and Data Size
Irony of Two Learners
x
y
Data 2nd Order Fit 10th Order Fit
x
y
Data Target
•
learnerOverfit: pick g 10 ∈ H 10
•
learnerRestrict: pick g 2 ∈ H 2
•
when bothknow that target = 10th
—R ‘gives up’ability to fit
but
R wins in E out
a lot!philosophy:
concession
foradvantage? :-)
Hazard of Overfitting The Role of Noise and Data Size
Irony of Two Learners
x
y
Data 2nd Order Fit 10th Order Fit
x
y
Data Target
•
learnerOverfit: pick g 10 ∈ H 10
•
learnerRestrict: pick g 2 ∈ H 2
•
when bothknow that target = 10th
—R ‘gives up’ability to fit
but
R wins in E out
a lot!philosophy:
concession
foradvantage? :-)
Hazard of Overfitting The Role of Noise and Data Size
Irony of Two Learners
x
y
Data 2nd Order Fit 10th Order Fit
x
y
Data Target
•
learnerOverfit: pick g 10 ∈ H 10
•
learnerRestrict: pick g 2 ∈ H 2
•
when bothknow that target = 10th
—R ‘gives up’ability to fit
but
R wins in E out
a lot!philosophy:
concession
foradvantage? :-)
Hazard of Overfitting The Role of Noise and Data Size
Learning Curves Revisited
H 2
Number of Data Points, N
E xp ec te d E rr or
E
outE
inH 10
Number of Data Points, N
E xp ec te d E rr or
E
outE
in•
H10
: lowerE out
when N → ∞,H
10
:but much larger generalization error for small N
•
gray area :O
overfits! (Ein ↓, E out ↑)
R
alwayswins in E out
if N small!Hazard of Overfitting The Role of Noise and Data Size
Learning Curves Revisited
H 2
Number of Data Points, N
E xp ec te d E rr or
E
outE
inH 10
Number of Data Points, N
E xp ec te d E rr or
E
outE
in•
H10
: lowerE out
when N → ∞,H
10
:but much larger generalization error for small N
•
gray area :O
overfits! (Ein ↓, E out ↑)
R
alwayswins in E out
if N small!Hazard of Overfitting The Role of Noise and Data Size
Learning Curves Revisited
H 2
Number of Data Points, N
E xp ec te d E rr or
E
outE
inH 10
Number of Data Points, N
E xp ec te d E rr or
E
outE
in•
H10
: lowerE out
when N → ∞,H
10
:but much larger generalization error for small N
•
gray area :O
overfits! (Ein ↓, E out ↑)
R
alwayswins in E out
if N small!Hazard of Overfitting The Role of Noise and Data Size
Learning Curves Revisited
H 2
Number of Data Points, N
E xp ec te d E rr or
E
outE
inH 10
Number of Data Points, N
E xp ec te d E rr or
E
outE
in•
H10
: lowerE out
when N → ∞,H
10
:but much larger generalization error for small N
•
gray area :O
overfits! (Ein ↓, E out ↑)
R
alwayswins in E out
if N small!Hazard of Overfitting The Role of Noise and Data Size
The ‘No Noise’ Case
x
y
Data 2nd Order Fit 10th Order Fit
x
y
Data Target
•
learnerOverfit: pick g 10 ∈ H 10
•
learnerRestrict: pick g 2 ∈ H 2
•
when bothknow that there is no noise
—R still wins
is there really
no noise?
‘target complexity’ acts like noise
Hazard of Overfitting The Role of Noise and Data Size
The ‘No Noise’ Case
x
y
Data 2nd Order Fit 10th Order Fit
x
y
Data Target
•
learnerOverfit: pick g 10 ∈ H 10
•
learnerRestrict: pick g 2 ∈ H 2
•
when bothknow that there is no noise
—R still wins
is there really
no noise?
‘target complexity’ acts like noise
Hazard of Overfitting The Role of Noise and Data Size
The ‘No Noise’ Case
x
y
Data 2nd Order Fit 10th Order Fit
x
y
Data Target
•
learnerOverfit: pick g 10 ∈ H 10
•
learnerRestrict: pick g 2 ∈ H 2
•
when bothknow that there is no noise
—R still winsis there really
no noise?
‘target complexity’ acts like noise
Hazard of Overfitting The Role of Noise and Data Size
The ‘No Noise’ Case
x
y
Data 2nd Order Fit 10th Order Fit
x
y
Data Target
•
learnerOverfit: pick g 10 ∈ H 10
•
learnerRestrict: pick g 2 ∈ H 2
•
when bothknow that there is no noise
—R still winsis there really
no noise?
‘target complexity’ acts like noise
Hazard of Overfitting The Role of Noise and Data Size
The ‘No Noise’ Case
x
y
Data 2nd Order Fit 10th Order Fit
x
y
Data Target
•
learnerOverfit: pick g 10 ∈ H 10
•
learnerRestrict: pick g 2 ∈ H 2
•
when bothknow that there is no noise
—R still winsis there really
no noise?
‘target complexity’ acts like noise
Hazard of Overfitting The Role of Noise and Data Size
Fun Time
When having limited data, in which of the following case would learner
R
perform better than learnerO?
1
limited data from a 10-th order target function with some noise2
limited data from a 1126-th order target function with no noise3
limited data from a 1126-th order target function with some noise4
all of the aboveReference Answer: 4
We discussed about 1 and 2 , but you shall be able to
‘generalize’ :-)
thatR
also wins in the more difficult case of 3 .Hazard of Overfitting The Role of Noise and Data Size
Fun Time
When having limited data, in which of the following case would learner
R
perform better than learnerO?
1
limited data from a 10-th order target function with some noise2
limited data from a 1126-th order target function with no noise3
limited data from a 1126-th order target function with some noise4
all of the aboveReference Answer: 4
We discussed about 1 and 2 , but you shall be able to
‘generalize’ :-)
thatR
also wins in the more difficult case of 3 .Hazard of Overfitting Deterministic Noise
A Detailed Experiment
y =
f (x )
+∼ Gaussian
Q
fX
q=0
α q x q
| {z }
f (x )
, σ 2
!
• Gaussian iid noise
with levelσ 2
•
some ‘uniform’ distribution onf (x )
with complexity levelQ f
•
data size Nx
y
Data Target
goal:
‘overfit level’
for different (N,σ 2
)and (N,Q f
)?Hazard of Overfitting Deterministic Noise
A Detailed Experiment
y =
f (x )
+∼ Gaussian
Q
fX
q=0
α q x q
| {z }
f (x )
, σ 2
!
• Gaussian iid noise
with levelσ 2
•
some ‘uniform’ distribution onf (x )
with complexity levelQ f
•
data size Nx
y
Data Target
goal:
‘overfit level’
for different (N,σ 2
)and (N,Q f
)?Hazard of Overfitting Deterministic Noise
A Detailed Experiment
y =
f (x )
+∼ Gaussian
Q
fX
q=0
α q x q
| {z }
f (x )
, σ 2
!
• Gaussian iid noise
with levelσ 2
•
some ‘uniform’ distribution onf (x )
with complexity levelQ f
•
data size Nx
y
Data Target
goal:
‘overfit level’
for different (N,σ 2
)and (N,Q f
)?Hazard of Overfitting Deterministic Noise
A Detailed Experiment
y =
f (x )
+∼ Gaussian
Q
fX
q=0
α q x q
| {z }
f (x )
, σ 2
!
• Gaussian iid noise
with levelσ 2
•
some ‘uniform’ distribution onf (x )
with complexity levelQ f
•
data size Nx
y
Data Target
goal:
‘overfit level’
for different (N,σ 2
)and (N,Q f
)?Hazard of Overfitting Deterministic Noise
The Overfit Measure
x
y
Data 2nd Order Fit 10th Order Fit
• g 2 ∈ H 2
• g 10 ∈ H 10
• E in (g 10 )
≤E in (g 2 )
for surex
y
Data Target
overfit measure E out (g 10 )
−E out (g 2 )
Hazard of Overfitting Deterministic Noise
The Overfit Measure
x
y
Data 2nd Order Fit 10th Order Fit
• g 2 ∈ H 2
• g 10 ∈ H 10
• E in (g 10 )
≤E in (g 2 )
for surex
y
Data Target
overfit measure E out (g 10 )
−E out (g 2 )
Hazard of Overfitting Deterministic Noise
The Overfit Measure
x
y
Data 2nd Order Fit 10th Order Fit
• g 2 ∈ H 2
• g 10 ∈ H 10
• E in (g 10 )
≤E in (g 2 )
for surex
y
Data Target
overfit measure E out (g 10 )
−E out (g 2 )
Hazard of Overfitting Deterministic Noise
The Results
impact of σ 2 versus N
Number of Data Points, N
N oi se Le ve l, σ
280 100 120 -0.2
-0.1 0 0.1 0.2
0 1 2
fixed Q
f
=20impact of Q f versus N
Number of Data Points, N T ar ge t C om pl ex it y, Q
f80 100 120 -0.2
-0.1 0 0.1 0.2
0 25 50 75 100
fixedσ
2
=0.1ring a bell? :-)
Hazard of Overfitting Deterministic Noise
The Results
impact of σ 2 versus N
Number of Data Points, N
N oi se Le ve l, σ
280 100 120 -0.2
-0.1 0 0.1 0.2
0 1 2
fixed Q
f
=20impact of Q f versus N
Number of Data Points, N T ar ge t C om pl ex it y, Q
f80 100 120 -0.2
-0.1 0 0.1 0.2
0 25 50 75 100
fixedσ
2
=0.1ring a bell? :-)
Hazard of Overfitting Deterministic Noise
The Results
impact of σ 2 versus N
Number of Data Points, N
N oi se Le ve l, σ
280 100 120 -0.2
-0.1 0 0.1 0.2
0 1 2
fixed Q
f
=20impact of Q f versus N
Number of Data Points, N T ar ge t C om pl ex it y, Q
f80 100 120 -0.2
-0.1 0 0.1 0.2
0 25 50 75 100
fixedσ
2
=0.1ring a bell? :-)
Hazard of Overfitting Deterministic Noise
The Results
impact of σ 2 versus N
Number of Data Points, N
N oi se Le ve l, σ
280 100 120 -0.2
-0.1 0 0.1 0.2
0 1 2
fixed Q
f
=20impact of Q f versus N
Number of Data Points, N T ar ge t C om pl ex it y, Q
f80 100 120 -0.2
-0.1 0 0.1 0.2
0 25 50 75 100
fixedσ
2
=0.1ring a bell? :-)
Hazard of Overfitting Deterministic Noise
Impact of Noise and Data Size
impact of σ 2 versus N:
stochastic noise
Number of Data Points, N
NoiseLevel,σ2
80 100 120 -0.2
-0.1 0 0.1 0.2
0 1 2
impact of Q f versus N: deterministic noise
Number of Data Points, N TargetComplexity,Qf
80 100 120 -0.2
-0.1 0 0.1 0.2
0 25 50 75 100
four reasons of serious overfitting:
data size N ↓ overfit
↑ stochastic noise
↑ overfit↑ deterministic noise
↑ overfit↑
excessive power ↑ overfit↑
overfitting
‘easily’ happensHazard of Overfitting Deterministic Noise
Impact of Noise and Data Size
impact of σ 2 versus N:
stochastic noise
Number of Data Points, N
NoiseLevel,σ2
80 100 120 -0.2
-0.1 0 0.1 0.2
0 1 2
impact of Q f versus N:
deterministic noise
Number of Data Points, N TargetComplexity,Qf
80 100 120 -0.2
-0.1 0 0.1 0.2
0 25 50 75 100
four reasons of serious overfitting:
data size N ↓ overfit
↑ stochastic noise
↑ overfit↑ deterministic noise
↑ overfit↑
excessive power ↑ overfit↑
overfitting
‘easily’ happensHazard of Overfitting Deterministic Noise
Impact of Noise and Data Size
impact of σ 2 versus N:
stochastic noise
Number of Data Points, N
NoiseLevel,σ2
80 100 120 -0.2
-0.1 0 0.1 0.2
0 1 2
impact of Q f versus N:
deterministic noise
Number of Data Points, N TargetComplexity,Qf
80 100 120 -0.2
-0.1 0 0.1 0.2
0 25 50 75 100
four reasons of serious overfitting:
data size N ↓ overfit
↑
stochastic noise
↑ overfit↑ deterministic noise
↑ overfit↑
excessive power ↑ overfit↑
overfitting
‘easily’ happensHazard of Overfitting Deterministic Noise
Impact of Noise and Data Size
impact of σ 2 versus N:
stochastic noise
Number of Data Points, N
NoiseLevel,σ2
80 100 120 -0.2
-0.1 0 0.1 0.2
0 1 2
impact of Q f versus N:
deterministic noise
Number of Data Points, N TargetComplexity,Qf
80 100 120 -0.2
-0.1 0 0.1 0.2
0 25 50 75 100
four reasons of serious overfitting:
data size N ↓ overfit
↑ stochastic noise
↑ overfit↑
deterministic noise
↑ overfit↑
excessive power ↑ overfit↑
overfitting
‘easily’ happensHazard of Overfitting Deterministic Noise
Impact of Noise and Data Size
impact of σ 2 versus N:
stochastic noise
Number of Data Points, N
NoiseLevel,σ2
80 100 120 -0.2
-0.1 0 0.1 0.2
0 1 2
impact of Q f versus N:
deterministic noise
Number of Data Points, N TargetComplexity,Qf
80 100 120 -0.2
-0.1 0 0.1 0.2
0 25 50 75 100
four reasons of serious overfitting:
data size N ↓ overfit
↑ stochastic noise
↑ overfit↑ deterministic noise
↑ overfit↑
excessive power ↑ overfit
↑
overfitting
‘easily’ happensHazard of Overfitting Deterministic Noise
Impact of Noise and Data Size
impact of σ 2 versus N:
stochastic noise
Number of Data Points, N
NoiseLevel,σ2
80 100 120 -0.2
-0.1 0 0.1 0.2
0 1 2
impact of Q f versus N:
deterministic noise
Number of Data Points, N TargetComplexity,Qf
80 100 120 -0.2
-0.1 0 0.1 0.2
0 25 50 75 100
four reasons of serious overfitting:
data size N ↓ overfit
↑ stochastic noise
↑ overfit↑ deterministic noise
↑ overfit↑
excessive power ↑ overfit↑
overfitting
‘easily’ happensHazard of Overfitting Deterministic Noise
Impact of Noise and Data Size
impact of σ 2 versus N:
stochastic noise
Number of Data Points, N
NoiseLevel,σ2
80 100 120 -0.2
-0.1 0 0.1 0.2
0 1 2
impact of Q f versus N:
deterministic noise
Number of Data Points, N TargetComplexity,Qf
80 100 120 -0.2
-0.1 0 0.1 0.2
0 25 50 75 100
four reasons of serious overfitting:
data size N ↓ overfit
↑ stochastic noise
↑ overfit↑ deterministic noise
↑ overfit↑
excessive power ↑ overfit↑
overfitting
‘easily’ happensHazard of Overfitting Deterministic Noise
Deterministic Noise
•
iff
∈/H: something of f
cannot be captured byH
•
deterministic noise : difference between besth ∗ ∈ H
andf
•
acts like ‘stochastic noise’—not new to CS:
pseudo-random generator
•
difference to stochastic noise:• depends on H
• fixed for a given x
x
y
h
∗f
philosophy: when teaching
a kid,
perhaps better not to use examples from acomplicated target function? :-)
Hazard of Overfitting Deterministic Noise
Deterministic Noise
•
iff
∈/H: something of f
cannot be captured byH
•
deterministic noise : difference between besth ∗ ∈ H
andf
•
acts like ‘stochastic noise’—not new to CS:
pseudo-random generator
•
difference to stochastic noise:• depends on H
• fixed for a given x
x
y
h
∗f
philosophy: when teaching
a kid,
perhaps better not to use examples from acomplicated target function? :-)
Hazard of Overfitting Deterministic Noise
Deterministic Noise
•
iff
∈/H: something of f
cannot be captured byH
•
deterministic noise : difference between besth ∗ ∈ H
andf
•
acts like ‘stochastic noise’—not new to CS:
pseudo-random generator
•
difference to stochastic noise:• depends on H
• fixed for a given x
x
y
h
∗f
philosophy: when teaching
a kid,
perhaps better not to use examples from acomplicated target function? :-)
Hazard of Overfitting Deterministic Noise
Deterministic Noise
•
iff
∈/H: something of f
cannot be captured byH
•
deterministic noise : difference between besth ∗ ∈ H
andf
•
acts like ‘stochastic noise’—not new to CS:pseudo-random generator
•
difference to stochastic noise:• depends on H
• fixed for a given x
x
y
h
∗f
philosophy: when teaching
a kid,
perhaps better not to use examples from acomplicated target function? :-)
Hazard of Overfitting Deterministic Noise
Deterministic Noise
•
iff
∈/H: something of f
cannot be captured byH
•
deterministic noise : difference between besth ∗ ∈ H
andf
•
acts like ‘stochastic noise’—not new to CS:pseudo-random generator
•
difference to stochastic noise:• depends on H
• fixed for a given x
x
y
h
∗f
philosophy: when teaching
a kid,
perhaps better not to use examples from acomplicated target function? :-)
Hazard of Overfitting Deterministic Noise
Deterministic Noise
•
iff
∈/H: something of f
cannot be captured byH
•
deterministic noise : difference between besth ∗ ∈ H
andf
•
acts like ‘stochastic noise’—not new to CS:pseudo-random generator
•
difference to stochastic noise:• depends on H
• fixed for a given x
x
y
h
∗f
philosophy: when teaching
a kid,
perhaps better not to use examples from acomplicated target function? :-)
Hazard of Overfitting Deterministic Noise
Deterministic Noise
•
iff
∈/H: something of f
cannot be captured byH
•
deterministic noise : difference between besth ∗ ∈ H
andf
•
acts like ‘stochastic noise’—not new to CS:pseudo-random generator
•
difference to stochastic noise:• depends on H
• fixed for a given x
x
y
h
∗f
philosophy: when teaching
a kid,
perhaps better not to use examples from acomplicated target function? :-)
Hazard of Overfitting Deterministic Noise
Fun Time
Consider the target function being sin(1126x ) for x ∈ [0, 2π]. When x is uniformly sampled from the range, and we use all possible linear hypotheses h(x ) = w · x to approximate the target function with respect to the squared error, what is the level of deterministic noise for each x ?
1
| sin(1126x)|2
| sin(1126x) − x|3
| sin(1126x) + x|4
| sin(1126x) − 1126x|Reference Answer: 1
You can try a few different w and convince yourself that the best hypothesis h
∗
is h∗
(x ) = 0. The deterministic noise is the difference between f and h∗
.Hazard of Overfitting Deterministic Noise
Fun Time
Consider the target function being sin(1126x ) for x ∈ [0, 2π]. When x is uniformly sampled from the range, and we use all possible linear hypotheses h(x ) = w · x to approximate the target function with respect to the squared error, what is the level of deterministic noise for each x ?
1
| sin(1126x)|2
| sin(1126x) − x|3
| sin(1126x) + x|4
| sin(1126x) − 1126x|Reference Answer: 1
You can try a few different w and convince yourself that the best hypothesis h
∗
is h∗
(x ) = 0. The deterministic noise is the difference between f and h∗
.Hazard of Overfitting Dealing with Overfitting
Driving Analogy Revisited
learning driving
overfit commit a car accident
use excessive dVC ‘drive too fast’
noise bumpy road
limited data size N limited observations about road condition
start from simple model
drive slowlydata cleaning/pruning
use more accurate road informationdata hinting
exploit more road informationregularization
put the brakesvalidation
monitor the dashboardall very
practical
techniques to combat overfittingHazard of Overfitting Dealing with Overfitting
Driving Analogy Revisited
learning driving
overfit commit a car accident
use excessive dVC ‘drive too fast’
noise bumpy road
limited data size N limited observations about road condition
start from simple model
drive slowlydata cleaning/pruning
use more accurate road informationdata hinting
exploit more road informationregularization
put the brakesvalidation
monitor the dashboardall very
practical
techniques to combat overfittingHazard of Overfitting Dealing with Overfitting
Driving Analogy Revisited
learning driving
overfit commit a car accident
use excessive dVC ‘drive too fast’
noise bumpy road
limited data size N limited observations about road condition
start from simple model
drive slowlydata cleaning/pruning
use more accurate road informationdata hinting
exploit more road informationregularization
put the brakesvalidation
monitor the dashboardall very
practical
techniques to combat overfittingHazard of Overfitting Dealing with Overfitting
Driving Analogy Revisited
learning driving
overfit commit a car accident
use excessive dVC ‘drive too fast’
noise bumpy road
limited data size N limited observations about road condition
start from simple model
drive slowlydata cleaning/pruning
use more accurate road informationdata hinting
exploit more road informationregularization
put the brakesvalidation
monitor the dashboardall very
practical
techniques to combat overfittingHazard of Overfitting Dealing with Overfitting
Driving Analogy Revisited
learning driving
overfit commit a car accident
use excessive dVC ‘drive too fast’
noise bumpy road
limited data size N limited observations about road condition
start from simple model
drive slowlydata cleaning/pruning
use more accurate road informationdata hinting
exploit more road informationregularization
put the brakesvalidation
monitor the dashboardall very
practical
techniques to combat overfittingHazard of Overfitting Dealing with Overfitting
Driving Analogy Revisited
learning driving
overfit commit a car accident
use excessive dVC ‘drive too fast’
noise bumpy road
limited data size N limited observations about road condition
start from simple model
drive slowlydata cleaning/pruning
use more accurate road informationdata hinting
exploit more road informationregularization
put the brakesvalidation
monitor the dashboardall very
practical
techniques to combat overfittingHazard of Overfitting Dealing with Overfitting
Driving Analogy Revisited
learning driving
overfit commit a car accident
use excessive dVC ‘drive too fast’
noise bumpy road
limited data size N limited observations about road condition
start from simple model
drive slowlydata cleaning/pruning
use more accurate road informationdata hinting
exploit more road informationregularization
put the brakesvalidation
monitor the dashboardall very
practical
techniques to combat overfittingHazard of Overfitting Dealing with Overfitting
Data Cleaning/Pruning
•
if ‘detect’ the outlier5
at the top by• too close to other ◦, or too far from other ×
• wrong by current classifier
• . . .
•
possible action 1: correct the label (data cleaning)•
possible action 2: remove the example (data pruning)possibly helps, but
effect varies
Hazard of Overfitting Dealing with Overfitting
Data Cleaning/Pruning
•
if ‘detect’ the outlier5
at the top by• too close to other ◦, or too far from other ×
• wrong by current classifier
• . . .
•
possible action 1: correct the label (data cleaning)•
possible action 2: remove the example (data pruning)possibly helps, but
effect varies
Hazard of Overfitting Dealing with Overfitting
Data Cleaning/Pruning
•
if ‘detect’ the outlier5
at the top by• too close to other ◦, or too far from other ×
• wrong by current classifier
• . . .
•
possible action 1: correct the label (data cleaning)•
possible action 2: remove the example (data pruning)possibly helps, but
effect varies
Hazard of Overfitting Dealing with Overfitting
Data Cleaning/Pruning
•
if ‘detect’ the outlier5
at the top by• too close to other ◦, or too far from other ×
• wrong by current classifier
• . . .
•
possible action 1: correct the label (data cleaning)•
possible action 2: remove the example (data pruning)possibly helps, but
effect varies
Hazard of Overfitting Dealing with Overfitting
Data Cleaning/Pruning
•
if ‘detect’ the outlier5
at the top by• too close to other ◦, or too far from other ×
• wrong by current classifier
• . . .
•
possible action 1: correct the label (data cleaning)•
possible action 2: remove the example (data pruning)possibly helps, but
effect varies
Hazard of Overfitting Dealing with Overfitting
Data Cleaning/Pruning
•
if ‘detect’ the outlier5
at the top by• too close to other ◦, or too far from other ×
• wrong by current classifier
• . . .
•
possible action 1: correct the label (data cleaning)•
possible action 2: remove the example (data pruning)possibly helps, but
effect varies
Hazard of Overfitting Dealing with Overfitting
Data Cleaning/Pruning
•
if ‘detect’ the outlier5
at the top by• too close to other ◦, or too far from other ×
• wrong by current classifier
• . . .
•
possible action 1: correct the label (data cleaning)•
possible action 2: remove the example (data pruning)possibly helps, but
effect varies
Hazard of Overfitting Dealing with Overfitting
Data Hinting
•
slightly shifted/rotated digits carry the same meaning•
possible action: addvirtual examples
by shifting/rotating the given digits (data hinting)possibly helps, but
watch out
—virtual example not
iid ∼ P(x, y )!
Hazard of Overfitting Dealing with Overfitting
Data Hinting
•
slightly shifted/rotated digits carry the same meaning•
possible action: addvirtual examples
by shifting/rotating the given digits (data hinting)possibly helps, but
watch out
—virtual example not
iid ∼ P(x, y )!
Hazard of Overfitting Dealing with Overfitting
Data Hinting
•
slightly shifted/rotated digits carry the same meaning•
possible action: addvirtual examples
by shifting/rotating the given digits (data hinting)possibly helps, but
watch out
—virtual example not
iid ∼ P(x, y )!
Hazard of Overfitting Dealing with Overfitting
Fun Time
Assume we know that f (x ) is symmetric for some 1D regression application. That is, f (x ) = f (−x ). One possibility of using the knowledge is to consider symmetric hypotheses only. On the other hand, you can also generate virtual examples from the original data {(x
n
, yn
)}as hints. What virtual examples suit your needs best?1
{(xn
, −yn
)}2
{(−xn
, −yn
)}3
{(−xn
, yn
)}4
{(2xn
, 2yn
)}Reference Answer: 3
We want the virtual examples to encode the invariance when x → −x .
Hazard of Overfitting Dealing with Overfitting
Fun Time
Assume we know that f (x ) is symmetric for some 1D regression application. That is, f (x ) = f (−x ). One possibility of using the knowledge is to consider symmetric hypotheses only. On the other hand, you can also generate virtual examples from the original data {(x
n
, yn
)}as hints. What virtual examples suit your needs best?1
{(xn
, −yn
)}2
{(−xn
, −yn
)}3
{(−xn
, yn
)}4
{(2xn
, 2yn
)}Reference Answer: 3
We want the virtual examples to encode the invariance when x → −x .
Hazard of Overfitting Dealing with Overfitting