Number of Data Points, N TargetComplexity,Qf
80 100 120 -0.2
-0.1 0 0.1 0.2
0 25 50 75 100
four reasons of serious overfitting:
data size N ↓ overfit
↑ stochastic noise
↑ overfit↑ deterministic noise
↑ overfit↑
excessive power ↑ overfit↑
overfitting
‘easily’ happensHazard of Overfitting Deterministic Noise
Impact of Noise and Data Size
impact of σ 2 versus N:
stochastic noise
Number of Data Points, N
NoiseLevel,σ2
80 100 120 -0.2
-0.1 0 0.1 0.2
0 1 2
impact of Q f versus N:
deterministic noise
Number of Data Points, N TargetComplexity,Qf
80 100 120 -0.2
-0.1 0 0.1 0.2
0 25 50 75 100
four reasons of serious overfitting:
data size N ↓ overfit
↑
stochastic noise
↑ overfit↑ deterministic noise
↑ overfit↑
excessive power ↑ overfit↑
overfitting
‘easily’ happensHazard of Overfitting Deterministic Noise
Impact of Noise and Data Size
impact of σ 2 versus N:
stochastic noise
Number of Data Points, N
NoiseLevel,σ2
80 100 120 -0.2
-0.1 0 0.1 0.2
0 1 2
impact of Q f versus N:
deterministic noise
Number of Data Points, N TargetComplexity,Qf
80 100 120 -0.2
-0.1 0 0.1 0.2
0 25 50 75 100
four reasons of serious overfitting:
data size N ↓ overfit
↑ stochastic noise
↑ overfit↑
deterministic noise
↑ overfit↑
excessive power ↑ overfit↑
overfitting
‘easily’ happensHazard of Overfitting Deterministic Noise
Impact of Noise and Data Size
impact of σ 2 versus N:
stochastic noise
Number of Data Points, N
NoiseLevel,σ2
80 100 120 -0.2
-0.1 0 0.1 0.2
0 1 2
impact of Q f versus N:
deterministic noise
Number of Data Points, N TargetComplexity,Qf
80 100 120 -0.2
-0.1 0 0.1 0.2
0 25 50 75 100
four reasons of serious overfitting:
data size N ↓ overfit
↑ stochastic noise
↑ overfit↑ deterministic noise
↑ overfit↑
excessive power ↑ overfit
↑
overfitting
‘easily’ happensHazard of Overfitting Deterministic Noise
Impact of Noise and Data Size
impact of σ 2 versus N:
stochastic noise
Number of Data Points, N
NoiseLevel,σ2
80 100 120 -0.2
-0.1 0 0.1 0.2
0 1 2
impact of Q f versus N:
deterministic noise
Number of Data Points, N TargetComplexity,Qf
80 100 120 -0.2
-0.1 0 0.1 0.2
0 25 50 75 100
four reasons of serious overfitting:
data size N ↓ overfit
↑ stochastic noise
↑ overfit↑ deterministic noise
↑ overfit↑
excessive power ↑ overfit↑
overfitting
‘easily’ happensHazard of Overfitting Deterministic Noise
Impact of Noise and Data Size
impact of σ 2 versus N:
stochastic noise
Number of Data Points, N
NoiseLevel,σ2
80 100 120 -0.2
-0.1 0 0.1 0.2
0 1 2
impact of Q f versus N:
deterministic noise
Number of Data Points, N TargetComplexity,Qf
80 100 120 -0.2
-0.1 0 0.1 0.2
0 25 50 75 100
four reasons of serious overfitting:
data size N ↓ overfit
↑ stochastic noise
↑ overfit↑ deterministic noise
↑ overfit↑
excessive power ↑ overfit↑
overfitting
‘easily’ happensHazard of Overfitting Deterministic Noise
Deterministic Noise
•
iff
∈/H: something of f
cannot be captured byH
•
deterministic noise : difference between besth ∗ ∈ H
andf
•
acts like ‘stochastic noise’—not new to CS:
pseudo-random generator
•
difference to stochastic noise:• depends on H
• fixed for a given x
x
y
h
∗f
philosophy: when teaching
a kid,
perhaps better not to use examples from acomplicated target function? :-)
Hazard of Overfitting Deterministic Noise
Deterministic Noise
•
iff
∈/H: something of f
cannot be captured byH
•
deterministic noise : difference between besth ∗ ∈ H
andf
•
acts like ‘stochastic noise’—not new to CS:
pseudo-random generator
•
difference to stochastic noise:• depends on H
• fixed for a given x
x
y
h
∗f
philosophy: when teaching
a kid,
perhaps better not to use examples from acomplicated target function? :-)
Hazard of Overfitting Deterministic Noise
Deterministic Noise
•
iff
∈/H: something of f
cannot be captured byH
•
deterministic noise : difference between besth ∗ ∈ H
andf
•
acts like ‘stochastic noise’—not new to CS:
pseudo-random generator
•
difference to stochastic noise:• depends on H
• fixed for a given x
x
y
h
∗f
philosophy: when teaching
a kid,
perhaps better not to use examples from acomplicated target function? :-)
Hazard of Overfitting Deterministic Noise
Deterministic Noise
•
iff
∈/H: something of f
cannot be captured byH
•
deterministic noise : difference between besth ∗ ∈ H
andf
•
acts like ‘stochastic noise’—not new to CS:pseudo-random generator
•
difference to stochastic noise:• depends on H
• fixed for a given x
x
y
h
∗f
philosophy: when teaching
a kid,
perhaps better not to use examples from acomplicated target function? :-)
Hazard of Overfitting Deterministic Noise
Deterministic Noise
•
iff
∈/H: something of f
cannot be captured byH
•
deterministic noise : difference between besth ∗ ∈ H
andf
•
acts like ‘stochastic noise’—not new to CS:pseudo-random generator
•
difference to stochastic noise:• depends on H
• fixed for a given x
x
y
h
∗f
philosophy: when teaching
a kid,
perhaps better not to use examples from acomplicated target function? :-)
Hazard of Overfitting Deterministic Noise
Deterministic Noise
•
iff
∈/H: something of f
cannot be captured byH
•
deterministic noise : difference between besth ∗ ∈ H
andf
•
acts like ‘stochastic noise’—not new to CS:pseudo-random generator
•
difference to stochastic noise:• depends on H
• fixed for a given x
x
y
h
∗f
philosophy: when teaching
a kid,
perhaps better not to use examples from acomplicated target function? :-)
Hazard of Overfitting Deterministic Noise
Deterministic Noise
•
iff
∈/H: something of f
cannot be captured byH
•
deterministic noise : difference between besth ∗ ∈ H
andf
•
acts like ‘stochastic noise’—not new to CS:pseudo-random generator
•
difference to stochastic noise:• depends on H
• fixed for a given x
x
y
h
∗f
philosophy: when teaching
a kid,
perhaps better not to use examples from acomplicated target function? :-)
Hazard of Overfitting Deterministic Noise
Fun Time
Consider the target function being sin(1126x ) for x ∈ [0, 2π]. When x is uniformly sampled from the range, and we use all possible linear hypotheses h(x ) = w · x to approximate the target function with respect to the squared error, what is the level of deterministic noise for each x ?
1
| sin(1126x)|2
| sin(1126x) − x|3
| sin(1126x) + x|4
| sin(1126x) − 1126x|Reference Answer: 1
You can try a few different w and convince yourself that the best hypothesis h
∗
is h∗
(x ) = 0. The deterministic noise is the difference between f and h∗
.Hazard of Overfitting Deterministic Noise
Fun Time
Consider the target function being sin(1126x ) for x ∈ [0, 2π]. When x is uniformly sampled from the range, and we use all possible linear hypotheses h(x ) = w · x to approximate the target function with respect to the squared error, what is the level of deterministic noise for each x ?
1
| sin(1126x)|2
| sin(1126x) − x|3
| sin(1126x) + x|4
| sin(1126x) − 1126x|Reference Answer: 1
You can try a few different w and convince yourself that the best hypothesis h
∗
is h∗
(x ) = 0. The deterministic noise is the difference between f and h∗
.Hazard of Overfitting Dealing with Overfitting
Driving Analogy Revisited
learning driving
overfit commit a car accident
use excessive dVC ‘drive too fast’
noise bumpy road
limited data size N limited observations about road condition
start from simple model
drive slowlydata cleaning/pruning
use more accurate road informationdata hinting
exploit more road informationregularization
put the brakesvalidation
monitor the dashboardall very
practical
techniques to combat overfittingHazard of Overfitting Dealing with Overfitting
Driving Analogy Revisited
learning driving
overfit commit a car accident
use excessive dVC ‘drive too fast’
noise bumpy road
limited data size N limited observations about road condition
start from simple model
drive slowlydata cleaning/pruning
use more accurate road informationdata hinting
exploit more road informationregularization
put the brakesvalidation
monitor the dashboardall very
practical
techniques to combat overfittingHazard of Overfitting Dealing with Overfitting
Driving Analogy Revisited
learning driving
overfit commit a car accident
use excessive dVC ‘drive too fast’
noise bumpy road
limited data size N limited observations about road condition
start from simple model
drive slowlydata cleaning/pruning
use more accurate road informationdata hinting
exploit more road informationregularization
put the brakesvalidation
monitor the dashboardall very
practical
techniques to combat overfittingHazard of Overfitting Dealing with Overfitting
Driving Analogy Revisited
learning driving
overfit commit a car accident
use excessive dVC ‘drive too fast’
noise bumpy road
limited data size N limited observations about road condition
start from simple model
drive slowlydata cleaning/pruning
use more accurate road informationdata hinting
exploit more road informationregularization
put the brakesvalidation
monitor the dashboardall very
practical
techniques to combat overfittingHazard of Overfitting Dealing with Overfitting
Driving Analogy Revisited
learning driving
overfit commit a car accident
use excessive dVC ‘drive too fast’
noise bumpy road
limited data size N limited observations about road condition
start from simple model
drive slowlydata cleaning/pruning
use more accurate road informationdata hinting
exploit more road informationregularization
put the brakesvalidation
monitor the dashboardall very
practical
techniques to combat overfittingHazard of Overfitting Dealing with Overfitting
Driving Analogy Revisited
learning driving
overfit commit a car accident
use excessive dVC ‘drive too fast’
noise bumpy road
limited data size N limited observations about road condition
start from simple model
drive slowlydata cleaning/pruning
use more accurate road informationdata hinting
exploit more road informationregularization
put the brakesvalidation
monitor the dashboardall very
practical
techniques to combat overfittingHazard of Overfitting Dealing with Overfitting
Driving Analogy Revisited
learning driving
overfit commit a car accident
use excessive dVC ‘drive too fast’
noise bumpy road
limited data size N limited observations about road condition
start from simple model
drive slowlydata cleaning/pruning
use more accurate road informationdata hinting
exploit more road informationregularization
put the brakesvalidation
monitor the dashboardall very
practical
techniques to combat overfittingHazard of Overfitting Dealing with Overfitting
Data Cleaning/Pruning
•
if ‘detect’ the outlier5
at the top by• too close to other ◦, or too far from other ×
• wrong by current classifier
• . . .
•
possible action 1: correct the label (data cleaning)•
possible action 2: remove the example (data pruning)possibly helps, but
effect varies
Hazard of Overfitting Dealing with Overfitting
Data Cleaning/Pruning
•
if ‘detect’ the outlier5
at the top by• too close to other ◦, or too far from other ×
• wrong by current classifier
• . . .
•
possible action 1: correct the label (data cleaning)•
possible action 2: remove the example (data pruning)possibly helps, but
effect varies
Hazard of Overfitting Dealing with Overfitting
Data Cleaning/Pruning
•
if ‘detect’ the outlier5
at the top by• too close to other ◦, or too far from other ×
• wrong by current classifier
• . . .
•
possible action 1: correct the label (data cleaning)•
possible action 2: remove the example (data pruning)possibly helps, but
effect varies
Hazard of Overfitting Dealing with Overfitting
Data Cleaning/Pruning
•
if ‘detect’ the outlier5
at the top by• too close to other ◦, or too far from other ×
• wrong by current classifier
• . . .
•
possible action 1: correct the label (data cleaning)•
possible action 2: remove the example (data pruning)possibly helps, but
effect varies
Hazard of Overfitting Dealing with Overfitting
Data Cleaning/Pruning
•
if ‘detect’ the outlier5
at the top by• too close to other ◦, or too far from other ×
• wrong by current classifier
• . . .
•
possible action 1: correct the label (data cleaning)•
possible action 2: remove the example (data pruning)possibly helps, but
effect varies
Hazard of Overfitting Dealing with Overfitting
Data Cleaning/Pruning
•
if ‘detect’ the outlier5
at the top by• too close to other ◦, or too far from other ×
• wrong by current classifier
• . . .
•
possible action 1: correct the label (data cleaning)•
possible action 2: remove the example (data pruning)possibly helps, but
effect varies
Hazard of Overfitting Dealing with Overfitting
Data Cleaning/Pruning
•
if ‘detect’ the outlier5
at the top by• too close to other ◦, or too far from other ×
• wrong by current classifier
• . . .
•
possible action 1: correct the label (data cleaning)•
possible action 2: remove the example (data pruning)possibly helps, but
effect varies
Hazard of Overfitting Dealing with Overfitting
Data Hinting
•
slightly shifted/rotated digits carry the same meaning•
possible action: addvirtual examples
by shifting/rotating the given digits (data hinting)possibly helps, but
watch out
—virtual example not
iid ∼ P(x, y )!
Hazard of Overfitting Dealing with Overfitting
Data Hinting
•
slightly shifted/rotated digits carry the same meaning•
possible action: addvirtual examples
by shifting/rotating the given digits (data hinting)possibly helps, but
watch out
—virtual example not
iid ∼ P(x, y )!
Hazard of Overfitting Dealing with Overfitting
Data Hinting
•
slightly shifted/rotated digits carry the same meaning•
possible action: addvirtual examples
by shifting/rotating the given digits (data hinting)possibly helps, but
watch out
—virtual example not
iid ∼ P(x, y )!
Hazard of Overfitting Dealing with Overfitting
Fun Time
Assume we know that f (x ) is symmetric for some 1D regression application. That is, f (x ) = f (−x ). One possibility of using the knowledge is to consider symmetric hypotheses only. On the other hand, you can also generate virtual examples from the original data {(x
n
, yn
)}as hints. What virtual examples suit your needs best?1
{(xn
, −yn
)}2
{(−xn
, −yn
)}3
{(−xn
, yn
)}4
{(2xn
, 2yn
)}Reference Answer: 3
We want the virtual examples to encode the invariance when x → −x .
Hazard of Overfitting Dealing with Overfitting
Fun Time
Assume we know that f (x ) is symmetric for some 1D regression application. That is, f (x ) = f (−x ). One possibility of using the knowledge is to consider symmetric hypotheses only. On the other hand, you can also generate virtual examples from the original data {(x
n
, yn
)}as hints. What virtual examples suit your needs best?1
{(xn
, −yn
)}2
{(−xn
, −yn
)}3
{(−xn
, yn
)}4
{(2xn
, 2yn
)}Reference Answer: 3
We want the virtual examples to encode the invariance when x → −x .
Hazard of Overfitting Dealing with Overfitting