• 沒有找到結果。

deterministic noise

Number of Data Points, N TargetComplexity,Qf

80 100 120 -0.2

-0.1 0 0.1 0.2

0 25 50 75 100

four reasons of serious overfitting:

data size N ↓ overfit

↑ stochastic noise

↑ overfit

↑ deterministic noise

↑ overfit

excessive power ↑ overfit

overfitting

‘easily’ happens

Hazard of Overfitting Deterministic Noise

Impact of Noise and Data Size

impact of σ 2 versus N:

stochastic noise

Number of Data Points, N

NoiseLevel,σ2

80 100 120 -0.2

-0.1 0 0.1 0.2

0 1 2

impact of Q f versus N:

deterministic noise

Number of Data Points, N TargetComplexity,Qf

80 100 120 -0.2

-0.1 0 0.1 0.2

0 25 50 75 100

four reasons of serious overfitting:

data size N ↓ overfit

stochastic noise

↑ overfit

↑ deterministic noise

↑ overfit

excessive power ↑ overfit

overfitting

‘easily’ happens

Hazard of Overfitting Deterministic Noise

Impact of Noise and Data Size

impact of σ 2 versus N:

stochastic noise

Number of Data Points, N

NoiseLevel,σ2

80 100 120 -0.2

-0.1 0 0.1 0.2

0 1 2

impact of Q f versus N:

deterministic noise

Number of Data Points, N TargetComplexity,Qf

80 100 120 -0.2

-0.1 0 0.1 0.2

0 25 50 75 100

four reasons of serious overfitting:

data size N ↓ overfit

↑ stochastic noise

↑ overfit

deterministic noise

↑ overfit

excessive power ↑ overfit

overfitting

‘easily’ happens

Hazard of Overfitting Deterministic Noise

Impact of Noise and Data Size

impact of σ 2 versus N:

stochastic noise

Number of Data Points, N

NoiseLevel,σ2

80 100 120 -0.2

-0.1 0 0.1 0.2

0 1 2

impact of Q f versus N:

deterministic noise

Number of Data Points, N TargetComplexity,Qf

80 100 120 -0.2

-0.1 0 0.1 0.2

0 25 50 75 100

four reasons of serious overfitting:

data size N ↓ overfit

↑ stochastic noise

↑ overfit

↑ deterministic noise

↑ overfit

excessive power ↑ overfit

overfitting

‘easily’ happens

Hazard of Overfitting Deterministic Noise

Impact of Noise and Data Size

impact of σ 2 versus N:

stochastic noise

Number of Data Points, N

NoiseLevel,σ2

80 100 120 -0.2

-0.1 0 0.1 0.2

0 1 2

impact of Q f versus N:

deterministic noise

Number of Data Points, N TargetComplexity,Qf

80 100 120 -0.2

-0.1 0 0.1 0.2

0 25 50 75 100

four reasons of serious overfitting:

data size N ↓ overfit

↑ stochastic noise

↑ overfit

↑ deterministic noise

↑ overfit

excessive power ↑ overfit

overfitting

‘easily’ happens

Hazard of Overfitting Deterministic Noise

Impact of Noise and Data Size

impact of σ 2 versus N:

stochastic noise

Number of Data Points, N

NoiseLevel,σ2

80 100 120 -0.2

-0.1 0 0.1 0.2

0 1 2

impact of Q f versus N:

deterministic noise

Number of Data Points, N TargetComplexity,Qf

80 100 120 -0.2

-0.1 0 0.1 0.2

0 25 50 75 100

four reasons of serious overfitting:

data size N ↓ overfit

↑ stochastic noise

↑ overfit

↑ deterministic noise

↑ overfit

excessive power ↑ overfit

overfitting

‘easily’ happens

Hazard of Overfitting Deterministic Noise

Deterministic Noise

if

f

∈/

H: something of f

cannot be captured by

H

deterministic noise : difference between best

h ∈ H

and

f

acts like ‘stochastic noise’

—not new to CS:

pseudo-random generator

difference to stochastic noise:

• depends on H

• fixed for a given x

x

y

h

f

philosophy: when teaching

a kid,

perhaps better not to use examples from a

complicated target function? :-)

Hazard of Overfitting Deterministic Noise

Deterministic Noise

if

f

∈/

H: something of f

cannot be captured by

H

deterministic noise : difference between best

h ∈ H

and

f

acts like ‘stochastic noise’

—not new to CS:

pseudo-random generator

difference to stochastic noise:

• depends on H

• fixed for a given x

x

y

h

f

philosophy: when teaching

a kid,

perhaps better not to use examples from a

complicated target function? :-)

Hazard of Overfitting Deterministic Noise

Deterministic Noise

if

f

∈/

H: something of f

cannot be captured by

H

deterministic noise : difference between best

h ∈ H

and

f

acts like ‘stochastic noise’

—not new to CS:

pseudo-random generator

difference to stochastic noise:

• depends on H

• fixed for a given x

x

y

h

f

philosophy: when teaching

a kid,

perhaps better not to use examples from a

complicated target function? :-)

Hazard of Overfitting Deterministic Noise

Deterministic Noise

if

f

∈/

H: something of f

cannot be captured by

H

deterministic noise : difference between best

h ∈ H

and

f

acts like ‘stochastic noise’—not new to CS:

pseudo-random generator

difference to stochastic noise:

• depends on H

• fixed for a given x

x

y

h

f

philosophy: when teaching

a kid,

perhaps better not to use examples from a

complicated target function? :-)

Hazard of Overfitting Deterministic Noise

Deterministic Noise

if

f

∈/

H: something of f

cannot be captured by

H

deterministic noise : difference between best

h ∈ H

and

f

acts like ‘stochastic noise’—not new to CS:

pseudo-random generator

difference to stochastic noise:

• depends on H

• fixed for a given x

x

y

h

f

philosophy: when teaching

a kid,

perhaps better not to use examples from a

complicated target function? :-)

Hazard of Overfitting Deterministic Noise

Deterministic Noise

if

f

∈/

H: something of f

cannot be captured by

H

deterministic noise : difference between best

h ∈ H

and

f

acts like ‘stochastic noise’—not new to CS:

pseudo-random generator

difference to stochastic noise:

• depends on H

• fixed for a given x

x

y

h

f

philosophy: when teaching

a kid,

perhaps better not to use examples from a

complicated target function? :-)

Hazard of Overfitting Deterministic Noise

Deterministic Noise

if

f

∈/

H: something of f

cannot be captured by

H

deterministic noise : difference between best

h ∈ H

and

f

acts like ‘stochastic noise’—not new to CS:

pseudo-random generator

difference to stochastic noise:

• depends on H

• fixed for a given x

x

y

h

f

philosophy: when teaching

a kid,

perhaps better not to use examples from a

complicated target function? :-)

Hazard of Overfitting Deterministic Noise

Fun Time

Consider the target function being sin(1126x ) for x ∈ [0, 2π]. When x is uniformly sampled from the range, and we use all possible linear hypotheses h(x ) = w · x to approximate the target function with respect to the squared error, what is the level of deterministic noise for each x ?

1

| sin(1126x)|

2

| sin(1126x) − x|

3

| sin(1126x) + x|

4

| sin(1126x) − 1126x|

Reference Answer: 1

You can try a few different w and convince yourself that the best hypothesis h

is h

(x ) = 0. The deterministic noise is the difference between f and h

.

Hazard of Overfitting Deterministic Noise

Fun Time

Consider the target function being sin(1126x ) for x ∈ [0, 2π]. When x is uniformly sampled from the range, and we use all possible linear hypotheses h(x ) = w · x to approximate the target function with respect to the squared error, what is the level of deterministic noise for each x ?

1

| sin(1126x)|

2

| sin(1126x) − x|

3

| sin(1126x) + x|

4

| sin(1126x) − 1126x|

Reference Answer: 1

You can try a few different w and convince yourself that the best hypothesis h

is h

(x ) = 0. The deterministic noise is the difference between f and h

.

Hazard of Overfitting Dealing with Overfitting

Driving Analogy Revisited

learning driving

overfit commit a car accident

use excessive dVC ‘drive too fast’

noise bumpy road

limited data size N limited observations about road condition

start from simple model

drive slowly

data cleaning/pruning

use more accurate road information

data hinting

exploit more road information

regularization

put the brakes

validation

monitor the dashboard

all very

practical

techniques to combat overfitting

Hazard of Overfitting Dealing with Overfitting

Driving Analogy Revisited

learning driving

overfit commit a car accident

use excessive dVC ‘drive too fast’

noise bumpy road

limited data size N limited observations about road condition

start from simple model

drive slowly

data cleaning/pruning

use more accurate road information

data hinting

exploit more road information

regularization

put the brakes

validation

monitor the dashboard

all very

practical

techniques to combat overfitting

Hazard of Overfitting Dealing with Overfitting

Driving Analogy Revisited

learning driving

overfit commit a car accident

use excessive dVC ‘drive too fast’

noise bumpy road

limited data size N limited observations about road condition

start from simple model

drive slowly

data cleaning/pruning

use more accurate road information

data hinting

exploit more road information

regularization

put the brakes

validation

monitor the dashboard

all very

practical

techniques to combat overfitting

Hazard of Overfitting Dealing with Overfitting

Driving Analogy Revisited

learning driving

overfit commit a car accident

use excessive dVC ‘drive too fast’

noise bumpy road

limited data size N limited observations about road condition

start from simple model

drive slowly

data cleaning/pruning

use more accurate road information

data hinting

exploit more road information

regularization

put the brakes

validation

monitor the dashboard

all very

practical

techniques to combat overfitting

Hazard of Overfitting Dealing with Overfitting

Driving Analogy Revisited

learning driving

overfit commit a car accident

use excessive dVC ‘drive too fast’

noise bumpy road

limited data size N limited observations about road condition

start from simple model

drive slowly

data cleaning/pruning

use more accurate road information

data hinting

exploit more road information

regularization

put the brakes

validation

monitor the dashboard

all very

practical

techniques to combat overfitting

Hazard of Overfitting Dealing with Overfitting

Driving Analogy Revisited

learning driving

overfit commit a car accident

use excessive dVC ‘drive too fast’

noise bumpy road

limited data size N limited observations about road condition

start from simple model

drive slowly

data cleaning/pruning

use more accurate road information

data hinting

exploit more road information

regularization

put the brakes

validation

monitor the dashboard

all very

practical

techniques to combat overfitting

Hazard of Overfitting Dealing with Overfitting

Driving Analogy Revisited

learning driving

overfit commit a car accident

use excessive dVC ‘drive too fast’

noise bumpy road

limited data size N limited observations about road condition

start from simple model

drive slowly

data cleaning/pruning

use more accurate road information

data hinting

exploit more road information

regularization

put the brakes

validation

monitor the dashboard

all very

practical

techniques to combat overfitting

Hazard of Overfitting Dealing with Overfitting

Data Cleaning/Pruning

if ‘detect’ the outlier

5

at the top by

• too close to other ◦, or too far from other ×

• wrong by current classifier

• . . .

possible action 1: correct the label (data cleaning)

possible action 2: remove the example (data pruning)

possibly helps, but

effect varies

Hazard of Overfitting Dealing with Overfitting

Data Cleaning/Pruning

if ‘detect’ the outlier

5

at the top by

• too close to other ◦, or too far from other ×

• wrong by current classifier

• . . .

possible action 1: correct the label (data cleaning)

possible action 2: remove the example (data pruning)

possibly helps, but

effect varies

Hazard of Overfitting Dealing with Overfitting

Data Cleaning/Pruning

if ‘detect’ the outlier

5

at the top by

• too close to other ◦, or too far from other ×

• wrong by current classifier

• . . .

possible action 1: correct the label (data cleaning)

possible action 2: remove the example (data pruning)

possibly helps, but

effect varies

Hazard of Overfitting Dealing with Overfitting

Data Cleaning/Pruning

if ‘detect’ the outlier

5

at the top by

• too close to other ◦, or too far from other ×

• wrong by current classifier

• . . .

possible action 1: correct the label (data cleaning)

possible action 2: remove the example (data pruning)

possibly helps, but

effect varies

Hazard of Overfitting Dealing with Overfitting

Data Cleaning/Pruning

if ‘detect’ the outlier

5

at the top by

• too close to other ◦, or too far from other ×

• wrong by current classifier

• . . .

possible action 1: correct the label (data cleaning)

possible action 2: remove the example (data pruning)

possibly helps, but

effect varies

Hazard of Overfitting Dealing with Overfitting

Data Cleaning/Pruning

if ‘detect’ the outlier

5

at the top by

• too close to other ◦, or too far from other ×

• wrong by current classifier

• . . .

possible action 1: correct the label (data cleaning)

possible action 2: remove the example (data pruning)

possibly helps, but

effect varies

Hazard of Overfitting Dealing with Overfitting

Data Cleaning/Pruning

if ‘detect’ the outlier

5

at the top by

• too close to other ◦, or too far from other ×

• wrong by current classifier

• . . .

possible action 1: correct the label (data cleaning)

possible action 2: remove the example (data pruning)

possibly helps, but

effect varies

Hazard of Overfitting Dealing with Overfitting

Data Hinting

slightly shifted/rotated digits carry the same meaning

possible action: add

virtual examples

by shifting/rotating the given digits (data hinting)

possibly helps, but

watch out

—virtual example not

iid ∼ P(x, y )!

Hazard of Overfitting Dealing with Overfitting

Data Hinting

slightly shifted/rotated digits carry the same meaning

possible action: add

virtual examples

by shifting/rotating the given digits (data hinting)

possibly helps, but

watch out

—virtual example not

iid ∼ P(x, y )!

Hazard of Overfitting Dealing with Overfitting

Data Hinting

slightly shifted/rotated digits carry the same meaning

possible action: add

virtual examples

by shifting/rotating the given digits (data hinting)

possibly helps, but

watch out

—virtual example not

iid ∼ P(x, y )!

Hazard of Overfitting Dealing with Overfitting

Fun Time

Assume we know that f (x ) is symmetric for some 1D regression application. That is, f (x ) = f (−x ). One possibility of using the knowledge is to consider symmetric hypotheses only. On the other hand, you can also generate virtual examples from the original data {(x

n

, y

n

)}as hints. What virtual examples suit your needs best?

1

{(x

n

, −y

n

)}

2

{(−x

n

, −y

n

)}

3

{(−x

n

, y

n

)}

4

{(2x

n

, 2y

n

)}

Reference Answer: 3

We want the virtual examples to encode the invariance when x → −x .

Hazard of Overfitting Dealing with Overfitting

Fun Time

Assume we know that f (x ) is symmetric for some 1D regression application. That is, f (x ) = f (−x ). One possibility of using the knowledge is to consider symmetric hypotheses only. On the other hand, you can also generate virtual examples from the original data {(x

n

, y

n

)}as hints. What virtual examples suit your needs best?

1

{(x

n

, −y

n

)}

2

{(−x

n

, −y

n

)}

3

{(−x

n

, y

n

)}

4

{(2x

n

, 2y

n

)}

Reference Answer: 3

We want the virtual examples to encode the invariance when x → −x .

Hazard of Overfitting Dealing with Overfitting

Summary

1 When Can Machines Learn?

2 Why Can Machines Learn?

3 How Can Machines Learn?

Lecture 12: Nonlinear Transform

4

How Can Machines Learn

Better?

Lecture 13: Hazard of Overfitting What is Overfitting?

lower E in but higher E out

The Role of Noise and Data Size

相關文件