if you torture the data long enough, it will confess :-)

Three Learning Principles Data Snooping

Data Snooping by Data Reusing

Research Scenario

benchmark dataD

•

paper 1: proposeH

1

that works well onD

•

paper 2: find room for improvement, proposeH

2

—and

publish only if better

thanH

1

onD

•

paper 3: find room for improvement, proposeH

3

—and

publish only if better

thanH

2

onD

•

. . .

•

if all papers from the same author in

one big paper:

bad generalization due to dVC(∪

m

)

•

step-wise: later author

snooped

data by reading earlier papers, bad generalization worsen by

publish only if better

Three Learning Principles Data Snooping

Data Snooping by Data Reusing

Research Scenario

benchmark dataD

•

paper 1: proposeH

1

that works well onD

•

paper 2: find room for improvement, proposeH

2

—and

publish only if better

thanH

1

onD

•

paper 3: find room for improvement, proposeH

3

—and

publish only if better

thanH

2

onD

•

. . .

•

if all papers from the same author in

one big paper:

bad generalization due to dVC(∪

m

)

•

step-wise: later author

snooped

data by reading earlier papers, bad generalization worsen by

publish only if better

if you torture the data long enough, it will confess :-)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 15/25

Three Learning Principles Data Snooping

Data Snooping by Data Reusing

Research Scenario

benchmark dataD

•

paper 1: proposeH

1

that works well onD

•

paper 2: find room for improvement, proposeH

2

—and

publish only if better

thanH

1

onD

•

paper 3: find room for improvement, proposeH

3

—and

publish only if better

thanH

2

onD

•

. . .

•

if all papers from the same author in

one big paper:

bad generalization due to dVC(∪

m

)

•

step-wise: later author

snooped

data by reading earlier papers, bad generalization worsen by

publish only if better

if you torture the data long enough, it will confess :-)

Three Learning Principles Data Snooping

Data Snooping by Data Reusing

Research Scenario

benchmark dataD

•

paper 1: proposeH

1

that works well onD

•

paper 2: find room for improvement, proposeH

2

—and

publish only if better

thanH

1

onD

•

paper 3: find room for improvement, proposeH

3

—and

publish only if better

thanH

2

onD

•

. . .

•

if all papers from the same author in

one big paper:

bad generalization due to dVC(∪

m

)

•

step-wise: later author

snooped

data by reading earlier papers, bad generalization worsen by

publish only if better

if you torture the data long enough, it will confess :-)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 15/25

Three Learning Principles Data Snooping

Data Snooping by Data Reusing

Research Scenario

benchmark dataD

•

paper 1: proposeH

1

that works well onD

•

paper 2: find room for improvement, proposeH

2

—and

publish only if better

thanH

1

onD

•

paper 3: find room for improvement, proposeH

3

—and

publish only if better

thanH

2

onD

•

. . .

•

if all papers from the same author in

one big paper:

bad generalization due to dVC(∪

m

)

•

step-wise: later author

snooped

data by reading earlier papers, bad generalization worsen by

publish only if better

if you torture the data long enough, it will confess :-)

Three Learning Principles Data Snooping

Data Snooping by Data Reusing

Research Scenario

benchmark dataD

•

paper 1: proposeH

1

that works well onD

•

paper 2: find room for improvement, proposeH

2

—and

publish only if better

thanH

1

onD

•

paper 3: find room for improvement, proposeH

3

—and

publish only if better

thanH

2

onD

•

. . .

•

if all papers from the same author in

one big paper:

bad generalization due to dVC(∪

m

)

•

step-wise: later author

snooped

data by reading earlier papers, bad generalization worsen by

publish only if better

if you torture the data long enough, it will confess :-)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 15/25

Three Learning Principles Data Snooping

Data Snooping by Data Reusing

Research Scenario

benchmark dataD

•

paper 1: proposeH

1

that works well onD

•

paper 2: find room for improvement, proposeH

2

—and

publish only if better

thanH

1

onD

•

paper 3: find room for improvement, proposeH

3

—and

publish only if better

thanH

2

onD

•

. . .

•

if all papers from the same author in

one big paper:

bad generalization due to dVC(∪

m

)

•

step-wise: later author

snooped

data by reading earlier papers, bad generalization worsen by

publish only if better

if you torture the data long enough, it will confess :-)

Three Learning Principles Data Snooping

Data Snooping by Data Reusing

Research Scenario

benchmark dataD

•

paper 1: proposeH

1

that works well onD

•

paper 2: find room for improvement, proposeH

2

—and

publish only if better

thanH

1

onD

•

paper 3: find room for improvement, proposeH

3

—and

publish only if better

thanH

2

onD

•

. . .

•

if all papers from the same author in

one big paper:

bad generalization due to dVC(∪

m

)

•

step-wise: later author

snooped

data by reading earlier papers,

bad generalization worsen by

publish only if better

if you torture the data long enough, it will confess :-)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 15/25

Three Learning Principles Data Snooping

Data Snooping by Data Reusing

Research Scenario

benchmark dataD

•

paper 1: proposeH

1

that works well onD

•

paper 2: find room for improvement, proposeH

2

—and

publish only if better

thanH

1

onD

•

paper 3: find room for improvement, proposeH

3

—and

publish only if better

thanH

2

onD

•

. . .

•

if all papers from the same author in

one big paper:

bad generalization due to dVC(∪

m

)

•

step-wise: later author

snooped

data by reading earlier papers,

bad generalization worsen by

publish only if better

if you torture the data long enough, it will confess :-)

Three Learning Principles Data Snooping

Data Snooping by Data Reusing

Research Scenario

benchmark dataD

•

paper 1: proposeH

1

that works well onD

•

paper 2: find room for improvement, proposeH

2

—and

publish only if better

thanH

1

onD

•

paper 3: find room for improvement, proposeH

3

—and

publish only if better

thanH

2

onD

•

. . .

•

if all papers from the same author in

one big paper:

bad generalization due to dVC(∪

m

)

•

step-wise: later author

snooped

data by reading earlier papers, bad generalization worsen by

publish only if better

if you torture the data long enough, it will confess :-)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 15/25

Three Learning Principles Data Snooping

Data Snooping by Data Reusing

Research Scenario

benchmark dataD

•

paper 1: proposeH

1

that works well onD

•

paper 2: find room for improvement, proposeH

2

—and

publish only if better

thanH

1

onD

•

paper 3: find room for improvement, proposeH

3

—and

publish only if better

thanH

2

onD

•

. . .

•

if all papers from the same author in

one big paper:

bad generalization due to dVC(∪

m

)

•

step-wise: later author

snooped

data by reading earlier papers, bad generalization worsen by

publish only if better

if you torture the data long enough, it will confess :-)

Three Learning Principles Data Snooping

Dealing with Data Snooping

•

truth—very hard to avoid, unless being extremely honest

•

extremely honest:

lock your test data in safe

•

less honest:

reserve validation and use cautiously

•

be blind: avoid

making modeling decision by data

•

be suspicious: interpret research results (including your own) by proper

feeling of contamination

one secret to winning KDDCups: careful balance between

data-driven modeling (snooping)

and

validation (no-snooping)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 16/25

Three Learning Principles Data Snooping

Dealing with Data Snooping

•

truth—very hard to avoid, unless being extremely honest

•

extremely honest:

lock your test data in safe

•

less honest:

reserve validation and use cautiously

•

be blind: avoid

making modeling decision by data

•

be suspicious: interpret research results (including your own) by proper

feeling of contamination

one secret to winning KDDCups: careful balance between

data-driven modeling (snooping)

and

validation (no-snooping)

Three Learning Principles Data Snooping

Dealing with Data Snooping

•

truth—very hard to avoid, unless being extremely honest

•

extremely honest:

lock your test data in safe

•

less honest:

reserve validation and use cautiously

•

be blind: avoid

making modeling decision by data

•

be suspicious: interpret research results (including your own) by proper

feeling of contamination

one secret to winning KDDCups: careful balance between

data-driven modeling (snooping)

and

validation (no-snooping)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 16/25

Three Learning Principles Data Snooping

Dealing with Data Snooping

•

truth—very hard to avoid, unless being extremely honest

•

extremely honest:

lock your test data in safe

•

less honest:

reserve validation and use cautiously

•

be blind: avoid

making modeling decision by data

•

be suspicious: interpret research results (including your own) by proper

feeling of contamination

one secret to winning KDDCups: careful balance between

data-driven modeling (snooping)

and

validation (no-snooping)

Three Learning Principles Data Snooping

Dealing with Data Snooping

•

truth—very hard to avoid, unless being extremely honest

•

extremely honest:

lock your test data in safe

•

less honest:

reserve validation and use cautiously

•

be blind: avoid

making modeling decision by data

•

be suspicious: interpret research results (including your own) by proper

feeling of contamination

one secret to winning KDDCups: careful balance between

data-driven modeling (snooping)

and

validation (no-snooping)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 16/25

Three Learning Principles Data Snooping

Dealing with Data Snooping

•

truth—very hard to avoid, unless being extremely honest

•

extremely honest:

lock your test data in safe

•

less honest:

reserve validation and use cautiously

•

be blind: avoid

making modeling decision by data

•

be suspicious: interpret research results (including your own) by proper

feeling of contamination

one secret to winning KDDCups:

careful balance between

data-driven modeling (snooping)

and

validation (no-snooping)

Three Learning Principles Data Snooping

Fun Time

Which of the following can result in unsatisfactory test performance in machine learning?

1

data snooping

2

overfitting

3

sampling bias

4

all of the above

Reference Answer: 4

A professional like you should be aware of

在文檔中 Lecture 16: Three Learning Principles (頁 66-83)