Three Learning Principles Data Snooping
Data Snooping by Data Reusing
Research Scenario
benchmark dataD•
paper 1: proposeH1
that works well onD•
paper 2: find room for improvement, proposeH2
—and
publish only if better
thanH1
onD•
paper 3: find room for improvement, proposeH3
—and
publish only if better
thanH2
onD•
. . .•
if all papers from the same author inone big paper:
bad generalization due to dVC(∪m
Hm
)•
step-wise: later authorsnooped
data by reading earlier papers, bad generalization worsen bypublish only if better
Three Learning Principles Data Snooping
Data Snooping by Data Reusing
Research Scenario
benchmark dataD•
paper 1: proposeH1
that works well onD•
paper 2: find room for improvement, proposeH2
—and
publish only if better
thanH1
onD•
paper 3: find room for improvement, proposeH3
—and
publish only if better
thanH2
onD•
. . .•
if all papers from the same author inone big paper:
bad generalization due to dVC(∪m
Hm
)•
step-wise: later authorsnooped
data by reading earlier papers, bad generalization worsen bypublish only if better
if you torture the data long enough, it will confess :-)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 15/25
Three Learning Principles Data Snooping
Data Snooping by Data Reusing
Research Scenario
benchmark dataD•
paper 1: proposeH1
that works well onD•
paper 2: find room for improvement, proposeH2
—and
publish only if better
thanH1
onD•
paper 3: find room for improvement, proposeH3
—and
publish only if better
thanH2
onD•
. . .•
if all papers from the same author inone big paper:
bad generalization due to dVC(∪m
Hm
)•
step-wise: later authorsnooped
data by reading earlier papers, bad generalization worsen bypublish only if better
if you torture the data long enough, it will confess :-)
Three Learning Principles Data Snooping
Data Snooping by Data Reusing
Research Scenario
benchmark dataD•
paper 1: proposeH1
that works well onD•
paper 2: find room for improvement, proposeH2
—and
publish only if better
thanH1
onD•
paper 3: find room for improvement, proposeH3
—and
publish only if better
thanH2
onD•
. . .•
if all papers from the same author inone big paper:
bad generalization due to dVC(∪m
Hm
)•
step-wise: later authorsnooped
data by reading earlier papers, bad generalization worsen bypublish only if better
if you torture the data long enough, it will confess :-)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 15/25
Three Learning Principles Data Snooping
Data Snooping by Data Reusing
Research Scenario
benchmark dataD•
paper 1: proposeH1
that works well onD•
paper 2: find room for improvement, proposeH2
—and
publish only if better
thanH1
onD•
paper 3: find room for improvement, proposeH3
—and
publish only if better
thanH2
onD•
. . .•
if all papers from the same author inone big paper:
bad generalization due to dVC(∪
m
Hm
)•
step-wise: later authorsnooped
data by reading earlier papers, bad generalization worsen bypublish only if better
if you torture the data long enough, it will confess :-)
Three Learning Principles Data Snooping
Data Snooping by Data Reusing
Research Scenario
benchmark dataD•
paper 1: proposeH1
that works well onD•
paper 2: find room for improvement, proposeH2
—and
publish only if better
thanH1
onD•
paper 3: find room for improvement, proposeH3
—and
publish only if better
thanH2
onD•
. . .•
if all papers from the same author inone big paper:
bad generalization due to dVC(∪
m
Hm
)•
step-wise: later authorsnooped
data by reading earlier papers, bad generalization worsen bypublish only if better
if you torture the data long enough, it will confess :-)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 15/25
Three Learning Principles Data Snooping
Data Snooping by Data Reusing
Research Scenario
benchmark dataD•
paper 1: proposeH1
that works well onD•
paper 2: find room for improvement, proposeH2
—and
publish only if better
thanH1
onD•
paper 3: find room for improvement, proposeH3
—and
publish only if better
thanH2
onD•
. . .•
if all papers from the same author inone big paper:
bad generalization due to dVC(∪
m
Hm
)•
step-wise: later authorsnooped
data by reading earlier papers, bad generalization worsen bypublish only if better
if you torture the data long enough, it will confess :-)
Three Learning Principles Data Snooping
Data Snooping by Data Reusing
Research Scenario
benchmark dataD•
paper 1: proposeH1
that works well onD•
paper 2: find room for improvement, proposeH2
—and
publish only if better
thanH1
onD•
paper 3: find room for improvement, proposeH3
—and
publish only if better
thanH2
onD•
. . .•
if all papers from the same author inone big paper:
bad generalization due to dVC(∪
m
Hm
)•
step-wise: later authorsnooped
data by reading earlier papers,bad generalization worsen by
publish only if better
if you torture the data long enough, it will confess :-)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 15/25
Three Learning Principles Data Snooping
Data Snooping by Data Reusing
Research Scenario
benchmark dataD•
paper 1: proposeH1
that works well onD•
paper 2: find room for improvement, proposeH2
—and
publish only if better
thanH1
onD•
paper 3: find room for improvement, proposeH3
—and
publish only if better
thanH2
onD•
. . .•
if all papers from the same author inone big paper:
bad generalization due to dVC(∪
m
Hm
)•
step-wise: later authorsnooped
data by reading earlier papers,bad generalization worsen by
publish only if better
if you torture the data long enough, it will confess :-)
Three Learning Principles Data Snooping
Data Snooping by Data Reusing
Research Scenario
benchmark dataD•
paper 1: proposeH1
that works well onD•
paper 2: find room for improvement, proposeH2
—and
publish only if better
thanH1
onD•
paper 3: find room for improvement, proposeH3
—and
publish only if better
thanH2
onD•
. . .•
if all papers from the same author inone big paper:
bad generalization due to dVC(∪
m
Hm
)•
step-wise: later authorsnooped
data by reading earlier papers, bad generalization worsen bypublish only if better
if you torture the data long enough, it will confess :-)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 15/25
Three Learning Principles Data Snooping
Data Snooping by Data Reusing
Research Scenario
benchmark dataD•
paper 1: proposeH1
that works well onD•
paper 2: find room for improvement, proposeH2
—and
publish only if better
thanH1
onD•
paper 3: find room for improvement, proposeH3
—and
publish only if better
thanH2
onD•
. . .•
if all papers from the same author inone big paper:
bad generalization due to dVC(∪
m
Hm
)•
step-wise: later authorsnooped
data by reading earlier papers, bad generalization worsen bypublish only if better
if you torture the data long enough, it will confess :-)
Three Learning Principles Data Snooping
Dealing with Data Snooping
•
truth—very hard to avoid, unless being extremely honest•
extremely honest:lock your test data in safe
•
less honest:reserve validation and use cautiously
•
be blind: avoidmaking modeling decision by data
•
be suspicious: interpret research results (including your own) by properfeeling of contamination
one secret to winning KDDCups: careful balance between
data-driven modeling (snooping)
andvalidation (no-snooping)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 16/25
Three Learning Principles Data Snooping
Dealing with Data Snooping
•
truth—very hard to avoid, unless being extremely honest•
extremely honest:lock your test data in safe
•
less honest:reserve validation and use cautiously
•
be blind: avoidmaking modeling decision by data
•
be suspicious: interpret research results (including your own) by properfeeling of contamination
one secret to winning KDDCups: careful balance between
data-driven modeling (snooping)
andvalidation (no-snooping)
Three Learning Principles Data Snooping
Dealing with Data Snooping
•
truth—very hard to avoid, unless being extremely honest•
extremely honest:lock your test data in safe
•
less honest:reserve validation and use cautiously
•
be blind: avoidmaking modeling decision by data
•
be suspicious: interpret research results (including your own) by properfeeling of contamination
one secret to winning KDDCups: careful balance between
data-driven modeling (snooping)
andvalidation (no-snooping)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 16/25
Three Learning Principles Data Snooping
Dealing with Data Snooping
•
truth—very hard to avoid, unless being extremely honest•
extremely honest:lock your test data in safe
•
less honest:reserve validation and use cautiously
•
be blind: avoidmaking modeling decision by data
•
be suspicious: interpret research results (including your own) by properfeeling of contamination
one secret to winning KDDCups: careful balance between
data-driven modeling (snooping)
andvalidation (no-snooping)
Three Learning Principles Data Snooping
Dealing with Data Snooping
•
truth—very hard to avoid, unless being extremely honest•
extremely honest:lock your test data in safe
•
less honest:reserve validation and use cautiously
•
be blind: avoidmaking modeling decision by data
•
be suspicious: interpret research results (including your own) by properfeeling of contamination
one secret to winning KDDCups: careful balance between
data-driven modeling (snooping)
andvalidation (no-snooping)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 16/25
Three Learning Principles Data Snooping
Dealing with Data Snooping
•
truth—very hard to avoid, unless being extremely honest•
extremely honest:lock your test data in safe
•
less honest:reserve validation and use cautiously
•
be blind: avoidmaking modeling decision by data
•
be suspicious: interpret research results (including your own) by properfeeling of contamination
one secret to winning KDDCups:
careful balance between
data-driven modeling (snooping)
andvalidation (no-snooping)
Three Learning Principles Data Snooping
Fun Time
Which of the following can result in unsatisfactory test performance in machine learning?