Machine Learning Foundations ( 機器學習基石)
Lecture 16: Three Learning Principles
Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.twDepartment of Computer Science
& Information Engineering
National Taiwan University
( 國立台灣大學資訊工程系)
Three Learning Principles
Roadmap
1 When Can Machines Learn?
2 Why Can Machines Learn?
3 How Can Machines Learn?
4
How Can Machines LearnBetter?
Lecture 15: Validation
(crossly) reserve
validation data
to simulate testing procedure formodel selection
Lecture 16: Three Learning Principles Occam’s Razor
Sampling Bias
Data Snooping
Power of Three
Three Learning Principles Occam’s Razor
Occam’s Razor
An explanation of the data should be made as simple as possible, but no simpler.—Albert Einstein?(1879-1955)
entia non sunt multiplicanda praeter necessitatem (entities must not be multiplied
beyond necessity)
—William of Occam (1287-1347)
‘Occam’s razor’ for trimming down unnecessary explanation
figure by Fred the Oyster (Own work) [CC-BY-SA-3.0], via Wikimedia Commons
Three Learning Principles Occam’s Razor
Occam’s Razor for Learning
The simplest model that fits the data is also the most plausible.
which one do you prefer? :-)
two questions:
1
What does it mean for a model to be simple?2
How do we know that simpler is better?Three Learning Principles Occam’s Razor
Simple Model
simple hypothesis h
•
small Ω(h) = ‘looks’ simple•
specified byfew parameters
simple model H
•
small Ω(H) = not many•
containssmall number of hypotheses
connection
h specified by ` bits⇐ |H| of size 2
`
small Ω(h)⇐ small Ω(H)simple:
small hypothesis/model complexity
Three Learning Principles Occam’s Razor
Simple is Better
in addition to
math proof
that you have seen, philosophically:=⇒
simpleH
=⇒ smaller m
H
(N)=⇒ less ‘likely’ to fit data perfectly m
H
(N) 2N
=⇒ more significant when fit happens
direct action:
linear first;
always ask whether
data over-modeled
Three Learning Principles Occam’s Razor
Fun Time
Three Learning Principles Sampling Bias
Presidential Story
•
1948 US President election: Truman versus Dewey•
a newspaper phone-poll of how peoplevoted,
and set the title ‘Dewey Defeats Truman’ based on polling
who is this? :-)
Three Learning Principles Sampling Bias
The Big Smile Came from . . .
Truman, and yes he won
suspect of the mistake:
•
editorial bug?—no•
bad luck of polling (δ)?—nohint: phones were
expensive :-)
Three Learning Principles Sampling Bias
Sampling Bias
If the data is sampled in a biased way, learning will pro- duce a similarly biased outcome.
•
techical explanation:data from
P 1
(x, y ) but test underP 2
6=P 1
:VC fails
•
philosophical explanation:study
Math
hard but testEnglish: no strong test guarantee
‘minor’ VC assumption:
data and testing
both iid from P
Three Learning Principles Sampling Bias
Sampling Bias in Learning
A True Personal Story
•
Netflix competition for movie recommender system:10% improvement = 1M US dollars
•
formedDval
, in myfirst shot,
E
val
(g) showed13%
improvement• why am I still teaching here? :-)
Match movie and viewer factors
predicted rating
comedy content action
content blockb uster?
TomCruisein it? likes TomCruise?
prefers blockbusters? likes action?
likes comedy?
movie viewer
add contributions from each factor
validation:
random examples
withinD;test:
‘last’ user records
‘after’DThree Learning Principles Sampling Bias
Dealing with Sampling Bias
If the data is sampled in a biased way, learning will pro- duce a similarly biased outcome.
•
practical rule of thumb:match test scenario as much as possible
•
e.g. if test:‘last’ user records
‘after’D• training: emphasize later examples (KDDCup 2011)
• validation: use ‘late’ user records
last puzzle:
danger when learning ‘credit card approval’
with
existing bank records?
Three Learning Principles Sampling Bias
Fun Time
Three Learning Principles Data Snooping
Visual Data Snooping
Visualize X = R 2
•
full Φ2
:z = (1, x 1
,x2
,x1 2
,x1
x2
,x2 2
), dVC =6•
orz = (1, x 1 2
,x2 2
), dVC =3,after visualizing?
•
or betterz = (1, x 1 2
+x2 2
), dVC=2?•
or even betterz = sign(0.6 − x 1 2 − x 2 2 )?
—careful about
your brain’s ‘model complexity’
−1 0 1
−1 0 1
for VC-safety, Φ shall be decided
without ‘snooping’
dataThree Learning Principles Data Snooping
Data Snooping by Mere Shifting-Scaling
If a data set has affected any step in the learning pro- cess, its ability to assess the outcome has been com- promised.
•
8 years of currency trading data•
first 6 years fortraining,
last two 2 years fortesting
• x = previous 20 days,
y = 21th day• snooping
versusno snooping:
superior profit possible
Day
CumulativeProfit%
no snooping snooping
0 100 200 300 400 500
-10 0 10 20 30
• snooping: shift-scale all values by training
+testing
• no snooping: shift-scale all values by training
onlyThree Learning Principles Data Snooping
Data Snooping by Data Reusing
Research Scenario
benchmark dataD•
paper 1: proposeH1
that works well onD•
paper 2: find room for improvement, proposeH2
—and
publish only if better
thanH1
onD•
paper 3: find room for improvement, proposeH3
—and
publish only if better
thanH2
onD•
. . .•
if all papers from the same author inone big paper:
bad generalization due to dVC(∪
m
Hm
)•
step-wise: later authorsnooped
data by reading earlier papers, bad generalization worsen bypublish only if better
if you torture the data long enough, it will confess :-)
Three Learning Principles Data Snooping
Dealing with Data Snooping
•
truth—very hard to avoid, unless being extremely honest•
extremely honest:lock your test data in safe
•
less honest:reserve validation and use cautiously
•
be blind: avoidmaking modeling decision by data
•
be suspicious: interpret research results (including your own) by properfeeling of contamination
one secret to winning KDDCups:
careful balance between
data-driven modeling (snooping)
andvalidation (no-snooping)
Three Learning Principles Data Snooping
Fun Time
Three Learning Principles Power of Three
Three Related Fields
Power of Three
Data Mining
•
use(huge)
data tofind property
that is interesting•
difficult to distinguish ML and DM in realityArtificial Intelligence
•
compute something that showsintelligent behavior
•
ML is one possible route to realize AIStatistics
•
use data tomake inference
about an unknown process•
statistics contains many useful tools for MLThree Learning Principles Power of Three
Three Theoretical Bounds
Power of Three
Hoeffding
P[BAD]
≤ 2 exp(−2
2N)
• one
hypothesis•
useful forverifying/testing
Multi-Bin Hoeffding
P[BAD]
≤ 2 M exp( −2
2N)
• M
hypotheses•
useful forvalidation
VC
P[BAD]
≤ 4 m
H(2N) exp(. . .)
•
allH
•
useful fortraining
Three Learning Principles Power of Three
Three Linear Models
Power of Three
PLA/pocket
h(x) = sign(s)s x
x
x x0
1 2
d
h x( )
plausible err = 0/1
(small flipping noise)
minimizespecially
linear regression
h(x) =s
s x
x
x x0
1 2
d
h x( )
friendly err = squared
(easy to minimize)
minimizeanalytically
logistic regression
h(x) = θ(s)s x
x
x x0
1 2
d
h x( )
plausible err = CE
(maximum likelihood)
minimizeiteratively
Three Learning Principles Power of Three
Three Key Tools
Power of Three
Feature Transform
E
in(w) → E
in( ˜ w) d
VC( H) → d
VC( H
Φ)
•
by usingmore complicated Φ
• lower E in
•
higher dVCRegularization
E
in(w) → E
in(w
REG) d
VC( H) → d
EFF( H, A)
•
by augmentingregularizer Ω
• lower d
EFF•
higher Ein
Validation
E
in(h) → E
val(h) H → {g
1−, . . . , g
M−}
•
by reserving K examples asD val
• fewer choices
•
fewer examplesThree Learning Principles Power of Three
Three Learning Principles
Power of Three
Occam’s Razer
simple is goodSampling Bias
class matches examData Snooping
honesty is best policyThree Learning Principles Power of Three
Three Future Directions
Power of Three
More Transform More Regularization Less Label
stochastic gradient descent
nonlinear transformation
overfitting
data snooping
Occam’s razor
perceptrons data contamination error measures
cross validation linear models
types of learning
kernel methods
logistic regression
training versus testing
VC dimension
linear regression
deterministic noise
noisy targets bias−variance tradeoff
RBF
SVM
weight decay regularization
soft−order constraint sampling bias neural networks
exploration versus exploitation
weak learners Gaussian processes
active learning
graphical models
decision trees
ensemble learning
Bayesian prior collaborative filtering
clustering
hidden Markov models distribution−free
ordinal regression
Boltzmann machines no free lunch
mixture of expertsQ learning
learning curves semi−supervised learning
is learning feasible?
ready for the
jungle!
Three Learning Principles Power of Three
Fun Time
Three Learning Principles Power of Three