Machine Learning Foundations ( 機器學習基石)
Lecture 16: Three Learning Principles
Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.twDepartment of Computer Science
& Information Engineering
National Taiwan University
( 國立台灣大學資訊工程系)
Three Learning Principles
Roadmap
1 When Can Machines Learn?
2 Why Can Machines Learn?
3 How Can Machines Learn?
4
How Can Machines LearnBetter?
Lecture 15: Validation
(crossly) reserve
validation data
to simulate testing procedure formodel selection
Lecture 16: Three Learning Principles Occam’s Razor
Sampling Bias
Data Snooping
Power of Three
Three Learning Principles Occam’s Razor
Occam’s Razor
An explanation of the data should be made as simple as possible, but no simpler.—Albert Einstein?(1879-1955)
entia non sunt multiplicanda praeter necessitatem (entities must not be multiplied
beyond necessity)
—William of Occam (1287-1347)
‘Occam’s razor’ for trimming down unnecessary explanation
figure by Fred the Oyster (Own work) [CC-BY-SA-3.0], via Wikimedia Commons
Three Learning Principles Occam’s Razor
Occam’s Razor for Learning
The simplest model that fits the data is also the most plausible.
which one do you prefer? :-)
two questions:
1
What does it mean for a model to be simple?2
How do we know that simpler is better?Three Learning Principles Occam’s Razor
Simple Model
simple hypothesis h
•
small Ω(h) = ‘looks’ simple•
specified byfew parameters
simple model H
•
small Ω(H) = not many•
containssmall number of hypotheses
connection
h specified by ` bits⇐ |H| of size 2
`
small Ω(h)⇐ small Ω(H)simple:
small hypothesis/model complexity
Three Learning Principles Occam’s Razor
Simple is Better
in addition to
math proof
that you have seen, philosophically:=⇒
simpleH
=⇒ smaller m
H
(N)=⇒ less ‘likely’ to fit data perfectly m
H
(N) 2N
=⇒ more significant when fit happens
direct action:
linear first;
always ask whether
data over-modeled
Three Learning Principles Occam’s Razor
Fun Time
Consider the decision stumps in R
1
as the hypothesis setH. Recall that mH
(N) = 2N. Consider 10 different inputsx 1
,x 2
, . . . ,x 10
coupled with labels yn
generated iid from a fair coin. What is the probability that the dataD = {(xn
,yn
)}10 n=1
is separable byH?1 1
1024 2 10
1024 3 20
1024 4 100
1024
Reference Answer: 3
Of all 1024 possibleD, only 2N = 20 of them is separable byH.
Three Learning Principles Occam’s Razor
Fun Time
Consider the decision stumps in R
1
as the hypothesis setH. Recall that mH
(N) = 2N. Consider 10 different inputsx 1
,x 2
, . . . ,x 10
coupled with labels yn
generated iid from a fair coin. What is the probability that the dataD = {(xn
,yn
)}10 n=1
is separable byH?1 1
1024 2 10
1024 3 20
1024 4 100
1024
Reference Answer: 3
Of all 1024 possibleD, only 2N = 20 of them is separable byH.
Three Learning Principles Sampling Bias
Presidential Story
•
1948 US President election: Truman versus Dewey•
a newspaper phone-poll of how peoplevoted,
and set the title ‘Dewey Defeats Truman’ based on polling
who is this? :-)
Three Learning Principles Sampling Bias
The Big Smile Came from . . .
Truman, and yes he won
suspect of the mistake:
•
editorial bug?—no•
bad luck of polling (δ)?—nohint: phones were
expensive :-)
Three Learning Principles Sampling Bias
Sampling Bias
If the data is sampled in a biased way, learning will pro- duce a similarly biased outcome.
•
technical explanation:data from
P 1
(x, y ) but test underP 2
6=P 1
:VC fails
•
philosophical explanation:study
Math
hard but testEnglish: no strong test guarantee
‘minor’ VC assumption:
data and testing
both iid from P
Three Learning Principles Sampling Bias
Sampling Bias in Learning
A True Personal Story
•
Netflix competition for movie recommender system:10% improvement = 1M US dollars
•
formedDval
, in myfirst shot,
E
val
(g) showed13%
improvement• why am I still teaching here? :-)
Match movie and viewer factors
predicted rating
comedy content action
content blockb uster?
TomCruisein it? likes TomCruise?
prefers blockbusters? likes action?
likes comedy?
movie viewer
add contributions from each factor
validation:
random examples
withinD;test:
‘last’ user records
‘after’DThree Learning Principles Sampling Bias
Dealing with Sampling Bias
If the data is sampled in a biased way, learning will pro- duce a similarly biased outcome.
•
practical rule of thumb:match test scenario as much as possible
•
e.g. if test:‘last’ user records
‘after’D• training: emphasize later examples (KDDCup 2011)
• validation: use ‘late’ user records
last puzzle:
danger when learning ‘credit card approval’
with
existing bank records?
Three Learning Principles Sampling Bias
Fun Time
If the dataD is an unbiased sample from the underlying distribution P for binary classification, which of the following subset ofD is also an unbiased sample from P?
1
all the positive (yn
>0) examples2
half of the examples that are randomly and uniformly picked from D without replacement3
half of the examples with the smallestkxn
k values4
the largest subset that is linearly separableReference Answer: 2
That’s how we form the validation set,
remember? :-)
Three Learning Principles Sampling Bias
Fun Time
If the dataD is an unbiased sample from the underlying distribution P for binary classification, which of the following subset ofD is also an unbiased sample from P?
1
all the positive (yn
>0) examples2
half of the examples that are randomly and uniformly picked from D without replacement3
half of the examples with the smallestkxn
k values4
the largest subset that is linearly separableReference Answer: 2
That’s how we form the validation set,
remember? :-)
Three Learning Principles Data Snooping
Visual Data Snooping
Visualize X = R 2
•
full Φ2
:z = (1, x 1
,x2
,x1 2
,x1
x2
,x2 2
), dVC =6•
orz = (1, x 1 2
,x2 2
), dVC =3,after visualizing?
•
or betterz = (1, x 1 2
+x2 2
), dVC=2?•
or even betterz = sign(0.6 − x 1 2 − x 2 2 )?
—careful about
your brain’s ‘model complexity’
−1 0 1
−1 0 1
for VC-safety, Φ shall be decided
without ‘snooping’
dataThree Learning Principles Data Snooping
Data Snooping by Mere Shifting-Scaling
If a data set has affected any step in the learning pro- cess, its ability to assess the outcome has been com- promised.
•
8 years of currency trading data•
first 6 years fortraining,
last two 2 years fortesting
• x = previous 20 days,
y = 21th day• snooping
versusno snooping:
superior profit possible
Day
CumulativeProfit%
no snooping snooping
0 100 200 300 400 500
-10 0 10 20 30
• snooping: shift-scale all values by training
+testing
• no snooping: shift-scale all values by training
onlyThree Learning Principles Data Snooping
Data Snooping by Data Reusing
Research Scenario
benchmark dataD•
paper 1: proposeH1
that works well onD•
paper 2: find room for improvement, proposeH2
—and
publish only if better
thanH1
onD•
paper 3: find room for improvement, proposeH3
—and
publish only if better
thanH2
onD•
. . .•
if all papers from the same author inone big paper:
bad generalization due to dVC(∪
m
Hm
)•
step-wise: later authorsnooped
data by reading earlier papers, bad generalization worsen bypublish only if better
if you torture the data long enough, it will confess :-)
Three Learning Principles Data Snooping
Dealing with Data Snooping
•
truth—very hard to avoid, unless being extremely honest•
extremely honest:lock your test data in safe
•
less honest:reserve validation and use cautiously
•
be blind: avoidmaking modeling decision by data
•
be suspicious: interpret research results (including your own) by properfeeling of contamination
one secret to winning KDDCups:
careful balance between
data-driven modeling (snooping)
andvalidation (no-snooping)
Three Learning Principles Data Snooping
Fun Time
Which of the following can result in unsatisfactory test performance in machine learning?
1
data snooping2
overfitting3
sampling bias4
all of the aboveReference Answer: 4
A professional like you should be aware of
those! :-)
Three Learning Principles Data Snooping
Fun Time
Which of the following can result in unsatisfactory test performance in machine learning?
1
data snooping2
overfitting3
sampling bias4
all of the aboveReference Answer: 4
A professional like you should be aware of
those! :-)
Three Learning Principles Power of Three
Three Related Fields
Power of Three
Data Mining
•
use(huge)
data tofind property
that is interesting•
difficult to distinguish ML and DM in realityArtificial Intelligence
•
compute something that showsintelligent behavior
•
ML is one possible route to realize AIStatistics
•
use data tomake inference
about an unknown process•
statistics contains many useful tools for MLThree Learning Principles Power of Three
Three Theoretical Bounds
Power of Three
Hoeffding
P[BAD]
≤ 2 exp(−2
2N)
• one
hypothesis•
useful forverifying/testing
Multi-Bin Hoeffding
P[BAD]
≤ 2 M exp( −2
2N)
• M
hypotheses•
useful forvalidation
VC
P[BAD]
≤ 4 m
H(2N) exp(. . .)
•
allH
•
useful fortraining
Three Learning Principles Power of Three
Three Linear Models
Power of Three
PLA/pocket
h(x) = sign(s)s x
x
x x0
1 2
d
h x( )
plausible err = 0/1
(small flipping noise)
minimizespecially
linear regression
h(x) =s
s x
x
x x0
1 2
d
h x( )
friendly err = squared
(easy to minimize)
minimizeanalytically
logistic regression
h(x) = θ(s)s x
x
x x0
1 2
d
h x( )
plausible err = CE
(maximum likelihood)
minimizeiteratively
Three Learning Principles Power of Three
Three Key Tools
Power of Three
Feature Transform
E
in(w) → E
in( ˜ w) d
VC( H) → d
VC( H
Φ)
•
by usingmore complicated Φ
• lower E in
•
higher dVCRegularization
E
in(w) → E
in(w
REG) d
VC( H) → d
EFF( H, A)
•
by augmentingregularizer Ω
• lower d
EFF•
higher Ein
Validation
E
in(h) → E
val(h) H → {g
1−, . . . , g
M−}
•
by reserving K examples asD val
• fewer choices
•
fewer examplesThree Learning Principles Power of Three
Three Learning Principles
Power of Three
Occam’s Razer
simple is goodSampling Bias
class matches examData Snooping
honesty is best policyThree Learning Principles Power of Three
Three Future Directions
Power of Three
More Transform More Regularization Less Label
soft-margin k -means OOB error RBF network probabilistic SVM GBDT PCA random forest matrix factorization Gaussian kernel kernel LogReg large-margin prototype quadratic programming SVR
dual uniform blending deep learning nearest neighbor decision stump AdaBoost aggregation sparsity autoencoder
coordinate descent bagging decision tree support vector machine neural network kernel
ready for the
jungle!
Three Learning Principles Power of Three
Fun Time
What are the magic numbers that repeatedly appear in this class?
1
32
11263
both 3 and 11264
neither 3 nor 1126Reference Answer: 3
3 as illustrated, and
you may recall 1126
somewhere :-)
Three Learning Principles Power of Three
Fun Time
What are the magic numbers that repeatedly appear in this class?
1
32
11263
both 3 and 11264
neither 3 nor 1126Reference Answer: 3
3 as illustrated, and
you may recall 1126
somewhere :-)
Three Learning Principles Power of Three