Machine Learning Foundations ( 機器學習基石)
Lecture 4: Feasibility of Learning
Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.twDepartment of Computer Science
& Information Engineering
National Taiwan University
( 國立台灣大學資訊工程系)
Feasibility of Learning
Roadmap
1 When
Can Machines Learn?Lecture 3: Types of Learning
focus:
binary classification
orregression
from abatch
ofsupervised
data withconcrete
featuresLecture 4: Feasibility of Learning
Learning is Impossible?
Probability to the Rescue Connection to Learning Connection to Real Learning
2 Why Can Machines Learn?
3 How Can Machines Learn?
4 How Can Machines Learn Better?
Feasibility of Learning Learning is Impossible?
A Learning Puzzle
y
n= −1
y
n= +1
g(x) = ?
let’s test your ‘human learning’
with 6 examples :-)
Feasibility of Learning Learning is Impossible?
Two Controversial Answers
whatever you say about g(x),
yn=−1
yn= +1
g(x) = ?
y n = −1
y n = +1
g(x) = ?
truth f (x) = +1 because . . .
•
symmetry⇔ +1•
(black or white count = 3) or (black count = 4 andmiddle-top black)⇔ +1
truth f (x) = −1 because . . .
•
left-top black⇔ -1•
middle column contains at most 1 black and right-top white⇔ -1p
all valid reasons, your
adversarial teacher
can always call you ‘didn’t learn’.:-(
Feasibility of Learning Learning is Impossible?
A ‘Simple’ Binary Classification Problem
x
ny
n= f (x
n)
0 0 0 ◦
0 0 1 ×
0 1 0 ×
0 1 1 ◦
1 0 0 ×
•
X = {0, 1}3
,Y = {◦, ×
}, can enumerate all candidate f as Hpick g ∈ H with all g(x
n
) =yn
(like PLA),does g ≈ f ?
Feasibility of Learning Learning is Impossible?
No Free Lunch
D
x y g f
1f
2f
3f
4f
5f
6f
7f
80 0 0 ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦
0 0 1 × × × × × × × × × ×
0 1 0 × × × × × × × × × ×
0 1 1 ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦
1 0 0 × × × × × × × × × ×
1 0 1 ? ◦ ◦ ◦ ◦ × × × ×
1 1 0 ? ◦ ◦ × × ◦ ◦ × ×
1 1 1 ? ◦ × ◦ × ◦ × ◦ ×
•
g ≈ f inside D: sure!•
g ≈ f outside D:No!
(but that’s really what we want!)learning fromD (to infer something outside D) is doomed if
any ‘unknown’ f can happen. :-(
Feasibility of Learning Learning is Impossible?
Fun Time
This is a popular ‘brain-storming’ problem, with a claim that 2%
of the world’s cleverest population can crack its ‘hidden pattern’.
(5, 3, 2)→ 151022, (7, 2, 5)→
?
It is like a ‘learning problem’ with N = 1,
x 1
= (5, 3, 2), y1
=151022.Learn a hypothesis from the one example to predict on
x = (7, 2, 5).
What is your answer?
1
1510262
1435473
I need more examples to get the correct answer4
there is no ‘correct’ answerReference Answer: 4
Following the same nature of the no-free-lunch problems discussed, we cannot hope to be correct under this ‘adversarial’ setting. BTW,
2 is the designer’s answer: the first two digits = x
1
· x2
; the next two digits = x1
· x3
; the last two digits = (x1
· x2
+x1
· x3
− x2
).Feasibility of Learning Probability to the Rescue
Inferring Something Unknown
difficult to infer
unknown target f outside D
in learning;can we infer
something unknown
inother scenarios?
top
bottom
•
consider a bin of many manyorange
andgreen
marbles•
do weknow
theorange
portion (probability)?No!
can you
infer
theorange
probability?Feasibility of Learning Probability to the Rescue
Statistics 101: Inferring Orange Probability
top
bottom top
bottom
sample
bin bin
assume
orange
probability =µ,green
probability = 1− µ, withµunknown
sample
N marbles sampled independently, with
orange
fraction =ν,green
fraction = 1− ν, nowνknown
does
in-sample ν
say anything about out-of-sampleµ?Feasibility of Learning Probability to the Rescue
Possible versus Probable
does
in-sample ν
say anything about out-of-sampleµ?No!
possibly not: sample can be mostly
green
while bin is mostlyorange Yes!
probably yes: in-sampleν likely
close to
unknownµtop
bottom top
bottom
sample
bin
formally,what does ν say about µ?
Feasibility of Learning Probability to the Rescue
Hoeffding’s Inequality (1/2)
top
bottom top
bottom
sample of size N
bin
µ =orange
probability in bin
ν =
orange
fraction in sample•
in big sample(N large),
ν is probably close to µ(within )
Pν− µ
>
≤ 2 exp−2
2 N
•
calledHoeffding’s Inequality, for marbles, coin, polling,
. . . the statement ‘ν = µ’ isprobably approximately correct
(PAC)Feasibility of Learning Probability to the Rescue
Hoeffding’s Inequality (2/2)
P ν− µ
>
≤ 2 exp−2
2 N
•
valid for allN
and•
does not depend onµ,no need to ‘know’ µ
• larger sample size N
orlooser gap
=⇒ higher probability for ‘ν ≈ µ’
top
bottom top
bottom
sample of size N
bin
iflarge N
, canprobably
inferunknownµ by known ν
Feasibility of Learning Probability to the Rescue
Fun Time
Let µ = 0.4. Use Hoeffding’s Inequality P
ν − µ
> ≤ 2 exp −2 2 N
to bound the probability that a sample of 10 marbles will have ν ≤ 0.1. What bound do you get?
1
0.672
0.403
0.334
0.05Reference Answer: 3
Set N = 10 and = 0.3 and you get the answer. BTW, 4 is the actual probability and Hoeffding gives only an upper bound to that.
Feasibility of Learning Connection to Learning
Connection to Learning
bin
•
unknownorange
prob. µ•
marble•
∈ bin• orange •
• green •
•
size-N sample from bin of i.i.d. marbleslearning
•
fixed hypothesis h(x)=?
target f (x)• x
∈ X•
h iswrong
⇔h(x) 6= f (x)
•
h isright
⇔h(x) = f (x)
•
check h onD = {(xn
, yn
|{z}
f (x
n)
)} with i.i.d.
x n
if
large N & i.i.d. x n
, canprobably
infer unknownJh(x) 6= f (x)K probabilityby knownJh(x
n
)6= yn
K fractiontop
X
• h(x) 6= f (x)
• h(x) = f (x)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 13/26
Feasibility of Learning Connection to Learning
Added Components
unknown target function f : X → Y
(ideal credit approval formula)
training examples D : (x
1, y
1), · · · , (x
N,y
N) (historical records in bank)
learning algorithm
A
final hypothesis g ≈ f
(‘learned’ formula to be used)
hypothesis set H
(set of candidate formula)
unknown P on X x
1, x
2, · · · , x
Nh ≈ f
?fixed h x
for any fixed h, can probably infer
unknown E out (h)
= Ex∼P
Jh(x) 6= f (x)K byknown E in (h)
=N 1
N
P
n=1
Jh(xn
)6= yn
K.Feasibility of Learning Connection to Learning
The Formal Guarantee
for any fixed h, in ‘big’ data
(N large),
for any fixed h,
in-sample error E
in
(h) is probably close tofor any fixed h,
out-of-sample error E
out
(h)(within )
P
E
in
(h)− Eout
(h)>
≤ 2 exp−2
2 N
same as the ‘bin’ analogy . . .
•
valid for allN
and•
does not depend on Eout
(h),no need to ‘know’ E out (h)
—f and P can stay unknown
•
‘Ein
(h) = Eout
(h)’ isprobably approximately correct (PAC)
=⇒
if
‘E in (h) ≈ E out (h)’
and‘E in (h) small’
=⇒ E
out
(h) small =⇒ h ≈ f with respect to PFeasibility of Learning Connection to Learning
Verification of One h
for any fixed h, when data large enough, E
in
(h)≈ Eout
(h)Can we claim ‘good learning’ (g ≈ f )?
Yes!
if E in (h) small for the fixed h
if
and
A pick the h as g
=⇒ ‘g = f ’ PAC
No!
if A forced to pick THE h as g
=⇒
E in (h) almost always not small
=⇒ ‘g 6= f ’ PAC!
real learning:
A shall
make choices ∈ H
(like PLA) rather thanbeing forced to pick one h. :-(
Feasibility of Learning Connection to Learning
The ‘Verification’ Flow
unknown target function f : X → Y
(ideal credit approval formula)
verifying examples D : (x
1, y
1), · · · , (x
N,y
N) (historical records in bank)
final hypothesis g ≈ f
(given formula to be verified) g = h
one hypothesis
h
(one candidate formula)
unknown P on X
x
1, x
2, · · · , x
Nx
can now use ‘historical records’ (data) to
verify ‘one candidate formula’ h
Feasibility of Learning Connection to Learning
Fun Time
Your friend tells you her secret rule in investing in a particular stock:
‘Whenever the stock goes down in the morning, it will go up in the afternoon;
vice versa.’ To verify the rule, you chose 100 days uniformly at random from the past 10 years of stock data, and found that 80 of them satisfy the rule. What is the best guarantee that you can get from the verification?
1
You’ll definitely be rich by exploiting the rule in the next 100 days.2
You’ll likely be rich by exploiting the rule in the next 100 days, if the market behaves similarly to the last 10 years.3
You’ll likely be rich by exploiting the ‘best rule’ from 20 more friends in the next 100 days.4
You’d definitely have been rich if you had exploited the rule in the past 10 years.Reference Answer: 2
1 : no free lunch; 3 : no ‘learning’ guarantee in verification; 4 : verifying
with only 100 days, possible that the rule is mostly for whole 10 years.
Feasibility of Learning Connection to Real Learning
Multiple h
. . . .
top
bottom
h
1
h2
hM
E
out
(h1
) Eout
(h2
) Eout
(hM
)E
in
(h1
) Ein
(h2
) Ein
(hM
)real learning (say like PLA):
BINGO
when getting••••••••••
?Feasibility of Learning Connection to Real Learning
Coin Game
. . . .
top
bottom
Q: if everyone in size-150 NTU ML class
flips a coin 5 times, and one of the students gets 5 heads for her coin ‘g’. Is ‘g’ really magical?
A: No. Even if all coins are fair, the probability that
one of the coins
results in5 heads
is 1−31 32
150
> 99%.
BAD sample: E in and E out far away
—can get worse when involving ‘choice’
Feasibility of Learning Connection to Real Learning
BAD Sample and BAD Data
BAD Sample
e.g., E
out
=1 2
, but getting all heads (Ein
=0)!BAD Data for One h E out (h) and E in (h) far away:
e.g., E
out
big (far from f ), but Ein
small (correct on most examples)D
1D
2. . . D
1126. . . D
5678. . . Hoeffding
h BAD BAD P
D[BAD D for h] ≤ . . .
Hoeffding: small
P
D
[BADD] = Xall possibleD
P(D) ·J
BAD
DKFeasibility of Learning Connection to Real Learning
BAD Data for Many h
=⇒
BAD data for many h
⇐⇒
no ‘freedom of choice’
byA⇐⇒
there exists some h such that E out (h) and E in (h) far away
D
1D
2 . . .D
1126 . . .D
5678Hoeffding
h
1BAD BAD P
D[BAD D for h
1] ≤ . . .
h
2BAD P
D[BAD D for h
2] ≤ . . .
h
3BAD BAD BAD P
D[BAD D for h
3] ≤ . . .
. . .
h
MBAD BAD P
D[BAD D for h
M] ≤ . . .
all BAD BAD BAD ?
for M hypotheses, bound of P
D
[BADD]?Feasibility of Learning Connection to Real Learning
Bound of BAD Data
P
D
[BADD]= P
D
[BADD for h1 or BAD
D for h2 or
. . . orBAD
D for hM
]≤ P
D
[BADD for h1
] + PD
[BADD for h2
] +. . . + PD
[BADD for hM
] (union bound)≤
2 exp
−2 2 N
+
2 exp
−2 2 N
+. . . +
2 exp
−2 2 N
= 2Mexp
−2
2
N•
finite-bin version of Hoeffding, valid for allM, N and
•
does not depend on any Eout
(hm
),no need to ‘know’ E out (h m )
—f and P can stay unknown
•
‘Ein
(g) = Eout
(g)’ isPAC, regardless of A
‘most reasonable’A (like PLA/pocket):
pick the h
m
withlowest E in (h m )
as gFeasibility of Learning Connection to Real Learning
The ‘Statistical’ Learning Flow
if|H| = M finite, N large enough,
for whatever g picked byA, E
out
(g)≈ Ein
(g) ifA finds one g with Ein
(g)≈ 0,PAC guarantee for E
out
(g)≈ 0 =⇒learning possible :-)
unknown target function f : X → Y
(ideal credit approval formula)
training examples D : (x
1, y
1), · · · , (x
N,y
N) (historical records in bank)
learning algorithm
A
final hypothesis g ≈ f
(‘learned’ formula to be used)
hypothesis set H
(set of candidate formula)
unknown P on X
x
1, x
2, · · · , x
Nx
M = ∞? (like perceptrons)
—see you in the next lectures
Feasibility of Learning Connection to Real Learning
Fun Time
Consider 4 hypotheses.
h
1
(x) = sign(x1
), h2
(x) = sign(x2
), h3
(x) = sign(−x1
), h4
(x) = sign(−x2
).For any N and, which of the following statement is not true?
1
theBAD
data of h1
and theBAD
data of h2
are exactly the same2
theBAD
data of h1
and theBAD
data of h3
are exactly the same3
PD
[BADfor some hk
]≤ 8 exp −22
N4
PD
[BADfor some hk
]≤ 4 exp −22
NReference Answer: 1
The important thing is to note that 2 is true, which implies that 4 is true if you revisit the union bound. Similar ideas will be used to conquer the M =∞ case.
Feasibility of Learning Connection to Real Learning