### Machine Learning Foundations ( 機器學習基石)

### Lecture 4: Feasibility of Learning

Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.tw### Department of Computer Science

### & Information Engineering

### National Taiwan University

### ( 國立台灣大學資訊工程系)

Feasibility of Learning

### Roadmap

### 1 **When**

Can Machines Learn?
### Lecture 3: Types of Learning

focus:

**binary classification**

or**regression**

from a
**batch**

of**supervised**

data with**concrete**

features
### Lecture 4: Feasibility of Learning

### Learning is Impossible?

### Probability to the Rescue Connection to Learning Connection to Real Learning

### 2 Why Can Machines Learn?

### 3 How Can Machines Learn?

### 4 How Can Machines Learn Better?

Feasibility of Learning Learning is Impossible?

### A Learning Puzzle

### y

n### = −1

### y

n### = +1

### g(x) = ?

Feasibility of Learning Learning is Impossible?

### Two Controversial Answers

### whatever you say about g(x),

yn=−1

yn= +1

g(x) = ?

### y n = −1

### y n = +1

### g(x) = ?

### truth f (x) = +1 because . . .

### •

symmetry⇔ +1### •

(black or white count = 3) or (black count = 4 andmiddle-top black)⇔ +1

### truth f (x) = −1 because . . .

### •

left-top black⇔ -1### •

middle column contains at most 1 black and right-top white⇔ -1p

all valid reasons, your

**adversarial teacher**

can always call you ‘didn’t learn’. **:-(**

Feasibility of Learning Learning is Impossible?

### A ‘Simple’ Binary Classification Problem

**x**

n ### y

n### = f (x

n### )

### 0 0 0 ◦

### 0 0 1 ×

### 0 1 0 ×

### 0 1 1 ◦

### 1 0 0 ×

### •

X = {0, 1}^{3}

,Y = {### ◦, ×

}, can enumerate all candidate f as Hpick g **∈ H with all g(x**

### n

) =y_{n}

(like PLA),
**does g** **≈ f ?**

Feasibility of Learning Learning is Impossible?

### No Free Lunch

### D

**x** y g f

1 ### f

2### f

3### f

4### f

5### f

6### f

7### f

8### 0 0 0 ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦

### 0 0 1 × × × × × × × × × ×

### 0 1 0 × × × × × × × × × ×

### 0 1 1 ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦

### 1 0 0 × × × × × × × × × ×

### 1 0 1 **?** ◦ ◦ ◦ ◦ × × × ×

### 1 1 0 **?** ◦ ◦ × × ◦ ◦ × ×

### 1 1 1 **?** ◦ × ◦ × ◦ × ◦ ×

### •

g ≈ f inside D: sure!### •

g ≈ f outside D:**No!**

(but that’s really what we want!)
learning fromD (to infer something outside D) is doomed if

**any ‘unknown’ f can happen.** **:-(**

Feasibility of Learning Learning is Impossible?

### Fun Time

### This is a popular ‘brain-storming’ problem, with a claim that 2%

### of the world’s cleverest population can crack its ‘hidden pattern’.

(5, 3, 2)→ 151022, (7, 2, 5)→

**?**

It is like a ‘learning problem’ with N = 1,

**x** _{1}

= (5, 3, 2), y_{1}

=151022.
Learn a hypothesis from the one example to predict on

**x = (7, 2, 5).**

What is your answer?

### 1

151026### 2

143547### 3

I need more examples to get the correct answer### 4

there is no ‘correct’ answer### Reference Answer: 4

Following the same nature of the no-free-lunch problems discussed, we cannot hope to be correct under this ‘adversarial’ setting. BTW,

2 is the designer’s answer: the first two digits = x

_{1}

· x### 2

; the next two digits = x_{1}

· x### 3

; the last two digits = (x_{1}

· x### 2

+x_{1}

· x### 3

− x### 2

).Feasibility of Learning Learning is Impossible?

### Fun Time

### This is a popular ‘brain-storming’ problem, with a claim that 2%

### of the world’s cleverest population can crack its ‘hidden pattern’.

(5, 3, 2)→ 151022, (7, 2, 5)→

**?**

It is like a ‘learning problem’ with N = 1,

**x** _{1}

= (5, 3, 2), y_{1}

=151022.
Learn a hypothesis from the one example to predict on

**x = (7, 2, 5).**

What is your answer?

### 1

151026### 2

143547### 3

I need more examples to get the correct answer### 4

there is no ‘correct’ answer### Reference Answer: 4

Following the same nature of the no-free-lunch problems discussed, we cannot hope to be correct under this ‘adversarial’ setting. BTW,

2 is the designer’s answer: the first two digits = x

_{1}

· x### 2

; the next two digits = x_{1}

· x### 3

; the last two digits = (x_{1}

· x### 2

+x_{1}

· x### 3

− x### 2

).Feasibility of Learning Probability to the Rescue

### Inferring Something Unknown

difficult to infer

**unknown target f outside** D

in learning;
can we infer

**something unknown**

in**other scenarios?**

top

bottom

### •

consider a bin of many many### orange

and### green

marbles### •

do we**know**

the### orange

portion (probability)?**No!**

can you

**infer**

the### orange

probability?Feasibility of Learning Probability to the Rescue

### Statistics 101: Inferring **Orange** Probability

top

bottom top

bottom

**sample**

**bin** **bin**

assume

### orange

probability =µ,### green

probability = 1− µ, withµ**unknown**

**sample**

N marbles sampled independently, with

### orange

fraction =ν,### green

fraction = 1− ν, nowν**known**

does

**in-sample** ν

say anything about
out-of-sampleµ?
Feasibility of Learning Probability to the Rescue

### Possible versus Probable

does

**in-sample** ν

say anything about out-of-sampleµ?
### No!

possibly not: sample can be mostly

### green

while bin is mostly### orange Yes!

probably yes: in-sampleν likely

**close** **to**

unknownµ
top

bottom top

bottom

**sample**

**bin**

formally,**what does** **ν say about µ?**

Feasibility of Learning Probability to the Rescue

### Hoeffding’s Inequality (1/2)

top

bottom top

bottom

**sample of size** N

**bin**

µ =### orange

probability in bin

ν =

### orange

fraction in sample### •

in big sample### (N large),

ν is probably close to µ### (within )

Pν− µ

>

≤ 2 exp−2

### ^{2} N

### •

called**Hoeffding’s Inequality, for marbles, coin, polling,**

. . .
the statement ‘ν = µ’ is
**probably approximately correct**

(PAC)
Feasibility of Learning Probability to the Rescue

### Hoeffding’s Inequality (2/2)

P ν− µ

>

≤ 2 exp−2

### ^{2} N

### •

valid for all### N

and### •

does not depend onµ,**no need to ‘know’** µ

### • larger sample size N

or### looser gap

=⇒ higher probability for ‘ν ≈ µ’

top

bottom top

bottom

**sample of size** N

**bin**

if**large N**

, can**probably**

infer
unknownµ by known ν

Feasibility of Learning Probability to the Rescue

### Fun Time

### Let µ = 0.4. Use Hoeffding’s Inequality P

### ν − µ

### > ≤ 2 exp −2 ^{2} N

### to bound the probability that a sample of 10 marbles will have ν ≤ 0.1. What bound do you get?

### 1

0.67### 2

0.40### 3

0.33### 4

0.05### Reference Answer: 3

Set N = 10 and = 0.3 and you get the answer. BTW, 4 is the actual probability and Hoeffding gives only an upper bound to that.

Feasibility of Learning Probability to the Rescue

### Fun Time

### Let µ = 0.4. Use Hoeffding’s Inequality P

### ν − µ

### > ≤ 2 exp −2 ^{2} N

### to bound the probability that a sample of 10 marbles will have ν ≤ 0.1. What bound do you get?

### 1

0.67### 2

0.40### 3

0.33### 4

0.05### Reference Answer: 3

Set N = 10 and = 0.3 and you get the

Feasibility of Learning Connection to Learning

### Connection to Learning

**bin**

### •

unknown### orange

prob. µ### •

marble### •

∈ bin### • orange •

### • green •

### •

size-N sample from bin of i.i.d. marbles**learning**

### •

fixed hypothesis h(x)=^{?}

target f (x)
### • **x**

∈ X
### •

h is### wrong

⇔### h(x) **6= f (x)**

### •

h is### right

⇔### h(x) = f (x)

### •

check h on**D = {(x**

### n

, y_{n}

|{z}

### f (x

n### )

)} with i.i.d.

**x** n

if

**large N** **&** **i.i.d. x** n

, can**probably**

infer
unknown**Jh(x) 6= f (x)K probability**

by known**Jh(x**

### n

)6= y### n

K fractiontop

## X

### • h(x) 6= f (x)

### • h(x) = f (x)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 13/26

Feasibility of Learning Connection to Learning

### Added Components

### unknown target function f : X → Y

### (ideal credit approval formula)

### training examples **D : (x**

1### , y

1### ), · · · , (x

_{N}

### ,y

N### ) (historical records in bank)

### learning algorithm

### A

### final hypothesis g ≈ f

### (‘learned’ formula to be used)

### hypothesis set H

### (set of candidate formula)

### unknown P on X **x**

1### , **x**

2### , · · · , **x**

N
### h ≈ f

^{?}

### fixed h **x**

for any fixed h, can probably infer

**unknown E** _{out} **(h)**

= E
**x∼P**

**Jh(x) 6= f (x)K**

Feasibility of Learning Connection to Learning

### The Formal Guarantee

for any fixed h, in ‘big’ data

### (N large),

for any fixed h,

in-sample error E

_{in}

(h) is probably close to
for any fixed h,

out-of-sample error E

_{out}

(h) ### (within )

P

E

_{in}

(h)− E### out

(h)>

≤ 2 exp−2

### ^{2} N

### same as the ‘bin’ analogy . . .

### •

valid for all### N

and### •

does not depend on E_{out}

(h),**no need to ‘know’ E** _{out} (h)

—f and P can stay unknown

### •

‘E_{in}

(h) = E_{out}

(h)’ is**probably approximately correct (PAC)**

=⇒

if

**‘E** _{in} (h) ≈ E out (h)’

and### ‘E _{in} (h) **small’**

=⇒ E

### out

(h) small =⇒ h ≈ f with respect to PFeasibility of Learning Connection to Learning

### Verification of One h

for any fixed h, when data large enough, E

_{in}

(h)≈ E^{out}

(h)
**Can we claim ‘good learning’ (g** **≈ f )?**

### Yes!

### if E _{in} (h) **small for the fixed h**

if

and

**A pick the h as g**

=⇒ ‘g = f ’ PAC

### No!

### if **A forced to pick THE h as g**

=⇒

### E _{in} (h) **almost always not small**

=⇒ ‘g 6= f ’ PAC!

real learning:

A shall

**make choices** ∈ H

(like PLA)
rather than**being forced to pick one h. :-(**

Feasibility of Learning Connection to Learning

### The ‘Verification’ Flow

### unknown target function f : X → Y

### (ideal credit approval formula)

**verifying** examples **D : (x**

1### , y

_{1}

### ), · · · , (x

_{N}

### ,y

_{N}

### ) (historical records in bank)

### final hypothesis g ≈ f

### (given formula to be verified) g = h

**one** **hypothesis**

### h

### (one candidate formula)

### unknown P on X

**x**

1### , **x**

2### , · · · , **x**

N **x**

can now use ‘historical records’ (data) to

**verify ‘one candidate formula’ h**

Feasibility of Learning Connection to Learning

### Fun Time

### Your friend tells you her secret rule in investing in a particular stock:

### ‘Whenever the stock goes down in the morning, it will go up in the afternoon;

### vice versa.’ **To verify the rule, you chose 100 days uniformly at random** **from the past 10 years of stock data, and found that 80 of them satisfy** **the rule.** What is the best guarantee that you can get from the verification?

### 1

You’ll definitely be rich by exploiting the rule in the next 100 days.### 2

You’ll likely be rich by exploiting the rule in the next 100 days, if the market behaves similarly to the last 10 years.### 3

You’ll likely be rich by exploiting the ‘best rule’ from 20 more friends in the next 100 days.### 4

You’d definitely have been rich if you had exploited the rule in the past 10 years.### Reference Answer: 2

### 1 : no free lunch; 3 : no ‘learning’ guarantee in verification; 4 : verifying

### with only 100 days, possible that the rule is mostly wrong for whole 10 years.

Feasibility of Learning Connection to Learning

### Fun Time

### Your friend tells you her secret rule in investing in a particular stock:

### ‘Whenever the stock goes down in the morning, it will go up in the afternoon;

### vice versa.’ **To verify the rule, you chose 100 days uniformly at random** **from the past 10 years of stock data, and found that 80 of them satisfy** **the rule.** What is the best guarantee that you can get from the verification?

### 1

You’ll definitely be rich by exploiting the rule in the next 100 days.### 2

You’ll likely be rich by exploiting the rule in the next 100 days, if the market behaves similarly to the last 10 years.### 3

You’ll likely be rich by exploiting the ‘best rule’ from 20 more friends in the next 100 days.### 4

You’d definitely have been rich if you had exploited the rule in the past 10 years.### Reference Answer: 2

### 1 : no free lunch; 3 : no ‘learning’ guarantee in verification; 4 : verifying

### with only 100 days, possible that the rule is mostly wrong for whole 10 years.

Feasibility of Learning Connection to Real Learning

### Multiple h

### . . . .

top

h

_{1}

h_{2}

h_{M}

E

_{out}

(h_{1}

) E_{out}

(h_{2}

) E_{out}

(h_{M}

)
E

_{in}

(h_{1}

) E_{in}

(h_{2}

) E_{in}

(h_{M}

)
Feasibility of Learning Connection to Real Learning

### Coin Game

### . . . .

top

bottom

Q: if everyone in size-150 NTU ML class

### flips a coin 5 times, and **one** **of the students gets 5 heads for her coin ‘g’. Is ‘g’ really magical?**

A: No. Even if all coins are fair, the probability that

**one of the coins**

results in**5 heads**

is 1− ^{31} 32

### 150

> 99%.

**BAD** **sample:** E _{in} **and E** _{out} **far away**

**—can get** **worse** **when involving ‘choice’**

Feasibility of Learning Connection to Real Learning

### BAD Sample and BAD Data

### BAD Sample

e.g., E

_{out}

= ^{1} _{2}

, but getting all heads (E_{in}

=0)!
### BAD Data for One h E out (h) **and E** _{in} (h) **far away:**

e.g., E

_{out}

big (far from f ), but E_{in}

small (correct on most examples)
### D

1### D

2### . . . D

1126### . . . D

5678### . . . Hoeffding

### h **BAD** **BAD** P

D### [BAD D for h] ≤ . . .

Hoeffding: small

Feasibility of Learning Connection to Real Learning

### BAD Data for Many h

=⇒

BAD data for many h

⇐⇒

**no ‘freedom of choice’**

byA
⇐⇒

**there exists some h such that** E out (h) and E _{in} (h) far away

### D

^{1}

### D

^{2}

^{. . .}

### D

^{1126}

^{. . .}

### D

^{5678}

### Hoeffding

### h

_{1}

**BAD** **BAD** P

D### [BAD D for h

1### ] ≤ . . .

### h

2**BAD** P

D### [BAD D for h

^{2}

### ] ≤ . . .

### h

_{3}

**BAD** **BAD** **BAD** P

D### [BAD D for h

3### ] ≤ . . .

### . . .

### h

_{M}

**BAD** **BAD** P

D### [BAD D for h

M### ] ≤ . . .

### all **BAD** **BAD** **BAD** **?**

for M hypotheses, bound of P

### D

[BADD]?Feasibility of Learning Connection to Real Learning

### Bound of BAD Data

P

### D

[BADD]= P

### D

[BADD for h### 1 **or** **BAD**

D for h### 2 **or**

**. . . or**

**BAD**

D for h### M

]≤ P

### D

[BADD for h### 1

] + P^{D}

[BADD for h### 2

] +. . . + P^{D}

[BADD for h### M

] (union bound)≤

### 2 exp

### −2 ^{2} N

+

### 2 exp

### −2 ^{2} N

+. . . +

### 2 exp

### −2 ^{2} N

= 2Mexp

−2

^{2}

N
### •

finite-bin version of Hoeffding, valid for all### M, N and

### •

does not depend on any E_{out}

(h_{m}

),**no need to ‘know’ E** _{out} (h _{m} )

—f and P can stay unknown

### •

‘E_{in}

(g) = E_{out}

(g)’ is**PAC,** **regardless of** A

Feasibility of Learning Connection to Real Learning

### The ‘Statistical’ Learning Flow

if|H| = M finite, N large enough,

for whatever g picked byA, E

^{out}

(g)≈ E### in

(g) ifA finds one g with E### in

(g)≈ 0,PAC guarantee for E

_{out}

(g)≈ 0 =⇒**learning possible :-)**

### unknown target function f : X → Y

### (ideal credit approval formula)

### training examples **D : (x**

1### , y

1### ), · · · , (x

_{N}

### ,y

N### ) (historical records in bank)

### learning algorithm

### A

### final hypothesis g ≈ f

### (‘learned’ formula to be used)

### hypothesis set H

### (set of candidate formula)

### unknown P on X

**x**

1### , **x**

2### , · · · , **x**

N **x**

### M = **∞? (like perceptrons)**

—see you in the next lectures

Feasibility of Learning Connection to Real Learning

### Fun Time

### Consider 4 hypotheses.

h

_{1}

(x) = sign(x_{1}

), h_{2}

(x) = sign(x_{2}

),
h_{3}

(x) = sign(−x### 1

), h_{4}

(x) = sign(−x### 2

).For any N and, which of the following statement is not true?

### 1

the**BAD**

data of h_{1}

and the**BAD**

data of h_{2}

are exactly the same
### 2

the**BAD**

data of h_{1}

and the**BAD**

data of h_{3}

are exactly the same
### 3

P### D

[BADfor some h_{k}

]≤ 8 exp −2^{2}

N
### 4

P### D

[BADfor some h_{k}

]≤ 4 exp −2^{2}

N
### Reference Answer: 1

The important thing is to note that 2 is true, which implies that 4 is true if you revisit the union bound. Similar ideas will be used to conquer the M =∞ case.

Feasibility of Learning Connection to Real Learning

### Fun Time

### Consider 4 hypotheses.

h

_{1}

(x) = sign(x_{1}

), h_{2}

(x) = sign(x_{2}

),
h_{3}

(x) = sign(−x### 1

), h_{4}

(x) = sign(−x### 2

).For any N and, which of the following statement is not true?

### 1

the**BAD**

data of h_{1}

and the**BAD**

data of h_{2}

are exactly the same
### 2

the**BAD**

data of h_{1}

and the**BAD**

data of h_{3}

are exactly the same
### 3

P### D

[BADfor some h_{k}

]≤ 8 exp −2^{2}

N
### 4

P### D

[BADfor some h_{k}

]≤ 4 exp −2^{2}

N
### Reference Answer: 1

The important thing is to note that 2 is true, which implies that 4 is true if you revisit the union bound. Similar ideas will be used to conquer the M =∞ case.

Feasibility of Learning Connection to Real Learning