Machine Learning Foundations (ᘤ9M)

(1)

Machine Learning Foundations ( 機器學習基石)

Lecture 4: Feasibility of Learning

Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.tw

Department of Computer Science

& Information Engineering

National Taiwan University

( 國立台灣大學資訊工程系)

(2)

Feasibility of Learning

Roadmap

1 When

Can Machines Learn?

Lecture 3: Types of Learning

focus:

binary classification

or

regression

from a

batch

of

supervised

data with

concrete

features

Lecture 4: Feasibility of Learning

Learning is Impossible?

Probability to the Rescue Connection to Learning Connection to Real Learning

2 Why Can Machines Learn?

3 How Can Machines Learn?

4 How Can Machines Learn Better?

(3)

Feasibility of Learning Learning is Impossible?

A Learning Puzzle

y

n

= −1

y

n

= +1

g(x) = ?

(4)

Two Controversial Answers

whatever you say about g(x),

yn=−1

yn= +1

g(x) = ?

y n = −1

y n = +1

g(x) = ?

truth f (x) = +1 because . . .

•

symmetry⇔ +1

•

(black or white count = 3) or (black count = 4 and

middle-top black)⇔ +1

truth f (x) = −1 because . . .

•

left-top black⇔ -1

•

middle column contains at most 1 black and right-top white⇔ -1

p

all valid reasons, your

adversarial teacher

can always call you ‘didn’t learn’.

:-(

(5)

A ‘Simple’ Binary Classification Problem

x

n

y

n

= f (x

n

)

0 0 0 ◦

0 0 1 ×

0 1 0 ×

0 1 1 ◦

1 0 0 ×

•

X = {0, 1}

³

,Y = {

◦, ×

}, can enumerate all candidate f as H

pick g ∈ H with all g(x

n

) =y

_n

(like PLA),

does g ≈ f ?

(6)

No Free Lunch

D

x y g f

1

f

2

f

3

f

4

f

5

f

6

f

7

f

8

0 0 0 ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦

0 0 1 × × × × × × × × × ×

0 1 0 × × × × × × × × × ×

0 1 1 ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦

1 0 0 × × × × × × × × × ×

1 0 1 ? ◦ ◦ ◦ ◦ × × × ×

1 1 0 ? ◦ ◦ × × ◦ ◦ × ×

1 1 1 ? ◦ × ◦ × ◦ × ◦ ×

•

g ≈ f inside D: sure!

•

g ≈ f outside D:

No!

(but that’s really what we want!)

learning fromD (to infer something outside D) is doomed if

any ‘unknown’ f can happen. :-(

(7)

Fun Time

This is a popular ‘brain-storming’ problem, with a claim that 2%

of the world’s cleverest population can crack its ‘hidden pattern’.

(5, 3, 2)→ 151022, (7, 2, 5)→

?

It is like a ‘learning problem’ with N = 1,

x ₁

= (5, 3, 2), y

₁

=151022.

Learn a hypothesis from the one example to predict on

x = (7, 2, 5).

What is your answer?

1

151026

2

143547

3

I need more examples to get the correct answer

4

there is no ‘correct’ answer

Reference Answer: 4

Following the same nature of the no-free-lunch problems discussed, we cannot hope to be correct under this ‘adversarial’ setting. BTW,

2 is the designer’s answer: the first two digits = x

₁

· x

2

; the next two digits = x

₁

· x

3

; the last two digits = (x

₁

· x

2

+x

₁

· x

3

− x

2

).

(8)

Fun Time

This is a popular ‘brain-storming’ problem, with a claim that 2%

of the world’s cleverest population can crack its ‘hidden pattern’.

(5, 3, 2)→ 151022, (7, 2, 5)→

?

It is like a ‘learning problem’ with N = 1,

x ₁

= (5, 3, 2), y

₁

=151022.

Learn a hypothesis from the one example to predict on

x = (7, 2, 5).

What is your answer?

1

151026

2

143547

3

I need more examples to get the correct answer

4

there is no ‘correct’ answer

Reference Answer: 4

Following the same nature of the no-free-lunch problems discussed, we cannot hope to be correct under this ‘adversarial’ setting. BTW,

2 is the designer’s answer: the first two digits = x

₁

· x

2

; the next two digits = x

₁

· x

3

; the last two digits = (x

₁

· x

2

+x

₁

· x

3

− x

2

).

(9)

Feasibility of Learning Probability to the Rescue

Inferring Something Unknown

difficult to infer

unknown target f outside D

in learning;

can we infer

something unknown

in

other scenarios?

top

bottom

•

consider a bin of many many

orange

and

green

marbles

•

do we

know

the

orange

portion (probability)?

No!

can you

infer

the

orange

probability?

(10)

Statistics 101: Inferring Orange Probability

top

bottom top

bottom

sample

bin bin

assume

orange

probability =µ,

green

probability = 1− µ, withµ

unknown

sample

N marbles sampled independently, with

orange

fraction =ν,

green

fraction = 1− ν, nowν

known

does

in-sample ν

say anything about out-of-sampleµ?

(11)

Possible versus Probable

does

in-sample ν

say anything about out-of-sampleµ?

No!

possibly not: sample can be mostly

green

while bin is mostly

orange Yes!

probably yes: in-sampleν likely

close to

unknownµ

top

bottom top

bottom

sample

bin

formally,

what does ν say about µ?

(12)

Hoeffding’s Inequality (1/2)

top

bottom top

bottom

sample of size N

bin

µ =

orange

probability in bin

ν =

orange

fraction in sample

•

in big sample

(N large),

ν is probably close to µ

(within )

P

ν− µ

>

≤ 2 exp

−2

² N

•

called

Hoeffding’s Inequality, for marbles, coin, polling,

. . . the statement ‘ν = µ’ is

probably approximately correct

(PAC)

(13)

Hoeffding’s Inequality (2/2)

P ν− µ

>

≤ 2 exp

−2

² N

•

valid for all

N

and

•

does not depend onµ,

no need to ‘know’ µ

• larger sample size N

or

looser gap

=⇒ higher probability for ‘ν ≈ µ’

top

bottom top

bottom

sample of size N

bin

if

large N

, can

probably

infer

unknownµ by known ν

(14)

Fun Time

Let µ = 0.4. Use Hoeffding’s Inequality P

ν − µ

> ≤ 2 exp −2 ² N

to bound the probability that a sample of 10 marbles will have ν ≤ 0.1. What bound do you get?

1

0.67

2

0.40

3

0.33

4

0.05

Reference Answer: 3

Set N = 10 and = 0.3 and you get the answer. BTW, 4 is the actual probability and Hoeffding gives only an upper bound to that.

(15)

Fun Time

Let µ = 0.4. Use Hoeffding’s Inequality P

ν − µ

> ≤ 2 exp −2 ² N

to bound the probability that a sample of 10 marbles will have ν ≤ 0.1. What bound do you get?

1

0.67

2

0.40

3

0.33

4

0.05

Reference Answer: 3

Set N = 10 and = 0.3 and you get the

(16)

Feasibility of Learning Connection to Learning

Connection to Learning

bin

•

unknown

orange

prob. µ

•

marble

•

∈ bin

• orange •

• green •

•

size-N sample from bin of i.i.d. marbles

learning

•

fixed hypothesis h(x)=

^?

target f (x)

• x

∈ X

•

h is

wrong

⇔

h(x) 6= f (x)

•

h is

right

⇔

h(x) = f (x)

•

check h onD = {(x

n

, y

_n

|{z}

f (x

n

)

)} with i.i.d.

x n

if

large N & i.i.d. x n

, can

probably

infer unknownJh(x) 6= f (x)K probability

by knownJh(x

n

)6= y

n

K fraction

top

X

• h(x) 6= f (x)

• h(x) = f (x)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 13/26

(17)

Added Components

unknown target function f : X → Y

(ideal credit approval formula)

training examples D : (x

1

, y

1

), · · · , (x

_N

,y

N

) (historical records in bank)

learning algorithm

A

final hypothesis g ≈ f

(‘learned’ formula to be used)

hypothesis set H

(set of candidate formula)

unknown P on X x

1

, x

2

, · · · , x

N

h ≈ f

^?

fixed h x

for any fixed h, can probably infer

unknown E _out (h)

= E

x∼P

Jh(x) 6= f (x)K

(18)

The Formal Guarantee

for any fixed h, in ‘big’ data

(N large),

for any fixed h,

in-sample error E

_in

(h) is probably close to

for any fixed h,

out-of-sample error E

_out

(h)

(within )

P

E

_in

(h)− E

out

(h)

>

≤ 2 exp

−2

² N

same as the ‘bin’ analogy . . .

•

valid for all

N

and

•

does not depend on E

_out

(h),

no need to ‘know’ E _out (h)

—f and P can stay unknown

•

‘E

_in

(h) = E

_out

(h)’ is

probably approximately correct (PAC)

=⇒

if

‘E _in (h) ≈ E out (h)’

and

‘E _in (h) small’

=⇒ E

out

(h) small =⇒ h ≈ f with respect to P

(19)

Verification of One h

for any fixed h, when data large enough, E

_in

(h)≈ E

^out

(h)

Can we claim ‘good learning’ (g ≈ f )?

Yes!

if E _in (h) small for the fixed h

if

and

A pick the h as g

=⇒ ‘g = f ’ PAC

No!

if A forced to pick THE h as g

=⇒

E _in (h) almost always not small

=⇒ ‘g 6= f ’ PAC!

real learning:

A shall

make choices ∈ H

(like PLA) rather than

being forced to pick one h. :-(

(20)

The ‘Verification’ Flow

unknown target function f : X → Y

(ideal credit approval formula)

verifying examples D : (x

1

, y

₁

), · · · , (x

_N

,y

_N

) (historical records in bank)

final hypothesis g ≈ f

(given formula to be verified) g = h

one hypothesis

h

(one candidate formula)

unknown P on X

x

1

, x

2

, · · · , x

N

x

can now use ‘historical records’ (data) to

verify ‘one candidate formula’ h

(21)

Fun Time

Your friend tells you her secret rule in investing in a particular stock:

‘Whenever the stock goes down in the morning, it will go up in the afternoon;

vice versa.’ To verify the rule, you chose 100 days uniformly at random from the past 10 years of stock data, and found that 80 of them satisfy the rule. What is the best guarantee that you can get from the verification?

1

You’ll definitely be rich by exploiting the rule in the next 100 days.

2

You’ll likely be rich by exploiting the rule in the next 100 days, if the market behaves similarly to the last 10 years.

3

You’ll likely be rich by exploiting the ‘best rule’ from 20 more friends in the next 100 days.

4

You’d definitely have been rich if you had exploited the rule in the past 10 years.

Reference Answer: 2

1 : no free lunch; 3 : no ‘learning’ guarantee in verification; 4 : verifying

with only 100 days, possible that the rule is mostly wrong for whole 10 years.

(22)

Fun Time

Your friend tells you her secret rule in investing in a particular stock:

‘Whenever the stock goes down in the morning, it will go up in the afternoon;

vice versa.’ To verify the rule, you chose 100 days uniformly at random from the past 10 years of stock data, and found that 80 of them satisfy the rule. What is the best guarantee that you can get from the verification?

1

You’ll definitely be rich by exploiting the rule in the next 100 days.

2

You’ll likely be rich by exploiting the rule in the next 100 days, if the market behaves similarly to the last 10 years.

3

You’ll likely be rich by exploiting the ‘best rule’ from 20 more friends in the next 100 days.

4

You’d definitely have been rich if you had exploited the rule in the past 10 years.

Reference Answer: 2

1 : no free lunch; 3 : no ‘learning’ guarantee in verification; 4 : verifying

with only 100 days, possible that the rule is mostly wrong for whole 10 years.

(23)

Feasibility of Learning Connection to Real Learning

Multiple h

. . . .

top

h

₁

h

₂

h

_M

E

_out

(h

₁

) E

_out

(h

₂

) E

_out

(h

_M

)

E

_in

(h

₁

) E

_in

(h

₂

) E

_in

(h

_M

)

(24)

Coin Game

. . . .

top

bottom

Q: if everyone in size-150 NTU ML class

flips a coin 5 times, and one of the students gets 5 heads for her coin ‘g’. Is ‘g’ really magical?

A: No. Even if all coins are fair, the probability that

one of the coins

results in

5 heads

is 1−

³¹ 32

150

> 99%.

BAD sample: E _in and E _out far away

—can get worse when involving ‘choice’

(25)

BAD Sample and BAD Data

BAD Sample

e.g., E

_out

=

¹ ₂

, but getting all heads (E

_in

=0)!

BAD Data for One h E out (h) and E _in (h) far away:

e.g., E

_out

big (far from f ), but E

_in

small (correct on most examples)

D

1

D

2

. . . D

1126

. . . D

5678

. . . Hoeffding

h BAD BAD P

D

[BAD D for h] ≤ . . .

Hoeffding: small

(26)

BAD Data for Many h

=⇒

BAD data for many h

⇐⇒

no ‘freedom of choice’

byA

⇐⇒

there exists some h such that E out (h) and E _in (h) far away

D

¹

D

² ^{. . .}

D

¹¹²⁶ ^{. . .}

D

⁵⁶⁷⁸

Hoeffding

h

₁

BAD BAD P

D

[BAD D for h

1

] ≤ . . .

h

2

BAD P

D

[BAD D for h

²

] ≤ . . .

h

₃

BAD BAD BAD P

D

[BAD D for h

3

] ≤ . . .

. . .

h

_M

BAD BAD P

D

[BAD D for h

M

] ≤ . . .

all BAD BAD BAD ?

for M hypotheses, bound of P

D

[BADD]?

(27)

Bound of BAD Data

P

D

[BADD]

= P

D

[BADD for h

1 or BAD

D for h

2 or

. . . or

BAD

D for h

M

]

≤ P

D

[BADD for h

1

] + P

^D

[BADD for h

2

] +. . . + P

^D

[BADD for h

M

] (union bound)

≤

2 exp

−2 ² N

+

2 exp

−2 ² N

+. . . +

2 exp

−2 ² N

= 2Mexp

−2

²

N

•

finite-bin version of Hoeffding, valid for all

M, N and

•

does not depend on any E

_out

(h

_m

),

no need to ‘know’ E _out (h _m )

—f and P can stay unknown

•

‘E

_in

(g) = E

_out

(g)’ is

PAC, regardless of A

(28)

The ‘Statistical’ Learning Flow

if|H| = M finite, N large enough,

for whatever g picked byA, E

^out

(g)≈ E

in

(g) ifA finds one g with E

in

(g)≈ 0,

PAC guarantee for E

_out

(g)≈ 0 =⇒

learning possible :-)

unknown target function f : X → Y

(ideal credit approval formula)

training examples D : (x

1

, y

1

), · · · , (x

_N

,y

N

) (historical records in bank)

learning algorithm

A

final hypothesis g ≈ f

(‘learned’ formula to be used)

hypothesis set H

(set of candidate formula)

unknown P on X

x

1

, x

2

, · · · , x

N

x

M = ∞? (like perceptrons)

—see you in the next lectures

(29)

Fun Time

Consider 4 hypotheses.

h

₁

(x) = sign(x

₁

), h

₂

(x) = sign(x

₂

), h

₃

(x) = sign(−x

1

), h

₄

(x) = sign(−x

2

).

For any N and, which of the following statement is not true?

1

the

BAD

data of h

₁

and the

BAD

data of h

₂

are exactly the same

2

the

BAD

data of h

₁

and the

BAD

data of h

₃

3

P

D

[BADfor some h

_k

]≤ 8 exp −2

²

N

4

P

D

[BADfor some h

_k

]≤ 4 exp −2

²

N

Reference Answer: 1

The important thing is to note that 2 is true, which implies that 4 is true if you revisit the union bound. Similar ideas will be used to conquer the M =∞ case.

(30)

Fun Time

Consider 4 hypotheses.

h

₁

(x) = sign(x

₁

), h

₂

(x) = sign(x

₂

), h

₃

(x) = sign(−x

1

), h

₄

(x) = sign(−x

2

).

For any N and, which of the following statement is not true?

1

the

BAD

data of h

₁

and the

BAD

data of h

₂

2

the

BAD

data of h

₁

and the

BAD

data of h

₃

3

P

D

[BADfor some h

_k

]≤ 8 exp −2

²

N

4

P

D

[BADfor some h

_k

]≤ 4 exp −2

²

N

Reference Answer: 1

The important thing is to note that 2 is true, which implies that 4 is true if you revisit the union bound. Similar ideas will be used to conquer the M =∞ case.

(31)

Summary

1 When

Can Machines Learn?

Machine Learning Foundations (ᘤ9M)

Machine Learning Foundations ( 機器學習基石)

Lecture 4: Feasibility of Learning

Department of Computer Science

& Information Engineering

National Taiwan University

( 國立台灣大學資訊工程系)

Roadmap

1 When

Lecture 3: Types of Learning

binary classification

regression

batch

supervised

concrete

Lecture 4: Feasibility of Learning

Learning is Impossible?

Probability to the Rescue Connection to Learning Connection to Real Learning

2 Why Can Machines Learn?

3 How Can Machines Learn?

4 How Can Machines Learn Better?

A Learning Puzzle

y

= −1

y

= +1

g(x) = ?

Two Controversial Answers

whatever you say about g(x),

y n = −1

y n = +1

g(x) = ?

truth f (x) = +1 because . . .

•

•

truth f (x) = −1 because . . .

•

•

adversarial teacher

:-(

A ‘Simple’ Binary Classification Problem

x

y

= f (x

)

0 0 0 ◦

0 0 1 ×

0 1 0 ×

0 1 1 ◦

1 0 0 ×

•

3

◦, ×

n

n

does g ≈ f ?

No Free Lunch

D

x y g f

f

f

f

f

f

f

f

0 0 0 ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦

0 0 1 × × × × × × × × × ×

0 1 0 × × × × × × × × × ×

0 1 1 ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦

1 0 0 × × × × × × × × × ×

1 0 1 ? ◦ ◦ ◦ ◦ × × × ×

1 1 0 ? ◦ ◦ × × ◦ ◦ × ×

1 1 1 ? ◦ × ◦ × ◦ × ◦ ×

•

•

No!

any ‘unknown’ f can happen. :-(

Fun Time

This is a popular ‘brain-storming’ problem, with a claim that 2%

Machine Learning Foundations (ᘤ9M)

³

_n

x ₁

₁

₁

₁

₁

₁

x ₁

₁

₁

₁

₁

₁