• 沒有找到結果。

Machine Learning Foundations (ᘤ9M)

N/A
N/A
Protected

Academic year: 2022

Share "Machine Learning Foundations (ᘤ9M)"

Copied!
27
0
0

加載中.... (立即查看全文)

全文

(1)

Machine Learning Foundations ( 機器學習基石)

Lecture 4: Feasibility of Learning

Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.tw

Department of Computer Science

& Information Engineering

National Taiwan University

( 國立台灣大學資訊工程系)

(2)

Feasibility of Learning

Roadmap

1 When

Can Machines Learn?

Lecture 3: Types of Learning

focus:

binary classification

or

regression

from a

batch

of

supervised

data with

concrete

features

Lecture 4: Feasibility of Learning

Learning is Impossible?

Probability to the Rescue Connection to Learning Connection to Real Learning

2 Why Can Machines Learn?

3 How Can Machines Learn?

4 How Can Machines Learn Better?

(3)

Feasibility of Learning Learning is Impossible?

A Learning Puzzle

y

n

= −1

y

n

= +1

g(x) = ?

let’s test your ‘human learning’

with 6 examples :-)

(4)

Feasibility of Learning Learning is Impossible?

Two Controversial Answers

whatever you say about g(x),

yn=−1

yn= +1

g(x) = ?

y n = −1

y n = +1

g(x) = ?

truth f (x) = +1 because . . .

symmetry⇔ +1

(black or white count = 3) or (black count = 4 and

middle-top black)⇔ +1

truth f (x) = −1 because . . .

left-top black⇔ -1

middle column contains at most 1 black and right-top white⇔ -1

p

all valid reasons, your

adversarial teacher

can always call you ‘didn’t learn’.

:-(

(5)

Feasibility of Learning Learning is Impossible?

A ‘Simple’ Binary Classification Problem

x

n

y

n

= f (x

n

)

0 0 0 ◦

0 0 1 ×

0 1 0 ×

0 1 1 ◦

1 0 0 ×

X = {0, 1}

3

,Y = {

◦, ×

}, can enumerate all candidate f as H

pick g ∈ H with all g(x

n

) =y

n

(like PLA),

does g ≈ f ?

(6)

Feasibility of Learning Learning is Impossible?

No Free Lunch

D

x y g f

1

f

2

f

3

f

4

f

5

f

6

f

7

f

8

0 0 0 ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦

0 0 1 × × × × × × × × × ×

0 1 0 × × × × × × × × × ×

0 1 1 ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦

1 0 0 × × × × × × × × × ×

1 0 1 ? ◦ ◦ ◦ ◦ × × × ×

1 1 0 ? ◦ ◦ × × ◦ ◦ × ×

1 1 1 ? ◦ × ◦ × ◦ × ◦ ×

g ≈ f inside D: sure!

g ≈ f outside D:

No!

(but that’s really what we want!)

learning fromD (to infer something outside D) is doomed if

any ‘unknown’ f can happen. :-(

(7)

Feasibility of Learning Learning is Impossible?

Fun Time

This is a popular ‘brain-storming’ problem, with a claim that 2%

of the world’s cleverest population can crack its ‘hidden pattern’.

(5, 3, 2)→ 151022, (7, 2, 5)→

?

It is like a ‘learning problem’ with N = 1,

x 1

= (5, 3, 2), y

1

=151022.

Learn a hypothesis from the one example to predict on

x = (7, 2, 5).

What is your answer?

1

151026

2

143547

3

I need more examples to get the correct answer

4

there is no ‘correct’ answer

Reference Answer: 4

Following the same nature of the no-free-lunch problems discussed, we cannot hope to be correct under this ‘adversarial’ setting. BTW,

2 is the designer’s answer: the first two digits = x

1

· x

2

; the next two digits = x

1

· x

3

; the last two digits = (x

1

· x

2

+x

1

· x

3

− x

2

).

(8)

Feasibility of Learning Probability to the Rescue

Inferring Something Unknown

difficult to infer

unknown target f outside D

in learning;

can we infer

something unknown

in

other scenarios?

top

bottom

consider a bin of many many

orange

and

green

marbles

do we

know

the

orange

portion (probability)?

No!

can you

infer

the

orange

probability?

(9)

Feasibility of Learning Probability to the Rescue

Statistics 101: Inferring Orange Probability

top

bottom top

bottom

sample

bin bin

assume

orange

probability =µ,

green

probability = 1− µ, withµ

unknown

sample

N marbles sampled independently, with

orange

fraction =ν,

green

fraction = 1− ν, nowν

known

does

in-sample ν

say anything about out-of-sampleµ?

(10)

Feasibility of Learning Probability to the Rescue

Possible versus Probable

does

in-sample ν

say anything about out-of-sampleµ?

No!

possibly not: sample can be mostly

green

while bin is mostly

orange Yes!

probably yes: in-sampleν likely

close to

unknownµ

top

bottom top

bottom

sample

bin

formally,

what does ν say about µ?

(11)

Feasibility of Learning Probability to the Rescue

Hoeffding’s Inequality (1/2)

top

bottom top

bottom

sample of size N

bin

µ =

orange

probability in bin

ν =

orange

fraction in sample

in big sample

(N large),

ν is probably close to µ

(within )

P

ν− µ

>



 ≤ 2 exp

−2

 2 N



called

Hoeffding’s Inequality, for marbles, coin, polling,

. . . the statement ‘ν = µ’ is

probably approximately correct

(PAC)

(12)

Feasibility of Learning Probability to the Rescue

Hoeffding’s Inequality (2/2)

P ν− µ

>



 ≤ 2 exp

−2

 2 N



valid for all

N

and



does not depend onµ,

no need to ‘know’ µ

• larger sample size N

or

looser gap 

=⇒ higher probability for ‘ν ≈ µ’

top

bottom top

bottom

sample of size N

bin

if

large N

, can

probably

infer

unknownµ by known ν

(13)

Feasibility of Learning Probability to the Rescue

Fun Time

Let µ = 0.4. Use Hoeffding’s Inequality P 

ν − µ

>   ≤ 2 exp −2 2 N 

to bound the probability that a sample of 10 marbles will have ν ≤ 0.1. What bound do you get?

1

0.67

2

0.40

3

0.33

4

0.05

Reference Answer: 3

Set N = 10 and = 0.3 and you get the answer. BTW, 4 is the actual probability and Hoeffding gives only an upper bound to that.

(14)

Feasibility of Learning Connection to Learning

Connection to Learning

bin

unknown

orange

prob. µ

marble

∈ bin

• orange •

• green •

size-N sample from bin of i.i.d. marbles

learning

fixed hypothesis h(x)=

?

target f (x)

x

∈ X

h is

wrong

h(x) 6= f (x)

h is

right

h(x) = f (x)

check h onD = {(x

n

, y

n

|{z}

f (x

n

)

)} with i.i.d.

x n

if

large N & i.i.d. x n

, can

probably

infer unknownJh(x) 6= f (x)K probability

by knownJh(x

n

)6= y

n

K fraction

top

X

• h(x) 6= f (x)

• h(x) = f (x)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 13/26

(15)

Feasibility of Learning Connection to Learning

Added Components

unknown target function f : X → Y

(ideal credit approval formula)

training examples D : (x

1

, y

1

), · · · , (x

N

,y

N

) (historical records in bank)

learning algorithm

A

final hypothesis g ≈ f

(‘learned’ formula to be used)

hypothesis set H

(set of candidate formula)

unknown P on X x

1

, x

2

, · · · , x

N

h ≈ f

?

fixed h x

for any fixed h, can probably infer

unknown E out (h)

= E

x∼P

Jh(x) 6= f (x)K by

known E in (h)

=

N 1

N

P

n=1

Jh(x

n

)6= y

n

K.

(16)

Feasibility of Learning Connection to Learning

The Formal Guarantee

for any fixed h, in ‘big’ data

(N large),

for any fixed h,

in-sample error E

in

(h) is probably close to

for any fixed h,

out-of-sample error E

out

(h)

(within )

P

E

in

(h)− E

out

(h)

>



 ≤ 2 exp

−2

 2 N



same as the ‘bin’ analogy . . .

valid for all

N

and



does not depend on E

out

(h),

no need to ‘know’ E out (h)

—f and P can stay unknown

‘E

in

(h) = E

out

(h)’ is

probably approximately correct (PAC)

=⇒

if

‘E in (h) ≈ E out (h)’

and

‘E in (h) small’

=⇒ E

out

(h) small =⇒ h ≈ f with respect to P

(17)

Feasibility of Learning Connection to Learning

Verification of One h

for any fixed h, when data large enough, E

in

(h)≈ E

out

(h)

Can we claim ‘good learning’ (g ≈ f )?

Yes!

if E in (h) small for the fixed h

if

and

A pick the h as g

=⇒ ‘g = f ’ PAC

No!

if A forced to pick THE h as g

=⇒

E in (h) almost always not small

=⇒ ‘g 6= f ’ PAC!

real learning:

A shall

make choices ∈ H

(like PLA) rather than

being forced to pick one h. :-(

(18)

Feasibility of Learning Connection to Learning

The ‘Verification’ Flow

unknown target function f : X → Y

(ideal credit approval formula)

verifying examples D : (x

1

, y

1

), · · · , (x

N

,y

N

) (historical records in bank)

final hypothesis g ≈ f

(given formula to be verified) g = h

one hypothesis

h

(one candidate formula)

unknown P on X

x

1

, x

2

, · · · , x

N

x

can now use ‘historical records’ (data) to

verify ‘one candidate formula’ h

(19)

Feasibility of Learning Connection to Learning

Fun Time

Your friend tells you her secret rule in investing in a particular stock:

‘Whenever the stock goes down in the morning, it will go up in the afternoon;

vice versa.’ To verify the rule, you chose 100 days uniformly at random from the past 10 years of stock data, and found that 80 of them satisfy the rule. What is the best guarantee that you can get from the verification?

1

You’ll definitely be rich by exploiting the rule in the next 100 days.

2

You’ll likely be rich by exploiting the rule in the next 100 days, if the market behaves similarly to the last 10 years.

3

You’ll likely be rich by exploiting the ‘best rule’ from 20 more friends in the next 100 days.

4

You’d definitely have been rich if you had exploited the rule in the past 10 years.

Reference Answer: 2

1 : no free lunch; 3 : no ‘learning’ guarantee in verification; 4 : verifying

with only 100 days, possible that the rule is mostly for whole 10 years.

(20)

Feasibility of Learning Connection to Real Learning

Multiple h

. . . .

top

bottom

h

1

h

2

h

M

E

out

(h

1

) E

out

(h

2

) E

out

(h

M

)

E

in

(h

1

) E

in

(h

2

) E

in

(h

M

)

real learning (say like PLA):

BINGO

when getting

••••••••••

?

(21)

Feasibility of Learning Connection to Real Learning

Coin Game

. . . .

top

bottom

Q: if everyone in size-150 NTU ML class

flips a coin 5 times, and one of the students gets 5 heads for her coin ‘g’. Is ‘g’ really magical?

A: No. Even if all coins are fair, the probability that

one of the coins

results in

5 heads

is 1−

31 32



150

> 99%.

BAD sample: E in and E out far away

—can get worse when involving ‘choice’

(22)

Feasibility of Learning Connection to Real Learning

BAD Sample and BAD Data

BAD Sample

e.g., E

out

=

1 2

, but getting all heads (E

in

=0)!

BAD Data for One h E out (h) and E in (h) far away:

e.g., E

out

big (far from f ), but E

in

small (correct on most examples)

D

1

D

2

. . . D

1126

. . . D

5678

. . . Hoeffding

h BAD BAD P

D

[BAD D for h] ≤ . . .

Hoeffding: small

P

D

[BADD] = X

all possibleD

P(D) ·J

BAD

DK

(23)

Feasibility of Learning Connection to Real Learning

BAD Data for Many h

=⇒

BAD data for many h

⇐⇒

no ‘freedom of choice’

byA

⇐⇒

there exists some h such that E out (h) and E in (h) far away

D

1

D

2 . . .

D

1126 . . .

D

5678

Hoeffding

h

1

BAD BAD P

D

[BAD D for h

1

] ≤ . . .

h

2

BAD P

D

[BAD D for h

2

] ≤ . . .

h

3

BAD BAD BAD P

D

[BAD D for h

3

] ≤ . . .

. . .

h

M

BAD BAD P

D

[BAD D for h

M

] ≤ . . .

all BAD BAD BAD ?

for M hypotheses, bound of P

D

[BADD]?

(24)

Feasibility of Learning Connection to Real Learning

Bound of BAD Data

P

D

[BADD]

= P

D

[BADD for h

1 or BAD

D for h

2 or

. . . or

BAD

D for h

M

]

≤ P

D

[BADD for h

1

] + P

D

[BADD for h

2

] +. . . + P

D

[BADD for h

M

] (union bound)

2 exp 

−2 2 N 

+

2 exp 

−2 2 N 

+. . . +

2 exp 

−2 2 N 

= 2Mexp

−2

2

N

finite-bin version of Hoeffding, valid for all

M, N and



does not depend on any E

out

(h

m

),

no need to ‘know’ E out (h m )

—f and P can stay unknown

‘E

in

(g) = E

out

(g)’ is

PAC, regardless of A

‘most reasonable’A (like PLA/pocket):

pick the h

m

with

lowest E in (h m )

as g

(25)

Feasibility of Learning Connection to Real Learning

The ‘Statistical’ Learning Flow

if|H| = M finite, N large enough,

for whatever g picked byA, E

out

(g)≈ E

in

(g) ifA finds one g with E

in

(g)≈ 0,

PAC guarantee for E

out

(g)≈ 0 =⇒

learning possible :-)

unknown target function f : X → Y

(ideal credit approval formula)

training examples D : (x

1

, y

1

), · · · , (x

N

,y

N

) (historical records in bank)

learning algorithm

A

final hypothesis g ≈ f

(‘learned’ formula to be used)

hypothesis set H

(set of candidate formula)

unknown P on X

x

1

, x

2

, · · · , x

N

x

M = ∞? (like perceptrons)

—see you in the next lectures

(26)

Feasibility of Learning Connection to Real Learning

Fun Time

Consider 4 hypotheses.

h

1

(x) = sign(x

1

), h

2

(x) = sign(x

2

), h

3

(x) = sign(−x

1

), h

4

(x) = sign(−x

2

).

For any N and, which of the following statement is not true?

1

the

BAD

data of h

1

and the

BAD

data of h

2

are exactly the same

2

the

BAD

data of h

1

and the

BAD

data of h

3

are exactly the same

3

P

D

[BADfor some h

k

]≤ 8 exp −2

2

N

4

P

D

[BADfor some h

k

]≤ 4 exp −2

2

N

Reference Answer: 1

The important thing is to note that 2 is true, which implies that 4 is true if you revisit the union bound. Similar ideas will be used to conquer the M =∞ case.

(27)

Feasibility of Learning Connection to Real Learning

Summary

1 When

Can Machines Learn?

Lecture 3: Types of Learning Lecture 4: Feasibility of Learning

Learning is Impossible?

absolutely no free lunch outside D Probability to the Rescue

probably approximately correct outside D Connection to Learning

verification possible if E in (h) small for fixed h Connection to Real Learning

learning possible if |H| finite and E in (g) small

2 Why Can Machines Learn?

next: what if |H| = ∞?

3 How Can Machines Learn?

4 How Can Machines Learn Better?

參考文獻

相關文件

vice versa.’ To verify the rule, you chose 100 days uniformly at random from the past 10 years of stock data, and found that 80 of them satisfy the rule. What is the best guarantee

2 You’ll likely be rich by exploiting the rule in the next 100 days, if the market behaves similarly to the last 10 years. 3 You’ll likely be rich by exploiting the ‘best rule’

(1) principle of legality - everything must be done according to law (2) separation of powers - disputes as to legality of law (made by legislature) and government acts (by

Two examples of the randomly generated EoSs (dashed lines) and the machine learning outputs (solid lines) reconstructed from 15 data points.. size 100) and 1 (with the batch size 10)

Courtesy: Ned Wright’s Cosmology Page Burles, Nolette & Turner, 1999?. Total Mass Density

Assuming that the positive charge of the nucleus is distributed uniformly, determine the electric field at a point on the surface of the nucleus due to that

“Machine Learning Foundations” free online course, and works from NTU CLLab and NTU KDDCup teams... The Learning Problem What is

DVDs, Podcasts, language teaching software, video games, and even foreign- language music and music videos can provide positive and fun associations with the language for