• 沒有找到結果。

# Machine Learning Foundations (ᘤ9M)

N/A
N/A
Protected

Share "Machine Learning Foundations (ᘤ9M)"

Copied!
94
0
0

(1)

### Lecture 4: Feasibility of Learning

Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.tw

### ( 國立台灣大學資訊工程系)

(2)

Feasibility of Learning

### 1 When

Can Machines Learn?

focus:

or

from a

of

data with

features

### 3 How Can Machines Learn?

(3)

Feasibility of Learning Learning is Impossible?

n

n

### with 6 examples :-)

(4)

Feasibility of Learning Learning is Impossible?

yn=−1

yn= +1

g(x) = ?

symmetry⇔ +1

### •

(black or white count = 3) or (black count = 4 and

middle-top black)⇔ +1

### •

left-top black⇔ -1

### •

middle column contains at most 1 black and right-top white⇔ -1

p

all valid reasons, your

can always call you ‘didn’t learn’.

### :-(

(5)

Feasibility of Learning Learning is Impossible?

yn=−1

yn= +1

g(x) = ?

symmetry⇔ +1

### •

(black or white count = 3) or (black count = 4 and

middle-top black)⇔ +1

### •

left-top black⇔ -1

### •

middle column contains at most 1 black and right-top white⇔ -1

p

all valid reasons, your

can always call you ‘didn’t learn’.

### :-(

(6)

Feasibility of Learning Learning is Impossible?

yn=−1

yn= +1

g(x) = ?

symmetry⇔ +1

### •

(black or white count = 3) or (black count = 4 and

middle-top black)⇔ +1

### •

left-top black⇔ -1

### •

middle column contains at most 1 black and right-top white⇔ -1

p

all valid reasons, your

can always call you ‘didn’t learn’.

### :-(

(7)

Feasibility of Learning Learning is Impossible?

yn=−1

yn= +1

g(x) = ?

symmetry⇔ +1

### •

(black or white count = 3) or (black count = 4 and

middle-top black)⇔ +1

### •

left-top black⇔ -1

### •

middle column contains at most 1 black and right-top white⇔ -1

p

all valid reasons, your

can always call you ‘didn’t learn’.

### :-(

(8)

Feasibility of Learning Learning is Impossible?

yn=−1

yn= +1

g(x) = ?

symmetry⇔ +1

### •

(black or white count = 3) or (black count = 4 and

middle-top black)⇔ +1

### •

left-top black⇔ -1

### •

middle column contains at most 1 black and right-top white⇔ -1

p

all valid reasons, your

can always call you ‘didn’t learn’.

### :-(

(9)

Feasibility of Learning Learning is Impossible?

yn=−1

yn= +1

g(x) = ?

symmetry⇔ +1

### •

(black or white count = 3) or (black count = 4 and

middle-top black)⇔ +1

### •

left-top black⇔ -1

### •

middle column contains at most 1 black and right-top white⇔ -1

p

all valid reasons, your

can always call you ‘didn’t learn’.

### :-(

(10)

Feasibility of Learning Learning is Impossible?

n

n

n

X = {0, 1}

,Y = {

### ◦, ×

}, can enumerate all candidate f as H

pick g ∈ H with all g(x

) =y

(like PLA),

### does g≈ f ?

(11)

Feasibility of Learning Learning is Impossible?

1

2

3

4

5

6

7

8

### •

g ≈ f inside D: sure!

g ≈ f outside D:

### No!

(but that’s really what we want!)

learning fromD (to infer something outside D) is doomed if

### any ‘unknown’ f can happen.:-(

(12)

Feasibility of Learning Learning is Impossible?

1

2

3

4

5

6

7

8

### •

g ≈ f inside D: sure!

g ≈ f outside D:

### No!

(but that’s really what we want!)

learning fromD (to infer something outside D) is doomed if

### any ‘unknown’ f can happen.:-(

(13)

Feasibility of Learning Learning is Impossible?

1

2

3

4

5

6

7

8

### •

g ≈ f inside D: sure!

g ≈ f outside D:

### No!

(but that’s really what we want!)

learning fromD (to infer something outside D) is doomed if

### any ‘unknown’ f can happen.:-(

(14)

Feasibility of Learning Learning is Impossible?

### of the world’s cleverest population can crack its ‘hidden pattern’.

(5, 3, 2)→ 151022, (7, 2, 5)→

### ?

It is like a ‘learning problem’ with N = 1,

= (5, 3, 2), y

### 1

=151022.

Learn a hypothesis from the one example to predict on

151026

143547

### 3

I need more examples to get the correct answer

### 4

there is no ‘correct’ answer

### Reference Answer: 4

Following the same nature of the no-free-lunch problems discussed, we cannot hope to be correct under this ‘adversarial’ setting. BTW,

2 is the designer’s answer: the first two digits = x

· x

### 2

; the next two digits = x

· x

### 3

; the last two digits = (x

· x

+x

· x

− x

### 2

).

(15)

Feasibility of Learning Learning is Impossible?

### of the world’s cleverest population can crack its ‘hidden pattern’.

(5, 3, 2)→ 151022, (7, 2, 5)→

### ?

It is like a ‘learning problem’ with N = 1,

= (5, 3, 2), y

### 1

=151022.

Learn a hypothesis from the one example to predict on

151026

143547

### 3

I need more examples to get the correct answer

### 4

there is no ‘correct’ answer

### Reference Answer: 4

Following the same nature of the no-free-lunch problems discussed, we cannot hope to be correct under this ‘adversarial’ setting. BTW,

2 is the designer’s answer: the first two digits = x

· x

### 2

; the next two digits = x

· x

### 3

; the last two digits = (x

· x

+x

· x

− x

### 2

).

(16)

Feasibility of Learning Probability to the Rescue

### Inferring Something Unknown

difficult to infer

in learning;

can we infer

in

top

bottom

### •

consider a bin of many many

and

marbles

do we

the

### orange

portion (probability)?

can you

the

### orange

probability?

(17)

Feasibility of Learning Probability to the Rescue

### Inferring Something Unknown

difficult to infer

in learning;

can we infer

in

top

bottom

### •

consider a bin of many many

and

marbles

do we

the

### orange

portion (probability)?

can you

the

### orange

probability?

(18)

Feasibility of Learning Probability to the Rescue

### Inferring Something Unknown

difficult to infer

in learning;

can we infer

in

top

bottom

### •

consider a bin of many many

and

marbles

do we

the

### orange

portion (probability)?

can you

the

### orange

probability?

(19)

Feasibility of Learning Probability to the Rescue

top

bottom

top

bottom

assume

probability =µ,

### green

probability = 1− µ, withµ

### sample

N marbles sampled independently, with

fraction =ν,

### green

fraction = 1− ν, nowν

does

### in-sample ν

say anything about out-of-sampleµ?

(20)

Feasibility of Learning Probability to the Rescue

top

bottom top

bottom

assume

probability =µ,

### green

probability = 1− µ, withµ

### sample

N marbles sampled independently, with

fraction =ν,

### green

fraction = 1− ν, nowν

does

### in-sample ν

say anything about out-of-sampleµ?

(21)

Feasibility of Learning Probability to the Rescue

top

bottom top

bottom

assume

probability =µ,

### green

probability = 1− µ, withµ

### sample

N marbles sampled independently, with

fraction =ν,

### green

fraction = 1− ν, nowν

does

### in-sample ν

say anything about out-of-sampleµ?

(22)

Feasibility of Learning Probability to the Rescue

top

bottom top

bottom

assume

probability =µ,

### green

probability = 1− µ, withµ

### sample

N marbles sampled independently, with

fraction =ν,

### green

fraction = 1− ν, nowν

does

### in-sample ν

say anything about out-of-sampleµ?

(23)

Feasibility of Learning Probability to the Rescue

top

bottom top

bottom

assume

probability =µ,

### green

probability = 1− µ, withµ

### sample

N marbles sampled independently, with

fraction =ν,

### green

fraction = 1− ν, nowν

does

### in-sample ν

say anything about out-of-sampleµ?

(24)

Feasibility of Learning Probability to the Rescue

does

### in-sample ν

say anything about out-of-sampleµ?

### No!

possibly not: sample can be mostly

### green

while bin is mostly

### orange Yes!

probably yes: in-sampleν likely

unknownµ

top

bottom top

bottom

formally,

### what doesν say about µ?

(25)

Feasibility of Learning Probability to the Rescue

does

### in-sample ν

say anything about out-of-sampleµ?

### No!

possibly not: sample can be mostly

### green

while bin is mostly

### Yes!

probably yes: in-sampleν likely

unknownµ

top

bottom top

bottom

formally,

### what doesν say about µ?

(26)

Feasibility of Learning Probability to the Rescue

does

### in-sample ν

say anything about out-of-sampleµ?

### No!

possibly not: sample can be mostly

### green

while bin is mostly

### orange Yes!

probably yes: in-sampleν likely

unknownµ

top

bottom top

bottom

formally,

### what doesν say about µ?

(27)

Feasibility of Learning Probability to the Rescue

does

### in-sample ν

say anything about out-of-sampleµ?

### No!

possibly not: sample can be mostly

### green

while bin is mostly

### orange Yes!

probably yes: in-sampleν likely

unknownµ

top

bottom top

bottom

formally,

### what doesν say about µ?

(28)

Feasibility of Learning Probability to the Rescue

top

bottom top

bottom

µ =

### orange

probability in bin

ν =

### orange

fraction in sample

in big sample

### (N large),

ν is probably close to µ

P

ν− µ

>

 ≤ 2 exp

−2



called

### Hoeffding’s Inequality, for marbles, coin, polling,

. . . the statement ‘ν = µ’ is

### probably approximately correct

(PAC)

(29)

Feasibility of Learning Probability to the Rescue

top

bottom top

bottom

µ =

### orange

probability in bin

ν =

### orange

fraction in sample

in big sample

### (N large),

ν is probably close to µ

P

ν− µ

>

 ≤ 2 exp

−2



called

### Hoeffding’s Inequality, for marbles, coin, polling,

. . .

the statement ‘ν = µ’ is

### probably approximately correct

(PAC)

(30)

Feasibility of Learning Probability to the Rescue

top

bottom top

bottom

µ =

### orange

probability in bin

ν =

### orange

fraction in sample

in big sample

### (N large),

ν is probably close to µ

P

ν− µ

>

 ≤ 2 exp

−2



called

### Hoeffding’s Inequality, for marbles, coin, polling,

. . .

(31)

Feasibility of Learning Probability to the Rescue

P ν− µ

>

 ≤ 2 exp

−2



valid for all

and

### •

does not depend onµ,

or

### looser gap 

=⇒ higher probability for ‘ν ≈ µ’

top

bottom top

bottom

if

, can

### probably

infer unknownµ by known ν

(32)

Feasibility of Learning Probability to the Rescue

P ν− µ

>

 ≤ 2 exp

−2



valid for all

and

### •

does not depend onµ,

or

### looser gap 

=⇒ higher probability for ‘ν ≈ µ’

top

bottom top

bottom

if

, can

### probably

infer unknownµ by known ν

(33)

Feasibility of Learning Probability to the Rescue

P ν− µ

>

 ≤ 2 exp

−2



valid for all

and

### •

does not depend onµ,

or

### looser gap 

=⇒ higher probability for ‘ν ≈ µ’

top

bottom top

bottom

if

, can

### probably

infer unknownµ by known ν

(34)

Feasibility of Learning Probability to the Rescue

P ν− µ

>

 ≤ 2 exp

−2



valid for all

and

### •

does not depend onµ,

or

### looser gap 

=⇒ higher probability for ‘ν ≈ µ’

top

bottom top

bottom

if

, can

### probably

infer

unknownµ by known ν

(35)

Feasibility of Learning Probability to the Rescue

0.67

0.40

0.33

0.05

### Reference Answer: 3

Set N = 10 and = 0.3 and you get the answer. BTW, 4 is the actual probability and Hoeffding gives only an upper bound to that.

(36)

Feasibility of Learning Probability to the Rescue

0.67

0.40

0.33

0.05

### Reference Answer: 3

Set N = 10 and = 0.3 and you get the

(37)

Feasibility of Learning Connection to Learning

unknown

prob. µ

marble

∈ bin

### •

size-N sample from bin of i.i.d. marbles

### •

fixed hypothesis h(x)=

target f

(x)

∈ X

h is

h is

### •

check h onD = {(x

, y

|{z}

n

)} with i.i.d.

if

, can

### probably

infer unknownJh(x) 6= f (x)K probability

by knownJh(x

)6= y

K fraction

top

## X

### • h(x) = f (x)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 13/26

(38)

Feasibility of Learning Connection to Learning

unknown

prob. µ

marble

∈ bin

### •

size-N sample from bin of i.i.d. marbles

### •

fixed hypothesis h(x)=

target f

(x)

∈ X

h is

h is

### •

check h onD = {(x

, y

|{z}

n

)} with i.i.d.

if

, can

### probably

infer unknownJh(x) 6= f (x)K probability

by knownJh(x

)6= y

K fraction

top

## X

### • h(x) = f (x)

(39)

Feasibility of Learning Connection to Learning

unknown

prob. µ

marble

∈ bin

### •

size-N sample from bin of i.i.d. marbles

### •

fixed hypothesis h(x)=

target f (x)

∈ X

h is

h is

### •

check h onD = {(x

, y

|{z}

n

)} with i.i.d.

if

, can

### probably

infer unknownJh(x) 6= f (x)K probability

by knownJh(x

)6= y

K fraction

top

## X

### • h(x) = f (x)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 13/26

(40)

Feasibility of Learning Connection to Learning

unknown

prob. µ

marble

∈ bin

### •

size-N sample from bin of i.i.d. marbles

### •

fixed hypothesis h(x)=

target f (x)

∈ X

h is

h is

### •

check h onD = {(x

, y

|{z}

n

)} with i.i.d.

if

, can

### probably

infer unknownJh(x) 6= f (x)K probability

by knownJh(x

)6= y

K fraction

top

## X

### • h(x) = f (x)

(41)

Feasibility of Learning Connection to Learning

unknown

prob. µ

marble

∈ bin

### •

size-N sample from bin of i.i.d. marbles

### •

fixed hypothesis h(x)=

target f (x)

∈ X

h is

h is

### •

check h onD = {(x

, y

|{z}

n

)} with i.i.d.

if

, can

### probably

infer unknownJh(x) 6= f (x)K probability

by knownJh(x

)6= y

K fraction

top

## X

### • h(x) = f (x)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 13/26

(42)

Feasibility of Learning Connection to Learning

unknown

prob. µ

marble

∈ bin

### •

size-N sample from bin of i.i.d. marbles

### •

fixed hypothesis h(x)=

target f (x)

∈ X

h is

h is

### •

check h onD = {(x

, y

|{z}

n

)} with i.i.d.

if

, can

### probably

infer unknownJh(x) 6= f (x)K probability

by knownJh(x

)6= y

K fraction

top

## X

### • h(x) = f (x)

(43)

Feasibility of Learning Connection to Learning

unknown

prob. µ

marble

∈ bin

### •

size-N sample from bin of i.i.d. marbles

### •

fixed hypothesis h(x)=

target f (x)

∈ X

h is

h is

### •

check h onD = {(x

, y

|{z}

n

)} with i.i.d.

if

, can

### probably

infer unknownJh(x) 6= f (x)K probability

by knownJh(x

)6= y

K fraction

top

## X

### • h(x) = f (x)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 13/26

(44)

Feasibility of Learning Connection to Learning

unknown

prob. µ

marble

∈ bin

### •

size-N sample from bin of i.i.d. marbles

### •

fixed hypothesis h(x)=

target f (x)

∈ X

h is

h is

### •

check h onD = {(x

, y

|{z}

n

)} with i.i.d.

if

, can

### probably

infer unknownJh(x) 6= f (x)K probability

by knownJh(x

)6= y

K fraction

top

## X

### • h(x) = f (x)

(45)

Feasibility of Learning Connection to Learning

unknown

prob. µ

marble

∈ bin

### •

size-N sample from bin of i.i.d. marbles

### •

fixed hypothesis h(x)=

target f (x)

∈ X

h is

h is

### •

check h onD = {(x

, y

|{z}

n

)} with i.i.d.

if

, can

### probably

infer unknownJh(x) 6= f (x)K probability

by knownJh(x

)6= y

K fraction

top

## X

### • h(x) = f (x)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 13/26

(46)

Feasibility of Learning Connection to Learning

1

1

N

N

1

2

N

?

### fixed h x

for any fixed h, can probably infer

= E

### x∼P

Jh(x) 6= f (x)K

(47)

Feasibility of Learning Connection to Learning

### The Formal Guarantee

for any fixed h, in ‘big’ data

### (N large),

for any fixed h,

in-sample error E

### in

(h) is probably close to

for any fixed h,

out-of-sample error E

(h)

P

E

(h)− E

(h)

>

 ≤ 2 exp

−2



valid for all

and

### •

does not depend on E

(h),

### no need to ‘know’ Eout (h)

—f and P can stay unknown

‘E

(h) = E

(h)’ is

=⇒ if

and

=⇒ E

### out

(h) small =⇒ h ≈ f with respect to P

(48)

Feasibility of Learning Connection to Learning

### The Formal Guarantee

for any fixed h, in ‘big’ data

### (N large),

for any fixed h,

in-sample error E

### in

(h) is probably close to

for any fixed h,

out-of-sample error E

(h)

P

E

(h)− E

(h)

>

 ≤ 2 exp

−2



valid for all

and

### •

does not depend on E

(h),

### no need to ‘know’ Eout (h)

—f and P can stay unknown

‘E

(h) = E

(h)’ is

=⇒ if

and

=⇒ E

### out

(h) small =⇒ h ≈ f with respect to P

(49)

Feasibility of Learning Connection to Learning

### The Formal Guarantee

for any fixed h, in ‘big’ data

### (N large),

for any fixed h,

in-sample error E

### in

(h) is probably close to

for any fixed h,

out-of-sample error E

(h)

P

E

(h)− E

(h)

>

 ≤ 2 exp

−2



valid for all

and

### •

does not depend on E

(h),

### no need to ‘know’ Eout (h)

—f and P can stay unknown

‘E

(h) = E

(h)’ is

=⇒ if

and

=⇒ E

### out

(h) small =⇒ h ≈ f with respect to P

(50)

Feasibility of Learning Connection to Learning

### The Formal Guarantee

for any fixed h, in ‘big’ data

### (N large),

for any fixed h,

in-sample error E

### in

(h) is probably close to

for any fixed h,

out-of-sample error E

(h)

P

E

(h)− E

(h)

>

 ≤ 2 exp

−2



valid for all

and

### •

does not depend on E

(h),

### no need to ‘know’ Eout (h)

—f and P can stay unknown

‘E

(h) = E

(h)’ is

=⇒ if

and

=⇒ E

### out

(h) small =⇒ h ≈ f with respect to P

(51)

Feasibility of Learning Connection to Learning

### The Formal Guarantee

for any fixed h, in ‘big’ data

### (N large),

for any fixed h,

in-sample error E

### in

(h) is probably close to

for any fixed h,

out-of-sample error E

(h)

P

E

(h)− E

(h)

>

 ≤ 2 exp

−2



valid for all

and

### •

does not depend on E

(h),

### no need to ‘know’ Eout (h)

—f and P can stay unknown

‘E

(h) = E

(h)’ is

=⇒

if

and

=⇒ E

### out

(h) small

=⇒ h ≈ f with respect to P

(52)

Feasibility of Learning Connection to Learning

### The Formal Guarantee

for any fixed h, in ‘big’ data

### (N large),

for any fixed h,

in-sample error E

### in

(h) is probably close to

for any fixed h,

out-of-sample error E

(h)

P

E

(h)− E

(h)

>

 ≤ 2 exp

−2



valid for all

and

### •

does not depend on E

(h),

### no need to ‘know’ Eout (h)

—f and P can stay unknown

‘E

(h) = E

(h)’ is

### probably approximately correct (PAC)

=⇒

(53)

Feasibility of Learning Connection to Learning

### Verification of One h

for any fixed h, when data large enough, E

(h)≈ E

(h)

if

and

=⇒ ‘g = f ’ PAC

=⇒

### E in (h) almost always not small

=⇒ ‘g 6= f ’ PAC!

real learning:

A shall

### make choices ∈ H

(like PLA) rather than

### being forced to pick one h. :-(

(54)

Feasibility of Learning Connection to Learning

### Verification of One h

for any fixed h, when data large enough, E

(h)≈ E

(h)

if

and

=⇒ ‘g = f ’ PAC

=⇒

### E in (h) almost always not small

=⇒ ‘g 6= f ’ PAC!

real learning:

A shall

### make choices ∈ H

(like PLA) rather than

### being forced to pick one h. :-(

(55)

Feasibility of Learning Connection to Learning

### Verification of One h

for any fixed h, when data large enough, E

(h)≈ E

(h)

if

and

=⇒ ‘g = f ’ PAC

=⇒

### E in (h) almost always not small

=⇒ ‘g 6= f ’ PAC!

real learning:

A shall

### make choices ∈ H

(like PLA) rather than

### being forced to pick one h. :-(

(56)

Feasibility of Learning Connection to Learning

### Verification of One h

for any fixed h, when data large enough, E

(h)≈ E

(h)

if

and

=⇒ ‘g = f ’ PAC

=⇒

### E in (h) almost always not small

=⇒ ‘g 6= f ’ PAC!

real learning:

A shall

### make choices ∈ H

(like PLA) rather than

### being forced to pick one h. :-(

(57)

Feasibility of Learning Connection to Learning

### Verification of One h

for any fixed h, when data large enough, E

(h)≈ E

(h)

if

and

=⇒ ‘g = f ’ PAC

=⇒

### E in (h) almost always not small

=⇒ ‘g 6= f ’ PAC!

real learning:

A shall

### make choices ∈ H

(like PLA) rather than

### being forced to pick one h. :-(

(58)

Feasibility of Learning Connection to Learning

### Verification of One h

for any fixed h, when data large enough, E

(h)≈ E

(h)

if

and

=⇒ ‘g = f ’ PAC

=⇒

### E in (h) almost always not small

=⇒ ‘g 6= f ’ PAC!

real learning:

A shall

### make choices ∈ H

(like PLA) rather than

### being forced to pick one h. :-(

(59)

Feasibility of Learning Connection to Learning

### Verification of One h

for any fixed h, when data large enough, E

(h)≈ E

(h)

if

and

=⇒ ‘g = f ’ PAC

=⇒

### E in (h) almost always not small

=⇒ ‘g 6= f ’ PAC!

real learning:

A shall

### make choices ∈ H

(like PLA) rather than

### being forced to pick one h. :-(

(60)

Feasibility of Learning Connection to Learning

1

1

N

N

1

2

N

### x

can now use ‘historical records’ (data) to

(61)

Feasibility of Learning Connection to Learning

### 1

You’ll definitely be rich by exploiting the rule in the next 100 days.

### 2

You’ll likely be rich by exploiting the rule in the next 100 days, if the market behaves similarly to the last 10 years.

### 3

You’ll likely be rich by exploiting the ‘best rule’ from 20 more friends in the next 100 days.

### 4

You’d definitely have been rich if you had exploited the rule in the past 10 years.

### with only 100 days, possible that the rule is mostly wrong for whole 10 years.

(62)

Feasibility of Learning Connection to Learning

### 1

You’ll definitely be rich by exploiting the rule in the next 100 days.

### 2

You’ll likely be rich by exploiting the rule in the next 100 days, if the market behaves similarly to the last 10 years.

### 3

You’ll likely be rich by exploiting the ‘best rule’ from 20 more friends in the next 100 days.

### 4

You’d definitely have been rich if you had exploited the rule in the past 10 years.

### Reference Answer: 2

(63)

Feasibility of Learning Connection to Real Learning

top

bottom

h

h

h

E

(h

) E

(h

) E

(h

)

E

(h

) E

(h

) E

(h

### M

)

real learning (say like PLA):

when getting

### ••••••••••

?

(64)

Feasibility of Learning Connection to Real Learning

top

h

h

h

E

(h

) E

(h

) E

(h

)

E

(h

) E

(h

) E

(h

### M

)

(65)

Feasibility of Learning Connection to Real Learning

### . . . .

top

bottom

Q: if everyone in size-150 NTU ML class

### flips a coin 5 times, and oneof the students gets 5 heads for her coin ‘g’. Is ‘g’ really magical?

A: No. Even if all coins are fair, the probability that

results in

is 1−



> 99%.

### —can getworsewhen involving ‘choice’

(66)

Feasibility of Learning Connection to Real Learning

### . . . .

top

bottom

Q: if everyone in size-150 NTU ML class

### flips a coin 5 times, and oneof the students gets 5 heads for her coin ‘g’. Is ‘g’ really magical?

A: No. Even if all coins are fair, the probability that

results in

is 1−



> 99%.

### —can getworsewhen involving ‘choice’

(67)

Feasibility of Learning Connection to Real Learning

### . . . .

top

bottom

Q: if everyone in size-150 NTU ML class

### flips a coin 5 times, and oneof the students gets 5 heads for her coin ‘g’. Is ‘g’ really magical?

A: No. Even if all coins are fair, the probability that

results in

is 1−



> 99%.

### —can getworsewhen involving ‘choice’

(68)

Feasibility of Learning Connection to Real Learning

e.g., E

=

### 12

, but getting all heads (E

=0)!

e.g., E

### out

big (far from f ), but E

### in

small (correct on most examples)

1

2

1126

5678

D

Hoeffding: small

P

### all possibleD

P(D) ·J

DK

(69)

Feasibility of Learning Connection to Real Learning

e.g., E

=

### 12

, but getting all heads (E

=0)!

e.g., E

### out

big (far from f ), but E

### in

small (correct on most examples)

1

2

1126

5678

D

Hoeffding: small

P

### all possibleD

P(D) ·J

DK

(70)

Feasibility of Learning Connection to Real Learning

e.g., E

=

### 12

, but getting all heads (E

=0)!

e.g., E

### out

big (far from f ), but E

### in

small (correct on most examples)

1

2

1126

5678

D

Hoeffding: small

P

### all possibleD

P(D) ·J

DK

(71)

Feasibility of Learning Connection to Real Learning

e.g., E

=

### 12

, but getting all heads (E

=0)!

e.g., E

### out

big (far from f ), but E

### in

small (correct on most examples)

1

2

1126

5678

D

Hoeffding: small

P

### all possibleD

P(D) ·J

DK

(72)

Feasibility of Learning Connection to Real Learning

### BAD Data for Many h

=⇒

BAD data for many h

⇐⇒

byA

⇐⇒

1

2 . . .

1126 . . .

5678

1

D

1

2

D

2

3

D

3

M

D

M

### ] ≤ . . .

for M hypotheses, bound of P

### D

(73)

Feasibility of Learning Connection to Real Learning

### BAD Data for Many h

=⇒

BAD data for many h

⇐⇒

byA

⇐⇒

1

2 . . .

1126 . . .

5678

1

D

1

2

D

2

3

D

3

M

D

M

### ] ≤ . . .

for M hypotheses, bound of P

### D

(74)

Feasibility of Learning Connection to Real Learning

### BAD Data for Many h

=⇒

BAD data for many h

⇐⇒

byA

⇐⇒

1

2 . . .

1126 . . .

5678

1

D

1

2

D

2

3

D

3

M

D

M

### ] ≤ . . .

for M hypotheses, bound of P

### D

(75)

Feasibility of Learning Connection to Real Learning

### BAD Data for Many h

=⇒

BAD data for many h

⇐⇒

byA

⇐⇒

1

2 . . .

1126 . . .

5678

1

D

1

2

D

2

3

D

3

M

D

M

### ] ≤ . . .

for M hypotheses, bound of P

### D

(76)

Feasibility of Learning Connection to Real Learning

### BAD Data for Many h

=⇒

BAD data for many h

⇐⇒

byA

⇐⇒

1

2 . . .

1126 . . .

5678

1

D

1

2

D

2

3

D

3

M

D

M

### ] ≤ . . .

for M hypotheses, bound of P

### D

(77)

Feasibility of Learning Connection to Real Learning

### BAD Data for Many h

=⇒

BAD data for many h

⇐⇒

byA

⇐⇒

1

2 . . .

1126 . . .

5678

1

D

1

2

D

2

3

D

3

M

D

M

### ] ≤ . . .

for M hypotheses, bound of P

### D

(78)

Feasibility of Learning Connection to Real Learning

P

= P

D for h

. . . or

D for h

]

P

] + P

] +. . . + P

] (union bound)

+

+. . . +

= 2Mexp



−2

N



### •

finite-bin version of Hoeffding, valid for all



### •

does not depend on any E

(h

),

### no need to ‘know’ Eout (h m )

—f and P can stay unknown

‘E

(g) = E

(g)’ is

### PAC,regardless of A

‘most reasonable’A (like PLA/pocket): pick the h

with

### lowest Ein (h m )

as g

(79)

Feasibility of Learning Connection to Real Learning

P

= P

D for h

. . . or

D for h

]

≤ P

] + P

] +. . . + P

]

(union bound)

+

+. . . +

= 2Mexp



−2

N



### •

finite-bin version of Hoeffding, valid for all



### •

does not depend on any E

(h

),

### no need to ‘know’ Eout (h m )

—f and P can stay unknown

‘E

(g) = E

(g)’ is

### PAC,regardless of A

‘most reasonable’A (like PLA/pocket): pick the h

with

### lowest Ein (h m )

as g

(80)

Feasibility of Learning Connection to Real Learning

P

= P

D for h

. . . or

D for h

]

≤ P

] + P

] +. . . + P

] (union bound)

+

+. . . +

= 2Mexp



−2

N



### •

finite-bin version of Hoeffding, valid for all



### •

does not depend on any E

(h

),

### no need to ‘know’ Eout (h m )

—f and P can stay unknown

‘E

(g) = E

(g)’ is

### PAC,regardless of A

‘most reasonable’A (like PLA/pocket): pick the h

with

### lowest Ein (h m )

as g

(81)

Feasibility of Learning Connection to Real Learning

P

= P

D for h

. . . or

D for h

]

≤ P

] + P

] +. . . + P

] (union bound)

+

+. . . +

= 2Mexp



−2

N



### •

finite-bin version of Hoeffding, valid for all



### •

does not depend on any E

(h

),

### no need to ‘know’ Eout (h m )

—f and P can stay unknown

‘E

(g) = E

(g)’ is

### PAC,regardless of A

‘most reasonable’A (like PLA/pocket): pick the h

with

### lowest Ein (h m )

as g

(82)

Feasibility of Learning Connection to Real Learning

P

= P

D for h

. . . or

D for h

]

≤ P

] + P

] +. . . + P

] (union bound)

+

+. . . +

= 2Mexp



−2

N



### •

finite-bin version of Hoeffding, valid for all



### •

does not depend on any E

(h

),

### no need to ‘know’ Eout (h m )

—f and P can stay unknown

‘E

(g) = E

(g)’ is

### PAC,regardless of A

‘most reasonable’A (like PLA/pocket): pick the h

with

### lowest Ein (h m )

as g

(83)

Feasibility of Learning Connection to Real Learning

P

= P

D for h

. . . or

D for h

]

≤ P

] + P

] +. . . + P

] (union bound)

+

+. . . +

= 2Mexp

−2

N

### •

finite-bin version of Hoeffding, valid for all



### •

does not depend on any E

(h

),

### no need to ‘know’ Eout (h m )

—f and P can stay unknown

‘E

(g) = E

(g)’ is

### PAC,regardless of A

‘most reasonable’A (like PLA/pocket): pick the h

with

### lowest Ein (h m )

as g

vice versa.’ To verify the rule, you chose 100 days uniformly at random from the past 10 years of stock data, and found that 80 of them satisfy the rule. What is the best guarantee

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 16/22.. If we use E loocv to estimate the performance of a learning algorithm that predicts with the average y value of the

vice versa.’ To verify the rule, you chose 100 days uniformly at random from the past 10 years of stock data, and found that 80 of them satisfy the rule. What is the best guarantee

• Submit your lab0 AND hw0 by next Thursday (or you will not be admitted to this course). • wn@csie.ntu.edu.tw is the e-mail

•In a stable structure the total strength of the bonds reaching an anion from all surrounding cations should be equal to the charge of the anion.. Pauling’ s rule-

(3)In principle, one of the documents from either of the preceding paragraphs must be submitted, but if the performance is to take place in the next 30 days and the venue is not

After the Opium War, Britain occupied Hong Kong and began its colonial administration. Hong Kong has also developed into an important commercial and trading port. In a society

Now, nearly all of the current flows through wire S since it has a much lower resistance than the light bulb. The light bulb does not glow because the current flowing through it