• 沒有找到結果。

# Machine Learning Foundations (ᘤ9M)

N/A
N/A
Protected

Share "Machine Learning Foundations (ᘤ9M)"

Copied!
31
0
0

(1)

### Lecture 4: Feasibility of Learning

Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.tw

### ( 國立台灣大學資訊工程系)

(2)

Feasibility of Learning

### 1 When

Can Machines Learn?

focus:

or

from a

of

data with

features

### 4 How Can Machines Learn Better?

(3)

Feasibility of Learning Learning is Impossible?

n

n

### g(x) = ?

(4)

Feasibility of Learning Learning is Impossible?

yn=−1

yn= +1

g(x) = ?

symmetry⇔ +1

### •

(black or white count = 3) or (black count = 4 and

middle-top black)⇔ +1

### •

left-top black⇔ -1

### •

middle column contains at most 1 black and right-top white⇔ -1

p

all valid reasons, your

can always call you ‘didn’t learn’.

### :-(

(5)

Feasibility of Learning Learning is Impossible?

n

n

n

X = {0, 1}

,Y = {

### ◦, ×

}, can enumerate all candidate f as H

pick g ∈ H with all g(x

) =y

(like PLA),

### does g≈ f ?

(6)

Feasibility of Learning Learning is Impossible?

1

2

3

4

5

6

7

8

### •

g ≈ f inside D: sure!

g ≈ f outside D:

### No!

(but that’s really what we want!)

learning fromD (to infer something outside D) is doomed if

### any ‘unknown’ f can happen.:-(

(7)

Feasibility of Learning Learning is Impossible?

### of the world’s cleverest population can crack its ‘hidden pattern’.

(5, 3, 2)→ 151022, (7, 2, 5)→

### ?

It is like a ‘learning problem’ with N = 1,

= (5, 3, 2), y

### 1

=151022.

Learn a hypothesis from the one example to predict on

151026

143547

### 3

I need more examples to get the correct answer

### 4

Following the same nature of the no-free-lunch problems discussed, we cannot hope to be correct under this ‘adversarial’ setting. BTW,

2 is the designer’s answer: the first two digits = x

· x

### 2

; the next two digits = x

· x

### 3

; the last two digits = (x

· x

+x

· x

− x

### 2

).

(8)

Feasibility of Learning Learning is Impossible?

### of the world’s cleverest population can crack its ‘hidden pattern’.

(5, 3, 2)→ 151022, (7, 2, 5)→

### ?

It is like a ‘learning problem’ with N = 1,

= (5, 3, 2), y

### 1

=151022.

Learn a hypothesis from the one example to predict on

151026

143547

### 3

I need more examples to get the correct answer

### 4

Following the same nature of the no-free-lunch problems discussed, we cannot hope to be correct under this ‘adversarial’ setting. BTW,

2 is the designer’s answer: the first two digits = x

· x

### 2

; the next two digits = x

· x

### 3

; the last two digits = (x

· x

+x

· x

− x

### 2

).

(9)

Feasibility of Learning Probability to the Rescue

### Inferring Something Unknown

difficult to infer

in learning;

can we infer

in

top

bottom

### •

consider a bin of many many

and

marbles

do we

the

### orange

portion (probability)?

can you

the

### orange

probability?

(10)

Feasibility of Learning Probability to the Rescue

top

bottom top

bottom

assume

probability =µ,

### green

probability = 1− µ, withµ

### sample

N marbles sampled independently, with

fraction =ν,

### green

fraction = 1− ν, nowν

does

### in-sample ν

(11)

Feasibility of Learning Probability to the Rescue

does

### No!

possibly not: sample can be mostly

### green

while bin is mostly

### orange Yes!

probably yes: in-sampleν likely

unknownµ

top

bottom top

bottom

formally,

### what doesν say about µ?

(12)

Feasibility of Learning Probability to the Rescue

top

bottom top

bottom

µ =

### orange

probability in bin

ν =

### orange

fraction in sample

in big sample

### (N large),

ν is probably close to µ

P

ν− µ

>

 ≤ 2 exp

−2



called

### Hoeffding’s Inequality, for marbles, coin, polling,

. . . the statement ‘ν = µ’ is

### probably approximately correct

(PAC)

(13)

Feasibility of Learning Probability to the Rescue

P ν− µ

>

 ≤ 2 exp

−2



valid for all

and

### •

does not depend onµ,

or

### looser gap 

=⇒ higher probability for ‘ν ≈ µ’

top

bottom top

bottom

if

, can

### probably

infer

unknownµ by known ν

(14)

Feasibility of Learning Probability to the Rescue

0.67

0.40

0.33

### 4

0.05

Set N = 10 and = 0.3 and you get the answer. BTW, 4 is the actual probability and Hoeffding gives only an upper bound to that.

(15)

Feasibility of Learning Probability to the Rescue

0.67

0.40

0.33

### 4

0.05

Set N = 10 and = 0.3 and you get the

(16)

Feasibility of Learning Connection to Learning

unknown

prob. µ

marble

∈ bin

### •

size-N sample from bin of i.i.d. marbles

### •

fixed hypothesis h(x)=

target f (x)

∈ X

h is

h is

### •

check h onD = {(x

, y

|{z}

n

)} with i.i.d.

if

, can

### probably

infer unknownJh(x) 6= f (x)K probability

by knownJh(x

)6= y

K fraction

top

## X

### • h(x) = f (x)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 13/26

(17)

Feasibility of Learning Connection to Learning

1

1

N

N

1

2

N

?

### fixed h x

for any fixed h, can probably infer

= E

### x∼P

Jh(x) 6= f (x)K

(18)

Feasibility of Learning Connection to Learning

### The Formal Guarantee

for any fixed h, in ‘big’ data

### (N large),

for any fixed h,

in-sample error E

### in

(h) is probably close to

for any fixed h,

out-of-sample error E

(h)

P

E

(h)− E

(h)

>

 ≤ 2 exp

−2



valid for all

and

### •

does not depend on E

(h),

### no need to ‘know’ Eout (h)

—f and P can stay unknown

‘E

(h) = E

(h)’ is

=⇒

if

and

=⇒ E

### out

(h) small =⇒ h ≈ f with respect to P

(19)

Feasibility of Learning Connection to Learning

### Verification of One h

for any fixed h, when data large enough, E

(h)≈ E

(h)

if

and

=⇒ ‘g = f ’ PAC

=⇒

### E in (h) almost always not small

=⇒ ‘g 6= f ’ PAC!

real learning:

A shall

### make choices ∈ H

(like PLA) rather than

### being forced to pick one h. :-(

(20)

Feasibility of Learning Connection to Learning

1

1

N

N

1

2

N

### x

can now use ‘historical records’ (data) to

### verify ‘one candidate formula’ h

(21)

Feasibility of Learning Connection to Learning

### 1

You’ll definitely be rich by exploiting the rule in the next 100 days.

### 2

You’ll likely be rich by exploiting the rule in the next 100 days, if the market behaves similarly to the last 10 years.

### 3

You’ll likely be rich by exploiting the ‘best rule’ from 20 more friends in the next 100 days.

### 4

You’d definitely have been rich if you had exploited the rule in the past 10 years.

### with only 100 days, possible that the rule is mostly wrong for whole 10 years.

(22)

Feasibility of Learning Connection to Learning

### 1

You’ll definitely be rich by exploiting the rule in the next 100 days.

### 2

You’ll likely be rich by exploiting the rule in the next 100 days, if the market behaves similarly to the last 10 years.

### 3

You’ll likely be rich by exploiting the ‘best rule’ from 20 more friends in the next 100 days.

### 4

You’d definitely have been rich if you had exploited the rule in the past 10 years.

### with only 100 days, possible that the rule is mostly wrong for whole 10 years.

(23)

Feasibility of Learning Connection to Real Learning

top

h

h

h

E

(h

) E

(h

) E

(h

)

E

(h

) E

(h

) E

(h

### M

)

(24)

Feasibility of Learning Connection to Real Learning

### . . . .

top

bottom

Q: if everyone in size-150 NTU ML class

### flips a coin 5 times, and oneof the students gets 5 heads for her coin ‘g’. Is ‘g’ really magical?

A: No. Even if all coins are fair, the probability that

results in

is 1−



> 99%.

### —can getworsewhen involving ‘choice’

(25)

Feasibility of Learning Connection to Real Learning

e.g., E

=

### 12

, but getting all heads (E

=0)!

e.g., E

### out

big (far from f ), but E

### in

small (correct on most examples)

1

2

1126

5678

D

### [BAD D for h] ≤ . . .

Hoeffding: small

(26)

Feasibility of Learning Connection to Real Learning

=⇒

⇐⇒

byA

⇐⇒

1

2 . . .

1126 . . .

5678

1

D

1

2

D

2

3

D

3

M

D

M

### ] ≤ . . .

for M hypotheses, bound of P

### D

(27)

Feasibility of Learning Connection to Real Learning

P

= P

D for h

. . . or

D for h

]

≤ P

] + P

] +. . . + P

] (union bound)

+

+. . . +

= 2Mexp

−2

N

### •

finite-bin version of Hoeffding, valid for all



### •

does not depend on any E

(h

),

### no need to ‘know’ Eout (h m )

—f and P can stay unknown

‘E

(g) = E

(g)’ is

### PAC,regardless of A

(28)

Feasibility of Learning Connection to Real Learning

### The ‘Statistical’ Learning Flow

if|H| = M finite, N large enough,

for whatever g picked byA, E

(g)≈ E

### in

(g) ifA finds one g with E

### in

(g)≈ 0,

PAC guarantee for E

(g)≈ 0 =⇒

1

1

N

N

1

2

N

### M = ∞? (like perceptrons)

—see you in the next lectures

(29)

Feasibility of Learning Connection to Real Learning

h

(x) = sign(x

), h

(x) = sign(x

), h

(x) = sign(−x

), h

(x) = sign(−x

### 2

).

For any N and, which of the following statement is not true?

the

data of h

and the

data of h

### 2

are exactly the same

the

data of h

and the

data of h

### 3

are exactly the same

P

]≤ 8 exp −2

N

P

]≤ 4 exp −2

### 2

N

The important thing is to note that 2 is true, which implies that 4 is true if you revisit the union bound. Similar ideas will be used to conquer the M =∞ case.

(30)

Feasibility of Learning Connection to Real Learning

h

(x) = sign(x

), h

(x) = sign(x

), h

(x) = sign(−x

), h

(x) = sign(−x

### 2

).

For any N and, which of the following statement is not true?

the

data of h

and the

data of h

### 2

are exactly the same

the

data of h

and the

data of h

### 3

are exactly the same

P

]≤ 8 exp −2

N

P

]≤ 4 exp −2

### 2

N

The important thing is to note that 2 is true, which implies that 4 is true if you revisit the union bound. Similar ideas will be used to conquer the M =∞ case.

(31)

Feasibility of Learning Connection to Real Learning

### 1 When

Can Machines Learn?

### • next: what if|H| = ∞?

Infectivity and transmission risk from oral cavity The prion’s possible route of transmission from the brain to the oral tissues and vice versa was suggested on the basis of

The purpose of this article is to analyze the history, present, and future of cystic conditions of the jaws and decompression, a modality of treatment that during the past few years

Animal or vegetable fats and oils and their fractiors, boiled, oxidised, dehydrated, sulphurised, blown, polymerised by heat in vacuum or in inert gas or otherwise chemically

Milk and cream, in powder, granule or other solid form, of a fat content, by weight, exceeding 1.5%, not containing added sugar or other sweetening matter.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 9/27.. The Learning Problem What is Machine Learning. The Machine

Notice that if the dx in the notation for an integral were to be interpreted as a differential, then the differential 2xdx would occur in (1) and so, formally, without justifying our

Notice that Theorem 3 has one term for each intermediate variable and each of these terms resembles the one-dimensional Chain Rule in Equation 1.. To remember the Chain

vice versa.’ To verify the rule, you chose 100 days uniformly at random from the past 10 years of stock data, and found that 80 of them satisfy the rule. What is the best guarantee

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 16/22.. If we use E loocv to estimate the performance of a learning algorithm that predicts with the average y value of the

2 You’ll likely be rich by exploiting the rule in the next 100 days, if the market behaves similarly to the last 10 years. 3 You’ll likely be rich by exploiting the ‘best rule’

(1) principle of legality - everything must be done according to law (2) separation of powers - disputes as to legality of law (made by legislature) and government acts (by

Two examples of the randomly generated EoSs (dashed lines) and the machine learning outputs (solid lines) reconstructed from 15 data points.. size 100) and 1 (with the batch size 10)

Courtesy: Ned Wright’s Cosmology Page Burles, Nolette &amp; Turner, 1999?. Total Mass Density

Assuming that the positive charge of the nucleus is distributed uniformly, determine the electric field at a point on the surface of the nucleus due to that

“Machine Learning Foundations” free online course, and works from NTU CLLab and NTU KDDCup teams... The Learning Problem What is

DVDs, Podcasts, language teaching software, video games, and even foreign- language music and music videos can provide positive and fun associations with the language for

• Nokia has been using Socialtext wiki software for a year and a half to facilitate information exchange within its Insight &amp;

Microphone and 600 ohm line conduits shall be mechanically and electrically connected to receptacle boxes and electrically grounded to the audio system ground point.. Lines in

Pursuant to the service agreement made between the Permanent Secretary for Education Incorporated (“Grantor”) and the Grantee in respect of each approved programme funded by the

In developing LIBSVM, we found that many users have zero machine learning knowledge.. It is unbelievable that many asked what the difference between training and

That is, when these records produced association rule: “Stock A drop Î Stock B drop”, the rule shows that when stock A drops, stock B drops with high probability on the same day..

• The price of any derivative on a non-dividend-paying stock must satisfy a partial differential equation.. • The key step is recognizing that the same random process drives

• The price of any derivative on a non-dividend-paying stock must satisfy a partial differential equation?. • The key step is recognizing that the same random process drives