• 沒有找到結果。

Lecture 16: Three Learning Principles

N/A
N/A
Protected

Academic year: 2022

Share "Lecture 16: Three Learning Principles"

Copied!
30
0
0

加載中.... (立即查看全文)

全文

(1)

Machine Learning Foundations ( 機器學習基石)

Lecture 16: Three Learning Principles

Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.tw

Department of Computer Science

& Information Engineering

National Taiwan University

( 國立台灣大學資訊工程系)

(2)

Three Learning Principles

Roadmap

1 When Can Machines Learn?

2 Why Can Machines Learn?

3 How Can Machines Learn?

4

How Can Machines Learn

Better?

Lecture 15: Validation

(crossly) reserve

validation data

to simulate testing procedure for

model selection

Lecture 16: Three Learning Principles Occam’s Razor

Sampling Bias

Data Snooping

Power of Three

(3)

Three Learning Principles Occam’s Razor

Occam’s Razor

An explanation of the data should be made as simple as possible, but no simpler.—Albert Einstein?(1879-1955)

entia non sunt multiplicanda praeter necessitatem (entities must not be multiplied

beyond necessity)

—William of Occam (1287-1347)

‘Occam’s razor’ for trimming down unnecessary explanation

figure by Fred the Oyster (Own work) [CC-BY-SA-3.0], via Wikimedia Commons

(4)

Three Learning Principles Occam’s Razor

Occam’s Razor for Learning

The simplest model that fits the data is also the most plausible.

which one do you prefer? :-)

two questions:

1

What does it mean for a model to be simple?

2

How do we know that simpler is better?

(5)

Three Learning Principles Occam’s Razor

Simple Model

simple hypothesis h

small Ω(h) = ‘looks’ simple

specified by

few parameters

simple model H

small Ω(H) = not many

contains

small number of hypotheses

connection

h specified by ` bits⇐ |H| of size 2

`

small Ω(h)⇐ small Ω(H)

simple:

small hypothesis/model complexity

(6)

Three Learning Principles Occam’s Razor

Simple is Better

in addition to

math proof

that you have seen, philosophically:

=⇒

simpleH

=⇒ smaller m

H

(N)

=⇒ less ‘likely’ to fit data perfectly m

H

(N) 2

N

=⇒ more significant when fit happens

direct action:

linear first;

always ask whether

data over-modeled

(7)

Three Learning Principles Occam’s Razor

Fun Time

Consider the decision stumps in R

1

as the hypothesis setH. Recall that m

H

(N) = 2N. Consider 10 different inputs

x 1

,

x 2

, . . . ,

x 10

coupled with labels y

n

generated iid from a fair coin. What is the probability that the dataD = {(x

n

,y

n

)}

10 n=1

is separable byH?

1 1

1024 2 10

1024 3 20

1024 4 100

1024

Reference Answer: 3

Of all 1024 possibleD, only 2N = 20 of them is separable byH.

(8)

Three Learning Principles Occam’s Razor

Fun Time

Consider the decision stumps in R

1

as the hypothesis setH. Recall that m

H

(N) = 2N. Consider 10 different inputs

x 1

,

x 2

, . . . ,

x 10

coupled with labels y

n

generated iid from a fair coin. What is the probability that the dataD = {(x

n

,y

n

)}

10 n=1

is separable byH?

1 1

1024 2 10

1024 3 20

1024 4 100

1024

Reference Answer: 3

Of all 1024 possibleD, only 2N = 20 of them is separable byH.

(9)

Three Learning Principles Sampling Bias

Presidential Story

1948 US President election: Truman versus Dewey

a newspaper phone-poll of how people

voted,

and set the title ‘Dewey Defeats Truman’ based on polling

who is this? :-)

(10)

Three Learning Principles Sampling Bias

The Big Smile Came from . . .

Truman, and yes he won

suspect of the mistake:

editorial bug?—no

bad luck of polling (δ)?—no

hint: phones were

expensive :-)

(11)

Three Learning Principles Sampling Bias

Sampling Bias

If the data is sampled in a biased way, learning will pro- duce a similarly biased outcome.

technical explanation:

data from

P 1

(x, y ) but test under

P 2

6=

P 1

:

VC fails

philosophical explanation:

study

Math

hard but test

English: no strong test guarantee

‘minor’ VC assumption:

data and testing

both iid from P

(12)

Three Learning Principles Sampling Bias

Sampling Bias in Learning

A True Personal Story

Netflix competition for movie recommender system:

10% improvement = 1M US dollars

formedD

val

, in my

first shot,

E

val

(g) showed

13%

improvement

why am I still teaching here? :-)

Match movie and viewer factors

predicted rating

comedy content action

content blockb uster?

TomCruisein it? likes TomCruise?

prefers blockbusters? likes action?

likes comedy?

movie viewer

add contributions from each factor

validation:

random examples

withinD;

test:

‘last’ user records

‘after’D

(13)

Three Learning Principles Sampling Bias

Dealing with Sampling Bias

If the data is sampled in a biased way, learning will pro- duce a similarly biased outcome.

practical rule of thumb:

match test scenario as much as possible

e.g. if test:

‘last’ user records

‘after’D

• training: emphasize later examples (KDDCup 2011)

• validation: use ‘late’ user records

last puzzle:

danger when learning ‘credit card approval’

with

existing bank records?

(14)

Three Learning Principles Sampling Bias

Fun Time

If the dataD is an unbiased sample from the underlying distribution P for binary classification, which of the following subset ofD is also an unbiased sample from P?

1

all the positive (y

n

>0) examples

2

half of the examples that are randomly and uniformly picked from D without replacement

3

half of the examples with the smallestkx

n

k values

4

the largest subset that is linearly separable

Reference Answer: 2

That’s how we form the validation set,

remember? :-)

(15)

Three Learning Principles Sampling Bias

Fun Time

If the dataD is an unbiased sample from the underlying distribution P for binary classification, which of the following subset ofD is also an unbiased sample from P?

1

all the positive (y

n

>0) examples

2

half of the examples that are randomly and uniformly picked from D without replacement

3

half of the examples with the smallestkx

n

k values

4

the largest subset that is linearly separable

Reference Answer: 2

That’s how we form the validation set,

remember? :-)

(16)

Three Learning Principles Data Snooping

Visual Data Snooping

Visualize X = R 2

full Φ

2

:

z = (1, x 1

,x

2

,x

1 2

,x

1

x

2

,x

2 2

), dVC =6

or

z = (1, x 1 2

,x

2 2

), dVC =3,

after visualizing?

or better

z = (1, x 1 2

+x

2 2

), dVC=2?

or even better

z = sign(0.6 − x 1 2 − x 2 2 )?

—careful about

your brain’s ‘model complexity’

−1 0 1

−1 0 1

for VC-safety, Φ shall be decided

without ‘snooping’

data

(17)

Three Learning Principles Data Snooping

Data Snooping by Mere Shifting-Scaling

If a data set has affected any step in the learning pro- cess, its ability to assess the outcome has been com- promised.

8 years of currency trading data

first 6 years for

training,

last two 2 years for

testing

x = previous 20 days,

y = 21th day

• snooping

versus

no snooping:

superior profit possible

Day

CumulativeProfit%

no snooping snooping

0 100 200 300 400 500

-10 0 10 20 30

• snooping: shift-scale all values by training

+

testing

• no snooping: shift-scale all values by training

only

(18)

Three Learning Principles Data Snooping

Data Snooping by Data Reusing

Research Scenario

benchmark dataD

paper 1: proposeH

1

that works well onD

paper 2: find room for improvement, proposeH

2

—and

publish only if better

thanH

1

onD

paper 3: find room for improvement, proposeH

3

—and

publish only if better

thanH

2

onD

. . .

if all papers from the same author in

one big paper:

bad generalization due to dVC(∪

m

H

m

)

step-wise: later author

snooped

data by reading earlier papers, bad generalization worsen by

publish only if better

if you torture the data long enough, it will confess :-)

(19)

Three Learning Principles Data Snooping

Dealing with Data Snooping

truth—very hard to avoid, unless being extremely honest

extremely honest:

lock your test data in safe

less honest:

reserve validation and use cautiously

be blind: avoid

making modeling decision by data

be suspicious: interpret research results (including your own) by proper

feeling of contamination

one secret to winning KDDCups:

careful balance between

data-driven modeling (snooping)

and

validation (no-snooping)

(20)

Three Learning Principles Data Snooping

Fun Time

Which of the following can result in unsatisfactory test performance in machine learning?

1

data snooping

2

overfitting

3

sampling bias

4

all of the above

Reference Answer: 4

A professional like you should be aware of

those! :-)

(21)

Three Learning Principles Data Snooping

Fun Time

Which of the following can result in unsatisfactory test performance in machine learning?

1

data snooping

2

overfitting

3

sampling bias

4

all of the above

Reference Answer: 4

A professional like you should be aware of

those! :-)

(22)

Three Learning Principles Power of Three

Three Related Fields

Power of Three

Data Mining

use

(huge)

data to

find property

that is interesting

difficult to distinguish ML and DM in reality

Artificial Intelligence

compute something that shows

intelligent behavior

ML is one possible route to realize AI

Statistics

use data to

make inference

about an unknown process

statistics contains many useful tools for ML

(23)

Three Learning Principles Power of Three

Three Theoretical Bounds

Power of Three

Hoeffding

P[BAD]

≤ 2 exp(−2

2

N)

one

hypothesis

useful for

verifying/testing

Multi-Bin Hoeffding

P[BAD]

≤ 2 M exp( −2

2

N)

• M

hypotheses

useful for

validation

VC

P[BAD]

≤ 4 m

H

(2N) exp(. . .)

all

H

useful for

training

(24)

Three Learning Principles Power of Three

Three Linear Models

Power of Three

PLA/pocket

h(x) = sign(s)

s x

x

x x0

1 2

d

h x( )

plausible err = 0/1

(small flipping noise)

minimize

specially

linear regression

h(x) =

s

s x

x

x x0

1 2

d

h x( )

friendly err = squared

(easy to minimize)

minimize

analytically

logistic regression

h(x) = θ(s)

s x

x

x x0

1 2

d

h x( )

plausible err = CE

(maximum likelihood)

minimize

iteratively

(25)

Three Learning Principles Power of Three

Three Key Tools

Power of Three

Feature Transform

E

in

(w) → E

in

( ˜ w) d

VC

( H) → d

VC

( H

Φ

)

by using

more complicated Φ

lower E in

higher dVC

Regularization

E

in

(w) → E

in

(w

REG

) d

VC

( H) → d

EFF

( H, A)

by augmenting

regularizer Ω

lower d

EFF

higher E

in

Validation

E

in

(h) → E

val

(h) H → {g

1

, . . . , g

M

}

by reserving K examples as

D val

fewer choices

fewer examples

(26)

Three Learning Principles Power of Three

Three Learning Principles

Power of Three

Occam’s Razer

simple is good

Sampling Bias

class matches exam

Data Snooping

honesty is best policy

(27)

Three Learning Principles Power of Three

Three Future Directions

Power of Three

More Transform More Regularization Less Label

soft-margin k -means OOB error RBF network probabilistic SVM GBDT PCA random forest matrix factorization Gaussian kernel kernel LogReg large-margin prototype quadratic programming SVR

dual uniform blending deep learning nearest neighbor decision stump AdaBoost aggregation sparsity autoencoder

coordinate descent bagging decision tree support vector machine neural network kernel

ready for the

jungle!

(28)

Three Learning Principles Power of Three

Fun Time

What are the magic numbers that repeatedly appear in this class?

1

3

2

1126

3

both 3 and 1126

4

neither 3 nor 1126

Reference Answer: 3

3 as illustrated, and

you may recall 1126

somewhere :-)

(29)

Three Learning Principles Power of Three

Fun Time

What are the magic numbers that repeatedly appear in this class?

1

3

2

1126

3

both 3 and 1126

4

neither 3 nor 1126

Reference Answer: 3

3 as illustrated, and

you may recall 1126

somewhere :-)

(30)

Three Learning Principles Power of Three

Summary

1 When Can Machines Learn?

2 Why Can Machines Learn?

3 How Can Machines Learn?

4

How Can Machines Learn

Better?

Lecture 15: Validation

Lecture 16: Three Learning Principles Occam’s Razor

simple, simple, simple!

Sampling Bias

match test scenario as much as possible Data Snooping

any use of data is ‘contamination’

Power of Three

relatives, bounds, models, tools, principles

next: ready for jungle!

參考文獻

相關文件

Formative assessment and self‐regulated learning: A model and seven principles of good feedback practice. A three‐step method of self‐reflection using reflective

Under the guiding principles for the ongoing renewal of the school curriculum, it is proposed that the seven learning goals should continue to focus on promoting the

 Teachers have to understand the salient features of the three pedagogical approaches of Direct Instruction, Enquiry Learning and Co-construction outlined below and

Internal assessment refers to the assessment practices that teachers and schools employ as part of the ongoing learning and teaching process during the three years

• elearning pilot scheme (Four True Light Schools): WIFI construction, iPad procurement, elearning school visit and teacher training, English starts the elearning lesson.. 2012 •

modify Clone and modify interactive tasks Vary Vary the task interaction formats Create Create tiered worksheets. Select Select diversified e-learning resources.. Some Principles

assessment items targeting the following reading foci: specific information, inferencing, main ideas. What syntactic and/or semantic clues would you identify in the text to guide

The continuity of learning that is produced by the second type of transfer, transfer of principles, is dependent upon mastery of the structure of the subject matter …in order for a