• 沒有找到結果。

Lecture 16: Three Learning Principles

N/A
N/A
Protected

Academic year: 2022

Share "Lecture 16: Three Learning Principles"

Copied!
26
0
0

加載中.... (立即查看全文)

全文

(1)

Machine Learning Foundations ( 機器學習基石)

Lecture 16: Three Learning Principles

Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.tw

Department of Computer Science

& Information Engineering

National Taiwan University

( 國立台灣大學資訊工程系)

(2)

Three Learning Principles

Roadmap

1 When Can Machines Learn?

2 Why Can Machines Learn?

3 How Can Machines Learn?

4

How Can Machines Learn

Better?

Lecture 15: Validation

(crossly) reserve

validation data

to simulate testing procedure for

model selection

Lecture 16: Three Learning Principles Occam’s Razor

Sampling Bias

Data Snooping

Power of Three

(3)

Three Learning Principles Occam’s Razor

Occam’s Razor

An explanation of the data should be made as simple as possible, but no simpler.—Albert Einstein?(1879-1955)

entia non sunt multiplicanda praeter necessitatem (entities must not be multiplied

beyond necessity)

—William of Occam (1287-1347)

‘Occam’s razor’ for trimming down unnecessary explanation

figure by Fred the Oyster (Own work) [CC-BY-SA-3.0], via Wikimedia Commons

(4)

Three Learning Principles Occam’s Razor

Occam’s Razor for Learning

The simplest model that fits the data is also the most plausible.

which one do you prefer? :-)

two questions:

1

What does it mean for a model to be simple?

2

How do we know that simpler is better?

(5)

Three Learning Principles Occam’s Razor

Simple Model

simple hypothesis h

small Ω(h) = ‘looks’ simple

specified by

few parameters

simple model H

small Ω(H) = not many

contains

small number of hypotheses

connection

h specified by ` bits⇐ |H| of size 2

`

small Ω(h)⇐ small Ω(H)

simple:

small hypothesis/model complexity

(6)

Three Learning Principles Occam’s Razor

Simple is Better

in addition to

math proof

that you have seen, philosophically:

=⇒

simpleH

=⇒ smaller m

H

(N)

=⇒ less ‘likely’ to fit data perfectly m

H

(N) 2

N

=⇒ more significant when fit happens

direct action:

linear first;

always ask whether

data over-modeled

(7)

Three Learning Principles Occam’s Razor

Fun Time

(8)

Three Learning Principles Sampling Bias

Presidential Story

1948 US President election: Truman versus Dewey

a newspaper phone-poll of how people

voted,

and set the title ‘Dewey Defeats Truman’ based on polling

who is this? :-)

(9)

Three Learning Principles Sampling Bias

The Big Smile Came from . . .

Truman, and yes he won

suspect of the mistake:

editorial bug?—no

bad luck of polling (δ)?—no

hint: phones were

expensive :-)

(10)

Three Learning Principles Sampling Bias

Sampling Bias

If the data is sampled in a biased way, learning will pro- duce a similarly biased outcome.

techical explanation:

data from

P 1

(x, y ) but test under

P 2

6=

P 1

:

VC fails

philosophical explanation:

study

Math

hard but test

English: no strong test guarantee

‘minor’ VC assumption:

data and testing

both iid from P

(11)

Three Learning Principles Sampling Bias

Sampling Bias in Learning

A True Personal Story

Netflix competition for movie recommender system:

10% improvement = 1M US dollars

formedD

val

, in my

first shot,

E

val

(g) showed

13%

improvement

why am I still teaching here? :-)

Match movie and viewer factors

predicted rating

comedy content action

content blockb uster?

TomCruisein it? likes TomCruise?

prefers blockbusters? likes action?

likes comedy?

movie viewer

add contributions from each factor

validation:

random examples

withinD;

test:

‘last’ user records

‘after’D

(12)

Three Learning Principles Sampling Bias

Dealing with Sampling Bias

If the data is sampled in a biased way, learning will pro- duce a similarly biased outcome.

practical rule of thumb:

match test scenario as much as possible

e.g. if test:

‘last’ user records

‘after’D

• training: emphasize later examples (KDDCup 2011)

• validation: use ‘late’ user records

last puzzle:

danger when learning ‘credit card approval’

with

existing bank records?

(13)

Three Learning Principles Sampling Bias

Fun Time

(14)

Three Learning Principles Data Snooping

Visual Data Snooping

Visualize X = R 2

full Φ

2

:

z = (1, x 1

,x

2

,x

1 2

,x

1

x

2

,x

2 2

), dVC =6

or

z = (1, x 1 2

,x

2 2

), dVC =3,

after visualizing?

or better

z = (1, x 1 2

+x

2 2

), dVC=2?

or even better

z = sign(0.6 − x 1 2 − x 2 2 )?

—careful about

your brain’s ‘model complexity’

−1 0 1

−1 0 1

for VC-safety, Φ shall be decided

without ‘snooping’

data

(15)

Three Learning Principles Data Snooping

Data Snooping by Mere Shifting-Scaling

If a data set has affected any step in the learning pro- cess, its ability to assess the outcome has been com- promised.

8 years of currency trading data

first 6 years for

training,

last two 2 years for

testing

x = previous 20 days,

y = 21th day

• snooping

versus

no snooping:

superior profit possible

Day

CumulativeProfit%

no snooping snooping

0 100 200 300 400 500

-10 0 10 20 30

• snooping: shift-scale all values by training

+

testing

• no snooping: shift-scale all values by training

only

(16)

Three Learning Principles Data Snooping

Data Snooping by Data Reusing

Research Scenario

benchmark dataD

paper 1: proposeH

1

that works well onD

paper 2: find room for improvement, proposeH

2

—and

publish only if better

thanH

1

onD

paper 3: find room for improvement, proposeH

3

—and

publish only if better

thanH

2

onD

. . .

if all papers from the same author in

one big paper:

bad generalization due to dVC(∪

m

H

m

)

step-wise: later author

snooped

data by reading earlier papers, bad generalization worsen by

publish only if better

if you torture the data long enough, it will confess :-)

(17)

Three Learning Principles Data Snooping

Dealing with Data Snooping

truth—very hard to avoid, unless being extremely honest

extremely honest:

lock your test data in safe

less honest:

reserve validation and use cautiously

be blind: avoid

making modeling decision by data

be suspicious: interpret research results (including your own) by proper

feeling of contamination

one secret to winning KDDCups:

careful balance between

data-driven modeling (snooping)

and

validation (no-snooping)

(18)

Three Learning Principles Data Snooping

Fun Time

(19)

Three Learning Principles Power of Three

Three Related Fields

Power of Three

Data Mining

use

(huge)

data to

find property

that is interesting

difficult to distinguish ML and DM in reality

Artificial Intelligence

compute something that shows

intelligent behavior

ML is one possible route to realize AI

Statistics

use data to

make inference

about an unknown process

statistics contains many useful tools for ML

(20)

Three Learning Principles Power of Three

Three Theoretical Bounds

Power of Three

Hoeffding

P[BAD]

≤ 2 exp(−2

2

N)

one

hypothesis

useful for

verifying/testing

Multi-Bin Hoeffding

P[BAD]

≤ 2 M exp( −2

2

N)

• M

hypotheses

useful for

validation

VC

P[BAD]

≤ 4 m

H

(2N) exp(. . .)

all

H

useful for

training

(21)

Three Learning Principles Power of Three

Three Linear Models

Power of Three

PLA/pocket

h(x) = sign(s)

s x

x

x x0

1 2

d

h x( )

plausible err = 0/1

(small flipping noise)

minimize

specially

linear regression

h(x) =

s

s x

x

x x0

1 2

d

h x( )

friendly err = squared

(easy to minimize)

minimize

analytically

logistic regression

h(x) = θ(s)

s x

x

x x0

1 2

d

h x( )

plausible err = CE

(maximum likelihood)

minimize

iteratively

(22)

Three Learning Principles Power of Three

Three Key Tools

Power of Three

Feature Transform

E

in

(w) → E

in

( ˜ w) d

VC

( H) → d

VC

( H

Φ

)

by using

more complicated Φ

lower E in

higher dVC

Regularization

E

in

(w) → E

in

(w

REG

) d

VC

( H) → d

EFF

( H, A)

by augmenting

regularizer Ω

lower d

EFF

higher E

in

Validation

E

in

(h) → E

val

(h) H → {g

1

, . . . , g

M

}

by reserving K examples as

D val

fewer choices

fewer examples

(23)

Three Learning Principles Power of Three

Three Learning Principles

Power of Three

Occam’s Razer

simple is good

Sampling Bias

class matches exam

Data Snooping

honesty is best policy

(24)

Three Learning Principles Power of Three

Three Future Directions

Power of Three

More Transform More Regularization Less Label

stochastic gradient descent

nonlinear transformation

overfitting

data snooping

Occam’s razor

perceptrons data contamination error measures

cross validation linear models

types of learning

kernel methods

logistic regression

training versus testing

VC dimension

linear regression

deterministic noise

noisy targets bias−variance tradeoff

RBF

SVM

weight decay regularization

soft−order constraint sampling bias neural networks

exploration versus exploitation

weak learners Gaussian processes

active learning

graphical models

decision trees

ensemble learning

Bayesian prior collaborative filtering

clustering

hidden Markov models distribution−free

ordinal regression

Boltzmann machines no free lunch

mixture of experts

Q learning

learning curves semi−supervised learning

is learning feasible?

ready for the

jungle!

(25)

Three Learning Principles Power of Three

Fun Time

(26)

Three Learning Principles Power of Three

Summary

1 When Can Machines Learn?

2 Why Can Machines Learn?

3 How Can Machines Learn?

4

How Can Machines Learn

Better?

Lecture 15: Validation

Lecture 16: Three Learning Principles Occam’s Razor

simple, simple, simple!

Sampling Bias

match test scenario as much as possible Data Snooping

any use of data is ‘contamination’

Power of Three

relatives, bounds, models, tools, principles

next: ready for jungle!

參考文獻

相關文件

Lecture 1: Introduction and overview of supergravity Lecture 2: Conditions for unbroken supersymmetry Lecture 3: BPS black holes and branes. Lecture 4: The LLM bubbling

Lecture 1: Introduction and overview of supergravity Lecture 2: Conditions for unbroken supersymmetry Lecture 3: BPS black holes and branes.. Lecture 4: The LLM bubbling

(intensive physical training). (最好印備5日4夜之活動時間表交學生讓醫生評估)

• elearning pilot scheme (Four True Light Schools): WIFI construction, iPad procurement, elearning school visit and teacher training, English starts the elearning lesson.. 2012 •

modify Clone and modify interactive tasks Vary Vary the task interaction formats Create Create tiered worksheets. Select Select diversified e-learning resources.. Some Principles

14:00-14:15 Case study - Experiencing field learning in health care setting (mental health promotion) 14:15-14:30 Principles of conducting mental health promotion?. - “Why

The continuity of learning that is produced by the second type of transfer, transfer of principles, is dependent upon mastery of the structure of the subject matter …in order for a

 Goal: select actions to maximize future reward Big three: action, state, reward.. Scenario of Reinforcement Learning.. Agent learns to take actions to maximize expected