• 沒有找到結果。

Lecture 16: Three Learning Principles

N/A
N/A
Protected

Academic year: 2022

Share "Lecture 16: Three Learning Principles"

Copied!
106
0
0

加載中.... (立即查看全文)

全文

(1)

Machine Learning Foundations

( 機器學習基石)

Lecture 16: Three Learning Principles

Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.tw

Department of Computer Science

& Information Engineering

National Taiwan University ( 國立台灣大學資訊工程系)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 0/25

(2)

Three Learning Principles

Roadmap

1 When Can Machines Learn?

2 Why Can Machines Learn?

3 How Can Machines Learn?

4

How Can Machines Learn

Better?

Lecture 15: Validation

(crossly) reserve

validation data

to simulate testing procedure for

model selection

Lecture 16: Three Learning Principles Occam’s Razor

Sampling Bias

Data Snooping

Power of Three

(3)

Three Learning Principles Occam’s Razor

Occam’s Razor

An explanation of the data should be made as simple as possible, but no simpler.—Albert Einstein?(1879-1955)

entia non sunt multiplicanda praeter necessitatem (entities must not be multiplied

beyond necessity)

—William of Occam (1287-1347)

‘Occam’s razor’ for trimming down unnecessary explanation

figure by Fred the Oyster (Own work) [CC-BY-SA-3.0], via Wikimedia Commons

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 2/25

(4)

Three Learning Principles Occam’s Razor

Occam’s Razor

An explanation of the data should be made as simple as possible, but no simpler.—Albert Einstein?(1879-1955)

entia non sunt multiplicanda praeter necessitatem (entities must not be multiplied

beyond necessity)

—William of Occam (1287-1347)

‘Occam’s razor’ for trimming down unnecessary explanation

figure by Fred the Oyster (Own work) [CC-BY-SA-3.0], via Wikimedia Commons

(5)

Three Learning Principles Occam’s Razor

Occam’s Razor

An explanation of the data should be made as simple as possible, but no simpler.—Albert Einstein?(1879-1955)

entia non sunt multiplicanda praeter necessitatem (entities must not be multiplied

beyond necessity)

—William of Occam (1287-1347)

‘Occam’s razor’ for trimming down unnecessary explanation

figure by Fred the Oyster (Own work) [CC-BY-SA-3.0], via Wikimedia Commons

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 2/25

(6)

Three Learning Principles Occam’s Razor

Occam’s Razor for Learning

The simplest model that fits the data is also the most plausible.

which one do you prefer? :-)

two questions:

1

What does it mean for a model to be simple?

2

How do we know that simpler is better?

(7)

Three Learning Principles Occam’s Razor

Occam’s Razor for Learning

The simplest model that fits the data is also the most plausible.

which one do you prefer? :-)

two questions:

1

What does it mean for a model to be simple?

2

How do we know that simpler is better?

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 3/25

(8)

Three Learning Principles Occam’s Razor

Occam’s Razor for Learning

The simplest model that fits the data is also the most plausible.

which one do you prefer? :-)

two questions:

1

What does it mean for a model to be simple?

2

How do we know that simpler is better?

(9)

Three Learning Principles Occam’s Razor

Occam’s Razor for Learning

The simplest model that fits the data is also the most plausible.

which one do you prefer? :-)

two questions:

1

What does it mean for a model to be simple?

2

How do we know that simpler is better?

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 3/25

(10)

Three Learning Principles Occam’s Razor

Simple Model

simple hypothesis h

small Ω(h) = ‘looks’ simple

specified by

few parameters

simple model H

small Ω(H) = not many

contains

small number of hypotheses

connection

h specified by ` bits⇐ |H| of size 2

`

small Ω(h)⇐ small Ω(H)

simple:

small hypothesis/model complexity

(11)

Three Learning Principles Occam’s Razor

Simple Model

simple hypothesis h

small Ω(h) = ‘looks’ simple

specified by

few parameters

simple model H

small Ω(H) = not many

contains

small number of hypotheses

connection

h specified by ` bits⇐ |H| of size 2

`

small Ω(h)⇐ small Ω(H)

simple:

small hypothesis/model complexity

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 4/25

(12)

Three Learning Principles Occam’s Razor

Simple Model

simple hypothesis h

small Ω(h) = ‘looks’ simple

specified by

few parameters

simple model H

small Ω(H) = not many

contains

small number of hypotheses

connection

h specified by ` bits⇐ |H| of size 2

`

small Ω(h)⇐ small Ω(H)

simple:

small hypothesis/model complexity

(13)

Three Learning Principles Occam’s Razor

Simple Model

simple hypothesis h

small Ω(h) = ‘looks’ simple

specified by

few parameters

simple model H

small Ω(H) = not many

contains

small number of hypotheses

connection

h specified by ` bits⇐ |H| of size 2

`

small Ω(h)⇐ small Ω(H)

simple:

small hypothesis/model complexity

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 4/25

(14)

Three Learning Principles Occam’s Razor

Simple Model

simple hypothesis h

small Ω(h) = ‘looks’ simple

specified by

few parameters

simple model H

small Ω(H) = not many

contains

small number of hypotheses

connection

h specified by ` bits⇐ |H| of size 2

`

small Ω(h)⇐ small Ω(H)

simple:

small hypothesis/model complexity

(15)

Three Learning Principles Occam’s Razor

Simple Model

simple hypothesis h

small Ω(h) = ‘looks’ simple

specified by

few parameters

simple model H

small Ω(H) = not many

contains

small number of hypotheses

connection

h specified by ` bits⇐ |H| of size 2

`

small Ω(h)⇐ small Ω(H)

simple:

small hypothesis/model complexity

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 4/25

(16)

Three Learning Principles Occam’s Razor

Simple Model

simple hypothesis h

small Ω(h) = ‘looks’ simple

specified by

few parameters

simple model H

small Ω(H) = not many

contains

small number of hypotheses

connection

h specified by ` bits⇐ |H| of size 2

`

small Ω(h)⇐ small Ω(H)

simple:

small hypothesis/model complexity

(17)

Three Learning Principles Occam’s Razor

Simple Model

simple hypothesis h

small Ω(h) = ‘looks’ simple

specified by

few parameters

simple model H

small Ω(H) = not many

contains

small number of hypotheses

connection

h specified by ` bits⇐ |H| of size 2

`

small Ω(h)⇐ small Ω(H)

simple:

small hypothesis/model complexity

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 4/25

(18)

Three Learning Principles Occam’s Razor

Simple Model

simple hypothesis h

small Ω(h) = ‘looks’ simple

specified by

few parameters

simple model H

small Ω(H) = not many

contains

small number of hypotheses

connection

h specified by ` bits⇐ |H| of size 2

`

small Ω(h)⇐ small Ω(H)

simple:

small hypothesis/model complexity

(19)

Three Learning Principles Occam’s Razor

Simple is Better

in addition to

math proof

that you have seen, philosophically:

=⇒

simpleH

=⇒ smaller m

H

(N)

=⇒ less ‘likely’ to fit data perfectly m

H

(N) 2

N

=⇒ more significant when fit happens

direct action:

linear first;

always ask whether

data over-modeled

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 5/25

(20)

Three Learning Principles Occam’s Razor

Simple is Better

in addition to

math proof

that you have seen, philosophically:

=⇒

simpleH

=⇒ smaller m

H

(N)

=⇒ less ‘likely’ to fit data perfectly m

H

(N) 2

N

=⇒ more significant when fit happens

direct action:

linear first;

always ask whether

data over-modeled

(21)

Three Learning Principles Occam’s Razor

Simple is Better

in addition to

math proof

that you have seen, philosophically:

=⇒

simpleH

=⇒ smaller m

H

(N)

=⇒ less ‘likely’ to fit data perfectly m

H

(N) 2

N

=⇒ more significant when fit happens

direct action:

linear first;

always ask whether

data over-modeled

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 5/25

(22)

Three Learning Principles Occam’s Razor

Simple is Better

in addition to

math proof

that you have seen, philosophically:

=⇒

simpleH

=⇒ smaller m

H

(N)

=⇒ less ‘likely’ to fit data perfectly m

H

(N) 2

N

=⇒ more significant when fit happens

direct action:

linear first;

always ask whether

data over-modeled

(23)

Three Learning Principles Occam’s Razor

Simple is Better

in addition to

math proof

that you have seen, philosophically:

=⇒

simpleH

=⇒ smaller m

H

(N)

=⇒ less ‘likely’ to fit data perfectly m

H

(N) 2

N

=⇒ more significant when fit happens

direct action:

linear first;

always ask whether

data over-modeled

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 5/25

(24)

Three Learning Principles Occam’s Razor

Simple is Better

in addition to

math proof

that you have seen, philosophically:

=⇒

simpleH

=⇒ smaller m

H

(N)

=⇒ less ‘likely’ to fit data perfectly m

H

(N) 2

N

=⇒ more significant when fit happens

direct action:

linear first;

always ask whether

data over-modeled

(25)

Three Learning Principles Occam’s Razor

Simple is Better

in addition to

math proof

that you have seen, philosophically:

=⇒

simpleH

=⇒ smaller m

H

(N)

=⇒ less ‘likely’ to fit data perfectly m

H

(N) 2

N

=⇒ more significant when fit happens

direct action:

linear first;

always ask whether

data over-modeled

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 5/25

(26)

Three Learning Principles Occam’s Razor

Simple is Better

in addition to

math proof

that you have seen, philosophically:

=⇒

simpleH

=⇒ smaller m

H

(N)

=⇒ less ‘likely’ to fit data perfectly m

H

(N) 2

N

=⇒ more significant when fit happens

direct action:

linear first;

always ask whether

data over-modeled

(27)

Three Learning Principles Occam’s Razor

Fun Time

Consider the decision stumps in R

1

as the hypothesis setH. Recall that m

H

(N) = 2N. Consider 10 different inputs

x 1

,

x 2

, . . . ,

x 10

coupled with labels y

n

generated iid from a fair coin. What is the probability that the dataD = {(x

n

,y

n

)}

10 n=1

is separable byH?

1 1

1024 2 10

1024 3 20

1024 4 100

1024

Reference Answer: 3

Of all 1024 possibleD, only 2N = 20 of them is separable byH.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 6/25

(28)

Three Learning Principles Occam’s Razor

Fun Time

Consider the decision stumps in R

1

as the hypothesis setH. Recall that m

H

(N) = 2N. Consider 10 different inputs

x 1

,

x 2

, . . . ,

x 10

coupled with labels y

n

generated iid from a fair coin. What is the probability that the dataD = {(x

n

,y

n

)}

10 n=1

is separable byH?

1 1

1024 2 10

1024 3 20

1024 4 100

1024

Reference Answer: 3

Of all 1024 possibleD, only 2N = 20 of them is separable byH.

(29)

Three Learning Principles Sampling Bias

Presidential Story

1948 US President election: Truman versus Dewey

a newspaper phone-poll of how people

voted,

and set the title ‘Dewey Defeats Truman’ based on polling

who is this? :-)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 7/25

(30)

Three Learning Principles Sampling Bias

Presidential Story

1948 US President election: Truman versus Dewey

a newspaper phone-poll of how people

voted,

and set the title ‘Dewey Defeats Truman’ based on polling

who is this? :-)

(31)

Three Learning Principles Sampling Bias

Presidential Story

1948 US President election: Truman versus Dewey

a newspaper phone-poll of how people

voted,

and set the title ‘Dewey Defeats Truman’ based on polling

who is this? :-)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 7/25

(32)

Three Learning Principles Sampling Bias

Presidential Story

1948 US President election: Truman versus Dewey

a newspaper phone-poll of how people

voted,

and set the title ‘Dewey Defeats Truman’ based on polling

who is this? :-)

(33)

Three Learning Principles Sampling Bias

The Big Smile Came from . . .

Truman, and yes he won

suspect of the mistake:

editorial bug?—no

bad luck of polling (δ)?—no

hint: phones were

expensive :-)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 8/25

(34)

Three Learning Principles Sampling Bias

The Big Smile Came from . . .

Truman, and yes he won

suspect of the mistake:

editorial bug?—no

bad luck of polling (δ)?—no

hint: phones were

expensive :-)

(35)

Three Learning Principles Sampling Bias

The Big Smile Came from . . .

Truman, and yes he won

suspect of the mistake:

editorial bug?—no

bad luck of polling (δ)?—no

hint: phones were

expensive :-)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 8/25

(36)

Three Learning Principles Sampling Bias

The Big Smile Came from . . .

Truman, and yes he won

suspect of the mistake:

editorial bug?—no

bad luck of polling (δ)?—no

hint: phones were

expensive :-)

(37)

Three Learning Principles Sampling Bias

The Big Smile Came from . . .

Truman, and yes he won

suspect of the mistake:

editorial bug?—no

bad luck of polling (δ)?—no

hint: phones were

expensive :-)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 8/25

(38)

Three Learning Principles Sampling Bias

The Big Smile Came from . . .

Truman, and yes he won

suspect of the mistake:

editorial bug?—no

bad luck of polling (δ)?—no

hint: phones were

expensive :-)

(39)

Three Learning Principles Sampling Bias

Sampling Bias

If the data is sampled in a biased way, learning will pro- duce a similarly biased outcome.

technical explanation:

data from

P 1

(x, y ) but test under

P 2

6=

P 1

:

VC fails

philosophical explanation:

study

Math

hard but test

English: no strong test guarantee

‘minor’ VC assumption: data and testing

both iid from P

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 9/25

(40)

Three Learning Principles Sampling Bias

Sampling Bias

If the data is sampled in a biased way, learning will pro- duce a similarly biased outcome.

technical explanation:

data from

P 1

(x, y ) but test under

P 2

6=

P 1

:

VC fails

philosophical explanation:

study

Math

hard but test

English: no strong test guarantee

‘minor’ VC assumption: data and testing

both iid from P

(41)

Three Learning Principles Sampling Bias

Sampling Bias

If the data is sampled in a biased way, learning will pro- duce a similarly biased outcome.

technical explanation:

data from

P 1

(x, y ) but test under

P 2

6=

P 1

:

VC fails

philosophical explanation:

study

Math

hard but test

English: no strong test guarantee

‘minor’ VC assumption: data and testing

both iid from P

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 9/25

(42)

Three Learning Principles Sampling Bias

Sampling Bias

If the data is sampled in a biased way, learning will pro- duce a similarly biased outcome.

technical explanation:

data from

P 1

(x, y ) but test under

P 2

6=

P 1

:

VC fails

philosophical explanation:

study

Math

hard but test

English: no strong test guarantee

‘minor’ VC assumption:

data and testing

both iid from P

(43)

Three Learning Principles Sampling Bias

Sampling Bias in Learning

A True Personal Story

Netflix competition for movie recommender system:

10% improvement = 1M US dollars

formedD

val

,

in my

first shot,

E

val

(g) showed

13%

improvement

why am I still teaching here? :-)

Match movie and viewer factors

predicted rating

comedy content action

content blockb uster?

TomCruisein it? likes TomCruise?

prefers blockbusters? likes action?

likes comedy?

movie viewer

add contributions from each factor

validation:

random examples

withinD; test:

‘last’ user records

‘after’D

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 10/25

(44)

Three Learning Principles Sampling Bias

Sampling Bias in Learning

A True Personal Story

Netflix competition for movie recommender system:

10% improvement = 1M US dollars

formedD

val

,

in my

first shot,

E

val

(g) showed

13%

improvement

why am I still teaching here? :-)

Match movie and viewer factors

predicted rating

comedy content action

content blockb uster?

TomCruisein it? likes TomCruise?

prefers blockbusters? likes action?

likes comedy?

movie viewer

add contributions from each factor

validation:

random examples

withinD; test:

‘last’ user records

‘after’D

(45)

Three Learning Principles Sampling Bias

Sampling Bias in Learning

A True Personal Story

Netflix competition for movie recommender system:

10% improvement = 1M US dollars

formedD

val

, in my

first shot,

E

val

(g) showed

13%

improvement

why am I still teaching here? :-)

Match movie and viewer factors

predicted rating

comedy content action

content blockb uster?

TomCruisein it? likes TomCruise?

prefers blockbusters? likes action?

likes comedy?

movie viewer

add contributions from each factor

validation:

random examples

withinD; test:

‘last’ user records

‘after’D

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 10/25

(46)

Three Learning Principles Sampling Bias

Sampling Bias in Learning

A True Personal Story

Netflix competition for movie recommender system:

10% improvement = 1M US dollars

formedD

val

, in my

first shot,

E

val

(g) showed

13%

improvement

why am I still teaching here? :-)

Match movie and viewer factors

predicted rating

comedy content action

content blockb uster?

TomCruisein it? likes TomCruise?

prefers blockbusters? likes action?

likes comedy?

movie viewer

add contributions from each factor

validation:

random examples

withinD; test:

‘last’ user records

‘after’D

(47)

Three Learning Principles Sampling Bias

Sampling Bias in Learning

A True Personal Story

Netflix competition for movie recommender system:

10% improvement = 1M US dollars

formedD

val

, in my

first shot,

E

val

(g) showed

13%

improvement

why am I still teaching here? :-)

Match movie and viewer factors

predicted rating

comedy content action

content blockb uster?

TomCruisein it? likes TomCruise?

prefers blockbusters? likes action?

likes comedy?

movie viewer

add contributions from each factor

validation:

random examples

withinD; test:

‘last’ user records

‘after’D

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 10/25

(48)

Three Learning Principles Sampling Bias

Sampling Bias in Learning

A True Personal Story

Netflix competition for movie recommender system:

10% improvement = 1M US dollars

formedD

val

, in my

first shot,

E

val

(g) showed

13%

improvement

why am I still teaching here? :-)

Match movie and viewer factors

predicted rating

comedy content action

content blockb uster?

TomCruisein it? likes TomCruise?

prefers blockbusters? likes action?

likes comedy?

movie viewer

add contributions from each factor

validation:

random examples

withinD;

test:

‘last’ user records

‘after’D

(49)

Three Learning Principles Sampling Bias

Sampling Bias in Learning

A True Personal Story

Netflix competition for movie recommender system:

10% improvement = 1M US dollars

formedD

val

, in my

first shot,

E

val

(g) showed

13%

improvement

why am I still teaching here? :-)

Match movie and viewer factors

predicted rating

comedy content action

content blockb uster?

TomCruisein it? likes TomCruise?

prefers blockbusters? likes action?

likes comedy?

movie viewer

add contributions from each factor

validation:

random examples

withinD;

test:

‘last’ user records

‘after’D

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 10/25

(50)

Three Learning Principles Sampling Bias

Dealing with Sampling Bias

If the data is sampled in a biased way, learning will pro- duce a similarly biased outcome.

practical rule of thumb:

match test scenario as much as possible

e.g. if test:

‘last’ user records

‘after’D

• training: emphasize later examples (KDDCup 2011)

• validation: use ‘late’ user records

last puzzle:

danger when learning ‘credit card approval’ with

existing bank records?

(51)

Three Learning Principles Sampling Bias

Dealing with Sampling Bias

If the data is sampled in a biased way, learning will pro- duce a similarly biased outcome.

practical rule of thumb:

match test scenario as much as possible

e.g. if test:

‘last’ user records

‘after’D

• training: emphasize later examples (KDDCup 2011)

• validation: use ‘late’ user records

last puzzle:

danger when learning ‘credit card approval’ with

existing bank records?

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 11/25

(52)

Three Learning Principles Sampling Bias

Dealing with Sampling Bias

If the data is sampled in a biased way, learning will pro- duce a similarly biased outcome.

practical rule of thumb:

match test scenario as much as possible

e.g. if test:

‘last’ user records

‘after’D

• training: emphasize later examples (KDDCup 2011)

• validation: use ‘late’ user records

last puzzle:

danger when learning ‘credit card approval’ with

existing bank records?

(53)

Three Learning Principles Sampling Bias

Dealing with Sampling Bias

If the data is sampled in a biased way, learning will pro- duce a similarly biased outcome.

practical rule of thumb:

match test scenario as much as possible

e.g. if test:

‘last’ user records

‘after’D

• training: emphasize later examples (KDDCup 2011)

• validation: use ‘late’ user records

last puzzle:

danger when learning ‘credit card approval’ with

existing bank records?

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 11/25

(54)

Three Learning Principles Sampling Bias

Dealing with Sampling Bias

If the data is sampled in a biased way, learning will pro- duce a similarly biased outcome.

practical rule of thumb:

match test scenario as much as possible

e.g. if test:

‘last’ user records

‘after’D

• training: emphasize later examples (KDDCup 2011)

• validation: use ‘late’ user records

last puzzle:

danger when learning ‘credit card approval’

with

existing bank records?

(55)

Three Learning Principles Sampling Bias

Fun Time

If the dataD is an unbiased sample from the underlying distribution P for binary classification, which of the following subset ofD is also an unbiased sample from P?

1

all the positive (y

n

>0) examples

2

half of the examples that are randomly and uniformly picked from D without replacement

3

half of the examples with the smallestkx

n

k values

4

the largest subset that is linearly separable

Reference Answer: 2

That’s how we form the validation set, remember? :-)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 12/25

(56)

Three Learning Principles Sampling Bias

Fun Time

If the dataD is an unbiased sample from the underlying distribution P for binary classification, which of the following subset ofD is also an unbiased sample from P?

1

all the positive (y

n

>0) examples

2

half of the examples that are randomly and uniformly picked from D without replacement

3

half of the examples with the smallestkx

n

k values

4

the largest subset that is linearly separable

Reference Answer: 2

That’s how we form the validation set,

remember? :-)

(57)

Three Learning Principles Data Snooping

Visual Data Snooping

Visualize X = R 2

full Φ

2

:

z = (1, x 1

,x

2

,x

1 2

,x

1

x

2

,x

2 2

), dVC =6

or

z = (1, x 1 2

,x

2 2

), dVC =3,

after visualizing?

or better

z = (1, x 1 2

+x

2 2

), dVC=2?

or even better

z = sign(0.6 − x 1 2 − x 2 2 )?

—careful about

your brain’s ‘model complexity’

−1 0 1

−1 0 1

for VC-safety, Φ shall be decided

without ‘snooping’

data

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 13/25

(58)

Three Learning Principles Data Snooping

Visual Data Snooping

Visualize X = R 2

full Φ

2

:

z = (1, x 1

,x

2

,x

1 2

,x

1

x

2

,x

2 2

), dVC =6

or

z = (1, x 1 2

,x

2 2

), dVC =3,

after visualizing?

or better

z = (1, x 1 2

+x

2 2

), dVC=2?

or even better

z = sign(0.6 − x 1 2 − x 2 2 )?

—careful about

your brain’s ‘model complexity’

−1 0 1

−1 0 1

for VC-safety, Φ shall be decided

without ‘snooping’

data

(59)

Three Learning Principles Data Snooping

Data Snooping by Mere Shifting-Scaling

If a data set has affected any step in the learning pro- cess, its ability to assess the outcome has been com- promised.

8 years of currency trading data

first 6 years for

training,

last two 2 years for

testing

x = previous 20 days,

y = 21th day

• snooping

versus

no snooping: superior profit possible

• snooping: shift-scale all values by training

+

testing

• no snooping: shift-scale all values by training

only

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 14/25

(60)

Three Learning Principles Data Snooping

Data Snooping by Mere Shifting-Scaling

If a data set has affected any step in the learning pro- cess, its ability to assess the outcome has been com- promised.

8 years of currency trading data

first 6 years for

training,

last two 2 years for

testing

x = previous 20 days,

y = 21th day

• snooping

versus

no snooping: superior profit possible

• snooping: shift-scale all values by training

+

testing

• no snooping: shift-scale all values by training

only

(61)

Three Learning Principles Data Snooping

Data Snooping by Mere Shifting-Scaling

If a data set has affected any step in the learning pro- cess, its ability to assess the outcome has been com- promised.

8 years of currency trading data

first 6 years for

training,

last two 2 years for

testing

x = previous 20 days,

y = 21th day

• snooping

versus

no snooping: superior profit possible

• snooping: shift-scale all values by training

+

testing

• no snooping: shift-scale all values by training

only

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 14/25

(62)

Three Learning Principles Data Snooping

Data Snooping by Mere Shifting-Scaling

If a data set has affected any step in the learning pro- cess, its ability to assess the outcome has been com- promised.

8 years of currency trading data

first 6 years for

training,

last two 2 years for

testing

x = previous 20 days,

y = 21th day

• snooping

versus

no snooping: superior profit possible

• snooping: shift-scale all values by training

+

testing

• no snooping: shift-scale all values by training

only

(63)

Three Learning Principles Data Snooping

Data Snooping by Mere Shifting-Scaling

If a data set has affected any step in the learning pro- cess, its ability to assess the outcome has been com- promised.

8 years of currency trading data

first 6 years for

training,

last two 2 years for

testing

x = previous 20 days,

y = 21th day

• snooping

versus

no snooping:

superior profit possible

Day

CumulativeProfit%

no snooping snooping

0 100 200 300 400 500

-10 0 10 20 30

• snooping: shift-scale all values by training

+

testing

• no snooping: shift-scale all values by training

only

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 14/25

(64)

Three Learning Principles Data Snooping

Data Snooping by Mere Shifting-Scaling

If a data set has affected any step in the learning pro- cess, its ability to assess the outcome has been com- promised.

8 years of currency trading data

first 6 years for

training,

last two 2 years for

testing

x = previous 20 days,

y = 21th day

• snooping

versus

no snooping:

superior profit possible

Day

CumulativeProfit%

no snooping snooping

0 100 200 300 400 500

-10 0 10 20 30

• snooping: shift-scale all values by training

+

testing

• no snooping: shift-scale all values by training

only

(65)

Three Learning Principles Data Snooping

Data Snooping by Mere Shifting-Scaling

If a data set has affected any step in the learning pro- cess, its ability to assess the outcome has been com- promised.

8 years of currency trading data

first 6 years for

training,

last two 2 years for

testing

x = previous 20 days,

y = 21th day

• snooping

versus

no snooping:

superior profit possible

Day

CumulativeProfit%

no snooping snooping

0 100 200 300 400 500

-10 0 10 20 30

• snooping: shift-scale all values by training

+

testing

• no snooping: shift-scale all values by training

only

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 14/25

(66)

Three Learning Principles Data Snooping

Data Snooping by Data Reusing

Research Scenario

benchmark dataD

paper 1: proposeH

1

that works well onD

paper 2: find room for improvement, proposeH

2

—and

publish only if better

thanH

1

onD

paper 3: find room for improvement, proposeH

3

—and

publish only if better

thanH

2

onD

. . .

if all papers from the same author in

one big paper:

bad generalization due to dVC(∪

m

H

m

)

step-wise: later author

snooped

data by reading earlier papers, bad generalization worsen by

publish only if better

if you torture the data long enough, it will confess :-)

(67)

Three Learning Principles Data Snooping

Data Snooping by Data Reusing

Research Scenario

benchmark dataD

paper 1: proposeH

1

that works well onD

paper 2: find room for improvement, proposeH

2

—and

publish only if better

thanH

1

onD

paper 3: find room for improvement, proposeH

3

—and

publish only if better

thanH

2

onD

. . .

if all papers from the same author in

one big paper:

bad generalization due to dVC(∪

m

H

m

)

step-wise: later author

snooped

data by reading earlier papers, bad generalization worsen by

publish only if better

if you torture the data long enough, it will confess :-)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 15/25

(68)

Three Learning Principles Data Snooping

Data Snooping by Data Reusing

Research Scenario

benchmark dataD

paper 1: proposeH

1

that works well onD

paper 2: find room for improvement, proposeH

2

—and

publish only if better

thanH

1

onD

paper 3: find room for improvement, proposeH

3

—and

publish only if better

thanH

2

onD

. . .

if all papers from the same author in

one big paper:

bad generalization due to dVC(∪

m

H

m

)

step-wise: later author

snooped

data by reading earlier papers, bad generalization worsen by

publish only if better

if you torture the data long enough, it will confess :-)

(69)

Three Learning Principles Data Snooping

Data Snooping by Data Reusing

Research Scenario

benchmark dataD

paper 1: proposeH

1

that works well onD

paper 2: find room for improvement, proposeH

2

—and

publish only if better

thanH

1

onD

paper 3: find room for improvement, proposeH

3

—and

publish only if better

thanH

2

onD

. . .

if all papers from the same author in

one big paper:

bad generalization due to dVC(∪

m

H

m

)

step-wise: later author

snooped

data by reading earlier papers, bad generalization worsen by

publish only if better

if you torture the data long enough, it will confess :-)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 15/25

(70)

Three Learning Principles Data Snooping

Data Snooping by Data Reusing

Research Scenario

benchmark dataD

paper 1: proposeH

1

that works well onD

paper 2: find room for improvement, proposeH

2

—and

publish only if better

thanH

1

onD

paper 3: find room for improvement, proposeH

3

—and

publish only if better

thanH

2

onD

. . .

if all papers from the same author in

one big paper:

bad generalization due to dVC(∪

m

H

m

)

step-wise: later author

snooped

data by reading earlier papers, bad generalization worsen by

publish only if better

if you torture the data long enough, it will confess :-)

(71)

Three Learning Principles Data Snooping

Data Snooping by Data Reusing

Research Scenario

benchmark dataD

paper 1: proposeH

1

that works well onD

paper 2: find room for improvement, proposeH

2

—and

publish only if better

thanH

1

onD

paper 3: find room for improvement, proposeH

3

—and

publish only if better

thanH

2

onD

. . .

if all papers from the same author in

one big paper:

bad generalization due to dVC(∪

m

H

m

)

step-wise: later author

snooped

data by reading earlier papers, bad generalization worsen by

publish only if better

if you torture the data long enough, it will confess :-)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 15/25

(72)

Three Learning Principles Data Snooping

Data Snooping by Data Reusing

Research Scenario

benchmark dataD

paper 1: proposeH

1

that works well onD

paper 2: find room for improvement, proposeH

2

—and

publish only if better

thanH

1

onD

paper 3: find room for improvement, proposeH

3

—and

publish only if better

thanH

2

onD

. . .

if all papers from the same author in

one big paper:

bad generalization due to dVC(∪

m

H

m

)

step-wise: later author

snooped

data by reading earlier papers, bad generalization worsen by

publish only if better

if you torture the data long enough, it will confess :-)

(73)

Three Learning Principles Data Snooping

Data Snooping by Data Reusing

Research Scenario

benchmark dataD

paper 1: proposeH

1

that works well onD

paper 2: find room for improvement, proposeH

2

—and

publish only if better

thanH

1

onD

paper 3: find room for improvement, proposeH

3

—and

publish only if better

thanH

2

onD

. . .

if all papers from the same author in

one big paper:

bad generalization due to dVC(∪

m

H

m

)

step-wise: later author

snooped

data by reading earlier papers,

bad generalization worsen by

publish only if better

if you torture the data long enough, it will confess :-)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 15/25

(74)

Three Learning Principles Data Snooping

Data Snooping by Data Reusing

Research Scenario

benchmark dataD

paper 1: proposeH

1

that works well onD

paper 2: find room for improvement, proposeH

2

—and

publish only if better

thanH

1

onD

paper 3: find room for improvement, proposeH

3

—and

publish only if better

thanH

2

onD

. . .

if all papers from the same author in

one big paper:

bad generalization due to dVC(∪

m

H

m

)

step-wise: later author

snooped

data by reading earlier papers,

bad generalization worsen by

publish only if better

if you torture the data long enough, it will confess :-)

(75)

Three Learning Principles Data Snooping

Data Snooping by Data Reusing

Research Scenario

benchmark dataD

paper 1: proposeH

1

that works well onD

paper 2: find room for improvement, proposeH

2

—and

publish only if better

thanH

1

onD

paper 3: find room for improvement, proposeH

3

—and

publish only if better

thanH

2

onD

. . .

if all papers from the same author in

one big paper:

bad generalization due to dVC(∪

m

H

m

)

step-wise: later author

snooped

data by reading earlier papers, bad generalization worsen by

publish only if better

if you torture the data long enough, it will confess :-)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 15/25

(76)

Three Learning Principles Data Snooping

Data Snooping by Data Reusing

Research Scenario

benchmark dataD

paper 1: proposeH

1

that works well onD

paper 2: find room for improvement, proposeH

2

—and

publish only if better

thanH

1

onD

paper 3: find room for improvement, proposeH

3

—and

publish only if better

thanH

2

onD

. . .

if all papers from the same author in

one big paper:

bad generalization due to dVC(∪

m

H

m

)

step-wise: later author

snooped

data by reading earlier papers, bad generalization worsen by

publish only if better

if you torture the data long enough, it will confess :-)

(77)

Three Learning Principles Data Snooping

Dealing with Data Snooping

truth—very hard to avoid, unless being extremely honest

extremely honest:

lock your test data in safe

less honest:

reserve validation and use cautiously

be blind: avoid

making modeling decision by data

be suspicious: interpret research results (including your own) by proper

feeling of contamination

one secret to winning KDDCups: careful balance between

data-driven modeling (snooping)

and

validation (no-snooping)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 16/25

(78)

Three Learning Principles Data Snooping

Dealing with Data Snooping

truth—very hard to avoid, unless being extremely honest

extremely honest:

lock your test data in safe

less honest:

reserve validation and use cautiously

be blind: avoid

making modeling decision by data

be suspicious: interpret research results (including your own) by proper

feeling of contamination

one secret to winning KDDCups: careful balance between

data-driven modeling (snooping)

and

validation (no-snooping)

(79)

Three Learning Principles Data Snooping

Dealing with Data Snooping

truth—very hard to avoid, unless being extremely honest

extremely honest:

lock your test data in safe

less honest:

reserve validation and use cautiously

be blind: avoid

making modeling decision by data

be suspicious: interpret research results (including your own) by proper

feeling of contamination

one secret to winning KDDCups: careful balance between

data-driven modeling (snooping)

and

validation (no-snooping)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 16/25

(80)

Three Learning Principles Data Snooping

Dealing with Data Snooping

truth—very hard to avoid, unless being extremely honest

extremely honest:

lock your test data in safe

less honest:

reserve validation and use cautiously

be blind: avoid

making modeling decision by data

be suspicious: interpret research results (including your own) by proper

feeling of contamination

one secret to winning KDDCups: careful balance between

data-driven modeling (snooping)

and

validation (no-snooping)

(81)

Three Learning Principles Data Snooping

Dealing with Data Snooping

truth—very hard to avoid, unless being extremely honest

extremely honest:

lock your test data in safe

less honest:

reserve validation and use cautiously

be blind: avoid

making modeling decision by data

be suspicious: interpret research results (including your own) by proper

feeling of contamination

one secret to winning KDDCups: careful balance between

data-driven modeling (snooping)

and

validation (no-snooping)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 16/25

(82)

Three Learning Principles Data Snooping

Dealing with Data Snooping

truth—very hard to avoid, unless being extremely honest

extremely honest:

lock your test data in safe

less honest:

reserve validation and use cautiously

be blind: avoid

making modeling decision by data

be suspicious: interpret research results (including your own) by proper

feeling of contamination

one secret to winning KDDCups:

careful balance between

data-driven modeling (snooping)

and

validation (no-snooping)

(83)

Three Learning Principles Data Snooping

Fun Time

Which of the following can result in unsatisfactory test performance in machine learning?

1

data snooping

2

overfitting

3

sampling bias

4

all of the above

Reference Answer: 4

A professional like you should be aware of those! :-)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 17/25

(84)

Three Learning Principles Data Snooping

Fun Time

Which of the following can result in unsatisfactory test performance in machine learning?

1

data snooping

2

overfitting

3

sampling bias

4

all of the above

Reference Answer: 4

A professional like you should be aware of

those! :-)

(85)

Three Learning Principles Power of Three

Three Related Fields

Power of Three

Data Mining

use

(huge)

data to

find property

that is interesting

difficult to distinguish ML and DM in reality

Artificial Intelligence

compute something that shows

intelligent behavior

ML is one possible route to realize AI

Statistics

use data to

make inference

about an unknown process

statistics contains many useful tools for ML

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 18/25

(86)

Three Learning Principles Power of Three

Three Related Fields

Power of Three

Data Mining

use

(huge)

data to

find property

that is interesting

difficult to distinguish ML and DM in reality

Artificial Intelligence

compute something that shows

intelligent behavior

ML is one possible route to realize AI

Statistics

use data to

make inference

about an unknown process

statistics contains many useful tools for ML

(87)

Three Learning Principles Power of Three

Three Related Fields

Power of Three

Data Mining

use

(huge)

data to

find property

that is interesting

difficult to distinguish ML and DM in reality

Artificial Intelligence

compute something that shows

intelligent behavior

ML is one possible route to realize AI

Statistics

use data to

make inference

about an unknown process

statistics contains many useful tools for ML

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 18/25

(88)

Three Learning Principles Power of Three

Three Theoretical Bounds

Power of Three

Hoeffding

P[BAD]

≤ 2 exp(−2

2

N)

one

hypothesis

useful for

verifying/testing

Multi-Bin Hoeffding

P[BAD]

≤ 2 M exp( −2

2

N)

• M

hypotheses

useful for

validation

VC

P[BAD]

≤ 4 m

H

(2N) exp(. . .)

all

H

useful for

training

(89)

Three Learning Principles Power of Three

Three Theoretical Bounds

Power of Three

Hoeffding

P[BAD]

≤ 2 exp(−2

2

N)

one

hypothesis

useful for

verifying/testing

Multi-Bin Hoeffding

P[BAD]

≤ 2 M exp( −2

2

N)

• M

hypotheses

useful for

validation

VC

P[BAD]

≤ 4 m

H

(2N) exp(. . .)

all

H

useful for

training

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 19/25

(90)

Three Learning Principles Power of Three

Three Theoretical Bounds

Power of Three

Hoeffding

P[BAD]

≤ 2 exp(−2

2

N)

one

hypothesis

useful for

verifying/testing

Multi-Bin Hoeffding

P[BAD]

≤ 2 M exp( −2

2

N)

• M

hypotheses

useful for

validation

VC

P[BAD]

≤ 4 m

H

(2N) exp(. . .)

all

H

useful for

training

(91)

Three Learning Principles Power of Three

Three Linear Models

Power of Three

PLA/pocket

h(x) = sign(s)

s x

x

x x0

1 2

d

h x( )

plausible err = 0/1

(small flipping noise)

minimize

specially

linear regression

h(x) =

s

s x

x

x x0

1 2

d

h x( )

friendly err = squared

(easy to minimize)

minimize

analytically

logistic regression

h(x) = θ(s)

s x

x

x x0

1 2

d

h x( )

plausible err = CE

(maximum likelihood)

minimize

iteratively

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 20/25

(92)

Three Learning Principles Power of Three

Three Linear Models

Power of Three

PLA/pocket

h(x) = sign(s)

s x

x

x x0

1 2

d

h x( )

plausible err = 0/1

(small flipping noise)

minimize

specially

linear regression

h(x) =

s

s x

x

x x0

1 2

d

h x( )

friendly err = squared

(easy to minimize)

minimize

analytically

logistic regression

h(x) = θ(s)

s x

x

x x0

1 2

d

h x( )

plausible err = CE

(maximum likelihood)

minimize

iteratively

(93)

Three Learning Principles Power of Three

Three Linear Models

Power of Three

PLA/pocket

h(x) = sign(s)

s x

x

x x0

1 2

d

h x( )

plausible err = 0/1

(small flipping noise)

minimize

specially

linear regression

h(x) =

s

s x

x

x x0

1 2

d

h x( )

friendly err = squared

(easy to minimize)

minimize

analytically

logistic regression

h(x) = θ(s)

s x

x

x x0

1 2

d

h x( )

plausible err = CE

(maximum likelihood)

minimize

iteratively

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 20/25

參考文獻

相關文件

Your problem may be modest, but if it challenges your curiosity and brings into play your inventive faculties, and if you solve it by your own means, you may experience the tension

we use data (random sample) to test if the data provides significant evidence to reject the null hypothesis.. If X > c reject

enough to pin down parameter space of dark matter candidates. ● Can check if those models are allowed by

Note: Except for Applied Learning (Vocational English) or otherwise specified, all taster programmes will be conducted in Cantonese (supplemented by English, if

Note: Except for Applied Learning (Vocational English) or otherwise specified, all taster programmes will be conducted in Cantonese (supplemented by English, if applicable)... Code

• But, If the representation of the data type is changed, the program needs to be verified, revised, or completely re- written... Abstract

Classifying sensitive data (personal data, mailbox, exam papers etc.) Managing file storage, backup and cloud services, IT Assets (keys) Security in IT Procurement and

– What  will  be  the  profit  if  sales  volume  increases  by 5%?..