Lecture 16: Three Learning Principles

(1)

Machine Learning Foundations ( 機器學習基石)

Lecture 16: Three Learning Principles

Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.tw

Department of Computer Science

& Information Engineering

National Taiwan University

( 國立台灣大學資訊工程系)

(2)

Three Learning Principles

Roadmap

1 When Can Machines Learn?

2 Why Can Machines Learn?

3 How Can Machines Learn?

4

How Can Machines Learn

Better?

Lecture 15: Validation

(crossly) reserve

validation data

to simulate testing procedure for

model selection

Lecture 16: Three Learning Principles Occam’s Razor

Sampling Bias

Data Snooping

Power of Three

(3)

Three Learning Principles Occam’s Razor

Occam’s Razor

An explanation of the data should be made as simple as possible, but no simpler.—Albert Einstein?(1879-1955)

entia non sunt multiplicanda praeter necessitatem (entities must not be multiplied

beyond necessity)

—William of Occam (1287-1347)

‘Occam’s razor’ for trimming down unnecessary explanation

figure by Fred the Oyster (Own work) [CC-BY-SA-3.0], via Wikimedia Commons

(4)

Occam’s Razor for Learning

The simplest model that fits the data is also the most plausible.

which one do you prefer? :-)

two questions:

1

What does it mean for a model to be simple?

2

How do we know that simpler is better?

(5)

Simple Model

simple hypothesis h

•

small Ω(h) = ‘looks’ simple

•

specified by

few parameters

simple model H

•

small Ω(H) = not many

•

contains

small number of hypotheses

connection

h specified by ` bits⇐ |H| of size 2

^`

small Ω(h)⇐ small Ω(H)

simple:

small hypothesis/model complexity

(6)

Simple is Better

in addition to

math proof

that you have seen, philosophically:

=⇒

simpleH

=⇒ smaller m

^H

(N)

=⇒ less ‘likely’ to fit data perfectly m

H

(N) 2

^N

=⇒ more significant when fit happens

direct action:

linear first;

always ask whether

data over-modeled

(7)

Fun Time

Consider the decision stumps in R

¹

as the hypothesis setH. Recall that m

H

(N) = 2N. Consider 10 different inputs

x ₁

,

x ₂

, . . . ,

x ₁₀

coupled with labels y

_n

generated iid from a fair coin. What is the probability that the dataD = {(x

ⁿ

,y

n

)}

¹⁰ n=1

is separable byH?

1 1

1024 2 10

1024 3 20

1024 4 100

1024

Reference Answer: 3

Of all 1024 possibleD, only 2N = 20 of them is separable byH.

(8)

Fun Time

Consider the decision stumps in R

¹

as the hypothesis setH. Recall that m

H

(N) = 2N. Consider 10 different inputs

x ₁

,

x ₂

, . . . ,

x ₁₀

coupled with labels y

_n

generated iid from a fair coin. What is the probability that the dataD = {(x

ⁿ

,y

n

)}

¹⁰ n=1

is separable byH?

1 1

1024 2 10

1024 3 20

1024 4 100

1024

Reference Answer: 3

Of all 1024 possibleD, only 2N = 20 of them is separable byH.

(9)

Three Learning Principles Sampling Bias

Presidential Story

•

1948 US President election: Truman versus Dewey

•

a newspaper phone-poll of how people

voted,

and set the title ‘Dewey Defeats Truman’ based on polling

who is this? :-)

(10)

The Big Smile Came from . . .

Truman, and yes he won

suspect of the mistake:

•

editorial bug?—no

•

bad luck of polling (δ)?—no

hint: phones were

expensive :-)

(11)

Sampling Bias

If the data is sampled in a biased way, learning will pro- duce a similarly biased outcome.

•

technical explanation:

data from

P ₁

(x, y ) but test under

P ₂

6=

P ₁

:

VC fails

•

philosophical explanation:

study

Math

hard but test

English: no strong test guarantee

‘minor’ VC assumption:

data and testing

both iid from P

(12)

Sampling Bias in Learning

A True Personal Story

•

Netflix competition for movie recommender system:

10% improvement = 1M US dollars

•

formedD

val

, in my

first shot,

E

_val

(g) showed

13%

improvement

• why am I still teaching here? :-)

Match movie and viewer factors

predicted rating

comedy content action

content blockb uster?

TomCruisein it? likes TomCruise?

prefers blockbusters? likes action?

likes comedy?

movie viewer

add contributions from each factor

validation:

random examples

withinD;

test:

‘last’ user records

‘after’D

(13)

Dealing with Sampling Bias

If the data is sampled in a biased way, learning will pro- duce a similarly biased outcome.

•

practical rule of thumb:

match test scenario as much as possible

•

e.g. if test:

‘last’ user records

‘after’D

• training: emphasize later examples (KDDCup 2011)

• validation: use ‘late’ user records

last puzzle:

danger when learning ‘credit card approval’

with

existing bank records?

(14)

Fun Time

If the dataD is an unbiased sample from the underlying distribution P for binary classification, which of the following subset ofD is also an unbiased sample from P?

1

all the positive (y

n

>0) examples

2

half of the examples that are randomly and uniformly picked from D without replacement

3

half of the examples with the smallestkx

ⁿ

k values

4

the largest subset that is linearly separable

Reference Answer: 2

That’s how we form the validation set,

remember? :-)

(15)

Fun Time

If the dataD is an unbiased sample from the underlying distribution P for binary classification, which of the following subset ofD is also an unbiased sample from P?

1

all the positive (y

n

>0) examples

2

half of the examples that are randomly and uniformly picked from D without replacement

3

half of the examples with the smallestkx

ⁿ

k values

4

the largest subset that is linearly separable

Reference Answer: 2

That’s how we form the validation set,

remember? :-)

(16)

Three Learning Principles Data Snooping

Visual Data Snooping

Visualize X = R ²

•

full Φ

₂

:

z = (1, x ₁

,x

₂

,x

₁ ²

,x

₁

x

₂

,x

₂ ²

), dVC =6

•

or

z = (1, x ₁ ²

,x

₂ ²

), dVC =3,

after visualizing?

•

or better

z = (1, x ₁ ²

+x

₂ ²

), dVC=2?

•

or even better

z = sign(0.6 − x 1 ² − x 2 ² )?

—careful about

your brain’s ‘model complexity’

−1 0 1

for VC-safety, Φ shall be decided

without ‘snooping’

data

(17)

Data Snooping by Mere Shifting-Scaling

If a data set has affected any step in the learning pro- cess, its ability to assess the outcome has been com- promised.

•

8 years of currency trading data

•

first 6 years for

training,

last two 2 years for

testing

• x = previous 20 days,

y = 21th day

• snooping

versus

no snooping:

superior profit possible

Day

CumulativeProfit%

no snooping snooping

0 100 200 300 400 500

-10 0 10 20 30

• snooping: shift-scale all values by training

+

testing

• no snooping: shift-scale all values by training

only

(18)

Data Snooping by Data Reusing

Research Scenario

benchmark dataD

•

paper 1: proposeH

1

that works well onD

•

paper 2: find room for improvement, proposeH

2

—and

publish only if better

thanH

1

onD

•

paper 3: find room for improvement, proposeH

3

—and

publish only if better

thanH

2

onD

•

. . .

•

if all papers from the same author in

one big paper:

bad generalization due to dVC(∪

m

H

m

)

•

step-wise: later author

snooped

data by reading earlier papers, bad generalization worsen by

publish only if better

if you torture the data long enough, it will confess :-)

(19)

Dealing with Data Snooping

•

truth—very hard to avoid, unless being extremely honest

•

extremely honest:

lock your test data in safe

•

less honest:

reserve validation and use cautiously

•

be blind: avoid

making modeling decision by data

•

be suspicious: interpret research results (including your own) by proper

feeling of contamination

one secret to winning KDDCups:

careful balance between

data-driven modeling (snooping)

and

validation (no-snooping)

(20)

Fun Time

Which of the following can result in unsatisfactory test performance in machine learning?

1

data snooping

2

overfitting

3

sampling bias

4

all of the above

Reference Answer: 4

A professional like you should be aware of

those! :-)

(21)

Fun Time

Which of the following can result in unsatisfactory test performance in machine learning?

1

data snooping

2

overfitting

3

sampling bias

4

all of the above

Reference Answer: 4

A professional like you should be aware of

those! :-)

(22)

Three Learning Principles Power of Three

Three Related Fields

Power of Three

Data Mining

•

use

(huge)

data to

find property

that is interesting

•

difficult to distinguish ML and DM in reality

Artificial Intelligence

•

compute something that shows

intelligent behavior

•

ML is one possible route to realize AI

Statistics

•

use data to

make inference

about an unknown process

•

statistics contains many useful tools for ML

(23)

Three Theoretical Bounds

Power of Three

Hoeffding

P[BAD]

≤ 2 exp(−2

²

N)

• one

hypothesis

•

useful for

verifying/testing

Multi-Bin Hoeffding

P[BAD]

≤ 2 M exp( −2

²

N)

• M

hypotheses

•

useful for

validation

VC

P[BAD]

≤ 4 m

H

(2N) exp(. . .)

•

all

H

•

useful for

training

(24)

Three Linear Models

Power of Three

PLA/pocket

h(x) = sign(s)

s x

x

x x₀

1 2

d

h x( )

plausible err = 0/1

(small flipping noise)

minimize

specially

linear regression

h(x) =

s

s x

x

x x₀

1 2

d

h x( )

friendly err = squared

(easy to minimize)

minimize

analytically

logistic regression

h(x) = θ(s)

s x

x

x x₀

1 2

d

h x( )

plausible err = CE

(maximum likelihood)

minimize

iteratively

(25)

Three Key Tools

Power of Three

Feature Transform

E

_in

(w) → E

in

( ˜ w) d

VC

( H) → d

^VC

( H

^Φ

)

•

by using

more complicated Φ

• lower E _in

•

higher dVC

Regularization

E

in

(w) → E

ⁱⁿ

(w

_REG

) d

VC

( H) → d

^EFF

( H, A)

•

by augmenting

regularizer Ω

• lower d

_EFF

•

higher E

_in

Validation

E

in

(h) → E

^val

(h) H → {g

1⁻

, . . . , g

_M⁻

}

•

by reserving K examples as

D val

• fewer choices

•

fewer examples

(26)

Three Learning Principles

Power of Three

Occam’s Razer

simple is good

Sampling Bias

class matches exam

Data Snooping

honesty is best policy

(27)

Three Future Directions

Power of Three

More Transform More Regularization Less Label

soft-margin k -means OOB error RBF network probabilistic SVM GBDT PCA random forest matrix factorization Gaussian kernel kernel LogReg large-margin prototype quadratic programming SVR

Lecture 16: Three Learning Principles

Machine Learning Foundations ( 機器學習基石)

Lecture 16: Three Learning Principles

Department of Computer Science

& Information Engineering

National Taiwan University

( 國立台灣大學資訊工程系)

Roadmap

1 When Can Machines Learn?

2 Why Can Machines Learn?

3 How Can Machines Learn?

4

Better?

Lecture 15: Validation

validation data

model selection

Lecture 16: Three Learning Principles Occam’s Razor

Sampling Bias

Data Snooping

Power of Three

Occam’s Razor

beyond necessity)

Occam’s Razor for Learning

The simplest model that fits the data is also the most plausible.

which one do you prefer? :-)

1

2

Simple Model

simple hypothesis h

•

•

few parameters

simple model H

•

•

small number of hypotheses

connection

`

small hypothesis/model complexity

Simple is Better

math proof

H

H

N

linear first;

data over-modeled

Fun Time

1

H

x 1

x 2

x 10

n

n

n

10 n=1

1 1

1024 2 10

1024 3 20

1024 4 100

1024

Reference Answer: 3

Fun Time

1

H

x 1

x 2

x 10

n

n

n

10 n=1

1 1

1024 2 10

1024 3 20

1024 4 100

1024

Reference Answer: 3

Presidential Story

•

^`

^H

^N

¹

x ₁

x ₂

x ₁₀

_n

ⁿ

¹⁰ n=1

¹

x ₁

x ₂

x ₁₀

_n

ⁿ

¹⁰ n=1

P ₁

P ₂

P ₁

_val

ⁿ

ⁿ

Visualize X = R ²

₂

z = (1, x ₁

₂

₁ ²

₁

₂

₂ ²