• 沒有找到結果。

Machine Learning Foundations (ᘤ9M)

N/A
N/A
Protected

Academic year: 2022

Share "Machine Learning Foundations (ᘤ9M)"

Copied!
26
0
0

加載中.... (立即查看全文)

全文

(1)

Machine Learning Foundations

( 機器學習基石)

Lecture 8: Noise and Error

Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.tw

Department of Computer Science

& Information Engineering

National Taiwan University

( 國立台灣大學資訊工程系)

(2)

Noise and Error

Roadmap

1 When Can Machines Learn?

2 Why

Can Machines Learn?

Lecture 7: The VC Dimension

learning happens

if

finite d

VC,

large N

, and

low E in

Lecture 8: Noise and Error Noise and Probabilistic Target Error Measure

Algorithmic Error Measure Weighted Classification

3 How Can Machines Learn?

4 How Can Machines Learn Better?

(3)

Noise and Error Noise and Probabilistic Target

Recap: The Learning Flow

unknown target function f : X → Y

+ noise

(ideal credit approval formula)

training examples D : (x

1

, y

1

), · · · , (x

N

,y

N

) (historical records in bank)

learning algorithm

A

final hypothesis g ≈ f

(‘learned’ formula to be used)

hypothesis set H

(set of candidate formula)

unknown P on X

x

1

, x

2

, · · · , x

N

x

what if there is

noise?

(4)

Noise and Error Noise and Probabilistic Target

Noise

briefly introduced

noise

before

pocket

algorithm

age 23 years

gender female

annual salary NTD 1,000,000 year in residence 1 year

year in job 0.5 year current debt 200,000

credit? {no(−1), yes(+1)}

but more!

noise in y

: good customer,

‘mislabeled’ as bad?

noise in y

: same customers, different labels?

noise in x: inaccurate

customer information?

does VC bound work under

noise?

(5)

Noise and Error Noise and Probabilistic Target

Probabilistic Marbles

one key of VC bound:

marbles!

top

bottom top

bottom

sample

bin

‘deterministic’ marbles

marble

x

∼ P(x)

deterministic color Jf (x) 6= h(x)K

‘probabilistic’ (noisy) marbles

marble

x

∼ P(x)

probabilistic color

Jy 6= h(x)K with y ∼ P (y |x)

same nature

: can estimate P[

orange]

if

i.i.d .

VC holds for

x i.i.d . ∼ P(x), y i.i.d . ∼ P(y|x)

| {z }

(x,y )

i.i.d .

∼ P(x,y )

(6)

Noise and Error Noise and Probabilistic Target

Target Distribution P(y |x)

characterizes behavior of

‘mini-target’

on one

x

can be viewed as ‘ideal mini-target’ + noise, e.g.

• P( ◦|x) = 0.7, P( ×|x) = 0.3

• ideal mini-target f (x) = ◦

• ‘flipping’ noise level = 0.3

deterministic target f :

special case of target distribution

• P(y |x) = 1 for y = f (x)

• P(y |x) = 0 for y 6= f (x)

goal of learning:

predict

ideal mini-target (w.r.t. P(y |x))

on

often-seen inputs (w.r.t. P(x))

(7)

Noise and Error Noise and Probabilistic Target

The New Learning Flow

unknown target distribution P(y |x) containing f (x) + noise (ideal credit approval formula)

training examples D : (x

1

, y

1

), · · · , (x

N

,y

N

) (historical records in bank)

learning algorithm

A

final hypothesis g ≈ f

(‘learned’ formula to be used)

hypothesis set H

(set of candidate formula)

unknown P on X

x

1

, x

2

, · · · , x

N

x y

1

,y

2

, · · · , y

N

y

VC still works,

pocket algorithm explained :-)

(8)

Noise and Error Noise and Probabilistic Target

Fun Time

Let’s revisit PLA/pocket. Which of the following claim is true?

1

In practice, we should try to compute ifD is linear separable before deciding to use PLA.

2

If we know thatD is not linear separable, then the target function f must not be a linear function.

3

If we know thatD is linear separable, then the target function f must be a linear function.

4

None of the above

Reference Answer: 4

1 After computing ifD is linear separable, we shall know

w

and then there is no need to use PLA. 2 What about noise? 3 What about

‘sampling luck’? :-)

(9)

Noise and Error Error Measure

Error Measure

final hypothesis g ≈ f

how well? previously, considered out-of-sample measure

E out (g)

=

E

x∼P Jg ( x) 6= f ( x) K

more generally,

error measure E (g, f )

naturally considered

• out-of-sample: averaged over unknown x

• pointwise: evaluated on one x

• classification: Jprediction 6= targetK

classification errorJ. . .K:

often also called

‘0/1 error’

(10)

Noise and Error Error Measure

Pointwise Error Measure

can often express E (g, f ) = averaged err(g(x), f (x)), like E

out

(g) = E

x∼P Jg (x) 6= f (x)K

| {z }

err(g(x),f (x))

—err: called

pointwise error measure

in-sample

E

in

(g) = 1 N

X

N

n=1

err(g(x

n

),f (x

n

))

out-of-sample

E

out

(g) = E

x∼P

err(g(x),f (x))

will mainly consider pointwise

err

for simplicity

(11)

Noise and Error Error Measure

Two Important Pointwise Error Measures

err

g(x) |{z}

˜ y

, f (x)

|{z} y

 

0/1 error

err(˜ y , y ) = J ˜ y 6= y K

correct or incorrect?

often for

classification

squared error

err(˜ y , y ) = (˜ y − y) 2

how far is ˜y from y ?

often for

regression

how does err

‘guide’ learning?

(12)

Noise and Error Error Measure

Ideal Mini-Target

interplay between

noise

and

error:

P(y |x)

and

err

define

ideal mini-target f (x)

P(y = 1|x) = 0.2, P(y = 2|x) = 0.7, P(y = 3|x) = 0.1

err(˜y , y ) =Jy˜6= yK

y =˜







1 avg. err 0.8 2 avg. err 0.3(∗) 3 avg. err 0.9

1.9 avg. err 1.0(really? :-))

f (x) = argmax

y ∈Y

P(y|x)

err(˜y , y ) = (˜y− y)

2







1 avg. err 1.1 2 avg. err 0.3 3 avg. err 1.5 1.9 avg. err 0.29(∗)

f (x) =X

y ∈Y

y· P(y|x)

(13)

Noise and Error Error Measure

Learning Flow with Error Measure

unknown target distribution P(y |x) containing f (x) + noise (ideal credit approval formula)

training examples D : (x

1

, y

1

), · · · , (x

N

,y

N

) (historical records in bank)

learning algorithm

A

final hypothesis g ≈ f

(‘learned’ formula to be used)

hypothesis set H

(set of candidate formula)

unknown P on X

x

1

, x

2

, · · · , x

N

x y

1

,y

2

, · · · , y

N

y

error measure err

extended VC theory/‘philosophy’

works for most H and err

(14)

Noise and Error Error Measure

Fun Time

Consider the following P(y |x) and err(˜y, y) = |˜y − y|. Which of the following is the ideal mini-target f (x)?

P(y = 1|x) = 0.10, P(y = 2|x) = 0.35, P(y = 3|x) = 0.15, P(y = 4|x) = 0.40.

1

2 = weighted median from P(y|x)

2

2.5 = average withinY = {1, 2, 3, 4}

3

2.85 = weighted mean from P(y|x)

4

4 = argmax P(y|x)

Reference Answer: 1

For the ‘absolute error’, the weighted median provably results in the minimum average err.

(15)

Noise and Error Algorithmic Error Measure

Choice of Error Measure

Fingerprint Verification

f

 

  +1 you

−1 intruder

two types of error:

false accept

and

false reject

g

+1 -1

f +1

no error false reject

-1

false accept no error

0/1 error penalizes both types

equally

(16)

Noise and Error Algorithmic Error Measure

Fingerprint Verification for Supermarket

Fingerprint Verification

f

 

  +1 you

−1 intruder

two types of error:

false accept

and

false reject

g

+1 -1

f +1

no error false reject

-1

false accept no error

g +1 -1

f +1

0 10

-1

1 0

supermarket: fingerprint for discount

• false reject: very unhappy customer, lose future business

• false accept: give away a minor discount, intruder left fingerprint :-)

(17)

Noise and Error Algorithmic Error Measure

Fingerprint Verification for CIA

Fingerprint Verification

f

 

  +1 you

−1 intruder

two types of error:

false accept

and

false reject

g

+1 -1

f +1

no error false reject

-1

false accept no error

g

+1 -1

f +1

0 1

-1

1000 0

CIA: fingerprint for entrance

• false accept: very serious consequences!

• false reject: unhappy employee, but so what? :-)

(18)

Noise and Error Algorithmic Error Measure

Take-home Message for Now

err

is

application/user-dependent Algorithmic Error Measures err c

true: just

err

plausible:

• 0/1: minimum ‘flipping noise’—NP-hard to optimize, remember? :-)

• squared: minimum Gaussian noise

friendly: easy to optimize forA

• closed-form solution

• convex objective function

c

err: more in next lectures

(19)

Noise and Error Algorithmic Error Measure

Learning Flow with Algorithmic Error Measure

unknown target distribution P(y |x) containing f (x) + noise (ideal credit approval formula)

training examples D : (x

1

, y

1

), · · · , (x

N

,y

N

) (historical records in bank)

learning algorithm

A

final hypothesis g ≈ f

(‘learned’ formula to be used)

hypothesis set H

(set of candidate formula)

unknown P on X

x

1

, x

2

, · · · , x

N

x y

1

,y

2

, · · · , y

N

y

error measure err err ˆ

err: application goal;

c

err: a key part of many

A

(20)

Noise and Error Algorithmic Error Measure

Fun Time

Consider err below for CIA. What is E in (g) when using this err?

g

+1 -1

f +1 0 1

-1 1000 0

1 1

N

P

N n=1

Jy

n

6= g(x

n

)K

2 1

N

P

y

n

=+1

Jy

n

6= g(x

n

)K + 1000 P

y

n

=−1

Jy

n

6= g(x

n

)K

!

3 1

N

P

y

n

=+1

Jy

n

6= g(x

n

)K − 1000 P

y

n

=−1

Jy

n

6= g(x

n

)K

!

4 1

N

1000 P

y

n

=+1

Jy

n

6= g(x

n

)K + P

y

n

=−1

Jy

n

6= g(x

n

)K

!

Reference Answer: 2

When y

n

=−1, the

false positive

made on such (x

n

,y

n

)is penalized

1000

times more!

(21)

Noise and Error Weighted Classification

Weighted Classification

CIA Cost (Error, Loss, . . .) Matrix

h(x)

+1 -1

y +1 0 1

-1 1000 0

out-of-sample

E

out

(h) = E

(x,y )∼P



1

if y = +1

1000

if y =−1



·

Jy 6= h(x)K

in-sample

E

in

(h) = 1 N

X

N

n=1



1

if y

n

= +1

1000

if y

n

=−1



·

Jy n 6= h(x n ) K

weighted classification:

different ‘weight’ for different (x, y )

(22)

Noise and Error Weighted Classification

Minimizing E in for Weighted Classification

E in w (h)

= 1 N

X

N

n=1



1

if y

n

= +1

1000

if y

n

=−1



·

Jy n 6= h(x n ) K

Naïve Thoughts

PLA:

doesn’t matter if linear separable. :-)

pocket: modify

pocket-replacement rule

—if

w t+1

reaches smaller

E in w

than ˆ

w, replace ˆ w by w t+1

pocket: some guarantee on E

in 0/1

;

modified pocket: similar guarantee on E in w ?

(23)

Noise and Error Weighted Classification

Systematic Route: Connect E in w and E in 0/1

original problem

h(x)

+1 -1

y +1 0 1

-1 1000 0

D:

(x 1 , +1) (x 2 , −1) (x 3 , −1)

. . .

(x N−1 , +1)

(x N , +1)

equivalent problem

h(x) +1 -1

y +1 0 1

-1 1 0

(x 1 , +1)

(x 2 , −1), (x 2 , −1), . . ., (x 2 , −1) (x 3 , −1), (x 3 , −1), . . ., (x 3 , −1)

. . .

(x N−1 , +1)

(x N , +1)

after

copying −1 examples 1000 times,

E

in w

for LHS≡ E

in 0/1

for RHS!

(24)

Noise and Error Weighted Classification

Weighted Pocket Algorithm

h(x)

+1 -1

y +1 0 1

-1 1000 0

using ‘virtual copying’,

weighted pocket algorithm

include:

weighted PLA:

randomly check

−1 example

mistakes with

1000

times more probability

weighted pocket replacement:

if

w t+1

reaches smaller

E in w

than ˆ

w, replace ˆ w by w t+1

systematic route (called ‘reduction’):

can be applied to many other algorithms!

(25)

Noise and Error Weighted Classification

Fun Time

Consider the CIA cost matrix. If there are 10 examples with y n = −1 (intruder) and 999, 990 examples with y n = +1 (you).

What would E in w (h) be for a constant h(x) that always returns +1?

h(x)

+1 -1

y +1 0 1

-1 1000 0

1

0.001

2

0.01

3

0.1

4

1

Reference Answer: 2

While the quiz is a simple evaluation, it is not uncommon that the data is very

unbalanced

for such an application. Properly ‘setting’ the weights can be used to avoid the lazy constant prediction.

(26)

Noise and Error Weighted Classification

Summary

1 When Can Machines Learn?

2 Why

Can Machines Learn?

Lecture 7: The VC Dimension Lecture 8: Noise and Error

Noise and Probabilistic Target

can replace f (x) by P(y |x) Error Measure

affect ‘ideal’ target Algorithmic Error Measure

user-dependent = ⇒ plausible or friendly Weighted Classification

easily done by virtual ‘example copying’

next: more algorithms, please? :-)

3 How Can Machines Learn?

4 How Can Machines Learn Better?

參考文獻

相關文件

Which keywords below shall have large positive weights in a good perceptron for the task.. 1 coffee, tea,

2 You’ll likely be rich by exploiting the rule in the next 100 days, if the market behaves similarly to the last 10 years. 3 You’ll likely be rich by exploiting the ‘best rule’

A floating point number in double precision IEEE standard format uses two words (64 bits) to store the number as shown in the following figure.. 1 sign

A floating point number in double precision IEEE standard format uses two words (64 bits) to store the number as shown in the following figure.. 1 sign

In this report, formats were specified for single, double, and extended precisions, and these standards are generally followed by microcomputer manufactures using

You shall find it difficult to generate more kinds by varying the inputs, and we will give a formal proof in future lectures.

Lecture 5: Training versus Testing Hsuan-Tien Lin (林 軒田) htlin@csie.ntu.edu.tw?. Department of

Definition of VC Dimension VC Dimension of Perceptrons Physical Intuition of VC Dimension Interpreting VC Dimension?. 3 How Can