## Machine Learning Foundations

## ( 機器學習基石)

### Lecture 8: Noise and Error

Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.tw### Department of Computer Science

### & Information Engineering

### National Taiwan University

### ( 國立台灣大學資訊工程系)

Noise and Error

## Roadmap

### 1 When Can Machines Learn?

### 2 **Why**

Can Machines Learn?
### Lecture 7: The VC Dimension

learning happensif

**finite d**

_{VC},

**large N**

, and**low E** _{in}

### Lecture 8: Noise and Error Noise and Probabilistic Target Error Measure

### Algorithmic Error Measure Weighted Classification

### 3 How Can Machines Learn?

### 4 How Can Machines Learn Better?

Noise and Error Noise and Probabilistic Target

## Recap: The Learning Flow

### unknown target function f : X → Y

**+ noise**

### (ideal credit approval formula)

### training examples **D : (x**

1### , y

_{1}

### ), · · · , (x

_{N}

### ,y

_{N}

### ) (historical records in bank)

### learning algorithm

### A

### final hypothesis g ≈ f

### (‘learned’ formula to be used)

### hypothesis set H

### (set of candidate formula)

### unknown P on X

**x**

1### , **x**

2### , · · · , **x**

N **x**

what if there is

**noise?**

Noise and Error Noise and Probabilistic Target

## Noise

briefly introduced

**noise**

before**pocket**

algorithm
### age 23 years

### gender female

### annual salary NTD 1,000,000 year in residence 1 year

### year in job 0.5 year current debt 200,000

credit? {no(−1), yes(+1)}

### but more!

### • **noise in y**

: good customer,
‘mislabeled’ as bad?

### • **noise in y**

: same customers,
different labels?
### • **noise in x: inaccurate**

customer information?
does VC bound work under

**noise?**

Noise and Error Noise and Probabilistic Target

## Probabilistic Marbles

one key of VC bound:

**marbles!**

top

bottom top

bottom

**sample**

**bin**

### ‘deterministic’ marbles

### •

marble**x**

**∼ P(x)**

### •

deterministic color**Jf (x) 6= h(x)K**

### ‘probabilistic’ (noisy) marbles

### •

marble**x**

**∼ P(x)**

### •

probabilistic color**Jy 6= h(x)K with y ∼ P (y |x)**

**same nature**

: can estimate P[### orange]

if^{i.i.d .}

∼
VC holds for

**x** ^{i.i.d .} **∼ P(x), y** ^{i.i.d .} **∼ P(y|x)**

### | {z }

### (x,y )

^{i.i.d .}

**∼ P(x,y )**

Noise and Error Noise and Probabilistic Target

## Target Distribution P(y **|x)**

characterizes behavior of

**‘mini-target’**

on one**x**

### •

can be viewed as ‘ideal mini-target’ + noise, e.g.### • P( ◦|x) = 0.7, P( ×|x) = 0.3

### • ideal mini-target f (x) = ◦

### • ‘flipping’ noise level = 0.3

### •

deterministic target f :**special case of target distribution**

### • P(y **|x) = 1 for y = f (x)**

### • P(y **|x) = 0 for y 6= f (x)**

goal of learning:

predict

**ideal mini-target (w.r.t. P(y** **|x))**

on**often-seen inputs (w.r.t. P(x))**

Noise and Error Noise and Probabilistic Target

## The New Learning Flow

### unknown target distribution P(y |x) containing f (x) + noise (ideal credit approval formula)

### training examples **D : (x**

1### , y

1### ), · · · , (x

N### ,y

N### ) (historical records in bank)

### learning algorithm

### A

### final hypothesis g ≈ f

### (‘learned’ formula to be used)

### hypothesis set H

### (set of candidate formula)

### unknown P on X

**x**

_{1}

### , **x**

_{2}

### , · · · , **x**

_{N}

**x** y

1### ,y

2### , · · · , y

N### y

VC still works,

**pocket algorithm explained :-)**

Noise and Error Noise and Probabilistic Target

## Fun Time

### Let’s revisit PLA/pocket. Which of the following claim is true?

### 1

In practice, we should try to compute ifD is linear separable before deciding to use PLA.### 2

If we know thatD is not linear separable, then the target function f must not be a linear function.### 3

If we know thatD is linear separable, then the target function f must be a linear function.### 4

None of the above### Reference Answer: 4

1 After computing ifD is linear separable, we shall know

**w** ^{∗}

and then there is no need to use
PLA. 2 What about noise? 3 What about
‘sampling luck’? :-)

Noise and Error Error Measure

## Error Measure

### final hypothesis g ≈ f

### •

how well? previously, considered out-of-sample measure### E _{out} (g)

= ### E

**x∼P** Jg ( **x)** 6= f ( **x)** K

### •

more generally,**error measure E (g, f )**

### •

naturally considered### • out-of-sample: averaged over unknown **x**

### • pointwise: evaluated on one **x**

### • classification: Jprediction 6= targetK

classification errorJ. . .K:

often also called

**‘0/1 error’**

Noise and Error Error Measure

## Pointwise Error Measure

can often express E (g, f ) = averaged err(g(x), f (x)), like E

_{out}

(g) = E
**x∼P** **Jg (x) 6= f (x)K**

### | {z }

### err(g(x),f (x))

—err: called

**pointwise error measure**

### in-sample

E

_{in}

(g) = 1
N
X

### N

### n=1

err(g(x

_{n}

),f (x_{n}

))
### out-of-sample

E

_{out}

(g) = E
**x∼P**

err(g(x),f (x))
will mainly consider pointwise

### err

for simplicityNoise and Error Error Measure

## Two Important Pointwise Error Measures

### err

###

### **g(x)** |{z}

### ˜ y

### , f (x)

### |{z} y

###

###

### 0/1 error

### err(˜ y , y ) = J ˜ y 6= y K

### •

correct or incorrect?### •

often for### classification

### squared error

### err(˜ y , y ) = (˜ y − y) ^{2}

### •

how far is ˜y from y ?### •

often for### regression

how does err

**‘guide’ learning?**

Noise and Error Error Measure

## Ideal Mini-Target

interplay between

### noise

and### error:

### P(y **|x)**

and### err

define### ideal mini-target f (x)

P(y = 1**|x) = 0.2, P(y = 2|x) = 0.7, P(y = 3|x) = 0.1**

err(˜y , y ) =Jy˜6= yK

y =˜

1 avg. err 0.8 2 avg. err 0.3(∗) 3 avg. err 0.9

1.9 avg. err 1.0(really? :-))

f (x) = argmax

### y ∈Y

P(y**|x)**

err(˜y , y ) = (˜y− y)

^{2}

1 avg. err 1.1 2 avg. err 0.3 3 avg. err 1.5 1.9 avg. err 0.29(∗)

f (x) =X

### y ∈Y

y**· P(y|x)**

Noise and Error Error Measure

## Learning Flow with Error Measure

### unknown target distribution P(y |x) containing f (x) + noise (ideal credit approval formula)

### training examples **D : (x**

1### , y

1### ), · · · , (x

N### ,y

N### ) (historical records in bank)

### learning algorithm

### A

### final hypothesis g ≈ f

### (‘learned’ formula to be used)

### hypothesis set H

### (set of candidate formula)

### unknown P on X

**x**

_{1}

### , **x**

_{2}

### , · · · , **x**

_{N}

**x** y

1### ,y

2### , · · · , y

N### y

### error measure err

extended VC theory/‘philosophy’

**works for most** H **and** err

Noise and Error Error Measure

## Fun Time

### Consider the following P(y **|x) and err(˜y, y) = |˜y − y|. Which of** the following is the ideal mini-target f (x)?

P(y = 1**|x) = 0.10, P(y = 2|x) = 0.35,**
P(y = 3**|x) = 0.15, P(y = 4|x) = 0.40.**

### 1

2 = weighted median from P(y**|x)**

### 2

2.5 = average withinY = {1, 2, 3, 4}### 3

2.85 = weighted mean from P(y**|x)**

### 4

4 = argmax P(y**|x)**

### Reference Answer: 1

For the ‘absolute error’, the weighted median provably results in the minimum average err.

Noise and Error Algorithmic Error Measure

## Choice of Error Measure

### Fingerprint Verification

### f

###

###

### +1 you

### −1 intruder

two types of error:

### false accept

and### false reject

g+1 -1

f +1

### no error false reject

-1### false accept no error

0/1 error penalizes both types

**equally**

Noise and Error Algorithmic Error Measure

## Fingerprint Verification for Supermarket

### Fingerprint Verification

### f

###

###

### +1 you

### −1 intruder

two types of error:

### false accept

and### false reject

g+1 -1

f +1

### no error false reject

-1### false accept no error

g +1 -1

f +1

### 0 10

-1

### 1 0

### •

supermarket: fingerprint for discount### • false reject: **very unhappy customer, lose future business**

### • false accept: give away a minor discount, intruder left fingerprint :-)

Noise and Error Algorithmic Error Measure

## Fingerprint Verification for CIA

### Fingerprint Verification

### f

###

###

### +1 you

### −1 intruder

two types of error:

### false accept

and### false reject

g+1 -1

f +1

### no error false reject

-1### false accept no error

g

+1 -1

f +1

### 0 1

-1

### 1000 0

### •

CIA: fingerprint for entrance### • false accept: **very serious consequences!**

### • false reject: unhappy employee, but so what? :-)

Noise and Error Algorithmic Error Measure

## Take-home Message for Now

### err

is**application/user-dependent** Algorithmic Error Measures err c

### •

true: just### err

### •

plausible:### • 0/1: minimum ‘flipping noise’—NP-hard to optimize, **remember? :-)**

### • squared: minimum Gaussian noise

### •

friendly: easy to optimize forA### • closed-form solution

### • convex objective function

c

err: more in next lectures

Noise and Error Algorithmic Error Measure

## Learning Flow with Algorithmic Error Measure

### unknown target distribution P(y |x) containing f (x) + noise (ideal credit approval formula)

### training examples **D : (x**

1### , y

1### ), · · · , (x

N### ,y

N### ) (historical records in bank)

### learning algorithm

### A

### final hypothesis g ≈ f

### (‘learned’ formula to be used)

### hypothesis set H

### (set of candidate formula)

### unknown P on X

**x**

_{1}

### , **x**

_{2}

### , · · · , **x**

_{N}

**x** y

1### ,y

2### , · · · , y

N### y

### error measure err err ˆ

err: application goal;

### c

### err: a key part of many

ANoise and Error Algorithmic Error Measure

## Fun Time

### Consider err below for CIA. What is E in (g) when using this err?

### g

### +1 -1

### f +1 0 1

### -1 1000 0

### 1 1

### N

P

### N n=1

Jy

### n

**6= g(x**

### n

)K### 2 1

### N

P

### y

n### =+1

Jy^{n}

**6= g(x**

^{n}

)K + 1000
P
### y

n### =−1

Jy^{n}

**6= g(x**

^{n}

)K
!

### 3 1

### N

P

### y

n### =+1

Jy### n

**6= g(x**

### n

)K − 1000 P### y

n### =−1

Jy### n

**6= g(x**

### n

)K!

### 4 1

### N

1000 P### y

n### =+1

Jy### n

**6= g(x**

### n

)K + P### y

n### =−1

Jy### n

**6= g(x**

### n

)K!

### Reference Answer: 2

When y

_{n}

=−1, the### false positive

made on such (x_{n}

,y_{n}

)is penalized### 1000

times more!Noise and Error Weighted Classification

## Weighted Classification

### CIA Cost (Error, Loss, . . .) Matrix

### h(x)

### +1 -1

### y +1 0 1

### -1 1000 0

### out-of-sample

E

_{out}

(h) = E
### (x,y )∼P

### 1

if y = +1### 1000

if y =−1

·

**Jy 6= h(x)K**

### in-sample

E

_{in}

(h) = 1
N
X

### N

### n=1

### 1

if y_{n}

= +1
### 1000

if y### n

=−1

·

### Jy n **6= h(x** n ) K

weighted classification:

**different ‘weight’ for different (x, y )**

Noise and Error Weighted Classification

## Minimizing E _{in} for Weighted Classification

### E _{in} ^{w} (h)

= 1
N
X

### N

### n=1

### 1

if y_{n}

= +1
### 1000

if y_{n}

=−1

·

### Jy n **6= h(x** n ) K

### Naïve Thoughts

### •

PLA:**doesn’t matter if linear separable. :-)**

### •

pocket: modify**pocket-replacement rule**

—if

**w** _{t+1}

reaches smaller### E _{in} ^{w}

than ˆ**w, replace ˆ** **w by w** _{t+1}

pocket: some guarantee on E

_{in} ^{0/1}

;
### modified pocket: similar guarantee on E _{in} ^{w} ?

Noise and Error Weighted Classification

## Systematic Route: Connect E _{in} ^{w} and E _{in} ^{0/1}

### original problem

### h(x)

### +1 -1

### y +1 0 1

### -1 1000 0

D:

### (x _{1} , +1) (x _{2} , −1) (x _{3} , −1)

. . .

### (x _{N−1} , +1)

### (x _{N} , +1)

### equivalent problem

### h(x) +1 -1

### y +1 0 1

### -1 1 0

### (x _{1} , +1)

### (x _{2} , **−1), (x** 2 , **−1), . . ., (x** 2 , −1) (x _{3} , **−1), (x** 3 , **−1), . . ., (x** 3 , −1)

. . .

### (x _{N−1} , +1)

### (x _{N} , +1)

after

**copying** **−1 examples** 1000 **times,**

E_{in} ^{w}

for LHS≡ E_{in} ^{0/1}

for RHS!
Noise and Error Weighted Classification

## Weighted Pocket Algorithm

### h(x)

### +1 -1

### y +1 0 1

### -1 1000 0

using ‘virtual copying’,

### weighted pocket algorithm

include:### •

weighted PLA:randomly check

### −1 example

mistakes with### 1000

times more probability### •

weighted pocket replacement:if

**w** _{t+1}

reaches smaller### E _{in} ^{w}

than ˆ**w, replace ˆ** **w by w** _{t+1}

systematic route (called ‘reduction’):

**can be applied to many other algorithms!**

Noise and Error Weighted Classification

## Fun Time

### Consider the CIA cost matrix. If there are 10 examples with y n = −1 (intruder) and 999, 990 examples with y ^{n} = +1 (you).

### What would E _{in} ^{w} (h) be for a constant h(x) that always returns +1?

### h(x)

### +1 -1

### y +1 0 1

### -1 1000 0

### 1

0.001### 2

0.01### 3

0.1### 4

1### Reference Answer: 2

While the quiz is a simple evaluation, it is not uncommon that the data is very

**unbalanced**

for such an application. Properly ‘setting’ the
weights can be used to avoid the lazy constant
prediction.
Noise and Error Weighted Classification