Machine Learning Foundations (ᘤ9M)

(1)

Machine Learning Foundations

( 機器學習基石)

Lecture 2: Learning to Answer Yes/No

Hsuan-Tien Lin (林軒田) [email protected]

Department of Computer Science

& Information Engineering

National Taiwan University

( 國立台灣大學資訊工程系)

(2)

Learning to Answer Yes/No

Roadmap

1 When

Can Machines Learn?

Lecture 1: The Learning Problem

A takes D and H to get g

Lecture 2: Learning to Answer Yes/No

Perceptron Hypothesis Set

Perceptron Learning Algorithm (PLA) Guarantee of PLA

Non-Separable Data

2 Why Can Machines Learn?

3 How Can Machines Learn?

4 How Can Machines Learn Better?

(3)

Learning to Answer Yes/No Perceptron Hypothesis Set

Credit Approval Problem Revisited

Applicant Information

age 23 years

gender female

annual salary NTD 1,000,000 year in residence 1 year

year in job 0.5 year current debt 200,000 unknown target function

f : X → Y

(ideal credit approval formula)

training examples D : (x

1

, y

₁

), · · · , (x

_N

,y

_N

) (historical records in bank)

learning algorithm

A

final hypothesis g ≈ f

(‘learned’ formula to be used)

hypothesis set H

(set of candidate formula)

what hypothesis set can we use?

(4)

A Simple Hypothesis Set: the ‘Perceptron’

age 23 years

annual salary NTD 1,000,000 year in job 0.5 year current debt 200,000

•

For

x = (x ₁

,x

₂

, · · · ,x

_d

)‘features of customer’, compute a weighted ‘score’ and

approve credit if X

d

i=1

w

_i

x

_i

>threshold deny credit if X

d

i=1

w

_i

x

_i

<threshold

•

Y:

+1(good), −1(bad) , 0 ignored—linear formula h ∈ H are h(x) = sign

_d

X

i=1

w _i

x

_i

!

−

threshold

!

called ‘perceptron’ hypothesis historically

(5)

Vector Form of Perceptron Hypothesis

h(x) = sign

_d

X

i=1

w _i

x

_i

!

−threshold

!

= sign







d

X

i=1

w _i

x

_i

!

+

(−threshold)

| {z }

w

0

· (+1)

| {z }

x

0







= sign

d

X

i=0

w _i

x

_i

!

= sign

w ^T x

•

each ‘tall’

w represents a hypothesis h & is multiplied with

‘tall’

x —will use tall versions to simplify notation

what do perceptrons h ‘look like’?

(6)

Perceptrons in R ²

h(x) = sign (w

₀

+w

₁

x

₁

+w

₂

x

₂

)

•

customer features

x:

points on the plane (or points in R

^d

)

•

labels y :

◦ (+1), × (-1)

•

hypothesis h:

lines

(or hyperplanes in R

^d

)

—positiveon one side of a line,

negative

on the other side

•

different line classifies customers differently

perceptrons ⇔

linear (binary) classifiers

(7)

Fun Time

Consider using a perceptron to detect spam messages.

Assume that each email is represented by the frequency of keyword occurrence, and output +1 indicates a spam. Which keywords below shall have large positive weights in a

good perceptron

for the task?

1

coffee, tea, hamburger, steak

2

free, drug, fantastic, deal

3

machine, learning, statistics, textbook

4

national, Taiwan, university, coursera

Reference Answer: 2

The occurrence of keywords with positive weights increase the ‘spam score’, and hence those keywords should often appear in spams.

(8)

Fun Time

Consider using a perceptron to detect spam messages.

Assume that each email is represented by the frequency of keyword occurrence, and output +1 indicates a spam. Which keywords below shall have large positive weights in a

good perceptron

for the task?

1

coffee, tea, hamburger, steak

2

free, drug, fantastic, deal

3

machine, learning, statistics, textbook

4

national, Taiwan, university, coursera

Reference Answer: 2

The occurrence of keywords with positive weights increase the ‘spam score’, and hence those keywords should often appear in spams.

(9)

Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

Select g from H

H = all possible perceptrons,

g =?

•

want: g ≈ f (hard when f unknown)

•

almost necessary: g ≈ f on D, ideally

g(x _n ) = f (x _n ) = y _n

•

difficult: H is of

infinite

size

•

idea: start from some g

₀

, and

‘correct’ its mistakes on D

will represent g

₀

by its weight vector

w ₀

(10)

Perceptron Learning Algorithm

start from some

w ₀

(say,

0), and ‘correct’ its mistakes on D

For t = 0, 1, . . .

1

find a

mistake

of

w t

called

x _n(t) , y _n(t)

sign

w ^T _t x _n(t)

6=

y _n(t)

2

(try to) correct the mistake by

w _t+1

←

w t

+

y _n(t) x _n(t)

. . .until

no more mistakes

return

last w (called w

PLA

) as g

w+ x y

y y= +1

x w

x

−1 w y=

w+ x

w+ x y

y y= +1

x w

x

−1 w y=

w+ x

That’s it!

—A fault confessed is half redressed.

:-)

(11)

Practical Implementation of PLA

start from some

w ₀

(say,

0), and ‘correct’ its mistakes on D Cyclic PLA

For t = 0, 1, . . .

1

find

the next

mistake of

w _t

called

x _n(t) , y _n(t)

sign

w ^T _t x _n(t)

6=

y _n(t)

2

correct the mistake by

w _t+1

←

w t

+

y _n(t) x _n(t)

. . .until

a full cycle of not encountering mistakes

precomputed random cycle

(12)

Seeing is Believing initially

x₁ w(t+1)

update: 1

x9

w(t)

w(t+1)

update: 2

x14

w(t) w(t+1)

update: 3

x3

w(t)

w(t+1)

update: 4

x9

w(t) w(t+1)

update: 5

x14

w(t) w(t+1)

update: 6

x9

w(t) w(t+1)

update: 7

x14

w(t) w(t+1)

update: 8

x9

w(t) w(t+1)

update: 9

w_PLA

finally

worked like a charm with < 20 lines!!

(note: made x _i x 0 = 1 for visual purpose)

(13)

Seeing is Believing

initially

x₁ w(t+1)

update: 1

x9

w(t)

w(t+1)

update: 2

x14

w(t) w(t+1)

update: 3

x3

w(t)

w(t+1)

update: 4

x9

w(t) w(t+1)

update: 5

x14

w(t) w(t+1)

update: 6

x9

w(t) w(t+1)

update: 7

x14

w(t) w(t+1)

update: 8

x9

w(t) w(t+1)

update: 9

w_PLA

finally

worked like a charm with < 20 lines!!

(note: made x _i x 0 = 1 for visual purpose)

(14)

Seeing is Believing

initially

x₁ w(t+1)

update: 1

x9

w(t)

w(t+1)

update: 2

x14

w(t) w(t+1)

update: 3

x3

w(t)

w(t+1)

update: 4

x9

w(t) w(t+1)

update: 5

x14

w(t) w(t+1)

update: 6

x9

w(t) w(t+1)

update: 7

x14

w(t) w(t+1)

update: 8

x9

w(t) w(t+1)

update: 9

w_PLA

finally

worked like a charm with < 20 lines!!

(note: made x _i x 0 = 1 for visual purpose)

(15)

Seeing is Believing

initially

x₁ w(t+1)

update: 1

x9

w(t)

w(t+1)

update: 2

x14

w(t) w(t+1)

update: 3

x3

w(t)

w(t+1)

update: 4

x9

w(t) w(t+1)

update: 5

x14

w(t) w(t+1)

update: 6

x9

w(t) w(t+1)

update: 7

x14

w(t) w(t+1)

update: 8

x9

w(t) w(t+1)

update: 9

w_PLA

finally

worked like a charm with < 20 lines!!

(note: made x _i x 0 = 1 for visual purpose)

(16)

Seeing is Believing

initially

x₁ w(t+1)

update: 1

x9

w(t)

w(t+1)

update: 2

x14

w(t) w(t+1)

update: 3

x3

w(t)

w(t+1)

update: 4

x9

w(t) w(t+1)

update: 5

x14

w(t) w(t+1)

update: 6

x9

w(t) w(t+1)

update: 7

x14

w(t) w(t+1)

update: 8

x9

w(t) w(t+1)

update: 9

w_PLA

finally

worked like a charm with < 20 lines!!

(note: made x _i x 0 = 1 for visual purpose)

(17)

Seeing is Believing

initially

x₁ w(t+1)

update: 1

x9

w(t)

w(t+1)

update: 2

x14

w(t) w(t+1)

update: 3

x3

w(t)

w(t+1)

update: 4

x9

w(t) w(t+1)

update: 5

x14

w(t) w(t+1)

update: 6

x9

w(t) w(t+1)

update: 7

x14

w(t) w(t+1)

update: 8

x9

w(t) w(t+1)

update: 9

w_PLA

finally

worked like a charm with < 20 lines!!

(note: made x _i x 0 = 1 for visual purpose)

(18)

Seeing is Believing

initially

x₁ w(t+1)

update: 1

x9

w(t)

w(t+1)

update: 2

x14

w(t) w(t+1)

update: 3

x3

w(t)

w(t+1)

update: 4

x9

w(t) w(t+1)

update: 5

x14

w(t) w(t+1)

update: 6

x9

w(t) w(t+1)

update: 7

x14

w(t) w(t+1)

update: 8

x9

w(t) w(t+1)

update: 9

w_PLA

finally

worked like a charm with < 20 lines!!

(note: made x _i x 0 = 1 for visual purpose)

(19)

Seeing is Believing

initially

x₁ w(t+1)

update: 1

x9

w(t)

w(t+1)

update: 2

x14

w(t) w(t+1)

update: 3

x3

w(t)

w(t+1)

update: 4

x9

w(t) w(t+1)

update: 5

x14

w(t) w(t+1)

update: 6

x9

w(t) w(t+1)

update: 7

x14

w(t) w(t+1)

update: 8

x9

w(t) w(t+1)

update: 9

w_PLA

finally

worked like a charm with < 20 lines!!

(note: made x _i x 0 = 1 for visual purpose)

(20)

Seeing is Believing

initially

x₁ w(t+1)

update: 1

x9

w(t)

w(t+1)

update: 2

x14

w(t) w(t+1)

update: 3

x3

w(t)

w(t+1)

update: 4

x9

w(t) w(t+1)

update: 5

x14

w(t) w(t+1)

update: 6

x9

w(t) w(t+1)

update: 7

x14

w(t) w(t+1)

update: 8

x9

w(t) w(t+1)

update: 9

w_PLA

finally

worked like a charm with < 20 lines!!

(note: made x _i x 0 = 1 for visual purpose)

(21)

Seeing is Believing

initially

x₁ w(t+1)

update: 1

x9

w(t)

w(t+1)

update: 2

x14

w(t) w(t+1)

update: 3

x3

w(t)

w(t+1)

update: 4

x9

w(t) w(t+1)

update: 5

x14

w(t) w(t+1)

update: 6

x9

w(t) w(t+1)

update: 7

x14

w(t) w(t+1)

update: 8

x9

w(t) w(t+1)

update: 9

w_PLA

finally

worked like a charm with < 20 lines!!

(note: made x _i x 0 = 1 for visual purpose)

(22)

Seeing is Believing

initially

x₁ w(t+1)

update: 1

x9

w(t)

w(t+1)

update: 2

x14

w(t) w(t+1)

update: 3

x3

w(t)

w(t+1)

update: 4

x9

w(t) w(t+1)

update: 5

x14

w(t) w(t+1)

update: 6

x9

w(t) w(t+1)

update: 7

x14

w(t) w(t+1)

update: 8

x9

w(t) w(t+1)

update: 9

w_PLA

finally

worked like a charm with < 20 lines!!

(note: made x _i x 0 = 1 for visual purpose)

(23)

Some Remaining Issues of PLA

‘correct’ mistakes on D

until no mistakes Algorithmic: halt (with no mistake)?

•

naïve cyclic: ??

•

random cyclic: ??

•

other variant: ??

Learning: g ≈ f ?

•

on D, if halt, yes (no mistake)

•

outside D: ??

•

if not halting: ??

[to be shown] if (...), after ‘enough’ corrections,

any PLA variant halts

(24)

Fun Time

Let’s try to think about why PLA may work.

Let n = n(t), according to the rule of PLA below, which formula is true?

sign

w ^T _t x n

6=

y n

,

w _t+1

←

w t

+

y n x n 1 w ^T _t+1 x _n

=y

_n

2

sign(w

^T _t+1 x _n

) =y

_n

3

y

_n w ^T _t+1 x _n

≥ y

n w ^T _t x _n

4

y

_n w ^T _t+1 x _n

<y

_n w ^T _t x _n

Reference Answer: 3

Simply multiply the second part of the rule by y

_n x _n

. The result shows that

the rule

somewhat ‘tries to correct the mistake.’

(25)

Fun Time

Let’s try to think about why PLA may work.

Let n = n(t), according to the rule of PLA below, which formula is true?

sign

w ^T _t x n

6=

y n

,

w _t+1

←

w t

+

y n x n 1 w ^T _t+1 x _n

=y

_n

2

sign(w

^T _t+1 x _n

) =y

_n

3

y

_n w ^T _t+1 x _n

≥ y

n w ^T _t x _n

4

y

_n w ^T _t+1 x _n

<y

_n w ^T _t x _n Reference Answer: 3

Simply multiply the second part of the rule by y

_n x _n

. The result shows that

the rule

somewhat ‘tries to correct the mistake.’

(26)

Learning to Answer Yes/No Guarantee of PLA

Linear Separability

• if

PLA halts (i.e. no more mistakes),

(necessary condition)

D allows some w to make no mistake

•

call such D

linear separable

(linear separable) (not linear separable) (not linear separable)

assume linear separable D, does PLA always

halt?

(27)

PLA Fact: w _t Gets More Aligned with w _f

linear separable D ⇔

exists perfect w _f such that y _n = sign(w ^T _f x _n )

• w _f perfect

hence

every x n correctly away from line:

y _n(t) w ^T _f x _n(t) ≥min

n y n w ^T _f x _n > 0

• w ^T _f w _t ↑

by updating with any

x _n(t) , y _n(t)

w ^T _f w _t+1

=

w ^T _f w t

+

y _n(t) x _n(t)

≥ w ^T _f w _t + min

n y _n w ^T _f x _n

> w ^T _f w _t + 0.

w t

appears more aligned with

w _f

after update

(really?)

(28)

PLA Fact: w _t Does Not Grow Too Fast

w _t changed only when mistake

⇔ sign w

^T _t x _n(t)

6= y

_n(t)

⇔

y _n(t) w ^T _t x _n(t) ≤ 0

•

mistake

‘limits’ kw _t k ² growth, even when updating with ‘longest’ x _n

kw

_t+1

k

²

= kw

t

+y

_n(t) x _n(t)

k

²

= kw

t

k

²

+

2y _n(t) w ^T _t x _n(t)

+ ky

_n(t) x _n(t)

k

²

≤

kw

t

k

²

+

0

+

ky _n(t) x _n(t) k ²

≤

kw

t

k

²

+

max

n ky _n x _n k ²

start from

w ₀

=

0, after T mistake corrections, w ^T _f

kw _f k w _T kw _T k ≥ √

T · constant

(29)

Fun Time

Let’s upper-bound T , the number of mistakes that PLA ‘corrects’.

Define R

²

=max

n

kx

n

k

²

ρ =min

n

y

n

w ^T _f

kw

_f

k

x n

We want to show that T ≤ . Express the upper bound by the two terms above.

1

R/ρ

2

R

²

/ρ

²

3

R/ρ

²

4

ρ

²

/R

²

Reference Answer: 2

The maximum value of

^w

T f

kw

f

k w

_t

kw

t

k

is 1. Since T mistake corrections

increase the inner product by √

T · constant, the maximum

number of corrected mistakes is 1/constant

²

.

(30)

Fun Time

Let’s upper-bound T , the number of mistakes that PLA ‘corrects’.

Define R

²

=max

n

kx

n

k

²

ρ =min

n

y

n

w ^T _f

kw

_f

k

x n

We want to show that T ≤ . Express the upper bound by the two terms above.

1

R/ρ

2

R

²

/ρ

²

3

R/ρ

²

4

ρ

²

/R

²

Reference Answer: 2

The maximum value of

^w

T f

kw

f

k w

_t

kw

t

k

is 1. Since T mistake corrections

increase the inner product by √

T · constant, the maximum

number of corrected mistakes is 1/constant

²

.

(31)

Learning to Answer Yes/No Non-Separable Data

More about PLA

Guarantee

as long as

linear separable

and

correct by mistake

• inner product of w _f and w _t grows fast; length of w _t grows slowly

• PLA ‘lines’ are more and more aligned with w _f

⇒ halts

Pros

simple to implement, fast, works in any dimension d

Cons

• ‘assumes’ linear separable D

to halt

—property unknown in advance (no need for PLA if we know

w _f

)

•

not fully sure

how long halting takes

(ρ depends on

w _f

)

—though practically fast

what if D not linear separable?

(32)

Learning with Noisy Data

unknown target function f : X → Y

+ noise

(ideal credit approval formula)

training examples D : (x

1

, y

₁

), · · · , (x

_N

,y

_N

) (historical records in bank)

learning algorithm

A

final hypothesis g ≈ f

(‘learned’ formula to be used)

hypothesis set H

(set of candidate formula)

how to at least get g ≈ f on

noisy

D?

(33)

Line with Noise Tolerance

•

assume ‘little’ noise:

y n = f (x n ) usually

•

if so, g ≈ f on D ⇔

y n = g(x n ) usually

•

how about

w _g

← argmin

w N

X

n=1

r

y

_n

6= sign(w

^T x _n

) z

—NP-hard to solve, unfortunately

can we

modify PLA

to get an ‘approximately good’ g?

(34)

Pocket Algorithm

modify PLA algorithm (black lines) by

keeping best weights in pocket initialize pocket weights ˆ w

For t = 0, 1, · · ·

1

find a

(random)

mistake of

w t

called (x

_n(t)

,y

_n(t)

)

2

(try to) correct the mistake by

w _t+1

← w

t

+y

_n(t) x _n(t)

3 if w _t+1 makes fewer mistakes than ˆ w, replace ˆ w by w _t+1

...until

enough iterations

return

w (called w ˆ

POCKET

) as g

a simple modification of PLA to find (somewhat) ‘best’ weights

(35)

Fun Time

Should we use pocket or PLA?

Since we do not know whether D is linear separable in advance, we may decide to just go with pocket instead of PLA. If D is actually linear separable, what’s the difference between the two?

1

pocket on D is slower than PLA

2

pocket on D is faster than PLA

3

pocket on D returns a better g in approximating f than PLA

4

pocket on D returns a worse g in approximating f than PLA

Reference Answer: 1

Because pocket need to check whether

w _t+1

is better than ˆ

w in each iteration, it is slower than

PLA. On linear separable D,

w

_POCKET is the same as

w

_PLA, both making no mistakes.

(36)

Fun Time

Should we use pocket or PLA?

Since we do not know whether D is linear separable in advance, we may decide to just go with pocket instead of PLA. If D is actually linear separable, what’s the difference between the two?

Machine Learning Foundations (ᘤ9M)