## Machine Learning Foundations

## ( 機器學習基石)

### Lecture 2: Learning to Answer Yes/No

Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.tw### Department of Computer Science

### & Information Engineering

### National Taiwan University

### ( 國立台灣大學資訊工程系)

Learning to Answer Yes/No

## Roadmap

### 1 **When**

Can Machines Learn?
### Lecture 1: The Learning Problem

A takes D and H to get g### Lecture 2: Learning to Answer Yes/No

### Perceptron Hypothesis Set

### Perceptron Learning Algorithm (PLA) Guarantee of PLA

### Non-Separable Data

### 2 Why Can Machines Learn?

### 3 How Can Machines Learn?

### 4 How Can Machines Learn Better?

Learning to Answer Yes/No Perceptron Hypothesis Set

## Credit Approval Problem Revisited

### Applicant Information

### age 23 years

### gender female

### annual salary NTD 1,000,000 year in residence 1 year

### year in job 0.5 year current debt 200,000 unknown target function

### f : X → Y

### (ideal credit approval formula)

### training examples **D : (x**

1### , y

_{1}

### ), · · · , (x

_{N}

### ,y

_{N}

### ) (historical records in bank)

### learning algorithm

### A

### final hypothesis g ≈ f

### (‘learned’ formula to be used)

### hypothesis set H

### (set of candidate formula)

what hypothesis set can we use?

Learning to Answer Yes/No Perceptron Hypothesis Set

## A Simple Hypothesis Set: the ‘Perceptron’

### age 23 years

### annual salary NTD 1,000,000 year in job 0.5 year current debt 200,000

### •

For**x = (x** _{1}

,x_{2}

, · · · ,x_{d}

)‘features of customer’, compute a
weighted ‘score’ and
approve credit if X

### d

### i=1

w_{i}

x_{i}

>threshold
deny credit if X### d

### i=1

w_{i}

x_{i}

<threshold
### •

Y:**+1(good), −1(bad) ,** 0 ignored—linear formula h ∈ H are h(x) = sign

_{d}

X
### i=1

### w _{i}

x_{i}

!

−

### threshold

!

called ‘perceptron’ hypothesis historically

Learning to Answer Yes/No Perceptron Hypothesis Set

## Vector Form of Perceptron Hypothesis

### h(x) = sign

_{d}

X

### i=1

### w _{i}

x_{i}

!

### −threshold

!

= sign

### d

X

### i=1

### w _{i}

x_{i}

!

+

### (−threshold)

### | {z }

### w

0### · (+1)

### | {z }

### x

0

= sign

### d

X

### i=0

### w _{i}

x_{i}

!

= sign

**w** ^{T} **x**

### •

each ‘tall’**w represents a hypothesis h & is multiplied with**

‘tall’

**x —will use tall versions to simplify notation**

what do perceptrons h ‘look like’?
Learning to Answer Yes/No Perceptron Hypothesis Set

## Perceptrons in R ^{2}

h(x) = sign (w

_{0}

+w_{1}

x_{1}

+w_{2}

x_{2}

)
### •

customer features**x:**

points on the plane (or points in R^{d}

)
### •

labels y :### ◦ (+1), × (-1)

### •

hypothesis h:**lines**

(or hyperplanes in R^{d}

)
—positiveon one side of a line,

### negative

on the other side### •

different line classifies customers differentlyperceptrons ⇔

**linear (binary) classifiers**

Learning to Answer Yes/No Perceptron Hypothesis Set

## Fun Time

### Consider using a perceptron to detect spam messages.

Assume that each email is represented by the frequency of keyword occurrence, and output +1 indicates a spam. Which keywords below shall have large positive weights in a

**good perceptron**

for the task?
### 1

coffee, tea, hamburger, steak### 2

free, drug, fantastic, deal### 3

machine, learning, statistics, textbook### 4

national, Taiwan, university, coursera### Reference Answer: 2

The occurrence of keywords with positive weights increase the ‘spam score’, and hence those keywords should often appear in spams.

Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

## Select g from H

H = all possible perceptrons,

### g =?

### •

want: g ≈ f (hard when f unknown)### •

almost necessary: g ≈ f on D, ideally### g(x _{n} ) = f (x _{n} ) = y _{n}

### •

difficult: H is of**infinite**

size
### •

idea: start from some g_{0}

, and### ‘correct’ its mistakes on D

will represent g

_{0}

by its weight vector**w** _{0}

Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

## Perceptron Learning Algorithm

start from some

**w** _{0}

(say,**0), and ‘correct’ its mistakes on D**

For t = 0, 1, . . .
### 1

find a### mistake

of**w** t

called **x** _{n(t)} , y _{n(t)}

sign
**w** ^{T} _{t} **x** _{n(t)}

6=

### y _{n(t)}

### 2

(try to) correct the mistake by**w** _{t+1}

←**w** t

+### y _{n(t)} **x** _{n(t)}

. . .until### no more mistakes

return

### last **w (called w**

PLA### ) as g

**w+ x** *y*

*y* *y= +1*

**x** **w**

**x**

### −1 **w** *y= *

**w+ x**

**w+ x** *y*

*y* *y= +1*

**x** **w**

**x**

### −1 **w** *y= *

**w+ x**

That’s it!

—A fault confessed is half redressed.

**:-)**

Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

## Practical Implementation of PLA

start from some

**w** _{0}

(say,**0), and ‘correct’ its mistakes on D** Cyclic PLA

For t = 0, 1, . . .

### 1

find**the next**

mistake of**w** _{t}

called **x** _{n(t)} , y _{n(t)}

sign
**w** ^{T} _{t} **x** _{n(t)}

6=### y _{n(t)}

### 2

correct the mistake by**w** _{t+1}

←**w** t

+### y _{n(t)} **x** _{n(t)}

. . .until**a full cycle of not encountering mistakes**

**next**

can follow naïve cycle (1, · · · , N)
or### precomputed random cycle

Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

## Seeing is Believing initially

**x**

**1**

**w(t+1)**

### update: 1

**x**

**9**

**w(t)** **w(t+1)**

### update: 2

**x**

**14**

**w(t)**

**w(t+1)**

### update: 3

**x**

**3**

**w(t)** **w(t+1)**

### update: 4

**x**

**9**

**w(t)** **w(t+1)**

### update: 5

**x**

**14**

**w(t)** **w(t+1)**

### update: 6

**x**

**9**

**w(t)** **w(t+1)**

### update: 7

**x**

**14**

**w(t)** **w(t+1)**

### update: 8

**x**

**9**

**w(t)** **w(t+1)**

### update: 9

**w**

_{PLA}### finally

**worked like a charm with < 20 lines!!**

### (note: made x _{i} x 0 = 1 for visual purpose)

Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

## Seeing is Believing

### initially

**x**

**1**

**w(t+1)**

### update: 1

**x**

**9**

**w(t)** **w(t+1)**

### update: 2

**x**

**14**

**w(t)**

**w(t+1)**

### update: 3

**x**

**3**

**w(t)** **w(t+1)**

### update: 4

**x**

**9**

**w(t)** **w(t+1)**

### update: 5

**x**

**14**

**w(t)** **w(t+1)**

### update: 6

**x**

**9**

**w(t)** **w(t+1)**

### update: 7

**x**

**14**

**w(t)** **w(t+1)**

### update: 8

**x**

**9**

**w(t)** **w(t+1)**

### update: 9

**w**

_{PLA}### finally

**worked like a charm with < 20 lines!!**

### (note: made x _{i} x 0 = 1 for visual purpose)

Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

## Seeing is Believing

### initially

**x**

**1**

**w(t+1)**

### update: 1

**x**

**9**

**w(t)** **w(t+1)**

### update: 2

**x**

**14**

**w(t)**

**w(t+1)**

### update: 3

**x**

**3**

**w(t)** **w(t+1)**

### update: 4

**x**

**9**

**w(t)** **w(t+1)**

### update: 5

**x**

**14**

**w(t)** **w(t+1)**

### update: 6

**x**

**9**

**w(t)** **w(t+1)**

### update: 7

**x**

**14**

**w(t)** **w(t+1)**

### update: 8

**x**

**9**

**w(t)** **w(t+1)**

### update: 9

**w**

_{PLA}### finally

**worked like a charm with < 20 lines!!**

### (note: made x _{i} x 0 = 1 for visual purpose)

Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

## Seeing is Believing

### initially

**x**

**1**

**w(t+1)**

### update: 1

**x**

**9**

**w(t)** **w(t+1)**

### update: 2

**x**

**14**

**w(t)**

**w(t+1)**

### update: 3

**x**

**3**

**w(t)** **w(t+1)**

### update: 4

**x**

**9**

**w(t)** **w(t+1)**

### update: 5

**x**

**14**

**w(t)** **w(t+1)**

### update: 6

**x**

**9**

**w(t)** **w(t+1)**

### update: 7

**x**

**14**

**w(t)** **w(t+1)**

### update: 8

**x**

**9**

**w(t)** **w(t+1)**

### update: 9

**w**

_{PLA}### finally

**worked like a charm with < 20 lines!!**

### (note: made x _{i} x 0 = 1 for visual purpose)

Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

## Seeing is Believing

### initially

**x**

**1**

**w(t+1)**

### update: 1

**x**

**9**

**w(t)** **w(t+1)**

### update: 2

**x**

**14**

**w(t)**

**w(t+1)**

### update: 3

**x**

**3**

**w(t)** **w(t+1)**

### update: 4

**x**

**9**

**w(t)** **w(t+1)**

### update: 5

**x**

**14**

**w(t)** **w(t+1)**

### update: 6

**x**

**9**

**w(t)** **w(t+1)**

### update: 7

**x**

**14**

**w(t)** **w(t+1)**

### update: 8

**x**

**9**

**w(t)** **w(t+1)**

### update: 9

**w**

_{PLA}### finally

**worked like a charm with < 20 lines!!**

### (note: made x _{i} x 0 = 1 for visual purpose)

Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

## Seeing is Believing

### initially

**x**

**1**

**w(t+1)**

### update: 1

**x**

**9**

**w(t)** **w(t+1)**

### update: 2

**x**

**14**

**w(t)**

**w(t+1)**

### update: 3

**x**

**3**

**w(t)** **w(t+1)**

### update: 4

**x**

**9**

**w(t)** **w(t+1)**

### update: 5

**x**

**14**

**w(t)** **w(t+1)**

### update: 6

**x**

**9**

**w(t)** **w(t+1)**

### update: 7

**x**

**14**

**w(t)** **w(t+1)**

### update: 8

**x**

**9**

**w(t)** **w(t+1)**

### update: 9

**w**

_{PLA}### finally

**worked like a charm with < 20 lines!!**

### (note: made x _{i} x 0 = 1 for visual purpose)

Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

## Seeing is Believing

### initially

**x**

**1**

**w(t+1)**

### update: 1

**x**

**9**

**w(t)** **w(t+1)**

### update: 2

**x**

**14**

**w(t)**

**w(t+1)**

### update: 3

**x**

**3**

**w(t)** **w(t+1)**

### update: 4

**x**

**9**

**w(t)** **w(t+1)**

### update: 5

**x**

**14**

**w(t)** **w(t+1)**

### update: 6

**x**

**9**

**w(t)** **w(t+1)**

### update: 7

**x**

**14**

**w(t)** **w(t+1)**

### update: 8

**x**

**9**

**w(t)** **w(t+1)**

### update: 9

**w**

_{PLA}### finally

**worked like a charm with < 20 lines!!**

### (note: made x _{i} x 0 = 1 for visual purpose)

Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

## Seeing is Believing

### initially

**x**

**1**

**w(t+1)**

### update: 1

**x**

**9**

**w(t)** **w(t+1)**

### update: 2

**x**

**14**

**w(t)**

**w(t+1)**

### update: 3

**x**

**3**

**w(t)** **w(t+1)**

### update: 4

**x**

**9**

**w(t)** **w(t+1)**

### update: 5

**x**

**14**

**w(t)** **w(t+1)**

### update: 6

**x**

**9**

**w(t)** **w(t+1)**

### update: 7

**x**

**14**

**w(t)** **w(t+1)**

### update: 8

**x**

**9**

**w(t)** **w(t+1)**

### update: 9

**w**

_{PLA}### finally

**worked like a charm with < 20 lines!!**

### (note: made x _{i} x 0 = 1 for visual purpose)

Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

## Seeing is Believing

### initially

**x**

**1**

**w(t+1)**

### update: 1

**x**

**9**

**w(t)** **w(t+1)**

### update: 2

**x**

**14**

**w(t)**

**w(t+1)**

### update: 3

**x**

**3**

**w(t)** **w(t+1)**

### update: 4

**x**

**9**

**w(t)** **w(t+1)**

### update: 5

**x**

**14**

**w(t)** **w(t+1)**

### update: 6

**x**

**9**

**w(t)** **w(t+1)**

### update: 7

**x**

**14**

**w(t)** **w(t+1)**

### update: 8

**x**

**9**

**w(t)** **w(t+1)**

### update: 9

**w**

_{PLA}### finally

**worked like a charm with < 20 lines!!**

### (note: made x _{i} x 0 = 1 for visual purpose)

Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

## Seeing is Believing

### initially

**x**

**1**

**w(t+1)**

### update: 1

**x**

**9**

**w(t)** **w(t+1)**

### update: 2

**x**

**14**

**w(t)**

**w(t+1)**

### update: 3

**x**

**3**

**w(t)** **w(t+1)**

### update: 4

**x**

**9**

**w(t)** **w(t+1)**

### update: 5

**x**

**14**

**w(t)** **w(t+1)**

### update: 6

**x**

**9**

**w(t)** **w(t+1)**

### update: 7

**x**

**14**

**w(t)** **w(t+1)**

### update: 8

**x**

**9**

**w(t)** **w(t+1)**

### update: 9

**w**

_{PLA}### finally

**worked like a charm with < 20 lines!!**

### (note: made x _{i} x 0 = 1 for visual purpose)

Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

## Seeing is Believing

### initially

**x**

**1**

**w(t+1)**

### update: 1

**x**

**9**

**w(t)** **w(t+1)**

### update: 2

**x**

**14**

**w(t)**

**w(t+1)**

### update: 3

**x**

**3**

**w(t)** **w(t+1)**

### update: 4

**x**

**9**

**w(t)** **w(t+1)**

### update: 5

**x**

**14**

**w(t)** **w(t+1)**

### update: 6

**x**

**9**

**w(t)** **w(t+1)**

### update: 7

**x**

**14**

**w(t)** **w(t+1)**

### update: 8

**x**

**9**

**w(t)** **w(t+1)**

### update: 9

**w**

_{PLA}### finally

**worked like a charm with < 20 lines!!**

### (note: made x _{i} x 0 = 1 for visual purpose)

Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

## Some Remaining Issues of PLA

‘correct’ mistakes on D

**until no mistakes** Algorithmic: halt (with no mistake)?

### •

naïve cyclic: ??### •

random cyclic: ??### •

other variant: ??### Learning: g ≈ f ?

### •

on D, if halt, yes (no mistake)### •

outside D: ??### •

if not halting: ??[to be shown] if (...), after ‘enough’ corrections,

**any PLA variant halts**

Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

## Fun Time

### Let’s try to think about why PLA may work.

Let n = n(t), according to the rule of PLA below, which formula is true?

sign

**w** ^{T} _{t} **x** n

6=

### y n

,**w** _{t+1}

←**w** t

+### y n **x** n 1 **w** ^{T} _{t+1} **x** _{n}

=y_{n}

### 2

sign(w^{T} _{t+1} **x** _{n}

) =y_{n}

### 3

y_{n} **w** ^{T} _{t+1} **x** _{n}

≥ y### n **w** ^{T} _{t} **x** _{n}

### 4

y_{n} **w** ^{T} _{t+1} **x** _{n}

<y_{n} **w** ^{T} _{t} **x** _{n} Reference Answer: 3

Simply multiply the second part of the rule by y

_{n} **x** _{n}

. The result shows that**the rule**

**somewhat ‘tries to correct the mistake.’**

Learning to Answer Yes/No Guarantee of PLA

## Linear Separability

### • **if**

PLA halts (i.e. no more mistakes),
**(necessary condition)**

**D allows some w to make no mistake**

### •

call such D**linear separable**

### (linear separable) (not linear separable) (not linear separable)

assume linear separable D, does PLA always

**halt?**

Learning to Answer Yes/No Guarantee of PLA

## PLA Fact: **w** _{t} Gets More Aligned with **w** _{f}

linear separable D ⇔

**exists perfect w** _{f} **such that y** _{n} = sign(w ^{T} _{f} **x** _{n} )

### • **w** _{f} perfect

hence### every **x** n correctly away from line:

### y _{n(t)} **w** ^{T} _{f} **x** _{n(t)} ≥min

### n y n **w** ^{T} _{f} **x** _{n} > 0

### • **w** ^{T} _{f} **w** _{t} ↑

by updating with any **x** _{n(t)} , y _{n(t)}

**w** ^{T} _{f} **w** _{t+1}

= **w** ^{T} _{f} **w** t

+### y _{n(t)} **x** _{n(t)}

### ≥ **w** ^{T} _{f} **w** _{t} + min

### n y _{n} **w** ^{T} _{f} **x** _{n}

### > **w** ^{T} _{f} **w** _{t} + 0.

**w** t

appears more aligned with**w** _{f}

after update
**(really?)**

Learning to Answer Yes/No Guarantee of PLA

## PLA Fact: **w** _{t} Does Not Grow Too Fast

**w** _{t} **changed only when mistake**

**⇔ sign w**

^{T} _{t} **x** _{n(t)}

6= y_{n(t)}

⇔### y _{n(t)} **w** ^{T} _{t} **x** _{n(t)} ≤ 0

### •

mistake### ‘limits’ kw _{t} k ^{2} growth, even when updating with ‘longest’ **x** _{n}

**kw**

_{t+1}

k^{2}

= **kw**

### t

+y_{n(t)} **x** _{n(t)}

k^{2}

= **kw**

### t

k^{2}

+### 2y _{n(t)} **w** ^{T} _{t} **x** _{n(t)}

+ ky_{n(t)} **x** _{n(t)}

k^{2}

### ≤

**kw**

### t

k^{2}

+### 0

+### ky _{n(t)} **x** _{n(t)} k ^{2}

### ≤

**kw**

### t

k^{2}

+### max

### n ky _{n} **x** _{n} k ^{2}

start from

**w** _{0}

=**0, after T mistake corrections,** **w** ^{T} _{f}

**kw** _{f} k **w** _{T} **kw** _{T} k ≥ √

### T · constant

Learning to Answer Yes/No Guarantee of PLA

## Fun Time

### Let’s upper-bound T , the number of mistakes that PLA ‘corrects’.

Define R

^{2}

=max
### n

**kx**

### n

k^{2}

ρ =min
### n

y### n

**w** ^{T} _{f}

**kw**

_{f}

k**x** n

We want to show that T ≤ . Express the upper bound by the two terms above.

### 1

R/ρ### 2

R^{2}

/ρ^{2}

### 3

R/ρ^{2}

### 4

ρ^{2}

/R^{2}

### Reference Answer: 2

The maximum value of^{w}

^{w}

T f

**kw**

f### k **w**

_{t}

**kw**

t### k

is 1. Since T mistake corrections**increase the inner** **product by** √

### T · **constant, the maximum**

number of corrected mistakes is 1/constant^{2}

.
Learning to Answer Yes/No Non-Separable Data

## More about PLA

### Guarantee

as long as

### linear separable

and### correct by mistake

### • inner product of **w** _{f} and **w** _{t} grows fast; length of **w** _{t} grows slowly

### • PLA ‘lines’ are more and more aligned with **w** _{f}

⇒ halts
### Pros

simple to implement, fast, works in any dimension d

### Cons

### • **‘assumes’ linear separable D**

to halt
—property unknown in advance (no need for PLA if we know

**w** _{f}

)
### •

not fully sure**how long halting takes**

(ρ depends on**w** _{f}

)
—though practically fast

what if D not linear separable?

Learning to Answer Yes/No Non-Separable Data

## Learning with **Noisy Data**

### unknown target function f : X → Y

**+ noise**

### (ideal credit approval formula)

### training examples **D : (x**

1### , y

_{1}

### ), · · · , (x

_{N}

### ,y

_{N}

### ) (historical records in bank)

### learning algorithm

### A

### final hypothesis g ≈ f

### (‘learned’ formula to be used)

### hypothesis set H

### (set of candidate formula)

how to at least get g ≈ f on

**noisy**

D?
Learning to Answer Yes/No Non-Separable Data

## Line with Noise Tolerance

### •

assume ‘little’ noise:### y n = f (x n ) **usually**

### •

if so, g ≈ f on D ⇔### y n = g(x n ) **usually**

### •

how about**w** _{g}

← argmin
**w** N

X

### n=1

r

y

_{n}

**6= sign(w**

^{T} **x** _{n}

)
z
—NP-hard to solve, unfortunately

can we

### modify PLA

to get an ‘approximately good’ g?Learning to Answer Yes/No Non-Separable Data

## Pocket Algorithm

modify PLA algorithm (black lines) by

**keeping best weights in pocket** **initialize pocket weights ˆ** **w**

For t = 0, 1, · · ·

### 1

find a### (random)

mistake of**w** t

called (x_{n(t)}

,y_{n(t)}

)
### 2

(try to) correct the mistake by**w** _{t+1}

**← w**

### t

+y_{n(t)} **x** _{n(t)}

### 3 **if w** _{t+1} **makes fewer mistakes than ˆ** **w, replace ˆ** **w by w** _{t+1}

...until**enough iterations**

return

**w (called w** ˆ

**) as g**

a simple modification of PLA to find (somewhat) ‘best’ weights

Learning to Answer Yes/No Non-Separable Data

## Fun Time

### Should we use pocket or PLA?

Since we do not know whether D is linear separable in advance, we may decide to just go with pocket instead of PLA. If D is actually linear separable, what’s the difference between the two?

### 1

pocket on D is slower than PLA### 2

pocket on D is faster than PLA### 3

pocket on D returns a better g in approximating f than PLA### 4

pocket on D returns a worse g in approximating f than PLA### Reference Answer: 1

Because pocket need to check whether

**w** _{t+1}

is
better than ˆ**w in each iteration, it is slower than**

PLA. On linear separable D,**w**

_{POCKET}is the same as

**w**

_{PLA}, both making no mistakes.

Learning to Answer Yes/No Non-Separable Data