Machine Learning Foundations
( 機器學習基石)
Lecture 2: Learning to Answer Yes/No
Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.twDepartment of Computer Science
& Information Engineering
National Taiwan University
( 國立台灣大學資訊工程系)
Learning to Answer Yes/No
Roadmap
1 When
Can Machines Learn?Lecture 1: The Learning Problem
A takes D and H to get gLecture 2: Learning to Answer Yes/No
Perceptron Hypothesis Set
Perceptron Learning Algorithm (PLA) Guarantee of PLA
Non-Separable Data
2 Why Can Machines Learn?
3 How Can Machines Learn?
4 How Can Machines Learn Better?
Learning to Answer Yes/No Perceptron Hypothesis Set
Credit Approval Problem Revisited
Applicant Information
age 23 years
gender female
annual salary NTD 1,000,000 year in residence 1 year
year in job 0.5 year current debt 200,000 unknown target function
f : X → Y
(ideal credit approval formula)
training examples D : (x
1, y
1), · · · , (x
N,y
N) (historical records in bank)
learning algorithm
A
final hypothesis g ≈ f
(‘learned’ formula to be used)
hypothesis set H
(set of candidate formula)
what hypothesis set can we use?
Learning to Answer Yes/No Perceptron Hypothesis Set
A Simple Hypothesis Set: the ‘Perceptron’
age 23 years
annual salary NTD 1,000,000 year in job 0.5 year current debt 200,000
•
Forx = (x 1
,x2
, · · · ,xd
)‘features of customer’, compute a weighted ‘score’ andapprove credit if X
d
i=1
wi
xi
>threshold deny credit if Xd
i=1
wi
xi
<threshold•
Y:+1(good), −1(bad) , 0 ignored—linear formula h ∈ H are h(x) = sign
d
Xi=1
w i
xi
!
−
threshold
!
called ‘perceptron’ hypothesis historically
Learning to Answer Yes/No Perceptron Hypothesis Set
Vector Form of Perceptron Hypothesis
h(x) = sign
d
X
i=1
w i
xi
!
−threshold
!
= sign
d
X
i=1
w i
xi
!
+
(−threshold)
| {z }
w
0· (+1)
| {z }
x
0
= sign
d
X
i=0
w i
xi
!
= sign
w T x
•
each ‘tall’w represents a hypothesis h & is multiplied with
‘tall’
x —will use tall versions to simplify notation
what do perceptrons h ‘look like’?Learning to Answer Yes/No Perceptron Hypothesis Set
Perceptrons in R 2
h(x) = sign (w
0
+w1
x1
+w2
x2
)•
customer featuresx:
points on the plane (or points in Rd
)•
labels y :◦ (+1), × (-1)
•
hypothesis h:lines
(or hyperplanes in Rd
)—positiveon one side of a line,
negative
on the other side•
different line classifies customers differentlyperceptrons ⇔
linear (binary) classifiers
Learning to Answer Yes/No Perceptron Hypothesis Set
Fun Time
Consider using a perceptron to detect spam messages.
Assume that each email is represented by the frequency of keyword occurrence, and output +1 indicates a spam. Which keywords below shall have large positive weights in a
good perceptron
for the task?1
coffee, tea, hamburger, steak2
free, drug, fantastic, deal3
machine, learning, statistics, textbook4
national, Taiwan, university, courseraReference Answer: 2
The occurrence of keywords with positive weights increase the ‘spam score’, and hence those keywords should often appear in spams.
Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)
Select g from H
H = all possible perceptrons,
g =?
•
want: g ≈ f (hard when f unknown)•
almost necessary: g ≈ f on D, ideallyg(x n ) = f (x n ) = y n
•
difficult: H is ofinfinite
size•
idea: start from some g0
, and‘correct’ its mistakes on D
will represent g
0
by its weight vectorw 0
Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)
Perceptron Learning Algorithm
start from some
w 0
(say,0), and ‘correct’ its mistakes on D
For t = 0, 1, . . .1
find amistake
ofw t
calledx n(t) , y n(t)
signw T t x n(t)
6=
y n(t)
2
(try to) correct the mistake byw t+1
←w t
+y n(t) x n(t)
. . .untilno more mistakes
return
last w (called w
PLA) as g
w+ x y
y y= +1
x w
x
−1 w y=
w+ x
w+ x y
y y= +1
x w
x
−1 w y=
w+ x
That’s it!
—A fault confessed is half redressed.
:-)
Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)
Practical Implementation of PLA
start from some
w 0
(say,0), and ‘correct’ its mistakes on D Cyclic PLA
For t = 0, 1, . . .
1
findthe next
mistake ofw t
calledx n(t) , y n(t)
signw T t x n(t)
6=y n(t)
2
correct the mistake byw t+1
←w t
+y n(t) x n(t)
. . .untila full cycle of not encountering mistakes
next
can follow naïve cycle (1, · · · , N) orprecomputed random cycle
Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)
Seeing is Believing initially
x
1w(t+1)
update: 1
x
9w(t) w(t+1)
update: 2
x
14w(t)
w(t+1)
update: 3
x
3w(t) w(t+1)
update: 4
x
9w(t) w(t+1)
update: 5
x
14w(t) w(t+1)
update: 6
x
9w(t) w(t+1)
update: 7
x
14w(t) w(t+1)
update: 8
x
9w(t) w(t+1)
update: 9
w
PLAfinally
worked like a charm with < 20 lines!!
(note: made x i x 0 = 1 for visual purpose)
Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)
Seeing is Believing
initially
x
1w(t+1)
update: 1
x
9w(t) w(t+1)
update: 2
x
14w(t)
w(t+1)
update: 3
x
3w(t) w(t+1)
update: 4
x
9w(t) w(t+1)
update: 5
x
14w(t) w(t+1)
update: 6
x
9w(t) w(t+1)
update: 7
x
14w(t) w(t+1)
update: 8
x
9w(t) w(t+1)
update: 9
w
PLAfinally
worked like a charm with < 20 lines!!
(note: made x i x 0 = 1 for visual purpose)
Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)
Seeing is Believing
initially
x
1w(t+1)
update: 1
x
9w(t) w(t+1)
update: 2
x
14w(t)
w(t+1)
update: 3
x
3w(t) w(t+1)
update: 4
x
9w(t) w(t+1)
update: 5
x
14w(t) w(t+1)
update: 6
x
9w(t) w(t+1)
update: 7
x
14w(t) w(t+1)
update: 8
x
9w(t) w(t+1)
update: 9
w
PLAfinally
worked like a charm with < 20 lines!!
(note: made x i x 0 = 1 for visual purpose)
Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)
Seeing is Believing
initially
x
1w(t+1)
update: 1
x
9w(t) w(t+1)
update: 2
x
14w(t)
w(t+1)
update: 3
x
3w(t) w(t+1)
update: 4
x
9w(t) w(t+1)
update: 5
x
14w(t) w(t+1)
update: 6
x
9w(t) w(t+1)
update: 7
x
14w(t) w(t+1)
update: 8
x
9w(t) w(t+1)
update: 9
w
PLAfinally
worked like a charm with < 20 lines!!
(note: made x i x 0 = 1 for visual purpose)
Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)
Seeing is Believing
initially
x
1w(t+1)
update: 1
x
9w(t) w(t+1)
update: 2
x
14w(t)
w(t+1)
update: 3
x
3w(t) w(t+1)
update: 4
x
9w(t) w(t+1)
update: 5
x
14w(t) w(t+1)
update: 6
x
9w(t) w(t+1)
update: 7
x
14w(t) w(t+1)
update: 8
x
9w(t) w(t+1)
update: 9
w
PLAfinally
worked like a charm with < 20 lines!!
(note: made x i x 0 = 1 for visual purpose)
Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)
Seeing is Believing
initially
x
1w(t+1)
update: 1
x
9w(t) w(t+1)
update: 2
x
14w(t)
w(t+1)
update: 3
x
3w(t) w(t+1)
update: 4
x
9w(t) w(t+1)
update: 5
x
14w(t) w(t+1)
update: 6
x
9w(t) w(t+1)
update: 7
x
14w(t) w(t+1)
update: 8
x
9w(t) w(t+1)
update: 9
w
PLAfinally
worked like a charm with < 20 lines!!
(note: made x i x 0 = 1 for visual purpose)
Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)
Seeing is Believing
initially
x
1w(t+1)
update: 1
x
9w(t) w(t+1)
update: 2
x
14w(t)
w(t+1)
update: 3
x
3w(t) w(t+1)
update: 4
x
9w(t) w(t+1)
update: 5
x
14w(t) w(t+1)
update: 6
x
9w(t) w(t+1)
update: 7
x
14w(t) w(t+1)
update: 8
x
9w(t) w(t+1)
update: 9
w
PLAfinally
worked like a charm with < 20 lines!!
(note: made x i x 0 = 1 for visual purpose)
Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)
Seeing is Believing
initially
x
1w(t+1)
update: 1
x
9w(t) w(t+1)
update: 2
x
14w(t)
w(t+1)
update: 3
x
3w(t) w(t+1)
update: 4
x
9w(t) w(t+1)
update: 5
x
14w(t) w(t+1)
update: 6
x
9w(t) w(t+1)
update: 7
x
14w(t) w(t+1)
update: 8
x
9w(t) w(t+1)
update: 9
w
PLAfinally
worked like a charm with < 20 lines!!
(note: made x i x 0 = 1 for visual purpose)
Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)
Seeing is Believing
initially
x
1w(t+1)
update: 1
x
9w(t) w(t+1)
update: 2
x
14w(t)
w(t+1)
update: 3
x
3w(t) w(t+1)
update: 4
x
9w(t) w(t+1)
update: 5
x
14w(t) w(t+1)
update: 6
x
9w(t) w(t+1)
update: 7
x
14w(t) w(t+1)
update: 8
x
9w(t) w(t+1)
update: 9
w
PLAfinally
worked like a charm with < 20 lines!!
(note: made x i x 0 = 1 for visual purpose)
Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)
Seeing is Believing
initially
x
1w(t+1)
update: 1
x
9w(t) w(t+1)
update: 2
x
14w(t)
w(t+1)
update: 3
x
3w(t) w(t+1)
update: 4
x
9w(t) w(t+1)
update: 5
x
14w(t) w(t+1)
update: 6
x
9w(t) w(t+1)
update: 7
x
14w(t) w(t+1)
update: 8
x
9w(t) w(t+1)
update: 9
w
PLAfinally
worked like a charm with < 20 lines!!
(note: made x i x 0 = 1 for visual purpose)
Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)
Seeing is Believing
initially
x
1w(t+1)
update: 1
x
9w(t) w(t+1)
update: 2
x
14w(t)
w(t+1)
update: 3
x
3w(t) w(t+1)
update: 4
x
9w(t) w(t+1)
update: 5
x
14w(t) w(t+1)
update: 6
x
9w(t) w(t+1)
update: 7
x
14w(t) w(t+1)
update: 8
x
9w(t) w(t+1)
update: 9
w
PLAfinally
worked like a charm with < 20 lines!!
(note: made x i x 0 = 1 for visual purpose)
Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)
Some Remaining Issues of PLA
‘correct’ mistakes on D
until no mistakes Algorithmic: halt (with no mistake)?
•
naïve cyclic: ??•
random cyclic: ??•
other variant: ??Learning: g ≈ f ?
•
on D, if halt, yes (no mistake)•
outside D: ??•
if not halting: ??[to be shown] if (...), after ‘enough’ corrections,
any PLA variant halts
Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)
Fun Time
Let’s try to think about why PLA may work.
Let n = n(t), according to the rule of PLA below, which formula is true?
sign
w T t x n
6=
y n
,w t+1
←w t
+y n x n 1 w T t+1 x n
=yn
2
sign(wT t+1 x n
) =yn
3
yn w T t+1 x n
≥ yn w T t x n
4
yn w T t+1 x n
<yn w T t x n Reference Answer: 3
Simply multiply the second part of the rule by y
n x n
. The result shows thatthe rule
somewhat ‘tries to correct the mistake.’
Learning to Answer Yes/No Guarantee of PLA
Linear Separability
• if
PLA halts (i.e. no more mistakes),(necessary condition)
D allows some w to make no mistake•
call such Dlinear separable
(linear separable) (not linear separable) (not linear separable)
assume linear separable D, does PLA always
halt?
Learning to Answer Yes/No Guarantee of PLA
PLA Fact: w t Gets More Aligned with w f
linear separable D ⇔
exists perfect w f such that y n = sign(w T f x n )
• w f perfect
henceevery x n correctly away from line:
y n(t) w T f x n(t) ≥min
n y n w T f x n > 0
• w T f w t ↑
by updating with anyx n(t) , y n(t)
w T f w t+1
=w T f w t
+y n(t) x n(t)
≥ w T f w t + min
n y n w T f x n
> w T f w t + 0.
w t
appears more aligned withw f
after update(really?)
Learning to Answer Yes/No Guarantee of PLA
PLA Fact: w t Does Not Grow Too Fast
w t changed only when mistake
⇔ sign w
T t x n(t)
6= yn(t)
⇔y n(t) w T t x n(t) ≤ 0
•
mistake‘limits’ kw t k 2 growth, even when updating with ‘longest’ x n
kwt+1
k2
= kwt
+yn(t) x n(t)
k2
= kw
t
k2
+2y n(t) w T t x n(t)
+ kyn(t) x n(t)
k2
≤
kwt
k2
+0
+ky n(t) x n(t) k 2
≤
kwt
k2
+max
n ky n x n k 2
start from
w 0
=0, after T mistake corrections, w T f
kw f k w T kw T k ≥ √
T · constant
Learning to Answer Yes/No Guarantee of PLA
Fun Time
Let’s upper-bound T , the number of mistakes that PLA ‘corrects’.
Define R
2
=maxn
kxn
k2
ρ =minn
yn
w T f
kwf
kx n
We want to show that T ≤ . Express the upper bound by the two terms above.
1
R/ρ2
R2
/ρ2
3
R/ρ2
4
ρ2
/R2
Reference Answer: 2
The maximum value ofw
T f
kw
fk w
tkw
tk
is 1. Since T mistake correctionsincrease the inner product by √
T · constant, the maximum
number of corrected mistakes is 1/constant2
.Learning to Answer Yes/No Non-Separable Data
More about PLA
Guarantee
as long as
linear separable
andcorrect by mistake
• inner product of w f and w t grows fast; length of w t grows slowly
• PLA ‘lines’ are more and more aligned with w f
⇒ haltsPros
simple to implement, fast, works in any dimension d
Cons
• ‘assumes’ linear separable D
to halt—property unknown in advance (no need for PLA if we know
w f
)•
not fully surehow long halting takes
(ρ depends onw f
)—though practically fast
what if D not linear separable?
Learning to Answer Yes/No Non-Separable Data
Learning with Noisy Data
unknown target function f : X → Y
+ noise
(ideal credit approval formula)
training examples D : (x
1, y
1), · · · , (x
N,y
N) (historical records in bank)
learning algorithm
A
final hypothesis g ≈ f
(‘learned’ formula to be used)
hypothesis set H
(set of candidate formula)
how to at least get g ≈ f on
noisy
D?Learning to Answer Yes/No Non-Separable Data
Line with Noise Tolerance
•
assume ‘little’ noise:y n = f (x n ) usually
•
if so, g ≈ f on D ⇔y n = g(x n ) usually
•
how aboutw g
← argminw N
X
n=1
r
y
n
6= sign(wT x n
) z—NP-hard to solve, unfortunately
can we
modify PLA
to get an ‘approximately good’ g?Learning to Answer Yes/No Non-Separable Data
Pocket Algorithm
modify PLA algorithm (black lines) by
keeping best weights in pocket initialize pocket weights ˆ w
For t = 0, 1, · · ·
1
find a(random)
mistake ofw t
called (xn(t)
,yn(t)
)2
(try to) correct the mistake byw t+1
← wt
+yn(t) x n(t)
3 if w t+1 makes fewer mistakes than ˆ w, replace ˆ w by w t+1
...untilenough iterations
return
w (called w ˆ
POCKET) as g
a simple modification of PLA to find (somewhat) ‘best’ weights
Learning to Answer Yes/No Non-Separable Data
Fun Time
Should we use pocket or PLA?
Since we do not know whether D is linear separable in advance, we may decide to just go with pocket instead of PLA. If D is actually linear separable, what’s the difference between the two?
1
pocket on D is slower than PLA2
pocket on D is faster than PLA3
pocket on D returns a better g in approximating f than PLA4
pocket on D returns a worse g in approximating f than PLAReference Answer: 1
Because pocket need to check whether
w t+1
is better than ˆw in each iteration, it is slower than
PLA. On linear separable D,w
POCKET is the same asw
PLA, both making no mistakes.Learning to Answer Yes/No Non-Separable Data