• 沒有找到結果。

Machine Learning Foundations (ᘤ9M)

N/A
N/A
Protected

Academic year: 2022

Share "Machine Learning Foundations (ᘤ9M)"

Copied!
33
0
0

全文

(1)

Machine Learning Foundations

( 機器學習基石)

Lecture 2: Learning to Answer Yes/No

Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.tw

Department of Computer Science

& Information Engineering

National Taiwan University

( 國立台灣大學資訊工程系)

(2)

Learning to Answer Yes/No

Roadmap

1 When

Can Machines Learn?

Lecture 1: The Learning Problem

A takes D and H to get g

Lecture 2: Learning to Answer Yes/No

Perceptron Hypothesis Set

Perceptron Learning Algorithm (PLA) Guarantee of PLA

Non-Separable Data

2 Why Can Machines Learn?

3 How Can Machines Learn?

4 How Can Machines Learn Better?

(3)

Learning to Answer Yes/No Perceptron Hypothesis Set

Credit Approval Problem Revisited

Applicant Information

age 23 years

gender female

annual salary NTD 1,000,000 year in residence 1 year

year in job 0.5 year current debt 200,000 unknown target function

f : X → Y

(ideal credit approval formula)

training examples D : (x

1

, y

1

), · · · , (x

N

,y

N

) (historical records in bank)

learning algorithm

A

final hypothesis g ≈ f

(‘learned’ formula to be used)

hypothesis set H

(set of candidate formula)

what hypothesis set can we use?

(4)

Learning to Answer Yes/No Perceptron Hypothesis Set

A Simple Hypothesis Set: the ‘Perceptron’

age 23 years

annual salary NTD 1,000,000 year in job 0.5 year current debt 200,000

For

x = (x 1

,x

2

, · · · ,x

d

)‘features of customer’, compute a weighted ‘score’ and

approve credit if X

d

i=1

w

i

x

i

>threshold deny credit if X

d

i=1

w

i

x

i

<threshold

Y:

+1(good), −1(bad) , 0 ignored—linear formula h ∈ H are h(x) = sign

d

X

i=1

w i

x

i

!

threshold

!

called ‘perceptron’ hypothesis historically

(5)

Learning to Answer Yes/No Perceptron Hypothesis Set

Vector Form of Perceptron Hypothesis

h(x) = sign

d

X

i=1

w i

x

i

!

−threshold

!

= sign

d

X

i=1

w i

x

i

!

+

(−threshold)

| {z }

w

0

· (+1)

| {z }

x

0

= sign

d

X

i=0

w i

x

i

!

= sign

w T x



each ‘tall’

w represents a hypothesis h & is multiplied with

‘tall’

x —will use tall versions to simplify notation

what do perceptrons h ‘look like’?

(6)

Learning to Answer Yes/No Perceptron Hypothesis Set

Perceptrons in R 2

h(x) = sign (w

0

+w

1

x

1

+w

2

x

2

)

customer features

x:

points on the plane (or points in R

d

)

labels y :

◦ (+1), × (-1)

hypothesis h:

lines

(or hyperplanes in R

d

)

—positiveon one side of a line,

negative

on the other side

different line classifies customers differently

perceptrons ⇔

linear (binary) classifiers

(7)

Learning to Answer Yes/No Perceptron Hypothesis Set

Fun Time

Consider using a perceptron to detect spam messages.

Assume that each email is represented by the frequency of keyword occurrence, and output +1 indicates a spam. Which keywords below shall have large positive weights in a

good perceptron

for the task?

1

coffee, tea, hamburger, steak

2

free, drug, fantastic, deal

3

machine, learning, statistics, textbook

4

national, Taiwan, university, coursera

Reference Answer: 2

The occurrence of keywords with positive weights increase the ‘spam score’, and hence those keywords should often appear in spams.

(8)

Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

Select g from H

H = all possible perceptrons,

g =?

want: g ≈ f (hard when f unknown)

almost necessary: g ≈ f on D, ideally

g(x n ) = f (x n ) = y n

difficult: H is of

infinite

size

idea: start from some g

0

, and

‘correct’ its mistakes on D

will represent g

0

by its weight vector

w 0

(9)

Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

Perceptron Learning Algorithm

start from some

w 0

(say,

0), and ‘correct’ its mistakes on D

For t = 0, 1, . . .

1

find a

mistake

of

w t

called

x n(t) , y n(t) 

sign

w T t x n(t)



6=

y n(t)

2

(try to) correct the mistake by

w t+1

w t

+

y n(t) x n(t)

. . .until

no more mistakes

return

last w (called w

PLA

) as g

w+ x y

y y= +1

x w

x

−1 w y=

w+ x

w+ x y

y y= +1

x w

x

−1 w y=

w+ x

That’s it!

—A fault confessed is half redressed.

:-)

(10)

Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

Practical Implementation of PLA

start from some

w 0

(say,

0), and ‘correct’ its mistakes on D Cyclic PLA

For t = 0, 1, . . .

1

find

the next

mistake of

w t

called

x n(t) , y n(t) 

sign

w T t x n(t)

 6=

y n(t)

2

correct the mistake by

w t+1

w t

+

y n(t) x n(t)

. . .until

a full cycle of not encountering mistakes

next

can follow naïve cycle (1, · · · , N) or

precomputed random cycle

(11)

Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

Seeing is Believing initially

x

1

w(t+1)

update: 1

x

9

w(t) w(t+1)

update: 2

x

14

w(t)

w(t+1)

update: 3

x

3

w(t) w(t+1)

update: 4

x

9

w(t) w(t+1)

update: 5

x

14

w(t) w(t+1)

update: 6

x

9

w(t) w(t+1)

update: 7

x

14

w(t) w(t+1)

update: 8

x

9

w(t) w(t+1)

update: 9

w

PLA

finally

worked like a charm with < 20 lines!!

(note: made x i  x 0 = 1 for visual purpose)

(12)

Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

Seeing is Believing

initially

x

1

w(t+1)

update: 1

x

9

w(t) w(t+1)

update: 2

x

14

w(t)

w(t+1)

update: 3

x

3

w(t) w(t+1)

update: 4

x

9

w(t) w(t+1)

update: 5

x

14

w(t) w(t+1)

update: 6

x

9

w(t) w(t+1)

update: 7

x

14

w(t) w(t+1)

update: 8

x

9

w(t) w(t+1)

update: 9

w

PLA

finally

worked like a charm with < 20 lines!!

(note: made x i  x 0 = 1 for visual purpose)

(13)

Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

Seeing is Believing

initially

x

1

w(t+1)

update: 1

x

9

w(t) w(t+1)

update: 2

x

14

w(t)

w(t+1)

update: 3

x

3

w(t) w(t+1)

update: 4

x

9

w(t) w(t+1)

update: 5

x

14

w(t) w(t+1)

update: 6

x

9

w(t) w(t+1)

update: 7

x

14

w(t) w(t+1)

update: 8

x

9

w(t) w(t+1)

update: 9

w

PLA

finally

worked like a charm with < 20 lines!!

(note: made x i  x 0 = 1 for visual purpose)

(14)

Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

Seeing is Believing

initially

x

1

w(t+1)

update: 1

x

9

w(t) w(t+1)

update: 2

x

14

w(t)

w(t+1)

update: 3

x

3

w(t) w(t+1)

update: 4

x

9

w(t) w(t+1)

update: 5

x

14

w(t) w(t+1)

update: 6

x

9

w(t) w(t+1)

update: 7

x

14

w(t) w(t+1)

update: 8

x

9

w(t) w(t+1)

update: 9

w

PLA

finally

worked like a charm with < 20 lines!!

(note: made x i  x 0 = 1 for visual purpose)

(15)

Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

Seeing is Believing

initially

x

1

w(t+1)

update: 1

x

9

w(t) w(t+1)

update: 2

x

14

w(t)

w(t+1)

update: 3

x

3

w(t) w(t+1)

update: 4

x

9

w(t) w(t+1)

update: 5

x

14

w(t) w(t+1)

update: 6

x

9

w(t) w(t+1)

update: 7

x

14

w(t) w(t+1)

update: 8

x

9

w(t) w(t+1)

update: 9

w

PLA

finally

worked like a charm with < 20 lines!!

(note: made x i  x 0 = 1 for visual purpose)

(16)

Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

Seeing is Believing

initially

x

1

w(t+1)

update: 1

x

9

w(t) w(t+1)

update: 2

x

14

w(t)

w(t+1)

update: 3

x

3

w(t) w(t+1)

update: 4

x

9

w(t) w(t+1)

update: 5

x

14

w(t) w(t+1)

update: 6

x

9

w(t) w(t+1)

update: 7

x

14

w(t) w(t+1)

update: 8

x

9

w(t) w(t+1)

update: 9

w

PLA

finally

worked like a charm with < 20 lines!!

(note: made x i  x 0 = 1 for visual purpose)

(17)

Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

Seeing is Believing

initially

x

1

w(t+1)

update: 1

x

9

w(t) w(t+1)

update: 2

x

14

w(t)

w(t+1)

update: 3

x

3

w(t) w(t+1)

update: 4

x

9

w(t) w(t+1)

update: 5

x

14

w(t) w(t+1)

update: 6

x

9

w(t) w(t+1)

update: 7

x

14

w(t) w(t+1)

update: 8

x

9

w(t) w(t+1)

update: 9

w

PLA

finally

worked like a charm with < 20 lines!!

(note: made x i  x 0 = 1 for visual purpose)

(18)

Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

Seeing is Believing

initially

x

1

w(t+1)

update: 1

x

9

w(t) w(t+1)

update: 2

x

14

w(t)

w(t+1)

update: 3

x

3

w(t) w(t+1)

update: 4

x

9

w(t) w(t+1)

update: 5

x

14

w(t) w(t+1)

update: 6

x

9

w(t) w(t+1)

update: 7

x

14

w(t) w(t+1)

update: 8

x

9

w(t) w(t+1)

update: 9

w

PLA

finally

worked like a charm with < 20 lines!!

(note: made x i  x 0 = 1 for visual purpose)

(19)

Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

Seeing is Believing

initially

x

1

w(t+1)

update: 1

x

9

w(t) w(t+1)

update: 2

x

14

w(t)

w(t+1)

update: 3

x

3

w(t) w(t+1)

update: 4

x

9

w(t) w(t+1)

update: 5

x

14

w(t) w(t+1)

update: 6

x

9

w(t) w(t+1)

update: 7

x

14

w(t) w(t+1)

update: 8

x

9

w(t) w(t+1)

update: 9

w

PLA

finally

worked like a charm with < 20 lines!!

(note: made x i  x 0 = 1 for visual purpose)

(20)

Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

Seeing is Believing

initially

x

1

w(t+1)

update: 1

x

9

w(t) w(t+1)

update: 2

x

14

w(t)

w(t+1)

update: 3

x

3

w(t) w(t+1)

update: 4

x

9

w(t) w(t+1)

update: 5

x

14

w(t) w(t+1)

update: 6

x

9

w(t) w(t+1)

update: 7

x

14

w(t) w(t+1)

update: 8

x

9

w(t) w(t+1)

update: 9

w

PLA

finally

worked like a charm with < 20 lines!!

(note: made x i  x 0 = 1 for visual purpose)

(21)

Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

Seeing is Believing

initially

x

1

w(t+1)

update: 1

x

9

w(t) w(t+1)

update: 2

x

14

w(t)

w(t+1)

update: 3

x

3

w(t) w(t+1)

update: 4

x

9

w(t) w(t+1)

update: 5

x

14

w(t) w(t+1)

update: 6

x

9

w(t) w(t+1)

update: 7

x

14

w(t) w(t+1)

update: 8

x

9

w(t) w(t+1)

update: 9

w

PLA

finally

worked like a charm with < 20 lines!!

(note: made x i  x 0 = 1 for visual purpose)

(22)

Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

Some Remaining Issues of PLA

‘correct’ mistakes on D

until no mistakes Algorithmic: halt (with no mistake)?

naïve cyclic: ??

random cyclic: ??

other variant: ??

Learning: g ≈ f ?

on D, if halt, yes (no mistake)

outside D: ??

if not halting: ??

[to be shown] if (...), after ‘enough’ corrections,

any PLA variant halts

(23)

Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

Fun Time

Let’s try to think about why PLA may work.

Let n = n(t), according to the rule of PLA below, which formula is true?

sign

w T t x n

6=

y n

,

w t+1

w t

+

y n x n 1 w T t+1 x n

=y

n

2

sign(w

T t+1 x n

) =y

n

3

y

n w T t+1 x n

≥ y

n w T t x n

4

y

n w T t+1 x n

<y

n w T t x n Reference Answer: 3

Simply multiply the second part of the rule by y

n x n

. The result shows that

the rule

somewhat ‘tries to correct the mistake.’

(24)

Learning to Answer Yes/No Guarantee of PLA

Linear Separability

if

PLA halts (i.e. no more mistakes),

(necessary condition)

D allows some w to make no mistake

call such D

linear separable

(linear separable) (not linear separable) (not linear separable)

assume linear separable D, does PLA always

halt?

(25)

Learning to Answer Yes/No Guarantee of PLA

PLA Fact: w t Gets More Aligned with w f

linear separable D ⇔

exists perfect w f such that y n = sign(w T f x n )

w f perfect

hence

every x n correctly away from line:

y n(t) w T f x n(t) ≥min

n y n w T f x n > 0

w T f w t

by updating with any

x n(t) , y n(t) 

w T f w t+1

=

w T f w t

+

y n(t) x n(t)



w T f w t + min

n y n w T f x n

> w T f w t + 0.

w t

appears more aligned with

w f

after update

(really?)

(26)

Learning to Answer Yes/No Guarantee of PLA

PLA Fact: w t Does Not Grow Too Fast

w t changed only when mistake

⇔ sign w

T t x n(t)

 6= y

n(t)

y n(t) w T t x n(t) ≤ 0

mistake

‘limits’ kw t k 2 growth, even when updating with ‘longest’ x n

kw

t+1

k

2

= kw

t

+y

n(t) x n(t)

k

2

= kw

t

k

2

+

2y n(t) w T t x n(t)

+ ky

n(t) x n(t)

k

2

kw

t

k

2

+

0

+

ky n(t) x n(t) k 2

kw

t

k

2

+

max

n ky n x n k 2

start from

w 0

=

0, after T mistake corrections, w T f

kw f k w T kw T k ≥ √

T · constant

(27)

Learning to Answer Yes/No Guarantee of PLA

Fun Time

Let’s upper-bound T , the number of mistakes that PLA ‘corrects’.

Define R

2

=max

n

kx

n

k

2

ρ =min

n

y

n

w T f

kw

f

k

x n

We want to show that T ≤ . Express the upper bound  by the two terms above.

1

R/ρ

2

R

2

2

3

R/ρ

2

4

ρ

2

/R

2

Reference Answer: 2

The maximum value of

w

T f

kw

f

k w

t

kw

t

k

is 1. Since T mistake corrections

increase the inner product by

T · constant, the maximum

number of corrected mistakes is 1/constant

2

.

(28)

Learning to Answer Yes/No Non-Separable Data

More about PLA

Guarantee

as long as

linear separable

and

correct by mistake

• inner product of w f and w t grows fast; length of w t grows slowly

• PLA ‘lines’ are more and more aligned with w f

⇒ halts

Pros

simple to implement, fast, works in any dimension d

Cons

‘assumes’ linear separable D

to halt

—property unknown in advance (no need for PLA if we know

w f

)

not fully sure

how long halting takes

(ρ depends on

w f

)

—though practically fast

what if D not linear separable?

(29)

Learning to Answer Yes/No Non-Separable Data

Learning with Noisy Data

unknown target function f : X → Y

+ noise

(ideal credit approval formula)

training examples D : (x

1

, y

1

), · · · , (x

N

,y

N

) (historical records in bank)

learning algorithm

A

final hypothesis g ≈ f

(‘learned’ formula to be used)

hypothesis set H

(set of candidate formula)

how to at least get g ≈ f on

noisy

D?

(30)

Learning to Answer Yes/No Non-Separable Data

Line with Noise Tolerance

assume ‘little’ noise:

y n = f (x n ) usually

if so, g ≈ f on D ⇔

y n = g(x n ) usually

how about

w g

← argmin

w N

X

n=1

r

y

n

6= sign(w

T x n

) z

—NP-hard to solve, unfortunately

can we

modify PLA

to get an ‘approximately good’ g?

(31)

Learning to Answer Yes/No Non-Separable Data

Pocket Algorithm

modify PLA algorithm (black lines) by

keeping best weights in pocket initialize pocket weights ˆ w

For t = 0, 1, · · ·

1

find a

(random)

mistake of

w t

called (x

n(t)

,y

n(t)

)

2

(try to) correct the mistake by

w t+1

← w

t

+y

n(t) x n(t)

3 if w t+1 makes fewer mistakes than ˆ w, replace ˆ w by w t+1

...until

enough iterations

return

w (called w ˆ

POCKET

) as g

a simple modification of PLA to find (somewhat) ‘best’ weights

(32)

Learning to Answer Yes/No Non-Separable Data

Fun Time

Should we use pocket or PLA?

Since we do not know whether D is linear separable in advance, we may decide to just go with pocket instead of PLA. If D is actually linear separable, what’s the difference between the two?

1

pocket on D is slower than PLA

2

pocket on D is faster than PLA

3

pocket on D returns a better g in approximating f than PLA

4

pocket on D returns a worse g in approximating f than PLA

Reference Answer: 1

Because pocket need to check whether

w t+1

is better than ˆ

w in each iteration, it is slower than

PLA. On linear separable D,

w

POCKET is the same as

w

PLA, both making no mistakes.

(33)

Learning to Answer Yes/No Non-Separable Data

Summary

1 When

Can Machines Learn?

Lecture 1: The Learning Problem Lecture 2: Learning to Answer Yes/No

Perceptron Hypothesis Set

hyperplanes/linear classifiers in R d Perceptron Learning Algorithm (PLA)

correct mistakes and improve iteratively Guarantee of PLA

no mistake eventually if linear separable Non-Separable Data

hold somewhat ‘best’ weights in pocket

next: the zoo of learning problems

2 Why Can Machines Learn?

3 How Can Machines Learn?

4 How Can Machines Learn Better?

參考文獻

相關文件

First, we discuss practical use of SVM as an example to see how users apply a machine learning method Second, we discuss design considerations for a good machine learning package..

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 9/27.. The Learning Problem What is Machine Learning. The Machine

Two causes of overfitting are noise and excessive d VC. So if both are relatively ‘under control’, the risk of overfitting is smaller... Hazard of Overfitting The Role of Noise and

The entrance system of the school gym, which does automatic face recognition based on machine learning, is built to charge four different groups of users differently: Staff,

[classification], [regression], structured Learning with Different Data Label y n. [supervised], un/semi-supervised, reinforcement Learning with Different Protocol f ⇒ (x n , y

vice versa.’ To verify the rule, you chose 100 days uniformly at random from the past 10 years of stock data, and found that 80 of them satisfy the rule. What is the best guarantee

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 16/22.. If we use E loocv to estimate the performance of a learning algorithm that predicts with the average y value of the

• logistic regression often preferred over pocket.. Linear Models for Classification Stochastic Gradient Descent. Two Iterative

vice versa.’ To verify the rule, you chose 100 days uniformly at random from the past 10 years of stock data, and found that 80 of them satisfy the rule. What is the best guarantee

Which keywords below shall have large positive weights in a good perceptron for the task.. 1 coffee, tea,

2 You’ll likely be rich by exploiting the rule in the next 100 days, if the market behaves similarly to the last 10 years. 3 You’ll likely be rich by exploiting the ‘best rule’

1 After computing if D is linear separable, we shall know w ∗ and then there is no need to use PLA.. Noise and Error Algorithmic Error Measure. Choice of

You shall find it difficult to generate more kinds by varying the inputs, and we will give a formal proof in future lectures.

Lecture 5: Training versus Testing Hsuan-Tien Lin (林 軒田) htlin@csie.ntu.edu.tw?. Department of

Definition of VC Dimension VC Dimension of Perceptrons Physical Intuition of VC Dimension Interpreting VC Dimension?. 3 How Can

happy linear modeling after Z = Φ(X ) Price of Nonlinear Transform.

Customers with higher monthly income should naturally be given a higher credit limit, which is captured by the positive weight on the ‘monthly income’ feature... Then, a

Lecture 14: Regularization Regularized Hypothesis Set Weight Decay Regularization Regularization and VC Theory General Regularizers.?. Regularization Regularization and

effective price of choice in training: (wishfully) growth function m H (N) with a break point Lecture 6: Theory of Generalization. Restriction of

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 5/25.. Noise and Error Noise and Probabilistic Target.

A dual coordinate descent method for large-scale linear SVM. In Proceedings of the Twenty Fifth International Conference on Machine Learning

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 0/26... The

Two causes of overfitting are noise and excessive d VC. So if both are relatively ‘under control’, the risk of overfitting is smaller... Hazard of Overfitting The Role of Noise and