• 沒有找到結果。

# Machine Learning Foundations (ᘤ9M)

N/A
N/A
Protected

Share "Machine Learning Foundations (ᘤ9M)"

Copied!
33
0
0

(1)

## ( 機器學習基石)

### Lecture 2: Learning to Answer Yes/No

Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.tw

### ( 國立台灣大學資訊工程系)

(2)

Learning to Answer Yes/No

### 1 When

Can Machines Learn?

### Lecture 1: The Learning Problem

A takes D and H to get g

### 4 How Can Machines Learn Better?

(3)

Learning to Answer Yes/No Perceptron Hypothesis Set

## Credit Approval Problem Revisited

1

1

N

N

### (set of candidate formula)

what hypothesis set can we use?

(4)

Learning to Answer Yes/No Perceptron Hypothesis Set

## A Simple Hypothesis Set: the ‘Perceptron’

For

,x

, · · · ,x

### d

)‘features of customer’, compute a weighted ‘score’ and

approve credit if X

w

x

### i

>threshold deny credit if X

w

x

<threshold

Y:

X

x

!

### threshold

!

called ‘perceptron’ hypothesis historically

(5)

Learning to Answer Yes/No Perceptron Hypothesis Set

## Vector Form of Perceptron Hypothesis

X

x

!

!

= sign

X

x

!

+

0

0

= sign

X

x

!

= sign



each ‘tall’

‘tall’

### x —will use tall versions to simplify notation

what do perceptrons h ‘look like’?

(6)

Learning to Answer Yes/No Perceptron Hypothesis Set

## Perceptrons in R 2

h(x) = sign (w

+w

x

+w

x

)

### •

customer features

### x:

points on the plane (or points in R

)

labels y :

hypothesis h:

### lines

(or hyperplanes in R

### d

)

—positiveon one side of a line,

### negative

on the other side

### •

different line classifies customers differently

perceptrons ⇔

### linear (binary) classifiers

(7)

Learning to Answer Yes/No Perceptron Hypothesis Set

## Fun Time

### Consider using a perceptron to detect spam messages.

Assume that each email is represented by the frequency of keyword occurrence, and output +1 indicates a spam. Which keywords below shall have large positive weights in a

### 1

coffee, tea, hamburger, steak

### 2

free, drug, fantastic, deal

### 3

machine, learning, statistics, textbook

### 4

national, Taiwan, university, coursera

### Reference Answer: 2

The occurrence of keywords with positive weights increase the ‘spam score’, and hence those keywords should often appear in spams.

(8)

Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

## Select g from H

H = all possible perceptrons,

### •

want: g ≈ f (hard when f unknown)

### •

almost necessary: g ≈ f on D, ideally

### •

difficult: H is of

size

### •

idea: start from some g

, and

will represent g

### 0

by its weight vector

### w0

(9)

Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

## Perceptron Learning Algorithm

start from some

(say,

### 0), and ‘correct’ its mistakes on D

For t = 0, 1, . . .

find a

of

called

sign



6=

### 2

(try to) correct the mistake by

+

. . .until

return

PLA

### w+ x

That’s it!

—A fault confessed is half redressed.

### :-)

(10)

Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

## Practical Implementation of PLA

start from some

(say,

### 0), and ‘correct’ its mistakes on D Cyclic PLA

For t = 0, 1, . . .

find

mistake of

called

sign

 6=

### 2

correct the mistake by

+

. . .until

### next

can follow naïve cycle (1, · · · , N) or

### precomputed random cycle

(11)

Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

## Seeing is Believing initially

1

9

14

3

9

14

9

14

9

PLA

### (note: made x i  x 0 = 1 for visual purpose)

(12)

Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

## Seeing is Believing

1

9

14

3

9

14

9

14

9

PLA

### (note: made x i  x 0 = 1 for visual purpose)

(13)

Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

## Seeing is Believing

1

9

14

3

9

14

9

14

9

PLA

### (note: made x i  x 0 = 1 for visual purpose)

(14)

Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

## Seeing is Believing

1

9

14

3

9

14

9

14

9

PLA

### (note: made x i  x 0 = 1 for visual purpose)

(15)

Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

## Seeing is Believing

1

9

14

3

9

14

9

14

9

PLA

### (note: made x i  x 0 = 1 for visual purpose)

(16)

Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

## Seeing is Believing

1

9

14

3

9

14

9

14

9

PLA

### (note: made x i  x 0 = 1 for visual purpose)

(17)

Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

## Seeing is Believing

1

9

14

3

9

14

9

14

9

PLA

### (note: made x i  x 0 = 1 for visual purpose)

(18)

Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

## Seeing is Believing

1

9

14

3

9

14

9

14

9

PLA

### (note: made x i  x 0 = 1 for visual purpose)

(19)

Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

## Seeing is Believing

1

9

14

3

9

14

9

14

9

PLA

### (note: made x i  x 0 = 1 for visual purpose)

(20)

Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

## Seeing is Believing

1

9

14

3

9

14

9

14

9

PLA

### (note: made x i  x 0 = 1 for visual purpose)

(21)

Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

## Seeing is Believing

1

9

14

3

9

14

9

14

9

PLA

### (note: made x i  x 0 = 1 for visual purpose)

(22)

Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

## Some Remaining Issues of PLA

‘correct’ mistakes on D

naïve cyclic: ??

### •

random cyclic: ??

### •

other variant: ??

### •

on D, if halt, yes (no mistake)

outside D: ??

### •

if not halting: ??

[to be shown] if (...), after ‘enough’ corrections,

### any PLA variant halts

(23)

Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

## Fun Time

### Let’s try to think about why PLA may work.

Let n = n(t), according to the rule of PLA below, which formula is true?

sign

6=

,

+

=y

sign(w

) =y

y

≥ y

y

<y

### nwTtxn Reference Answer: 3

Simply multiply the second part of the rule by y

### nxn

. The result shows that

### somewhat ‘tries to correct the mistake.’

(24)

Learning to Answer Yes/No Guarantee of PLA

## Linear Separability

### • if

PLA halts (i.e. no more mistakes),

### (necessary condition)

D allows some w to make no mistake

call such D

### (linear separable) (not linear separable) (not linear separable)

assume linear separable D, does PLA always

### halt?

(25)

Learning to Answer Yes/No Guarantee of PLA

## PLA Fact: wt Gets More Aligned with wf

linear separable D ⇔

hence

### • wTfwt ↑

by updating with any

=

+



### w t

appears more aligned with

after update

### (really?)

(26)

Learning to Answer Yes/No Guarantee of PLA

## PLA Fact: wt Does Not Grow Too Fast

⇔ sign w

 6= y

mistake

kw

k

= kw

+y

k

= kw

k

+

+ ky

k

kw

k

+

+

kw

k

+

start from

=

### T · constant

(27)

Learning to Answer Yes/No Guarantee of PLA

## Fun Time

Define R

=max

kx

k

ρ =min

y

kw

k

### x n

We want to show that T ≤ . Express the upper bound  by the two terms above.

R/ρ

R

R/ρ

ρ

/R

### Reference Answer: 2

The maximum value of

T f

f

t

t

### k

is 1. Since T mistake corrections

### T · constant, the maximum

number of corrected mistakes is 1/constant

### 2

.

(28)

Learning to Answer Yes/No Non-Separable Data

## More about PLA

as long as

and

⇒ halts

### Pros

simple to implement, fast, works in any dimension d

### • ‘assumes’ linear separable D

to halt

—property unknown in advance (no need for PLA if we know

)

not fully sure

(ρ depends on

### wf

)

—though practically fast

what if D not linear separable?

(29)

Learning to Answer Yes/No Non-Separable Data

## Learning with Noisy Data

1

1

N

N

### (set of candidate formula)

how to at least get g ≈ f on

### noisy

D?

(30)

Learning to Answer Yes/No Non-Separable Data

## Line with Noise Tolerance

### •

assume ‘little’ noise:

### •

if so, g ≈ f on D ⇔

← argmin

X

r

y

6= sign(w

### Txn

) z

—NP-hard to solve, unfortunately

can we

### modify PLA

to get an ‘approximately good’ g?

(31)

Learning to Answer Yes/No Non-Separable Data

## Pocket Algorithm

modify PLA algorithm (black lines) by

### keeping best weights in pocketinitialize pocket weights ˆw

For t = 0, 1, · · ·

find a

mistake of

called (x

,y

)

### 2

(try to) correct the mistake by

← w

+y

...until

return

POCKET

### ) as g

a simple modification of PLA to find (somewhat) ‘best’ weights

(32)

Learning to Answer Yes/No Non-Separable Data

## Fun Time

### Should we use pocket or PLA?

Since we do not know whether D is linear separable in advance, we may decide to just go with pocket instead of PLA. If D is actually linear separable, what’s the difference between the two?

### 1

pocket on D is slower than PLA

### 2

pocket on D is faster than PLA

### 3

pocket on D returns a better g in approximating f than PLA

### 4

pocket on D returns a worse g in approximating f than PLA

### Reference Answer: 1

Because pocket need to check whether

is better than ˆ

### w in each iteration, it is slower than

PLA. On linear separable D,

### w

POCKET is the same as

### w

PLA, both making no mistakes.

(33)

Learning to Answer Yes/No Non-Separable Data

## Summary

### 1 When

Can Machines Learn?

### 4 How Can Machines Learn Better?

vice versa.’ To verify the rule, you chose 100 days uniformly at random from the past 10 years of stock data, and found that 80 of them satisfy the rule. What is the best guarantee

Which keywords below shall have large positive weights in a good perceptron for the task.. 1 coffee, tea,

2 You’ll likely be rich by exploiting the rule in the next 100 days, if the market behaves similarly to the last 10 years. 3 You’ll likely be rich by exploiting the ‘best rule’

1 After computing if D is linear separable, we shall know w ∗ and then there is no need to use PLA.. Noise and Error Algorithmic Error Measure. Choice of

You shall find it difficult to generate more kinds by varying the inputs, and we will give a formal proof in future lectures.

Lecture 5: Training versus Testing Hsuan-Tien Lin (林 軒田) htlin@csie.ntu.edu.tw?. Department of

Definition of VC Dimension VC Dimension of Perceptrons Physical Intuition of VC Dimension Interpreting VC Dimension?. 3 How Can

happy linear modeling after Z = Φ(X ) Price of Nonlinear Transform.