• 沒有找到結果。

What is Machine Learning Perceptron Learning Algorithm Types of Learning

N/A
N/A
Protected

Academic year: 2022

Share "What is Machine Learning Perceptron Learning Algorithm Types of Learning"

Copied!
121
0
0

加載中.... (立即查看全文)

全文

(1)

Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.tw

Department of Computer Science & Information Engineering

National Taiwan University ( 國立台灣大學資訊工程系)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 0/99

(2)

Roadmap

What is Machine Learning Perceptron Learning Algorithm Types of Learning

Possibility of Learning Linear Regression Logistic Regression Nonlinear Transform Overfitting

Principles of Learning

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 1/99

(3)

What is Machine Learning

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 2/99

(4)

The Learning Problem What is Machine Learning

From Learning to Machine Learning

learning: acquiring skill

with experience accumulated from

observations observations learning skill

machine learning: acquiring skill

machine learning:

with experience accumulated/computedfrom

data

data ML skill

What is

skill?

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 3/99

(5)

The Learning Problem What is Machine Learning

A More Concrete Definition

skill

⇔ improve some

performance measure

(e.g. prediction accuracy)

machine learning: improving some performance measure

machine learning:

with experience

computed

from

data

data ML

improved performance measure

An Application in Computational Finance

stock data ML more investment gain

Why use machine learning?

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 4/99

(6)

‘define’ trees and hand-program:

difficult

learn from data (observations) and recognize: a

3-year-old can do so

‘ML-based tree recognition system’ can be

easier to build

than hand-programmed system

ML: an

alternative route

to build complicated systems

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 5/99

(7)

Some Use Scenarios

when human cannot program the system manually

—navigating on Mars

when human cannot ‘define the solution’ easily

—speech/visual recognition

when needing rapid decisions that humans cannot do

—high-frequency trading

when needing to be user-oriented in a massive scale

—consumer-targeted marketing

Give a

computer a fish, you feed it for a day;

teach it how to fish, you feed it for a lifetime.

:-)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 6/99

(8)

The Learning Problem What is Machine Learning

Key Essence of Machine Learning

machine learning: improving some performance measure

with experience

computed

from

data

data ML

improved performance measure

1

exists

some ‘underlying pattern’ to be learned

—so ‘performance measure’ can be improved

2

but

no

programmable (easy)

definition

—so ‘ML’ is needed

3

somehow there is

data

about the pattern

—so ML has some ‘inputs’ to learn from

key essence: help decide whether to use ML

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 7/99

(9)

• data: how many users have rated some movies

• skill: predict how a user would rate an unrated movie A Hot Problem

competition held by Netflix in 2006

• 100,480,507 ratings that 480,189 users gave to 17,770 movies

• 10% improvement = 1 million dollar prize

similar competition (movies→ songs) held by Yahoo! in KDDCup 2011

• 252,800,275 ratings that 1,000,990 users gave to 624,961 songs

How can machines

learn our preferences?

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 8/99

(10)

The Learning Problem What is Machine Learning

Entertainment: Recommender System (2/2)

Match movie and viewer factors

predicted rating

comedy content action

content blockb uster?

TomCruisein it?

likes TomCruise?

prefers blockbusters? likes action?

likes comedy?

movie viewer

add contributions from each factor

A Possible ML Solution

pattern:

rating

viewer/movie factors

learning:

known rating

→ learned

factors

→ unknown rating prediction

key part of the

world-champion

(again!) system from National Taiwan Univ.

in KDDCup 2011

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 9/99

(11)

age 23 years

gender female

annual salary NTD 1,000,000 year in residence 1 year

year in job 0.5 year current debt 200,000

unknown pattern to be learned:

‘approve credit card good for bank?’

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 10/99

(12)

Formalize the Learning Problem

Basic Notations

input:

x

∈ X (customer application)

output: y ∈ Y (good/bad after approving credit card)

• unknown pattern to be learned ⇔ target function

: f : X → Y (ideal credit approval formula)

• data ⇔ training examples

:D = {(x

1

, y

1

), (x

2

, y

2

),· · · , (x

N

, y

N

)} (historical records in bank)

• hypothesis ⇔ skill

with hopefully

good performance:

g : X → Y (‘learned’ formula to be used)

{(x n , y n ) }

from

f ML g

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 11/99

(13)

(ideal credit approval formula)

training examples D : (x

1

, y

1

), · · · , (x

N

,y

N

) (historical records in bank)

learning algorithm

A

final hypothesis g ≈ f

(‘learned’ formula to be used)

target f

unknown

(i.e. no programmable definition)

hypothesis g hopefully≈ f but possibly

different

from f

(perfection ‘impossible’ when f unknown) What does g look like?

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 12/99

(14)

training examples D : (x

1

, y

1

), · · · , (x

N

, y

N

) (historical records in bank)

learning algorithm

A

final hypothesis g ≈ f

(‘learned’ formula to be used)

hypothesis set H

(set of candidate formula)

assume g∈ H = {h

k

}, i.e. approving if

• h

1

: annual salary > NTD 800,000

• h

2

: debt > NTD 100,000 (really?)

• h

3

: year in job ≤ 2 (really?)

hypothesis setH:

• can contain good or bad hypotheses

• up to A to pick the ‘best’ one as g learning model

=A and H

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 13/99

(15)

(ideal credit approval formula)

training examples D : (x

1

, y

1

), · · · , (x

N

,y

N

) (historical records in bank)

learning algorithm

A

final hypothesis g ≈ f

(‘learned’ formula to be used)

hypothesis set H

(set of candidate formula)

machine learning:

use

data

to compute

hypothesis g

that approximates

target f

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 14/99

(16)

Machine Learning and Data Mining

Machine Learning

use data to compute hypothesis g that approximates target f

Data Mining

use

(huge)

data to

find property

that is interesting

if ‘interesting property’

same as

‘hypothesis that approximate target’

—ML = DM(usually what KDDCup does)

if ‘interesting property’

related to

‘hypothesis that approximate target’

—DM can help ML, and vice versa(often, but not always)

traditional DM also focuses on

efficient computation in large database

difficult to distinguish ML and DM in reality

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 15/99

(17)

use data to compute hypothesis g that approximates target f

compute

something

that shows intelligent behavior

g ≈ f is something that shows intelligent behavior

—ML can realize AI, among other routes

e.g. chess playing

• traditional AI: game tree

• ML for AI: ‘learning from board data’

ML is one possible route to realize AI

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 16/99

(18)

Machine Learning and Statistics

Machine Learning

use data to compute hypothesis g that approximates target f

Statistics

use data to

make inference about an unknown process

g is an inference outcome; f is something unknown

—statistics

can be used to achieve ML

traditional statistics also focus on

provable results with math assumptions, and care less about computation

statistics: many useful tools for ML

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 17/99

(19)

Perceptron Learning Algorithm

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 18/99

(20)

Credit Approval Problem Revisited

Applicant Information

age 23 years

gender female

annual salary NTD 1,000,000 year in residence 1 year

year in job 0.5 year current debt 200,000 unknown target function

f : X → Y

(ideal credit approval formula)

training examples D : (x

1

, y

1

), · · · , (x

N

,y

N

) (historical records in bank)

learning algorithm

A

final hypothesis g ≈ f

(‘learned’ formula to be used)

hypothesis set H

(set of candidate formula)

what hypothesis set can we use?

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 19/99

(21)

current debt 200,000

For

x = (x 1

, x

2

,· · · , x

d

)‘features of customer’, compute a weighted ‘score’ and

approve credit if X

d

i=1

w

i

x

i

> threshold deny credit if X

d

i=1

w

i

x

i

< threshold

Y:



+1(good), −1(bad)

,

0 ignored—linear formula h

∈ H are

h(x) = sign

d

X

i=1

w i

x

i

!

threshold

!

called ‘perceptron’ hypothesis historically

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 20/99

(22)

h(x) = sign

d

X

i=1

w i

x

i

!

−threshold

!

= sign

 X

d

i=1

w i

x

i

!

+

( | −threshold) {z }

w

0

· (+1) | {z }

x

0



= sign X

d i=0

w i

x

i

!

= sign

w T x



each ‘tall’

w represents a hypothesis h & is multiplied with

‘tall’

x —will use tall versions to simplify notation

what do perceptrons h ‘look like’?

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 21/99

(23)

customer features

x:

points on the plane (or points in R

d

)

labels y :

◦ (+1)

,

× (-1)

hypothesis h:

lines

(or hyperplanes in R

d

)

—positiveon one side of a line,

negative

on the other side

different line classifies customers differently

perceptrons⇔

linear (binary) classifiers

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 22/99

(24)

Select g from H

H = all possible perceptrons,

g =?

want: g≈ f (hard when f unknown)

almost necessary: g ≈ f on D, ideally

g(x n ) = f (x n ) = y n

difficult: H is of

infinite

size

idea: start from some g

0

, and

‘correct’ its mistakes on D

will represent g

0

by its weight vector

w 0

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 23/99

(25)

For t = 0, 1, . . .

1

find a

mistake

of

w t

called

x n(t) , y n(t) 

sign

w T t x n(t)



6=

y n(t)

2

(try to) correct the mistake by

w t+1

w t

+

y n(t) x n(t)

. . . until

no more mistakes

return

last w (called w

PLA

) as g

w+ x y

y y= +1

x w

x

−1 w y=

w+ x y

x w

x

−1 w y=

w+ x

That’s it!

—A fault confessed is half redressed.

:-)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 24/99

(26)

The Learning Problem Perceptron Learning Algorithm

Seeing is Believing initially

x1 w(t+1)

x9

w(t)

w(t+1)

x14

w(t) w(t+1) x3

w(t)

w(t+1)

x9

w(t) w(t+1)

x14

w(t) w(t+1)

x9

w(t) w(t+1)

x14

w(t) w(t+1)

x9

w(t) w(t+1) wPLA

worked like a charm with < 20 lines!!

(note: made x i  x 0 = 1 for visual purpose)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 25/99

(27)

The Learning Problem Perceptron Learning Algorithm

Seeing is Believing

x1 w(t+1)

update: 1

x9

w(t)

w(t+1)

x14

w(t) w(t+1) w(t)

w(t+1)

x9

w(t) w(t+1)

x14

w(t) w(t+1)

x9

w(t) w(t+1)

x14

w(t) w(t+1)

x9

w(t) w(t+1) wPLA

worked like a charm with < 20 lines!!

(note: made x i  x 0 = 1 for visual purpose)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 25/99

(28)

The Learning Problem Perceptron Learning Algorithm

Seeing is Believing

x1 w(t+1)

x9

w(t)

w(t+1)

update: 2

x14

w(t) w(t+1) x3

w(t)

w(t+1)

x9

w(t) w(t+1)

x14

w(t) w(t+1)

x9

w(t) w(t+1)

x14

w(t) w(t+1)

x9

w(t) w(t+1) wPLA

worked like a charm with < 20 lines!!

(note: made x i  x 0 = 1 for visual purpose)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 25/99

(29)

The Learning Problem Perceptron Learning Algorithm

Seeing is Believing

x1 w(t+1)

x9

w(t)

w(t+1)

x14

w(t) w(t+1)

update: 3

w(t)

w(t+1)

x9

w(t) w(t+1)

x14

w(t) w(t+1)

x9

w(t) w(t+1)

x14

w(t) w(t+1)

x9

w(t) w(t+1) wPLA

worked like a charm with < 20 lines!!

(note: made x i  x 0 = 1 for visual purpose)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 25/99

(30)

The Learning Problem Perceptron Learning Algorithm

Seeing is Believing

x1 w(t+1)

x9

w(t)

w(t+1)

x14

w(t) w(t+1)

x3

w(t)

w(t+1)

update: 4

x9

w(t) w(t+1)

x14

w(t) w(t+1)

x9

w(t) w(t+1)

x14

w(t) w(t+1)

x9

w(t) w(t+1) wPLA

worked like a charm with < 20 lines!!

(note: made x i  x 0 = 1 for visual purpose)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 25/99

(31)

The Learning Problem Perceptron Learning Algorithm

Seeing is Believing

x1 w(t+1)

x9

w(t)

w(t+1)

x14

w(t) w(t+1) w(t)

w(t+1)

x9

w(t) w(t+1)

update: 5

x14

w(t) w(t+1)

x9

w(t) w(t+1)

x14

w(t) w(t+1)

x9

w(t) w(t+1) wPLA

worked like a charm with < 20 lines!!

(note: made x i  x 0 = 1 for visual purpose)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 25/99

(32)

The Learning Problem Perceptron Learning Algorithm

Seeing is Believing

x1 w(t+1)

x9

w(t)

w(t+1)

x14

w(t) w(t+1) x3

w(t)

w(t+1)

x9

w(t) w(t+1)

x14

w(t) w(t+1)

update: 6

x9

w(t) w(t+1)

x14

w(t) w(t+1)

x9

w(t) w(t+1) wPLA

worked like a charm with < 20 lines!!

(note: made x i  x 0 = 1 for visual purpose)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 25/99

(33)

The Learning Problem Perceptron Learning Algorithm

Seeing is Believing

x1 w(t+1)

x9

w(t)

w(t+1)

x14

w(t) w(t+1) w(t)

w(t+1)

x9

w(t) w(t+1)

x14

w(t) w(t+1)

x9

w(t) w(t+1)

update: 7

x14

w(t) w(t+1)

x9

w(t) w(t+1) wPLA

worked like a charm with < 20 lines!!

(note: made x i  x 0 = 1 for visual purpose)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 25/99

(34)

The Learning Problem Perceptron Learning Algorithm

Seeing is Believing

x1 w(t+1)

x9

w(t)

w(t+1)

x14

w(t) w(t+1) x3

w(t)

w(t+1)

x9

w(t) w(t+1)

x14

w(t) w(t+1)

x9

w(t) w(t+1)

x14

w(t) w(t+1)

update: 8

x9

w(t) w(t+1) wPLA

worked like a charm with < 20 lines!!

(note: made x i  x 0 = 1 for visual purpose)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 25/99

(35)

The Learning Problem Perceptron Learning Algorithm

Seeing is Believing

x1 w(t+1)

x9

w(t)

w(t+1)

x14

w(t) w(t+1) w(t)

w(t+1)

x9

w(t) w(t+1)

x14

w(t) w(t+1)

x9

w(t) w(t+1)

x14

w(t) w(t+1)

x9

w(t) w(t+1)

update: 9

wPLA

worked like a charm with < 20 lines!!

(note: made x i  x 0 = 1 for visual purpose)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 25/99

(36)

The Learning Problem Perceptron Learning Algorithm

Seeing is Believing

x1 w(t+1)

x9

w(t)

w(t+1)

x14

w(t) w(t+1) x3

w(t)

w(t+1)

x9

w(t) w(t+1)

x14

w(t) w(t+1)

x9

w(t) w(t+1)

x14

w(t) w(t+1)

x9

w(t) w(t+1)

wPLA

finally

worked like a charm with < 20 lines!!

(note: made x i  x 0 = 1 for visual purpose)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 25/99

(37)

Types of Learning

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 26/99

(38)

Credit Approval Problem Revisited

age 23 years

gender female

annual salary NTD 1,000,000 year in residence 1 year

year in job 0.5 year current debt 200,000

credit?

{no(−1), yes(+1)}

unknown target function f : X → Y

(ideal credit approval formula)

training examples D : (x

1

, y

1

), · · · , (x

N

,y

N

) (historical records in bank)

learning algorithm

A

final hypothesis g ≈ f

(‘learned’ formula to be used)

hypothesis set H

(set of candidate formula)

Y = {−1, +1}:

binary classification

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 27/99

(39)

credit

approve/disapprove

email

spam/non-spam

patient

sick/not sick

ad

profitable/not profitable

answer

correct/incorrect

(KDDCup 2010)

core and important problem with many tools as

building block of other tools

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 28/99

(40)

25

5 1

Mass

Size 10

classify US coins (1c, 5c, 10c, 25c) by (size, mass)

Y = {1c, 5c, 10c, 25c}, or

Y = {1, 2, · · · , K } (abstractly)

binary classification: special case with K = 2

Other Multiclass Classification Problems

written digits⇒ 0, 1, · · · , 9

pictures⇒ apple, orange, strawberry

emails⇒ spam, primary, social, promotion, update (Google)

many applications

in practice, especially for ‘recognition’

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 29/99

(41)

multiclass classification: patient features⇒ which type of cancer

regression: patient features⇒

how many days before recovery

• Y = R

orY = [lower, upper] ⊂ R (bounded regression)

—deeply studied in statistics

Other Regression Problems

company data⇒ stock price

climate data⇒ temperature

also core and important with many ‘statistical’

tools as

building block of other tools

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 30/99

(42)

Mini Summary

Learning with Different Output Space Y

binary classification:

Y = {−1, +1}

multiclass classification: Y = {1, 2, · · · , K }

regression:

Y = R

. . . and a lot more!!

unknown target function f : X → Y

training examples D : (x

1

,y

1

), · · · , (x

N

,y

N

)

learning algorithm

A

final hypothesis g ≈ f

hypothesis set H

core tools: binary classification and regression

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 31/99

(43)

5 1

Size 10

unknown target function f : X → Y

training examples D : (x

1

,y

1

), · · · , (x

N

,y

N

)

learning algorithm

A

final hypothesis g ≈ f

hypothesis set H

supervised learning:

every

x n comes with corresponding y n

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 32/99

(44)

Unsupervised: Coin Recognition without y n

25

5 1

Mass

Size 10

supervised multiclass classification

Mass

Size

unsupervised multiclass classification

⇐⇒

‘clustering’

Other Clustering Problems

articles⇒ topics

consumer profiles⇒ consumer groups

clustering: a challenging but useful problem

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 33/99

(45)

25

5 1

Mass

Size 10

supervised multiclass classification

Mass

Size

unsupervised multiclass classification

⇐⇒

‘clustering’

Other Clustering Problems

articles⇒ topics

consumer profiles⇒ consumer groups

clustering: a challenging but useful problem

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 33/99

(46)

Unsupervised: Learning without y n

Other Unsupervised Learning Problems

clustering: {x

n

} ⇒ cluster(x)

(≈ ‘unsupervised multiclass classification’)

—i.e. articles⇒ topics

density estimation:

{x

n

} ⇒ density(x) (≈ ‘unsupervised bounded regression’)

—i.e. traffic reports with location⇒ dangerous areas

outlier detection:

{x

n

} ⇒ unusual(x)

(≈ extreme ‘unsupervised binary classification’)

—i.e. Internet logs⇒ intrusion alert

. . . and a lot more!!

unsupervised learning: diverse, with possibly

very different performance goals

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 34/99

(47)

25

5 1

Mass

Size 10

supervised

25

5 1

Mass

Size 10

semi-supervised

Mass

Size

unsupervised (clustering)

Other Semi-supervised Learning Problems

face images with a few labeled⇒ face identifier (Facebook)

medicine data with a few labeled⇒ medicine effect predictor

semi-supervised learning: leverage

unlabeled data to avoid ‘expensive’ labeling

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 35/99

(48)

Reinforcement Learning

a ‘very different’ but natural way of learning

Teach Your Dog: Say ‘Sit Down’

The dog pees on the ground.

BAD DOG. THAT’S A VERY WRONG ACTION.

cannot easily show the dog that y

n

= sit when

x n

=‘sit down’

but can ‘punish’ to say ˜y

n

= pee is wrong

Other Reinforcement Learning Problems Using (x, ˜ y , goodness)

(customer, ad choice, ad click earning)⇒ ad system

(cards, strategy, winning amount)⇒ black jack agent reinforcement: learn with

‘partial/implicit

information’

(often sequentially)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 36/99

(49)

Teach Your Dog: Say ‘Sit Down’

The dog sits down.

Good Dog. Let me give you some cookies.

still cannot show y

n

= sit when

x n

=‘sit down’

but can ‘reward’ to say ˜y

n

= sit is good

Other Reinforcement Learning Problems Using (x, ˜ y , goodness)

(customer, ad choice, ad click earning)⇒ ad system

(cards, strategy, winning amount)⇒ black jack agent reinforcement: learn with

‘partial/implicit

information’

(often sequentially)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 36/99

(50)

Learning with Different Data Label y n

supervised: all y n

unsupervised: no y

n

semi-supervised: some y

n

reinforcement: implicit y

n

by goodness(˜y

n

)

. . . and more!!

unknown target function f : X → Y

training examples D : (x

1

,y

1

), · · · , (x

N

,y

N

)

learning algorithm

A

final hypothesis g ≈ f

hypothesis set H

core tool: supervised learning

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 37/99

(51)

Possibility of Learning

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 38/99

(52)

y

n

= −1

y

n

= +1

g(x) = ?

let’s test your ‘human learning’

with 6 examples :-)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 39/99

(53)

The Learning Problem Possibility of Learning

Two Controversial Answers

whatever you say about g(x),

yn=−1

yn= +1

g(x) = ?

y n = −1

y n = +1

g(x) = ?

truth f (x) = +1 because . . .

symmetry⇔ +1

(black or white count = 3) or (black count = 4 and

middle-top black)⇔ +1

truth f (x) = −1 because . . .

left-top black⇔ -1

middle column contains at most 1 black and right-top white⇔ -1

all valid reasons, your

adversarial teacher

can always call you ‘didn’t learn’.

:-(

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 40/99

(54)

Theoretical Foundation of Statistical Learning

if

training and testing from same distribution, with a high probability, E out (g)

| {z } test error

E in (g)

| {z } training error

+

r

8

N ln  4(2N)

dVC(H)

δ



| {z }

Ω:price of using H

in-sample error model complexity out-of-sample error

VC dimension, dvc

Error

dvc

dVC(H): VC dimension of H

≈ # of parameters to describeH

dVC↑:

E in

but

Ω ↑

dVC↓:

Ω ↓

but

E in

best dVC

in the middle

powerful H

not always good!

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 41/99

(55)

distribution P(y |x) containing f (x) + noise (ideal credit approval formula)

training examples D : (x

1

, y

1

), · · · , (x

N

,y

N

) (historical records in bank)

learning algorithm

A

final hypothesis g ≈ f

(‘learned’ formula to be used)

hypothesis set H

(set of candidate formula)

P on X

x

1

, x

2

, · · · , x

N

x y

1

,y

2

, · · · , y

N

if control complexity ofH properly and minimize E

in

,

learning possible

:-)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 42/99

(56)

Linear Regression

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 43/99

(57)

year in residence 1 year year in job 0.5 year current debt 200,000

credit limit?

100,000

unknown target function f : X → Y (ideal credit limit formula)

training examples D : (x

1

, y

1

), · · · , (x

N

, y

N

) (historical records in bank)

learning algorithm

A

final hypothesis g ≈ f

(‘learned’ formula to be used)

hypothesis set H

(set of candidate formula)

Y = R:

regression

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 44/99

(58)

Linear Regression Hypothesis

age 23 years

annual salary NTD 1,000,000 year in job 0.5 year current debt 200,000

For

x = (x 0

, x

1

, x

2

,· · · , x

d

)‘features of customer’,

approximate the

desired credit limit

with a

weighted

sum:

y

≈ X

d i=0

w i

x

i

linear regression hypothesis:

h(x) = w T x

h(x): like perceptron, but without the sign

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 45/99

(59)

∈ R

x

y

x = (x 1 , x 2 ) ∈ R

x

1

x

2

y

x

1

x

2

y

linear regression:

find

lines/hyperplanes

with small

residuals

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 46/99

(60)

The Error Measure

popular/historical error measure:

squared error

err(ˆ y , y ) = (ˆ y − y) 2 in-sample

E

in

(hw) = 1 N

X

N n=1

(h(x n )

| {z }

w

T

x

n

− y n ) 2

out-of-sample

E

out

(w) = E

(x,y)∼P (w T x − y ) 2

next: how to minimize E

in

(w)?

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 47/99

(61)

E

in

(w) =

N

n=1

(w

T x n

y n

)

2

= N

n=1

(x

T n w

y n

)

2

= 1

N

x T 1 w

y 1 x T 2 w

y 2

. . .

x T N w

y N

2

= 1

N

 

− − x T 1 − −

− − x T 2 − − . . .

− − x T N − −

 

w

 

 y 1 y 2 . . . y N

 

2

= 1

Nk

X

|{z}

N×d +1

|{z}

w

d +1×1

y

|{z}

N×1

k

2

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 48/99

(62)

min

w

E

in

(w) = 1

Nk

Xw

y

k

2

w

E

in

E

in

(w): continuous, differentiable,

convex

necessary condition of ‘best’

w

∇E

in

(w)≡





∂E

in

∂w

0(w)

∂E

in

∂w

1(w) . . .

∂E

in

∂w

d(w)



=





0 0

. . .

0





—not possibleto ‘roll down’

task: find

w

LINsuch that∇E

in

(wLIN) =

0

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 49/99

(63)

N N

| {z }

A

|{z}

b

|{z}

c

one w only

E

in

(w)=

N 1



aw 2

− 2

bw

+

c



∇E

in

(w)=

N 1

(2aw− 2

b) simple! :-)

vector w

E

in

(w)=

N 1



w T Aw

− 2

w T b

+

c



∇E

in

(w)=

N 1

(2Aw− 2

b)

similar (derived by definition)

∇E

in

(w) =

N 2 X T Xw

X T y



Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 50/99

(64)

task: find

w

LIN such that

N 2 X T Xw

X T y



=∇E

in

(w) =

0 invertible X T X

easy!

unique solution

w

LIN= 

X T X



−1

X T

| {z }

pseudo-inverse

X

y

often the case because

N  d + 1

singular X T X

many

optimal solutions

one of the solutions

w

LIN=

X y

by defining

X

in other ways

practical suggestion:

use

well-implemented † routine

instead of

X T X



−1

X T

for numerical stability when

almost-singular

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 51/99

(65)

X =  

− − x 1 − −

− − x T 2 − −

· · ·

− − x T N − −

 

| {z }

N×(d +1)

y =  

 y 1 y 2

· · · y N

 

| {z }

N×1

2

calculate pseudo-inverse

|{z} X

(d +1)×N 3

return

w

LIN

|{z}

(d +1)×1

=

X y

simple and efficient with

good † routine

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 52/99

(66)

Is Linear Regression a ‘Learning Algorithm’?

w

LIN=

X y

No!

analytic (closed-form) solution, ‘instantaneous’

not improving E

in

nor E

out

iteratively

Yes!

good E

in

?

yes, optimal!

good E

out

?

yes, finite d

VC

like perceptrons

improving iteratively?

somewhat, within an iterative pseudo-inverse routine

if E

out

(wLIN)is good,

learning ‘happened’!

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 53/99

(67)

Logistic Regression

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 54/99

(68)

Heart Attack Prediction Problem (1/2)

age 40 years

gender male

blood pressure 130/85 cholesterol level 240

weight 70

heart disease?

yes

unknown target distribution P(y |x) containing f (x) + noise

training examples D : (x

1

, y

1

), · · · , (x

N

,y

N

)

learning algorithm

A

final hypothesis g ≈ f

hypothesis set H

error measure err c err

binary classification:

ideal f (x) = sign

P(+1 |x)

1 2

∈ {−1, +1}

because of

classification err

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 55/99

(69)

blood pressure 130/85 cholesterol level 240

weight 70

heart

attack? 80% risk

unknown target distribution P(y |x) containing f (x) + noise

training examples D : (x

1

, y

1

), · · · , (x

N

,y

N

)

learning algorithm

A

final hypothesis g ≈ f

hypothesis set H

error measure err c err

‘soft’

binary classification:

f

(x) =

P(+1 |x)

∈ [0, 1]

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 56/99

(70)

Soft Binary Classification

target function

f

(x) =

P(+1 |x)

∈ [0, 1]

ideal (noiseless) data



x 1

, y

1 0

=0.9 =

P(+1 |x 1 )





x 2

, y

2 0

=0.2 =

P(+1 |x 2 )

 ...



x N

, y

N 0

=0.6 =

P(+1 |x N )



actual (noisy) data



x 1

, y

1

=

P(y |x 1 )





x 2

, y

2

=

×

P(y |x 2 )

 ...



x N

, y

N

=

×

P(y |x N )



same data as hard binary classification, different

target function

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 57/99

(71)

ideal (noiseless) data



x 1

, y

1 0

=0.9 =

P(+1 |x 1 )





x 2

, y

2 0

=0.2 =

P(+1 |x 2 )

 ...



x N

, y

N 0

=0.6 =

P(+1 |x N )



actual (noisy) data



x 1

, y

1 0

=

1

=r

? P(y |x 1 )

z



x 2

, y

2 0

=

0

=r

? P(y |x 2 )

z

...



x N

, y

N 0

=

0

= r

? P(y |x N )

z

same data as hard binary classification, different

target function

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 57/99

(72)

age 40 years

gender male

blood pressure 130/85 cholesterol level 240

For

x = (x 0

, x

1

, x

2

,· · · , x

d

)‘features of patient’, calculate a

weighted

‘risk score’:

s

= X

d

i=0

w i

x

i

convert the

score

to

estimated probability

by logistic function

θ(s)

θ(s) 1

0 s

logistic hypothesis:

h(x) = θ(w T x) = 1+exp(−w 1

T

x)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 58/99

(73)

linear classification

h(x) = sign(s)

s x

x

x x0

1 2

d

h x( )

plausible err = 0/1

(small flipping noise)

linear regression

h(x) =

s

s x

x

x x0

1 2

d

h x( )

friendly err = squared

(easy to minimize)

logistic regression

h(x) =θ(s)

s x

x

x x0

1 2

d

h x( )

err = ?

how to define

E in (w) for logistic regression?

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 59/99

(74)

target function

f(x) = P(+1 |x)P(y |x)

=

f 1 (x) f (x) for y = for y = +1 −1

considerD = {(x

1

,

), (x

2

,

×

), . . . , (x

N

,

×

)}

probability that f generates D

P(x

1

)P(

◦|x 1

)× P(x

2

)P(

×|x 2

)× . . .

P(x

N

)P(

×|x N

)

likelihood that h generates D

P(x

1

)h(x

1 )

×

P(x

2

)(1

− h(x 2 ))

× . . .

P(x

N

)(1

− h(x N ))

if

h

f,

then likelihood(h)≈ probability using

f

probability using

f

usually

large

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 60/99

(75)

considerD = {(x

1

,

), (x

2

,

×

), . . . , (x

N

,

×

)}

probability that f generates D

P(x

1

)f

(x 1 )

× P(x

2

)(1

− f(x 2 ))

× . . .

P(x

N

)(1

− f (x N ))

likelihood that h generates D

P(x

1

)h(x

1 )

×

P(x

2

)(1

− h(x 2 ))

× . . .

P(x

N

)(1

− h(x N ))

if

h

f,

then likelihood(h)≈ probability using

f

probability using

f

usually

large

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 60/99

(76)

Likelihood of Logistic Hypothesis

likelihood(h)≈ (probability using

f)

large

g = argmax

h

likelihood(h)

when logistic: h(x) = θ(w T x) 1 − h(x)

=

h( − x)

θ(s) 1

0 s

likelihood(h) =

P(x 1 )h(x 1 ) × P(x 2 )(1 − h(x 2 )) × . . . P(x N )(1 − h(x N ))

likelihood(logistic

h)

∝ Y

N n=1

h(y n x n

)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 61/99

(77)

g = argmax

h

likelihood(h)

when logistic: h(x) = θ(w T x) 1 − h(x)

=

h( − x)

θ(s) 1

0 s

likelihood(h) =

P(x 1 )h(+x 1 ) × P(x 2 )h( −x 2 ) × . . . P(x N )h( −x N )

likelihood(logistic

h)

∝ Y

N n=1

h(y n x n

)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 61/99

(78)

The Learning Problem Logistic Regression

Cross-Entropy Error

max

h likelihood(logistic h) ∝

Y

N n=1

h(y n x n

)

1 + exp(−s)

w N

n=1

=⇒ min

w 1 N

X

N n=1

err(w, x

n

, y

n

)

| {z }

E

in

(w)

err(w, x, y ) = ln (1 + exp(−y

wx)): cross-entropy error

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 62/99

(79)

The Learning Problem Logistic Regression

Cross-Entropy Error

max

w likelihood(w)

∝ Y

N n=1

θ

y

n w T x n



w N

n=1

| {z }

E

in

(w)

err(w, x, y ) = ln (1 + exp(−y

wx)): cross-entropy error

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 62/99

(80)

The Learning Problem Logistic Regression

Cross-Entropy Error

max

w

ln Y

N n=1

θ

y

n w T x n



1 + exp(−s)

w N

n=1

=⇒ min

w 1 N

X

N n=1

err(w, x

n

, y

n

)

| {z }

E

in

(w)

err(w, x, y ) = ln (1 + exp(−y

wx)): cross-entropy error

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 62/99

(81)

w N

n=1

θ(s) = 1

1 + exp(−s) : min

w

1 N

X

N n=1

ln

1 + exp(−y

n w T x n

)

=⇒ min

w 1 N

X

N n=1

err(w, x

n

, y

n

)

| {z }

E

in

(w)

err(w, x, y ) = ln (1 + exp(−y

wx)):

cross-entropy error

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 62/99

(82)

Minimizing E in (w)

min

w

E

in

(w) = 1 N

X

N n=1

ln

1 + exp(−y

n w T x n

)



w

E

in

E

in

(w): continuous, differentiable, twice-differentiable,

convex

how to minimize? locate

valley

want∇E

in

(w) =

0

first: derive∇E

in

(w)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 63/99

(83)

The Learning Problem Logistic Regression

The Gradient ∇E in (w)

E

in

(w) = 1 N

X

N n=1

ln

 1 + exp(

z }| {

−y

n

w

T

x

n

)

| {z }



 

∂E

in

(w)

∂w i

= 1 N

X

N n=1



∂ ln()

∂

 ∂(1 + exp( ))

 ∂ − y n w T x n

∂w i



= 1

N X

N n=1

     

= 1

N X

N n=1

 exp( ) 1 + exp( )



−y n x n,i

!

= 1 N

X

N n=1

θ( ) −y n x n,i 

∇E

in

(w) =

N 1

P

N n=1

θ

−y

n w T x n 

−y n x n 

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 64/99

(84)

E

in

(w) = 1 N

X

N n=1

ln

 1 + exp(

z }| {

−y

n

w

T

x

n

)

| {z }



 

∂E

in

(w)

∂w i

= 1 N

X

N n=1



∂ ln()

∂

 ∂(1 + exp( ))

 ∂ − y n w T x n

∂w i



= 1

N X

N n=1

 1





exp( )

!

−y n x n,i

!

= 1

N X

N n=1

 exp( ) 1 + exp( )



−y n x n,i

!

= 1 N

X

N n=1

θ( ) −y n x n,i 

∇E

in

(w) =

N 1

P

N n=1

θ

−y

n w T x n 

−y n x n 

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 64/99

(85)

n=1

want∇E

in

(w) = 1 N

X

N n=1

θ 

−y

n w T x n 

−y n x n 

=

0

w

E

in

scaled θ-weighted sum of −y n x n

all

θ( ·)

=0: only if y

n w T x n

 0

—linear separableD

weighted sum =

0:

non-linear equation of

w

closed-form solution? no :-(

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 65/99

(86)

The Learning Problem Logistic Regression

PLA Revisited: Iterative Optimization

PLA: start from some

w 0

(say,

0), and ‘correct’ its mistakes on

D For t = 0, 1, . . .

1

find a

mistake

of

w t

called

x n(t) , y n(t) 

sign

w T t x n(t)

 6=

y n(t)

2

(try to) correct the mistake by

w t+1

w t

+

y n(t) x n(t)

w t+1

← w

t

+ r

sign

w T t x n



6= y

n

z y

n x n

when stop, return

last w as g

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 66/99

(87)

For t = 0, 1, . . .

1 find a mistake of w t called x n(t) , y n(t)  sign 

w T t x n(t)  6= y n(t) 2 (try to) correct the mistake by

w t+1 ← w t + y n(t) x n(t)

1

(equivalently) pick some

n, and update w t

by

w t+1

w t

+r

sign

w T t x n

6=

y n

z

y n x n

when stop, return

last w as g

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 66/99

(88)

PLA Revisited: Iterative Optimization

PLA: start from some

w 0

(say,

0), and ‘correct’ its mistakes on

D For t = 0, 1, . . .

1

(equivalently) pick some

n, and update w t

by

w t+1

w t

+ 1

|{z}

η

·r sign

w T t x n

 6=

y n

z

·

y n x n



| {z }

v

when stop, return

last w as g

choice of (η, v) and stopping condition defines

iterative optimization approach

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 66/99

參考文獻

相關文件

to maximize fairness (everyone’s responsibility), lending/borrowing not allowed.. Collaboration

our class: four to six times harder than a normal one in NTU around seven homework sets (and a hard final project) homework due within two weeks.. even have homework 0 and 1 NOW

to maximize fairness (everyone’s responsibility), lending/borrowing not allowed.. Collaboration

For machine learning applications, no need to accurately solve the optimization problem Because some optimal α i = 0, decomposition methods may not need to update all the

“Machine Learning Foundations” free online course, and works from NTU CLLab and NTU KDDCup teams... The Learning Problem What is

Machine Learning for Modern Artificial Intelligence.. Hsuan-Tien

⇔ improve some performance measure (e.g. prediction accuracy) machine learning: improving some performance measure?.

Parallel dual coordinate descent method for large-scale linear classification in multi-core environments. In Proceedings of the 22nd ACM SIGKDD International Conference on