• 沒有找到結果。

Quick Tour of Machine Learning ( 機器學習速遊)  

N/A
N/A
Protected

Academic year: 2022

Share "Quick Tour of Machine Learning ( 機器學習速遊)  "

Copied!
164
0
0

加載中.... (立即查看全文)

全文

(1)

Quick Tour of Machine Learning ( 機器學習速遊)  

Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.tw

Department of Computer Science

& Information Engineering

National Taiwan University

( 國立台灣大學資訊工程系)

資料科 學愛好者年會系列活動, 2015/12/12

(2)

Learning from Data

Disclaimer

just

super-condensed

and

shuffled

version of

• my co-authored textbook “Learning from Data: A Short Course”

• my two NTU-Coursera Mandarin-teaching ML Massive Open Online Courses

• “Machine Learning Foundations”:

www.coursera.org/course/ntumlone

• “Machine Learning Techniques”:

www.coursera.org/course/ntumltwo

—impossible to be complete, with most

math details removed

live

interaction

is important

goal: help you

begin

your journey with ML

(3)

Learning from Data

Roadmap

Learning from Data

What is Machine Learning

Components of Machine Learning

Types of Machine Learning

Step-by-step Machine Learning

(4)

Learning from Data What is Machine Learning

Learning from Data ::

What is Machine Learning

(5)

Learning from Data What is Machine Learning

From Learning to Machine Learning

learning: acquiring skill

learning:

with experience accumulated from

observations observations learning skill

machine learning: acquiring skill

machine learning:

with experience accumulated/computedfrom

data

data ML skill

What is

skill?

(6)

Learning from Data What is Machine Learning

A More Concrete Definition

skill

⇔ improve some

performance measure

(e.g. prediction accuracy)

machine learning: improving some performance measure

machine learning:

with experience

computed

from

data

data ML

improved performance measure

An Application in Computational Finance

stock data ML more investment gain

Why use machine learning?

(7)

Learning from Data What is Machine Learning

Yet Another Application: Tree Recognition

‘define’ trees and hand-program:

difficult

learn from data (observations) and recognize: a

3-year-old can do so

‘ML-based tree recognition system’ can be

easier to build

than hand-programmed system

ML: an

alternative route

to build complicated systems

(8)

Learning from Data What is Machine Learning

The Machine Learning Route

ML: an

alternative route

to build complicated systems

Some Use Scenarios

when human cannot program the system manually

—navigating on Mars

when human cannot ‘define the solution’ easily

—speech/visual recognition

when needing rapid decisions that humans cannot do

—high-frequency trading

when needing to be user-oriented in a massive scale

—consumer-targeted marketing

Give a

computer a fish, you feed it for a day;

teach it how to fish, you feed it for a lifetime.

:-)

(9)

Learning from Data What is Machine Learning

Machine Learning and Artificial Intelligence

Machine Learning

use data to compute something that improves performance

Artificial Intelligence

compute

something

that shows intelligent behavior

improving performance

is something that shows

intelligent behavior

—ML can realize AI, among other routes

e.g. chess playing

• traditional AI: game tree

• ML for AI: ‘learning from board data’

ML is one possible

and popular

route to realize AI

(10)

Learning from Data Components of Machine Learning

Learning from Data ::

Components of Machine Learning

(11)

Learning from Data Components of Machine Learning

Components of Learning:

Metaphor Using Credit Approval

Applicant Information

age 23 years

gender female

annual salary NTD 1,000,000 year in residence 1 year

year in job 0.5 year current debt 200,000

what to learn? (for improving performance):

‘approve credit card good for bank?’

(12)

Learning from Data Components of Machine Learning

Formalize the Learning Problem

Basic Notations

input:

x

∈ X (customer application)

output: y ∈ Y (good/bad after approving credit card)

unknown underlying pattern to be learned ⇔ target function

: f : X → Y (ideal credit approval formula)

• data ⇔ training examples

:D = {(x

1

, y

1

), (x

2

, y

2

),· · · , (x

N

, y

N

)} (historical records in bank)

• hypothesis ⇔ skill

with hopefully

good performance:

g : X → Y (‘learned’ formula to be used), i.e. approve if

• h

1

: annual salary > NTD 800,000

• h

2

: debt > NTD 100,000 (really?)

• h

3

: year in job ≤ 2 (really?)

—all

candidate formula

being considered: hypothesis set

H

—procedure to

learn

best formula: algorithm

A

{(x n , y n ) }

from

f ML ( A, H) g

(13)

Learning from Data Components of Machine Learning

Practical Definition of Machine Learning

unknown target function f : X → Y

(ideal credit approval formula)

training examples D : (x

1

, y

1

), · · · , (x

N

,y

N

) (historical records in bank)

learning algorithm

A

final hypothesis g ≈ f

(‘learned’ formula to be used)

hypothesis set H

(set of candidate formula)

machine learning ( A and H)

: use

data

to compute

hypothesis g

that approximates

target f

(14)

Learning from Data Components of Machine Learning

Key Essence of Machine Learning

machine learning:

use

data

to compute

hypothesis g

that approximates

target f

data ML

improved performance measure

1

exists

some ‘underlying pattern’ to be learned

—so ‘performance measure’ can be improved

2

but

no

programmable (easy)

definition

—so ‘ML’ is needed

3

somehow there is

data

about the pattern

—so ML has some ‘inputs’ to learn from

key essence: help decide whether to use ML

(15)

Learning from Data Types of Machine Learning

Learning from Data ::

Types of Machine Learning

(16)

Learning from Data Types of Machine Learning

Visualizing Credit Card Problem

customer features

x:

points on the plane (or points in R

d

)

labels y :

◦ (+1)

,

× (-1)

called

binary classification

hypothesis h:

lines

here, but possibly other curves

different curve classifies customers differently

binary classification algorithm:

find

good decision boundary curve

g

(17)

Learning from Data Types of Machine Learning

More Binary Classification Problems

credit

approve/disapprove

email

spam/non-spam

patient

sick/not sick

ad

profitable/not profitable

core and important problem with many tools as

building block of other tools

(18)

Learning from Data Types of Machine Learning

Binary Classification for Education

data ML skill

• data: students’ records on quizzes on a Math tutoring system

• skill: predict whether a student can give a correct answer to

another quiz question

A Possible ML Solution

answer correctly≈Jrecent

strength

of student>

difficulty

of questionK

give ML

9 million records

from

3000 students

ML determines (reverse-engineers)

strength

and

difficulty

automatically

key part of the

world-champion

system from National Taiwan Univ. in KDDCup 2010

(19)

Learning from Data Types of Machine Learning

Multiclass Classification: Coin Recognition Problem

25

5 1

Mass

Size 10

classify US coins (1c, 5c, 10c, 25c) by (size, mass)

Y = {1c, 5c, 10c, 25c}, or

Y = {1, 2, · · · , K } (abstractly)

binary classification: special case with K = 2

Other Multiclass Classification Problems

written digits⇒ 0, 1, · · · , 9

pictures⇒ apple, orange, strawberry

emails⇒ spam, primary, social, promotion, update (Google)

many applications

in practice, especially for ‘recognition’

(20)

Learning from Data Types of Machine Learning

Regression: Patient Recovery Prediction Problem

binary classification: patient features⇒ sick or not

multiclass classification: patient features⇒ which type of cancer

regression: patient features⇒

how many days before recovery

• Y = R

orY = [lower, upper] ⊂ R (bounded regression)

—deeply studied in statistics

Other Regression Problems

company data⇒ stock price

climate data⇒ temperature

also core and important with many ‘statistical’

tools as

building block of other tools

(21)

Learning from Data Types of Machine Learning

Regression for Recommender System (1/2)

data ML skill

• data: how many users have rated some movies

• skill: predict how a user would rate an unrated movie A Hot Problem

competition held by Netflix in 2006

• 100,480,507 ratings that 480,189 users gave to 17,770 movies

• 10% improvement = 1 million dollar prize

similar competition (movies→ songs) held by Yahoo! in KDDCup 2011

• 252,800,275 ratings that 1,000,990 users gave to 624,961 songs

How can machines

learn our preferences?

(22)

Learning from Data Types of Machine Learning

Regression for Recommender System (2/2)

Match movie and viewer factors

predicted rating

comedy content action

content blockb uster?

TomCruisein it?

likes TomCruise?

prefers blockbusters? likes action?

likes comedy?

movie viewer

add contributions from each factor

A Possible ML Solution

pattern:

rating

viewer/movie factors

learning:

known rating

→ learned

factors

→ unknown rating prediction

key part of the

world-champion

(again!) system from National Taiwan Univ.

in KDDCup 2011

(23)

Learning from Data Types of Machine Learning

Supervised versus Unsupervised

coin recognition with y

n

25

5 1

Mass

Size 10

supervised multiclass classification

coin recognition without y

n

Mass

Size

unsupervised multiclass classification

⇐⇒

‘clustering’

Other Clustering Problems

articles⇒ topics

consumer profiles⇒ consumer groups

clustering: a challenging but useful problem

(24)

Learning from Data Types of Machine Learning

Supervised versus Unsupervised

coin recognition with y

n

25

5 1

Mass

Size 10

supervised multiclass classification

coin recognition without y

n

Mass

Size

unsupervised multiclass classification

⇐⇒

‘clustering’

Other Clustering Problems

articles⇒ topics

consumer profiles⇒ consumer groups

clustering: a challenging but useful problem

(25)

Learning from Data Types of Machine Learning

Semi-supervised: Coin Recognition with Some y n

25

5 1

Mass

Size 10

supervised

25

5 1

Mass

Size 10

semi-supervised

Mass

Size

unsupervised (clustering)

Other Semi-supervised Learning Problems

face images with a few labeled⇒ face identifier (Facebook)

medicine data with a few labeled⇒ medicine effect predictor

semi-supervised learning: leverage

unlabeled data to avoid ‘expensive’ labeling

(26)

Learning from Data Types of Machine Learning

Reinforcement Learning

a ‘very different’ but natural way of learning

Teach Your Dog: Say ‘Sit Down’

The dog pees on the ground.

BAD DOG. THAT’S A VERY WRONG ACTION.

cannot easily show the dog that y

n

= sit when

x n

=‘sit down’

but can ‘punish’ to say ˜y

n

= pee is wrong

Other Reinforcement Learning Problems Using (x, ˜ y , goodness)

(customer, ad choice, ad click earning)⇒ ad system

(cards, strategy, winning amount)⇒ black jack agent reinforcement: learn with

‘partial/implicit

information’

(often sequentially)

(27)

Learning from Data Types of Machine Learning

Reinforcement Learning

a ‘very different’ but natural way of learning

Teach Your Dog: Say ‘Sit Down’

The dog sits down.

Good Dog. Let me give you some cookies.

still cannot show y

n

= sit when

x n

=‘sit down’

but can ‘reward’ to say ˜y

n

= sit is good

Other Reinforcement Learning Problems Using (x, ˜ y , goodness)

(customer, ad choice, ad click earning)⇒ ad system

(cards, strategy, winning amount)⇒ black jack agent reinforcement: learn with

‘partial/implicit

information’

(often sequentially)

(28)

Learning from Data Step-by-step Machine Learning

Learning from Data ::

Step-by-step Machine Learning

(29)

Learning from Data Step-by-step Machine Learning

Step-by-step Machine Learning

unknown target function f : X → Y

(ideal credit approval formula)

training examples D : (x

1

, y

1

), · · · , (x

N

,y

N

) (historical records in bank)

learning algorithm

A

final hypothesis g ≈ f

(‘learned’ formula to be used)

hypothesis set H

(set of candidate formula)

1

choose error measure: how

g(x) ≈ f (x)

2

decide hypothesis set

H

3

optimize error

and more

on

D

as

A

4

pray for generalization:

whether

g(x) ≈ f (x)

for

unseen x

(30)

Learning from Data Step-by-step Machine Learning

Choose Error Measure

g

f

can often evaluate by

averaged err (g(x),

f(x)), called pointwise error measure

in-sample (within data)

E

in

(g) = 1 N

X

N n=1

err(g(x

n

), f (x

n

)

| {z }

y

n

)

out-of-sample (future data)

E

out

(g) = E

future x

err(g(x), f (x))

will start from 0/1 error

err(˜ y , y ) = J y ˜ 6= y K

for

classification

(31)

Learning from Data Step-by-step Machine Learning

Choose Hypothesis Set (for Credit Approval)

age 23 years

annual salary NTD 1,000,000 year in job 0.5 year current debt 200,000

For

x = (x 1

, x

2

,· · · , x

d

)‘features of customer’, compute a weighted ‘score’ and

approve credit if X

d

i=1

w

i

x

i

> threshold deny credit if X

d

i=1

w

i

x

i

< threshold

Y:



+1(good), −1(bad)

,

0 ignored—linear formula h

∈ H are

h(x) = sign

d

X

i=1

w i

x

i

!

threshold

!

linear (binary) classifier,

called ‘perceptron’ historically

(32)

Learning from Data Step-by-step Machine Learning

Optimize Error (and More) on Data

H = all possible perceptrons,

g =?

want: g≈ f (hard when f unknown)

almost necessary: g ≈ f on D, ideally

g(x n ) = f (x n ) = y n

difficult: H is of

infinite

size

idea: start from some g

0

, and

‘correct’ its mistakes on D

let’s visualize

without math

(33)

Learning from Data Step-by-step Machine Learning

Seeing is Believing initially

x1 w(t+1)

update: 1

x9

w(t)

w(t+1)

update: 2

x14

w(t) w(t+1)

update: 3

x3

w(t)

w(t+1)

update: 4

x9

w(t) w(t+1)

update: 5

x14

w(t) w(t+1)

update: 6

x9

w(t) w(t+1)

update: 7

x14

w(t) w(t+1)

update: 8

x9

w(t) w(t+1)

update: 9

wPLA

finally

worked like a charm with < 20 lines!!

—A fault confessed is half redressed.

:-)

(34)

Learning from Data Step-by-step Machine Learning

Seeing is Believing

initially

x1 w(t+1)

update: 1

x9

w(t)

w(t+1)

update: 2

x14

w(t) w(t+1)

update: 3

x3

w(t)

w(t+1)

update: 4

x9

w(t) w(t+1)

update: 5

x14

w(t) w(t+1)

update: 6

x9

w(t) w(t+1)

update: 7

x14

w(t) w(t+1)

update: 8

x9

w(t) w(t+1)

update: 9

wPLA

finally

worked like a charm with < 20 lines!!

—A fault confessed is half redressed.

:-)

(35)

Learning from Data Step-by-step Machine Learning

Seeing is Believing

initially

x1 w(t+1)

update: 1

x9

w(t)

w(t+1)

update: 2

x14

w(t) w(t+1)

update: 3

x3

w(t)

w(t+1)

update: 4

x9

w(t) w(t+1)

update: 5

x14

w(t) w(t+1)

update: 6

x9

w(t) w(t+1)

update: 7

x14

w(t) w(t+1)

update: 8

x9

w(t) w(t+1)

update: 9

wPLA

finally

worked like a charm with < 20 lines!!

—A fault confessed is half redressed.

:-)

(36)

Learning from Data Step-by-step Machine Learning

Seeing is Believing

initially

x1 w(t+1)

update: 1

x9

w(t)

w(t+1)

update: 2

x14

w(t) w(t+1)

update: 3

x3

w(t)

w(t+1)

update: 4

x9

w(t) w(t+1)

update: 5

x14

w(t) w(t+1)

update: 6

x9

w(t) w(t+1)

update: 7

x14

w(t) w(t+1)

update: 8

x9

w(t) w(t+1)

update: 9

wPLA

finally

worked like a charm with < 20 lines!!

—A fault confessed is half redressed.

:-)

(37)

Learning from Data Step-by-step Machine Learning

Seeing is Believing

initially

x1 w(t+1)

update: 1

x9

w(t)

w(t+1)

update: 2

x14

w(t) w(t+1)

update: 3

x3

w(t)

w(t+1)

update: 4

x9

w(t) w(t+1)

update: 5

x14

w(t) w(t+1)

update: 6

x9

w(t) w(t+1)

update: 7

x14

w(t) w(t+1)

update: 8

x9

w(t) w(t+1)

update: 9

wPLA

finally

worked like a charm with < 20 lines!!

—A fault confessed is half redressed.

:-)

(38)

Learning from Data Step-by-step Machine Learning

Seeing is Believing

initially

x1 w(t+1)

update: 1

x9

w(t)

w(t+1)

update: 2

x14

w(t) w(t+1)

update: 3

x3

w(t)

w(t+1)

update: 4

x9

w(t) w(t+1)

update: 5

x14

w(t) w(t+1)

update: 6

x9

w(t) w(t+1)

update: 7

x14

w(t) w(t+1)

update: 8

x9

w(t) w(t+1)

update: 9

wPLA

finally

worked like a charm with < 20 lines!!

—A fault confessed is half redressed.

:-)

(39)

Learning from Data Step-by-step Machine Learning

Seeing is Believing

initially

x1 w(t+1)

update: 1

x9

w(t)

w(t+1)

update: 2

x14

w(t) w(t+1)

update: 3

x3

w(t)

w(t+1)

update: 4

x9

w(t) w(t+1)

update: 5

x14

w(t) w(t+1)

update: 6

x9

w(t) w(t+1)

update: 7

x14

w(t) w(t+1)

update: 8

x9

w(t) w(t+1)

update: 9

wPLA

finally

worked like a charm with < 20 lines!!

—A fault confessed is half redressed.

:-)

(40)

Learning from Data Step-by-step Machine Learning

Seeing is Believing

initially

x1 w(t+1)

update: 1

x9

w(t)

w(t+1)

update: 2

x14

w(t) w(t+1)

update: 3

x3

w(t)

w(t+1)

update: 4

x9

w(t) w(t+1)

update: 5

x14

w(t) w(t+1)

update: 6

x9

w(t) w(t+1)

update: 7

x14

w(t) w(t+1)

update: 8

x9

w(t) w(t+1)

update: 9

wPLA

finally

worked like a charm with < 20 lines!!

—A fault confessed is half redressed.

:-)

(41)

Learning from Data Step-by-step Machine Learning

Seeing is Believing

initially

x1 w(t+1)

update: 1

x9

w(t)

w(t+1)

update: 2

x14

w(t) w(t+1)

update: 3

x3

w(t)

w(t+1)

update: 4

x9

w(t) w(t+1)

update: 5

x14

w(t) w(t+1)

update: 6

x9

w(t) w(t+1)

update: 7

x14

w(t) w(t+1)

update: 8

x9

w(t) w(t+1)

update: 9

wPLA

finally

worked like a charm with < 20 lines!!

—A fault confessed is half redressed.

:-)

(42)

Learning from Data Step-by-step Machine Learning

Seeing is Believing

initially

x1 w(t+1)

update: 1

x9

w(t)

w(t+1)

update: 2

x14

w(t) w(t+1)

update: 3

x3

w(t)

w(t+1)

update: 4

x9

w(t) w(t+1)

update: 5

x14

w(t) w(t+1)

update: 6

x9

w(t) w(t+1)

update: 7

x14

w(t) w(t+1)

update: 8

x9

w(t) w(t+1)

update: 9

wPLA

finally

worked like a charm with < 20 lines!!

—A fault confessed is half redressed.

:-)

(43)

Learning from Data Step-by-step Machine Learning

Seeing is Believing

initially

x1 w(t+1)

update: 1

x9

w(t)

w(t+1)

update: 2

x14

w(t) w(t+1)

update: 3

x3

w(t)

w(t+1)

update: 4

x9

w(t) w(t+1)

update: 5

x14

w(t) w(t+1)

update: 6

x9

w(t) w(t+1)

update: 7

x14

w(t) w(t+1)

update: 8

x9

w(t) w(t+1)

update: 9

wPLA

finally

worked like a charm with < 20 lines!!

—A fault confessed is half redressed.

:-)

(44)

Learning from Data Step-by-step Machine Learning

Pray for Generalization

(pictures from Google Image Search)

Parent

?

(picture, label) pairs

?

Kid’s good

hypothesis brain

'

&

$

% -

6

alternatives

Target f (x) + noise

?

examples (picture x

n

, label y

n

)

?

learning good

hypothesis g(x) ≈ f (x) algorithm

'

&

$

% -

6

hypothesis set H

challenge:

see only{(x

n

, y

n

)} without knowing f nor noise

=

?

generalize

to unseen (x, y ) w.r.t. f (x)

(45)

Learning from Data Step-by-step Machine Learning

Generalization Is Non-trivial

Bob impresses Alice by memorizing every given (movie, rank);

but too nervous about a

new movie

and guesses randomly

(pictures from Google Image Search)

memorize 6=

generalize

perfect from Bob’s view 6=

good for Alice

perfect during training 6=

good when testing

take-home message: ifH is

simple

(like lines), generalization is

usually possible

(46)

Learning from Data Step-by-step Machine Learning

Mini-Summary

Learning from Data

What is Machine Learning

use data to approximate target Components of Machine Learning

algorithm A takes data D and hypotheses H to get hypothesis g Types of Machine Learning

variety of problems almost everywhere Step-by-step Machine Learning

error, hypotheses, optimize, generalize

(47)

Fundamental Machine Learning Models

Roadmap

Fundamental Machine Learning Models Linear Regression

Logistic Regression

Nonlinear Transform

Decision Tree

(48)

Fundamental Machine Learning Models Linear Regression

Fundamental Machine Learning Models ::

Linear Regression

(49)

Fundamental Machine Learning Models Linear Regression

Credit Limit Problem

age 23 years

gender female

annual salary NTD 1,000,000 year in residence 1 year

year in job 0.5 year current debt 200,000

credit limit?

100,000

unknown target function f : X → Y (ideal credit limit formula)

training examples D : (x

1

, y

1

), · · · , (x

N

, y

N

) (historical records in bank)

learning algorithm

A

final hypothesis g ≈ f

(‘learned’ formula to be used)

hypothesis set H

(set of candidate formula)

Y = R:

regression

(50)

Fundamental Machine Learning Models Linear Regression

Linear Regression Hypothesis

age 23 years

annual salary NTD 1,000,000 year in job 0.5 year current debt 200,000

For

x = (x 0

, x

1

, x

2

,· · · , x

d

)‘features of customer’,

approximate the

desired credit limit

with a

weighted

sum:

y

≈ X

d i=0

w i

x

i

linear regression hypothesis:

h(x) = w T x

h(x): like perceptron, but without the sign

(51)

Fundamental Machine Learning Models Linear Regression

Illustration of Linear Regression

x = (x ) ∈ R

x

y

x = (x 1 , x 2 ) ∈ R 2

x

1

x

2

y

x

1

x

2

y

linear regression:

find

lines/hyperplanes

with small

residuals

(52)

Fundamental Machine Learning Models Linear Regression

The Error Measure

popular/historical error measure:

squared error

err(ˆ y , y ) = (ˆ y − y) 2 in-sample

E

in

(hw) = 1 N

X

N n=1

(h(x n )

| {z }

w

T

x

n

− y n ) 2

out-of-sample

E

out

(w) = E

(x,y)∼P (w T x − y ) 2

next: how to minimize E

in

(w)?

(53)

Fundamental Machine Learning Models Linear Regression

Minimize E in

min

w

E

in

(w) = 1 N

X

N n=1

(w T x n − y n ) 2

w

E

in

E

in

(w): continuous, differentiable,

convex

necessary condition of ‘best’

w

∇E

in

(w)≡





∂E

in

∂w

0(w)

∂E

in

∂w

1(w) . . .

∂E

in

∂w

d(w)



=





0 0

. . .

0





—not possibleto ‘roll down’

task: find

w

LINsuch that∇E

in

(wLIN) =

0

(54)

Fundamental Machine Learning Models Linear Regression

Linear Regression Algorithm

1

fromD, construct

input matrix X

and

output vector y

by

X =

 

− − x T 1 − −

− − x T 2 − −

· · ·

− − x T N − −

 

| {z }

N×(d +1)

y =

 

 y 1 y 2

· · · y N

 

| {z }

N×1

2

calculate pseudo-inverse

|{z} X

(d +1)×N 3

return

w

LIN

|{z}

(d +1)×1

=

X y

simple and efficient with

good † routine

(55)

Fundamental Machine Learning Models Linear Regression

Is Linear Regression a ‘Learning Algorithm’?

w

LIN=

X y

No!

analytic (closed-form) solution, ‘instantaneous’

not improving E

in

nor E

out

iteratively

Yes!

good E

in

?

yes, optimal!

good E

out

?

yes, ‘simple’ like perceptrons

improving iteratively?

somewhat, within an iterative pseudo-inverse routine

if E

out

(wLIN)is good,

learning ‘happened’!

(56)

Fundamental Machine Learning Models Logistic Regression

Fundamental Machine Learning Models ::

Logistic Regression

(57)

Fundamental Machine Learning Models Logistic Regression

Heart Attack Prediction Problem (1/2)

age 40 years

gender male

blood pressure 130/85 cholesterol level 240

weight 70

heart disease?

yes

unknown target distribution P(y |x) containing f (x) + noise

training examples D : (x

1

, y

1

), · · · , (x

N

,y

N

)

learning algorithm

A

final hypothesis g ≈ f

hypothesis set H

error measure err c err

binary classification:

ideal f (x) = sign

P(+1 |x)

1 2

∈ {−1, +1}

because of

classification err

(58)

Fundamental Machine Learning Models Logistic Regression

Heart Attack Prediction Problem (2/2)

age 40 years

gender male

blood pressure 130/85 cholesterol level 240

weight 70

heart

attack? 80% risk

unknown target distribution P(y |x) containing f (x) + noise

training examples D : (x

1

, y

1

), · · · , (x

N

,y

N

)

learning algorithm

A

final hypothesis g ≈ f

hypothesis set H

error measure err c err

‘soft’

binary classification:

f

(x) =

P(+1 |x)

∈ [0, 1]

(59)

Fundamental Machine Learning Models Logistic Regression

Soft Binary Classification

target function

f

(x) =

P(+1 |x)

∈ [0, 1]

ideal (noiseless) data



x 1

, y

1 0

=0.9 =

P(+1 |x 1 )





x 2

, y

2 0

=0.2 =

P(+1 |x 2 )

 ...



x N

, y

N 0

=0.6 =

P(+1 |x N )



actual (noisy) data



x 1

, y

1

=

P(y |x 1 )





x 2

, y

2

=

×

P(y |x 2 )

 ...



x N

, y

N

=

×

P(y |x N )



same data as hard binary classification, different

target function

(60)

Fundamental Machine Learning Models Logistic Regression

Soft Binary Classification

target function

f

(x) =

P(+1 |x)

∈ [0, 1]

ideal (noiseless) data



x 1

, y

1 0

=0.9 =

P(+1 |x 1 )





x 2

, y

2 0

=0.2 =

P(+1 |x 2 )

 ...



x N

, y

N 0

=0.6 =

P(+1 |x N )



actual (noisy) data



x 1

, y

1 0

=

1

=r

? P(y |x 1 )

z



x 2

, y

2 0

=

0

=r

? P(y |x 2 )

z

...



x N

, y

N 0

=

0

= r

? P(y |x N )

z

same data as hard binary classification, different

target function

(61)

Fundamental Machine Learning Models Logistic Regression

Logistic Hypothesis

age 40 years

gender male

blood pressure 130/85 cholesterol level 240

For

x = (x 0

, x

1

, x

2

,· · · , x

d

)‘features of patient’, calculate a

weighted

‘risk score’:

s

= X

d

i=0

w i

x

i

convert the

score

to

estimated probability

by logistic function

θ(s)

θ(s) 1

0 s

logistic hypothesis:

h(x) = θ(w T x) = 1+exp(−w 1

T

x)

(62)

Fundamental Machine Learning Models Logistic Regression

Minimizing E in (w)

a popular error: E

in

(w) =

N 1

P

N

n=1

ln 1 + exp(−y

n w T x n

)

called

cross- entropy

derived from

maximum likelihood

w

E

in

E

in

(w): continuous, differentiable, twice-differentiable,

convex

how to minimize? locate

valley

want∇E

in

(w) =

0

most basic algorithm:

gradient descent

(roll downhill)

(63)

Fundamental Machine Learning Models Logistic Regression

Gradient Descent

For t = 0, 1, . . .

w t+1

← w

t

+

ηv

when stop, return

last w as g

PLA:

v

comes from mistake correction

smooth E

in

(w) for logistic regression:

choose

v

to get the ball roll ‘downhill’?

• direction v:

(assumed) of unit length

• step size η:

(assumed) positive

Weights, w

In-sampleError,Ein

gradient descent:

v

∝ −∇E

in

(w

t

)

(64)

Fundamental Machine Learning Models Logistic Regression

Putting Everything Together

Logistic Regression Algorithm

initialize

w 0

For t = 0, 1,· · ·

1

compute

∇E

in

(w

t

) = 1 N

X

N n=1

θ 

−y

n w T t x n

 −y n x n 

2

update by

w t+1

← w

t

η ∇E in (w t )

...until∇E

in

(w

t+1

)≈ 0 or enough iterations return

last w t+1 as g

can use more sophisticated tools to speed up

(65)

Fundamental Machine Learning Models Logistic Regression

Linear Models Summarized

linear scoring function:

s

=

w T x linear classification

h(x) = sign(s)

s x

x

x x0

1 2

d

h x( )

plausible err = 0/1 discrete E

in

(w):

solvable in special case

linear regression

h(x) = s

s x

x

x x0

1 2

d

h x( )

friendly err = squared quadratic convex E

in

(w):

closed-form solution

logistic regression

h(x) = θ(s)

s x

x

x x0

1 2

d

h x( )

plausible err = cross-entropy smooth convex E

in

(w):

gradient descent

my ‘secret’:

linear first!!

(66)

Fundamental Machine Learning Models Nonlinear Transform

Fundamental Machine Learning Models ::

Nonlinear Transform

(67)

Fundamental Machine Learning Models Nonlinear Transform

Linear Hypotheses

up to now: linear hypotheses

visually:

‘line’-like

boundary

mathematically: linear scores

s

=

w T x

but limited . . .

−1 0 1

−1 0 1

theoretically:

complexity under control :-)

practically: on someD,

large E in

for every line

:-(

how to

break the limit

of linear hypotheses

(68)

Fundamental Machine Learning Models Nonlinear Transform

Circular Separable

−1 0 1

−1 0 1

−1 0 1

−1 0 1

D not linear separable

but

circular separable

by a circle of radius√

0.6 centered at origin:

hSEP(x) = sign

−x

1 2

− x

2 2

+0.6

re-derive

Circular-PLA, Circular-Regression,

blahblah. . . all over again?

:-)

(69)

Fundamental Machine Learning Models Nonlinear Transform

Circular Separable and Linear Separable

h(x) = sign

 |{z}

0.6

w ˜

0

·|{z}

1

z

0

+(

−1

)

| {z }

w ˜

1

·

x 1 2

|{z}

z

1

+(

−1

)

| {z }

w ˜

2

·

x 2 2

|{z}

z

2



= sign

w ˜ T z



x

1

x

2

−1 0 1

−1 0

1

{(x

n

, y

n

)} circular separable

=⇒ {(

z n

, y

n

)}

linear

separable

x

∈ X 7−→

Φ z ∈ Z

:

(nonlinear) feature

transform Φ z

1

z

2

0 0.5 1

0 0.5 1

circular separable inX =⇒

linear

separable in

Z

(70)

Fundamental Machine Learning Models Nonlinear Transform

General Quadratic Hypothesis Set

a ‘bigger’

Z

-space with

Φ 2

(x) = (1,

x 1

,

x 2

,

x 1 2

,

x 1 x 2

,

x 2 2

) perceptrons in

Z

-space⇐⇒ quadratic hypotheses in X -space

H

Φ

2 =n

h(x) : h(x) =

h(Φ ˜ 2

(x)) for some linear

˜ h

on

Z

o

can

implement all possible quadratic curve boundaries:

circle, ellipse,

rotated ellipse, hyperbola, parabola,

. . .

⇐=

ellipse 2(x

1

+x

2

− 3)

2

+ (x

1

− x

2

− 4)

2

=1

⇐=

w ˜ T

=

[33, −20, −4, 3, 2, 3]

include

lines and constants as degenerate

cases

(71)

Fundamental Machine Learning Models Nonlinear Transform

Good Quadratic Hypothesis

Z

-space X -space

perceptrons

⇐⇒ quadratic hypotheses

good perceptron

⇐⇒

good quadratic hypothesis separating perceptron

⇐⇒ separating quadratic hypothesis

z1

z2

0 0.5 1

0 0.5 1

⇐⇒

x1

x2

−1 0 1

−1 0 1

want: get

good perceptron

in

Z

-space

known: get

good perceptron

in

X

-space with data{(

x n

, y

n

)} solution: get

good perceptron

in

Z

-space with data

{(

z n

=

Φ 2

(x

n

), y

n

)}

(72)

Fundamental Machine Learning Models Nonlinear Transform

The Nonlinear Transform Steps

−1 0 1

−1 0 1

−→

Φ

0 0.5 1

0 0.5 1

↓ A

−1 0 1

−1 0 1

Φ

−1

←−

−→

Φ

0 0.5 1

0 0.5 1

1

transform original data{(x

n

, y

n

)} to {(

z n

=

Φ(x n

), y

n

)} by

Φ

2

get a good perceptron

w ˜

using{(

z n

, y

n

)} and your favorite linear algorithmA

3

return g(x) = sign

w ˜ T Φ(x)



(73)

Fundamental Machine Learning Models Nonlinear Transform

Nonlinear Model via Nonlinear Φ + Linear Models

−1 0 1

−1 0 1

−→

Φ

0 0.5 1

0 0.5 1

↓ A

−1 0 1

−1 0 1

Φ

−1

←−

−→

Φ

0 0.5 1

0 0.5 1

two choices:

feature transform

Φ

linear modelA,

not just binary classification

Pandora’s box :-):

can now freely do

quadratic PLA, quadratic regression,

cubic regression, . . ., polynomial regression

(74)

Fundamental Machine Learning Models Nonlinear Transform

Feature Transform Φ

−→

Φ

Average Intensity

Symmetry

not 1 1

↓ A

Φ

−1

←−

−→

Φ

Average Intensity

Symmetry

more generally, not just polynomial:

raw (pixels)

domain knowledge

−→

concrete (intensity, symmetry)

the force, too good to be true? :-)

(75)

Fundamental Machine Learning Models Nonlinear Transform

Computation/Storage Price

Q-th order polynomial transform: Φ

Q

(x) =  1,

x

1

, x

2

, . . . , x

d

, x

12

, x

1

x

2

, . . . , x

d2

, . . . ,

x

1Q

, x

1Q−1

x

2

, . . . , x

dQ



=

|{z}1

w ˜

0

+

|{z} d ˜

others

dimensions

= # ways of≤ Q-combination from d kinds with repetitions

=

Q+d Q



=

Q+d d



=

O Q d 

= efforts needed for computing/storing

z

=

Φ Q

(x) and

w ˜

Q large =⇒

difficult to compute/store

AND curve too complicated

(76)

Fundamental Machine Learning Models Nonlinear Transform

Generalization Issue

Φ

1

(original

x)

which one do you prefer? :-)

Φ

1

‘visually’ preferred

Φ

4

: E

in

(g) = 0 but overkill

Φ

4

how to pick Q?

model selection

(to be discussed) important

(77)

Fundamental Machine Learning Models Decision Tree

Fundamental Machine Learning Models ::

Decision Tree

(78)

Fundamental Machine Learning Models Decision Tree

Decision Tree for Watching MOOC Lectures

G(x) = X

T t=1

q t

(x)·

g t

(x)

base hypothesis g t

(x):

leaf at end of path t, a

constant

here

condition q t

(x):

Jis x on path t ?K

usually with

simple internal nodes

quitting time?

has a date?

N

true

Y

false

<18:30

Y

between

deadline?

N

>2 days

Y

between

N

< −2 days

>21:30

decision tree: arguably one of the most

human-mimicking models

(79)

Fundamental Machine Learning Models Decision Tree

Recursive View of Decision Tree

Path View: G(x) =P

T

t=1 Jx on path t K

·

leaf t (x)

quitting time?

has a date?

N

true

Y

false

<18:30

Y

between

deadline?

N

>2 days

Y

between

N

< −2 days

>21:30

Recursive View G(x) =

X

C c=1

Jb(x) = c K

·

G c

(x)

• G(x): full-tree

hypothesis

• b(x): branching criteria

• G c

(x):

sub-tree

hypothesis at the c-th branch

tree

= (root,

sub-trees), just like what

your data structure instructor would say :-)

(80)

Fundamental Machine Learning Models Decision Tree

A Basic Decision Tree Algorithm

G(x) =

P

C

c=1

J

b(x)

=cK

G c

(x) function

DecisionTree

dataD = {(x

n

, y

n

)}

N n=1

 if

termination criteria met

return

base hypothesis g t

(x) else

1

learn

branching criteria b(x)

2

splitD to

C

parts

D c

={(x

n

, y

n

) :

b(x n )

=c}

3

build sub-tree

G c

DecisionTree( D c

)

4

return

G(x) =

P

C

c=1

J

b(x)

=cK

G c

(x)

four choices:

number of branches, branching

criteria, termination criteria, & base hypothesis

(81)

Fundamental Machine Learning Models Decision Tree

Classification and Regression Tree (C&RT)

function

DecisionTree(data

D = {(x

n

, y

n

)}

N n=1

) if

termination criteria met

return

base hypothesis g t

(x) else ...

2

splitD to

C

parts

D c

={(x

n

, y

n

) :

b(x n )

=c}

choices

• C

=2 (binary tree)

• g t

(x) = E

in

-optimal

constant

• binary/multiclass classification (0/1 error): majority of {y

n

}

• regression (squared error): average of {y

n

}

branching:

threshold some selected dimension

termination:

fully-grown, or better pruned

disclaimer:

C&RT

here is based on

selected components

of

CART TM of California Statistical Software

(82)

Fundamental Machine Learning Models Decision Tree

A Simple Data Set

C&RT

C&RT: ‘divide-and-conquer’

(83)

Fundamental Machine Learning Models Decision Tree

A Simple Data Set

C&RT

C&RT: ‘divide-and-conquer’

(84)

Fundamental Machine Learning Models Decision Tree

A Simple Data Set

C&RT

C&RT: ‘divide-and-conquer’

(85)

Fundamental Machine Learning Models Decision Tree

A Simple Data Set

C&RT

C&RT: ‘divide-and-conquer’

(86)

Fundamental Machine Learning Models Decision Tree

A Simple Data Set

C&RT

C&RT: ‘divide-and-conquer’

(87)

Fundamental Machine Learning Models Decision Tree

A Simple Data Set

C&RT

C&RT: ‘divide-and-conquer’

(88)

Fundamental Machine Learning Models Decision Tree

A Simple Data Set

C&RT

C&RT: ‘divide-and-conquer’

(89)

Fundamental Machine Learning Models Decision Tree

A Simple Data Set

C&RT

C&RT: ‘divide-and-conquer’

(90)

Fundamental Machine Learning Models Decision Tree

A Simple Data Set

C&RT

C&RT: ‘divide-and-conquer’

(91)

Fundamental Machine Learning Models Decision Tree

A Simple Data Set

C&RT

C&RT: ‘divide-and-conquer’

參考文獻

相關文件

to maximize fairness (everyone’s responsibility), lending/borrowing not allowed.. Collaboration

our class: four to six times harder than a normal one in NTU around seven homework sets (and a hard final project) homework due within two weeks.. even have homework 0 and 1 NOW

to maximize fairness (everyone’s responsibility), lending/borrowing not allowed.. Collaboration

For machine learning applications, no need to accurately solve the optimization problem Because some optimal α i = 0, decomposition methods may not need to update all the

⇔ improve some performance measure (e.g. prediction accuracy) machine learning: improving some performance measure..

“Machine Learning Foundations” free online course, and works from NTU CLLab and NTU KDDCup teams... The Learning Problem What is

Machine Learning for Modern Artificial Intelligence.. Hsuan-Tien

⇔ improve some performance measure (e.g. prediction accuracy) machine learning: improving some performance measure?.