Quick Tour of Machine Learning ( 機器學習速遊) 　

(1)

Quick Tour of Machine Learning ( 機器學習速遊) 　

Hsuan-Tien Lin (林軒田) [email protected]

Department of Computer Science

& Information Engineering

National Taiwan University

( 國立台灣大學資訊工程系)

資料科學愛好者年會系列活動, 2015/12/12

(2)

Learning from Data

Disclaimer

•

just

super-condensed

and

shuffled

version of

• my co-authored textbook “Learning from Data: A Short Course”

• my two NTU-Coursera Mandarin-teaching ML Massive Open Online Courses

• “Machine Learning Foundations”:

www.coursera.org/course/ntumlone

• “Machine Learning Techniques”:

www.coursera.org/course/ntumltwo

—impossible to be complete, with most

math details removed

•

live

interaction

is important

goal: help you

begin

your journey with ML

(3)

Learning from Data

Roadmap

Learning from Data

What is Machine Learning

Components of Machine Learning

Types of Machine Learning

Step-by-step Machine Learning

(4)

Learning from Data What is Machine Learning

Learning from Data ::

What is Machine Learning

(5)

From Learning to Machine Learning

learning: acquiring skill

learning:

with experience accumulated from

observations observations learning skill

machine learning: acquiring skill

machine learning:

with experience accumulated/computedfrom

data

data ML ^skill

What is

skill?

(6)

A More Concrete Definition

⇔

skill

⇔ improve some

performance measure

(e.g. prediction accuracy)

machine learning: improving some performance measure

machine learning:

with experience

computed

from

data

data ML

improved performance measure

An Application in Computational Finance

stock data ML more investment gain

Why use machine learning?

(7)

Yet Another Application: Tree Recognition

•

‘define’ trees and hand-program:

difficult

•

learn from data (observations) and recognize: a

3-year-old can do so

•

‘ML-based tree recognition system’ can be

easier to build

than hand-programmed system

ML: an

alternative route

to build complicated systems

(8)

The Machine Learning Route

ML: an

alternative route

to build complicated systems

Some Use Scenarios

•

when human cannot program the system manually

—navigating on Mars

•

when human cannot ‘define the solution’ easily

—speech/visual recognition

•

when needing rapid decisions that humans cannot do

—high-frequency trading

•

when needing to be user-oriented in a massive scale

—consumer-targeted marketing

Give a

computer a fish, you feed it for a day;

teach it how to fish, you feed it for a lifetime.

:-)

(9)

Machine Learning and Artificial Intelligence

Machine Learning

use data to compute something that improves performance

Artificial Intelligence

compute

something

that shows intelligent behavior

• improving performance

is something that shows

intelligent behavior

—ML can realize AI, among other routes

•

e.g. chess playing

• traditional AI: game tree

• ML for AI: ‘learning from board data’

ML is one possible

and popular

route to realize AI

(10)

Learning from Data Components of Machine Learning

Learning from Data ::

Components of Machine Learning

(11)

Components of Learning:

Metaphor Using Credit Approval

Applicant Information

age 23 years

gender female

annual salary NTD 1,000,000 year in residence 1 year

year in job 0.5 year current debt 200,000

what to learn? (for improving performance):

‘approve credit card good for bank?’

(12)

Formalize the Learning Problem

Basic Notations

•

input:

x

∈ X (customer application)

•

output: y ∈ Y (good/bad after approving credit card)

• unknown underlying pattern to be learned ⇔ target function

: f : X → Y (ideal credit approval formula)

• data ⇔ training examples

:D = {(x

1

, y

₁

), (x

₂

, y

₂

),· · · , (x

N

, y

_N

)} (historical records in bank)

• hypothesis ⇔ skill

with hopefully

good performance:

g : X → Y (‘learned’ formula to be used), i.e. approve if

• h

1

: annual salary > NTD 800,000

• h

2

: debt > NTD 100,000 (really?)

• h

3

: year in job ≤ 2 (really?)

—all

candidate formula

being considered: hypothesis set

H

—procedure to

learn

best formula: algorithm

A

{(x n , y n ) }

from

f ML ( A, H) ^g

(13)

Practical Definition of Machine Learning

unknown target function f : X → Y

(ideal credit approval formula)

training examples D : (x

1

, y

₁

), · · · , (x

_N

,y

_N

) (historical records in bank)

learning algorithm

A

final hypothesis g ≈ f

(‘learned’ formula to be used)

hypothesis set H

(set of candidate formula)

machine learning ( A and H)

: use

data

to compute

hypothesis g

that approximates

target f

(14)

Key Essence of Machine Learning

machine learning:

use

data

to compute

hypothesis g

that approximates

target f

data ML

improved performance measure

1

exists

some ‘underlying pattern’ to be learned

—so ‘performance measure’ can be improved

2

but

no

programmable (easy)

definition

—so ‘ML’ is needed

3

somehow there is

data

about the pattern

—so ML has some ‘inputs’ to learn from

key essence: help decide whether to use ML

(15)

Learning from Data Types of Machine Learning

Learning from Data ::

Types of Machine Learning

(16)

Visualizing Credit Card Problem

•

customer features

x:

points on the plane (or points in R

^d

)

•

labels y :

◦ (+1)

,

× (-1)

called

binary classification

•

hypothesis h:

lines

here, but possibly other curves

•

different curve classifies customers differently

binary classification algorithm:

find

good decision boundary curve

g

(17)

•

credit

approve/disapprove

•

email

spam/non-spam

•

patient

sick/not sick

•

ad

profitable/not profitable

core and important problem with many tools as

building block of other tools

(18)

Binary Classification for Education

data ML ^skill

• data: students’ records on quizzes on a Math tutoring system

• skill: predict whether a student can give a correct answer to

another quiz question

A Possible ML Solution

answer correctly≈Jrecent

strength

of student>

difficulty

of questionK

•

give ML

9 million records

from

3000 students

•

ML determines (reverse-engineers)

strength

and

difficulty

automatically

key part of the

world-champion

system from National Taiwan Univ. in KDDCup 2010

(19)

Multiclass Classification: Coin Recognition Problem

25

5 1

Mass

Size 10

•

classify US coins (1c, 5c, 10c, 25c) by (size, mass)

•

Y = {1c, 5c, 10c, 25c}, or

Y = {1, 2, · · · , K } (abstractly)

•

binary classification: special case with K = 2

•

written digits⇒ 0, 1, · · · , 9

•

pictures⇒ apple, orange, strawberry

•

emails⇒ spam, primary, social, promotion, update (Google)

many applications

in practice, especially for ‘recognition’

(20)

Regression: Patient Recovery Prediction Problem

•

binary classification: patient features⇒ sick or not

•

multiclass classification: patient features⇒ which type of cancer

•

regression: patient features⇒

how many days before recovery

• Y = R

orY = [lower, upper] ⊂ R (bounded regression)

—deeply studied in statistics

•

company data⇒ stock price

•

climate data⇒ temperature

also core and important with many ‘statistical’

tools as

building block of other tools

(21)

Regression for Recommender System (1/2)

data ML ^skill

• data: how many users have rated some movies

• skill: predict how a user would rate an unrated movie A Hot Problem

•

competition held by Netflix in 2006

• 100,480,507 ratings that 480,189 users gave to 17,770 movies

• 10% improvement = 1 million dollar prize

•

similar competition (movies→ songs) held by Yahoo! in KDDCup 2011

• 252,800,275 ratings that 1,000,990 users gave to 624,961 songs

How can machines

learn our preferences?

(22)

Regression for Recommender System (2/2)

Match movie and viewer factors

predicted rating

comedy content action

content blockb uster?

TomCruisein it?

likes TomCruise?

prefers blockbusters? likes action?

likes comedy?

movie viewer

add contributions from each factor

A Possible ML Solution

•

pattern:

rating

←

viewer/movie factors

•

learning:

→

known rating

→ learned

factors

→ unknown rating prediction

key part of the

world-champion

(again!) system from National Taiwan Univ.

in KDDCup 2011

(23)

Supervised versus Unsupervised

coin recognition with y

_n

25

5 1

Mass

Size 10

supervised multiclass classification

coin recognition without y

_n

Mass

Size

unsupervised multiclass classification

⇐⇒

‘clustering’

•

articles⇒ topics

•

consumer profiles⇒ consumer groups

clustering: a challenging but useful problem

(24)

Supervised versus Unsupervised

coin recognition with y

_n

25

5 1

Mass

Size 10

supervised multiclass classification

coin recognition without y

_n

Mass

Size

unsupervised multiclass classification

⇐⇒

‘clustering’

•

articles⇒ topics

•

consumer profiles⇒ consumer groups

clustering: a challenging but useful problem

(25)

Semi-supervised: Coin Recognition with Some y _n

25

5 1

Mass

Size 10

supervised

25

5 1

Mass

Size 10

semi-supervised

Mass

Size

unsupervised (clustering)

•

face images with a few labeled⇒ face identifier (Facebook)

•

medicine data with a few labeled⇒ medicine effect predictor

semi-supervised learning: leverage

unlabeled data to avoid ‘expensive’ labeling

(26)

Reinforcement Learning

a ‘very different’ but natural way of learning

Teach Your Dog: Say ‘Sit Down’

The dog pees on the ground.

BAD DOG. THAT’S A VERY WRONG ACTION.

•

cannot easily show the dog that y

_n

= sit when

x _n

=‘sit down’

•

but can ‘punish’ to say ˜y

_n

= pee is wrong

•

(customer, ad choice, ad click earning)⇒ ad system

•

(cards, strategy, winning amount)⇒ black jack agent reinforcement: learn with

‘partial/implicit

information’

(often sequentially)

(27)

Reinforcement Learning

a ‘very different’ but natural way of learning

Teach Your Dog: Say ‘Sit Down’

The dog sits down.

Good Dog. Let me give you some cookies.

•

still cannot show y

_n

= sit when

x _n

=‘sit down’

•

but can ‘reward’ to say ˜y

_n

= sit is good

•

(customer, ad choice, ad click earning)⇒ ad system

•

(cards, strategy, winning amount)⇒ black jack agent reinforcement: learn with

‘partial/implicit

information’

(often sequentially)

(28)

Learning from Data Step-by-step Machine Learning

Learning from Data ::

Step-by-step Machine Learning

(29)

Step-by-step Machine Learning

unknown target function f : X → Y

(ideal credit approval formula)

training examples D : (x

1

, y

₁

), · · · , (x

_N

,y

_N

) (historical records in bank)

learning algorithm

A

final hypothesis g ≈ f

(‘learned’ formula to be used)

hypothesis set H

(set of candidate formula)

1

choose error measure: how

g(x) ≈ f (x)

2

decide hypothesis set

H

3

optimize error

and more

on

D

as

A

4

pray for generalization:

whether

g(x) ≈ f (x)

for

unseen x

(30)

Choose Error Measure

g

≈

f

can often evaluate by

averaged err (g(x),

f(x)), called pointwise error measure

in-sample (within data)

E

_in

(g) = 1 N

X

N n=1

err(g(x

n

), f (x

n

)

| {z }

y

n

)

out-of-sample (future data)

E

out

(g) = E

future x

err(g(x), f (x))

will start from 0/1 error

err(˜ y , y ) = J y ˜ 6= y K

for

classification

(31)

Choose Hypothesis Set (for Credit Approval)

age 23 years

annual salary NTD 1,000,000 year in job 0.5 year current debt 200,000

•

For

x = (x ₁

, x

2

,· · · , x

d

)‘features of customer’, compute a weighted ‘score’ and

approve credit if X

d

i=1

w

_i

x

_i

> threshold deny credit if X

d

i=1

w

_i

x

_i

< threshold

•

Y:

+1(good), −1(bad)

,

0 ignored—linear formula h

∈ H are

h(x) = sign

_d

X

i=1

w _i

x

_i

!

−

threshold

!

linear (binary) classifier,

called ‘perceptron’ historically

(32)

Optimize Error (and More) on Data

H = all possible perceptrons,

g =?

•

want: g≈ f (hard when f unknown)

•

almost necessary: g ≈ f on D, ideally

g(x _n ) = f (x _n ) = y _n

•

difficult: H is of

infinite

size

•

idea: start from some g

₀

, and

‘correct’ its mistakes on D

let’s visualize

without math

(33)

Seeing is Believing initially

x₁ w(t+1)

update: 1

x9

w(t)

w(t+1)

update: 2

x14

w(t) w(t+1)

update: 3

x3

w(t)

w(t+1)

update: 4

x9

w(t) w(t+1)

update: 5

x14

w(t) w(t+1)

update: 6

x9

w(t) w(t+1)

update: 7

x14

w(t) w(t+1)

update: 8

x9

w(t) w(t+1)

update: 9

w_PLA

finally

worked like a charm with < 20 lines!!

—A fault confessed is half redressed.

:-)

(34)

Seeing is Believing

initially

x1 w(t+1)

update: 1

x9

w(t)

w(t+1)

update: 2

x14

w(t) w(t+1)

update: 3

x3

w(t)

w(t+1)

update: 4

x9

w(t) w(t+1)

update: 5

x14

w(t) w(t+1)

update: 6

x9

w(t) w(t+1)

update: 7

x14

w(t) w(t+1)

update: 8

x9

w(t) w(t+1)

update: 9

w_PLA

finally

worked like a charm with < 20 lines!!

:-)

(35)

Seeing is Believing

initially

x₁ w(t+1)

update: 1

x9

w(t)

w(t+1)

update: 2

x14

w(t) w(t+1)

update: 3

x3

w(t)

w(t+1)

update: 4

x9

w(t) w(t+1)

update: 5

x14

w(t) w(t+1)

update: 6

x9

w(t) w(t+1)

update: 7

x14

w(t) w(t+1)

update: 8

x9

w(t) w(t+1)

update: 9

w_PLA

finally

worked like a charm with < 20 lines!!

:-)

(36)

Seeing is Believing

initially

x₁ w(t+1)

update: 1

x9

w(t)

w(t+1)

update: 2

x14

w(t) w(t+1)

update: 3

x3

w(t)

w(t+1)

update: 4

x9

w(t) w(t+1)

update: 5

x14

w(t) w(t+1)

update: 6

x9

w(t) w(t+1)

update: 7

x14

w(t) w(t+1)

update: 8

x9

w(t) w(t+1)

update: 9

w_PLA

finally

worked like a charm with < 20 lines!!

:-)

(37)

Seeing is Believing

initially

x₁ w(t+1)

update: 1

x9

w(t)

w(t+1)

update: 2

x14

w(t) w(t+1)

update: 3

x3

w(t)

w(t+1)

update: 4

x9

w(t) w(t+1)

update: 5

x14

w(t) w(t+1)

update: 6

x9

w(t) w(t+1)

update: 7

x14

w(t) w(t+1)

update: 8

x9

w(t) w(t+1)

update: 9

w_PLA

finally

worked like a charm with < 20 lines!!

:-)

(38)

Seeing is Believing

initially

x₁ w(t+1)

update: 1

x9

w(t)

w(t+1)

update: 2

x14

w(t) w(t+1)

update: 3

x3

w(t)

w(t+1)

update: 4

x9

w(t) w(t+1)

update: 5

x14

w(t) w(t+1)

update: 6

x9

w(t) w(t+1)

update: 7

x14

w(t) w(t+1)

update: 8

x9

w(t) w(t+1)

update: 9

w_PLA

finally

worked like a charm with < 20 lines!!

:-)

(39)

Seeing is Believing

initially

x₁ w(t+1)

update: 1

x9

w(t)

w(t+1)

update: 2

x14

w(t) w(t+1)

update: 3

x3

w(t)

w(t+1)

update: 4

x9

w(t) w(t+1)

update: 5

x14

w(t) w(t+1)

update: 6

x9

w(t) w(t+1)

update: 7

x14

w(t) w(t+1)

update: 8

x9

w(t) w(t+1)

update: 9

w_PLA

finally

worked like a charm with < 20 lines!!

:-)

(40)

Seeing is Believing

initially

x₁ w(t+1)

update: 1

x9

w(t)

w(t+1)

update: 2

x14

w(t) w(t+1)

update: 3

x3

w(t)

w(t+1)

update: 4

x9

w(t) w(t+1)

update: 5

x14

w(t) w(t+1)

update: 6

x9

w(t) w(t+1)

update: 7

x14

w(t) w(t+1)

update: 8

x9

w(t) w(t+1)

update: 9

w_PLA

finally

worked like a charm with < 20 lines!!

:-)

(41)

Seeing is Believing

initially

x₁ w(t+1)

update: 1

x9

w(t)

w(t+1)

update: 2

x14

w(t) w(t+1)

update: 3

x3

w(t)

w(t+1)

update: 4

x9

w(t) w(t+1)

update: 5

x14

w(t) w(t+1)

update: 6

x9

w(t) w(t+1)

update: 7

x14

w(t) w(t+1)

update: 8

x9

w(t) w(t+1)

update: 9

w_PLA

finally

worked like a charm with < 20 lines!!

:-)

(42)

Seeing is Believing

initially

x₁ w(t+1)

update: 1

x9

w(t)

w(t+1)

update: 2

x14

w(t) w(t+1)

update: 3

x3

w(t)

w(t+1)

update: 4

x9

w(t) w(t+1)

update: 5

x14

w(t) w(t+1)

update: 6

x9

w(t) w(t+1)

update: 7

x14

w(t) w(t+1)

update: 8

x9

w(t) w(t+1)

update: 9

w_PLA

finally

worked like a charm with < 20 lines!!

:-)

(43)

Seeing is Believing

initially

x₁ w(t+1)

update: 1

x9

w(t)

w(t+1)

update: 2

x14

w(t) w(t+1)

update: 3

x3

w(t)

w(t+1)

update: 4

x9

w(t) w(t+1)

update: 5

x14

w(t) w(t+1)

update: 6

x9

w(t) w(t+1)

update: 7

x14

w(t) w(t+1)

update: 8

x9

w(t) w(t+1)

update: 9

w_PLA

finally

worked like a charm with < 20 lines!!

:-)

(44)

Pray for Generalization

(pictures from Google Image Search)

Parent

?

(picture, label) pairs

?

Kid’s good

hypothesis brain

'

&

$

% -

6 alternatives

Target f (x) + noise

?

examples (picture x

_n

, label y

_n

)

?

learning good

hypothesis g(x) ≈ f (x) algorithm

'

&

$

% -

6 hypothesis set H

challenge:

see only{(x

ⁿ

, y

n

)} without knowing f nor noise

=

?

⇒

generalize

to unseen (x, y ) w.r.t. f (x)

(45)

Generalization Is Non-trivial

Bob impresses Alice by memorizing every given (movie, rank);

but too nervous about a

new movie

and guesses randomly

(pictures from Google Image Search)

memorize 6=

generalize

perfect from Bob’s view 6=

good for Alice

perfect during training 6=

good when testing

take-home message: ifH is

simple

(like lines), generalization is

usually possible

(46)

Mini-Summary

Learning from Data

What is Machine Learning

use data to approximate target Components of Machine Learning

algorithm A takes data D and hypotheses H to get hypothesis g Types of Machine Learning

variety of problems almost everywhere Step-by-step Machine Learning

error, hypotheses, optimize, generalize

(47)

Fundamental Machine Learning Models

Roadmap

Fundamental Machine Learning Models Linear Regression

Logistic Regression

Nonlinear Transform

Decision Tree

(48)

Fundamental Machine Learning Models Linear Regression

Fundamental Machine Learning Models ::

Linear Regression

(49)

Credit Limit Problem

age 23 years

gender female

annual salary NTD 1,000,000 year in residence 1 year

year in job 0.5 year current debt 200,000

credit limit?

100,000

unknown target function f : X → Y (ideal credit limit formula)

training examples D : (x

1

, y

1

), · · · , (x

N

, y

N

) (historical records in bank)

learning algorithm

A

final hypothesis g ≈ f

(‘learned’ formula to be used)

hypothesis set H

(set of candidate formula)

Y = R:

regression

(50)

Linear Regression Hypothesis

age 23 years

annual salary NTD 1,000,000 year in job 0.5 year current debt 200,000

•

For

x = (x ₀

, x

1

, x

2

,· · · , x

d

)‘features of customer’,

approximate the

desired credit limit

with a

weighted

sum:

y

≈ X

d i=0

w _i

x

_i

•

linear regression hypothesis:

h(x) = w ^T x

h(x): like perceptron, but without the sign

(51)

Illustration of Linear Regression

x = (x ) ∈ R

x

y

x = (x ₁ , x 2 ) ∈ R ²

x

1

x

2

y

x

1

x

2

y

linear regression:

find

lines/hyperplanes

with small

residuals

(52)

The Error Measure

popular/historical error measure:

squared error

err(ˆ y , y ) = (ˆ y − y) ² in-sample

E

_in

(hw) = 1 N

X

N n=1

(h(x _n )

| {z }

w

^T

x

n

− y n ) ²

out-of-sample

E

_out

(w) = E

(x,y)∼P (w ^T x − y ) ²

next: how to minimize E

_in

(w)?

(53)

Minimize E _in

min

w

E

_in

(w) = 1 N

X

N n=1

(w ^T x _n − y n ) ²

w

E

_in

•

E

_in

(w): continuous, differentiable,

convex

•

necessary condition of ‘best’

w

∇E

in

(w)≡







∂E

_in

∂w

₀(w)

∂E

_in

∂w

1(w) . . .

∂E

_in

∂w

_d(w)





=







0 0

. . .

0







—not possibleto ‘roll down’

task: find

w

_LINsuch that∇E

in

(w_LIN) =

0

(54)

Linear Regression Algorithm

1

fromD, construct

input matrix X

and

output vector y

by

X =



 



− − x ^T 1 − −

− − x ^T 2 − −

· · ·

− − x ^T N − −



 



| {z }

N×(d +1)

y =



 

 y ₁ y ₂

· · · y _N



 



| {z }

N×1

2

calculate pseudo-inverse

|{z} X ^†

(d +1)×N 3

return

w

LIN

|{z}

(d +1)×1

=

X ^† y

simple and efficient with

good † routine

(55)

Is Linear Regression a ‘Learning Algorithm’?

w

_LIN=

X ^† y

No!

•

analytic (closed-form) solution, ‘instantaneous’

•

not improving E

_in

nor E

out

iteratively

Yes!

•

good E

_in

?

yes, optimal!

•

good E

_out

?

yes, ‘simple’ like perceptrons

•

improving iteratively?

somewhat, within an iterative pseudo-inverse routine

if E

_out

(w_LIN)is good,

learning ‘happened’!

(56)

Fundamental Machine Learning Models Logistic Regression

Fundamental Machine Learning Models ::

Logistic Regression

(57)

Heart Attack Prediction Problem (1/2)

age 40 years

gender male

blood pressure 130/85 cholesterol level 240

weight 70

heart disease?

yes

unknown target distribution P(y |x) containing f (x) + noise

training examples D : (x

1

, y

1

), · · · , (x

_N

,y

N

)

learning algorithm

A

final hypothesis g ≈ f

hypothesis set H

error measure err c err

binary classification:

ideal f (x) = sign

P(+1 |x)

−

¹ 2

∈ {−1, +1}

because of

classification err

(58)

Heart Attack Prediction Problem (2/2)

age 40 years

gender male

blood pressure 130/85 cholesterol level 240

weight 70

heart

attack? 80% risk

unknown target distribution P(y |x) containing f (x) + noise

training examples D : (x

1

, y

1

), · · · , (x

_N

,y

N

)

learning algorithm

A

final hypothesis g ≈ f

hypothesis set H

error measure err c err

‘soft’

binary classification:

f

(x) =

P(+1 |x)

∈ [0, 1]

(59)

Soft Binary Classification

target function

f

(x) =

P(+1 |x)

∈ [0, 1]

ideal (noiseless) data

x ₁

, y

₁ ⁰

=0.9 =

P(+1 |x 1 )

x ₂

, y

₂ ⁰

=0.2 =

P(+1 |x 2 )

...

x _N

, y

_N ⁰

=0.6 =

P(+1 |x N )

actual (noisy) data

x ₁

, y

₁

=

◦

∼

P(y |x 1 )

x ₂

, y

2

=

×

∼

P(y |x 2 )

...

x _N

, y

N

=

×

∼

P(y |x N )

same data as hard binary classification, different

target function

(60)

Soft Binary Classification

target function

f

(x) =

P(+1 |x)

∈ [0, 1]

ideal (noiseless) data

x ₁

, y

₁ ⁰

=0.9 =

P(+1 |x 1 )

x ₂

, y

₂ ⁰

=0.2 =

P(+1 |x 2 )

...

x _N

, y

_N ⁰

=0.6 =

P(+1 |x N )

actual (noisy) data

x ₁

, y

₁ ⁰

=

1

=r

◦

∼

^? P(y |x 1 )

z

x ₂

, y

₂ ⁰

=

0

=r

◦

∼

^? P(y |x 2 )

z

...

x _N

, y

_N ⁰

=

0

= r

◦

∼

^? P(y |x N )

z

same data as hard binary classification, different

target function

(61)

Logistic Hypothesis

age 40 years

gender male

blood pressure 130/85 cholesterol level 240

•

For

x = (x ₀

, x

1

, x

2

,· · · , x

d

)‘features of patient’, calculate a

weighted

‘risk score’:

s

= X

d

i=0

w _i

x

_i

•

convert the

score

to

estimated probability

by logistic function

θ(s)

θ(s) 1

0 s

logistic hypothesis:

h(x) = θ(w ^T x) = _1+exp(−w ¹

T

x)

(62)

Minimizing E _in (w)

a popular error: E

_in

(w) =

_N ¹

P

N

n=1

ln 1 + exp(−y

n w ^T x _n

)

called

cross- entropy

derived from

maximum likelihood

w

E

_in

•

E

_in

(w): continuous, differentiable, twice-differentiable,

convex

•

how to minimize? locate

valley

want∇E

in

(w) =

0

most basic algorithm:

gradient descent

(roll downhill)

(63)

Gradient Descent

For t = 0, 1, . . .

w _t+1

← w

t

+

ηv

when stop, return

last w as g

•

PLA:

v

comes from mistake correction

•

smooth E

_in

(w) for logistic regression:

choose

v

to get the ball roll ‘downhill’?

• direction v:

(assumed) of unit length

• step size η:

(assumed) positive

^{Weights, w}

In-sampleError,Ein

gradient descent:

v

∝ −∇E

in

(w

_t

)

(64)

Putting Everything Together

Logistic Regression Algorithm

initialize

w ₀

For t = 0, 1,· · ·

1

compute

∇E

in

(w

t

) = 1 N

X

N n=1

θ

−y

ⁿ w ^T _t x n

−y ⁿ x n

2

update by

w _t+1

← w

t

−

η ∇E in (w _t )

...until∇E

in

(w

_t+1

)≈ 0 or enough iterations return

last w _t+1 as g

can use more sophisticated tools to speed up

(65)

Linear Models Summarized

linear scoring function:

s

=

w ^T x linear classification

h(x) = sign(s)

s x

x

x x₀

1 2

d

h x( )

plausible err = 0/1 discrete E

_in

(w):

solvable in special case

linear regression

h(x) = s

s x

x

x x₀

1 2

d

h x( )

friendly err = squared quadratic convex E

_in

(w):

closed-form solution

logistic regression

h(x) = θ(s)

s x

x

x x₀

1 2

d

h x( )

plausible err = cross-entropy smooth convex E

_in

(w):

gradient descent

my ‘secret’:

linear first!!

(66)

Fundamental Machine Learning Models Nonlinear Transform

Fundamental Machine Learning Models ::

Nonlinear Transform

(67)

Linear Hypotheses

up to now: linear hypotheses

•

visually:

‘line’-like

boundary

•

mathematically: linear scores

s

=

w ^T x

but limited . . .

−1 0 1

•

theoretically:

complexity under control :-)

•

practically: on someD,

large E _in

for every line

:-(

how to

break the limit

of linear hypotheses

(68)

Circular Separable

−1 0 1

•

D not linear separable

•

but

circular separable

by a circle of radius√

0.6 centered at origin:

hSEP(x) = sign

−x

1 ²

− x

2 ²

+0.6

re-derive

Circular-PLA, Circular-Regression,

blahblah. . . all over again?

:-)

(69)

Circular Separable and Linear Separable

h(x) = sign



 |{z}

0.6 w ˜

0

·|{z}

1 z

₀

+(

−1

)

| {z }

w ˜

₁

·

x ₁ ²

|{z}

z

1

+(

−1

)

| {z }

w ˜

₂

·

x ₂ ²

|{z}

z

2





= sign

w ˜ ^T z

x

1

x

2

−1 0 1

−1 0

1

•

{(x

ⁿ

, y

n

)} circular separable

=⇒ {(

z _n

, y

_n

)}

linear

separable

• x

∈ X 7−→

^Φ z ∈ Z

:

(nonlinear) feature

transform Φ z

1

z

2

0 0.5 1

circular separable inX =⇒

linear

separable in

Z

(70)

General Quadratic Hypothesis Set

a ‘bigger’

Z

-space with

Φ ₂

(x) = (1,

x ₁

,

x ₂

,

x ₁ ²

,

x ₁ x ₂

,

x ₂ ²

) perceptrons in

Z

-space⇐⇒ quadratic hypotheses in X -space

H

Φ

2 =n

h(x) : h(x) =

h(Φ ˜ ₂

(x)) for some linear

˜ h

on

Z

o

•

can

implement all possible quadratic curve boundaries:

circle, ellipse,

rotated ellipse, hyperbola, parabola,

. . .

⇐=

ellipse 2(x

₁

+x

₂

− 3)

²

+ (x

₁

− x

2

− 4)

²

=1

⇐=

w ˜ ^T

=

[33, −20, −4, 3, 2, 3]

include

lines and constants as degenerate

cases

(71)

Good Quadratic Hypothesis

Z

-space X -space

perceptrons

⇐⇒ quadratic hypotheses

good perceptron

⇐⇒

good quadratic hypothesis separating perceptron

⇐⇒ separating quadratic hypothesis

z1

z2

0 0.5 1

⇐⇒

x1

x2

−1 0 1

•

want: get

good perceptron

in

Z

-space

•

known: get

good perceptron

in

X

-space with data{(

x _n

, y

n

)} solution: get

good perceptron

in

Z

-space with data

{(

z _n

=

Φ ₂

(x

_n

), y

n

)}

(72)

The Nonlinear Transform Steps

−1 0 1

−→

Φ

0 0.5 1

↓ A

−1 0 1

Φ

⁻¹

←−

−→

Φ

0 0.5 1

1

transform original data{(x

ⁿ

, y

n

)} to {(

z n

=

Φ(x n

), y

n

)} by

Φ

2

get a good perceptron

w ˜

using{(

z n

, y

n

)} and your favorite linear algorithmA

3

return g(x) = sign

w ˜ ^T Φ(x)

(73)

Nonlinear Model via Nonlinear Φ + Linear Models

−1 0 1

−→

Φ

0 0.5 1

↓ A

−1 0 1

Φ

⁻¹

←−

−→

Φ

0 0.5 1

two choices:

•

feature transform

Φ

•

linear modelA,

not just binary classification

Pandora’s box :-):

can now freely do

quadratic PLA, quadratic regression,

cubic regression, . . ., polynomial regression

(74)

Feature Transform Φ

−→

Φ

Average Intensity

Symmetry

not 1 1

↓ A

Φ

⁻¹

←−

−→

Φ

Average Intensity

Symmetry

more generally, not just polynomial:

raw (pixels)

domain knowledge

−→

concrete (intensity, symmetry)

the force, too good to be true? :-)

(75)

Computation/Storage Price

Q-th order polynomial transform: Φ

_Q

(x) = 1,

x

1

, x

2

, . . . , x

d

, x

₁²

, x

1

x

₂

, . . . , x

_d²

, . . . ,

x

₁^Q

, x

₁^Q−1

x

2

, . . . , x

_d^Q

=

|{z}1

w ˜

₀

+

|{z} d ˜

others

dimensions

= # ways of≤ Q-combination from d kinds with repetitions

=

^Q+d _Q

=

^Q+d _d

=

O Q ^d

= efforts needed for computing/storing

z

=

Φ _Q

(x) and

w ˜

Q large =⇒

difficult to compute/store

AND curve too complicated

(76)

Generalization Issue

Φ

1

(original

x)

which one do you prefer? :-)

•

Φ

1

‘visually’ preferred

•

Φ

₄

: E

_in

(g) = 0 but overkill

Φ

₄

how to pick Q?

model selection

(to be discussed) important

(77)

Fundamental Machine Learning Models Decision Tree

Fundamental Machine Learning Models ::

Decision Tree

(78)

Decision Tree for Watching MOOC Lectures

G(x) = X

T t=1

q t

(x)·

g t

(x)

• base hypothesis g _t

(x):

leaf at end of path t, a

constant

here

• condition q _t

(x):

Jis x on path t ?K

•

usually with

simple internal nodes

quitting time?

has a date?

N

true

Y

false

<18:30

Y

between

deadline?

N

>2 days

Y

between

N

< −2 days

>21:30

decision tree: arguably one of the most

human-mimicking models

(79)

Recursive View of Decision Tree

Path View: G(x) =P

T

t=1 Jx on path t K

·

leaf _t (x)

quitting time?

has a date?

N

true

Y

false

<18:30

Y

between

deadline?

N

>2 days

Y

between

N

< −2 days

>21:30

Recursive View G(x) =

X

C c=1

Jb(x) = c K

·

G _c

(x)

• G(x): full-tree

hypothesis

• b(x): branching criteria

• G c

(x):

sub-tree

hypothesis at the c-th branch

tree

= (root,

sub-trees), just like what

your data structure instructor would say :-)

(80)

A Basic Decision Tree Algorithm

G(x) =

P

C

c=1

J

b(x)

=cK

G c

(x) function

DecisionTree

dataD = {(x

ⁿ

, y

n

)}

^N n=1

if

termination criteria met

return

base hypothesis g _t

(x) else

1

learn

branching criteria b(x)

2

splitD to

C

parts

D ^c

={(x

ⁿ

, y

n

) :

b(x n )

=c}

3

build sub-tree

G c

←

DecisionTree( D ^c

)

4

return

G(x) =

P

C

c=1

J

b(x)

=cK

G c

(x)

four choices:

number of branches, branching

criteria, termination criteria, & base hypothesis

(81)

Classification and Regression Tree (C&RT)

function

DecisionTree(data

D = {(x

ⁿ

, y

n

)}

^N n=1

) if

termination criteria met

return

base hypothesis g _t

(x) else ...

2

splitD to

C

parts

D c

={(x

n

, y

_n

) :

b(x _n )

=c}

choices

• C

=2 (binary tree)

• g t

(x) = E

_in

-optimal

constant

• binary/multiclass classification (0/1 error): majority of {y

ⁿ

}

• regression (squared error): average of {y

n

}

•

branching:

threshold some selected dimension

•

termination:

fully-grown, or better pruned

disclaimer:

C&RT

here is based on

selected components

of

CART ^TM of California Statistical Software

(82)

A Simple Data Set

C&RT

C&RT: ‘divide-and-conquer’

(83)

A Simple Data Set

C&RT

C&RT: ‘divide-and-conquer’

(84)

A Simple Data Set

C&RT

C&RT: ‘divide-and-conquer’

(85)

A Simple Data Set

C&RT

C&RT: ‘divide-and-conquer’

(86)

A Simple Data Set

C&RT

C&RT: ‘divide-and-conquer’

(87)

A Simple Data Set

C&RT

C&RT: ‘divide-and-conquer’

(88)

A Simple Data Set

C&RT

C&RT: ‘divide-and-conquer’

(89)

A Simple Data Set

C&RT

C&RT: ‘divide-and-conquer’

(90)

A Simple Data Set

C&RT

C&RT: ‘divide-and-conquer’

(91)

A Simple Data Set

C&RT

Quick Tour of Machine Learning ( 機器學習速遊)