### Quick Tour of Machine Learning ( 機器學習速遊)

Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.tw

### Department of Computer Science

### & Information Engineering

### National Taiwan University

### ( 國立台灣大學資訊工程系)

## 資料科 學愛好者年會系列活動, 2015/12/12

Learning from Data

### Disclaimer

### •

just**super-condensed**

and**shuffled**

version of
### • my co-authored textbook “Learning from Data: A Short Course”

### • my two NTU-Coursera Mandarin-teaching ML Massive Open Online Courses

### • “Machine Learning Foundations”:

### www.coursera.org/course/ntumlone

### • “Machine Learning Techniques”:

### www.coursera.org/course/ntumltwo

—impossible to be complete, with most

**math details removed**

### •

live**interaction**

is important
goal: help you

**begin**

your journey with ML
Learning from Data

### Roadmap

### Learning from Data

### What is Machine Learning

### Components of Machine Learning

### Types of Machine Learning

### Step-by-step Machine Learning

Learning from Data What is Machine Learning

### Learning from Data ::

### What is Machine Learning

Learning from Data What is Machine Learning

### From Learning to Machine Learning

### learning: acquiring skill

learning:

with experience accumulated from

### observations observations learning skill

**machine learning: acquiring** skill

**machine learning:**

with experience accumulated/computedfrom

### data

### data ML ^{skill}

What is

### skill?

Learning from Data What is Machine Learning

### A More Concrete Definition

⇔

### skill

⇔ improve some

### performance measure

(e.g. prediction accuracy)**machine learning: improving some** performance measure

**machine learning:**

with experience

**computed**

from### data

### data ML

### improved performance measure

### An Application in Computational Finance

### stock data ML more investment gain

Why use machine learning?

Learning from Data What is Machine Learning

### Yet Another Application: Tree Recognition

### •

‘define’ trees and hand-program:**difficult**

### •

learn from data (observations) and recognize: a**3-year-old can do so**

### •

‘ML-based tree recognition system’ can be**easier to build**

than hand-programmed
system
ML: an

**alternative route**

to
build complicated systems
Learning from Data What is Machine Learning

### The Machine Learning Route

ML: an

**alternative route**

to build complicated systems
### Some Use Scenarios

### •

when human cannot program the system manually—navigating on Mars

### •

when human cannot ‘define the solution’ easily—speech/visual recognition

### •

when needing rapid decisions that humans cannot do—high-frequency trading

### •

when needing to be user-oriented in a massive scale—consumer-targeted marketing

Give a

**computer a fish, you feed it for a day;**

teach it how to fish, you feed it for a lifetime.

**:-)**

Learning from Data What is Machine Learning

### Machine Learning and Artificial Intelligence

### Machine Learning

use data to compute something that improves performance

### Artificial Intelligence

compute**something**

**that shows intelligent behavior**

### • **improving performance**

is something that shows**intelligent** **behavior**

—ML can realize AI, among other routes

### •

e.g. chess playing### • traditional AI: game tree

### • ML for AI: ‘learning from board data’

ML is one possible

**and popular**

route to realize AI
Learning from Data Components of Machine Learning

### Learning from Data ::

### Components of Machine Learning

Learning from Data Components of Machine Learning

### Components of Learning:

### Metaphor Using Credit Approval

### Applicant Information

age 23 years

gender female

annual salary NTD 1,000,000 year in residence 1 year

year in job 0.5 year current debt 200,000

### what to learn? (for improving performance):

‘approve credit card good for bank?’

Learning from Data Components of Machine Learning

### Formalize the Learning Problem

### Basic Notations

### •

input:**x**

∈ X (customer application)
### •

output: y ∈ Y (good/bad after approving credit card)### • **unknown** underlying pattern to be learned ⇔ target function

:
f : X → Y (ideal credit approval formula)
### • data ⇔ training examples

:**D = {(x**

### 1

, y_{1}

), (x_{2}

, y_{2}

),**· · · , (x**

### N

, y_{N}

)}
(historical records in bank)
### • hypothesis ⇔ skill

with hopefully### good performance:

g : X → Y (‘learned’ formula to be used), i.e. approve if

### • h

1### : annual salary > NTD 800,000

### • h

2### : debt > NTD 100,000 (really?)

### • h

3### : year in job ≤ 2 (really?)

—all

### candidate formula

being considered: hypothesis set### H

—procedure to

### learn

best formula: algorithm### A

**{(x** n , y n ) }

from### f ML ( A, H) ^{g}

Learning from Data Components of Machine Learning

### Practical Definition of Machine Learning

### unknown target function f : X → Y

### (ideal credit approval formula)

### training examples **D : (x**

1### , y

_{1}

### ), · · · , (x

_{N}

### ,y

_{N}

### ) (historical records in bank)

### learning algorithm

### A

### final hypothesis g ≈ f

### (‘learned’ formula to be used)

### hypothesis set H

### (set of candidate formula)

### machine learning ( A and H)

: use### data

to compute### hypothesis g

that approximates

### target f

Learning from Data Components of Machine Learning

### Key Essence of Machine Learning

### machine learning:

use

### data

to compute### hypothesis g

that approximates### target f

### data ML

### improved performance measure

### 1

exists### some ‘underlying pattern’ to be learned

—so ‘performance measure’ can be improved

### 2

but### no

programmable (easy)### definition

—so ‘ML’ is needed

### 3

somehow there is### data

about the pattern—so ML has some ‘inputs’ to learn from

key essence: help decide whether to use ML

Learning from Data Types of Machine Learning

### Learning from Data ::

### Types of Machine Learning

Learning from Data Types of Machine Learning

### Visualizing Credit Card Problem

### •

customer features**x:**

points on the plane (or points in R^{d}

)
### •

labels y :### ◦ (+1)

,### × (-1)

called

**binary classification**

### •

hypothesis h:**lines**

here, but possibly other curves
### •

different curve classifies customers differentlybinary classification algorithm:

find

**good decision boundary curve**

g
Learning from Data Types of Machine Learning

### More Binary Classification Problems

### •

credit### approve/disapprove

### •

email### spam/non-spam

### •

patient### sick/not sick

### •

ad### profitable/not profitable

core and important problem with many tools as

**building block of other tools**

Learning from Data Types of Machine Learning

### Binary Classification for Education

### data ML ^{skill}

### • data: students’ records on quizzes on a Math tutoring system

### • skill: predict whether a student can give a correct answer to

another quiz question### A Possible ML Solution

answer correctly≈Jrecent

### strength

of student>### difficulty

of questionK### •

give ML### 9 million records

from### 3000 students

### •

ML determines (reverse-engineers)### strength

and### difficulty

automaticallykey part of the

**world-champion**

system from
National Taiwan Univ. in KDDCup 2010
Learning from Data Types of Machine Learning

### Multiclass Classification: Coin Recognition Problem

**25**

**5**
**1**

**Mass**

**Size**
**10**

### •

classify US coins (1c, 5c, 10c, 25c) by (size, mass)### •

Y = {1c, 5c, 10c, 25c}, or**Y = {1, 2, · · · , K } (abstractly)**

### •

binary classification: special case with K = 2### Other Multiclass Classification Problems

### •

written digits⇒ 0, 1, · · · , 9### •

pictures⇒ apple, orange, strawberry### •

emails⇒ spam, primary, social, promotion, update (Google)**many applications**

in practice,
especially for ‘recognition’
Learning from Data Types of Machine Learning

### Regression: Patient Recovery Prediction Problem

### •

binary classification: patient features⇒ sick or not### •

multiclass classification: patient features⇒ which type of cancer### •

regression: patient features⇒**how many days before recovery**

### • Y = R

orY = [lower, upper] ⊂ R (bounded regression)—deeply studied in statistics

### Other Regression Problems

### •

company data⇒ stock price### •

climate data⇒ temperaturealso core and important with many ‘statistical’

tools as

**building block of other tools**

Learning from Data Types of Machine Learning

### Regression for Recommender System (1/2)

### data ML ^{skill}

### • data: how many users have rated some movies

### • skill: predict how a user would rate an unrated movie A Hot Problem

### •

competition held by Netflix in 2006### • 100,480,507 ratings that 480,189 users gave to 17,770 movies

### • 10% improvement = **1 million dollar prize**

### •

similar competition (movies→ songs) held by Yahoo! in KDDCup 2011### • 252,800,275 ratings that 1,000,990 users gave to 624,961 songs

How can machines**learn our preferences?**

Learning from Data Types of Machine Learning

### Regression for Recommender System (2/2)

Match movie and viewer factors

predicted rating

comedy content action

content blockb uster?

TomCruisein it?

likes TomCruise?

prefers blockbusters? likes action?

likes comedy?

movie viewer

add contributions from each factor

### A Possible ML Solution

### •

pattern:### rating

←### viewer/movie factors

### •

learning:→

### known rating

→ learned

### factors

→ unknown rating prediction

key part of the

**world-champion**

(again!)
system from National Taiwan Univ.
in KDDCup 2011

Learning from Data Types of Machine Learning

### Supervised versus Unsupervised

coin recognition with y

_{n}

**25**

**5**
**1**

**Mass**

**Size**
**10**

### supervised multiclass classification

coin recognition without y

_{n}

**Mass**

**Size**

**unsupervised** multiclass classification

### ⇐⇒

‘clustering’### Other Clustering Problems

### •

articles⇒ topics### •

consumer profiles⇒ consumer groups**clustering: a challenging but useful problem**

Learning from Data Types of Machine Learning

### Supervised versus Unsupervised

coin recognition with y

_{n}

**25**

**5**
**1**

**Mass**

**Size**
**10**

### supervised multiclass classification

coin recognition without y

_{n}

**Mass**

**Size**

**unsupervised** multiclass classification

### ⇐⇒

‘clustering’### Other Clustering Problems

### •

articles⇒ topics### •

consumer profiles⇒ consumer groups**clustering: a challenging but useful problem**

Learning from Data Types of Machine Learning

### Semi-supervised: Coin Recognition with Some y _{n}

**25**

**5**
**1**

**Mass**

**Size**
**10**

### supervised

**25**

**5**
**1**

**Mass**

**Size**
**10**

**semi-supervised**

**Mass**

**Size**

### unsupervised (clustering)

### Other Semi-supervised Learning Problems

### •

face images with a few labeled⇒ face identifier (Facebook)### •

medicine data with a few labeled⇒ medicine effect predictor**semi-supervised learning: leverage**

unlabeled data to avoid ‘expensive’ labeling
Learning from Data Types of Machine Learning

### Reinforcement Learning

a ‘very different’ but natural way of learning

### Teach Your Dog: Say ‘Sit Down’

The dog pees on the ground.

**BAD DOG. THAT’S A VERY WRONG ACTION.**

### •

cannot easily show the dog that y_{n}

= sit
when**x** _{n}

=‘sit down’
### •

but can ‘punish’ to say ˜y_{n}

= pee is wrong
### Other Reinforcement Learning Problems Using (x, ˜ y , goodness)

### •

(customer, ad choice, ad click earning)⇒ ad system### •

(cards, strategy, winning amount)⇒ black jack agent reinforcement: learn with**‘partial/implicit**

**information’**

(often sequentially)
Learning from Data Types of Machine Learning

### Reinforcement Learning

a ‘very different’ but natural way of learning

### Teach Your Dog: Say ‘Sit Down’

The dog sits down.

### Good Dog. Let me give you some cookies.

### •

still cannot show y_{n}

= sit
when**x** _{n}

=‘sit down’
### •

but can ‘reward’ to say ˜y_{n}

= sit is good
### Other Reinforcement Learning Problems Using (x, ˜ y , goodness)

### •

(customer, ad choice, ad click earning)⇒ ad system### •

(cards, strategy, winning amount)⇒ black jack agent reinforcement: learn with**‘partial/implicit**

**information’**

(often sequentially)
Learning from Data Step-by-step Machine Learning

### Learning from Data ::

### Step-by-step Machine Learning

Learning from Data Step-by-step Machine Learning

### Step-by-step Machine Learning

### unknown target function f : X → Y

### (ideal credit approval formula)

### training examples **D : (x**

1### , y

_{1}

### ), · · · , (x

_{N}

### ,y

_{N}

### ) (historical records in bank)

### learning algorithm

### A

### final hypothesis g ≈ f

### (‘learned’ formula to be used)

### hypothesis set H

### (set of candidate formula)

### 1

choose error measure: how### g(x) ≈ f (x)

### 2

decide hypothesis set### H

### 3

optimize error### and more

on### D

as### A

### 4

pray for generalization:whether

### g(x) ≈ f (x)

for**unseen** **x**

Learning from Data Step-by-step Machine Learning

### Choose Error Measure

### g

≈### f

can often evaluate byaveraged err (g(x),

### f(x)), called **pointwise error measure**

### in-sample (within data)

E

_{in}

(g) = 1
N
X

### N n=1

err(g(x

### n

), f (x### n

)| {z }

### y

n)

### out-of-sample (future data)

E

### out

(g) = E### future **x**

err(g(x), f (x))
will start from 0/1 error

### err(˜ y , y ) = J y ˜ 6= y K

for**classification**

Learning from Data Step-by-step Machine Learning

### Choose Hypothesis Set (for Credit Approval)

### age 23 years

### annual salary NTD 1,000,000 year in job 0.5 year current debt 200,000

### •

For**x = (x** _{1}

, x### 2

,· · · , x### d

)‘features of customer’, compute a weighted ‘score’ andapprove credit if X

### d

### i=1

w_{i}

x_{i}

> threshold
deny credit if X### d

### i=1

w_{i}

x_{i}

< threshold
### •

Y:### +1(good), **−1(bad)**

,

### 0 ignored—linear formula h

∈ H are### h(x) = sign

_{d}

X

### i=1

### w _{i}

x_{i}

!

−

### threshold

!

**linear (binary) classifier,**

called ‘perceptron’ historically
Learning from Data Step-by-step Machine Learning

### Optimize Error (and More) on Data

H = all possible perceptrons,

### g =?

### •

want: g≈ f (hard when f unknown)### •

almost necessary: g ≈ f on D, ideally### g(x _{n} ) = f (x _{n} ) = y _{n}

### •

difficult: H is of**infinite**

size
### •

idea: start from some g_{0}

, and### ‘correct’ its mistakes on D

let’s visualize

**without math**

Learning from Data Step-by-step Machine Learning

### Seeing is Believing initially

**x**_{1}**w(t+1)**

### update: 1

**x****9**

**w(t)**

**w(t+1)**

### update: 2

**x****14**

**w(t)**
**w(t+1)**

### update: 3

**x****3**

**w(t)**

**w(t+1)**

### update: 4

**x****9**

**w(t)**
**w(t+1)**

### update: 5

**x****14**

**w(t)**
**w(t+1)**

### update: 6

**x****9**

**w(t)**
**w(t+1)**

### update: 7

**x****14**

**w(t)**
**w(t+1)**

### update: 8

**x****9**

**w(t)**
**w(t+1)**

### update: 9

**w**_{PLA}

### finally

**worked like a charm with** **< 20 lines!!**

—A fault confessed is half redressed.

**:-)**

Learning from Data Step-by-step Machine Learning

### Seeing is Believing

### initially

**x****1**
**w(t+1)**

### update: 1

**x****9**

**w(t)**

**w(t+1)**

### update: 2

**x****14**

**w(t)**
**w(t+1)**

### update: 3

**x****3**

**w(t)**

**w(t+1)**

### update: 4

**x****9**

**w(t)**
**w(t+1)**

### update: 5

**x****14**

**w(t)**
**w(t+1)**

### update: 6

**x****9**

**w(t)**
**w(t+1)**

### update: 7

**x****14**

**w(t)**
**w(t+1)**

### update: 8

**x****9**

**w(t)**
**w(t+1)**

### update: 9

**w**_{PLA}

### finally

**worked like a charm with** **< 20 lines!!**

—A fault confessed is half redressed.

**:-)**

Learning from Data Step-by-step Machine Learning

### Seeing is Believing

### initially

**x**_{1}**w(t+1)**

### update: 1

**x****9**

**w(t)**

**w(t+1)**

### update: 2

**x****14**

**w(t)**
**w(t+1)**

### update: 3

**x****3**

**w(t)**

**w(t+1)**

### update: 4

**x****9**

**w(t)**
**w(t+1)**

### update: 5

**x****14**

**w(t)**
**w(t+1)**

### update: 6

**x****9**

**w(t)**
**w(t+1)**

### update: 7

**x****14**

**w(t)**
**w(t+1)**

### update: 8

**x****9**

**w(t)**
**w(t+1)**

### update: 9

**w**_{PLA}

### finally

**worked like a charm with** **< 20 lines!!**

—A fault confessed is half redressed.

**:-)**

Learning from Data Step-by-step Machine Learning

### Seeing is Believing

### initially

**x**_{1}**w(t+1)**

### update: 1

**x****9**

**w(t)**

**w(t+1)**

### update: 2

**x****14**

**w(t)**
**w(t+1)**

### update: 3

**x****3**

**w(t)**

**w(t+1)**

### update: 4

**x****9**

**w(t)**
**w(t+1)**

### update: 5

**x****14**

**w(t)**
**w(t+1)**

### update: 6

**x****9**

**w(t)**
**w(t+1)**

### update: 7

**x****14**

**w(t)**
**w(t+1)**

### update: 8

**x****9**

**w(t)**
**w(t+1)**

### update: 9

**w**_{PLA}

### finally

**worked like a charm with** **< 20 lines!!**

—A fault confessed is half redressed.

**:-)**

Learning from Data Step-by-step Machine Learning

### Seeing is Believing

### initially

**x**_{1}**w(t+1)**

### update: 1

**x****9**

**w(t)**

**w(t+1)**

### update: 2

**x****14**

**w(t)**
**w(t+1)**

### update: 3

**x****3**

**w(t)**

**w(t+1)**

### update: 4

**x****9**

**w(t)**
**w(t+1)**

### update: 5

**x****14**

**w(t)**
**w(t+1)**

### update: 6

**x****9**

**w(t)**
**w(t+1)**

### update: 7

**x****14**

**w(t)**
**w(t+1)**

### update: 8

**x****9**

**w(t)**
**w(t+1)**

### update: 9

**w**_{PLA}

### finally

**worked like a charm with** **< 20 lines!!**

—A fault confessed is half redressed.

**:-)**

Learning from Data Step-by-step Machine Learning

### Seeing is Believing

### initially

**x**_{1}**w(t+1)**

### update: 1

**x****9**

**w(t)**

**w(t+1)**

### update: 2

**x****14**

**w(t)**
**w(t+1)**

### update: 3

**x****3**

**w(t)**

**w(t+1)**

### update: 4

**x****9**

**w(t)**
**w(t+1)**

### update: 5

**x****14**

**w(t)**
**w(t+1)**

### update: 6

**x****9**

**w(t)**
**w(t+1)**

### update: 7

**x****14**

**w(t)**
**w(t+1)**

### update: 8

**x****9**

**w(t)**
**w(t+1)**

### update: 9

**w**_{PLA}

### finally

**worked like a charm with** **< 20 lines!!**

—A fault confessed is half redressed.

**:-)**

Learning from Data Step-by-step Machine Learning

### Seeing is Believing

### initially

**x**_{1}**w(t+1)**

### update: 1

**x****9**

**w(t)**

**w(t+1)**

### update: 2

**x****14**

**w(t)**
**w(t+1)**

### update: 3

**x****3**

**w(t)**

**w(t+1)**

### update: 4

**x****9**

**w(t)**
**w(t+1)**

### update: 5

**x****14**

**w(t)**
**w(t+1)**

### update: 6

**x****9**

**w(t)**
**w(t+1)**

### update: 7

**x****14**

**w(t)**
**w(t+1)**

### update: 8

**x****9**

**w(t)**
**w(t+1)**

### update: 9

**w**_{PLA}

### finally

**worked like a charm with** **< 20 lines!!**

—A fault confessed is half redressed.

**:-)**

Learning from Data Step-by-step Machine Learning

### Seeing is Believing

### initially

**x**_{1}**w(t+1)**

### update: 1

**x****9**

**w(t)**

**w(t+1)**

### update: 2

**x****14**

**w(t)**
**w(t+1)**

### update: 3

**x****3**

**w(t)**

**w(t+1)**

### update: 4

**x****9**

**w(t)**
**w(t+1)**

### update: 5

**x****14**

**w(t)**
**w(t+1)**

### update: 6

**x****9**

**w(t)**
**w(t+1)**

### update: 7

**x****14**

**w(t)**
**w(t+1)**

### update: 8

**x****9**

**w(t)**
**w(t+1)**

### update: 9

**w**_{PLA}

### finally

**worked like a charm with** **< 20 lines!!**

—A fault confessed is half redressed.

**:-)**

Learning from Data Step-by-step Machine Learning

### Seeing is Believing

### initially

**x**_{1}**w(t+1)**

### update: 1

**x****9**

**w(t)**

**w(t+1)**

### update: 2

**x****14**

**w(t)**
**w(t+1)**

### update: 3

**x****3**

**w(t)**

**w(t+1)**

### update: 4

**x****9**

**w(t)**
**w(t+1)**

### update: 5

**x****14**

**w(t)**
**w(t+1)**

### update: 6

**x****9**

**w(t)**
**w(t+1)**

### update: 7

**x****14**

**w(t)**
**w(t+1)**

### update: 8

**x****9**

**w(t)**
**w(t+1)**

### update: 9

**w**_{PLA}

### finally

**worked like a charm with** **< 20 lines!!**

—A fault confessed is half redressed.

**:-)**

Learning from Data Step-by-step Machine Learning

### Seeing is Believing

### initially

**x**_{1}**w(t+1)**

### update: 1

**x****9**

**w(t)**

**w(t+1)**

### update: 2

**x****14**

**w(t)**
**w(t+1)**

### update: 3

**x****3**

**w(t)**

**w(t+1)**

### update: 4

**x****9**

**w(t)**
**w(t+1)**

### update: 5

**x****14**

**w(t)**
**w(t+1)**

### update: 6

**x****9**

**w(t)**
**w(t+1)**

### update: 7

**x****14**

**w(t)**
**w(t+1)**

### update: 8

**x****9**

**w(t)**
**w(t+1)**

### update: 9

**w**_{PLA}

### finally

**worked like a charm with** **< 20 lines!!**

—A fault confessed is half redressed.

**:-)**

Learning from Data Step-by-step Machine Learning

### Seeing is Believing

### initially

**x**_{1}**w(t+1)**

### update: 1

**x****9**

**w(t)**

**w(t+1)**

### update: 2

**x****14**

**w(t)**
**w(t+1)**

### update: 3

**x****3**

**w(t)**

**w(t+1)**

### update: 4

**x****9**

**w(t)**
**w(t+1)**

### update: 5

**x****14**

**w(t)**
**w(t+1)**

### update: 6

**x****9**

**w(t)**
**w(t+1)**

### update: 7

**x****14**

**w(t)**
**w(t+1)**

### update: 8

**x****9**

**w(t)**
**w(t+1)**

### update: 9

**w**_{PLA}

### finally

**worked like a charm with** **< 20 lines!!**

—A fault confessed is half redressed.

**:-)**

Learning from Data Step-by-step Machine Learning

### Pray for Generalization

### (pictures from Google Image Search)

### Parent

### ?

### (picture, label) pairs

### ?

### Kid’s good

### hypothesis brain

### '

### &

### $

### % -

### 6

### alternatives

### Target f (x) + noise

### ?

### examples (picture **x**

_{n}

### , label y

_{n}

### )

### ?

### learning good

### hypothesis g(x) **≈ f (x)** algorithm

### '

### &

### $

### % -

### 6

### hypothesis set H

challenge:

see only**{(x**

^{n}

, y### n

)} without knowing f nor noise=

### ?

⇒**generalize**

to unseen (x, y ) w.r.t. f (x)
Learning from Data Step-by-step Machine Learning

### Generalization Is Non-trivial

Bob impresses Alice by memorizing every given (movie, rank);

but too nervous about a

**new movie**

and guesses randomly
### (pictures from Google Image Search)

memorize 6=

**generalize**

perfect from Bob’s view 6=

**good for Alice**

perfect during training 6= **good when testing**

take-home message: ifH is

**simple**

(like lines),
generalization is**usually possible**

Learning from Data Step-by-step Machine Learning

### Mini-Summary

### Learning from Data

### What is Machine Learning

**use data to approximate target** Components of Machine Learning

**algorithm** **A takes data D and hypotheses H to get hypothesis g** Types of Machine Learning

**variety of problems almost everywhere** Step-by-step Machine Learning

**error, hypotheses, optimize, generalize**

Fundamental Machine Learning Models

### Roadmap

### Fundamental Machine Learning Models Linear Regression

### Logistic Regression

### Nonlinear Transform

### Decision Tree

Fundamental Machine Learning Models Linear Regression

### Fundamental Machine Learning Models ::

### Linear Regression

Fundamental Machine Learning Models Linear Regression

### Credit **Limit** Problem

### age 23 years

### gender female

### annual salary NTD 1,000,000 year in residence 1 year

### year in job 0.5 year current debt 200,000

credit limit?

**100,000**

### unknown target function f : X → Y (ideal credit **limit** formula)

### training examples **D : (x**

1### , y

1### ), · · · , (x

N### , y

N### ) (historical records in bank)

### learning algorithm

### A

### final hypothesis g ≈ f

### (‘learned’ formula to be used)

### hypothesis set H

### (set of candidate formula)

Y = R:

**regression**

Fundamental Machine Learning Models Linear Regression

### Linear Regression Hypothesis

### age 23 years

### annual salary NTD 1,000,000 year in job 0.5 year current debt 200,000

### •

For**x = (x** _{0}

, x### 1

, x### 2

,· · · , x### d

)‘features of customer’,approximate the

### desired credit limit

with a### weighted

sum:### y

≈ X### d i=0

### w _{i}

x_{i}

### •

linear regression hypothesis:### h(x) = **w** ^{T} **x**

### h(x): like **perceptron, but without the** sign

Fundamental Machine Learning Models Linear Regression

### Illustration of Linear Regression

**x = (x )** ∈ R

### x

### y

**x = (x** _{1} , x 2 ) ∈ R ^{2}

### x

1### x

2### y

### x

1### x

2### y

linear regression:

find

### lines/hyperplanes

with small### residuals

Fundamental Machine Learning Models Linear Regression

### The Error Measure

popular/historical error measure:

squared error

### err(ˆ y , y ) = (ˆ y − y) ^{2} in-sample

E

_{in}

(hw) = 1
N
X

### N n=1

### (h(x _{n} )

### | {z }

**w**

^{T}

**x**

n
### − y n ) ^{2}

### out-of-sample

E

_{out}

(w) = E
### (x,y)∼P (w ^{T} **x** − y ) ^{2}

next: how to minimize E

_{in}

(w)?
Fundamental Machine Learning Models Linear Regression

### Minimize E _{in}

min

**w**

E_{in}

(w) = 1
N
X

### N n=1

### (w ^{T} **x** _{n} − y n ) ^{2}

**w**

E_{in}

### •

E_{in}

(w): continuous, differentiable,**convex**

### •

necessary condition of ‘best’**w**

∇E

### in

(w)≡

### ∂E

_{in}

### ∂w

_{0}(w)

### ∂E

_{in}

### ∂w

1(w) . . .### ∂E

_{in}

### ∂w

_{d}(w)

=

### 0 0

. . .### 0

—not possibleto ‘roll down’

task: find

**w**

_{LIN}such that∇E

### in

(w_{LIN}) =

**0**

Fundamental Machine Learning Models Linear Regression

### Linear Regression Algorithm

### 1

fromD, construct### input matrix X

and### output vector **y**

by
### X =

###

###

###

**− − x** ^{T} 1 − −

**− − x** ^{T} 2 − −

### · · ·

**− − x** ^{T} N − −

###

###

###

### | {z }

### N×(d +1)

**y =**

###

###

### y _{1} y _{2}

### · · · y _{N}

###

###

###

### | {z }

### N×1

### 2

calculate pseudo-inverse### |{z} X ^{†}

### (d +1)×N 3

return**w**

LIN
### |{z}

### (d +1)×1

=

### X ^{†} **y**

simple and efficient with

**good** **† routine**

Fundamental Machine Learning Models Linear Regression

### Is Linear Regression a ‘Learning Algorithm’?

**w**

_{LIN}=

### X ^{†} **y**

### No!

### •

analytic (closed-form) solution, ‘instantaneous’### •

not improving E_{in}

nor
E### out

iteratively### Yes!

### •

good E_{in}

?
**yes, optimal!**

### •

good E_{out}

?
**yes, ‘simple’ like perceptrons**

### •

improving iteratively?**somewhat, within an iterative** **pseudo-inverse routine**

if E

_{out}

(w_{LIN})is good,

**learning ‘happened’!**

Fundamental Machine Learning Models Logistic Regression

### Fundamental Machine Learning Models ::

### Logistic Regression

Fundamental Machine Learning Models Logistic Regression

### Heart Attack Prediction Problem (1/2)

### age 40 years

### gender male

### blood pressure 130/85 cholesterol level 240

### weight 70

heart disease?

**yes**

### unknown target distribution P(y |x) containing f (x) + noise

### training examples **D : (x**

1### , y

1### ), · · · , (x

_{N}

### ,y

N### )

### learning algorithm

### A

### final hypothesis g ≈ f

### hypothesis set H

### error measure err c err

binary classification:

ideal f (x) = sign

### P(+1 **|x)**

−^{1} 2

∈ {−1, +1}

because of

### classification err

Fundamental Machine Learning Models Logistic Regression

### Heart Attack Prediction Problem (2/2)

### age 40 years

### gender male

### blood pressure 130/85 cholesterol level 240

### weight 70

heart

### attack? **80% risk**

### unknown target distribution P(y |x) containing f (x) + noise

### training examples **D : (x**

1### , y

1### ), · · · , (x

_{N}

### ,y

N### )

### learning algorithm

### A

### final hypothesis g ≈ f

### hypothesis set H

### error measure err c err

### ‘soft’

binary classification:### f

(x) =### P(+1 **|x)**

∈ [0, 1]
Fundamental Machine Learning Models Logistic Regression

### Soft Binary Classification

target function

### f

(x) =### P(+1 **|x)**

∈ [0, 1]
### ideal (noiseless) data

**x** _{1}

, y_{1} ^{0}

=0.9 =### P(+1 **|x** 1 )

**x** _{2}

, y_{2} ^{0}

=0.2 =### P(+1 **|x** 2 )

...

**x** _{N}

, y_{N} ^{0}

=0.6 =### P(+1 **|x** N )

### actual (noisy) data

**x** _{1}

, y_{1}

=### ◦

∼### P(y **|x** 1 )

**x** _{2}

, y### 2

=### ×

∼### P(y **|x** 2 )

...

**x** _{N}

, y### N

=### ×

∼### P(y **|x** N )

same data as hard binary classification, different

**target function**

Fundamental Machine Learning Models Logistic Regression

### Soft Binary Classification

target function

### f

(x) =### P(+1 **|x)**

∈ [0, 1]
### ideal (noiseless) data

**x** _{1}

, y_{1} ^{0}

=0.9 =### P(+1 **|x** 1 )

**x** _{2}

, y_{2} ^{0}

=0.2 =### P(+1 **|x** 2 )

...

**x** _{N}

, y_{N} ^{0}

=0.6 =### P(+1 **|x** N )

### actual (noisy) data

**x** _{1}

, y_{1} ^{0}

=### 1

=r### ◦

∼^{?} P(y **|x** 1 )

z

**x** _{2}

, y_{2} ^{0}

=### 0

=r### ◦

∼^{?} P(y **|x** 2 )

z
...

**x** _{N}

, y_{N} ^{0}

=### 0

= r### ◦

∼^{?} P(y **|x** N )

z
same data as hard binary classification, different

**target function**

Fundamental Machine Learning Models Logistic Regression

### Logistic Hypothesis

### age 40 years

### gender male

### blood pressure 130/85 cholesterol level 240

### •

For**x = (x** _{0}

, x### 1

, x### 2

,· · · , x### d

)‘features of patient’, calculate a### weighted

‘risk score’:### s

= X### d

### i=0

### w _{i}

x_{i}

### •

convert the### score

to### estimated probability

by logistic function### θ(s)

### θ(s) 1

### 0 s

logistic hypothesis:

### h(x) = θ(w ^{T} **x) =** _{1+exp(−w} ^{1}

T**x)**

Fundamental Machine Learning Models Logistic Regression

### Minimizing E _{in} (w)

a popular error: E

_{in}

(w) = _{N} ^{1}

P### N

### n=1

ln 1 + exp(−y### n **w** ^{T} **x** _{n}

)
called

**cross-** **entropy**

derived from**maximum likelihood**

**w**

E_{in}

### •

E_{in}

(w): continuous, differentiable,
twice-differentiable,**convex**

### •

how to minimize? locate**valley**

want∇E### in

(w) =**0**

most basic algorithm:

**gradient descent**

(roll downhill)
Fundamental Machine Learning Models Logistic Regression

### Gradient Descent

For t = 0, 1, . . .

**w** _{t+1}

**← w**

### t

+### ηv

when stop, return### last **w as g**

### •

PLA:**v**

comes from mistake correction
### •

smooth E_{in}

(w) for logistic regression:
choose

**v**

to get the ball roll ‘downhill’?
### • direction **v:**

### (assumed) of unit length

### • step size η:

### (assumed) positive

^{Weights, w}

In-sampleError,Ein

gradient descent:

**v**

∝ −∇E### in

(w_{t}

)
Fundamental Machine Learning Models Logistic Regression

### Putting Everything Together

### Logistic Regression Algorithm

initialize**w** _{0}

For t = 0, 1,· · ·

### 1

compute∇E

### in

(w### t

) = 1 NX

### N n=1

### θ

−y

^{n} **w** ^{T} _{t} **x** n

### −y ^{n} **x** n

### 2

update by**w** _{t+1}

**← w**

### t

−### η ∇E in (w _{t} )

...until∇E

### in

(w_{t+1}

)≈ 0 or enough iterations
return### last **w** _{t+1} as g

can use more sophisticated tools to speed up

Fundamental Machine Learning Models Logistic Regression

### Linear Models Summarized

linear scoring function:

### s

=**w** ^{T} **x** linear classification

### h(x) = sign(s)

*s*
*x*

*x*

*x*
*x*_{0}

1 2

*d*

*h x*( )

### plausible err = 0/1 discrete E

_{in}

### (w):

### solvable in special case

### linear regression

### h(x) = s

*s*
*x*

*x*

*x*
*x*_{0}

1 2

*d*

*h x*( )

### friendly err = squared quadratic convex E

_{in}

### (w):

### closed-form solution

### logistic regression

### h(x) = θ(s)

*s*
*x*

*x*

*x*
*x*_{0}

1 2

*d*

*h x*( )

### plausible err = cross-entropy smooth convex E

_{in}

### (w):

### gradient descent

my ‘secret’:

**linear first!!**

Fundamental Machine Learning Models Nonlinear Transform

### Fundamental Machine Learning Models ::

### Nonlinear Transform

Fundamental Machine Learning Models Nonlinear Transform

### Linear Hypotheses

### up to now: linear hypotheses

### •

visually:**‘line’-like**

boundary
### •

mathematically: linear scores### s

=**w** ^{T} **x**

### but limited . . .

−1 0 1

−1 0 1

### •

theoretically:**complexity** **under control :-)**

### •

practically: on someD,**large E** _{in}

for every line**:-(**

how to

**break the limit**

of linear hypotheses
Fundamental Machine Learning Models Nonlinear Transform

### Circular Separable

−1 0 1

−1 0 1

−1 0 1

−1 0 1

### •

D not linear separable### •

but**circular separable**

by a circle of
radius√
0.6 centered at origin:

hSEP(x) = sign

−x

### 1 ^{2}

− x### 2 ^{2}

+0.6
re-derive

**Circular-PLA,** **Circular-Regression,**

blahblah. . . all over again? **:-)**

Fundamental Machine Learning Models Nonlinear Transform

### Circular Separable and Linear Separable

### h(x) = sign

|{z}

### 0.6

### w ˜

0·|{z}

### 1

### z

_{0}

+(

### −1

)| {z }

### w ˜

_{1}

·

### x _{1} ^{2}

|{z}

### z

1+(

### −1

)| {z }

### w ˜

_{2}

·

### x _{2} ^{2}

|{z}

### z

2

= sign

**w** ˜ ^{T} **z**

### x

1### x

2−1 0 1

−1 0

1

### •

**{(x**

^{n}

, y### n

)} circular separable=⇒ {(

**z** _{n}

, y_{n}

)}### linear

separable### • **x**

∈ X 7−→^{Φ} **z** ∈ Z

:
**(nonlinear) feature**

**transform Φ** z

1
### z

20 0.5 1

0 0.5 1

circular separable inX =⇒

### linear

separable in### Z

Fundamental Machine Learning Models Nonlinear Transform

### General Quadratic Hypothesis Set

a ‘bigger’

### Z

-space with### Φ _{2}

(x) = (1,### x _{1}

,### x _{2}

,### x _{1} ^{2}

,### x _{1} x _{2}

,### x _{2} ^{2}

)
perceptrons in### Z

-space⇐⇒ quadratic hypotheses in X -spaceH

### Φ

2 =nh(x) : h(x) =

### h(Φ ˜ _{2}

(x)) for some linear### ˜ h

on### Z

o### •

can### implement **all possible quadratic curve boundaries:**

circle, ellipse,

**rotated ellipse, hyperbola, parabola,**

. . .
⇐=

ellipse 2(x

_{1}

+x_{2}

− 3)^{2}

+ (x_{1}

− x### 2

− 4)^{2}

=1
⇐=

**w** ˜ ^{T}

=### [33, −20, −4, 3, 2, 3]

include

**lines and constants as degenerate**

**cases**

Fundamental Machine Learning Models Nonlinear Transform

### Good Quadratic Hypothesis

### Z

-space X -space### perceptrons

⇐⇒ quadratic hypotheses**good perceptron**

⇐⇒ **good quadratic hypothesis** **separating perceptron**

**⇐⇒ separating quadratic hypothesis**

z1

z2

0 0.5 1

0 0.5 1

⇐⇒

x1

x2

−1 0 1

−1 0 1

### •

want: get**good perceptron**

in### Z

-space### •

known: get**good perceptron**

in### X

-space with data{(**x** _{n}

, y### n

)} solution: get**good perceptron**

in### Z

-space with data{(

**z** _{n}

=### Φ _{2}

(x_{n}

), y### n

)}Fundamental Machine Learning Models Nonlinear Transform

### The Nonlinear Transform Steps

−1 0 1

−1 0 1

−→

### Φ

0 0.5 1

0 0.5 1

↓ A

−1 0 1

−1 0 1

### Φ

^{−1}

### ←−

−→

### Φ

0 0.5 1

0 0.5 1

### 1

transform original data**{(x**

^{n}

, y### n

)} to {(**z** n

=### Φ(x n

), y### n

)} by### Φ

### 2

get a good perceptron**w** ˜

using{(**z** n

, y### n

)} and your favorite linear algorithmA### 3

return g(x) = sign**w** ˜ ^{T} Φ(x)

Fundamental Machine Learning Models Nonlinear Transform

### Nonlinear Model via Nonlinear Φ + Linear Models

−1 0 1

−1 0 1

−→

### Φ

0 0.5 1

0 0.5 1

↓ A

−1 0 1

−1 0 1

### Φ

^{−1}

### ←−

−→

### Φ

0 0.5 1

0 0.5 1

two choices:

### •

feature transform### Φ

### •

linear modelA,**not just binary** **classification**

**Pandora’s box :-):**

can now freely do

**quadratic PLA, quadratic regression,**

**cubic regression,** **. . ., polynomial regression**

Fundamental Machine Learning Models Nonlinear Transform

### Feature Transform Φ

−→

### Φ

Average Intensity

Symmetry

not 1 1

↓ A

### Φ

^{−1}

### ←−

−→

### Φ

Average Intensity

Symmetry

more generally, not just polynomial:

raw (pixels)

**domain knowledge**

−→

**concrete (intensity, symmetry)**

### the force, too good to be true? **:-)**

Fundamental Machine Learning Models Nonlinear Transform

### Computation/Storage Price

### Q-th order polynomial transform: Φ

_{Q}

### (x) = 1,

### x

1### , x

2### , . . . , x

d### , x

_{1}

^{2}

### , x

1### x

_{2}

### , . . . , x

_{d}

^{2}

### , . . . ,

### x

_{1}

^{Q}

### , x

_{1}

^{Q−1}

### x

2### , . . . , x

_{d}

^{Q}

=

|{z}1

### w ˜

_{0}

+

### |{z} d ˜

### others

dimensions

= # ways of≤ Q-combination from d kinds with repetitions

=

^{Q+d} _{Q}

=

^{Q+d} _{d}

=

### O Q ^{d}

= efforts needed for computing/storing

**z**

=### Φ _{Q}

(x) and**w** ˜

Q large =⇒**difficult to compute/store**

**AND curve too complicated**

Fundamental Machine Learning Models Nonlinear Transform

### Generalization Issue

Φ

### 1

(original**x)**

**which one do you prefer? :-)**

### •

Φ### 1

‘visually’ preferred### •

Φ_{4}

: E_{in}

(g) = 0 but overkill
Φ

_{4}

how to pick Q?
**model selection**

(to be discussed) important
Fundamental Machine Learning Models Decision Tree

### Fundamental Machine Learning Models ::

### Decision Tree

Fundamental Machine Learning Models Decision Tree

### Decision Tree for Watching MOOC Lectures

G(x) = X

### T t=1

### q t

(x)·### g t

(x)### • **base hypothesis g** _{t}

(x):
leaf at end of path t, a

**constant**

here
### • **condition q** _{t}

(x):
**Jis x on path t ?K**

### •

usually with**simple** **internal nodes**

**quitting** **time?**

**has a** **date?**

N

true

Y

false

<18:30

Y

between

**deadline?**

N

>2 days

Y

between

N

< −2 days

>21:30

decision tree: arguably one of the most

**human-mimicking models**

Fundamental Machine Learning Models Decision Tree

### Recursive View of Decision Tree

Path View: G(x) =P

### T

### t=1 **Jx on path t K**

·### leaf _{t} (x)

**quitting**
**time?**

**has a**
**date?**

### N

true

### Y

false

<18:30

### Y

between

**deadline?**

### N

>2 days

### Y

between

### N

< −2 days

>21:30

### Recursive View G(x) =

X

### C c=1

**Jb(x) = c K**

·### G _{c}

(x)
### • G(x): full-tree

hypothesis### • b(x): branching criteria

### • G c

(x):### sub-tree

hypothesis at the c-th branch### tree

= (root,### sub-trees), just like what

**your data structure instructor would say :-)**

Fundamental Machine Learning Models Decision Tree

### A Basic Decision Tree Algorithm

### G(x) =

P### C

### c=1

J### b(x)

=cK### G c

(x) function### DecisionTree

data**D = {(x**

^{n}

, y### n

)}^{N} n=1

if

### termination criteria met

return

### base hypothesis g _{t}

(x)
else
### 1

learn### branching criteria b(x)

### 2

splitD to### C

parts### D ^{c}

=**{(x**

^{n}

, y### n

) :### b(x n )

=c}### 3

build sub-tree### G c

←### DecisionTree( D ^{c}

)
### 4

return### G(x) =

P### C

### c=1

J### b(x)

=cK### G c

(x)four choices:

### number of branches, branching

### criteria, termination criteria, & base hypothesis

Fundamental Machine Learning Models Decision Tree

### Classification and Regression Tree (C&RT)

function

### DecisionTree(data

**D = {(x**

^{n}

, y### n

)}^{N} n=1

)
if### termination criteria met

return

### base hypothesis g _{t}

(x)
else ...
### 2

splitD to### C

parts### D c

=**{(x**

### n

, y_{n}

) :### b(x _{n} )

=c}
### choices

### • C

=2 (binary tree)### • g t

(x) = E_{in}

-optimal### constant

### • binary/multiclass classification (0/1 error): majority of {y

^{n}

### }

### • regression (squared error): average of {y

n### }

### •

branching:### threshold some selected dimension

### •

termination:### fully-grown, or better **pruned**

disclaimer:
**C&RT**

here is based on**selected components**

of**CART** ^{TM} **of California Statistical Software**

Fundamental Machine Learning Models Decision Tree

### A Simple Data Set

C&RT

**C&RT: ‘divide-and-conquer’**

Fundamental Machine Learning Models Decision Tree

### A Simple Data Set

C&RT

**C&RT: ‘divide-and-conquer’**

Fundamental Machine Learning Models Decision Tree

### A Simple Data Set

C&RT

**C&RT: ‘divide-and-conquer’**

Fundamental Machine Learning Models Decision Tree

### A Simple Data Set

C&RT

**C&RT: ‘divide-and-conquer’**

Fundamental Machine Learning Models Decision Tree

### A Simple Data Set

C&RT

**C&RT: ‘divide-and-conquer’**

Fundamental Machine Learning Models Decision Tree

### A Simple Data Set

C&RT

**C&RT: ‘divide-and-conquer’**

Fundamental Machine Learning Models Decision Tree

### A Simple Data Set

C&RT

**C&RT: ‘divide-and-conquer’**

Fundamental Machine Learning Models Decision Tree

### A Simple Data Set

C&RT

**C&RT: ‘divide-and-conquer’**

Fundamental Machine Learning Models Decision Tree

### A Simple Data Set

C&RT

**C&RT: ‘divide-and-conquer’**

Fundamental Machine Learning Models Decision Tree

### A Simple Data Set

C&RT