Quick Tour of Machine Learning ( 機器學習速遊)
Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.tw
Department of Computer Science
& Information Engineering
National Taiwan University
( 國立台灣大學資訊工程系)
資料科 學愛好者年會系列活動, 2015/12/12
Learning from Data
Disclaimer
•
justsuper-condensed
andshuffled
version of• my co-authored textbook “Learning from Data: A Short Course”
• my two NTU-Coursera Mandarin-teaching ML Massive Open Online Courses
• “Machine Learning Foundations”:
www.coursera.org/course/ntumlone
• “Machine Learning Techniques”:
www.coursera.org/course/ntumltwo
—impossible to be complete, with most
math details removed
•
liveinteraction
is importantgoal: help you
begin
your journey with MLLearning from Data
Roadmap
Learning from Data
What is Machine Learning
Components of Machine Learning
Types of Machine Learning
Step-by-step Machine Learning
Learning from Data What is Machine Learning
Learning from Data ::
What is Machine Learning
Learning from Data What is Machine Learning
From Learning to Machine Learning
learning: acquiring skill
learning:
with experience accumulated from
observations observations learning skill
machine learning: acquiring skill
machine learning:
with experience accumulated/computedfrom
data
data ML skill
What is
skill?
Learning from Data What is Machine Learning
A More Concrete Definition
⇔
skill
⇔ improve some
performance measure
(e.g. prediction accuracy)machine learning: improving some performance measure
machine learning:
with experience
computed
fromdata
data ML
improved performance measure
An Application in Computational Finance
stock data ML more investment gain
Why use machine learning?
Learning from Data What is Machine Learning
Yet Another Application: Tree Recognition
•
‘define’ trees and hand-program:difficult
•
learn from data (observations) and recognize: a3-year-old can do so
•
‘ML-based tree recognition system’ can beeasier to build
than hand-programmed systemML: an
alternative route
to build complicated systemsLearning from Data What is Machine Learning
The Machine Learning Route
ML: an
alternative route
to build complicated systemsSome Use Scenarios
•
when human cannot program the system manually—navigating on Mars
•
when human cannot ‘define the solution’ easily—speech/visual recognition
•
when needing rapid decisions that humans cannot do—high-frequency trading
•
when needing to be user-oriented in a massive scale—consumer-targeted marketing
Give a
computer a fish, you feed it for a day;
teach it how to fish, you feed it for a lifetime.
:-)
Learning from Data What is Machine Learning
Machine Learning and Artificial Intelligence
Machine Learning
use data to compute something that improves performance
Artificial Intelligence
computesomething
that shows intelligent behavior
• improving performance
is something that showsintelligent behavior
—ML can realize AI, among other routes
•
e.g. chess playing• traditional AI: game tree
• ML for AI: ‘learning from board data’
ML is one possible
and popular
route to realize AILearning from Data Components of Machine Learning
Learning from Data ::
Components of Machine Learning
Learning from Data Components of Machine Learning
Components of Learning:
Metaphor Using Credit Approval
Applicant Information
age 23 years
gender female
annual salary NTD 1,000,000 year in residence 1 year
year in job 0.5 year current debt 200,000
what to learn? (for improving performance):
‘approve credit card good for bank?’
Learning from Data Components of Machine Learning
Formalize the Learning Problem
Basic Notations
•
input:x
∈ X (customer application)•
output: y ∈ Y (good/bad after approving credit card)• unknown underlying pattern to be learned ⇔ target function
: f : X → Y (ideal credit approval formula)• data ⇔ training examples
:D = {(x1
, y1
), (x2
, y2
),· · · , (xN
, yN
)} (historical records in bank)• hypothesis ⇔ skill
with hopefullygood performance:
g : X → Y (‘learned’ formula to be used), i.e. approve if
• h
1: annual salary > NTD 800,000
• h
2: debt > NTD 100,000 (really?)
• h
3: year in job ≤ 2 (really?)
—all
candidate formula
being considered: hypothesis setH
—procedure to
learn
best formula: algorithmA
{(x n , y n ) }
fromf ML ( A, H) g
Learning from Data Components of Machine Learning
Practical Definition of Machine Learning
unknown target function f : X → Y
(ideal credit approval formula)
training examples D : (x
1, y
1), · · · , (x
N,y
N) (historical records in bank)
learning algorithm
A
final hypothesis g ≈ f
(‘learned’ formula to be used)
hypothesis set H
(set of candidate formula)
machine learning ( A and H)
: usedata
to computehypothesis g
that approximates
target f
Learning from Data Components of Machine Learning
Key Essence of Machine Learning
machine learning:
use
data
to computehypothesis g
that approximatestarget f
data ML
improved performance measure
1
existssome ‘underlying pattern’ to be learned
—so ‘performance measure’ can be improved
2
butno
programmable (easy)definition
—so ‘ML’ is needed
3
somehow there isdata
about the pattern—so ML has some ‘inputs’ to learn from
key essence: help decide whether to use ML
Learning from Data Types of Machine Learning
Learning from Data ::
Types of Machine Learning
Learning from Data Types of Machine Learning
Visualizing Credit Card Problem
•
customer featuresx:
points on the plane (or points in Rd
)•
labels y :◦ (+1)
,× (-1)
called
binary classification
•
hypothesis h:lines
here, but possibly other curves•
different curve classifies customers differentlybinary classification algorithm:
find
good decision boundary curve
gLearning from Data Types of Machine Learning
More Binary Classification Problems
•
creditapprove/disapprove
•
emailspam/non-spam
•
patientsick/not sick
•
adprofitable/not profitable
core and important problem with many tools as
building block of other tools
Learning from Data Types of Machine Learning
Binary Classification for Education
data ML skill
• data: students’ records on quizzes on a Math tutoring system
• skill: predict whether a student can give a correct answer to
another quiz questionA Possible ML Solution
answer correctly≈Jrecent
strength
of student>difficulty
of questionK•
give ML9 million records
from3000 students
•
ML determines (reverse-engineers)strength
anddifficulty
automaticallykey part of the
world-champion
system from National Taiwan Univ. in KDDCup 2010Learning from Data Types of Machine Learning
Multiclass Classification: Coin Recognition Problem
25
5 1
Mass
Size 10
•
classify US coins (1c, 5c, 10c, 25c) by (size, mass)•
Y = {1c, 5c, 10c, 25c}, orY = {1, 2, · · · , K } (abstractly)
•
binary classification: special case with K = 2Other Multiclass Classification Problems
•
written digits⇒ 0, 1, · · · , 9•
pictures⇒ apple, orange, strawberry•
emails⇒ spam, primary, social, promotion, update (Google)many applications
in practice, especially for ‘recognition’Learning from Data Types of Machine Learning
Regression: Patient Recovery Prediction Problem
•
binary classification: patient features⇒ sick or not•
multiclass classification: patient features⇒ which type of cancer•
regression: patient features⇒how many days before recovery
• Y = R
orY = [lower, upper] ⊂ R (bounded regression)—deeply studied in statistics
Other Regression Problems
•
company data⇒ stock price•
climate data⇒ temperaturealso core and important with many ‘statistical’
tools as
building block of other tools
Learning from Data Types of Machine Learning
Regression for Recommender System (1/2)
data ML skill
• data: how many users have rated some movies
• skill: predict how a user would rate an unrated movie A Hot Problem
•
competition held by Netflix in 2006• 100,480,507 ratings that 480,189 users gave to 17,770 movies
• 10% improvement = 1 million dollar prize
•
similar competition (movies→ songs) held by Yahoo! in KDDCup 2011• 252,800,275 ratings that 1,000,990 users gave to 624,961 songs
How can machineslearn our preferences?
Learning from Data Types of Machine Learning
Regression for Recommender System (2/2)
Match movie and viewer factors
predicted rating
comedy content action
content blockb uster?
TomCruisein it?
likes TomCruise?
prefers blockbusters? likes action?
likes comedy?
movie viewer
add contributions from each factor
A Possible ML Solution
•
pattern:rating
←viewer/movie factors
•
learning:→
known rating
→ learned
factors
→ unknown rating prediction
key part of the
world-champion
(again!) system from National Taiwan Univ.in KDDCup 2011
Learning from Data Types of Machine Learning
Supervised versus Unsupervised
coin recognition with y
n
25
5 1
Mass
Size 10
supervised multiclass classification
coin recognition without y
n
Mass
Size
unsupervised multiclass classification
⇐⇒
‘clustering’Other Clustering Problems
•
articles⇒ topics•
consumer profiles⇒ consumer groupsclustering: a challenging but useful problem
Learning from Data Types of Machine Learning
Supervised versus Unsupervised
coin recognition with y
n
25
5 1
Mass
Size 10
supervised multiclass classification
coin recognition without y
n
Mass
Size
unsupervised multiclass classification
⇐⇒
‘clustering’Other Clustering Problems
•
articles⇒ topics•
consumer profiles⇒ consumer groupsclustering: a challenging but useful problem
Learning from Data Types of Machine Learning
Semi-supervised: Coin Recognition with Some y n
25
5 1
Mass
Size 10
supervised
25
5 1
Mass
Size 10
semi-supervised
Mass
Size
unsupervised (clustering)
Other Semi-supervised Learning Problems
•
face images with a few labeled⇒ face identifier (Facebook)•
medicine data with a few labeled⇒ medicine effect predictorsemi-supervised learning: leverage
unlabeled data to avoid ‘expensive’ labelingLearning from Data Types of Machine Learning
Reinforcement Learning
a ‘very different’ but natural way of learning
Teach Your Dog: Say ‘Sit Down’
The dog pees on the ground.
BAD DOG. THAT’S A VERY WRONG ACTION.
•
cannot easily show the dog that yn
= sit whenx n
=‘sit down’•
but can ‘punish’ to say ˜yn
= pee is wrongOther Reinforcement Learning Problems Using (x, ˜ y , goodness)
•
(customer, ad choice, ad click earning)⇒ ad system•
(cards, strategy, winning amount)⇒ black jack agent reinforcement: learn with‘partial/implicit
information’
(often sequentially)Learning from Data Types of Machine Learning
Reinforcement Learning
a ‘very different’ but natural way of learning
Teach Your Dog: Say ‘Sit Down’
The dog sits down.
Good Dog. Let me give you some cookies.
•
still cannot show yn
= sit whenx n
=‘sit down’•
but can ‘reward’ to say ˜yn
= sit is goodOther Reinforcement Learning Problems Using (x, ˜ y , goodness)
•
(customer, ad choice, ad click earning)⇒ ad system•
(cards, strategy, winning amount)⇒ black jack agent reinforcement: learn with‘partial/implicit
information’
(often sequentially)Learning from Data Step-by-step Machine Learning
Learning from Data ::
Step-by-step Machine Learning
Learning from Data Step-by-step Machine Learning
Step-by-step Machine Learning
unknown target function f : X → Y
(ideal credit approval formula)
training examples D : (x
1, y
1), · · · , (x
N,y
N) (historical records in bank)
learning algorithm
A
final hypothesis g ≈ f
(‘learned’ formula to be used)
hypothesis set H
(set of candidate formula)
1
choose error measure: howg(x) ≈ f (x)
2
decide hypothesis setH
3
optimize errorand more
onD
asA
4
pray for generalization:whether
g(x) ≈ f (x)
forunseen x
Learning from Data Step-by-step Machine Learning
Choose Error Measure
g
≈f
can often evaluate byaveraged err (g(x),
f(x)), called pointwise error measure
in-sample (within data)
E
in
(g) = 1 NX
N n=1
err(g(x
n
), f (xn
)| {z }
y
n)
out-of-sample (future data)
E
out
(g) = Efuture x
err(g(x), f (x))will start from 0/1 error
err(˜ y , y ) = J y ˜ 6= y K
forclassification
Learning from Data Step-by-step Machine Learning
Choose Hypothesis Set (for Credit Approval)
age 23 years
annual salary NTD 1,000,000 year in job 0.5 year current debt 200,000
•
Forx = (x 1
, x2
,· · · , xd
)‘features of customer’, compute a weighted ‘score’ andapprove credit if X
d
i=1
wi
xi
> threshold deny credit if Xd
i=1
wi
xi
< threshold•
Y:+1(good), −1(bad)
,
0 ignored—linear formula h
∈ H areh(x) = sign
d
X
i=1
w i
xi
!
−
threshold
!
linear (binary) classifier,
called ‘perceptron’ historicallyLearning from Data Step-by-step Machine Learning
Optimize Error (and More) on Data
H = all possible perceptrons,
g =?
•
want: g≈ f (hard when f unknown)•
almost necessary: g ≈ f on D, ideallyg(x n ) = f (x n ) = y n
•
difficult: H is ofinfinite
size•
idea: start from some g0
, and‘correct’ its mistakes on D
let’s visualize
without math
Learning from Data Step-by-step Machine Learning
Seeing is Believing initially
x1 w(t+1)
update: 1
x9
w(t)
w(t+1)
update: 2
x14
w(t) w(t+1)
update: 3
x3
w(t)
w(t+1)
update: 4
x9
w(t) w(t+1)
update: 5
x14
w(t) w(t+1)
update: 6
x9
w(t) w(t+1)
update: 7
x14
w(t) w(t+1)
update: 8
x9
w(t) w(t+1)
update: 9
wPLA
finally
worked like a charm with < 20 lines!!
—A fault confessed is half redressed.
:-)
Learning from Data Step-by-step Machine Learning
Seeing is Believing
initially
x1 w(t+1)
update: 1
x9
w(t)
w(t+1)
update: 2
x14
w(t) w(t+1)
update: 3
x3
w(t)
w(t+1)
update: 4
x9
w(t) w(t+1)
update: 5
x14
w(t) w(t+1)
update: 6
x9
w(t) w(t+1)
update: 7
x14
w(t) w(t+1)
update: 8
x9
w(t) w(t+1)
update: 9
wPLA
finally
worked like a charm with < 20 lines!!
—A fault confessed is half redressed.
:-)
Learning from Data Step-by-step Machine Learning
Seeing is Believing
initially
x1 w(t+1)
update: 1
x9
w(t)
w(t+1)
update: 2
x14
w(t) w(t+1)
update: 3
x3
w(t)
w(t+1)
update: 4
x9
w(t) w(t+1)
update: 5
x14
w(t) w(t+1)
update: 6
x9
w(t) w(t+1)
update: 7
x14
w(t) w(t+1)
update: 8
x9
w(t) w(t+1)
update: 9
wPLA
finally
worked like a charm with < 20 lines!!
—A fault confessed is half redressed.
:-)
Learning from Data Step-by-step Machine Learning
Seeing is Believing
initially
x1 w(t+1)
update: 1
x9
w(t)
w(t+1)
update: 2
x14
w(t) w(t+1)
update: 3
x3
w(t)
w(t+1)
update: 4
x9
w(t) w(t+1)
update: 5
x14
w(t) w(t+1)
update: 6
x9
w(t) w(t+1)
update: 7
x14
w(t) w(t+1)
update: 8
x9
w(t) w(t+1)
update: 9
wPLA
finally
worked like a charm with < 20 lines!!
—A fault confessed is half redressed.
:-)
Learning from Data Step-by-step Machine Learning
Seeing is Believing
initially
x1 w(t+1)
update: 1
x9
w(t)
w(t+1)
update: 2
x14
w(t) w(t+1)
update: 3
x3
w(t)
w(t+1)
update: 4
x9
w(t) w(t+1)
update: 5
x14
w(t) w(t+1)
update: 6
x9
w(t) w(t+1)
update: 7
x14
w(t) w(t+1)
update: 8
x9
w(t) w(t+1)
update: 9
wPLA
finally
worked like a charm with < 20 lines!!
—A fault confessed is half redressed.
:-)
Learning from Data Step-by-step Machine Learning
Seeing is Believing
initially
x1 w(t+1)
update: 1
x9
w(t)
w(t+1)
update: 2
x14
w(t) w(t+1)
update: 3
x3
w(t)
w(t+1)
update: 4
x9
w(t) w(t+1)
update: 5
x14
w(t) w(t+1)
update: 6
x9
w(t) w(t+1)
update: 7
x14
w(t) w(t+1)
update: 8
x9
w(t) w(t+1)
update: 9
wPLA
finally
worked like a charm with < 20 lines!!
—A fault confessed is half redressed.
:-)
Learning from Data Step-by-step Machine Learning
Seeing is Believing
initially
x1 w(t+1)
update: 1
x9
w(t)
w(t+1)
update: 2
x14
w(t) w(t+1)
update: 3
x3
w(t)
w(t+1)
update: 4
x9
w(t) w(t+1)
update: 5
x14
w(t) w(t+1)
update: 6
x9
w(t) w(t+1)
update: 7
x14
w(t) w(t+1)
update: 8
x9
w(t) w(t+1)
update: 9
wPLA
finally
worked like a charm with < 20 lines!!
—A fault confessed is half redressed.
:-)
Learning from Data Step-by-step Machine Learning
Seeing is Believing
initially
x1 w(t+1)
update: 1
x9
w(t)
w(t+1)
update: 2
x14
w(t) w(t+1)
update: 3
x3
w(t)
w(t+1)
update: 4
x9
w(t) w(t+1)
update: 5
x14
w(t) w(t+1)
update: 6
x9
w(t) w(t+1)
update: 7
x14
w(t) w(t+1)
update: 8
x9
w(t) w(t+1)
update: 9
wPLA
finally
worked like a charm with < 20 lines!!
—A fault confessed is half redressed.
:-)
Learning from Data Step-by-step Machine Learning
Seeing is Believing
initially
x1 w(t+1)
update: 1
x9
w(t)
w(t+1)
update: 2
x14
w(t) w(t+1)
update: 3
x3
w(t)
w(t+1)
update: 4
x9
w(t) w(t+1)
update: 5
x14
w(t) w(t+1)
update: 6
x9
w(t) w(t+1)
update: 7
x14
w(t) w(t+1)
update: 8
x9
w(t) w(t+1)
update: 9
wPLA
finally
worked like a charm with < 20 lines!!
—A fault confessed is half redressed.
:-)
Learning from Data Step-by-step Machine Learning
Seeing is Believing
initially
x1 w(t+1)
update: 1
x9
w(t)
w(t+1)
update: 2
x14
w(t) w(t+1)
update: 3
x3
w(t)
w(t+1)
update: 4
x9
w(t) w(t+1)
update: 5
x14
w(t) w(t+1)
update: 6
x9
w(t) w(t+1)
update: 7
x14
w(t) w(t+1)
update: 8
x9
w(t) w(t+1)
update: 9
wPLA
finally
worked like a charm with < 20 lines!!
—A fault confessed is half redressed.
:-)
Learning from Data Step-by-step Machine Learning
Seeing is Believing
initially
x1 w(t+1)
update: 1
x9
w(t)
w(t+1)
update: 2
x14
w(t) w(t+1)
update: 3
x3
w(t)
w(t+1)
update: 4
x9
w(t) w(t+1)
update: 5
x14
w(t) w(t+1)
update: 6
x9
w(t) w(t+1)
update: 7
x14
w(t) w(t+1)
update: 8
x9
w(t) w(t+1)
update: 9
wPLA
finally
worked like a charm with < 20 lines!!
—A fault confessed is half redressed.
:-)
Learning from Data Step-by-step Machine Learning
Pray for Generalization
(pictures from Google Image Search)
Parent
?
(picture, label) pairs
?
Kid’s good
hypothesis brain
'
&
$
% -
6
alternatives
Target f (x) + noise
?
examples (picture x
n, label y
n)
?
learning good
hypothesis g(x) ≈ f (x) algorithm
'
&
$
% -
6
hypothesis set H
challenge:
see only{(x
n
, yn
)} without knowing f nor noise=
?
⇒generalize
to unseen (x, y ) w.r.t. f (x)Learning from Data Step-by-step Machine Learning
Generalization Is Non-trivial
Bob impresses Alice by memorizing every given (movie, rank);
but too nervous about a
new movie
and guesses randomly(pictures from Google Image Search)
memorize 6=
generalize
perfect from Bob’s view 6=
good for Alice
perfect during training 6=good when testing
take-home message: ifH is
simple
(like lines), generalization isusually possible
Learning from Data Step-by-step Machine Learning
Mini-Summary
Learning from Data
What is Machine Learning
use data to approximate target Components of Machine Learning
algorithm A takes data D and hypotheses H to get hypothesis g Types of Machine Learning
variety of problems almost everywhere Step-by-step Machine Learning
error, hypotheses, optimize, generalize
Fundamental Machine Learning Models
Roadmap
Fundamental Machine Learning Models Linear Regression
Logistic Regression
Nonlinear Transform
Decision Tree
Fundamental Machine Learning Models Linear Regression
Fundamental Machine Learning Models ::
Linear Regression
Fundamental Machine Learning Models Linear Regression
Credit Limit Problem
age 23 years
gender female
annual salary NTD 1,000,000 year in residence 1 year
year in job 0.5 year current debt 200,000
credit limit?
100,000
unknown target function f : X → Y (ideal credit limit formula)
training examples D : (x
1, y
1), · · · , (x
N, y
N) (historical records in bank)
learning algorithm
A
final hypothesis g ≈ f
(‘learned’ formula to be used)
hypothesis set H
(set of candidate formula)
Y = R:
regression
Fundamental Machine Learning Models Linear Regression
Linear Regression Hypothesis
age 23 years
annual salary NTD 1,000,000 year in job 0.5 year current debt 200,000
•
Forx = (x 0
, x1
, x2
,· · · , xd
)‘features of customer’,approximate the
desired credit limit
with aweighted
sum:y
≈ Xd i=0
w i
xi
•
linear regression hypothesis:h(x) = w T x
h(x): like perceptron, but without the sign
Fundamental Machine Learning Models Linear Regression
Illustration of Linear Regression
x = (x ) ∈ R
x
y
x = (x 1 , x 2 ) ∈ R 2
x
1x
2y
x
1x
2y
linear regression:
find
lines/hyperplanes
with smallresiduals
Fundamental Machine Learning Models Linear Regression
The Error Measure
popular/historical error measure:
squared error
err(ˆ y , y ) = (ˆ y − y) 2 in-sample
E
in
(hw) = 1 NX
N n=1
(h(x n )
| {z }
w
Tx
n− y n ) 2
out-of-sample
E
out
(w) = E(x,y)∼P (w T x − y ) 2
next: how to minimize E
in
(w)?Fundamental Machine Learning Models Linear Regression
Minimize E in
min
w
Ein
(w) = 1 NX
N n=1
(w T x n − y n ) 2
w
Ein
•
Ein
(w): continuous, differentiable,convex
•
necessary condition of ‘best’w
∇E
in
(w)≡
∂E
in∂w
0(w)∂E
in∂w
1(w) . . .∂E
in∂w
d(w)
=
0 0
. . .0
—not possibleto ‘roll down’
task: find
w
LINsuch that∇Ein
(wLIN) =0
Fundamental Machine Learning Models Linear Regression
Linear Regression Algorithm
1
fromD, constructinput matrix X
andoutput vector y
byX =
− − x T 1 − −
− − x T 2 − −
· · ·
− − x T N − −
| {z }
N×(d +1)
y =
y 1 y 2
· · · y N
| {z }
N×1
2
calculate pseudo-inverse|{z} X †
(d +1)×N 3
returnw
LIN|{z}
(d +1)×1
=
X † y
simple and efficient with
good † routine
Fundamental Machine Learning Models Linear Regression
Is Linear Regression a ‘Learning Algorithm’?
w
LIN=X † y
No!
•
analytic (closed-form) solution, ‘instantaneous’•
not improving Ein
nor Eout
iterativelyYes!
•
good Ein
?yes, optimal!
•
good Eout
?yes, ‘simple’ like perceptrons
•
improving iteratively?somewhat, within an iterative pseudo-inverse routine
if E
out
(wLIN)is good,learning ‘happened’!
Fundamental Machine Learning Models Logistic Regression
Fundamental Machine Learning Models ::
Logistic Regression
Fundamental Machine Learning Models Logistic Regression
Heart Attack Prediction Problem (1/2)
age 40 years
gender male
blood pressure 130/85 cholesterol level 240
weight 70
heart disease?
yes
unknown target distribution P(y |x) containing f (x) + noise
training examples D : (x
1, y
1), · · · , (x
N,y
N)
learning algorithm
A
final hypothesis g ≈ f
hypothesis set H
error measure err c err
binary classification:
ideal f (x) = sign
P(+1 |x)
−1 2
∈ {−1, +1}
because of
classification err
Fundamental Machine Learning Models Logistic Regression
Heart Attack Prediction Problem (2/2)
age 40 years
gender male
blood pressure 130/85 cholesterol level 240
weight 70
heart
attack? 80% risk
unknown target distribution P(y |x) containing f (x) + noise
training examples D : (x
1, y
1), · · · , (x
N,y
N)
learning algorithm
A
final hypothesis g ≈ f
hypothesis set H
error measure err c err
‘soft’
binary classification:f
(x) =P(+1 |x)
∈ [0, 1]Fundamental Machine Learning Models Logistic Regression
Soft Binary Classification
target function
f
(x) =P(+1 |x)
∈ [0, 1]ideal (noiseless) data
x 1
, y1 0
=0.9 =P(+1 |x 1 )
x 2
, y2 0
=0.2 =P(+1 |x 2 )
...
x N
, yN 0
=0.6 =P(+1 |x N )
actual (noisy) data
x 1
, y1
=◦
∼P(y |x 1 )
x 2
, y2
=×
∼P(y |x 2 )
...
x N
, yN
=×
∼P(y |x N )
same data as hard binary classification, different
target function
Fundamental Machine Learning Models Logistic Regression
Soft Binary Classification
target function
f
(x) =P(+1 |x)
∈ [0, 1]ideal (noiseless) data
x 1
, y1 0
=0.9 =P(+1 |x 1 )
x 2
, y2 0
=0.2 =P(+1 |x 2 )
...
x N
, yN 0
=0.6 =P(+1 |x N )
actual (noisy) data
x 1
, y1 0
=1
=r◦
∼? P(y |x 1 )
z
x 2
, y2 0
=0
=r◦
∼? P(y |x 2 )
z...
x N
, yN 0
=0
= r◦
∼? P(y |x N )
zsame data as hard binary classification, different
target function
Fundamental Machine Learning Models Logistic Regression
Logistic Hypothesis
age 40 years
gender male
blood pressure 130/85 cholesterol level 240
•
Forx = (x 0
, x1
, x2
,· · · , xd
)‘features of patient’, calculate aweighted
‘risk score’:s
= Xd
i=0
w i
xi
•
convert thescore
toestimated probability
by logistic functionθ(s)
θ(s) 1
0 s
logistic hypothesis:
h(x) = θ(w T x) = 1+exp(−w 1
Tx)
Fundamental Machine Learning Models Logistic Regression
Minimizing E in (w)
a popular error: E
in
(w) =N 1
PN
n=1
ln 1 + exp(−yn w T x n
)called
cross- entropy
derived frommaximum likelihood
w
Ein
•
Ein
(w): continuous, differentiable, twice-differentiable,convex
•
how to minimize? locatevalley
want∇Ein
(w) =0
most basic algorithm:
gradient descent
(roll downhill)Fundamental Machine Learning Models Logistic Regression
Gradient Descent
For t = 0, 1, . . .
w t+1
← wt
+ηv
when stop, returnlast w as g
•
PLA:v
comes from mistake correction•
smooth Ein
(w) for logistic regression:choose
v
to get the ball roll ‘downhill’?• direction v:
(assumed) of unit length
• step size η:
(assumed) positive
Weights, wIn-sampleError,Ein
gradient descent:
v
∝ −∇Ein
(wt
)Fundamental Machine Learning Models Logistic Regression
Putting Everything Together
Logistic Regression Algorithm
initializew 0
For t = 0, 1,· · ·
1
compute∇E
in
(wt
) = 1 NX
N n=1
θ
−y
n w T t x n
−y n x n
2
update byw t+1
← wt
−η ∇E in (w t )
...until∇E
in
(wt+1
)≈ 0 or enough iterations returnlast w t+1 as g
can use more sophisticated tools to speed up
Fundamental Machine Learning Models Logistic Regression
Linear Models Summarized
linear scoring function:
s
=w T x linear classification
h(x) = sign(s)
s x
x
x x0
1 2
d
h x( )
plausible err = 0/1 discrete E
in(w):
solvable in special case
linear regression
h(x) = s
s x
x
x x0
1 2
d
h x( )
friendly err = squared quadratic convex E
in(w):
closed-form solution
logistic regression
h(x) = θ(s)
s x
x
x x0
1 2
d
h x( )
plausible err = cross-entropy smooth convex E
in(w):
gradient descent
my ‘secret’:
linear first!!
Fundamental Machine Learning Models Nonlinear Transform
Fundamental Machine Learning Models ::
Nonlinear Transform
Fundamental Machine Learning Models Nonlinear Transform
Linear Hypotheses
up to now: linear hypotheses
•
visually:‘line’-like
boundary•
mathematically: linear scoress
=w T x
but limited . . .
−1 0 1
−1 0 1
•
theoretically:complexity under control :-)
•
practically: on someD,large E in
for every line:-(
how to
break the limit
of linear hypothesesFundamental Machine Learning Models Nonlinear Transform
Circular Separable
−1 0 1
−1 0 1
−1 0 1
−1 0 1
•
D not linear separable•
butcircular separable
by a circle of radius√0.6 centered at origin:
hSEP(x) = sign
−x
1 2
− x2 2
+0.6re-derive
Circular-PLA, Circular-Regression,
blahblah. . . all over again?:-)
Fundamental Machine Learning Models Nonlinear Transform
Circular Separable and Linear Separable
h(x) = sign
|{z}
0.6
w ˜
0·|{z}
1
z
0+(
−1
)| {z }
w ˜
1·
x 1 2
|{z}
z
1+(
−1
)| {z }
w ˜
2·
x 2 2
|{z}
z
2
= sign
w ˜ T z
x
1x
2−1 0 1
−1 0
1
•
{(xn
, yn
)} circular separable=⇒ {(
z n
, yn
)}linear
separable• x
∈ X 7−→Φ z ∈ Z
:(nonlinear) feature
transform Φ z
1z
20 0.5 1
0 0.5 1
circular separable inX =⇒
linear
separable inZ
Fundamental Machine Learning Models Nonlinear Transform
General Quadratic Hypothesis Set
a ‘bigger’
Z
-space withΦ 2
(x) = (1,x 1
,x 2
,x 1 2
,x 1 x 2
,x 2 2
) perceptrons inZ
-space⇐⇒ quadratic hypotheses in X -spaceH
Φ
2 =nh(x) : h(x) =
h(Φ ˜ 2
(x)) for some linear˜ h
onZ
o•
canimplement all possible quadratic curve boundaries:
circle, ellipse,
rotated ellipse, hyperbola, parabola,
. . .⇐=
ellipse 2(x
1
+x2
− 3)2
+ (x1
− x2
− 4)2
=1⇐=
w ˜ T
=[33, −20, −4, 3, 2, 3]
include
lines and constants as degenerate
cases
Fundamental Machine Learning Models Nonlinear Transform
Good Quadratic Hypothesis
Z
-space X -spaceperceptrons
⇐⇒ quadratic hypothesesgood perceptron
⇐⇒good quadratic hypothesis separating perceptron
⇐⇒ separating quadratic hypothesisz1
z2
0 0.5 1
0 0.5 1
⇐⇒
x1
x2
−1 0 1
−1 0 1
•
want: getgood perceptron
inZ
-space•
known: getgood perceptron
inX
-space with data{(x n
, yn
)} solution: getgood perceptron
inZ
-space with data{(
z n
=Φ 2
(xn
), yn
)}Fundamental Machine Learning Models Nonlinear Transform
The Nonlinear Transform Steps
−1 0 1
−1 0 1
−→
Φ
0 0.5 1
0 0.5 1
↓ A
−1 0 1
−1 0 1
Φ
−1←−
−→
Φ
0 0.5 1
0 0.5 1
1
transform original data{(xn
, yn
)} to {(z n
=Φ(x n
), yn
)} byΦ
2
get a good perceptronw ˜
using{(z n
, yn
)} and your favorite linear algorithmA3
return g(x) = signw ˜ T Φ(x)
Fundamental Machine Learning Models Nonlinear Transform
Nonlinear Model via Nonlinear Φ + Linear Models
−1 0 1
−1 0 1
−→
Φ
0 0.5 1
0 0.5 1
↓ A
−1 0 1
−1 0 1
Φ
−1←−
−→
Φ
0 0.5 1
0 0.5 1
two choices:
•
feature transformΦ
•
linear modelA,not just binary classification
Pandora’s box :-):
can now freely do
quadratic PLA, quadratic regression,
cubic regression, . . ., polynomial regression
Fundamental Machine Learning Models Nonlinear Transform
Feature Transform Φ
−→
Φ
Average Intensity
Symmetry
not 1 1
↓ A
Φ
−1←−
−→
Φ
Average Intensity
Symmetry
more generally, not just polynomial:
raw (pixels)
domain knowledge
−→
concrete (intensity, symmetry)
the force, too good to be true? :-)
Fundamental Machine Learning Models Nonlinear Transform
Computation/Storage Price
Q-th order polynomial transform: Φ
Q(x) = 1,
x
1, x
2, . . . , x
d, x
12, x
1x
2, . . . , x
d2, . . . ,
x
1Q, x
1Q−1x
2, . . . , x
dQ=
|{z}1
w ˜
0+
|{z} d ˜
others
dimensions
= # ways of≤ Q-combination from d kinds with repetitions
=
Q+d Q
=
Q+d d
=
O Q d
= efforts needed for computing/storing
z
=Φ Q
(x) andw ˜
Q large =⇒difficult to compute/store
AND curve too complicated
Fundamental Machine Learning Models Nonlinear Transform
Generalization Issue
Φ
1
(originalx)
which one do you prefer? :-)
•
Φ1
‘visually’ preferred•
Φ4
: Ein
(g) = 0 but overkillΦ
4
how to pick Q?model selection
(to be discussed) importantFundamental Machine Learning Models Decision Tree
Fundamental Machine Learning Models ::
Decision Tree
Fundamental Machine Learning Models Decision Tree
Decision Tree for Watching MOOC Lectures
G(x) = X
T t=1
q t
(x)·g t
(x)• base hypothesis g t
(x):leaf at end of path t, a
constant
here• condition q t
(x):Jis x on path t ?K
•
usually withsimple internal nodes
quitting time?
has a date?
N
true
Y
false
<18:30
Y
between
deadline?
N
>2 days
Y
between
N
< −2 days
>21:30
decision tree: arguably one of the most
human-mimicking models
Fundamental Machine Learning Models Decision Tree
Recursive View of Decision Tree
Path View: G(x) =P
T
t=1 Jx on path t K
·leaf t (x)
quitting time?
has a date?
N
true
Y
false
<18:30
Y
between
deadline?
N
>2 days
Y
between
N
< −2 days
>21:30
Recursive View G(x) =
X
C c=1
Jb(x) = c K
·G c
(x)• G(x): full-tree
hypothesis• b(x): branching criteria
• G c
(x):sub-tree
hypothesis at the c-th branchtree
= (root,sub-trees), just like what
your data structure instructor would say :-)
Fundamental Machine Learning Models Decision Tree
A Basic Decision Tree Algorithm
G(x) =
PC
c=1
Jb(x)
=cKG c
(x) functionDecisionTree
dataD = {(xn
, yn
)}N n=1
if
termination criteria met
return
base hypothesis g t
(x) else1
learnbranching criteria b(x)
2
splitD toC
partsD c
={(xn
, yn
) :b(x n )
=c}3
build sub-treeG c
←DecisionTree( D c
)4
returnG(x) =
PC
c=1
Jb(x)
=cKG c
(x)four choices:
number of branches, branching
criteria, termination criteria, & base hypothesis
Fundamental Machine Learning Models Decision Tree
Classification and Regression Tree (C&RT)
function
DecisionTree(data
D = {(xn
, yn
)}N n=1
) iftermination criteria met
return
base hypothesis g t
(x) else ...2
splitD toC
partsD c
={(xn
, yn
) :b(x n )
=c}choices
• C
=2 (binary tree)• g t
(x) = Ein
-optimalconstant
• binary/multiclass classification (0/1 error): majority of {y
n}
• regression (squared error): average of {y
n}
•
branching:threshold some selected dimension
•
termination:fully-grown, or better pruned
disclaimer:C&RT
here is based onselected components
ofCART TM of California Statistical Software
Fundamental Machine Learning Models Decision Tree
A Simple Data Set
C&RT
C&RT: ‘divide-and-conquer’
Fundamental Machine Learning Models Decision Tree
A Simple Data Set
C&RT
C&RT: ‘divide-and-conquer’
Fundamental Machine Learning Models Decision Tree
A Simple Data Set
C&RT
C&RT: ‘divide-and-conquer’
Fundamental Machine Learning Models Decision Tree
A Simple Data Set
C&RT
C&RT: ‘divide-and-conquer’
Fundamental Machine Learning Models Decision Tree
A Simple Data Set
C&RT
C&RT: ‘divide-and-conquer’
Fundamental Machine Learning Models Decision Tree
A Simple Data Set
C&RT
C&RT: ‘divide-and-conquer’
Fundamental Machine Learning Models Decision Tree
A Simple Data Set
C&RT
C&RT: ‘divide-and-conquer’
Fundamental Machine Learning Models Decision Tree
A Simple Data Set
C&RT
C&RT: ‘divide-and-conquer’
Fundamental Machine Learning Models Decision Tree
A Simple Data Set
C&RT
C&RT: ‘divide-and-conquer’
Fundamental Machine Learning Models Decision Tree
A Simple Data Set
C&RT