Hsuan-Tien Lin (林軒田) htlin@csie.ntu.edu.tw
Department of Computer Science & Information Engineering
National Taiwan University ( 國立台灣大學資訊工程系)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 0/99
Roadmap
What is Machine Learning Perceptron Learning Algorithm Types of Learning
Possibility of Learning Linear Regression Logistic Regression Nonlinear Transform Overfitting
Principles of Learning
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 1/99
What is Machine Learning
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 2/99
The Learning Problem What is Machine Learning
From Learning to Machine Learning
learning: acquiring skill
with experience accumulated from
observations observations learning skill
machine learning: acquiring skill
machine learning:
with experience accumulated/computedfrom
data
data ML skill
What is
skill?
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 3/99
The Learning Problem What is Machine Learning
A More Concrete Definition
skill
⇔ improve some
performance measure
(e.g. prediction accuracy)machine learning: improving some performance measure
machine learning:
with experience
computed
fromdata
data ML
improved performance measure
An Application in Computational Finance
stock data ML more investment gain
Why use machine learning?
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 4/99
•
‘define’ trees and hand-program:difficult
•
learn from data (observations) and recognize: a3-year-old can do so
•
‘ML-based tree recognition system’ can beeasier to build
than hand-programmed systemML: an
alternative route
to build complicated systemsHsuan-Tien Lin (NTU CSIE) Machine Learning Basics 5/99
Some Use Scenarios
•
when human cannot program the system manually—navigating on Mars
•
when human cannot ‘define the solution’ easily—speech/visual recognition
•
when needing rapid decisions that humans cannot do—high-frequency trading
•
when needing to be user-oriented in a massive scale—consumer-targeted marketing
Give a
computer a fish, you feed it for a day;
teach it how to fish, you feed it for a lifetime.
:-)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 6/99
The Learning Problem What is Machine Learning
Key Essence of Machine Learning
machine learning: improving some performance measure
with experiencecomputed
fromdata
data ML
improved performance measure
1
existssome ‘underlying pattern’ to be learned
—so ‘performance measure’ can be improved
2
butno
programmable (easy)definition
—so ‘ML’ is needed
3
somehow there isdata
about the pattern—so ML has some ‘inputs’ to learn from
key essence: help decide whether to use ML
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 7/99
• data: how many users have rated some movies
• skill: predict how a user would rate an unrated movie A Hot Problem
•
competition held by Netflix in 2006• 100,480,507 ratings that 480,189 users gave to 17,770 movies
• 10% improvement = 1 million dollar prize
•
similar competition (movies→ songs) held by Yahoo! in KDDCup 2011• 252,800,275 ratings that 1,000,990 users gave to 624,961 songs
How can machines
learn our preferences?
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 8/99
The Learning Problem What is Machine Learning
Entertainment: Recommender System (2/2)
Match movie and viewer factors
predicted rating
comedy content action
content blockb uster?
TomCruisein it?
likes TomCruise?
prefers blockbusters? likes action?
likes comedy?
movie viewer
add contributions from each factor
A Possible ML Solution
•
pattern:rating
←viewer/movie factors
•
learning:known rating
→ learned
factors
→ unknown rating prediction
key part of the
world-champion
(again!) system from National Taiwan Univ.in KDDCup 2011
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 9/99
age 23 years
gender female
annual salary NTD 1,000,000 year in residence 1 year
year in job 0.5 year current debt 200,000
unknown pattern to be learned:
‘approve credit card good for bank?’
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 10/99
Formalize the Learning Problem
Basic Notations
•
input:x
∈ X (customer application)•
output: y ∈ Y (good/bad after approving credit card)• unknown pattern to be learned ⇔ target function
: f : X → Y (ideal credit approval formula)• data ⇔ training examples
:D = {(x1
, y1
), (x2
, y2
),· · · , (xN
, yN
)} (historical records in bank)• hypothesis ⇔ skill
with hopefullygood performance:
g : X → Y (‘learned’ formula to be used)
{(x n , y n ) }
fromf ML g
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 11/99
(ideal credit approval formula)
training examples D : (x
1, y
1), · · · , (x
N,y
N) (historical records in bank)
learning algorithm
A
final hypothesis g ≈ f
(‘learned’ formula to be used)
•
target funknown
(i.e. no programmable definition)
•
hypothesis g hopefully≈ f but possiblydifferent
from f(perfection ‘impossible’ when f unknown) What does g look like?
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 12/99
training examples D : (x
1, y
1), · · · , (x
N, y
N) (historical records in bank)
learning algorithm
A
final hypothesis g ≈ f
(‘learned’ formula to be used)
hypothesis set H
(set of candidate formula)
•
assume g∈ H = {hk
}, i.e. approving if• h
1: annual salary > NTD 800,000
• h
2: debt > NTD 100,000 (really?)
• h
3: year in job ≤ 2 (really?)
•
hypothesis setH:• can contain good or bad hypotheses
• up to A to pick the ‘best’ one as g learning model
=A and HHsuan-Tien Lin (NTU CSIE) Machine Learning Basics 13/99
(ideal credit approval formula)
training examples D : (x
1, y
1), · · · , (x
N,y
N) (historical records in bank)
learning algorithm
A
final hypothesis g ≈ f
(‘learned’ formula to be used)
hypothesis set H
(set of candidate formula)
machine learning:
use
data
to computehypothesis g
that approximates
target f
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 14/99
Machine Learning and Data Mining
Machine Learning
use data to compute hypothesis g that approximates target f
Data Mining
use
(huge)
data tofind property
that is interesting•
if ‘interesting property’same as
‘hypothesis that approximate target’—ML = DM(usually what KDDCup does)
•
if ‘interesting property’related to
‘hypothesis that approximate target’—DM can help ML, and vice versa(often, but not always)
•
traditional DM also focuses onefficient computation in large database
difficult to distinguish ML and DM in reality
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 15/99
use data to compute hypothesis g that approximates target f
compute
something
that shows intelligent behavior
•
g ≈ f is something that shows intelligent behavior—ML can realize AI, among other routes
•
e.g. chess playing• traditional AI: game tree
• ML for AI: ‘learning from board data’
ML is one possible route to realize AI
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 16/99
Machine Learning and Statistics
Machine Learning
use data to compute hypothesis g that approximates target f
Statistics
use data to
make inference about an unknown process
•
g is an inference outcome; f is something unknown—statistics
can be used to achieve ML
•
traditional statistics also focus onprovable results with math assumptions, and care less about computation
statistics: many useful tools for ML
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 17/99
Perceptron Learning Algorithm
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 18/99
Credit Approval Problem Revisited
Applicant Information
age 23 years
gender female
annual salary NTD 1,000,000 year in residence 1 year
year in job 0.5 year current debt 200,000 unknown target function
f : X → Y
(ideal credit approval formula)
training examples D : (x
1, y
1), · · · , (x
N,y
N) (historical records in bank)
learning algorithm
A
final hypothesis g ≈ f
(‘learned’ formula to be used)
hypothesis set H
(set of candidate formula)
what hypothesis set can we use?
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 19/99
current debt 200,000
•
Forx = (x 1
, x2
,· · · , xd
)‘features of customer’, compute a weighted ‘score’ andapprove credit if X
d
i=1
wi
xi
> threshold deny credit if Xd
i=1
wi
xi
< threshold•
Y:+1(good), −1(bad)
,
0 ignored—linear formula h
∈ H areh(x) = sign
d
Xi=1
w i
xi
!
−
threshold
!
called ‘perceptron’ hypothesis historically
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 20/99
h(x) = sign
d
X
i=1
w i
xi
!
−threshold
!
= sign
X
d
i=1
w i
xi
!
+
( | −threshold) {z }
w
0· (+1) | {z }
x
0
= sign X
d i=0
w i
xi
!
= sign
w T x
•
each ‘tall’w represents a hypothesis h & is multiplied with
‘tall’
x —will use tall versions to simplify notation
what do perceptrons h ‘look like’?Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 21/99
•
customer featuresx:
points on the plane (or points in Rd
)•
labels y :◦ (+1)
,× (-1)
•
hypothesis h:lines
(or hyperplanes in Rd
)—positiveon one side of a line,
negative
on the other side•
different line classifies customers differentlyperceptrons⇔
linear (binary) classifiers
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 22/99
Select g from H
H = all possible perceptrons,
g =?
•
want: g≈ f (hard when f unknown)•
almost necessary: g ≈ f on D, ideallyg(x n ) = f (x n ) = y n
•
difficult: H is ofinfinite
size•
idea: start from some g0
, and‘correct’ its mistakes on D
will represent g
0
by its weight vectorw 0
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 23/99
For t = 0, 1, . . .
1
find amistake
ofw t
calledx n(t) , y n(t)
signw T t x n(t)
6=
y n(t)
2
(try to) correct the mistake byw t+1
←w t
+y n(t) x n(t)
. . . untilno more mistakes
return
last w (called w
PLA) as g
w+ x y
y y= +1
x w
x
−1 w y=
w+ x y
x w
x
−1 w y=
w+ x
That’s it!
—A fault confessed is half redressed.
:-)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 24/99
The Learning Problem Perceptron Learning Algorithm
Seeing is Believing initially
x1 w(t+1)
x9
w(t)
w(t+1)
x14
w(t) w(t+1) x3
w(t)
w(t+1)
x9
w(t) w(t+1)
x14
w(t) w(t+1)
x9
w(t) w(t+1)
x14
w(t) w(t+1)
x9
w(t) w(t+1) wPLA
worked like a charm with < 20 lines!!
(note: made x i x 0 = 1 for visual purpose)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 25/99
The Learning Problem Perceptron Learning Algorithm
Seeing is Believing
x1 w(t+1)
update: 1
x9
w(t)
w(t+1)
x14
w(t) w(t+1) w(t)
w(t+1)
x9
w(t) w(t+1)
x14
w(t) w(t+1)
x9
w(t) w(t+1)
x14
w(t) w(t+1)
x9
w(t) w(t+1) wPLA
worked like a charm with < 20 lines!!
(note: made x i x 0 = 1 for visual purpose)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 25/99
The Learning Problem Perceptron Learning Algorithm
Seeing is Believing
x1 w(t+1)
x9
w(t)
w(t+1)
update: 2
x14
w(t) w(t+1) x3
w(t)
w(t+1)
x9
w(t) w(t+1)
x14
w(t) w(t+1)
x9
w(t) w(t+1)
x14
w(t) w(t+1)
x9
w(t) w(t+1) wPLA
worked like a charm with < 20 lines!!
(note: made x i x 0 = 1 for visual purpose)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 25/99
The Learning Problem Perceptron Learning Algorithm
Seeing is Believing
x1 w(t+1)
x9
w(t)
w(t+1)
x14
w(t) w(t+1)
update: 3
w(t)
w(t+1)
x9
w(t) w(t+1)
x14
w(t) w(t+1)
x9
w(t) w(t+1)
x14
w(t) w(t+1)
x9
w(t) w(t+1) wPLA
worked like a charm with < 20 lines!!
(note: made x i x 0 = 1 for visual purpose)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 25/99
The Learning Problem Perceptron Learning Algorithm
Seeing is Believing
x1 w(t+1)
x9
w(t)
w(t+1)
x14
w(t) w(t+1)
x3
w(t)
w(t+1)
update: 4
x9
w(t) w(t+1)
x14
w(t) w(t+1)
x9
w(t) w(t+1)
x14
w(t) w(t+1)
x9
w(t) w(t+1) wPLA
worked like a charm with < 20 lines!!
(note: made x i x 0 = 1 for visual purpose)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 25/99
The Learning Problem Perceptron Learning Algorithm
Seeing is Believing
x1 w(t+1)
x9
w(t)
w(t+1)
x14
w(t) w(t+1) w(t)
w(t+1)
x9
w(t) w(t+1)
update: 5
x14
w(t) w(t+1)
x9
w(t) w(t+1)
x14
w(t) w(t+1)
x9
w(t) w(t+1) wPLA
worked like a charm with < 20 lines!!
(note: made x i x 0 = 1 for visual purpose)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 25/99
The Learning Problem Perceptron Learning Algorithm
Seeing is Believing
x1 w(t+1)
x9
w(t)
w(t+1)
x14
w(t) w(t+1) x3
w(t)
w(t+1)
x9
w(t) w(t+1)
x14
w(t) w(t+1)
update: 6
x9
w(t) w(t+1)
x14
w(t) w(t+1)
x9
w(t) w(t+1) wPLA
worked like a charm with < 20 lines!!
(note: made x i x 0 = 1 for visual purpose)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 25/99
The Learning Problem Perceptron Learning Algorithm
Seeing is Believing
x1 w(t+1)
x9
w(t)
w(t+1)
x14
w(t) w(t+1) w(t)
w(t+1)
x9
w(t) w(t+1)
x14
w(t) w(t+1)
x9
w(t) w(t+1)
update: 7
x14
w(t) w(t+1)
x9
w(t) w(t+1) wPLA
worked like a charm with < 20 lines!!
(note: made x i x 0 = 1 for visual purpose)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 25/99
The Learning Problem Perceptron Learning Algorithm
Seeing is Believing
x1 w(t+1)
x9
w(t)
w(t+1)
x14
w(t) w(t+1) x3
w(t)
w(t+1)
x9
w(t) w(t+1)
x14
w(t) w(t+1)
x9
w(t) w(t+1)
x14
w(t) w(t+1)
update: 8
x9
w(t) w(t+1) wPLA
worked like a charm with < 20 lines!!
(note: made x i x 0 = 1 for visual purpose)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 25/99
The Learning Problem Perceptron Learning Algorithm
Seeing is Believing
x1 w(t+1)
x9
w(t)
w(t+1)
x14
w(t) w(t+1) w(t)
w(t+1)
x9
w(t) w(t+1)
x14
w(t) w(t+1)
x9
w(t) w(t+1)
x14
w(t) w(t+1)
x9
w(t) w(t+1)
update: 9
wPLA
worked like a charm with < 20 lines!!
(note: made x i x 0 = 1 for visual purpose)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 25/99
The Learning Problem Perceptron Learning Algorithm
Seeing is Believing
x1 w(t+1)
x9
w(t)
w(t+1)
x14
w(t) w(t+1) x3
w(t)
w(t+1)
x9
w(t) w(t+1)
x14
w(t) w(t+1)
x9
w(t) w(t+1)
x14
w(t) w(t+1)
x9
w(t) w(t+1)
wPLA
finally
worked like a charm with < 20 lines!!
(note: made x i x 0 = 1 for visual purpose)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 25/99
Types of Learning
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 26/99
Credit Approval Problem Revisited
age 23 years
gender female
annual salary NTD 1,000,000 year in residence 1 year
year in job 0.5 year current debt 200,000
credit?
{no(−1), yes(+1)}
unknown target function f : X → Y
(ideal credit approval formula)
training examples D : (x
1, y
1), · · · , (x
N,y
N) (historical records in bank)
learning algorithm
A
final hypothesis g ≈ f
(‘learned’ formula to be used)
hypothesis set H
(set of candidate formula)
Y = {−1, +1}:
binary classification
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 27/99
•
creditapprove/disapprove
•
emailspam/non-spam
•
patientsick/not sick
•
adprofitable/not profitable
•
answercorrect/incorrect
(KDDCup 2010)core and important problem with many tools as
building block of other tools
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 28/99
25
5 1
Mass
Size 10
•
classify US coins (1c, 5c, 10c, 25c) by (size, mass)•
Y = {1c, 5c, 10c, 25c}, orY = {1, 2, · · · , K } (abstractly)
•
binary classification: special case with K = 2Other Multiclass Classification Problems
•
written digits⇒ 0, 1, · · · , 9•
pictures⇒ apple, orange, strawberry•
emails⇒ spam, primary, social, promotion, update (Google)many applications
in practice, especially for ‘recognition’Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 29/99
•
multiclass classification: patient features⇒ which type of cancer•
regression: patient features⇒how many days before recovery
• Y = R
orY = [lower, upper] ⊂ R (bounded regression)—deeply studied in statistics
Other Regression Problems
•
company data⇒ stock price•
climate data⇒ temperaturealso core and important with many ‘statistical’
tools as
building block of other tools
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 30/99
Mini Summary
Learning with Different Output Space Y
• binary classification:
Y = {−1, +1}•
multiclass classification: Y = {1, 2, · · · , K }• regression:
Y = R•
. . . and a lot more!!unknown target function f : X → Y
training examples D : (x
1,y
1), · · · , (x
N,y
N)
learning algorithm
A
final hypothesis g ≈ f
hypothesis set H
core tools: binary classification and regression
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 31/99
5 1
Size 10
unknown target function f : X → Y
training examples D : (x
1,y
1), · · · , (x
N,y
N)
learning algorithm
A
final hypothesis g ≈ f
hypothesis set H
supervised learning:
every
x n comes with corresponding y n
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 32/99
Unsupervised: Coin Recognition without y n
25
5 1
Mass
Size 10
supervised multiclass classification
Mass
Size
unsupervised multiclass classification
⇐⇒
‘clustering’Other Clustering Problems
•
articles⇒ topics•
consumer profiles⇒ consumer groupsclustering: a challenging but useful problem
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 33/99
25
5 1
Mass
Size 10
supervised multiclass classification
Mass
Size
unsupervised multiclass classification
⇐⇒
‘clustering’Other Clustering Problems
•
articles⇒ topics•
consumer profiles⇒ consumer groupsclustering: a challenging but useful problem
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 33/99
Unsupervised: Learning without y n
Other Unsupervised Learning Problems
•
clustering: {xn
} ⇒ cluster(x)(≈ ‘unsupervised multiclass classification’)
—i.e. articles⇒ topics
• density estimation:
{xn
} ⇒ density(x) (≈ ‘unsupervised bounded regression’)—i.e. traffic reports with location⇒ dangerous areas
• outlier detection:
{xn
} ⇒ unusual(x)(≈ extreme ‘unsupervised binary classification’)
—i.e. Internet logs⇒ intrusion alert
•
. . . and a lot more!!unsupervised learning: diverse, with possibly
very different performance goalsHsuan-Tien Lin (NTU CSIE) Machine Learning Basics 34/99
25
5 1
Mass
Size 10
supervised
25
5 1
Mass
Size 10
semi-supervised
Mass
Size
unsupervised (clustering)
Other Semi-supervised Learning Problems
•
face images with a few labeled⇒ face identifier (Facebook)•
medicine data with a few labeled⇒ medicine effect predictorsemi-supervised learning: leverage
unlabeled data to avoid ‘expensive’ labelingHsuan-Tien Lin (NTU CSIE) Machine Learning Basics 35/99
Reinforcement Learning
a ‘very different’ but natural way of learning
Teach Your Dog: Say ‘Sit Down’
The dog pees on the ground.
BAD DOG. THAT’S A VERY WRONG ACTION.
•
cannot easily show the dog that yn
= sit whenx n
=‘sit down’•
but can ‘punish’ to say ˜yn
= pee is wrongOther Reinforcement Learning Problems Using (x, ˜ y , goodness)
•
(customer, ad choice, ad click earning)⇒ ad system•
(cards, strategy, winning amount)⇒ black jack agent reinforcement: learn with‘partial/implicit
information’
(often sequentially)Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 36/99
Teach Your Dog: Say ‘Sit Down’
The dog sits down.
Good Dog. Let me give you some cookies.
•
still cannot show yn
= sit whenx n
=‘sit down’•
but can ‘reward’ to say ˜yn
= sit is goodOther Reinforcement Learning Problems Using (x, ˜ y , goodness)
•
(customer, ad choice, ad click earning)⇒ ad system•
(cards, strategy, winning amount)⇒ black jack agent reinforcement: learn with‘partial/implicit
information’
(often sequentially)Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 36/99
Learning with Different Data Label y n
• supervised: all y n
•
unsupervised: no yn
•
semi-supervised: some yn
•
reinforcement: implicit yn
by goodness(˜yn
)•
. . . and more!!unknown target function f : X → Y
training examples D : (x
1,y
1), · · · , (x
N,y
N)
learning algorithm
A
final hypothesis g ≈ f
hypothesis set H
core tool: supervised learning
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 37/99
Possibility of Learning
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 38/99
y
n= −1
y
n= +1
g(x) = ?
let’s test your ‘human learning’
with 6 examples :-)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 39/99
The Learning Problem Possibility of Learning
Two Controversial Answers
whatever you say about g(x),
yn=−1
yn= +1
g(x) = ?
y n = −1
y n = +1
g(x) = ?
truth f (x) = +1 because . . .
•
symmetry⇔ +1•
(black or white count = 3) or (black count = 4 andmiddle-top black)⇔ +1
truth f (x) = −1 because . . .
•
left-top black⇔ -1•
middle column contains at most 1 black and right-top white⇔ -1all valid reasons, your
adversarial teacher
can always call you ‘didn’t learn’.:-(
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 40/99
Theoretical Foundation of Statistical Learning
if
training and testing from same distribution, with a high probability, E out (g)
| {z } test error
≤
E in (g)
| {z } training error
+
r
8
N ln 4(2N)
dVC(H)δ
| {z }
Ω:price of using H
in-sample error model complexity out-of-sample error
VC dimension, dvc
Error
d∗vc
•
dVC(H): VC dimension of H≈ # of parameters to describeH
•
dVC↑:E in ↓
butΩ ↑
•
dVC↓:Ω ↓
butE in ↑
•
best dVC∗ in the middle
powerful H
not always good!Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 41/99
distribution P(y |x) containing f (x) + noise (ideal credit approval formula)
training examples D : (x
1, y
1), · · · , (x
N,y
N) (historical records in bank)
learning algorithm
A
final hypothesis g ≈ f
(‘learned’ formula to be used)
hypothesis set H
(set of candidate formula)
P on X
x
1, x
2, · · · , x
Nx y
1,y
2, · · · , y
Nif control complexity ofH properly and minimize E
in
,learning possible
:-)Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 42/99
Linear Regression
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 43/99
year in residence 1 year year in job 0.5 year current debt 200,000
credit limit?
100,000
unknown target function f : X → Y (ideal credit limit formula)
training examples D : (x
1, y
1), · · · , (x
N, y
N) (historical records in bank)
learning algorithm
A
final hypothesis g ≈ f
(‘learned’ formula to be used)
hypothesis set H
(set of candidate formula)
Y = R:
regression
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 44/99
Linear Regression Hypothesis
age 23 years
annual salary NTD 1,000,000 year in job 0.5 year current debt 200,000
•
Forx = (x 0
, x1
, x2
,· · · , xd
)‘features of customer’,approximate the
desired credit limit
with aweighted
sum:y
≈ Xd i=0
w i
xi
•
linear regression hypothesis:h(x) = w T x
h(x): like perceptron, but without the sign
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 45/99
∈ R
x
y
x = (x 1 , x 2 ) ∈ R
x
1x
2y
x
1x
2y
linear regression:
find
lines/hyperplanes
with smallresiduals
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 46/99
The Error Measure
popular/historical error measure:
squared error
err(ˆ y , y ) = (ˆ y − y) 2 in-sample
E
in
(hw) = 1 NX
N n=1
(h(x n )
| {z }
w
Tx
n− y n ) 2
out-of-sample
E
out
(w) = E(x,y)∼P (w T x − y ) 2
next: how to minimize E
in
(w)?Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 47/99
E
in
(w) =N
n=1
(w
T x n
−y n
)2
= Nn=1
(x
T n w
−y n
)2
= 1
N
x T 1 w
−y 1 x T 2 w
−y 2
. . .
x T N w
−y N
2
= 1
N
− − x T 1 − −
− − x T 2 − − . . .
− − x T N − −
w
−
y 1 y 2 . . . y N
2= 1
Nk
X
|{z}
N×d +1
|{z}
w
d +1×1
−
y
|{z}
N×1
k
2
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 48/99
min
w
Ein
(w) = 1Nk
Xw
−y
k2
w
Ein
•
Ein
(w): continuous, differentiable,convex
•
necessary condition of ‘best’w
∇E
in
(w)≡
∂E
in∂w
0(w)∂E
in∂w
1(w) . . .∂E
in∂w
d(w)
=
0 0
. . .0
—not possibleto ‘roll down’
task: find
w
LINsuch that∇Ein
(wLIN) =0
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 49/99
N N
| {z }
A
|{z}
b
|{z}
c
one w only
Ein
(w)=N 1
aw 2
− 2bw
+c
∇E
in
(w)=N 1
(2aw− 2b) simple! :-)
vector w
Ein
(w)=N 1
w T Aw
− 2w T b
+c
∇E
in
(w)=N 1
(2Aw− 2b)
similar (derived by definition)∇E
in
(w) =N 2 X T Xw
−X T y
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 50/99
task: find
w
LIN such thatN 2 X T Xw
−X T y
=∇E
in
(w) =0 invertible X T X
• easy!
unique solutionw
LIN=X T X
−1
X T
| {z }
pseudo-inverse
X
†y
•
often the case becauseN d + 1
singular X T X
• many
optimal solutions•
one of the solutionsw
LIN=X † y
by defining
X †
in other wayspractical suggestion:
use
well-implemented † routine
instead ofX T X
−1
X T
for numerical stability when
almost-singular
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 51/99
X =
− − x 1 − −
− − x T 2 − −
· · ·
− − x T N − −
| {z }
N×(d +1)
y =
y 1 y 2
· · · y N
| {z }
N×1
2
calculate pseudo-inverse|{z} X †
(d +1)×N 3
returnw
LIN|{z}
(d +1)×1
=
X † y
simple and efficient with
good † routine
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 52/99
Is Linear Regression a ‘Learning Algorithm’?
w
LIN=X † y
No!
•
analytic (closed-form) solution, ‘instantaneous’•
not improving Ein
nor Eout
iterativelyYes!
•
good Ein
?yes, optimal!
•
good Eout
?yes, finite d
VClike perceptrons
•
improving iteratively?somewhat, within an iterative pseudo-inverse routine
if E
out
(wLIN)is good,learning ‘happened’!
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 53/99
Logistic Regression
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 54/99
Heart Attack Prediction Problem (1/2)
age 40 years
gender male
blood pressure 130/85 cholesterol level 240
weight 70
heart disease?
yes
unknown target distribution P(y |x) containing f (x) + noise
training examples D : (x
1, y
1), · · · , (x
N,y
N)
learning algorithm
A
final hypothesis g ≈ f
hypothesis set H
error measure err c err
binary classification:
ideal f (x) = sign
P(+1 |x)
−1 2
∈ {−1, +1}
because of
classification err
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 55/99
blood pressure 130/85 cholesterol level 240
weight 70
heart
attack? 80% risk
unknown target distribution P(y |x) containing f (x) + noise
training examples D : (x
1, y
1), · · · , (x
N,y
N)
learning algorithm
A
final hypothesis g ≈ f
hypothesis set H
error measure err c err
‘soft’
binary classification:f
(x) =P(+1 |x)
∈ [0, 1]Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 56/99
Soft Binary Classification
target function
f
(x) =P(+1 |x)
∈ [0, 1]ideal (noiseless) data
x 1
, y1 0
=0.9 =P(+1 |x 1 )
x 2
, y2 0
=0.2 =P(+1 |x 2 )
...
x N
, yN 0
=0.6 =P(+1 |x N )
actual (noisy) data
x 1
, y1
=◦
∼P(y |x 1 )
x 2
, y2
=×
∼P(y |x 2 )
...
x N
, yN
=×
∼P(y |x N )
same data as hard binary classification, different
target function
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 57/99
ideal (noiseless) data
x 1
, y1 0
=0.9 =P(+1 |x 1 )
x 2
, y2 0
=0.2 =P(+1 |x 2 )
...
x N
, yN 0
=0.6 =P(+1 |x N )
actual (noisy) data
x 1
, y1 0
=1
=r◦
∼? P(y |x 1 )
z
x 2
, y2 0
=0
=r◦
∼? P(y |x 2 )
z...
x N
, yN 0
=0
= r◦
∼? P(y |x N )
zsame data as hard binary classification, different
target function
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 57/99
age 40 years
gender male
blood pressure 130/85 cholesterol level 240
•
Forx = (x 0
, x1
, x2
,· · · , xd
)‘features of patient’, calculate aweighted
‘risk score’:s
= Xd
i=0
w i
xi
•
convert thescore
toestimated probability
by logistic functionθ(s)
θ(s) 1
0 s
logistic hypothesis:
h(x) = θ(w T x) = 1+exp(−w 1
Tx)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 58/99
linear classification
h(x) = sign(s)s x
x
x x0
1 2
d
h x( )
plausible err = 0/1
(small flipping noise)
linear regression
h(x) =s
s x
x
x x0
1 2
d
h x( )
friendly err = squared
(easy to minimize)
logistic regression
h(x) =θ(s)s x
x
x x0
1 2
d
h x( )
err = ?
how to define
E in (w) for logistic regression?
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 59/99
target function
f(x) = P(+1 |x) ⇔ P(y |x)
=f 1 (x) − f (x) for y = for y = +1 −1
considerD = {(x1
,◦
), (x2
,×
), . . . , (xN
,×
)}probability that f generates D
P(x1
)P(◦|x 1
)× P(x2
)P(×|x 2
)× . . .P(x
N
)P(×|x N
)likelihood that h generates D
P(x1
)h(x1 )
×P(x
2
)(1− h(x 2 ))
× . . .P(x
N
)(1− h(x N ))
•
ifh
≈f,
then likelihood(h)≈ probability using
f
•
probability usingf
usuallylarge
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 60/99
considerD = {(x
1
,◦
), (x2
,×
), . . . , (xN
,×
)}probability that f generates D
P(x
1
)f(x 1 )
× P(x2
)(1− f(x 2 ))
× . . .P(x
N
)(1− f (x N ))
likelihood that h generates D
P(x1
)h(x1 )
×P(x
2
)(1− h(x 2 ))
× . . .P(x
N
)(1− h(x N ))
•
ifh
≈f,
then likelihood(h)≈ probability using
f
•
probability usingf
usuallylarge
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 60/99
Likelihood of Logistic Hypothesis
likelihood(h)≈ (probability using
f)
≈large
g = argmaxh
likelihood(h)
when logistic: h(x) = θ(w T x) 1 − h(x)
=h( − x)
θ(s) 1
0 s
likelihood(h) =
P(x 1 )h(x 1 ) × P(x 2 )(1 − h(x 2 )) × . . . P(x N )(1 − h(x N ))
likelihood(logistic
h)
∝ YN n=1
h(y n x n
)Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 61/99
g = argmax
h
likelihood(h)
when logistic: h(x) = θ(w T x) 1 − h(x)
=h( − x)
θ(s) 1
0 s
likelihood(h) =
P(x 1 )h(+x 1 ) × P(x 2 )h( −x 2 ) × . . . P(x N )h( −x N )
likelihood(logistic
h)
∝ YN n=1
h(y n x n
)Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 61/99
The Learning Problem Logistic Regression
Cross-Entropy Error
max
h likelihood(logistic h) ∝
YN n=1
h(y n x n
)1 + exp(−s)
w N
n=1
=⇒ min
w 1 N
X
N n=1
err(w, x
n
, yn
)| {z }
E
in(w)
err(w, x, y ) = ln (1 + exp(−y
wx)): cross-entropy error
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 62/99
The Learning Problem Logistic Regression
Cross-Entropy Error
max
w likelihood(w)
∝ YN n=1
θ
y
n w T x n
w N
n=1
| {z }E
in(w)
err(w, x, y ) = ln (1 + exp(−y
wx)): cross-entropy error
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 62/99
The Learning Problem Logistic Regression
Cross-Entropy Error
max
w
ln YN n=1
θ
y
n w T x n
1 + exp(−s)
w N
n=1
=⇒ min
w 1 N
X
N n=1
err(w, x
n
, yn
)| {z }
E
in(w)
err(w, x, y ) = ln (1 + exp(−y
wx)): cross-entropy error
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 62/99
w N
n=1
θ(s) = 1
1 + exp(−s) : min
w
1 N
X
N n=1
ln
1 + exp(−y
n w T x n
)=⇒ min
w 1 N
X
N n=1
err(w, x
n
, yn
)| {z }
E
in(w)
err(w, x, y ) = ln (1 + exp(−y
wx)):
cross-entropy error
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 62/99
Minimizing E in (w)
min
w
Ein
(w) = 1 NX
N n=1
ln
1 + exp(−y
n w T x n
)
w
Ein
•
Ein
(w): continuous, differentiable, twice-differentiable,convex
•
how to minimize? locatevalley
want∇Ein
(w) =0
first: derive∇E
in
(w)Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 63/99
The Learning Problem Logistic Regression
The Gradient ∇E in (w)
E
in(w) = 1 N
X
N n=1ln
1 + exp(
z }| {
−y
nw
Tx
n)
| {z }
∂E
in
(w)∂w i
= 1 NX
N n=1
∂ ln()∂
∂(1 + exp( ))
∂
∂ − y n w T x n
∂w i
= 1
N X
N n=1
= 1
N X
N n=1
exp( ) 1 + exp( )
−y n x n,i
!
= 1 N
X
N n=1
θ( ) −y n x n,i
∇E
in
(w) =N 1
PN n=1
θ
−yn w T x n
−y n x n
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 64/99
E
in(w) = 1 N
X
N n=1ln
1 + exp(
z }| {
−y
nw
Tx
n)
| {z }
∂E
in
(w)∂w i
= 1 NX
N n=1
∂ ln()∂
∂(1 + exp( ))
∂
∂ − y n w T x n
∂w i
= 1
N X
N n=1
1
exp( )
!
−y n x n,i
!
= 1
N X
N n=1
exp( ) 1 + exp( )
−y n x n,i
!
= 1 N
X
N n=1
θ( ) −y n x n,i
∇E
in
(w) =N 1
PN n=1
θ
−yn w T x n
−y n x n
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 64/99
n=1
want∇E
in
(w) = 1 NX
N n=1
θ
−y
n w T x n
−y n x n
=
0
w
Ein
scaled θ-weighted sum of −y n x n
•
allθ( ·)
=0: only if yn w T x n
0—linear separableD
•
weighted sum =0:
non-linear equation of
w
closed-form solution? no :-(
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 65/99
The Learning Problem Logistic Regression
PLA Revisited: Iterative Optimization
PLA: start from some
w 0
(say,0), and ‘correct’ its mistakes on
D For t = 0, 1, . . .1
find amistake
ofw t
calledx n(t) , y n(t)
signw T t x n(t)
6=y n(t)
2
(try to) correct the mistake byw t+1
←w t
+y n(t) x n(t)
w t+1
← wt
+ rsign
w T t x n
6= y
n
z y
n x n
when stop, return
last w as g
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 66/99
For t = 0, 1, . . .
1 find a mistake of w t called x n(t) , y n(t) sign
w T t x n(t) 6= y n(t) 2 (try to) correct the mistake by
w t+1 ← w t + y n(t) x n(t)
1
(equivalently) pick somen, and update w t
byw t+1
←w t
+rsign
w T t x n
6=
y n
z
y n x n
when stop, return
last w as g
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 66/99
PLA Revisited: Iterative Optimization
PLA: start from some
w 0
(say,0), and ‘correct’ its mistakes on
D For t = 0, 1, . . .1
(equivalently) pick somen, and update w t
byw t+1
←w t
+ 1|{z}
η
·r sign
w T t x n
6=y n
z·
y n x n
| {z }
v
when stop, return
last w as g
choice of (η, v) and stopping condition defines
iterative optimization approach
Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics 66/99