I t d ti t M hi L i
Introduction to Machine Learning (Part 1: Statistical Machine Learning) ( a t Stat st ca ac e ea g)
Shou‐de Lin S ou de
CSIE/GINM, NTU / ,
sdlin@csie.ntu.edu.tw
2009/11/30 1
Syllabus of a Intro‐ML course (“Machine Learning”, Andrew Ng Stanford Autumn 2009)
Andrew Ng, Stanford, Autumn 2009)
• Supervised learning. (7 classes) Supervised learning setup. LMS.
– Logistic regression. Perceptron. Exponential family.
G ti l i l ith G i di i i t l i N i B
– Generative learning algorithms. Gaussian discriminant analysis. Naive Bayes.
– Support vector machines.
– Model selection and feature selection.
– Ensemble methods: Bagging, boosting, ECOC.
– Evaluating and debugging learning algorithms.
• Learning theory. (3 classes)
– Bias/variance tradeoff. Union and Chernoff/Hoeffding bounds.
– VC dimension Worst case (online) learningVC dimension. Worst case (online) learning.
– Practical advice on how to use learning algorithms.
• Unsupervised learning. (5 classes)
– Clustering. K‐means. EM. Mixture of Gaussians.
– Factor analysis. PCA. MDS. pPCA.
– Independent components analysis (ICA).
• Reinforcement learning and control. (4 classes)
– MDPs. Bellman equations. Value iteration and policy iteration.MDPs. Bellman equations. Value iteration and policy iteration.
– Linear quadratic regulation (LQR). LQG.
– Q‐learning. Value function approximation.
– Policy search. Reinforce. POMDPs.
HT has done a great job teaching you “Advanced SL” and “Learning
Why teaching “Intro to ML”?
Why teaching Intro to ML ?
Wh li th t h t k ML
• When revealing that you have taken an ML course,
people would more or less expect you to have already known something E g
known something, E.g.
– Naïve Bayes.
• There are some ML methods that are so commonlyThere are some ML methods that are so commonly applied in research and real world that you will need to know a little bit about them. E.g.g
– K‐means clustering
• There are some ML method that are too unbelievable and amazing to ignore . E.g.
– EM framework.
2009/11/30 3
To Bring you Back to the Earth
• Statistical Machine Learning. (2 hours)
– A Bayesian view about ML – Generative learning model.Generative learning model.
– Gaussian discriminant analysis. Naïve Bayes
i d l i ( h )
• Unsupervised learning. (3 hours)
– Clustering: K‐means.
– EM.
• Reinforcement learning (0 5 hour)
• Reinforcement learning (0.5 hour)
– Value iteration and policy iteration.
– Q‐learning & SARSA
Theoretical ML vs. Statistical ML Theoretical ML vs. Statistical ML
• What you have known: SL takes many (x,t) as inputs to train a learner f(x), then apply it to inputs to train a learner f(x), then apply it to unseen xk and predict it as f(xk)
• For example (X is 3 dimensional):For example (X is 3 dimensional):
– Training { ([1,2,3], 0.1), ([2,3,4],0.2), ([3,4,5], 0.5)…}
– Testing: [2,4,5] Æ 0.7g [ ]
• However, uncertainty exist in the real world,
therefore an error distribution (e.g. Gaussian) is usually added: t=f(x)+error. That says, it is
possible to generate different results for same inputs for example:
inputs, for example:
– Training {([1,2,3],0.1), ([1,2,3],0.2),([1,2,3],0.1)…}
– Testing: [1 2 3]=?
– Testing: [1,2,3]=?
2009/11/30 Probability and ML, Shou‐de Lin 5
The Probabilistic Form of t The Probabilistic Form of t
• The output t is a distribution caused by the error (assuming Gaussian) term: ( g )
p(t|x,W,β)= N(t|y(x,W), β‐1), β is called a
precision parameter which equals the inverse precision parameter which equals the inverse of the variance 1/σ2.
•
The SL process under probability The SL process under probability
• Given training data {X,T}, we want to g
determine the unknown parameter W and β so we will know the distribution of y
so we will know the distribution of y.
• Assuming we observed N data points, then
function )
) (
| (
) ,
* )...
,
* ) ,
= ) ,
1
2 2 1
1
β
β β
β β
→ Ν
=
∏
t y x W − likelihood,W
|x p(t ,W
|x p(t ,W
|x p(t p(T|X,W
N
N N
)) 2 l ( (l
} )
( { ))
l (
function )
), ,
(
| (
2 1
β β β
β →
Ν
=
∑
∏
=t N W
(T|X W
likelihood W
x y t
N n
n n
function likelihood
- log called
is this
)), 2
ln(
2 (ln }
) ,
( 2 {
)) ,
ln(
1
2 β π
β = − β
∑
− + −=
t W
x y p(T|X,W
n
n n
2009/11/30 Probability and ML, Shou‐de Lin 7
function likelihood
log called
is this
Maximum Likelihood Estimation (MLE) ( )
• Idea: trying to adjust the unknown parameters (i e W and β) to ma imi e the likelihood
(i.e. W and β) to maximize the likelihood function or log‐likelihood function
)) 2 ln(
2 (ln }
) ,
( 2 {
)) ,
ln(
1
2 β π
β = − β
∑
− + −=
t N W
x y p(T|X,W
N
n
n n
• Adjusting W to maximizing this log‐likelihood
n
function given Gaussian error function is
equivalentq to finding a Wg MLML that minimizing g the mean‐square error function
Maximum Likelihood Estimation for β Maximum Likelihood Estimation for β
• First, we calculate WML that governs the mean of the distribution.
• Then we use WML in the likelihood function to determine the optimal β
determine the optimal βML
∑
+ =∂ ML = N N
t W
x
p(T|X,W y 2
0 }
) (
1 { ))
,
ln( β
∑
∑
== +
−
−
∂ =
N
n
n ML
n W t
x y
1
1
2 0 }
) ,
(
2 { β
β
∑
=− = −
⇒
n
n ML
n W t
x N 1 y
2
1 1 { ( , ) }
β
2009/11/30 Probability and ML, Shou‐de Lin 9
A SL system using MLE A SL system using MLE
1 We first determine W as W that minimizes the error 1. We first determine W as WML that minimizes the error
function 2
} )
, ( 2 {
1
∑
N y xn w − tn Tend to overfit2. Using WML to find β as 2
∑
1= n
∑
− = N y xn WML − tn
N
2
1 1 { ( , ) }
β
overfit
3. Prediction stage: Using WML and β to construct the distribution of t: p(t|x W β)= N(t|y(x W ) β ‐1)
=
N n 1
distribution of t: p(t|x,W,β)= N(t|y(x,WML), βML ) 4. Predict the value of an input x’ by sampling t using
the distribution in (3) the distribution in (3)
• The MLE approach consistently underestimate the variance of the data and can lead to overfitting
variance of the data and can lead to overfitting
Bayesian Approach for Regression Bayesian Approach for Regression
• Why Bayesian Approach: some w’s are preferable than others
p
– For example, the regularization prefers simple model (i.e. small w’s).
model (i.e. small w s).
– Consequently, p(w) cannot be treated as uniformly distributed
uniformly distributed
2009/11/30 Probability and ML, Shou‐de Lin 11
Bayes’ Rule Review Bayes Rule Review
)
| ( W T
P P ( T | W ) * P ( W )
= )
| ( W T
P ( )
) (
)
| (
T P
= ) ,
|
( W X T
P ( | )
)
| (
* ) ,
| (
X T
P
X W
P W
X T
P
)
| (
* ) ,
| (
) ,
|
(W X T P T X W P W X
P ∝
)
| ( T X P
• P(W|X): prior probability
• P(Tl X W): Likelihood probability (what MLE
• P(Tl X,W): Likelihood probability (what MLE tries to optimize, argmaxw P(T|X,W))
P(W|X T) i b bili
• P(W|X,T) : posterior probability
Bayesian Curve Fitting y g
)
| (
* ) ,
| (
) ,
|
( W X T P T X W P W X
P ∝
• Likelihood probability (we have already done):
β2
∑
N { ( , ) } N2 (ln ln(2 )))) ,
ln(
1
2 β π
β = − β
∑
− + −=
t N W
x y p(T|X,W
n
n n
• Prior: Assuming independent of X, and is Gaussian with mean 0 and variance = 1/α/
w M wT
e X
W
p 2 2
1
2 ) ( )
| (
α
π
α + −
=
• Then the log probability of posterior will be proportion to
2π
2009/11/30 NProbability and ML, Shou‐de Lin M 13w w
t W
x
y T
N
n
n
n (ln ln( 2 )) 2
2 )) 1
2 ln(
2 (ln }
) , ( 2 1{
2 β π α π α
β − + − + + − −
− ∑
=
Maximum Posterior Estimation (MAP) Maximum Posterior Estimation (MAP)
w M w
t N W
x
y T
N
n
n (ln ln(2 )) 2
2 )) 1
2 ln(
2 (ln }
) , ( 2 {
2 β π α π α
β − + − + + − −
−
∑
• The best parameter set should maximize
posterior probability instead of the likelihood
n 2 2 2
2
∑
1=
posterior probability instead of the likelihood probability.
• The MAP solution for the Gaussian noise andThe MAP solution for the Gaussian noise and Gaussian Prior is to find a W that minimize
W T
N
} )
(
{ 2 α
β
∑
• Maximizing the posterior distribution is
w w t
W x
y T
n
n
n, ) } 2
( 2 1{
2 α
β − +
∑
=Maximizing the posterior distribution is
equivalent to minimizing the regularized sum‐of‐
squares error function with the regularization q g parameter λ=α/β
What we have discussed so far What we have discussed so far
i h ( )
1. Learning Phrase (MLE or MAP):
– Finding WML that maximizes the likelihood
function p(T|X,W)Í Î Finding W that minimize the square error of loss function, or
– Finding WMAP that maximizes the posterior
function P(W|T,X)Í Î Finding W that minimize
th l i d f l f ti
the regularized sum‐of‐squares loss function
2. Inference Phrase:
– When an new x’ comes in, using the determined W to predict the output y’
2009/11/30 Probability and ML, Shou‐de Lin 15
Potential Issues Potential Issues
h bl f fi i
• The problem of MLE: overfitting
• The problem of MAP: lose informationp
P(W|X,T) P(W|X,T) P(W|X,T)
MAP W MAP W
MAP W
• Since in MAP we have learned P(W|X,T), why not using total probability theoryg p y y
) )
(
| ( )
| (
) ,
| (
* ) ,
| ( )
, ,
| (
1
=
∫
β W
N h
dW T
X W
p W
x t p T
X x t
p w
) ),
, (
| ( )
,
|
(t x w = N t y x W β −1 p
where
The predictive distribution of t The predictive distribution of t
) ,
| (
* ) ,
| ( )
, ,
|
(t x X T =
∫
p t x W p W X T dWp w
• It can be proved that when the posterior and
) ),
, (
| ( )
,
|
(t x w = N t y x W β −1 p
where
w
It can be proved that when the posterior and p(t|x,W) are Gaussian, then the predictive
distribution p(t|x X T) is also Gaussian with distribution p(t|x,X,T) is also Gaussian with mean m(x) and variance s2(x)
2009/11/30 Probability and ML, Shou‐de Lin 17
Example of predictive distribution Example of predictive distribution
• Green: true function. Red line: mean of the
d d f d
predicted function . Red zone: one variance from mean.
Y(x,w) from sampling posterior distributions over w
2009/11/30 Probability and ML, Shou‐de Lin 19
The benefit of Statistical Learning The benefit of Statistical Learning
• Because it can not only produce the output, but the distribution of the outputs.p
– The distribution tells us more about the data, including how confident the system has about its including how confident the system has about its prediction.
It can be used to generate the dataset – It can be used to generate the dataset.
We have talked about Regression, g , so how about Classification?
2009/11/30 21
Two Classification Strategies Two Classification Strategies
S 1 h d
Strategy 1: two‐stage methods
Classification can be broken down into two stages
– Inference stage: for each Ck, using its own training data to learn a model for p(Ck|X)
– Decision stage: Use p(Ck|X) and the loss matrix to make optimal class assignment
(
Strategy 2: One‐shot methods (or Discriminant model)
Using all training data to learn a function that directly maps inputs x into the output class
Two Models for Strategy 1 (1/2) Two Models for Strategy 1 (1/2)
• Model 1:Model 1: Generative ModelGenerative Model
– First solve the inference problem of determining p(x|Ck) for each class Ck individually.
p( | k) k y
– Separately infer the prior class probabilities p(Ck).
– Use Bayes’ theorem to find the posterior class y p probabilities p(Ck|x)
– note that the denominator can be ( )
) ( )
| ) (
|
( p x
C p C x x p
C
p k = k k
generated as p(x)=Σ p(x|Ck)p(Ck)
– Finally use p(Ckk|x) and decision theory to find the best class assignment.
• This is called generative model since we can learn p(x) and p(Ck,x)
2009/11/30 23
Two Approaches for Strategy 1 (2/2) Two Approaches for Strategy 1 (2/2)
M d l 2 Di i i ti M d l
• Model 2: Discriminative Model
– Directly learn p(Ck|x) from data ( know nothing about p(x|Ck), and p(x))
– Logistic regression is a typical example.g g yp p
Classification Models Classification Models
• Generative Model: learning P(Cg ( kk|X) using Bayes Rule| ) g y
– First solve the inference problem of determining p(x|Ck) and p(Ck) for each class Ck individually.
U B ’ l fi d h i l b bili i (C | )
– Use Bayes’ rule to find the posterior class probabilities p(Ck|x)
• Discriminative Model: learning P(Ck|X) directly from data Then apply decision theory to decide which C is the best – Then apply decision theory to decide which C is the best
assignment for x (e.g. Logistic Regression)
• DiscriminantDiscriminant Model:Model: Learn a function that directly maps inputs xLearn a function that directly maps inputs x into the output class
– Linear discriminant function: learning linear functions to separate h l
the classes
• Least Squares
• Fisher’s linear discriminantFisher s linear discriminant
• Perceptron Algorithm
2009/11/30 25
Generative vs Discriminative Model Generative vs. Discriminative Model
G ti d l
• Generative model
– Pros: P(x) can be used to generate samples of inputs, which is useful for knowledge discovery & data mining (e.g. outlier
d d l d )
detection and novelty detection).
– Cons: very demanding since it has to find the joint distribution of Ck and x. Need a lot training data.
• Discriminative Model
– Pros: can be learned with fewer data
Cons: cannot learn the detail structure of the data – Cons: cannot learn the detail structure of the data
Generative vs. Discriminant Model (1/3)
• A discriminant approach learns a discriminant
f d f d k
function and use it for decision making. It does not learn P(Ckk|x).
• However, P(Ck|x) is useful in many aspects
1 It can be combined with the cost function to 1. It can be combined with the cost function to
produce the final decision. If the cost function changes we don’t need to re train the whole changes, we don t need to re‐train the whole model as a discriminant model does.
b d d h
2. It can be used to determine the reject region.
• P(CHT|x)= 0.1, P(CPJ|x)= 0.05
• P(CHT|x)= 0.7, P(CPJ|x)= 0. 8
2009/11/30 27
Generative vs. Discriminant Model (2/3) Generative vs. Discriminant Model (2/3)
• Generative Model takes care of the class prior P(y) explicitly.(y) p y
– E.g.: in cancer prediction, only a small amount of data (e.g. 0.1 %) are positive.
data (e.g. 0.1 %) are positive.
– A normal classifier will guess negative and receive 99 9% accuracy
99.9% accuracy.
– Using P(Ck|x) and P(Ck) allow us to ignore the
i f f th i d i l i
inference from the prior during learning.
Generative vs Discriminant Model (3/3)
G ti d l b tt i t f
Generative vs. Discriminant Model (3/3)
• Generative model are better in terms of combining several models:
Assuming in the previous example we have two types – Assuming in the previous example, we have two types
of information for each photo:
• The image features (Xi)
• The social information (Xs)
• It might be more effective and meaningful to b ild t d l P(C |X ) P(C |X ) f th
build separate models P(Ck|Xi), P(Ck|Xs) for these two sets of features.
• Generative allows us to combine these models as
• Generative allows us to combine these models as:
P(Ck|Xi,Xs) p(Ck | xi, xs) ∝ P(xi, xs |Ck )P(Ck ) ∝
Naïve bayes assumption
2009/11/30 ( ) 29
)
| (
)
| ) (
( )
| ( )
| (
k
s k
i k
k k
s k
i P C
x C
P x
C C P
P C
x P C
x
P ∝
Naïve bayes assumption
Naïve Baye Assumption Naïve Baye Assumption
• Recall in Bayesian Setup we have p(Ck | x) = p(x |Ck)p(Ck)
• Recall in Bayesian Setup, we have
• If we assume features of an instance are independent given the class (conditionally independent).
) ) (
|
( p x
p k
g ( y p )
)
| (
)
| ,
, (
)
| (
1 2
1
∏
=
=
= n
i
i
n C P X C
X X
X P C
X
P L
• Therefore, we then only need to know P(Xi |C) for each possible pair of a feature‐value and class.
If C d ll X bi hi i if i l 2
=1 i
• If C and all Xi are binary, this requires specifying only 2n parameters:
– P(X( ii=true | C=true) and P(Xt ue | C t ue) a d ( ii=true | C=false) for each Xt ue | C a se) o eac ii – P(Xi=false | C) = 1 – P(Xi=true | C)
• Compared to specifying 2n parameters without any
• Compared to specifying 2 parameters without any
Gaussian Discriminant Analysis (GDA) Gaussian Discriminant Analysis (GDA)
• This is another generative model.
• GDA assumes p(x|y) is distributed accordingGDA assumes p(x|y) is distributed according to a Multivariate Normal Distribution (MND).
A MND i di i i i d b
• An MND in n‐dimensions is parameterized by a mean vector μ∈Rn and a covariance matrix Σ∈Rnxn , also written as N(μ, Σ). Its density is:
2009/11/30 31
Examples for 2‐D Multivariate Normal Distribution
• Σ= I Σ= 0.6I Σ= 2I
The Model for GDA (1/2) The Model for GDA (1/2)
• p(x|y) is MND, p(y=0)=Φ, p(y=1)=1‐Φ
(assuming different y shares the same Σ ) (assuming different y shares the same Σ )
• The log‐likelyhood of the data is
2009/11/30 33
The Model for GDA (2/2) The Model for GDA (2/2)
• Using maximum likelihood estimate (MLE), we can obtain
Discussion: GDA vs. Logistic Regression g g
• In GDA, p(y|x) is of the form 1/(1+exp(‐θTx)), where θ is a function of ϕ, Σ, μ.
function of ϕ, Σ, μ.
– This is exactly the form of logistic regression to model p(y|x).
That says, if p(x|y) is multivariate gaussian, then p(y|x) follows a logistic function
logistic function.
– However, the converse is not true. This implies that GDA makes stronger modeling assumptions about the data than LR does.
• Training on the same dataset these t o algorithms ill
• Training on the same dataset, these two algorithms will produce different decision boundaries.
– If p(x|y) is indeed Gaussian, then GDA will get better results.
That says, if x is some sort of the mean value of something whose size is not small, then based on central‐limit‐theorem, GDA should perform very well.
– If p(x|y=1) and p(x|y=0) are both Poisson, then P(y|x) will be logistic. In this case, LR can work better than GDA.
– If we are sure the data is non‐Gaussian, we should use LR than , GDA
2009/11/30 35