Introduction to Machine Learning (Part 1: Statistical Machine Learning) ( a t Stat st ca ac e ea g)

(1)

I t d ti t M hi L i

Introduction to Machine Learning (Part 1: Statistical Machine Learning) ( a t Stat st ca ac e ea g)

Shou‐de Lin S ou de

CSIE/GINM, NTU / ,

sdlin@csie.ntu.edu.tw

2009/11/30 1

(2)

Syllabus of a Intro‐ML course (“Machine Learning”, Andrew Ng Stanford Autumn 2009)

Andrew Ng, Stanford, Autumn 2009)

• Supervised learning. (7 classes) Supervised learning setup. LMS.

– Logistic regression. Perceptron. Exponential family.

G ti l i l ith G i di i i t l i N i B

– Generative learning algorithms. Gaussian discriminant analysis. Naive Bayes.

– Support vector machines.

– Model selection and feature selection.

– Ensemble methods: Bagging, boosting, ECOC.

– Evaluating and debugging learning algorithms.

• Learning theory. (3 classes)

– Bias/variance tradeoff. Union and Chernoff/Hoeffding bounds.

– VC dimension Worst case (online) learningVC dimension. Worst case (online) learning.

– Practical advice on how to use learning algorithms.

• Unsupervised learning. (5 classes)

– Clustering. K‐means. EM. Mixture of Gaussians.

– Factor analysis. PCA. MDS. pPCA.

– Independent components analysis (ICA).

• Reinforcement learning and control. (4 classes)

– MDPs. Bellman equations. Value iteration and policy iteration.MDPs. Bellman equations. Value iteration and policy iteration.

– Linear quadratic regulation (LQR). LQG.

– Q‐learning. Value function approximation.

– Policy search. Reinforce. POMDPs.

HT has done a great job teaching you “Advanced SL” and “Learning

(3)

Why teaching “Intro to ML”?

Why teaching Intro to ML ?

Wh li th t h t k ML

• When revealing that you have taken an ML course,

people would more or less expect you to have already known something E g

known something, E.g.

– Naïve Bayes.

• There are some ML methods that are so commonlyThere are some ML methods that are so commonly applied in research and real world that you will need to know a little bit about them. E.g.g

– K‐means clustering

• There are some ML method that are too unbelievable and amazing to ignore . E.g.

– EM framework.

2009/11/30 3

(4)

To Bring you Back to the Earth

• Statistical Machine Learning. (2 hours)

– A Bayesian view about ML – Generative learning model.Generative learning model.

– Gaussian discriminant analysis. Naïve Bayes

i d l i ( h )

• Unsupervised learning. (3 hours)

– Clustering: K‐means.

– EM.

• Reinforcement learning (0 5 hour)

• Reinforcement learning (0.5 hour)

– Value iteration and policy iteration.

– Q‐learning & SARSA

(5)

Theoretical ML vs. Statistical ML Theoretical ML vs. Statistical ML

• What you have known: SL takes many (x,t) as inputs to train a learner f(x), then apply it to inputs to train a learner f(x), then apply it to unseen x_k and predict it as f(x_k)

• For example (X is 3 dimensional):For example (X is 3 dimensional):

– Training { ([1,2,3], 0.1), ([2,3,4],0.2), ([3,4,5], 0.5)…}

– Testing: [2,4,5] Æ 0.7g [ ]

• However, uncertainty exist in the real world,

therefore an error distribution (e.g. Gaussian) is usually added: t=f(x)+error. That says, it is

possible to generate different results for same inputs for example:

inputs, for example:

– Training {([1,2,3],0.1), ([1,2,3],0.2),([1,2,3],0.1)…}

– Testing: [1 2 3]=?

– Testing: [1,2,3]=?

2009/11/30 Probability and ML, Shou‐de Lin 5

(6)

The Probabilistic Form of t The Probabilistic Form of t

• The output t is a distribution caused by the error (assuming Gaussian) term: ( g )

p(t|x,W,β)= N(t|y(x,W), β^‐1), β is called a

precision parameter which equals the inverse precision parameter which equals the inverse of the variance 1/σ².

•

(7)

The SL process under probability The SL process under probability

• Given training data {X,T}, we want to g

determine the unknown parameter W and β so we will know the distribution of y

so we will know the distribution of y.

• Assuming we observed N data points, then

function )

) (

| (

) ,

* )...

,

* ) ,

= ) ,

1

2 2 1

1

β

β β

→ Ν

=

∏

^t ^y ^x ^W ⁻ ^likelihood

,W

|x p(t ,W

|x p(t p(T|X,W

N

N N

)) 2 l ( (l

} )

( { ))

l (

function )

), ,

(

| (

2 1

β β β

β →

Ν

=

∑

∏

=

t N W

(T|X W

likelihood W

x y t

N n

n n

function likelihood

- log called

is this

)), 2

ln(

2 (ln }

) ,

( 2 {

)) ,

ln(

1

2 β π

β = − β

∑

− + −

=

t W

x y p(T|X,W

n

n n

function likelihood

log called

is this

(8)

Maximum Likelihood Estimation (MLE) ( )

• Idea: trying to adjust the unknown parameters (i e W and β) to ma imi e the likelihood

(i.e. W and β) to maximize the likelihood function or log‐likelihood function

)) 2 ln(

2 (ln }

) ,

( 2 {

)) ,

ln(

1

2 β π

β = − β

∑

− + −

=

t N W

x y p(T|X,W

N

n

n n

• Adjusting W to maximizing this log‐likelihood

n

function given Gaussian error function is

equivalentq to finding a Wg _ML_ML that minimizing g the mean‐square error function

(9)

Maximum Likelihood Estimation for β Maximum Likelihood Estimation for β

• First, we calculate W_ML that governs the mean of the distribution.

• Then we use W_MLin the likelihood function to determine the optimal β

determine the optimal β_ML

∑

⁺ ⁼

∂ _ML = ^N N

t W

x

p(T|X,W y ₂

0 }

) (

1 { ))

,

ln( β

∑

=

= +

−

∂ =

N

n

n ML

n W t

x y

1

2 0 }

) ,

(

2 { β

β

∑

=

− = −

⇒

n

n ML

n W t

x N ₁ y

2

1 1 { ( , ) }

β

(10)

A SL system using MLE A SL system using MLE

1 We first determine W as W that minimizes the error 1. We first determine W as W_ML that minimizes the error

function ₂

} )

, ( 2 {

1

∑

^N ^y ^xⁿ ^w ⁻ ^tⁿ ^Tend to_overfit

2. Using W_ML to find β as 2

∑

1

= n

∑

− = ^N y x_n W_ML − t_n

N

2

1 1 { ( , ) }

β

overfit

3. Prediction stage: Using W_MLand β to construct the distribution of t: p(t|x W β)= N(t|y(x W ) β ^‐1)

=

N n ₁

distribution of t: p(t|x,W,β)= N(t|y(x,W_ML), β_ML ) 4. Predict the value of an input x’ by sampling t using

the distribution in (3) the distribution in (3)

• The MLE approach consistently underestimate the variance of the data and can lead to overfitting

variance of the data and can lead to overfitting

(11)

Bayesian Approach for Regression Bayesian Approach for Regression

• Why Bayesian Approach: some w’s are preferable than others

p

– For example, the regularization prefers simple model (i.e. small w’s).

model (i.e. small w s).

– Consequently, p(w) cannot be treated as uniformly distributed

uniformly distributed

(12)

Bayes’ Rule Review Bayes Rule Review

)

| ( W T

P P ( T | W ) * P ( W )

= )

| ( W T

P ( )

) (

)

| (

T P

= ) ,

|

( W X T

P ( | )

)

| (

* ) ,

| (

X T

P

X W

P W

X T

P

)

| (

* ) ,

| (

) ,

|

(W X T P T X W P W X

P ∝

)

| ( T X P

• P(W|X): prior probability

• P(Tl X W): Likelihood probability (what MLE

• P(Tl X,W): Likelihood probability (what MLE tries to optimize, argmax_w P(T|X,W))

P(W|X T) i b bili

• P(W|X,T) : posterior probability

(13)

Bayesian Curve Fitting y g

)

| (

* ) ,

| (

) ,

|

( W X T P T X W P W X

P ∝

• Likelihood probability (we have already done):

β₂

∑

^N ^{ ⁽ ^, ⁾ ^} ^N₂ ^(ln ^ln(² ⁾⁾

)) ,

ln(

1

2 β π

β = − β

∑

− + −

=

t N W

x y p(T|X,W

n

n n

• Prior: Assuming independent of X, and is Gaussian with mean 0 and variance = 1/α/

w M w_T

e X

W

p ² ²

1

2 ) ( )

| (

α

π

α ⁺ ⁻

=

• Then the log probability of posterior will be proportion to

2π

2009/11/30 NProbability and ML, Shou‐de Lin M 13w w

t W

x

y ^T

N

n

n (ln ln( 2 )) 2

2 )) 1

2 ln(

2 (ln }

) , ( 2 ₁{

2 β π α π α

β − + − + + ₋ ₋

− ∑

=

(14)

Maximum Posterior Estimation (MAP) Maximum Posterior Estimation (MAP)

w M w

t N W

x

y ^T

N

n

n (ln ln(2 )) 2

2 )) 1

2 ln(

2 (ln }

) , ( 2 {

2 β π α π α

β − + − + + ₋ ₋

−

∑

• The best parameter set should maximize

posterior probability instead of the likelihood

n 2 2 2

2

∑

₁

=

posterior probability instead of the likelihood probability.

• The MAP solution for the Gaussian noise andThe MAP solution for the Gaussian noise and Gaussian Prior is to find a W that minimize

W ^T

N

} )

(

{ ² α

β

∑

• Maximizing the posterior distribution is

w w t

W x

y ^T

n

n, ) } 2

( 2 ₁{

2 α

β ₋ ₊

∑

=

Maximizing the posterior distribution is

equivalent to minimizing the regularized sum‐of‐

squares error function with the regularization q g parameter λ=α/β

(15)

What we have discussed so far What we have discussed so far

i h ( )

1. Learning Phrase (MLE or MAP):

– Finding W_ML that maximizes the likelihood

function p(T|X,W)Í Î Finding W that minimize the square error of loss function, or

– Finding W_MAP that maximizes the posterior

function P(W|T,X)Í Î Finding W that minimize

th l i d f l f ti

the regularized sum‐of‐squares loss function

2. Inference Phrase:

– When an new x’ comes in, using the determined W to predict the output y’

(16)

Potential Issues Potential Issues

h bl f fi i

• The problem of MLE: overfitting

• The problem of MAP: lose informationp

P(W|X,T) P(W|X,T) P(W|X,T)

MAP W MAP W

MAP W

• Since in MAP we have learned P(W|X,T), why not using total probability theoryg p y y

) )

(

| ( )

| (

) ,

| (

* ) ,

| ( )

, ,

| (

1

=

∫

β W

N h

dW T

X W

p W

x t p T

X x t

p w

) ),

, (

| ( )

,

|

(t x w = N t y x W β ⁻¹ p

where

(17)

The predictive distribution of t The predictive distribution of t

) ,

| (

* ) ,

| ( )

, ,

|

(^t ^x ^X ^T =

∫

^p ^t ^x ^W ^p ^W ^X ^T ^dW

p w

• It can be proved that when the posterior and

) ),

, (

| ( )

,

|

(t x w = N t y x W β ⁻¹ p

where

w

It can be proved that when the posterior and p(t|x,W) are Gaussian, then the predictive

distribution p(t|x X T) is also Gaussian with distribution p(t|x,X,T) is also Gaussian with mean m(x) and variance s²(x)

(18)

Example of predictive distribution Example of predictive distribution

• Green: true function. Red line: mean of the

d d f d

predicted function . Red zone: one variance from mean.

(19)

Y(x,w) from sampling posterior distributions over w

(20)

The benefit of Statistical Learning The benefit of Statistical Learning

• Because it can not only produce the output, but the distribution of the outputs.p

– The distribution tells us more about the data, including how confident the system has about its including how confident the system has about its prediction.

It can be used to generate the dataset – It can be used to generate the dataset.

(21)

We have talked about Regression, g , so how about Classification?

2009/11/30 21

(22)

Two Classification Strategies Two Classification Strategies

S 1 h d

Strategy 1: two‐stage methods

Classification can be broken down into two stages

– Inference stage: for each C_k, using its own training data to learn a model for p(C_k|X)

– Decision stage: Use p(C_k|X) and the loss matrix to make optimal class assignment

(

Strategy 2: One‐shot methods (or Discriminant model)

Using all training data to learn a function that directly maps inputs x into the output class

(23)

Two Models for Strategy 1 (1/2) Two Models for Strategy 1 (1/2)

• Model 1:Model 1: Generative ModelGenerative Model

– First solve the inference problem of determining p(x|C_k) for each class C_k individually.

p( | _k) _k y

– Separately infer the prior class probabilities p(C_k).

– Use Bayes’ theorem to find the posterior class y p probabilities p(C_k|x)

– note that the denominator can be ⁽ ⁾

) ( )

| ) (

|

( p x

C p C x x p

C

p _k = ^k ^k

generated as p(x)=Σ p(x|C_k)p(C_k)

– Finally use p(C_k_k|x) and decision theory to find the best class assignment.

• This is called generative model since we can learn p(x) and p(C_k,x)

2009/11/30 23

(24)

Two Approaches for Strategy 1 (2/2) Two Approaches for Strategy 1 (2/2)

M d l 2 Di i i ti M d l

• Model 2: Discriminative Model

– Directly learn p(C_k|x) from data ( know nothing about p(x|C_k), and p(x))

– Logistic regression is a typical example.g g yp p

(25)

Classification Models Classification Models

• Generative Model: learning P(Cg ( _k_k|X) using Bayes Rule| ) g y

– First solve the inference problem of determining p(x|C_k) and p(C_k) for each class C_k individually.

U B ’ l fi d h i l b bili i (C | )

– Use Bayes’ rule to find the posterior class probabilities p(C_k|x)

• Discriminative Model: learning P(C_k|X) directly from data Then apply decision theory to decide which C is the best – Then apply decision theory to decide which C is the best

assignment for x (e.g. Logistic Regression)

• DiscriminantDiscriminant Model:Model: Learn a function that directly maps inputs xLearn a function that directly maps inputs x into the output class

– Linear discriminant function: learning linear functions to separate h l

the classes

• Least Squares

• Fisher’s linear discriminantFisher s linear discriminant

• Perceptron Algorithm

2009/11/30 25

(26)

Generative vs Discriminative Model Generative vs. Discriminative Model

G ti d l

• Generative model

– Pros: P(x) can be used to generate samples of inputs, which is useful for knowledge discovery & data mining (e.g. outlier

d d l d )

detection and novelty detection).

– Cons: very demanding since it has to find the joint distribution of Ck and x. Need a lot training data.

• Discriminative Model

– Pros: can be learned with fewer data

Cons: cannot learn the detail structure of the data – Cons: cannot learn the detail structure of the data

(27)

Generative vs. Discriminant Model (1/3)

• A discriminant approach learns a discriminant

f d f d k

function and use it for decision making. It does not learn P(C_k_k|x).

• However, P(C_k|x) is useful in many aspects

1 It can be combined with the cost function to 1. It can be combined with the cost function to

produce the final decision. If the cost function changes we don’t need to re train the whole changes, we don t need to re‐train the whole model as a discriminant model does.

b d d h

2. It can be used to determine the reject region.

• P(C_HT|x)= 0.1, P(C_PJ|x)= 0.05

• P(C_HT|x)= 0.7, P(C_PJ|x)= 0. 8

2009/11/30 27

(28)

Generative vs. Discriminant Model (2/3) Generative vs. Discriminant Model (2/3)

• Generative Model takes care of the class prior P(y) explicitly.(y) p y

– E.g.: in cancer prediction, only a small amount of data (e.g. 0.1 %) are positive.

data (e.g. 0.1 %) are positive.

– A normal classifier will guess negative and receive 99 9% accuracy

99.9% accuracy.

– Using P(C_k|x) and P(C_k) allow us to ignore the

i f f th i d i l i

inference from the prior during learning.

(29)

Generative vs Discriminant Model (3/3)

G ti d l b tt i t f

Generative vs. Discriminant Model (3/3)

• Generative model are better in terms of combining several models:

Assuming in the previous example we have two types – Assuming in the previous example, we have two types

of information for each photo:

• The image features (X_i)

• The social information (X_s)

• It might be more effective and meaningful to b ild t d l P(C |X ) P(C |X ) f th

build separate models P(C_k|X_i), P(C_k|X_s) for these two sets of features.

• Generative allows us to combine these models as

• Generative allows us to combine these models as:

P(C_k|X_i,X_s) _p₍_C_k _| _x_i_, _x_s₎ _∝ _P₍_x_i_, _x_s _|_C_k ₎_P₍_C_k ₎ _∝

Naïve bayes assumption

2009/11/30 ( ) 29

)

| (

)

| ) (

( )

| ( )

| (

k

s k

i k

k k

s k

i P C

x C

P x

C C P

P C

x P C

x

P ∝

Naïve bayes assumption

(30)

Naïve Baye Assumption Naïve Baye Assumption

• Recall in Bayesian Setup we have ^p⁽^C^k ^| ^x⁾ ⁼ ^p⁽^x ^|^C^k⁾^p⁽^C^k⁾

• Recall in Bayesian Setup, we have

• If we assume features of an instance are independent given the class (conditionally independent).

) ) (

|

( p x

p _k

g ( y p )

)

| (

)

| ,

, (

)

| (

1 2

1

∏

=

= ⁿ

i

n C P X C

X X

X P C

X

P L

• Therefore, we then only need to know P(X_i|C) for each possible pair of a feature‐value and class.

If C d ll X bi hi i if i l 2

=1 i

• If C and all X_i are binary, this requires specifying only 2n parameters:

• Compared to specifying 2ⁿparameters without any

• Compared to specifying 2 parameters without any

(31)

Gaussian Discriminant Analysis (GDA) Gaussian Discriminant Analysis (GDA)

• This is another generative model.

• GDA assumes p(x|y) is distributed accordingGDA assumes p(x|y) is distributed according to a Multivariate Normal Distribution (MND).

A MND i di i i i d b

• An MND in n‐dimensions is parameterized by a mean vector μ∈Rⁿ and a covariance matrix Σ∈R^nxn , also written as N(μ, Σ). Its density is:

2009/11/30 31

(32)

Examples for 2‐D Multivariate Normal Distribution

• Σ= I Σ= 0.6I Σ= 2I

(33)

The Model for GDA (1/2) The Model for GDA (1/2)

• p(x|y) is MND, p(y=0)=Φ, p(y=1)=1‐Φ

(assuming different y shares the same Σ ) (assuming different y shares the same Σ )

• The log‐likelyhood of the data is

2009/11/30 33

(34)

The Model for GDA (2/2) The Model for GDA (2/2)

• Using maximum likelihood estimate (MLE), we can obtain

(35)

Discussion: GDA vs. Logistic Regression g g

• In GDA, p(y|x) is of the form 1/(1+exp(‐θ^Tx)), where θ is a function of ϕ, Σ, μ.

function of ϕ, Σ, μ.

– This is exactly the form of logistic regression to model p(y|x).

That says, if p(x|y) is multivariate gaussian, then p(y|x) follows a logistic function

logistic function.

– However, the converse is not true. This implies that GDA makes stronger modeling assumptions about the data than LR does.

• Training on the same dataset these t o algorithms ill

• Training on the same dataset, these two algorithms will produce different decision boundaries.

– If p(x|y) is indeed Gaussian, then GDA will get better results.

That says, if x is some sort of the mean value of something whose size is not small, then based on central‐limit‐theorem, GDA should perform very well.

– If p(x|y=1) and p(x|y=0) are both Poisson, then P(y|x) will be logistic. In this case, LR can work better than GDA.

– If we are sure the data is non‐Gaussian, we should use LR than , GDA

2009/11/30 35