• 沒有找到結果。

Introduction to Machine Learning  (Part 1: Statistical Machine Learning) ( a t Stat st ca ac e ea g)

N/A
N/A
Protected

Academic year: 2022

Share "Introduction to Machine Learning  (Part 1: Statistical Machine Learning) ( a t Stat st ca ac e ea g)"

Copied!
35
0
0

加載中.... (立即查看全文)

全文

(1)

I t d ti t M hi L i

Introduction to Machine Learning  (Part 1: Statistical Machine Learning) ( a t Stat st ca ac e ea g)

Shou‐de Lin S ou de

CSIE/GINM, NTU / ,

sdlin@csie.ntu.edu.tw

2009/11/30 1

(2)

Syllabus of a Intro‐ML course (“Machine Learning”,  Andrew Ng Stanford Autumn 2009)

Andrew Ng, Stanford, Autumn 2009)

Supervised learning. (7 classes) Supervised learning setup. LMS.

Logistic regression. Perceptron. Exponential family. 

G ti l i l ith G i di i i t l i N i B

Generative learning algorithms. Gaussian discriminant analysis. Naive Bayes. 

Support vector machines. 

Model selection and feature selection. 

Ensemble methods: Bagging, boosting, ECOC. 

Evaluating and debugging learning algorithms. 

Learning theory. (3 classes) 

Bias/variance tradeoff. Union and Chernoff/Hoeffding bounds.

VC dimension Worst case (online) learningVC dimension. Worst case (online) learning. 

Practical advice on how to use learning algorithms. 

Unsupervised learning. (5 classes) 

Clustering. K‐means.  EM. Mixture of Gaussians. 

Factor analysis. PCA. MDS. pPCA. 

Independent components analysis (ICA). 

Reinforcement learning and control. (4 classes) 

MDPs. Bellman equations. Value iteration and policy iteration.MDPs. Bellman equations.  Value iteration and policy iteration. 

Linear quadratic regulation (LQR). LQG. 

Q‐learning. Value function approximation. 

Policy search. Reinforce. POMDPs. 

HT has done a great job teaching you “Advanced SL” and “Learning 

(3)

Why teaching “Intro to ML”?

Why teaching  Intro to ML ?

Wh li th t h t k ML

When revealing that you have taken an ML course, 

people would more or less expect you to have already  known something E g

known something, E.g. 

Naïve Bayes.

There are some ML methods that are so commonlyThere are some ML methods that are so commonly  applied in research and real world that you will need  to know a little bit about them. E.g.g

K‐means clustering

There are some ML method that are too unbelievable  and amazing to ignore . E.g.

EM framework.

2009/11/30 3

(4)

To Bring you Back to the Earth

Statistical Machine Learning. (2 hours)

A Bayesian view about ML Generative learning model.Generative learning model. 

Gaussian discriminant analysis. Naïve Bayes

i d l i ( h )

Unsupervised learning. (3 hours) 

Clustering: K‐means. 

EM. 

Reinforcement learning (0 5 hour)

Reinforcement learning (0.5 hour) 

Value iteration and policy iteration. 

Q‐learning & SARSA

(5)

Theoretical ML vs. Statistical ML Theoretical ML vs. Statistical ML

What you have known: SL takes many (x,t) as  inputs to train a learner f(x), then apply it to inputs to train a learner f(x), then apply it to  unseen xk and predict it as f(xk)

For example (X is 3 dimensional):For example (X is 3 dimensional): 

Training { ([1,2,3], 0.1), ([2,3,4],0.2), ([3,4,5], 0.5)…}

Testing:  [2,4,5] Æ 0.7g [ ]

However, uncertainty exist in the real world, 

therefore an error distribution (e.g. Gaussian) is  usually added: t=f(x)+error. That says, it is 

possible to generate different results for same  inputs for example:

inputs, for example:

Training {([1,2,3],0.1), ([1,2,3],0.2),([1,2,3],0.1)…}

Testing: [1 2 3]=?

Testing: [1,2,3]=?

2009/11/30 Probability and ML, Shou‐de Lin 5

(6)

The Probabilistic Form of t The Probabilistic Form of t

• The output t is a distribution caused by the  error (assuming Gaussian) term: ( g )

p(t|x,W,β)= N(t|y(x,W), β‐1), β is called a 

precision parameter which equals the inverse precision parameter which equals the inverse  of the variance 1/σ2.

(7)

The SL process under probability The SL process under probability

Given training data {X,T}, we want to g

determine the unknown parameter W and β so we will know the distribution of y

so we will know the distribution of y.

• Assuming we observed N data points, then

function )

) (

| (

) ,

* )...

,

* ) ,

= ) ,

1

2 2 1

1

β

β β

β β

Ν

=

t y x W likelihood

,W

|x p(t ,W

|x p(t ,W

|x p(t p(T|X,W

N

N N

)) 2 l ( (l

} )

( { ))

l (

function )

), ,

(

| (

2 1

β β β

β

Ν

=

=

t N W

(T|X W

likelihood W

x y t

N n

n n

function likelihood

- log called

is this

)), 2

ln(

2 (ln }

) ,

( 2 {

)) ,

ln(

1

2 β π

β = β

+

=

t W

x y p(T|X,W

n

n n

2009/11/30 Probability and ML, Shou‐de Lin 7

function likelihood

log called

is this

(8)

Maximum Likelihood Estimation (MLE) ( )

• Idea: trying to adjust the unknown parameters  (i e W and β) to ma imi e the likelihood

(i.e. W and β) to maximize the likelihood  function or log‐likelihood function

)) 2 ln(

2 (ln }

) ,

( 2 {

)) ,

ln(

1

2 β π

β = β

+

=

t N W

x y p(T|X,W

N

n

n n

• Adjusting W to maximizing this log‐likelihood 

n

function given Gaussian error function is 

equivalentq to finding a Wg MLML that minimizing g the mean‐square error function 

(9)

Maximum Likelihood Estimation for β Maximum Likelihood Estimation for β

• First, we calculate WML that governs the mean  of the distribution. 

• Then we use WML in the likelihood function to  determine the optimal β

determine the optimal βML

+ =

ML = N N

t W

x

p(T|X,W y 2

0 }

) (

1 { ))

,

ln( β

=

= +

=

N

n

n ML

n W t

x y

1

1

2 0 }

) ,

(

2 { β

β

=

=

n

n ML

n W t

x N 1 y

2

1 1 { ( , ) }

β

2009/11/30 Probability and ML, Shou‐de Lin 9

(10)

A SL system using MLE A SL system using MLE

1 We first determine W as W that minimizes the error 1. We first determine W as WML that minimizes the error 

function  2

} )

, ( 2 {

1

N y xn w tn Tend to overfit

2. Using WML to find β as 2

1

= n

= N y xn WML tn

N

2

1 1 { ( , ) }

β

overfit

3. Prediction stage: Using WML and β to construct the  distribution of t: p(t|x W β)= N(t|y(x W ) β ‐1)

=

N n 1

distribution of t:  p(t|x,W,β)= N(t|y(x,WML), βML ) 4. Predict the value of an input x’ by sampling t using 

the distribution in (3) the distribution in (3)

The MLE approach consistently underestimate the  variance of the data and can lead to overfitting

variance of the data and can lead to overfitting

(11)

Bayesian Approach for Regression Bayesian  Approach for Regression

• Why Bayesian Approach: some w’s are  preferable than others

p

For example, the regularization prefers simple  model (i.e. small w’s).

model (i.e. small w s).

Consequently, p(w) cannot be treated as  uniformly distributed

uniformly distributed

2009/11/30 Probability and ML, Shou‐de Lin 11

(12)

Bayes’ Rule Review Bayes  Rule Review

)

| ( W T

P P ( T | W ) * P ( W )

= )

| ( W T

P ( )

) (

)

| (

T P

= ) ,

|

( W X T

P ( | )

)

| (

* ) ,

| (

X T

P

X W

P W

X T

P

)

| (

* ) ,

| (

) ,

|

(W X T P T X W P W X

P

)

| ( T X P

• P(W|X): prior probability

• P(Tl X W): Likelihood probability (what MLE

• P(Tl X,W): Likelihood probability (what MLE  tries to optimize, argmaxw P(T|X,W))

P(W|X T) i b bili

• P(W|X,T) : posterior probability

(13)

Bayesian Curve Fitting y g

)

| (

* ) ,

| (

) ,

|

( W X T P T X W P W X

P

• Likelihood probability (we have already done): 

β2

N { ( , ) } N2 (ln ln(2 ))

)) ,

ln(

1

2 β π

β = β

+

=

t N W

x y p(T|X,W

n

n n

• Prior: Assuming independent of X, and is  Gaussian with mean 0 and variance = 1/α/

w M wT

e X

W

p 2 2

1

2 ) ( )

| (

α

π

α +

=

• Then the log probability of posterior will be  proportion to  

2π

2009/11/30 NProbability and ML, Shou‐de Lin M 13w w

t W

x

y T

N

n

n

n (ln ln( 2 )) 2

2 )) 1

2 ln(

2 (ln }

) , ( 2 1{

2 β π α π α

β + + +

=

(14)

Maximum Posterior Estimation (MAP) Maximum Posterior Estimation (MAP) 

w M w

t N W

x

y T

N

n

n (ln ln(2 )) 2

2 )) 1

2 ln(

2 (ln }

) , ( 2 {

2 β π α π α

β + + +

The best parameter set should maximize 

posterior probability instead of the likelihood

n 2 2 2

2

1

=

posterior probability instead of the likelihood  probability.

The MAP solution for the Gaussian noise andThe MAP solution for the Gaussian noise and  Gaussian Prior is to find a W that minimize

W T

N

} )

(

{ 2 α

β

Maximizing the posterior distribution is

w w t

W x

y T

n

n

n, ) } 2

( 2 1{

2 α

β +

=

Maximizing the posterior distribution is 

equivalent to minimizing the regularized sum‐of‐

squares error function with the regularization q g parameter λ=α/β

(15)

What we have discussed so far What we have discussed so far

i h ( )

1. Learning Phrase (MLE or MAP):

Finding WML that maximizes the likelihood 

function p(T|X,W)Í Î Finding W that minimize  the square error of loss function, or

Finding WMAP that maximizes the posterior 

function P(W|T,X)Í Î Finding W that minimize 

th l i d f l f ti

the regularized sum‐of‐squares loss function

2. Inference Phrase: 

When an new x’ comes in, using the determined  W to predict the output y’

2009/11/30 Probability and ML, Shou‐de Lin 15

(16)

Potential Issues Potential Issues

h bl f fi i

• The problem of MLE: overfitting

• The problem of MAP: lose informationp

P(W|X,T) P(W|X,T) P(W|X,T)

MAP         W MAP         W

MAP      W

• Since in MAP we have learned P(W|X,T), why  not using total probability theoryg p y y

) )

(

| ( )

| (

) ,

| (

* ) ,

| ( )

, ,

| (

1

=

β W

N h

dW T

X W

p W

x t p T

X x t

p w

) ),

, (

| ( )

,

|

(t x w = N t y x W β 1 p

where

(17)

The predictive distribution of t The predictive distribution of t

) ,

| (

* ) ,

| ( )

, ,

|

(t x X T =

p t x W p W X T dW

p w

• It can be proved that when the posterior and

) ),

, (

| ( )

,

|

(t x w = N t y x W β 1 p

where

w

It can be proved that when the posterior and  p(t|x,W) are Gaussian, then the predictive 

distribution p(t|x X T) is also Gaussian with distribution p(t|x,X,T) is also Gaussian with  mean m(x) and variance s2(x)

2009/11/30 Probability and ML, Shou‐de Lin 17

(18)

Example of predictive distribution Example of predictive distribution

• Green: true function. Red line: mean of the 

d d f d

predicted function . Red zone: one variance  from mean.

(19)

Y(x,w) from sampling posterior  distributions over w

2009/11/30 Probability and ML, Shou‐de Lin 19

(20)

The benefit of Statistical Learning The benefit of Statistical Learning

• Because it can not only produce the output,  but the distribution of the outputs.p

The distribution tells us more about  the data,  including how confident the system has about its including how confident the system has about its  prediction.

It can be used to generate the dataset It can be used to generate the dataset. 

(21)

We have talked about Regression,  g , so how about Classification?

2009/11/30 21

(22)

Two Classification Strategies Two Classification Strategies

S 1 h d

Strategy 1: two‐stage methods

Classification can be broken down into two stages

Inference stage: for each Ck, using its own training  data to learn a model for p(Ck|X)

Decision stage: Use p(Ck|X) and the loss matrix to  make optimal class assignment

(

Strategy 2: One‐shot methods (or Discriminant model)

Using all training data to learn a function that  directly maps inputs x into the output class

(23)

Two Models for Strategy 1 (1/2) Two Models for Strategy 1  (1/2)

Model 1:Model 1: Generative ModelGenerative Model

First solve the inference problem of determining  p(x|Ck) for each class Ck individually. 

p( | k) k y

Separately infer the prior class probabilities p(Ck).

Use Bayes’ theorem to find the posterior class y p probabilities p(Ck|x)

note that the denominator can be ( )

) ( )

| ) (

|

( p x

C p C x x p

C

p k = k k

generated as p(x)=Σ p(x|Ck)p(Ck)

Finally use p(Ckk|x) and decision theory to find the best  class assignment.

This is called generative model since we can learn  p(x) and p(Ck,x)

2009/11/30 23

(24)

Two Approaches for Strategy 1 (2/2) Two Approaches for Strategy 1  (2/2)

M d l 2 Di i i ti M d l

• Model 2: Discriminative Model

Directly learn p(Ck|x) from data ( know nothing  about p(x|Ck), and p(x)) 

Logistic regression is a typical example.g g yp p

(25)

Classification Models Classification Models

Generative Model: learning P(Cg ( kk|X) using Bayes Rule| ) g y

First solve the inference problem of determining p(x|Ck) and p(Ck for each class Ck individually.

U B l fi d h i l b bili i (C | )

Use Bayes’ rule to find the posterior class probabilities p(Ck|x)

Discriminative Model: learning P(Ck|X) directly from data Then apply decision theory to decide which C is the best Then apply decision theory to decide which C is the best 

assignment for x (e.g. Logistic Regression)

DiscriminantDiscriminant Model:Model: Learn a function that directly maps inputs xLearn a function that directly maps inputs x  into the output class

Linear discriminant function: learning linear functions to separate  h l

the classes

Least Squares

Fisher’s linear discriminantFisher s linear discriminant

Perceptron Algorithm

2009/11/30 25

(26)

Generative vs Discriminative Model Generative vs. Discriminative Model

G ti d l

Generative model

Pros:  P(x) can be used to generate samples of inputs, which is  useful for knowledge discovery & data mining (e.g. outlier 

d d l d )

detection and novelty detection).

Cons: very demanding since it has to find the joint distribution  of Ck and x. Need a lot training data.

Discriminative Model

Pros: can be learned with fewer data

Cons: cannot learn the detail structure of the data Cons: cannot learn the detail structure of the data

(27)

Generative vs. Discriminant Model (1/3)

• A discriminant approach learns a discriminant

f d f d k

function and use it for decision making. It  does not learn P(Ckk|x).

• However, P(Ck|x) is useful in many aspects

1 It can be combined with the cost function to 1. It can be combined with the cost function to 

produce the final decision. If the cost function  changes we don’t need to re train the whole changes, we don t need to re‐train the whole  model as a discriminant model does.

b d d h

2. It can be used to determine the reject region.

P(CHT|x)= 0.1, P(CPJ|x)= 0.05 

P(CHT|x)= 0.7, P(CPJ|x)= 0. 8

2009/11/30 27

(28)

Generative vs. Discriminant Model (2/3) Generative vs. Discriminant Model (2/3)

• Generative Model takes care of the class prior  P(y) explicitly.(y) p y

E.g.: in cancer prediction, only a small amount of  data (e.g. 0.1 %) are positive.

data (e.g. 0.1 %) are positive. 

A normal classifier will guess negative and receive  99 9% accuracy

99.9% accuracy.

Using P(Ck|x) and P(Ck) allow us to ignore the 

i f f th i d i l i

inference from the prior during learning.

(29)

Generative vs Discriminant Model (3/3)

G ti d l b tt i t f

Generative vs. Discriminant Model (3/3)

Generative model are better in terms of  combining several models:

Assuming in the previous example we have two types Assuming in the previous example, we have two types 

of information for each photo:

The image features (Xi)

The social information (Xs)

It might be more effective and meaningful to  b ild t d l P(C |X ) P(C |X ) f th

build separate models P(Ck|Xi), P(Ck|Xs) for these  two sets of features.

Generative allows us to combine these models as

Generative allows us to combine these models as: 

P(Ck|Xi,Xs p(Ck | xi, xs) P(xi, xs |Ck )P(Ck )

Naïve bayes assumption

2009/11/30 ( ) 29

)

| (

)

| ) (

( )

| ( )

| (

k

s k

i k

k k

s k

i P C

x C

P x

C C P

P C

x P C

x

P

Naïve bayes assumption

(30)

Naïve Baye Assumption Naïve Baye Assumption

Recall in Bayesian Setup we have p(Ck | x) = p(x |Ck)p(Ck)

Recall in Bayesian Setup, we have 

If we assume features of an instance are independent  given the class (conditionally independent).

) ) (

|

( p x

p k

g ( y p )

)

| (

)

| ,

, (

)

| (

1 2

1

=

=

= n

i

i

n C P X C

X X

X P C

X

P L

Therefore, we then only need to know P(X|C) for each  possible pair of a feature‐value and class.

If C d ll X bi hi i if i l 2

=1 i

If C and all Xi are binary, this requires specifying only 2n parameters:

P(X( ii=true | C=true) and P(Xt ue | C t ue) a d ( ii=true | C=false) for each Xt ue | C a se) o eac ii P(Xi=false | C) = 1 – P(Xi=true | C)

Compared to specifying 2parameters without any

Compared to specifying 2 parameters without any 

(31)

Gaussian Discriminant Analysis (GDA) Gaussian Discriminant Analysis (GDA)

• This is another generative model.

• GDA assumes p(x|y) is distributed accordingGDA assumes p(x|y) is distributed according  to a Multivariate Normal Distribution (MND). 

A MND i di i i i d b

• An MND in n‐dimensions is parameterized by  a mean vector μ∈Rn and a covariance matrix  Σ∈Rnxn , also written as N(μ, Σ). Its density is:

2009/11/30 31

(32)

Examples for 2‐D Multivariate Normal  Distribution

• Σ= I       Σ= 0.6I       Σ= 2I

(33)

The Model for GDA (1/2) The Model for GDA (1/2)

• p(x|y) is MND, p(y=0)=Φ, p(y=1)=1‐Φ

(assuming different y shares the same Σ ) (assuming different y shares the same Σ )

• The log‐likelyhood of the data is

2009/11/30 33

(34)

The Model for GDA (2/2) The Model for GDA (2/2)

• Using maximum likelihood estimate (MLE), we  can obtain

(35)

Discussion: GDA vs. Logistic Regression g g

In GDA, p(y|x) is of the form 1/(1+exp(‐θTx)), where θ is a  function of ϕ, Σ, μ.

function of ϕ, Σ, μ.

This is exactly the form of logistic regression to model p(y|x). 

That says, if p(x|y) is multivariate gaussian, then p(y|x) follows a  logistic function

logistic function.

However, the converse is not true. This implies that GDA makes  stronger modeling assumptions about the data than LR does. 

Training on the same dataset these t o algorithms ill

Training on the same dataset, these two algorithms will  produce different decision boundaries.

If p(x|y) is indeed Gaussian, then GDA will get better results. 

That says, if x is some sort of the mean value of something  whose size is not small, then based on central‐limit‐theorem,  GDA should perform very well.

If p(x|y=1) and p(x|y=0) are both Poisson, then P(y|x) will be  logistic. In this case, LR can work better than GDA.

If we are sure the data is non‐Gaussian, we should use LR than , GDA

2009/11/30 35

參考文獻

相關文件

As each school has its unique school context, including its organisation of the JS and SS curriculum, experience in conducting PL, self-directed learning atmosphere,

This kind of algorithm has also been a powerful tool for solving many other optimization problems, including symmetric cone complementarity problems [15, 16, 20–22], symmetric

© NET Section, CDI, EDB, HKSAR Unit 2 - Sports Articles Activity 2 Worksheet 1: Identifying the Structure and Language Features of a Sports Article.. Structure of a Sports

• It is a plus if you have background knowledge on computer vision, image processing and computer graphics.. • It is a plus if you have access to digital cameras

• Note: The following slides integrate some people’s  materials and viewpoints about EM including Kevin materials and viewpoints about EM, including Kevin 

and/or make predictions about the future behaviour of a system in the real world. ● requires the modeller to be creative and make choices, assumptions,

• It is a plus if you have background knowledge on computer vision, image processing and computer graphics.. • It is a plus if you have access to digital cameras

• It is a plus if you have background knowledge on computer vision, image processing and computer graphics.. • It is a plus if you have access to digital cameras