© Deng Cai, College of Computer Science, Zhejiang University

(1)

Cai, College of Computer Science, Zhejiang University

So Far…

Our goal (supervised learning):

 To learn a set of discriminant functions

Bayesian framework

 We could design an optimal classifier if we knew:

 P( _i ) : priors and P(x |  _i ) : class‐conditional densities

 Using training data to estimate P( _i ) and P(x |  _i )

 P( _i |x ) is computed and be used as the discriminant functions

Other possible ways?



(2)

Linear Methods for Regression

Deng Cai (蔡登)

College of Computer Science Zhejiang University

[email protected]

(3)

, College of Computer Science, Zhejiang University

Classification VS Regression

Both are supervised learning methods

 Goal: learn a mapping from inputs to outputs

Classification (Categorization, Decision making…)

 is a categorical variable

Regression

 is real‐valued

⋯

(4)

ge of Computer Science, Zhejiang University

Linear model

Sample: ∈ , , , ⋯

Finds a linear function , , ⋯ , ∈ ,

(5)

ollege of Computer Science, Zhejiang University

Polynomial Curve Fitting

(6)

0 ^th Order Polynomial

(7)

1 ^st Order Polynomial

(8)

3 ^rd Order Polynomial

(9)

Sum‐of‐Squares Error Function

,

Training data：

 , , , , ⋯ ,

To learn which

(10)

Polynomial Curve Fitting  Linear Regression

, ⋯

1, , , ⋯ ,

, , , ⋯ ,

,

(11)

Linear Regression Model

Training data: ,

∑

 , ⋯ , and : unknown parameters or coefficients

 : Feature vector, the outcome of feature extraction.

Minimize the mean‐squared error : 1

Minimize the residual sum of squares

(12)

The MSE Criterion

MSE Criterion: Minimize the sum of squared differences between and

Using matrix notation for convenience

, ⋯ , , , ⋯ ,

How to optimize it (finds the optimal solution)?

(13)

Optimizing the MSE Criterion

Computing the gradient gives:

2 Setting the gradient to zero,

Any problems?

What is the rank of the matrix ?

The solution for a can be obtained uniquely if is non‐singular.

The fitted values at the training inputs are

(14)

Geometry of least‐squares fitting

Figure 2:The N‐dimensional geometry of least squares regression with two

predictors. The outcome vector y is orthogonally projected onto the

hyperplane spanned by the input vectors x1 and x2. The projection represents the vector of the least squares predictions

ˆy

(15)

Statistical model of regression

A generative model: , , is a deterministic function

is a random noise , it represents things we cannot capture with ,

e.g. ~ 0,

(16)

Statistical model of regression

A generative model: , Assume: ~ 0,

| , , =?

1 2

,

Likelihood of predictions

 The probability of observing outputs in given , ,

and .

(17)

Maximum Likelihood Estimation

Likelihood of predictions

, , | , ,

Maximum likelihood estimation of parameters

 Parameters maximizing the likelihood of predictions

∗ argmax | , ,

Log‐likelihood

(18)

Maximum Likelihood Estimation

Log‐likelihood

, , log , , log | , ,

log | , ,

, , 1

2 ,

, , 1

2 ,

(19)

Sum‐of‐Squares Error Function

,

(20)

0 ^th Order Polynomial

(21)

1 ^st Order Polynomial

(22)

3 ^rd Order Polynomial

(23)

9 ^th Order Polynomial

(24)

lege of Computer Science, Zhejiang University

Over‐fitting

(25)

Issues with MSE Criterion

The solution for a can be obtained uniquely if is non‐singular.

If is singular, overfitting

Coefficients

(26)

Ridge Regression

How to control the size of the coefficients in Regression?

∗ argmin

local smoothness weight decay

Equivalent formulation

∗ argmin

(27)

Ridge Regression

∗ argmin

 Matrix notations:

Computing the gradient gives:

2 2

Setting the gradient to zero,

(28)

Polynomial Coefficients

(29)

9 ^th Order Polynomial

(30)

Regularization ( )

(31)

Regularization ( )

(32)

Regularization

Simple model

Complicated model

(33)

Maximum Likelihood Estimation

A generative model: , Assume: ~ 0,

, , ^,

∗ argmax , , argmax | , ,

, , log , , ∑ ,

(34)

Bayesian Linear Regression

Bayes rule

prior likelihood

posterior

(35)

Bayesian Linear Regression

A common choice for the prior is

Posterior ∝ likelihood prior

(36)

A more general regularizer

Ridge Regression

subject to ∑ argmin 1

2 subject to ∑

(37)

LASSO

Least Absolute Selection and Shrinkage Operator

Sparse model

subject to ∑

argmin 1 2 argmin 1

2

(38)

LASSO: Sparse Model

Ridge regression VS. LASSO

 Why LASSO  Sparse model ?

subject to ∑ argmin 1 2

subject to ∑ argmin 1

2

(39)

LASSO: Sparse Model

The normal cones to the feasible set at the corner points, such as and , contain infinitely many rays, while they reduce to a singleton (only contain a single ray) at the other boundary points.

The first order (necessary and

sufficient) condition concludes that: a feasible point becomes the optimum if and only if the opposite gradient

direction of the objective function falls inside the normal cone to the feasible set at that point.

Thus, the optimum will more likely fall at the points with “larger” normal cones.

This also explains why non‐convex

Ω

(40)

LASSO Solution

Convex optimization Coordinate descent

Single predictor (feature) setting

argmin 1 2

∑ 0, ∑ , ∑ 1

argmin 1 2

If ,

1 ,

(41)

LASSO Solution

Single predictor (feature) setting

∑ 0, ∑ 0, ∑ 1

argmin 1 2

1 2

1 2 1

2 1 , 1

2 1 2

1 ,

1 1

(42)

LASSO Solution

1 2

1 , , 0

1 2

1 , , 0

.

1 , 1

1 , ,

.

(43)

LASSO Solution

Convex optimization Coordinate descent

Single predictor (feature) setting

Multiple predictors (features)

 Cyclic Coordinate descent

argmin 1

2

(44)

Model Assessment and Selection

The generalization performance of a learning method

Model selection:

 estimating the performance of different models in order to choose the best one.

Model assessment:

 having chosen a final model, estimating its prediction

error (generalization error) on new data.

(45)

Bias & Variance Decomposition

(46)

The Supervised Learning Problem

Given example pairs ,

Learn a function , such as = Loss: ,

Expected Loss:

E , ,

Squared loss: ,

Expected Prediction Error:

EPE ,

(47)

Squared loss

|

| 2 |

Expected Prediction Error:

EPE var |

The first term:

EPE ,

(48)

In Reality

Given training set , contains example pairs , Learn a function , such as =

Expected Prediction Error:

EPE ,

→ ;

;

(49)

In Reality

Expected Prediction Error:

EPE var |

; var |

; ; ;

; ; ; 2 ; ; ;

;

; ; ;

(50)

Bias‐variance Decomposition

EPE ∬ ,

Expected prediction error (expected loss) =

(bias)

²

+ variance + noise

(bias)

²

:

; |

variance:

; ;

(51)

The Bias‐Variance Decomposition

Example: 25 data sets from the sinusoidal,

varying the degree of regularization, λ.

(52)

The Bias‐Variance Decomposition

Example: 25 data sets from the sinusoidal,

varying the degree of regularization, λ.

(53)

The Bias‐Variance Decomposition

Example: 25 data sets from the sinusoidal,

varying the degree of regularization, λ.

(54)

The Bias‐Variance Trade‐off

Over‐regularized

model (large λ) high bias

Under‐regularized

model (small λ ) 

high variance.

(55)

© Deng Cai, College of Computer Science, Zhejiang University

So Far…

Our goal (supervised learning):

 To learn a set of discriminant functions

Bayesian framework

 We could design an optimal classifier if we knew:

 P( i ) : priors and P(x |  i ) : class‐conditional densities

 Using training data to estimate P( i ) and P(x |  i )

 P( i |x ) is computed and be used as the discriminant functions

Other possible ways?



Linear Methods for Regression

College of Computer Science Zhejiang University

[email protected]

Classification VS Regression

Both are supervised learning methods

 Goal: learn a mapping from inputs to outputs

Classification (Categorization, Decision making…)

 is a categorical variable

Regression

 is real‐valued

⋯

Linear model

Sample: ∈ , , , ⋯

Finds a linear function , , ⋯ , ∈ ,

Polynomial Curve Fitting

0 th Order Polynomial

1 st Order Polynomial

3 rd Order Polynomial

Sum‐of‐Squares Error Function

,

Training data：

 , , , , ⋯ ,

To learn which

Polynomial Curve Fitting  Linear Regression

, ⋯

1, , , ⋯ ,

, , , ⋯ ,

,

Linear Regression Model

Training data: ,

∑

 , ⋯ , and : unknown parameters or coefficients

 : Feature vector, the outcome of feature extraction.

Minimize the mean‐squared error : 1

Minimize the residual sum of squares

The MSE Criterion

MSE Criterion: Minimize the sum of squared differences between and

Using matrix notation for convenience

, ⋯ , , , ⋯ ,

How to optimize it (finds the optimal solution)?

Optimizing the MSE Criterion

Computing the gradient gives:

2 Setting the gradient to zero,

Any problems?

What is the rank of the matrix ?

The solution for a can be obtained uniquely if is non‐singular.

The fitted values at the training inputs are

Geometry of least‐squares fitting

Figure 2:The N‐dimensional geometry of least squares regression with two

predictors. The outcome vector y is orthogonally projected onto the

hyperplane spanned by the input vectors x1 and x2. The projection represents the vector of the least squares predictions

ˆy

Statistical model of regression

A generative model: , , is a deterministic function

is a random noise , it represents things we cannot capture with ,

e.g. ~ 0,

Statistical model of regression

A generative model: , Assume: ~ 0,

| , , =?

1 2

,

Likelihood of predictions

 The probability of observing outputs in given , ,

and .

Maximum Likelihood Estimation

Likelihood of predictions

, , | , ,

Maximum likelihood estimation of parameters

 Parameters maximizing the likelihood of predictions

 P( _i ) : priors and P(x |  _i ) : class‐conditional densities

 Using training data to estimate P( _i ) and P(x |  _i )

 P( _i |x ) is computed and be used as the discriminant functions

0 ^th Order Polynomial

1 ^st Order Polynomial

3 ^rd Order Polynomial

0 ^th Order Polynomial

1 ^st Order Polynomial

3 ^rd Order Polynomial

9 ^th Order Polynomial

9 ^th Order Polynomial

, , ^,