• 沒有找到結果。

© Deng Cai, College of Computer Science, Zhejiang University

N/A
N/A
Protected

Academic year: 2021

Share "© Deng Cai, College of Computer Science, Zhejiang University"

Copied!
55
0
0

加載中.... (立即查看全文)

全文

(1)

Cai, College of Computer Science, Zhejiang University

So Far…

Our goal (supervised learning):

 To learn a set of discriminant functions

Bayesian framework

 We could design an optimal classifier if we knew:

 P( i ) : priors and P(x |  i ) : class‐conditional densities

 Using training data to estimate P( i ) and P(x |  i )

 P( i |x ) is computed and be used as the discriminant  functions

Other possible ways?

(2)

Linear Methods for Regression

Deng Cai (蔡登)

College of Computer Science Zhejiang University

[email protected]

(3)

, College of Computer Science, Zhejiang University

Classification VS Regression

Both are supervised learning methods

 Goal: learn a mapping from  inputs  to outputs 

Classification (Categorization, Decision  making…)

 is a categorical variable

Regression

 is real‐valued

(4)

ge of Computer Science, Zhejiang University

Linear model

Sample:  ∈ , , , ⋯

Finds a linear function  , , ⋯ , ∈ ,

(5)

ollege of Computer Science, Zhejiang University

Polynomial Curve Fitting

(6)

ge of Computer Science, Zhejiang University

0 th Order Polynomial

(7)

ollege of Computer Science, Zhejiang University

1 st Order Polynomial

(8)

ge of Computer Science, Zhejiang University

3 rd Order Polynomial

(9)

Cai, College of Computer Science, Zhejiang University

Sum‐of‐Squares Error Function

,

Training data:

 , , , , ⋯ ,

To learn  which 

(10)

ge of Computer Science, Zhejiang University

Polynomial Curve Fitting  Linear Regression

, ⋯

1, , , ⋯ ,

, , , ⋯ ,

,

(11)

ollege of Computer Science, Zhejiang University

Linear Regression Model

Training data: ,

 , ⋯ , and  : unknown parameters or  coefficients

 : Feature vector, the outcome of feature extraction. 

Minimize the mean‐squared error : 1

Minimize the residual sum of squares

(12)

ge of Computer Science, Zhejiang University

The MSE Criterion

MSE Criterion: Minimize the sum of squared  differences between  and 

Using matrix notation for convenience

, ⋯ , , , ⋯ ,

How to optimize it (finds the optimal solution)?

(13)

ollege of Computer Science, Zhejiang University

Optimizing the MSE Criterion

Computing the gradient gives:

2 Setting the gradient to zero, 

Any problems?

What is the rank of the matrix  ?

The solution for a can be obtained uniquely if  is non‐singular.

The fitted values at the training inputs are

(14)

ge of Computer Science, Zhejiang University

Geometry of least‐squares fitting

Figure 2:The N‐dimensional  geometry of least squares  regression with two 

predictors. The outcome  vector y is orthogonally  projected onto the 

hyperplane spanned  by the  input vectors x1 and x2. The  projection     represents the  vector of the least squares  predictions

ˆy

(15)

ollege of Computer Science, Zhejiang University

Statistical model of regression

A generative model:      , , is a deterministic function

is a random noise , it represents things we cannot capture with , 

e.g. ~ 0,

(16)

ge of Computer Science, Zhejiang University

Statistical model of regression

A generative model: , Assume:  ~ 0,

| , , =?

1 2

,

Likelihood of predictions

 The probability of observing outputs  in  given  ,  , 

and  .

(17)

ollege of Computer Science, Zhejiang University

Maximum Likelihood Estimation

Likelihood of predictions

, , | , ,

Maximum likelihood estimation of parameters

 Parameters maximizing the likelihood of predictions

∗ argmax | , ,

Log‐likelihood

(18)

ge of Computer Science, Zhejiang University

Maximum Likelihood Estimation

Log‐likelihood

, , log , , log | , ,

log | , ,

, , 1

2

,

, , 1

2 ,

(19)

ollege of Computer Science, Zhejiang University

Sum‐of‐Squares Error Function

,

(20)

ge of Computer Science, Zhejiang University

0 th Order Polynomial

(21)

ollege of Computer Science, Zhejiang University

1 st Order Polynomial

(22)

ge of Computer Science, Zhejiang University

3 rd Order Polynomial

(23)

ollege of Computer Science, Zhejiang University

9 th Order Polynomial

(24)

lege of Computer Science, Zhejiang University

Over‐fitting

(25)

ollege of Computer Science, Zhejiang University

Issues with MSE Criterion

The solution for a can be obtained uniquely if  is non‐singular.

If  is singular, overfitting

Coefficients

(26)

lege of Computer Science, Zhejiang University

Ridge Regression

How to control the size of the coefficients in Regression?

∗ argmin

local smoothness weight decay

Equivalent formulation

∗ argmin

(27)

ollege of Computer Science, Zhejiang University

Ridge Regression

∗ argmin

 Matrix notations:

Computing the gradient gives:

2 2

Setting the gradient to zero, 

(28)

ge of Computer Science, Zhejiang University

Polynomial Coefficients   

(29)

ollege of Computer Science, Zhejiang University

9 th Order Polynomial

(30)

ge of Computer Science, Zhejiang University

Regularization ( )

(31)

ollege of Computer Science, Zhejiang University

Regularization ( )

(32)

lege of Computer Science, Zhejiang University

Regularization

Simple model

Complicated model

(33)

ollege of Computer Science, Zhejiang University

Maximum Likelihood Estimation

A generative model: , Assume:  ~ 0,

, , ,

∗ argmax , , argmax | , ,

, , log , , ∑ ,

(34)

, College of Computer Science, Zhejiang University

Bayesian Linear Regression

Bayes rule

prior likelihood

posterior

(35)

ollege of Computer Science, Zhejiang University

Bayesian Linear Regression

A common choice for the prior is

Posterior ∝ likelihood prior

(36)

ge of Computer Science, Zhejiang University

A more general regularizer

Ridge Regression

subject to ∑ argmin 1

2

subject to ∑

(37)

ollege of Computer Science, Zhejiang University

LASSO

Least Absolute Selection and Shrinkage Operator

Sparse model

subject to ∑

argmin 1 2 argmin 1

2

(38)

Cai, College of Computer Science, Zhejiang University

LASSO: Sparse Model

Ridge regression VS. LASSO 

 Why LASSO  Sparse model ?

subject to ∑ argmin 1 2

subject to ∑ argmin 1

2

(39)

Cai, College of Computer Science, Zhejiang University

LASSO: Sparse Model

The normal cones to the feasible set at  the corner points, such as  and  ,  contain infinitely many rays, while  they reduce to a singleton (only  contain a single ray) at the other  boundary points. 

The first order (necessary and 

sufficient) condition concludes that: a  feasible point becomes the optimum if  and only if the opposite gradient 

direction of the objective function falls  inside the normal cone to the feasible  set at that point.

Thus, the optimum will more likely  fall at the points with “larger” normal  cones. 

This also explains why non‐convex 

Ω

(40)

ollege of Computer Science, Zhejiang University

LASSO Solution

Convex optimization Coordinate descent

Single predictor (feature) setting

argmin 1 2

∑ 0, ∑ , ∑ 1

argmin 1 2

If ,

1 ,

(41)

, College of Computer Science, Zhejiang University

LASSO Solution

Single predictor (feature) setting

∑ 0, ∑ 0, ∑ 1

argmin 1 2

1 2

1 2 1

2

1 , 1

2

1 2

1 ,

1 1

(42)

, College of Computer Science, Zhejiang University

LASSO Solution

1 2

1 , , 0

1 2

1 , , 0

.

1 , 1

1 , ,

.

(43)

ollege of Computer Science, Zhejiang University

LASSO Solution

Convex optimization Coordinate descent

Single predictor (feature) setting

Multiple predictors (features)

 Cyclic Coordinate descent

argmin 1

2

(44)

Cai, College of Computer Science, Zhejiang University

Model Assessment and Selection

The generalization performance of a learning method

Model selection: 

 estimating the performance of different models in  order to choose the best one.

Model assessment: 

 having chosen a final model, estimating its prediction 

error (generalization error) on new data.

(45)

Cai, College of Computer Science, Zhejiang University

Bias & Variance Decomposition

(46)

ge of Computer Science, Zhejiang University

The Supervised Learning Problem

Given example pairs  ,

Learn a function , such as   = Loss:  ,

Expected Loss:

E ,   ,

Squared loss:  ,

Expected Prediction Error:

EPE ,

(47)

ollege of Computer Science, Zhejiang University

Squared loss

|

| 2 |

Expected Prediction Error:

EPE var |

The first term:

EPE ,

(48)

ge of Computer Science, Zhejiang University

In Reality

Given training set  , contains  example pairs  , Learn a function , such as   =

Expected Prediction Error:

EPE ,

→ ;

;

(49)

Cai, College of Computer Science, Zhejiang University

In Reality

Expected Prediction Error:

EPE var |

; var |

; ; ;

; ; ; 2 ; ; ;

;

; ; ;

(50)

ge of Computer Science, Zhejiang University

Bias‐variance Decomposition

EPE ∬ ,

Expected prediction error (expected loss) =

(bias)

+ variance + noise

(bias)

2

;   |

variance:

; ;

(51)

ollege of Computer Science, Zhejiang University

The Bias‐Variance Decomposition

Example: 25 data sets from the sinusoidal, 

varying the degree of regularization, λ.

(52)

ge of Computer Science, Zhejiang University

The Bias‐Variance Decomposition

Example: 25 data sets from the sinusoidal, 

varying the degree of regularization, λ.

(53)

ollege of Computer Science, Zhejiang University

The Bias‐Variance Decomposition

Example: 25 data sets from the sinusoidal, 

varying the degree of regularization, λ.

(54)

ge of Computer Science, Zhejiang University

The Bias‐Variance Trade‐off

Over‐regularized 

model (large λ) high  bias

Under‐regularized 

model (small λ ) 

high variance.

(55)

Cai, College of Computer Science, Zhejiang University

Cross‐Validation

K‐Fold Cross‐Validation

leave‐one‐out cross‐validation

參考文獻

相關文件

Reading Task 6: Genre Structure and Language Features. • Now let’s look at how language features (e.g. sentence patterns) are connected to the structure

Understanding and inferring information, ideas, feelings and opinions in a range of texts with some degree of complexity, using and integrating a small range of reading

Writing texts to convey information, ideas, personal experiences and opinions on familiar topics with elaboration. Writing texts to convey information, ideas, personal

 Promote project learning, mathematical modeling, and problem-based learning to strengthen the ability to integrate and apply knowledge and skills, and make. calculated

• Introduction of language arts elements into the junior forms in preparation for LA electives.. Curriculum design for

Writing texts to convey simple information, ideas, personal experiences and opinions on familiar topics with some elaboration. Writing texts to convey information, ideas,

This kind of algorithm has also been a powerful tool for solving many other optimization problems, including symmetric cone complementarity problems [15, 16, 20–22], symmetric

• Children from this parenting style are more responsive, able to recover quickly from stress; they also have better emotional responsiveness and self- control; they can notice