Cai, College of Computer Science, Zhejiang University
So Far…
Our goal (supervised learning):
To learn a set of discriminant functions
Bayesian framework
We could design an optimal classifier if we knew:
P( i ) : priors and P(x | i ) : class‐conditional densities
Using training data to estimate P( i ) and P(x | i )
P( i |x ) is computed and be used as the discriminant functions
Other possible ways?
Linear Methods for Regression
Deng Cai (蔡登)
College of Computer Science Zhejiang University
[email protected]
, College of Computer Science, Zhejiang University
Classification VS Regression
Both are supervised learning methods
Goal: learn a mapping from inputs to outputs
Classification (Categorization, Decision making…)
is a categorical variable
Regression
is real‐valued
⋯
ge of Computer Science, Zhejiang University
Linear model
Sample: ∈ , , , ⋯
Finds a linear function , , ⋯ , ∈ ,
ollege of Computer Science, Zhejiang University
Polynomial Curve Fitting
ge of Computer Science, Zhejiang University
0 th Order Polynomial
ollege of Computer Science, Zhejiang University
1 st Order Polynomial
ge of Computer Science, Zhejiang University
3 rd Order Polynomial
Cai, College of Computer Science, Zhejiang University
Sum‐of‐Squares Error Function
,
Training data:
, , , , ⋯ ,
To learn which
ge of Computer Science, Zhejiang University
Polynomial Curve Fitting Linear Regression
, ⋯
1, , , ⋯ ,
, , , ⋯ ,
,
ollege of Computer Science, Zhejiang University
Linear Regression Model
Training data: ,
∑
, ⋯ , and : unknown parameters or coefficients
: Feature vector, the outcome of feature extraction.
Minimize the mean‐squared error : 1
Minimize the residual sum of squares
ge of Computer Science, Zhejiang University
The MSE Criterion
MSE Criterion: Minimize the sum of squared differences between and
Using matrix notation for convenience
, ⋯ , , , ⋯ ,
How to optimize it (finds the optimal solution)?
ollege of Computer Science, Zhejiang University
Optimizing the MSE Criterion
Computing the gradient gives:
2 Setting the gradient to zero,
Any problems?
What is the rank of the matrix ?
The solution for a can be obtained uniquely if is non‐singular.
The fitted values at the training inputs are
ge of Computer Science, Zhejiang University
Geometry of least‐squares fitting
Figure 2:The N‐dimensional geometry of least squares regression with two
predictors. The outcome vector y is orthogonally projected onto the
hyperplane spanned by the input vectors x1 and x2. The projection represents the vector of the least squares predictions
ˆy
ollege of Computer Science, Zhejiang University
Statistical model of regression
A generative model: , , is a deterministic function
is a random noise , it represents things we cannot capture with ,
e.g. ~ 0,
ge of Computer Science, Zhejiang University
Statistical model of regression
A generative model: , Assume: ~ 0,
| , , =?
1 2
,
Likelihood of predictions
The probability of observing outputs in given , ,
and .
ollege of Computer Science, Zhejiang University
Maximum Likelihood Estimation
Likelihood of predictions
, , | , ,
Maximum likelihood estimation of parameters
Parameters maximizing the likelihood of predictions
∗ argmax | , ,
Log‐likelihood
ge of Computer Science, Zhejiang University
Maximum Likelihood Estimation
Log‐likelihood
, , log , , log | , ,
log | , ,
, , 1
2
,
, , 1
2 ,
ollege of Computer Science, Zhejiang University
Sum‐of‐Squares Error Function
,
ge of Computer Science, Zhejiang University
0 th Order Polynomial
ollege of Computer Science, Zhejiang University
1 st Order Polynomial
ge of Computer Science, Zhejiang University
3 rd Order Polynomial
ollege of Computer Science, Zhejiang University
9 th Order Polynomial
lege of Computer Science, Zhejiang University
Over‐fitting
ollege of Computer Science, Zhejiang University
Issues with MSE Criterion
The solution for a can be obtained uniquely if is non‐singular.
If is singular, overfitting
Coefficients
lege of Computer Science, Zhejiang University
Ridge Regression
How to control the size of the coefficients in Regression?
∗ argmin
local smoothness weight decay
Equivalent formulation
∗ argmin
ollege of Computer Science, Zhejiang University
Ridge Regression
∗ argmin
Matrix notations:
Computing the gradient gives:
2 2
Setting the gradient to zero,
ge of Computer Science, Zhejiang University
Polynomial Coefficients
ollege of Computer Science, Zhejiang University
9 th Order Polynomial
ge of Computer Science, Zhejiang University
Regularization ( )
ollege of Computer Science, Zhejiang University
Regularization ( )
lege of Computer Science, Zhejiang University
Regularization
Simple model
Complicated model
ollege of Computer Science, Zhejiang University
Maximum Likelihood Estimation
A generative model: , Assume: ~ 0,
, , ,
∗ argmax , , argmax | , ,
, , log , , ∑ ,
, College of Computer Science, Zhejiang University
Bayesian Linear Regression
Bayes rule
prior likelihood
posterior
ollege of Computer Science, Zhejiang University
Bayesian Linear Regression
A common choice for the prior is
Posterior ∝ likelihood prior
ge of Computer Science, Zhejiang University
A more general regularizer
Ridge Regression
subject to ∑ argmin 1
2
subject to ∑
ollege of Computer Science, Zhejiang University
LASSO
Least Absolute Selection and Shrinkage Operator
Sparse model
subject to ∑
argmin 1 2 argmin 1
2
Cai, College of Computer Science, Zhejiang University
LASSO: Sparse Model
Ridge regression VS. LASSO
Why LASSO Sparse model ?
subject to ∑ argmin 1 2
subject to ∑ argmin 1
2
Cai, College of Computer Science, Zhejiang University
LASSO: Sparse Model
The normal cones to the feasible set at the corner points, such as and , contain infinitely many rays, while they reduce to a singleton (only contain a single ray) at the other boundary points.
The first order (necessary and
sufficient) condition concludes that: a feasible point becomes the optimum if and only if the opposite gradient
direction of the objective function falls inside the normal cone to the feasible set at that point.
Thus, the optimum will more likely fall at the points with “larger” normal cones.
This also explains why non‐convex
Ω
ollege of Computer Science, Zhejiang University
LASSO Solution
Convex optimization Coordinate descent
Single predictor (feature) setting
argmin 1 2
∑ 0, ∑ , ∑ 1
argmin 1 2
If ,
1 ,
, College of Computer Science, Zhejiang University
LASSO Solution
Single predictor (feature) setting
∑ 0, ∑ 0, ∑ 1
argmin 1 2
1 2
1 2 1
2
1 , 1
2
1 2
1 ,
1 1
, College of Computer Science, Zhejiang University
LASSO Solution
1 2
1 , , 0
1 2
1 , , 0
.
1 , 1
1 , ,
.
ollege of Computer Science, Zhejiang University
LASSO Solution
Convex optimization Coordinate descent
Single predictor (feature) setting
Multiple predictors (features)
Cyclic Coordinate descent
argmin 1
2
Cai, College of Computer Science, Zhejiang University
Model Assessment and Selection
The generalization performance of a learning method
Model selection:
estimating the performance of different models in order to choose the best one.
Model assessment:
having chosen a final model, estimating its prediction
error (generalization error) on new data.
Cai, College of Computer Science, Zhejiang University
Bias & Variance Decomposition
ge of Computer Science, Zhejiang University
The Supervised Learning Problem
Given example pairs ,
Learn a function , such as = Loss: ,
Expected Loss:
E , ,
Squared loss: ,
Expected Prediction Error:
EPE ,
ollege of Computer Science, Zhejiang University
Squared loss
|
| 2 |
Expected Prediction Error:
EPE var |
The first term:
EPE ,
ge of Computer Science, Zhejiang University
In Reality
Given training set , contains example pairs , Learn a function , such as =
Expected Prediction Error:
EPE ,
→ ;
;
Cai, College of Computer Science, Zhejiang University
In Reality
Expected Prediction Error:
EPE var |
; var |
; ; ;
; ; ; 2 ; ; ;
;
; ; ;
ge of Computer Science, Zhejiang University
Bias‐variance Decomposition
EPE ∬ ,
Expected prediction error (expected loss) =
(bias)
2+ variance + noise
(bias)
2:
; |
variance:
; ;
ollege of Computer Science, Zhejiang University
The Bias‐Variance Decomposition
Example: 25 data sets from the sinusoidal,
varying the degree of regularization, λ.
ge of Computer Science, Zhejiang University
The Bias‐Variance Decomposition
Example: 25 data sets from the sinusoidal,
varying the degree of regularization, λ.
ollege of Computer Science, Zhejiang University
The Bias‐Variance Decomposition
Example: 25 data sets from the sinusoidal,
varying the degree of regularization, λ.
ge of Computer Science, Zhejiang University
The Bias‐Variance Trade‐off
Over‐regularized
model (large λ) high bias
Under‐regularized
model (small λ )
high variance.
Cai, College of Computer Science, Zhejiang University