Chih-Jen Lin
Department of Computer Science National Taiwan University
Talk at TWSIAM Annual Meeting, July 24, 2020
1 Introduction
2 Empirical risk minimization
3 Optimization techniques for machine learning
4 Discussion and conclusions
Outline
1 Introduction
2 Empirical risk minimization
3 Optimization techniques for machine learning
4 Discussion and conclusions
What is Machine Learning
Extract knowledge from data
Representative tasks: classification, clustering, and others
Classification Clustering Today we will focus on classification
Data Classification
Given training data in different classes (labels known)
Predict test data (labels unknown) Classic example
1. Find a patient’s blood pressure, weight, etc.
2. After several years, know if he/she recovers 3. Build a machine learning model
4. New patient: find blood pressure, weight, etc 5. Prediction
Two main stages: training and testing
Why Is Optimization Used?
Usually the goal of classification is to minimize the number of errors Therefore, many classification methods solve optimization problems
We will discuss a topic called empirical risk
minimization that can connect many classification methods
Outline
1 Introduction
2 Empirical risk minimization
3 Optimization techniques for machine learning
4 Discussion and conclusions
Minimizing Training Errors
Basically a classification method starts with minimizing the training errors
modelmin (training errors)
That is, all or most training data with labels should be correctly classified by our model
A model can be a decision tree, a neural network, etc.
Minimizing Training Errors (Cont’d)
For simplicity, let’s consider the model to be a vector w
That is, the decision function is sgn(wTx)
For any data, x, the predicted label is (1 if wTx ≥ 0
−1 otherwise
Minimizing Training Errors (Cont’d)
The two-dimensional situation
◦ ◦
◦
◦ ◦
◦
◦◦
4 4 4 4
4 4
4
wTx = 0
This seems to be quite restricted, but practically x is in a much higher dimensional space
Minimizing Training Errors (Cont’d)
To characterize the training error, we need a loss function ξ(w; x, y) for each instance (x, y) Ideally we should use 0–1 training loss:
ξ(w; x, y) =
(1 if ywTx < 0, 0 otherwise
Minimizing Training Errors (Cont’d)
However, this function is discontinuous. The optimization problem becomes difficult
−ywTx ξ(w; x, y)
We can do continuous approximations
Common Loss Functions
Hinge loss (l1 loss)
ξL1(w; x, y) ≡ max(0, 1 − ywTx) (1) Logistic loss
ξLR(w; x, y) ≡ log(1 + e−ywTx) (2) Support vector machines (SVM): Eq. (1). Logistic regression (LR): (2)
SVM and LR are two very fundamental classification methods
Common Loss Functions (Cont’d)
−ywTx ξ(w; x, y)
ξL1 ξLR
Logistic regression is very related to SVM Their performance is usually similar
Common Loss Functions (Cont’d)
However, minimizing training losses may not give a good model for future prediction
Overfitting occurs
Overfitting
See the illustration in the next slide For classification,
You can easily achieve 100% training accuracy This is useless
When training a data set, we should Avoid underfitting: small training error Avoid overfitting: small testing error
l and s: training; and 4: testing
Regularization
To minimize the training error we manipulate the w vector so that it fits the data
To avoid overfitting we need a way to make w’s values less extreme.
One idea is to make w values closer to zero We can add, for example,
wTw
2 or kwk1 to the function that is minimized
General Form of Linear Classification
Training data {yi,xi},xi ∈ Rn, i = 1, . . . , l , yi = ±1 l : # of data, n: # of features
minw f (w), f (w) ≡ wTw
2 + C
l
X
i =1
ξ(w; xi, yi) wTw/2: regularization term
ξ(w; x, y): loss function
C : regularization parameter (chosen by users)
Neural Networks
We all know that recently deep learning (i.e., deep neural networks) is very hot.
We will explain neural networks using the the same empirical risk minimization framework
Among various types of networks, we consider fully-connected feed-forward networks for
multi-class classification
Neural Networks (Cont’d)
Our training set includes (yi,xi), i = 1, . . . , l . xi ∈ Rn1 is the feature vector.
yi ∈ RK is the label vector.
K : # of classes
If xi is in class k, then yi = [0, . . . , 0
| {z }
k−1
, 1, 0, . . . , 0]T ∈ RK
Neural Networks (Cont’d)
A neural network maps each feature vector to one of the class labels by the connection of nodes Between two layers a weight matrix maps input to output
A
1B
1C
1A
2B
2A
3B
3C
3Neural Networks (Cont’d)
The weight matrix Wm at the mth layer is
Wm =
w11m w12m · · · w1nm
m
w21m w22m · · · w2nmm ... ... ... ...
wnm
m+11 wnm
m+12 · · · wnm
m+1nm
nm+1×nm
nm : # input features at layer m
nm+1 : # output features at layer m, or # input features at layer m + 1
L: number of layers
n1 = # of features, nL+1 = # of classes
Neural Networks (Cont’d)
Let zm be the input of the mth layer, z1 = x and zL+1 be the output
From mth layer to (m + 1)th layer sm = Wmzm,
zjm+1 = σ(sjm), j = 1, . . . , nm+1,
σ(·) is the activation function. We collect all variables:
θ =
vec(W1) ...
vec(WL)
∈ Rn
n : total # variables
= (n1 + 1)n2 + · · · + (nL + 1)nL+1
Neural Networks (Cont’d)
We solve the following optimization problem, minθ f (θ),
where
f (θ) = 1
2θTθ + C Xl
i =1ξ(zL+1,i(θ);xi,yi).
C : regularization parameter
zL+1(θ) ∈ RnL+1: last-layer output vector of x.
ξ(zL+1;x, y): loss function. Example:
ξ(zL+1;x, y) = ||zL+1 −y||2
Neural Networks (Cont’d)
The formulation is as before, but loss function is more complicated
Note that we discussed the simplest type of networks
Nowadays people use much more complicated networks
The optimization problem is non-convex
Discussion
We have seen that many classification methods are under the empirical risk minimization framework We also see that optimization problems must be solved
Outline
1 Introduction
2 Empirical risk minimization
3 Optimization techniques for machine learning
4 Discussion and conclusions
Optimization Techniques for Machine Learning
Standard optimization packages may be directly applied to machine learning applications
However, efficiency and scalability are issues
Many optimization researchers want to do machine learning
Some are more successful, but some are not Very often properties from machine learning side must be considered
I will illustrate this point by some examples
Differences between Optimization and Machine Learning
The two topics may have different focuses. We give the following example
Recall that the optimization problem for empirical risk minimization is
1
2wTw + C(sum of training losses) A large C means to fit training data
The optimization problem becomes more difficult
In contrast, if C → 0, minw
1 2wTw is easy
Optimization researchers may rush to solve difficult cases of large C
It turns out that C should not be too large
A large C causes severe overfitting and bad accuracy Thus knowing what is useful and what is not on the machine learning side is very important
Stochastic Gradient for Deep Learning
In optimization, gradient descent is a basic method But it has slow convergence
So in many application domains higher-order
optimization methods (e.g., Newton, quasi Newton) were developed for faster convergence
However, in deep learning people use an even lower-order method: stochastic gradient Why?
Estimation of the Gradient
Let us rewrite the objective function as f (θ) = 1
2CθTθ + 1 l
Xl
i =1ξ(zL+1,i(θ);xi,yi) The gradient is
θ C + 1
l ∇θ
l
X
i =1
ξ(zL+1,i(θ);xi,yi) Going over all data is time consuming
Estimation of the Gradient (Cont’d)
We may use a subset S of data θ
C + 1
|S|∇θ X
i :i ∈S
ξ(zL+1,i(θ);xi,yi) This works if data points are under the same distribution
Ey,x(∇θξ(zL+1,i;x, y)) = 1 l ∇θ
l
X
i =1
ξ(zL+1,i(θ);xi,yi)
Stochastic Gradient Algorithm
1: Given an initial learning rate η.
2: while do
3: Choose S ⊂ {1, . . . , l }.
4: Calculate θ ← θ −η(θ
C + 1
|S|∇θ X
i :i ∈S
ξ(zL+1,i(θ);xi,yi))
5: May adjust the learning rate η
6: end while
Issues of Stochastic Gradient Algorithm
People often use the name SGD (stochastic gradient descent) but it is not a descent algorithm
Note that we didn’t (and cannot) do things like line search to ensure the function-value decrease
It’s known that deciding a suitable learning rate is difficult
Too small learning rate: very slow convergence Too large learning rate: the procedure may diverge
Despite such drawbacks, SG is widely used in deep learning. Why?
Why Stochastic Gradient Widely Used? I
In machine learning fast final convergence may not be important
An optimal solution θ∗ may not lead to the best model
Further, we don’t need a point close to θ∗. In prediction we find
arg max
k zkL+1(θ)
A not-so-accurate θ may be good enough An illustration
Why Stochastic Gradient Widely Used? II
time
distancetooptimum
time
distancetooptimum
Slow final convergence Fast final convergence
Why Stochastic Gradient Widely Used? III
The special property of data classification is essential
E (∇θξ(zL+1;x, y)) = 1 l∇θ
l
X
i =1
ξ(zL+1,i(θ);xi,yi) We can cheaply get a good approximation of the gradient
Indeed stochastic gradient is less used outside machine learning
Why Stochastic Gradient Widely Used? IV
Easy implementation. It’s simpler than methods using, for example, second derivative
Now for complicated networks, (subsampled) gradient is calculated by automatic differentiation Non-convexity plays a role
For convex, other methods may possess advantages to more efficiently find the global minimum
But for non-convex, efficiency to reach a stationary point is less useful
Why Stochastic Gradient Widely Used? V
A global minimum usually gives a good model (as loss is minimized), but for a stationary point we are less sure
Some variants of SG have been proposed to improve the robustness or the convergence
All these explain why SG is popular for deep learning
Subsampled 2nd-order Method
Recall for stochastic gradient method, we use E (∇θξ(zL+1;x, y)) = 1
l∇θ
l
X
i =1
ξ(zL+1,i(θ);xi,yi) Can we extend this idea to 2nd derivative? Yes, Byrd et al. (2011); Martens (2010)
E (∇2θθξ(zL+1;y, x)) = 1 l∇2θθ
l
X
i =1
ξ(zL+1;yi,xi).
Subsampled 2nd-order Method (Cont’d)
We can consider 1
|S|∇2θθ X
i ∈S
ξ(zL+1;yi,xi).
in designing subsampled Newton or quasi Newton methods
Outline
1 Introduction
2 Empirical risk minimization
3 Optimization techniques for machine learning
4 Discussion and conclusions
Conclusions
Many machine learning methods involve optimization problems
However, designing useful optimization techniques for these applications may not be easy
Incorporating machine learning knowledge is very essential