Optimization and Machine Learning

45  Download (0)

Full text


Chih-Jen Lin

Department of Computer Science National Taiwan University

Talk at TWSIAM Annual Meeting, July 24, 2020


1 Introduction

2 Empirical risk minimization

3 Optimization techniques for machine learning

4 Discussion and conclusions



1 Introduction

2 Empirical risk minimization

3 Optimization techniques for machine learning

4 Discussion and conclusions


What is Machine Learning

Extract knowledge from data

Representative tasks: classification, clustering, and others

Classification Clustering Today we will focus on classification


Data Classification

Given training data in different classes (labels known)

Predict test data (labels unknown) Classic example

1. Find a patient’s blood pressure, weight, etc.

2. After several years, know if he/she recovers 3. Build a machine learning model

4. New patient: find blood pressure, weight, etc 5. Prediction

Two main stages: training and testing


Why Is Optimization Used?

Usually the goal of classification is to minimize the number of errors Therefore, many classification methods solve optimization problems

We will discuss a topic called empirical risk

minimization that can connect many classification methods



1 Introduction

2 Empirical risk minimization

3 Optimization techniques for machine learning

4 Discussion and conclusions


Minimizing Training Errors

Basically a classification method starts with minimizing the training errors

modelmin (training errors)

That is, all or most training data with labels should be correctly classified by our model

A model can be a decision tree, a neural network, etc.


Minimizing Training Errors (Cont’d)

For simplicity, let’s consider the model to be a vector w

That is, the decision function is sgn(wTx)

For any data, x, the predicted label is (1 if wTx ≥ 0

−1 otherwise


Minimizing Training Errors (Cont’d)

The two-dimensional situation

◦ ◦

◦ ◦


4 4 4 4

4 4


wTx = 0

This seems to be quite restricted, but practically x is in a much higher dimensional space


Minimizing Training Errors (Cont’d)

To characterize the training error, we need a loss function ξ(w; x, y) for each instance (x, y) Ideally we should use 0–1 training loss:

ξ(w; x, y) =

(1 if ywTx < 0, 0 otherwise


Minimizing Training Errors (Cont’d)

However, this function is discontinuous. The optimization problem becomes difficult

−ywTx ξ(w; x, y)

We can do continuous approximations


Common Loss Functions

Hinge loss (l1 loss)

ξL1(w; x, y) ≡ max(0, 1 − ywTx) (1) Logistic loss

ξLR(w; x, y) ≡ log(1 + e−ywTx) (2) Support vector machines (SVM): Eq. (1). Logistic regression (LR): (2)

SVM and LR are two very fundamental classification methods


Common Loss Functions (Cont’d)

−ywTx ξ(w; x, y)

ξL1 ξLR

Logistic regression is very related to SVM Their performance is usually similar


Common Loss Functions (Cont’d)

However, minimizing training losses may not give a good model for future prediction

Overfitting occurs



See the illustration in the next slide For classification,

You can easily achieve 100% training accuracy This is useless

When training a data set, we should Avoid underfitting: small training error Avoid overfitting: small testing error


l and s: training; and 4: testing



To minimize the training error we manipulate the w vector so that it fits the data

To avoid overfitting we need a way to make w’s values less extreme.

One idea is to make w values closer to zero We can add, for example,


2 or kwk1 to the function that is minimized


General Form of Linear Classification

Training data {yi,xi},xi ∈ Rn, i = 1, . . . , l , yi = ±1 l : # of data, n: # of features

minw f (w), f (w) ≡ wTw

2 + C



i =1

ξ(w; xi, yi) wTw/2: regularization term

ξ(w; x, y): loss function

C : regularization parameter (chosen by users)


Neural Networks

We all know that recently deep learning (i.e., deep neural networks) is very hot.

We will explain neural networks using the the same empirical risk minimization framework

Among various types of networks, we consider fully-connected feed-forward networks for

multi-class classification


Neural Networks (Cont’d)

Our training set includes (yi,xi), i = 1, . . . , l . xi ∈ Rn1 is the feature vector.

yi ∈ RK is the label vector.

K : # of classes

If xi is in class k, then yi = [0, . . . , 0

| {z }


, 1, 0, . . . , 0]T ∈ RK


Neural Networks (Cont’d)

A neural network maps each feature vector to one of the class labels by the connection of nodes Between two layers a weight matrix maps input to output


















Neural Networks (Cont’d)

The weight matrix Wm at the mth layer is

Wm =

w11m w12m · · · w1nm


w21m w22m · · · w2nmm ... ... ... ...


m+11 wnm

m+12 · · · wnm



nm : # input features at layer m

nm+1 : # output features at layer m, or # input features at layer m + 1

L: number of layers

n1 = # of features, nL+1 = # of classes


Neural Networks (Cont’d)

Let zm be the input of the mth layer, z1 = x and zL+1 be the output

From mth layer to (m + 1)th layer sm = Wmzm,

zjm+1 = σ(sjm), j = 1, . . . , nm+1,

σ(·) is the activation function. We collect all variables:

θ =

vec(W1) ...


 ∈ Rn

n : total # variables

= (n1 + 1)n2 + · · · + (nL + 1)nL+1


Neural Networks (Cont’d)

We solve the following optimization problem, minθ f (θ),


f (θ) = 1

Tθ + C Xl

i =1ξ(zL+1,i(θ);xi,yi).

C : regularization parameter

zL+1(θ) ∈ RnL+1: last-layer output vector of x.

ξ(zL+1;x, y): loss function. Example:

ξ(zL+1;x, y) = ||zL+1 −y||2


Neural Networks (Cont’d)

The formulation is as before, but loss function is more complicated

Note that we discussed the simplest type of networks

Nowadays people use much more complicated networks

The optimization problem is non-convex



We have seen that many classification methods are under the empirical risk minimization framework We also see that optimization problems must be solved



1 Introduction

2 Empirical risk minimization

3 Optimization techniques for machine learning

4 Discussion and conclusions


Optimization Techniques for Machine Learning

Standard optimization packages may be directly applied to machine learning applications

However, efficiency and scalability are issues

Many optimization researchers want to do machine learning

Some are more successful, but some are not Very often properties from machine learning side must be considered

I will illustrate this point by some examples


Differences between Optimization and Machine Learning

The two topics may have different focuses. We give the following example

Recall that the optimization problem for empirical risk minimization is


2wTw + C(sum of training losses) A large C means to fit training data

The optimization problem becomes more difficult


In contrast, if C → 0, minw

1 2wTw is easy

Optimization researchers may rush to solve difficult cases of large C

It turns out that C should not be too large

A large C causes severe overfitting and bad accuracy Thus knowing what is useful and what is not on the machine learning side is very important


Stochastic Gradient for Deep Learning

In optimization, gradient descent is a basic method But it has slow convergence

So in many application domains higher-order

optimization methods (e.g., Newton, quasi Newton) were developed for faster convergence

However, in deep learning people use an even lower-order method: stochastic gradient Why?


Estimation of the Gradient

Let us rewrite the objective function as f (θ) = 1

2CθTθ + 1 l


i =1ξ(zL+1,i(θ);xi,yi) The gradient is

θ C + 1

l ∇θ



i =1

ξ(zL+1,i(θ);xi,yi) Going over all data is time consuming


Estimation of the Gradient (Cont’d)

We may use a subset S of data θ

C + 1

|S|∇θ X

i :i ∈S

ξ(zL+1,i(θ);xi,yi) This works if data points are under the same distribution

Ey,x(∇θξ(zL+1,i;x, y)) = 1 l ∇θ



i =1



Stochastic Gradient Algorithm

1: Given an initial learning rate η.

2: while do

3: Choose S ⊂ {1, . . . , l }.

4: Calculate θ ← θ −η(θ

C + 1

|S|∇θ X

i :i ∈S


5: May adjust the learning rate η

6: end while


Issues of Stochastic Gradient Algorithm

People often use the name SGD (stochastic gradient descent) but it is not a descent algorithm

Note that we didn’t (and cannot) do things like line search to ensure the function-value decrease

It’s known that deciding a suitable learning rate is difficult

Too small learning rate: very slow convergence Too large learning rate: the procedure may diverge

Despite such drawbacks, SG is widely used in deep learning. Why?


Why Stochastic Gradient Widely Used? I

In machine learning fast final convergence may not be important

An optimal solution θ may not lead to the best model

Further, we don’t need a point close to θ. In prediction we find

arg max

k zkL+1(θ)

A not-so-accurate θ may be good enough An illustration


Why Stochastic Gradient Widely Used? II





Slow final convergence Fast final convergence


Why Stochastic Gradient Widely Used? III

The special property of data classification is essential

E (∇θξ(zL+1;x, y)) = 1 l∇θ



i =1

ξ(zL+1,i(θ);xi,yi) We can cheaply get a good approximation of the gradient

Indeed stochastic gradient is less used outside machine learning


Why Stochastic Gradient Widely Used? IV

Easy implementation. It’s simpler than methods using, for example, second derivative

Now for complicated networks, (subsampled) gradient is calculated by automatic differentiation Non-convexity plays a role

For convex, other methods may possess advantages to more efficiently find the global minimum

But for non-convex, efficiency to reach a stationary point is less useful


Why Stochastic Gradient Widely Used? V

A global minimum usually gives a good model (as loss is minimized), but for a stationary point we are less sure

Some variants of SG have been proposed to improve the robustness or the convergence

All these explain why SG is popular for deep learning


Subsampled 2nd-order Method

Recall for stochastic gradient method, we use E (∇θξ(zL+1;x, y)) = 1




i =1

ξ(zL+1,i(θ);xi,yi) Can we extend this idea to 2nd derivative? Yes, Byrd et al. (2011); Martens (2010)

E (∇2θθξ(zL+1;y, x)) = 1 l∇2θθ



i =1



Subsampled 2nd-order Method (Cont’d)

We can consider 1

|S|∇2θθ X

i ∈S


in designing subsampled Newton or quasi Newton methods



1 Introduction

2 Empirical risk minimization

3 Optimization techniques for machine learning

4 Discussion and conclusions



Many machine learning methods involve optimization problems

However, designing useful optimization techniques for these applications may not be easy

Incorporating machine learning knowledge is very essential




Related subjects :