Chih-Jen Lin

Department of Computer Science National Taiwan University

Talk at TWSIAM Annual Meeting, July 24, 2020

1 Introduction

2 Empirical risk minimization

3 Optimization techniques for machine learning

4 Discussion and conclusions

### Outline

1 Introduction

2 Empirical risk minimization

3 Optimization techniques for machine learning

4 Discussion and conclusions

### What is Machine Learning

Extract knowledge from data

Representative tasks: classification, clustering, and others

Classification Clustering Today we will focus on classification

### Data Classification

Given training data in different classes (labels known)

Predict test data (labels unknown) Classic example

1. Find a patient’s blood pressure, weight, etc.

2. After several years, know if he/she recovers 3. Build a machine learning model

4. New patient: find blood pressure, weight, etc 5. Prediction

Two main stages: training and testing

### Why Is Optimization Used?

Usually the goal of classification is to minimize the number of errors Therefore, many classification methods solve optimization problems

We will discuss a topic called empirical risk

minimization that can connect many classification methods

### Outline

1 Introduction

2 Empirical risk minimization

3 Optimization techniques for machine learning

4 Discussion and conclusions

### Minimizing Training Errors

Basically a classification method starts with minimizing the training errors

modelmin (training errors)

That is, all or most training data with labels should be correctly classified by our model

A model can be a decision tree, a neural network, etc.

### Minimizing Training Errors (Cont’d)

For simplicity, let’s consider the model to be a vector w

That is, the decision function is
sgn(w^{T}x)

For any data, x, the predicted label is
(1 if w^{T}x ≥ 0

−1 otherwise

### Minimizing Training Errors (Cont’d)

The two-dimensional situation

◦ ◦

◦

◦ ◦

◦

◦◦

4 4 4 4

4 4

4

w^{T}x = 0

This seems to be quite restricted, but practically x is in a much higher dimensional space

### Minimizing Training Errors (Cont’d)

To characterize the training error, we need a loss function ξ(w; x, y) for each instance (x, y) Ideally we should use 0–1 training loss:

ξ(w; x, y) =

(1 if yw^{T}x < 0,
0 otherwise

### Minimizing Training Errors (Cont’d)

However, this function is discontinuous. The optimization problem becomes difficult

−yw^{T}x
ξ(w; x, y)

We can do continuous approximations

### Common Loss Functions

Hinge loss (l1 loss)

ξL1(w; x, y) ≡ max(0, 1 − yw^{T}x) (1)
Logistic loss

ξ_{LR}(w; x, y) ≡ log(1 + e^{−y}^{w}^{T}^{x}) (2)
Support vector machines (SVM): Eq. (1). Logistic
regression (LR): (2)

SVM and LR are two very fundamental classification methods

### Common Loss Functions (Cont’d)

−yw^{T}x
ξ(w; x, y)

ξ_{L1}
ξ_{LR}

Logistic regression is very related to SVM Their performance is usually similar

### Common Loss Functions (Cont’d)

However, minimizing training losses may not give a good model for future prediction

Overfitting occurs

### Overfitting

See the illustration in the next slide For classification,

You can easily achieve 100% training accuracy This is useless

When training a data set, we should Avoid underfitting: small training error Avoid overfitting: small testing error

### l and s: training; and 4: testing

### Regularization

To minimize the training error we manipulate the w vector so that it fits the data

To avoid overfitting we need a way to make w’s values less extreme.

One idea is to make w values closer to zero We can add, for example,

w^{T}w

2 or kwk^{1}
to the function that is minimized

### General Form of Linear Classification

Training data {yi,x^{i}},x^{i} ∈ R^{n}, i = 1, . . . , l , yi = ±1
l : # of data, n: # of features

minw f (w), f (w) ≡ w^{T}w

2 + C

l

X

i =1

ξ(w; x^{i}, yi)
w^{T}w/2: regularization term

ξ(w; x, y): loss function

C : regularization parameter (chosen by users)

### Neural Networks

We all know that recently deep learning (i.e., deep neural networks) is very hot.

We will explain neural networks using the the same empirical risk minimization framework

Among various types of networks, we consider fully-connected feed-forward networks for

multi-class classification

### Neural Networks (Cont’d)

Our training set includes (yi,x^{i}), i = 1, . . . , l .
x^{i} ∈ R^{n}^{1} is the feature vector.

yi ∈ R^{K} is the label vector.

K : # of classes

If xi is in class k, then yi = [0, . . . , 0

| {z }

k−1

, 1, 0, . . . , 0]^{T} ∈ R^{K}

### Neural Networks (Cont’d)

A neural network maps each feature vector to one of the class labels by the connection of nodes Between two layers a weight matrix maps input to output

### A

_{1}

### B

_{1}

### C

_{1}

### A

_{2}

### B

_{2}

### A

_{3}

### B

_{3}

### C

_{3}

### Neural Networks (Cont’d)

The weight matrix W^{m} at the mth layer is

W^{m} =

w_{11}^{m} w_{12}^{m} · · · w_{1n}^{m}

m

w_{21}^{m} w_{22}^{m} · · · w_{2n}^{m}_{m}
... ... ... ...

w_{n}^{m}

m+11 w_{n}^{m}

m+12 · · · w_{n}^{m}

m+1n_{m}

nm+1×nm

n_{m} : # input features at layer m

n_{m+1} : # output features at layer m, or # input
features at layer m + 1

L: number of layers

n_{1} = # of features, n_{L+1} = # of classes

### Neural Networks (Cont’d)

Let z^{m} be the input of the mth layer, z^{1} = x and z^{L+1}
be the output

From mth layer to (m + 1)th layer
s^{m} = W^{m}z^{m},

z_{j}^{m+1} = σ(s_{j}^{m}), j = 1, . . . , n_{m+1},

σ(·) is the activation function. We collect all variables:

θ =

vec(W^{1})
...

vec(W^{L})

∈ R^{n}

n : total # variables

= (n_{1} + 1)n_{2} + · · · + (n_{L} + 1)n_{L+1}

### Neural Networks (Cont’d)

We solve the following optimization problem,
min_{θ} f (θ),

where

f (θ) = 1

2θ^{T}θ + C X^{l}

i =1ξ(z^{L+1,i}(θ);x^{i},yi).

C : regularization parameter

z^{L+1}(θ) ∈ R^{n}^{L+1}: last-layer output vector of x.

ξ(z^{L+1};x, y): loss function. Example:

ξ(z^{L+1};x, y) = ||z^{L+1} −y||^{2}

### Neural Networks (Cont’d)

The formulation is as before, but loss function is more complicated

Note that we discussed the simplest type of networks

Nowadays people use much more complicated networks

The optimization problem is non-convex

### Discussion

We have seen that many classification methods are under the empirical risk minimization framework We also see that optimization problems must be solved

### Outline

1 Introduction

2 Empirical risk minimization

3 Optimization techniques for machine learning

4 Discussion and conclusions

### Optimization Techniques for Machine Learning

Standard optimization packages may be directly applied to machine learning applications

However, efficiency and scalability are issues

Many optimization researchers want to do machine learning

Some are more successful, but some are not Very often properties from machine learning side must be considered

I will illustrate this point by some examples

### Differences between Optimization and Machine Learning

The two topics may have different focuses. We give the following example

Recall that the optimization problem for empirical risk minimization is

1

2w^{T}w + C(sum of training losses)
A large C means to fit training data

The optimization problem becomes more difficult

In contrast, if C → 0, minw

1
2w^{T}w
is easy

Optimization researchers may rush to solve difficult cases of large C

It turns out that C should not be too large

A large C causes severe overfitting and bad accuracy Thus knowing what is useful and what is not on the machine learning side is very important

### Stochastic Gradient for Deep Learning

In optimization, gradient descent is a basic method But it has slow convergence

So in many application domains higher-order

optimization methods (e.g., Newton, quasi Newton) were developed for faster convergence

However, in deep learning people use an even lower-order method: stochastic gradient Why?

### Estimation of the Gradient

Let us rewrite the objective function as f (θ) = 1

2Cθ^{T}θ + 1
l

X^{l}

i =1ξ(z^{L+1,i}(θ);x^{i},yi)
The gradient is

θ C + 1

l ∇_{θ}

l

X

i =1

ξ(z^{L+1,i}(θ);xi,yi)
Going over all data is time consuming

### Estimation of the Gradient (Cont’d)

We may use a subset S of data θ

C + 1

|S|∇_{θ} X

i :i ∈S

ξ(z^{L+1,i}(θ);x^{i},yi)
This works if data points are under the same
distribution

E_{y,x}(∇_{θ}ξ(z^{L+1,i};x, y)) = 1
l ∇_{θ}

l

X

i =1

ξ(z^{L+1,i}(θ);x^{i},yi)

### Stochastic Gradient Algorithm

1: Given an initial learning rate η.

2: while do

3: Choose S ⊂ {1, . . . , l }.

4: Calculate θ ← θ −η(θ

C + 1

|S|∇_{θ} X

i :i ∈S

ξ(z^{L+1,i}(θ);xi,yi))

5: May adjust the learning rate η

6: end while

### Issues of Stochastic Gradient Algorithm

People often use the name SGD (stochastic gradient descent) but it is not a descent algorithm

Note that we didn’t (and cannot) do things like line search to ensure the function-value decrease

It’s known that deciding a suitable learning rate is difficult

Too small learning rate: very slow convergence Too large learning rate: the procedure may diverge

Despite such drawbacks, SG is widely used in deep learning. Why?

### Why Stochastic Gradient Widely Used? I

In machine learning fast final convergence may not be important

An optimal solution θ^{∗} may not lead to the
best model

Further, we don’t need a point close to θ^{∗}. In
prediction we find

arg max

k z_{k}^{L+1}(θ)

A not-so-accurate θ may be good enough An illustration

### Why Stochastic Gradient Widely Used? II

time

distancetooptimum

time

distancetooptimum

Slow final convergence Fast final convergence

### Why Stochastic Gradient Widely Used? III

The special property of data classification is essential

E (∇_{θ}ξ(z^{L+1};x, y)) = 1
l∇_{θ}

l

X

i =1

ξ(z^{L+1,i}(θ);xi,yi)
We can cheaply get a good approximation of the
gradient

Indeed stochastic gradient is less used outside machine learning

### Why Stochastic Gradient Widely Used? IV

Easy implementation. It’s simpler than methods using, for example, second derivative

Now for complicated networks, (subsampled) gradient is calculated by automatic differentiation Non-convexity plays a role

For convex, other methods may possess advantages to more efficiently find the global minimum

But for non-convex, efficiency to reach a stationary point is less useful

### Why Stochastic Gradient Widely Used? V

A global minimum usually gives a good model (as loss is minimized), but for a stationary point we are less sure

Some variants of SG have been proposed to improve the robustness or the convergence

All these explain why SG is popular for deep learning

### Subsampled 2nd-order Method

Recall for stochastic gradient method, we use
E (∇_{θ}ξ(z^{L+1};x, y)) = 1

l∇_{θ}

l

X

i =1

ξ(z^{L+1,i}(θ);x^{i},yi)
Can we extend this idea to 2nd derivative? Yes,
Byrd et al. (2011); Martens (2010)

E (∇^{2}_{θθ}ξ(z^{L+1};y, x)) = 1
l∇^{2}_{θθ}

l

X

i =1

ξ(z^{L+1};yi,x^{i}).

### Subsampled 2nd-order Method (Cont’d)

We can consider 1

|S|∇^{2}_{θθ} X

i ∈S

ξ(z^{L+1};yi,x^{i}).

in designing subsampled Newton or quasi Newton methods

### Outline

1 Introduction

2 Empirical risk minimization

3 Optimization techniques for machine learning

4 Discussion and conclusions

### Conclusions

Many machine learning methods involve optimization problems

However, designing useful optimization techniques for these applications may not be easy

Incorporating machine learning knowledge is very essential