Newton Method for Convolutional Neural Networks

(1)

Neural Networks

Chih-Jen Lin

Department of Computer Science National Taiwan University

(2)

1 Introduction

2 Optimization problem for convolutional neural networks (CNN)

3 Newton method for CNN

4 Experiments

5 Discussion and conclusions

(3)

Outline

1 Introduction

4 Experiments

(4)

Introduction

Training a neural network involves a difficult optimization problem

SG (stochastic gradient) is the major optimization technique for deep learning.

SG is simple and effective, but sometimes not robust (e.g., selecting the learning rate may be difficult) Is it possible to consider other methods?

In this work, we investigate Newton methods

(5)

Outline

1 Introduction

4 Experiments

(6)

Optimization and Neural Networks

In a typical setting, a neural network is no more than an empirical risk minimization problem

We will show an example using convolutional neural networks (CNN)

CNN is a type of networks useful for image classification

(7)

Convolutional Neural Networks (CNN)

Consider a K -class classification problem with training data

(yⁱ, Z^1,i), i = 1, . . . , `.

yⁱ: label vector Z^1,i: input image If Z^1,i is in class k, then

yⁱ = [0, . . . , 0

| {z }

k−1

, 1, 0, . . . , 0]^T ∈ R^K.

CNN maps each image Z^1,i to yⁱ

(8)

Convolutional Neural Networks (CNN)

Typically, CNN consists of multiple convolutional layers followed by fully-connected layers.

We discuss only convolutional layers.

Input and output of a convolutional layer are assumed to be images.

(9)

Convolutional Layers

For mth layer, let the input be an image a^m× b^m × d^m.

a^m: height, b^m: width, and d^m: #channels.

a

^m

b

^m

d

^m

(10)

Convolutional Layers (Cont’d)

Consider d^m+1 filters.

Each filter includes weights to extract local information

Filter j ∈ {1, . . . , d^m+1} has dimensions h × h × d^m.







w_1,1,1^m,j w_1,h,1^m,j . . .

w_h,1,1^m,j w_h,h,1^m,j





 . . .







w_1,1,d^m,j m w_1,h,d^m,j m

. . .

w_h,1,d^m,j m w_h,h,d^m,j m





. h: filter height/width (m of h^m omitted)

(11)

Convolutional Layers (Cont’d)

1,1,1 1,2,1 1,3,1

2,1,1 2,2,1 2,3,1

3,1,1 3,2,1 3,3,1

s_1,1,j^m,i s_1,2,j^m,i s_2,1,j^m,i s_2,2,j^m,i

To compute the j th channel of output, we scan the input from top-left to bottom-right to obtain the sub-images of size h × h × d^m

Then calculate the inner product between each sub-image and the j th filter

(12)

Convolutional Layers (Cont’d)

It’s known that convolutional operations can be done by matrix-matrix and matrix-vector operations Let’s collect images of all channels as the input

Z^m,i

=







z_1,1,1^m,i z_2,1,1^m,i . . . z_a^m,i^m_,b^m_,1 ... ... . . . ...

z_1,1,d^m,i m z_2,1,d^m,i m . . . z_a^m,i^m_,b^m_,d^m







∈R^d^m^×a^m^b^m.

(13)

Convolutional Layers (Cont’d)

Let all filters

W^m =







w_1,1,1^m,1 w_2,1,1^m,1 . . . w_h,h,d^m,1 ^m ... ... . . . ...

w_1,1,1^m,d^m+1 w_2,1,1^m,d^m+1 . . . w_h,h,d^m,d^m+1^m







∈ R^d^m+1^×hhd^m

be variables (parameters) of the current layer

Usually a bias term is considered but we omit it here

(14)

Convolutional Layers (Cont’d)

Operations at a layer

S^m,i = W^mφ(Z^m,i) Z^m+1,i = σ(S^m,i), φ(Z^m,i) collects all sub-images in Z^m,i into a matrix

φ(Z^m,i) =







z_1,1,1^m,i z_1+s^m,im,1,1 z_1+(a^m,i m+1−1)s^m,1+(b^m+1−1)s^m,1

z_2,1,1^m,i z_2+s^m,im,1,1 z_2+(a^m,i m+1−1)s^m,1+(b^m+1−1)s^m,1

... ... . . . ...

z_h,h,1^m,i z_h+s^m,im,h,1 z_h+(a^m,i m+1−1)s^m,h+(b^m+1−1)s^m,1

... ... ...

z_h,h,d^m,i m z_h+s^m,im,h,d^m z_h+(a^m,i m+1−1)s^m,h+(b^m+1−1)s^m,d^m







(15)

Convolutional Layers (Cont’d)

σ is an element-wise activation function In the matrix-matrix product

S^m,i = W^mφ(Z^m,i), (1) each element is the inner product between a filter and a sub-image

(16)

Optimization Problem

We collect all weights to a vector variable θ.

θ =





vec(W¹) ...

vec(W^L)



 ∈ Rⁿ, n : total # variables The output of the last fully-connected layer L is a vector z^L+1,i(θ).

Consider any loss function such as the squared loss ξi(θ) = ||z^L+1,i(θ) − yⁱ||².

(17)

Optimization Problem (Cont’d)

The optimization problem is minθ f (θ), where

f (θ) = regularization + losses

= 1

2Cθ^Tθ + 1

`

X

i =1

ξi(θ) C : regularization parameter.

(18)

Outline

1 Introduction

4 Experiments

(19)

Mini-batch Stochastic Gradient

We begin with explaining why stochastic gradient (SG) is popular for deep learning

Recall the function is f (θ) = 1

2Cθ^Tθ + 1

`

X

i =1

ξ(θ; yⁱ, Z^1,i) The gradient is

θ C + 1

`∇_θ

`

X

i =1

ξ(θ; yⁱ, Z^1,i)

(20)

Mini-batch Stochastic Gradient (Cont’d)

Going over all data is time consuming From

E (∇_θξ(θ; y , Z¹)) = 1

`∇_θ

`

X

i =1

ξ(θ; yⁱ, Z^1,i) we may just use a subset S (called a batch)

θ C + 1

|S|∇_θ X

i :i ∈S

ξ(θ; yⁱ, Z^1,i)

(21)

Mini-batch SG: Algorithm

1: Given an initial learning rate η.

2: while do

3: Choose S ⊂ {1, . . . , `}.

4: Calculate

θ ← θ −η(θ C + 1

|S|∇_θ X

i :i ∈S

ξ(θ; yⁱ, Z^1,i))

5: May adjust the learning rate η

6: end while

But deciding a suitable learning rate may be tricky

(22)

Why SG Popular for Deep Learning?

The special property of data classification is essential

E (∇_θξ(θ; y , Z¹)) = 1

`∇_θ

`

X

i =1

ξ(θ; yⁱ, Z^1,i) Indeed stochastic gradient is less used outside machine learning

High-order methods with fast final convergence may not be needed in machine learning

An approximate solution may give similar accuracy to the final solution

(23)

Why SG Popular for Deep Learning?

(Cont’d)

Easy implementation. It’s simpler than methods using, for example, second derivative

Non-convexity plays a role

For convex, a global minimum usually gives a good model (loss is minimized)

Thus we want to efficiently find the global minimum

But for non-convex, efficiency to reach a stationary point is less useful

(24)

Drawback of SG

Tuning the learning rate is not easy

Thus if we would like to consider other methods, robustness rather than efficiency may be the main reason

(25)

Newton Method

Newton method finds a direction d that minimizes the second-order approximation of f (θ)

min

d ∇f (θ)^>d + 1

2d^>∇²f (θ)d. (2) If ∇²f (θ) is positive definite, (2) is equivalent to solving

∇²f (θ)d = −∇f (θ).

(26)

Newton Method (Cont’d)

while stopping condition not satisfied do Let G be ∇²f (θ) or its approximation Exactly or approximately solve

G d = −∇f (θ)

Find a suitable step size α (e.g., line search) Update

θ ← θ +αd . end while

(27)

Hessian may not be Positive Definite

Hessian of f (θ) is (derivation omitted)

∇²f (θ) =1

CI + 1

` X^`

i =1(Jⁱ)^>BⁱJⁱ

+ a non-PSD (positive semi-definite) term I: identity, Bⁱ: simple PSD matrix, Jⁱ: Jacobian of z^L+1,i(θ)

Jⁱ =







∂z₁^L+1,i

∂θ1 . . . ^∂z_∂θ¹^L+1,i ... . . . ...n

∂z_nL+1^L+1,i

∂θ . . . ^∂z

L+1,i nL+1

∂θ







∈ Rⁿ^L+1^×n

n_L+1: # classes n: # total

variables

(28)

Positive Definite Modification of Hessian

Several strategies have been proposed.

For example, Schraudolph (2002) considered the Gauss-Newton matrix (which is PD)

G = 1

CI + 1

`

X

i =1

(Jⁱ)^>BⁱJⁱ ≈ ∇²f (θ).

Then Newton linear system becomes

G d = −∇f (θ). (3)

(29)

Memory Difficulty

The Gauss-Newton matrix G may be too large to be stored

G : # variables × # variables

Many approaches have been proposed (through approximation)

For example, we may store and use only diagonal blocks of G

(30)

Memory Difficulty (Cont’d)

Here we try to use the original Gauss-Newton matrix G without aggressive approximation

Reason: we should show first that for median-sized data, standard Newton is more robust than SG Otherwise, there is no need to develop techniques for large-scale problems

(31)

Hessian-free Newton Method

If G has certain structures, it’s possible to use

iterative methods (e.g., conjugate gradient) to solve the Newton linear system by a sequence of

matrix-vector products

G v¹, G v², . . . without storing G

This is called Hessian-free in optimization

(32)

Hessian-free Newton Method (Cont’d)

The Gauss-Newton matrix is G = 1

CI + 1

`

X

i =1

(Jⁱ)^>BⁱJⁱ

Matrix-vector product without explicitly storing G G v = 1

Cv + 1

`

X

i =1

((Jⁱ)^>(Bⁱ(Jⁱv ))). Examples of using this setting for deep learning include Martens (2010), Le et al. (2011), and Wang et al. (2018).

(33)

Hessian-free Newton Method (Cont’d)

However, for the conjugate gradient process, Jⁱ ∈ Rⁿ^L+1^×n, i = 1, . . . , `, can be too large to be stored (` is # data) Total memory usage is

n_L+1 × n × `

= # classes × # variables × # data

(34)

Hessian-free Newton Method (Cont’d)

The product involves

`

X

i =1

((Jⁱ)^>(Bⁱ(Jⁱv ))).

We can trade time for space: Jⁱ is calculated when needed (i.e., at every matrix-vector product)

On the other hand, we may not need to use all data points to have Jⁱ, ∀i

We will discuss the subsampled Hessian technique

(35)

Subsampled Hessian Newton Method

Similar to gradient, for Hessian we have E (∇²_θ,θξ(θ; y , Z¹)) = 1

`∇²_θ,θ

`

X

i =1

ξ(θ; yⁱ, Z^1,i) Thus we can approximate the Gauss-Newton matrix by a subset of data

This is the subsampled Hessian Newton method (Byrd et al., 2011; Martens, 2010; Wang et al., 2015)

(36)

Subsampled Hessian Newton Method

We select a subset S ⊂ {1, . . . , `} and have G^S = 1

CI + 1

|S|

X

i ∈S

(Jⁱ)^TBⁱJⁱ ≈ G .

The cost of storing Jⁱ is reduced from

∝ ` to ∝ |S|

(37)

Subsampled Hessian Newton Method

With enough data, direction obtained by G^Sd = −∇f (θ)

may be close to that by

G d = −∇f (θ)

Computational cost per matrix-vector product is saved

On CPU we may afford to store Jⁱ, ∀i ∈ S On GPU, which has less memory, we calculate Jⁱ, ∀i ∈ S when needed

(38)

Calculation of Jacobian Matrix

Now we know the subsampled Gauss-Newton matrix-vector product is

G^Sv = 1

Cv + 1

|S|

X

i ∈S

(Jⁱ)^T Bⁱ(Jⁱv )

(4)

We briefly discuss how to calculate Jⁱ

(39)

Calculation of Jacobian Matrix (Cont’d)

The Jacobian can be partitioned with respect to layers.

Jⁱ =







∂z₁^L+1,i

∂θ1 . . . ^∂z_∂θ¹^L+1,i ... . . . ...n

∂z_nL+1^L+1,i

∂θ₁ . . . ^∂z

L+1,i nL+1

∂θ_n







=

h ∂z^L+1,i

∂vec(W¹)^> · · · ^∂z^L+1,i

∂vec(W^L)^>

i

We check details of one layer. It’s difficult to calculate the derivative if using a matrix form

S^m,i = W^mφ(Z^m,i)

(40)

Calculation of Jacobian Matrix (Cont’d)

We can rewrite it to

vec(S^m,i) = (φ(Z^m,i)^>⊗ I_d^m+1)vec(W^m), where

⊗ : Kronecker product I_d^m+1 : Identity If

y = Ax, with y ∈ R^p and x ∈ R^q then

∂y

∂(x)^> =







∂y₁

∂x₁ . . . ^∂y_∂x¹ ... ... ...q

∂y_p

∂x . . . ^∂y_∂x^p





 = A

(41)

Calculation of Jacobian Matrix (Cont’d)

Therefore,

∂z^L+1,i

∂vec(W^m)^> = ∂z^L+1,i

∂vec(S^m,i)^>

∂vec(S^m,i)

∂vec(W^m)^>

= ∂z^L+1,i

∂vec(S^m,i)^>(φ(Z^m,i)^> ⊗ I_d^m+1).

Further, (detailed derivation omitted)

∂z^L+1,i

∂vec(S^m,i)^> = ∂z^L+1,i

∂vec(Z^m+1,i)^> 1ⁿL+1vec(σ⁰(S^m,i))^> , where is element-wise product, and

(42)

Calculation of Jacobian Matrix (Cont’d)

∂z^L+1,i

∂vec(Z^m,i)^> = ∂z^L+1,i

∂vec(S^m,i)^>(I_a^m+1_b^m+1 ⊗ W^m)P_φ^m. Thus a backward process can calculate all the needed values

We see that with suitable representation, the derivation is manageable

Major operations can be performed by matrix-based settings (details not shown)

This is why GPU is useful

(43)

Outline

1 Introduction

4 Experiments

(44)

Running Time and Test Accuracy

Four sets are considered

MNIST, SVHN, CIFAR10, smallNORB

For each method, best parameters from a validation process are used

We will check parameter sensitivity later Two SG implementations are used

Simple SG shown earlier

SG with momentum (details not explained here)

SG with momentum is a reasonably strong baseline

(45)

Running Time and Test Accuracy (Cont’d)

0 20 40 60 80 100

0 500 1000 1500 2000 2500

Test accuracy (%)

Time in seconds

SG-with-momentum SG-without-momentum Newton

0 20 40 60 80 100

0 2000 4000 6000 8000 10000 12000 14000

Test accuracy (%)

Time in seconds

0 10 20 30 40 50 60 70 80

0 5000 10000 15000 20000

Test accuracy (%)

0 20 40 60 80 100

0 500 1000 1500 2000 2500

Test accuracy (%)

(46)

Running Time and Test Accuracy (Cont’d)

Clearly, SG has faster initial convergence

This is reasonable as a second-order method is slower in the beginning

But if cost for parameter selection is considered, Newton may be useful

(47)

Experiments on Parameter Sensitivity

Consider a fixed regularization parameter C = 0.01`

For SG with momentum, we consider the following initial learning rates

0.1, 0.05, 0.01, 0.005, 0.001, 0.0003, 0.0001 For Newton, there is no particular parameter to tune. We check the size of subsampled Hessian:

|S| = 10%, 5%, 1% of data

(48)

Results by Using Different Parameters

Each line shows the result of one problem

Newton SG

Sampling rate Initial learning rate

10% 5% 1% 0.03 0.01 0.003 0.001 0.0003 99.2% 99.2% 99.1% 9.9% 10.3% 99.1% 99.2% 99.0%

92.7% 92.7% 92.2% 19.5% 92.4% 93.0% 92.7% 92.3%

78.2% 78.3% 75.4% 10.0% 63.1% 79.5% 79.2% 76.9%

94.9% 95.0% 94.6% 64.7% 95.0% 95.0% 95.0% 94.8%

We find that

a too large learning rate causes SG to diverge, and a too small rate causes slow convergence

(49)

Outline

1 Introduction

4 Experiments

(50)

Conclusions

Stochastic gradient method has been popular for CNN

It is simple and useful, but sometimes not robust Newton is more complicated and has slower initial convergence

However, it may be overall more robust

By careful designs, the implementation of Newton isn’t too complicated

(51)

Conclusions (Cont’d)

Results presented here are based on the paper by Wang et al. (2019)

An ongoing software development is at https://github.com/cjlin1/simpleNN Both MATLAB and Python are supported

MATLAB: joint work with Chien-Chih Wang and Tan Kent Loong (NTU)

Python: joint work with Pengrui Quan (UCLA)