Neural Networks
Chih-Jen Lin
Department of Computer Science National Taiwan University
1 Introduction
2 Optimization problem for convolutional neural networks (CNN)
3 Newton method for CNN
4 Experiments
5 Discussion and conclusions
Outline
1 Introduction
2 Optimization problem for convolutional neural networks (CNN)
3 Newton method for CNN
4 Experiments
5 Discussion and conclusions
Introduction
Training a neural network involves a difficult optimization problem
SG (stochastic gradient) is the major optimization technique for deep learning.
SG is simple and effective, but sometimes not robust (e.g., selecting the learning rate may be difficult) Is it possible to consider other methods?
In this work, we investigate Newton methods
Outline
1 Introduction
2 Optimization problem for convolutional neural networks (CNN)
3 Newton method for CNN
4 Experiments
5 Discussion and conclusions
Optimization and Neural Networks
In a typical setting, a neural network is no more than an empirical risk minimization problem
We will show an example using convolutional neural networks (CNN)
CNN is a type of networks useful for image classification
Convolutional Neural Networks (CNN)
Consider a K -class classification problem with training data
(yi, Z1,i), i = 1, . . . , `.
yi: label vector Z1,i: input image If Z1,i is in class k, then
yi = [0, . . . , 0
| {z }
k−1
, 1, 0, . . . , 0]T ∈ RK.
CNN maps each image Z1,i to yi
Convolutional Neural Networks (CNN)
Typically, CNN consists of multiple convolutional layers followed by fully-connected layers.
We discuss only convolutional layers.
Input and output of a convolutional layer are assumed to be images.
Convolutional Layers
For mth layer, let the input be an image am× bm × dm.
am: height, bm: width, and dm: #channels.
a
mb
md
mConvolutional Layers (Cont’d)
Consider dm+1 filters.
Each filter includes weights to extract local information
Filter j ∈ {1, . . . , dm+1} has dimensions h × h × dm.
w1,1,1m,j w1,h,1m,j . . .
wh,1,1m,j wh,h,1m,j
. . .
w1,1,dm,j m w1,h,dm,j m
. . .
wh,1,dm,j m wh,h,dm,j m
. h: filter height/width (m of hm omitted)
Convolutional Layers (Cont’d)
1,1,1 1,2,1 1,3,1
2,1,1 2,2,1 2,3,1
3,1,1 3,2,1 3,3,1
s1,1,jm,i s1,2,jm,i s2,1,jm,i s2,2,jm,i
To compute the j th channel of output, we scan the input from top-left to bottom-right to obtain the sub-images of size h × h × dm
Then calculate the inner product between each sub-image and the j th filter
Convolutional Layers (Cont’d)
It’s known that convolutional operations can be done by matrix-matrix and matrix-vector operations Let’s collect images of all channels as the input
Zm,i
=
z1,1,1m,i z2,1,1m,i . . . zam,im,bm,1 ... ... . . . ...
z1,1,dm,i m z2,1,dm,i m . . . zam,im,bm,dm
∈Rdm×ambm.
Convolutional Layers (Cont’d)
Let all filters
Wm =
w1,1,1m,1 w2,1,1m,1 . . . wh,h,dm,1 m ... ... . . . ...
w1,1,1m,dm+1 w2,1,1m,dm+1 . . . wh,h,dm,dm+1m
∈ Rdm+1×hhdm
be variables (parameters) of the current layer
Usually a bias term is considered but we omit it here
Convolutional Layers (Cont’d)
Operations at a layer
Sm,i = Wmφ(Zm,i) Zm+1,i = σ(Sm,i), φ(Zm,i) collects all sub-images in Zm,i into a matrix
φ(Zm,i) =
z1,1,1m,i z1+sm,im,1,1 z1+(am,i m+1−1)sm,1+(bm+1−1)sm,1
z2,1,1m,i z2+sm,im,1,1 z2+(am,i m+1−1)sm,1+(bm+1−1)sm,1
... ... . . . ...
zh,h,1m,i zh+sm,im,h,1 zh+(am,i m+1−1)sm,h+(bm+1−1)sm,1
... ... ...
zh,h,dm,i m zh+sm,im,h,dm zh+(am,i m+1−1)sm,h+(bm+1−1)sm,dm
Convolutional Layers (Cont’d)
σ is an element-wise activation function In the matrix-matrix product
Sm,i = Wmφ(Zm,i), (1) each element is the inner product between a filter and a sub-image
Optimization Problem
We collect all weights to a vector variable θ.
θ =
vec(W1) ...
vec(WL)
∈ Rn, n : total # variables The output of the last fully-connected layer L is a vector zL+1,i(θ).
Consider any loss function such as the squared loss ξi(θ) = ||zL+1,i(θ) − yi||2.
Optimization Problem (Cont’d)
The optimization problem is minθ f (θ), where
f (θ) = regularization + losses
= 1
2CθTθ + 1
`
`
X
i =1
ξi(θ) C : regularization parameter.
Outline
1 Introduction
2 Optimization problem for convolutional neural networks (CNN)
3 Newton method for CNN
4 Experiments
5 Discussion and conclusions
Mini-batch Stochastic Gradient
We begin with explaining why stochastic gradient (SG) is popular for deep learning
Recall the function is f (θ) = 1
2CθTθ + 1
`
`
X
i =1
ξ(θ; yi, Z1,i) The gradient is
θ C + 1
`∇θ
`
X
i =1
ξ(θ; yi, Z1,i)
Mini-batch Stochastic Gradient (Cont’d)
Going over all data is time consuming From
E (∇θξ(θ; y , Z1)) = 1
`∇θ
`
X
i =1
ξ(θ; yi, Z1,i) we may just use a subset S (called a batch)
θ C + 1
|S|∇θ X
i :i ∈S
ξ(θ; yi, Z1,i)
Mini-batch SG: Algorithm
1: Given an initial learning rate η.
2: while do
3: Choose S ⊂ {1, . . . , `}.
4: Calculate
θ ← θ −η(θ C + 1
|S|∇θ X
i :i ∈S
ξ(θ; yi, Z1,i))
5: May adjust the learning rate η
6: end while
But deciding a suitable learning rate may be tricky
Why SG Popular for Deep Learning?
The special property of data classification is essential
E (∇θξ(θ; y , Z1)) = 1
`∇θ
`
X
i =1
ξ(θ; yi, Z1,i) Indeed stochastic gradient is less used outside machine learning
High-order methods with fast final convergence may not be needed in machine learning
An approximate solution may give similar accuracy to the final solution
Why SG Popular for Deep Learning?
(Cont’d)
Easy implementation. It’s simpler than methods using, for example, second derivative
Non-convexity plays a role
For convex, a global minimum usually gives a good model (loss is minimized)
Thus we want to efficiently find the global minimum
But for non-convex, efficiency to reach a stationary point is less useful
Drawback of SG
Tuning the learning rate is not easy
Thus if we would like to consider other methods, robustness rather than efficiency may be the main reason
Newton Method
Newton method finds a direction d that minimizes the second-order approximation of f (θ)
min
d ∇f (θ)>d + 1
2d>∇2f (θ)d. (2) If ∇2f (θ) is positive definite, (2) is equivalent to solving
∇2f (θ)d = −∇f (θ).
Newton Method (Cont’d)
while stopping condition not satisfied do Let G be ∇2f (θ) or its approximation Exactly or approximately solve
G d = −∇f (θ)
Find a suitable step size α (e.g., line search) Update
θ ← θ +αd . end while
Hessian may not be Positive Definite
Hessian of f (θ) is (derivation omitted)
∇2f (θ) =1
CI + 1
` X`
i =1(Ji)>BiJi
+ a non-PSD (positive semi-definite) term I: identity, Bi: simple PSD matrix, Ji: Jacobian of zL+1,i(θ)
Ji =
∂z1L+1,i
∂θ1 . . . ∂z∂θ1L+1,i ... . . . ...n
∂znL+1L+1,i
∂θ . . . ∂z
L+1,i nL+1
∂θ
∈ RnL+1×n
nL+1: # classes n: # total
variables
Positive Definite Modification of Hessian
Several strategies have been proposed.
For example, Schraudolph (2002) considered the Gauss-Newton matrix (which is PD)
G = 1
CI + 1
`
`
X
i =1
(Ji)>BiJi ≈ ∇2f (θ).
Then Newton linear system becomes
G d = −∇f (θ). (3)
Memory Difficulty
The Gauss-Newton matrix G may be too large to be stored
G : # variables × # variables
Many approaches have been proposed (through approximation)
For example, we may store and use only diagonal blocks of G
Memory Difficulty (Cont’d)
Here we try to use the original Gauss-Newton matrix G without aggressive approximation
Reason: we should show first that for median-sized data, standard Newton is more robust than SG Otherwise, there is no need to develop techniques for large-scale problems
Hessian-free Newton Method
If G has certain structures, it’s possible to use
iterative methods (e.g., conjugate gradient) to solve the Newton linear system by a sequence of
matrix-vector products
G v1, G v2, . . . without storing G
This is called Hessian-free in optimization
Hessian-free Newton Method (Cont’d)
The Gauss-Newton matrix is G = 1
CI + 1
`
`
X
i =1
(Ji)>BiJi
Matrix-vector product without explicitly storing G G v = 1
Cv + 1
`
`
X
i =1
((Ji)>(Bi(Jiv ))). Examples of using this setting for deep learning include Martens (2010), Le et al. (2011), and Wang et al. (2018).
Hessian-free Newton Method (Cont’d)
However, for the conjugate gradient process, Ji ∈ RnL+1×n, i = 1, . . . , `, can be too large to be stored (` is # data) Total memory usage is
nL+1 × n × `
= # classes × # variables × # data
Hessian-free Newton Method (Cont’d)
The product involves
`
X
i =1
((Ji)>(Bi(Jiv ))).
We can trade time for space: Ji is calculated when needed (i.e., at every matrix-vector product)
On the other hand, we may not need to use all data points to have Ji, ∀i
We will discuss the subsampled Hessian technique
Subsampled Hessian Newton Method
Similar to gradient, for Hessian we have E (∇2θ,θξ(θ; y , Z1)) = 1
`∇2θ,θ
`
X
i =1
ξ(θ; yi, Z1,i) Thus we can approximate the Gauss-Newton matrix by a subset of data
This is the subsampled Hessian Newton method (Byrd et al., 2011; Martens, 2010; Wang et al., 2015)
Subsampled Hessian Newton Method
We select a subset S ⊂ {1, . . . , `} and have GS = 1
CI + 1
|S|
X
i ∈S
(Ji)TBiJi ≈ G .
The cost of storing Ji is reduced from
∝ ` to ∝ |S|
Subsampled Hessian Newton Method
With enough data, direction obtained by GSd = −∇f (θ)
may be close to that by
G d = −∇f (θ)
Computational cost per matrix-vector product is saved
On CPU we may afford to store Ji, ∀i ∈ S On GPU, which has less memory, we calculate Ji, ∀i ∈ S when needed
Calculation of Jacobian Matrix
Now we know the subsampled Gauss-Newton matrix-vector product is
GSv = 1
Cv + 1
|S|
X
i ∈S
(Ji)T Bi(Jiv )
(4)
We briefly discuss how to calculate Ji
Calculation of Jacobian Matrix (Cont’d)
The Jacobian can be partitioned with respect to layers.
Ji =
∂z1L+1,i
∂θ1 . . . ∂z∂θ1L+1,i ... . . . ...n
∂znL+1L+1,i
∂θ1 . . . ∂z
L+1,i nL+1
∂θn
=
h ∂zL+1,i
∂vec(W1)> · · · ∂zL+1,i
∂vec(WL)>
i
We check details of one layer. It’s difficult to calculate the derivative if using a matrix form
Sm,i = Wmφ(Zm,i)
Calculation of Jacobian Matrix (Cont’d)
We can rewrite it to
vec(Sm,i) = (φ(Zm,i)>⊗ Idm+1)vec(Wm), where
⊗ : Kronecker product Idm+1 : Identity If
y = Ax, with y ∈ Rp and x ∈ Rq then
∂y
∂(x)> =
∂y1
∂x1 . . . ∂y∂x1 ... ... ...q
∂yp
∂x . . . ∂y∂xp
= A
Calculation of Jacobian Matrix (Cont’d)
Therefore,
∂zL+1,i
∂vec(Wm)> = ∂zL+1,i
∂vec(Sm,i)>
∂vec(Sm,i)
∂vec(Wm)>
= ∂zL+1,i
∂vec(Sm,i)>(φ(Zm,i)> ⊗ Idm+1).
Further, (detailed derivation omitted)
∂zL+1,i
∂vec(Sm,i)> = ∂zL+1,i
∂vec(Zm+1,i)> 1nL+1vec(σ0(Sm,i))> , where is element-wise product, and
Calculation of Jacobian Matrix (Cont’d)
∂zL+1,i
∂vec(Zm,i)> = ∂zL+1,i
∂vec(Sm,i)>(Iam+1bm+1 ⊗ Wm)Pφm. Thus a backward process can calculate all the needed values
We see that with suitable representation, the derivation is manageable
Major operations can be performed by matrix-based settings (details not shown)
This is why GPU is useful
Outline
1 Introduction
2 Optimization problem for convolutional neural networks (CNN)
3 Newton method for CNN
4 Experiments
5 Discussion and conclusions
Running Time and Test Accuracy
Four sets are considered
MNIST, SVHN, CIFAR10, smallNORB
For each method, best parameters from a validation process are used
We will check parameter sensitivity later Two SG implementations are used
Simple SG shown earlier
SG with momentum (details not explained here)
SG with momentum is a reasonably strong baseline
Running Time and Test Accuracy (Cont’d)
0 20 40 60 80 100
0 500 1000 1500 2000 2500
Test accuracy (%)
Time in seconds
SG-with-momentum SG-without-momentum Newton
0 20 40 60 80 100
0 2000 4000 6000 8000 10000 12000 14000
Test accuracy (%)
Time in seconds
SG-with-momentum SG-without-momentum Newton
0 10 20 30 40 50 60 70 80
0 5000 10000 15000 20000
Test accuracy (%)
SG-with-momentum SG-without-momentum Newton
0 20 40 60 80 100
0 500 1000 1500 2000 2500
Test accuracy (%)
SG-with-momentum SG-without-momentum Newton
Running Time and Test Accuracy (Cont’d)
Clearly, SG has faster initial convergence
This is reasonable as a second-order method is slower in the beginning
But if cost for parameter selection is considered, Newton may be useful
Experiments on Parameter Sensitivity
Consider a fixed regularization parameter C = 0.01`
For SG with momentum, we consider the following initial learning rates
0.1, 0.05, 0.01, 0.005, 0.001, 0.0003, 0.0001 For Newton, there is no particular parameter to tune. We check the size of subsampled Hessian:
|S| = 10%, 5%, 1% of data
Results by Using Different Parameters
Each line shows the result of one problem
Newton SG
Sampling rate Initial learning rate
10% 5% 1% 0.03 0.01 0.003 0.001 0.0003 99.2% 99.2% 99.1% 9.9% 10.3% 99.1% 99.2% 99.0%
92.7% 92.7% 92.2% 19.5% 92.4% 93.0% 92.7% 92.3%
78.2% 78.3% 75.4% 10.0% 63.1% 79.5% 79.2% 76.9%
94.9% 95.0% 94.6% 64.7% 95.0% 95.0% 95.0% 94.8%
We find that
a too large learning rate causes SG to diverge, and a too small rate causes slow convergence
Outline
1 Introduction
2 Optimization problem for convolutional neural networks (CNN)
3 Newton method for CNN
4 Experiments
5 Discussion and conclusions
Conclusions
Stochastic gradient method has been popular for CNN
It is simple and useful, but sometimes not robust Newton is more complicated and has slower initial convergence
However, it may be overall more robust
By careful designs, the implementation of Newton isn’t too complicated
Conclusions (Cont’d)
Results presented here are based on the paper by Wang et al. (2019)
An ongoing software development is at https://github.com/cjlin1/simpleNN Both MATLAB and Python are supported
MATLAB: joint work with Chien-Chih Wang and Tan Kent Loong (NTU)
Python: joint work with Pengrui Quan (UCLA)