• 沒有找到結果。

Newton Method for Convolutional Neural Networks

N/A
N/A
Protected

Share "Newton Method for Convolutional Neural Networks"

Copied!
51
0
0

(1)

Neural Networks

Chih-Jen Lin

Department of Computer Science National Taiwan University

(2)

1 Introduction

2 Optimization problem for convolutional neural networks (CNN)

3 Newton method for CNN

4 Experiments

5 Discussion and conclusions

(3)

Outline

1 Introduction

2 Optimization problem for convolutional neural networks (CNN)

3 Newton method for CNN

4 Experiments

5 Discussion and conclusions

(4)

Introduction

Training a neural network involves a difficult optimization problem

SG (stochastic gradient) is the major optimization technique for deep learning.

SG is simple and effective, but sometimes not robust (e.g., selecting the learning rate may be difficult) Is it possible to consider other methods?

In this work, we investigate Newton methods

(5)

Outline

1 Introduction

2 Optimization problem for convolutional neural networks (CNN)

3 Newton method for CNN

4 Experiments

5 Discussion and conclusions

(6)

Optimization and Neural Networks

In a typical setting, a neural network is no more than an empirical risk minimization problem

We will show an example using convolutional neural networks (CNN)

CNN is a type of networks useful for image classification

(7)

Convolutional Neural Networks (CNN)

Consider a K -class classification problem with training data

(yi, Z1,i), i = 1, . . . , `.

yi: label vector Z1,i: input image If Z1,i is in class k, then

yi = [0, . . . , 0

| {z }

k−1

, 1, 0, . . . , 0]T ∈ RK.

CNN maps each image Z1,i to yi

(8)

Convolutional Neural Networks (CNN)

Typically, CNN consists of multiple convolutional layers followed by fully-connected layers.

We discuss only convolutional layers.

Input and output of a convolutional layer are assumed to be images.

(9)

Convolutional Layers

For mth layer, let the input be an image am× bm × dm.

am: height, bm: width, and dm: #channels.

m

m

m

(10)

Convolutional Layers (Cont’d)

Consider dm+1 filters.

Each filter includes weights to extract local information

Filter j ∈ {1, . . . , dm+1} has dimensions h × h × dm.

w1,1,1m,j w1,h,1m,j . . .

wh,1,1m,j wh,h,1m,j

 . . .

w1,1,dm,j m w1,h,dm,j m

. . .

wh,1,dm,j m wh,h,dm,j m

. h: filter height/width (m of hm omitted)

(11)

Convolutional Layers (Cont’d)

1,1,1 1,2,1 1,3,1

2,1,1 2,2,1 2,3,1

3,1,1 3,2,1 3,3,1

s1,1,jm,i s1,2,jm,i s2,1,jm,i s2,2,jm,i

To compute the j th channel of output, we scan the input from top-left to bottom-right to obtain the sub-images of size h × h × dm

Then calculate the inner product between each sub-image and the j th filter

(12)

Convolutional Layers (Cont’d)

It’s known that convolutional operations can be done by matrix-matrix and matrix-vector operations Let’s collect images of all channels as the input

Zm,i

=

z1,1,1m,i z2,1,1m,i . . . zam,im,bm,1 ... ... . . . ...

z1,1,dm,i m z2,1,dm,i m . . . zam,im,bm,dm

∈Rdm×ambm.

(13)

Convolutional Layers (Cont’d)

Let all filters

Wm =

w1,1,1m,1 w2,1,1m,1 . . . wh,h,dm,1 m ... ... . . . ...

w1,1,1m,dm+1 w2,1,1m,dm+1 . . . wh,h,dm,dm+1m

∈ Rdm+1×hhdm

be variables (parameters) of the current layer

Usually a bias term is considered but we omit it here

(14)

Convolutional Layers (Cont’d)

Operations at a layer

Sm,i = Wmφ(Zm,i) Zm+1,i = σ(Sm,i), φ(Zm,i) collects all sub-images in Zm,i into a matrix

φ(Zm,i) =

z1,1,1m,i z1+sm,im,1,1 z1+(am,i m+1−1)sm,1+(bm+1−1)sm,1

z2,1,1m,i z2+sm,im,1,1 z2+(am,i m+1−1)sm,1+(bm+1−1)sm,1

... ... . . . ...

zh,h,1m,i zh+sm,im,h,1 zh+(am,i m+1−1)sm,h+(bm+1−1)sm,1

... ... ...

zh,h,dm,i m zh+sm,im,h,dm zh+(am,i m+1−1)sm,h+(bm+1−1)sm,dm

(15)

Convolutional Layers (Cont’d)

σ is an element-wise activation function In the matrix-matrix product

Sm,i = Wmφ(Zm,i), (1) each element is the inner product between a filter and a sub-image

(16)

Optimization Problem

We collect all weights to a vector variable θ.

θ =

vec(W1) ...

vec(WL)

 ∈ Rn, n : total # variables The output of the last fully-connected layer L is a vector zL+1,i(θ).

Consider any loss function such as the squared loss ξi(θ) = ||zL+1,i(θ) − yi||2.

(17)

Optimization Problem (Cont’d)

The optimization problem is minθ f (θ), where

f (θ) = regularization + losses

= 1

2CθTθ + 1

`

`

X

i =1

ξi(θ) C : regularization parameter.

(18)

Outline

1 Introduction

2 Optimization problem for convolutional neural networks (CNN)

3 Newton method for CNN

4 Experiments

5 Discussion and conclusions

(19)

We begin with explaining why stochastic gradient (SG) is popular for deep learning

Recall the function is f (θ) = 1

2CθTθ + 1

`

`

X

i =1

ξ(θ; yi, Z1,i) The gradient is

θ C + 1

`∇θ

`

X

i =1

ξ(θ; yi, Z1,i)

(20)

Going over all data is time consuming From

E (∇θξ(θ; y , Z1)) = 1

`∇θ

`

X

i =1

ξ(θ; yi, Z1,i) we may just use a subset S (called a batch)

θ C + 1

|S|∇θ X

i :i ∈S

ξ(θ; yi, Z1,i)

(21)

Mini-batch SG: Algorithm

1: Given an initial learning rate η.

2: while do

3: Choose S ⊂ {1, . . . , `}.

4: Calculate

θ ← θ −η(θ C + 1

|S|∇θ X

i :i ∈S

ξ(θ; yi, Z1,i))

5: May adjust the learning rate η

6: end while

But deciding a suitable learning rate may be tricky

(22)

Why SG Popular for Deep Learning?

The special property of data classification is essential

E (∇θξ(θ; y , Z1)) = 1

`∇θ

`

X

i =1

ξ(θ; yi, Z1,i) Indeed stochastic gradient is less used outside machine learning

High-order methods with fast final convergence may not be needed in machine learning

An approximate solution may give similar accuracy to the final solution

(23)

(Cont’d)

Easy implementation. It’s simpler than methods using, for example, second derivative

Non-convexity plays a role

For convex, a global minimum usually gives a good model (loss is minimized)

Thus we want to efficiently find the global minimum

But for non-convex, efficiency to reach a stationary point is less useful

(24)

Drawback of SG

Tuning the learning rate is not easy

Thus if we would like to consider other methods, robustness rather than efficiency may be the main reason

(25)

Newton Method

Newton method finds a direction d that minimizes the second-order approximation of f (θ)

min

d ∇f (θ)>d + 1

2d>2f (θ)d. (2) If ∇2f (θ) is positive definite, (2) is equivalent to solving

2f (θ)d = −∇f (θ).

(26)

Newton Method (Cont’d)

while stopping condition not satisfied do Let G be ∇2f (θ) or its approximation Exactly or approximately solve

G d = −∇f (θ)

Find a suitable step size α (e.g., line search) Update

θ ← θ +αd . end while

(27)

Hessian may not be Positive Definite

Hessian of f (θ) is (derivation omitted)

2f (θ) =1

CI + 1

` X`

i =1(Ji)>BiJi

+ a non-PSD (positive semi-definite) term I: identity, Bi: simple PSD matrix, Ji: Jacobian of zL+1,i(θ)

Ji =

∂z1L+1,i

∂θ1 . . . ∂z∂θ1L+1,i ... . . . ...n

∂znL+1L+1,i

∂θ . . . ∂z

L+1,i nL+1

∂θ

∈ RnL+1×n

nL+1: # classes n: # total

variables

(28)

Positive Definite Modification of Hessian

Several strategies have been proposed.

For example, Schraudolph (2002) considered the Gauss-Newton matrix (which is PD)

G = 1

CI + 1

`

`

X

i =1

(Ji)>BiJi ≈ ∇2f (θ).

Then Newton linear system becomes

G d = −∇f (θ). (3)

(29)

Memory Difficulty

The Gauss-Newton matrix G may be too large to be stored

G : # variables × # variables

Many approaches have been proposed (through approximation)

For example, we may store and use only diagonal blocks of G

(30)

Memory Difficulty (Cont’d)

Here we try to use the original Gauss-Newton matrix G without aggressive approximation

Reason: we should show first that for median-sized data, standard Newton is more robust than SG Otherwise, there is no need to develop techniques for large-scale problems

(31)

Hessian-free Newton Method

If G has certain structures, it’s possible to use

iterative methods (e.g., conjugate gradient) to solve the Newton linear system by a sequence of

matrix-vector products

G v1, G v2, . . . without storing G

This is called Hessian-free in optimization

(32)

Hessian-free Newton Method (Cont’d)

The Gauss-Newton matrix is G = 1

CI + 1

`

`

X

i =1

(Ji)>BiJi

Matrix-vector product without explicitly storing G G v = 1

Cv + 1

`

`

X

i =1

((Ji)>(Bi(Jiv ))). Examples of using this setting for deep learning include Martens (2010), Le et al. (2011), and Wang et al. (2018).

(33)

Hessian-free Newton Method (Cont’d)

However, for the conjugate gradient process, Ji ∈ RnL+1×n, i = 1, . . . , `, can be too large to be stored (` is # data) Total memory usage is

nL+1 × n × `

= # classes × # variables × # data

(34)

Hessian-free Newton Method (Cont’d)

The product involves

`

X

i =1

((Ji)>(Bi(Jiv ))).

We can trade time for space: Ji is calculated when needed (i.e., at every matrix-vector product)

On the other hand, we may not need to use all data points to have Ji, ∀i

We will discuss the subsampled Hessian technique

(35)

Subsampled Hessian Newton Method

Similar to gradient, for Hessian we have E (∇2θ,θξ(θ; y , Z1)) = 1

`∇2θ,θ

`

X

i =1

ξ(θ; yi, Z1,i) Thus we can approximate the Gauss-Newton matrix by a subset of data

This is the subsampled Hessian Newton method (Byrd et al., 2011; Martens, 2010; Wang et al., 2015)

(36)

Subsampled Hessian Newton Method

We select a subset S ⊂ {1, . . . , `} and have GS = 1

CI + 1

|S|

X

i ∈S

(Ji)TBiJi ≈ G .

The cost of storing Ji is reduced from

∝ ` to ∝ |S|

(37)

Subsampled Hessian Newton Method

With enough data, direction obtained by GSd = −∇f (θ)

may be close to that by

G d = −∇f (θ)

Computational cost per matrix-vector product is saved

On CPU we may afford to store Ji, ∀i ∈ S On GPU, which has less memory, we calculate Ji, ∀i ∈ S when needed

(38)

Calculation of Jacobian Matrix

Now we know the subsampled Gauss-Newton matrix-vector product is

GSv = 1

Cv + 1

|S|

X

i ∈S

(Ji)T Bi(Jiv )

(4)

We briefly discuss how to calculate Ji

(39)

Calculation of Jacobian Matrix (Cont’d)

The Jacobian can be partitioned with respect to layers.

Ji =

∂z1L+1,i

∂θ1 . . . ∂z∂θ1L+1,i ... . . . ...n

∂znL+1L+1,i

∂θ1 . . . ∂z

L+1,i nL+1

∂θn

=

h ∂zL+1,i

∂vec(W1)> · · · ∂zL+1,i

∂vec(WL)>

i

We check details of one layer. It’s difficult to calculate the derivative if using a matrix form

Sm,i = Wmφ(Zm,i)

(40)

Calculation of Jacobian Matrix (Cont’d)

We can rewrite it to

vec(Sm,i) = (φ(Zm,i)>⊗ Idm+1)vec(Wm), where

⊗ : Kronecker product Idm+1 : Identity If

y = Ax, with y ∈ Rp and x ∈ Rq then

∂y

∂(x)> =

∂y1

∂x1 . . . ∂y∂x1 ... ... ...q

∂yp

∂x . . . ∂y∂xp

 = A

(41)

Calculation of Jacobian Matrix (Cont’d)

Therefore,

∂zL+1,i

∂vec(Wm)> = ∂zL+1,i

∂vec(Sm,i)>

∂vec(Sm,i)

∂vec(Wm)>

= ∂zL+1,i

∂vec(Sm,i)>(φ(Zm,i)> ⊗ Idm+1).

Further, (detailed derivation omitted)

∂zL+1,i

∂vec(Sm,i)> = ∂zL+1,i

∂vec(Zm+1,i)> 1nL+1vec(σ0(Sm,i))> , where is element-wise product, and

(42)

Calculation of Jacobian Matrix (Cont’d)

∂zL+1,i

∂vec(Zm,i)> = ∂zL+1,i

∂vec(Sm,i)>(Iam+1bm+1 ⊗ Wm)Pφm. Thus a backward process can calculate all the needed values

We see that with suitable representation, the derivation is manageable

Major operations can be performed by matrix-based settings (details not shown)

This is why GPU is useful

(43)

Outline

1 Introduction

2 Optimization problem for convolutional neural networks (CNN)

3 Newton method for CNN

4 Experiments

5 Discussion and conclusions

(44)

Running Time and Test Accuracy

Four sets are considered

MNIST, SVHN, CIFAR10, smallNORB

For each method, best parameters from a validation process are used

We will check parameter sensitivity later Two SG implementations are used

Simple SG shown earlier

SG with momentum (details not explained here)

SG with momentum is a reasonably strong baseline

(45)

Running Time and Test Accuracy (Cont’d)

0 20 40 60 80 100

0 500 1000 1500 2000 2500

Test accuracy (%)

Time in seconds

SG-with-momentum SG-without-momentum Newton

0 20 40 60 80 100

0 2000 4000 6000 8000 10000 12000 14000

Test accuracy (%)

Time in seconds

SG-with-momentum SG-without-momentum Newton

0 10 20 30 40 50 60 70 80

0 5000 10000 15000 20000

Test accuracy (%)

SG-with-momentum SG-without-momentum Newton

0 20 40 60 80 100

0 500 1000 1500 2000 2500

Test accuracy (%)

SG-with-momentum SG-without-momentum Newton

(46)

Running Time and Test Accuracy (Cont’d)

Clearly, SG has faster initial convergence

This is reasonable as a second-order method is slower in the beginning

But if cost for parameter selection is considered, Newton may be useful

(47)

Experiments on Parameter Sensitivity

Consider a fixed regularization parameter C = 0.01`

For SG with momentum, we consider the following initial learning rates

0.1, 0.05, 0.01, 0.005, 0.001, 0.0003, 0.0001 For Newton, there is no particular parameter to tune. We check the size of subsampled Hessian:

|S| = 10%, 5%, 1% of data

(48)

Results by Using Different Parameters

Each line shows the result of one problem

Newton SG

Sampling rate Initial learning rate

10% 5% 1% 0.03 0.01 0.003 0.001 0.0003 99.2% 99.2% 99.1% 9.9% 10.3% 99.1% 99.2% 99.0%

92.7% 92.7% 92.2% 19.5% 92.4% 93.0% 92.7% 92.3%

78.2% 78.3% 75.4% 10.0% 63.1% 79.5% 79.2% 76.9%

94.9% 95.0% 94.6% 64.7% 95.0% 95.0% 95.0% 94.8%

We find that

a too large learning rate causes SG to diverge, and a too small rate causes slow convergence

(49)

Outline

1 Introduction

2 Optimization problem for convolutional neural networks (CNN)

3 Newton method for CNN

4 Experiments

5 Discussion and conclusions

(50)

Conclusions

Stochastic gradient method has been popular for CNN

It is simple and useful, but sometimes not robust Newton is more complicated and has slower initial convergence

However, it may be overall more robust

By careful designs, the implementation of Newton isn’t too complicated

(51)

Conclusions (Cont’d)

Results presented here are based on the paper by Wang et al. (2019)

An ongoing software development is at https://github.com/cjlin1/simpleNN Both MATLAB and Python are supported

MATLAB: joint work with Chien-Chih Wang and Tan Kent Loong (NTU)

Python: joint work with Pengrui Quan (UCLA)

a single instruction.. Thus, the operand can be modified before it can be modified before it is used. Useful for fast multipliation and dealing p g with lists, table and other

Under the pressure of the modern era is often busy with work and financial resources, and sometimes not in fact do not want to clean up the environment, but in a full day of hard

Solver based on reduced 5-equation model is robust one for sample problems, but is difficult to achieve admissible solutions under extreme flow conditions.. Solver based on HEM

Abstract In this paper, we consider the smoothing Newton method for solving a type of absolute value equations associated with second order cone (SOCAVE for short), which.. 1

We investigate some properties related to the generalized Newton method for the Fischer-Burmeister (FB) function over second-order cones, which allows us to reformulate the

Abstract We investigate some properties related to the generalized Newton method for the Fischer-Burmeister (FB) function over second-order cones, which allows us to reformulate

Unless prior permission in writing is given by the Commissioner of Police, you may not use the materials other than for your personal learning and in the course of your official

Unless prior permission in writing is given by the Commissioner of Police, you may not use the materials other than for your personal learning and in the course of your official