• 沒有找到結果。

牛頓法於卷積神經網路之應用

N/A
N/A
Protected

Academic year: 2022

Share "牛頓法於卷積神經網路之應用"

Copied!
77
0
0

加載中.... (立即查看全文)

全文

(1)

國立臺灣大學電機資訊學院資訊網路與多媒體研究所 碩士論文

Graduate Institute of Networking and Multimedia College of Electrical Engineering and Computer Science

National Taiwan University Master Thesis

牛頓法於卷積神經網路之應用

Newton Methods For Convolutional Neural Networks

陳勁龍

Tan Kent Loong

指導教授:林智仁 博士

Advisor: Chih-Jen Lin, Ph.D.

中華民國 107 年 7 月

July, 2018

(2)
(3)

中文摘要

深度學習包含困難的非凸優化問題。大多數研究經常使用隨機梯度演算法

(SG)來優化這類模型。使用SG通常很有效,但有時並不那麼強大。近代的

研究探討了利用牛頓法作為替代的優化方法,但絕大部分研究只將其應用於

全連接神經網路。他們沒有探討諸如卷積神經網路等更為廣泛使用的深度學習 模型。其中一個原因是應用牛頓法於卷積神經網路的過程中牽涉到多個複雜的

運算,因此目前未有仔細的相關研究。在這篇論文中,我們給出詳細的建構模

組,當中包括函數、梯度及賈可比矩陣的運算和高斯-牛頓矩陣向量的乘積。這

些基本的模組非常重要。因為沒有它們,任何牛頓法於全連接神經網路的進步 沒辦法在卷積神經網路上嘗試。因此我們的研究將可能推動更多牛頓法於卷積

神經網路上的發展。我們完成一個簡單的MATLAB實作。實驗結果顯示這個方

法具有競爭力。

關鍵詞: 卷積神經網路, 多類別分類, 大規模學習, 抽樣海森矩陣。

(4)

ABSTRACT

Deep learning involves a difficult non-convex optimization problem, which is of- ten solved by stochastic gradient (SG) methods. While SG is usually effective, it is sometimes not very robust. Recently, Newton methods have been investigated as an alternative optimization technique, but nearly all existing studies consider only fully- connected feedforward neural networks. They do not investigate other types of net- works such as Convolutional Neural Networks (CNN), which are more commonly used in deep-learning applications. One reason is that Newton methods for CNN involve complicated operations, and so far no works have conducted a thorough investigation.

In this thesis, we give details of building blocks including function, gradient, and Ja- cobian evaluation, and Gauss-Newton matrix-vector products. These basic components are very important because without them none of any recent improvement of Newton methods for fully-connected networks can even be tried. Thus we will enable possi- ble further developments of Newton methods for CNN. We finish a simple MATLAB implementation and show that it gives competitive test accuracy.

KEYWORDS: Convolutional neural networks, multi-class classification, large-scale classification, subsampled Hessian.

(5)

TABLE OF CONTENTS

口 口

口試 試 試 委 委 委員 員 員會 會 會審 審 審定 定 定書 書 書

. . . i

中 中 中文 文 文摘 摘 摘要 要 要

. . . ii

ABSTRACT . . . iii

LIST OF FIGURES . . . vi

LIST OF TABLES . . . vii

CHAPTER I. Introduction . . . 1

II. Optimization Problem of Feedforward CNN . . . 3

2.1 Convolutional Layer . . . 3

2.1.1 Zero Padding . . . 9

2.1.2 Pooling Layer . . . 9

2.1.3 Summary of a Convolutional Layer . . . 11

2.2 Fully-Connected Layer . . . 12

2.3 The Overall Optimization Problem . . . 13

III. Hessian-free Newton Methods for Training CNN . . . 14

3.1 Procedure of the Newton Method . . . 14

3.2 Gradient Evaluation . . . 17

3.2.1 Padding, Pooling, and the Overall Procedure . . . . 21

3.3 Jacobian Evaluation . . . 22

3.4 Gauss-Newton Matrix-Vector Products . . . 24

IV. Implementation Details . . . 27

4.1 Generation of φ(Zm−1,i) . . . 27

4.2 Construction of Ppoolm−1,i . . . 32

4.3 Details of Padding Operation . . . 34

(6)

4.4 Evaluation of vTPφm−1 . . . 36

4.5 Gauss-Newton Matrix-Vector Products . . . 39

4.6 Mini-Batch Function and Gradient Evaluation . . . 41

V. Analysis of Newton Methods for CNN . . . 49

5.1 Memory Requirement . . . 49

5.2 Computational Cost . . . 51

VI. Experiments . . . 54

6.1 Comparison Between Newton and Stochastic Gradient Methods 56 VII. Conclusions . . . 59

APPENDICES . . . 60

BIBLIOGRAPHY . . . 66

(7)

LIST OF FIGURES

Figure

2.1 An padding example with sm = 1 in order to set am = am−1. . . 10 2.2 An illustration of max pooling to extract translational invariance fea-

tures. The image B is derived from shifting A by 1 pixel in the hori- zontal direction. . . 10

(8)

LIST OF TABLES

Table

6.1 Summary of the data sets, where a0× b0× d0 represents the (height, width, channel) of the input image, l is the number of training data, lt

is the number of test data, and nLis the number of classes. . . 55 6.2 Structure of convolutional neural networks. “conv” indicates a con-

volutional layer, “pool” indicates a pooling layer, and “full” indicates a fully-connected layer. . . 57 6.3 Test accuracy for Newton method and SG. For Newton method, we

trained for 250 iterations; for SG, we trained for 1000 epochs. . . 58

(9)

CHAPTER I

Introduction

Deep learning is now widely used in many applications. To apply this technique, a difficult non-convex optimization problem must be solved. Currently, stochastic gradi- ent (SG) methods and their variants are the major optimization technique used for deep learning (e.g., Krizhevsky et al., 2012; Simonyan and Zisserman, 2014). This situation is different from some application domains, where other types of optimization methods are more frequently used. One interesting research question is thus to study if other optimization methods can be extended to be viable alternatives for deep learning. In this thesis, we aim to address this issue by developing a practical Newton method for deep learning.

Some past works have studied Newton methods for training deep neural networks (e.g., Botev et al. 2017; He et al. 2016; Kiros 2013; Martens 2010; Vinyals and Povey 2012; Wang et al. 2015, 2018a). Almost all of them consider fully-connected feedfor- ward neural networks and some have shown the potential of Newton methods for be- ing more robust than SG. Unfortunately, these works have not fully established Newton methods as a practical technique for deep learning because other types of networks such as Convolutional Neural Networks (CNN) are more commonly used in deep-learning applications. One important reason why CNN was not considered is because of the very complicated operations in implementing Newton methods. Up to now no works

(10)

have shown details of all the building blocks such as function, gradient, and Jacobian evaluation, and Hessian-vector products. In particular, because interpreted-type lan- guages such as Python or MATLAB have been popular for deep learning, how to easily implement efficient operations by these languages is an important research issue.

In this thesis, we aim at a thorough investigation on the implementation of Newton methods for CNN. We focus on basic components because without them none of any recent improvement of Newton methods for fully-connected networks can be even tried.

We will enable many further developments of Newton methods for CNN and maybe even other types of networks.

This thesis is organized as follows. In Chapter II, we begin with introducing CNN.

In Chapter III, Newton methods for CNN are introduced and the detailed mathematical formulations of all operations are dervied. In Chapter IV, we provide details for an efficient MATLAB implementation. Experiments to demonstrate the viability of New- ton methods for CNN are in Chapter VI. In the same chapter, we also investigate the efficiency of our implementation. Chapter VII concludes this work.

A MATLAB package of implementing a Newton method for CNN is available at

https://https://www.csie.ntu.edu.tw/˜cjlin/papers/cnn/

Programs used for experiments can be found at the same page.

This thesis is based on the paper by Wang et al. (2018b). We acknowledge the support from Ministry of Science and Technology of Taiwan.

(11)

CHAPTER II

Optimization Problem of Feedforward CNN

Consider a K-class problem, where the training data set consists of l input pairs (Z0,i, yi), i = 1, . . . , l. Here Z0,i is the ith input image with dimension a0× b0× d0, where a0 denotes the height of the input images, b0 represents the width of the input images and d0is the number of color channels. If Z0,ibelongs to the kth class, then the label vector

yi = [0, . . . , 0

| {z }

k−1

, 1, 0, . . . , 0]T ∈ RK.

A CNN utilizes a stack of convolutional and pooling layers followed by fully- connected layers to predict the target vector. Let L be the number of layers, Lc be the number of convolutional layers, and Lf be the number of fully-connected layers.

The images

Z0,i, i = 1, . . . , l,

are the input of layer 1. Subsequently we describe operations of convolutional layers, pooling layers, and fully-connected layers.

2.1 Convolutional Layer

A hallmark of CNN is that both input and output of a convolutional layer are explic- itly assumed to be images. We discuss the details between layers m − 1 and m. Let the

(12)

input be an image of dimensions

am−1× bm−1× dm−1,

where am−1 is the height, bm−1 is the width, and dm−1 is the depth (or the number of channels). Thus for every given channel, we have a matrix of am−1 × bm−1 pixels.

Specifically, the input contains the following matrices.

 z1,1,1m−1,i

. ..

zm−1,iam−1,bm−1,1

 . . .

z1,1,dm−1,im−1

. ..

zam−1,im−1,bm−1,dm−1

(2.1)

For example, at layer 1, usually d0 = 3 because of three color channels (red, green, blue). For each channel, the matrix of size a0 × b0 contains raw pixel values of the image.

To generate the output, we consider dmfilters, each of which is a 3-D weight matrix of size

hm× hm× dm−1. Specifically, the jth filter includes the following matrices

wm,j1,1,1 wm,j1,hm,1

. ..

wm,jhm,1,1 whm,jm,hm,1

, . . . ,

w1,1,dm,j m−1 wm,j1,hm,dm−1

. ..

wm,jhm,1,dm−1 wm,jhm,hm,dm−1

and a bias term bmj . The main idea of CNN is to conduct convolutional operations, each of which is the inner product between a small sub-image and a filter. We now describe the details. Specifically, for the jth filter, we scan the entire input image to obtain small regions of size (hm, hm) and calculate the inner product between each region and the filter. For example, if we start from the upper left corner of the input image, the first

(13)

sub-image of channel d is

z1,1,dm−1,i . . . z1,hm−1,im,d

. ..

zhm−1,im,1,d . . . zm−1,ihm,hm,d

 .

We then calculate the following value.

dm−1

X

d=1

*

z1,1,dm−1,i . . . z1,hm−1,im,d

. ..

zhm−1,im,1,d . . . zm−1,ihm,hm,d

 ,

wm,j1,1,d . . . w1,hm,jm,d

. ..

wm,jhm,1,d . . . wm,jhm,hm,d

 +

+ bmj , (2.2)

where h·, ·i means the sum of component-wise products between two matrices. This value becomes the (1, 1) position of the channel j of the output image.

Next, we must obtain other sub-images to produce values in other positions of the output image. We specify the stride smfor sliding the filter. That is, we move smpixels vertically or horizontally to get sub-images. For the (2, 1) position of the output image, we move down smpixels vertically to obtain the following sub-image:

z1+sm−1,im,1,d . . . z1+sm−1,im,hm,d

. ..

zhm−1,im+sm,1,d . . . zm−1,ihm+sm,hm,d

, d = 1, . . . , dm−1.

Then the (2, 1) position of the channel j of the output image is

dm−1

X

d=1

*

z1+sm−1,im,1,d . . . z1+sm−1,im,hm,d

. ..

zhm−1,im+sm,1,d . . . zm−1,ihm+sm,hm,d

 ,

wm,j1,1,d . . . wm,j1,hm,d

. ..

whm,jm,1,d . . . wm,jhm,hm,d

 +

+ bmj . (2.3)

Assume that vertically and horizontally we can move the filter amand bmtimes, respec- tively. Therefore,

am−1 = hm+ (am− 1) × sm,

(14)

bm−1 = hm+ (bm− 1) × sm. (2.4)

By our notation, the output image has the size

am× bm× dm.

For efficient implementations, we can conduct all operations including (2.2) and (2.3) by matrix operations. To begin, we concatenate the matrices of the different channels in (2.1) to

Zm−1,i =

z1,1,1m−1,i . . . zam−1,im−1,1,1 z1,2,1m−1,i . . . zm−1,iam−1,bm−1,1

... . .. ... ... . .. ...

zm−1,i1,1,dm−1 . . . zam−1,im−1,1,dm−1 z1,2,dm−1,im−1 . . . zam−1,im−1,bm−1,dm−1

, i = 1, . . . , l.

(2.5) We note that (2.2) is the inner product between the following two vectors



w1,1,1m,j . . . whm,jm,1,1 w1,2,1m,j . . . whm,jm,hm,1 . . . whm,jm,hm,dm−1 bmj

T

and



z1,1,1m−1,i . . . zhm−1,im,1,1 z1,2,1m−1,i . . . zhm−1,im,hm,1 . . . zhm−1,im,hm,dm−1 1

T

.

Therefore, based on Vedaldi and Lenc (2015), we define the following two operators

vec(M ) =(M:,1)T . . . (M:,n)TT

∈ Rmn×1, where M ∈ Rm×n, (2.6)

mat(v)m×n =

v1 v(n−1)m+1 ... · · · ...

vm vnm

∈ Rm×n, where v ∈ Rmn×1, (2.7)

and the operator

φ : Rdm−1×am−1bm−1 → Rhmhmdm−1×ambm

(15)

in order to collect all sub-images in Zm−1,i:

φ(Zm−1,i) ≡ mat Pφm−1vec(Zm−1,i)

hmhmdm−1×ambm, m = 1, . . . , Lc, ∀i, (2.8) where

Pφm−1 ∈ Rhmhmdm−1ambm×dm−1am−1bm−1.

We discuss the implementation details of (2.8) in Chapter 4.1. Then, we have φ(Zm−1,i) derived as follows.

zm−1,i1,1,1 . . . z1+(am−1,im−1)×sm,1,1 z1,1+sm−1,im,1 . . . z1+(am−1,im−1)×sm,1+(bm−1)×sm,1

zm−1,i2,1,1 . . . z2+(am−1,im−1)×sm,1,1 z2,1+sm−1,im,1 . . . z2+(am−1,im−1)×sm,1+(bm−1)×sm,1

... . .. ... ... . .. ...

zhm−1,im,hm,1 . . . zhm−1,im+(am−1)×sm,hm,1 zhm−1,im,hm+sm,1 . . . zm−1,ihm+(am−1)×sm,hm+(bm−1)×sm,1

... . .. ... ... . .. ...

z1,1,dm−1,im−1 . . . zm−1,i1+(am−1)×sm,1,dm−1 z1,1+sm−1,im,dm−1 . . . z1+(am−1,im−1)×sm,1+(bm−1)×sm,dm−1

... . .. ... ... . .. ...

zm−1,ihm,hm,dm−1 . . . zhm−1,im+(am−1)×sm,hm,dm−1 zhm−1,im,hm+sm,dm−1 . . . zhm−1,im+(am−1)×sm,hm+(bm−1)×sm,dm−1

 .

(2.9) By considering

Wm=

wm,11,1,1 w2,1,1m,1 . . . whm,1m,hm,dm−1

... ... . .. ... wm,d1,1,1m w2,1,1m,dm . . . whm,dm,hmm,dm−1

∈ Rdm×hmhmdm−1 and bm =

 bm1

... bmdm

∈ Rdm×1,

(2.10) the following operations are conducted.

Sm,i = Wmφ(Zm−1,i) + bm1Tambm ∈ Rdm×ambm, (2.11)

(16)

where

Sm,i =

sm,i1,1,1 . . . sm,iam,1,1 sm,i1,2,1 . . . sm,iam,bm,1

... . .. ... ... . .. ... sm,i1,1,dm . . . sm,iam,1,dm sm,i1,2,dm . . . sm,iam,bm,dm

and1ambm =

 1

... 1

∈ Rambm×1.

Next, an activation function is applied to scale the value.

za,b,dm,i = σ(sm,ia,b,d), (2.12)

where a = 1, . . . , am, b = 1, . . . , bm, and d = 1, . . . , dm. For CNN, commonly the following RELU activation function

σ(x) = max(x, 0) (2.13)

is used and we consider it in this work. The output becomes the following matrix

Zm,i =

z1,1,1m,i z2,1,1m,i . . . zam,im,bm,1

. ..

z1,1,dm,i m zm,i2,1,dm . . . zam,im,bm,dm

. (2.14)

We then apply (2.8) to expand the output to form the matrix φ(Zm,i) and then substitute φ(Zm,i) into (2.11), so we can continue the operations between layers m and m + 1.

Note that by the matrix representation, the storage is increased from am−1× bm−1× dm−1

in (2.1) to

(hmhmdm−1) × am× bm. From (2.4), roughly

 hm sm

2

folds increase of the memory occurs. However, we gain efficiency by using fast matrix- matrix multiplications in optimized BLAS (Dongarra et al., 1990).

(17)

2.1.1 Zero Padding

To make (2.4) hold or am be larger, sometimes we enlarge the input image to have zero values around the border. This technique, conducted before the mapping in (2.8), is called zero-padding in CNN training. For example, we may set

am = am−1

in order to prevent the decrease of the image size. When

sm = 1,

we can pad the input image with

hm− 1 lines of zeros around every border. See Figure 2.1.

For our derivation, we represent the padding operation as the following linear oper- ation:

Zm,i= mat(Ppaddingm−1,ivec(Zm−1,i))dm×ambm. (2.15)

2.1.2 Pooling Layer

For CNN, to reduce the computational cost, a dimension reduction is often applied by using a pooling layer after each convolutional layer. Usually we consider an op- eration that can (approximately) extract rotational or translational invariance features.

There are various types of pooling methods such as average pooling, max pooling, and stochastic pooling. We consider max pooling in this chapter because it is the most used setting for CNN. Here we show an example of max pooling by considering two 4 × 4 images, A and B, in Figure 2.2. The image B is derived by shifting A by 1 pixel in the horizontal direction. We split two images into four 2 × 2 sub-images and choose the

(18)

An input image

0 · · · 0 ... 0 · · · 0

... ...

· · ·

· · ·

0 · · · 0 ... 0 · · · 0

· · ·0 0 ...

· · ·0 0

· · ·0 0 ...

· · ·0 0

hm− 1

hm− 1

hm− 1

hm− 1 am−1

bm−1

Figure 2.1: An padding example with sm = 1 in order to set am = am−1.

2 3 6 8 5 4 9 7 1 2 6 0 4 3 2 1

→5 9 4 6



(a) Image A

3 2 3 6 4 5 4 9 2 1 2 6 3 4 3 2

→5 9 4 6



(b) Image B

Figure 2.2: An illustration of max pooling to extract translational invariance features.

The image B is derived from shifting A by 1 pixel in the horizontal direction.

max value from every sub-image. In each sub-image because only some elements are changed, the maximal value is likely the same or similar. This is called translational invariance and for our example the two output images from A and B are the same.

Now we discuss the mathematical operation of the pooling layer. They are in fact special cases of convolutional operations. Assume Zm−1,iis the input image (i.e., output image of the previous convolutional layer). We partition every channel of Zm−1,i into non-overlapping sub-regions by hm× hmfilters with the stride sm = hm.1 By the same definition as (2.8) we can generate the matrix

φ(Zm−1,i) = mat(Pφm−1vec(Zm−1,i))hmhm×dm−1ambm, (2.16) 1Because of the disjoint sub-regions, the stride smfor sliding the filters is equal to hm.

(19)

where

am = am−1

hm , bm = bm−1

hm . (2.17)

To select the largest element of each sub-region, there exists a matrix Wm,i ∈ Rdmambm×hmhmdm−1ambm

so that each row of Wm,iselects a single element from vec(φ(Zm−1,i)). Therefore, Zm,i = mat Wm,ivec(φ(Zm−1,i))

dm×ambm. (2.18) Note that different from (2.11) of the convolutional layer, Wm,iis a constant matrix rather than a weight matrix. Further, because from (2.8)

vec(φ(Zm−1,i)) = Pφm−1vec(Zm−1,i),

we have

Zm,i= mat Ppoolm−1,ivec(Zm−1,i)

dm×ambm, (2.19) where

Ppoolm−1,i = Wm,iPφm−1 ∈ Rdmambm×dm−1am−1bm−1.

We provide implementation details in Chapter 4.2. Note that pooling operations are often considered as an (optional) part of the convolutional layer. Here we treat them as a separate layer for the easier description of the procedure.

2.1.3 Summary of a Convolutional Layer

From the input Zm−1,i, we consider the following flow as one convolutional layer:

Zm−1,i → padding → φ(Zm−1,i) → ¯Sm,i → ¯Zm,i → pooling → Zm,i, where

m,i and Z¯m,i

are Sm,iand Zm,iin (2.11) and (2.14), respectively, if pooling is not applied.

(20)

2.2 Fully-Connected Layer

After passing through the convolutional and pooling layers, we concatenate columns in the matrix in (2.14) to form the input vector of the first fully-connected layer.

zm,i = vec(Zm,i), i = 1, . . . , l, m = Lc.

In the fully-connected layers (Lc < m ≤ L), we consider the following weight matrix and bias vector between layers m − 1 and m.

Wm =

w11m w21m · · · wmnm−11 w12m w22m · · · wmnm−12

... ... ... ... w1nmm w2nmm · · · wmnm−1nm

nm×nm−1

and bm =

 bm1 bm2 ... bmnm

nm×1

, (2.20)

where nm−1 and nmare the numbers of neurons in layers m − 1 and m, respectively.2 If zm−1,i ∈ Rnm−1 is the input vector, the following operations are applied to generate the output vector zm,i ∈ Rnm.

sm,i = Wmzm−1,i+ bm, (2.21)

zjm,i = σ(sm,ij ), j = 1, . . . , nm. (2.22)

For the activation function in fully-connected layers, except the last layer, we also con- sider the RELU function defined in (2.13). For the last layer, we use the following linear function.

σ(x) = x. (2.23)

2n

Lc = dLcaLcbLcand nL= K is the number of classes.

(21)

2.3 The Overall Optimization Problem

At the last layer, the output zL,i, ∀i is obtained. We can check how close it is to the label vector yi and consider the following squared loss in this work.

ξ(zL,i; yi) = ||zL,i− yi||2. (2.24)

Furthermore, we can collect all model parameters such as filters of convolutional layers in (2.10) and weights/biases in (2.20) for fully-connected layers into a long vector θ ∈ Rn, where n becomes the total number of variables from the discussion in this chapter.

n =

Lc

X

m=1

dm× (hm× hm× dm−1+ 1) +

L

X

m=Lc+1

nm× (nm−1 + 1).

The output zL,i of the last layer is a function of θ. The optimization problem to train a CNN is

minθ f (θ), (2.25)

where

f (θ) = 1

2CθTθ + 1 l

l

X

i=1

ξ(zL,i; yi) (2.26) and the first term with the parameter C > 0 is used to avoid overfitting.

(22)

CHAPTER III

Hessian-free Newton Methods for Training CNN

To solve an unconstrained minimization problem such as (2.25), a Newton method iteratively finds a search direction d by solving the following second-order approxima- tion.

mind ∇f (θ)Td +1

2dT2f (θ)d, (3.1)

where ∇f (θ) and ∇2f (θ) are the gradient vector and the Hessian matrix, respectively.

In this chapter we present details of applying a Newton method to solve the CNN prob- lem (2.25).

3.1 Procedure of the Newton Method

For CNN, the gradient of f (θ) is

∇f (θ) = 1 Cθ + 1

l

l

X

i=1

(Ji)TzL,iξ(zL,i; yi), (3.2) where

Ji =

∂zL,i1

∂θ1 · · · ∂z

L,i 1

∂θn

... ... ...

∂zL,inL

∂θ1 · · · ∂z

L,i nL

∂θn

nL×n

, i = 1, . . . , l, (3.3)

is the Jacobian of zL,i. The Hessian matrix of f (θ) is

2f (θ) =1 CI + 1

l

l

X

i=1

(Ji)TBiJi

(23)

+1 l

l

X

i=1 nL

X

j=1

∂ξ(zL,i; yi)

∂zL,ij

2zjL,i

∂θ1∂θ1 · · ·

2zL,ij

∂θ1∂θn

... . .. ...

2zjL,i

∂θn∂θ1 · · ·

2zL,ij

∂θn∂θn

, (3.4)

where I is the identity matrix and

Btsi = ∂2ξ(zL,i; yi)

∂ztL,i∂zsL,i

, t = 1, . . . , nL, s = 1, . . . , nL. (3.5)

From now on for simplicity we let

ξi ≡ ξi(zL,i; yi).

If f (θ) is non-convex as in the case of deep learning, (3.1) is difficult to solve and the resulting direction may not lead to the decrease of the function value. Thus the Gauss-Newton approximation (Schraudolph, 2002)

G = 1 CI + 1

l

l

X

i=1

(Ji)TBiJi ≈ ∇2f (θ) (3.6)

is often used. In particular, if G is positive definite, then (3.1) becomes the same as solving the following linear system.

Gd = −∇f (θ). (3.7)

After a Newton direction d is obtained, to ensure the convergence, we update θ by

θ ← θ + αd,

where α is the largest element in {1,12,14, . . .} which can satisfy

f (θ + αd) ≤ f (θ) + ηα∇f (θ)Td, (3.8)

where η ∈ (0, 1) is a pre-defined constant. The procedure to find α is called a back- tracking line search.

(24)

Past works (e.g., Martens, 2010; Wang et al., 2018a) have shown that sometimes (3.7) is too aggressive, so a direction closer to the negative gradient is better. To this end, in recent works of applying Newton methods on fully-connected networks, the Levenberg-Marquardt method (Levenberg, 1944; Marquardt, 1963) is used to solve the following linear system rather than (3.7).

(G + λI)d = −∇f (θ), (3.9)

where λ is a parameter decided by how good the function reduction is. Specifically, we define

ρ = f (θ + d) − f (θ)

∇f (θ)Td + 12(d)TGd

as the ratio between the actual function reduction and the predicted reduction. By using ρ, the parameter λnext for the next iteration is decided by

λnext =

















λ × drop ρ > ρupper,

λ ρlower ≤ ρ ≤ ρupper, λ × boost otherwise,

(3.10)

where (drop,boost) are given constants. From (3.10) we can clearly see that if the function-value reduction is not satisfactory, then λ is enlarged and the resulting direction is closer to the negative gradient.

Next, we discuss how to solve the linear system (3.9). When the number of variables n is large, the matrix G is too large to be stored. For some optimization problems including neural networks, without explicitly storing G it is possible to calculate the product between G and any vector v (Le et al., 2011; Martens, 2010; Wang et al., 2018a). For example, from (3.6),

(G + λI)v = (1

C + λ)v + 1 l

l

X

i=1

(Ji)T Bi(Jiv) . (3.11)

(25)

If the product between Jiand a vector can be easily calculated, then G does not need to be explicitly formed. Therefore, we can apply the conjugate gradient (CG) method to solve (3.7) by a sequence of matrix-vector products. This technique is called Hessian- free methods in optimization. Details of CG methods in a Hessian-free Newton frame- work can be found in, for example, Algorithm 2 of Lin et al. (2007).

Because the computational cost in (3.11) is proportional to the number of instances, subsampled Hessian Newton methods have been proposed (Byrd et al., 2011; Martens, 2010; Wang et al., 2015) to reduce the cost in solving the linear system (3.9). They observe that the second term in (3.6) is the average training loss. If the large number of data points are assumed to be from the same distribution, (3.6) can be reasonablely approximated by selecting a subset S ⊂ {1, . . . , l} and having

GS = 1

CI + 1

|S|

X

i∈S

(Ji)TBiJi ≈ G.

Then (3.11) becomes (GS+ λI)v = (1

C + λ)v + 1

|S|

X

i∈S

(Ji)T Bi(Jiv) ≈ (G + λI)v.

A summary of the Newton method is in Algorithm 1.

3.2 Gradient Evaluation

In order to solve (3.7), ∇f (θ) is needed. It can be obtained by (3.2) if the Jacobian matrix Ji, i = 1, . . . , l is available; see the discussion on Jacobian calculation in Chap- ter 3.3. However, we have mentioned in Chapter 3.1 that in practice, only a subset of Ji may be calculated. Thus here we present a direct calculation by a backward process.

Consider two layers m − 1 and m. The variables between them are Wmand bm, so we aim to calculate the following gradient components.

∂f

∂Wm = 1

CWm+1 l

l

X

i=1

∂ξi

∂Wm, (3.12)

(26)

Algorithm 1 A standard subsampled Hessian Newton method for CNN.

1: Compute f (θ1).

2: for k = 1, . . . , do

3: Choose a set Sk⊂ {1, . . . , l}.

4: Compute ∇f (θk) and Ji, ∀i ∈ Sk.

5: Approximately solve the linear system in (3.9) by CG to obtain a direction dk

6: α = 1.

7: while true do

8: Update θk+1 = θk+ αdk and compute f (θk+1)

9: if (3.8) is satisfied then

10: break

11: end if

12: α ← α/2.

13: end while

14: Calculate λk+1 based on (3.10).

15: end for

∂f

∂bm = 1

Cbm+1 l

l

X

i=1

∂ξi

∂bm. (3.13)

Because (3.12) is in a matrix form, following past developments such as Vedaldi and Lenc (2015), it is easier to transform them to a vector form for the derivation. To begin, we list the following properties of the vec(·) function, in which ⊗ is the kronecker product.

vec(AB) = (I ⊗ A)vec(B), (3.14)

= (BT ⊗ I)vec(A), (3.15)

vec(AB)T = vec(B)T(I ⊗ AT), (3.16)

= vec(A)T(B ⊗ I). (3.17)

(27)

We further define

∂y

∂(x)T =

∂y1

∂x1 . . . ∂x∂y1

|x|

... . .. ...

∂y|y|

∂x1 . . . ∂y∂x|y|

|x|

 ,

where x and y are column vectors, and let

φ(zm−1,i) = Inm−1zm−1,i, Lc< m ≤ L,

where Ipis the p × p identity matrix.

For the fully-connected layers, from (2.21), we have

sm,i = Wmzm−1,i+ bm

= (I1⊗ Wm) zm−1,i+ (11⊗ Inm)bm (3.18)

= (zm−1,i)T ⊗ Inm vec(Wm) + (11 ⊗ Inm)bm, (3.19)

where (3.18) and (3.19) are from (3.14) and (3.15), respectively. For the convolutional layers, from (2.11), we have

vec(Sm,i) = vec(Wmφ(Zm−1,i)) + vec(bm1Tambm)

= (Iambm⊗ Wm) vec(φ(Zm−1,i)) + (1ambm ⊗ Idm)bm (3.20)

= φ(Zm−1,i)T ⊗ Idm vec(Wm) + (1ambm⊗ Idm)bm, (3.21)

where (3.20) and (3.21) are from (3.14) and (3.15), respectively.

An advantage of using (3.18) and (3.20) is that they are in the same form, and so are (3.19) and (3.21). Thus we can derive the gradient together. We begin with calculating the gradient for convolutional layers. From (3.21), we derive

∂ξi

∂vec(Wm)T = ∂ξi

∂vec(Sm,i)T

∂vec(Sm,i)

∂vec(Wm)T

= ∂ξi

∂vec(Sm,i)T φ(Zm−1,i)T ⊗ Idm

(28)

= vec

 ∂ξi

∂Sm,iφ(Zm−1,i)T

T

(3.22)

and

∂ξi

∂(bm)T = ∂ξi

∂vec(Sm,i)T

∂vec(Sm,i)

∂(bm)T

= ∂ξi

∂vec(Sm,i)T (1ambm⊗ Idm)

= vec

 ∂ξi

∂Sm,i1ambm

T

, (3.23)

where (3.22) and (3.23) are from (3.17). To calculate (3.22), φ(Zm−1,i) is available in the forward process of calculating the function value.

For ∂ξi/∂Sm,i also needed in (3.22) and (3.23), it can be obtained by a backward process. Here we assume that ∂ξi/∂Sm,i is available, and calculate ∂ξi/∂Sm−1,i for layer m − 1.

∂ξi

∂vec(Zm−1,i)T = ∂ξi

∂vec(Sm,i)T

∂vec(Sm,i)

∂vec(φ(Zm−1,i))T

∂vec(φ(Zm−1,i))

∂vec(Zm−1,i)T

= ∂ξi

∂vec(Sm,i)T (Iambm⊗ Wm) Pφm−1 (3.24)

= vec



(Wm)T ∂ξi

∂Sm,i

T

Pφm−1, (3.25)

where (3.24) is from (2.8) and (3.20), and (3.25) is from (3.16).

Then, because the RELU activation function is considered for the convolutional layers, we have

∂ξi

∂vec(Sm−1,i)T = ∂ξi

∂vec(Zm−1,i)T vec(I[Zm−1,i])T, (3.26) where is Hadamard product (i.e., element-wise products) and

I[Zm−1,i](p,q)=









1 if zm−1,i(p,q) > 0, 0 otherwise.

(29)

For fully-connected layers, by the same form in (3.18), (3.19), (3.20) and (3.21), we immediately get from (3.22), (3.23), (3.26) and (3.25) that

∂ξi

∂vec(Wm)T = vec

 ∂ξi

∂sm,i(zm−1,i)T

T

, (3.27)

∂ξi

∂(bm)T = ∂ξi

∂(sm,i)T, (3.28)

∂ξi

∂(zm−1,i)T =



(Wm)T ∂ξi

∂(sm,i)

T

Inm−1

= ∂ξi

∂(sm,i)TWm, (3.29)

∂ξi

∂(sm−1,i)T = ∂ξi

∂(zm−1,i)T I[zm−1,i]T. (3.30) Finally, we check the initial values of the backward process. From the square loss in (2.24), we have

∂ξi

∂zL,i = 2(zL,i− yL,i),

∂ξi

∂sL,i = ∂ξi

∂zL,i.

3.2.1 Padding, Pooling, and the Overall Procedure

For the padding operation, from (2.15), we have

∂ξi

∂vec(Zm−1,i)T = ∂ξi

∂vec(Zm,i)T

∂vec(Zm,i)

∂vec(Zm−1,i)T

= ∂ξi

∂vec(Zm,i)TPpaddingm−1,i. (3.31) Similarly, for the pooling layer, from (2.19), we have

∂ξi

∂vec(Zm−1,i)T = ∂ξi

∂vec(Zm,i)TPpoolm−1,i. (3.32) Following the explanation in Chapter 2.1.3, to generate ∂ξi/∂vec(Sm−1,i) from

∂ξi/vec(Sm,i), we consider the following cycle.

Sm−1 → Zm−1 → pooling → padding → φ(Zm−1,i) → Sm.

(30)

Therefore, by combining (3.25), (3.26), (3.31) and (3.32), details of obtaining ∂ξi/∂vec(Zm−1,i)T is by

∂ξi

∂vec(Zm−1,i) = vec



(Wm)T ∂ξi

∂Sm,i

T

Pφm−1Ppaddingm−1 Ppoolm−1. (3.33)

3.3 Jacobian Evaluation

For (3.6), the Jacobian matrix is needed and it can be partitioned into L blocks.

Ji =



J1,i J2,i . . . JL,i



, m = 1, . . . , L, i = 1, . . . , l, (3.34)

where

Jm,i=

 ∂zL,i

∂vec(Wm)T

∂zL,i

∂(bm)T

 .

The calculation is very similar to that for the gradient. For the convolutional layers, from (3.22) and (3.23), we have

 ∂zL,i

∂vec(Wm)T

∂zL,i

∂(bm)T



=

∂z1L,i

∂vec(Wm)T

∂zL,i1

∂(bm)T

...

∂znLL,i

∂vec(Wm)T

∂zL,inL

∂(bm)T

=

 vec(∂z

L,i 1

∂Sm,iφ(Zm−1,i)T)T vec(∂z

L,i 1

∂Sm,i1ambm)T ...

vec(∂z

nLL,i

∂Sm,iφ(Zm−1,i)T)T vec(∂z

L,inL

∂Sm,i1ambm)T

=

 vec

∂zL,i 1

∂Sm,i φ(Zm−1,i)T 1ambm

T

...

vec

∂zL,inL

∂Sm,i φ(Zm−1,i)T 1ambm

T

. (3.35)

In the backward process, if ∂zL,i/∂vec(Sm,i)T is available, we calculate ∂zL,i/∂vec(Sm−1,i)T for convolutional layer m − 1. From a derivation similar to (3.25) for the gradient, we

數據

Figure 2.2: An illustration of max pooling to extract translational invariance features.
Table 6.1: Summary of the data sets, where a 0 × b 0 × d 0 represents the (height, width, channel) of the input image, l is the number of training data, l t is the number of test data, and n L is the number of classes.
Table 6.2: Structure of convolutional neural networks. “conv” indicates a convolutional layer, “pool” indicates a pooling layer, and “full” indicates a fully-connected layer.
Table 6.3: Test accuracy for Newton method and SG. For Newton method, we trained for 250 iterations; for SG, we trained for 1000 epochs.

參考文獻

相關文件

To ensure that Hong Kong students can have experiences in specific essential contents for learning (such as an understanding of Chinese history and culture, the development of Hong

For other types of no-pay leave, IMC schools should follow the same procedures as for no-pay sick/maternity/special tuberculosis leave mentioned above; schools which have not

Wang, A recurrent neural network for solving nonlinear convex programs subject to linear constraints, IEEE Transactions on Neural Networks, vol..

Associate Professor of Department of Mathematics and Center of Teacher Education at National Central

To improve the convergence of difference methods, one way is selected difference-equations in such that their local truncation errors are O(h p ) for as large a value of p as

™ Independent networks (indep. basic service set, IBSS), also known as ad hoc networks.. ™

SG is simple and effective, but sometimes not robust (e.g., selecting the learning rate may be difficult) Is it possible to consider other methods.. In this work, we investigate

{ Title: Using neural networks to forecast the systematic risk..