牛頓法於卷積神經網路之應用

(1)

國立臺灣大學電機資訊學院資訊網路與多媒體研究所碩士論文

Graduate Institute of Networking and Multimedia College of Electrical Engineering and Computer Science

National Taiwan University Master Thesis

牛頓法於卷積神經網路之應用

Newton Methods For Convolutional Neural Networks

陳勁龍

Tan Kent Loong

指導教授：林智仁博士

Advisor: Chih-Jen Lin, Ph.D.

中華民國 107 年 7 月

July, 2018

(2)

(3)

中文摘要

深度學習包含困難的非凸優化問題。大多數研究經常使用隨機梯度演算法

（SG）來優化這類模型。使用SG通常很有效，但有時並不那麼強大。近代的

研究探討了利用牛頓法作為替代的優化方法，但絕大部分研究只將其應用於

全連接神經網路。他們沒有探討諸如卷積神經網路等更為廣泛使用的深度學習模型。其中一個原因是應用牛頓法於卷積神經網路的過程中牽涉到多個複雜的

運算，因此目前未有仔細的相關研究。在這篇論文中，我們給出詳細的建構模

組，當中包括函數、梯度及賈可比矩陣的運算和高斯-牛頓矩陣向量的乘積。這

些基本的模組非常重要。因為沒有它們，任何牛頓法於全連接神經網路的進步沒辦法在卷積神經網路上嘗試。因此我們的研究將可能推動更多牛頓法於卷積

神經網路上的發展。我們完成一個簡單的MATLAB實作。實驗結果顯示這個方

法具有競爭力。

關鍵詞: 卷積神經網路, 多類別分類, 大規模學習, 抽樣海森矩陣。

(4)

ABSTRACT

Deep learning involves a difficult non-convex optimization problem, which is often solved by stochastic gradient (SG) methods. While SG is usually effective, it is sometimes not very robust. Recently, Newton methods have been investigated as an alternative optimization technique, but nearly all existing studies consider only fully- connected feedforward neural networks. They do not investigate other types of networks such as Convolutional Neural Networks (CNN), which are more commonly used in deep-learning applications. One reason is that Newton methods for CNN involve complicated operations, and so far no works have conducted a thorough investigation.

In this thesis, we give details of building blocks including function, gradient, and Ja- cobian evaluation, and Gauss-Newton matrix-vector products. These basic components are very important because without them none of any recent improvement of Newton methods for fully-connected networks can even be tried. Thus we will enable possible further developments of Newton methods for CNN. We finish a simple MATLAB implementation and show that it gives competitive test accuracy.

KEYWORDS: Convolutional neural networks, multi-class classification, large-scale classification, subsampled Hessian.

(5)

口口

口試試試委委委員員員會會會審審審定定定書書書

. . . i

中中中文文文摘摘摘要要要

. . . ii

ABSTRACT . . . iii

LIST OF FIGURES . . . vi

LIST OF TABLES . . . vii

CHAPTER I. Introduction . . . 1

II. Optimization Problem of Feedforward CNN . . . 3

2.1 Convolutional Layer . . . 3

2.1.1 Zero Padding . . . 9

2.1.2 Pooling Layer . . . 9

2.1.3 Summary of a Convolutional Layer . . . 11

2.2 Fully-Connected Layer . . . 12

2.3 The Overall Optimization Problem . . . 13

III. Hessian-free Newton Methods for Training CNN . . . 14

3.1 Procedure of the Newton Method . . . 14

3.2 Gradient Evaluation . . . 17

3.2.1 Padding, Pooling, and the Overall Procedure . . . . 21

3.3 Jacobian Evaluation . . . 22

3.4 Gauss-Newton Matrix-Vector Products . . . 24

IV. Implementation Details . . . 27

4.1 Generation of φ(Z^m−1,i) . . . 27

4.2 Construction of P_pool^m−1,i . . . 32

4.3 Details of Padding Operation . . . 34

(6)

4.4 Evaluation of v^TP_φ^m−1 . . . 36

4.5 Gauss-Newton Matrix-Vector Products . . . 39

4.6 Mini-Batch Function and Gradient Evaluation . . . 41

V. Analysis of Newton Methods for CNN . . . 49

5.1 Memory Requirement . . . 49

5.2 Computational Cost . . . 51

VI. Experiments . . . 54

6.1 Comparison Between Newton and Stochastic Gradient Methods 56 VII. Conclusions . . . 59

APPENDICES . . . 60

BIBLIOGRAPHY . . . 66

(7)

LIST OF FIGURES

Figure

2.1 An padding example with s^m = 1 in order to set a^m = a^m−1. . . 10 2.2 An illustration of max pooling to extract translational invariance fea-

tures. The image B is derived from shifting A by 1 pixel in the horizontal direction. . . 10

(8)

LIST OF TABLES

Table

6.1 Summary of the data sets, where a⁰× b⁰× d⁰ represents the (height, width, channel) of the input image, l is the number of training data, lt

is the number of test data, and nLis the number of classes. . . 55 6.2 Structure of convolutional neural networks. “conv” indicates a con-

volutional layer, “pool” indicates a pooling layer, and “full” indicates a fully-connected layer. . . 57 6.3 Test accuracy for Newton method and SG. For Newton method, we

trained for 250 iterations; for SG, we trained for 1000 epochs. . . 58

(9)

CHAPTER I

Introduction

Deep learning is now widely used in many applications. To apply this technique, a difficult non-convex optimization problem must be solved. Currently, stochastic gradient (SG) methods and their variants are the major optimization technique used for deep learning (e.g., Krizhevsky et al., 2012; Simonyan and Zisserman, 2014). This situation is different from some application domains, where other types of optimization methods are more frequently used. One interesting research question is thus to study if other optimization methods can be extended to be viable alternatives for deep learning. In this thesis, we aim to address this issue by developing a practical Newton method for deep learning.

Some past works have studied Newton methods for training deep neural networks (e.g., Botev et al. 2017; He et al. 2016; Kiros 2013; Martens 2010; Vinyals and Povey 2012; Wang et al. 2015, 2018a). Almost all of them consider fully-connected feedforward neural networks and some have shown the potential of Newton methods for be- ing more robust than SG. Unfortunately, these works have not fully established Newton methods as a practical technique for deep learning because other types of networks such as Convolutional Neural Networks (CNN) are more commonly used in deep-learning applications. One important reason why CNN was not considered is because of the very complicated operations in implementing Newton methods. Up to now no works

(10)

have shown details of all the building blocks such as function, gradient, and Jacobian evaluation, and Hessian-vector products. In particular, because interpreted-type languages such as Python or MATLAB have been popular for deep learning, how to easily implement efficient operations by these languages is an important research issue.

In this thesis, we aim at a thorough investigation on the implementation of Newton methods for CNN. We focus on basic components because without them none of any recent improvement of Newton methods for fully-connected networks can be even tried.

We will enable many further developments of Newton methods for CNN and maybe even other types of networks.

This thesis is organized as follows. In Chapter II, we begin with introducing CNN.

In Chapter III, Newton methods for CNN are introduced and the detailed mathematical formulations of all operations are dervied. In Chapter IV, we provide details for an efficient MATLAB implementation. Experiments to demonstrate the viability of New- ton methods for CNN are in Chapter VI. In the same chapter, we also investigate the efficiency of our implementation. Chapter VII concludes this work.

A MATLAB package of implementing a Newton method for CNN is available at

https://https://www.csie.ntu.edu.tw/˜cjlin/papers/cnn/

Programs used for experiments can be found at the same page.

This thesis is based on the paper by Wang et al. (2018b). We acknowledge the support from Ministry of Science and Technology of Taiwan.

(11)

CHAPTER II

Optimization Problem of Feedforward CNN

Consider a K-class problem, where the training data set consists of l input pairs (Z^0,i, yⁱ), i = 1, . . . , l. Here Z^0,i is the ith input image with dimension a⁰× b⁰× d⁰, where a⁰ denotes the height of the input images, b⁰ represents the width of the input images and d⁰is the number of color channels. If Z^0,ibelongs to the kth class, then the label vector

yⁱ = [0, . . . , 0

| {z }

k−1

, 1, 0, . . . , 0]^T ∈ R^K.

A CNN utilizes a stack of convolutional and pooling layers followed by fully- connected layers to predict the target vector. Let L be the number of layers, L^c be the number of convolutional layers, and L^f be the number of fully-connected layers.

The images

Z^0,i, i = 1, . . . , l,

are the input of layer 1. Subsequently we describe operations of convolutional layers, pooling layers, and fully-connected layers.

2.1 Convolutional Layer

A hallmark of CNN is that both input and output of a convolutional layer are explicitly assumed to be images. We discuss the details between layers m − 1 and m. Let the

(12)

input be an image of dimensions

a^m−1× b^m−1× d^m−1,

where a^m−1 is the height, b^m−1 is the width, and d^m−1 is the depth (or the number of channels). Thus for every given channel, we have a matrix of a^m−1 × b^m−1 pixels.

Specifically, the input contains the following matrices.





 z_1,1,1^m−1,i

. ..

z^m−1,i_am−1,b^m−1,1





 . . .







z_1,1,d^m−1,im−1

. ..

z_a^m−1,im−1,b^m−1,d^m−1







(2.1)

For example, at layer 1, usually d⁰ = 3 because of three color channels (red, green, blue). For each channel, the matrix of size a⁰ × b⁰ contains raw pixel values of the image.

To generate the output, we consider d^mfilters, each of which is a 3-D weight matrix of size

h^m× h^m× d^m−1. Specifically, the jth filter includes the following matrices







w^m,j_1,1,1 w^m,j_1,hm,1

. ..

w^m,j_hm,1,1 w_h^m,jm,h^m,1







, . . . ,







w_1,1,d^m,j m−1 w^m,j_1,hm,d^m−1

. ..

w^m,j_hm,1,d^m−1 w^m,j_hm,h^m,d^m−1







and a bias term b^m_j . The main idea of CNN is to conduct convolutional operations, each of which is the inner product between a small sub-image and a filter. We now describe the details. Specifically, for the jth filter, we scan the entire input image to obtain small regions of size (h^m, h^m) and calculate the inner product between each region and the filter. For example, if we start from the upper left corner of the input image, the first

(13)

sub-image of channel d is







z_1,1,d^m−1,i . . . z_1,h^m−1,im,d

. ..

z_h^m−1,im,1,d . . . z^m−1,i_hm,h^m,d





 .

We then calculate the following value.

d^m−1

X

d=1

*







z_1,1,d^m−1,i . . . z_1,h^m−1,im,d

. ..

z_h^m−1,im,1,d . . . z^m−1,i_hm,h^m,d





 ,







w^m,j_1,1,d . . . w_1,h^m,jm,d

. ..

w^m,j_hm,1,d . . . w^m,j_hm,h^m,d





 +

+ b^m_j , (2.2)

where h·, ·i means the sum of component-wise products between two matrices. This value becomes the (1, 1) position of the channel j of the output image.

Next, we must obtain other sub-images to produce values in other positions of the output image. We specify the stride s^mfor sliding the filter. That is, we move s^mpixels vertically or horizontally to get sub-images. For the (2, 1) position of the output image, we move down s^mpixels vertically to obtain the following sub-image:







z_1+s^m−1,im,1,d . . . z_1+s^m−1,im,h^m,d

. ..

z_h^m−1,im+s^m,1,d . . . z^m−1,i_hm+s^m,h^m,d







, d = 1, . . . , d^m−1.

Then the (2, 1) position of the channel j of the output image is

d^m−1

X

d=1

*







z_1+s^m−1,im,1,d . . . z_1+s^m−1,im,h^m,d

. ..

z_h^m−1,im+s^m,1,d . . . z^m−1,i_hm+s^m,h^m,d





 ,







w^m,j_1,1,d . . . w^m,j_1,hm,d

. ..

w_h^m,jm,1,d . . . w^m,j_hm,h^m,d





 +

+ b^m_j . (2.3)

Assume that vertically and horizontally we can move the filter a^mand b^mtimes, respectively. Therefore,

a^m−1 = h^m+ (a^m− 1) × s^m,

(14)

b^m−1 = h^m+ (b^m− 1) × s^m. (2.4)

By our notation, the output image has the size

a^m× b^m× d^m.

For efficient implementations, we can conduct all operations including (2.2) and (2.3) by matrix operations. To begin, we concatenate the matrices of the different channels in (2.1) to

Z^m−1,i =







z_1,1,1^m−1,i . . . z_a^m−1,im−1,1,1 z_1,2,1^m−1,i . . . z^m−1,i_am−1,b^m−1,1

... . .. ... ... . .. ...

z^m−1,i_1,1,dm−1 . . . z_a^m−1,im−1,1,d^m−1 z_1,2,d^m−1,im−1 . . . z_a^m−1,im−1,b^m−1,d^m−1







, i = 1, . . . , l.

(2.5) We note that (2.2) is the inner product between the following two vectors

w_1,1,1^m,j . . . w_h^m,jm,1,1 w_1,2,1^m,j . . . w_h^m,jm,h^m,1 . . . w_h^m,jm,h^m,d^m−1 b^m_j

T

and

z_1,1,1^m−1,i . . . z_h^m−1,im,1,1 z_1,2,1^m−1,i . . . z_h^m−1,im,h^m,1 . . . z_h^m−1,im,h^m,d^m−1 1

T

.

Therefore, based on Vedaldi and Lenc (2015), we define the following two operators

vec(M ) =(M_:,1)^T . . . (M_:,n)^TT

∈ R^mn×1, where M ∈ R^m×n, (2.6)

mat(v)m×n =







v₁ v_(n−1)m+1 ... · · · ...

v_m v_nm







∈ R^m×n, where v ∈ R^mn×1, (2.7)

and the operator

φ : R^d^m−1^×a^m−1^b^m−1 → R^h^m^h^m^d^m−1^×a^m^b^m

(15)

in order to collect all sub-images in Z^m−1,i:

φ(Z^m−1,i) ≡ mat P_φ^m−1vec(Z^m−1,i)

h^mh^md^m−1×a^mb^m, m = 1, . . . , L^c, ∀i, (2.8) where

P_φ^m−1 ∈ R^h^m^h^m^d^m−1^a^m^b^m^×d^m−1^a^m−1^b^m−1.

We discuss the implementation details of (2.8) in Chapter 4.1. Then, we have φ(Z^m−1,i) derived as follows.







z^m−1,i_1,1,1 . . . z_1+(a^m−1,im−1)×s^m,1,1 z_1,1+s^m−1,im,1 . . . z_1+(a^m−1,im−1)×s^m,1+(b^m−1)×s^m,1

z^m−1,i_2,1,1 . . . z_2+(a^m−1,im−1)×s^m,1,1 z_2,1+s^m−1,im,1 . . . z_2+(a^m−1,im−1)×s^m,1+(b^m−1)×s^m,1

... . .. ... ... . .. ...

z_h^m−1,im,h^m,1 . . . z_h^m−1,im+(a^m−1)×s^m,h^m,1 z_h^m−1,im,h^m+s^m,1 . . . z^m−1,i_hm+(a^m−1)×s^m,h^m+(b^m−1)×s^m,1

... . .. ... ... . .. ...

z_1,1,d^m−1,im−1 . . . z^m−1,i_1+(am−1)×s^m,1,d^m−1 z_1,1+s^m−1,im,d^m−1 . . . z_1+(a^m−1,im−1)×s^m,1+(b^m−1)×s^m,d^m−1

... . .. ... ... . .. ...

z^m−1,i_hm,h^m,d^m−1 . . . z_h^m−1,im+(a^m−1)×s^m,h^m,d^m−1 z_h^m−1,im,h^m+s^m,d^m−1 . . . z_h^m−1,im+(a^m−1)×s^m,h^m+(b^m−1)×s^m,d^m−1





 .

(2.9) By considering

W^m=







w^m,1_1,1,1 w_2,1,1^m,1 . . . w_h^m,1m,h^m,d^m−1

... ... . .. ... w^m,d_1,1,1^m w_2,1,1^m,d^m . . . w_h^m,dm,h^m^m,d^m−1







∈ R^d^m^×h^m^h^m^d^m−1 and b^m =





 b^m₁

... b^m_dm







∈ R^d^m^×1,

(2.10) the following operations are conducted.

S^m,i = W^mφ(Z^m−1,i) + b^m1^Ta^mb^m ∈ R^d^m^×a^m^b^m, (2.11)

(16)

where

S^m,i =







s^m,i_1,1,1 . . . s^m,i_am,1,1 s^m,i_1,2,1 . . . s^m,i_am,b^m,1

... . .. ... ... . .. ... s^m,i_1,1,dm . . . s^m,i_am,1,d^m s^m,i_1,2,dm . . . s^m,i_am,b^m,d^m







and1a^mb^m =





 1

... 1







∈ R^a^m^b^m^×1.

Next, an activation function is applied to scale the value.

z_a,b,d^m,i = σ(s^m,i_a,b,d), (2.12)

where a = 1, . . . , a^m, b = 1, . . . , b^m, and d = 1, . . . , d^m. For CNN, commonly the following RELU activation function

σ(x) = max(x, 0) (2.13)

is used and we consider it in this work. The output becomes the following matrix

Z^m,i =







z_1,1,1^m,i z_2,1,1^m,i . . . z_a^m,im,b^m,1

. ..

z_1,1,d^m,i m z^m,i_2,1,dm . . . z_a^m,im,b^m,d^m







. (2.14)

We then apply (2.8) to expand the output to form the matrix φ(Z^m,i) and then substitute φ(Z^m,i) into (2.11), so we can continue the operations between layers m and m + 1.

Note that by the matrix representation, the storage is increased from a^m−1× b^m−1× d^m−1

in (2.1) to

(h^mh^md^m−1) × a^m× b^m. From (2.4), roughly

h^m s^m

2

folds increase of the memory occurs. However, we gain efficiency by using fast matrix- matrix multiplications in optimized BLAS (Dongarra et al., 1990).

(17)

2.1.1 Zero Padding

To make (2.4) hold or a^m be larger, sometimes we enlarge the input image to have zero values around the border. This technique, conducted before the mapping in (2.8), is called zero-padding in CNN training. For example, we may set

a^m = a^m−1

in order to prevent the decrease of the image size. When

s^m = 1,

we can pad the input image with

h^m− 1 lines of zeros around every border. See Figure 2.1.

For our derivation, we represent the padding operation as the following linear operation:

Z^m,i= mat(P_padding^m−1,ivec(Z^m−1,i))d^m×a^mb^m. (2.15)

2.1.2 Pooling Layer

For CNN, to reduce the computational cost, a dimension reduction is often applied by using a pooling layer after each convolutional layer. Usually we consider an operation that can (approximately) extract rotational or translational invariance features.

There are various types of pooling methods such as average pooling, max pooling, and stochastic pooling. We consider max pooling in this chapter because it is the most used setting for CNN. Here we show an example of max pooling by considering two 4 × 4 images, A and B, in Figure 2.2. The image B is derived by shifting A by 1 pixel in the horizontal direction. We split two images into four 2 × 2 sub-images and choose the

(18)

An input image

0 · · · 0 ... 0 · · · 0

... ...

· · ·

0 · · · 0 ... 0 · · · 0

· · ·0 0 ...

· · ·0 0

· · ·0 0 ...

· · ·0 0

h^m− 1

h^m− 1 a^m−1

b^m−1

Figure 2.1: An padding example with s^m = 1 in order to set a^m = a^m−1.







2 3 6 8 5 4 9 7 1 2 6 0 4 3 2 1







→5 9 4 6

(a) Image A







3 2 3 6 4 5 4 9 2 1 2 6 3 4 3 2







→5 9 4 6

(b) Image B

Figure 2.2: An illustration of max pooling to extract translational invariance features.

The image B is derived from shifting A by 1 pixel in the horizontal direction.

max value from every sub-image. In each sub-image because only some elements are changed, the maximal value is likely the same or similar. This is called translational invariance and for our example the two output images from A and B are the same.

Now we discuss the mathematical operation of the pooling layer. They are in fact special cases of convolutional operations. Assume Z^m−1,iis the input image (i.e., output image of the previous convolutional layer). We partition every channel of Z^m−1,i into non-overlapping sub-regions by h^m× h^mfilters with the stride s^m = h^m.1 By the same definition as (2.8) we can generate the matrix

φ(Z^m−1,i) = mat(P_φ^m−1vec(Z^m−1,i))_h^m_h^m×d^m−1a^mb^m, (2.16) 1Because of the disjoint sub-regions, the stride s^mfor sliding the filters is equal to h^m.

(19)

where

a^m = a^m−1

h^m , b^m = b^m−1

h^m . (2.17)

To select the largest element of each sub-region, there exists a matrix W^m,i ∈ R^d^m^a^m^b^m^×h^m^h^m^d^m−1^a^m^b^m

so that each row of W^m,iselects a single element from vec(φ(Z^m−1,i)). Therefore, Z^m,i = mat W^m,ivec(φ(Z^m−1,i))

d^m×a^mb^m. (2.18) Note that different from (2.11) of the convolutional layer, W^m,iis a constant matrix rather than a weight matrix. Further, because from (2.8)

vec(φ(Z^m−1,i)) = P_φ^m−1vec(Z^m−1,i),

we have

Z^m,i= mat P_pool^m−1,ivec(Z^m−1,i)

d^m×a^mb^m, (2.19) where

P_pool^m−1,i = W^m,iP_φ^m−1 ∈ R^d^m^a^m^b^m^×d^m−1^a^m−1^b^m−1.

We provide implementation details in Chapter 4.2. Note that pooling operations are often considered as an (optional) part of the convolutional layer. Here we treat them as a separate layer for the easier description of the procedure.

2.1.3 Summary of a Convolutional Layer

From the input Z^m−1,i, we consider the following flow as one convolutional layer:

Z^m−1,i → padding → φ(Z^m−1,i) → ¯S^m,i → ¯Z^m,i → pooling → Z^m,i, where

S¯^m,i and Z¯^m,i

are S^m,iand Z^m,iin (2.11) and (2.14), respectively, if pooling is not applied.

(20)

2.2 Fully-Connected Layer

After passing through the convolutional and pooling layers, we concatenate columns in the matrix in (2.14) to form the input vector of the first fully-connected layer.

z^m,i = vec(Z^m,i), i = 1, . . . , l, m = L^c.

In the fully-connected layers (L^c < m ≤ L), we consider the following weight matrix and bias vector between layers m − 1 and m.

W^m =







w₁₁^m w₂₁^m · · · w^m_n_m−1₁ w₁₂^m w₂₂^m · · · w^m_n_m−1₂

... ... ... ... w_1n^m_m w_2n^m_m · · · w^m_n_m−1_n_m







nm×nm−1

and b^m =





 b^m₁ b^m₂ ... b^m_n_m







nm×1

, (2.20)

where nm−1 and nmare the numbers of neurons in layers m − 1 and m, respectively.2 If z^m−1,i ∈ Rⁿ^m−1 is the input vector, the following operations are applied to generate the output vector z^m,i ∈ Rⁿ^m.

s^m,i = W^mz^m−1,i+ b^m, (2.21)

z_j^m,i = σ(s^m,i_j ), j = 1, . . . , n_m. (2.22)

For the activation function in fully-connected layers, except the last layer, we also consider the RELU function defined in (2.13). For the last layer, we use the following linear function.

σ(x) = x. (2.23)

2_n

L^c = d^L^ca^L^cb^L^cand nL= K is the number of classes.

(21)

2.3 The Overall Optimization Problem

At the last layer, the output z^L,i, ∀i is obtained. We can check how close it is to the label vector yⁱ and consider the following squared loss in this work.

ξ(z^L,i; yⁱ) = ||z^L,i− yⁱ||². (2.24)

Furthermore, we can collect all model parameters such as filters of convolutional layers in (2.10) and weights/biases in (2.20) for fully-connected layers into a long vector θ ∈ Rⁿ, where n becomes the total number of variables from the discussion in this chapter.

n =

L^c

X

m=1

d^m× (h^m× h^m× d^m−1+ 1) +

L

X

m=L^c+1

n_m× (n_m−1 + 1).

The output z^L,i of the last layer is a function of θ. The optimization problem to train a CNN is

minθ f (θ), (2.25)

where

f (θ) = 1

2Cθ^Tθ + 1 l

l

X

i=1

ξ(z^L,i; yⁱ) (2.26) and the first term with the parameter C > 0 is used to avoid overfitting.

(22)

CHAPTER III

Hessian-free Newton Methods for Training CNN

To solve an unconstrained minimization problem such as (2.25), a Newton method iteratively finds a search direction d by solving the following second-order approximation.

mind ∇f (θ)^Td +1

2d^T∇²f (θ)d, (3.1)

where ∇f (θ) and ∇²f (θ) are the gradient vector and the Hessian matrix, respectively.

In this chapter we present details of applying a Newton method to solve the CNN problem (2.25).

3.1 Procedure of the Newton Method

For CNN, the gradient of f (θ) is

∇f (θ) = 1 Cθ + 1

l

X

i=1

(Jⁱ)^T∇_z^L,iξ(z^L,i; yⁱ), (3.2) where

Jⁱ =







∂z^L,i₁

∂θ1 · · · ^∂z

L,i 1

∂θn

... ... ...

∂z^L,i_nL

∂θ1 · · · ^∂z

L,i nL

∂θn







nL×n

, i = 1, . . . , l, (3.3)

is the Jacobian of z^L,i. The Hessian matrix of f (θ) is

∇²f (θ) =1 CI + 1

l

X

i=1

(Jⁱ)^TBⁱJⁱ

(23)

+1 l

l

X

i=1 nL

X

j=1

∂ξ(z^L,i; yⁱ)

∂z^L,i_j







∂²z_j^L,i

∂θ1∂θ1 · · · ^∂

2z^L,i_j

∂θ1∂θn

... . .. ...

∂²z_j^L,i

∂θn∂θ1 · · · ^∂

2z^L,i_j

∂θn∂θn







, (3.4)

where I is the identity matrix and

B_tsⁱ = ∂²ξ(z^L,i; yⁱ)

∂z_t^L,i∂zs^L,i

, t = 1, . . . , n_L, s = 1, . . . , n_L. (3.5)

From now on for simplicity we let

ξi ≡ ξi(z^L,i; yⁱ).

If f (θ) is non-convex as in the case of deep learning, (3.1) is difficult to solve and the resulting direction may not lead to the decrease of the function value. Thus the Gauss-Newton approximation (Schraudolph, 2002)

G = 1 CI + 1

l

X

i=1

(Jⁱ)^TBⁱJⁱ ≈ ∇²f (θ) (3.6)

is often used. In particular, if G is positive definite, then (3.1) becomes the same as solving the following linear system.

Gd = −∇f (θ). (3.7)

After a Newton direction d is obtained, to ensure the convergence, we update θ by

θ ← θ + αd,

where α is the largest element in {1,¹₂,¹₄, . . .} which can satisfy

f (θ + αd) ≤ f (θ) + ηα∇f (θ)^Td, (3.8)

where η ∈ (0, 1) is a pre-defined constant. The procedure to find α is called a back- tracking line search.

(24)

Past works (e.g., Martens, 2010; Wang et al., 2018a) have shown that sometimes (3.7) is too aggressive, so a direction closer to the negative gradient is better. To this end, in recent works of applying Newton methods on fully-connected networks, the Levenberg-Marquardt method (Levenberg, 1944; Marquardt, 1963) is used to solve the following linear system rather than (3.7).

(G + λI)d = −∇f (θ), (3.9)

where λ is a parameter decided by how good the function reduction is. Specifically, we define

ρ = f (θ + d) − f (θ)

∇f (θ)^Td + ¹₂(d)^TGd

as the ratio between the actual function reduction and the predicted reduction. By using ρ, the parameter λ_next for the next iteration is decided by

λ_next =











λ × drop ρ > ρ_upper,

λ ρ_lower ≤ ρ ≤ ρ_upper, λ × boost otherwise,

(3.10)

where (drop,boost) are given constants. From (3.10) we can clearly see that if the function-value reduction is not satisfactory, then λ is enlarged and the resulting direction is closer to the negative gradient.

Next, we discuss how to solve the linear system (3.9). When the number of variables n is large, the matrix G is too large to be stored. For some optimization problems including neural networks, without explicitly storing G it is possible to calculate the product between G and any vector v (Le et al., 2011; Martens, 2010; Wang et al., 2018a). For example, from (3.6),

(G + λI)v = (1

C + λ)v + 1 l

l

X

i=1

(Jⁱ)^T Bⁱ(Jⁱv) . (3.11)

(25)

If the product between Jⁱand a vector can be easily calculated, then G does not need to be explicitly formed. Therefore, we can apply the conjugate gradient (CG) method to solve (3.7) by a sequence of matrix-vector products. This technique is called Hessian- free methods in optimization. Details of CG methods in a Hessian-free Newton frame- work can be found in, for example, Algorithm 2 of Lin et al. (2007).

Because the computational cost in (3.11) is proportional to the number of instances, subsampled Hessian Newton methods have been proposed (Byrd et al., 2011; Martens, 2010; Wang et al., 2015) to reduce the cost in solving the linear system (3.9). They observe that the second term in (3.6) is the average training loss. If the large number of data points are assumed to be from the same distribution, (3.6) can be reasonablely approximated by selecting a subset S ⊂ {1, . . . , l} and having

G^S = 1

CI + 1

|S|

X

i∈S

(Jⁱ)^TBⁱJⁱ ≈ G.

Then (3.11) becomes (G^S+ λI)v = (1

C + λ)v + 1

|S|

X

i∈S

(Jⁱ)^T Bⁱ(Jⁱv) ≈ (G + λI)v.

A summary of the Newton method is in Algorithm 1.

3.2 Gradient Evaluation

In order to solve (3.7), ∇f (θ) is needed. It can be obtained by (3.2) if the Jacobian matrix Jⁱ, i = 1, . . . , l is available; see the discussion on Jacobian calculation in Chap- ter 3.3. However, we have mentioned in Chapter 3.1 that in practice, only a subset of Jⁱ may be calculated. Thus here we present a direct calculation by a backward process.

Consider two layers m − 1 and m. The variables between them are W^mand b^m, so we aim to calculate the following gradient components.

∂f

∂W^m = 1

CW^m+1 l

l

X

i=1

∂ξ_i

∂W^m, (3.12)

(26)

Algorithm 1 A standard subsampled Hessian Newton method for CNN.

1: Compute f (θ¹).

2: for k = 1, . . . , do

3: Choose a set S_k⊂ {1, . . . , l}.

4: Compute ∇f (θ^k) and Jⁱ, ∀i ∈ S_k.

5: Approximately solve the linear system in (3.9) by CG to obtain a direction d^k

6: α = 1.

7: while true do

8: Update θ^k+1 = θ^k+ αd^k and compute f (θ^k+1)

9: if (3.8) is satisfied then

10: break

11: end if

12: α ← α/2.

13: end while

14: Calculate λ_k+1 based on (3.10).

15: end for

∂f

∂b^m = 1

Cb^m+1 l

l

X

i=1

∂ξ_i

∂b^m. (3.13)

Because (3.12) is in a matrix form, following past developments such as Vedaldi and Lenc (2015), it is easier to transform them to a vector form for the derivation. To begin, we list the following properties of the vec(·) function, in which ⊗ is the kronecker product.

vec(AB) = (I ⊗ A)vec(B), (3.14)

= (B^T ⊗ I)vec(A), (3.15)

vec(AB)^T = vec(B)^T(I ⊗ A^T), (3.16)

= vec(A)^T(B ⊗ I). (3.17)

(27)

We further define

∂y

∂(x)^T =







∂y1

∂x1 . . . _∂x^∂y¹

|x|

... . .. ...

∂y_|y|

∂x1 . . . ^∂y_∂x^|y|

|x|





 ,

where x and y are column vectors, and let

φ(z^m−1,i) = Inm−1z^m−1,i, L^c< m ≤ L,

where I_pis the p × p identity matrix.

For the fully-connected layers, from (2.21), we have

s^m,i = W^mz^m−1,i+ b^m

= (I₁⊗ W^m) z^m−1,i+ (11⊗ I_n_m)b^m (3.18)

= (z^m−1,i)^T ⊗ I_n_m vec(W^m) + (11 ⊗ I_n_m)b^m, (3.19)

where (3.18) and (3.19) are from (3.14) and (3.15), respectively. For the convolutional layers, from (2.11), we have

vec(S^m,i) = vec(W^mφ(Z^m−1,i)) + vec(b^m1^Ta^mb^m)

= (I_a^m_b^m⊗ W^m) vec(φ(Z^m−1,i)) + (1a^mb^m ⊗ I_d^m)b^m (3.20)

= φ(Z^m−1,i)^T ⊗ I_d^m vec(W^m) + (1a^mb^m⊗ I_d^m)b^m, (3.21)

where (3.20) and (3.21) are from (3.14) and (3.15), respectively.

An advantage of using (3.18) and (3.20) is that they are in the same form, and so are (3.19) and (3.21). Thus we can derive the gradient together. We begin with calculating the gradient for convolutional layers. From (3.21), we derive

∂ξ_i

∂vec(W^m)^T = ∂ξ_i

∂vec(S^m,i)^T

∂vec(S^m,i)

∂vec(W^m)^T

= ∂ξ_i

∂vec(S^m,i)^T φ(Z^m−1,i)^T ⊗ I_d^m

(28)

= vec

∂ξ_i

∂S^m,iφ(Z^m−1,i)^T

T

(3.22)

and

∂ξ_i

∂(b^m)^T = ∂ξ_i

∂vec(S^m,i)^T

∂vec(S^m,i)

∂(b^m)^T

= ∂ξi

∂vec(S^m,i)^T (1a^mb^m⊗ I_d^m)

= vec

∂ξ_i

∂S^m,i1^a^m^b^m

T

, (3.23)

where (3.22) and (3.23) are from (3.17). To calculate (3.22), φ(Z^m−1,i) is available in the forward process of calculating the function value.

For ∂ξi/∂S^m,i also needed in (3.22) and (3.23), it can be obtained by a backward process. Here we assume that ∂ξi/∂S^m,i is available, and calculate ∂ξi/∂S^m−1,i for layer m − 1.

∂ξ_i

∂vec(Z^m−1,i)^T = ∂ξ_i

∂vec(S^m,i)^T

∂vec(S^m,i)

∂vec(φ(Z^m−1,i))^T

∂vec(φ(Z^m−1,i))

∂vec(Z^m−1,i)^T

= ∂ξ_i

∂vec(S^m,i)^T (I_a^m_b^m⊗ W^m) P_φ^m−1 (3.24)

= vec

(W^m)^T ∂ξ_i

∂S^m,i

T

P_φ^m−1, (3.25)

where (3.24) is from (2.8) and (3.20), and (3.25) is from (3.16).

Then, because the RELU activation function is considered for the convolutional layers, we have

∂ξ_i

∂vec(S^m−1,i)^T = ∂ξ_i

∂vec(Z^m−1,i)^T vec(I[Z^m−1,i])^T, (3.26) where is Hadamard product (i.e., element-wise products) and

I[Z^m−1,i]_(p,q)=











1 if z^m−1,i_(p,q) > 0, 0 otherwise.

(29)

For fully-connected layers, by the same form in (3.18), (3.19), (3.20) and (3.21), we immediately get from (3.22), (3.23), (3.26) and (3.25) that

∂ξ_i

∂vec(W^m)^T = vec

∂ξ_i

∂s^m,i(z^m−1,i)^T

T

, (3.27)

∂ξ_i

∂(b^m)^T = ∂ξ_i

∂(s^m,i)^T, (3.28)

∂ξi

∂(z^m−1,i)^T =

(W^m)^T ∂ξi

∂(s^m,i)

T

I_n_m−1

= ∂ξ_i

∂(s^m,i)^TW^m, (3.29)

∂ξ_i

∂(s^m−1,i)^T = ∂ξ_i

∂(z^m−1,i)^T I[z^m−1,i]^T. (3.30) Finally, we check the initial values of the backward process. From the square loss in (2.24), we have

∂ξ_i

∂z^L,i = 2(z^L,i− y^L,i),

∂ξ_i

∂s^L,i = ∂ξ_i

∂z^L,i.

3.2.1 Padding, Pooling, and the Overall Procedure

For the padding operation, from (2.15), we have

∂ξi

∂vec(Z^m−1,i)^T = ∂ξi

∂vec(Z^m,i)^T

∂vec(Z^m,i)

∂vec(Z^m−1,i)^T

= ∂ξ_i

∂vec(Z^m,i)^TP_padding^m−1,i. (3.31) Similarly, for the pooling layer, from (2.19), we have

∂ξ_i

∂vec(Z^m−1,i)^T = ∂ξ_i

∂vec(Z^m,i)^TP_pool^m−1,i. (3.32) Following the explanation in Chapter 2.1.3, to generate ∂ξ_i/∂vec(S^m−1,i) from

∂ξ_i/vec(S^m,i), we consider the following cycle.

S^m−1 → Z^m−1 → pooling → padding → φ(Z^m−1,i) → S^m.

(30)

Therefore, by combining (3.25), (3.26), (3.31) and (3.32), details of obtaining ∂ξ_i/∂vec(Z^m−1,i)^T is by

∂ξi

∂vec(Z^m−1,i) = vec

(W^m)^T ∂ξi

∂S^m,i

T

P_φ^m−1P_padding^m−1 P_pool^m−1. (3.33)

3.3 Jacobian Evaluation

For (3.6), the Jacobian matrix is needed and it can be partitioned into L blocks.

Jⁱ =

J^1,i J^2,i . . . J^L,i

, m = 1, . . . , L, i = 1, . . . , l, (3.34)

where

J^m,i=

∂z^L,i

∂vec(W^m)^T

∂z^L,i

∂(b^m)^T

.

The calculation is very similar to that for the gradient. For the convolutional layers, from (3.22) and (3.23), we have

∂z^L,i

∂vec(W^m)^T

∂z^L,i

∂(b^m)^T

=







∂z₁^L,i

∂vec(W^m)^T

∂z^L,i₁

∂(b^m)^T

...

∂z_nL^L,i

∂vec(W^m)^T

∂z^L,i_nL

∂(b^m)^T







=





 vec(^∂z

L,i 1

∂S^m,iφ(Z^m−1,i)^T)^T vec(^∂z

L,i 1

∂S^m,i1a^mb^m)^T ...

vec(^∂z

nLL,i

∂S^m,iφ(Z^m−1,i)^T)^T vec(^∂z

L,inL

∂S^m,i1a^mb^m)^T







=





 vec

_∂zL,i 1

∂S^m,i φ(Z^m−1,i)^T 1a^mb^m

T

...

vec

∂z^L,i_nL

∂S^m,i φ(Z^m−1,i)^T 1^a^m^b^m

T







. (3.35)

In the backward process, if ∂z^L,i/∂vec(S^m,i)^T is available, we calculate ∂z^L,i/∂vec(S^m−1,i)^T for convolutional layer m − 1. From a derivation similar to (3.25) for the gradient, we

牛頓法於卷積神經網路之應用

Graduate Institute of Networking and Multimedia College of Electrical Engineering and Computer Science

Newton Methods For Convolutional Neural Networks

Tan Kent Loong

Advisor: Chih-Jen Lin, Ph.D.

July, 2018

中文摘要

ABSTRACT

TABLE OF CONTENTS

口 口

口試 試 試 委 委 委員 員 員會 會 會審 審 審定 定 定書 書 書

中 中 中文 文 文摘 摘 摘要 要 要

LIST OF FIGURES

LIST OF TABLES

CHAPTER I

CHAPTER II

2.1 Convolutional Layer

2.2 Fully-Connected Layer

2.3 The Overall Optimization Problem

CHAPTER III

3.1 Procedure of the Newton Method

3.2 Gradient Evaluation

3.3 Jacobian Evaluation

口口

口試試試委委委員員員會會會審審審定定定書書書

中中中文文文摘摘摘要要要