Gauss-Newton Matrix-Vector Products - Hessian-free Newton Methods for Training CNN

III. Hessian-free Newton Methods for Training CNN

3.4 Gauss-Newton Matrix-Vector Products

For solving (3.7), conjugate gradient (CG) methods are often applied and the main operation at each CG iteration is the Gauss-Newton matrix-vector product.

From (3.34), we rearrange (3.6) to

G = 1

and the Gauss-Newton matrix vector product becomes

Gv = 1

In this subsection, we focus on the ith instance and give the implementation details of calculating (3.45) for the whole data in Chapter 4.5.

For the convolutional layers, from (3.35) and (3.45), we first calculate

To simplify (3.46), we use the following property

vec(AB)^Tvec(C) = vec(A)^Tvec(CB^T)

to have for example, the first element in (3.46) is

vec ∂z₁^L,i

After deriving (3.47), from (3.45), we sum results of all layers

m=1

J^m,iv^m

and then calculate

qⁱ = Bⁱ(

m=1

J^m,iv^m). (3.48)

From (2.24) and (3.5),

B_tsⁱ = ∂²ξⁱ

and we derive qⁱ by multiplying every element ofPL

m=1J^m,iv^mby two.

After deriving (3.48), from (3.35) and (3.45), we calculate

Similar to the results of the convolutional layers, for the fully-connected layers, we have

CHAPTER IV

Implementation Details

Our goal in this chapter is to show that after some careful derivations, a Newton method for CNN can be implemented by a simple and short program. Here we give a MATLAB implementation though a modification for other languages such as Python should be straightforward.

For the discussion in Chapter III, we consider a single data instance, but for practical implementations, all instances must be taken care of. In out implementation, we stored Z^m−1,i, ∀i = 1, . . . , l as the following matrix.

Z^m−1,1 Z^m−1,2 . . . Z^m−1,l

∈ R^d^m−1^×a^m−1^b^m−1^l. (4.1)

Similarly, we stored ∂z^L,i/∂vec(S^m,i)^T as

∂z₁^L,1

∂vec(S^m,i)^T . . . ^∂z

L,l 1

∂vec(S^m,i)^T . . . ^∂z

L,1 nL

∂vec(S^m,i)^T . . . ^∂z

L,l nL

∂vec(S^m,i)^T

∈ R^ln^L^×d^m^a^m^b^m. (4.2)

4.1 Generation of φ(Z

^m−1,i

)

MATLAB has a built-in function im2col that can generate φ(Z^m−1,i) for s^m = 1 and s^m = h^m. To handle general s^m, we notice that φ(Z^m−1,i) is a sub-matrix of the

output matrix of using MATLAB’s im2col under s^m = 1. In appendix we provide an efficient implementation to extract the sub-matrix. However, in other languages a subroutine like MATLAB’s im2col may not be available. Further, generating a larger matrix under s^m = 1 causes extra time and memory.

Therefore, here we show an efficient implementation to get φ(Z^m−1,i) without re-lying on a subroutine like MATLAB’s im2col. To begin we consider the following linear indices1 (i.e., counting elements in a column-oriented way) of Z^m−1,i:



Because every element in

φ(Z^m−1,i) ∈ R^h^m^h^m^d^m−1^×a^m^b^m,

is extracted from Z^m−1,i, the task is to find the mapping between each element in φ(Z^m−1,i) and a linear index of Z^m−1,i. Consider the following example.

a^m−1 = b^m−1 = 2, d^m−1 = 1, s^m = 1, h^m = 2.

Because d^m−1 = 1, we omit the channel subscript. In addition, we also omit the instance index i. Thus the image and its linear indices are



1Linear indices refer to the sequence of how elements in a matrix are stored. Here we consider a column-oriented setting.

We have

Thus we store the following vector to indicate the mapping between linear indices of Z^m−1,iand φ(Z^m−1,i).

[ 1 2 4 5 2 3 5 6 ]^T . (4.4)

It also corresponds to row indices of non-zero elements in P_φ^m−1.

We begin with checking how linear indices of Z^m−1,i can be mapped to the first column of φ(Z^m−1). For simplicity, we consider only channel j. From (2.9) and (4.3), we have

where the left column gives the linear indices in Z^m−1,i, while the right column shows

the corresponding values. We rewrite linear indices in (4.5) as

Clearly, every linear index in (4.6) can be represented as

(p + qa^m−1)d^m−1 + j, (4.7)

where

p, q ∈ {0, . . . , h^m− 1}

correspond to the pixel position in the convolutional filter.2

Next we consider other columns in φ(Z^m−1,i) by still fixing the channel to be j.

From (2.9), similar to the right column in (4.5), each column contains the following elements from the jth channel of Z^m−1,i.

z_1+p+as^m−1,i m,1+q+bs^m,j , a = 0, 1, . . . , a^m− 1,

b = 0, 1, . . . , b^m− 1, (4.8)

where (1 + as^m, 1 + bs^m) denotes the top-left position of a sub-image in the channel j 2More precisely, p + 1 and q + 1 are the pixel position.

of Z^m−1,i. From (4.3), the linear index of each element in (4.8) is

Now we have known for each element of φ(Z^m−1,i) what the corresponding linear index in Z^m−1,iis. Next we discuss the implementation details, where the code is shown in Listing IV.1. First, we compute elements in (4.6) with j = 1 by applying MATLAB’s bsxfunfunction on the following two arrays.



The results is the following matrix

 whose columns, if concatenated, lead to values in (4.6) with j = 1; see line 2 of the code. To get (4.7) for all channels j = 1, . . . , d^m−1, we apply bsxfun again to plus the vector form of (4.10) and

0 1 . . . d^m−1− 1

and then vectorize the resulting matrix; see line 3.

To obtain other columns in φ(Z^m−1,i), next we calculate a^mand b^mby (2.4) in lines 4-5. To get all linear indices in (4.9), we see that the second term corresponds to indices

of the first column and therefore we must calculate the following column offset

(a + ba^m−1)s^md^m−1, ∀a = 0, 1, . . . , a^m− 1, b = 0, 1, . . . , b^m− 1.

This is by a bsxfun to plus the following two arrays.





 0

... a^m− 1







× s^md^m−1 and

0 . . . b^m− 1

× a^m−1s^md^m−1;

see line 6 in the code. Finally, we use another bsxfun to add the column offset and the values in (4.6); see line 7.

The obtained linear indices are independent of Z’s values. Thus the above procedure only needs to be run once in the beginning. For any Z, we apply the indices to extract φ(Z); see line 20-21 in Listing IV.1.

For φ(Z^m−1,i) in the pooling layer, the same implementation can be used.

4.2 Construction of P

_pool^m−1,i

Following (2.19), we assume that Z^m−1,i and φ(Z^m−1,i) are input and output be-fore/after the pooling operation, respectively. In the beginning of the training proce-dure, we use the proposed procedure in Chapter 4.1 to have φ(Z^m−1,i) that partitions each non-overlapping sub-regions; see (2.16) and (2.17). At each Newton iteration, Listing IV.2 is a MATLAB implementation for constructing P_pool^m−1,i, ∀i. In line 13, we handle the situation if

a^m−1 mod h^m 6= 0 or b^m−1 mod h^m 6= 0. (4.11) That is, we must ensure that (2.17) holds with a^m and b^m being integers. If (4.11) occurs, we simply discard the last several rows or columns in Z^m−1,i to make a^m−1 and

b^m−1 be multiples of h^m. In line 8, we extract the linear indices of Z^m−1,i to appear in vec(φ(Z^m−1,i)), which as we mentioned has been generated in the beginning. The resulting vector P contains

h^mh^md^m−1a^mb^m elements and each element is in the range of

1, . . . , d^m−1a^m−1b^m−1.

In line 9, we get

Z^m−1,i, i = 1, . . . , l, (4.12)

which are stored as a matrix in (4.1). In line 23-24, we use P to generate

vec(φ(Z^m−1,1)) · · · vec(φ(Z^m−1,l))

∈ h^mh^md^m−1a^mb^m× l. (4.13)

Next we rewrite the above matrix so that each column contains a sub-region:







z_1,1,1^m−1,i z_1,1,2^m−1,i . . . z_1+(a^m−1,lm−1)×s^m,1+(b^m−1)×s^m,d^m−1

... ... . .. ...

z_h^m−1,im,h^m,1 z_h^m−1,im,h^m,2 . . . z_h^m−1,lm+(a^m−1)×s^m,h^m+(b^m−1)×s^m,d^m−1







∈ R^h^m^h^m^×d^m−1^a^m^b^m^l.

(4.14) We apply a max function to get the largest value of each column and its index in the range of 1, . . . , h^mh^m. The resulting row vector has d^ma^mb^ml elements, where d^m = d^m−1; see line 25. In line 26, we reformulate it to be

d^m× a^mb^ml

as the output Z^m,i, ∀i.

Next we find linear indices in the matrix (4.13) that corresponds to the largest ele-ments obtained from (4.14).3 First we obtain linear indices of (4.1), which are in the

3Note that we do not really generate a sparse matrix P_pool^m−1,iin (2.19).

range of

1, . . . , d^m−1a^m−1b^m−1l.

We store these indices in a matrix the same size as in (4.1); see line 19. Then we generate

φ(linear indices of (4.1)), (4.15) which has the same size as in (4.13); see line 28-29. Next, if we can find positions in (4.15) that correspond to the largest elements identified earlier, then we have the desired linear indices. We mentioned that in line 25, not only the maximal value in each sub-region is obtained, but also we get the corresponding index in {1, . . . , h^mh^m}.

Therefore, for these max values, their linear indices in (4.14)4 are







row indices of max values in (4.14)





+ h^mh^m







0 ...

d^m−1a^mb^ml − 1







. (4.16)

The reason is that h^mh^mis the column offset in (4.14). See line 30 for the implementa-tion of (4.16). Next in line 31 we use (4.16) to extract values in (4.15) and obtain linear indices in (4.1) for the selected max values. Similar to the situation in Chapter 4.1 for φ(Z^m−1,i), the index finding procedure in Listing IV.2 is conducted only once in the beginning of the training process.

Note that because max pooling depends on Z^m−1,ivalues, linear indices for P_pool^m−1,i must be re-generated in the beginning of each Newton iteration.

4.3 Details of Padding Operation

To implement zero-padding, we first capture the linear indices of the input image in the padded image. For example, if the size of the imput image is 3 × 3 and the output 4Also linear indices in (4.13) because (4.14) is separating each column in (4.13) to several columns.

padded image is 5 × 5, we have

where “1” values indicate positions of the input image. Based on the column-major order, we derive

pad idx one= {7, 8, 9, 11, 12, 13, 16, 17, 18}.

It is obtained in the begining of the training procedure and it will be used in the fol-lowing situations. First, pad idx one contains row indices in P_padding^m−1 of (2.15) that correspond to the input image. We can use it to conduct the padding operation in (2.15).

Second, from (3.33), in gradient and Jacobian evaluations, we need

v^TP_padding^m−1 .

This can be considered as the inverse of the padding operation: we would like to remove zeros and get back the original image.

MATLAB implementations are shown in Listing IV.4 and IV.5, where Listing IV.4 is for generating the indices and Listing IV.5 is for the padding operation (2.15). In line 12 of Listing IV.4, we use the MATLAB bulit-in function padarray to pad zeros around the input image. Because Z^m−1,i, ∀i are stored as a matrix in (4.1), in line 8 of Listing IV.5, we extend pad idx one to conver all instances:





pad idx one

pad idx one + a^mb^m

pad idx one + 2a^mb^m

. . .





. (4.17)

These values indicate the new column indices of original columns in Z^m−1,i, ∀i. In line 9 of Listing IV.5, we create the zeros of the output padding image. By using values in (4.17), in line 10 of Listing IV.5, we put the input Z^m−1,ito the corresponding positions of the output padding image.

4.4 Evaluation of v

P

_φ^m−1

For (3.25) and (3.36), the following operation is applied.

v^TP_φ^m−1, (4.18)

where

v = vec

(W^m)^T ∂ξ_i

∂S^m,i

for (3.25) and

v = vec (W^m)^T ∂z_j^L,i

∂S^m,i

, j = 1, . . . , n_L (4.19) for (3.36).

Consider the same example in Chapter 4.1. We note that

(P_φ^m−1)^Tv = [ v1 v2+ v5 v6 v3 v4+ v7 v8 ]^T , (4.20)

which is a kind of “inverse” operation of φ(Z^m−1): we accumulate elements in φ(Z^m−1) back to their original positions in Z^m−1. In MATLAB, given indices in (4.4), a function accumarraycan directly generate the vector (4.20).

To calculate (3.25) over a batch of instances, we aim to have







(P_φ^m−1)^Tv1

... (P_φ^m−1)^Tvl







. (4.21)

We can manage to apply MATLAB’s accumarray on the vector

by giving the following indices as the input.



We can easily rearrange the resulting vector by reshape and transpose operator to get the matrix (3.25), where each row is corresponding to an instance. We show the program in Listing IV.6, where line 8 generates the indices shown in (4.23) and V(:) in line 9 is the vector (4.22).

Following the same idea, to calculate (3.36) over a batch of instances, we do not have to calculate (4.18) n_Ltimes. Notice that (3.36) is equivalent to

We can calculate v by vectorizing the result of fast matrix-matrix multiplication.

By passing the appropriate input indices to accumarray, we can calculate

(P_φ^m−1)^T

(4.26)

So the calculation is fallen back to the same case as evaluating (3.25). That is, after having (4.26), we calculate (4.18) once instead of calculating (4.18) and (4.19) n_Ltimes.

The input indices for accumarray are easy to be obtained. Before the transpose operation in (4.25), the linear indices for each column, i.e.

(P_φ^m−1)^Tvec

is start from

(u − 1)la^m−1b^m−1d^m−1 + 1.

Therefore, we can apply accumarray on the vector (4.26) with the following input indices.

The program is the same as Listing IV.6, except that we pass the value of the product of n Land S k for the parameter S k.

4.5 Gauss-Newton Matrix-Vector Products

Because (3.45) is the form of summation of every instance, in the view of MATLAB implementation, we extend (3.45) to the whole data set in a matrix form.

Gv = 1

To derive (4.27), we first calculate



From (3.47), for a particular m, we have



and 

A MATLAB implementation for (4.28) is shown in Listing IV.7. We store the model in a structure of different layers. Thus in lines 11 and 13 we respectively obtain

Z^m−1,i (after padding) and ∂z^L,i

∂vec(S^m,i)^T, i = 1, . . . , l.

In line 14, we calculate (4.29) to derive p^m,1, . . . , p^m,l. In line 15, we separately calcu-late

∂z^L,i

∂vec(S^m,i)^Tp^m,i, i = 1, . . . , l.

In line 16, we sum of the results of each layer.

After deriving (4.28), from (3.48) and (4.27), we must calculate



From (3.49), (4.30) can be derived by multiplying every element of (4.28) by two.

When (4.30) is available, from (4.27), the following matrix-vector product is needed.

 For the layer m, from (3.50), the calculation is written as

= vec

A MATLAB implementation for (4.31) is shown in Listing IV.8. In line 12, we separately calculate r^m,1, . . . , r^m,lby (4.32). In line 18, we build

In line 19, we calculate (4.31). In line 20, we derive the result of (4.27) for the mth layer.

4.6 Mini-Batch Function and Gradient Evaluation

Later in Chapter 5.1 we will discuss details of memory usage. One important con-clusion is that in many places of the Newton method, the memory consumption is pro-portional to the number of data. This fact causes difficulties in handling large data sets.

Therefore, here we discuss some implementation techniques to address the memory difficulty.

In subsampled Newton methods discussed in Chapter 3.1, a subset S of the train-ing data is used to derive the subsampled Gauss-Newton matrix for approximattrain-ing the Hessian matrix. While a motivation of this technique is to trade a slightly less accurate

direction for shorter running time per iteration, it is also useful to reduce the memory consumption. For example, at the mth convolutional layer, we only need to store the following matrices

∂z^L,i

∂vec(S^m,i)^T, ∀i ∈ S (4.33)

for the Gauss-Newton matrix-vector products.

However, function and gradient evaluations must use the whole training data. Fortu-nately, both operations involve the summation of independent results over all instances.

Here we follow Wang et al. (2018a) to split the index set {1, . . . , l} of data to, for ex-ample, R equal-sized subsets S1, . . . , S_R. We then calculate the result corresponding to each subset and accumulate them for the final output. Take the function evaluation as an example. For each subset, we must store only

Z^m,i, ∀m, ∀i ∈ S_r.

Thus, the memory usage can be dramatically reduced.

For the Gauss-Newton matrix-vector product, to calculate (4.33), we need Z^m,i, ∀i ∈ S. However, under the mini-batch setting, the needed values may not be kept. Our strat-egy is to let the last subset SR be the same subset used for the sub-sampled Hessian.

Then we can preserve the needed Z^m,ifor subsequent operations.

Listing IV.1: MATLAB implementation for φ(Z^m−1,i) 1 function idx = indicator_im2col(a,b,d,h,s)

2 first_channel_idx = bsxfun(@plus, ([0:h-1]*d+1)', [0:h-1]*a

*d);

3 first_col_idx = bsxfun(@plus, first_channel_idx(:), [0:d -1]);

4 out_a = floor((a - h)/s) + 1;

5 out_b = floor((b - h)/s) + 1;

6 column_offset = bsxfun(@plus, [0:out_a-1]', [0:out_b-1]*a)*

s*d;

7 idx = bsxfun(@plus, column_offset(:)', first_col_idx(:));

8 end

9 model(m).indicator = indicator_im2col(param.wdimages_pad0,param .htimages_pad0,param.chimages0,param.wdfilters(m),param.

strides(m));

10 function phiZ = cal_phiZ(param, model, batch_idx, m) 11 if m > 1

12 if param.padflags(m-1) == 1

13 phiZ = padding(param,model(m-1).Z,m-1,model(m-1).

pad_idx);

14 else

15 phiZ = model(m-1).Z;

16 end

17 else

18 phiZ = model(m).Z0(:,param.batch_idx_current);

19 end

20 phiZ = reshape(phiZ,[],param.sample_inst);

21 phiZ = phiZ(model(m).indicator,:);

23 if m == 1

24 phiZ = reshape(phiZ,param.wdfilters(m)*param.wdfilters(

m)*param.chimages0,[]);

25 else

26 phiZ = reshape(phiZ,param.wdfilters(m)*param.wdfilters(

m)*param.chimages(m-1),[]);

27 end

28 end

Listing IV.2: MATLAB implementation for P_pool^m−1 1 function [param, model] = maxpooling(param, model, m) 2

3 a = param.htimages(m);

4 b = param.wdimages(m);

5 d = param.chimages(m);

6 h = param.wdpool(m);

7 S_k = param.num_sampled_data;

8 P = model(m).idx_phiZ_pool;

9 Z = model(m).Z;

11 rm_idx = [];

12 pool_idx = [1:d*a*b*S_k];

13 if (mod(a,h) > 0 || mod(b,h) > 0)

14 newa = a - mod(a,h); newb = b - mod(b,h);

15 remained_idx = bsxfun(@plus,[1:newa]',[0:newb-1]*a);

16 remained_idx = bsxfun(@plus,remained_idx(:),[0:S_k-1]*a*b);

17 Z = Z(:,remained_idx(:));

19 pool_idx = reshape(pool_idx,d,[]);

20 pool_idx = pool_idx(:,remained_idx(:));

21 end 22

23 Z = reshape(Z,[],S_k);

24 Z = Z(P,:);

25 [Z, WS] = max(reshape(Z,h*h,[]));

26 model(m).Z = reshape(Z,d,[]);

28 pool_idx = reshape(pool_idx,[],S_k);

29 pool_idx = pool_idx(P,:);

30 WS = WS + h*h*([0:floor(a/h)*floor(b/h)*d*S_k-1]);

31 model(m).pool_idx = pool_idx(WS);

Listing IV.3: MATLAB implementation for evaluating (3.32) 1 function output = maxpooling_grad(param,m,input,pool_idx) 2

3 a = param.wdimages(m);

4 b = param.htimages(m);

5 d = param.chimages(m);

6 S_k = param.num_sampled_data;

8 output = zeros(d,a*b*S_k);

9 output(pool_idx) = reshape(input,[],1);

Listing IV.4: MATLAB implementation for the index of zero-padding 1 function [pad_idx] = padding_idx(param,m)

3 if m == 0

4 a = param.wdimages0;

5 b = param.htimages0;

6 else

7 a = param.wdimages_pool(m);

8 b = param.htimages_pool(m);

9 end 10

11 p = (param.wdfilters(m+1)-1)/2;

12 newa = a+2*p; newb = b+2*p;

13 pad_idx = repmat(p+(1:a)', b,1) + repeat_elements(newa*(p+(0:b -1)'), a);

Listing IV.5: MATLAB implementation for zero-padding 1 function output = padding(param,Z,m,pad_idx_one) 2

3 a = param.wdimages_pad(m);

4 b = param.htimages_pad(m);

5 d = param.chimages(m);

6 S_k = param.num_sampled_data;

8 idx = reshape(bsxfun(@plus, pad_idx_one, [0:S_k-1]*a*b), [], 1)

;

9 output = zeros(d,a*b*S_k);

10 output(:,idx) = Z;

Listing IV.6: MATLAB implementation to evaluate v^TP_φ^m−1 1 function vTP = vTP(param, model, S_k, m, V)

2 % V: a matrix with #cols = S_k 3

4 a_prev = param.htimages(m-1);

5 b_prev = param.wdimages(m-1);

6 d_prev = param.chimages(m-1);

8 idx = reshape(bsxfun(@plus, model(m).idx_phiZ(:), [0:S_k-1]*

d_prev*a_prev*b_prev), [], 1);

9 vTP = reshape(accumarray(idx, V(:), [d_prev*a_prev*b_prev*S_k 1]), [], S_k)';

Listing IV.7: MATLAB implementation for J v 1 function Jv = Jv(param, model, v_in, subset_idx) 2

3 n = param.n;

4 nL = param.nL;

5 L = param.L;

6 S_k = param.num_sampled_data;

7 Jv = zeros(nL*S_k, 1);

9 for m = param.L : -1 : param.LC+1 10 n_m = param.neurons(m+1);

11 v = reshape(v_in(n(m)+1:n(m+1)), n_m, []);

12 p = v * [model(m-1).Z; ones(1, S_k)];

13 if m < L

14 p = p';

15 p = repeat_elements(p, nL);

16 p = sum(model(m).dZLdS_T.*p, 2);

17 else

18 p = p(:);

19 end

20 Jv = Jv + p;

21 end 22

23 for m = param.LC : -1 : 1 24 a = param.wdimages(m);

25 b = param.htimages(m);

26 d = param.chimages(m);

27 v = reshape(v_in(n(m)+1:n(m+1)), d, []);

28 phiZ = [cal_phiZ(param, model, subset_idx, m); ones(1, a*b*

S_k)];

29 p = reshape(v * phiZ, [], S_k);

30 p = p';

31 p = repeat_elements(p, nL);

32 p = sum(model(m).dZLdS_T.*p, 2);

33 Jv = Jv + p;

34 end

Listing IV.8: MATLAB implementation for J^Tq 1 nL = param.nL;

2 S_K = param.sample_inst;

3 idx = param.batch_idx_current;

4 lambda = param.lambda;

5 C = param.C;

6 q = model(1).Jv;

7 for m = param.LC : -1 : 1 8 ZsT = model(m).ZsT;

9 Z = model(m-1).Z;

10 d = param.chimages(m);

11 v = cgparam(m).p;

12 r = arrayfun(@(i) ZsT(1+(i-1)*nL:i*nL,:)' * q(1+(i-1)*nL:i*

nL),[1:S_K],'un',0);

13 r = horzcat(r{:});

14 r = reshape(r,d,[]);

15 if m > 1

16 Z_pad = padding(param,Z,m-1,model(m-1).pad_idx);

17 end

18 phiZ = cal_phiZ(param,model,idx,m,Z_pad);

19 r = reshape(r*[phiZ' ones(size(phiZ,2),1)],[],1);

20 model(m).Gv = (lambda + 1/C)*v + r/S_K;

21 end

CHAPTER V

Analysis of Newton Methods for CNN

In this chapter, based on the implementation details in Chapter IV, we analyze the memory and computational cost per iteration. We consider that all training instances are used. If the subsampled Hessian in Chapter III is considered, then in the Jacobian calculation and the Gauss-Newton matrix vector products, the number of instances l should be replaced by the subset size |S|.

In this discussion we exclude the padding operation and the pooling layer because first they are optional steps and second they are not the bottleneck.

5.1 Memory Requirement

(1) Weight matrix and bias vector: For every layer, we must store

W^m and b^m, m = 1, . . . , L.

From (2.10) and (2.20), the memory usage is

L^c

m=1

d^m× (h^mh^md^m−1+ 1) +

m=L^c+1

n_m× (n_m−1+ 1)

! .

(2) To construct φ(Z^m−1,i), in Chapter 4.1, we store each position’s corresponding

linear index in Z^m−1,i.

L^c

m=1

h^mh^md^m−1a^mb^m

! .

(3) Function evaluation: From Chapter II, we store

Z^m,i, m = 0, . . . , L, i = 1, . . . , l.

Therefore, the memory usage is

O l ×

L^c

m=0

d^ma^mb^m+

m=L^c+1

n_m

(4) Gradient evaluation: From Chapter 3.2, because

∂ξ_i

∂vec(S^m−1,i)^T, m = 2, . . . , L, ∀i.

is only used in backward process. We just store this matrix for two adjacent layers.

Therefore, the memory usage is



l × X

{m,m+1}

d^ma^mb^m



, 1 ≤ m < L^c in convolutional layers or



l × X

m∈{m,m+1}

n^m



, L^c < m < L

in fully-connected layers. The following matrix must be stored.

∂ξ_i

∂vec(W^m)^T, and ∂ξ_i

∂(b^m)^T, m = 1, . . . , L, ∀i.

Therefore, the memory usage is

O l ×

L^c

m=1

d^m h^mh^md^m−1+ 1 +

m=L^c+1

n_m(n_m−1+ 1)

(5) Jacobian evaluation and Gauss-Newton matrix-vector products: Besides W^m and Z^m−1,i, from (3.36), (3.47), (3.50), we explicitly store

∂z^L,i

∂vec(S^m,i)^T, m = 1, . . . , L, ∀i.

Thus, the memory usage is

O l × nL×

L^c

m=1

(d^ma^mb^m) +

m=L^c+1

5.2 Computational Cost

To avoid clutter, we show the computational cost for mth conolutional/fully-connected layer.

(1) Function evaluation:

• Convolutional layers: From (2.8), (2.11), and (2.12), the computational cost is

O(l × h^mh^md^m−1d^ma^mb^m).

• Fully-connected layers: From (2.21) and (2.22), the computational cost is

O(l × n_mn_m−1)

(2) Gradient evaluation:

• Convolutional layers: From (3.22) and (3.23), the computational cost is

O(l × h^mh^md^m−1d^ma^mb^m).

From (3.25) and (3.26), the computational cost is

O(l × a^m−1b^m−1d^m−1d^ma^mb^m).

Therefore, the total computational cost for the gradient evaluation is

O(l × a^m−1b^m−1d^m−1d^ma^mb^m).

• Fully-connected layers: For (3.27) and (3.28), the computational cost is

O(l × nmnm−1).

For (3.29) and (3.30), the computational cost is similar. Therefore, the total computational cost is

O(l × n_mn_m−1).

(3) Jacobian evaluation:

• Convolutional layers: From (3.35), the computational cost is

O l × n_L× d^ma^mb^m(h^mh^md^m−1+ 1) .

From (3.36), the computational cost is

O l × n_L× (d^ma^mb^mh^mh^md^m−1+ h^mh^ma^m−1b^m−1d^m−1) .

The computational cost of (3.37) can be omitted. Therefore, the total compu-tational cost is

O(l × n_L× d^ma^mb^mh^mh^md^m−1).

• Fully-connected layers: From (3.40) and (3.42), the computational cost is

O(l × n_L× n_mn_m−1).

(4) Gauss-Newton matrix-vector products:

• Convolutional layers: From (3.47) and (3.50), the computational cost is

O l × (d^mh^mh^md^m−1a^mb^m+ n_Ld^ma^mb^m) .

• Fully-connected layers: From (3.51) and (3.52), the computational cost is

O (l × (nmnm−1+ nLnm)) .

CHAPTER VI

Experiments

We choose the following image data sets for experiments. All the data sets are publicly available1 and the summary is in Table 6.1.

• MNIST: This data set, containing hand-written digits, is a widely used benchmark for data classification (LeCun et al., 1998).

• SVHN: This data set consists of the colored images of house numbers (Netzer et al., 2011).

• CIFAR10: This data set is a famous colored image classification benchmark (Krizhevsky and Hinton, 2009).

• smallNORB: This data set is built for 3D object recognition (LeCun et al., 2004).

The original dimension is 96 × 96 × 2 because every object is taken two 96 × 96 grayscale images from the different angles. These two images are then placed in two channels. For the dimensionality reduction, we downsample each channel of every object with the max pooling (h = 3, s = 3) to the dimension 32 × 32.

All the data sets were pre-processed by the following procedure.

1All data sets used can be found at https://www.csie.ntu.edu.tw/˜cjlin/

libsvmtools/datasets/.

Table 6.1: Summary of the data sets, where a⁰× b⁰× d⁰ represents the (height, width, channel) of the input image, l is the number of training data, ltis the number of test data, and nLis the number of classes.

Data set a⁰× b⁰× d⁰ l l_t n_L

MNIST 28 × 28 × 1 60, 000 10, 000 10 SVHN 32 × 32 × 3 73, 257 26, 032 10 CIFAR10 32 × 32 × 3 50, 000 10, 000 10 smallNORB 32 × 32 × 2 24, 300 24, 300 5

(1) Min-max normalization. That is, for every image Z^0,i, we have Z^0,i ← Z^0,i− min

max − min, where max/min is the maximum/minimum value in Z^0,i.

(2) Zero-centering. This is commonly applied before training CNN (Krizhevsky et al., 2012; Zeiler and Fergus, 2014). That is, for every image Z^0,i, we have

Z^0,i ← Z^0,i− mean(Z^0,i), where mean(Z^0,i) is the mean value in Z^0,i.

We consider two simple CNN structure shown in Table 6.2. The parameters used in our algorithm are given as follows. For the initialization, we follow He et al. (2015) to randomly set the weight values from the N (0, 1) distribution and multiply by

s 2 n^m_in, where

n^m_in =











d^m−1× a^m−1× b^m−1 if m ≤ L^c,

n_m−1 otherwise.

For a CG procedure, we terminate it when the following relative stopping condition sat-isfies or the number of CG iterations reaches a maximal number of iterations (denoted as CG_max).

||(G + λI)d + ∇f (θ)|| ≤ σ||∇f (θ)||, (6.1) where σ = 0.1 and CG_max= 250. For the implementation of the Levenberg-Marquardt method, we set the initial λ₁ = 1 and (drop, boost, ρ_upper, ρ_lower) constants in (3.10) are

在文檔中牛頓法於卷積神經網路之應用 (頁 32-0)