• 沒有找到結果。

Gauss-Newton Matrix-Vector Products

III. Hessian-free Newton Methods for Training CNN

3.4 Gauss-Newton Matrix-Vector Products

For solving (3.7), conjugate gradient (CG) methods are often applied and the main operation at each CG iteration is the Gauss-Newton matrix-vector product.

From (3.34), we rearrange (3.6) to

G = 1

and the Gauss-Newton matrix vector product becomes

Gv = 1

In this subsection, we focus on the ith instance and give the implementation details of calculating (3.45) for the whole data in Chapter 4.5.

For the convolutional layers, from (3.35) and (3.45), we first calculate

To simplify (3.46), we use the following property

vec(AB)Tvec(C) = vec(A)Tvec(CBT)

to have for example, the first element in (3.46) is

vec ∂z1L,i

After deriving (3.47), from (3.45), we sum results of all layers

L

X

m=1

Jm,ivm

and then calculate

qi = Bi(

L

X

m=1

Jm,ivm). (3.48)

From (2.24) and (3.5),

Btsi = ∂2ξi

and we derive qi by multiplying every element ofPL

m=1Jm,ivmby two.

After deriving (3.48), from (3.35) and (3.45), we calculate

Similar to the results of the convolutional layers, for the fully-connected layers, we have

CHAPTER IV

Implementation Details

Our goal in this chapter is to show that after some careful derivations, a Newton method for CNN can be implemented by a simple and short program. Here we give a MATLAB implementation though a modification for other languages such as Python should be straightforward.

For the discussion in Chapter III, we consider a single data instance, but for practical implementations, all instances must be taken care of. In out implementation, we stored Zm−1,i, ∀i = 1, . . . , l as the following matrix.



Zm−1,1 Zm−1,2 . . . Zm−1,l



∈ Rdm−1×am−1bm−1l. (4.1)

Similarly, we stored ∂zL,i/∂vec(Sm,i)T as



∂z1L,1

∂vec(Sm,i)T . . . ∂z

L,l 1

∂vec(Sm,i)T . . . ∂z

L,1 nL

∂vec(Sm,i)T . . . ∂z

L,l nL

∂vec(Sm,i)T

T

∈ RlnL×dmambm. (4.2)

4.1 Generation of φ(Z

m−1,i

)

MATLAB has a built-in function im2col that can generate φ(Zm−1,i) for sm = 1 and sm = hm. To handle general sm, we notice that φ(Zm−1,i) is a sub-matrix of the

output matrix of using MATLAB’s im2col under sm = 1. In appendix we provide an efficient implementation to extract the sub-matrix. However, in other languages a subroutine like MATLAB’s im2col may not be available. Further, generating a larger matrix under sm = 1 causes extra time and memory.

Therefore, here we show an efficient implementation to get φ(Zm−1,i) without re-lying on a subroutine like MATLAB’s im2col. To begin we consider the following linear indices1 (i.e., counting elements in a column-oriented way) of Zm−1,i:

Because every element in

φ(Zm−1,i) ∈ Rhmhmdm−1×ambm,

is extracted from Zm−1,i, the task is to find the mapping between each element in φ(Zm−1,i) and a linear index of Zm−1,i. Consider the following example.

am−1 = bm−1 = 2, dm−1 = 1, sm = 1, hm = 2.

Because dm−1 = 1, we omit the channel subscript. In addition, we also omit the instance index i. Thus the image and its linear indices are

1Linear indices refer to the sequence of how elements in a matrix are stored. Here we consider a column-oriented setting.

We have

Thus we store the following vector to indicate the mapping between linear indices of Zm−1,iand φ(Zm−1,i).

[ 1 2 4 5 2 3 5 6 ]T . (4.4)

It also corresponds to row indices of non-zero elements in Pφm−1.

We begin with checking how linear indices of Zm−1,i can be mapped to the first column of φ(Zm−1). For simplicity, we consider only channel j. From (2.9) and (4.3), we have

where the left column gives the linear indices in Zm−1,i, while the right column shows

the corresponding values. We rewrite linear indices in (4.5) as

Clearly, every linear index in (4.6) can be represented as

(p + qam−1)dm−1 + j, (4.7)

where

p, q ∈ {0, . . . , hm− 1}

correspond to the pixel position in the convolutional filter.2

Next we consider other columns in φ(Zm−1,i) by still fixing the channel to be j.

From (2.9), similar to the right column in (4.5), each column contains the following elements from the jth channel of Zm−1,i.

z1+p+asm−1,i m,1+q+bsm,j , a = 0, 1, . . . , am− 1,

b = 0, 1, . . . , bm− 1, (4.8)

where (1 + asm, 1 + bsm) denotes the top-left position of a sub-image in the channel j 2More precisely, p + 1 and q + 1 are the pixel position.

of Zm−1,i. From (4.3), the linear index of each element in (4.8) is

Now we have known for each element of φ(Zm−1,i) what the corresponding linear index in Zm−1,iis. Next we discuss the implementation details, where the code is shown in Listing IV.1. First, we compute elements in (4.6) with j = 1 by applying MATLAB’s bsxfunfunction on the following two arrays.

The results is the following matrix

 whose columns, if concatenated, lead to values in (4.6) with j = 1; see line 2 of the code. To get (4.7) for all channels j = 1, . . . , dm−1, we apply bsxfun again to plus the vector form of (4.10) and



0 1 . . . dm−1− 1

 ,

and then vectorize the resulting matrix; see line 3.

To obtain other columns in φ(Zm−1,i), next we calculate amand bmby (2.4) in lines 4-5. To get all linear indices in (4.9), we see that the second term corresponds to indices

of the first column and therefore we must calculate the following column offset

(a + bam−1)smdm−1, ∀a = 0, 1, . . . , am− 1, b = 0, 1, . . . , bm− 1.

This is by a bsxfun to plus the following two arrays.

 0

... am− 1

× smdm−1 and



0 . . . bm− 1



× am−1smdm−1;

see line 6 in the code. Finally, we use another bsxfun to add the column offset and the values in (4.6); see line 7.

The obtained linear indices are independent of Z’s values. Thus the above procedure only needs to be run once in the beginning. For any Z, we apply the indices to extract φ(Z); see line 20-21 in Listing IV.1.

For φ(Zm−1,i) in the pooling layer, the same implementation can be used.

4.2 Construction of P

poolm−1,i

Following (2.19), we assume that Zm−1,i and φ(Zm−1,i) are input and output be-fore/after the pooling operation, respectively. In the beginning of the training proce-dure, we use the proposed procedure in Chapter 4.1 to have φ(Zm−1,i) that partitions each non-overlapping sub-regions; see (2.16) and (2.17). At each Newton iteration, Listing IV.2 is a MATLAB implementation for constructing Ppoolm−1,i, ∀i. In line 13, we handle the situation if

am−1 mod hm 6= 0 or bm−1 mod hm 6= 0. (4.11) That is, we must ensure that (2.17) holds with am and bm being integers. If (4.11) occurs, we simply discard the last several rows or columns in Zm−1,i to make am−1 and

bm−1 be multiples of hm. In line 8, we extract the linear indices of Zm−1,i to appear in vec(φ(Zm−1,i)), which as we mentioned has been generated in the beginning. The resulting vector P contains

hmhmdm−1ambm elements and each element is in the range of

1, . . . , dm−1am−1bm−1.

In line 9, we get

Zm−1,i, i = 1, . . . , l, (4.12)

which are stored as a matrix in (4.1). In line 23-24, we use P to generate

vec(φ(Zm−1,1)) · · · vec(φ(Zm−1,l))

∈ hmhmdm−1ambm× l. (4.13)

Next we rewrite the above matrix so that each column contains a sub-region:

z1,1,1m−1,i z1,1,2m−1,i . . . z1+(am−1,lm−1)×sm,1+(bm−1)×sm,dm−1

... ... . .. ...

zhm−1,im,hm,1 zhm−1,im,hm,2 . . . zhm−1,lm+(am−1)×sm,hm+(bm−1)×sm,dm−1

∈ Rhmhm×dm−1ambml.

(4.14) We apply a max function to get the largest value of each column and its index in the range of 1, . . . , hmhm. The resulting row vector has dmambml elements, where dm = dm−1; see line 25. In line 26, we reformulate it to be

dm× ambml

as the output Zm,i, ∀i.

Next we find linear indices in the matrix (4.13) that corresponds to the largest ele-ments obtained from (4.14).3 First we obtain linear indices of (4.1), which are in the

3Note that we do not really generate a sparse matrix Ppoolm−1,iin (2.19).

range of

1, . . . , dm−1am−1bm−1l.

We store these indices in a matrix the same size as in (4.1); see line 19. Then we generate

φ(linear indices of (4.1)), (4.15) which has the same size as in (4.13); see line 28-29. Next, if we can find positions in (4.15) that correspond to the largest elements identified earlier, then we have the desired linear indices. We mentioned that in line 25, not only the maximal value in each sub-region is obtained, but also we get the corresponding index in {1, . . . , hmhm}.

Therefore, for these max values, their linear indices in (4.14)4 are

row indices of max values in (4.14)

+ hmhm

0 ...

dm−1ambml − 1

. (4.16)

The reason is that hmhmis the column offset in (4.14). See line 30 for the implementa-tion of (4.16). Next in line 31 we use (4.16) to extract values in (4.15) and obtain linear indices in (4.1) for the selected max values. Similar to the situation in Chapter 4.1 for φ(Zm−1,i), the index finding procedure in Listing IV.2 is conducted only once in the beginning of the training process.

Note that because max pooling depends on Zm−1,ivalues, linear indices for Ppoolm−1,i must be re-generated in the beginning of each Newton iteration.

4.3 Details of Padding Operation

To implement zero-padding, we first capture the linear indices of the input image in the padded image. For example, if the size of the imput image is 3 × 3 and the output 4Also linear indices in (4.13) because (4.14) is separating each column in (4.13) to several columns.

padded image is 5 × 5, we have

where “1” values indicate positions of the input image. Based on the column-major order, we derive

pad idx one= {7, 8, 9, 11, 12, 13, 16, 17, 18}.

It is obtained in the begining of the training procedure and it will be used in the fol-lowing situations. First, pad idx one contains row indices in Ppaddingm−1 of (2.15) that correspond to the input image. We can use it to conduct the padding operation in (2.15).

Second, from (3.33), in gradient and Jacobian evaluations, we need

vTPpaddingm−1 .

This can be considered as the inverse of the padding operation: we would like to remove zeros and get back the original image.

MATLAB implementations are shown in Listing IV.4 and IV.5, where Listing IV.4 is for generating the indices and Listing IV.5 is for the padding operation (2.15). In line 12 of Listing IV.4, we use the MATLAB bulit-in function padarray to pad zeros around the input image. Because Zm−1,i, ∀i are stored as a matrix in (4.1), in line 8 of Listing IV.5, we extend pad idx one to conver all instances:

pad idx one

pad idx one + ambm

pad idx one + 2ambm

. . .

. (4.17)

These values indicate the new column indices of original columns in Zm−1,i, ∀i. In line 9 of Listing IV.5, we create the zeros of the output padding image. By using values in (4.17), in line 10 of Listing IV.5, we put the input Zm−1,ito the corresponding positions of the output padding image.

4.4 Evaluation of v

T

P

φm−1

For (3.25) and (3.36), the following operation is applied.

vTPφm−1, (4.18)

where

v = vec



(Wm)T ∂ξi

∂Sm,i



for (3.25) and

v = vec (Wm)T ∂zjL,i

∂Sm,i

!

, j = 1, . . . , nL (4.19) for (3.36).

Consider the same example in Chapter 4.1. We note that

(Pφm−1)Tv = [ v1 v2+ v5 v6 v3 v4+ v7 v8 ]T , (4.20)

which is a kind of “inverse” operation of φ(Zm−1): we accumulate elements in φ(Zm−1) back to their original positions in Zm−1. In MATLAB, given indices in (4.4), a function accumarraycan directly generate the vector (4.20).

To calculate (3.25) over a batch of instances, we aim to have

(Pφm−1)Tv1

... (Pφm−1)Tvl

T

. (4.21)

We can manage to apply MATLAB’s accumarray on the vector

by giving the following indices as the input.

We can easily rearrange the resulting vector by reshape and transpose operator to get the matrix (3.25), where each row is corresponding to an instance. We show the program in Listing IV.6, where line 8 generates the indices shown in (4.23) and V(:) in line 9 is the vector (4.22).

Following the same idea, to calculate (3.36) over a batch of instances, we do not have to calculate (4.18) nLtimes. Notice that (3.36) is equivalent to



We can calculate v by vectorizing the result of fast matrix-matrix multiplication.

By passing the appropriate input indices to accumarray, we can calculate

(Pφm−1)T

 (4.26)

 .

So the calculation is fallen back to the same case as evaluating (3.25). That is, after having (4.26), we calculate (4.18) once instead of calculating (4.18) and (4.19) nLtimes.

The input indices for accumarray are easy to be obtained. Before the transpose operation in (4.25), the linear indices for each column, i.e.

(Pφm−1)Tvec

is start from

(u − 1)lam−1bm−1dm−1 + 1.

Therefore, we can apply accumarray on the vector (4.26) with the following input indices.

The program is the same as Listing IV.6, except that we pass the value of the product of n Land S k for the parameter S k.

4.5 Gauss-Newton Matrix-Vector Products

Because (3.45) is the form of summation of every instance, in the view of MATLAB implementation, we extend (3.45) to the whole data set in a matrix form.

Gv = 1

To derive (4.27), we first calculate

From (3.47), for a particular m, we have

and 

A MATLAB implementation for (4.28) is shown in Listing IV.7. We store the model in a structure of different layers. Thus in lines 11 and 13 we respectively obtain

Zm−1,i (after padding) and ∂zL,i

∂vec(Sm,i)T, i = 1, . . . , l.

In line 14, we calculate (4.29) to derive pm,1, . . . , pm,l. In line 15, we separately calcu-late

∂zL,i

∂vec(Sm,i)Tpm,i, i = 1, . . . , l.

In line 16, we sum of the results of each layer.

After deriving (4.28), from (3.48) and (4.27), we must calculate

From (3.49), (4.30) can be derived by multiplying every element of (4.28) by two.

When (4.30) is available, from (4.27), the following matrix-vector product is needed.

 For the layer m, from (3.50), the calculation is written as

l

= vec

A MATLAB implementation for (4.31) is shown in Listing IV.8. In line 12, we separately calculate rm,1, . . . , rm,lby (4.32). In line 18, we build

In line 19, we calculate (4.31). In line 20, we derive the result of (4.27) for the mth layer.

4.6 Mini-Batch Function and Gradient Evaluation

Later in Chapter 5.1 we will discuss details of memory usage. One important con-clusion is that in many places of the Newton method, the memory consumption is pro-portional to the number of data. This fact causes difficulties in handling large data sets.

Therefore, here we discuss some implementation techniques to address the memory difficulty.

In subsampled Newton methods discussed in Chapter 3.1, a subset S of the train-ing data is used to derive the subsampled Gauss-Newton matrix for approximattrain-ing the Hessian matrix. While a motivation of this technique is to trade a slightly less accurate

direction for shorter running time per iteration, it is also useful to reduce the memory consumption. For example, at the mth convolutional layer, we only need to store the following matrices

∂zL,i

∂vec(Sm,i)T, ∀i ∈ S (4.33)

for the Gauss-Newton matrix-vector products.

However, function and gradient evaluations must use the whole training data. Fortu-nately, both operations involve the summation of independent results over all instances.

Here we follow Wang et al. (2018a) to split the index set {1, . . . , l} of data to, for ex-ample, R equal-sized subsets S1, . . . , SR. We then calculate the result corresponding to each subset and accumulate them for the final output. Take the function evaluation as an example. For each subset, we must store only

Zm,i, ∀m, ∀i ∈ Sr.

Thus, the memory usage can be dramatically reduced.

For the Gauss-Newton matrix-vector product, to calculate (4.33), we need Zm,i, ∀i ∈ S. However, under the mini-batch setting, the needed values may not be kept. Our strat-egy is to let the last subset SR be the same subset used for the sub-sampled Hessian.

Then we can preserve the needed Zm,ifor subsequent operations.

Listing IV.1: MATLAB implementation for φ(Zm−1,i) 1 function idx = indicator_im2col(a,b,d,h,s)

2 first_channel_idx = bsxfun(@plus, ([0:h-1]*d+1)', [0:h-1]*a

*d);

3 first_col_idx = bsxfun(@plus, first_channel_idx(:), [0:d -1]);

4 out_a = floor((a - h)/s) + 1;

5 out_b = floor((b - h)/s) + 1;

6 column_offset = bsxfun(@plus, [0:out_a-1]', [0:out_b-1]*a)*

s*d;

7 idx = bsxfun(@plus, column_offset(:)', first_col_idx(:));

8 end

9 model(m).indicator = indicator_im2col(param.wdimages_pad0,param .htimages_pad0,param.chimages0,param.wdfilters(m),param.

strides(m));

10 function phiZ = cal_phiZ(param, model, batch_idx, m) 11 if m > 1

12 if param.padflags(m-1) == 1

13 phiZ = padding(param,model(m-1).Z,m-1,model(m-1).

pad_idx);

14 else

15 phiZ = model(m-1).Z;

16 end

17 else

18 phiZ = model(m).Z0(:,param.batch_idx_current);

19 end

20 phiZ = reshape(phiZ,[],param.sample_inst);

21 phiZ = phiZ(model(m).indicator,:);

22

23 if m == 1

24 phiZ = reshape(phiZ,param.wdfilters(m)*param.wdfilters(

m)*param.chimages0,[]);

25 else

26 phiZ = reshape(phiZ,param.wdfilters(m)*param.wdfilters(

m)*param.chimages(m-1),[]);

27 end

28 end

Listing IV.2: MATLAB implementation for Ppoolm−1 1 function [param, model] = maxpooling(param, model, m) 2

3 a = param.htimages(m);

4 b = param.wdimages(m);

5 d = param.chimages(m);

6 h = param.wdpool(m);

7 S_k = param.num_sampled_data;

8 P = model(m).idx_phiZ_pool;

9 Z = model(m).Z;

10

11 rm_idx = [];

12 pool_idx = [1:d*a*b*S_k];

13 if (mod(a,h) > 0 || mod(b,h) > 0)

14 newa = a - mod(a,h); newb = b - mod(b,h);

15 remained_idx = bsxfun(@plus,[1:newa]',[0:newb-1]*a);

16 remained_idx = bsxfun(@plus,remained_idx(:),[0:S_k-1]*a*b);

17 Z = Z(:,remained_idx(:));

18

19 pool_idx = reshape(pool_idx,d,[]);

20 pool_idx = pool_idx(:,remained_idx(:));

21 end 22

23 Z = reshape(Z,[],S_k);

24 Z = Z(P,:);

25 [Z, WS] = max(reshape(Z,h*h,[]));

26 model(m).Z = reshape(Z,d,[]);

27

28 pool_idx = reshape(pool_idx,[],S_k);

29 pool_idx = pool_idx(P,:);

30 WS = WS + h*h*([0:floor(a/h)*floor(b/h)*d*S_k-1]);

31 model(m).pool_idx = pool_idx(WS);

Listing IV.3: MATLAB implementation for evaluating (3.32) 1 function output = maxpooling_grad(param,m,input,pool_idx) 2

3 a = param.wdimages(m);

4 b = param.htimages(m);

5 d = param.chimages(m);

6 S_k = param.num_sampled_data;

7

8 output = zeros(d,a*b*S_k);

9 output(pool_idx) = reshape(input,[],1);

Listing IV.4: MATLAB implementation for the index of zero-padding 1 function [pad_idx] = padding_idx(param,m)

2

3 if m == 0

4 a = param.wdimages0;

5 b = param.htimages0;

6 else

7 a = param.wdimages_pool(m);

8 b = param.htimages_pool(m);

9 end 10

11 p = (param.wdfilters(m+1)-1)/2;

12 newa = a+2*p; newb = b+2*p;

13 pad_idx = repmat(p+(1:a)', b,1) + repeat_elements(newa*(p+(0:b -1)'), a);

Listing IV.5: MATLAB implementation for zero-padding 1 function output = padding(param,Z,m,pad_idx_one) 2

3 a = param.wdimages_pad(m);

4 b = param.htimages_pad(m);

5 d = param.chimages(m);

6 S_k = param.num_sampled_data;

7

8 idx = reshape(bsxfun(@plus, pad_idx_one, [0:S_k-1]*a*b), [], 1)

;

9 output = zeros(d,a*b*S_k);

10 output(:,idx) = Z;

Listing IV.6: MATLAB implementation to evaluate vTPφm−1 1 function vTP = vTP(param, model, S_k, m, V)

2 % V: a matrix with #cols = S_k 3

4 a_prev = param.htimages(m-1);

5 b_prev = param.wdimages(m-1);

6 d_prev = param.chimages(m-1);

7

8 idx = reshape(bsxfun(@plus, model(m).idx_phiZ(:), [0:S_k-1]*

d_prev*a_prev*b_prev), [], 1);

9 vTP = reshape(accumarray(idx, V(:), [d_prev*a_prev*b_prev*S_k 1]), [], S_k)';

Listing IV.7: MATLAB implementation for J v 1 function Jv = Jv(param, model, v_in, subset_idx) 2

3 n = param.n;

4 nL = param.nL;

5 L = param.L;

6 S_k = param.num_sampled_data;

7 Jv = zeros(nL*S_k, 1);

8

9 for m = param.L : -1 : param.LC+1 10 n_m = param.neurons(m+1);

11 v = reshape(v_in(n(m)+1:n(m+1)), n_m, []);

12 p = v * [model(m-1).Z; ones(1, S_k)];

13 if m < L

14 p = p';

15 p = repeat_elements(p, nL);

16 p = sum(model(m).dZLdS_T.*p, 2);

17 else

18 p = p(:);

19 end

20 Jv = Jv + p;

21 end 22

23 for m = param.LC : -1 : 1 24 a = param.wdimages(m);

25 b = param.htimages(m);

26 d = param.chimages(m);

27 v = reshape(v_in(n(m)+1:n(m+1)), d, []);

28 phiZ = [cal_phiZ(param, model, subset_idx, m); ones(1, a*b*

S_k)];

29 p = reshape(v * phiZ, [], S_k);

30 p = p';

31 p = repeat_elements(p, nL);

32 p = sum(model(m).dZLdS_T.*p, 2);

33 Jv = Jv + p;

34 end

Listing IV.8: MATLAB implementation for JTq 1 nL = param.nL;

2 S_K = param.sample_inst;

3 idx = param.batch_idx_current;

4 lambda = param.lambda;

5 C = param.C;

6 q = model(1).Jv;

7 for m = param.LC : -1 : 1 8 ZsT = model(m).ZsT;

9 Z = model(m-1).Z;

10 d = param.chimages(m);

11 v = cgparam(m).p;

12 r = arrayfun(@(i) ZsT(1+(i-1)*nL:i*nL,:)' * q(1+(i-1)*nL:i*

nL),[1:S_K],'un',0);

13 r = horzcat(r{:});

14 r = reshape(r,d,[]);

15 if m > 1

16 Z_pad = padding(param,Z,m-1,model(m-1).pad_idx);

17 end

18 phiZ = cal_phiZ(param,model,idx,m,Z_pad);

19 r = reshape(r*[phiZ' ones(size(phiZ,2),1)],[],1);

20 model(m).Gv = (lambda + 1/C)*v + r/S_K;

21 end

CHAPTER V

Analysis of Newton Methods for CNN

In this chapter, based on the implementation details in Chapter IV, we analyze the memory and computational cost per iteration. We consider that all training instances are used. If the subsampled Hessian in Chapter III is considered, then in the Jacobian calculation and the Gauss-Newton matrix vector products, the number of instances l should be replaced by the subset size |S|.

In this discussion we exclude the padding operation and the pooling layer because first they are optional steps and second they are not the bottleneck.

5.1 Memory Requirement

(1) Weight matrix and bias vector: For every layer, we must store

Wm and bm, m = 1, . . . , L.

From (2.10) and (2.20), the memory usage is

O

Lc

X

m=1

dm× (hmhmdm−1+ 1) +

L

X

m=Lc+1

nm× (nm−1+ 1)

! .

(2) To construct φ(Zm−1,i), in Chapter 4.1, we store each position’s corresponding

linear index in Zm−1,i.

O

Lc

X

m=1

hmhmdm−1ambm

! .

(3) Function evaluation: From Chapter II, we store

Zm,i, m = 0, . . . , L, i = 1, . . . , l.

Therefore, the memory usage is

O l ×

Lc

X

m=0

dmambm+

L

X

m=Lc+1

nm

!!

.

(4) Gradient evaluation: From Chapter 3.2, because

∂ξi

∂vec(Sm−1,i)T, m = 2, . . . , L, ∀i.

is only used in backward process. We just store this matrix for two adjacent layers.

Therefore, the memory usage is

O

l × X

{m,m+1}

dmambm

, 1 ≤ m < Lc in convolutional layers or

O

l × X

m∈{m,m+1}

nm

, Lc < m < L

in fully-connected layers. The following matrix must be stored.

∂ξi

∂vec(Wm)T, and ∂ξi

∂(bm)T, m = 1, . . . , L, ∀i.

Therefore, the memory usage is

O l ×

Lc

X

m=1

dm hmhmdm−1+ 1 +

L

X

m=Lc+1

nm(nm−1+ 1)

!!

.

(5) Jacobian evaluation and Gauss-Newton matrix-vector products: Besides Wm and Zm−1,i, from (3.36), (3.47), (3.50), we explicitly store

∂zL,i

∂vec(Sm,i)T, m = 1, . . . , L, ∀i.

Thus, the memory usage is

O l × nL×

Lc

X

m=1

(dmambm) +

L

X

m=Lc+1

nm

!!

.

5.2 Computational Cost

To avoid clutter, we show the computational cost for mth conolutional/fully-connected layer.

(1) Function evaluation:

• Convolutional layers: From (2.8), (2.11), and (2.12), the computational cost is

O(l × hmhmdm−1dmambm).

• Fully-connected layers: From (2.21) and (2.22), the computational cost is

O(l × nmnm−1)

(2) Gradient evaluation:

• Convolutional layers: From (3.22) and (3.23), the computational cost is

O(l × hmhmdm−1dmambm).

From (3.25) and (3.26), the computational cost is

O(l × am−1bm−1dm−1dmambm).

Therefore, the total computational cost for the gradient evaluation is

O(l × am−1bm−1dm−1dmambm).

• Fully-connected layers: For (3.27) and (3.28), the computational cost is

O(l × nmnm−1).

For (3.29) and (3.30), the computational cost is similar. Therefore, the total computational cost is

O(l × nmnm−1).

(3) Jacobian evaluation:

• Convolutional layers: From (3.35), the computational cost is

O l × nL× dmambm(hmhmdm−1+ 1) .

From (3.36), the computational cost is

O l × nL× (dmambmhmhmdm−1+ hmhmam−1bm−1dm−1) .

The computational cost of (3.37) can be omitted. Therefore, the total compu-tational cost is

O(l × nL× dmambmhmhmdm−1).

• Fully-connected layers: From (3.40) and (3.42), the computational cost is

O(l × nL× nmnm−1).

(4) Gauss-Newton matrix-vector products:

• Convolutional layers: From (3.47) and (3.50), the computational cost is

O l × (dmhmhmdm−1ambm+ nLdmambm) .

• Fully-connected layers: From (3.51) and (3.52), the computational cost is

O (l × (nmnm−1+ nLnm)) .

CHAPTER VI

Experiments

We choose the following image data sets for experiments. All the data sets are publicly available1 and the summary is in Table 6.1.

• MNIST: This data set, containing hand-written digits, is a widely used benchmark for data classification (LeCun et al., 1998).

• SVHN: This data set consists of the colored images of house numbers (Netzer et al., 2011).

• CIFAR10: This data set is a famous colored image classification benchmark (Krizhevsky and Hinton, 2009).

• smallNORB: This data set is built for 3D object recognition (LeCun et al., 2004).

The original dimension is 96 × 96 × 2 because every object is taken two 96 × 96 grayscale images from the different angles. These two images are then placed in two channels. For the dimensionality reduction, we downsample each channel of every object with the max pooling (h = 3, s = 3) to the dimension 32 × 32.

All the data sets were pre-processed by the following procedure.

1All data sets used can be found at https://www.csie.ntu.edu.tw/˜cjlin/

libsvmtools/datasets/.

Table 6.1: Summary of the data sets, where a0× b0× d0 represents the (height, width, channel) of the input image, l is the number of training data, ltis the number of test data, and nLis the number of classes.

Data set a0× b0× d0 l lt nL

MNIST 28 × 28 × 1 60, 000 10, 000 10 SVHN 32 × 32 × 3 73, 257 26, 032 10 CIFAR10 32 × 32 × 3 50, 000 10, 000 10 smallNORB 32 × 32 × 2 24, 300 24, 300 5

(1) Min-max normalization. That is, for every image Z0,i, we have Z0,i ← Z0,i− min

max − min, where max/min is the maximum/minimum value in Z0,i.

(2) Zero-centering. This is commonly applied before training CNN (Krizhevsky et al., 2012; Zeiler and Fergus, 2014). That is, for every image Z0,i, we have

Z0,i ← Z0,i− mean(Z0,i), where mean(Z0,i) is the mean value in Z0,i.

We consider two simple CNN structure shown in Table 6.2. The parameters used in our algorithm are given as follows. For the initialization, we follow He et al. (2015) to randomly set the weight values from the N (0, 1) distribution and multiply by

s 2 nmin, where

nmin =









dm−1× am−1× bm−1 if m ≤ Lc,

nm−1 otherwise.

For a CG procedure, we terminate it when the following relative stopping condition sat-isfies or the number of CG iterations reaches a maximal number of iterations (denoted as CGmax).

||(G + λI)d + ∇f (θ)|| ≤ σ||∇f (θ)||, (6.1) where σ = 0.1 and CGmax= 250. For the implementation of the Levenberg-Marquardt method, we set the initial λ1 = 1 and (drop, boost, ρupper, ρlower) constants in (3.10) are

||(G + λI)d + ∇f (θ)|| ≤ σ||∇f (θ)||, (6.1) where σ = 0.1 and CGmax= 250. For the implementation of the Levenberg-Marquardt method, we set the initial λ1 = 1 and (drop, boost, ρupper, ρlower) constants in (3.10) are

相關文件