Mini-Batch Function and Gradient Evaluation

IV. Implementation Details

4.6 Mini-Batch Function and Gradient Evaluation

Later in Chapter 5.1 we will discuss details of memory usage. One important con-clusion is that in many places of the Newton method, the memory consumption is pro-portional to the number of data. This fact causes difficulties in handling large data sets.

Therefore, here we discuss some implementation techniques to address the memory difficulty.

In subsampled Newton methods discussed in Chapter 3.1, a subset S of the train-ing data is used to derive the subsampled Gauss-Newton matrix for approximattrain-ing the Hessian matrix. While a motivation of this technique is to trade a slightly less accurate

direction for shorter running time per iteration, it is also useful to reduce the memory consumption. For example, at the mth convolutional layer, we only need to store the following matrices

∂z^L,i

∂vec(S^m,i)^T, ∀i ∈ S (4.33)

for the Gauss-Newton matrix-vector products.

However, function and gradient evaluations must use the whole training data. Fortu-nately, both operations involve the summation of independent results over all instances.

Here we follow Wang et al. (2018a) to split the index set {1, . . . , l} of data to, for ex-ample, R equal-sized subsets S1, . . . , S_R. We then calculate the result corresponding to each subset and accumulate them for the final output. Take the function evaluation as an example. For each subset, we must store only

Z^m,i, ∀m, ∀i ∈ S_r.

Thus, the memory usage can be dramatically reduced.

For the Gauss-Newton matrix-vector product, to calculate (4.33), we need Z^m,i, ∀i ∈ S. However, under the mini-batch setting, the needed values may not be kept. Our strat-egy is to let the last subset SR be the same subset used for the sub-sampled Hessian.

Then we can preserve the needed Z^m,ifor subsequent operations.

Listing IV.1: MATLAB implementation for φ(Z^m−1,i) 1 function idx = indicator_im2col(a,b,d,h,s)

2 first_channel_idx = bsxfun(@plus, ([0:h-1]*d+1)', [0:h-1]*a

*d);

3 first_col_idx = bsxfun(@plus, first_channel_idx(:), [0:d -1]);

4 out_a = floor((a - h)/s) + 1;

5 out_b = floor((b - h)/s) + 1;

6 column_offset = bsxfun(@plus, [0:out_a-1]', [0:out_b-1]*a)*

s*d;

7 idx = bsxfun(@plus, column_offset(:)', first_col_idx(:));

8 end

9 model(m).indicator = indicator_im2col(param.wdimages_pad0,param .htimages_pad0,param.chimages0,param.wdfilters(m),param.

strides(m));

10 function phiZ = cal_phiZ(param, model, batch_idx, m) 11 if m > 1

12 if param.padflags(m-1) == 1

13 phiZ = padding(param,model(m-1).Z,m-1,model(m-1).

pad_idx);

14 else

15 phiZ = model(m-1).Z;

16 end

17 else

18 phiZ = model(m).Z0(:,param.batch_idx_current);

19 end

20 phiZ = reshape(phiZ,[],param.sample_inst);

21 phiZ = phiZ(model(m).indicator,:);

23 if m == 1

24 phiZ = reshape(phiZ,param.wdfilters(m)*param.wdfilters(

m)*param.chimages0,[]);

25 else

26 phiZ = reshape(phiZ,param.wdfilters(m)*param.wdfilters(

m)*param.chimages(m-1),[]);

27 end

28 end

Listing IV.2: MATLAB implementation for P_pool^m−1 1 function [param, model] = maxpooling(param, model, m) 2

3 a = param.htimages(m);

4 b = param.wdimages(m);

5 d = param.chimages(m);

6 h = param.wdpool(m);

7 S_k = param.num_sampled_data;

8 P = model(m).idx_phiZ_pool;

9 Z = model(m).Z;

11 rm_idx = [];

12 pool_idx = [1:d*a*b*S_k];

13 if (mod(a,h) > 0 || mod(b,h) > 0)

14 newa = a - mod(a,h); newb = b - mod(b,h);

15 remained_idx = bsxfun(@plus,[1:newa]',[0:newb-1]*a);

16 remained_idx = bsxfun(@plus,remained_idx(:),[0:S_k-1]*a*b);

17 Z = Z(:,remained_idx(:));

19 pool_idx = reshape(pool_idx,d,[]);

20 pool_idx = pool_idx(:,remained_idx(:));

21 end 22

23 Z = reshape(Z,[],S_k);

24 Z = Z(P,:);

25 [Z, WS] = max(reshape(Z,h*h,[]));

26 model(m).Z = reshape(Z,d,[]);

28 pool_idx = reshape(pool_idx,[],S_k);

29 pool_idx = pool_idx(P,:);

30 WS = WS + h*h*([0:floor(a/h)*floor(b/h)*d*S_k-1]);

31 model(m).pool_idx = pool_idx(WS);

Listing IV.3: MATLAB implementation for evaluating (3.32) 1 function output = maxpooling_grad(param,m,input,pool_idx) 2

3 a = param.wdimages(m);

4 b = param.htimages(m);

5 d = param.chimages(m);

6 S_k = param.num_sampled_data;

8 output = zeros(d,a*b*S_k);

9 output(pool_idx) = reshape(input,[],1);

Listing IV.4: MATLAB implementation for the index of zero-padding 1 function [pad_idx] = padding_idx(param,m)

3 if m == 0

4 a = param.wdimages0;

5 b = param.htimages0;

6 else

7 a = param.wdimages_pool(m);

8 b = param.htimages_pool(m);

9 end 10

11 p = (param.wdfilters(m+1)-1)/2;

12 newa = a+2*p; newb = b+2*p;

13 pad_idx = repmat(p+(1:a)', b,1) + repeat_elements(newa*(p+(0:b -1)'), a);

Listing IV.5: MATLAB implementation for zero-padding 1 function output = padding(param,Z,m,pad_idx_one) 2

3 a = param.wdimages_pad(m);

4 b = param.htimages_pad(m);

5 d = param.chimages(m);

6 S_k = param.num_sampled_data;

8 idx = reshape(bsxfun(@plus, pad_idx_one, [0:S_k-1]*a*b), [], 1)

;

9 output = zeros(d,a*b*S_k);

10 output(:,idx) = Z;

Listing IV.6: MATLAB implementation to evaluate v^TP_φ^m−1 1 function vTP = vTP(param, model, S_k, m, V)

2 % V: a matrix with #cols = S_k 3

4 a_prev = param.htimages(m-1);

5 b_prev = param.wdimages(m-1);

6 d_prev = param.chimages(m-1);

8 idx = reshape(bsxfun(@plus, model(m).idx_phiZ(:), [0:S_k-1]*

d_prev*a_prev*b_prev), [], 1);

9 vTP = reshape(accumarray(idx, V(:), [d_prev*a_prev*b_prev*S_k 1]), [], S_k)';

Listing IV.7: MATLAB implementation for J v 1 function Jv = Jv(param, model, v_in, subset_idx) 2

3 n = param.n;

4 nL = param.nL;

5 L = param.L;

6 S_k = param.num_sampled_data;

7 Jv = zeros(nL*S_k, 1);

9 for m = param.L : -1 : param.LC+1 10 n_m = param.neurons(m+1);

11 v = reshape(v_in(n(m)+1:n(m+1)), n_m, []);

12 p = v * [model(m-1).Z; ones(1, S_k)];

13 if m < L

14 p = p';

15 p = repeat_elements(p, nL);

16 p = sum(model(m).dZLdS_T.*p, 2);

17 else

18 p = p(:);

19 end

20 Jv = Jv + p;

21 end 22

23 for m = param.LC : -1 : 1 24 a = param.wdimages(m);

25 b = param.htimages(m);

26 d = param.chimages(m);

27 v = reshape(v_in(n(m)+1:n(m+1)), d, []);

28 phiZ = [cal_phiZ(param, model, subset_idx, m); ones(1, a*b*

S_k)];

29 p = reshape(v * phiZ, [], S_k);

30 p = p';

31 p = repeat_elements(p, nL);

32 p = sum(model(m).dZLdS_T.*p, 2);

33 Jv = Jv + p;

34 end

Listing IV.8: MATLAB implementation for J^Tq 1 nL = param.nL;

2 S_K = param.sample_inst;

3 idx = param.batch_idx_current;

4 lambda = param.lambda;

5 C = param.C;

6 q = model(1).Jv;

7 for m = param.LC : -1 : 1 8 ZsT = model(m).ZsT;

9 Z = model(m-1).Z;

10 d = param.chimages(m);

11 v = cgparam(m).p;

12 r = arrayfun(@(i) ZsT(1+(i-1)*nL:i*nL,:)' * q(1+(i-1)*nL:i*

nL),[1:S_K],'un',0);

13 r = horzcat(r{:});

14 r = reshape(r,d,[]);

15 if m > 1

16 Z_pad = padding(param,Z,m-1,model(m-1).pad_idx);

17 end

18 phiZ = cal_phiZ(param,model,idx,m,Z_pad);

19 r = reshape(r*[phiZ' ones(size(phiZ,2),1)],[],1);

20 model(m).Gv = (lambda + 1/C)*v + r/S_K;

21 end

CHAPTER V

Analysis of Newton Methods for CNN

In this chapter, based on the implementation details in Chapter IV, we analyze the memory and computational cost per iteration. We consider that all training instances are used. If the subsampled Hessian in Chapter III is considered, then in the Jacobian calculation and the Gauss-Newton matrix vector products, the number of instances l should be replaced by the subset size |S|.

In this discussion we exclude the padding operation and the pooling layer because first they are optional steps and second they are not the bottleneck.

5.1 Memory Requirement

(1) Weight matrix and bias vector: For every layer, we must store

W^m and b^m, m = 1, . . . , L.

From (2.10) and (2.20), the memory usage is

L^c

m=1

d^m× (h^mh^md^m−1+ 1) +

m=L^c+1

n_m× (n_m−1+ 1)

! .

(2) To construct φ(Z^m−1,i), in Chapter 4.1, we store each position’s corresponding

linear index in Z^m−1,i.

L^c

m=1

h^mh^md^m−1a^mb^m

! .

(3) Function evaluation: From Chapter II, we store

Z^m,i, m = 0, . . . , L, i = 1, . . . , l.

Therefore, the memory usage is

O l ×

L^c

m=0

d^ma^mb^m+

m=L^c+1

n_m

(4) Gradient evaluation: From Chapter 3.2, because

∂ξ_i

∂vec(S^m−1,i)^T, m = 2, . . . , L, ∀i.

is only used in backward process. We just store this matrix for two adjacent layers.

Therefore, the memory usage is



l × X

{m,m+1}

d^ma^mb^m



, 1 ≤ m < L^c in convolutional layers or



l × X

m∈{m,m+1}

n^m



, L^c < m < L

in fully-connected layers. The following matrix must be stored.

∂ξ_i

∂vec(W^m)^T, and ∂ξ_i

∂(b^m)^T, m = 1, . . . , L, ∀i.

Therefore, the memory usage is

O l ×

L^c

m=1

d^m h^mh^md^m−1+ 1 +

m=L^c+1

n_m(n_m−1+ 1)

(5) Jacobian evaluation and Gauss-Newton matrix-vector products: Besides W^m and Z^m−1,i, from (3.36), (3.47), (3.50), we explicitly store

∂z^L,i

∂vec(S^m,i)^T, m = 1, . . . , L, ∀i.

Thus, the memory usage is

O l × nL×

L^c

m=1

(d^ma^mb^m) +

m=L^c+1

5.2 Computational Cost

To avoid clutter, we show the computational cost for mth conolutional/fully-connected layer.

(1) Function evaluation:

• Convolutional layers: From (2.8), (2.11), and (2.12), the computational cost is

O(l × h^mh^md^m−1d^ma^mb^m).

• Fully-connected layers: From (2.21) and (2.22), the computational cost is

O(l × n_mn_m−1)

(2) Gradient evaluation:

• Convolutional layers: From (3.22) and (3.23), the computational cost is

O(l × h^mh^md^m−1d^ma^mb^m).

From (3.25) and (3.26), the computational cost is

O(l × a^m−1b^m−1d^m−1d^ma^mb^m).

Therefore, the total computational cost for the gradient evaluation is

O(l × a^m−1b^m−1d^m−1d^ma^mb^m).

• Fully-connected layers: For (3.27) and (3.28), the computational cost is

O(l × nmnm−1).

For (3.29) and (3.30), the computational cost is similar. Therefore, the total computational cost is

O(l × n_mn_m−1).

(3) Jacobian evaluation:

• Convolutional layers: From (3.35), the computational cost is

O l × n_L× d^ma^mb^m(h^mh^md^m−1+ 1) .

From (3.36), the computational cost is

O l × n_L× (d^ma^mb^mh^mh^md^m−1+ h^mh^ma^m−1b^m−1d^m−1) .

The computational cost of (3.37) can be omitted. Therefore, the total compu-tational cost is

O(l × n_L× d^ma^mb^mh^mh^md^m−1).

• Fully-connected layers: From (3.40) and (3.42), the computational cost is

O(l × n_L× n_mn_m−1).

(4) Gauss-Newton matrix-vector products:

• Convolutional layers: From (3.47) and (3.50), the computational cost is

O l × (d^mh^mh^md^m−1a^mb^m+ n_Ld^ma^mb^m) .

• Fully-connected layers: From (3.51) and (3.52), the computational cost is

O (l × (nmnm−1+ nLnm)) .

CHAPTER VI

Experiments

We choose the following image data sets for experiments. All the data sets are publicly available1 and the summary is in Table 6.1.

• MNIST: This data set, containing hand-written digits, is a widely used benchmark for data classification (LeCun et al., 1998).

• SVHN: This data set consists of the colored images of house numbers (Netzer et al., 2011).

• CIFAR10: This data set is a famous colored image classification benchmark (Krizhevsky and Hinton, 2009).

• smallNORB: This data set is built for 3D object recognition (LeCun et al., 2004).

The original dimension is 96 × 96 × 2 because every object is taken two 96 × 96 grayscale images from the different angles. These two images are then placed in two channels. For the dimensionality reduction, we downsample each channel of every object with the max pooling (h = 3, s = 3) to the dimension 32 × 32.

All the data sets were pre-processed by the following procedure.

1All data sets used can be found at https://www.csie.ntu.edu.tw/˜cjlin/

libsvmtools/datasets/.

Table 6.1: Summary of the data sets, where a⁰× b⁰× d⁰ represents the (height, width, channel) of the input image, l is the number of training data, ltis the number of test data, and nLis the number of classes.

Data set a⁰× b⁰× d⁰ l l_t n_L

MNIST 28 × 28 × 1 60, 000 10, 000 10 SVHN 32 × 32 × 3 73, 257 26, 032 10 CIFAR10 32 × 32 × 3 50, 000 10, 000 10 smallNORB 32 × 32 × 2 24, 300 24, 300 5

(1) Min-max normalization. That is, for every image Z^0,i, we have Z^0,i ← Z^0,i− min

max − min, where max/min is the maximum/minimum value in Z^0,i.

(2) Zero-centering. This is commonly applied before training CNN (Krizhevsky et al., 2012; Zeiler and Fergus, 2014). That is, for every image Z^0,i, we have

Z^0,i ← Z^0,i− mean(Z^0,i), where mean(Z^0,i) is the mean value in Z^0,i.

We consider two simple CNN structure shown in Table 6.2. The parameters used in our algorithm are given as follows. For the initialization, we follow He et al. (2015) to randomly set the weight values from the N (0, 1) distribution and multiply by

s 2 n^m_in, where

n^m_in =











d^m−1× a^m−1× b^m−1 if m ≤ L^c,

n_m−1 otherwise.

For a CG procedure, we terminate it when the following relative stopping condition sat-isfies or the number of CG iterations reaches a maximal number of iterations (denoted as CG_max).

||(G + λI)d + ∇f (θ)|| ≤ σ||∇f (θ)||, (6.1) where σ = 0.1 and CG_max= 250. For the implementation of the Levenberg-Marquardt method, we set the initial λ₁ = 1 and (drop, boost, ρ_upper, ρ_lower) constants in (3.10) are (2/3, 3/2, 0.75, 0.1). In addition, the sampling rate for the Gauss-Newton matrix is set to 1% and the value of C in (2.26) is set to 0.01l.

6.1 Comparison Between Newton and Stochastic Gradient Meth-ods

In this chapter, the goal is to compare SG methods with the proposed subsampled Newton method for CNN. For SG methods, we consider mini-batch SG with momen-tum. We use the python deep learning library, Keras (Chollet et al., 2015), to implement it. To have a fair comparison between SG and subsampled Newton methods, the fol-lowing conditions are fixed.

• Initial points.

• Network structures.

• Objective function.

• Regularization parameter.

The training mini-batch size is 128 for all SG experiments. The initial learning rate is selected from {0.003, 0.001, 0.0003, 0.0001} by five-fold cross validation.2 When

2We split the training data by stratified sampling for the cross validation.

Table 6.2: Structure of convolutional neural networks. “conv” indicates a convolutional layer, “pool” indicates a pooling layer, and “full” indicates a fully-connected layer.

model-3-layers model-5-layers

filter size #filters stride filter size #filters stride

conv 1 5 × 5 × 3 32 1 5 × 5 × 3 32 1

pool 1 2 × 2 - 2 2 × 2 - 2

conv 2 3 × 3 × 32 64 1 3 × 3 × 32 32 1

pool 2 2 × 2 - 2 - -

-conv 3 3 × 3 × 32 64 1 3 × 3 × 32 64 1

pool 3 2 × 2 - 2 2 × 2 - 2

conv 4 - - - 3 × 3 × 64 64 1

pool 4 - - -

-conv 5 - - - 3 × 3 × 64 128 1

pool 5 - - - 2 × 2 - 2

conducting the cross validation and training process, the learning rate is adapted to the Keras framework’s default scheduling with a decaying factor 10⁻⁶ and the momentum coefficient is 0.9.

From the results shown in Table 6.3, we can see that

Table 6.3: Test accuracy for Newton method and SG. For Newton method, we trained for 250 iterations; for SG, we trained for 1000 epochs.

model-3-layers model-5-layers

Newton SG Newton SG

MNIST (99.15, 99.25)% 99.15% 99.46% 99.35%

SVHN (92.91, 92.99)% 93.21% 93.49% 94.60%

CIFAR10 (77.85, 79.41)% 79.27% 76.7% 79.47%

smallNORB (98.14, 98.16)% 98.09% 97.68% 98.00%

CHAPTER VII

Conclusions

In this study, we establish all the building blocks of Newton methods for CNN. A simple and elegant MATLAB implementation is developed for public use. Based on our results, it is possible to develop novel techniques to further enhance Newton methods for CNN.

APPENDICES

APPENDIX A. List of Symbols

Notation Description

yⁱ The label vector of the ith training instance.

Z^0,i The input image of the ith training instance.

l The number of training instances.

K The number of classes.

θ The model vector (weights and biases) of the neural network.

ξ The loss function.

ξ_i The training loss of the ith instance.

f The objective function.

C The regularization parameter.

L The number of layers of the neural network.

L^c The number of convolutional layers of the neural network.

L^f The number of fully-connected layers of the neural network.

nm The number of neurons in the mth layer (L^c< m ≤ L).

n The total number of weights and biases.

a^m the height of the data at the mth layer (0 ≤ m ≤ L^c).

b^m the width of the data at the mth layer (0 ≤ m ≤ L^c).

d^m the depth (or the number of channels) of the data at the mth layer (0 ≤ m ≤ L^c).

h^m the height (width) of the filters at the mth layer.

W^m The weight matrix in the mth layer.

b^m The bias vector in the mth layer.

Notation Description

S^m,i The output matrix of the function (W^m)^Tφ(Z^m−1,i) + b^m1^Ta^mb^min the mth layer for the ith instance (1 ≤ m ≤ L^c).

Z^m,i The output matrix (element-wise application of the activation function on S^m,i) in the mth layer for the ith instance (1 ≤ m ≤ L^c).

s^m,i The output vector of the function (W^m)^Tz^m−1,i+ b^min the mth layer for the ith instance (L^c< m ≤ L).

z^m,i The output vector (element-wise application of the activation function on s^m,i) in the mth layer for the ith instance (L^c< m ≤ L).

σ The activation function.

Jⁱ The Jacobian matrix of z^L,iwith respect to θ.

I An identity matrix.

α_k A step size at the kth iteration.

ρ_k The ratio between the actual function reduction and the predicted reduction at the kth iteration.

λ_k A parameter in the Levenberg-Marquardt method.

N (µ, σ²) A Gaussian distribution with mean µ and variance σ².

APPENDIX B. Alternative Method for the Generation of φ(Z

^m−1,i

)

For the alternative method here, we use MATLAB’s im2col with s^m = 1 and extract a sub-matrix as φ(Z^m−1,i).

We now explain each line of the program. To find P_φ^m−1, from (2.9) what we need is to extract elements in Z^m−1,i. Some elements may be extracted multiple times. For the

extraction it is more convenient to chapter on the linear indices of elements in Z^m−1,i. Following MATLAB’s setting, for an a × b matrix, the linear indices are

[ 1, . . . , a, a + 1, . . . , ab ] ,

where elements are mapped to the above indices in a cloumn-wise setting. In line 2, we start with obtaining the linear indices of the first row of Z^m−1,i, which corresponds to the first channel of the image. In line 3, we use im2col to build φ(Z^m−1,i) under s^m = d^m−1 = 1, though contents of the input matrix are linear indices of Z^m−1,i rather than values. For φ(Z^m−1,i) under s^m = d^m−1 = 1, the matrix size is

h^mh^m× ¯a^m¯b^m,

where from (2.4),

a^m = a^m−1 − h^m+ 1, ¯b^m = b^m−1− h^m+ 1.

From (2.9), when a general s^m is considered, we must select some columns, whose column indices are the following subset of {1, . . . , ¯a^m¯b^m}:

where a^m and b^m are defined in (2.4). More precisely, (B.1) comes from the following

mapping between the first row of φ(Z^m−1,i) in (2.9) and {1, . . . , ¯a^m¯b^m}:

Next we discuss how to extend the linear indices of the first channel to others. From (2.5), each column of Z^m−1,i contains values of the same pixel in different channels.

Therefore, because we consider a column-major order, indices in Z^m−1,i for a given pixel are a continuous segment. Then in (2.9) for φ(Z^m−1,i), essentially we have d^m−1 segments ordered vertically and elements in two consecutive segments are from two consecutive rows in Z^m−1,i. Therefore, the following index matrix can be used to extract all needed elements in Z^m−1,i for φ(Z^m−1,i).







linear indices of Z^m−1,ifor 1st channel of φ(Z^m−1,i) The implementation is in line 10 and we use a property of MATLAB to add a matrix and a vector. Thus the ⊗ operation in the second term in (B.2) is not needed.

Listing B.1: Matlab implementation for P_φ^m−1 1 function indicator = indicator_im2col(a,b,d,h,s) 2 input_idx = reshape(([1:a*b]-1)*d+1,a,b);

3 output_idx = im2col(input_idx,[h,h],'sliding');

4 a_bar = a - h + 1;

5 b_bar = b - h + 1;

6 a_idx = 1:s:a_bar;

7 b_idx = 1:s:b_bar;

8 select_idx = repelem(a_idx,1,length(a_idx)) + a_bar*repmat(

b_idx-1,1,length(b_idx));

9 output_idx = output_idx(:,select_idx);

10 output_idx = repmat(output_idx,d,1) + repelem([0:d-1]',h*h ,1);

11 end

BIBLIOGRAPHY

A. Botev, H. Ritter, and D. Barber. Practical gauss-newton optimisation for deep learning. In Procceedings of International Conference on Machine Learning (ICML), pages 557–565, 2017.

R. H. Byrd, G. M. Chin, W. Neveitt, and J. Nocedal. On the use of stochastic Hessian information in optimization methods for machine learning. SIAM Journal on Optimization, 21(3):977–995, 2011.

F. Chollet et al. Keras. https://keras.io, 2015.

J. J. Dongarra, J. Du Croz, S. Hammarling, and I. S. Duff. A set of level 3 basic linear algebra subprograms. ACM Transactions on Mathematical Software, 16(1):

1–17, 1990.

K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In Proceedings of IEEE International Conference on Computer Vision (ICCV), 2015.

X. He, D. Mudigere, M. Smelyanskiy, and M. Tak´aˇc. Large scale dis-tributed Hessian-free optimization for deep neural network, 2016. arXiv preprint arXiv:1606.00511.

R. Kiros. Training neural networks with stochastic Hessian-free optimization, 2013. arXiv preprint arXiv:1301.3641.

A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.

A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. 2012.

Q. V. Le, J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, and A. Y. Ng. On op-timization methods for deep learning. In Proceedings of the 28th International Conference on Machine Learning, pages 265–272, 2011.

Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, Novem-ber 1998. MNIST database available at http://yann.lecun.com/exdb/

mnist/.

Y. LeCun, F. J. Huang, and L. Bottou. Learning methods for generic object recog-nition with invariance to pose and lighting. In Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 97–104, 2004.

K. Levenberg. A method for the solution of certain non-linear problems in least squares. Quarterly of Applied Mathematics, 2(2):164–168, 1944.

C.-J. Lin, R. C. Weng, and S. S. Keerthi. Trust region Newton method for large-scale logistic regression. In Proceedings of the 24th International Conference on Machine Learning (ICML), 2007. Software available at http://www.csie.

ntu.edu.tw/˜cjlin/liblinear.

D. W. Marquardt. An algorithm for least-squares estimation of nonlinear param-eters. Journal of the Society for Industrial and Applied Mathematics, 11(2):431–

441, 1963.

J. Martens. Deep learning via Hessian-free optimization. In Proceedings of the 27th International Conference on Machine Learning (ICML), 2010.

Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011.

N. N. Schraudolph. Fast curvature matrix-vector products for second-order gradi-ent descgradi-ent. Neural Computation, 14(7):1723–1738, 2002.

K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition, 2014. arXiv preprint arXiv:1409.1556.

A. Vedaldi and K. Lenc. MatConvNet: Convolutional neural networks for matlab.

In Proceedings of the 23rd ACM International Conference on Multimedia, pages 689–692, 2015.

O. Vinyals and D. Povey. Krylov subspace descent for deep learning. In Proceed-ings of Artificial Intelligence and Statistics, pages 1261–1268, 2012.

C.-C. Wang, C.-H. Huang, and C.-J. Lin. Subsampled Hessian New-ton methods for supervised learning. Neural Computation, 27:1766–1795, 2015. URL http://www.csie.ntu.edu.tw/˜cjlin/papers/sub_

hessian/sample_hessian.pdf.

C.-C. Wang, K.-L. Tan, C.-T. Chen, Y.-H. Lin, S. S. Keerthi, D. Mahajan, S. Sun-dararajan, and C.-J. Lin. Distributed Newton methods for deep learning. Neural Computation, 30:1673–1724, 2018a. URL http://www.csie.ntu.edu.

tw/˜cjlin/papers/dnn/dsh.pdf.

C.-C. Wang, K. L. Tan, and C. J. Lin. Newton methods for convolutional neural networks. Technical report, National Taiwan University, 2018b.

M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional net-works. In Proceedings of European Conference on Computer Vision, pages 818–

833, 2014.

在文檔中牛頓法於卷積神經網路之應用 (頁 49-0)