Comparison Between Newton and Stochastic Gradient Methods 56

VI. Experiments

6.1 Comparison Between Newton and Stochastic Gradient Methods 56

In this chapter, the goal is to compare SG methods with the proposed subsampled Newton method for CNN. For SG methods, we consider mini-batch SG with momen-tum. We use the python deep learning library, Keras (Chollet et al., 2015), to implement it. To have a fair comparison between SG and subsampled Newton methods, the fol-lowing conditions are fixed.

• Initial points.

• Network structures.

• Objective function.

• Regularization parameter.

The training mini-batch size is 128 for all SG experiments. The initial learning rate is selected from {0.003, 0.001, 0.0003, 0.0001} by five-fold cross validation.2 When

2We split the training data by stratified sampling for the cross validation.

Table 6.2: Structure of convolutional neural networks. “conv” indicates a convolutional layer, “pool” indicates a pooling layer, and “full” indicates a fully-connected layer.

model-3-layers model-5-layers

filter size #filters stride filter size #filters stride

conv 1 5 × 5 × 3 32 1 5 × 5 × 3 32 1

pool 1 2 × 2 - 2 2 × 2 - 2

conv 2 3 × 3 × 32 64 1 3 × 3 × 32 32 1

pool 2 2 × 2 - 2 - -

-conv 3 3 × 3 × 32 64 1 3 × 3 × 32 64 1

pool 3 2 × 2 - 2 2 × 2 - 2

conv 4 - - - 3 × 3 × 64 64 1

pool 4 - - -

-conv 5 - - - 3 × 3 × 64 128 1

pool 5 - - - 2 × 2 - 2

conducting the cross validation and training process, the learning rate is adapted to the Keras framework’s default scheduling with a decaying factor 10⁻⁶ and the momentum coefficient is 0.9.

From the results shown in Table 6.3, we can see that

Table 6.3: Test accuracy for Newton method and SG. For Newton method, we trained for 250 iterations; for SG, we trained for 1000 epochs.

model-3-layers model-5-layers

Newton SG Newton SG

MNIST (99.15, 99.25)% 99.15% 99.46% 99.35%

SVHN (92.91, 92.99)% 93.21% 93.49% 94.60%

CIFAR10 (77.85, 79.41)% 79.27% 76.7% 79.47%

smallNORB (98.14, 98.16)% 98.09% 97.68% 98.00%

CHAPTER VII

Conclusions

In this study, we establish all the building blocks of Newton methods for CNN. A simple and elegant MATLAB implementation is developed for public use. Based on our results, it is possible to develop novel techniques to further enhance Newton methods for CNN.

APPENDICES

APPENDIX A. List of Symbols

Notation Description

yⁱ The label vector of the ith training instance.

Z^0,i The input image of the ith training instance.

l The number of training instances.

K The number of classes.

θ The model vector (weights and biases) of the neural network.

ξ The loss function.

ξ_i The training loss of the ith instance.

f The objective function.

C The regularization parameter.

L The number of layers of the neural network.

L^c The number of convolutional layers of the neural network.

L^f The number of fully-connected layers of the neural network.

nm The number of neurons in the mth layer (L^c< m ≤ L).

n The total number of weights and biases.

a^m the height of the data at the mth layer (0 ≤ m ≤ L^c).

b^m the width of the data at the mth layer (0 ≤ m ≤ L^c).

d^m the depth (or the number of channels) of the data at the mth layer (0 ≤ m ≤ L^c).

h^m the height (width) of the filters at the mth layer.

W^m The weight matrix in the mth layer.

b^m The bias vector in the mth layer.

Notation Description

S^m,i The output matrix of the function (W^m)^Tφ(Z^m−1,i) + b^m1^Ta^mb^min the mth layer for the ith instance (1 ≤ m ≤ L^c).

Z^m,i The output matrix (element-wise application of the activation function on S^m,i) in the mth layer for the ith instance (1 ≤ m ≤ L^c).

s^m,i The output vector of the function (W^m)^Tz^m−1,i+ b^min the mth layer for the ith instance (L^c< m ≤ L).

z^m,i The output vector (element-wise application of the activation function on s^m,i) in the mth layer for the ith instance (L^c< m ≤ L).

σ The activation function.

Jⁱ The Jacobian matrix of z^L,iwith respect to θ.

I An identity matrix.

α_k A step size at the kth iteration.

ρ_k The ratio between the actual function reduction and the predicted reduction at the kth iteration.

λ_k A parameter in the Levenberg-Marquardt method.

N (µ, σ²) A Gaussian distribution with mean µ and variance σ².

APPENDIX B. Alternative Method for the Generation of φ(Z

^m−1,i

)

For the alternative method here, we use MATLAB’s im2col with s^m = 1 and extract a sub-matrix as φ(Z^m−1,i).

We now explain each line of the program. To find P_φ^m−1, from (2.9) what we need is to extract elements in Z^m−1,i. Some elements may be extracted multiple times. For the

extraction it is more convenient to chapter on the linear indices of elements in Z^m−1,i. Following MATLAB’s setting, for an a × b matrix, the linear indices are

[ 1, . . . , a, a + 1, . . . , ab ] ,

where elements are mapped to the above indices in a cloumn-wise setting. In line 2, we start with obtaining the linear indices of the first row of Z^m−1,i, which corresponds to the first channel of the image. In line 3, we use im2col to build φ(Z^m−1,i) under s^m = d^m−1 = 1, though contents of the input matrix are linear indices of Z^m−1,i rather than values. For φ(Z^m−1,i) under s^m = d^m−1 = 1, the matrix size is

h^mh^m× ¯a^m¯b^m,

where from (2.4),

a^m = a^m−1 − h^m+ 1, ¯b^m = b^m−1− h^m+ 1.

From (2.9), when a general s^m is considered, we must select some columns, whose column indices are the following subset of {1, . . . , ¯a^m¯b^m}:

where a^m and b^m are defined in (2.4). More precisely, (B.1) comes from the following

mapping between the first row of φ(Z^m−1,i) in (2.9) and {1, . . . , ¯a^m¯b^m}:

Next we discuss how to extend the linear indices of the first channel to others. From (2.5), each column of Z^m−1,i contains values of the same pixel in different channels.

Therefore, because we consider a column-major order, indices in Z^m−1,i for a given pixel are a continuous segment. Then in (2.9) for φ(Z^m−1,i), essentially we have d^m−1 segments ordered vertically and elements in two consecutive segments are from two consecutive rows in Z^m−1,i. Therefore, the following index matrix can be used to extract all needed elements in Z^m−1,i for φ(Z^m−1,i).







linear indices of Z^m−1,ifor 1st channel of φ(Z^m−1,i) The implementation is in line 10 and we use a property of MATLAB to add a matrix and a vector. Thus the ⊗ operation in the second term in (B.2) is not needed.

Listing B.1: Matlab implementation for P_φ^m−1 1 function indicator = indicator_im2col(a,b,d,h,s) 2 input_idx = reshape(([1:a*b]-1)*d+1,a,b);

3 output_idx = im2col(input_idx,[h,h],'sliding');

4 a_bar = a - h + 1;

5 b_bar = b - h + 1;

6 a_idx = 1:s:a_bar;

7 b_idx = 1:s:b_bar;

8 select_idx = repelem(a_idx,1,length(a_idx)) + a_bar*repmat(

b_idx-1,1,length(b_idx));

9 output_idx = output_idx(:,select_idx);

10 output_idx = repmat(output_idx,d,1) + repelem([0:d-1]',h*h ,1);

11 end

BIBLIOGRAPHY

A. Botev, H. Ritter, and D. Barber. Practical gauss-newton optimisation for deep learning. In Procceedings of International Conference on Machine Learning (ICML), pages 557–565, 2017.

R. H. Byrd, G. M. Chin, W. Neveitt, and J. Nocedal. On the use of stochastic Hessian information in optimization methods for machine learning. SIAM Journal on Optimization, 21(3):977–995, 2011.

F. Chollet et al. Keras. https://keras.io, 2015.

J. J. Dongarra, J. Du Croz, S. Hammarling, and I. S. Duff. A set of level 3 basic linear algebra subprograms. ACM Transactions on Mathematical Software, 16(1):

1–17, 1990.

K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In Proceedings of IEEE International Conference on Computer Vision (ICCV), 2015.

X. He, D. Mudigere, M. Smelyanskiy, and M. Tak´aˇc. Large scale dis-tributed Hessian-free optimization for deep neural network, 2016. arXiv preprint arXiv:1606.00511.

R. Kiros. Training neural networks with stochastic Hessian-free optimization, 2013. arXiv preprint arXiv:1301.3641.

A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.

A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. 2012.

Q. V. Le, J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, and A. Y. Ng. On op-timization methods for deep learning. In Proceedings of the 28th International Conference on Machine Learning, pages 265–272, 2011.

Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, Novem-ber 1998. MNIST database available at http://yann.lecun.com/exdb/

mnist/.

Y. LeCun, F. J. Huang, and L. Bottou. Learning methods for generic object recog-nition with invariance to pose and lighting. In Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 97–104, 2004.

K. Levenberg. A method for the solution of certain non-linear problems in least squares. Quarterly of Applied Mathematics, 2(2):164–168, 1944.

C.-J. Lin, R. C. Weng, and S. S. Keerthi. Trust region Newton method for large-scale logistic regression. In Proceedings of the 24th International Conference on Machine Learning (ICML), 2007. Software available at http://www.csie.

ntu.edu.tw/˜cjlin/liblinear.

D. W. Marquardt. An algorithm for least-squares estimation of nonlinear param-eters. Journal of the Society for Industrial and Applied Mathematics, 11(2):431–

441, 1963.

J. Martens. Deep learning via Hessian-free optimization. In Proceedings of the 27th International Conference on Machine Learning (ICML), 2010.

Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011.

N. N. Schraudolph. Fast curvature matrix-vector products for second-order gradi-ent descgradi-ent. Neural Computation, 14(7):1723–1738, 2002.

K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition, 2014. arXiv preprint arXiv:1409.1556.

A. Vedaldi and K. Lenc. MatConvNet: Convolutional neural networks for matlab.

In Proceedings of the 23rd ACM International Conference on Multimedia, pages 689–692, 2015.

O. Vinyals and D. Povey. Krylov subspace descent for deep learning. In Proceed-ings of Artificial Intelligence and Statistics, pages 1261–1268, 2012.

C.-C. Wang, C.-H. Huang, and C.-J. Lin. Subsampled Hessian New-ton methods for supervised learning. Neural Computation, 27:1766–1795, 2015. URL http://www.csie.ntu.edu.tw/˜cjlin/papers/sub_

hessian/sample_hessian.pdf.

C.-C. Wang, K.-L. Tan, C.-T. Chen, Y.-H. Lin, S. S. Keerthi, D. Mahajan, S. Sun-dararajan, and C.-J. Lin. Distributed Newton methods for deep learning. Neural Computation, 30:1673–1724, 2018a. URL http://www.csie.ntu.edu.

tw/˜cjlin/papers/dnn/dsh.pdf.

C.-C. Wang, K. L. Tan, and C. J. Lin. Newton methods for convolutional neural networks. Technical report, National Taiwan University, 2018b.

M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional net-works. In Proceedings of European Conference on Computer Vision, pages 818–

833, 2014.

在文檔中牛頓法於卷積神經網路之應用 (頁 64-77)