The Overall Optimization Problem - Optimization Problem of Feedforward CNN

II. Optimization Problem of Feedforward CNN

2.3 The Overall Optimization Problem

At the last layer, the output z^L,i, ∀i is obtained. We can check how close it is to the label vector yⁱ and consider the following squared loss in this work.

ξ(z^L,i; yⁱ) = ||z^L,i− yⁱ||². (2.24)

Furthermore, we can collect all model parameters such as filters of convolutional layers in (2.10) and weights/biases in (2.20) for fully-connected layers into a long vector θ ∈ Rⁿ, where n becomes the total number of variables from the discussion in this chapter.

n =

L^c

m=1

d^m× (h^m× h^m× d^m−1+ 1) +

m=L^c+1

n_m× (n_m−1 + 1).

The output z^L,i of the last layer is a function of θ. The optimization problem to train a CNN is

minθ f (θ), (2.25)

where

f (θ) = 1

2Cθ^Tθ + 1 l

i=1

ξ(z^L,i; yⁱ) (2.26) and the first term with the parameter C > 0 is used to avoid overfitting.

CHAPTER III

Hessian-free Newton Methods for Training CNN

To solve an unconstrained minimization problem such as (2.25), a Newton method iteratively finds a search direction d by solving the following second-order approxima-tion.

mind ∇f (θ)^Td +1

2d^T∇²f (θ)d, (3.1)

where ∇f (θ) and ∇²f (θ) are the gradient vector and the Hessian matrix, respectively.

In this chapter we present details of applying a Newton method to solve the CNN prob-lem (2.25).

3.1 Procedure of the Newton Method

For CNN, the gradient of f (θ) is

∇f (θ) = 1 Cθ + 1

i=1

(Jⁱ)^T∇_z^L,iξ(z^L,i; yⁱ), (3.2) where

Jⁱ =







∂z^L,i₁

∂θ1 · · · ^∂z

L,i 1

∂θn

... ... ...

∂z^L,i_nL

∂θ1 · · · ^∂z

L,i nL

∂θn







nL×n

, i = 1, . . . , l, (3.3)

is the Jacobian of z^L,i. The Hessian matrix of f (θ) is

∇²f (θ) =1 CI + 1

i=1

(Jⁱ)^TBⁱJⁱ

+1 l

i=1 nL

j=1

∂ξ(z^L,i; yⁱ)

∂z^L,i_j







∂²z_j^L,i

∂θ1∂θ1 · · · ^∂

2z^L,i_j

∂θ1∂θn

... . .. ...

∂²z_j^L,i

∂θn∂θ1 · · · ^∂

2z^L,i_j

∂θn∂θn







, (3.4)

where I is the identity matrix and

B_tsⁱ = ∂²ξ(z^L,i; yⁱ)

∂z_t^L,i∂zs^L,i

, t = 1, . . . , n_L, s = 1, . . . , n_L. (3.5)

From now on for simplicity we let

ξi ≡ ξi(z^L,i; yⁱ).

If f (θ) is non-convex as in the case of deep learning, (3.1) is difficult to solve and the resulting direction may not lead to the decrease of the function value. Thus the Gauss-Newton approximation (Schraudolph, 2002)

G = 1 CI + 1

i=1

(Jⁱ)^TBⁱJⁱ ≈ ∇²f (θ) (3.6)

is often used. In particular, if G is positive definite, then (3.1) becomes the same as solving the following linear system.

Gd = −∇f (θ). (3.7)

After a Newton direction d is obtained, to ensure the convergence, we update θ by

θ ← θ + αd,

where α is the largest element in {1,¹₂,¹₄, . . .} which can satisfy

f (θ + αd) ≤ f (θ) + ηα∇f (θ)^Td, (3.8)

where η ∈ (0, 1) is a pre-defined constant. The procedure to find α is called a back-tracking line search.

Past works (e.g., Martens, 2010; Wang et al., 2018a) have shown that sometimes (3.7) is too aggressive, so a direction closer to the negative gradient is better. To this end, in recent works of applying Newton methods on fully-connected networks, the Levenberg-Marquardt method (Levenberg, 1944; Marquardt, 1963) is used to solve the following linear system rather than (3.7).

(G + λI)d = −∇f (θ), (3.9)

where λ is a parameter decided by how good the function reduction is. Specifically, we define

ρ = f (θ + d) − f (θ)

∇f (θ)^Td + ¹₂(d)^TGd

as the ratio between the actual function reduction and the predicted reduction. By using ρ, the parameter λ_next for the next iteration is decided by

λ_next =











λ × drop ρ > ρ_upper,

λ ρ_lower ≤ ρ ≤ ρ_upper, λ × boost otherwise,

(3.10)

where (drop,boost) are given constants. From (3.10) we can clearly see that if the function-value reduction is not satisfactory, then λ is enlarged and the resulting direction is closer to the negative gradient.

Next, we discuss how to solve the linear system (3.9). When the number of variables n is large, the matrix G is too large to be stored. For some optimization problems including neural networks, without explicitly storing G it is possible to calculate the product between G and any vector v (Le et al., 2011; Martens, 2010; Wang et al., 2018a). For example, from (3.6),

(G + λI)v = (1

C + λ)v + 1 l

i=1

(Jⁱ)^T Bⁱ(Jⁱv) . (3.11)

If the product between Jⁱand a vector can be easily calculated, then G does not need to be explicitly formed. Therefore, we can apply the conjugate gradient (CG) method to solve (3.7) by a sequence of matrix-vector products. This technique is called Hessian-free methods in optimization. Details of CG methods in a Hessian-Hessian-free Newton frame-work can be found in, for example, Algorithm 2 of Lin et al. (2007).

Because the computational cost in (3.11) is proportional to the number of instances, subsampled Hessian Newton methods have been proposed (Byrd et al., 2011; Martens, 2010; Wang et al., 2015) to reduce the cost in solving the linear system (3.9). They observe that the second term in (3.6) is the average training loss. If the large number of data points are assumed to be from the same distribution, (3.6) can be reasonablely approximated by selecting a subset S ⊂ {1, . . . , l} and having

G^S = 1

CI + 1

|S|

i∈S

(Jⁱ)^TBⁱJⁱ ≈ G.

Then (3.11) becomes (G^S+ λI)v = (1

C + λ)v + 1

|S|

i∈S

(Jⁱ)^T Bⁱ(Jⁱv) ≈ (G + λI)v.

A summary of the Newton method is in Algorithm 1.

3.2 Gradient Evaluation

In order to solve (3.7), ∇f (θ) is needed. It can be obtained by (3.2) if the Jacobian matrix Jⁱ, i = 1, . . . , l is available; see the discussion on Jacobian calculation in Chap-ter 3.3. However, we have mentioned in ChapChap-ter 3.1 that in practice, only a subset of Jⁱ may be calculated. Thus here we present a direct calculation by a backward process.

Consider two layers m − 1 and m. The variables between them are W^mand b^m, so we aim to calculate the following gradient components.

∂f

∂W^m = 1

CW^m+1 l

i=1

∂ξ_i

∂W^m, (3.12)

Algorithm 1 A standard subsampled Hessian Newton method for CNN.

1: Compute f (θ¹).

2: for k = 1, . . . , do

3: Choose a set S_k⊂ {1, . . . , l}.

4: Compute ∇f (θ^k) and Jⁱ, ∀i ∈ S_k.

5: Approximately solve the linear system in (3.9) by CG to obtain a direction d^k

6: α = 1.

7: while true do

8: Update θ^k+1 = θ^k+ αd^k and compute f (θ^k+1)

9: if (3.8) is satisfied then

10: break

11: end if

12: α ← α/2.

13: end while

14: Calculate λ_k+1 based on (3.10).

15: end for

∂f

∂b^m = 1

Cb^m+1 l

i=1

∂ξ_i

∂b^m. (3.13)

Because (3.12) is in a matrix form, following past developments such as Vedaldi and Lenc (2015), it is easier to transform them to a vector form for the derivation. To begin, we list the following properties of the vec(·) function, in which ⊗ is the kronecker product.

vec(AB) = (I ⊗ A)vec(B), (3.14)

= (B^T ⊗ I)vec(A), (3.15)

vec(AB)^T = vec(B)^T(I ⊗ A^T), (3.16)

= vec(A)^T(B ⊗ I). (3.17)

We further define

∂y

∂(x)^T =







∂y1

∂x1 . . . _∂x^∂y¹

|x|

... . .. ...

∂y_|y|

∂x1 . . . ^∂y_∂x^|y|

|x|





 ,

where x and y are column vectors, and let

φ(z^m−1,i) = Inm−1z^m−1,i, L^c< m ≤ L,

where I_pis the p × p identity matrix.

For the fully-connected layers, from (2.21), we have

s^m,i = W^mz^m−1,i+ b^m

= (I₁⊗ W^m) z^m−1,i+ (11⊗ I_n_m)b^m (3.18)

= (z^m−1,i)^T ⊗ I_n_m vec(W^m) + (11 ⊗ I_n_m)b^m, (3.19)

where (3.18) and (3.19) are from (3.14) and (3.15), respectively. For the convolutional layers, from (2.11), we have

vec(S^m,i) = vec(W^mφ(Z^m−1,i)) + vec(b^m1^Ta^mb^m)

= (I_a^m_b^m⊗ W^m) vec(φ(Z^m−1,i)) + (1a^mb^m ⊗ I_d^m)b^m (3.20)

= φ(Z^m−1,i)^T ⊗ I_d^m vec(W^m) + (1a^mb^m⊗ I_d^m)b^m, (3.21)

where (3.20) and (3.21) are from (3.14) and (3.15), respectively.

An advantage of using (3.18) and (3.20) is that they are in the same form, and so are (3.19) and (3.21). Thus we can derive the gradient together. We begin with calculating the gradient for convolutional layers. From (3.21), we derive

∂ξ_i

∂vec(W^m)^T = ∂ξ_i

∂vec(S^m,i)^T

∂vec(S^m,i)

∂vec(W^m)^T

= ∂ξ_i

∂vec(S^m,i)^T φ(Z^m−1,i)^T ⊗ I_d^m

= vec

∂ξ_i

∂S^m,iφ(Z^m−1,i)^T

(3.22)

and

∂ξ_i

∂(b^m)^T = ∂ξ_i

∂vec(S^m,i)^T

∂vec(S^m,i)

∂(b^m)^T

= ∂ξi

∂vec(S^m,i)^T (1a^mb^m⊗ I_d^m)

= vec

∂ξ_i

∂S^m,i1^a^m^b^m

, (3.23)

where (3.22) and (3.23) are from (3.17). To calculate (3.22), φ(Z^m−1,i) is available in the forward process of calculating the function value.

For ∂ξi/∂S^m,i also needed in (3.22) and (3.23), it can be obtained by a backward process. Here we assume that ∂ξi/∂S^m,i is available, and calculate ∂ξi/∂S^m−1,i for layer m − 1.

∂ξ_i

∂vec(Z^m−1,i)^T = ∂ξ_i

∂vec(S^m,i)^T

∂vec(S^m,i)

∂vec(φ(Z^m−1,i))^T

∂vec(φ(Z^m−1,i))

∂vec(Z^m−1,i)^T

= ∂ξ_i

∂vec(S^m,i)^T (I_a^m_b^m⊗ W^m) P_φ^m−1 (3.24)

= vec

(W^m)^T ∂ξ_i

∂S^m,i

P_φ^m−1, (3.25)

where (3.24) is from (2.8) and (3.20), and (3.25) is from (3.16).

Then, because the RELU activation function is considered for the convolutional layers, we have

∂ξ_i

∂vec(S^m−1,i)^T = ∂ξ_i

∂vec(Z^m−1,i)^T vec(I[Z^m−1,i])^T, (3.26) where is Hadamard product (i.e., element-wise products) and

I[Z^m−1,i]_(p,q)=











1 if z^m−1,i_(p,q) > 0, 0 otherwise.

For fully-connected layers, by the same form in (3.18), (3.19), (3.20) and (3.21), we immediately get from (3.22), (3.23), (3.26) and (3.25) that

∂ξ_i

∂vec(W^m)^T = vec

∂ξ_i

∂s^m,i(z^m−1,i)^T

, (3.27)

∂ξ_i

∂(b^m)^T = ∂ξ_i

∂(s^m,i)^T, (3.28)

∂ξi

∂(z^m−1,i)^T =

(W^m)^T ∂ξi

∂(s^m,i)

I_n_m−1

= ∂ξ_i

∂(s^m,i)^TW^m, (3.29)

∂ξ_i

∂(s^m−1,i)^T = ∂ξ_i

∂(z^m−1,i)^T I[z^m−1,i]^T. (3.30) Finally, we check the initial values of the backward process. From the square loss in (2.24), we have

∂ξ_i

∂z^L,i = 2(z^L,i− y^L,i),

∂ξ_i

∂s^L,i = ∂ξ_i

∂z^L,i.

3.2.1 Padding, Pooling, and the Overall Procedure

For the padding operation, from (2.15), we have

∂ξi

∂vec(Z^m−1,i)^T = ∂ξi

∂vec(Z^m,i)^T

∂vec(Z^m,i)

∂vec(Z^m−1,i)^T

= ∂ξ_i

∂vec(Z^m,i)^TP_padding^m−1,i. (3.31) Similarly, for the pooling layer, from (2.19), we have

∂ξ_i

∂vec(Z^m−1,i)^T = ∂ξ_i

∂vec(Z^m,i)^TP_pool^m−1,i. (3.32) Following the explanation in Chapter 2.1.3, to generate ∂ξ_i/∂vec(S^m−1,i) from

∂ξ_i/vec(S^m,i), we consider the following cycle.

S^m−1 → Z^m−1 → pooling → padding → φ(Z^m−1,i) → S^m.

Therefore, by combining (3.25), (3.26), (3.31) and (3.32), details of obtaining ∂ξ_i/∂vec(Z^m−1,i)^T

3.3 Jacobian Evaluation

For (3.6), the Jacobian matrix is needed and it can be partitioned into L blocks.

Jⁱ =

The calculation is very similar to that for the gradient. For the convolutional layers, from (3.22) and (3.23), we have

∂z^L,i

In the backward process, if ∂z^L,i/∂vec(S^m,i)^T is available, we calculate ∂z^L,i/∂vec(S^m−1,i)^T for convolutional layer m − 1. From a derivation similar to (3.25) for the gradient, we

have

Similar to (3.26),

∂z_j^L,i

∂vec(S^m−1,i)^T = ∂z_j^L,i

∂vec(Z^m−1,i)^T vec(I[Z^m−1,i])^T, j = 1, . . . , n_L. They can be written together as

∂z^L,i

∂vec(S^m−1,i)^T = ∂z^L,i

∂vec(Z^m−1,i)^T 1nLvec(I[Z^m−1,i])^T . (3.37) For the padding operation and pooling layer, similar to (3.31) and (3.32), we have

∂z_i

For the fully-connected layers, we follow the same derivation of gradient. Thus, we get

∂z^L,i

∂(s^m−1,i)^T = ∂z^L,i

∂(z^m−1,i)^T 1nLI[z^m−1,i]^T

(3.43)

For layer L, because of using (2.24) and the linear activation function, we have

∂z^L,i

∂(s^L,i)^T = In_L.

3.4 Gauss-Newton Matrix-Vector Products

For solving (3.7), conjugate gradient (CG) methods are often applied and the main operation at each CG iteration is the Gauss-Newton matrix-vector product.

From (3.34), we rearrange (3.6) to

G = 1

and the Gauss-Newton matrix vector product becomes

Gv = 1

In this subsection, we focus on the ith instance and give the implementation details of calculating (3.45) for the whole data in Chapter 4.5.

For the convolutional layers, from (3.35) and (3.45), we first calculate

To simplify (3.46), we use the following property

vec(AB)^Tvec(C) = vec(A)^Tvec(CB^T)

to have for example, the first element in (3.46) is

vec ∂z₁^L,i

After deriving (3.47), from (3.45), we sum results of all layers

m=1

J^m,iv^m

and then calculate

qⁱ = Bⁱ(

m=1

J^m,iv^m). (3.48)

From (2.24) and (3.5),

B_tsⁱ = ∂²ξⁱ

and we derive qⁱ by multiplying every element ofPL

m=1J^m,iv^mby two.

After deriving (3.48), from (3.35) and (3.45), we calculate

Similar to the results of the convolutional layers, for the fully-connected layers, we have

CHAPTER IV

Implementation Details

Our goal in this chapter is to show that after some careful derivations, a Newton method for CNN can be implemented by a simple and short program. Here we give a MATLAB implementation though a modification for other languages such as Python should be straightforward.

For the discussion in Chapter III, we consider a single data instance, but for practical implementations, all instances must be taken care of. In out implementation, we stored Z^m−1,i, ∀i = 1, . . . , l as the following matrix.

Z^m−1,1 Z^m−1,2 . . . Z^m−1,l

∈ R^d^m−1^×a^m−1^b^m−1^l. (4.1)

Similarly, we stored ∂z^L,i/∂vec(S^m,i)^T as

∂z₁^L,1

∂vec(S^m,i)^T . . . ^∂z

L,l 1

∂vec(S^m,i)^T . . . ^∂z

L,1 nL

∂vec(S^m,i)^T . . . ^∂z

L,l nL

∂vec(S^m,i)^T

∈ R^ln^L^×d^m^a^m^b^m. (4.2)

4.1 Generation of φ(Z

^m−1,i

)

MATLAB has a built-in function im2col that can generate φ(Z^m−1,i) for s^m = 1 and s^m = h^m. To handle general s^m, we notice that φ(Z^m−1,i) is a sub-matrix of the

output matrix of using MATLAB’s im2col under s^m = 1. In appendix we provide an efficient implementation to extract the sub-matrix. However, in other languages a subroutine like MATLAB’s im2col may not be available. Further, generating a larger matrix under s^m = 1 causes extra time and memory.

Therefore, here we show an efficient implementation to get φ(Z^m−1,i) without re-lying on a subroutine like MATLAB’s im2col. To begin we consider the following linear indices1 (i.e., counting elements in a column-oriented way) of Z^m−1,i:



Because every element in

φ(Z^m−1,i) ∈ R^h^m^h^m^d^m−1^×a^m^b^m,

is extracted from Z^m−1,i, the task is to find the mapping between each element in φ(Z^m−1,i) and a linear index of Z^m−1,i. Consider the following example.

a^m−1 = b^m−1 = 2, d^m−1 = 1, s^m = 1, h^m = 2.

Because d^m−1 = 1, we omit the channel subscript. In addition, we also omit the instance index i. Thus the image and its linear indices are



1Linear indices refer to the sequence of how elements in a matrix are stored. Here we consider a column-oriented setting.

We have

Thus we store the following vector to indicate the mapping between linear indices of Z^m−1,iand φ(Z^m−1,i).

[ 1 2 4 5 2 3 5 6 ]^T . (4.4)

It also corresponds to row indices of non-zero elements in P_φ^m−1.

We begin with checking how linear indices of Z^m−1,i can be mapped to the first column of φ(Z^m−1). For simplicity, we consider only channel j. From (2.9) and (4.3), we have

where the left column gives the linear indices in Z^m−1,i, while the right column shows

the corresponding values. We rewrite linear indices in (4.5) as

Clearly, every linear index in (4.6) can be represented as

(p + qa^m−1)d^m−1 + j, (4.7)

where

p, q ∈ {0, . . . , h^m− 1}

correspond to the pixel position in the convolutional filter.2

Next we consider other columns in φ(Z^m−1,i) by still fixing the channel to be j.

From (2.9), similar to the right column in (4.5), each column contains the following elements from the jth channel of Z^m−1,i.

z_1+p+as^m−1,i m,1+q+bs^m,j , a = 0, 1, . . . , a^m− 1,

b = 0, 1, . . . , b^m− 1, (4.8)

where (1 + as^m, 1 + bs^m) denotes the top-left position of a sub-image in the channel j 2More precisely, p + 1 and q + 1 are the pixel position.

of Z^m−1,i. From (4.3), the linear index of each element in (4.8) is

Now we have known for each element of φ(Z^m−1,i) what the corresponding linear index in Z^m−1,iis. Next we discuss the implementation details, where the code is shown in Listing IV.1. First, we compute elements in (4.6) with j = 1 by applying MATLAB’s bsxfunfunction on the following two arrays.



The results is the following matrix

 whose columns, if concatenated, lead to values in (4.6) with j = 1; see line 2 of the code. To get (4.7) for all channels j = 1, . . . , d^m−1, we apply bsxfun again to plus the vector form of (4.10) and

0 1 . . . d^m−1− 1

and then vectorize the resulting matrix; see line 3.

To obtain other columns in φ(Z^m−1,i), next we calculate a^mand b^mby (2.4) in lines 4-5. To get all linear indices in (4.9), we see that the second term corresponds to indices

of the first column and therefore we must calculate the following column offset

(a + ba^m−1)s^md^m−1, ∀a = 0, 1, . . . , a^m− 1, b = 0, 1, . . . , b^m− 1.

This is by a bsxfun to plus the following two arrays.





 0

... a^m− 1







× s^md^m−1 and

0 . . . b^m− 1

× a^m−1s^md^m−1;

see line 6 in the code. Finally, we use another bsxfun to add the column offset and the values in (4.6); see line 7.

The obtained linear indices are independent of Z’s values. Thus the above procedure only needs to be run once in the beginning. For any Z, we apply the indices to extract φ(Z); see line 20-21 in Listing IV.1.

For φ(Z^m−1,i) in the pooling layer, the same implementation can be used.

4.2 Construction of P

_pool^m−1,i

Following (2.19), we assume that Z^m−1,i and φ(Z^m−1,i) are input and output be-fore/after the pooling operation, respectively. In the beginning of the training proce-dure, we use the proposed procedure in Chapter 4.1 to have φ(Z^m−1,i) that partitions each non-overlapping sub-regions; see (2.16) and (2.17). At each Newton iteration, Listing IV.2 is a MATLAB implementation for constructing P_pool^m−1,i, ∀i. In line 13, we handle the situation if

a^m−1 mod h^m 6= 0 or b^m−1 mod h^m 6= 0. (4.11) That is, we must ensure that (2.17) holds with a^m and b^m being integers. If (4.11) occurs, we simply discard the last several rows or columns in Z^m−1,i to make a^m−1 and

b^m−1 be multiples of h^m. In line 8, we extract the linear indices of Z^m−1,i to appear in vec(φ(Z^m−1,i)), which as we mentioned has been generated in the beginning. The resulting vector P contains

h^mh^md^m−1a^mb^m elements and each element is in the range of

1, . . . , d^m−1a^m−1b^m−1.

In line 9, we get

Z^m−1,i, i = 1, . . . , l, (4.12)

which are stored as a matrix in (4.1). In line 23-24, we use P to generate

vec(φ(Z^m−1,1)) · · · vec(φ(Z^m−1,l))

∈ h^mh^md^m−1a^mb^m× l. (4.13)

Next we rewrite the above matrix so that each column contains a sub-region:







z_1,1,1^m−1,i z_1,1,2^m−1,i . . . z_1+(a^m−1,lm−1)×s^m,1+(b^m−1)×s^m,d^m−1

... ... . .. ...

z_h^m−1,im,h^m,1 z_h^m−1,im,h^m,2 . . . z_h^m−1,lm+(a^m−1)×s^m,h^m+(b^m−1)×s^m,d^m−1







∈ R^h^m^h^m^×d^m−1^a^m^b^m^l.

(4.14) We apply a max function to get the largest value of each column and its index in the range of 1, . . . , h^mh^m. The resulting row vector has d^ma^mb^ml elements, where d^m = d^m−1; see line 25. In line 26, we reformulate it to be

d^m× a^mb^ml

as the output Z^m,i, ∀i.

Next we find linear indices in the matrix (4.13) that corresponds to the largest ele-ments obtained from (4.14).3 First we obtain linear indices of (4.1), which are in the

3Note that we do not really generate a sparse matrix P_pool^m−1,iin (2.19).

range of

1, . . . , d^m−1a^m−1b^m−1l.

We store these indices in a matrix the same size as in (4.1); see line 19. Then we generate

φ(linear indices of (4.1)), (4.15) which has the same size as in (4.13); see line 28-29. Next, if we can find positions in (4.15) that correspond to the largest elements identified earlier, then we have the desired linear indices. We mentioned that in line 25, not only the maximal value in each sub-region is obtained, but also we get the corresponding index in {1, . . . , h^mh^m}.

Therefore, for these max values, their linear indices in (4.14)4 are







row indices of max values in (4.14)





+ h^mh^m







0 ...

d^m−1a^mb^ml − 1







. (4.16)

The reason is that h^mh^mis the column offset in (4.14). See line 30 for the implementa-tion of (4.16). Next in line 31 we use (4.16) to extract values in (4.15) and obtain linear indices in (4.1) for the selected max values. Similar to the situation in Chapter 4.1 for φ(Z^m−1,i), the index finding procedure in Listing IV.2 is conducted only once in the beginning of the training process.

Note that because max pooling depends on Z^m−1,ivalues, linear indices for P_pool^m−1,i must be re-generated in the beginning of each Newton iteration.

4.3 Details of Padding Operation

To implement zero-padding, we first capture the linear indices of the input image in the padded image. For example, if the size of the imput image is 3 × 3 and the output 4Also linear indices in (4.13) because (4.14) is separating each column in (4.13) to several columns.

padded image is 5 × 5, we have

where “1” values indicate positions of the input image. Based on the column-major order, we derive

pad idx one= {7, 8, 9, 11, 12, 13, 16, 17, 18}.

It is obtained in the begining of the training procedure and it will be used in the fol-lowing situations. First, pad idx one contains row indices in P_padding^m−1 of (2.15) that correspond to the input image. We can use it to conduct the padding operation in (2.15).

Second, from (3.33), in gradient and Jacobian evaluations, we need

v^TP_padding^m−1 .

This can be considered as the inverse of the padding operation: we would like to remove zeros and get back the original image.

MATLAB implementations are shown in Listing IV.4 and IV.5, where Listing IV.4 is for generating the indices and Listing IV.5 is for the padding operation (2.15). In line 12 of Listing IV.4, we use the MATLAB bulit-in function padarray to pad zeros around the input image. Because Z^m−1,i, ∀i are stored as a matrix in (4.1), in line 8 of Listing IV.5, we extend pad idx one to conver all instances:





pad idx one

pad idx one + a^mb^m

pad idx one + 2a^mb^m

. . .





. (4.17)

These values indicate the new column indices of original columns in Z^m−1,i, ∀i. In line 9 of Listing IV.5, we create the zeros of the output padding image. By using values in (4.17), in line 10 of Listing IV.5, we put the input Z^m−1,ito the corresponding positions of the output padding image.

4.4 Evaluation of v

P

_φ^m−1

For (3.25) and (3.36), the following operation is applied.

v^TP_φ^m−1, (4.18)

where

v = vec

(W^m)^T ∂ξ_i

∂S^m,i

for (3.25) and

v = vec (W^m)^T ∂z_j^L,i

∂S^m,i

, j = 1, . . . , n_L (4.19) for (3.36).

Consider the same example in Chapter 4.1. We note that

(P_φ^m−1)^Tv = [ v1 v2+ v5 v6 v3 v4+ v7 v8 ]^T , (4.20)

which is a kind of “inverse” operation of φ(Z^m−1): we accumulate elements in φ(Z^m−1) back to their original positions in Z^m−1. In MATLAB, given indices in (4.4), a function accumarraycan directly generate the vector (4.20).

To calculate (3.25) over a batch of instances, we aim to have







(P_φ^m−1)^Tv1

... (P_φ^m−1)^Tvl







. (4.21)

We can manage to apply MATLAB’s accumarray on the vector

by giving the following indices as the input.



We can easily rearrange the resulting vector by reshape and transpose operator to get the matrix (3.25), where each row is corresponding to an instance. We show the program in Listing IV.6, where line 8 generates the indices shown in (4.23) and V(:) in line 9 is the vector (4.22).

Following the same idea, to calculate (3.36) over a batch of instances, we do not have to calculate (4.18) n_Ltimes. Notice that (3.36) is equivalent to

We can calculate v by vectorizing the result of fast matrix-matrix multiplication.

By passing the appropriate input indices to accumarray, we can calculate

(P_φ^m−1)^T

(4.26)

So the calculation is fallen back to the same case as evaluating (3.25). That is, after having (4.26), we calculate (4.18) once instead of calculating (4.18) and (4.19) n_Ltimes.

The input indices for accumarray are easy to be obtained. Before the transpose operation in (4.25), the linear indices for each column, i.e.

(P_φ^m−1)^Tvec

is start from

(u − 1)la^m−1b^m−1d^m−1 + 1.

Therefore, we can apply accumarray on the vector (4.26) with the following input indices.

The program is the same as Listing IV.6, except that we pass the value of the product of n Land S k for the parameter S k.

4.5 Gauss-Newton Matrix-Vector Products

Because (3.45) is the form of summation of every instance, in the view of MATLAB implementation, we extend (3.45) to the whole data set in a matrix form.

Gv = 1

To derive (4.27), we first calculate



From (3.47), for a particular m, we have



and 

A MATLAB implementation for (4.28) is shown in Listing IV.7. We store the model in a structure of different layers. Thus in lines 11 and 13 we respectively obtain

Z^m−1,i (after padding) and ∂z^L,i

∂vec(S^m,i)^T, i = 1, . . . , l.

In line 14, we calculate (4.29) to derive p^m,1, . . . , p^m,l. In line 15, we separately calcu-late

∂z^L,i

∂vec(S^m,i)^Tp^m,i, i = 1, . . . , l.

In line 16, we sum of the results of each layer.

After deriving (4.28), from (3.48) and (4.27), we must calculate



From (3.49), (4.30) can be derived by multiplying every element of (4.28) by two.

When (4.30) is available, from (4.27), the following matrix-vector product is needed.

 For the layer m, from (3.50), the calculation is written as

= vec

A MATLAB implementation for (4.31) is shown in Listing IV.8. In line 12, we separately calculate r^m,1, . . . , r^m,lby (4.32). In line 18, we build

In line 19, we calculate (4.31). In line 20, we derive the result of (4.27) for the mth layer.

4.6 Mini-Batch Function and Gradient Evaluation

Later in Chapter 5.1 we will discuss details of memory usage. One important con-clusion is that in many places of the Newton method, the memory consumption is pro-portional to the number of data. This fact causes difficulties in handling large data sets.

Therefore, here we discuss some implementation techniques to address the memory difficulty.

In subsampled Newton methods discussed in Chapter 3.1, a subset S of the train-ing data is used to derive the subsampled Gauss-Newton matrix for approximattrain-ing the Hessian matrix. While a motivation of this technique is to trade a slightly less accurate

direction for shorter running time per iteration, it is also useful to reduce the memory consumption. For example, at the mth convolutional layer, we only need to store the following matrices

∂z^L,i

∂vec(S^m,i)^T, ∀i ∈ S (4.33)

for the Gauss-Newton matrix-vector products.

However, function and gradient evaluations must use the whole training data. Fortu-nately, both operations involve the summation of independent results over all instances.

Here we follow Wang et al. (2018a) to split the index set {1, . . . , l} of data to, for ex-ample, R equal-sized subsets S1, . . . , S_R. We then calculate the result corresponding to each subset and accumulate them for the final output. Take the function evaluation as an example. For each subset, we must store only

Z^m,i, ∀m, ∀i ∈ S_r.

Thus, the memory usage can be dramatically reduced.

For the Gauss-Newton matrix-vector product, to calculate (4.33), we need Z^m,i, ∀i ∈

在文檔中牛頓法於卷積神經網路之應用 (頁 21-0)