A second-order learning algorithm for multilayer networks based on block Hessian matrix

(1)

Contributed article

A second-order learning algorithm for multilayer networks based on

block Hessian matrix

Yi-Jen Wang*, Chin-Teng Lin

Department of Electrical and Control Engineering, National Chiao-Tung University, Hsinchu, Taiwan, ROC Received 18 November 1996; revised 19 June 1998; accepted 19 June 1998

Abstract

This article proposes a new second-order learning algorithm for training the multilayer perceptron (MLP) networks. The proposed algorithm is a revised Newton’s method. A forward–backward propagation scheme is first proposed for network computation of the Hessian matrix, H, of the output error function of the MLP. A block Hessian matrix, Hb, is then defined to approximate and simplify H. Several lemmas and theorems are proved to uncover the important properties of H and Hb, and verify the good approximation of Hbto H; Hb preserves the major properties of H. The theoretic analysis leads to the development of an efficient way for computing the inverse of Hb recursively. In the proposed second-order learning algorithm, the least squares estimation technique is adopted to further lessen the local minimum problems. The proposed algorithm overcomes not only the drawbacks of the standard backpropagation algorithm (i.e. slow asymptotic convergence rate, bad controllability of convergence accuracy, local minimum problems, and high sensitivity to learning constant), but also the shortcomings of normal Newton’s method used on the MLP, such as the lack of network implementation of H, ill representability of the diagonal terms of H, the heavy computation load of the inverse of H, and the requirement of a good initial estimate of the solution (weights). Several example problems are used to demonstrate the efficiency of the proposed learning algorithm. Extensive performance (convergence rate and accuracy) comparisons of the proposed algorithm with other learning schemes (including the standard backpropagation algorithm) are also made.䉷 1998 Elsevier Science Ltd. All rights reserved.

Keywords: Multilayer perceptrons; Hessian matrix; Forward–backward propagation; Newton’s method; Least squares estimation

1. Introduction

Gradient-descent-based backpropagation (BP) learning algorithm has been widely used for training multilayer per-ceptron (MLP) networks (Haykin, 1994; Lin and Lee, 1996). However, several drawbacks of the BP learning method have been observed; its convergence speed is usually too low, its convergence accuracy is hard to control, it is easily stuck in bad local minima, and the choice of proper learning constant largely depends on trial and error. Further numerical optimization theory (Luenberger, 1984) can be applied to overcome these drawbacks. One common approach is to upgrade the normal BP, which is a first-order learning algorithm, to a second-order one, the so-called Newton’s method. Since Newton’s method is an optimization algorithm with quadratic convergence speed (Luenberger, 1984), it can be used to improve the learning speed and accuracy of the normal BP. Also, since Newton’s

method is less sensitive to the learning constant, the choice of a proper learning constant is not difficult. However, several shortcomings of Newton’s method, as mentioned later, make its use for training the MLP quite limited. This article aims at conquering these shortcomings, and develops an efficient second-order learning algorithm for the MLP.

There are several problems in using Newton’s method to minimize the output error function of the MLP (Haykin, 1994; Lin and Lee, 1996).

1. Newton’s method needs to compute the second-order derivatives of the output error function with respect to the network weights, i.e. the Hessian matrix. Since the computation of the Hessian matrix needs global informa-tion, Newton’s method is not suitable for network com-putation. Besides, the large computation load of the Hessian matrix hinders its practical use in training MLPs. 2. One common strategy to simplifying the computation of Hessian matrix is to approximate the whole matrix by its diagonal terms only (we call it the diagonal Hessian matrix) (Battiti, 1992). This article will show that the * Corresponding author. Tel: +886 3 5712121 ext 54315; Fax: +886 3

5715998; e-mail: [email protected]

Neural Networks 11 (1998) 1607–1622 PERGAMON

Neural Networks

(2)

diagonal Hessian matrix does not maintain the major properties of the true Hessian matrix of a MLP, and thus cannot be used to improve the convergence speed and accuracy of BP learning efficiently.

3. In using Newton’s method for minimizing the output error of a MLP, each iteration requires the computation of the inverse of Hessian matrix, so the method is expensive in terms of both storage and computational requirements.

4. In order to converge, Newton’s method requires a good initial estimate of the solution. This further restricts the practical usability of Newton’s method on the MLP. From these points, we know that standard Newton’s method is not a practical technique for training the MLP. Although several alternatives or revised methods have been studied such as those based on conjugate-direction method and quasi-Newton method (Ricotti et al., 1988; Becker and LeCun, 1989; Makram-Ebeid et al., 1989; Bello, 1992), the aforementioned problems have not been seriously addressed and most problems still exist. Especially, in the existing second-order learning approaches, the computation-expensive nonlinear programming techniques in the numerical optimization theory are usually adopted, and the special properties of the Hessian matrix of the MLP are not taken into account to reduce the computation load.

In this article, we shall propose a novel second-order learning algorithm for the MLP, aimed at solving the four drawbacks of Newton’s method. We first propose an order-based-derivative scheme (Werbos, 1990; Piche, 1994) to derive a network computation method for computing the Hessian matrix of a given MLP network. This can be viewed as a network implementation of the Hessian matrix. With this scheme, the computation of the Hessian matrix of a MLP can be performed as signals flow forward and back-ward in the MLP network. This solves the first drawback of Newton’s method mentioned previously. From the proposed network computation method of the Hessian matrix, we can analyze the important properties of the Hessian matrix of the MLP easily; e.g. the Hessian matrix of a MLP is always a singular matrix. From such analysis, we clearly understand why the diagonal Hessian matrix is not a good approxima-tion of the original Hessian matrix. This theoretically explains why the existing second-order learning algorithms based on the diagonal Hessian matrix do not display the expected advantages over Newton’s method as mentioned in the second point of the last paragraph. To overcome this drawback, and avoid the large computation load of the whole Hessian matrix, we propose a better approximation of the Hessian matrix, called the block-diagonal Hessian matrix or block Hessian matrix, Hb, for short. The block

Hessian matrix keeps the singularity of the original Hessian matrix. Also, we show that the block Hessian matrix of a MLP is linearly dependent with the matrix formed by the gradients of the output error function E of this MLP, ⵜE. This dependency property makes the equation,ⵜE ¼ 0, the

weight space have solutions even though Hb is singular,

when we apply the block Hessian matrix (Hb) for

second-order approximation of the E function. Hence, like the original error function, there also exist extreme points for the error function which is second-order approximated by the block Hessian matrix.

Making use of the singularity and dependency property of the block Hessian matrix, we arrive at an efficient algorithm for solving the equationⵜE ¼ 0. This algorithm does not need the computation of the inverse of the block Hessian matrix explicitly, and thus solves the third drawback of Newton’s method mentioned earlier. In the proposed algo-rithm, we also apply the least squares estimation technique (Goodwin and Sin, 1984; Stefanos and Anastassioy, 1988) to modify the original Newton’s method. This further improves the convergence speed and accuracy of learning. Finally, since Newton’s method only guarantees finding the extreme points of error functions (i.e. the points that result inⵜE ¼ 0), which may be minimum or maximum points, we develop three protection methods in this article to prevent the proposed second-order learning algorithm from conver-ging in wrong directions. Among these three protection methods, two methods try to change the gradient of the error surface in the transient region, and keep the gradients in the steady states unchanged. These protection methods make Newton’s method insensitive to initial states, and solve the fourth drawback of Newton’s method.

The rest of this article is organized as follows. Section 2 derives a forward–backward propagation scheme for computing the Hessian matrix H of the MLP in the form of network operations. This section also defines the block Hessian matrix Hb, and shows the singularity of H and Hb,

and the linear dependency of HbandⵜE. In Section 3, the

special properties of the block Hessian matrix are adopted to develop an efficient algorithm that greatly reduces the com-putation load of the inverse Hessian matrix. This section also uses the least squares estimation technique to increase learning speed and accuracy. Extensive computer simula-tions and performance comparisons with normal BP and diagonal Hessian matrix approaches are made in Section 4. In this section, three protection methods are given to further improve the proposed second-order learning algorithm. Conclusion and discussion are presented in Section 5.

2. Block Hessian matrix of multilayer perceptrons

2.1. Network computation of Hessian matrix based on order-based derivative

The gradient-descent-based BP learning method was widely used for training the MLP. Although the gradient-descent (or steepest-gradient-descent) method is one of the simplest optimization techniques, it is not a very effective one. Numerical optimization theory provides a rich and robust 1608 Yi-Jen Wang, Chin-Teng Lin/Neural Networks 11 (1998) 1607–1622

(3)

set of techniques which can be applied to neural networks to improve learning rates. The gradient-descent method con-siders only the first-order derivative of an error function. It is helpful to take into account higher-order derivatives. Using Taylor’s series expansion on the error function E(w) of a MLP around the current point w0in the weight

space, we have E(w)¼E(w0)þ(w ¹ w0) T_ⵜE₍ w0) þ1 2(w ¹ w0) T H(w0)(w ¹ w0)þ…, ð1Þ where H(w0) is called Hessian matrix and is the

second-order derivative evaluated at w0:

H(w0)⬅ ⵜ 2 E(w)l_{w ¼ w} 0or Hij¼ ⳵2 E ⳵wi⳵wj (2) To find the minimum of E(w), we set its gradient to zero: ⵜE(w)¼ⵜE(w0)þH(w0)(w ¹ w0)þ…¼0 (3) If we ignore the third- and higher-order terms, we obtain w ¼ w0¹H

¹1₍

w0)ⵜE(w0) (4)

or using k to indicate the kth updating step, we obtain w(k þ 1)¼w(k)¹H¹1(w(k))ⵜE(w(k)) (5) This is called Newton’s method of weight updating. Newton’s method uses the second-order derivative in addition to the gradient to determine the next updating direction and step size. It can converge quadratically when close to a solution of a convex function. However, there are several drawbacks for Newton’s method as mentioned in the last section. In this subsection, we shall derive a network computation scheme with forward–backward signal propa-gation to simplify the computation of Hessian matrix of the MLP. We shall also study the important properties of Hessian matrix of the MLP, and then propose an approxi-mated matrix that keeps the important properties of the original one.

Consider a MLP network with L layers. For notation clarity, let #l denote the number of nodes in the lth layer of the MLP for l ¼ 1,2,…,L. The input of the jth node in the (l þ 1)th layer is represented by net(_jl þ 1), and the output by z(_jl þ 1)⬅ f(net(_jl þ 1))⬅ f(X

#l

k ¼ 1

w(_jkl)z(_kl)) (6) where f(·) is an activation function, and w(_jkl)is the connection weight between the kth node in layer l and the jth node in layer l þ 1. The output error function of the MLP is defined by E ¼1 2 X]L j ¼ 1 (z(_jL)¹dj)2 (7) where z(_jL)is the jth network output for the current input and djis the corresponding desired output.

We shall then use order-based derivative (Werbos, 1990; Piche, 1994) to derive the Hessian matrix of E with respect to connection weights w(_jil)and w(mnp)according to Eq. (2). We shall see that the use of order-based derivative can lead to a network implementation of the Hessian matrix; i.e. the com-putation of the Hessian matrix can be proceeded through the forward and backward data flow in a MLP network. Order-based derivative is defined as the deviation of a function, E(z1,z2,…,zn), with respect to the deviations of a set of

variables, {z1,z2,…,zn}, where this set of variables form a

order set; i.e. any variable, zj, is a function of variables

{z1,z2,…,zj¹1}. Due to the nest relationship in this set of

variables, the total derivative of E(z1,z2,…,zn) with respect

to zj, j ¼ 1,2,…,n ¹ 1 can be obtained recursively, based on

the order relationship of the set of dependent variables, {z1,z2,…,zn}. The approach to such a recursive computation

is called an order-based-derivative scheme. The reader is referred to (Piche, 1994) for a detailed example. According to the types of ordering, two kinds of order-based deriva-tives, forward and backward, can be distinguished. The con-cept of an order-based derivative is similar to that of a partial derivative formed by chain rule with ordinary deri-vatives. However, the former uses more precise notation to distinguish the value and ordering of a derivative, which are easily mixed up in the notation of partial derivative. Hence, as compared with using chain rule to derive first-order deri-vatives in a backpropagation algorithm, the concept of order-based derivative is more suitable for deriving com-plex and higher-order derivatives, such as the second-order derivatives in the Hessian matrix of a MLP.

At first, we use the backward scheme to derive the first-order first-order-based derivative of E with respect to wji:

⳵þ E ⳵w(l) ji ¼⳵z (l þ 1) j ⳵w(l) ji ⳵þ E ⳵z(l þ 1) j ¼(f⬘(net_j(l þ 1))z(_il)) ⳵ þ E ⳵z(l þ 1) j (8) where l ¼ 1,2,…,L ¹ 1, and(⳵þE)=(⳵z(_jl þ 1))is computed by the following backpropagation rule

⳵þ_E ⳵jz(s) ¼ ⳵E ⳵z(s) j þ X #(s þ 1) k ¼ 1 ⳵z(s þ 1) k ⳵z(s) j ⳵þ_E ⳵z(s þ 1) k (9) where s ¼ L,…,(l þ 1) and j ¼ 1,…,#s. Notice that

(⳵z(s þ 1) k )=(⳵z (s) j )¼0 if s ¼ L, and (⳵E)/(⳵z (s) j ) ¼ 0 if s⫽ L. Next, we find the second-order order-based derivative of E with respect to w(_jil)and another connection weight w(mnp)by computing the order-based derivative of Eq. (8) with respect to w(mnp), assuming pⱕ l: ⳵þ ⳵w(p) mn ⳵þ ⳵w(l) ji E ¼ ⳵ þ ⳵w(p) mn (f⬘(net_j(l þ 1))z(_il)) ⳵ þ E ⳵z(l þ 1) j þ ⳵ þ ⳵w(p) mn ⳵þ E ⳵z(l þ 1) j ! ·f⬘(net(_jl þ 1))·z(_il) ð10Þ where the term[(⳵þ)=(⳵w(mnp))][(⳵

þ

E)=(⳵z(_jl þ 1))]can be com-puted by the order-based derivative of Eq. (9) and has the Yi-Jen Wang, Chin-Teng Lin/Neural Networks 11 (1998) 1607–1622

(4)

following backward form ⳵þ ⳵w(p) mn ⳵þ E ⳵z(s) j ! ¼ X #(s þ 1) k ¼ 1 ⳵z(s þ 1) k ⳵z(s) j ⳵þ ⳵w(p) mn ⳵þ E ⳵z(s þ 1) k ! þ X #(s þ 1) k ¼ 1 ⳵þ ⳵w(p) mn (f⬘(net(_ks þ 1)))w(_kjs) ⳵ þ E ⳵z(s þ 1) k ð11Þ where s ¼ L ¹ 1,…,(l þ 1) and j ¼ 1,…,#s. Notice that the initial state of Eq. (11) at s ¼ L is

[(⳵þ₎₌₍_⳵w(p) mn)][(⳵ þ E)=(⳵z(_jL))]¼[(⳵þ)=(⳵w(mnp))](z( L) j ).

The first term of the right-hand side (RHS) of Eq. (11) is the same as the formula in the normal backpropagation algorithm, whereas the second term can be viewed as the bias term of the jth node in the sth layer of a MLP. If the differential of the activation function f with respect to its net input can be expressed as a function of its output z(_js), then calculating the bias term in Eq. (11) needs only computation of the order-based derivative [(⳵þ)=(⳵w(p))]z(_js). For example, if f(net)¼[2=(1 þ exp(¹l·net))]¹1 ¼ zj (i.e. sigmoidal function), then f⬘(net)¼(l=2)(1 ¹ z2j), and

[(⳵þ₎₌₍_⳵w_)]

f⬘(net)¼ ¹lzj[(⳵

þ

zj)=(⳵w)]. Hence to compute Eq. (11) in backward direction, we need to compute

[(⳵þ₎₌₍_⳵w(p) mn)]z(

s)

j , for s ¼ L, L ¹ 1,…, p þ 1. We shall then derive an algorithm for computing the bias term in Eq. (11) in the form of network operations like the backpropagation algorithm. Since this algorithm performs forward propagation on the network, it is called the forward propagation rule. This forward propagation rule can also be used to compute the first term of the RHS of Eq. (10). At first, we have ⳵þ_z(s) j ⳵w(p) mn ¼ X #(s ¹ 1) k ¼ 1 ⳵z(s) j ⳵z(s ¹ 1) k ⳵þ z(_ks ¹ 1) ⳵w(p) mn þ ⳵z (s) j ⳵w(p) mn ¼ X #(s ¹ 1) k ¼ 1 wjkf⬘(net( s) j ) ⳵þ_z(s ¹ 1) k ⳵w(p) mn þ ⳵z (s) j ⳵w(p) mn ð12Þ where p þ 1⬍ s ⱕ L. Notice that if s ¼ p þ 1 then Eq. (12) becomes ⳵þ z(_jp þ 1) ⳵w(p) mn ¼⳵z (p þ 1) j ⳵w(p) mn ¼ z (p) n m ¼ j 0 m⫽ j, ( (13) and if s⫽ p þ 1 then the term,(⳵z(_js))=(⳵w(mnp))in Eq. (12) is zero.

The formula in Eq. (12) can be represented in a network-operation form to simplify the computation. Consider a node in a MLP whose output is z(_js). The order-based deri-vative of this node’s output with respect to the connection weight w(nmp) in layer p, i.e. (⳵þz(

s) j )=(⳵w(

p)

mn), is equal to the output activation value of the network shown in Fig. 1, when we let the inputs, net(mp þ 1)¼1 and net(

p þ 1)

j ¼0,j⫽ m,

forward propagate through the network. The network shown in Fig. 1 has the same topology as the original MLP whose Hessian matrix is to be computed, whereas their node functions are different as indicated in Fig. 1. From Eq. (12) and Fig. 1, we find that each forward propagation starting from the mth node of the ðp þ 1Þth layer can compute all the terms of(⳵z(_js))=(⳵w(mnp)), for n ¼ 1,…,#p, where p þ 1ⱕ s ⱕ L. Moreover, for a specific z(_js), the vector [(⳵þz_j(s))=(⳵w(mnp))]n ¼ 1, …,#p is

proportional to [z(np)]n ¼ 1, …,#p. The forward propagation

rule in Eq. (12) can be used to compute the first term of Fig. 1. Network for computing the order-based derivatives(⳵þz(s)j )=(⳵w(p)mn)using the forward propagation rule, where each node performs the operation,䉺,

which computes the product net·f⬘(net), where net is the net input of the node.

(5)

the RHS of Eq. (10) and the second term of the RHS of Eq. (11).

The computations of Eqs. (10)–(12) are summarized as follows. In computing the element of Hessian matrix

(⳵þ_⳵þ_E₎₌₍_⳵w(p) mn⳵w(

l)

ji), we perform one forward propagation on the network in Fig. 1 starting from the mth node in layer p þ 1 to obtain all the bias terms in Eq. (11) and the first term of the RHS of Eq. (10). Then according to Eq. (11) and then Eq. (10), we perform one backward propagation pro-cess on the original network to find all the terms in the column of the Hessian matrix right below the term of

(⳵þ_⳵þ

E)=(⳵w(mnp)⳵w( p)

ji ); i.e. the terms(⳵

þ_⳵þ

E)=(⳵w(mnp)⳵w( l) ji) for all lⱖ p. Repeat this process until all such terms in all the columns of the Hessian matrix are obtained. Finally, we can apply the symmetric property and the column propor-tional property of the Hessian matrix of the MLP shown in the following lemma to get all the other terms and obtain the complete Hessian matrix. For a MLP with N nodes, its whole Hessian matrix can be obtained after only (N ¹ k1)

cycles of forward and backward propagation, where k1is the

number of input nodes. The forward–backward propagation corresponding to the pth layer will produce #(p þ 1)·#p columns in the Hessian matrix, in which #(p þ 1) columns are linearly independent (as being shown in the next sub-section). Moreover, the network for a forward–backward propagation will be getting smaller; i.e. the computation load is lessening as the forward–backward propagation process proceeds.

As an example, in computing the Hessian matrix of a four-layer fully-connected MLP with structure N5,3,3,1

(meaning that the node number of each layer is 5, 3, 3, 1, respectively, from the input layer), the sizes of the networks to be propagated and the obtained columns of Hessian matrix for each cycle of forward–backward propagation are listed in Table 1. Correspondingly, the form of the

resultant Hessian matrix after the seven cycles of forward–backward propagation is shown in Fig. 2. In Fig. 2, columns a1, a2, a3, b1, b2, b3, and c1correspond to the first, second, …, and seventh cycle of forward– backward propagation, respectively. This correspondence is also shown in Table 1. More clearly, the first cycle of forward–backward propagation produces the five columns of Hessian matrix indicated by a1 in Fig. 2.

These five columns consist of the order-based derivatives, (⳵þ⳵þE)=(⳵w_1n(1)⳵w(_jip)), for n ¼ 1,2,…,5, where {w(_jip)} ¼ {w(₁₁1), …,w(₃₅1),w(₁₁2), …,w(₃₃2),w(₁₁3),w(₁₂3),w(₁₃3)} i.e. those weights shown on the right-hand side of the Hessian matrix in Fig. 2. Similarly, the second and third cycles of forward–backward propagation produce those columns indicated by a2 and a3 in Fig. 2, respectively,

each set including five columns. They consist of the order-based derivatives, (⳵þ⳵þE)=(⳵w(_2n1)⳵w(_jip)) and ⳵þ_⳵þ_E_=⳵w(1)

3n⳵w

(p)

ji , for n ¼ 1,2,…,5. The fourth cycle of forward–backward propagation produces the three columns of Hessian matrix indicated by b1 in Fig. 2. These three

columns consist of the order-based derivatives,

(⳵þ_⳵þ

E)=(⳵w(_1k2)⳵w_ji(p)) , for k ¼ 1,2,3, where {w_ji(p)} ¼ {w₁₁(1), …,w(₃₃2),w(₁₁3),w₁₂(3),w(₁₃3)} ,i.e. those weights shown on the corresponding right-hand side of the Hessian matrix in Fig. 2. Similarly, the fifth and sixth cycles of forward–backward propagation produce those columns indicated by b2 and b3 in Fig. 2, respectively,

each set including three columns. They consist of the order-based derivatives, (⳵þ⳵þE)=(⳵w(_2k2)⳵w(_jip)) and

(⳵þ_⳵þ_E₎₌₍_⳵w(2) 3k⳵w

(p)

ji ), for k ¼ 1,2,3. Finally, the seventh cycle of forward–backward propagation produces the three columns of Hessian matrix indicated by c1in Fig. 2.

These three columns consist of the order-based derivatives,

(⳵þ_⳵þ_E₎₌₍_⳵w(3) 1k⳵w

(p)

ji ), for k ¼ 1,2,3, where

{w(_jip)} ¼ {w(₁₁3),w(₁₂3),w(₁₃3)}. Hence, in Fig. 2, all the elements in the columns indicated by a1, a2, a3, b1, b2, b3, and c1are computed by seven cycles of forward–backward propaga-tion. The rest of the elements of the Hessian matrix (i.e. the Table 1

The sizes of the networks to be propagated and the obtained columns of Hessian matrix for each forward–backward propagation on the MLP, N5,3,3,1

Time Subnetwork size for

forward–backward propagation Columns computed in Hessian matrix 1st time (1→ 3 → 1) þ (3 ← 3 ← 1) a1: 5 columns 2nd time (1→ 3 → 1) þ (3 ← 3 ← 1) a2: 5 columns 3rd time (1→ 3 → 1) þ (3 ← 3 ← 1) a3: 5 columns 4th time (1→ 1) þ (3 ← 1) b1: 3 columns 5th time (1→ 1) þ (3 ← 1) b2: 3 columns 6th time (1→ 1) þ (3 ← 1) b3: 3 columns 7th time (1) þ (1) c1: 3 columns

Here, considering the first forward–backward propagation for example, (1→ 3 → 1) þ (3 ← 3 ← 1) means the signals flow forward from a layer-two node to the three layer-three nodes, and to the single output node in layer four, and then the signals flow backward from the single output node to the three three nodes, and finally to the three layer-two nodes. The second and third cycles of forward–backward propagation are performed on similar subnetworks, but they start at a different node in layer two.

Fig. 2. Form of resultant Hessian matrix after seven times of forward– backward propagation on the MLP, N5,3,3,1, where m ¼ 1,2,3, n ¼ 1,2,…,5, and k ¼ 1,2,3.

(6)

upper-left empty part in the Hessian matrix shown in Fig. 2) can be obtained directly from the computed elements by using the symmetric property of Hessian matrix shown in Section 2.2.

2.2. Block Hessian matrix and its properties

In this subsection, we shall study some properties of the Hessian matrix of the MLP based on the order-based-derivative formulas derived in the last subsection. Consider a MLP network with L layers, each layer containing an extra node (called ‘threshold node’) with fixed activation value ¹ 1 to provide the threshold value for each node in the next layer. At first, we show that the Hessian matrix derived from the order-based-derivative scheme in the last subsection keeps the symmetric property of a Hessian matrix.

Lemma 1. The matrix whose elements are given in Eq. (10) is symmetric, i.e. ⳵þ ⳵w(p) mn ⳵þ ⳵w(l) ji E(z(L))¼ ⳵ þ ⳵w(l) ji ⳵þ ⳵w(p) mn E(z(L)) (14)

where z(L)is the output vector of the MLP.

Proof. Let E˜ (w(1),…,w(L)) represent the error function EðzðLÞÞ expressed in terms of connection weight vectors, w(s)⬅[w(_jis)], for s ¼ 1,…,L. According to the concept of order-based derivative, we have

⳵þ E(z(L)) ⳵w(l) ji ¼ ⳵ ⳵w(l) ji ˜ E(w(1), …,w(L)) (15) and ⳵þ ⳵w(p) mn ⳵þ ⳵w(l) ji E(z(L))¼ ⳵ 2 ⳵w(p) mn⳵w(jil) ˜ E(w(1), …,w(L)) (16) Similarly, we can obtain

⳵þ ⳵w(l) ji ⳵þ ⳵w(p) mn E(z(L))¼ ⳵ 2 ⳵w(l) ji⳵w (p) mn ˜ E(w(1), …,w(L)) (17) Since the derivative ordering is changeable for partial derivative,Eq. (16) is equal to Eq. (17), i.e.

⳵þ ⳵w(p) mn ⳵þ ⳵w(l) ji E(z(L))¼ ⳵ þ ⳵w(l) ji ⳵þ ⳵w(p) mn E(z(L)) (18)

This completes the proof.

Notice that the change of derivative ordering in Lemma 1 (see Eq. (14) might not be allowed by other forms of derivatives, e.g. [(⳵þ=⳵zi)(⳵=zj)]E⫽[(⳵=⳵zj)(⳵þ=⳵zi)]E and

[⳵þ₌₍_⳵w(p) mn)][⳵þ=(⳵z( p) j )]E⫽[⳵ þ₌₍_⳵z(p) j )][⳵ þ₌₍_⳵w(p) mn)]E. The following theorem shows an important property of the Hessian matrix of the MLP.

Theorem 1. Consider a MLP network whose error function

is E ¼(1=2)X#Li ¼ 1(z( L) i ¹di)

2

. Assume the differential of the node activation function of the MLP can be expressed as a function of the node output (e.g. assume the activation func-tion is a sigmoidal funcfunc-tion). Then the Hessian matrix of E with respect to the connections weights of the MLP is a singular matrix.

Proof. We shall first show that for a fixed m, we have

[⳵þ₌₍_⳵w(1) mn)][(⳵þE)=(⳵w( l) ji)]¼D l jiz(n1), where Dljiis a constant determined by the forward–backward propagation compu-tation between layer 2 and layer l þ 1. From the forward propagation rule in Eq. (12), we know that(⳵þz(_js))=(⳵w(mn1))is proportional to z(n1), and the proportional constant, say c(

s) j , is dependent on the node output in concern, i.e. z(_js). Hence we have (⳵þz(_js))=(⳵w(mn1))¼c(

s) j z(

1)

n . Considering Eq. (11) for computing[⳵þ=(⳵w(mn1))][(⳵

þ

E)=(⳵z(_js))], since f⬘(net(_ks þ 1))is a function of z(_ks þ 1), the second term of the RHS of Eq. (11) is proportional to z(n1). Also, since ⳵

þ_=⳵w(1)

mn(⳵

þ

E=⳵z(_jL)) is proportional to z(n1), the first term of the RHS of Eq. (11) is also proportional to z(n1). Hence the term

⳵þ_=⳵w(1)

mn(⳵þE=⳵z( s)

j ) computed in Eq. (11) is proportional to z(n1). This results in the fact that the second term of the RHS of Eq. (10) is proportional to z(n1), when Eq. (10) is used to compute the term[⳵þ=(⳵wmn(1))][(⳵þE)=(⳵w(

l)

ji)]. Since the first term of the RHS of Eq. (10) is also proportional to z(n1), we conclude that ⳵þ ⳵w(1) mn ⳵þ_E ⳵w(l) ji ¼Dljiz( 1) n , (19)

where Dlji is a constant determined by the forward– backward propagation computation between layer 2 and the jth node of layer l þ 1.

With the same reason as in the above analysis, we can prove that the next column of the Hessian matrix associated with the same m has the same property:

⳵þ ⳵w(1) mn þ 1 ⳵þ_E ⳵w(l) ji ¼Dljiz( 1) n þ 1 (20)

for all l ¼ 1,…,L ¹ 1 and i,j ¼ 1,…,#l. From Eqs. (19) and (20), we find that there exist at least two columns in the Hessian matrix which are linearly dependent, so the Hessian matrix is singular. This completes the proof.

In the above theorem, we assume there are at least two input nodes for a MLP. This assumption is always true if a threshold node is added to each layer of the MLP. Many existing literatures point out that the computation of the Hessian matrix of a MLP is quite complex (Becker and LeCun, 1989; LeCun, 1989), partly because of the large size of the Hessian matrix, and to the lack of its network implementation (such as the form we derived earlier). To simplify the computation, the diagonal terms of the Hessian matrix were usually used to approximate the whole Hessian matrix. From the aforementioned analysis, we clearly see that such approximation is improper, since the singular property of the true Hessian matrix of a MLP is not 1612 Yi-Jen Wang, Chin-Teng Lin/Neural Networks 11 (1998) 1607–1622

(7)

preserved by the approximating diagonal Hessian matrix. This also explains why Newton’s method based on the diagonal Hessian matrix cannot speed up the weight con-vergence in MLP learning efficiently (Battiti, 1992). The network implementation of the Hessian matrix proposed in the last subsection dose make its computation easier. To further simplify the computation, we shall next derive a block-diagonal Hessian matrix or block Hessian matrix, Hb, for short, to approximate the true Hessian matrix, H. In

the block Hessian matrix, we consider only the second-order order-based derivatives of the error function with respect to the two weights in the same layer, (⳵þ⳵þE)=(⳵w(mnp)⳵w(

p) ji ), and let the other second-order order-based derivatives to zero, i.e.(⳵þ⳵þE)=(⳵w(mnp)⳵w(

l)

ji)¼0 for p⫽ l. This reduces the computation load of the Hessian matrix greatly. More importantly, the block Hessian matrix keeps the singularity property of the true Hessian matrix as shown in the following.

Consider a MLP with a threshold node in each layer. The elements of the block Hessian matrix of the MLP are arranged in the order of layer numbers and indexes of weights such that the derivatives with respect to the weights in the same layer flock together in a block lying along the diagonal of the Hessian matrix, i.e.

H ¼ H(_b1) ⴱ ⴱ ⴱ ⴱ ⴱ ] ⴱ ⴱ ⴱ ⴱ ⴱ H(p) b ⴱ ⴱ ⴱ ⴱ ⴱ ] ⴱ ⴱ ⴱ ⴱ ⴱ H(L ¹ 1) b 2 6 6 6 6 6 6 6 6 6 6 4 3 7 7 7 7 7 7 7 7 7 7 5 , Hb⬅ H(_b1) 0 0 0 0 0 ] 0 0 0 0 0 H(_bp) 0 0 0 0 0 ] 0 0 0 0 0 H(_bL ¹ 1) 2 6 6 6 6 6 6 6 6 6 6 4 3 7 7 7 7 7 7 7 7 7 7 5 ð21Þ

where H(_bp) with dimension (#(p þ 1) ¹ 1)(#p)⫻ (#(p þ 1) ¹1)(#p) is the block corresponding to the pth layer of the MLP. Notice that ‘ ¹ 1’ in the term (#(p þ 1) ¹ 1) is due to the threshold node we assumed in each layer except the output layer. Hence for the output layer, ‘ ¹ 1’ should be taken out; i.e. the dimension of H(_bL ¹ 1)is (#L)(#(L ¹ 1))⫻ (#L)(#(L ¹ 1)). This notation simplification will also be used throughout the following development.

The following theorem shows that the block Hessian matrix of a MLP is singular.

Theorem 2. Under the same assumptions in Theorem 1, each submatrix H(_bp) of the block Hessian matrix Hb is a

singular matrix with rank(H(_bp))ⱕ #(p þ 1) ¹ 1.

Proof. As we noted in the above that H(_bp) is a ð#ðp þ 1Þ ¹ 1Þ(#p) ⫻ (#(p þ 1) ¹ 1)(#p) matrix. Each element of H(_bp) is computed by Eq. (10). To obtain a set of columns of H(_bp) corresponding to some fixed m, for n ¼ 1,2,…,#p, we need to compute the term⳵þ=⳵w(_mnp)(f⬘(net(_ks))), for s ¼ p þ 1,…,L, which requires the computation of the order-based derivative(⳵þz(_js))=(⳵w(mnp))assuming the sigmoi-dal activation function is used [see the first term of the RHS of Eq. (10) and the second term of the RHS of Eq. (11)]. From the forward propagation rule in Eq. (12) with initial value z(np), we have ⳵þ z(_js) ⳵w(p) mn ¼c(_js)z(_np) for all p þ 1ⱕ s ⱕ L (22) where c(_js) is a constant determined by the connection weights between the node with output z(_jp þ 1) and the node with output z(ms), and is thus not a function of z(

p)

n . Hence, when we compute some specific row of H(_bp)[i.e. the indexes i and j are fixed in Eqs. (10)–(12)], the first term of the RHS of Eq. (10) is proportional to z(np).

We shall next show that the second term of Eq. (10) is also proportional to z(np)when we compute some specific row of H(_bp). This is achieved by showing that the first and second terms of the RHS of Eq. (11) are both proportional to z(np). With the same reason in deriving Eq. (22), we see that the bias term [i.e. the second term of the RHS of Eq. (11)] is also proportional to z(_np), and the proportion constant is deter-mined by the forward–backward propagation computation between node m in layer p þ 1 to the specific row position. As to the first term of Eq. (11), it is obtained by the backpropagation rule with initial value

⳵þ ⳵w(p) mn ⳵þ_E ⳵z(L) j ! ¼ X #L k ¼ 1 ⳵þ ⳵w(p) mn (dk¹z( L) k ) (23)

where dk is the kth component of the desired output.

According to Eq. (22), Eq. (23) is also proportional to z(np). Hence the first term of the RHS of Eq. (11) is also proportional to z(np).

This analysis proves that the order-based derivative com-puted by Eq. (11) and thus that comcom-puted by Eq. (10) are proportional to z(np). Hence for a specific m, the adjacent two columns of the H(_bp)matrix are proportional to z(np)and z(

p) n þ 1, respectively. Since the full columns of H(_bp)are spanned by the vectors[⳵þ=(⳵w(mnp))][⳵þ=(⳵w(p))E]and the earlier analy-sis shows that they are also spanned by the vectors

[⳵þ₌₍_⳵w(p) m1)][⳵

þ₌₍_⳵w(p)₎_E_]_{, for m ¼ 1,}…_{,# (p þ 1) ¹ 1,} we can conclude that rank(H(_bp)) ⱕ #(p þ 1)¹1. This completes the proof.

This theorem shows that each submatrix (block) of the block Hessian matrix is singular. This obviously implies that the whole block Hessian matrix is singular. Since the rank of H(_bp)is equal to the number of nodes in the layer next to the layer corresponding to H(_bp),the equality in Theorem 2 usually holds, i.e. rank(H_b(p))¼#(p þ 1)¹1. Further, according to the property in Theorem 2, we can decompose Yi-Jen Wang, Chin-Teng Lin/Neural Networks 11 (1998) 1607–1622

(8)

the H(_bp) matrix into #(p þ 1) ¹ 1 submatrices, each with dimension (#(p þ 1) ¹ 1)(#p)⫻ (#p). Matrix H(_bp)can thus be represented by H(_bp)¼ X #(p þ 1)¹1 k ¼ 1 h(_kp) 0k ¹ 1,z(p),0#(p þ 1)¹1 ¹ k (24)

where h(_kp) is the vector derived by Eq. (11) and Eq. (12), 0j

is a zero(row) vector containing j·(#p) zeros, and z(p)is the output (row) vector of layer p.

From numerical optimization theory (Luenberger, 1984), a system usually has no extreme points if its Hessian matrix is singular. However, since there are usually many extreme points in the error surface of a MLP, especially for large networks, we expect that HbandⵜE are linearly dependent.

Otherwise, the Hbmatrix cannot be used for second-order

approximation of the E function in finding the extreme points of E by Newton’s method. The following theorem proves this property.

Theorem 3. Under the same assumptions in Theorem 1 and assuming that the rank of H(_bp)is #(p þ 1) ¹ 1 and there is no zero element inⵜE(p), then the rank of matrix[H(_bp),ⵜE(p)] is also equal to #(p þ 1) ¹ 1, whereⵜE(p)is the gradient of E corresponding to layer p.

Proof. By the backpropagation rule, we can derive ⳵þ E ⳵w(p) mn ¼cmz( p) n , (25)

where nonzero constant cmis not a function of connection

weights of layer p. For convenience, we let s ¼ #ðp þ 1Þ ¹ 1.Then Eq. (24) can be rewritten as

H(_bp)¼ X s k ¼ 1 y(kp) 0k ¹ 1, ⳵þ E ⳵w(p) k !T ,0s ¹ k " # (26) where y(kp)¼h (p) k =ck, and [(⳵ þ E)=(⳵w(_kp))]T¼ [(⳵þ E)=(⳵w(_k1p)), …, (⳵þE)=(⳵w(_kpp))].

By the symmetric property of a Hessian matrix, we have

(H(_bp))T¼H(_bp)¼ X s k ¼ 1 0Tk ¹ 1 ⳵þ_E ⳵w(p) k 0Ts ¹ k 2 6 6 6 6 6 4 3 7 7 7 7 7 5( y(_kp))T (27)

Notice that the matrix

M⬅ (yp₁)T ⯗ (yps) T 2 6 6 4 3 7 7 5

is a s ⫻ (#p)(#(p þ 1) ¹ 1) matrix. Since rank(M) ¼ #ðp þ 1Þ ¹ 1 ¼ s, there exists a series of column operations such that the s ⫻ 1 unit vector, [11,12,…,1s]

T

, can be spanned by the columns of M. Hence, the vector ⵜE(p)

¼{[(⳵þE)=(⳵w(₁p))], …, [(⳵þE)=(⳵w(_sp))]}T can also be spanned by the columns of H(_bp), and thus rank[H(_bp),ⵜE(p)]¼s. This completes the proof.

This theorem shows that H(_bp) and ⵜE(p) are linearly dependent. This implies that the block Hessian matrix Hb

is linearly dependent with

ⵜE ⬅ ⵜE(1) ⯗ ⵜE(L ¹ 1) 2 6 6 4 3 7 7 5 since Hb is composed of H( p) b as defined in Eq. (21). Theorems 2 and 3 show that the block Hessian matrix Hbpreserves the singularity and extreme-point properties

of the true Hessian matrix H. Hence we can approximate the error function of a MLP, E⬅ ¹(1=2)X#Lk ¼ 1(z( L) k ¹dk) 2_, by ˜ E(w)¼E(w0)þ Kw T_ⵜE₍ w0)þ(1=2)Kw T HbKw. In the next section, we shall derive an efficient second-order learning algorithm for the MLP by minimizing such an approximated error function.

3. Second-order learning algorithm based on block Hessian matrix

In this section, we shall develop a second-order learning algorithm for the MLP based on the block Hessian matrix. By making use of the properties of the block Hessian matrix, this algorithm can reduce the computation load of the block Hessian matrix, its inverse, and the process for finding the least squares solution.

3.1. Inverse of block Hessian matrix

In using Newton’s method to minimize an error function E approximated by the block Hessian matrix, we need to solve the linear equations,ⵜE ¼ ¹ HbKw. Since ⵜE and

Hbare linear dependent as shown in the last section, we cannot

compute the inverse of the block Hessian matrix, Hb¹1, directly to solve the linear equations, ⵜE ¼ ¹ HbKw. To

address this problem, we apply the Levenberg–Marquardt method to replace Hb¹1by (lI þ Hb)¹1(LeCun, 1989; Silva

and Almeida, 1990), where l is an arbitrary small positive number. Hence, the problem now is to solve the linear equa-tions, Kw ¼ ¹ (lI þ Hb)¹1ⵜE, i.e. Kw(

p)

¼ ¹(lI þ H(bp))

¹1_ⵜE(p)_,

p ¼ 1,2,…,L ¹ 1. We shall next pro-pose an algorithm to simplify the computation of (lI þ H(_bp))¹1 based on the special property of the block Hessian matrix.

(9)

According to Eq. (24), the inverse matrix (lI þ H(bp)) ¹1 can be represented by (lI þ H(_bp))¹1¼ lI þ X #(p þ 1)¹1 k ¼ 1 h(_kp)0k ¹ 1, z(p),0#(p þ 1)¹1 ¹ k !¹1 ð28Þ for p ¼ 1,2,…,L ¹ 1. The inverse in Eq. (28) can be easily obtained by the matrix inverse lemma (Kailath, 1980). The matrix inverse lemma says if A is an invertible matrix, and u and v are two column vectors then the following equality holds,

(A þ uvT)¹1¼A¹1¹(A

¹1

u)(vTA¹1)

1 þ vT_A¹1_u (29)

Using Eq. (29) recursively, we can find the (#(p þ 1) ¹ 1)(#p) ⫻ (#(p þ 1) ¹ 1)(#p) inverse matrix of Eq. (28) after #(p þ 1) ¹ 1 iterations. It takes (#(p þ 1) ¹ 1)(#p) iterations to compute such an inverse matrix in the normal way. Hence, by making use of the singular property of H(_bp), we reduce the computation load of (lI þ H(bp))

¹1 by #p times.

Another advantage of this method is that the computation of the matrix inverse can be performed separately for dif-ferent layers p, p ¼ 1, 2,…, L ¹ 1, due to its recursive form. This can shorten the duty cycle of computation. Hence, with the proposed computation method based on the matrix inverse lemma, the use of the block Hessian matrix for substituting the diagonal Hessian matrix in Newton’s method can preserve the property of the true Hessian matrix at the expense of only small extra computation load. 3.2. Least squares update method

Consider that there are x(1), x(2),…, x(s) training patterns for training a MLP. The corresponding desired output vectors are d(1), d(2),…, d(s). The total error function to be minimized is Etotal⬅ 1 2 Xs i ¼ 1 kd(i)¹z(L)(i)k2 (30)

where zðLÞð1Þ, z(L)(2),…, z(L)(s) are the network output vectors corresponding to the inputs x(1), x(2),…, x(s), respectively. The concept of standard Newton’s method is to approximate Etotal by a second-order function, E˜total, as

follows ˜

Etotal(w(k)þDw)¼Etotal(w(k))þ(ⵜEtotal)TDw þ1

2Dw T

HtotalKw, ð31Þ

where Htotal is the Hessian matrix of Etotal taking value at

weights w(k). Then the minimum points of Etotal are

obtained approximately by finding the minimum points of

E˜totalaccording to

Dw ¼ ¹ HtotalⵜEtotal¹1 (32)

In using the above update rule, we usually find the average gradient value over s training patterns to getⵜEtotal, and the

average of s Hessian matrices corresponding to the s training patterns to get Htotal. From qualitative analysis,

since this approach smooths the update direction and size corresponding to each training pattern, it cannot speed up the convergence speed efficiently.

We shall now adopt the least squares (LS) estimation technique to derive a fast update rule for Dw. At first, we apply Newton’s method on each individual training pattern and produce s gradient equations:

H(x(i)lw(k))Dw ¼ ¹ⵜE(x(i)lw(k)) (33) for i ¼ 1, 2,…, s. Then we solve the following combination of equations in the LS sense,

H(x(1)l_w(_k)) ⯗ ⯗ H(x(s)lw(k)) 2 6 6 6 6 6 4 3 7 7 7 7 7 5 Dw ¼ ¹ ⵜE(x(1)l_w(_k)) ⯗ ⯗ ⵜE(x(s)lw(k)) 2 6 6 6 6 6 4 3 7 7 7 7 7 5 (34) to obtain Dw ¼ ¹ HT1H1þ…þH T sHs ¹1 HT1ⵜE1þ…þH T sⵜEs (35) where Hi¼H(x(i)lw(k)) andⵜEi¼EðxðiÞlwðkÞÞ. To solve Eq. (34), we need to find lI þ HT1H1þ…þH

T

sHs

¹1

according to the Levenberg–Marquardt method. By apply-ing the matrix inverse lemma [Eq. (29)] recursively, this can be done incrementally for each available Hi, i ¼ 1,2,…,s.

Hence to solve Eq. (34), we need to find each Hessian matrix and the corresponding(HTiHi)¹1, i ¼ 1,2,…,s. For notation simplicity, we omit the subscript i hereafter. Since HTH is singular, (lI þ HTH)¹1 is used to replace (HTH)¹1as mentioned in the last subsection. To reduce the computation load in solving Eq. (34), we use the block Hessian matrix Hbto approximate H as discussed in the previous

sections.

Consider the block Hessian matrix, Hb, defined by

Eqs. (21) and (24), which are repeated here, Hb¼diagonal {H( 1) b ,H (2) b , …,H (L ¹ 1) b } (36) where H(_bp)¼X#_{k ¼ 1}(p þ 1)¹1h_k(p)0_{k ¹ 1},z(p), 0_#₍_{p þ 1}₎_¹_{1 ¹ k}ÿ, p ¼ 1,2, …, L ¹ 1. Due to the block structure of the block Hessian matrix, the inverse of(lI þ HTbHb)can be obtained by finding the inverse of each block (corresponding to one layer of the MLP), (lI þ(H(_bp))TH(_bp))(¹1), as we did in Section 3.1. According to Lemma 1, H(_bp) is a symmetric Yi-Jen Wang, Chin-Teng Lin/Neural Networks 11 (1998) 1607–1622

(10)

matrix, i.e.(H(_bp))T¼H(_bp), so we have (H(_bp))TH(_bp)¼H(_bp)(H(_bp))T¼X_{k ¼ 1}#(p þ 1)¹1kz(p)k2h(_kp)(h(_kp))T ¼ kz(p)k2· X #(p þ 1)¹1 k ¼ 1 h(kp)(h (p) k ) T (37) for p ¼ 1,2,…,L ¹ 1. We can then apply the matrix inverse lemma to find (lI þ H(_bp)(H(_bp))T)¹1¼ lI þ kz(p)k2· X #(p þ 1)¹1 k ¼ 1 h(_kp)(h(_kp))T !¹1 (38) recursively.

Let us now analyze the computation complexity of the above approach. Let np¼(#(p þ 1) ¹ Bp)(#p), where Bp¼

1 if p⫽ L ¹ 1 and Bp¼0 if p ¼ L ¹ 1. For general matrix

multiplication, the number of multiplications needed in computing HTb Hbis

X_{L ¹ 1}

i ¼ 1n

3

i, and the number of additions needed isXL ¹ 1i ¼ 1(ni¹1)n

2

i. However, if we take the singular property of HTbHb into account and compute H

T bHb

according to Eq. (37), the number of multiplications needed in computing HTbHbis

X_{L ¹ 1}

p ¼ 1[(#(p þ 1)¹Bp)n

2

pþ#p þ np],

and the number of additions needed is

X

L ¹ 1

p ¼ 1[([(#(p þ 1)¹Bp)n2pþnp¹1]. For example, consider a four-layer MLP with structure N23414 , where n1¼4, n2¼

9, n3¼4. The number of multiplications needed for

com-puting HTbHb using normal matrix multiplication is 64 þ

729 þ 64 ¼ 857, and the number of additions needed is 48 þ 648 þ 48 ¼ 744. For the proposed simplified method, the number of multiplications needed is (2 þ 4 þ 16⫻ 2) þ (3 þ 9 þ 81⫻ 3) þ (4 þ 4 þ 16) ¼ 317, and the number of additions needed is 35 þ 251 þ 19 ¼ 305. Further, if the symmetric property of the Hb matrix is considered, the

computation load analyzed previously can be cut in half. Again, the singular property of block Hessian matrix simplifies the computation of HTbHb greatly in using the

LS method for updating network weights. Also, like the analysis in Section 3.1, the matrix inverse lemma reduces the load in computing (lI þ(H(_bp))TH(_bp))¹1 by np/#p times

as compared with the normal approach.

4. Simulations

In this section, we shall compare the performance of the proposed second-order learning algorithm in Section 3 to that of the pure gradient descent rule (backpropagation algorithm), Newton’s method in which the whole Hessian matrix is computed, and a commonly used simplified Newton’s method in which only the diagonal terms of the Hessian matrix are considered. The performance compari-son indexes are learning speed and convergence accuracy. Since the MLP is a universal approximator of continuous functions, we shall use three continuously differentiable

functions with different input–output dimensions as examples to test the learning speed and convergence accuracy of the proposed second-order learning algorithm. It is noted that in applying Newton’s method to minimize an error function, we simply letⵜE ¼ 0 [see Eq. (32)] and thus the found extreme point could be either a minimum point or a maximum point. To avoid wrong convergent directions in using Newton’s method, we propose three protection methods, denoted by PT1, PT2 and PT3, respec-tively. In the PT1 method, we compare the directions (signs) of the computed Dw and ¹ ⵜE. If they are in the same direction, then we adopt the update Dw. Otherwise, we update the network weights according to ¹ⵜE. With the PT1 method, the network weights are updated by the general gradient descent rule when E(w) is convex, and are updated by the proposed second-order learning algorithm when E(w) is concave. In the PT2 method, we reverse the sign of the bias term in Eq. (11), and keep the sign of the backpropaga-tion term in Eq. (11) unchanged. This makes E(w) become concave when it is convex to push the weights away from the maximum points. When E(w) is concave and near the region ⵜE ⬇ zero, since the backpropagation term in Eq. (11) is dominant (this can be proved by order analysis), the sign change of the bias term will not affect the curvature of E(w). The PT3 method is the simplest one; it ignores the bias term in Eq. (11) and leaves only the backpropagation term. In the three protection methods, only PT1 computes the exact block Hessian matrix. However, the (approxi-mated) block Hessian matrices computed in all the three methods own the properties mentioned in the three theorems in Section 2. Especially, these (approximated) block Hessian matrices converge to the same curvature, which is determined by the backpropagation term in Eq. (11), when ⵜE ⬇ zero.

We shall next use three typical examples concerning single-input-single-output (SISO), multi-input-single-out-put (MISO), and multiinmulti-input-single-out-put-multioutmulti-input-single-out-put (MIMO) systems, respectively, to verify the convergence speed and accuracy of the proposed second-order learning algorithm based on the block Hessian matrix under the protections of PT1, PT2 and PT3, respectively. (In the following, we shall call the proposed second-order learning algorithm with the PT1, PT2, and PT3 protection schemes as PT1, PT2, and PT3 methods, respectively, for short.) The performance will be compared with that of the pure gradient descent rule (i.e. backpropagation (BP) algorithm), Newton’s method computing the full Hessian matrix (H method), and the simplified Newton’s method computing only the diagonal Hessian matrix (Hdmethod). Here, the LS update method

proposed in Section 3.2 and the PT2 protection method are also used in the H method.

Example 1: consider a four-layer MLP with structure

N2,3,4,1. We fix the weights of this network (denoted by

‘teaching MLP’), and use another MLP (denoted by ‘train-ing MLP’) with the same topology to learn the input–output relationship of the teaching MLP. The training data are 1616 Yi-Jen Wang, Chin-Teng Lin/Neural Networks 11 (1998) 1607–1622

(11)

collected by feeding random numbers uniformly distributed in [ ¹ 3.0, 3.0] to the teaching MLP and recording its output values. A total of 60 training input–output pairs are collected. The weights of the training MLP are initialized randomly, and then updated by the six first-order and second-order learning algorithms mentioned previously. The learning results are shown in Fig. 3. Fig. 3(a) and (b) show, respectively, the convergence curves (accuracy) of the PT1 and PT2 methods after 40 learning iterations. The

convergence curve of the BP algorithm after 400 iterations is shown in Fig. 3(c). It is observed that the convergence accuracy of BP is smaller than that of PT2 by one order, even after ten cycles of the learning iterations, which takes nearly the same actual computation time as PT1 and PT2. Fig. 3(d) shows the learning curve of the Hdmethod, which

displays poor convergence accuracy. The learning curve of the PT3 method is shown in Fig. 3(e), which is very similar to that of the PT2 method in convergence speed and Fig. 3. Convergence curves of various learning algorithms in Example 1.

Table 2

Range of weight variations and convergence accuracy of the training MLP under six different learning algorithms in Example 1, where w(k)means the weight vectors in layer k, k ¼ 1,2,3

Method k_w(1)¹w(1)_(0)k _k_w(2)_¹_w(2)_(0)k _k_w(3)_¹_w(3)_(0)k _Epochs _Error

PT1 0.4046 0.3313 0.8875 40 2.944⫻ 10^{¹3} PT2 0.3332 0.3255 0.7705 40 1.796⫻ 10^{¹3} BP 0.0603 0.0454 0.1325 400 7.320⫻ 10^{¹3} Hd 0.2081 0.2807 0.0908 40 1.066⫻ 10^{¹2} PT3 0.3868 0.2959 0.8507 40 3.230⫻ 10^{¹3} H 0.1571 0.0812 0.2189 40 8.121⫻ 10¹4

(12)

accuracy. The learning curve of the H method under the PT2 protection is shown in Fig. 3(f). We observe that the pro-posed second-order learning algorithm based on the block Hessian matrix can achieve nearly the same convergence accuracy as that of the Newton’s method that computes the full Hessian matrix, although the former takes much less computation time per iteration than the latter. Table 2 shows the moving distances of the weight vectors of the training MLP (the distance between the final and initial weight vectors) and convergence accuracy under different learning algorithms. To obtain reliable test results, we repeat the above test fifty times; at each time, the weights of the training MLP are re-initialized randomly. In the repeated tests, we use the BP, PT2, and PT3 methods to train the training MLP, and record the respective total error after 40 (for PT2 and PT3) or 400 (for BP) iterations. The results are shown in Fig. 4. It is found that the mean of total errors for the PT2 method is 0.0064, and the standard deviation is 0.0056; the mean of total errors for the PT3 method is 0.0151, and the standard deviation is 0.0183; the mean of total errors for the BP method is 0.0669, and the standard deviation is 0.0437. These statistics numbers are comparable to those in Table 2 in magnitude order. From

Table 2 and Fig. 4, we find that the proposed block-Hessian-matrix-based second-order learning algorithm has much larger search range than the gradient descent method, and preserves high convergence accuracy and speed.

Example 2: consider the Vander Pol’s equation, y(x₁,x₂)¼(1=20)(x₁(1 ¹ x2₂)¹x₂), which is a two-input– single-output system. In this example, we use a MLP,

N3,10,5,1, with random initial weights to learn this system.

The training data are collected by randomly choosing (x1,

x2) values uniformly distributed in [ ¹ 2.5, 2.5]⫻ [ ¹ 2.5,

2.5]. A total of 100 training data are used. The six learning algorithms in Example 1 are used here again to train the MLP. The results are shown in Fig. 5. Fig. 5(a) and (b) show the learning curves of PT1 and PT2, respectively. It appears that the PT2 methods still keeps high convergence accuracy and speed. The learning curve of BP after 800 iterations is shown in Fig. 5(c). The whole actual learning time of BP is larger than that of PT1 and PT2, but the convergence accuracy of BP is much worse than that of PT1 and PT2. In using the Hdmethod, we found that the

same l value in Eq. (28) as that used in the PT1 and PT2 methods could not make the Hdmethod converge. Hence

we reset the parameter l from 0.5 ⫻ 10¹4 to 0.5. The Fig. 4. Repeated tests of convergence accuracy of three learning algorithms: BP (solid line), PT2 (dotted line), and PT3 (broken line) in Example 1.

Table 3

PT1 3.7414 32.1940 2.8925 160 3.922⫻ 10^{¹1} PT2 6.6627 18.9966 3.6575 80 4.819⫻ 10^{¹2} BP 1.9902 1.7518 2.5195 800 4.546⫻ 10^{¹1} Hd 0.0939 0.0640 1.2062 80 1.454 PT3 7.5917 21.1058 1.9849 80 1.230⫻ 10^{¹1} H 4.0045 24.8150 23.047 80 3.058⫻ 10¹2

(13)

corresponding learning curve is shown in Fig. 5(d), Displaying similar convergence accuracy and speed to the BP method. The learning curves of the PT3 and H methods are shown in Fig. 5(e) and (f), respectively, which are both similar to the learning curve of PT2. Table 3 lists the range of weight variations and convergence accuracy under various learning algorithms. Again, the proposed second-order learning algorithm, like the H method, shows the largest search range on the weight space.

Example 3: consider a system described by y(x1,x2)¼ 1 5exp sin(x1)¹cos(x2) 4 þ cos(x2) ,cos(x1)sin(2x2) ,

which is a two-input–two-output system, where the two inputs are x1, x2, and the two outputs are

(1=5)exp[(sin(x1)¹cos(x2))=(4 þ cos(x2))] and cos(x1

)-sin(2x2). In this example, we use a N3,7,5,2MLP, which has

two output nodes, to learn the behavior of this system. The Fig. 5. Convergence curves of various learning algorithms in Example 2.

Table 4

PT1 6.0998 18.0398 5.4619 100 4.863⫻ 10^{¹1} PT2 3.5607 45.6548 7.3597 80 1.539⫻ 10^{¹1} BP 2.3591 1.1166 1.8174 800 9.141⫻ 10^{¹1} Hd 0.2883 0.7666 0.6662 80 6.447 PT3 22.4823 49.1420 49.5137 80 6.979⫻ 10^{¹1} H 8.1261 12.9116 67.3783 80 3.718⫻ 10¹1

(14)

training data are obtained by randomly choosing (x1, x2)

values uniformly distributed in [ ¹ 1,1] ⫻ [ ¹ 1,1]. A total of 64 training patterns are used. Like the previous two examples, six learning algorithms are used to train the MLP, and the results are shown in Fig. 6. The learning curves of PT1 and PT2 are shown in Fig. 6(a) and (b), respectively. Again PT2 is found to be superior in conver-gence accuracy and speed. Fig. 6(c) is the learning curve of BP after 800 iterations, which takes longer actual computa-tion time, but achieves much worse accuracy than PT2. The learning curve of the Hd method, as shown in Fig. 6(d),

converges when the parameter l is reset to be 1.0. The learning curves of the PT3 and H methods are shown in Fig. 6(e) and (f), respectively, which are both similar to the learning curve of PT2 again. The ranges of weight variations and convergence accuracy of various learning algorithms are listed in Table 4.

To see the performance of the combination of different learning algorithms, we did the following extra tests. In the

first test, we train the MLP using BP until the learning curve converges stably after about 100 epochs, and then apply PT2 on the trained MLP for further training. The corresponding learning curve is given in Fig. 7(a) showing that the PT2 method can further reduce the network output error even after the BP learning. In the second test, we train the MLP using PT2 first until converges and then using BP with the same learning constant as that used for Fig. 6(c). The learn-ing curve of this test is shown in Fig. 7(b), indicatlearn-ing that the learning constant is too large to keep the minimum point located by PT2. The third test is similar to the second one, except that the learning constant of BP is reduced by 100 times. The corresponding learning curve is shown in Fig. 7(c), indicating that BP keeps the minimum point found by PT2 in this case. These tests show that the learning constant affects the convergence accuracy of BP greatly. The aforementioned tests use some hybrid learning schemes, but the PT2 method still show the best performance.

Fig. 6. Convergence curves of various learning algorithms in Example 3. 1620 Yi-Jen Wang, Chin-Teng Lin/Neural Networks 11 (1998) 1607–1622

(15)

It is noted that the MLPs used in the earlier examples all have more than two input nodes and use sigmoidal activation functions, so their Hessian and block Hessian matrices are singular according to Theorems 1 and 2. This substantiates the legal usage of the proposed second-order learning algorithm. The previous simulations lead to the following discussions. Since the proposed block-Hessian-matrix approach is a second-order learning algorithm with quadratic convergence speed, it, unlike the normal back-propagation algorithm, can avoid the phenomena of being trapped in the flat region of error surface. This property generates a higher learning speed and probabilities of

locating better local minima than the normal back-propagation algorithm. Simulation results also show that the proposed approach has higher learning speed and convergence accuracy than the diagonal-Hessian-matrix method. This is consistent with our analysis result in Section 2.2 that the diagonal Hessian matrix of a MLP does not own the important properties of the whole Hessian matrix. In other words, the diagonal Hessian matrix of a MLP is an ill approximation of the original Hessian matrix. On the contrary, the block Hessian matrix discussed in this paper keeps the important properties of the original Hessian matrix. Hence, the proposed block-Hessian-matrix approach shows nearly the same convergence accuracy as the one that computes the whole Hessian matrix; however, the former needs much less computation power, and thus has higher learning speed, than the latter. Moreover, the proposed protection methods associated with the block-Hessian-matrix approach can avoid the problem of wrong convergence directions in the normal Hessian-matrix-based learning algorithms.

5. Conclusion

A novel second-order learning algorithm for the MLP has been developed in this article, which represents an attractive alternative to standard backpropagation algorithms. The proposed second-order learning algorithm is a revised Newton’s method, in which the computation of the Hessian matrix of the MLP (H) and its inverse is the kernel part. Several techniques have been presented in developing this algorithm, including the order-based-derivative approach to deriving formulas for H, the network implementation of H, a forward–backward propagation scheme for computing H, the block Hessian matrix of the MLP to approximate H, the use of matrix inverse lemma to find the inverse of block Hessian matrix, the least squares update method for updating the weights of the MLP, and three protection methods to insure correct convergence directions. Further, several lemmas and theorems have been provided to show the important properties of H, and verify that the proposed block Hessian matrix is a good approximation of H. The theoretic analysis not only uncovers the nature of the Hessian matrix of the MLP, but also provides important insight into developing an efficient second-order learning algorithm for the MLP. The proposed algorithm has solved the major drawbacks of the standard backpropagation algorithm and of normal Newton’s method in training the MLP. Extensive computer simulations and performance comparisons with backpropagation algorithm and other revised Newton’s methods have demonstrated the fast learning speed and high convergence accuracy of the proposed learning algorithm. Formal mathematic analysis on the convergence speed and accuracy of the proposed second-order learning algorithm protected by the Fig. 7. Convergence curves of hybrid learning algorithms of BP and PT2 in

Example 3.