Description of First Order Differential Dynamic Programming

5 Learning as a Control Process

5.5 Description of First Order Differential Dynamic Programming

The generalized net representation of the backpropagation algorithm was de-scribed in Sect. 4.6. The learning algorithms based on differential dynamic pro-gramming in Sect. 5.3 and 5.4. In this section, we will consider the process of neural network training as a multistage optimal control problem, and we will use the generalized net methodology in order to describe the neural networks learning algorithm based on the first order differential dynamic programming.

The procedure for developing this new representations of neural networks based on generalized net methodology is based first of all on the books by Atanas-sov (1991, 1998, 2007), and by AtanasAtanas-sov and Aladjov (2000), as well as the pa-pers by Krawczak et al. (2003), and the works of Krawczak (2003b, 2003e, 2004e, 2004g, 2005b).

Similarly as in Sect. 4.6 it is assumed that the basic dynamic components of the neural network are neurons. This assumption is crucial because the signals, as well as the connection weights are less important in this consideration. In such a manner all changes in the neural network structure and states are represented by changes of neuron states. The α -type tokens describing each neuron (or a group of neurons) in the neural network enter the net through the place X , and have the ₁ following initial characteristics

( )

i( )l =NN l i

^{( )}

l fi( )l 

yα 1, , , (5.62)

for ⁱ

( )

^l ⁼¹^,²^,^...,^N

( )

^l ^,^l⁼⁰^,¹^,^...,^L^{, where}

1 NN

the neural network identifier, i

the number of the token (neuron) associated with the l-th layer, l

the present layer number, ( )l

()

⋅

f₁

the activation function of the i -th neuron associated with the l -th layer of the neural network, where fi_{( )}0

( )

⋅ =¹^.⁰, for ⁱ

( )

⁰ =¹^,²^,...,^N

( )

⁰ .

The generalized net representation of the first order differential dynamic pro-gramming algorithm contains seven transitions, see Fig. 5.3. It is particularly important that most of the transitions are the same as for the backpropagation algorithm (therefore, they will be reminded in a short way), and the transition Z ₅ will be divided into two transitions, Z₅_,₁ and Z₅_,₂.

Fig. 5.3 The generalized net description of the first order differential dynamic programming algorithm

Via the transition Z every token ₁ α_i_{( )}_l , ⁱ

( )

^l ⁼¹^,²^,^...,^N

( )

^l ^, ^l⁼⁰^,¹^,^...,^L^{, is}

transferred from the place X to the place ₁ X as well as ₂ X . The tokens are ₃ transferred sequentially according to the increasing indexes ⁱ

( )

^l =¹^,²^,^...,^N

( )

^l for

l=0,1,..., , in this way the tokens of the same level l are aggregated into a new token α_{( )}_l , representing the layer l , according the condition transitions in the transition Z₁

X 2 X ₃

{

1^, 2

} {

^, 2^, 3

}

1 X X X X

Z =     X ₁ V1,2 V₁_,₃ ,^∨

(

X1, X2

)

^

X2

 V2,2 V₂_,₃ (5.63)

where

= 1,3 2 ,

1 V

V “if there is only one token α_i_{( )}_l in the place X ”, i.e. ₁

(

j( )l ∈^KX

) (

^pr^Y i( )l ≠ ^pr^Y j( )l ^j

( ) ( )

^l ≠ⁱ^l

)

∀ ₂ ₂ ,

1 α α

α  (5.64)

(where

K___ is a set of all tokens entering the net from the place X₁_,

( ) ( )

^l ^j^l ^N

( )

i , =1,2,..., )

n 2

X 9

m 6

X10

m 7

Z 6

2 ,

m5 2 ,

X8 1

Z5 Z₅_,₂ Z 4

m 4

X7 6  X  X 5

Z 3

n 1

m 3

X2



m 2

X 4

m 1

Z 2

Z 1

X 1 X ₃ X₈_,₁

1 ,

V2 “if there is more than one token α_i_{( )}_l and α_j_{( )}_l associated with the l -th layer”, i.e.,

( ) ( )

(

α_i_l ∈K_X₁&α_j_l ∈K_X₂

) (

pr2Y_αi( )l = pr2Y_αj( )l

)

∃   (5.65)

3 =

V2 ”if all tokens α_i_{( )}_l , ⁱ

( )

^l ⁼¹^,²^,^...,^N

( )

^l ^, have been combined into one token”

i.e.,

( ) ( )

(

α_i_l ∈K_X₁^&α_k_l ∈K_X₂

) (

pr²Y_αi( )l = pr²Y_αk( )l

)

¬∃  

( ) ( )

(

il ∈^KX₁^& jl ∈^KX₁

) (

^pr²^Y i( )l =^pr²^Y j( )l ^,ⁱ≠ ^j

)

¬∃α  α  α α

(5.66)

The tokens associated with neurons lying within one layer are aggregated in the transition Z . For layer, ₁ l=0,1,...,N, the characteristics of the tokens are processed in order to construct a new token α_{( )}_l representing the whole l -th layer according to the condition (5.66). The aggregated token is transferred from the place X to the place ₂ X , and has the following characteristic ₃

( )

( )l =^NN l

[

( )

]

F_{( )}l ^

yα 1, , 1, , for l=0,1,2,...,L, (5.67) where

1 NN

the neural network identifier, l

the layer number,

[

¹^,^N

( )

]

denotes ^N

( )

^l tokens (neurons) arranged in a sequence, starting form the first and ending at ^N

( )

^l , associated with the l -th layer,

( )

[ ]

F₀ = 1,1,...,1

( )^l

[

^f( )^l

( )

^f _{( )}^l

()

^f^N( )^l

() ]

F = ₁ ⋅, ₂ ⋅,..., ⋅ (5.68)

is a vector of the activation functions of the neurons associated with the l -th layer of the neural network.

In the place X we obtain L tokens, producing the neural network output. ₃ The second transition Z has the following form ₂

X4

 m ₂

{

³^, ¹

} {

^, ⁴^, ²

}

2 X m X m

Z =   X true ₃ false ,^∧

(

X3, m1

)

^

m 1 false true (5.69)

where the performance index of the learning process is introduced, and the β -type token, which enters the input place m , has the following initial characteristic ₁

( )

=^NN1,E,Emax^

yβ (5.70)

where 1 NN

the neural network identifier, E

the performance index of the neural network learning, Emax

the threshold value of the performance index, which must be reached.

The token α_{( )}_l , l=0,1,2,...,L, related to the l -th layer, has now the characteris-tic, in the place X , ₄

( )

( )l ⁼^NN l

[

( )

]

F_{( ) ( )}l Wl ^

yα 1, , 1, , , (5.71)

for l=0,1,2,...,L, where 1

the neural network identifier, l

the layer number,

[

¹^,^N

( )

]

denotes ^N

( )

^l tokens (neurons) arranged in a sequence, starting form the first and ending at ^N

( )

^l , associated with the l -th layer,

( )

[ ]

F₀ =1,1,...,1 , ^F_{( )}^l ⁼

[

^f1_{( )}^l

( )

^⋅^,^f2( )^l

()

^⋅^,^...,^f^N( )^l

()

^⋅

]

^T (5.72) is a vector of the activation functions of the neurons associated with the l-th layer

( )^l W

denotes the aggregated initial weights connecting the neurons of the l-th layer with the

( )

^l⁻¹ -st layer neurons.

In place m the ₂ β token obtains now the characteristic

( )

=NN¹^,⁰^,Emax^.

yβ (5.73)

In the transition Z the tokens ₃ γ_p, p=1 ,2 ,P, p being the number of the training pattern, enters the place n with the initial characteristic ₁

( )

⁼^^X ^D ^p^

yγ_p _p(0), _p, (5.74)

where

( ) [

^p ^p ^pN

]

p x x x

X 0 = ₁, ₂,, ₍₀₎

is the input vector of the neural network,

[

^p ^p ^pN

]

p d d d

D = ₁, ₂,, ₍₀₎

is the vector of desired network outputs,

and for the input X_p(0) the outputs of all layers are calculated.

The transition Z describes the process of signal propagation in the neural ₃ network (5.76). The tokens α_{( )}l , l=0,1,2,...,L, in the place X^^^₅, obtain the new characteristics

( )

( )l =NN l

[

( )

]

F( ) ( ) ( )l Wl Xl 

yα 1, , 1, , , , (5.75)

where X_{( )}_l , W , _{( )}_l l=1,2,...,L.

}, , , , { }, , , , , ,

{ ₄ ₅ ₉ ₂ ₇ ₁ ₅ ₆ ₃ ₂

3 X X X m m n X X m n

Z = ^^^ ^^^ ^^^ ^^^ ^^^ X^^5

 X^^^6 m3 n₂

X^^^4 V4,5 V₄_,₆ false false X^^5

 V₅_,₅ V₅_,₆ false false X^^9

 V₉_,₅ V₉_,₆ false false , m2 false false true false m7 false false true false n1 false false false true

∨ 

∧( (X^^^₄, X^^^₅, X^^^₉),(m₂,m₇),n₁) (5.76) where

= 5,5 9,5 5

4 V V

V “previous layer does not have defined outputs”

5 , 4 6 , 9 6 , 5 6 ,

4 V V V

V = = =¬

V1 “all layers’ outputs have assigned values for the current pattern”.

In the place X^^^₆ the tokens α get the following characteristics

( )

( )l =NN l

[

( )

]

F( ) ( )l Wl Xp( )l 

yα 1, , 1, , , , (5.77)

while the token β has the characteristic y

( )

β =^NN1,0,E_max^ (in the place m ), ₃ and the token γ has the characteristic ^y

( )

^γ^p ⁼^^X^p⁽⁰^), ^D^p^,^p^ in the place n . ₂

The transition Z describes the estimation and weight adjustment, and has the ₄ form via the characteristic

( )

⁼^NN1,E^′,Emax^ In distinction from the generalized net representation for the classic backpro-pagation algorithm, described in Chap. 4, here the transition Z is split in two ₅ transitions Z₅_,₁ and Z₅_,₂.

The transition Z₅_,₁ models the evaluation of the partial derivatives described in details in Chap. 4

and partial derivatives of the return functions with respect to the aggregated

For the given nominal values of the neuron states ^X

( )

^l and weight connections

[

l^,L−¹

]

W , where l=L−1,...,0, the equations (5.82) and (5.83) in a shorter form are as follows

( )

^l ⁼^V

(

( )

^l ^,^W

[

^l^,^L⁻¹

] )

which include the results of Equ. 5.83

( )

( ) =^NN¹^,l^,

[

¹^,N

( )

]

^,F( ) ( )^,W ^,X ( )^,V

( )

l ^^.

yαl l l pl X (5.86)

The token of β -type does not change its characteristic in the place m₅_,₁.

The transition Z₅_,₂ is devoted to the partial derivatives of the return functions

( ) [ ]

(

^X ^l ^,^W^l^,^L⁻¹

)

V with respect to ^W

[

^l^,^L⁻¹

]

, which are described by (5.82), and have the following form

2 changed its characteristic.

The transition Z describes the process of weights adjustment during the ₆ learning process and has the following form

}, , , , { }, ,

{ ₈_,₂ ₅_,₂ ₉ ₁₀ ₆ ₇

6 X m X X m m

Z =    

X^^^9 X^^^₁₀ m 6 m ₇

2 ,

X^^^8 V8,9 V₈_,₁₀ false false ,

2 ,

m5 false false V₅_,₆ V₅_,₇

∧(X₈_,₂,m₅_,₂) (5.89) where

9 =

V8 ”there are still unused patterns”,

9 , 8 10 ,

8 V

V =¬ ,

V5 “if the performance index is below the given threshold E_max”,

6 , 5 7 ,

5 V

V =¬ .

The α -type tokens obtain the new characteristics in the place X^^^₉

( )

( )l =NN l

[

( )

]

F( ) ( )l W′l 

yα 1, , 1, , , (5.90)

with updated weight connections

( )^l

[

^w( )^l ^w ( )^l ^w^N^l

]

W′ = ₁′ , ₂′ ,..., ′₍₎ (5.91)

where

( )^l

[

ⁱ( ) ( )^l ^l ⁱ( ) ( )^l ^l ⁱ( )^l ^N ^l

]

i w w w

w′ = ′ ₋₁₁ , ′ ₋₁₂ ,..., ′ ₋₁ ₍₎

for ⁱ

(

^L⁻¹

)

⁼¹^,²^,...,^N⁽^L⁻¹⁾^, ^j

( )

^L ⁼¹^,²^,...,^N⁽^L⁾. The new values of the weights are calculated in the following way (for the learning parameter η>0)

( ) ( ) ( ( ) [ ] )

( )

l W

L l W l X l V

W l

W _



 





∂

−

− ∂

′ = η , , 1 (5.92)

for 0l=L,L−1,..., , and replace W

[

⁰^,L−¹

]

=W′

[

⁰^,L−¹

]

, ^X

[ ]

¹^,^L ⁼^X^′

[ ]

¹^,^L ^.

In the place m the ₇ β token obtains the characteristic

( )

⁼^NN1,E,Emax^

yβ which is not final.

The optimal values of the weights satisfying the stop condition are denoted by ( )l =pr ^NN l

[

( )

]

F_{( ) ( )}l W^′_l ^

W^* ₅ 1, , 1, , , , where the characteristics of the α-type tokens, in the place X^^^₁₀, is described by

( )

( )l =^NN l

[

^{( )}

]

F_{( ) ( )}l W^′_l ^

yα 1, , 1, , , (5.93)

while the final value of the performance index is equal E^*=NN1,E′,E_max, and the β token characteristic in the place m is described by ₆

( )

=NN¹^,E′^,Emax^.

yβ (5.94)

M. Krawczak: Multilayer Neural Networks, SCI 478, pp. 123–144.

DOI: 10.1007/978-3-319-00248-4_6 © Springer International Publishing Switzerland 2013

Parameterisation of Learning

6.1 Introduction

Learning of a neural network is meant to adjust connections between layers (con-nections between neurons) in order to minimize the performance index of learning. For this, the backpropagation algorithm with various modifications is commonly used. At the same time, the learning process of multilayer neural net-works can be considered as a particular multistage optimal control problem, de-scribed in Chap. 4. This kind of problem can be naturally treated by the dynamic programming approach (Chernousko and Lyubushin 1982, Larson and Korsak 1970) as well as (Krawczak 1994, 1995a, 1999b, 2000a, 2001b, 2002b, 2002c).

In this chapter we introduce a gain parameter into models of neurons. The value of this parameter is tacitly assumed to be 1.0 in almost all learning algo-rithms used. Note that setting the parameter to a small value makes the neuron model “almost linear”. Thus, the learning process problem can be solved using computational tools specified for linear-quadratic systems optimisation, like the first order differential dynamic programming methodology described in Chap. 5.

Using the continuation methodology (Krawczak 1999b, 2000b, 2000c, 2000e) the value of the gain parameter may be changed in order to reach 01. . In fact, we can do much more, namely by considering the gain parameter as an additional control variable, the optimal value of the parameter can be found (Krawczak 2001a, 2002b). The presented methodology is based on the first order differential dy-namic programming, and due to the global properties of the methodology we propose to call it the heuristic dynamic programming. In some sense the idea is borrowed from the simulated annealing approach, known in stochastic mechanics (Aarts and Laarhoven 1987, Kirkpatrick et al. 1983, Brooks and Morgan 1995).

In this chapter we will also describe the method of conversion of the multilayer neural networks learning problem into the iterative minimax problem. The method uses the methodology of game theory as well as the tools of multiobjective opti-misation introduced by Krawczak (1997b, 1998) and allows to consider the neural network learning process as a multiobjective optimisation problem, where each pair of the training examples is associated with a partial performance index

(as a separate objective function). The methodology can be used for establishing a new paradigm of learning that we will call the updating learning.

6.2 Neuron Models with Parameters

Let us consider an artificial neuron model which is a nonlinear processing unit performing the operation ^f

( )

^net by means of the activation function, where net is the input to the neuron.

There are two typical activation functions, commonly used, described by the sigmoidal functions:

the unipolar

(

^net

) (

^net

)

f λ λ

−

= + exp 1

, 1 (6.1)

the bipolar

(

^net^,

)

⁼¹⁺^exp

(

²⁻ ^net

)

⁻¹

f λ λ (6.2)

where 0λ> is the gain parameter, 1λ being sometimes called the temperature of the system. This parameter describes the slope of the activation functions. These functions are shown in the following Fig. 6.1 and 6.2.

We can observe how a curve changes with respect to the gain parameter λ. Assuming reasonable values of the input, for large value of the gain parameter, the sigmoidal function turns into the Heaviside’s function (the step function) while, for small values of the parameter, the sigmoidal function becomes an “almost linear” function.

Fig. 6.1 A unipolar sigmoidal function with different gain values

Fig. 6.2 A bipolar sigmoidal function with different gain values

6.3 Continuation Method

In the literature on nonlinear optimisation one can find a little known methodology called the continuation method (Avila 1974, Richter and de Carlo 1984), which allows finding near optimal solutions. The method is usually presented in terms of finding zeroes of a mapping G:Rⁿ →R, and it can be generalized for findings fixed points (Ortega and Rheinboldt 1970).

Following the papers of the present author (Krawczak 1999b, 2000b, 2000c, 2000e) we will consider the learning of the neural networks problem as the opti-misation problem

[ ]

( )











 =



−

− =

p L p

W E D X L

2 1

0 2

min 1 (6.3)

subject to

( )

^l ^F

(

( ) ( )

^l ^X ^l

) ( )

^F ^l

X ₊1 ₌ , ₌ for l=0,1,...,L−1 (6.4) where ^F

( )

^l is an aggregated function of transition from one layer to another. Let

[

⁰^, ¹

]

~ L−

W be a solution of the problem (6.3)-(6.4). Now, let us consider the fol-lowing homotopy

[ ]

( )











 =



−

− =

p L p

W E D X L

2 1

0 2

min 1 (6.5)

subject to

( )

l 1 F

(

( ) ( )

l ,X l ,λ

)

X + = , for l=0,1,...,L−1 _(6.6)

such that for λ=1 the problem (6.5)-(6.6) is equivalent to (6.3)-(6.4). It is as-sumed that the problem (6.5)-(6.6) has some trivial, or easy to compute, solution for λ =λ₀. The value of the parameter λ₀ is treated as the starting point. Addi-tionally, due to the “almost” linear-quadratic form of the considered problem, it is assumed that the solution obtained for λ =λ₀ is the global one. This assumption is based on the properties of the first order differential dynamic methodology de-scribed in Chap. 5.

Next, the problem can be imbedded into a family of problems with the parame-ter λ. Usually we do not consider the gain parameter, which means that we as-sume that the parameter is equal 1 . The basic idea is as follows, if for all

[ ]

λ0^,¹

λ ∈ there exist weights W

(

λ^,

[

⁰^,L−¹

] )

such that there are the solutions of (6.5)-(6.6), then the curve ^W

(

^λ^,

[

⁰^,^L⁻¹

] )

can be found numerically starting at the point W

(

λ0^,

[

⁰^,L−¹

] )

and ending at W

(

¹^,

[

⁰^,L−¹

] )

=W^~

(

¹^,

[

⁰^,L−¹

] )

. In the case of the linear-quadratic optimisation problem, like the neural networks learning, it is obvious that there does exist a solution of (6.5)-(6.6) for any λ ∈

[ ]

λ0^,¹, where

0<λ₀≤ . It is assumed that the curve W

(

λ^,

[

⁰^,L−¹

] )

is continuous and has first derivatives.

This kind of approach is known in the literature as the continuation method (Avila 1974) or the homotopy method (Richter and de Carlo 1984). Due to the continuous differentiability of the considered sigmoidal activation functions, and the methodology described in the previous chapter, it is obvious that the function

[ ]

(

^, ⁰^,^L⁻¹

)

W λ is continuous and differentiable with respect to the gain parameter +∞

<λ

0 .

There are two main methods of finding the solution curve ^W

(

^λ^,

[

⁰^,^L⁻¹

] )

^{: the}

first is the discrete method (Avila 1974) and the second – is the Davidenko’s method (Davidenko 1953).

Discrete Method

The interval

[ ]

λ0^,¹ should be divided into several segments

2 1

0 <λ <λ < λN =

λ ^

and the corresponding partial problems

[ ]

( )











 =



−

− =

p L p

L X D E

2 1

0 2

min 1 (6.7)

subject to

( )

l F

(

( ) ( )

l X l k

)

X +1 = , ,λ , l=0,1,...,L−1 (6.8) for k=0,1,...,N, should be solved.

Starting with the known solution W

(

λ0^,

[

⁰^,L−¹

] )

, obtained by applying the first order differential dynamic programming method, described in Sect. 5.6.1, the new weights W

(

λ1^,

[

⁰^,L−¹

] )

are computed. The procedure is performed sequen-tially for all λ₀<λ₁<λ₂<^λ_N =1 until the point λ_N =1 is reached. The main problem is to determine the conditions of existence of the partition

2 1

0<λ <λ < λ_N =

λ ^ and the iterative process

[ ]

( )

(

^, ^, ⁰^, ¹

)

1= −

+ I _k W k L

k λ

λ , k=0,1,...,N. (6.9)

According to Ortega and Rheinboldt (1970), for linear-quadratic optimisation problems the relationship (6.9) exists, and the problem of the partition of the range

[ ]

λ0^,¹ must ensure that the reachable sets of the obtained solution of (6.7)-(6.8)

[ ]

( ) ( [ ] )

{

W λ_k^, ⁰^,L−¹ ^,W λ_k₊1^, ⁰^,L−¹

}

and

[ ]

( ) ( [ ] )

{

W λ_k₊1^, ⁰^,L−¹ ^,W λ_k₊2^, ⁰^,L−¹

}

for 2k=0,1,2,...,N− overlap for each pair

{

λ_k,λ_k₊1

}

, 1k=0,1,...,N− . Davidenko’s Method

The method requires writing the optimisation problem in the following way

[ ]

( )

(

^W ^λ^, ⁰^,^L⁻¹ ^,^λ

)

⁼⁰^.

H (6.10)

By differentiating (6.10) with respect to λ one can get the following Davidenko’s differential equation

[ ]

( )

(

^, ⁰^, −¹^,

) (

[

⁰^, −¹

]

)

( (

λ^,

[

⁰^, −¹

]

^,λ

) )

=⁰

λ λ

λ H W L

d L L dW

H_W _W (6.11)

with W

(

λ0,

[

0,L−1

]

,λ0

)

as the initial conditions, and by numerical integration from λ₀ to 01. - the solution ^W^~

[

⁰^,^L⁻¹

]

⁼^W

(

¹^,

[

⁰^,^L⁻¹

] )

. The main difficulty is the implicit nature of this equation for such complex problems as the neural net-work learning.

The proposed application of the continuation method, combined with the first order differential dynamic optimisation approach, gives the new possibility of finding the global optimal value of the performance index of the learning process for the multilayer feedforward neural networks. In some sense the approach is similar to the idea of simulated annealing methodology (Aarts and Laarhoven 1987).

6.4 Heuristic Dynamic Programming

Let us rewrite the here considered first order differential dynamic programming method, described in Chap. 5, and introduce the gain parameter into the neuron models, as follows.

Now, the return function has the form

( ) [ ]

The first order expansion of the return function for the whole network

( ) [ ]

δλ= ′− must be small enough to ensure the validity of the expansion. Choos-ing Wδ and δλ as

where η is the learning parameter, we obtain that the return function

( ) [ ]

Instead of consideration of the return function for the whole network

[ ]

( )

(

X 0,W0,L−1,λ

)

V , the backward form can be calculated

( ) [ ]

(

X l ,Wl,L−1,λ

)

(

( )

l+1,W

[

l+1,L−1

]

,λ

)

V (6.16)

for 1l,l=1,2,...,L− , with the condition for the last layer

( ) [ ]

( ) ( )

2 , 1 1 , 1 ,



−

p X L

D L

L W L X

V λ (6.17)

The partial derivative of the return function (6.16) takes the form

( )

^l ^F

( ) (

^l ^V ^l ¹^,

) ( )

^F ^l ^.

V_W = _W^T _X + λ _W (6.18)

In order to obtain (6.18) the following expression must be calculated ^VX

( )

^l^,^λ , which can be calculated using the following sequential relations

( )  ( )

−

= ^P

p p

X L D X L

,λ (6.19)

( )

l^,λ =F

( ) (

l V l+¹^,λ

)

V_X _X _X (6.20)

The derivative of the return function ^V

(

⁰^,^W

[

⁰^,^L⁻¹

]

^,^λ

) )

with respect to λ is obtained in a similar manner, as

( )  ( )

−

= ^P

p X L

D L

,λ

λ (6.21)

( )

λ λ

(

)

( ) (

)

λ l, =V l+1, +F l V l+1,

V _X (6.22)

where ^VX

(

^l⁺¹^,^λ

)

is given by (6.19) and (6.20).

Using Equ. 6.19 - 6.21, the gradients of the return functions, which are re-quired for Equ. 6.18 and 6.22, can be computed from the sequence of the equa-tions solved from the last layer back to the inputs.

The parameter λ can be found at the input layer by minimizing the optimal re-turn function ^V

(

^X^′

( )

⁰^,^W^′

[

⁰^,^L⁻¹

]

^,^λ

)

with respect to λ.

The heuristic differential dynamic programming algorithm for the learning of multilayer neural networks can be formulated in the following steps:

1. Initialise weights: ^W

( )

^l ^, ^l⁼⁰^,¹^,²^,...,^L⁻¹

λ as a small value, e.g. 0 .1

2. Set: E=0

3. Set: p=0 ( p denotes the pattern’s index) 4. Submit a pattern:

(

( )

0,Dp

)

, p= p+1

5. Compute the layers’ outputs: ^X

( )

^l ⁼ ^X

( )

^l ^, ^l⁼¹^,²^,...,^L from the system Equ. 6.8

6. Compute the performance index of learning:

( )

1 D X L

E= + _p− _p (6.23)

7. Compute the partial derivatives:

( )

^l ⁼^V

(

( )

^l ^,^W

[

^l^,^L⁻¹

]

^,^λ

)

8. Compute the gradient of the return function with respect to the weights for each level:

( ) [ ]

9. Choose the learning parameter η>0 and compute the new weight values

( ) ( ) ( ( ) [ ] )

11. If l=0 then compute the new value of the gain parameter, for η₁>0

( ) [ ]

(

^X ^W ^L

)

V 



 





∂

−

− ∂

′=

λ λ

η λ

λ 0, 0, 1,

1 (6.29)

12. Set:

[

⁰^,^L⁻¹

]

⁼^W^′

[

⁰^,^L⁻¹

]

[ ]

^L ^X

[ ]

X1, = ′1, λ

λ = ′

13. If E<E_MAX then go to Step 14 else go to Step 3 14. STOP.

The first order differential dynamic programming algorithm can ensure "almost global" optimisation of the learning process for the neural networks with the lin-earized neurons. By optimisation of the gain parameter the methodology allows for finding of the "almost global" optimum of the learning index performance. We will call this method the heuristic dynamic programming for the multilayer neural networks learning.

6.5 Learning as a Multiobjective Problem

The idea of using multiobjective optimisation to solve some classes of nonlinear programming problems was first proposed by Geoffrion (1967). This idea was extended by Li and Haimes (1990, 1991) for the dynamic systems. In the case of the neural networks learning, the idea was further elaborated by Krawczak (1995b, 1997, 1998).

6.5.1 Embedding into Multiobjective Optimisation Problems

It is easy to notice that the performance index of learning, E , can be treated as a simple composite function of multiple performance indices E_p, corresponding to the training pairs

{

^D^p^,^X^p

( )

⁰

}

^, ^p⁼¹^,²^,...,^P, as follows

( ) ( )







 =Φ =  −

= P p

p p

P D X L

E E E E

2 2

1, ,...,

min (6.30)

subject to

( )

^l⁺¹ ⁼^F

(

( ) ( )

^l ^,^W ^l

)

^, ^l⁼⁰^,¹^,...,^L⁻¹^.

X_p _p (6.31)

The minimization is done with respect to weights ^W

( )

^l^, ^l⁼⁰^,¹^,...,^L⁻¹^.^{In the}

discussed case, the overall performance index E is a strictly increasing function of E , for _p p=1,...,P, i.e.

=1 Ep

∂

∂ .

Conforming to the notation used, the multiobjective multistage optimisation prob-lem can be formulated in the following way

( ) ( )

^^













−

















2 2 1 1 1

min min

L X D

E E

p P p



 (6.32)

subject to

( )

^l⁺¹ ⁼^F

(

( ) ( )

^l ^,^W ^l

)

^, ^l⁼⁰^,¹^,...,^L⁻¹^.

X_p _p (6.33)

In general, solution to the multiobjective problem (6.32)-(6.33) is not unique. A solution Wˆ of this problem is said to be noninferior if there does not exist another feasible W such that

( )

^W ^E

( )

E_p ≤ _p ˆ (6.34)

for all p=1,2,...,P, with strict inequality for at least one p .

For the optimisation problem (6.32)-(6.33) the following theorems can be proved (Li and Haimes 1990). Here, the respective analysis is developed for the neural networks learning problem.

Theorem 6.1. The optimal solution of problem (6.30)-(6.31) is attained by a non-inferior solution of the multiobjective optimisation problem given through (6.32)-(6.33).

The most common approaches to generation of the set of noninferior solutions are the ε-constraint method and the weighting method (Li and Haimes 1990).

The noninferior solutions to the problem (6.32)-(6.33) can be generated by solving the following ε-constraint method form

( )

E₁

min (6.35)

subject to

( )

^W ^p ^P

E_p ≤ε_p, =2,3,..., (6.36)

and

( )

^l ¹ ^F

(

( ) ( )

^l ^,^W ^l

)

X_p + = _p l=0,1,...,L−1 (6.37)

where )X_p(0 is given.

Theorem 6.2. Assume that the set of noninferior solutions of problem (6.32-(6.33) can be parameterised by μ₁_p,p=2,3,...,P, which are the optimal Kuhn-Tucker multipliers associated with the p-th constraint in Equ. (6.36. Thus, the optimal solution of the dynamic optimisation problem (6.30)-(6.31) is then reached by the noninferior solution that satisfies the following equalities

. ,..., 3 , 2 , 0

1 p P

E E E

p p

− ∂

μ ∂

∂

∂ (6.38)

For the case considered in this section, the objective function E is of an additive form, and so the solution to Equ. (6.30)-(6.31) is attained by the noninferior solu-tion with all μ₁_p equal to one.

If problem (6.32)-(6.33) is convex, then the noninferior solutions of problem (6.30)-(6.31) can be obtained by solving the following weighting form

( )

E v _p



min _(6.39)

subject to

( )

^l ¹ ^F

(

( ) ( )

^l ^,^W ^l

)

Xp + = p l=0,1,...,L−1 (6.40)

where )X_p(0 is given, and



≥

p p

p p P v

; ,..., 2 , 1 ,

0 .

Theorem 6.3. If the set of noninferior solutions of the problem (6.32)-(6.33) can be parameterised by the overall weighting vector v , the optimal solution of the nonseparable dynamic optimisation problem given in (6.30)-(6.31) is reached under certain conditions by the noninferior solution that satisfies the following equations

/ . /

2 2 1

P P

v E E v

E E v

E ∂ ∂ ∂ ∂ ∂

∂ ₌ ₌₌ _(6.41)

For the discussed case, due to the additive form of the performance index, the optimal solution of problem (6.30)-(6.31) is attained by the noninferior solution with all v equal to Pp 1/ .

The aim of consideration of the optimisation problem (6.30)-(6.31) as an em-bedded problem, meaning the embedding of this optimisation problem in a family of parameterised optimisation problems, is to obtain a new algorithm for the mul-tilayer neural networks learning.

Let us consider the following weighted minimax formulation for problem (6.30)-(6.31):

{

vE,v E ,...,vPEP

}

max

min ₁ ₁ ₂ ₂ (6.42)

subject to

( )

^l ¹ ^F

(

( ) ( )

^l ^,^W ^l

)

X_p + = _p (6.43)

where )X_p(0 is given, the weighting coefficient v is always set to one and the ₁ weighting coefficients v , _p p=2,...,P are nonnegative. In (6.42) the maximiza-tion is performed among P weighted systems indicated by the performance indices (corresponding to any training pair) while minimization is carried out over the weights searching W .

6.5.2 Iterative Minimax Solution

Several theorems strictly related to the conversion of the optimisation problem (6.30)-(6.31) into another optimisation problem, (6.42)-(6.43), can be proven. If we denote by W the set of solutions of the problem (6.30)-(6.31) and by ^∗ W the _v^∗ union of sets of solutions of the weighted minimax problem (6.42)-(6.43), then it is possible to prove the following

Theorem 6.4. The intersection of W and ^∗ W is nonempty, i.e. _v^∗

在文檔中 Multilayer NeuralNetworks (頁 122-0)