5 Learning as a Control Process
5.5 Description of First Order Differential Dynamic Programming
The generalized net representation of the backpropagation algorithm was de-scribed in Sect. 4.6. The learning algorithms based on differential dynamic pro-gramming in Sect. 5.3 and 5.4. In this section, we will consider the process of neural network training as a multistage optimal control problem, and we will use the generalized net methodology in order to describe the neural networks learning algorithm based on the first order differential dynamic programming.
The procedure for developing this new representations of neural networks based on generalized net methodology is based first of all on the books by Atanas-sov (1991, 1998, 2007), and by AtanasAtanas-sov and Aladjov (2000), as well as the pa-pers by Krawczak et al. (2003), and the works of Krawczak (2003b, 2003e, 2004e, 2004g, 2005b).
Similarly as in Sect. 4.6 it is assumed that the basic dynamic components of the neural network are neurons. This assumption is crucial because the signals, as well as the connection weights are less important in this consideration. In such a manner all changes in the neural network structure and states are represented by changes of neuron states. The α -type tokens describing each neuron (or a group of neurons) in the neural network enter the net through the place X , and have the 1 following initial characteristics
( )
i( )l =NN l i( )
l fi( )l yα 1, , , (5.62)
for i
( )
l =1,2,...,N( )
l , l=0,1,...,L, where1 NN
the neural network identifier, i
the number of the token (neuron) associated with the l-th layer, l
the present layer number, ( )l
()
⋅f1
the activation function of the i -th neuron associated with the l -th layer of the neural network, where fi( )0
( )
⋅ =1.0, for i( )
0 =1,2,...,N( )
0 .The generalized net representation of the first order differential dynamic pro-gramming algorithm contains seven transitions, see Fig. 5.3. It is particularly important that most of the transitions are the same as for the backpropagation algorithm (therefore, they will be reminded in a short way), and the transition Z 5 will be divided into two transitions, Z5,1 and Z5,2.
Fig. 5.3 The generalized net description of the first order differential dynamic programming algorithm
Via the transition Z every token 1 αi( )l , i
( )
l =1,2,...,N( )
l , l=0,1,...,L, istransferred from the place X to the place 1 X as well as 2 X . The tokens are 3 transferred sequentially according to the increasing indexes i
( )
l =1,2,...,N( )
l forL
l=0,1,..., , in this way the tokens of the same level l are aggregated into a new token α( )l , representing the layer l , according the condition transitions in the transition Z1
X 2 X 3
{
1, 2} {
, 2, 3}
,1 X X X X
Z = X 1 V1,2 V1,3 ,∨
(
X1, X2)
X2
V2,2 V2,3 (5.63)
where
=
¬
= 1,3 2 ,
1 V
V “if there is only one token αi( )l in the place X ”, i.e. 1
(
j( )l ∈KX) (
prY i( )l ≠ prY j( )l j( ) ( )
l ≠il)
∀ 2 2 ,
1 α α
α (5.64)
(where
X1
K is a set of all tokens entering the net from the place X1,
( ) ( )
l jl N( )
li , =1,2,..., )
n 2
X 9
m 6
X10
m 7
Z 6
2 ,
m5 2 ,
X8 1
,
Z5 Z5,2 Z 4
m 4
X7 6 X X 5
Z 3
n 1
m 3
X2
m 2
X 4
m 1
Z 2
Z 1
X 1 X 3 X8,1
1 ,
m5
2=
,
V2 “if there is more than one token αi( )l and αj( )l associated with the l -th layer”, i.e.,
( ) ( )
(
αil ∈KX1&αjl ∈KX2) (
pr2Yαi( )l = pr2Yαj( )l)
∃ (5.65)
3 =
,
V2 ”if all tokens αi( )l , i
( )
l =1,2,...,N( )
l , have been combined into one token”i.e.,
( ) ( )
(
αil ∈KX1&αkl ∈KX2) (
pr2Yαi( )l = pr2Yαk( )l)
&¬∃
( ) ( )
(
il ∈KX1& jl ∈KX1) (
pr2Y i( )l =pr2Y j( )l ,i≠ j)
.¬∃α α α α
(5.66)
The tokens associated with neurons lying within one layer are aggregated in the transition Z . For layer, 1 l=0,1,...,N, the characteristics of the tokens are processed in order to construct a new token α( )l representing the whole l -th layer according to the condition (5.66). The aggregated token is transferred from the place X to the place 2 X , and has the following characteristic 3
( )
( )l =NN l[
N( )
l]
F( )l yα 1, , 1, , for l=0,1,2,...,L, (5.67) where
1 NN
the neural network identifier, l
the layer number,
[
1,N( )
l]
denotes N
( )
l tokens (neurons) arranged in a sequence, starting form the first and ending at N( )
l , associated with the l -th layer,( )
[ ]
TF0 = 1,1,...,1
( )l
[
f( )l( )
f ( )l()
fN( )l() ]
TF = 1 ⋅, 2 ⋅,..., ⋅ (5.68)
is a vector of the activation functions of the neurons associated with the l -th layer of the neural network.
In the place X we obtain L tokens, producing the neural network output. 3 The second transition Z has the following form 2
X4
m 2
{
3, 1} {
, 4, 2}
,2 X m X m
Z = X true 3 false ,∧
(
X3, m1)
m 1 false true (5.69)
where the performance index of the learning process is introduced, and the β -type token, which enters the input place m , has the following initial characteristic 1
( )
=NN1,E,Emaxyβ (5.70)
where 1 NN
the neural network identifier, E
the performance index of the neural network learning, Emax
the threshold value of the performance index, which must be reached.
The token α( )l , l=0,1,2,...,L, related to the l -th layer, has now the characteris-tic, in the place X , 4
( )
( )l =NN l[
N( )
l]
F( ) ( )l Wl yα 1, , 1, , , (5.71)
for l=0,1,2,...,L, where 1
NN
the neural network identifier, l
the layer number,
[
1,N( )
l]
denotes N
( )
l tokens (neurons) arranged in a sequence, starting form the first and ending at N( )
l , associated with the l -th layer,( )
[ ]
TF0 =1,1,...,1 , F( )l =
[
f1( )l( )
⋅,f2( )l()
⋅,...,fN( )l()
⋅]
T (5.72) is a vector of the activation functions of the neurons associated with the l-th layer( )l W
denotes the aggregated initial weights connecting the neurons of the l-th layer with the
( )
l−1 -st layer neurons.In place m the 2 β token obtains now the characteristic
( )
=NN1,0,Emax.yβ (5.73)
In the transition Z the tokens 3 γp, p=1 ,2 ,P, p being the number of the training pattern, enters the place n with the initial characteristic 1
( )
=X D pyγp p(0), p, (5.74)
where
( ) [
p p pN]
Tp x x x
X 0 = 1, 2,, (0)
is the input vector of the neural network,
[
p p pN]
Tp d d d
D = 1, 2,, (0)
is the vector of desired network outputs,
and for the input Xp(0) the outputs of all layers are calculated.
The transition Z describes the process of signal propagation in the neural 3 network (5.76). The tokens α( )l , l=0,1,2,...,L, in the place X5, obtain the new characteristics
( )
( )l =NN l[
N( )
l]
F( ) ( ) ( )l Wl Xl yα 1, , 1, , , , (5.75)
where X( )l , W , ( )l l=1,2,...,L.
}, , , , { }, , , , , ,
{ 4 5 9 2 7 1 5 6 3 2
3 X X X m m n X X m n
Z = X5
X6 m3 n2
X4 V4,5 V4,6 false false X5
V5,5 V5,6 false false X9
V9,5 V9,6 false false , m2 false false true false m7 false false true false n1 false false false true
∨
∧( (X4, X5, X9),(m2,m7),n1) (5.76) where
=
=
= 5,5 9,5 5
,
4 V V
V “previous layer does not have defined outputs”
5 , 4 6 , 9 6 , 5 6 ,
4 V V V
V = = =¬
2=
,
V1 “all layers’ outputs have assigned values for the current pattern”.
In the place X6 the tokens α get the following characteristics
( )
( )l =NN l[
N( )
l]
F( ) ( )l Wl Xp( )l yα 1, , 1, , , , (5.77)
while the token β has the characteristic y
( )
β =NN1,0,Emax (in the place m ), 3 and the token γ has the characteristic y( )
γp =Xp(0), Dp,p in the place n . 2The transition Z describes the estimation and weight adjustment, and has the 4 form via the characteristic
( )
=NN1,E′,Emax In distinction from the generalized net representation for the classic backpro-pagation algorithm, described in Chap. 4, here the transition Z is split in two 5 transitions Z5,1 and Z5,2.The transition Z5,1 models the evaluation of the partial derivatives described in details in Chap. 4
and partial derivatives of the return functions with respect to the aggregated
For the given nominal values of the neuron states X
( )
l and weight connections[
l,L−1]
W , where l=L−1,...,0, the equations (5.82) and (5.83) in a shorter form are as follows
( )
l =V(
X( )
l ,W[
l,L−1] )
which include the results of Equ. 5.83( )
( ) =NN1,l,[
1,N( )
l]
,F( ) ( ),W ,X ( ),V( )
l .yαl l l pl X (5.86)
The token of β -type does not change its characteristic in the place m5,1.
The transition Z5,2 is devoted to the partial derivatives of the return functions
( ) [ ]
(
X l ,Wl,L−1)
V with respect to W
[
l,L−1]
, which are described by (5.82), and have the following form2 changed its characteristic.
The transition Z describes the process of weights adjustment during the 6 learning process and has the following form
}, , , , { }, ,
{ 8,2 5,2 9 10 6 7
6 X m X X m m
Z =
X9 X10 m 6 m 7
2 ,
X8 V8,9 V8,10 false false ,
2 ,
m5 false false V5,6 V5,7
∧(X8,2,m5,2) (5.89) where
9 =
,
V8 ”there are still unused patterns”,
9 , 8 10 ,
8 V
V =¬ ,
6=
,
V5 “if the performance index is below the given threshold Emax”,
6 , 5 7 ,
5 V
V =¬ .
The α -type tokens obtain the new characteristics in the place X9
( )
( )l =NN l[
N( )
l]
F( ) ( )l W′l yα 1, , 1, , , (5.90)
with updated weight connections
( )l
[
w( )l w ( )l wNl]
TW′ = 1′ , 2′ ,..., ′() (5.91)
where
( )l
[
i( ) ( )l l i( ) ( )l l i( )l N l]
Ti w w w
w′ = ′ −11 , ′ −12 ,..., ′ −1 ()
for i
(
L−1)
=1,2,...,N(L−1), j( )
L =1,2,...,N(L). The new values of the weights are calculated in the following way (for the learning parameter η>0)( ) ( ) ( ( ) [ ] )
( )
T
l W
L l W l X l V
W l
W
∂
−
− ∂
′ = η , , 1 (5.92)
for 0l=L,L−1,..., , and replace W
[
0,L−1]
=W′[
0,L−1]
, X[ ]
1,L =X′[ ]
1,L .In the place m the 7 β token obtains the characteristic
( )
=NN1,E,Emaxyβ which is not final.
The optimal values of the weights satisfying the stop condition are denoted by ( )l =pr NN l
[
N( )
l]
F( ) ( )l W′l W* 5 1, , 1, , , , where the characteristics of the α-type tokens, in the place X10, is described by
( )
( )l =NN l[
N( )
l]
F( ) ( )l W′l yα 1, , 1, , , (5.93)
while the final value of the performance index is equal E*=NN1,E′,Emax, and the β token characteristic in the place m is described by 6
( )
=NN1,E′,Emax.yβ (5.94)
M. Krawczak: Multilayer Neural Networks, SCI 478, pp. 123–144.
DOI: 10.1007/978-3-319-00248-4_6 © Springer International Publishing Switzerland 2013
Parameterisation of Learning
6.1 Introduction
Learning of a neural network is meant to adjust connections between layers (con-nections between neurons) in order to minimize the performance index of learning. For this, the backpropagation algorithm with various modifications is commonly used. At the same time, the learning process of multilayer neural net-works can be considered as a particular multistage optimal control problem, de-scribed in Chap. 4. This kind of problem can be naturally treated by the dynamic programming approach (Chernousko and Lyubushin 1982, Larson and Korsak 1970) as well as (Krawczak 1994, 1995a, 1999b, 2000a, 2001b, 2002b, 2002c).
In this chapter we introduce a gain parameter into models of neurons. The value of this parameter is tacitly assumed to be 1.0 in almost all learning algo-rithms used. Note that setting the parameter to a small value makes the neuron model “almost linear”. Thus, the learning process problem can be solved using computational tools specified for linear-quadratic systems optimisation, like the first order differential dynamic programming methodology described in Chap. 5.
Using the continuation methodology (Krawczak 1999b, 2000b, 2000c, 2000e) the value of the gain parameter may be changed in order to reach 01. . In fact, we can do much more, namely by considering the gain parameter as an additional control variable, the optimal value of the parameter can be found (Krawczak 2001a, 2002b). The presented methodology is based on the first order differential dy-namic programming, and due to the global properties of the methodology we propose to call it the heuristic dynamic programming. In some sense the idea is borrowed from the simulated annealing approach, known in stochastic mechanics (Aarts and Laarhoven 1987, Kirkpatrick et al. 1983, Brooks and Morgan 1995).
In this chapter we will also describe the method of conversion of the multilayer neural networks learning problem into the iterative minimax problem. The method uses the methodology of game theory as well as the tools of multiobjective opti-misation introduced by Krawczak (1997b, 1998) and allows to consider the neural network learning process as a multiobjective optimisation problem, where each pair of the training examples is associated with a partial performance index
(as a separate objective function). The methodology can be used for establishing a new paradigm of learning that we will call the updating learning.
6.2 Neuron Models with Parameters
Let us consider an artificial neuron model which is a nonlinear processing unit performing the operation f
( )
net by means of the activation function, where net is the input to the neuron.There are two typical activation functions, commonly used, described by the sigmoidal functions:
the unipolar
(
net) (
net)
f λ λ
−
= + exp 1
, 1 (6.1)
the bipolar
(
net,)
=1+exp(
2− net)
−1f λ λ (6.2)
where 0λ> is the gain parameter, 1λ being sometimes called the temperature of the system. This parameter describes the slope of the activation functions. These functions are shown in the following Fig. 6.1 and 6.2.
We can observe how a curve changes with respect to the gain parameter λ. Assuming reasonable values of the input, for large value of the gain parameter, the sigmoidal function turns into the Heaviside’s function (the step function) while, for small values of the parameter, the sigmoidal function becomes an “almost linear” function.
Fig. 6.1 A unipolar sigmoidal function with different gain values
Fig. 6.2 A bipolar sigmoidal function with different gain values
6.3 Continuation Method
In the literature on nonlinear optimisation one can find a little known methodology called the continuation method (Avila 1974, Richter and de Carlo 1984), which allows finding near optimal solutions. The method is usually presented in terms of finding zeroes of a mapping G:Rn →R, and it can be generalized for findings fixed points (Ortega and Rheinboldt 1970).
Following the papers of the present author (Krawczak 1999b, 2000b, 2000c, 2000e) we will consider the learning of the neural networks problem as the opti-misation problem
[ ]
( )
=
−− =
P
p
p L p
W E D X L
1
2 1
,
0 2
min 1 (6.3)
subject to
( )
l F(
W( ) ( )
l X l) ( )
F lX +1 = , = for l=0,1,...,L−1 (6.4) where F
( )
l is an aggregated function of transition from one layer to another. Let[
0, 1]
~ L−
W be a solution of the problem (6.3)-(6.4). Now, let us consider the fol-lowing homotopy
[ ]
( )
=
−− =
P
p
p L p
W E D X L
1
2 1
,
0 2
min 1 (6.5)
subject to
( )
l 1 F(
W( ) ( )
l ,X l ,λ)
X + = , for l=0,1,...,L−1 (6.6)
such that for λ=1 the problem (6.5)-(6.6) is equivalent to (6.3)-(6.4). It is as-sumed that the problem (6.5)-(6.6) has some trivial, or easy to compute, solution for λ =λ0. The value of the parameter λ0 is treated as the starting point. Addi-tionally, due to the “almost” linear-quadratic form of the considered problem, it is assumed that the solution obtained for λ =λ0 is the global one. This assumption is based on the properties of the first order differential dynamic methodology de-scribed in Chap. 5.
Next, the problem can be imbedded into a family of problems with the parame-ter λ. Usually we do not consider the gain parameter, which means that we as-sume that the parameter is equal 1 . The basic idea is as follows, if for all
[ ]
λ0,1λ ∈ there exist weights W
(
λ,[
0,L−1] )
such that there are the solutions of (6.5)-(6.6), then the curve W(
λ,[
0,L−1] )
can be found numerically starting at the point W(
λ0,[
0,L−1] )
and ending at W(
1,[
0,L−1] )
=W~(
1,[
0,L−1] )
. In the case of the linear-quadratic optimisation problem, like the neural networks learning, it is obvious that there does exist a solution of (6.5)-(6.6) for any λ ∈[ ]
λ0,1, where1
0<λ0≤ . It is assumed that the curve W
(
λ,[
0,L−1] )
is continuous and has first derivatives.This kind of approach is known in the literature as the continuation method (Avila 1974) or the homotopy method (Richter and de Carlo 1984). Due to the continuous differentiability of the considered sigmoidal activation functions, and the methodology described in the previous chapter, it is obvious that the function
[ ]
(
, 0,L−1)
W λ is continuous and differentiable with respect to the gain parameter +∞
<
<λ
0 .
There are two main methods of finding the solution curve W
(
λ,[
0,L−1] )
: thefirst is the discrete method (Avila 1974) and the second – is the Davidenko’s method (Davidenko 1953).
Discrete Method
The interval
[ ]
λ0,1 should be divided into several segments2 1
1
0 <λ <λ < λN =
λ
and the corresponding partial problems
[ ]
( )
=
−− =
P
p
p L p
W
L X D E
1
2 1
,
0 2
min 1 (6.7)
subject to
( )
l F(
W( ) ( )
l X l k)
X +1 = , ,λ , l=0,1,...,L−1 (6.8) for k=0,1,...,N, should be solved.
Starting with the known solution W
(
λ0,[
0,L−1] )
, obtained by applying the first order differential dynamic programming method, described in Sect. 5.6.1, the new weights W(
λ1,[
0,L−1] )
are computed. The procedure is performed sequen-tially for all λ0<λ1<λ2<λN =1 until the point λN =1 is reached. The main problem is to determine the conditions of existence of the partition2 1
1
0<λ <λ < λN =
λ and the iterative process
[ ]
( )
(
, , 0, 1)
1= −
+ I k W k L
k λ
λ , k=0,1,...,N. (6.9)
According to Ortega and Rheinboldt (1970), for linear-quadratic optimisation problems the relationship (6.9) exists, and the problem of the partition of the range
[ ]
λ0,1 must ensure that the reachable sets of the obtained solution of (6.7)-(6.8)[ ]
( ) ( [ ] )
{
W λk, 0,L−1 ,W λk+1, 0,L−1}
and
[ ]
( ) ( [ ] )
{
W λk+1, 0,L−1 ,W λk+2, 0,L−1}
for 2k=0,1,2,...,N− overlap for each pair
{
λk,λk+1}
, 1k=0,1,...,N− . Davidenko’s MethodThe method requires writing the optimisation problem in the following way
[ ]
( )
(
W λ, 0,L−1 ,λ)
=0.H (6.10)
By differentiating (6.10) with respect to λ one can get the following Davidenko’s differential equation
[ ]
( )
(
, 0, −1,) (
,[
0, −1]
,)
+( (
λ,[
0, −1]
,λ) )
=0λ λ
λ λ
λ H W L
d L L dW
W
HW W (6.11)
with W
(
λ0,[
0,L−1]
,λ0)
as the initial conditions, and by numerical integration from λ0 to 01. - the solution W~[
0,L−1]
=W(
1,[
0,L−1] )
. The main difficulty is the implicit nature of this equation for such complex problems as the neural net-work learning.The proposed application of the continuation method, combined with the first order differential dynamic optimisation approach, gives the new possibility of finding the global optimal value of the performance index of the learning process for the multilayer feedforward neural networks. In some sense the approach is similar to the idea of simulated annealing methodology (Aarts and Laarhoven 1987).
6.4 Heuristic Dynamic Programming
Let us rewrite the here considered first order differential dynamic programming method, described in Chap. 5, and introduce the gain parameter into the neuron models, as follows.
Now, the return function has the form
( ) [ ]
The first order expansion of the return function for the whole network
( ) [ ]
δλ= ′− must be small enough to ensure the validity of the expansion. Choos-ing Wδ and δλ as
where η is the learning parameter, we obtain that the return function
( ) [ ]
Instead of consideration of the return function for the whole network
[ ]
( )
(
X 0,W0,L−1,λ)
V , the backward form can be calculated
( ) [ ]
(
X l ,Wl,L−1,λ)
=V(
X( )
l+1,W[
l+1,L−1]
,λ)
V (6.16)
for 1l,l=1,2,...,L− , with the condition for the last layer
( ) [ ]
( ) ( )
.2 , 1 1 , 1 ,
1
2=
−
=
−
−
P
p
p
p X L
D L
L W L X
V λ (6.17)
The partial derivative of the return function (6.16) takes the form
( )
l F( ) (
l V l 1,) ( )
F l .VW = WT X + λ W (6.18)
In order to obtain (6.18) the following expression must be calculated VX
( )
l,λ , which can be calculated using the following sequential relations( ) ( )
=
−
−
= P
p
p p
X L D X L
V
1
,λ (6.19)
( )
l,λ =F( ) (
l V l+1,λ)
.VX X X (6.20)
The derivative of the return function V
(
X(
0,W[
0,L−1]
,λ) )
with respect to λ is obtained in a similar manner, as( ) ( )
=
−
−
= P
p
p
p X L
D L
V
1
,λ
λ (6.21)
( )
λ λ(
λ)
λ( ) (
λ)
λ l, =V l+1, +F l V l+1,
V X (6.22)
where VX
(
l+1,λ)
is given by (6.19) and (6.20).Using Equ. 6.19 - 6.21, the gradients of the return functions, which are re-quired for Equ. 6.18 and 6.22, can be computed from the sequence of the equa-tions solved from the last layer back to the inputs.
The parameter λ can be found at the input layer by minimizing the optimal re-turn function V
(
X′( )
0,W′[
0,L−1]
,λ)
with respect to λ.The heuristic differential dynamic programming algorithm for the learning of multilayer neural networks can be formulated in the following steps:
1. Initialise weights: W
( )
l , l=0,1,2,...,L−1λ as a small value, e.g. 0 .1
2. Set: E=0
3. Set: p=0 ( p denotes the pattern’s index) 4. Submit a pattern:
(
Xp( )
0,Dp)
, p= p+15. Compute the layers’ outputs: X
( )
l = X( )
l , l=1,2,...,L from the system Equ. 6.86. Compute the performance index of learning:
( )
22
1 D X L
E
E= + p− p (6.23)
7. Compute the partial derivatives:
( )
l =V(
X( )
l ,W[
l,L−1]
,λ)
8. Compute the gradient of the return function with respect to the weights for each level:
( ) [ ]
9. Choose the learning parameter η>0 and compute the new weight values
( ) ( ) ( ( ) [ ] )
11. If l=0 then compute the new value of the gain parameter, for η1>0
( ) [ ]
(
X W L)
TV
∂
−
− ∂
′=
λ λ
η λ
λ 0, 0, 1,
1 (6.29)
12. Set:
[
0,L−1]
=W′[
0,L−1]
W
[ ]
L X[ ]
LX1, = ′1, λ
λ = ′
13. If E<EMAX then go to Step 14 else go to Step 3 14. STOP.
The first order differential dynamic programming algorithm can ensure "almost global" optimisation of the learning process for the neural networks with the lin-earized neurons. By optimisation of the gain parameter the methodology allows for finding of the "almost global" optimum of the learning index performance. We will call this method the heuristic dynamic programming for the multilayer neural networks learning.
6.5 Learning as a Multiobjective Problem
The idea of using multiobjective optimisation to solve some classes of nonlinear programming problems was first proposed by Geoffrion (1967). This idea was extended by Li and Haimes (1990, 1991) for the dynamic systems. In the case of the neural networks learning, the idea was further elaborated by Krawczak (1995b, 1997, 1998).
6.5.1 Embedding into Multiobjective Optimisation Problems
It is easy to notice that the performance index of learning, E , can be treated as a simple composite function of multiple performance indices Ep, corresponding to the training pairs
{
Dp,Xp( )
0}
, p=1,2,...,P, as follows( ) ( )
=Φ = −
= P p
p p
P D X L
E E E E
1
2 2
1, ,...,
min (6.30)
subject to
( )
l+1 =F(
X( ) ( )
l ,W l)
, l=0,1,...,L−1.Xp p (6.31)
The minimization is done with respect to weights W
( )
l, l=0,1,...,L−1. In thediscussed case, the overall performance index E is a strictly increasing function of E , for p p=1,...,P, i.e.
=1 Ep
E
∂
∂ .
Conforming to the notation used, the multiobjective multistage optimisation prob-lem can be formulated in the following way
( ) ( )
−
−
=
2 2 1 1 1
min min
L X D
L X D
E E
p P p
(6.32)
subject to
( )
l+1 =F(
X( ) ( )
l ,W l)
, l=0,1,...,L−1.Xp p (6.33)
In general, solution to the multiobjective problem (6.32)-(6.33) is not unique. A solution Wˆ of this problem is said to be noninferior if there does not exist another feasible W such that
( )
W E( )
WEp ≤ p ˆ (6.34)
for all p=1,2,...,P, with strict inequality for at least one p .
For the optimisation problem (6.32)-(6.33) the following theorems can be proved (Li and Haimes 1990). Here, the respective analysis is developed for the neural networks learning problem.
Theorem 6.1. The optimal solution of problem (6.30)-(6.31) is attained by a non-inferior solution of the multiobjective optimisation problem given through (6.32)-(6.33).
The most common approaches to generation of the set of noninferior solutions are the ε-constraint method and the weighting method (Li and Haimes 1990).
The noninferior solutions to the problem (6.32)-(6.33) can be generated by solving the following ε-constraint method form
( )
WE1
min (6.35)
subject to
( )
W p PEp ≤εp, =2,3,..., (6.36)
and
( )
l 1 F(
X( ) ( )
l ,W l)
,Xp + = p l=0,1,...,L−1 (6.37)
where )Xp(0 is given.
Theorem 6.2. Assume that the set of noninferior solutions of problem (6.32-(6.33) can be parameterised by μ1p,p=2,3,...,P, which are the optimal Kuhn-Tucker multipliers associated with the p-th constraint in Equ. (6.36. Thus, the optimal solution of the dynamic optimisation problem (6.30)-(6.31) is then reached by the noninferior solution that satisfies the following equalities
. ,..., 3 , 2 , 0
1
1 p P
E E E
E
p p
=
=
− ∂
μ ∂
∂
∂ (6.38)
For the case considered in this section, the objective function E is of an additive form, and so the solution to Equ. (6.30)-(6.31) is attained by the noninferior solu-tion with all μ1p equal to one.
If problem (6.32)-(6.33) is convex, then the noninferior solutions of problem (6.30)-(6.31) can be obtained by solving the following weighting form
( )
WE v p
P
p
p=1
min (6.39)
subject to
( )
l 1 F(
X( ) ( )
l ,W l)
,Xp + = p l=0,1,...,L−1 (6.40)
where )Xp(0 is given, and
==
=
≥
P
p p
p p P v
v
1
1
; ,..., 2 , 1 ,
0 .
Theorem 6.3. If the set of noninferior solutions of the problem (6.32)-(6.33) can be parameterised by the overall weighting vector v , the optimal solution of the nonseparable dynamic optimisation problem given in (6.30)-(6.31) is reached under certain conditions by the noninferior solution that satisfies the following equations
/ . /
/
2 2 1
1
P P
v E E v
E E v
E
E ∂ ∂ ∂ ∂ ∂
∂ = == (6.41)
For the discussed case, due to the additive form of the performance index, the optimal solution of problem (6.30)-(6.31) is attained by the noninferior solution with all v equal to Pp 1/ .
The aim of consideration of the optimisation problem (6.30)-(6.31) as an em-bedded problem, meaning the embedding of this optimisation problem in a family of parameterised optimisation problems, is to obtain a new algorithm for the mul-tilayer neural networks learning.
Let us consider the following weighted minimax formulation for problem (6.30)-(6.31):
{
vE,v E ,...,vPEP}
max
min 1 1 2 2 (6.42)
subject to
( )
l 1 F(
X( ) ( )
l ,W l)
,Xp + = p (6.43)
where )Xp(0 is given, the weighting coefficient v is always set to one and the 1 weighting coefficients v , p p=2,...,P are nonnegative. In (6.42) the maximiza-tion is performed among P weighted systems indicated by the performance indices (corresponding to any training pair) while minimization is carried out over the weights searching W .
6.5.2 Iterative Minimax Solution
Several theorems strictly related to the conversion of the optimisation problem (6.30)-(6.31) into another optimisation problem, (6.42)-(6.43), can be proven. If we denote by W the set of solutions of the problem (6.30)-(6.31) and by ∗ W the v∗ union of sets of solutions of the weighted minimax problem (6.42)-(6.43), then it is possible to prove the following
Theorem 6.4. The intersection of W and ∗ W is nonempty, i.e. v∗
Theorem 6.4. The intersection of W and ∗ W is nonempty, i.e. v∗