The Back-Propagation Algorithm (BPA) - The Perceptron as A Neural Network

CHAPTER 2 The Perceptron as A Neural Network

2.4. The Back-Propagation Algorithm (BPA)

The most serious problem of the single layer perceptron is that we don’t have any proper learning algorithm to adjust the synaptic weights of the perceptron. But the multi-layer feed-forward perceptrons don’t have this problem. The synaptic weights of the multi-layer perceptrons can be trained by using the highly popular algorithm known as the error back-propagation algorithm (EBP algorithm). We can view the error back-propagation algorithm as the general form of the least-mean square

algorithm (LMS). The error back-propagation algorithm (or the back-propagation

algorithm) consists of two parts, one is the forward pass and the other is the backward pass. In the forward pass, the effect of the input data propagates through the network layer by layer. During this process, the synaptic weights of the networks are all fixed.

On the other hand, the synaptic weights of the multi-layer perceptrons are all adjusted in accordance with the error-correction rule during the backward pass. Before using error-correction rule, we have to define the error signal of the learning process.

Because the goal of the learning algorithm is to find one set of weights which makes the actual outputs of the network equal to the desired outputs, so we can define the error signal as the difference between the actual output and the desired output.

Specifically, the desired response is subtracted from the actual response of the network to produce the error signal. In the backward pass, the error signal propagates backward through the network from output layer. During the backward pass, the synaptic weights of the multi-layer perceptrons are adjusted to make the actual outputs of the network move closer to the desired outputs. This is why we call this algorithm as “error back-propagation algorithm”. Now, we will consider the learning process of the back-propagation algorithm. First, the error signal at the output node j at the t^th iteration is defined by Where the term ψ(．) is the activation function, which is the sigmoid function, and

net

^mj is the output of node j in output layer. We can rewrite (2.8) as a more general

By the same way, we can define the total square error J of the network as:

1

2 j

2

j j

= ∑

ej (2.12) The goal of the back-propagation algorithm is to find one set of weights so that the actual outputs can be as close as possible to the desired outputs. In other words, the purpose of the back-propagation algorithm is to reduce the total square error J, as described in (2.12). In the method of steepest descent, the successive adjustments applied to the weight matrix W are in the direction opposite to the gradient matrix∂ ∂

J

W

. The adjustment can be expressed as: where η is the learning rate parameter of the back-propagation algorithm. It decides the step-size of correction of the method of steepest descent [18]. Note that the learning rate is a positive constant. By using the chain rule, the element of matrix J W∂ ∂ , i.e.∂ ∂

J w

^k_ji, can be represented as

layers. So, by using the chain rule we can rewrite ∂J /∂y^k_j as

Substitute (2.19) and (2.20) into (2.18), we have k

(

l^k ¹ lj^k ¹ From (2.14) to (2.21), we have the following observations:

1. If the synaptic weights are between the hidden layer and output layer, then

(

^mj j

) (

^kj ¹ ^kj

)

i ¹ 2. If the synaptic weights are between the hidden layers or between the input layer

and the hidden layer, then

Equation (2.23) and (2.25) are the most important formulas of the back-propagation algorithm. The synaptic weights can be adjusted by substituting (2.23) and (2.25) into the following (2.26):

(

) ( ) ( )

k k

ji ji ji

w t

+ =

w t

+ ∆

w t

^k (2.26) The learning process of Back-Propagation learning algorithm can be expressed by the following steps:

Step1: Decide the structure of the network for the problem.

Step2: Choosing a suitable value between 0 and 1 for the learning rate η.

Step3: Picking the initial synaptic weights from a uniform distribution whose value is all, like between -1 and 1.

usually sm

Step4: Calculate the output signal of the network by using (2.9).

Step5: Calculate the error energy function J by using (2.12).

Step6: Using (2.26) to update the synaptic weights.

Step7: Back to step4 and repeat step4 to step6 until the error energy function J is small enough.

Although the network with BPA can deal with more complex problems which can not be solved by single layer perceptron, the BPA still has the following problems:

1. Number of hidden layers

According to the result of theoretical researches, the number of hidden layer does not need over two layers. Although nearly all problems can be solved by two hidden layers or even one hidden layer, we really have no idea to choose the number of hidden layers. Even the number of neurons in the hidden layer is an open question.

2. Initial synaptic weights

One defect of the method of steepest descent is the “local minimum” problem. This problem relates to the initial synaptic weights. How to choose the best initial weights is still a topic of neural network.

3. The most suitable learning rate

The learning rate decides the step-size of learning process. Although smaller learning rate can have a better chance to have convergent results, the speed of convergence is very slow and need more number of epochs. For larger learning rate, it can speed up the learning process, but it will create more unpredictable results. So how to find a suitable learning rate is also an important problem of neural network.

CHAPTER 3 Dynamic Optimal Training of A Three-Layer Neural Network with Sigmoid Function

In this chapter, we will try to solve the problem of finding the suitable learning rate for the training of a three layer neural network. For solving this problem, we will find the dynamic optimal learning rate of every iteration. The neural network with one hidden layer is enough to solve the most problems in classification. So the dynamical optimal training algorithm will be proposed in this Chapter for a three layer neural network with sigmoid activation functions in the hidden and output layers in this chapter.

3.1. The Architecture of A Three-Layer Network

Consider the following three layer neural network, as shown in the Figure 3-1, which has only one hidden layer. The architecture of the neural network is M-H-N network.

There are M neurons in the input layer and H neurons in the hidden layer and N neurons in the output layer.

Figure 3-1. The three layer neural network

There is no activation function in the neurons of the input layer. And the activation function for the neurons in the hidden and output layers is defined by

( ) 1

1 exp( )

f x

x

+ − (3.1) Note that we don’t consider the bias in this chapter.

3.2. The Dynamic Optimal Learning Rate

Suppose we are given the following training matrix and the desired output matrix:

11 12 1

There are P columns in (3.2) and (3.3) which implies that there are P sets of training examples. And the weighting matrix between the input layer and the hidden layer is expressed as:

The weighting matrix between the hidden layer and the output layer is expressed as:

11 12 1

Then, we can get the linear combiner output of the hidden layer due to (3.2) and (3.4) which is expressed as:

And by using (3.1) and (3.6), the output matrix of the hidden layer is expressed as

The linear combiner output of the output layer due to (3.5) and (3.7) is expressed as:

11 12 1

And the output of the output layer, which is also the actual output of the network:

( ) ( ) ( )

Substituting (3.8) into (3.1) then we can get (3.9).

The most important parameter of the BP algorithm is the error signal. And the error signal is the difference between the actual output of the network and the desire output.

So, we define the error function E as:

11 12 1

Finial, we define the total squared error, the energy function of the network, J as follows: The topic of the learning process is to minimize the total square error J. According to Back-propagation algorithm, we use the method of steepest descent to adjust the synaptic weights. And we apply formula (3.12) and (3.14) to update the weighting matrixes WH and WY. Equation (3.12) and (3.14) are expressed as follow.

(

) ( )

Now, let us consider the derivative of (3.15) only,

1 1 2 2

Accordingly, the use of (3.17)、(3.18) and (3.19) in (3.16) yields

( ) ( )

Equation (3.21) is the formula that we use to adjust the synaptic weights WY.

Furthermore, let us consider the weighting matrix WH between the input layer and the hidden layer, as expressed in (3.13):

(

) ( )

For the convenience, we just discuss the derivative of (3.13) in the beginning.

1 2

……… +

Accordingly, the use of (3.17)、(3.23)、(3.24) and (3.25) in (3.22) yields

( ) ( ( ) ( ) )

Equation (3.27) is the formula that we use to adjust the synaptic weights WH. By using (3.21) and (3.27), we can adjust the synaptic weights of the network.

3.3. Dynamical Optimal Training via Lyapunov’s Method

In control system, we know that we can use the Lyapunov function to consider the stability of the system. The basic philosophy of Lyapunov’s direct method is the mathematical extension of a fundamental physical observation: if the total energy of a

mechanical (or electrical) system is continuously dissipated, then the system, whether linear or nonlinear must eventually settle down to an equilibrium point [19]. So by the same meaning, we define the Lyapunov function as

V

= (3.28)

J

Where the item J is total square error, defined in (3.11). And Equation (3.28) is positive definite, which means that V = J > 0. The difference of the Lyapunov function is

V J

t₊

J

∆ = − (3.29) Where Jt+1 expresses the total square error of the (t+1)^th iteration. If Equation (3.29) is negative, the system is guaranteed to be stable. Then, for △

V < 0 we have

1 0 one parameter G(β(t)). The Equation (3.31) can be rewritten as

( ( ) )

t t

J

₊ − =

J G β t

(3.32) If the parameter β(t) satisfies Jt+1 -

J

t =

G(β(t)) <0, then the set of β(t) is the stable

range of the learning rate of the system at the t^th iteration. In this stable range, if the

β

opt(t) satisfies that Jt+1 - Jt is at its minimum, we call βopt(t) the optimal learning rate at the t^th iteration. The optimal learning rate βopt(t) will not only guarantee the stability of the training process, but also has the fastest speed of convergence.

In order to find the optimal learning rate βopt(t) from the function Jt+1 - Jt analytically, we need to have an explicit form of Jt+1 - Jt, like the simplest form of Jt+1 - Jt (for a simple two layer neural network) in [13]. But the function Jt+1 -

J

t here is a very complicated nonlinear algebraic equation, it is nearly impossible to have a simple explicit form. However we have also defined the function Jt+1 -

J

t in (3.31) by progressively evolving the equations from the beginning of this Chapter. Therefore we can indeed defined (3.31) implicitly in Matlab coding. In this case, we can apply the Matlab routine fminbnd to find the optimal learning rate βopt(t) from Jt+1 -

J

t. The

calling sequence of fminbnd is:

FMINBND Scalar bounded nonlinear function minimization.

X = FMINBND(FUN,x1,x2) starts at X0 and finds a local minimizer X of the function FUN in the interval x1 <= X <= x2. FUN accepts scalar input X and returns a scalar function value F evaluated at X.

The capability of the Matlab routine “fminbnd” is to find a local minimizer βopt of the function G(β), which has only one independent variable β, in a given interval. So we have to define an interval when we use this routine. For the following two examples in Chapter 4, we set the interval as [0.01, 100] to set the allowable learning rates between 0.01 and 100. Note that for simplicity, we assume that βH(t) = βY(t) = β(t) in (3.31), therefore there is only one variable in (3.31). So we use the routine “fminbnd”

to find the optimal learning rate. However, we can also find the learning rate βH(t) and

β

Y(t) respectively by using another Matlab routine “fminunc”. But this routine

“fminunc” can only find the minimizer βH(t) and βY(t) around one specific point, not around an interval. Therefore it is very limited in application and is not appropriate for this case.

Algorithm 1: Dynamic optimal training algorithm for a three-layer neural network

Step 0: The following WH(t), WY(t), G(β(t)), βopt(t) and Y(t) denote their respective

values at iteration t.

Step 1: Given the initial weighting matrix WH(1),

W

Y(1), the training input matrix X and the desired output matrix D then we can find the actual output matrix of the network Y(1) and the nonlinear function G(β(1)).

Step 2: Using Matlab routine “fminbnd” with the interval [0.01, 100] to solve the nonlinear function G(β(1)) and find the optimal learning rate βopt(1).

Step 3: Iteration count t=1. Start the back propagation training process.

Step 4: Find if the desired output matrix D and the actual output matrix of the network Y(1) are close enough or not? If Yes, GOTO Step 9.

Step 5: Update the synaptic weights matrix to yield WH(t+1) and WY(t+1) by using (3.27) and (3.21) respectively.

Step 6: Find the actual output matrix of the network Y(t+1) and the nonlinear function

G(β(t+1)).

Step 7: Use Matlab routine “fminbnd” to find the optimal learning rate βopt(t+1)for the next iteration.

Step 8: t=t+1 and GOTO Step 4.

Step 9: End.

CHAPTER 4 Experimental Results

In this chapter, the classification problems of XOR and Iris data will be solved via our new dynamical optimal training algorithm in Chapter 3. The training results will be compared with the conventional BP training using fixed learning rate.

4.1. Example 1: The XOR Problem

The task is to train the network to produce the Boolean “Exclusive OR” (XOR) function of two variables. The XOR operator yields true if exactly one (but not both) of two conditions is true, otherwise the XOR operator yields false. We need only consider four input data (0,0), (0,1), (1,1), and (1,0) in this problem. The first and third input patterns are in class 0, which means the XOR operator yields “False”

when the input data is (0,0) or (1,1). The distribution of the input data is shown in Figure 4-1. Because there are two variables of XOR function, we choose the input layer with two nodes and the output layer with one node. Then we use one hidden layer with two neurons to solve XOR problem [14], as shown in Figure 4-2. The architecture of the neural network is 2-2-1 network.

The input data of XOR

0, 0

1, 1

1, 0 0, 1

X₁

X₂ ^{Class 0}

Class 1

Figure 4-1. The distribution of XOR input data sets

y

Figure 4-2. The neural network for solving XOR

First, we use the standard BP algorithm with fixed learning rates (β = 1.5, 0.9, 0.5 and 0.1) to train the XOR, and the training results are shown in Figure 4-3-1 ~ 4-3-4. The result of using BP algorithm with dynamic optimal learning rates to train the XOR is shown Figure 4-4.

Figure 4-3-1. The square error J of the standard BPA with fixed

β = 1.5

Figure 4-3-2. The square error J of the standard BPA with fixed

β = 0.9

Figure 4-3-3. The square error J of the standard BPA with fixed

β = 0.5

Figure 4-3-4. The square error J of the standard BPA with fixed

β = 0.1

Figure 4-4. The square error J of the BPA with dynamic optimal training

The following Figure 4-5 shows the plot of (3.32) for -1 <

β < 100, which is G( β

) =

∆J(

β

) = Jt+1 - Jt, at iteration count t = 1. The Matlab routine fminbnd will be invoked to find

β

opt with the constraint that G(

β

opt) < 0 with maximum absolute value. This

β

opt

is the learning rate for iteration count t = 2.The

β

opt is found to be 7.2572 for iteration count 2. The dynamic learning rate of every iteration is shown in Figure 4-6.

Figure 4-5. The difference equation G(β(n)) and

β

opt

= 7.2572

Figure 4-6. The dynamic learning rates of every iteration

The comparison of these cases is shown in Figure 4-7. In Figure 4-7, it is obvious that our dynamical optimal training yields the best training results in minimum epochs.

Figure 4-7. Training errors of dynamic optimal learning rates and fixed learning rates

Table 4-1 shows the training result via dynamical optimal training for XOR problem.

Table 4.1. The training result for XOR using dynamical optimal training Iterations

Training Results 1000 5000 10000 15000

W

₁ (after trained) 6.5500 7.7097 8.8191 8.4681

W

2 (after trained) 6.5652 7.7145 8.1921 8.4703

W

₃ (after trained) 0.8591 0.9265 0.9473 0.9573

W

4 (after trained) 0.8592 0.9265 0.9473 0.9573

W

5 (after trained) 14.9536 26.2062 33.0393 37.8155

W

6 (after trained) -19.0670 -32.9550 -41.3692 -47.2513

Actual Output Y for (x1, x2) = (0,0) 0.1134 0.0331 0.0153 0.0089

Actual Output Y for (x1, x2) = (0,1) 0.8232 0.9300 0.9616 0.9750

Actual Output Y for (x1, x2) = (1,0) 0.8232 0.9300 0.9616 0.9750

Actual Output Y for (x1, x2) = (1,1) 0.2291 0.0925 0.0511 0.0334

J

0.0639 0.0097 0.0029 0.0012

Table 4-2 shows the training result via the standard BP with fixed β = 0.9 for XOR problem.

Table 4.2. The training result for XOR using fixed learning rate β = 0.9 Iterations

Training Results 1000 5000 10000 15000

W

₁ (after trained) 4.7659 7.2154 7.6576 7.8631

W

2 (after trained) 4.8474 7.2228 7.6624 7.8670

W

₃ (after trained) 0.7199 0.8996 0.9234 0.9331

W

4 (after trained) 0.7228 0.8996 0.9234 0.9331

W

5 (after trained) 6.2435 20.6467 25.5617 28.2288

W

6 (after trained) -8.1214 -26.1034 -32.1610 -35.4456

Actual Output Y for (x1, x2) = (0,0) 0.2811 0.0613 0.0356 0.0264

Actual Output Y for (x1, x2) = (0,1) 0.6742 0.8885 0.9263 0.9415

Actual Output Y for (x1, x2) = (1,0) 0.6745 0.8885 0.9263 0.9415

Actual Output Y for (x1, x2) = (1,1) 0.4192 0.1479 0.0982 0.0781

J

0.2334 0.0252 0.0109 0.0068

To compare Table 4.1 with Table 4.2, we can see that the training result via dynamical optimal training is faster with better result than other approaches.

Now, we will use another method, the back-propagation algorithm with momentum, to solve XOR problem again. Then we will compare its training errors with that of dynamic optimal training and see if dynamic optimal training is indeed better. The back-propagation algorithm with momentum is to modify the Equation (2.13) by including a momentum term as follows:

( ) (

¹ where α is usually a positive number called the momentum constant and usually in the range [0, 1). The training results of the BPA with momentum for XOR problem are shown in Figure 4-8-1 ~ 4-8-3. The comparison of these cases is shown in Figure 4-9.

In Figure 4-9, we can see that some training results of BPA with momentum are as well as the dynamic training but the most results of BPA with momentum are unpredictable. The training results still depend on the chosen learning rates and momentum.

Figure 4-8-1. The square error J of the BPA with variant momentum(

β = 0.9)

Figure 4-8-2. The square error J of the BPA with variant momentum(

β = 0.5)

Figure 4-8-3. The square error J of the BPA with variant momentum (

β = 0.1)

Figure 4-9. Total square errors of dynamic training and the BPA with different learning rates and momentum

4.2. Example 2: Classification of Iris Data Set

In this example, we will use the same neural network as before to classify Iris data sets [15], [16]. Generally, Iris has three kinds of subspecies, and the classification will depend on the length and width of the petal and the length and width of the sepal. The total Iris data are shown in Figure 4-10-1 and 4-10-2. And the training data sets, the first 75 samples of total data, are shown in Figures 4-11-1 and 4-11-2. The Iris data samples are available in [20]. There are 150 samples of three species of the Iris flowers in this data. We choose 75 samples to train the network and using the other 75 samples to test the network. We will have four kinds of input data, so we adopt the network which has four nodes in the input layer and three nodes in the output layer for this problem. Then the architecture of the neural network is a 4-4-3 network as shown in Figure 4-12. In which, we use the network with four hidden nodes in the hidden layer.

Class 1-setosa Class 2-versicolor Class 3-virginica

Figure 4-10-1. The total Iris data set (Sepal)

IRIS Data-Petal

Class 1-setosa Class 2-versicolor Class 3-virginica

Figure 4-10-2. The total Iris data set (Petal)

IRIS Data-Sepal

Class 1-setosa Class 2-versicolor Class 3-virginica

Figure 4-11-1. The training set of Iris data (Sepal)

IRIS Data-Petal

Class 1-setosa Class 2-versicolor Class 3-virginica

Figure 4-11-2. The training set of Iris data (Petal)

Figure 4-12. The neural network for solving Iris problem

First, we use the standard BPA with fixed learning rates (β = 0.1, 0.01 and 0.001) to solve the classification of Iris data sets, and the training results are shown in Figure 4-13-1 ~ 4-13-3. The result of BPA with dynamic optimal learning rates is shown in Figure 4-14.

Figure 4-13-1. The square error J of the standard BPA with fixed

β = 0.1

Figure 4-13-2. The square error J of the standard BPA with fixed

β = 0.01

Figure 4-13-3. The square error J of the standard BPA with fixed

β = 0.001

Figure 4-14. The square error J of the BPA with dynamic optimal training

Figure 4-15 shows that the convergence speed of the network with dynamic learning rate is absolutely faster than the network with fixed learning rates. Because the optimal learning rate of every iteration is almost in the range [0.01, 0.02], so the convergence speed of the fixed learning rate β = 0.01 is similar to the convergence speed of the dynamic learning rate. But dynamic learning rate approach still performs better than those of using fixed learning rates.

Figure 4-15. Training errors of dynamic optimal learning rates and fixed learning rates After 10000 training iterations, the resulting weights and total square error J are shown below.

1.2337 -0.5033 1.3225 1.3074 -0.3751 3.4714 -2.6777 -1.6052 W = 3.7235 5.1603 -4.0019 -10.4289

1.9876 -2.7186 4.6171 4.3400

Total square error J = 0.1582

The actual output and desired output of 10000 training iteration are shown in Table 4.3 and the testing output and desired output are shown in Table 4.4. After we substitute the above weighting matrices into the network and perform real testing, we find that there is no classification error by using training set (the first 75 data set).

在文檔中三層類神經網路的動態最佳學習 (頁 18-0)