Determine the upper-bound of learning rate in each iteration

In the dynamic optimal training algorithm, we use matlab function “fminbnd” to find optimal learning rate. The calling sequence of “fminbnd” is [27]:

FMINBND Scalar bounded nonlinear function minimization.

X = FMINBND(FUN,x1,x2) starts at X0 and finds a local minimizer X of the function

FUN in the interval x1 <= X <= x2. FUN accepts scalar input X and returns a scalar function value F evaluated at X.

Matlab function “fminbnd” can find a global minima value in a given range. Usually, we give a fixed interval to find the optimal learning rate. But it take too mach time to search because of fixed interval. From the experimental result, we can see that the optimal learning rate may go from 0.001 to 1000. It is a very large interval. Further, it can be discovered that most learning rate in each iteration does not exceed 100. So a fixed range will waste a lot of time to find an optimal learning rate during the learning process. Besides, most of other mathematical methods also need an interval to find a global minima value. Indeed, the upper bound of stable learning rate of the three layer NN shown in Figure 3-3 can be found in the following Theorem 4. This upper-bound will be updated in each iteration and changed according to the maxima value of input matrix and output weighting matrix.

Theorem 4:

The upper-bound of the stable learning rate of the three-layer NN shown in Figure 3-3 can be shown as follows:

( )

acceptable maximum absolute value for the second (output) layer weighting matrix and for input matrix, respectively.

Proof:

Figure 3-3 is a three-layer NN which has a sigmoid activation function in the hidden layer and a linear activation function with slope is a in the output layer. M is the number of input neurons, R is the number of hidden-layer neurons, P is the number of output-layer neurons, and N is the length of input data vector. k is the time instant.

Consider the following cost function to be minimized:

( ) 1 ( ) ( )

J k

e k e k

PN

′

= (3.21) where

e k

( )=

D k

( )−

y k

( ) is the error vector with D∈ ^P×¹ and y∈ ^P×¹ denote the desired output and real output vectors, respectively. W is the weighting matrix. We define the update rule of the weighting matrix W in the form:

( ) ( 1) ( ) _W ( )

W k W k W k η J k

∆ = + − = − ⋅∇ (3.22) where

η

is a positive variable called learning rate of the learning algorithm based on the update rule (3.22), with cost function (3.21). ∇_W

J k

( ) is the cost function gradient related to the vector of adjustable weights, which can be expressed as

1 ( )

tolerance for the process to converge. After convergence, we should have

( ) ( 1) ( ) _W ( ) 0

W k W k W k η J k

∆ = + − = − ⋅∇ ≈

The difference of error vector e(k) can be approximated by the first form of Taylor’s series

( ) ( ) ( )

Consider the Lyapunov function:

( )

k e k e k

( ) ( )

∆ is the difference between two instants of time in Lyapunov function during the training process. Applying (3.22) and (3.23) into (3.24), and we can rewrite ∆e(k) as the following equation.

∆ < then the learning of NN will be stable and the learning algorithm will converge, i.e., the weighting matrix will converge to

W from

W . In (3.28), we have

⁰ already known that

η

>0,

P

>0,

Q

>0 ,

and N

> . So the term:0 B

= 2

− ( η

PN Q

)

must be a positive definite matrix. In other word, all eigenvalues of the matrix B must be positive. We let

λ

max

(

Q k

( ) )

be the maximum eigenvalue of Q at k^th iteration with find its maximum eigenvalue for the whole training process. In order to simplify the calculation, we use the trace of Q at k^th iteration. Trace of Q is the summation of all

eigenvalues of Q, so its value is larger than

λ

^*_{m ax} . Define ² for 2-norm,

x

² =

x x

′

The advantage of (3.30) is that we can find a clear and definite presentation for the term

max{ ( )

}

∂

y k

∂

W . The maximum value of Jacobian of the NN’s output related to the matrix of weighting matrix W can be expressed as

From Figure 3-3, the revised algorithm of the learning process can be expressed as:

( )

where a is the slope of the output layer linear activation function. X is the input matrix

W

H is the weighting matrix of hidden layer with dimension R×M. By chain rule, term the neural network during the learning process can be defined by

0 ( )

0 output layer weighting matrix and for the element of input matrix, respectively. From (3.32) to (3.34), we can find ∂

Y k

( ) ∂

W

_Y as

2 Substituting (3.35) and (3.36) into (3.31), we can obtain

2 2

Substituting (3.37) into (3.30), then we get

( )

Example 3-2: Finding the upper-bound of optimal learning rate

Give the following neural network with three layers, four inputs in the input layer, three output neurons in the output layer, and four neurons in the hidden layer. WH is a 4×4 matrix and W_Y is a 3×4 matrix. The architecture of the neural network is shown in Figure 3-4. The sigmoid function is adopted as hidden layer activation function, and a linear function as output layer activation function. The input signals are bounded at (-2, 2). The slope parameter of linear activation function is given as 1. Try to get an upper

Figure 3-4. Neural network for Example 3-2

Hence, the range of initial weighting factors is [-0.6931, 0.6931]. Therefore

W

Ymax

=0.6931. Using Theorem 4, we can get the upper bound for the learning rate at

first iteration:

From Section 3.1 to Section 3.4, we can summarize the revised dynamic optimal training algorithm for a modified three-layer neural network as shown in Figure 3-3.

Math model in matrix format:

The math model of revised dynamic optimal training algorithm is shown as:

(1) Feed-forward process:

(2) Back-forward process:

(t 1) ( )t ( )t ( )t Update rule of synaptic weight factors:

( , ) ( , ) ( , )

Algorithm 2: Revised dynamic optimal training algorithm

Step 0: Give input matrix X and desired output matrix D. Find the maxima absolute value of input matrix: |Xmax|. Compute the bound of initial weight factors

m in m ax

[

,

]

by using (3.20).

Step 1: Set initial weight factor WH and WY, which are random values in the bound of

Step 2: Iteration count k=1.

Step 3: Start feed-forward part of back-propagation training process.

Step 4: Compute error function E(k)=Y(k)-D(k) and cost function J(k). If J(k) is smaller than acceptable value J₀, the algorithm goes to Step 9. If no, it goes to Step 5.

Step 5: Find the maxima absolute value of WY: |WYmax|. Compute learning rate upper-bound:

η

_{u p}by using Theorem 4.

Step 6: Using Matlab routine “fminbnd” with the search interval [0.01,

η

_up] to solve the

nonlinear function ∆J(k)=J(k+1)-J(k) and find the stable learning rate

η

^{o p t}_. Step 7: Start back-forward part of back-propagation training process. Update the

synaptic weights matrix to yield WH(k+1) and WY(k+1) by using (3.38) and (3.39) respectively.

Step 8: Iteration count k = k+1, and algorithm goes back to Step 3.

Step 9: End the algorithm

CHAPTER 4 Experimental Results

In this chapter, we will solve the classification problems of XOR and Iris Data via the

revised dynamic optimal training algorithm discussed in Chapter 3. The training result

will be compared with dynamic optimal training algorithm and back-propagation

training algorithm using a fixed learning rate.

4.1 Example 1: The XOR problem

The task is to train the network to produce the Boolean “Exclusive OR” (XOR) function of two variables. The XOR operator yields true if exactly one (but not both) of two conditions is true, otherwise the XOR operator yields false. The truth table of XOR function is shown in Figure 4-1.

A B Y T T F T F T F T T F F F

Figure 4-1. The truth table of XOR function: Y = A ♁ B

4.1.1 System modeling and Problem definition

We need only consider four input data (0,0), (0,1), (1,1), and (1,0) in this problem. The first and third input patterns are in class 0, which means the XOR operator yields

“False” when the input data is (0,0) or (1,1). The second and fourth input patterns are in class 1, which means the XOR operator yields “True” when the input data is (0,1) or

variables of XOR function, we choose the input layer with two neurons and the output layer with one neuron. Then we use one hidden layer with two neurons to solve XOR problem [22], as shown in Figure 4-3. The architecture of the neural network is 2-2-1 network.

The input data of XOR

0, 0

1, 1

1, 0 0, 1

X₁

X₂ ^{Class 0}

Class 1

Figure 4-2. The distribution of XOR input data sets

Figure 4-3. The neural network for solving XOR

Hence, we have the weight matrix which is between input layer and hidden layer in the

And the weight matrix between hidden layer and output layer can be expressed as

[ ]

(1,1) (1,2) 5 6

Y Y Y

W

=⎡⎣

W W

⎤⎦=

W W

4.1.2 Experimental result analysis

We first compare the training experimental result by respectively using

Back-propagation Algorithm (BPA), Dynamic Optimal Training Algorithm (DOA), and Revised Dynamic Optimal Training Algorithm (RDOA). Table 4-1 shows experimental

results of three kinds of algorithm. All results are averaged over 20 trials. For BPA, we use a learning rate of 0.9 and the initial weight are drawn from a uniform distribution over interval [-1, 1]. For DOA, we search optimal learning rate by using Matlab routine

“fminbnd” with searching range [0.01, 100] and the initial weight are drawn from a uniform distribution over interval [-1, 1]..

Table 4-1 Training results of XOR problems for a 2-2-1 neural network with three different kinds of algorithm: BPA, DOA, and RDOA

BPA DOA RDOA

(final normalized total square error)

0.063825

(diverged result) 0.000617 0.000994 T

(training time, sec) Diverged 17.617 22.63

(convergence iteration Diverged 2684 775

(convergence time, sec) Diverged 4.72 1.75

T is total training time for 10000 training iteration. SI is first iteration at which the value

of J is smaller than 0.005. We assume that when ek

=0.1, it is enough to say that the

network can classify XOR data set. So it means that when J=0.005, the network can achieve our requirement. ST is actual settling time and it can be calculate in the form of [ST = T × S

I

÷ (total iteration)]. C is times of successful convergence for training neural network in 20 trials.

For DOA and RDOA, we averaged the value of J, T, and S only when we have convergence. But the column of BPA listed in Table 4-1 is somehow different from others. For BPA, we averaged the value of J and T for all 20 trials. Because the output of network trained by BPA can not get a correct result for all 20 trials. From Table 4-1, we can find that BPA is the worst algorithm to train a neural network. The convergence probability C of BPA is 0%, so BPA has no settling iteration. From row J in Table 4-1, DOA has a better normalized total square error (.000617) than RDOA (.000994) if both algorithms have correct results. But the small J in both algorithms are small enough to classify XOR data set. From row C, we see that RDOA has a greater probability (70%) than DOA (20%) to have a correct result. It means that RDOA has a high capability of jumping out local minimum. From row ST, we can know that RDOA only spend 1.75 seconds to get normalized total square error J smaller than 0.005 or get into a steady

state. RDOA is 2.7 times faster than DOA (4.72 sec).

The best result of using BP algorithm with fixed learning rate 0.9 to train the XOR is shown Figure 4-4. The best result of using DO algorithm with initial weight drawn from a uniform distribution over interval [-1, 1] to train the XOR is shown in Figure 4-5. The best result of using RDO algorithm to train XOR is shown in Figure 4-6. The comparison of the best results of three algorithms is shown in Figure 4-7-1 and Figure 4-7-2.

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16

Normalized square error J of the BPA with fixed learning rate 0.9

Iteration

The total square error J

Figure 4-4. Normalized square error J of the standard BPA with fixed

η

= 0.9

Figure 4-5. The best normalized square error J of the DOA in 20 trials

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18

Normalized square error J of the RDOA

Iteration

The total square error J

Figure 4-6. The best normalized square error J of the RDOA in 20 trials

In Figure 4-7-1, RDOA and DOA have almost the same performance at the end of training process. They are different at the beginning of training process. In Figure 4-7-2, it is obvious that RDOA achieves the minimum acceptable square error 0.005 around

the 800^th iteration. At the same iteration, DOA still has a higher square error 0.02. So RDOA can train the XOR data set faster than that of DOA .

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0

Best normalized square error J of three algorithms:RDOA, DOA, and BPA

Iteration

The total square error J

RDOA ROA BPA

Figure 4-7-1. Training error of RDOA, DOA, and BPA

0 200 400 600 800 1000 1200

0.005

Best normalized square error J of three algorithms:RDOA, DOA, and BPA

Iteration

The total square error J

RDOA ROA BPA

Figure 4-7-2. The close look of training error

Tables 4-2 ~ 4-4 show the complete training results of three algorithm. F in the column C denotes failed result of convergence. S in the column C denotes successful result of convergence.

Table 4-2 Detailed Training results of XOR problems for a 2-2-1 neural network with Back-propagation algorithm.

BAP_fixed rate_0.9 (Have tuning initial weight)

J (total square

Table 4-3 Detailed Training results of XOR problems for a 2-2-1 neural network with Dynamic Optimal Training algorithm.

XOR_Dynamic Optimal(Have tuning initial weight) J (total square

Average 0.0006174 17.617 2683.75 20%

Table 4-4 Detailed Training results of XOR problems for a 2-2-1 neural network with Revised Dynamic Optimal Training algorithm.

XOR_Revised Dynamic Optimal

Average 0.0009945 22.636143 774.78571 70%

Table 4-5 shows the training result via revised dynamical optimal training (RDOA) for XOR problem. Table 4-6 shows the training result via dynamical optimal training (DOA)

for XOR problem. To compare Table 4-5 with Table 4-6, we can see that the training result via revised dynamical optimal training is faster with better result than other approaches.

Table 4-5. The training result for XOR using revised dynamical optimal training Iterations

Training Results 1 1000 5000 10000

W

1 (after trained) 1.2480 5.1970 5.8972 8.4681

W

2 (after trained) 0.2963 5.2356 5.9167 8.4703

W

3 (after trained) -0.7454 1.3695 1.8095 0.9573

W

4 (after trained) -0.0388 1.3708 1.8100 0.9573

W

5 (after trained) 1.0849 5.6486 8.1003 9.1849

W

6 (after trained) 0.7266 -5.8580 -8.2528 -9.3141

Actual Output Y for (x1, x2) = (0,0) 0.9058 -0.10308 -0.0759 -0.0644

Actual Output Y for (x1, x2) = (0,1) 0.9785 0.9503 0.9871 0.9918

Actual Output Y for (x1, x2) = (1,0) 1.0768 0.9503 0.9871 0.9918

Actual Output Y for (x1, x2) = (1,1) 1.1218 0.1496 0.0634 0.0484

J

0.2606 0.0047 0.0012 0.0008

Table 4-6. The training result for XOR using dynamical optimal training Iterations

Training Results 1 1000 5000 10000

W

(after trained)

0.6428 6.5146 7.6615 8.1360

W

(after trained)

0.2309 6.5302 7.6665 8.1391

W

(after trained)

-0.1106 0.8578 0.9246 0.9454

W

(after trained)

0.5839 0.8579 0.9246 0.9454

W

(after trained)

0.8436 14.698 25.604 32.201

W

(after trained)

0.4764 -18.751 -32.213 -40.337

Actual Output Y for (x1, x2) = (0,0) 0.6592 0.1167 0.0354 0.0162

Actual Output Y for (x1, x2) = (0,1) 0.6848 0.8226 0.9260 0.9589

Actual Output Y for (x1, x2) = (1,0) 0.6852 0.8226 0.9260 0.9589

Actual Output Y for (x1, x2) = (1,1) 0.7086 0.2381 0.0984 0.0554

J

0.1419 0.0166 0.0027 0.0008

4.2 Example 2: Classification of Iris Data Set

In this example, our task is to train the neural network to classify Iris data sets [23], [24].

Generally, Iris has three kinds of subspecies: setosa, versicolor, virginica. The classification will depend on the length and width of the petal and the length and width of the sepal.

4.2.1 System modeling and Problem definition

The total Iris data are shown in Figures 4-8-1 and 4-8-2. The first 75 samples of total data are the training data, which are shown in Figures 4-9-1 and 4-9-2. The Iris data samples are available in [28]. There are 150 samples of three species of the Iris flowers in this data. We choose 75 samples to train the network and using the other 75 samples to test the network. We will have four kinds of input data, so we adopt the network which has four nodes in the input layer and three nodes in the output layer for this problem. Then the architecture of the neural network is a 4-4-3 network as shown in Figure 4-10. In which, we use the network with four hidden nodes in the hidden layer.

When the input data set belongs to class setosa, the output of network will be expressed as [1 0 0]. For class versicolor, the output is set to [0 1 0]. For class virginica, the output is set to [0 0 1].

IRIS Data-Sepal

1 1.5 2 2.5 3 3.5 4 4.5 5

4 5 6 7 8

Sepal length (in cm)

Sepal width (in cm)

Class 1-setosa Class 2-versicolor Class 3-virginica

Figure 4-8-1. The total Iris data set (Sepal)

IRIS Data-Petal

0 0.5 1 1.5 2 2.5 3

0 2 4 6 8

Petal length (in cm)

Petal width (in cm)

Class 1-setosa Class 2-versicolor Class 3-virginica

Figure 4-8-2. The total Iris data set (Petal)

IRIS Data-Sepal

1 1.5 2 2.5 3 3.5 4 4.5 5

4 5 6 7 8

Sepal length (in cm)

Sepal width (in cm)

Class 1-setosa Class 2-versicolor Class 3-virginica

Figure 4-9-1. The training set of Iris data (Sepal)

IRIS Data-Petal

0 0.5 1 1.5 2 2.5 3

0 2 4 6 8

Petal length (in cm)

Petal width (in cm

Class 1-setosa Class 2-versicolor Class 3-virginica

Figure 4-9-2. The training set of Iris data (Petal)

Figure 4-10. The neural network for solving Iris problem

We have the weight matrix which is between input layer and hidden layer in the form

(1,1) (1,2) (1,3) (1,4)

And the weight matrix between hidden layer and output layer can be expressed as

(1,1) (1,2) (1,3) (1,4)

4.2.2 Experimental result analysis

We first compare the training experimental result of network by respectively using

Back-propagation Algorithm (BPA), Dynamic Optimal Training Algorithm (DOA),

and Revised Dynamic Optimal Training Algorithm (RDOA). Table 4-7 shows

For BPA, we use a learning rate of 0.01 for 10 trials and a learning rate of 0.1 for another 10 trials. Initial weights of BPA are drawn from a uniform distribution over interval [-1, 1]. For DOA, we search optimal learning rate by using Matlab routine

“fminbnd” with searching range [0.01, 1000] and the initial weight are drawn from a uniform distribution over interval [-1, 1].

Table 4-7 Training results of IRIS problems for a 4-4-3 neural network with three different kinds of algorithm: BPA, DOA, and RDOA

BPA DOA RDOA

(final normalized total square error)

0.079515

(diverged result) 0.000626 0.003836 T

(training time, sec) Diverged 41.09 22.87

(convergence iteration when J(k)<0.0078125)

Diverged 654 529

(convergence time, sec) Diverged 2.68 1.21

C vector; T is total training time for 10000 training iteration; SI is first iteration at which the value of J is smaller than 0.0078125. We assume that when ek

=0.125, it is enough

to say that the network can classify IRIS data set. So it means that when J=0.0078125, the network can achieve our requirement. ST is actual settling time and it can be

calculate in the form of [ST = T × S

I

÷ (total iteration)]. C is times of successful convergence for training neural network in 20 trials.

From row J in Table 4-7, DOA has a better normalized total square error (.000626) than RDOA (.003836) if both algorithms have correct results. But the Js in both algorithms are small enough to classify IRIS data set. From row C, we see that RDOA has a greater probability (95%) than DOA (20%) to have a correct result. It means that RDOA has a high capability of jumping out local minimum. From row ST, we can know that RDOA only spends 1.21 seconds to get normalized total square error J smaller than 0.0078125 or get into a steady state. RDOA is 2.21 times faster than DOA (2.68sec). The result of using BP algorithm with fixed learning rate 0.1 to train the IRIS is shown Figure 4-11. The result of using DO algorithm with initial weight drawn from a uniform distribution over interval [-1, 1] to train the IRIS is shown in Figure 4-12. The result of using RDO algorithm to train IRIS is shown in Figure 4-13.

The comparison of the results of three algorithms is shown in Figure 4-14-1 and Figure 4-14-2.

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0.095

0.1 0.105 0.11 0.115 0.12 0.125

The square error J of the BP Algorithm with fixed learning rate 0.1

Iteration

The total square error J

η

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

Normalized square error J of the DOA with initail weight drawn from [-1, 1]

Iteration

The total square error J

Figure 4-12. The normalized square error J of the DOA

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0

Normalized square error J of the RDOA

Iteration

The total square error J

Figure 4-13. The normalized square error J of the RDOA.

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

Normalized square error J of the RDOA, DOA, and BPA

Iteration

The total square error J

RDOA BPA DOA

Figure 4-14-1. Training error of RDOA, DOA, and BPA

0 100 200 300 400 500 600 700 800 900

Normalized square error J of the RDOA and DOA

Iteration

The total square error J

RDOA DOA

Figure 4-14-2. The close look of training error for Iris problem.

In Figure 4-14-1, DOA has a slightly better performance than that RDOA has at the end of training process. Both of them have acceptable performance at the end of training process. In Figure 4-14-2, it is obvious that RDOA and DOA both achieve the minimum acceptable square error 0.0078125 around the 800^th iteration. RDOA has a

faster convergent rate at very beginning of training process and the convergent rate of RDOA slow down when network gets into a steady state. Tables 4-8 ~ 4-10 show the complete training results of three algorithms for Iris classification problem. F in the column C denotes failed result of convergence. S in the column C denotes successful result of convergence.

Table 4-8 Detailed Training results of Iris problems for a 4-4-3 neural network with Back-propagation algorithm.

Iris_BPA with fixed learning rate 0.01(first 10 trials) and 0.1 (rest 10 trials)

Table 4-9 Detailed Training results of Iris problems for a 4-4-3 neural network with Dynamic Optimal Training algorithm

Iris_Dynamic Optimal Training Algorithm

J (total square

Table 4-10 Detailed Training results of Iris problems for a 4-4-3 neural network with Revised Dynamic Optimal Training algorithm.

Iris_Revised Dynamic Optimal Training Algorithm

J (total square

Average 0.0038368 22.873105 528.68421 95%

From Table 4-10, we choose one successful convergent result for the test of real classification performance. After 10000 training iterations, the resulting weighting factors and total square error J are shown below. The actual output and desired output of 10000 training iteration are shown in Table 4-11 and the testing output and desired output are shown in Table 4-12. After we substitute the above weighting matrices into

the network to perform real testing, we find that there is no classification error by using training set (the first 75 data set). However there are 4 classification errors by using testing set (the later 75 data set), which are index 34, 55, 57, 59 in Table 4-12.

This is better than that of using DO [19], which generate 5 classification errors.

-2.9855 -2.5472 3.5321 4.9206 0.7139 0.8761 -1.7383 -1.4481 W = -0.2861 0.4348 -0.1717 -0.5248 0.6269 0.1752 0.5665 0.1341

W -1.1929 -1.6309 1.3376 1.0304

1.1153 0.3748 -0.8871 0.0136

Normalized total square error J = 0.043

Table 4-11. Actual and desired outputs after 10000 training iterations Actual Output Desired Output Index Class 1 Class 2 Class 3 Class 1 Class 2 Class 3

17 1.0145 -0.0145 0.0028 1.0000 0.0000 0.0000

54 0.0009 0.0310 0.9709 0.0000 0.0000 1.0000

Table 4-12. Actual and desired outputs in real testing Actual Output Desired Output Index Class 1 Class 2 Class 3 Class 1 Class 2 Class 3

10 0.9936 -0.0236 0.0228 1.0000 0.0000 0.0000

47 -0.0109 1.0595 -0.0437 0.0000 1.0000 0.0000

CHAPTER 5 Conclusions

In this thesis, a revised dynamic optimal training algorithm (RDOA) for modified three-layer neural network has been proposed. The RDOA includes three

在文檔中三層類神經網路的改良式動態最佳學習 (頁 42-0)