Example 2: Classification of Iris Data Set

CHAPTER 4 Experimental Results

4.2 Example 2: Classification of Iris Data Set

In this example, our task is to train the neural network to classify Iris data sets [23], [24].

Generally, Iris has three kinds of subspecies: setosa, versicolor, virginica. The classification will depend on the length and width of the petal and the length and width of the sepal.

4.2.1 System modeling and Problem definition

The total Iris data are shown in Figures 4-8-1 and 4-8-2. The first 75 samples of total data are the training data, which are shown in Figures 4-9-1 and 4-9-2. The Iris data samples are available in [28]. There are 150 samples of three species of the Iris flowers in this data. We choose 75 samples to train the network and using the other 75 samples to test the network. We will have four kinds of input data, so we adopt the network which has four nodes in the input layer and three nodes in the output layer for this problem. Then the architecture of the neural network is a 4-4-3 network as shown in Figure 4-10. In which, we use the network with four hidden nodes in the hidden layer.

When the input data set belongs to class setosa, the output of network will be expressed as [1 0 0]. For class versicolor, the output is set to [0 1 0]. For class virginica, the output is set to [0 0 1].

IRIS Data-Sepal

1 1.5 2 2.5 3 3.5 4 4.5 5

4 5 6 7 8

Sepal length (in cm)

Sepal width (in cm)

Class 1-setosa Class 2-versicolor Class 3-virginica

Figure 4-8-1. The total Iris data set (Sepal)

IRIS Data-Petal

0 0.5 1 1.5 2 2.5 3

0 2 4 6 8

Petal length (in cm)

Petal width (in cm)

Class 1-setosa Class 2-versicolor Class 3-virginica

Figure 4-8-2. The total Iris data set (Petal)

IRIS Data-Sepal

1 1.5 2 2.5 3 3.5 4 4.5 5

4 5 6 7 8

Sepal length (in cm)

Sepal width (in cm)

Class 1-setosa Class 2-versicolor Class 3-virginica

Figure 4-9-1. The training set of Iris data (Sepal)

IRIS Data-Petal

0 0.5 1 1.5 2 2.5 3

0 2 4 6 8

Petal length (in cm)

Petal width (in cm

Class 1-setosa Class 2-versicolor Class 3-virginica

Figure 4-9-2. The training set of Iris data (Petal)

Figure 4-10. The neural network for solving Iris problem

We have the weight matrix which is between input layer and hidden layer in the form

(1,1) (1,2) (1,3) (1,4)

And the weight matrix between hidden layer and output layer can be expressed as

(1,1) (1,2) (1,3) (1,4)

4.2.2 Experimental result analysis

We first compare the training experimental result of network by respectively using

Back-propagation Algorithm (BPA), Dynamic Optimal Training Algorithm (DOA),

and Revised Dynamic Optimal Training Algorithm (RDOA). Table 4-7 shows

For BPA, we use a learning rate of 0.01 for 10 trials and a learning rate of 0.1 for another 10 trials. Initial weights of BPA are drawn from a uniform distribution over interval [-1, 1]. For DOA, we search optimal learning rate by using Matlab routine

“fminbnd” with searching range [0.01, 1000] and the initial weight are drawn from a uniform distribution over interval [-1, 1].

Table 4-7 Training results of IRIS problems for a 4-4-3 neural network with three different kinds of algorithm: BPA, DOA, and RDOA

BPA DOA RDOA

(final normalized total square error)

0.079515

(diverged result) 0.000626 0.003836 T

(training time, sec) Diverged 41.09 22.87

(convergence iteration when J(k)<0.0078125)

Diverged 654 529

(convergence time, sec) Diverged 2.68 1.21

C vector; T is total training time for 10000 training iteration; SI is first iteration at which the value of J is smaller than 0.0078125. We assume that when ek

=0.125, it is enough

to say that the network can classify IRIS data set. So it means that when J=0.0078125, the network can achieve our requirement. ST is actual settling time and it can be

calculate in the form of [ST = T × S

I

÷ (total iteration)]. C is times of successful convergence for training neural network in 20 trials.

From row J in Table 4-7, DOA has a better normalized total square error (.000626) than RDOA (.003836) if both algorithms have correct results. But the Js in both algorithms are small enough to classify IRIS data set. From row C, we see that RDOA has a greater probability (95%) than DOA (20%) to have a correct result. It means that RDOA has a high capability of jumping out local minimum. From row ST, we can know that RDOA only spends 1.21 seconds to get normalized total square error J smaller than 0.0078125 or get into a steady state. RDOA is 2.21 times faster than DOA (2.68sec). The result of using BP algorithm with fixed learning rate 0.1 to train the IRIS is shown Figure 4-11. The result of using DO algorithm with initial weight drawn from a uniform distribution over interval [-1, 1] to train the IRIS is shown in Figure 4-12. The result of using RDO algorithm to train IRIS is shown in Figure 4-13.

The comparison of the results of three algorithms is shown in Figure 4-14-1 and Figure 4-14-2.

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0.095

0.1 0.105 0.11 0.115 0.12 0.125

The square error J of the BP Algorithm with fixed learning rate 0.1

Iteration

The total square error J

η

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

Normalized square error J of the DOA with initail weight drawn from [-1, 1]

Iteration

The total square error J

Figure 4-12. The normalized square error J of the DOA

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0

Normalized square error J of the RDOA

Iteration

The total square error J

Figure 4-13. The normalized square error J of the RDOA.

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

Normalized square error J of the RDOA, DOA, and BPA

Iteration

The total square error J

RDOA BPA DOA

Figure 4-14-1. Training error of RDOA, DOA, and BPA

0 100 200 300 400 500 600 700 800 900

Normalized square error J of the RDOA and DOA

Iteration

The total square error J

RDOA DOA

Figure 4-14-2. The close look of training error for Iris problem.

In Figure 4-14-1, DOA has a slightly better performance than that RDOA has at the end of training process. Both of them have acceptable performance at the end of training process. In Figure 4-14-2, it is obvious that RDOA and DOA both achieve the minimum acceptable square error 0.0078125 around the 800^th iteration. RDOA has a

faster convergent rate at very beginning of training process and the convergent rate of RDOA slow down when network gets into a steady state. Tables 4-8 ~ 4-10 show the complete training results of three algorithms for Iris classification problem. F in the column C denotes failed result of convergence. S in the column C denotes successful result of convergence.

Table 4-8 Detailed Training results of Iris problems for a 4-4-3 neural network with Back-propagation algorithm.

Iris_BPA with fixed learning rate 0.01(first 10 trials) and 0.1 (rest 10 trials)

Table 4-9 Detailed Training results of Iris problems for a 4-4-3 neural network with Dynamic Optimal Training algorithm

Iris_Dynamic Optimal Training Algorithm

J (total square

Table 4-10 Detailed Training results of Iris problems for a 4-4-3 neural network with Revised Dynamic Optimal Training algorithm.

Iris_Revised Dynamic Optimal Training Algorithm

J (total square

Average 0.0038368 22.873105 528.68421 95%

From Table 4-10, we choose one successful convergent result for the test of real classification performance. After 10000 training iterations, the resulting weighting factors and total square error J are shown below. The actual output and desired output of 10000 training iteration are shown in Table 4-11 and the testing output and desired output are shown in Table 4-12. After we substitute the above weighting matrices into

the network to perform real testing, we find that there is no classification error by using training set (the first 75 data set). However there are 4 classification errors by using testing set (the later 75 data set), which are index 34, 55, 57, 59 in Table 4-12.

This is better than that of using DO [19], which generate 5 classification errors.

-2.9855 -2.5472 3.5321 4.9206 0.7139 0.8761 -1.7383 -1.4481 W = -0.2861 0.4348 -0.1717 -0.5248 0.6269 0.1752 0.5665 0.1341

W -1.1929 -1.6309 1.3376 1.0304

1.1153 0.3748 -0.8871 0.0136

Normalized total square error J = 0.043

Table 4-11. Actual and desired outputs after 10000 training iterations Actual Output Desired Output Index Class 1 Class 2 Class 3 Class 1 Class 2 Class 3

17 1.0145 -0.0145 0.0028 1.0000 0.0000 0.0000

54 0.0009 0.0310 0.9709 0.0000 0.0000 1.0000

Table 4-12. Actual and desired outputs in real testing Actual Output Desired Output Index Class 1 Class 2 Class 3 Class 1 Class 2 Class 3

10 0.9936 -0.0236 0.0228 1.0000 0.0000 0.0000

47 -0.0109 1.0595 -0.0437 0.0000 1.0000 0.0000

CHAPTER 5 Conclusions

In this thesis, a revised dynamic optimal training algorithm (RDOA) for modified three-layer neural network has been proposed. The RDOA includes three modifications proposed in Chapter 3. The three modifications are: “Simplification of

activation function”, “Selection of proper initial weighting factors”, and

“Determination of upper-bound of learning rate in each iteration”. Simplification of activation function will enhance the back-propagated error signal, and hence improves the convergence rate of dynamic optimal training algorithm (DOA). By finding proper initial weighting factors, the probability of escaping local minima will be increased.

Also the finding of the upper-bound of stable learning rate can guarantee the convergence of training process and this will speed up the search of optimal learning rate. The classification problems of XOR and Iris data are proposed in Chapter 4.

They are solved by using the revised dynamical optimal training for a modified three-layer neural network with sigmoid activation functions in hidden layer and linear activation function in output layer. Excellent results have obtained for XOR problem and Iris data problem, which indicate that the RDOA is considerably faster than DOA and BPA. In addition, the RDOA is easier to find global convergent results than those by using DOA and BPA. So RDOA can speed up convergence rate and has a higher chance of escaping local minima than the chance of DOA.

REFERENCES

[1] T. Yoshida, and S. Omatu, “Neural network approach to land cover mapping,”

IEEE Trans. Geoscience and Remote, Vol. 32, pp. 1103-1109, Sept. 1994.

[2] H. Bischof, W. Schneider, and A. J. Pinz, “Multispectral classification of Landsat-images using neural networks,” IEEE Trans, Gsoscience and Remote

Sensing, Vol. 30, pp. 482-490, May 1992.

[3] M. Gopal, L. Behera, and S. Choudhury, “On adaptive trajectory tracking of a robot manipulator using inversion of its neural emulator,” IEEE Trans. Neural

Networks, 1996.

[4] L. Behera, “Query based model learning and stable tracking of a roboot arm using radial basis function network,” Elsevier Science LTd., Computers and

Electrical Engineering, 2003.

[5] F. Amini, H. M. Chen, G. Z. Qi, and J. C. S. Yang, “Generalized neural network based model for structural dynamic identification, analytical and experimental studies,” Intelligent Information Systems, Proceedings 8-10, pp. 138-142, Dec.

1997.

[6] K. S. Narendra, and S. Mukhopadhyay, “Intelligent control using neural networks,” IEEE Trans., Control Systems Magazine, Vol. 12, Issue 2, pp.11-18, April 1992.

[7] L. Yinghua, and G. A. Cunningham, “A new approach to fuzzy-neural system modeling,” IEEE Trans., Fuzzy Systems, Vol. 3, pp. 190-198, May 1995.

[8] L. J. Zhang, and W. B. Wang, “Scattering signal extracting using system modeling method based on a back propagation neural network,” Antennas and

Propagation Society International Symposium, 1992. AP-S, 1992 Digest. Held in

Conjuction with: URSI Radio Science Meting and Nuclear EMP Meeting, IEEE 18-25, Vol. 4, pp. 2272, July 1992

[9] P. Poddar, and K. P. Unnikrishnan, “Nonlinear prediction of speech signals using memory neuron networks,” Neural Networks for Signal Processing [1991], Proceedings of the 1991 IEEE Workshop 30, pp. 395-404, Oct. 1991.

[10] F. Rosenblatt, “The Perceptron: A probabilistic model for information storage and organization in the brain,” Psychological Review, Vol. 65, pp.386-408, 1958.

[11] F. Rosenblatt, “Principles of Neurodynamics”, Spartan books, New York, 1962.

[12] R. P. Lippmann, “An introduction to computing with neural networks,” IEEE

ASSP Magazine, 1987.

[13] D. E. Rumelhart et al., Learning representations by back propagating error,”

Nature, Vol. 323, pp. 533-536, 1986

[14] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internal representations by error propagation,” Parallel Distributed Processing,

Exploration in the Microstructure of Cognition, Vol. 1, D. E. Rumelhart and J. L.

McClelland, eds. Cambridge, MA: MIT Press, 1986.

[15] K. C. Tan, and H. J. Tang, “New Dynamical Optimal Learning for Linear Multilayer FNN,” IEEE Trans. Neural Networks, Vol. 15, No. 6, pp. 1562-1568, Nov. 2004.

[16] S. C. Ng, S. H. Leung, and A. Luk, “The Generalized Back-Propagation Algorithm with Convergence Analysis,” Circuits and Systems, 1999. ISCAS ’99,

Proceedings, 1999 IEEE International Symposium, Vol. 5, pp. 612-615, 30

May-2 June 1999.

[17] H. Lari-Najafi, M. Nasiruddin, and T. Samad, “Effect of initial weights on Back-Propagation and its variations,” Syst., Man, Cybern., Proceedings of

international Conference, Vol. 1, pp. 218-219, 14-17 Nov 1989.

[18] C. H. Wang, H. L. Liu, and C. T. Lin, “Dynamic Optimal Learning Rates of a Certain Class of Fuzzy Neural Networks and its Applications with Genetic Algorithm,” IEEE Trans. Syst., Man, Cybern. Part B, Vol. 31, pp. 467-475, June 2001.

[19] Y. Y. Chi, and C. H. Wang, “Dynamic Optimal Training of A Three-Layer Neural Network with Sigmoid Function,” thesis of master degree, National Chiao Tung

University, June 2005.

[20] J. J. F. Cerqueira, A. G. B. Palhares, and M. K. Madrid, “A Complement to the Back-Propagation Algorithm: An Upper Bound for the Learning Rate,” IEEE

Trans. Neural Networks, Proceedings of the IEEE-INNS-ENNS international Conference, Vol. 4, pp. 517-522, 24-27 July 2000.

[21] J. J. F. Cerqueira, A. G. B. Palhares, and M. K. Madrid, “A Simple adaptive Back-Propagation Algorithm for Multilayered Feedforward Perceptrons,” IEEE

Trans. Syst., Man, Cybern., 2002 IEEE international Conference, Vol. 3, Oct

2002.

[22] L. Behera, S. Kumar, and A. Patnaik, “A novel learning algorithm for feedforeward networks using Lyapunov function approach,” Intelligent Sensing

and Information Processing, Proceedings of international Conference, pp.

277-282, 2004.

[23] M. A. AL-Alaoui, R. Mouci, M. M. Mansour, and R. Ferzli, “A Cloning Approach to Classifier Training,” IEEE Trans. Syst., Man, Cybern. Part A, Vol.

32, pp. 746-752, Nov. 2002.

[24] R. Kozma, M. Kitamura, A. Malinowski, and J. M. Zurada, “On performance measures of artificial neural networks trained by structural learning algorithms,”

Artificial Neural Networks and Expert Systmes, Proceedings, Second New

[25] W. Feller, An Introduction to Probability Theory and its Applications, Vol. 1, 3^rd edition, New York: John Wiley, 1968

[26] B. Widrow, and Jr. M. E. Hoff, “Adaptive switching circuits,” IRE WCSEON

Convention Record, pp. 96-104, 1960

[27] G. E. Forsythe, M. A. Malcolm, and C. B. Moler, Computer Methods for

Mathematical Computations, Prentice-Hall, 1976

[28] Iris Data Samples [Online]. Available:

ftp.ics.uci.edu/pub/machine-learning

data- bases/iris/iris.data.

APPENDIX A

In Section 3.1, we have proposed an unbounded linear activation function in the NN training process. In this section, we will explain the reason why it is not suitable to use bounded linear activation function in the NN training process.

Defining an unbounded linear function as:

( )

x ax ϕ

= Defining a saturated linear function as

( ]

is an adjustable parameter which means slope of the function. Usually, we set parametera

= 1

. It is a piece-wise continuous linear function. Between u₁and u₂, ( )

ϕ

x

is the same as an unbounded linear activation function.

Consider update rule of weighting factors in output layer

( ,:) ( ,:)

( ) ⁽ ⁽ ⁾ ^]

From (a.1), (a.2), and (a.3), we can rewrite update rule of output layer (2.42) as

( ]

Hence, the update rule of hidden layer (2.43) becomes to

( ]

From (a.4) & (a.5), we can discover that, if we use a bounded linear activation function, the weight matrix will not be updated in the saturation region of piece-wise linear function. In the saturation region, term VO(p,:)

is a constant, so its differential term is zero.

Hence, the output of neural network will not change anymore and will make the network get into a steady state. Then the training process is stop. A narrow range of linear region will make system too early to enter a steady state; even total-square-error is still at a high level. A wide one is much better, but it still may make the system stop to update the weight matrix. Compare with sigmoid activation function, although it is also a saturated function, but its differential term is not zero in its saturation region. So the training process is capable of jumping out the local steady state which doesn’t have an acceptable result.

In a word, we can not use a saturated linear activation function; because its fixed boundary lets the network lose the capability of online updating when the network once gets stuck into saturation region.

APPENDIX B

We consider the activation function at the first (hidden) layer. Assume that we use the unbounded activation function in both hidden and output layer. Without boundary, the output of the activation function can be any real value. During the training process of neural network, this unlimited output may cause that the weight becomes to an unreasonable large value which is not acceptable in the real condition. So, a saturated activation function is needed in the NN. Now we assume that a saturated activation function has been given in the NN.

From Appendix A and Section 3.2, we have already known that a saturated linear activation function is forbidden and second layer activation function has been corrected to a unbounded linear one. Hence, we can only put the saturated nonlinear activation function (ex: sigmoid function) in the hidden layer.

在文檔中三層類神經網路的改良式動態最佳學習 (頁 64-0)

Example 2: Classification of Iris Data Set

CHAPTER 4 Experimental Results

4.2 Example 2: Classification of Iris Data Set

IRIS Data-Petal

0 0.5 1 1.5 2 2.5 3

0 2 4 6 8

Petal length (in cm)

Class 1-setosa Class 2-versicolor Class 3-virginica

Back-propagation Algorithm (BPA), Dynamic Optimal Training Algorithm (DOA),

=0.125, it is enough

I

η

W -1.1929 -1.6309 1.3376 1.0304

1.1153 0.3748 -0.8871 0.0136

CHAPTER 5 Conclusions

activation function”, “Selection of proper initial weighting factors”, and

REFERENCES

IEEE Trans. Geoscience and Remote, Vol. 32, pp. 1103-1109, Sept. 1994.

Sensing, Vol. 30, pp. 482-490, May 1992.

Networks, 1996.

Electrical Engineering, 2003.

Propagation Society International Symposium, 1992. AP-S, 1992 Digest. Held in

ASSP Magazine, 1987.

Nature, Vol. 323, pp. 533-536, 1986

Exploration in the Microstructure of Cognition, Vol. 1, D. E. Rumelhart and J. L.

Proceedings, 1999 IEEE International Symposium, Vol. 5, pp. 612-615, 30

international Conference, Vol. 1, pp. 218-219, 14-17 Nov 1989.

University, June 2005.

Trans. Neural Networks, Proceedings of the IEEE-INNS-ENNS international Conference, Vol. 4, pp. 517-522, 24-27 July 2000.

Trans. Syst., Man, Cybern., 2002 IEEE international Conference, Vol. 3, Oct

and Information Processing, Proceedings of international Conference, pp.

Artificial Neural Networks and Expert Systmes, Proceedings, Second New

Convention Record, pp. 96-104, 1960

Mathematical Computations, Prentice-Hall, 1976

ftp.ics.uci.edu/pub/machine-learning

data- bases/iris/iris.data.

APPENDIX A

x ax ϕ

( ]

= 1

ϕ

x

( ) ( ( ) ]

( ]

( ]

is a constant, so its differential term is zero.

APPENDIX B

( ) ⁽ ⁽ ⁾ ^]