Wilcoxon Neural Networks Training Using Evolutionary Optimization Methods

(1)

Wilcoxon Neural Networks Training Using

Evolutionary Optimization Methods

Yih-Lon Lin

Department of Information Engineering, I-Shou University, Kaohsiung 840, Taiwan. Email:[email protected]

Abstract―In this paper, two kinds of evolutionary

computations including a genetic algorithm (GA) and a particle swarm optimization (PSO) are used to train the novel Wilcoxon neural networks (WNNs) for function approximation with outliers. Unlike the traditional artificial neural networks (ANNs), the objective function used in the proposed WNNs is the Wilcoxon norm instead

of the total sum of squared errors, i.e., 2-norm. The

advantage of using the Wilcoxon norm is to reduce the influence of outliers on overall neural network training. Moreover, to overcome the drawback due to the back-propagation learning algorithm, we utilize the population-based optimization methods containing GA and PSO algorithms to find the suitable weights of WNNs. Finally, some numerical examples, as compared with traditional ANNs, will be provided to verify the robustness against outliers by the proposed methods.

L

Index Terms―artificial neural networks (ANNs),

Wilcoxon neural networks (WNNs), particle swarm optimization (PSO), genetic algorithms (GAs).

I. INTRODUCTION

It is well known that neural networks have successfully been applied in many branches of science and engineering. The typical architecture of ANNs consists of several layers, i.e., the input layer, one or two hidden layers, and output layer. Each layer includes several neurons, which are usually connected with ones located in another layers by weights. In the back-propagation algorithm [1, 2], the error signal is the feedback layer-by-layer to the input layer to update the connection weights such that the given objective function is minimized. However, the serious drawback of this kind of algorithm is that the solved solution is easy to be trapped at the local minimum around initial values and difficult to escape from there. To accelerate the convergence of the algorithm, many new approaches are

presented, such as by adding momentum terms to the updating law [3], or using the adaptive learning rate according to any appropriate step size selection rules, e.g., line minimization, limited line minimization, Armijo, Goldstein, Wolfe, or diminishing step size rules [4].

The genetic algorithm, initially developed by John Holland [5], is a biologically motivated search technique mimicking natural selection and natural genetics. It is a general search method in between the exhaustive search and traditional search. When the fitness landscape of the problem is unclear or riddled with many local optima, the genetic algorithm usually has good searching capability. The GA starts with a population of possible solutions, called chromosomes, to the problem. A prescribed fitness function is firstly defined for the problem, which evaluates the fitness or goodness of each chromosome. Then chromosomes with better fitness are selected for reproduction. The subsequent crossover and mutation operations are performed onto the population in order to generate a new generation of possible solutions. Such a process is repeated until some stopping criterion is met. In recent years, the related researches about GA have been presented and successfully applied in a variety of science and engineering fields [6-10].

Another evolutionary algorithm frequently used is the PSO [11-15]. It is an optimization algorithm having origins from evolutionary computation together with the social psychology principle. Essentially, PSO is dependent on stochastic processes and also uses the concept of fitness as well as GAs. In addition, it provides a mechanism that individuals in the swarm exchange

(2)

and communicate information one another, which is similar to the social behavior of insects and human beings. Because of mimicking the social sharing of information, PSO directs the individuals to search the optimal solution more efficiently [12, 13]. Another important feature of PSO is that the paradigm requires only primitive mathematical operations which can easily be implemented to computer programs. Therefore, PSO has been attracted to many real-world applications such as the analysis of human tremor, the reactive power and voltage control, the state-of-charge estimator for a battery pack, the ingredient mix optimization, milling optimization, and improvised music composition [16, 17]. Many of them have been shown that PSO techniques can perform well. To combine PSO algorithm with neural network, some efforts have been made recently. In [11], the author proposed an evolutionary system for evolving artificial neural networks, which is based on the PSO algorithm. A hybrid of GA and PSO was used for recurrent networks design problems in [15].

Robust and non-parametric smoothing is an important idea in statistics that aims to simultaneously estimate and model the underlying structure for given data. The annealing robust back-propagation learning algorithm was proposed to deal with the modeling problem under the existence of outliers in [18]. Based on fuzzy neural network structures, a robust learning algorithm was developed to reduce the outlier effects [19, 20]. In addition, the weighted error back-propagation algorithm was proposed to improve the resistance of multi-layer forward networks training to outliers in [21]. The simulation results of the above-mentioned papers completely demonstrate their satisfactory abilities on dealing with outliers. One principal method belonging to this category is the Wilcoxon approach, which is usually robust against outliers. The concept of Wilcoxon norm and linear Wilcoxon regressors are presented in Hogg [22]. This motivates us to include the Wilcoxon norm concept to the neural networks. The combination is called the Wilcoxon neural networks (WNNs) and this class of learning machines was briefly described in [23]. The contribution of this paper

is to apply evolutionary computations of PSO and GA, respectively, as training algorithms for WNNs, and some simulation results, as compared with with ANNs using the traditional back-propagation and adjustable Armijo learning rate algorithms, respectively, are proposed to show the better robustness against outliers.

II. NEURAL NETWORKS A. Artificial neural networks

ANNs are a biologically motivated learning machine mimicking the structure and behavior of biological neurons and nervous system. The input-output relationship in each neuron can be described by the following equations

∑

= + = n i i ix w net 1 θ ,

( )

_netnet e e net f ₋ − + − = 1 1 , (1) where is the input to the neuron and is the

weight with respect to , i

x wi

i

x θ is called a threshold, n is the number of inputs, and is referred to as a nonlinear transfer function and is only used in hidden layers in this study. When the transfer function is used in the output layer, a linear function,

(

net

f

)

( )

net net

f = , is chosen. For training

ANNs, a performance index or an object function must be defined previously and will be minimized by means of updating weights and biases. Usually, a total sum of squared errors is defined as an objective function and is given by

(

∑∑

= = − = l q p k qk qk total d y E 1 1 2 2 1 :

)

, (2)

where is the number of training data, l p is output number of neural networks, is the kth desired output of the qth training data. The objective function of (2) belongs to the corresponding norm. This kind of objective function is not a good robustness indictor for outliers. In the following, the concept of Wilcoxon norm [22] is first introduced and we use it as an objective function in training networks for solving outliers problem.

qk d

2

L

(3)

To define the Wilcoxon norm of a vector, we first need a score function. A score function is a function ϕ

( )

u :

[ ]

0,1 →ℜ which is non-decreasing

such that

( )

<∞

∫

1 0 2 du u ϕ .

Usually, the score function is standardized so that

( )

0 1 0 =

∫

ϕ u du and

( )

1. 1 0 2 ₌

∫

ϕ u du

The score associated with the score function ϕ is defined by

( )

⎟ ⎠ ⎞ ⎜ ⎝ ⎛ + = 1 l i i a_ϕ ϕ , i∈l. Hence we have

( )

a

( )

a

( )

l a_ϕ 1 ≤ _ϕ 2 ≤...≤ _ϕ .

It can be shown that the following function is a pseudo-norm (semi-norm) on ℜl:

( )

(

)

_∑

( )

_{( )}

∑

= = = = l i i l i i i W a R v v ai v v 1 1 : ,

[

]

T l l v v v:= ₁ ... ∈ℜ , (3)

where denotes the rank of among , …, , are the ordered values of

, …, ,

( )

vi R l v v i v

)

1 v 1 v ( )1 ≤ ...≤v( )l l v a

( )

i :=ϕ

[

i

(

l+1

]

, and

( )

u := 12

(

u−0.5

)

ϕ . We call v_W defined in (3)

the Wilcoxon norm.

There are one input layer with nodes, one hidden layer with nodes, and one output layer with p nodes. We also have p bias terms at the output nodes.

1 + n 1 + m

Let the input vector be

[

]

T n n x x x:= ₁ _L ∈ℜ , or

[

]

T n n z z z z:= ₁ _L ₊₁

[

]

1 1 1 ∈ℜ+ = T n n x x _L , i.e., i i x z := , i∈n, . z_n₊₁:=1

Let denote the connection weight from ith input node to the input of the jth hidden node.

Then the input and output of the jth hidden node are given by, respectively,

ji v j u : 1 j r

∑

+ = = 1 1 n i i ji j v z u , 1z_n₊ = , r_j = f_hj

( )

u_j , j∈m, (4)

where is the activation function of the jth hidden node.

hj f

Let denote the connection weight from the output of the jth hidden node to the input of the kth output node. Then the input and output of the kth output node are given by, respectively,

kj w k s t_k

∑

+ = = 1 1 m j j kj k w r s , 1rm₊₁:= , tk = fok

( )

sk , k∈ , (5) p where is the activation function of the kth output node. The final output of the network is given by ok f k y k k k t b y = + , k∈ , p where bk is the bias.

For training WNNs, in this study the Wilcoxon norm of the total residuals is taken as an object function, which is given by

( )

(

)

∑∑

∑

= = = = Ψ = Ψ p k l q qk qk p k k total aR 1 1 1 ρ ρ

( )

∑∑

= = = p k l q k q q a 1 1 ρ , (6a) qk qk qk:=d −t ρ , q∈l, k∈ . (6b) p

( )

Here R ρqk denotes the rank of the residual ρqk among ρ1k , …, ρlk , ρ( )1k ≤ ...≤ρ( )lk

k

1

are the ordered values of ρ , …, ρlk , and

( )

i :=

[

i

( )

l+1

]

a ϕ is a score function with

( )

u := 12

(

u−0.5

)

ϕ . The bias term bk, k∈ , p is given by the median of the residuals at the kth output node, i.e.,

{

qk qk

}

l q k med d t b = − ≤ ≤ 1 .

Base on the proposed WNNs, the weights of ji and kj here need to be adjusted for minimizing the total residuals (6) according to certain evolutionary algorithms including PSO and GA approaches, respectively. For convenience, we further let

v w

Θ represent a parameter vector which contains all adjustable weights and in

(4)

III. GAS AND PSO A.3 Mutation A. Basic concepts of GAs

The GAs begin with generating a population that contains a number of random chromosomes. Each chromosome of the population is to represent a set of possible solution to optimization problem. In the view of using GAs to WNNs training problem, the chromosome here is referred to as the adjustable parameter vector Θ of WNNs. The population is evolved to generate a better offspring according to the size of Wilcoxon norm (6) by means of using genetic operations. To constrain the search interval, a lower bound and upper bound for each gene in the chromosome is given by and during evolutionary procedure. If any resulting gene is outside the defined interval, then the original remains. In addition, let be population size, i.e., number of chromosomes in the population, l be the number of genes in a chromosome, and be the crossover and mutation probabilities, respectively. The variables will be used in the genetic operations.

min k max k N c p p_m

The number of executing mutation is given by

(

N N

)

l

p_m× − _g × . In every mutation, we first randomly select a gene of the chromosome from

g N

N − chromosomes and this gene is then replaced by a random number from the search interval between k_min and k_max.

The procedure that have run the above selection, crossover, and mutation is called a generation. The algorithm stops when the desired value of Wilcoxon norm is satisfied or pre-specified number of generations is achieved. The overall design steps for training WNNs based on using a GAs can be summarized as follows.

Step 1. Create a population with chromosomes randomly.

N

Step 2. Evaluate the value of Wilcoxon norm of (6) for each chromosome.

Step 3. If the pre-specified number of generations G is reached or there is any chromosome with Wilcoxon norm value less then ε , then stop.

A.1 Selection

We first need to evaluate the corresponding Wilcoxon norm for each chromosome. The highly fit chromosomes are directly kept in the next generation. The rest

g N

N− chromosomes are then taken into the mating pool to be crossed. This completes the selection operation.

Step 4. Perform three genetic operations: selection, crossover, and mutation, respectively. If any of the resulting genes during operations is outside the interval , then the original one is retained.

[

kmin, kmax

]

Step 5. Go back to Step 2. A.2 Crossover

In the mating pool, all of chromosomes are randomly divided into many pairs. Each pair, i.e.,

and , proceeds to cross. Moreover, let be a random number selected from

d Θ c m Θ

[ ]

0,1 . If , then the following crossover operation for and are performed:

c p c≤ d Θ Θm

(

m d

)

m b =Θ − × Θ −Θ Θ β ,

(

m d d s =Θ + × Θ −Θ Θ β

)

, (7) else Θ_b =Θ_d, Θ_s =Θ_m,

where and are the offspring

chromosomes, b Θ Θ_s

]

1

[

0, ∈ β is random numbers.

B. Basic concepts of PSO

PSO is another population-based algorithm for searching global optimum. The original idea behind PSO is to simulate a simplified social behavior. It ties to artificial life, like bird flocking or fish schooling, and some common features of evolutionary computation such as fitness evaluation. For example, PSO is like GAs in which the population is initialized with random candidate solutions. The adjustments toward the best individual and the best swarm experiences are basically similar to the crossover operation in GAs. Conversely, the difference between PSO and GAs is that each potential solution, called individual or particle, is “flying” through hyperspace with a

(5)

velocity. Moreover, the particles and the swarm in the PSO have the capacity of memory, which does not exist in the population of the GAs.

Let and denote the jth

dimensional value of the vector of position and velocity of ith particle in the swarm, respectively, at step k. The PSO updating rules can be expressed as

( )

k j i, Θ Vi,j

( )

k

( )

_,

(

1

)

, k =w⋅V k− V_i _j _i _j

(

)

(

_j −Θ_i_,_j k−1

)

(

)

* , 1 1⋅ ⋅ Θ + c ϕ i

(

# −Θ_i_,_j k−1

)

2 2⋅ ⋅ Θ + c ϕ j , (8)

(

k

)

V

(

k k _i _j _i _j j i, ( )=Θ, −1 + , Θ

Step 4. Update the swarm best position if the fitness of the new best position is better than that of the previous one.

#

Θ

Step 5. For each particle, update the velocity and the position according (8) and (9). As well as GAs, if any resulting position during operations is outside the set interval

[

kmin, kmax

]

, then the original one is

retained.

Step 6. Go back to Step 2.

IV. SIMULATION RESULTION

)

, (9) In this section, we will compare the performance of ANNs and WNNs using various updating rules for an illustrative nonlinear regression problem with testing conditions. The updating rules for ANNs are the traditional back-propagation algorithm and the back-propagation algorithm with Armijo rule [4], respectively. Moreover, for training WNNs, the GAs and PSO algorithms introduced above are employed. In order for “fair” comparisons, the simulation machines will use the same set of parameters. For example, for ANNs and WNNs, there are one input neuron, two hidden layers with five and ten neurons, respectively, and one output neuron. The activation function used in the hidden nodes is the bipolar sigmoidal function, and the activation function of the output node is the linear function with unit slope. Besides, for GAs and PSO algorithms, we use the same search space and population size (swarm size). The software to implement the above numerical algorithms is the Visual C++ 6.0 running on Microsoft Windows XP, Pentium IV 2.4 GHz platform.

where w is the inertia weight which controls the effect of the preceding velocity at on the present one, denotes the best position of ith particle up to step and denotes the best position of the whole swarm up to step

1 − k * i Θ 1 − k Θ# 1 − k , ϕ₁

and ϕ2 are random numbers selected from

[ ]

0,1 ,

and and are the positive numbers and represent the individuality and swarm coefficients, respectively.

1

c c₂

The PSO algorithm is first to give the swarm size and the position and velocity of each particle are initialized randomly. Each particle moves according to Eqs. (8) and (9), and the fitness function of (6) is then calculated. Meanwhile, the best positions of each particle and the swarm are recorded. Finally, if the stopping criterion is satisfied, the best position of the swarm is the final solution. The main features of PSO algorithm can be outlined as follows.

Step 1. Give the swarm size N and initialize the position and the velocity of each particle randomly.

The true function that will be learned is given by the following complex function

( )

x

( )

x

( )

x

( )

x

y =sinπ +0.5sin 2π +2cos3π ,

Step 2. For each particle i, compute the

corresponding fitness value of and update the individual best position if better fitness is generated.

i Θ * i Θ

[

−1,1

]

∈ x .

The parameters used in GAs and PSO algorithms are set to N =20 , G=15000 ,

[

k_min,k_max

] [

= −10,10

]

, N_g =2 , P_c =0.8 , 005 . 0 = m P , w=0.45, and c₁ = c₂ =2. In the

simulations, there are fifty training data uniformly generated from the true functions and in which Step 3. If the pre-specified number of generations

G is reached or the fitness value of the best particle _Θ# in the swarm is less then _{ε ,} then stop.

(6)

there are three, five, and eight training samples intentionally replaced by artificial outliers. Figures 1(a)(b)(c) show the simulation results by ANNs with the back-propagation algorithm and back-propagation algorithm with Armijo rule for three, five, and eight artificial outliers, respectively. The simulation results by WNNs with GAs and PSO algorithms are then displayed in Figures 2(a)(b)(c). From these figures, it can easily be seen that WNNs perform with GAs and PSO algorithms better than ANNs with the back-propagation algorithm and Armijo rules for different outliers. The predictive results by WNNs with GAs and PSO algorithms are almost indistinguishable from the true function, and are all robust against outliers.

V. CONCLUSION

In this paper, we have successfully applied GAs and PSO algorithms to train the the Wilcoxon neural networks. These population-based optimization methods have many advantages over the traditional back-propagation learning algorithm, for example, easy to escape from the local minimum around initial values and more efficient to solve the optimal solution for complex functions. The main difference between WNNs and ANNs is the use of the Wilcoxon norm to replace the general the total sum of squared errors. To deal with function approximations with outliers, this change can efficiently reduce the effect of outliers. From simulation results including three, five, and eight artificial outliers for complex function approximations, it is concluded that WNNs with GAs and PSO algorithms proposed in this paper have good robustness against outliers than ANNs with the traditional back-propagation algorithms and Armijo rules.

REFERENCE

[1] K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforward networks are universal approximators,” Neural Networks, vol. 2, no. 5, pp. 359-366, 1989.

[2] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internal representations by error propagation,” in Parallel Distributed Processing: Explorations in the Microstructure of Cognition, D. E. Rumelhart, J. L.

McClelland, and the PDP Research Group Eds. MIT Press, Cambridge, Massachusetts, vol. 1, Foundations, 1986, pp. 318-362.

[3] R. A. Jacobs, “Increased rates of convergence through learning rate adaptation,” Neural Networks, vol. 1, no. 4, pp.295-308, 1988. [4] J. Nocedal and S. J. Wright, Numerical

Optimization. Springer, New York, 1999. [5] J. H. Holland, Adaptation in Natural and

Artificial Systems. University of Michigan Press, Ann Arbor, Michigan, 1975.

[6] M. Mitchell, An Introduction to Genetic Algorithms. MIT Press, Cambridge, Massachusetts, 1996.

[7] R. L. Haupt and S. E. Haupt, Practical Genetic Algorithms. Wiley, New York, 1998.

[8] D. J. Montana and L. Davis, “Training feedforward neural networks using genetics algorithms,” in Proceedings of the International Joint Conference on Artificial Intelligence, Morgan Kanufmann, 1989, pp. 762-767.

[9] D. E. Goldberg, Genetic Algorithm in Search, Optimization and Machine Learning. Addsion-Wesley, Massachusetts, 1989.

[10] D. A. Coley, An Introduction to Genetic Algorithms for Scientists and Engineers. World Scientific, Singapore, 1999.

[11] C. Zhang, H. Shao, and Y. Li, “Particle swarm optimization for evolving artificial neural network,” in Proceedings of IEEE International Conference on Systems, Man, and Cybernetics, Nashville, Tennessee, 2000, vol. 4, pp. 2487-2490.

[12] J. Kennedy and R. C. Eberhart, “Particle swarm optimization,” in Proceedings of IEEE International Conference on Neural Networks, Perth, Australia, 1995, vol. 4, pp. 1942-1948. [13] R. C. Eberhart and J. Kennedy, “A new

optimizer using particle swarm theory,” in Proceedings of IEEE International Symposium on Micro Machine and Human Science, Nagoya, Japan, 1995, pp. 39-43.

[14] R. C. Eberhart and Y. H. Shi, “Particle swarm optimization: developments, applications and resources,” in Proceedings of IEEE International Conference on Evolutionary Computation, Seoul, Korea, 2001, vol. 1, pp. 81-86.

(7)

and particle swarm optimization for recurrent network design,” IEEE Transactions on Systems, Man, and Cybernetics, Part B, vol. 34, no 2, pp. 997-1006, 2004. -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -5 -4 -3 -2 -1 0 1 2 3 x y

y=sin(πx)+0.5sin(2πx)+2cos(3πx)

data back-propagation Armijo

[16] J. Kennedy, Eberhart R, and Shi Y, Swarm Intelligence. Morgan Kaufmann, 2001. [17] Engelbrecht, Fundamentals of Computational

Swarm Intelligence. John Wiley & Sons, 2006.

[18] C. C. Chuang, S. F. Su, J. T. Jeng, and C. C. Hsiao, “Annealing robust backpropagation (ARBP) learning algorithm,” IEEE Transactions on Neural Networks, vol. 11, no. 5, pp. 1067-1077, 2000.

[19] H. H. Tsai and P. T. Yu, “On the optimal design of fuzzy neural networks with robust learning for function approximation,” IEEE Transactions on Systems, Man and Cybernetics, Part B, vol. 30, no. 1, pp. 217-223, 2000.

Figure 1(a). Simulation results with three outliers for ANNs. -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -5 -4 -3 -2 -1 0 1 2 3 x y y=sin(πx)+0.5sin(2πx)+2cos(3πx) data back-propagation Armijo

[20] W. Y. Wang; T. T. Lee, C. L. Liu, C. H. Wang, “Function approximation using fuzzy neural networks with robust learning algorithm,” IEEE Transactions on Systems, Man, and Cybernetics, Part B, vol. 27, no. 4, pp. 740-747, 1997.

[21] W. Zhao, D. Chen, and S, Hu, “Detection of outlier and a robust BP algorithm against outlier,” Computers and Chemical Engineering, vol. 28, no. 8, pp. 1403-1408, 2004.

[22] R. V. Hogg, J. W. McKean, and A. T. Craig, Introduction to Mathematical Statistics.

Prentice-Hall, New Jersey, 2005. Figure 1(b). Simulation results with five outliers for ANNs.

[23] J. G. Hsieh, Y. L. Lin, and J.H. Jeng, “Preliminary study on Wilcoxon learning machines,” IEEE Transactions on Neural Networks, vol. 19, no. 2, pp. 201-211, 2008.

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -5 -4 -3 -2 -1 0 1 2 3 x y y=sin(πx)+0.5sin(2πx)+2cos(3πx) data back-propagation Armijo

Figure 1(c). Simulation results with eight outliers for ANNs.

(8)

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -5 -4 -3 -2 -1 0 1 2 3 x y y=sin(πx)+0.5sin(2πx)+2cos(3πx) data GAs PSO

Figure 2(a). Simulation results with three outliers for WNNs. -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -5 -4 -3 -2 -1 0 1 2 3 x y y=sin(πx)+0.5sin(2πx)+2cos(3πx) data GAs PSO

Figure 2(b). Simulation results with five outliers for WNNs. -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -5 -4 -3 -2 -1 0 1 2 3 x y

y=sin(πx)+0.5sin(2πx)+2cos(3πx)

data GAs PSO

Figure 2(c). Simulation results with eight outliers for WNNs.