Initializing the weights in multiplayer network with quadratic sigmoid function

(1)

Initializing the weights in multiplayer network

with quadratic sigmoid function

Abstract – A new method of initializing the weights in back propagation networks using the quadratic threshold activation function with one hidden layer is developed.

The weights can be directly obtained even without further learning. This method relies on general position assumption of the distribution of the patterns. This method can be applied to many pattern recognition tasks. Finally, simulation results are presented, showing that this initializing technique generally results in drastic reduction in training time and the number of hidden neurons.

1 Introduction

Using neural networks for pattern recognition applications is more and more attractive. The main advantages of using neural networks are their massive parallelism, adaptability, and noise tolerance, etc (B. Widrow, and R. Winter, 1988)( J. J. Hopfield, 1982)( T. J. Sejnowski, and C. R. Rosenberg, 1986). One of the most popular neural networks is the back propagation (BP) or multilayer perceptron (MLP) neural network.

The most commonly used activation function of BP is the hard \-limited threshold function and sigmoid function.

Using the hard-limited activation function in every neuron, the upper bound of the number of hidden neurons in a single-hidden layer network required for solving a general-position two-class classification problem is ⎥⎥⎤

⎢⎢⎡ n

K , where K is the number of patterns, n is the input dimension. If without the general-position constraint, the upper bound of the number of the hidden neurons is K – 1.

Recently, a new quadratic threshold activation function is proposed (C. C.

Chiang, 1993). By using it in each neuron, it is shown that the upper bound of the number of hidden neurons required for solving a given two-class classification problem can be reduced by one half compared with the conventional multilayer perceptrons which use the hard-limited threshold function, The results are given in table 1. Since the quadratic function is a little bit more complicated than the hard-limited threshold function, the learning is much difficult for the BP network with the quadratic function. Both the learning period and convergence properties are not

(2)

good enough to obtain effective results and can be observed in typical simulations. To relieve this learning difficulty, a new method for initializing weights in BP networks using the quadratic threshold activation function with one hidden layer is presented.

The method based on the use of Gauss elimination; it is applicable to many classification tasks. The paper is organized as follows: the basic quadratic threshold function is described in section 2; this new initialization method is addressed in section 3; and finally, simulation results are shown in section 4.

Problem Activation

Function

General-position Not general-position

Hard-limited ⎥⎥⎤

⎢⎢⎡ n

K K-1

QTF ⎥⎥⎤

⎢⎢⎡ n K

2 ⎥⎥⎤

⎢⎢⎡ 2 K

Table 1. Number of hidden neurons required

2 Quadratic Threshold Function

The quadratic threshold function (QTF) is defined as (C. C. Chiang, 1993)

(3)

Quadratic Threshold Function:

⎪⎩

⎪⎨

⎧

≤

= >

θ θ θ

2 2

, 1

, ) 0 ,

( if net

net net if

f

In (C. C. Chiang, 1993), an upper bound on the number of hidden neurons required for implementing arbitrary dichotomy on a K-element set S in Eⁿ is derived under the constraint that S is in general position.

Definition 1 A K-element set S in Eⁿ is in general position if no (j+1) elements in S in a (j-1)-dimensional linear variety for any j where 2 ≤ j ≤ n.

Proposition 1 (S. C. Huang, and Y. F. Huang, 1991, Proposition 1) Let S be a finite set in Eⁿ and S is in general position. Then, for any J-element subset S1 of S, where 1 ≤ J ≤ n, there is a hyperplane, which is an (n-1)-dimensional linear variety of Eⁿ and no other elements of S.

In (E. B. Baum), Baum proved that if all the elements of a K-element set S in Eⁿ is in general position, then a single-hidden-layer MLP with ⎥⎥⎤

⎢⎢⎡ n

K hidden neurons using the hard-limited threshold function can implement arbitrary dichotomies defined on S. In (C. C. Chiang, 1993), it is proved that a two-layered (one hidden layer) MLP with at most ⎥⎥⎤

⎢⎢⎡ n K

2 hidden neurons, which use the QTF, is capable of implementing arbitrary dichotomies of a K-element set S in Eⁿ if S is in general position.

Since the quadratic threshold function is non-differentiable. To ease the derivation, we use the quadratic sigmoid function (QSF) as follows:

Quadratic Sigmoid Function:

) exp(

1 ) 1 ,

( ₂

θ θ

−

= + net net f

3 Description of this Method

Consider a classification problem consisting in assigning K vectors of Rⁿ to 2 predetermined classes. Let the given training set H = {x1, x2, …, xK} = {H0, H1} is partitioned into K0 ≤ K training vectors in subset H0 corresponding to class 0, and K1 ≤ K training vectors in set H1 corresponding to class 1, where K0 + K1 = K. And H0 = {p¹, p², …, p^K0}, H1 = {q¹, q², …, q^K1}. The classification can be implemented within a two-layer neural network with N0 + 1 input units, N1 + 1 hidden neurons, and N2

output units, as illustrated in Figure 1. (N0 = n, N1 = ⎥⎥⎤

⎢⎢⎡ n K

2 , N2 = 1).

(4)

According to Proposition 1, for any n-element subset S’1 of S1, which contains elements belong to for example class 1, there is a hyperplane which is a (n-1)-dimensional linear variety of Eⁿ containing S’1 and no other elements of S. We can use the Gauss elimination to solve the linear equations of the hyperplane that the n patterns of class 1 lie on.

Let us now describe the initialization scheme more formally. Let the input and hidden layers have each a bias unit with constant input activation value of 1, and the matrices of weights W⁽¹⁾ and W⁽²⁾ are between the input and the hidden layer, and between the hidden and the output layer, respectively.

⎟⎟

⎠

⎞

⎜⎜

⎝

⎛

=

⎟⎟

⎠

⎞

⎜⎜

⎝

⎛

=

) 1 ( ) 1 (

0 ,

) 1 ( 2 ) 1 (

0 , 2

) 1 ( 1 ) 1 (

0 , 1

) 1 (

, )

1 (

1 , ) 1 (

0 ,

) 1 (

, 2 )

1 (

1 , 2 ) 1 (

0 , 2

) 1 ( , 1 )

1 (

1 , 1 ) 1 (

0 , 1 )

2 (

1 1 1

1

1 ,

, ,

, , ,

N N n

N N

N

n n

W W

W

W W

W

W W

W

W LLL

L L L L L L L L

L L

Where W_j⁽_,i¹⁾ represents the connection from the i-th input unit to the j-th hidden

neuron, for j = 1, 2, …, N1, i = 0, 1, …, n; W_{1 j}⁽_,²⁾ represents the connection from the j-th hidden unit to the output neuron, where j = 0, 1, 2, …, N1;

) , ( ) , ,

( ₁⁽_,₀²⁾ ₁⁽_,₁²⁾ ₁⁽_,²⁾ ₁⁽_,₀²⁾ ₁⁽²⁾

) 2 (

1 W W

W W

W

W = L _N =

Let θ⁽¹⁾, θ⁽²⁾ be the vector of θs values of the hidden and output layer, respectively.

) 1 (

θ and j θ₁⁽²⁾ represent the θ value of the j-th hidden neuron j = 1, 2, …, N1 and of the output neuron, respectively.

) ( );

, ,

( ₁⁽¹⁾ ₂⁽¹⁾ ⁽¹⁾ ⁽²⁾ ₁⁽²⁾

) 1 (

1 θ θ

θ θ

θ

θ = L _N =

If K0 ≥ K1, we use the set of patterns in set H1 corresponding to class 1 to obtain the weights values for the

⎥⎥⎤

⎢⎢⎡ n K₁

hidden neuron, so the number of the hidden neurons }

, , 2 , 1 { , } , , 2 , 1 {

2 . ¹

1

1 j N i n

n K n

N K ∀ ∈ K ∀ ∈ K

⎥⎥⎤

⎢⎢⎡

⎥⎥≤

⎢⎢ ⎤

=⎡ , let

) 1 (

0 ,

Wj = 1 (1)

) 2 ( ,

W1 j = 1 (2)

) 2 (

0 ,

W1 = -1 (3)

(5)

) 1 (

θ j = β (4)

) 2 (

θ1 = β (5)

where 0 <β < 1 is a very small positive number.

In order for each j-th hidden neuron represents a different set of n input patterns, we solve the following equations to get W_j⁽_,i¹⁾, for each j = 1, 2, …,

⎥⎦⎥

⎢⎣⎢ n K₁

, i = 1, 2, …, n.

⎪⎪

⎩

⎪⎪

⎨

⎧

= +

∑

=

+

−

=

+

−

=

+

−

n

i

j n n j i i j n

i j

n j i i j n

i

j n j i i j

W q

W

W q

W

W q

W

1

) 1 (

0 , )

1 ( ) 1 (

, 1

) 1 (

0 , 2 ) 1 ( ) 1 (

, 1

) 1 (

0 , 1 ) 1 ( ) 1 (

,

0 0 0

L L L L L L L L L L L

(6)

If n K₁

is a integer (i.e.

⎥⎦⎥

⎢⎣⎢ n K₁

= ⎥⎥⎤

⎢⎢⎡ n K₁

= N1), here we done. If n K₁

is not a integer (i.e.

⎥⎦⎥

⎢⎣⎢ n K₁

= ⎥⎥⎤

⎢⎢⎡ n K₁

- 1 = N1 – 1), the rest K1 – n × ⎥⎦⎥

⎢⎣⎢ n K₁

patterns are then represented by the N1-th hidden neuron using the following formula. For j = N1, i

= 1, 2, …, K1 – n × ⎥⎦⎥

⎢⎣⎢ n K₁

.

⎪⎪

⎩

⎪⎪

⎨

⎧

= +

∑

=

⎥⎦+

⎢⎣ ⎥

⎢

=

⎥⎦+

⎢⎣ ⎥

⎢

n

i

j K i i j n

i

j n

n K

i i j n

i

j n

n K

i i j

W q W

W q

W

W q

W

1

) 1 (

0 , )

1 (

, 1

) 1 (

0 , 2 ) *

1 (

, 1

) 1 (

0 , 1 ) *

1 (

,

0 0 0

1 1 1

(7)

The equations can solve ⁽¹⁾_,

1i

WN for i = 1, 2, …, K1 – n × ⎥⎦⎥

⎢⎣⎢ n K₁

. Let the other n - K1

+ n × ⎥⎦⎥

⎢⎣⎢ n K₁

weights connects to the N1-th hidden neuron be zero.

If K0 < K1, we use the set of patterns in set H0 corresponding to class 0 to obtain the weights values for the

⎥⎥⎤

⎢⎢⎡ n K₁

hidden neuron, so the number of the hidden neurons

(6)

N1 =

⎥⎥⎤

⎢⎢⎡ n K₀

<

⎥⎥⎤

⎢⎢⎡ n K

2 . Everything is the same except that let

) 2 (

0 ,

W1 = 0 (8)

and q is replaced by p. We now solve the following equations to get W_j⁽_,i¹⁾, for each j

= 1, 2, …,

⎥⎦⎥

⎢⎣⎢ n K_n

, i = 1, 2, …, n.

⎪⎪

⎩

⎪⎪

⎨

⎧

= +

∑

=

+

−

=

+

−

=

+

−

n

i

j n n j i i j n

i j

n j i i j n

i

j n j i i j

W p

W

W p

W

W p

W

1

) 1 (

0 , )

1 ( ) 1 (

, 1

) 1 (

0 , 2 ) 1 ( ) 1 (

, 1

) 1 (

0 , 1 ) 1 ( ) 1 (

,

0 0 0

(9)

If n K₀

is not a integer, the weights connects inputs and the N1-th hidden neuron can be obtained the same way as for the case K0 ≥ K1.

To test the results, when K0 ≥ K1, with these initial values of the weights, an input x ∈ H1 such:

) 0

1 (

0 , )

1

( ⋅ + _j =

j x W

W

means x lies on the hyperplane that the j-th hidden neuron represents. This input will cause only the j-th hidden unit to have an activation value close to 1, since

) 0

1 (

0 , )

1 ( ) 1

(_j =W_j ⋅x+W_j =

net ,

where, net⁽_j¹⁾ represent the net input magnitude of the j-th hidden neuron. And β

β < = <

− net⁽_j¹⁾ 0

The other hidden neurons t ≠ j will not get activated because

) 0

1 (

0 , )

1 ( ) 1

( = _t ⋅ + _j ≠

t W x W

net ,

and

β β

<

−

<

−

>

) 1 ( )

1 (

) 1 ( )

1 (

~ :

0

;

~ :

0

t t

net net

if

net net

if

Consequently, provided the activation of the activation of the j-th hidden neuron is

(7)

close to 1 with a QSF and all other neurons are 0, the output neuron will also get activated, since

0 1

) 1

2 (

0 , 1 )

2 ( 1 ) 2 (

1 =W ⋅y+W = ∗y_i − =

net ,

β β < = <

− net₁⁽²⁾ 0

where y represent the output vector of the hidden layer, net represent the net input ₁⁽²⁾ magnitude of the output neuron. If K0 < K1, there is a similar situation.

4 Simulations

This section provides the results of applying the above method for the initialization of three simple problems. The first is the XOR problem, the second is the parity problem, and the third is the gray-zone problem. They are trained with this algorithm using a two-layer neural network. All the simulations have been performed using BP with a momentum term. In this algorithm, the weights and θs are updated after a training pattern, according to the following equations (C. C. Chiang, 1993):

) 1 ) (

1 ) (

(

) 1 ) (

1 ) (

(

) 1 ) (

1 ) (

(

) 1 ) (

1 ) (

(

) 2 ( 1 ) 2

2 ( 1 2 )

2 ( 1

) 1 ( ) 2

1 2 ( )

1 (

) 2 ( , 1 ) 1

2 ( , 1 1 )

2 ( , 1

) 1 (

, ) 1

1 (

, 1 )

1 (

,

−

∆

− +

∂

− ∂

=

∆

−

∆

− +

∂

− ∂

=

∆

−

∆

− +

∂

− ∂

=

∆

−

∆

− +

∂

− ∂

=

∆

t t t E

t t W

W t E

W

t t W

W t E

W

j j

j

j j

j

i j i

j i

j

θ θ α

µ θ

θ θ α

µ θ

α µ

where j = 0, 1, 2, …, N1; i = 0, 1, …, n; E = ( )² 2

1 o−d ; o = the actual output; d = the desired output; t = a discrete time index; α1 = the momentum coefficient for weights;

α2 = the momentum coefficient for θs; µ1 = the learning rate for weights; µ2 = the learning rate for θs.

The exclusive-or (XOR) function has been widely used as a benchmark example to test the neural network’s learning ability. Figure 2 depicts the learning curves of the XOR problem for our method, random QSF BP, and random Sigmoid BP. The learning curves of the random QSF BP and the random sigmoid BP are the average results of 50 times training using 50 different sets of initial random weights all between –0.1 and +0.1 to relieve the effect of different initial weights. In the simulations, the parameter were chosen as follows: µ1 = 0.3, µ2 = 2.0, α1 = 0.3, α2 =

(8)

0.005, β = 0.09. This figure shows that the evaluated mean squared error decreasing with the epoch number. An “epoch” means the presentations of the whole training set.

As shown in Figure 2, our learning speed is far superior to the learning speed of the QSF BP and the random Sigmoid BP with the same learning rate and even though we use less number of hidden neurons.

The parity problem contains 32 8-dimensional binary training patterns. If the sum of every bit of the input pattern is odd, the output is 1, otherwise 0. For example, presentation of 00001111 should result in an output of ‘0’ and presentation of 01000110 should result in an output of ‘1’. Figure 3 shows the learning curves of the parity problem for our method, random QSF BP, and conventional BP. In the same way, the learning curves of the random QSF BP and conventional BP are the average results of 50 times training using 50 different sets of initial random weights all between –0.1 and +0.1. In the simulations, the parameter were chosen as follows: µ1 = 0.01, µ2 = 2.0, α1 = 0.001, α2 = 0.005, β = 0.09. With the same learning rate and less number of hidden neurons, our learning speed is far superior to the others.

(9)

Figure 4 is the distribution of the training patterns of the gray-zone problem on the 2-dimension plane. Circle and cross present class 0 and class 1, respectively.

Figure 5 is the learning curves for our method, random QSF BP, and random sigmoid BP (µ1 = 0.2, µ2 = 0.9, α1 = 0.02, α2 = 0.005, β = 0.04). Figure 6 is the result of using our method after 200 epoches. The white zone is class 0 (i.e. the output is smaller than 0.2), while the black zone is class 1 (i.e., the output is larger than 0.8). Some data is included in neither class 0 nor class 1 (i.e., the output is between 0.2 and 0.8), as shown in the gray zone. Therefore in such training patterns as figure 4, the ambiguous ones will be grouped to the gray zone. So our method can solve the ambiguous problem of the training patterns.

5 Conclusion

In this paper, we propose a new method to initialize the weights of QSF BP neural networks. According to the theoretical results, the upper bound on the number of the hidden neurons is reduced to be a half of the single-hidden-layer MLP used.

According to the simulation results, the initialized neural network is far superior to the conventional BP and random QSF BP both in learning speed and network size.

(10)

Exercises

3.1 Note that the MLP networks using the quadratic threshold activation function defined as

⎪⎩

⎪⎨

⎧

≤

= >

θ θ θ

2 2

1 ) 0 ,

( if net

net net if

f

can implement dichotomy problems. The patterns as shown in Figure P1 have to be classified in two categories using a layered network. Figure P2 is the two-layer classifier using the quadratic threshold activation function. The values in the circles are the values of θ corresponding to the neurons. Figure P3 is the

(11)

logic equivalent diagram. (1)~(5) Fill in the appropriate weights in the Figure P2.

(6) Fill in the appropriate value of θ in the Figure P2. (7) Write down the correct interpretation of the logic gate in Figure P3. (8) Sketch the partitioning of the pattern space.

(0, 1)

(0.5, 0.5)

(1, 0) (-0.5, 0.5)

(-1, 0)

(-0.5, -0.5)

(0.5, -0.5)

(0, -1)

: class 1

: class 0

Figure P1.

(12)

The truth table of ::

1. B. Widrow, and R. Winter, (1988). “Neural nets for adaptive filtering and adaptive pattern recognition,” IEEE Computer, pp. 25-39, March, 1988.

2. C. C. Chiang, (1993). The study of supervised-learning neural models, Thesis of the Degree of Doctor of Philosophy, Dept. of Computer Science and Information Engineering College of Engineering, National Chiao-Tung Univ., Hsin-Chu, Taiwan.

3. E. B. Baum. “On the capability of multilayer perceptron, “

4. J. J. Hopfield, (1982). “Neural networks and physical systems with emergent collective computational abilities,” Proc. of the National Academy of Sciences,

In 0 1 -1 0.5 -0.5 Out 1 0 0 1 1 0.9

0.9

(6) 0

0 1

1 1

1 (1)

(2) (3)

(4) X1

X2

(5)

Figure P2.

(7) X1

X2

Figure P3.

(13)

79:2254-2558.

5. T. J. Sejnowski, and C. R. Rosenberg, (1986). “NETalk: a parallel network that learns to read aloud,” The Johns Hopkins University Electrical Engineering and Computer Science Tech. Report, JHU/EECS-86/01.

6. S. C. Huang, and Y. F. Huang, (1991). “Bounds on the number of hidden neurons in multilayer perceptrons,” IEEE Trans. on Neural Networks, vol. 2, pp. 47-55.