Chapter 1. Introduction
2.1 Radial Basis Functions and QuickRBF
2.1 Radial Basis Functions and QuickRBF
Networks based on radial basis functions have been developed to address some of the problems encountered with training multilayer perceptrons: radial basis functions are usually able to converge and the training is much more rapid. Both are feed-forward networks with similar-looking diagrams and their applications are similar; however, the principles of action of radial basis function networks and the way they are trained are quite different from multilayer perceptrons.
An RBFN (radial basis function network) consists of three layers, namely the input layer, the hidden layer and the output layer. The input layer broadcasts the coordinates of the input vector to each of the nodes in the hidden layer. Each node in the hidden layer then produces an activation based on the associated radial basis function. Finally, each node in the output layer computes a linear combination of the activations of the hidden nodes.
For radial basis function networks, each hidden unit represents the center of a cluster in the data space. Input to a hidden unit in a radial basis function is not the weighted sum of its inputs but a distance measure: a measure of how far the input vector is from the center of the basis function for that hidden unit. Various distance measures are used, but perhaps the most common is the well-known Eculidean distance measure.
If x andµare vectors, the Eculidean distance between them is given by
∑
− where x is an input vector andµis the location vector of the basis function for hiddennode j. The hidden node then computes its outputs as a function of the distance between the input vector and its center. For the Gaussian radial basis function the hidden unit output is whereD is the Euclidean distance between an input vector and the location vector j
for hidden unit j; h is the output of hidden j and j σ is a measure of the size of the cluster j (in statistical terms it is called the variance or the square of the standard deviation).
How an RBFN reacts to a given input stimulus is completely determined by the activation functions associated with the hidden nodes and the weights associated with the links between the hidden layer and the output layer. The general mathematical form of the output nodes in an RBFN is as follows:
j-th class and i-th center. The general architecture of RBFN is shown as follows.
j j
Fig. 2.1. General Architecture of Radial Basis Function Networks.
We can see that constructing an RBFN involves determining the values of three sets of parameters: the centers (µi), the bandwidths (σi) and the weights (w ), in ji
order to minimize a suitable cost function.
In QuickRBF package, the centers are randomly selected and bandwidth are fixed and set as 5 for each kernel function for conducting the simplest method. The transformation between the inputs and the corresponding outputs of the hidden units is now fixed. The network can thus be viewed as an equivalent single-layer network with linear output units. Then, the LMSE method is used to determine the weights associated with the links between the hidden layer and the output layer.
Assume h is the output of the hidden layer.
h=
[
φ1(x), φ2(x ) ,K, φk(x)]
T (2.4) where k is the number of centers, φ1(x) is the output value of first kernel functionwith input x. Then, the discriminant function cj
( )
x of class j can be expressed by the After calculating the discriminant function value of each class, we choose the class with the biggest discriminant function value as the classification result. We will discuss how to get the weight vectors by using least mean square error method in the following.For a classification problem with m classes, let V designate the i-th column i vector of an m × m identity matrix and W be an k × m matrix of weights:
W =
[
w1, w2 ,K ,wm]
(2.7) Then the objective function to be minimized( )
{ }
second-order moments of h, i.e.{ }
hhTi
i E
K = (2.10) If K denotes the matrix of the second-order moments under the mixture distribution, we have
If K is nonsingular, the optimal W can be calculated by
W* =K-1M (2.14) However, there is a critical drawback of this method. That is, K may be singular and this will crash the whole procedure. By observing the matrix hh , we are aware T of that the matrix hh is symmetric positive semi-definite (PSD) matrix with rank T equal to 1. Since K is the summation of hh for each training instance, K is also a T PSD matrix with rank smaller than n. However, PSD matrix may be a singular matrix, so we should add the regularization term to make sure the matrix will be invertible. In the regularization theory, it consists in replacing the objective function as follows:
{ }
m j jThen the Eq. (2.12) becomes
(
K+λI)
W=M (2.16) If we set λ >0,(
K+λI)
will be a positive definite (PD) matrix and therefore is nonsingular.The optimal W* can be calculated byW* =
(
K+λI)
-1M (2.17) However, the PD matrix has many good properties, and one of them is a special and efficient triangular decomposition, Cholesky decomposition. By using Cholesky decomposition, we can decompose the(
K+λI)
matrix as follows:
(
K+λI)
=LLT (2.18) where L is a lower triangular matrix. Then, the Eq. (2.16) becomes