Data Classification with the Radial Basis Function Network Based on a Novel Kernel Density Estimation Algorithm

(1)

Data Classification with the Radial Basis Function Network Based on a Novel Kernel Density Estimation

Algorithm

Yen-Jen Oyang, Shien-Ching Hwang, Yu-Yen Ou, Chien-Yu Chen, and Zhi-Wei Chen Department of Computer Science and Information Engineering

National Taiwan University, Taipei, Taiwan, R. O. C.

{yjoyang, yien, zwchen}@csie.ntu.edu.tw {schwang, cychen}@mars.csie.ntu.edu.tw

Abstract

This paper proposes a novel learning algorithm for constructing data classifiers with radial basis function (RBF) networks. The RBF networks constructed with the proposed learning algorithm generally are able to deliver the same level of classification accuracy as the support vector machines (SVM). One important advantage of the proposed learning algorithm, in comparison with the support vector machines, is that the proposed learning algorithm normally takes far less time to figure out optimal parameter values with cross validation. A comparison with the SVM is of interest, because it has been shown in a number of recent studies that the SVM generally is able to deliver higher level of accuracy than the other existing data classification algorithms. The proposed learning algorithm works by constructing one RBF network to approximate the probability density function of each class of objects in the training data set. The main distinction of the proposed learning algorithm is how it exploits local distributions of the training samples in determining the optimal parameter values of the basis functions. As the proposed learning algorithm is instance-based, the data reduction issue is also addressed in this paper. One interesting observation is that, for all three data sets used in data reduction experiments, the number of training samples remaining after a naïve data reduction mechanism is applied is quite close to the number of support vectors identified by the SVM software.

Key terms: Radial basis function network, Data classification, Machine learning.

(2)

1. Introduction

The radial basis function (RBF) network is a special type of neural networks with several distinctive features [15, 1 ]. Since its first proposal, the RBF network has attracted a high degree of interest in computer science research communities. One of the main applications of RBF networks is data classification and several learning algorithms have been proposed for data classification with RBF networks [2, 4, 13]. However, latest development in data classification has focused more on support vector machines (SVM) than on RBF networks, because several recent studies have reported that the SVM generally is able to deliver higher classification accuracy than the other existing data classification algorithms [9, 12, 14]. Nevertheless, the SVM suffers one serious problem. That is, the time taken to carry out model selection for the SVM may be unacceptably long for some applications.

8

In this paper, a novel learning algorithm is proposed for efficient construction of RBF networks based classifiers that can deliver the same level of accuracy as SVM. The main properties of the proposed learning algorithm are as follows:

(i) the RBF networks constructed are generally able to deliver the same level of classification accuracy as the SVM [8];

(ii) the average time complexity for constructing the RBF network is bound by O(mn log n + n log n), where m is the number of attributes of the data set and n is total number of training samples;

(iii) the average time complexity for classifying c objects with unknown class is bound by O(mn log n + c log n).

As the RBF network based classifier constructed with the proposed learning algorithm is instance-based, the efficiency issue shared by almost all instance-based learning algorithms must be addressed. That is, a data reduction mechanism must be employed to remove redundant samples in the training data set and, therefore, to improve the efficiency of the instance-based classifier. Experimental results reveal that the naïve data reduction mechanism employed in this paper is able to reduce the size of the training data set substantially with a minimum impact on classification accuracy. One interesting observation in this regard is that, for all three data sets used in data reduction experiments, the number of training samples remaining after data reduction is applied and the number of support vectors identified by the SVM software are in the same order.

In fact, in two out of the three cases reported in this paper, the differences are less than 15%.

Since data reduction is a crucial issue for instance-based learning algorithms, further studies on this issue should be conducted.

In the following part of this paper, section 2 presents an overview of how data classification is conducted with the proposed learning algorithm. Section 3 elaborates the novel kernel density estimation algorithm that the proposed learning algorithm is based on. Section 4 discusses the practical issues involved in applying the proposed learning algorithm and reports the experiments conducted to evaluate its performance. Finally, concluding remarks are presented in section 5.

2. Overview of Data Classification with the Proposed Learning Algorithm This section presents an overview of how data classification is conducted with the proposed learning algorithm. Assume that the objects of concern are distributed in an m-dimensional vector space and let fj denote the probability density function that corresponds to the distribution of class-j objects in the m-dimensional vector space. The proposed learning algorithm constructs one RBF network for approximating the probability density function of one class of objects in the training data set. The general form of the function approximators is as follows:

(3)

∑

∈ 

 



− −

=

j

i S

s

si

v v ₂

2

||

exp ||

) ˆ (

i i

j w

f σ , (1)

where

(i) fˆ_j is the RBF network based function approximator for class-j training samples;

(ii) v is a vector in the m-dimensional vector space;

(iii) Sj is the set of class-j training samples;

(iv) ||v − si|| is the distance between vectors v and si.

With the RBF network based function approximators, a new object with unknown class and located at v is predicted to belong to the class that gives the maximum value of the likelihood function defined in the following:

) ˆ (

|

| ) |

(v ^h _h v

h f

S

L = S ,

where Sh is the set of class-h training samples and S is the set of training samples of all classes.

The essential issue of the learning algorithm is to construct the RBF network based function approximators. In the next section, the novel kernel density estimation algorithm that the proposed learning algorithm is based on will be presented. For the time being, let us address how to estimate the values of the probability density functions. Assume that the number of samples is sufficiently large. Then, by the law of large numbers in statistics [2 ], we can estimate the value

of the probability density function f 0

j() at a class-j sample si as follows:

1

2 1

) 1 (

) (

|

| ) 1 ) (

( ²

−













+

⋅ Γ

≅ + _m^m

j

R m

f k _i π

j i

s

s S (2)

where

(i) R(si) is the maximum distance between si and its k1 nearest training samples of the same class;

(ii) 1) )

( ²

+ ( Γ ^m₂

m ^m

R si π

is the volume of a hypersphere with radius R(si) in an m-dimensional vector space;

(iii) Γ(⋅) is the Gamma function [1];

(iv) k1 is a parameter to be set either through cross validation or by the user.

In equation (2), R(si) is determined by one single training sample and therefore could be unreliable, if the data set is noisy. In our implementation, we use R(s_i) defined in the following to replace R(si) in equation (2),



 



 −

= +

∑

=

1

1 1

ˆ ||

1 ||

) 1 (

k

k h

m

R s_i m s_h s_i

k1

sˆ

,

where sˆ₁ ,sˆ₂ ,..., are the k₁ nearest training samples of the same class as si. The basis of employing R(s_i) is elaborated in Appendix A.1.

3. The Proposed Kernel Density Estimation Algorithm

This section elaborates the novel kernel density estimation algorithm that the proposed learning algorithm is based on. The proposed kernel density estimation algorithm is based on some ideal assumptions. Therefore, some sort of adaptation must be employed, if the target data set does not conform to the assumptions. In this section, we will focus on the derivation of the kernel density estimation algorithm based on the ideal assumptions. The adaptation employed in this paper will be addressed in next section.

(4)

Assume that we now want to derive a function approximator for the set of class-j training samples. The proposed kernel density estimation algorithm places one Gaussian function at each sample as shown in equation (1). The challenge now is how to figure out the optimal weight and parameter value of each Gaussian function. For a training sample si, the learning algorithm first conducts a mathematical analysis on a synthesized data set. The synthesized data set is derived from two ideal assumptions. The first assumption is that the sampling density in the proximity of si is sufficiently high and, as a result, the variation of the probability density function in the proximity of si approaches 0. The second assumption is that si and its nearest samples of the same class are evenly spaced by a distance determined by the value of the probability density function at si. Fig. 1 shows an example of the synthesized data set for a training sample in a 2-dimensional vector space. The details of the model are elaborated in the following.

δi

si

Fig. 1. An example of the synthesized data set for a training sample in a 2-dimensional vector space.

(i) Sample si is located at the origin and the neighboring class-j samples are located at (h1δⁱ^, h2δⁱ^{, …, h}^mδⁱ^{), where h}¹^{, h}²^{, …, h}^m are integers and δⁱ is the average distance between two adjacent class-j training samples in the proximity of si. How δⁱ is determined will be elaborated later on.

(ii) All the samples in the synthesized data set, including si, have the same function value equal to fj(si). The value of fj(si) is estimated based on equation (2) in section 2.

The proposed kernel density estimation algorithm begins with an analysis to figure out the values of wi and σi that make function gi(⋅) defined in the following virtually a constant function equal to fj(si),

) 2 (

||

) ,..., (

exp ||

) (

1

2

2 1

si

x x _j

h h i

i m i i

i h h f

w g

m

≅













 



− −

=

∑ ∑

^∞

−∞

=

∞

−∞

= σ

δ

L δ . (3)

Let 2

) exp (

)

( ₂

∑

^∞ 2

−∞

= 

 



− −

=

h i

y y

q σ

δ _{. (4)}

Then, we have

[ ]

{ } { [ ] }

^m

h h i

i m m i

y h q

m

) ( Maximum 2

||

) ,..., (

exp ||

) ( Minimum

1

2

1 ≤



 



− −

≤

∑ ∑

^∞

−∞

=

∞

−∞

= σ

δ δ

L x .

It is shown in Appendix A.2 that, if σⁱ⁼δⁱ, then we have

≤ ) ( y q





 −









 



 



− 



 







 



−



 



− −



 



− −

∑

^∞

−∞

=

∞

−∞

− =

≤

∞ ≤

→

2 2

2

1

0 2 2

exp 1 2 2 ) 1 (2

2 exp 1 ,

2 ) 2( exp 1

lim n

j n

j h n

n h j

n Maximum j

h n h

n j

















 



 



 + −

 −



 



 + −



 



−











 



 



 −

 −



 



 −



 





∑

^∞

≠−∞

=0,1

2 2

2 ) 1 ( 2 exp 1 2

) 1 ( 2

1 1 2 2 exp 1 2 1

2 1

hh

n h h j

n j n

j n

(5)

506628288 .

≅2 . (5)

Furthermore, we have

≥ ) ( y q

 −















 



 



 +

 −



 



 +



 



−



 



− −



 



− −

∑

^∞

−∞

=

∞

−∞

− =

≤

2 2

2

1 2

1 2 exp 1 2

1 2

) 1 (2

2 exp 1 ,

2 ) 2( exp 1

n j n

j h n

n h j

n Minimum j

h n h

≤j

∞

→ 0

limn

















 



 



 −

 −



 



 −



 



−











 



 



 + −

 −



− 

+

∑

^∞

≠−∞

=0,1

2 2

2 2 exp 1 2

2 1 1

2 1 2 exp 1 2 1

1

hh

n h h j

n j n

j ≅2.506628261







 



 2

1

n (6)

That is, q(y) is bound by ²^.^5066282745±¹^.³⁵×¹⁰⁻⁸. Therefore, with σi =δi, q(y) defined in equation (4) and gi(x) defined in equation (3) are virtually constant functions. In fact, we can apply basically the same procedure presented in Appendix A.2 to find the upper bounds and lower bounds of q(y) with alternative

i

δi

σ ratios. As Table 1 reveals, the variation of q(y) becomes

smaller, if

i

δi

β =σ is set to a larger value. However, the variation of q(y) is not the only concern with respect to choosing the appropriate β value. We will discuss another effect to consider later.

i

δi

β =σ Bounds of q(y) 0.5 1.253314144 ± 1.80 × 10⁻² 1.0 2.5066282745 ± 1.34 × 10⁻⁸ 1.5 3.759942411933922 ± 2.94 × 10⁻¹¹

Tab. 1: The bounds of function q(y) defined in equation (4) with alternative

i

δi

σ _ratios.

Next, we need to figure out the appropriate value of wi to make equation (3) satisfied. We have



 



 

 



− + +

=

∑ ∑

^∞

−∞

=

∞

−∞

= hm i

i m h

i i

h w h

g ₂

2 2 2

1

2 ) exp (

) 0 ..., , 0 (

1 σ

L δ L

m

h i

w h 





















−

=

∑

^∞

−∞

= 2

2

exp 2

β ^, where

i

δi

β =σ .

Therefore, we need to set wi as follows:

) 2 (

exp ₂

2

si j m

h

i h f

w  =





















−

∑

^∞

−∞

= β ^.

If we employ equation (2) to estimate the value of fj(si), then we have

∑

^∞

−∞

=

2 









−

⋅ =

⋅

+ ( Γ

⋅

= +

m h j

m

m i

h R

S

w k m 2

2 1

exp 2 where

) , (

|

) 1 )

1 (

2 λ β

π

λ si . (7)

The only remaining issue is to determine δi. In this paper, we set δi to the average distance between two adjacent class-j training samples in the proximity of sample si. In an m-dimensional vector space, the number of uniformly distributed samples, N, in a hypercube with volume V can

(6)

be computed by V_m

N ⁼α ^{, where}α is the spacing between two adjacent samples. Therefore, we can compute δi by

i m m

k R

) 1 ( ) 1 (

) (

1+ Γ 2 +

= π

δ ^sⁱ . (8)

Finally, we have the following approximate probability density function for class-j training samples:

∑





 



− −

⋅

+ ( Γ

⋅

= +



 



− −

= ²

∈ j i

i s

i i

S j s

i v s

s S s

v v ₂

2 1

2 2

ˆ 2

||

exp ||

) (

|

) 1 )

1 ( ˆ

2

||

exp ||

) ˆ ˆ (

2 i

m m

m i

i

j m

R w k

f σ λ π σ

∑

∈ 



 



− −



 





= ⋅

j

i S

s

i j

s v

S ²

2

2ˆ

||

exp ||

| ˆ

| 1

i m

i σ

σ λ

β , (9)

where

(1) v is a vector in the m-dimensional vector space,

(2) i i m m

k R

) 1 ( ) 1 (

) ˆ (

ˆ

1+ Γ 2 +

=

=βδ β π

σ ^sⁱ ^,

(3)

∑

^∞

−∞

= 









−

=

h

2 2

exp 2

λ β ^.

One interesting observation is that, regardless of which

i i

δ

β =σ ratio is employed, we have

β π

λ _≅ ₂ . If this observation can be proved to be generally correct, then we can further simplify equation (9) and obtain

∑

∈ 



 



− −













= ⋅

j

i S

s

i j

s v

v S ₂

2

ˆ 2

||

exp ||

2 ˆ 1

|

| ) 1 ˆ (

i m

i

fj

σ σ

π . (10)

In practice, because the value of the Gaussian function decreases exponentially, when computing according to equations (9) or (10), we only need to include a limited number of nearest training samples of v. The number of nearest training samples to be included can be determined through cross validation.

) ˆ v_j( f

In earlier discussion, we mentioned that there is another issue to consider in selecting the

i i

δ

β =σ ratio, in addition to the tightness of the bounds of function q(y) defined in equation (4).

If we examine equations (9) and (10), we will find that the value of the approximate function at a sample si, i.e. ˆ ( ), is actually a weighted average of the estimated sample densities at s

si

fj i and its

nearby samples of the same class. Therefore, a smoothing effect will result. A larger

i

δi

β =σ ratio implies that the smoothing effect will be more significant. In section 4, we will discuss the effect of choosing alternative β values based on experimental results.

As far as the time complexity of the proposed learning algorithm is concerned, for each training sample, we need to identify its k1 nearest neighbors of the same class in the learning phase, where k1 is a parameter in equation (2). If the kd-tree structure is employed [3], then the average time complexity of this process is O(mn log n + k1 n log n), where m is the number of attributes of the data set and n is total number of training samples. In the classifying phase, we need to identify a

(7)

fixed number of nearest training samples for each incoming object to be classified. Let k2 denote the fixed number. Then, the average time complexity of the classifying phase is O(mn log n + k2c log n), where c is the number of objects to be classified.

4. Practical Issues and Experimental Results

This section addresses the practical issues concerning applying the proposed learning algorithm in handling real-world data sets. Since the kernel density estimation algorithm elaborated in Section 3 is based on some ideal assumptions, certain adaptations must be employed for handling real-world data sets. The approach employed in this paper is to incorporate a number of parameters in the learning algorithm, which are to be set with cross validation. Table 2. lists the parameters incorporated in our implementation. The parameter that deserves most attention now is , as the roles of the other two parameters have been addressed earlier. In equations (2), (9), and (10), parameter m is supposed to be set to the number of attributes of the data set, if the ideal assumptions that the derivation of the kernel density estimation algorithm is based on hold.

However, because the local distribution of the training samples may not spread in all dimensions and some attributes may even be correlated, we replace m in these equations by another value, denoted by , to be determined through cross validation. In fact, the process conducted to figure out optimal also serves to tune w

mˆ

mˆ i and σi.

k1 One of the parameters in equation (2).

k2 The number of nearest training samples included in evaluating the right-hand side of equations (9) or (10).

mˆ The parameter that substitutes for m in equations (2), (9), and (10).

Table 2. The parameters to be set through cross validation in the proposed learning algorithm.

In fact, there is another parameter to be set, the ratio

i

δi

β =σ in equations (9) and (10).

However, the experimental results reveal that, as long as β is set to a value within [0.6, 2], β^has virtually no effect on the level of classification accuracy that can be achieved,. Therefore, β^is not included in Table 2. In the following experiments, the parameters listed in Table 2 are set with 10-fold cross validations.

Tables 3 and 4 compare the classification accuracies achieved with the proposed learning algorithm, the support vector machine, and the KNN classifiers, over 9 data sets from the UCI repository [6]. The collection of benchmark data sets used is the same as that used in [12], except that DNA is not included. DNA is not included, because it contains categorical data and an extension of the proposed learning algorithm is yet to be developed for handling categorical data sets. In these experiments, the SVM software used was LIBSVM [7] with the radial basis kernel and the one-against-one practice was conducted.

Table 3 lists the results of the 6 smaller data sets, which contain no separate training and test sets. In Table 3, the number of samples in each of these 6 data sets is listed in the parenthesis below the name of the data set and the entry with the highest score in each row is shaded. For these 6 data sets, we adopted the evaluation practice used in [12]. That is, 10-fold cross validation is conducted on the entire data set and the best score is reported. Therefore, the numbers reported just reveal the maximum accuracy that can be achieved, provided that perfect cross validation algorithms are available to identify the optimal parameter values. As Table 3 shows, data classification based on the proposed learning algorithm and the SVM achieve basically the same level of accuracy for 4 out of these 6 data sets. The two exceptions are glass and vehicle. The benchmark results of these two data sets suggest that both the proposed algorithm

(8)

and the SVM have some blind spots, and therefore may not be able to perform as well as the other in some cases. It is interesting to observe that the same level of classification accuracy can be achieved with a number of different parameter settings in the proposed learning algorithm.

Table 4 provides a more informative comparison, as the data sets are larger and cross validations are conducted to determine the optimal parameter values for the proposed learning algorithm and the SVM. The two numbers in the parenthesis below the name of each data set correspond to the numbers of training samples and test samples, respectively. Again, the results show that the data classification based on the proposed learning algorithm and the SVM generally achieve the same level of accuracy. Tables 3 and 4 also show that the proposed learning algorithm and the SVM generally are able to deliver higher level of accuracy than the KNN classifiers.

In the experiments that have been reported so far, no data reduction is performed for the proposed learning algorithm. As the proposed learning algorithm places one spherical Gaussian function at each training sample, removal of redundant training samples means that the SGF network constructed will contain less nodes and will operate more efficiently. Table 5 presents the effect of applying a naïve data reduction algorithm. The naïve data reduction algorithm examines the training samples one by one in an arbitrary order. If the training sample being examined and all of its 10 nearest neighbors in the remaining training data set belong to the same class, then the training sample being examined is considered as redundant and is deleted. As shown in Table 5, the naïve data reduction algorithm is able to reduce the number of training samples in the shuttle data set substantially, with less than 2% of training samples left. On the other hand, the reduction rates for the satimage and letter are not as substantial. It is apparent that the reduction rate is determined by the characteristics of the data set. Table 5 also reveals that applying the naïve data reduction mechanism will lead to slightly lower classification accuracy.

Since the data reduction mechanism employed in this paper is a naïve one, there is room for improvement with respect to both reduction rate and impact on classification accuracy.

Table 6 compares the numbers of training samples remaining after data reduction is applied and the numbers of support vectors identified by the SVM software in various data sets. One interesting observation is that, for satimage and letter, the numbers of training samples remaining and the numbers of support vectors identified by the SVM software are almost equal. For shuttle, though the difference is larger, the two numbers are still in the same order. What the numbers in Table 6 suggest is that, regardless of the types of classification algorithms, the numbers of training samples that carry essential information for defining the classification boundaries are in the same order.

Data sets classification algorithms

proposed algorithm SVM 1NN 3NN

1. iris (150)

97.33

(k1 = 24, k2 = 14, mˆ = 5, β = 0.7) 97.33 94.0 94.67 2. wine

(178) 99.44

(k1 = 3, k2 = 16, mˆ = 1, β = 0.7) 99.44 96.08 94.97 3. vowel

(528)

99.62

(k1 = 15, k2 = 1, mˆ = 1, β^{= 0.7)} ^99.05 ^99.43 ^97.16 4. segment

(2310)

97.27

(k1 = 25, k2 = 1, mˆ = 1, β^{= 0.7)} ^97.40 ^96.84 ^95.98

Avg. 1-4 98.42 98.31 96.59 95.70

5. glass (214)

75.74

(k1 = 9, k2 = 3, mˆ = 2, β^{= 0.7)} ^71.50 ^69.65 ^72.45

6. vehicle 73.53 86.64 70.45 71.98

(9)

(846) (k1 = 13, k2 = 8, mˆ = 2, β^{= 0.7)}

Avg. 1-6 90.49 91.89 87.74 87.87

Table 3. Comparison of classification accuracy of the 6 smaller data sets.

Data sets classification algorithms

proposed algorithm SVM 1NN 3NN

7. satimage (4435,2000)

92.30

(k1 = 6, k2 = 26, mˆ = 1, β^{= 0.7)} 91.30 89.35 90.6 8. letter

(15000,5000)

97.12

(k1 = 28, k2 = 28, mˆ = 2, β = 0.7) 97.98 95.26 95.46 9. shuttle

(43500,14500) 99.94

(k1 = 18, k2 = 1, mˆ = 3, β = 0.7) 99.92 99.91 99.92

Avg. 7-9 96.45 96.40 94.84 95.33

Table 4. Comparison of classification accuracy of the 3 larger data sets.

satimage letter shuttle

# of training samples in the original data set 4435 15000 43500

# of training samples after data reduction is applied 1815 7794 627

% of training samples remaining 40.92% 51.96% 1.44%

Classification accuracy after data reduction is applied 92.15 96.18 99.32 Table 5. Effects of applying a naïve data reduction mechanism.

# of training samples after data reduction is applied

# of support vectors identified by LIBSVM

satimage 1815 1689

letter 7794 8931

shuttle 627 287

Table 6. Comparison of the numbers of training samples remaining after data reduction is applied and the numbers of support vectors identified by the SVM software.

Table 7 compares the execution times of carrying out data classification based on the proposed learning algorithm and the SVM. In Table 7, Cross validation designates the time taken to conduct 10-fold cross validation to figure out optimal parameter values. For SVM, we followed the practice suggested in [12]. In this practice, cross validation is conducted with 225 possible combinations of parameter values. For the proposed learning algorithm, cross validation was conducted over the following ranges for the three parameters listed in Table 1: k1: 1~30; k2: 1~30;

: 1~5.

mˆ

In Table 7, Make classifier designates the time taken to construct a classifier based on the parameter values determined in the cross validation phase. For SVM, this is the time taken to identify support vectors. For the proposed learning algorithm, this is the time taken to construct the SGF networks. Test corresponds to executing the classification algorithm to label all the objects in the test data set.

Proposed algorithm without data reduction

Proposed algorithm with data

reduction SVM

satimage 670 265 64622

letter 2825 1724 386814

Cross validation

shuttle 96795 59.9 467825

(10)

satimage 5.91 0.85 21.66

letter 17.05 6.48 282.05

Make classifier

shuttle 1745 0.69 129.84

satimage 21.3 7.4 11.53

letter 128.6 51.74 94.91

Test

shuttle 996.1 5.85 2.13

Table 7. Comparison of execution times in seconds.

As Table 7 reveals, the time taken to conduct cross validation for the SVM classifier is substantially higher than the time taken to conduct cross validation for the proposed learning algorithm. On the other hand, for both SVM and the proposed learning algorithm, once the optimal parameter values are determined, the times taken to construct the classifiers accordingly are insignificant in comparison with the time taken in the cross validation phase. As far as the execution time of the data classification phase is concerned, Table 7 shows that the performance of the SVM and that of the SGF network with data reduction applied are comparable. The main problem of the SGF network based classifier is that, if the training data set contains a high percentage of redundant samples, e.g. the shuttle data set, then the execution time will be substantially longer than that taken by the SVM. The overall conclusion in this part is that, if the training data set does not contain a high percentage of redundant samples, then the SGF network based classifier takes the same order of execution time as the SVM and takes much less time for carrying out cross validation. On the other hand, if the training data set contains a high percentage of redundant samples, then data reduction must be applied or the execution time of the SGF network based classifier will suffer. As the incorporation of the naïve data reduction mechanism may lead to slightly lower classification accuracy, it is of interest to develop a novel data reduction mechanism that does not share the same problem as the naïve mechanism.

5. Conclusion

In this paper, a novel learning algorithm for constructing data classifiers with RBF networks is proposed. The main distinction of the proposed learning algorithm is the way it exploits local distributions of the training samples in determining the optimal parameter values of the basis functions. The experiments presented in this paper reveal that data classification with the proposed learning algorithm generally achieves the same level of accuracy as the support vector machines. One important advantage of the proposed data classification algorithm, in comparison with the support vector machine, is that the process conducted to construct a RBF network with the proposed learning algorithm normally takes much less time than the process conducted to construct a SVM. As far as the efficiency of the classifier is concerned, the naïve data reduction mechanism employed in this paper is able to reduce the size of the training data set substantially with a minimum impact on classification accuracy. One interesting observation in this regard is that, for all three data sets used in data reduction experiments, the number of training samples remaining after data reduction is applied is quite close to the number of support vectors identified by the SVM software.

As the study presented in this paper looks quite promising, several issues deserve further study.

The first issue is the development of advanced data reduction mechanisms for the proposed learning algorithm. Advanced data reduction mechanisms should be able to deliver higher reduction rates than the naïve mechanism employed in this paper without sacrificing classification accuracy. The second issue is the extension of the proposed learning algorithm for handling categorical data sets. The third issue concerns why the proposed learning algorithm fails to deliver comparable accuracy in the vehicle test case, what the blind spot is, and how improvements can be made.

(11)

Appendix A.1 Assume that are the k

2 1

1 s sk

sˆ ,ˆ ,...,ˆ ₁ nearest training samples of si that belongs to the same class as si. If k1 is sufficiently large and the distribution of these k1 samples in the vector space is uniform then we have

) 1 )

2

+

m mπ^m si

( (

1 ≈ RΓ

k ρ _,

where ρ is the local density of samples sˆ₁ ,sˆ₂ ,...,sˆ_k₁ in the proximity of si. Furthermore, we have

∫





≈ ₀ ⁽ ⁾ 

|| ^R ^sⁱ ρ si

∑

⁻

=1

||ˆ

k1

h

sh 



 ( Γ ₂

−1

) 2 ^m²

m m

r π rdr

) ( ) 1 (

) ( 2

2

1 2

m m

m

R ^m

Γ

= ρ +^s_i ⁺π _, where

)

2

πm

( 2

2 1 m

rm

Γ

−

is the surface area of a hypersphere with radius r in an m-dimensional vector space. Therefore, we have

∑

=

− + ⋅ ¹

1 1

ˆ ||

1 ||

1 ^k

k h

m m

i

h s

= s ) (

R s_i .

The right-hand side of the equation above is then employed in this paper to estimate R(si).

Appendix A.2

Let 2

) exp (

)

(

∑

^∞ ₂ ²

−∞

= 



 



− −

=

h

h y y

q σδ _{, where}_δ_∈_{R and}σ∈ R are two coefficients and y ∈ R. We have

( )

∑

^∞

−∞

= 



 



− −

 −



 



−

=

i h

h h y

dy y y y dq

q ₂

2

2 2

) exp (

1 )

) ( (

' δ σδ

σ .

Since q(y) is a symmetric and periodical function, if we want to find the global maximum and minimum values of q(y), we only need to analyze q(y) within interval [0 ,^δ₂]. Let y0 ∈ [0, ^δ₂) and ₌δ _⋅ ₊ε

n y j

0 2 , where n ≥ 1 and 0 ≤ j < n − 1 are integers, and 0 ≤ε^<

n 2

δ . We have

∫

⁺

+

= ^δ δ^δ ^ε n j

n

j q t dt

q y

q ^j_n ²

2

) ( ' )

( )

( ₀ ₂ .

Let us consider the special case with σ⁼δ. Then, we have

∑

^∞

∫

−∞

=

+ 



 



 

 



− −

−



 



− −

=

h

n j

n

j t h dt

h t n h

y j

q ^ε

δ

δ σ

δ δ σ

2

2 2

2

0 2

) exp (

) 1 (

2 ) 2( exp 1 )

( .

Let ^g⁽^h⁾⁼⁻_σ¹²

∫

₂^j²^δ^j_n^δⁿ⁺^ε⁽^t⁻^h^δ⁾^exp^_^⁻⁽^t⁻₂_σ^h^δ²⁾²^_^^dt ^{. Since} _σ⁻¹²⁽^t⁻^h^δ⁾^exp^_^⁻⁽^t⁻₂_σ^h^δ²⁾² ^_^ ^{is an}

increasing function for t ∈ [(h − 1)_δ, (h + 1)_δ] and is a decreasing function for t ∉ [(h − 1)_δ, (h + 1)_δ], we have

(i) 









 



 



− 



 







 



−

≤

2

2 2 2

exp 1 2 ) 1

0

( n

j n

g j δ

σ δ

ε σ











 



 



− 



 







 



−

=

2

2 2 exp 1 2 1

n j n

j

ε σ ^;

(12)

(ii) 









 



 



 −

−



 



 −



 



−

≤

2

2 2 2

exp 1 2

) 1 1

( δ δ

δ σ δ ε σ

n j n

g j











 



 



 −

−



 



 −



 



−

=

2

2 1 2 exp 1 2 1

1

n j n

j

ε σ ^;

(iii) for h ≠ 0 and h ≠ 1,











 



 



 + −

−



 



 + −



 



−

≤

2

2 2

) 1 ( 2 exp 1 2

) 1 ( ) 1

( δ δ

δ σ δ

ε σ ^h

n h j

g











 



 



 + −

−



 



 + −



 



−

=

2

2 ) 1 ( 2 exp 1 2

) 1 (

1 h

n h j

n j

ε σ .

Therefore,

∑

^∞

−∞

= 

 



 +



 



− −

=

h

h g n h

y j

q ) ( )

(2 2 exp 1 )

( ₀ ²

εθ

+



 



 



 



− −

≤

∑

^∞

−∞

= h

n h

j 2

2 ) 2(

exp 1 , where











 



 



 −

 −



 



 −



 



−

+









 



 



− 



 







 



−

=

2 2

2 1 2 exp 1 2 1

1 2

2 exp 1 2 1

n j n

j n

j

σ θ σ

∑

^∞

≠−∞

= 









 



 



 + −

 −



 



 + −



 



− +

1 , 0

2

2 ) 1 ( 2 exp 1 2

) 1 ( 1

hh

n h h j

n j σ

≥0

.

If θ , then we have for any

n 0≤ε <2δ

2 . 2 )

2( exp 1 2 )

2(

exp 1 ² εθ ² δ θ

h n n h j

n j

h h

+



 



 



 



− −

≤

+



 



 



 



− −

∑

^∞

−∞

=

∞

−∞

=

<0

(A.1) On the other hand, if θ , then we have for any

n

i

0≤ε < 2δ . 2 )

2( exp 1 2 )

2(

exp 1 ² ² 

 



 



 



− −

<

+



 



 



 



− −

∑

^∞

−∞

=

∞

−∞

= h

h

n h h j

n

j εθ (A.2)

Combining equations (A.1) and (A.2), we obtain, for all y ∈ [0, ^δ₂),







 +



 



− −



 



− −

≤

∑ ∑

^∞

−∞

=

∞

−∞

− =

≤

∞

→ δ θ

h n n h j

n y j

q

h n h

n ) 2

(2 2 exp 1 ,

2 ) 2( exp 1 Max

lim )

( ² ²

1 j

0 .

Similarly, we can show that







 +



 



− −



 



− −

≥

∑ ∑

^∞

−∞

=

∞

−∞

− =

≤

∞

→ δ ρ

h n n h j

n y j

q

h n h

n ) 2

(2 2 exp 1 ,

2 ) 2( exp 1 Min

lim )

( ² ²

1 j

0 , where











 



 



 + −

 −



 



 + −



 



−

+









 



 



 +

 −



 



 +



 



−

=

2 2

2 1 1 2 exp 1 2 1

1 1

2 1 2 exp 1 2

1 1

n j n

j n

j

σ ρ σ

2 . 2 exp 1 2

1

1 , 0

∑

^∞ 2

≠−∞

= 









 



 



 −

 −



 



 −



 



− +

hh

n h h j

n j σ

If we set n = 100,000, then we have 2.5066282612≤q(y)≤2.5066357067, for y ∈ [0, ^δ₂).

(13)

7. REFERENCES

[1] E. Artin, The Gamma Function, New York, Holt, Rinehart, and Winston, 1964.

[2] F. Belloir, A. Fache, and A. Billat, "A general approach to construct RBF net-based classifier," Proceedings of the 7^th European Symposium on Artificial Neural Network, pp.

399-404, 1999.

[3] J. L. Bentley, "Multidimensional binary search trees used for associative searching,"

Communication of the ACM, vol. 18, no. 9, pp. 509-517, 1975.

[4] M. Bianchini, P. Frasconi, and M. Gori, "Learning without local minima in radial basis function networks," IEEE Transaction on Neural Networks, vol. 6, no. 3, pp. 749-756, 1995.

[5] C. M. Bishop, "Improving the generalization properties of radial basis function neural networks," Neural Computation, vol. 3, no. 4, pp. 579-588l, 1991.

[6] C. L. Blake and C. J. Merz, "UCI repository of bmachine learning databases," Technical report, University of California, Department of Information and Computer Science, Irvine, CA, 1998.

[7] C. C. Chang and C. J. Lin, "LIBSVM: a library for support vector machines,"

http://www.csie.ntu. edu.tw/~cjlin/libsvm, 2001.

[8] N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines, Cambridge University Press, Cambridge, UK, 2000.

[9] S. Dumais, J. Platt, and D. Heckerman, "Inductive learning algorithms and representations for text categorization," Proceedings of the International Conference on Information and Knowledge Management, pp. 148-154, 1998.

[10] G. W. Flake, "Square unit augmented, radially extended, multilayer perceptrons," Neural Networks: Tricks of the Trade, G. B. Orr and K. Müller, Eds., pp. 145-163, 1998.

[11] S. Haykin, Neural Networks: A Comprehensive Foundation, Macmillan, 1994.

[12] C. W. Hsu and C. J. Lin, "A comparison of methods for multi-class support vector machines," IEEE Transactions on Neural Networks, vol. 13, pp. 415-425, 2002.

[13] Y. S. Hwang and S. Y. Bang, "An efficient method to construct a radial basis function neural network classifier," Neural Networks, vol. 10, no. 8, pp. 1495-1503, 1997.

[14] T. Joachims, "Text categorization with support vector machines: learning with many relevant features," Proceedings of European Conference on Machine Learning, pp. 137-142, 1998.

[15] V. Kecman, Learning and Soft Computing: Support Vector Machines, Neural Networks, and Fuzzy Logic Models, The MIT Press, Cambridge, Massachusetts, London, England, 2001.

[16] P. F. Monica Bianchini and M. Gori, "Learning without local minima in radial basis function networks," IEEE Transactions on Neural Networks, vol. 6, no. 3, pp. 749-756, 1995.

[17] J. Moody and C. J. Darken, "Fast learning in networks of locally-tuned processing units,"

Neural Computation, vol. 1, pp. 281-294, 1989.

[18] M. J. L. Orr, "Introduction to radial basis function networks," Technical report, Center for Cognitive Science, University of Edinburgh, 1996.

[19] M. J. Orr, J. Hallam, A. Murray, and T. Leonard, "Assessing rbf networks using delve,"

International Journal of Neural Systems, vol. 10, no. 5, pp. 397-415, 2000.

[20] A. Papoulis, Probability, Random Variables, and Stochastic Processes, McGraw-Hill, 1991.

[21] B. Scholkopf, K. K. Sung, C. Burges, F. Girosi, P. Niyogi, T. Poggio, and V. Vapnik,

"Comparing support vector machines with gaussian kernels to radial basis function classifiers," IEEE Transactions on Signal Processing, vol. 45, no. 11, pp. 1-8, 1997.