Research Objectives and Organization of this dissertation

CHAPTER 1 INTRODUCTION

1.3 Research Objectives and Organization of this dissertation

In this dissertation, novel fuzzy neural networks (FNNs) combining with support vector learning mechanism called support-vector-based fuzzy neural networks (SVFNNs) are proposed for pattern classification and function approximation. The SVFNNs combine the capability of minimizing the empirical risk (training error) and expected risk (testing error) of support vector learning in high dimensional data spaces and the efficient human-like reasoning of FNN in handling uncertainty information. There have been some researches on combining SVM with FNN [19]-[22]. In [19], a self-organizing map with fuzzy class memberships was used to reduce the training samples to speed up the SVM training. The objective of [20]-[22]

was on improving the accuracy of SVM on multi-class pattern recognition problems.

The overall objective of this dissertation is to develop a theoretical foundation for the FNN using the SVM method. We exploit the knowledge representation power and learning ability of the FNN to determine the kernel functions of the SVM adaptively, and propose a novel adaptive fuzzy kernel function, which has been proven to be a Mercer kernel. The SVFNNs can not only well maintain the classification accuracy,

but also reduce the number of support vectors as compared with the regular SVM.

Organization and objectives of the dissertation are as follows.

In chapter 2, a novel adaptive fuzzy kernel is proposed for combining FNN with SVM. We exploit the knowledge representation power and learning ability of the FNN to determine the kernel functions of the SVM adaptively and develop a novel adaptive fuzzy kernel function. The objective of this chapter is to prove that the adaptive fuzzy kernel conform to the Mercer theory.

In chapter 3, a support-vector based fuzzy neural network (SVFNN) is proposed.

This network is developed for solving pattern recognition problem. Compared to conventional neural fuzzy network approaches, the objective of this chapter is to construct the learning algorithm of the proposed SVFNN with simultaneously minimizing the empirical risk and the expected risk for good generalization ability and characterize the proposed SVFNN with good classification performance.

In chapter 4, a support-vector based fuzzy neural network for function approximation is proposed. This network is developed for solving function approximation. The objective of this chapter is to integrate the statistical support vector learning method into FNN and characterize the proposed SVFNN with the capability of good robustness against noise.

The applications and simulated results of the SVFNNs are presented at the ends of Chapter 3 and 4, respectively. Finally, conclusions are made on Chapter 5.

CHAPTER 2 SUPPORT-VECTOR BASED FUZZY NEURAL NETWORK AND THE ADAPTIVE FUZZY

KERNEL

In this chapter, adaptive fuzzy kernel is proposed for applying the SVM technique to obtain the optimal parameters of FNN. The adaptive fuzzy kernel provides the SVM with adaptive local representation power, and thus brings the advantages of FNN (such as adaptive learning and economic network structure) into the SVM directly. On the other hand, the SVM provides the advantage of global optimization to the FNN and also its ability to minimize the expected risk; while the FNN originally works on the principle of minimizing only the training error.

2.1 Structure of the FNN

A four-layered fuzzy neural network (FNN) is shown in Fig 2.1, which is comprised of the input, membership function, rule, and output layers. Layer 1 accepts input variables, whose nodes represent input linguistic variables. Layer 2 is to calculate the membership values, whose nodes represent the terms of the respective linguistic variables. Nodes at Layer 3 represent fuzzy rules. The links before Layer 3 represent the preconditions of fuzzy rules, and the link after Layer 3 represent the consequences of fuzzy rules. Layer 4 is the output layer. This four-layered network realizes the following form of fuzzy rules:

Rule Rj : If x1 is A1j and …xi is Aij….. and xM is AMj, Then y is dj , j=1, 2, …, N, (2.1) where Aij are the fuzzy sets of the input variables xi, i =1, 2, …, M and dj are the consequent parameter of y. For the ease of analysis, a fuzzy rule 0 is added as:

Rule 0 : If x1 is A10 and …….. and xM is AM0, Then y is d0, (2.2) where Ak0 is a universal fuzzy set, whose fuzzy degree is 1 for any input value xi, i =1, 2, …, M and is the consequent parameter of y in the fuzzy rule 0. Define O

(P) and a^(P) as the output and input variables of a node in layer P, respectively. The signal propagation and the basic functions in each layer are described as follows.

Layer 1- Input layer: No computation is done in this layer. Each node in this layer, which corresponds to one input variable, only transmits input values to the next layer directly. That is

Fig. 2.1 The structure of the four-layered fuzzy neural network.

Layer 2 – Membership function layer: Each node in this layer is a membership function that corresponds one linguistic label ( e.g., fast, slow, etc.) of one of the input variables in Layer 1. In other words, the membership value which specifies the degree to which an input value belongs to a fuzzy set is calculated in Layer 2:

, (2.4)

(2) _i( )^j ( _i(2)

O =u a )

where is a membership function j=1,

2, …, N. With the use of Gaussian membership function, the operation performed in this layer is

where mij and σij are, respectively, the center (or mean) and the width (or variance) of the Gaussian membership function of the j-th term of the i-th input variable xi.

Layer 3 – Rule layer: A node in this layer represents one fuzzy logic rule and performs precondition matching of a rule. Here we use the AND operation for each Layer 2 node the FNN input vector. The output of a Layer-3 node represents the firing strength of the corresponding fuzzy rule.

Layer 4 – Output layer: The single node O⁽⁴⁾ in this layer is labeled with Σ, which computes the overall output as the summation of all input signals:

(4) (4)

where the connecting weight dj is the output action strength of the Layer 4 output associated with the Layer 3 rule and the scalar d0 is a bias. Thus the fuzzy neural network mapping can be rewritten in the following input-output form:

. (2.8)

2.2 Fuzzy Clustering and Input/Output Space Partitioning

For constructing the initial fuzzy rules of the FNN, the fuzzy clustering method is used to partition a set of data into a number of overlapping clusters based on the distance in a metric space between the data points and the cluster prototypes.

Fig. 2.2 The aligned clustering-based partition method giving both less number of clusters as well as less number of membership functions.

Each cluster in the product space of the input-output data represents a rule in the rule base. The goal is to establish the fuzzy preconditions in the rules. The membership functions in Layer 2 of FNN can be obtained by projections onto the various input variables x_i spanning the cluster space. In this work, we use an aligned clustering-based approach proposed in [23]. This method produces a partition result as shown in Fig. 2.2.

The input space partitioning is also the first step in constructing the fuzzy kernel function in the SVFNNs. The purpose of partitioning has a two-fold objective:

• It should give us a minimum yet sufficient number of clusters or fuzzy rules.

• It must be in spirit with the SVM-based classification scheme.

To satisfy the aforementioned conditions, we use a clustering method which takes care of both the input and output values of a data set. That is, the clustering is done based on the fact that the points lying in a cluster also belong to the same class or have an identical value of the output variable. The class information of input data is only used in the training stage to generate the clustering-based fuzzy rules; however, in testing stage, the input data excite the fuzzy rules directly without using class information. In addition, we also allow existence of overlapping clusters, with no bound on the extent of overlap, if two clusters contain points belonging to the same class. We may have a clustering like the one shown in Fig. 2.3. Thus a point may be geometrically closer to the center of a cluster, but it can belong only to the nearest cluster, which has the points belonging to the same class as that point.

Fig. 2.3 The clustering arrangement allowing overlap and selecting the member points according to the labels (or classes) attached to them.

2.3 Fuzzy Rule Generation

A rule corresponds to a cluster in the input space, with mj and Dj representing the center and variance of that cluster. For each incoming pattern x, the strength a rule is fired can be interpreted as the degree the incoming pattern belongs to the corresponding cluster. It is generally represented by calculating degree of membership of the incoming pattern in the cluster [24]. For computational efficiency, we can use the firing strength derived in (2.6) directly as this degree measure

[ ( )] [ ( )] distance between x and the center of cluster j. Using this measure, we can obtain the following criterion for the generation of a new fuzzy rule. Let x be the newly incoming pattern. Find

where c(t) is the number of existing rules at time t. If F^J ≤F t( ), then a new rule is generated, where ( ) (0, 1)F t ∈ is a prespecified threshold that decays during the learning process. Once a new rule is generated, the next step is to assign initial centers and widths of the corresponding membership functions. Since our goal is to minimize an objective function and the centers and widths are all adjustable later in the following learning phases, it is of little sense to spend much time on the assignment of centers and widths for finding a perfect cluster. Hence we can simply set

[ ( ) 1]c t + =

according to the first-nearest-neighbor heuristic [25], where χ ≥0decides the overlap degree between two clusters. Similar methods are used in [26], [27] for the allocation of a new radial basis unit. However, in [26] the degree measure doesn’t take the width Dj into consideration. In [27], the width of each unit is kept at a prespecified constant value, so the allocation result is, in fact, the same as that in [26]. In this dissertation, the width is taken into account in the degree measure, so for a cluster with larger width (meaning a larger region is covered), fewer rules will be generated in its vicinity than a cluster with smaller width. This is a more reasonable result.

Another disadvantage of [26] is that another degree measure (the Euclidean distance) is required, which increases the computation load.

After a rule is generated, the next step is to decompose the multidimensional membership function formed in (2.11) and (2.12) to the corresponding 1-D membership function for each input variable. To reduce the number of fuzzy sets of each input variable and to avoid the existence of highly similar ones, we should check the similarities between the newly projected membership function and the existing

ones in each input dimension. Before going to the details on how this overall process works, let us consider the similarity measure first. Since Gaussian membership functions are used in the SVFNNs, we use the formula of the similarity measure of two fuzzy sets with Gaussian membership functions derived previously in [28].

Suppose the fuzzy sets to be measured are fuzzy sets

A

and

B

with membership function µ_A( ) expx =

{

− −(x c1)² σ1²

}

and µ_B( ) expx =

{

− −(x c2)² σ2²

}

, respectively. The size or cardinality of fuzzy set A, M(A), equals the sum of the support values of A:

x U∈

σ π [29] and its height is always 1, it can be approximated by an isosceles triangle with unity height and the length of bottom edge 2σ π . We can then compute the fuzzy similarity measure of two fuzzy sets with such kind of membership functions.

Assume c₁≥c₂ as in [28], we can compute M A B∩ by using this similarity measure, we can check if two projected membership functions are close enough to be merged into one single membership

function ^µ^c^{( ) exp}^x ⁼

{

^{− −}⁽^{x c}³⁾² ^σ³²

}

. The mean and variance of the merged membership function can be calculated by

1 2

2.4 Adaptive Fuzzy Kernel for the SVM/SVR

The proposed fuzzy kernel K( zxˆ,ˆ) in this dissertation is defined as training samples. Assume the training samples are partitioned into l clusters through fuzzy clustering in Section II. We can perform the following permutation of training samples

that we have

∑

= = v. Then the fuzzy kernel can be calculated by using the training set in (2.18), and the obtained kernel matrix K can be rewritten as the following form

In order that the fuzzy kernel function defined by (2.17) is suitable for application in SVM, we must prove that the fuzzy kernel function is symmetric and positive-definite Gram Matrices [30]. To prove this, we first quote the following theorems.

Theorem 1 (Mercer theorem [30]) : Let X be a compact subset of Rⁿ. Suppose K is a continuous symmetric function such that the integral operator TK : L2(X)→L2(X)

( ) ( )

( _K )( ) 0

T f ⋅ =

∫

K ⋅^{, x} f ^x d^x≥ ,^(2.21)

is positive; that is

( ) ( ) ( )

0, 2( )

. (2.23)

The kernel is referred to as Mercer’s kernel as it satisfies the above Mercer theorem.

Proposition 1 [31] : A function K(x, z) is a valid kernel iff for any finite set it

produces symmetric and positive-definite Gram matrices.

Proposition 2 [32] : Let K1 and K2 be kernels over X X× , . Then the function if the matrix

f R→R

[ (f x_i−x_j)]∈R^{n n}^× is positive semidefinite for all choices of points { , , }x₁ x_n ⊂ R and all n=1, 2, .

Proposition 3 [33] : A block diagonal matrix with the positive-definite diagonal

matrices is also a positive-definite matrix.

Theorem 2 : For the fuzzy kernel defined by (2.17), if the membership functions

are positive-definite functions, then the fuzzy kernel is a Mercer kernel.

( )

i ^: ^{[0, 1],} ^{1, 2,} ^,

u x R→ i= n,

Proof:

First, we prove that the formed kernel matrix ^K⁼

(

^{x x}ⁱ^, ^j

) )

ⁿ_{i j}_, ₌₁^{is a}

symmetric matrix. According to the definition of fuzzy kernel in (2.17), if and are in the j-th cluster,

So the kernel matrix is indeed symmetric. By the elementary properties of Proposition 2, the product of two positive-defined functions is also a kernel function.

And according to Proposition 3, a block diagonal matrix with the positive-definite diagonal matrices is also a positive-definite matrix. So the fuzzy kernel defined by (2.17) is a Mercer kernel.

Since the proposed fuzzy kernel has been proven to be a Mercer kernel, we can apply the SVM technique to obtain the optimal parameters of SVFNNs. It is noted that the proposed SVFNNs is not a pure SVM, so it dose not minimize the empirical risk and expected risk exactly as SVMs do. However, it can achieve good classification performance with drastically reduced number of fuzzy kernel functions.

CHAPTER 3 SUPPORT-VECTOR BASED FUZZY NEURAL NETWORK FOR PATTERN CLASSIFICATION

In this chapter, we develop a support-vector-based fuzzy neural network (SVFNN) for pattern classification, which is the realization of a new idea for the adaptive kernel functions used in the SVM. The use of the proposed fuzzy kernels provides the SVM with adaptive local representation power, and thus brings the advantages of FNN (such as adaptive learning and economic network structure) into the SVM directly. On the other hand, the SVM provides the advantage of global optimization to the FNN and also its ability to minimize the expected risk; while the FNN originally works on the principle of minimizing only the training error. The proposed learning algorithm of SVFNN consists of three phases. In the first phase, the initial fuzzy rule (cluster) and membership of network structure are automatically established based on the fuzzy clustering method. The input space partitioning determines the initial fuzzy rules, which is used to determine the fuzzy kernels. In the second phase, the means of membership functions and the connecting weights between layer 3 and layer 4 of SVFNN (see Fig. 2.1) are optimized by using the result of the SVM learning with the fuzzy kernels. In the third phase, unnecessary fuzzy rules are recognized and eliminated and the relevant fuzzy rules are determined. Experimental results on five datasets (Iris, Vehicle, Dna, Satimage, Ijcnn1) from the UCI Repository, Statlog collection and IJCNN challenge 2001 show that the proposed SVFNN classification method can automatically generate the fuzzy rules, improve the accuracy of

classification, reduce the number of required kernel functions, and increase the speed of classification.

3.1 Maximum Margin Algorithm

An SVM constructs a binary classifier from a set of labeled patterns called training examples. Let the training set be S = {(x1, y1), (x2, y2), …, (xv, yv)} with explanatory variables and the corresponding binary class labels

, for all , where v denotes the number of data, and d denotes the dimension of the datasets. The SVM generates a maximal margin linear decision rule of the form

Where w is the weight vector and b is a bias. The margin M can be calculated by M=2/||w|| that show in Fig. 3.1. For obtaining the largest margin, the weight vector, , must be calculated by

1 ² min 2 w

s.t. (y_i x w_i +b) 1 0,− ≥ ∀ =i 1, ,v. (3.2) The optimization problem be converted to a quadratic programming problem, which can be formulated as follows:

1 , 1 whereαi denotes Lagrange multiplier.

Class 1

Class 2

M (w^Tx)+b=0

Fig 3.1 Optimal canonical separating hyperplane with the largest margin between the two classes.

In practical applications for non-ideal data, the data contain some noise and overlap. The slack variablesξ , which allow training patterns to be misclassified in the case of linearly non-separable problems, and the regularization parameter C, which sets the penalty applied to margin-errors controlling the trade-off between the width of the margin and training set error, are added to SVM. The equation is altered as follows:

1 ² ²

min 2 2 _i ⁱ

C ξ

∑

s.t. (y_i x w_i +b) 1≥ −ξ_i, ∀ =i 1, ,v. (3.4) To construct a non-linear decision rule, the kernel method mappin an input vector x R∈ ^d into a vector of a higher-dimensional feature space F (φ( )x , where φ represents a mapping R^d →R^q) is discovered. Therefore, the maximal margin linear

classifier can solve the linear and non-linear classification problem in this feature space. Fig. 3.2 show the training data map into a higher-dimensional feature space.

However, the computation cannot easily map the input vector to the feature space. If the dimension of transformed training vectors is very large, then the computation of the dot products is prohibitively expensive. The transformed function ( )φ x is not _i known a priori. The Mercer’s theorem provides a solution for those problems. The equation φ( ) ( )x_i ⋅φ x_j can be calculated directly by a positive definite symmetric kernel function K x x( , )_i _j =φ( ) ( )x_i ⋅φ x_j which complies with Mercer’s theorem.

Popular choices for Kernel function include Gaussian kernel :

To obtain an optimal hyperplane for any linear or nonlinear space, Eq. (3.4) can be rewritten to the following dual quadratic optimization

max

( ) ( )

Fig.3.2 map the training data nonlinearly into a higher-dimensional feature space Φ:x→ φ(x)

( , )_i _j ( ) ( )_i _j K x x =φ x ⋅φ x

The dual Lagrangian ^L

( )

^α must be maximized with respect toαi≥ 0. The training patterns with nonzero Lagrange multipliers are called support vectors. The separating function is given as follows

where Nsv denotes the number of support vectors; xi denotes a support vectors;

αi denotes a corresponding Lagrange coefficient, and b0 denotes the constant given by

where x^*(1) denote some support vector belonging to the first class and 0≤α_i≤ C. x^*(−1) denote some support vector belonging to the second class, where 0≤α_i≤ C. In next section, we proposed the learning algorithm of SVFNN that combine the capability of minimizing the empirical risk (training error) and expected risk (testing error) of support vector learning in high dimensional data spaces and the efficient human-like reasoning of FNN in handling uncertainty information.

3.2 Learning Algorithm of SVFNN

The learning algorithm of the SVFNN consists of three phases. The details are given below:

Learning Phase 1 – Establishing initial fuzzy rules

The first phase establishes the initial fuzzy rules, which were usually derived from human experts as linguistic knowledge. Because it is not always easy to derive

fuzzy rules from human experts, the method of automatically generating fuzzy rules from numerical data is issued. The input space partitioning determines the number of fuzzy rules extracted from the training set and also the number of fuzzy sets. We use the centers and widths of the clusters to represent the rules. To determine the cluster to which a point belongs, we consider the value of the firing strength for the given cluster. The highest value of the firing strength determines the cluster to which the point belongs. The whole algorithm for the generation of new fuzzy rules as well as fuzzy sets in each input variable is as follows. Suppose no rules are existent initially.

IF x is the first incoming input pattern THEN do

PART 1. { Generate a new rule with center m = x₁ and width

在文檔中支持向量模糊類神經網路及其在資料分類和函數近似之應用 (頁 14-0)