靜、動態神經模糊建模技術及其在語言辨識上的應用(III)Static/Dynamic Neuro-Fuzzy Modeling Techniques and Its Application to Speech Recognition

(1)

行政院國家科學委員會專題研究計畫成果報告

靜、動態神經模糊建模技術及其在語言辨識上的應用(3/3)

計畫類別：個別型計畫

計畫編號： NSC91-2213-E-110-020-

執行期間： 91 年 08 月 01 日至 92 年 07 月 31 日

執行單位：國立中山大學電機工程學系(所)

計畫主持人：李錫智

計畫參與人員：歐陽振森、蔡賢&#63863；、楊欣泰、陳文恭、蔡政勳、&#63969；

婉瑞、杜世懷、&#63988；毅杰、鍾潤世、黃&#63991；銘、

&#63988；安泰、黃偉程

報告類型：完整報告

處理方式：本計畫可公開查詢

中華民國 92 年 10 月 31 日

(2)

Ó Gÿ%_È_xX£wÊxß<…,í@à

Static/Dynamic Neuro-Fuzzy Modeling Techniques and Its

Application to Speech Recognition

lå)U:NSC-89-2218-E-110-001NSC-90-2213-E-110-014NSC-91-2213-E-110-020

ÏW‚Ì

:89

08 ~

01 nB

92

07 ~

31 n

3MA

:

†— Å2˙×çÚœ˙çÍ `¤

lå¡DAº

:

r¹PR ’jœ Ïd6 \¤ †û: 0a

Š+ â0 ô Šéœ ôQ˙ Å2˙×çÚœ˙çÍ

ø 2d¿b

…låJúívÈT|7y^0íÓ

Gÿ%_È_xX

,

1øwõÒ@àVj²x

ß<…½æ Êø2

,

_{Bb3bu‡úÓG}

Í$_½æ

,

_{T|øP!kA¶†ßÞj}

¶£¹¯ç3Æ¶íÓGÿ%_È_x

X ¤Õ

,

_{yø¤j¶Tªø¥íô.£ZG}

,

T|ÇøPy7í!k¯9_Èé£¹¯

TSK

¶†ç3Æ¶5ÓGÿ%_È_

xX Êù2

,

†u‡úGÍ$_½æ

,

T|7øPZGí

TSK

]c_Èéÿ%æ

˜

,

éj¶íZª _@4¡bí‹p£Ö¼

q¶]cíô.

,

U)¤¶yx^0£4 |

(

,

Êú2

,

Ñ7ð„FêíGÿ%_

È_xX5õà4

,

Bbøw!¯(4ã¿)

{

,

@àVj²xß<…½æ óÉíû˝A‹

“˛ê[ÊÖ_O±íÅÒ‚…£}‡,

ÉœÈ

:

ÓG_È_ G_È_ xß<

… _Èé

SVD

TSK

_È¶†

Abstract

we took three years to develop more efficient static/dynamic neuro-fuzzy modeling techniques, and apply these methods in the area of speech recognition. In the first year, we mainly focused on static system modeling problems and proposed a static neuro-fuzzy modeling technique based on a self-constructing rule generation method and a hybrid learning algorithm. Furthermore, we devel-oped some improvements on the technique men-tioned above and proposed a better static neuro-fuzzy modeling technique based on a merge-based fuzzy clustering and a hybrid TSK-type rule learn-ing algorithm. In the second year, we focused on the dynamic system modeling problem and pro-posed an improved TSK-type recurrent fuzzy net-work. Because of the improvements on clustering method, addition of adaptive parameters, and ex-tension of high-order internal dynamics, our sys-tem performs more efficiently and flexibly. Finally, in the third year, we combined the improved

TSK-type recurrent fuzzy network and the linear pre-dict coding method in the area of speech recogni-tion. Our research results have been published in many well-known international journals and con-ferences.

Keywords: static neuro-fuzzy modeling, dynamic

neuro-fuzzy modeling, speech recognition, fuzzy clustering, SVD, TSK-type fuzzy rule.

ù íâDñí

System modeling plays a very important role in many areas such as data mining, control, ex-pert systems, communications, etc. The purpose of system modeling is to model the operation of an unknown system from a set of measured input-output data. Problems of system modeling are di-vided into two categories, i.e., static system mod-eling and dynamic system modmod-eling. In static sys-tem modeling problems, the present syssys-tem output only depends on the present system input. On the other hand, the present output may depend on past inputs and past outputs in dynamic system modeling.

Many approaches have been proposed for sys-tem modeling. Quantitative approaches based on conventional mathematics (e.g., statistical regres-sion, differential equations) tend to be more ac-curate, but are not suitable when the underlying system is complex, ill-defined, or uncertain. Zadeh proposed the fuzzy set theory [52] to deal with such kind of uncertain information and many other re-searchers have pursued research on fuzzy model-ing [46, 41, 43, 34, 22, 8, 44, 45]. However, this approach lacks a definite method to determine the number of fuzzy rules required and the member-ship functions associated with each rule. Moover, it lacks an effective learning algorithm to re-fine the membership functions to minimize out-put errors. Another approach using feedforward multi-layered neural networks and recurrent neu-ral networks [35, 13, 7, 30, 31, 36] was proposed, which, like fuzzy modeling, is considered to be a

(3)

universal approximator [9, 14, 23, 47, 4] of function approximation. This approach has advantages of learning capability and high precision. However, it usually encounters problems of slow convergence, local minima, and low understandability of the as-sociated numerical weights.

Recently, static neuro-fuzzy modeling and dy-namic neuro-fuzzy system modeling [17, 21, 25, 24, 26, 27, 10, 19, 53, 54, 20, 48, 38, 49, 50, 18, 32] have attracted a lot of attention in solving static system modeling problems and dynamic system modeling problems, respectively. In general, these approaches involves two major phases, structure identiﬁcation and parameter identiﬁcation. Fuzzy modeling and neural network techniques are usu-ally employed in the two phases, respectively. Con-sequently, they possesses the advantages of fuzzy modeling and neural networks, i.e., adaptability, quick convergence, and high accuracy.

For static neuro-fuzzy modeling, Lin et al. [26] proposed a method of fuzzy partitioning to extract initial fuzzy rules, but it is hard to decide the lo-cations of cuts and too much time is needed to select best cuts. Wong and Chen [48] proposed a clustering algorithm to obtain fuzzy rules from the given input-output data. However, convergence is very slow especially when the amount of given data is huge. Thawonmas and Abe [45] also pro-posed a method for extracting fuzzy rules from a set of training patterns. To improve the approxi-mation accuracy, the method needs to resolve the problem of overlapping among hyperboxes of dif-ferent classes. Each time two hyperboxes are con-sidered, and new hyperboxes are created for the overlapping area between them. The process iter-ates until no overlapping occurs between any two classes. This method may result in high complex-ity, especially when the amount of training data is large and the dimensionality of each training pat-tern is high. For parameter identification, most systems use backpropagation to refine parameters of the system. However, backpropagation suffers from the problems of local minima and low con-vergence rate. To alleviate these difficulties, differ-ent methods of least squares estimation (LSE) [5] have been proposed. Many researchers [7, 49, 50] applied pseudo-inverse techniques to obtain opti-mal solutions for LSE. However, in most cases, pseudo-inverse is hard to find. Even though it can be found, it is usually memory/ time demand-ing when the amount of traindemand-ing data is large. Jang [17] proposed sequential formulas and Gomm et al. [12] proposed orthogonal least squares train-ing for LSE. However, these methods suffers from either the necessity of initializing a certain param-eter by the user or the restriction of usage to full rank problems.

For dynamic neuro-fuzzy modeling, Mastoro-costas et al. [32] proposed a dynamic fuzzy neu-ral network by using recurrent neuneu-ral networks

in the consequent parts of fuzzy rules. However, the complexity of the whole network structure is high, especially for complex temporal problems. Juang [18] proposed a TSK-type recurrent fuzzy network (TRFN) for dynamic system identifica-tion. However, the clustering method used in TRFN cannot describe the data distribution well in the structure learning phase due to the im-proper definitions of cluster parameters. Besides, the first-order properties of internal dynamics and rule consequents restrict the representation capa-bility of TRFN for high-order temporal problems. Therefore, TFRN may take more training time to refine network parameters to meet the desired pre-cision or it even cannot achieve the desired preci-sion at all.

To alleviate and overcome the previous deficien-cies, we proposed several developments and im-provements in the period of this three-year project. In the first year, we developed a static neuro-fuzzy modeling technique for solving static system mod-eling problems. A self-constructing rule genera-tion method is proposed to generate previous fuzzy rules. Then, a hybrid learning algorithm is derived to refine the obtained rules for higher precision. Furthermore, to improve the approach mentioned above to be more efficient and more generalized, we developed a merge-based fuzzy clustering tech-nique and extended each rule to be a TSK-type fuzzy IF-THEN rule with a linear function in the conclusion part. Besides, The corresponding fuzzy neural network and hybrid learning algorithm are also constructed and derived, respectively. In the second year, we focused on developing a dynamic neuro-fuzzy modeling technique for solving dy-namic system modeling problems. We extended the self-constructing rule generation method pro-posed in the first year to generate initial recurrent fuzzy rules from a set of time-dependent data. A novel architecture of recurrent fuzzy neural net-work and the corresponding learning rules are pro-posed for refining the obtained rules. In the third year, we combined the improved TSK-type recur-rent fuzzy network and the linear predict coding (LPC) method in speech recognition to recognize spoken words.

In view of the training algorithms, speech recognition can be divided into two types: word recognition and phoneme recognition. The al-gorithmic complexity of word recognition is sim-pler than that of phoneme recognition, but is less appropriate to a large vocabulary. From the view of constraint on users, we can divide speech recognition into speaker-dependent recog-nition and speaker-independent recogrecog-nition. From the view of speech models, we can divide speech recognition into isolated-word recognition and con-tinuous speech recognition. Many algorithms in speech recognition have been proposed, and they can be roughly divided into pattern matching

(4)

ap-proaches (e.g., DTW) and statistic pattern recog-nition(e.g., HMMs [16, 29, 37]). In our project, we focused on the pattern matching approachs. Our approach concerns speaker-independent isolated-word speech recognition. We use the LPC cep-strum features to encode speech signals. It is well known that automatic speech recognition by template matching is impeded by local and global variations along the time axis. Even if the same speaker utters the same word, the du-ration changes every time with nonlinear expan-sion and contraction. DTW was proposed to cope with these variations [39, 40]. DTW can nonlin-ear expands or contracts the time axis to match the same phoneme positions between the input pattern and reference patterns. However, DTW has some drawbacks: (1) DTW requires a stor-age for every training pattern. For a large set of training patterns, the resulting system is large in size and slow in speed. To obtain a better per-formance, the training patterns need to be pre-ﬁltered by human experts. However, such prese-lection is hard since the selected training patterns should be as diverse and informative as possible in order to make the trained system powerful. (2) DTW treats each frame of every reference pattern to be equally important. This strategy leads to low accuracy in recognition of similar uttered words. Recently, neural networks have been applied to speech recognition (e.g., recurrent backpropaga-tion [1]) and time delay neural networks [15]). In this project, we attempted to address the short-comings of DTW by using the advantages of our TSK-type recurrent fuzzy network, and succeeded in developing a speech-independent isolated-word speech recognition system.

ú !‹Dn

In this section, we describe the theorems and methodologies developed in each year of our three-year project.

1. Static neuro-fuzzy modeling

We proposed a novel approach for static neuro-fuzzy modeling. A neuro-fuzzy system for the given set of input-output data is obtained in two steps. Firstly, the data set is partitioned into a set of clusters based on input-similarity and output-similarity tests. Membership functions associated with each cluster are deﬁned according to statis-tical means and variances of the data points in-cluded in the cluster. In this way, the data con-tained in a cluster have a high degree of similar-ity while the data in diﬀerent clusters have a low degree of similarity. Besides, when new training data are considered, the existing clusters can be adjusted or new clusters can be created, without the necessity of generating the whole set of rules

from the scratch. Then a fuzzy IF-THEN rule with a constant conclusion is extracted from each clus-ter to form a fuzzy rule-base. Secondly, a fuzzy neural network is constructed accordingly and pa-rameters are reﬁned to increase the precision of the fuzzy rule-base. To decrease the size of the search space and speed up the convergence, we develop a hybrid learning algorithm which combines a re-cursive SVD-based least squares estimator and the gradient descent method.

Furthermore, to make the approach mentioned above more eﬃcient and general, we develop a merge-based fuzzy clustering technique and extend each rule to be a TSK-type fuzzy IF-THEN rule with a linear function in the conclusion part. Be-sides, The corresponding fuzzy neural network and hybrid learning algorithm are also constructed and derived, respectively.

1.1 A Novel Neuro-Fuzzy Modeling Tech-nique

1.1.1 Self-Constructing Rule Generation

Suppose we are given a set of data each of which contains n inputs and one output. Initial fuzzy rules are obtained by a self-constructing rule gen-erator in the phase of structure identiﬁcation. The

jth fuzzy rule is deﬁned to be of the following form:

IF x1 IS µ1j(x1) AND x2 IS µ2j(x2) AND

. . . AND xn IS µnj(xn) THEN y IS cj. (1)

Note that each µij(xi) is a Gaussian function with

mean mij and deviation σij, i.e.

µij(xi) = exp − xi− mij σij 2 (2) and cjis a constant. The fuzzy operator ‘AND’ of

Eq.(1) is interpreted as the algebraic product, and we use the centroid defuzziﬁcation method to cal-culate the output of a fuzzy rule-base. The detail is described as follows.

Firstly, the given input-output data set is parti-tioned into fuzzy clusters. We deﬁne a fuzzy clus-ter j as a pair (Gj(x), cj) where Gj(x) is deﬁned

as follows: Gj(x) = n i=1 exp − xi− mij σij 2 (3) where x = [x1, . . . , xn], mj = [m1j, . . . , mnj], and

σj = [σ1j, . . . , σnj], denote the input vector, the

mean vector, and the deviation vector, respec-tively, and cj denotes the height of cluster j. Let

J be the number of existing fuzzy clusters and Sj

be the size of cluster j. Apparently, J is 0 ini-tially. For an input-output instance v, ( pv, qv),

(5)

each existing cluster j, 1≤ j ≤ J. We say that in-stance v passes the input-similarity test on cluster

j if Gj( pv)≥ ρ where ρ, 0 ≤ ρ ≤ 1, is a predeﬁned

threshold. Then we calculate evj = |qv− cj| for

each cluster j on which instance v has passed the input-similarity test. Let d = qmax− qmin where

qmax and qmin are the maximum output and the

minimum output, respectively, of the given data set. We say that instance v passes the output-similarity test on cluster j if evj ≤ τd where τ,

0≤ τ ≤ 1, is another predeﬁned threshold. Two cases may occur. First, there are no exist-ing fuzzy clusters on which instance v has passed both the input-similarity test and the output-similarity test. For this case, we assume that in-stance v is not close enough to any existing clus-ter and a new fuzzy clusclus-ter k = J + 1 is created with mk = pv, σk = σ0, ck = qv where σ0 =

[σ0, . . . , σ0] is a user-deﬁned constant vector. Note

that the new cluster k contains only one member, instance v, at this time. Of course, the number of clusters is increased by 1 and the size of cluster k should be initialized, i.e., J = J + 1, Sk = 1. On

the other hand, if there are existing fuzzy clusters on which instance v has passed both the input-similarity test and the output-input-similarity test, let clusters j1, j2, . . . , and jf be such clusters and let

the cluster with the largest membership degree be cluster t, i.e.,

Gt( pv) = max(Gj₁( pv), Gj₂( pv), . . . , Gj_f( pv)). (4)

In this case, we assume that instance v is closest to cluster t and cluster t should be modiﬁed to include instance v as its member. The modiﬁcation to cluster t is as follows: σit={ (St− 1)(σit− σ0)2+ Stmit2+ piv2 St − St+ 1 St (Stmit+ piv St+ 1 )2}21_{+ σ}₀_, ₍₅₎ mit= Stmit+ piv St+ 1 (6) for 1≤ i ≤ n and ct= Stct+ qv St+ 1 , St= St+ 1. (7)

Note that J is not changed in this case.

The above process is iterated until all the input-output instances have been processed. At the end, we have J fuzzy clusters. Note that each cluster

j is described as (Gj(x), cj) where Gj(x) contains

mean vector mj and deviation vector σj.

Alterna-tively, we can represent cluster j by a fuzzy rule having the form of Eq.(1) with

µij(xi) = exp − xi− mij σij 2 (8) for 1 ≤ i ≤ n and the conclusion is cj. Now we

end up with a set of J initial fuzzy rules for the given input-output data set.

Ο(2) 11 Ο(2) nJ Ο(3) J Ο(3) j Ο(3) 1 1 C j C J C Ο(4) p1 pi pn (m , ) 11σ11 (m , ) n1σn1 (m , ) nJσnJ Ο(2) n1 Ο(2) 1J (m , ) ijσij (m , ) 1Jσ1J Ο(2) ij

Figure 1: Fuzzy neural network.

1.1.2 Hybrid Learning Algorithm

In the phase of parameter identiﬁcation, a four-layer fuzzy neural network shown in Figure 1 is constructed based on the initial fuzzy rules. Note that links between layer 1 and layer 2 are weighted by (µij, σij), 1≤ i ≤ n, 1 ≤ j ≤ J, links between

layer 3 and layer 4 are weighted by cj, 1≤ j ≤ J,

and the other links are weighted by 1. Let (p, q) be

an input-output pattern where p = [p1, . . . , pn] is

the input vector and q is the corresponding desired output. The operation of the fuzzy neural network is described as follows.

1. Layer 1. Layer 1 contains n nodes. The out-put of node i is o(1)_i = pi.

2. Layer 2. Layer 2 contains J groups and each group contains n nodes. Node (i, j) of this layer produces its output as

o(2)_ij = µij(o (1) i ) = exp ⎡ ⎣− o(1)i − mij σij 2⎤ ⎦ . (9) 3. Layer 3. Layer 3 contains J nodes. Node j’s

output is calculated as o(3)j = n i=1 o(2)ij . (10)

4. Layer 4. Layer 4 contains only one node whose output is calculated as o(4)= J j=1o (3) j ·cj J j=1o (3) j . (11)

To train the network mentioned above, we de-rived a hybrid learning algorithm which combines a recursive SVD-based least squares estimator and the gradient descent method. Suppose we have a total number N of training patterns. An iteration of learning involves the presentation of all train-ing patterns. In each iteration of learntrain-ing, we ﬁrst treat all mijand σijas ﬁxed, and use the recursive

SVD-based least squares estimator to optimize cj.

Then we treat all cjas ﬁxed and use the batch

(6)

process is iterated until the desired approximation precision is achieved. Let ( pv, qv) be the vth

train-ing pattern, where pv = [p1v, . . . , pnv] is the input

vector and qv is the desired output. Let v.o(4)and

v.o(3)j denote the actual output of layer 4 and the

actual output of node j in layer 3, respectively, for the vth training pattern. By Eq.(11), we have

v.o(4) = av1c1+ av2c2+ . . . + avJcJ (12) where avj = v.o(3)j J j=1v.o (3) j (13)

for 1 ≤ j ≤ J. Apparently, we would like

|qv− v.o(4)| to be as small as possible for the vth

training pattern. For all N training patterns, we have N equations of the form Eq.(12). Clearly, we would like J (X) = B − AX to be as small as possible, where B, A, and X are matrices of N×1,

N×J, and J×1, respectively, and

B = q1 · · · qN T , A = a1 · · · aN T , ai= ai1 · · · aiJ T , X = c1 · · · cN T .

Note that for a matrix D, D is deﬁned to be

trace(DT_D).

We develop a recursive SVD-based estimator which only requires the decomposition of a small matrix in each iteration, leading to less demanding in time and space requirements. In our method, training patterns are considered one by one, start-ing with the ﬁrst pattern, t = 1, until the last pattern, t = N . For each t, we want to ﬁnd the optimal X(t) such that

J (X(t)) =B(t) − A(t)X(t) (14) is minimized. Note that

A(t) = a1 · · · at

T

, B(t) = q1 · · · qt

T

for t = 1, 2, . . . , N . Before we proceed, we present a theorem that will be used later.

Theorem Minimizing Eq.(14) is equivalent to

minimizing ˆ

J (X(t)) =B(t)− Σ(t)VT(t)X(t) (15) where B(t), Σ(t), VT_{(t), and X(t) satisfy the}

following equalities: ⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ A(1) = U(1)Σ(1)VT(1), Σ(t− 1)VT_(t_{− 1)} atT = U(t)Σ(t)VT_{(t), t}_{≥ 2} (16) Σ(t) = Σ(t) 0 , t≥ 1 (17) ⎧ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎩ UT_(1)B(1) ₌ B(1) B(1) , UT_(t) B(t− 1) qt = B(t) B(t) , t≥ 2. (18)

Apparently, what we want is to ﬁnd the optimal

X∗(N ) which minimizes

J (X(N )) =B(N) − A(N)X(N). (19) By the theorem, we only need to ﬁnd the optimal

X∗(N ) which minimizes ˆ

J (X(N )) =B(N )− Σ(N )VT(N )X(N ). (20) We obtain B(N ), Σ(N ), and VT_{(N ) by the}

fol-lowing recursive procedure:

Step 1 Set t = 1 and calculate U(1), Σ(1), and V(1) by Eq.(16). Then, get Σ(1) and B(1) by Eqs. (17) and (18).

Step 2 Increase t by 1. Calculate U(t), Σ(t), and V(t) by Eq.(16). Then, get Σ(t) and B(t) by Eqs. (17) and (18).

Step 3 If t = N , then we are done. Otherwise,

go to Step 2.

Let Y(N ) = VT_{(N )X(N ). Eq.(20) becomes}

ˆ

J (Y(N )) =B(N )− Σ(N )Y(N ). (21) Apparently, Eq.(21) is minimized by Y∗(N ) such that B(N )− Σ(N )Y∗(N ) is 0, i.e.,

Σ(N )Y∗(N ) = B(N ). (22) Suppose Σ(N ) we got is an h×J diagonal matrix with each component Σ(N )ij being

Σ(N )ij =

0 if i= j,

ei otherwise

(23) where h ≤ J. Moreover, let B(N ) and Y∗(N ) be represented by B(N ) = b1 b2 · · · bh T , Y∗(N ) = y1∗ y2∗ · · · yJ∗ T . (24) Then we have yi∗= _b i e_i if i≤ h , 0 if h < i≤ J. (25)

Therefore, the optimal solution X∗(N ) which min-imizes Eq.(19) is X∗(N ) = V(N )Y∗(N ).

As mentioned, parameters mij and σij, 1≤ i ≤

n and 1 ≤ j ≤ J, are reﬁned by the gradient

de-scent method.The learning rule for mij is

mnewij = m old ij − η1 2 N N v=1 {[v.o(4)_{− q} v] [cj− v.o(4)][piv− mij]v.o (3) j σij2 J r=1v.o (3) r }.

(7)

(a) (b) (c)

Figure 2: Experiment 1.1: (a) Output of the de-sired function; (b) Output of the six rules gen-erated by our SCRG; (c) Output of the six rules reﬁned by our HLA.

where η1is the learning rate. Similarly, we have

σnewij = σ old ij − η2 2 N N v=1 {[v.o(4)_{− q} v] [cj− v.o(4)][piv− mij]2v.o (3) j σij3 J r=1v.o (3) r }.

where η2is also a learning rate.

1.1.3 Experimental Results and Discussion

We demonstrate the performance of our approach by showing the results of several experiments. A comparison between our system and the other two systems, Lin’s system [26] and Wong’s system [48], is presented. Lin’s system [26] uses fuzzy parti-tioning (FP) and Wong’s system [48] uses a clus-tering algorithm (CA) for structure identiﬁcation, and both systems use back-propagation (BP) for parameter identiﬁcation. In each experiment, we compare the performance of our self-constructing rule generator (SCRG) with that of FP and CA. Then we compare the performance of our hybrid learning algorithm (HLA) with that of BP. Finally, we compare the performance of our system with that of Lin’s system and Wong’s system on the system level.

Experiment 1.1

The ﬁrst experiment concerns the modeling of the following nonlinear function [48]:

y = sin(x1π)· sin(x2π), (26)

where x1 and x2 are inputs and y is the output,

and x1 ∈ [−1, 1] and x2 ∈ [0, 1]. The function

is shown graphically in Figure 2(a). The input-output training data pairs, ([p1, p2], q) are taken

by sampling x1 and x2 with the sampling

inter-val being 0.1. As a result, there are totally 231 input-output training patterns. Six fuzzy rules are obtained by our SCRG as follows:

r1: IF x1 IS g(x1;−0.039344, 0.681465)

AND x2 IS g(x2; 0.122951, 0.209606)

(a)r1 (b)r2

(c)r3 (d)r4

(e)r5 (f)r6

Figure 3: Membership functions of the six rules for Experiment 1.1. THEN y IS -0.002856; r2: IF x1 IS g(x1;−0.477551, 0.275006) AND x2 IS g(x2; 0.518367, 0.285825) THEN y IS -0.654764; r3: IF x1 IS g(x1; 0.532203, 0.302401) AND x2 IS g(x2; 0.532203, 0.298491) THEN y IS 0.595167; r4: IF x1 IS g(x1;−0.863636, 0.219882) AND x2 IS g(x2; 0.777273, 0.284548) THEN y IS -0.121663; r5: IF x1 IS g(x1; 0.921429, 0.177496) AND x2 IS g(x2; 0.800000, 0.276116) THEN y IS 0.050626; r6: IF x1 IS g(x1; 0.019231, 0.323342) AND x2 IS g(x2; 0.880769, 0.212723) THEN y IS -0.034207

where the Gaussian functions associated with each rule are shown in Figure 3. These rules provide an approximation to the original function, as shown in Figure 2(b), with a mean square error (MSE) of 0.0415. For example, when x1 =−0.4 and x2 =

0.5, the desired y value is

y = sin(−0.4π) · sin(0.5π) = −0.951057 (27) by Eq.(26). Using the six rules above, we can ap-proximate the y value as follows. We compute the ﬁring strength

g (−0.4; −0.039344, 0.681465)×g(0.5; 0.122951, 0.209606) = 0.755715×0.039327 = 0.029720

for r1. Similarly, we have ﬁring strengths 0.919751,

0.000074, 0.004537, 0, and 0.007559 for r2, r3, r4,

r5, and r6, respectively. By centroid

defuzziﬁca-tion, we have ˆ

(8)

×(−0.654764) + 0.000074×0.595167

+0.004537×(−0.121663) + 0×0.050626 +0.007559×(−0.034207)}/{0.029720 +0.919751 + 0.000074 + 0.004537 + 0 +0.007559} = −0.627128 (28) as the output of the six rules, which is an ap-proximation to Eq.(27). For FP and CA, a MSE of 0.0709 and 0.0604, respectively, is ob-tained with six fuzzy rules. If using 8 fuzzy rules, our SCRG produces a MSE of 0.0302, and FP and CA produce a MSE of 0.065 and 0.0589, respectively. A comparison on MSE of structure identiﬁcation with diﬀerent number of rules is shown in Table 1. For SCRG, we have ρ = 0.010∧τ = 0.195, ρ = 0.010∧τ = 0.190, and ρ = 0.010∧τ = 0.170 with 6, 8, and 10 rules, respectively. Also, the time is in unit of seconds in this table, obtained by running on a PC with AMD Athlon XP CPU and 256M memory. Apparently, SCRG performs best, i.e., having the smallest er-ror and running fastest, in each case.

Next, we test the performance of our hybrid learning algorithm (HLA). We build a neural net-work with the six fuzzy rules obtained by our SCRG and train the neural network with our HLA. A MSE of 0.0037 is obtained after 17 iterations with the following reﬁned fuzzy rules:

r1: IF x1 IS g(x1;−0.031622, 0.768006) AND x2 IS g(x2;−0.011605, 0.322145) THEN y IS -0.002670; r2: IF x1 IS g(x1;−0.320470, 0.451664) AND x2 IS g(x2; 0.520692, 0.303634) THEN y IS -1.280038; r3: IF x1 IS g(x1; 0.317044, 0.453049) AND x2 IS g(x2; 0.524339, 0.300619) THEN y IS 1.279548; r4: IF x1 IS g(x1;−1.008829, 0.238071) AND x2 IS g(x2; 0.716354, 0.447273) THEN y IS 0.034395; r5: IF x1 IS g(x1; 1.012777, 0.243046) AND x2 IS g(x2; 0.715546, 0.469390) THEN y IS -0.044219; r6: IF x1 IS g(x1; 0.006815, 0.580475) AND x2 IS g(x2; 0.993415, 0.344548) THEN y IS -0.001240

The output of the reﬁned fuzzy rules is shown in Figure 2(c) which gives a better approximation than Figure 2(b). For example, when x1 =−0.4

and x2 = 0.5, we have ﬁring strengths 0.063789,

0.964979, 0.081145, 0.001143, 0, and 0.078711 for these reﬁned rules, and thus

ˆ y = {0.063789×(−0.002670) + 0.964979 ×(−1.280038) + 0.081145×1.279548 +0.001143×(0.034395) + 0×(−0.044219) +0.078711×(−0.001240)}/{0.063789 +0.964979 + 0.081145 + 0.001143 + 0 (a) (b) (c)

Figure 4: Experiment 1.2: (a) Output of the de-sired function; (b) Output of the nine rules gen-erated by our SCRG; (c) Output of the nine rules reﬁned by our HLA.

+0.078711}

= −0.951120 (29)

which is better approximation to Eq.(27) than Eq.(28). We compare the performance of our HLA with that of BP by running BP on the same net-work. The result is shown in Table 2. Clearly, HLA performs much better than BP. For exam-ple, HLA takes 17 iterations and 10.255 seconds to reduce MSE to 0.0037 with 6 rules, while BP takes 88 iterations and 44.414 seconds to reduce MSE to 0.0041. For the case of 8 rules, HLA takes 18 iterations and 13.9 seconds, while BP takes 147 iterations and 95.057 seconds to reduce MSE to about 0.0030.

Finally, we compare the performance of the three systems on the system level. The learning performance of each system is shown in Table 3. Our system takes fewer iterations and less CPU time than the other two systems. After learning, the generalization ability of these systems is in-vestigated by applying 200 test patterns which are diﬀerent from the training patterns, and the re-sults are shown in Table 4 in which MSE errors are listed. We can see that our system can ap-proximate the function at test points equally well with or better than the other two systems.

Experiment 1.2

The second experiment considers the following nonlinear function [26]:

y = x2sin(x1) + x1cos(x2), 0≤ x1, x2≤ π (30)

which is shown graphically in Figure 4(a). The input-output training data pairs, ([p1, p2], q) are

taken by sampling x1 and x2 at 0, ₂₀π, 2π₂₀, . . . , 19π

20, and π, resulting in 441 training patterns

in total. Nine fuzzy rules are obtained by our SCRG. The approximation due to these rules is shown in Figure 4(b), with a MSE of 0.0796. A comparison on MSE among FP, CA, and SCRG with diﬀerent number of rules is shown in Ta-ble 5. For SCRG, we have ρ = 0.010∧τ = 0.090

(9)

Table 1: Comparison on accuracy of structure identiﬁcation with diﬀerent number of rules for Experi-ment 1.1.

No. of FP CA SCRG

Rules MSE Time MSE Time MSE Time

6 0.0709 0.018 0.0604 0.595 0.0415 0.006 8 0.0650 0.022 0.0589 0.383 0.0302 0.008 10 0.0400 0.027 0.0413 0.535 0.0247 0.009

Table 2: Comparison on performance of diﬀerent learning methods for Experiment 1.1.

No. of BP HLA

Rules MSE Iters Time MSE Iters Time

6 0.0041 88 44.414 0.0037 17 10.255

8 0.0030 147 95.057 0.0029 18 13.900 10 0.0024 98 71.483 0.0023 11 10.605

Table 3: Comparison on learning performance of the three systems for Experiment 1.1.

No. of Lin’s System Wong’s System Our System

Rules MSE Iters Time MSE Iters Time MSE Iters Time

6 0.0041 218 118.981 0.0041 132 63.398 0.0037 17 10.261 8 0.0030 390 264.631 0.0029 227 141.707 0.0029 18 13.908 10 0.0024 623 486.961 0.0023 332 241.339 0.0023 11 10.614

Table 4: Comparison on generalization ability of the three systems for Experiment 1.1. No. of

Rules Lin’s System Wong’s System Our System

6 0.0034 0.0042 0.0027

8 0.0027 0.0029 0.0022

10 0.0025 0.0024 0.0016

No. of FP CA SCRG

9 0.1600 0.040 0.1638 1.822 0.0796 0.010 16 0.0562 0.060 0.0898 1.850 0.0327 0.020

No. of BP HLA

9 0.0079 164 214.298 0.0079 17 27.970 16 0.0032 114 254.796 0.0031 6 16.393

(10)

9 0.0079 102 140.118 0.0079 97 113.262 0.0079 17 27.980 16 0.0032 145 333.107 0.0031 124 256.368 0.0031 6 16.413

9 0.0062 0.0061 0.0058

16 0.0088 0.0033 0.0023

and ρ = 0.050∧τ = 0.090 with 9 and 16 rules, re-spectively. A refined set of 6 fuzzy rules can be obtained by our HLA after 17 iterations, with a MSE of 0.0079. The output of these refined rules is shown in Figure 4(c). Table 6 shows a com-parison on performance of different learning meth-ods with different number of initial rules. Finally, we compare the performance of the three systems on the system level. The learning performance of each system is shown in Table 7. The generaliza-tion ability of each system is investigated by ap-plying 400 test patterns which are different from the training patterns, and the results are shown in Table 8 in which MSE errors are listed.

Experiment 1.3

The third example considers the following function [43]:

y = (1 + x−21 + x−1.52 ) 2

(31) where 1≤ x1, x2 ≤ 5. We take ﬁfty input-output

patterns as training patterns [43] and another fifty patterns as test patterns. A comparison on MSE among FP, CA, and SCRG with different num-ber of rules is shown in Table 9. For SCRG, we have ρ = 0.001∧τ = 0.205, ρ = 0.001∧τ = 0.150, and ρ = 0.001∧τ = 0.105 with 6, 8 and 10 rules, respectively. Table 10 shows a comparison on per-formance of different learning methods with dif-ferent number of initial rules. Finally, the learn-ing performance and generalization ability of each system are shown in Table 11 and Table 12, re-spectively.

Experiment 1.4

The last experiment concerns a real-world data set of daily stock prices taken from [43]. The set consists of 100 patterns, each having 10 in-puts and one output. As in [43], 80 patterns are used for training and the remaining 20 patterns

are used as test data. A comparison on MSE among FP, CA, and SCRG with diﬀerent num-ber of rules is shown in Table 13. For SCRG, we have ρ = 0.0001∧τ = 0.17, ρ = 0.0001∧τ = 0.12, and ρ = 0.0001∧τ = 0.15 with 7, 9 and 10 rules, respectively. Table 14 shows a comparison on per-formance of diﬀerent learning methods with dif-ferent number of initial rules. Finally, the learn-ing performance and generalization ability of each system are shown in Table 15 and Table 16, re-spectively.

1.2 An Improved Neuro-Fuzzy Modeling Technique

We improved the method described in the previ-ous section suing a merge-based fuzzy clustering technique and TSK-type fuzzy IF-THEN rules.

1.2.1 Merge-Based Fuzzy Clustering (MFC)

The task of our clustering method is to partition the given input-output data set into fuzzy clusters, with the degree of association being strong for data within a cluster and weak for data in different clus-ters. Our method is an incremental one and con-sists of two stages, data partitioning and cluster merge. In the data partitioning stage, the data set is partitioned automatically into a set of clusters for which membership functions are defined. Like other incremental clustering algorithms, the clus-ters obtained from the data partitioning stage are sensitive to the input order of the training pat-terns. This sensitivity is reduced by the second stage, i.e., the cluster merge stage, in which simi-lar clusters are merged together. Therefore, more clusters are located in the local areas with a highly variant output surface and fewer clusters are lo-cated in the areas with flat output values.

For convenience, we deal with modeling a sys-tem with n input variables x1, x2, . . ., xn and one

(11)

mul-Table 9: Comparison on accuracy of structure identiﬁcation with diﬀerent number of rules for Experi-ment 1.3.

No. of FP CA SCRG

6 0.3510 0.009 0.3020 0.023 0.1367 0.002 8 0.1759 0.013 0.2076 0.207 0.1127 0.002 10 0.1468 0.019 0.1725 0.184 0.0857 0.003

No. of BP HLA

6 0.0600 93 15.552 0.0589 22 5.087

8 0.0500 288 57.600 0.0500 35 8.858

10 0.0148 40 9.324 0.0148 13 3.575

6 0.0599 125 18.633 0.0572 103 14.818 0.0589 22 5.089 8 0.0499 115 25.797 0.0499 113 24.080 0.0500 35 8.860 10 0.0149 278 53.016 0.0149 343 60.833 0.0148 13 3.578

6 0.1124 0.0967 0.0711

8 0.0966 0.0898 0.0695

10 0.0537 0.0472 0.0407

No. of FP CA SCRG

7 0.0102 0.070 0.0159 0.150 0.0075 0.010 9 0.0106 0.098 0.0154 0.120 0.0065 0.010 10 0.0106 0.107 0.0155 0.110 0.0054 0.020

No. of BP HLA

7 0.0020 145 115.596 0.0020 12 10.866 9 0.0015 235 235.909 0.0015 16 18.797 10 0.0015 197 217.403 0.0015 19 24.516

(12)

7 0.0020 372 319.490 0.0020 393 303.784 0.0020 12 10.876 9 0.0015 351 401.027 0.0017 1346 1230.273 0.0015 16 18.807 10 0.0015 338 425.371 0.0015 1564 1692.724 0.0015 19 24.536

7 0.0113 0.0049 0.0033

9 0.0159 0.0026 0.0021

10 0.0173 0.0038 0.0023

tiple output variables is obvious. Let x be the

input vector, i.e., x = [x1, x2, . . . , xn]. A fuzzy

cluster Cj is deﬁned as a pair (Ij(x), Oj(y)) where

Ij(x) describes the input distribution and Oj(y)

describes the output distribution of the training patterns covered by cluster Cj. Both Ij(x) and

Oj(y) are Gaussian functions deﬁned as:

Ij(x) = n i=1 g(xi; mij, σij) (32) = n i=1 exp − xi− mij σij 2 (33) Oj(y) = g(y; m0j, σ0j) (34) = exp − y− m0j σ0j 2 (35)

where mj = [m1j, . . . , mnj] denotes the mean

vec-tor and σj = [σ1j, . . . , σnj] denotes the deviation

vector for Ij(x), and m0jand σ0jdenote the mean

and deviation, respectively, for Oj(y). Gaussian

functions are adopted for representing clusters be-cause of their superiority over other functions in performance .

Data Partitioning

Assume that we have a set of N training patterns and each pattern tv, 1 ≤ v ≤ N, is represented

by (pv, qv) where pv= [p1v, . . . , pnv] denotes input

values and qv denotes the desired output value.

We say tv belongs to cluster Cj if tvcontributes to

the distribution of patterns in Cj, i.e., mj, σj, m0j,

and σ0jhave to be recalculated due to the addition

of tv. The size, Sj, of cluster Cj is deﬁned to be

the number of patterns that belong to Cj. Before

we proceed, we deﬁne several operators to help the description later. The operator comb combines a cluster Cj and a pattern tv and results in a new

cluster Cj, i.e.,

Cj = comb(Cj, tv) (36)

where Cj= (Ij(x), Oj(y)) and

Ij(x) = comb x(Ij(x), pv), (37)

Oj(y) = comb y(Oj(y), qv). (38)

The mean and deviation vectors associated with

Ij(x) are computed by:

mij = Sjmij+ piv Sj+ 1 (39) σij = { (Sj− 1)(σij− σ0i)2+ Sjmij2+ piv2 Sj − Sj+ 1 Sj (Sjmij+ piv Sj+ 1 )2}12 _{+ σ}₀i_, ₍₄₀₎

for 1≤ i ≤ n, while the mean and deviation asso-ciated with Oj(y) are computed by:

m0j = Sjm0j+ qv Sj+ 1 (41) σ0j = { (Sj− 1)(σ0j− σ0o)2+ Sjm0j2+ qv2 Sj − Sj+ 1 Sj (Sjm0j+ qv Sj+ 1 )2}12 _{+ σ}₀o_, ₍₄₂₎ with σi

0 and σ0o being user-deﬁned constants.

Let J be the number of existing fuzzy clusters. Initially, J is 0 since no cluster exists at the be-ginning. For a training instance tv, we calculate

Ij(pv) which measures the degree that tv is close

to Cj in the input subspace. We say that instance

tv passes the input-similarity test on cluster Cj if

Ij(pv)≥ ρ (43)

where ρ, 0 ≤ ρ ≤ 1, is a predeﬁned threshold. Then we check the output variance of tvas follows.

(13)

For each cluster Cj on which tv has passed the

input similarity test, we calculate

Oj(y) = comb y(Oj(y), qv).

We say that instance tvpasses the output-variance

test on cluster Cj if

σ0j≤ τ (44)

where τ is a user-deﬁned threshold.

Two cases may occur. First, there are no exist-ing fuzzy clusters on which instance tv has passed

both the input-similarity test and the output-variance test. For this case, we assume that in-stance tvis not close enough to any existing cluster

and a new fuzzy cluster Ck, k = J + 1 is created

with mk = pv, σk = [σ0i, σ i 0, . . . , σ i 0], m0k = qv, σ0k= σo0. (45)

Note that the new cluster Ck contains only one

member, instance tv. The reason that σik and σ o k

are initialized to non-zero values is to avoid the null width of such a singleton cluster. Of course, the number of clusters is increased by 1 and the size of cluster Ck should be initialized, i.e.,

J = J + 1, Sk = 1. (46)

On the other hand, if there are existing fuzzy clus-ters on which instance tv has passed both the

input-similarity test and the output-variance test, let clusters Cm1, Cm2, . . . , and Cmf be such

clus-ters and let the cluster with the largest input-similarity measure be cluster Ca, i.e.,

Ia(pv) = max(Im1(pv), Im2(pv), . . . , Imf(pv)). (47)

In this case, we assume that instance tv is closest

to cluster Caand cluster Cashould be modiﬁed to

include instance tv:

Ca = comb(Ca, tv), Sa= Sa+ 1. (48)

Note that J is not changed in this case.

The above process is iterated until all the train-ing instances have been processed. At the end, we have J fuzzy clusters. The whole process of data partitioning can be summarized as below.

procedure Data Partitioning J = 0;

for each pattern tv, 1≤ v ≤ N

W1={Cj|Ij(pv)≥ ρ, 1 ≤ j ≤ J};

Calculate Oj(y) = comb y(Oj(y), qv)

for all Cj∈ W1;

W2={Cj|σ0j≤ τ, Cj ∈ W1};

if W2==∅

A new cluster Ck, k = J + 1,

is created by Eqs.(45) and (46);

else Let Ca ∈ W2 be the cluster with the

largest input-similarity measure; Incorporate tv into Ca by Eq.(48);

endif; endfor;

return with J clusters; end data partitioning

Cluster Merge

A good clustering algorithm should not generate an unnecessarily large number of clusters. A fuzzy model with a large number of clusters is likely to encounter the risk of over-fitting, i.e., capable of fitting training data well but incapable of gener-alizing to untrained data satisfactorily. Also, as mentioned earlier, a good incremental clustering algorithm should have a low degree of sensitivity to the input order of training data. Our clustering method offers a solution to these requirements by a merging facility. The basic idea is to merge to-gether the clusters in the areas where training pat-terns present less variable output response. Before we continue, we define some operators for merging clusters together. The operator comb g combines k clusters, C1, C2, . . . , Ck, k ≥ 2, into a new cluster

Cj, i.e.,

Cj= comb g(C1, C2, . . . , Ck) (49)

where Cj= (Ij(x), Oj(y)) and

Ij(x) = comb g x(I1(x), I2(x), . . . , Ik(x)), (50)

Oj(y) = comb g y(O1(y), O2(y), . . . , Ok(y)). (51)

The mean and deviation vectors associated with

Ij(x) are computed by:

mij= k d=1Sd×mid k d=1Sd , σij ={ k d=1[(Sd− 1)×(σid− σ i 0)2+ Sd×m2id] k d=1Sd− 1 −m ij 2 ×k d=1Sd k d=1Sd− 1 }21_{+ σ}i₀_,

for 1≤ i ≤ n, while the mean and deviation asso-ciated with Oj(y) are computed by:

m0j= k d=1Sd×m0d k d=1Sd , σ0j ={ k d=1[(Sd− 1)×(σ0d− σ o 0)2+ Sd×m20d] k d=1Sd− 1 −m 0j 2 ×k d=1Sd k d=1Sd− 1 }12 _{+ σ}₀o_.

Let A contains the J clusters obtained from the data partitioning stage, and let B be an empty set. Firstly, we group the clusters into equivalence

(14)

classes. For any two diﬀerent clusters Ci and Cj in A, we calculate riij = Ii( mj) + Ij( mi) 2 , (52) roij = Oi(m0j) + Oj(m0i) 2 (53) where ri

ij and rijo are the input-similarity measure

and the output-similarity measure, respectively, between Ci and Cj. Ci and Cj are grouped into

the same equivalent class if

riij ≥ ρ, roij≥ ε (54)

where ε is a user-deﬁned threshold. Therefore, the clusters in A are grouped into a set of equivalent classes. If every equivalent class contains only one cluster, we are done and the clusters in A are de-sired ones. Otherwise, we check whether the con-stituent clusters of each equivalent class can be merged together to form a new fuzzy cluster. Let

C1, C2, . . . , Ck be the constituent clusters of an

equivalent class X. If k = 1, the class has only one cluster and nothing can be merged, and so we remove the cluster in X from A to B. Otherwise, we calculate

Oj(y)= comb g y(O1(y), O2(y), . . . , Ok(y)).

If the output-variance test is successful, i.e.,

σ0j ≤ τ, (55)

then we remove C1, C2, . . . , Ck from A, merge

them into a new cluster Cj by

Cj = comb g(C1, C2, . . . , Ck), (56) Sj = k d=1 Sd (57)

and put Cj into B. If the output-variance test

fails, we don’t do merge. Instead, we remove C1,

C2, . . . , Ckof X from A to B. This process iterates

until A is empty. Then we remove all the clusters in B to A, increase ρ to become (1+θ)ρ where θ is a predeﬁned constant rate, and do the whole process again, until every equivalence class has only one cluster. The above procedure can be summarized below.

procedure Cluster Merge

Let A contain the clusters obtained from data partitioning; B =∅;

while (!(every equivalent class in A contains only

one cluster))

for each equivalent class X if X has two or more clusters

if the output-variance test of Eq.(55)

succeeds

Merge the clusters in X into a new cluster by Eqs.(56) and (57);

Put the new cluster into B, and remove the clusters in X from A;

else remove the clusters in X from A

to B;

endif;

else remove the clusters in X from A

to B;

endif; endfor;

Increase ρ to (1 + θ)ρ;

endwhile;

return with all clusters in A; end Cluster Merge

1.2.2 Fuzzy Rules and Neuro-Fuzzy System Modeling

As mentioned earlier, a clustering technique can be used to extract fuzzy rules for creating a neuro-fuzzy system from a given set of input-output data. System outputs can then be inferred from these fuzzy rules for any inputs presented to the system. For the purpose of higher precision, the obtained fuzzy rules can be reﬁned by learning algorithms of neural networks.

Using our merge-based clustering method to ex-tract fuzzy rules is straightforward. Suppose we are given a dataset of an unknown system with n inputs x1, . . . , xn, and one output y. We apply our

clustering method and have a set of J clusters C1,

C2, . . . , and CJ. Then we obtain one fuzzy rule

from each cluster. For cluster Cj= (Ij(x), Oj(y)),

the corresponding fuzzy rule Rjtakes the following

TSK-based form:

IF x1 IS µ1j(x1) AND x2 IS µ2j(x2) AND

. . . AND xn IS µnj(xn) (58)

THEN y IS fj(x) = b0j+ b1jx1+ . . . + bnjxn

where b0j = m0j, b1j = b2j = . . . = bnj = 0 are

called consequent parameters and µij(xi), 1≤ i ≤

n, are membership functions deﬁned as µij(xi) = g(xi; mij, σij) = exp − xi− mij σij 2 (59) in which mij and σij are called antecedent

pa-rameters. In each rule, the ﬁrst antecedent cor-responds to the ﬁrst input, the second antecedent corresponds to the second input, etc., and the con-sequent corresponds to the output. TSK model is chosen because of its good approximation capabil-ity and the simpliccapabil-ity of its operation. Note that

b1j to bnj are set temporarily to 0 since the

re-lationship between y and xi cannot be deduced

simply from the information kept in Oj(y). The

desired values will be learned later. As a result, we have a rule base R = {R1, R2, . . . , RJ} consisting

of the J fuzzy clusters obtained from the previous section.

(15)

Figure 5: Architecture of the ﬁve-layer neural net-work.

The obtained rule set R can be used to pro-vide system output values for any given input values through an interpolation of all the rele-vant individual rules. The degree of relevance of a rule is determined by the degree the input data belong to the fuzzy subspace associated with the rule. These degrees of relevance become the weight in the interpolation process. For any in-put x = [x1, x2, . . . , xn], the system output y is

computed by centroid defuzziﬁcation as:

y = J j=1αj(x)×fj(x) J j=1αj(x) = J j=1αj(x)×(b0j+ b1jx1+ . . . + bnjxn) J j=1αj(x) (60) where the degree the input x matches rule Rj is

computed using the product operator

αj(x) = µ1j(x1)×µ2j(x2)× · · · ×µnj(xn)

= Ij(x) (61)

and is called the ﬁring strength of rule Rj.

To improve the approximation precision of Eq.(60), the learning techniques of neural networks are applied to tune the antecedent and consequent parameters of the whole rule base R. We adopt a hybrid learning algorithm which combines a re-cursive SVD-based least squares estimator and the gradient descent method to reﬁne these parame-ters, as described in the next section.

1.2.3 Parameter Reﬁnement by Hybrid Learning

As mentioned, the parameters associated with the obtained rule baseR are reﬁned with neural net-work techniques. Firstly, a ﬁve-layer netnet-work, with input variables x = [x1, x2, . . . , xn] and output

variable y, is constructed from the J fuzzy rules of the obtained rule base, as shown in Figure 5. The ﬁve layers are called the fuzziﬁcation layer (layer 1), the conjunction layer (layer 2), the normaliza-tion layer (layer 3), the inference layer (layer 4), and the output layer (layer 5), respectively. The

links connecting inputs x to layer 1 are weighted

by (mij, σij), 1 ≤ i ≤ n, 1 ≤ j ≤ J, the links

connecting inputs x to layer 4 are weighted by b0j,

b1j, . . . , bnj, 1 ≤ j ≤ J, and all the other links

are weighted by 1. Note that there are J groups of nodes in Layer 1, each group having n nodes for a rule. Layers 2–4 all have J nodes, one node for a rule. Layer 5 contains only one node, providing output for the whole system. For any input x, the

function of each layer is described as follows:

• Layer 1. Compute the matching degree to a

fuzzy condition involving one variable, i.e.,

µij(xi) = g(xi; mij, σij),

1≤ i ≤ n, 1 ≤ j ≤ J. (62)

• Layer 2. Compute the ﬁring strength of each

rule, i.e.,

αj(x) = Πni=1µij(xi) = Πni=1g(xi; mij, σij)

= Ij(x), 1≤ j ≤ J. (63)

• Layer 3. Compute the normalized matching

degree for each rule, i.e.,

rj(x) =

αj(x)

J

k=1αk(x)

, 1≤ j ≤ J. (64)

• Layer 4. Compute the conclusion inferred by

each fuzzy rule, i.e.,

sj(x) = rj(x)×fj(x), 1≤ j ≤ J, (65)

= rj(x)×(b0j+ b1jx1+ b2jx2+ . . .

+bnjxn), 1≤ j ≤ J. (66)

• Layer 5. Combine the conclusion of all fuzzy

rules and obtain the network output:

y =

J

j=1

sj(x). (67)

It is easy to see that Eq.(67) is identical to Eq.(60) for any input x. Obviously, the constructed network performs the fuzzy rule-based inference. Therefore, we may apply the learning techniques for neural networks to reﬁne the parameters asso-ciated with the fuzzy rules.

A hybrid learning algorithm (HLA) is adopted for tuning the parameters associated with the net-work, and hence the rule base, eﬃciently. In par-ticular, a recursive SVD-based least squares esti-mator is used to optimize the consequent parame-ters, i.e., b0j, b1j,· · ·, bnj, 1≤ j ≤ J, and the

gra-dient descent method is used to optimize the an-tecedent parameters, i.e., mij and σij, 1≤ i ≤ n,

1 ≤ j ≤ J. Suppose we have a total number

N of training patterns. An iteration of learning

involves the presentation of all training patterns. In each iteration of learning, both the recursive

(16)

SVD-based least squares estimator and the gra-dient descent method are applied. We first treat all the antecedent parameters, as fixed, and use the recursive SVD-based least squares estimator to optimize the consequent parameters. Then we treat all the consequent parameters to be fixed and use the gradient descent method to refine the an-tecedent parameters. The process is iterated until the desired approximation precision is achieved.

Recursive SVD-Based Least Squares Esti-mator

When the antecedent parameters are ﬁxed, the optimization of consequent parameters can be re-garded as a special case of linear regression model and SVD-based optimization algorithms can be applied. Let tv = (pv, qv) be the vth training

pat-tern, where pv = [p1v, . . . , pnv] is the input vector

and qv is the desired output. From Eqs.(66) and

(67), we have the network output yv to be

yv = J j=1 rj(pv)×(b0j+ b1jp1v+ b2jp2v+ . . . +bnjpnv) (68)

for input pattern tv. For all N training patterns,

we have N equations with the form of Eq.(68). We would like to have the following Mean Square Error (MSE): E = 1 N N v=1 (qv− yv) 2 = 1 N N v=1 [qv− J j=1 rj(pv)fj(pv)] 2 (69)

to be as small as possible. Let

Q = q1 q2 · · · qN T , A = ⎡ ⎢ ⎢ ⎢ ⎣ a11 · · · a11pn1 · · · a1Jpn1 a21 · · · a21pn2 · · · a2Jpn2 .. . . .. ... . .. ... aN 1 · · · aN 1pnN · · · aN JpnN ⎤ ⎥ ⎥ ⎥ ⎦, X = b01 · · · bn1 · · · b0J · · · bnJ T where aij= rj(pi), 1≤ i ≤ N, 1 ≤ j ≤ J. (70)

Then minimizing Eq.(69) is equivalent to minimiz-ing

E1=Q − AX (71)

where for a matrix D, D is deﬁned to be

trace(DT_{D). Since we treat all the parameters}

in the antecedent part to be ﬁxed at this point,

A is ﬁxed and X is the only variable vector in

Eq.(71).

Eq.(71) is a special form of linear regression model and the optimal solution, X∗, which mini-mizes Eq.(71) can be obtained by the techniques based on singular value decomposition (SVD) [11]. When a lot of training patterns are involved in an application, A becomes a very large matrix and conventional SVD-based methods will be time-consuming and memory-demanding in obtaining solutions. We adopt the recursive SVD-based least squares estimator to ﬁnd the optimal solution, X∗, to Eq.(71). With this method, training patterns are considered one by one, starting with the ﬁrst pattern until the last pattern. In each iteration, only a small matrix has to be decomposed, instead of the need of decomposing a large matrix in con-ventional SVD-based algorithms. Therefore, the recursive SVD-based least squares estimator leads to less demanding in time and space requirements, which is very useful when a lot of training patterns are considered.

Gradient Descent Method

Optimization of the antecedent parameters in-volves nonlinear formulation and thus the gradi-ent descgradi-ent method is adopted. However, since the consequent parameters are taken ﬁxed, the formulation can be much simpliﬁed. Again, let

tv= (pv, qv) be the vth training pattern and yvbe

the corresponding network output. The error func-tion we consider is twice the MSE E of Eq.(69), i.e., E2= 1 2N N v=1 (qv− yv)2 (72)

where the factor 2 in the denominator is simply for the convenience of taking derivatives on E2. In

or-der to work properly with the recursive SVD-based estimator, the batch back-propagation (BP) mode is adopted in which weight updating is performed after the presentation of all the training data in an iteration [13].

The learning rule for mij is

mnewij = m old ij − η1( ∂E2 ∂mij ), 1≤ i ≤ n, 1 ≤ j ≤ J (73) where η1 is the learning rate and

∂E2 ∂mij = 1 N N v=1 {[yv− qv] ∂yv ∂mij } = 1 N N v=1 {[yv− qv] fj(pv)− yv J r=1αr(pv) ∂αj(pv) ∂mij } = 1 N N v=1 {[yv− qv] [fj(pv)− yv]αj(pv) J r=1αr(pv) ∂ −p_iv−m_ij σij 2 ∂mij }

(17)

= 2 N N v=1 {[yv− qv] [fj(pv)− yv][piv− mij]αj(pv) σij2 J r=1αr(pv) }. (74) Similarly, we have σnewij = σijold − η2( ∂E2 ∂σij ), 1≤ i ≤ n, 1 ≤ j ≤ J (75) where η2is the learning rate and

∂E2 ∂σij = 2 N N v=1 {[yv− qv] [fj(pv)− yv][piv− mij]2αj(pv) σij3 J r=1αr(pv) }.(76) 1.2.4 Experimental Results

We demonstrate the effectiveness of our approach by showing the results of three experiments which were done on a PC with AMD Athlon XP CPU and 256M memory. The first experiment shows that our clustering method, MFC, can locate clus-ters in a reasonable way by revealing the struc-ture and similarity of training data in both the input and output subspaces. The second experi-ment shows that MFC can alleviate the problem of order bias. Finally, the performance of our ap-proach on high-dimensional and real datasets is investigated. A comparison between our system and other systems, including Yen’s system [51], Juang’s system [19, 20], and Lee’s system [28], is also given. Among these systems, Lee’s system uses a constant in the consequent part of fuzzy rules, and Yen’s system and Juang’s system, like our system, adopt the TSK-type fuzzy rule form which has a linear model in the consequent part. To extract fuzzy rules from the given input-output dataset, Yen’s system uses a SVD-QR with column pivoting algorithm (SVD-QR-CP), Juang’s system uses an aligned clustering-based algorithm (ACA), and Lee’s system uses a self-constructing rule gen-eration algorithm (SCRG). Note that SCRG and ACA are incremental clustering methods. When fuzzy rules have been obtained, different learn-ing techniques are used for parameter refinement. Yen’s system applies the conventional SVD tech-nique once to obtain the optimal values of the consequent parameters, without any further learn-ing for the antecedent parameters. Juang’s system uses BP to train both the antecedent and conse-quent parameters, and Lee’s system uses the same hybrid learning algorithm.

Experiment 2.1

A good clustering algorithm for neuro-fuzzy sys-tem modeling should take both input and output

Figure 6: The original function for Experiment 2.1. subspaces into account. That is, it should reveal the structure of training data in the input sub-space, but also preserves the homogeneity of the output responses of data belonging to the same cluster. Also, the density of prototypes produced should be higher in the input areas with highly variant outputs than that in the input areas with ﬂat outputs. In this experiment, we’ll show that MFC meets these requirements better than other methods. Consider the following nonlinear func-tion:

y = 1

(10x− 3)2+ 1 +

1

(10x− 9)2+ 4+ 0.06 (77) which is drawn in Figure 6. Note that y has two peaks in the range x = [0, 1] where there is a large variation in output values. We take 125 training patterns unevenly from the range x∈ [−1, 2], with

2

5 of the patterns taken from the range x∈ [−1, 0], 1

5 from x = [0, 1], and 2

5 from x = [1, 2]. On the

other hand, we take 100 points as testing patterns, with 1

4 of the points taken from the range x ∈

[−1, 0], 2₄ from x = [0, 1], and 1₄ from x = [1, 2]. We compare MFC with SVD-QR-CP and ACA on the locations of the obtained clusters, as shown in Figure 7, in which training data are represented by dots and the centers of the clusters are rep-resented by cross marks. At the bottom of each sub-figure shows the location, shape, and size of each cluster in the x direction. Eight rules are ob-tained from all the methods. Figure 7(a) shows where the rules are located by SVD-QR-CP. Two rules are located in the left flat area, two rules are located in the right flat area, and four rules are lo-cated near or in the central area. Obviously, more rules are needed in the central area where there are great variations in y. This deficiency of rules re-sults in a poor approximation to the original func-tion, as shown by the solid curve in Figure 7(a). A similar situation happens with ACA, as shown in Figure 7(b). ACA generates three rules in the right flat area. On the other hand, MFC generates five rules in the central area, as shown in Figure 7, which greatly improve the approximation precision in this area, without hurting the approximation capability in the other areas. Figure 8 shows the approximation results after the 8 fuzzy rules are re-fined by the learning techniques of the correspond-ing systems. Obviously, our system provides the best approximation to the original function.

(18)

Table 17: Comparison on efficiency of different clustering methods with different number of rules for Experiment 2.1.

7 Rules 11 Rules

Training MSE Testing MSE Time Training MSE Testing MSE Time

SVD-QR-CP 1.18×10−2 2.75×10−2 0.06 9.44×10−3 2.26×10−2 0.11

ACA 7.37×10−3 2.14×10−2 0.02 7.22×10−3 1.87×10−2 0.05 MFC 5.05×10−3 1.22×10−2 0.05 3.13×10−3 7.34×10−3 0.06

Table 18: Comparison on learning performance of diﬀerent systems for Experiment 2.1.

7 Rules 11 Rules

Training Testing Training Testing

MSE MSE Iters Time MSE MSE Iters Time

Yen’s system 2.82×10−3 6.15×10−3 1 0.16 8.26×10−4 1.74×10−3 1 0.17

Juang’s system 4.47×10−4 2.12×10−3 196 72.36 2.00×10−4 7.21×10−4 325 158.32

Our system 4.32×10−4 1.03×10−3 2 0.17 1.98×10−4 4.24×10−4 4 0.28

(a) (b)

(c)

Figure 7: Experiment 2.1: (a) results obtained by SVD-QR-CP; (b) results obtained by ACA; (c) re-sults obtained by MFC.

(a) (b)

(c)

Figure 8: Experiment 2.1: (a) output of the refined rules by Yen’s system; (b) output of the refined rules by Juang’s system; (c) output of the refined rules by our system.

靜、動態神經模糊建模技術及其在語言辨識上的應用(III)Static/Dynamic Neuro-Fuzzy Modeling Techniques and Its Application to Speech Recognition

行政院國家科學委員會專題研究計畫 成果報告

靜、動態神經模糊建模技術及其在語言辨識上的應用(3/3)

計畫類別： 個別型計畫

計畫編號： NSC91-2213-E-110-020-

執行期間： 91 年 08 月 01 日至 92 年 07 月 31 日

執行單位： 國立中山大學電機工程學系(所)

計畫主持人： 李錫智

計畫參與人員： 歐陽振森、蔡賢&#63863；、楊欣泰、陳文恭、蔡政勳、&#63969；

婉瑞、杜世懷、&#63988；毅杰、鍾潤世、黃&#63991；銘、

&#63988；安泰、黃偉程

報告類型： 完整報告

處理方式： 本計畫可公開查詢

中 華 民 國 92 年 10 月 31 日

Ó Gÿ%_È_xX£wÊxß<…,í@à

Static/Dynamic Neuro-Fuzzy Modeling Techniques and Its

Application to Speech Recognition

lå)U:NSC-89-2218-E-110-001NSC-90-2213-E-110-014NSC-91-2213-E-110-020

ÏW‚Ì

:89

08

~

01

nB

92

07

~

31

n

3MA

:

†— Å2˙×çÚœ˙çÍ `¤

lå¡DAº

:

r¹PR ’jœ Ïd6 \¤ †û: 0a

Š+ â0 ô Šéœ ôQ˙ Å2˙×çÚœ˙çÍ

ø 2d¿b

…låJúívÈT|7y^0íÓ

Gÿ%_È_xX

1øwõÒ@àVj²x

ß<…½æ Êø2

Bb3bu‡úÓG

Í$_½æ

T|øP!kA¶†ßÞj

¶£¹¯ç3Æ¶íÓGÿ%_È_x

X ¤Õ

yø¤j¶Tªø¥íô.£ZG

T|ÇøPy7í!k¯9_Èé£¹¯



¶†ç3Æ¶5ÓGÿ%_È_

xX Êù2

†u‡úGÍ$_½æ

T|7øPZGí

]c_Èéÿ%æ

˜

éj¶íZª _@4¡bí‹p£Ö¼

q¶]cíô.

U)¤¶yx^0£4 |

(

Êú2

Ñ7ð„FêíGÿ%_

È_xX5õà4

Bbøw!¯(4ã¿)

{

@àVj²xß<…½æ óÉíû˝A‹

“˛ê[ÊÖ_O±íÅÒ‚…£}‡,

ÉœÈ

ÓG_È_ G_È_ xß<

… _Èé

_È¶†

Abstract

ù íâDñí

ú !‹Dn

1. Static neuro-fuzzy modeling

行政院國家科學委員會專題研究計畫成果報告

計畫類別：個別型計畫

執行單位：國立中山大學電機工程學系(所)

計畫主持人：李錫智

計畫參與人員：歐陽振森、蔡賢&#63863；、楊欣泰、陳文恭、蔡政勳、&#63969；

報告類型：完整報告

處理方式：本計畫可公開查詢

中華民國 92 年 10 月 31 日

_{Bb3bu‡úÓG}

_{T|øP!kA¶†ßÞj}

¶£¹¯ç3Æ¶íÓGÿ%_È_x

_{yø¤j¶Tªø¥íô.£ZG}

T|ÇøPy7í!k¯9_Èé£¹¯

T|7øPZGí

Ñ7ð„FêíGÿ%_