Mining fuzzy association rules for classification problems

(1)

Mining fuzzy association rules for classification problems

Yi-Chung Hu

a

, Ruey-Shun Chen

a

, Gwo-Hshiung Tzeng

b,

*

a

Institute of Information Management, National Chiao Tung University, Hsinchu 300, Taiwan, ROC

b

Institute of Management of Technology, National Chiao Tung University, Hsinchu 300, Taiwan, ROC

Abstract

The effective development of data mining techniques for the discovery of knowledge from training samples for classification problems in industrial engineering is necessary in applications, such as group technology. This paper proposes a learning algorithm, which can be viewed as a knowledge acquisition tool, to effectively discover fuzzy association rules for classification problems. The consequence part of each rule is one class label. The proposed learning algorithm consists of two phases: one to generate large fuzzy grids from training samples by fuzzy partitioning in each attribute, and the other to generate fuzzy association rules for classification problems by large fuzzy grids. The proposed learning algorithm is implemented by scanning training samples stored in a database only once and applying a sequence of Boolean operations to generate fuzzy grids and fuzzy rules; therefore, it can be easily extended to discover other types of fuzzy association rules. The simulation results from the iris data demonstrate that the proposed learning algorithm can effectively derive fuzzy association rules for classification problems. q 2002 Elsevier Science Ltd. All rights reserved.

Keywords: Data mining; Knowledge acquisition; Classification problems; Association rules

1. Introduction

Data mining is a methodology for the extraction of new knowledge from data. This knowledge may

relate to a problem that we want to solve (Myra, 2000). Thus, data mining can ease the knowledge

acquisition bottleneck in building prototype systems (Hong & Chen, 1999; Hong, Wang, Wang, &

Chien, 2000). On the other hand, database-mining problems involving classification can be viewed

within a common framework of rule discovery (Agrawal, Imielinski, & Swami, 1993). These concepts

demonstrate that effective development of data mining techniques to discover knowledge from training samples for classification problems is necessary. Moreover, it is necessary to develop effective methods for classification problems in industrial engineering, such as group technology.

www.elsevier.com/locate/dsw

* Corresponding author.

(2)

Recently, the discovery of association rules from databases has become an important research topic, and association rules have been applied to analyze market baskets to help managers determine, which

items are frequently purchased together by customers (Berry & Linoff, 1997; Han & Kamber, 2001;

Yilmaz, Triantaphyllou, Chen, & Liao, 2002). Initially, Agrawal, Mannila, Srikant, Toivonen, and Verkamo (1996) proposed the Apriori algorithm to quickly find association rules. Han, Karypis, and Kumar (2000)also proposed parallel data mining techniques implemented in large databases. Generally, there are two phases for mining association rules. In phase I, we find large itemsets, whose supports are larger than or equal to the user-specified minimum support. If there are k items in a large itemset, then we call it a large k-itemset, and the Apriori property shows that any subset of a large itemset must also be large (Han & Kamber, 2001). In phase II, we use large itemsets generated in phase I to generate effective association rules. An association rule is effective, if its confidence is larger than or equal to the user-specified minimum confidence.

In this paper, we propose a learning algorithm to discover fuzzy associative classification rules for classification problems. We define that a fuzzy associative classification rule is a fuzzy if – then rule, whose consequent part is one class label. Since the comprehensibility of fuzzy rules by

human users is a criterion in designing a fuzzy rule-based system (Ishibuchi, Nakashima, &

Murata, 1999), fuzzy associative classification rules with linguistic interpretation must be taken into account. To cope with this problem, we consider that both quantitative and categorical attributes are divided into many fuzzy partitions by the concept of the fuzzy grids, resulting from

fuzzy partitioning in feature space (Ishibuchi, Nozaki, Yamamoto, & Tanaka, 1995; Ishibuchi

et al., 1999). Since each fuzzy partition is a fuzzy number, a linguistic interpretation of each fuzzy partition is easily obtained.

Each fuzzy partition distributed in either quantitative or categorical attributes is viewed as a candidate one-dimension (1-dim) fuzzy grid used to generate large k-dim ðk $ 1Þ fuzzy grids. We give the definitions of the fuzzy support and the fuzzy confidence to determine, which candidate fuzzy grids are large and which fuzzy rules are effective, respectively. The proposed learning algorithm also consists of two phases: one to generate large fuzzy grids from training samples by fuzzy partitioning in each attribute, and the other to generate the fuzzy associative classification rules from these large fuzzy grids. The proposed learning algorithm is implemented by scanning training samples stored in databases only once and applying a sequence of Boolean operations to generate fuzzy grids and fuzzy associative classification rules. Therefore, it can be easily extended to discover other types of fuzzy association rules for market basket analysis. The

well-known iris data proposed by Fisher (1936) is often used to compare the performance between the

proposed learning algorithm and other classification methods, such as the genetic-algorithm-based

method (Ishibuchi et al., 1995), and some results are reported by Grabisch and Dispot (1992).

The simulation results reported in this paper demonstrate that the proposed learning algorithm works well in comparison with other classification methods. Therefore, the proposed learning algorithm may effectively derive fuzzy associative classification rules; moreover, the goal of knowledge acquisition can also be easily achieved.

This paper is organized as follows. The concepts of fuzzy partitions are introduced in Section 2. In Section 3, we give definitions for the fuzzy support and the fuzzy confidence, and the proposed learning algorithm is also presented in this section. In Section 4, the performance of the proposed learning algorithm is examined by computer simulation on the iris data. Discussions and conclusions are presented in Sections 5 and 6, respectively.

(3)

2. Fuzzy partitions

The notation used in this paper is as follows:

C total number of class labels

d total number of data attributes, where 1 # d

k dimension of one fuzzy grid, where 1 # k # d

K0 maximal number of fuzzy partitions in each quantitative attribute

Axm

K;im im-th fuzzy partition of K fuzzy partitions defined in attribute xm; where 1 # m # d; 3 # K #

K0and 1 # i # K

mxm

K;im membership function of A

xm

K;im

tp p-th sample or tuple, where tp¼ ðtp1; tp2; …; tpdÞ; and p $ 1

Fuzzy sets were proposed byZadeh (1965), who also proposed the concepts of a linguistic variable

and its applications to approximate reasoning (Zadeh, 1975a,b,c, 1976). A linguistic variable is a

variable, whose values are linguistic words or sentences in a natural language (Chen & Jong, 1997). The

division of the attributes into many fuzzy partitions has been also widely used in pattern recognition and

fuzzy reasoning. Examples are the application by Ishibuchi et al. to pattern classification (Ishibuchi,

Nozaki, & Tanaka, 1992; Ishibuchi et al., 1995; Ishibuchi, Murata, & Gen, 1998; Ishibuchi et al., 1999),

and fuzzy rules generation byWang and Mendel (1992). In addition, some methods for partitioning an

attribute space were discussed bySun (1994).

The discussions for fuzzy partitioning in quantitative and categorical attributes are introduced in Sections 2.1 and 2.2, respectively.

2.1. Fuzzy partitioning in quantitative attributes

The proposed learning algorithm includes two methods resulting from fuzzy partitions: one is the M-type (multiple M-type) division method (MTDM), and the other is the S-M-type (single M-type) division method

(STDM). The MTDM allows us to divide each quantitative attribute into ð3 þ 4 þ · · · þ K0Þ fuzzy

partitions. That is, we sequentially divide each quantitative attribute into 3; 4; …; K0fuzzy partitions. As

for the STDM, only K0 fuzzy partitions are defined. In these two methods, K0 is pre-specified before

executing the proposed learning algorithm. Triangular membership functions are used for the fuzzy partitions defined in the quantitative attributes. Hence, fuzzy partitions are fuzzy numbers, whereas a

fuzzy number is a fuzzy partition in the universe of discourse, that is, both convex and normal (Chen &

Jong, 1997). For example, by using the STDM, we describe K0¼ 3 and K0¼ 4 for attribute ‘Width’ (denoted by x1) that range from 0 to 60 inFigs. 1 and 2, respectively. Then, mWidthK;i1 can be represented as

(4)

follows:

mWidthK;i1 ðxÞ ¼ max 12f jx 2 a

K i1 =b K ; 0 o ð1Þ where aKi1 ¼ mi þ ðma 2 miÞði12 1Þ=ðK 2 1Þ ð2Þ bK ¼ ðma 2 miÞ=ðK 2 1Þ ð3Þ

where ma is the maximal value of the domain, and mi is the minimal value. Here, ma ¼ 60 and mi ¼ 0

for Width. Moreover, if we view Width as a linguistic variable, then the linguistic term AWidthK;i1 can be

described in the sentences with different i1:

AWidthK;1 : Width is small; and below 60=ðK 2 1Þ ð4Þ

AWidthK;K : Width is large; and above ½60 2 60=ðK 2 1Þ ð5Þ

AWidthK;i1 : Width is close to ði12 1Þ½60 2 60=ðK 2 1Þ ; and between ði12 2Þ½60 2 60=ðK 2 1Þ

and i1½60 2 60=ðK 2 1Þ ; for 1 , i1, K

ð6Þ

Clearly, the set of candidate 1-dim fuzzy grids generated for the same K0by the STDM is contained in

the one generated by the MTDM. For example, when K0¼ 4; then {AWidth

4;1 ; AWidth4;2 ; AWidth4;3 ; AWidth4;4 } is

generated by the STDM, and {AWidth3;1 ; AWidth3;2 ; AWidth3;3 ; A4;1Width; AWidth4;2 ; AWidth4;3 ; AWidth4;4 } is generated by the

MTDM. If we divide both Width and ‘Length’ (denoted by x2) into three fuzzy partitions, then a feature

space is divided into nine 2-dim fuzzy grids, as shown inFig. 3. For the shaded 2-dim fuzzy grid shown

inFig. 3, we can use AWidth3;1 £ A Length

3;3 to represent it.

Note that,Ishibuchi et al. (1999)proposed another method to define fuzzy sets for discrete values of

Fig. 2. K0¼ 4 for Width.

(5)

quantitative attributes. For example, if one quantitative attribute has two attribute values {0, 1}, then we may use the fuzzy sets ‘small’ and ‘large’ with degrees 0.0 and 1.0, respectively.

2.2. Fuzzy partitioning in categorical attributes

If the number of distinctly categorical attribute values is equal to n0 (where n0 is finite), then this

attribute can only be divided into n0 fuzzy partitions. At first, we view a categorical attribute as a

quantitative attribute. That is, each value of the categorical attribute can correspond to an integer number. This is helpful for us in dividing the categorical attribute. A linguistic term Axm

n0_;i

mð1 # im # n

0_Þ

is defined in the partition ðim2 1; imþ 1Þ ð1 ! 0Þ: The membership function of A

xm

n0;im is 1.

For example, ‘class label’ is a linguistic variable, and suppose its values include ‘class 1’ and ‘class 2’

which correspond to 1 and 2, respectively. The result of partitions is shown inFig. 4. We can see that

there are two fuzzy partitions distributed in class label, one is ð1 2 1; 1 þ 1Þ and the other is ð2 2 1; 2 þ 1Þ: Linguistic terms Aclass label_2;1 and Aclass label_2;2 can be interpreted as Eqs. (7) and (8), respectively. The membership functions can be described in Eqs. (9) and (10). Sometimes, we can divide a quantitative attribute with discrete values in this way (e.g. number of cars which a person owns)

Aclass label2;1 : class label is class 1 ð7Þ

Aclass label2;2 : class label is class 2 ð8Þ

mclass label2;1 ðxÞ ¼ 1; 1 2 1 # x # 1 þ 1; 1 ! 0 ð9Þ

mclass label2;2 ðxÞ ¼ 1; 2 2 1 # x # 2 þ 1; 1 ! 0 ð10Þ

Initially, each fuzzy partition distributed in either quantitative or categorical attributes is viewed as a candidate 1-dim fuzzy grid. The next important task is how to use the candidate 1-dim fuzzy grids to generate the other large fuzzy grids and fuzzy associative classification rules. Therefore, we propose the learning algorithm described in Section 3.

3. Mining fuzzy associative classification rules

In this section, we first describe the learning model for generating fuzzy associative classification rules inFig. 5. From this figure, we can see that fuzzy associative classification rules are generated in the two phases of the proposed algorithm. Large fuzzy grids and effective fuzzy associative classification rules are generated in phases I and II, respectively. We describe the individual phases of the learning model in Sections 3.1 and 3.2. The proposed learning approach is presented in Section 3.2.

(6)

3.1. Phase 1: generate large fuzzy grids

Suppose each attribute, xm; is divided into K0 fuzzy partitions. Without loss of generality, given a

candidate k-dim fuzzy grid Ax1

K;i1 £ A x2 K;i2 £ · · · £ A xk21 K;ik21£ A xk K;ik; where 1 # i1; i2; …; ik# K; 3 # K # K 0

for the MTDM and K ¼ K0 for the STDM, the degree to which tp belongs to this fuzzy grid can be

computed as mx1 K;i1ðtp1Þm x2 K;i2ðtp2Þ· · ·m xk21 K;ik21ðtpk21Þm xk

K;ikðtpkÞ: To check whether this fuzzy grid is large or not,

we define the fuzzy support FSðAx1

K;i1£ A x2 K;i2£ · · · £ A xk21 K;ik21£ A xk K;ikÞ as follows: FS Ax1 K;i1 £ A x2 K;i2 £ · · · £ A xk21 K;ik21£ A xk K;ik ¼ X n p¼1 mx1 K;i1ðtp1Þm x2 K;i2ðtp2Þ· · ·m xk21 K;ik21ðtpk21Þm xk K;ikðtpkÞ 2 4 3 5=n ð11Þ

where n is the number of training samples. When FSðAx1

K;i1£ A x2 K;i2£ · · · £ A xk21 K;ik21£ A xk

K;ikÞ is larger than or

equal to the user-specified minimum fuzzy support (min FS), we can say that Ax1

K;i1 £ A x2 K;i2 £ · · · £ Axk21 K;ik21£ A xk

K;ik is a large k-dim fuzzy grid. This is similar to defining a large k-itemset, whose support is

larger than the user-specified minimum support.

Table FGTTFSK is implemented to generate large fuzzy grids for K. FGTTFS consists of the

following substructures:

(a) Fuzzy grids table ðFGKÞ : each row represents a fuzzy grid, and each column represents a fuzzy

partition Axm

K;im:

(7)

(b) Transaction table ðTTKÞ : each column represents tp; while each element records the membership

degree of the corresponding fuzzy grid.

(c) Column FSK: stores the fuzzy support corresponding to the fuzzy grid in FG.

An initial tabular FGTTFS3is shown asTable 1as an example, from which we can see that there are

two tuples t1and t2; and two attributes x1and x2in a given database. Both x1and x2are divided into three

fuzzy partitions (i.e. K0¼ 3). In the learning process for K0¼ 4; this means that FGTTFS3and FGTTFS4

will be used for M-type division and only FGTTFS4for S-type division. Assume that x2is the attribute of

class labels. Since each row of FG is a bits string consisting of 0 and 1, FGK½u and FGK½v (i.e. u-th row

and v-th row of FGK) can be paired to generate certain desired results by applying the Boolean

operations. For example, if we apply the OR operation on two rows, FG3½1 ¼ ð1; 0; 0; 0; 0; 0Þ and

FG3½4 ¼ ð0; 0; 0; 1; 0; 0Þ; then ðFG3½1 OR FG3½4 Þ ¼ ð1; 0; 0; 1; 0; 0Þ corresponding to a candidate

2-dim fuzzy grid Ax1

3;1£ A

x2

3;1; is generated. Then, FSðA x1 3;1£ A x2 3;1Þ ¼ TT3½1 TT3½4 ¼ ½m x1 3;1ðt11Þm x2 3;1ðt12Þ þ mx1 3;1ðt21Þm x2

3;1ðt22Þ =2 is obtained to compare with the min FS.

However, any two fuzzy partitions defined in the same attribute cannot be contained in the same candidate k-dim fuzzy grid ðk $ 2Þ: Therefore, (1,1,0,0,0,0) and (0,0,0,1,1,0) are all invalid. To solve this problem, we

implement a 1-dim array Group of Fuzzy Grids ðGFGKÞ: From GFGK; we can easily distinguish, which

fuzzy partitions are defined in the same attribute. Each index of GFGKcorresponds to a fuzzy partition, and

fuzzy partitions defined in the same attribute must be set to the same integer number. GFG3is shown as an

example Table 2. For example, since GFG3½1 ¼ GFG3½2 ¼ 1; string (1,1,0,0,0,0) generated by

FG3½1 OR FG3½2 ¼ ð1; 0; 0; 0; 0; 0Þ OR ð0; 1; 0; 0; 0; 0Þ ¼ ð1; 1; 0; 0; 0; 0Þ is invalid.

In the Apriori algorithm (Agrawal et al., 1996), two large ðk 2 1Þ-itemsets are joined to be a candidate

k-itemset, and these two large itemsets share ðk 2 2Þ items. Similarly, a candidate k-dim ð3 # k # dÞ fuzzy grid is derived by joining two large ðk 2 1Þ-dim fuzzy grids, and these two large grids share

ðk 2 2Þ fuzzy partitions. For example, we can use Ax1

3;2£ A x2 3;1 and A x1 3;2£ A x3 3;3 to generate the

candidate 3-dim fuzzy grid Ax1

3;2£ A x2 3;1£ A x3 3;3; because A x1 3;2£ A x2 3;1and A x1 3;2£ A x3

3;3share the linguistic term

Ax1 3;2: However, A x1 3;2£ A x2 3;1£ A x3

3;3can also be generated by joining A

x1 3;2£ A x2 3;1to A x2 3;1£ A x3 3;3: This implies

that we must select one of many possible combinations to avoid redundant computations. To cope with

this problem, the method we adopt here is that if there exist integers 1 # e1, e2, · · · , ek # d; such

that FGK½u; e1 ¼ FGK½u; e2 ¼ · · · ¼ FGK½u; ek22 ¼ FGK½u; ek21 ¼ 1 and FGK½v; e1 ¼

FGK½v; e2 ¼ · · · ¼ FGK½v; ek22 ¼ FGK½v; ek21 ¼ 1; where FGK½u and FGK½v correspond to large

ðk 2 1Þ-dim fuzzy grids, then FGK½u and FGK½v can be paired to generate a candidate k-dim fuzzy grid.

Table 1 Initial table FGTTFS3 Fuzzy grid FG3 TT3 FS3 Ax1 3;1 A x1 3;2 A x1 3;3 A x2 3;1 A x2 3;2 A x2 3;3 t1 t2 Ax1 3;1 1 0 0 0 0 0 m x1 3;1ðt11Þ m x1 3;1ðt21Þ FS(A x1 3;1) Ax1 3;2 0 1 0 0 0 0 m x1 3;2ðt11Þ m x1 3;2ðt21Þ FS(A x1 3;2) Ax1 3;3 0 0 1 0 0 0 m x1 3;3ðt11Þ m x1 3;3ðt21Þ FS(A x1 3;3) Ax2 3;1 0 0 0 1 0 0 m x2 3;1ðt12Þ m x2 3;1ðt22Þ FS(A x2 3;1) Ax2 3;2 0 0 0 0 1 0 m x2 3;2ðt12Þ m x2 3;2ðt22Þ FS(A x2 3;2) Ax2 3;3 0 0 0 0 0 1 m x2 3;3ðt12Þ m x2 3;3ðt22Þ FS(A x2 3;3)

(8)

3.2. Phase 2: generate effective fuzzy associative classification rules

The general type of the fuzzy associative classification rule R is stated as Eq. (12) Rule R : Ax1 K;i1£ A x2 K;i2 £ · · · £ A xk21 K;ik21£ A xk K;ik ) A xa C;i_a with FCðRÞ ð12Þ

where xa ð1 # a # dÞ is the class label and FC(R ) is the fuzzy confidence of rule ‘A

x1 K;i1 £ A x2 K;i2 £ · · · £ Axk K;ik ) A xa

C;i_a’: The above rule represents that: if x1is A x1 K;i1 and x2is A x2 K;i2, … and xk is A xk K;ik; then xa is Axa

C;i_a: The left-hand-side of ‘ ) ’ is the antecedent part of R, and the right-hand-side is the consequent

part. FC(R ) can be viewed as the grade of certainty of R. R is generated by two large fuzzy grids, one is Ax1 K;i1£ A x2 K;i2£ · · · £ A xk21 K;ik21£ A xk K;ik£ A x_a

C;ia and the other is A

x1 K;i1£ A x2 K;i2£ · · · £ A xk21 K;ik21£ A xk K;ik: We define

the fuzzy confidence FC(R ) of R as follows:

FCðRÞ ¼ FS Ax1 K;i1 £ A x2 K;i2 £ · · · £ A xk21 K;ik21£ A xk K;ik£ A x_a C;ia =FS Ax1 K;i1 £ A x2 K;i2£ · · · £ A xk21 K;ik21£ A xk K;ik ð13Þ If FC(R ) is larger than or equal to the user-specified minimum fuzzy confidence (min FC), then R is effective and can be reserved. This is similar to defining an effective association rule, whose confidence is larger than or equal to the user-specified minimum confidence. We still apply Boolean operations to

obtain the antecedent part and consequent part of each fuzzy rule. For example, if there exists FG3½u ¼

ð1; 0; 0; 0; 0; 0Þ and FG3½v ¼ ð1; 0; 0; 1; 0; 0Þ corresponding to large fuzzy grids Luand Lv; where Lv ,

Lu; respectively; then FG3½u AND FG3½v ¼ ð1; 0; 0; 0; 0; 0Þ; corresponding to the large fuzzy grid

Ax1

3;1; is generated to be the antecedent part of rule R. Then, FG3½u XOR FG3½v ¼ ð0; 0; 0; 1; 0; 0Þ;

corresponding to the large fuzzy grid Ax2

3;1; is generated to be the consequent part of rule R. Then,

FCðRÞ ¼ FSðAx1

3;1£ A

x2

3;1Þ=FSðA x1

3;1Þ is obtained to compare with the min FC to determine R is effective or

not.

However, some redundant rules must be eliminated from the viewpoint in order to achieve the goal of compactness. If there exist two rules R and S, having the same consequent part, and the antecedent part of S is contained in that of R, then R is redundant and can be discarded, and S is temporarily reserved. For example, if S is ‘Ax1 K;i1£ A x2 K;i2£ · · · £ A xk21 K;ik21) A xa

C;i_a’; then R can be eliminated. This can happen because

the number of antecedent conditions must be minimized.

The main difference between the MTDM with the STDM is that the MTDM needs to calculate the total number of fuzzy partitions in each quantitative attribute, but STDM does not. Therefore, the proposed learning algorithms for MTDM and STDM are almost the same. We describe the general algorithm as following.

Algorithm: learning algorithm for mining fuzzy associative classification rules Input:

a. A set of training samples selected from the specified classification problem Table 2

One-dimensional array GFG3(group of fuzzy grids)

Index

[1] [2] [3] [4] [5] [6]

(9)

b. The minimum fuzzy support and the minimum fuzzy confidence are user-specified c. K0

Output:

Phase I: Generate large fuzzy grids

Phase II: Generate effective fuzzy associative classification rules Method:

Phase I. Generate large fuzzy grids

Step 1. Fuzzy partitioning in each attribute

Divide each quantitative into fuzzy partitions by M-type or S-type division, and the number n0is

also pre-determined for the categorical attribute class label of the specified classification problem (e.g. n0¼ 3 for the iris data).

Step 2. Scan the training samples from a database, and then construct the initial table FGTTFS and GFG

Step 3. Generate large fuzzy grids

For FGTTFSK (K ¼ 1; 2; …; K0for M-type division and K ¼ K0 for S-type division ) do

3-1. Generate large 1-dim fuzzy grids

Set k ¼ 1 and eliminate the rows of initial FGTTFSKcorresponding to candidate 1-dim fuzzy

grids that are not large.

3-2. Generate large k-dim fuzzy grids

Set k þ 1 to k. If there is only one ðk 2 1Þ-dim fuzzy grid, then go to phase II.

For two unpaired rows, FGTTFSK½u and FGTTFSK½v ðu – vÞ; corresponding to large

ðk 2 1Þ-dim fuzzy grids do

Compute (FGK½u OR FGK½v ) corresponding to a candidate k-dim fuzzy grid c.

3-2-1. From non-zero elements of (FGK½u OR FGK½v ), retrieve all corresponding values from

GFGK: If any two values are the same, then discard c and skip Steps 3-2-2, 3-2-3, and 3-2-4.

That is, c is invalid.

3-2-2. If FGK½u and FGK½v do not share ðk 2 2Þ linguistic terms, then discard c and skip Steps

3-2-3 and 3-2-4. That is, c is invalid.

3-2-3. If there exist integers 1 # e1, e2, · · · , ek # d such that ðFGK½u OR FGK½v Þ½e1 ¼

ðFGK½u OR FGK½v Þ½e2 ¼ · · · ¼ ðFGK½u OR FGK½v Þ½ek21 ¼ ðFGK½u OR FGK½v Þ½ek ¼

1; then compute ðTTK½e1 TTK½e2 · · ·TTK½ek Þ and the fuzzy support fs of c.

3-2-4. Add (FGK½u OR FGK½v ) to FGK; ðTTK½e1 TTK½e2 · · ·TTK½ek Þ to TTK and fs to FSK

when fs is larger than the minimum fuzzy support; otherwise, discard c. End

3-3. Check whether any large k-dim fuzzy grid is generated or not

If any large k-dim fuzzy grid is generated, then go to Step 3-2. Note that the final FGTTFSK

stores only large fuzzy grids. End

Phase II: Generate effective fuzzy associative classification rules

For FGTTFSK (K ¼ 1; 2; …; K0 for M-type division and K ¼ K0for S-type division ) do

Step 1. Generate effective fuzzy rules

For two unpaired rows, FGK½u and FGK½v ðu , vÞ; corresponding to large fuzzy grids Luand

Lv respectively do

(10)

Compute the number of non-zero elements in FG to temp.

If the number of non-zero elements in (FGK½u AND FGK½v ) is equal to temp, then Lv , Luis hold,

and the antecedent part of Rvis generated as (FGK½u AND FGK½v ) ¼ Lu; else skip Steps 1-2 and 1-3.

1-2. Generate the consequent part of fuzzy rule Rv:

Use (FGK½u XOR FGK½v ) to obtain the consequent part.

If (FGK½u XOR FGK½v ) contains only one fuzzy partition defined in the class label then generate Rv;

else skip Step 1-3.

1-3. Check whether R can be reserved or not. FCðRvÞ ¼ FSðLvÞ=FSðLuÞ

If FC(Rv) $ min FC, then mark FGK½v to represent that Rvand FCðRvÞ are reserved; else discard Rv:

End

Step 2. Reduce redundant rules

For any two marked rows FGK½u and FGK½v ðu , vÞ corresponding to effective fuzzy rules Ru

and Rcrespectively do

If FGK½u ¼ ðFGK½u AND FGK½v Þ; then unmark FGK½v :

End End

Those marked rows of FGK are used for classification problems. The performance of the proposed

learning algorithm is mainly dependent on the size of candidate grids and the size of large grids for phases I and II, respectively. Clearly, the proposed learning algorithm is implemented by scanning training samples stored in a database only once and applying a sequence of Boolean operations to generate fuzzy grids and fuzzy rules. In Section 4, some simulation results are presented to demonstrate the effectiveness of the proposed learning algorithm.

4. Experiments

In this section, the performance of the proposed learning algorithm is examined. We employ the proposed algorithm to discover fuzzy associative classification rules from the well-known iris data

proposed byFisher (1936). The computer programs were coded using the Delphi ver. 5.0 system and

were executed on a personal computer with Pentium III-500 CPU and 128 MB RAM running Win98. The iris data consists of three classes (Class 1: Iris setosa, Class 2: Iris versicolor, and Class 3: Iris

virginica) and each class consists of 50 data points (Ishibuchi et al., 1995). Moreover, Class 2 overlaps

with Class 3. These data are stored in a relational database. Suppose that attribute x1is the sepal length,

attribute x2is the sepal width, attribute x3is the petal length, attribute x4is the petal width, and attribute

x5 is the class label (i.e. n0 ¼ 3 for x5 is determined) to which tp¼ ðtp1; tp2; …; tpsÞ ð1 # p # 150Þ

belongs. The pairs (ma,mi) for x1; x2; x3; and x4 are (79,43), (44,20), (69,10), and (25,1), respectively.

Although the iris data is a crisp data set, the fuzzy rules can be still extracted from these 150 data points by using the proposed learning algorithm.

Now, we determine the class label of tpby applying the proposed learning algorithm to classify the iris

data. Without losing generality, if the antecedent part of a fuzzy associative classification rule Rt is

Ax1

K;i1£ A

x2

K;i2£ · · · £ A

x_t

K;it; then we can calculate the compatibility grade mtðtpÞ of tp as

mx1

K;i1ðtp1Þm

x2

K;i2ðtp2Þ· · ·m

xt

(11)

consequent part of R_b; when m_bðtpÞFCðRbÞ ¼ max

j {mjðtpÞFCðRjÞlRj[ TR} ð14Þ

where TR is the set of fuzzy rules generated by the proposed learning algorithm.

First, we consider K0¼ 6 for each attribute except x5: No doubt that only three fuzzy partitions can be

defined in x5; they are Aclass label3;1 : ‘tp belongs to Class 1’, Aclass label3;2 : ‘tp belongs to Class 2’, and

Aclass label_3;3 : ‘tpbelongs to Class 3’. Simulation results with different user-specified minimum support and

confidence are shown inTables 3 and 4using the MTDM and the STDM, respectively. FromTables 3

and 4, we can see that classification accuracy rates are more sensitive to larger min FS. Therefore, the smaller min FS could be a better choice. On the other hand, rules generated by the MTDM are more robust than those generated by the STDM with respect to different parameter specifications. For example, rules generated by the MTDM work well for min FS ¼ 0:15 with different min FC; however, much lower rates are obtained by rules generated by the STDM, when min FC is larger than 0.75.

FromTables 3 and 4, we can see that the best classification accuracy rate 96.67% is simultaneously obtained in both tables with (min FS, min FC) ¼ (0.05,0.85) and (0.10,0.80); hence, these best parameter specifications are used in subsequent simulations. For different values of K, we show the simulation

results inTables 5 and 6with (0.05,0.85) and (0.10,0.80), respectively. FromTables 5 and 6, we can see

that the classification accuracy rates are not sensitive to larger values of K0 (i.e. K0 ¼ 6 – 8). For

comparison, we show simulation results with the same best parameter specifications by the STDM in

Tables 7 and 8. From the comparison ofTable 7withTable 5andTable 8withTable 6, we can see that the simulation results of the MTDM are more robustly than those of the STDM with respect to larger

values of K0 (i.e. K0¼ 6 – 8). By the same parameter specification, the training time of the STDM is

shorter than that of the MTDM for all values of K0:

In the above simulation, all 150 data are used as training samples to generate fuzzy rules. To examine the error rate of the proposed learning algorithm for testing samples, we perform the leaving-one-out

technique, which is an almost unbiased estimator of the true error rate of a classifier (Weiss &

Kulikowski, 1991). In each iteration of the leaving-one-out technique, fuzzy rules are generated from 149 training samples and tested on the single remaining data. This procedure is iterated, until all the Table 3

Classification accuracy rate (%) by M-type division

Min fuzzy confidence Min fuzzy support

0.05 0.10 0.15 0.20 0.50 96.67 96.67 94.00 92.67 0.55 96.67 96.67 94.00 91.33 0.60 96.67 96.67 94.00 91.33 0.65 96.67 96.67 94.00 91.33 0.70 96.67 96.67 94.00 91.33 0.75 96.67 96.67 94.00 94.00 0.80 96.67 96.67 96.67 96.67 0.85 96.67 96.67 96.67 96.67 0.90 96.00 96.67 96.00 66.00

(12)

given 150 data are used as a test sample. Classification rates with different parameter specifications are

shown inTable 9. FromTable 9, we can see that the best result obtained by the MTDM and the STDM

are all 96.67%. However, some poor results (e.g. 91.33 and 92.67% for K0¼ 7) are obtained by the

STDM resulting from overfitting the training samples.

Based on the leaving-one-out technique, we try to make a comparison between the proposed learning

algorithm and other fuzzy classification methods. Previously,Ishibuchi et al. (1995)proposed a

genetic-algorithm-based method to select the fuzzy classification rules from the iris data. They demonstrated that the classification rate was 94.67% by the leaving-one-out technique. Some proper parameters, including the stopping condition (1000 generations), population size, and biased mutation probability, were also needed. It is clear that the best result (i.e. 96.67%) of the proposed learning algorithm outperforms that of Ishibuchi et al.’s genetic-algorithm-based method.

Previously, error rates of nine fuzzy classification methods, including fuzzy integral with perception criterion, fuzzy integral with the quadratic criterion, minimum operator, fast heuristic search with the Sugeno integral, simulated annealing with the Sugeno integral, fuzzy k-nearest neighbor, fuzzy c-means, fuzzy c-means for histograms and hierarchical fuzzy c-means, for the iris data estimated by the

leaving-one-out technique were reported by Grabisch and Dispot (1992). The best result (i.e. 96.67%) was

obtained by using the fuzzy integral with the quadratic criterion or the fuzzy k-NMR method. It is clear that the best result of the proposed learning algorithm is equal to the best result of these nine fuzzy methods. However, because a linguistic interpretation of each fuzzy associative classification rule is Table 4

Classification accuracy rate (%) by S-type division

Min fuzzy confidence Min fuzzy support

0.05 0.10 0.15 0.20 0.50 95.33 94.67 92.67 88.67 0.55 95.33 94.67 92.67 88.67 0.60 95.33 96.00 92.67 88.67 0.65 95.33 96.00 92.67 88.67 0.70 95.33 96.00 92.67 88.67 0.75 95.33 96.00 66.00 66.00 0.80 96.00 96.67 66.67 66.00 0.85 96.67 94.67 66.67 66.00 0.90 94.67 94.67 66.67 66.00 Table 5

Simulation results by the MTDM with min FS ¼ 0:05; min FC ¼ 0:85

K0 Classification accuracy rate (%)

3 66.67 4 92.67 5 94.00 6 96.67 7 96.67 8 96.67

(13)

easily obtained, the goal of knowledge acquisition for users can be achieved by checking the fuzzy rules generated from the proposed learning algorithm.

5. Discussions

Fuzzy associative classification rules with linguistic interpretation discovered by data mining techniques are helpful to build a prototype fuzzy knowledge base of the fuzzy classifier system. For this, the generation of fuzzy classification rules with linguistic interpretation from the training data becomes quite necessary. The proposed learning algorithm can also be viewed as a knowledge acquisition tool for classification problems.

The performance of the proposed learning algorithm is tested on the iris data. From some simulation results, we can see that the classification rates are more sensitive to larger min FS, indicating that a smaller min FS can be a better choice. This result demonstrates that smaller min FS can lead to reserve a larger number of large fuzzy grids; that is, more valuable information can be reserved to generate fuzzy

rules. Moreover, the MTDM is not sensitive to larger values of K0 (i.e. K0¼ 6 – 8) and works more

robustly than the STDM. When examining the generalization ability for testing samples, we find that some poor results (e.g. 91.33 and 92.67% for K0¼ 7) resulted from larger values of K0_{(i.e. K}0_{¼ 7; 8) are}

obtained by the STDM. That is, the STDM can suffer from overfitting the training samples, because larger values of K can result in fine partitions in a feature space. From the viewpoint of improving classification rates, it seems that it might be better for the proposed learning algorithm to use smaller min FS and the MTDM to derive fuzzy rules with high classification capability. The downside is that the number of rules increases as well to verify the generalization of this observation; it is reasonable for the proposed learning algorithm to be further tested on other classification problems. Recently, automation for the classification task of group technology become a significant research topic, e.g. the classification

of block-shaped parts byChuang, Wang, and Wu (1999). After features and data from workpieces are

collected, then it is possible to employ the proposed learning algorithm to discover fuzzy rules and classify the data.

The performance of the proposed learning algorithm for phase I and phase II is mainly dependent on the size of candidate grids and the size of large grids, respectively. It seems that it will waste much more time in phase I for generating candidate fuzzy grids. For simplicity, we briefly discuss the time

complexity of phase I for operating FGTTFSK: For FGTTFSK; we need ðdK0Þ2; nðdK0Þ and ðdK0Þ

operations for building FGK; TTK; and FSK; respectively. In addition, the worst cases for operating FGK;

Table 6

Simulation results by the MTDM with min FS ¼ 0:10 and min FC ¼ 0:80

3 66.67 4 88.00 5 95.33 6 96.67 7 96.67 8 96.67

(14)

TTK and FSK in generating candidate k-dim ðk $ 2Þ fuzzy grids are roughly measured as ðdK0ÞC sk21 2 ; nCsk21 2 ; and C sk21

2 ; where siði $ 1Þ denotes the number of large i-dim fuzzy grids, respectively. Therefore,

the worst case of phase I of the proposed learning algorithm could be roughly measured as ðdK0Þ2_þ

nðdK0Þ þ ðdK0Þ þPk21

i¼1 ðn þ dK0þ 1ÞC

sk

2:

We stress the feasibility and the problem-solving capability of the proposed method for classification problems, rather than providing formal methods to find general parameter specifications that can obtain the best classification accuracy rate. That is, it seems relatively difficult to determine appropriate values

of K0; minimum fuzzy support and fuzzy confidence. Previously, some tuning methods were proposed,

for example, ANFIS by Jang (1993), and design of fuzzy controllers by Homaifar and McCormick

(1995). On the other hand,Nozaki, Ishibuchi, and Tanaka (1996)demonstrated that the performance of fuzzy rule-based systems can be improved by adjusting the grade of certainty of each rule. Therefore, to develop methods for determining appropriate membership functions and the fuzzy confidence of each rule to obtain higher classification rates by machine learning techniques, such as genetic algorithms or neural networks, is quite appropriate and left for future research.

6. Conclusions

In this paper, we propose a learning algorithm to discover effective fuzzy associative classification rules. As we have explained earlier, the proposed learning algorithm consists of two phases: one to generate large fuzzy grids from training samples by fuzzy partitioning in each attribute, and the other to generate fuzzy associative classification rules by large fuzzy grids. The proposed learning algorithm is implemented by scanning training samples stored in a database only once and applying a sequence of Table 7

Simulation results by the STDM with min FS ¼ 0:05 and min FC ¼ 0:85

3 66.67 4 94.67 5 94.00 6 96.67 7 96.00 8 95.33 Table 8

Simulation results by the STDM with min FS ¼ 0:10 and min FC ¼ 0:80

3 66.67 4 91.33 5 94.00 6 96.67 7 93.33 8 95.33

(15)

table operations to generate fuzzy grids and fuzzy rules. Therefore, it can be easily extended to discover other types of fuzzy association rules for market basket analysis that can help managers design different

store layouts and help retailers to plan which items to put on sale (Han & Kamber, 2001). Especially,

because each fuzzy partition is a fuzzy number, a linguistic interpretation of each fuzzy partition is easily obtained.

The performance of the proposed learning algorithm is tested on the iris data. According to the

simulation results, the MTDM is not sensitive to larger values of K0(i.e. K0¼ 6 – 8) and it works more

robustly than the STDM. Based on the leaving-one-out technique, we make a comparison between the proposed learning algorithm and other fuzzy classification methods. It is clear that the best result obtained by the proposed learning algorithm outperforms that of Ishibuchi et al.’s genetic-algorithm-based method. The best result of the proposed algorithm is also equal to the best result reported by

Grabisch and Dispot (1992) for nine fuzzy classification methods. However, because a linguistic interpretation of each fuzzy associative classification rule is easily obtained, the goal of knowledge acquisition for users can be achieved by checking the fuzzy rules. The simulation results from the iris data indicate that the proposed learning algorithm may effectively derive fuzzy associative classification rules. On the other hand, from discussions in Section 5, it seems that it might be better for classification problems to use the MTDM with smaller min FS to perform the proposed learning algorithm.

Acknowledgements

We are very grateful to the anonymous referees for their valuable comments and constructive suggestions. This research is supported by the National Science Council under grant NSC-90-2416-H-009-002.

References

Agrawal, R., Imielinski, T., & Swami, A. (1993). Database mining: A performance perspective. IEEE Transactions on Knowledge and Data Engineering, 5(6), 914 – 925.

Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., & Verkamo, A. I. (1996). Fast discovery of association rules. In U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, & R. Uthurusamy (Eds.), Advances in knowledge discovery and data mining (pp. 307 – 328). AAAI Press. Mewlo Park, USA.

Berry, M., & Linoff, G. (1997). Data mining techniques: For marketing, sales, and customer support. New York: Wiley. Table 9

Classification accuracy rates (%) obtained by the leaving-one-out technique

Min FS Min FC Division method K

6 7 8

0.05 0.85 MTDM 96.67 96.67 96.67

STDM 95.33 93.33 91.33

0.10 0.80 MTDM 96.67 96.00 96.00

(16)

Chen, S. M., & Jong, W. T. (1997). Fuzzy query translation for relational database systems. IEEE Transactions on Systems, Man, and Cybernetics, 27(4), 714 – 721.

Chuang, J. H., Wang, P. H., & Wu, M. C. (1999). Automatic classification of block-shaped parts based on their 2D projections. Computers and Industrial Engineering, 36(3), 697 – 718.

Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals in Eugenics, 7(2), 179 – 188. Grabisch, M., & Dispot, F. (1992). A comparison of some methods of fuzzy classification on real data. Proceedings of the

Second International Conference on Fuzzy Logic and Neural Networks, Japan, pp. 659 – 662.

Han, J. W., & Kamber, M. (2001). Data mining: Concepts and techniques. San Francisco: Morgan Kaufmann.

Han, E. H., Karypis, G., & Kumar, V. (2000). Scalable parallel data mining for association rules. IEEE Transactions on Knowledge and Data Engineering, 12(3), 337 – 352.

Homaifar, A., & McCormick, E. (1995). Simultaneous design of membership functions and rule sets for fuzzy controllers using genetic algorithms. IEEE Transactions on Fuzzy Systems, 3(2), 129 – 139.

Hong, T. P., & Chen, J. B. (1999). Finding relevant attributes and membership functions. Fuzzy Sets and Systems, 103(3), 389 – 404.

Hong, T. P., Wang, T. T., Wang, S. L., & Chien, B. C. (2000). Learning a coverage set of maximally general fuzzy rules by rough sets. Expert Systems with Applications, 19(2), 97 – 103.

Ishibuchi, H., Nozaki, K., & Tanaka, H. (1992). Distributed representation of fuzzy rules and its application to pattern classification. Fuzzy Sets and Systems, 52(1), 21 – 32.

Ishibuchi, H., Nozaki, K., Yamamoto, N., & Tanaka, H. (1995). Selecting fuzzy if – then rules for classification problems using genetic algorithms. IEEE Transactions on Fuzzy Systems, 3(3), 260 – 270.

Ishibuchi, H., Murata, T., & Gen, M. (1998). Performance evaluation of fuzzy rule-based classification systems obtained by multi-objective genetic algorithms. Computers and Industrial Engineering, 35(3 – 4), 575 – 578.

Ishibuchi, H., Nakashima, T., & Murata, T. (1999). Performance evaluation of fuzzy classifier systems for multidimensional pattern classification problems. IEEE Transactions on Systems, Man, and Cybernetics, 29(5), 601 – 618.

Jang, J. S. R. (1993). ANFIS: Adaptive-network-based fuzzy inference systems. IEEE Transactions on Systems, Man, and Cybernetics, 23(3), 665 – 685.

Myra, S. (2000). Web usage mining for web site evaluation. Communications of the ACM, 43(8), 127 – 134.

Nozaki, K., Ishibuchi, H., & Tanaka, H. (1996). Adaptive fuzzy rule-based classification systems. IEEE Transactions on Fuzzy Systems, 4(3), 238 – 250.

Sun, C. T. (1994). Rule-base structure identification in an adaptive-network-based fuzzy inference system. IEEE Transactions on Fuzzy Systems, 2(1), 64 – 73.

Wang, L. X., & Mendel, J. M. (1992). Generating fuzzy rules by learning from examples. IEEE Transactions on Systems, Man, and Cybernetics, 22(6), 1414 – 1427.

Weiss, S. M., & Kulikowski, C. A. (1991). Computer systems that learn: Classification and prediction methods from statistics, neural nets, machine learning, and expert systems. CA: Morgan Kaufmann.

Yilmaz, E., Triantaphyllou, E., Chen, J., & Liao, T. W. (2002). A heuristic for mining association rules in polynomial time. Mathematical and Computer Modelling, in press.

Zadeh, L. A. (1965). Fuzzy sets. Information Control, 8(3), 338 – 353.

Zadeh, L. A. (1975a). The concept of a linguistic variable and its application to approximate reasoning. Information Science (part 1), 8(3), 199 – 249.

Zadeh, L. A. (1975b). The concept of a linguistic variable and its application to approximate reasoning. Information Science (part 2), 8(4), 301 – 357.

Zadeh, L. A. (1976c). The concept of a linguistic variable and its application to approximate reasoning. Information Science (part 3), 9(1), 43 – 80.

Zadeh, L. A. (1976). The concept of a linguistic variable and its application to approximate reasoning. Information Sciences (part 3), 9(1), 43 – 80.