Fuzzy classification trees for data analysis

(1)

Fuzzy classi!cation trees for data analysis

I-Jen Chiang

a; ∗

_{, Jane Yung-jen Hsu}

b

a_{Department of Medical Informatics, Taipei Medical University, Taipei, Taiwan 105, ROC}

b_{Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan 100, ROC} Received 21 March 1997; received in revised form 2 October 2001; accepted 10 October 2001

Abstract

Overly generalized predictions are a serious problem in concept classi!cation. In particular, the boundaries among classes are not always clearly de!ned. For example, there are usually uncertainties in diagnoses based on data from biochemical laboratory examinations. Such uncertainties make the prediction be more di4cult than noise-free data. To avoid such problems, the idea of fuzzy classi$cation is proposed. This paper presents the basic de!nition offuzzy classi!cation trees along with their construction algorithm. Fuzzy classi$cation trees is a new model that integrates the fuzzy classi!ers with decision trees, that can work well in classifying the data with noise. Instead of determining a single class for any given instance, fuzzy classi!cation predicts the degree of possibility for every class.

Some empirical results the dataset from UCI Repository are given for comparing FCT and C4:5. Generally speaking, FCT can obtain better results than C4:5. c 2002 Elsevier Science B.V. All rights reserved.

Keywords: Arti!cial intelligence; Decision making; Classi!cations; Information theory; Decision trees; Tree classi!ers

1. Introduction

Discovering regularities in complex data is an important research topic. Many problems in med-ical, astronommed-ical, and !nancial applications in-volve a large amount ofdata that need to be classi!ed. Classi!cation can be thought as the base ofability to knowledge acquisition [25]. Cur-rent classi!cation techniques, e.g. decision trees [13,22–24,30,33] work well for pattern recognition and process control. Unfortunately, while we are con-sidering the problem ofuncertainties and noise, the data would be very di4cult to be clearly classi!ed.

∗_{Corresponding author. 3F, 8-1 Tai-An Street, Taipei 100,} Taiwan.

E-mail addresses: [email protected] (I-J. Chiang), [email protected] (J.Y.-j. Hsu).

According to the following example, it can be iden-ti!ed why those approaches are failed to classify the data.

Example 1. Table 1 lists the instances for a classi!-cation task to decide whether a patient is at risk of having a stroke. The attributes are systolic and dias-tolic arterial blood pressures.

Figs. 1 and 2 show the decision tree generated by C4.5=ID3 [35]. A new instance with blood pressures of systolic = 154 and diastolic = 74 will be clas-si!ed as a stroke patient.

Such a conclusion may be incorrect. As we all know, the risk ofstroke for people with normal di-astolic arterial blood pressures is usually not high. It is most likely that the patient will not suDer from a stroke. However, in rare cases, the abnor-mal systolic arterial blood pressure may be caused

(2)

Table 1

A training set about stroke patients

No. Systolic Diastolic Class

1 170 75 Normal 2 180 67 Normal 3 170 95 Stroke 4 181 72 Stroke 5 194 56 Normal 6 195 54 Stroke 7 169 82 Stroke 8 144 90 Normal

Fig. 1. A decision tree for Example 1.

by a chronic condition which leads to a stroke eventually.

As a result, a more reasonable answer from the clas-si!cation system should present all probable conclu-sions, each ofwhich is associated with a degree of possibility. Therefore, fuzzy classi!cations have pro-posed by Hsu and Chiang [9,14].

In this paper, we will present to use this approach to classi!cation in domains with such vague conclu-sions. Some related work is given in Section 2. The de!nitions of fuzzy classi$cation trees are presented in Section 3. The attribute selection measures are de!ned in Section 4. Section 5 describes the basic algorithm for constructing a FCT from a data set. Section 6 shows the empirical results compared FCT with C4:5 on some UCI repository data sets. The advantages and limitations offuzzy classi!cation are discussed in Section 7, followed by the conclusion. 2. Related work

When the number ofvariable to describe a process is not large, it models the process by (1) dividing the whole space into several subspaces, (2) represent-ing each subspace by a simple linear function, and

Fig. 2. A decision tree for Example 1.

(3) interpolating several subspaces continuously. When a system is very complex, it is necessary to extract the relevant variables in the premises offuzzy models. Sugeno and Kang [42] proposed to use a mathematical programming method to dealing with this program. A large amount ofcalculation to iden-tify premise parameters is unavoiding.

Many methods have been developed for con-structing decision trees from collections of exam-ples. Although these methods are useful in building knowledge based expert system, they often suDer from inadequately or improperly expressing and han-dling the vagueness and ambiguity associated with human thinking and perception [47]. Even by the Quinlan’s work [31], the types ofuncertainties are not to be probabilistic, appearing as randomness or noise. Pedrycz and Sosnowski [26] pointed out that the con-cept offuzzy granulation realized via context-based clustering is aimed at the discretization process. For the sake ofvagueness, fuzzy decision trees are issued. Hunt et al. [16] proposed Concept learning system to construct a decision tree that attempts to minimize the score ofclassifying chess endgames. Quinlan modi!ed CLS and proposed the ID3 algorithm [27,28]. ID3 represents acquired knowl-edge in the form of decision trees. An internal node

(3)

ofa tree speci!es a test ofan attribute, with each outgoing branch corresponding to a possible result of this test. Leafnodes represent the class to be assigned to an instance. Quinlan abandoned the cost-driven lookahead ofCLS with an information-driven eval-uation function to solve the Michie’s challenge for chess endgames. The evaluation function, called en-tropy measure, is used to decide the pattern-based features from the chess position. In order to classify an instance, ID3 starts at the root ofthe tree, evaluates the test (attribute), and takes the branch appropriate to the outcome. Only a subset ofthe attributes may be encountered on a particular path from the root of a decision tree to a leaf.

Quinlan further re!ned ID3 [29–32,34]. The con-cept ofprocessing noisy data and unknown attribute values from ASSISTANT [19] is embedded into ID3 and its successor, C4 [31,34]. During induction, a best attribute tests selected by an information-based mea-sure ofall possible attributes is used to ‘spawn’ a leaf node in the tree. A numerical attribute is discretized by a cut point and is considered to be a binary categorical data. This makes the numerical attributes process as categorical attribute. Therefore, all the attributes are taken to be categorical. However, the clear cut makes the data too straightforward to be partitioned. The mi-nority ofthe overlapped data has been misclassi!ed. Over!tting ofdecision trees can thus be avoided by halting the tree growth when no more signi!cant infor-mation can be gained. Stopping recursively building the tree by irrelevant attribute tests, such as the infor-mation gain ofany tested attribute exceeds a threshold or 2 _{test for stochastic independent, is considered in}

ID3 and C4 for noisy data processing. C4:5 [35] is the summary ofQuinlan’s ID3 algorithms.

2.1. Binary fuzzy decision trees

Fuzzy decision trees were !rst mentioned by Chang and Pavlidis [7]. In their paper, the fuzzy decision tree is de!ned to be a binary tree in which each nonterminal node contains four !elds including one decision attribute and tree links. The links ofa node are the pointers to its parent, and the pointers to its left and right children, respectively. Instead of using the e4cient top-down search but less correct than the ine4cient bottom-up search for parsing the tree, they presented the branch-bound-backtrack algorithm.

This tree search method is belongs to the family of branch-and-bound methods. This paper presented the structure offuzzy decision trees, a search method and the relationship between decision trees and fuzzy decision trees. The paper did not address how to construct the fuzzy decision trees.

Wang and Suen [44] used fuzzy regions to cover the Bayes decision regions. They applied their work to 3200 Chinese characters recognition. Since the error accumulation form classi!cation on decision trees can be very harmful when the number of classes is very large. They extended the regions with a prior prob-ability to be fuzzy regions. The feature selection is based on the attribute whose minimum Mahalanobis distance is the maximum. The decision is evaluated by a heuristic function based on the membership func-tions. Fuzzy logic search is useful to !nd all possible correct classes, and the similarity measures are used to determine the most probable class. Global training is applied to expand the decision tree in order to enhance the recognition rate. That provides a lot ofPexibility, and reduces the error accumulation.

2.2. Post-fuzzi$cation

Cios and Sztandera [10] used a continuous ID3 algorithm convert a decision tree into a layer ofa feedforward neural network. A neuron with a sig-moid function can be view as a hyperplane with fuzzy boundary. Kosko’s fuzzy entropy is used to measure the fuzziness of classi!cation by the neuron. The nodes within the hidden layer are generated until the fuzzy entropy is reduced to zero.

Tani et al. used ID3 to obtain the IF-THEN rule, and then used multiple regression analysis to fuzzify the premise ofeach rule. The premise ofeach rule is a conjunction oflinguistic terms with associated membership functions. The membership function is used to determine the boundary ofthe fuzzy sets.

Maher and St. Clair [21] presented UR-ID3 to com-bine uncertain reasoning with the rule sets produced by ID3. UR-ID3 is a post-fuzzi!ed method. After the ID3 decision tree is constructed, triangularly shaped membership functions is used to each of the deci-sion values on the branches and attribute values. The classi!cation ofa test sample is done by the cor-responding set ofsupport intervals for each possi-ble classi!cation. Chi and Yan [8] following their

(4)

approach, converted the ID3 rules to fuzzy rules. To measure the degree ofeach IF-part to be satis!ed and to measure the degree ofoverall antecedent conditions to be satis!ed, suitable membership functions is pro-posed in their method. An defuzzi!cation method to determine the output for each test input is generated from a two-layer perception. Using the same train-ing sample which generated the fuzzy rules, a two-layer perceptron is trained to optimize the connection weights by minimizing a cost function.

Hsu et al. [15] used ID3 to generate the fuzzy con-trol rules for mobile robot concon-trol. The rules are in-duced by ID3 from a collection of training data. This data is a combination ofsensors’ value and robot’s actions. Like UR-ID3, a post-fuzzi!cation is applied to the generated rules. The fuzzy rules are represented by a neural network architecture. The gradient-descent approach is used to turn the membership function of each linguistic variable while performing an on-line training.

SuQarez and Lutsko [41] generated partial member-ship in the nodes ofa CART decision tree by incor-porating features of connectionist methods. After a decision tree has been generated, a reformulation of the tree construction algorithm in terms offuzzy de-grees ofmembership makes it possible to employ an-alytic tools in the construction decision trees that are globally optimal.

Boyen and Wehenkel [2] proposed to use neural networks, multilayer perceptrons, to implemented to generate a fuzzy decision tree for the power system security assessment.

2.3. Pre-fuzzi$cation

Weber presented Fuzzy-ID3 [45,46]. No fuzziness is involved with categorical attributes. Numerical at-tribute values are fuzzi!ed into linguistic terms be-fore induction. The probability of fuzzy event is used to replace the probability ofcrisp value for numeri-cal attribute values. The fuzzy entropy [20] is used to measure the disorder ofthe fuzzi!ed data. According to the fuzzy entropy of each attribute, the most suit-able attribute is selected for branching. The branching form decision node seems somehow overlapping, but not being treated as fuzzy partitioning.

Yuan and Shaw [47] categorize the uncertainties into two categories: statistical and cognitive. There are

two components in the cognitive uncertainty: fuzzi-ness and ambiguity. All the previous work, as Yuan and Shaw pointed, loses to concern about the ambi-guity uncertainty. Yuan and Shaw proposed an induc-tion learning algorithm for fuzzy decision trees. They focused on incorporating cognitive uncertainties into knowledge induction process for classi!cation. Each attribute value is !rst fuzzi!ed into each linguistic term in a set and is speci!ed by a membership value be-tween 0 and 1. Once the fuzzy sets are introduced, the cognitive uncertainties can therefore be measured by fuzziness (vagueness) measures and ambiguity mea-sures. The fuzzy entropy [20] and the nonspeci!city measure are used to measure the vagueness and ambi-guity uncertainties. Based on ID3, the nonspeci!city measure is used as the goodness ofsplit in construct-ing the fuzzy decision tree. The fuzzy decision tree is constructed by reducing the classi!cation ambiguity. Form the rules induced by the fuzzy decision algo-rithm, the most possible class can be found.

Janickow [18] proposed to use Exemplar-based learning associated with fuzzy attributes to build up the fuzzy decision tree. Special examples are selected or generated from data to be used with a proxim-ity measure, which is represented as membership function. Following the ID3 constructing algorithms, Janickow’s algorithm explands the tree by using the fuzzy operations on the fuzzy set at each node.

However, no matter what the fuzzy decision tree methods are, they all unavoid two phases processing to generate the decision rules. They either prefuzzify the data according to domain knowledge or postfuzzify the decision rules generated by the decision tree methods by some tuning methods. They do not concern the distribution ofthe data that can make an improper classi!cations.

3. Denitions

This section introduces the concept of fuzzy classi-$cation and its basic de!nitions [14].

Classi!cation problems are concerned with assign-ing classes to instances. Each instance can be de-scribed in terms ofa set ofattribute values, which are used as the basis for classi!cation. Therefore, given an arbitrary set ofdata, the most important issue is to identify their key attributes.

(5)

Consider an ordered set ofattributes A = (a1; a2;

: : : ; an) for instance description. For example, to

de-cide whether one should play golfdepends on at-tributes {Outlook; Temperature; : : :}. An attribute value vector x = x1; x2; : : : ; xn consists ofthe value

xi for the corresponding attribute ai. Each attribute

may take either ordered or categorical values. Ordered values are typically numerical, either discrete or con-tinuous, while categorical values are symbolic. The attribute value vector for the golf example may look like sunny; 85◦_{F; : : :, in which attribute Outlook}

takes a symbol as value and attribute Temperature is numerical. As another example, in a study on phys-ical examinations performed at the National Taiwan University, there were more than 400 attributes asso-ciated with the examinations. The !rst attribute was sex, which had a symbolic value ofeither male or fe-male; the second attribute, age, had a discrete value ranging from 0 to 120; the third attribute, height, was de!ned over a continuous range from 0 to 200. 3.1. Classi$cations

A classi$cation problem is de!ned as a pair (X; C), where X, called the instance space, is the collection ofall possible instances, and C = {C1; C2; : : : ; Cn} is

the set ofall possible classes. A classi!er maps each instance into a class. Formally,

Denition 1. A classi$er D for a given classi!cation problem (X; C), de!nes a total function

D : X → C:

Every instance x ∈ X is classi!ed by the decision func-tion into a single class D(x).

The goal ofclassi!cation is to !nd a classi!er D that correctly classi!es the given set ofinstances. Let Class(x) denote the actual class ofan instance x ∈ X. A instance x is said to be misclassi$ed, if

D(x) = Class(x):

In general, a classi!er is considered to be better than another ifit has a lower misclassi$cation rate, which is de!ned as the probability that any instance x ∈ X is misclassi!ed.

A class ofclassi!ers called tree classi$ers are of particular interests in solving practical classi!cation

problems. A tree classi!er determines the class of an instance based on a sequence oftests on its at-tributes. A global decision is reached via a series of local decisions that constitute a path through a tree structure.

Decision trees provide a straightforward implemen-tation oftree classi!ers [37,38]. Each terminal node (or leaf) in a decision tree is labeled with a class together with a set ofinstances, while each nonter-minal node is labeled with a test on certain attribute value(s). Each test de!nes the branches from its as-sociated node, and each branch is labeled with an attribute value (or a range ofvalues). An instance fol-lows a unique choice ofbranch from a node according to its attribute values.

3.2. Fuzzy classi$cations

Although tree classi!ers are generally e4cient, they have serious problems in dealing with elaborate real-valued attributes [12,11]. As Example 1 in the previous section shows, standard decision trees cannot handle multiple instances with overlapping attribute values that belong to diDerent classes. To overcome such di4culties, the idea of fuzzy classi$cation is proposed below.

Denition 2. Given a fuzzy classi$er F for a given classi!cation problem (X; C) de!nes a total function F : X → {p1; : : : ; pn|pi∈ [0; 1]};

where pi is the possibility that a given instance x

belongs to class Ci.

For ease ofpresentation, the function F is some-times represented as a vector offunctions

˝1; ˝2; : : : ; ˝n;

where ˝i is a possibility function X → [0; 1]. For

any given instance x, the relation ˝i(x)¿˝j(x)

indi-cates that it is more likely for the instance x to be in class Ci.

A fuzzy classi!er can be readily implemented by a tree structure. This section presents the basic de!ni-tions of fuzzy classi$cation trees (FCTs). To facili-tate discussions in the rest ofthe paper, we de!ne a labeling scheme that assigns a unique label for each node.

(6)

Denition 3. Given an FCT, each node n in the tree T is given a label: Label(n) =      1 if n is the root; Label(n_{):i if n is the ith}

child ofnode n_:;

where : is the concatenation operator.

Let L be the set ofall labels, NLdenote the node

labeled by L ∈ L, and BL denote the branch leading

into node NL. The label for the parent node of NL

can be easily obtained by removing the last integer from label L, and the result is denoted by ˆL. Each nonterminal node in the tree is associated with a test, and the resulting branches, BL:i, is associated with a

membership function L:i: X → [0; 1]:

Intuitively, the membership de!nes the degree of possibility that an instance x ∈ X should be propa-gated down the branch. Without loss ofgenerality, we assume each test to be on a single attribute. Therefore, the membership function is de!ned over projection(X; aL), i.e. the domain ofthe testing

attribute aL∈ A.

We further assume that each node NLis associated

with a class CLand a possibility function PL.

Denition 4. The possibility function PL: X → [0; 1]

is de!ned by composing the membership functions along the path from the root to node NL. That is,

PL=

1 if NLis the root node;

P_ˆL⊗ L if N_ˆL is the parent of NL:

The composition operator ⊗ is de!ned in terms of some valid operation for combining two membership functions.

Several composition operators, e.g. fuzzy sum, fuzzy product, and fuzzy max, are supported in our implementation. For example,

PL(x) = PˆL(x) + L(x);

when the fuzzy sum operator is applied.

Fig. 3 shows a sample FCT that classi!es instances into two classes C1 and C2.

Fig. 3. A sample FCT with C = {C1; C2}.

Given any instance x at a terminal node NL in an

FCT, it is classi!ed into class CL with a possibility

PL(x). As was shown in Fig. 3, multiple terminal nodes

may be associated with the same class. It follows that an FCT de!nes a unique fuzzy classi!er

F = ˝1; : : : ; ˝n

such that the possibility for an instance belonging to class Ciis the maximum over all the possibility values

at terminal nodes classi!ed as Ci. That is, for 16i6n,

˝i(x) = max{PL(x) | NLis a leaf ∧ CL= Ci}:

4. Information-base measure

There are multiple FCTs that implement the same fuzzy classi!er. When a classi!cation tree is con-structed, choices ofattributes to be tested at each node are made. DiDerent attribute selections result in dif-ferent classi!cation trees. Based on the principle of simplicity [7,20,40], it is desirable to create the small-est tree that can correctly classify the most data in the training set. In general, the most discriminating at-tribute(s) should be chosen !rst. This section de!nes the criteria [9] for attribute selection in terms of an information-based measure of FCT.

4.1. Properties of fuzzy entropy functions

Like the decision tree method, one ofthe impor-tant properties of fuzzy classi!cation tree is to identify which attribute is important for classi!cation. Accord-ing to diDerent position ofthe attribute in the tree, we can construct many diDerent classi!cation trees. As by Occam’s Razor [17,43], the simple one can capture more information than the complex one. It is neces-sary to determine on choosing which attribute from root to leafwill construct a simple tree.

(7)

Fig. 4. Decomposition ofa choice from three possibilities.

According to the original probabilistic entropy de!ned by Shannon [40] and fuzzy entropy function de!ned by De Luca and Termini [20], the information-based measure should satisfy the following criteria. Let the possibility ˝ifor each i de!ne the possibility

ofan instance, where ˝i∈ [0; 1]:

[Property 1] Function H(˝1; ˝2; : : : ; ˝n) should

be continuous in ˝i: This property prevents a

sit-uation in which a very small change in ˝iwould

produce a large (discontinuous) vibration. [Property 2] Function H must be 0 ifand only if all the ˝i but one are zero. When all but one is

possible, there exists no uncertainty in the data. [Property 3] Function H is the maximum value ifand only ifthe ˝i are equal because there

ex-ists the most uncertainties in the data. That is, no matter what all the ˝i are, the largest

uncertain-ties happened when all the ˝i are ofthe same

value.

[Property 4] Function H is a nonnegative valua-tion on the ˝i.

[Property 5] In order for the purpose that an at-tribute selection is to reduce the uncertainties in the data, it is necessary that ifa choice is broken down into several successive choices, the origi-nal H should be no less than the weighted sum ofthe individual values ofH. As illustrated in Fig. 4, in (a) we have three possibilities 1, 2,

and 3. In (b), on the right, we !rst choose

be-tween two possibilities, 1and 2, and ifthe

sec-ond occurs make another choice with possibility 2:1, 2:2. The !nal entropy of(b) should be less

than or equal to the entropy of(a). That is H(1; 2; 3) ¿ H(1; 2) + 2× H(2:1; 2:2):

The coe4cient 2 appears because the second

choice is made only with possibility 2.

4.2. Fuzzy entropy functions

Suppose we have a set ofinstances SL at node NL.

Assume there are n classes associated with the pos-sibilities ofoccurrences ˝1; ˝2; : : : ; ˝n. Concerning

about the measure ofhow much choice is involved in the selection ofthe instance in SLor ofhow uncertain

we are ofthe outcome, we choose the entropy func-tion to evaluate that.

Denition 5. The entropy for the set of instances SL

at node NLis de!ned by Info(SL) = − ∀c∈C Pc L PL × log2 Pc L PL; where PL= x∈SL PL(x)

is the sum ofthe possibility value PL(x) ofall instances

at node NL, and Pc L= x∈SL∧Class(x)=c PL(x)

is the sum over instances belonging to class c. The entropy ofa set measures the average amount of information needed to identify the class of an in-stance in the set. It is minimized when the set ofin-stances are homogeneous, and maximized when the set is perfectly balanced among the classes.

A similar measurement can be de!ned when the set is distributed into bLsubsets, one for each branch

based on the test at node NL. The expected information

requirement is the weighted sum over the subsets. InfoT(SL) = bL i=1 PL:i PL × Info(SL:i):

To asses the “bene!ts” ofa test, we need to consider the increase in entropy. The quality

(8)

measures the information gain due to the test TestL.

This gain criterion is used as the basis for attribute selection.

4.3. The requirements of the fuzzy operations Since the function, log₂ is a continuous function, the fuzzy entropy de!ned by log₂is also a continuous function. It is easy to see that Info satis!es Property 1. If SL is the set ofinstances in NL that has been

purely classi!ed into one class, that is all the ˝i of

each instance but one are zero. Let ˝i= 0 for some

class Ci, then the possibility

PL= x∈SL PL(x) = x∈SL ˝i(x): The possibilities Pc

L ofthe other classes are zero.

Because Pc L= x∈SL∧Class(x)=c PL(x) = 0

for c = Ci: The entropy value ofInfo(SL) will be zero

when all the possibilities ˝i but one are zero.

Property 3 restricts that the entropy value is maximum when all the class possibilities are equal. According to that, it needs that_c Pc

L should

be no bigger than PL: Otherwise, this property will

not be satis!ed. Let |C| be the number ofclasses and PCi

L = PLCj for i = j: In the FCT algorithm, the sum

Info(SL) = − ∀c∈C Pc L PL× log2 Pc L PL 6 − |C| i=1 PL |C|PLlog2 PL |C|PL = − |C| i=1 1 |C|log2 _|C|1 :

Operationis de!ned to be equal to the sum opera-tion in classical (crisp) set.

Since 06Pc

L6PL for all class c ∈ C, log2PLc=PL

60 and Info(SL)¿0: Therefore, it is no doubt that the

fourth property is also satis!ed.

Why the !fth property is required in the thesis? The purpose ofattribute selection in FCTs is toward reducing the uncertainties in the data. After the fuzzy

classi!cation tree has been further generating, the total entropy ofthe child nodes should be no greater than the entropy oftheir parent nodes. In the other word, the total entropy ofchild nodes from a node should be less than or equal to the entropy ofthat node before the tree expanded. That is,

Info(SL) ¿ bL i=1 PL:i PL × Info(SL:i):

This is a strong constraint that restricts the kinds of fuzzy operations and the membership functions. It also limits the clustering methods to generate the member-ship function from a node.

Theorem 1. Let ⊗ be the fuzzy t-norm operator. If _b_L

i=1L(x)61 for every x ∈ SL: De$nition 1 satis$es

the $fth property of entropy. That is Info(SL) ¿ bL i=1 PL:i PL Info(SL:i):

Proof. Let be the maximal membership value for all membership functions. Let us derive from the right-hand side ofthe inequality

bL i=1 PL:i PL Info(SL:i) = − bL i=1 PL:i PL c∈C Pc L:i PL: log2 Pc L:i PL:i = −bL i=1 PL:i PL c∈C Pc L⊗ L:i PL⊗ L:i log2 Pc L⊗ L:i PL⊗ L:i 6 − c∈C Pc L⊗ PL⊗ log2 Pc L⊗ PL⊗ × l l(x) 6 1 and ¿ l(x); ∀l; x 6 − c∈C Pc L PL log2 Pc L PL c∈C Pc L6 PL = Info(SL):

(9)

4.4. Entropy evaluation algorithm

With the above de!nitions, we can calculate the en-tropy at the root ofan FCT by propagating the enen-tropy values up the tree.

Algorithm Evaluate Entropy [Input] An FCT with root node NL

[Output] The entropy value of TL

1. ∀l ∈ L; s.t. Nlis any node in TL,

Info(Sl) ← −1 =* Initialization *=

=* Info(Sl) is nonnegative, and therefore set a

negative value to it !rst. *= 2. ∀l ∈ L; s.t. Nlis a leafnode, Info(Sl) ← −c∈C P c l Pl× ln Pc l Pl

3. loop until Info(SL)¿0

if ∀i; 16i6bl Info(Sl:i)¿0 then

Info(Sl) ←bi=1l PPl:il×Info(Sl:i)

end

4. return Info(SL).

5. Construction

This section presents the learning algorithm for constructing a fuzzy classi!cation tree from a set of training instances containing real-valued attributes. Previous approaches to this problem usually fuzzify the data before they are used to construct a decision tree [47]. The linguistic variables have to be de!ned ahead oftime based on existing domain knowledge.

The main algorithm for FCT construction takes an input a set S0ofinstances, and starts by creating a root

node N1, adding its label to L, and initializing S1 to

be S0.

Algorithm Build FCT

[Input] A set oftraining instances S0

[Output] An FCT 1. L ← 1

=* Initialize L to be 1 which is the label at the root node. *=

2. L ← {1}

=* Let L be the set oflabels represented the nodes that have not been expanded. *=

3. S1← S0

=* S1at the root node is set to be the original set

S0. *=

4. loop until L = 5. L ← random(L)

=* Random select one ofthe label from L. *= 6. L ← L\{L}

7. ∀ai; i← Spawn New Tree(NL; ai)

8. Find k s.t. Info(k) = maxjInfo(j)

9. Gain ← Info(TL) − Info(k)

10. if Gain¿ then L ← L ∪ leaf(k)

Assign subsets of SLinto SL:1; : : : ; SL:k.

The procedure Spawn New Tree(NL; ai) expands

the tree from node NLaccording to some attribute ai.

6. Clustering

The membership function is the kernel for fuzzy classi!cations. To determine the membership function from a data set, the method of clustering is used. Clus-tering is a well-used method in pattern recognition. It plays a key role in searching for structures in data. There may be diDerent kinds ofmodels simultane-ously occurring in the data, that is called multi-model [5]. Data could be clustered into diDerential groups in accordance to their distribution models. The models construct the membership function of the data. Algorithm Spawn New Tree

[Input] An unexpanded node N An attribute a

[Output] An expanded tree rooted at node N

∀i; 16i6n do the following:

1. Project instances at node N ofclass Cionto

at-tribute a.

2. Smooth the resulting histogram using k-median method.

3. Partition the smoothed histogram into clusters. 4. Create a new branch from NLfor each cluster.

5. De$ne the membership function for each branch. 6.1. Clustering on numerical attributes

Clustering is the important operation to deal with real-valued attributes to derive the membership function from them, as well as symbolic attributes. Deriving the memberships ofsymbolic attributes will be introduced later. For real-valued attributes,

(10)

it can be used to partition the domain ofeach one ofthem into several clusters according to its dis-tribution. Given a !nite set ofdata, X , ofa real-valued attribute, clustering in X is to !nd sev-eral cluster centers that can properly characterize relevant categories of X . In classical approaches, these categories are required to form a partition of X . Each instance in X is uniquely assigned to a categories ofthe partition. However, this require-ment is too strong in many practical applications, such as medical diagnosis, !nancial management, robot control, etc. Because ofthe uncertainties, that make partition boundaries not so clear. There ex-ists an uncertain overlapped region at each parti-tion boundary. It is thus desirable to shift it with a weaker requirement to describe this overlapped situation.

Fuzzy c-means clustering method, which satis!es the weaker requirement, is used to make a properly vague partition. The membership value ofeach datum de!nes how possible this datum is associated with a category. The membership gives a meaningful expla-nation on this vagueness. Therefore, to deal with the unavoidable observation and measurement uncertain-ties, fuzzy clustering is a very suitable choice applied to real world applications.

6.1.1. Fuzzy c-means clustering method

No universally optimal clustering criteria existed can e4ciently group a data set into clusters. Bezdek [1] has proposed the fuzzy c-means clustering method to solve this optimization problem.

Given a data set X = {x1; x2; : : : ; xn}, a fuzzy

c-partition of X is a family of fuzzy subsets of X , de-noted by P = {1; : : : ; c}, where c ∈ ℵ and

c

k=1

k(xi) = 1

for all i ∈ {1; 2; : : : ; n} and 0 ¡

n

i=1

k(xi) ¡ n

for all k ∈ {1; 2; : : : ; c}. The membership function

i∈ P, 16i6c, denotes the function for

evaluat-ing the degree ofuncertainties ofX in class Ci. For

instance, given X = {x1; x2; x3} and

1= :6=x1+ 0=x2+ :2=x3;

2= :4=x1+ 1=x2+ :8=x3;

{1; 2} is the fuzzy 2-partition of X .

The fuzzy c-means clustering method requires a cri-terion that the association ofthe data is strong within a cluster and weak between clusters. Let v1; v2; : : : ; vc

be c cluster centers of P. Each cluster center associ-ated with the partition is calculassoci-ated by the following formula. vk= _n i=1[k(xi)]mxi _n i=1[k(xi)]m ;

where 16k6c, m¿1 is a real number that governs the inPuence ofmembership grades. The weight ofa datum xi is the mth power ofthe membership grade

k(xi). When m → 1, the fuzzy c-means converges to

a generalized classical c means. When m → ∞, all cluster centers tend towards the centroid ofthe data set X . That is, the partition becomes fuzzier with in-creasing m.

Denition 6 (Bezdek). The criterion ofa fuzzy c-partition is de!ned in terms ofthe cluster centers by the formula n j=1 c k=1 [k(xj)]mxj− vi2;

where · is the inner product-induced norm in and

xj− vi2represents the distance between xjand vi.

The goal ofthe fuzzy c-means clustering method is to !nd a fuzzy partition S that minimizes the criterion. 7. Empirical results

We have tested our algorithm on !ve data sets from the UCI repository (ftp:==ftp.ics.uci.edu=machine-learning-databases).

Glass: This is sample data set including 214 in-stances in determining whether the glass was a type of“Poat” glass or not for criminological investiga-tion. Seven classes with nine numerical attributes with missing data.

(11)

Table 2

Average accuracy between C4.5 and FCT over the glass, monks, and ionosphere data sets

Glass Monk1 Monk2 Monk3 Ionosphere

C4.5 94:5±3:3% 77:1±3:3% 65:3±6:7% 92:6±2:9% 95:5±2:5%

FDT (Hsu et al.) 95:2±3:4% 78:9±3:4% 68:1±6:8% 93:3±2:2% 95:5±2:3%

FDT (Yuan and Shaw) 95:3±3:1% 80:6±3:4% 68:7±7:0% 92:7±1:9% 95:1±2:0%

FDT (Janickow) 95:6±3:3% 85±2:1% 71:7±6:3% 93±1:9% 95:4±2:4%

FDT (SuQarez and Lutsko) 93:4±4:0% 84:6±2:7% 70:8±5:3% 94:5±1:6% 94:7±2:2%

FCT 96:2±2:8% 86:2±2:9% 73:4±5:8% 93:2±1:7% 95:3±1:8%

Monks’ problem: The three Monks’ problems are a collection ofthree binary classi!cation problems over a six-attribute discrete domain. The classes is either 0 or 1. Six categorical value attributes, no missing value. There are noisy data in monk1 and monk2.

Ionosphere: This data set is a binary classi!cation task. The radar data was collected by a system in Goose Bay, Labrador. This system consists ofa phased array of16 high-frequency antennas with a total trans-mitted power on the order of6:4 kW. The targets were free electrons in the ionosphere. “Good” radar returns are those showing evidence ofsome type ofstruc-ture in the ionosphere. “Bad” returns are those that do not; their signals pass through the ionosphere. There are 351 instances with all 34 numerical attributes, no missing value.

The result ofcomparing the accuracy ofFCT with C4.5 and three kinds offuzzy decision trees which are proposed by Hsu et al. [15], Yuan and Shaw [47], Janickow [18] and SuQarez and Lutsko [41] on these problems is shown in Table 2. Note the tree that we use to compare is without pruning.

All those data sets were tested according to the F-test under the con!dent level of95% by 5-folded cross validation.

The clustering method determines the performance ofFCT. The clustering method used for classi!ca-tions is fuzzy c-partition algorithm. It full satis!es the criteria ofpossibilistic entropy. As the result on the ionosphere problem, the accuracy ofFCT is lower than the accuracy ofC4.5 and the fuzzy decision trees proposed by Hsu et al. Since the values ofall the attributes are distributed in [−1; 1], the size ofthe clusters has an eDect on the accuracy ofFCT. To improve the clustering method is one ofour further objectives. The pre-fuzzi!cation methods proposed by Yuan and Shaw are worse than FCTs. Because

Yuan and Shaws’ algorithms partition the numerical data !rst and then use the linguistic data to construct the decision tree, therefore, it is unavoidable that the performance of the clustering methods aDects the accuracy ofthese algorithms much more than FCTs. The lack ofdynamic clustering adjustment as FCTs makes the Yuan and Shaws’ algorithms less accurate.

8. Discussion

Classi!cation by decision trees has been success-fully applied to problems in arti!cial intelligence, pat-tern recognition and statistics. However, as Quinlan [31] pointed out “the results ofdecision trees are cate-gorical and so do not convey potential uncertainties in classi!cation”. Missing or imprecise information may prevent a case from being classi!ed at all. In the pres-ence ofuncertainties, it is often desirable to have an estimate ofthe degree that an instance is in each class, e.g. medical diagnosis.

Instead ofclassifying a case as belonging to exactly one class, and ruling out the other possibilities, one can estimate the relative probabilities ofits being in each class. Casey and Nagy [6] designed a decision tree classi!er using probabilistic model for optical character recognition process. Breiman et al. [3] intro-duced the class probability estimate. Quinlan [31,34] proposed probabilistic decision trees to deal with uncertainties in data. Schuermann and Doster [39] also proposed using probabilistic model to estimate the probability ofeach class. In addition, to deal with search bias introduced in attribute selection and hypotheses-space bias due to noisy data [4], Buntine [5] suggested averaging over multiple class probabil-ity trees.

(12)

Probabilistic approaches still assume that there is only one decision node in the tree to which a case can be classi!ed. A test instance falls down a sin-gle branch to arrive at a leafwhere a probability is associated with each class. Such classi!cations ignore the information at the other nodes. However, several methods, including Buntine’s classi!cation trees [5], Rymon’s set enumeration tree [36] have been pro-posed to address this issue. However, the approaches are ine4cient in both time and space.

In a fuzzy classi!cation tree, an instance has a mem-bership value at each leafnode. We can calculate the degree ofpossibility that the instance belongs to any of the classes. Using information-based measures, there is no need to generate multiple classi!cation trees. Therefore, it requires less time and space than deci-sion forests.

9. Conclusion

This paper has presented an algorithm that inte-grates the fuzzy classi!ers with decision trees. The algorithm attempts to expand the FCT while minimiz-ing its entropy at each step.

Unclearly partitioned boundaries between classes strongly confuse the conclusion obtained from C4.5. However, overly generalized predictions are the seri-ous problem [42,47]. Opposite to C4.5, fuzzy classi-!cation trees give much better predictive conclusion. Fuzzy classi!cation predicts the degree of possibil-ity for every class instead of determining a single class for any given instance. According to these pos-sibilities, a proper conclusion can be made for each instance.

We have compared FCT with C4.5 and four kinds offuzzy decision trees with the empirical re-sults of!ve data sets in the above section. From the noise-free data (Golf) to the data with a great amount ofnoise (Monk2), the accuracy rate of FCT is better than those. C4.5 classi!es an in-stance into exactly one class. The inin-stances with attribute values around class boundaries are forced to be classi!ed into a single class, which may re-sult in wrong predictions, especially in the noisy domains. Instead ofmaking a rigid classi!cation, it is sometimes necessary to identify more than one possible classi!cations for a given instance.

Although the four fuzzy decision trees provide multi-ple classi!cation for an instance, they do not consider the distribution ofthe data that can make an improper classi!cations. FCTs can properly solve this prob-lem. Through fuzzy clustering, the data could reveal the “context” structure for each attribute, as pointed out by Pedrycz and Sosnowski [26]. Based on the parent node’s testing attribute, the cluster structure could give the globally optimal decision for each node.

FCTs allow multiple predictions to be made, each ofwhich is associated with a degree ofpossibility. In application domains that involve a large amount of data with uncertainty, such as medicine or business, fuzzy classi!cation trees can serve as a useful tool for generating fuzzy rules or discovery knowledge in database.

References

[1] J.C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms, Plenum Press, New York, 1981. [2] X. Boyen, L. Wehenkel, Automatic induction offuzzy

decision trees and its application to power system security assessment, Fuzzy Sets and Systems 102 (1999) 3–19. [3] L. Breiman, J. Friedman, R. Olshen, C. Stone, Classi!cation

and Regression Trees, Chapman & Hall, London, 1984. [4] W. Buntine, Myths and legends in learning classi!cation rules,

Proc. 8th National Conf. on Arti!cial Intelligence, Boston, MA, 1990, pp. 736–742.

[5] W. Buntine, Learning classi!cation trees, Statist. Comp. 2 (1992) 63–73.

[6] R.G. Casey, G. Nagy, Decision tree design using a probabilistic model, IEEE Trans. Information Theory 30 (1) (1984) 93–99.

[7] R.L.P. Chang, T. Pavlidis, Fuzzy decision tree algorithms, IEEE Trans. Syst. Man Cybern. 7 (1) (1977) 28–35. [8] Z. Chi, H. Yan, ID3-derived fuzzy rules and optimized

defuzzi!cation for handwritten numeral recognition, IEEE Trans. Fuzzy Syst. 4 (1) (1996) 24–31.

[9] I. Chiang, J. Hsu, Integration offuzzy classi!ers with decision trees, Proc. Asian Fuzzy Syst. Symp., Kenting, Taiwan, 1996, pp. 65–78.

[10] K.J. Cios, L.M. Sztandera, Continuous ID3 algorithm with fuzzy entropy measures, Proc. Int. Conf. on Fuzzy Systems, San Diego, CA, 1992, pp. 469–476.

[11] J. Dougherty, R. Kohavi, M. Sahami, Supervised and unsuper-vised discretization of continuous features, Proc. 12th Int. Conf. on Machine Learning, San Mateo, CA, 1995, pp. 194–202. [12] U.M. Fayyad, K.B. Irani, On the handling of

continuous-valued attributes in decision tree generation, Machine Learning 8 (1992) 87–102.

(13)

[13] D. Heath, S. Kasif, S. Salzberg, Learning oblique decision trees, Proc. 13th Int. Joint Conf. on Arti!cial Intelligence, Chambery, France, 1993, pp. 1002–1007.

[14] J.Y. Hsu, I. Chiang, Fuzzy classi!cation trees, Proc. 9th Int. Symp. on Arti!cial Intelligence, Cancun, Mexico, 1996, pp. 431–438.

[15] S. Hsu, J.Y. Hsu, I. Chiang, Automatic generation of fuzzy control rules by machine learning methods, Proc. Int. Conf. on Robotics and Automation, Nagoya, Japan, 1995, pp. 287–292.

[16] E.B. Hunt, J. Marin, P.J. Stone, Experiments in Induction, Academic Press, Orlando, FL, 1996.

[17] A. Hyman, J.J. Walsh, Philosophy in the Middle Ages, 2nd ed., Hackett Publishing Co., Indianapolis, 1973.

[18] C.Z. Janickow, Fuzzy decision trees: issues and methods, IEEE Trans. Syst. Man Cybern. B: Cybern. 28 (1) (1998) 1–14.

[19] I. Kononenko, I. Bratko, E. Roskar, Experiments in automatic learning ofmedical diagnostic rules, Technical Report, Jozef Stefan Institute, Ljubljana, Yugoslavia, 1984.

[20] A. De Luca, S. Termini, A de!nition ofa nonprobabilistic entropy in the setting of fuzzy sets theory, Inf. Control 20 (1976) 301–312.

[21] P.E. Maher, D. St. Clair, Uncertain reasoning in an ID3 machine learning framework, Proc. 2nd IEEE Int. Conf. on Fuzzy Systems, San Francisco, CA, 1993, pp. 7–12. [22] S.K. Murthy, On growing better decision trees from data,

Ph.D. Dissertation, The Johns Hopkins University, Baltimore, Maryland, 1995.

[23] S.K. Murthy, S. Kasif, S. Salzberg, A system for induction ofoblique decision trees, J. Artif. Intell. Res. 2 (1994) 1–32. [24] S.K. Murthy, S. Kasif, S. Salzberg, R. Beigel, OC1: randomized induction ofoblique decision trees, Proc. 11th National Conf. on Arti!cial Intelligence, Washington, DC, 1993, pp. 322–327.

[25] Z. Pawlak, Rough Sets, Kluwer Academic, Dordrecht, 1991. [26] W. Pedrycz, Z.A. Sosnowski, The design ofdecision trees in framework of granular data and their application to software quality models, Fuzzy Sets and Systems 123 (2001) 271–290. [27] J.R. Quinlan, Discovery rules by induction from large collections ofexamples, in: D. Michie (Ed.), Expert Systems in the Micro Electronic Age, Edinburgh University Press, Edinburgh, UK, 1979.

[28] J.R. Quinlan, Learning e4cient classi!cation procedures and their application to chess endgames, in: R.S. Michalski, J.G. Carbonell, T.M. Mitchell (Eds.), Machine Learning: An Arti!cial Intelligence Approach, Tioga, Palo Alto, CA, 1983.

[29] J.R. Quinlan, The eDect ofnoise on concept learning, in: R.S. Michalski, J.G. Carbonell, T.M. Mitchell (Eds.), Machine Learning, Morgan Kaufman, Los Altos, CA, 1985. [30] J.R. Quinlan, Induction ofdecision trees, Machine Learning

1 (1986) 81–106.

[31] J.R. Quinlan, Probabilistic decision trees, in: P. Langley (Ed.), Proc. 4th Int. Workshop on Machine Learning, Los Altos, CA, 1987.

[32] J.R. Quinlan, Simplifying decision trees, Int. J. Man-Machine Studies 27 (1987) 221–234.

[33] J.R. Quinlan, Decision trees and decision making, IEEE Trans. Syst. Man Cybern. 20 (1990) 339–346.

[34] J.R. Quinlan, Learning logical de!nitions from relations, Machine Learning 5 (1990) 239–266.

[35] J.R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann, Los Altos, CA, 1993.

[36] R. Rymon, An SE-tree based characterization ofthe induction problem, Proc. 10th Int. Conf. on Machine Learning, Amherst, MA, 1993, pp. 268–275.

[37] S.R. Safavian, D. Landgrebe, A survey of decision tree classi!er methodology, IEEE Trans. Syst. Man Cybern. 21 (3) (1991) 660–674.

[38] J.C. Schlimmer, R.H. Granger Jr., Incremental learning from noisy data, Machine Learning 1 (1986) 317–354.

[39] J. Schuermann, W. Doster, A decision theoretic approach to hierarchical classi!er design, Pattern Recognition 17 (3) (1984) 359–369.

[40] C.E. Shannon, A mathematical theory ofcommunication, The Bell Syst. Technical J. 27 (1948) 379–423, 623–656. [41] A. SuQarez, J.F. Lutsko, Globally optimal fuzzy decision trees

for classi!cation and regression, IEEE Trans. Pattern Anal. Machine Intell. 21 (12) (1999) 1297–1311.

[42] M. Sugeno, G.T. Kang, Structure identi!cation offuzzy model, Fuzzy Sets and Systems 28 (1988) 15–33. [43] W.M. Thorburn, The myth ofoccam’s razor, Mind 27 (1918)

345–353.

[44] Q.R. Wang, C.Y. Suen, Large tree classi!er with heuristic search and global training, IEEE Trans. Pattern Anal. Machine Intell. 9 (1) (1987) 91–102.

[45] R. Weber, Automatic knowledge acquisition for fuzzy control application, Proc. Int. Symp. Fuzzy Systems, Iizuka, Japan, 1992, pp. 9–12.

[46] R. Weber, Fuzzy-ID3: a class ofmethods for automatic knowledge acquisition, Proc. 2nd Int. Conf. on Fuzzy Logic and Neural Networks, Iizuka, Japan, 1992, pp. 265–268. [47] Y. Yuan, M.J. Shaw, Induction offuzzy decision trees, Fuzzy