Learning Cross-level Certain and Possible Rules by Rough Sets

(1)

Learning Cross-level Certain and Possible Rules

by Rough Sets

Tzung-Pei Hong†**_{, Chun-E Lin} ‡_{, Jiann-Horng Lin} ‡_{, Shyue-Liang Wang}*

†_{Department of Electrical Engineering} National University of Kaohsiung

Kaohsiung, 811, Taiwan, R.O.C. [email protected]

‡_{Institute of Information Management} I-Shou University

Kaohsiung, 840, Taiwan, R.O.C.

[email protected], [email protected]

*

Department of Computer Science New York Institute of Technology 1855 Broadway, New York 10023, U.S.A.

[email protected]

Abstract

Machine learning can extract desired knowledge and ease the development bottleneck in building expert systems. Among the proposed approaches, deriving rules from training examples is the most common. Given a set of examples, a learning program tries to induce rules that describe each class. Recently, the rough-set theory has been widely used in dealing with data classification problems. Most of the previous studies on rough sets focused on deriving certain rules and possible rules on the single concept level. Data with hierarchical attribute values are, however, commonly seen in real-world applications. This paper thus attempts to propose a new learning algorithm based on rough sets to find cross-level certain and possible rules from training data with hierarchical attribute values. It is more complex than learning rules from training examples with single-level values, but may derive more general knowledge from data. Boundary approximations, instead of upper approximations, are used to find possible rules, thus reducing some subsumption checking. Some pruning heuristics are also adopted in the proposed algorithm to avoid unnecessary search. Keywords: machine learning, rough set, certain rule, possible rule, hierarchical value. ---

(2)

1. Introduction

Expert systems have been widely used in domains where mathematical models cannot easily be built, human experts are not available or the cost of querying an expert is high. Although a wide variety of expert systems have been built, knowledge acquisition remains a development bottleneck. Usually, a knowledge engineer is needed to establish a dialog with a human expert and to encode the knowledge elicited into a knowledge base to produce an expert system. The process is, however, very time-consuming [1][2]. Building a large-scale expert system involves creating and extending a large knowledge base over the course of many months or years. Thence shortening the development time is the most important factor for the success of an expert system. Machine-learning techniques have thus been developed to ease the knowledge-acquisition bottleneck. Among the proposed approaches, deriving rules from training examples is the most common [5][6][9][10][11][15]. Given a set of examples, a learning program tries to induce rules that describe each class.

Recently, the rough-set theory has been used in reasoning and knowledge acquisition for expert systems [3]. It was proposed by Pawlak in 1982 [12] with the concept of equivalence classes as its basic principle. Several applications and extensions of the rough-set theory have also been proposed. Examples are Lambert-Torres et al.’s knowledge-base reduction [7], Zhong et al.'s rule discovery [18], Lee et al.'s hierarchical granulation structure [8] and Tsumto's attribute-oriented generalization [17]. Because of the success of the rough-set theory in knowledge acquisition, many researchers in the machine-learning fields are very interested in this research topic since it offers opportunities to discover useful information from training examples.

(3)

Most of the previous studies on rough sets focused on deriving certain rules and possible rules on the single concept level. Hierarchical attribute values are, however, usually predefined in real-world applications. Deriving rules on multiple concept levels may thus lead to the discovery of more general and important knowledge from data. It is, however, more complex than learning rules from training examples with single-level values. In this paper, we thus propose a new learning algorithm based on rough sets to find cross-level certain and possible rules from training data with hierarchical attribute values. Boundary approximations, instead of upper approximations, are used to find possible rules, thus reducing some subsumption checking. Some pruning heuristics are also used in the proposed algorithm to avoid unnecessary search.

The remainder of this paper is organized as follows. The rough-set theory is briefly reviewed in Section 2. Management of hierarchical attribute values by rough sets is described in Section 3. The notation and definitions used in this paper are given in Section 4. A new learning algorithm based on the rough-set theory to induce cross-level certain and possible rules are proposed in Section 5. An example to illustrate the proposed algorithm is given in Section 6. Some discussion is taken in Section 7. Conclusions and future works are finally given in Section 8.

2. Review of the Rough-Set Theory

The rough-set theory, proposed by Pawlak in 1982 [12][14], can serve as a new mathematical tool for dealing with data classification problems. It adopts the concept of equivalence classes to partition training instances according to some criteria. Two kinds of partitions are formed in the mining process: lower approximations and upper

(4)

approximations, from which certain and possible rules can easily be derived.

Formally, let U be a set of training examples (objects), A be a set of attributes describing the examples, C be a set of classes, and Vj be a value domain of an attribute

Aj. Also let v(i_j) be the value of attribute Aj for the i-th object Obj(i). When two

objects Obj(i) and Obj(k) have the same value of attribute Aj, (that is, v(ij)= ) (k j

v ), Obj(i) and Obj(k) are said to have an indiscernibility relation (or an equivalence relation) on attribute Aj. Also, if Obj(i) and Obj(k) have the same values for each attribute in subset

B of A, Obj(i) and Obj(k) are also said to have an indiscernibility (equivalence) relation on attribute set B. These equivalence relations thus partition the object set U into disjoint subsets, denoted by U/B, and the partition including Obj(i) is denoted

B(Obj(i)).

Example 1: Table 1 shows a data set containing ten objects U={Obj(1), Obj(2), …,

Obj(10)}, two attributes A={(Transport, Residence)}, and a class set Consumption Style. The class set has two possible values: {Low (L), High (H)}.

Table 1: The training data used in this example. Transport Residence Consumption Style

Obj(1) Expensive car Villa High

Obj(2) Cheap car Single house High

Obj(3) Ordinary train Suite Low

Obj(4) Express train Villa Low

Obj(6) Express train Single house Low

(5)

Obj(8) Express train Single house High

Obj(10) Express train Apartment Low

Since Obj(2) and Obj(7) have the same attribute value Cheap Car for attribute

Transport, they share an indiscernibility relation and thus belong to the same

equivalence class for Transport. The equivalence partitions for singleton attributes can be derived as follows:

U/{Transport} = {{(Obj(1))}, {(Obj(3))(Obj(5))(Obj(9))},

{(Obj(4))(Obj(6))(Obj(8))(Obj(10))}, {(Obj(2))(Obj(7))}}, and U/{Residence} = {{(Obj(1),)(Obj(4))}, {(Obj(2))(Obj(6))(Obj(7))(Obj(8))},

{(Obj(3))(Obj(5))(Obj(9))}, {(Obj(10))}}.

The sets of equivalence classes for subset B are referred to as B-elementary sets. Also, {Transport}(Obj(2)) = {Transport}(Obj(7)) = {Obj(2), Obj(7)}.

The rough-set approach analyzes data according to two basic concepts, namely the lower and the upper approximations of a set. Let X be an arbitrary subset of the universe U, and B be an arbitrary subset of attribute set A. The lower and the upper approximations for B on X, denoted B*(X) and B*(X) respectively, are defined as follows:

B*(X) = {x | x ∈ U, B(x) ⊆ X }, and

(6)

Elements in B*(x) can be classified as members of set X with full certainty using attribute set B, so B*(x) is called the lower approximation of X. Similarly, elements in

B*(x) can be classified as members of the set X with only partial certainty using attribute set B, so B*(x) is called the upper approximation of X.

Example 2: Continuing from Example 1, assume X={Obj(1), Obj(2), Obj(7),

Obj(8)}. The lower and the upper approximations of attribute Transport with respect to

X can be calculated as follows:

Transport*(X)= {{(Obj(1))}, {(Obj(2))(Obj(7))}}, and

Transport*(X)={{Obj(1)},{(Obj(2))(Obj(7))}, {(Obj(4))(Obj(6))(Obj(8))(Obj(10))}}.

After the lower and the upper approximations have been found, the rough-set theory can then be used to derive both certain and uncertain information and induce certain and possible rules from them [3][5].

Lambert-Torres et al. found unimportant attributes from lower and upper approximations and deleted them from a database [7]. Zhong et al. proposed a new incremental learning algorithm based on the generalization distribution table, which maintained the probabilistic relationships between the possible instances and the possible concepts [18]. Two sets of generalizations were formed from the table based on the rough set model. One set consisted of all consistent generalizations and the other consisted of all contradictory generalizations, which were similar to the S and G sets in the version space approach. The generalizations were then gradually adjusted

(7)

according to new instances. The examples in the database could then be merged since some attributes were removed. The resulting database was thus a compact database. Yao formed a stratified granulation structure with respect to different levels of rough set approximations by incrementally clustering objects with the same characteristics together [17]. Also, Lee et al. simplified classification rules for data mining using rough set theory [8]. The proposed classification method generated minimal classification rules and made the analysis of information systems easy. Tasumoto presented a knowledge discovery system based on rough sets and attribute-oriented generalization [16]. It was used not only to acquire several sets of attributes important for classification, but also to evaluate how precisely the attributes of a database were able to classify data.

The advantage of the rough set theory lies in its simplicity from a mathematical point of view since it requires only finite sets, equivalence relations, and cardinalities [13].

3. Hierarchical Attribute Values

Most of the previous studies on rough sets focused on finding certain rules and possible rules on the single concept level. However, hierarchical attribute values are usually predefined in real-world applications and can be represented by hierarchy trees. Terminal nodes on the trees represent actual attribute values appearing in training examples; internal nodes represent value clusters formed from their lower-level nodes. Deriving rules on multiple concept levels may lead to the discovery of more general and important knowledge from data. A simple example for attribute Transport is given in Figure 1.

(8)

Figure 1: An example of predefined hierarchical values for attribute Transport

In Figure 1, the attribute Transport falls into two general values: Train and Car.

Train can be further classified into two more specific values Express Train and Ordinary Train. Similarly, assume Car is divided into Expensive Car and Cheap Car.

Only the terminal attribute values (Express Train, Ordinary Train, Expensive Car,

Cheap Car) can appear in training examples.

The concept of equivalence classes in the rough set theory makes it very suitable for finding cross-level certain and possible rules from training examples with hierarchical values. The equivalence class of a non-terminal-level attribute value for attribute Aj can be easily found by the union of its underlying terminal-level equivalence classes for Aj. Also, the equivalence class of a cross-level attribute value combination for more than two attributes can be derived from the intersection of the equivalence classes of its single attribute values.

Example 3: Continuing from Example 2, assume the hierarchical values for attribute Transport are the same as those in Figure 1. The equivalence class for

Transport Train Car Express Train Ordinary Train Expensive Car Cheap Car

(9)

Transport = Train is then the union of the equivalence classes for Transport = Expressive Train and Transport = Ordinary Train. Similarly, the equivalence class for Transport = Car is the union of the equivalence classes for Transport = Expensive Car and Transport = Cheap Car. Thus:

U/{Transportnt

} = {{(Obj(1))(Obj(2))(Obj(7))}, {(Obj(3))(Obj(4))(Obj(5)) (Obj(6))(Obj(8))(Obj(9))(Obj(10))}},

where U/{Transportnt} represents the non-terminal-level elementary set for attribute

Transport.

In this paper, we will thus propose a rough-set-based learning algorithm for deriving cross-level certain and possible rules from training examples with hierarchical attribute values.

4. Notation and Definitions

According to the definitions of the lower approximation and the upper approximation, it is easily seen that the upper approximation includes the lower approximation. Thus each certain rule derived from the lower approximation will also be derived from the upper approximation. It thus causes redundant derivation and wastes computational time. The proposed algorithm thus uses the boundary approximation, instead of the upper approximation, to derive the pure possible rules. It can thus reduce the subsumption checking needed. For convenience, the symbol

B*(X) is used from here on to represent the boundary approximation, instead of the upper approximation, of attribute subset B on X. The boundary approximation for a

(10)

subset B is defined as follows:

B*(X) = {x | x ∈ U and B(x) ∩ X ≠ ∅, B(x) ⊄ X }.

The notation used in this paper is shown below.

U: the universe of all objects; n: the total number of objects in U; Obj(i): the i-th object, 1

≤

i

≤

n;

C: the set of classes to be determined; c: the total number of classes in C; Xl: the l-th class, 1

≤

l

≤

c;

A: the set of all attributes describing U; m: the total number of attributes in A; Aj: the j-th attribute, 1

≤

j

≤

m;

) (i j

v : the value of Aj for Obj(i);

|A |: the number of terminal attribute values for At_j j;

i

t j

A : the i-th terminal value of Aj, 1

≤

i

≤

|A |; tj

k

nt j

A : the number of non-terminal-level attribute values of Aj on the k-th level;

ki nt j

A : the i-th non-terminal-level value of Aj on the k-th level, 1

≤

i

≤

k nt j

A ;

t j

A

*: the terminal-level lower approximation of each single attribute Aj; *

t j

A

: the terminal-level boundary approximation of each single attribute Aj; nt

j

(11)

* nt j

A

: the non-terminal-level boundary approximation of each single attribute Aj;

Bj: an arbitrary subset of A; )

( (i)

j obj

B : the equivalence class of Bj in which obj(i) exists.

5. The Algorithm

In the section, a new learning algorithm based on rough sets is proposed to find

cross-level certain and possible rules from training data with hierarchical attribute

values. The algorithm first finds the terminal-level elementary sets of the single

attributes. These equivalence classes can then be used later to find the

non-terminal-level elementary sets for the single attributes and the cross-level

elementary sets for more than one attribute. Lower approximations are used to derive

certain rules. Boundary approximations, instead of upper approximations, are used to

find possible rules, thus reducing some subsumption checking. The algorithm

calculates the lower and the boundary approximations of single attributes from the

terminal level to the root level. After that, the lower and the boundary approximations

of more than one attribute are derived based on the results of single attributes. Some

pruning heuristics are also used to avoid unnecessary search. The rule-derivation

process based on these approximations is then performed to find maximally generally

(12)

described as follows.

A rough-set-based learning algorithm for training examples with hierarchical attribute values:

Input: A data set U with n objects, each of which has m decision attributes with hierarchical values and belongs to one of c classes.

Output: A set of multiple-level certain and possible rules.

Step 1: Partition the object set into disjoint subsets according to class labels. Denote each subset of objects belonging to class Cl as Xl.

Step 2: Find terminal-level elementary sets of single attributes; that is, if an object

obj(i) has a terminal value vj(i) for attribute Aj, put obj(i) in the equivalence class from Aj = vj(i).

Step 3: Link each terminal node for Aj =A , 1 tji ≤ i ≤ | t j

A |, in the taxonomy tree for

attribute Aj to the equivalence class for Aj =A , where |tji i t j

A | is the number of

terminal attribute values for Aj.

Step 4: Set l to 1, where l is used to represent the number of the class currently being processed.

Step 5: Compute the terminal-level lower approximation of each single attribute Aj

for class Xl as:

}, X ) x ( A , U x | ) x ( A { ) X ( At_j_* _l = t_j ∈ t_j ⊆ _l

where At_j(x) is the terminal-level equivalence class including object x and

derived from attribute Aj.

Step 6: Compute the terminal-level boundary approximation of each single attribute

(13)

}. X ) x ( A , X ) x ( A , U x | ) x ( A { ) X ( At_j* _l = t_j ∈ t_j ∩ _l ≠φ t_j ⊄ _l

Step 7: Compute the non-terminal-level lower and boundary approximations of each single attribute Aj for class Xl from the terminal level to the root level in the following substeps:

(a) Derive the equivalence class of the i-th non-terminal-level attribute value

ki nt j

A for attribute Aj on level k by the union of its underlying terminal-level equivalence classes.

(b) Put the equivalence class of a non-terminal-level attribute value ntki j

A in the k-level lower approximation for attribute Aj if all the equivalence classes of the underlying attribute values of ntki

j

A are in the (k+1)-level lower approximation for attribute Aj.

(c) Put the equivalence class of a non-terminal-level attribute value ntki j

A in the k-level boundary approximation for attribute Aj if at least one equivalence class of the underlying attribute values of ntki

j

A is in the

(k+1)-level boundary approximation for attribute Aj.

Step 8: Set q = 2, where q is used to count the number of attributes currently being processed.

Step 9: Compute the lower and the boundary approximations of each attribute set Bj with q attributes (on any levels) for class Xl from the terminal level to the root level by the following substeps:

(a) Skip all the combinations of attribute values in Bj which have the equivalence classes of their any value subsets already in the lower approximation for Xl .

(14)

values by the intersection of the equivalence classes of its single attribute values.

(c) Put the equivalence class B_j(x) of each combination in substep (b) into

the lower approximation class Xl if Bj(x) ⊆Xl.

(d) Put the equivalence class B_j(x) of each combination in substep (b) into

the boundary approximation class if Bj(x) ∩ Xl ≠∅ and Bj(x)⊄Xl. Step 10:Set q=q+1 and repeat 8 to 10 until q>m.

Step 11:Derive the certain rules from the lower approximations.

Step 12:Remove certain rules with condition parts more specific than those of some other certain rules.

Step 13:Derive the possible rules from the boundary approximations and calculate their plausibility values as:

) x ( B ) X ) x ( B )) x ( B ( p j l j j ∩ = ,

where B_j(x) is the equivalence class including x and derived from attribute set Bj.

Step 14:Set l = l + 1 and repeat Steps 5 to 14 until l > c. Step 15:Output the certain rules and possible rules.

6. An Example

In this section, an example is given to show how the proposed algorithm can be used to generate certain and possible rules from training data with hierarchical values.

(15)

Assume the training data set is shown in Table 2.

Table 2: The training data used in this example Transport Residence Consumption Style

Obj(1) Expensive car Villa High

Obj(4) Express train Villa Low

Obj(6) Express train Single house Low

Obj(8) Express train Single house High

Obj(10) Express train Apartment Low

Table 2 contains ten objects U={Obj(1), Obj(2), …, Obj(10)}, two decision

attributes A={Transport, House}, and a class attribute C={Consumption Style}. The possible values of each decision attribute are organized into a taxonomy, as shown in Figures 2 and 3. Transport Train Car Express Train Ordinary Train Expensive Car Cheap Car Figure 2: Hierarchy of Transport

(16)

In Figures 2 and 3, there are three levels of hierarchical attribute values for attributes Transport and Residence. The roots representing the generic names of attributes are located on level 0 (such as “Transport” and “Residence”), the internal nodes representing categories (such as “Train”) are on level 1, and the terminal nodes representing actual values (such as “Express Train”) are on level 2. Only values of terminal nodes can appear in training examples. Assume the class has only two possible values: {High (H), Low (L)}. The proposed algorithm then processes the data in Table 2 as follows.

Step 1: Since two classes exist in the data set, two partitions are found as follows:

XH ={Obj(1), Obj(2), Obj(7), Obj(8)}, and

XL = {Obj(3), Obj(4), Obj(5), Obj(6), Obj(9), Obj(10)}.

Step 2: The terminal-level elementary sets are formed for the two attributes as follows: Residence House Building Villa Single House Suite Apartment

(17)

U/{Transportt} = {{(Obj(1))}, {(Obj(3))(Obj(5))(Obj(9))},

{(Obj(4))(Obj(6))(Obj(8))(Obj(10))}, {(Obj(2))(Obj(7))}}; U/{Residencet} = {{(Obj(1),)(Obj(4))}, {(Obj(2))(Obj(6))(Obj(7))(Obj(8))},

{(Obj(3))(Obj(5))(Obj(9))}, {(Obj(10))}}.

Step 3: Each terminal node is linked to its equivalence class for later usage. Results for the two attributes are respectively shown in Figures 4 and 5.

Transport Train Car Express Train Ordinary Train Expensive Car Cheap Car

Figure 4: Linking the equivalence classes to the terminal nodes in the Transport taxonomy

Obj(1)

Obj(3) Obj(5) Obj(9)

(18)

Step 4: l is set at 1, where l is used to represent the number of the class currently

being processed . In this example, assume the class XH is first processed.

Step 5: The terminal-level lower approximation of attribute Transport for class

XH is first calculated. Since the two terminal-level equivalence classes {(Obj(1))} and {(Obj(2))(Obj(7))} for Transport are completely included in XH, which is {Obj(1), Obj(2),

Obj(7), Obj(8)}, the lower approximation of attribute Transport for class XH is thus:

Transportt*(XH) = {{(Obj

(1)

)}, {(Obj(2))(Obj(7))}}.

Similarly, the terminal-level lower approximation of attribute Residence for class

XH is calculated as: Residencet*(XH) = ∅. Residence House Building Villa Single House Suite Apartment

Figure 5: Linking the equivalence classes to the terminal nodes in the Residence taxonomy

(19)

Step 6: The terminal-level boundary approximation of attribute Transport for class XH is calculated. Since only the equivalence class{(Obj(4))(Obj(6))(Obj(8))(Obj(10))} has intersection with XH and not covered by XH, its boundary approximation is thus:

Transportt*(XH) = {(Obj(4))(Obj(6))(Obj(8))(Obj(10))}.

Similarly, the terminal-level boundary approximation of attribute Residence for class XH is calculated as:

Residencet*(XH) = {{(Obj(1))(Obj(4))}, {(Obj(2))(Obj(6))(Obj(7))(Obj(8))}}.

Step 7: The non-terminal-level lower and boundary approximations of single attributes for class XH are computed by the following substeps.

(a) The equivalence classes for non-terminal-level attribute values are first calculated from their underlying terminal-level equivalence classes. Take the attribute value Train in Transport as an example. Its equivalence class is the union of the two equivalence classes from the terminal nodes Express Train and Ordinary Train. Similarly, the equivalence class for the attribute value Car in Transport is the union of the two equivalence classes from the terminal nodes Expensive Car and Cheap Car. The equivalence classes for attribute Transport on level 1 are then shown as follows:

U/{Transportnt} = {{(Obj(1))(Obj(2))(Obj(7))}, {(Obj(3))(Obj(4))(Obj(5)) (Obj(6))(Obj(8))(Obj(9))(Obj(10))}}.

(20)

Similarly, the non-terminal-level equivalence classes for the attribute Residence on level 1 are found as:

U/{Residencent} = {{(Obj(1))(Obj(2))(Obj(4))(Obj(6))(Obj(7))(Obj(8))},

{(Obj(3))(Obj(5))(Obj(9))(Obj(10))}}.

(b) The equivalence class of a non-terminal-level attribute value is in the lower approximation if all the equivalence classes of its underlying attribute values on the lower levels are also in the lower approximation. In this example, only the equivalence classes of the underlying attribute values of Transport=Car are in the lower approximation for XH. The lower approximations for XH on level 1 are thus:

Transportnt_* (XH) = {(Obj(1))(Obj(2))(Obj(7))}, and

Residence_*nt(XH) = ∅.

(c) The equivalence class of a non-terminal-level attribute value is in the boundary approximation if at least one equivalence class of its underlying attribute values on the lower levels is also in the boundary approximation. The boundary approximations for XH on level 1 are thus found as:

Transportnt*(XH) = {(Obj(3))(Obj(4))(Obj(5))(Obj(6))(Obj(8))(Obj(9)) (Obj(10))}, and

Residencent*(XH) = {(Obj(1))(Obj(2))(Obj(4))(Obj(6))(Obj(7))(Obj(8))}.

Step 8: q is set at 2, where q is used to count the number of attributes currently

(21)

Step 9: The lower and the boundary approximations of each attribute set with two attributes for class XH on the terminal level are found in the following substeps. In this example, only the attribute set {Transport, Residence} contains two attributes.

(a) The attribute set {Transport, Residence} has the following possible combinations of values on the terminal level:

(Transport=Express Train, Residence=Villa),

(Transport=Express Train, Residence=Single House), (Transport=Express Train, Residence=Suite),

(Transport=Express Train, Residence=Apartment), (Transport=Ordinary Train, Residence=Villa),

(Transport=Ordinary Train, Residence=Single House), (Transport=Ordinary Train, Residence=Suite),

(Transport=Ordinary Train, Residence=Apartment), (Transport=Expensive Car, Residence=Villa),

(Transport=Expensive Car, Residence=Single House), (Transport=Expensive Car, Residence=Suite),

(Transport=Expensive Car, Residence=Apartment), (Transport=Cheap Car, Residence=Villa),

(Transport=Cheap Car, Residence=Single House), (Transport=Cheap Car, Residence=Suite), and (Transport=Cheap Car, Residence=Apartment).

(22)

Expensive Car) and (Transport = Cheap Car) are in the lower approximation for XH, the above combinations including (Transport = Expensive Car) and (Transport =

Cheap Car) won’t then be considered in the later steps. Thus, only eight combinations

are considered.

(b) The equivalence class of each remaining value combination for {Transport,

Residence} is then derived by the intersection of the equivalence classes of its single

attribute values. Take the combination (Transport = Express Train, Residence = Villa) as an example. The equivalence class for (Transport = Express Train) is {(Obj(4))(Obj(6))(Obj(8))(Obj(10))} and for (Residence = Villa) is {(Obj(1))(Obj(4))}. The

equivalence class for (Transport = Express Train, Residence = Villa) is thus the intersection of {(Obj(4))(Obj(6))(Obj(8))(Obj(10))} and {(Obj(1))(Obj(4))}, which is

{(Obj(4))}.

The equivalence classes for the other value combinations of {Transport,

Residence} can be similarly derived. Thus:

U/{Transportt, Residencet} = {{(Obj(4))}, {(Obj(6))(Obj(8))}, {(Obj(10))}, {(Obj(3))(Obj(5))(Obj(9))}}.

Note that {(Obj(1))} and {(Obj(2))(Obj(7))} won’t be considered since they are in

the lower approximation for XH from the single attribute value (Transport = Expensive

Car) and (Transport = Cheap Car).

(23)

terminal level is first derived. Only {Transport, Residence} is considered in this example. Since no equivalence classes in U/{Transport, Residence} are in XH, the lower approximation of {Transport, Residence} for XH on the terminal level is thus:

{Transportt, Residencet}_*(XH) = ∅.

(d) The boundary approximation of each 2-attribute subset Bj for class XH on the

terminal level is then derived. Take {Transport, Residence} as an example. Since (Obj(8)) in the equivalence class {(Obj(6))(Obj(8))} of U/{Transportt, Residencet} is in

XH and (Obj(6)) is not, {(Obj(6))(Obj(8))} is thus in the boundary approximation of

{Transport, Residence} for XH on the terminal level. The boundary approximation of

{Transport, Residence} for XH on the terminal level are shown below:

{Transportt, Residencet}*

(XH) = {(Obj(6))(Obj(8))}.

The same process is then repeated from the terminal level to the root level for finding the lower and the boundary approximations of {Transport, Residence} on other levels. Result are shown as follows:

{Transportnt, Residencet}_*(XH) = ∅,

{Transportt, Residencent

}_*(XH) = ∅,

{Transportnt, Residencent }_*(XH) = ∅,

{Transportnt, Residencet}*(XH) = {(Obj(6))(Obj(8))},

{Transportt, Residencent

}*

(XH) = {(Obj(4))(Obj(6))(Obj(8))}, and

{Transportnt

, Residencent

}*

(24)

Step 10: q = 2 + 1 = 3. Since q > m (= 2), the next step is executed.

Step 11: All the certain rules are then derived from the lower approximations. Results for this example are shown below:

1. If Transport is Expensive Car then Consumption Style is High; 2. If Transport is Cheap Car then Consumption Style is High; 3. If Transport is Car then Consumption Style is High.

Step 12: Since the condition parts of the first and second certain rules are more specific than the third certain rule, the first two rules are removed from the certain rule set.

Step 13: All the possible rules are derived from the boundary approximations. The plausibility measure of each rule is also calculated in this step. For example, the plausibility measure of the equivalence class {(Obj(4))(Obj(6))(Obj(8))(Obj(10))} of

Transport = Express Train in the boundary approximation for class XH is calculated as follows:

P (If Transport = Express Train then XH)

) , , , ( ) , , , ( ) , , , ( ) 10 ( ) 8 ( ) 6 ( ) 4 ( ) 8 ( ) 7 ( ) 2 ( ) 1 ( ) 10 ( ) 8 ( ) 6 ( ) 4 ( obj obj obj obj obj obj obj obj obj obj obj obj ∩ = = 0.25.

(25)

All the resulting possible rules with their plausibility values are shown below:

1. If Transport is Express Train then Consumption Style is High, with plausibility = 0.25;

2. If Residence is Single House then Consumption Style is High, with plausibility = 0.75;

3. If Transport is Train then Consumption Style is High, with plausibility = 0.14;

4. If Residence is House then Consumption Style is High, with plausibility = 0.66;

5. If Transport is Express Train and Residence is Single House then

consumption style is High, with plausibility = 0.5;

6. If Transport is Train and Residence is Single House then Consumption Style is High, with plausibility = 0.5;

7. If Transport is Train and Residence is House then consumption style is High,

with plausibility = 0.33;

8. If Transport is Express Train and Residence is House as then Consumption

Style is High, with plausibility = 0.33.

Step 14: l = l + 1 = 2. Steps 5 to 14 are then repeated for the other class XL. Step 15: All the certain rules and possible rules are then output:

7. Discussion

(26)

hierarchical values, only the maximally general certain rules, instead of all certain ones, are kept for classification. Certain rules which are not maximally general are removed since they provide no other new information. Take the maximally general rule “If Transport is Car then Consumption Style is High” derived in the above section as an example. All the descendent rules covered by the maximally general rule according to the taxonomy relation in Figure 2 are shown as follows:

1. If Transport is Expensive Car then Consumption Style is High; 2. If Transport is Cheap Car then Consumption Style is High.

It can be easily verified that the above two rules are also certain rules. Besides, any rules generated by adding additional constraints into the maximally general rule or into its descendent rules are also certain. These include the following 18 rules:

1. If Transport is Car and Residence is Villa then Consumption Style is High; 2. If Transport is Car and Residence is Single House then Consumption Style is

High;

3. If Transport is Car and Residence is Suite then Consumption Style is High; 4. If Transport is Car and Residence is Apartment then Consumption Style is

High;

5. If Transport is Car and Residence is House then Consumption Style is High; 6. If Transport is Car and Residence is Building then Consumption Style is

High;

7. If Transport is Expensive Car and Residence is Villa then Consumption Style is High;

(27)

Consumption Style is High;

9. If Transport is Expensive Car and Residence is Suite then Consumption Style is High;

10. If Transport is Expensive Car and Residence is Apartment then Consumption Style is High;

11. If Transport is Expensive Car and Residence is House then Consumption

Style is High;

12. If Transport is Expensive Car and Residence is Building then Consumption

Style is High;

13. If Transport is Cheap Car and Residence is Villa then Consumption Style is

High;

14. If Transport is Cheap Car and Residence is Single House then Consumption

Style is High;

15. If Transport is Cheap Car and Residence is Suite then Consumption Style is

High;

16. If Transport is Cheap Car and Residence is Apartment then Consumption

Style is High;

17. If Transport is Cheap Car and Residence is House then Consumption Style is High;

18. If Transport is Cheap Car and Residence is Building then Consumption

Style is High;

The pruning procedure is embedded in the proposed algorithm. The above subsumption relation for certain rules is, however, not valid for possible rules. The plausibility of a parent possible rule will always lie between the minimum and the

(28)

maximum plausibility values of its children rules. Take the possible rule “If Transport is Train then Consumption Style is High, with plausibility = 0.14” derived in the above section as an example. Both its descendent rules according to the taxonomy relation in Figure 2 are shown as follows:

1. If Transport is Express Train then Consumption Style is High, with plausibility = 0.25;

2. If Transport is Ordinary Train then Consumption Style is High, with plausibility = 0.

It can be seen that the plausibility of the parent rule is between 0.25 and 0. Note that the second child rule will not be actually kept since its plausibility is zero. It is only shown here to demonstrate the relationship of the plausibility values in parent and child rules. The child rules with plausibility values less than their parent rules will also be kept by the proposed algorithm since they may provide some useful information about the classification. When a new event satisfies both a child rule and its parent rule, it is more accurate to derive the plausibility of the consequence from the child rule than from the parent rule. However, if a new event has an unknown attribute value, but a known non-terminal value, it can still be inferred using the parent rules. The proposed algorithm thus keeps all the possible rules except for those with plausibility = 0. If the child rules with plausibility values less than their parent rules won’t be kept, the proposed algorithm can be easily modified by simply adding a subsumption checking after the step of generating the possible rules.

Besides, a plausibility threshold can be used in the proposed algorithm to avoid overwhelming possible rules. The rules with their plausibility values less than the

(29)

threshold will thus be pruned. This checking step can easily be embedded in finding the boundary approximation to reduce the computational time further.

8. Conclusions and Future Works

In this paper, we have proposed a new learning algorithm based on rough sets to find cross-level certain and possible rules from training data with hierarchical attribute values. The proposed method adopts the concept of equivalence classes to find the terminal-level elementary sets of single attributes. These equivalence classes are then easily used to find the non-terminal-level elementary sets of single attributes and the cross-level elementary sets of multiple attributes by the union and the intersection operations. Lower and boundary approximations are then derived from the elementary sets from the terminal level to the root level. Boundary approximations, instead of upper approximations, are used in the proposed algorithm to find possible rules, thus reducing some subsumption checking. Lower approximations are used to derive maximally general certain rules. Some pruning heuristics are also used to avoid unnecessary search. The rules derived can be used to infer a new event with both terminal and non-terminal attribute values.

The limitation of the proposed algorithm is its application to symbolic data. If numerical data are fed, they will first be converted into intervals. Currently, we are trying to apply fuzzy concepts to managing numerical data to enhance its power.

References

(30)

Experiments of the Standford Heuristic Programming Projects, Massachusetts:

Addison-Wesley, 1984.

[2] M. R. Chmielewski, J. W. Grzymala-Busse, N. W. Pererson and S. Than. “The rule induction system LERS – A version for personal computers,” Foundation of

Computation Decision Science Vol. 18, No. 3, 1993, pp. 181-212.

[3] J. W. Grzymala-Busse, “Knowledge acquisition under uncertainty: A rough set approach,” Journal of Intelligent Robotic Systems, Vol. 1, 1988, pp. 3-16.

[4] S. Hirano, X. Sun, and S. Tsumoto, “Dealing with multiple types of expert knowledge in medical image segmentation: A rough sets style approach,” The

2002 IEEE International Conference on Fuzzy system, Vol. 2, 2002, pp. 884

-889.

[5] T. P. Hong, T. T. Wang and S. L. Wang, "Knowledge acquisition from quantitative data using the rough-set theory," Intelligent Data Analysis, Vol. 4, 2000, pp. 289-304.

[6] T. P. Hong, T. T. Wang and B. C. Chien, "Mining approximate coverage rules,"

International Journal of Fuzzy Systems, Vol. 3, No. 2, 2001, pp. 409-414.

[7] G. Lambert-Torres, A. P. Alves da Silva, V. H. Quintana and L. E. Borges da Silva, “Knowledge-base reduction based on rough set techniques,” The Canadian

Conference on Electrical and Computer Engineering, 1996, pp. 278-281.

[8] C. H. Lee, S. H. Swo and S. C. Choi, ”Rule discovery using hierarchical classification structure with rough sets,” The 9th IFSA World Congress and the

20th NAFIPS International Conference, Vol. 1, 2001, pp. 447 -452.

[9] P. J. Lingras and Y. Y. Yao, “Data mining using extensions of the rough set model,” Journal of the American Society for Information Science, Vol. 49, No. 5, 1998, pp. 415-422.

(31)

[10]R. S. Michalski, J. G. Carbonell and T. M. Mitchell, Machine Learning: An

Artificial Intelligence Approach, Vol. 1, California: Kaufmann Publishers, 1983.

[11]R. S. Michalski, J. G. Carbonell and T. M. Mitchell, Machine Learning: An

Artificial Intelligence Approach, Vol. 2, 1984.

[12]Z. Pawlak, “Rough set,” International Journal of Computer and Information

Sciences, Vol. 11, No. 5, 1982, pp. 341-356.

[13]Z. Pawlak, Rough Sets: Theoretical Aspects of Reasoning about Data, Kluwer Academic Publishers, 1991.

[14]Z. Pawlak, “Why rough sets?,” The Fifth IEEE International Conference on

Fuzzy Systems, Vol. 2, 1996, pp. 738 –743.

[15]S. Tsumoto, “Extraction of experts’ decision rules from clinical databases using rough set model,” Intelligent Data Analysis, Vol. 2, 1998, pp. 215-227.

[16]S. Tsumto, “Knowledge discovery in medical databases based on rough sets and attribute-oriented generalization,” The 1998 IEEE International Conference on

Fuzzy Systems, Vol. 2, 1998, pp. 1296 -1301.

[17]Y. Y. Yao, “Stratified rough sets and granular computing,” The 18th International

Conference of the North American, 1999, pp. 800 -804.

[18]N. Zhong, J. Z. Dong, S. Ohsuga and T. Y. Lin, “An incremental, probabilistic rough set approach to rule discovery,” The IEEE International Conference on