Using multi-attribute predicates for mining classification rules

(1)

Ming-Syan Chen

Electrical Engineering Department

National Taiwan University

Taipei, Taiwan, ROC

email: [email protected]

Abstract

In order to improve the eciency of deriving classi-cation rules from a large training dataset, we develop in this paper a two-phase method for multi-attribute ex-traction. A feature that is useful in inferring the group identity of a data tuple is said to have a good infer-ence power to that group identity. Given a large train-ing set of data tuples, the rst phase, referred to as feature extraction phase, is applied to a subset of the training database with the purpose of identifying use-ful features which have good inference powers to group identities. In the second phase, referred to as feature combination phase, these extracted features are evalu-ated together and multi-attribute predicates with strong inference powers are identied. A technique on using match index of attributes is devised to reduce the pro-cessing cost.

1 Introduction

Various data mining capabilities have been explored in the literature. Mining association rules has attracted a signicant amount of research attention [3, 9, 11, 15]. For example, given a database of sales transactions, it is desirable to discover all associations among items such that the presence of some items in a transaction will imply the presence of other items in the same transac-tion. Another type of data mining is on ordered data, such as stock market and point of sales data. Interest-ing aspects to explore from these ordered data include searching for similar sequences [1, 16], e.g., stocks with similar movement in stock prices, and sequential pat-terns [4], e.g., grocery items bought over a set of visits in sequence. Mining on Web path traversal patterns was studied in [6]. In addition, one important applica-tion of data mining is the ability to perform

classica-tion in a huge amount of data. This is referred to as mining classication rules. Explicitly, mining classi-cation rules is an approach of trying to develop rules to group data tuples together based on certain common features. For an example of commercial applications, it is desirable for a car dealer to know what are the com-mon features of its most customers so that its sales per-sons will know whom to approach, and its catalogs of new models can be mailed directly to those customers with identied features. The business opportunity can thus be maximized. It is noted that due to the in-creasing use of computing for various applications, the importance of mining classication rules is growing at a rapid pace. The fast growth in the amount of data in those applications has furthermore made the ecient mining for classication rules a very challenging issue. Classication rule mining has been explored both in the AI domain [12, 14] and in the context of databases [2, 5, 7, 8]. In machine learning, a decision-tree classi-cation method, developed by Quinlan [13, 14], is one of the most important results, and has been very in- uential to later studies. It is a supervised learning method that constructs decision trees from a set of ex-amples. The quality of a tree depends on both the classication accuracy and the size of the tree. Other approaches on data classication include statistical ap-proaches [12], rough sets approach [17], etc. In the context of databases, an interval classier has been proposed in [2] to reduce the cost of decision tree gen-eration. An attribute-oriented induction method has been developed for mining classication rules in rela-tional databases [8]. The work in [10] explores rule extraction in a database based on neural networks.

It is noted that in mining classication rules for a given database, one would naturally like to have a training dataset large enough so as to have a sucient

(2)

condence on the rules derived. However, with a large training set, the execution time required for rule deriva-tion could be prohibitive, in particular, when forming multi-attribute predicates is needed. When a sophis-ticated predicate is constructed from a combination of features, the execution time required grows expo-nentially with the size of a training database, which is highly undesirable in many applications. Conse-quently, we present in this paper a two-phase method for multi-attribute extraction and improve the e-ciency of deriving classication rules in a large training dataset. A feature that is useful in inferring the group identity of a data tuple is said to have a goodinference power to that group identity. Given a large training set of data tuples, the rst phase, referred to asfeature extraction phase, is applied to a subset of the training database with the purpose of identifying useful features which have good inference powers to group identities. Note, however, that in some cases the group identity is not so dependent on the value of a single attribute. Rather, the group identity depends on the combined values of a set of attributes. This is particularly true in a database where attributes have strong dependencies among themselves. Combining several individual fea-tures is thus required for constructing multi-attribute predicates with better inference powers. In the second phase, referred to as feature combination phase, those features extracted from the rst phase are evaluated together and multi-attribute predicates with strong in-ference powers are identied. A technique on using

match index of attributes is devised to reduce the pro-cessing cost. In essence, a match index is a heuristic indication on the combined inference power of multiple attributes, and can be used to identify uninteresting combined attributes and remove them from later pro-cessing. Note that being performed only on a subset of the training set, the feature extraction phase can be executed eciently. On the other hand, since the fea-tures extracted are used to the whole training set in the feature combination phase, the condence of the nal classication rules derived can hence be ensured.

This paper is organized as follows. A problem de-scription is given in Section 2. The two-phase method for mining classication rules is described in Section 3. Section 4 contains the summary.

2 Problem Description

In general, the problem on miningclassication rules can be stated as follows. We are given a large database W, in which each tuple consists of a set of n attributes (features), fA

1, A2;:::, An

g. The terms \attribute"

Label Gender Age Beverage State Group id

1 M 3 water CA I 2 F 4 juice NY I 3 M 4 water TX I 4 F 4 milk TX I 5 M 5 water NY I 6 M 3 juice CA I 7 M 3 water CA II 8 F 5 juice TX II 9 F 5 juice NY II 10 F 6 milk TX III 11 M 4 milk NY III 12 M 5 milk CA III 13 F 4 milk TX III 14 F 6 water NY III 15 F 6 water CA III

Table 1. A sample prole for classifying 15 children. and \feature" are used interchangeably in this paper. For example, attributes could be age, salary range, gender, zip code, etc. Our purpose is to classify all data tuples in this database into dierent groups ac-cording to their attributes. In order to learn proper knowledge on such classication, we are given a small training database E, in which each tuple consists of the same attributes as tuples in W, and additionally has a known group identity associated with it. An example group identity is the type of car owned, say \plain", \good", or \luxury". We want to (1) learn the rela-tionship between \attributes" and group identity from the training database E, and then (2) apply the learned knowledge to classify data in the large database W into dierent groups. Note that once the relationship be-tween attributes and group identity is learned in (1), the process in (2) can be performed in a straightfor-ward manner. Hence, we shall focus our discussion on methods for (1) in this paper, i.e., to identify attributes fromfA

1, A2;:::, An

gthat have strong inference to the

group identity.

Consider a sample prole for 15 children in Table 1 as an example. In Table 1, each tuple, correspond-ing to each child, contains attributes: his/her gender, age, beverage preferred and state lived, and addition-ally his/her group identity. (For ease of exposition, each tuple is given a label in it rst column, which is, however, not part of the attributes.) We now would like to explore the relationship between the attributes (i.e., gender, age, beverage and state in this case) and the group identity. As stated before, an attribute that is useful in inferring the group identity of a data tuple

(3)

is said to have a good inference power to that group identity. A predicate in this study means a resulting classication rule from step (1) mentioned above, and will be used in step (2) to classify data tuples in the database W. In our discussion, the quality of a predi-cate refers to the combined inference power of the at-tributes this predicate is composed of. The proposed method consists of two phases: feature extraction phase

andfeature combination phase. Given a large training set of data tuples, the rst phase, feature extraction phase, is to learn useful features, which have good in-ference powers to group identities, from a subset D of the training database E.

As mentioned earlier, in some cases the group iden-tity is not so dependent on the value of a single at-tribute, but instead, depends upon the combined val-ues of a set of attributes. This is particularly true in the presence of those attributes that have strong infer-ence among themselves. Consider the prole in Table 2 as an example. In Table 2, it is found that a male with low income and a female with high income usu-ally drive cars, whereas a male with high income and a female with low income ride bikes. In this case, explor-ing the relationship between \vehicle" (correspondexplor-ing to the group id in Table 1) and \either gender or in-come attribute" will lead to little results, since neither gender nor income has a good inference power to the vehicle. However, a combination of gender and income (e.g., a male and low income) indeed has a good in-ference power to the vehicle. It can be seen that the type of vehicle in each tuple can in fact be determined from the combined value of gender and income. In view of this, to exploit the benet of multi-attribute predi-cates, we devise the second phase, feature combination phase, which evaluates individual features extracted in the rst phase and produces multi-attribute predicates with strong inference powers.

The two-phase method for miningclassication rules can be summarized as follows.

Algorithm M

: Mining classication rules

Feature extraction phase:

To learn useful features, which have good inference powers to group iden-tities, from a subset of the training database.

Feature combination phase:

To evaluate extracted features based on the entire training database and form multi-attribute predicates with good infer-ence powers.

Label Gender Income Vehicle

1 male low car

2 male low car

3 female high car 4 female high car 5 male high bike 6 male high bike 7 female low bike 8 female low bike

Table 2. A sample prole for preferred vehicles.

3 Mining Classication Rules

We describe in this section a two phase method for mining classication rules. The rst phase, feature ex-traction phase, is presented in Section 3.1, and the sec-ond phase, feature combination phase, is presented in Section 3.2.

As illustrated in Figure 1, the feature extraction phase is applied to a subset of the training database. In this phase, attributes that have good inference pow-ers to group identities are identied. The operations of this phase are explained below. First, tuples in the database D are divided into several groups according to their group id's. The inference power of each attribute is then investigated one by one. Suppose A is an at-tribute and fa

1, a2;:::, am

g are m possible values of

attribute A. Also, the domain of the group identity g is represented by domain(g)=fv

1, v2;:::, vjdomain(g)j g.

The primary group for a value ai of attribute A,

de-noted by vai, is the group that has the most tuples

with their attribute A= ai. Explicitly, use nA(ai;vk)

to denote the number of tuples which are in group vk

and have a value of ai in their attribute A. Then, we

have

nA(ai;vai) = max

vk2domain(g)

fnA(ai;vk)g: (1)

The primary group for each value of attribute A can hence be obtained. For the example pro-le in Table 1, if A is \gender," then domain(A)=

fMale, Femaleg, and nA(Male,I)=4, nA(Male,II)=1,

and nA(Male,III)=2. Group I is therefore the primary

(4)

The hit ratio of attribute A, denoted by h(A), is dened as the percentage of tuples which, according to their corresponding attribute values, fall into their primary groups. Let N denote the total number of tuples. Then, h(A) = P 1imnA(ai;v ai) N : (2)

It can be seen that the stronger the relationship be-tween an attribute and the group identity, the larger the hit ratio of this attribute will be. A hit ratio of an attribute would become one if that attribute could uniquely determine the group identity. The hit ratio is a quantitative measurement for the inference power of an attribute. According to the primary groups of various values of an attribute, the hit ratio of that at-tribute can be determined. Note that in essence we want to investigate the inference power of some com-bined features. However, to reduce the processing cost, we would like to restrict our attention to those at-tributes whose individual hit ratios meet a predeter-mined threshold. Specically, we include attribute A into a set SA for future processing only if the hit ratio

of A is larger than or equal to a predetermined thresh-old. Note that this is a greedy approach and does not guarantee providing the optimal solutions. As a conse-quence, those features with poor inference powers will be removed from later processing, and the processing cost can thus be reduced. Themost distinguishing at-tributerefers to the attribute with the largest hit ratio. Formally, the ow of the feature extraction phase is summarized as follows.

Feature extraction phase:

Step 1:

Divide tuples in the database D into several groups according to their group id's.

Step 2:

Let A denote the next attribute to process.

Step 3:

Determine the primary group for each value of attribute A.

Step 4:

According to the primary groups of various values of attribute A, obtain the hit ratio of A.

Step 5:

Include attribute A into set SAfor future

pro-cessing if the hit ratio of A is larger than or equal to a predetermined threshold.

Gender I II III (max, group) Male 4 1 2 (4, I) Female 2 2 4 (4, III)

hit ratio: 8

15

Table 3. Distribution when the prole is classied by gender.

Step 6:

If there is any more attribute to process then go to Step 2.

Otherwise stop.

For illustrative purposes, consider the example pro-le in Table 1, which can be viewed as a subset of the training database to be used in the feature extraction phase. First, we classify this prole according to gen-der, and obtain the results in Table 3. As explained ear-lier, Group I is the primary group for the value \Male" of attribute \gender". Also, it can be obtained that Group III is the primary group for the value \Female" of attribute \gender". As a result, there are 8 tuples, out of 15 tuples, fall into their primary groups. The hit ratio of attribute gender is thus 8

15.

Similarly, it can be veried that the hit ratios of age, beverage and state are, respectively, 10

15, 9 15 and

6 15.

Fi-nally, having the largest hit ratio among the four at-tributes, age is the most distinguishing attribute in this example. Suppose the predetermined threshold for the inclusion into SA is 8

15. Then, attributes gender, age

and beverage are included into SA whereas attribute

state is not. The attributes collected in SA will be

evaluated together in the feature combination phase to form multi-attribute predicates. Moreover, we have the following lemma which species the lower bound of the hit ratio of an attribute.

Lemma 1:

Let A be an attribute and g be a group identity. Then,

h(A)

1

jdomain(g)j

; which is a tight lower bound.

Note that the feature extraction phase explores the relationship between the group identity and individual

(5)

attributes. However, as explained by using the prole in Table 2 earlier, in some cases the group identity is not so dependent on the value of a single attribute, but instead, depends upon the combined values of a set of attributes. This is the very reason that the feature combination phase is called for.

The feature combination phase is applied to the entire training database with the purpose of evaluing the inference power resultevaluing from combinevaluing at-tributes. As stated before, we shall only examine those attributes in SA so as to decrease the processing cost.

In addition, a technique on using match index of at-tributes is devised to further reduce the processing cost. Specically, instead of evaluating every pair of at-tributes, we shall only deal with those attribute pairs whose match indexes meet another threshold, where a match index is a heuristic indication on how likely a pair of attributes will lead to a strong inference power as a whole. Suppose that A and B are two attributes to be used to form a 2-attribute predicate. Let domain(A)=fa

1, a2;:::, am

gand domain(B)=fb 1,

b2;:::, bp

g. Recall that nA(ai;vk) is the number of

tu-ples which are in group vk and have their attribute

A = ai. nB(bj;vk) is dened similarly. Then, the

match index of two attributes A and B, denoted by MI(A, B), is dened as:

MI(A;B) = X 1im X 1jp max vk 2domain(g) fmin(nA(ai;vk);nB(bj;vk))g:

Essentially, MI(A,B) is a heuristic indication on the

combined inference power of A and B. It can be ob-served that MI(A,B) is in fact an upper bound for the

number of tuples fallinginto the primary groups of vari-ous values of attribute pair (A,B). Explicitly, MI(A;B)

N

is an upper bound for the hit ratio of attribute pair (A,B). The use of match index will signicantly reduce the processing cost while causing little compromise on the quality of the resulting predicates. The overall ow of the feature combination phase can be summarized as follows.

Feature combination phase:

Step 1:

Let attribute pair (A,B) be the next attribute pair from SA to process.

Step 2:

If the match index of (A,B) does not reach the predetermined threshold, go to Step 6.

Step 3:

Determine the primary group for each value of attribute pair (A,B).

Step 4:

According to the primary groups of various values of attribute pair (A,B), obtain the hit ratio of pair (A,B).

Step 5:

Include attribute pair (A,B) into a set SC for

future processing if the hit ratio of (A,B) is larger than or equal to a predetermined threshold.

Step 6:

If there is any more attribute pair to process then go to Step 1.

Otherwise stop.

4 Conclusion

We have developed in this paper a two-phase method for multi-attribute extraction and improved the eciency of classication rule derivation in a large training dataset. Given a large training set of data tuples, the feature extraction phase was applied to a subset of the training database with the purpose of identifying useful features which have good inference powers to group identities. In the feature combination phase, these extracted features were evaluated system-atically and multi-attribute predicates with strong in-ference powers were identied. A technique on using match index of attributes was utilized to reduce the processing cost.

Acknowledgements

M.-S. Chen is in part supported by National Sci-ence Council, Project No. NSC 87-2213-E-002-009 and Project No. 87-2213-E-002-101, Taiwan, ROC.

References

[1] R. Agrawal, C. Faloutsos, and A. Swami. Ecient Similarity Search in Sequence Databases. Proceed-ings of the 4th Intl. conf. on Foundations of Data Organization and Algorithms, October, 1993.

(6)

[2] R. Agrawal, S. Ghosh, T. Imielinski, B. Iyer, and A. Swami. An Interval Classier for Database Mining Applications. Proceedings of the 18th In-ternational Conference on Very Large Data Bases, pages 560{573, August 1992.

[3] R. Agrawal and R. Srikant. Fast Algorithms for Mining Association Rules in Large Databases.

Proceedings of the 20th International Confer-ence on Very Large Data Bases, pages 478{499, September 1994.

[4] R. Agrawal and R. Srikant. Mining Sequen-tial Patterns. Proceedings of the 11th Interna-tional Conference on Data Engineering, pages 3{ 14, March 1995.

[5] T.M. Anwar, H.W. Beck, and S.B. Navathe. Knowledge Mining by Imprecise Querying: A Classication-Based Approach. Proceedings of the 8th International Conference on Data Engineer-ing, pages 622{630, February 1992.

[6] M.-S. Chen, J.-S. Park, and P. S. Yu. Ecient Data Mining for Path Traversal Patterns. IEEE Transactions on Knowledge and Data Engineer-ing, 10(2), April 1998.

[7] J. Han, Y. Cai, , and N. Cercone. Knowledge Dis-covery in Databases: An Attribute-Oriented Ap-proach.Proceedings of the 18th International Con-ference on Very Large Data Bases, pages 547{559, August 1992.

[8] J. Han, Y. Cai, and N. Cercone. Data Driven Discovery of Quantitative Rules in Relational Databases.IEEE Transactions on Knowledge and Data Engineering, pages 29{40, February 1993. [9] J. Han and Y. Fu. Discovery of Multiple-Level

Association Rules from Large Databases. Pro-ceedings of the 21th International Conference on Very Large Data Bases, pages 420{431, September 1995.

[10] H. Lu, R. Setiono, and H. Liu. NeuroRule: A Connectionist Approach to Data Mining. Pro-ceedings of the 21th International Conference on Very Large Data Bases, pages 478{489, September 1995.

[11] J.-S. Park, M.-S. Chen, and P. S. Yu. Using a Hash-Based Method with Transaction Trimming for Mining Association Rules. IEEE Transactions on Knowledge and Data Engineering, 9(5):813{ 825, October 1997.

[12] G. Piatetsky-Shapiro. Discovery, analysis, and presentation of strong rules. In G. Piatetsky-Shapiro and W. J. Frawley, editors, Knowl-edge Discovery in Databases, pages 229{238. AAAI/MIT Press, 1991.

[13] J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.

[14] J.R. Quinlan. Induction of Decision Trees. Ma-chine Learning, 1:81{106, 1986.

[15] R. Srikant and R. Agrawal. Mining Generalized Association Rules. Proceedings of the 21th Inter-national Conference on Very Large Data Bases, pages 407{419, September 1995.

[16] J. T.-L. Wang, G.-W. Chirn, T.G. Marr, B. Shapiro, D. Shasha, and K. Zhang. Com-binatorial Pattern Discovery for Scientic Data: Some Preliminary Results. Proceedings of ACM SIGMOD, Minneapolis, MN, pages 115{125, May, 1994.

[17] W. Ziarko. The discovery, analysis, and repre-sentation of data dependancies in databases. In G. Piatetsky-Shapiro and W. J. Frawley, editors,

Knowledge Discovery in Databases, pages 195{ 209. AAAI/MIT Press, 1991.

Using multi-attribute predicates for mining classification rules

Ming-Syan Chen

Electrical Engineering Department

National Taiwan University

Taipei, Taiwan, ROC

email: [email protected]

Abstract

1 Introduction

2 Problem Description

Algorithm M

Feature extraction phase:

Feature combination phase:

3 Mining Classi cation Rules

Feature extraction phase:

Step 1:

Step 2:

Step 3:

Step 4:

Step 5:

Step 6:

Lemma 1:

Feature combination phase:

Step 1:

Step 2:

Step 3:

Step 4:

Step 5:

Step 6:

4 Conclusion

Acknowledgements

References

3 Mining Classication Rules