Adherence clustering: an efficient method for mining market-basket clusters

(1)

Information Systems 31 (2006) 170–186

Adherence clustering: an efﬁcient method for mining

market-basket clusters

Ching-Huang Yun, Kun-Ta Chuang, Ming-Syan Chen

Department of Electrical Engineering, National Taiwan University, No. 1, Sec. 14. Roosevelt Rd., Taipei, Taiwan, ROC Received 7 January 2004; received in revised form 26 September 2004; accepted 3 November 2004

Abstract

We explore in this paper the efficient clustering of market-basket data. Different from those of the traditional data, the features of market-basket data are known to be of high dimensionality and sparsity. Without explicitly considering the presence of the taxonomy, most prior efforts on clustering market-basket data can be viewed as dealing with items in the leaf level of the taxonomy tree. Clustering transactions across different levels of the taxonomy is of great importance for marketing strategies as well as for the result representation of the clustering techniques for market-basket data. In view of the features of market-market-basket data, we devise in this paper a novel measurement, called the category-based adherence, and utilize this measurement to perform the clustering. With this category-based adherence measurement, we develop an efficient clustering algorithm, called algorithm k-todes, for market-basket data with the objective to minimize the category-based adherence. The distance of an item to a given cluster is defined as the number of links between this item and its nearest tode. The category-based adherence of a transaction to a cluster is then defined as the average distance of the items in this transaction to that cluster. A validation model based on information gain is also devised to assess the quality of clustering for market-basket data. As validated by both real and synthetic datasets, it is shown by our experimental results, with the taxonomy information, algorithm k-todes devised in this paper significantly outperforms the prior works in both the execution efficiency and the clustering quality as measured by information gain, indicating the usefulness of category-based adherence in market-basket data clustering.

Keywords: Data mining; Clustering market-basket data; Category-based adherence; k-todes

1. Introduction

Data clustering is an important technique for exploratory data analysis [1,2]. Explicitly, data clustering is a well-known capability studied in information retrieval[3], data mining[4], machine learning[5], and statistical pattern recognition[6].

www.elsevier.com/locate/infosys

Corresponding author. Tel.: +886 2 2363 5251; fax: +886 2 2367 1597.

E-mail addresses: [email protected] (C.-H. Yun), [email protected] (K.-T. Chuang), [email protected]. ntu.edu.tw (M.-S. Chen).

(2)

In essence, clustering is meant to divide a set of transactions into some proper groups in such a way that transactions in the same group have similar features while transactions in different group are dissimilar. Many data clustering algo-rithms have been proposed in the literature. These algorithms can be categorized into nearest neigh-bor clustering [7], fuzzy clustering[8], partitional clustering [9,10], hierarchical clustering [11,12], artificial neural networks for clustering [13], and statistical clustering algorithms [14]. However, finding optimal clustering result is known to be an NP-hard problem [15] and thus clustering algorithms usually employ some heuristic pro-cesses to find local optimal results.

In market-basket data (also called transaction data), each transaction contains a set of items purchased by a customer. Market-basket data has been well studied in mining association rules for discovering the set of frequently purchased items [16–19]. However, mining association rules is generally useful in the cross-selling of items. For marketing strategies, the clusters with representa-tive subjects (consisting of items or categories) are informative for planning a product promotion. Clustering market-basket data techniques can be used to identify the subjects with similar buying patterns in the same cluster. For promotion of a cluster, the items are identiﬁed as the products to be sold and the transactions could be used to identify the target customers. In this paper, we focus on clustering market-basket data for

identi-fying representative subjects. One of the important features of market-basket data sets is that they are generated at rapid pace (million transactions per day) and thus requires the data mining algorithms to be scalable and capable of dealing with the large data set.

It is important to note that since customers purchase desired items with the corresponding categorical meanings, the implications from pur-chasing supports of items and the taxonomy of items are in fact entangled, and both of them are of great importance in reflecting customer beha-viors. Explicitly, the support of item i is defined as the percentage of transactions which contain i. Note that in mining association rules, a large item is basically an item with frequent occurrences in transactions[16]. Thus, item i is called a large item if the support of item i is larger than the pre-given minimum support count. The taxonomy of items defines the categorical relationships of items and it can be represented as a taxonomy tree. In the taxonomy tree, the leaf nodes are called the item nodes and the internal nodes are called the category nodes. For the example shown in Fig. 1, ‘‘War and Peace’’ is an item node and ‘‘Novel’’ is a category node. As formally defined in Section 2, a large item/category (i.e., item or category) is basically an item/category with its occurrence count in transactions exceeding a given threshold. If an item/category is large, its corre-sponding node in the taxonomy tree is called a tode (standing for taxonomy node). The todes in each

Novel Harry Potter War and Peace Computer Science Sorting Machine Parallel Compiler Book

(3)

cluster can be viewed as the representatives of the cluster. For the example shown in Fig. 1, nodes marked gray are todes. Based on the definition of a large item, if item i is large, the categories containing i are also large. For example, because item ‘‘Harry Potter’’ is large, its ancestors, category ‘‘Novel’’ and category ‘‘Book’’, are also large. In other words, in the taxonomy tree, if a node is a tode, its ancestor nodes are also todes. The characteristic of todes is helpful in efficiently discovering todes of clusters. As formally defined in Section 3.1, the todes and the taxonomy tree both are used to identify the nearest todes.

In view of the features of market-basket data, we devise in this paper a novel measurement, called the category-based adherence, and utilize this measurement to perform the clustering. The distance of an item to a given cluster is defined as the number of links between this item and its nearest tode in the taxonomy tree. The adherence of a transaction to a cluster is then defined as the average distance of the items in this transaction to that cluster.1With this category-based adherence measurement, we develop an efficient clustering algorithm, called algorithm k-todes, for market-basket data. Explicitly, algorithm k-todes employs the category-based adherence as the similarity measurement between transactions and clusters, and allocates each transaction to the cluster with the minimum adherence. To the best of our knowledge, without explicitly considering the presence of the taxonomy, previous efforts on clustering market-basket data unavoidably re-stricted themselves to deal with the items in the leaf level (also called item level) of the taxonomy tree. However, clustering transactions across different levels of the taxonomy is of great importance for the efficiency and the quality of the clustering techniques for market-basket data. Note that in the real market-basket data, there are many transactions containing only single items, and many items are purchased infrequently. Hence, without considering the taxonomy tree, one may inappropriately treat a transaction (such as the one containing ‘‘parallel compiler’’ inFig. 1)

as an outlier. However, as indicated in Fig. 1, purchasing ‘‘parallel compiler’’ is in fact instru-mental for the category node ‘‘computer science’’ to become a tode (i.e., a representative). In contrast, by employing category-based adherence measurement for clustering, many transactions will not be mistakenly treated as outliers if we take taxonomy relationships of items into con-sideration, thus leading to a better clustering quality. The details of k-todes will be described in Section 7. A validation model based on Information Gain (IG) is also devised in this paper for evaluating the clustering results. As validated by both real and synthetic datasets, it is shown by our experimental results, with the taxonomy information, that algorithm k-todes devised in this paper signiﬁcantly outperforms the prior works[20,21] in both the execution efﬁciency and the clustering quality evaluated by IG, indicating the usefulness of category-based adherence in market-basket data clustering.

1.1. Related works

Many data clustering algorithms have been proposed in the literature. Numerical attributes are those with finite or infinite number of ordered values, such as the height of a customer. On the other hand, categorical attributes are those with finite unordered values, such as the sex of a customer. In market-basket data, the purchase record is unordered and thus non-numeric. In addition, a transaction can be represented as a vector with boolean attributes where each attribute corresponds to a single item[22]. Boolean attributes themselves form a special case of categorical attributes because they are unordered[2].

The k-means algorithm is efﬁcient in the clustering of numerical data [23]. There are other fast algorithms designed for clustering large numerical data sets, such as CLARANS [24], BIRCH [25], DBSCAN [26], CURE [27], and CSM [28]. In addition, several approaches in [29–31] are proposed to solve the high dimension-ality and data sparsity problems of numerical data. The k-modes algorithm extends the k-means algorithm for clustering categorical data by repla-cing means of clusters with modes and using a

1

The formal deﬁnitions of these terms will be given in Section 2.1.

(4)

frequency-based method to update modes. For each attribute, the mode is the highest frequency values. However, in clustering market-basket data sets, k-modes will view each item as one boolean attribute. For each item, k-modes chooses True or False as the highest value to perform the clustering and suffers unstable clustering quality in market-basket data. The approach in [32]is an extension of k-means algorithm to cluster categorical data by converting multiple category attributes into binary attributes which are computed as numerical data. However, it is very time-consuming in matrix computing and needs a large memory to store the matrices for clustering market-basket data. ROCK is an agglomerative hierarchical clustering algo-rithm by treating market-basket data as catego-rical data and using the links between the data points to cluster categorical data [22]. ROCK utilizes the concept of links for clustering, where a link is defined as the number of common ‘‘neighbors’’ between two transactions. Here two transactions are said to be the neighbor if their Jaccard-coefficient[1]is larger than or equal to the user defined threshold y: The time complexity of ROCK could be prohibitive because the number of transactions is very large in the market-basket data. Only by properly choosing value of y; ROCK could generate the clustering results with good qualities. However, in practice, the threshold y is difficult to be determined by users[33]. CORE [34] is a gravity based clustering algorithm by using the ensemble of correlated-forces between two clusters as the similarity measurement to perform subspace categorical clustering. Concep-tual-based clustering in machine learning is devel-oped for clustering categorical data [5,35,36]. In general, the clustering techniques proposed in [5,22,34–36] have high time complexity and thus are not suitable for market-basket data. The concept of nodes in [37] is a set of distinct categorical values where the emphasis is in constructing the categorical clusters by both STIRR[37]and CACTUS[38]. However, how to cluster transactions was not addressed. Explicitly, STIRR is an iterative algorithm according to non-linear dynamic systems. In addition, CACTUS is devised by using a summarization procedure based on the assumption that all attributes are

indepen-dent. COOLCAT in [33] utilizing the entropy analysis for clustering categorical data sets is also under the attribute independence assumption. However, the items (each of which represents a boolean attribute) in market-basket data sets usually have high associations [16], meaning that the assumption of having independent attributes needs further justiﬁcation.

The authors in [39] proposed a hypergraph partitioning algorithm to find the clusters of items and transactions based on the large item sets. The work in [40] devised a top-down hierarchical algorithm by using association rules with high confidences to discover the clusters of customers. BitOp is a greedy grid-based clustering algorithm for clustering association rules where the cause attributes are quantitative and the consequence attribute is categorical [41]. The authors in [42] proposed an EM-based algorithm by using the maximum likelihood estimation method for clus-tering transaction data. OPOSSUM is a graph-partitioning approach based on a similarity matrix to cluster transaction data [43]. The work in [21] proposed a k-means based algorithm by using large items as the similarity measurement to divide the transactions into clusters with a cost function to minimize the overlap of large items (corre-sponding to inter-cluster cost) and minimize the union summation of small items (corresponding to intra-cluster cost). In this approach, an item which is not large is called a small item which is used to measure the intra-cluster cost. However, with this disposition, the support difference between a large item and a small one could be as few as one, which could make the clustering quality be very data dependent. OAK in [44] combined hierarchical and partitional clustering techniques for transac-tion data. CLOPE in [45] proposed a heuristic approach by increasing the height-to-width ratio for clustering transaction data. CLOPE did not explicitly address the inter-cluster dissimilarity issue. In addition, there is no explicitly statement for describing the statistical relationship between the repulsion parameter and the intra-cluster similarity. In market-basket data, the taxonomy of items defines the generalization relationships for the concepts in different abstraction levels [46]. Item taxonomy (i.e., is-a hierarchy) is well

(5)

addressed with respect to its impact to mining association rules of market-basket data[17,19]and can be represented as a tree, called taxonomy tree. Similar techniques for extracting synonyms, hy-pernyms (i.e., a kind of) and holonyms (i.e., a part of) from the lexical database are derived in[47,48]. This paper is organized as follows. Preliminaries are given in Section 2. In Section 3, algorithm k-todes is devised for clustering market-basket data. Experimental studies are conducted in Section 4. This paper concludes with Section 5.

2. Preliminary

The problem description will be presented in Section 2.1. In Section 2.2, we describe a new validation model, IG validation model, for the assessment to the quality of different clustering algorithms.

2.1. Problem description

In this paper, the market-basket data is repre-sented by a set of transactions. A database of transactions is denoted by D ¼ ft1; t2; . . . ; thg;

where each transaction tj is represented by a set

of items fi1; i2; . . . ; irg: An example database for

clustering market-basket data is described inTable 1 where there are twelve transactions, each of which has a transaction identiﬁcation (abbreviated as TID) and a set of purchased items. For example, transaction ID 40 has items h and item z. A clustering U ¼ oC1; C2; . . . ; Ck4 is a

partition of transactions into k clusters, where Cj

is a cluster consisting of a set of transactions. Items in the transactions can be generalized to multiple concept levels of the taxonomy. An example taxonomy tree is shown inFig. 2. In the taxonomy tree, the leaf nodes are called the item nodes and the internal nodes are called the category nodes. The root node in the highest level

is a virtual concept of the generalization of all categories. In this taxonomy tree, item g is-a category B, category B is-a category A, and item h is-a category B, etc. In this paper, we use the measurement of the occurrence count to determine which items or categories are the representatives of each cluster.

Deﬁnition 1. The support of an item ikin a cluster

Cj; denoted by Supðik; CjÞ; is deﬁned as the

number of transactions containing this item ik in

cluster Cj: An item ik in a cluster Cj is called a

large item if Supðik; CjÞ exceeds the minimum

support count.

Deﬁnition 2. The support of a category ck in a

cluster Cj; denoted by Supðck; CjÞ; is deﬁned as the

number of transactions containing items under this category ck in cluster Cj: A category ck in a

cluster Cj is called a large category if Supðck; CjÞ

exceeds the minimum support count.

Note that one transaction may include more than one item from the same category, in which case the support contributed by this transaction to that category is still one. In this paper, the minimum support percentage Sp is a given

parameter for determining the large items/cate-gories of the taxonomy tree in the cluster. For a cluster Cj; the minimum support count ScðCjÞ is

deﬁned as follows.

Deﬁnition 3. For cluster Cj; the minimum support

count ScðCjÞis deﬁned as

ScðCjÞ ¼SpnjCjj:

where jCjj denotes the number of transactions in

cluster Cj:

Consider the example database in Table 1 as an initial cluster C0 with the corresponding taxonomy

tree recording the supports of the items/categories shown in Fig. 2. Then, Supðg; C0Þ ¼5 and SupðE;

C0Þ ¼7: With Sp¼50%; we have ScðC0Þ ¼6:

Table 1

An example database D

TID 10 20 30 40 50 60 70 80 90 100 110 120

(6)

In this example, all categories are large but all items are not.

2.2. Information gain validation model

To evaluate the quality of clustering results, some experimental models were proposed[49,50]. In general, square error criterion is widely em-ployed in evaluating the efficiency of numerical data clustering algorithms [49]. In addition, authors in [51] proposed a novel clustering validation scheme which uses the variance and the density of each cluster to measure the inter-cluster dissimilarity and the intra-inter-cluster similar-ity. Note that the nature feature of numeric data is quantitative (e.g., weight or length), whereas that of categorical data is qualitative (e.g., color or gender) [2]. Thus, validation schemes using the concept of variance are thus not applicable to assessing the clustering result of categorical data. To remedy this problem, some real data with good classified labels, e.g., mushroom data, congres-sional votes data, soybean disease[52]and Reuters news collection [53], were taken as the experi-mental data for categorical clustering algorithms [22,54,21,44]. In view of the feature of market-basket data, we propose in this paper a validation model based on Information Gain (IG) to assess the qualities of the clustering results. It is noted that information gain is widely used in the classification problem[55,56]. Explicitly, ID3[55] and C4.5[56]used information gain measurement

to select the test attribute with the highest information gain for splitting when constructing the decision tree.

The deﬁnitions required for deriving the infor-mation gain of a clustering result are given below. Deﬁnition 4. The entropy of an attribute Ja in a

database D is deﬁned as I ðJa; DÞ ¼ Xn i¼1 jJi aj jDjnlog2 jJi aj jDj;

where jDj is the number of transactions in D and jJi_ajdenotes the number of the transactions whose attribute Ja is classiﬁed as the value Jia in D.

Deﬁnition 5. The entropy of an attribute Ja in a

cluster Cj is deﬁned as I ðJa; CjÞ ¼ Xn i¼1 jJi_a;c_jj jCjj n_log₂ jJi_a;c_jj jCjj ;

where jCjjis the number of transactions in cluster

Cj; and jJia;cjjdenotes the number of the

transac-tions whose attribute Jais classiﬁed as the value Jia

in Cj:

Deﬁnition 6. Let a clustering U contain C1; C2; . . . ; Cm clusters. Thus, the entropy of an

attribute Ja in the clustering U is deﬁned as

EðJa; U Þ ¼

X

Cj2U

jCjj

jDjI ðJa; CjÞ:

Deﬁnition 7. The information gain obtained by separating Jainto the clusters of the clustering U is

deﬁned as

GainðJa; U Þ ¼ I ðJa; DÞ EðJa; U Þ:

Deﬁnition 8. The information gain of the cluster-ing U is deﬁned as

IGðU Þ ¼X

Ja2I

GainðJa; U Þ;

where I is the data set of the total items purchased in the whole market-basket data records.

A completely numerical example on the use of these deﬁnitions will be given in Section 3.3. For

root A B R g h k m n E F z x y 9 7 7 6 5 5 3 6 2 5 2 2 3

Fig. 2. An illustrative taxonomy example whose transactions are shown inTable 1(Sp¼0:5).

(7)

clustering market-basket data, the larger an IG value, the better the clustering quality is. In market-basket data, with the taxonomy tree, there are three kinds of IG values, i.e., IGitemðU Þ;

IGcatðU Þ; and IGtotalðU Þ; for representing the

quality of a clustering result. Speciﬁcally, IGitemðU Þ is the information gain obtained on

items and IGcatðU Þ is the information gain

obtained on categories. IGtotalðU Þ is the total

information gain, i.e., IGtotalðU Þ ¼ IGitemðUÞ þ

IGcatðU Þ: In general, market-basket data set is

typically represented by a 2-dimensional table, in which each entry is either 1 or 0 to denote purchased or non-purchased items, respectively. In IG validation model, we treat each item in market-basket data as an attribute Ja with two

classiﬁed labels, 1 or 0. Explicitly, for an item ik;

IYes ik and I

No

ik are the two classiﬁed labels of item ik

to represent purchased and non-purchased values. The meanings of various parameters are shown in Table 2. It will be shown in Section 4 that with the category-based adherence measurement, algorithm k-todes outperforms the prior works[20,21]in the clustering quality based on the IG validation model.

3. Design of algorithm k-todes

In this section, we describe the details of k-todes algorithm. The similarity measurement of k-todes, called category-based adherence, will be described in Section 3.1. The procedure of k-todes is devised

in Section 3.2 and an illustrative example is given in Section 3.3. The complexity of k-todes is analyzed in Section 3.4.

3.1. Similarity measurement: category-based adherence

Some terminologies for the similarity measure-ment employed by algorithm k-todes are deﬁned as follows.

Deﬁnition 9 (Tode). If an item/category is large, its corresponding node in the taxonomy tree is called a tode (standing for taxonomy node). In this paper, the todes in each cluster are the representatives of the cluster. For the example shown inFig. 3, nodes marked gray are todes.

Deﬁnition 10 (Nearest tode of an item to a cluster). In the taxonomy tree, the nearest tode of an item ikis itself if ik is a tode. Otherwise, the

nearest tode is the category node which is the lowest generalized concept level node among all ancestor todes of item ik: Note that if an item/

category node is identiﬁed as tode, all its high level category nodes will also be todes. For the example shown in Fig. 3, the nearest tode of item k to cluster C1 is category B and the nearest tode of

item k to cluster C2 is category A.

Deﬁnition 11 (Distance of an item to a cluster). -For an item ikof a transaction, the distance of ikto

a cluster Cj; denoted by dðik; CjÞ; is deﬁned as the

number of links between ikand the nearest tode of

ik to cluster Cj: If ikis a tode in cluster Cj; then

dðik; CjÞ ¼0: For the example shown inFig. 3, the

distance of item k to cluster C1is dðk; C1Þ ¼1 and

the distance of item k to cluster C2is dðk; C2Þ ¼2:

Deﬁnition 12 (Adherence of a transaction to a cluster). For a transaction t ¼ fi1; i2; . . . ; ipg;the

adherence of t to a cluster Cj; denoted by

Hðt; CjÞ; is deﬁned as the average distance of

distances of the items in t to Cj and shown below.

Hðt; CjÞ ¼ 1 p Xp k¼1 dðik; CjÞ;

where dðik; CjÞis the distance of ikto cluster Cj: For

the example shown inFig. 3, the adherence of TID Table 2

The meanings of various parameters Notation Meaning

D The database

Supði; CjÞ The support of i in cluster Cj

IGðUÞ The information gain of clustering U IGitemðUÞ The information gain obtained on items in

clustering U

IGcatðU Þ The information gain obtained on categories in

clustering U

IGtotalðUÞ The total information gain in clustering U

dðik; CjÞ The distance of item ikto cluster Cj

(8)

70 to cluster C1 is Hð70; C1Þ ¼1₃ðdðk; C1Þ þ

dðm; C1Þ þdðn; C1ÞÞ ¼1₃ð1 þ 1 þ 0Þ ¼2₃and the

ad-herence of TID 70 to cluster C2 is Hð70; C2Þ ¼ 1

3ðdðk; C2Þþdðm; C2Þþdðn; C2ÞÞ¼ 1

3ð2 þ 0 þ 1Þ ¼ 1:

Note that the todes are the representatives of a cluster. The adherence of a transaction to a cluster is a measurement of the distance between the transaction and the representatives of the cluster. Thus, the adherence is smaller, the similarity is higher. In this example shown in Fig. 3, because Hð70; C1Þ ¼2₃oHð70; C2Þ ¼1; TID 70 is more

similar to C1 than C2:

3.2. Procedure of algorithm k-todes

The overall procedure of algorithm k-todes is shown in Fig. 4. In Step 1, algorithm k-todes randomly selects k transactions as the seed transactions of the k clusters from the database D. For each cluster, the items and categories of the corresponding seed transaction are counted once in the taxonomy tree. In each cluster, the items and their ancestors are all large in the very beginning because their support percentages are all 100% in the only seed transaction, larger than the mini-mum support percentage. For each initial cluster, they are the todes. In Step 2, algorithm k-todes reads each transaction sequentially and allocates it to the cluster with the minimum category-based

adherence. After one transaction is allocated to a cluster Cj; the supports of the items and their

ancestors are increased by one in the correspond-ing nodes in the taxonomy tree of Cj: After all

transactions are allocated, the minimum support counts of clusters are updated. In Step 3, algorithm k-todes updates the todes of each cluster based on the supports of nodes in the taxonomy tree. In Step 4, algorithm k-todes repeats Steps 2 and 3 until no transaction is moved between clusters. In Step 5, algorithm k-todes outputs the taxonomy tree of the ﬁnal clustering result for each cluster, where the items, categories, and their corresponding counts are presented.

A B R g h k m n E F z x y root A B R g n E z x y C₁ _C 2 root A B R g h k m n E F z x y k m n TID 70

Fig. 3. The adherence represents the similarity measurement.

Procedure of Algorithm k-todes

Step 1. Randomly select k transactions as the seed transactions of the k clusters from the database D.

Step 2. Read each transaction sequentially and allocates it to the cluster with the minimum category-based adherence. For each moved transaction, the supports of items and their ancestors are increased by one.

Step 3. Update the todes of each cluster.

Step 4. Repeat Step 2 and Step 3 until no transaction is moved between clusters.

Step 5. Output the taxonomy tree for each cluster as the visual representation of the clustering result.

(9)

3.3. An illustrative example

An illustrative example is given to describe the execution of k-todes in Section 3.3.1 and an example for describing the measurement of in-formation gain is given in Section 3.3.2.

3.3.1. Execution of k-todes

For the example database D shown inTable 1, we set k ¼ 2 and Sp¼50%: In Step 1, algorithm

k-todes randomly chooses TID 10 and TID 20 as the seed transaction of the cluster C1 and C2;

respectively. Then, for cluster C1shown inFig. 5a,

nodes marked gray are the purchased items of TID 10 and the corresponding categories in the taxonomy tree. The gray nodes are identiﬁed as todes. Similarly, for cluster C2; shown inFig. 5b,

nodes marked gray are todes. In Fig. 5, the support of each node is illustrated nearby. For example, Supðg; C1Þ is 1 and Supðg; C2Þ is 0. In

Step 2, algorithm k-todes ﬁrst allocates TID 30 to cluster C2 because Hð30; C2Þ ¼1₂ð1 þ 0Þ ¼1₂ is

smaller than Hð30; C1Þ ¼12ð1 þ 1Þ ¼ 1: Similarly,

TIDs 40, 50, 60, 90, and 120 are allocated to cluster C1which is shown inFig. 6a. TIDs 30, 70,

80, 100, and 110 are allocated to cluster C2 which

is shown inFig. 6b. In Step 3, algorithm k-todes updates the todes in cluster C1to be {A, E, B, R, g,

n} and the todes in cluster C2 to be {A, E, R,

F, m, y }. Explicitly, algorithm k-todes derives ScðC1Þ ¼3 and ScðC2Þ ¼3 by SpnjC1j ¼0:5n6 ¼

3 and SpnjC2j ¼0:5n6 ¼ 3; respectively. Because

Supðg; C1Þ4ScðC1Þ; item g is identiﬁed as a

large node in cluster C1 and marked gray. In

Step 4, algorithm k-todes proceeds to iteration 2 by repeating Steps 2 and 3. In iteration 2, two transactions, TID 50 and TID 70 are moved. TID 50 is moved from cluster C1 to

cluster C2 because Hð50; C1Þ ¼13ð0 þ 2 þ 2Þ ¼434

Hð50; C2Þ ¼1₃ð2 þ 1 þ 0Þ ¼ 1; and TID 70 is

moved from cluster C2 to cluster C1 due

the Hð70; C1Þ ¼1₃ð1 þ 1 þ 0Þ ¼2₃oHð70; C2Þ ¼ 1

3ð2 þ 0 þ 1Þ ¼ 1: Then, algorithm k-todes updates

the todes again. In iteration 3, only one transaction TID 100 is moved from cluster C2 to cluster C1: In

iteration 4, there is no movement and thus algorithm k-todes proceeds to Step 5. The ﬁnal taxonomy trees of clustering U1are shown inFig. 7.

Note that a transaction at item level may not be similar to any cluster. For example, TID 10 {g, x} and TID 40 {h, z} have no common items, but item g and item h have common category B and item x and item z have common category E. Thus, TID 10 is similar to TID 40 in the high level concept. By taking category-based adherence measurement, many transactions may not be taken as outliers if we take categorical relationships of items into consideration. In addition, transactions at the item level may have the same similarities in different clusters. However, by summarizing the similarities of all items across their category levels, algorithm k-todes allocates each transaction to a proper

root A B R g h k m n E F z x y A B R g h k m n E F z x y 1 1 1 1 Taxonomy Tree in C1 Taxonomy Tree in C2 g, x 10 TID Items C1 C2 0 0 0 0 0 A B R g h k m n E F z x y root A B R g h k m n E F z x y 1 1 1 1 m, y 20 Items TID 0 0 1 1 0 0 1 0 0 1 0 0 0 (a) (b)

Fig. 5. In Step 1, algorithm CBA randomly chooses the seed transaction for each cluster.

(10)

cluster. For example, TID 50 has three items: g, x, and y. Item g is large in cluster C1 and item y is

large in cluster C2: Thus, TID 50 has the same

similarities in both C1 and C2: However, item x is

a category F which is a tode in C2: Thus, TID 50 is

allocated to C2:

3.3.2. Measurement by information gains

To provide more insight into the quality of k-todes, we calculate the IG values of the cluster-ing U1 shown in Fig. 7. Note that for an item

ik; IYesik and I

No

ik are the two classiﬁed labels of item

ik for representing purchased and non-purchased

values. For item g, the information gain Gainðg; U1Þ ¼I ðg; DÞ Eðg; U1Þ ¼ ð125 log2125 127

log₂ 7

12Þ ½127 ð47log24737log237Þ þ125ð15log215 45

log₂4

5Þ ¼0:10: Similarly, Gainðh; U1Þ ¼0:15; Gain

ðk; U1Þ ¼0:48; Gainðm; U1Þ ¼0:31; Gainðn; U1Þ ¼

0:48; Gainðx; U1Þ ¼0; Gainðy; U1Þ ¼0:98; and

Gainðz; U1Þ ¼0:39: Hence, IGitemðU1Þ ¼

P

JaIGainðJa; U1Þ ¼2:89; where I is the set of

items {g, h, k, m, n, x, y, z}. Similarly, GainðB; U1Þ ¼

0:33; GainðR; U1Þ ¼0:2; GainðA; U1Þ ¼0:41; Gain

ðF ; U1Þ ¼0:65; GainðE; U1Þ ¼0:48; and thus

IGcatðU1Þ ¼sumJaCGainðJa; U1Þ ¼2:07; where C

is the set of categories {A, B, E, F, R}. Then, IGtotalðU1Þ ¼IGitemðU1Þ þIGcatðU1Þ ¼4:96:

3.4. Complexity analysis of algorithm k-todes The time complexity and the space complexity of algorithm k-todes are analyzed by the following two theorems.

Theorem 1. The time complexity of k-todes is OðrkðjDjvNLþNNÞÞ; where r is the number of iterations, k is the given cluster number, jDj is the database size, v is the average transaction length, NLis the number of taxonomy levels, and NNis the number of nodes in the taxonomy tree.

Proof. We ﬁrst deﬁne following symbols to analyze the complexity of each iteration in detail: It is the item set in transaction t, 1ptpjDj; Imt is

the mth item in transaction t, costðIm

t ; CkÞ is the

cost for Im_t to ﬁnd the nearest tode in cluster Ck;

and xðIm_t Þis the number of levels from item Im_t to its highest ancestor, 1pxðIm_t ÞpNL: Note that an iteration consists of Steps 2 and 3. For each transaction t, the adherence of t to every cluster is obtained in Step 2. Thus, the time complexity of this sub-step is P_kP_I_t_2DP_Im

t costðI

m

t ; CkÞ: After

obtaining the cluster Ca in which t has the

minimum adherence, t is allocated to Ca and the

supports of items and related categories in the taxonomy tree of Ca will be increased by one.

A B R g h k m n E F z x y root A B R g h k m n E F z x y 6 6 3 2 5 g, x h, z g, n 10 40 50 60 90 120 TID g, x, y g, h, n g, k, n 3 2 1 3 1 0 C1 Sc(C1) = 3 root A B R g h k m n E F z x y 3 3 4 4 m, y y, z k, m, n y m, n y, z 20 30 70 80 100 110 Items Items TID 1 2 C2 Taxonomy Tree in C1 Sc(C1) = 3 Sc(C2) = 3 1 2 0 0 3 2 4 Taxonomy Tree in C2 1 0 (a) (b)

Fig. 6. In Step 2, algorithm CBA reads each transaction sequentially and allocates it to the cluster with the minimum category-based adherence.

(11)

Thus, the time complexity of this sub-step is P

Im t xðI

m

tÞ: In Step 3, algorithm k-todes updates

the todes of each cluster. Thus, the time complex-ity of Step 3 iskNN where NN is the number of nodes in the taxonomy tree. With r iterations for running Steps 2 and 3, the total time complexity is

therefore X r X k X It2D X Im t costðIm_t; CkÞ 8 < : 2 4 þX Im t xðIm_t Þ 9 = ;þ fkN N_g 3 5 prkjDjvNL_þ_rjDjvNL_þ_rkNN ¼rðk þ 1ÞjDjvNLþrkNN ¼OðrkðjDjvNLþNNÞÞ;

where r is the number of iterations, k is the given cluster number, jDj is the database size, v is the average transaction length, NL is the number of taxonomy levels, and NNis the number of nodes in the taxonomy tree. &

Theorem 2. The space complexity of k-todes is OðjDj þ kAÞ; where jDj is the database size, k is the given cluster number, and A is the number of nodes, including category nodes and item nodes, in the taxonomy tree.

Proof. First, before k-todes is executed, all data must be loaded and the space requirement is OðjDjÞ: In each cluster, there is only an array structure needed to store the supports of all nodes, whose space requirement is OðAÞ: Because the number of clusters is k, the space requirement is OðkAÞ for all clusters. Thus, the overall space complexity of k-todes is OðjDj þ kAÞ: &

4. Experimental results

To assess the performance of algorithm k-todes, we have conducted a series of experiments. These experiments are performed on a computer with a 1 Ghz Intel CPU and 512M of memory. We compare k-todes with k-modes algorithm [20] and the algorithm proposed in [21] (for the convenience, the algorithm is named as Basic in this paper). By extending both previous ap-proaches with taxonomy consideration in mar-ket-basket data, we also implement algorithm k-modesT (standing for k-modes with Taxonomy) and algorithm BasicT (standing for Basic with root A B R g h k m n E F z x y A B R g h k m n E F z x y 7 6 2 1 4 1 g, x h, z g, n k, m, n g, k, n m, n 10 40 60 70 90 100 120 Items TID g, h, n 5 2 2 1 5 2 0 C1 Sc(C1) = 3.5 A B R g h k m n E F z x y A B R g h k m n E F z x y 2 1 1 5 5 5 m, y y, z y y, z 20 30 50 80 110 g, x, y 1 1 2 1 0 0 0 Taxonomy Tree in C1 Items TID C2 Sc(C2) = 2.5 Taxonomy Tree in C2 root (a) (b)

(12)

Taxonomy) for comparison purposes. The details of data generation are described in Section 4.1. The experimental results are shown in Section 4.2. 4.1. Data generation

The meanings of various parameters used in our experiments are shown inTable 3. We take the real market-basket data from a large bookstore com-pany for performance study. In this real data set whose item distribution is shown inFig. 8, there are jDj ¼ 100K transactions, NI ¼58909 items, and NL¼3 levels. Note that in this real data, there are many transactions containing only single items, and many items are purchased infrequently. In this real data, there are 58909 31846 ¼ 27063 items which are purchased only once among the 100K transactions. In addition, the number of the taxonomy level in this real data set is 3. To provide more insight into this study, we use a well-known market-basket synthetic data generated by the IBM Quest Synthetic Data Generation Code[16], as the synthetic data for performance evaluation. This code will generate volumes of transactions over a large range of data characteristics. These transactions mimic the transactions in the real world retailing environment. This generation code also assumes that people will tend to buy sets of items together, and each such set is potentially a maximal large itemset. An example of such a set might be sheets, pillow case, comforter, and rufﬂes. However, not all items purchased by customers are large itemsets. The average size of the transactions, denoted by jT j; is set to 5 as default. The average size of the maximal

poten-tially large itemsets, denoted by jI j; is set to 2 as default. The number of maximal potential large itemsets, denoted by jLj; is set to 2K: The number of items in database, denoted by NI_{; is set to 60K}

as default. The number of roots, denoted by NR_{; is}

set to 100 and the number of the taxonomy level, denoted by NL; is set to 3.

4.2. Performance study

We conduct experiments in this section for performance study and the clustering quality is evaluated by the IG values. For algorithms k-todes, Basic, and BasicT, the minimum support percentage Sp is set to 0.5%. Recall that there are

three kinds of IG values, i.e., IGitem; IGcat; and

IGtotal; for evaluating the quality of the clustering

result. IGitem is the information gain obtained on

items and IGcat is the information gain obtained

on categories. IGtotal ¼IGitemþIGcat:

4.2.1. Experiment one: Comparison on the clustering results

Fig. 9a shows the relative qualities of clustering results of k-todes, k-modes, k-modesT, Basic, and BasicT in real data set where jDj ¼ 100K; NL¼3; and NI ¼58909: In addition, the number of clusters k is 50. As described in [57], a term with a higher discrimination value will be associated with a longer distance between data points in the database. Because different items may belong to the same categories, the discrimination values of categories are lower than those of items for the Table 3

The meanings of various parameters used in experimental results

Notation Meaning jDj The database size

jT j Average size of the transactions

jI j Average size of the maximal potential large itemsets jLj Number of large itemsets within database

NI _{Number of items in database}

NR _{Number of the roots}

NL _{Number of the taxonomy levels}

60000 50000 40000 30000 20000 10000 0 700 600 500 400 300 200 100 0 Item ID Item Occurrence (1,722) (483,74) (3381,14) (9575,10) (31846,1) (58909,1) =3 =58909 |D|=100000 L I N N

Fig. 8. The data distribution of real market-basket data obtained from the large bookstore.

(13)

transactions in the database. For identifying the large and small terms in Basic and BasicT, the discrimination values of the items and the cate-gories are aggregated in the similarity measure-ments for clustering market-basket data. Thus, BasicT obtains higher IG values than Basic. Similarly, k-modesT obtains higher IG values than k-modes. By considering the item similarities across their category levels, algorithm k-todes utilizes the category-based adherence measure-ment to allocate each transaction to a proper cluster so that k-todes in general outperforms other algorithms in the three IG values. To provide more insight into the performance

com-parisons of algorithms, we also conduct experi-ments on synthetic data set. In the experiexperi-ments shown fromFig. 9b–d, we set jDj ¼ 100K; jT j ¼ 5; jLj ¼ 2K; jI j ¼ 2; NI ¼60K; NR¼100; NL¼3 with three different synthetic database sizes (jDj ¼ 100K; jDj ¼ 500K; and jDj ¼ 900K). 4.2.2. Experiment two: when the database size jDj varies

It is shown inFig. 10, the scalability of k-todes is evaluated by both the real data and the synthetic data. By varying the real database size jDj from 20 to 100K; it is shown in Fig. 10a that k-todes signiﬁcantly outperforms other algorithms in

Real Data |D| = 100K Synthetic Data |D| =100K

Synthetic Data |D| = 500K Synthetic Data |D| =900K

0 1 2 3 4 5 6 7 8

k-todes k-modes k-modesT Basic BasicT

Algorithms Information Gain 0 1 2 3 4 5 6 7 8 Information Gain 0 1 2 3 4 5 6 7 8 Information Gain IGitem IGcat IGtotal 0 1 2 3 4 5 6 7 8 Information Gain

Algorithms

Algorithms IGitem IGcat IGtotal IGitem IGcat IGtotal IGitem IGcat IGtotal (a) (b) (c) (d)

(14)

execution efﬁciency. The execution time of k-todes increases linearly as the database size increases, indicating the good scale-up feature of algorithm k-todes. In the experiment shown in Fig. 10b, we set jDj ¼ 100K; jT j ¼ 5; jLj ¼ 2K; jI j ¼ 2; NI _¼

60K; NR_¼_{100; N}L_¼_{3; and jDj varies from 100}

to 900K:

4.2.3. Experiment three: when the number of items NI varies

In the synthetic data experiment shown inFig. 11a, we set jDj ¼ 100K; jT j ¼ 5; jLj ¼ 2K; jI j ¼ 2; NR_¼_{100; N}L_¼_{3; and N}I _{varies from 20 to}

100K: Similarly, in the synthetic data experiment shown in Fig. 11b, we set jDj ¼ 500K; jT j ¼ 5; jLj ¼ 2K; jI j ¼ 2; NR¼100; NL¼3; and NI varies from 50 to 250K: Note that each item

could be viewed as a boolean attribute and NI is thus viewed as the number of dimensions in the boolean space. With todes as the representatives, algorithm k-todes increases approximately linearly as the number of items increases.

4.2.4. Experiment four: when the average size of maximal potential large itemsets jI j varies

In the synthetic data experiments shown in Fig. 12, we set jTj ¼ 5; jLj ¼ 2K; NI _¼_{60K; N}R_¼

100; NL_¼_{3; and jI j varies from 1 to 4. It is shown}

inFig. 12a that when jI j increases, the IGtotalvalue

also increases. This can be explained by the reason that when jI j increases, the number of transactions containing co-occurrence itemsets increases and thus most transactions are allocated to the corresponding clusters with smaller adherences to the todes. Explicitly, many members of the transactions containing an item ik are allocated

to a cluster Cj because these transactions also

Real Data Synthetic Data 0 20 40 60 80 100 120 20K 40K 60K 80K 100K |D|

Execution Time (sec)

k-todes k-modes k-modesT Basic BasicT 0 200 400 600 800 1000 1200 1400 1600 1800 100K 300K 500K 700K 900K |D|

k-todes k-modes k-modesT Basic BasicT (a) (b)

Fig. 10. Execution time for algorithms when the database size jDj varies. |D| = 100K |D| = 500K 0 500 1000 1500 2000 2500 3000 20K 40K 60K 80K 100K No. of Items

0 20000 40000 60000 80000 50K 100K 150K 200K 250K No. of Items

k-todes k-modes k-modesT Basic BasicT k-todes k-modes k-modesT Basic BasicT (a) (b)

(15)

contain other frequently-purchased items which are purchased together with ik: When jI j increases,

the number of such items as ik also increases so

that more transactions containing ik are allocated

to one cluster instead of being allocated to several clusters separately. Therefore, the value of IGtotal

increases. In addition, it is shown inFig. 12b that the percentage of IGitem in IGtotal also increases

when jI j increases.

4.2.5. Experiment five: when the number of taxonomy levels NL varies

In the synthetic data experiment shown in Fig. 13, we set jTj ¼ 5; jI j ¼ 2; jLj ¼ 2K; NI ¼ 60K; NR¼100; and NL varies from 3 to 6. When NL increases, k-todes has more category levels to distinguish the items by calculating their

adher-ences. Thus, the percentage of IGcat in IGtotal

increases, indicating the good feature of k-todes.

5. Conclusion

In this paper, we devised an efﬁcient method to cluster market-basket data by identifying repre-sentative subjects. One of the important features of market-basket data sets is that they are generated at rapid pace and thus requires the data mining algorithms to be scalable and capable of dealing with the large data set. In view of the features of market-basket data, we devised in this paper a novel measurement, called the category-based adherence, and utilized this measurement to per-form the clustering. With this category-based adherence measurement, we developed an efﬁcient

0.5 0.6 0.7 0.8 IGitem /IGtotal 2 2.5 3 3.5 4

Average Size of Maximal Potential Itemsets

IGtotal |D|=100K |D|=300K |D|=500K |D|=700K |D|=900K 1 2 3 4 |D|=100K |D|=300K |D|=500K |D|=700K |D|=900K

Average Size of Maximal Potential Itemsets

1 2 3 4

(a)

(b)

(16)

clustering algorithm, called algorithm k-todes, for market-basket data with the objective to minimize the category-based adherence. A validation model based on Information Gain (IG) was also devised in this paper to assess the quality of clustering for market-basket data. As validated by both real and synthetic datasets, it was shown by our experi-mental results, with the taxonomy information, algorithm k-todes devised in this paper signiﬁ-cantly outperforms the prior works in both the execution efﬁciency and the clustering quality for market-basket data.

Acknowledgements

The work was supported in part by the National Science Council of Taiwan, R.O.C., under Con-tracts NSC93-2752-E-002-006-PAE.

References

[1] A.K. Jain, R.C. Dubes, Algorithms for Clustering Data, Prentice-Hall, Englewood cliffs, NJ, 1988.

[2] A.K. Jain, M.N. Murty, P.J. Flynn, Data clustering: a review, ACM Comput. Surveys 31 (3) (1999).

[3] M. Charikar, C. Chekuri, T. Feder, R. Motwani, Incre-mental clustering and dynamic information retrieval, Proceedings of the 29th ACM Symposium on Theory of Computing, 1997.

[4] M.-S. Chen, J. Han, P.S. Yu, Data mining: an overview from a database perspective, IEEE Trans. on Knowledge and Data Eng. 8 (6) (1996) 833–866.

[5] D.H. Fisher, Knowledge acquisition via incremental conceptual clustering, Machine Learning 2 (2) (1987) 139–172.

[6] A.K. Jain, R.P.W. Duin, J. Mao, Statistical pattern recognition: a review, IEEE Trans. on Pattern Anal. and Machine Intelligence, (2000) pp. 4–37.

[7] S.Y. Lu, K.S. Fu, A sentence-to-sentence clustering procedure for pattern analysis, IEEE Trans. Syst. Man Cybern. 8 (1978) 381–389.

[8] J.C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms, Plenum Press, New York, NY, 1981. [9] R.C. Dubes, How many clusters are best?—an experiment,

Pattern Recognition 20 (6) (1987) 645–663.

[10] C.-R. Lin, M.-S. Chen, On the Optimal Clustering of Sequential Data, in: Proceedings of the second SIAM International Conference on Data Mining, April 2002. [11] B. King, Step-wise clustering procedures, J. Am. Stat.

Assoc. 69 (1967) 86–101.

[12] P.H.A. Sneath, R.R. Sokal, Numerical Taxonomy, Free-man, London, UK, 1973.

[13] J. Hertz, A. Krogh, R.G. Palmer, Introduction to the Theory of Neural Computation, Westview Press, 1991. [14] J. Tantrum, A. Murua, W. Stuetzle, Hierarchical

Model-Based Clustering of Large Datasets Through Fractiona-tion and RefracFractiona-tionaFractiona-tion, in: Procreedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2002.

[15] M.R. Garey, D.S. Johnson, Computers and Intractability: A Guide to the Theory of NP-Completeness, Freeman and Company, New York, 1979.

[16] R. Agrawal, R. Srikant, Fast algorithms for mining association rules in large databases, Proceedings of the 20th VLDB Conference, September 1994, pp. 478–499. [17] J. Han, Y. Fu, Discovery of multiple-level association rules

from large databases, Proceedings of the 21st VLDB Conference, September 1995, pp. 420–431.

[18] J.-S. Park, M.-S. Chen, P.S. Yu, An effective hash based algorithm for mining association rules, Proceedings of the ACM SIGMOD Conference, May 1995, pp. 175–186.

[19] R. Srikant, R. Agrawal, Mining generalized association rules, Proceedings of the 21st VLDB Conference, Septem-ber 1995, pp. 407–419.

[20] Z. Huang, Extensions to the k-means algorithm for clustering large data sets with categorical values, Data Mining and Knowledge Discovery 2 (3) (1998) 283–304. [21] K. Wang, C. Xu, B. Liu, Clustering transactions using

large items, Proceedings of the ACM CIKM International Conference on Information and Knowledge Management, 1999.

[22] S. Guha, R. Rastogi, K. Shim, ROCK: A robust clustering algorithm for categorical attributes, Journal of Informa-tion Systems 25 (5) (2000) 345–366.

[23] J.B. MacQueen, Some methods for classiﬁcation and analysis of multivariate observations, Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, 1967, pp. 281–297. 0.3 0.4 0.5 0.6 0.7 0.8 No. of Levels IGcat /IGtotal |D|=100K |D|=300K |D|=500K |D|=700K |D|=900K 3 4 5 6

(17)

[24] R.T. Ng, J. Han, Efﬁcient and effective clustering methods for spatial data mining, Proceedings of the 20th Annual International Conference on Very Large Data Bases, 1994, pp. 144–155.

[25] T. Zhang, R. Ramakrishnan, M. Livny, BIRCH: an efﬁcient data clustering method for very large databases, ACM SIGMOD International Conference on Manage-ment of Data, vol. 25(2) June 1996, pp. 103–114. [26] M. Ester, H.P. Kriegel, J. Sander, X. Xu, A density-based

algorithm for discovering clusters in large spatial databases with noise, Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD 96), August 1996, pp. 226–231.

[27] S. Guha, R. Rastogi, K. Shim, CURE: an efﬁcient clustering algorithm for large databases, ACM SIGMOD International Conference on Management of Data, vol. 27(2), June 1998, pp. 73–84.

[28] C.-R. Lin, M.-S. Chen, A Robust and Efﬁcient Clustering Algorithm based on Cohesion Self-Merging, Proceedings of the 8th ACM SIGKDD Intern’l Conference on Knowledge Discovery and Data Mining (KDD-2002), July 2002.

[29] C.C. Aggarwal, C.M. Procopiuc, J.L. Wolf, P.S. Yu, J.-S. Park, Fast algorithms for projected clustering, ACM SIGMOD International Conference on Management of Data, June 1999, pp. 61–72.

[30] C.C. Aggarwal, P.S. Yu, Finding generalized projected clusters in dimensional spaces, ACM SIGMOD Interna-tional Conference on Management of Data, May 2000, pp. 70–81.

[31] A. Hinneburg, D.A. Keim, Optimal grid-clustering: towards breaking the curse of dimensionality in high-dimensional clustering, Proceedings of the 25th VLDB Conference, September 1999, pp. 506–517.

[32] H. Ralambondrainy, A conceptual version of the k-means algorithm, Pattern Recognition Lett. 16 (1995) 1147–1157.

[33] D. Barbara, Y. Li, J. Couto, COOLCAT: an entropy-based algorithm for categorical clustering, Proceedings of the 2002 ACM CIKM International Conference on Information and Knowledge Management, November 2002, pp. 590–599.

[34] K.-T. Chuang, M.-S. Chen, Clustering categorical data by utilizing the correlated-force ensemble, Proceedings of the 4th SIAM Conference on Data Mining, April 2004. [35] M. Lebowitz, Experiments with incremental concept

formation, Machine Learning 2 (2) (1987) 103–138. [36] R.S. Michalski, R.E. Stepp, Automated construction of

classiﬁcations: conceptual clustering versus numerical taxonomy, IEEE Trans. Pattern Anal. and Machine Intelligence 5 (4) (1983) 396–410.

[37] D. Gibson, J. Kleinberg, P. Raghavan, Clustering catego-rical data: an approach based on dynamical systems, Proceedings of the 24th Annual International Conference on Very Large Data Bases, 1998, pp. 311–322.

[38] V. Ganti, J. Gehrke, R. Ramakrishnan, CACTUS-clustering categorical data using summaries, Proceedings

of ACM SIGKDD International Conference on Knowl-edge discovery and data mining, 1999.

[39] E.-H. Han, G. Karypis, V. Kumar, B. Mobasher, Clustering based on association rule hypergraphs, ACM SIGMOD’97 Workshop on Research Issues on Data Mining and Knowledge Discovery, 1997.

[40] W.A. Kosters, E. Marchiori, A.A.J. Oerlemans, Mining Clusters with Association Rules, Lecture Notes in Com-puter Science, 1642, 1999.

[41] B. Lent, A.N. Swami, J. Widom, Clustering Association Rules, Proceedings of the 13th International Conference on Data Engineering, April 1997, pp. 220–231.

[42] C. Ordonez, E. Omiecinski, FREM: fast and robust EM clustering for large data sets, Proceedings of the 2002 ACM CIKM International Conference on Informa-tion and Knowledge Management, November 2002, pp. 590–599.

[43] A. Strehl, J. Ghosh, A Scalable approach to balanced, high-dimensional clustering of market-baskets, Proceed-ings of the 7th International Conference on High Performance Computing, December 2000.

[44] Y. Xiao, M.H. Dunham, Interactive clustering for transaction data, Proceedings of the 3rd International Conference on Data Warehousing and Knowledge Dis-covery (DaWaK 2001), September 2001.

[45] Y. Yang, X. Guan, J. You, CLOPE: a fast and effective clustering algorithm for transactional data, The 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Poster), July 2002. [46] J. Han, M. Kamber, Data Mining: Concepts and

Techniques, Morgan Kaufmann, Los Altas, CA, 2000. [47] C. Fellbaum, WordNet: An Electronic Lexical Database,

MIT Press, 1998.

[48] S. Scott, S. Matwin, Text classiﬁcation using wordNet hypernyms, Proceedings of the COLING/ACL Workshop on Usage of WordNet in Natural Language Processing Systems, 1998, pp. 38–44.

[49] R. Duda, P. Hart, Pattern Classiﬁcation and Scene Analysis, Wiley, New York, 1973.

[50] M. Halkidi, Y. Batistakis, M. Vazirgiannis, On clustering validation techniques, J. Intelligent Information Systems, 2001.

[51] O.R. Zaiane, A. Foss, C.-H. Lee, W. Wang, On Data Clustering Analysis: Scalability, Constraints and Valida-tion, PAKDD02, 2002.

[52] UCI Machine Learning Repository. http:://www.ics.uci. edu/_{mlearn/MLRepository.html}_.

[53] Reuters-21578 news collection, http://www.research.att. com/lewis/reuters21578.html.

[54] F.-X. Jollois, M. Nadif, Clustering Large Categorical Data, PAKDD02, 2002.

[55] J.R. Quinlan, Induction of decision trees, Machine Learning, 1986.

[56] J.R. Quinlan, C.45: programs for machine learning, Morgan Kaufmann, Los Atlas, CA, 1993.

[57] R. Baeza-Yates, B. Ribeiro-Neto, Modern Information Retrieval, Addison-Wesley, Reading, MA, 1999.