Mining fuzzy frequent itemsets for hierarchical document clustering

(1)

Mining fuzzy frequent itemsets for hierarchical document clustering

Chun-Ling Chen

a

, Frank S.C. Tseng

b,*

, Tyne Liang

a

a_{Department of Computer Science, National Chiao Tung University, HsinChu 300, Taiwan, ROC} b

Dept. of Information Management, National Kaohsiung 1st University of Science and Technology, YenChao, Kaohsiung 824, Taiwan, ROC

a r t i c l e

i n f o

Article history:

Received 20 October 2008

Received in revised form 28 September 2009 Accepted 29 September 2009

Available online 31 October 2009 Keywords:

Fuzzy association rule mining Text mining

Hierarchical document clustering Frequent itemsets

a b s t r a c t

As text documents are explosively increasing in the Internet, the process of hierarchical document clustering has been proven to be useful for grouping similar documents for ver-satile applications. However, most document clustering methods still suffer from chal-lenges in dealing with the problems of high dimensionality, scalability, accuracy, and meaningful cluster labels. In this paper, we will present an effective Fuzzy Frequent Item-set-Based Hierarchical Clustering (F2_{IHC) approach, which uses fuzzy association rule} min-ing algorithm to improve the clustermin-ing accuracy of Frequent Itemset-Based Hierarchical Clustering (FIHC) method. In our approach, the key terms will be extracted from the doc-ument set, and each docdoc-ument is pre-processed into the designated representation for the following mining process. Then, a fuzzy association rule mining algorithm for text is employed to discover a set of highly-related fuzzy frequent itemsets, which contain key terms to be regarded as the labels of the candidate clusters. Finally, these documents will be clustered into a hierarchical cluster tree by referring to these candidate clusters. We have conducted experiments to evaluate the performance based on Classic4, Hitech, Re0, Reuters, and Wap datasets. The experimental results show that our approach not only absolutely retains the merits of FIHC, but also improves the accuracy quality of FIHC.

1. Introduction

In order to browse and organize documents smoothly, hierarchical clustering techniques have been proposed to cluster a collection of documents into a hierarchical tree structure. Despite that, there still exist several challenges for hierarchical

document clustering, such as high dimensionality, scalability, accuracy, and meaningful cluster labels (Beil, Ester, & Xu,

2002; Fung, Wang, & Ester, 2002, 2003).

As text mining is much more complex than data mining because text data are inherently unstructured and fuzzy (Tan,

1999), some studies (Delgado, MartÃn-Bautista, SÃanchez, & Vila, 2002; Feldman & Dagan, 1995; Lin et al., 1998) applied

the technique of association rule mining in document management. For example,Feldman and Dagan (1995)have presented

a Knowledge Discovery in Text (KDT) system, which used the simplest information extraction approach to get interesting

information and knowledge from unstructured text collections.Lin et al. (1998)proposed a method, namely Mining Term

Association, to acquire the semantic relations between terms when applying to documents. Moreover, Delgado et al .

(2002)think that the association rule mining is the ﬁrst data mining technique employed in mining text collections. It is very interesting since many applications related to text processing involve associations and co-occurrence between terms. These works mainly focused on analyzing the co-occurrence terms for document management.

*Corresponding author. Present address: 1, University Road, YenChao, Kaoshiung County 824, Taiwan, ROC. Tel.: +886 7 6011000x4113; fax: +886 7 6011042.

(2)

Furthermore,Fung et al. (2002)proposed a novel method, namely Frequent Itemset-based Hierarchical Clustering (FIHC), to produce a hierarchical topic tree for document clustering. This method offers some merits to resolve the challenges, such as dimensionality reduction, the number of clusters as an optional input parameter, and easy browsing with meaningful

cluster labels. They employed the tf-idf (Term Frequency-Inverse Document Frequency) method (Salton, 1971) to replace

the actual term frequency of a term by the weighted frequency. However, the main limitation of tf-idf method is that long documents tend to have higher weights than short ones. This is because it considers only the weighted frequency of the terms in a document, but neglects the length of the document. This disadvantage will affect the accuracy of the clustering task, when the mining algorithm cannot obtain appropriate topic labels for the derived clusters.

In this paper, we will propose an approach which stems from prior studies (Hong, Lin, & Wang, 2003; Kaya & Alhajj, 2006;

Martín-Bautista, Sánchez, Chamorro-Martínez, Serrano, & Vila, 2004), by integrating fuzzy set concepts (Zadeh, 1965) and the association rule mining to ﬁnd interesting fuzzy association rules from given transactions. The fuzzy association rule mining is a good method choice because it is easily understandable and realistic for integrating linguistic terms with fuzzy sets.

Compared with the tf-idf weighting used in FIHC, we intend to focus on using the fuzzy association rule mining based on term frequency to ﬁnd the association relationships between terms for clustering documents. Since some important terms that express the topics of a document may rarely appear in the document collection, if we use the association rule mining instead, then only the terms which frequently occur in the document collection can be retrieved, which implies that the important sparse terms might be obscured in the process of document clustering. By applying the fuzzy association rule

min-ing, we can discover interesting connections between fuzzy frequent itemsets. For example, (t1 Low ? t2 High) or even

(t1 Low ? t2 Low) are rules that can be found to show the association relationships between the frequencies of important

terms in the document collection.

In order to ﬂexibly apply the frequent itemset-based technique for more applications in document clustering, we extend

our previous study (Chen, Tseng, Liang, 2008) and further propose an effective Fuzzy Frequent Itemset-Based Hierarchical

Clustering (F2IHC) approach based on the fuzzy association rule mining to ameliorate the accuracy quality of FIHC. In

con-trast with our previous study, we explain our approach in more details and conduct experiments to evaluate more datasets. Our approach can be distinguished into the following stages:

1. In the ﬁrst stage, the key terms will be extracted from the document set, and each document is pre-processed into the designated representation for the following mining process. In this stage, a hybrid feature selection method will be used to effectively reduce the unimportant terms for each document.

2. In the second stage, to discover a set of relevant fuzzy frequent itemsets efﬁciently, we will propose a fuzzy association

rule mining algorithm for text. In this algorithm, we revise the method devised byHong et al. (2003)by regarding a

doc-ument as a transaction, and those term frequency values in a docdoc-ument as the quantitative values (i.e., the number of

purchased items in a transaction). A frequent itemset, as deﬁned byFung et al. (2003), is a set of words that occur

together in some minimum fraction of documents in a cluster. By employing pre-deﬁned membership functions, our algorithm calculates three fuzzy values, i.e., Low, Mid, and High regions, for each term based on its frequency to discrim-inate the degree of importance of the term within a document in the mining process. The derived fuzzy frequent itemsets contain key terms to be regarded as the labels of candidate clusters.

3. In the ﬁnal stage, the documents will be clustered into a hierarchical cluster tree based on these candidate clusters. The cluster tree will be built in a top-down fashion to recursively select the parent clusters at level k 1 for dividing the doc-uments into its suitable children clusters at level k. Notice that the clusters generated by our algorithm are crisp partitions for assigning a document to exactly one cluster.

In summary, our approach has the following advantages:

1. It provides a frequent itemset-based clustering algorithm for the analysis of a document set to generate a ﬂexible hier-archical document cluster tree, which can be easily integrated into a document management system for providing ﬂexible browsing and retrieving of various applications.

2. It shows how the fuzzy association rule mining can be applied more accurately to document clustering. It extends a fuzzy

data representation used in data mining byHong et al. (2003)to text mining to provide a subtler partitioning of a dataset.

3. By conducting experimental results to evaluate the datasets of Classic4, Hitech, Re0, Reuters, and Wap, it has been proven that our approach not only absolutely retains the merits of FIHC in reducing the high dimensionality of text, efﬁciency for these datasets, and the meaningful labels of the discovered clusters, but also improves the accuracy quality of FIHC.

The subsequent sections of this paper are organized as follows: In Section2, we brieﬂy review related literature on document

clustering methods. Section3presents our approach in three stages, together with an illustrative example. Experimental results

are presented and analyzed in Section4. Finally, we conclude and propose some future directions in Section5.

2. Document clustering methods

In general, major clustering algorithms are divided into partitioning methods and hierarchical methods. For document clustering, partitioning methods exclusively partition the set of documents into a number of clusters by moving documents

(3)

posed the Hierarchical Frequent Term-based Clustering (HFTC) method by using frequent itemsets and minimizing the

over-lap of clusters in terms of shared documents. However, the experiments ofFung et al. (2003) showed that HFTC is not

scalable. For a scalable algorithm, Fung et al. proposed a novel FIHC algorithm by using frequent itemsets derived from the association rule mining to construct a hierarchical topic tree for clusters. They also proved that using frequent itemsets for document clustering can reduce the dimension of a vector space effectively. Therefore, this approach not only reduces dimensionality, but also offers efﬁcient processing of high volume data, supports ease of browsing, and provides meaningful cluster labels.

Steinbach, Karypis, and Kumar (2000)have compared the performance of some inﬂuential clustering algorithms, and the results indicated that UPGMA and Bisecting k-means are the most accurate clustering algorithms. Therefore, we will com-pare the performance of our algorithm with that of FIHC, UPGMA, and Bisecting k-means algorithms in term of accuracy in Section4.

In the following, we will present our approach, which stems from the frequent itemset-based clustering technique, and

extends the algorithm proposed byHong et al. (2003)to text processing, to ﬁnd suitable fuzzy frequent itemsets for

con-structing a hierarchical cluster tree.

3. The framework of our approach

There are three stages in our framework as shown inFig. 1. We explain them as follows:

1. Document pre-processing. By document pre-processing, the frequency of each term within a document is counted. 2. Candidate clusters extraction. Use our fuzzy association rule mining algorithm to ﬁnd fuzzy frequent itemsets, which are

then used to form the candidate clusters.

3. The cluster tree construction. Build the Document-Cluster Matrix (DCM) for assigning each document to a ﬁtting cluster. Then, a pruned hierarchical cluster tree will be built.

(4)

3.1. Stage 1: document pre-processing

This stage describes the required transformation processes of documents to obtain the desired representation of docu-ments. As there are thousands of words in a document set, the purpose of this stage is to reduce dimensionality for high

clus-tering accuracy. Several methods, such as itemset pruning (Beil et al., 2002), feature clustering or co-clustering (Mandhani,

Joshi, & Kummamuru, 2003), feature selection technique (Shihab, 2004), and matrix factorization (Shahnaz, Berry, Pauca, & Plemmons, 2006; Xu & Gong, 2004), have been applied to reduce dimensionality. To solve this problem, we have to ﬁnd the terms that are signiﬁcant and important to represent the content of each document. Hence, we must remove the terms that are not meaningful and discriminative to increase the clustering accuracy and maintain the computing cost small. We de-scribe the details of the pre-processing in the following:

1. Divide the sentences into terms.

2. Remove the stop words. We use a stop word list1_{that contains words to be excluded. The list is applied to remove the terms}

that have general meaning but do not discriminate for topics.

3. Conduct word stemming: Use the developed stemming algorithms, such asPorter (1980), to reduce a word to its stem or

root form.

4. Term selection. The terms with selection metric weights all higher than pre-speciﬁed thresholds will be selected as key

terms. In our approach, three feature selection methods, tf-idf, tf-df, and tf2_{, are used to select representative terms for}

each document, and these feature selection methods are deﬁned as follows:

(1) tf-idf (term frequent inverse document frequency): It is denoted as tﬁdfij and used for the measure of the

importance of term tjwithin document di. For preventing a bias for longer documents, the weighted frequency

of each term is usually normalized by the total frequencies of all terms in document di, and is deﬁned as follows:

tfidfij¼ fij Pm j¼1fij log jDj jfdijtj2 di;di2 Dgj ð3:1Þ

where fijis the frequency of term tjin document di, and the denominator is the total frequencies of all terms in

doc-ument di. |D| is the total number of documents in the document set D, and |{di| tj

e

di, di

e

D}| is the number of

doc-uments containing term tj.

(2) tf-df (term frequent document frequency): It is represented by tfdfijand evaluated by(3.2)for the value calculated

by dividing the term frequency (TF) by the document frequency (DF), where TF is the number of times a term tj

appears in a document didivided by the total frequencies of all terms in di, and DF is used to determine the number

of documents containing term tjdivided by the total number of documents in the document set D:

tfdfij¼ TF=DF; where TF ¼ fij Pm j¼1fij ; and DF ¼jfdijtj2 di;di2 Dgj jDj ð3:2Þ (3) tf2_{: It is the multiplication of tﬁdf}

ijand tfdfij, and we denote it as tf2ij:

tf2ij¼ tfidfij tfdfij ð3:3Þ

After these weights of each term in each document have been calculated, those which have weights all higher than pre-speciﬁed thresholds are retained. Subsequently, these retained terms form a set of key terms for the document set D, and we formally deﬁne them as follows.

Deﬁnition 3.1. A document, denoted di= {(t1, fi1), (t2, fi2), . . . , (tj, fij), . . . , (tm, fim)}, is a logical unit of text, characterized by a

set of key terms tjtogether with their corresponding frequency fij.

Deﬁnition 3.2. A document set, denoted D = {d1, d2, . . . , di, . . . , dn}, also called a document collection, is a set of documents, where n is the total number of documents in D.

Deﬁnition 3.3. The term set of a document set D = {d1, d2, . . . , di, . . . , dn}, denoted TD= {t1, t2, . . . , tj, . . . , ts}, is the set of terms appeared in D, where s is the total number of terms.

Deﬁnition 3.4. The key term set of a document set D = {d1, d2, . . . , di, . . . , dn}, denoted KD = {t1, t2, . . . , tj, . . . , tm}, is a subset of

the term set TD, including only meaningful key terms, which do not appear in a well-deﬁned stop word list, and satisfy the

pre-deﬁned minimum tf-idf threshold

a

, the minimum tf-df threshold b and the minimum tf2_threshold

_c

_.

Based on these deﬁnitions, the representation of a document can be derived by Algorithm 3.1 shown inFig. 2. Let us

con-sider, for example, a document set D = {d1, d2, . . . , d10} containing 10 documents. By Algorithm 3.1, we might obtain the

de-1

(5)

rived representation of D and its key term set KD= {stock, record, proﬁt, medical, treatment, health} as shown inTable 1. Notice that we use a tabular representation, where each entry denotes the frequency of a key term (the column heading)

in a document di(the row heading), to make our presentation more concise. This representation scheme will be employed

in the following to illustrate our approach. 3.2. Stage 2: candidate clusters extraction

The objective of this stage is to take a document set D, a set of pre-deﬁned membership functions, the minimum support value h, and the minimum conﬁdence value k as input, and to output a set of candidate clusters. To achieve this goal, we

modiﬁed the algorithm proposed byHong et al. (2003)to capture the relationships among different key terms of the

doc-ument set. Since each discovered fuzzy frequent itemset has an associated fuzzy count value, it can be regarded as the degree of importance that the itemset contributes to the document set.

In the following, we will deﬁne the membership functions, present our algorithm, and ﬁnally explain our approach by an illustrative example.

Fig. 2. A detailed illustration of Algorithm 3.1.

Table 1 Document set.

Docs ID Key term set

Stock Record Proﬁt Medical Treatment Health

d1 2 1 1 0 0 0 d2 1 1 0 0 0 0 d3 1 0 2 0 0 0 d4 0 0 0 3 0 2 d5 0 0 0 11 1 1 d6 0 1 0 4 0 0 d7 0 0 0 8 1 2 d8 3 0 1 0 0 0 d9 0 1 0 3 0 0 d10 0 0 0 8 2 1

(6)

3.2.1. The membership functions

The membership functions are used to convert each term frequency into a fuzzy set. Therefore, we deﬁne the t-f (term frequency) fuzzy set used in this paper as follows.

Deﬁnition 3.5. A t-f fuzzy set of document di is a pair ðFij; wrijÞ, where Fij is a set and equals to fwLowij ðfijÞ=tj

Low; wMid

ij ðfijÞ=tj Mid; wHighij ðfijÞ=tj Highg, wrij:F ! ½0; 2, and r can be Low, Mid, or High. The notation tj r is called a fuzzy region of tj.For each term pair (tj, fij) of document di, wrijðfijÞ is the grade of membership of tjin diwith Low, Mid, and High

membership functions deﬁned by Formulas(3.4)–(3.6), respectively.

wLow ij ðfijÞ ¼ 0; fij¼ 0 1 þ fij=a1; 0 < fij<a1 2; fij¼ a1 1 þ a2 fij=a2 a1; a1<fij<a2 1; fij a2 8 > > > > > > < > > > > > > : ; a1¼ minðfijÞ; a2¼ a

v

gðfijÞ ð3:4Þ wMid ij ðfijÞ ¼ 0; fij¼ 0 1; fij¼ a1 1 þ fij a1=a2 a1; a1<fij<a2 2; fij¼ a2 1 þ a3 fij=a3 a2; a2<fij<a3 1; fij¼ a3 8 > > > > > > > > > < > > > > > > > > > : ; a1¼ minðfijÞ; a2¼ a

v

gðfijÞ; a3¼ maxðfijÞ ð3:5Þ wHigh ij ðfijÞ ¼ 0; fij¼ 0 1; fij a1 1 þ fij=a2 a1; a1<fij<a2 2; fij¼ a2 8 > > > < > > > : ; a1¼ a

v

gðfijÞ; a2¼ maxðfijÞ ð3:6Þ

In Formulas(3.4)–(3.6), min(fij) is the minimum frequency of terms in D, max(fij) is the maximum frequency of terms in D,

and a

v

gðfijÞ ¼

Pn

i¼1fij

jKj

, where fij–min(fij) or max(fij), and |K| is the number of summed key terms. For example, based on the

document set inTable 1, the derived membership functions are shown inFig. 3.

3.2.2. The fuzzy association rule mining algorithm for text

To describe our fuzzy association rule mining algorithm shown, we need the following deﬁnitions.

Deﬁnition 3.6. For a document set D, a candidate cluster ~c ¼ ð~Dc;

s

Þ is a two-tuple, where ~Dcis a subset of the document set D,

such that it includes those documents which contain all the key terms in

s

= {t1, t2, . . . , tq} # KD, q P 1, where KDis the key

term set of D and q is the number of key terms included in

s

. In fact,

s

is a fuzzy frequent itemset for describing ~c. To

illustrate, ~c can also be denoted as ~cq_ðt

1;t2;...;tqÞor ~c q

ðsÞ, and will be used interchangeably hereafter. For instance, inTable 1, the

candidate cluster ~c1

ðstockÞ¼ ðfd1;d2;d3;d8g; fstockgÞ, as the term ‘‘stock” appeared in these documents.

Deﬁnition 3.7. The candidate cluster set of a document set D, denoted ~CD¼ f~c11; . . . ; ~c2l1; ~c q l; . . . ; ~c

q

kg, is a set of candidate

clus-ters, where k is the total number of candidate clusters. The candidate cluster set ~CDfor a document set D can be generated by

Algorithm 3.2 shown inFig. 4.

(7)

(8)

3.2.3. An illustrative example

Consider using the document set D inTable 1, the membership functions deﬁned inFig. 3, the minimum support value

40%, and the minimum conﬁdence value 60% as inputs. The fuzzy frequent itemsets discovery procedure is illustrated in

Fig. 5.

In the proposed algorithm, we estimate the strength of association among key terms in the document set by using con-fidence values. There is useful information when the occurring keywords have been shown. This is because highly co-occurring terms are used together. Thus, our algorithm computes the confidence values of a rule pair to check the strong association of key terms (t1, t2, . . . , tq) of the fuzzy frequent q-itemsets. Take the candidate cluster ~c2_{ðstock;profitÞ}as an example. Since its confidence value of the rule pair ‘‘If stock = Low, then profit = Low” and ‘‘If profit = Low, then stock = Low” are both

larger than the minimum conﬁdence value 60%, ~c2

ðstock;profitÞis put in the candidate cluster set ~CD. Finally, the candidate cluster set ~CD¼ ~c1ðstockÞ; ~cðrecordÞ1 ; ~c1ðprofitÞ; ~c1ðmedicalÞ

n

, ~c1

ðtreatmentÞ; ~c1ðhealthÞ; ~c2ðstock;profitÞ, ~c2ðmedical;healthÞ; ~c2ðtreatment;healthÞ o

will be output. 3.3. Stage 3: the cluster tree construction

The candidate cluster set generated by the previous steps can be viewed as a set of topics with their corresponding sub-topics in the document set. In this stage, we ﬁrst construct the Document-Term Matrix (DTM) and the Term-Cluster Matrix

(TCM) to derive the Document-Cluster matrix (DCM) for assigning each document to a ﬁtting cluster, such that each cq_i

con-tains a subset of documents. For the documents in each cq

i, the intra-cluster similarity is minimized and the inter-clusters

similarity is maximized. We call each cq

i a target cluster in the following. Based on the assignment result, we will ﬁnd the

set of target clusters CD¼ fc11;c12; . . . ;c q i; . . . ;c

q

fg, and then use these target clusters to form a hierarchical tree for the

docu-ment set D. To avoid the constructed cluster tree including too many clusters, the methods described in Section3.3.3will be

used to prune unnecessary clusters.

3.3.1. Building the Document-Cluster Matrix (DCM)

First, consider each candidate cluster ~cq

ðsÞ¼ ~c q

ðt1;t2;...;tqÞwith fuzzy frequent itemset

s

.

s

will be regarded as a reference point

for generating a target cluster. Then, to represent the degree of importance of document diin a candidate cluster ~cql, an n k

Document-Cluster Matrix will be constructed to calculate the similarity of terms in diand ~cql. To achieve this goal, we have to

deﬁne two matrixes, namely Document-Term Matrix and Term-Cluster Matrix, as follows.

(9)

Deﬁnition 3.8. A Document-Term Matrix (DTM), denoted W ¼ wmaxRj ij

h i

, for a document set D, is an n p matrix, such that wmaxRj

ij is the weight (fuzzy value) of term tjin document diand tj

e

L1. A formal illustration of DTM can be found inFig. 6.

Deﬁnition 3.9. A Term-Cluster Matrix (TCM), denoted G ¼ gmaxRj

jl

h i

, for a document set D of n documents, is an p k matrix, such that for 1 6 j 6 p, 1 6 l 6 k, and

gmaxRj jl ¼ scoreð~cqlÞ Pn i¼1w maxRj ij ; where scoreð~cqlÞ ¼ P di2~c1l;tj2L1 wmaxRj ij if q ¼ 1; P di2~cql;tj2L1 wmaxRj_ij k else; 8 > > > < > > > : 9 > > > = > > > ; ð3:7Þ In Formula(3.7), wmaxRj

ij is the weight (fuzzy value) of term tjin document di2 ~cql and k is the minimum conﬁdence value.

Each gmaxRj

jl of TCM represents the degree of importance of key term tjin a candidate cluster ~cqðsÞby referring to those

doc-uments including

s

. To reduce the dimension, only key terms present in L1were applied in TCM. A formal illustration of TCM

can be found inFig. 7.

Finally, based on the previous two deﬁnitions, we can deﬁne the Document-Cluster Matrix (DCM) of a document set D as follows.

Deﬁnition 3.10. A Document-Cluster Matrix (DCM) for a document set D of n documents is the inner product of its DTM and

TCM. It is an n k matrix, and can be deﬁned as V ¼ ½

v

il, where

v

il¼ rowiðWÞ collðGÞ ¼ w maxRj i1 w maxRj i2 w maxRj ip h i gmaxRj 1l gmaxRj 2l .. . gmaxRj pl 2 6 6 6 6 6 6 4 3 7 7 7 7 7 7 5 ¼X p p¼1 wipgpl; 1 i n and 1 l k

A formal illustration of DCM can be found inFig. 8.

3.3.2. Building the hierarchical cluster tree

Based on the obtained DCM, each document can be assigned to only one target cluster by using the following rules.

Rule 1. Each element

_v

ilof the DCM matrix represents the degree of importance of document diin a candidate cluster ~c1l.

For each document di(the row i of DCM), if there exists only one maximum

v

ilin {

v

i1;

v

i2; . . . ;

v

iy}

e

~c1ðsÞ(the column

1 to y of DCM), where 1 6 y 6 k, then diwill be assigned to a target cluster c1l; otherwise, apply Rule 2.

Fig. 6. A formal illustration of Document-Term Matrix.

(10)

c1

l ¼ fdij

v

il¼ maxf

v

i1;

v

i2; . . . ;

v

iyg 2 ~c1ðsÞ; where 1 y kg ð3:8Þ

Rule 2. If a document dihas the same maximum values {

v

i1;

v

i2; . . . ;

v

iy}

e

~c1ðsÞfrom more than one of the candidate clusters f~c1

1; ~c12; . . . ; ~cy1g, then diwill be assigned to a target cluster c1l, such that its fuzzy frequent itemset

s

has the highest

count value. Notice that when q = 1, the count value is max-countj(refer to the Step 3 in Algorithm 3.2).

After assigning each document to the best ﬁtting cluster, the resulting tree can be formed as a foundation for pruning and

a natural structure for browsing. The cluster tree built by F2_{IHC algorithm has the following eight features:}

1. The cluster tree is built in a top-down fashion, which is different from the cluster tree obtained in a bottom-up fashion by FIHC.

2. Each child target cluster has exactly one parent target cluster.

3. The topic of a parent target cluster is more general than the topic of its children target clusters. The nodes become more and more specialized as they get closer to the leaf nodes.

4. A parent target cluster and its children target clusters are ‘‘similar” to a certain degree.

5. Each target cluster employs one fuzzy frequent q-itemset

s

as its cluster label.

6. The root node of the cluster tree appears at level 0, and is tagged with the cluster label ‘‘all”. 7. Each target cluster with its fuzzy frequent q-itemset appears in the level q of the tree. 8. The depth of the cluster tree is the same as the maximum size of fuzzy frequent itemsets.

3.3.3. Tree pruning

When a low minimum support value and a low minimum conﬁdence value are used, the target cluster tree would become broad and deep. The documents with the same topic may be spread to several small target clusters, which would cause low document clustering accuracy. In order to generate a natural hierarchical cluster tree for higher document clustering accu-racy and for easy browsing, one tree pruning method is used for merging similar target clusters at level 1. This method em-ploys the following deﬁnition to compute the inter-cluster similarity between two target clusters.

Deﬁnition 3.11. The inter-cluster similarity between two target clusters c1

x and c1y;c1x–c1y, is deﬁned by Formula(3.9):

Inter Simðc1 x;c 1 yÞ ¼ Pn di2c1x;c1y

v

ix

v

iy ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Pn di2c1xð

v

ixÞ 2 Pndi2c1yð

v

iyÞ 2 q ð3:9Þ

where

v

ixand

v

iystand for two entries, such that di2 c1x;di2 c1y, in DCM, respectively; The range of Sim is [0, 1]. If the Inter-Sim value is close to 1, then both clusters are regarded nearly the same. In the following, the minimum Inter-Inter-Sim will be used as a threshold d to decide whether two target clusters should be merged.

The objective of sibling merging is to shrink a tree by merging similar target clusters at level 1 for attaining high docu-ment clustering accuracy. Each pair of target clusters at level 1 of a tree is calculated by using the inter-cluster similarity measure. The target cluster pair with the highest Inter-Sim value must keep merging until the Inter-Sim value of all target clusters at level 1 becomes smaller than the minimum Inter-Sim threshold d.

Algorithm 3.3 shown inFig. 9is used to assign each document to the best ﬁtting cluster, and ﬁnally builds a cluster tree

for output.

3.3.4. An illustrative example

For example, consider the document set inTable 1. The key term set KD= {stock, record, proﬁt, medical, treatment, health}

and the candidate cluster set ~CD¼ ~c1ðstockÞ; ~cðrecordÞ1 ; ~c1ðprofitÞ; ~c1ðmedicalÞ n

, ~c1

ðtreatmentÞ; ~c1ðhealthÞ; ~cðstock;profitÞ2 ; ~c2ðmedical;healthÞ; ~c2ðtreatment;healthÞ o

(11)

were already generated in Sections3.1and3.2.3, respectively. Now, suppose the minimum Inter-Sim value is 0.6. The pro-posed cluster tree construction algorithm proceeds as follows:

Step 1. Build 10 6 Document-Term Matrix W inTable 2.

Step 2. Build 6 9 Term-Cluster Matrix G inTable 3.

Step 3. Build 10 9 Document-Cluster Matrix V inTable 4.

Step 4. Assign each document to its best target cluster.

c1

ðstockÞ¼ fd1;d3;d8g c1ðrecordÞ¼ fd2g c1ðmedicalÞ¼ fd4;d5;d6;d7;d9;d10g

c1

ðprofitÞ¼ fg c 1

ðtreatmentÞ¼ fg c1ðhealthÞ¼ fg

(12)

Step 5. Sibling merging.

1. Remove the empty node fc1

ðprofitÞ;c1ðtreatmentÞ;c1ðhealthÞg.

2. The Inter_Sim values of all pairs of target clusters are calculated inTable 5.

3. Keep merging until the Inter_Sim of all pairs of target clusters are lower than the minimum Inter-Sim value 0.6.

(a) Based on the above result, the cluster pair (c1

ðstockÞ;c1ðrecordÞ) has the highest Inter_Sim value.

(b) Since the number of documents of c1

ðrecordÞis less than c1ðstockÞ;the document in c1ðrecordÞis merged into c1ðstockÞ: Thus, c1

ðstockÞ¼ fd1;d2;d3;d8g.

(c) Update the inter-cluster similarity matrix. We omit the details here due to space limitation. Step 6. Tree construction.

1. Sort all target clusters based on the number of key terms, we obtain {c1

ðstockÞ;c 1 ðmedicalÞ;c 2 ðstock;profitÞ;c 2 ðmedical;healthÞ; c2 ðtreatment;healthÞg.

2. Remove the target clusters and it has no parent clusters to produce the result {c2

ðtreatment;healthÞg. 3. Identify all potential children.

(a) The number of terms in c2

ðstock;profitÞand c2ðmedical;healthÞare both 2.

(b) The PotentialChildren of c1

ðstockÞis c2ðstock;profitÞand the PotentialChildren of c1ðmedicalÞis c2ðmedical;healthÞ:

4. The target clusters c2

ðstock;profitÞand c2ðmedical;healthÞare set as the child cluster of c1ðstockÞand c1ðmedicalÞ;respectively.

Table 2

The DTM of this example.

Documents/key terms Stock Low Record Low Proﬁt Low Medical Mid Treatment Low Health Low

d1 1.67 2.00 2.00 0.00 0.00 0.00 d2 2.00 2.00 0.00 0.00 0.00 0.00 d3 2.00 0.00 1.67 0.00 0.00 0.00 d4 0.00 0.00 0.00 1.67 0.00 1.67 d5 0.00 0.00 0.00 1.00 2.00 2.00 d6 0.00 2.00 0.00 2.00 0.00 0.00 d7 0.00 0.00 0.00 1.43 2.00 1.67 d8 1.33 0.00 2.00 0.00 0.00 0.00 d9 0.00 2.00 0.00 1.67 0.00 0.00 d10 0.00 0.00 0.00 1.43 1.67 2.00 Table 3

The TCM of this example. Key terms/clusters ~c1

ðstockÞ ~c1ðrecordÞ ~c1ðprofitÞ ~c1ðmedicalÞ ~cðtreatmentÞ1 ~c1ðhealthÞ ~cðstock; profitÞ2 ~c2ðmedical; healthÞ ~c2ðtreatment; healthÞ

Stock Low 1.00 0.52 0.71 0.00 0.00 0.00 1.19 0.00 0.00 Record Low 0.50 1.00 0.25 0.50 0.00 0.00 0.42 0.00 0.00 Proﬁt Low 1.00 0.35 1.00 0.00 0.00 0.00 1.67 0.00 0.00 Medical Mid 0.00 0.40 0.00 1.00 0.42 0.60 0.00 1.00 0.70 Treatment Low 0.00 0.00 0.00 1.00 1.00 1.00 0.00 1.67 1.67 Health Low 0.00 0.00 0.00 1.00 0.77 1.00 0.00 1.67 1.29 Table 4

The DCM of this example. Documents/clusters ~c1

ðstockÞ ~c1ðrecordÞ ~c1ðprofitÞ ~c1ðmedicalÞ ~cðtreatmentÞ1 ~c1ðhealthÞ ~cðstock; profitÞ2 ~c2ðmedical; healthÞ ~c2ðtreatment; healthÞ

d1 4.67 3.58 3.69 1.00 0.00 0.00 6.15 0.00 0.00 d2 3.00 3.05 1.93 1.00 0.00 0.00 3.21 0.00 0.00 d3 3.67 1.64 3.10 0.00 0.00 0.00 5.16 0.00 0.00 d4 0.00 0.67 0.00 3.34 1.99 2.67 0.00 4.46 3.32 d5 0.00 0.40 0.00 5.00 3.96 4.60 0.00 7.67 6.61 d6 1.00 2.80 0.50 3.00 0.84 1.20 0.83 2.00 1.40 d7 0.00 0.57 0.00 5.10 3.89 4.53 0.00 7.55 6.48 d8 3.33 1.40 2.95 0.00 0.00 0.00 4.92 0.00 0.00 d9 1.00 2.67 0.50 2.67 0.70 1.00 0.83 1.67 1.17 d10 0.00 0.57 0.00 5.10 3.81 4.53 0.00 7.55 6.36

Numbers appeared in boldface mean the largest values of each row of ~c1

(13)

5. Children splitting.

(a) Here, we take the documents in the parent cluster c1

ðmedicalÞfor example.

(b) Based on DCM, we compare the value

v

ilof each document in the parent cluster c1ðmedicalÞand its child cluster

c2

ðmedical;healthÞ to decide whether the document is divided into the child cluster. The result is shown in Table 6.

Step 7. Finally, the derived cluster tree CT can be shown inFig. 10.

4. Experimental results

In this section, we experimentally evaluate the performance of the proposed algorithm by comparing with that of the

FIHC method. We make use of the FIHC 1.0 tool2_{to generate the results of FIHC. The produced results are then fetched into}

the same evaluation program to ensure a fair comparison. All the experiments have been performed on a P4 3.2 GHz Windows XP machine with 1 GB memory.

4.1. Datasets

We used the ﬁve standard datasets employed by the FIHC experiments. These datasets are widely adopted as standard benchmarks for the text categorization task. To ﬁnd key terms, stop words were removed and stemming was performed.

The compare results between the parent cluster ~c1

ðmedicalÞand its child cluster ~c 2 ðmedical; healthÞ. Documents/clusters ~c1

ðmedicalÞ ~c2ðmedical; healthÞ Whether the document is divided into the child cluster

d4 3.34 4.46 Yes d5 5.00 7.67 Yes d6 3.00 2.00 No d7 5.10 7.55 Yes d9 2.67 1.67 No d10 5.10 7.55 Yes

Numbers appeared in boldface mean the largest values of each row.

Fig. 10. The derived hierarchical cluster tree.

2

(14)

Documents then were represented as TF (Term Frequency) vectors, and unimportant terms were discarded. This process im-plies a signiﬁcant dimensionality reduction without loss of clustering performance.

The statistics of these datasets, after the document pre-processing described in Section3.1, are summarized inTable 7.

They are heterogeneous in terms of document size, cluster size, number of classes, and document distribution. The smallest document set contains 1504 documents, and the largest one contains 8649 documents. Each document is pre-classiﬁed into a single topic, i.e., a natural class. The class information is utilized in the evaluation method for measuring the accuracy of the

clustering result. The detailed information (Özgür & Güngör, 2006) of these datasets can be described as follow:

Classic43: This document set is a combination of the four classes CACM, CISI, CRAN, and MED abstracts. Classic4 includes 3204

CACM documents, 1460 CISI documents from information retrieval papers, 1398 CRANFIELD documents from aeronautical system papers, and 1033 MEDLINE documents from medical journals.

Hitech: The Hitech data set was derived from the San Jose Mercury newspaper articles, which are delivered as part of the

Text REtrieval Conference4_{(TREC) collection. The categories of this document set are computers, electronics, health, medical,}

research, and technology.

Re0: Re0 is a text document dataset, derived from Reuters-215785_{text categorization test collection Distribution 1.0. Re0}

includes 1504 documents belonging to 13 different classes.

Reuters5: This document set is extracted from newspaper articles. These documents are divided into 135 topics mostly

concerning business and economy. In our test, test we discarded document with multiple category labels, and the result is consisting of documents associated with a single topic of approximately 9000 documents and 50 categories. This dataset is also highly skewed.

Wap: This dataset consists of 1560 Web pages from Yahoo! Subject hierarchy collected and classiﬁed into 20 different

classes for the WebACE project (Han et al., 1998). Many categories of Wap are close to each other.

4.2. Evaluation of cluster quality: overall F-measure

The F-measure is often employed to evaluate the accuracy of the generated clustering results. It is a standard evaluation method for both flat and hierarchical clustering structure. More importantly, this measure balances the cluster precision and cluster recall. Hence, we define a set of document clusters generated from the clustering result, denoted C, and another set, denoted L, consisting of natural classes, such as each document is pre-classified into a single class. Both sets are derived from

the same document set D. Let |D| be the number of all documents in the document set D; |ci| be the number of documents in

the cluster ci

e

C; |lj| be the number of documents in the class lj

e

L; |ci\ lj| be the number of documents both in a cluster ci

and a class lj.Fung et al. (2002)measured the quality of a clustering result C using the weighted sum of such maximum

F-measures for all natural classes according to the cluster size. This measure is called the overall F-measure of C, denoted F(C), and is deﬁned as follows:

FðCÞ ¼X lj2L jljj jDjmaxci2C fFg; where F ¼ 2PR P þ R; P ¼ jci\ ljj jcij and R ¼jci\ ljj jljj ð4:1Þ

In general, the higher the F(C) values, the better the clustering solution is.

To compute a ratio signifying how much improvement is achieved for our proposed approach, F2_{IHC, when compared to}

FIHC method. The Improvement Ratio (IR) is the relative value of improvements to the F(C) value of F2_{IHC. In the following, we}

deﬁned the IR:

IR ¼FðCÞ

F2_IHC

FðCÞFIHC

FðCÞFIHC ð4:2Þ

Table 7

Statistics for our test datasets.

Data sets Number of documents Number of natural clusters Class size The length of documents

Total Total Max Average Min Average

Class4 7094 4 3203 1774 1033 43 Hitech 2301 6 603 384 116 221 Re0 1504 13 608 116 11 76 Reuters 8649 65 3725 131 1 42 Wap 1560 20 341 78 5 216 3 ftp://ftp.cs.cornell.edu/pub/smart/. 4 _{http://trec.nist.gov/}_. 5 http://www.daviddlewis.com/resources/testcollections/.

(15)

where FðCÞF2IHC and FðCÞFIHCrepresent the F(C) values of F2_{IHC and FIHC, respectively. A higher IR value indicates that the}

clustering quality of F2IHC method is better than the clustering quality of FIHC.

4.3. The effect of feature selection

In document clustering, feature selection is essential to make the clustering task efﬁcient and more accurate. The most important goal of feature selection is to extract topic-related terms, which could present the content of each document.

Before applying F2_{IHC, we ﬁrst consider the feature selection strategy. To select the most representative features, we use}

Formulas(3.1)–(3.3)to obtain three weights and select these terms, which their weights are all higher than the pre-defended

thresholds.Table 8shows the keyword statistics of our test datasets and the suggested thresholds for each datasets.

4.4. Evaluation results

We have conducted experiments to compare the accuracy of our algorithm F2_{IHC with other methods in Section}_4.4.1_{. In}

Section4.4.2, we further evaluate the accuracy of F2IHC with respect to different MinSup parameters ranging from 2% to 9%.

The efﬁciency of our algorithm is measured in Section4.4.3.

4.4.1. Accuracy comparison

Table 9presents the obtained overall F-measure values for F2_{IHC and FIHC algorithms by comparing four different}

num-bers of clusters, namely 3, 15, 30, and 60. We use the same minimum support, ranging from 3% to 6%, to test FIHC and F2_IHC

in each data set, and list their average overall F-measure values.

Table 9

Comparison of the overall F-measure.

Data set (# of natural clusters) # of clusters F2

IHC FIHC UPGMA Bi. k-means

Classic 4 (4) 3 0.51 0.53 N.A. 0.59* 15 0.53* _0.53* _N.A. _0.46 30 0.54* _0.52 _N.A. _0.43 60 0.56* 0.45 N.A. 0.27 Average 0.54* _0.51 _N.A. _0.44 Hitech (6) 3 0.47 0.48 0.33 0.54* 15 0.47* _0.45 _0.33 _0.44 30 0.48* 0.46 0.47 0.29 60 0.45* _0.42 _0.40 _0.21 Average 0.47* _0.45 _0.38 _0.37 Re0 (13) 3 0.55* _0.40 _0.36 _0.34 15 0.54* _0.41 _0.47 _0.38 30 0.54* _0.38 _0.42 _0.38 60 0.54* _0.40 _0.34 _0.28 Average 0.54* _0.40 _0.40 _0.34 Reuters (65) 3 0.49* _0.48 _N.A. _0.48 15 0.56* _0.47 _N.A. _0.42 30 0.57* _0.47 _N.A. _0.35 60 0.54* _0.42 _N.A. _0.30 Average 0.54* _0.46 _N.A. _0.39 Wap (20) 3 0.39 0.37 0.39 0.40* 15 0.61* _0.49 _0.49 _0.57 30 0.62* _0.56 _0.58 _0.44 60 0.62* _0.59 _0.59 _0.37 Average 0.56* _0.50 _0.51 _0.45

N.A. means not available for large datasets.

The experimental results of UPGMA and Bi. k-means are the same as that of FIHC. *_{The best competitor.}

(16)

It is apparent that the average accuracy of F2_{IHC is superior to that of all other algorithms. Although the performances of}

UPGMA, Bisecting k-means, and FIHC are slightly better than that of F2_{IHC in several cases, we argue that the exact number}

of clusters in a document set is usually unknown in real case, and F2_{IHC is robust enough to produce stable, consistent and}

high quality clusters for a wide range number of clusters. This can be realized by observing the average overall F-measure values of all test cases. Notice that as UPGMA is not available for large data sets because some experimental results cannot be generated for UPGMA, and we denoted them as N.A.

From the experimental result inTable 9, based on Formula(4.2), our proposed approach has gained (0.54–0.4)/0.4 = 35%

and (0.54–0.42)/0.42 = 28% F(C) value improvement in average on Re0 and Reuters datasets, respectively, compared with FIHC algorithm. For the other datasets, the reasons for limited improvement may be due to the numbers of clusters were ﬁxed with 3, 15, 30, and 60 for comparison purpose, and these numbers of clusters may not be appropriate for these datasets. 4.4.2. Sensitivity to various parameters

Our algorithm has two main parameters for the adjustment of accuracy quality. We now discuss how the default values were chosen, the effects of modifying those parameters, and suggestions for practical uses. The ﬁrst one is mandatory and is denoted MinSup, which means the minimum support for fuzzy frequent itemset generation. The other is optional, and is

de-noted KCluster, which represents the number of clusters at level 1 of the cluster tree. InTable 9, we do not only demonstrate

the accuracy of the produced solutions, but also show the sensitivity of the accuracy of KCluster.

Fig. 11a depicts the overall F-measure values of F2_{IHC when accepting different mandatory parameters, but ignoring the} parameter values of the optional ones. We observe that high clustering accuracies are fairly consistent while MinSup are set

MinSup (%)

0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 2 3 4 5 6 7 8 9

Overall F-measure

Classic Hitech Re0 Reuter Wap

(a)

minSup (%)

-10 10 30 50 70 90 110 130 150 170

The number of clusters

Classic 164 146 142 88 56 22 12 19 Hitech 25 25 41 25 33 34 35 34 Re0 17 5 4 4 6 6 25 28 Reuter 41 132 30 56 30 36 32 34 Wap 24 22 15 26 24 82 77 16 2 3 4 5 6 7 8 9

(b)

Fig. 11. The accuracy test of F2

(17)

between 2% and 9%. As KClusters is not speciﬁed in each test case, the sibling merging algorithm has to decide the most

appropriate number of output clusters, which are shown inFig. 11b.

Based on our test, we observe a general guidance that the best choice of MinSup can be set between 3% and 6%. Never-theless, it cannot be over emphasized that MinSup should not be regarded as the only parameter for ﬁnding the optimal accuracy. It is supposed that users are responsible to adjust the shape of the cluster tree based on the value of MinSup. The smaller the value of MinSup, the deeper (and broader) a cluster tree can be generated, and vice versa.

4.4.3. Efﬁciency and scalability

Our algorithm involves three major phases: ﬁnding fuzzy frequent itemsets, initial clustering, and clusters merging.

Fig. 12depicts the average execution time of F2_{IHC algorithm on ﬁve datasets, where there were ﬁve different MinSup,}

5–9%, set to evaluate the performance. According to the result shown inFig. 12, the document length dominates the

perfor-mance of the execution time. FromFig. 12, we further found that the average execution time of the fuzzy mining stage on ﬁve

datasets is almost identical. The runtime of our algorithm is inversely related to the input parameter MinSup. In other words, runtime increases as MinSup decreases. Due to the longer average length of documents in Hitech and Wap datasets, their average initial clustering and cluster merging time is higher than that of the other datasets.

To analyze the scalability of our algorithm, we get 100,000 documents from RVC1 (Reuters Corpus Volume 1) dataset (

Le-wis, Yang, Rose, & Li, 2004), which contains news from Reuters Ltd. There are three category sets: Topics, Industries, and re-gions. In our experiments, we consider the Topics category set, which includes 23,149 training and 781,265 testing documents. Before clustering this dataset, documents were parsed by converting all terms in documents into lower case, removing stop words, and applying the stemming algorithm.

0 5 10

Fuzzy Association

Mining

Initial Clustering

Cluster Merging

Entire Time Cost

Stages

Average Execution

MinSup=5%-9%, Kcluster = 60

Fig. 12. The detailed time cost analysis of F2

IHC on ﬁve datasets.

0 50 100 150 200 250 300 350 10 20 30 40 50 60 70 80 90 100

# documents (in thousands)

Runtime (sec)

Fuzzy Association Mining Initial Clustering Cluster Merging Entire Time Cost

α=0.6, β=0.5, γ =0.3, MinSup=10%, Kcluster = 60

Fig. 13. Scalability of F2 IHC.

(18)

Fig. 13shows the runtimes with respect to the different sizes of RVC1 dataset, ranging from 10 K to 100 K documents, for different stages of our algorithm. The whole process was completed within ﬁve minutes. The ﬁgure also shows that fuzzy mining and the initial clustering stages are the most two time-consuming stages in our algorithm. In the clustering stage, most of the time is spent on constructing initial clusters and its runtime is almost linear with respect to the number of documents.

5. Conclusions and future directions

Although numerous interesting document clustering methods have been extensively studied for many years, the high computation complexity and space need still make the clustering methods inefficient. Hence, reducing the heavy computa-tional load and increasing the precision of the unsupervised clustering of documents are important issues. In this paper, we derived a fuzzy-based hierarchical document clustering approach, based on the fuzzy association rule mining, for alleviating these problems satisfactorily. In our approach, we start with the document pre-processing stage; then employ the fuzzy association data mining method in second stage; automatically generate a candidate cluster set, and merge the high similar clusters, and finally build a hierarchical cluster tree in a top-down fashion for easy browsing. Our experiments show that the accuracy of our algorithm is higher than that of FIHC method, UPGMA, and Bisecting k-means when compared on the five standard datasets. Moreover, the experiment results show that the use of fuzzy association rule mining discovery important candidate clusters for document clustering to increase the accuracy quality of document clustering. Therefore, it is worthy extending in reality for concentrating on huge text documents management.

Our future work focuses on the following two aspects:

1. Combining the semantic analysis: For improving the performance of the document clustering algorithm, it can be combined

with semantic information, e.g., WordNet, to distinguish the senses of these key terms (Hotho, Staab, & Stumme, 2003).

We will further improve F2_{IHC algorithm with some domain knowledge, like WordNet, for higher accuracy.}

2. Incrementally updating the cluster tree: An efficient incremental clustering algorithm for assigning new document to the most similar existing cluster should be proposed as the future direction. Since the number of documents increase sequen-tially in a document set, it is inefficient to reform the cluster tree with a new upcoming document. It reflects the current

state of the whole document set by incrementally updating the cluster tree (Guha, Meyerson, Mishra, Motwani, &

O’Cal-laghan, 2003; Pons-Porrata, Berlanga-Llavori, & Ruiz-Shulcloper, 2007). Moreover, some of the recent researches on data

mining in steam data (Ordonez, 2003) are closely related to incremental clustering.

Acknowledgements

This research was partially supported by National Science Council, Taiwan, ROC, under Contract Nos. NSC 96-2416-H-327-008-MY2 and NSC 96-2221-E-009-168-MY2.

References

Beil, F., Ester, M., & Xu, X. (2002). Frequent term-based text clustering. In Proceedings of the 8th ACM SIGKDD int’l conf. on knowledge discovery and data mining (pp. 436–442).

Bellot, P., & El-Beze, M. (1999). A clustering method for information retrieval. Technical report IR-0199.

Chen, C. L., Tseng, Frank S. C., & Liang, T. (2008). Hierarchical document clustering using fuzzy association rule mining. In Proceedings of the 3rd international conference of innovative computing information and control (ICICIC2008) (pp. 326–330).

Delgado, M., MartÃn-Bautista, M. J., SÃanchez, D., & Vila, M. A. (2002). Mining text data: Special features and patterns. In Proceedings of EPS exploratory workshop on pattern detection and discovery in data mining (pp. 140–153).

Feldman, R., & Dagan, I. (1995). Knowledge discovery in textual databases (KDT). In Proceedings of the 1st ACM SIGKDD int’l conf. on knowledge discovery and data mining (pp. 112–117).

Fung, B. C. M., Wang, K., & Ester, M. (2002). Hierarchical document clustering using frequent itemsets. Master thesis, Simon Fraser University.

Fung, B. C. M., Wang, K., & Ester, M. (2003). Hierarchical document clustering using frequent itemsets. In Proceedings of the 3th SIAM int’l conf. on data mining (pp. 59–70).

Guha, S., Meyerson, A., Mishra, N., Motwani, R., & O’Callaghan, L. (2003). Clustering data streams: Theory and practice. IEEE Transactions on Knowledge and Data Engineering, 15(3), 515–528.

Han, E. H., Boley, B., Gini, M., Gross, R., Hastings, K., Karypis, G., et al. (1998). Webace: A web agent for document categorization and exploration. In Proceedings of the 2nd int’l conf. on autonomous agents (pp. 408–415).

Hong, T. P., Lin, K. Y., & Wang, S. L. (2003). Fuzzy data mining for interesting generalized association rules. Fuzzy Sets and Systems, 138(2), 255–269. Hotho, A., Staab, S., & Stumme, G. (2003). Wordnet improves text document clustering. In Proceedings of the semantic web workshop of the 26th annual

international ACM SIGIR conference.

Hu, G., Zhou, S., Guan, J., & Hu, X. (2008). Towards effective document clustering: A constrained K-means based approach. Information Processing and Management, 44(4), 1397–1409.

Iliopoulos, I., Enright, A. J., & Ouzounis, C. A. (2001). Textquest: Document clustering of Medline abstracts for concept discovery in molecular biology. In Proceedings of the 6th annual Paciﬁc symposium on biocomputing (pp. 384–395).

Kaya, M., & Alhajj, R. (2006). Utilizing genetic algorithms to optimize membership functions for fuzzy weighted association rule mining. Applied Intelligence, 24(1), 7–15.

Lewis, D. D., Yang, Y., Rose, T. G., & Li, F. (2004). RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5, 361–397.

(19)

43(3), 752–768.

Porter, M. F. (1980). An algorithm for sufﬁx stripping. Program, 14(3), 130–137.

Salton, G. (1971). The SMART retrieval system – Experiments in automatic document retrieval, Retrieval. Englewood Cliffs, NJ: Prentice Hall.

Shahnaz, F., Berry, M. W., Pauca, V. P., & Plemmons, R. J. (2006). Document clustering using nonnegative matrix factorization. Information Processing and Management, 42(2), 373–386.

Shihab, K. (2004). Improving clustering performance by using feature selection and extraction techniques. Journal of Intelligent Systems, 13(3), 135–161. Steinbach, M., Karypis, G., & Kumar, V. (2000). A comparison of document clustering techniques. In Proceedings of the KDD workshop on text mining. Tan, A. H. (1999). Text mining: The state of the art and the challenges. In Proceedings of the Paciﬁc Asia conf. on knowledge discovery and data mining (pp. 65–

70).

Xu, W., & Gong, Y. (2004). Document clustering by concept factorization. In Proceedings of the 27th ACM SIGIR conf. on research and development in information retrieval (pp. 202–209).

Zadeh, L. A. (1965). Fuzzy sets. Information and Control, 8, 338–353.

Chun-Ling Chen is a Ph.D. student in Dept. Computer Science, National Chiao-Tung University, Taiwan, ROC. Her research interests include database, object-oriented conceptual modeling, information retrieval, text mining, and fuzzy set theory.

Frank S.C. Tseng received his B.S., M.S. and Ph.D. Degrees, all in computer science and information engineering, from National Chiao Tung University, Taiwan, ROC, in 1986, 1988, and 1992, respectively. He joined the faculty of the Department of Information Management, Yuan-Ze University, Taiwan, ROC, on August 1995. From 1996 to 1997, he was the Chairman of the Department. Currently, he is a professor and the chairman of the Department of Information Management, National Kaohsiung First University of Science and Technology, Kaohsiung, Taiwan, ROC. His research interests include database theory and applications, information retrieval, XML technologies for Internet computing, data/document warehousing, and data/text mining. Dr. Tseng is a member of the IEEE Computer Society and the Association for Computing Machinery.

Tyne Liang received her Ph.D. Degree from National Chiao Tung University, Taiwan, ROC, majored in computer science. Currently, she is an associate professor of the Dept. of Computer Science, National Chiao Tung University, Taiwan, ROC. Her research interests include information retrieval, natural language processing, web mining, and inter-connection network.