Mining fuzzy frequent itemsets for hierarchical document clustering
Chun-Ling Chen
a, Frank S.C. Tseng
b,*, Tyne Liang
aaDepartment of Computer Science, National Chiao Tung University, HsinChu 300, Taiwan, ROC b
Dept. of Information Management, National Kaohsiung 1st University of Science and Technology, YenChao, Kaohsiung 824, Taiwan, ROC
a r t i c l e
i n f o
Article history:
Received 20 October 2008
Received in revised form 28 September 2009 Accepted 29 September 2009
Available online 31 October 2009 Keywords:
Fuzzy association rule mining Text mining
Hierarchical document clustering Frequent itemsets
a b s t r a c t
As text documents are explosively increasing in the Internet, the process of hierarchical document clustering has been proven to be useful for grouping similar documents for ver-satile applications. However, most document clustering methods still suffer from chal-lenges in dealing with the problems of high dimensionality, scalability, accuracy, and meaningful cluster labels. In this paper, we will present an effective Fuzzy Frequent Item-set-Based Hierarchical Clustering (F2IHC) approach, which uses fuzzy association rule min-ing algorithm to improve the clustermin-ing accuracy of Frequent Itemset-Based Hierarchical Clustering (FIHC) method. In our approach, the key terms will be extracted from the doc-ument set, and each docdoc-ument is pre-processed into the designated representation for the following mining process. Then, a fuzzy association rule mining algorithm for text is employed to discover a set of highly-related fuzzy frequent itemsets, which contain key terms to be regarded as the labels of the candidate clusters. Finally, these documents will be clustered into a hierarchical cluster tree by referring to these candidate clusters. We have conducted experiments to evaluate the performance based on Classic4, Hitech, Re0, Reuters, and Wap datasets. The experimental results show that our approach not only absolutely retains the merits of FIHC, but also improves the accuracy quality of FIHC.
Crown Copyright Ó 2009 Published by Elsevier Ltd. All rights reserved.
1. Introduction
In order to browse and organize documents smoothly, hierarchical clustering techniques have been proposed to cluster a collection of documents into a hierarchical tree structure. Despite that, there still exist several challenges for hierarchical
document clustering, such as high dimensionality, scalability, accuracy, and meaningful cluster labels (Beil, Ester, & Xu,
2002; Fung, Wang, & Ester, 2002, 2003).
As text mining is much more complex than data mining because text data are inherently unstructured and fuzzy (Tan,
1999), some studies (Delgado, MartÃn-Bautista, SÃanchez, & Vila, 2002; Feldman & Dagan, 1995; Lin et al., 1998) applied
the technique of association rule mining in document management. For example,Feldman and Dagan (1995)have presented
a Knowledge Discovery in Text (KDT) system, which used the simplest information extraction approach to get interesting
information and knowledge from unstructured text collections.Lin et al. (1998)proposed a method, namely Mining Term
Association, to acquire the semantic relations between terms when applying to documents. Moreover, Delgado et al .
(2002)think that the association rule mining is the first data mining technique employed in mining text collections. It is very interesting since many applications related to text processing involve associations and co-occurrence between terms. These works mainly focused on analyzing the co-occurrence terms for document management.
0306-4573/$ - see front matter Crown Copyright Ó 2009 Published by Elsevier Ltd. All rights reserved. doi:10.1016/j.ipm.2009.09.009
*Corresponding author. Present address: 1, University Road, YenChao, Kaoshiung County 824, Taiwan, ROC. Tel.: +886 7 6011000x4113; fax: +886 7 6011042.
Furthermore,Fung et al. (2002)proposed a novel method, namely Frequent Itemset-based Hierarchical Clustering (FIHC), to produce a hierarchical topic tree for document clustering. This method offers some merits to resolve the challenges, such as dimensionality reduction, the number of clusters as an optional input parameter, and easy browsing with meaningful
cluster labels. They employed the tf-idf (Term Frequency-Inverse Document Frequency) method (Salton, 1971) to replace
the actual term frequency of a term by the weighted frequency. However, the main limitation of tf-idf method is that long documents tend to have higher weights than short ones. This is because it considers only the weighted frequency of the terms in a document, but neglects the length of the document. This disadvantage will affect the accuracy of the clustering task, when the mining algorithm cannot obtain appropriate topic labels for the derived clusters.
In this paper, we will propose an approach which stems from prior studies (Hong, Lin, & Wang, 2003; Kaya & Alhajj, 2006;
Martín-Bautista, Sánchez, Chamorro-Martínez, Serrano, & Vila, 2004), by integrating fuzzy set concepts (Zadeh, 1965) and the association rule mining to find interesting fuzzy association rules from given transactions. The fuzzy association rule mining is a good method choice because it is easily understandable and realistic for integrating linguistic terms with fuzzy sets.
Compared with the tf-idf weighting used in FIHC, we intend to focus on using the fuzzy association rule mining based on term frequency to find the association relationships between terms for clustering documents. Since some important terms that express the topics of a document may rarely appear in the document collection, if we use the association rule mining instead, then only the terms which frequently occur in the document collection can be retrieved, which implies that the important sparse terms might be obscured in the process of document clustering. By applying the fuzzy association rule
min-ing, we can discover interesting connections between fuzzy frequent itemsets. For example, (t1 Low ? t2 High) or even
(t1 Low ? t2 Low) are rules that can be found to show the association relationships between the frequencies of important
terms in the document collection.
In order to flexibly apply the frequent itemset-based technique for more applications in document clustering, we extend
our previous study (Chen, Tseng, Liang, 2008) and further propose an effective Fuzzy Frequent Itemset-Based Hierarchical
Clustering (F2IHC) approach based on the fuzzy association rule mining to ameliorate the accuracy quality of FIHC. In
con-trast with our previous study, we explain our approach in more details and conduct experiments to evaluate more datasets. Our approach can be distinguished into the following stages:
1. In the first stage, the key terms will be extracted from the document set, and each document is pre-processed into the designated representation for the following mining process. In this stage, a hybrid feature selection method will be used to effectively reduce the unimportant terms for each document.
2. In the second stage, to discover a set of relevant fuzzy frequent itemsets efficiently, we will propose a fuzzy association
rule mining algorithm for text. In this algorithm, we revise the method devised byHong et al. (2003)by regarding a
doc-ument as a transaction, and those term frequency values in a docdoc-ument as the quantitative values (i.e., the number of
purchased items in a transaction). A frequent itemset, as defined byFung et al. (2003), is a set of words that occur
together in some minimum fraction of documents in a cluster. By employing pre-defined membership functions, our algorithm calculates three fuzzy values, i.e., Low, Mid, and High regions, for each term based on its frequency to discrim-inate the degree of importance of the term within a document in the mining process. The derived fuzzy frequent itemsets contain key terms to be regarded as the labels of candidate clusters.
3. In the final stage, the documents will be clustered into a hierarchical cluster tree based on these candidate clusters. The cluster tree will be built in a top-down fashion to recursively select the parent clusters at level k 1 for dividing the doc-uments into its suitable children clusters at level k. Notice that the clusters generated by our algorithm are crisp partitions for assigning a document to exactly one cluster.
In summary, our approach has the following advantages:
1. It provides a frequent itemset-based clustering algorithm for the analysis of a document set to generate a flexible hier-archical document cluster tree, which can be easily integrated into a document management system for providing flexible browsing and retrieving of various applications.
2. It shows how the fuzzy association rule mining can be applied more accurately to document clustering. It extends a fuzzy
data representation used in data mining byHong et al. (2003)to text mining to provide a subtler partitioning of a dataset.
3. By conducting experimental results to evaluate the datasets of Classic4, Hitech, Re0, Reuters, and Wap, it has been proven that our approach not only absolutely retains the merits of FIHC in reducing the high dimensionality of text, efficiency for these datasets, and the meaningful labels of the discovered clusters, but also improves the accuracy quality of FIHC.
The subsequent sections of this paper are organized as follows: In Section2, we briefly review related literature on document
clustering methods. Section3presents our approach in three stages, together with an illustrative example. Experimental results
are presented and analyzed in Section4. Finally, we conclude and propose some future directions in Section5.
2. Document clustering methods
In general, major clustering algorithms are divided into partitioning methods and hierarchical methods. For document clustering, partitioning methods exclusively partition the set of documents into a number of clusters by moving documents
posed the Hierarchical Frequent Term-based Clustering (HFTC) method by using frequent itemsets and minimizing the
over-lap of clusters in terms of shared documents. However, the experiments ofFung et al. (2003) showed that HFTC is not
scalable. For a scalable algorithm, Fung et al. proposed a novel FIHC algorithm by using frequent itemsets derived from the association rule mining to construct a hierarchical topic tree for clusters. They also proved that using frequent itemsets for document clustering can reduce the dimension of a vector space effectively. Therefore, this approach not only reduces dimensionality, but also offers efficient processing of high volume data, supports ease of browsing, and provides meaningful cluster labels.
Steinbach, Karypis, and Kumar (2000)have compared the performance of some influential clustering algorithms, and the results indicated that UPGMA and Bisecting k-means are the most accurate clustering algorithms. Therefore, we will com-pare the performance of our algorithm with that of FIHC, UPGMA, and Bisecting k-means algorithms in term of accuracy in Section4.
In the following, we will present our approach, which stems from the frequent itemset-based clustering technique, and
extends the algorithm proposed byHong et al. (2003)to text processing, to find suitable fuzzy frequent itemsets for
con-structing a hierarchical cluster tree.
3. The framework of our approach
There are three stages in our framework as shown inFig. 1. We explain them as follows:
1. Document pre-processing. By document pre-processing, the frequency of each term within a document is counted. 2. Candidate clusters extraction. Use our fuzzy association rule mining algorithm to find fuzzy frequent itemsets, which are
then used to form the candidate clusters.
3. The cluster tree construction. Build the Document-Cluster Matrix (DCM) for assigning each document to a fitting cluster. Then, a pruned hierarchical cluster tree will be built.
3.1. Stage 1: document pre-processing
This stage describes the required transformation processes of documents to obtain the desired representation of docu-ments. As there are thousands of words in a document set, the purpose of this stage is to reduce dimensionality for high
clus-tering accuracy. Several methods, such as itemset pruning (Beil et al., 2002), feature clustering or co-clustering (Mandhani,
Joshi, & Kummamuru, 2003), feature selection technique (Shihab, 2004), and matrix factorization (Shahnaz, Berry, Pauca, & Plemmons, 2006; Xu & Gong, 2004), have been applied to reduce dimensionality. To solve this problem, we have to find the terms that are significant and important to represent the content of each document. Hence, we must remove the terms that are not meaningful and discriminative to increase the clustering accuracy and maintain the computing cost small. We de-scribe the details of the pre-processing in the following:
1. Divide the sentences into terms.
2. Remove the stop words. We use a stop word list1that contains words to be excluded. The list is applied to remove the terms
that have general meaning but do not discriminate for topics.
3. Conduct word stemming: Use the developed stemming algorithms, such asPorter (1980), to reduce a word to its stem or
root form.
4. Term selection. The terms with selection metric weights all higher than pre-specified thresholds will be selected as key
terms. In our approach, three feature selection methods, tf-idf, tf-df, and tf2, are used to select representative terms for
each document, and these feature selection methods are defined as follows:
(1) tf-idf (term frequent inverse document frequency): It is denoted as tfidfij and used for the measure of the
importance of term tjwithin document di. For preventing a bias for longer documents, the weighted frequency
of each term is usually normalized by the total frequencies of all terms in document di, and is defined as follows:
tfidfij¼ fij Pm j¼1fij log jDj jfdijtj2 di;di2 Dgj ð3:1Þ
where fijis the frequency of term tjin document di, and the denominator is the total frequencies of all terms in
doc-ument di. |D| is the total number of documents in the document set D, and |{di| tj
e
di, die
D}| is the number ofdoc-uments containing term tj.
(2) tf-df (term frequent document frequency): It is represented by tfdfijand evaluated by(3.2)for the value calculated
by dividing the term frequency (TF) by the document frequency (DF), where TF is the number of times a term tj
appears in a document didivided by the total frequencies of all terms in di, and DF is used to determine the number
of documents containing term tjdivided by the total number of documents in the document set D:
tfdfij¼ TF=DF; where TF ¼ fij Pm j¼1fij ; and DF ¼jfdijtj2 di;di2 Dgj jDj ð3:2Þ (3) tf2: It is the multiplication of tfidf
ijand tfdfij, and we denote it as tf2ij:
tf2ij¼ tfidfij tfdfij ð3:3Þ
After these weights of each term in each document have been calculated, those which have weights all higher than pre-specified thresholds are retained. Subsequently, these retained terms form a set of key terms for the document set D, and we formally define them as follows.
Definition 3.1. A document, denoted di= {(t1, fi1), (t2, fi2), . . . , (tj, fij), . . . , (tm, fim)}, is a logical unit of text, characterized by a
set of key terms tjtogether with their corresponding frequency fij.
Definition 3.2. A document set, denoted D = {d1, d2, . . . , di, . . . , dn}, also called a document collection, is a set of documents, where n is the total number of documents in D.
Definition 3.3. The term set of a document set D = {d1, d2, . . . , di, . . . , dn}, denoted TD= {t1, t2, . . . , tj, . . . , ts}, is the set of terms appeared in D, where s is the total number of terms.
Definition 3.4. The key term set of a document set D = {d1, d2, . . . , di, . . . , dn}, denoted KD = {t1, t2, . . . , tj, . . . , tm}, is a subset of
the term set TD, including only meaningful key terms, which do not appear in a well-defined stop word list, and satisfy the
pre-defined minimum tf-idf threshold
a
, the minimum tf-df threshold b and the minimum tf2thresholdc
.Based on these definitions, the representation of a document can be derived by Algorithm 3.1 shown inFig. 2. Let us
con-sider, for example, a document set D = {d1, d2, . . . , d10} containing 10 documents. By Algorithm 3.1, we might obtain the
de-1
rived representation of D and its key term set KD= {stock, record, profit, medical, treatment, health} as shown inTable 1. Notice that we use a tabular representation, where each entry denotes the frequency of a key term (the column heading)
in a document di(the row heading), to make our presentation more concise. This representation scheme will be employed
in the following to illustrate our approach. 3.2. Stage 2: candidate clusters extraction
The objective of this stage is to take a document set D, a set of pre-defined membership functions, the minimum support value h, and the minimum confidence value k as input, and to output a set of candidate clusters. To achieve this goal, we
modified the algorithm proposed byHong et al. (2003)to capture the relationships among different key terms of the
doc-ument set. Since each discovered fuzzy frequent itemset has an associated fuzzy count value, it can be regarded as the degree of importance that the itemset contributes to the document set.
In the following, we will define the membership functions, present our algorithm, and finally explain our approach by an illustrative example.
Fig. 2. A detailed illustration of Algorithm 3.1.
Table 1 Document set.
Docs ID Key term set
Stock Record Profit Medical Treatment Health
d1 2 1 1 0 0 0 d2 1 1 0 0 0 0 d3 1 0 2 0 0 0 d4 0 0 0 3 0 2 d5 0 0 0 11 1 1 d6 0 1 0 4 0 0 d7 0 0 0 8 1 2 d8 3 0 1 0 0 0 d9 0 1 0 3 0 0 d10 0 0 0 8 2 1
3.2.1. The membership functions
The membership functions are used to convert each term frequency into a fuzzy set. Therefore, we define the t-f (term frequency) fuzzy set used in this paper as follows.
Definition 3.5. A t-f fuzzy set of document di is a pair ðFij; wrijÞ, where Fij is a set and equals to fwLowij ðfijÞ=tj
Low; wMid
ij ðfijÞ=tj Mid; wHighij ðfijÞ=tj Highg, wrij:F ! ½0; 2, and r can be Low, Mid, or High. The notation tj r is called a fuzzy region of tj.For each term pair (tj, fij) of document di, wrijðfijÞ is the grade of membership of tjin diwith Low, Mid, and High
membership functions defined by Formulas(3.4)–(3.6), respectively.
wLow ij ðfijÞ ¼ 0; fij¼ 0 1 þ fij=a1; 0 < fij<a1 2; fij¼ a1 1 þ a2 fij=a2 a1; a1<fij<a2 1; fij a2 8 > > > > > > < > > > > > > : ; a1¼ minðfijÞ; a2¼ a
v
gðfijÞ ð3:4Þ wMid ij ðfijÞ ¼ 0; fij¼ 0 1; fij¼ a1 1 þ fij a1=a2 a1; a1<fij<a2 2; fij¼ a2 1 þ a3 fij=a3 a2; a2<fij<a3 1; fij¼ a3 8 > > > > > > > > > < > > > > > > > > > : ; a1¼ minðfijÞ; a2¼ av
gðfijÞ; a3¼ maxðfijÞ ð3:5Þ wHigh ij ðfijÞ ¼ 0; fij¼ 0 1; fij a1 1 þ fij=a2 a1; a1<fij<a2 2; fij¼ a2 8 > > > < > > > : ; a1¼ av
gðfijÞ; a2¼ maxðfijÞ ð3:6ÞIn Formulas(3.4)–(3.6), min(fij) is the minimum frequency of terms in D, max(fij) is the maximum frequency of terms in D,
and a
v
gðfijÞ ¼Pn
i¼1fij
jKj
, where fij–min(fij) or max(fij), and |K| is the number of summed key terms. For example, based on the
document set inTable 1, the derived membership functions are shown inFig. 3.
3.2.2. The fuzzy association rule mining algorithm for text
To describe our fuzzy association rule mining algorithm shown, we need the following definitions.
Definition 3.6. For a document set D, a candidate cluster ~c ¼ ð~Dc;
s
Þ is a two-tuple, where ~Dcis a subset of the document set D,such that it includes those documents which contain all the key terms in
s
= {t1, t2, . . . , tq} # KD, q P 1, where KDis the keyterm set of D and q is the number of key terms included in
s
. In fact,s
is a fuzzy frequent itemset for describing ~c. Toillustrate, ~c can also be denoted as ~cqðt
1;t2;...;tqÞor ~c q
ðsÞ, and will be used interchangeably hereafter. For instance, inTable 1, the
candidate cluster ~c1
ðstockÞ¼ ðfd1;d2;d3;d8g; fstockgÞ, as the term ‘‘stock” appeared in these documents.
Definition 3.7. The candidate cluster set of a document set D, denoted ~CD¼ f~c11; . . . ; ~c2l1; ~c q l; . . . ; ~c
q
kg, is a set of candidate
clus-ters, where k is the total number of candidate clusters. The candidate cluster set ~CDfor a document set D can be generated by
Algorithm 3.2 shown inFig. 4.
3.2.3. An illustrative example
Consider using the document set D inTable 1, the membership functions defined inFig. 3, the minimum support value
40%, and the minimum confidence value 60% as inputs. The fuzzy frequent itemsets discovery procedure is illustrated in
Fig. 5.
In the proposed algorithm, we estimate the strength of association among key terms in the document set by using con-fidence values. There is useful information when the occurring keywords have been shown. This is because highly co-occurring terms are used together. Thus, our algorithm computes the confidence values of a rule pair to check the strong association of key terms (t1, t2, . . . , tq) of the fuzzy frequent q-itemsets. Take the candidate cluster ~c2ðstock;profitÞas an example. Since its confidence value of the rule pair ‘‘If stock = Low, then profit = Low” and ‘‘If profit = Low, then stock = Low” are both
larger than the minimum confidence value 60%, ~c2
ðstock;profitÞis put in the candidate cluster set ~CD. Finally, the candidate cluster set ~CD¼ ~c1ðstockÞ; ~cðrecordÞ1 ; ~c1ðprofitÞ; ~c1ðmedicalÞ
n
, ~c1
ðtreatmentÞ; ~c1ðhealthÞ; ~c2ðstock;profitÞ, ~c2ðmedical;healthÞ; ~c2ðtreatment;healthÞ o
will be output. 3.3. Stage 3: the cluster tree construction
The candidate cluster set generated by the previous steps can be viewed as a set of topics with their corresponding sub-topics in the document set. In this stage, we first construct the Document-Term Matrix (DTM) and the Term-Cluster Matrix
(TCM) to derive the Document-Cluster matrix (DCM) for assigning each document to a fitting cluster, such that each cqi
con-tains a subset of documents. For the documents in each cq
i, the intra-cluster similarity is minimized and the inter-clusters
similarity is maximized. We call each cq
i a target cluster in the following. Based on the assignment result, we will find the
set of target clusters CD¼ fc11;c12; . . . ;c q i; . . . ;c
q
fg, and then use these target clusters to form a hierarchical tree for the
docu-ment set D. To avoid the constructed cluster tree including too many clusters, the methods described in Section3.3.3will be
used to prune unnecessary clusters.
3.3.1. Building the Document-Cluster Matrix (DCM)
First, consider each candidate cluster ~cq
ðsÞ¼ ~c q
ðt1;t2;...;tqÞwith fuzzy frequent itemset
s
.s
will be regarded as a reference pointfor generating a target cluster. Then, to represent the degree of importance of document diin a candidate cluster ~cql, an n k
Document-Cluster Matrix will be constructed to calculate the similarity of terms in diand ~cql. To achieve this goal, we have to
define two matrixes, namely Document-Term Matrix and Term-Cluster Matrix, as follows.
Definition 3.8. A Document-Term Matrix (DTM), denoted W ¼ wmaxRj ij
h i
, for a document set D, is an n p matrix, such that wmaxRj
ij is the weight (fuzzy value) of term tjin document diand tj
e
L1. A formal illustration of DTM can be found inFig. 6.Definition 3.9. A Term-Cluster Matrix (TCM), denoted G ¼ gmaxRj
jl
h i
, for a document set D of n documents, is an p k matrix, such that for 1 6 j 6 p, 1 6 l 6 k, and
gmaxRj jl ¼ scoreð~cqlÞ Pn i¼1w maxRj ij ; where scoreð~cqlÞ ¼ P di2~c1l;tj2L1 wmaxRj ij if q ¼ 1; P di2~cql;tj2L1 wmaxRjij k else; 8 > > > < > > > : 9 > > > = > > > ; ð3:7Þ In Formula(3.7), wmaxRj
ij is the weight (fuzzy value) of term tjin document di2 ~cql and k is the minimum confidence value.
Each gmaxRj
jl of TCM represents the degree of importance of key term tjin a candidate cluster ~cqðsÞby referring to those
doc-uments including
s
. To reduce the dimension, only key terms present in L1were applied in TCM. A formal illustration of TCMcan be found inFig. 7.
Finally, based on the previous two definitions, we can define the Document-Cluster Matrix (DCM) of a document set D as follows.
Definition 3.10. A Document-Cluster Matrix (DCM) for a document set D of n documents is the inner product of its DTM and
TCM. It is an n k matrix, and can be defined as V ¼ ½
v
il, wherev
il¼ rowiðWÞ collðGÞ ¼ w maxRj i1 w maxRj i2 w maxRj ip h i gmaxRj 1l gmaxRj 2l .. . gmaxRj pl 2 6 6 6 6 6 6 4 3 7 7 7 7 7 7 5 ¼X p p¼1 wipgpl; 1 i n and 1 l kA formal illustration of DCM can be found inFig. 8.
3.3.2. Building the hierarchical cluster tree
Based on the obtained DCM, each document can be assigned to only one target cluster by using the following rules.
Rule 1. Each element
v
ilof the DCM matrix represents the degree of importance of document diin a candidate cluster ~c1l.For each document di(the row i of DCM), if there exists only one maximum
v
ilin {v
i1;v
i2; . . . ;v
iy}e
~c1ðsÞ(the column1 to y of DCM), where 1 6 y 6 k, then diwill be assigned to a target cluster c1l; otherwise, apply Rule 2.
Fig. 6. A formal illustration of Document-Term Matrix.
c1
l ¼ fdij
v
il¼ maxfv
i1;v
i2; . . . ;v
iyg 2 ~c1ðsÞ; where 1 y kg ð3:8ÞRule 2. If a document dihas the same maximum values {
v
i1;v
i2; . . . ;v
iy}e
~c1ðsÞfrom more than one of the candidate clusters f~c11; ~c12; . . . ; ~cy1g, then diwill be assigned to a target cluster c1l, such that its fuzzy frequent itemset
s
has the highestcount value. Notice that when q = 1, the count value is max-countj(refer to the Step 3 in Algorithm 3.2).
After assigning each document to the best fitting cluster, the resulting tree can be formed as a foundation for pruning and
a natural structure for browsing. The cluster tree built by F2IHC algorithm has the following eight features:
1. The cluster tree is built in a top-down fashion, which is different from the cluster tree obtained in a bottom-up fashion by FIHC.
2. Each child target cluster has exactly one parent target cluster.
3. The topic of a parent target cluster is more general than the topic of its children target clusters. The nodes become more and more specialized as they get closer to the leaf nodes.
4. A parent target cluster and its children target clusters are ‘‘similar” to a certain degree.
5. Each target cluster employs one fuzzy frequent q-itemset
s
as its cluster label.6. The root node of the cluster tree appears at level 0, and is tagged with the cluster label ‘‘all”. 7. Each target cluster with its fuzzy frequent q-itemset appears in the level q of the tree. 8. The depth of the cluster tree is the same as the maximum size of fuzzy frequent itemsets.
3.3.3. Tree pruning
When a low minimum support value and a low minimum confidence value are used, the target cluster tree would become broad and deep. The documents with the same topic may be spread to several small target clusters, which would cause low document clustering accuracy. In order to generate a natural hierarchical cluster tree for higher document clustering accu-racy and for easy browsing, one tree pruning method is used for merging similar target clusters at level 1. This method em-ploys the following definition to compute the inter-cluster similarity between two target clusters.
Definition 3.11. The inter-cluster similarity between two target clusters c1
x and c1y;c1x–c1y, is defined by Formula(3.9):
Inter Simðc1 x;c 1 yÞ ¼ Pn di2c1x;c1y
v
ixv
iy ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Pn di2c1xðv
ixÞ 2 Pndi2c1yðv
iyÞ 2 q ð3:9Þwhere
v
ixandv
iystand for two entries, such that di2 c1x;di2 c1y, in DCM, respectively; The range of Sim is [0, 1]. If the Inter-Sim value is close to 1, then both clusters are regarded nearly the same. In the following, the minimum Inter-Inter-Sim will be used as a threshold d to decide whether two target clusters should be merged.The objective of sibling merging is to shrink a tree by merging similar target clusters at level 1 for attaining high docu-ment clustering accuracy. Each pair of target clusters at level 1 of a tree is calculated by using the inter-cluster similarity measure. The target cluster pair with the highest Inter-Sim value must keep merging until the Inter-Sim value of all target clusters at level 1 becomes smaller than the minimum Inter-Sim threshold d.
Algorithm 3.3 shown inFig. 9is used to assign each document to the best fitting cluster, and finally builds a cluster tree
for output.
3.3.4. An illustrative example
For example, consider the document set inTable 1. The key term set KD= {stock, record, profit, medical, treatment, health}
and the candidate cluster set ~CD¼ ~c1ðstockÞ; ~cðrecordÞ1 ; ~c1ðprofitÞ; ~c1ðmedicalÞ n
, ~c1
ðtreatmentÞ; ~c1ðhealthÞ; ~cðstock;profitÞ2 ; ~c2ðmedical;healthÞ; ~c2ðtreatment;healthÞ o
were already generated in Sections3.1and3.2.3, respectively. Now, suppose the minimum Inter-Sim value is 0.6. The pro-posed cluster tree construction algorithm proceeds as follows:
Step 1. Build 10 6 Document-Term Matrix W inTable 2.
Step 2. Build 6 9 Term-Cluster Matrix G inTable 3.
Step 3. Build 10 9 Document-Cluster Matrix V inTable 4.
Step 4. Assign each document to its best target cluster.
c1
ðstockÞ¼ fd1;d3;d8g c1ðrecordÞ¼ fd2g c1ðmedicalÞ¼ fd4;d5;d6;d7;d9;d10g
c1
ðprofitÞ¼ fg c 1
ðtreatmentÞ¼ fg c1ðhealthÞ¼ fg
Step 5. Sibling merging.
1. Remove the empty node fc1
ðprofitÞ;c1ðtreatmentÞ;c1ðhealthÞg.
2. The Inter_Sim values of all pairs of target clusters are calculated inTable 5.
3. Keep merging until the Inter_Sim of all pairs of target clusters are lower than the minimum Inter-Sim value 0.6.
(a) Based on the above result, the cluster pair (c1
ðstockÞ;c1ðrecordÞ) has the highest Inter_Sim value.
(b) Since the number of documents of c1
ðrecordÞis less than c1ðstockÞ;the document in c1ðrecordÞis merged into c1ðstockÞ: Thus, c1
ðstockÞ¼ fd1;d2;d3;d8g.
(c) Update the inter-cluster similarity matrix. We omit the details here due to space limitation. Step 6. Tree construction.
1. Sort all target clusters based on the number of key terms, we obtain {c1
ðstockÞ;c 1 ðmedicalÞ;c 2 ðstock;profitÞ;c 2 ðmedical;healthÞ; c2 ðtreatment;healthÞg.
2. Remove the target clusters and it has no parent clusters to produce the result {c2
ðtreatment;healthÞg. 3. Identify all potential children.
(a) The number of terms in c2
ðstock;profitÞand c2ðmedical;healthÞare both 2.
(b) The PotentialChildren of c1
ðstockÞis c2ðstock;profitÞand the PotentialChildren of c1ðmedicalÞis c2ðmedical;healthÞ:
4. The target clusters c2
ðstock;profitÞand c2ðmedical;healthÞare set as the child cluster of c1ðstockÞand c1ðmedicalÞ;respectively.
Table 2
The DTM of this example.
Documents/key terms Stock Low Record Low Profit Low Medical Mid Treatment Low Health Low
d1 1.67 2.00 2.00 0.00 0.00 0.00 d2 2.00 2.00 0.00 0.00 0.00 0.00 d3 2.00 0.00 1.67 0.00 0.00 0.00 d4 0.00 0.00 0.00 1.67 0.00 1.67 d5 0.00 0.00 0.00 1.00 2.00 2.00 d6 0.00 2.00 0.00 2.00 0.00 0.00 d7 0.00 0.00 0.00 1.43 2.00 1.67 d8 1.33 0.00 2.00 0.00 0.00 0.00 d9 0.00 2.00 0.00 1.67 0.00 0.00 d10 0.00 0.00 0.00 1.43 1.67 2.00 Table 3
The TCM of this example. Key terms/clusters ~c1
ðstockÞ ~c1ðrecordÞ ~c1ðprofitÞ ~c1ðmedicalÞ ~cðtreatmentÞ1 ~c1ðhealthÞ ~cðstock; profitÞ2 ~c2ðmedical; healthÞ ~c2ðtreatment; healthÞ
Stock Low 1.00 0.52 0.71 0.00 0.00 0.00 1.19 0.00 0.00 Record Low 0.50 1.00 0.25 0.50 0.00 0.00 0.42 0.00 0.00 Profit Low 1.00 0.35 1.00 0.00 0.00 0.00 1.67 0.00 0.00 Medical Mid 0.00 0.40 0.00 1.00 0.42 0.60 0.00 1.00 0.70 Treatment Low 0.00 0.00 0.00 1.00 1.00 1.00 0.00 1.67 1.67 Health Low 0.00 0.00 0.00 1.00 0.77 1.00 0.00 1.67 1.29 Table 4
The DCM of this example. Documents/clusters ~c1
ðstockÞ ~c1ðrecordÞ ~c1ðprofitÞ ~c1ðmedicalÞ ~cðtreatmentÞ1 ~c1ðhealthÞ ~cðstock; profitÞ2 ~c2ðmedical; healthÞ ~c2ðtreatment; healthÞ
d1 4.67 3.58 3.69 1.00 0.00 0.00 6.15 0.00 0.00 d2 3.00 3.05 1.93 1.00 0.00 0.00 3.21 0.00 0.00 d3 3.67 1.64 3.10 0.00 0.00 0.00 5.16 0.00 0.00 d4 0.00 0.67 0.00 3.34 1.99 2.67 0.00 4.46 3.32 d5 0.00 0.40 0.00 5.00 3.96 4.60 0.00 7.67 6.61 d6 1.00 2.80 0.50 3.00 0.84 1.20 0.83 2.00 1.40 d7 0.00 0.57 0.00 5.10 3.89 4.53 0.00 7.55 6.48 d8 3.33 1.40 2.95 0.00 0.00 0.00 4.92 0.00 0.00 d9 1.00 2.67 0.50 2.67 0.70 1.00 0.83 1.67 1.17 d10 0.00 0.57 0.00 5.10 3.81 4.53 0.00 7.55 6.36
Numbers appeared in boldface mean the largest values of each row of ~c1
5. Children splitting.
(a) Here, we take the documents in the parent cluster c1
ðmedicalÞfor example.
(b) Based on DCM, we compare the value
v
ilof each document in the parent cluster c1ðmedicalÞand its child clusterc2
ðmedical;healthÞ to decide whether the document is divided into the child cluster. The result is shown in Table 6.
Step 7. Finally, the derived cluster tree CT can be shown inFig. 10.
4. Experimental results
In this section, we experimentally evaluate the performance of the proposed algorithm by comparing with that of the
FIHC method. We make use of the FIHC 1.0 tool2to generate the results of FIHC. The produced results are then fetched into
the same evaluation program to ensure a fair comparison. All the experiments have been performed on a P4 3.2 GHz Windows XP machine with 1 GB memory.
4.1. Datasets
We used the five standard datasets employed by the FIHC experiments. These datasets are widely adopted as standard benchmarks for the text categorization task. To find key terms, stop words were removed and stemming was performed.
The compare results between the parent cluster ~c1
ðmedicalÞand its child cluster ~c 2 ðmedical; healthÞ. Documents/clusters ~c1
ðmedicalÞ ~c2ðmedical; healthÞ Whether the document is divided into the child cluster
d4 3.34 4.46 Yes d5 5.00 7.67 Yes d6 3.00 2.00 No d7 5.10 7.55 Yes d9 2.67 1.67 No d10 5.10 7.55 Yes
Numbers appeared in boldface mean the largest values of each row.
Fig. 10. The derived hierarchical cluster tree.
2
Documents then were represented as TF (Term Frequency) vectors, and unimportant terms were discarded. This process im-plies a significant dimensionality reduction without loss of clustering performance.
The statistics of these datasets, after the document pre-processing described in Section3.1, are summarized inTable 7.
They are heterogeneous in terms of document size, cluster size, number of classes, and document distribution. The smallest document set contains 1504 documents, and the largest one contains 8649 documents. Each document is pre-classified into a single topic, i.e., a natural class. The class information is utilized in the evaluation method for measuring the accuracy of the
clustering result. The detailed information (Özgür & Güngör, 2006) of these datasets can be described as follow:
Classic43: This document set is a combination of the four classes CACM, CISI, CRAN, and MED abstracts. Classic4 includes 3204
CACM documents, 1460 CISI documents from information retrieval papers, 1398 CRANFIELD documents from aeronautical system papers, and 1033 MEDLINE documents from medical journals.
Hitech: The Hitech data set was derived from the San Jose Mercury newspaper articles, which are delivered as part of the
Text REtrieval Conference4(TREC) collection. The categories of this document set are computers, electronics, health, medical,
research, and technology.
Re0: Re0 is a text document dataset, derived from Reuters-215785text categorization test collection Distribution 1.0. Re0
includes 1504 documents belonging to 13 different classes.
Reuters5: This document set is extracted from newspaper articles. These documents are divided into 135 topics mostly
concerning business and economy. In our test, test we discarded document with multiple category labels, and the result is consisting of documents associated with a single topic of approximately 9000 documents and 50 categories. This dataset is also highly skewed.
Wap: This dataset consists of 1560 Web pages from Yahoo! Subject hierarchy collected and classified into 20 different
classes for the WebACE project (Han et al., 1998). Many categories of Wap are close to each other.
4.2. Evaluation of cluster quality: overall F-measure
The F-measure is often employed to evaluate the accuracy of the generated clustering results. It is a standard evaluation method for both flat and hierarchical clustering structure. More importantly, this measure balances the cluster precision and cluster recall. Hence, we define a set of document clusters generated from the clustering result, denoted C, and another set, denoted L, consisting of natural classes, such as each document is pre-classified into a single class. Both sets are derived from
the same document set D. Let |D| be the number of all documents in the document set D; |ci| be the number of documents in
the cluster ci
e
C; |lj| be the number of documents in the class lje
L; |ci\ lj| be the number of documents both in a cluster ciand a class lj.Fung et al. (2002)measured the quality of a clustering result C using the weighted sum of such maximum
F-measures for all natural classes according to the cluster size. This measure is called the overall F-measure of C, denoted F(C), and is defined as follows:
FðCÞ ¼X lj2L jljj jDjmaxci2C fFg; where F ¼ 2PR P þ R; P ¼ jci\ ljj jcij and R ¼jci\ ljj jljj ð4:1Þ
In general, the higher the F(C) values, the better the clustering solution is.
To compute a ratio signifying how much improvement is achieved for our proposed approach, F2IHC, when compared to
FIHC method. The Improvement Ratio (IR) is the relative value of improvements to the F(C) value of F2IHC. In the following, we
defined the IR:
IR ¼FðCÞ
F2IHC
FðCÞFIHC
FðCÞFIHC ð4:2Þ
Table 7
Statistics for our test datasets.
Data sets Number of documents Number of natural clusters Class size The length of documents
Total Total Max Average Min Average
Class4 7094 4 3203 1774 1033 43 Hitech 2301 6 603 384 116 221 Re0 1504 13 608 116 11 76 Reuters 8649 65 3725 131 1 42 Wap 1560 20 341 78 5 216 3 ftp://ftp.cs.cornell.edu/pub/smart/. 4 http://trec.nist.gov/. 5 http://www.daviddlewis.com/resources/testcollections/.
where FðCÞF2IHC and FðCÞFIHCrepresent the F(C) values of F2IHC and FIHC, respectively. A higher IR value indicates that the
clustering quality of F2IHC method is better than the clustering quality of FIHC.
4.3. The effect of feature selection
In document clustering, feature selection is essential to make the clustering task efficient and more accurate. The most important goal of feature selection is to extract topic-related terms, which could present the content of each document.
Before applying F2IHC, we first consider the feature selection strategy. To select the most representative features, we use
Formulas(3.1)–(3.3)to obtain three weights and select these terms, which their weights are all higher than the pre-defended
thresholds.Table 8shows the keyword statistics of our test datasets and the suggested thresholds for each datasets.
4.4. Evaluation results
We have conducted experiments to compare the accuracy of our algorithm F2IHC with other methods in Section4.4.1. In
Section4.4.2, we further evaluate the accuracy of F2IHC with respect to different MinSup parameters ranging from 2% to 9%.
The efficiency of our algorithm is measured in Section4.4.3.
4.4.1. Accuracy comparison
Table 9presents the obtained overall F-measure values for F2IHC and FIHC algorithms by comparing four different
num-bers of clusters, namely 3, 15, 30, and 60. We use the same minimum support, ranging from 3% to 6%, to test FIHC and F2IHC
in each data set, and list their average overall F-measure values.
Table 9
Comparison of the overall F-measure.
Data set (# of natural clusters) # of clusters F2
IHC FIHC UPGMA Bi. k-means
Classic 4 (4) 3 0.51 0.53 N.A. 0.59* 15 0.53* 0.53* N.A. 0.46 30 0.54* 0.52 N.A. 0.43 60 0.56* 0.45 N.A. 0.27 Average 0.54* 0.51 N.A. 0.44 Hitech (6) 3 0.47 0.48 0.33 0.54* 15 0.47* 0.45 0.33 0.44 30 0.48* 0.46 0.47 0.29 60 0.45* 0.42 0.40 0.21 Average 0.47* 0.45 0.38 0.37 Re0 (13) 3 0.55* 0.40 0.36 0.34 15 0.54* 0.41 0.47 0.38 30 0.54* 0.38 0.42 0.38 60 0.54* 0.40 0.34 0.28 Average 0.54* 0.40 0.40 0.34 Reuters (65) 3 0.49* 0.48 N.A. 0.48 15 0.56* 0.47 N.A. 0.42 30 0.57* 0.47 N.A. 0.35 60 0.54* 0.42 N.A. 0.30 Average 0.54* 0.46 N.A. 0.39 Wap (20) 3 0.39 0.37 0.39 0.40* 15 0.61* 0.49 0.49 0.57 30 0.62* 0.56 0.58 0.44 60 0.62* 0.59 0.59 0.37 Average 0.56* 0.50 0.51 0.45
N.A. means not available for large datasets.
The experimental results of UPGMA and Bi. k-means are the same as that of FIHC. *The best competitor.
It is apparent that the average accuracy of F2IHC is superior to that of all other algorithms. Although the performances of
UPGMA, Bisecting k-means, and FIHC are slightly better than that of F2IHC in several cases, we argue that the exact number
of clusters in a document set is usually unknown in real case, and F2IHC is robust enough to produce stable, consistent and
high quality clusters for a wide range number of clusters. This can be realized by observing the average overall F-measure values of all test cases. Notice that as UPGMA is not available for large data sets because some experimental results cannot be generated for UPGMA, and we denoted them as N.A.
From the experimental result inTable 9, based on Formula(4.2), our proposed approach has gained (0.54–0.4)/0.4 = 35%
and (0.54–0.42)/0.42 = 28% F(C) value improvement in average on Re0 and Reuters datasets, respectively, compared with FIHC algorithm. For the other datasets, the reasons for limited improvement may be due to the numbers of clusters were fixed with 3, 15, 30, and 60 for comparison purpose, and these numbers of clusters may not be appropriate for these datasets. 4.4.2. Sensitivity to various parameters
Our algorithm has two main parameters for the adjustment of accuracy quality. We now discuss how the default values were chosen, the effects of modifying those parameters, and suggestions for practical uses. The first one is mandatory and is denoted MinSup, which means the minimum support for fuzzy frequent itemset generation. The other is optional, and is
de-noted KCluster, which represents the number of clusters at level 1 of the cluster tree. InTable 9, we do not only demonstrate
the accuracy of the produced solutions, but also show the sensitivity of the accuracy of KCluster.
Fig. 11a depicts the overall F-measure values of F2IHC when accepting different mandatory parameters, but ignoring the parameter values of the optional ones. We observe that high clustering accuracies are fairly consistent while MinSup are set
MinSup (%)
0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 2 3 4 5 6 7 8 9Overall F-measure
Classic Hitech Re0 Reuter Wap(a)
minSup (%)
-10 10 30 50 70 90 110 130 150 170The number of clusters
Classic 164 146 142 88 56 22 12 19 Hitech 25 25 41 25 33 34 35 34 Re0 17 5 4 4 6 6 25 28 Reuter 41 132 30 56 30 36 32 34 Wap 24 22 15 26 24 82 77 16 2 3 4 5 6 7 8 9
(b)
Fig. 11. The accuracy test of F2
between 2% and 9%. As KClusters is not specified in each test case, the sibling merging algorithm has to decide the most
appropriate number of output clusters, which are shown inFig. 11b.
Based on our test, we observe a general guidance that the best choice of MinSup can be set between 3% and 6%. Never-theless, it cannot be over emphasized that MinSup should not be regarded as the only parameter for finding the optimal accuracy. It is supposed that users are responsible to adjust the shape of the cluster tree based on the value of MinSup. The smaller the value of MinSup, the deeper (and broader) a cluster tree can be generated, and vice versa.
4.4.3. Efficiency and scalability
Our algorithm involves three major phases: finding fuzzy frequent itemsets, initial clustering, and clusters merging.
Fig. 12depicts the average execution time of F2IHC algorithm on five datasets, where there were five different MinSup,
5–9%, set to evaluate the performance. According to the result shown inFig. 12, the document length dominates the
perfor-mance of the execution time. FromFig. 12, we further found that the average execution time of the fuzzy mining stage on five
datasets is almost identical. The runtime of our algorithm is inversely related to the input parameter MinSup. In other words, runtime increases as MinSup decreases. Due to the longer average length of documents in Hitech and Wap datasets, their average initial clustering and cluster merging time is higher than that of the other datasets.
To analyze the scalability of our algorithm, we get 100,000 documents from RVC1 (Reuters Corpus Volume 1) dataset (
Le-wis, Yang, Rose, & Li, 2004), which contains news from Reuters Ltd. There are three category sets: Topics, Industries, and re-gions. In our experiments, we consider the Topics category set, which includes 23,149 training and 781,265 testing documents. Before clustering this dataset, documents were parsed by converting all terms in documents into lower case, removing stop words, and applying the stemming algorithm.
0 5 10
Fuzzy Association
Mining
Initial Clustering
Cluster Merging
Entire Time Cost
Stages
Average Execution
MinSup=5%-9%, Kcluster = 60
Fig. 12. The detailed time cost analysis of F2
IHC on five datasets.
0 50 100 150 200 250 300 350 10 20 30 40 50 60 70 80 90 100
# documents (in thousands)
Runtime (sec)
Fuzzy Association Mining Initial Clustering Cluster Merging Entire Time Cost
α=0.6, β=0.5, γ =0.3, MinSup=10%, Kcluster = 60
Fig. 13. Scalability of F2 IHC.
Fig. 13shows the runtimes with respect to the different sizes of RVC1 dataset, ranging from 10 K to 100 K documents, for different stages of our algorithm. The whole process was completed within five minutes. The figure also shows that fuzzy mining and the initial clustering stages are the most two time-consuming stages in our algorithm. In the clustering stage, most of the time is spent on constructing initial clusters and its runtime is almost linear with respect to the number of documents.
5. Conclusions and future directions
Although numerous interesting document clustering methods have been extensively studied for many years, the high computation complexity and space need still make the clustering methods inefficient. Hence, reducing the heavy computa-tional load and increasing the precision of the unsupervised clustering of documents are important issues. In this paper, we derived a fuzzy-based hierarchical document clustering approach, based on the fuzzy association rule mining, for alleviating these problems satisfactorily. In our approach, we start with the document pre-processing stage; then employ the fuzzy association data mining method in second stage; automatically generate a candidate cluster set, and merge the high similar clusters, and finally build a hierarchical cluster tree in a top-down fashion for easy browsing. Our experiments show that the accuracy of our algorithm is higher than that of FIHC method, UPGMA, and Bisecting k-means when compared on the five standard datasets. Moreover, the experiment results show that the use of fuzzy association rule mining discovery important candidate clusters for document clustering to increase the accuracy quality of document clustering. Therefore, it is worthy extending in reality for concentrating on huge text documents management.
Our future work focuses on the following two aspects:
1. Combining the semantic analysis: For improving the performance of the document clustering algorithm, it can be combined
with semantic information, e.g., WordNet, to distinguish the senses of these key terms (Hotho, Staab, & Stumme, 2003).
We will further improve F2IHC algorithm with some domain knowledge, like WordNet, for higher accuracy.
2. Incrementally updating the cluster tree: An efficient incremental clustering algorithm for assigning new document to the most similar existing cluster should be proposed as the future direction. Since the number of documents increase sequen-tially in a document set, it is inefficient to reform the cluster tree with a new upcoming document. It reflects the current
state of the whole document set by incrementally updating the cluster tree (Guha, Meyerson, Mishra, Motwani, &
O’Cal-laghan, 2003; Pons-Porrata, Berlanga-Llavori, & Ruiz-Shulcloper, 2007). Moreover, some of the recent researches on data
mining in steam data (Ordonez, 2003) are closely related to incremental clustering.
Acknowledgements
This research was partially supported by National Science Council, Taiwan, ROC, under Contract Nos. NSC 96-2416-H-327-008-MY2 and NSC 96-2221-E-009-168-MY2.
References
Beil, F., Ester, M., & Xu, X. (2002). Frequent term-based text clustering. In Proceedings of the 8th ACM SIGKDD int’l conf. on knowledge discovery and data mining (pp. 436–442).
Bellot, P., & El-Beze, M. (1999). A clustering method for information retrieval. Technical report IR-0199.
Chen, C. L., Tseng, Frank S. C., & Liang, T. (2008). Hierarchical document clustering using fuzzy association rule mining. In Proceedings of the 3rd international conference of innovative computing information and control (ICICIC2008) (pp. 326–330).
Delgado, M., MartÃn-Bautista, M. J., SÃanchez, D., & Vila, M. A. (2002). Mining text data: Special features and patterns. In Proceedings of EPS exploratory workshop on pattern detection and discovery in data mining (pp. 140–153).
Feldman, R., & Dagan, I. (1995). Knowledge discovery in textual databases (KDT). In Proceedings of the 1st ACM SIGKDD int’l conf. on knowledge discovery and data mining (pp. 112–117).
Fung, B. C. M., Wang, K., & Ester, M. (2002). Hierarchical document clustering using frequent itemsets. Master thesis, Simon Fraser University.
Fung, B. C. M., Wang, K., & Ester, M. (2003). Hierarchical document clustering using frequent itemsets. In Proceedings of the 3th SIAM int’l conf. on data mining (pp. 59–70).
Guha, S., Meyerson, A., Mishra, N., Motwani, R., & O’Callaghan, L. (2003). Clustering data streams: Theory and practice. IEEE Transactions on Knowledge and Data Engineering, 15(3), 515–528.
Han, E. H., Boley, B., Gini, M., Gross, R., Hastings, K., Karypis, G., et al. (1998). Webace: A web agent for document categorization and exploration. In Proceedings of the 2nd int’l conf. on autonomous agents (pp. 408–415).
Hong, T. P., Lin, K. Y., & Wang, S. L. (2003). Fuzzy data mining for interesting generalized association rules. Fuzzy Sets and Systems, 138(2), 255–269. Hotho, A., Staab, S., & Stumme, G. (2003). Wordnet improves text document clustering. In Proceedings of the semantic web workshop of the 26th annual
international ACM SIGIR conference.
Hu, G., Zhou, S., Guan, J., & Hu, X. (2008). Towards effective document clustering: A constrained K-means based approach. Information Processing and Management, 44(4), 1397–1409.
Iliopoulos, I., Enright, A. J., & Ouzounis, C. A. (2001). Textquest: Document clustering of Medline abstracts for concept discovery in molecular biology. In Proceedings of the 6th annual Pacific symposium on biocomputing (pp. 384–395).
Kaya, M., & Alhajj, R. (2006). Utilizing genetic algorithms to optimize membership functions for fuzzy weighted association rule mining. Applied Intelligence, 24(1), 7–15.
Lewis, D. D., Yang, Y., Rose, T. G., & Li, F. (2004). RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5, 361–397.
43(3), 752–768.
Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14(3), 130–137.
Salton, G. (1971). The SMART retrieval system – Experiments in automatic document retrieval, Retrieval. Englewood Cliffs, NJ: Prentice Hall.
Shahnaz, F., Berry, M. W., Pauca, V. P., & Plemmons, R. J. (2006). Document clustering using nonnegative matrix factorization. Information Processing and Management, 42(2), 373–386.
Shihab, K. (2004). Improving clustering performance by using feature selection and extraction techniques. Journal of Intelligent Systems, 13(3), 135–161. Steinbach, M., Karypis, G., & Kumar, V. (2000). A comparison of document clustering techniques. In Proceedings of the KDD workshop on text mining. Tan, A. H. (1999). Text mining: The state of the art and the challenges. In Proceedings of the Pacific Asia conf. on knowledge discovery and data mining (pp. 65–
70).
Xu, W., & Gong, Y. (2004). Document clustering by concept factorization. In Proceedings of the 27th ACM SIGIR conf. on research and development in information retrieval (pp. 202–209).
Zadeh, L. A. (1965). Fuzzy sets. Information and Control, 8, 338–353.
Chun-Ling Chen is a Ph.D. student in Dept. Computer Science, National Chiao-Tung University, Taiwan, ROC. Her research interests include database, object-oriented conceptual modeling, information retrieval, text mining, and fuzzy set theory.
Frank S.C. Tseng received his B.S., M.S. and Ph.D. Degrees, all in computer science and information engineering, from National Chiao Tung University, Taiwan, ROC, in 1986, 1988, and 1992, respectively. He joined the faculty of the Department of Information Management, Yuan-Ze University, Taiwan, ROC, on August 1995. From 1996 to 1997, he was the Chairman of the Department. Currently, he is a professor and the chairman of the Department of Information Management, National Kaohsiung First University of Science and Technology, Kaohsiung, Taiwan, ROC. His research interests include database theory and applications, information retrieval, XML technologies for Internet computing, data/document warehousing, and data/text mining. Dr. Tseng is a member of the IEEE Computer Society and the Association for Computing Machinery.
Tyne Liang received her Ph.D. Degree from National Chiao Tung University, Taiwan, ROC, majored in computer science. Currently, she is an associate professor of the Dept. of Computer Science, National Chiao Tung University, Taiwan, ROC. Her research interests include information retrieval, natural language processing, web mining, and inter-connection network.