An integration of fuzzy association rules and WordNet for document clustering

(1)

DOI 10.1007/s10115-010-0364-2

R E G U L A R PA P E R

An integration of fuzzy association rules and WordNet

for document clustering

Chun-Ling Chen · Frank S. C. Tseng · Tyne Liang

Received: 27 August 2009 / Revised: 21 March 2010 / Accepted: 4 November 2010 / Published online: 27 November 2010

Abstract With the rapid growth of text documents, document clustering technique is emerging for efficient document retrieval and better document browsing. Recently, some methods had been proposed to resolve the problems of high dimensionality, scalability, accuracy, and meaningful cluster labels by using frequent itemsets derived from association rule mining for clustering documents. In order to improve the quality of document clus-tering results, we propose an effective Fuzzy Frequent Itemset-based Document Clusclus-tering (F2_{IDC) approach that combines fuzzy association rule mining with the background} knowl-edge embedded in WordNet. A term hierarchy generated from WordNet is applied to discover generalized frequent itemsets as candidate cluster labels for grouping documents. We have conducted experiments to evaluate our approach on Classic4, Re0, R8, and WebKB datasets. Our experimental results show that our proposed approach indeed provide more accurate clustering results than prior influential clustering methods presented in recent literature.

Keywords Fuzzy association rule mining· Text mining · Document clustering · Frequent itemsets· WordNet

C.-L. Chen· T. Liang

Department of Computer Science, National Chiao Tung University, HsinChu 300, Taiwan, ROC

e-mail: chunling@cs.nctu.edu.tw T. Liang

e-mail: tliang@cs.nctu.edu.tw Present Address:

F. S. C. Tseng (

B

)

Department of Information Management, National Kaohsiung 1st University of Science and Technology, 1, University Road, YenChao, Kaoshiung County 824, Taiwan, ROC

(2)

1 Introduction

With the rapid growth of text documents, document clustering has become one of the main techniques for managing large document collections [6]. Several effective document clus-tering algorithms have been proposed including the k-means [16], Bisecting k-means [26], Hierarchical Agglomerative Clustering (HAC) [29], Unweighted Pair Group Method with Arithmetic Mean (UPGMA) [19], etc. However, there still exist some challenges for the clus-tering quality [2,8,10], such as (1) have high-dimensional term features, (2) are not scalable for large document sets (like UPGMA), (3) require the user to specify the number of clus-ters as an input parameter, which is usually unknown in advance (like k-means), (4) do not provide a meaningful label (or description) for a cluster, and (5) do not embody any external knowledge to extract semantics from texts.

In reply to these challenges (1)–(4), a new research area, namely “frequent itemset-based clustering”, has been proposed. In [2], Beil et al. developed the first frequent itemsets-based algorithm, namely Hierarchical Frequent Term-based Clustering (HFTC), where the frequent itemsets are generated based on the association rule mining, e.g., Apriori [1]. They consider the only low-dimensional frequent itemsets as clusters. However, the experiments evalu-ated by Fung et al. [8] showed that HFTC is not scalable. For a scalable algorithm, Fung et al. proposed a novel approach, namely Frequent Itemset-based Hierarchical Clustering (FIHC), by using derived frequent itemsets to construct a hierarchical topic tree for clus-ters. They also proved that using frequent itemsets for document clustering can reduce the dimension of a vector space effectively. But, our experimental results showed that FIHC is not scalable for long average length of documents. Yu et al. [31] presented another frequent itemset-based algorithm, called TDC, to improve the clustering quality and scalability. This algorithm dynamically generates a topic directory from a document set using only closed frequent itemsets and further reduces dimensionality. TDC uses a complicated tree structure to build the hierarchy by linking each itemset of size k with all of its subsets at level k− 1. This approach may result in high accuracy, but would affect the overall clustering quality because of too much node duplication when terms in the document set are highly correlated. Moreover, HFTC, FIHC, and TDC only account for term frequency in the documents and all ignore the important semantic relationships between terms. Therefore, our approach aims to investigate whether WordNet semantic relationships can improve the clustering quality of frequent itemset-based clustering.

Recently, WordNet [20], which is one of the most widely used thesauruses for English, has been used to group documents with its semantic relations of terms [10,13]. Many existing document clustering algorithms mainly transform text documents into simplistic flat bags of document representation, e.g., term vectors or bag-of-words. Once terms are treated as indi-vidual items in such simplistic representation, their semantic relations will be lost. Thus, Dave et al. [13] employed synsets as features for document representation and subsequent cluster-ing. However, synsets would decrease the clustering performance in all experiments without considering word sense disambiguation. Accordingly, Hotho et al. [10] used WordNet in document clustering for word sense disambiguation to improve the clustering performance. In order to consider the conceptual similarity of terms that do not co-occur actually, we employ WordNet in our document clustering approach and show where and how it can be fruitfully utilized.

As a result, a term that frequently occurs in a document does not imply its importance [25]. Since some important terms that express the topics of a document may be rarely appeared in the document collection. If we use association rule mining in our approach, then only the terms which frequently occur in the document collection can be obtained, which implies

(3)

the important sparse terms be obscured in the process of document clustering. Moreover, association rule mining often suffers from producing too many itemsets, especially when items in the dataset are highly correlated [15]. Considering these two issues, we propose an approach which stems from prior studies [9,12,18], by integrating fuzzy set concepts [32] and association rule mining to provide significant dimensionality reduction over interesting frequent itemsets. To illustrate the usefulness of fuzzy data mining in document clustering, we use fuzzy set concept to model the term frequency describing the important degree of a term in a document. In contrast with using the crisp set concept, in which a term is either a member of a document or not, fuzzy set concept makes it possible that a term belongs to a doc-ument to a certain degree. By applying fuzzy association rule mining, we can discover fuzzy frequent itemsets as candidate clusters, like(term1.Low, term2.High) or (term1.Low,

ter m2.Low), and label the terms with a linguistic term, like Low, Mid, or High.

The frequent itemsets found in the document collections often reveal hidden relation-ships and correlations among terms. In this paper, we extend our previous study [3,4] and further propose an effective Fuzzy Frequent Itemset-based Document Clustering(F2IDC) approach based on fuzzy association rule mining in conjunction with WordNet for clustering textual documents. In contrast to our previous study, this paper illustrates how to add these hypernyms as term features for the document representation, how to utilize the hypernyms of WordNet in the process of fuzzy association rule mining to obtain the conceptual labels from the derived clusters, and how we conducted experiments to evaluate more datasets.

In summary, our approach has the following advantages:

1. It presents a means of dynamically deriving a hierarchical organization of concepts from the WordNet thesaurus based on the content of each document without use of training data or standard clustering techniques;

2. Following prior studies [9,12,18], it extends a fuzzy data representation to text mining, especially with the use of is-a hierarchy of WordNet, for discovering generalized fuzzy frequent itemsets and providing conceptual labels for clusters;

3. By conducting experimental evaluations on the four datasets of Classic4, Re0, R8, and WebKB, the result presents better accuracy quality than that of FIHC, Bisecting k-means, and UPGMA methods.

The subsequent sections of this paper are organized as follows. In Sect.2, we briefly review related work on the general process of document clustering. In Sect.3, a detailed description of our approach with an example is presented. The experimental evaluation is described and the results are shown in Sect.4. Finally, we conclude in Sect.5.

2 A generic process of document clustering

The aim of document clustering is to group similar documents together based on the content of a set of documents. According to [28], we divide the general process of document cluster-ing into three main stages, includcluster-ing Document Pre-processcluster-ing, Document Representation, and Document Clustering (as shown in Fig.1). These stages are described as follows: 1. Document Pre-processing. There are two steps in this stage, namely Term Extraction and

Term Selection, for generating the term set from the document collection.

(1) Term Extraction: The whole extraction process is as follows:

(4)

Term Extraction Stage 2: Document Representation Clustering Stage 3: Document Clustering Term Selection Document Representation Stage 1: Document Pre-processing Documents Clusters Term Set

Fig. 1 General process of document clustering

• Remove the stop words. A pre-defined stop-word list1 _{is applied to remove} commonly used words that do not discriminate for topics.

• Conduct word stemming. Use the developed stemming algorithms, such as Porter

[21], to convert a word to its stem or root form.

(2) Term Selection: After extracting terms, it is crucial to reduce the set of term features, a process referred to as term selection. Several methods, such as itemset pruning [2], feature clustering or co-clustering [17], feature selection technique [24], and matrix factorization [30], have been applied to reduce the dimensionality for high clustering accuracy.

2. Document Representation. Several document representation methods have been pro-posed, including binary (which shows the presence or absence of a term in a document) and term frequency (which shows the frequency of a term in a document).

3. Document Clustering. Common approaches for document clustering have been used, including the k-means [16], Bisecting k-means [26], Hierarchical Agglomerative Clus-tering (HAC) [29], Unweighted Pair Group Method with Arithmetic Mean (UPGMA) [19], etc.

3 Fuzzy frequent itemset-based document clustering(F2IDC)

In this section, we will illustrate the overall process and detail design of the proposed F2IDC approach. As shown in Fig.2, the process of F2IDC is similar to the general pro-cess of document clustering (as depicted in Fig.1), except for the gray-colored components (i.e., Document Enrichment, Fuzzy Frequent Itemset Mining, and Clustering, etc.) In the following, we explain these three stages in our framework:

(5)

Fig. 2 The F2IDC framework

1. Document Pre-processing. A set of terms from the document collection are first extracted. Then, we use a feature selection method to find the terms that are significant and important to represent the content of each document.

2. Document Representation and Enrichment. After the steps of document representation and enrichment, the designated document representation is prepared for the later mining algorithm.

3. Document Clustering. Starting from the designated document representation of all docu-ments, we run fuzzy association rule mining algorithm to discover fuzzy frequent itemsets and then generate the candidate clusters. Furthermore, in order to represent the degree of importance of a document in a candidate cluster, an n× k Document-Cluster Matrix (DCM) will be constructed to calculate the similarity of terms in a document and a can-didate cluster. Based on the obtained DCM, each document will be assigned into a target cluster.

3.1 Stage 1: Document pre-processing

As with document clustering techniques, the proposed approach starts with term extraction. For a document set D = {d1, d2, . . ., di, . . ., dn}, a term set TD = {t1, t2, . . ., tj, . . ., ts},

which is the set of terms appeared in D, can be obtained. The details of the term extraction are described in Sect.2.

The feature description of a document is constituted by terms of the document set to form a term vector. A term vector with high dimensions is easy to make clustering inefficient and difficult in principle. Hence, in this paper, we employ tf-idf as the feature selection method to produce a low-dimensional term vector. A term will be discarded if its weight is less than

(6)

a tf-idf thresholdγ . Formula (1) is used for the measurement of t f i d fi jfor the importance

of a term tjwithin a document di. For preventing a bias for longer documents, the weighted

frequency of each term is usually normalized by the maximum frequency of all terms in di

and is defined as follows:

t f i d fi j = 0.5 + 0.5 ∗ fi j max tj∈di ( fi j)× log 1+ |D| |{di|tj ∈ di, di ∈ D}| , (1)

where fi j is the frequency of tj in di, and the denominator is the maximum frequency of

all terms in di.|D| is the total number of documents in the document set D, and |{di|tj ∈

di, di∈ D}| is the number of documents containing tj.

After the step of term selection, the key term set of D, denoted KD= {t1, t2, . . ., tj, . . ., tp}

is obtained. KDis a subset of TD, including only meaningful key terms, and satisfying the

pre-defined minimum tf-idf thresholdγ .

3.2 Stage 2: Document representation and enrichment

In this stage, each document diin D is represented using those terms in KD. Thus, each

doc-ument di ∈ D, denoted di = {(t1, fi 1), (t2, fi 2), . . ., (tj, fi j), . . ., (tp, fi p)}, is represented

by a set of pairs (term, frequency), where the frequency fi jrepresents the occurrence of the

key term tjin di.

Accordingly, we enrich the document representation by using WordNet, a source repos-itory of semantic meanings. WordNet, developed by Miller et al. [20], consists of so-called synsets, together with a hypernym/hyponym hierarchy.

The basic idea of document enrichment is to add the generality of terms by corre-sponding hypernyms of WordNet based on the key terms appeared in each document. Each key term is linked up to the top 5 levels of hypernyms. For a simple and effec-tive combination, these added hypernyms form a new key term set, denoted KD =

{t1, t2, . . ., tp, h1, . . ., hd}, where hj is a hypernym. The enriched document di is

repre-sented by di = {(t1, fi 1), (t2, fi 2), . . ., (tp, fi p), (h1, h fi 1), . . ., (hd, h fi d)}, where a weight

of 0 will be assigned to several terms appearing in some of the documents but not in di.

The frequency fi jof a key term tjin diis mapped to its hypernyms{h1, . . ., hj, . . ., hd} to

accumulate as the frequency h fi jof hj.

The reason of using hypernyms of WordNet is to reveal hidden similarities to identify related topics, which potentially leads to better clustering quality [23]. For example, a doc-ument about ‘sale’ may not be associated to a docdoc-ument about ‘trade’ by the clustering algorithm if there are only ‘sale’ and ‘trade’ in the key term set. But, if the more general term ‘commerce’ is added to both documents, their semantic relation is revealed. The suitable representation of each document for the later mining can be derived by Algorithm1. 3.3 Stage 3: Document clustering

The final stage is to group the documents into clusters. In the following, we first define the membership functions and present our fuzzy association rule mining algorithm for texts. Subsequently, based on the mining results, we illustrate the details of the clustering process.

3.3.1 The membership functions

The membership functions are used to convert each term frequency into a fuzzy set. A

(7)

Algorithm 1 Obtain the designated representation of all documents

Input: A document set D; A well-defined stop word list; WordNet; The minimum tf-idf thresholdγ .

Output: The formal representation of all documents in D. 1. Extract the term set TD= {t1, t2, . . ., tj, . . ., ts}

2. Remove all stop words from TD

3. Apply Stemming for TD

4. For each di∈ D do //key term selection

For each tj∈ TDdo

(1) Evaluate its t f i d fi jweight // defined by Formula (1)

(2) Retain the term if t f i d fi j≥ γ

5. Form the key term set KD= {t1, t2, . . ., tj, . . ., tp}

6. For each di∈ D do //document enrichment step

For each tj∈ KDdo

(1) If (hjis hypernyms of tj) then //refer to WordNet

(a) h fi j→ h fi j+ fi j

(b) KD→ KD∪ {hj}

7. For each di∈ D do //in order to decrease noise from hypernyms, tf-idf method is

executed again

For each tj∈ KDdo

(1) Evaluate its t f i d fi jweight

(2) Retain the term if t f i d fi j≥ γ

8. Form the new key term set KD= {t1, t2, . . ., tp, h1, . . ., hd} // m(= p + d) is total

number of key terms

9. For each di∈ D, record the frequency fi jof tjand the frequency h fi jof hjin dito

obtain the final representation of di= {(t1, fi 1), (t2, fi 2), . . ., (tp, fi p), (h1, h fi 1), . . .,

(hd, h fi d)}

{wLow

i j ( fi j)/tj.Low, wi jMi d( fi j)/tj.Mid, w H igh

i j ( fi j)/tj.High}, wri j : F → [0, 2], and r

can be Low, Mid, or High. The notation tj.r is called a fuzzy region of tj. For each term pair

(tj, fi j) of document di, wri j( fi j) is the grade of membership of tjin diwith Low, Mid, and

High membership functions defined by Formulas (2), (3), and (4), respectively. The derived

membership functions are shown in Fig.3.

In Formulas (2), (3), and (4), min( fi j) is the minimum frequency of terms in D, max( fi j)

is the maximum frequency of terms in D, and avg( fi j) = 

n

i=1fi j

|K | , where fi j = min( fi j)

or max( fi j), and |K | is the number of summed key terms.

3.3.2 The fuzzy association rule mining algorithm for texts

To generate the target cluster set CD = {c11, c12, . . . , c

q i, . . . , c

q

f} for a document set D,

a candidate cluster set ˜CD = {˜c11, . . . , ˜c2l−1, ˜c q l, . . . , ˜c

q

k}, where k is the total number of

candidate clusters, will be generated after the mining process. We call each cq_i as a

tar-get cluster in the following. A candidate cluster ˜c = ( ˜Dc, τ) is a two-tuple, where ˜Dcis

(8)

Membership Value 1 2 Frequency of terms

min(fij) avg(fij) max(fij)

Low Mid High

0, 0 - 1 , ( ) 2, , 1 , 1, > ij ij ij Low ij ij ij ij ij ij f f a a f b b a w f b f c f d c f d c d f d ⎧ = ⎪ ⎪ + ₋ ≤ ≤ ⎪ ⎪ =⎨ < < ⎪ ₋ ⎪ + ₋ ≤ ≤ ⎪ ⎪⎩ 0, ( ), ) ( ) , 2 ) ij ij ij ij a b min f avg(f + min f c d avg(f = = = = (2) (3) 0, 0 1, < - ( ) 1 , , 2, 1 , ij ij high ij ij ij ij ij ij ij f f a f a w f _{b a} a f b b f c f d c f d c d ⎧ ₌ ⎪ ⎪ ⎪ ⎪ = +⎨ ₋ ≤ ≤ ⎪ _{< <} ⎪ ⎪ ₋ ⎪ + ₋ ≤ ≤ ⎩ ), ) ( ) )+ , 4 ) ( ) )+ , 2 = ) ij ij ij ij ij ij ij ij a avg(f max(f - avg f b avg(f max(f - avg f c avg(f d max(f = = = (4) 0, 0 1, < - 1 , ( ) , 2, 1 , 1, > ij ij ij ij mid ij ij ij ij ij ij f f a f a a f b b a w f b f c f d c f d c d f d = ⎧ ⎪ ⎪ ⎪ ⎪ + ≤ ≤ ⎪ − = ⎨ _{< <} ⎪ ⎪ ₋ ⎪ + ₋ ≤ ≤ ⎪ ⎪⎩ ( ), ) ( ) , 2 ), ) ( ) ) + 4 ij ij ij ij ij ij ij a min f avg(f + min f b c avg(f max(f - avg f d avg(f = = = = 1 2 3 6 9 11 15 ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤

Fig. 3 The predefined membership functions

τ = {t1, t2, . . ., tq} ⊆ KD, q ≥ 1, where KDis the key term set of D, and q is the number of

key terms contained inτ. In fact, τ is a fuzzy frequent itemset for describing ˜c. To illustrate,

˜c can also be denoted as ˜cq

(t1,t2,...,tq)or˜c

q

(τ), and will be used interchangeably hereafter. For

instance, the candidate cluster˜c_(trade)1 = ({d2, d3}, {trade}) means the term “trade” appeared in documents d2and d3.

In the mining process, it is considered that documents and key terms are transactions and purchased items, respectively. Algorithm2generates fuzzy frequent itemsets based on pre-defined membership functions and the minimum support valueθ, from a large textual doc-ument set, and obtains a candidate cluster set according to the minimum confidence valueλ. Moreover, each discovered fuzzy frequent itemset has an associated fuzzy count value, it can be regarded as the degree of importance that the itemset contributes to the document set.

In Algorithm2, the strength of association among key terms in the document set will be estimated by using confidence values. Our algorithm computes two confidence values of a rule pair to check the strength of association among the key terms(t1, t2, . . ., tq) of the

fuzzy frequent q-itemsets. Consider the candidate cluster˜c2_{(sale, trade)}as an example. Since its confidence values of the rule pair “If sale= Low, then trade=Mid” and “If trade = Mid, then sale= Low” are both greater than the minimum confidence value λ, ˜c2_{(sale, trade)}is put in the candidate cluster set ˜CD. Finally, the candidate cluster set ˜CDwill be output. In this

study, we setλ = 0.7.

3.3.3 Clustering

For assigning documents to the target clusters, each candidate cluster˜cq_(τ)= ˜cq_(t

1,t2,...,tq)with fuzzy frequent itemsetτ is considered in the clustering process. The τ will be regarded as a reference point for generating a target cluster. In order to represent the degree of importance of a document di in a candidate cluster˜cq_l, an n× k Document-Cluster Matrix (DCM) will

be constructed to calculate the similarity of terms in diand˜clq. The DCM is derived from the

Document-Term Matrix (DTM) and the Term-Cluster (TCM). A formal illustration of DCM can be found in Fig.4.

Based on DCM, cq_i may or may not be assigned a subset of documents. For the docu-ments in each c_iq, the intra-cluster similarity is minimized and the inter-clusters similarity is maximized.

(9)

Algorithm 2 Obtain the fuzzy frequent itemsets as candidate clusters

Input: A set of documents D= {d1, d2, . . ., dn}, where di= {(t1, fi 1), (t2, fi 2), . . .,

(tj, fi j), . . ., (tm, fi m)}; A set of membership functions (as defined in Sect.

3.3.1); The minimum support valueθ; The minimum confidence value λ. Output: A set of candidate cluster ˜CD.

1. For each di∈ D do

For each tj∈ dido

(1) fi j → Fi j= wi jLow/tj.Low + wi jMi d/tj.Mid + wi jH igh/tj.High //using

membership functions 2. For each tj∈ KDdo For each di∈ D do (1) countLo_j w= n i=1w Low i j , count Mi d j = n i=1w Mi d i j , count H igh j = n i=1w H igh i j 3. For each tj∈ KDdo

(1) max-countj= max(countLoj w, countMi dj , count H igh j ) 4. L1= {max-Rj|support(tj) = max-countj |D| ≥ θ, 1 ≤ j ≤ m} //|D| is the number of documents.

5. For(q = 2; Lq−1 = ∅; q++) do // Find fuzzy frequent q-itemsets Lq

(1) Cq= apriori_gen(Lq−1, θ) // similar to the a priori algorithm [1]

(2) For each candidate q-itemsetsτ with key terms (t1, t2, . . ., tq) ∈ Cqdo

(a) For each di∈ D do

wiτ= min wmax-Rj i j | j = 1, 2, . . . , q //wmax-Rj i j is the fuzzy

membership value of the maximum region of tjin di (b) count_τ = n i=1wiτ (3) Lq= {τ ∈ Cq|support(τ) =count_|D|τ ≥ θ, 1 ≤ j ≤ q}

6. For all the fuzzy frequent q-itemsetsτ containing key terms (t1, t2, . . ., tq),

where q≥ 2 do //construct the strong fuzzy frequent itemsets (1) Form all possible association rules

τ1∧ · · · ∧ τk−1∧ τk+1∧ · · · ∧ τq→ τk, k = 1 to q

(2) Calculate the confidence values of all possible association rules con f i dence(τ) = n i=1wiτ n i=1(wi 1∧···∧wi k−1,wi k+1∧···∧wi q) (3) ˜CD= {τ ∈ Lq|con f idence(τ) ≥ λ} 7. ˜CD→ {L1} ∪ ˜CD Procedure apriori_gen(Lq−1, θ)

1. For each itemset l1∈ Lq−1do

For each itemset l2∈ Lq−1do

(1) if(l1[1] = l2[1] ∧ l1[2] = l2[2] ∧ · · · ∧ l1[k − 2] = l2[k − 2] ∧ l1[k − 1] =

l2[k − 1]) then Cq= {c|c = l1× l2}

(10)

Document - Cluster Marix Document - Term Marix Te rm - Cluster Marix

Fig. 4 Document-Cluster Matrix

Fig. 5 The process of Algorithm1of this example

The objective of Algorithm3 is to assign each document to the best fitting cluster c_iq and finally obtain the target cluster set for output. For improving clustering accuracy, the inter-cluster similarity between two target clusters cqxand cqy, cqx = cqy, is calculated to merge

the small target clusters with the similar topic. The inter-cluster similarity measurement is defined as follows: I nter -Si m(cqx, cqy) = n i=1,di∈cqx,cqyvi x× vi y n i=1,di∈cqx(vi x) 2_×n i=1,di∈cqy(vi y) 2 (5)

wherevi xandvi ystand for two entries, such that di ∈ cqxand di ∈ cqy, in DCM, respectively.

The range of Inter-Sim is [0, 1]. If the Inter-Sim value is close to 1, then both clusters are regarded nearly the same. In the following, the minimum Inter-Sim will be used as a threshold δ to decide whether two target clusters should be merged. The target cluster pair with the highest Inter-Sim value keeps merging until the Inter-Sim values of all target clusters are less than the minimum Inter-Sim thresholdδ. In this study, we set δ = 0.5.

3.4 An illustrative example of F2IDC method

Suppose we have a document set D= {d1, d2, . . ., d5} and its key term set KD={sale, trade,

medical, health}. Figure5illustrates the process of Algorithm1to obtain the representation of all documents.

Consider the representation of all documents generated by Algorithm1in Fig.5, the membership functions defined in Fig.3, the minimum support value 70%, and the minimum confidence value 70% as inputs. The fuzzy frequent itemsets discovery procedure is depicted in Fig.6.

(11)

Moreover, consider the candidate cluster set ˜cD was already generated in Fig.6. Now,

suppose the minimum Inter-Sim value is 0.5. Figure7illustrates the process of Algorithm3 and shows the final results.

4 Experiments

In this section, we experimentally evaluated the performance of the proposed algorithm by comparing with that of FIHC, Bisecting k-means, and UPGMA algorithms. We make use of the FIHC 1.0 tool2to generate the results of FIHC. Moreover, Steinbach et al. [26] com-pared the performance of some influential clustering algorithms, and the results indicated that UPGMA and Bisecting k-means are the most accurate clustering algorithms. Therefore, the CLUTO Clustering tool3 is applied to generate the results of Bisecting k-means and UPGMA. The produced results are then fetched into the same evaluation program to ensure a fair comparison. All the experiments were performed on a P4 3.2 GHz Windows XP machine 2_{http://ddm.cs.sfu.ca/dmsoft/Clustering/products/.}

(12)

Algorithm 3 Obtain the target clusters

Input: A document set D= {d1, d2, . . ., di, . . ., dn}; The key term set KD= {t1, t2, . . .,

tj, . . ., tm}; The candidate cluster set ˜CD= {˜c11, . . . , ˜cl1−1, ˜c q l, . . . , ˜c

q

k}; A minimum

Inter-Sim thresholdδ;

Output: The target cluster set CD= {c11, c12, . . . , c

q i, . . . , c

q f}

1. Build n× m document-term matrix W =

wmax-Rj

i j

//w_{i j}max-Rj is the weight (fuzzy value) of tjin diand tj∈ L1

2. Build m× k term-cluster matrix G =

gmax_jl -Rj //gmax_jl -Rj = scor e(˜c q l) n i=1w max-R j i j , 1≤ j ≤ m, 1 ≤ l ≤ k, and score(˜c_lq) = di∈˜cql,tj∈τ wmax-Rj i j , wherew max-R_j i j is the

weight (fuzzy value) of tjin di∈ ˜clqand tj∈ L1.

3. Build n× k document-cluster matrix V = W · G =vil= m

m=1wi mgml

4. Based on V , assign dito a target cluster clq

(1) c_lq= {di|vil= max{vi 1, vi 2, . . . , vil} ∈ ˜cql, where the number of vilis 1},

otherwise (2)

(2) c_lq= {di|vil= max{vi 1, vi 2, . . ., vil} ∈ ˜cql, where the number ofvil> 1 and ˜cql

with the highest fuzzy count value corresponding to its fuzzy frequent itemset} 5. Clusters merging

(1) For each c_lq∈ CDdo

(a) If (cq_l = null) then { remove this target clusters c_lqfrom CD}

(2) For each pair of target clusters (cqx, cqy) ∈ CDdo

(a) Calculate the Inter_sim

(b) Store the results in the Inter-Cluster Similarity matrix I (3) If (one of the Inter_sim value in I≥ δ) then

(a) Select(cqx, cqy) with the highest Inter_sim

(b) Merge the smaller target cluster into the larger target cluster (c) Repeat Step (2) to update I

6. Output CD

with 1 GB memory. The implementation was written with Java 1.5 to allow reusability of the written code.

4.1 Datasets

To test the proposed approach, we used four different kinds of datasets: Classic4, Re0, R8, and WebKB, which are widely adopted as standard benchmarks for the text categorization task. They are heterogeneous in terms of document size, cluster size, number of classes, and document distribution. Moreover, these datasets are not specially designed to combine with WordNet for facilitating the clustering result.

Table1summarizes the statistics of these datasets. Each document is pre-classified into a single topic, i.e., a natural class. The class information is utilized in the evaluation method

(13)

Table 1 Statistics for our test datasets

Datasets Documents total Classes total Class size Document length Average

Max Average Min

Class4 7,094 4 3,203 1,774 1,033 39

Re0 1,504 13 608 116 11 69

R8 7,674 8 3,923 959 51 48

WebKB 4,199 4 1,641 1,050 504 124

for measuring the accuracy of the clustering result. The detailed information of these datasets is described as follows:

1. Classic44: This document set is a combination of the four classes CACM, CISI, CRAN, and MED abstracts. Classic4 includes 3,204 CACM documents, 1,460 CISI documents from information retrieval papers, 1,398 CRANFIELD documents from aeronautical system papers, and 1,033 MEDLINE documents from medical journals.

2. Re05: Re0 is a text document dataset, derived from Reuters-215786text categorization test collection Distribution 1.0. Re0 includes 1,504 documents belonging to 13 different classes.

3. R87: R8 is a subset of the Reuters-21578 text categorization collections. It considers only the documents associated with a single topic and the classes which still have at least one train and one test example. R8 includes 7,674 documents with 8 most frequent classes. 4. WebKB8: This dataset consists of web pages collected by the WebKB project of the

CMU text learning group [5]. These pages are manually classified into seven categories. In our test, we select the four most popular entity-representing categories: course, faculty, project, and student.

4.2 Evaluation of cluster quality

In these datasets, each document is pre-classified into single category, i.e., natural class. The class information is utilized in the evaluation method for measuring the accuracy of the clustering result. In our test, a standard evaluation measures, namely Overall F-measure [8], is widely used to evaluate the generated clustering results. More important, this measure balances the cluster precision and cluster recall.

Document clustering is a process of partitioning a set of documents into a set of mean-ingful subclasses, called clusters. Hence, we define a set of document clusters generated from clustering results, denoted C, and another set is natural classes, denoted L, which each document is pre-classified into a single class. Both sets are derived from the same document set D. Let|D| be the number of all documents in the document set D; |ci| be the number of

documents in the cluster ci ∈ C; |lj| be the number of documents in the class lj ∈ L; |ci∩lj|

be the number of documents both in a cluster ciand a class lj.

4_{ftp://ftp.cs.cornell.edu/pub/smart/.}

5_{The pre-processed datasets can be downloaded at}_{http://glaros.dtc.umn.edu/gkhome/cluto/cluto/download/.} 6_{http://www.daviddlewis.com/resources/testcollections/.}

7_{The pre-processed datasets can be downloaded at}_{http://web.ist.utl.pt/~acardoso/datasets/.} 8_{The pre-processed datasets can be downloaded at}_{http://www.cs.technion.ac.il/~ronb/thesis.html.}

(14)

Table 2 List of all parameters for our algorithms and the other three algorithms

Parameter name F2IDC FIHC UPGMAa,b Bi. k-meansc

Datasets Classic4, Re0, R8, WebKB

Stopword removal Yes

Stemming Yes

Length of the smallest term three

Weight of the term vector tf tf-idf tf-idf tf-idf

Levels of hypernyms H 1, H2, H3, H4, H5

Cluster count k 3, 15, 30, 60

H 1 represents the addition of direct hypernyms; H 2 stands for the addition of hypernyms of the first and second levels, and so on

a_{The command was vcluster clmethod=agglo crfun=upgma sim=cos rowmodel=maxtf colmodel=idf}

-clabelfile=<X>.mat.clabel <X>.mat <K>

b_{<X> is the name of the dataset being tested (ex. R8, WebKB etc.), and <K> is the number of clusters desired}

in the final solution. Vcluster is the name of the Cluto clustering program that clusters data from .mat files as input

c_{The command was vcluster -clmethod=rbr -crfun=i2 -sim=cos –cstype=best -rowmodel=maxtf}

-colmod-el=idf -clabelfile=<X>.mat.clabel <X>.mat <K>

Overall F-measure The F-measure is often employed to evaluate the accuracy of clustering

results. Fung et al. [8] measured the quality of a clustering result C using the weighted sum of such maximum F-measures for all natural classes according to the cluster size. This measure is called the overall F-measure of C, denoted F(C), which is defined as follows.

F(C) = lj∈L |lj| |D|maxci∈C {F}, where F = 2 P R P+ R, P = |ci∩ lj| |ci| and R= |ci∩ lj| |lj| (6)

In general, the higher the F(C) values, the better the clustering solution is.

Improvement ratio The Improvement ratio (IR) is the ratio of improvements to the F(C)

value of our proposed approach, F2IDC, when compared with the other compared algorithms. In the following, we define the IR:

I R= F(C)F

2

IDC − F(C)<Y >

F(C)<Y > , (7)

where F(C)F2IDC and F(C)<Y >_{represent the F(C) values of F}2_{IDC and the other three} algorithms (e.g., <Y > can be FIHC, UPGMA, or Bi. k-means), respectively. A higher IR value indicates that the clustering quality of F2IDC method is better than the clustering quality of the other algorithms.

4.3 Parameters selection

Table2summarizes the parameters for our proposed method and the other algorithms to compare the clustering performance.

Before applying F2IDC, we first consider the feature selection strategy. In order to select the most representative features, we use Formula (1) to obtain the key terms with weights higher than the pre-defined thresholdsγ . Table3shows the keyword statistics of our test datasets and the suggested threshold for each dataset. Documents were then represented as

(15)

Table 3 Keyword statistics of our test datasets

Dataset # of terms # of terms after # of terms after γ threshold pre-processing enriching F2IDC WordNet_based F2IDC Classic4 40, 291 40,279 41,931 0.60 0.65 Re0 2, 886 2,678 3,507 0.60 0.65 R8 16, 810 16,790 18,692 0.60 0.65 WebKB 42, 503 34,310 36,622 0.60 0.65

tf (term frequency) vectors, and unimportant terms were discarded. This process implies a significant dimensionality reduction without loss of clustering performance.

The two algorithms, F2IDC and FIHC, both have two main parameters for the adjustment of accuracy quality. The first parameter, denoted MinSup, is mandatory, which means the minimum support for frequent itemsets generation. The other parameter, denoted KCluster, is optional, which represents the number of clusters. As Bisecting k-means and UPGMA require a predefined number of clusters as their inputs, their KCluster parameters must be provided.

4.4 Experimental results and analysis

The experiments were conducted by the following steps. First, we evaluated our method, F2IDC, on the four datasets mentioned earlier and compared its accuracy with that of FIHC, Bisecting k-means, and UPGMA. Moreover, we verified if the use of WordNet can generate conceptual labels for derived clusters. Second, the dataset, RVC1 (Reuters Corpus Volume 1) [14], was chosen to evaluate the efficiency and scalability of F2IDC.

4.4.1 Accuracy comparison for F2IDC algorithm

Table4presents the obtained overall F-Measure values for WordNet-based F2IDC and the other WordNet-based algorithms by comparing four different numbers of clusters, namely 3, 15, 30, and 60, on four datasets, respectively. For each algorithm, we run each dataset enriched with the top 5 levels of hypernyms. We tested each algorithm’s clustering results with the value H , the levels of hypernyms, from 1 to 5 and selected the best results. We chose the minimum support in {25%, 28%, 30%, 32%, 35%} to run F2IDC with WordNet for all datasets. Moreover, we set the minimum support values, ranging from 3 to 6%, to obtain the best results for FIHC.

It is apparent that the average accuracy of Bisecting k-means and FIHC are slightly better than that of F2IDC in several cases. We argue that the exact number of clusters in a document set is usually unknown in real case, and F2IDC is robust enough to produce stable, consistent and high-quality clusters for a wide range of number of clusters. This can be realized by observing the average overall F-measure values of all test cases. Notice that UPGMA is not available for large datasets because some experimental results cannot be generated for UPGMA, and we denoted them as NA. Since FIHC is not available for the documents of long average length, there is no experimental result generated on the WebKB dataset, and we also marked them as NA.

(16)

Table 4 Average overall F-measure comparison for four clustering algorithms on the four datasets

Dataset (# of # of clusters F2IDC(H) FIHC (H ) UPGMA (H ) Bi. k-means (H ) natural classes) Classic4 (4) 3 0.68 (3)* 0.51 (1) NA 0.61 (5) 15 0.70 (3)* 0.51 (1) NA 0.59 (5) 30 0.70 (3)* 0.52 (1) NA 0.43 (5) 60 0.69 (3)* 0.51 (1) NA 0.28 (5) Average 0.69 (3)* 0.51 (1) NA 0.48 (5) Re0 (13) 3 0.56 (3)* 0.43 (1) 0.40 (3) 0.40 (3) 15 0.53 (3)* 0.40 (1) 0.35 (3) 0.42 (3) 30 0.52 (3)* 0.39 (1) 0.35 (3) 0.36 (3) 60 0.52 (3)* 0.34 (1) 0.35 (3) 0.30 (3) Average 0.53 (3)* 0.39 (1) 0.36 (3) 0.37 (3) R8 (8) 3 0.57 (3) 0.47 (1) NA 0.59 (3)* 15 0.44 (3)* 0.43 (1) NA 0.42 (3) 30 0.43 (3)* 0.43 (1)* NA 0.36 (3) 60 0.44 (3)* 0.43 (1) NA 0.23 (3) Average 0.47 (3)* 0.44 (1) NA 0.40 (3) WebKB (4) 3 0.48 (1)* NA 0.44 (1) 0.33 (3) 15 0.49 (1)* NA 0.43 (1) 0.19 (3) 30 0.49 (1)* NA 0.42 (1) 0.13 (3) 60 0.49 (1)* NA 0.36 (1) 0.07 (3) Average 0.49 (1)* NA 0.42 (1) 0.18 (3)

NA means not available for the datasets * The best competitor

Table 5 Improvement ratio for other three clustering algorithms on the four datasets

Dataset Clustering algorithms Improvement ratio

F2IDC(H) FIHC (H ) UPGMA (H ) Bi. k-means FIHC UPGMA Bi. k-means

Classic4 0.69 (3) 0.51 (1) NA 0.48 (5) +0.35 NA +0.43

Re0 0.54 (3) 0.39 (1) 0.36 (3) 0.37 (3) +0.39 +0.50 +0.46

R8 0.47 (3) 0.44 (1) NA 0.40 (3) +0.07 NA +0.18

WebKB 0.49 (1) NA 0.42 (1) 0.18 (3) NA +0.17 +1.72

From the experimental result in Table4, based on Formula (7), our proposed approach has gained F(C) value improvement in average (as shown in Table5) for the other three algorithms on four datasets. The percentage of improvement ratio ranges from 7 to 172% based on the increases of the F(C) value.

4.4.2 The effect of enriching the document representation

As described in Sect.3.2, when enriching the document representation, we utilize WordNet to exploit hypernymy for clustering. We now demonstrate the effect of adding hypernyms into the datasets as follows.

(17)

Table 6 The effect of enriching the document representation

Dataset Classic4 Re0 R8 WebKB

F2IDC FIHC F2IDC FIHC F2IDC FIHC F2IDC FIHC

Baseline 0.54 0.51 0.50 0.40 0.55 0.55 0.44 NA H 1 0.67 0.51 0.52 0.39 0.43 0.44 0.49 NA H 2 0.65 0.50 0.51 0.38 0.43 0.44 0.48 NA H 3 0.69 0.49 0.53 0.38 0.47 0.40 0.46 NA H 4 0.66 0.47 0.53 0.38 0.47 0.40 0.45 NA H 5 0.67 0.47 0.52 0.38 0.47 0.40 0.43 NA

Since FIHC obtained the best performance in terms of accuracy among the three comparing algorithms, we tested F2IDC and FIHC by the baseline method and the addition of hypernyms of different levels. Table6shows the comparison of clustering results obtained by F2IDC and FIHC, respectively. In Table6, “Baseline” means that no hypernyms are added; “H 1” corresponds to the addition of direct hypernyms; “H 2” stands for the addition of hypernyms of first and second levels, and so on. We chose the minimum support, ranging from 4 to 8% to run the baseline result of F2IDC for all datasets. The results in Table6show that FIHC decreases the clustering accuracy when increasing the levels of hypernyms. WordNet-based FIHC does not provide the improvement with respect to the baseline method. For the obtained results, the reasons could be

(1) Using hypernyms as additional features in the document enrichment process inevitably introduces a lot of noise into these datasets;

(2) Word sense disambiguation was not performed to determine the proper meaning of each polysemous term in documents [10].

By Table6, it is obvious that the average overall F-measure values of WordNet-based F2IDC are superior to that of WordNet-based FIHC when adding hypernyms of the first, second, and third levels on almost all datasets, except for WebKB dataset. The performance of F2IDC with the addition of direct hypernyms is better than that of F2IDC with higher levels of hypernyms on WebKB dataset. Due to the longer average length of documents in WebKB dataset, higher levels of hypernyms may add more noise to the clustering process and decrease the clustering accuracy.

In contrast to WordNet-based FIHC, our approach can ameliorate the effect of adding hypernyms by filtering out noise for clustering. The use of WordNet for F2IDC induces better clustering results on Classic4 dataset, while the improvements of the others are not particularly spectacular. In the case of the Reuters tasks, the limited improvement may not cause a particular worry. It is not likely to work well for text, such as documents in Reuters-21578, which is guaranteed to be written in concise and efficiently [22].

To understand the reason why WordNet enhanced F2IDC to perform better, a sample of the cluster labels generated by F2IDC on Re0 dataset can be found in Table7. Due to the rich semantic network representation provided by WordNet, F2IDC with WordNet generates more general and meaningful labels for clusters. For example, the label ‘commerce’ produced by F2IDC with WordNet is a more general concept than the labels ‘sell’ and ‘trade’ generated by F2IDC without WordNet.

(18)

Table 7 Cluster Labels generated by F2IDC algorithm on Re0 dataset

F2IDC without WordNet F2IDC with WordNet

bank, dollar, currency, growth, industry, market, nation, rate, rise, rose, sell, trade

Activity, agent, assemblage, commerce, (commodity, good), currency, forecast, growth, merchant, nation, rate, record, (bush, rose, shrub)

Fig. 8 The accuracy test of F2IDC for different MinSup values with the optimal cluster numbers determined by the clusters merging step algorithm

4.4.3 Sensitivity to various parameters

Figure8a, b, respectively, depict the overall F-measure values of F2IDC and WordNet-based F2_{IDC when accepting different mandatory parameters, but ignoring the parameter values} of the optional ones. We observed that high clustering accuracies are fairly consistent while MinSup are set between 2 and 9% for F2IDC and set between 15 and 35% for WordNet-based F2IDC. As KClusters is not specified in each test case, the clusters merging step in Algorithm3has to decide the most appropriate number of output clusters, which are shown in Fig.8b, d for F2IDC and WordNet-based F2IDC, respectively.

Based on our test, we concluded a general observation that the best choice of MinSup can be set between 4 and 8% for F2IDC, and set between 25 and 35% for WordNet-based F2IDC. Nevertheless, it cannot be over emphasized that MinSup should not be regarded as the only parameter for finding the optimal accuracy.

(19)

Fig. 9 Scalability of F2IDC

4.4.4 Efficiency and scalability

To analyze the scalability of our algorithm, we got 100,000 documents from RVC1 (Reuters Corpus Volume 1) dataset [14], which contains news from Reuters Ltd. There are three category sets: Topics, Industries, and Regions. In our experiments, we consider the Topics category set, which includes 23,149 training and 781,265 testing documents. Before clus-tering this dataset, documents were parsed by converting all terms in documents into lower case, removing stop words, and applying the stemming algorithm.

Figure9shows the runtimes with respect to the different sizes of RVC1 dataset, ranging from 10K to 100 K documents, for different stages of our algorithm. The figure also shows that fuzzy association mining and initial clustering stages are the most two time-consuming stages in our algorithm. In the clustering process, most of the time is spent on constructing initial clusters and its runtime is almost linear with respect to the number of documents. As the efficiency of the fuzzy association rule mining is very sensitive to the input param-eter MinSup, the runtime of F2IDC is inversely related to MinSup. In other words, runtime increases as MinSup decreases.

5 Conclusion

The importance of document clustering emerges from the massive volumes of textual doc-uments created. Although numerous document clustering methods have been extensively studied in these years, there still exist several challenges for improving the clustering quality. Particularly, most of the current documents clustering algorithms, including FIHC, do not consider the semantic relationships among the terms. In this paper, we derived an effective Fuzzy Frequent Itemset-based Document clustering(F2IDC) approach that combines fuzzy association rule mining with the external knowledge, WordNet, for grouping documents. The key advantage conferred by our proposed algorithm is that the generated clusters, labeled with conceptual terms, are easier to understand than clusters annotated by isolated terms. In addition, the extracted cluster labels may help for identifying the content of individual clusters.

Our experiments reveal that the proposed algorithm has better accuracy quality than that of FIHC, Bisecting k-means, and UPGMA methods on our datasets. Our primary findings are as follows:

(20)

(1) Our approach facilitates the integration of the rich knowledge of WordNet into textual documents by effectively filtering out noise when adding hypernyms into documents and generating more conceptual labels for clusters.

(2) FIHC performs better for documents of short average length, but worse for documents of long average length.

(3) The other document clustering algorithms, like Bisecting k-means and UPGMA, are sensitive to the number of clusters.

In the future, we will explore some further issues. First, we will extend F2IDC to generate overlapping clusters for providing multiple subjective perspectives onto the same document to enhance its practical applicability. Second, we intend to propose an efficient incremental clustering algorithm [7,11] for assigning a new document to the most similar existing cluster. Third, we will consider the abundant structural relation within Wikipedia, such as hyperlinks and hierarchical categories, to improve the performance of clustering [27].

Acknowledgments This research was partially supported by National Science Council, Taiwan, ROC, under Contract No. NSC 98-2410-H-327-020-MY3 and No. NSC 98-2221-E-009-145.

References

1. Agrawal R, Imielinski T, Swami AN (1993) Mining association rules between sets of items in large databases. In: ACM SIGMOD international conference on management of data, pp 207–216

2. Beil F, Ester M, Xu X (2002) Frequent term-based text clustering. In: International conference on knowledge discovery and data mining (KDD’02), pp 436–442

3. Chen CL, Tseng FSC, Liang T (2008) Hierarchical document clustering using fuzzy association rule mining. In: The 3rd international conference of innovative computing information and control (ICICIC2008), pp 326–330

4. Chen CL, Tseng FSC, Liang T (2010) Mining fuzzy frequent itemsets for hierarchical document clustering. Inf Process Manag 46(2):193–211

5. Craven M, DiPasquo D, McCallum A, Mitchell T, Nigam K, Slattery S (1998) Learning to extract symbolic knowledge from the World Wide Web. In: AAAI-98

6. Cutting DR, Karger DR, Pederson JO, Tukey JW (1992) Scatter/gather: a cluster-based approach to browsing large document collections. In: The 15th international ACM SIGIR conference on research and development in information retrieval, pp 318–329

7. Exarchos TP, Tsipouras MG, Papaloukas C, Fotiadis DI (2009) An optimized sequential pattern matching methodology for sequence classification. Knowl Inf Syst 19(2):249–264

8. Fung B, Wang K, Ester M (2003) Hierarchical document clustering using frequent itemsets. In: SIAM international conference on data mining (SDM’03), pp 59–70

9. Hong TP, Lin KY, Wang SL (2003) Fuzzy data mining for interesting generalized association rules. Fuzzy Sets Syst 138(2):255–269

10. Hotho A, Staab S, Stumme G (2003) Wordnet improves text document clustering. In: SIGIR international conference on Semantic Web Workshop

11. Huang Z, Sun S, Wang W (2010) Efficient mining of skyline objects in subspaces over data streams. Knowl Inf Syst 22(2):159–183

12. Kaya M, Alhajj R (2006) Utilizing genetic algorithms to optimize membership functions for fuzzy weighted association rule mining. Appl Intell 24(1):7–15

13. Kushal Dave DMP, Lawrence S (2003) Mining the peanut gallery: opinion extraction and semantic classification of product reviews. In: The 12th international conference on World Wide Web (WWW) 14. Lewis DD, Yang Y, Rose TG, Li F (2004) RCV1: a new benchmark collection for text categorization

research. J Mach Learn Res 5:361–397

15. Liu B, Hsu W, Ma Y (1999) Pruning and summarizing the discovered associations. In: The ACM SIGKDD conference on knowledge discovery and data mining, pp 125–134

16. MacQueen JB (1967) Some methods for classification and analysis of multivariate observations. In: The 5th Berkeley Symposium on Mathematical Statistics and Probability, pp 281–297

17. Mandhani B, Joshi S, Kummamuru K (2003) A matrix density based algorithm to hierarchically co-cluster documents and words. In: The 12th international conference on World Wide Web (WWW), pp 511–518

(21)

18. Martín-Bautista MJ, Sánchez D, Chamorro-Martínez J, Serrano JM, Vila MA (2004) Mining web documents to find additional query terms using fuzzy association rules. Fuzzy Sets Syst 148(1):85–104

19. Michenerand CD, Sokal RR (1957) A quantitative approach to a problem in classification. Evolution 11:130–162

20. Miller GA (1995) WordNet: a lexical database for English. J Commun ACM 38(11):39–41 21. Porter MF (1980) An algorithm for suffix stripping. Program 14(3):130–137

22. Scott S, Matwin S (1998) Text classification using WordNet hypernyms. In: Proceedings of Worksh Usage of WordNet in NLP Systems at COLING-98, pp 38–44

23. Sedding J, Kazakov D (2004) WordNet-based text document clustering. In: COLING-2004 workshop on robust methods in analysis of natural language data

24. Shihab K (2004) Improving clustering performance by using feature selection and extraction techniques. J Intell Syst 13(3):135–161

25. Singhal A, Salton G (1993) Automatic text browsing using vector space model. Technical Report, Department of Computer Science, Cornell University

26. Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. In: The 6th ACM SIGKDD international conference on knowledge discovery and data mining (KDD)

27. Wang P, Hu J, Zeng H-J, Chen Z (2009) Wikipedia knowledge to improve text classification. Knowl Inf Syst 19(3):265–281

28. Wei C, Hu P, Dong YX (2002) Managing document categories in e-commerce environments: an evolution-based approach. Eur J Inf Syst 11(3):208–222

29. Willett P (1988) Recent trends in hierarchic document clustering: a critical review. Inf Process Manag 24(5):577–597

30. Xu W, Gong Y (2004) Document clustering by concept factorization. In: The 27th ACM SIGIR conference on research and development in information retrieval, pp 202–209

31. Yu H, Searsmith D, Li X, Han J (2004) Scalable construction of topic directory with nonparametric closed termset mining. In: The IEEE international conference on data mining series (ICDM 2004), pp 563–566 32. Zadeh LA (1965) Fuzzy sets. Inf Control 8:338–353

Author Biographies

Chun-Ling Chen is a postdoctoral research fellow of the Institute of

Statistical Science, Academia SINICA, Taiwan, ROC. She received her Ph.D. degree in computer science from National Chiao Tung University, Taiwan, ROC, in 2010. Her research interests include data-base, object-oriented conceptual modeling, information retrieval, text mining, and machine learning.

(22)

Frank S. C. Tseng, Ph.D. received his B.S., M.S. and Ph.D. degrees,

all in computer science and information engineering, from National Chiao Tung University, Taiwan, ROC, in 1986, 1988, and 1992, respec-tively. From 1993 to 1995, he served the military obligation in the General Headquarters of ROC Air Force. Dr. Tseng is one of the win-ners of Acer Long Term Ph.D. dissertation prize in 1992. He joined the faculty of the Department of Information Management, Yuan-Ze University, Taiwan, ROC, on August 1995. From 1996 to 1997, he was the chairman of the Department. Currently, he is a professor and the chairperson of the Department of Information Management, National Kaohsiung First University of Science and Technology, Kaohsiung, Taiwan, ROC. His research interests include database theory and appli-cations, information retrieval, XML technologies for Internet comput-ing, data/document warehouscomput-ing, and data/text mining. Dr. Tseng is a member of the IEEE Computer Society and the Association for Com-puting Machinery, Special Interest Group on Management of Data.

Tyne Liang received her Ph.D. degree from National Chiao Tung

University, Taiwan, ROC, majored in computer science. Currently, she is an associate professor of the Dept. of Computer Science, National Chiao Tung University, Taiwan, ROC. Her research interests include information retrieval, natural language processing, web mining, and inter-connection networking.