Chapter 3 Fuzzy Frequent Itemset-based Hierarchical Document Clustering
4.4 An Illustrative Example of F 2 IDC Method
4.5.2 Parameters Selection
Table 4-2 summarizes the parameters for our proposed method and the other algorithms to compare the clustering performance.
Before applying F2IDC, we first consider the feature selection strategy. In order to select the most representative features, we use Formula (4.1) to obtain the key terms with weights higher than the pre-defined thresholds γ. Table 4-3 shows the keyword statistics of our test datasets and the suggested threshold for each dataset.
12 The preprocessed datasets can be downloaded. http://www.cs.technion.ac.il/~ronb/thesis.html
Table 4-2: List of all parameters for our algorithms and the other three algorithms.
Parameter Name F2IDC FIHC UPGMA13,14 Bi. k-means15
Datasets Classic, Re0, R8, WebKB
Stopword Removal Yes
Stemming Yes
Length of the smallest term Three
Weight of the term vector tf tf-idf tf-idf tf-idf
Levels of hypernyms H1, H2, H3,H4, H5
Cluster count k 3, 15, 30, 60
H1 represents the addition of direct hypernyms; H2 stands for the addition of hypernyms of the first and second levels, and so on.
Table 4-3: Keyword statistics of our test datasets.
Datasets # of
The two algorithms, F2IDC and FIHC, both have two main parameters for the adjustment of accuracy quality. This first one is mandatory and is denoted MinSup, which means the minimum support for frequent itemsets generation. The other one is optional, and is denoted KCluster, which represents the number of clusters. As Bisecting k-means and UPGMA require a predefined number of clusters as their inputs, their KCluster parameters must be provided.
13 The command was vcluster -clmethod=agglo -crfun=upgma -sim=cos -rowmodel=maxtf -colmodel=idf -clabelfile=<X>.mat.clabel <X>.mat < K>.
14 <X> is the name of the dataset being tested (ex. R8, WebKB etc.), and <K> is the number of clusters desired in the final solution. Vcluster is the name of the Cluto clustering program that clusters data from .mat files as input.
15 The command was vcluster -clmethod=rbr -crfun=i2 -sim=cos –cstype=best -rowmodel=maxtf -colmodel=idf
4.5.3 Experimental Results and Analysis
The experiments were conducted by the following steps. First, we evaluated our method, F2IDC, on the four datasets mentioned above and compared its accuracy with that of FIHC, Bisecting k-means, and UPGMA. Moreover, we verified if the use of WordNet can generate conceptual labels for derived clusters. Second, the dataset, RVC1 (Reuters Corpus Volume 1) [31], was chosen to evaluate the efficiency and scalability of F2IDC.
4.5.3.1. Accuracy Comparison for F2IDC Algorithm
Table 4-4 presents the obtained overall F-Measure values for WordNet-based F2IDC and the other WordNet-based algorithms by comparing four different numbers of clusters, namely 3, 15, 30, and 60, on four datasets respectively. For each algorithm, we run each dataset enriched with the top 5 levels of hypernyms. We tested each algorithm’s clustering results with the value H, the levels of hypernyms, from 1 to 5 and selected the best results. We chose the minimum support in {25%, 28%, 30%, 32%, 35%} to run F2IDC with WordNet for all datasets. Moreover, we set the minimum support values, ranging from 3% to 6%, to obtain the best results for FIHC.
It is apparent that the average accuracy of Bisecting k-means and FIHC are slightly better than that of F2IDC in several cases. We argue that the exact number of clusters in a document set is usually unknown in real case, and F2IDC is robust enough to produce stable, consistent and high quality clusters for a wide range number of clusters. This can be realized by observing the average overall F-measure
values of all test cases. Notice that UPGMA is not available for large data sets because some experimental results cannot be generated for UPGMA, and we denoted them as N.A. Since FIHC is not available for the documents of long average length, there is no experimental result generated on the WebKB dataset, and we also marked them as N.A.
Table 4-4: Average overall F-measure comparison for four clustering algorithms on the four datasets.
Datasets (# of Natural
Classes) # of Clusters F2IDC(H) FIHC(H) UPGMA(H) Bi. k-means(H) Classic
Average 0.53(3) * 0.39(1) 0.36(3) 0.37(3) R8
The Improvement Ratio (IR) is the ratio of improvements to the F(C) value of our proposed approach, F2IDC, when compared with the other compared algorithms.
In the following, we define the IR:
( ) 2 ( ) three algorithms (e.g., <Y> can be FIHC, UPGMA, or Bi. k-means), respectively. A higher IR value indicates that the clustering quality of F2IDC method is better than the clustering quality of the other algorithms.
From the experimental result in Table 4-4, based on Formula (4.5), our proposed approach has gained F(C) value improvement in average (as shown in Table 4-5) for the other three algorithms on four datasets. The percentage of improvement ratio ranges from 7% to 172% based on the increases of the F(C) value.
Table 4-5: Improvement Ratio for other three clustering algorithms on the four datasets.
Datasets Clustering Algorithms Improvement Ratio
F2IDC(w) FIHC(w) UPGMA(w) Bi.
4.5.3.2. The Effect of Enriching the Document Representation
As described in Section 4.2.2, when enriching the document representation, we utilize WordNet to exploit hypernymy for clustering. We now demonstrate the effect of adding hypernyms into the datasets as follows.
Since FIHC obtained the best performance in terms of accuracy among the three comparing algorithms, we tested F2IDC and FIHC by the baseline method and the
addition of hypernyms of different levels. Table 4-6 shows the comparison of clustering results obtained by F2IDC and FIHC, respectively. In Table 4-6, “Baseline”
means that no hypernyms are added; “H1” corresponds to the addition of direct hypernyms; “H2” stands for the addition of hypernyms of first and second levels, and so on. We chose the minimum support, ranging from 4% to 8% to run the baseline result of F2IDC for all datasets. The results in Table 4-6 show that FIHC decreases the clustering accuracy when increasing the levels of hypernyms. WordNet-based FIHC does not provide the improvement with respect to the baseline method. For the obtained results, the reasons could be:
(1) Using hypernyms as additional features in the document enrichment process inevitably introduces a lot of noise into these datasets;
(2) Word sense disambiguation was not performed to determine the proper meaning of each polysemous term in documents [24] .
By Table 4-6, it is obvious that the average overall F-measure values of WordNet-based F2IDC are superior to that of WordNet-based FIHC when adding hypernyms of the first, second, and third levels on almost all datasets, except for WebKB dataset. The performance of F2IDC with the addition of direct hypernyms is better than that of F2IDC with higher levels of hypernyms on WebKB dataset. Due to the longer average length of documents in WebKB dataset, higher levels of hypernyms may add more noise to the clustering process and decrease the clustering accuracy.
In contrast to WordNet-based FIHC, our approach can ameliorate the effect of adding hypernyms by filtering out noise for clustering. The use of WordNet for F2IDC induces better clustering results on Classic dataset, while the improvements of the others are not particularly spectacular. In the case of the Reuters tasks, the limited
such as documents in Reuters-21578, which is guaranteed to be written in concise and efficiently [48].
Table 4-6: The effect of enriching the document representation.
Datasets Classic Re0 R8 WebKB
To understand the reason why WordNet enhanced F2IDC to perform better, a sample of the cluster labels generated by F2IDC on Re0 dataset can be found in Table 4-7. Due to the rich semantic network representation provided by WordNet, F2IDC with WordNet generates more general and meaningful labels for clusters. For example, the label ‘commerce’ produced by F2IDC with WordNet is a more general concept than the labels ‘sell’ and ‘trade’ generated by F2IDC without WordNet.
Table 4-7: Cluster Labels generated by F2IDC algorithm on Re0 dataset.
F2IDC without WordNet F2IDC with WordNet bank, dollar, currency, growth,
industry, market, nation, rate, rise, rose, sell, trade
4.5.3.3. Sensitivity to Various Parameters
Figure 4-8(a) and (b) respectively depict the overall F-measure values of F2IDC and WordNet-based F2IDC when accepting different mandatory parameters, but
ignoring the parameter values of the optional ones. We observed that high clustering accuracies are fairly consistent while MinSup are set between 2% and 9% for F2IDC and set between 15% and 35% for WordNet-based F2IDC. As KClusters is not specified in each test case, the clusters merging step in Algorithm 3 has to decide the most appropriate number of output clusters, which are shown in Figure 4-8(b) and (d) for F2IDC and WordNet-based F2IDC, respectively.
Based on our test, we concluded a general observation that the best choice of MinSup can be set between 4% and 8% for F2IDC, and set between 25% and 35% for WordNet-based F2IDC. Nevertheless, it cannot be over emphasized that MinSup should not be regarded as the only parameter for finding the optimal accuracy.
4.5.3.4. Efficiency and Scalability
To analyze the scalability of our algorithm, we get 100,000 documents from RVC1 (Reuters Corpus Volume 1) dataset [31] , which contains news from Reuters Ltd. There are three category sets: Topics, Industries, and Regions. In our experiments, we consider the Topics category set, which includes 23,149 training and 781,265 testing documents. Before clustering this dataset, documents were parsed by converting all terms in documents into lower case, removing stop words, and applying the stemming algorithm.
(a) (b)
(c) (d)
Figure 4-8:The accuracy test of F2IDC for different MinSup values with the optimal cluster numbers determined by the clusters merging step algorithm.
Figure 4-9 shows the runtimes with respect to the different sizes of RVC1 dataset, ranging from 10K to 100K documents, for different stages of our algorithm. The figure also shows that fuzzy association mining and initial clustering stages are the most two time-consuming stages in our algorithm. In the clustering process, most of the time is spent on constructing initial clusters and its runtime is almost linear with respect to the number of documents. As the efficiency of the fuzzy association rule mining is very sensitive to the input parameter MinSup, the runtime of F2IDC is inversely related to MinSup. In other words, runtime increases as MinSup decreases.
Figure 4-9: Scalability of F2IDC.
4.6 Summary
The importance of document clustering emerges from the massive volumes of textual documents created. Although numerous document clustering methods have been extensively studied in these years, there still exist several challenges for improving the clustering quality. Particularly, most of the current documents clustering algorithms, including FIHC, do not consider the semantic relationships among the terms. In this paper, we derived an effective Fuzzy Frequent Itemset-based
Document clustering (F2IDC) approach that combines fuzzy association rule mining with the external knowledge, WordNet, for grouping documents. The key advantage conferred by our proposed algorithm is that the generated clusters, labeled with conceptual terms, are easier to understand than clusters annotated by isolated terms. In addition, the extracted cluster labels may help for identifying the content of individual clusters.
Our experiments reveal that the proposed algorithm has better accuracy quality than that of FIHC, Bisecting k-means, and UPGMA methods on our datasets. Our primary findings are as follows:
(1) Our approach facilitates the integration of the rich knowledge of WordNet into textual documents by effectively filtering out noise when adding hypernyms into documents and generating more conceptual labels for clusters.
(2) FIHC performs better for documents of short average length, but worse for documents of long average length.
(3) The other document clustering algorithms, like Bisecting k-means and UPGMA, are sensitive to the number of clusters.
In the next chapter, we will extend F2IDC to generate overlapping clusters for providing multiple subjective perspectives onto the same document to enhance its practical applicability.
Chapter 5
Fuzzy Frequent Itemset-based Soft Clustering (F
2ISC) Approach
In this chapter, we further propose an effective Fuzzy Frequent Itemset-based Soft Clustering (F2ISC) approach by extending F2IDC under the consideration of overlapping cluster problem. F2ISC provides an accurate measure of confidence, and adopts the α-cut concept (defined in Definition 2.5) to assign each document to one or more than one target cluster.
Figure 5-1 shows the proposed F2ISC (Fuzzy Frequent Itemset-based Soft Clustering) framework, which consists of four modules, namely Document Analysis Module, TermOnto Construction Module, Candidate Clusters Extraction Module, and Overlapping Clusters Generation Module as explained in Sections 5.2.1, 5.2.2, 5.2.3, and 5.2.4, respectively.
In this framework, when receiving a set of textual documents, our first module will extract and select the key term set, and then the second module organizes it into a term forest (defined in Definition 4.2) by referring to WordNet for generating the Document Set D. The third module implements our fuzzy association rule mining procedure to generate the candidate cluster set. Finally, the last module constructs and evaluates the Document-Cluster Matrix (DCM) to produce the target clusters. The whole process will be illustrated by a comprehensive example.
Figure 5-1: The F2ISC framework.
5.1 Document Analysis Module
There are two stages in this module, namely Key Term Extraction and Key Term Selection, for reducing the dimensionality of the source document set:
1. Key Term Extraction: The whole extraction process is as follows:
(1) First of all, each document is broken into sentences. Then, terms in each sentence are extracted as features. In this thesis, a term is regarded as the stem of a single word.
(2) The terms appeared in a pre-defined stop-word list16 are removed.
(3) Remained terms are converted to their base forms by stemming. The terms with the same stem are combined for frequency counting. Finally, the frequency of each term in each document is recorded.
2. Key Term Selection: We understand that terms of low frequencies are supposed as noise and useless for identifying the appropriate cluster. Thus, we apply the tf-idf (term frequency × inverse document frequency) method defined in Formula (4.1) to choose the key terms for the document set. A term will be discarded if its weight is less than a fixed tf-idf threshold γ. Subsequently, these retained terms form a set of key terms for the document set D, and we have defined them in Definitions 3.1 - 3.4.
5.2 TermOnto Construction Module
The objective of this module is based on the usage of WordNet for generating a richer document representation of the given document set. As the relationships of relevant terms have been predefined in WordNet ontology, in this module, we intend to use the hypernyms provided by WordNet ontology as useful features for document clustering. Thus, we use Algorithm 4.1, as shown in Figure 4-2, to generate the extended representation of each document for later mining process.
5.3 Candidate Cluster Extraction Module
After the above processes, documents are converted into structured term vectors.
Then, the fuzzy data mining algorithm is executed to generate fuzzy frequent itemsets and output a candidate cluster set. In the module, we use the membership functions described in Figure 4-3 and the fuzzy association rule mining algorithm for texts shown in Figure 3-4 to generate the candidate cluster set.
5.4 Overlapping Cluster Generation Module
The objective of this module is to assign each document to multiple clusters {c1q,…,ciq}, where i ≥ 1 and q ≥ 1. The assignment process is based on the derived Document-Cluster matrix (DCM) defined in Definition 3.10. Then, we apply intersection of fuzzy set theory to compute the membership degree of each document in one candidate cluster with the other candidate clusters. Hence, we define one matrix, namely Multiple Clusters Matrix (MCM), in Definition 5.1.
Definition 5.1: A Multiple Clusters Matrix (MCM), denoted M = [mig], is an n ×Ck2 matrix, such that mig= min{mil, mij} is the membership degree of document di in intersection of two candidate clusters clq∩cqj , where l, j ∈{1, 2,…, k}, l ≠ j, and q = 1. A formal illustration of MDM can be found in Figure 5-2.
2
Figure 5-2: A formal illustration of Multiple Clusters Matrix.
Moreover, we apply the α-cut threshold [64][68] determined by Formula (5.1) to evaluate the minimum value which satisfies the restrictive condition, and it can appropriately provide flexibility to overlapping clusters.
{ }
Then, based on the obtained DCM, an unassigned document di might belong to more than one target cluster by using Formula (5.2).
{
{( ), } where = { , ,..., }1 2}
q q
l i il i i ik l
c = d v >max ρ α α− ρ max v v v ∈c
(5.2)
Finally, to avoid low clustering accuracy, the inter-cluster similarity(defined by Formula (3.9) in Chapter 3) between two target clusters is calculated to merge the small target clusters with the similar topic.
Algorithm 5.1 shown in Figure 5-3 is used to assign each document to the fitting target clusters, and finally builds a target cluster set for output.
5.5 An Illustrative Example of F2ISC Method
Suppose we have a document set D = {d1, d2,…, d5} and its key term set KD = {sale, trade, medical, health}. Figure 5-4 illustrates the process of Algorithm 4.1 to
obtain the representation of all documents. Moreover, rectangle nodes represent actual key terms appearing in the document set; spheroid nodes represent newly-added hypernyms. In this example, the key term ‘sale’ has the parent nodes ‘marketing’ and
‘commerce’. Similarly, ‘trade’ and ‘marketing’ have the same parent node
‘commerce’.
Figure 5-3: The detailed description of Algorithm 5.1.
Figure 5-4: The process of Algorithm 4.1 of this example.
Consider the representation of all documents generated from Figure 5-4, the membership functions defined in Figure 4-3, the minimum support value 80%, and the minimum confidence value 80% as inputs. The fuzzy frequent itemsets discovery procedure is depicted in Figure 5-5.
CD c1(sale) c1(trade) c(health)1 c1(marketing) c1(commerce) c(sale, marketing)2
Figure 5-5: The process of Algorithm 4.2 of this example.
Moreover, consider the candidate cluster set CD was already generated in Figure 5-5. Now, suppose the minimum Inter-Sim value is 0.5. Figure 5-6 illustrates the process of Algorithm 5.1, together with the final results.
1
Figure 5-6: The process of Algorithm 5-1 of this example.
5.6 Experiments
In this section, we experimentally evaluated the performance of the proposed algorithm by comparing with that of FIHC, k-means, Bisecting k-means, and UPGMA algorithms. To test the proposed approach, we used four different kinds of datasets:
Classic, Re0, R8, and WebKB, which are described in Subsection 4.3.1 and summarized the statistics in Table 4-1.
Notice that overall F-Measure favors for the hard assignment generated by clustering algorithms. In order to demonstrate the performance of our approach, we present experiments in which we generated hard assignment (this has been called hardening the clusters) [2] and then evaluated the output of our algorithm. The hardening scheme is simply performed by assigning each document to the cluster which has a maximum membership degree among all the document clusters. Thus, it can be employed to evaluate the performance of our approach by comparing with the other hard clustering methods. Thus, we use overall F-Measure to evaluate the clustering quality of F2ISC and the other compared algorithms.
5.6.1 Parameters Selection
Table 5-1 summarizes the parameters for our proposed method and the other algorithms to compare the clustering performance. Since k-means, Bisecting k-means, and UPGMA may generate different clustering results each time with randomly chosen initial value. Therefore, the final result of these three algorithms is an average from five runs performed on a given dataset.
Table 5-1: List of all parameters for our algorithms and the other four algorithms.
Parameter Name F2ISC FIHC k-means Bi.
k-means
UPGMA
Datasets Classic, Re0, R8, WebKB
Stopword Removal Yes
Stemming Yes
Length of the smallest term Three
Weight of the term vector TF tf-idf tf-idf tf-idf tf-idf Levels of hypernyms h1, h2, h3, h4, h5
Cluster count k 5, 10, 15, 30, 45, 60, 80, 100
Before applying F2ISC, we first consider the feature selection strategy. In order to select the most representative features, we use Formula (4.1) to obtain the key terms with weights higher than the pre-defined thresholds γ. Table 4-3 shows the keyword statistics of our test datasets and the suggested thresholds for each dataset.
The two algorithms, F2ISC and FIHC, all have two main parameters for the
The two algorithms, F2ISC and FIHC, all have two main parameters for the