An Illustrative Example of F 2 ISC Method

Chapter 5 Fuzzy Frequent Itemset-based Soft Clustering (F 2 ISC) Approach

5.5 An Illustrative Example of F 2 ISC Method

Then, based on the obtained DCM, an unassigned document di might belong to more than one target cluster by using Formula (5.2).

{

^{( ), } where = { , ,..., }¹ ²

}

q q

l i il i i ik l

c = d v >max ρ α α− ρ max v v v ∈c

(5.2)

Finally, to avoid low clustering accuracy, the inter-cluster similarity(defined by Formula (3.9) in Chapter 3) between two target clusters is calculated to merge the small target clusters with the similar topic.

Algorithm 5.1 shown in Figure 5-3 is used to assign each document to the fitting target clusters, and finally builds a target cluster set for output.

5.5 An Illustrative Example of F²ISC Method

Suppose we have a document set D = {d1, d2,…, d5} and its key term set KD = {sale, trade, medical, health}. Figure 5-4 illustrates the process of Algorithm 4.1 to

obtain the representation of all documents. Moreover, rectangle nodes represent actual key terms appearing in the document set; spheroid nodes represent newly-added hypernyms. In this example, the key term ‘sale’ has the parent nodes ‘marketing’ and

‘commerce’. Similarly, ‘trade’ and ‘marketing’ have the same parent node

‘commerce’.

Figure 5-3: The detailed description of Algorithm 5.1.

Figure 5-4: The process of Algorithm 4.1 of this example.

Consider the representation of all documents generated from Figure 5-4, the membership functions defined in Figure 4-3, the minimum support value 80%, and the minimum confidence value 80% as inputs. The fuzzy frequent itemsets discovery procedure is depicted in Figure 5-5.

CD c¹(sale) c¹_(trade) c_(health)¹ c¹(marketing) c¹_(commerce) c(sale, marketing)²

Figure 5-5: The process of Algorithm 4.2 of this example.

Moreover, consider the candidate cluster set C_D was already generated in Figure 5-5. Now, suppose the minimum Inter-Sim value is 0.5. Figure 5-6 illustrates the process of Algorithm 5.1, together with the final results.

Figure 5-6: The process of Algorithm 5-1 of this example.

5.6 Experiments

In this section, we experimentally evaluated the performance of the proposed algorithm by comparing with that of FIHC, k-means, Bisecting k-means, and UPGMA algorithms. To test the proposed approach, we used four different kinds of datasets:

Classic, Re0, R8, and WebKB, which are described in Subsection 4.3.1 and summarized the statistics in Table 4-1.

Notice that overall F-Measure favors for the hard assignment generated by clustering algorithms. In order to demonstrate the performance of our approach, we present experiments in which we generated hard assignment (this has been called hardening the clusters) [2] and then evaluated the output of our algorithm. The hardening scheme is simply performed by assigning each document to the cluster which has a maximum membership degree among all the document clusters. Thus, it can be employed to evaluate the performance of our approach by comparing with the other hard clustering methods. Thus, we use overall F-Measure to evaluate the clustering quality of F²ISC and the other compared algorithms.

5.6.1 Parameters Selection

Table 5-1 summarizes the parameters for our proposed method and the other algorithms to compare the clustering performance. Since k-means, Bisecting k-means, and UPGMA may generate different clustering results each time with randomly chosen initial value. Therefore, the final result of these three algorithms is an average from five runs performed on a given dataset.

Table 5-1: List of all parameters for our algorithms and the other four algorithms.

Parameter Name F²ISC FIHC k-means Bi.

k-means

UPGMA

Datasets Classic, Re0, R8, WebKB

Stopword Removal Yes

Stemming Yes

Length of the smallest term Three

Weight of the term vector TF tf-idf tf-idf tf-idf tf-idf Levels of hypernyms h1, h2, h3, h4, h5

Cluster count k 5, 10, 15, 30, 45, 60, 80, 100

Before applying F²ISC, we first consider the feature selection strategy. In order to select the most representative features, we use Formula (4.1) to obtain the key terms with weights higher than the pre-defined thresholds γ. Table 4-3 shows the keyword statistics of our test datasets and the suggested thresholds for each dataset.

The two algorithms, F²ISC and FIHC, all have two main parameters for the adjustment of accuracy quality. This first one is mandatory and is denoted MinSup, which means the minimum support for frequent itemsets generation. The other one is optional, and is denoted KCluster, which represents the number of clusters.

5.6.2 Experimental Results and Analysis

The experiments were conducted by the following steps. First, we evaluated our approach, F²ISC, on the four selected datasets described in Section 4.1 and compared its accuracy with that of FIHC, the standard k-means, Bisecting k-means, and UPGMA. Second, we verified if the use of WordNet can improve the clustering accuracy on these compared algorithms and generated conceptual labels for the derived clusters. Third, the dataset Reuters was chosen to evaluate the efficiency and scalability of F²ISC.

5.6.2.1. Comparison of F²ISC with Other Algorithms

Figure 5-7 presents the obtained overall F-Measure values for F²ISC and the other algorithms by comparing eight different numbers of clusters on four datasets.

For each algorithm, we run each dataset enriched with the top 5 levels of hypernyms.

We tested each algorithm’s clustering results with the value h, the levels of hypernyms, from 1 to 5 and selected the best results. We chose the MinSup threshold from the elements in {25%, 28%, 30%, 32%, 35%} to run F²ISC with WordNet for all datasets.

Moreover, we use the minimum support, ranging from 3% to 6% for FIHC for all datasets. Notice that UPGMA is not available for large data sets because some experimental results cannot be generated for UPGMA. Since FIHC is not available for the documents of long average length, there is no experimental result generated on the WebKB dataset.

By Table 5-2, it is obvious that the average overall F-measure values of F²ISC with WordNet are superior to that of the other algorithms on all datasets. Although the

average accuracy of Bisecting k-means and FIHC shown in Figure 5-7 are slightly better than that of F²ISC in several cases. We argue that the exact number of clusters in a document set is usually unknown in real case, and F²ISC is robust enough to produce stable, consistent and high quality clusters for a wide range number of clusters. This can be realized by observing the average overall F-measure values of all test cases. From Figure 5-7, we also observed that the clustering accuracy of k-means, Bisecting k-means, and UPGMA are sensitive when the number of clusters changes.

These algorithms require users to specify the number of cluster as an input parameter, which may imply poor clustering accuracy when we input an incorrect parameter [17].

Table 5-2: Average overall F-measure comparison for five clustering algorithms on the four datasets.

Datasets F²ISC(h) FIHC(h) k-means(h) Bi. k-means (h) UPGMA(h) Classic 0.65(3) * 0.49(1) 0.47(2) 0.45(5) N.A.

Re0 0.53(3) * 0.36(1) 0.35(2) 0.34(5) 0.36(1)

R8 0.44(3) * 0.42(1) 0.34(3) 0.33(3) N.A.

WebKB 0.48(1) * N.A. 0.16(4) 0.15(1) 0.38(1)

N.A. means not scalable to run * means the best competitor

5.6.2.2. The effect of the Enriched Document Representation

As described in the second module of our approach, when enriching the document representation, we use the hypernyms from WordNet as useful features for clustering. We demonstrate the effect of adding hypernyms in our approach. In the following, all algorithms are tested by the baseline method and the addition of hypernyms of various levels.

Figure 5-7: Overall F-measure comparison for five clustering algorithms on the four datasets.

Table 5-3 shows the average overall F-measure results obtained by all algorithms on classic and re0 datasets. The results for R8 and WebKB datasets are shown in Table 5-4. In Table 5-3 and Table 5-4, “Baseline” means that no hypernyms are added;

“h1” corresponds to the addition of direct hypernyms; “h2” stands for the addition of hypernyms of first and second levels, and so on. We chose the minimum support values, ranging from 4% to 8%, to run the baseline result of F²ISC for all datasets.

The evaluation results in Table 5-3 and Table 5-4 confirm that the average overall F-measure values of WordNet-based F²ISC performance are superior to that of the other algorithms when adding hypernyms of the first, second, and third levels on almost all datasets, except for WebKB dataset. The performance of F²ISC with the addition of direct hypernyms is better than that of F²ISC with higher levels of hypernyms on WebKB dataset. Due to the longer average length of documents in WebKB dataset, we think that higher levels of hypernyms may add more noise to the clustering process and decrease the clustering accuracy.

From Table 5-3 and Table 5-4, the use of WordNet for F²ISC induces better clustering results at least 5% higher than the other algorithms on Classic and WebKB datasets, particularly the improvement of Classic dataset. However, adding hypernyms may not be beneficial for the clustering task. The reason is that using hypernyms as additional features in the document enrichment process inevitably introduces a lot of noise into these datasets. In contrast to the other WordNet-based algorithms, our approach can ameliorate the effect of adding hypernyms by filtering out noise for clustering on Classic and WebKB datasets.

Table 5-3: The effect of enriching the document representation on Classic and Re0 datasets.

Datasets Classic Re0

F²ISC FIHC k-means Bi. k-means UPGMA F²ISC FIHC k-means Bi. k-means UPGMA N.A. means not scalable to run boldface entries highlight the best competitor in each column from h1 to h5 (the row headings)

Table 5-4: The effect of enriching the document representation on R8 and Webkb datasets.

Datasets R8 Webkb

F²ISC FIHC k-means Bi. k-means UPGMA F²ISC FIHC k-means Bi. k-means UPGMA Baseline 0.53 0.52 0.35 0.34 N.A. 0.43 N.A. 0.15 0.15 0.35

N.A. means not scalable to run boldface entries highlight the best competitor in each column from h1 to h5 (the row headings)

However, comparing with the baseline method, the use of WordNet decreases the clustering accuracy on Re0 and R8 datasets for our approach and the other compared algorithms. For the obtained results, the reasons could be:

(1) It is not likely to work well for text, such as documents in Reuters-21578, which is guaranteed to be written in concise and efficiently [48].

(2) Word sense disambiguation was not performed to determine the proper meaning of each polysemous term in documents [24].

5.6.2.3. Efficiency and Scalability

Our algorithm, F²ISC, involves three major phases: finding fuzzy frequent itemsets, initial clustering, and clusters merging. Figure 5-8 shows the scalabilities of

F²ISC on different sizes of Reuters datasets, ranging from 1K to 8K documents.

Figure 5-8: The detailed time cost analysis of F²ISC on Reuters dataset.

5.7 Summary

In this chapter, we derived a fuzzy-based document clustering approach that combines fuzzy association rule mining with WordNet to take semantic information into account. In the total processes, we begin with the process of document pre-processing and further enrich the initial representation of all documents by using hypernyms of WordNet in order to exploit the semantic relations between terms. Then, fuzzy association data mining algorithm automatically generates fuzzy frequent itemsets and regards them as candidate clusters. Finally, each document is dispatched into more than one cluster by referring to these candidate clusters, and then highly similar clusters are merged.

Moreover, document clustering methods should provide multiple subjective perspectives onto the same document to enhance their practical applicability. For this issue, we adopt the α-cut concept in the process of document clustering to assign each

KCluster = 60; MinSup = 15%

document to one or more than one target cluster. The generated overlapping clusters occur naturally in many applications such as Yahoo! directory

Our experiments reveal that the proposed algorithm has better cluster quality than that of FIHC, k-means, Bisecting k-means, and UPGMA methods based on the four datasets of Classic, Re0, R8, and WebKB.

Chapter 6 Conclusions and Future Work

6.1 Conclusions

The importance of document clustering emerges from the massive volumes of textual documents created. Although numerous document clustering methods have been extensively studied in these years, there still exist several challenges for increasing the clustering quality. Particularly, most of the current document clustering algorithms do not consider the semantic relationships among the terms nor search an organization of documents into overlapping clusters. In this thesis, we derived three fuzzy frequent itemset-based document clustering methods, namely F²IHC, F²IDC, and F²ISC, to solve these challenges.

The key advantage conferred by our proposed algorithms, F²IDC and F²ISC, is that the generated clusters, labeled with conceptual terms, are easier to understand than clusters annotated by isolated terms. In addition, the extracted cluster labels may help for identifying the content of individual clusters. Moreover, the other advantage of F²ISC method is that overlapping clusters occur naturally in many applications such as Yahoo! directory.

Our experiments reveal that the proposed algorithm has better accuracy quality than that of FIHC, k-means, Bisecting k-means, and UPGMA methods based on the comparison on these datasets. Our primary findings are as follows:

(1) The use of fuzzy association rule mining discovery important candidate clusters for document clustering to increase the accuracy quality of document clustering.

(2) F²IDC and F²ISC approach are successful in avoiding the expansion of terms with noisy features on Classic and WebKB datasets.

(3) FIHC performs better for documents of short average length, but worse for documents of long average length.

(4) The other document clustering algorithms, like k-means, Bisecting k-means, and UPGMA, are sensitive when the number of clusters changes.

6.2 Future Work

Our future work will focus on the following two aspects:

(1) Combining the syntactic analysis: For finding the important terms in a document, terms with different part-of-speech (POS) and syntactic attributes should be set different weights according to their relatedness in a document [67]. There are a lot of syntactic analysis tools that can be used to tag all terms in the document set, i.e., Qtag¹⁷ parser. We will further study whether our proposed algorithm with a syntactic analysis tool can improve the clustering results.

(2) Incrementally updating the cluster tree: When the number of documents increases sequentially in a document set, it is inefficient to reform the cluster tree for each new insertion. That is, it is admirable to reflect the current state of the whole document set by incrementally updating the cluster tree [14][43]. Therefore, we intend to propose an efficient incremental clustering algorithm for assigning a new document to the most similar existing cluster in the future. Some recent researches on data mining concerning data streaming [41][18][25] may be applicable for such incremental clustering development.

17 http://www.english.bham.ac.uk/staff/omason/software/qtag.html

(3) Using Wikipedia: we will consider the abundant structural relation within Wikipedia, such as hyperlinks and hierarchical categories, to improve the performance of clustering [57]. In addition, we will further compare our proposed approaches with other new frequent itemset-based document algorithms, such as Clustering based on Frequent Word Sequences (CFWS) [32] and Maximum Capturing (MC) [66].

Bibliography

[1] Agrawal, R., Imielinski, T., A. Swami, Mining association rules between sets of items in large databases, In: Proc. of ACM SIGMOD Int’l Conf. on Management of Data, 1993, pp.207-216.

[2] Andrews, N. O., Fox, E. A., Recent Developments in Document Clustering, Technical Report TR-07-35, Computer Science, Virginia Tech, 2007.

[3] Beil, F., Ester, M., Xu, X., Frequent term-based text clustering, In: Proc. of Int’l Conf. on knowledge Discovery and Data Mining (KDD’02), 2002, pp. 436-442.

[4] Bellot, P., El-Beze, M., A Clustering Method for Information Retrieval, Technical Report IR-0199, 1999.

[5] Chen, C. L., Tseng, F. S. C., and Liang, T., An integration of fuzzy association rules and WordNet for document clustering, Knowledge and Information Systems (KAIS), Revision Submitted, 2010/03/20.

[6] Chen, C. L., Tseng, F. S. C., and Liang, T., An integration of fuzzy association rules and WordNet for document clustering, In: Proc. of the 13^th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD-09), 2009, pp. 147-159.

[7] Chen, C. L., Tseng, F. S. C., and Liang, T., An integration of WordNet and fuzzy association rule mining for multi-label document clustering, Data and Knowledge Engineering, to appear.

[8] Chen, C. L., Tseng, F. S. C., and Liang, T., Hierarchical document clustering based on fuzzy association rule mining, In: Proc. of the 3^rd International Conference on Innovative Computing Information and Control, (ICICIC’08), 2008/06, pp. 326-330.

[9] Chen, C. L., Tseng, F. S. C., and Liang, T., Mining fuzzy frequent itemsets for hierarchical document clustering, Information Processing and Management, Vol. 46, No. 2, March 2010, pp. 193-211.

[10] Craven, M., DiPasquo, D., McCallum, A., Mitchell, T., Nigam, K., Slattery, S., Learning to extract symbolic knowledge from the world wide web, In: AAAI-98, 1998.

[11] Dave, K., Lawrence, S., Pennock, D. M., Mining the peanut gallery: opinion extraction and semantic classification of product reviews, In: Proc. of the 12^th Int’l Conf. on World Wide Web, 2003.

[12] de Campos, L. M., Moral, S., Learning rules for a fuzzy inference model, Fuzzy Sets and Systems, Vol. 59, 1993, pp. 247-257.

[13] Delgado, M., MartÃn-Bautista, M. J., SÃanchez, D., Vila, M. A., Mining text data: special features and patterns, In: Proc. of EPS Exploratory Workshop on Pattern Detection and Discovery in Data Mining, 2002, pp. 140-153.

[14] Exarchos, T. P., Tsipouras, M. G., Papaloukas, C., Fotiadis, D. I., An optimized sequential pattern matching methodology for sequence classification, Knowledge and Information Systems, Vol. 19, No. 2, 2009, pp. 249-264.

[15] Feldman, R., Dagan, I., Knowledge discovery in textual databases (KDT), In:

Proc. of the 1^st ACM SIGKDD Int’l Conf. on Knowledge Discovery and Data Mining, 1995, pp. 112-117.

[16] Fung, B. C. M., Hierarchical Document Clustering Using Frequent Itemset, Master thesis, Simon Fraser University, 2002.

[17] Fung, B., Wang, K., Ester, M., Hierarchical document clustering using frequent itemsets, In: Proc. of SIAM Int’l Conf. on Data Mining (SDM’03), May 2003, pp. 59-70.

[18] Guha, S., Meyerson, A., Mishra, N., Motwani, R., O’Callaghan, L., Clustering data streams: theory and practice, IEEE Trans. on Knowledge and Data Eng., Vol. 15, No. 3, 2003, pp. 515–528.

[19] Han, E. H., Boley, B., Gini, M., Gross, R., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., Moore, J., Webace: A web agent for document categorization and exploration, In: Proc. of the 2^nd Int’l Conf. on Autonomous Agents, 1998, pp. 408-415.

[20] Hipp, J., Guntzer, U., Nakhaeizadeh, G., Algorithms for association rule mining - a general survey and comparison, ACM SIGKDD Explorations Newsletter, Vol. 2, 2000, pp. 58–64.

[21] Hong, T. P., Lee, Y. C., An overview of mining fuzzy association rules, In: H.

Bustince et al., (eds.), Fuzzy Sets and Their Extensions: Representation, Aggregation and Models, 2008, pp. 397-410.

[22] Hong, T. P., Lin, K. Y., Wang, S. L., Fuzzy data mining for interesting generalized association rules, Fuzzy Sets and Systems, Vol. 138, No. 2, 2003, pp. 255-269.

[23] Hotho, A., Maedche, A., Staab, S., Ontology-based textual document clustering, Kunstliche Intelligenz, Vol. 16, No. 4, 2002, pp. 48–54.

[24] Hotho, A., Staab, S., Stumme, G., Wordnet improves textual document clustering, In: Proc. of SIGIR Int’l Conf. on Semantic Web Workshop, 2003.

[25] Huang, Z., Sun, S., Wang, W., Efficient mining of skyline objects in subspaces over data streams, Knowledge and Information Systems, Vol. 22, No. 2, 2010. pp.

159-183.

[26] Jain, A. K., Dubes, R. C., Algorithms for clustering data, Prentice-Hall, Inc., 1988.

[27] Jing, L., Survey of text Clustering, http://www.alphaminer.org/document/

downloads/textmining/survey of text clustering.pdf, 2008.

[28] Jing, L., Zhou, L., Ng, M. K., Huang, J. Z., Ontology-based distance measure for text clustering, In: Proc. of SIAM Int’l Conf. on Data Mining, 2006.

[29] Kaufman, L., Rousseeuw, P. J., Finding Groups in Data: An Introduction to Cluster Analysis, John Wiley and Sons, Inc., 1990.

[30] Kaya, M., Alhajj, R., Utilizing genetic algorithms to optimize membership functions for fuzzy weighted association rule mining, Applied Intelligence, Vol. 24, No. 1, 2006, pp. 7-15.

[31] Lewis, D. D., Yang, Y., Rose, T. G., Li, F., RCV1: a new benchmark collection for text categorization research, Journal of Machine Learning Research, Vol. 5, 2004, pp. 361 - 397.

[32] Li, Y. J., Chung, S. M., Holt, J. D., Text document clustering based on frequent word meaning sequences, Data and Knowledge Engineering, Vol. 64, 2008, pp. 381-404.

[33] Lin, K., Kondadadi, R., A word-based soft clustering algorithm for documents, Computers and Their Applications, 2001, pp. 391-394.

[34] Lin, S. H., Shih, C.S., Chen, M. C., Ho, J. M., Ko, M. T., Huang, Y. M., Extracting classification knowledge of internet documents with mining term association: a semantic approach, In: Proc. of the 21^st Annual Int’l ACM SIGIR Conf. on Research and Development in Information Retrieval, 1998, pp. 241-249.

[35] Liu, B., Hsu, W., Ma, Y., Pruning and summarizing the discovered associations, In: Proc. of the ACM SIGKDD Conf. on Knowledge Discovery and Data Mining, 1999, pp. 125-134.

[36] MacQueen, J. B., Some methods for classification and analysis of multivariate observations, In: Proc. of 5^th Berkeley Symposium on Mathematical Statistics and Probability, 1967, pp. 281-297.

[37] Mandhani, B., Joshi, S., Kummamuru, K., A matrix density based algorithm to hierarchically co-cluster documents and words, In: Proc. of the 12^th Int’l Conf. on World Wide Web, 2003, pp. 511-518.

[38] Martín-Bautista, M. J., Sánchez, D., Chamorro-Martínez, J., Serrano, J. M., Vila, M. A., Mining web documents to find additional query terms using fuzzy

在文檔中以模糊理論與高頻項目集為基礎之文件分群研究 (頁 94-0)