Background and Motivation - 以模糊理論與高頻項目集為基礎之文件分群研究

Chapter 1 Introduction

1.1 Background and Motivation

Clustering textual documents into different groups is an important step in indexing, retrieval, management, and mining of abundant text data on the Web or in corporate document management repositories [4][27][56][61]. Recently, the incessant flourishing of Internet invigorates various textual documents to be shared over the cyberspace astonishingly. However, it also makes users suffer from the information-overloading problem. In particular, when users pose queries to WWW search engines, they usually bewilderingly receive a small number of relevant Web pages intermingled with a large number of irrelevant Web pages. The focus of textual document clustering technique has shifted towards providing ways to reorganize search results into meaningful cluster hierarchies for efficiently browse large collections of documents. Therefore, a good textual document clustering technique has to provide a helpful complement for traditional search engines when keyword-based search returns too many documents.

The aim of document clustering algorithms is to automatically discover the hidden similarity and the key concepts of clustered documents for users to comprehend a large amount of documents. Over the past decades, several effective document clustering algorithms have been proposed to mitigate the hassle, including the k-means [36], Bisecting k-means [53], Hierarchical Agglomerative Clustering (HAC) [26][29][61], and Unweighted Pair Group Method with Arithmetic Mean

(UPGMA) [39]. Nevertheless, as pointed out by [3][17][24][45][33], there are still challenges in improving the clustering quality, which we list as follows:

(1) To cope with high dimensionality: As the volume of textual document increases, the dimensionality of term features increases as well.

(2) To improve the scalability: Many document clustering algorithms work fine on small document sets, but fail to deal with large document sets efficiently.

(3) To promote the accuracy: Many existing document clustering algorithms require users to specify the number of clusters as an input parameter. However, it is difficult to determine the number of clusters in advance. Moreover, an incorrect estimation of the input parameter, i.e., the number of clusters, may lead to poor clustering accuracy [17].

(4) To assign meaningful cluster labels: Meaningful cluster labels will guide users in the process of browsing the retrieved results. Thus, each cluster should be labeled with an understandable description. However, most of traditional clustering algorithms do not provide labels for clusters.

(5) To extract semantics from text: The bag-of-words representation used for clustering algorithms is often unsatisfactory as it ignores the conceptual similarity of terms that do not co-occur actually [24][45].

(6) To enable overlapping clusters: Many well-known clustering algorithms focus on hard clustering, where each document belongs to exactly one cluster. However, a document could contain multiple subjects. By using soft clustering algorithms [33], a document would appear in multiple clusters (i.e., overlapping clusters).

To resolve the problems of high dimensionality, large size, and understandable cluster description, Beil et al. [3] developed the first frequent itemset-based algorithm, namely Hierarchical Frequent Term-based Clustering (HFTC), where the frequent

itemsets are generated based on the association rule mining [12]. They only considered the low-dimensional frequent itemsets as clusters. Moreover, HFTC discovers overlapping clusters, which is useful for a search engine where overlapping clusters occur like Yahoo! Directory.

However, the experiments of Fung et al. [17] showed that HFTC is not scalable.

For a scalable algorithm, Fung et al. proposed the FIHC (Frequent Itemset-based Hierarchical Clustering) algorithm by using frequent itemsets derived from association rule mining to construct a hierarchical topic tree for clusters. They also proved that using frequent itemsets for document clustering can reduce the dimensionality of term vectors effectively. Yu et al. [63] presented another frequent itemset-based algorithm, called TDC, to improve the clustering quality and scalability.

This algorithm dynamically generates a topic directory from a document set using only closed frequent itemsets and further reduces dimensionality. But, the clusters generated by FIHC and TDC are non-overlapping. In [23], the authors proposed that document clustering methods should provide multiple subjective perspectives onto the same document to enhance their practical applicability.

Recently, WordNet [40], one of the most widely adopted thesaurus for English, has been extensively used as an ontology in grouping documents with its semantic relations of terms [24][45][11][28]. Many existing document clustering algorithms mainly transform textual documents into simplistic flat bags of document representation, i.e., term vectors or bag-of-words. Once terms are treated as individual items in such simplistic representation, the semantic content of a document is decomposed and cannot be reflected. Thus, Dave et al. [11] proposed using synsets as features for document representation and subsequent clustering. However, synsets decrease the clustering performance in all experiments without considering word

sense disambiguation. Meanwhile, Hotho et al. [24] used WordNet in document clustering for word sense disambiguation to improve the clustering results. Jing et al.

[28] presented another application of WordNet, which described how to find mutual information between terms by using the background knowledge through WordNet. In [45], Recupero proposed a new unsupervised document clustering method by using WordNet lexical and conceptual relations to allow common clustering algorithms to perform well. In this thesis, the reasons of utilizing hypernyms from WordNet are two-fold:

(1) We intend to obtain more general and conceptual labels for derived clusters.

(2) From the experimental results in [11][49], the authors found that the performance of adding hypernyms is better than adding synonymy.

在文檔中以模糊理論與高頻項目集為基礎之文件分群研究 (頁 16-19)