Organization of the Thesis - 以模糊理論與高頻項目集為基礎之文件分群研究

Chapter 1 Introduction

1.3 Organization of the Thesis

The subsequent sections of this thesis are organized as follows. In Chapter 2, we briefly review related work on general process of document clustering, major document clustering methods, association rules for text mining Applications, and fuzzy set theory. In Chapter 3, the Fuzzy Frequent Itemset-based Hierarchical Document Clustering (F²IHC) approach will be described, together with an illustrative example. Chapter 4 illustrates the Fuzzy Frequent Itemset-based Document Clustering (F²IDC) approach. We depict in Chapter 5 the description of the Fuzzy Frequent Itemset-based Soft Clustering (F²ISC) approach. Finally, we conclude and propose some future directions in Chapter 6.

Chapter 2 Related Work

In the first place, the general process of document clustering is described in Section 2.1. Then, the literature concerning document clustering methods will be surveyed in Section 2.2. In Section 2.3, we will discuss how association rules are applied to text mining. Finally, we briefly review some basic knowledge of fuzzy sets in Section 2.4

2.1 A Generic Process of Document Clustering

The aim of document clustering is to group similar documents together based on the content of a set of documents. According to [59], we divide the general process of document clustering into three main stages, including Document Pre-processing, Document Representation, and Document Clustering (as shown in Figure 2-1). These stages are described as follows.

Figure 2-1: General process of document clustering.

1. Document Pre-processing. In order to satisfy document clustering methods, the given unstructured documents need to be preprocessed. There are two steps in this stage, namely Term Extraction and Term Selection, for generating the term set from the document collection.

(1) Term Extraction: The whole extraction process is as follows:

y Extract terms. Divide the sentences into terms and extract terms as features.

y Remove the stop words. A pre-defined stop-word list¹ is applied to remove commonly used words that do not discriminate for topics.

y Conduct word stemming. Use the developed stemming algorithms, such as Porter [44], to convert a word to its stem or root form. The frequencies of stemmed terms instead of the original terms in the document collection are computed.

(2) Term Selection: After extracting terms, it is crucial to reduce the set of term features, a process referred to as term selection. For example, a term should be discarded (i.e. from the term set) if it appears rarely or more frequently in the document collection. Several methods, such as itemset pruning [3], feature clustering or co-clustering [37], feature selection technique [51], and matrix factorization [50][62], have been applied to reduce the dimensionality for high clustering accuracy.

2. Document Representation. The most common representation is the so-called

“bag-of-words” matrix, where each document is represented as a vector based on the terms which occur in the relative documents, and then the clustering methods compute the similarity between the vectors [47]. Several document representation

1 It contains a list of 571 stop words that was developed by the SMART project.

methods have been proposed, including binary (which shows the presence or absence of a term in a document) and term frequency (which shows the frequency of a term in a document).

3. Document Clustering. Common approaches for document clustering have been used, including the k-means [36], Bisecting k-means [53], Hierarchical Agglomerative Clustering (HAC) [26][29][61], and Unweighted Pair Group Method with Arithmetic Mean (UPGMA) [39], etc. The details of each clustering approach will be depicted in the following section.

2.2 Document Clustering Methods

The basic principle of document classification is to classify or group a set of unlabeled documents into classes or clusters. According to [53], we divide document classification into three subcategories, i.e., supervised or unsupervised, hard or soft, and partitioning, hierarchical, or frequent itemset-based. These subcategories can be shown in a tree structure as Figure 2-2 depicts, which we describe as follows.

Figure 2-2: Types of document classification.

1 Supervised and Unsupervised (Clustering): In supervised document classification, a set of predefined classes are available. On the other hand, in unsupervised document classification, also called document clustering, there are no pre-determined classes available. Document clustering is the process of calculating document similarities to form clusters. The documents within a cluster are similar to each other and, simultaneously, dissimilar to the documents in the other groups.

2 Hard (Disjoint) and Soft (Overlapping): Hard clustering algorithms compute the hard assignment (i.e., each document is assigned to exactly one cluster) and produce a set of disjoint clusters. Soft clustering algorithms compute the soft assignment (i.e., each document allows to appear in multiple clusters) and generate a set of overlapping clusters. For instance, a document discussing

“Natural language and Information Retrieval” should be assigned to both of the clusters “Natural language” and “Information Retrieval”.

3 Partitioning, Hierarchical, and Frequent itemset-based: For document clustering, partitioning-based methods exclusively partition the set of documents into a number of clusters by moving documents from one cluster to another, such as k-means [36] and Bisecting k-means [53].

Compacted to partitioning-based methods, hierarchical-based document clustering is to build a hierarchical tree of clusters whose leaf nodes represent the subset of a document collection, like Hierarchical Agglomerative Clustering (HAC) [26][29][61] and Unweighted Pair Group Method with Arithmatic Mean (UPGMA) [39]. Moreover, this method can be further classified into agglomerative and divisive approaches, which work in a bottom-up and top-down fashion, respectively. An agglomerative clustering iteratively merges two most

similar clusters until a terminative condition is satisfied. On the other hand, a divisive method starts with one cluster, which consists of all documents, and recursively splits one cluster into smaller sub-clusters until some termination criterion is fulfilled.

Besides, a new category of document clustering, namely “frequent itemset-based clustering,” has been extensively developed, including FIHC [17], HFTC [3], and TDC [63]. Frequent itemset-based clustering methods use frequent itemsets generated by the association rule mining and further cluster the documents according to these extracted frequent itemsets. These methods reduce the dimensionality of term features efficiently for very large datasets, thus they can improve the accuracy and scalability of the clustering algorithms. The organization of clusters generated by frequent itemset-based clustering methods could be a flat set or a hierarchical tree of clusters.

Moreover, an advantage of frequent itemset-based clustering method is that each cluster can be labeled by the obtained frequent itemsets shared by the documents in the same cluster. A cluster label could only be used to describe the main concept of the cluster, but also differentiate the cluster from its sibling and parent clusters [55][65]. However, most frequent itemset-based clustering methods ignore the semantics of the terms in the process of generating frequent itemsets. In the thesis, the proposed approaches provides more general cluster labels because they take into account the semantics of the terms using background knowledge, WordNet.

Table 2-1 summarizes the characteristics of the proposed approaches and other document clustering algorithms.

Table 2-1: Summary for our approaches and the other document clustering algorithms.

Hierarchical-based Partitioning-based Frequent itemset-based Hard y Hierarchical

A Hierarchical Tree of Clusters y Fung et al. (2003) [17]

y The proposed approach (F²IHC) [8][9]

A Flat Set of Clusters y Yu et al. (2004) [63]

y The proposed approach (F²IDC) [5][6] Soft y Lin and Kondadadi (2001)

[33]

A Hierarchical Tree of Clusters y Beil et al. (2002) [3]

A Flat Set of Clusters

y The proposed approach (F²ISC) [7]

means a WordNet-based document clustering approach.

2.3 Association Rules for Text Mining Applications

According to [15], the authors have defined that knowledge discovery in database has several interactive and iterative phases to extract useful knowledge from huge volumes of data, where data mining has been recognized as the most important phase, as it offers flexibility for extracting useful patterns from business data.

In data mining, association rule mining [20] is a popular method for discovering interesting association rules in large databases. The form of an association rule can be represented as X → Y, where X and Y are sets of items and X ∩ Y = ∅, and is usually adopted for market basket analysis to describe the following meaning: customers that buy product X also buy product Y for satisfying some predefined minimum support value and minimum confidence value. In general, each itemset has an associated measure of statistical significance called Support value, which is the fraction of all

transactions that contain the itemset. For example, an itemset X with support value, supp(X) = 0.5, regards there are 50% of transactions in the dataset containing X. An itemset can be chosen as a frequent itemset if its support value is larger than or equal to the predefined minimum support value. The confidence value of an association rule, denoted conf(X → Y) = supp(X∪Y)/supp(X), is to measure how often items in Y appear in transactions which also contain X. Finally, a rule X → Y will be discovered whether its confidence value is larger than or equal to the predefined minimum confidence value or not.

Due to the strong need for analyzing the vast amount of textual documents spread over the Internet, text mining is also growing rapidly. By the definition described in [15][52][60], Text Mining, also known as Intelligent Text Analysis, Text Data Mining or Knowledge Discovery in Text (KDT), refers generally to the process of extracting interesting and non-trivial information and knowledge from unstructured text. The main purpose of text mining is to acquire fruitful knowledge from a large document set. It draws on techniques from data mining, computational linguistics, database systems, information retrieval, and artificial intelligence to achieve the goal.

As text mining is much more complex than data mining because text data are inherently unstructured and fuzzy [54], some studies [13][15][34] applied the technique of association rule mining in document management. For example, Feldman and Dagan [15] have presented a Knowledge Discovery in Text (KDT) system, which used the simplest information extraction approach to get interesting information and knowledge from unstructured text collections. Lin et al. [34]

proposed a method, namely Mining Term Association, to acquire the semantic relations between terms when applying to documents. Moreover, Delgoado et al. [13]

think that association rule mining is the first data mining technique employed in

mining text collections. It is very interesting since many applications related to text processing involve associations and co-occurrence between terms. These works mainly focused on analyzing the co-occurrence terms for document management.

Recently, to flexibly conduct the association rule mining for more applications, some research works [22][30][38] have been proposed to integrate fuzzy set theory [64] and association rule mining for handling items with quantitative values while discovering fuzzy association rules from given transactions. Basically, a fuzzy association rule mining approach proposed by Hong et al. [22] first use membership functions to convert quantitative values into a fuzzy set in linguistic terms. Then, the scalar cardinality of each linguistic term on all transactions is calculated. The mining process based on fuzzy counts was used to find interesting association rules. In addition, Hong et al. [21] described some fuzzy mining concepts and techniques related to association rules discovery in details, including mining fuzzy association rules, mining fuzzy generalized association rules, and mining both membership function and fuzzy association rules.

In the association rule mining technique, each document merely contains binary terms, meaning that a term either appears in a document or not. However, terms in the documents may be presented with quantitative types, such as term frequency or term weight. In this thesis, we thus focus on employing fuzzy association rule mining devised by Hong et al. [22] by regarding a document as a transaction, and those term frequency values in a document as the quantitative values (i.e., the number of purchased items in a transaction) to find the association relationships between terms.

To illustrate the usefulness of fuzzy data mining in document clustering, we use fuzzy set concepts to model the term frequency describing the important degree of a term in a document. In contrast with using the crisp set concept, in which a term is either a

member of a document or not, fuzzy set concepts make it possible that a term belongs to a document to a certain degree.

2.4 Fuzzy Set Theory

In this section, we briefly review some basic knowledge of fuzzy sets [64].

According to [68], a fuzzy set is considered as a class with fuzzy boundaries.

Definition 2.1 (Fuzz set): A fuzzy set A in the universe of discourse U = {u1, u2,…,un} is defined by the membership function μA, denoted as μA(u), where u ∈ U. Each element u of U has a membership value, in the closed interval [0,1], given by μ.

{ ,_i _A( ) |_i _i }

A = u μ u u ∈U . (2.1)

Definition 2.2 (Fuzzy Relation): A fuzzy relation R between variables v and w, whose domains are V and W, repressively, is defined by function that map an ordered pair (v, w) in V × W to its degree in the relation, where is a value between 0 and 1.

R = V × W → [0, 1]. (2.2)

Let μA and μB be the membership functions of the fuzzy sets A and B, respectively. In the following, we summarized some fuzzy operations used in this thesis.

Definition 2.3 (Fuzzy Set Union): The union of the fuzzy sets A and B is denoted as A

∪ B and is defined by

{( ,_i _{A B}( ) |_i _{A B}( )_i ( _A( ),_i _B( )),_i _i }

A B∪ = u μ _∪ u μ _∪ u =Max μ u μ u u ∈U . (2.3)

Definition 2.4 (Fuzzy Set Intersection): The intersection of the fuzzy sets A and B is denoted as A ∩ B and is defined by

{( ,_i _{A B}( ) |_i _{A B}( )_i ( _A( ),_i _B( )),_i _i }

A B∩ = u μ _∩ u μ _∩ u =Min μ u μ u u ∈U . (2.4)

Definition 2.5 (α-cut): The α-cut of the fuzzy set A is denoted as Aα and is defined by

{ |_i _A( )_i , _i } [0,1]

A_α = u μ u ≥α u ∈U α∈ . (2.5)

The α-cut is the crisp set that contains all the elements of U whose membership values given by μA are greater than or equal to the specified value of α.

In the following, we will present three fuzzy frequent itemset-based clustering approaches, which employ fuzzy set theory for document representation, to find suitable fuzzy frequent itemsets for clustering documents. Moreover, the mined fuzzy frequent itemsets will be expressed as the cluster labels.

Chapter 3 Fuzzy Frequent Itemset-based Hierarchical Document Clustering (F

IHC) Approach

In order to browse and organize documents smoothly, hierarchical clustering techniques have been proposed to cluster a collection of documents into a hierarchical tree structure. Despite that, there still exist several challenges for hierarchical document clustering, such as high dimensionality, scalability, accuracy, and meaningful cluster labels [3][16][17] .

In this chapter, we will present an effective Fuzzy Frequent Itemset-Based Hierarchical Clustering (F²IHC) approach, which uses fuzzy association rule mining algorithm to construct a hierarchical cluster tree for providing flexible browsing.

There are three stages in our F²IHC framework as shown in Figure 3-1 . We explain them in Sections 3.1 - 3.3.

Figure 3-1: The F²IHC framework.

3.1 Stage 1: Document Pre-processing

This stage describes the required transformation processes of documents to obtain the desired representation of documents. As there are thousands of words in a document set, the purpose of this stage is to reduce dimensionality for high clustering accuracy. Several methods, such as itemset pruning [3], feature clustering or co-clustering [37], feature selection technique [51], and matrix factorization [50][62], have been applied to reduce dimensionality. To solve this problem, we have to find the terms that are significant and important to represent the content of each document.

Hence, we must remove the terms that are not meaningful and discriminative to increase the clustering accuracy and maintain the computing cost small. We describe the details of the pre-processing in the following:

1. Divide the sentences into terms.

2. Remove the stop words. We use a stop word list² that contains words to be excluded. The list is applied to remove the terms that have general meaning but do not discriminate for topics.

3. Conduct word stemming: Use the developed stemming algorithms, such as Porter [44], to reduce a word to its stem or root form.

4. Term selection. The terms with selection metric weights all higher than pre-specified thresholds will be selected as key terms. In our approach, three feature selection methods [46], tf-idf, tf-df, and tfidf-tfdf,are used to select representative terms for each document, and these feature selection methods are defined as follows:

2 It contains a list of 571 stop words that was developed by the SMART project.

(1) tf-idf (term frequency-inverse document frequency): It is denoted as tfidfij

and used for the measure of the importance of term tj within document di. For preventing a bias for longer documents, the weighted frequency of each term is usually normalized by the total frequencies of all terms in document di, and is defined as follows: the total frequencies of all terms in document di. |D| is the total number of documents in the document set D, and |{di | tj ∈ di, di ∈ D}| is the number of documents containing term tj.

(2) tf-df (term frequency-document frequency): It is represented by tfdfij and evaluated by (3.2) for the value calculated by dividing the term frequency (TF) by the document frequency (DF), where TF is the number of times a term tj appears in a document di divided by the total frequencies of all terms in di, and DF is used to determine the number of documents containing term tj divided by the total number of documents in the document set D:

tfdfij = TF/DF, where

(3) tfidf-tfdf: It is the multiplication of tfidfij and tfdfij, and we denote it as tfidf-tfdfij:

tfidf-tfdfij = tfidfij * tfdfij (3.3)

After these weights of each term in each document have been calculated, those

these retained terms form a set of key terms for the document set D, and we formally define them as follows.

Definition 3.1: A document, denoted di = {(t1, fi1), (t2, fi2),…, (tj, fij),…, (tm, fim)}, is a logical unit of text, characterized by a set of key terms tj together with their corresponding frequency fij.

Definition 3.2: A document set, denoted D = {d1, d2,…, di,…, dn}, also called a document collection, is a set of documents, where n is the total number of documents in D.

Definition 3.3: The term set of a document set D = {d1, d2,…, di,…, dn}, denoted TD = {t1, t2,…, tj,…, ts}, is the set of terms appeared in D, where s is the total number of terms.

Definition 3.4: The key term set of a document set D = {d1, d2,…, di,…, dn}, denoted KD = {t1, t2,…, tj,…, tm}, is a subset of the term set TD, including only meaningful key terms, which are not appeared in a well-defined stop word list, and satisfy the pre-defined minimum threshold of term selection methods.

Based on these definitions, the representation of a document can be derived by Algorithm 3.1 shown in Figure 3-2. For example, for a document set D = {d1, d2,…, d10}, which includes ten documents. By Algorithm 3.1, suppose we can obtain the derived representation of D and its key term set KD = {stock, record, profit, medical, treatment, health} as shown in Table 3-1. Notice that we use a tabular representation, where each entry denotes the frequency of a key term (the column heading) in a document di (the row heading), to make our presentation more concise. This representation scheme will be employed in the following to illustrate our approach.

Figure 3-2: A detailed illustration of Algorithm 3.1.

Table 3-1 : Document set.

Docs ID Key Term Set

stock record profit medical treatment health

d1 2 1 1 0 0 0

d2 1 1 0 0 0 0

d3 1 0 2 0 0 0

d4 0 0 0 3 0 2

d5 0 0 0 11 1 1

d6 0 1 0 4 0 0

d7 0 0 0 8 1 2

d8 3 0 1 0 0 0

d9 0 1 0 3 0 0

d10 0 0 0 8 2 1

3.2 Stage 2: Candidate Clusters Extraction

The objective of this stage is to take a document set D, a set of predefined membership functions, the minimum support value θ, and the minimum confidence value λ as input, and to output a set of candidate clusters. To achieve this goal, we modified the algorithm proposed by Hong et al. [22] to capture the relationships among different key terms of the document set. Since each discovered fuzzy frequent itemset

has an associated fuzzy count value, it can be regarded as the degree of importance that the itemset contributes to the document set.

In the following, we will define the membership functions, present our algorithm, and finally explain our approach by an illustrative example.

3.2.1 The Membership Functions

The membership functions are used to convert each term frequency into a fuzzy set. Therefore, we define the t-f (term frequency) fuzzy set in Definition 3.5 used in this thesis.

In formulas (3.4), (3.5), and (3.6), min(fij) is the minimum frequency of terms in

D, max(fij) is the maximum frequency of terms in D, and avg(fij) = ⎡ ¹ based on the document set in Table 3-1, the derived membership functions are shown in Figure 3-3.

Figure 3-3: The predefined membership functions of this example.

3.2.2 The Fuzzy Association Rule Mining Algorithm for Text

To describe our fuzzy association rule mining algorithm shown, we need the Definitions 3.6 - 3.7. The candidate cluster set C for a document set D can be _D generated by Algorithm 3.2 shown in Figure 3-4 .

Definition 3.6: For a document set D, a candidate cluster c=(D_c, )τ is a two-tuple,

where D_c is a subset of the document set D, such that it includes those documents

在文檔中以模糊理論與高頻項目集為基礎之文件分群研究 (頁 21-0)