• 沒有找到結果。

Chapter 2 Related Work

2.4 Fuzzy Set Theory

In this section, we briefly review some basic knowledge of fuzzy sets [64].

According to [68], a fuzzy set is considered as a class with fuzzy boundaries.

Definition 2.1 (Fuzz set): A fuzzy set A in the universe of discourse U = {u1, u2,…,un} is defined by the membership function μA, denoted as μA(u), where u ∈ U. Each element u of U has a membership value, in the closed interval [0,1], given by μ.

{ ,i A( ) |i i }

A = u μ u uU . (2.1)

Definition 2.2 (Fuzzy Relation): A fuzzy relation R between variables v and w, whose domains are V and W, repressively, is defined by function that map an ordered pair (v, w) in V × W to its degree in the relation, where is a value between 0 and 1.

R = V × W → [0, 1]. (2.2)

Let μA and μB be the membership functions of the fuzzy sets A and B, respectively. In the following, we summarized some fuzzy operations used in this thesis.

Definition 2.3 (Fuzzy Set Union): The union of the fuzzy sets A and B is denoted as A

∪ B and is defined by

{( ,i A B( ) |i A B( )i ( A( ),i B( )),i i }

A B∪ = u μ u μ u =Max μ u μ u uU . (2.3)

Definition 2.4 (Fuzzy Set Intersection): The intersection of the fuzzy sets A and B is denoted as A ∩ B and is defined by

{( ,i A B( ) |i A B( )i ( A( ),i B( )),i i }

A B∩ = u μ u μ u =Min μ u μ u uU . (2.4)

Definition 2.5 (α-cut): The α-cut of the fuzzy set A is denoted as Aα and is defined by

{ |i A( )i , i } [0,1]

Aα = u μ u ≥α uU α∈ . (2.5)

The α-cut is the crisp set that contains all the elements of U whose membership values given by μA are greater than or equal to the specified value of α.

In the following, we will present three fuzzy frequent itemset-based clustering approaches, which employ fuzzy set theory for document representation, to find suitable fuzzy frequent itemsets for clustering documents. Moreover, the mined fuzzy frequent itemsets will be expressed as the cluster labels.

Chapter 3

Fuzzy Frequent Itemset-based Hierarchical Document Clustering (F

2

IHC) Approach

In order to browse and organize documents smoothly, hierarchical clustering techniques have been proposed to cluster a collection of documents into a hierarchical tree structure. Despite that, there still exist several challenges for hierarchical document clustering, such as high dimensionality, scalability, accuracy, and meaningful cluster labels [3][16][17] .

In this chapter, we will present an effective Fuzzy Frequent Itemset-Based Hierarchical Clustering (F2IHC) approach, which uses fuzzy association rule mining algorithm to construct a hierarchical cluster tree for providing flexible browsing.

There are three stages in our F2IHC framework as shown in Figure 3-1 . We explain them in Sections 3.1 - 3.3.

Figure 3-1: The F2IHC framework.

3.1 Stage 1: Document Pre-processing

This stage describes the required transformation processes of documents to obtain the desired representation of documents. As there are thousands of words in a document set, the purpose of this stage is to reduce dimensionality for high clustering accuracy. Several methods, such as itemset pruning [3], feature clustering or co-clustering [37], feature selection technique [51], and matrix factorization [50][62], have been applied to reduce dimensionality. To solve this problem, we have to find the terms that are significant and important to represent the content of each document.

Hence, we must remove the terms that are not meaningful and discriminative to increase the clustering accuracy and maintain the computing cost small. We describe the details of the pre-processing in the following:

1. Divide the sentences into terms.

2. Remove the stop words. We use a stop word list2 that contains words to be excluded. The list is applied to remove the terms that have general meaning but do not discriminate for topics.

3. Conduct word stemming: Use the developed stemming algorithms, such as Porter [44], to reduce a word to its stem or root form.

4. Term selection. The terms with selection metric weights all higher than pre-specified thresholds will be selected as key terms. In our approach, three feature selection methods [46], tf-idf, tf-df, and tfidf-tfdf,are used to select representative terms for each document, and these feature selection methods are defined as follows:

2 It contains a list of 571 stop words that was developed by the SMART project.

(1) tf-idf (term frequency-inverse document frequency): It is denoted as tfidfij

and used for the measure of the importance of term tj within document di. For preventing a bias for longer documents, the weighted frequency of each term is usually normalized by the total frequencies of all terms in document di, and is defined as follows: the total frequencies of all terms in document di. |D| is the total number of documents in the document set D, and |{di | tj ∈ di, di ∈ D}| is the number of documents containing term tj.

(2) tf-df (term frequency-document frequency): It is represented by tfdfij and evaluated by (3.2) for the value calculated by dividing the term frequency (TF) by the document frequency (DF), where TF is the number of times a term tj appears in a document di divided by the total frequencies of all terms in di, and DF is used to determine the number of documents containing term tj divided by the total number of documents in the document set D:

tfdfij = TF/DF, where

(3) tfidf-tfdf: It is the multiplication of tfidfij and tfdfij, and we denote it as tfidf-tfdfij:

tfidf-tfdfij = tfidfij * tfdfij (3.3)

After these weights of each term in each document have been calculated, those

these retained terms form a set of key terms for the document set D, and we formally define them as follows.

Definition 3.1: A document, denoted di = {(t1, fi1), (t2, fi2),…, (tj, fij),…, (tm, fim)}, is a logical unit of text, characterized by a set of key terms tj together with their corresponding frequency fij.

Definition 3.2: A document set, denoted D = {d1, d2,…, di,…, dn}, also called a document collection, is a set of documents, where n is the total number of documents in D.

Definition 3.3: The term set of a document set D = {d1, d2,…, di,…, dn}, denoted TD = {t1, t2,…, tj,…, ts}, is the set of terms appeared in D, where s is the total number of terms.

Definition 3.4: The key term set of a document set D = {d1, d2,…, di,…, dn}, denoted KD = {t1, t2,…, tj,…, tm}, is a subset of the term set TD, including only meaningful key terms, which are not appeared in a well-defined stop word list, and satisfy the pre-defined minimum threshold of term selection methods.

Based on these definitions, the representation of a document can be derived by Algorithm 3.1 shown in Figure 3-2. For example, for a document set D = {d1, d2,…, d10}, which includes ten documents. By Algorithm 3.1, suppose we can obtain the derived representation of D and its key term set KD = {stock, record, profit, medical, treatment, health} as shown in Table 3-1. Notice that we use a tabular representation, where each entry denotes the frequency of a key term (the column heading) in a document di (the row heading), to make our presentation more concise. This representation scheme will be employed in the following to illustrate our approach.

Figure 3-2: A detailed illustration of Algorithm 3.1.

Table 3-1 : Document set.

Docs ID Key Term Set

stock record profit medical treatment health

d1 2 1 1 0 0 0

d2 1 1 0 0 0 0

d3 1 0 2 0 0 0

d4 0 0 0 3 0 2

d5 0 0 0 11 1 1

d6 0 1 0 4 0 0

d7 0 0 0 8 1 2

d8 3 0 1 0 0 0

d9 0 1 0 3 0 0

d10 0 0 0 8 2 1

3.2 Stage 2: Candidate Clusters Extraction

The objective of this stage is to take a document set D, a set of predefined membership functions, the minimum support value θ, and the minimum confidence value λ as input, and to output a set of candidate clusters. To achieve this goal, we modified the algorithm proposed by Hong et al. [22] to capture the relationships among different key terms of the document set. Since each discovered fuzzy frequent itemset

has an associated fuzzy count value, it can be regarded as the degree of importance that the itemset contributes to the document set.

In the following, we will define the membership functions, present our algorithm, and finally explain our approach by an illustrative example.

3.2.1 The Membership Functions

The membership functions are used to convert each term frequency into a fuzzy set. Therefore, we define the t-f (term frequency) fuzzy set in Definition 3.5 used in this thesis.

1

In formulas (3.4), (3.5), and (3.6), min(fij) is the minimum frequency of terms in

D, max(fij) is the maximum frequency of terms in D, and avg(fij) = ⎡ 1 based on the document set in Table 3-1, the derived membership functions are shown in Figure 3-3.

Figure 3-3: The predefined membership functions of this example.

3.2.2 The Fuzzy Association Rule Mining Algorithm for Text

To describe our fuzzy association rule mining algorithm shown, we need the Definitions 3.6 - 3.7. The candidate cluster set C for a document set D can be D generated by Algorithm 3.2 shown in Figure 3-4 .

Definition 3.6: For a document set D, a candidate cluster c=(Dc, )τ is a two-tuple,

where Dc is a subset of the document set D, such that it includes those documents which contain all the key terms in τ = {t1, t2,…, tq} ⊆ KD, q ≥ 1, where KD is the key term set of D and q is the number of key terms included in τ. In fact, τ is a fuzzy frequent itemset for describing c . To illustrate, c can also be denoted as

(1 2 q) q

t , t , , t

c or c( )qτ , and will be used interchangeably hereafter. For instance, in Table

3-1, the candidate cluster c1(stock)= ({d1, d2, d3, d8}, {stock}), as the term “stock”

appeared in these documents.

Definition 3.7: The candidate cluster set of a document set D, denoted

1 2

1 1

{ q q}

D l l k

C = c , , c , c , , c … , is a set of candidate clusters, where k is the total number of candidate clusters.

= Low/ . Mid/ . High/ .

countj countMidj countHighj

1 { j ( )j max-countj , 1 }

Figure 3-4: A detailed illustration of Algorithm 3-2.

3.2.3 An Illustrative Example of Stage 2

Consider using the document set D in Table 3-1, the membership functions defined in Figure 3-3, the minimum support value 40%, and the minimum confidence value 60% as inputs. The procedure of Algorithm 3.2 is illustrated in the following.

Step 1. The input membership functions are used to convert each term frequency into a fuzzy set. By taking the first key term t1 “stock” in document d1 as an example, its term frequency ‘2’ will be transformed into the fuzzy set F11 = 1.67/stock.Low + 1.33/stock.Mid + 1.0/stock.High based on the given membership functions, where the notation term.region is called a fuzzy region.

This step will be repeated for the other terms, and the results are shown in Table 3-2.

Table 3-2:The fuzzy set in this example.

Doc ID Level-1 Fuzzy Set

stock record profit medical treatment health L M H L M H L M H L M H L M H L M H

Step 2. For D, the scalar cardinality of each fuzzy region for each key term is calculated as count value. For example, the scalar cardinality of the fuzzy region stock.Low = (1.67 + 2.00+ 2.00 + 1.33) = 7.0. By repeating this step for the other regions, the results can be obtained as Table 3-3 illustrates.

Table 3-3: The count values of three fuzzy regions for each key term.

Terms Count Terms Count Terms Count

stock.Low 7.00 profit.Low 5.67 treatment.Low 5.67 stock.Mid 5.00 profit.Mid 3.33 treatment.Mid 3.33 stock.High 4.00 profit.High 3.00 treatment.High 3.00 record.Low 8.00 medical.Low 6.66 health.Low 7.34 record.Mid 4.00 medical.Mid 9.20 health.Mid 4.66 record.High 4.00 medical.High 8.14 health.High 4.00

Step 3. Then, the region of each key term with maximum count value will be found.

Take the key term “stock" as an example. Its count value is 7.0 for Low, 5.0 for Mid, and 4.0 for High. Due to the count value for Low is the highest among the three count values, the region Low is thus used to represent the key term “stock" in the following steps. This step is repeated for the other key terms. Thus, Low is chosen for “stock”, “record”, “profit”, “treatment”, and

“health”, and Mid is chosen for “medical”.

Step 4. According to the maximum count value for each key term chosen in Step 3, these key terms must be checked against the predefined minimum support value 40%. Since the count values of stock.Low, record.Low, profit.Low, treatment.Low, medical.Mid, and health.Low, are all larger than 40%, these key terms are put in L1 (fuzzy frequent 1-itemsets) as shown in Table 3-4.

Table 3-4:The set of fuzzy frequent 1-itemsets in this example.

Terms Count Support Values

Table 3-5:The candidate set C2.

Candidate 2-itemsets Candidate 2-itemsets Candidate 2-itemsets (stock.Low, record.Low) (record.Low, profit.Low) (profit.Low, treatment.Low)

(stock.Low, profit.Low) (record.Low, medical.Mid) (profit.Low, health.Low) (stock.Low, medical.Mid) (record.Low, treatment.Low) (medical.Mid, treatment.Low) (stock.Low, treatment.Low) (record.Low, health.Low) (medical.Mid, health.Low) (stock.Low, health.Low) (profit.Low, medical.Mid) (treatment.Low, health.Low)

(2) For each candidate 2-itemset in C2, there are three sub-steps to be performed:

(a) The fuzzy value of each document for each candidate 2-itemset is calculated. For instance, the derived fuzzy value of (stock.Low, record.Low) in document d1 can be calculated as: min(1.67, 2.00) = 1.67. The results for the other documents are shown in Table 3-6.

Table 3-6:The fuzzy values of (stock.Low, record.Low) in D.

DocID stock.Low record.Low min(stock.Low, record.Low)

d1 1.67 2.00 1.67

d2 2.00 2.00 2.00

d3 2.00 0.00 0.00

d4 0.00 0.00 0.00

d5 0.00 0.00 0.00

d6 0.00 2.00 0.00

d7 0.00 0.00 0.00

d8 1.33 0.00 0.00

d9 0.00 2.00 0.00

d10 0.00 0.00 0.00

(b) Calculate the scalar cardinality for each candidate 2-itemset. Table 3-8 lists the results for all candidate 2-itemsets.

Table 3-7:The count values of candidate 2-itemsets.

Candidate 2-itemsets Count Support Values

(stock.Low, record.Low) 3.67 3.67/10=37%

(3) Because only the count values of (stock.Low, profit.Low), (medical.Mid, health.Low), and (treatment.Low, health.Low) are larger than the predefined minimum support value 40%. Thus, they are stored in L2

(fuzzy frequent 2-itemsets).

Step 5. Since L2 is not null, repeat the step 5 as follows.

(1) q, a variable used to store the number of key terms kept in the current itemsets, is set as 2.

(2) The candidate 3-itemset (medical.Mid, health.Low, treatment.Low) is generated from L2. The count value of the candidate 3-itemset (medical.Mid, health.Low, treatment.Low) is 3.00.

(3) Then, its support value is 3.00/10 = 0.30. Since its support value is not larger than 40%, it is not put in L3.

Step 6. Since L3 is null, we proceed to step 6. For each fuzzy frequent itemset, the

association rules are constructed by accomplishing the following sub-steps.

(a) Based on the fuzzy frequent itemsets, all possible association rules are formed:

If stock = Low, then profit = Low If profit = Low, then stock = Low If medical = Mid, then health = Low If health = Low, then medical = Mid If treatment = Low, then health = Low If health = Low, then treatment = Low

(b) Then, we calculate the confidence values of the above possible association rules. Take the first rule pair as an example. Their confidence values are calculated as follows:

„ If stock = Low, then profit = Low

For the other rule pairs, the results are shown below:

If medical = Mid, then health = Low, with a confidence value of 0.60.

If health = Low, then Medical = Mid, with a confidence value of 0.75.

If treatment = Low, then health = Low, with a confidence value of 0.94.

If health = Low, then treatment = Low, with a confidence value of 0.73.

In the proposed algorithm, we estimate the strength of association among key terms in the document set by using confidence values. There is useful information when the co-occurring keywords have been shown. This is because highly co-occurring terms are used together. Thus, our algorithm compute the confidence values of a rule pair to check the strong association of key terms (t1, t2,…, tq) of the fuzzy frequent q-itemsets. Take the candidate cluster c(stock, profit)2 as an example.

Since its confidence value of the rule pair “If stock = Low, then profit = Low” and “If profit = Low, then stock = Low” are both larger than the minimum confidence value 60%, c(stock, profit)2 is put in the candidate cluster set CD. Finally, the candidate cluster set CD = { c1(stock), c1(record), c(1profit), c1(medical), c(treatment)1 , c1(health), c(stock, profit)2 , c(medical, health)2 ,

2

(treatment, health)

c } will be output.

3.3 Stage 3: The Cluster Tree Construction

The candidate cluster set generated by the previous steps can be viewed as a set of topics with their corresponding sub-topics in the document set. In this stage, we first construct the Document-Term Matrix (DTM) and the Term-Cluster Matrix (TCM) to derive the Document-Cluster matrix (DCM) for assigning each document to a fitting cluster, such that each c contains a subset of documents. For the documents iq in each c , the intra-cluster similarity is minimized and the inter-clusters similarity is iq maximized. We call each c a target cluster in the following. Based on the iq assignment result, we will find the set of target clusters C = c , c , , c , , cD { 11 12iqqf}, and then use these target clusters to form a hierarchical tree for the document set D.

To avoid the constructed cluster tree including too many clusters, we use the tree

pruning method to prune unnecessary clusters.

3.3.1 Building the Document-Cluster Matrix (DCM)

First, consider each candidate cluster c( )qτ = ( )

1 2 q

q t , t , , t

c with fuzzy frequent itemset τ. τ will be regarded as a reference point for generating a target cluster. Then, to represent the degree of importance of document di in a candidate cluster c , an lq n × k Document-Cluster Matrix will be constructed to calculate the similarity of terms in di and c . To achieve this goal, we have to define two matrixes in Definition ql 3.8 and Definition 3.9, namely Document-Term Matrix and Term-Cluster Matrix, respectively. Finally, based on Definitions 3.8 - 3.9, we can define the Document- Cluster Matrix (DCM) of a document set D in Definition 3.10.

Definition 3.8: A Document-Term Matrix (DTM), denoted W = ⎣⎡wijmaxRj⎤⎦ , for a

document set D, is an n × p matrix, such that wijmaxRjis the weight (fuzzy membership value of the maximum region) of term tj in document di and tj ∈ L1 and can be calculated from the Steps 4 and 5 of Algorithm 3-2. A formal illustration of DTM can be found in Figure 3-5.

Figure 3-5: A formal illustration of Document-Term Matrix.

Definition 3.9: A Term-Cluster Matrix (TCM), denoted G= ⎣gmaxjl Rj, for a document

In Formula (3.7), wijmaxRj is the weight (fuzzy membership value of the maximum region) of term tj in document di clq and λ is the minimum confidence value.

Each gmaxjl Rj of TCM represents the degree of importance of key term tj in a candidate cluster c( )qτ by referring to those documents including τ. To reduce the dimension, only key terms appeared in L1 were applied in TCM. A formal illustration of TCM can be found in Figure 3-6.

Figure 3-6: A formal illustration of Term-Cluster Matrix.

Definition 3.10: A Document-Cluster Matrix (DCM) for a document set D of n documents is the inner product of its DTM and TCM. It is an n × k matrix, and can be defined as V =

[ ]

vil , where

1

A formal illustration of DCM can be found in Figure 3-7.

21 22 2

Figure 3-7: A formal illustration of Document-Cluster Matrix.

3.3.2 Building the Hierarchical Cluster Tree

Based on the obtained DCM, each document can be assigned to only one target cluster by using the following rules.

Rule 1. Each element vil of the DCM matrix represents the degree of importance of document di in a candidate cluster c . For each document dl1 i (the row i of

assigned to a target cluster c , such that its fuzzy frequent itemset 1l τ has the highest count value. Notice that when q = 1, the count value is max-countj

(refer to the Step 3 in Algorithm 3-2).

After assigning each document to the best fitting cluster, the resulting tree can be formed as a foundation for pruning and a natural structure for browsing. The cluster tree built by F2IHC algorithm has the following eight features:

1. The cluster tree is built in a top-down fashion, which is different from the cluster tree obtained in a bottom-up fashion by FIHC.

2. Each child target cluster has exactly one parent target cluster.

3. The topic of a parent target cluster is more general than the topic of its children target clusters. The nodes become more and more specialized as they get closer to the leaf nodes.

4. A parent target cluster and its children target clusters are “similar” to a certain degree.

5. Each target cluster employs one fuzzy frequent q-itemset τ as its cluster label.

6. The root node of the cluster tree appears at level 0, and is tagged with the cluster label “all”.

7. Each target cluster with its fuzzy frequent q-itemset appears in the level q of the tree.

8. The depth of the cluster tree is the same as the maximum size of fuzzy frequent itemsets.

3.3.3 Tree Pruning

used, the target cluster tree would become broad and deep. The documents with the same topic may be spread to several small target clusters, which would cause low document clustering accuracy. In order to generate a natural hierarchical cluster tree for higher document clustering accuracy and for easy browsing, one tree pruning method is used for merging similar target clusters at level 1. This method employs Definition 3.11 to compute the inter-cluster similarity between two target clusters. In the following, the minimum Inter-Sim will be used as a threshold δ to decide whether two target clusters should be merged.

Definition 3.11: The inter-cluster similarity between two target clusters c1x and c1y ,

1 respectively; The range of Sim is [0, 1]. If the Inter-Sim value is close to 1, then both clusters are regarded nearly the same.

The objective of sibling merging is to shrink a tree by merging similar target clusters at level 1 for attaining high document clustering accuracy. Each pair of target clusters at level 1 of a tree is calculated by using the inter-cluster similarity measure.

The target cluster pair with the highest Inter-Sim value is to keep merging until the Inter-Sim value of all target clusters at level 1 is less than the minimum Inter-Sim threshold δ.

Algorithm 3.3 shown in Figure 3-8 is used to assign each document to the best

Algorithm 3.3 shown in Figure 3-8 is used to assign each document to the best

相關文件