The F - Review of Related Works - 利用樹狀結構探勘完整語意項目集

CHAPTER 2 Review of Related Works

2.3 The F

Frequent pattern mining is one of the most important data mining problems. The initial solution for the problem of association rule mining was given by Agrawal et al.

[2] in the form of Apriori algorithm which is based on level-wise candidate set generation and test methodology. However, because the size of the database can be very large, it is very costly to repeatedly scan the database to count supports for candidate itemsets. The limitations of Apriori algorithm are overcome by an innovative approach proposed the frequent pattern (FP) tree structure and the FP-growth algorithm by Han et al. [13]. Their approach can efficiently mine frequent itemsets without the generation of candidate itemsets, and it scans the original transaction database only twice. The mining algorithm consists of two phases;

the first constructs an FP-tree structure and the second recursively mines the frequent itemsets from the structure. The details of two phases for the FP-growth algorithm [13]

are respectively described below.

2.3.1 Construction of FP-tree

The FP-tree is a compact tree structure storing frequent items from databases.

The Header_Table is also built as an index table, thus keeping not only the frequent items but also its occurrence frequencies. Each item in Header_Table points to its first occurrence in the tree through a node-link. Nodes with the same item in the tree are

also connected in a sequence. It can fast trace the same items in the tree, for efficiently generating the frequent itemsets. The construction of FP tree requires two database scans. The first database scan is to identify all frequent (or large) items. In the second database scan, frequent items within transactions are sorted in descending order according to the occurrence frequencies of items. The construction process is then executed tuple by tuple, from the first transaction to the last one. After all transactions in databases are processed, the FP tree is completely constructed.

Here, an example shown in Table 2.1 is used to illustrate the construction process. The minimum support threshold is set at 50%.

Table 2.1. Five transactions in database

TID Items

1 a, b, e. f, g, h, m 2 a, c, g, k, m

3 a, b, d, h

4 a, c, e, g, m 5 b, c, k, h, m

The database in Table 2.1 is then scanned to calculate the occurrence frequencies (count) of items and check whether the count of items is larger than or equal to the minimum count, which is 5*50% (= 2.5). The results are shown in Table 2.2, in which the large items are marked in green color.

Table 2.2. All items with their counts Item Count Item Count

a 4 f 1

b 3 g 3

c 3 h 3

d 1 k 2

e 2 m 4

As the results in Table 2.2, the infrequent items are eliminated from the original database in Table 2.1. The remaining frequent items are sorted by their count in a descending order. The updated transactions in database are shown in Table 2.3.

Table 2.3. The updated transactions in database

TID Items

001 a, m, b, g, h, 002 a, m, c, g

003 a, b, h

004 a, m, c, g 005 m, b, c, h

The updated transactions in Table 2.3 are used to construct an FP tree tuple by tuple from the first transaction to the last one. There are two cases to be checked for constructing the FP tree. For the processed transaction, first, the order of items is exactly matched to the tree path; the count of each node in the path can be incremented by 1. In the second case, a new tree path is then built since the processed

transaction is not matched to existing tree path; the count of node in newly built path is then set at 1. After all transactions are processed, the Header_Table and the FP tree are shown in Figure 2.1.

Header_Table

Figure 2.1. The constructed FP tree

2.3.2 Mining of Frequent Itemsets

After the construction of an FP tree, the complete frequent itemsets can be discovered by the FP-growth mining approach [11]. The FP-growth algorithm is more efficient and scalable rather than the Apriori algorithm [2] since the candidate itemsets is unnecessary to generate level-by-level. The FP-growth algorithm is recursively mine frequent itemsets one by one and bottom-up from the Header_Table. A conditional FP tree is generated for each frequent item, and the frequent itemsets with the processed item are recursively derived from the tree. Below, the constructed FP tree in Figure 2.1 is used to illustrate the FP-growth procedure.

The frequent items from the Header_Table in Figure 2.1 are processed one by one from bottom to top. In this example, item {h} is first processed. Three prefix paths for item {h} are {a: 4, m: 3, b: 1, g: 1, h: 1}, {a: 4, b: 1, h: 1} and {m: 1, b: 1, c:

1, h: 1}. The counts of all nodes in the first path are then updated as 1 since they only appear once with item {h:1} in the branch. Similarly, the counts of all nodes in the second and third path are also updated in the same way. Thus, three converted prefix paths are {a: 1, m: 1, b: 1, g: 1, h: 1}, {a: 1, b: 1, h: 1} and {m: 1, b: 1, c: 1, h: 1}. The counts of the rest items in three prefix paths are then summed together to check whether the counts of items are larger than or equal to the minimum count, which is 2.5. In this example, only item {b} is a large item with item{h: 3}. The conditional FP tree for item {h} is shown in Figure 2.2, and the frequent itemsets can be generated for item {h} are {h: 3} and {bh: 3}.

{root}

a : 4 m : 3 b : 1 g : 1 h : 1

b : 1 h : 1

m : 1 b : 1 c : 2 h : 1

b : 3 h : 3

Prefix path with {h}

Conditional FP-tree with {h}

Figure 2.2. The conditional FP-tree for {h}

Next, item {g} is processed. There are two converted prefix paths {a: 1, m: 1, b:

1, g: 1} and {a: 2, m: 2, c: 2, g: 2} for item {g}. The counts of items in the paths are summed together, and the counts of items {a} and {m} are larger than the minimum count. The set of large items for the conditional FP tree of item {g} are thus {a, m}.

The conditional FP-tree for item {g} is shown in Figure 2.3.

{root}

a : 3 m : 3

b : 1 g : 1

c : 2 g : 2

a : 3 m : 3 g : 3

Prefix path with {g}

Conditional FP-tree with {g}

Figure 2.3. The conditional FP-tree for {g}

The frequent patterns with item {g} can be generated as {g: 3}, {mg: 3} and {ag:

3}. A conditional FP tree is then recursively constructed in the sequence of {mg: 3}

and {ag: 3}. The prefix path for {mg: 3} is {a: 3}. The conditional FP tree for itemset {mg} is shown in Figure 2.4. The large itemsets with {mg} are {mg: 3} and {amg: 3}.

Because of there is no any prefix paths of itemset {amg}, the recursive procedure of itemset {mg} is then completed.

a : 3 m : 3

g : 3

Conditional FP-tree with {g}

a : 3 mg : 3

Conditional FP-tree with {mg}

Figure 2.4. The conditional FP tree for {mg}

After processed itemset {mg}, the recursive procedure of itemset {ag} is then processed. Since there is no any prefix paths of itemset {ag}, the recursive process of item {g} is then completed. The derived frequent itemsets for item {g} are {g: 3}, {mg: 3}, {ag: 3} and {agm: 3}. The above recursive procedure is repeated for other items in the Header_Table until all items are processed.

Several other algorithms based on the FP-tree structure have been proposed.

Qiu et al. proposed the QFP-growth mining approach to mine association rules [40].

Ezeife et al. constructed a generalized FP tree, which stores all frequent and infrequent items, for incremental mining without rescanning databases [10]. Many related researches are still in progress for efficiently discovering the desired information [1, 11, 23, 28, 35, 39, 42].

CHAPTER 3 Multiple Fuzzy FP-tree Algorithm

In this chapter, the multiple fuzzy FP-tree (abbreviated as MFFP-tree) algorithm is proposed to keep fuzzy frequent regions whether they are generated from the same item or not. The MFFP-tree structure is used to efficiently handle quantitative data with multiple fuzzy regions of an item (term). The notation used in the proposed MFFP-tree algorithm is shown below.

3.1 Notation

D the original quantitative database;

n the number of transactions in D;

T the i-th transaction in D,1in; m the number of items in D;

Ij the j-th item,1 jm;

hj the number of fuzzy regions for Ij; Rjl the l-th fuzzy region of I_j, 1lh_j;

v_ij the quantitative value of I_j in T;

fijl the membership value of vij in region Rjl; countjl the count of the fuzzy region Rjl in D;

s the predefined minimum support threshold.

3.2 The MFFP-tree Construction Algorithm

INPUT: A quantitative database consisting of n transactions, a set of membership

functions, and a predefined minimum support threshold s.

OUTPUT: A multiple fuzzy FP tree (MFFP tree).

STEP 1: Transform quantitative value v_ij of each item I_j in the i-th transaction into a fuzzy set fij represented as (fij1/Rj1 + fij2/Rj2 + …+ fijh/Rjh) using the given membership functions, where h is the number of fuzzy regions for Ij, Rjl is the l-th fuzzy region of Ij, 1lh, and fijl is vij’s fuzzy membership value in region Rjl. Note that fijl/Rjl means that the membership value of region Rjl

is fijl.

STEP 2: Calculate the scalar cardinality count_jl of each fuzzy region R_jl in the transactions as:





 ⁿ

i ijl

jl f

count

STEP 3: Check whether the value count_jl of the fuzzy region R_jl is larger than or equal to the predefined minimum count n*s. If the count of a fuzzy region R_jl is equal to or greater than the minimum count, it can be treated as a fuzzy frequent itemset and put it in the set of L₁. That is:

L₁ = {R_jl | count_jln*s, 1jm}.

STEP 4: Build the Header_Table by sorting the fuzzy regions (fuzzy frequent

itemsets) in L₁in descending order of their fuzzy values.

STEP 5: Remove the fuzzy regions of the items not existing in L₁ from the

transactions of the transformed database.

STEP 6: Sort the remaining fuzzy regions in descending order of their fuzzy values

in each transaction.

STEP 7: Initially set the root node of the MFFP tree as {root}.

STEP 8: Insert the transactions of the transformed database into the MFFP tree tuple

by tuple. The following two cases may exist.

Substep 8-1: If a fuzzy region R_jl in a transaction is at the corresponding branch of the MFFP tree, add the fuzzy value fijl of Rjl in the processed transaction to the node of Rjl in the branch.

Substep 8-2: Otherwise, add a node of R_jl at the end of the corresponding branch, set the count of the node as the fuzzy value fijl of Rjl, and connect the node of R_jl in the last branch with the current node as a sequence. If there is no such branch with the node of R_jl, insert a node-link from the entry of R_jl in Header_Table to the added node.

In STEP 8, a corresponding branch is the branch built in the MFFP tree according to sorted fuzzy regions in descending order of their fuzzy values in the transformed transactions. After STEP 8, the final MFFP tree is thus built.

3.3 An Example of the MFFP-tree Construction Algorithm

Below, an example is given to illustrate how to construct a MFFP tree from quantitative transaction data, which is shown in Table 3.1. It consists of 6 transactions and 5 items, denoted A to E. The minimum support threshold s is initially set to 30%.

Table 3.1. Six transactions with purchased items and its quantitative values

TID Items

1 (A:5) (C:10) (D:2) (E:9) 2 (A:8) (B:2) (C:3)

3 (B:3) (C:9)

4 (A:7) (C:9) (D:3) 5 (A:5) (B:2) (C:4) 6 (A:3) (C:11) (D:2) (E:2)

Assume that the fuzzy membership functions are the same for all items shown in Figure 3.1. In this example, amounts are represented by three fuzzy regions: {Low}, {Middle}, and {High}. Thus, three fuzzy membership values are produced for each item in a transaction according to the predefined membership functions in Figure 3.1.

Note that the proposed approach also works when the membership functions of the amounts for the items are not the same.

0 1 6 11 Amount

Membership value

1 Low Middle High

Figure 3.1. Membership functions used in the example

The MFFP tree for this example is thus constructed using the proposed approach as follows.

STEP 1: The quantitative values of the items in the transactions are represented

as fuzzy sets using the membership functions shown in Figure 3.1. Take item {A} in transaction 1 as an example to illustrate the procedure. The amount “5” of {A} can be

converted into the fuzzy set (

Table 3.2. Fuzzy sets transformed from Table 3.1

TID Items

3 ⁾

STEP 2: The scalar cardinality of each fuzzy region in transactions is calculated

as the count value. Take the fuzzy region {A.Low} as an example to explain the procedure. {A.Low} appears in transactions 1, 5, and 6, and its scalar cardinality is calculated as (0.2 + 0.2 + 0.6) (= 1.0). This step is repeated for the other regions; the results are shown in Table 4.3.

Table 3.3. Counts of fuzzy regions

Item Count Item Count Item Count

predefined minimum count, which is calculated as (6 * 0.3) (= 1.8). For example, the counts for {A.Low}, {A.Middle}, and {A.High} are 1.0, 3.4, and 0.6, respectively.

Since the count for {A.Middle} is larger than the minimum count, {A.Middle} is then kept for the subsequent mining process. The satisfied fuzzy regions are considered as fuzzy frequent itemsets and kept them in the set of L1 for later building the MFFP tree.

Thus, L1 = {A.Middle: 3.4, B.Low: 2.2, C.Middle: 2.0, C.High: 3.0, D.Low: 2.2}. The results are shown in Table 3.4.

Table 3.4. Counts of fuzzy frequent regions Fuzzy regions Count

A.Middle 3.4

B.Low 2.2

C.Middle 2.0

C.High 3.0

D.Low 2.2

The fuzzy regions in L1 are then sorted in descending order of their counts for building the Header_Table. The results are shown in Figure 3.2.

Header_Table

Fuzzy region Count

A.Middle 3.4

C.High 3.0

B.Low 2.2

D.Low 2.2

C.Middle 2.0

Figure 3.2. The built Header_Table

STEP 5: The fuzzy regions not existing in L₁ are then removed from each transaction in Table 3.2. The results are shown in Table 3.5.

Table 3.5. The remaining fuzzy regions

TID Fuzzy regions

1 0.8 )

STEP 6: The remaining fuzzy regions at each transaction in Table 3.5 are then

sorted according to their membership values in descending order. The updated transactions of the sorted results are shown in Table 3.6.

Table 3.6. The updated transactions for constructing the MFFP tree

TID Fuzzy regions

1 )

3 )

transactions in transformed database in Table 3.6 are inserted into the MFFP tree tuple

by tuple. For example, the first transaction is )

. the membership value of the corresponding fuzzy region. Since each node in branch is the first one for fuzzy region, a node-link is created to connect the fuzzy region in Header_Table to its corresponding node. The results after the first transaction has been processed are shown in Figure 3.3.

Header_Table

Figure 3.3. The built MFFP tree after the first transaction has been processed

The second transaction in Table 3.6 is 0.4 )

then processed and inserted into the MFFP tree as the second branch since it does not share the same prefix path with the first transaction. The two nodes of {A.Middle} for two branches in the MFFP tree are then connected as a sequence. The results are shown in Figure 3.4.

Header_Table

Figure 3.4. The built MFFP tree after the second transaction has been processed

The process is repeated for the other four transactions. After all transactions have been processed, the final results of the constructed MFFP tree and its Header_Table are then shown in Figure 3.5.

Header_Table

Figure 3.5. The finally constructed MFFP tree and its Header_Table

3.4 The MFFP-growth Mining Algorithm

After the MFFP tree has been constructed, the complete fuzzy frequent itemsets can be found using the proposed MFFP-growth mining approach. The fuzzy regions (fuzzy itemsets) in the Header_Table are processed one by one and bottom-up for generating fuzzy frequent itemsets. The corresponding nodes of the currently processed item can be found by node-link from the first node to the last one for recursively mining fuzzy frequent itemsets using the intersection operation in fuzzy

sets, which is the minimum operation here. The MFFP-growth mining algorithm is shown as follows:

INPUT: The built MFFP tree, its corresponding Header_Table, and the

pre-calculated minimum count.

OUTPUT: The desired fuzzy frequent itemsets.

STEP 1: Process the fuzzy regions (fuzzy frequent items) in the Header_Table one

by one from bottom to top using the following steps. The currently processed fuzzy region is set as Rjl.

STEP 2: Find all nodes with the fuzzy region Rjl in the MFFP tree through the sequenced connection between nodes.

STEP 3: Trace the prefix and suffix paths of the currently processed fuzzy region R_jl

in the MFFP tree. Extract the corresponding fuzzy regions that existed at higher position than the currently processed fuzzy region R_jl in the Header_Table. Merge the extracted paths to recursively form the conditional MFFP tree for generating fuzzy itemsets with the currently processed fuzzy region R_jl. The minimum operation is thus used to get the fuzzy values of the derived fuzzy itemsets. Note that any of fuzzy regions associated with the same I_j of the currently processed region R_jl cannot be formed as fuzzy

itemsets due to its meaningless.

STEP 4: Check whether the value count_jl of the derived fuzzy itemset is larger than or equal to the pre-calculated minimum count n*s.

STEP 5: Repeat STEPs 2 to 4 for the other fuzzy regions until all regions in the

Header_Table have been processed.

After STEP 5, the desired fuzzy frequent itemsets are then derived from the built MFFP tree.

3.5 An Example of the MFFP-growth Mining Algorithm

For the built MFFP tree in Figure 3.5, the proposed MFFP-growth mining algorithm is then processed to find the fuzzy frequent itemsets as follows:

STEP 1: The fuzzy regions in the Header_Table are processed one by one from

bottom to top. In this example, the processed order of fuzzy regions are {C.Middle}, {D.Low}, {B.Low}, {C.High}, and {A.Middle}. Here, the fuzzy region {C.Middle} is used as an example to illustrate the following steps.

STEP 2: The nodes with the currently processed fuzzy region {C.Middle} in the

MFFP tree are then found through node-link of sequenced connection between nodes.

In this example, there are four nodes in the MFFP tree containing the fuzzy region {C.Middle}.

STEPs 3 & 4: The prefix and suffix paths of the currently processed node

{C.Middle} are then found for recursively generating fuzzy frequent itemsets. Since the {C.Middle} is the last node of each branch in the MFFP tree, the suffix paths cannot be found of {C.Middle}. That is, the currently processed nodes of {C.Middle}

are marked in red color, and the prefix paths are marked in blue color, respectively in Figure 3.6.

Figure 3.6. The processed nodes {C.Middle} with its prefix paths

In this example, four prefix paths are then extracted from the MFFP tree and set their fuzzy values the same as the processed nodes of {C.Middle} in the path. Thus, four extracted paths are {A.Middle: 0.6, C.High: 0.6, D.Low: 0.6}, {A.Middle: 0.6,

B.Low: 0.6}, {B.Low: 0.4, A.Middle: 0.4} and {B.Low: 0.4, C.High:0.4}. The above

paths are then merged together to form the conditional MFFP tree of {C.Middle}. In this example, the conditional MFFP tree of {C.Middle} is null since there is no satisfied fuzzy frequent itemsets with {C.Middle}.

STEP 5: Next, the fuzzy region {D.Low} is then processed. The currently

processed nodes of {D.Low} are marked in red color, and the prefix and suffix paths are marked in blue color respectively in Figure 3.7. Note that, only {A.Middle}, {C.High} and {B.Low} can be extracted from the MFFP tree since they are at higher position than {D.Low} in the Header_Table.

Header_Table

Figure 3.7. The processed nodes {D.Low} with its corresponding paths

The two extracted paths for fuzzy region {D.Low} are {A.Middle: 1.4, C.High:

1.4} and {C.High: 0.8, A.Middle: 0.4}, which can be merged to form the conditional

MFFP tree of {D.Low}. The results are then shown in Figure 3.8.

A.Middle 1.8 C.High

2.2 D.Low

2.2

Figure 3.8. The conditional MFFP-tree of {D.Low}

The fuzzy frequent 2-itemsets with {D.Low} can be generated, which are {(C.High, D.Low): 2.22.2 = 2.2} and {(A.Middle, D.Low): 2.21.8 = 1.8}. A conditional MFFP tree is recursively constructed in the sequence of {A.Middle, D.Low}

and {C.High, D.Low}. The results of the conditional MFFP tree for {A.Middle,

在文檔中利用樹狀結構探勘完整語意項目集 (頁 23-0)