CHAPTER 2 Review of Related Works
2.3 The F
Frequent pattern mining is one of the most important data mining problems. The initial solution for the problem of association rule mining was given by Agrawal et al.
[2] in the form of Apriori algorithm which is based on level-wise candidate set generation and test methodology. However, because the size of the database can be very large, it is very costly to repeatedly scan the database to count supports for candidate itemsets. The limitations of Apriori algorithm are overcome by an innovative approach proposed the frequent pattern (FP) tree structure and the FP-growth algorithm by Han et al. [13]. Their approach can efficiently mine frequent itemsets without the generation of candidate itemsets, and it scans the original transaction database only twice. The mining algorithm consists of two phases;
the first constructs an FP-tree structure and the second recursively mines the frequent itemsets from the structure. The details of two phases for the FP-growth algorithm [13]
are respectively described below.
2.3.1 Construction of FP-tree
The FP-tree is a compact tree structure storing frequent items from databases.
The Header_Table is also built as an index table, thus keeping not only the frequent items but also its occurrence frequencies. Each item in Header_Table points to its first occurrence in the tree through a node-link. Nodes with the same item in the tree are
also connected in a sequence. It can fast trace the same items in the tree, for efficiently generating the frequent itemsets. The construction of FP tree requires two database scans. The first database scan is to identify all frequent (or large) items. In the second database scan, frequent items within transactions are sorted in descending order according to the occurrence frequencies of items. The construction process is then executed tuple by tuple, from the first transaction to the last one. After all transactions in databases are processed, the FP tree is completely constructed.
Here, an example shown in Table 2.1 is used to illustrate the construction process. The minimum support threshold is set at 50%.
Table 2.1. Five transactions in database
TID Items
1 a, b, e. f, g, h, m 2 a, c, g, k, m
3 a, b, d, h
4 a, c, e, g, m 5 b, c, k, h, m
The database in Table 2.1 is then scanned to calculate the occurrence frequencies (count) of items and check whether the count of items is larger than or equal to the minimum count, which is 5*50% (= 2.5). The results are shown in Table 2.2, in which the large items are marked in green color.
Table 2.2. All items with their counts Item Count Item Count
a 4 f 1
b 3 g 3
c 3 h 3
d 1 k 2
e 2 m 4
As the results in Table 2.2, the infrequent items are eliminated from the original database in Table 2.1. The remaining frequent items are sorted by their count in a descending order. The updated transactions in database are shown in Table 2.3.
Table 2.3. The updated transactions in database
TID Items
001 a, m, b, g, h, 002 a, m, c, g
003 a, b, h
004 a, m, c, g 005 m, b, c, h
The updated transactions in Table 2.3 are used to construct an FP tree tuple by tuple from the first transaction to the last one. There are two cases to be checked for constructing the FP tree. For the processed transaction, first, the order of items is exactly matched to the tree path; the count of each node in the path can be incremented by 1. In the second case, a new tree path is then built since the processed
transaction is not matched to existing tree path; the count of node in newly built path is then set at 1. After all transactions are processed, the Header_Table and the FP tree are shown in Figure 2.1.
Header_Table
Figure 2.1. The constructed FP tree
2.3.2 Mining of Frequent Itemsets
After the construction of an FP tree, the complete frequent itemsets can be discovered by the FP-growth mining approach [11]. The FP-growth algorithm is more efficient and scalable rather than the Apriori algorithm [2] since the candidate itemsets is unnecessary to generate level-by-level. The FP-growth algorithm is recursively mine frequent itemsets one by one and bottom-up from the Header_Table. A conditional FP tree is generated for each frequent item, and the frequent itemsets with the processed item are recursively derived from the tree. Below, the constructed FP tree in Figure 2.1 is used to illustrate the FP-growth procedure.
The frequent items from the Header_Table in Figure 2.1 are processed one by one from bottom to top. In this example, item {h} is first processed. Three prefix paths for item {h} are {a: 4, m: 3, b: 1, g: 1, h: 1}, {a: 4, b: 1, h: 1} and {m: 1, b: 1, c:
1, h: 1}. The counts of all nodes in the first path are then updated as 1 since they only appear once with item {h:1} in the branch. Similarly, the counts of all nodes in the second and third path are also updated in the same way. Thus, three converted prefix paths are {a: 1, m: 1, b: 1, g: 1, h: 1}, {a: 1, b: 1, h: 1} and {m: 1, b: 1, c: 1, h: 1}. The counts of the rest items in three prefix paths are then summed together to check whether the counts of items are larger than or equal to the minimum count, which is 2.5. In this example, only item {b} is a large item with item{h: 3}. The conditional FP tree for item {h} is shown in Figure 2.2, and the frequent itemsets can be generated for item {h} are {h: 3} and {bh: 3}.
{root}
a : 4 m : 3 b : 1 g : 1 h : 1
b : 1 h : 1
m : 1 b : 1 c : 2 h : 1
b : 3 h : 3
Prefix path with {h}
Conditional FP-tree with {h}
Figure 2.2. The conditional FP-tree for {h}
Next, item {g} is processed. There are two converted prefix paths {a: 1, m: 1, b:
1, g: 1} and {a: 2, m: 2, c: 2, g: 2} for item {g}. The counts of items in the paths are summed together, and the counts of items {a} and {m} are larger than the minimum count. The set of large items for the conditional FP tree of item {g} are thus {a, m}.
The conditional FP-tree for item {g} is shown in Figure 2.3.
{root}
a : 3 m : 3
b : 1 g : 1
c : 2 g : 2
a : 3 m : 3 g : 3
Prefix path with {g}
Conditional FP-tree with {g}
Figure 2.3. The conditional FP-tree for {g}
The frequent patterns with item {g} can be generated as {g: 3}, {mg: 3} and {ag:
3}. A conditional FP tree is then recursively constructed in the sequence of {mg: 3}
and {ag: 3}. The prefix path for {mg: 3} is {a: 3}. The conditional FP tree for itemset {mg} is shown in Figure 2.4. The large itemsets with {mg} are {mg: 3} and {amg: 3}.
Because of there is no any prefix paths of itemset {amg}, the recursive procedure of itemset {mg} is then completed.
a : 3 m : 3
g : 3
Conditional FP-tree with {g}
a : 3 mg : 3
Conditional FP-tree with {mg}
Figure 2.4. The conditional FP tree for {mg}
After processed itemset {mg}, the recursive procedure of itemset {ag} is then processed. Since there is no any prefix paths of itemset {ag}, the recursive process of item {g} is then completed. The derived frequent itemsets for item {g} are {g: 3}, {mg: 3}, {ag: 3} and {agm: 3}. The above recursive procedure is repeated for other items in the Header_Table until all items are processed.
Several other algorithms based on the FP-tree structure have been proposed.
Qiu et al. proposed the QFP-growth mining approach to mine association rules [40].
Ezeife et al. constructed a generalized FP tree, which stores all frequent and infrequent items, for incremental mining without rescanning databases [10]. Many related researches are still in progress for efficiently discovering the desired information [1, 11, 23, 28, 35, 39, 42].
CHAPTER 3
Multiple Fuzzy FP-tree Algorithm
In this chapter, the multiple fuzzy FP-tree (abbreviated as MFFP-tree) algorithm is proposed to keep fuzzy frequent regions whether they are generated from the same item or not. The MFFP-tree structure is used to efficiently handle quantitative data with multiple fuzzy regions of an item (term). The notation used in the proposed MFFP-tree algorithm is shown below.
3.1 Notation
D the original quantitative database;
n the number of transactions in D;
T the i-th transaction in D,1in; m the number of items in D;
Ij the j-th item,1 jm;
hj the number of fuzzy regions for Ij; Rjl the l-th fuzzy region of Ij, 1lhj;
vij the quantitative value of Ij in T;
fijl the membership value of vij in region Rjl; countjl the count of the fuzzy region Rjl in D;
s the predefined minimum support threshold.
3.2 The MFFP-tree Construction Algorithm
INPUT: A quantitative database consisting of n transactions, a set of membership
functions, and a predefined minimum support threshold s.
OUTPUT: A multiple fuzzy FP tree (MFFP tree).
STEP 1: Transform quantitative value vij of each item Ij in the i-th transaction into a fuzzy set fij represented as (fij1/Rj1 + fij2/Rj2 + …+ fijh/Rjh) using the given membership functions, where h is the number of fuzzy regions for Ij, Rjl is the l-th fuzzy region of Ij, 1lh, and fijl is vij’s fuzzy membership value in region Rjl. Note that fijl/Rjl means that the membership value of region Rjl
is fijl.
STEP 2: Calculate the scalar cardinality countjl of each fuzzy region Rjl in the transactions as:
n
i ijl
jl f
count
1
.
STEP 3: Check whether the value countjl of the fuzzy region Rjl is larger than or equal to the predefined minimum count n*s. If the count of a fuzzy region Rjl is equal to or greater than the minimum count, it can be treated as a fuzzy frequent itemset and put it in the set of L1. That is:
L1 = {Rjl | countjln*s, 1jm}.
STEP 4: Build the Header_Table by sorting the fuzzy regions (fuzzy frequent
itemsets) in L1 in descending order of their fuzzy values.
STEP 5: Remove the fuzzy regions of the items not existing in L1 from the
transactions of the transformed database.
STEP 6: Sort the remaining fuzzy regions in descending order of their fuzzy values
in each transaction.
STEP 7: Initially set the root node of the MFFP tree as {root}.
STEP 8: Insert the transactions of the transformed database into the MFFP tree tuple
by tuple. The following two cases may exist.
Substep 8-1: If a fuzzy region Rjl in a transaction is at the corresponding branch of the MFFP tree, add the fuzzy value fijl of Rjl in the processed transaction to the node of Rjl in the branch.
Substep 8-2: Otherwise, add a node of Rjl at the end of the corresponding branch, set the count of the node as the fuzzy value fijl of Rjl, and connect the node of Rjl in the last branch with the current node as a sequence. If there is no such branch with the node of Rjl, insert a node-link from the entry of Rjl in Header_Table to the added node.
In STEP 8, a corresponding branch is the branch built in the MFFP tree according to sorted fuzzy regions in descending order of their fuzzy values in the transformed transactions. After STEP 8, the final MFFP tree is thus built.
3.3 An Example of the MFFP-tree Construction Algorithm
Below, an example is given to illustrate how to construct a MFFP tree from quantitative transaction data, which is shown in Table 3.1. It consists of 6 transactions and 5 items, denoted A to E. The minimum support threshold s is initially set to 30%.
Table 3.1. Six transactions with purchased items and its quantitative values
TID Items
1 (A:5) (C:10) (D:2) (E:9) 2 (A:8) (B:2) (C:3)
3 (B:3) (C:9)
4 (A:7) (C:9) (D:3) 5 (A:5) (B:2) (C:4) 6 (A:3) (C:11) (D:2) (E:2)
Assume that the fuzzy membership functions are the same for all items shown in Figure 3.1. In this example, amounts are represented by three fuzzy regions: {Low}, {Middle}, and {High}. Thus, three fuzzy membership values are produced for each item in a transaction according to the predefined membership functions in Figure 3.1.
Note that the proposed approach also works when the membership functions of the amounts for the items are not the same.
0 1 6 11 Amount
Membership value
1 Low Middle High
Figure 3.1. Membership functions used in the example
The MFFP tree for this example is thus constructed using the proposed approach as follows.
STEP 1: The quantitative values of the items in the transactions are represented
as fuzzy sets using the membership functions shown in Figure 3.1. Take item {A} in transaction 1 as an example to illustrate the procedure. The amount “5” of {A} can be
converted into the fuzzy set (
Table 3.2. Fuzzy sets transformed from Table 3.1
TID Items
3 )
STEP 2: The scalar cardinality of each fuzzy region in transactions is calculated
as the count value. Take the fuzzy region {A.Low} as an example to explain the procedure. {A.Low} appears in transactions 1, 5, and 6, and its scalar cardinality is calculated as (0.2 + 0.2 + 0.6) (= 1.0). This step is repeated for the other regions; the results are shown in Table 4.3.
Table 3.3. Counts of fuzzy regions
Item Count Item Count Item Count
predefined minimum count, which is calculated as (6 * 0.3) (= 1.8). For example, the counts for {A.Low}, {A.Middle}, and {A.High} are 1.0, 3.4, and 0.6, respectively.
Since the count for {A.Middle} is larger than the minimum count, {A.Middle} is then kept for the subsequent mining process. The satisfied fuzzy regions are considered as fuzzy frequent itemsets and kept them in the set of L1 for later building the MFFP tree.
Thus, L1 = {A.Middle: 3.4, B.Low: 2.2, C.Middle: 2.0, C.High: 3.0, D.Low: 2.2}. The results are shown in Table 3.4.
Table 3.4. Counts of fuzzy frequent regions Fuzzy regions Count
A.Middle 3.4
B.Low 2.2
C.Middle 2.0
C.High 3.0
D.Low 2.2
The fuzzy regions in L1 are then sorted in descending order of their counts for building the Header_Table. The results are shown in Figure 3.2.
Header_Table
Fuzzy region Count
A.Middle 3.4
C.High 3.0
B.Low 2.2
D.Low 2.2
C.Middle 2.0
Figure 3.2. The built Header_Table
STEP 5: The fuzzy regions not existing in L1 are then removed from each transaction in Table 3.2. The results are shown in Table 3.5.
Table 3.5. The remaining fuzzy regions
TID Fuzzy regions
1 0.8 )
STEP 6: The remaining fuzzy regions at each transaction in Table 3.5 are then
sorted according to their membership values in descending order. The updated transactions of the sorted results are shown in Table 3.6.
Table 3.6. The updated transactions for constructing the MFFP tree
TID Fuzzy regions
1 )
3 )
transactions in transformed database in Table 3.6 are inserted into the MFFP tree tuple
by tuple. For example, the first transaction is )
. the membership value of the corresponding fuzzy region. Since each node in branch is the first one for fuzzy region, a node-link is created to connect the fuzzy region in Header_Table to its corresponding node. The results after the first transaction has been processed are shown in Figure 3.3.
Header_Table
Figure 3.3. The built MFFP tree after the first transaction has been processed
The second transaction in Table 3.6 is 0.4 )
then processed and inserted into the MFFP tree as the second branch since it does not share the same prefix path with the first transaction. The two nodes of {A.Middle} for two branches in the MFFP tree are then connected as a sequence. The results are shown in Figure 3.4.
Header_Table
Figure 3.4. The built MFFP tree after the second transaction has been processed
The process is repeated for the other four transactions. After all transactions have been processed, the final results of the constructed MFFP tree and its Header_Table are then shown in Figure 3.5.
Header_Table
Figure 3.5. The finally constructed MFFP tree and its Header_Table
3.4 The MFFP-growth Mining Algorithm
After the MFFP tree has been constructed, the complete fuzzy frequent itemsets can be found using the proposed MFFP-growth mining approach. The fuzzy regions (fuzzy itemsets) in the Header_Table are processed one by one and bottom-up for generating fuzzy frequent itemsets. The corresponding nodes of the currently processed item can be found by node-link from the first node to the last one for recursively mining fuzzy frequent itemsets using the intersection operation in fuzzy
sets, which is the minimum operation here. The MFFP-growth mining algorithm is shown as follows:
INPUT: The built MFFP tree, its corresponding Header_Table, and the
pre-calculated minimum count.
OUTPUT: The desired fuzzy frequent itemsets.
STEP 1: Process the fuzzy regions (fuzzy frequent items) in the Header_Table one
by one from bottom to top using the following steps. The currently processed fuzzy region is set as Rjl.
STEP 2: Find all nodes with the fuzzy region Rjl in the MFFP tree through the sequenced connection between nodes.
STEP 3: Trace the prefix and suffix paths of the currently processed fuzzy region Rjl
in the MFFP tree. Extract the corresponding fuzzy regions that existed at higher position than the currently processed fuzzy region Rjl in the Header_Table. Merge the extracted paths to recursively form the conditional MFFP tree for generating fuzzy itemsets with the currently processed fuzzy region Rjl. The minimum operation is thus used to get the fuzzy values of the derived fuzzy itemsets. Note that any of fuzzy regions associated with the same Ij of the currently processed region Rjl cannot be formed as fuzzy
itemsets due to its meaningless.
STEP 4: Check whether the value countjl of the derived fuzzy itemset is larger than or equal to the pre-calculated minimum count n*s.
STEP 5: Repeat STEPs 2 to 4 for the other fuzzy regions until all regions in the
Header_Table have been processed.
After STEP 5, the desired fuzzy frequent itemsets are then derived from the built MFFP tree.
3.5 An Example of the MFFP-growth Mining Algorithm
For the built MFFP tree in Figure 3.5, the proposed MFFP-growth mining algorithm is then processed to find the fuzzy frequent itemsets as follows:
STEP 1: The fuzzy regions in the Header_Table are processed one by one from
bottom to top. In this example, the processed order of fuzzy regions are {C.Middle}, {D.Low}, {B.Low}, {C.High}, and {A.Middle}. Here, the fuzzy region {C.Middle} is used as an example to illustrate the following steps.
STEP 2: The nodes with the currently processed fuzzy region {C.Middle} in the
MFFP tree are then found through node-link of sequenced connection between nodes.
In this example, there are four nodes in the MFFP tree containing the fuzzy region {C.Middle}.
STEPs 3 & 4: The prefix and suffix paths of the currently processed node
{C.Middle} are then found for recursively generating fuzzy frequent itemsets. Since the {C.Middle} is the last node of each branch in the MFFP tree, the suffix paths cannot be found of {C.Middle}. That is, the currently processed nodes of {C.Middle}
are marked in red color, and the prefix paths are marked in blue color, respectively in Figure 3.6.
Figure 3.6. The processed nodes {C.Middle} with its prefix paths
In this example, four prefix paths are then extracted from the MFFP tree and set their fuzzy values the same as the processed nodes of {C.Middle} in the path. Thus, four extracted paths are {A.Middle: 0.6, C.High: 0.6, D.Low: 0.6}, {A.Middle: 0.6,
B.Low: 0.6}, {B.Low: 0.4, A.Middle: 0.4} and {B.Low: 0.4, C.High:0.4}. The above
paths are then merged together to form the conditional MFFP tree of {C.Middle}. In this example, the conditional MFFP tree of {C.Middle} is null since there is no satisfied fuzzy frequent itemsets with {C.Middle}.
STEP 5: Next, the fuzzy region {D.Low} is then processed. The currently
processed nodes of {D.Low} are marked in red color, and the prefix and suffix paths are marked in blue color respectively in Figure 3.7. Note that, only {A.Middle}, {C.High} and {B.Low} can be extracted from the MFFP tree since they are at higher position than {D.Low} in the Header_Table.
Header_Table
Figure 3.7. The processed nodes {D.Low} with its corresponding paths
The two extracted paths for fuzzy region {D.Low} are {A.Middle: 1.4, C.High:
1.4} and {C.High: 0.8, A.Middle: 0.4}, which can be merged to form the conditional
MFFP tree of {D.Low}. The results are then shown in Figure 3.8.
A.Middle 1.8 C.High
2.2 D.Low
2.2
Figure 3.8. The conditional MFFP-tree of {D.Low}
The fuzzy frequent 2-itemsets with {D.Low} can be generated, which are {(C.High, D.Low): 2.22.2 = 2.2} and {(A.Middle, D.Low): 2.21.8 = 1.8}. A conditional MFFP tree is recursively constructed in the sequence of {A.Middle, D.Low}
and {C.High, D.Low}. The results of the conditional MFFP tree for {A.Middle,
and {C.High, D.Low}. The results of the conditional MFFP tree for {A.Middle,