CHAPTER 6 Multiple Fuzzy FP-tree Merging Algorithm
6.3 An Example of the iMFFP-tree Algorithm
6.3 An Example of the iMFFP-tree Algorithm
In this section, an example is given to illustrate the proposed iMFFP-tree algorithm. Assume that there are two quantitative databases DB1 and DB2 which shown in Table 6.1, and the minimum support threshold s is set to 30%. Both of each consisted of 4 transactions and 5 items, denoted {A} to {E}.
Table 6.1. Two quantitative databases in the example
TID Items Quantitative database
1 (A:5) (B:2) (C:5) DB1
2 (A:3) (C:10) (D:2) (E:2) DB1
3 (A:5) (B:2) (C:8) (E:6) DB1
4 (C:9) (D:3) DB1
5 (A:5) (C:10) (D:2) (E:9) DB2
6 (A:8) (B:2) (C:3) DB2
7 (B:3) (C:9) DB2
8 (A:7) (C:9) (D:3) DB2
Assume that the fuzzy membership functions are the same for all items shown in Figure 6.1. In this example, amounts are represented by three fuzzy regions: {Low}, {Middle}, and {High}. Thus, three fuzzy membership values are produced for each item in a transaction according to the predefined membership functions.
0 1 6 11 Amount
Membership value
1 Low Middle High
Figure 6.1. Membership functions used in the example
The procedure of the MFFP-tree merging algorithm for this example is described below. Note that the sub-MFFP tree of DB1 is then merged into sub-MFFP tree of DB2
to form the iMFFP tree.
STEP 1: The quantitative values of the items in the transactions are represented
as fuzzy sets using the membership functions shown in Figure 6.1. Take item {A} in transaction 1 as an example to illustrate the procedure. The amount “5” of {A} can be
converted into the fuzzy set (
Table 6.2. Fuzzy sets transformed from Table 6.1
TID Items DB
STEPs 2 & 3: The scalar cardinality of each fuzzy region in the transactions of
two databases is calculated as the count value and be checked against the specified minimum count, which is (8 * 0.3) (= 2.4) to find fuzzy frequent 1-itemsets. Take the
fuzzy region {B.Low} as an example to explain the procedure. {B.Low} appears in transactions 1, 3, 6, and 7, and its scalar cardinality is calculated as (0.8 + 0.8 + 0.8 + 0.6) (= 3.0). Since the count for {B.Low} is larger than the minimum count, {B.Low}
is then kept in the set of L1. The results are shown in Table 6.3.
Table 6.3. Counts of fuzzy regions (fuzzy frequent itemsets) Fuzzy regions Count
A.Middle 4.2
B.Low 3.0
C.Middle 3.4
C.High 3.8
D.Low 2.8
STEP 4: The sub-MFFP tree for different two quantitative databases are
respectively built. The results of two trees are then shown in Figure 6.2 and 6.3.
{root}
A.Middle 1.6 B.Low
1.6 C.Middle
1.4
C.High 1.4 D.Low
1.4 A.Middle
0.4
C.Middle 0.2 C.High
0.4
C.Middle 0.4
Figure 6.2. The sub-MFFP tree of DB1
{root}
Figure 6.3. The sub-MFFP tree of DB2
STEP 5: The leaf nodes of MFFP-tree in DB1 are then traced one by one. In this
Figure 6.4. All leaf nodes of the currently MFFP-tree
And then, three branches can be desired from these leaf nodes. Figure 6.5 shows
the result of three branches.
A.Middle 1.6 B.Low
1.6 C.Middle
1.4 C.High
0.4
C.High 0.4 D.Low
0.4 A.Middle
0.4 C.Middle
0.2
C.High 1.0 D.Low
1.0 C.Middle
0.4
Figure 6.5. Three branches of the currently sub-MFFP tree
STEP 6: Insert three branches of the sub-MFFP tree of DB1 into the iMFFP tree.
Take the first branch as an example, since the item {A.Middle} is at the corresponding branch of the iMFFP tree, the fuzzy value of the item {A.Middle} in this branch is added to this node. The remaining items of the item train are inserted into the iMFFP tree as a new branch. The result is shown in Figure 6.6.
{root}
Figure 6.6. The iMFFP treeafter merging the branch with fuzzy region {C.High}
from the sub MFFP-tree of DB1
The above steps are repeated for the other two branches. Figure 6.7 and Figure 6.8 show the remaining merging procedure of the sub MFFP-tree of DB1 and the sub MFFP-tree of DB2.
Figure 6.7. After merging the branch {C.High:0.4, D.Low:0.4, A.Middle:0.4, C.Middle:0.2}
{root}
Figure 6.8. After merging the branch {C.High:1.0, D.Low:1.0, C.Middle:0.4}
STEPs 7 & 8: Since there is not any sub MFFP-tree should be merged, we create
the Header_Table and insert node-links from the entry of fuzzy region in the Header_Table to the first branch of fuzzy region. The finally merged MFFP-tree has been constructed. The result is shown in Figure 6.9.
{root}
Figure 6.9. The finally merged MFFP-tree
CHAPTER 7
Experiments and Discussion
In this section, experiments were made to show the performance of the proposed approaches. The experiments were performed in C on an AMD Athlon PC with a 3.0G Hz processor and 1G main memory, running the Microsoft Windows XP operating system. Four real dataset were used in the experiments, which are foodmart, BMS-POS, chess and mushroom. In the following subsections, the execution time and
numbers of tree nodes of three different approaches are then evaluated in different support thresholds among different databases.
The foodmart dataset is from an anonymous chain store [30] containing quantitative transactions about the products sold in the chain store. There are 21,556 transactions and 1,600 items in the dataset.
The BMS-POS dataset [44] contains several years of point-of-sale data from a large electronics retailer. Each transaction in this dataset consists of all the product categories purchased by a customer at one time. There are 515,597 transactions and 1,657 items in the dataset. The maximum length of a transaction is 164 and the average length of a transaction is 6.5. In the BMS-POS database, only binary values are considered for all items. To deal with the quantitative databases, we randomly
assigned quantities for all items in a uniform distribution.
Two real-world datasets called chess and mushroom were used in the experiments [12]. The characteristics of chess dataset included 3,196 transactions, 75 items and average transaction size is calculated as 37. The characteristics of mushroom dataset included 8,124 transactions and 22 items. Random quantitative
values from the range [1, 11] were assigned to the items in the transactions in a uniform distribution.
In the experiments, the above datasets are all respectively transformed into fuzzy 2-regions and 3-regions according the predefined membership functions.
7.1 Experimental Results of the MFFP-tree Algorithm
In this section, two real-world datasets called foodmart and mushroom were used to evaluate the performance of the proposed MFFP-tree algorithm. Experiments were conducted to compare the execution time of the proposed MFFP-tree algorithm and the multiple fuzzy Apriori algorithm (abbreviated as MF-Apriori, which derives the fuzzy frequent itemsets using the Apriori algorithm [2]). Figure 7.1 shows the execution time for two algorithms in the foodmart dataset. The minimum support
threshold was set from 0.12% to 0.2% in 0.02% increments each time.
Figure 7.1. Comparison of the execution time obtained using the MF-Apriori and the MFFP-tree algorithm in the foodmart dataset
In Figure 7.1, it shows that a longer execution time was required for 3-regions than that for 2-regions in five minimum support thresholds. This is because 3-regions would gradually product more fuzzy regions rather than fuzzy 2-regions. Since the number of transformed fuzzy regions depends on the predefined membership functions, the quantitative values of items also affect the number of transformed fuzzy regions. The execution time for the MFFP-tree algorithm was much lower than the MF-Apriori algorithm especially when the minimum support threshold was set lower.
This is because the MFFP-tree algorithm requires fewer times for scanning database than the level-wise MF-Apriori algorithm. Experiments were also conducted to determine the numbers of tree nodes for fuzzy 2-regions and fuzzy 3-regions in Figure
7.2.
Figure 7.2. Comparison of the numbers of tree nodes obtained using the MFFP-tree algorithm in the foodmart dataset
In Figure 7.2, it is obvious to see that there is a cross point at 0.14% minimum support threshold. When the minimum support threshold was set lower than 0.14%, the nodes generated based on two regions were more than that based on the three regions. That is because when an item produced lesser fuzzy regions through membership functions, it generates more fuzzy frequent itemsets at the lower minimum support threshold.
In real-world applications, the proposed MFFP-tree algorithm can product complete fuzzy frequent itemsets for efficiently making a correct decision rather than the maximum cardinality one. That is, the number of fuzzy frequent itemsets is also compared to see the performance of the proposed approach. Figure 7.3 shows the
numbers of fuzzy large itemsets of the proposed MFFP-tree algorithm and the fuzzy FP-tree (abbreviated as FFP-tree) algorithm [24].
Figure 7.3. Comparison of the numbers of large itemsets obtained using the FFP-tree algorithm and the MFFP-tree algorithm in the foodmart dataset
Besides, experiments were conducted to compare of the execution time for the MFFP-tree algorithm and the MF-Apriori algorithm in the BMS-POS [44] dataset.
Figure 7.4 shows the execution time for the two algorithms in fuzzy 2-regions and fuzzy 3-regions, respectively. The minimum support threshold was set from 3.5% to 5.5% in 0.5% increments.
Figure 7.4. Comparison of the execution time obtained using the MF-Apriori algorithm and the MFFP-tree algorithm in the BMS-POS dataset
In Figure 7.4, it shows that the proposed MFFP-tree algorithm is faster than the MF-Apriori algorithm in five minimum support thresholds. The reason is the same as the above explanations. Figure 7.5 shows the numbers of tree nodes for fuzzy 2-regions and 3-regions, respectively.
Figure 7.5. Comparison of the numbers of tree nodes obtained using the MFFP-tree algorithm in the BMS-POS dataset
In Figure 7.5, it could be observed that when the minimum support threshold was set lower than 5.5%, fuzzy 2-regions has more tree nodes than fuzzy 3-regions. This is because when an item had more fuzzy regions, it might have more concentrated membership functions, thus causing the number of fuzzy region that satisfied the minimum support threshold was less than it had less fuzzy regions. Figure 7.6 shows the numbers of fuzzy large itemsets for the proposed MFFP-tree algorithm and the FFP-tree algorithm. It is also showed that the proposed MFFP-tree algorithm can generate complete fuzzy large itemsets than the FFP-tree algorithm.
Figure 7.6. Comparison of the numbers of large itemsets obtained using the FFP-tree algorithm and the MFFP-tree algorithm in BMS-POS dataset
7.2 Experimental Results of the CMFFP-tree
Algorithm
In this section, two real-world datasets called foodmart and chess were used to evaluate the performance of the proposed CMFFP-tree algorithm. Experiments were conducted to compare the execution time and the number of tree nodes for the proposed CMFFP-tree algorithm, the MFFP-tree algorithm, and the MF-Apriori algorithm [2]. Figure 7.7 and Figure 7.8 show the execution time for three algorithms in fuzzy 2-regions and fuzzy 3-regions, respectively. The minimum support threshold was varied from 0.14% to 0.18% in 0.01% increments.
Figure 7.7. Comparison of the execution time obtained using the MF-Apriori algorithm and the CMFFP-tree algorithm in the foodmart dataset
Figure 7.8. Comparison of the execution time obtained using the MFFP-tree algorithm and the CMFFP-tree algorithm in the foodmart dataset
In Figure 7.7, it shows that the proposed CMFFP-tree algorithm is faster than the MF-Apriori algorithm. In Figure 7.8, when the minimum support threshold was set higher than 0.145%, the proposed CMFFP-tree algorithm is faster than the MFFP-tree algorithm. This is because the number of attached array is increased with the length of branch, it requires more computational cost to calculate the membership values for the derived fuzzy itemsets. Next, the number of tree nodes was then compared to determine the effect of the minimum support threshold and respectively shown in Figure 7.9 and Figure 7.10. In Figure 7.9 and Figure 7.10, they indicate that the CMFFP-tree algorithm has a fewer tree nodes than the MFFP-tree algorithm whether in fuzzy 2-regions and fuzzy 3-regions.
Figure 7.9. Comparison of the numbers of tree nodes for fuzzy 2-regions obtained using the MFFP-tree algorithm and the CMFFP-tree algorithm in the foodmart dataset
Figure 7.10. Comparison of the numbers of tree nodes for fuzzy 3-regions obtained using the MFFP-tree algorithm and the CMFFP-tree algorithm in the foodmart dataset
Another real-world dataset called chess was used in the experiments [12]. In Figure 7.11, it shows the execution time for the MFFP-tree algorithm and the CMFFP-tree algorithm in fuzzy 2-regions. Figure 7.12 then shows the numbers of tree nodes for the MFFP-tree algorithm and the CMFFP-tree algorithm in fuzzy 2-regions.
The minimum support threshold was set from 39% to 43% in 1% increments each time.
Figure 7.11. Comparison of the execution time for fuzzy 2-regions obtained using the MFFP-tree algorithm and the CMFFP-tree algorithm in the chess dataset
Figure 7.12. Comparison of the numbers of tree nodes for fuzzy 2-regions obtained using the MFFP-tree algorithm and the CMFFP-tree algorithm in the chess dataset
As can be seen from Figure 7.11, the proposed CMFFP-tree algorithm is faster
than the MFFP-tree algorithm. Besides, it can be seen from Figure 7.12 that the number of nodes in the CMFFP-tree algorithm was less than that in the MFFP-tree algorithm. This is because the fuzzy regions in transactions are sorted in descending order of their occurrence frequencies for the CMFFP-tree algorithm in the phase of tree construction. The MFFP-tree algorithm, however, follows the descending order of the membership values for fuzzy regions in transactions to build the MFFP tree. That is, in the MFFP-tree approach, two records with same fuzzy regions but difference orders were inserted into the MFFP tree in two different branches, causing more tree nodes than the CMFFP-tree algorithm.
Again, Experiments are made to show the execution time and the numbers of tree nodes for the MFFP-tree algorithm and the CMFFP-tree algorithm in fuzzy 3-regions.
The experimental results are respectively shown in Figure 7.13 and Figure 7.14. The minimum support threshold for Figure 7.13 and Figure 7.14 are both set at from 28%
to 32%, with 1% increment each time, respectively.
Figure 7.13. Comparison of the execution time for fuzzy 3-regions obtained using the MFFP-tree algorithm and the CMFFP-tree algorithm in the chess dataset
Figure 7.14. Comparison of the numbers of tree nodes for fuzzy 3-regions obtained using the MFFP- tree algorithm and the CMFFP-tree algorithm in the chess dataset
From Figure 7.13 and 7.14, it showed the proposed CMFFP-tree algorithm has a better performance than the MFFP-tree algorithm both in the execution time and the numbers of tree nodes.
7.3 Experimental Results of the UBMFFP-tree Algorithm
In this section, two real-world datasets called foodmart and mushroom were used to evaluate the performance of the proposed UBMFFP-tree algorithm. Experiments were conducted to compare the execution time of the proposed UBMFFP-tree algorithm, the CMFFP-tree algorithm, the MFFP-tree algorithm, and the multiple fuzzy Apriori algorithm (abbreviated as MF-Apriori). Figure 7.15 and 7.16 respectively showed the execution time of four algorithms for fuzzy 2-regions in the foodmart dataset. The minimum support threshold was set from 0.14% to 0.18% in
0.01% increments each time.
Figure 7.15. Comparison of execution time for fuzzy 2-regions obtained using MF-Apriori algorithm and UBMFFP-tree algorithm in foodmart dataset
Figure 7.16. Comparison of execution time for fuzzy 2-regions obtained using the proposed three fuzzy FP-tree algorithms in the foodmart dataset
It was obvious to see from Figure 7.15 and Figure 7.16 that the proposed UBMFFP-tree algorithm ran faster than other algorithms in the various minimum support thresholds. The numbers of tree nodes was also compared for three algorithms in fuzzy 2-regions and shown in Figure 7.17.
Figure 7.17. Comparison of the numbers of tree nodes for fuzzy 2-regions obtained using the proposed three fuzzy FP-tree algorithms in the foodmart dataset
In Figure 7.17, it is obvious to see that the CMFFP-tree algorithm has the same tree nodes as UBMFFP-tree algorithm and fewer than the MFFP-tree algorithm. This is because the fuzzy regions in transactions are sorted in descending order of their occurrence frequencies for both CMFFP-tree algorithm and UBMFFP-tree algorithm in the phase of tree construction. The MFFP-tree algorithm, however, follows the descending order of the membership values for fuzzy regions in transactions to build the MFFP tree. Figure 7.18 and Figure 7.19 respectively show the execution time of four algorithms for fuzzy 3-regions in the foodmart dataset. The minimum support threshold was set at from 0.15% to 0.19% in 0.01% increments each time.
Figure 7.18. Comparison of the execution time for fuzzy 3-regions obtained using the MF-Apriori algorithm and the UBMFFP-tree algorithm in the foodmart dataset
Figure 7.19. Comparison of the execution time for fuzzy 3-regions obtained using the proposed three fuzzy FP-tree algorithms in the foodmart dataset
It is obvious to see that the proposed UBMFFP-tree algorithm is faster than other algorithms. Figure 7.20 then shows the number of tree nodes for the MFFP-tree algorithm, the CMFFP-tree algorithm and the proposed UBMFFP-tree algorithm in fuzzy 3-regions.
Figure 7.20. Comparison of the numbers of tree nodes for fuzzy 3-regions obtained using the proposed three fuzzy FP-tree algorithms in the foodmart dataset
It can be observed in Figure 7.20 that the proposed UBMFFP-tree algorithm keeps the same number of tree nodes as the CMFFP-tree algorithm, and lesser than the MFFP-tree algorithm. The reason is the same which described in section 7.3.
Another real dataset called mushroom was also used in the experiments [12].
Figure 7.21 shows the execution time of the MFFP-tree algorithm, the CMFFP-tree algorithm, and the proposed UBMFFP-tree algorithm in fuzzy 2-regions. The minimum support threshold was set at from 29% to 33% in 1% increments each time.
Figure 7.21. Comparison of the execution time for fuzzy 2-regions obtained using the proposed three fuzzy FP-tree algorithms in the mushroom dataset
In Figure 7.21, it can be seen that the proposed UBMFFP-tree algorithm ran faster than the MFFP-tree algorithm and the CMFFP-tree algorithm in difference minimum support thresholds. The numbers of tree nodes was also compared for three
algorithms in fuzzy 2-regions and shown in Figure 7.22, where the CMFFP-tree algorithm and the UBMFFP-tree algorithm keep the same number of tree nodes and fewer than the MFFP-tree algorithm. The reason is the same which described in section 7.3.
Figure 7.22. Comparison of the numbers of tree nodes for fuzzy 2-regions obtained using the proposed three fuzzy FP-tree algorithms in the mushroom dataset
Experiments were then made to show the execution time and the numbers of tree nodes for fuzzy 3-regions. The minimum support threshold was set at from 24% to 28% in 1% increments each time. The results are then shown in Figure 7.23.
Figure 7.23. Comparison of the execution time for fuzzy 3-regions obtained using the proposed three fuzzy FP-tree algorithms in the mushroom dataset
Again, the UBMFFP-tree algorithm ran faster than the other algorithms in difference minimum support threshold. The numbers of tree nodes was also compared for three algorithms in fuzzy 3-regions and shown in Figure 7.24.
Figure 7.24. Comparison of the numbers of tree nodes for fuzzy 3-regions obtained using the proposed three fuzzy FP-tree algorithms in the mushroom dataset
It could be observed that the UBMFFP-tree algorithm keeps the same number of tree nodes as the CMFFP-tree algorithm, but less than the MFFP-tree algorithm. The reason is the same which described in section 7.3.