Tidset-based Parallel FP-tree (TPFP-tree)

Chapter 5 Parallel FP-tree Algorithm for Frequent Pattern Mining Problem

5.4 Tidset-based Parallel FP-tree (TPFP-tree)

In this dissertation, parallel algorithms for frequent pattern mining on Cluster and Grid systems are proposed. In spite of the results of other research [44, 60], there are still two important issues that need to be considered for a parallel algorithm to improve frequent pattern mining, one is reducing the communication cost and the other is balancing the computing node workload. In order to evaluate the execution time of different computing stages in detail, the PFP-tree [44] algorithm was implemented. Table 5-1 shows the execution time for each stage of the PFP-tree. It can be observed that the exchange stage dominated the others. Thus, the exchange stage was analyzed in depth. First, the exchange stage examined the candidate tree paths required for other processors, then exchanged the extracted paths with other processors and inserted it back to the local FP-tree. Therefore, the performance deteriorated with large databases or lower thresholds. Moreover, more processors also led to worse load balancing. Therefore, the performance can be improved significantly if the execution time of the exchange stage can be reduced and the workload of the processors can be balanced evenly.

Table 5-1: Execution Time of Each Stage of the PFP-tree (t20.i04.d200k.n100k, Threshold = 0.0005) for the Different Processors

p1 p₂ p₃ p₄ p₅ p₆ p₇ p₈

Data Transfer 0.91 0.04 0.06 0.12 0.06 0.05 0.08 0.04

Header Table 0.6 0.59 0.6 0.59 0.6 0.6 0.6 0.59

All reduce 1.05 0.17 0.18 0.17 0.17 0.17 0.17 0.17 FP-tree 13.62 12.49 12.49 12.40 12.40 12.42 12.43 12.51 Exchange 98.29 157.11 204.78 233.51 241.07 235.06 223.06 197.02

FP-growth 18.06 26.34 27.09 31.1 24.51 22.44 20.07 12.59 Total 132.53 196.74 245.2 277.89 278.81 270.74 256.41 222.92

The goal of our algorithm was to reduce the computation and communication cost of the exchange stage. Since extracting the candidate tree paths from an FP-tree data structure needs repeated traversing of the entire tree and inserting the tree paths back to the objective tree also requires repeated traversing of the trees, it becomes costly, leading to the FP-tree construction procedures being postponed. After creating the Header Table, the necessary information for parallel mining is exchanged in the transaction level of the DB instead of in the tree paths of the FP-tree.

However, indexing the necessary transactions is costly when the number of processors increases. For example, when there are n processors, processor p needs to process _i mineSet . _i

mineSet is the items that block partitioned from header table for processor i p . Therefore, ⁱ processor p should scan its database ₀ |mineSet₁|+ + |mineSet_n| times and then transfer to corresponding processors, to efficiently index the item in which transactions can speed up the execution processes. For that reason, transaction id (TID) was used to index the item. For a transactional database DB={ , ,..., }T T₁ ₂ T_n , T_i ⊆I I, ={ , ,..., }i i^{1 2} i_m , the TID is defined as

( ) { | _j _k , 1... }

TID j = k i ∩T ≠ϕ k = n . After creating TID, transactions can be selected directly while the information for mining frequent patterns is exchanged.

Since finding all frequent patterns from transactional databases is a computational intensive problem, a parallel and distributed strategy could reduces the execution time and improve the mining performance. Therefore, the first parallel FP-tree algorithm based TID is developed for Cluster computing. Since a Cluster is homogeneous computing, the proposed algorithm distributes the workload to each processor evenly without considering the

58 difference between processors. The main object is to reduce the execution time of mining information exchange and to shorten the index cost of transaction extraction. There are five primary stages in the Tidset-based Parallel FP-tree (TPFP-tree) algorithm: (1) create Header Table and Tidset, (2) distribute mining item set, (3) exchange transactions, (4) FP-tree and (5) FP-growth.

Firstly, although creating the header table needs only one database scan, when the database size is large, the execution time is still costly. Therefore, the TPFP-tree uses block distribution to partition the database and to distribute the divided database to corresponding computing nodes. Moreover, in order to directly select a transaction with corresponding item in subsequent procedures, a local transaction identification set (Tidset) is also created in this stage. After processing stage 1, frequent 1-itemset was found with a given threshold.

Frequent 1-itemsets were also the mining items of the TPFP-tree algorithm. Then the mining items were equally distributed to the participating processors. Each processor was assigned

n p

  

  items to mine for n frequent 1-itemset and p processors.

In order to build the tree structure and to mine the frequent patterns with FP-growth on each processor independently, a processor should comprise the transactions which contain the assigned mining items from other processors. In the transaction exchanging stage, processor p scans its partial database to gather the transactions containing mining items _i required by other processors. However, it is costly since p must scan its database _i p −1 times to gather all transactions. Hence, the Tidset is used to improve the transaction selecting.

Tidset is a map between items and transaction, the transactions can be directly chosen from given items with Tidset. Since the Tidset table can be concurrently created with a frequent 1-itemset, the Tidset of each partial database is created in stage 1. Consequently, selected

59 transactions are transferred to corresponding processors after gathering the transactions.

Moreover, with the Tidset table, more processors do not lead to a worse performance.

After exchanging the mining items, a processor has the necessary transaction corresponding to its assigned mining items. Therefore, the processor can, independently, build the FP-tree and mine the frequent patterns by FP-growth. Finally, after completing the mining processes p collects the frequent patterns from the others and merges them into all ₁ the frequent patterns. The detailed algorithm of the TPFP-tree is given below.

Algorithm TPFP-tree

Input: a transaction database DB={ , ,..., }T T₁ ₂ T_n , and each transaction T_i ⊆I I, ={ , ,..., }i i_{1 2} i_m . A given minimum threshold ξ . p k= 1 DB_j s the number of processors. (p is master node (MN), and ₁

2, ,...,3 _p

p p p are salve nodes (SNs))

Output: a complete set of frequent patterns, where Sup x( )_i ≥ ∀ . ξ, x_i Method

Step 1. MN equally divides the database DB into p disjointed partitions

1 2 1 2

(DB DB, ,...,DB_p,∃DB ∪DB ∪ ∪... DB_p=DB) and assigns DB to _i p . _i

Step 2. Each processor p receives the database i DB and scans the i DB to create local header table (i HT ). i

Step 3. Each processor creates the local transaction identification set (Tidset ) of i DB . i

Step 4. Processors perform all-reduce of HT to get a global header table (GHT). i

Step 5. MN sorts items in GHT in descending order according to their support and block divides those items into mining set MS MS₁, ₂,...,MS_p where MS₁∪MS₂∪ ∪... MS_p =Items of GHT.

Step 6. MN performs broadcast to distribute mining set information to all processors.

Step 7. In order to create an FP-tree, each processor p has to obtain transaction _i T_jk on processor ( 1... , )

j j= p j i≠ such that T_jk∩MS_i ≠ϕ(k=1... |DB_j|). Since the mining set MS is partitioned i

statically, each processor knows the mining set of others. Moreover, Tidset i_i( 1... )= p helps selecting the transactions directly in the local database. After that, each processor exchanges the transactions required for mining and NewDB DB ReversedTransactions_i = _i∪ .

Step 8. Each processor p performs the FP-tree constructing procedure of NewDB . _i

Step 9. Each processor p performs the FP-growth procedure to mine the given i MS from their local FP-tree. _i Step 10. MN performs the MPI All-Reduce function to collect the frequent pattern from ( 1... )p i_i = p .

Figure 5-1 is an example of a header and Tidset table for four processors. Figure 5-1 (a) shows the database equally partitioned into four parts with each transaction’s local identity (TID). Figure 5-1 (b) depicts the created Tidset table of the database. From Tidset1 ,

60 item F appears in transaction 1 to 4 and item H appears in transactions 3 and 4 and so on.

Moreover, the local header tables (HT) are also created at the same time (Figure 5-1 (c)).

Finally, the processors performed all-reduce to get a global header table (GHT). After that, the master node (MN) sorted the GHT in descending order according to its support and divided items into mining sets (MS) using block distribution. Then MN broadcast the MSs to all processors.

Then, each processor scanned its database to extract the transaction to transfer to the others. Figure 5-2 shows the exchanging stage. Figure 5-2 (a) is the MS of each processor, and from Figure 5-2 (b), p had to prepare three tables which recorded items to be sent for ₀ exchange. Since it was costly to scan the database three times to create the table (Figure 5-2 (b)), the table using the Tidset was created beforehand (Figure 5-1 (b)). For example, p sent ₁ the transaction containing M, H, G to p , according to ₂ Tidset , the union of item M, H, G ₁ was TID 1, 2, 3, and 4. Therefore, p sent the transaction 1 to 4 to ₁ p . By the same process, ₂ essential transactions could be efficiently exchanged among processors. Since each processor had the necessary transaction for mining, each one could build an FP-tree and use FP-growth to find frequent patterns independently. Finally, MN gathered the frequent patterns created by each processor to produce the all the frequent patterns.

TID Items 1 F C A M B P 2 F C A M B P 3 F C A M B H 4 F C A M B H G

TID Items 1 F C A G O H 2 F C A G O H 3 F C A M K L 4 F A M K L

TID Items 1 F C A G P K L 2 F G P D E 3 F C A M B H 4 F A M H G

TID Items 1 F C A G D E K 2 F C M B P 3 F C A M B H 4 F C M H

F C A M H G B P K L D E O

16 14 13 11 8 7 7 5 4 3 2 2 2

F C A M B P H G 12

3 4

12 3 4

12 3 4 4

F A G C M P H K L D E B 12

3 4

13 4

12 4

13 3 4 1

2 3

4 1 1 2 2 3 F A C G M O K H L

12 3 4

12 3

12 3 4 1

2 3 4 1

2 3 4

F C M A B H G D E K P 12

3 4

12 3 4

23 4

13 2 3 3

4 1 1 1 1 2

DB1(p1) DB2 (p2) DB3 (p3) DB4 (p4)

Tidset1 Tidset2 Tidset3 Tidset4

F C A M B P H G 4 4 4 4 4 2 2 1

F A G C M P H K L D E B 4 3 3 2 2 2 2 1 1 1 1 1 F A C G M O K H L

4 4 3 2 2 2 2 2 2

F C M A B H G D E K P 4 4 3 2 2 2 1 1 1 1 1

HT1 HT2 HT3 HT4

GHT

MS1 MS2 MS3 MS4

DB (p = 4, ξ = 2) (a)

(b)

(C)

(d)

Figure 5-1: Example of DB Partitioning Into 4 Processors with the Given Threshold ξ

F C A

16 14 13

MS₁

M H G

11 8 7

MS₂

B P K

7 5 4

MS₃

L D E O

3 2 2 2

MS₄

TID Items 1 F C A M B P 2 F C A M B P 3 F C A M B H 4 F C A M B H G Transactions to p₂

TID Items 1 F C A M B P 2 F C A M B P 3 F C A M B H 4 F C A M B H G Transactions to p₃

TID Items Transactions to p₄

TID Items 1 F C A G O H 2 F C A G O H 3 F C A M K L 4 F A M K L Transactions to p₁

TID Items 3 F C A M K L 4 F A M K L Transactions to p₃

TID Items 3 F C A M K L 4 F A M K L Transactions to p₄

TID Items 1 F C A G P K L 2 F G P D E 3 F C A M B H 4 F A M H G Transactions to p₁

TID Items 3 F C A M B H 4 F A M H G Transactions to p₂

TID Items 1 F C A G P K L 2 F G P D E Transactions to p₄

TID Items 1 F C A G D E K 2 F C M B P 3 F C A M B H 4 F C M H Transactions to p₁

TID Items 1 F C A G D E K 2 F C M B P 3 F C A M B H 4 F C M H Transactions to p₂

TID Items 1 F C A G D E K 2 F C M B P 3 F C A M B H Transactions to p₃ (a)

(b)

Figure 5-2: Example of the Exchange Stage of 4 Processors

5.5 Balanced Tidset Parallel FP-tree (BTP-tree) algorithm for

在文檔中 Design Parallel Algorithms for Ultrametric Tree Construction, Chemical Compound Inference, and Frequent Pattern Mining on (頁 67-73)