An Efficient FUFP-tree Maintenance Algorithm for Record Modification

(1)

An Efficient FUFP-tree Maintenance Algorithm for Record Modification

Tzung-Pei Hong

Department of Computer Science and Information Engineering National University of Kaohsiung

Department of Computer Science and Engineering National Sun Yat-sen University

Kaohsiung, {811, 804}, Taiwan, R.O.C. [email protected]

Chun-Wei Lin

Department of Computer Science and Information Engineering National Cheng Kung University

Tainan, 701, Taiwan, R.O.C. [email protected]

Yu-Lung Wu

Department of Information Management I-Shou University

Kaohsiung, 84008, Taiwan, R.O.C. [email protected]

ABSTRACT. The Frequent-Pattern-tree (FP-tree) is an efficient data structure for association-rule mining without the generation of candidate itemsets. It is used to compress a database into a tree structure, which stores only large items. When the underlying data is updated, the FP-tree, however, needs to process all the transactions in a batch way. In this paper, we thus attempt to extend the FP-tree construction algorithm for the efficient handling of record modification. An expeditious FP-tree (FUFP-tree) structure is used to ease the tree update process. An FUFP-tree maintenance algorithm is also proposed for reducing the execution time in reconstructing the tree when records are modified. Experimental results show that the proposed FUFP-tree maintenance algorithm for record modification runs faster than the batch FP-tree construction algorithm for handling updated records and generates nearly the same tree structure as the FP-tree algorithm. The proposed approach can thus achieve a good trade-off between execution time and tree complexity.

Keywords: data mining, FP-tree, FUFP-tree, record modification, maintenance

1. Introduction. Data mining involves applying specific algorithms to extract patterns, features or rules from data sets in a particular representation. One common type of data mining is to derive association rules from transaction data, such that the presence of certain items in a transaction will imply the presence of some other items. Many mining approaches have been proposed to achieve this purpose [1][2][3][6][7][9][10][11][13][14]

(2)

[16][17][18]. For example, Agrawal and his co-workers proposed several mining algorithms based on the concept of large itemsets to find association rules from transaction data [1][2][3]. In their approaches, candidate itemsets had to be first generated to determine large itemsets and association rules.

Cheung et al. proposed a noticeable incremental mining algorithm, called the Fast Updated Algorithm (FUP) [4] for avoiding the shortcomings of batch mining. The FUP algorithm modified the Apriori mining algorithm [2] and adopted the pruning techniques used in the DHP (Direct Hashing and Pruning) algorithm [14]. It first calculated large itemsets mainly from newly inserted transactions, and compared them with the previous large itemsets from the original database. According to the comparison results, FUP determined whether re-scanning the original database was required, thus saving some time in maintaining the association rules.

Han et al. proposed the Frequent-Pattern-tree (FP-tree) structure for efficiently mining association rules without generation of candidate itemsets [8]. The FP-tree [8] was used to compress a database into a tree structure, which stored only large items. It was condensed and complete for finding all the frequent patterns. The construction process was executed tuple by tuple, from the first transaction to the last. After that, a recursive mining procedure called FP-Growth was executed to derive frequent patterns from the FP-tree. They demonstrated that the approach could have a better performance than Apriori. The FP-tree mining approach belongs to batch mining; that is, all transactions must be processed in a batch way.

Many mining methods for finding association rules based on the FP-tree structure have also been proposed. Qiu et al. proposed the QFP-growth mining approach to mine association rules [15]. It could generate frequent patterns without the usage of conditional FP-trees. Its computational time and space was reduced when compared to the original FP-tree approach. The QFP-growth and the batch FP-tree mining algorithms can not, however, deal with the problem of incremental mining. Whenever the records are changed by insertion, deletion or modification, the trees must be re-constructed and re-built. Even when the number of the processed records is small, the traditional methods should still be processed in a batch way. To deal with the incremental mining problem, Ezeife thus constructed a generalized FP-tree, which stored all the large and non-large items, for incremental mining without rescanning databases [5]. All the non-large items had to be kept, thus requiring a large amount of space. Hong et al. also proposed an efficient mining algorithm based on the FP-tree for handling the insertion of records [12]. In that approach, an expeditious FUFP-tree structure was used to simplify the tree update process [12]. It was similar to the FP-tree structure except that the links between the parent and child nodes were bi-directional. The counts of the sorted frequent items were kept in the Header Table of the FP-tree algorithm as well. Based on the bi-directional linking and the items with sorted counts in the Header Table, it assisted to hasten the maintenance process.

In addition to record insertion, record modification is also commonly seen in real-world applications. Although using insertion and deletion procedures can perform handling record modification, it requires twice the computational time needed for a single procedure.

(3)

Therefore, developing an efficient maintenance algorithm for record modification is essential. In this paper, we thus propose a maintenance algorithm based on the FUFP-tree for the efficient handling of record modification. When records are modified from the database, the proposed algorithm will process them to maintain the FUFP-tree and the Header Table. The count difference is first formed by comparing the counts of each updated item before and after record modification. The proposed maintenance algorithm then partitions the items into four sections according to whether they are large in the original database and whether their item difference is positive or negative (including zero). Each section is then processed in its own way. The Header Table and the FUFP-tree are correspondingly updated whenever necessary.

The remainder of this paper is organized as follows. Related works are reviewed in Section 2. The proposed algorithm for record modification is described in Section 3. An example to illustrate the proposed algorithm is given in Section 4. Experimental results for showing the performance of the proposed algorithm are provided in Section 5. Conclusions are given in Section 6.

2. Review of the Frequent Pattern Tree. Han et al. proposed the Frequent-Pattern-tree structure (FP-tree) for efficiently mining association rules without the generation of candidate itemsets [8]. The FP-tree mining algorithm consists of two phases. The first phase focuses on constructing the FP-tree from the database, and the second phase focuses on deriving frequent patterns from the FP-tree. They are described below.

2.1. Construction of an FP-tree. The FP-tree [8] is used to compress a database into a tree structure storing only large items. It is condensed and complete for determining all the frequent patterns. Three steps are involved in FP-tree construction. The database is first scanned to find all items with their frequency. The items with their supports larger than a predefined minimum support are selected as large 1-itemsets (items). Next, the large items are sorted in descending frequency. Finally, the database is scanned once more to construct the FP-tree according to the sorted order of large items. The construction process is executed tuple by tuple, from the first transaction to the last. . After all transactions are processed, the FP-tree is completely constructed.

The Header Table, built to facilitate tree traversal, includes the sorted large items and their pointers (called frequency head) linked to their first occurrence nodes in the FP-tree. If more than one node has the same item name, they are also linked in sequence. Note that the links between nodes are uni-directional, from parents to children.

Below, a simple example is given to illustrate the process of the FP-tree construction. Assume there are five transactions shown in Table 1. Each transaction has its transaction identifier (TID) and the items purchased. Each item is denoted by a symbol. Also assume the minimum support is set at 50%. The FP-tree is constructed in the following way [8].

(4)

TID Items 100 a, c, d, f, g, i, m ,p 200 a, b, c, f, l ,m, o 300 b, f, h, j, o 400 b ,c, k, s, p 500 a, c, e, f, l, m, n, p

First, the database is scanned to find large items. In this example, the five transactions are scanned to find the items with their counts shown in Table 2, in which the large items are marked.

TABLE 2. All the items with their counts

Item frequency item frequency

a 3 j 1 b 3 k 1 c 4 l 2 d 1 m 3 e 1 n 1 f 4 o 2 g 1 p 3 h 1 s 1 i 1 w 1

From Table 2, it can be observed that the set of large 1-itemsets, named L1, includes {a:3, b:3, c:4, f:4, m:3, p:3}, where the number after an item represents its count. Next, the items

in L1 are sorted according to their descending frequency. The sorted L1, named L1’, is {f:4, c:4, a:3. b:3, m:3, p:3}. At last, the database is scanned again to construct the FP-tree. The

transactions with only the sorted large items are shown in Table 3 for illustrating the construction process easily.

TABLE 3. The transactions with only the sorted large items

TID Sorted frequent items

100 f ,c, a, m, p 200 f, c, a, b, m

300 f, b

400 c, b, p 500 f, c, a, m, p

In Table 3, the first transaction is (f, c, a, m, p). The root of the FP-tree is first set Null. This transaction is then inserted into the FP-tree as the first branch. Each node in the branch is attached a count of 1. The results after the first transaction is processed are shown in Figure 1.

(5)

f, c, a, m, p

500 c, b, p

400 f, b

300 f, c, a, b, m

200 f, c, a, m, p

100 Frequent items

TID

{}

f:1

c:1

a:1

m:1

p:1

FIGURE 1. The FP-tree after the first transaction is processed.

The second transaction is next processed. It shares the same prefix (f, c, a) as the first branch of the FP-tree. The counts of nodes f, c and a are then incremented by 1, and a new node (b:1) is created and linked to (a:2) as its child. Another new node (m:1) is then created and linked to (b:1). Besides, a link is created between the two nodes of m. The results after the second transaction is processed are shown in Figure 2.

f, c, a, m, p

500 c, b, p

400 f, b

300 f, c, a, b, m

200 f, c, a, m, p

100 Frequent items

TID

{}

f:2

c:2

a:2

m:1

p:1

b:1

m:1

FIGURE 2. The FP-tree after the second transaction is processed.

The same process is then executed for the other transactions. After all the transactions are processed, the resulting Header Table and FP-tree are shown in Figure 3.

(6)

Header Table

Item frequency head

f

c

a

b

m

p

{} f:4 c:3 a:3 m:2 p:1 b:1 b:1 c:1 b:1 p:2 m:1

FIGURE 3. The resulting Header_Table and FP-tree in the example.

2.2. Mining of Large Itemsets. After the FP-tree is constructed from a database, a mining procedure called FP-Growth [8] is executed to find all large itemsets. FP-Growth does not need to generate candidate itemsets for mining, but derives frequent patterns directly from the FP-tree. It is a recursive process, handling the frequent items one by one and bottom-up according to the Header Table. A conditional FP-tree is generated for each frequent item and from the tree the large itemsets with the processed item can be recursively derived. Specifically, a conditional FP-tree is generated in the following way. Let a prefix path of an item I in the FP-tree be the preceding part of a branch above I. The corresponding prefix paths for a large item I are first extracted from the FP-tree. The count of each node in a prefix path is set as the count of I in the same branch. The counts of an item appearing in different prefix paths are then calculated.. The items with their counts larger than or equal to the minimum count are selected to build the conditional FP-tree for I. Each prefix path, like a transaction, is used to build the conditional FP-tree as in the FP-tree construction. A conditional FP-tree is thus similar to a sub-FP-tree with the processed item lying at its leaves. An itemset composed of the original item I and each item in the conditional FP-tree is certain to be large. The process is recursively executed until all the items in a conditional FP-tree are processed.

3. The Proposed FUFP-tree Maintenance Approach for Record Modification.

3.1. Design Concept. Assume an FUFP-tree has been built in advance from the original database before records are modified. The FUFP-tree construction algorithm is the same as the FP-tree algorithm [8] except that the links between parent and child nodes are bi-directional. Bi-directional linking will help to hasten the process of item modification in the maintenance process. The counts of the sorted frequent items are recorded in the Header Table as well.

(7)

When records are modified from the database, the proposed algorithm will process them to maintain the FUFP-tree. The count difference is first formed by comparing the counts of each updated item before and after record modification. The proposed maintenance algorithm then partitions items into four sections according to whether they are large in the original database and whether their count difference is positive or negative (including zero). Each section is then processed in its own way. The Header Table and the FUFP-tree are correspondingly updated whenever necessary.

Considering an original database and some records to be modified, the following four cases (illustrated in Figure 4) may arise.

Case 1: An item is frequent in an original database and has a positive count difference. Case 2: An item is frequent in an original database and has a negative (including zero)

count difference.

Case 3: An item is not frequent in an original database and has a positive count difference. Case 4: An item is not frequent in an original database and has a negative (including zero)

count difference. Positive

Original

database

Large items Small items Case 1 Case 2 Case 3 Case 4 Negative (zero) difference difference Original database Item difference

FIGURE 4. Four cases when records are modified from an existing database.

Since items in Case 1 are large in the original database and have a positive count difference, they will remain large after the database is updated. Similarly, items in Case 4 will remain small after the records are modified. Thus, Cases 1 and 4 will not affect the final large items. Items in Case 2 are large in the original database and have negative (or zero) count difference. Some existing large items may be removed after the database is modified. It is easily decided since the counts of the original large items are kept in the Header Table. At last, items in Case 3 are small in the original database and have a positive count difference. Some large items may thus be added. The original database must be rescanned to detect the original counts of these items. The summary of the four cases and their results is given in Table 4.

(8)

TABLE 4. Four cases and their results for record modification

Cases: Original – Difference Results Case 1: Large – Positive Always large

Case 2: Large – Negative (or zero) Determined from the Header Table Case 3: Small –Positive Determined by rescanning the original database Case 4: Small – Negative (or zero) Always small

In the maintenance process of the FUFP-tree for record modification, item deletion is completed before item insertion. When an originally large item becomes small, it is directly removed from the FUFP-tree and its parent and child nodes are then linked together. On the contrary, when an originally small item becomes large, it is added to the end of the Header Table and then inserted into the leaf nodes of the FUFP-tree. It is reasonable to insert the item at the end of the Header Table since, when an originally small item becomes large due to the modified records its updated support is usually only a little larger than the minimum support. The FUFP-tree can at least be updated accordingly, and the performance of the proposed maintenance algorithm can be greatly improved. The entire FUFP-tree can be re-constructed in a batch way when a sufficiently large number of transactions are deleted. The notation used in this paper is first described below.

3.2. Notation.

D: the original database;

T: the set of modified records (after modification);

T’: the set of records to be modified (before modification); D-: the set of unchanged records, i.e., D - T;

U: the entire updated database;

M: the set of items appearing in the updated records before and after modification; d: the number of records in D;

t: the number of records in T; d-: the number of records in D-;

SD(I): the number of occurrences of I in D;

SM(I): the count difference of I from the updated records, IM; SU(I): the number of occurrences of I in U;

Sup: the support threshold for large itemsets;

Decrease_Items: the set of items with which the updated records before modification (i.e. in T’) are reprocessed to decrease the corresponding counts in the FUFP-tree;

Increase_Items: the set of items with which the updated records after modification (i.e. in T) are reprocessed to increase the corresponding counts in the FUFP-tree;

Rescan_Items: the set of items for which the unmodified records in the original database are rescanned;

Rescan_Transactions: the set of unmodified records with at least one item in the set of Rescan_Items;

(9)

The details of the proposed algorithm are described below.

The Proposed Algorithm:

INPUT: An old database, its corresponding Header Table storing the frequent items in descending order, its corresponding FUFP-tree, a support threshold Sup and a set of t modified records.

OUTPUT: A new FUFP-tree for the updated database.

STEP 1: Find all the items in the t records before and after modification. Denote them as a set of modified items, M.

STEP 2: Find the count difference (including zero) of each item in M for the modified records.

STEP 3: Check whether the items in M are large or small in the original database.

STEP 4: For each item I in M, which has a positive count difference and is large in the original database (appearing in the Header Table), do the following substeps (Case 1):

Substep 4-1: Set the new count SU(I) of I in the entire updated database as: SU(I)=SD(I)+SM(I),

where SD(I) is the count of I in the Header Table (original database)

and SM(I) is the count difference of I after record modification.

Substep 4-2: Update the count of I in the Header Table as SU(I).

Substep 4-3: Put I in both the sets of Increase_Items and Decrease_Items, which will be further processed in STEP 7.

STEP 5: For each item I in M, which has a negative (or zero) count difference and is large in the original database (appearing in the Header Table), do the following substeps (Case 2):

Substep 5-1: Set the new count SU(I) of I in the entire updated database as: SU(I)=SD(I)+SM(I).

Substep 5-2: If SU(I) d*Sup, item I will still be large after the database is

updated; update the count of I in the Header Table as SU(I) and add I

to both the sets of Increase_Items and Decrease_Items.

Substep 5-3: If SU(I)<d*Sup, item I will become small after the database is

updated; Remove I from the Header Table, connect each parent node of I directly to the corresponding child node of I, and remove I from the FUFP-tree.

STEP 6: For each item I in M, which has a positive count difference and is small in the original database (not appearing in the Header Table), do the following substeps (Case 3):

Substep 6-1: Rescan the original database to detect the transactions with item I, and calculate the count SD(I) of I in the original database before

modification.

Substep 6-2: Set the new count SU(I) of I in the entire updated database as: SU(I)=SD(I)+SM(I).

(10)

Substep 6-3: If SU(I) d*Sup, item I will be large after the database is updated; add

item I in the sets of Increase_Items and Rescan_Items, and put the transaction IDs with item I from the unchanged records D- into the set of Rescan_Transactions.

STEP 7: For each updated record before modification (T’) and with an item J existing in the set of Decrease_Items, find the corresponding branch of J in the FUFP-tree for the record, and subtract 1 from the count of the J node in the branch. If the count of the J node becomes zero after subtraction, remove node J from its corresponding branch and connect the parent node of J directly to the child node.

STEP 8: Sort the items in the Rescan_Items in a descending order of their updated counts. STEP 9: Insert the items in the Rescan_Items to the end of the Header Table according to

the sorted order.

STEP 10: For the records in the Rescan_Transactions with an item J existing in the set of

Rescan_Items, if J has not been at the corresponding branch of the FUFP-tree for

the record, insert J at the end of the branch and set its count as 1. Otherwise, add 1 to the count of the node J.

STEP 11: For the updated records after modification with an item J existing in the

Increase_Items, if J has not been at the corresponding branch of the FUFP-tree,

insert J at the end of the branch and set its count as 1. Otherwise, add 1 to the count of the J node.

In Step 7, a corresponding branch is the branch which is only generated by the large items in a transaction, and corresponds to the order of items appearing in the Header Table. After Step 11, the final updated FUFP-tree is constructed. The modified records can then be integrated into the original database. Based on the FUFP-tree, the desired association rules can be subsequently established by the FP-Growth mining approach as proposed in [8]. 4. An Example. In this session, an example is given to illustrate the proposed algorithm for maintaining an FUFP tree when the records are modified. Table 5 shows a database to be used in the example. The database contains ten transactions and nine items, denoted from a to i.

TABLE 5. The original database in the example

Old database

Transaction No. Items 1 a, b, c, d, g, h 2 b, f, ,g, i 3 b, d, e, f, g 4 a, b, f, h 5 a, b, f 6 a, c, d, g, h 7 a, b, f, i 8 a, b, e, f, h 9 a, b, h, g 10 b, c, d, e

(11)

Assume the support threshold is set at 50%. For the given database, the large items are a,

b, f, g and h, from which the Header Table can be constructed. The FUFP-tree is then

formed from the database and the Header Table. The results are shown in Figure 5.

Header Table

Item Frequency Head

b 9 a 7 f 6 g 5 h 5

{}

b:9

g:1

h:1

f:2

g:2

a:6

g:2

h:2

Null

a:1

f:4

FIGURE 5. The Header Table and the FUFP-tree constructed.

Assume the last four records (with No. 7 to 10) in the original database are modified as shown in Table 6. The proposed algorithm proceeds as follows.

TABLE 6. The four records after modification

Transaction No. Items

7 a, b

8 a, b, h

9 a, b, c, d, f 10 a, b, c, d,

STEP 1: The items in the four records before and after modification are found as {a, c, d, e,

f, g, h, i}, which are denoted by M.

STEP 2: The count difference of each item in M is found. For example, the counts of each item in M for the updated records before and after modification are shown in Tables 7 and 8, respectively.

TABLE 7. The counts of the items in M for the updated records before modification

Item Count a 3 b 4 c 1 d 1 e 2 f 2

(12)

g 1

h 2

i 1

TABLE 8. The counts of the items in M for the updated records after modification

Item Count a 4 b 4 c 2 d 2 e 0 f 1 g 0 h 1 i 0

The count difference of each item in M can then be easily calculated. The results are shown in Table 9.

TABLE 9. The count difference of each item in M

Item Count a 1 b 0 c 1 d 1 e -2 f -1 g -1 h -1 i -1

STEP 3: All the items in M are divided into two parts, {a}{b}{f}{g}{h} and {c}{d}{e}{i}, according to whether they are large or small in the original database. Results are shown in Table 10.

TABLE 10. Two partitions of the items in M

Large items for the original database

Small items for the original database

Items Count Items Count

A 1 e -2

B 0 c 1

F -1 d 1

G -1 i -1

(13)

STEP 4: The items in M, which have a positive count difference and are large in the original database, are processed. In this example, a is the only item which satisfies the condition. The count of item a in the Header Table is 7 and the count difference for item a is 1. As the new count of item a is 7+1 (= 8), the frequency value of item a in the Header Table is therefore changed to 8. Item a is then put into both sets of Increase_Items and Decrease_Items. After STEP 4, Increase_Items = {a} and Decrease_Items = {a}.

STEP 5: The items in M which have a negative (or zero) count difference and are large in the original database are processed. In this example, items b, f, g and h satisfy the condition and are processed. The minimum count for an item to be large in the updated database is 5. Take item b first as an example to illustrate the substeps The count of item b in the Header Table is 9, and its count difference is 0; the count of item b is unchanged and is larger than the minimum count. As item b remains large for the updated database, it is placed into both the sets of Increase_Items and

Decrease_Items. The frequency value of item b in the Header Table is not changed

since its count difference is zero. Similarly, the updated count for item f is 6+(-1) (= 5), equal to the minimum count. Item f is still large for the updated database, and is put into both the sets of Increase_Items and Decrease_Items. The frequency value of item f in the Header Table is changed to 5. Because both of the new counts of items g and h are calculated to be 4, which is smaller than the minimum count, items g and h are directly removed from the Header Table. In this case, the FUFP-tree needs to be processed as well. After STEP 5, Increase_Items = {a, b, f} and Decrease_Items = {a, b, f}. The updated FUFP-tree is shown in Figure 6.

Header Table

Item Frequency Head b 9 a 8 f 5

{}

b:9

f:2

a:6

Null

a:1

f:4

FIGURE 6. The Header Table and the FUFP-tree after STEP 5.

STEP 6: The items in M, which have a positive count difference and are small in the original database, (not appearing in the Header Table) are processed. In this example, items c and d satisfy the condition and will be processed. The original database is then rescanned to find the transactions with items c, d, and their counts. The counts of items c and d respectively are 3 and 4 in the original database.

(14)

After the database is updated, the count of item c becomes 3+1 (= 4), smaller than the minimum count. Item c remains a small item and will not affect the Header Table or the FUFP-tree. It is thus directly ignored. On the contrary, the count of item d is 4+1 (= 5), equal to the minimum count. As item d remains a large itemset after the database is updated, it is then inserted into the end of the Header Table and is placed into both of the sets of Increase_Items and Rescan_Items. After STEP 6, Increase_Items = {a, b, f, d}, Rescan_Items = {d}, and

Rescan_Transactions = {1, 3, 6}. The corresponding transactions with their IDs in

the Rescan_Transactions are shown in Table 11.

TABLE 11. Transactions with their IDs in the Resan_Transactions

Transaction No. Items

1 a, b, c, d, g, h 3 b, d, e, f, g 6 a, c, d, g, h

STEP 7: The FUFP-tree is updated according to the records before modification (T’) with items existing in the set of Decrease_Items. In this example, Decrease_Items = {a,

b, f}. The corresponding branches for the records before modification are shown in

Table 12.

TABLE 12. The corresponding branches for the records before modification

Transactions No. Items Corresponding branch

7 a, b, f, i b, a, f

8 a, b, e, f, h b, a, f

9 a, b, h, g b, a

10 b, c, d, e b

The first branch shares the same prefix (b, a, f) as the current FUFP-tree. The counts for items b, a and f are then subtracted by 1 since they have been modified. The same process is then executed for the other three branches. The results are shown in Figure 7.

Header Table

Item Frequency Head b 9 a 8 f 5

{}

b:5

f:2

a:3

Null

a:1

f:2

(15)

FIGURE 7. The Final FUFP-tree after STEP 7.

STEP 8: The items in the set of Rescan_Items are sorted in the descending order of their updated counts. In this example, Rescan_Items contains only d, and no sorting is needed.

STEP 9: The items in the Rescan_Items are inserted into the end of the Header Table. In this example, only d is inserted. The Header Table after this step is shown in Figure 8.

Header Table

b 9 a 8 f 5 d 5

{}

b:5

f:2

a:3

Null

a:1

f:2

FIGURE 8. The Header Table and the FUFP-tree after item d is added.

STEP 10: The FUFP-tree is updated according to the records in the set of

Rescan_Transactions. In this example, Rescan_Items = {d}. The corresponding

branches for the records in Rescan_Transactions with d are shown in Table 13. TABLE 13. The corresponding branches for the records in Rescan_Transactions

Transaction No. Items Corresponding branch

1 a, b, c, d, g, h b, a, d

3 b, d, e, f, g b, f, d

6 a, c, d, g, h a, d

The first branch is then processed. This branch shares the same prefix (b, a) as the current FUFP-tree. A new node (d:1) is created and linked to (a:3) as its child. Note that the counts for items b and a are not increased since they have already been counted in the construction of the FUFP-tree. The same process is then executed for the other two corresponding branches. The results are shown in Figure 9.

(16)

Header Table

b 9 a 8 f 5 d 5

{}

b:5

f:2

a:3

Null

a:1

f:2

d:1

Null

d:1

FIGURE 9. The Header Table and the FUFP-tree after STEP 10.

STEP 11: The FUFP-tree is updated according to the updated records after modification with items existing in the Increase_Items. In this example, Increase_Items = {a, b,

f, d}. The corresponding branches for the records after modification are shown in

Table 14.

TABLE 14. The corresponding branches for the records after modification

Transaction No. Items Corresponding branches

7 a, b b, a

8 a, b, h b, a

9 a, b, c, f, d b, a, f, d 10 a, b, c, d, b, a, d

The first branch shares the same prefix (b, a) as the current FUFP-tree. The counts for items b and a are then increased by 1 since they have not yet been counted in the construction of the FUFP-tree. The same process is then executed for the other three branches. The results are shown in Figure 10.

(17)

Header Table

b 9 a 8 f 5 d 5

{}

b:9

f:2

a:7

Null

a:1

f:3

d:2

Null

d:1

FIGURE 10. The Final FUFP-tree after all the modified records are processed.

Based on the FUFP-tree shown in Figure 10, the desired large itemsets can then be found by the FP-Growth mining approach as proposed in [8].

5. Experimental Results. Experiments were completed to compare the performance of the batch FP-tree construction algorithm with the FUFP-tree maintenance algorithm for processing modified records. When records were modified, the batch FP-tree construction algorithm created a new FP-tree from the updated database. The process was executed whenever records were modified. The FUFP-tree maintenance algorithm was executed for modification of records as mentioned in Section.

The experiments were performed in C++ on an Intel x86 PC with a 2.8G Hz processor and 512 MB of main memory. Microsoft Windows XP was the operating system of choice. The dataset, BMS-POS [19], also used in the KDDCUP 2000 competition, was applied here. This real dataset contained several years of point-of-sale data from a large electronics retailer. Each transaction in this dataset consisted of all the product categories purchased by a customer at one time. There were 515,597 transactions with 1657 items in the dataset. The maximum length of a transaction was 164 with an average length of 6.5.

The first 425,000 transactions were extracted from the BMS-POS database to construct an initial FP-tree. Five thousand transactions were randomly chosen from the last updated database for modification. After that, 5,000 transactions outside the 425,000 were used as the new contents in the modified records. The minimum support was set at 4%. The execution times and the numbers of nodes obtained from both the batch FP-tree construction algorithm and the FUFP-tree maintenance algorithm for record modification were compared. Figure 11 shows the execution times required by the batch FP-tree construction algorithm and by the FUFP-tree maintenance algorithm for processing every 5000 modified records.

(18)

0 50 100 150 200 250 5000 10000 15000 20000 25000 number of modifications

(each modification includes 5000 transactions)

ex ecu ti on ti m e (s ec. ) FP-tree FUFP-tree

FIGURE 11. Comparison of the execution times.

It is easily observed from Figure 11 that the execution time by the proposed maintenance approach was much less than that by the batch FP-tree construction algorithm for the handling of modified records.

The FUFP-tree maintenance algorithm may generate a less concise tree than the FP-tree construction algorithm since the latter completely follows the sorted frequent items to build the tree. As mentioned above, when an originally small item becomes large due to modified records, its updated support is usually only a little larger than the minimum support. It is thus reasonable to place a new large item at the end of the Header Table. Thus, the difference between the FP and the FUFP tree-structures will not be significant. To demonstrate this effect, the numbers of nodes between the two algorithms are presented in Figure 12. 120000 128000 136000 144000 152000 160000 168000 5000 10000 15000 20000 25000 number of modifications

(each modification includes 5000 transactions)

nu m b er o f no de s FP-tree FUFP-tree

FIGURE 12. Comparison of the numbers of nodes.

It is observed from Figure 12 that the FUFP-tree maintenance algorithm for modified records generated nearly the same number of nodes as the FP-tree construction algorithm.

(19)

The effectiveness of the FUFP-tree maintenance algorithm for record modification is therefore acceptable.

6. Conclusions. In this paper, we have proposed the FUFP maintenance structure and algorithm to efficiently and effectively handle the modification of records in data mining. The FUFP-tree structure is the same as the FP-tree structure [8] except that the links between parent and child nodes are bi-directional. As well, the counts of the sorted frequent items are kept in the Header Table. These modifications will simplify the tree update process.

When records are modified from the database, the proposed algorithm will process them to maintain the FUFP-tree. Comparing the counts of each updated item before and after record modification first forms the count difference. The proposed maintenance algorithm then partitions items into four sections according to whether they are large in the original database and whether their count difference is positive or negative (including zero). Each section is then processed in its own way. The Header Table and the FUFP-tree are correspondingly updated whenever necessary. It is reasonable to insert a new large item at the end of the Header Table since when an originally small item becomes large due to modification of records, its updated support is usually only a little larger than the minimum support.

Experimental results also show that the proposed FUFP-tree maintenance algorithm runs faster than the batch FP-tree construction algorithm for handling modification of records and generates nearly the same tree structure as the FP-tree algorithm. The proposed approach can thus achieve a good trade-off between execution time and tree complexity.

The FP-Growth mining procedure, used for mining from the FP-tree in the past, can also be borrowed for mining from the FUFP-tree. Both the FP-tree and the FUFP-tree structures can easily allow the FP-growth procedure to mine desired rules for only specified items. In this case, the maintenance of the tree structures is especially important. In the future, we will attempt to discuss other mining-problem issues.

Acknowledgement. This research was supported by the National Science Council of the

Republic of China under contract NSC 94-2213-E-390-005.

REFERENCES

[1] R. Agrawal, T. Imielinksi and A. Swami, Mining association rules between sets of items in large database, The ACM SIGMOD Conference, pp.207-216, 1993.

[2] R. Agrawal and R. Srikant, Fast algorithm for mining association rules, The International Conference on

Very Large Data Bases, pp.487-499, 1994.

[3] R. Agrawal, R. Srikant and Q. Vu, Mining association rules with item constraints, The Third

International Conference on Knowledge Discovery in Databases and Data Mining, pp.67-73, 1997. [4] D. W. Cheung, J. Han, V. T. Ng, and C. Y. Wong, Maintenance of discovered association rules in large

(20)

Engineering, pp.106-114, 1996.

[5] C. I. Ezeife, Mining incremental association rules with generalized FP-tree, The 15th Conference of the

Canadian Society for Computational Studies of Intelligence on Advances in Artificial Intelligence,

pp.147-160, 2002.

[6] T. Fukuda, Y. Morimoto, S. Morishita and T. Tokuyama, Mining optimized association rules for numeric attributes, The ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database

Systems, pp.182-191, 1996.

[7] J. Han and Y. Fu, Discovery of multiple-level association rules from large database, The Twenty-first

International Conference on Very Large Data Bases, pp.420-431, 1995.

[8] J. Han, J. Pei, and Y. Yin, Mining frequent patterns without candidate generation, The 2000 ACM

SIGMOD International Conference on Management of Data, 2000.

[9] Y. Liu, G. Teng, J. Ma, D. Yang and F. Wang, Double Transductive Inference Algorithm for Text Classification, International Journal of Innovative Computing, Information and Control, vol.3, no.6(a), pp.1463-1469, 2007.

[10] F. V. Nelwamondo, T. Marwala and U. Mahola, Early Classifications of Bearing Faults Using Hidden Markov Models, Gaussian Mixture Models, Mel-frequency Cepstral Coefficients and Fractals, ,

International Journal of Innovative Computing, Information and Control, vol.2, no.6, pp.1281-1299,

2006.

[11] S. Ozawa, S. Pang and N. Kasabov, On-line Feature Selection for Adaptive Evolving Connectionist Systems, International Journal of Innovative Computing, Information and Control, vol.2, no.1, pp.181-192, 2006.

[12] T. P. Hong, J. W. Lin and Y. L. Wu, A Fast Updated Frequent Pattern Tree, The IEEE International

Conference on Systems, Man, and Cybernetics, 2006.

[13] H. Mannila, H. Toivonen, and A. I. Verkamo, Efficient algorithms for discovering association rules, The

AAAI Workshop on Knowledge Discovery in Databases, pp.181-192, 1994.

[14] J. S. Park, M. S. Chen, P. S. Yu, Using a hash-based method with transaction trimming for mining association rules, The IEEE Transactions on Knowledge and Data Engineering, vol.9, no.5, pp.812-825, 1997.

[15] Y. Qiu, Y. J. Lan and Q. S. Xie, An improved algorithm of mining from FP-tree, The Third

International Conference on Machine Learning and Cybernetics, pp.26-29, 2004.

[16] R. Srikant and R. Agrawal, Mining generalized association rules, The Twenty-first International

Conference on Very Large Data Bases, pp.407-419, 1995.

[17] R. Srikant and R. Agrawal, Mining quantitative association rules in large relational tables, The 1996

ACM SIGMOD International Conference on Management of Data, pp.1-12, 1996.

[18] K. Umayahara, S. Miyamoto and Y. Nakamori, Formulations of Fuzzy Clustering for Categorical Data,

International Journal of Innovative Computing, Information and Control, vol. 1, no.1, pp.83-94, 2005. [19] Z. Zheng, R. Kohavi and L. Mason, Real world performance of association rule algorithms, The