• 沒有找到結果。

The Pre-FUFP Algorithm for Incremental Mining

N/A
N/A
Protected

Academic year: 2021

Share "The Pre-FUFP Algorithm for Incremental Mining"

Copied!
35
0
0

加載中.... (立即查看全文)

全文

(1)

The Pre-FUFP Algorithm for Incremental Mining*

Chun-Wei Lin†, Tzung-Pei Hong**, and Wen-Hsiang Lu† †Department of Computer Science and Information Engineering

National Cheng Kung University Tainan, 701, Taiwan, R.O.C. {p7895122; whlu}@mail.ncku.edu.tw

Department of Electrical Engineering

National University of Kaohsiung Kaohsiung, 811, Taiwan, R.O.C.

tphong@nuk.edu.tw

Abstract

The frequent pattern tree (FP-tree) is an efficient data structure for association-rule mining without generation of candidate itemsets. It was used to compress a database into a tree structure which stored only large items. It, however, needed to process all transactions in a batch way. In real-world applications, new transactions are usually incrementally inserted into databases. In the past, we proposed a Fast Updated FP-tree (FUFP-tree) structure to efficiently handle new transactions and to make the tree update process become easier. In this paper, we attempt to modify the FUFP-tree construction based on the concept of pre-large itemsets. Pre-large itemsets are defined by a lower support threshold and an upper support threshold. It does not need to rescan the original database until a number of new transactions have been inserted. The proposed approach can thus achieve a good execution time for tree construction especially when each time a small number of transactions are inserted. Experimental results also show that the proposed Pre-FUFP maintenance algorithm has a good performance for incrementally handling new transactions.

Keyword: data mining, FUFP-tree, Pre-FUFP algorithm, pre-large itemsets, incremental mining, maintenance.

*This is a modified and expanded version of the paper "Using the Pre-FUFP algorithm for handling new transactions in incremental mining," presented at The IEEE Symposium on Computational Intelligence and Data Mining, Hawaii, USA, 2007.

(2)

1.

Introduction

Years of effort in data mining have produced a variety of efficient techniques.

Depending on the types of databases processed, these mining approaches may be

classified as working on transaction databases, temporal databases, relational

databases and multimedia databases, among others. On the other hand, depending on

the classes of knowledge derived, the mining approaches may be classified as finding

association rules, classification rules, clustering rules and sequential patterns [4],

among others. Among them, finding association rules in transaction databases is most

commonly seen in data mining [1][3][5][10][11][17][18][21][22].

In the past, many algorithms for mining association rules from transactions were

proposed, most of which were based on the Apriori algorithm [1], which generated

and tested candidate itemsets level-by–level. This may cause iterative database scans

and high computational costs.

Han et al. thus proposed the Frequent-Pattern-tree (FP-tree) structure for

efficiently mining association rules without generation of candidate itemsets [12]. The

FP-tree [12] was used to compress a database into a tree structure which stored only

large items. It was condensed and complete for finding all the frequent patterns. The

construction process was executed tuple by tuple, from the first transaction to the last

(3)

derive frequent patterns from the FP-tree. They showed the approach could have a

better performance than the Apriori approach.

Both the Apriori and the FP-tree mining approaches belong to batch mining. That

is, they must process all the transactions in a batch way. In real-world applications,

new transactions are usually inserted into databases incrementally. In this case, the

originally desired large itemsets may become invalid, or new large itemsets may

appear in the resulting updated databases [7][8][16][20][24]. Designing an efficient

algorithm that can maintain association rules as a database grows is thus critically

important.

One noticeable incremental mining algorithm was the Fast-Updated Algorithm

(called FUP), which was proposed by Cheung et al. [7] for avoiding the shortcomings

mentioned above. The FUP algorithm modified the Apriori mining algorithm [3] and

adopted the pruning techniques used in the DHP (Direct Hashing and Pruning)

algorithm [18]. It first calculated large itemsets mainly from newly inserted

transactions, and compared them with the previous large itemsets from the original

database. According to the comparison results, FUP determined whether re-scanning

the original database was needed, thus saving some time in maintaining the

association rules.

(4)

incrementally growing databases, original databases still needed to be scanned when

necessary. A pre-large-itemset algorithm was thus proposed to further reduce the need

for rescanning original database based on two support thresholds [13]. The upper

support threshold is the same as that used in the conventional mining algorithms. The

support ratio of an itemset must be larger than the upper support threshold in order to

be considered large. On the other hand, the lower support threshold defines the lowest

support ratio for an itemset to be treated as pre-large. An itemset with its support ratio

below the lower threshold is thought of as a small itemset. The algorithm did not need

to rescan the original database until a number of new transactions have been inserted.

Since rescanning the database spent much computation time, the maintenance cost

could thus be reduced in the pre-large-itemset algorithm.

In the past, Hong et al. [14] modified the FP-tree structure and designed the fast

updated frequent pattern trees (FUFP-trees) to efficiently handle newly inserted

transactions based on the FUP concept. The FUFP-tree structure was similar to the

FP-tree structure except that the links between parent nodes and their child nodes

were bi-directional. Besides, the counts of the sorted frequent items were also kept in

the Header_Table of the FP-tree algorithm. Experimental results showed that the

FUFP-tree maintenance algorithm could achieve a good performance for handling the

(5)

In this paper, we attempt to further modify the FUFP-tree algorithm for

incremental mining based on the pre-large concept [13]. Based on two support

thresholds, the proposed approach can effectively handle cases in which itemsets are

small in an original database but large in newly inserted transactions. The proposed

algorithm does not require rescanning the original databases to construct the FUFP

tree until a number of new transactions have been processed. The number is

determined from the two support thresholds and the size of the database.

Experimental results also show that the proposed maintenance algorithm has a good

performance for incrementally handling new transactions.

The remainder of this paper is organized as follows. Related works are reviewed

in Section 2. The proposed Pre-FUFP maintenance algorithms are described in

Section 3. An example to illustrate the proposed algorithm is given in Section 4.

Experimental results for showing the performance of the proposed algorithm are

provided in Section 5. Conclusions are finally given in Section 6.

2.

Review of Related Works

In this section, some related researches are briefly reviewed. They are the

FUFP-tree algorithm (which was based on the FP-tree algorithm), and the

(6)

2. 1

The FUFP-tree algorithm

The FUFP-tree construction algorithm is the same as the FP-tree algorithm [12]

except that the links between parent nodes and their child nodes are bi-directional.

Bi-directional linking will help fasten the process of item deletion in the maintenance

process. Besides, the counts of the sorted frequent items are also kept in the

Header_Table.

An FUFP tree must be built in advance from the original database before new

transactions come. When new transactions are added, the FUFP-tree maintenance

algorithm will process them to maintain the FUFP tree. It first partitions items into

four parts according to whether they are large or small in the original database and in

the new transactions. Each part is then processed in its own way. The Header_Table

and the FUFP-tree are correspondingly updated whenever necessary.

In the process for updating the FUFP tree, item deletion is done before item

insertion. When an originally large item becomes small, it is directly removed from

the FUFP tree and its parent and child nodes are then linked together. On the contrary,

when an originally small item becomes large, it is added to the end of the

Header_Table and then inserted into the leaf nodes of the FUFP tree. It is reasonable

to insert the item at the end of the Headrer_Table since when an originally small item

(7)

larger than the minimum support. The FUFP tree can thus be least updated in this way,

and the performance of the FUFP-tree maintenance algorithm can be greatly

improved. The entire FUFP tree can then be re-constructed in a batch way when a

sufficiently large number of transactions have been inserted.

Several other algorithms based on the FP-tree structure have been proposed. For

example, Qiu et al. proposed the QFP-growth mining approach to mine association

rules [19]. Mohammad proposed the COFI-tree structure to replace the conditional

FP-tree [23]. Ezeife constructed a generalized FP-tree, which stored all the large and

non-large items, for incremental mining without rescanning databases [9]. Koh et al.

adjusted FP trees also based on two support thresholds [15], but with a more complex

adjusting procedure and spending more computation time than the one proposed in

this paper. Some related researches are still in progress.

2. 2

The pre-large-itemset algorithm

A pre-large itemset is not truly large, but may be large with a high probability in

the future. Two support thresholds, a lower support threshold and an upper support

threshold, are used to realize this concept. The upper support threshold is the same as

that used in the conventional mining algorithms. The support ratio of an itemset must

(8)

other hand, the lower support threshold defines the lowest support ratio for an itemset

to be treated as pre-large. An itemset with its support ratio below the lower threshold

is thought of as a small itemset. Pre-large itemsets act like buffers in the incremental

mining process and are used to reduce the movements of itemsets directly from large

to small and vice-versa.

Considering an original database and transactions which are newly inserted by

the two support thresholds, itemsets may fall into one of the following nine cases

illustrated in Figure 1.

Figure 1: Nine cases arising from adding new transactions to existing databases

Cases 1, 5, 6, 8 and 9 will not affect the final association rules according to the

weighted average of the counts. Cases 2 and 3 may remove existing association rules,

and cases 4 and 7 may add new association rules. If we retain all large and pre-large

itemsets with their counts after each pass, then cases 2, 3 and case 4 can be handled

Large itemsets Large itemsets Pre-large itemsets Original database New transactions Small itemsets Small itemsets

Case 1 Case 2 Case 3 Case 4 Case 5 Case 6 Case 7 Case 8 Case 9

Pre-large itemsets Large itemsets Large itemsets Pre-large itemsets Original database New transactions Small itemsets Small itemsets

Case 1 Case 2 Case 3 Case 4 Case 5 Case 6 Case 7 Case 8 Case 9

Pre-large itemsets

(9)

easily. Also, in the maintenance phase, the ratio of new transactions to old

transactions is usually very small. This is more apparent when the database is growing

larger. It has been formally shown that an itemset in case 7 cannot possibly be large

for the entire updated database as long as the number of transactions is smaller than

the number f shown below [13]:

f =       u l u S d S S 1 ) ( ,

where f is the safety number of the new transactions, Su is the upper threshold, Sl is the

lower threshold, and d is the number of original transactions.

A summary of the nine cases and their results is given in Table 1 [13].

Table 1: Nine cases and their results

Cases: Original – New Results

Case 1: Large – Large Always large

Case 2: Large - Pre-large Large or pre-large,

Determined from existing information

Case 3: Large - Small Large or pre-large or small,

Determined from existing information

Case 4: Pre-large - Large Pre-large or large,

Determined from existing information

Case 5: Pre-large - Pre-large Always pre-large

Case 6: Pre-large - Small Pre-large or small,

Determined from existing information

Case 7: Small - Large Pre-large or small when the number of

transactions is small

Case 8: Small - Pre-large Small or Pre-large

(10)

3.

The Proposed Pre-FUFP Maintenance Approach

The notation used in the proposed Pre-FUFP maintenance approach is first

described below.

3. 1

Notation

D: the original database; T: the set of new transactions;

U: the entire updated database, i.e., D T; d: the number of transactions in D;

t: the number of transactions in T;

Sl: the lower support threshold for pre-large itemsets; Su: the upper support threshold for large itemsets, Su >Sl; I: an itemset;

SD(I): the number of occurrences of I in D; ST(I): the number of occurrences of I in T; SU(I): the number of occurrences of I in U. Pre_ItemsD: the set of pre-large items from D; Pre_ItemsT: the set of pre-large items from T; Lar_ItemsT: the set of large items from T;

Insert_Items: the set of items for which the new transactions have to be reprocessed for updating the FUFP-trees;

Branch_Items: the set of items for which the original database has to be reprocessed for updating the FUFP-trees;

Rescan_Items: the set of items for which the original database has to be rescanned to determine whether the items are large;

3. 2

The Proposed Maintenance Algorithm

An FUFP tree must be built in advance from the initially original database before

new transactions come. Its initial construction is similar to that of an FP tree. The

(11)

minimum support, which called large items. Next, the large items are sorted in

descending frequency. At last, the database is scanned again to construct the FUFP

tree according to the sorted order of large items. The construction process is executed

tuple by tuple, from the first transaction to the last one. After all transactions are

processed, the FUFP tree is completely constructed. Besides, a variable c is used to

record the number of new transactions since the last re-scan of the original database

with d transactions. The details of the proposed algorithm are described below.

The Pre-FUFP maintenance algorithm:

INPUT: An old database consisting of (d+c) transactions, its corresponding

Header_Table storing the frequent items initially in descending order, its

corresponding FUFP tree, a lower support threshold Sl, an upper support

threshold Su, its corresponding pre-large table storing the set of pre-large

items from the original database, and a set of t new transactions.

OUTPUT: A new FUFP tree for the updated database by using the Pre-FUFP

maintenance algorithm.

STEP 1: Calculate the safety number f of new transactions according to the following

formula [13]: f =       u l u S d S S 1 ) ( .

(12)

STEP 2: Scan the new transactions to get all the items and their counts.

STEP 3: Divide the items in the new transactions into three parts according to whether

they are large, pre-large or small in the original database.

STEP 4: For each item I from STEP 3, which is large in the original database

(appearing in the Header_Table), do the following substeps (Cases 1, 2 and

3):

Substep 4-1: Set the new count SU(I) of I in the entire updated database as:

SU(I) = SD(I) + ST(I),

where SD(I) is the count of I in the Header_Table (original

database) and ST(I) is the count of I in the new transactions.

Substep 4-2: If SU(I)/(d+c+t)  Su, update the count of I in the Header_Table

as SU(I), and put I in the set of Insert_Items, which will be

further processed in STEP 10;

Otherwise, if Su SU(I)/(d+c+t)  Sl, remove I from the

Header_Table, connect each parent node of I directly to its

child node in the corresponding FUFP tree, set SD(I) = SU(I),

and keep I with SD(I) in the pre-large table;

Otherwise, item I is small after the database is updated;

(13)

node of I directly to its child node in the corresponding

FUFP tree.

STEP 5: For each item I from STEP 3 which is pre-large in the original database, do

the following substeps (Cases 4, 5 and 6):

Substep 5-1: Set the new count SU(I) of I in the entire updated database as:

SU(I) = SD(I) + ST(I).

Substep 5-2: If SU(I)/(d+c+t)  Su, item I will be large after the database is

updated; put I in the set of Insert_Items and Branch_Items,

which will be further processed in STEP 8; Otherwise, if

SuSU(I)/(d+c+t)  Sl, set SD(I) = SU(I) and keep I with the

new SD(I) in the pre-large table;

Otherwise, remove item I from the pre-large table.

STEP 6: For each item I from STEP 3 which is neither large nor pre-large in the

original database but large or pre-large in the new transactions (Cases 7 and

8), put I in the set of Rescan_Items, which is used when rescanning the database in STEP 7 is necessary.

STEP 7: If t+c f or the set of Rescan_Items is null, then do nothing;

Otherwise, do the following substeps for each item I in the set of

(14)

Substep 7-1: Rescan the original database to decide the original count SD(I)

of I.

Substep 7-2: Set the new count SU(I) of I in the entire updated database as:

SU(I) = SD(I) + ST(I).

Substep 7-3: If SU(I)/(d+c+t)  Su, item I will become large after database is

updated, put I in the set of Insert_Items and Branch_Items;

Otherwise, if Su SU(I)/(d+c+t)  Sl, set SD(I) = SU(I) and

keep I with SD(I) in the pre-large table;

Substep 7-4: Otherwise, neglect I.

STEP 8: Insert the items in the Branch_Items to the end of the Header_Table

according to the descending order of their updated counts.

STEP 9: For each original transaction with an item I existing in the Branch_Items, if I

has not been at the corresponding branch of the FUFP tree for the

transaction, insert I at the end of the branch and set its count as 1; Otherwise,

add 1 to the count of the node I.

STEP 10: For each new transaction with an item I existing in the Insert_Items, if I has

not been at the corresponding branch of the FUFP tree for the new

transactions, insert I at the end of the branch and set its count as 1;

(15)

STEP 11: If t+c > f, then set d = d+t+c and set c = 0; otherwise, set c = t+c.

In STEP 9, a corresponding branch is the branch generated from the large items

in a transaction and corresponding to the order of items appearing in the

Header_Table. After STEP 11, the final updated FUFP tree by using the Pre-FUFP

maintenance algorithm is constructed. The new transactions can then be integrated

into the original database. Based on the FUFP tree, the desired association rules can

then be found by the FP-Growth mining approach as proposed in [12].

4.

An Example

In this session, an example is given to illustrate the proposed Pre-FUFP

algorithm for maintaining an FUFP tree when new transactions are inserted. Table 2

shows a database to be used in the example. It contains 10 transactions and 9 items,

denoted a to i.

Table 2: The original database in the example Old database

Transaction No. Items 1 a, b, c, d, e, g, h

2 a, b, f, g

3 b, d, e, f, g

(16)

5 a, b, f, i 6 a, c, d, e, g, h 7 a, b, h, i 8 b, c, d, f, g 9 a, b, f 10 a, b, g, h

Assume the lower support threshold Sl is set at 30% and the upper one Su at 50%.

For the given database, the large 1-itemsets are a, b, f, g and h, from which the

Header_Table can be constructed. The FUFP tree are then formed from the database

and the Header_Table, with the results shown in Figure 2. Besides, the sets of

pre-large items for the given database are shown in Table 3.

Header Table

Item Frequency Head b 9 a 8 f 6 g 6 h 5

{}

b:9

a:1

g:1

h:1

f:2

g:2

a:7

g:2

f:4

h:1

h:1

g:1

h:2

Null

Null

Null

Null

Null

Null

Figure 2: The Header_Table and the FUFP tree constructed Table 3: The pre-large itemset for the original database

Pre-large itemset in the original database Items Count

c 3

d 4

(17)

Assume the three new transactions shown in Table 4 appear. The proposed

Pre-FUFP maintenance algorithm proceeds as follows. The variable c is initially set at

0.

Table 4: The three new transactions Transaction No. Items

1 a, b, d, f, i

2 a, b, d, i

3 a, c, d, h, i

STEP 1: The safety number f for new transactions is calculated as:

. 4 5 . 0 1 10 ) 3 . 0 5 . 0 ( 1 ) (                  u l u S d S S f

STEP 2: The three new transactions are first scanned to get the items and their

counts. The results are shown in Table 5.

Table 5: The counts of all items in the new transactions Item Count a 3 b 2 c 1 d 3 e 0 f 1 g 0 h 1 i 3

(18)

STEP 3: All the items a to i in Table 5 are divided into three parts,

{a}{b}{f}{g}{h}, {c}{d}{e}, and {i} according to whether they are large (appearing

in the Header_Table), pre-large (appearing in the pre-large table) or small in the

original database. Results are shown in Table 6, where the counts are only from the

new transactions.

Table 6: Three partitions of the items from the new transactions Large items in the original database Pre-large items in the original database Small items in the original database Items Count Items Count Items Count

a 3 c 1 i 3

b 2 d 3

f 1 e 0

g 0

h 1

STEP 4: The items in the new transactions which are large in the original

database are first processed. In this example, items a, b, f, g, and h (the first partition)

satisfy the condition and are processed. The support ratios of items a, b and f are

larger than 0.5. Take item a as an example to illustrate the substeps. The count of item

a in the Header_Table is 8, and the count in the new transactions is 3. The new count

of item a is thus 8+3 (= 11). The new support ratio of item a is 11/(10+0+3) 0.5.

(19)

item a in the Header_Table is thus changed as 11, and item a is then put into the set of

Insert_Items. Items b and f are similarly processed.

Next, both the support ratios of items g and h are smaller than 0.5 but larger than

0.3. Items h and g will become pre-large after the database is updated. Take item h as

an example. Item h is removed from the Header_Table and its corresponding FUFP

tree, and put in the pre-large table with its updated count as 6. In this case, the FUFP

tree needs to be processed as well. The Header_Table and the FUFP tree is processed

as shown in Figure 3, with all nodes for h are marked. The results after item h is

processed are shown in Figure 4.

Header Table

Item Frequency Head b 11 a 11 f 7 g 6 h 5 {} b:9 a:1 g:1 h:1 f:2 g:2 a:7 g:2 f:4 h:1 h:1 g:1 h:2 Null Null Null Null Null Null

(20)

Header Table

Item Frequency Head b 11 a 11 f 7 g 6 {} b:9 a:1 g:1 f:2 g:2 a:7 g:2 f:4 g:1 Null Null Null Null

Figure 4: The Header_Table and the FUFP tree after item h is pruned

Item g is processed in the same way. After STEP 4, Insert_Items = {a, b, f} and

the updated FUFP tree is shown in Figure 5.

Header Table

Item Frequency Head b 11 a 11 f 7

{}

b:9

a:1

f:2

a:7

f:4

Null

Null

Null

Figure 5: The Header_Table and the FUFP tree after STEP 4

STEP 5: The items in the new transactions which are pre-large in the original

database are processed. In this example, items c, d and e satisfy the condition and are

(21)

d in the pre-large itemset is 4, and its count in the new transactions is 3. The new

count of item d is thus 4+3 (= 7). The new support ratio of item d is 7/(10+0+3) 0.5.

Item d will thus become a large item after the database is updated. d is then put into

the set of Insert_Items and Branch_Items.

The new support ratio of item c is 0.4, which is between the lower and the upper

thresholds. Item c is then put into the pre-large table and its count is updated as 4. At

last, the new support ratio of item e is small than 0.3. Item e is thus removed from the

pre-large table. After STEP 5, we can get Insert_Items = {a, b, f, d} and Branch_

Items = {d}.

STEP 6: Since the item i is neither large nor pre-large in the original database but

large in the new transactions, it is put into the set of Rescan_Items, which is used

when rescanning in STEP 7 is required. After STEP 6, Rescan_Items = {i}.

STEP 7: Since t+c = 3+0 < f (= 4), rescanning the original database is

unnecessary. Nothing is done in this step.

STEP 8: The items in the set of Branch_Items are sorted in descending order of

their updated counts and then inserted into the end of the Header_Table. In this

example, the set of Branch_Items contains only d, and no sorting is needed. Item d is

thus inserted into the end of the Header_Table. The Header_Table after this step is

(22)

Header Table

Item Frequency Head b 11 a 11 f 7 d 7

{}

b:9

a:1

f:2

a:7

f:4

Null

Null

Null

Figure 6: The Header_Table and the FUFP tree after item d is added to the Header_Table

STEP 9: The FUFP tree is updated according to the original transactions with

items existing in the Branch_Items. In this example, Branch_Items = {d}. The

corresponding branches for the original transactions with d are show in Table 7.

Table 7: The corresponding branches for the original transactions with item d

Transaction No. Items Corresponding branches

1 a, b, c, d, e, g, h b, a, d

3 b, d, e, f, g b, f, d

6 a, c, d, e, g, h a, d

8 b, c, d, f, g b, f, d

The first branch is then processed. This branch shares the same prefix (b, a) as

the current FUFP-tree. A new node (d:1) is thus created and linked to (a:7) as its child.

(23)

Header Table

Item Frequency Head b 11 a 11 f 7 d 7 {} b:9 a:1 f:2 a:7 f:4 Null Null Null d:1 Null

Figure 7: The Header_Table and the FUFP tree after the first branch is processed

Note that the counts for items b and a are not increased since they have already

been counted in the construction of the FUFP tree. The same process is then executed

for the other three corresponding branches. The final results are shown in Figure 8.

Header Table

Item Frequency Head b 11 a 11 f 7 d 7 {} b:9 a:1 f:2 a:7 f:4 Null Null Null d:1 Null d:2 d:1

Figure 8: The Header_Table and the FUFP tree after STEP 8

STEP 10: The FUFP tree is updated according to the new transactions with items

(24)

corresponding branches for the new transactions with any of these items are shown in

Table 8.

Table 8: The corresponding branches for the new transactions with items b, a, f and d

Transaction No. Items Corresponding branches

1 a, b, d, f, i b, a, f, d

2 a, b, d, i b, a, d

3 a, c, d, h, i a, d

The first branch shares the same prefix (b, a, f) as the current FUFP tree. The

counts for items b, a, and f are then increased by 1 since they have not yet counted in

the construction of the previous FUFP tree. The results after the first branch is

processed are shown in Figure 9.

Header Table

Item Frequency Head b 11 a 11 f 7 d 7

{}

b:10

a:1

f:2

a:8

f:5

d:1

d:2

d:1

Null

Null

Null

d:1

Figure 9: The Header_Table and the FUFP tree after the first branch is processed

The same process is then executed for the other two branches. The final results

(25)

Header Table

Item Frequency Head b 11 a 11 f 7 d 7

{}

b:11

a:2

f:2

a:9

f:5

d:1

d:2

d:2

Null

Null

Null

d:2

Figure 10: The Final FUFP tree after all the new transactions are processed

STEP 11: Since t (= 3) + c (= 0) < f (= 4), set c = t+c = 3+0 =3.

After STEP 11, the FUFP tree is updated. Note that the final value of c is 3 in

this example and f - c = 1. This means that one more new transaction can be added

without rescanning the original database for Case 7. Based on the FUFP tree shown in

Figure 16, the desired large itemsets can then be found by the FP-Growth mining

approach as proposed in [12].

5.

Experimental Results

Experiments were made to compare the performance of the batch FP-tree

construction algorithm, the FUFP-tree maintenance algorithm and the Pre-FUFP

maintenance algorithm. When new transactions came, the batch FP-tree construction

(26)

new FP-tree from the updated database. The process was executed whenever new

transactions came. The incremental FUFP-tree maintenance algorithm and the

Pre-FUFP maintenance algorithm processed new transactions incrementally in the

way mentioned in Sections 2.1 and 3.

The experiments were performed in C++ on an Intel x86 PC with a 3.0G Hz

processor and 512 MB main memory and running the Microsoft Windows XP

operating system. A real dataset called BMS-POS [25] were used in the experiments.

This dataset was also used in the KDDCUP 2000 competition. The BMS-POS dataset

contained several years of point-of-sale data from a large electronics retailer. Each

transaction in this dataset consisted of all the product categories purchased by a

customer at one time. There were 515,597 transactions with 1657 items in the

dataset. The maximal length of a transaction was 164 and the average length of the

transactions was 6.5.

The first 500,000 transactions were extracted from the BMS-POS database to

construct an initial FP-tree. The value of the minimum threshold was set at 1% to 5%

for the three algorithms, with 1% increment each time. The next 2,000 transactions

were then used in incremental mining. For the Pre-FUFP maintenance algorithm, the

upper minimum support threshold was set at 1% to 5% (1% increment each time)

(27)

each time). The execution times and the numbers of nodes obtained from the three

algorithms were compared. Figure 11 shows the execution times of the three

algorithms for different threshold values.

0 100 200 300 400 500 600 700 1% 2% 3% 4% 5% threshold value ex ecu ti o n t im e (s ec. ) FP-tree FUFP-tree Pre-FUFP

Figure 11: The comparison of the execution times for different threshold values

It can be observed from Figure 11 that the proposed Pre-FUFP maintenance

algorithm ran faster than the other two. Note that the FUFP-tree maintenance

algorithm and the Pre-FUFP maintenance algorithm may generate a less concise tree

than the FP-tree construction algorithm since the latter completely follows the sorted

frequent items to build the tree. As mentioned above, when an originally small item

becomes large due to new transactions, its updated support is usually only a little

(28)

end of the Headrer_Table. The difference between the FP-tree and the FUFP-tree

structures will thus not be significant. For showing this effect, the comparison of the

numbers of nodes for the three algorithms is given in Figure 12. It can be seen that the

three algorithms generated nearly the same sizes of trees. The effectiveness of the

Pre-FUFP maintenance algorithm is thus acceptable.

0 200000 400000 600000 800000 1000000 1200000 1% 2% 3% 4% 5% threshold value num be r of node s FP-tree FUFP-tree Pre-FUFP

Figure 12: The comparison of the numbers of nodes for different threshold values

Experiments were then made to show the execution times and the numbers of

nodes of the three algorithms for different numbers of transactions inserted. The

minimum support threshold was set at 4% for the batch FP-tree algorithm; the upper

and the lower support thresholds were set at 4% and 2%, respectively, for the FUFP

(29)

extracted from the BMS-POS database to construct an initial FP-tree. The next 2,000

transactions were then sequentially used each time as new transactions for the

experiments. Figure 13 shows the execution times required by the three algorithms

for processing each 2000 new transactions.

0 30 60 90 120 150 180 210 2000 4000 6000 8000 10000 number of transactions ex ecu ti o n t im e (s ec.) FP-tree FUFP-tree Pre-FUFP

Figure 13: The comparison of the execution times for sequentially inserted new transactions

(30)

214000 215000 216000 217000 218000 219000 220000 221000 2000 4000 6000 8000 10000 number of transactions num be r of node s FP-tree FUFP-tree Pre-FUFP

Figure 14: The comparison of the numbers of nodes for sequentially inserted new transactions

Again, the Pre-FUFP maintenance algorithm ran faster than the other two and

had nearly the same node numbers as them.

6.

Conclusion

In this paper, we have proposed the Pre-FUFP maintenance algorithm for

incremental mining based on the concept of pre-large itemsets. The FUFP-tree

structure is used to efficiently and effectively handle new transactions. Using two

user-specified upper and lower support thresholds, the pre-large itemsets act as a gap

to avoid small itemsets becoming large in the updated database when transactions are

inserted. When new transactions are added, the proposed Pre-FUFP maintenance

(31)

transactions into three parts according to whether they are large, pre-large or small in

the original database. Each part is then processed in its own way. The Header_Table

and the FUFP-tree are correspondingly updated whenever necessary.

Experimental results also show that the proposed Pre-FUFP maintenance

algorithm runs faster than the batch FP-tree and the FUFP-tree construction algorithm

for handling new transactions and generates nearly the same tree structure as them.

The proposed approach can thus achieve a good trade-off between execution time and

tree complexity.

References

[1] R. Agrawal, T. Imielinksi and A. Swami, “Mining association rules between sets

of items in large database,” The ACM SIGMOD Conference, pp. 207-216, 1993.

[2] R. Agrawal, T. Imielinksi and A. Swami, “Database mining: a performance

perspective,” IEEE Transactions on Knowledge and Data Engineering, pp.

914-925, 1993.

[3] R. Agrawal and R. Srikant, “Fast algorithm for mining association rules,” The

International Conference on Very Large Data Bases, pp. 487-499, 1994.

[4] R. Agrawal and R. Srikant, “Mining sequential patterns,” The Eleventh IEEE

(32)

[5] R. Agrawal, R. Srikant and Q. Vu, “Mining association rules with item

constraints,” The Third International Conference on Knowledge Discovery in

Databases and Data Mining, pp. 67-73, 1997.

[6] M.S. Chen, J. Han and P.S. Yu, “Data mining: An overview from a database

perspective,” IEEE Transactions on Knowledge and Data Engineering, pp.

866-883, 1996.

[7] D.W. Cheung, J. Han, V.T. Ng and C.Y. Wong, “Maintenance of discovered

association rules in large databases: An incremental updating approach,” The

Twelfth IEEE International Conference on Data Engineering, pp. 106-114, 1996.

[8] D.W. Cheung, S.D. Lee and B. Kao, “A general incremental technique for

maintaining discovered association rules,” In Proceedings of Database Systems

for Advanced Applications, pp. 185-194, 1997.

[9] C. I. Ezeife, “Mining Incremental association rules with generalized FP-tree,”

Proceedings of the 15th Conference of the Canadian Society for Computational

Studies of Intelligence on Advances in Artificial Intelligence, pp. 147-160, 2002.

[10] T. Fukuda, Y. Morimoto, S. Morishita and T. Tokuyama, “Mining optimized

association rules for numeric attributes,” The ACM SIGACT-SIGMOD-SIGART

Symposium on Principles of Database Systems, pp. 182-191, 1996.

(33)

database,” The Twenty-first International Conference on Very Large Data Bases,

pp. 420-431, 1995.

[12] J. Han, J. Pei and Y. Yin, “Mining frequent patterns without candidate

generation,” The 2000 ACM SIGMOD International Conference on Management

of Data, pp. 1-12, 2000.

[13] T. P. Hong, C. Y. Wang and Y. H. Tao, "A new incremental data mining algorithm

using pre-large itemsets," Intelligent Data Analysis, Vol. 5, No. 2, 2001, pp.

111-129.

[14] T. P. Hong, J. W. Lin and Y. L. Wu, “A fast updated frequent pattern tree,” The

IEEE International Conference on Systems, Man, and Cybernetics, pp.2167-2172,

2006.

[15] J. L. Koh and S. F. Shieh, “An efficient approach for maintaining association

rules based on adjusting FP-tree structures,” The Ninth International Conference

on Database Systems for Advanced Applications, pp. 417-424, 2004.

[16] M. Y. Lin and S. Y. Lee, “Incremental update on sequential patterns in large

databases,” The Tenth IEEE International Conference on Tools with Artificial

Intelligence, pp. 24-31, 1998.

[17] H. Mannila, H. Toivonen and A. I. Verkamo, “Efficient algorithm for

(34)

Databases, pp. 181-192, 1994.

[18] J. S. Park, M. S. Chen and P. S. Yu, “Using a hash-based method with

transaction trimming for mining association rules,” IEEE Transactions on

Knowledge and Data Engineering, pp. 812-825, 1997.

[19] Y. Qiu, Y. J. Lan and Q. S. Xie, “An improved algorithm of mining from FP-

tree,” Proceedings of the Third International Conference on Machine Learning

and Cybernetics, pp. 26-29, 2004.

[20] N. L. Sarda and N. V. Srinivas, “An adaptive algorithm for incremental mining

of association rules,” The Ninth International Workshop on Database and Expert

Systems, pp. 240-245, 1998.

[21] R. Srikant and R. Agrawal, “Mining generalized association rules,” The

Twenty-first International Conference on Very Large Data Bases, pp. 407-419,

1995.

[22] R. Srikant and R. Agrawal, “Mining quantitative association rules in large

relational tables,” The 1996 ACM SIGMOD International Conference on

Management of Data, pp. 1-12, 1996.

[23] O. R. Zaiane and E. H. Mohammed, “COFI-tree mining: A new approach to

pattern growth with reduced candidacy generation,” IEEE International

(35)

[24] S. Zhang, “Aggregation and maintenance for database mining,” Intelligent Data

Analysis, pp. 475-490, 1999.

[25] Z. Zheng, R. Kohavi and L. Mason, “Real world performance of association rule

algorithms,” The International Conference on Knowledge Discovery and Data

數據

Figure 1: Nine cases arising from adding new transactions to existing databases
Table 1: Nine cases and their results
Table 2: The original database in the example  Old database
Figure 2: The Header_Table and the FUFP tree constructed  Table 3: The pre-large itemset for the original database
+7

參考文獻

相關文件

&#34;Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values,&#34; Data Mining and Knowledge Discovery, Vol. “Density-Based Clustering in

With regards to the questionnaire and interview aspects, we employed those made up by ourselves &#34;The Questionnaire of trigonometry study present situation

Inspired by the concept that the firing pattern of the post-synaptic neuron is generally a weighted result of the effects of several pre-synaptic neurons with possibly

In our AI term project, all chosen machine learning tools will be use to diagnose cancer Wisconsin dataset.. To be consistent with the literature [1, 2] we removed the 16

2 machine learning, data mining and statistics all need data. 3 data mining is just another name for

Therefore, this study is focusing on designing the bicycle traffic safety Lesson Plan to enhance the bicycle riding safety of students.. Through the pre-teaching test and the

Since the FP-tree reduces the number of database scans and uses less memory to represent the necessary information, many frequent pattern mining algorithms are based on its

This bioinformatic machine is a PC cluster structure using special hardware to accelerate dynamic programming, genetic algorithm and data mining algorithm.. In this machine,