• 沒有找到結果。

Sliding window filtering: an efficient method for incremental mining on a time-variant database

N/A
N/A
Protected

Academic year: 2021

Share "Sliding window filtering: an efficient method for incremental mining on a time-variant database"

Copied!
18
0
0

加載中.... (立即查看全文)

全文

(1)

Information Systems 30 (2005) 227–244

Sliding window filtering: an efficient method for incremental

mining on a time-variant database

$

Chang-Hung Lee, Cheng-Ru Lin, Ming-Syan Chen*

Department of Electrical Engineering, National Taiwan University, No. 1, Sec. 4, Roosevelt Rd., Taipei, Taiwan, ROC Received 3 April 2003; accepted 4 February 2004

Abstract

Recently, several important database applications have called for the design of efficient techniques for incremental mining of association rules. In response to this need, we explore in this paper an effective sliding-window filtering (abbreviatedly as SWF) algorithm for incremental mining of association rules. In essence, by partitioning a transaction database into several partitions, algorithm SWF employs a filtering threshold in each partition to deal with the candidate itemset generation. Under SWF, the cumulative information of mining previous partitions is selectively carried over toward the generation of candidate itemsets for the subsequent partitions. Algorithm SWF not only significantly reduces I/O and CPU cost by the concepts of cumulative filtering and scan reduction techniques but also effectively controls memory utilization by the technique of sliding-window partition. More importantly, algorithm SWF is particularly powerful for efficient incremental mining for an ongoing time-variant transaction database. By utilizing proper scan reduction techniques, only one scan of the incremented dataset is needed by algorithm SWF. The I/O cost of SWF is, in orders of magnitude, smaller than those required by prior methods, thus resolving the performance bottleneck. Extensive experimental studies are performed to evaluate performance of algorithm SWF. Sensitivity analysis of various parameters is conducted to provide many insights into algorithm SWF. It is noted that the improvement achieved by algorithm SWF is even more prominent as the incremented portion of the dataset increases and also as the size of the database increases.

r2004 Elsevier Ltd. All rights reserved.

Keywords: Data mining; Association rules; Sliding-window filtering; Incremental mining; Time-variant database

1. Introduction

Due to the increasing use of computing for various applications, the importance of data mining is growing at rapid pace. It is noted that analysis of past transaction data can provide valuable information on customer buying beha-vior, and thus improve the quality of business decisions. In essence, it is necessary to collect and analyze a sufficient amount of sales data before any meaningful conclusion can be drawn

$

Partial results of this study appeared in the Proceedings of the 10th ACM International Conference on Information and Knowledge Management (CIKM), November 5–10, 2001.

*Corresponding author. Tel.: 2-2363-5251; fax: +886-2-2367-1597.

E-mail addresses:[email protected] (C.-H. Lee), [email protected] (C.-R. Lin), [email protected]. edu.tw (M.-S. Chen).

0306-4379/$ - see front matter r 2004 Elsevier Ltd. All rights reserved. doi:10.1016/j.is.2004.02.001

(2)

therefrom. Since the amount of these processed data tends to be huge, it is important to devise efficient algorithms to conduct mining on these data. Various data mining capabilities have been explored inRefs. [1–10]. One receiving a significant amount of research attention is on mining associa-tion rules over basket data[1,11–19]. For example, given a database of sales transactions, it is desirable to discover all associations among items such that the presence of some items in a transaction will imply the presence of other items in the same transaction, e.g., 90% of customers that purchase milk and bread also purchase eggs at the same time. Mining association rules was first introduced in [1], where it was shown that the problem of mining association rules is composed of the following two subproblems: (1) discovering the frequent itemsets, i.e., all sets of itemsets that have transaction support above a pre-determined minimum support s; and (2) using the frequent itemsets to generate the association rules for the database. The overall performance of mining association rules is in fact determined by the first subproblem. After the frequent itemsets are identified, the corresponding association rules can be derived in a straightfor-ward manner[1]. Among others, Apriori[1], DHP

[17], and partition-based ones[18,20]are proposed

to solve the first subproblem efficiently. In addi-tion, several novel mining techniques, including TreeProjection [21], FP-tree [14,22–24], and con-straint-based ones [19,25–29] also received a significant amount of research attention.

In addition, it is noted that recent important applications have called for the need of incre-mental mining. This is due to the increasing use of the record-based databases whose data are being continuously added. Examples of such applica-tions include Web log records, stock market data, grocery sales data, transactions in electronic commerce, and daily weather/traffic records, to name a few. In many applications, we would like to mine the transaction database for a fixed amount of most recent data (say, data in the last 12 months). That is, in the incremental mining, one has to not only include new data (i.e., data in the new month) into, but also remove the old data (i.e., data in the most obsolete month) from the mining process.

Consider the example transaction database in

Fig. 1. Note that dbi;jis the part of the transaction

database formed by a continuous region from partition Pi to partition Pj: Suppose, we have conducted the mining for the transaction database dbi;j: As time advances, we are given the new data of January of 2001, and are interested in conduct-ing an incremental minconduct-ing against the new data. Instead of taking all the past data into considera-tion, our interest is limited to mining the data in the last 12 months. As a result, the mining of the transaction database dbiþ1;jþ1 is called for. Note that since the underlying transaction database has been changed as time advances, some algorithms, such as Apriori, may have to resort to the regeneration of candidate itemsets for the deter-mination of new frequent itemsets, which is, however, very costly even if the incremental data subset is small. On the other hand, while FP-tree-based methods[14,22–24]are shown to be efficient for small databases, it is expected that their deficiency of memory overhead due to the need of keeping a portion of database in memory, as indicated in[30], could become more severe in the presence of a large database upon which an incremental mining process is usually performed.

To the best of our knowledge, there is little progress made thus far to explicitly address the problem of incremental mining except noted below. In [31], the FUP algorithm updates the association rules in a database when new transac-tions are added to the database. Algorithm FUP is based on the framework of Apriori and is designed to discover the new frequent itemsets iteratively.

data for 1/2000 data for 2/2000 data for 12/2000 data for 1/2001 dbi, j Pi+1 Pj Pj+1 dbi+1, j+1 Pi

Fig. 1. Incremental mining for an ongoing time-variant transaction database.

(3)

The idea is to store the counts of all the frequent itemsets found in a previous mining operation. Using these stored counts and examining the newly added transactions, the overall count of these candidate itemsets are then obtained by scanning the original database. An extension to the work in [31] was reported in [32] where the authors propose an algorithm FUP2 for updating the existing association rules when transactions are added to and deleted from the database. In essence, FUP2 is equivalent to FUP for the case of insertion, and is, however, a complementary algorithm of FUP for the case of deletion. It is shown in [32] that FUP2 outperforms Apriori algorithm which, without any provision for incre-mental mining, has to re-run the association rule mining algorithm on the whole updated database. Another FUP-based algorithm, call FUP2H; was also devised in[32]to utilize the hash technique for performance improvement. Furthermore, the con-cept of negative borders in[33]and that of UWEP, i.e., update with early pruning, in[34] are utilized to enhance the efficiency of FUP-based algorithms. However, as will be shown by our experimental results, the above mentioned FUP-based algo-rithms tend to suffer from two inherent problems, namely (1) the occurrence of a potentially huge set of candidate itemsets, and (2) the need of multiple scans of database. First, consider the problem of a potentially huge set of candidate itemsets. Note that the FUP-based algorithms deal with the combination of two sets of candidate itemsets which are independently generated, i.e., from the original data set and the incremental data subset. Since the set of candidate itemsets includes all the possible permutations of the elements, FUP-based algorithms may suffer from a very large set of candidate itemsets, especially from candidate 2-itemsets. As conformed by our experimental results, this problem becomes even more severe for FUP-based algorithms when the increased portion of the incremental mining is large. More impor-tantly, in many applications, one may encounter new itemsets in the increased dataset. While adding some new products in the transaction database, FUP-based algorithms will need to resort to multiple scans of database. Specifically, in the presence of a new frequent itemset Lk

generated in the data subset, k scans of the database are needed by FUP-based algorithms in the worst case. That is, the case of k ¼ 8 means that the database has to be scanned 8 times, which is very costly, especially in terms of I/O cost. As will become clear later, the problem of a large set of candidate itemsets will hinder an effective use of the scan reduction technique [17] by an FUP-based algo-rithm.

To remedy these problems, we shall devise in this paper an algorithm based on sliding-window filtering (abbreviatedly as SWF) for incremental mining of association rules. In essence, by tioning a transaction database into several parti-tions, algorithm SWF employs a filtering threshold in each partition to deal with the candidate itemset generation. For ease of exposition, the processing of a partition is termed a phase of processing. Under SWF, the cumulative information in the prior phases is selectively carried over toward the generation of candidate itemsets in the subsequent phases. After the processing of a phase, algorithm SWF outputs a cumulative filter, denoted by CF ; which consists of a progressive candidate set of itemsets, their occurrence counts and the corre-sponding partial support required. As will be seen, the cumulative filter produced in each processing phase constitutes the key component to realize the incremental mining. An illustrative example for the operations of SWF is presented in Section 3.1, a detailed description of algorithm SWF is given in Section 3.2 and the correctness of algorithm SWF is proved in Section 3.3. It will be seen that algorithm SWF proposed has several important advantages. First, with employing the prior knowl-edge in the previous phase, SWF is able to reduce the amount of candidate itemsets efficiently which in turn reduces the CPU and memory overhead. The second advantage of SWF is that owing to the small number of candidate sets generated, the scan reduction technique[17]can be applied efficiently. As a result, only one scan of the ongoing time-variant database is required. As will be validated by our experimental results, this very advantage of SWF enables SWF to significantly outperform FUP-based algorithms. The third advantage of SWF over FUP-based algorithms is the capability of SWF to avoid the data skew in nature. As

(4)

mentioned in [18,20], such instances as severe weather conditions may cause the sales of some items to increase rapidly within a short period of time. Data skew may cause FUP-based algorithms to generate many false candidate itemsets. In contrast, the performance of SWF will be less affected by the data skew since SWF employs the cumulative information for pruning false candi-date itemsets in the early stage.

Extensive experiments are performed to assess the performance of SWF. As shown in the experimental results, SWF produces a significantly smaller amount of candidate 2-itemsets than FUP-based algorithms. In fact, the number of the candidate itemsets Cks generated by SWF ap-proaches to its theoretical minimum, i.e., the number of frequent k-itemsets, as the value of the minimal support increases. It is shown by our experiments that SWF in general significantly outperforms FUP-based algorithms. Explicitly, the execution time of SWF is, in orders of magnitude, smaller than those required by FUP-based algorithms. Sensitivity analysis on various parameters of the database is also conducted to provide many insights into algorithm SWF. The advantage of SWF over FUP-based algorithms becomes even more prominent not only as the amount of increased dataset increases but also as the size of the database increases. This is indeed an important feature for SWF to be practically used for the mining of an ongoing transaction database. The rest of this paper is organized as follows. Preliminaries and related works are given in Section 2. Algorithm SWF is described in Section 3 with its correctness proved. Performance studies on various schemes are conducted in Section 4. This paper concludes with Section 5.

2. Preliminaries and related works

Let I ¼ fi1; i2; y; img be a set of literals, called items. Let D be a set of transactions, where each transaction T is a set of items such that T DI : Note that the quantities of items bought in a transaction are not considered, meaning that each item is a binary variable representing if an item was bought. Each transaction is associated with an

identifier, called TID. Let X be a set of items. A transaction T is said to contain X if and only if X DT : An association rule is an implication of the form X ) Y ; where X CI ; Y CI and X-Y ¼ f: The rule X ) Y holds in the transaction set D with confidence c if c% of transactions in D that contain X also contain Y : The rule X ) Y has support s in the transaction set D if s% of transactions in D contain X,Y: For a given pair of confidence and support thresholds, the problem of mining association rules is to find out all the association rules that have confidence and support greater than the corresponding thresholds. This problem can be reduced to the problem of finding all frequent itemsets for the same support

thresh-old[1]. Before the description of algorithm SWF in

Section 3, some related works are reviewed below. 2.1. Apriori-like algorithms

Most of the previous studies, including those is [1,17,31,32,35–37], belong to Apriori-like ap-proaches. Basically, an Apriori-like approach is based on an anti-monotone Apriori heuristic [1], i.e., if any itemset of length k is not frequent in the database, its length (k þ 1) super-itemset will never be frequent. The essential idea is to iteratively generate the set of candidate itemsets of length (k þ 1) from the set of frequent itemsets of length k (for kX1), and to check their corresponding occurrence frequencies in the database. As a result, if the largest frequent itemset is a j-itemset, then an Apriori-like algorithm may need to scan the database up to (j þ 1) times.

In Apriori-like algorithms, C3is generated from L2%L2; where C3 is the set of all candidate itemsets of length 3 and L2is the set of all frequent itemset of length 2. In fact, a C2 can be used to generate the candidate 3-itemsets. This technique is referred to as scan reduction in[4]. Clearly, a C0

3 generated from C2%C2; instead of from L2%L2; will have a size greater than jC3j where C3 is generated from L2%L2: However, if jC03j is not much larger than jC3j; and both C2 and C3 can be stored in main memory, we can find L2 and L3 together when the next scan of the database is performed, thereby saving one round of database scan. It can be seen that using this concept, one

(5)

can determine all Lks by as few as two scans of the database (i.e., one initial scan to determine L1and a final scan to determine all other frequent itemsets), assuming that Ck0 for kX3 is generated from Ck10 and all C0kfor k > 2 can be kept in the memory. In [5], the technique of scan-reduction was utilized and shown to result in prominent performance improvement.

2.2. Partition-based algorithms

The works in[18,20,38]are essentially based on a partition-based heuristic, i.e., if X is a frequent itemset in database D which is divided into n partitions p1; p2; y; pn; then X must be a frequent itemset in at least one of the n partitions. The partition algorithm in [18] divides D into n partitions, and processes one partition in main memory at a time. The algorithm first scans partition pi; for i ¼ 1 to n; to find the set of all local frequent itemsets in pi; denoted as Lpi: Then, by taking the union of Lpi for i ¼ 1 to n; a set of

candidate itemsets over D is constructed, denoted as CG: Based on the above partition-based heuristic, CGis a superset of the set of all frequent itemsets in D: Finally, the algorithm scans each partition for the second time to calculate the support of each itemset in CG and to find out which candidate itemsets are really frequent item-sets inD: Instead of constructing CGby taking the union of Lpi; for i ¼ 1 to n; at the end of the first

scan, some variations of the above partition algorithm are proposed in [20,38]. In [38], algo-rithm SPINC constructs CG incrementally by adding Lpi to CGwhenever Lpi is available. SPINC

starts the counting of occurrences for each candidate itemset cACG as soon as c is added to

CG: In [20], algorithm AS-CPA employs prior

knowledge collected during the mining process to further reduce the number of candidate itemsets and to overcome the problem of data skew. However, these works were not devised to handle incremental updating of association rule.

2.3. FUP-based algorithms

Since it is costly to find the association rules in large databases, incremental updating

techni-ques are desirable in order to avoid redoing data mining on the whole updated database. Basically, similar to that of Apriori, the frame-work of FUP, which can update the association rules in a database when new transactions are added to the database, contains a number of iterations [31,32]. The candidate sets at each iteration are generated based on the frequent itemsets found in the previous iteration. The key steps of FUP are listed below, where Wþ denotes the added portion of an ongoing transaction database. (1) At each iteration, the supports of the size-k frequent itemsets in L are updated against the increment Wþ to filter out those that are no longer in the updated database. (2) While scanning the increment, a set of candidate sets, Ck; is extracted from the transactions in Wþ; together with their supports in Wþ counted. The supports of these sets in Ck are then updated against the original database to find the ‘‘new’’ frequent itemsets. (3) Many sets in Ck can be pruned away by checking their supports in Wþ before the update against the original database starts. (4) The size of the updated database is reduced at each iteration by pruning away a few items from some transactions in the updated database.

The major idea is to reuse the information of the old frequent itemsets and to integrate the support information of the new frequent itemsets in order to substantially reduce the pool of candidate sets to be re-examined. An extension to FUP was reported in [32] and is referred to as FUP2: In essence, FUP2 is equiva-lent to FUP for the case of insertion, and is, however, a complementary algorithm of FUP for the case of deletion. Another FUP-based algorithm, call FUP2H; was also devised in [32] to utilize the hash technique for performance improvement. As pointed out earlier, the existing FUP-based algorithms in general suffer from two inherent problems, namely (1) the occurrence of a potentially huge set of candidate itemsets, which is particularly critical for incremental mining since the candidate sets for the original database and the incremental portion are gener-ated separately, and (2) the need of multiple scans of database.

(6)

3. SWF: incremental mining with sliding-window filtering

In essence, by partitioning a transaction data-base into several partitions, algorithm SWF employs a filtering threshold in each partition to deal with the candidate itemset generation. As described earlier, under SWF, the cumulative information in the prior phases is selectively carried over toward the generation of candidate itemsets in the subsequent phases. In the proces-sing of a partition, a progressive candidate set of itemsets is generated by SWF. Explicitly, a progressive candidate set of itemsets is com-posed of the following two types of candidate itemsets, i.e., (1) the candidate itemsets that were carried over from the previous progressive candidate set in the previous phase and remain as candidate itemsets after the current partition is taken into consideration (such candidate item-sets are called type a candidate itemitem-sets); and (2) the candidate itemsets that were not in the progressive candidate set in the previous phase but are newly selected after only taking the current data partition into account (such candi-date itemsets are called type b candicandi-date itemsets). As such, after the processing of a phase, algorithm SWF outputs a cumulative filter, denoted by CF, which consists of a progressive candidate set of itemsets, their occurrence counts and the corresponding partial support required. With these design considerations, algorithm SWF is shown to have very good performance for incre-mental mining. In Section 3.1, an illustrative example of SWF is presented. A detailed descrip-tion of algorithm SWF is given in Secdescrip-tion 3.2. The correctness of SWF is proved in Section 3.3.

3.1. An example of incremental mining by SWF Algorithm SWF proposed can be best under-stood by the illustrative transaction database in

Figs. 2 and 3 where a scenario of generating

frequent itemsets from a transaction database for the incremental mining is given. The minimum transaction support is assumed to be s ¼ 40%: Without loss of generality, the incremental

mining problem can be decomposed into two procedures:

1. Preprocessing procedure: This procedure deals with mining on the original transaction database.

2. Incremental procedure: The procedure deals with the update of the frequent itemsets for an ongoing time-variant transaction database.

The preprocessing procedure is only utilized for the initial mining of association rules in the original database, e.g., db1;n: For the generation of mining association rules in db2;nþ1; db3;nþ2; dbi;j; and so on, the incremental procedure is employed. Consider the database inFig. 2. Assume that the original transaction database db1;3 is segmented into three partitions, i.e., fP1; P2; P3g; in the preprocessing procedure. Each partition is scanned sequentially for the generation of candidate 2-itemsets in the first scan of the database db1;3: After scanning the first segment of three transactions, i.e., partition P1; 2-itemsets fAB; AC; AE; AF; BC; BE; CEg are generated as shown in Fig. 3. In addition, each potential candidate itemset cAC2 has two attributes: (1) c.start which contains the identity of the starting partition when c was added to C2; and (2) c.count which contains the number of occurrences of c since c was added to C2: Since there are three transactions in P1; the partial minimal support is J3 0:4n ¼ 2: Such a partial minimal support is called the filtering threshold in this paper. Itemsets whose occurrence

(7)

counts are below the filtering threshold are removed. Then, as shown in Fig. 3, only fAB; AC; BCg; marked by ‘‘J’’, remain as candi-date itemsets (of type b in this phase since they are newly generated) whose information is then carried over to the next phase of processing.

Similarly, after scanning partition P2; the occurrence counts of potential candidate 2-item-sets are recorded (of type a and type b). From Fig. 3, it is noted that since there are also three transactions in P2; the filtering threshold of those itemsets carried out from the previous phase (that become type a candidate itemsets in this phase) is Jð3 þ 3 0:4n ¼ 3 and that of newly identified candidate itemsets (i.e., type b candidate itemsets)

is J3 0:4n ¼ 2: It can be seen fromFig. 3 that we have five candidate itemsets in C2 after the processing of partition P2; and three of them are type a and two of them are type b:

Finally, partition P3 is processed by algorithm SWF. The resulting candidate 2-itemsets are C2¼ fAB; AC; BC; BD; BEg as shown in Fig. 3. Note that though appearing in the previous phase P2; itemset fADg is removed from C2once P3is taken into account since its occurrence count does not meet the filtering threshold then, i.e., 2o3: However, we do have one new itemset, i.e., BE, which joins the C2 as a type b candidate itemset. Consequently, we have five candidate 2-itemsets generated by SWF, and four of them are of type a Fig. 3. Large itemsets generation for the incremental mining with SWF.

(8)

and one of them is of type b: Note that instead of 15 candidate itemsets that would be generated if Apriori were used,1only five candidate 2-itemsets are generated by SWF. The correctness of algo-rithm SWF will be formally proved later.

After generating C2from the first scan of database db1;3; we employ the scan reduction technique and use C2 to generate Ck (k ¼ 2; 3; y; n), where Cn is the candidate last-itemsets. It can be verified that a C2 generated by SWF can be used to generate the candidate 3-itemsets and its sequential C0

k1 can be utilized to generate C0

k: Clearly, a C30 generated from C2%C2; instead of from L2%L2; will have a size greater than jC3j where C3 is generated from L2%L2: However, since the jC2j generated by SWF is very close to the theoretical minimum, i.e., jL2j; the jC03j is not much larger than jC3j: Similarly, the jCk0j is close to jCkj: All Ck0 can be stored in main memory, and we can find Lk (k ¼ 1; 2; y; n) together when the second scan of the database db1;3 is performed. Thus, only two scans of the original database db1;3are required in the preproces-sing step. In addition, instead of recording all Lks in main memory, we only have to keep C2 in main memory for the subsequent incremental mining of an ongoing time variant transaction database.

The merit of SWF mainly lies in its incremental procedure. As depicted in Fig. 3, the mining database will be moved from db1;3 to db2;4: Thus, some transactions, i.e., t1; t2; and t3; are deleted from the mining database and other transactions, i.e., t10; t11; and t12; are added. For ease of exposition, this incremental step can also be divided into three sub-steps: (1) generating C2 in D¼ db1;3 W

; (2) generating C2 in db2;4¼ Dþ Wþand (3) scanning the database db2;4only once for the generation of all frequent itemsets Lk: In the first sub-step, db1;3 W¼ D; we check out the pruned partition P1; and reduce the value of c.count and set c:start ¼ 2 for those candidate itemsets c where c:start ¼ 1: It can be seen that itemsets fAB; AC; BCg were removed. Next, in the second sub-step, we scan the incremental transac-tions in P4: The process in Dþ Wþ¼ db2;4 is similar to the operation of scanning partitions,

e.g., P2; in the preprocessing step. Three new itemsets, i.e., DE, DF, EF, join the C2 after the scan of P4as type b candidate itemsets. Finally, in the third sub-step, we use C2 to generate Ck0 as mentioned above. With scanning db2;4 only once, SWF obtains frequent itemsets fA; B; C; D; E; F; BD; BE; DEg in db2;4: As will be shown by experimental results later, the improvement achieved by algorithm SWF is even more promi-nent as the amount of the incremental portion increases and also as the size of the database dbi;j increases.

3.2. Algorithm of SWF

For ease exposition, the meanings of various symbols used are given inTable 1. The preproces-sing procedure and the incremental procedure of algorithm SWF are described in Sections 3.2.1 and 3.2.2, respectively.

3.2.1. Preprocessing procedure of SWF

The preprocessing procedure of Algorithm SWF is outlined below. Initially, the database db1;n is partitioned into n partitions by executing the preprocessing procedure (in Step 2), and CF, i.e., cumulative filter, is empty (in Step 3). Let Ci;j2 be the set of progressive candidate 2-itemsets generated by database dbi;j: It is noted that instead

Table 1

Meanings of symbols used

dbi;j Partition database (D) formed by a continuous

region from partition Pito partition Pj

s Minimum support required

jPkj Number of transactions in partition Pk

NpkðI Þ Number of transactions in partition Pk that

contain itemset I

jdb1;nðIÞj Number of transactions in db1;nthat contain

itemset I

Ci;j The set of progressive candidate itemsets

generated by database dbi;j

W The deleted portion of an ongoing transaction database

D The unchanged portion of an ongoing transaction

database

Wþ The added portion of an ongoing transaction database

1The details of the execution procedure by Apriori are

(9)

of keeping Lks in the main memory, algorithm SWF only records C1;n2 which is generated by the preprocessing procedure to be used by the incre-mental procedure.

Preprocessing procedure of Algorithm SWF 1. n = Number of partitions;

2. jdb1;nj ¼P

k¼1;njPkj;

3. CF ¼|;

4. begin for k ¼ 1 to n // 1st scan of db1;n 5. begin for each 2-itemset I APk

6. if (I eCF ) 7. I :count ¼ NpkðIÞ; 8. I :start ¼ k; 9. if (I :countXsjPkj) 10. CF ¼ CF,I; 11. if (I ACF )

12. I :count ¼ I :count þ NpkðIÞ;

13. if (I :countoJsPm¼I:start;kjPmjn)

14. CF ¼ CF  I ;

15. end

16. end

17. select C21;nfrom I where I ACF; 18. keep C21;n in main memory; 19. h ¼ 2; ==C1 is given

20. begin while (Ch1;na|Þ //Database scan reduction

21. Chþ11;n ¼ Ch1;n%Ch1;n; 22. h ¼ h þ 1;

23. end

24. refresh I :count ¼ 0 where I AC1;n; //where C1;n¼S

h C

1;n h

25. begin for k ¼ 1 to n //2nd scan of db1;n 26. for each itemset I AC1;n

27. I :count ¼ I :count þ NpkðIÞ;

28. end

29. for each itemset I AC1;n 30. if (I :countXJsjdb1;njn)

31. L ¼ L,I;

32. end 33. return L;

From Step 4 to Step 16, the algorithm processes one partition at a time for all partitions. When partition Pi is processed, each potential candidate 2-itemset is read and saved to CF. The number of occurrences of an itemset I and its starting partition are recorded in I.count and I.start,

respectively. An itemset, whose I :countX

JsPm¼I:start;kjPmjn; will be kept in CF. Next,

we select C21;nfrom I where I ACF and keep C21;nin main memory for the subsequent incremental procedure. With employing the scan reduction technique from Step 19 to Step 23, Ch1;ns ðhX3Þ are generated in main memory. After refreshing I :count ¼ 0 where I AC1;n; we begin the last scan of database for the preprocessing procedure from Step 25 to Step 28. Finally, those itemsets whose I :countXJsjdb1;njn are the frequent itemsets. 3.2.2. Incremental procedure of SWF

As shown in Table 1, D indicates the un-changed portion of an ongoing transaction data-base. The deleted and added portions of an ongoing transaction database are denoted by W and Wþ; respectively. It is worth mentioning that the sizes of Wþ and W; i.e., jWþj and jWj; respectively, are not required to be the same. The incremental procedure of SWF is devised to maintain frequent itemsets efficiently and effec-tively. This procedure is outlined below.

Incremental procedure of Algorithm SWF 1. Original database = dbm;n;

2. New database = dbi;j;

3. Database removed W¼Pk¼m;i1Pk; 4. Database increased Wþ¼Pk¼nþ1;jPk;

5. D¼Pk¼i;nPk;

6. dbi;j¼ dbm;n W þ Wþ;

7. loading C2m;n of dbm;n into CF where I ACm;n

2 ;

8. begin for k ¼ m to i  1 // one scan of W 9. begin for each 2-itemset I APk

10. if (I ACF and I :startpk) 11. I :count ¼ I :count  NpkðIÞ;

12. I :start ¼ k þ 1;

13. if (I :countoJsPm¼I:start;njPmjn)

14. CF ¼ CF  I ;

15. end

16. end

17. begin for k ¼ n þ 1 to j // one scan of Wþ 18. begin for each 2-itemset I APk

19. if ( I eCF )

20. I :count ¼ NpkðIÞ;

21. I :start ¼ k;

22. if (I :countXsjPkj)

(10)

24. if (I ACF )

25. I :count ¼ I :count þ NpkðI Þ;

26. if (I :countoJsPm¼I :start;kjPmjn)

27. CF ¼ CF  I ;

28. end

29. end

30. select C2i;j from I where I ACF; 31. keep C2i;j in main memory 32. h ¼ 2 //C1 is well known.

33. begin while (Chi;ja|Þ //Database scan reduction

34. Chþ1i;j ¼ Ci;jh%Chi;j; 35. h ¼ h þ 1; 36. end

37. refresh I :count ¼ 0 where I ACi;j; //where C1;n ¼S

h C

1;n h

38. begin for k ¼ i to j //only one scan of dbi;j 39. for each itemset I ACi;j

40. I :count ¼ I :count þ NpkðIÞ;

41. end

42. for each itemset I ACi;j 43. if (I :countXJsjdbi;jjn)

44. L ¼ L,I;

45. end 46. return L;

As mentioned before, this incremental step can also be divided into three sub-steps: (1) generating C2 in D; (2) generating C2 in Dþ Wþ and (3) scanning the database Dþ Wþ only once for the generation of all frequent itemsets Lk: Initially, after some update activities, old transactions W are removed from the database dbm;n and new transactions Wþ are added (in Step 6). Note that WCdbm;n: Denote the updated database as dbi;j: Note that dbi;j¼ dbm;n Wþ Wþ

: We denote the unchanged transactions by D¼ dbm;n W ¼ dbi;j Wþ

: After loading C2m;n of dbm;ninto CF where I ACm;n

2 ; we start the first sub-step, i.e., generating C2in D¼ dbm;n W: This sub-step tries to reverse the cumulative processing which is described in the preprocessing procedure. From Step 8 to Step 16, we prune the occurrences of an itemset I ; which appeared before partition Pi; by deleting the value I.count where I ACF and I :startoi: Next, from Step 17 to Step 36, similarly to the cumulative processing in Section 3.2.1, the second sub-step generates new potential C2i;j in

dbi;j¼ Dþ Wþ

and employs the scan reduction technique to generate Chi;js from C2i;j: Finally, to generate new Lks in the updated database, we scan dbi;jfor only once in the incremental procedure to maintain frequent itemsets. Note that C2i;j is kept in main memory for the next generation of incremental mining. Noted that, it is easy to extend algorithm SWF, so that after generating Ci;j2; it takes one more database scan to obtain Li:j2 and then generate Ci;jdirectly from Li;j

2: Similarly, we can generate Ci;j from Li;j

k by ðk  1Þ more database scans. This extension is especially useful for some extremely distributed data set, in which, Lk; where k > 2; is much larger then L2:

Note that SWF is able to filter out false candidate itemsets in Pi with a hash table. Same as in[17], using a hash table to prune candidate 2-itemsets, i.e., C2; in each accumulative ongoing partition set Pi of transaction database, the CPU and memory overhead of SWF can be further reduced. As will be validated by our experimental studies, SWF indeed provides an efficient solution for incremental mining, which is, in our opinion, important for mining the record-based databases whose data are being frequently and continuously added, such as Web log records, stock market data, grocery sales data, and transactions in electronic commerce, to name a few.

3.3. Correctness of SWF

With the above two procedures described, we now examine the correctness and effectiveness of algorithm SWF. Let NpkðIÞ be the number of

transactions in partition Pkthat contain itemset I ; and jPkj is the number of transactions in partition Pk: Also, let dbi;jdenote the part of the transaction database formed by a continuous region from partition Pito partition Pj; and jdbi;jj ¼Pk¼i;jjPkj: We can then define the region ratio of an itemset as follows.

Definition. A region ratio of an itemset I for the transaction database dbi;j; denoted by r

i;jðIÞ; is ri;jðIÞ ¼ ðPk¼i;jNpkðIÞÞ=jdb

i;jj:

In essence, the region ratio of an itemset is the support of that itemset if only the part of transaction database dbi;j is considered.

(11)

Lemma 1. An itemset I remains in the CF after the processing of partition Pj if and only if there exists an i such that for any integer k in the interval ½i; j ; ri;kðIÞXs; where s is the minimal support required.

Proof. We shall prove the ‘‘if’’ condition first. Consider the following two cases. First, suppose the itemset I is not in the progressive candidate set before the processing of partition Pi: Since ri;iðIÞXs; itemset I will be selected as a type b candidate itemset by SWF after the processing of partition Pi: On the other hand, if the itemset I is already in the progressive candidate set before the processing of partition Pi; itemset I will remain as a type a candidate itemset by SWF. Clearly, for the above two cases, itemset I will remain in CF throughout the processing from Pi to Pj since for any integer k in the interval ½i; j ; ri;kðI ÞXs:

We now prove the ‘‘only if’’ condition, i.e., if I remains in CF after the processing of partition Pj then there exists an i such that for any k in the interval ½i; j ; ri;kðI ÞXs: Note that itemset I can be either type a or type b candidate itemset in the CF after the processing of partition Pj: Suppose I is a type b candidate itemset there, then this implica-tion follows by setting j ¼ i since ri;iðI ÞXs: On the other hand, suppose that I is a type a candidate itemset after the processing of Pj; which means itemset I has become a type b candidate itemset in a previous phase. Then, we shall trace backward the type of itemset I from partition Pj(i.e., looking over Pj; Pj1; Pj2and so forth) until the partition that records itemset I as a type b candidate itemset is first encountered. (It should be noted that there could be two discontinuous regions that record itemset I in the CF, which means that an itemset may get on and off the progressive candidate set through the processing of partitions. This in turn means that an itemset may appear as a type b candidate itemset more than once. Such a scenario occurs for the itemset BE in the example inFig. 3.) By referring the partition identified above as partition Pi; we have, for any k in the interval ½i; j ; ri;kðIÞXs; completing the proof of this lemma. &

Lemma 1 leads to Lemma 2 below.

Lemma 2. An itemset I remains in CF after the processing of partition Pj if and only if there exists an i such that ri;jðIÞXs; where s is the minimal support required.

Proof. It can be seen that the proof of ‘‘only if’’ condition follows directly from Lemma 1. We now prove the ‘‘if’’ condition of this lemma. If there exists an i such that ri;jðIÞXs then we let t be the largest x such that ri;xðI Þos: If such a t does not exist, it follows from Lemma 1 that itemset I will remain in CF after the processing of partition Pj: If such a t exists, we have rtþ1;jðIÞXs since ri;tðIÞos and ri;jðIÞXs: It again follows from Lemma 1 that itemset I will remain in CF after the processing of partition Pj: This lemma follows. &

Lemma 2 leads to the following theorem which states the correctness of algorithm SWF.

Theorem 1. If an itemset I is a frequent itemset, then I will be in the progressive candidate set of itemsets produced by algorithm SWF.

Proof. Let n be the number of partitions of the transaction database. Since the itemset I is a frequent itemset, we have r1;nðIÞXs; which is in essence a special case of Lemma 2 for i ¼ 1 and j ¼ n; proving this theorem. &

It follows from Theorem 1 that the frequent itemsets generated by SWF are the same as those produced by existing association rule mining algorithms such as Apriori. Furthermore, we let Ci;j; where ipj; be the set of progressive candidate itemsets generated by algorithm SWF with respect to database dbi;jafter the processing of Pj: We then have the following lemma.

Lemma 3. For ipkpj; then Ck;jCCi;j:

Proof. Assume that there exists an itemset I ACk;j: From the ‘‘only if’’ implication of Lemma 2, it follows that there exists an h such that rh;jðIÞXs; where kphpj . Since ipkpj; we have iphpj: Then, according to the ‘‘if’’ implication of Lemma 2, itemset I is also in Ci;j; i.e., I ACi;j: The fact that Ck;jCCi;jfollows. &

(12)

Lemma 3 leads to the following theorem which states the effectiveness of SWF for incremental mining.

Theorem 2. If an itemset I is a frequent itemset with respect to the database dbiþ1;jþ1; then itemset I is either in Ci;j or will be a type b candidate itemset after the processing of partition Pjþ1: Proof. If an itemset I is a frequent itemset with respect to the database dbiþ1;jþ1; we then have

riþ1;jþ1ðIÞXs: Three cases for riþ1;jþ1ðIÞXs are

considered. The first case is riþ1;jðIÞXs and

rjþ1;jþ1ðIÞXs; and the second one is riþ1;jðI ÞXs

and rjþ1;jþ1ðIÞps: From Theorem 1, it follows that in the above two cases, I ACiþ1;j; which in turn implies that I ACi;j since we have Ciþ1;jCCi;j by Lemma 3.

Consider the third case where riþ1;jðIÞos and

rjþ1;jþ1ðIÞXs: If riþ1;jðIÞos and IeCiþ1;j; then

itemset I will be a type b candidate itemset after the processing of partition Pjþ1since rjþ1;jþ1ðI ÞXs: On the other hand, if riþ1;jðIÞos but IACiþ1;j; we also get I ACi;j; from Lemma 3. This theorem is thus proved. &

Note that any itemset I that is a frequent itemset with respect to dbiþ1;jþ1 and has appeared in Ci;j will be identified as a type a candidate itemset after the processing of partition Pjþ1: From Theorem 2 and this fact, it follows that the cumulative filters of algorithm SWF can be determined in a progressive manner without missing any possible frequent itemsets even in the presence of the need of mining an ongoing time variant transaction database.

4. Experimental studies

To assess the performance of algorithm SWF, we performed several experiments on a computer with a CPU clock rate of 450 MHz and 512 MB of main memory. The transaction data resides in the NTFS file system and is stored on a 30GB IDE 3:500 drive with a measured sequential throughput of 10 MB=s: The simulation program was coded in C++. The methods used to generate synthetic

data are described in Section 4.1. The performance comparison of SWF, FUP2 and Apriori is presented in Section 4.2. Section 4.3 shows the I/O cost among SWF, FUP2H and Apriori. We conduct some experiments on examining CPU and memory overhead in Section 4.4. Results on scaleup experiments are presented in Section 4.5. 4.1. Generation of synthetic workload

For obtaining reliable experimental results, the method to generate synthetic transactions we employed in this study is similar to the ones used in prior works [17,31,33,34]. Explicitly, we gener-ated several different transaction databases from a set of potentially frequent itemsets to evaluate the performance of SWF. These transactions mimic the transactions in the retailing environment. Note that the efficiency of algorithm SWF has been evaluated by some real databases, such as Web log records and grocery sales data. However, we show the experimental results from synthetic transaction data so that the work relevant to data cleaning, which is in fact application dependent and also orthogonal to the incremental technique proposed, is hence omitted for clarity. Further, more sensitivity analysis can then be conducted by using the synthetic transaction data. Each database consists of jDj transactions, and on the average, each transaction has jTj items.Table 2summarizes the meanings of various parameters used in the experiments. The mean of the correlation level is set to 0.25 for our experiments.

Recall that the sizes of jWþj and jWj are not required to be the same for the execution of SWF. Without loss of generality, we set jdj ¼ jWþj ¼ jWj for simplicity. Thus, by denoting the original database as db1;n and the new mining database as dbi;j; we have jdbi;jj ¼ jdb1;n Wþ Wþj ¼ jDj; where W¼ db1;i1 and Wþ

¼ dbnþ1;j: In the

following, we use the notation Tx  Iy  Dm  dn to represent a database in which D ¼ m thousands, d ¼ n thousands, jTj ¼ x; and jI j ¼ y: We compare relative performance of three meth-ods, i.e., Apriori, FUP-based algorithms and SWF. As mentioned before, without any provision for incremental mining, Apriori algorithm has to re-run the association rule mining algorithm on the

(13)

whole updated database. As reported in [31,32], with reducing the candidate itemsets, FUP-based algorithms outperform Apriori. As will be shown by our experimental results, with the sliding window technique that carries cumulative infor-mation selectively, the execution time of SWF is, in orders of magnitude, smaller than those required by FUP-based algorithms. In order to conduct our experiments on a database of size dbi;j with an increment of Wþand a removal of W; a database of db1;j is first generated and then db1;i1; db1;n;

dbnþ1;j; and dbi;j are produced separately.

4.2. Experiment one: relative performance

We first conducted several experiments to evaluate the relative performance of Apriori, FUP2 and SWF. As shown in Fig. 4, the experimental results are consistent from one to another for various values of jLj and N on dataset T10  I 4  D100  d10: For interest of space, we only report the results on jLj ¼ 2000 and N ¼ 10000 in the following experiments. Fig. 5shows the relative execution times for the three algo-rithms as the minimum support threshold is decreased from 1% support to 0:1% support. When the support threshold is high, there are only a limited number of frequent itemsets produced. However, as the support threshold decreases, the performance difference becomes prominent in that SWF significantly outperforms both FUP2 and Apriori. As shown in Fig. 5, SWF leads to prominent performance improvement for various sizes of jT j; jI j and jdj: Explicitly, SWF is in orders of magnitude faster than FUP2; and the margin

grows as the minimum support threshold de-creases. Note that from our experimental results, the difference between FUP2 and Apriori is consistent with that observed in [32]. In fact, SWF outperforms FUP2and Apriori in both CPU and I/O costs, which are evaluated next.

4.3. Experiment two: evaluation of I/O cost To evaluate the corresponding of I/O cost, same as in[24], we assume that each sequential read of a byte of data consumes one unit of I/O cost and each random read of a byte of data consumes two units of I/O cost. Fig. 6 shows the number of database scans and the I/O costs of Apriori, FUP2H; i.e., hash-type FUP in [32], and SWF over data sets T10  I 4  D100  d10 and T10  I 4  D200  d20: As shown in Fig. 6, SWF outperforms Apriori and FUP2H where without loss of generality a hash table of 250 000 entries is employed for those methods. Note that the large amount of database scans is the performance bottleneck when the database size does not fit into main memory. In view of that, SWF is advanta-geous since only one scan of the updated database is required, which is independent of the variance in minimum supports.

4.4. Experiment three: reduction of CPU and memory overhead

As explained before, SWF substantially reduces the number of candidate itemsets generated. The effect is particularly important for the candidate 2-itemsets. The experimental results in Fig. 7 show the candidate itemsets generated by Apriori, FUP2H; and SWF across the whole processing on the datasets T 10  I 4  D100  d10 and T 10  I 4  D200  d20 with minimum support threshold s ¼ 0:1%: As shown in Fig. 7, SWF leads to a 99% candidate reduction rate in C2when being compared to Apriori, and leads to a 93% candidate reduction rate in C2 when being compared to FUP2H: Similar phenomena were observed when other datasets were used. This feature of SWF enables it to efficiently reduce the CPU and memory overhead. Note that the number of candidate 2-itemsets produced by Table 2

Meanings of various parameters

jDj Number of transactions in the database jWþj Number of added transactions

jWj Number of deleted transactions

jdj Number of incremental transactions jT j Average size of the transactions

jI j Average size of the maximal potentially frequent itemsets

jLj Number of maximal potentially frequent itemsets N Number of items

(14)

SWF approaches to its theoretical minimum, i.e., the number of frequent 2-itemsets. Recall that the C3in either Apriori or FUP2H has to be obtained by L2due to the large size of their C2: As shown in

Fig. 7, the value of jCkj (kX3) is only slightly

larger than that of Apriori or FUP2H; even though SWF only employs C2 to generate Cks; thus fully exploiting the benefit of scan reduction. 4.5. Experiment four: scaleup performance

In this experiment, we examine the scaleup performance of algorithm SWF. The scale-up results for different selected datasets are obtained.

Fig. 8shows the scaleup performance of algorithm

SWF as the values of jDj and jdj increase. Three

different minimum supports are considered. We obtained the results for the dataset T 10  I 4  Dm  d10 when the number of customers in-creases from 100,000 to one million. The execution times are normalized with respect to the times for the 100,000 transactions dataset in theFig. 8a. The second scaleup experiment with the dataset T10  I 4  D1000  dn shows the performance results of SWF when the number of transactions in the increased dataset varies from 50 thousands to 300 thousands. The execution times are normalized with respect to the times for the 50,000 increased transaction dataset in the Fig. 8b. Note that, as shown in Fig. 8bthe execution time only slightly increases with the growth of the incremental size, showing good scalability of SWF.

T10-I4-D100-d10 (N20-L2) 0 100 200 300 400 500 600 0.1 0.3 0.5 0.7 0.9 Minimum Support (%)

Execution Time (Sec)

Apriori FUP2 SWF T10-I4-D100-d10 (N10-L4) 0 100 200 300 400 500 600 700 0.1 0.3 0.5 0.7 0.9 Minimum Support (%)

Execution Time (Sec)

Apriori FUP2 SWF T10-I4-D100-d10 (N10-L2) 0 50 100 150 200 250 300 350 400 450 0.1 0.3 0.5 0.7 0.9 Minimum Support (%)

Execution Time (Sec)

Apriori FUP2 SWF T10-I4-D100-d10 (N20-L4) 0 100 200 300 400 500 600 700 800 0.1 0.3 0.5 0.7 0.9 Minimum Support (%)

Execution Time (Sec)

Apriori FUP2 SWF

(15)

To further understand the impact of jDj and jdj to the relative performance of algorithms SWF and FUP-based algorithms, we conduct the scale-up experiments for both SWF and FUP2 with two

minimum support thresholds 0.2% and 0.4%. The results are shown in Fig. 9 where the value in y-axis corresponds to the ratio of the execution time of SWF to that of FUP2: Fig. 9ashows the

T10-I4-D100-d20 0 50 100 150 200 250 300 350 400 450 0.1 0.3 0.5 0.7 0.9 Minimum Support (%)

Execution Time (Sec)

Apriori FUP2 SWF T10-I4-D200-d20 0 100 200 300 400 500 600 700 800 900 0.1 0.3 0.5 0.7 0.9 Minimum Support (%)

Execution Time (Sec)

Apriori FUP2 SWF T20-I4-D100-d10 0 200 400 600 800 1000 1200 1400 1600 1800 2000 0.1 0.3 0.5 0.7 0.9 Minimum Support (%)

Execution Time (Sec)

Apriori FUP2 SWF T10-I6-D100-d10 0 100 200 300 400 500 600 700 0.1 0.3 0.5 0.7 0.9 Minimum Support (%)

Execution Time (Sec)

Apriori FUP2 SWF T20-I6-D100-d10 0 500 1000 1500 2000 2500 3000 0.1 0.3 0.5 0.7 0.9 Minimum Support (%)

Execution Time (Sec)

Appriori FUP2 SWF T10-I4-D100-d10 0 50 100 150 200 250 300 350 400 450 0.1 0.3 0.5 0.7 0.9 Minimum Support (%)

Execution Time (Sec)

Apriori FUP2 SWF

(16)

referenced ratio obtained from an updated data-base over datasets of T 10  I 4  Dm  dðm=10Þ: With the value jDj=jdj ¼ 10; the execution-time-ratio of SWF to FUP2decreases when the amount of updated database jDj grows larger, meaning that the advantage of SWF over FUP2increases as the database size increases. Fig. 9b shows the execution-time-ratio for different values of jdj: It can be seen that since the size of jdj has less

influence on the performance of SWF, the execu-tion-time-ratio becomes smaller with the growth of the incremental transaction number jdj: This also implies that the advantage of SWF over FUP2 becomes even more prominent as the amount of incremental portion increases.

5. Conclusion

We explored in this paper an efficient sliding-window filtering algorithm for incremental mining of association rules. Under SWF, the cumulative information of mining previous partitions is selectively carried over toward the generation of candidate itemsets for the subsequent partitions. Algorithm SWF not only significantly reduces I/O and CPU cost by the concepts of cumulative filtering and scan reduction techniques but also effectively controls memory utilization by the technique of sliding-window partition. More im-portantly, SWF is particularly powerful for efficient incremental mining for an ongoing time-variant transaction database. The correctness of SWF is proved and some of its theoretical properties are derived. Extensive simulations have been performed to evaluate performance of algo-rithm SWF. Sensitivity analysis of various para-meters was conducted to provide many insights into SWF. It was noted that the improvement achieved by SWF increases as the increased

T10-I4-D100-d10 0 10 20 30 40 50 60 70 80 0.1 0.3 0.5 0.7 0.9 Minimum Support (%) I/O Cost (M) Apriori FUP2-H SWF T10-I4-D200-d20 0 20 40 60 80 100 120 140 160 0.1 0.3 0.5 0.7 0.9 Minimum Support (%) I/O Cost (M) Apriori FUP2-H SWF (b) (a)

Fig. 6. I/O cost performance: (a) IO cost performance over data set D100-d10; (b) IO cost performance over data set T10-I4-D200-d20.

Candidates Apriori FUP2H SWF Freq. Itemsets

C2 3399528 104145 7482 L2=6656 C3 8353 8353 9241 L3=8135 C4 7882 7882 8679 L4=7616 C5 6762 6382 7162 L5=6077 C6 5437 4709 5578 L6=4658 C7 3918 3417 3951 L7=3412

Candidates Apriori FUP2H SWF Freq. Itemsets

C2 3430890 105848 7632 L2=6641 C3 8332 8332 9468 L3=8021 C4 7752 7752 8996 L4=7507 C5 6406 6406 7510 L5=5819 C6 4643 4643 5852 L6=4622 C7 3421 3421 4100 L7=3416 (b) T10-I4-D100-d10 T10-I4-D200-d20 (a)

Fig. 7. Reduction on candidate itemsets when datasets (a) T10-I4-D100-d10 and (b) T10-I4-D200-d20 were used.

(17)

portion of the dataset increases and also as the size of the database increases.

References

[1] R. Agrawal, T. Imielinski, A. Swami, Mining association rules between sets of items in large databases, Proceedings of the ACM SIGMOD, May 1993, pp. 207–216. [2] R. Agrawal, R. Srikant, Mining sequential patterns,

Proceedings of the 11th International Conference on Data Engineering, March 1995, pp. 3–14.

[3] J.M. Ale, G. Rossi, An approach to discovering temporal association rules, ACM Symposium on Applied Comput-ing, 2000.

[4] M.-S. Chen, J. Han, P.S. Yu, Data mining: an overview from database perspective, IEEE Trans. Knowledge Data Eng. 8 (6) (1996) 866–883.

[5] M.-S. Chen, J.-S. Park, P.S. Yu, Efficient data mining for path traversal patterns, IEEE Trans. Knowledge Data Eng. 10 (2) (1998) 209–221.

[6] X. Chen, I. Petr, Discovering temporal association rules: algorithms, language and system, Proceedings of 2000 International Conference on Data Engineering, 2000.

[7] J. Han, G. Dong, Y. Yin, Efficient mining of partial periodic patterns in time series database, in: Proceedings of the 15th International Conference on Data Engineering, March 1999, pp. 106–115.

[8] R.T. Ng, J. Han, Efficient and effective clustering methods for spatial data mining, Proceedings of the 18th Interna-T10-I4-Dm-d10 1 2 3 4 5 6 7 8 9 10 11 100 300 500 700 900

|D|, updated transaction number (K)

Relative Time 0.2% 0.4% 0.8% T10-I4-D1000-dn 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 50 100 150 200 250 300

|d|, incremental transaction number (K)

Relative Time

0.2% 0.4% 0.8%

(a) (b)

Fig. 8. Scaleup performance of SWF: (a) various value of jDj; (b) various value of jdj:

T10-I4-D1000-dn 0 0.05 0.1 0.15 0.2 0.25 50 100 150 200 250 300

|d|, incremental transaction number (K)

Execution Time Ratio

(SWF/FUP2) 0.2% 0.4% T10-I4-Dm-d(m/10) 0 0.05 0.1 0.15 0.2 0.25 100 300 500 700 900

|D|, updated transaction number (K)

Execution Time Ratio

(SWF/FUP2)

0.20% 0.40%

(a) (b)

Fig. 9. Scaleup performance with the execution time ratio between SWF and FUP: (a) execution time ratio in various value of jDj: (b) execution time ratio in various value of jdj:

(18)

tional Conference on Very Large Data Bases, September 1994, pp. 144–155.

[9] K. Wang, S.Q. Zhou, S.C. Liew, Building hierarchical classifiers using class proximity, Proceedings of 1999 International Conference on Very Large Data Bases, 1999, pp. 363–374.

[10] C. Yang, U. Fayyad, P. Bradley, Efficient discovery of error-tolerant frequent itemsets in high dimensions, The Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2001.

[11] R. Agrawal, R. Srikant, Fast algorithms for mining association rules in large databases, Proceedings of the 20th International Conference on Very Large Data Bases, September 1994, pp. 478–499.

[12] E. Cohen, et al., Finding interesting associations without support pruning, IEEE Trans. Knowledge Data Eng. (2001) 64–78.

[13] J. Han, Y. Fu, Discovery of multiple-level association rules from large databases, Proceedings of the 21st International Conference on Very Large Data Bases, September 1995, pp. 420–431.

[14] J. Han, J. Pei, Y. Yin, Mining frequent patterns without candidate generation, Proceedings of 2000 ACM-SIG-MOD International Conference on Management of Data, May 2000, pp. 486–493.

[15] B. Liu, W. Hsu, Y. Ma, Mining association rules with multiple minimum supports, Proceedings of 1999 Interna-tional Conference on Knowledge Discovery and Data Mining, August 1999.

[16] H. Mannila, H. Toivonen, A. Inkeri Verkamo, Efficient algorithms for discovering association rules, Proceedings of AAAI Workshop on Knowledge Discovery in Data-bases, July, 1994, pp. 181–192.

[17] J.-S. Park, M.-S. Chen, P.S. Yu, Using a hash-based method with transaction trimming for mining association rules, IEEE Trans. Knowledge and Data Eng. 9 (5) (1997) 813–825.

[18] A. Savasere, E. Omiecinski, S. Navathe, An efficient algorithm for mining association rules in large databases, Proceedings of the 21st International Conference on Very Large Data Bases, September 1995, pp. 432–444. [19] K. Wang, Y. He, J. Han, Mining frequent itemsets using

support constraints, Proceedings of 2000 International Conference on Very Large Data Bases, September 2000. [20] J.-L. Lin, M.H. Dunham, Mining association rules:

anti-skew algorithms, Proceedings of 1998 International Con-ference on Data Engineering, 1998, pp. 486–493. [21] R. Agarwal, C. Aggarwal, V.V.V. Prasad, A tree

projec-tion algorithm for generaprojec-tion of frequent itemsets, J. Parallel Distributed Comput. (special issue on High Performance Data Mining) (2000).

[22] J. Han, J. Pei, Mining frequent patterns by pattern-growth: methodology and implications, ACM SIGKDD Explora-tions (special issue on Scalable Data Mining Algorithms), December 2000.

[23] J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, M.-C. Hsu, FreeSpan: frequent pattern-projected

sequen-tial pattern mining, Proceedings of 2000 International Conference on Knowledge Discovery and Data Mining, August 2000, pp. 355–359.

[24] J. Pei, J. Han, L.V.S. Lakshmanan, Mining frequent itemsets with convertible constraints, Proceedings of 2001 International Conference on Data Engineering, 2001. [25] J. Han, L.V.S. Lakshmanan, R.T. Ng, Constraint-based,

multidimensional data mining, Computer (special issues on Data Mining) (1999) 46–50.

[26] D. Kifer, C. Bucila, J. Gehrke, W. White, DualMiner: a dual-pruning algorithm for itemsets with constraints, in: Proceedings of the 8th ACM SIGKDD International Conference on knowledge discovery and data mining, 2002. [27] L.V.S. Lakshmanan, R. Ng, J. Han, A. Pang, Optimiza-tion of constrained frequent set queries with 2-variable constraints, Proceedings of 1999 ACM-SIGMOD Con-ference on Management of Data, June 1999, pp. 157–168. [28] J. Pei, J. Han, Can we push more constraints into frequent pattern mining? Proceedings of 2000 International Con-ference on Knowledge Discovery and Data Mining, August 2000.

[29] A.K.H. Tung, J. Han, L.V.S. Lakshmanan, R.T. Ng, Constraint-based clustering in large databases, Proceed-ings of 2001 International Conference on Database Theory, January 2001.

[30] J. Hipp, U. G.untzer, G. Nakhaeizadeh, Algorithms for association rule mining—a general survey and comparison, SIGKDD Explorations 2 (1) (2000) 58–64.

[31] D. Cheung, J. Han, V. Ng, C.Y. Wong, Maintenance of discovered association rules in large databases: an incremental updating technique, Proceedings of 1996 International Con-ference on Data Engineering, February 1996, pp. 106–114. [32] D. Cheung, S.D. Lee, B. Kao, A general incremental

technique for updating discovered association rules, Proceedings of the International Conference on Database Systems for Advanced Applications, April 1997. [33] S. Thomas, S. Bodagala, K. Alsabti, S. Ranka, An efficient

algorithm for the incremental updation of association rules in large databases, Proceedings of 1997 International Confer-ence on Knowledge Discovery and Data Mining 1997. [34] N.F. Ayan, A.U. Tansel, E. Arkun, An efficient algorithm

to update large itemsets with early pruning, Proceedings of 1999 International Conference on Knowledge Discovery and Data Mining, 1999.

[35] S. Brin, R. Motwani, J.D. Ullman, S. Tsur, Dynamic itemset counting and implication rules for market basket data, ACM SIGMOD Rec. 26 (2) (1997) 255–264. [36] R. Srikant, R. Agrawal, Mining generalized association

rules, Proceedings of the 21st International Conference on Very Large Data Bases, September 1995, pp. 407–419. [37] H. Toivonen, Sampling large databases for association

rules, Proceedings of the 22nd VLDB Conference, September 1996, pp. 134–145.

[38] A. Mueller, Fast sequential and parallel algorithms for association rule mining: a comparison, Technical Report CS-TR-3515, Department of Computer Science, Univer-sity of Maryland, College Park, MD, 1995.

數據

Fig. 1. Incremental mining for an ongoing time-variant transaction database.
Fig. 2. An illustrative transaction database.
Fig. 3. Large itemsets generation for the incremental mining with SWF.
Fig. 8 shows the scaleup performance of algorithm SWF as the values of jDj and jdj increase
+4

參考文獻

相關文件

In JSDZ, a model process in the modeling phase is treated as an active entity that requires an operation on its data store to add a new instance to the collection of

As the result, I found that the trail I want can be got by using a plane for cutting the quadrangular pyramid, like the way to have a conic section from a cone.. I also found

The stack H ss ξ (C, D; m, e, α) was constructed in section 2.3.. It is a smooth orbifold surface containing a unique orbifold point above each ℘ i,j.. An inverse morphism can

Research has suggested that owning a pet is linked with a reduced risk of heart disease, fewer visits to the doctor, and a lower risk of asthma and allergies in young

In particular, if s = f(t) is the position function of a particle that moves along a straight line, then f ′(a) is the rate of change of the displacement s with respect to the

Theorem (Comparison Theorem For Functions) Suppose that a ∈ R, that I is an open interval that contains a, and that f,g are real functions defined everywhere on I except possibly at

We use neighborhood residues sphere (NRS) as local structure representation, an itemset which contains both sequence and structure information, and then

Data larger than memory but smaller than disk Design algorithms so that disk access is less frequent An example (Yu et al., 2010): a decomposition method to load a block at a time