Incremental update on sequential patterns in large databases by implicit merging and efficient counting

(1)

Information Systems 29 (2004) 385–404

Incremental update on sequential patterns in large databases

by implicit merging and efﬁcient counting

$

Ming-Yen Lin, Suh-Yin Lee*

Department of Computer Science and Information Engineering, National Chiao Tung University, Taiwan 30050, China Received 6 February 2002; received in revised form 20 December 2002; accepted 2 April 2003

Abstract

Current approaches for sequential pattern mining usually assume that the mining is performed in a static sequence database. However, databases are not static due to update so that the discovered patterns might become invalid and new patterns could be created. In addition to higher complexity, the maintenance of sequential patterns is more challenging than that of association rules owing to sequence merging. Sequence merging, which is unique in sequence databases, requires the appended new sequences to be merged with the existing ones if their customer ids are the same. Re-mining of the whole database appears to be inevitable since the information collected in previous discovery will be corrupted by sequence merging. Instead of re-mining, the proposed IncSP (Incremental Sequential Pattern Update) algorithm solves the maintenance problem through effective implicit merging and efﬁcient separate counting over appended sequences. Patterns found previously are incrementally updated rather than re-mined from scratch. Moreover, the technique of early candidate pruning further speeds up the discovery of new patterns. Empirical evaluation using comprehensive synthetic data shows that IncSP is fast and scalable.

Keywords: Data mining; Sequential patterns; Incremental update; Sequence discovery; Sequence merging

1. Introduction

Sequential pattern discovery, which ﬁnds fre-quent temporal patterns in databases, is an important issue in data mining originated from retailing databases with broad applications [1–8]. The discovery problem is difﬁcult considering the numerous combinations of potential sequences, not to mention the re-mining required when databases are updated or changed. Therefore, it

is essential to investigate efﬁcient algorithms for sequential pattern mining and effective approaches for sequential pattern updating.

A sequential pattern is a relatively frequent sequence of transactions, where each transaction is a set of items (called itemset). For example, one might purchase a PC and then purchase a printer later. After some time, he or she could possibly buy some printing software and a scanner. If there exists a sufﬁcient number of customers in the transactional database who have the purchasing sequence of PC, printer, printing software and scanner, then such a frequent sequence is a sequential pattern. In general, each customer $

Recommended by Prof. Nick Koudas. *Corresponding author.

E-mail address:[email protected] (S.-Y. Lee).

(2)

record in the transactional database is an itemset associated with the transaction time and a customer-id [1]. Records having the same custo-mer-id are sorted by ascending transaction time into a data sequence before mining. The objective of the discovery is to ﬁnd out all sequential patterns from these data sequences.

A sequential pattern is a sequence having support greater than or equal to a minimum threshold, called the minimum support. The support of a sequence is the percentage of data sequences containing the sequence. Note that the support calculation is different in the mining of association rules[9–12]and sequential patterns[1,4,7,13]. The former is transaction-based, while the latter is sequence-based. Suppose that a customer has two transactions buying the same item. In association discovery, the customer ‘‘contributes’’ to the support count of that item by two, whereas it counts only once in the support counting in sequential pattern mining.

The discovery of sequential patterns is more difﬁcult than association discovery because the patterns are formed not only by combinations of items but also by permutations of itemsets. For example, given 50 possible items in a sequence database, the number of potential patterns is

50 50+C(50,2) regarding two items, and

50 50 50+50 C(50,2) 2+C(50,3) regard-ing three items (formed by 1-1-1, 1-2, 2-1, and 3),y, etc. Most current approaches assume that the sequence database is static and focus on speeding up the time-consuming mining process. In practice, databases are not static and are usually appended with new data sequences, conducted by either existing or new customers. The appen-ding might invalidate some existing patterns whose supports become insufﬁcient with respect to the currently updated database, or might create some new patterns due to the increased supports. Hence, we need an effective approach for keeping patterns up-to-dated.

However, not much work has been done on the maintenance of sequential patterns in large data-bases. Many algorithms deal with the mining of association rules[9,10,12], the mining of sequential patterns [1,3,7,8,14,15], and parallel mining of sequential patterns [6]. Some algorithms discover

frequent episodes in a single long sequence [16]. Nevertheless, when there are changes in the data-base, all these approaches have to re-mine the whole updated database. The re-mining demands more time than the previous mining process since the appending increases the size of the database.

Although there are some incremental techniques for updating association rules[11,17], few research has been done on the updating of sequential patterns, which is quite different. Association discovery is transaction-based; thus, none of the new transactions appended is related to the old transactions in the original database. Sequential pattern mining is sequence-based; thus, the two data sequences, one in the newly appended database and the other in the original database, must be merged into a data sequence if their customer-ids are the same. However, the sequence merging will corrupt previous support count

information so that either FUP or FUP2 [17]

algorithm could not be directly extended for the maintenance of sequential patterns.

One work dealing with incremental sequence mining for vertical database is the Incremental Sequence Mining (ISM) algorithm [5]. Sequence databases of vertical layout comprise a list of (cid, timestamp) pairs for each of all the items. In order to update the supports and enumerate frequent sequen-ces, ISM maintains ‘‘maximally frequent sequences’’ and ‘‘minimally infrequent sequences’’ (called nega-tive border). However, the problem with ISM is that the size of negative border (i.e. the number of potentially frequent sequences) might be too large to be processed in memory. Besides, the size of extra space for transformed vertical lists might be several times the size of the original sequence database.

This paper presents an efﬁcient incremental updating algorithm for up-to-date maintenance of sequential patterns after a nontrivial number of data sequences are appended to the sequence database. Assume that the minimum support keeps the same. Instead of re-mining the whole database for pattern discovery, the proposed algorithm utilizes the knowledge of previously computed frequent sequences. We merge data sequences implicitly, generate fewer but more promising candidates, and separately count sup-ports with respect to the original database and the

(3)

newly appended database. The supports of old patterns are updated by merging new data sequences implicitly into the original database. Since the data sequences of old customers are processed already, efﬁcient counting over the data sequences of new customers further optimizes the pattern updating process.

The rest of the paper is organized as follows. Section 2 describes the problem of sequential pattern mining and addresses the issue of incre-mental update. In Section 3, we review some previous algorithms of sequence mining. Section 4 presents our proposed approach for the updating of sequential patterns after databases are changed. Comparative results of the experiments by com-prehensive synthetic data sets are depicted in Section 5. Section 6 concludes this paper.

2. Problem formulation

In Section 2.1, we formally describe the problem of sequential pattern mining and the terminology used in this paper. The issue of incremental update is presented in Section 2.2. Section 2.3 demon-strates the changes of sequential patterns due to database update.

2.1. Sequential pattern mining

A sequence s; denoted by /e1e2ye_nS; is an ordered set of n elements where each element ei is an itemset. An itemset, denoted by ðx1; x2; y; xqÞ; is a nonempty set of q items, where each item xjis represented by a literal. Without loss of generality, we assume the items in an element are in lexicographic order. The size of sequence s, written as |s|, is the total number of items in all the elements in s: Sequence s is a k-sequence if jsj ¼ k: For example, /(8)(2)(1)S, /(1,2)(1)S, and /(3)(5,9)S are all 3-sequences. A sequence s ¼ /e1e2yenS is a subsequence of another sequence s0_{¼ /e}0

1e02ye0mS if there exist 1pi1oi2oy oinpm such that e1De0_i1; e₂De0_i2; ?; and e_nDe0_in: Sequence s0 contains sequence s if s is a subse-quence of s0: For example, /(2)(1,5)S is a sub-sequence of /(2,4)(3)(1,3,5)S.

Each sequence in the sequence database DB is referred to as a data sequence. Each data sequence is associated with a customer-id (abbreviated as cid). The number of data sequences in DB is denoted by |DB|. The support of sequence s; denoted by s.sup, is the number of data sequences containing s divided by the total number of data sequences in DB. The minsup is the user speciﬁed minimum support threshold. A sequence s is a frequent sequence, or called sequential pattern, if s.supXminsup. Given the minsup and the sequence database DB, the problem of sequential pattern mining is to discover the set of all sequential patterns, denoted by SDB_:

2.2. Incremental update of sequential patterns In practice, the sequence database will be updated with new transactions after the pattern mining process. Possible updating includes tran-saction appending, deletions, and modifications. With respect to the same minsup, the incremental update problem aims to find out the new set of all sequential patterns after database updat-ing without re-minupdat-ing the whole database. First, we describe the issue of incremental updating by taking the transaction appending as an illustrating example. Transaction modification can be accomplished by transaction deletion and appending.

The original database DB is appended with a few data sequences after some time. The increment database db is referred to as the set of these newly appended data sequences. The cids of the data sequences in db may already exist in DB: The whole database combining all the data sequences from the original database DB and the increment database db is referred to as the updated database UD: Let the support count of a sequence s in DB be sDB

count: A sequence s is a frequent sequence in UD if sUD

countXminsup jUDj; where sUD

count is the support count of s in UD: Although UD is the union of DB and db; jUDj is not necessarily equal to jDBj plus jdbj: If there are joldj cids appearing both in DB and db; then the number of ‘new’ customers is jnewj ¼ jdbj joldj: Thus jUDj ¼ jDBj þ jdbj joldj due to sequence merging. When all cids in db are different from

(4)

those in DB; joldj (the number of ‘old’ customers) is zero. On the contrary, joldj equals jdbj in case all cids in db exist in DB: Let sdb

countbe the increase in support count of sequence s due to db: Whether sequence s in UD is frequent or not depends on sUD_count; with respect to the same minsup and jUDj:

Most approaches re-execute mining algorithms over all data sequences in UD to obtain sUD_count and discover SUD_{; as shown in}_{Fig. 1a}_{. However,} we can effectively calculate sUD

count utilizing the support count of each sequential pattern s in SDB_:_{Fig. 1b} _{shows that we discover S}UD _through incremental update on SDB _{after implicit merging.}

Table 1 summarizes the notations used in this paper.

2.3. Changes of sequential patterns due to database update

Consider an example database DB with 6 data

sequences as shown in Fig. 2. Assume that

minsup ¼ 33%; i.e., minimum support count

being 2. The sequential patterns in DB are /(1)S, /(2)S, /(3)S, /(4)S, /(1,2)S, /(1)(4)S, /(2)(2)S, and /(3)(1)S. Note that /(6)S, though appeared twice in the same data sequence C6, is not frequent because its support count is one.

Fig. 3a shows the data sequences in the increment database db after some updates from new customers only. The updated database UD is shown inFig. 3b. Corresponding to the nine data sequences and with the same minsup, the support

(a) Obtain SUD_{by re-executing mining algorithm on UD}

DB data sequence |old| |new| Database updating data sequence db SDB Mining with minsup UD is DB∪db data sequence data sequence SUD Mining with minsup DB data sequence |old | |new| Database updating data sequence db SDB Mining with minsup SDB _|new| Implicit merging DB with db SUD Incremental update with minsup

(b) Obtain SUD_{by incremental updating with S}DB

(5)

count of a frequent sequence must be three or larger. The support counts of previous sequential patterns /(3)S, /(1)(4)S, and /(3)(1)S are less than three, and are no longer frequent due to the database updates. While /(5)S, /(2)(5)S, and /(2,4)S become new patterns because they have minimum supports now.

In the cases of updates when the new sequences are from old customers, i.e., the cids of the new sequences appear in the original database. These data sequences must be appended to the old data sequences of the same customers in DB: Assume that two customers, cid=C4 and cid=C8, bought item ‘8’ afterward. The data sequences for cid=C4

and cid=C8 now become /(4)(3)(1)(8)S and /(2,4)(5)(8)S, respectively. Fig. 4 shows the example of an increment database having data sequences from both old and new customers. In this example, joldj ¼ 4; jnewj ¼ 3; and jdbj ¼ 7 where records in shadow are old customers.

Fig. 5 presents the resulting data sequences in UD: After invalidating the patterns /(5)S, /(2)(2)S, /(2)(5)S, and /(1,2)S, the up-to-date

Table 1 Notations used

x1; x2; y; xq Items

ðx1; x2; y; xqÞ A q-itemset, each xiis an item

s ¼ /e1e2yenS A sequence with n element

e1; e2; y; en Elements (of a sequence). Each eiis an itemset

minsup The minimum support speciﬁed by the user

UD The updated database

DB The original database

db The increment database

jUDj; jDBj; jdbj The total number of data sequences in UD; DB; and db; respectively joldj The total number of data sequences of ‘old’ customers in db jnewj The total number of data sequences of ‘new’ customers in db SDB_{; S}UD _{The set of all sequential patterns in DB and UD; respectively}

sDB

count; sUDcount The support counts of candidate sequence s in DB and UD; respectively

sdb

count The increase in support count of candidate sequence s due to db

Sk The set of all frequent k-sequences, see Section 3.1

Xk The set of all candidate k-sequences, see Section 3.1

Xj0 _{The reduced set of candidate k-sequences, see Section 4}

SDB

k The set of frequent k-sequences in DB; see Section 4.2

XjðDBÞ The set of candidates in X k that are also in SkDB; see Section 4

Xj0ðDBÞ X k0ðDBÞ¼ Xk X kðDBÞ; see Section 4

dsUD_{; ds}DB_{; ds}db _{A data sequence in UD; DB; and db; respectively, see Section 4.1}

dsDB_,dsdb _{An implicitly merged data sequence, see Section 4.1}

UDDB Data sequences in UD whose cids appearing in DB only, see Appendix

UDdb Data sequences in UD whose cids appearing in db only, see Appendix

UDDd Data sequences in UD whose cids are in both DB and db; see Appendix

Fig. 2. The original database DB example, jDBj ¼ 6:

Cid C8 C9 C7 <(2,4)> <(2,4)(5)> <(1,2)(5)(2,6)> Data Sequence (dsdb₎ Cid C2 C3 C1 <(1)(4)> <(2)(3,5)(1,2)> <(1,2)(2,4)> Data Sequence (dsUD₎ C5 C6 C4 <(4)(3)(1)> <(1)> <(6)(2,6,7) > C8 C9 C7 <(2,4)> <(2,4)(5)> <(1,2)(5)(2,6)> (a) (b)

Fig. 3. Data sequences in the increment database and the updated database (a) db with new customers only (b) the updated database UD:

(6)

sequential patterns are /(1)S, /(2)S, /(4)S, /(6)S, /(2,4)S, /(2,6)S and /(1)(4)S, for the given minsup 33%.

3. Related work

In Section 3.1, we review some algorithms for discovering sequential patterns. Section 3.2 pre-sents related approaches for incremental pattern updating.

3.1. Algorithms for discovering sequential patterns The Apriori algorithm discovers association rules[9], while the AprioriAll algorithm deals with the problem of sequential pattern mining [1]. AprioriAll splits sequential pattern mining into three phases, itemset phase, transformation phase, and sequence phase. The itemset phase uses Apriori to ﬁnd all frequent itemsets. The database is transformed, with each transaction being replaced by the set of all frequent itemsets contained in the transaction, in the transformation phase. In the third phase, AprioriAll makes multi-ple passes over the database to generate candidates

and to count the supports of candidates. In subsequent work, the same authors proposed the Generalized Sequential Pattern (GSP) algo-rithm that outperforms AprioriAll [7]. Both algo-rithms use the similar techniques for candidate generation and support counting, as described in the following.

The GSP algorithm makes multiple passes over the database and ﬁnds out frequent k-sequences at kth database scanning. In each pass, every data sequence is examined to update the support counts of the candidates contained in this sequence. Initially, each item is a candidate 1-sequence for the ﬁrst pass. Frequent 1-sequences are determined after checking all the data sequences in the database. In succeeding passes, frequent ðk 1Þ-sequences are self-joined to generate candidate k-sequences. Again, the supports of these candidate sequences are counted by examining all data sequences, and then those candidates having minimum supports become frequent sequences.

This process terminates when there is no

candidate sequence any more. In the following, we further depict two essential sub-processes in GSP, the candidate generation and the support counting.

Candidate generation: Let Skdenote the set of all frequent k-sequences, and Xkdenote the set of all candidate k-sequences. GSP generates Xk by two steps. The first step joins Sk1 with Sk1 and obtains a superset of the final Xk: Those candi-dates in the superset having any ðk 1Þ-subse-quence which is not in Sk1 are deleted in the second step. In the first step, a ðk 1Þ-sequence s1 ¼ /e1e2ye_n1e_nS is joined with another ðk 1Þ-sequence s2 ¼ /e0₁e0₂y_e0_nS _if _{s1 ¼ s2;} where s1 is the ðk 2Þ-sequence of s1 dropping the first item of e1and s2 is the ðk 2Þ-sequence of s2 dropping the last item of e0

n: The generated candidate k-sequence s3 is /e1e2yen1ene0nS if e0n is a 1-itemset. Otherwise, s3 is /e1e2yen1ene0nS: For example, the candidate 5-sequence /(1,2)(3,5) (6)S is generated by joining /(1,2)(3,5)S with /(2)(3,5)(6)S, and the candidate /(1,2)(3,5)(6)S is generated by joining /(1,2)(3,5)S with /(2) (3,5,6)S. In addition, the Xk produced from this procedure is a superset of Skas proved in[7]. That is, Xk+Sk:

Cid D ata Sequence (dsdb₎

C2 <(4)> C4 <(8)> C5 <(1,4)> C8 <(8)> C10 <(2,4,6,8)> C11 <(1)(7)> C12 <(2,6)(7)>

Fig. 4. Data sequences of old and new customers in db:

Cid Data Sequence (dsUD₎

C1 <(1)(4)> C2 <(2)(3,5)(1,2)(4)> C3 <(1,2)(2,4)> C4 <(4)(3)(1)(8)> C5 <(1)(1,4)> C6 <(6)(2,6,7)> C7 <(2,4)> C8 <(2,4)(5)(8)> C9 <(1,2)(5)(2,6)> C10 <(2,4,6,8)> C11 <(1)(7)> C12 <(2,6)(7)>

(7)

Support counting: GSP adopts a hash-tree structure [9,7] for storing candidates to reduce the number of candidates that need to be checked for each data sequence. Candidates would be placed in the same leaf if their leading items, starting from the ﬁrst item, were hashed to the same node. The next item is used for hashing when an interior node, instead of a leaf node, is reached

[7]. The candidates required for checking against a data sequence are located in leaves reached by applying the hashing procedure on each item of the data sequence[7]. The support of the candidate is incremented by one if it is contained in the data sequence.

In addition, the SPADE (Sequential PAttern Discovery using Equivalence classes) algorithm ﬁnds out sequential patterns using vertical data-base layout and join-operations [8]. Vertical database layout transforms customer sequences into items’ id-lists. The id-list of an item is a list of (cid, timestamp) pairs indicating the occurr-ing timestamps of the item in that customer-id. The list pairs are joined to form a sequence lattice, in which SPADE searches and discovers the patterns[8].

Recently, the FreeSpan (Frequent pattern-pro-jected Sequential Pattern Mining) algorithm was proposed to mine sequential patterns by a database projection technique [3]. FreeSpan ﬁrst ﬁnds the frequent items after scanning the database once. The sequence database is then projected, according to the frequent items, into several smaller intermediate databases. Finally, all sequential patterns are found by recursively growing subsequence fragments in each database. Based on the similar projection technique, the authors proposed the PrefixSpan (Prefix-projected Sequential pattern mining) algorithm [14].

Nevertheless, all these algorithms have to re-mine the database after the database is appen-ded with new data sequences. Next, we introduce some approaches for updating patterns without re-mining.

3.2. Approaches for incremental pattern updating A work for incremental sequential pattern updating was proposed in[18]. The approach uses

a dynamic sufﬁx tree structure for incremental mining in a single long sequence. However, the focus of research here is on multiple sequences of itemsets, instead of a single long sequence of items.

Based on the SPADE algorithm, the ISM algorithm was proposed for incremental sequence mining [5]. An Increment Sequence Lattice con-sisting of both frequent sequences and the nearly frequent ones (called negative border) is built to prune the search space for potential new patterns. However, the ISM might encounter memory problem if the number of the potentially frequent patterns is too large [5]. Besides, computation is required to transform the sequence database into vertical layout, which also requires additional storage several times the original database.

In order to avoid re-mining from scratch with respect to database updates with both old and new customers, we propose a pattern updating ap-proach that incrementally mines sequential pat-terns by utilizing the discovered knowledge. Section 4 gives the details of the proposed algorithm.

4. The proposed algorithm

In sequence mining, frequent patterns are those candidates whose supports are greater than or equal to the minimum support. In order to obtain the supports, every data sequence in the database is examined, and the support of each candidate contained in that data sequence is incremented by one. For pattern updating after database update, the database DB was already mined and the supports of the frequent patterns with respect to DB are known. Intuitively, the number of data sequences need to be examined in current updating with database UD seems to be jUDj: However, we can utilize the prior knowledge to improve the overall updating efﬁciency. Therefore, we propose the IncSP (Incremental Sequential Pattern Up-date) algorithm to speed up the incremental updating problem. Fig. 6 depicts the architecture of a single pass in the IncSP algorithm. In brief, IncSP incrementally updates and discovers the sequential patterns through effective implicit

(8)

merging, early candidate pruning, and efﬁcient separate counting.

The data sequence of a customer in DB and the sequence with same cid in db must be merged into the customer’s data sequence in UD: If all such sequences are merged explicitly, we have to re-mine and re-count the supports of the candidates contained in the resultant customer sequences from scratch. Hence, IncSP deals with the required sequence merging implicitly for incremental pat-tern updating, which is described in Section 4.1.

IncSP further speeds up the support counting by partitioning the candidates into two sets. The candidates with respect to DB which were also frequent patterns before updating are placed into set XjðDBÞ; and the remaining candidates are placed into set XjðDBÞ: After the partitioning, the supports of the candidates in XjðDBÞ can be incremented and updated simply by scanning over the increment database db: During the same scanning, we also calculate the increment supports of the candidates in Xj0ðDBÞ with respect to db: Since the supports of the candidates in Xj0

ðDBÞare not available (only the supports of frequent patterns in DB are kept in prior mining over DB), we need to compute their supports against the data sequences in DB: The number of candidates need to be checked is reduced to the size of set Xj0ðDBÞ instead of the full set Xk: Thus, IncSP divides the counting procedure into separate processes to efﬁciently count the supports of

candidates with respect to DB and db: We show that the support of a candidate is the sum of the two support counts after the two counting processes in Lemma 1 (in Section 4.2).

Moreover, some candidates in Xj0ðDBÞ can be pruned earlier before the actual counting over the data sequences in DB: By partitioning the set of candidates into XjðDBÞ and Xj0ðDBÞ; we know that all the candidates in Xj0ðDBÞ are not frequent patterns with respect to DB: If the support of a candidate in Xj0

ðDBÞ with respect to db is smaller than the proportion minsup ðjUDj jDBjÞ; the candidate cannot possibly become a frequent pattern in UD: Such unqualifying candidates are pruned and only the more promising candidates go through the actual support counting over DB: Lemma 2 (in Section 4.2) shows this property. This early pruning further reduces the number of candidates required to be counted against the data sequences in DB: The reduced set of candidates is referred to as Xj0:

In essence, IncSP generates candidates and examines data sequences to determine frequent patterns in multiple passes. As shown in Fig. 6, IncSP reduces the size of Xkinto Xj0 and updates the supports of patterns in SDB_{by simply checking} the increment database db; which is usually smaller than the original database DB: In addition, the separate counting technique enables IncSP to accumulate candidates’ supports quickly because only the new candidates, whose supports are unavailable from SDB_{; need to be checked against} DB: The complete IncSP algorithm and the separate counting are described in Section 4.2. Section 4.3 further illustrates other updating operations such as modiﬁcations and deletions. 4.1. Implicit merging of data sequences with same cids

For the discovery of sequential patterns, trans-actions coming from the same customer, either in DB or in db; are parts of the unique data sequence corresponding to that customer in UD: Given a customer having one data sequence in DB and another sequence in db; the proper data sequence for the customer (in UD) is the merged sequence of the two. Since the transaction times in db are later

Read k-sequence s∈SDB ∀data sequence dsdb∈_db FilteredXk' Sk= {s| s∈Xk∧ ≥s minsup× |UD|} Support Counting (I) ∀data sequence dsDB∈_DB Support Counting (II) GenerateXk : separate counting : candidate pruning : previous knowledge

: (embedded) implicit merging

: operation

(9)

than those in DB; the merging appends the data sequences in db to the sequences in DB: Never-theless, such ‘‘explicit merging’’ might invalidate SDB _{because the data sequence of the customer} becomes a longer sequence. Some patterns in SDB_; which are not contained in the data sequence before merging, might become contained in the now longer data sequence so that the support counts of these patterns become larger. In order to effectively keep the patterns in SDB _up-to-date, IncSP implicitly merges data sequences of the same customers and delays the actual action of merging until pattern updating completes.

Assume that an explicit merging must merge dsDB _{with ds}db _{into ds}UD_{; where ds}DB_{; ds}db_{; and} dsUD _{represent the data sequences in DB; db; and} UD; respectively. In each pass, the mining process needs to count the supports of candidate sequences against dsUD_{: The ‘‘implicit merging’’ in IncSP} employs dsDB _{and ds}db _{as if ds}UD _{is produced} during mining process. We will describe how

‘‘implicit merging’’ updates the supports of

sequential patterns in SDB; and how ‘‘implicit merging’’ counts the supports of candidates contained in the implicitly merged data sequence, represented by dsDB_,dsdb_:

The ‘‘implicit merging’’ updates the supports of sequential patterns in SDB _{according to ds}DB _and dsdb_{: This updating involves only the newly} generated (candidate) k-sequences in the kth pass, which are contained in dsUD _{but not in ds}DB_{; since} dsDB_{had already engaged in the discovery of S}DB_: We refer to these candidate k-sequences as the new k-sequences. As indicated in Fig. 6, when dsdb _is checked in Support Counting (I), only the supports of such new k-sequences must be counted. If this new k-sequence is also a sequential pattern in SDB; we update the support count of the sequence in SDB: Otherwise, supports of new k-sequences which are not in SDB_{; being initialized to zero} before counting, are incremented by one for this data sequence (dsDB_,dsdb_{). In this way, IncSP} correctly maintains SDB _{with the new k-sequences} and counts supports with respect to dsdb _during Support Counting (I).

Example 1. Implicit merging for support updating in pass-1. Take customers in Fig. 5 for example,

the DB is shown inFig. 3band the db is shown in

Fig. 4. The customer with cid=C2 has the two sequences, dsDB_{¼ /ð2Þð3; 5Þð1; 2ÞS and ds}db_¼ /ð4ÞS: During pass 1, /ð4ÞSDB

count is increased by one due to the implicit merging with dsdb_{and ds}DB (of C2). Note that implicit merging for the

customer with cid=C5 whose dsDB¼ /ð1ÞS and

dsdb ¼ /ð1; 4ÞS contains only the new 1-sequence /(4)S because /(1)S was already counted when we examined dsDB_{to produce S}DB_{: Eventually, the} support count /ð4ÞSDB_count is increased by two considering the two implicitly merged sequences of C2 and C5. Similarly, the support count of candidate /ð8ÞSDB_count is two after the implicit merging on customer sequences whose cids=C4 and C8.

4.2. The IncSP algorithm

The implicit merging technique preserves the correctness of supports of the patterns and enables IncSP to count the supports in DB and db separately for pattern updating. Fig. 7

lists the proposed IncSP algorithm and Fig. 8

depicts the two separate sub-processes of

support counting in the IncSP algorithm. Through separate counting, we do not have to check the full candidate set Xk against all data seque-nces from db and DB: Only the (usually) smaller

db must take all the candidates in Xk into

consideration for support updating. Furthermore, we can prune previous patterns and leave fewer

but more promising candidates in Xj0 before

applying the data sequences in DB for support counting.

The IncSP algorithm generates candidates and computes the supports for pattern updating in multiple passes. In each pass, we initialize the two support counts of each candidate in UD to zero, and read the support count of each frequent k-sequence s in DB to sDB

count: We then accumulate the increases in support count of candidates with respect to the sequences in db by Support Counting (I). Before Support Counting (II) starts, candidates which are frequent in DB but cannot be frequent in UD according to Lemma 4 are ﬁltered out. The full candidate set Xk is reduced into the set Xj0: Next, the Support Counting (II) calculates

(10)

the support counts of these promising candidates with respect to the sequences in DB: As indicated in Lemma 1, the support count of any candidate k-sequence is the sum of the two counts obtained after the two counting processes. Consequently,

we can discover the set of frequent k-sequences Sk by validating the sum of the two counts of every candidate. The Skis used to generate the complete candidate set for the next pass, employing the similar candidate generation procedure in GSP.

Fig. 7. Algorithm IncSP.

(11)

The above process is iterated until no more candidates.

We need to show that IncSP updates the supports and discovers frequent patterns correctly. Several properties used in the IncSP algorithm are described as follows. The details of the proof of the lemmas are included in Appendix.

Lemma 1. The support count of any candidate k-sequence s in UD is equal to sDB

countþ sdbcount. Lemma 2. A candidate sequence s, which is not frequent in DB, is a frequent sequence in UD only if sdb

countXminsup ðjUDj jDBjÞ.

Lemma 3. The separate counting procedure (in

Fig. 8) completely counts the supports of candidate k-sequences against all data sequences in UD. Lemma 4. The candidates required for checking against the data sequences in DB in Support Counting (II) is the set Xj0; where Xj0¼ Xk fsjsASDB

k g fsjsdbcountominsup ðjUDj jDBjÞg.

Theorem 1. IncSP updates the supports and

discovers frequent patterns correctly.

Proof. In IncSP, we use the candidate generation procedure analogous to GSP to produce the complete set of candidates in Xk: By Lemma 3, the separate counting procedure completely counts the supports of candidate k-sequences against all data sequences in UD. Lemma 1 determines frequent patterns in UD and the updated supports. Therefore, IncSP correctly maintains sequential patterns. & Example 2. Sequential pattern updating using IncSP. The data sequences in the original database DB is shown inFig. 3b. The minsup is 33%. SDB_is listed in Table 2. The increment database db is shown inFig. 4. IncSP discovers SUD _{as follows.}

Pass 1:

(1) Generate candidates for pass 1, X1¼

f/ð1ÞS; /ð2ÞS; y; /ð8ÞSg:

(2) Initialize the two counts of each candidate in X1 to zero, and read S1DB:

(3) After Support Counting (I), the increases in support count are listed in Part (b) ofTable 2. Note that for customer with cid=C5, the increase in support count of /(1)S is not changed. Now jUDj ¼ 12 and jDBj ¼ 9: Since

Table 2

Sequences and support counts for Example 2

Part (a): SDB _{Part (b): Pass 1} _{Part (c): Pass 2} _{Part (d): S}UD

sDB

count Support counting (I) sdbcount Support counting (I) sdbcount sUDcount

/(1)S 6 /(1)S 1 /(1)(1)S 1 /(1)S 7 /(2)S 6 /(2)S 2 /(1)(4)S 2 /(2)S 8 /(4)S 5 /(4)S 3 /(2)(4)S 1 /(4)S 8 /(5)S 3 /(6)S 2 /(2,4)S 1 /(6)S 4 /(2)(2)S 3 /(7)S 2 /(2,6)S 2 /(1)(4)S 4 /(2)(5)S 3 /(8)S 3 /(4,6)S 1 /(2,4)S 4 /(1,2)S 3 /(3)S 0 /(1,4)S 1 /(2,6)S 4 /(2,4)S 3 /(5)S 0 Others 0

Support counting (II) sDB

count Support counting (II) sDBcount

/(6)S 2 /(1)(4)S 2 /(7)S 1 /(2)(4)S 1 /(8)S 0 /(2,6)S 2 /(1)(1)S 0 /(1,4)S 0 /(4,6)S 0

(12)

SDB

1 ¼ f/ð1ÞS; /ð2ÞS; /ð4ÞS; /ð5ÞSg and the increase in support count of /(3)S are less than 33% ðjUDj jDBjÞ; the reduced set X10 is {/(6)S,/(7)S,/(8)S}.

(4) After Support Counting (II), the sDB

countof /(6)S and /(7)S are 2 and 1, respectively. The minimum support count is 4 in UD. IncSP obtains the updated frequent 1-sequences, which are /(1)S, /(2)S, /(4)S, and /(6)S. Total 22 candidate 2-sequences are generated with the four frequent 1-sequences.

Pass 2:

(5) We read S₂DBafter initializing the two support counts of all candidate 2-sequences. Note that the sDB

count of /(2)(5)S is useless because /(2)(5)S is not a candidate in UD in this pass. (6) We list the result of Support Counting (I) in Part (c) of Table 2. The increases in support count of some candidates, such as /(1,6)S or /(4)(6)S, are all zero and are not listed.

(7) Again, we compute the X0

2 so that the

candidates need to be checked against the data sequences in DB are /(1)(1)S, /(1)(4)S, /(1,4)S, /(2)(4)S, /(2,6)S, and /(4,6)S. We ﬁlter out 16 candidates (13 candidates with insufﬁcient ‘‘support increases’’ and 3 candi-dates in SDB₂ ) before Support Counting (II) starts.

(8) The sDB_countof /(1)(4)S, /(2)(4)S, and /(2,6)S are 2, 1, and 2, respectively, after Support Counting (II). IncSP then sums up the counts ðsDB

countand sdbcountÞ to obtain the updated fre-quent 2-sequences. Finally, IncSP terminates since no candidate 3-sequence is generated. Part (d) ofTable 2lists the sequential patterns and their support counts in UD.

In comparison with GSP, IncSP updates sup-ports of sequential patterns in SDB _{by scanning} data sequences in db only. New sequential patterns, which are not in DB; are generated from fewer candidate sequences comparing with pre-vious methods. The support increases of new candidates are checked in advance and leave the most promising candidates for Support Counting (II) against data sequences in DB: Every candidate

in the reduced set is then checked against DB to see if it is frequent in UD: On the contrary, GSP takes every candidate and counts over all data sequences in the updated database. Consequently, IncSP is much faster than GSP as shown in the experi-mental results.

4.3. Pattern maintenance on transaction deletion and modification

Common operations on constantly updated databases include not only appending, but also deletions and modiﬁcations. Deleting transactions from a data sequence changes the sequence; there-by changing the supports of patterns contained in this sequence. The supports of the discovered patterns might decrease but no new patterns would occur. We check patterns in SDB_{against these data} sequences. Assume that a data sequence ds is changed to ds0due to deletion. The ds0is an empty sequence when all transactions in ds are deleted. If a frequent sequence s is contained in ds but not in ds0; sDB_count is decreased by one. The resulting sequential patterns in the updated database are those patterns still having minimum supports.

A transaction modiﬁcation can be accomplished by deleting the old transaction and then inserting the new transaction. In IncSP, we delete the original data sequence from the original database, create a new sequence comprising the substituted transaction(s), and then append the new sequence to the increment database.

5. Performance comparisons and experimental results

In order to assess the performance of the IncSP algorithm, we conducted comprehensive experi-ments using an 866 MHz Pentium-III PC with 1024 MB memory. In these experiments, the databases are composed of synthetic data. The method used to generate these data is described in Section 5.1. Section 5.2 compares the performance and resource consumption of algorithms GSP, ISM and IncSP. Results of scale-up experiments are presented in Section 5.3. Section 5.4 discusses the memory requirements of these algorithms.

(13)

5.1. Synthetic data generation

Updating the original database DB with the increment database db was modeled by generating the update database UD; then partitioning UD into DB and db: Synthetic transactions covering various data characteristics were generated by the well-known method in[1]. Since all sequences were generated from the same statistical patterns, it might model real updates very well.

At ﬁrst, total jUDj data sequences were created as UD: Three parameters are used to partition UD for simulating different updating scenarios. Para-meter Rinc; called increment ratio, decides the size of db: Total jdbj ¼ jUDj Rinc sequences were randomly picked from UD into db: The remaining jUDj jdbj sequences would be placed in DB: The

comeback ratio Rcb determines the number of

‘‘old’’ customers in db: Total joldj ¼ jdbj Rcb sequences were randomly chosen from these jdbj sequences as ‘‘old’’ customer sequences, which were to be split further. The splitting of a data sequence is to simulate that some transactions were conducted formerly (thus in DB), while the remaining transactions were newly appended. The splitting was controlled by the third parameter Rf; the former ratio. If a sequence with total jdsUD_j transactions was to split, we placed the leading jdsDB_{j ¼ jds}UD_{j R}

f transactions in DB and the remaining jdsUD_{j jds}DB_{j transactions in db: For} example, a UD with Rinc¼ 20%; Rcb¼ 30%; and

Rf ¼ 40% means that 20% of sequences in UD

come from db; 30% of the sequences in db have cids occurring in DB; and that for each ‘‘old’’ customer, 40% of his/her transactions were con-ducted before current pattern updating. (Note: The calculation is integer-based with ‘ceiling’ function. E.g. jdsUDj ¼ 4; jdsDB_{j ¼ J4 40%n ¼ 2:)} We now review the details of data sequence generation, as described in [1]. In the modeled retailing environment, each customer purchases a sequence of itemsets. Such a sequence is referred to as a potentially frequent sequence (PFS). Still, some customers might buy only some of the items from a PFS. A customer’s data sequence may contain more than one PFS. The PFSs are composed of potentially frequent itemsets (PFIs).

Table 3 summarizes the symbols and the parameters used in the experiments. A database generated with these parameters is described as follows. The updated database has jUDj customer sequences, each customer has jCj transactions on average, and each transaction has average jT j items. A table of total NIPFIs and a table of total

NS PFSs were generated before picking items for

the transactions of customer sequences. On average, a PFS has jSj transactions and a PFI has jI j items. The total number of possible items for all PFIs is N:

The number of transactions for the next customer and the average size of transactions for this customer are determined ﬁrst. The size of the customer’s data sequence is picked from a Poisson distribution with mean equal to jCj: The average size of the transactions is picked from a Poisson distribution with mean equal to jT j: Items are then assigned to the transactions of the customer. Each customer is assigned a series of PFSs from table GS, the table of PFSs. Next, we describe the

generation of PFS and then the assignment of PFS.

The number of itemsets in a PFS is generated by picking from a Poisson distribution with mean equal to jSj: The itemsets in a PFS are picked from table GI, the table of PFIs. In order to model that

there are common itemsets in frequent sequences,

subsequent PFSs in GS are related. In the

subsequent PFS, a fraction of itemsets are chosen from the previous PFS and the other itemsets are picked at random from GI: The fraction corrS,

called correlation level, is decided by an exponen-tially distributed random variable with mean equal to mcorrS: Itemsets in the ﬁrst PFS in GS are randomly picked. The generations of PFI and GI are analogous to the generations of PFS and GS; with parameters N items, mean jI j; correlation level corrI and mean mcorrI correspondingly.

The assignment of PFSs is based on the weights of PFSs. The weight of the PFS, representing the probability that this PFS will be chosen, is exponentially distributed and then normalized in such a way that the sum of all the weights is equal to one. Since all the itemsets in a PFS are not always bought together, each sequence in GS is assigned a corruption level crupS. When selecting

(14)

itemsets from a PFS to a customer sequence, an itemset is dropped as long as a uniformly distributed random number between 0 and 1 is less than crupS. The crupSis a normally distributed

random variable with mean m_crup

S and variance scrupS: The assignment of PFIs (from GI) to a PFS is processed analogously with parameters crupI; mean m_crup_I and variance scrupI correspondingly.

All datasets used here were generated by setting m_crup_S and m_crup_I to 0.75, scrupS and scrupI to 0.1, m_corr_S and m_corr_I to 0.25, NS ¼ 5000; NI ¼ 25000: Two values of N (1000 and 10000) were used. A dataset created with jCj ¼ a; jT j ¼ b; jSj ¼ w; and jIj ¼ d is denoted by the notation Ca:Tb:Sw:Id: 5.2. Performance comparisons of GSP, ISM, and IncSP

To realize the performance improvements of IncSP, we ﬁrst compare the efﬁciency of incre-mental updating with that of re-mining from scratch, and then contrast that with other incre-mental mining approaches. The well-known GSP algorithm [7], which is a re-mining based algo-rithm, is used as the basis for comparison. The

PrefixSpan algorithm [14] mines patterns by

recursively projecting data sequences to smaller intermediate databases. Starting from preﬁx-items (the frequent items), sequential patterns are found by recursively growing subsequence fragments in each intermediate database. Except re-mining, mechanisms of modifying PrefixSpan to solve incremental updating is not found in the literature. Since it demands a totally different framework to handle the sequence projection of the original database and the increment database, the Prefix-Span is not included in the experiments. The ISM algorithm [5], which is the incremental mining version of the SPADE algorithm [8], deals with database update using databases of vertical layout. We pre-processed the databases for ISM into vertical layout and the pre-processing time is not counted in the following context.

Extensive experiments were performed to com-pare the execution times of GSP, ISM, and IncSP with respect to critical factors that reﬂect the performance of incremental updating, including minsup, increment ratio, comeback ratio, and former ratio. We set Rinc¼ 10%; Rcb¼ 50%; and

Rf ¼ 80% to model common database updating

Table 3

Parameters used in the experiments

Parameter Description Value

jUDj Number of data sequences in database UD 10 K, 100 K, 250 K, 500 K, 750 K, 1000 K jCj Average size (number of transactions) per customer 10, 20

jT j Average size (number of items) per transaction 2.5, 5 jSj Average size of potentially sequential patterns 4, 8 jI j Average size of potentially frequent itemsets 1.25, 2.5

N Number of possible items 1000, 10,000

NI Number of potentially frequent itemsets 25,000

NS Number of possible sequential patterns 5000

GS The table of potentially frequent sequences (PFSs)

GI The table of potentially frequent itemsets (PFIs)

corrS Correlation level (sequence), exponentially distributed mcrupS ¼ 0:25

crupS Corruption level (sequence), normally distributed mcrupS ¼ 0:75; scrupS¼ 0:1

corrI Correlation level (itemset), exponentially distributed mcrupI ¼ 0:25

crupI Corruption level (itemset), normally distributed mcrupI ¼ 0:75; scrupI¼ 0:1

Rinc Ratio of increment database db to updated database UD 1%, 2%, 5%, 8%, 10%, 20%, 30%,y, 90%

Rcb Ratio of comeback customers to all customers

in increment database db

0%, 10%, 25%, 50%, 75%, 100% Rf Ratio of former transactions to all transactions

for each ‘‘old’’ customer

(15)

scenarios. The dataset has 20,000 sequences (jUDj ¼ 20 K, 3.8 MB), generated with jCj ¼ 10; jT j ¼ 2:5; jSj ¼ 4; jIj ¼ 1:25:

The effect on performance with various

minsups was evaluated ﬁrst. Re-mining is less efﬁcient than incremental updating, as indicated in

Fig. 9. In the experiments, both ISM and IncSP are faster than GSP for all values of minimum supports. Fig. 9a shows that ISM is faster than IncSP when the number of items (N) is 1000

and minsupp1%: When N is 10,000, IncSP

outperforms ISM for all values of minsup, as shown in Fig. 9b. The total execution time is longer for all the three algorithms for smaller minsup value, which allows more patterns to pass the frequent threshold. GSP suffers from the explosive growth of the number of candidates

and the re-counting of supports for each

pattern. For example, when minsup is 1% and N ¼ 10; 000; the number of candidate 2-sequences in GSP is 532,526 and that of ‘new’ candidate

2-sequences in IncSP is 59. Only 59 candidate 2-sequences required counting over the data sequences in UD: The other candidate 2-sequences are updated, rather than re-counted, against the 2000 sequences in UD (UD 10%).

Comparing Fig. 9a with Fig. 9b, it indicates that ISM is more efficient with a smaller N. ISM keeps all frequent sequences, as well as the maximally potential frequent sequences (negative borders), in memory. Take minsup ¼ 0:75% for example. The number of frequent sequences is 701 for N ¼ 1000 and 1017 for N ¼ 10; 000; respectively. Accordingly, the size of negative borders of size two is 736,751 and 1,550,925, respectively. Those turn-into-frequent patterns that were in negative borders before database updating must intersect with the complete set of frequent patterns. Consequently, with a smaller minsup like 0.75%, the larger N provides more possible items to pass the frequent threshold so that the total execution is less efficient in ISM. Instead of frequent-pattern intersection, IncSP deals with candidates separately, the explo-sively increased frequent items (because of the larger N) affect the efficiency of the pattern updating less. This also accounts for the perfor-mance gaps between IncSP and ISM, no matter how increment ratio, comeback ratio or former ratio changes.

The results of varying increment ratio from

1% to 50% are shown in Fig. 10. The minsup

is ﬁxed at 2%. In general, IncSP gains less at higher increment ratio because larger increment

C 10 .T2. 5 .S 4 .I 1. 25 , |UD | = 20K , N = 1000 Rinc = 10% , Rcb = 50% , Rf= 80% 0 50 100 150 200 250 3% 2% 1% 0. 75%

Total Exe. Time (sec.)

GSP ISM In cS P minsup C 10 .T2. 5 .S4 .I1.25 , |UD | = 20K , N = 10000 Rinc = 10% , Rcb = 50% , Rf= 80% 0 50 100 150 200 250 300 350 400 3% 2% 1% 0.75%

Total Exe. Time (sec.)

GSP ISM In cS P minsup (a) N = 1000 (b) N = 10000

Fig. 9. Total execution times over various minsup: (a) N ¼ 1000 and (b) N ¼ 10; 000: C10 .T2.5 .S4 .I1.25 , |UD | = 20K Rcb = 50%, Rf = 80%, minsup = 2% 1 3 5 7 9 11 50% 20% 10% 8% 5% 2% 1% Ex ec ut io n Ti me Ra ti o T(GSP) / T (ISM) T(GSP) / T (IncSP) Rinc

(16)

ratio means more sequences appearing in db and causes more pattern updatings. As indica-ted in Fig. 10, the smaller the increment database db is, the more time on the discovery IncSP could save.

IncSP is still faster than GSP even when increment ratio is 50%. When increment ratio becomes much larger, say over 60%, IncSP is slower than GSP. Clearly, when most of the frequent sequences in DB turn out to be invalid in UD; the information used by IncSP in pattern updating might become useless. When the size of the increment database becomes larger than the size of the original database, i.e. the data-base has accumulated dramatic change and not incremental change any more, re-mining might be a better choice for the total new sequence database.

The impact of the comeback ratio is presented in

Fig. 11. IncSP updates patterns more efﬁciently than GSP and ISM for all the comeback ratios. High comeback ratio means that there are many ‘old’ customers in the increment database. Conse-quently, the speedup ratio decreases as the come-back ratio increases because more sequence merging is required. Fig. 11 shows that IncSP was efﬁcient with implicit merging, even when the comeback ratio was increased to 100%, i.e., all the sequences in the increment database must be merged.

Fig. 12 depicts the performance comparisons concerning former ratios. It can be seen from the ﬁgure that IncSP was constantly about 6.5 times faster than GSP over various former ratios, ranging from 10% to 90%.

5.3. Scale-up experiments

To assess the scalability of our algorithm, several experiments of large databases were con-ducted. Since the basic construct of IncSP is similar to that of GSP, similar scalable results could be expected. In the scale-up experiments, the total number of customers was increased from 100 K (18.8 MB) to 1000 K (187.9 MB), with ﬁxed parameters C10.T2.5.S4.I1.25, N ¼ 10; 000; Rinc¼ 10%; Rcb¼ 50%; and Rf ¼ 80%: Again, IncSP are faster than GSP for all the datasets. The execution times were normalized with respect to the execu-tion time for 100 K customers here.Fig. 13shows that the execution time of IncSP increases linearly as the database size increases, which demonstrates good scalability of IncSP.

5.4. Memory requirements

Although IncSP uses separate counting to speed up mining, it generates candidates and then

C 10 .T 2.5 .S 4 .I1. 25 , |UD | = 20K Rinc = 10% , Rf= 80% , mi ns up = 2% 0 1 2 3 4 5 6 7 8 10% 25% 50% 75% 100% Ex ec ut io n Ti me Ra ti o T( GS P) /T (I SM ) T( GS P) /T (I nc SP ) Rcb

Fig. 11. Total execution times over various comeback ratios.

C 10 .T2.5 .S4 .I1. 25 , |UD | = 20K Rinc = 10% , Rcb = 50% , mi ns up = 2% 2.8 2.8 2.7 2.6 2.5 7.1 6.7 6.4 6.3 6.1 0 1 2 3 4 5 6 7 8 10% 30% 50% 70% 90% Ex ec ut io n Ti me Ra ti o T( GS P) /T (I SM ) T( GS P) /T (IncSP ) Rf

(17)

performs counting by multiple database scanning, like GSP. The pattern updating process in IncSP reads in the previous discovered patterns and stores them into a hash-tree for fast support updating. Therefore, the maximum size of memory required for both GSP and IncSP is determined by the space required to store the candidates. A smaller minsup often generates a large number of candidates, thereby demanding a larger memory space.

In contrast, ISM applies item-intersection in each class for new pattern discovery, assuming that all frequent sequences as well as potentially frequent sequences are stored in a lattice in memory. Storing every possible frequent sequence costs a huge memory space, not to mention those required for lattice links. For instance, the size of negative borders of size two is over 1.5 million with N ¼ 10; 000 (minsup ¼ 0:75%) in the experi-ment ofFig. 9b. As shown inFig. 14, the required

memory for IncSP is smaller than that of ISM. More memory is required in vertical approaches like SPADE, which is also observed in[13].

6. Conclusions

The problem of sequential pattern mining is much more complicated than association discov-ery due to sequence permutation. Validity of discovered patterns may change and new patterns may emerge after updates on databases. In order to keep the sequential patterns current and up-to-dated, re-execution of the mining algorithm on the whole database updated is required. However, it takes more time than required in prior mining because of the additional data sequences ap-pended. Therefore, we proposed the IncSP algo-rithm utilizing previously discovered knowledge to solve the maintenance problem efﬁciently by incremental updating without re-mining from scratch. The performance improvements result from effective implicit merging, early candidate pruning, and efﬁcient separate counting.

Implicit merging ensures that IncSP employs correctly combined data sequences while preser-ving previous knowledge useful for incremental updating. Candidate pruning after updating pat-tern supports against the increment database further accelerates the whole process, since fewer but more promising candidates are generated by just checking counts in the increment database. Eventually, efﬁcient support counting of promis-ing candidates over the original database

accom-C 10 .T2.5 .S4 .I1.25 , |UD| = 20K, N = 1000 Rinc = 10%, Rcb = 50%, Rf = 80% 0 100 200 300 400 500 3% 2% 1% 0.75%

Maximum used memory (MB)

GSP ISM IncSP

minsup

Fig. 13. Linear scalability of the database size.

C 10 .T2.5 .S4 .I1.25 , |UD| = 20K, N = 1000 Rinc = 10%, Rcb = 50%, Rf = 80% 0 100 200 300 400 500 3% 2% 1% 0.75%

Maximum used memory (MB)

GSP ISM IncSP

minsup

Fig. 14. Maximum required memory with respect to various minsup.

(18)

plishes the discovery of new patterns. IncSP both updates the supports of existing patterns and ﬁnds out new patterns for the updated database. The simulation performed shows that the proposed incremental updating mechanism is several times faster than re-mining using the GSP algorithm, with respect to various data characteristics or data combinations. IncSP outperforms GSP with re-gard to different ratios of the increment database to the original database except when the increment database becomes larger than the original data-base. It means that it has been long time since last database maintenance and most of the patterns become obsolete. In such a case, re-mining with new minsup over the whole database would be more appropriate since the original minsup might not be suitable for current database any more.

The IncSP algorithm currently solves the pattern updating problems using previously speci-fied minimum support. Further researches could be extended to the problems of dynamically varying minimum supports. Generalized sequen-tial pattern problems[7], such as patterns with is-a hierarchy or with sliding-time window property, are also worthy of further investigation since different constraints induce diversified mainte-nance difficulties. In addition to the maintemainte-nance problem, constantly updated database generally create a pattern-changing history, indicating changes of sequential patterns at different time. It is challenging to extend the proposed algorithm to exploring the pattern changing history for trend prediction.

Acknowledgements

The authors thank the referees for their valuable comments and suggestions.

Appendix A

As noted inTable 1, sDB

countis the support count of candidate sequence s in DB; and sdb

count denotes the increase in support count of candidate sequence s due to db: The candidate k-sequences in UD is partitioned into XjðDBÞ and Xj0ðDBÞ: That is, Xk¼ XjðDBÞ,Xj0ðDBÞ; where XjðDBÞ¼

fsjsAXk4sAS_kDBg and XjðDBÞ0¼ Xk XjðDBÞ: The data sequences in UD could be partitioned into three sets: sequences with cids appearing in DB only, sequences with cids appearing in db only, and sequences with cids occurring in both DB and db: The cid of a data sequence ds is represented by ds: cid. Let UD ¼ UDDB,UDdb,

UDDd; where UDDB¼ fdsjdsADB4dsedbg;

UDdb¼ fdsjdsAdb4dseDBg; and UDDd ¼

fd sjd s ¼ d s1þ d s2; d s1AD B4ds₂Ad b4ds₁:c i d ¼ ds2:cidg:

Lemma 1. The support count of any candidate k-sequence s in UD is equal to sDB

countþ sdbcount. Proof. The support count of s in UD is the support count of s in DB; plus the count increase due to the data sequences in db: That is sDB

countþ sdb

countby deﬁnition. &

Lemma 2. A candidate sequence s; which is not frequent in DB; is a frequent sequence in UD only if sdb_countX_{minsup ðjUDj jDBjÞ.}

Proof. Since seSDB; we have sDB

countominsup

jDBj: If sdb

countominsup ðjUDj jDBjÞ; then

sDB

countþ sdbcountominsup jUDj: That is, seSUD: & Lemma 3. The separate counting procedure (in

Fig. 8) completely counts the supports of candidate k-sequences against all data sequences in UD: Proof. Considering a data sequence ds in UD and a candidate k-sequence sAXk;

(i) For each candidate k-sequence s contained in ds where dsAUDdb: The support count in-crease (due to ds) is accumulated in sdb_count; by line 4 of Support Counting (I) inFig. 8. (ii) For each candidate k-sequence s contained

in ds where dsAUDDB: (a) If sAXjðDBÞ; no counting is required since s had been counted while discovering SDB_{: The support} count of s in DB is read in sDB

count by line 6 in Fig. 7. (b) If sAXjðDBÞ0; sDBcount accumulates the support count of s; by line 3 of Support Counting (II) in Fig. 8. Note that in this counting, we reduce XjðDBÞ0 to Xj0by Lemma 4.

(19)

(iii) For each candidate k-sequence s contained in

ds where dsAUDDd: Now ds is formed by

appending dsdb _{to ds}DB_{: (a) If sgds}DB_{; i.e.,}

dsDB _{of the ds does not contain s: We}

accumulate the increase in sdb

count; by line 9 of

Support Counting (I) in Fig. 8. (b) If

sDdsDB4sAXjðDBÞ; similar to (ii)-(a), the support count is already read in sDB_count so

that no counting is required. (c) If

sDdsDB_4sAXj

ðDBÞ0; similar to (ii)-(b), we calculate sDB

count by line 3 of Support Counting (II) inFig. 8. Again, XjðDBÞ0is reduced to Xj0 by Lemma 4 here.

The separate counting considers all the data sequences in UD as described here. Next, we show that the supports of all candidates are calculated. By Lemma 1, the support count of s in UD is the sum of sDB

count and sdb

count:

(iv) For any candidate s in XjðDBÞ: The sDBcount is from (ii)-(a) and (iii)-(b), and the sdb

count is accumulated by (i) and (iii)-(a).

(v) For any candidate s in XjðDBÞ0: The sDBcount is counted by (ii)-(b) and (iii)-(c), and the sdb_countis counted by (i) and (iii)-(a). The separate counting is complete. &

Lemma 4. The candidates required for checking against the data sequences in DB in Support Counting (II) is the set Xj0, where Xj0¼ Xk fsjsASDB

k g fsjsdbcountominsup ðjUDj jDBjÞg.

Proof. Since UD ¼ UDDB,UDdb,UDDd and

UDdb contains no data sequence in DB; the data

sequences concerned are in UDDB and UDDd:

Considering a candidate s;

If sAS_kDB: For any data sequence dsAUDDB or dsAUDDd4sDdsDB; s was counted while discover-ing S_kDB: For dsAUDDd4sgdsDB; the increase in support count sdb

count is accumulated by line 9 of Support Counting (I). Therefore, in Support Counting (II), we can exclude any candidate s which is also in SDB

k : If sASDB

k : After Support Counting (I), the sdbcount now contains the support count counted for data

sequence ds; where dsAUDdb or dsAUDDd4

sgdsDB_{: By Lemma 2, if the s}db

count is less than minsup ðjUDj jDBjÞ; this candidate s cannot be

frequent in UD: Therefore, such candidate s could be ﬁltered out.

By (i) and (ii), we have Xj0¼ Xk fsjsASkDBg fsjsdb

countominsup ðjUDj jDBjÞg: &

References

[1] R. Agrawal, R. Srikant, Mining sequential patterns, in: Proceedings of the 11th International Conference on Data Engineering, 1995, pp. 3–14.

[2] D. Gunopulos, R. Khardon, H. Mannila, H. Toivonen. Data mining, hypergraph transversals, and machine learning, in: Proceedings of the 16th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, 1997, pp. 209–216.

[3] J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, M.C. Hsu, FreeSpan: Frequent pattern-projected sequential pattern mining, in: Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2000, pp. 355–359.

[4] M.Y. Lin, S.Y. Lee, Incremental update on sequential patterns in large databases. in: Proceedings of the 10th IEEE International Conference on Tools with Artiﬁcial Intelligence, 1998, pp. 24–31.

[5] S. Parthasarathy, M.J. Zaki, M. Ogihara, S. Dwarkadas, Incremental and interactive sequence mining, in: Proceed-ings of the 8th International Conference on Information and Knowledge Management, 1999, pp. 251–258. [6] T. Shintani, M. Kitsuregawa, Mining algorithms for

sequential patterns in parallel: Hash based approach, in: Proceedings of the Second Paciﬁc–Asia Conference on Knowledge Discovery and Data Mining, 1998, pp. 283– 294.

[7] R. Srikant, R. Agrawal, Mining sequential patterns: Generalizations and performance improvements, in: Pro-ceedings of the 5th International Conference on Knowl-edge Discovery and Data Mining, 1996, pp. 3–17. [8] M.J. Zaki, Efﬁcient enumeration of frequent sequences,

in: Proceedings of the 7th International Conference on Information and Knowledge Management, 1998, pp. 68–75.

[9] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, A.I. Verkamo, Fast discovery of association rules, in: U.M. Fayyad, G. Piatesky-Shapiro, P. Smyth, R. Uthurusamy (Eds.), Advances in Knowledge Discovery and Data Mining, AAAI Press, Menlo Park, CA, 1996, pp. 307–328. [10] S. Brin, R. Motwani, J. Ullman, S. Tsur, Dynamic itemset counting and implication rule for market basket data, in: Proceedings of the 1997 SIGMOD Conference on Management of Data, 1997, pp. 255–264.

[11] D.W.L. Cheung, J. Han, V. Ng, C.Y. Wong, Maintenance of discovered association rules in large databases: An incremental updating technique, in: Proceedings of the 16th International Conference on Data Engineering, 1996, pp. 106–114.

(20)

[12] J.S. Park, M.S. Chen, P.S. Yu, Using a hash-based method with transaction trimming for mining association rules, IEEE Trans. Knowledge Data Eng. 9 (5) (1997) 813–825.

[13] I. Tsoukatos, D. Gunopulos, Efﬁcient mining of spatio-temporal patterns, in: Proceedings of the 7th International Symposium of Advances in Spatial and Temporal Data-bases, 2001, pp. 425–442.

[14] J. Pei, J. Han, H. Pinto, Q. Chen, U. Dayal, M.C. Hsu, PrefixSpan: Mining sequential patterns efficiently by prefix-projected pattern growth, in: Proceedings of 2001 International Conference on Data Engineering, 2001, pp. 215–224.

[15] S.J. Yen, A.L.P. Chen, An efﬁcient approach to discover-ing knowledge from large databases, in: Proceeddiscover-ings of the

4th International Conference on Parallel and Distributed Information Systems, 1996, pp. 8–18.

[16] H. Mannila, H. Toivonen, A.I. Verkamo, Discovery of frequent episodes in event sequences, Data Mining Knowl-edge Discovery 1 (3) (1997) 259–289.

[17] D.W.L. Cheung, S.D. Lee, B. Kao, Ageneral incremental technique for maintaining discovered association rules, in: Proceedings of the 5th International Conference on Database Systems for Advanced Applications, 1997, pp. 185–194.

[18] K. Wang, Discovering patterns from large and dynamic sequential data, J. Intell. Inform. Syst. 9 (1) (1997) 33–56.