• 沒有找到結果。

Mining sequential patterns across multiple sequence databases

N/A
N/A
Protected

Academic year: 2021

Share "Mining sequential patterns across multiple sequence databases"

Copied!
20
0
0

加載中.... (立即查看全文)

全文

(1)

Mining sequential patterns across multiple sequence databases

Wen-Chih Peng

*

, Zhung-Xun Liao

Department of Computer Science, National Chiao Tung University, Hsinchu, Taiwan, ROC

a r t i c l e

i n f o

Article history: Received 26 April 2008

Received in revised form 11 April 2009 Accepted 13 April 2009

Available online 7 May 2009

Keywords: Data mining

Sequential pattern mining Multi-domain sequential patterns

a b s t r a c t

In this paper, given a set of sequence databases across multiple domains, we aim at mining multi-domain sequential patterns, where a multi-domain sequential pattern is a sequence of events whose occurrence time is within a pre-defined time window. We first propose algorithm Naive in which multiple sequence databases are joined as one sequence data-base for utilizing traditional sequential pattern mining algorithms (e.g., PrefixSpan). Due to the nature of join operations, algorithm Naive is costly and is developed for comparison purposes. Thus, we propose two algorithms without any join operations for mining multi-domain sequential patterns. Explicitly, algorithm IndividualMine derives sequential pat-terns in each domain and then iteratively combines sequential patpat-terns among sequence databases of multiple domains to derive candidate multi-domain sequential patterns. However, not all sequential patterns mined in the sequence database of each domain are able to form multi-domain sequential patterns. To avoid the mining cost incurred in algo-rithm IndividualMine, algoalgo-rithm PropagatedMine is developed. Algoalgo-rithm PropagatedMine first performs one sequential pattern mining from one sequence database. In light of sequential patterns mined, algorithm PropagatedMine propagates sequential patterns mined to other sequence databases. Furthermore, sequential patterns mined are repre-sented as a lattice structure for further reducing the number of sequential patterns to be propagated. In addition, we develop some mechanisms to allow some empty sets in multi-domain sequential patterns. Performance of the proposed algorithms is compara-tively analyzed and sensitivity analysis is conducted. Experimental results show that by exploring propagation and lattice structures, algorithm PropagatedMine outperforms algo-rithm IndividualMine in terms of efficiency (i.e., the execution time).

Ó 2009 Elsevier B.V. All rights reserved.

1. Introduction

Sequential pattern mining has attracted a considerable amount of research effort recently[3,4,9,23,24]. Given a sequence database that contains a set of sequences and a user-specified threshold (the minimum support), the main task of sequential pattern mining is to discover frequent subsequences that appear in a sufficient number of sequences. Since sequential pat-tern mining is able to discover temporal relationship (i.e., order of events), a significant amount of research works has elab-orated on developing novel approaches to discover sequential patterns for a variety of applications[7,10,19,22,25,26].

Note that prior works only mine sequential patterns in one sequence database. This sequence database consists of se-quences of events in one domain. For example, given a sequence database of purchasing in a supermarket, frequent purchas-ing behavior is discovered. In many real world applications, we may have events in multiple domains. Consider payment lists of credit cards, where a user uses a credit card for a variety of services, such as payments in restaurants, food, books and movies. These payments are referred to as events in different domains. For each domain, we could extract these events

0169-023X/$ - see front matter Ó 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.datak.2009.04.009

* Corresponding author.

E-mail addresses:[email protected](W.-C. Peng),[email protected](Z.-X. Liao).

Contents lists available atScienceDirect

Data & Knowledge Engineering

(2)

and build the corresponding sequence database. Then, one could utilize sequential pattern mining to discover frequent se-quences. For example, in the movie domain, one sequential pattern is that users watch a series of movies related to Harry Potter. On the other hand, in the book domain, one sequential pattern is that users buy a series of novels related to Harry Potter. From these two sequential patterns of two domains, one could derive a composite sequential pattern across two do-mains (referred to as a multi-domain sequential pattern) if events in these two sequential patterns closely occur together (i.e., events occur within the same time window). For the above example,Fig. 1shows that these two sequences are sequential patterns in the movie domain and the book domain, respectively. Moreover, the corresponding time of events in these two sequences are within the same time window. In this paper, we claim that discovering sequential patterns from multiple do-mains will provide a unique way to reveal complex relationships across multiple dodo-mains. InFig. 1, one could infer that users are likely to buy novels of Harry Potter, which is motivated by movies. Furthermore, this multi-domain sequential pattern also implies that users who go to watch movies are likely to be triggered by books bought. To reveal more information from sequence databases across multiple domains, multi-domain sequential patterns are very useful. Depending on requirements of applications, one could decide which domains should be involved in mining multi-domain sequential patterns. In our above example, if a user wants to know the cross-relationship between book and movie domains, sequence databases of these two domains are given. Consequently, such a multi-domain sequential pattern captures the cross-relationship among multiple domains, which in turn can yield significant information and reveal more knowledge.

Given a set of sequence databases across multiple domains, we aim at mining multi-domain sequential patterns, where a multi-domain sequential pattern is a sequence of events whose occurrence time is within a pre-defined time window. With a set of sequence databases, these sequence databases could be joined into one sequence database according to time informa-tion of sequences. Then, by exploring tradiinforma-tional sequential pattern mining algorithms, we could obtain multi-domain sequential patterns as well. This method is referred to as algorithm Naive in our paper and the details of this algorithm are presented later. However, there are three drawbacks in this algorithm: (1) integrating sequence databases of multiple domains into a single sequence database incurs a considerable cost due to the nature of joining operations, (2) the length of each sequence becomes longer and the number of items becomes huge after joining operations, and (3) sequential pat-terns mined should be further verified whether these sequential patpat-terns satisfy multi-domain sequential patpat-terns or not. Hence, the above algorithm unavoidably exhibits poor efficiency and scalability performance, which calls for the design of efficient mining algorithms for multi-domain sequential patterns.

To avoid the above poor performance issues, in this paper, we first propose algorithm IndividualMine in which sequential patterns in each sequence database should be mined first and then the sequential patterns in each domain are integrated as candidate multi-domain sequential patterns. Clearly, algorithm IndividualMine is able to avoid join operations among se-quence databases. InFig. 1, it can be seen that in the movie domain (respectively, book domain), we could have one sequen-tial pattern, a series of movies (respectively, books) related to Harry Potter. By checking the corresponding time of events in these two sequential patterns, we could combine these two sequential patterns into one multi-domain sequential pattern since each event in these two sequential patterns has close occurrence time. It is possible that sequential patterns from each sequence database cannot be formed as multi-domain sequential patterns since events’ occurrence time is not close. Though avoiding join operations, algorithm IndividualMine is likely to suffer from mining cost since sequential patterns in each se-quence database should be discovered. Consequently, we propose algorithm PropagatedMine to further reduce the mining cost in each sequence database. Algorithm PropagatedMine first performs one sequential pattern mining from one sequence database. In light of sequential patterns mined, algorithm PropagatedMine propagates time information (referred to as the time-instance set) of sequential patterns mined to other sequence databases. By utilizing time-instance sets, we are able to extract a subset of sequences from sequence databases, where the subset of sequences has the same time information. As such, only a limited number of sequences that are likely to form multi-domain sequential patterns are extracted. Further-more, sequential patterns mined are represented as a lattice structure for further reducing the number of time-instance sets to other sequence databases. In addition, we develop some mechanisms to allow some empty sets in multi-domain sequen-tial patterns. Performance of the proposed algorithms is comparatively analyzed and sensitivity analysis is conducted. It is shown by our simulation results that both algorithms IndividualMine and PropagatedMine perform better than algorithm Naive. By exploring propagation and lattice structures, algorithm PropagatedMine outperforms algorithms IndividualMine and Naive in terms of efficiency (i.e., the execution time).

Philosopher's Stone Chamber of Secrets

Movie p Prisoner of Azkaban

Book Philosopher's Stone Chamber of Secrets

Time ti Time tj Time tk

Time

(3)

The remainder of the paper is organized as follows. In Section2, we present existing research works of mining sequential patterns. In Section3, some notations and the problem definition are given. Our proposed algorithms are described in Section

4. Performance study and experimental results are shown in Section5. Section6concludes with this paper. 2. Related works

A significant amount of research efforts has been devoted to sequential pattern mining[6,11,16,28–31]. The problem of sequential pattern mining is first formulated in[3]and the authors in[3]proposed mining algorithms based on the Apriori algorithm. Algorithm GSP[27]was developed for mining sequential patterns using a breadth first search and button-up method, whereas algorithm SPADE[33]employed a depth first search and button-up method with an ID-list. The authors in[13,23,24]exploited the projection concept to reduce the amount of data for sequential pattern mining. To prevent can-didate generation, DISC-all[8]used a novel sequence comparison strategy. A progressive concept has been explored in min-ing sequential patterns to capture the dynamic nature of data addition and deletion[15]. The above research works are focused on improving the performance of traditional sequential pattern mining.

Some variations and applications on sequential patterns are proposed recently. We mention in passing that the authors in

[25]developed to mine multi-dimensional sequential patterns, in which sequential patterns indicate not only frequent se-quences but also some attributes in the category dimensions. In[25], the sequence database consists of category attributes and sequence attributes, andTable 1shows an example of a multi-dimensional sequence database. Clearly, the problem of mining multi-domain sequential patterns is very different from the problem in[25]in terms of the input and output of prob-lem definitions. In this paper, the input is the set of sequence databases and the output is the set of multi-domain sequential patterns that consist sequences of co-occurred events across sequence databases. The authors in[32]explore the concept of mining sequential patterns from multi-dimensional sequences. However, the output of their proposed method is a set of high-dimensional sequential patterns, which consist of sequential patterns from multiple sequence datasets. However, in their proposed algorithm, time information is not associated with each event, and thus each event from multiple dimensions of a high-dimensional sequential pattern is not co-occurred. As such, a high-dimensional sequential pattern is intrinsically different from our proposed multi-domain sequential pattern. Furthermore, the problem in[5]is to discover events that are occurred together. In contrast, our problem is that given a set of sequence databases, we intend to discover sequences con-sisting of co-occurred events. Moreover, the authors in[12]proposed the problem of distributed sequential pattern mining, where each set of co-occurred events is complete and sequences are separated into different databases. Similarly, the prob-lem in[17]is indeed a distributed sequential pattern mining problem and the authors in[17]exploited the concepts of approximate patterns and local clustering to avoid noise and a large number of local patterns. As pointed early, given a pay-ment list of credit cards, we could divide paypay-ments into several domains according to paypay-ment services. Thus, our problem of mining multi-domain sequential patterns is not the same as distributed mining of sequential patterns.

To the best of our knowledge, previous studies have not adequately explored multi-domain sequential patterns, let alone proposing efficient algorithms for mining such sequential patterns. The contributions of this paper are twofold: (1) exploit-ing novel and useful sequential patterns (i.e., multi-domain sequential patterns), and (2) devisexploit-ing algorithms IndividualMine

Table 1

An example of multi-dimensional sequence database in[25].

Cid Cust-grp City Age-grp Sequence

10 Business Boston Middle hðbdÞðcÞðbÞðaÞi

20 Professional Chicago Young hðbf ÞðceÞðfgÞi

30 Business Chicago Middle hðahÞðaÞðbÞðf Þi

40 Education New York Retired hðbeÞðceÞi

Table 2

An example of sequence databases in two domains.

ID Time sequences Sequences

Domain database D1

s1 hðT1ÞðT2ÞðT3ÞðT4Þi hðaÞðb; cÞðb; c; dÞðeÞi

s2 hðT5ÞðT6ÞðT7Þi hða; bÞðb; cÞðc; eÞi

s3 hðT10ÞðT12ÞðT13Þi hða; eÞðhÞðg; jÞi

s4 hðT21ÞðT22ÞðT23ÞðT24Þi hða; b; f ÞðdÞðb; cÞðe; f Þi

Domain Database D2

l1 hðT21ÞðT22ÞðT23ÞðT24Þi hð1; 2; 5Þð7Þð2; 3Þð4; 5; 6Þi

l2 hðT10ÞðT12ÞðT13Þi hð1; 6Þð5Þð9; 10Þi

l3 hðT5ÞðT6ÞðT7Þi hð1; 3Þð2; 4Þð8Þi

(4)

and PropagatedMine to efficiently mine multi-domain sequential patterns. Our preliminary works were presented in[20,21]. In this paper, more detailed complexity and theoretical analysis are conducted. Also, we develop some mechanisms in each proposed algorithm to allow multi-domain sequential patterns with some empty sets in some domains. In particular, by exploring lattice structures, algorithm PropagatedMine is able to further reduce the number of candidate multi-domain sequential patterns. Furthermore, an extensive performance study is conducted and sensitivity analysis is investigated on several parameters for each algorithm.

3. Preliminaries

Assume that each domain has its own set of items and a sequence database. The problem of mining multi-domain sequential patterns is that given a set of sequence databases, we aim at discovering sequential patterns that consist of co-occurred events across multiple domains.Table 2shows two domains with its own sequence database, where in each sequence of sequence databases, the corresponding time sequence indicates the occurrence time of events. For example, in sequence s1in D1, it can be seen that event a occurs at T1and both b and c occur at T2. By joining these two sequence

databases via their time sequences, we could have one multi-domain sequence database (referred to as MDB). As such,Table 3is an example of multi-domain sequence database.

To facilitate the presentation of multi-domain sequences, one sequence siin domain Diis expressed by hXi1;Xi2; . . . ;Xili,

where Xijis the jth element of sequence si, and l is the number of elements of si. Therefore, a multi-domain sequence across k

domains (abbreviated as k-domain sequence) is represented as M ¼ ½s1;s2; . . . ;skT and is further denoted as

M ¼ X11 X12 . . . X1l X21 X22 . . . X2l .. . .. . . . . .. . Xk1 Xk2 . . . Xkl 2 6 6 6 4 3 7 7 7

5, where each column is a set of itemsets that occur within the same time window, denoted as Ti. A time sequence TSðMÞ is represented as TSðMÞ ¼ hT1;T2; . . . ;Tli to indicate the occurrence time of M. Actually, the time

window, a time interval, is determined in accordance with application requirements.

With the above representation of multi-domain sequences, we further define the length and the number of elements for multi-domain sequential patterns. Since a multi-domain sequence consists of multiple sequences from various domains, the length of a multi-domain sequence across k domains can be defined as follows:

Definition 1 (Length and number of elements). Let M ¼ ½s1;s2; . . . ;skT be a k-domain sequence. The length of M, denoted as

jMj, is the length of the longest sequence in domain sequence M. Furthermore, the number of elements in a multi-domain sequence, expressed by eðMÞ, is the number of itemsets in the multi-multi-domain sequence.

For example, given a 2-domain sequence M ¼ ðaÞ ðb; cÞ ðbÞ

ð1Þ ð2Þ ð1; 2; 3Þ

 

, the length of M is 5 due to that the longest sequence hð1Þð2Þð1; 2; 3Þi in M, and the number of elements is 3 (i.e., eðMÞ ¼ 3Þ.

Once we have the definition of the length and the number of elements for a multi-domain sequences, the containing rela-tion among multi-domain sequences is thus defined as follows:

Definition 2 (Containing relation). Suppose that we have two multi-domain sequences M ¼

X11 X12 . . . X1b X21 X22 . . . X2b .. . .. . . . . .. . Xa1 Xa2 . . . Xab 2 6 6 6 4 3 7 7 7 5 and N ¼ Y11 Y12 . . . Y1b0 Y21 Y22 . . . Y2b0 .. . .. . . . . .. . Ya1 Ya2 . . . Yab0 2 6 6 6 4 3 7 7 7

5, where eðMÞ 6 eðNÞ. M is contained by N, denoted as M v N, if and only if there exists an integer list LðM; NÞ, denoted as hl1;l2; . . . ;lbi, such that 1 6 l1<l2<   < lb6b0and Xij#Yilj, where 1 6 i 6 a and 1 6 j 6 b.

Table 3

An example of a multi-domain sequence database.

ID Time sequences Multi-domain sequences

S1 hðT1ÞðT2ÞðT3ÞðT4Þi ðaÞ ðb; cÞ ðb; c; dÞ ðeÞ

ð1; 2Þ ð2; 3Þ ð6Þ ð4; 5Þ

 

S2 hðT5ÞðT6ÞðT7Þi ð1; 3Þða; bÞ ð2; 4Þðb; cÞ ðc; eÞð8Þ

 

S3 hðT10ÞðT12ÞðT13Þi ð1; 6Þða; eÞ ðhÞð5Þ ð9; 10Þðg; jÞ

 

S4 hðT21ÞðT22ÞðT23ÞðT24Þi ða; b; f Þ ðdÞ ðb; cÞ ðe; f Þ

ð1; 2; 5Þ ð7Þ ð2; 3Þ ð4; 5; 6Þ

(5)

For example, assume that M ¼ ðaÞ ðb; cÞ

ð2Þ ð6Þ

 

and N ¼ ðaÞ ðb; cÞ ðb; c; dÞ ðeÞ

ð1; 2Þ ð2; 3Þ ð6Þ ð4; 5Þ

 

. It can be verified that M is con-tained by N since there exist integer list LðM; NÞ ¼ h1; 3i such that 1 6 1 < 3 6 4, and ðaÞ # ðaÞ, ð2Þ # ð1; 2Þ, ðb; cÞ # ðb; c; dÞ and ð6Þ # ð6Þ.

Based on the above definitions, a multi-domain sequence database is a set of multi-domain sequences. Consider an exam-ple of a multi-domain sequence database inTable 3, where the number of 2-domain sequences is 4. Given a multi-domain sequence database MDB, the support value of a multi-domain sequence M is the number of multi-domain sequences in MDB that contain the multi-domain sequence M.

Multi-domain sequential pattern mining: Given a set of sequence databases across multiple domains, one could join these sequence databases as one multi-domain sequence database. Then, the task of mining multi-domain sequential patterns is to derive multi-domain sequences with their supports larger than a user-specified minimum support threshold d in MDB. For example, for the multi-domain sequence database MDB inTable 3and a minimum support d ¼ 3, multi-domain sequential patterns are ðaÞ

ð1Þ   ; ðbÞ ð2Þ   ; ðbÞ ð3Þ   ; ðcÞ ð2Þ   ; ðb; cÞ ð2Þ   ; ðaÞ ðbÞ ð1Þ ð2Þ   ; ðaÞ ðcÞ ð1Þ ð2Þ   , and ðaÞ ðb; cÞ ð1Þ ð2Þ   .

Notice that joining these multiple sequence databases is costly due to the nature of join operations. It can be verified that multi-domain sequential patterns contain sequential patterns in each domain. For example, ðaÞ ðb; cÞ

ð1Þ ð2Þ

 

is a multi-domain sequential pattern, where ðaÞðb; cÞ (respectively, (1) (2)) is a sequential pattern in domain D1(respectively, D2) and events in

ðaÞðb; cÞ and (1) (2) has the same time sequences. Thus, in this paper, we propose algorithms to discover multi-domain sequential patterns without joining.

4. Algorithms of mining multi-domain sequential patterns

In this section, we first describe one Naive method in which multiple sequence databases are joined as one sequence data-base, and multi-domain sequential patterns are derived by using traditional sequential pattern mining algorithms (e.g., Pre-fixSpan[23,24]). As pointed out early, to avoid the overheads of joining multiple sequence databases, we then propose algorithm IndividualMine in which sequential patterns in each sequence database should be mined and further merged for possible multi-domain sequential patterns. Furthermore, to further reduce the cost of mining sequential patterns in each sequence database, algorithm PropagatedMine is proposed. By propagation of sequential patterns to other sequence dat-abases, the number of sequences in other sequence databases is reduced. In addition, the above three algorithms could be extended to discover multi-domain sequential patterns with some empty sets in some domains.

4.1. Naive algorithm with one multi-domain sequence database

As mentioned early, to mine multi-domain sequential patterns, one Naive method is joining sequence databases into one multi-domain sequence database. Then, this multi-domain sequence database is transformed such that the Naive algorithm could utilize existing sequential pattern mining algorithms. Consequently, in the Naive algorithm, there are two steps: the joining step and the mining step. In the joining step, multiple sequence databases are first joined together by the time se-quences and then the multiple sequence databases are thus transformed into a sequence database. In the mining step, one could utilize existing sequential pattern mining algorithms to derive sequential patterns. In light of sequential patterns mined, we have to separate the items from different domains and derive multi-domain sequential patterns. The detailed steps are described as follows:

Step 1: joining step: In the beginning, sequence databases are joined by their time sequences to form one multi-domain sequence database. For example,Table 3is derived by performing the join process among two sequence databases inTable 2. It can be verified that s1in D1sequence database and l4in D2sequence database are joined as one sequence S1inTable 3.

With the multi-domain sequence database derived, one should transform this multi-domain sequence database into one se-quence database. Explicitly, inTable 3, for each sequence, time sequences are deleted and multi-domain sequences could be viewed as one sequence.Table 4is an example of a sequence database transformed fromTable 3. It can be seen that in se-quence S1inTable 4, co-occurred events from multiple domains are viewed as one event. For example, (a, 1, 2) comes from

ðaÞ ð1; 2Þ

 

in sequence S1ofTable 3.

Table 4

An example of a transformed sequence database.

ID Sequences

S1 hða; 1; 2Þðb; c; 2; 3Þðb; c; d; 6Þðe; 4; 5Þi

S2 hða; b; 1; 3Þðb; c; 2; 4Þðc; e; 8Þi

S3 hða; e; 1; 6Þðh; 5Þðg; j; 9; 10Þi

(6)

Step 2: mining step: According to the sequence database derived in Step 1, by exploiting traditional sequential pattern min-ing algorithms, we could derive sequential patterns. The second column ofTable 5shows some examples of sequential pat-terns mined from the sequence database inTable 4with the minimum support as 3. However, even if a sequence database is obtained, traditional sequential pattern mining algorithms are not directly able to mine multi-domain sequential patterns. This is due to that several sequential patterns mined do not contain events from all domains. Thus, each sequential pattern should be represented as multi-domain sequential patterns. Then, we could first verify whether multi-domain sequential patterns consists of events from all domains or not. For example, the third column ofTable 5shows multi-domain sequential patterns from the second column ofTable 5. Since we have all events of all domains, it is very straightforward to represent sequential patterns as multi-domain sequential patterns. It can be seen inTable 5, P1;P2and P6have some empty sets and

these patterns are referred to as multi-domain sequential patterns with empty sets (abbreviated as relaxed multi-domain sequential patterns). On the other hands, P3;P4and P5are called strong multi-domain sequential patterns since all co-occurred

events are from all domains required.

Algorithm Naive needs to perform join operations among multiple sequence databases. Due to join operations, the per-formance of algorithm Naive is not efficient. Furthermore, in order to utilize traditional sequential pattern mining algo-rithms, one sequence database is derived by transforming from one multi-domain sequence database joined from sequence databases. Clearly, with events from all domains, the sequence database contains long sequences, which is not effi-cient in mining sequential patterns. With the above two drawbacks of algorithm Naive, we develop two effieffi-cient algorithms for mining multi-domain sequential patterns without joining sequence databases.

4.2. Algorithm IndividualMine: mining patterns in each domain

In this section, we develop algorithm IndividualMine.Fig. 2shows the overview of algorithm IndividualMine, where algo-rithm IndividualMine consists of two phases: the mining phase and the checking phase. In the mining phase, sequential pat-terns in each sequence database are first mined by utilizing sequential pattern mining algorithms (e.g., PrefixSpan[23,24]).

D D1 D2 Dn S ti l P tt S ti l P tt S ti l P tt Sequential Pattern Mining Sequential Pattern Mining Sequential Pattern Mining Mining Phase Sequential Patterns Sequential Patterns Sequential Patterns Mining Phase

Compare time-instance sets

to check support valuespp Checking Phase

to check support values

Results

g

Fig. 2. An overview of algorithm IndividualMine. Table 5

An example of a transformed sequence database.

Pattern ID Sequential patterns multi-domain sequential patterns

P1 hð1Þðb; 2ÞðeÞi ðbÞ ðeÞ

ð1Þ ð2Þ

 

P2 hða; 1Þð5Þi ð1ÞðaÞ ð5Þ

 

P3 hða; 1Þðc; 2Þi ð1ÞðaÞ ð2ÞðcÞ

  P4 hðb; 3Þi ðbÞð3Þ   P5 hðb; c; 2Þi ðb; cÞð2Þ   P6 hðb; c; 2ÞðeÞi ðb; cÞð2Þ ðeÞ  

(7)

In the checking phase, sequential patterns from all domains are combined to generate candidate multi-domain sequential patterns. If a candidate multi-domain sequential pattern has its support value larger than the minimum support threshold, this candidate domain sequential pattern is a domain sequential pattern. The support counts of candidate multi-domain sequential patterns will be described later.

Without loss of generality, given k sequence databases, we intend to derive multi-domain sequential patterns across k domains. Furthermore, we denote the set of k sequence databases as fD1;D2; . . . ;Dkg, and SPias the set of i-domain

sequen-tial patterns across a set of i sequence databases (i.e., fD1;D2; . . . ;Dig). To derive k-domain sequential patterns, we should

start with one sequential patterns from one domain and progressively composite sequential patterns from other domains until the number of domains is k. Hence, sequential patterns mined in D1is first in the set of SP1. Then, for each pattern

in SP1, candidate 2-domain sequential patterns (across two domains fD1and D2g) are generated by combining sequential

patterns in domain D2. For example, given a minimum support as 3, in our above example inTable 2, hðaÞðbÞi is a sequential

pattern and is put in the set of SP1. Also, hð1Þ; ð2Þi is one sequential pattern in D2. Consequently, we could have a candidate

2-domain sequential pattern ðaÞ ðbÞ

ð1Þ ð2Þ

 

.

After generating candidate multi-domain sequential patterns, their support values should be determined. As can be seen inTable 2, each sequence is associated with its own time sequence. Thus, one could use time sequences to derive support values. Explicitly, the time-instance set of sequence M is defined as follows:

Definition 3 (Time-instance set). Let MDB be a k-domain sequence database1 and M be a k-domain sequence. The time-instance set of M is defined as TISðMÞ ¼ fhTSðNÞ : LðM; NÞijN 2 MDB and M v Ng.

Based on the above definition, for a candidate multi-domain sequential pattern, we could determine its support value by evaluating the intersections in time-instance sets of each sequential pattern. For example, to determine the support of

ðaÞ ðbÞ

ð1Þ ð2Þ

 

, we should check both time-instance set of hðaÞðbÞi and hð1Þð2Þi inTable 2. It can be seen that inTable 2, the time-instance set of hðaÞðbÞi is fhðT1ÞðT2ÞðT3ÞðT4Þ : 1; 2i; hðT1ÞðT2ÞðT3ÞðT4Þ : 1; 3i; hðT5ÞðT6ÞðT7Þ : 1; 2i; hðT21ÞðT22ÞðT23Þ

ðT24Þ : 1; 3ig. Moreover, we could have TISðhð1Þð2Þi) as fhðT1ÞðT2ÞðT3ÞðT4Þ : 1; 2i; hðT5ÞðT6ÞðT7Þ : 1; 2i; hðT21ÞðT22ÞðT23Þ

ðT24Þ : 1; 3ig. Thus, the support of a candidate 2-domain sequential pattern ðaÞ ðbÞ

ð1Þ ð2Þ   is represented as TIS ðaÞ ðbÞ ð1Þ ð2Þ    

¼ fhðT1ÞðT2ÞðT3ÞðT4Þ : 1; 2h; hðT5ÞðT6ÞðT7Þ : 1; 2i; hðT21ÞðT22ÞðT23ÞðT24Þ : 1; 3ig. Therefore, Support

ðaÞ ðbÞ ð1Þ ð2Þ     ¼ TIS ðaÞ ðbÞ ð1Þ ð2Þ           

 ¼ 3. Given a minimum support threshold of 3, ðaÞð1Þ ðbÞð2Þ

 

is a 2-domain sequential pat-tern, since its support value is not less than the minimum support. Consequently, through the time-instance sets, support values for candidate multi-domain sequence patterns are derived.

Once we have 2-domain sequential patterns, these 2-domain sequential patterns are in the set of SP2. Then, for each

pat-tern in SP2, candidate 3-domain sequential patterns and their corresponding supports will be generated by the above same

procedure. Given sequential patterns in k domains, k-domain sequential patterns are derived by iteratively expanding one domain in each round until the number of rounds is k.

Algorithm: IndividualMine

Input: Sequence databases across n domains D1;D2; . . . ;Dn, and minimum support d.

Output: Multi-domain sequential patterns across n domains. Begin

Let Ckbe the set of candidate patterns across k domains, where k ¼ 1; 2; . . . ; n.

Apply sequential pattern mining on each domain Di;i ¼ 1; 2; . . . ; n.

Let SP1be the set of sequential patterns mined in D1.

For each domain Diþ1, i ¼ 1; 2; . . . ; n  1

For each P 2 SPi

For each sequential pattern Q of Diþ1

If eðQ Þ ¼ eðPÞ Then append P Q  

to Ciþ1.

For each candidate c 2 Ciþ1

If SupportðcÞ P d Then append c to SPiþ1.

Output ¼ SPn.

End

1

(8)

Without joining, algorithm IndividualMine could still discover multi-domain sequential patterns. It can be seen that in algorithm IndividualMine, each domain should individually perform sequential pattern mining algorithms, which incurs a considerable amount of mining cost. Furthermore, those sequential patterns mined from each domain are not necessarily able to become multi-domain sequential patterns. Thus, to further reduce the cost of mining sequential patterns in each do-main and the number of candidate multi-dodo-main sequential patterns, we develop algorithm PropagatedMine in which those sequences that are likely to form multi-domain sequential patterns are extracted from their sequence databases.

4.3. Algorithm PropagatedMine: propagating sequential patterns among domains

Algorithm PropagatedMine is designed to reduce the mining cost in each sequence database. Explicitly, algorithm Prop-agatedMine first performs sequential pattern mining in one domain (referred to as the starting domain) and then propagates time-instance sets of the mined sequential patterns to other domains. By propagating time-instance sets, only those se-quences that have the same time sese-quences with the time-instance sets are extracted, thereby reducing the mining space in each sequence database. Algorithm PropagatedMine iteratively propagates time-instance sets of multi-domain sequential patterns to the next domain until all domains have been mined.Fig. 3shows an overview of algorithm PropagatedMine, where there are two phases in algorithm PropagatedMine: the mining phase and the deriving phase.

In the mining phase, PropagatedMine utilizes existing sequential pattern mining algorithms to discover sequential pat-terns in a starting domain (i.e., D1) and then propagates these patterns to other domains. Note that the mined sequential

patterns in the starting domain provide a guideline to extract multi-domain sequential patterns from other domains, and hence for mining multi-domain sequential patterns in sequence databases across multiple domains, the length and the num-ber of elements of multi-domain sequences are constrained by sequential patterns mined in the starting domain. Conse-quently, sequential patterns mined in the starting domain could be represented as a lattice structure to facilitate the generation of candidate multi-domain sequential patterns across other domains.

For example, assume that the starting domain is set to D1inTable 2and that sequential patterns are then found using

existing sequential pattern mining algorithms with the same minimum support 3. The mined sequential patterns are repre-sented as a lattice structure inFig. 4, where each node represents a sequential pattern, the linkages of nodes (or intradomain links) represent containing relation, and nodes are ordered by the number of elements. InFig. 4, those nodes having the same number of elements are further arranged level by level according to their sequence lengths and nodes with one element are placed level by level in increasing order of sequence length. For example, hðb; cÞi inFig. 4is below the nodes whose sequence

<(a)> <(b)> <(c)> <(b,c)> number of elements=1 <(a)(b)> <(a)(c)> <(b)(b)> <(b)(c)> <(a)(b,c)> <(b)(b,c)> number of elements=2

Fig. 4. An example of lattice structures for sequential patterns in a starting domain (i.e., D1inTable 2).

propagate propagate propagate D1 p p g p p g p p g D2 Dn Sequential Pattern Mining Propagated Table Propagated Table Multi-domain S ti l g Results Sequential Sequential Patterns Results Sequential Patterns

Mining Phase Deriving Phase

(9)

length is 1 (e.g., hðbÞi). As mentioned above, the lattice structure is used as a guideline for propagating time-instance sets of sequential patterns to other domains. In the deriving phase, algorithm PropagatedMine extracts those sequences with occur-rence times equal to those of the time-instance sets propagated. Thus, for each propagated time-instance set, we can build the corresponding propagated table as defined inDefinition 4.

Definition 4 (Propagated table). Let M be a k-domain sequential pattern. The propagated table of M in sequence database Dkþ1 is denoted as Dkþ1jjM¼ fhSi½l1; Si½l2; . . . ; Si½lbijhTSðSiÞ : l1;l2; . . . ;lbi 2 TISðMÞ, where Si2 Dkþ1g which is consisted of

sequences that co-occurred with M. Furthermore, Dkþ1jjM is also a sequence database, and

M S

 

is a ðk þ 1Þ-domain sequential pattern if and only if S is a sequential pattern of Dkþ1jjM and eðSÞ ¼ eðMÞ with the same minimum support

threshold.

For example, in domain D1 of Table 2, we have TISðhðaÞðcÞiÞ ¼ fhðT1ÞðT2ÞðT3ÞðT4Þ : 1; 2i; hðT1ÞðT2ÞðT3ÞðT4Þ : 1; 3i;

hðT5ÞðT6ÞðT7Þ : 1; 2i; hðT5ÞðT6ÞðT7Þ : 1; 3i; hðT21ÞðT22ÞðT23ÞðT24Þ : 1; 3ig, and propagating TISðhðaÞðcÞiÞ to domain D2yields

propa-gated table D2jjhðaÞðcÞi.Table 6is the propagated table D2jjhðaÞðcÞi, where each sequence is very likely to form multi-domain

sequential patterns with hðaÞðcÞi mined from domain D1. From propagated tables, one could mine sequential patterns having

the same number of elements as the propagated sequential pattern and these sequential patterns could be formed as multi-domain sequential patterns. Consider the above example, where the minimum support is set to 3. We can easily find that hð1Þð2Þi is the sequential pattern of D2jjhðaÞðcÞi and thus

ðaÞ ðcÞ ð1Þ ð2Þ

 

is a 2-domain sequential pattern by compositing hðaÞðcÞi and hð1Þð2Þi.

Note that even though PropagatedMine successfully prevents mining sequential patterns in each domain, however, the cost of some redundant mining of propagated tables can be further reduced. For example, some patterns mined in propa-gated tables D2jjhðaÞiand D2jjhðcÞiare the same as patterns mined in propagated table D2jjhðaÞðcÞi. This is due to that the

time-instance set of hðaÞðcÞi is contained in both time-time-instance sets of hðaÞi and hðcÞi. Consequently, sequences in propagated table D2jjhðaÞðcÞialso include some sequences in propagated table D2jjhðaÞiand D2jjhðcÞi. Therefore, only sequential patterns with their

length being one should be propagated to other domains. In other words, only time-instance sets of the top-level nodes (re-ferred to as atomic patterns) in lattice structures are propagated. After obtained, propagated tables are viewed as transaction databases. Consequently, given a propagated table, by utilizing frequent itemset algorithms in[1,2,34,14], we could generate the corresponding multi-domain sequential patterns. We now analyze some important properties of the propagated table. With these properties of propagated tables, the lattice structure in the starting domain is used to determine multi-domain sequential patterns whose length is larger than one. The details of generating multi-domain sequential patterns are de-scribed later.

Property of the propagated table of atomic patterns: Suppose that P is a k-domain sequential pattern (i.e., P 2 SPk) with

jPj ¼ 1. P b  

is a multi-domain sequential pattern across ðk þ 1Þ-domain sequence databases (i.e., D1;D2; . . ., and Dkþ1) with

a minimum support of d if and only if b is a frequent itemset in propagated table Dkþ1jjPwith the same minimum support d.

Property of antimonotone with multiple domains: If M is a k-domain sequential pattern (i.e., across D1;D2; . . ., and Dk),

k-domain sequences contained by M are also k-k-domain sequential patterns.

Based on the antimonotone property, algorithm PropagatedMine generates candidate multi-domain sequential patterns in a level-by-level manner. However, in the propagated domain, sequential patterns are also generated level by level accord-ing to the number of sequence elements. The detailed steps for derivaccord-ing multi-domain sequential patterns are described below.

Step 1: Derive atomic patterns across ðk þ 1Þ domains

Let SPkbe the set of multi-domain sequential patterns across k domains. When deriving atomic patterns across ðk þ 1Þ

domains, the corresponding frequent itemsets can be derived from the propagated tables of each atomic pattern in SPk.

Through the property of propagated table of atomic patterns, those frequent items mined from propagated tables are merged with atomic patterns in SPkto derive atomic patterns across ðk þ 1Þ domains. Consider the sequence databases across two

domains inTable 2as an example, where sequential patterns of domain D1are represented as a lattice structure. We could

derive atomic patterns in domain D2and thus generate their corresponding multi-domain sequential patterns by

propagat-ing the time-instance sets of atomic patterns in domain D1(i.e., the top-level nodes) to domain D2. Specifically, inFig. 5, for Table 6

An example of propagated table D2jjhðaÞðcÞi.

Time sequences Sequences

hðT1ÞðT2ÞðT3ÞðT4Þi (1,2)(2,3)

hðT1ÞðT2ÞðT3ÞðT4Þi (1,2)(6)

hðT5ÞðT6ÞðT7Þi (1,3)(2,4)

hðT5ÞðT6ÞðT7Þi (1,3)(8)

(10)

each atomic pattern in D1, there are interdomain links representing that these two patterns are able to form multi-domain

sequential patterns. Consequently, we have ðaÞ ð1Þ   ; ðbÞ ð2Þ   ; ðbÞ ð3Þ   , and ðcÞ ð2Þ  

in the above example, and they are obviously also atomic patterns.

Step 2: Derive ðk þ 1Þ-domain sequential patterns with one element

This step involves deriving ðk þ 1Þ-domain sequential patterns with one element. Assume that k-domain sequential pat-tern P across k-domain sequence databases (i.e., D1;D2; . . ., and Dk) and that there is only one element in P (i.e., eðPÞ ¼ 1). The

intradomain links in the lattice structure for domain k can be followed to find two multi-domain sequential patterns (e.g., X and Y, which are the components of P). The corresponding multi-domain sequential patterns in domain k þ 1 are found by traversing interdomain links of X and Y. According to the antimonotone property, if there exists any corresponding sequen-tial patterns of X or Y in domain k þ 1, they must have been discovered due to X v P and Y v P. Hence, the corresponding sequential patterns of P in domain k þ 1 are generated from the union of all the multi-domain sequential patterns found in domain k þ 1. For example, let P ¼ hðb; cÞi be a sequential pattern with eðPÞ ¼ 1 in D1ofTable 2. The components of P

(i.e., hðbÞi and hðcÞiÞ can be found from the intradomain links. Following interdomain links of hðbÞi and hðcÞi inFig. 6, yields the multi-domain sequential patterns in domain D2(i.e., ðbÞð2Þ

 

and ðbÞ

ð3Þ

 

for hðbÞi, and ðcÞ ð2Þ

 

for hðcÞi). Consequently, two candidates are generated by union operation: ðbÞ

ð2Þ   [ ðcÞ ð2Þ   ¼ ðb; cÞ ð2Þ   and ðbÞ ð3Þ   [ ðcÞ ð2Þ   ¼ ðb; cÞ ð2; 3Þ   .

Once the candidate multi-domain sequential patterns are obtained, support values of these patterns are examined by checking their time-instance sets (i.e., Support ð

a

Þ

ðbÞ     ¼ TIS ð

a

Þ ðbÞ           

 ¼ jTISðhð

a

ÞiÞ \ TISðhðbÞiÞjÞ. Given a minimum support of 3, since the support values of ðb; cÞ

ð2Þ

 

and ðb; cÞ ð2; 3Þ

 

are 3 and 2, respectively, only ðb; cÞ ð2Þ

 

is frequent. Thus, the lattice structure in domain D2contains node hð2Þi, and interdomain links are built between lattice structures in domains D1and D2.

Step 3: Derive ðk þ 1Þ-domain sequential patterns with more than one element

After generating atomic patterns and the ðk þ 1Þ-domain sequential patterns with one element in step1 and step 2 respec-tively, algorithm PropagatedMine can further generate remaining ðk þ 1Þ-domain sequential patterns in a level-by-level manner by referring to the lattice structure in the last domain propagated (i.e., domain Dk). In this step, PropagatedMine

starts deriving from those patterns with two elements due to the antimonotone property. The frequent patterns in the upper levels are found from the intradomain links in the lattice structure of Dk, and the corresponding upper level patterns in the

<(2)> <(3)> <(2)> <(a)> <(b)> <(c)> <(e)> number of l t 1 <(1)> <(b,c)> elements=1 <(2)> Domain D1 Domain D2

Fig. 6. An example of generating sequential patterns with one element in domain D2. <(2)> <(3)> <(2)> <(a)> <(b)> <(c)> <(e)> number of l t 1 <(1)> <(b,c)> elements=1 Domain D1 Domain D2

(11)

lattice structure of domain Dkþ1are identified from their interdomain links. Now, the interdomain links of upper level

pat-terns must been established due to the antimonotone property. Before deriving ðk þ 1Þ-domain sequential patpat-terns, it should be determined whether or not to merge the sequential patterns identified in the lattice structure based on their time order. This leads toDefinition 5.

Algorithm: PropagatedMine

Input: Sequence databases across n domains D1;D2; . . . ;Dn, and minimum support d.

Output: Multi-domain sequential patterns across n domains. Begin

Apply sequential pattern mining on D1.

Let SP1be the set of sequential patterns mined in D1.

For each domain Di, i ¼ 2; 3; . . . ; n

For each P 2 SPi1

//Step 1

If jPj ¼ 1 Then Begin

Construct propagation table DijjP.

Find frequent items in DijjP with minimum support d.

Let FI be the set of frequent items in DijjP.

For each Q 2 FI Append P Q   to SPi. Let TIS P Q     ¼ TISðPÞ \ TISðQ Þ. End //Step 2

If eðPÞ ¼ 1 Then Begin

Let X and Y be two patterns pointed to by intradomain links of P. For each pattern

a

pointed to by interdomain links of X

For each pattern b pointed to by interdomain links of Y

If Support

a

b  

 

P dThen Begin

Construct interdomain links from P to

a

b  

. Construct intradomain links from

a

b   to

a

and b. Append

a

b   to SPi. End //Step 3

If eðPÞ > 1 Then Begin

Let X and Y be two patterns pointed to by intradomain links of P. For each pattern

a

pointed to by interdomain links of X

For each pattern b pointed to by interdomain links of Y If Supportð½ð

a

ÞðbÞÞ P d Then Begin

Construct interdomain links from P to ½ð

a

ÞðbÞ. Construct intradomain links from ½ð

a

ÞðbÞ to

a

and b. Append ½ð

a

ÞðbÞ to SPi.

End Output ¼ SPn.

End

Definition 5 (Concatenate operation of TIS). Let M and N be two multi-domain sequences, where TISðMÞ ¼ fhTS1:l11;

l12; . . . ;l1eðMÞi; hTS2:l21;l22; . . . ;l2eðMÞi; . . . ; hTSm:lm1;lm2; . . . ;lmeðMÞig; TISðNÞ ¼ fhTT1:k11;k12; . . . ;k1eðNÞi; hTT2:k21; k22; . . . ;

k2eðNÞi; . . . ; hTTn:kn1;kn2; . . . ;kneðNÞig, and TSi is the time sequence for i ¼ 1; 2; . . . ; m while TTj is also time sequence for

j ¼ 1; 2; . . . ; n. The concatenation of TISðMÞ and TISðNÞ is denoted as TISðMÞ\<TISðNÞ ¼ fhTSi:li1;li2; . . . ;lieðMÞ;kj1;kj2; . . . ;

kjeðNÞig, such that TSi¼ TTjand lieðMÞ<kj1. In other words, TISðMÞ\<TISðNÞ is the time-instance set of the multi-domain

(12)

For example, given M ¼ ðaÞ ð1Þ   ;N ¼ ðb; cÞ ð2Þ  

, and the sequence database across two domains in Table 2, where

TISðMÞ ¼ fhðT1ÞðT2ÞðT3ÞðT4Þ : 1i; hðT5ÞðT6ÞðT7Þ : 1i; hðT10ÞðT12ÞðT13Þ : 1i; hðT21ÞðT22ÞðT23ÞðT24Þ : 1ig, and TISðNÞ ¼ fhðT1ÞðT2ÞðT3Þ

ðT4Þ : 2i; hðT5ÞðT6ÞðT7Þ : 2i; hðT21ÞðT22ÞðT23ÞðT24Þ : 3ig. It can be verified that TISð ðaÞ ðb; cÞ

ð1Þ ð2Þ

 

Þ ¼ TISðMÞ\<TISðNÞ ¼ fhðT1Þ

ðT2ÞðT3ÞðT4Þ : 1; 2i; hðT5ÞðT6ÞðT7Þ : 1; 2i; hðT21ÞðT22ÞðT23ÞðT24Þ : 1; 3ig.

Assume that pattern P 2 SPkand eðPÞ > 1. Similar to Step 2, we can obtain the components of P; X and Y, by traversing

intradomain links among lattice structures across k domains, and the multi-domain sequential patterns pointed to by their interdomain links can be determined. In light ofDefinition 5, a concatenate operation is considered rather than generating their union as in Step 2. For example, assume pattern P ¼ hðaÞðb; cÞi inFig. 7. The intradomain and interdomain links yield

ðaÞ ð1Þ   and ðb; cÞ ð2Þ  

. Therefore, candidate multi-domain sequential pattern ðaÞ ðb; cÞ

ð1Þ ð2Þ

 

is generated, as its support value,

Support ðaÞ ðb; cÞ ð1Þ ð2Þ     ¼ TIS ðaÞ ð1Þ     \<TIS ðb; cÞ ð2Þ             ¼ 3.

The above steps allow multi-domain sequential patterns across ðk þ 1Þ-domain sequence databases to be derived from k-domain sequential patterns. Algorithm PropagatedMine iteratively repeats the above three steps until all sequence databases are propagated.

Theorem 1. Algorithm PropagatedMine is able to mine all multi-domain sequential patterns via lattice structures.

Proof. Mining frequent itemsets in propagated tables reveals multi-domain atomic patterns across other sequence dat-abases. To prove the correctness of Steps 2 and 3, first let P be a k-domain sequential pattern and P0be a ðk þ 1Þ-domain

sequential pattern derived from P, where eðP0

Þ ¼ eðPÞ ¼ 1 and jP0j P jPj > 1. In other words, P0¼ P Z  

, where Z is a frequent itemset in the propagated table Dkþ1jjhðPÞi. Assume that X and Y are parts of P, and X [ Y ¼ P. Hence, in the lattice structure, we

have intradomain links from P to X and Y. In addition, there are interdomain links from X and Y to Z0, where Z0is the power set

of Z and Z0

–;. Due to the antimonotone property, all multi-domain sequences contained by P0 must also be frequent. In

other words, both X Z0

 

and Y

Z0

 

are frequent. Therefore, the lattice structures can be used to derive all pairs of P and P0while

eðP0Þ ¼ eðPÞ ¼ 1. Similarly, when eðP0Þ ¼ eðPÞ > 1, X and Y are parts of P and TISðXÞ\

<TISðYÞ ¼ TISðPÞ. Moreover, assume that Z

is a frequent itemset in propagated table Dkþ1jjhðPÞi. Clearly, interdomain links exist from X and Y to Z

0in domain D

kþ1, where

Z0is the power set of Z and Z0

–;. The antimonotone property means that all multi-domain sequences contained by P0must also be frequent. This results in both ½ðX; Z0Þ and ½ðY; Z0Þ being frequent. This proof indicates that algorithm PropagatedMine

is able to mine all multi-domain sequential patterns. h 4.4. Mining relaxed multi-domain sequential patterns

The above three algorithms are utilized in mining strong multi-domain sequential patterns, where all co-occurred events are from all domains required. Strong multi-domain sequential patterns are very restricted since users may have their minds on analyzing the behavior across domains interested by users. In this paper, we further develop some mechanisms for

min-<(2)> <(3)> <(2)> <(a)> <(b)> <(c)> <(e)> number of l t 1 <(1)> <(b,c)> elements=1 <(2)> <(a)(b,c)> <(1)(2)> number of elements=2 Domain D1 Domain D2 number of elements=3

(13)

ing relaxed multi-domain sequential patterns in which in some time slots, some empty sets are allowed. Note that both the Naive algorithm and algorithm IndividualMine could be extended for mining relaxed multi-domain sequential patterns. However, due to the feature of propagation, algorithm PropagatedMine is not able to discover relaxed patterns. In the fol-lowing, we will discuss how to mine relaxed multi-domain sequential patterns.

Naive algorithm

As pointed out early, given a set of sequence databases, algorithm Naive will join these sequence databases into one mul-ti-domain sequence database. With the proper transformed of mulmul-ti-domain sequence databases, one could generate a se-quence database whose events are from all domains. Thus, existing sequential pattern mining algorithms could be utilized to discover sequential patterns. Note that sequential patterns mined are then represented as the form of multi-domain sequen-tial patterns. Hence, those multi-domain sequensequen-tial patterns that have some empty sets are directly viewed as relaxed patterns.

Algorithm IndividualMine

Algorithm IndividualMine performs sequential pattern mining algorithms in each sequence database. After generating all sequential patterns in all domains, in the checking phase, algorithm IndividualMine will check and composite candidate multi-domain sequential patterns with the same number of elements. In order to mine relaxed patterns, all possible com-positions of multi-domain sequential patterns from sequential patterns of each domain should be enumerated. For example, assume that one i-domain sequential pattern P ¼ hP1;P2; . . . ;Pli, is selected SPiand Q ¼ hq1;q2; . . . ;qri is a sequential pattern

of domain Diþ1. Candidate ði þ 1Þ-domain sequential patterns generated from P and Q are

P1 P2 . . . Pl q1 q2 . . . qr   ; . . . ; P1 P2 . . . Pl q1 q2 . . . qr  

and so on. Note that the number of candidate patterns is denoted as f ðr; lÞ which is formulated as follows:

f ðr; lÞ ¼ 1; if r ¼ 0 f ðl; r  1Þ þ 2r1P i¼0 f ði; l  1Þ; otherwise: 8 < : ð1Þ

Obviously, it could be very large when r and l increase. As expected, we could have a large number of candidate multi-do-main sequential patterns, degrading the performance of algorithm IndividualMine.

Algorithm PropagatedMine

By exploring propagation and lattice structures, algorithm PropagatedMine is able to reduce the mining cost. However, algorithm PropagatedMine cannot mine relaxed patterns since propagation needs to obtain time-instance sets of sequential patterns. Empty sets mean that events don’t occur and thus there are no any available time information for the empty sets. Thus, it is impossible to derive time-instance sets of empty sets. Consequently, for mining relaxed patterns, algorithm Naive and algorithm IndividualMine should be used.

5. Performance evaluation

To evaluate the performance of our proposed algorithms, we implement a simulation model and conduct extensive exper-iments. In Section5.1, the simulation model and synthetic datasets are described. Section5.2is devoted to experimental results.

5.1. Simulation model

We modify a well-known data generator in[3]to generate datasets that include multiple domains and the data generator is broadly used in many studies to evaluate mining algorithms proposed[18]. The detailed generation process could be re-ferred to[18]. Some parameters are summarized inTable 7. Explicitly, M denotes the number of domains, D is the number of sequences, C is the average number of elements in a sequence, T is the average number of events in an element and I is the total number of distinct events. The modeling of these parameters are almost the same in [3]. For example, dataset M5D10kC10T5I100 represents that there are 5 domains, each of which contains 10k of sequences, where the average number of elements in a sequence is 10, the average number of items in an element is 5, and the total number of distinct items is 100.

Table 7

Parameters used for the data generator.

Parameter Description

M Number of domains

D Number of sequences

C Average number of elements within a sequence

T Average number of items within an element

(14)

For the traditional sequential pattern mining, we use algorithm PrefixSpan which is obtained from the IlliMine project (http://illimine.cs.uiuc.edu/). Algorithm PrefixSpan is used in algorithm Naive and the mining phases of both algorithms IndividualMine and PropagatedMine. Our programs are executed in the platform with the hardware as an Intel 2.4-GHz XEON CPU and 3.5 GB of RAM, and the software as FreeBSD 5.0 and GCC 3.2. We use three performance metrics: the execu-tion time, memory consumpexecu-tion and the number of mined patterns to compare the proposed algorithms.

5.2. Experimental results

Several experiments were conducted to evaluate the performance and memory consumption of the three algorithms. Sen-sitivity analysis on some important parameters, such as the minimum support, the number of sequences, and the number of domains, is conducted.

5.2.1. Impact of the minimum support threshold

We first investigated the performances of three algorithms with the minimum support varied. For the dataset M2D2kC3T4I200,Fig. 8 shows the execution time and the memory consumption of three algorithms. It can be seen in

Fig. 8 that the execution time of algorithm IndividualMine and PropagatedMine is reduced as the minimum support in-creases. This is due to that with a larger minimum support, the number of sequential patterns in sequence databases is smal-ler. Furthermore, algorithm PropagatedMine significantly outperforms the other two algorithms in terms of execution time, which demonstrates the advantage of exploring propagation and lattice structures in mining multi-domain sequential pat-terns. On the other hand, when the minimum support was smaller than 1.5%, algorithm IndividualMine was worse than algo-rithm Naive. The reason is that with a smaller minimum support, a larger number of sequential patterns are mined in each domain. Thus, algorithm IndividualMine needs more time to composite candidate multi-domain sequential patterns and determine their supports. In Naive algorithm, joining operations among sequence databases are costly, which dominates the execution time. As for the memory consumption, algorithm Naive use less memory than algorithms IndividualMine and PropagatedMine. This is due to that both algorithms IndividualMine and PropagatedMine use more memory spaces for storing sequential patterns mined. Algorithm PropagatedMine also needs to store lattice structures, which incurs more memory space than algorithm IndividualMine. On the other hand, algorithm IndividualMine does not need any more mem-ory space for storing sequential patterns. Though algorithm PropagatedMine needs more memmem-ory spaces, algorithm Propa-gatedMine is able to quickly derive multi-domain sequential patterns, which strikes a compromise between memory space and the execution time.

5.2.2. Impact of the number of domains

We next examine the impact of domains on the performance of three proposed algorithms. The experiments were con-ducted on D1kC2T3I100 (referred to as a smaller dataset) and D1kC3T4I200 (referred to as a larger dataset). With the min-imum support as 0.3%, the execution time with its unit as seconds for these proposed algorithms is shown inTable 8and

Table 9. From both tables, it can be seen that all three algorithms have a larger execution time when the number of domains increases. In particular, the execution of algorithm Naive drastically increases the execution time. Both algorithms Individ-ualMine and PropagatedMine have smaller execution time than algorithm Naive. Furthermore, algorithm PropagatedMine outperforms other algorithms in terms of the execution time, showing the advantage of utilizing propagation to reduce the mining cost. In addition, given a larger dataset with more number of events and larger sequence lengths, the execution

0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 0 20 120 140 160

Execution Time (sec)

Minimum Support (%) Naive IndividualMine PropagatedMine 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 0 1 2 3 9 10 11 Minimum Support (%) Memory Consumption (MB) Naive IndividualMine PropagatedMine

(a)

(b)

(15)

time of algorithm Naive is worse. On the other hands, algorithm PropagatedMine incurs a smaller execution time than algo-rithms Naive and IndividualMine, showing the good scalability of algorithm PropagatedMine.

5.2.3. Impact of the number of sequences

Experiments with the number of sequences varied are examined, where the number of sequences is from 1000 to 6000 and other parameters are M2C3T3I200. With a given minimum support was 1%,Fig. 9shows the execution time of all algo-rithms. As can be seen inFig. 9, the execution of all three algorithms increases as the number of sequences increases. Notice that the execution time of algorithm Naive is significantly increasing when the number of sequences is lager than 2000. Thus, to compare algorithms IndividualMine and PropagatedMine, we only put the execution time of algorithms IndividualMine and PropagatedMine. By exploring lattice structures, PropagatedMine should mine only atomic patterns, from which other patterns are derived accordingly. As a result, the execution time of PropagatedMine slightly increases with the number of sequences. Note that the execution time of algorithm PropagatedMine is very smaller compared with algorithms Individu-alMine and Naive. However, both algorithms IndividuIndividu-alMine and PropagatedMine need more memory space for storing sequential patterns mined. Thus, it can be seen inFig. 9that both algorithms IndividualMine and PropagatedMine have a larger memory consumption than algorithm Naive. This also agrees that algorithm Naive is bounded by execution time, and algorithms IndividualMine and PropagatedMine are bounded by memory spaces.

5.2.4. Impact of the average number of elements within a sequence

In this section, we investigate the performance of Naive, IndividualMine, and PropagatedMine with the average number of elements within a sequence varied. Without loss of generality, the minimum support threshold is set to 1% and the other parameters in the dataset are M2D1kT3I200.Fig. 10shows experimental results of Naive, IndividualMine, and Propagated-Mine. Clearly, the execution time of mining multi-domain sequential patterns increases with the average number of ele-ments within a sequence. Note that algorithm IndividualMine even performs worse than algorithm Naive when the

Table 8

Execution time of algorithms Naive, IndividualMine, and PropagatedMine with the number of domains varied on D1kC2T3I100.

Number of domains 2 3 4 5

Naive 5.3 206.7 2513.9 21769.7

IndividualMine 126.3 163.9 180.2 181.1

PropagatedMine 0.4 0.6 0.7 0.7

Table 9

Execution time of algorithms Naive, IndividualMine, and PropagatedMine with the number of domains varied on D1kC2T4I200.

Number of domains 2 3 4 5 Naive 57.1 3065.3 53164.9 379118.5 IndividualMine 1052.1 1192.9 1213.9 1214.4 PropagatedMine 2.1 2.4 2.5 2.5 1000 2000 3000 4000 5000 6000 1000 2000 3000 4000 5000 6000 0 10 20 30 40 50 60 70 80 90 100

Execution Time (sec)

Number of Sequences Naive IndividualMine PropagatedMine 0 5 10 15 20 25 Number of Sequences Memory Consumption (MB) Naive IndividualMine PropagatedMine

(a)

(b)

(16)

average number of elements in a sequence is larger than 4.7. The reason is that IndividualMine mines a large number of sequential patterns in each domain and spends more costs to composite candidate multi-domain sequential patterns. The above observation is also proved inFig. 11, where algorithm IndividualMine generates a larger number of sequential patterns propagated than algorithm PropagatedMine. Note that, the number of patterns propagated in algorithm IndividualMine is the number of patterns discovered in the starting domain.Fig. 10b also indicates that though algorithm PropagatedMine has a smaller execution time, algorithm PropagatedMine needs more memory spaces to store lattice structure.

5.2.5. Impact of the average number of items within an itemset

The average number of items within an itemset generally impacts on the performance of sequential pattern mining. Thus, we investigate the effect of varying the average number of items within an itemset. The minimum support was set to 1% and we used the dataset M2D1kC3I200. The execution time and memory consumption with the average number of items in an itemset varied are shown inFig. 12. As can be seen that inFig. 12, PropagatedMine performs the best in terms of the exe-cution time. When the average number of items in an itemset is smaller, the exeexe-cution time of IndividualMine is smaller than that of Naive. However, if there is a large number of items within an itemset, IndividualMine performs worse than Native since algorithm IndividualMine has a larger number of patterns mined, which incurs a considerable cost in the checking phase.Fig. 13demonstrates that PropagatedMine is better than IndividualMine because sequential patterns mined in the starting domain are much smaller than that of algorithm IndividualMine. In algorithm PropagatedMine, only atomic patterns are mined and thus the number of patterns mined in the starting domain is equal to the number of atomic patterns. Con-sequently, by exploring lattice structures, algorithm PropagatedMine outperforms the other algorithms in terms of the exe-cution time. 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 0 10 20 30 40 50

Execution Time (sec)

Average Number of Elements in a Sequence

Naive IndividualMine PropagatedMine 0 5 10 15 20 25 30 35 40 45 Memory Consumption (MB)

Average Number of Elements in a Sequence Naive

IndividualMine PropagatedMine

(b)

(a)

Fig. 10. Performance of Naive, IndividualMine, and PropagatedMine with the average number of elements within a sequence varied.

2 3 4 5 6 0 1000 2000 3000 4000

Number of Patterns Propagated

Average Number of Elements in a Sequence IndividualMine

PropagatedMine

(17)

0 20 40 60 80 100

Execution Time (sec)

Average Number of Items in an Itemset

Naive IndividualMine PropagatedMine 0 10 20 30 40 50 60

Average Number of Items in an Itemset

Memory Consumption (MB) Naive IndividualMine PropagatedMine 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0

(b)

(a)

Fig. 12. Performance of Naive, IndividualMine, and PropagatedMine with the average number of items within an itemset varied.

0 500 1000 1500 2000 2500 3000 3500 4000

Number of Patterns Propagated

Average Number of Items in an Itemset Individual Mine

Propagated Mine

2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0

Fig. 13. Number of patterns propagated in IndividualMine and PropagatedMine with the average number of items within an itemset varied.

0 10 20 30 40 50 60 70

Execution Time (sec)

Number of Different Items

Naive IndividualMine PropagatedMine 0 5 10 15 20 75 80 85 90 95

Number of Different Items

Memory Consumption (MB) Naive IndividualMine PropagatedMine 100 200 300 400 500 600 700 800 100 200 300 400 500 600 700 800

(a)

(b)

(18)

5.2.6. Impact of the number of items

We next investigate the impact of the total number of items, where a minimum support is set to 1% and other parameters are set as M2D1kC3T4.Fig. 14shows the execution times and memory consumption of Naive, IndividualMine, and Propa-gatedMine. It can be seen inFig. 14that both IndividualMine and PropagatedMine have a smaller execution time than Naive as the number of items increases. When the number of items is larger, the probability of being frequent for each item is smal-ler with the same setting in D1kC3T4.Fig. 15depicts the number of patterns with the number of items varied. As can be seen inFig. 15, PropagatedMine has a smaller number of patterns derived, which demonstrates the advantage of using lattice structures for discovering multi-domain sequential patterns.

5.2.7. Impact of the propagation order for PropagatedMine

Since algorithm PropagatedMine explores propagation on mining multi-domain sequential patterns, we now get insight into the impact of propagation orders on performance of algorithm PropagatedMine. As pointed out early, algorithm Prop-agatedMine first selects a starting domain and then performs sequential pattern mining. Based on the mining results, a lat-tice structure is built. Clearly, one should judiciously determine the starting domain in algorithm PropagatedMine. Intuitively, selecting a domain with a smaller number of sequential patterns is good to reduce the size of lattice structures, thereby improving the performance of algorithm PropagatedMine. In this experiment, we conduct experiments on different propagation orders.Fig. 16shows the execution time of algorithm PropagatedMine with various propagation orders, where the value in the x-axle is the propagation order used. For example, 12435 indicates that the algorithm PropagatedMine starts with D1, and then propagates to D2;D4;D3and D5. As can be seen inFig. 16, selecting domain D1as a starting domain is better

0 500 1000 1500 2000 2500 3000

Number of Patterns Propagated

Number of Different Items IndividualMine PropagatedMine

100 200 300 400 500 600 700 800

Fig. 15. Number of patterns propagated in IndividualMine and PropagatedMine with the number of different items varied.

12345 51234 45123 34512 23451 12435 15342 13245 53421 0.0 0.5 1.0 1.5 2.0 2.5

Execution Time (Sec)

Propagation Order 12345 51234 45123 34512 23451 12435 15342 13245 53421 0 20 40 60 80 100 120 140 Memory Consumption (MB) Propagation Order

(a)

(b)

數據

Fig. 1. An example of multi-domain sequential pattern.
Fig. 2. An overview of algorithm IndividualMine.Table 5
Fig. 4. An example of lattice structures for sequential patterns in a starting domain (i.e., D 1 in Table 2).
Fig. 6. An example of generating sequential patterns with one element in domain D 2 . &lt;(2)&gt;&lt;(3)&gt;&lt;(2)&gt;&lt;(a)&gt;&lt;(b)&gt;&lt;(c)&gt;&lt;(e)&gt;number of lt1&lt;(1)&gt;&lt;(b,c)&gt;elements=1Domain D1Domain D2
+7

參考文獻

相關文件

Write three nuclear equations to represent the nuclear decay sequence that begins with the alpha decay of U-235 followed by a beta decay of the daughter nuclide and then another

• The order of nucleotides on a nucleic acid chain specifies the order of amino acids in the primary protein structure. • A sequence of three

The algorithms have potential applications in several ar- eas of biomolecular sequence analysis including locating GC-rich regions in a genomic DNA sequence, post-processing

n SCTP ensures that messages are delivered to the SCTP user in sequence within a given stream. n SCTP provides a mechanism for bypassing the sequenced

In the work of Qian and Sejnowski a window of 13 secondary structure predictions is used as input to a fully connected structure-structure network with 40 hidden units.. Thus,

 Sequence-to-sequence learning: both input and output are both sequences with different lengths..

• A sequence of numbers between 1 and d results in a walk on the graph if given the starting node.. – E.g., (1, 3, 2, 2, 1, 3) from

Input domain: word, word sequence, audio signal, click logs Output domain: single label, sequence tags, tree structure, probability