• 沒有找到結果。

Book Philosopher's Stone Chamber of Secrets

5.2 Related Works

A significant amount of research efforts has been devoted to sequential pattern mining [8, 21, 27, 68, 70, 74, 75]. The problem of sequential pattern mining is first formulated in [3] and the authors in [3] pro-posed mining algorithms based on the Apriori algorithm. Algorithm GSP [65] was developed for mining sequential patterns using a breadth first search and button-up method, whereas algorithm SPADE [79]

employed a depth first search and button-up method with an ID-list. The authors in [24][54][55] exploited the projection concept to reduce the amount of data for sequential pattern mining. To prevent candidate

cid cust-grp city age-grp sequence

10 business Boston middle < (bd)(c)(b)(a) >

20 professional Chicago young < (bf )(ce)(f g) >

30 business Chicago middle < (ah)(a)(b)(f ) >

40 education New York retired < (be)(ce) >

Table 5.1: Multi-dimensional sequence database [56].

generation, DISC-all [13] used a novel sequence comparison strategy. A progressive concept has been explored in mining sequential patterns to capture the dynamic nature of data addition and deletion [26].

The above research works are focused on improving the performance of traditional sequential pattern mining.

Some variations and applications on sequential patterns are proposed recently. We mention in passing that the authors in [56] developed to mine multi-dimensional sequential patterns, in which sequential patterns indicate not only frequent sequences but also some attributes in the category dimensions. In [56], the sequence database consists of category attributes and sequence attributes, and Table 5.1 shows an example of a multi-dimensional sequence database. Clearly, the problem of mining multi-domain sequential patterns is very different from the problem in [56] in terms of the input and output of problem definitions. In this chapter, the input is the set of sequence databases and the output is the set of multi-domain sequential patterns that consist sequences of co-occurred events across sequence databases.

However, [78] is another study that mentioned multidimensional sequence. In [78], sequence data are divided into different dimensions according to user’s specification. However, there is no time information amount different dimensions. In other words, each event in different dimensions is not co-occurred.

Therefore, it is quite different with our study. Furthermore, the problem in [6] is to discover events that are occurred together. In contrast, our problem is that given a set of sequence databases, we intend to discover sequences consisting of co-occurred events. Moreover, the authors in [22] proposed the problem of distributed sequential pattern mining, where each set of co-occurred events is complete and sequences are separated into different databases. Similarly, the problem in [35] is indeed a distributed sequential pattern mining problem and the authors in [35] exploited the concepts of approximate patterns and local clustering to avoid noise and a large number of local patterns. As pointed early, given a payment list of credit cards, we could divide payments into several domains according to payment services. Thus, our problem of mining multi-domain sequential patterns is not the same as distributed mining of sequential patterns.

To the best of our knowledge, previous studies have not adequately explored multi-domain sequential patterns, let alone proposing efficient algorithms for mining such sequential patterns. The contributions of this chapter are twofold: (1) exploiting novel and useful sequential patterns (i.e., multi-domain sequential patterns), and (2) devising algorithms IndividualMine and PropagatedMine to efficiently mine

multi-Domain Database D1

Table 5.2: Example of sequence databases in two domains.

domain sequential patterns. Our preliminary works were presented in [42] and [43]. In this chapter, more detailed complexity and theoretical analysis are conducted. Also, we develop some mechanisms in each proposed algorithm to allow multi-domain sequential patterns with some empty sets in some domains.

In particular, by exploring lattice structures, algorithm PropagatedMine is able to further reduce the number of candidate multi-domain sequential patterns. Furthermore, an extensive performance study is conducted and sensitivity analysis is investigated on several parameters for each algorithm.

5.3 Preliminaries

Assume that each domain has its own set of items and a sequence database. The problem of mining multi-domain sequential patterns is that given a set of sequence databases, we aim at discovering sequential patterns that consist of co-occurred events across multiple domains. Table 5.2 shows two domains with its own sequence database, where in each sequence of sequence databases, the corresponding time sequence indicates the occurrence time of events. For example, in sequence s1 in D1, it can be seen that event a occurs at T1 and both b and c occur at T2. By joining these two sequence databases via their time sequences, we could have one Multi-Domain sequence dataBase (referred to as MDB). As such, Table 5.3 is an example of multi-domain sequence database.

To facilitate the presentation of multi-domain sequences, one sequence siin domain Diis expressed by

< Xi1, Xi2, . . . , Xil>, where Xij is the jth element of sequence si, and l is the number of elements of si. Therefore, a multi-domain sequence across k domains (abbreviated as k-domain sequence) is represented

as M = [s1, s2, . . . , sk]T and is further denoted as M =

is a set of itemsets that occur within the same time window, denoted as Ti. A time sequence T S(M )

ID Time sequences Multi-domain sequences

Table 5.3: An example of a multi-domain sequence database.

is represented as T S(M ) =< T1, T2, . . . , Tl> to indicate the occurrence time of M . Actually, the time window, a time interval, is determined in accordance with application requirements.

With the above representation of multi-domain sequences, we further define the length and the number of elements for multi-domain sequential patterns. Since a multi-domain sequence consists of multiple sequences from various domains, the length of a multi-domain sequence across k domains can be defined as follows:

Definition 5. (Length and number of elements) Let M = [s1, s2, . . . , sk]T be a k-domain sequence.

The length of M , denoted as |M |, is the length of the longest sequence in multi-domain sequence M . Furthermore, the number of elements in a multi-domain sequence, expressed by e(M ), is the number of itemsets in the multi-domain sequence.

For example, given a 2-domain sequence M =

Once we have the definition of the length and the number of elements for a multi-domain sequences, the containing relation among multi-domain sequences is thus defined as follows:

Definition 6. (Containing relation) Suppose that we have two multi-domain sequences M =

For example, assume that M =

can be verified that M is contained by N since there exist integer list L(M, N ) =< 1, 3 > such that 1 6 1 < 3 6 4 ,and (a) ⊆ (a), (2) ⊆ (1, 2), (b, c) ⊆ (b, c, d) and (6) ⊆ (6).

Based on the above definitions, a multi-domain sequence database is a set of multi-domain sequences.

Consider an example of a multi-domain sequence database in Table 5.3, where the number of 2-domain sequences is 4. Given a multi-domain sequence database M DB, the support value of a multi-domain sequence M is the number of multi-domain sequences in M DB that contain the multi-domain sequence M .

Multi-domain Sequential Pattern Mining: Given a set of sequence databases across multiple domains, one could join these sequence databases as one multi-domain sequence database. Then, the task of mining multi-domain sequential patterns is to derive multi-domain sequences with their supports larger than a user-specified minimum support threshold δ in MDB. For example, for the multi-domain sequence database M DB in Table 5.3 and a minimum support δ = 3, multi-domain sequential patterns are Notice that joining these multiple sequence databases is costly due to the nature of join operations.

It can be verified that multi-domain sequential patterns contain sequential patterns in each domain. For example,

is a multi-domain sequential pattern, where (a)(b, c) (respectively, (1)(2)) is a sequential pattern in domain D1(respectively, D2) and events in (a)(b, c) and (1)(2) has the same time sequences. Thus, in this chapter, we propose algorithms to discover multi-domain sequential patterns without joining.

相關文件