Algorithm IndividualMine: Mining Patterns in Each Domain

Book Philosopher's Stone Chamber of Secrets

5.4 Algorithms of Mining Multi-domain Sequential Patterns

5.4.2 Algorithm IndividualMine: Mining Patterns in Each Domain

P2 < (a, 1)(5) >

(a) (1) (5)

P3 < (a, 1)(c, 2) >

(a) (c) (1) (2)

P4 < (b, 3) >

(b) (3)

P5 < (b, c, 2) >

(b, c) (2)

P6 < (b, c, 2)(e) >

(b, c) (e) (2)

Table 5.5: An example of a transformed sequence database.

domains, it is very straightforward to represent sequential patterns as multi-domain sequential patterns.

It can be seen in Table 5.5, P1, P2 and P6 have some empty sets and these patterns are referred to as multi-domain sequential patterns with empty sets (abbreviated as relaxed multi-domain sequential pat-terns). On the other hands, P3, P4 and P5 are called strong multi-domain sequential patterns since all co-occurred events are from all domains required.

Algorithm Naive needs to perform join operations among multiple sequence databases. Due to join operations, the performance of algorithm Naive is not efficient. Furthermore, in order to utilize traditional sequential pattern mining algorithms, one sequence database is derived by transforming from one multi-domain sequence database joined from sequence databases. Clearly, with events from all multi-domains, the sequence database contains long sequences, which is not efficient in mining sequential patterns. With the above two drawbacks of algorithm Naive, we develop two efficient algorithms for mining multi-domain sequential patterns without joining sequence databases.

5.4.2 Algorithm IndividualMine: Mining Patterns in Each Domain

In this section, we develop algorithm IndividualMine. Figure 5.2 shows the overview of algorithm Indi-vidualMine, where algorithm IndividualMine consists of two phases: the mining phase and the checking phase. In the mining phase, sequential patterns in each sequence database are first mined by utilizing se-quential pattern mining algorithms (e.g., PrefixSpan [54][55]). In the checking phase, sese-quential patterns from all domains are combined to generate candidate multi-domain sequential patterns. If a candidate multi-domain sequential pattern has its support value larger than the minimum support threshold, this candidate multi-domain sequential pattern is a multi-domain sequential pattern. The support counts of candidate multi-domain sequential patterns will be described later.

Without loss of generality, given k sequence databases, we intend to derive multi-domain sequen-tial patterns across k domains. Furthermore, we denote the set of k sequence databases as {D1, D2, . . . , Dk}, and SPi as the set of i-domain sequential patterns across a set of i sequence databases

D₁ D₂ DD_n

to check support valuespp Checking Phase to check support values

Results

Figure 5.2: Overview of algorithm IndividualMine.

(i.e., {D1, D2, . . . , Di}). To derive k-domain sequential patterns, we should start with one sequential patterns from one domain and progressively composite sequential patterns from other domains until the number of domains is k. Hence, sequential patterns mined in D1 is first in the set of SP1. Then, for each pattern in SP1, candidate 2-domain sequential patterns (across two domains {D1 and D2}) are generated by combining sequential patterns in domain D2. For example, given a minimum support as 3, in our above example in Table 5.2, < (a)(b) > is a sequential pattern and is put in the set of SP1. Also, < (1), (2) > is one sequential pattern in D2. Consequently, we could have a candidate 2-domain sequential pattern

After generating candidate multi-domain sequential patterns, their support values should be deter-mined. As can be seen in Table 5.2, each sequence is associated with its own time sequence. Thus, one could use time sequences to derive support values. Explicitly, the time-instance set of sequence M is defined as follows:

Definition 7. (Time-instance set) Let M DB be a k-domain sequence database¹ and M be a k-domain sequence. The time-instance set of M is defined as T IS(M ) = {< T S(N ) : L(M, N ) > |N ∈ M DB and M ⊑ N }.

Based on the above definition, for a candidate multi-domain sequential pattern, we could determine its support value by evaluating the intersections in time-instance sets of each sequential pattern. For example, to determine the support of



, we should check both time-instance set of < (a)(b) >

and < (1)(2) > in Table 5.2. It can be seen that in Table 5.2, the time-instance set of < (a)(b) > is {< (T1)(T2)(T3)(T4) : 1, 2 >, < (T1)(T2)(T3)(T4) : 1, 3 >, < (T5)(T6)(T7) : 1, 2 >, < (T21)(T22)(T23)(T24) : 1, 3 >}. Moreover, we could have T IS(< (1)(2) >) as {< (T1)(T2)(T3)(T4) : 1, 2 >, < (T5)(T6)(T7) : 1, 2 >, < (T21)(T22)(T23)(T24) : 1, 3 >}. Thus, the support of a candidate 2-domain sequential pattern

1To facilitate our presentation, one could image that M DB are virtually joined by multiple sequence databases.



 is represented as T IS(



Given a minimum support threshold of 3,



is a 2-domain sequential pattern, since its support value is not less than the minimum support. Consequently, through the time-instance sets, support values for candidate multi-domain sequence patterns are derived.

Once we have 2-domain sequential patterns, these 2-domain sequential patterns are in the set of SP2. Then, for each pattern in SP2, candidate 3-domain sequential patterns and their corresponding supports will be generated by the above same procedure. Given sequential patterns in k domains, k-domain sequential patterns are derived by iteratively expanding one domain in each round until the number of rounds is k.

Algorithm: IndividualMine

Input: Sequence databases across n domains D1, D2, . . . , Dn, and minimum support δ.

Output: Multi-domain sequential patterns across n domains.

Begin

Let Ck be the set of candidate patterns across k domains, where k = 1, 2, . . . , n.

Apply sequential pattern mining on each domain Di, i = 1, 2, . . . , n.

Let SP1 be the set of sequential patterns mined in D1. For each domain Di+1, i = 1, 2, . . . , n − 1

For each P ∈ SPi

For each sequential pattern Q of Di+1

If e(Q) = e(P ) Then append

Without joining, algorithm IndividualMine could still discover multi-domain sequential patterns. It can be seen that in algorithm IndividualMine, each domain should individually perform sequential pattern mining algorithms, which incurs a considerable amount of mining cost. Furthermore, those sequential patterns mined from each domain are not necessarily able to become multi-domain sequential patterns.

Thus, to further reduce the cost of mining sequential patterns in each domain and the number of candidate multi-domain sequential patterns, we develop algorithm PropagatedMine in which those sequences that are likely to form multi-domain sequential patterns are extracted from their sequence databases.

propagate propagate propagate D₁

p p g p p g p p g

D₂ D_n

Sequential Pattern Mining

Propagated Table

Multi-domain S ti l g

Results Sequential Sequential

Patterns

Results Sequential

Patterns

Mining Phase Deriving Phase

Figure 5.3: Overview of algorithm PropagatedMine.

5.4.3 Algorithm PropagatedMine: Propagating Sequential Patterns among Domains

Algorithm PropagatedMine is designed to reduce the mining cost in each sequence database. Explic-itly, algorithm PropagatedMine first performs sequential pattern mining in one domain (referred to as the starting domain) and then propagates time-instance sets of the mined sequential patterns to other domains. By propagating time-instance sets, only those sequences that have the same time sequences with the time-instance sets are extracted, thereby reducing the mining space in each sequence database.

Algorithm PropagatedMine iteratively propagates time-instance sets of multi-domain sequential patterns to the next domain until all domains have been mined. Figure 5.3 shows an overview of algorithm Propa-gatedMine, where there are two phases in algorithm PropagatedMine: the mining phase and the deriving phase.

In the mining phase, PropagatedMine utilizes existing sequential pattern mining algorithms to dis-cover sequential patterns in a starting domain (i.e., D1) and then propagates these patterns to other domains. Note that the mined sequential patterns in the starting domain provide a guideline to extract multi-domain sequential patterns from other domains, and hence for mining multi-domain sequential patterns in sequence databases across multiple domains, the length and the number of elements of multi-domain sequences are constrained by sequential patterns mined in the starting multi-domain. Consequently, sequential patterns mined in the starting domain could be represented as a lattice structure to facilitate the generation of candidate multi-domain sequential patterns across other domains.

For example, assume that the starting domain is set to D1in Table 5.2 and that sequential patterns are then found using existing sequential pattern mining algorithms with the same minimum support 3. The mined sequential patterns are represented as a lattice structure in Figure 5.4, where each node represents a sequential pattern, the linkages of nodes (or intradomain links) represent containing relation, and nodes are ordered by the number of elements. In Figure 5.4, those nodes having the same number

<(a)> <(b)> <(c)>

<(b,c)>

number of

在文檔中探勘智慧型手機中應用程式使用行為之研究 (頁 62-66)