Chapter 5 Proposed Interval-based Event Mining Algorithm: CTMiner
5.2 phase II: Coincidence Mining
By using the concepts of the sequential pattern mining algorithm PrefixSpan [4], the mining phase of CTMiner, CPrefixSpan algorithm is proposed to discover all frequent coincidence patterns, which are defined in Def. 6. In the following, we first provide a brief description of PrefixSpan, and then we present the CPrefixSpan algorithm in details.
PrefixSpan uses a divide-and-conquer strategy to solve the sequential pattern mining problem on time point-based data. First, it scans the database to find all frequent 1-patterns, i.e., L1. Second, suppose there are |L1| patterns in L1, the original database is divided into |L1| partitions, where each partition is the projection of the sequence database with respect to each frequent 1-pattern as a prefix. Third, similar to the first step, each partition is treated as the original one and all the local frequent 1-patterns are found in this partition. Appending these frequent 1-patterns to the prefix will generate frequent sequential patterns with the length increased by one. Finally, recursively running step two and step three will derive all frequent
time_points_list
sequential patterns until prefixes cannot be extended any more. From a high-level point of view, CPrefixSpan is similar to PrefixSpan. However, the PrefixSpan uses itemsets and items to represent a time point-based event sequence and CPrefixSpan uses coincidences and event slices to represent an interval-based event sequence.
Here we discuss similarity and dissimilarity of the above mining algorithms:
conceptually, the relationship of “event slice versus coincidence” is analogous to “item versus itemset.” Since we have incised each event sequence to event slices and transformed the complex relationship among events to the relationship among event slices. Though the relationship between items is the same as that of event slices i.e., “after”, “before” and
“equal”, however, in a lower level, many implementations in CPrefixSpan are still different from those in PrefixSpan. The main reason for these differences is the characteristics of interval-based event and coincidence representation. The optimizations are proposed and adapted to the characteristics of paired start slice and finish slice. Therefore, essentially, PrefixSpan can be adopted as the fundamental extension of the proposed algorithm. The pseudo code of CPrefixSpan is shown in Fig. 5-4.
Algorithm 3: CPrefixSpan(D|a,min_supp,F)
Input: A projected database D|a, a prefix is represented in Csequence a, the minimum support threshold min_supp, a set of frequent coincidence patterns F Output: A set of updated frequent coincidence patterns F
Variables: coincidence,E1 1: for each sequence s in D|a do 2: Count_freq_event_slices(s);
3: put all frequent 1-event slices into E1;
4: for each finish slice e in E1 do
5: elimination_test(e, a); // an intact slice is formed w.r.t e 6: for each frequent 1-event slice e in E1 do
7: a’ ← append event slice e to a;
8: construct projected database D|a’; 9: if form_a_pattern_test(a’) then 10: F←F ∪ a’;
11: if |D|a’| ≥ min_supp then
12: call CPrefixSpan(D|a’, min_supp,F);
Figure 5-4 The CPrefixSpan algorithm.
Algorithm CPrefixSpan scans projected coincidence database to collect all local frequent 1-event slices and three pruning strategies are performed here to avoid further meaningless processing. Intuitively, all event slices in postfix sequences are counted to obtain all frequent 1-event slices. An event can be represented either as a pair of start and finish slices or as an intact slice. Therefore, when a start slice is appended to the prefix we need to mark its corresponding finish slice in the postfix sequence. The finish slice will append to the prefix in the further processing. In the example as shown in Fig. 5-5, initially, we have a temporal database with frequent 1-event slices {A, A+, C, C+, D} and minimum support=2. After running the CPrefixSpan once on the original database, five projected databases are created with respect to frequent 1-event slices. We utilize the projected database with prefix 〈(A+)〉
and the potential frequent 1-event slices {A-, C, C+, C-, D} in the projected database as the example to illustrate the following three pruning strategies.
Temporal DB
Sid1: 〈(A+)(C+)(B)(C-)(A-)(D)〉
Sid2: 〈(A+)(C+)(A-)(DC-)〉
Figure 5-5 Illustrating three pruning strategies.
Pruning strategy 1: The finish slices in a postfix sequence without its corresponding start slices in the prefix are not counted. Once the finish slice is appended to the prefix then the further processing will exclude its corresponding start slice. Existence of either a single start slice or a single finish slice in a pattern is meaningless. The finish slice with its corresponding start slice in the prefix is marked to speedup the further processing. For example, both A- and C- occurs in two event sequences in projected database with prefix 〈(A+)〉 in Fig. 5-5 but C -without it corresponding start slice C+ in the prefix. If we treat C- as the frequent 1-event slice and append it to the prefix, the new prefix 〈(A+)(C-)〉 and only one postfix sequence 〈(A-)(D)〉
are generated. The coincidence pattern 〈(A+)(C-)〉 is incomplete and it will not have the chance to become a complete coincidence pattern due to C+ has been omitted by projection scheme permanently. Thus, we can prune the C- as a frequent 1-event slice in this case.
Pruning strategy 2: The event slices which occur before the first marked finish slice in postfix sequence are counted. The first marked finish slice indicates that its corresponding start slice exists in the prefix. Obviously, if the event slice after the first marked finish slice is
Projected DB with prefix 〈(A+)〉
Sid1: 〈(C+)(B)(C-)(A-)(D)〉
Sid2: 〈(C+)(A-)(DC-)〉
Pruning 1 Pruning 2 Pruning 3
〈(A+)〉 〈(C)〉 〈(C+)〉 〈(D)〉
〈(A+)(C)〉 〈(A+)(C-) 〈(A+)(D)〉
〈(A)〉
〈(A+)(C+)〉
〈(A)〉=〈(A+)(A-)〉
appended to the prefix then further processing will omit the first marked finish slice and the property of paired start slice and finish slice is violated. Therefore, further processing with respect to the prefix is meaningless. For the running example in Fig. 5-5, the intact slice D after the first marked event slice A- is a frequent 1-event slice. The new prefix 〈(A+)(D)〉 is formed by appending the frequent intact slice D then the only postfix sequence 〈(C-)〉 is created with respect to the new prefix. We can find that the marked finish slices A1- is omitted in the further processing permanently. Therefore, the pattern 〈(A+)(D)〉 being a prefix is incomplete and meaningless for the further processing. The function Count_freq_event_slices(α) implements the above two pruning strategies, while counting frequent 1 event slices in postfix sequence α(Line 2, algorithm 3).
Pruning strategy 3: The new prefix is formed by appending each frequent finish slice to the original prefix. The third pruning strategy tests that the finish slice and its corresponding start slice forms the intact slice properly in the new prefix. If the intact slice forms without any event slice elimination, the further processing of the new prefix can be skipped due to divide and conquer strategy. Actually, the further processing with respect to the new prefix is performed by another data partition with the new prefix. Still, for the running example in Fig.
5-5, A- is a frequent 1-event slice. The new prefix 〈(A+)(A-)〉 is generated and the pattern actually can instead be an intact slice 〈(A)〉. According to the divide and conquer strategy, the further processing is the same as the processing on the projected database with prefix 〈(A)〉
which can be omitted. The third pruning strategy is implemented in function elimination_test(e, a) (Line 5, algorithm 3) where e is a frequent finish slice and a is the prefix. For the example shown in Fig. 5-6, if we append A- to the prefix 〈(A+B+C+)〉 to form a new prefix 〈(A+B+C+)(A-)〉, the following conditions are checked to determine whether elimination_test(A-, a) operation works. We assume A+ and A- in coincidence ci and cj in a, respectively. (1) There is no coincidence between ci and cj in a. If there are coincidences between ci and cj, then to form the intact slice A has to eliminate event slices in those
coincidences between ci and cj, i.e., A+ and A- cannot be merged properly. (2) There is no other event slices in ci and cj besides the start slices occurring after A+ in ci,i.e., the relation between event A and the event with the start slice may start or equal due to the same start time.
After operating the function elimination_test, the new prefix can be represented as 〈(AB+C+)〉 , i.e., A+ and A- form A, and further processing of the prefix can be eliminated.
After the above two functions, each frequent 1-event slices can be appended to the original prefix to generate new pattern with the length increased by one. This way, the prefixes are successfully extended.
In PrefixSpan algorithm, the frequent patterns are output by appending each frequent 1-event to the prefix. But in CPrefixSpan, a lot of prefixes are treated as intermediate coincidence patterns but not frequent coincidence patterns. The function form_a_pattern_test(p) verifies the prefix p either an intermediate coincidence pattern or a frequent coincidence pattern (Line 9, algorithm 3). The prefix is treated as a frequent coincidence pattern if all the start slices and finish slices are paired correctly in the prefix.
Finally, if the number of sequences in projected database is greater than min_supp, then recursively run the projected database with respect to the extended prefix until the prefix cannot be extended successfully. Then, all frequent coincidence patterns will be discovered (Lines 11, 12, algorithm 3).
We take the database in Table 5-1 with min_sup = 2 as an example. There are 17 event records which can be regarded as 4 event sequences in the temporal database. After scanning
prefix: 〈(A+B+C+)〉
Figure 5-6 Illustration for elimination_test on A- successfully with all possible correlative event slices in the prefix.
A
-the original temporal database, we find all -the frequent 1-patterns. They are 〈A〉: 3, 〈B〉: 4, 〈D〉:
4, and 〈E〉: 4, where the notation” 〈pattern〉 : count” represents the pattern and its associated support count.
For each original event sequence in Table 5-1, the Csequences are constructed and projected with respect to event slices created by frequent 1-patterns as shown in the first two columns in Table 5-2 and the pictorial examples corresponding to the projected databases are shown in Fig. 5-7. Furthermore, we take the coincidence database with event A as example to discuss in details. We have to consider the patterns prefixed with intact slice 〈A〉 and start slice
〈A+〉. Note that when counting the support and constructing projected database with regard to intact slice 〈A〉, we also require considering the occurrence of start slice 〈A+〉 in sequences since both represent the existence of event A. The projected sequences with respect to 〈A〉 has 3 sequences: SID 1: 〈(C+)(B-C-)(D+)(E)(D-)〉, SID3: 〈(B-)(@D+)(E)(D-)〉 and SID 4: 〈(D
+)(E)(D-)〉. Simultaneously, we also project sequence with respect to 〈A+〉, so we have 2 sequences: 〈(B+C+A-)(B-C-)(D+)(E)(D-)〉 and 〈(B+A-)(B-)(@D+)(E)(D-)〉. Then we collect the corresponding sequences with respect to the coincidence prefix to coincidence database, so both D|〈(A)〉 and D|〈(A+)〉 are obtained. Continuing the recursive process with the D|〈(A)〉 and D|〈(A+)〉, we can discover all frequent coincidence patterns prefixed with 〈(A)〉 and
〈(A+)〉, respectively. The last column in Table 5-2 lists all frequent coincidence patterns.
Original coincidence database with frequent 1 event slices
prefix Projected database Frequent coincidence patterns
〈(A)〉 1:〈(B-)(D+)(E)(D-)〉
Table 5-2 Projected databases and frequent temporal patterns
___________________________
1. Note that, the first coincidence with “_” in projected Csequence indicates that the last coincidence in the prefix is the same as the first coincidence in the projected Csequence. The finish slice of italic and bold indicates which has corresponding start slice in the prefix.
(d) Csequence of SID 4 and projected sequence w.r.t intact slice B.
(c) Csequence of SID 3 and projected sequence w.r.t intact slice B.
(b) Csequence of SID and projected sequence w.r.t intact slice B.
(a) Csequence of SID 1 and projected sequence w.r.t intact slice B.
A
Figure 5-7 Database of Csequences and projected databases w.r.t intact slice B