Chapter 2 An Efficient Algorithm for Mining Temporal Patterns from Interval-based
2.2 Related Work
Sequential pattern mining is one of the most important research themes in data mining.
Recently, there has been a stream of research on it [1, 3, 6, 10, 11, 18, 20, 21, 22, 30, 32, 39] and its extensions, including closed patterns [4, 5, 15, 34, 38, 40], incremental pattern mining [4, 5, 7, 9, 12, 14, 19, 23, 26, 28, 42] to name a few. Almost all of these related studies mentioned above are focused on time point-based event data which has no duration concept. Some recent works have investigated the mining of interval-based events [2, 13, 16, 17, 24, 25, 27, 29, 31, 33, 35, 36, 37, 41].
Villafane et al. [33] proposed a graph mining technique to discover time interval-based sequential pattern by transforming data sequences to containment graphes. However, the
containment rules discussed are constrained only to “contains” and “during.” Kam et al. [16]
proposed a compact encoding method, hierarchical representation and designed an algorithm to discover frequent temporal patterns. Although hierarchical representation only use k + (k 1 ) = 2k 1 memory space for describing a k-intervals pattern (k event indices, k 1 describers), it may suffer from two ambiguous problems. First, the same relationships among event intervals can be mapped to different temporal patterns. As shown in Fig. 2.1(a), the pattern can be expressed as
“((A overlaps B) before C) overlaps D ” or “(A overlaps B) before (C during D).” Second, the same temporal pattern can represent different relationships among event intervals. For example, Fig. 2.1(b) shows that pattern “(A overlaps B) overlaps C ” can represent two different relations among intervals.
Rainsford et al. [31] presented an approach that combine temporal semantics with association rules. The algorithm firstly generates the traditional association rules, and then finds all the possible pairings of temporal items in each rule. Hoppner [13] proposed a nonambiguous representation, relation matrix which exhaustively lists all binary relationships between event intervals in a pattern. For example, pattern P in Fig. 2.2(a) can be represented as a matrix in Fig.
before
(((A overlaps B) before C) overlaps D) C
A
B D
((A overlaps B) before (C during D)) C
Fig. 2.1: Example of two ambiguous problems of hierarchical representation
A
needs 2k + (k × (k 1)) = k2 + k memory space to describe a k-intervals pattern (2k event indices, k2 k describers).
H-DFS [27] was proposed to discovery frequent arrangements of temporal intervals. This approach transforms an event sequence into a vertical representation using id-lists. The id-list of one event is merged with the id-list of other events to generate temporal patterns. TSKR [24]
expressed the temporal concepts of coincidence and partial order for interval patterns. The pattern represented in TSKR format is easily understandable but may reveal the relationship between pairwise event intervals in a pattern ambiguously. For example, in Fig. 2.2(a), pattern P and Q are represented as the identical TSKR expression “AB(BC)C.”
Laxman et al. [17] extended the original framework of frequent episode discovery in event sequences by incorporating event duration constraints. The authors also presented some algorithms based on finite-state automaton. Based on the efficient algorithm MEMISP [20], the algorithm ARMADA [35] is proposed to find frequent temporal patterns from large database.
DTP [41] partitions database into some disjoint datasets, so that scanning the whole database could be avoided when calculating the support of each pattern. However, DTP only discusses two of the Allen relationships: “contains” and “during.
Temporal representation [36] utilizes endpoint arrangements to represent the temporal pattern nonambiguously. For example, in Fig. 2.2(a), pattern P can be represented as the expression “A+<
A-< B+< C+< B-< C-”, where “+” and “-” represent the start and finish endpoints of an event interval, respectively. It requires 2k + (2k 1) = 4k 1 space to describe a k-intervals pattern (2k
C
Fig. 2.2: Example of relation matrix representation
equal after before overlaps overlapped-by
(a) Two example temporal patterns
= :
(b) Relation matrix for P B
event indices, 2k 1 describers). TPrefixSpan [36] used temporal representation to discover frequent temporal patterns. TPrefixSpan first generates all the possible candidates and then discovers frequent events and scans the projected databases for support counting.
Patel et al. [29] utilized additional counting information to achieve a lossless hierarchical representation, named augmented representation. Every Allen describer must take a space to store five counters, i.e., contain, finish-by, meet, overlap and start counters for accumulating the occurrences of corresponding relations. For example, in Fig. 2.2(a), pattern P can be represented as expression “(A before[0,0,0,0,0] B) overlaps[0,0,0,1,0] C.” The counter of overlap describer is [0,0,0,1,0] since C only overlaps B. Augmented hierarchical representation is not easily comprehensible and needs k + (k 1) × 6 = 7k 6 memory space in a k-intervals pattern (k event indices, 6×(k1) describers). IEMiner [29] was designed to discover frequent temporal patterns from interval-based events based on the augmented representation.
HTPM [37] was developed to mine hybrid temporal pattern from event sequences, which contain both point-based and interval-based events. Authors modify temporal representation [36]
to also express event points. Moerchen et al. developed a new kind of pattern, SIPO [25], to express Allen relationship. Authors utilize the boundaries of interval and further consider the noise tolerance. However, SIPO may suffer the ambiguous problem and the mining algorithm requires discovering both closed sequential pattern and closed itemset, and therefore is time consuming.
There are three contributions from our work reported in this chapter. The first contribution is that we propose an incision strategy, to simplify processing complex relations when mining temporal patterns. The incision strategy segments all intervals to disjoint slices based on the global information in a pattern. The second contribution is that we develop a new representation, coincidence representation, to express a pattern or sequence nonambiguously, based on the incision strategy. As mentioned above, various existing representations may lead to different kinds of problem. An appropriate representation can facilitate processing and improve performance of algorithm. Coincidence representation has several advantages and we will discuss
The final contribution is that we design a new algorithm, CTMiner, which can effectively avoid the effort on candidate generation and test for mining temporal patterns. We first transform interval sequences in database to coincidence format and then borrow the idea from PrefixSpan [21] (Prefix-projected Sequential pattern mining), an efficient pattern growth-based algorithm in finding sequential patterns from transactional database, to mine frequent temporal patterns.
Furthermore, CTMiner employs the proposed optimization strategies to reduce the search space and avoids non-promising projection. The performance in both synthetic datasets and real datasets shows that CTMiner outperforms state-of-the-art algorithms. Our experimental results also show that the proposed approach consumes a much smaller memory space.