Chapter 2 An Efficient Algorithm for Mining Temporal Patterns from Interval-based
3.5 Experimental Results
3.5.3 Real-World Dataset Analysis
In addition to using synthetic data sets, we have also performed an experiment on real world dataset to compare the performance and indicate the applicability of closed temporal pattern mining. The database used in the experiment consists a collection of 1,098,142 library records (lending and returning) for three years from the National Chiao Tung University Library. The experimented database includes 206,844 books and 28,339 readers. An event interval is constructed by a book ID and corresponding lending and returning time. The size of database is the number of sequences in database (same as the number of readers, 28,339). The maximal and the average length of sequences are 262 and 38 respectively.
Figure 3.12 shows the performance and mining result. Fig. 3.12(a) indicates the running time of five mining algorithms with varying minimum support thresholds from 0.1 % to 0.05 % and
minimum support (%)
Fig. 3.12: The performance and mining result on library data set from NCTU
the number of generated patterns under different thresholds is shown in Fig. 3.12(b). As the minimum support drops down to 0.05 %, there are 13,550 closed patterns and the running time of CEMiner is about 1.5 times faster than CTMiner, more than 2 times faster than TPrefixSpan, about 5 times faster than IEMiner and H-DFS has never terminated.
3.6 Summary
Previous studies of mining closed sequential pattern mainly are focused on time point-based data. Little attention has been paid to the mining of closed temporal patterns from time interval-based data. Since the processing for complex relations among intervals may require generating and examining large amount of intermediate subsequences, mining closed temporal patterns from time interval-based data is an arduous problem. In this chapter, we develop an efficient algorithm, CEMiner, to discover closed temporal patterns without candidate generation, based on proposed endpoint representation. The algorithm further employs three pruning methods to reduce the search space effectively. The experimental studies indicate that CEMiner is efficient and scalable. Both running time and memory usage of CEMiner outperform state-of-the-art algorithms. Furthermore, we also apply CEMiner on real world dataset to show the efficiency and the practicability of mining time interval-based closed pattern.
Chapter 4
Incremental Mining Temporal Patterns from Interval-based Database
4.1 Introduction
Sequential pattern mining is an essential data mining technique with broad applications, such as market and customer analysis, network intrusion detection, analysis of Web access, and finding of tandem repeats in DNA sequences, to name a few. Several efficient algorithms exhibit excellent performance in discovering sequential patterns from a static database, i.e., mine the entire database and acquire the results in a one-stop solution. Nevertheless, the assumption of having a static database may not hold in a number of applications. The database usually grows incrementally over time, i.e., some new data may be added. The algorithms based on static database do not consider the evolution of database and the maintenance of discovered sequential patterns. The result mined from the original database may no longer be valid since existing sequential patterns will be invalid, and new sequential patterns may be introduced with the evolution of databases. Obviously, re-mining the updated databases from scratch each time is inefficient because it wastes computational resources and neglects the previous mining result.
Previous research of the incremental mining algorithm [4, 5, 7, 9, 12, 14, 19, 23, 26, 28, 42]
mainly focused on sequential patterns discovered from time point-based data. Prior works have claimed that in reality, mining time interval-based patterns is more practical [8]. Interval-based sequential patterns, also referred to as temporal patterns, occasionally can reveal more precise information. In many real-world applications, some events, which intrinsically persist for periods of time instead of instantaneous occurrences, cannot be treated as “time points.” In such cases, the data is usually a sequence of interval events with both start and finish times. Examples include library lending, stock fluctuation, patient diseases, and meteorology data, to name a few.
Table 4.1: Part of temporal patterns discovered from of NCTU library
PID temporal patterns support
1 163
(0.57%)
2 43
(0.15%)
3 92
(0.32%)
4 35
(0.12%)
Consider an example of mining temporal patterns from the NCTU library lending datasets.
Usually, there is duration between the time of a reader borrowing a book and the time he/she returning the book. Thus, the lending dataset, in general, is time interval-based. By extracting some users’ lending patterns, we could develop a recommendation system for library. This information would be more helpful than conventional sequential time point-based pattern. Table 4.1 illustrates some temporal patterns (part of mining results) discovered from the NCTU library.
We used pattern 1 and 2 for discussion. Suppose that two readers, Mary and Sue, both check out the books “The Know-It-All” and “The Curious Incident of the Dog in the Night-time.” If Mary checks out two books simultaneously, the library can send her an e-mail to notify her that the book “The Hitchhiker's Guide to the Galaxy” is still on the shelf, or that the book “The Restaurant at the End of the Universe” will be returned by June 23, 2011. However, if Sue checks out two books at different times, the library may send her an e-mail to notify her about the availability of books “Le Cosmicomiche” or “The One Hundred Years of Solitude.” The temporal patterns offer a more expressive result to present correlations among data than conventional sequential patterns.
“The Inheritance of Loss”
“I Served the King of England”
“The End of the Affair”
“The Pearl in the Deep”
“I Served the King of England”
“The End of the Affair”
“Le Cosmicomiche”
“The Curious Incident of the Dog in the Night-time”
“The Know-It-All”
“The One Hundred Years of Solitude”
“The Restaurant at the End of the Universe”
“The Hitchhiker's guide to the galaxy”
“The Curious Incident of the Dog in the Night-time”
“The Know-It-All”
Allen’s 13 temporal logics [2] are usually adopted to describe the complex relations among intervals, as follows: “before,” “after,” “overlap,” “overlapped by,” “contain,” “during,” “start,”
“started by,” “finish,” “finished by,” “meet,” “met by,” and “equal.” However, Allen’s temporal logics are binary relations and may experience several problems when describing relationships among more than three event intervals. An appropriate representation is crucial for this circumstance. Various representations [8, 13, 16, 24, 25, 29, 36] have been proposed; however, most of them have a restriction on either ambiguity or scalability. In this chapter, we utilize the endpoint arrangements to effectively simplify the processing of complex relations, which is the major bottleneck of incremental mining of temporal patterns. Since the endpoints are non-overlapped, Allen’s 13 temporal logics can be reduced to 3 relations, i.e. “before,” “equal”
and “after.”
As mentioned early, new time interval-based data is generated. To truly capture temporal patterns, one should re-execute existing algorithms of mining temporal patterns from the updated database, where the new data is appended or the new record is inserted. In this chapter, we target at designing algorithms to incrementally mine temporal patterns. To the best of our knowledge, no methods have been discussed on how to discover frequent sequential patterns from time interval-based data in an incremental environment. Since the feature of time intervals differs considerably from that of time points, the pairwise relationships between any two interval events are intrinsically complex. This complex relation is a crucial problem in the design of an efficient and effective algorithm for maintaining temporal patterns. When appending an interval to an event sequence, the complex relations may lead to the generation of a larger number of possible candidates and consume more memory space.
Two types of incremental updates for interval sequence database are used, 1) inserting new sequences into database, denoted as INSERT; 2) appending new intervals to existing sequences, denoted as APPEND. A real world application may include all types of updates. When the database is updated with a combination of INSERT and APPEND, we can regard the INSERT as a special case of APPEND, for inserting a new sequence is equivalent to appending a new sequence to an empty sequence, as shown in Fig. 4.1. This chapter proposes an efficient
algorithm, Inc_CTMiner which stands for Incremental Temporal Miner, to address the crucial problem and incrementally discover temporal patterns based on the coincidence representation.
Furthermore, Inc_CTMiner employs some pruning strategies to reduce the search space and avoids non-promising database projection. Experimental studies on both synthetic and real datasets indicated that, in the incremental environment, Inc_CTMiner is efficient and outperforms the state-of-the-art algorithms, which are based on static database. Our experiments also revealed that the proposed approach is scalable and consumes a smaller memory space. We also applied Inc_CTMiner on real world datasets to demonstrate the practicability of maintaining the temporal patterns.
The remainder of this chapter is organized as follows: Section 4.2 presents the related work;
Section 4.3 introduces the preliminaries; Section 4.4 provides incremental mining algorithms;
Section 4.5 presents the experimental results and performance study; and finally, Section 4.6 summerizes this chapter.
Fig. 4.1: Concept of INSERT and APPEND updates interval sequence INSERT increment
database (db) original
database (DB)
..
updated database (DB’)
APPEND extended
database (EDB)
4.2 Related Work
A number of studies have investigated the mining of temporal patterns [2, 8, 13, 17, 24, 25, 27, 29, 31, 33, 35, 36, 37, 41] in a static environment. Kam et al. [16] proposed a hierarchical representation and designed an algorithm to discover temporal patterns. Although hierarchical representation is a compact encoding method, it may suffer from two ambiguous problems, as follows: 1) the same relationships among event intervals can be mapped to different temporal patterns; and 2) the same temporal pattern can represent different relationships among event intervals. Hoppner [13] proposed a nonambiguous representation, relation matrix, which exhaustively lists all binary relationships between event intervals in a pattern. The mining algorithm needs to scan the database repeatedly, which considerably lowers its efficiency, and the relation matrix does not scale effectively if numerous intervals appear in a pattern.
H-DFS [27] was proposed to discover frequent arrangements of temporal intervals. This approach transforms an event sequence into a vertical representation using id-lists. However, H-DFS does not scale effectively when the temporal pattern length increases. TSKR [24]
expressed the temporal concepts of coincidence and partial order for interval patterns. The pattern represented in TSKR format is easily understandable and robust; however, it may reveal the relationship between pairwise event intervals ambiguously. Based on MEMISP [20], ARMADA [35] was proposed to find temporal patterns from large databases. Since it is based on relation matrix representation, memory usage is a substantial bottleneck when the database is very large.
TPrefixSpan [36] uses temporal representation to discover temporal patterns nonambiguously, but it does not use any pruning strategy to reduce the search space. Augmented hierarchical representation [29] uses additional counting information to achieve a lossless expression. Every Allen describer must take space to store five counters. Based on this representation, IEMiner [29]
was proposed by using optimization strategies and removing non-promising candidate sequences, but it must scan the database multiple times.
A robust representation, SIPO [25], used the partial order of intervals and considers the noise tolerance to express relationships among intervals. Nevertheless, the proposed algorithm requires discovering both closed sequential pattern and closed itemset, and therefore, is time consuming.
and compact representation, coincidence representation [8] to facilitate the mining process. It first segments all intervals to disjoint slices based on the global information in a pattern, and subsequently groups all event slices occurring simultaneously to form a coincidence to represent a sequence.
A few prior works [4, 5, 7, 9, 12, 14, 19, 23, 26, 28, 42] have focused on incremental mining sequential patterns from time point-based data. ISM [28] uses a sequence lattice of original database for incrementally mining of sequential patterns. The sequence lattice includes all of the frequent sequences and all of the sequences in the negative border. Two problems occur when using negative border. First, the combined number of sequences in the frequent set and the negative border is large. Second, the sequences in negative border are generated based on the structural relation between sequences. However, these sequences do not necessarily have high support. Therefore, using negative border is very time and memory consuming. Zhang et al. [42]
developed two candidate generate-and-test algorithms, GSP+ and MFS+, for incremental mining of sequential patterns when sequences are inserted into or deleted from the original database.
ISE [23] is another incremental mining algorithm based on candidate generate-and-test approach.
The weakness of these three algorithms is that the candidate set may be very large and the level-wise working manner requires multiple database scans. When the frequent sequences are long, the testing phase is usually slow and costly.
The IncSpan [9] buffers a set of semi-frequent sequences as the candidates in the updated database which can accelerate the maintaining process efficiently. Two optimization techniques, reverse pattern matching and shared projection, were proposed to improve the performance.
However, IncSpan fails to find the complete set of sequential patterns from an updated database because several properties are incorrect. Nguyen et al. [26] proved the incompleteness of IncSpan and proposed an algorithm, IncSpan+, to correct the weaknesses of IncSpan. IncSP [12] solved the maintenance problem through effective implicit merging and efficient separate counting over appended sequences. The proposed early candidate pruning technique, further speeds up the discovery of new patterns. PBIncSpan [7] uses a prefix tree to record all frequent sequences and corresponding projected databases to maintain the discovered sequential patterns; however such a
method requires extremely huge storage space when the database is large. The proposed pruning strategy is based on the Apriori property and is inefficient when the prefix tree has numerous nodes.
All previous studies for incremental mining are mainly focused on time point-based data which has no concept of duration of time. Limited attention has been paid to updating temporal patterns from interval-based database. In this chapter, we design a new algorithm, Inc_CTMiner, which can incrementally discover temporal patterns effectively and efficiently.
4.3 Preliminary
Let E = {e1, e2,…, ek} be the set of event symbols. Without loss of generality, we define a set of uniformly spaced time points based on the natural number N. We say the triplet (ei, si, fi) E N N is an event interval, where ei E, si, fi N and si fi. The two time points si, fi are called event times, where si is the starting time and fi is the finishing time. The set of all event intervals over E is denoted by I. An event sequence is a series of event interval triplets (e1, s1, f1), (e2, s2, f2), …, (en, sn, fn), where si si+1, and si fi. A temporal database is a set of tuple SID, Q where SID is a sequence-id and Q is an event sequence. For example, in Table 4.2, the temporal database ĐB has 3 event sequences. Given two event sequences Q and Q’, Q’’ = Q ◇ Q’ means Q’’ is the concatenation of Q and Q’. Q’ is called appended sequence of Q and Q’’ is called updated sequence of Q appended with Q’.
Definition 4.1 (Increment and updated database)
Given a temporal database DB appended with a few event sequences after some time, DB is called original database. The increment database db is referred to as the set of newly appended data sequences. The SIDs of the data sequences in db may already exist in DB. A database combining all the data sequences from DB and db is referred to as the updated database DB’.
An extended database EDB of an updated temporal database DB’ is a set of event sequences in DB’ which are the concatenations of sequences in DB and db. The concept of Definition 4.1 is given as Fig. 4.1.
Table 4.2: An example of temporal database
original database DB increment database db
event interval event interval
pictorial example pictorial example
SID
coincidence representation
(A, 1, 3), (B, 4, 6), (F, 7, 10), (D, 8, 10) (F, 10, 13), (G, 14, 18)
1
(A) (B) (F+) (F-D) ◇ (F) (G)
→ (B) (F+) (D) (F-) (G) (A, 1, 3), (D, 4, 6), (E, 7, 9)
2
(A) (D E) ◇
→ (A) (D E) (A, 1, 3), (D, 4, 6), (E, 7, 10)
3
(A) (D E) ◇
→ (A) (D E)
(B, 11, 14), (F, 15, 20), (D, 16, 18)
4
◇ (B) (F+) (D) (F-)
→ (B) (F+) (D) (F-)
4.4 Coincidence Representation
The incremental mining of temporal patterns is more difficult than that of conventional sequential patterns. Since the time period of two intervals may overlap, the relation among event intervals is more complex than that of the event points. An appropriate representation is very important for describing relationships among more than three events. Various representations have been proposed but most of them have restriction on either ambiguity or space usage. The existing representations are compared in Table 4.3.
B F D
A F
G
A D E
D B F
D E A
Table 4.3: Comparisons of existing representation
events complex complex complex simple complex simple
Given an event sequence Q = (e1, s1, f1), (e2, s2, f2), …, (en, sn, fn), the set T ={s1, f1, s2, f2, …, si, fi,…, sn, fn} is called a time set corresponding to sequence Q where 1 i n. If we order all the elements in T and eliminate redundant elements, we can derive a sequence TS = t1, t2, t3, …, tk where tiT, ti ti+1. TSQ is called a time sequence corresponding to sequence Q.
Definition 4.2 (Incising Function and Event Slice)
Given an event sequences Q= (e1, s1, f1), (e2, s2, f2), …, (ei, si, fi), …, (en, sn, fn) where (ei, si, fi)
that si t fi, and denoted as ei.
Let S and S’ be two event slices. We say that S is similar to S’, denoted as S S’, if the event symbol of S is identical to the event symbol of S’.
For example, as db in Table 4.2, sequence 4 has three event intervals, (B, 11, 14), (F, 15, 20) and (D, 16, 18) and its corresponding time sequence = 11, 14, 15, 16, 18, 20. Event interval F can be incised into three event slices, start slice F+ = Ψ(15, 16, (F, 15, 20)), F* = Ψ(16, 18, (F, 15, 20)) and finish slice F- = Ψ(18, 20, (F, 15, 20)). Event interval B has only one intact slice B
= Ψ(11, 14, (B, 11, 14)). F+ and F- have the same event symbol, F, hence F+ F-. By Definition 4.2, we know that there are four kinds of event slice. Obviously, an event interval can only have one start slice and one finish slice but can have many intermediate slices.
Definition 4.3 (Grouping Function, Coincidence and Coincidence Sequence)
Given an event sequences Q= (e1, s1, f1), (e2, s2, f2), …, (ei, si, fi), …, (en, sn, fn) where (ei, si, fi)
I, and a, b TSQ = t1, t2, t3, …, tk, 1 k 2n, a grouping function,
Φ( a, b, q) = { Ψ( a, b, (e1, s1, f1)), Ψ( a, b, (e2, s2, f2)), … , Ψ( a, b, (en, sn, fn))}.
A coincidence Ci = Φ(ti, ti+1, Q) = (Si1, Si2,…, Sij,…), where ti and ti+1 is two consecutive event times in TSQ and Sij is an event slice, 1 i k-1, 1 j n. Ci is an ordered set of event slices sorted by lexicographic order. A coincidence sequence Qc is denoted by C1, C2, …, Ck-1 and also called the coincidence representation of Q. To deal with multiple occurrences of events, we attach occurrence number to event slices to distinguish multiple occurrences of the same event type in a coincidence sequence. For example, (A1+)(B1+)(B1-D+)(D-)(A1-B2+)(B2-)(EF)(A2) is a coincidence sequence with occurrence number where both event A and B occur twice.
To facilitate the incremental maintenance of temporal patterns, we also preserve the starting and the finishing time of Q, sQ and fQ, respectively. sQ is the starting time of the first event interval in Q and fQ is the finishing time of the last event interval in Q, i.e., if Q= (e1, s1, f1), (e2, s2, f2), …, (en, sn, fn), sQ = s1 and fQ = fn. For a temporal database DB, by Definition 4.2 and 4.3, we can transform it into a set of tuples SID, Qc, [sQ, fQ] where SID is the sequence-id of each event sequence Q in DB, Qc is the coincidence representation of Q, and sQ and fQ are the starting
and finishing time of Q. For example, in Table 2, we can transform three event sequences in ĐB into corresponding coincidence sequences. For better readability, later in this chapter, we suppose that the temporal database has been transformed into coincidence representation.
Table 4.4: The coincidence representation of Allen’s relations between two intervals
Temporal Relation
Inversed
Relation Pictorial Example Coincidence
representation Pictorial Example Coincidence representation
We adopt coincidence representation [8] to express a temporal pattern since it can accelerate the process of updating temporal patterns when new intervals are appended to the original interval sequences. The coincidence representation has several benefits, and the most important one is that it can simplify the processing of complex pairwise relationships among all intervals effectively. It utilizes the concept of slice-and- coincidence as defined in Definition 4.2 and 4.3, and considers the information of an entire event sequence instead of individual event intervals.
A
Given two different event intervals A and B, the coincidence representation of Allen’s 13 relations between A and B is categorized as in Table 4.4.
4.5. Inc_CTMiner Algorithm
In this section, we develop a new algorithm, named Inc_CTMiner (Incremental Coincidence Temporal Miner), for incremental mining of temporal patterns, by utilizing the
In this section, we develop a new algorithm, named Inc_CTMiner (Incremental Coincidence Temporal Miner), for incremental mining of temporal patterns, by utilizing the