3
RAN BAI,Department of Computing, The Hong Kong Polytechnic University
WING KAI HON,Department of Computer Science, National Tsing Hua University
ERIC LO,Department of Computer Science and Engineering, Chinese University of Hong Kong
ZHIAN HE,Department of Computer Science, University of Hong Kong
KENNY ZHU,Department of Computer Science and Engineering, Shanghai Jiao Tong University
Many emerging applications are based on finding interesting subsequences from sequence data. Finding
“prominent streaks,” a set of the longest contiguous subsequences with values all above (or below) a cer- tain threshold, from sequence data is one of that kind that receives much attention. Motivated from real applications, we observe that prominent streaks alone are not insightful enough but require the discovery of something we coined as “historic moments” as companions. In this article, we present an algorithm to efficiently compute historic moments from sequence data. The algorithm is incremental and space optimal, meaning that when facing new data arrival, it is able to efficiently refresh the results by keeping minimal infor- mation. Case studies show that historic moments can significantly improve the insights offered by prominent streaks alone. Furthermore, experiments show that our algorithm can outperform the baseline in both time and space.
CCS Concepts: • Information systems → Data structures; Data mining; • Theory of computation → Data structures and algorithms for data management;
Additional Key Words and Phrases: Historic moments, space optimal, prominent streaks, sequence data ACM Reference format:
Ran Bai, Wing Kai Hon, Eric Lo, Zhian He, and Kenny Zhu. 2019. Historic Moments Discovery in Sequence Data. ACM Trans. Database Syst. 44, 1, Article 3 (January 2019), 33 pages.
https://doi.org/10.1145/3276975
1 INTRODUCTION
Finding prominent streaks [8], a set of maximal contiguous subsequences with values all above (or below) a certain threshold, from a sequence dataset has recently found applications in so- cial network analysis, disease outbreak detection, and computational journalism [6, 7]. Take
This work is partly supported by the Research Grants Council of Hong Kong (GRF14200817, 15204116, 15200715, 521012), Research Committee of CUHK and NSFC grants (91646205, 61373031).
Authors’ addresses: R. Bai, Department of Computing, The Hong Kong Polytechnic University; email: csrbai@comp.
polyu.edu.hk; W. K. Hon, Department of Computer Science, National Tsing Hua University; email: [email protected].
edu.tw; E. Lo, Department of Computer Science and Engineering, Chinese University of Hong Kong; email: ericlo@
cse.cuhk.edu.hk; Z. He, Department of Computer Science, University of Hong Kong; email: [email protected]; K. Zhu, De- partment of Computer Science and Engineering, Shanghai Jiao Tong University; email: [email protected].
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from[email protected].
© 2019 Association for Computing Machinery.
0362-5915/2019/01-ART3 $15.00 https://doi.org/10.1145/3276975
computational journalism as an example; the following news article from January 20111contains a real example of a prominent streak:
“Today is Beijing’s 36th consecutive Blue Sky Day, a day whose Air Pollution Index (API) is 100 or below, indicating “excellent” or “good” air quality. As far as I can tell, this is the longest consecutive streak of Blue Sky Days in Beijing for at least ten years.”
In the excerpt above, the prominent streak refers to a subsequence of 36 days of consecutive measures of Air Pollution Index of 100 or below, and the authors in [8] have developed an efficient algorithm to discover streaks like this. Although useful, a prominent streak only stands for a singular event in the dataset. In fact, the news excerpt above is continued like this:
“ . . . in Beijing for at least ten years. Previously, there were only three streaks of 30 days or longer, one in 2006 and two during 2008 Olympics.”
The last sentence is crucial: it pinpoints the rarity of the 36-day Blue Sky streak; otherwise, readers who are unfamiliar with Beijing’s weather would probably find the news mundane. In contrast, when it is further explained that the last time Beijing had 30 or more consecutive Blue Sky Days was nearly 3 years ago, readers will then be impressed by how rare the current streak is and may be aroused to learn more about Beijing’s environment. In other words, the three prominent streaks in 2006 and 2008 are similar “historic moments” that happened before, which highlights the rarity of the 36-day streak that happened in 2011.
In this article, we formally introduce the concept of the historic moment of a streak s, which is a set of prominent streaks that end before s and can be used to highlight the interestingness of s. The term “interestingness” can take many forms but is mainly centered around whether a similar event happened before, and if so, how long ago it was. The technical concern is how to efficiently report historic moments from a sequence dataset, with a consideration that the data sequences are being appended regularly (e.g., hourly update of crude oil price,2 update of the seismic magnitude per one-tenth second3). To this end, we present a highly efficient incremental algorithm that can enable interactive historic moment analysis on a sequence dataset with continuous data updates. The efficiency of the algorithm comes from maintaining an index in minimal space. Space optimality leads to disk I/O reduction per operation or even makes the index small enough to be memory resident. That property is crucial for online analysis and real-time monitoring, especially when there are possibly many data sequences of interest concurrently. Experiments on five real datasets show that our algorithm, namely, the space-optimal incremental algorithm (SOIA), outperforms the baseline algorithm by 9× to 184× in terms of speed, using 98% to 99.5% less space (e.g., in our case study, our algorithm maintained an index of only 2GB for a 350GB data sequence, whereas the baseline needed to maintain an index of size 500GB). Furthermore, our case studies show that historic moments indeed can help journalists to find full news stories, instead of half-baked ones found by prominent streaks, and help seismologists to get a bigger picture when analyzing ground motion data.
The rest of the article is organized as follows. Section 2reviews the related work. Section3 gives the formal problem definition. Section4presents our space-optimal incremental algorithm together with a few baseline algorithms. Section5presents the case studies, and Section6presents
1Excerpts from V. Wagner’s LiveFromBeijing blog:http://www.livefrombeijing.com/2011/01/beijing-breaks-record-for- longest-streak-of-consecutive-blue-sky-days-best-air-quality-in-years/.
2http://www.pmbull.com/oil-price/.
3http://ds.iris.edu/ds/nodes/dmc/data/.
the experimental study. Section7concludes the article. AppendixAextends the problem and the algorithms from a single data sequence to multiple data sequences.
2 RELATED WORK
In this section, we first review related works that focus on discovering interesting knowledge from sequence datasets. The first work is [8], which studied the finding of prominent streaks from a data sequence Dn = v1,v2, . . . ,vn with n numeric values. A streak s = (i, j,v) is a contiguous subsequence in Dn containing numeric valuesvi,vi+1, . . . ,vj, with i ≤ j, and v = min {vk | i ≤ k ≤ j} denotes the minimum4value of the subsequence. We call v the value of s, |s| = j − i the length5of s, and [i, j] the interval of s.
Prominent streaks are in fact the two-dimensional skyline points [1,3–5,11,12,15,18] of all the streaks found in a sequence, based on the length and the value dimensions. That is, a streak is a prominent streak if there is no streak that has both a larger length and value at the same time. The challenge of computing prominent streaks is that, given a sequence of length n, there are Θ(n2) streaks in total. So a brute-force method would require O (n4) comparisons to locate the set of prominent streaks from Θ(n2) candidates. In view of that, the authors in [8] developed an O (n log n) time algorithm, LLPS, which first computes a set of local prominent streaks (LPSn) from Dn and then locates the set of prominent streaks fromLPSn. The size ofLPSnis at most n and it is guaranteed that the set of prominent streaks is a subset ofLPSn. According to [8], the definition of a local prominent streak is:
Definition 2.1 (Local Prominent Streak (LPS) [8]). A streak s= (i, j,v) is a local prominent streak if (1) vi−1< v and (2) vj+1 < v. We use LPSnto denote the set of local prominent streaks in the sequence dataset Dn. When i= 1, we can ignore condition (1). Similarly when j = n, we can ignore condition (2).
Figure 1shows some streaks of a sequence dataset. Streaks s1(299, 303, 9) and s2(306, 309, 8) are local prominent streaks, whereas streak s3(310, 311, 4) is not. The LLPS algorithm has two phases: (1) first spend O (n) time to obtain the linear size LPSn and (2) then invoke an O (|LPSn| log |LPSn|) skyline algorithm to compute the set of prominent streaks from LPSn, resulting in an overall time complexity of O (n+ |LPSn| log |LPSn|), which is at most O (n log n).6 Prominent streaks alone could only tell the continuity of a singular event. Historic moments can tell the other side of the story about how interesting that singular event is. We later show that historic moments and prominent streaks are related, but computing historic moments is a challenging problem in its own right.
In [17], the authors advocated that rare events are more informative and comparisons among objects can make a story more complete. As such, given a setA of concerned attributes on a relational dataset, the authors developed the concept of top-τ -skyband as “one of the few” objects.
Intuitively, the top-τ skyband consists of (at most) τ objects that are dominated by the smallest number of other objects. Each of these objects is one of the few most prominent objects in the dataset when attributes inA are concerned. [17] proposed an efficient algorithm that finds the top- τ skyband in O (2d(τ|n| + n logn)) time, where d is the number of attributes and n is the number of objects. Our article shares the same vision with [17] in terms of quantifying the interestingness
4Maximum value is an alternate but equivalent definition.
5An alternate definition is|s | = j − i + 1.
6[8] also included another algorithm, NLPS, which generates O (n2) local prominent streaks in the first phase. It is obvious that LLPS is more efficient than NLPS.
Fig. 1. A data sequence (a value is represented by×).
of a piece of information through its rarity. However, we focus on sequence data, whereas [17]
focused on relational data.
In [14], a “fact” is defined as a contextual skyline object that stands out against other objects in a context with regard to a set of measures. Given a relational table with a setM of measure attributes and a setA of dimension attributes, for a constraint c defined on A ⊆ A (known as a context) and a measure subspace M ⊆ M, a tuple t is a contextual skyline object if t satisfies c and no other tuple tsatisfying c dominates t. The authors in [14] focused on identifying these “facts” in a timely manner. So when a new tuple t is added to a table, the target is to find which combinations of constraint c and measure subspace M make t a contextual skyline object. All those eligiblec, M
pairs are treated as situational facts and the prominence of a situational fact is defined based on the size of the skyline under c and M. Upon the arrival of a new tuple t, a baseline method can compute the topmost prominent situational facts using O (2|M |+|A |) skyline queries. In case the context is fixed, the number of skyline queries becomes O (2|M |). In this article, we aim to discover certain knowledge in a timely fashion as in [14]. However, other than that, our work is orthogonal with [14] because that work focused on relational data, while ours focuses on sequence data.
All of the related work above relies on skyline computation, whose discussion began with [4], in which an O (n2) block nested loop (BNL) skyline algorithm was introduced. Then the sort-filter- skyline (SFS) algorithm was introduced [5], whose best-case time complexity is O (dn+ n logn), where d is the number of dimensions. Index-based skyline algorithms were introduced in [10,12, 15], in which the Branch & Bound (BBS) skyline algorithm in [13] gives the I/O optimal solu- tion because for each query it visits relevant nodes of the R-tree only once, with time complexity O (n log n). Recently, [1] presented a parallel skyline algorithm. A survey of skyline computation and the variants of skyline could be found in [9].
3 PROBLEM DEFINITION
In this article, we propose the notion of historic moment. Our goal is to identify the historic moment in a timely fashion. So we mainly focus on streaks that just happened, and we name those as situational streaks. Furthermore, inspired by our motivating example, whether a situational streak is informative or not is relative to how long ago a similar event (streak) happened before. For some applications (e.g., computation journalism), a situational streak might lead to a news story because a similar streak happened long ago. But there are also applications (e.g., seismology) where a situational streak becomes important because a similar streak just happened. We now formally define historic moments and their related terms based on the intuition above.
Fig. 2. Data sequence D19.
Definition 3.1 (Situational Streak (SS)). In a data sequence Dn with n numeric values
v1,v2, . . . ,vn, a streak (i, j,v) ∈ LPSn is a situational streak if j= n, i.e., the most recent lo- cal prominent streak. We useSSnto denote the set of situational streaks in Dn.
Figure2shows a data sequence D19with n= 19 numeric values. D19has four situational streaks:
s1(15, 19, 7), s2(14, 19, 6), s3(8, 19, 4), s4(1, 19, 1).
Obviously not all situational streaks are interesting, and basically we are interested in those with the highest or lowest values (e.g., high seismic magnitude, low Air Pollution Index). Without loss of generality, we focus on those with the highest values in the discussion. Also, for ease of discussion, we assume that all values are positive. So we define:
Definition 3.2 (Top-k Situational Streak). The top-k situational streaks are the k streaks inSSn
with the highest values.
In Figure2, among the four situational streaks, s1and s2are the top two situational streaks.
The following news7is a real example of reporting not only the top but also the top two situa- tional streaks:
“ . . . the highest temperatures are above 32 Celsius for five days and 30 Celsius for six days . . . .”
Next, given any situational streak z, we are interested in a subset of local prominent streaks that are “similar” to z. So we define:
Definition 3.3 (Analogous Streaks (AS)). A local prominent streak s in a data sequence Dnis an analogous streak of a situational streak z when:
(1) s.j < z.i (i.e., s ends before z starts),
(2) |s| ≥ |z| · σ (i.e., the length of s is at least σ times that of z, where σ ≥ 0 is a similarity threshold), and
(3) s.v ≥ z.v · σ (i.e., the value of s is at least σ times that of z).
In Figure 2, s11, with length 4 and value 8, is the only analogous streak of the top one situational streak s1(length 4; value 7) when the similarity threshold σ is 1. When we relax σ to be 0.75, both s7and s11are analogous streaks of s1.
7http://www.chinanews.com/sh/2015/05-21/7291066.shtml.
In this article, we use the same similarity threshold σ for both length and value. An alternate definition is to impose different similarity thresholds on length and value, which could be easily supported by straightforward adaption of our techniques.
Definition 3.4 (Historic Moments (H M)). Let AS(z) be the set of analogous streaks of a situ- ational streak z. Assume that each streak s inAS(z) is represented by a 3D point (|s|,s.j,s.v).
The historic moment of z, denoted byH M(z), is the 3D skyline of AS(z), and we write that as H M(z) = skyline(AS(z)).
Problem Definition: Given a data sequence Dn with n numeric values, a similarity threshold σ , and a positive integer k, compute the historic moments for each of the top-k situational streaks.
In Figure2, for the top one situational streak s1, if σ = 0.5, we have H M(s1)= {s7, s8, s11}. Streak s9 is not a historic moment of s1 because s8dominates8s9, denoted as s8 s9, despiteAS(s1) = {s7, s8, s9, s11}. Note that s7is the historic moment of s1with the largest j (i.e., most recent), s8is the historic moment of s1with the highest value, and s11is the historic moment of s1with the longest interval. In the following, we use computational journalism as an example and present some real news stories that justify the use of the skyline of analogous streaks to formulate historic moments.
Story 1 (Most recent historic moment). The Beijing Blue Sky Days news in the introduction is a real example that illustrates that the most recent historic moment is newsworthy. In that story, the “36 days of Blue Sky” is a situational streak z with length 36 and value 100. The “30 days of Blue Sky that happened in 2008” is then the historic moment of z that occurs most recently.
Story 2 (Highest-value historic moment). In April 2014, the United Kingdom had the following news about smog:
“air pollution levels with more than eight lasts two days,”
which is essentially a situational streak z of length 2 with value 8. When reporting the news, the journalist quoted the “Great Smog” incident, which happened in 1952:
“ . . . (The 1952 Smog) led to the creation of the Clean Air Act 1956, which introduced a number of measures to reduce air pollution. . . . Unfortunately in the modern day, despite the visibility and intensity of smog being much reduced, up to 29,000 peo- ple in the UK still die per year because of air pollution, according to the European Commission.”9
The “Great Smog” incident is in fact the historic moment of z with the highest value.
Story 3 (Longest-interval historic moment). In January 2014, the United States had the follow- ing news about drought in California:
“Meanwhile, today’s scant rainfall was enough for Sacramento to finally end the longest running rainless rainy season streak in its recorded history. The city now has a new all-time record of 52 consecutive days without measurable rainy season rainfall,”
which is essentially a situational streak of length 52 with value close to 0. When reporting the news, the journalist continued with comparisons with a historic moment that had the longest interval:
8Following the literature, an object x is said to be dominated by another object y, denoted as y x, with respect to a setA of concerned attributes, if y.Ai ≥ x .Ai for all Ai ∈ A, and there exists at least one attribute Aj ∈ A such that y .Aj> x .Aj. Here we discuss streak s, and the concerned attributes setA refers to {|s |, s .j, s .v } of s.
9“UK Smog: You Thought This Was Bad? Take a Look at the Great Smog of 1952”:http://www.independent.co.uk/news/
uk/home-news/uk-smog-you-thought-this-was-bad-take-a-look-at-the-great-smog-of-1952-9238550.html.
Table 1. Major Notations in This Article
Notation Meaning
Dn Data sequence with n values
σ Similarity threshold
k Top-k parameter
s (i, j, v ) Streak s with interval [i, j] and value v LPSn Local prominent streaks of Dn
SSn Situational streaks of Dn
AS(z) Analogous streaks of a situational streak z H M(z) Historic moments of a situational streak z
Pn Perplexing streaks of Dn
Nn Nonperplexing streaks of Dn
Un Minimal subset ofLPSnobtained by SOIA
“ . . . dating back to Dec. 7, 2013. The previous longest streak was Nov. 1 to Dec. 16, 1884.”10
4 FINDING HISTORIC MOMENTS FROM A DATA SEQUENCE
In this section, we present algorithms to obtain historic moments from a data sequence. Specifi- cally, Section4.1first presents a baseline algorithm that computes historic moments from a data sequence Dnoffline, given a similarity threshold σ and a parameter k. In practice, there could be a data update or a user may want to look for historic moments with different σ and k values when operating under an interactive (online) mode [7]. In these cases, re-executing the baseline algo- rithm for any update of σ , k, or data would be inefficient. Therefore, in Section4.2, we first present how to refactor the baseline algorithm to be incremental. The baseline incremental algorithm is still inefficient because it needs to keep and maintain a lot of intermediate results. Therefore, in Section4.3, we present SOIA, an efficient incremental algorithm that returns historic moments by maintaining and accessing minimal information. AppendixAextends the problem and the al- gorithms for finding historic moments from a single data sequence to multiple data sequences.
Table1lists the major notations used in this article.
4.1 Baseline Algorithm (BA)
Given a data sequence Dn, a similarity threshold σ , and a parameter k, the problem of finding historic moments for each of the top k situational streaks can be solved naively as follows:
• Step 1. Use the first phase of LLPS in [8] to compute the set of all local prominent streaks LPSnfrom Dn, and then select the top k situational streaks from there. This step, according to [8], takes O (n) time and theLPSntakes O (n) space.
• Step 2. For each top k situational streak z, scan through LPSn to identify its analogous streaksAS(z).
• Step 3. Finally, for each z, compute the skyline from AS(z) as the resulting H M(z).
The naive method above is inefficient in terms of fully scanning the large LPSn k times in Step 2. Therefore, one simple improvement is to build an index forLPSnafter Step 1. Specifically,
10“California’s Devastating Drought Isn’t Going to Get Better Any Time Soon”:http://www.slate.com/blogs/future_tense/
2014/01/30/california_s_exceptional_drought_won_t_get_better_any_time_soon.html.
one can regard each streak s inLPSn as a 3D point (|s|,s.j,s.v) and insert them into an R-tree.
Then, the analogous streaks of a situational streak z can be regarded as a 3D-range query Q:
[|z| · σ, +∞) × [0, z.i) × [z.v · σ, +∞).
By building an R-tree, the above query can locate the analogous streaks of a situational streak z without scanning all the streaks inLPSn. Furthermore, Steps 2 and 3 can be combined as a constrained skyline query on the R-tree that returns the skyline within Q. This combined step can be implemented using the BBS skyline algorithm in [13]. Algorithm1summarizes the above index-based baseline algorithm (BA).
ALGORITHM 1: Baseline Algorithm (BA)
1: procedure BA(Dn, σ, k)
2: Use the first phase of LLPS in [8] to computeLPSn; 3: Rtree = Build-R-Tree(LPSn);
4: SSn= Get-SS(LPSn);
5: Z = Get-Top-SS(SSn, k);
6: for each streak z inZ do
7: Q= [|z| · σ, +∞) × [0, z.i) × [z.v · σ, +∞); // define the analogous region 8: H M(z) = BBS(Rtree, Q); // compute constrained skyline
4.2 Baseline Incremental Algorithm (BIA)
The BA is not incremental. Therefore, we refactor it to become incremental so that it can return results efficiently even when the data sequence Dn is appended with new values resulting in Dm, where m > n, or when the parameters are updated as σor k. The incremental method is composed of two phases where there are (1) a maintenance procedure to computeLPSmonline when data is appended and (2) a lookup procedure to answer the historic moment queries. Similar to BA, we use an R-tree to store the local prominent streaks.
4.2.1 BIA Maintenance. When the data sequence Dn is appended with new values
vn+1,vn+2, . . . ,vm resulting in Dm, where m > n, this procedure aims to obtain (1) the updated set of situational streaksSSmand (2) possibly some new local prominent streaks. The maintenance procedure is similar to the one in [8]. Specifically, for each value vn+k ∈ {vn+1,vn+2, . . . ,vm}, where 1≤ k ≤ m − n, it may:
(1) make some streaks inSSn+k−1to stop being situational streaks and turn to being local prominent streaks that end at n+ k − 1—those streaks would then get inserted into the R-tree;
(2) extend some streaks inSSn+k−1to become longer streaks inSSn+k that end at n+ k; or (3) form a new local prominent streak whose value is vn+k and ends at n+ k; this happens when none of the streak inSSn+k−1has value vn+k. Note that this is a situational streak for Dn+k, so it is inSSn+k.
For maintenance reasons that become clear momentarily, we separateSSn from the rest of streaks inLPSn. Specifically, we insert streaks fromLPSn− SSn as 3D points into an R-tree.
Streaks fromSSn ⊆ LPSn are stored separately. Algorithm2summarizes the above discussion, and we have the following:
ALGORITHM 2: BIA Maintenance Procedure
1: procedure Maintenance(vn+1,vn+2. . . vm) 2: for k= 1 to m − n do
3: if n== 0 and k == 1 then 4: SS1= {(1, 1,v1)};
5: continue;
6: SSn+k = ∅;
7: for each streak (i, n+ k − 1,v) in SSn+k−1do 8: if vn+k ≥ v then
9: Insert (i, n+ k,v) to SSn+k; //case (2)
10: else
11: Insert (i, n+ k − 1,v) to Buffer B; // case (1)
12: if no streak inSSn+k−1has value vn+kthen // case (3)
13: if all streaks inSSn+k−1have value < vn+kthen 14: Insert (n+ k,n + k,vn+k) toSSn+k;
15: else
16: Select the streak (i, n+ k − 1,v) in SSnwhose value v > vn+k and is the smallest;
17: Extend it to be (i, n+ k,vn+k) 18: Insert it intoSSn+k; 19: Insert the streaks in B into R-tree;
• The Maintenance procedure takes SSn and the appending valuesvn+1,vn+2, . . . ,vm as the input. The outputs of it areSSm and the updated R-tree by inserting new local prominent streaks.
• For each value vn+k ∈ {vn+1,vn+2, . . . ,vm}, Lines 2 to 18 compute its SSn+k and the new local prominent streaks based onSSn+k−1iteratively. Lines 3 to 5 are to initializeSS1.
• For each value vn+k ∈ {vn+1,vn+2, . . . ,vm}, we collect the new local prominent streaks in Buffer B (Line 11) and then insert them into the R-tree in a batch (Line 19) to reduce the I/O overhead by avoiding frequent access to the R-tree.
4.2.2 BIA Lookup. Given the R-tree of historic moment candidates created by the mainte- nance step, the historic moments for each of the top k situational streaks, under any similarity parameter σand kvalues, can be obtained by calling the Lookup procedure as presented in Algorithm3.
ALGORITHM 3: BIA Lookup Procedure
1: procedure Lookup(σ, k) 2: Z = Get-Top-SS(SSn, k);
3: for each streak z inZ do
4: Q= [|z| · σ, +∞] × [0, z.i) × [z.v · σ, +∞];
5: H M(z) = BBS(Rtree, Q); // compute constrained skyline
4.2.3 Space Requirement of BIA. Why is it necessary for BIA to keep all streaks in LPSn? Specifically, from what we have defined, the historic moments are the skyline of analogous streaks, which are in turn a subset ofLPSn. Therefore, one might question whether BIA can keep only the skyline ofLPSn, instead of the fullLPSn. Unfortunately, optimizing BIA like that is incorrect.
That is because a streak currently not in the skyline (LPSn) could become a historic moment after a new value vn+1is appended, or when facing different values of σ and k. The following are two examples.
Fig. 3. After v20is appended.
Example 4.1. Consider again the data sequence in Figure2. We have:
• n = 19;
• LPS19= {s1, s2, s3, s4, s5, s6, s7, s8, s9, s10, s11}; and
• SS19 ⊂ LPS19= {s1, s2, s3, s4}.
Consider the skyline ofLPS19, which is:
• skyline(LPS19) = {s1, s2, s3, s4, s5, s6, s11}.
Now consider the lookup of historic moment of the top 1 situational streak in that dataset D19with σ= 0.75. The top 1 situational streak in that dataset D19 is s1, whereas the analogous streaks of s1 areAS(s1)= {s7, s11}. The historic moments of s1areH M(s1) = {s7, s11}, since s7
and s11cannot dominate each other. At this point, we see that s7is a historic moment of s1but s7 skyline(LPS19).
Example 4.2. Still consider the data sequence in Figure2, but with a new value v20= 5 appended, resulting in D20of Figure3. We see that, with v20appended, it has:
(1) made streaks s1 and s2inSS19to stop being situational streaks and turn to being local prominent streaks that end at n= 19;
(2) extended streaks s3and s4inSS19to become longer streaks s3and s4inSS20that ends at n= 20; and
(3) formed a new local prominent streak, snew, whose value is 5 and ends at n= 20.
Now consider the lookup of historic moment of the top 1 situational streak in that updated dataset D20 with σ= 0.5. The top 1 situational streak in that dataset D20 is snew, whereas the analogous streaks of snewareAS(snew)= {s7, s11}. The historic moments ofsnewareH M(snew) = {s7, s11}. Once again, we see that s7is a historic moment of snew but s7 skyline(LPS19).
4.3 Space-Optimal Incremental Algorithm (SOIA)
BIA needs to keep and maintainLPSn, which is space inefficient, especially when multiple se- quences are of interest (e.g., multiple stocks, multiple seismic monitoring sensors) or when the se- quences are very long (e.g., high-frequency trading with stock tick every millisecond). As shown in the previous section, straightforward optimizations like keeping skyline (LPSn), instead of LPSn, is unfortunately incorrect. When not space efficient, BIA would not be time efficient ei- ther because the redundancies in the index would jeopardize both the lookup and maintenance
time. In this section, we present a space-optimal incremental algorithm (SOIA) that keeps only a minimal subsetUnofLPSnthat is sufficient to return historic moments online. Formally,
Definition 4.3. Given a data sequence Dn, the minimal subsetUnofLPSn refers to a subset ofLPSn that guarantees
(1) for any streak s in {LPSn− Un}, it must not be a historic moment of any situational streak y in Dm, where m≥ n, and
(2) there does NOT exist a proper subset X ofUnthat satisfies condition (1).
SOIA strikes to maintainUnunder continuous data updates. Space optimality of SOIA signif- icantly reduces the size of the index, thereby reducing the I/O per operation or making the in- dex memory resident. That property is crucial for online analysis and real-time monitoring, espe- cially when there are possibly many data sequences of interest, which demands one index per data sequence.
So the grand challenge of SOIA is how to confineUn, where trivially confiningUn to be the skyline ofLPSn is definitely insufficient, whereas confiningUn to be LPSn is definitely not space optimal. Now we first discuss how to obtainUn properly given the data sequence Dn, and Section4.3.1studies how to maintain its minimality when the data sequence is appended.
In SOIA, we treat σ as a parameter that controls the tradeoff between space and time, in addition to its original role as being the similarity threshold. More specifically, when deciding which streaks inLPSnshall be kept (i.e., belong toUn) and which shall be discarded, a small σ value makes local prominent streaks easier to be an analogous streak of a situational streak z. So it is natural that the size ofUn increases when σ is getting smaller. In contrast, a large σ value makes local prominent streaks harder to be an analogous streak of a situational streak z. So it is intuitive that the size of Undecreases when σ is getting larger. With σ as a parameter between space and time, the goal of SOIA is to maintainUn as long as a user queries for historic moments with any similarity value σlarger than or equal to σ .
We now confine whatUnshould be, given σ . We start by showing a simple lemma (Lemma4.4).
Then, we look at the easiest case, with σ= 1. We can interpret this case as either having the most stringent space requirement or as the users being uninterested in analogous streaks that are shorter, or smaller in value, than the situational streak of interest. Finally, we work on the general case.
Lemma 4.4. The intervals of two local prominent streaks are either disjoint or one containing the other.
Proof. Assume to the contrary that there exist two local prominent streaks (i, j, v ) and (i, j,v) whose intervals are not disjoint nor one containing the other. Without loss of gener- ality, assume that i < i. Then, we have i < i≤ j < j. That is, i− 1 is within the range of [i, j]
and j+ 1 is within the range of [i, j]. Now, there are two cases:
(1) If v ≥ v, then (i, j,v) is not a local prominent streak because vi−1≥ v ≥ v.
(2) Otherwise, we have v> v. Then (i, j,v) is not a local prominent streak because vj+1≥ v> v.
So both cases lead to contradiction. The lemma follows.
The Specific Case: What Streaks inLPSnShould Be inUnWhen σ = 1?
Let H M(SSn) denote the set that contains the historic moments of all streaks inSSn, and when σ = 1, we have the following proposition:
Proposition 4.5. When σ = 1, skyline(LPSn)∪ H M(SSn) is the minimal subset Un of LPSn that contains historic moments of a situational streak for dataset Dn or future dataset Dm, where m > n.
Proof. We prove by showing that a streak s is in skyline (LPSn) orH M(SSn) if and only if s can serve as a historic moment of some streaks inSSnorSSm. This is done by proving Lemma4.6 (the if case) and Lemma4.7(the only-if case) below.
Lemma 4.6. If a streak s inLPSncan serve as a historic moment of a streak in SSnorSSm, then s is in skyline (LPSn)∪ H M(SSn).
Proof. It is equivalent to showing that for any streak s, if s skyline(LPSn) and s H M(SSn), then s will not be a historic moment of any streak inSSnor SSm.
Since s is not in the skyline ofLPSn, there must exist some streak y∈ LPSnin the skyline with y s. Now, for any streak z∗= (i∗, j∗,v∗) inSSn orSSm:
(1) If z∗ ∈ SSn, s cannot be a historic moment of z∗since s H M(SSn).
(2) Else, if y does not overlap with z∗, y will be an analogous streak of z∗ whenever s is, so that s cannot be in the skyline ofAS(z∗), and thus not a historic moment of z∗.
(3) Else, y overlaps with z∗ and z∗ SSn. Then, we must have some z= (i, j, v ) inSSnwith the same starting position as z∗, i.e., i= i∗, and also y must overlap with z. Also,|z| < |z∗| since z∗ SSnwhile both streaks z and z∗have the same starting position. Now, recall that from Lemma4.4, the intervals of two local prominent streaks are either disjoint or one con- taining the other. Thus, we further have|s| ≤ |y| (since y s) and |y| ≤ |z|
(since y overlaps with z and z∈ SSn). The above inequalities imply that
|s| < |z∗| so that s is not an analogous streak, and thus not a historic mo- ment of z∗when σ = 1.
So in all cases, s is not a historic moment of z∗. This completes the proof of the
lemma.
Lemma 4.7. If a streak s is in skyline (LPSn)∪ H M(SSn), then s can serve as a historic moment ofSSn orSSm.
Proof. Since any streak inH M(SSn) is already a historic moment of a current situational streak, it remains to show that any streak s= (i, j,v) in skyline(LPSn) can be a historic moment of some streak inSSnorSSm. To see this, imagine that the data sequence Dn is appended with the following values for its next |s| + 2 days: ϵ, v, v, . . . , v, where ϵ is a positive value less than v. That essentially will create a situational streak z∗= (i∗, j∗,v∗), with i∗= n + 2, j∗= n + |s| + 2, length
|z∗| = j∗− i∗= |s|, and value v∗= v.
Note that s is an analogous streak of z∗. Furthermore, s is in the skyline of AS(z∗) because (1) every streak in AS(z∗) must end before n+ 1, but (2) ev- ery streak that ends exactly at n+ 1 will have minimum value ϵ, which cannot be an analogous streak of z∗. This implies thatAS(z∗) ⊆ LPSn. So no streak in AS(z∗) dominates s as s ∈ skyline(LPSn). Thus, s is in the skyline ofAS(z∗), and is therefore a historic moment of z∗. This completes the proof of the lemma.
By combining Lemmas4.6and4.7, Proposition4.5follows.
For Figure2, as in Example4.1, we have:
• SS19= {s1, s2, s3, s4}and
• skyline(LPS19)= {s1, s2, s3, s4, s5, s6, s11}.
When we set σ= 1 and k = 4, the historic moment of SS19, denoted as H M(SS19), is {s11}, since s11 is a historic moment (and the only one) of s1, while there are no historic moments for s2, s3, or s4. So Proposition4.5states that we only need to keep{s1, s2, s3, s4, s5, s6, s11}. Although currently s5and s6are not historic moments of any situational streak (since they overlap with all streaks inSS19), each of them may be a historic moment of some situational streaks in the future.
For example, consider two new values that v20 = v21= 9 are added to the example data sequence in Figure2. Then, (20, 21, 9) becomes a situational streak of the data sequence D21, with both s5
and s6being its historic moments.
Example 4.8. Consider v2 and v6equal to 7 instead of 8 in Figure2. In this case, streak s11= (2, 6, 7) is no longer in skyline (LPS19), since it is dominated by s1. Yet, s11is a historic moment of s1, and s11∈ H M(SS19).
The example above illustrates why Proposition4.5has to keepH M(SSn) as well. In that ex- ample, s11 ∈ H M(SS19). This also justified our aforementioned argument of why modifying BIA to keep only the skyline ofLPSnis insufficient.
The General Case: What Streaks inLPSnShould Be inUnWhen σ > 0?
Unfortunately, Proposition4.5is still insufficient, specifically when there are queries with σ<
1. As we discussed in Example4.1, in Figure2, when querying with σ= 0.75, s7is the historic moment of s1, but s7 skyline(LPS19)∪ H M(SS19) according to Proposition4.5.
So to support a general σ , in addition to skyline (LPSn)∪ H M(SSn), what else shall we in- clude while maintaining minimality?
We answer the question by continuing Example4.1. Specifically, when σ = 0.75, s7shall be in- cluded, but unfortunately it is pruned by s5 because s5 s7. However, while s7is pruned by s5, the latter is not serving as an analogous streak of s1 instead. Why can’t s5 serve as s1’s analo- gous streak? That is because s5overlaps with s1, which violates the definition of analogous streak (Definition3.3, condition 1; i.e., s5has to end before s1starts). In other words, when s5cannot serve as an analogous streak of s1, it shall not be used to prune any analogous streak for s1.
So we classify the streaks inLPSninto two subsets: (1) Perplexing StreaksPn, and (2) Nonper- plexing StreaksNn:
Definition 4.9 (Perplexing Streaks Pn). A streak p ∈ LPSn is a perplexing streak when there exists a situational streak z∈ SSn such that:
(1) p overlaps with z, (2) |p| ≥ |z| · σ, and (3) p.v ≥ z.v · σ.
Note that p can overlap with multiple situational streaks. We usePnto denote the set of all per- plexing streaks in data sequence Dn.
By definition,SSn ⊆ Pn.
Definition 4.10 (Nonperplexing Streaks Nn). Streaks inLPSn that are not inPn are nonper- plexing streaks, denoted asNn. That is,Nn = LPSn− Pn. So a streak s ∈ LPSn is a nonper- plexing streak if, for every situational streak z∈ SSn that s overlaps with, either|s| < |z| · σ or s.v < z.v· σ.
Example 4.11. In Figure2, when σ = 0.75, streaks in LPS19= {s1, s2, s3, s4, s5, s6, s7, s8, s9, s10, s11} are classified into
• P19= {s1, s2, s3, s4, s5} or
• N19= {s6, s7, s8, s9, s10, s11}.
s5is inP19because (1) s5overlaps with s1, (2)|s5| ≥ |s1| · 0.75, and (3) s5.v ≥ s1.v · 0.75. In contrast, s6 P19because|s6| |si| · 0.75 for any si ∈ SS19.
In the following, we state some important lemmas.
Lemma 4.12. A streak that is dominated by skyline (Nn) would not be a historic moment of any streak in eitherSSn for dataset Dn orSSmfor future dataset Dm, where m > n.
Proof. Suppose that streak s is dominated by a streak y ∈ skyline(Nn). Then, for any streak z∗ inSSnorSSm, two cases can happen: (1) y does not overlap with z∗or (2) y overlaps with z∗.
In Case (1), if s is an analogous streak of z∗, so is y, as y s, so that s is not in the skyline of AS(z∗), and thus not a historic moment of z∗.
In Case (2), let z∗= (i∗, j∗,v∗). Since y overlaps with z∗, we must have some z= (i, j,v) in SSn
with the same starting position as z∗, i.e., i= i∗, and also y overlaps with z. (Note that z= z∗ if z∗∈ SSn.) As y∈ N , |y| < |z| σ ≤ |z∗| σ holds. As y s, this further implies |s| ≤ |y| < |z∗| σ.
Thus, s is not an analogous streak of z∗. Consequently, z cannot be a historic moment of z∗. This
completes the proof.
Lemma4.12directly leads to the following corollary.
Corollary 4.13. Given a streak s ∈ Nn, if s∈ H M(z), where z ∈ SSn∪ SSm, then s ∈ skyline (Nn).
Definition 4.14 (Smallest-Value Extension). Let z= (i, j,v) be a situational streak. Suppose that the sequence is appended with a new value v< v. If the streak z= (i, j + 1,v) remains as a streak inLPSn+1(i.e., vi−1< v), then zis called the vextension of z. If vsmall is the smallest value vsuch that the vextension of z exists, then the streak z+= (i, j + 1,vsmall) is called the smallest-value extension of z.
So for Figure2, the corresponding smallest-value extensions of the situational streaks are s1+= (15, 20, 6 + ϵ) s+2 = (14, 20, 4 + ϵ)
s3+= (8, 20, 1 + ϵ) s+4 = (1, 20, 0 + ϵ),
where ϵ is an arbitrarily small positive value. Note that streak (15, 20, 7) is not s1+because 7 is not the smallest possible value to extend s1= (15, 19, 7). Appending the data sequence with vsmall= 6 will result in a situational streak (14, 20, 6). But that does not qualify as s1+because the starting position of s+1 should be the same as the starting position of s1.
Definition 4.15 (Universal Domination). LetSSn(p ) be the set of situational streaks that overlap with a perplexing streak p. A streak s is universally dominated by a perplexing streak p when
(1) p s, and
(2) s is not an analogous streak of all z∈ SSn(p ), and (3) s is not an analogous streak of z+, for all z∈ SSn(p ).
Conceptually, universal domination is harder to achieve than the ordinary domination, since p universally dominating s implies that p dominates s.
Lemma 4.16. If a streak s ∈ skyline(Nn)∪ Pn that is either (1) universally dominated by some streak inPn or (2) its length is less than σ , then s would not be a historic moment of any streak in SSnfor dataset DnorSSmfor future dataset Dm, where m > n.
Proof.
(1) Suppose that a streak s is universally dominated by a perplexing streak p = (i, j,v) ∈ Pn. Then, there are the following cases when considering a current/future situational streak z∗= (i∗, j∗,v∗):
(a) z∗does not overlap with p. Then, if s is an analogous streak of z, so is p. In this case, s is not in the skyline ofAS(z∗), and thus not a historic moment of z∗.
(b) z∗overlaps with p.
(i) z∗∈ SSn: then z∗∈ SSn(p ), so that by definition s cannot be an analogous streak of z∗.
(ii) z∗ SSn (it happens when z∗∈ SSm and z∗ SSn): then there exists some current situational streak z= (i, j,v) with the same starting position as z∗, and z∈ SSn(p ). Furthermore,|z∗| ≥ |z+| and z∗.v ≥ z+.v hold, where z+is the smallest- value extension of z. As s is not an analogous streak of z+, s cannot be an analogous streak of z∗. In other words, s cannot be a historic moment of z∗.
(2) Any streak in SSn orSSm has length at least 1. So streaks in skyline (Nn)∪ Pn with length less than σ cannot be the historic moment, as their length is not qualified.
This completes the proof.
Lemma4.16leads to the following corollary.
Corollary 4.17. Given a streak s ∈ skyline(Nn)∪ Pn, if s∈ H M(z), where z ∈ SSn∪ SSm, then∀p ∈ Pnsuch that s is not universally dominated by p, and s has length no less than σ .
Example 4.18. Consider Figure 2again; recall that when σ = 0.75, P19= {s1, s2, s3, s4, s5}, and N19= {s6, s7, s8, s9, s10, s11}. Also, skyline(N19)= {s6, s7, s8, s11}. Then, s5is a perplexing streak. Note that s8∈ skyline(N19) and s8is universally dominated by s5since:
(1) s5 s8,
(2) s8is not an analogous streak of any streak inSS(s195) = {s1, s2, s3, s4}, and (3) s8is not an analogous streak of s1+, s+2, s3+, s+4.
Therefore, by Lemma4.16, s8would not be a historic moment in any case and it is not necessary to keep it. In contrast, s7∈ skyline(N19) and s5 s7as well. However, since s7is an analogous streak of s1, s7is not universally dominated by s5and it cannot be pruned using Lemma4.16.
Theorem 4.19. The minimal subsetUn ofLPSn is the set of streaks in skyline (Nn)∪ Pn that (1) are not universally dominated by any streak inPn and (2) have length at least σ .
Proof. We prove the theorem by showing that a streak s is inUnif and only if s can serve as a historic moment of some streak inSSnorSSm, where m > n. This is done by proving Lemma4.20 (the if case) and Lemma4.21(the only-if case) below.
Lemma 4.20. If a streak s can serve as a historic moment of some streak inSSnor SSm, then s is inUn.
Proof. First,LPSn = Nn∪ Pn contains all local prominent streaks, and thus all historic moments. Then, by Corollary4.13, we see that skyline (Nn)∪ Pncon- tains all historic moments. Consequently, by Corollary4.17, we see thatUncon-
tains all historic moments. The lemma follows.
Lemma 4.21. If a streak s is inUn, then s can serve as a historic moment of some streak inSSnorSSm.
Proof. A streak s in Un is either (1) not dominated by any other streak in LPSn or (2) dominated only by some streak p∈ Pn but not universally domi- nated by p.
For Case (1), if we append the data sequence Dn first with an arbitrarily small positive value ϵ, followed by|s|/σ + 1 values of (s.v)/σ, then, after |s|/σ + 2 new data values arrived, there will be a situational streak z∗ with length |z∗| =
|s|/σ and value z.v = (s.v)/σ. That streak z∗ will regard s as its historic mo- ment.11
For Case (2), let z be the longest situational streak inSSn such that s is an analogous streak of z∗, where z∗= z or z+. Note that such a z must exist since s is not universally dominated by some p∈ Pn(recall Definition4.15for the property of universal domination). Also, z∗∈ SSn∪ SSm.
Then, for any streak p∈ Pnwith p s, p must overlap with z∗.12So all streaks in Pnthat dominate s cannot be an analogous streak of z∗. Moreover, no streak inNn
dominates s, as s is dominated only by streaks inPn. The above statements imply that s is not dominated by any analogous streak of z∗, so that it is in the skyline of AS(z∗), and thus a historic moment of z∗. In summary, each streak inUnis a his- toric moment of some streak inSSnorSSm, and therefore needs to be kept. Combining the above lemmas, Theorem4.19follows.
Example 4.22. Consider Figure2when σ = 0.75; as in Example4.18, we have:
• P19= {s1, s2, s3, s4, s5};
• skyline(N19)= {s6, s7, s8, s11};
• in skyline(N19)∪ P19, s8is universally dominated by s5; and
• each streak in skyline(N19)∪ P19has length at least σ .
According to Theorem4.19, minimal subsetU19ofLPS19is the set of streaks in skyline (N19)∪ P19that (1) are not universally dominated by any streak inP19, which gives s8 U19, and (2) have length at least σ . So we have:
• U19= {s1, s2, s3, s4, s5, s6, s7, s11} and
• SS19= {s1, s2, s3, s4}
4.3.1 SOIA Maintenance. Theorem4.19gives how to obtainUn of data sequence Dn. Based on that, now we discuss how to obtainUmaccordingly when new valuesvn+1,vn+2, . . . ,vm are appended to the data sequence Dn. Without loss of generality, we first focus on when a single data value vn+1is appended, and then generalize the procedure when valuesvn+1,vn+2, . . . ,vm are appended in a batch.
When appending value vn+1to Dn, it works like BIA (Section4.2.1) to getSSn+1and local promi- nent streaks ending at n based on three different cases. Specifically, our maintenance algorithm aims to do the following:
11The reason is as follows: clearly, s is an analogous streak of z∗. Next, if to the contrary s is not in the skyline ofAS(z∗), then there is some streak y∈ AS(z∗) that dominates s. Such a y must not overlap with z∗(since it is an analogous streak of z∗) and must not include the value ϵ that is newly appended to D (since the value of y is at least s .v for y s). In other words, y must be a streak inL PSn. A contradiction occurs, for no streak inL PSnshould dominate s.
12Else,SSn(p )does not contain any situational streak of which s is an analogous streak, so that s is universally dominated by p, and therefore s Un. A contradiction occurs.