In this section, we evaluate the performance of SOIA using five real sequence datasets and compare with BIA. The experimental platform has a 2.8GHz Intel i5 CPU, 8GB RAM, and 256GB hard disk.
Table2lists the information about the datasets, which are ordered based on their data sizes. Dataset D5 was too large that BIA would require index space larger than our disk; therefore, we only used a fraction of D5 in order to make BIA runnable in this section.
6.1 Overall Comparison
We first look at the overall performance of BIA and SOIA when the full dataset is available. We evaluate the performance of BIA and SOIA in terms of their (1) index building time (maintenance time), (2) query time (lookup time), and (3) index space.
We use σ = 0.5 in building the index structure (Maintenance procedure). We create five user lookup workloads: W1, W2, W3, W4, W5, where a workload Wimimics a user exploring the dataset by trying out 10· i different k and σhistoric moment lookups. Therefore, W1consists of 10 lookups (e.g., a casual user) and W5consists of 50 lookups (e.g., a serious journalist). When reporting the lookup time, we report the average running time of W1to W5. That represents the total wait time for a user in one interactive session. In the experiment, each lookup query randomly chose a value between 1 and 10 for k and randomly chose a value between σ and 1 as σ.
Figure4shows the results. Since SOIA has to invest some time to identify and prune redundant streaks in order to maintain a minimal space index for the latter use, its one-off maintenance time is higher than BIA (Figure4(a)). Nevertheless, we see that such investment is worthwhile because the lookup time of SOIA is 9.58× (D1) to 184.14× (D4) better than BIA (Figure4(b)), which gives users a much shorter waiting time during the exploration. Figure4(c) shows that the space requirement of SOIA is much smaller than that of BIA. On D5 (a largely trimmed version of Taiwan’s ground
Table 2. Summary of Datasets
Dataset Size (MB) Length Description
D1 14.1 1,074,637 household global minute-averaged
current intensity from December 2006 to December 200819 D2 21.3 1,600,237 household global minute-averaged active
power from December 2006 to December 2009
D3 27.5 1,802,000 EEG time-series datasets20
D4 32.5 2,764,800 number of requests to World Cup 98 website per second from April to May 199821
D5 52.4 3,456,000 Taiwan KMNB station ground motion
from Jan. 1 2017, to Jan. 2, 201722
Fig. 4. BIA versus SOIA.
motion data), monitoring historic moments for just one station already requires BIA to build an index bigger than 200MB.
The minimal index size of SOIA is also the key factor that leads to its excellent historic moment lookup performance (other factors include the size and the value of the streak; see Algorithm6).
Compared with SOIA, BIA takes 656× (D1) to 2, 898× (D3) more index space. Table3lists the index space as well as the number of streaks stored using BIA and SOIA, respectively. For SOIA, the number of streaks in Skyline (Nn) andPnis also reported. In fact, the sizes of the index structures are proportional to the numbers of streaks inLPSn. In the following sections, we only include the index size in our discussions.
6.2 Historic Moment Exploration with Data Update
Next, we look at the performance of SOIA and BIA for maintaining the index structure online. That is, whenever a value arrives, BIA inserts new local prominent streaks into the R-tree immediately, and SOIA maintains its space-optimal property by inserting and pruning the R-tree immediately.
In this experiment, we regard the first 98% of a dataset as the initial dataset and its index structure has been built by the Maintenance procedure already. Then, we examine the performance of SOIA and BIA regarding a data append of the last x% of the dataset, where x = 0.1, 0.5, 1, and 2.
Figure5shows the experiment results. In the figures, we report:
(a) the time of the Maintenance procedure in order to handle data appending,
19https://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption.
20http://alumni.cs.ucr.edu/∼mueen/OnlineMotif/index.html. 21http://ita.ee.lbl.gov/html/contrib/WorldCup.html.
22http://ds.iris.edu/ds/nodes/dmc/forms/breqfast-request/.
Table 3. Number of Streaks Maintained by BIA and SOIA
(d) the size of the index after a data update.
The maintenance times of SOIA and BIA are similar (see subfigure (a)) because the advantage of having a smaller index in SOIA is offset by the extra effort spent on maintaining minimality.
Nevertheless, SOIA is superior in terms of lookup performance (see subfigure (b)). So if one con-siders the time from getting the new data to the time that a user finishes a particular session of historic moment exploratory (see subfigure (c)), SOIA is much better than BIA. With no surprise, SOIA maintains a smaller index size all the way (see subfigure (d)).
6.3 Sensitivity Study
Here, we look at the impact of parameters σ and k on the performance of SOIA.
We first look at the influence of σ . In this experiment, given that the full dataset is available, we try different values for σ : 0.1, 0.3. 0.5. 0.7, 0.9. Figure6(a) shows that different σ values do not influ-ence the maintenance time of SOIA much. That is because the maintenance of SOIA is dominated by the time of scanning allLPSn to getPn, which is independent of σ there. Figure6(b) shows that a higher σ value during maintenance would make historic moment lookup more efficient because that would result in a smaller index as shown in Figure6(c).
Figure7shows the time for maintaining space optimal online when data is being appended.
In that experiment, we report the results of the maintenance time when the remaining 0.1%, 1%, and 2% of data are inserted, in case the index has been built for 98% of the original data already.
When σ increases, the maintenance procedure takes less time. That is because a higher σ value would reduce the number of perplexing streaks, and for each perplexing streak p, the mainte-nance algorithm is required to remove streaks that are universally dominated by p from the R-tree (Algorithm4, Lines 23 and 24). As a result, the maintenance time decreases when σ increases.
Lastly, we look at the impact of k on the performance of SOIA. Note that k has no impact on the maintenance phase of SOIA, so we only report the average execution time of the lookup phase based on the five workloads, with σ = 0.5.
Figure8shows that the lookup time generally increases with the value k. That is because the lookup procedure looks for the historic moment for each top k situational streak. Increasing k would then increase the number of skyline computational calls to the index.
7 CONCLUSION
In this article, we introduce the notion of the historic moment, which complements existing work [8,19] and provides holistic insights from sequence data. The computational issue of the
Fig. 5. BIA versus SOIA under data update.
Fig. 6. Varying σ .
Fig. 7. Varying σ when updating.
Fig. 8. Varying k.
historic moment focuses on the incremental and interactive aspects, in which new data are ex-pected to arrive regularly and users are supposed to discover new historic moments right away, by feeding in different input parameters. To this end, we present SOIA, a highly efficient incremen-tal algorithm using minimal space. Space optimality is important for online analysis and real-time monitoring systems because it significantly reduces the index size, thereby reducing the I/O per operation or making the index memory resident even when there are many data sequences. Case studies show that historic moments are helpful in computational journalism as well as in seismol-ogy. Experimental studies show that SOIA is both space and time efficient and outperforms the baseline on real data.
APPENDIX