Experiments of EPLR Similarity Retrieval - Proposed Subsequence Similarity Matching Method

Chapter 3 Proposed Subsequence Similarity Matching Method

4.2 Experiments of EPLR Similarity Retrieval

Figure 4-7 Comparison of error rate between EPLR with 1/2 overlapping and PAA

Our experiments demonstrate that data overlapping is necessary. Besides, overlapping 1/2 data of previous segment is good enough to beat Euclidean distance.

We do not need more data overlap because it can cause more overhead. The comparisons between EPLR and PAA and PLR are also shown that EPLR is superior to PAA and PLR in most of the data sets.

4.2 Experiments of EPLR Similarity Retrieval

Since the superiority of EPLR is proven, the following experiments are all based on EPLR. In this section, we use pruning rate, miss rate and CPU cost to test our similarity retrieval. Pruning rate refers to efficiency that how many time series pruned by major trends match. In order to verify effectiveness, we compare the time series matched by major trends and the ground truth is acquired by DTW distance applied to

Euclidean distance and the proposed method without major trends match. The experimental results are presented based on 24 data sets out of 32 benchmark data sets used in [22]. Table 4-3 shows the parameter settings for 24 data sets. The symbol mix( ) means mix the items in ( ).

Table 4-3 Parameter settings for 24 data sets

Number of data sets 24

Time series length of query data 64, 128, 256, mix(64, 128 ,256) Time series length of database 512, mix(512,1024)

Size of query data for each length 400 Size of database for each length 1000

Each data set contains query data and database data and the length of query data is less than the length of database data. Before experimenting, there are still some parameters needed to define. Parameters a and b in Def 3 for major trends match are set to 0.4 and 0.2. For simplicity, each time series is divided into equal segments with 16 data points. We define a DTW distance threshold ε as a ground truth for similarity.

Each time series in query data and each time series in database have a minimum subsequence distance, called MSSD. ε is set to 1/5 of average MSSD. Figure 4-8, 4-9, 4-10 and 4-11 show the results of four combinations of different length of query data and database, respectively. In Figure 4-8, the length of query data is 64 and the length of database is 512. We change the length of query data to 128 and 256 in Figure 4-9 and Figure 4-10. In Figure 4-11, we mix the length of 64, 128 and 256 for query data and the length of 512 and 1024 for database. As for CPU cost, two data sets, Fetal ECG and Power Data, are chosen to compare. CPU cost of Fetal ECG is shown in Figure 4-12 and CPU cost of Power Data is described in Figure 4-13.

Table 4-4 Summarization of pruning rates and miss rates for different length of data

Data Set Pruning Miss Pruning Miss Pruning Miss Pruning Miss

64-512 64-512 128-512 128-512 256-512 256-512 mix-mix mix-mix

Figure 4-8 Pruning rate and Miss rate withlength of query = 64, and length of

time series in Database = 512

Figure 4-9 Pruning rate and Miss rate with length of query = 128, and length

of time series in Database = 512

Figure 4-10 Pruning rate and Miss rate with length of query = 256, and length

of time series in Database = 512

Figure 4-11 Pruning rate and Miss rate with length of query = mix of 64, 128,

256 and length of time series in Database = 512

Figure 4-12 CPU cost of Fetal ECG between Euclidean Distance, Proposed Method

and MDTW

Figure 4-13 CPU cost of Power Data between Euclidean Distance, Proposed Method

and MDTW

Observe that the miss rates of all data sets are low enough for different length of query. The miss rate rises a little only on the condition that the pruning rate is too high, such as tide and winding in Figure 4-10. Besides, the results indicate that pruning rates are quite satisfactory for most data sets. By Figure 4-12 and Figure 4-13, we can further prove that our proposed method is much faster than Euclidean distance. Even

Chapter 5 Conclusion and Future Work

5.1 Conclusion of Our Proposed Work

In this thesis, we proposed a subsequence similarity retrieval mechanism to deal with the shape-based similarity. A new representation EPLR is presented and for similarity retrieval, a similarity measure for EPLR is proposed. EPLR is a kind of segmentation technique, divides a time series of the length n into m segments with equal length k. Each segment overlaps parts of data with previous segment and is represented by the angle of its best-fit line segment. Since segments are equally split and presented by angles, there are many advantages, such as easy to implementation, retaining information of trends and dimensionality reduction. Experimental results show that the representation EPLR not only reduces the dimension but is also reliable to handle shape or trend of time series. In contrast with PLR and PAA, EPLR is superior to them.

We define 2-level similarity measure based on EPLR. On first level, the major trends of two subsequences have to be matched because if two time series are similar, their shape should be similar. We can prune a lot of non-qualified time series to speed the retrieval. We assign each segment a segment trend in accordance with its angle.

Then a merge mechanism is applied to segment trends to form major trends. After that, we can perform major trends match by specific rules. As regards to the second level, the distance of two subsequences which is the sum of all major trend distance must

conditions are met. Experiments demonstrate that the pruning rate of our similarity measure is satisfactory with acceptable miss rates and CPU cost is of the proposed method is low enough.

5.2 Future Work

Our work uses angles as the representation of shape or trend of time series and the similarity retrieval based on this representation is discussed. There is an extended research on applying EPLR to other data mining tasks such as anomaly detection and motif discovery. As for EPLR, how many data points in a segment is highly data dependent. We may analyze the data distribution to decide the number of data points in a segment. Furthermore, only the trends but not the real values of data are concerned in the work. It may be possible to combine the real values with angles to make the similarity retrieval more robust and powerful.

Bibliography

[1]. R. Agrawal, C. Faloutsos, and A. R. Swami, Efficient Similarity Search Databases, Proceedings of the 4^th International Conf. Foundations of Data Organization and Algorithms (FODO), pp. 69-87, 1993.

[2]. R. Agrawal and R.Srikant, Mining Sequential Patterns, Proc. of 11^th IEEE Intel.

Conf. on Data Engineering (ICDE), pp. 3-14, Mar. 1995

[3]. D.J. Berndt and J.Cliford, Using Dynamic Time Warping to Find Patterns in Time Series. AAAI-94 Workshop on Knowledge Discovery in Databases, pp.350-370, 1994.

[4]. C. Faloutsos, M. Ranganathan, and Y. Manolopoulos, Fast Subsequence Matching in Time-Series Database, Proc. of the ACM SIGMOD Intel. Conf. on Management of Data, pp. 419-429, 1994.

[5]. G. Das, D. Gunopulos, and H.Mannila, Finding Similar Time Series, Proc. of Principles of Data Mining and Knowledge Discovery, 1^st (PKDD), Pages 88-100, 1997.

[6]. K-P. Chan and A. Fu, Efficient Time Series Matching by Wavelets, Proc. of the 15^th IEEE Intel. Conf. on Data Engineering (ICDE), pp. 126-133, 1999.

[7]. B.-K. Yi, H.V. Jagadish, and C.Faloutsos, Efficient Retrieval of Similar Time Sequences under Time Warping, Proc. of the 14^th IEEE Intel. Conf. on Data Engineering (ICDE), pp. 201-208, 1998.

[8]. K-P. Chan, A. Fu, and C. Yu, Haar Wavelets for Efficient Similarity Search of Time-series: With and Without Time Warping, Journal of Transactions on

the 21^st Intel. Conf. on Very Large Databases (VLDB), pp. 490-501, 1995.

[10]. J. Gehrke, F. Kor, and D. Srivastava, On computing Correlated Aggregates over Continual Data Streams, Proc. of the ACM SIGMOD Intel. Conf. on Management of Data, pp. 126-133, 2001.

[11]. P. Sanghyun, W. Chu, J. Yoon, and C. Hsu, Efficient Similarity Searches for Time-Warped Subsequences in Sequence Databases, Proc. of the 16^th IEEE Intel. Conf. on Data Engineering (ICDE), pp. 23-32.

[12]. E. J. Keogh, K. Chakrabarti, S. Mehrotra, and M. J. Pazzani, Locally Adaptive Dimensionality Reduction for Indexing Large Time Series Databases, Proc. of ACM SIGMOD Intel. Conf. on Management of Data, pp. 151-162, 2001.

[13]. E. J. Keogh, K. Chakrabarti, M. J. Pazzani, and S. Mehrotra, Dimensionality Reduction for Fast Similarity Search in Large Time Series Databases, Journal of Knowledge and Information Systems, 3(3): 263-286, 2001.

[14]. H. Wu, B. Salzberg, and D. Zhang, Online Event-driven Subsequence Matching over Financial Data Streams, Proc. of ACM SIGMOD Intel. Conf. on Management of Data, pp. 23-34, 2004.

[15]. H. Wu, B. Salzberg, G. C. Sharp, S. B. Jiang, H. Shirato, and D. Kaeli, Subsequence Matching on Structured Time Series Data, Proc. of ACM SIGMOD Intel. Conf. on Management of Data, pp. 682-693, 2005.

[16]. L. Chen and R.Ng, On the Marriage of Lp-norms and Edit Distance, Proc. of the 30^th Intel. Conf. on Very Large Databases (VLDB), pp. 792-803, 2004.

[17]. L. Chen, M. T. Ozsu, and V. Oria, Robust and Efficient Similarity search for moving object trajectories, Proc. of ACM SIGMOD Intel. Conf. on Management of Data, pp. 491-502, 2005.

[18]. M. Vlachos, G. Kollios, and D. Gunopulos, Discovering Similar

Engineering (ICDE), pp.673-684, 2002.

[19]. F. Korn, H. Jagadish, and C. Faloutsos, Efficiently Supporting Ad Hoc Queries in Large Datasets of Time Sequences, Proc. of ACM SIGMOD Intel. Conf. on Management of Data, pp.289-300, 1997.

[20]. E. J. Keogh, S.Chu, D.Hart, and M.J. Pazzani, An Online Algorithm for Segmenting Time Series, Proc. of IEEE Intel. Conf. on Data Mining (ICDM), pp.

289-296, 2001.

[21]. J. Lin, E.J. Keogh, L. Wei, and S. Lonardi, Experiencing SAX: A Novel Symbolic Representation of Time Series, Journal of Data Mining and Knowledge Discovery, 15(2): 107-144, 2007

[22]. E. J. Keogh, C. A. Ratanamahatana, Exact Indexing of Dynamic Time Warping, Journal of Knowledge and Information Systems, 7(3): pp. 358-386, 2005.

[23]. M. Vlachos, M. Hadjieleftheriou, D. Gunopulos, and E. J. Keogh, Indexing Multi-Dimensional Time-series with Support for Multiple Distance Measures, Proc. of the 9^th ACM SIGKDD Intel. Conf. on Knowledge Discovery and Data Mining, pp. 216-225, 2003.

[24]. S.-W. Kim, S. Park, and W. W. Chu, An Index-based Approach for Similarity Search Supporting Time Warping in Large Sequence Databases, Proc. of the 17^th IEEE Intel. Conf. on Data Engineering (ICDE), pp.607-614, 2001.

[25]. Y. Sakurai, M. Yoshikawa, and C. Faloutsos, FTW: Fast Similarity Search under the Time Warping Distance, Proc. of the 24^th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS), pp. 326-337, 2005.

[26]. N. Q. V. Hung and D. T. Anh, Combing SAX and Piecewise Linear

ESAX for Financial Applications, Proc. of the 22^nd Intel. Conf. on Data Engineering Workshops (ICDEW), pp.115, 2006.

[28]. Y. Zhu and D. Shasha, Warping Indexes with Envelope Transforms for Query by Humming, Proc. of ACM SIGMOD Intel. Conf. on Management of Data, pp.181-192, 2003.

[29]. X. Lian and L. Chen, Efficient Similarity Search over Future Stream Time Series, Journal of IEEE Transactions on Knowledge and Data Engineering, 20(1): pp.

40-54, 2008.

[30]. Y. Zhu and D. Shasha, StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time, Proc. of the 28^th Intel. Conf. on Very Large Databases (VLDB), pp: 358-369, 2002.

[31]. E. J. Keogh and S. Kasetty, On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration, Journal of Data Mining and Knowledge Discovery, 7(4): pp. 349-371, 2003.

[32]. C.A. Ratanamahatana and E. J. Keogh, Three Myths about Dynamic Time Warping Data Mining, Proc. of SIAM Intel. Conf. on Data Mining (SDM), pp.506-510, 2005.

[33]. S. Chu, E. J. Keogh, D. Hart, and M. Pazzani, Iterative Deepening Dynamic Time Warping for Time Series, 2^nd SLAM Intel. Conf. on Data Mining, 2002.

[34]. E. J. Keogh and T. Folias. The UCR Time Series Data Mining Archive[http://www.cs.ucr.edu/~eamonn/ time_series_data/], Riverside CA.

University of California – Computer Science and Engineering Department, 2002.

[35]. M. Gavrilov, D. Anguelov, P. Indyk, and R. Motwani, Mining the Stock Market:

Which Measure is Best, Proc. of the 6^th ACM SIGKDD Intel. Conf. on Knowledge Discovery and Data Mining, pp.487-496, 2000.

Document Clustering, Proc. of the 5^th ACM SIGKDD Intel. Conf. on Knowledge Discovery and Data Mining, pp.16-22, 1999.

[37]. E. J. Keogh and M. Pazzani, An Enhanced Representation of Time Series which Allows Fast and Accurate Classification Clustering and Relevance Feedback, Proc. of the 4^th Intel. Conf. on Knowledge Discovery and Data Mining, pp.

239-241, 1998.

在文檔中時間序列資料處理與相似性擷取 (頁 54-0)