• 沒有找到結果。

Conclusions and Future Work

In the dissertation, we have proposed three algorithms. The first algorithm, called CMP, is designed to mine closed patterns from a time-series database, where each transaction contains multiple time-series sequences. The second algorithm, called CFP, integrates the idea of allowing flexible gaps between items in a pattern and is capable of discovering closed flexible patterns in a time-series database. The third algorithm, called CNP, mines multi-resolution closed patterns directly from raw numerical data without any transformation from numerical sequences to symbolic sequences as performed in the CMP and CFP algorithms. It aims at providing analyzers different views on data by various resolutions.

The CMP algorithm consists of three phases. First, we transform each time-series sequence into a symbolic sequence. Second, we scan the transformed database to find all frequent 1-patterns, and build a projected database for each frequent 1-pattern. Third, we recursively use a frequent k-pattern and its projected database to generate its frequent super-patterns at the next level in the frequent pattern tree, where k > 1. The CMP algorithm adopts a DFS manner and uses the projected database to localize support counting, candidate pruning, and closure checking. Therefore, it can efficiently mine closed patterns. The experimental results show that the CMP algorithm outperforms the modified Apriori and BIDE algorithms by one or two orders of magnitude.

The CFP algorithm takes into consideration the flexible gaps between items in a pattern to mine closed flexible patterns in a time-series database. The problem of mining closed flexible patterns is solved by initially transforming a time-series database into a symbolic database, and then identifying frequent 1-patterns within the transformed database, and recursively mining closed flexible patterns in a DFS manner. Since the CFP algorithm localizes the pattern extension in a small number of projected databases and eliminates unnecessary patterns by two pruning strategies and the closure checking

105

scheme, it is more efficient and scalable than the modified Apriori algorithm. The experimental results show that the CFP algorithm outperforms the modified Apriori algorithm by an order of magnitude.

Transforming time-series databases into symbolic databases may change the context in which the values may be seen and some valuable information may still be uncovered. Moreover, a notable disadvantage for the symbolic sequence analysis is that the number of symbols and breakpoints must be supplied. Therefore, the CNP algorithm is designed to address with these issues. It mines multi-resolution closed numerical patterns in a multi-sequence time-series database. Initially, the Haar wavelet transform is applied to convert each time-series in the database into a sequence in the low resolution, and then all frequent 1-patterns are identified from the transformed database.

Subsequently, each frequent k-pattern is recursively extended to find frequent super-patterns in a DFS manner. For each frequent k-pattern found in the low resolution, it is restored back to the high resolution. As a result, all closed numerical patterns can be mined in the low and high resolutions. Since the CNP algorithm employs the projected database to localize the pattern growth and takes the benefits of pruning strategies to speed up the mining process, the CNP algorithm has demonstrated a significant runtime improvement in comparison to the modified A-Close algorithm.

At present, our work on the CMP algorithm has been published on Data and Knowledge Engineering Journal [32] and the work on the CFP algorithm has been published on Expert Systems with Applications Journal [64]. The significant contribution of this dissertation is that we have designed three efficient algorithms which are able to solve real-world problems. Specifically, we have presented a novel concept of mining closed multi-sequence patterns in a time-series database and designed the CMP algorithm to mine closed multi-sequence patterns. Moreover, we have removed the limitation of exact sequence alignments and incorporated the idea of flexible-range of consecutive gaps to discover patterns. We have proposed the CFP

106

algorithm to mine closed flexible patterns in a time-series database. In addition, we have used the Haar wavelet transform to view a time-series database in multiple resolutions and designed a novel algorithm, CNP, to mine closed numerical multi-sequence patterns in a time-series database. We have devised effective closure checking schemes and pruning strategies with respect to each proposed algorithm to avoid generating redundant candidates, and hence each results in less execution time. All the proposed algorithms are evaluated with both synthetic and real datasets. The experimental results show that the CMP algorithm outperforms the modified Apriori and BIDE algorithms by one or two orders of magnitude; the CFP algorithm outperforms the modified Apriori algorithm by an order of magnitude; and the CNP algorithm outperforms the modified A-Close algorithm by one or two orders of magnitude.

The limitations of the proposed algorithms are addressed as follows. First, since we use an existing data discretization method, such as SAX representation, to transform time-series sequences into symbolic sequences in the initial phase of the CMP and CFP algorithms, we may not know whether the breakpoints are best determined and what is the effect of the discretization. Second, the CNP algorithm cannot cope with multiple sequences that have different lengths or missing values in a transaction since no gap symbol is taken into consideration. Finally, the CNP algorithm is unable to mine closed patterns in a database that contains non-aligned time-series because the concept of gap symbols is not integrated in the mining process.

The present dissertation has illustrated that the CMP, CFP, and CNP algorithms can efficiently mine closed patterns from both synthetic and real-world datasets; however, in the future, subsequent studies can be conducted in the following directions:

1. We may modify the CMP algorithm to mine closed patterns in one-sequence databases.

2. We may combine the essence of the CMP and CFP algorithms and develop a novel algorithm to address the problem of mining closed flexible patterns in

107

multi-sequence time-series databases.

3. We may allow a user-specified gap interval, instead of a user-specified maximum gap threshold, in the CFP algorithm to find specific patterns in which a user is interested.

4. It is worth extending the CNP algorithm further to mine closed patterns with some complicated constraints. For instance, a gap constraint may be pushed into the mining process.

5. We may modify the CNP algorithm to mine multi-resolution closed numerical patterns in one-sequence databases.

6. In the CNP algorithm, instead of measuring whether a pair of multi-sequences is similar by calculating the distance between each two numerical points in the sequences, other mechanisms may be adopted to improve the efficiency of this task.

7. The CNP algorithm uses the already mined patterns to check if a newly found pattern is closed. This is a time-consuming task; therefore, it is helpful to design effective closure checking schemes in order to speed up the algorithm.

8. Without generalization, too many patterns may be mined and they may be too detailed. By generalizing with a concept hierarchy, we may be able to obtain patterns that are more abstract and meaningful.

9. We have implemented the memory-based algorithms for the CMP, CFP, and CNP algorithms. It will be worth further study on implementing the disk-based algorithms for a very large database.

10. We may further apply the proposed methods to analyze other real-world applications, such as bioinformatics, medical diagnosis, hurricane forecasts, etc.

108

References

[1] R. Agrawal, K. Lin, H. S. Sawhney, K. Shim, Fast similarity search in the presence of noise, scaling, and translation in time-series databases, in: Proceedings of the 21th International Conference on Very Large Data Bases, 1995, pp. 490-501.

[2] C. D. Ahrens, Meteorology today: an introduction to weather, climate, and the environment (8th ed.), Thomson Brooks/Cole, Belmont, 2007.

[3] D. Alter, Liver-function testing, MLO: Medical Laboratory Observer 40 (12) (2008) 10-17.

[4] J. Ayres, J. Gehrke, T. Yiu, J. Flannick, Sequential pattern mining using a bitmap representation, in: Proceedings of the 8th ACM International Conference on Knowledge Discovery and Data Mining, 2002, pp. 429-435.

[5] BBC News, <http://news.bbc.co.uk/2/hi/business/7073131.stm>.

[6] D. J. Berndt, J. Clifford, Finding patterns in time series: a dynamic programming approach, Advances in Knowledge Discovery and Data Mining (1st ed.), American Association for Artificial Intelligence, 1996, pp. 229-248.

[7] Central Weather Bureau, <http://www.cwb.gov.tw/>.

[8] L. Chang, T. Wang, D. Yang, H. Luan, SeqStream: mining closed sequential patterns over stream sliding windows, in: Proceedings of the 8th IEEE International Conference on Data Mining, 2008, pp. 83-92.

[9] H. Chen, W. Chung, J. J. Xu, G. Wang, Y. Qin, M. Chau, Crime data mining: a general framework and some examples, IEEE Computer 37 (4) (2004) 50-56.

[10] T. S. Chen, S. C. Hsu, Mining frequent tree-like patterns in large datasets, Data and Knowledge Engineering 62 (1) (2007) 65-83.

[11] Y. L. Chen, T. C. K. Huang, A novel knowledge discovering model for mining fuzzy multi-level sequential patterns in sequence databases, Data and Knowledge Engineering 66 (3) (2008) 349-367.

[12] Y. Chen, S. Mabu, K. Shimada, and K. Hirasawa, A genetic network programming

109

with learning approach for enhanced stock trading model, Expert Systems with Applications 36 (10) (2009) 12537-12546.

[13] C. J. Chu, V. S. Tseng, T. Liang, Efficient mining of temporal emerging itemsets from data streams, Expert Systems with Applications 36 (1) (2009) 885-893.

[14] T. H. Cormen, C. E. Leiserson, R. L. Rivest, C. Stein, Introduction to algorithms (2nd ed.), The MIT Press, Cambridge, 2003.

[15] G. Das, K. Lin, H. Mannila, Rule discovery from time series, in: Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining, 1998, pp. 16-22.

[16] Data Bank for Atmospheric Research, <http://dbar.as.ntu.edu.tw/>.

[17] C. Faloutsos, M. Ranganathan, Y. Manolopoulos, Fast subsequence matching in time-series databases, ACM SIGMOD Record 23 (2) (1994) 419-429.

[18] J. Han, G. Dong, Y. Yin, Efficient mining of partial periodic patterns in time series database, in: Proceedings of the 15th International Conference on Data Engineering, 1999, pp. 106-115.

[19] J. Han, M. Kamber, Data mining: concepts and techniques (2nd ed.), Morgan Kaufmann, San Francisco, 2006.

[20] J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, M. Hsu, FreeSpan: frequent pattern-projected sequential pattern mining, in: Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2000, pp. 355-359.

[21] J. Han, J. Wang, Y. Lu, P. Tzvetkov, Mining top-k frequent closed patterns without minimum support, in: Proceedings of the 2002 IEEE International Conference on Data Mining, 2002, pp. 211-218.

[22] J. W. Huang, C. Y. Tseng, J. C. Ou, M. S. Chen, A general model for sequential pattern mining with a progressive database, IEEE Transactions on Knowledge and Data Engineering 20 (9) (2008) 1153-1167.

110

[23] Y. Huang, L. Zhang, P. Zhang, A framework for mining sequential patterns from spatio-temporal event data sets, IEEE Transactions on Knowledge and Data Engineering 20 (4) (2008) 433-448.

[24] L. Ji, K. L. Tan, K. H. Tung, Compressed hierarchical mining of frequent closed patterns from dense data sets, IEEE Transactions on Knowledge and Data Engineering 19 (9) (2007) 1175-1187.

[25] E. Keogh, Fast similarity search in the presence of longitudinal scaling in time series database, in: Proceedings of the Ninth International Conference on Tools with Artificial Intelligence, 1997, pp. 578-584.

[26] E. Keogh, S. Kasetty, On the need for time series data mining benchmarks: a survey and empirical demonstration, in: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2002, pp.

102-111.

[27] C. Kim, J. Lim, R. T. Ng, K. Shim, SQUIRE: sequential pattern mining with quantities, The Journal of Systems and Software 80 (10) (2007) 1726-1745.

[28] M. Kontaki, A. N. Papadopoulos, Y. Manolopoulos, Adaptive similarity search in streaming time series with sliding windows, Data and Knowledge Engineering 63 (2) (2007) 478-502.

[29] R. J. Larsen, M. L. Marx, An introduction to mathematical statistics and its applications (3rd ed.), Prentice Hall, New Jersey, 2001.

[30] A. J. T. Lee, C. S. Wang, W. Y. Wang, Y. A. C, H. W. Wu, An efficient algorithm for mining closed inter-transaction itemsets, Data and Knowledge Engineering 66 (1) (2008) 68-91.

[31] A. J. T. Lee, Y. T. Wang, Efficient data mining for calling path patterns in GSM networks, Information Systems 28 (8) (2003) 929-948.

[32] A. J. T. Lee, H. W. Wu, T. Y. Lee, Y. H. Liu, K. T. Chen, Mining closed patterns in multi-sequence time-series databases, Data and Knowledge Engineering 68 (10)

111

(2009) 1071-1090.

[33] C. H. L. Lee, A. Liu, W. S. Chen, Pattern discovery of fuzzy time-series for financial prediction, IEEE Transactions on Knowledge and Data Engineering 18 (5) (2006) 613-625.

[34] T. H. Lee, R. Kim, J. T. Benson, T. M. Therneau, L. J. Melton III, Serum aminotransferase activity and mortality risk in a United States community, Hepatology 47 (3) (2008) 880-887.

[35] Y. S. Lee, S. J. Yen, Incremental and interactive mining of web traversal patterns, Information Sciences 178 (2) (2008) 287-306.

[36] H. F. Li, C. C. Ho, S. Y. Lee, Incremental updates of closed frequent itemsets over continuous data streams, Expert Systems with Applications 36 (2) (2009) 2451-2458.

[37] J. Lin, E. Keogh, S. Lonardi, B. Chiu, A symbolic representation of time series, with implications for streaming algorithms, in: Proceedings of the 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, 2003, pp. 2-11.

[38] M. Y. Lin, S. C. Hsueh, C. W. Chang, Fast discovery of sequential patterns in large databases using effective time-indexing, Information Sciences 178 (22) (2008) 4228-4245.

[39] F. Masseglia, P. Poncelet, M. Teisseire, Efficient mining of sequential patterns with time constraints: reducing the combinations, Expert Systems with Applications 36 (2) (2009) 2677-2690.

[40] F. Masseglia, P. Poncelet, M. Teisseire, Incremental mining of sequential patterns in large databases, Data and Knowledge Engineering 46 (1) (2003) 97-121.

[41] F. Mörchen, A. Ultsch, Optimizing time series discretization for knowledge discovery, in: Proceedings of the 11thACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2005, pp. 660-665.

112

[42] Y. Nishi, R. Doering, Handbook of semiconductor manufacturing technology (1st ed.), Marcel Dekker Inc., New York, 2000.

[43] N. Pasquier, Y. Bastide, R. Taouil, L. Lakhal, Discovering frequent closed itemsets for association rules, in: Proceeding of the 7th International Conference on Database Theory, 1999, pp. 398-416.

[44] J. Pei, J. Han, R. Mao, CLOSET: an efficient algorithm for mining frequent closed itemsets, in: Proceedings of the 5th ACM-SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, 2000, pp. 21-30.

[45] J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, PrefixSpan: mining sequential patterns efficiently by prefix-projected pattern growth, in: Proceedings of the 17th International Conference on Data Engineering, 2001, pp. 215-224.

[46] W. C. Peng, Z. X. Liao, Mining sequential patterns across multiple sequence databases, Data and Knowledge Engineering 68 (10) (2009) 1014-1033.

[47] D. Perera, J. Kay, I. Koprinska, K. Yacef, O. R. Zaiane, Clustering and sequential pattern mining of online collaborative learning data, IEEE Transactions on Knowledge and Data Engineering 21 (6) (2009) 759-772.

[48] P. J. Pockros, E. R. Schiff, M. L. Shiffman, J. G. McHutchison, R. G. Gish, N. H.

Afdhal, M. Makhviladze, M. Huyghe, D. Hecht, T. Oltersdorf, D. A. Shapiro, Oral IDN-6556, an antiapoptotic caspase inhibitor, may lower aminotransferase activity in patients with chronic hepatitis C, Hepatology 46 (2) (2007) 324-329.

[49] A. H. Ritchie, D. M. Williscroft, Elevated liver enzymes as a predictor of liver injury in stable blunt abdominal trauma patients: case report and systematic review of the literature, Canadian Journal of Rural Medicine 11 (4) (2006) 283-287.

[50] S. Russell, A. Gangopadhyay, V. Yoon, Assisting decision making in the event-driven enterprise using wavelets, Decision Support Systems 46 (1) (2008) 14-28.

[51] S. R. Song, W. Y. Ku, Y. L. Chen, Y. C. Lin, C. M. Liu, L. W. Kuo, T. F. Yang, H. J.

113

Lo, Groundwater chemical anomaly before and after the Chi-Chi Earthquake in Taiwan, Terrestrial, Atmospheric and Oceanic Sciences 14 (3) (2003) 311-320.

[52] R. Srikant, R. Agrawal, Mining sequential patterns, in: Proceedings of the 11th International Conference on Data Engineering, 1995, pp. 3-14.

[53] R. Srikant, R. Agrawal, Fast algorithms for mining association rules, in:

Proceedings of the 20th International Conference Very Large Data Bases, 1994, pp.

487-499.

[54] R. Srikant, R. Agrawal, Mining sequential patterns: generalizations and performance improvements, in: Proceedings of the 5th International Conference on Extending Database Technology, 1996, pp. 3-17.

[55] Standard and Poor's, <http://www2.standardandpoors.com>.

[56] Stocks on Wall Street,

<http://stocksonwallstreet.net/2009/07/31/golden-cross-shows-bullish-technical-in dicator/>.

[57] Taiwan Stock Exchange Corporation, < http://www.tse.com.tw/ch/index.php>.

[58] J. I. Takeuchi, K. Yamanishi, A unifying framework for detecting outliers and change points from time series, IEEE Transactions on Knowledge and Data Engineering 18 (4) (2006) 482-492.

[59] H. J. Teoh, C. H. Cheng, H. H. Chu, J. S. Chen, Fuzzy time series model based on probabilistic approach and rough set rule induction for empirical research in stock markets, Data and Knowledge Engineering 67 (1) (2008) 103-117.

[60] C. S. Wang, A. J. T. Lee, Mining inter-sequence patterns, Expert Systems with Applications 36 (4) (2009) 8649-8658.

[61] J. Wang, J. Han, BIDE: efficient mining of frequent closed sequences, in:

Proceedings of the 20th International Conference on Data Engineering, 2004, pp.

79-90.

[62] J. Wang, J. Han, J. Pei, CLOSET+: searching for the best strategies for mining

114

frequent closed itemsets, in: Proceedings of the 9th ACM International Conference on Knowledge Discovery and Data Mining, 2003, pp. 236-245.

[63] Y. Wang, E. P. Lim, S. Y. Hwang, Efficient mining of group patterns from user movement data, Data and Knowledge Engineering 57 (3) (2006) 240-282.

[64] H. W. Wu, A. J. T. Lee, Mining closed flexible patterns in time-series databases, Expert Systems with Applications 37 (3) (2010) 2098-2107.

[65] Yahoo Finance, <http://finance.yahoo.com>.

[66] X. Yan, J. Han, R. Afshar, CloSpan: mining closed sequential patterns in large databases, in: Proceedings of the 2003 SIAM International Conference on Data Mining, 2003, pp. 166-177.

[67] T. Q. Yang, A time series data mining based on ARMA and MLFNN model for intrusion detection, Journal of Communication and Computer 3 (7) (2006) 16-22.

[68] D. Yuan, K. Lee, H. Cheng, G. Krishna, Z. Li, X. Ma, Y. Zhou, J. Han, CISpan:

comprehensive incremental mining algorithms of closed sequential patterns for multi-versional software mining, in: Proceedings of the 2008 SIAM International Conference on Data Mining, 2008, pp. 84-95.

[69] M. J. Zaki, SPADE: an efficient algorithm for mining frequent sequences, Machine Learning 42 (1) (2001) 31-60.

[70] M. J. Zaki, C. Hsiao, Efficient algorithms for mining closed itemsets and their lattice structure, IEEE Transactions on Knowledge and Data Engineering 17 (4) (2005) 462-478.

簡 歷

姓 名 : 吳 惠 雯

出 生 地 : 台 灣 省 台 南 縣

出 生 日 : 中 華 民 國 六 十 九 年 十 一 月 二 十 三 日

學 歷 :

(1999/9 - 2003/6)

Concordia University, Montréal, P.Q.

Bachelor of Engineering in Computer Engineering

(2003/9 - 2004/6)

University of Chicago, Chicago, IL Master of Science in Computer Science

(2006/9 - 2010/1)

國 立 台 灣 大 學 資 訊 管 理 研 究 所 博 士 班

著 作 :

A. J. T. Lee, C. S. Wang, W. Y. Weng, Y. A. Chen, H. W. Wu, An efficient algorithm for mining closed inter-transaction itemsets, Data and Knowledge Engineering 66 (1) (2008) 68-91.

A. J. T. Lee, Y. H. Liu, H. M. Tsai, H. H. Lin, H. W. Wu, Mining frequent patterns in image databases with 9D-SPA representation, The Journal of Systems and Software 82 (4) (2009) 603-618.

A. J. T. Lee, H. W. Wu, T. Y. Lee, Y. H. Liu, K. T. Chen, Mining closed patterns in multi-sequence time-series databases, Data and Knowledge Engineering 68 (10) (2009) 1071-1090.

H. W. Wu, A. J. T. Lee, Mining closed flexible patterns in time-series databases, Expert Systems with Applications 37 (3) (2010) 2098-2107.