Real Dataset Analysis - Inc

Chapter 4 Incremental Mining Temporal Patterns from Interval-based Database

4.5. Inc_CTMiner Algorithm

4.6.6 Real Dataset Analysis

In addition to using synthetic data sets, we have also performed an experiment on real world data set to compare the performance and indicate the applicability of temporal pattern mining.

The database used in this experiment consists of a collection of 1,098,142 library records, includes lending and returning records, for three years from the National Chiao Tung University Library. The database includes 206,844 books and 28,339 readers. An event interval is composed by a book ID and corresponding lending and returning time. The size of database is the number of sequences in database (same as the number of readers, 28,339). The maximum and the average length of sequences are 302 and 36, respectively. First, we collect the records of first two and half years to construct the original database DB and use the record of last half year to build the increment database db. The DB with 1,053,276 library records can be viewed as 26,738 user sequences and the db with 44,866 library records can be viewed as 3,514 user sequences. Fig.

4.17(a) shows the performance of execution time with varying minimum support thresholds from 0.1 % to 0.05 %, respectively. As the minimum support drops down to 0.05 %, Inc_CTMiner is almost 2 times faster than Naïve method and more than 2.7 times faster than CTMiner.

We still use the records of first two and half years to construct DB and divide the records of the rest half years by every one month to build six different db. Fig. 4.17(b) shows the performance of Inc_CTMiner, with min_sup = 0.1%, to incrementally maintain multiple database updates, i.e., 6 months, six updates in this case. Each time the database is updated, we also run CTMiner to re-mine from scratch for comparison. We can see from the figure, when the increments accumulate, the time for incremental mining also increases, but increase is very small. The incremental mining still outperforms re-mining with CTMiner by a factor of 2.5 or 3.5. This experiment shows that Inc_CTMiner is really efficient for multiple updates of database.

4.7 Summary

Previous studies of updating sequential pattern mainly are focused on time point-based data.

(b) Multiple updates of library dataset from NCTU

minimum support (%)

execution time (secs)

number of database updates

execution time (secs)

(a) Performance of three algorithms on library dataset from NCTU

Fig. 4.17: Execution time of three algorithms and multi updates on library dataset from NCTU

the issue for incremental mining the temporal patterns. Inc_CTMiner is proposed to balance the efficiency and reusability based on a proper expression, coincidence representation. The algorithm also employs two optimization techniques, sequence-reduction and slice-reduction, to further reduce the search space effectively. The experimental results indicate that both execution time and memory usage of Inc_CTMiner outperform previous algorithms designed based on static database. We also show the graceful scalability of Inc_CTMiner. Furthermore, we apply the algorithm on real world dataset to show the efficiency and the practicability of maintaining temporal patterns.

Chapter 5 Conclusion

In this dissertation, we propose two new representations, coincidence representation and endpoint representation to simplify the processing of complex relations among event intervals.

Then, three efficient algorithms are developed to discover several types of temporal patterns from interval-based data. These algorithms employ some pruning techniques to reduce the search space effectively. The experimental studies indicate that all proposed algorithm is efficient and scalable and outperforms state-of-the-art algorithms. Furthermore, we also apply our algorithms on real world data to show the efficiency and validate the practicability of interval-base temporal mining.

In Chapter 2, a novel technique, incision strategy and a new representation, coincidence representation are proposed to remedy the critical issue of temporal pattern mining. We simplify the processing of complex relations among event intervals effectively. Coincidence representation is nonambiguous and has several advantages over existing representations. Based on coincidence representation, we develop an efficient algorithm, CTMiner to discover frequent temporal patterns without candidate generation. The algorithm further employs two pruning techniques, pre-pruning and post-pruning, to reduce the search space effectively. By analyzing the differences between mining sequential patterns and temporal patterns, we also propose a new projection technique, multi-projection to correctly project a database into a set of smaller projected databases. The experimental studies indicate that CTMiner is efficient and scalable. Both running time and memory usage of CTMiner outperform state-of-the-art algorithms.

Previous studies of mining closed sequential pattern mainly are focused on time point-based data. Little attention has been paid to the mining of closed temporal patterns from time interval-based data. Since the processing for complex relations among intervals may require generating and examining large amount of intermediate subsequences, mining closed temporal patterns from time interval-based data is an arduous problem. In Chapter 3, we develop an efficient algorithm, CEMiner, to discover closed temporal patterns without candidate generation,

based on proposed endpoint representation. The algorithm further employs three pruning methods, pre-pruning, post-pruning and pair-pruning, to reduce the search space effectively. The experimental studies indicate that CEMiner is efficient and scalable. Both running time and memory usage of CEMiner outperform the state-of-the-art algorithms. Furthermore, we also apply CEMiner on real world dataset to show the efficiency and the practicability of mining time interval-based closed pattern.

Little attention has been paid to the incremental mining of temporal patterns from time interval-based data. Since the processing for complex relations among intervals may require generating and examining large amount of intermediate subsequences, maintaining temporal patterns in interval-based database is a challenging problem. In Chapter 4, we investigate the issue for incremental mining of the temporal patterns. Inc_CTMiner is proposed to balance the efficiency and reusability based on a proper expression, coincidence representation. The algorithm also employs two optimization techniques, sequence-reduction and slice-reduction to further reduce the search space effectively. The experimental results indicate that both execution time and memory usage of Inc_CTMiner outperform previous algorithms designed based on static database. We also show the graceful scalability of Inc_CTMiner. Furthermore, we apply the algorithm on real world dataset to show the efficiency and the practicability of maintaining time interval-based patterns.

Bibliography

[1] R. Agrawal and R. Srikant, “Mining Sequential Patterns,” Proceedings of 11th International Conference on Data Engineering (ICDE’95), pp. 3-14, 1995.

[2] J. Allen, “Maintaining Knowledge about Temporal Intervals,” Communications of ACM, vol.26, issue 11, pp.832-843, 1983.

[3] J. Ayres, J. Gehrke, T. Yu, and J. Flannick, “Sequential Pattern Mining Using a Bitmap Representation,” The 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’02), pp. 429-435, 2002.

[4] L. Chang, T. Wang, D. Yang and H. Luan, “SeqStream: Mining Closed Sequential Patterns over Stream Sliding Windows,” International Conference on Data Mining (ICDM’08), pp.

83-92, 2008.

[5] L. Chang, T. Wang, D. Yang, H. Luan and S. Tang, “Efficient algorithms for incremental maintenance of closed sequential patterns in large databases,” Data & Knowledge Engineering, vol. 68, issue 1, pp. 68-106, 2009.

[6] J. Chen, “An Up Down Directed Acyclic Graph Approach for Sequential Pattern Mining,”

IEEE Transactions on Knowledge and Data Engineering, vol.22, no. 7, pp.913-928, 2010.

[7] Y. Chen, J. Guo, Y. Wang, Y. Xiong and Y. Zhu, “Incremental Mining of Sequential Patterns using Prefix Tree,” The 11th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD’07), pp. 433-440, 2007.

[8] Y. Chen, J. Jiang, W. Peng and S. Lee, “An Efficient Algorithm for Mining Time Interval-based Patterns in Large Databases,” 19th ACM International Conference on Information and Knowledge Management (CIKM’10), pp 49-58, 2010.

[9] H. Cheng, X. Yan and J. Han, “IncSpan: incremental mining of sequential patterns in large database,” The 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’04), pp.527-232, 2004.

[10] M. Garofalakis, R. Rastogi, and K. Shim, “SPIRIT: Sequential Pattern Mining with Regular Expression Constraints,” 25th International Conference on Very Large Data Bases (VLDB ’99), pp. 223-234, 1999.

[11] J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, and M. Hsu, “FreeSpan: Frequent

Pattern-Projected Sequential Pattern Mining,” The 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’00), pp. 355-359, 2000.

[12] C. Ho, H. Li, F. Kuo and S. Lee, “Incremental Mining of Sequential Patterns over a Stream Sliding Window,” International Conference on Data Mining - Workshops (ICDMW’06) pp.677-681, 2006.

[13] F. Hoppner, “Finding informative rules in interval sequences,” Intelligent Data Analysis, vol.

6, no. 3, pp. 237-255, 2002.

[14] J. Huang, C. Tseng, J. Ou, and M. Chen, “A General Model for Sequential Pattern Mining with a Progressive Database,” IEEE Transactions on Knowledge and Data Engineering, vol.20, issue 9, pp. 1153-1167, 2008.

[15] K. Huang, C. Chang, J. Tung, C. Ho, “COBRA: closed sequential pattern mining using bi-phase reduction approach,” Proceedings of the 2006 International Conference on Data Warehousing and Knowledge Discovery (DaWaK’06), pp. 280-291, 2006.

[16] P. Kam and W. Fu, “Discovering Temporal Patterns for Interval-based Events,” International Conference on Data Warehousing and Knowledge Discovery (DaWaK’00), vol. 1874, pp.

317-326, 2000.

[17] S. Laxman, P Sastry and K. Unnikrishnan, “Discovering Frequent Generalized Episodes When Events Persist for Different Durations,” IEEE Transactions on Knowledge and Data Engineering, vol.19, issue 9, pp. 1188-1201, 2007.

[18] M. Lin, S. Hsueh, and C. Chang, “Fast discovery of sequential patterns in large databases using effective time-indexing,” Information Sciences: An International Journal, vol. 178/22, pp. 4228-4245, 2008.

[19] M. Lin and S. Lee, Incremental update on sequential patterns in large databases by implicit merging and efficient counting, Information Systems, vol. 29, issue 5, pp. 385-404, 2004.

[20] M. Lin and S. Lee, “Fast Discovery of Sequential Patterns by Memory Indexing and Database Partitioning,” Journal of Information Sciences and Engineering, Vol. 21, No. 1, pp.

109-128, 2005.

[21] H. Mannila, H. Toivonen, and I. Verkamo, “Discovery of frequent episodes in event sequences,” Data Mining and Knowledge Discovery, vol. 1, issue 3, pp. 259-289, 1997.

[22] F. Masseglia, F. Cathala and P. Poncelet, “The PSP Approach for Mining Sequential Patterns,” European Conference on Principles of Data Mining and Knowledge Discovery

(PKDD’01), vol. 1510, pp176-184, 1998.

[23] F. Masseglia, P. Poncelet and M. Teisseire, “Incremental mining of sequential patterns in large databases,” Data & Knowledge Engineering, vol.46, issue 1, pp.97–121, 2003.

[24] F. Morchen and A. Ultsch, “Efficient Mining of Understandable Patterns from Multivariate Interval Time Series,” Data Mining Knowledge Discovery, vol. 15, number 2, pp.181-215, 2007.

[25] F. Morchen and D. Fradkin, “Robust mining of time intervals with semi-interval partial order patterns,” Proceedings of 10th SIAM International Conference on Data Mining (SDM’10), pp.315-326, 2010.

[26] S. Nguyen, X. Sun, M. Orlowska, “Improvements of IncSpan: Incremental Mining of Sequential Patterns in Large Database,” The 9th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD’05), pp. 442-451, 2005.

[27] P. Papapetrou, G. Kollios, S. Sclaroff, and D. Gunopulos, “Discovering frequent arrangements of temporal intervals,” International Conference on Data Mining (ICDM’05), pp. 354-361, 2005.

[28] S. Parthasarathy, M. Zaki, M. Ogihara, and S. Dwarkadas, “Incremental and interactive sequence mining,” Proceedings of the 8th International Conference on Information and Knowledge Management (CIKM’99), pp. 251-258, 1999.

[29] D. Patel, W. Hsu and M. Lee, “Mining Relationships Among Interval-based Events for Classification,” Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pp. 393-404, 2008.

[30] J. Pei, J. Han, B. Mortazavi-Asl, J. Wang, H. Pinto, Q. Chen, U. Dayal and M. Hsum,

“Mining Sequential Patterns by Pattern-Growth: The PrefixSpan Approach,” IEEE Transactions on Knowledge and Data Engineering, vol. 16, no. 10, pp.1424-1440, 2004.

[31] C. Rainsford and J. Roddick, “Adding temporal semantics to association rules,” In Proceedings of the 3rd European conference on principles and practice of knowledge discovery in databases (PKDD’99), pp. 504-509, 1999.

[32] R. Srikant and R. Agrawal, “Mining Sequential patterns: Generalizations and Performance Improvements,” Proceedings of 5th International Conference on Extended Database Technology (EDBT’96), pp. 3-17, 1996.

[33] R.Villafane, K. Hua and D. Tran, “Knowledge Discovery from Series of Interval Events,”

Journal of Intelligent Information Systems, vol.15, pp.71-89, 2000.

[34] J. Wang, J. Han, “BIDE: Efficient mining of frequent closed sequences,” Proceedings of the 20th International Conference on Data Engineering (ICDE’04), pp. 79-90, 2004.

[35] E. Winarko and J.F Roddick, “ARMADA-An algorithm for discovering richer relative temporal association rules from interval-based data,” Data & Knowledge Engineering, vol.

63, issue 1, pp. 76-90, 2007.

[36] S. Wu and Y. Chen, “Mining Nonambiguous Temporal Patterns for Interval-Based Events,”

IEEE Transactions on Knowledge and Data Engineering, vol.19, issue 6, pp. 742-758, 2007.

[37] S. Wu and Y. Chen, “Discovering hybrid temporal patterns from sequences consisting of point- and interval-based events,” Data & Knowledge Engineering, vol.68, issue 11, pp.1309–1330, 2009.

[38] X. Yan, H. Cheng, J. Han and D. Xin, “CloSpan: Mining Closed Sequential Patterns in Large Datasets,” Proceedings of 3rd SIAM International Conference on Data Mining (SDM’03), pp 166-177, 2003.

[39] M. Zaki, “SPADE: An Efficient Algorithm for Mining Frequent Sequences,” Machine Learning, vol. 42, numbers 1-2, pp. 31-60, 2001.

[40] M. Zaki and C. Hsiao, “CHARM: An Efficient algorithm for Closed Itemset Mining,”

Proceedings of 2nd SIAM International Conference on Data Mining (SDM’02), pp. 457-478, 2002.

[41] L. Zhang, G. Chen, T. Brijs and X. Zhang, “Discovering during-temporal patterns (DTPs) in large temporal databases,” Expert Systems with Applications, vol. 34, pp.1178-1189, 2008.

[42] M. Zhang, B. Kao, D. Cheung, and C. Yip, “Efficient algorithms for incremental updates of frequent sequences,” The 6th Proceedings of Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD’02), pp.186-197, 2002.

在文檔中探勘時間間隔循序特徵樣式之相關研究 (頁 111-0)