• 沒有找到結果。

Chapter 5 Performance Measurement

5.2 Performance Measurement of IncSPAM

5.2.3 Different Number of Customers

IncSPAM can dynamically add a new customer to the summary data structure. In previous experiments we fix the number of customers for observing performance conveniently. The memory usage and average sliding time for different number of customers is tested in this experiment. Minimum support is also 10 customer-sequences. Figure 5-18 shows memory usage of IncSPAM in different number of customers.

Memory Usage

0 100 200 300

1000 2000 3000 4000 5000

Number of Customers

Memory Usage (MB)

T = 2 T = 2.5 T = 3

Fig 5-18. Memory usage with different number of customers (IncSPAM)

The relationship between memory usage and the number of customers is linear. IncSPAM can efficiently handle a great amount of customers with reasonable memory. In Figure 5-19

Average Sliding Time

0 0.2 0.4 0.6 0.8 1 1.2

1000 2000 3000 4000 5000

Number of Customer

Time (seconds)

T = 2 T = 2.5 T = 3

Fig 5-19. Average sliding time with different number of customers (IncSPAM)

Chapter 6

Conclusion and Future Work

Mining of frequent patterns in a data stream is more complicated than in a static database.

In this paper we propose two algorithms: New-Moment to mine closed frequent itemsets and IncSPAM to mine sequential patterns with the sliding window model in the data stream environment.

The first algorithm, New-Moment, is to improve the efficiency of Moment algorithm.

New-Moment utilizes bit-vectors and a smaller lexicographic tree New-CET to reduce the memory usage. Employing the characteristics of bit-vectors New-Moment is also as efficient as Moment in executing time. The second algorithm, IncSPAM, utilizes the concept of the sliding window in each customer-sequence. IncSPAM uses CBASWs and a lexicographic sequence tree to maintain the sequential patterns in the current window. In the lexicographic sequence tree IncSPAM uses an index set in each tree node to speed up counting support.

IncSPAM can handle a transaction from the data stream in one second.

6.1 Conclusion of New-Moment

New-Moment reduces the memory usage by only maintaining bit-vectors of 1-itemsets and closed frequent itemsets in New-CET. In the test of different minimum support, New-Moment outperforms Moment in memory usage about 100MB. When the minimum support becomes lower, the difference of memory usage in New-Moment and Moment becomes more significant. Due to the efficiency of bit-vector in window sliding and in the generation of itemset candidates, New-Moment is faster than Moment in the loading time of the first

lexicographic tree, New-Moment still has almost the same performance as Moment in execution time of window sliding. In the test of different window size, New-Moment still outperforms Moment in memory usage and running time. In the test of different number of items, Moment is even running out of memory bound. New-Moment is still efficient in memory usage and executing time.

6.2 Conclusion of IncSPAM

We test memory usage and execution time of handing a transaction in IncSPAM in different minimum support, different window size, and different number of customers. In the test of different minimum support, memory used in IncSPAM is about 300MB. The memory usage of IncSPAM increases when the minimum support becomes lower. We prove that the lexicographic sequence tree of IncSPAM does not produce redundant tree nodes. The handling time of a transaction is below 1 second. IncSPAM can be applied in the data stream environment. In the test of different window size, the memory usage and execution time of handling a transaction of IncSPAM increases when window size becomes large. We can observe that IncSPAM is still efficient when the window size is from 10 to 25. In the test of different number of customers, memory usage and time of handling a transaction of IncSPAM is linear related to the number of customers. That means IncSPAM can perform well even the number of customers becomes large.

6.3 Future Work

The concept of sliding window in this paper is based on transactions as units. In some applications the unit of a window may be a time point. The number of transactions in each

Data Streams Data Streams

System starts

‧‧‧

Window N (time intervals)

:A transaction

:A transaction

1 2 ‧‧‧ N – 1 N

Fig 6-1. The sliding window model in time units

In this sliding window model, the system keeps the transactions in the latest N time intervals. The time interval may be one day, one week, or one month. There is different number of transactions in each time interval. In Figure 6-1, there are two transactions in the first time interval but there are three transactions in the second window. Since the number of transactions in each time interval becomes variable, using bit-vectors that store fixed number of transactions to store the transactions in the window is difficult. Mining of these complicated and flexible patterns in a data stream is a great challenge.

Bibliography

[1] R. Agrawal and R. Srikant, Fast algorithms for mining association rules, In Proc. of the 20th Intl. Conf. on Very Large Databases (VLDB) 1994.

[2] R. Agrawal and R. Srikant, Mining Sequential Patterns, In Proc. of the 7th International Conf. on Data Engineering (ICDE) 1995.

[3] M. Yen and S. Lee, Incremental Update on Sequential Patterns in Large Databases, In Proc. of the 10th IEEE Int’l Conf. on Tools with Artificial Intelligence (ICTAI98), 1998.

[4] N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal, Discovering Frequent Closed Itemsets for Association Rules, Proc. Seventh Int’l Conf. Database Theory (ICDT) 1999.

[5] J. Han, J. Pei, and Y. Yin, Mining frequent patterns without candidate generation, In 2000 ACM SIGMOD Int’l. Conf. on Management of Data, 2000.

[6] J. Pei, J. Han, B Mortazavi-Asl, and H. Pinto, PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth, In Proc. of the 17th Int’l Conf. on Data Engineering (ICDE) 2001.

[7] A. Veloso, W. Meira Jr., M de Carvalho, B. Pôssas, S. Parthasarathy, and M. J. Zaki, Mining frequent itemsets in evolving databases, In SDM 2002.

[8] M. J. Zaki and C. Hsiao, Charm: An efficient algorithm for closed itemset mining, In 2nd SIAM Int’l Conf. on Data Mining 2002.

[9] Y. Zhu and D. Shasha, StatStream: Statistical Monitoring of Thousands of Data Stream in Real Time, In Proceedings of the 28th International Conference on Very Large Data Bases, pp. 358-369, 2002.

[10] G. Manku and R. Motwani, Approximate Frequency Counts over Data Streams, In Proceedings of the 28th International Conference on Very Large Data Bases, 2002.

[11] J. Ayres, J. Gehrke, T. Yiu, and J. Flannick, Sequential Pattern Mining using A Bitmap Representation, Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, 2002.

[12] M. Lin and S. Lee, Fast Discovery of Sequential Patterns by Memory Indexing, In Proc.

of the 4th Int’l Conf. on Data Warehousing and Knowledge Discovery (DaWak), 2002.

[13] E. Elnahrawy, Research Directions in Sensor Data Streams: Solutions and Challenges, DCIS Technical Report DCIS-TR-527, Rutgers University, 2003.

[14] L. Golab and M. T. Özsu, Issues in Data Stream Management, ACM SIGMOD Record, 32(2): 5-14, Jun. 2003.

[15] C. Jin, W. Qian, C. Sha, J. Yu, and A. Zhou, Dynamically Maintaining Frequent Items Over a Data Stream, In Proceedings of the 12th ACM CIKM International Conference on Information and Knowledge Management, pages 287--294, 2003.

[16] J. H. Chang and W. S. Lee, Finding Recent Frequent Itemsets Adaptively over Online Data Streams, In Proc. of the 2003 Int. Conf. Knowledge Discovery and Data Mining (SIGKDD’03), 2003.

[17] X. Yan, J. Han, and R. Afshar, CloSpan: Mining Closed Sequential Patterns in Large Datasets, In Proc. of 2003 SIAM Int’l Conf. on Data Mining (SDM03), 2003.

[18] W. Teng, M. Chen, and P. S. Yu, A Regression-Based Temporal Pattern Mining Scheme for Data Streams, Proceedings of the 29th International Conference on Very Large Data Bases , VLDB 2003.

[19] H. F. Li, S. Y. Lee and M. K. Shan, DSM-FI: An Efficient Algorithm for Mining Frequent Itemsets over the Entire History of Data Streams, In 1st International Workshop on Knowledge Discovery in Data Streams, Pisa, Italy, 2004.

[20] Y. Chi, H. Wang, P. S. Yu, and R. R. Muntz, Moment: Maintaining Closed Frequent

[21] Y. Chi, H. Wang, P. S. Yu, and R. R. Muntz, Catch the Moment: Maintaining Closed Frequent Itemsets over a Data Stream Sliding Window, Technical Report, IBM, 2004.

[22] J. Chang and W. Lee, A Sliding Window Method for Finding Recently Frequent Itemsets over Online Data Streams, Journal of Information Science and Engineering, Vol. 20, No.

4, July, 2004.

[23] A. Chen and H. Ye, Multiple-Level Sequential Pattern Discovery from Customer Transaction Databases, International Journal of Computational Intelligence, 2004.

[24] H. Cheng, X. Yan, and J. Han, IncSpan: Incremental Mining of Sequential Patterns in Large Database, In Proc. of the 10th ACM SIGKDD Int’l Conf. on Knowledge Discovery and Data Mining, 2004.

[25] G. Chen, X. Wu, and X. Zhu, Mining sequential patterns across data streams, Technical Report, 2004.

[26] J. H. Chang and W. S. Lee, Decaying Obsolete Information in Finding Recent Frequent Itemsets over Data Stream, IEICE Transactions on Information and System, 2004.

[27] H. F. Li, S. Y. Lee, M. K. Shan, Online Mining (Recently) Maximal Frequent Itemsets over Data Streams, In proc. Of the 15th IEEE International Workshop on Research Issues on Data Engineering (RIDE 2005), April 3-4, 2005.

[28] A. Marascu and F. Masseglia, Mining Sequential Patterns from Temporal Streaming Data, In Proc. of the 1st ECML/PKDD Workshop on Mining Complex Data 2005 (IEEE MCD05), held in conjunction with the 5th IEEE Int’l Conf. on Data Mining (ICDM05), 2005.