Chapter 6 Experimental Results
6.2 Streaming Data
6.2.3 The Accuracy Comparison
Finally, the experiments were conducted to evaluate the quality of the frequent itemsets and the indirect association patterns by comparing the results obtained by EMIA and EMIA-LM on synthetic and real datasets by different block sizes and various mediator support thresholds.
Since MIA-LM and EMIA-LM use the same pruning technology, so we only evaluate the quality on EMIA-LM.
In this experiment, we used both synthetic and real datasets with different mediator support thresholds and different error threshold (=1/block size). The itempair support threshold (σs) was set to be the same as the mediator support threshold (σf) and the dependence threshold (σd) was set to 0.1. We analyzed the accuracy from two aspects: the accuracy rate and the recall. Here, the results generated by EMIA were used as baseline for computing the accuracy.
We adopt the following formula for calculating the accuracy:
(4) The second term in (4) denotes the average error rate of itemset frequencies.
From Figure 6-9, we can observe that all frequent itemsets discovered by EMIA-LM are exactly the same as those by EMIA. It also asserts Theorem 1 in section 5.4.1, i.e., no false negative frequent itemsets are generated by EMIA-LM.
55 (a)
(b)
(c)
56 (d)
Figure 6-9 Accuracy comparison of frequent itemsets derived by EMIA-LM (σd = 0.1) and EMIA.
From Figure 6-9 and Table 6-2, we can discover that the memory usage of EMIA-LM is less than EMIA with same mediator support on same dataset but still has very high accuracy.
This indicates that the pruning approach only pruned those infrequent itemsets. Now, we discuss that why EMIA-LM is faster than EMIA as mediator support threshold is 0.05. This is because many infrequent itemsets are pruned, so it can reduce the execution time.
Table 6-2 Execution time and memory usage comparison for dataset BMS-POS (error=0.00004 for EMIA-LM)
BMS-POS
Mediator Support Threshold
0.005 0.05
Execution Time(sec)
EMIA 783.6 21.1
EMIA-LM 740.8 40.4
Memory Usage(MB)
EMIA 164.7 145.5
EMIA-LM 123.9 104.7
57
Finally, we examined how many percentages of indirect association patterns are missed.
We use the well-know recall as the measurement.
(5) The results are shown in Figure 6-10. Noticeably no indirect association is missed.
(a)
(b)
58 (c)
(d)
Figure 6-10 Comparison results of derived indirect association patterns by using EMIA-LM (σd = 0.1) and EMIA.
59
Chapter 7
Conclusions and Future Work
7.1 Conclusions
In this thesis, we have proposed three indirect association mining algorithms, EMIA for static data, and MIA-LM and EMIA-LM for streaming data.
Inspired by the success of HI-mine*, the EMIA algorithm adopts a more compact data structure and simple but more efficient way for indirect association generation. The EMIA algorithm consists of three phases, including compact transactions table construction, MSSs and IIS construction and Indirect Association generation. In compact transactions table construction phase, the algorithm does not need to reorder items according to the itemset frequency and uses a hash table to speed up the transaction data compression time; in MSSs and IIS construction phase, the algorithm adds pointer link to reduce the number of comparisons for saving more execution time; finally, in the Indirect Association generation phase it uses the same concept with HI-mine*, thus can quickly find indirect associations from IIS and MSSs.
To the best of our knowledge, no research work has been conducted on mining indirect associations over data streams. In this thesis, we have proposed two algorithms, MIA-LM and EMIA-LM. Both algorithms adopt some data compression or projection techniques thus require only one data scanning, and use data pruning techniques to reduce memory usage, while sustain a very high accuracy. The MIA-LM algorithm has hybridized the data structures used in DSM-FI and HI-mine, resulting in an efficient and effective algorithmic design. The EMIA-LM algorithm is modified from EMIA to fit the streaming data environment. Experimental results on both synthetic and real world datasets have shown that EMIA-LM is superior to MIA-LM both
60
in performance and memory usage. We also have conducted some theoretical analyses on the effectiveness of MIA-LM and EMIA-LM. We has proved that: (1) The support error of the generated frequent itemsets is smaller than the user specified error threshold ; and (2) if the mediator dependence threshold is set smaller than a specific value (= f ), then all possible indirect association patterns can be generated.
7.2 Future Work
The study of mining indirect associations from streaming data is in its infancy. Many research issues are worthy of further investigation. First, we will continue to improve the efficiency of the proposed algorithms, seeking more effective way for reducing the memory usage. We will also extend the proposed algorithms to different window models such as time-fading model and sliding-window model. Recently, the design of adaptive data stream mining methods that can perform adaptively under constrained resources has emerged into an important and challenging research issue to the data mining community. In the future, we will study how to apply or incorporate some adaptive technique such as load shedding into our approach, especially when the situation is that we have very limited resources, such as CPU computing power or memory size, without sacrificing too much the quality of the discovered indirect associations.
61
References
[1] C. C. Aggarwal, Data streams : Models and Algorithms, New York: Springer, 2007.
[2] R. Agrawal and R. Srikant, "Fast algorithms for mining association rules in large databases," In Proc. of the 20th Intl. Conf. on Very Large Data Bases, pp. 487-499, 1994.
[3] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom, "Models and issues in data stream systems," in Proc. of the 21st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 1-16, 2002.
[4] J. H. Chang and W. S. Lee, "estWin: Online data stream mining of recent frequent itemsets by sliding window method," Journal of Information Science, vol. 31, no. 2, pp. 76-90, 2005.
[5] J. H. Chang and W. S. Lee, "Finding recent frequent itemsets adaptively over online data streams," in Proc.of the 9th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, pp. 487-492, 2003.
[6] L. Chen, S. Bhowmick, and J. Li, "Mining temporal indirect associations," in Proc. of the 10th Pacific-Asia Conf. on Knowledge Discovery and Data Mining, pp. 425-434, 2006.
[7] C. Cornells, Y. Peng, Z. Xing, and C. Guoqing, "Mining positive and negative association rules from large databases," in Proc. of IEEE Conf. on Cybernetics and Intelligent Systems, pp. 1-6, 2006.
[8] L. Daesu and L. Wonsuk, "Finding maximal frequent itemsets over online data streams adaptively," in Proc. of the 5th IEEE Intl. Conf. on Data Mining, pp., 2005.
[9] M. M. Gaber, A. Zaslavsky, and S. Krishnaswamy, "Towards an adaptive approach for mining data streams in resource constrained environments," in Proc. of the 6th Intl. Conf.
on Data Warehousing and Knowledge Discovery, pp. 189-198, 2004.
[10] L. Golab and M. T. Ö zsu, "Issues in data stream management," SIGMOD Record, vol. 32, no. 2, pp. 5-14, 2003.
[11] K. Gouda and M. J. Zaki, "Efficiently mining maximal frequent itemsets," in Proc. of the 1st Intl. Conf. on Data Mining, pp. 163-170, 2001.
62
[12] J. Han, J. Pei, and Y. Yin, "Mining frequent patterns without candidate generation," in Proc. of the 2000 ACM SIGMOD Intl. Conf. on Management of Data, pp. 1-12, 2000.
[13] N. Jiang and L. Gruenwald, "CFI-Stream: mining closed frequent itemsets in data streams," In Proc. of the 12th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, pp. 592-597, 2006.
[14] R. Jin and G. Agrawal, "An algorithm for in-core frequent itemset mining on streaming data," in Proc. of the 5th IEEE Intl. Conf. on Data Mining, pp. 210-217, 2005.
[15] P. Kazienko, "IDARM — Mining of indirect association rules," ed, 2005, pp. 77-86.
[16] P. Kazienko and K. Kuzminska, "The influence of indirect association rules on recommendation ranking lists," in Proc. of the 5th Intl. Conf. on Intelligent Systems Design and Applications, pp. 482-487, 2005.
[17] R. Kohavi, C. E. Brodley, B. Frasca, L. Mason, and Z. Zheng, "KDD-Cup 2000 organizers' report: peeling the onion," SIGKDD Exploration Newsletters, vol. 2, pp. 86-93, 2000.
[18] H.-F. Li, C.-C. Ho, F.-F. Kuo, and S.-Y. Lee, "A new algorithm for maintaining closed frequent itemsets in data streams by incremental updates," in Proc. of the 6th IEEE Intl.
Conf. on Data Mining - Workshops, pp. 672-676, 2006.
[19] H. F. Li, S. Y. Lee, and M. K. Shan, "An efficient algorithm for mining frequent itemsets over the entire history of data streams," in Proc. of the 1st Intl. Workshop on Knowledge Discovery in Data Streams, pp. 20-24, 2004.
[20] H. F. Li, S. Y. Lee, and M. K. Shan, "Mining maximal frequent itemsets in data streams,"
in Proc. of Intl. Computer Symposium, 2004.
[21] G. S. Manku and R. Motwani, "Approximate frequency counts over data streams," in Proc.
of 28th Intl. Conf. on Very Large Data Bases, pp. 346-357, 2002.
[22] A. Savasere, E. Omiecinski, and S. Navathe, "Mining for strong negative associations in a large database of customer transactions," in Proc. of 14th Intl. Conf. on Data Engineering, pp. 494-502, 1998.
[23] P.-N. Tan, V. Kumar, and J. Srivastava, "Indirect association: Mining higher order dependencies in data," in Proc. of the 4th European Conf. on Principles of Data Mining and Knowledge Discovery, pp. 632-637, 2000.
63
[24] P. N. Tan and V. Kumar, "Interestingness measures for association patterns: A perspective," in Proc. of KDD 2000 Workshop on Postprocessing in Machine Learning and Data Mining, 2000.
[25] P. N. Tan and V. Kumar, "Mining indirect associations in Web data," Lecture Notes in Artificial Intelligence, vol. 2356, pp. 145-166, 2002.
[26] V. S. Tseng, Y. C. Liu, and J. W. Shin, "Mining gene expression data with indirect association rules," in Proc. of National Computer Symposium, 2007.
[27] Q. Wan and A. An, "An efficient approach to mining indirect associations," Journal of Intelligent Information Systems, vol. 27, no. 2, pp. 135-158, 2006.
[28] Q. Wan and A. An, "Efficient indirect association discovery using compact transaction databases," in Proc. of 2006 IEEE Intl. Conf. on Granular Computing, pp. 154-159, 2006.
[29] W. G. Teng, M. J. Hsieh, and M. S. Chen., "On the mining of substitution rules for statistically dependent items," in Pro. of the 2nd Intl. Conf. on Data Mining, pp. 442-449, 2002.
[30] X. Wu, C. Zhang, and S. Zhang, "Efficient mining of both positive and negative association rules," ACM Transactions on Information Systems, vol. 22, pp. 381-405, 2004.
[31] C. Yun, W. Haixun, P. S. Yu, and R. R. Muntz, "Moment: maintaining closed frequent itemsets over a stream sliding window," in Proc. of the 4th IEEE Intl. Conf. on Data Mining, pp. 59-66, 2004.
[32] Y. Zhu and D. Shasha, "StatStream: statistical monitoring of thousands of data streams in real time," in Proc. of the 28th Intl. Conf. on Very Large Data Bases, pp. 358-369, 2002.