Chapter 5 Indirect Associations Mining from Streaming Data
5.4 Theoretical Analyses
5.4.2 Estimated Dependence Condition Analysis
Recall that an indirect association is an itempair {x, y} indirectly associated via a mediator M, denoted as {x, y} | M, if it satisfies the three conditions. One of the three conditions is mediator dependence condition, which is used to guarantee the high dependence of the item x and mediator M. We will show that if the mediator dependence threshold is set smaller than a specific value of ε, then all possible indirect association patterns can be generated.
Theorem 2
Proof: According to Theorem 1, we can easily derive the intervals of estimated support of
itemset X, mediator M and itemset (XM) are Tsup(X) - Esup(X) Tsup(X), Tsup(M) - Esup(M)
Tsup(M) and Tsup(XM) - Esup(XM) Tsup(XM), respectively. By using these three intervals, the range of estimated mediator dependence can be derived as follows:
(3)
If the itemset X and mediator M can form an indirect association, according to Definition 1, it is easy to know that the true support of itemset X, mediator M and itemset (XM) are, at least, larger than mediator support threshold σf and the maximal value of IS measure is 1. Besides, since we try to find a possible lower bound of estimated mediator dependence, the maximum value of can be set to 1. We conclude that if the mediator dependence threshold σd is set smaller than (σf -), then all possible indirect association patterns will be generated by using the proposed approach.
43
Chapter 6
Experimental Results
In this chapter, experiments on several datasets (shown in Table 6-1) were made to show the effectiveness and efficiency of the proposed EMIA, MIA-LM and EMIA-LM algorithms.
Table 6-1 Dataset characteristics
Database Type Items Tractions Maximum
Transaction Size
The synthetic datasets, T10I5N0.25KD20K and T20I10N0.05KD1000K are generated using a transaction data generator obtained from IBM Almaden [12]. The BMS-POS dataset contains several years of point-of-sale data from a large electronics retailer. And each item represents a category, rather than an individual product. The BMS-WebView-2 dataset contain several months of clickstream data from an e-commerce web site. Each transaction in these datasets is a web session consisting of all the product detail pages viewed in that session. The BMS-POS dataset and the BMS-WebView-2 dataset were used in the KDD-Cup 2000
44
competition [17]. In the following, all experiments were implemented in C# and conducted on HP ProLiant DL380 G6 with Intel Xeon E5530 2.40GHz and 6GB RAM.
6.1 Static Data
In this section, we evaluate the proposed algorithm for mining indirect association form static data. We compared the proposed approach EMIA with two algorithms, INDIRECT and HI-mine* on both synthetic and real datasets over different mediator support thresholds. In our experiments, the itempair support threshold (σs) was set to be the same as the mediator support threshold (σf) and the dependence threshold (σd) was set to 0.1. The performance evaluation was examined from two aspects, the execution time and memory usage.
6.1.1 Evaluation on Synthetic Dataset
The performance curves of the datasets T10I5N0.25KD20K and T10I5N0.02KD1000K are depicted in Figure 6-1. We can observe that the execution times of all algorithms are decreasing along with the increasing of mediator support (σf); EMIA is significantly superior to INDIRECT and also superior to HI-mine*, especially when the mediator support is small. Besides, INDIRECT is faster than HI-mine* when the mediator support threshold (σf) is greater than 0.03.
45 (a)
(b)
Figure 6-1 Execution time comparison on synthetic dataset.
The memory usage comparison is shown in Figure 6-2. In smaller dataset (10I5N0.25KD20K), the INDIRECT algorithm consumes the least memory but in the larger dataset (T10I5N0.02KD1000K) EMIA consumes the least. There are two possible reasons. First INDIRECT does not need to keep all transactions in memory. Although algorithms EMIA and HI-mine* have adopted compression technology, they still need larger memory to store the compressed transactions as compared with INDIRECT. Second, for EMIA and HI-mine* the dataset 10I5N0.25KD20K is composed of 250 items, resulting in relatively small ratio of analogical transactions. Therefore it is not easy to compress transactions. On the other hand,
46
T10I5N0.02KD1000K contains only 20 items thus it is easier to be compressed, and so less memory is required. In addition, EMIA needs less memory than HI-mine* because EMIA adopts a relatively simple and efficient way in the projection phase.
(a)
(b)
Figure 6-2 Memory usage comparison on synthetic dataset.
47
6.1.2 Evaluation on Real Dataset
In this experiment, we evaluated the algorithms using real dataset. The execution time and memory usage comparisons are depicted in Figures 6-3 and 6-4, respectively. In summary from the results, the EMIA algorithm is the most efficient algorithm, and consumes approximately the same memory usage as HI-mine* but more memory than INDIRECT.
(a)
(b)
Figure 6-3 Execution time comparison on real dataset.
48 (a)
(b)
Figure 6-4 Memory usage comparison on real dataset
6.2 Streaming Data
In streaming data, we compare three algorithms, EMIA-LM, HI-mine* and MIA-LM, using both synthetic and real datasets over different mediator support thresholds. The experimental results on real datasets are shown in Figure 6-7, which are analogous to that on synthetic datasets.
The itempair support threshold (σs) is set to be the same as the mediator support threshold (σf)
49
and the dependence threshold (σd) is set to 0.1 and the error rate for each data set is different. All three algorithms are compared from three aspects: the execution time, memory usage, and the accuracy.
6.2.1 Evaluation on Synthetic Dataset
Figure 6-5 shows the execution times of three algorithms on the synthetic dataset.
Noticeably, EMIA-LM outperforms MIA-LM and HI-mine*. MIA-LM performs better than HI-mine* in smaller mediator support thresholds. Its performance is stable when the support threshold is greater than some value, i.e., 0.009 in these experiments. This is because MIA-LM spends a lot of computations to build the initial IS-tree, whose size expands as support threshold becomes smaller.
Figure 6-5 shows the memory usage of three algorithms on the synthetic dataset. Note that EMIA-LM uses the same data structures as EMIA, whose memory usage in highly affected by the number of items in the datasets. Because dataset T10I5N0.25KD20K contains 250 items, while T10I5N0.02KD1000K only has 20 items, the compression rations for T10I5N0.25KD20K is smaller than that for T10I5N0.02KD1000K, on the other hand, the memory consumed by MIA-LM is highly depended on the number of items involved in the database. That is why MIA-LM consumes relatively larger memory than HI-mine* and EMIA-LM for T10I5N0.25KD20K but less for T10I5N0.02KD1000K.
50 (a)
(b)
Figure 6-5 Execution time comparison on synthetic dataset.
51 (a)
(b)
Figure 6-6 Memory usage comparison on synthetic dataset.
6.2.2 Evaluation on Real Dataset
Next, we compare the three algorithms on two real datasets, BMS-WebView-2 and BMS-POS. The execution time comparison is shown in Figure 6-7. Again, we can observe EMIA-LM is far more superior to HI-mine* and MIA-LM. It outperforms the other two
52
algorithms with an order of magnitude. MIA-LM is inferior to HI-mine*. The reason is almost the same with experimenting on synthetic datasets; MIA-LM spends a lot of computations to build the initial IS-tree especially on the real datasets that are composed of thousands of items.
(a)
(b)
Figure 6-7 Execution time comparison on real datasets.
53
The memory usage comparison is depicted in Figure 6-8. Since the two datasets are composed of thousands of items, the data structure ISFI-forest used by the MIA-LM algorithm requires very large amount of memory, especially in dataset BMS-POS, to store all item-suffix of each transaction. EMIA-LM and HI-mine* consume approximately the same amount of memory.
(a)
(b)
Figure 6-8 Memory usage comparison on real datasets.
54
6.2.3 The Accuracy Comparison
Finally, the experiments were conducted to evaluate the quality of the frequent itemsets and the indirect association patterns by comparing the results obtained by EMIA and EMIA-LM on synthetic and real datasets by different block sizes and various mediator support thresholds.
Since MIA-LM and EMIA-LM use the same pruning technology, so we only evaluate the quality on EMIA-LM.
In this experiment, we used both synthetic and real datasets with different mediator support thresholds and different error threshold (=1/block size). The itempair support threshold (σs) was set to be the same as the mediator support threshold (σf) and the dependence threshold (σd) was set to 0.1. We analyzed the accuracy from two aspects: the accuracy rate and the recall. Here, the results generated by EMIA were used as baseline for computing the accuracy.
We adopt the following formula for calculating the accuracy:
(4) The second term in (4) denotes the average error rate of itemset frequencies.
From Figure 6-9, we can observe that all frequent itemsets discovered by EMIA-LM are exactly the same as those by EMIA. It also asserts Theorem 1 in section 5.4.1, i.e., no false negative frequent itemsets are generated by EMIA-LM.
55 (a)
(b)
(c)
56 (d)
Figure 6-9 Accuracy comparison of frequent itemsets derived by EMIA-LM (σd = 0.1) and EMIA.
From Figure 6-9 and Table 6-2, we can discover that the memory usage of EMIA-LM is less than EMIA with same mediator support on same dataset but still has very high accuracy.
This indicates that the pruning approach only pruned those infrequent itemsets. Now, we discuss that why EMIA-LM is faster than EMIA as mediator support threshold is 0.05. This is because many infrequent itemsets are pruned, so it can reduce the execution time.
Table 6-2 Execution time and memory usage comparison for dataset BMS-POS (error=0.00004 for EMIA-LM)
BMS-POS
Mediator Support Threshold
0.005 0.05
Execution Time(sec)
EMIA 783.6 21.1
EMIA-LM 740.8 40.4
Memory Usage(MB)
EMIA 164.7 145.5
EMIA-LM 123.9 104.7
57
Finally, we examined how many percentages of indirect association patterns are missed.
We use the well-know recall as the measurement.
(5) The results are shown in Figure 6-10. Noticeably no indirect association is missed.
(a)
(b)
58 (c)
(d)
Figure 6-10 Comparison results of derived indirect association patterns by using EMIA-LM (σd = 0.1) and EMIA.
59
Chapter 7
Conclusions and Future Work
7.1 Conclusions
In this thesis, we have proposed three indirect association mining algorithms, EMIA for static data, and MIA-LM and EMIA-LM for streaming data.
Inspired by the success of HI-mine*, the EMIA algorithm adopts a more compact data structure and simple but more efficient way for indirect association generation. The EMIA algorithm consists of three phases, including compact transactions table construction, MSSs and IIS construction and Indirect Association generation. In compact transactions table construction phase, the algorithm does not need to reorder items according to the itemset frequency and uses a hash table to speed up the transaction data compression time; in MSSs and IIS construction phase, the algorithm adds pointer link to reduce the number of comparisons for saving more execution time; finally, in the Indirect Association generation phase it uses the same concept with HI-mine*, thus can quickly find indirect associations from IIS and MSSs.
To the best of our knowledge, no research work has been conducted on mining indirect associations over data streams. In this thesis, we have proposed two algorithms, MIA-LM and EMIA-LM. Both algorithms adopt some data compression or projection techniques thus require only one data scanning, and use data pruning techniques to reduce memory usage, while sustain a very high accuracy. The MIA-LM algorithm has hybridized the data structures used in DSM-FI and HI-mine, resulting in an efficient and effective algorithmic design. The EMIA-LM algorithm is modified from EMIA to fit the streaming data environment. Experimental results on both synthetic and real world datasets have shown that EMIA-LM is superior to MIA-LM both
60
in performance and memory usage. We also have conducted some theoretical analyses on the effectiveness of MIA-LM and EMIA-LM. We has proved that: (1) The support error of the generated frequent itemsets is smaller than the user specified error threshold ; and (2) if the mediator dependence threshold is set smaller than a specific value (= f ), then all possible indirect association patterns can be generated.
7.2 Future Work
The study of mining indirect associations from streaming data is in its infancy. Many research issues are worthy of further investigation. First, we will continue to improve the efficiency of the proposed algorithms, seeking more effective way for reducing the memory usage. We will also extend the proposed algorithms to different window models such as time-fading model and sliding-window model. Recently, the design of adaptive data stream mining methods that can perform adaptively under constrained resources has emerged into an important and challenging research issue to the data mining community. In the future, we will study how to apply or incorporate some adaptive technique such as load shedding into our approach, especially when the situation is that we have very limited resources, such as CPU computing power or memory size, without sacrificing too much the quality of the discovered indirect associations.
61
References
[1] C. C. Aggarwal, Data streams : Models and Algorithms, New York: Springer, 2007.
[2] R. Agrawal and R. Srikant, "Fast algorithms for mining association rules in large databases," In Proc. of the 20th Intl. Conf. on Very Large Data Bases, pp. 487-499, 1994.
[3] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom, "Models and issues in data stream systems," in Proc. of the 21st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 1-16, 2002.
[4] J. H. Chang and W. S. Lee, "estWin: Online data stream mining of recent frequent itemsets by sliding window method," Journal of Information Science, vol. 31, no. 2, pp. 76-90, 2005.
[5] J. H. Chang and W. S. Lee, "Finding recent frequent itemsets adaptively over online data streams," in Proc.of the 9th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, pp. 487-492, 2003.
[6] L. Chen, S. Bhowmick, and J. Li, "Mining temporal indirect associations," in Proc. of the 10th Pacific-Asia Conf. on Knowledge Discovery and Data Mining, pp. 425-434, 2006.
[7] C. Cornells, Y. Peng, Z. Xing, and C. Guoqing, "Mining positive and negative association rules from large databases," in Proc. of IEEE Conf. on Cybernetics and Intelligent Systems, pp. 1-6, 2006.
[8] L. Daesu and L. Wonsuk, "Finding maximal frequent itemsets over online data streams adaptively," in Proc. of the 5th IEEE Intl. Conf. on Data Mining, pp., 2005.
[9] M. M. Gaber, A. Zaslavsky, and S. Krishnaswamy, "Towards an adaptive approach for mining data streams in resource constrained environments," in Proc. of the 6th Intl. Conf.
on Data Warehousing and Knowledge Discovery, pp. 189-198, 2004.
[10] L. Golab and M. T. Ö zsu, "Issues in data stream management," SIGMOD Record, vol. 32, no. 2, pp. 5-14, 2003.
[11] K. Gouda and M. J. Zaki, "Efficiently mining maximal frequent itemsets," in Proc. of the 1st Intl. Conf. on Data Mining, pp. 163-170, 2001.
62
[12] J. Han, J. Pei, and Y. Yin, "Mining frequent patterns without candidate generation," in Proc. of the 2000 ACM SIGMOD Intl. Conf. on Management of Data, pp. 1-12, 2000.
[13] N. Jiang and L. Gruenwald, "CFI-Stream: mining closed frequent itemsets in data streams," In Proc. of the 12th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, pp. 592-597, 2006.
[14] R. Jin and G. Agrawal, "An algorithm for in-core frequent itemset mining on streaming data," in Proc. of the 5th IEEE Intl. Conf. on Data Mining, pp. 210-217, 2005.
[15] P. Kazienko, "IDARM — Mining of indirect association rules," ed, 2005, pp. 77-86.
[16] P. Kazienko and K. Kuzminska, "The influence of indirect association rules on recommendation ranking lists," in Proc. of the 5th Intl. Conf. on Intelligent Systems Design and Applications, pp. 482-487, 2005.
[17] R. Kohavi, C. E. Brodley, B. Frasca, L. Mason, and Z. Zheng, "KDD-Cup 2000 organizers' report: peeling the onion," SIGKDD Exploration Newsletters, vol. 2, pp. 86-93, 2000.
[18] H.-F. Li, C.-C. Ho, F.-F. Kuo, and S.-Y. Lee, "A new algorithm for maintaining closed frequent itemsets in data streams by incremental updates," in Proc. of the 6th IEEE Intl.
Conf. on Data Mining - Workshops, pp. 672-676, 2006.
[19] H. F. Li, S. Y. Lee, and M. K. Shan, "An efficient algorithm for mining frequent itemsets over the entire history of data streams," in Proc. of the 1st Intl. Workshop on Knowledge Discovery in Data Streams, pp. 20-24, 2004.
[20] H. F. Li, S. Y. Lee, and M. K. Shan, "Mining maximal frequent itemsets in data streams,"
in Proc. of Intl. Computer Symposium, 2004.
[21] G. S. Manku and R. Motwani, "Approximate frequency counts over data streams," in Proc.
of 28th Intl. Conf. on Very Large Data Bases, pp. 346-357, 2002.
[22] A. Savasere, E. Omiecinski, and S. Navathe, "Mining for strong negative associations in a large database of customer transactions," in Proc. of 14th Intl. Conf. on Data Engineering, pp. 494-502, 1998.
[23] P.-N. Tan, V. Kumar, and J. Srivastava, "Indirect association: Mining higher order dependencies in data," in Proc. of the 4th European Conf. on Principles of Data Mining and Knowledge Discovery, pp. 632-637, 2000.
63
[24] P. N. Tan and V. Kumar, "Interestingness measures for association patterns: A perspective," in Proc. of KDD 2000 Workshop on Postprocessing in Machine Learning and Data Mining, 2000.
[25] P. N. Tan and V. Kumar, "Mining indirect associations in Web data," Lecture Notes in Artificial Intelligence, vol. 2356, pp. 145-166, 2002.
[26] V. S. Tseng, Y. C. Liu, and J. W. Shin, "Mining gene expression data with indirect association rules," in Proc. of National Computer Symposium, 2007.
[27] Q. Wan and A. An, "An efficient approach to mining indirect associations," Journal of Intelligent Information Systems, vol. 27, no. 2, pp. 135-158, 2006.
[28] Q. Wan and A. An, "Efficient indirect association discovery using compact transaction databases," in Proc. of 2006 IEEE Intl. Conf. on Granular Computing, pp. 154-159, 2006.
[29] W. G. Teng, M. J. Hsieh, and M. S. Chen., "On the mining of substitution rules for statistically dependent items," in Pro. of the 2nd Intl. Conf. on Data Mining, pp. 442-449, 2002.
[30] X. Wu, C. Zhang, and S. Zhang, "Efficient mining of both positive and negative association rules," ACM Transactions on Information Systems, vol. 22, pp. 381-405, 2004.
[31] C. Yun, W. Haixun, P. S. Yu, and R. R. Muntz, "Moment: maintaining closed frequent itemsets over a stream sliding window," in Proc. of the 4th IEEE Intl. Conf. on Data Mining, pp. 59-66, 2004.
[32] Y. Zhu and D. Shasha, "StatStream: statistical monitoring of thousands of data streams in real time," in Proc. of the 28th Intl. Conf. on Very Large Data Bases, pp. 358-369, 2002.