CHAPTER 4 The Proposed Resource-Aware GIAMS Framework
4.4 Adaption Scheme for Available Memory Awareness
4.4.2 Algorithm Description
We first present the algorithm description of DelayInsert, which is responsible for constructing and maintaining Card-Stree* using transactions in CT, then detail the victim searching & releasing algorithm, which acts as a procedure called by DelayInsert when the memory shortage does occur.
The algorithm DelayInsert is described in Figure 4-6. In summary, for each k-itemset X generated from the input transaction, if X is already in Card-Stree*, update its information. Otherwise, if the memory space is enough, then insert X if k <
3, or perform delay insert if k ≥ 3. On the other hand, if the memory is not enough and k ≥ 3, then we simply discard itemset X. But for k < 3, we temporarily store X into a buffer. After all itemsets in Ck have been inspected, then call algorithm Victim_Searching&Releasing to release memory, and insert at most #victim of the new generated k-itemsets in buffer into Card-Stree*.
Procedure Name: DelayInsert Input: CT, Card-Stree*, cbid.
Output: Updated Card-Stree*.
Steps:
1. foreach transaction tr in CT do
2. foreach itemset X in Ck, the set of k-itemset Ck
update X.bidv, X.countv, and X.tlcount;
6. else if memory is enough then 7. if k < 3 then insert X into STk 8.
;
else insert X into STk only if all immediate subset of X in STk-1 9.
17. #victim = Victim_Searching&Releasing(Card-Stree*, n);
18. insert at most #victim k-itemsets in buffer into STk
19.
; endif
20. endfor
Figure 4-6. Algorithm description of DelayInsert
Algorithm: Victim searching & releasing Input: Number of nodes n and Card-Stree*
Output: Number of victim nodes releasing from Card-Stree*, #victim 1: #victim = 0;
2 k = the largest length of itemsets in Card-Stree*;
3: repeat
8: if #victim − victim-list[0].num >= n then delete victim-list[0];
9: endif
16: insert node X into victim-list; #victim++;
17: case 3:
18: insert node X into victim-list;
19: #victim++;
20: if #victim − victim-list[0].num >= n then 21: delete victim-list[0];
28: foreach victim X in victim-list do
29: delete the corresponding nodes pointed by X.addr from Card-Stree*;
30: return #victim;
Figure 4-7. Algorithm for victim searching&releasing
The algorithm for finding victim for releasing is described in Figure 4-7, which details the steps implementing the idea presented in subsection 4.4.1.
4.4.3 An Example
Suppose the Card-Stree* contains the set of frequent itemsets shown in Figure 4-8. For simplicity, we only show the total counts of each itemsets. Suppose that we want to insert F and G into Card-Stree* but found the memory is not enough. Then the procedure Victim_Searching&Releasing is activated to perform victim searching and node releasing, with n = 2.
Figure 4-8. The itemsets maintained in an example Card-Stree*.
1. The victim search starts from itemsets in subtree of length 4, i.e., ST4
. Since the victim-list is empty, itemset ABCD is inserted into victim-list with count
= 3 and link to its corresponding node in Card-Stree*. The result is shown in Figure 4-9.
Figure 4-9. The Card-Stree* and victim-list after inserting ABCD.
2. Since there are other subtrees with cardinality larger than 2 and #victim < n, the victim search continues to subtree of cardinality 3, first inspecting the node ABC. Since #victim < 2 and ABC’s count is larger than 3, so we insert ABC to the front of victim-list. The result is shown in Figure 4-10.
Figure 4-10. The Card-Stree* and victim-list after inserting ABC.
3. The search process continues to examine other nodes in subtree ST3. The next node inspected is ABD. Note that its count is 4, smaller than that of ABC, and so we insert ABD into victim-list. However, we found #victim – victim-list[0].num = 2. Therefore, ABC is deleted. The result is shown in Figure 4-11.
Figure 4-11. The Card-Stree* and victim-list after inserting ABD.
4. The next node is ABE. Its count is 2, smaller than that of ABD. So ABE is inserted into victim-list. Again, since #victim – victim-list[0].num = 2, node ABD is deleted. The result is shown in Figure 4-12.
Figure 4-12. The Card-Stree* and victim-list after inserting ABE.
5. The next node is ACD. Its count is 3 equal to the first node ABCD in victim-list. So, we insert ACD to the same node wherein ABCD locates, as shown in Figure 4.13.
Figure 4-13. The Card-Stree* and victim-list after inserting ACD.
6. The next itemset is ACE. Its count is equal to 1, so we delete immediately the node containing ACE from ST3 and decrease the number of nodes for releasing by 1, obtaining n = 1. After this, we found there are far more victims in victim-list than required, i.e., #victim – victim-list[0].num ≥ n, so delete the first group in victim-list. The result is depicted in Figure 4-14.
Figure 4-14. The Card-Stree* and victim-list after deleting ABCD and ABD.
7. The next node is BCD. Since its count is larger than that of the first node in victim-list, i.e., 2, we ignore BCD.
8. Finally, all nodes in ST3 have been inspected and #victim = 1, equal to required. Hence, we delete from Card-Stree* all of the nodes pointed by the victims in victim-list to release memory for inserting F and G.
CHAPTER 5
Experimental Results
5.1 Experiment Design
To evaluate the effectiveness and efficiency of RA-GIAMS, we conducted a series of experiments on both synthetic and real datasets. The evaluation was inspected from two aspects, execution time and pattern accuracy.
All experiments were done on Intel(R) Core(TM) i5-2400(3.1G) PC with 4GB of main memory, running the Windows 7 32-bit operation system. All programs were implemented in Visual C++ 2008.
5.2 Evaluation on Real Dataset
In this experiment, we consider the sliding window model, with detail settings of the generic model shown in Table 5-1. The dataset msnbc [2] is used, which was constructed from the web log of news pages in msn.com for the entire day of September, 28, 1999. The characteristics of msnbc are summarized in Table 5-2. More detailed description of this dataset can be found in [2].
Table 5-1. Parameter settings for generic window model used in this experiment.
s w d σs σf σ
10000
d
80000 1 0.01 0.01 0.1
Table 5-2. Characteristics of the test data msnbc.
Database Items Transactions Maximum Transaction size
Average Transaction size
msnbc 17 989818 17 1.71678
To inspect the performance and effectiveness of our memory awareness adaptation scheme, we run RA-GIAMS under five settings of available memory, i.e., ranging from 90% to 50% of the original memory space, with 10% decrement. And compare the results with those running with sufficient memory space.
First, we evaluate the execution times of RA-GIAMS. The results are depicted in Figure 5-1. As the results demonstrate, most of the time RA-GIAMS is faster than GIAMS even RA-GIAMS incurs overhead for running victim searching and node releasing to cope with insufficient memory. And the execution times increase as the available memory increase. This is because when the memory is not sufficient there are certain amount of itemsets that originally have to be maintained in the Card-Stree*
are pruned, which has the effect in decreasing the number of create and update operations, without doing too much of node replacement operations, as shown in Figure 5-2, where the cases of 70%, 80% and 90% memory incur the least number of node replacement operations.
Figure 5-1. Execution times of RA-GIAMS running over msnbc with available memory variation.
Figure 5-2. Execution times spent on node replacement running over msnbc with available memory variation.
Next, we inspect the accuracy of the results generated by RA-GIAMS. Because our algorithm introduces storage shedder to prune itemsets to be maintained in the Card-Stree* structure, so error may occur to the discovered frequent itemsets, including the missing rate and support error of frequent itemsets, and the precision and recall with respect to indirect association rules.
We first evaluate how much of frequent itemsets generated by GIAMS without memory limitation will be lost when memory shortage occurs, which is measured as error rate:
Error = |Ftrue ∩ Fest| / |Ftrue
where F
| (5.1)
true denotes the set of frequent itemsets discovered by GIAMS while Fest represents that by RA-GIAMS with insufficient memory.
Figure 5-3. Error rate of discovered frequent itemsets from msnbc with available memory variation.
We then check the difference between the supports of the discovered frequent itemsets with and without memory limitation, which is measured by the following formula called ASE (Average Support Error):
where Tsup denotes the frequent itemsets that are discovered by GIAMS, Esup denotes the frequent itemsets discovered by RA-GIAMS under memory limitation. As the results displayed in Figure 5-4, all ASEs are nearly zero in all cases, even in the case that only 50% of memory is available.
Figure 5-4. Average support error of discovered frequent itemsets from msnbc with available memory variation.
We then examine the accuracy of rules discovered by RA-GIAMS. We consider two measurements, precision and recall. Let IAtrue denote the set of indirect associations discovered by GIAMS without memory limitation and IAest denote the set discovered by RA-GIAMS with insufficient memory. The precision measures the ratio of how many indirect associations in IAest are also in IAtrue, while recall examines the percentage of how many indirect associations in IAtrue
Precision = |IA
are missed generated by RA-GIAMS. These two criteria are define as follows:
true ∩ IAest| / |IAest
Recall = |IA
| (5.3)
true ∩ IAest| / |IAtrue
As the results illustrated in Figures 5-5 and 5-6, the memory adaptation scheme of our RA-GIAMS performs very well. All of the precisions are larger than 0.9 and recalls are above 0.8, meaning the percentages of false indirect associations discovered by our RA-GIAMS are less than 10% and the percentages of true indirect associations not discovered by RA-GIAMS are less than 20%, respectively, even in the case that only 50% of memory is available.
| (5.4)
Figure 5-5. Precisions of RA-GIAMS in terms of discovered indirect associations from msnbc with available memory variation.
Figure 5-6. Recalls of RA-GIAMS in terms of discovered indirect associations from msnbc with available memory variation.
5.3 Evaluation on Synthetic Dataset
In this experiment, we consider a synthetic dataset T5I5N0.1KD1000K generated by the IBM data generator, whose characteristics are summarized in Table 5-3. The parameter settings for generic window model follow the settings for msnbc.
Table 5-3. Characteristics of the test data T5I5N0.1KD1000K.
Database Items Transactions Maximum Transaction size
Average Transaction size
T5I5N0.1KD1000K 17 767768 23 6.23
We first examine the performance of RA-GIAMS over this dataset. As the results depicted in Figure 5-7, surprisingly, we can see the execution times of 90% and 80%
memory are more than the execution time without memory limitation.
The reason is that both cases incur much more node replacement operations than the other cases, as shown in Figure 5-8.
Figure 5-7. Execution times of RA-GIAMS running over T5I5N0.1KD1000K with available memory variation.
We next inspect the errors of the generated frequent itemsets. As the results shown in Figures 5-9 and 5-10, all of the errors are nearly zero in all cases.
Figure 5-8. Execution times spent on node replacement running over T5I5N0.1KD1000K with available memory variation.
Figure 5-9. Error rate of discovered frequent itemsets from T5I5N0.1KD1000K with available memory variation.
Finally, we inspect the precision and recall in terms of indirect association rules.
As the results illustrated in Figures 5-11 and 5-12, all cases exhibit very high precision and recall, nearly one.
Figure 5-10. Average support error of discovered frequent itemsets from T5I5N0.1KD1000K with available memory variation.
Figure 5-11. Precisions of RA-GIAMS running on T5I5N0.1KD1000K with available memory variation.
Figure 5-12. Recalls of RA-GIAMS running on T5I5N0.1KD1000K with available memory variation.
CHAPTER 6
Conclusions and Future Work
6.1 Conclusions
In this thesis, we have considered the problem of resource-aware mining of indirect association rules over data streams. We have proposed a generic framework RA-GIAMS to cope with this problem. Our proposed framework is based on GIAMS, a mining framework that can accommodate most of the contemporary stream window models and allow user-defined specific window models. Our framework add some mechanisms to GIAMS, including a resource monitor, a load shedder, and a storage shedder, as a whole can adapt the computation in accordance with data arriving rate as well as the available resources, including CPU power and memory space. To evaluate the effectiveness and performance of the proposed framework, we have conducted a series of experiments. The experimental results showed that our framework can effectively adjust the memory consumption during the course of frequent pattern mining with very little overhead. In addition, the results also showed that even with limited memory space, i.e., only 50% of the original space, our framework not only can discover most of the frequent patterns serving as mediators for qualified indirect associations but also maintain the accuracy of discovered patterns; the support errors are very small, nearly zero in all test settings.
6.2 Future Work
In this thesis, we only complete the implementation of the proposed adaption schemes for available memory awareness. In the future, we will accomplish the scheme for or CPU power awareness and integrate it to the RA-GIAMS.
The study of resource-aware mining from streaming data is still in its infancy.
Many research issues are worthy of further investigation. Among many topics to be explored in the future, some important ones are listed below:
1. In this study, we confine the resources to the general types, CPU power and memory space. Many applications developed in mobile environment, such as sensor networks, intelligent cell phones, however, have to consider additional resources constraint, mainly battery and network bandwidth. We will extend our framework to accommodate these new types of resources.
2. It is interesting to note that although our proposed framework RA-GIAMS aims at discovering indirect association rules, it relies on a kernel procedure for generating frequent itemsets, whose discovery is the most computation-intensive part to many mining tasks, such as association rules, classification, and clustering. In this regard, we believe that our framework can be extended to accomplish these tasks, developed as a more general data stream mining system.
References
[1] B. Babcock, M. Datar, and R. Motwani, “Load Shedding Techniques for Data Stream Systems,” in Proceedings of the 2003 Workshop on Management and Processing of Data Streams, 2003.
[2] I. Cadez, D. Heckerman, C. Meek, P. Smyth, and S. White, “Visualization of Navigation Patterns on a Web Site Using Model-Based Clustering,” in Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 280-284, 2000.
[3] J. H. Chang, H. C. Kum, “Frequency-Based Load Shedding over A Data Stream of Tuples,” Information Sciences, vol. 179, no. 21, pp. 3733-3744, 2009.
[4] J. H. Chang, and W. S. Lee, “estWin: Adaptively Monitoring the Recent Change of Frequent Itemsets over Online Data Streams,” in Proceedings of 12th International Conference on Information and Knowledge Management, pp.
536-539, 2003.
[5] J. H. Chang, and W. S. Lee, “Find Recent Frequent Itemsets Adaptively over Online Data Stream,” in Proceedings of 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 487-492, 2003.
[6] L. Chen, S. S. Bhowmick, and J. Li, “Mining Temporal Indirect Associations,”
in Proceedings of 10th Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 425-434, 2006.
[7] X. H. Dang, W. K. Ng, and K. L. Ong, “Adaptive Load Shedding for Mining Frequent Patterns from Data Streams,” in Proceedings of Data Warehousing and Knowledge Discovery, pp. 342-351, 2006.
[8] X. H. Dang, W. K. Ng, K. L. Ong, and V. C. S. Lee, “Discovering Frequent Sets from Data Streams with CPU Constraint,” in Proceedings of 6th Australasian Data Mining Conference, pp. 121-128, 2007.
[9] P. Domingos and G. Hulten, “A General Framework for Mining Massive Data Streams,” Journal of Computational and Graphical Statistics, vol. 12, no. 4, pp.
945-949, 2003.
[10] M. M. Gaber, S. Krishnaswamy, and A. Zaslavsky, “Resource-aware Mining of Data Streams,” Journal of Universal Computer Science, vol. 11, no. 8, pp.
1440-1453, 2005.
[11] M. M. Gaber and P. S. Yu, “A Framework for Resource-aware Knowledge Discovery in Data Streams: A Holistic Approach with Its Application to Clustering,” in Proceedings of ACM Symposium on Applied Computing, pp.
649-656, 2006.
[12] P. Kazienko, “IDRAM—Mining of Indirect Association Rules,” in Proceedings of International Conference on Intelligent Information Processing and Web Mining, pp. 77-86, 2005.
[13] P. Kazienko and K. Kuzminska, “The Iinfluence of Indirect Association Rules on Recommendation Ranking Lists,” in Proceedings of 5th International Conference on Intelligent Systems Design and Applications, pp. 482-487, 2005.
[14] W. Y. Lin, Y. E. Wei,and C. H. Chen, “A Generic Approach for Mining Indirect Association Rules in Data Streams,” in Proceedings of 24th International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems, pp. 95-104, 2011.
[15] S. Guha, A. Meyerson, N. Mishra, and R. Motwani, “Clustering Data Streams:
Theory and Practice,” IEEE Transactions on Knowledge and Data Engineering, vol. 15, no. 3, pp. 515-258, 2003.
[16] C. Heinz, and B. Seeger, “Cluster Kernels: Resource-Aware Kernel Density Estimators over Streaming Data,” IEEE Transactions on Knowledge and Data Engineering, vol. 20, no. 7, pp. 880-893, 2008.
[17] S. Parthasarathy, and R. Subramonian, “An Interactive Resource-Aware Framework for Distributed Data Mining,” IEEE Technical Committee on Distributed Processing Letters
[18] R. Shah, S. Krishnaswamy, and M. M. Gaber, “Resource-Aware Very Fast K-Means for Ubiquitous Data Stream Mining,” in Proceedings of 2nd International Workshop on Knowledge Discovery in Data Streams, pp. 40-50, 2005.
, pp. 24-32, 2001.
[19] P. N. Tan, V. Kumar, and J. Srivastava, “Indirect Association: Mining Higher Order Dependencies in Data,” in Proceedings of 4th European Conference on Principles of Data Mining and Knowledge Discovery, pp. 632-637, 2000.
[20] P. N. Tan and V. Kumar, “Mining Indirect Associations in Web Data,” in Proceedings of 3rd
[21] W. G. Teng, M. S. Chen and P. S. Yu, “Resource-Aware Mining with Variable Granularities in Data Streams,” in Proceedings of SIAM Conference on Data Mining, pp. 22-24, 2004
International Workshop on Mining Weg Log Data Across All Customers Touch Points, pp. 145-166, 2001.
[22] J. X. Yu, Z. Chong, H. Lu, Z. Zhang, and A. Zhou, “A False Negative Approach to Mining Frequent Itemsets from High Speed Transactional Data Streams,”
Information Sciences, vol. 176, no. 14, pp. 1986-2015, 2006.
[23] Q. Wan, and A. An, “An Efficient Approach to Mining Indirect Associations,”
Journal of Intelligent Information Systems, vol. 27, no. 2, pp. 135-158, 2006.
[24] Q. Wan and A. An, "Efficient Indirect Association Discovery using Compact Transaction Databases," in Proceedings of 2006 IEEE International Conference on Granular Computing, pp. 154-159, 2006.
[25] Y. E. Wei, A Generic Framework and Algorithms for Mining Indirect Associations from Data Streams, Master Thesis, National University of Kaohsiung, Taiwan, 2010.