To assess the performance of Twain, we conducted several experiments on gen-erating frequent temporal itemsets from synthetic databases. The simulation program is coded in C++ and the experiments are run on a computer with Pentium III 866MHz CPU and 512MB RAM. We use both synthetic datasets and a real dataset to demonstrate the behavior of Twain. As will be shown later, Twain not only generates frequent patterns with better quality, but also outperforms AprioriIPand SPF in terms of execution time, I/O costs, CPU over-heads, and scalability. In addition to generating more precise frequent patterns, Twain performs even more efficiently than SPF because Twain can generate frequent 2-itemsets directly. We describe the method used to generate syn-thetic databases and the source of real datasets in Section 4.1. Section 4.2 re-veals the quality comparison of frequent patterns generated by SPF and Twain.
The execution time of Twain is compared with prior algorithms in Section 4.3.
Section 4.4 shows the I/O costs and CPU overheads for Twain. Results of scaleup
Table II. Summary of the Parameters Used
|T| Ave. size of the transactions
|I| Avg. size of the potentially frequent itemsets
|D| Number of transactions in the database N Number of distinct items
|L| Number of potentially frequent itemsets
|P| Number of partitions
experiments are presented in Section 4.5. Finally, the incremental ability of Twain is examined in Section 4.6.
4.1 Simulation Model
We use the same scenario as in Chang et al. [2002] to generate synthetic datasets where the items in transactions are allowed to have different exhi-bition periods. The method to generate synthetic datasets is similar to the ones used in Agrawal and Srikant [1994] and Park et al. [1997]. However, in or-der to simulate various exhibition periods of the items in a realistic database, we equally divide the synthetic database into n partitions to imitate the phe-nomenon of the time granularity required. Then, an exhibition period [ p, q]
(1≤ p ≤ q ≤ n) for each item in the synthetic database is randomly assigned.
Finally, we scan the database once to remove the items that are not within their exhibition periods in each transaction. For example, the item X would be removed from the transaction XYZ in partition P1 if the exhibition period of the item X is [2, 4]. In accordance with the above method, we generate sev-eral different synthetic datasets to evaluate the performance Twain. Each of the generated databases consists of|D| transactions with |T| items in average.
The number of different items in each database is N . The average size of the potentially frequent itemsets is set to|I|, and the number of the potentially fre-quent itemsets is set to |L|. The mean correlation level between the potential frequent itemsets is set to 0.25 in our experiments. Table II provides a summary of some parameters used in our experiments. In addition, for the simplicity of presentation, we use the notation Tx− Iy − Dz(Nm − Ln − Po) to represent a database in which |T| = x, |I| = y, |D| = z thousands, N = m thousands,
|L| = n thousands and |P| = o.
The real dataset is from KDDCUP web site. We use BMS-POS as the testing workload, which contains 515597 transactions and 1633 different items. The dataset is divided into 12 and 24 partitions, which are denoted as BOS-POS-12 and BOS-POS-24 respectively, for the performance evaluation. We appreci-ate that Zheng and Blue Martini Software provided the dataset in KDD 2001 [Zheng et al. 2001].
4.2 Quality of Frequent Patterns
At the beginning, the quality of frequent patterns provided by SPF and Twain is investigated. We use both synthetic datasets and real datasets as testing workloads. We generate several synthetic datasets with parameters T 10− I4 − D100. As shown in Figure 5, the leftmost column represents
Fig. 5. Comparison of frequent patterns generated by SPF and Twain.
different datasets. The synthetic datasets are denoted by Nm− Ln − Po − s, where N , L and P are the same definitions in Table II and s represents the support value per thousand. On the other hand, the real datasets are repre-sented by BMS− POS − P − S, where P is the number of partition and S is the support value per hundred. The second and the third columns are the number of frequent patterns generated by SPF and Twain. There are three types of patterns generated by Twain. The numbers are listed in the fourth, fifth, and sixth columns. The “Same” column represents the number of pat-terns whose items and the frequent exhibition periods are exactly the same as the patterns generated by SPF. The “New” column represents the number of patterns whose items are not found by SPF. The “Precise” column represents the number of patterns whose items are the same as some patterns discovered by SPF but whose frequent exhibition periods are more precise than those by SPF. In addition, the following three columns are the ratio(%) of the number in the fourth, fifth, and sixth columns to the total number of patterns generated by Twain. Finally, the last column represents the ratio(%) of the number of patterns in “New” and “Precise” columns to the total number of patterns gen-erated by Twain. For interested readers, the results of the patterns obtained by Twain and SPF in this experiment are shown in the following web site:
http://arbor.ee.ntu.edu.tw/˜jwhuang/twain-results/.
As shown in Figure 5, Twain can find some frequent patterns which SPF is not able to discover. For some patterns generated by SPF, Twain can ob-tain more precise frequent exhibition periods of them. The improvement of the
Fig. 6. The execution time under various minimum support.
quality is 97.40% in average for synthetic datasets. For real datasets, the im-provement ratio reaches 89.18%. Therefore, Twain overcomes the drawbacks of SPF and discovers frequent patterns of significantly better quality.
4.3 Execution Time
Twain has shown significant improvement in the quality of the discovered pat-terns compared to existing methods. In fact, Twain also takes reasonably fewer computational steps than other algorithms. In the second experiment, several synthetic datasets are used to investigate the execution time of all algorithms by varying the minimum support. The experimental results on various datasets are shown in Figure 6 and Figure 7. Note that no matter what combination of different parameters is, Twain consistently outperforms AprioriIP and SPF in terms of the execution time. Specifically, the execution time of Twain is in or-ders of magnitude smaller than that of AprioriIP, and is also better than that of SPF. The margin even grows as the minimum support decreases.
The reason is that the number of candidate itemsets generated by AprioriIP increases exponentially as the number of items or the number of partitions increases. In contrast, the number of candidates generated by SPF and Twain is in proportion to the number of items or the number of partitions in the database, which is about constant. Additionally, Twain can generate frequent itemsets directly, whose number is much less than the number of candidate 2-itemsets generated by the other two algorithms. Furthermore, AprioriIPneeds to scan the database multiple times to determine frequent k-itemsets. However, by the technique of scan reduction, Twain only needs to scan the database twice.
Fig. 7. The execution time under various minimum support.
Fig. 8. I/O costs under various minimum supports.
Therefore, Twain can perform much better than AprioriIPand is more efficient than SPF in terms of execution time.
4.4 I/O Costs and CPU Overheads
In this experiment, we examine I/O costs and the CPU overheads of all algo-rithms. As the method used in Pei et al. [2001], we assume that each sequential read of a byte consumes one unit of I/O cost and each random read of a byte of data consumes two units of I/O cost. The experimental result of I/O costs under various minimum supports is shown in Figure 8. The I/O costs of SPF and Twain remain the same as the minimum support decreases. Nevertheless, the cost of Twain is slightly less than that of SPF. However, the I/O cost of
Fig. 9. Number of candidates generated.
AprioriIP increases as the minimum support decreases. It is noted that the performance of I/O cost depends mainly on the number of database scans needed. As the minimum support decreases, the value of k (for k-itemset) in-creases. Recall that one database scan is needed in AprioriIPwhenever a deter-mination of frequent k-itemsets is required. In contrast, I/O costs of SPF and Twain, due to the advantage of the scan reduction technique, are not affected as the minimum support varies.
Note that the scan reduction technique can only be used when the number of candidate 2-itemsets is very close to the number of frequent 2-itemsets. As explained in Section 2.4, SPF progressively filters out infrequent candidate 2-itemsets from one partition to another. This feature enables us to apply the scan reduction technique to SPF. In addition, since Twain can generate fre-quent 2-itemsets directly, the effect of scan reduction technique becomes more remarkable. To explore more insights into the number of candidates generated by all algorithms, another experiment is conducted and examined in Figure 9.
As shown in Figure 9, Twain generates exactly the same number of candidate 2-itemsets as that of frequent 2-itemsets while SPF leads to a 98% candidate reduction rate in candidate 2-itemsets over AprioriIP. This feature not only enables the scan reduction technique to be used in Twain, but also efficiently reduces the CPU and memory overheads in the following procedures. Note that candidate 3-itemsets are generated by joining 2-itemsets. Since the number of candidate 2-itemsets obtained by SPF is larger than the number of frequent 2-itemsets discovered by Twain, the number of candidate 3-itemsets gener-ated by SPF is larger than the number of candidate 3-itemsets genergener-ated by Twain. In addition, the number of candidate 3-itemsets of Twain is the same as that of AprioriIP. Assume there are n partitions in a database. All items are supposed to be independent and in uniform distribution. Assume there are in average m frequent temporal 1-itemsets in each partition. In AprioriIP, there will be C2m= m(m − 1)/2 candidate 2-itemsets in each partition. In accordance with the results reported in Park et al. [1997], the first two passes of scanning database induces about 62% of the total execution time. Thus, in each partition, the execution time of AprioriIP is in proportional to n∗ (m + m(m2−1))∗ 10062. In addition, there are n(n−1)/2 subdatabases. Therefore, the total execution time
Fig. 10. Normalized execution time under various numbers of transactions.
of AprioriIP is in proportional to n(n2−1) ∗ n ∗ (m +m(m2−1))∗10062. As for SPF, the execution time of ProcSG phase is in proportional to n. The number of candidate 2-itemset is about 2%∗m(m−1)/2. By using scan reduction technique, the total candidate k-itemsets(k > 2) are about 4 times of candidate 2-itemsets. Then, the total execution time of SPF is in proportion to n+ n ∗ (m +m(m2−1)∗ 0.02 ∗ 5).
Finally, Twain can generate exactly the same number of frequent 2-itemsets, which is about 85% of the candidate 2-itemsets generated by SPF. Therefore, the total execution time of Twain is in proportion to n∗(m+m(m2−1)∗0.02∗0.85∗5).
It is noted that although SPF generates more candidate k-itemsets, SPF still has the drawbacks of over-estimating and under-estimating the supports of frequent itemsets. Twain can overcome these two drawbacks successfully.
4.5 Scalability
Then, we conduct the experiments on different number of transactions in the synthetic dataset (|D|) to investigate the scalability of Twain. Among these, we consider three different minimum supports, that is, 0.2%, 0.4%, and 0.8%, in the experiments. Note that the execution time under various numbers of trans-actions is normalized with respect to the time for T 10− I4− D100 of each exper-iment. As shown in Figure 10, the execution time of SPF and Twain increases linearly while the number of transactions in the synthetic database increases.
It shows that both SPF and Twain perform well and do not suffer from the ex-ponential increment of execution time as other Apriori-like algorithms do. It is because the bottleneck of Apriori-like algorithms is on the huge number of can-didate 2-itemsets and cancan-didate 3-itemsets. To generate frequent 2-itemsets and frequent 3-itemsets is time consuming. However, both SPF and Twain generate few candidate 2-itemsets efficiently as shown in Figure 9. Twain can even finds frequent 2-itemsets directly. In addition, SPF and Twain apply scan reduction technique to reduce scanning time of the database. Therefore, both algorithms can conquer this bottleneck. More specifically, Twain is of better scalability than SPF because the slopes of the lines for Twain are all smaller than the slopes of the lines for SPF. This feature shows Twain is more practi-cable than SPF.
Fig. 11. The cumulative execution time of the incremental database.
4.6 Incremental Ability
Finally, we investigate the incremental ability of Twain. We divide T 10− I 4− D20(N20 − L2 − P24) dataset into 7 subdatasets. The first subdataset contains the 1st to the 12th partitions in the original dataset. Each of the other subdatasets contains the following 2 partitions in the original dataset. These subdatasets are fed into Twain one by one. Note that the candidate 2-itemsets generated in each partition are examined and those whose occurrence frequen-cies are larger than the relative supports are passed to the next partition.
Therefore, ProcPM can pass the latest candidate 2-itemsets of the previous sub-dataset to the new partition in the following sub-dataset. Then, ProcPM handles the transactions in the new partition and generates new candidate 2-itemsets in this partition. As shown in Figure 11, X-axis represents each run of the mining process. For example, the first point includes the 1st to the 12th partitions in the original dataset, the second point includes the 1st to the 14th partitions in the original dataset, the third point includes the 1st to the 16th partitions in the original dataset, and so on. Y-axis represents the execution time of each run. It is noted that the execution time increases linearly, which means Twain can utilize the information carried from the previous partitions well and can incrementally generate frequent itemsets efficiently.
5. CONCLUSION
We have presented a general model of mining association rules in a temporal database where the exhibition periods of the items are allowed to be different from one to another as in Chang et al. [2002]. However, some interesting rules may be under-estimated. In addition, the frequent exhibition periods of tempo-ral itemsets may be over-estimated. To address this issue, we introduced the notions of FCP and MFCP to give more precise frequent exhibition periods of frequent temporal itemsets. In addition, we developed an efficient algorithms, referred to as Twain to discover precise general temporal association rules.
Specifically, Twain can generate frequent 2-itemsets directly, which allows us to apply scan reduction technique to find candidate k-itemsets (k > 2) effec-tively. Moreover, Twain not only overcomes the drawbacks of SPF in Chang et al. [2002] for mining precise general temporal association rules, but also pos-sesses the incremental mining ability. Some related theoretical properties were
derived in this article as well. The experimental results showed that Twain out-performs other algorithms in the quality of frequent patterns, execution time, I/O cost, CPU overhead and scalability.
REFERENCES
AGARWAL, R., AGGARWAL, C.,ANDPRASAD, V. 2000. A tree projection algorithm for generation of frequent itemsets. J. Para. Distrib. Comput. (Special Issue on High Performance Data Mining).
AGRAWAL, R., IMIELINSKI, T.,ANDSWAMI, A. 1993. Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD International Conference on Man-agement of Data, ACM, New York, 207–216.
AGRAWAL, R.ANDSRIKANT, R. 1994. Fast algorithms for mining association rules in large databases.
In Proceedings of the 20th International Conference on Very Large Data Bases, 478–499.
ALE, J.ANDROSSI, G. 2000. An approach to discovering temporal association rules. In Proceedings of the ACM Symposium on Applied Computing 1, ACM, New York, 294–300.
AYAD, A. M., EL-MAKKY, N. M.,ANDTAHA, Y. 2001. Incremental mining of constrained association rules. In Proceedings of the 1st ACM-SIAM Conference on Data Mining. ACM, New York.
BESEMANN, C.ANDDENTON, A. 2005. Integration of profile hidden Markov model output into as-sociation rule mining. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, ACM, New York, 538–543.
BETTINI, C., WANG, X.,ANDJAJODIA, S. 1998. Mining temporal relationships with multiple granu-larities in time sequences. Bulle. IEEE Comput. Soc. Tech. Comm. Data Eng..
BLANCHARD, J., GUILLET, F., GRAS, R.,ANDBRIAND, H. 2005. Using information-theoretic measures to assess association rule interestingness. In Proceedings of the 5th IEEE International Conference on Data Mining. IEEE Computer Society Press, Los Alamitos, CA.
CHANG, C.-Y., CHEN, M.-S.,ANDLEE, C.-H. 2002. Mining general temporal association rules for items with different exhibition periods. In Proceedings of the 2nd IEEE International Conference on Data Mining. IEEE Computer Society Press, Los Alamitos, CA.
CHEN, J., HE, H., WILLIAMS, G.,ANDJIN, H. 2004. Temporal sequence associations for rare events.
In Proceedings of the 8th Pacific Asia Conference on Knowledge Discovery and Data Mining.
CHEN, X.ANDPETR, I. 2000. Discovering temporal association rules: algorithms, language and system. In Proceedings of the 16th IEEE International Conference on Data Engineering. IEEE Computer Society Press, Los Alamitos, CA.
CHEN, X., PETROUNIAS, I.,ANDHEATHFIELD, H. 1998. Discovery of association rules in temporal databases. In Proceedings of the Issues and Applications of Database Technology.
COHEN, E., DATARY, M., FUJIWARAZ, S., GIONISX, A., INDYK, P., MOTWANIK, R., ULLMAN, J. D.,ANDYANGYY, C. 2001. Finding interesting associations without support pruning. IEEE Trans. Knowl. Data Eng., 64–78.
HAN, J.ANDFU, Y. 1995. Discovery of multiple-level association rules from large databases. In Proceedings of the 21th International Conference on Very Large Data Bases, 420–431.
HAN, J.ANDKAMBER, M. 2000. Data Mining: Concepts and Techniques. Morgan-Kaufmann. San Francisco, CA.
HAN, J.ANDPEI, J. 2000. Mining frequent patterns by pattern-growth: Methodology and impli-cations. ACM SIGKDD Explorations (Special Issue on Scaleble Data Mining Algorithms). ACM, New York.
HAN, J., PEI, J., MORTAZAVI-ASL, B., CHEN, Q., DAYAL, U.,ANDHSU, M.-C. 2000a. FreeSpan: Frequent pattern-projected sequential pattern mining. In Proceedings of the 6th ACM SIGKDD Interna-tional Conference on Knowledge Discovery and Data Mining, ACM, New York. 355–359.
HAN, J., PEI, J.,ANDYIN, Y. 2000b. Mining frequent patterns without candidate generation. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, ACM, New York. 486–493.
HARMS, S. K.ANDDEOGUN, J. S. 2004. Sequential association rule mining with time lags. Journal of Intelligent Informatics Systems.
JIANG, N.ANDGRUENWALD, L. 2006. An efficient algorithm to mine online data streams. In Pro-ceedings of the 2006 KDD TDM Workshop.
KE, Y., CHENG, J.,ANDNG, W. 2006. MIC framework: An information-theoretic approach to quan-titative association rule mining. In Proceedings of the 22nd IEEE International Conference on Data Engineering. IEEE Computer Society Press, Los Alamitos, CA.
KIFER, D., BUCILA, C., GEHRKE, J.,ANDWHITE, W. 2002. DualMiner: A dual-pruning algorithm for itemsets with constraints. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York.
LAKSHMANAN, L., NG, R., HAN, J.,ANDPANG, A. 1998. Exploratory mining and pruning optimiza-tion of constrained associaoptimiza-tions rules. In Proceedings of the 1998 ACM SIGMOD Internaoptimiza-tional Conference on Management of Data. ACM, New York.
LAKSHMANAN, L. V. S., NG, R., HAN, J.,ANDPANG, A. 1999. Optimization of constrained frequent set queries with 2-variable constraints. In Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, ACM, New York. 157–168.
LEE, C.-H., CHEN, M.-S.,ANDLIN, C.-R. 2003. Progressive partition miner: An efficient algorithm
LEE, C.-H., CHEN, M.-S.,ANDLIN, C.-R. 2003. Progressive partition miner: An efficient algorithm