• 沒有找到結果。

Handling large databases

Chapter 5 Proposed Interval-based Event Mining Algorithm: CTMiner

5.4 Handling large databases

The CTMiner algorithm only works if the database fits into memory. If the database is too large to fit into memory, the frequent temporal patterns are discovered by partition-and-validation technique. First, the database is partitioned so that each partition can be processed in memory by CTMiner. A temporal pattern is frequent in the database it has to be frequent in at least one partition. Thus, we can obtain the set of potential frequent patterns by collecting the discovered patterns after running CTMiner on those partitions. Then, we validate the frequency of each potential frequent pattern. The validation needs one more data scan to get all frequent temporal patterns.

1 2 3 4

Chapter 6

Experimental Result

To evaluate the performance of CTMiner, three temporal pattern mining algorithms, TPrefixSpan [11], H-DFS [10], and IEMiner [13] were implemented for comparison. All algorithms were implemented in C++ language and tested on a Pentium D 3.0 GHz with 4.0 GB of main memory running Windows XP system. The comprehensive performance study has been conducted on both synthetic and real datasets. In section 6.1.1, to compare the performance of CTMiner, TPrefixSpan, H-DFS and IEMiner, we perform experiments for the four algorithms on small (10K), medium (100K) and large (200K) synthetic datasets and varying the minimal support threshold. We discuss the memory usage of each algorithm in section 6.1.2. In section 6.1.3, to verify the scalability of CTMiner, we also test the CTMiner on different sizes of synthetic datasets. Besides the experiments on synthetic datasets, we also perform CTMiner on real dataset of library lending which is described in section 6.2. The detailed description of the parameters in the experiment is listed in Table 6-1.

Parameters Description

|D| Number of customers

|C| Average number of transactions per customer

|T| Average number of items per transaction

|S| Average length of maximal potentially large sequences

|I| Average size of itemsets in maximal potentially large sequences Ns Number of maximal potentially large sequences

Ni Number of maximal potentially large itemsets N Number of items

Table 6-1 Parameters of synthetic data generator

6.1 Experiments on synthetic datasets

6.1.1 Runtime comparisons

Some parameters are fixed in the runtime experiments: |C|=10, |T|=2.5, |S|=4, |I|=1.25, Ns=500, Ni=2,500. That means that the average length of sequences is 25, i.e., |C|×|T|, and the average length of temporal patterns is 6, i.e., |S|×|I|, and parameters Ns and Ni are set to small values which indicates plenty temporal patterns will be generated. Three runtime experiments on different data sizes will be testified individually which are small, medium and large datasets. The first experiment testifies small dataset of 10K sequences in temporal database and 500 different event types and varying minimum support threshold from 4 % to 1 %.

Fig. 6-1 and Fig. 6-2 illustrate the running time and the number of temporal patterns of small dataset with respect to different minimum support threshold, respectively. Fig. 6-3 shows the distribution of the length of frequent temporal patterns. Obviously, when the minimum support threshold decreases, the running time required for all algorithms increases.

However, the runtime for IEMiner, H-DFS and TPrefixSpan increase drastically compared to CTMiner. When minimum support is 1 %, the data set contains a large number of frequent temporal patterns (3,184). CTMiner takes 509 seconds, which is 5 times faster than TPrefixSpan (2,532 seconds), more than 8 times faster than IEMiner (4,337 seconds) and more than 13 times faster than H-DFS (6,676 seconds).

Figure 6-1 Performance of the four algorithms on data set with D10k – C10 – I1.25 – Ns 500 – Ni 2,500 – N10k

Figure 6-2 The number of generated frequent patterns on dataset D10k – C20 – I2.5 – Ns 500 – Ni 2,500 – N500.

Figure 6-3 The distribution of frequent patterns of dataset with D10k – C20 – I2.5 – Ns 500 – Ni 2,500 – N500

The second experiment testifies medium dataset of 100K sequences in temporal database and 10K different event types and varying minimal support threshold from 1 % to 0.5%. Both the data size and event types of the medium dataset are 10 times larger than the small dataset.

Fig 6-4 and Fig. 6-5 show the running time and number of temporal patterns of medium dataset with respect to different minimum support threshold, respectively. The Fig. 6-6 shows the distribution of the length of frequent temporal patterns. The data set contains a large number of frequent temporal patterns when minimum support is reduced to 0.5 % (2,880).

CTMiner takes 4,695 seconds, which is 4 times faster than TPrefixSpan (18,789 seconds), more than 8 times faster than IEMiner (38,678 seconds) and more than 12 times faster than H-DFS (59,489 seconds).

Figure 6-5 The number of generated frequent patterns on dataset with D100k – C20 – I2.5 –

Figure 6-4 Performance of the four algorithms on data set with D100k – C10 – I2.5 – Ns 500 – Ni 2,500 – N10k

Figure 6-6 The pattern length distribution of frequent patterns on dataset with D100k – C20 – I2.5 – Ns 500 – Ni 2,500 – N10k.

The third experiment testifies large dataset of 200K sequences in temporal database and 10K different event types and varying minimum support threshold from 1 % to 0.5%. The data size of the large dataset is 2 times larger than the medium dataset. As shown in Fig. 6-7, when minimum support is reduced to 0.5 %, CTMiner takes 6,257 seconds, which is more than 4 times faster than TPrefixSpan (26,751 seconds), more than 9 times faster than IEMiner (56,972 seconds), while H-DFS never terminates in the experiment. Fig. 6-8 and Fig. 6-9 show the number of temporal patterns and the distribution of the length of frequent temporal patterns under different minimal supports. The experiments indicate that even with extremely low support and a large number of frequent patterns; CTMiner algorithm is still efficient and outperforms state-of-the-art algorithms.

minimum support (%)

Pattern Length distribution on D100k – C20 – I2.5 – Ns 500 – Ni 2,500 – N10k

0

Figure 6-7 Performance of the four algorithms on data set with D200k – C10 – I2.5 – Ns 500 – Ni 2,500 – N10k

Figure 6-8 The number of generated frequent pattern on dataset with D200k – C20 – I2.5– Ns 500 – Ni 2,500 – N10k

Figure 6-9 The pattern length distribution of frequent patterns on dataset with D200k – C20 – I2.5 – Ns 500 – Ni 2,500 – N10k.

6.1.2 Discussion of memory usage

The memory usage of the four algorithms on medium dataset of 100K is shown in Fig.

6-10. The H-DFS algorithm uses a huge amount of memory space from 712MB to 2.04GB because it stores a huge amount of related records of frequent 2-patterns and brute-and-force enumeration mining strategy. The TPrefixSpan algorithm also uses a lot of memory space from 524MB to 684 MB due to projected databases creation during the mining processing and without memory indexing technique. The proposed CTMiner algorithm uses the memory indexing to effectively reduce the memory space usage for storing projected databases which are generated by the proposed multi-projection scheme. The memory usage of CTMiner from 153MB to 330MB is also good. Last, IEMiner only temporally preserves candidate patterns and intermediate patterns i.e., prefixes of candidate patterns, which is generated by the support counting scheme and only needs one database scan. The IEMiner uses least amount of memory space from 61MB to 96MB but complexity representation causes running time of experiments unacceptable.

Figure 6-10 Memory usage comparison of the four algorithms on data set with D100k – C10 – I2.5 – Ns 500 – Ni 2,500 – N10k

6.1.3 Scalability

In this section, we study the scalability of the CTMiner algorithm. Fig. 6-11 and Fig.

6-12 show the results of scalability tests of the CTMiner algorithm, with the database size from 100K to 500K sequences and varying minimum support thresholds from 3% to 1%. The parameters of dataset are fixed as follows: |C|=10, |T|=2.5, |S|=4, |I|=1.25, Ns=5,000, Ni=25,000 and N=10K. The average length of the sequence is 25 and the number of event types in the database is 10K. As the size of database increases and minimum support decreases, the processing time of CTMiner increases, since the number of frequent patterns also increases as shown in Fig. 6-11 and Fig. 6-12. As can be seen, CTMiner is linearly scalable with different minimum support thresholds due to the compact and efficient coincidence representation and the proposed algorithm does not require candidate generation and test. When the number of frequent patterns is large, the runtime of CTMiner still increases

0.9 0.8 0.7 0.6 0.5

minimum support (%) 0

500 1000 1500 2000 2500

1

CTMiner TPrefixSpan

IEMiner H-DFS

memory usage (MB)

D100k – C10 – I2.5 – Ns 500 – Ni 2,500 – N10k

linearly with respect to different database sizes.

Figure 6-11 Scalability test of the CTMiner algorithm with different database size and minimum supports.

Figure 6-12 The number of generated frequent patterns with different database sizes and minimum supports.

number of sequences in database |D|

0

100000 200000 300000 400000 500000

number of patterns

100000 200000 300000 400000 500000 number of sequences in database |D|

1%

6.2 Experiment on Real world dataset

In addition to using synthetic data sets, we have also performed experiments on real world dataset to compare the performance and validate the applicability of time interval-based pattern mining. The database used in the experiments collected 1,098,142 library records (including borrowing and returning) for last three years from the National Chiao Tung University Library. The experimental database includes 206,844 books, i.e., N=206,844, 28,339 readers and |D|=28,339. An event is constructed by a book ID and its associated borrowing and returning time. The size of the database is the number of sequences in the database i.e., total 28,339 readers. Fig. 6-13 indicates the running time of four temporal pattern mining algorithm with varying minimum support thresholds from 0.1 % to 0.05 % and the number of detected patterns under different thresholds is shown in Fig. 6-14. The distribution of the length of frequent temporal patterns is shown in Fig. 6-15. The experiments show that when the minimum support is greater than 0.1 %, most of generated frequent patterns are of length one or two. As the minimum support drops down to 0.05 %, there are 14,549 frequent patterns and the running time of CTMiner takes 4,771 seconds, which is about 2 times faster than TPrefixSpan (8,235 seconds), about 4 times faster than IEMiner (15,424 seconds) and H-DFS has never terminated.

We apply the CTMiner algorithm on books borrowing dataset to extract the readers’

behavior. The experimental result shows that a lot of frequent patterns are related to a series of TV soaps or books. For instance, the frequent temporal pattern, ”Friends of season 1 overlaps Friends of season 2” ^ ”Friends of season 1 before Friends of season 3” ^ “Friends of season 2 before Friends of season 3”, indicates users’ behavior especially on borrowing a series of TV soaps. When a user wants to borrow a series of TV soaps, he always likely holds as many videos as he can.

Figure 6-13 Experimental result of the CTMiner algorithm with varying minimal supports on real dataset.

Figure 6-14 The number of generated frequent patterns with varying minimum support.

minimum support (%)

Figure 6-15 The pattern length distribution of pattern length of real dataset with varying minimum support.

The experimental results show that performing the CTMiner algorithm in synthetic data and real data is consistent with our expectation. The CTMiner algorithm exhibits the scalability and efficiency in the experiments.

0.1 0.09 0.08 0.07 0.06 0.05

minimum support (%) 0

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

1-pattern 2-pattern 3-pattern 4-pattern 5-pattern 6-pattern

number of patterns

Chapter 7

Conclusion and Future works

In this thesis, we observe and analyze the drawbacks of different proposed temporal representations based on the complex temporal relation among interval-based events. Then an unambiguous, scalable and efficient coincidence representation based on the concept of coincidence-slice is proposed to address the problem of the complex relationship among events which causes the inefficiency in temporal mining algorithms. Base on the coincidence representation, we also develop a pattern growth-based algorithm called CTMiner by borrowing the concept of PrefixSpan algorithm without candidate generation. According to the characteristics of coincidence representation, we also propose three pruning strategies to reduce the search space and avoid meaningless processing. Hence, the performance of our proposed algorithm CTMiner is improved. To comprehend a coincidence pattern, we discover all the relations in a pattern and present the relations by relation list representation.

Experiments on synthetic datasets and real world datasets of library lending demonstrate the efficiency and scalability of our proposed algorithm.

Many extended researches based on interval-based events can be developed by using CTMiner algorithm such as mining partial orders of temporal pattern and closed patterns, maximal patterns, incremental mining and classification and so on. The notation of mining partial orders of temporal patterns has been introduced in [24] and the interesting approach has been recently proposed for closed sequential patterns in [25]. Many real life sequence databases grow incrementally. It is undesirable to mine sequential patterns from scratch each time when a small set of sequences grow, or when some new sequences are added into the database. The incremental mining algorithm has been proposed in [26]. However, these methods again assume that the events are instantaneous. The proposed algorithm provides the

opportunity to design more efficient algorithms in extend researches.

Bibliography

[1] R. Agrawal and R. Srikant. “Mining Sequential Patterns,” Proceedings of 11th International Conference on Data Engineering. (ICDE’95), pp. 3-14, 1995.

[2] F. Masseglia, F. Cathala and P. Poncelet. “The PSP Approach for Mining Sequential Patterns,” European Conference on Principles of Data Mining and Knowledge Discovery (PKDD’01), vol. 1510, pp176-184, 1998.

[3] R. Srikant and R. Agrawal. “Mining Sequential patterns: Generalizations and Performance Improvements,” Proceedings of 5th International Conference on Extended Database Technology (EDBT’96), 1996.

[4] M. J. Zaki. “SPADE: An Efficient Adlgorithm for Mining Frequent Sequences,” Machine Learning, vol. 42, numbers 1-2, pp. 31-60, 2001.

[5] J. Pei, J. Han, B. Mortazavi-Asl, H. Pito, Q. Chen, U. Dayal, and M.-C. Hsu, “PrefixSpan:

Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth,” Proceedings of 17th International Conference on Data Engineering. (ICDE ’01), pp. 215-224, 2001.

[6] M.Y. Lin and S.Y. Lee. “Fast discovery of sequential patterns by memory indexing,”

Proceedings of 4th International Conference on Data Warehousing and Knowledge Discovery (DaWaK’02), pp. 227-237, 2002.

[7] J.F Allen. “Maintaining Knowledge about Temporal Intervals,” Communications of ACM, vol.26, issue 11, pp.832-843, 1983.

[8] P. Kam and W. Fu. “Discovering Temporal Patterns for Interval-based Events,”

International Conference on Data Warehousing and Knowledge Discovery (DaWaK’00), vol. 1874, pp. 317-326, 2000.

[9] F. Hoppner. “Discovery of Temporal Patterns: Learning Rules about the Qualitative Behaviour of Time Series,” European Conference on Principles of Data Mining and Knowledge Discovery (PKDD’01), vol. 2168, pp. 192-203, 2001

[10] P. Papapetrou, G. Kollios, S. Sclaroff, and D. Gunopulos, “Discovering frequent arrangements of temporal intervals,” International Conference on Data Mining (ICDM’05), pp. 647-661, 2005.

[11] S. Wu and Y. Chen. “Mining Nonambiguous Temporal Patterns for Interval-Based Events,” IEEE Transactions on Knowledge and Data Engineering (TKDE’07), vol.19, issue 6, pp. 742-758, 2007.

[12] F. Morchen and A. Ultsch. “Efficient Mining of Understandable Patterns from Multivariate Interval Time Series,” Data Mining Knowledge Discovery, vol. 15, number 2, pp.181-215, 2007.

[13] D. Patel, W. Hsu and M. Lee. “Mining Relationships Among Interval-based Events for Classification,” Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pp. 393-404, 2008.

[14] E. Winarko and J.F Roddick. “ARMADA-An algorithm for discovering richer relative temporal association rules from interval-based data,” Data & Knowledge Engineering, vol. 63, issue 1, pp. 76-90, 2007.

[15] M. J. Zaki and C. J. Hsiao. “CHARM: An Efficient algorithm for Closed Itemset Mining,” Proceedings of 2nd SIAM International Conference on Data Mining (SDM’02), pp. 457-478, 2002.

[16] X. Yan, H. Cheng, J. Han and D. Xin, “CloSpan: Mining Closed Sequential Patterns in Large Datasets,” Proceedings of 3rd SIAM International Conference on Data Mining (SDM’03), pp 166-177, 2003.

[17] M.-S. Chen, J. Han, and P.S. Yu, “Data Mining: An Overview from a Database Perspective,” IEEE Transactions on Knowledge and Data Engineering, vol. 8, no. 6, pp.

866-883, 1996.

[18] J. Han and M. Kamber, “Data Mining: Concepts and Techniques.” Academic Press, 2001.

[19] H. Mannila, H. Toivonen, and I. Verkamo, “Discovery of frequent episodes in event sequences,” ACM Special Interest Group on Knowledge Discovery and Data Mining, SIGKDD, 1995.

[20] T.B. Ho, T.D. Nguyen, S. Kawasaki, S.Q. Le, D.D. Nguyen, H. Yokoi, and K.

Takabayashi. “Mining hepatitis data with temporal abstraction,” Proceedings of the 9th ACM SIGMOD international Conference on Knowledge Discovery and Data Mining (SIGKDD’03), pp. 369-377, 2003.

[21] C. Antunes and A. L. Oliveria, “Generalization of pattern-growth methods for sequential pattern mining with gap constraints,” Machine Learning and Data Mining in Pattern Recognition, vol. 2734, pp.239-251, 2003.

[22] J. Ayres, J. Gehrke, T. Yiu, and J. Flannick. “Sequential pattern mining using a bitmap representation,” Proceedings of the 8th ACM SIGKDD international Conference on Knowledge Discovery and Data Mining (SIGKDD’02), pp. 429-435, 2002.

[23] J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, and M.-C. Hsu, “FreeSpan:

Frequent Pattern-Projected Sequential Pattern Mining,” Proceedings of the 6th ACM SIGKDD International Conference on Data Engineering. (ICDE’01), pp. 215-224, 2001.

[24] Mannila H and Toivonen H, “Discovery generalized episodes using minimal occurrences”, Proceedings of ACM SIGMOD, pp. 146-151, 1996.

[25] Casas-Garriga G, “Summarizing sequential data with closed partial orders,” Proceedings of SDM, 2004.

[26] H. Cheng, X. Yan and J. Han, “IncSpan: incremental mining of sequential patterns in large database,” Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp.527-232, 2004.

[27] K. Gouda, M. J. Zaki, “Efficient Mining Maximal frequent itemsets,” International Conference on Data Mining (ICDM’01), pp. 163, 2001.

相關文件