Experimental Environment and Data Sets - Experiments and Analysis

Chapter 4 Experiments and Analysis

4.1 Experimental Environment and Data Sets

We conducted experiments to show the performance of the proposed algorithm. The other two approaches, meta-erasable [7] and fastupdate-erasable [16], were also compared.

Experiments were implemented in Java and executed on a notebook with a 2.5 GHz CPU and 12GB of memory. The way to generate the dataset is described as follows.

Assume we wanted to generate N products, each of which consisted of some material items. The length of material items for each product was first randomly generated within 1 to a given maximum length L. After the length of a product was generated, each item ID within 0 to I was generated randomly, where I

+

1 is the total number of material items in the whole database. Note that the first item ID was 0 in the dataset and a same item ID could not be generated twice for a product. The profit of a product was then set as the sum of the ID values of the items in the product. For example in Figure 4.1, six products are generated. The first product is (2, 5, 7), in which the first two numbers, 2 and 5, represent the item IDs and the last number, 7, represents the profit of the product. The profit is decided by 2

+

5 in this dataset.

Figure 4.1: The way to generating a product base

4.2 Experimental Results

The experimental result for the datasets is shown in this section.

Experiments were conducted for the dataset generated by the random program. Table 4.1 shows the parameter setting.

Table 4.1: The parameter setting on the dataset

Parameter Value Description

r 0.5 The maximum erasable threshold

ε 0.05 The quasi-erasable parameter

P 100000 The number of original products

N 1000, 5000 The two numbers of newly inserted products

I 20 The number of items

length 4 The maximum number of items in each

product

The following three algorithms are compared: ε-quasi-erasable mining, fastupdate-erasable mining [16], and meta-erasable mining [7]. Each time, 1000 new products were inserted and merged into the original product database. Figure 4.2 shows the execution

2, 5, 7

time of the three algorithms. The y-axis represents the execution time, whose units are in seconds, and the x-axis denotes the inserting runs.

Figure 4.2: The execution time of the three algorithms for N = 1000

In Figure 4.2, the proposed ε-quasi-erasable mining algorithm was the fastest among the three, the fastupdate-erasable was the second fastest, and the meta-erasable was the slowest.

The reason was the ε-quasi-erasable mining algorithm did not need to scan the product database in the first several runs because the accumulated gain value of the new products had not yet gotten to the allowed gain value, which was calculated from the formula in Section 3.1. In the eleventh run, the execution time rose abruptly because the accumulated gain value of the new products had exceeded the allowed gain value and the old product database needed to be rescanned, which took a lot of time. But even in this case, its execution time at the run was still not larger than that by the fastupdate-erasable algorithm. After the run, the accumulated gain value was reset and the execution time became little again. Compared to the ε-quasi-erasable algorithm, the fastupdate-erasable algorithm always needed to rescan the product database for the itemsets that were non-erasable in the original product database but erasable in the new products. The fastupdate-erasable mining algorithm was slower than the

ε-quasi-erasable mining algorithm. It could, however, save the rescanning time for the other cases. Thus, it was faster than the meta-erasable mining algorithm.

Figure 4.3 shows the numbers of erasable itemsets and ε-quasi-erasable itemsets. The y-axis represents the number of itemsets and the x-axis denotes the inserting run. Note that all the three algorithms had the same erasable itemsets, and only the proposed algorithm needed to keep the ε-quasi-erasable itemsets, which were the overhead of the algorithm for fast execution. In this experiment, the number of erasable itemsets was about 250, and the number of ε-quasi-erasable itemsets was about 100, which was small when compared to 250. It is noteworthy that to keep the additional ε-quasi-erasable itemsets for significantly saving execution time. Additionally, the numbers of erasable itemsets in the runs were about the same because a small number of new products would not affect the whole product database much.

Figure 4.3: The numbers of erasable and ε-quasi-erasable itemsets obtained by the proposed algorithm for N= 1000

Experiments were then conducted for inserting 5000 new products each time. Except the number of newly inserted products, the other parameters were set at the same values as those

gain value was reached for about every three runs.

Figure 4.4: The execution time of the three algorithms for N = 5000

Comparing Figures 4.2 and 4.4, it was found that the experiments for N = 5000 had a higher frequency of peak time than those for N = 1000. This phenomenon could be easily validated by the theorem proposed previously. Additionally, the peak time increased along with the inserted run number because the size of the original product database increased after each run.

Figure 4.5 then shows the numbers of erasable and ε-quasi-erasable itemsets for inserting 5000 products. Like the results in Figure 4.3, the number of ε-quasi-erasable itemsets was much smaller than that of erasable itemsets if the quasi-erasable parameter ε was small.

Figure 4.5: The numbers of erasable and ε-quasi-erasable itemsets obtained by the proposed algorithm for N = 5000

Next, experiments were conducted for different numbers of new products, including N = 1000, 3000, 5000, 7000 and 9000. The size of the original product database is 100000.

Besides, r = 0.5 and ε = 0.05. Figure 4.6 shows the execution time of the first inserting run for different N values. In Figure 4.6, all the execution didn’t need to scan the original product database.

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

Execution time (Sec)

Number of new products

It could be observed from Figure 4.6 that the more products were inserted, the more execution time was needed. The relationship between the number of new products and the execution time was nearly linear. The reason was that the ε-quasi-erasable mining algorithm needed to scan the new products. So, when more new products were processed, more execution time would be spent. The execution time would be proportional to the size of new products. Figure 4.7 shows the number of erasable and ε-quasi-erasable itemsets for different numbers of new products. It could be seen that nearly the same quantities of erasable and ε-quasi-erasable itemsets were generated for the different numbers of new products.

Figure 4.7: The numbers of erasable and ε-quasi-erasable itemsets for different N values

Experiments were then made to show the effects of the original product database size. For inserting 1000 new products, the execution time for different sizes of original product databases was shown in Figure 4.8. The execution time was measured for the first inserting run, in which no rescan was needed.

Figure 4.8: The execution time without rescanning for different sizes of original product databases.

It could be seen from Figure 4.8 that nearly the execution time was the same for different sizes of original product databases. This was because the above execution process didn’t need to rescan the original product databases but only needed to check the erasable and ε-quasi-erasable itemsets. Since the numbers of erasable and ε-quasi-erasable itemsets are nearly the same, which are shown in Figure 4.9, thus the execution time is about equal as well.

0 2 4 6 8 10 12 14 16 18 20

0 50000 100000 150000 200000 250000

Execution time (Sec)

Number of original products

Figure 4.9: The numbers of erasable and ε-quasi-erasable itemsets without rescanning for different sizes of original product databases.

In Figure 4.9, the number of the ε-quasi-erasable itemsets was much smaller than that of the erasable itemsets when ε was small (ε = 0.05 in this case). Besides, different sizes of original product databases generated nearly the same numbers of erasable (and ε-quasi-erasable) itemsets because the data distribution was about the same.

Next, experiments were conducted to measure the execution time with rescanning for different numbers of original products. When the original product database size was 25000, the rescanning occurred in the third inserting run. So, the x-value of the corresponding point was 27000. When the original product database size was 50000, the rescanning occurred in the sixth inserting run. So, the x-value of the corresponding point was 55000. At last, the original product database size was 100000, the rescanning occurred in the eleventh inserting run. So, the x-value of the corresponding point was 110000. The results are shown in Figure 4.10.

Figure 4.10: The execution time with rescanning for different sizes of original product databases

Note that in Figure 4.10, the execution time was measured for the runs that needed rescanning. It could be observed from Figure 4.10 that the bigger the original product database size was, the more execution time was needed. The relationship between the size of the original product database and the execution time was nearly linear. The reason was that when rescanning was needed, the ε-quasi-erasable mining algorithm scanned the original databases to get the actual gain values of some candidates. So, when more original products were in the database, more execution time would be spent. The execution time would be proportional to the size of the original product database.

Figure 4.11 shows the number of erasable and ε-quasi-erasable itemsets with rescanning for different sizes of original product databases. The results were like those in Figure 4.9. The number of the ε-quasi-erasable itemsets was much smaller than that of the erasable itemsets when ε is small. Besides, different sizes of original product databases generated nearly the same numbers of erasable (and ε-quasi-erasable) itemsets because the data distribution was nearly the same.

Figure 4.11: The number of erasable and ε-quasi-erasable itemsets with rescanning for different sizes of original product databases

Next, experiments were conducted for different maximum erasable thresholds, including r

= 10%, 30%, 50%, 70%, and 90%. The other parameter values were the same as those in Table 4.1. Figure 4.12 shows the numbers of erasable itemsets and ε-quasi-erasable itemsets for inserting the first 1000 new products.

Figure 4.12: The numbers of erasable itemsets and ε-quasi-erasable itemsets for different r values and ε = 0.05

It was observed from this figure that the bigger the threshold r was, the more erasable itemsets were generated. It could be easily known from the definition of erasable itemsets.

The number of the ε-quasi-erasable itemsets depended on the dataset because the quasi-erasable parameter was fixed in Figure 4.12. Figure 4.13 shows the execution time of the ε-quasi-erasable mining algorithm with different thresholds.

Figure 4.13: The execution time of the proposed algorithm with different thresholds

The execution time increased along with the increase of the r value. The reason was more erasable itemsets would be generated when the threshold was bigger, causing more execution time.

Experiments were then conducted for different values of the quasi-erasable parameter ε.

The numbers of erasable and ε-quasi-erasable itemsets for ε= 1%, 2%, 3%, 4%, 5%, 6%, 7%, 8%, 9%, and 10% are shown in Figure 4.14.

0 50 100 150 200 250 300 350

10% 30% 50% 70% 90%

Execution time (Sec)

Maximum erasable threshold

Figure 4.14: The numbers of erasable itemsets and ε-quasi-erasable itemsets of the proposed algorithm with different ε values and r = 0.5

In Figure 4.14, the bigger the ε value was, the more ε-quasi-erasable itemsets would be generated in the ε-quasi-erasable mining algorithm. It was reasonable because a bigger ε value would cause a larger range of ε-quasi-erasable itemsets. Note that the numbers of erasable itemsets for different ε values are the same since the threshold r is fixed at 0.5. The experimental results for comparing the execution time of different quasi-erasable parameter values are shown in Figure 4.15.

Figure 4.15: The execution time of the proposed algorithm with different ε and r = 0.5

In Figure 4.15, the execution time slightly increased along with the increase of the ε value.

This was because when the ε value increased, more ε-quasi-erasable itemsets were generated, causing more checking time.

Experiments were then conducted for comparing the execution time of different numbers of items for I = 10, 15, 20, and 25. The results are shown in Figure 4.16.

Figure 4.16: The execution time of different numbers of itemsets with ε = 0.05 and r = 0.5

It was observed that the more items were used in manufacturing the products, the more execution time was spent. The reason for this was that both the erasable itemsets and the ε-quasi-erasable itemsets would increase as the number of items increased. When more candidate itemsets were generated, the corresponding execution time would become more as well. The numbers of erasable and ε-quasi-erasable itemsets for different numbers of items are shown in Figure 4.17.

Figure 4.17: The numbers of itemsets of number of itemsets with ε = 0.05 and r = 0.5

The numbers of erasable itemsets and ε-quasi-erasable itemsets might increase along with the increase of the items. The reason was that when more items existed, the possible combinations of itemsets would increase as well.

0 500 1000 1500 2000 2500

0 5 10 15 20 25 30

Number of itemsets

Number of items

erasable itemsets

ε-quasi-erasable itemsets

Chapter 5 Conclusion and Future Work

In this thesis, we propose the concept of the ε-quasi-erasable itemsets and use it to improve the mining performance of erasable itemsets for dynamic environments. When the previous fastupdate-erasable algorithm handles its case 3, much time will be spent because the original product database needs to be rescanned. Therefore, we define ε-quasi-erasable itemsets to help accelerate the execution speed. We also prove the proposed approach can significantly reduce the rescanning number if the number of new products is small. To show the proposed mining algorithm is efficient, a lot of experiments are conducted. The results demonstrate that the proposed algorithm executes faster than both the META-erasable algorithm and the fastupdate-erasable algorithm in incremental mining for certain parameter settings. Experiments are finally made to verify the performance.

In the future, we will design the ε-quasi-erasable itemset mining for deletion and modification. In addition, we will also use different data structures like trees [11][22] to improve the efficiency.

References

[1] R. Agrawal, T. Imielinski and A. Swami, “Mining association rules between sets of items in large databases,” The ACM SIGMOD International Conference on Management of Data, pp. 207–216, 1993.

[2] R. Agrawal and R. Srikant, “Fast algorithm for mining association rules,” The 20th International Conference on Very Large Data Bases, pp. 487–499, 1994.

[3] K. R. Anil and R. S. Gladston, “N-list based friend recommendation system using pre-rule checking,” International Journal of Computer Science and Information Technologies, Vol.

7, pp. 338–341, 2016.

[4] K. Bharati and D. Kanchan, “Comparative study of frequent itemset mining algorithms:

FP growth, FIN, prepost + and study of efficiency in terms of memory consumption, scalability and runtime,” International Journal of Technical Research and Applications, Vol. 4, Issue 6, pp. 72–77, 2016.

[5] D. W. Cheung, J. Han, V. T. Ng and C. Y. Wong, “Maintenance of discovered association rules in large databases: Agrawal approach,” The 12th IEEE International Conference on Data Engineering, pp. 106–114, 1996.

[6] F. Coenen, T. Le and B. Vo, “An efficient algorithm for mining erasable itemsets using the difference of NC-Sets,” The IEEE International Conference on Systems, Man, and Cybernetics Manchester, pp. 2270–2274, 2013.

[7] Z. H. Deng, G. D. Fang, Z. H. Wang and X. R. Xu, “Mining erasable itemsets,” The 8th International Conference on Machine Learning and Cybernetics, pp. 12–15, 2009.

[8] Z. H. Deng and X. R. Xu, “Fast mining erasable itemsets using NC_sets,” Expert Systems with Applications, Vol. 39, pp. 4453–4463, 2012.

[9] Z. H. Deng, “Mining top-rank-k erasable itemsets by PID_lists,” International Journal of

Intelligent Systems, Vol. 28, Issue 4, pp. 366–379, 2013.

[10] D. K. Garg and M. Shweta, “Searching for the best strategies of mining erasable itemsets,” International Journal of Scientific and Engineering Research, Vol. 4, Issue 7, pp. 2229–5518, 2013.

[11] J. Han, R. Mao, J. Pei and Y. Yin, “Mining frequent patterns without candidate

generation: a frequent-pattern tree approach,” Data Mining and Knowledge Discovery, Vol. 8,Issue 1, pp. 53–87, 2014.

[12] T. P. Hong, Y. H. Tao and C. Y. Wang, “A new incremental data mining algorithm using pre-large itemsets,”Journal Intelligent Data Analysis, Vol. 5, Issue 2, pp. 111–129, 2001.

[13] T. P. Hong, G. C. Lan and W. Lin, “An incremental mining algorithm for high utility itemsets,” Expert Systems with Applications, Vol. 39, Issue 8, pp. 7173–7180, 2012.

[14] T. P. Hong, B. Le and B. Vo, “A dynamic bit-vector approach for fast mining frequent closed itemsets,” Expert Systems with Applications, Vol. 39, Issue 8, pp. 7196–7206, 2012.

[15] T. P. Hong, B. Le, T. Le and B. Vo, “An effective approach for maintenance of pre-large-based frequent-itemset lattice in incremental mining,” Applied Intelligence, Vol. 41, Issue 3, pp. 759–775, 2014.

[16] T. P. Hong and K. Y. Lin, “An incremental mining algorithm for erasable itemsets,” The IEEE International Conference on Innovations in Intelligent Systems and Applications, 2017.

[17] W. Li, M. Ogihara, S. Parthasarathy and J. Zaki, “New algorithms for fast discovery of association rules,” The 3th International Conference on Knowledge Discovery and Data Mining, 1997.

[18] T. Le and B. Vo, “An efficient algorithm for mining erasable itemsets,” Engineering

[19] T. Le, G. Nguyen and B. Vo, “A survey of erasable itemset mining algorithms,” Wires Data Mining Knowledge Discovery, pp. 356–379, 2014.

[20] G. Lee, U. Yun and K. H. Ryu, “Sliding window based weighted maximal frequent pattern mining over data streams, “Expert Systems with Applications, Vol. 41, Issue. 2, pp. 694–708, 2014.

[21] G. Lee and U. Yun, “Sliding window based weighted erasable stream pattern mining for stream data applications, “ Future Generation Computer Systems, Vol. 59, pp. 1–20, 2016.

[22] A. Lavanya and S. Rajalakshmi, “An associated mining approach for promoting sales of infrequent items using AIF algorithm,” International Journal of Informative and Futuristic Research, Vol. 4, Issue. 6, pp. 2347–1697, 2017.

[23] M. R. Patel, “A parallel and distributed method to mine erasable itemsets from high utility patterns,” International Journal of Advanced Research in Computer Science and Electronics Engineering, Vol. 1, Issue 8, pp. 2277–9043, 2012.

[24] J. Shah and S. Shah, “An incremental approach for mining erasable itemsets,”

International Journal of Computer Applications, Vol. 121, pp. 0975–8887, 2015.

在文檔中漸進式準可篩除項目集探勘之研究 (頁 36-55)