• 沒有找到結果。

number of elements=3

5.5.1 Simulation Model

that the number of candidate patterns is denoted as f (r, l) which is formulated as follows:

f (r, l) =

Obviously, it could be very large when r and l increase. As expected, we could have a large number of candidate multi-domain sequential patterns, degrading the performance of algorithm IndividualMine.

Algorithm PropagatedMine:

By exploring propagation and lattice structures, algorithm PropagatedMine is able to reduce the mining cost. However, algorithm PropagatedMine cannot mine relaxed patterns since propagation needs to obtain time-instance sets of sequential patterns. Empty sets mean that events don’t occur and thus there are no any available time information for the empty sets. Thus, it is impossible to derive time-instance sets of empty sets. Consequently, for mining relaxed patterns, algorithm Naive and algorithm IndividualMine should be used.

5.5 Performance Evaluation

To evaluate the performance of our proposed algorithms, we implement a simulation model and conduct extensive experiments. In Section 5.5.1, the simulation model and synthetic datasets are described.

Section 5.5.2 is devoted to experimental results.

5.5.1 Simulation Model

We modify the well-known data generator in [3] to generate datasets that include multiple domains.

The data generator is broadly used in many studies for evaluating the performance of their proposed methods [36]. The detailed generation process could be referred to [36]. Some parameters are summarized

in Table 5.7 . Explicitly, M denotes the number of domains, D is the number of sequences, C is the average number of elements in a sequence, T is the average number of events in an element and I is the total number of distinct events. The modeling of these parameters are almost the same in [3]. For example, dataset M5D10kC10T5I100 represents that there are 5 domains , each of which contains 10k of sequences, where the average number of elements in a sequence is 10, the average number of items in an element is 5, and the total number of distinct items is 100. For the traditional sequential pattern mining, we use algorithm PrefixSpan which is obtained from the IlliMine project (http://illimine.cs.uiuc.edu/).

Algorithm PrefixSpan is used in algorithm Naive and the mining phases of both algorithms IndividualMine and PropagatedMine. Our programs are executed in the platform with the hardware as an Intel 2.4-GHz XEON CPU and 3.5 GB of RAM, and the software as FreeBSD 5.0 and GCC 3.2. We use three performance metrics: the execution time, memory consumption and the number of mined patterns to compare the proposed algorithms.

5.5.2 Experimental Results

Several experiments were conducted to evaluate the performance and memory consumption of the three algorithms. Sensitivity analysis on some important parameters, such as the minimum support, the number of sequences, and the number of domains, is conducted.

Impact of the Minimum Support Threshold

We first investigated the performances of three algorithms with the minimum support varied. For the dataset M2D2kC3T4I200, Figure 5.8 shows the execution time and the memory consumption of three algorithms. It can be seen in Figure 5.8 that the execution time of algorithm IndivudualMine and Prop-agatedMine is reduced as the minimum support increases. This is due to that with a larger minimum support, the number of sequential patterns in sequence databases is smaller. Furthermore, algorithm PropagatedMine significantly outperforms the other two algorithms in terms of execution time, which demonstrates the advantage of exploring propagation and lattice structures in mining multi-domain se-quential patterns. On the other hand, when the minimum support was smaller than 1.5%, algorithm IndividualMine was worse than algorithm Naive. The reason is that with a smaller minimum support, a larger number of sequential patterns are mined in each domain. Thus, algorithm IndividualMine needs

Parameter Description

M number of domains

D number of sequences

C average number of elements within a sequence T average number of items within an element I total number of different items

Table 5.7: Parameters used for the data generator.

0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Figure 5.8: Execution times of the three algorithms with various minimum support thresholds.

Number of domains 2 3 4 5

Naive 5.3 206.7 2513.9 21769.7

IndividualMine 126.3 163.9 180.2 181.1

PropagatedMine 0.4 0.6 0.7 0.7

Table 5.8: Execution times of algorithms Naive, IndividualMine, and PropagatedMine with the number of domains varied on D1kC2T3I100.

more time to composite candidate multi-domain sequential patterns and determine their supports. In Naive algorithm, joining operations among sequence databases are costly, which dominates the execution time. As for the memory consumption, algorithm Naive use less memory than algorithms IndividualMine and PropagatedMine. This is due to that both algorithms IndividualMine and PropagatedMine use more memory spaces for storing sequential patterns mined. Algorithm PropagatedMine also needs to store lattice structures, which incurs more memory space than algorithm IndividualMine. On the other hand, algorithm IndividualMine does not need any more memory space for storing sequential patterns. Though algorithm PropagatedMine needs more memory spaces, algorithm PropagatedMine is able to quickly derive multi-domain sequential patterns, which strikes a compromise between memory space and the execution time.

Impact of the Number of Domains

We next examine the impact of domains on the performance of three proposed algorithms. The exper-iments were conducted on D1kC2T3I100 (referred to as a smaller dataset) and D1kC3T4I200 (referred

Number of domains 2 3 4 5

Naive 57.1 3065.3 53164.9 379118.5

IndividualMine 1052.1 1192.9 1213.9 1214.4

PropagatedMine 2.1 2.4 2.5 2.5

Table 5.9: Execution times of algorithms Naive, IndividualMine, and PropagatedMine with the number of domains varied on D1kC2T4I200.

1000 2000 3000 4000 5000 6000

1000 2000 3000 4000 5000 6000

0

Figure 5.9: Performance of Naive, IndividualMine, and PropagatedMine with the number of sequences varied.

to as a larger dataset). With the minimum support as 0.3%, the execution time with its unit as sec-onds for these proposed algorithms is shown in Table 5.8 and Table 5.9. From both tables, it can be seen that all three algorithms have a larger execution time when the number of domains increases. In particular, the execution of algorithm Naive drastically increases the execution time. Both algorithms IndividualMine and PropagatedMine have smaller execution time than algorithm Naive. Furthermore, algorithm PropagatedMine outperforms other algorithms in terms of the execution time, showing the advantage of utilizing propagation to reduce the mining cost. In addition, given a larger dataset with more number of events and larger sequence lengths, the execution time of algorithm Naive is worse. On the other hands, algorithm PropagatedMine incurs a smaller execution time than algorithms Naive and IndividualMine, showing the good scalability of algorithm PropagatedMine.

Impact of the Number of Sequences

Experiments with the number of sequences varied are examined, where the number of sequences is from 1000 to 6000 and other parameters are M2C3T3I200. With a given minimum support was 1%, Figure 5.9 shows the execution time of all algorithms. As can be seen in Figure 5.9, the execution of all three algorithms increases as the number of sequences increases. Notice that the execution time of algorithm Naive is significantly increasing when the number of sequences is lager than 2000. Thus, to compare algorithms IndividualMine and PropagatedMine, we only put the execution time of algorithms Individ-ualMine and PropagatedMine. By exploring lattice structures, PropagatedMine should mine only atomic patterns, from which other patterns are derived accordingly. As a result, the execution time of Propa-gatedMine slightly increases with the number of sequences. Note that the execution time of algorithm PropagatedMine is very smaller compared with algorithms IndividualMine and Naive. However, both algorithms IndividualMine and PropagatedMine need more memory space for storing sequential patterns mined. Thus, it can be seen in Figure 5.9 that both algorithms IndividualMine and PropagatedMine have

2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0

Average Number of Elements in a Sequence (a)

Memory Consumption (MB)

Average Number of Elements in a Sequence (b)

Naive IndividualMine PropagatedMine

Figure 5.10: Performance of Naive, IndividualMine, and PropagatedMine with the average number of elements within a sequence varied.

a larger memory consumption than algorithm Naive. This also agrees that algorithm Naive is bounded by execution time, and algorithms IndividualMine and PropagatedMine are bounded by memory spaces.

Impact of the Average Number of Elements within a Sequence

In this section, we investigate the performance of Naive, IndividualMine, and PropagatedMine with the average number of elements within a sequence varied. Without loss of generality, the minimum support threshold is set to 1% and the other parameters in the dataset are M2D1kT3I200. Figure 5.10 shows experimental results of Naive, IndividualMine, and PropagatedMine. Clearly, the execution time of mining multi-domain sequential patterns increases with the average number of elements within a sequence.

Note that algorithm IndividualMine even performs worse than algorithm Naive when the average number of elements in a sequence is larger than 4.7. The reason is that IndividualMine mines a large number of sequential patterns in each domain and spends more costs to composite candidate multi-domain sequential patterns. The above observation is also proved in Figure 5.11, where algorithm IndividualMine generates a larger number of sequential patterns propagated than algorithm PropagatedMine. Note that, the number of patterns propagated in algorithm IndividualMine is the number of patterns discovered in the starting domain. Figure 5.10 (b) also indicates that though algorithm PropagatedMine has a smaller execution time, algorithm PropagatedMine needs more memory spaces to store lattice structure.

Impact of the Average Number of Items within an Itemset

The average number of items within an itemset generally impacts on the performance of sequential pattern mining. Thus, we investigate the effect of varying the average number of items within an itemset. The minimum support was set to 1% and we used the dataset M2D1kC3I200. The execution time and memory consumption with the average number of items in an itemset varied are shown in Figure 5.12. As can be seen that in Figure 5.12, PropagatedMine performs the best in terms of the execution time. When the

2 3 4 5 6 0

1000 2000 3000 4000

Number of Patterns Propagated

Average Number of Elements in a Sequence IndividualMine

PropagatedMine

Figure 5.11: Number of patterns propagated in IndividualMine and PropagatedMine with the average number of elements within a sequence varied.

2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0

0 20 40 60 80 100

Execution Time (sec)

Average Number of Items in an Itemset (a)

Naive IndividualMine PropagatedMine

2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0

0 10 20 30 40 50 60

Average Number of Items in an Itemset (b)

Memory Consumption (MB)

Naive IndividualMine PropagatedMine

Figure 5.12: Performance of Naive, IndividualMine, and PropagatedMine with the average number of items within an itemset varied.

2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0

Average Number of Items in an Itemset IndividualMine

PropagatedMine

Figure 5.13: Number of patterns propagated in IndividualMine and PropagatedMine with the average number of items within an itemset varied.

100 200 300 400 500 600 700 800

0

100 200 300 400 500 600 700 800

0

Number of Different Items (b)

Memory Consumption (MB)

Naive IndividualMine PropagatedMine

Figure 5.14: Performance of Naive, IndividualMine, and PropagatedMine with the number of different items varied.

average number of items in an itemset is smaller, the execution time of IndividualMine is smaller than that of Naive. However, if there is a large number of items within an itemset, IndividualMine performs worse than Native since algorithm IndividualMine has a larger number of patterns mined, which incurs a considerable cost in the checking phase. Figure 5.13 demonstrates that PropagatedMine is better than IndividualMine because sequential patterns mined in the starting domain are much smaller than that of algorithm IndividualMine. In algorithm PropagatedMine, only atomic patterns are mined and thus the number of patterns mined in the starting domain is equal to the number of atomic patterns. Consequently, by exploring lattice structures, algorithm PropagatedMine outperforms the other algorithms in terms of the execution time.

Impact of the Number of Items

We next investigate the impact of the total number of items, where a minimum support is set to 1%

and other parameters are set as M2D1kC3T4. Figure 5.14 shows the execution times and memory consumption of Naive, IndividualMine, and PropagatedMine. It can be seen in Figure 5.14 that both

100 200 300 400 500 600 700 800 0

500 1000 1500 2000 2500 3000

Number of Patterns Propagated

Number of Different Items IndividualMine PropagatedMine

Figure 5.15: Number of patterns propagated in IndividualMine and PropagatedMine with the number of different items varied.

IndividualMine and PropagatedMine have a smaller execution time than Naive as the number of items increases. When the number of items is larger, the probability of being frequent for each item is smaller with the same setting in D1kC3T4. Figure 5.15 depicts the number of patterns with the number of items varied. As can be seen in Figure 5.15, PropagatedMine has a smaller number of patterns derived, which demonstrates the advantage of using lattice structures for discovering multi-domain sequential patterns.

Impact of the Propagation Order for PropagatedMine

Since algorithm PropagatedMine explores propagation on mining multi-domain sequential patterns, we now get insight into the impact of propagation orders on performance of algorithm PropagatedMine. As pointed out early, algorithm PropagatedMine first selects a starting domain and then performs sequential pattern mining. Based on the mining results, a lattice structure is built. Clearly, one should judiciously determine the starting domain in algorithm PropagatedMine. Intuitively, selecting a domain with a smaller number of sequential patterns is good to reduce the size of lattice structures, thereby improving the performance of algorithm PropagatedMine. In this experiment, we conduct experiments on different propagation orders.. Figure 5.16 shows the execution time of algorithm PropagatedMine with various propagation orders, where the value in the x-axle is the propagation order used. For example, 12435 indicates that the algorithm PropagatedMine starts with D1, and then propagates to D2, D4, D3 and D5. As can be seen in Figure 5.16, selecting domain D1 as a starting domain is better since algorithm PropagatedMine has a smaller execution time and memory consumption. This implies that sequential patterns in D1has the minimal number of sequential patterns. Table 5.10 depicts the number of sequential patterns in each domain and the number of sequential patterns in D1is the smallest among other domains.

Furthermore, in Figure 5.16, propagation order 12435 incurs the smallest execution time of algorithm PropagatedMine. This observation gives a guideline in which a good propagation order is determined as an ascending order of the number of sequential patterns in sequence databases. Note that there are many ways (e.g., sampling) to approximate the number of sequential patterns in each domain. Thus, according

12345 51234 45123 34512 23451 12435 15342 13245 53421 0.0

0.5 1.0 1.5 2.0 2.5

Execution Time (Sec)

Propagation Order (a)

12345 51234 45123 34512 23451 12435 15342 13245 53421 0

20 40 60 80 100 120 140

Memory Consumption (MB)

Propagation Order (b)

Figure 5.16: Performance of PropagatedMine with varied propagation order.

Domains D1 D2 D3 D4 D5

Number of Sequential patterns 24982 25507 28204 27654 28560

Table 5.10: Number of sequential patterns mined in each domain.

to the guideline above, one could determine a good propagation order for algorithm PropagatedMine.

5.6 Conclusions

This chapter addresses a novel mining task: the multi-domain sequential pattern mining problem. Multi-domain sequential patterns are of practical interest and use since they clearly reflect the relations of domains hidden in user’s behavior. We designed algorithm Naive as a baseline algorithm and two effi-cient algorithms, IndividualMine and PropagatedMine, to solve this problem. Specifically, in algorithm IndividualMine, each domain individually performs sequential pattern mining and then candidate multi-domain sequential patterns are generated by combining all mined sequential patterns in each multi-domain.

Finally, by checking the time-instance sets of candidate domain sequential patterns, the multi-domain sequential patterns are discovered without scanning databases. In order to reduce the mining cost of discovering sequential patterns in each domain, algorithm PropagatedMine first mines sequen-tial patterns in a starting domain. Propagated tables are then constructed to discover the candidate multi-domain sequential patterns. Note that by using propagated tables, only sequential patterns that are likely to form multi-domain sequential patterns are extracted. Algorithm PropagatedMine further explores lattice structures to reduce the number of patterns propagated. A comprehensive experimental study is conducted and experimental results show that both algorithms IndividualMine and Propagat-edMine are able to quickly mine multi-domain sequential patterns compared with algorithm Naive. By exploring propagation and lattice structures, algorithm PropagatedMine outperforms other algorithms in terms of execution times.

Chapter 6

Conclusion

In this dissertation, we develop a series of research works for Apps usage behavior mining and explore patterns mined from multiple categories of Apps. We select useful features from all sensor readings and Apps usage relations to perform Apps usage prediction. Then, two algorithms are proposed to estimate users dynamic preferences of Apps. Finally, the multi-domain sequential patterns are discovered to formulate the Apps usage pattern in the category level. In the first work, we focus on collecting all sensor readings and Apps usage transitions, and performing kNN classification to predict Apps usage. Two main type of features are proposed. The explicit feature consists of 1) device sensors, 2) environmental sensors, and 3) personalized sensors. The implicit feature models the Apps usage transitions. Two implicit feature discovery algorithms are proposed to explore the implicit features for training and test purposes. Then, a personalized feature selection algorithm is proposed to measure which features are useful for different users usage behavior. In the second work, we implement an AppNow widget on Android based smartphones.

We further reduce the used features into only the temporal information, and build a temporal profile for each App. As the observation of Apps usage behavior, we realized that the Apps usage could have a specific usage period. We adopt Fourier transform to discover the usage periods for each App. The temporal profile is thus modelled by the discovered usage periods. The AppNow widget will predict the Apps usage by comparing the temporal profile of every App and current time to see which Apps have higher probability to be launched. In the third work, we propose a novel dynamic preference prediction problem which is to quantize and predict users preference according to their Apps usage counts. Two algorithms are proposed. The mode-based prediction (MBP) considers the usage status of each App to calculate its preference. The reference-based prediction (RBP) calculate a reference point for each App.

The preference of an App is thus estimated by comparing the reference point and the real usage counts. In the forth work, a novel data mining task: mining domain sequential patterns is proposed. The multi-domain sequential pattern represents the usage transition of Apps in the category level. Two algorithms are proposed to solve this problem. The individualMine algorithm discovers sequential patterns in each

domain and combines those sequential patterns into one single domain. The propagatedMine algorithm only performs sequential pattern mining in one starting domain and propagates the discovered sequential patterns to the next domains. We design several operations when propagating patterns.

Bibliography

[1] Rakesh Agrawal, Tomasz Imielinski, and Arun N. Swami. Mining Association Rules between Sets of Items in Large Databases. In Proceedings of the 1993 ACM International Conference on Management of Data (SIGMOD), pages 207–216, 1993.

[2] Rakesh Agrawal and Ramakrishnan Srikant. Fast Algorithms for Mining Association Rules in Large Databases. In Proceedings of the 1994 International Conference on Very Large Data Bases (VLDB), pages 487–499, 1994.

[3] Rakesh Agrawal and Ramakrishnan Srikant. Mining Sequential Patterns. In Proceedings of the 1995 IEEE International Conference on Data Engineering (ICDE), pages 3–14, 1995.

[4] Driss Choujaa andC Naranker Dulay. Predicting Human Behaviour from Selected Mobile Phone Data Points. In Proc. of UbiComp, pages 105–108, 2010.

[5] Jay Ayres, Jason Flannick, Johannes Gehrke, and Tomi Yiu. Sequential Pattern Mining Using A Bitmap Representation. In Proceedings of the 2002 ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), pages 429–435, 2002.

[6] Mete Celik, Shashi Shekhar, James P. Rogers, James A. Shine, and Jin Soung Yoo. Mixed-Drove Spatio-Temporal Co-occurence Pattern Mining: A Summary of Results. In Proceedings of the 2006 IEEE International Conference on Data Mining (ICDM), pages 119–128, 2006.

[7] O. Celma. Music Recommendation and Discovery in the Long Tail. Springer, 2010.

[8] Gong Chen, Xindong Wu, and Xingquan Zhu. Sequential Pattern Mining in Multiple Streams. In Proceedings of the 2005 IEEE International Conference on Data Mining (ICDM), pages 585–588, 2005.

[9] Lei Chen, M. Tamer zsu, and Vincent Oria. Robust and fast similarity search for moving object trajectories. In Proc. of SIGMOD, pages 491–502, 2005.

[10] Shuo Chen, Joshua L. Moore, Douglas Turnbull, and Thorsten Joachims. Playlist prediction via

[10] Shuo Chen, Joshua L. Moore, Douglas Turnbull, and Thorsten Joachims. Playlist prediction via

相關文件