• 沒有找到結果。

Sequential Pattern Mining Method

CHAPTER 4 EXPERIMENT

4.2 Sequential Pattern Mining Method

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

4.2 Sequential Pattern Mining Method

In the last section, we apply association rule mining on Boolean-session data and get some results. But we are not satisfied with these results. We want to get more informative rules not just what applications may be used together but the certain order of sequence. That is, we are interested in answering the following question: “If a user had used a smartphone application, then what would be the next application that he or she uses?” Another important problem in data mining is sequential pattern mining problem, whose goal is originated from discovering patterns indicating the sequence of items that were purchased by a customer.

However, we cannot use sequential pattern mining method directly to generate frequent sequential patterns for our purpose. Because a pattern found by the traditional sequential pattern mining algorithm indicates the correlation between transactions and such

“inter-transaction” relations are not what we want. In other words, results found by the classic association rule mining algorithm concern about which items coming from the same transaction occur together frequently. Results found by the traditional sequential pattern mining algorithm show which items coming from different transactions appear in a certain order. Considering these properties, we feel the need to design a method able to find out the rules in which items appear in a certain order and come from the same transaction. In addition, any sequential pattern mining algorithms or tools can’t be used directly on our data.

So, first of all, we need a data processor to do a series of processing that including purification and partition. But this time, the session data is different from the Boolean-session, we create a session that keeps the concept of “transaction” while records the sequence of applications used. Then we need a data formatter for the tool of SPMF (A Sequential Pattern Mining Framework) [27] and a result processor that deal with the output of SPMF.

We prepare the data set from the session data (Table IV). We then bring in the concept of sequence to create sets of Sequence-session data for each session. As a result, we actually record the applications one by one according to their timestamps instead of Boolean values.

Following is an example that Table VIII is a part of raw data and Table IX is a Sequence-session which transform from Table VIII.

Table VIII. A portion of the raw data.

User ID Machine ID Time Package Name Operational Activity USER_005 MACHINE_018 2010-11-17

16:52:15

Table IX. A sample of Sequence-session.

User ID Machine ID App1 App2 App3 App4 App5

USER_002 MACHINE_013 Camera Facebook Launcher Facebook Launcher

machine. Logs within 10 minutes are considered relevant and combined into the same session, whereas logs that are apart more than 10 minutes are divided into different sessions (since the user is likely to have finished his or her usage). We assume that there is a weak relationship between two application usages in the same session if their timestamps are away from each other for more than 10 minutes. Consequently, the last log in Table VIII will not join in the session because of its 30 minutes interval of time. And based on the user, the first log in Table VIII also not be included in the session because of it’s done by USER_005 not USER_002.

Then, we discovered that successive duplicates Package Name (applications user used) exist within sessions. This can straightforwardly means the user keeps using an application with different activities during the session, or it can come from the fact that we filter out infrequent applications when we construct the sessions. In either case, it is better to eliminate these duplicates and keep only one appearance of the application since we are focusing on the usage patterns instead of activities patterns. Otherwise, if this situation is abundant, PrefixSpan would generate sequential patterns containing identical applications in sequence. Such pattern would be uninformative for our research and also erroneous, because it should only be considered a single usage of that application.

Moreover, we do not want to see the traditional sequential pattern mining algorithm grouping our sessions by User_ID and finding relationships between sessions. The reason is as follows: It is not usual for one to use Facebook in the morning and Camera in the afternoon while we still consider the usages of these two applications are in the same session. This could happen since the usages of these two applications may be related, but we think that it makes more sense to treat them as two usages in two sessions. The following tables show our Sequence-session data which divide items with time-gap and the traditional sequence that

Table X. A sample of our Sequence-session data.

User ID Machine ID App1 App2 App3 App4

USER_002 MACHINE_013 Gmail Facebook Gmail - -

USER_013 MACHINE_012 Busplus Gmail, Facebook Browser - USER_013 MACHINE_012 Facebook Browser Gmail - - USER_002 MACHINE_013 Camera Facebook Facebook - -

USER_002 MACHINE_013 Camera Facebook - - -

Table XI. A sample of traditional sequence.

User ID Sequence

USER_002

(Gmail, Facebook, Gmail), (Camera, Facebook, Facebook), (Camera, Facebook)

USER_013 (Busplus, Gmail, Facebook, Browser), (Facebook, Browser, Gmail)

4.2.2 Sequential Pattern Mining

In sequential pattern mining experiment, we would like to apply our Sequence-session data (which is already processed by data processor) to SPMF [27] via a data formatter. SPMF is an open source data mining platform written in Java which offers implementations of data mining algorithms for association rule mining, sequential pattern mining, clustering etc. SPMF is a project founded by Philippe Fournier-Viger and it has been cited by or used in various studies since 2010. In our work we use PrefixSpan algorithm and its implementation in the open source SPMF toolkit.

After data processor and data formatter to the raw data, we then make minor changes to the source code of PrefixSpan module provided by the tool SPMF, in order to make the

algorithm suitable for our data set.

The data processor we had illustrated with an example in chapter 4.2.1 The data formatter is a program that converts application names into integers, so that the SPMF tool can execute normally.

After that, it is necessary that we need a result processor dealing with the output of SPMF and transform to the readable and meaningful results. PrefixSpan is based on pattern growth which is one of major technique for sequential pattern mining. The key idea is to avoid the candidate generation step altogether, and to focus the search on a restricted portion of the initial database.

There are three major properties that PrefixSpan algorithm is suitable for our research.

First of all, PrefixSpan is designed for dense database which exactly meet our data (composed of the top 30 most frequently used applications that support are all exceed 0.5%). Then, there is no need for candidate generation in PrefixSpan and it is recursive the projected databases of frequent prefixes which are generated based on their suffixes. It finds new prefixes of length 1 and the pattern is growing one item longer at a time. Therefore, it can be expanded with more constraint or condition between items, such as time constraint, when forming frequent patterns.

Moreover, since the algorithm of PrefixSpan is based on recursively constructing the patterns by growing on the prefix, there are several advantages in practice. It is capable of dealing very large database, and the search space is reduced at each step, allowing better performance in the presence of small support thresholds.

In addition, there are three different projection methods developed for PrefixSpan:

level-by-level projection, bi-level projection and pseudo projection. In SPMF, PrefixSpan algorithm is implemented with pseudo projection which is the most efficient among three.

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

Pseudo projection avoids constructing a physical projection database by representing each suffix by a pair of pointer and offset value. Once a projected database can be fitted in main memory, the cost of projection can be reduced significantly.

相關文件