PrefixSpan by Pattern-Growth Approach

CHAPTER 2 BACKGROUND

2.4 Data Mining Methods

2.4.4 PrefixSpan by Pattern-Growth Approach

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

point of view. Just similar to this kind of web mining which automatic discovery of user access patterns from Web, our research’s goal is to discover usage pattern of smartphone.

2.4.4 PrefixSpan by Pattern-Growth Approach

Sequential pattern mining is an important data mining problem and also is a practical method with broad applications from analysis of customer purchase behavior to web browse patterns, from economics stock trend prediction to biological gene sequence discovery and so on. As the result, after sequential pattern mining was first introduced by Rakesh Agarwal and Ramakrishnan Srikant in 1995, many algorithms related to sequential pattern mining have been introduced.

In the paper [20], author classify current sequential pattern mining algorithms including Apriori, AprioriALL, GSP, SPADE, FreeSpan, WAP-Mine, PrefixSpan, SPAM, PLWAP, DISC-all, FS-Miner, Apriori-GST, HVSM and LAPIN into four categories: Apriori-based, Pattern-growth, Early pruning and Hybrid algorithms based on important key features supported by the techniques. Meanwhile, many papers may choose some algorithms, then using specific data and provide experiments to compare performance in execution time, memory, and discuss which algorithm is good or better. However, to the best of our knowledge, algorithm is the best for all applications. What algorithm is the best? The answer depends on the data characteristics and applications. In this thesis, we will not put emphasis on running time or memory cost given by different algorithms, our focus will be on the meaningful and interesting patterns. So, we implement a “suitable” approach (which means that it considers the characteristics of the data we use). We refer to an existing algorithm for the sequential pattern mining problem, implement methods for data processing and data format transformation and also implement a modified version of the algorithm that consider time constraint between

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

items.

PrefixSpan is based on pattern growth which is one of major techniques for sequential pattern mining. Though it has been proposed by J. Pei et al. [21] in the early 2000’s, PrefixSpan is still interested by considering of all the novel algorithms nowadays in 2013 paper [22]. In the paper [23], the author quoted that “Among the various approaches, PrefixSpan was one of the most influential and efficient ones in terms of both time and space. Some approaches may achieve better performance under special circumstances; however the overall performance of PrefixSpan is among the best. For example LAPIN is more efficient for dense data sets with long patterns but less efficient in other cases. Besides, it consumes much memory than PrefixSpan. SPAM outperforms the basic PrefixSpan but is much slower than PrefixSpan with pseudoprojection technique.”

As one of the efficient sequential pattern mining algorithms, PrefixSpan [21] in the pattern generation process it partitions the sequence to get a “prefix” and a “suffix”, and then only the projected databases of frequent prefixes are generated based on their suffixes. For example, give a sequence { a (abc) (ac) d (ef) }, which a, b, c, d, e, f represent items, < a >, <

aa >, < a (ab) > and < a (abc) > are prefixes of this sequence, and the corresponding suffixes are shown in Figure 2 below.

Prefix Suffix

< a > < (abc) (ac) d (ef) >

< aa > < (_bc) (ac) d (ef) >

< a (ab) > < (_c) (ac) d (ef) >

< a (abc) > < (ac) d (ef) >

Figure 2. An example of prefixes and suffixes.

‧

The process runs recursively. Each time it finds new prefixes of length 1. Therefore, the pattern is growing one item longer at a time. The separation of prefix and suffix prunes the candidate sequences greatly, allowing a much smaller search space and thus less running time.

PrefixSpan also employs a pseudo projection mechanism, which avoids physically copying suffixes by representing the corresponding sequence with an index and the starting position of the projected suffix in the sequence. When the sequence database or the projected database can be loaded into memory, pseudo projection can further improve the efficiency.

Let a set of items be denoted as I = {i1, i2, … , im} and a set of sequences be denoted as T

= {s1, s2, … , sn}. Each sequence in S contains a set of items. D is a database that consists of a set of sequences

PrefixSpan Recursive:

Input: Database D_α, Sequence α, Integer min_supp, Set P; Output: Set P 1: P1 ← {frequent items in Dα}

13: Prefixspan-Recursive(D’, γ, min_supp, P) 14: end if

15: end for

Figure 3. Pseudocode of the PrefixSpan algorithm [21].

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

Overall, PrefixSpan has the following advantages: No candidate sequence needs to be generated, and projected databases are shrinking as the algorithm proceeds.

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

CHAPTER 3 DATA

3.1 Raw Data Description

We use a real data set similar to the one used by Chen et al. in [24], which reflects the daily application usage of several smartphone users over a long period of time. More accurately, we have a size 25 over half a year log records which contains 25 different users using 26 machines and the period of the raw data is from September 2010 to March 2011. The raw data set contains logs for each user. A sample portion of the data set is Table I. In the raw data set, every record includes Log_ID, a serial number representing the identifier of a record. Every record also includes User_ID and Machine_ID, which together identify a unique operating unit.

Moreover, anonymity of users is protected as no personal identification is stored. In addition, every record is created with a timestamp and a geolocation, which is represented by latitude and longitude.

To store user behavior data, we keep track of what application was used by a user, and it is recorded as “Package Name”; and we also keep track of what activity was done by a user, and it is recorded as “Operational Activity” as shown in Table I. In short, every record stores data about who used what application, where and when.

Although it is a rich data set, in its raw format, it would not be suitable for our analysis.

First of all, every log record in the raw data set seems independent of each other. It is difficult to

‧

identify a “usage session” of a user directly from the raw data set, where a usage session means the sequence of applications used by a user on a machine in a period of time. That is, a usage session is “a series of user behaviors” of a particular user, but it may be separated by tens or even hundreds of log records in the raw data set. Therefore, we have to do a series of data processing operations, cleaning and transformation operations included.

Table I. Smartphone user behavior log data.

Log ID User ID Machine ID Time Latitude Longitude Package Name

Operational Activity

15553 USER_002 MACHINE_013

2010-11-17 17:47:07

25.057679 121.617137 com.android.camera .CameraEntry

15554 USER_002 MACHINE_013

2010-11-17 17:47:20

25.057679 121.617137 com.facebook.katana

.UploadPhotoA ctivity

15555 USER_002 MACHINE_013

2010-11-17 17:47:40

25.057679 121.617137 com.htc.launcher .Launcher

3.2 Data Processing

We are given a database D of smartphone user behavior log data as our raw data, as we introduced previously. Each record in the database D includes: Log_ID, User_ID, Machine_ID, Time, Latitude, Longitude, Package Name, and Operational Activity. In order to assist us in our analysis of application usage patterns of smartphone users, we would like to partition the database into a set of “sessions”, each of which corresponds to a series of application usages made by a user on a machine in a period of time.

Figure 4. Data Processing flow chart.

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

There are three main functions in our program of data processing shown in Figure 4. We do a serious of data purification first, including scan the database, check every log record and examine values in every attribute. Remove incomplete data records such as records with

“UnknownUse” in the attribute of User_ID that cannot be identified with. Then we carry on data partition. In the begin with partition, we order raw data by USER _ID, and define a session S as follows: S contains all the log records corresponding to the same user using the same machine, and the difference in time between two continuous log records in S does not exceed 10 minutes. As the result, each session is generated by combining time-relevant logs to represent a single usage period performed by a particular user. The fixed time gap 10 minutes refers to [25], the authors gave a case study about whether users are "instant sharing" photo on Facebook after using camera. They defined "instant sharing" as the activity of user uploading pictures to Facebook within 10 minutes after they are taken.

Table II gives an example that shows a sample portion of an extraction of our raw data, including Log_ID, User_ID, Machine_ID, Time, and Package Name. Then, we can collect and separate log records to generate sessions, as shown in Table III, by the aforementioned definition. After we partition raw data into session data, we do some calculations about session data including temporal and spatial information and how applications used. We will give the detail charts with description about those statistics of data.

‧

Table II. A fragment of the raw data.

Log ID User ID Machine ID Time Package Name

1 USER_002 MACHINE_013 2010-11-17 00:36:26 com.google.android.gm 2 USER_002 MACHINE_013 2010-11-17 00:37:21 com.facebook.katana 3 USER_002 MACHINE_013 2010-11-17 00:38:00 com.google.android.gm 4 USER_013 MACHINE_012 2010-11-17 08:22:24 com.mywoo.busplus 5 USER_013 MACHINE_012 2010-11-17 08:24:25 com.google.android.gm 6 USER_013 MACHINE_012 2010-11-17 08:26:13 com.facebook.katana 7 USER_013 MACHINE_012 2010-11-17 08:27:34 com.android.browser 8 USER_013 MACHINE_012 2010-11-17 15:02:59 com.facebook.katana 9 USER_013 MACHINE_012 2010-11-17 15:04:19 com.android.browser 10 USER_013 MACHINE_012 2010-11-17 15:04:44 com.google.android.gm 11 USER_002 MACHINE_013 2010-11-17 16:58:03 com.android.camera 12 USER_002 MACHINE_013 2010-11-17 16:58:14 com.facebook.katana 13 USER_002 MACHINE_013 2010-11-17 16:59:07 com.facebook.katana 14 USER_002 MACHINE_013 2010-11-17 17:47:07 com.android.camera 15 USER_002 MACHINE_013 2010-11-17 17:47:20 com.facebook.katana

Table III. Session data.

User ID Machine ID Applications Used Duration …

USER_002 MACHINE_013 Gmail, Facebook, Gmail 94 USER_013 MACHINE_012 Busplus, Gmail, Facebook, Browser 310 USER_013 MACHINE_012 Facebook, Browser, Gmail 105 USER_002 MACHINE_013 Camera, Facebook, Facebook 64

USER_002 MACHINE_013 Camera, Facebook 13

3.3 Summary Statistics of Data

Initially, there are 262,858 log records in our raw data. After we perform data processing, there are 25,880 sessions. We then give summary statistics and charts to illustrate some properties of our session data.

Figure 5 shows the number of sessions to which each user contributes. There are 25 users in the data we use, the User_ID is count form User_001 to User_030, but User_001,

‧

User_008, User_012, User_014, User_026 are no records.

Figure 5. Users and their corresponding number of sessions.

Package Name is the name of an application used by a user. Operational Activity is the type of operations that a user performed when using an application. There are 1,132 Package Names and 3,099 Operational Activity in our session data. Hence, there are over one thousand applications that users installed and used, and there are over three thousands of operations that users performed. It is clear that an application may be associated with several operations or activities. For example, the application “Facebook” may contain activities such as “Login”,

“Home”, “Feedback”, and “Upload Photo”.

We also gathered statistics for groups of applications data (not from our raw data).

User_001 User_002 User_003 User_004 User_005 User_006 User_007 User_008 User_009 User_010 User_011 User_012 User_013 User_014 User_015 User_016 User_017 User_018 User_019 User_020 User_021 User_022 User_023 User_024 User_025 User_026 User_027 User_028 User_029 User_030

‧

there is no category information on the applications. We took a detailed view in this statistics and found that many common-used applications such as Album, Camera, Gtalk etc. are without category. Moreover, we discovered thatsome applications cannot be categorized accurately. For example, Gmail has been categorized as “Communication”, however, it can also be considered “Tools”. Due to the incomplete information and ambiguous definition of categories, we are not using the category information in our research.

Figure 6. Statistics of application categories.

The large number of applications installed by the users does not mean that they use them equally. We find that users devote most of their time using some specific applications.

Let’s take the most popular application as an example. Facebook has been used in over 20,000 log records out of total 262,858 log records.

Since our focus in this thesis is on applications, we show the distribution of top 30

Arcade & Action Books & Reference Brain & Puzzle Business Cards & Casino Casual Communication Education Entertainment Finance Health & Fitness Libraries & Demo Lifestyle Media & Video Music & Audio News & Magazines Personalization Photography Productivity Racing Shopping Social Sports Games Tools Transportation Travel & Local Weather NULL

Number of Application(s)

Category

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

omitting three launchers. The reason these launchers are deleted is that, rather than actual applications, they are automatic triggers that are recorded whenever user presses the Home key. Therefore, we can in fact generate rule sets with clearer meanings without these launchers in between. In the remainder of this thesis, all the related discussions will be focused on only the top 30 applications.

We calculate the difference in time between the first and the last records in a session as the duration time of the session, showing the distribution of duration time in Figure 7. We can see that a large concentration of users’ one-time usage period (defined by the previous 10 minutes definition we gave) time under 20 minutes. This phenomenon can be evidence that support our definitions.

We also show the distribution of the number of log records in a session in Figure 8. We calculate the number of locations in a session (which is the number of locations where a user was during his or her use of smartphone in the period of time) by calculating the number of different latitude-longitude pairs in a session. Here is our finding. A smartphone is a mobile device, but most of the users (who contribute their data to us) stayed at a location (or stayed in a place where accurate latitude and longitude data are not available) and did not move.

‧

Figure 7. Distribution of duration time.

Figure 8. Distribution of the number of logs in a session.

Considering the top 30 applications shown in Figure 9, we calculate the number of applications (among the top 30) that are used in a session, as shown in Figure 11. Please note

18008

Number of Logs in a Session

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

that we consider only the applications among the top 30 regardless of how many times that an application that is not among the top 30 is used in a session. There are 1,603 (out of 25,880) sessions that do not use any of the top 30 applications. The portion is small.

Figure 9. Top 30 application names and their numbers of sessions.

0 5000 10000 15000 20000 25000

Facebook Browser Settings Phone Gmail WhatsApp Messenger Privacy Blocker GoogleTalk mms plurk htcdialer htccontacts Calendar Panda Home World Clock packageinstaller Album Superuser Mail Handcent SMS DocumentsToGo… Camera Alarm Clock Google Maps Contacts BlackMarketApp google.android.gsf joelapenna.foursquared rechild.advancedtaskkiller urbandroid.sleep

Number of Sessions

Application Name

‧

Figure 10. Distribution of the number of unique locations in a session.

Figure 11. Distribution of the number of top 30 applications used in a session.

Finally, we calculate the length of the sequence which implies how many applications are used sequentially in a session. As shown in Figure 12, we can see that most users switch

19331

Number of Top-30 Applications Used

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

between 10 or less applications during his or her one-time usage period.

Figure 12. Distribution of the length of sequence.

Consequently, there are basic information and statistics in our session data: user-id, machine-id, start-time of the session, end-time of the session, duration of the session, the number of log records in the session, and the number of locations in the session.

20938

2811

1000 465 237 129 300

0 5000 10000 15000 20000 25000

0~5 6~10 11~15 16~20 21~25 26~30 31~

Number of Sessions

Length of Sequence

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

CHAPTER 4 EXPERIMENT

We have a database that stores smartphone users’ behavior log records, as we have introduced and described in detail in the previous section. Meanwhile, we have defined what a session is and explained the characteristics of the session data in the last section. Now, we will demonstrate experiments done by using traditional association rule mining method, sequential pattern mining method and an advanced sequential pattern mining method which considers time constraint.

Figure 13. Sequential pattern mining flow chart.

‧

4.1 Association Rule Mining Method

4.1.1 Data Processor

We use the session data (Table IV), which reflects the daily usage of multiple users over a long period of time.

Table IV. Session data.

User ID Machine ID Applications Used Duration …

USER_002 MACHINE_013 Camera, Facebook 13

In order to find out relations between applications, we began with association rule mining. Considering the data, we then apply the transaction concept used in the market basket analysis, and we create sets of session data with Boolean records called Boolean-session. Since we have gathered statistics used to select top 30 applications that are used the most frequently, we add corresponding fields to Boolean-session records. These fields are filled with Boolean values to create a data set of Boolean-session records, as shown in Table V. As a result, each Boolean-session session includes not only user basic information (User-id, Machine-id, applications be used in Boolean record) but also statistics of session such as durations in the session, number of locations in the session etc.

It is worth emphasizing that in each session, we focus on what application is used. So no matter how many times the application is used in the session, the corresponding field is filled up with ”1” and the corresponding field is fixed by the order of top 30 applications instead of the order of applications used in the session.

‧

Table V. Boolean-session data set.

User ID Machine ID Facebook Browser … Gmail … Camera frequently, we can generate association rules with the Apriori-based algorithm by Weka.

Weka [26] is an open source data mining software written in Java, which is a collection of machine learning algorithms for data mining tasks. WEKA contains implementations of algorithms for classification, clustering, and association rule mining with graphical user interfaces and visualization utilities for data exploration. The algorithms can either be applied directly to a data set with particular formats (such as CSV or ARFF) or be called from the user’s own Java code. We use Weka 3.6.6 in our experiments.

We give a definition to our problem as follows: An item i in the data set is defined as an application that a user used on a smartphone. Let a set of items I = {i₁, i₂, … , in} be the itemset, and let a set of Boolean-session records T = {t1, t2, … , tm} be a data set that keeps the usage records of transactions. An association rule containing two sets of items, X and Y, can be generated in the form of X ⇒ Y, where X, Y ⊆ I and X∩Y = ∅.

In association rule mining experiment, we apply our Boolean-session data which after our data processor code in Java that create the attribute “applications used” in Boolean records to Weka directly by CSV data format. After the Apriori-based method in tool Weka is

‧

executed, we can generate a number of rules with the above method. We extract some sample rules, which include Facebook, Camera, and Album, and report them in Table VI.

Table VI. Rules generated by Apriori-based algorithm.

Set of items X support  Set of items Y support confidence conditional probability minimum confidence is 0.03.

在文檔中智慧型手機使用模式之探勘 - 政大學術集成 (頁 25-0)