Summary Statistics of Data - 智慧型手機使用模式之探勘

CHAPTER 3 DATA

3.3 Summary Statistics of Data

Table II. A fragment of the raw data.

Log ID User ID Machine ID Time Package Name

1 USER_002 MACHINE_013 2010-11-17 00:36:26 com.google.android.gm 2 USER_002 MACHINE_013 2010-11-17 00:37:21 com.facebook.katana 3 USER_002 MACHINE_013 2010-11-17 00:38:00 com.google.android.gm 4 USER_013 MACHINE_012 2010-11-17 08:22:24 com.mywoo.busplus 5 USER_013 MACHINE_012 2010-11-17 08:24:25 com.google.android.gm 6 USER_013 MACHINE_012 2010-11-17 08:26:13 com.facebook.katana 7 USER_013 MACHINE_012 2010-11-17 08:27:34 com.android.browser 8 USER_013 MACHINE_012 2010-11-17 15:02:59 com.facebook.katana 9 USER_013 MACHINE_012 2010-11-17 15:04:19 com.android.browser 10 USER_013 MACHINE_012 2010-11-17 15:04:44 com.google.android.gm 11 USER_002 MACHINE_013 2010-11-17 16:58:03 com.android.camera 12 USER_002 MACHINE_013 2010-11-17 16:58:14 com.facebook.katana 13 USER_002 MACHINE_013 2010-11-17 16:59:07 com.facebook.katana 14 USER_002 MACHINE_013 2010-11-17 17:47:07 com.android.camera 15 USER_002 MACHINE_013 2010-11-17 17:47:20 com.facebook.katana

Table III. Session data.

User ID Machine ID Applications Used Duration …

USER_002 MACHINE_013 Gmail, Facebook, Gmail 94 USER_013 MACHINE_012 Busplus, Gmail, Facebook, Browser 310 USER_013 MACHINE_012 Facebook, Browser, Gmail 105 USER_002 MACHINE_013 Camera, Facebook, Facebook 64

USER_002 MACHINE_013 Camera, Facebook 13

3.3 Summary Statistics of Data

Initially, there are 262,858 log records in our raw data. After we perform data processing, there are 25,880 sessions. We then give summary statistics and charts to illustrate some properties of our session data.

Figure 5 shows the number of sessions to which each user contributes. There are 25 users in the data we use, the User_ID is count form User_001 to User_030, but User_001,

‧

User_008, User_012, User_014, User_026 are no records.

Figure 5. Users and their corresponding number of sessions.

Package Name is the name of an application used by a user. Operational Activity is the type of operations that a user performed when using an application. There are 1,132 Package Names and 3,099 Operational Activity in our session data. Hence, there are over one thousand applications that users installed and used, and there are over three thousands of operations that users performed. It is clear that an application may be associated with several operations or activities. For example, the application “Facebook” may contain activities such as “Login”,

“Home”, “Feedback”, and “Upload Photo”.

We also gathered statistics for groups of applications data (not from our raw data).

User_001 User_002 User_003 User_004 User_005 User_006 User_007 User_008 User_009 User_010 User_011 User_012 User_013 User_014 User_015 User_016 User_017 User_018 User_019 User_020 User_021 User_022 User_023 User_024 User_025 User_026 User_027 User_028 User_029 User_030

‧

there is no category information on the applications. We took a detailed view in this statistics and found that many common-used applications such as Album, Camera, Gtalk etc. are without category. Moreover, we discovered thatsome applications cannot be categorized accurately. For example, Gmail has been categorized as “Communication”, however, it can also be considered “Tools”. Due to the incomplete information and ambiguous definition of categories, we are not using the category information in our research.

Figure 6. Statistics of application categories.

The large number of applications installed by the users does not mean that they use them equally. We find that users devote most of their time using some specific applications.

Let’s take the most popular application as an example. Facebook has been used in over 20,000 log records out of total 262,858 log records.

Since our focus in this thesis is on applications, we show the distribution of top 30

Arcade & Action Books & Reference Brain & Puzzle Business Cards & Casino Casual Communication Education Entertainment Finance Health & Fitness Libraries & Demo Lifestyle Media & Video Music & Audio News & Magazines Personalization Photography Productivity Racing Shopping Social Sports Games Tools Transportation Travel & Local Weather NULL

Number of Application(s)

Category

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

omitting three launchers. The reason these launchers are deleted is that, rather than actual applications, they are automatic triggers that are recorded whenever user presses the Home key. Therefore, we can in fact generate rule sets with clearer meanings without these launchers in between. In the remainder of this thesis, all the related discussions will be focused on only the top 30 applications.

We calculate the difference in time between the first and the last records in a session as the duration time of the session, showing the distribution of duration time in Figure 7. We can see that a large concentration of users’ one-time usage period (defined by the previous 10 minutes definition we gave) time under 20 minutes. This phenomenon can be evidence that support our definitions.

We also show the distribution of the number of log records in a session in Figure 8. We calculate the number of locations in a session (which is the number of locations where a user was during his or her use of smartphone in the period of time) by calculating the number of different latitude-longitude pairs in a session. Here is our finding. A smartphone is a mobile device, but most of the users (who contribute their data to us) stayed at a location (or stayed in a place where accurate latitude and longitude data are not available) and did not move.

‧

Figure 7. Distribution of duration time.

Figure 8. Distribution of the number of logs in a session.

Considering the top 30 applications shown in Figure 9, we calculate the number of applications (among the top 30) that are used in a session, as shown in Figure 11. Please note

18008

Number of Logs in a Session

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

that we consider only the applications among the top 30 regardless of how many times that an application that is not among the top 30 is used in a session. There are 1,603 (out of 25,880) sessions that do not use any of the top 30 applications. The portion is small.

Figure 9. Top 30 application names and their numbers of sessions.

0 5000 10000 15000 20000 25000

Facebook Browser Settings Phone Gmail WhatsApp Messenger Privacy Blocker GoogleTalk mms plurk htcdialer htccontacts Calendar Panda Home World Clock packageinstaller Album Superuser Mail Handcent SMS DocumentsToGo… Camera Alarm Clock Google Maps Contacts BlackMarketApp google.android.gsf joelapenna.foursquared rechild.advancedtaskkiller urbandroid.sleep

Number of Sessions

Application Name

‧

Figure 10. Distribution of the number of unique locations in a session.

Figure 11. Distribution of the number of top 30 applications used in a session.

Finally, we calculate the length of the sequence which implies how many applications are used sequentially in a session. As shown in Figure 12, we can see that most users switch

19331

Number of Top-30 Applications Used

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

between 10 or less applications during his or her one-time usage period.

Figure 12. Distribution of the length of sequence.

Consequently, there are basic information and statistics in our session data: user-id, machine-id, start-time of the session, end-time of the session, duration of the session, the number of log records in the session, and the number of locations in the session.

20938

2811

1000 465 237 129 300

0 5000 10000 15000 20000 25000

0~5 6~10 11~15 16~20 21~25 26~30 31~

Number of Sessions

Length of Sequence

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

CHAPTER 4 EXPERIMENT

We have a database that stores smartphone users’ behavior log records, as we have introduced and described in detail in the previous section. Meanwhile, we have defined what a session is and explained the characteristics of the session data in the last section. Now, we will demonstrate experiments done by using traditional association rule mining method, sequential pattern mining method and an advanced sequential pattern mining method which considers time constraint.

Figure 13. Sequential pattern mining flow chart.

‧

4.1 Association Rule Mining Method

4.1.1 Data Processor

We use the session data (Table IV), which reflects the daily usage of multiple users over a long period of time.

Table IV. Session data.

User ID Machine ID Applications Used Duration …

USER_002 MACHINE_013 Camera, Facebook 13

In order to find out relations between applications, we began with association rule mining. Considering the data, we then apply the transaction concept used in the market basket analysis, and we create sets of session data with Boolean records called Boolean-session. Since we have gathered statistics used to select top 30 applications that are used the most frequently, we add corresponding fields to Boolean-session records. These fields are filled with Boolean values to create a data set of Boolean-session records, as shown in Table V. As a result, each Boolean-session session includes not only user basic information (User-id, Machine-id, applications be used in Boolean record) but also statistics of session such as durations in the session, number of locations in the session etc.

It is worth emphasizing that in each session, we focus on what application is used. So no matter how many times the application is used in the session, the corresponding field is filled up with ”1” and the corresponding field is fixed by the order of top 30 applications instead of the order of applications used in the session.

‧

Table V. Boolean-session data set.

User ID Machine ID Facebook Browser … Gmail … Camera frequently, we can generate association rules with the Apriori-based algorithm by Weka.

Weka [26] is an open source data mining software written in Java, which is a collection of machine learning algorithms for data mining tasks. WEKA contains implementations of algorithms for classification, clustering, and association rule mining with graphical user interfaces and visualization utilities for data exploration. The algorithms can either be applied directly to a data set with particular formats (such as CSV or ARFF) or be called from the user’s own Java code. We use Weka 3.6.6 in our experiments.

We give a definition to our problem as follows: An item i in the data set is defined as an application that a user used on a smartphone. Let a set of items I = {i₁, i₂, … , in} be the itemset, and let a set of Boolean-session records T = {t1, t2, … , tm} be a data set that keeps the usage records of transactions. An association rule containing two sets of items, X and Y, can be generated in the form of X ⇒ Y, where X, Y ⊆ I and X∩Y = ∅.

In association rule mining experiment, we apply our Boolean-session data which after our data processor code in Java that create the attribute “applications used” in Boolean records to Weka directly by CSV data format. After the Apriori-based method in tool Weka is

‧

executed, we can generate a number of rules with the above method. We extract some sample rules, which include Facebook, Camera, and Album, and report them in Table VI.

Table VI. Rules generated by Apriori-based algorithm.

Set of items X support  Set of items Y support confidence conditional probability minimum confidence is 0.03.

Association rule mining aims at finding “intra transaction” relationships. Although those relationships represent some kind of usage patterns, we do not know the sequence of application usages. Take the following result as an example:

Rule 9: Album (1088)  Camera (586) 0.54

The above result means that Album and Camera are both used more than one percent of all sessions, and the number of sessions that has used Album is 1088; the number of sessions that has used both Album and Camera is 586. Moreover, under the condition of using Album, the probability of using Camera is 0.54. We can see the corresponding rule in Rule 15 similarity.

However, we cannot gain the knowledge of which application is used first from this Rule.

‧

Therefore, we intend to apply sequential pattern mining to our data set.

Table VII. Rules generated by Apriori-based algorithm.

Set of items X  Set of items Y confidence

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

4.2 Sequential Pattern Mining Method

In the last section, we apply association rule mining on Boolean-session data and get some results. But we are not satisfied with these results. We want to get more informative rules not just what applications may be used together but the certain order of sequence. That is, we are interested in answering the following question: “If a user had used a smartphone application, then what would be the next application that he or she uses?” Another important problem in data mining is sequential pattern mining problem, whose goal is originated from discovering patterns indicating the sequence of items that were purchased by a customer.

However, we cannot use sequential pattern mining method directly to generate frequent sequential patterns for our purpose. Because a pattern found by the traditional sequential pattern mining algorithm indicates the correlation between transactions and such

“inter-transaction” relations are not what we want. In other words, results found by the classic association rule mining algorithm concern about which items coming from the same transaction occur together frequently. Results found by the traditional sequential pattern mining algorithm show which items coming from different transactions appear in a certain order. Considering these properties, we feel the need to design a method able to find out the rules in which items appear in a certain order and come from the same transaction. In addition, any sequential pattern mining algorithms or tools can’t be used directly on our data.

So, first of all, we need a data processor to do a series of processing that including purification and partition. But this time, the session data is different from the Boolean-session, we create a session that keeps the concept of “transaction” while records the sequence of applications used. Then we need a data formatter for the tool of SPMF (A Sequential Pattern Mining Framework) [27] and a result processor that deal with the output of SPMF.

‧

We prepare the data set from the session data (Table IV). We then bring in the concept of sequence to create sets of Sequence-session data for each session. As a result, we actually record the applications one by one according to their timestamps instead of Boolean values.

Following is an example that Table VIII is a part of raw data and Table IX is a Sequence-session which transform from Table VIII.

Table VIII. A portion of the raw data.

User ID Machine ID Time Package Name Operational Activity USER_005 MACHINE_018 2010-11-17

16:52:15

Table IX. A sample of Sequence-session.

User ID Machine ID App1 App2 App3 App4 App5 …

USER_002 MACHINE_013 Camera Facebook Launcher Facebook Launcher

‧

machine. Logs within 10 minutes are considered relevant and combined into the same session, whereas logs that are apart more than 10 minutes are divided into different sessions (since the user is likely to have finished his or her usage). We assume that there is a weak relationship between two application usages in the same session if their timestamps are away from each other for more than 10 minutes. Consequently, the last log in Table VIII will not join in the session because of its 30 minutes interval of time. And based on the user, the first log in Table VIII also not be included in the session because of it’s done by USER_005 not USER_002.

Then, we discovered that successive duplicates Package Name (applications user used) exist within sessions. This can straightforwardly means the user keeps using an application with different activities during the session, or it can come from the fact that we filter out infrequent applications when we construct the sessions. In either case, it is better to eliminate these duplicates and keep only one appearance of the application since we are focusing on the usage patterns instead of activities patterns. Otherwise, if this situation is abundant, PrefixSpan would generate sequential patterns containing identical applications in sequence. Such pattern would be uninformative for our research and also erroneous, because it should only be considered a single usage of that application.

Moreover, we do not want to see the traditional sequential pattern mining algorithm grouping our sessions by User_ID and finding relationships between sessions. The reason is as follows: It is not usual for one to use Facebook in the morning and Camera in the afternoon while we still consider the usages of these two applications are in the same session. This could happen since the usages of these two applications may be related, but we think that it makes more sense to treat them as two usages in two sessions. The following tables show our Sequence-session data which divide items with time-gap and the traditional sequence that

‧

Table X. A sample of our Sequence-session data.

User ID Machine ID App1 App2 App3 App4 …

USER_002 MACHINE_013 Gmail Facebook Gmail - -

USER_013 MACHINE_012 Busplus Gmail, Facebook Browser - USER_013 MACHINE_012 Facebook Browser Gmail - - USER_002 MACHINE_013 Camera Facebook Facebook - -

USER_002 MACHINE_013 Camera Facebook - - -

Table XI. A sample of traditional sequence.

User ID Sequence

USER_002

(Gmail, Facebook, Gmail), (Camera, Facebook, Facebook), (Camera, Facebook)

USER_013 (Busplus, Gmail, Facebook, Browser), (Facebook, Browser, Gmail)

4.2.2 Sequential Pattern Mining

In sequential pattern mining experiment, we would like to apply our Sequence-session data (which is already processed by data processor) to SPMF [27] via a data formatter. SPMF is an open source data mining platform written in Java which offers implementations of data mining algorithms for association rule mining, sequential pattern mining, clustering etc. SPMF is a project founded by Philippe Fournier-Viger and it has been cited by or used in various studies since 2010. In our work we use PrefixSpan algorithm and its implementation in the open source SPMF toolkit.

After data processor and data formatter to the raw data, we then make minor changes to the source code of PrefixSpan module provided by the tool SPMF, in order to make the

‧

algorithm suitable for our data set.

The data processor we had illustrated with an example in chapter 4.2.1 The data formatter is a program that converts application names into integers, so that the SPMF tool can execute normally.

After that, it is necessary that we need a result processor dealing with the output of SPMF and transform to the readable and meaningful results. PrefixSpan is based on pattern growth which is one of major technique for sequential pattern mining. The key idea is to avoid the candidate generation step altogether, and to focus the search on a restricted portion of the initial database.

There are three major properties that PrefixSpan algorithm is suitable for our research.

First of all, PrefixSpan is designed for dense database which exactly meet our data (composed of the top 30 most frequently used applications that support are all exceed 0.5%). Then, there is no need for candidate generation in PrefixSpan and it is recursive the projected databases of frequent prefixes which are generated based on their suffixes. It finds new prefixes of length 1 and the pattern is growing one item longer at a time. Therefore, it can be expanded with more constraint or condition between items, such as time constraint, when forming frequent patterns.

Moreover, since the algorithm of PrefixSpan is based on recursively constructing the patterns by growing on the prefix, there are several advantages in practice. It is capable of dealing very large database, and the search space is reduced at each step, allowing better performance in the presence of small support thresholds.

In addition, there are three different projection methods developed for PrefixSpan:

level-by-level projection, bi-level projection and pseudo projection. In SPMF, PrefixSpan algorithm is implemented with pseudo projection which is the most efficient among three.

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

Pseudo projection avoids constructing a physical projection database by representing each suffix by a pair of pointer and offset value. Once a projected database can be fitted in main memory, the cost of projection can be reduced significantly.

4.3 PrefixSpan Algorithm with Time Constraint

The rule sets we extracted for some examples in the previous section are already useful,

在文檔中智慧型手機使用模式之探勘 - 政大學術集成 (頁 32-0)