Organization of the Thesis - 智慧型手機使用模式之探勘

CHAPTER 1 INTRODUCTION

1.4 Organization of the Thesis

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

smartphone users. Third, it provides an analysis on the discovered rules. This thesis contributes to a better understanding of smartphone users’ behaviors and detailed analysis to the results. It uses a real data set to discover rules that could be beneficial to the designers of smartphone applications or user interfaces.

1.4 Organization of the Thesis

The rest of the thesis is organized as follows: Chapter 2 gives the background of the work presented in this thesis. It starts from the MIT reality mining project, which is one of the first research projects conducted to study user behavior analysis for mobile phones. Then it discusses studies related to user behavior analysis for smartphones nowadays, and it also gives data mining preliminaries, including association rule mining and sequential pattern mining.

Chapter 3 gives a detailed description of the data we use, including the process for the collection of the raw data and the process for data processing, and it also presents summary statistics of the data. Chapter 4 describes how we apply data mining techniques to the data, and it reports the experimental results. Chapter 5 takes a closer look into the results and discusses patterns observed from the results. Finally, Chapter 6 concludes this thesis and indicates the future work.

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

CHAPTER 2 BACKGROUND

2.1 Traditional Research Methods

In the paper [1], the authors mentioned several traditional research methods, including questionnaire survey, interview, laboratory test, and road test, and gave a detailed description for each method. In the following, we briefly introduce these methods.

Questionnaire survey is one of the most common methods. It could be with fixed

answers or open answers; it could be in the form of paper questionnaire or online questionnaire. As the Internet and communication development, questionnaire survey provides a quick way to collect data from a large group of users. However, questionnaire survey is based on pre-specified questions and the subjective answers would provide less flexibility in study deployment.

Interview is an interactive method in which interviewer ask questions directly to the

respondent. Moreover, interviewers can guide the discussion according to their own interests.

This method can be improved by technologies in many ways. For example, interviews can take respondents’ emotional characteristics into consideration. Interviews can use special machines to capture a respondent’s expression or physiological action as feedback. However, doing interviews is usually slow and costly. In addition, just like answers to questionnaires, answers to an interview are subjective because they are controlled by interviewers and

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

affected by respondents’ conditions.

Laboratory and road tests both refer to prepared tests. Laboratory test is taking place in

a fixed context while road test is taking place in a more natural context. As the result, laboratory and road tests conduct in controlled environments.

2.2 Researches on Mobile Phone

2.2.1 Reality mining

The Reality Mining project is one of the pioneer projects in Computer Science dedicated to the study of mobile phone user behaviors. The goal of the project is to utilize the data collected on mobile phones to answer questions about user behaviors for a wide range of applications.

The Reality Mining data set consists of data collected from one hundred Nokia 6600 smartphones, pre-installed with several logging software packages and used by students or faculty members in MIT. The information collected includes the call log, the application used, the phone status, the in proximity, etc. Given the phone usage statistics, interesting observations can be made. For example, we can learn the percentage of communication in all users, or the distribution of applications used in context.

Furthermore, the data set can be used to perform user behavior modeling and prediction.

An entropy-related metric is used to quantify the amount of predictable structure in a user’s life, and users can be categorized as living a low-entropy or high-entropy life by tracking their usage patterns. In addition, the information enables a customized log application that allows users to query their own lives and provides predictions about the upcoming behavior in the immediate future.

‧

The development of mobile technologies creates many interesting research areas, and one of those areas is the study of trajectory data streams. Studies in this area utilize the location data provided by GPS-equipped mobile devices provide. Users’ locations are transformed into trajectories. These trajectory data streams can be used to predict the route of a mobile phone user, or they can be used to discover groups of users who travel together. The applications include location-based recommendation systems, mobile commerce, and customized navigation systems.

In the paper [2], the authors focused on not only spatial but also temporal information in the trajectory data. They used probabilistic suffix tree (PST) to contain both spatial and temporal information that together represent sequential patterns of movement, and they proposed an algorithm to traverse the tree and generate prediction for locations. Their idea is discovering sequential patterns in the tree based on support values. Since temporal information is included, the sequential patterns also indicate the points in time when users will be at the locations.

Some studies aim at discovering groups of users who travel together, and they combine trajectories in the groups or clusters to provide more accurate results. The authors of the paper [3] proposed an algorithm to discover user clusters, or communities in their term, by mining and comparing similarities between trajectories stored in PST. Once the clusters are discovered, the information can be used to improve route prediction in various ways. For example, a trajectory pattern mining framework called Clustering and Aggregating Clues of Trajectories (CACT) was proposed in the paper [4]. The approach utilizes clues such as patterns of movement discovered from previously observed trajectories or the trajectories of other users in the same cluster to infer a user’s route during the “silent duration” in which no data points are recorded

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

for that user.

Other research focuses include the collection process of trajectories. L. Wei et al. in the paper [5] addressed the trajectory search issue by a pattern-aware trajectory search (PATS) framework. They used a measurement to evaluate the “attractiveness” of trajectories when conducting trajectory searches. The measurement, called potential regions, is based on popularity and sequential travel records from a given set of trajectories. With the evaluation of attractiveness, PATS returns top-K trajectories as results, the framework allows researchers to efficiently filter trajectories to collect information in which they are interested.

2.3 Analysis of Smartphone User Behavior

Given the rapid growth of the smartphone industry, there is an urgent need for studies in this field. Among the studies that have been conducted to address related issues, user behavior analysis has always remained as an important topic. The abundant information that accompanies smartphone usage data has made this area more valuable.

Some studies aim at categorizing the users into certain types. As an example, Chittaranjan, Blom, and Gatica-Perez proposed a method to discover personality traits of a user [6]. Using data sources such as GPS, call logs, and Bluetooth, they used a machine learning method along with their Big-Five and gender-specific models to categorize users [6]. Their results would allow the design of mobile applications to better fit users’ individual needs [6].

Usage data can also be used in studies whose goal is to address the problem of malware detection. By analyzing process state transitions and user operational patterns, one can distinguish the operations performed by human users from the ones performed by devices infected by malwares [7]. Moreover, in the paper [8], a crowdsourcing system was used to collect the traces of applications’ behaviors, and the traces were used to identify applications

‧

With the computing power, mobility and downloadable applications, researches in smartphone user behavior attract not only computer sciences researchers, but also researchers in every field. Paper [9] for example, point out that smartphone has become Commonplace within the medical field as both a personal and professional tool. A digital survey examining smartphone and associated app usage was administered via email to all ACGME training programs. Data regarding respondent specialty, level of training, use of smartphones, use of smartphone apps, desired apps, and commonly used apps were collected and analyzed in this thesis.

In the paper [10], they collected data from Finnish smartphone users by a handset-based measurement platform developed for Nokia S60.And present some results by statistics. They illustrate the time use of Finnish people, telling us where the users spend their time and under what and contexts. Moreover, they give the distribution of active usage time between applications (including voice calls, messaging, browsing, multi-media or business and productivity) and hours of day. The paper [11] also proposed a framework for mobile audience measurements which collect data from mobile device directly rather than user surveys, providing a number of statistics and analytics. Some of the analytics are related to finding the most important application categories, and presenting the distribution of their usage on a weekly or monthly basis. In the paper [12], It present results on apps usage at a national level using anonymized network measurements in the U.S and investigate how, where, and when smartphone apps are used from spatial, temporal, and user perspectives at large scale.

Falaki et al., as another example, focused on relating intentional user activities to network traffic and power consumption [13]. They analyze relative time spent with each application during each hour of the day for a sample user’s top applications, it has been found

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

that the differences between user behaviors have a significant impact on the performance of mobile devices [13]. With the knowledge of how a user interacts with his or her smartphone, user experience can be improved and future energy drain can be predicted more accurately [13].

Another paper concerns about the similar topic to Falaki et al [14]. It discusses the relationship between usage pattern and Battery consumption. They collected real usage log data from real smartphone users over a two month period and show that all users have their own usage pattern.

Then present a case study in order to show how to apply usage pattern information to power management, mobile device management and network management of smartphones.

2.4 Data Mining Methods

Data mining is a collection of techniques to discover meaningful knowledge hidden in large databases. Chen, Han, and Yu defined it as “a process of extracting nontrivial, implicit, previously unknown and potentially useful information such as regularities or rules from databases” [15]. There are various types of data mining techniques which are made based on different tasks. For example, classifications is the task of generalizing known structure to apply to new data; clustering is the task of discovering groups and structures in the data that are in some way or another "similar", without using known structures in the data; association rules is the task that finding frequent patterns, associations, correlations among sets of items or objects in transaction databases, relational databases, and other information repositories.

Because our goal is to find the smartphone user’s behavior usage patterns (that means we want to extract relationships between applications in smartphone.), we start our research from the view of association rule mining problem.

‧

2.4.1 Association Rule Mining Problem

Among the broad applications of data mining, some are connected to an important problem called association rule mining problem, which was first introduced by Agrawal in the early 1990’s [16]. Given a transaction database in which each transaction consists of items purchased by a customer, the association rule mining problem is how to efficiently generate significant rules, each of which associate the purchase of some items with the purchase of others.

Below are the notations used by Agrawal [16]: Let a set of items be denoted as I = {i1, i2, … , im} and a set of transactions be denoted as T = {t1, t2, … , tn}. Each transaction in T contains a set of items. D is a database that consists of a set of transactions. An association rule is in the form of X ⇒ Y, where X and Y are sets of items (or itemsets) that satisfy X, Y ⊆ I and X∩Y = ∅. X and Y are called antecedent and consequent of an association rule, respectively. Such a rule means that X implies Y.

To select interesting rules out of all possible rules, usually some constraints would be applied to a rule to measure its significance. The most well-known constraints are minimum support and minimum confidence [17]. For an association rule X ⇒ Y, support is defined as the percentage of transactions in D that contain an itemset X, and confidence is defined as the percentage of transactions in D’ that contain itemsets X and Y, where D’ is the set of transaction containing X. Their definitions are given as follows:

( )

( ) ( ) ( )

A higher value of Support(X) indicates a higher percentage of the transactions in which

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

customers purchase X, and a higher value of Confidence(X→Y) means that more transactions containing X also contain Y. We can specify the minimum support as a threshold to find association rules in which we are interested. The association rules are generated from itemsets whose supports exceed the threshold. Moreover, we can also specify the minimum confidence as a threshold to find association rules that can tell us what itemsets would be purchased together frequently.

2.4.2 Apriori-based Approach

Apriori algorithm introduced by Agrawal and Srikant is an efficient algorithm prosed to handle association rule mining problem [17]. It is efficient in its candidate generation process and its adoption of a new pruning technique. There are two phases of finding out all the frequent itemsets from a database in Apriori algorithm. First, it generates the candidate itemsets and checks the support values of the corresponding itemsets. Only the itemsets whose support values exceed the pre-specified minimum support will remain and be named as frequent itemsets. Second, the candidate k-itemsets are generated by the frequent (k-1)-itemsets. The most important property of frequent itemsets used by Apriori algorithm is that every sub (k-1)-itemsets of the frequent k-itemsets must be frequent.

2.4.3 Sequential Pattern Mining Problem

Although association rule mining is practical, it does not consider the points in time when items were purchased. For example, the rule A ⇒ B tells us that if a customer had purchased A then he or she would purchase B in the same transaction. However, it does not tell us whether the customer purchased A before B or whether the customer would purchase B simply because he or she had purchased A. This limitation makes it inappropriate to use

‧

association rule mining algorithms in the situation where the points in time when items were purchased are important. We are interested in answering the following question: If a user had used a smartphone application, then what would be the next application that he or she use?

Another important problem in data mining is sequential pattern mining problem, whose goal is to discover patterns indicating the sequence of items that were purchased by a customer.

Sequential pattern mining is one of data mining tasks that extracts frequent subsequences as patterns from a sequence database which aims at finding sets of data items occurring together frequently in some sequences.

It was first introduced by Agrawal and Srikant in the mid 1990’s [18]. Sequential pattern mining problem can be described as follows: There is a transaction database D, where each transaction consists of the following fields or columns: customer-id, transaction-time, and the items purchased in the transaction. An itemset is a non-empty set of items, and a sequence is an ordered list of itemsets. All transactions made by a customer are sorted by time in ascending order and together can be viewed as a sequence. Each transaction corresponds to a set of items, and the ordered list of transactions corresponds to a sequence. The purpose of sequential pattern mining is to discover orders of items that were purchased by a customer in successive transactions. For example,a sequential pattern indicates that if a customer purchased an item A in a transaction then he or she would purchase an item B in the successive transaction.

Sequential pattern mining is a very deep and broad research area in the field of data mining with wide range of applications. One of the applications is to use frequent pattern discovery methods in Web log data which aims at to obtain information about the navigation user behavior. Web mining and sequential pattern mining are also well researched. Many paper implement sequential pattern mining methods on web log data as one kinds of web mining. In the paper [19], three pattern mining approaches are investigated from the Web usage mining

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

point of view. Just similar to this kind of web mining which automatic discovery of user access patterns from Web, our research’s goal is to discover usage pattern of smartphone.

2.4.4 PrefixSpan by Pattern-Growth Approach

Sequential pattern mining is an important data mining problem and also is a practical method with broad applications from analysis of customer purchase behavior to web browse patterns, from economics stock trend prediction to biological gene sequence discovery and so on. As the result, after sequential pattern mining was first introduced by Rakesh Agarwal and Ramakrishnan Srikant in 1995, many algorithms related to sequential pattern mining have been introduced.

In the paper [20], author classify current sequential pattern mining algorithms including Apriori, AprioriALL, GSP, SPADE, FreeSpan, WAP-Mine, PrefixSpan, SPAM, PLWAP, DISC-all, FS-Miner, Apriori-GST, HVSM and LAPIN into four categories: Apriori-based, Pattern-growth, Early pruning and Hybrid algorithms based on important key features supported by the techniques. Meanwhile, many papers may choose some algorithms, then using specific data and provide experiments to compare performance in execution time, memory, and discuss which algorithm is good or better. However, to the best of our knowledge, algorithm is the best for all applications. What algorithm is the best? The answer depends on the data characteristics and applications. In this thesis, we will not put emphasis on running time or memory cost given by different algorithms, our focus will be on the meaningful and interesting patterns. So, we implement a “suitable” approach (which means that it considers the characteristics of the data we use). We refer to an existing algorithm for the sequential pattern mining problem, implement methods for data processing and data format transformation and also implement a modified version of the algorithm that consider time constraint between

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

items.

PrefixSpan is based on pattern growth which is one of major techniques for sequential pattern mining. Though it has been proposed by J. Pei et al. [21] in the early 2000’s, PrefixSpan is still interested by considering of all the novel algorithms nowadays in 2013 paper [22]. In the paper [23], the author quoted that “Among the various approaches, PrefixSpan was one of the most influential and efficient ones in terms of both time and space. Some approaches may achieve better performance under special circumstances; however the overall performance of PrefixSpan is among the best. For example LAPIN is more efficient for dense data sets with long patterns but less efficient in other cases. Besides, it consumes much memory

在文檔中智慧型手機使用模式之探勘 - 政大學術集成 (頁 15-0)