A regression-based approach for mining user movement patterns from random sample data

(1)

A regression-based approach for mining user movement patterns from

random sample data

Chih-Chieh Hung, Wen-Chih Peng

⁎

Department of Computer Science, National Chiao Tung University, Taiwan, ROC

a r t i c l e i n f o

a b s t r a c t

Article history: Received 6 July 2009

Received in revised form 13 July 2010 Accepted 13 July 2010

Available online 5 August 2010

Mobile computing systems usually express a user movement trajectory as a sequence of areas that capture the user movement trace. Given a set of user movement trajectories, user movement patterns refer to the sequences of areas through which a user frequently travels. In an attempt to obtain user movement patterns for mobile applications, prior studies explore the problem of mining user movement patterns from the movement logs of mobile users. These movement logs generate a data record whenever a mobile user crosses base station coverage areas. However, this type of movement log does not exist in the system and thus generates extra overheads. By exploiting an existing log, namely, call detail records, this article proposes a Regression-based approach for mining User Movement Patterns (abbreviated as RUMP). This approach views call detail records as random sample trajectory data, and thus, user movement patterns are represented as movement functions in this article. We propose algorithm LS (standing for Large Sequence) to extract the call detail records that capture frequent user movement behaviors. By exploring the spatio-temporal locality of continuous movements (i.e., a mobile user is likely to be in nearby areas if the time interval between consecutive calls is small), we develop algorithm TC (standing for Time Clustering) to cluster call detail records. Then, by utilizing regression analysis, we develop algorithm MF (standing for Movement Function) to derive movement functions. Experimental studies involving both synthetic and real datasets show that RUMP is able to derive user movement functions close to the frequent movement behaviors of mobile users.

Keywords:

User movement patterns Data mining

Mobile data management

1. Introduction

Mobile services, such as navigation services, mobile search and location-aware services, are becoming very popular. These wireless communication systems enable users to access various kinds of information from anywhere at any time. A mobile

computing system usually expresses a user movement trajectory as a sequence of areas in which the mobile user moves.1_{In this}

article, we aim at mining user movement patterns for a mobile user. Thus, given a user's set of movement trajectories, user movement patterns refer to the sequences of areas that this user frequently travels. Analysis of user trajectory data could provide

some understandings and management of moving objects [1,2]. User movement patterns can be used to improve system

performance, such as designing personal paging area[3], and developing data allocation strategies[4–6], querying strategies[7],

and navigation services[8,9].

To discover user movement patterns in a mobile computing system, the methods proposed in previous studies require

movement logs to record the movements of mobile users. For example, in[4,5], when a mobile user moves from the coverage area

⁎ Corresponding author. No. 1001 University Road, Hsinchu, Taiwan 300, ROC. Tel.: +886 3 5731478. E-mail address:wcpeng@cs.nctu.edu.tw(W.-C. Peng).

1

This article deﬁnes a unit of a area as the coverage area of one base station. For ease of presentation, we simply use a base station identiﬁcation to represent the corresponding coverage area.

Contents lists available atScienceDirect

Data & Knowledge Engineering

j o u r n a l h o m e p a g e : w w w. e l s e v i e r. c o m / l o c a t e / d a t a k

(2)

of base station i to the coverage area of base station j, a handoff procedure is performed to smoothly switch communication channels between base stations. Meanwhile, the movement log generates a movement pair (i,j). However, the movement log is not an existing log of mobile systems and needs some overheads to generate during handoff procedures. Hence, generating movement logs for all mobile users leads to increased storage costs and decreases the performance of mobile computing systems. Therefore, prior works are not practical for mobile computing systems due to the overhead of generating movement logs. In fact, mobile computing systems generate call detail records (abbreviated as CDR) when a mobile user makes or receives phone calls [10].Table 1shows an example of call detail records, where Uid represents the identiﬁcation of a user making or receiving calls, and Cellid represents the corresponding base station that serves that user. Time information (i.e., date and time) is recorded in the CDR.2_{Table 1}_{shows that the CDRs of a mobile user contain both spatial (i.e., base station identi}_{ﬁcation) and temporal information}

(i.e., date and time). Since CDRs reﬂect the movement behaviors of users, this article addresses the problem of mining user

movement patterns from an existing log of CDRs, thereby reducing the overhead of generating a movement log.

Fig. 1shows some trajectories of one user, where the dashed line represents one real trajectory of this user and the regions with mobile phones indicate that the user is receiving or making phone calls. This user's calling behavior is captured in the log of CDRs, Table 1

An example of call detail records.

Uid Date Time Cellid

1 Day 1 07:30 A 1 Day 1 09:32 D 1 Day 1 09:49 E 1 Day 1 13:50 H 1 Day 2 08:50 C 1 Day 2 09:50 E 1 Day 2 14:00 H 1 Day 3 07:15 A 1 Day 3 09:02 C 1 Day 3 09:30 D 1 Day 4 12:30 W 1 Day 4 12:52 X 1 Day 4 13:30 Y

2_{The real call detail records analyzed in this study were provided by Taiwan mobile service providers, and we only extracted some useful attributes of call}

detail records to mine user movement patterns.

(a)

Day1

(b) Day2

(c)

Day3

(d)

Day4

(3)

andTable 1shows the CDR log.Fig. 1shows that CDRs are data points that are randomly sampled from trajectories and the corresponding locations of these CDRs are scattered over the mobile computing environment. As a result, mining user movement behaviors from CDRs is a challenging task. Given these random sample data points, we aim to derive movement functions that are

close to real user trajectories. We refer these movement functions as movement patterns due to that movement functions reﬂect

the frequent movement behavior of users.Fig. 2shows the movement function of a user for the example above.

This article proposes a novel approach, called RUMP (standing for Regression-based approach for User Movement Patterns), to mine user movement patterns from CDRs. Given a set of data points, the main objective of regression analysis is to derive a regression function that minimizes the sum of distances between the function derived and data points. In this approach, call detail records are viewed as data points, while the regression functions derived are regarded as movement functions. However, not all call detail records should be involved in mining user movement patterns. Without carefully selecting CDRs, user movement

patterns cannot reﬂect the frequent movement behaviors of mobile users. On the other hand, CDRs should be fully utilized for

mining user movement patterns since only limited information is available in the CDR logs. Thus, several issues remain to be

addressed to efﬁciently utilize CDRs for mining user movement patterns.

1.1. Extracting frequent movement behaviors from CDRs

As mentioned before, user movement patterns refer to the frequent movement behaviors of mobile users. However, the CDR logs not only contain frequent user movement behaviors, but also include infrequent movement behaviors. For example, a user

usually goes to his ofﬁce and is back to his home every weekday (asFig. 1(a), (b) and (c) shows), and occasionally takes a trip (as

Fig. 1(d) shows). The frequent movement behavior is the trajectory from his home to his ofﬁce, whereas a trip is an infrequent movement behavior. Since regression analysis is sensitive to these infrequent CDRs, they should be eliminated. In other words, the call detail records that capture the frequent movement behaviors of users should be extracted. To extract the frequent movement behaviors of mobile users, we develop algorithm LS (standing for Large Sequence) to extract base stations whose coverage areas are frequently visited by users.

1.2. Determining the number of regression functions

Once CDRs that capture the frequent movement behaviors have been extracted, it is necessary to determine how many regression functions are needed. If only one regression function is derived, it may not be very close to the frequent user movement behavior. Thus, given a set of call detail records of the frequent movement behavior, clustering techniques can be used to divide call detail records into several groups. The number of groups is viewed as the number of regression functions. The movement trajectories of mobile users generally follow spatio-temporal locality (i.e., if the time interval between two consecutive calls of a mobile user is small, the mobile user is likely to have moved nearby). Therefore, the feature of spatio-temporal locality in algorithm TC (standing for Time Clustering) can be used to group the call detail records with a close occurrence time.

1.3. Deriving movement functions

Location identiﬁcation techniques typically use one of two location models: the geometric model and the symbolic models[11].

The geometric model speciﬁes the location in n-dimensional coordinates (typically n=2 or 3). The symbolic model, however, uses

logical entities to describe the location. This article represents the location of mobile users in CDRs using the symbolic model (i.e.,

the base station identi_{ﬁcation). To derive movement functions of a mobile user, the location of the call detail records in the}

symbolic model must be transformed into the geometric model. Then, with the cluster results obtained, we develop algorithm MF (standing for Movement Function) for each cluster. This algorithm utilizes weighted regression analysis to derive the corresponding movement functions of a user.

The RUMP approach consists of a series of algorithms that tackle the various issues described above. This study evaluates RUMP performance using both synthetic and real datasets. Sensitivity analysis is conducted on several design parameters. Experimental

results show that RUMP is able to efﬁciently and effectively derive user movement patterns that capture the frequent movement

behaviors of mobile users.

(4)

The rest of this article is organized as follows.Section 2presents some related works.Section 3then devises algorithms for

mining user movement patterns.Section 4presents performance results. Finally,Section 5draws conclusions.

2. Related works

The problem of mining user movement patterns has attracted a considerable amount of research effort. Prior studies are

generally classiﬁed into two categories based on their deﬁnitions of user movement patterns: spatial movement patterns and

spatio-temporal movement patterns. In theﬁrst category, a user movement pattern refers to a sequence consisting of base station

identiﬁcations or pre-deﬁned regions. In the second category, user movement patterns represent the spatio-temporal associated

relationships among base station identiﬁcations or pre-deﬁned regions.

In theﬁrst category, the authors in[12]proposed an information–theoretical method to mine user movement patterns and

represented them in a trie data structure. Moreover, the authors in[3]proposed a statistical approach to mine user movement

patterns. The authors of[13] and [4] proposed a data mining approach for mining user movement patterns based on the

movement logs of mobile users.

In the second category, user movement patterns are usually extracted from user trajectories, where trajectories are detailed

user movements. A considerable amount of research efforts focuses on mining spatio-temporal association rules[14–19]. The

authors in[20]explored the fuzziness of locations in patterns and developed algorithms to discover spatio-temporal sequential

patterns. Furthermore, the authors in[21]proposed a clustering-based approach to discover movement regions within time

intervals. In[22], the authors developed a hybrid prediction model, consisting of vector-based and pattern-based models, to

predict user movements. In[23]and[24], the authors exploited temporal annotated sequences in which sequences are associated

with time information (i.e., transition times between two movements).

To the best of our knowledge, this study proposes a new way of mining user movement patterns from random sample data points. Though the main theme of this article is to mine user movement patterns from call detail records, the proposed approach can be used for other log data with randomly sampled data features. Existing studies neither fully utilize fragmented spatio-temporal information (e.g., call detail records) for mining user movement patterns nor explore regression techniques for deriving

movement functions. These features distinguish this article from others. Our preliminary work was presented in[25]. The current

article extends this preliminary work with more detailed complexity and theoretical analysis. Furthermore, we conduct an extensive performance analysis on both synthetic and real datasets. Finally, this study investigates the sensitivity of several parameters, such as the calling behavior and thresholds for each algorithm.

3. A regression-based approach for mining user movement patterns

This section develops a regression-based approach (i.e., RUMP) consisting of a sequence of algorithms to mine user movement

patterns. First,Section 3.1provides an overview of RUMP, and the following sections present details of Algorithm LS, TC, and MF.

3.1. An overview

Given a log of CDRs, the goal of this article is to derive movement functions that closely reﬂect the frequent movement

behaviors of mobile users. Due to that CDRs are random samples, the timestamps of CDRs are not likely to be the same even if a

user follows the same movement behavior. Consequently, a basic time slot is deﬁned as a time interval. For example, if call detail

records whose occurrence time is within the time interval of one time slot, these CDRs are associated with the same time slot.

Therefore, these CDRs are further put in a movement record de_{ﬁned as follows:}

3.1.1. Deﬁnition: movement record

A movement record is deﬁned as a set of pairs (BSi:Ni), where BSiis a base station and Niis the number of occurrence counts of

BSiin call detail records whose occurring times are within the same time slot.

Table 2

Notations used in our algorithms.

Deﬁnition Notation

Number of movement sequences w

Movement sequence i MSi

Movement record at time slot j in MSi MRi,j

A large movement sequence LMS

Large movement record at time slot i LMRi

An aggregation movement sequence AMS

Aggregated movement record at time slot i AMRi

A time projection sequence of AMS TPAMS

(5)

Assume that one time unit has its time interval of 6:00 am to 10:00 am. FromTable 1, in Day 1, we could have one movement record that includes {A:1, D:1, E:1} since the occurrence time of three call detail records (i.e., A, D, and E) is within the time interval

(i.e., 6:00 am to 10:00 am). With the deﬁnition of movement records, a movement sequence is deﬁned as follows:

3.1.2. Deﬁnition: movement sequence

A movement sequence MSi, denoted bybMRi,1, MRi,2, MRi,3, ..., MRi,εN, is an ordered sequence of ε movement records, where MRi,j

is the movement record at time slot j in MSiandε is an adjustable parameter.

The length of a time slot determines the granularity of user movement patterns in terms of time. Same as in[22], the value ofε

indicates that a movement pattern may re-appear. Thus, the value ofε depends on the periodicity of a user.Table 2are notations

used in our article. The overall procedure for mining movement patterns is outlined as follows: 3.1.2.1. Execution steps in RUMP

Step 1. (Extracting the aggregation movement sequence) In this step, call detail records are converted into w movement sequences, where w is an adjustable window size for the recent movement sequences being considered. Algorithm LS discovers an aggregation movement sequence, in which each movement record contains frequent areas that a user appears.

Step 2. (Clustering movement records) According to the aggregation movement sequence derived, we further develop algorithm TC to cluster movement records whose time slots are close.

Step 3. (Deriving movement functions) We then use regression techniques to derive the corresponding movement functions for each group in Step 2.

CDRs only reﬂect the fragmented movement behaviors of mobile users. Thus, the RUMP approach uses regression techniques to

derive movement functions which are close to the frequent movement behaviors of mobile users. Due to the nature of regression techniques, without the proper determination of call detail records, user movement functions derived cannot capture the frequent movement behaviors of mobile users. On the other hand, call detail records should be fully utilized to mine user movement patterns since only limited information is available in CDRs. In the following subsections, each algorithm is presented in detail. 3.2. Algorithm LS: extracting the aggregation movement sequence

In this article, a user movement trajectory is represented as a sequence of base station identiﬁcations (hereafter, we use “base

station” for short). Hence, call detail records are converted into movement sequences. With a set of movement sequences,

algorithm LS determines an Aggregation Movement Sequence (abbreviated as AMS) and uses it to represent the frequent movement behaviors of a user. Intuitively, AMS is a sequence of movement records that have frequent base station and their corresponding counts at each time slot. At each time slot, a frequent base station in this article refers to a base station which a user appears more than min_freq times among movement sequences. The min_freq is given to quantify frequent base stations. As pointed out early, counts for frequent base stations should also be determined. Thus, before deriving AMS, a large movement sequence (abbreviated as LMS) is a sequence of frequent base stations and we use LMS to compute the similarity between LMS and each movement sequence. In light of similarity measurements obtained, we are able to identify those movement sequences capturing the frequent moving behavior of users and aggregate them as AMS.

3.2.1. Deﬁnition: large movement record

Given a set of movement sequences MS1,MS2...,MSwand a threshold min_freq, a large movement record at time slot t is denoted

as LMRtand LMRtcontains a set of base stations whose occurrence count in the set of movement records at time slot t (i.e., MR1,t,

MR2,t,...,MRw,t) is larger or equal to min_freq.

Givenﬁve movement sequences inTable 3, if min_freq is set to 2, LMR4is {D,F} since both D and F have their occurrence count

equal to min_freq. Large movement records demonstrate the frequent movement behavior of a user at each speciﬁc time slot. After

obtaining large movement records at each time slot, a large movement sequence LMS is thus a sequence of large movement

records, which is denoted as LMS =bLMR1,LMR2,...,LMRεN. Consequently, LMS indicates the frequent moving behavior of users.

Table 3

An example of algorithm LS.

1 2 3 4 5

MS1 A:14 A:2 F:1 I:2

MS2 C:8 C:1, D:1, F:1 H:1, G:4

MS3 A:1 C:1 D:1 H:1

MS4 A:1, B:1 A:1 F:9

MS5 B:4 D:4 H:1 A:1, B:2

LMS {A, B} {A} {D, F} {H, I}

(6)

Once a large movement sequence LMS is determined, we should further formulate the similarity between movement sequences and LMS to identify whether a movement sequence is the frequent movement behavior of a user or not. The

conventional similarity measurements, such as Cosine similarity[26]and extended Jaccard coefﬁcient[27], cannot be applied for

the similarity measurement because they can only deal with scalar vectors with no missing values. Movement sequences and LMS are sequences of sets of base stations, not sequences of scalar values. Moreover, empty sets are allowed in movement sequences

and LMS. As such, we formulate the similarity between a movement sequence (e.g., MSi) and LMS as the closeness between

movement records MRi,jand LMRj, denoted by C(MRi,j,LMRj). C(MRi,j,LMRj) compares the set of base stations in MRi,jwith the

frequent base stations in LMRj. C(MRi,j,LMRj) is formulated as x∈MRi;j∩LMRj

y∈MRi;j∪LMRj

, and returns the normalized value in 0½ ; 1. The larger

the value of C(MRi,j,LMRj), the more closely MRi,jresembles LMRj. For example, assume that LMRj = {a,b,c,d}, MRx,j= {b,e} and

MRy,j={a,b,c,d,e}. It can be veriﬁed that the value of C(MRx,j,LMRj) is1₅and the value of C(MRy,j,LMRj) is4₅. Therefore, MRy,jis more

similar to LMRj. Based on the similarity between movement records and large movement records, the similarity measure of

movement sequences MSiand LMS is formulated as sim(MSi, LMS) =∑j = 1ε |MRi, j| C(MRi, j, LMRj). Given a threshold value min_sim,

for each movement sequence MSi, if sim(MSi,LMS)≥min_sim, the movement sequence MSiis identiﬁed as a similar movement

sequence. Consider the example inTable 3. It can be veriﬁed that sim MSð 1; LMSÞ = 11

2 + 1 1 1 + 0 + 1 1 2 + 1 0 1= 2. Further, sim

(MS2, LMS) = 3, sim(MS3, LMS) = 2, sim(MS4, LMS) = 3, and sim(MS5, LMSÞ =1₂. Assuming that min _ sim is 2, MS1, MS2, MS3and MS4

are recognized as similar movement sequences.

After identifying similar movement sequences, these similar movement sequences are aggregated as one AMS in which

frequent base stations and their associated counts are determined. An aggregation movement sequence is deﬁned as follows:

3.2.2. Deﬁnition: aggregation movement sequence

The aggregation movement sequence is denoted as AMS =bAMR1, AMR2,..., AMRεN, where AMRjis an aggregated movement

record that contains frequent base stations, which are the same in large movement record LMRjand their occurring counts

accumulated from movement records at time slot j of similar movement sequences.

Consider the AMR1of AMS inTable 3as an example. from those similar movement sequences, the occurrence count of A in AMR1

is calculated as the sum of the count of A in MR1, 1, that in MR3, 1and that in MR4, 1(i.e., 14 + 1 + 1 = 16). Following the same

procedure, we could have AMS =_{b{A:16,B:1},{A:3},ϕ,{D:2,F:3},{H:2}N shown in}Table 3.

3.2.3. Time complexity analysis

Given w movement sequences withε time slots, the complexity of algorithm LS can be expressed as O(εω). The complexity

involved in calculating large movement records is O(εω), while that of extracting frequent movement sequences is ε⁎ω⁎O(1)=

O(εω). As a result, the overall time complexity of algorithm LS is O(εω). Thus, algorithm LS is of polynomial time complexity.

Algorithm 1. Algorithm LS

Input: w movement sequences with lengthε,two threshold:min freq and min sim Output: aggregation movement sequence AMS

1: begin 2: for j←1 to ε do 3: for i←1 to w do

4: begin

5: LMRj←frequent 1-itemset of MRi,j ; //by min freq

6: end 7: for i←1 to w do 8: begin 9: match←0; 10: for j←1 to ε do 11: begin

12: C(MRi,j, LMRj)←|x∈MRi,j∩LMRj| / |y∈MRi,j∪LMRj|;

13: match←match+|MRi,j| C(MRi,j,LMRj);

14: end

15: if match≥min sim then

16: accumulate the occurring counts of items in the aggregation movement sequence; 17: end

3.3. Algorithm TC: clustering aggregation movement records

As pointed out early, the movement trajectories of mobile users generally follow spatio-temporal locality (i.e., if the time interval between two consecutive calls of a mobile user is small, the mobile user is likely to have moved nearby). Accordingly, aggregation movement records in AMS could be clustered into several groups if their corresponding time slots are close. To

(7)

facilitate the presentation of this paper, only time information (i.e., time slots) is extracted from AMS. Thus, time projection

sequence of AMS is deﬁned as follows:

3.3.1. Deﬁnition: time projection sequence

A time projection sequence of AMS is expressed as TPAMS=bα1,...,αnN, where AMRαj≠{} and α1b...bαn.

A time projection sequence is a sequence of time slots in which the corresponding movement records are not empty. Algorithm TC then uses the time projection sequence to cluster close time slots. The cluster result of algorithm TC is represented as a clustered

time projection sequence deﬁned as follows:

3.3.2. Deﬁnition: clustered time projection sequence

A clustered time projection sequence of TPAMS, denoted by CTPAMS, is represented asbCL1, CL2,..., CLkN, where the i-th group CLiis

the time slots of the clustered movement records, and k is an integer such that 1≤k≤ε.

Given AMS obtained in Step 1, TPAMSis then easily determined. By exploring the feature of spatial-temporal locality, algorithm

TC generates a clustered time projection sequence of AMS (i.e., CTPAMS). Each cluster in CTPAMScontains close time slots. Those

movement records with their time slots being clustered preserve the feature of spatio-temporal locality. Therefore, the objective of clustering is to bound the variance of time slots in each group with a given threshold (i.e., min_var).

The variance of a group CLiis deﬁned as Var CLð iÞ =_m1 ∑

m k = 1 ni;k−_m1∑ m j = 1 ni;j !2

, where ni, j represents the j-th time slots of

movement records in CLiand the total number of movement records in CLiis m. Algorithm TC generates a clustered time projection

sequence CTPAMSsuch that Var(CLi)≤min _var for all clusters CLi.

To achieve the objective of clustering, algorithm TCﬁrst starts coarsely clustering TPAMSinto several marked clusters using a

valueδ. The initial value of δ is set to ε and δ then decreases by one for each round. Thus, in the beginning, there is only one cluster.

Dividing clusters with a variance larger than min_var increases the number of clusters. In algorithm TC, unmarked clusters refer to

clusters that do not need to be reﬁned, whereas marked clusters refer to clusters that should be further partitioned. For each

cluster CLi, if Var(CLi) is smaller than min_var, the cluster CLiis unmarked. Otherwise,δ decreases by 1 and algorithm TC re-clusters

the time slots in unmarked clusters with the updated value ofδ. Algorithm TC partitions TPAMSiteratively until no marked cluster

remain or until_{δ=1. If there are no marked clusters, CTP}AMSis generated. Otherwise, if there are still marked clusters with their

variance values larger than min_var, algorithm TC continues toﬁnely partition these marked clusters so that the variance for every

marked cluster is constrained by the threshold value of min_var.

When the value ofδ is 1, the time slots of movement records in a marked cluster generally follow a sequence of consecutive

integers such that the variance of marked clusters is still larger than min_var. This situation results in loss of spatio-temporal locality. For example, given movement records with a sequence of consecutive time slots 1,2,3,4,5,6, and 7, though the differences of consecutive time slots are small, the location of a user at time slot 1 and that at time slot 7 are probably far from each other. To deal

with this problem, this cluster must be further partitioned into smaller clusters. The variance of each reﬁned cluster should be

smaller than min_var. Moreover, to guarantee that no time slots of each reﬁned clusters are as close as possible, the total variance of

the reﬁned clusters should be minimized. To derive the optimal method for further partitioning, the following Lemma is derived:

Lemma. Given a cluster that has a sequence of consecutive integers 1, 2, 3,..., n and a positive integer k , the optimal method to minimize

the sum of variance in each cluster and divide this cluster into k clusters is to partition it into k sub-clusters each with a size ofn

k.

Proof. Suppose thatb1,2,3,...,nN is divided into k sub-clusters: b1,...,t1N,bt1+ 1,..., t2N,...,btk− 1+ 1,..., nN. Let t0= 0, tk= n, and

Vari= Var(bti− 1+ 1,..., tiN). Our goal is to ﬁnd the cutting points (i.e., t1, t2, ..., and tk− 1) to minimize f = ∑

k i = 1

Vari.

The variance remains the same constant for a sequence of consecutive integers with the same length. For example, consider two

clusters with two sequences of consecutive time slots:_{b1,2,3,4,5N and b7,8,9,10,11N. It can be veriﬁed that Var(b1,2,3,4,5N)=}

Var(b7,8,9,10,11N). Since Var b1; 2; :::; n Nð Þ = 1

12 n 2₋₁ , we have f = ∑k i = 1 Vari= 1 12∑ k i = 1 ti−ti−1 ð Þ2 −1 . To minimize f = ∑k i = 1

Vari, the cutting points t1, t2, ..., tk− 1are derived by letting theﬁrst derivatives be zero.

∂f ∂t1 = 4t1−2t2−2t0= 0 ∂f ∂t2 = 4t2−2t3−2t1= 0 ::: ∂f ∂tk−1 = 4tk−1−2tk−2tk−2= 0 8 > > > > > > > < > > > > > > > :

Thus, we can have the following terms: t1= ti0+ t2 2 t2= t₁+ t₃ 2 ::: tk−1=tk−22+ tk 8 > > > > > > < > > > > > > :

(8)

Table 4

An execution scenario of algorithm TC.

Run δ min_var Clusters

0 20 1.6 b1,2,3,4,5,9,10,14,17,18,20⁎N ... ... ... ... 1 3 1.6 b1,2,3,4,5N⁎,b9,10N,b14,17,18,20N⁎ 2 2 1.6 b1,2,3,4,5N⁎,b9,10N,b14N,b17,18,20N 3 1 1.6 b1,2,3,4,5N⁎,b9,10N,b14N,b17,18,20N 4 0 1.6 b1,2,3,4,5N⁎,b9,10N,b14N,b17,18,20N 5 0 1.6 b1,2,3N,b4,5N⁎,b9,10N,b14N,b17,18,20N

Using substitution, we have

t1=1₂t2 t2=2₃t3 ::: tk−1=k−1_k tk 8 > > > > > > > < > > > > > > > :

Therefore, we can get:

t1= 1 kn t2= 2 kn ::: tk−1=k−1_k n 8 > > > > > > > < > > > > > > > :

From the derivation above, the optimal way to divideb1,2,3,...,nN into k clusters is to divide b1,2,3,..,nN into k

sub-clusters each with size ofn

k. □

This Lemma provides a guideline for partitioning a marked cluster that has a sequence of consecutive time slots into smaller clusters. Since the value of k is not known in advance, the value of k is initially set 2, and then increases in each iteration. In each

iteration, a marked cluster is evenly divided into k sub-clusters, each with size ofn

k, and the variance of each sub-cluster is tested. If

the variance of a sub-cluster is smaller than min _ var, the procedure terminates. Otherwise, the value of k is increased by 1 and the

marked cluster will be further re_{ﬁned into smaller sub-clusters.}

Consider the execution scenario inTable 4, where the time projection sequence is TPAMS=b1,2,3,4,5,9,10,14,17,18,20N. The

initial cluster is b1,2,3,4,5,9,10,14,17,18,20N. Given min _var=1.6, algorithm TC ﬁrst roughly partitions TPAMSinto three

clusters.Table 4shows that two marked clusters (i.e.,b1,2,3,4,5N with Var(b1,2,3,4,5N)=2 and b14,17,18,20N with Var

(b14,17,18,20N)=4.69 are determined because the variance values of these two clusters are larger than 1.6. Then, δ is reduced to

2, and these two marked clusters are re-examined. In the following run, the previous clusterb14,17,18,20N is divided into two

clusters, i.e.,b14N and b17,18,20N in this run. Since Var(b14N)=0b1.6 and Var(b17,18,20N)=1.56b1.6, these two clusters

remain unmarked. Following the same procedure, algorithm TC partitions marked clusters untilδ equals 1. Run 4 inTable 4shows

thatb1,2,3,4,5N is still a marked cluster with Var(b1,2,3,4,5N)=2. Therefore, algorithm TC ﬁnely partitions b1,2,3,4,5N. The

value of k is initially set at 1. Since Var(b1,2,3,4,5N)=2.5 is larger than min _var (i.e., 1.6), k increases to 2. Then, b1,2,3,4,5N is

divided intob1,2,3Nb4,5N. Of these two clusters (i.e., b1,2,3N and b4,5N), the b1,2,3N cluster has the larger variance and thus

b1,2,3N is compared with the value of min_var. Since the Var(b1,2,3N)=0.67b1.6, algorithm TC stops the clustering process.

Finally, a CTPAMSis generated asb1,2,3N,b4,5N,b9,10N,b14N,b17,18,20N.

Algorithm TC is of polynomial time complexity. Let TPAMShave n numbers. Algorithm TC needs O(n) to divide TPAMSinto clusters

from lines 5 to 15. From lines 17 to line 25, assume that there are still s clusters with m numbers to be reﬁned

Algorithm 2. Algorithm TC

Input: time projection sequence:TPAMS, thresholds: min var

Output: clustered time projection sequence:CTPAMS

1: begin

2: δ←ε;

3: CL1←TPAMS;

4: Mark CL1;

(9)

6: begin

7: for each marked clusters CLido

8: if V ar(CLi)≤min var then

9: begin

10: unmark CLi;

11: end

12: δ←δ−1;

13: for all marked clusters CLido

14: group the numbers whose differences are withinδ in CLi;

15: end

16:

17: if there are marked clusters then 18: begin

19: for each marked cluster CLido

20: k = 2; 21: repeat

22: k←k+1;

23: divide CLiinto k groups with equal sizes;

24: until the variance of each group≤min var

25: end

26: end

Since k is at most m, we have s O(m) to run the clustering process. The worst case occurs when estimating the time complexity of algorithm TC. In the worst case (i.e., m = n), the overall time complexity of algorithm TC is at most O(n).

3.4. Algorithm MF: deriving movement functions

Given the aggregation movement sequence AMS devised by algorithm LS and its clustered time projection sequence CTPAMS

generated by algorithm TC, algorithm MF is able to derive a sequence of movement functions able to estimate the frequent

movement behaviors of mobile users. For each cluster, we need to derive conﬁdence movement functions. Then, linkage movement

functions are determined to link conﬁdence movement functions among clusters. Finally, a movement function F(t) is derived and

represented asbU0(t), E1(t), U1(t), E2(t),..., Ek(t), Uk(t)N, where Ei(t) is the conﬁdence movement function in cluster CLiof CTPAMS

and Ui(t) is the linkage movement function from Ei(t) to Ei + 1(t).

3.4.1. Deriving conﬁdence movement functions

For each cluster CLiof CTPAMS, the conﬁdence movement function of a mobile user, expressed as Eið Þ = ˆxt ið Þ; ˆyt ið Þ; TIt i

, is

derived. In this case, ˆxið Þ (respectively, ˆyt ið Þ) is a movement function in x-coordinate axis (respectively, in y-coordinate axis) andt

the conﬁdence movement function is valid for the time interval indicated in TIi.

Without loss of generality, let CLibebt1, t2,..., tnN, where tjdenotes one of the time slots in CLifor j = 1, 2,..., n. AMRicontains

frequent base stations with their corresponding counts in the i-th time slot of AMS. To derive movement functions, the location of base stations should be converted from the symbolic model into the geometric model through a map table that indicates the

coordinates of base stations and is provided by telecompanies. Hence, given AMS and CTPAMS, for each cluster of CTPAMS,

the geometric coordinates of frequent base stations can be derived along with their corresponding counts and represented as

(t1, x1, y1, w1), (t2, x2, y2, w2),..., (tn, xn, yn, wn) where tiis the corresponding time slot, xi(respectively, yi) is the x-coordinate

(respectively, y-coordinate) of the base station, and wiis the number of phone calls a mobile user has made at this base station.

Accordingly, for each cluster of CTPAMS, a weighted regression analysis is able to derive the corresponding conﬁdence

movement function.

Given a set of data points, the goal of regression analysis is to derive the best estimated curve with the minimal sum of least

square errors[28]. One aggregation movement sequence is generated in Step 1, which calculates the appearance counts of base

stations. Thus, based on the appearance counts of base stations, we can derive curves closer to those base stations with larger

appearance counts. This is because the more calls a user makes at a base station, the more conﬁdence we have that this mobile user

frequently appears in the coverage area of this base station. Another advantage of utilizing weighted regression analysis is that in a real scenario of mobile computing systems, the base station that serves to a user is not always the nearest base station. This is because other base stations nearby will cover the nearest base station when it becomes overloaded. However, the scenario above does not always happen. The appearing counts of other base stations will be fewer than that of the nearest base station. Therefore, weighted regression analysis makes it possible to derive curves close to base stations with higher appearance counts.

Given data points within a cluster, this article considers the derivation of the ˆx tð Þ: An m-degree polynomial function

ˆx tð Þ = a0+ a1t +::: + amtmis derived to approximate the movement behavior along x-coordinate axis. Given the data points

(t1, x1, y1, w1), (t2, x2, y2, w2) ,..., (tn, xn, yn, wn), the regression coefﬁcients αf 0; α1; :::amg are then selected to minimize the residual

sum of squaresx=∑i = 1n wiei2, where ei= (xi−(a0+ a1ti+ a2(ti)2... + am(ti)m)). The value of m is application dependent, and

must be smaller than the number of data points. The value of m is proportional to the precision of theﬁtting curve. Since ˆx tð Þ is

obtained by matrix operations, the matrix size is thus the dominant factor in regression performance. However, the impact of

weighted regression analysis on execution time is not signiﬁcant in this article since the maximal value of m is usually small. When

(10)

the value of m is small, the execution time of regression analysis is acceptable. Therefore, according to the number of data points available, the value of m should be set as large as possible.

For ease of presentation, the following terms are deﬁned:

H = 1 t1 ð Þt1 2 ⋯ ð Þt1 m ⋯ ⋯ ⋯ ⋯ ⋯ 1 tn ð Þt2 2 ⋯ ð Þtn 2 2 4 3 5; a⁎ = a⋯0 am 2 4 3 5; ˜bx= x1 ⋯ xn 2 4 3 5; e = e⋯1 en 2 4 3 5; W = w1 ⋯ wn 2 4 3 5:

By solving the equationpffiffiffiffiffiffiWHTpffiffiffiffiffiffiWHa⁎ =pWffiffiffiffiffiffiHTpffiffiffiffiffiffiW ˜bx, a⁎ can be derived such that the value ofxis minimized.3This

leads toˆx tð Þ = a0+ a1t +::: + amtm.ˆy tð Þ can be derived following the same procedure. As a result, for each cluster of CTPAMS, the

conﬁdence movement function Eið Þ = ˆx tt ð Þ; ˆy tð Þ; t½1; tn

of a mobile user can be devised.

For example, let AMS =b{A:16,B:1},{A:1},ϕ,{D:2,F:3},{H:2}N and the coordinates of A, B, D, F and H be (1, 1), (1, 2), (4, 2),

(3, 3) and (5,3), respectively. Given AMS and CTPAMS=b1,2,4,5N, it is possible to obtain data points with their weights, asTable 5

shows. By setting m to 3, the 3-degree polynomialˆx tð Þ = a0+ a1t + a2t2+ a3t3is derived. The coefﬁcients a0, a1, a2and a3are

determined by a regression curve that minimize the residual sum error. That is, a⁎ = ða0a1a2a3)Tmust be determined. Since

there are six data points with their corresponding time slots of 1, 1, 2, 4, 4 and 5, H =

1 1 ð Þ12 1 ð Þ3 1 1 ð Þ12 ð Þ13 1 2 ð Þ22 2 ð Þ3 1 4 ð Þ42 4 ð Þ3 1 4 ð Þ42 4 ð Þ3 1 5 ð Þ52 5 ð Þ3 2 6 6 6 6 6 6 6 4 3 7 7 7 7 7 7 7 5

is then calculated. The

weights of the data points are 16, 1, 1, 2, 3 and 2, respectively. Hence,pffiffiffiffiffiffiWis a diagonal matrix with the diagonal entries of [pffiffiffiffiffiffi16,

ffiffiffi 1 p

,pffiffiffi1;pffiffiffi2;pffiffiffi3;pffiffiffi2.Table 5shows that ˜bx=ð1 1 1 4 3 5 )T. By solving the equation ffiffiffiffiffiffi W p H T ffiffiffiffiffiffi W p H a⁎ =pffiffiffiffiffiffiWHTpffiffiffiffiffiffiW˜bx, we

can get a =ð 2.333 −2.133 0.867 −0.066)T_{. Therefore,} _{ˆx t}_{ð Þ = 2:333−2:133t + 0:867t}2_−0:066t3_{is devised to predict the}

x-coordinate axis of the mobile user from t = 1 to t = 5. Similarly, ˜by=ð 1 2 1 2 3 3 )Tis then determined fromTable 5. By solving the

normal equationpWffiffiffiffiffiffiHTpWffiffiffiffiffiffiHa⁎ =pffiffiffiffiffiffiWHTpffiffiffiffiffiffiW˜by, we can get a⁎ = ð 2.529 −2.386 1.021 −0.105 )T. We can obtain

ˆy tð Þ = 2:529−2:386t + 1:021t2_−0:105t3_._{Fig. 3}_{shows that the con}_{ﬁdence movement functions, where the circle point indicates}

the location of a base station with its corresponding weight and the solid line is the curve derived by algorithm MF. The con_ﬁdence

movement function closely resembles actual movement behavior, demonstrating the advantage of utilizing regression analysis to mine user movement patterns.

Algorithm 3. Algorithm MF

Input: AMS and clustered time projection sequence CTPAMS

Output: a list of movement functions F(t) =bE1(t), U1(t),E2(t), ...,Ek(t), Uk(t)N 1: begin

2: F(t) =bN; 3: for i = 1 to k−1 do

4: begin

5: doing regression on CLito generate Ei(t);

6: doing regression on CLi + 1to generate Ei + 1(t);

7: t1= the last time slot in CLi;

8: t2= theﬁrst time slot in CLi + 1;

9: using inner interpolation to generate Ui(t) = (ˆxi(t),ˆyi(t), (t1, t2));

10: insert Ei(t), Ui(t) and Ei + 1(t) in F(t);

11: end

12: if 1∉CL1then

13: generate U0(t) and Insert U0(t) into the head of F(t);

14: ifε∉CLkthen

15: generate Uk(t) and Insert Uk(t) into the tail of F(t);

16: end

3.4.2. Deriving linkage movement functions

Given the AMS and a cluster of CTPAMS=bCL1, CL2,..., CLkN, algorithm MF generates the whole conﬁdence movement function,

denoted as F(t). F(t) is represented as_bU0(t), E1(t), U1(t), E2(t), ..., Ek(t), Uk(t)N, where Ei(t) is the conﬁdence movement function in

cluster CLiof CTPAMSand Ui(t) is the linkage movement function from Ei(t) to Ei + 1(t). Algorithm MF (from lines 5 to 6) shows that

for each cluster of CTPAMS, the corresponding conﬁdence movement functions are derived using the regression method above.

However, theﬁrst time slot may not be included in CL1. If t0is theﬁrst time slot of CL1and t0≠1, the U0(t) = {E1(t0), [1, t0)} is

generated for the boundary condition. Otherwise, U0(t) will not be valid in F(t). The same is true for Uk(t). The linkage movement

function is calculated by interpolation (in line 9 of algorithm MF).

3

(11)

For example, assume that CTPAMS=b1,2,4,5N,b7,9,10N, E1(t) = (2.333−2.133t+0.867t2−0.066t3, 2.529−2.386t+1.021t2−

0.105t3_{, [1, 5]) and E}

2(t) = (6+ 1.17t−0.16t2, 3 + 0t + 0t2, [7, 10]). It can be veriﬁed that the ﬁrst time slot of cluster b1,2,4,5N is 1. The

last time slot ofb1,2,4,5N is 5 and the ﬁrst time slot of cluster b7,9,10N is 7. Thus, a linkage movement function should be generated

by inner interpolation. From E1(t), at the 5th time slot, we can have a data point (x = 5.09, y = 3). At the 7th time slot, a data point

(x = 6.35, y = 3) is generated by applying E2(7). By inner interpolation, we could have U1ð Þ = 1:94 +t 6:35−5:09₇₋₅ t

, 3 +3−3

7−5t, (5,7)).

Similarly, U2(t) can be determined. After obtaining the conﬁdence and linkage functions, the F(t)=bE1(t),U1(t),E2(t), U2(t)N can be

derived.Fig. 4shows the snapshot of F(t). When using F(t) to predict the location of mobile users, we will only use the conﬁdence

movement function whose time interval includes the given time t. For F(t) =bE1(t),U1(t),E2(t), U2(t)N, when the time is 4, only E1(t)

will be used to predict the location since the given time 4 is within the time interval of E1(t).

1 2 3 4 5 1 2 3 4 50 1 2 3 4 Time X coordinate Y coordinate (1,1,1) W=16 (1,1,2) W=1 (2,1,1) W=1 (4,4,2) W=2 (4,3,3) W=3 (5,5,3) W=2

Fig. 3. An illustrative example of deriving conﬁdent movement functions. Table 5

Data points with their corresponding weights.

ti ID xi yi wi 1 A 1 1 16 1 B 1 2 1 2 A 1 1 1 4 D 4 2 2 4 F 3 3 3 5 H 5 3 2 7 K 6 3 4 9 F 3 3 10 10 E 4 3 1 2 4 6 8 10 12 1 2 3 4 5 6 0.5 1 1.5 2 2.5 3 3.5 Time x coordinate y coordinate E1 (t) E2 (t) U1 (t) U2 (t)

(12)

Algorithm MF is of polynomial time complexity. When the maximal size in row/column is n, the time complexity used to solve

the normal equation by Strassen's algorithm isΘ(nlg7₎_[29]_{. Moreover, the interpolation by Lagrange's formula requires}_Θ(m2_),

where m represents the number of points involved in the interpolation[29]. Since n is usually larger than m, the value ofΘ(nlg7₎

dominates the complexity of algorithm MF.

3.5. Estimating a user's location based on a movement function

For many applications, it is necessary to estimate a user's location in the symbolic model. In this case, F(t) represents the movement behavior of mobile users. Thus, once movement functions F(t) have be obtained, the location of mobile users can be

predicted as (xt, yt), which denotes the coordinates of applying the movement function at time t. Through the estimated coordinate

(xt, yt), this coordinate can be transformed into a symbol which contains (xt, yt). In our example, since each base station is aware of

its location and coverage area, it is easy to transform the geometric location (xt, yt) into the identiﬁcation of the base station in the

symbolic model.

4. Performance evaluation

This section evaluates the effectiveness and efﬁciency of mining user movement patterns from call detail records.Section 4.1

presents the models for user behaviors, including movement behavior and calling behavior.Section 4.1also describes both the

synthetic dataset and the real dataset.Section 4.2presents experimental results. Finally, the RUMP sensitivity analysis is given in

Section 4.3.

4.1. Modeling user behaviors

User behaviors in a mobile computing environment include movement behaviors and calling behaviors. This sectionﬁrst

describes the synthetic dataset used in this study, in which user movement behaviors are derived according to pre-deﬁned

parameters. To simulate a mobile computing environment, we use a 16 × 16 mesh network, in which each node represents a base

station. Thus, the simulation model contains 256 base stations[4,30]. Moreover, our simulation considers 10,000 users. As in[31],

this simulation considers three movement trajectories. For each user, we randomly select one movement trajectory as his/her own movement pattern. Then, a user mostly follows his/her own movement pattern. However, users may have some movements that do not follow their movement patterns. These movements are viewed as biased movements. To prevent users from diverging too

far from their movement patterns due to biased movements, we borrowed the concept in[17]that allows users to move back to

their movement patterns. The number of movements made by mobile users in one time slot is modeled as a uniform distribution

between mf−2 and mf+2. The larger the value of mf is, the more frequently mobile users move. We used the design above to

generate user movements.

However, for a real dataset, it is difﬁcult to obtain real call detail records from mobile service providers due to the privacy issue

of customers. Moreover, the RUMP approach requires the location information of base stations, which is business-related information for mobile service providers. Thus, for real datasets, we use real movement logs from a GPS-based testbed, CarWeb

[32], and generate simulated CDRs along with real movements. In the CarWeb platform, users can obtain their locations from a GPS

device everyﬁve seconds and upload their locations to CarWeb servers.Fig. 5shows one frequent movement behavior, where

every redﬂag represents a user-uploaded location. By collecting user movement behaviors for four months, we produce roughly

200 movement trajectories for each user. In the CarWeb dataset, the ground truth is known, which is useful to validate our mining

results.4_{In the CarWeb dataset, a user has frequent and infrequent movements. To simulate the coverage area of a base station, we}

divided the whole space into grids and viewed each grid as the coverage area of one base station.Fig. 5shows the grids in the

CarWeb datasets, where the frequent movement behaviors of this user occurred within or around 16 girds. Furthermore, since the traveling times of movement sequences in the CarWeb dataset are not exactly the same, the traveling time for each trajectory is

thus normalized to 24 hours. In both datasets, the time slot is set to 2 hours and the value ofε is 12.

Once user movements have been determined, calling behaviors can be modeled for each user's movements. According to[30],

calling behavior can be modeled as a Poisson distribution. Moreover, a Zeta distribution is used to model burst calling behavior in

this article. In a Poisson distribution, the probability that a user has x calls in a time slot is determined by P xð Þ =e−λλx

x! , where x is

the number of calls and_{λ is the expected number of calls in a time slot. Three time slots are grouped and then each time slot is}

divided into three subsections, producing a total of 9 subsections in each group. For each user, the probability of having phone calls

in the x-th subsection of a group is Z(x) = x−λ

∑∞ n = 1

1 nλ

, where x indicates the subsection order in a group (i.e., the x-th subsection in a

group) andλ is the value of the exponent feature for a Zeta distribution. In the beginning of subsections in a group, a user will have

more phone calls, but the number of phone calls decays exponentially in the remaining subsections of a group. The speed of decay

is determined by the parameterλ; the larger this parameter is, the faster the decay is. For brevity, CDR(ρ,λ) indicates that the

4_{Due to customer privacy issues, it is impossible to get the ground truth of user movement behaviors even if mobile service providers were to release call}

(13)

calling behavior is modeled asρ distribution with parameter λ, where the value of ρ is set to P (respectively, Z) if a Poisson (respectively, Zeta) distribution is used. For example, CDR(P, 2) represents the calling behavior of a user under a Poisson

distribution withλ=2.

For comparison purposes, we also implemented the method of mining movement patterns in[4], denoted by UMP. To validate

the quality of movement patterns mined by UMP and RUMP, we could utilize movement patterns to predict next movements of users. The accuracy of prediction indicates the quality of movement patterns mined. Hence, the hop count (referred to as hn) represents the number of base stations between the prediction location and the actual location of the mobile user. Intuitively, the smaller the value of the hop count, the closer the current location and the derived location. Thus, the expected value of hop counts

per call E(hn / call) is deﬁned astotal hop counts

number of calls, where the total _ hop _ counts is the sum of hop counts per call and number _ of _ calls is the total number of calls per user. To evaluate the quality of user movement patterns mined by UMP and RUMP, the precision ratio

is derived and deﬁned as 1−E hn= callð _2n Þ−1, where the size of network is n × n and E(hn / call) is the expected value of hop counts per

call. The precision ratio represents the percentage of the average hop counts from the derived cell to the current cell a mobile user

with respect to the network size.Table 6summarizes the deﬁnitions of some primary simulation parameters. In this table, the

default values are optimal values based on following experiments in our experimental environment. Each experimental result was obtained by an average of twenty experiments.

4.2. Experiments of UMP and RUMP

Weﬁrst evaluated UMP and RUMP in terms of the data amount, the precision ratio, and the execution time. The data amount

is the number of records stored in a movement log and a CDR log.Fig. 6(a) shows that the data amount of UMP increases with

the value of mf. This is because with a larger mf, a user tends to move frequently, producing a greater amount of data of the

movement log. On the contrary, the data amount in RUMP remains almost constant.Fig. 6(b) shows that the precision ratio of

RUMP is smaller than that of UMP. However, with CDR(P, 4), the precision ratio of RUMP is not far below UMP. Note, however, Fig. 5. The frequent movement behavior in CarWeb dataset.

Table 6

The parameters used in experiments.

Notation Deﬁnition Default value

w Number of movement sequences 50

mf Movement frequency 3

min_freq Threshold used in algorithm LS 0.3

min_sim Threshold used in algorithm LS 0.5

(14)

that though UMP performs better than RUMP in terms of the precision ratio, it also incurs a larger amount of data in a movement

log. To investigate the precision ratio gained by having the additional amount of log data, this study deﬁnes data utilization as

the ratio between the precision ratio and the amount of log data.Fig. 7shows the data utilization of UMP and RUMP. With a

400 600 800 1000 1200 1400 3 5 7 9 data amount mf "UMP" "RUMP with CDR(P,2)" "RUMP with CDR(P,4)" "RUMP with CDR(Z,2)" "RUMP with CDR(Z,4)"

(a)

Data Amount

0.4 0.5 0.6 0.7 0.8 0.9 1 1 3 5 7 9 precision ratio mf "RUMP with CDR(P,2)" "RUMP with CDR(P,4)" "RUMP with CDR(Z,2)" "RUMP with CDR(Z,4)" "UMP"

(b)

Precision Ratio

Fig. 6. Performance comparisons of UMP and RUMP on the synthetic dataset.

0.0005 0.0006 0.0007 0.0008 0.0009 0.001 0.0011 0.0012 0.0013 0.0014 3 5 7 9 data utilization mf "RUMP with CDR(P,2)" "RUMP with CDR(P,4)" "RUMP with CDR(Z,2)" "RUMP with CDR(Z,4)" "UMP"

Fig. 7. Data utilization of UMP and RUMP on the synthetic dataset.

0 200 400 600 800 1000 1200 1400 1600

UMP RUMP/CDR(P,2)RUMP/CDR(P,4)RUMP/CDR(Z,2)RUMP/CDR(Z,4)

data amount

(a)

Data Amount

0.5 0.6 0.7 0.8 0.9 1

UMP RUMP/CDR(P,2)RUMP/CDR(P,4)RUMP/CDR(Z,2)RUMP/CDR(Z,4)

precision ratio

(b)

Precision Ratio

(15)

higher mf, the data utilization of UMP drastically decreases. This is because the amount of data in the movement log increases dramatically as users move frequently. If the value of mf is smaller, the data utilization of RUMP with a Zeta distribution is larger than that of RUMP with a Poisson distribution. On the other hand, when the value of mf increases, the data utilization of RUMP with a Poisson distribution is larger than that of RUMP with a Zeta distribution. It is primarily because when mf is large, it is

better to have more uniform calling behaviors to allow the call detail records fully reﬂect user movement behaviors. These

experimental results show that RUMP has a higher data utilization than UMP. By exploring CDRs, RUMP is more cost-effective in mining user movement patterns.

Fig. 8(a) shows the data amount of UMP and RUMP with various calling behaviors under the CarWeb dataset.Fig. 8(a) shows

that the data amount of RUMP is much smaller than that of UMP. Furthermore,Fig. 8(b) shows that the precision ratios of UMP

and RUMP, indicating that the difference between UMP and RUMP is not large. This suggests that RUMP is able to achieve acceptable precision ratios when using a smaller amount of data. However, through performing better than RUMP in terms of

the precision ratio, UMP incurs more amounts of data in the movement log. InFig. 9, the data utilization of UMP is much

smaller than that of RUMP, showing that with a smaller amount of log data, RUMP can still achieve an acceptable precision ratio.

Fig. 10shows the execution time of UMP and RUMP under the synthetic dataset.Fig. 9shows that the RUMP execution time is smaller than that of UMP in both the synthetic dataset and the CarWeb dataset. With a larger number of movement sequences, the

UMP execution time signiﬁcantly increases. With a higher mf, the execution time of RUMP becomes much slower than that of UMP.

Further, RUMP has better scalability than UMP. In addition,Fig. 9shows the execution time of UMP and RUMP on the CarWeb

dataset. Similar to the results in the synthetic dataset, the RUMP execution time is much smaller than that of UMP. As the number of movement sequences increases, UMP takes longer to discover user movement patterns. On the other hand, the RUMP performance is determined by the data amount generated by calling. Since the data amount generated by calling is usually fewer than that by movements, RUMP incurs a smaller execution time.

0.0006 0.0008 0.001 0.0012 0.0014 0.0016 0.0018 0.002 UMP RUMP/CDR(P,2 ) RUMP/CDR(P,4 ) RUMP/CDR(Z,2)RUMP/CDR(Z,4 ) data utilization

Fig. 9. Data utilization of UMP and RUMP on the CarWeb dataset.

0 200 400 600 800 1000 10 30 50 70 execution time

number of moving sequences

"UMP, mf=5" "UMP, mf=3" "RUMP with CDR(P,2), mf=3" "RUMP with CDR(P,4), mf=3" "RUMP with CDR(Z,2), mf=3" "RUMP with CDR(Z,4), mf=3"

(a)

Synthetic Dataset

0 100 200 300 400 500 600 700 800 900 15 45 75 105 execution time

number of moving sequences

"UMP" "RUMP with CDR(P,2)" "RUMP with CDR(P,4)" "RUMP with CDR(Z,2)" "RUMP with CDR(Z,4)"

(b)

CarWeb Dataset

(16)

4.3. Sensitivity analysis of RUMP

This section further investigates the parameters used in RUMP. First, the impact of w is presented. Then, we examine the impact of thresholds on the mining results.

4.3.1. Impact of w

Fig. 11shows the experiments of varying w values for RUMP under both the synthetic dataset and the CarWeb dataset. This ﬁgure indicates that the RUMP precision ratio increases as the value of w increases in both datasets. This is because as the value of w increases, the number of movement sequences considered in RUMP increases as the value of w increases. In this case, RUMP can use more calls to discover user movement patterns. The RUMP precision ratio with a Poisson distribution is larger than that of RUMP with a Zeta distribution. This is because the calling behavior in a Poisson distribution is much more evenly across user movements. Thus, RUMP is able to fully capture user movement behaviors when the calling behavior follows a Poisson

distribution. In a Poisson distribution, with a larger value ofλ, the precision ratio of RUMP is larger. For a larger value of λ, the

amount of call detail records tends to increase, thereby reﬂecting the complete movement behaviors of users. For users with a

larger number of calls and non-burst calling behavior, the value of w can be set smaller to quickly obtain movement patterns. In contrast, for users with a smaller number of calls or burst calling behavior, the value of w should be set larger to improve the precision ratio of the movement patterns mined by RUMP.

4.3.2. Impact of thresholds in algorithm LS

This section examines the impact of min _ freq and min _ sim on the RUMP performance. Algorithm LS uses min _ freq and

min _ sim thresholds to extract CDRs representing frequent movement behaviors.Figs. 12 and 13show RUMP experiments with

0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 precision ratio min_freq "CDR(P,2)" "CDR(P,4)" "CDR(Z,2)" "CDR(Z,4)"

(a)

Synthetic Dataset

0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 0.1 0.2 0.3 0.4 precision ratio min_freq "CDR(P,2)" "CDR(P,4)" "CDR(Z,2)" "CDR(Z,4)"

(b)

CarWeb Dataset

Fig. 12. Precision ratio of RUMP with min_freq varied.

0.5 0.6 0.7 0.8 0.9 1 10 20 30 40 50 precision ratio w "RUMP with CDR(P,2)" "RUMP with CDR(P,4)" "RUMP with CDR(Z,2)" "RUMP with CDR(Z,4)"

(a)

Synthetic Dataset

0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 15 30 45 60 75 precision ratio w "RUMP with CDR(P,2)" "RUMP with CDR(P,4)" "RUMP with CDR(Z,2)" "RUMP with CDR(Z,4)"

(b)

CarWeb Dataset

(17)

varying values of min _ freq and min _ sim.Fig. 12(a) shows the result of using RUMP on the synthetic dataset. Thisﬁgure indicates

that the RUMP precision ratio tends to increase when the value of min _ freq increases from 0.1 to 0.3. Thisﬁgure also shows that

the RUMP precision ratio decreases when min _ freq exceeds than 0.3. This is because increasing min _ freqﬁlters out areas through

which users do not frequently move areﬁltered out. However, a larger min _freq is too strict for identifying what areas are frequent

and decreases the precision ratio.Fig. 12(b) shows that the same phenomenon for the CarWeb dataset. Selecting the value of

min _ freq should be determined empirically. For example, in this experiment, we set min _ freq at 0.3.

Fig. 13shows the RUMP precision ratio with various values of min _ sim. In both datasets, the RUMP precision ratio tends to increase when min _ sim increases from 0.1 to 0.5. However, when the value of min _ sim exceeds than 0.5, the RUMP precision ratio decreases. The min _ sim threshold is set to identify whether or not a movement sequence is similar to the frequent movement

behavior. With a larger value of min _ sim, only a few movement sequences are identiﬁed as being similar to frequent user

movement behaviors. This, in this turn, decreases the RUMP precision ratio. Therefore, the value of min _ sim should be carefully set. Experimental results shows that min _ freq should be set to 0.3 and min _ sim should be set to be 0.5 to achieve the best precision ratio performance.

4.3.3. Impact of thresholds in algorithm TC

As described above, the value of min _ var for algorithm TC affects the accuracy of the RUMP time clustering results. We

conducted experiments to examine the impact of min _ var. For the synthetic dataset,Fig. 14(a) shows that the precision ratio of

RUMP with the values of threshold min _ var varied. Thisﬁgure indicates the RUMP precision ratio signiﬁcantly increases when

min _ var is 0.25. However, the precision ratio of RUMP decreases when min _ var exceeds than 0.75. This is because excessively

0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 0.1 0.25 0.75 1 1.5 2 precision ratio min_var "mf=1" "mf=3" "mf=5"

(a)

Synthetic Dataset

0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 0.25 0.75 1 1.5 2 precision ratio min_var

(b)

CarWeb Dataset

Fig. 14. Precision ratio with min_var varied.

0.5 0.6 0.7 0.8 0.9 1 0.1 0.3 0.5 0.7 precision ratio min_sim "CDR(P,2)" "CDR(P,4)" "CDR(Z,2)" "CDR(Z,4)"

(a)

Synthetic Dataset

0.7 0.75 0.8 0.85 0.9 0.95 1 0.1 0.3 0.5 0.7 precision ratio min_sim "CDR(P,2)" "CDR(P,4)" "CDR(Z,2)" "CDR(Z,4)"

(b)

CarWeb Dataset

(18)

large values of min _ var result in most of the call detail records being grouped in the same cluster. Hence, the number of movement functions is not enough to capture user movement behaviors. Furthermore, with a larger mf, the RUMP precision ratio is smaller

and signiﬁcantly decreases when min _var is larger. For the CarWeb dataset,Fig. 14(b) shows the similar experimental results.

These results indicate that min _ var should be set to be a smaller value for users who move frequently. The value of min _ var,

which can be determined empirically, should not set too larger. For example, inFig. 14(a), min _ var should set to 0.75 because the

RUMP precision ratio is the highest.

Fig. 15depicts the RUMP precision ratio with various calling behaviors. InFig. 15(a), the results of CDR(P,2) and CDR(P,4) are similar to the results above. However, it is interesting to note that the precision ratios of CDR(Z,2) and CDR(Z,4) do not decrease when the value of min _ var exceeds than 0.75. Since burst calls happen in the beginning of every three time slots, most of the call

detail records in these three time slots can be grouped into one cluster.Fig. 15shows the similar results in the CarWeb dataset.

Thus, we can set min _ var as 0.75 to obtain the highest RUMP precision ratio.

5. Conclusions

User movement patterns can provide a lot of beneﬁts in many mobile design schemes and applications, including designing a

paging area, developing data allocation schemes, conducting querying strategies, or offering navigation services. This article proposes a regression-based approach called RUMP for mining user movement patterns from call detail records. To fully exploit the fragmented spatio-temporal information hidden in such trajectories, the proposed regression-based solution discovers user

movement patterns. The RUMP approach uses three algorithms. First, algorithm LS extracts CDRs that reﬂect the frequent

movement behaviors of mobile users. By capturing similar movement sequences from call detail records, an aggregation movement sequence is computed to represent the frequent movement behaviors of mobile users in each time slot. The feature of spatio-temporal locality states that if the time interval between consecutive calls is small, the mobile user is likely to have moved nearby. By exploring this feature, algorithm TC is able to determine the number of regression functions properly by clustering those movement records whose time of occurrence are very close from an aggregation movement sequence. For each cluster of the aggregation movement sequence, algorithm MF generates the movement functions representing user movement patterns of mobile users. This article evaluates the performance of the proposed algorithms and conducts sensitivity analysis on several design

parameters. Experimental results indicate that RUMP can efﬁciently and effectively derive user movement patterns that capture

the frequent movement behaviors of mobile users.

Appendix A. Proof of minimizing residual error sum

Following the same notation inSection 3.4, the residual sum of squares (i.e.,x=∑i = 1n wiei2) can be expressed asx= eTWe in

a linear algebra manner. All elements in W are positive and e is able to be formulated as ˜bx−Ha⁎

. Thus, we have: x= e T We = ˜bx−Ha⁎ T W ˜bx−Ha⁎ = ˜bx−Ha⁎ T ffiffiffiffiffiffi W p ffiffiffiffiffiffi W p ˜bx−Ha⁎ = pffiffiffiffiffiffiW˜bx− ffiffiffiffiffiffi W p Ha⁎ T ffiffiffiffiffiffi W p ˜bx−pffiffiffiffiffiffiWHa⁎ =

‖

pffiffiffiffiffiffiW˜bx− ffiffiffiffiffiffi W p Ha⁎

‖

0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 0.1 0.25 0.75 1 1.5 2 precision ratio min_var "RUMP with CDR(P,2)" "RUMP with CDR(P,4)" "RUMP with CDR(Z,2)" "RUMP with CDR(Z,4)"

(a)

Synthetic Dataset

0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 0.25 0.75 1 1.5 2 precision ratio min_var "RUMP with CDR(P,2)" "RUMP with CDR(P,4)" "RUMP with CDR(Z,2)" "RUMP with CDR(Z,4)"