A regression-based approach for mining user movement patterns from
random sample data
Chih-Chieh Hung, Wen-Chih Peng
⁎
Department of Computer Science, National Chiao Tung University, Taiwan, ROC
a r t i c l e i n f o
a b s t r a c t
Article history: Received 6 July 2009
Received in revised form 13 July 2010 Accepted 13 July 2010
Available online 5 August 2010
Mobile computing systems usually express a user movement trajectory as a sequence of areas that capture the user movement trace. Given a set of user movement trajectories, user movement patterns refer to the sequences of areas through which a user frequently travels. In an attempt to obtain user movement patterns for mobile applications, prior studies explore the problem of mining user movement patterns from the movement logs of mobile users. These movement logs generate a data record whenever a mobile user crosses base station coverage areas. However, this type of movement log does not exist in the system and thus generates extra overheads. By exploiting an existing log, namely, call detail records, this article proposes a Regression-based approach for mining User Movement Patterns (abbreviated as RUMP). This approach views call detail records as random sample trajectory data, and thus, user movement patterns are represented as movement functions in this article. We propose algorithm LS (standing for Large Sequence) to extract the call detail records that capture frequent user movement behaviors. By exploring the spatio-temporal locality of continuous movements (i.e., a mobile user is likely to be in nearby areas if the time interval between consecutive calls is small), we develop algorithm TC (standing for Time Clustering) to cluster call detail records. Then, by utilizing regression analysis, we develop algorithm MF (standing for Movement Function) to derive movement functions. Experimental studies involving both synthetic and real datasets show that RUMP is able to derive user movement functions close to the frequent movement behaviors of mobile users.
© 2010 Elsevier B.V. All rights reserved.
Keywords:
User movement patterns Data mining
Mobile data management
1. Introduction
Mobile services, such as navigation services, mobile search and location-aware services, are becoming very popular. These wireless communication systems enable users to access various kinds of information from anywhere at any time. A mobile
computing system usually expresses a user movement trajectory as a sequence of areas in which the mobile user moves.1In this
article, we aim at mining user movement patterns for a mobile user. Thus, given a user's set of movement trajectories, user movement patterns refer to the sequences of areas that this user frequently travels. Analysis of user trajectory data could provide
some understandings and management of moving objects [1,2]. User movement patterns can be used to improve system
performance, such as designing personal paging area[3], and developing data allocation strategies[4–6], querying strategies[7],
and navigation services[8,9].
To discover user movement patterns in a mobile computing system, the methods proposed in previous studies require
movement logs to record the movements of mobile users. For example, in[4,5], when a mobile user moves from the coverage area
⁎ Corresponding author. No. 1001 University Road, Hsinchu, Taiwan 300, ROC. Tel.: +886 3 5731478. E-mail address:wcpeng@cs.nctu.edu.tw(W.-C. Peng).
1
This article defines a unit of a area as the coverage area of one base station. For ease of presentation, we simply use a base station identification to represent the corresponding coverage area.
0169-023X/$– see front matter © 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.datak.2010.07.010
Contents lists available atScienceDirect
Data & Knowledge Engineering
j o u r n a l h o m e p a g e : w w w. e l s e v i e r. c o m / l o c a t e / d a t a kof base station i to the coverage area of base station j, a handoff procedure is performed to smoothly switch communication channels between base stations. Meanwhile, the movement log generates a movement pair (i,j). However, the movement log is not an existing log of mobile systems and needs some overheads to generate during handoff procedures. Hence, generating movement logs for all mobile users leads to increased storage costs and decreases the performance of mobile computing systems. Therefore, prior works are not practical for mobile computing systems due to the overhead of generating movement logs. In fact, mobile computing systems generate call detail records (abbreviated as CDR) when a mobile user makes or receives phone calls [10].Table 1shows an example of call detail records, where Uid represents the identification of a user making or receiving calls, and Cellid represents the corresponding base station that serves that user. Time information (i.e., date and time) is recorded in the CDR.2Table 1shows that the CDRs of a mobile user contain both spatial (i.e., base station identification) and temporal information
(i.e., date and time). Since CDRs reflect the movement behaviors of users, this article addresses the problem of mining user
movement patterns from an existing log of CDRs, thereby reducing the overhead of generating a movement log.
Fig. 1shows some trajectories of one user, where the dashed line represents one real trajectory of this user and the regions with mobile phones indicate that the user is receiving or making phone calls. This user's calling behavior is captured in the log of CDRs, Table 1
An example of call detail records.
Uid Date Time Cellid
1 Day 1 07:30 A 1 Day 1 09:32 D 1 Day 1 09:49 E 1 Day 1 13:50 H 1 Day 2 08:50 C 1 Day 2 09:50 E 1 Day 2 14:00 H 1 Day 3 07:15 A 1 Day 3 09:02 C 1 Day 3 09:30 D 1 Day 4 12:30 W 1 Day 4 12:52 X 1 Day 4 13:30 Y
2The real call detail records analyzed in this study were provided by Taiwan mobile service providers, and we only extracted some useful attributes of call
detail records to mine user movement patterns.
(a)
Day1
(b) Day2
(c)
Day3
(d)
Day4
andTable 1shows the CDR log.Fig. 1shows that CDRs are data points that are randomly sampled from trajectories and the corresponding locations of these CDRs are scattered over the mobile computing environment. As a result, mining user movement behaviors from CDRs is a challenging task. Given these random sample data points, we aim to derive movement functions that are
close to real user trajectories. We refer these movement functions as movement patterns due to that movement functions reflect
the frequent movement behavior of users.Fig. 2shows the movement function of a user for the example above.
This article proposes a novel approach, called RUMP (standing for Regression-based approach for User Movement Patterns), to mine user movement patterns from CDRs. Given a set of data points, the main objective of regression analysis is to derive a regression function that minimizes the sum of distances between the function derived and data points. In this approach, call detail records are viewed as data points, while the regression functions derived are regarded as movement functions. However, not all call detail records should be involved in mining user movement patterns. Without carefully selecting CDRs, user movement
patterns cannot reflect the frequent movement behaviors of mobile users. On the other hand, CDRs should be fully utilized for
mining user movement patterns since only limited information is available in the CDR logs. Thus, several issues remain to be
addressed to efficiently utilize CDRs for mining user movement patterns.
1.1. Extracting frequent movement behaviors from CDRs
As mentioned before, user movement patterns refer to the frequent movement behaviors of mobile users. However, the CDR logs not only contain frequent user movement behaviors, but also include infrequent movement behaviors. For example, a user
usually goes to his office and is back to his home every weekday (asFig. 1(a), (b) and (c) shows), and occasionally takes a trip (as
Fig. 1(d) shows). The frequent movement behavior is the trajectory from his home to his office, whereas a trip is an infrequent movement behavior. Since regression analysis is sensitive to these infrequent CDRs, they should be eliminated. In other words, the call detail records that capture the frequent movement behaviors of users should be extracted. To extract the frequent movement behaviors of mobile users, we develop algorithm LS (standing for Large Sequence) to extract base stations whose coverage areas are frequently visited by users.
1.2. Determining the number of regression functions
Once CDRs that capture the frequent movement behaviors have been extracted, it is necessary to determine how many regression functions are needed. If only one regression function is derived, it may not be very close to the frequent user movement behavior. Thus, given a set of call detail records of the frequent movement behavior, clustering techniques can be used to divide call detail records into several groups. The number of groups is viewed as the number of regression functions. The movement trajectories of mobile users generally follow spatio-temporal locality (i.e., if the time interval between two consecutive calls of a mobile user is small, the mobile user is likely to have moved nearby). Therefore, the feature of spatio-temporal locality in algorithm TC (standing for Time Clustering) can be used to group the call detail records with a close occurrence time.
1.3. Deriving movement functions
Location identification techniques typically use one of two location models: the geometric model and the symbolic models[11].
The geometric model specifies the location in n-dimensional coordinates (typically n=2 or 3). The symbolic model, however, uses
logical entities to describe the location. This article represents the location of mobile users in CDRs using the symbolic model (i.e.,
the base station identification). To derive movement functions of a mobile user, the location of the call detail records in the
symbolic model must be transformed into the geometric model. Then, with the cluster results obtained, we develop algorithm MF (standing for Movement Function) for each cluster. This algorithm utilizes weighted regression analysis to derive the corresponding movement functions of a user.
The RUMP approach consists of a series of algorithms that tackle the various issues described above. This study evaluates RUMP performance using both synthetic and real datasets. Sensitivity analysis is conducted on several design parameters. Experimental
results show that RUMP is able to efficiently and effectively derive user movement patterns that capture the frequent movement
behaviors of mobile users.
The rest of this article is organized as follows.Section 2presents some related works.Section 3then devises algorithms for
mining user movement patterns.Section 4presents performance results. Finally,Section 5draws conclusions.
2. Related works
The problem of mining user movement patterns has attracted a considerable amount of research effort. Prior studies are
generally classified into two categories based on their definitions of user movement patterns: spatial movement patterns and
spatio-temporal movement patterns. In thefirst category, a user movement pattern refers to a sequence consisting of base station
identifications or pre-defined regions. In the second category, user movement patterns represent the spatio-temporal associated
relationships among base station identifications or pre-defined regions.
In thefirst category, the authors in[12]proposed an information–theoretical method to mine user movement patterns and
represented them in a trie data structure. Moreover, the authors in[3]proposed a statistical approach to mine user movement
patterns. The authors of[13] and [4] proposed a data mining approach for mining user movement patterns based on the
movement logs of mobile users.
In the second category, user movement patterns are usually extracted from user trajectories, where trajectories are detailed
user movements. A considerable amount of research efforts focuses on mining spatio-temporal association rules[14–19]. The
authors in[20]explored the fuzziness of locations in patterns and developed algorithms to discover spatio-temporal sequential
patterns. Furthermore, the authors in[21]proposed a clustering-based approach to discover movement regions within time
intervals. In[22], the authors developed a hybrid prediction model, consisting of vector-based and pattern-based models, to
predict user movements. In[23]and[24], the authors exploited temporal annotated sequences in which sequences are associated
with time information (i.e., transition times between two movements).
To the best of our knowledge, this study proposes a new way of mining user movement patterns from random sample data points. Though the main theme of this article is to mine user movement patterns from call detail records, the proposed approach can be used for other log data with randomly sampled data features. Existing studies neither fully utilize fragmented spatio-temporal information (e.g., call detail records) for mining user movement patterns nor explore regression techniques for deriving
movement functions. These features distinguish this article from others. Our preliminary work was presented in[25]. The current
article extends this preliminary work with more detailed complexity and theoretical analysis. Furthermore, we conduct an extensive performance analysis on both synthetic and real datasets. Finally, this study investigates the sensitivity of several parameters, such as the calling behavior and thresholds for each algorithm.
3. A regression-based approach for mining user movement patterns
This section develops a regression-based approach (i.e., RUMP) consisting of a sequence of algorithms to mine user movement
patterns. First,Section 3.1provides an overview of RUMP, and the following sections present details of Algorithm LS, TC, and MF.
3.1. An overview
Given a log of CDRs, the goal of this article is to derive movement functions that closely reflect the frequent movement
behaviors of mobile users. Due to that CDRs are random samples, the timestamps of CDRs are not likely to be the same even if a
user follows the same movement behavior. Consequently, a basic time slot is defined as a time interval. For example, if call detail
records whose occurrence time is within the time interval of one time slot, these CDRs are associated with the same time slot.
Therefore, these CDRs are further put in a movement record defined as follows:
3.1.1. Definition: movement record
A movement record is defined as a set of pairs (BSi:Ni), where BSiis a base station and Niis the number of occurrence counts of
BSiin call detail records whose occurring times are within the same time slot.
Table 2
Notations used in our algorithms.
Definition Notation
Number of movement sequences w
Movement sequence i MSi
Movement record at time slot j in MSi MRi,j
A large movement sequence LMS
Large movement record at time slot i LMRi
An aggregation movement sequence AMS
Aggregated movement record at time slot i AMRi
A time projection sequence of AMS TPAMS
Assume that one time unit has its time interval of 6:00 am to 10:00 am. FromTable 1, in Day 1, we could have one movement record that includes {A:1, D:1, E:1} since the occurrence time of three call detail records (i.e., A, D, and E) is within the time interval
(i.e., 6:00 am to 10:00 am). With the definition of movement records, a movement sequence is defined as follows:
3.1.2. Definition: movement sequence
A movement sequence MSi, denoted bybMRi,1, MRi,2, MRi,3, ..., MRi,εN, is an ordered sequence of ε movement records, where MRi,j
is the movement record at time slot j in MSiandε is an adjustable parameter.
The length of a time slot determines the granularity of user movement patterns in terms of time. Same as in[22], the value ofε
indicates that a movement pattern may re-appear. Thus, the value ofε depends on the periodicity of a user.Table 2are notations
used in our article. The overall procedure for mining movement patterns is outlined as follows: 3.1.2.1. Execution steps in RUMP
Step 1. (Extracting the aggregation movement sequence) In this step, call detail records are converted into w movement sequences, where w is an adjustable window size for the recent movement sequences being considered. Algorithm LS discovers an aggregation movement sequence, in which each movement record contains frequent areas that a user appears.
Step 2. (Clustering movement records) According to the aggregation movement sequence derived, we further develop algorithm TC to cluster movement records whose time slots are close.
Step 3. (Deriving movement functions) We then use regression techniques to derive the corresponding movement functions for each group in Step 2.
CDRs only reflect the fragmented movement behaviors of mobile users. Thus, the RUMP approach uses regression techniques to
derive movement functions which are close to the frequent movement behaviors of mobile users. Due to the nature of regression techniques, without the proper determination of call detail records, user movement functions derived cannot capture the frequent movement behaviors of mobile users. On the other hand, call detail records should be fully utilized to mine user movement patterns since only limited information is available in CDRs. In the following subsections, each algorithm is presented in detail. 3.2. Algorithm LS: extracting the aggregation movement sequence
In this article, a user movement trajectory is represented as a sequence of base station identifications (hereafter, we use “base
station” for short). Hence, call detail records are converted into movement sequences. With a set of movement sequences,
algorithm LS determines an Aggregation Movement Sequence (abbreviated as AMS) and uses it to represent the frequent movement behaviors of a user. Intuitively, AMS is a sequence of movement records that have frequent base station and their corresponding counts at each time slot. At each time slot, a frequent base station in this article refers to a base station which a user appears more than min_freq times among movement sequences. The min_freq is given to quantify frequent base stations. As pointed out early, counts for frequent base stations should also be determined. Thus, before deriving AMS, a large movement sequence (abbreviated as LMS) is a sequence of frequent base stations and we use LMS to compute the similarity between LMS and each movement sequence. In light of similarity measurements obtained, we are able to identify those movement sequences capturing the frequent moving behavior of users and aggregate them as AMS.
3.2.1. Definition: large movement record
Given a set of movement sequences MS1,MS2...,MSwand a threshold min_freq, a large movement record at time slot t is denoted
as LMRtand LMRtcontains a set of base stations whose occurrence count in the set of movement records at time slot t (i.e., MR1,t,
MR2,t,...,MRw,t) is larger or equal to min_freq.
Givenfive movement sequences inTable 3, if min_freq is set to 2, LMR4is {D,F} since both D and F have their occurrence count
equal to min_freq. Large movement records demonstrate the frequent movement behavior of a user at each specific time slot. After
obtaining large movement records at each time slot, a large movement sequence LMS is thus a sequence of large movement
records, which is denoted as LMS =bLMR1,LMR2,...,LMRεN. Consequently, LMS indicates the frequent moving behavior of users.
Table 3
An example of algorithm LS.
1 2 3 4 5
MS1 A:14 A:2 F:1 I:2
MS2 C:8 C:1, D:1, F:1 H:1, G:4
MS3 A:1 C:1 D:1 H:1
MS4 A:1, B:1 A:1 F:9
MS5 B:4 D:4 H:1 A:1, B:2
LMS {A, B} {A} {D, F} {H, I}
Once a large movement sequence LMS is determined, we should further formulate the similarity between movement sequences and LMS to identify whether a movement sequence is the frequent movement behavior of a user or not. The
conventional similarity measurements, such as Cosine similarity[26]and extended Jaccard coefficient[27], cannot be applied for
the similarity measurement because they can only deal with scalar vectors with no missing values. Movement sequences and LMS are sequences of sets of base stations, not sequences of scalar values. Moreover, empty sets are allowed in movement sequences
and LMS. As such, we formulate the similarity between a movement sequence (e.g., MSi) and LMS as the closeness between
movement records MRi,jand LMRj, denoted by C(MRi,j,LMRj). C(MRi,j,LMRj) compares the set of base stations in MRi,jwith the
frequent base stations in LMRj. C(MRi,j,LMRj) is formulated as x∈MRi;j∩LMRj
y∈MRi;j∪LMRj
, and returns the normalized value in 0½ ; 1. The larger
the value of C(MRi,j,LMRj), the more closely MRi,jresembles LMRj. For example, assume that LMRj = {a,b,c,d}, MRx,j= {b,e} and
MRy,j={a,b,c,d,e}. It can be verified that the value of C(MRx,j,LMRj) is15and the value of C(MRy,j,LMRj) is45. Therefore, MRy,jis more
similar to LMRj. Based on the similarity between movement records and large movement records, the similarity measure of
movement sequences MSiand LMS is formulated as sim(MSi, LMS) =∑j = 1ε |MRi, j| C(MRi, j, LMRj). Given a threshold value min_sim,
for each movement sequence MSi, if sim(MSi,LMS)≥min_sim, the movement sequence MSiis identified as a similar movement
sequence. Consider the example inTable 3. It can be verified that sim MSð 1; LMSÞ = 11
2 + 1 1 1 + 0 + 1 1 2 + 1 0 1= 2. Further, sim
(MS2, LMS) = 3, sim(MS3, LMS) = 2, sim(MS4, LMS) = 3, and sim(MS5, LMSÞ =12. Assuming that min _ sim is 2, MS1, MS2, MS3and MS4
are recognized as similar movement sequences.
After identifying similar movement sequences, these similar movement sequences are aggregated as one AMS in which
frequent base stations and their associated counts are determined. An aggregation movement sequence is defined as follows:
3.2.2. Definition: aggregation movement sequence
The aggregation movement sequence is denoted as AMS =bAMR1, AMR2,..., AMRεN, where AMRjis an aggregated movement
record that contains frequent base stations, which are the same in large movement record LMRjand their occurring counts
accumulated from movement records at time slot j of similar movement sequences.
Consider the AMR1of AMS inTable 3as an example. from those similar movement sequences, the occurrence count of A in AMR1
is calculated as the sum of the count of A in MR1, 1, that in MR3, 1and that in MR4, 1(i.e., 14 + 1 + 1 = 16). Following the same
procedure, we could have AMS =b{A:16,B:1},{A:3},ϕ,{D:2,F:3},{H:2}N shown inTable 3.
3.2.3. Time complexity analysis
Given w movement sequences withε time slots, the complexity of algorithm LS can be expressed as O(εω). The complexity
involved in calculating large movement records is O(εω), while that of extracting frequent movement sequences is ε⁎ω⁎O(1)=
O(εω). As a result, the overall time complexity of algorithm LS is O(εω). Thus, algorithm LS is of polynomial time complexity.
Algorithm 1. Algorithm LS
Input: w movement sequences with lengthε,two threshold:min freq and min sim Output: aggregation movement sequence AMS
1: begin 2: for j←1 to ε do 3: for i←1 to w do
4: begin
5: LMRj←frequent 1-itemset of MRi,j ; //by min freq
6: end 7: for i←1 to w do 8: begin 9: match←0; 10: for j←1 to ε do 11: begin
12: C(MRi,j, LMRj)←|x∈MRi,j∩LMRj| / |y∈MRi,j∪LMRj|;
13: match←match+|MRi,j| C(MRi,j,LMRj);
14: end
15: if match≥min sim then
16: accumulate the occurring counts of items in the aggregation movement sequence; 17: end
3.3. Algorithm TC: clustering aggregation movement records
As pointed out early, the movement trajectories of mobile users generally follow spatio-temporal locality (i.e., if the time interval between two consecutive calls of a mobile user is small, the mobile user is likely to have moved nearby). Accordingly, aggregation movement records in AMS could be clustered into several groups if their corresponding time slots are close. To
facilitate the presentation of this paper, only time information (i.e., time slots) is extracted from AMS. Thus, time projection
sequence of AMS is defined as follows:
3.3.1. Definition: time projection sequence
A time projection sequence of AMS is expressed as TPAMS=bα1,...,αnN, where AMRαj≠{} and α1b...bαn.
A time projection sequence is a sequence of time slots in which the corresponding movement records are not empty. Algorithm TC then uses the time projection sequence to cluster close time slots. The cluster result of algorithm TC is represented as a clustered
time projection sequence defined as follows:
3.3.2. Definition: clustered time projection sequence
A clustered time projection sequence of TPAMS, denoted by CTPAMS, is represented asbCL1, CL2,..., CLkN, where the i-th group CLiis
the time slots of the clustered movement records, and k is an integer such that 1≤k≤ε.
Given AMS obtained in Step 1, TPAMSis then easily determined. By exploring the feature of spatial-temporal locality, algorithm
TC generates a clustered time projection sequence of AMS (i.e., CTPAMS). Each cluster in CTPAMScontains close time slots. Those
movement records with their time slots being clustered preserve the feature of spatio-temporal locality. Therefore, the objective of clustering is to bound the variance of time slots in each group with a given threshold (i.e., min_var).
The variance of a group CLiis defined as Var CLð iÞ =m1 ∑
m k = 1 ni;k−m1∑ m j = 1 ni;j !2
, where ni, j represents the j-th time slots of
movement records in CLiand the total number of movement records in CLiis m. Algorithm TC generates a clustered time projection
sequence CTPAMSsuch that Var(CLi)≤min _var for all clusters CLi.
To achieve the objective of clustering, algorithm TCfirst starts coarsely clustering TPAMSinto several marked clusters using a
valueδ. The initial value of δ is set to ε and δ then decreases by one for each round. Thus, in the beginning, there is only one cluster.
Dividing clusters with a variance larger than min_var increases the number of clusters. In algorithm TC, unmarked clusters refer to
clusters that do not need to be refined, whereas marked clusters refer to clusters that should be further partitioned. For each
cluster CLi, if Var(CLi) is smaller than min_var, the cluster CLiis unmarked. Otherwise,δ decreases by 1 and algorithm TC re-clusters
the time slots in unmarked clusters with the updated value ofδ. Algorithm TC partitions TPAMSiteratively until no marked cluster
remain or untilδ=1. If there are no marked clusters, CTPAMSis generated. Otherwise, if there are still marked clusters with their
variance values larger than min_var, algorithm TC continues tofinely partition these marked clusters so that the variance for every
marked cluster is constrained by the threshold value of min_var.
When the value ofδ is 1, the time slots of movement records in a marked cluster generally follow a sequence of consecutive
integers such that the variance of marked clusters is still larger than min_var. This situation results in loss of spatio-temporal locality. For example, given movement records with a sequence of consecutive time slots 1,2,3,4,5,6, and 7, though the differences of consecutive time slots are small, the location of a user at time slot 1 and that at time slot 7 are probably far from each other. To deal
with this problem, this cluster must be further partitioned into smaller clusters. The variance of each refined cluster should be
smaller than min_var. Moreover, to guarantee that no time slots of each refined clusters are as close as possible, the total variance of
the refined clusters should be minimized. To derive the optimal method for further partitioning, the following Lemma is derived:
Lemma. Given a cluster that has a sequence of consecutive integers 1, 2, 3,..., n and a positive integer k , the optimal method to minimize
the sum of variance in each cluster and divide this cluster into k clusters is to partition it into k sub-clusters each with a size ofn
k.
Proof. Suppose thatb1,2,3,...,nN is divided into k sub-clusters: b1,...,t1N,bt1+ 1,..., t2N,...,btk− 1+ 1,..., nN. Let t0= 0, tk= n, and
Vari= Var(bti− 1+ 1,..., tiN). Our goal is to find the cutting points (i.e., t1, t2, ..., and tk− 1) to minimize f = ∑
k i = 1
Vari.
The variance remains the same constant for a sequence of consecutive integers with the same length. For example, consider two
clusters with two sequences of consecutive time slots:b1,2,3,4,5N and b7,8,9,10,11N. It can be verified that Var(b1,2,3,4,5N)=
Var(b7,8,9,10,11N). Since Var b1; 2; :::; n Nð Þ = 1
12 n 2−1 , we have f = ∑k i = 1 Vari= 1 12∑ k i = 1 ti−ti−1 ð Þ2 −1 . To minimize f = ∑k i = 1
Vari, the cutting points t1, t2, ..., tk− 1are derived by letting thefirst derivatives be zero.
∂f ∂t1 = 4t1−2t2−2t0= 0 ∂f ∂t2 = 4t2−2t3−2t1= 0 ::: ∂f ∂tk−1 = 4tk−1−2tk−2tk−2= 0 8 > > > > > > > < > > > > > > > :
Thus, we can have the following terms: t1= ti0+ t2 2 t2= t1+ t3 2 ::: tk−1=tk−22+ tk 8 > > > > > > < > > > > > > :
Table 4
An execution scenario of algorithm TC.
Run δ min_var Clusters
0 20 1.6 b1,2,3,4,5,9,10,14,17,18,20⁎N ... ... ... ... 1 3 1.6 b1,2,3,4,5N⁎,b9,10N,b14,17,18,20N⁎ 2 2 1.6 b1,2,3,4,5N⁎,b9,10N,b14N,b17,18,20N 3 1 1.6 b1,2,3,4,5N⁎,b9,10N,b14N,b17,18,20N 4 0 1.6 b1,2,3,4,5N⁎,b9,10N,b14N,b17,18,20N 5 0 1.6 b1,2,3N,b4,5N⁎,b9,10N,b14N,b17,18,20N
Using substitution, we have
t1=12t2 t2=23t3 ::: tk−1=k−1k tk 8 > > > > > > > < > > > > > > > :
Therefore, we can get:
t1= 1 kn t2= 2 kn ::: tk−1=k−1k n 8 > > > > > > > < > > > > > > > :
From the derivation above, the optimal way to divideb1,2,3,...,nN into k clusters is to divide b1,2,3,..,nN into k
sub-clusters each with size ofn
k. □
This Lemma provides a guideline for partitioning a marked cluster that has a sequence of consecutive time slots into smaller clusters. Since the value of k is not known in advance, the value of k is initially set 2, and then increases in each iteration. In each
iteration, a marked cluster is evenly divided into k sub-clusters, each with size ofn
k, and the variance of each sub-cluster is tested. If
the variance of a sub-cluster is smaller than min _ var, the procedure terminates. Otherwise, the value of k is increased by 1 and the
marked cluster will be further refined into smaller sub-clusters.
Consider the execution scenario inTable 4, where the time projection sequence is TPAMS=b1,2,3,4,5,9,10,14,17,18,20N. The
initial cluster is b1,2,3,4,5,9,10,14,17,18,20N. Given min _var=1.6, algorithm TC first roughly partitions TPAMSinto three
clusters.Table 4shows that two marked clusters (i.e.,b1,2,3,4,5N with Var(b1,2,3,4,5N)=2 and b14,17,18,20N with Var
(b14,17,18,20N)=4.69 are determined because the variance values of these two clusters are larger than 1.6. Then, δ is reduced to
2, and these two marked clusters are re-examined. In the following run, the previous clusterb14,17,18,20N is divided into two
clusters, i.e.,b14N and b17,18,20N in this run. Since Var(b14N)=0b1.6 and Var(b17,18,20N)=1.56b1.6, these two clusters
remain unmarked. Following the same procedure, algorithm TC partitions marked clusters untilδ equals 1. Run 4 inTable 4shows
thatb1,2,3,4,5N is still a marked cluster with Var(b1,2,3,4,5N)=2. Therefore, algorithm TC finely partitions b1,2,3,4,5N. The
value of k is initially set at 1. Since Var(b1,2,3,4,5N)=2.5 is larger than min _var (i.e., 1.6), k increases to 2. Then, b1,2,3,4,5N is
divided intob1,2,3Nb4,5N. Of these two clusters (i.e., b1,2,3N and b4,5N), the b1,2,3N cluster has the larger variance and thus
b1,2,3N is compared with the value of min_var. Since the Var(b1,2,3N)=0.67b1.6, algorithm TC stops the clustering process.
Finally, a CTPAMSis generated asb1,2,3N,b4,5N,b9,10N,b14N,b17,18,20N.
3.3.3. Time complexity analysis
Algorithm TC is of polynomial time complexity. Let TPAMShave n numbers. Algorithm TC needs O(n) to divide TPAMSinto clusters
from lines 5 to 15. From lines 17 to line 25, assume that there are still s clusters with m numbers to be refined
Algorithm 2. Algorithm TC
Input: time projection sequence:TPAMS, thresholds: min var
Output: clustered time projection sequence:CTPAMS
1: begin
2: δ←ε;
3: CL1←TPAMS;
4: Mark CL1;
6: begin
7: for each marked clusters CLido
8: if V ar(CLi)≤min var then
9: begin
10: unmark CLi;
11: end
12: δ←δ−1;
13: for all marked clusters CLido
14: group the numbers whose differences are withinδ in CLi;
15: end
16:
17: if there are marked clusters then 18: begin
19: for each marked cluster CLido
20: k = 2; 21: repeat
22: k←k+1;
23: divide CLiinto k groups with equal sizes;
24: until the variance of each group≤min var
25: end
26: end
Since k is at most m, we have s O(m) to run the clustering process. The worst case occurs when estimating the time complexity of algorithm TC. In the worst case (i.e., m = n), the overall time complexity of algorithm TC is at most O(n).
3.4. Algorithm MF: deriving movement functions
Given the aggregation movement sequence AMS devised by algorithm LS and its clustered time projection sequence CTPAMS
generated by algorithm TC, algorithm MF is able to derive a sequence of movement functions able to estimate the frequent
movement behaviors of mobile users. For each cluster, we need to derive confidence movement functions. Then, linkage movement
functions are determined to link confidence movement functions among clusters. Finally, a movement function F(t) is derived and
represented asbU0(t), E1(t), U1(t), E2(t),..., Ek(t), Uk(t)N, where Ei(t) is the confidence movement function in cluster CLiof CTPAMS
and Ui(t) is the linkage movement function from Ei(t) to Ei + 1(t).
3.4.1. Deriving confidence movement functions
For each cluster CLiof CTPAMS, the confidence movement function of a mobile user, expressed as Eið Þ = ˆxt ið Þ; ˆyt ið Þ; TIt i
, is
derived. In this case, ˆxið Þ (respectively, ˆyt ið Þ) is a movement function in x-coordinate axis (respectively, in y-coordinate axis) andt
the confidence movement function is valid for the time interval indicated in TIi.
Without loss of generality, let CLibebt1, t2,..., tnN, where tjdenotes one of the time slots in CLifor j = 1, 2,..., n. AMRicontains
frequent base stations with their corresponding counts in the i-th time slot of AMS. To derive movement functions, the location of base stations should be converted from the symbolic model into the geometric model through a map table that indicates the
coordinates of base stations and is provided by telecompanies. Hence, given AMS and CTPAMS, for each cluster of CTPAMS,
the geometric coordinates of frequent base stations can be derived along with their corresponding counts and represented as
(t1, x1, y1, w1), (t2, x2, y2, w2),..., (tn, xn, yn, wn) where tiis the corresponding time slot, xi(respectively, yi) is the x-coordinate
(respectively, y-coordinate) of the base station, and wiis the number of phone calls a mobile user has made at this base station.
Accordingly, for each cluster of CTPAMS, a weighted regression analysis is able to derive the corresponding confidence
movement function.
Given a set of data points, the goal of regression analysis is to derive the best estimated curve with the minimal sum of least
square errors[28]. One aggregation movement sequence is generated in Step 1, which calculates the appearance counts of base
stations. Thus, based on the appearance counts of base stations, we can derive curves closer to those base stations with larger
appearance counts. This is because the more calls a user makes at a base station, the more confidence we have that this mobile user
frequently appears in the coverage area of this base station. Another advantage of utilizing weighted regression analysis is that in a real scenario of mobile computing systems, the base station that serves to a user is not always the nearest base station. This is because other base stations nearby will cover the nearest base station when it becomes overloaded. However, the scenario above does not always happen. The appearing counts of other base stations will be fewer than that of the nearest base station. Therefore, weighted regression analysis makes it possible to derive curves close to base stations with higher appearance counts.
Given data points within a cluster, this article considers the derivation of the ˆx tð Þ: An m-degree polynomial function
ˆx tð Þ = a0+ a1t +::: + amtmis derived to approximate the movement behavior along x-coordinate axis. Given the data points
(t1, x1, y1, w1), (t2, x2, y2, w2) ,..., (tn, xn, yn, wn), the regression coefficients αf 0; α1; :::amg are then selected to minimize the residual
sum of squaresx=∑i = 1n wiei2, where ei= (xi−(a0+ a1ti+ a2(ti)2... + am(ti)m)). The value of m is application dependent, and
must be smaller than the number of data points. The value of m is proportional to the precision of thefitting curve. Since ˆx tð Þ is
obtained by matrix operations, the matrix size is thus the dominant factor in regression performance. However, the impact of
weighted regression analysis on execution time is not significant in this article since the maximal value of m is usually small. When
the value of m is small, the execution time of regression analysis is acceptable. Therefore, according to the number of data points available, the value of m should be set as large as possible.
For ease of presentation, the following terms are defined:
H = 1 t1 ð Þt1 2 ⋯ ð Þt1 m ⋯ ⋯ ⋯ ⋯ ⋯ 1 tn ð Þt2 2 ⋯ ð Þtn 2 2 4 3 5; a⁎ = a⋯0 am 2 4 3 5; ˜bx= x1 ⋯ xn 2 4 3 5; e = e⋯1 en 2 4 3 5; W = w1 ⋯ wn 2 4 3 5:
By solving the equationpffiffiffiffiffiffiWHTpffiffiffiffiffiffiWHa⁎ =pWffiffiffiffiffiffiHTpffiffiffiffiffiffiW ˜bx, a⁎ can be derived such that the value ofxis minimized.3This
leads toˆx tð Þ = a0+ a1t +::: + amtm.ˆy tð Þ can be derived following the same procedure. As a result, for each cluster of CTPAMS, the
confidence movement function Eið Þ = ˆx tt ð Þ; ˆy tð Þ; t½1; tn
of a mobile user can be devised.
For example, let AMS =b{A:16,B:1},{A:1},ϕ,{D:2,F:3},{H:2}N and the coordinates of A, B, D, F and H be (1, 1), (1, 2), (4, 2),
(3, 3) and (5,3), respectively. Given AMS and CTPAMS=b1,2,4,5N, it is possible to obtain data points with their weights, asTable 5
shows. By setting m to 3, the 3-degree polynomialˆx tð Þ = a0+ a1t + a2t2+ a3t3is derived. The coefficients a0, a1, a2and a3are
determined by a regression curve that minimize the residual sum error. That is, a⁎ = ða0a1a2a3)Tmust be determined. Since
there are six data points with their corresponding time slots of 1, 1, 2, 4, 4 and 5, H =
1 1 ð Þ12 1 ð Þ3 1 1 ð Þ12 ð Þ13 1 2 ð Þ22 2 ð Þ3 1 4 ð Þ42 4 ð Þ3 1 4 ð Þ42 4 ð Þ3 1 5 ð Þ52 5 ð Þ3 2 6 6 6 6 6 6 6 4 3 7 7 7 7 7 7 7 5
is then calculated. The
weights of the data points are 16, 1, 1, 2, 3 and 2, respectively. Hence,pffiffiffiffiffiffiWis a diagonal matrix with the diagonal entries of [pffiffiffiffiffiffi16,
ffiffiffi 1 p
,pffiffiffi1;pffiffiffi2;pffiffiffi3;pffiffiffi2.Table 5shows that ˜bx=ð1 1 1 4 3 5 )T. By solving the equation ffiffiffiffiffiffi W p H T ffiffiffiffiffiffi W p H a⁎ =pffiffiffiffiffiffiWHTpffiffiffiffiffiffiW˜bx, we
can get a =ð 2.333 −2.133 0.867 −0.066)T. Therefore, ˆx tð Þ = 2:333−2:133t + 0:867t2−0:066t3is devised to predict the
x-coordinate axis of the mobile user from t = 1 to t = 5. Similarly, ˜by=ð 1 2 1 2 3 3 )Tis then determined fromTable 5. By solving the
normal equationpWffiffiffiffiffiffiHTpWffiffiffiffiffiffiHa⁎ =pffiffiffiffiffiffiWHTpffiffiffiffiffiffiW˜by, we can get a⁎ = ð 2.529 −2.386 1.021 −0.105 )T. We can obtain
ˆy tð Þ = 2:529−2:386t + 1:021t2−0:105t3.Fig. 3shows that the confidence movement functions, where the circle point indicates
the location of a base station with its corresponding weight and the solid line is the curve derived by algorithm MF. The confidence
movement function closely resembles actual movement behavior, demonstrating the advantage of utilizing regression analysis to mine user movement patterns.
Algorithm 3. Algorithm MF
Input: AMS and clustered time projection sequence CTPAMS
Output: a list of movement functions F(t) =bE1(t), U1(t),E2(t), ...,Ek(t), Uk(t)N 1: begin
2: F(t) =bN; 3: for i = 1 to k−1 do
4: begin
5: doing regression on CLito generate Ei(t);
6: doing regression on CLi + 1to generate Ei + 1(t);
7: t1= the last time slot in CLi;
8: t2= thefirst time slot in CLi + 1;
9: using inner interpolation to generate Ui(t) = (ˆxi(t),ˆyi(t), (t1, t2));
10: insert Ei(t), Ui(t) and Ei + 1(t) in F(t);
11: end
12: if 1∉CL1then
13: generate U0(t) and Insert U0(t) into the head of F(t);
14: ifε∉CLkthen
15: generate Uk(t) and Insert Uk(t) into the tail of F(t);
16: end
3.4.2. Deriving linkage movement functions
Given the AMS and a cluster of CTPAMS=bCL1, CL2,..., CLkN, algorithm MF generates the whole confidence movement function,
denoted as F(t). F(t) is represented asbU0(t), E1(t), U1(t), E2(t), ..., Ek(t), Uk(t)N, where Ei(t) is the confidence movement function in
cluster CLiof CTPAMSand Ui(t) is the linkage movement function from Ei(t) to Ei + 1(t). Algorithm MF (from lines 5 to 6) shows that
for each cluster of CTPAMS, the corresponding confidence movement functions are derived using the regression method above.
However, thefirst time slot may not be included in CL1. If t0is thefirst time slot of CL1and t0≠1, the U0(t) = {E1(t0), [1, t0)} is
generated for the boundary condition. Otherwise, U0(t) will not be valid in F(t). The same is true for Uk(t). The linkage movement
function is calculated by interpolation (in line 9 of algorithm MF).
3
For example, assume that CTPAMS=b1,2,4,5N,b7,9,10N, E1(t) = (2.333−2.133t+0.867t2−0.066t3, 2.529−2.386t+1.021t2−
0.105t3, [1, 5]) and E
2(t) = (6+ 1.17t−0.16t2, 3 + 0t + 0t2, [7, 10]). It can be verified that the first time slot of cluster b1,2,4,5N is 1. The
last time slot ofb1,2,4,5N is 5 and the first time slot of cluster b7,9,10N is 7. Thus, a linkage movement function should be generated
by inner interpolation. From E1(t), at the 5th time slot, we can have a data point (x = 5.09, y = 3). At the 7th time slot, a data point
(x = 6.35, y = 3) is generated by applying E2(7). By inner interpolation, we could have U1ð Þ = 1:94 +t 6:35−5:097−5 t
, 3 +3−3
7−5t, (5,7)).
Similarly, U2(t) can be determined. After obtaining the confidence and linkage functions, the F(t)=bE1(t),U1(t),E2(t), U2(t)N can be
derived.Fig. 4shows the snapshot of F(t). When using F(t) to predict the location of mobile users, we will only use the confidence
movement function whose time interval includes the given time t. For F(t) =bE1(t),U1(t),E2(t), U2(t)N, when the time is 4, only E1(t)
will be used to predict the location since the given time 4 is within the time interval of E1(t).
1 2 3 4 5 1 2 3 4 50 1 2 3 4 Time X coordinate Y coordinate (1,1,1) W=16 (1,1,2) W=1 (2,1,1) W=1 (4,4,2) W=2 (4,3,3) W=3 (5,5,3) W=2
Fig. 3. An illustrative example of deriving confident movement functions. Table 5
Data points with their corresponding weights.
ti ID xi yi wi 1 A 1 1 16 1 B 1 2 1 2 A 1 1 1 4 D 4 2 2 4 F 3 3 3 5 H 5 3 2 7 K 6 3 4 9 F 3 3 10 10 E 4 3 1 2 4 6 8 10 12 1 2 3 4 5 6 0.5 1 1.5 2 2.5 3 3.5 Time x coordinate y coordinate E1 (t) E2 (t) U1 (t) U2 (t)
3.4.3. Time complexity analysis
Algorithm MF is of polynomial time complexity. When the maximal size in row/column is n, the time complexity used to solve
the normal equation by Strassen's algorithm isΘ(nlg7)[29]. Moreover, the interpolation by Lagrange's formula requiresΘ(m2),
where m represents the number of points involved in the interpolation[29]. Since n is usually larger than m, the value ofΘ(nlg7)
dominates the complexity of algorithm MF.
3.5. Estimating a user's location based on a movement function
For many applications, it is necessary to estimate a user's location in the symbolic model. In this case, F(t) represents the movement behavior of mobile users. Thus, once movement functions F(t) have be obtained, the location of mobile users can be
predicted as (xt, yt), which denotes the coordinates of applying the movement function at time t. Through the estimated coordinate
(xt, yt), this coordinate can be transformed into a symbol which contains (xt, yt). In our example, since each base station is aware of
its location and coverage area, it is easy to transform the geometric location (xt, yt) into the identification of the base station in the
symbolic model.
4. Performance evaluation
This section evaluates the effectiveness and efficiency of mining user movement patterns from call detail records.Section 4.1
presents the models for user behaviors, including movement behavior and calling behavior.Section 4.1also describes both the
synthetic dataset and the real dataset.Section 4.2presents experimental results. Finally, the RUMP sensitivity analysis is given in
Section 4.3.
4.1. Modeling user behaviors
User behaviors in a mobile computing environment include movement behaviors and calling behaviors. This sectionfirst
describes the synthetic dataset used in this study, in which user movement behaviors are derived according to pre-defined
parameters. To simulate a mobile computing environment, we use a 16 × 16 mesh network, in which each node represents a base
station. Thus, the simulation model contains 256 base stations[4,30]. Moreover, our simulation considers 10,000 users. As in[31],
this simulation considers three movement trajectories. For each user, we randomly select one movement trajectory as his/her own movement pattern. Then, a user mostly follows his/her own movement pattern. However, users may have some movements that do not follow their movement patterns. These movements are viewed as biased movements. To prevent users from diverging too
far from their movement patterns due to biased movements, we borrowed the concept in[17]that allows users to move back to
their movement patterns. The number of movements made by mobile users in one time slot is modeled as a uniform distribution
between mf−2 and mf+2. The larger the value of mf is, the more frequently mobile users move. We used the design above to
generate user movements.
However, for a real dataset, it is difficult to obtain real call detail records from mobile service providers due to the privacy issue
of customers. Moreover, the RUMP approach requires the location information of base stations, which is business-related information for mobile service providers. Thus, for real datasets, we use real movement logs from a GPS-based testbed, CarWeb
[32], and generate simulated CDRs along with real movements. In the CarWeb platform, users can obtain their locations from a GPS
device everyfive seconds and upload their locations to CarWeb servers.Fig. 5shows one frequent movement behavior, where
every redflag represents a user-uploaded location. By collecting user movement behaviors for four months, we produce roughly
200 movement trajectories for each user. In the CarWeb dataset, the ground truth is known, which is useful to validate our mining
results.4In the CarWeb dataset, a user has frequent and infrequent movements. To simulate the coverage area of a base station, we
divided the whole space into grids and viewed each grid as the coverage area of one base station.Fig. 5shows the grids in the
CarWeb datasets, where the frequent movement behaviors of this user occurred within or around 16 girds. Furthermore, since the traveling times of movement sequences in the CarWeb dataset are not exactly the same, the traveling time for each trajectory is
thus normalized to 24 hours. In both datasets, the time slot is set to 2 hours and the value ofε is 12.
Once user movements have been determined, calling behaviors can be modeled for each user's movements. According to[30],
calling behavior can be modeled as a Poisson distribution. Moreover, a Zeta distribution is used to model burst calling behavior in
this article. In a Poisson distribution, the probability that a user has x calls in a time slot is determined by P xð Þ =e−λλx
x! , where x is
the number of calls andλ is the expected number of calls in a time slot. Three time slots are grouped and then each time slot is
divided into three subsections, producing a total of 9 subsections in each group. For each user, the probability of having phone calls
in the x-th subsection of a group is Z(x) = x−λ
∑∞ n = 1
1 nλ
, where x indicates the subsection order in a group (i.e., the x-th subsection in a
group) andλ is the value of the exponent feature for a Zeta distribution. In the beginning of subsections in a group, a user will have
more phone calls, but the number of phone calls decays exponentially in the remaining subsections of a group. The speed of decay
is determined by the parameterλ; the larger this parameter is, the faster the decay is. For brevity, CDR(ρ,λ) indicates that the
4Due to customer privacy issues, it is impossible to get the ground truth of user movement behaviors even if mobile service providers were to release call
calling behavior is modeled asρ distribution with parameter λ, where the value of ρ is set to P (respectively, Z) if a Poisson (respectively, Zeta) distribution is used. For example, CDR(P, 2) represents the calling behavior of a user under a Poisson
distribution withλ=2.
For comparison purposes, we also implemented the method of mining movement patterns in[4], denoted by UMP. To validate
the quality of movement patterns mined by UMP and RUMP, we could utilize movement patterns to predict next movements of users. The accuracy of prediction indicates the quality of movement patterns mined. Hence, the hop count (referred to as hn) represents the number of base stations between the prediction location and the actual location of the mobile user. Intuitively, the smaller the value of the hop count, the closer the current location and the derived location. Thus, the expected value of hop counts
per call E(hn / call) is defined astotal hop counts
number of calls, where the total _ hop _ counts is the sum of hop counts per call and number _ of _ calls is the total number of calls per user. To evaluate the quality of user movement patterns mined by UMP and RUMP, the precision ratio
is derived and defined as 1−E hn= callð 2n Þ−1, where the size of network is n × n and E(hn / call) is the expected value of hop counts per
call. The precision ratio represents the percentage of the average hop counts from the derived cell to the current cell a mobile user
with respect to the network size.Table 6summarizes the definitions of some primary simulation parameters. In this table, the
default values are optimal values based on following experiments in our experimental environment. Each experimental result was obtained by an average of twenty experiments.
4.2. Experiments of UMP and RUMP
Wefirst evaluated UMP and RUMP in terms of the data amount, the precision ratio, and the execution time. The data amount
is the number of records stored in a movement log and a CDR log.Fig. 6(a) shows that the data amount of UMP increases with
the value of mf. This is because with a larger mf, a user tends to move frequently, producing a greater amount of data of the
movement log. On the contrary, the data amount in RUMP remains almost constant.Fig. 6(b) shows that the precision ratio of
RUMP is smaller than that of UMP. However, with CDR(P, 4), the precision ratio of RUMP is not far below UMP. Note, however, Fig. 5. The frequent movement behavior in CarWeb dataset.
Table 6
The parameters used in experiments.
Notation Definition Default value
w Number of movement sequences 50
mf Movement frequency 3
min_freq Threshold used in algorithm LS 0.3
min_sim Threshold used in algorithm LS 0.5
that though UMP performs better than RUMP in terms of the precision ratio, it also incurs a larger amount of data in a movement
log. To investigate the precision ratio gained by having the additional amount of log data, this study defines data utilization as
the ratio between the precision ratio and the amount of log data.Fig. 7shows the data utilization of UMP and RUMP. With a
400 600 800 1000 1200 1400 3 5 7 9 data amount mf "UMP" "RUMP with CDR(P,2)" "RUMP with CDR(P,4)" "RUMP with CDR(Z,2)" "RUMP with CDR(Z,4)"
(a)
Data Amount
0.4 0.5 0.6 0.7 0.8 0.9 1 1 3 5 7 9 precision ratio mf "RUMP with CDR(P,2)" "RUMP with CDR(P,4)" "RUMP with CDR(Z,2)" "RUMP with CDR(Z,4)" "UMP"
(b)
Precision Ratio
Fig. 6. Performance comparisons of UMP and RUMP on the synthetic dataset.
0.0005 0.0006 0.0007 0.0008 0.0009 0.001 0.0011 0.0012 0.0013 0.0014 3 5 7 9 data utilization mf "RUMP with CDR(P,2)" "RUMP with CDR(P,4)" "RUMP with CDR(Z,2)" "RUMP with CDR(Z,4)" "UMP"
Fig. 7. Data utilization of UMP and RUMP on the synthetic dataset.
0 200 400 600 800 1000 1200 1400 1600
UMP RUMP/CDR(P,2)RUMP/CDR(P,4)RUMP/CDR(Z,2)RUMP/CDR(Z,4)
data amount
(a)
Data Amount
0.5 0.6 0.7 0.8 0.9 1
UMP RUMP/CDR(P,2)RUMP/CDR(P,4)RUMP/CDR(Z,2)RUMP/CDR(Z,4)
precision ratio
(b)
Precision Ratio
higher mf, the data utilization of UMP drastically decreases. This is because the amount of data in the movement log increases dramatically as users move frequently. If the value of mf is smaller, the data utilization of RUMP with a Zeta distribution is larger than that of RUMP with a Poisson distribution. On the other hand, when the value of mf increases, the data utilization of RUMP with a Poisson distribution is larger than that of RUMP with a Zeta distribution. It is primarily because when mf is large, it is
better to have more uniform calling behaviors to allow the call detail records fully reflect user movement behaviors. These
experimental results show that RUMP has a higher data utilization than UMP. By exploring CDRs, RUMP is more cost-effective in mining user movement patterns.
Fig. 8(a) shows the data amount of UMP and RUMP with various calling behaviors under the CarWeb dataset.Fig. 8(a) shows
that the data amount of RUMP is much smaller than that of UMP. Furthermore,Fig. 8(b) shows that the precision ratios of UMP
and RUMP, indicating that the difference between UMP and RUMP is not large. This suggests that RUMP is able to achieve acceptable precision ratios when using a smaller amount of data. However, through performing better than RUMP in terms of
the precision ratio, UMP incurs more amounts of data in the movement log. InFig. 9, the data utilization of UMP is much
smaller than that of RUMP, showing that with a smaller amount of log data, RUMP can still achieve an acceptable precision ratio.
Fig. 10shows the execution time of UMP and RUMP under the synthetic dataset.Fig. 9shows that the RUMP execution time is smaller than that of UMP in both the synthetic dataset and the CarWeb dataset. With a larger number of movement sequences, the
UMP execution time significantly increases. With a higher mf, the execution time of RUMP becomes much slower than that of UMP.
Further, RUMP has better scalability than UMP. In addition,Fig. 9shows the execution time of UMP and RUMP on the CarWeb
dataset. Similar to the results in the synthetic dataset, the RUMP execution time is much smaller than that of UMP. As the number of movement sequences increases, UMP takes longer to discover user movement patterns. On the other hand, the RUMP performance is determined by the data amount generated by calling. Since the data amount generated by calling is usually fewer than that by movements, RUMP incurs a smaller execution time.
0.0006 0.0008 0.001 0.0012 0.0014 0.0016 0.0018 0.002 UMP RUMP/CDR(P,2 ) RUMP/CDR(P,4 ) RUMP/CDR(Z,2)RUMP/CDR(Z,4 ) data utilization
Fig. 9. Data utilization of UMP and RUMP on the CarWeb dataset.
0 200 400 600 800 1000 10 30 50 70 execution time
number of moving sequences
"UMP, mf=5" "UMP, mf=3" "RUMP with CDR(P,2), mf=3" "RUMP with CDR(P,4), mf=3" "RUMP with CDR(Z,2), mf=3" "RUMP with CDR(Z,4), mf=3"
(a)
Synthetic Dataset
0 100 200 300 400 500 600 700 800 900 15 45 75 105 execution time
number of moving sequences
"UMP" "RUMP with CDR(P,2)" "RUMP with CDR(P,4)" "RUMP with CDR(Z,2)" "RUMP with CDR(Z,4)"
(b)
CarWeb Dataset
4.3. Sensitivity analysis of RUMP
This section further investigates the parameters used in RUMP. First, the impact of w is presented. Then, we examine the impact of thresholds on the mining results.
4.3.1. Impact of w
Fig. 11shows the experiments of varying w values for RUMP under both the synthetic dataset and the CarWeb dataset. This figure indicates that the RUMP precision ratio increases as the value of w increases in both datasets. This is because as the value of w increases, the number of movement sequences considered in RUMP increases as the value of w increases. In this case, RUMP can use more calls to discover user movement patterns. The RUMP precision ratio with a Poisson distribution is larger than that of RUMP with a Zeta distribution. This is because the calling behavior in a Poisson distribution is much more evenly across user movements. Thus, RUMP is able to fully capture user movement behaviors when the calling behavior follows a Poisson
distribution. In a Poisson distribution, with a larger value ofλ, the precision ratio of RUMP is larger. For a larger value of λ, the
amount of call detail records tends to increase, thereby reflecting the complete movement behaviors of users. For users with a
larger number of calls and non-burst calling behavior, the value of w can be set smaller to quickly obtain movement patterns. In contrast, for users with a smaller number of calls or burst calling behavior, the value of w should be set larger to improve the precision ratio of the movement patterns mined by RUMP.
4.3.2. Impact of thresholds in algorithm LS
This section examines the impact of min _ freq and min _ sim on the RUMP performance. Algorithm LS uses min _ freq and
min _ sim thresholds to extract CDRs representing frequent movement behaviors.Figs. 12 and 13show RUMP experiments with
0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 precision ratio min_freq "CDR(P,2)" "CDR(P,4)" "CDR(Z,2)" "CDR(Z,4)"
(a)
Synthetic Dataset
0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 0.1 0.2 0.3 0.4 precision ratio min_freq "CDR(P,2)" "CDR(P,4)" "CDR(Z,2)" "CDR(Z,4)"
(b)
CarWeb Dataset
Fig. 12. Precision ratio of RUMP with min_freq varied.
0.5 0.6 0.7 0.8 0.9 1 10 20 30 40 50 precision ratio w "RUMP with CDR(P,2)" "RUMP with CDR(P,4)" "RUMP with CDR(Z,2)" "RUMP with CDR(Z,4)"
(a)
Synthetic Dataset
0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 15 30 45 60 75 precision ratio w "RUMP with CDR(P,2)" "RUMP with CDR(P,4)" "RUMP with CDR(Z,2)" "RUMP with CDR(Z,4)"
(b)
CarWeb Dataset
varying values of min _ freq and min _ sim.Fig. 12(a) shows the result of using RUMP on the synthetic dataset. Thisfigure indicates
that the RUMP precision ratio tends to increase when the value of min _ freq increases from 0.1 to 0.3. Thisfigure also shows that
the RUMP precision ratio decreases when min _ freq exceeds than 0.3. This is because increasing min _ freqfilters out areas through
which users do not frequently move arefiltered out. However, a larger min _freq is too strict for identifying what areas are frequent
and decreases the precision ratio.Fig. 12(b) shows that the same phenomenon for the CarWeb dataset. Selecting the value of
min _ freq should be determined empirically. For example, in this experiment, we set min _ freq at 0.3.
Fig. 13shows the RUMP precision ratio with various values of min _ sim. In both datasets, the RUMP precision ratio tends to increase when min _ sim increases from 0.1 to 0.5. However, when the value of min _ sim exceeds than 0.5, the RUMP precision ratio decreases. The min _ sim threshold is set to identify whether or not a movement sequence is similar to the frequent movement
behavior. With a larger value of min _ sim, only a few movement sequences are identified as being similar to frequent user
movement behaviors. This, in this turn, decreases the RUMP precision ratio. Therefore, the value of min _ sim should be carefully set. Experimental results shows that min _ freq should be set to 0.3 and min _ sim should be set to be 0.5 to achieve the best precision ratio performance.
4.3.3. Impact of thresholds in algorithm TC
As described above, the value of min _ var for algorithm TC affects the accuracy of the RUMP time clustering results. We
conducted experiments to examine the impact of min _ var. For the synthetic dataset,Fig. 14(a) shows that the precision ratio of
RUMP with the values of threshold min _ var varied. Thisfigure indicates the RUMP precision ratio significantly increases when
min _ var is 0.25. However, the precision ratio of RUMP decreases when min _ var exceeds than 0.75. This is because excessively
0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 0.1 0.25 0.75 1 1.5 2 precision ratio min_var "mf=1" "mf=3" "mf=5"
(a)
Synthetic Dataset
0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 0.25 0.75 1 1.5 2 precision ratio min_var
(b)
CarWeb Dataset
Fig. 14. Precision ratio with min_var varied.
0.5 0.6 0.7 0.8 0.9 1 0.1 0.3 0.5 0.7 precision ratio min_sim "CDR(P,2)" "CDR(P,4)" "CDR(Z,2)" "CDR(Z,4)"
(a)
Synthetic Dataset
0.7 0.75 0.8 0.85 0.9 0.95 1 0.1 0.3 0.5 0.7 precision ratio min_sim "CDR(P,2)" "CDR(P,4)" "CDR(Z,2)" "CDR(Z,4)"
(b)
CarWeb Dataset
large values of min _ var result in most of the call detail records being grouped in the same cluster. Hence, the number of movement functions is not enough to capture user movement behaviors. Furthermore, with a larger mf, the RUMP precision ratio is smaller
and significantly decreases when min _var is larger. For the CarWeb dataset,Fig. 14(b) shows the similar experimental results.
These results indicate that min _ var should be set to be a smaller value for users who move frequently. The value of min _ var,
which can be determined empirically, should not set too larger. For example, inFig. 14(a), min _ var should set to 0.75 because the
RUMP precision ratio is the highest.
Fig. 15depicts the RUMP precision ratio with various calling behaviors. InFig. 15(a), the results of CDR(P,2) and CDR(P,4) are similar to the results above. However, it is interesting to note that the precision ratios of CDR(Z,2) and CDR(Z,4) do not decrease when the value of min _ var exceeds than 0.75. Since burst calls happen in the beginning of every three time slots, most of the call
detail records in these three time slots can be grouped into one cluster.Fig. 15shows the similar results in the CarWeb dataset.
Thus, we can set min _ var as 0.75 to obtain the highest RUMP precision ratio.
5. Conclusions
User movement patterns can provide a lot of benefits in many mobile design schemes and applications, including designing a
paging area, developing data allocation schemes, conducting querying strategies, or offering navigation services. This article proposes a regression-based approach called RUMP for mining user movement patterns from call detail records. To fully exploit the fragmented spatio-temporal information hidden in such trajectories, the proposed regression-based solution discovers user
movement patterns. The RUMP approach uses three algorithms. First, algorithm LS extracts CDRs that reflect the frequent
movement behaviors of mobile users. By capturing similar movement sequences from call detail records, an aggregation movement sequence is computed to represent the frequent movement behaviors of mobile users in each time slot. The feature of spatio-temporal locality states that if the time interval between consecutive calls is small, the mobile user is likely to have moved nearby. By exploring this feature, algorithm TC is able to determine the number of regression functions properly by clustering those movement records whose time of occurrence are very close from an aggregation movement sequence. For each cluster of the aggregation movement sequence, algorithm MF generates the movement functions representing user movement patterns of mobile users. This article evaluates the performance of the proposed algorithms and conducts sensitivity analysis on several design
parameters. Experimental results indicate that RUMP can efficiently and effectively derive user movement patterns that capture
the frequent movement behaviors of mobile users.
Appendix A. Proof of minimizing residual error sum
Following the same notation inSection 3.4, the residual sum of squares (i.e.,x=∑i = 1n wiei2) can be expressed asx= eTWe in
a linear algebra manner. All elements in W are positive and e is able to be formulated as ˜bx−Ha⁎
. Thus, we have: x= e T We = ˜bx−Ha⁎ T W ˜bx−Ha⁎ = ˜bx−Ha⁎ T ffiffiffiffiffiffi W p ffiffiffiffiffiffi W p ˜bx−Ha⁎ = pffiffiffiffiffiffiW˜bx− ffiffiffiffiffiffi W p Ha⁎ T ffiffiffiffiffiffi W p ˜bx−pffiffiffiffiffiffiWHa⁎ =
‖
pffiffiffiffiffiffiW˜bx− ffiffiffiffiffiffi W p Ha⁎‖
0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 0.1 0.25 0.75 1 1.5 2 precision ratio min_var "RUMP with CDR(P,2)" "RUMP with CDR(P,4)" "RUMP with CDR(Z,2)" "RUMP with CDR(Z,4)"(a)
Synthetic Dataset
0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 0.25 0.75 1 1.5 2 precision ratio min_var "RUMP with CDR(P,2)" "RUMP with CDR(P,4)" "RUMP with CDR(Z,2)" "RUMP with CDR(Z,4)"