Exploring Regression for Mining User Moving Patterns in a Mobile Computing System

(1)

in a Mobile Computing System

Chih-Chieh Hung, Wen-Chih Peng, and Jiun-Long Huang Department of Computer Science, National Chiao Tung University,

Hsinchu, Taiwan, ROC

{hungcc, wcpeng, jlhuang}@csie.nctu.edu.tw

Abstract. In this paper, by exploiting the log of call detail records, we present a solution procedure of mining user moving patterns in a mobile computing system.

Specifically, we propose algorithm LS to accurately determine similar moving sequences from the log of call detail records so as to obtain moving behaviors of users. By exploring the feature of spatial-temporal locality, we develop algorithm TC to group call detail records into clusters. In light of the concept of regression, we devise algorithm MF to derive moving functions of moving behaviors. Perfor- mance of the proposed solution procedure is analyzed and sensitivity analysis on several design parameters is conducted. It is shown by our simulation results that user moving patterns obtained by our solution procedure are of very high quality and in fact very close to real user moving behaviors.

1 Introduction

User moving patterns refer to the areas where users frequently travel in a mobile computing environment. It is worth mentioning that user moving patterns are particularly important and are able to provide many benefits in mobile applications. A significant amount of research efforts has been elaborated upon issues of utilizing user moving patterns in developing location tracking schemes and data allocation methods [3][5][6].

Clearly, it has been recognized as an important issue to develop algorithms to mine user moving patterns so as to improve the performance of mobile computing systems.

The study in [5] explored the problem of mining user moving patterns with the moving log of mobile users given. Specifically, in order to capture user moving patterns, a moving log recording each movement of mobile users is needed. In practice, generating the moving log of all mobile users unavoidably leads to the increased storage cost and degraded performance of mobile computing systems. Consequently, in this paper, we address the problem of mining user moving patterns from the existing log of call detail records (referred to as CDR) of mobile computing systems. Generally, mobile computing systems generate one call detail record when a mobile user makes or receives a phone call. Table 1 shows an example of selected real call detail records where Uid is the identification of an individual user that makes or receives a phone call and Cellid indicates the corresponding base station that serves that mobile user. Thus, a mobile computing system produces daily a large amount of call detail records which contain hidden valuable information about the moving behaviors of mobile users. Unlike the

The corresponding author of this paper.

L.T. Yang et al. (Eds.): HPCC 2005, LNCS 3726, pp. 878–887, 2005.

Springer-Verlag Berlin Heidelberg 2005c

(2)

Table 1. An example of selected call detail records

Uid Date Time Cellid

1 01/03/2004 03:30:21 A 1 01/03/2004 09:12:02 D 1 01/03/2004 20:30:21 G 1 01/03/2004 21:50:31 I

moving log keeping track of the entire moving paths, the log of call detail records only reflects the fragmented moving behaviors of mobile users. However, such fragmented moving behaviors are of little interest in a mobile computing environment where one would naturally like to know the complete moving behaviors of users. Thus, in this paper, with these fragmented moving behaviors hidden in the log of call detail records, we devise a solution procedure to mine user moving patterns. The problem we shall study can be best understood by the illustrative example in Fig. 1 where the log of call detail records is given in Table 1. The dotted line in Fig. 1 represents the real moving path of the mobile user and the cells with the symbol of a mobile phone are the areas where the mobile user made or received phone calls. Explicitly, there are four call detail records generated in the log of CDRs while the mobile user travels. Given these fragmented moving behaviors, we explore the technique of regression analysis to generate user moving patterns (i.e., the solid line in Fig. 1). If user moving patterns devised are close to the real moving paths, one can utilize user moving patterns to predict the real moving behaviors of mobile users.

In this paper, we propose a solution procedure to mine user moving patterns from call detail records. Specifically, we shall first determine similar moving sequences from the log of call detail records and then these similar moving sequences are merged into one moving sequence (referred to as aggregate moving sequence). It is worth mention- ing that to fully explore the feature of periodicity and utilize the limited amount of call detail records, algorithm LS (standing for Large Sequence) devised is able to accurately extract those similar moving sequences in the sense that those similar moving sequences are determined by two adjustable threshold values when deriving the aggregate moving sequence. By exploring the feature of spatial-temporal locality, which refers to the feature that if the time interval between the two consecutive calls of a mobile user is

Fig. 1. A moving path and an approximate user moving pattern of a mobile user

(3)

small, the mobile user is likely to move nearby, algorithm TC (standing for Time Clus- tering) developed should group those call detail records into a cluster. For each cluster of call detail records, algorithm MF (standing for Moving Function), a regression-based method, devised is employed to derive moving functions of users so as to generate approximate user moving patterns. Performance of the proposed solution procedure is analyzed and sensitivity analysis on several design parameters is conducted. It is shown by our simulation results that approximate user moving patterns obtained by our proposed algorithms are of very high quality and in fact very close to real moving behaviors of users.

The rest of the paper is organized as follows. Some preliminaries and definitions are presented in Section 2. Algorithms for mining moving patterns are devised in Section 3.

Performance results are presented in Section 4. This paper concludes with Section 5.

2 Preliminary

In this paper, assume that the moving behavior of mobile users has periodicity and the consecutive movement of the mobile user is not too far. Therefore, if the time interval of two consecutive CDRs is not too large, the mobile user is likely to move nearby.

To facilitate the presentation of this paper, a moving section is defined as a basic time unit. A moving record is a data structure that is able to accumulate the numbers of base station identifications (henceforth referred to as item) appearing in call detail records whose occurring times are within the same moving section. Given a log of call detail records, we will first convert these CDR data into multiple moving sequences where a moving sequence is an ordered list of moving records and the length of the moving sequence is ε. The value of ε depends on the periodicity of mobile users and is able to obtain by the method proposed in [1]. As a result, a moving sequence i is denoted by <M R¹_i, M R²_i, M R³_i, ..., M R^ε_i>,where M R_i^jis the jth moving record of moving sequence i. We consider four hours as one basic unit of a moving section and the value of ε is six. Given the log data in Table 1, we have the moving sequence M S1= <{A : 1}, {}, {D : 1}, {}, {}, {G : 1, I : 1} >. Time projection sequence of moving sequence M Siis denoted as T PM S_i,which is formulated as T PM S_i = < α1, ..., αn >, where M R^α_i^j = {} and α1< ... < αn.Explicitly, T PM Siis a sequence of numbers that are the identifications of moving sections in which the corresponding moving records are not empty. Given M S1 =< {A : 1}, {}, {D : 1}, {}, {}, {G : 1, I : 1} >, one can verify that T PM S1 =< 1, 3, 6 >. By utilizing the technique of sequential clustering, a time projection sequence T PM Siis divided into several groups in which time intervals among moving sections are close. For the brevity purpose, we define a clustered time projection sequence of T PM Si, denoted by CT P (T PM Si), which is represented as

< CL₁, CL₂, ..., CL_x>where CLiis the ith group and i = [1, x]. Note that the value of x is determined by our proposed method.

3 Mining User Moving Patterns

The overall procedure for mining moving patterns comprises three phases, i.e., data collection phase, time clustering phase and regression phase. The details of algorithms in each phases are described in the following subsections.

(4)

3.1 Data Collection Phase

As mentioned early, in this phase, we shall identify similar moving sequences from a set of w moving sequences obtained and then merge these similar moving sequences into one aggregate moving sequence (to be referred to as AM S). Algorithm LS is ap- plied to moving sequences of each mobile user to determine the aggregate moving se- quence that comprises a sequence of large moving records denoted as LM Rⁱ, where i = [1, ε].Specifically, large moving record LM R^j is a set of items with their corre- sponding counting values if there are a sufficient number of M R^j_i of moving sequences containing these items. Such a threshold number is called vertical min sup in this paper. Once the aggregate moving sequence is generated from these recent w moving sequences, we will then compare this aggregate moving sequence with these w moving sequences so as to further accumulate the occurring counts of items appearing in each large moving record. The threshold to identify the similarity between moving sequences and the aggregate moving sequence is named by match min sup. The algorithmic form is given below.

Algorithm LS

input: z moving sequences with their lengths being %, two threshold:yhuwlfdo_plq_vxs and pdwfk_plq_vxs

output: Aggregate moving sequence DPV 1 begin

2 for m =1 to % 3 for i=1 to z

4 OPU^m=large 1-itemset ofPU^m_l; (by yhuwlfdo_plq_vxs) 5 for l = 1 to z

6 begin

7 pdwfk = 0;

8 for m = 1 to %

9 begin

10 F(PU^m_l> OPU^m) =

|{ 5 PU^ml_ OPU^m| / || 5 PU^ml^ OPU^m|;

11 pdwfk = pdwfk+|PU^ml|*F(PU^ml> OPU^m);

12 end

13 if pdwfk pdwfk_plq_vxs then 14 accumulate the occurring counts of

items in the aggregate moving sequence;

15 end 16 end

In algorithm LS (from line 2 to line 4), we first calculate the appearing counts of items in each moving sections of w moving sequences. If the count of an item among w moving sequences is larger than the value of vertical min sup, this item will be weaved into the corresponding large moving record. After obtaining all large moving records, AM S is then generated and is represented as < LM R¹, LM R², ...,

(5)

LM R^ε >,where the length of the aggregate moving sequence is ε. As mentioned before, large moving records contain frequent items with their corresponding counts.

Once obtaining the aggregate moving sequence, we should in algorithm LS (from line 5 to line 12) compare this aggregate moving sequence with w moving sequences in order to identify those similar moving sequences and then calculate the count of each item in each large moving record. Note that a moving sequence (respectively, AM S) consists of a sequence of moving records (respectively, large moving records). Thus, in order to quantity how similar between a moving sequence (e.g., M Si)and AM S, we shall first measure the closeness between moving record M R^j_i and LM R^j, denoted by C(M R^j_i, LM R^j). C(M R^j_i, LM R^j)is formulated as ^|{x∈MR

j

i∩LMR^j}|

|{y∈MR^ji∪LMR^j}| that returns the normalized value in [0, 1]. The larger the value of C(M R^j_i, LM R^j)is, the more closely M R^j_i resembles LM R^j. Accordingly, the similarity measure of moving se- quence M Si and AM S is thus formulated as sim(M Si, AM S) = ε

i=1|MR^ji| ∗ C(M R^j_i, LM R^j). Given a threshold value match min sup, for each moving se- quence M Si, if sim(M Si, AM S)≥ match min sup, moving sequence MSiis iden- tified as a similar moving sequence containing sufficient moving behaviors of mobile users. In algorithm LS (from line 13 to line 14), for each item in large moving records, the occurring count is accumulated from the corresponding moving records of those similar moving sequences.

3.2 Time Clustering Phase

In this phase, two threshold values (i.e., δ and σ²) are given in clustering a time pro- jection sequence. Explicitly, the value of δ is used to determine the density of clusters and σ²is utilized to make sure that the spread of the time is bounded within σ². Algo- rithm TC is able to dynamically determine the number of groups in a time projection sequence.

Algorithm TC (from line 2 to line 3) first starts clustering coarsely T PAM S into several marked clusters if the difference between two successive numbers is smaller than the threshold value δ. As pointed out before, CLidenotes the ith marked cluster. In order to guarantee the quality of clusters, a spread degree of CLi, denoted as Sd(CLi),

Algorithm TC

input: Time projection sequence W SDP V, threshold and ² output: Clustered time projection sequence FW S (W SDP V) 1 begin

2 group the numbers whose differences are within ;

3 mark all clusters;

4 while there exist marked clusters and = 1 5 for each marked clusters FOl

6 if Vg(FOl) ² 7 unmark FOl; 8 = 1;

9 for all marked clusters FOl

(6)

10 group the numbers whose differences are within in FOl;

11 end while

12 if there exist marked clusters 13 for each marked cluster FOl

14 n = 1;

15 repeat

16 n++;

17 divide evenly FOlinto n groups ; 18 until the spread degree of each group ²; 19 end

is defined to measure the distribution of numbers in cluster CLi. Specifically, Sd(CLi) is modelled by the variance of a sequence of numbers. Hence, Sd(CLi)is formulated as_m¹

m k=1

(nk−_m¹ ^m

j=1

nj)², where nkis the kth number in CLiand m is the number of elements in CLi. As can be seen from line 5 to line 7 in algorithm TC, for each cluster CLi, if Sd(CLi)is smaller than σ², we unmark the cluster CLi.Otherwise, we will decrease δ by 1 and with given the value of δ, algorithm TC (from line 8 to line 10) will re-cluster those numbers in unmark clusters. Algorithm TC partitions the numbers of T PAM Siteratively with the objective of satisfying two threshold values, i.e., δ and σ², until there is no marked cluster or δ = 0. If there is no marked clusters, CT P (T PAM S) is thus generated. Note that, however, if there are still marked clusters with their spread degree values larger than σ², algorithm TC (from line 12 to line 18) will further finely partition these marked clusters so that the spread degree for each marked cluster is constrained by the threshold value of σ². If the threshold value of δ is 1, a marked cluster is usually a sequence of continuos numbers in which the spread degree of this marked cluster is still larger than σ².Given marked cluster CLi,algorithm TC initially sets k to be 1. Then, marked cluster CLi is evenly divided into k groups with each group sizeⁿ_k. By increasing the value of k each run, algorithm TC is able to partition the marked cluster until the spread degree of each partition in the marked cluster CLi

satisfies σ².

3.3 Regression Phase

Assume that AM S is < LM R¹, LM R², ..., LM R^ε>with its clustered time projec- tion sequence CT P (T PAM S)= CL1, CL2, ..., CLk, where CLi represents the ith cluster. For each cluster CLiof CT P ( T PAM S), we will derive the estimated moving function of mobile users, expressed as Ei(t) = ( ˆxi(t), ˆyi(t), valid time interval ), where ˆxi(t)(respectively, ˆyi(t)) is a moving function in x-coordinate axis (respectively, in y-coordinate axis)) and valid time interval indicates the time interval when the moving function is valid.

Without loss of generality, let CLibe{t1, t2, ..., tn} where tiis one of the moving section in CLi. As described before, a moving record has the set of the items with their corresponding counts. Therefore, we could extract those large moving records from AM S to derive the estimated moving function for each cluster. In order to de-

(7)

rive moving functions, the location of base stations should be represented in geome- try model through a map table provided by tele-companies. Hence, given AM S and a cluster of CT P (T PAM S),for each cluster of CT P (T PAM S), we could have geo- metric coordinates of frequent items with their corresponding counts, which are able to represent as (t1,x1,y1,w1), (t2,x2,y2,w2), ...(tn,xn,yn,wn). Accordingly, for each cluster of CT P ( T PAM S), regression analysis is able to derive the corresponding estimated moving function.

Given a cluster of data points (e.g., (t1, x1, y1, w1), (t2, x2, y2, w2), ..., (tn,xn,yn,wn)), we first consider the derivation of ˆx(t).If the number of distinct time points in a given cluster is m + 1, a m-degree polynomial function ˆx(t) = a0+ a1t + ... + amt^m will be derived to approximate moving behaviors in x-coordinate axis. Specifically, the regression coefficients {α0, α1, ...am} are chosen to make the residual sum of squares x = n

i=1wie²_i minimal, where wi is the weight of the data point (xi, yi) and ei = (xi− (a0+ a1ti+ a2(ti)²... + am(ti)^m)).To facilitate the presentation of our paper, we define the following terms:

T =

⎡

⎢⎢

⎣

1 t₁ (t₁)² ... (t₁)^m 1 t₂ (t₂)² ... (t₂)^m ... ... ... ... ...

1 tn (tn)²... (tn)^m

⎤

⎥⎥

⎦ , a^∗ =

⎡

⎢⎢

⎣ a₀ a₁ ...

am

⎤

⎥⎥

⎦ , b^x =

⎡

⎢⎢

⎣ x₁ x₂ ...

xn

⎤

⎥⎥

⎦ , e =

⎡

⎢⎢

⎣ e₁ e₂ ...

en

⎤

⎥⎥

⎦

T

, W =

⎡

⎢⎢

⎣ w1

w2

...

w_n

⎤

⎥⎥

⎦.

The residual sum of squares can be expressed as x = e^TW e. Since wi are pos- itive for all i, W is written as: W = √

W√

W, where√

W is a diagonal matrix with its diagonal entries to be [√w1,√w2, ...,√wn]. Thus, x = e^TW e = (bx − T a^∗)^T√

W√

W (bx−T a^∗) = (√

W bx−√

W T a^∗)^T(√

W bx−√

W T a^∗). Clearly, xis minimized w.r.t. a^∗by the normal equation (√

W T )^T(√

W T )a^∗= (√

W T )^T√

W bx[2].

The coefficients{α0, α1, ...am} can hence be obtained by solving the normal equation:

a^∗= [(√

W T )^T(√

W T )]⁻¹(√

W T )^T√

W bx. Therefore, ˆx(t) = a0+a1t+...+amt^m is obtained. Following the same procedure, we could derive ˆy(t).As a result, for each cluster of CT P (T PAM S), the estimated moving function Ei(t) = (ˆx(t), ˆy(t), [t1, tn]) of a mobile user is devised.

Algorithm MF

input: AM S and clustered time projection sequence CT P (T PAM S)

output: A set of moving functions F (t) ={E1(t), U1(t), E2(t), ..., Ek(t), Uk(t)} 1 begin

2 initialize F (t)=empty;

3 for i= 1 to k-1 4 begin

5 doing regression on CLito generate Ei(t);

6 doing regression on CLi+1to generate Ei+1(t);

7 t1=the last number in CLi;

(8)

8 t2=the first number in CLi+1; 9 using inner interpolation to generate

Ui(t) = (ˆxi(t), ˆyi(t), (t1, t2));

10 insert Ei(t), Ui(t)and Ei+1(t)in F (t);

11 end 12 if(1 /∈ CL1)

13 generate U0(t)and Insert U0(t)into the head of F (t);

14 if(ε /∈ CLk)

15 generate Uk(t)and Insert Uk(t)into the tail of F (t);

16 return F (t);

17 end

4 Performance Study

In this section, the effectiveness of mining user moving patterns by call detail records is evaluated empirically. The simulation model for the mobile system considered is described in Section 4.1. Section 4.2 is devoted to experimental results and comparison with the original algorithm of mining moving patterns [5].

4.1 Simulation Model for a Mobile System

To simulate the base stations in a mobile computing system, we use a eight by eight mesh network, where each node represents one base station and there are hence 64 base stations in this model [4][5]. A moving path is a sequence of base stations travelled by a mobile user. The number of movements made by a mobile user during one moving section is modeled as a uniform distribution between mf -2 and mf +2. Explicitly, the larger the value of mf is, the more frequently a mobile user moves. To model user calling behavior, the calling frequency is employed to determine the number of calls during one moving section. If the value of cf is large, the number of calls for a mobile user will increase. Similar to [5], the mobile user moves to one of its neighboring base stations depending on a probabilistic model. To make sure the periodicity of moving behaviors, the probability that a mobile user moves to the base station where this user came from is modeled by Pback and the probability that the mobile user routes to the other base stations is determined by (1-Pback)/(n-1) where n is the number of possible base stations this mobile user can move to. The method of mining moving patterns in [5], denoted as U M P , is implemented for the comparison purposes. For interest of brevity, our proposed solution procedure of mining user moving patterns is expressed by AU M P (standing for approximate user moving patterns). The location is represented as the identification of a base station. To measure the accuracy of user moving patterns, we use the hop count (denoted as hn), which is measured by the number of base stations, to represent the distance from the location predicted by moving functions derived to the actual location of the mobile user. Intuitively, a smaller value of hn implies that the more accurate prediction is achieved.

4.2 Experiments of UMP and AUMP

To conduct the experiments to evaluate U M P and AU M P , we set the value of w to be 10, the value of cf to be 3 and the value of ε to be 12. In order to reduce the

(9)

˃

˃ˁ˃˃˃˃ˈ

˃ˁ˃˃˃˄

˃ˁ˃˃˃˄ˈ

˃ˁ˃˃˃˅

˃ˁ˃˃˃˅ˈ

ˆ ˈ ˊ ˌ ˄˄ ˄ˆ ˄ˈ ˄ˊ ˄ˌ

̀̂̉˼́˺ʳ˹̅˸̄̈˸́˶̌

˶̂̆̇ʳ̅˴̇˼̂

˔˨ˠˣ

˨ˠˣ

Fig. 2. The cost ratios of AUMP and UMP with the moving frequency varied

˃

˅˃˃

ˇ˃˃

ˉ˃˃

ˋ˃˃

˄˃˃˃

˄˅˃˃

ˈ ˄˃ ˄ˈ w ˅˃ ˅ˈ ˆ˃

˻̂̃ʳ˶̂̈́̇̆

̈̆˸̅̆ʳ̊˼̇˻ʳ˶˹ː˄

̈̆˸̅̆ʳ̊˼̇˻ʳ˶˹ːˆ

̈̆˸̅̆ʳ̊˼̇˻ʳ˶˹ːˈ

Fig. 3. The performance of AUMP with the value of w varied

amount of data used in mining user moving patterns, AU M P explores the log of call detail records. The cost ratio for a user, i.e., amount of log data^hn¹ , means the prediction accuracy gained by having the additional amount of log data. Fig. 2 shows the cost ratios of U M P and AU M P . Notice that AU M P has larger cost ratios than U M P , showing that AU M P employs the amount of log data more cost-efficiently to increase the prediction accuracy.

The impact of varying the values of w for mining moving patterns is next investi- gated. Without loss of generality, we set the value of ε to be 12, that of mf to be 3 and the values of cf to be 1, 3 and 5. Both vertical min sup and match min support are set to 20% , the value of δ is set to be 3 , and σ²is set to be 0.25. With this setting, the experimental results are shown in Fig. 3.

As can be seen from Fig. 3, the hop count of AUMP decreases as the value of w increases. This is due to that as the value of w increases, meaning that the number of moving sequences considered in AUMP increases, AUMP is able to effectively extract more information from the log of call detail records. Note that with a given the value of w, the hop count of AUMP with a larger value of cf is smaller, showing that the log of data has more information when the value of cf increases. Clearly, for mobile users having high call frequencies, the value of w is able to set smaller in order to quickly obtain moving patterns. However, for mobile users having low call frequencies,

(10)

the value of w should be set larger so as to increase the accuracy of moving patterns mined by AU M P .

5 Conclusions

In this paper, without increasing the overhead of generating the moving log, we presented a new mining method to mine user moving patterns from the existing log of call detail records of mobile computing systems. Specifically, we proposed algorithm LS to capture similar moving sequences from the log of call detail records and then these similar moving sequences are merged into the aggregate moving sequence. By exploring the feature of spatial-temporal locality, algorithm TC proposed is able to group call detail records into several clusters. For each cluster of the aggregate moving sequence, algorithm MF devised is employed to derive the estimated moving function, which is able to generate user moving patterns. It is shown by our simulation results that user moving patterns achieved by our proposed algorithms are of very high quality and in fact very close to real user moving behaviors.

Acknowledgment

The authors are supported in part by the National Science Council, Project No. NSC 92-2211-E-009-068 and NSC 93-2213-E-009-121, Taiwan, Republic of China.

References

1. J. Han, G. Dong, and Y. Yin. Efficient Mining of Partial Periodic Patterns in Time Series Database. In Proceeding of the 15th International Conference on Data Engineering, March 1999.

2. R. V. Hogg and E. A. Tanis. Probability and Statistical Inference. Prentice-Hall International Inc., 1997.

3. D. L. Lee, J. Xu, B. Zheng, and W.-C. Lee. Data Management in Location-Dependent Infor- mation Services. In Proceeding of IEEE Pervasive Computing, pages 65–72, 2002.

4. Y.-B. Lin. Modeling Techniques for Large-Scale PCS Networks. IEEE Communications Magazine, 35(2):102–107, February 1997.

5. W.-C. Peng and M.-S. Chen. Developing Data Allocation Schemes by Incremental Mining of User Moving Patterns in a Mobile Computing System. In Proceeding of IEEE Transactions on Knowledge and Data Engineering, Volume 15, pages 70–85, 2003.

6. H.-K. Wu, M.-H. Jin, J.-T. Horng, and C.-Y. Ke. Personal Paging Area Design Based On Mobile’s Moving Behaviors. In Proceeding of IEEE INFOCOM, 2001.