Regression Phase - Procedure for Mining User Moving Patterns

3.2 Procedure for Mining User Moving Patterns

3.2.4 Regression Phase

Given aggregate moving sequence AM S devised by algorithm LS with its clustered time projec-tion sequence CT P ( T PAM S) generated by algorithm TC, in this phase, algorithm MF is able to derive a sequence of moving functions that are able to estimate moving behaviors of mobile users.

Assume that AM S is <LM R¹, LM R², ..., LM R^ε>with its clustered time projection sequence CT P (T PAM S) = < CL1, CL2, ..., CLk>, where CLi represents the ith cluster. For each cluster CLiof CT P ( T PAM S), we will derive the estimated moving function of mobile users, expressed as Ei(t) = ( ˆxi(t), ˆyi(t), valid_time_interval ), where ˆxi(t)(respectively, ˆyi(t)) is a moving function in x-coordinate axis (respectively, in y-coordinate axis)) and the moving function is valid for the time interval indicated in valid_time_interval.

Without loss of generality, let CLi be {t¹, t2, ..., tn} where tⁱ is one of the moving section in CLi. As described before, a moving record has the set of the items with their corresponding counts. Therefore, we could extract those large moving records from AM S to derive the estimated moving function for each cluster. In order to derive moving functions, the location of base stations should be represented in geometry model through a map table provided by tele-companies.

Hence, given AM S and a cluster of CT P (T PAM S), for each cluster of CT P (T PAM S), we could have geometric coordinates of frequent items with their corresponding counts, which are able to

represent as (t1,x1,y1,w1), (t2,x2,y2,w2), ...(tn,xn,yn,wn). Accordingly, for each cluster of CT P ( T PAM S), regression analysis is able to derive the corresponding estimated moving function. By exploring the technique of regression analysis, the moving functions devised are able to generate the curves close to the data points and thus can be used to estimate users’ moving behaviors.

The regression analysis fits equations of approximating curves to the raw field data [23]. For a given set of data, the fitting curves are generally not unique. Note that a curve with a minimal deviation from all data points is desired. Let ei be the error between the ith data point and the estimated fitting curve. Given a set of data points, the best estimated curve is the one that has the minimal sum of least square errors (i.e., the minimal value of x, where x = Pn

i=1e²_i) [23].

Since the number of calls may be varied for each distinct (ti, xi, yi), it is reasonable to derive moving functions by taking the weights into consideration. Therefore, the regression analysis with weighted least squares is then applied.

Given a cluster of data points (e.g., (t1, x1, y1, w1), (t2, x2, y2, w2), ..., (tn,xn,yn,wn)), we first consider the derivation of ˆx(t).An m-degree polynomial function ˆx(t) = a0+ a1t + ... + amt^m will be derived to approximate moving behaviors in x-coordinate axis. Specifically, the regression coeﬃcients {α⁰, α1, ...am} are chosen to make the residual sum of squares ^x =Pn

i=1wie²_i minimal, where wi is the weight of the data point (xi, yi)and ei = (xi−(a⁰+a1ti+a2(ti)²...+am(ti)^m)).The value of m is determined in accordance with the requirement of applications but m is usually smaller than the number of data points. To facilitate the presentation of our paper, we define the following terms:

It can be verified that the residual sum of squares (i.e., x =Pn

i=1wie²_i) can be expressed as

W˜b_x. It can be seen that the main objective is to minimize

x = ||B − Aa^∗||. According to the theorem of least squares, B − Aa^∗ must be orthogonal to Aa^∗ so as to minimize x[8]. For interest of brevity, the theorem of least squares is omitted in this paper. Consequently, we can have:

B− Aa^∗ ∈ R(A)^⊥ = N (A^T)

, where R(A)^⊥ represents the orthogonal complement of column space of A and N (A^T) represents the kernel space of A^T

1Since W is diagonal and all elements are positive, we can decompose W into√ W√

t_i item x_i y_i w_i

Table 3.3: Data points with their corresponding weights.

is obtained. Following the same procedure, we could derive ˆy(t). As a result, for each cluster of CT P (T PAM S), the estimated moving function Ei(t) = (ˆx(t), ˆy(t), [t1, tn]) of a mobile user is a regression curve with the purpose of minimizing the residual sum error. In other words, a^∗ = ( a0 a1 a2 a3 )^T should be determined. Since there are five data points with their

corresponding moving sections are 1, 2, 4, 4 and 5, H =

⎡

1 1.5 2 2.5 3 3.5 4 4.5 5

Figure 3.1: An illustrative example, where the arrow line is the real moving path and the solid line is estimated by moving functions obtained by algorithm MF.

sum of 16 and 1. The weights of data points are 17, 1, 2, 3 and 2, respectively. Hence, √

W is a diagonal matrix with its diagonal entries to be [√

17,√ 0.066t³is devised to predict the x coordinate-axis of the mobile user from t = 1 to t = 5. Similarly, b˜_y = ( 1 2 1 2 3 3 )^T is then determined from Table 3.3. By solving the normal equation Consequently, ˆy(t) = 2.529−2.386t+1.021t²−0.105t³is obtained. The estimated moving function is shown in Figure 3.1. It can be seen that the estimated moving function is very close to the real moving path, showing the advantage of utilizing regression in mining user moving patterns.

Algorithm MF

input: AM S and clustered time projection sequence CT P (T PAM S)

output: A set of moving functions

F (t) ={E¹(t), U1(t), E2(t), ..., Ek(t), Uk(t)} 1 begin

2 initialize F (t)=empty;

3 for i= 1 to k-1

4 begin

5 doing regression on CLi to generate Ei(t);

6 doing regression on CLi+1 to generate Ei+1(t);

7 t1 =the last number in CLi; 8 t2 =the first number in CLi+1;

9 using inner interpolation to generate Ui(t) = (ˆxi(t), ˆyi(t), (t1, t2)); to generate the whole estimated moving function, denoted as F (t). F (t) is represented as {U⁰(t), E1(t), U1(t), E2(t), ..., Ek(t), Uk(t)}, where Eⁱ(t) is the estimated moving function in cluster i of CT P (T PAM S) and Ui(t) is the linkage moving function from Ei(t) to Ei+1(t). It is shown in algorithm MF (from line 5 to line 6) that for each cluster of CT P (T PAM S), we could de-rive the corresponding estimated moving functions by the regression method mentioned above.

Note that, however, it is possible that the first moving section is not in CL1. If t0 is the first number of CL1 and t0 6= 1, the U⁰(t) = {E¹(t0), [1, t0)} is generated for the boundary condi-tion. Otherwise, U0(t) will not be valid in F (t). The situation of Uk(t) is similar. The linkage moving function will be calculated by interpolation (in line 9 of algorithm MF). For example, assume that CT P (T PAM S)=<{1, 2, 4, 5}{7, 9, 10}>, E¹(t)=(2.333 − 2.133t + 0.867t²− 0.066t³, 2.529−2.386t+1.021t²−0.105t³, [1, 5])and E2(t) = (10−2.17t+0.17t², 32−6.33t+0.33t², [7, 10]).

It can be verified that the first number of cluster {1, 2, 4, 5} is 1. Thus U⁰(t) is invalid in F (t).

Figure 3.2: A snapshot of complete moving function F (t)

The last number of {1, 2, 4, 5} is 5 and the first number of cluster {7, 9, 10} is 7. Thus, a linkage moving function should be generated by inner interpolation. From E1(t),at time 5, we can have a data point (x = 5.09, y = 3). At time 7, a data point ( x = 3.14, y = 3.86 ) is generated by applying E2(7). By inner interpolation, we could have U1(t) = (9.965 + ^3.14−5.09₇₋₅ t, 0.85 + ^3.86−3₇₋₅ t, (5,7)). Similarly, U2(t)can be produced. Thus, we could have F (t) = {E¹(t), U1(t), E2(t), U2(t)}.

The snapshot of F(t) is shown in Figure 3.2.

When using F (t) to predict the location of a mobile user, we will only use the estimated moving function whose time interval includes the given time t. For the above example F (t) = {E¹(t), U1(t), E2(t), U2(t)}, given the time to be 4, only E¹(t)will be used to predict the location since the given time 4 is within the time interval of E1(t). Once the estimated moving function is obtained, it is straightforward to generate the approximate moving patterns in symbolic model.

By utilizing the estimated moving function derived, the location of a mobile user is predicted as (xt, yt).Since each base station is aware of its location and converge area, as shown in Figure 3.3, we can obtain transform the geometric location (xt, yt) into base station D in symbolic model.

(xt ,yt) A

Figure 3.3: An illustrative example for transformation from geometric model to symbolic model Note that the time complexity of algorithm MF is of polynomial time complexity. Specifi-cally, with the maximal size in row/column being n, the time complexity of solving the normal equation by Strassen’s algorithm is Θ(n^{lg 7}) [19]. Moreover, the interpolation by Lagrange’s for-mula requires Θ(m²), where m represents the number of points involved in the interpolation [19].

Since n is usually larger than m, the value of Θ(n^{lg 7}) is the dominating factor of the complexity of algorithm MF.

Chapter 4 Performance Study

In this section, the eﬀectiveness of mining approximate user moving patterns by call detail records is evaluated empirically. The simulation model for the mobile system considered is described in Section 4.1. Section 4.2 is devoted to experimental results and comparison with the original algorithm of mining moving patterns [15]. Finally, sensitivity analysis of mining approximate user moving patterns is shown in Section 4.3.

4.1 Simulation Model for a Mobile System

To simulate base stations in a mobile computing system, we use an eight by eight mesh network, where each node represents one base station and there are hence 64 base stations in this model [13][15]. A moving path is a sequence of base stations travelled by a mobile user. The number of movements made by a mobile user during one moving section is modeled as a uniform distribution between mf -2 and mf +2. There are 10,000 users considered in our simulation model. According to Law of Large Number[23], we repeat each experiment for 20 times and every result presented

in the figures is the average performance of 20 experimental results. Explicitly, the larger the value of mf is, the more frequently a mobile user moves. To model user calling behavior, the calling frequency is employed to determine the number of calls during one moving section. If the value of cf is large, the number of calls for a mobile user will increase. Similar to [15], the mobile user moves to one of its neighboring base stations depending on a probabilistic model. To make sure the periodicity of moving behaviors, the probability that a mobile user moves to the base station where this user came from is modeled by Pback and the probability that the mobile user routes to the other base stations is determined by (1-Pback)/(n-1) where n is the number of possible base stations this mobile user can move to. We assign two Pback to each users to present the major and minor moving behavior respectively. As mentioned before, the method of mining moving patterns in [15], denoted as U M P , is implemented for the comparison purposes. For interest of brevity, our proposed solution procedure of mining user moving patterns is expressed by AU M P (standing for approximate user moving patterns). The location is represented as the identifications of base stations. To measure the prediction accuracy, we use the hop count (denotes as hn), which is measured by the number of base stations, to represent the distance from the prediction location to the actual location of the mobile user. Intuitively, a smaller value of hn implies that the more accurate prediction is achieved. It is worth mentioning that the expected value of hop count per call, denoted by E(hn/call), is _{w∗ε∗cf/2}^hn where cf /2 is the expected value of the number of CDR in a time unit. Thus a precise ratio is defined as 1 −E(hn/call)−1

2n . Precise ratio is a measurement considering not only the distance between the moving patterns and real paths but also the ratio of the distance and the whole network size. Table 4.1 summarizes the definitions for some primary simulation parameters and the measurements of performance.

Notation Definition Value

w retrospective factor various value used

M s the number of moving sections in a moving sequence various value used

M f moving frequency various value used

Cf call frequency various value used

σ² variance threshold various value used

vertical_min_sup threshold of vertical minimal support various value used match_min_sup threshold of match minimal support various value used

Table 4.1: The parameters and measurements used in the simulation

0.7

Figure 4.1: The precise ratio of UMP and AUMP with the value of mf varied.

在文檔中利用迴歸分析於探勘使用者移動模式 (頁 36-46)