• 沒有找到結果。

Time Clustering Phase

3.2 Procedure for Mining User Moving Patterns

3.2.3 Time Clustering Phase

Recall that the time projection sequence of moving sequence M Si is denoted as T PMSi, which is presented as T PM Si =< α1, ..., αn >, where M Rαij 6= {} and α1 < ... < αn. Once obtaining AM S, we could easily determine T PAM S. By exploring the feature of spatial-temporal locality, we will in this phase develop algorithm TC to generate a clustered time projection sequence of AM S (i.e., CT P (T PAM S)).

In algorithm TC, two threshold values (i.e., δ and σ2) are given in clustering a time projection sequence. Explicitly, the value of δ is used to determine the density of clusters and σ2is utilized to

make sure that the spread of the time is bounded within σ2. Algorithm TC is able to dynamically determine the number of groups in a time projection sequence.

Algorithm TC

input: Time projection sequence T PAM S, threshold δ and σ2

output: Clustered time projection sequence CT P (T PAM S)

1 begin

2 group the numbers whose differences are within δ;

3 mark all clusters;

4 while there exist marked clusters and δ >= 1 5 foreach marked clusters CLi

6 if V ar(CLi)≤ σ2 7 unmark CLi;

8 δ = δ− 1;

9 forall marked clusters CLi

10 group the numbers whose differences are within δ in CLi;

11 end while

12 if there exist marked clusters 13 for each marked cluster CLi

14 k = 1;

15 repeat

16 k++;

17 divide evenly CLi into k groups ;

18 until the spread degree of each group≤ σ2; 19 end

By grouping those numbers together if the difference between two successive numbers is smaller than the threshold value δ, algorithm TC (from line 2 to line 3) first starts coarsely clustering T PAM Sinto several marked clusters. As pointed out before, CLidenotes the ith marked cluster. In order to make sure that quality of clusters, variance of CLi, denoted as V ar(CLi), is defined to measure the distribution of numbers in cluster CLi. Specifically, V ar(CLi) is the variance of a sequence of numbers. Hence, V ar(CLi) is formulated as m1

Pm

line 5 to line 7 in algorithm TC, for each cluster CLi, if V ar(CLi)is smaller than σ2, we unmark the cluster CLi. Otherwise, we will decrease δ by 1 and with given the value of δ, algorithm TC (from line 8 to line 10) will re-cluster those numbers in unmark clusters. Algorithm TC partitions the numbers of T PAM S iteratively with the objective of satisfying two threshold values, i.e., δ and σ2, until there is no marked cluster or δ = 0. If there is no marked clusters, CT P (T PAM S)is thus generated. Note that, however, if there are still marked clusters with their variance values larger than σ2, algorithm TC (from line 12 to line 18) will further finely partition these marked clusters so that the variance for every marked cluster is constrained by the threshold value of σ2. If the threshold value of δ is 1, a marked cluster is usually a sequence of consecutive numbers in which the variance of this marked cluster is still larger than σ2. To deal with this problem, we derive the following lemma:

Lemma 1: Given a sequence of consecutive integers Sn with the length being n, the variance of Sn is 121(n2 − 1).

Proof:

Note that the variance of the sequence of consecutive integers with the same length is the same.

For example, consider two sequences of consecutive integers: {1, 2, 3, 4, 5} and {7, 8, 9, 10, 11}.

It can be verified that V ar({1, 2, 3, 4, 5}) = V ar({7, 8, 9, 10, 11}). Without loss of generality,

= (n+1)(2n+1)

Property: Given a sequence of consecutive integers {1, 2, 3, ..., n} and a positive integer k, the optimal way of dividing {1, 2, 3, ..., n} into k clusters is to partition {1, 2, 3, .., n} into k clusters with each cluster size being dnke.

Proof: Thus, we can have the following term:

⎧⎪

By using substitution method, we could have

⎧⎪

From the derivation above, the optimal way to divide {1, 2, 3, ..., n} into k clusters is to divide {1, 2, 3, .., n} into k clusters with each cluster size being dnke.

By the above property, given marked cluster CLi,algorithm TC initially sets k to be 1. Then, marked cluster CLi is evenly divided into k groups with each group size dnke. By increasing the value of k each run, algorithm TC is able to partition the marked cluster until the variance of each partition in the marked cluster CLi satisfies σ2.

Consider the execution scenario in Table 3.2 where the time projection sequence is T PAM S

= <1, 2, 3, 4, 5, 9, 10, 14, 17, 18, 20>. Given σ2 = 1.6 and δ = 3, algorithm TC first roughly partitions T PAM S into three clusters. It can been verified in Table 3.2 that two marked clusters

Run δ σ2 Clusters of a time projection sequence 0 3 1.6 <{1, 2, 3, 4, 5, 9, 10, 14, 17, 18, 20}>

1 3 1.6 <{1, 2, 3, 4, 5}, {9, 10}, {14, 17, 18, 20}>

2 2 1.6 <{1, 2, 3, 4, 5}, {9, 10}, {14}, {17, 18, 20}>

3 1 1.6 <{1, 2, 3, 4, 5}, {9, 10}, {14}, {17, 18, 20}>

4 0 1.6 <{1, 2, 3, 4, 5}, {9, 10}, {14}, {17, 18, 20}>

5 0 1.6 <{1, 2, 3}, {4, 5}, {9, 10}, {14}, {17, 18, 20}>

Table 3.2: An execution scenario under algorithm TC.

(i.e., {1, 2, 3, 4, 5} with V ar({1, 2, 3, 4, 5})=2 and {14, 17, 18, 20} with V ar({14, 17, 18, 20})=4.69) are determined due to that the variance values of these two clusters are larger than 1.6. Then, δ is reduced to 2, and these two marked clusters are examined again. Following the same procedure, algorithm TC partitions mark clusters until δ equals 1. As can been seen in Run 4 of Table 3.2, {1, 2, 3, 4, 5} is still a marked cluster with V ar({1, 2, 3, 4, 5})=2. Therefore, algorithm TC finely partitions {1, 2, 3, 4, 5}. The value of k is initially set to be 1. Since V ar({1, 2, 3, 4, 5})=2.5 is larger than σ2 (i.e., 1.6), k is increased to 2. Then, {1, 2, 3, 4, 5} is divided into {1, 2, 3}{4, 5}.

Among these two clusters (i.e., {1, 2, 3} and {4, 5}), {1, 2, 3} has the larger variance and thus {1, 2, 3} is compared with the value of σ2. Note that since the variance of {1, 2, 3} = 0.67 < 1.6, algorithm TC stops clustering. After the execution of algorithm TC, a CT P (T PAM S)is generated as <{1, 2, 3}, {4, 5}, {9, 10}, {14}, {17, 18, 20}>.

The time complexity of algorithm TC is of polynomial time complexity. Explicitly, let T PAM S

have n numbers. In line 2 of algorithm TC, we have O(n) to roughly divide sequence into t the clusters. Note that from line 4 to line 11 of algorithm TC, we have O(δmt) to group the original sequence. From 13 to 19, assume that there are still t clusters with m numbers to be refined and then we have t ∗ m ∗ (m) = O(m2t) to run the clustering process. Since algorithm TC is a

heuristic algorithm, we consider the worst case when estimating the time complexity of algorithm TC. Assume that the worst case is that t = m = n, and thus the overall time complexity of algorithm TC is at most O(n3).

相關文件