Time Clustering Phase - Procedure for Mining User Moving Patterns

3.2 Procedure for Mining User Moving Patterns

3.2.3 Time Clustering Phase

Recall that the time projection sequence of moving sequence M Si is denoted as T PMSi, which is presented as T PM Si =< α1, ..., αn >, where M R^α_i^j 6= {} and α¹ < ... < αn. Once obtaining AM S, we could easily determine T PAM S. By exploring the feature of spatial-temporal locality, we will in this phase develop algorithm TC to generate a clustered time projection sequence of AM S (i.e., CT P (T PAM S)).

In algorithm TC, two threshold values (i.e., δ and σ²) are given in clustering a time projection sequence. Explicitly, the value of δ is used to determine the density of clusters and σ²is utilized to

make sure that the spread of the time is bounded within σ². Algorithm TC is able to dynamically determine the number of groups in a time projection sequence.

Algorithm TC

input: Time projection sequence T PAM S, threshold δ and σ²

output: Clustered time projection sequence CT P (T PAM S)

1 begin

2 group the numbers whose diﬀerences are within δ;

3 mark all clusters;

4 while there exist marked clusters and δ >= 1 5 foreach marked clusters CLi

6 if V ar(CLi)≤ σ² 7 unmark CLi;

8 δ = δ− 1;

9 forall marked clusters CLi

10 group the numbers whose diﬀerences are within δ in CLi;

11 end while

12 if there exist marked clusters 13 for each marked cluster CLi

14 k = 1;

15 repeat

16 k++;

17 divide evenly CLi into k groups ;

18 until the spread degree of each group≤ σ²; 19 end

By grouping those numbers together if the diﬀerence between two successive numbers is smaller than the threshold value δ, algorithm TC (from line 2 to line 3) first starts coarsely clustering T PAM Sinto several marked clusters. As pointed out before, CLidenotes the ith marked cluster. In order to make sure that quality of clusters, variance of CLi, denoted as V ar(CLi), is defined to measure the distribution of numbers in cluster CLi. Specifically, V ar(CLi) is the variance of a sequence of numbers. Hence, V ar(CLi) is formulated as _m¹

line 5 to line 7 in algorithm TC, for each cluster CLi, if V ar(CLi)is smaller than σ², we unmark the cluster CLi. Otherwise, we will decrease δ by 1 and with given the value of δ, algorithm TC (from line 8 to line 10) will re-cluster those numbers in unmark clusters. Algorithm TC partitions the numbers of T PAM S iteratively with the objective of satisfying two threshold values, i.e., δ and σ², until there is no marked cluster or δ = 0. If there is no marked clusters, CT P (T PAM S)is thus generated. Note that, however, if there are still marked clusters with their variance values larger than σ², algorithm TC (from line 12 to line 18) will further finely partition these marked clusters so that the variance for every marked cluster is constrained by the threshold value of σ². If the threshold value of δ is 1, a marked cluster is usually a sequence of consecutive numbers in which the variance of this marked cluster is still larger than σ². To deal with this problem, we derive the following lemma:

Lemma 1: Given a sequence of consecutive integers Sn with the length being n, the variance of Sn is ₁₂¹(n² − 1).

Proof:

Note that the variance of the sequence of consecutive integers with the same length is the same.

For example, consider two sequences of consecutive integers: {1, 2, 3, 4, 5} and {7, 8, 9, 10, 11}.

It can be verified that V ar({1, 2, 3, 4, 5}) = V ar({7, 8, 9, 10, 11}). Without loss of generality,

= (n+1)(2n+1)

Property: Given a sequence of consecutive integers {1, 2, 3, ..., n} and a positive integer k, the optimal way of dividing {1, 2, 3, ..., n} into k clusters is to partition {1, 2, 3, .., n} into k clusters with each cluster size being dⁿke.

Proof: Thus, we can have the following term:

⎧⎪

By using substitution method, we could have

⎧⎪

From the derivation above, the optimal way to divide {1, 2, 3, ..., n} into k clusters is to divide {1, 2, 3, .., n} into k clusters with each cluster size being dⁿke.

By the above property, given marked cluster CLi,algorithm TC initially sets k to be 1. Then, marked cluster CLi is evenly divided into k groups with each group size dⁿke. By increasing the value of k each run, algorithm TC is able to partition the marked cluster until the variance of each partition in the marked cluster CLi satisfies σ².

Consider the execution scenario in Table 3.2 where the time projection sequence is T PAM S

= <1, 2, 3, 4, 5, 9, 10, 14, 17, 18, 20>. Given σ² = 1.6 and δ = 3, algorithm TC first roughly partitions T PAM S into three clusters. It can been verified in Table 3.2 that two marked clusters

Run δ σ² Clusters of a time projection sequence 0 3 1.6 <{1, 2, 3, 4, 5, 9, 10, 14, 17, 18, 20}>

1 3 1.6 <{1, 2, 3, 4, 5}^∗, {9, 10}, {14, 17, 18, 20}^∗>

2 2 1.6 <{1, 2, 3, 4, 5}^∗, {9, 10}, {14}, {17, 18, 20}>

3 1 1.6 <{1, 2, 3, 4, 5}^∗, {9, 10}, {14}, {17, 18, 20}>

4 0 1.6 <{1, 2, 3, 4, 5}^∗, {9, 10}, {14}, {17, 18, 20}>

5 0 1.6 <{1, 2, 3}, {4, 5}, {9, 10}, {14}, {17, 18, 20}>

Table 3.2: An execution scenario under algorithm TC.

(i.e., {1, 2, 3, 4, 5} with V ar({1, 2, 3, 4, 5})=2 and {14, 17, 18, 20} with V ar({14, 17, 18, 20})=4.69) are determined due to that the variance values of these two clusters are larger than 1.6. Then, δ is reduced to 2, and these two marked clusters are examined again. Following the same procedure, algorithm TC partitions mark clusters until δ equals 1. As can been seen in Run 4 of Table 3.2, {1, 2, 3, 4, 5} is still a marked cluster with V ar({1, 2, 3, 4, 5})=2. Therefore, algorithm TC finely partitions {1, 2, 3, 4, 5}. The value of k is initially set to be 1. Since V ar({1, 2, 3, 4, 5})=2.5 is larger than σ² (i.e., 1.6), k is increased to 2. Then, {1, 2, 3, 4, 5} is divided into {1, 2, 3}{4, 5}.

Among these two clusters (i.e., {1, 2, 3} and {4, 5}), {1, 2, 3} has the larger variance and thus {1, 2, 3} is compared with the value of σ². Note that since the variance of {1, 2, 3} = 0.67 < 1.6, algorithm TC stops clustering. After the execution of algorithm TC, a CT P (T PAM S)is generated as <{1, 2, 3}, {4, 5}, {9, 10}, {14}, {17, 18, 20}>.

The time complexity of algorithm TC is of polynomial time complexity. Explicitly, let T PAM S

have n numbers. In line 2 of algorithm TC, we have O(n) to roughly divide sequence into t the clusters. Note that from line 4 to line 11 of algorithm TC, we have O(δmt) to group the original sequence. From 13 to 19, assume that there are still t clusters with m numbers to be refined and then we have t ∗ m ∗ (m) = O(m²t) to run the clustering process. Since algorithm TC is a

heuristic algorithm, we consider the worst case when estimating the time complexity of algorithm TC. Assume that the worst case is that t = m = n, and thus the overall time complexity of algorithm TC is at most O(n³).

在文檔中利用迴歸分析於探勘使用者移動模式 (頁 30-36)