An incremental algorithm for clustering spatial data streams: exploring temporal locality

(1)

DOI 10.1007/s10115-013-0636-8

R E G U L A R PA P E R

An incremental algorithm for clustering spatial data

streams: exploring temporal locality

Ling-Yin Wei· Wen-Chih Peng

Received: 8 September 2011 / Revised: 17 February 2013 / Accepted: 5 April 2013 / Published online: 26 April 2013

Abstract Clustering sensor data discovers useful information hidden in sensor networks. In sensor networks, a sensor has two types of attributes: a geographic attribute (i.e, its spa-tial location) and non-geographic attributes (e.g., sensed readings). Sensor data are period-ically collected and viewed as spatial data streams, where a spatial data stream consists of a sequence of data points exhibiting attributes in both the geographic and non-geographic domains. Previous studies have developed a dual clustering problem for spatial data by con-sidering similarity-connected relationships in both geographic and non-geographic domains. However, the clustering processes in stream environments are time-sensitive because of fre-quently updated sensor data. For sensor data, the readings from one sensor are similar for a period, and the readings refer to temporal locality features. Using the temporal locality features of the sensor data, this study proposes an incremental clustering (IC) algorithm to discover clusters efficiently. The IC algorithm comprises two phases: cluster prediction and cluster refinement. The first phase estimates the probability of two sensors belonging to a cluster from the previous clustering results. According to the estimation, a coarse clustering result is derived. The cluster refinement phase then refines the coarse result. This study eval-uates the performance of the IC algorithm using synthetic and real datasets. Experimental results show that the IC algorithm outperforms exiting approaches confirming the scalability of the IC algorithm. In addition, the effect of temporal locality features on the IC algorithm is analyzed and thoroughly examined in the experiments.

Keywords Data mining· Dual clustering · Spatial data streams

L.-Y. Wei· W.-C. Peng (

B

)

Department of Computer Science, National Chiao Tung University, Hsinchu, Taiwan

e-mail: [email protected] L.-Y. Wei

(2)

1 Introduction

With the growth of sensor network applications, such as traffic surveillance and weather monitoring, clustering sensor data can reveal valuable insights hidden in the collected data. A sensor generally has two types of attributes: a geographic attribute (i.e., the spatial location of a sensor) and non-geographic attributes (e.g, sensing readings). Solving a traditional clus-tering problem involves partitioning sensors into clusters according to their readings [21,39]. To explore more informative clusters, researchers have widely investigated constrained clus-tering problems in recent years [6,13,14,16,17,24,40]. Furthermore, several studies have investigated dual clustering problems with considering constraints in both geographic and non-geographic domains [12,22,25,26,35,37,45]. A reading detected by a sensor is updated regularly, and a series of sensing readings from a sensor can be viewed as a spatial data stream. However, previous studies have focused on clustering sensors without considering data stream environments. Researchers have proposed several clustering algorithms for data stream environments [1,5,7,9,18,19,23,27,30,33,34,44,47], but they only consider non-geographic attributes. The authors of [41,46] investigated the spatial data stream clustering while considering only the geographic attribute. This study addresses a dual clustering prob-lem for spatial data streams in which the attributes of both geographic and non-geographic domains are considered.

Previous research has presented a general dual clustering problem for non-stream environ-ments [25,37] in which the number of clusters is pre-specified. However, for dual clustering problems in spatial data streams, the values in the non-geographic domain vary over time. General dual clustering problems are not suitable for data stream environments because the number of clusters usually changes over time according to the variation of data in the non-geographic domain. Hence, previous research has presented a dual clustering problem in spatial data streams [43]. That study presented a hierarchical-based clustering (HBC) algo-rithm for a dual clustering problem in spatial data streams without specifying the number of clusters.

The sensors discussed in this paper are fixed and therefore have no mobility. For example, to monitor the traffic status along a freeway, sensors are deployed and utilized to collect readings, such as vehicle speed and volume of traffic. Given a set of sensors with their locations, readings, and geographic and similarity constraints in both domains, the dual clustering problem in [43] clusters sensors into groups based on similarity-connected relationships, in which sensors have similar readings in a non-geographic domain under a geographic constraint. For example, considering the sensor data from a time stamp (i.e., with only one reading in the non-geographic domain) in Fig.1a, the data of each sensor include a two-dimensional coordinate in the geographic domain and a reading in the non-geographic domain. Figure1b shows the corresponding clustering result from the time stamp: with given geographic and similarity constraints in both domains, the sensors in the same cluster are connected by similar readings under the geographic constraint. Although the readings of the sensors in Cluster 3 and Cluster 4 are similar, these two clusters are not grouped together. This is because the sensors in these two clusters are far away from each other, failing to satisfy the geographic constraint.

Figure2shows an example of eight sensors, each of which has a two-dimensional coor-dinate in the geographic domain to detect a series of speed readings in the non-geographic domain. Given a time window size (e.g., W = 5), the readings of the sensors are divided into four non-overlapping time windows (i.e.,w1, w2, w3, andw4). Exploring these clusters yields substantial benefits. First, the readings of the sensors in a cluster have similar speeds, and the clusters reveal traffic status. For example, sensors in a cluster with lower speeds

(3)

Geographic domain x y Non-geographic domain Geographic domain Cluster 1 Cluster 2 Cluster 3 Cluster 4 x y Non-geographic domain (a) (b)

Fig. 1 An illustrative example of the dual clustering problem. a Sensor data distribution from a time stamp; b a clustering result from time stamp

0 1 2 3 4 0 1 2 3 4 y x S1 S2 S3 S4 S5 S6 S7 S8 0 5 10 15 20 25 30 5 10 15 20 V alue Time stamp S1 S2 S3 S4 S5 S6 S7 S8 w1 w2 w3 w4 (a) (b)

Fig. 2 An illustrative example of the dual clustering problem in spatial data streams. a Geographic domain; b non-geographic domain

typically indicate traffic jams near the sensors’ locations. Second, information derived from clusters facilitates data recovery [29,42]. Because sensors deployed in an outdoor environ-ment can easily malfunction, they may not report their sensing readings if they fail to work. If several sensors are frequently clustered in a group, they are likely to detect similar traffic statuses. Once a particular sensor S fails to work, its missing traffic status can be inferred from the readings of other sensors that are usually clustered with sensor S. Third, the cluster results can be used for outlier detection [2,36]. If a traffic status sensed by sensor S is different from the traffic status sensed by other sensors that are usually clustered with sensor S, the sensor may be reporting an abnormal event or the detected data may be an outlier. As such, utilizing dual clustering in spatial data streams could improve the aforementioned scenarios, justifying the motivation of this paper.

The authors in [43] proposed the HBC algorithm to solve the dual clustering problem in spatial data streams, and the algorithm is performed in each time window. Regarding runtime, the overhead might be high. To deal with dynamic data environments efficiently, researchers have developed incremental techniques for different clustering problems in [3,4,

11,15,20,27,28,31,34,37]. The authors in [11] devised an incremental clustering approach to the reconciliation of textual entities, and their method can efficiently de-duplicate volumes of data. Another study [31] proposes an incremental hierarchical co-clustering algorithm for high-dimensional text datasets, whereas the study [4] introduces an incremental algorithm to cluster XML documents sharing similar structures. However, the incremental clustering

(4)

algorithms of these three studies are developed for textual data or documents and do not cope with the data in dual domains.

On the other hand, researchers have developed some incremental algorithms for cluster-ing data streams in the non-geographic domain or clustercluster-ing data streams in the geographic domain. For a dynamic environment in which objects are inserted and deleted over time, previ-ous studies [15,28] presented incremental algorithms for density-based clustering problems. The authors in [20] proposed a semi-supervised incremental algorithm for the density-based clustering problem by exploiting available background knowledge. However, the algorithms proposed in [15,20,28] only cope with geographic data without considering non-geographic attributes. Given time series data, the authors in [27] developed an incremental technique for traditional k-means clustering algorithms and EM clustering algorithms. Another study [34] presents an incremental system to discover clusters where time series in the same cluster behave similarly. The authors of [3] introduced an incremental fuzzy clustering for bank customers’ transactions. However, these incremental clustering methods only deal with the data in the non-geographic domain without considering geographic attributes. Although the incremental algorithm in [37] is designed for a general dual clustering problem, it only deals with the data in non-stream environments. These incremental techniques were developed for clustering problems that are different from dual clustering in spatial data streams in this paper.

In the real world, the values of a sensor in the non-geographic domain usually have temporal locality features, meaning that the sensors’ readings are similar for a period. For example, Fig.2b shows that the readings of sensor S4in time windoww1are similar to those in time windoww2. Therefore, based on the temporal locality feature, this study proposes an incremental clustering (IC) algorithm that clusters objects roughly by inducing clustering results from previous time windows. To explore the temporal locality features, this study proposes a probability matrix in which the values indicate the probability that two sensors will belong to the same cluster. Sensors are clustered roughly according to the estimation, and the coarse clustering results are refined to satisfy the required constraints. This study presents numerous experiments conducted on both a real dataset and a synthetic dataset. To evaluate the effects of temporal locality features on the proposed algorithm, this study proposes a simulator framework to generate synthetic datasets by effectively simulating real world data. Using the proposed simulation, synthetic datasets can be generated by controlling the degree of temporal locality. To use the synthetic dataset generated by the proposed simulation effectively, this study adopts a statistical approach to estimate accurately the constraint in the non-geographic domain for dual clustering problems in spatial data streams under an user-specified tolerance. Experimental results confirm the effectiveness of this approach. This study also proposes an approach to estimate the degree of temporal locality of sensor data and demonstrate its effectiveness by assessing the results of the experiments. Based on these results, it is possible to evaluate the performance of the proposed algorithm and compare its efficiency to that of existing algorithms using synthetic and real datasets. Experimental results show that the proposed algorithm outperforms exiting approaches, revealing the scalability of the algorithm. This study also analyzes the effects of temporal locality features on the algorithm. Based on the experimental results, this study presents guidelines for setting parameters in the proposed algorithm.

The remainder of this paper is organized as follows. Section2formally defines the dual clustering problem in spatial data streams. Section3presents the IC algorithm and analyze its complexity. Section4presents the performance of the IC algorithm. Section5concludes the paper.

(5)

Table 1 Non-geographic and geographic attributes for objects in Fig.2

ID Location Data points Time

S1 (1.3,2.5) (23,25,22,21,23,22,21,23,24,25,21,23,23,20,22,21,23,23,21,25) [1, 20] S2 (2.0,2.0) (22,24,23,20,24,23,22,24,23,23,24,22,21,24,20,24,21,23,22,23) [1, 20] S3 (2.8,1.5) (22,25,23,20,25,22,21,25,24,21,20,22,20,23,23,20,21,24,20,22) [1, 20] S4 (3.5,0.8) (21,20,21,19,19,21,21,19,20,20,22,23,23,20,22,24,23,23,23,20) [1, 20] S5 (1.5,1.5) (15,12,12,10,13,13,14,14,9,10,5,5,9,7,6,7,9,7,9,10) [1, 20] S6 (1.7,2.7) (6,7,5,6,3,3,6,5,4,7,7,6,3,8,4,7,8,5,5,6) [1, 20] S7 (3.0,3.0) (4,7,5,5,3,5,5,7,4,5,6,3,8,5,4,6,7,6,7,7) [1, 20] S8 (1.0,2.0) (5,7,4,4,4,4,6,6,3,3,2,4,7,4,4,7,7,8,7,4) [1, 20] 2 Preliminaries

This section presents the notations and formulates the dual clustering problem in spatial data streams. An object consists of non-geographic attributes and a geographic attribute and is denoted as Si. Throughout the rest of this study, an object refers to a sensor. Given a particular time interval, the values of non-geographic attributes of object Siare represented as a vector

Si.Vt, where Si.Vtis the data point at time t in the time interval. In addition, the location of

Si is denoted as Si.Lx, Si.Ly, which represents the object’s position in a two-dimensional space. Without loss of generality, an object’s spatial position can be generalized to a high-dimensional space. In this paper, the locations of objects are fixed. Table1shows the values of the attributes in the geographic and the non-geographic domains based on the example in Fig.2.

To describe the constraints of the dual clustering problem, this study first defines the dissimilarity between two objects in the non-geographic domain, and the physical distance between two objects in the geographic domain.

Definition 2.1 (Dissimilarity in the non-geographic domain) Given a time windoww = [t + 1, t + W] and two objects Siand Sj, the dissimilarity between Siand Sjin time window

w is defined as di ss(Si, Sj, w) = 1 W · W k=1 (Si.Vt+k− Sj.Vt+k)2.

Definition 2.2 (Physical distance in the geographic domain) Given two objects Siand Sj, the physical distance between Siand Sjis defined as

E D(Si, Sj) =

(Si.Lx− Sj.Lx)2+ (Si.Ly− Sj.Ly)2.

This study employs the most common dissimilarity measure (i.e., the average Euclidean distance) to focus on the dual clustering problem in spatial data streams. Depending on an application’s requirements, other dissimilarity measures can be used. Based on these defini-tions, this study presents a concept called the directly similarity-connected relationship (i.e., directly SC-relationship), which indicates both the similarity and the connectivity between two objects.

Definition 2.3 (Directly similarity-connected relationship) Given a geographic constraint R and a similarity constraintε, two objects Si and Sj have a directly SC-relationship in time windoww if diss(Si, Sj, w) ≤ ε and E D(Si, Sj) ≤ R.

(6)

Fig. 3 A similarity-connected

relationship between S1and S4

0 1 2 3 4 0 1 2 3 4 y x S1 S₂ S3 S4 S5 S6 S7 S₈ R

Clearly, if two objects have a directly SC-relationship, they should have similar non-geographic attributes and the physical distance between them should satisfy the given geo-graphic constraint. In the example of sensor networks used to monitor freeway traffic, sensors in proximity that have similar readings are likely to have a directly SC-relationship. Thus, based on the directly SC-relationship, this study defines a similarity-connected relationship (abbreviated as SC-relationship) as follows.

Definition 2.4 (Similarity-connected relationship) Given a geographic constraint R and a similarity constraintε, two objects Siand Sjhave a similarity-connected relationship in time windoww if there exists a chain of objects Si = Sl1, Sl2, . . . , Slq−1, Slq = Sjsuch that, in time

windoww, the following conditions are satisfied: (1) for 1 ≤ k, h ≤ q, diss(Slk, Slh, w) ≤ ε;

and (2) for 1≤ k < q, Slkand Slk+1have a directly SC-relationship.

Given the objects in Table1, geographic constraint R= 1, similarity constraint ε = 10, and time window size W = 5, the S1and S4in Fig.3have an SC-relationship in time window

w1 = [1, 5], because their dissimilarity is within the given similarity constraint and a chain (i.e., S1, S2, S3, S4) satisfies the SC-relationship. Although the physical distance between

S1 and S4 is larger than the geographic constraint R, S1and S4 are similarity-connected. This reveals that two objects might have an SC-relationship even if the physical distance between them does not satisfy the geographic constraint. In this example, the clusters inw1 are{S1, S2, S3, S4}, {S5}, {S6, S8}, and {S7}.

Based on the SC-relationship, the dual clustering problem for spatial data streams is formulated as follows.

A dual clustering problem in spatial data streams: Given a time window size W , a simi-larity constraintε, and a geographic constraint R, the goal is to cluster the objects into several groups in each time window such that objects in the same group should have SC-relationships. Note that a series of values of an object’s non-geographic attributes are partitioned into con-secutive non-overlapping time windows, and the objects in each time window are clustered. A cluster in which the objects have SC-relationships is called an SC-cluster.

3 Dual clustering for spatial data streams

This section presents a graph structure to capture similarity relationships among objects. After exploring the temporal locality of attributes in the non-geographic domain, this study proposes an incremental clustering (IC) algorithm to improve the efficiency of dual clustering

(7)

(a) (b)

Fig. 4 An illustrative example of a graph representation. a Graph representation; b clustering result

in spatial data streams. This section also presents the derivation of the time complexity of the IC algorithm.

3.1 Graph representation

Given a set of objects, Similarity-Connected Graph (abbreviated as SC-graph) is utilized to capture the directly SC-relationships and SC-relationships among objects in each time window. Each vertex in an SC-graph represents an object with two types of edges: explicit and implicit edges. If two objects, Si and Sj, have a directly SC-relationship, an explicit edge, denoted as ee(Si, Sj), exists between them. Conversely, if Siand Sjhave similar non-geographic attributes but the physical distance between them is larger than the non-geographic constraint (i.e., E D(Si, Sj) > R), an implicit edge, denoted as ei(Si, Sj) exists between the objects. For example, given the objects’ attributes in Table1, if R = 1, ε = 10, and

W = 5, the SC-graph in the first time window (i.e., w1) is shown in Fig.4a in which the solid line and the dotted line represent an explicit edge and an implicit edge, respectively. The edge between S1and S2is an explicit edge, meaning that S1and S2have a directly SC-relationship, because di ss(S1, S2, w1) ≈ 1 < ε and E D(S1, S2) ≈ 0.86 < R. Conversely, the edge between S1 and S3 is an implicit edge because E D(S1, S3) ≈ 1.80 > R though

di ss(S1, S3, w1) ≈ 1.2 < ε. The weight of an edge represents the dissimilarity between two objects in the non-geographic domain.

According to an SC-graph, a clustering result consists of subgraphs that must satisfy two requirements: 1) the vertices of a subgraph must be connected through explicit edges; and 2) the subgraph must be complete through both explicit and implicit edges. Note that two vertices connected by an implicit edge do not imply an SC-relationship. However, when the two ver-tices are connected by a simple path comprised of explicit edges, they have an SC-relationship. Thus, the first requirement is that each subgraph must guarantee the connectivity of vertices through explicit edges. For example, Fig.4b shows the clustering result for the graph in Fig.4a. In Fig.4b, the clustering result includes four subgraphs,{S1, S2, S3,S4}, {S5}, {S6,S8}, and{S7}, all of which fulfill the aforementioned two requirements. Note that a subgraph refers to an SC-cluster in this paper.

3.2 Incremental clustering algorithm

Because of the feature of streams (i.e., dynamic and rapid generation of data records), the time of clustering procedures should be as short as possible in stream environments. A previous

(8)

02:00 _03:00 04:00 _05:00 06:00 _07:00 08:00 20 40 60 80 100 120 140 160 75 80 85 90 95 100 105 110 Speed(in km/hr) Time Location(in km)

Fig. 5 The temporal locality feature in the real dataset

study [43] proposed a hierarchical-based algorithm (HBC) for dual clustering in spatial data streams. That study executed the HBC algorithm in each time window. Appendix A presents the HBC algorithm. This study proposes the IC algorithm, an incremental algorithm, to efficiently derive cluster results. To reiterate, the non-geographic attributes of objects usually have temporal locality features, meaning that their values are similar for a period. For example, Fig.5shows the speed readings of the sensors along a freeway [38], where the X-axis represents the time, the Y-axis represents sensors’ locations deployed linearly, and the Z-axis represents sensors’ speed readings. The speeds of a sensor are similar within a period, demonstrating the temporal locality features of the non-geographic attributes of objects. Consequently, the clustering results of adjacent time windows are similar. The derived clustering results reveal which objects are frequently clustered together. As Fig.2shows, a cluster{S6, S8} is discovered in time windows w1 andw2, and a cluster{S5, S6, S8} is discovered in time windowsw3andw4. This phenomenon indicates that S6and S8are more likely to be in the same clusters with the given objects S5, S6, and S8. Therefore, based on the temporal locality features, the IC algorithm uses the clustering results in the previous time windows to improve the efficiency of the clustering procedures.

Algorithm 1: Incremental Clustering (IC) Algorithm

input : A set of objects ST D, a similarity constraintε, a geographic constraint R, and a window size

W , a correlation factorα, a probability threshold θ, and the time interval [ts, te]

output : A set of SC-clusters C_w_kwith respect to time windowwk

1 Do algorithm H BC inw1and then generate Rw1and Pw1;

2 for each time windowwk= [ts+ k · W, ts+ (k + 1) · W] where 1 ≤ k ≤ te_W−ts do

3 Generate coarse clusters C_w_kby P_w_k₋₁andθ;

4 Split each cluster of C_w_kinto SC-clusters if it is not an SC-cluster;

5 Using the neighbor list to merge SC-clusters until the number of SC-clusters do not decrease;

6 Generate Rwkby Cwk;

7 Generate P_w_k₊₁by P_w_k₊₁= (1 − α)P_w_k+ αR_w_k; 8 end

9 return Rwk;

The IC algorithm has two main phases: cluster prediction and cluster refinement. Specifi-cally, this IC algorithm predicts which objects to cluster together using the previous clustering results and then refines the coarse clusters to derive SC-clusters by verifying the relationships

(9)

among objects. To predict the clustering results of the subsequent time windows, two matri-ces record the clustering results and estimate the probability of being clustered together, respectively.

Let N denote the number of objects. For time windowwh, the clustering result of objects is recorded using an upper triangular matrix R_w_h, where R_w_his an N× N matrix. The value of each element in the matrix is as follows:

ri, j,h=

1 if i≤ j, and objects Siand Sjare in the same cluster

0 otherwise. (1)

Because the non-geographic attributes of objects have temporal locality features, the objects that are clustered in previous time windows are likely to be in the same cluster in subsequent time windows. Therefore, based on the previous clustering results, a probability matrix can be used to predict future clustering results. The probability matrix for time window

whis expressed by an upper triangular matrix Pwh, where Pwhis an N× N matrix. For time

windowwh, an element pi, j,hin Pwh represents the probability that Siand Sjwill be in the

same cluster. The element pi, j,hcan be estimated from the previous clustering results. The probability matrix P_w_h is defined as P_w_h = (1 − α)P_w_h₋₁+ αR_w_h₋₁, whereα represents a

temporal correlation and 0≤ α ≤ 1. Thus, the values of elements in P_w_hcan be determined as follows: pi, j,h= ⎧ ⎨ ⎩ (1 − α)pi, j,h−1+ αri, j,h−1 if i< j 1 if i= j 0 if i> j. (2)

Initially, in P_w₁, pi, j,1= 1 for i = j; otherwise, pi, j,1= 0. The value of α determines the weights of the clustering results in the most recent time window.

After defining the two matrices, this study presents the IC algorithm in detail. Initially, this algorithm applies the HBC algorithm to derive SC-clusters in the first time window. The probability matrix in time windoww2can be generated by Rw1and Pw1. As mentioned previously, in a subsequent time windowwh, the IC algorithm coarsely clusters objects based on the probability matrix Pwhand then refines the coarse clusters to derive SC-clusters. In the

matrix P_w_h, if pi, j,his large, the corresponding objects (for example, Siand Sj) are likely to have similar values in the non-geographic domain. Thus, if pi, j,his larger than a given prob-ability thresholdθ, the objects form a cluster. These clusters are coarse clusters because the SC-relationships among objects in the same coarse cluster must be verified. As in hierarchi-cal clustering techniques, if the probability value is greater thanθ, iteratively merge the two objects with the maximal probability value. This procedure clusters objects in a bottom-up manner. The merging order is recorded and will be used to refine coarse clusters later.

To refine coarse clusters, the IC algorithm verifies whether the objects in each cluster have SC-relationships by assessing the non-geographic attributes of each pair of objects. It is only necessary to compute the similarity of the objects’ non-geographic attributes because the location of each object does not change over time. The order for computing the similarity is the same as that used to derive coarse clusters, indicating the similarity degree of objects in the probability matrix. A higher value in the probability matrix means that corresponding objects are likely to have similar values in the non-geographic domain. Thus, the similarity of the non-geographic attributes of objects can be derived by following the order used to derive coarse clusters. An SC-graph is built for each coarse cluster in a similar manner. After computing the similarity between objects, it is easy to identify the corresponding edge type. Recall that SC-clusters must satisfy two requirements: 1) the vertices of a subgraph must be connected through explicit edges; and 2) a subgraph must be complete through explicit

(10)

Fig. 6 A running example of the IC algorithm fromw1tow2in Fig.2

and implicit edges. These two criteria are used to derive SC-clusters when computing the similarity of objects.

Given the data in Fig.2, W= 5, α = 0.5 and θ = 0.5, Fig.6shows a running example of the IC algorithm. This algorithm applies the HBC algorithm to generate SC-clusters in the first time window. According to the clustering result in time windoww1, the matrix R_w₁is derived, and P_w₂is then calculated using Formula2. In time windoww2, predict clusters by Pw2and refine the coarse clusters to derive SC-clusters using a split-and-merge process. For example, in the cluster prediction phase, S6and S8 form a coarse cluster because p6,8,2= 0.5 ≥ θ.

After generating coarse clusters, verify the SC-relationships among objects in each coarse cluster and split unqualified clusters into several SC-clusters. This example only assesses clusters{S1, S2, S3, S4} and {S6, S8} for the split steps, because other clusters only have one member. After splitting unqualified coarse clusters, perform bottom-up merging to derive a final clustering result. As Fig.6shows, except for cluster{S7}, any two clusters are possibly merged in time windoww2. This is because the object S7 is far away from other objects without satisfying the geographic constraint; therefore, the cluster{S7} does not form an SC-cluster with other objects. Because the geographic distances between any two objects were calculated in the first time window, cluster refinement in the following time windows can use the geographic information without an additional computation. The SC-clusters in time windoww2are generated after merging clusters. Next, generate Rw2and Pw3. Similarly, the IC algorithm performs cluster prediction and cluster refinement in the remaining time windows (Fig.6).

Regarding an example of refining a coarse cluster in detail, assume that a coarse cluster in time windowwh has a set of objects{S1, S2, S3, S4, S5, S6} and the order of deriving the coarse cluster is< {S2, S4}, {S2, S5}, {S3, S4}, {S5, S6}, {S1, S2} >. Based on this order, calculate the similarity of the objects’ non-geographic attributes. Figure7lists the rounds necessary for computing the similarity among objects in the same coarse cluster and shows the corresponding ground truth of the SC-graph. In Fig.7, each object is initially viewed

(11)

Fig. 7 An example of refining a coarse cluster

as an individual cluster before calculating the similarity between object S2 and object S4 (that is, di ss(S2, S4, wh)). Based on the ground truth, an explicit edge (that is, ee(S2, S4)) is generated and the SC-cluster{S2, S4} is derived. Next, compute the similarity of objects

S2 and S5. Although the explicit edge ee(S2, S5) can be identified, the similarity between

S5 and S4 should still be verified. However, as shown by the ground truth, no edge exists between S5 and S4. Thus, S5 cannot be merged with the SC-cluster that contains S2 and

S4. In Round 3, choose and mark object S3and assess the relationships between objects S3 and S4. Because there is an edge ee(S2, S4) and the SC-cluster satisfies the aforementioned requirements,{S2, S3, S4} is an SC-cluster. This procedure generates SC-clusters similar to the example in Fig.7, where a coarse cluster is split into two SC-clusters,{S1, S5, S6} and {S2, S3, S4}.

After splitting unqualified coarse clusters, the IC algorithm performs merging processes to derive final clustering results by assessing the SC-relationships among objects in different clusters. The physical distances between objects were calculated in the first time window, and each object has a list of objects whose physical distances from it are less than the given geographic constraint. Therefore, each object in an SC-cluster can identify any explicit edges between itself and objects in different SC-clusters. Note that if two SC-clusters are split from the same coarse cluster, it is not necessary to assess them for this type of merging.

3.3 Analysis of the IC algorithm

This section first proves that the value of a probability matrix is between 0 and 1 in Theorem

3.1and then derives the time complexity of the IC algorithm. This section also derives the time complexity of the HBC algorithm for comparison purposes.

Theorem 3.1 The value of every element of a probability matrix P_w_h is between 0 and 1. Proof If h= 1, it is trivial to show that the value of every element of the probability matrix

is between 0 and 1. For h> 1, divide elements in the probability matrix into three parts and prove that the value of every element in each case is between 0 and 1, respectively. Case 1: for each i> j, the value of the element pi, j,hof the probability matrix Pwh is 0 by Formula 2. Case 2: for each i = j, the value of the element pi, j,hof the probability matrix P_w_h is 1 by Formula2. Case 3: for each i < j, Formula2shows that the value of the element in the probability matrix P_w_his pi, j,h= (1 − α)pi, j,h−1+ αri, j,h−1for i< j. It is then possible to derive that pi, j,h= α((1 − α)h−2ri, j,1+ (1 − α)h−3ri, j,2+ · · · +ri, j,h−1) ≤ 1 − (1 − α)h−1 because 0≤ ri, j,k ≤ 1 for each 1 ≤ k ≤ h −1 by Formula1. Because 0≤ α ≤ 1, pi, j,h≤ 1. By contrast, pi, j,h ≥ 0 by Formula2. Hence, the value of every element of a probability

matrix is between 0 and 1.

We next analyze the time complexity of the IC algorithm and that of the HBC algorithm. The time complexity of the HBC algorithm is O(N2TE D + F(N2W Tdi ss+ |E|log|E| + |E|c2

(12)

window size; Tdi ss represents the time required to compute the dissimilarity between two objects; TE D represents the time required to compute the physical distance between two objects; cmax denotes the number of objects in the cluster that has the maximal number of objects; and|E| is the number of explicit edges among N objects. The cost of computing the dissimilarity and the physical distance between any two objects is O(N2(W Tdi ss+ TE D)), and the cost of sorting the explicit edges in Q according to their dissimilarity values is

O(|E|log|E|). Therefore, the cost of merging clusters is O(|E|c_max2 ).

The IC algorithm derives SC-clusters by refining each coarse cluster, and R_w_his updated. The IC algorithm utilizes the probability matrix to derive coarse clusters; therefore, the cost of refining each cluster is lower because the number of objects in coarse clusters is bounded. The best case of the IC algorithm involves the predicted M coarse clusters fully satisfying the requirements of SC-clusters in every time window after the first window. Thus, the time complexity of the IC algorithm is O(TH BC+(F−1)(N2+(Mc2max+M2+Mτ+τ2)W Tdi ss)), where TH BC is the execution time of one round of the HBC algorithm; F is the number of time windows; M is the number of predicted coarse clusters; cmax represents the number of objects in the cluster that has the maximal number of objects;τ is the number of objects that do not belong to any predicted coarse cluster (τ approximates 0); W is the window size; and Tdi ssrepresents the time required to compute the dissimilarity of two objects. The cost of generating coarse clusters and computing matrices R and P is O(N2_{), and the cost} of computing the dissimilarity values and assessing the SC-relationships between objects is

O((Mc2

max+ M2+ Mτ + τ2)W Tdi ss).

This study presents a comparison of the complexity of the HBC algorithm with that of the IC algorithm after the first time window. For the HBC algorithm, the complexity in the time windowwh, where h ≥ 2, is O(N2W Tdi ss+ |E|log|E| + |E|cmax2 ). For the IC algorithm, the complexity of the best case in the time windowwh, where h ≥ 2, is

O(N2_{+ (Mc}2

max+ M2+ Mτ + τ2)W Tdi ss). If all objects belong to coarse clusters, then

τ ≈ 0. To prove that the time complexity of the IC algorithm is less than that of the HBC

algorithm, we prove that O(N2W Tdi ss) > O((Mc2max+ M2+ Mτ + τ2)W Tdi ss). Because

τ ≈ 0, W > 0 and Tdi ss> 0, we show that O(N2) > O(Mc2max+ M2). Let N = M

i=1ci where ci is the number of objects in the i -th cluster of M clusters and ci ≤ ci+1for each 1≤ i < M. The detailed time complexity of O(Mcmax) is

_M

i=1ci2and the detailed time complexity of O(M2) is C₂M. Then, if M > 1,_iM₌₁c_i2+ C₂M < _iM₌₁c_i2+ C₂Mc2₁ < (c2

1 + · · · + c2M) + 2

i = j,1≤i, j≤Mcicj =(c1+ · · · + cM)2 = N2. Thus, we derive that

O(N2_{W T}

di ss+ |E|log|E| + |E|cmax2 ) > O(N2+ (Mcmax2 + M2+ Mτ + τ2)W Tdi ss)) if

M> 1 and τ ≈ 0. Consequently, when M > 1 and τ ≈ 0, the IC algorithm is more efficient

than the HBC algorithm in the best case.

4 Performance study and analysis

Because the IC algorithm is based on temporal locality, a framework is designed to generate a synthetic dataset from the observations of real dataset characteristics in Sect.4.1and then the effect of temporal locality on performance can be analyzed. Extensive experiments were conducted using the synthetic datasets generated by different temporal localities. Based on experiments using synthetic datasets, one could have some guidelines to set the parameters of the IC algorithm for experiments in a real dataset. These experiments compared the IC algorithm with existing approaches, the HBC algorithm [43], the CLS algorithm [25], and the CTS-ARMA algorithm [5]. Sections4.2and4.3present an analysis of the scalability of the algorithms using both synthetic and real datasets, respectively. Table2shows the notations

(13)

Table 2 The notations used

in the experiments Symbol Description

W Time window size

R Geographic constraint

ε Similarity constraint

α Temporal correlation

θ Probability threshold

N Number of objects

L Number of time stamps

F Number of time windows

TL Degree of temporal locality

a Positive constant

t Time gap

used in the experiments. All experiments were performed on a computer with a 2.80 GHz Intel CPU and 2 GB of memory.

Several previous studies [5,18,34] have presented different algorithms for clustering data streams. However, the method in [18] focuses on the k-median problem in stream environ-ments. Although other methods in [5,34] can discover clusters where the data streams in the same cluster behave similarly, the non-geographic values of data streams from the same cluster are often quite different, and the clustering results do not meet the clustering criteria in this paper. Therefore, the approach in [5] was modified to derive SC-clusters from spa-tial data streams, creating the CTS-ARMA algorithm. At the beginning of the CTS-ARMA algorithm, the number of clusters should be specified, and we set it as the number of clusters derived in the IC algorithm. For the ARMA(p, q) model used in the CTS-ARMA algorithm, the experiments in this study chose parameters p = 1 and q = 1. On the other hand, for existing dual clustering algorithms, only the HBC and CLS algorithms were considered as competitors. This is because the performance of the incremental clustering algorithm in [37], which was developed for clustering data in dual domains, is worse than that of the CLS algorithm. In addition, the two traditional clustering algorithms (i.e., the k-means algorithm and the Jarvis-Patrick algorithm) modified in [25] also have worse performance than the CLS algorithm. As mentioned previously, the CLS algorithm was designed to solve general dual clustering problems from a single time stamp, and its constraints are different from the proposed algorithm. That is, the CLS algorithm requires users to pre-specify the number of clusters and has no similarity constraint for clusters. Therefore, the CLS algorithm was mod-ified to be suitable for discovering SC-clusters from spatial data streams given a particular number of clusters. Similarly, for the CLS algorithm, the number of clusters was set as the number of clusters derived in the IC algorithm.

For this framework, the degree of temporal locality TL varies between 0 and 1, and synthetic datasets can be generated for different degrees of temporal locality. For the clus-tering problem in this paper, the window size and two constraints R andε can be specified according to different application needs and domain knowledge. To use synthetic datasets for performance evaluation, Sect.4.1.2presents a guideline for setting similarity constraints and Sect.4.2.3shows how to evaluate their effectiveness. Sections4.2.1and4.2.2use synthetic datasets to investigate the effect of temporal locality on the efficiency and the scalability of algorithms. Section4.1.3presents an approach to estimate the temporal locality of spatial data streams and evaluate their effectiveness using synthetic datasets. If temporal locality of

(14)

Fig. 8 The framework of the proposed simulation

spatial data streams can be estimated, it is possible to determine the parameters of the IC algorithm. Explicitly, to use the IC algorithm, temporal correlationα and probability thresh-oldθ should be specified. Because these two parameters are related to temporal locality, Sect.4.2.4investigates the influence among the two parameters and temporal locality and provides guidelines for setting these two parameters for the IC algorithm.

4.1 Synthetic dataset generation and analysis

4.1.1 Framework for synthetic data generation

To simulate real data, this study presents a framework to generate synthetic datasets by controlling the features of objects. As Fig.8shows, this framework comprises two stages. The first stage generates the objects’ geographic information, and the second stage generates the values of the objects’ non-geographic attributes.

The left-hand side of Fig.8shows the flow of generating geographic information of objects. Assume that the number of objects is N and the objects are deployed in a specified range,

Range(x, y). The locations of objects are then uniformly distributed over Range(x, y).

Given a geographic constraint R and the desired number of clusters M, must determine the number of objects in each cluster. To generate a cluster, begin by selecting an object as the seed of the cluster. Note that each object has at least one neighbor from the same cluster in the range R. In other words, for each object Si in a cluster C, an object Sj ∈ C exists such that E D(Si, Sj) ≤ R. Therefore, use the location of the seed as the center and the geographic constraint R as the radius. For each cluster, choose objects within the radius R of the seed of the cluster at random as the members of the cluster. Next, choose the furthest object in the range, with respect to the current center, as the new center. Repeat these steps until the locations of the selected objects meet the boundary of the given geographic range

Range(x, y) and generate other clusters accordingly. Finally, mark the objects selected as

cluster members and regard the unmarked objects as noisy objects.

This study uses a real dataset to generate the values of the objects’ non-geographic attributes (that is, sensor data collected by monitoring traffic speeds on a freeway).

(15)

0 10 20 30 40 50 60 70 80 90 100 110 120 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Speed (in km/hr) Time

Fig. 9 Real freeway traffic data on a particular day

As Fig.9 shows, most drivers drive close to the speed limit because of speed limits on the freeway. During peak periods, such as rush hour, driving speeds are drastically reduced. Traffic accidents also affect driving. Figure9shows that from approximately 7 a.m. to 9 a.m. (and from 5 p.m. to 7 p.m.), driving speeds are reduced because of high commuter volume. At other times, driving speeds are closer to the speed limit. Based on this observation, the char-acteristics of the real data indicate that: (1) the values of a sensor’s non-geographic attributes are usually close to a certain value, and (2) the values sometimes decrease but increase again after a period. Thus, the proposed simulation model for the non-geographic domain adopts the concept of mean reverting jump diffusion [8,10].

The right-hand side of Fig.8shows the detailed flow generating the values of the non-geographic attributes of the objects. First, a value Vi ni and a gap Vgapmust be determined to generate the initial means of the non-geographic attributes of the objects. Based on Vi ni and Vgap, generate M distinct initial means of the values in the non-geographic domain for different clusters. Note that the initial means are separated into different clusters. Thus, the initial means of clusters are defined asμi = Vi ni+ (i − 1)Vgapfor i ∈ N and 1 ≤ i ≤ M, and the initial mean of the noisy set is defined as Vi ni.

After generating the initial means of M clusters and the noisy set, iteratively generate the values of the non-geographic attributes of the objects with respect to the initial means and the temporal locality (TL) for each time window. To simulate jump diffusion for the non-geographic attributes of the objects in each cluster, determine whether the values decrease in time windowwhusing the following indicator function:

I_w_h =

0 w.p. TL 1 w.p. 1-TL.

In time windowwh, for object Si in a cluster with initial meanμ, the value of the non-geographic attribute of Si from time stamp t ∈ wh, denoted as Si.Vt, can be formulated as follows:

Si.Vt = Si.Vt−1− a(Si.Vt−1− μ)t + ωt− Iwh· Vwh, (3)

where a is a positive constant,t represents the time gap between Si.Vt−1and Si.Vt;ωt follows a normal distribution with a mean of zero and varianceσ2, and V_w_hfollows a uniform distribution within the range(0, Vdr op).

(16)

0 10 20 30 40 50 60 70 80 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Value Time

Fig. 10 Two synthetic spatial data streams in a cluster

The number of values for the non-geographic attribute of an object in each time window is

W ,with a total of F time windows. The proposed framework can generate a synthetic dataset.

For example, Fig.10shows the values in the non-geographic domain of two objects in the same cluster when the parametersμ = 60, T L = 0.8, F = 24, a = 5, σ = 1, Vdr op = 20, t = 0.1, and W = 12. Compared with Fig.9, the curves of the synthetic datasets capture the behaviors of the real datasets effectively.

4.1.2 Analysis of the similarity constraint

The previous section proposed a model to simulate real data. However, the clustering problem in this paper requires a time window size of W , a geographic constraint of R, and similarity constraint ofε. As mentioned previously, these parameters can be determined with respect to domain knowledge for real data. Although this model simulates real data based on user-specified parameters (for example, a geographic constraint), we do not specify a similarity constraint for simulation. To use the synthetic datasets generated by the model, this study presents guidelines for setting similarity constraintε using a user-specified time window size. Based on this model, we first formulate the dissimilarity between two objects from the same cluster in Theorem4.1as follows. According to Theorem4.1, it is then possible to statistically estimate the proper value of similarity constraintε.

Theorem 4.1 Given two objects Si and Sj in the same cluster and a time windowwh = [t + 1, t + W] without a decrease (that is, Iwh = 0), the dissimilarity between Si and Sjin time windowwhcan be represented as

1 W W k=1 (Si.Vt+k− Sj.Vt+k)2, where, for k∈Nand 1≤ k ≤ W, (Si.Vt+k− Sj.Vt+k) ∼N(0,2σ

2_{(1−(1−at)}2k₎

1−(1−at)2 ).

Proof Given a time window wh = [t + 1, t + W], where W is the size of the time window, the values of objects Si and Sj in the non-geographic domain are

(17)

1 2 3 4 5 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0 .7 Dissimilarity

Probability density function

ε1−β= 3.58

Fig. 11 The probability density function of the dissimilarity between two objects

on Formula3in the proposed framework, for 1≤ k ≤ W, Si.Vt+k= (1 − at)Si.Vt+k−1+ aμt + ωt+kbecause Iwh = 0. Similarly, Sj.Vt+k = (1 − at)Sj.Vt+k−1+ aμt + ωt+k for 1≤ k ≤ W. Because Si and Sj are in the same cluster, they have the same initial mean (that is,μ = μ). To simplify the derivation below, useμ instead of μfor Sj. According to Definition2.1, the dissimilarity between Siand Sjin time windowwhcan be derived by

di ss(Si, Sj, wh) = 1 W W k=1(Si.Vt+k− Sj.Vt+k)2.

Next, consider how to derive the distribution of the value(Si.Vt+k− Sj.Vt+k) for k ∈N

and 1≤ k ≤ W. Initially, when k = 1, Si.Vt+1− Sj.Vt+1= (1 − at)(Si.Vt− Sj.Vt) +

(ωt+1− ωt+1) = ωt+1− ωt+1. Thus,(Si.Vt+1− Sj.Vt+1) ∼N(0, 2σ2) because ωt+1and

ω

t+1are i.i.d.N(0, σ2) according to Formula3. When k= 2, we have Si.Vt+2− Sj.Vt+2=

(1−at)(Si.Vt+1− Sj.Vt+1)+(ωt+2−ωt+2) = (1−at)(ωt+1−ωt+1)+(ωt+2−ωt+2).

Hence,(Si.Vt+2− Sj.Vt+2) ∼N(0, 2σ2(1 + (1 − at)2)). Consequently, for k = W, we

have Si.Vt+W− Sj.Vt+W= (1 − at)W−1(ωt+1− ωt+1) + (1 − at)W−2(ωt+2− ωt+2) + · · · + (ωt+W− ωt+W). Thus, (Si.Vt+W− Sj.Vt+W) ∼N(0,

2σ2_{(1−(1−at)}2W₎

1−(1−at)2 ).

Theorem4.1shows that there is no closed form for the probability density function of the dissimilarity between two objects in the same cluster. Given a set of objects in the same cluster, assume that a toleranceβ represents the proportion of objects that are not in the same cluster. Therefore, clustering errors that occur because objects are not in their ground-truth clusters are bounded by the tolerance. Given the tolerance, the similarity constraint can be determined using the Monte Carlo method [32]. For example, assume that the synthetic dataset is generated using the parameter settings a = 5, σ = 1, t = 0.1 and W = 10. Using the Monte Carlo method, take 100,000 random samples to calculate the dissimilarity according to Theorem4.1and then derive the probability density function of the dissimilarity (Fig.11). Whenβ = 0.05, derive the similarity constraint as ε = 3.58. Table3shows the

(18)

Table 3 Similarity constraintε with various time window size W settings and tolerance β settings W β = 0.01 β = 0.025 β = 0.05 β = 0.1 β = 0.15 β = 0.2 10 4.03 3.79 3.58 3.35 3.19 3.02 20 3.65 3.47 3.33 3.17 3.06 2.97 30 3.46 3.33 3.21 3.08 2.99 2.92 40 3.36 3.24 3.14 3.03 2.95 2.89 50 3.29 3.18 3.1 2.99 2.92 2.87

proper settings of the similarity constraintε when the time window size W varies between 10 and 50 under different values of the toleranceβ with a = 5, σ = 1 and t = 0.1. As Table3shows, given a fixed time window size W , the similarity constraintε decreases slightly as the toleranceβ increases. This is reasonable because random noise prevents the objects in a cluster from being grouped with a smaller similarity constraint. Table3also shows that, given a fixed toleranceβ, the similarity constraint ε decreases sightly as the time window size W increases. This is because a larger time window increases the dissimilarity; therefore, the similarity constraintε should be smaller as W increases to satisfy the fixed tolerance β. Section4.2.3validates the above derivation of the similarity constraint.

4.1.3 Temporal locality estimation

This subsection presents a method for estimating the temporal locality of objects in the non-geographic domain. Given an object and the similarity constraintε determined by Theorem

4.1with time window size W and toleranceβ, the values of the object in the non-geographic domain are divided into multiple sequences based on the size of W . The number of sequences is denoted by Fall, and the multiple sequences represents a set of spatial data streams. The proposed method uses the HBC algorithm to cluster the data streams with the given similarity constraintε, geographic constraint R and window size W. The mean of each cluster is computed by averaging the means of the spatial data streams in the cluster. For different applications, the relative interval of the general values of the spatial data stream in the non-geographic domain can be determined based on observations of the given values of a spatial data stream in the non-geographic domain. This approach uses the maximum likelihood estimation [8,10] to derive the estimated mean of a data stream and then selects the cluster whose mean is closest to the relative interval or the estimated mean. The term Fr egrepresents the number of objects in the selected cluster. Intuitively, the temporal locality is formulated as T L = F_Fr eg

all. For example, Fig.12a shows the values of a sensor in the non-geographic

domain, where the ground truth for its temporal locality is 0.8. The values of this object in the non-geographic domain are partitioned into 12 sequences (i.e., Fall = 12). Assume that the clustering result for the 12 sequences is{S1, S2, S3, S4, S6, S7, S8, S9, S12}, {S5} and {S10, S11}. According to the proposed synthetic dataset generation framework (that is, the general values of data streams are close to the maximal speed limits), the general values of data streams are larger. Thus, the mean of cluster{S1, S2, S3, S4, S6, S7, S8, S9, S12} is larger than that of other clusters, and Fr eg = 9. As a result, the estimated temporal locality

T L= ₁₂9 = 0.75.

To verify the effectiveness of this temporal locality estimation scheme, simulate the values of an object in the non-geographic domain with its parametersμ = 60, F = 50, a = 5, σ = 1, Vdr op = 20, t = 0.1, and W = 10. The temporal locality T L varies from 0.1 to 1.

(19)

0 10 20 30 40 50 60 70 80 1 13 25 37 49 61 73 85 97 109 121 133 145 V alue Time S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 (a) 0 10 20 30 40 50 60 70 80 1 2 3 4 5 6 7 8 9 10 11 12 V alue Time S1 S2 S3 S4 S6 S7 S8 S9 S12 S5 S10 S11 (b) Fig. 12 An illustrative example of temporal locality estimation

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Estimated temporal locality

Temporal locality TL

Fig. 13 Temporal locality estimation with various temporal locality TL settings

Figure13shows the effectiveness of this temporal locality estimation scheme. In this figure, the estimated temporal locality is less than or equal to the temporal locality of the ground truth, but the estimated temporal locality is close to the temporal locality of the ground truth. 4.2 Experimental results on synthetic datasets

Synthetic datasets were generated to evaluate the proposed algorithm and the existing algorithms. The default parameter settings for generating the synthetic datasets are that

N = 200, Range(8, 8), M = 4, R = 1.3, Vi ni = 60, Vgap= 40, a = 5, t = 0.1, σ = 1, Vdr op = 20, F = 50, W = 10. The default settings of parameters in the experiments areε = 110, R = 1.7, α = 0.5, and θ = 0.5. The following subsections describe the experiments and present the experimental results.

4.2.1 Effect of the temporal locality

This section first investigates the effect of temporal locality on the efficiency of the IC algorithm and three existing approaches, the HBC algorithm [43], the CLS algorithm [25], and the CTS-ARMA algorithm [5]. In this experiment, to investigate the effect of T L, let

(20)

Fig. 14 The ground truth of

clusters for the experiments in Sect.4.2.1 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 y x 4 clusters 0 20 40 60 80 100 120 140 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Runtime (in sec.)

Temporal Locality (TL)

CTS-ARMA CLS HBC IC

Fig. 15 The performance of the IC, HBC, CLS, and CTS-ARMA algorithms with various TL values

N = 130. Figure14shows the ground truth of clusters and shows the distribution of objects in the geographic domain, where each point represents an object and x (resp. y) refers to Lx (resp. Ly) of an object. Objects in the same cluster are drawn with the same symbol. Figure15 shows the efficiency of the three algorithms when the temporal locality T L varies. This figure shows that using a larger temporal locality, the IC and HBC algorithms outperform the CTS-ARMA and CLS algorithms. The IC algorithm has the shortest runtime because it exploits the temporal locality feature, it has the shortest runtime. In particular, when the temporal locality is larger than 0.7, the runtime of the IC algorithm is reduced substantially. In summary, the IC and HBC algorithms are more efficient than the CTS-ARMA and CLS algorithms in stream environments. The remaining experiments focus on presenting a performance comparison of the IC and HBC algorithms.

To compare the clustering results of different algorithms, this study considers clustering results from the same time window. Figure16shows the clustering results of different algo-rithms with varied T L. This figure shows that the number of clusters of these algoalgo-rithms

(21)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 y x 14 clusters 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 y x 12 clusters 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 y x 12 clusters 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 y x 13 clusters 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 y x 26 clusters 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 y x 25 clusters 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 y x 24 clusters 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 y x 26 clusters (a) (b) (c) (d) (e) (f) (g) (h)

Fig. 16 The clustering results of the IC, HBC, CLS, and CTS-ARMA algorithms with various T L

val-ues. a CTS-ARMA-TL0.8; b CLS-TL0.8; c HBC-TL0.8; d IC-TL0.8; e CTS-ARMA-TL0.5; f CLS-TL0.5;

g HBC-TL0.5; h IC-TL0.5; 0 70 140 210 280 350 100 200 300 400 500

Runtime (in sec.)

Number of objects N IC-TL0.8 IC-TL0.2 HBC-TL0.8 HBC-TL0.2 (a) 0 5 10 15 20 25 30 35 100 200 300 400 500

Runtime (in sec.)

Number of time stamps L

IC-TL0.8 IC-TL0.2 HBC-TL0.8 HBC-TL0.2

(b)

Fig. 17 Scalability analysis of the HBC and IC algorithms with various representations of N and L

increases as T L decreases. That is, for the three algorithms, the clustering result has 12–14 clusters when T L = 0.8 and has 24–26 clusters when T L = 0.5. This is because, as the value of T L is lower, the non-geographic values of objects varies frequently over time and the dissimilarity of objects in a non-geographic domain increases in the same time window. In other words, for a lower value of T L, only a few objects are clustered and the number of clusters increases.

4.2.2 Comparison of the scalability of algorithms

This experiment was designed to assess the scalability of the IC and HBC algorithms by increasing the number of objects and the number of time stamps using different temporal locality settings (that is, T L= 0.2 and T L = 0.8). Figure17a shows that the runtime of the IC algorithm increases slightly as the number of objects increases. However, the algorithm’s

(22)

1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 y x Noise C1 C2 C3 C4 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 y x Noise C1 C2 C3 C4 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 y x Noise C1 C2 C3 C4 C5 C6 C7 C8 (a) (b) (c)

Fig. 18 Quality of the clustering results with various values of toleranceβ. a Ground truth; b β = 0 : ε = 5;

cβ = 0.2 : ε = 3

runtime becomes shorter when using a larger temporal locality value. In contrast, the runtime of the HBC algorithm increases drastically as N increases. The runtime of ‘HBC-TL0.8’ and the runtime of ‘HBC-TL0.2’ are similar when N varies from 100 to 500. Figure17b shows the effect of the number of time stamps on the performance of the IC and HBC algorithms. With a larger L, the runtimes of both algorithms tend to increase. However, the runtime of the IC algorithm is still shorter than that of the HBC algorithm. These experimental results demonstrate that the IC algorithm achieves favorable scalability for both a large number of objects and a large number of time stamps.

4.2.3 Validation of the similarity constraint

Given a toleranceβ, Sect.4.1.2 shows the derivation of an optimal setting for the simi-larity constraint based on Theorem4.1. Because a synthetic dataset was generated for this experiment, Fig.18a shows the ground truth for comparison. Note that Fig.18shows the distribution of objects in the geographic domain, where each point represents an object and

x (resp. y) refers to the Lx (resp. Ly) of an object. The ground truth refers to the clusters generated by the proposed simulator. Clusters that have more than one member were used to evaluate the effectiveness of the proposed guidelines in Sect.4.1.2, and the clusters that have only one member were regarded as noise. By setting different values of toleranceβ (that is,

β = 0 and β = 0.2), the similarity constraints were derived from Theorem4.1. Forβ = 0 (resp.β = 0.2), the similarity constraint was set at 5 (resp. 3). The clustering results can be derived by applying the proposed algorithm. Figure18b shows that with a toleranceβ=0, the clustering result is extremely close to the ground truth in Fig.18a. However, whenβ is set at 0.2, the clustering result is dissimilar to the ground truth. These experimental results validate the derivation in Theorem4.1and show that users can set their own tolerance values. These results also demonstrate that the similarity constraint is properly determined using Theorem4.1.

4.2.4 Sensitivity analysis of the IC algorithm

Section4.1.3presented a method for estimating the temporal locality of the non-geographic attributes of objects. After estimating the temporal locality, set the temporal correlationα and

(23)

0 5 10 15 20 25 30 35 40 45 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Runtime (in sec.)

Temporal locality TL

Fig. 19 The runtimes for different values of temporal localities TL with various settings of parameterα and

probability thresholdθ 0.1 0.20.3 0.4 0.5 0.60.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.2 0.4 0.6 0.8 1 θ TL0.1 TL0.2 TL0.3 TL0.4 TL0.5 TL0.6 TL0.7 TL0.8 TL0.9 TL1 TL α θ 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0 .2 0.4 0.6 0.8 1.0 Temporal correlation α Probability threshold θ (a) (b)

Fig. 20 A guideline for setting the temporal correlationα and probability threshold θ with various values of

temporal locality TL. a TL= 0.1 ∼1; b TL = 0.3 ∼1

probability thresholdθ for the IC algorithm as follows. The proposed simulation framework generated synthetic datasets with temporal locality T L varying between 0.1 and 1. For each synthetic dataset, the IC algorithm was then executed with different combinations ofα and θ settings, where the range of the two parameters varies between 0.1 and 1. Each combination of parameters ten times was repeated to average the runtime of each combination. Figure19

shows the experimental results. For each temporal locality, we have a range of runtimes with different combination settings forα and θ, and a lower bound of runtimes exists for different temporal locality values. The minimal runtime of each temporal locality decreases because the temporal locality tends to increase. Based on the shorter runtimes under different temporal localities, Fig.20a shows the corresponding settings forα and θ.

To set the proper combinations of temporal correlationα and probability threshold θ for each temporal locality, choose the minimal runtime Tmi nand then select the combinations

(24)

0 1 2 3 4 60 70 80 90 100 110 120 130 140 150

Runtime (in sec.)

Similarity constraint ε TL0.8 R1.2 R1.7 R2.1 (a) 0 1 2 3 4 60 70 80 90 100 110 120 130 140 150

Runtime (in sec.)

Similarity constraint ε TL0.2 R1.2 R1.7 R2.1 (b) Fig. 21 Sensitivity analysis of the IC algorithm regarding T L with varying R andε

of parameters whose runtimes are between[Tmi n, (1 + λ)Tmi n] where λ = 0.05. Figure20a shows a plot of these combinations with respect to different temporal locality values. This figure shows that the best combinations of temporal correlationα and probability threshold θ is 0.1 and 0.1, respectively, while T L= 0.1 and T L = 0.2. This clustering result is not closely related to that of the previous time window because the temporal locality is smaller. Therefore, the temporal correlationα should be set at a smaller value. The probability threshold should also be set at a smaller value, because the elements of a probability matrix do not easily exceed a larger probability threshold, and the IC algorithm does not generate coarse clusters to reduce the clustering runtime. For the temporal locality between 0.3 and 1, Fig.20b shows the overlap of the combinations of the temporal correlationα and probability threshold θ. Both the temporal correlation and probability threshold can be selected from the range shown in Fig.20b. Because the IC algorithm exploits the temporal locality for efficiency purposes, the temporal correlationα and probability threshold θ significantly influence the clustering results of the algorithm. Thus, these observations provide guidelines for determining the settings for temporal correlationα and probability threshold θ.

4.2.5 Effect of the constraints

This section presents the effect of the constraints on the IC algorithm at different T L values, and Fig.21shows the results. Given a higher value of T L (e.g., T L= 0.8), in Fig.21a, the runtime of the IC algorithm dramatically decreases asε varies from 60 to 90. The runtime of the IC algorithm becomes stable whenε > 90. This is because, when the similarity constraint is loosened (i.e., a higher value ofε), most objects are grouped together. For a higher value of T L (e.g., T L = 0.8), these objects still belong to the same cluster over time. Thus, the IC algorithm can effectively predict the clusters according to the previous clustering results. In addition, Fig.21a shows that a higher value of R also loosens the geographic constraint. Thus, the runtime of the IC algorithm with a higher value of R (e.g., R = 2.1) is shorter than the runtime of the IC algorithm with a lower value of R (e.g., R= 1.2) given a fixed

ε. When ε > 90, the effect of the geographic constraint is not obvious. This is because most

objects can be clustered based on the similarity-constrained relationships given a higher value ofε.

Given a lower value of T L (e.g., T L = 0.2), Fig.21b shows that the runtime of the IC algorithm decreases slightly asε varies from 60 to 90, but increases slightly when ε > 90. This is because, for a lower value of T L, the non-geographic values of objects are not similar

(25)

Fig. 22 The efficiency

comparison of the IC, HBC, CLS, and CTS-ARMA algorithms using real datasets

10-1 100 101 102 103 104 105 106 10 15 20 25 Runtime(in msec.) Number of sensors N CTS-ARMA CLS HBC IC

with time, and the prediction of the IC algorithm would be destroyed. In addition, a more loosened constraint (e.g., a higher value ofε or a higher value of R) induces most of the objects to be grouped together, but these objects may not belong to the same cluster in the next time window given a lower T L. Similarly, Fig.21b shows that, for the same ε, the runtime of the IC algorithm with a higher value of R (e.g., R= 2.1) is shorter than that with a lower value of R (e.g., r= 1.2).

4.3 Experimental results on real datasets

The following subsections present comparisons of the performance of the IC algorithm with that of existing approaches and analyze their scalability using a real dataset. This study also investigates the sensitivity of a geographic constraint and a similarity constraint.

4.3.1 Real dataset

The real dataset was obtained from the Taiwan Area National Freeway Bureau. We compiled a traffic database for Freeway No.1, which runs the length of the island (a distance of 372.7 kilometers). We collected data from 100 sensors positioned along the freeway. Each sensor has a specific location and reports the speeds of vehicles on the monitored segments every five minutes. The default settings for the real dataset include the number of objects (that is, sensors), N = 79; the number of time stamps, L = 434; the geographic constraint, R = 10 (Km); the time window size, W = 10; and the similarity constraint, ε = 5. According to Sect.4.2.4, we setα = 0.5 and the probability threshold θ = 0.5. Note that F = _WL.

4.3.2 Efficiency comparison

As mentioned in Sect.4.2.1, the HBC, CLS, and CTS-ARMA algorithms are regarded as competitors of the proposed algorithm. This experiment was designed to compare the IC algorithm with the HBC, CLS, and ARMA algorithms. Similarly, for the CLS and CTS-ARMA algorithms, the number of clusters was set as the number of clusters derived from the IC algorithm. We evaluate the efficiency of the three algorithms with different representations of N . Figure22shows that the runtimes of the CLS and CTS-ARMA algorithms are less favorable than the runtimes of the HBC and IC algorithms, revealing that the IC and HBC algorithms are more efficient than the CLS and CTS-ARMA algorithms when using real datasets. Hence, the remainder of the experiments only compare the performance of the IC algorithm and the competitor algorithm, HBC.