Adaptive clustering for multiple evolving streams

(1)

Adaptive Clustering for Multiple

Evolving Streams

Bi-Ru Dai, Jen-Wei Huang, Mi-Yen Yeh, and Ming-Syan Chen, Fellow, IEEE

Abstract—In the data stream environment, the patterns generated at different time instances are different due to data evolution. As time progresses, the behavior and members of clusters usually change. Hence, clustering continuous data streams allows us to observe the changes of group behavior. In order to support flexible clustering requirements, we devise in this paper a Clustering on Demand framework, abbreviated as COD framework, to dynamically cluster multiple data streams. While providing a general framework of clustering on multiple data streams, the COD framework has two advantageous features, namely, one data scan for online statistics collection and compact multiresolution approximations, which are designed to address, respectively, the time and the space constraints in a data stream environment. The COD framework consists of two phases, i.e., the online maintenance phase and the offline clustering phase. The online maintenance phase provides an efficient mechanism to maintain summary hierarchies of data streams with multiple resolutions in time linear in both the number of streams and the number of data points in each stream. On the other hand, an adaptive clustering algorithm is devised for the offline phase to retrieve approximations of desired substreams from summary hierarchies according to clustering queries. We propose two summarization techniques, based on wavelet and regression analyses, to construct the summary hierarchies. The regression-based summary hierarchy approximates the data stream more precisely and provides better clustering results, at the cost of slightly longer time than and twice the storage space as the wavelet-based one. An adaptive version of COD framework is designed to make a selection between a wavelet-wavelet-based model and a regression-based model for building the summary hierarchy. By the adaptive COD, we can obtain clustering results with almost the same quality as the regression-based COD while using much less storage space for the summary hierarchy. As shown in the complexity analyses and also validated by our empirical studies, the COD framework performs very efficiently in the data stream environment while producing clustering results of very high quality.

Index Terms—Data mining, clustering of multiple data streams, time-series clustering.

Ç

1 I

NTRODUCTION

I

N recent years, several query problems and mining

capabilities have been explored for the data stream environment [1], [2], including those on the summarization and statistics [3], [4], [5], data selection [6], change detection [7], [8], sampling [9], data clustering [10], [11], [12], [13], and data classification [14], [15], [16], to name a few. For data stream applications, the volume of data is usually too huge to be stored on permanent devices or to be scanned thoroughly more than once. It is hence recognized that both approximation and adaptivity are key ingredients for executing queries and performing mining tasks over rapid data streams.

The clustering problems have been studied for a data stream environment recently [10], [11], [12], [13]. In this paper, the problem of clustering multiple data streams is addressed. It is assumed that at each time stamp, data points from individual streams arrive simultaneously, and the data points are highly correlative to previous ones in the same stream. Unlike that of prior studies, the objective in this work is to partition these data streams, rather than their data points, into clusters. Note that the data streams are not

of a fixed length. Instead, they are still evolving when the clustering results are observed at users’ requests. For the sake of simplicity, the model of one-dimensional data streams is explored, which can also be regarded as multiple streaming time-series.

In the data stream environment, the patterns generated at different time instances are different due to data evolution. The frequency of change for rules and patterns is different for different applications. For example, the streams gathered from adjacent sensors may always be in the same cluster and rarely change as time progresses. On the other hand, for stock prices, some companies are probably within the same cluster during several months but in different clusters afterward. The clusters obtained hence change frequently. In order to deal with various types of multiple data streams in the same stream mining system, an important question arises: “Can we design a scheme for modeling both fast and slow evolving patterns adaptively?”

Consider the stock market data again. Assume that the clustering request is unknown when the data is collected and processed. After the time range of clustering request is given, recommendations for short-term or long-term in-vestments are desired to be offered precisely. For example, some users are interested in the short-term behavior of clusters, such as the daily or hourly behavior. Therefore, we would like to provide daily clusters for one week or hourly clusters for a few hours. On the other hand, for users interested in the long-term behavior of clusters, such as the monthly or annual behavior, we would like to provide

. The authors are with the Department of Electrical Engineering, National Taiwan University, No. 1, Sec. 4, Roosevelt Road, Taipei, Taiwan, ROC. E-mail: {brdai, jwhuang, miyen}@arbor.ee.ntu.edu.tw,

[email protected].

Manuscript received 5 June 2005; revised 18 Dec. 2005; accepted 13 Apr. 2006; published online 19 July 2006.

For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TKDE-0226-0605.

(2)

monthly clusters for this year or annual clusters for last five years. This leads to another important issue: “Can we provide a system to support various clustering requirements at the same time?” Clustering multiple data streams is very useful in various applications. As time progresses, the behavior and members of clusters usually change. Hence, clustering continuous data streams allows us to observe the changes of group behavior. For example, oceanographic and surface meteorological data consist of measures such as tempera-ture, humidity, surface winds, and so on. These values are recorded continuously and probably change with the ocean currents, rainfall, solar radiation, and other factors. There-fore, by clustering these continuous streams, we are able to observe that some streams are similar in a period and become dissimilar afterward. For example, we can query monthly temperature clusters or annual humidity clusters. Then, the clustering query can be set as follows: the window size is set as one month and several months are observed, or the window size is set as one year and several years are observed, respectively. Then, according to the changes of cluster members between windows, oceanographers can obtain useful knowledge for further analysis such as building prediction models for future trends. Consequently, we devise in this paper a framework of Clustering on Demand, abbreviated as COD framework, to dynamically cluster multiple data streams. While providing a general framework of clustering on multiple data streams, the proposed COD framework has the following two advanta-geous features. The first one is one data scan for online statistics collection, and the second one is compact multi-resolution approximations, which are designed to address, respectively, the time and the space constraints in a data stream environment. Furthermore, with the multiresolution approximations of data streams, flexible clustering de-mands can be supported. Note that since the clustering algorithms are only applied to the statistics maintained rather than to the original data streams, the COD proposed is able to work very efficiently in practice.

The COD framework proposed consists of two phases, namely, the online maintenance phase and the offline clustering phase. The online maintenance phase provides an efficient algorithm to maintain the summary hierarchies of the data streams with multiple resolutions in time linear in both the number of streams and the number of data points in each stream. On the other hand, an adaptive clustering algorithm is devised for the offline phase to retrieve the approximations of the desired substreams from the summary hierarchies as precisely as possible according to the clustering queries specified by the users. Note that the data streams are generated continuously in a dynamic environment with a huge volume and infinite flow. To maintain the summaries of the evolving streams, we propose two summarization techniques based on 1) wavelets and 2) regression lines. In general, we keep finer approximations for more recent data and coarser approximations for older data. In the wavelet-based hierarchy, a flat line is used to approximate an interval of a stream. In contrast, a straight fitting line is applied in the regression-based hierarchy. Since a flat line approximation can be calculated more efficiently and maintained with fewer parameters, the wavelet-based

hierarchy can be built faster while using less storage space than the regression-based one. However, the straight fitting line utilized in the regression-based hierarchy can approx-imate the stream more precisely, at the cost of slightly longer time and twice the storage space to build the summary hierarchy. Therefore, we are able to trade the precision with the storage space in practical applications. If the values of the data streams vary slowly, the wavelet-based models may approximate these streams with accep-table precision. In contrast, the regression-based models will provide better approximations for the fast changing streams. An adaptive version of COD framework is designed to address this issue by making a selection between a wavelet-based model and a regression-based model for building the summary hierarchy. By the adaptive COD, we can obtain clustering results with the same quality as the regression-based COD while using much less storage space for the summary hierarchy. We also conduct a series of experiments on these algorithms to investigate their properties. As shown in the complexity analyses and also validated by our empirical studies, the algorithms of the COD framework perform very efficiently in the data stream environment while producing clustering results of very high quality.

The problem studied in this paper is different from the one discussed in [17], which focuses on clustering the windows of a single streaming time series. On the other hand, the clustering of evolving streams is also discussed in [18]. However, the objective in [18] is to continuously report clusters satisfying the specified distance threshold. The clusters are generated according to the values from the beginning of the stream to the current time. Therefore, it does not have the ability to observe clusters in a time range of interest. To further enhance these techniques, clustering requests of flexible time ranges are supported in our framework.

The rest of this paper is organized as follows: Prelimin-aries and advantages of the COD framework are described in Section 2. The online maintenance phase and the offline clustering phase of the COD framework are presented in Section 3. Then, empirical studies are conducted in Section 4. This paper concludes with Section 5.

2 C

LUSTERING ON

D

EMAND FOR

M

ULTIPLE

D

ATA

S

TREAMS

The definitions used in the COD framework are provided in Section 2.1. In addition, the advantages of the COD framework, namely, clustering with flexible window sizes and trend identification, are described in Section 2.2. 2.1 Framework Definitions

The COD framework has two phases, i.e., the online maintenance phase and the offline clustering phase. In the online maintenance phase, data streams are processed and only very brief summaries are maintained. The offline clustering phase deals with the clustering queries. The COD framework supports clustering queries with a flexible window size and a desired number of windows to observe. For example, a clustering query could be 12 windows with the window size of 30 days to observe the clusters of each

(3)

month during a year. Based on the limited space property in the data stream environment, the raw data streams are parsed only once and then discarded. Therefore, the clustering algorithm in our framework is applied to the statistics maintained by the online phase rather than to the original streams, as illustrated in Fig. 1.

At any time stamp, each stream receives a new value simultaneously, and there are totally n data streams. More specifically, we have the n streams fS1; S2; . . . ; Sng at time stamp m where Si¼ ffi1; fi2; . . . ; fimg, for 1 i n, and fitis the value of stream Si that has arrived at time t.

2.1.1 The Offline Clustering Phase

Let k denote the number of clusters, and let w be the window size of the clustering query submitted at time stamp tnow. The algorithm proposed will generate at most pwindows of k-clustering results where

ClðwiÞ ¼ fC1ðwiÞ; C2ðwiÞ; . . . ; CkðwiÞg;

for 1 i p, which minimizes each clustering cost CostðClðwiÞÞ of the substreams in the interval

ðtnow w i; tnow w ði 1Þ:

Note that CjðwiÞ is the jth cluster of window wi with the properties of \k j¼1 CjðwiÞ ¼ and S k j¼1 C_jðwiÞ ¼ fS1ðwiÞ; S2ðwiÞ; . . . ; SnðwiÞg, where S_qðwiÞ ¼ ffqðtnowwiÞþ1; f ðtnowwiÞþ2 q ; . . . ; f ðtnowwiÞþw q g; for 1 q n.

Example 1.Consider the first three data streams fS1; S2; S3g in Fig. 1 at time stamp tnow¼ 15,

S1¼ f64; 48; 16; 32; 56; 56; 48; 24; 32; 24; 16; 16; 24; 32; 40g; S2¼ f24; 38; 46; 52; 54; 56; 40; 16; 24; 26; 34; 28; 20; 14; 8g;

and

S3¼ f32; 46; 54; 60; 62; 64; 66; 64; 58; 50; 42; 36; 28; 22; 16g: Assume that the clustering query is k ¼ 2, w ¼ 5, and p¼ 2. The algorithm will generate at most 2 windows of 2-clustering results, Clðw1Þ ¼ fC1ðw1Þ; C2ðw1Þg and Clðw2Þ ¼ fC1ðw2Þ; C2ðw2Þg. Note that the resulting clus-ters of window w1 are C1ðw1Þ ¼ fS1g and C2ðw1Þ ¼ fS2; S3g since S2 is more similar to S3 in the interval ð15 5 1; 15 5 ð1 1Þ ¼ ð10; 15. In window w2,

the clusters are C1ðw2Þ ¼ fS1; S2g and C2ðw2Þ ¼ fS3g since S2 is more similar to S1 in the interval ð15 5 2; 15 5 ð2 1Þ ¼ ð5; 10.

2.1.2 The Online Maintenance Phase

To support various clustering queries in the offline clustering phase, adequate information has to be preserved during the online maintenance phase for discovering fast and slowly evolving patterns. Note that the resolution of statistics maintained could affect the patterns obtained. With a small interval used for summarization, short-term patterns can be observed, but long-term patterns are likely to be neglected. Also, the summaries are possibly affected by noises and oscillations, and the summaries should be updated or reconstructed frequently. On the other hand, if a large interval is used, long-term patterns can be observed while neglecting short-term patterns. Also, the patterns may not catch up with the changes in data streams. Moreover, the patterns could be very rough because of the general-ization/summarization of streams. Therefore, we devise a hierarchical structure to store data summaries at different resolutions in the online maintenance phase. Whenever a clustering query is submitted, the offline clustering algo-rithm will select the approximations at the appropriate levels of the summary hierarchies to support the require-ments of the clustering query. The most recent data points are in the first (latest) window and are always approxi-mated by the most accurate fitting models, and new windows are in general more accurate than older ones. Therefore, recent data will not suffer the problem of roughness. Details of the online and offline phases will be described in following sections.

2.2 Advantages of the COD Framework

In this section, we present two advantages of the COD framework, namely, clustering with flexible window sizes and trend detection, which are, in our view, not only of practical importance, but also absent in conventional methods.

2.2.1 Clustering with Flexible Window Sizes

Since the summaries maintained in the summary hierar-chies are at multiple resolutions, the COD framework is able to support the clusters for variable window sizes. Applying the conventional time series clustering algorithms to the raw streams with the desired window sizes is not practical in the streaming environment because there is not enough space for the storage of continuous streams. On the other hand, the summarization techniques which maintain the streams with a fixed resolution do not perform well for various window sizes. Although the prior work [18] for clustering of evolving streams continuously reports clusters within the given distance threshold, it does not allow the user to specify a desired window size to be observed. In our COD framework, even though the data streams are collected and summarized into the summary hierarchies before the clustering queries are submitted, the clusters with various window sizes can be obtained directly from the existing summary hierarchies without parsing the data streams again to construct the summaries for the desired

Fig. 1. Clustering of multiple data streams by COD at tnow¼ 15, for a

(4)

resolutions. For example, suppose that the summary hierarchies of the stock prices have been kept for the prices in 10 years. Then, clusters in one day, one month, or even one year can be observed very efficiently without resorting to the old prices again.

2.2.2 Trend Identification and Change Detection From the definition of clustering query, at most p windows of clustering results can be obtained for a query. It is very efficient to observe the trends and changes of clusters at one time. The behavior of the clusters, such as the moving paths of clusters, the merges and splits of clusters, and the streams jumping between clusters, can be investigated from the results of a clustering query. Consider the example of stock prices again. Assume that the window size is one month and 12 windows are inspected. We might find out that stock A and stock B are in the same cluster for one month and then stock A jumps to another cluster for the following several months. Such trends and changes within one year can be extended from the results of this clustering query.

3 T

HE

F

RAMEWORK OF

COD

We present the details of the online maintenance phase and the offline clustering phase of COD in Sections 3.1 and 3.2, respectively. Then, we analyze the complexity of the proposed approaches in Section 3.3. Finally, we make some remarks on the adaptive use of COD in Section 3.4. 3.1 Online Maintenance Phase

The main objective of this online maintenance phase is to provide a one scan algorithm of the incoming multiple data streams for statistics collection. A summary hierarchy is maintained incrementally to provide multiresolution ap-proximations for a stream. Multiple levels in the hierarchy correspond to various resolutions.

Definition 1. A fitting model F ðhÞ is defined as an approximation of a raw substream in the interval h ¼ ½ts; te by some summarization techniques. The fitting model on level Lis generated by the aggregation of Bhmodels on level ðL 1Þ: A level, say level L, also keeps the time stamp of the latest data point, which has been summarized to that level, as the end time tL

e of the level.

Fig. 2 illustrates an example of the summary hierarchy with the parameters Bt¼ T , Bh¼ 2, and ¼ 6. The proce-dure of generating and updating the summary hierarchy is

described in Fig. 3. According to a procedure of the online maintenance phase, each incoming value is put in the temporary bucket first. A new model of level 0 will be generated after Btvalues have arrived. At the same time, the end time t0

eof level 0 should also be updated. Then, these Bt values are moved from the temporary bucket to the raw bucket, in which the latest values of the raw stream are maintained. The values in the raw bucket are maintained for queries with very short window sizes, which are smaller than the span of the fitting models in the lowest level. Since they are the newest values in data streams, we would like them to be as precise as possible, i.e., without approximation, when answering clustering queries. When Bh new models are accumulated in level L, a new model of level ðL þ 1Þ will be generated by aggregating the latest Bhmodels in level L, and the end time tLþ1

e of level ðL þ 1Þ is also updated.

A summary hierarchy is established for each data stream. Therefore, totally n hierarchies are maintained for nstreams. As shown in the procedure, to achieve the space limitation in the streaming environment, only the latest models are maintained in each level. Note the should be set to a number not smaller than Bhin order to have enough fitting models for the model generation in a higher level. To capture approximate historical details of data streams with limited resources, various existing techniques for data compression are examined. Among several approaches, the method of wavelet-based transform [19], [20] and regression-based analysis [21], [22], which are selected as our fitting models, are found to be advantageous for their linear computation complexity and the graceful approxima-tion with only a few coefficients.

3.1.1 Wavelet-Based Fitting Model

While the theory of wavelet is extensive, we conform ourselves to the rudimentary wavelet transform in this paper. Namely, the Haar wavelet basis function, which is the simplest wavelet, is explored and utilized for our purposes. In the rest of the paper, we will assume that Haar wavelets are used and a single coefficient representing the average is maintained as a fitting model. This is similar to the strategy of coefficient selection in [3]. To generalize the idea to averaging arbitrary number of values, we have the following definition.

Fig. 2. The illustration of summary hierarchy, where FðhÞ represents the fitting model in the time interval h¼ ½ts; te. The parameters are set as

Bt¼ T ; Bh¼ 2, and ¼ 6.

(5)

Definition 2.A wavelet-based fitting model in an interval h is represented by WðhÞ ¼ P h f h j j :

Lemma 1. The lossless aggregation of q wavelet-based fitting models is denoted by Wq ¼ Pq i¼1 WðhiÞ hj ji Pq i¼1 hi j j ;

where q successive models are merged together.

The concept of a wavelet-based model can be best understood by the following example.

Example 2. The scenario of Bt¼ T , Bh¼ 2, and ¼ 6 is shown in Fig. 2. A model of level 0 is generated from T arrivals, and two new models accumulated in level L will be aggregated into a new model of level ðL þ 1Þ. Also, at most six latest models can be maintained in each level of the hierarchy. Let Bt¼ 2 and consider stream S1 of Example 1 again. We have 64þ48

2 ¼ 56h i at time stamp 2and 16þ32 2

¼ 24h i at time stamp 4 for level 0. A fitting model of level 1 is produced by averaging two models of level 0 and, hence, 56þ24

2

¼ 40h i is also inserted into level 1 at time stamp 4. The first fitting model of level 2,

40þ46 2

¼ 43h i, is generated at time stamp 8. By repeating this process, the wavelet-based summary hierarchy can be obtained as in Fig. 4a. Remember that at most six models can be kept in each level. Therefore, when the seventh model 24þ32₂ ¼ 28h i is generated in level 0 at time stamp 14, the oldest model of that level, which is

56

h i, will be removed. The latest value arriving at time stamp 15 is still in the temporary bucket waiting to be merged with the future values. The end time tL

e of each level will be the teof the latest model in that level, which is 14, 12, and 8 for level 0, 1, and 2, respectively. 3.1.2 Regression-Based Fitting Model

To meet the space constraint in a data stream environment, we devise a compact stream representation to comprise all the information required for the regression-based approx-imations. Consequently, only required synopses rather than historical details of data streams are maintained during our discovering process.

A straight-line fit for a stream is a linear estimation function bf¼ bþbt that conforms to the principle of least squares. Specifically, the regression parameters bandbare chosen to make the residual sum of squares

D¼X n t¼1

ðft bbtÞ2

minimal, where ftis the actual value of the tth data point. To perform the calculations for getting best estimates of b and b, the following quantities are maintained, t¼1

n P t, f¼1 n P f, Stt¼Pðt tÞ2¼Pt2 ðPtÞ2 n , and Stf ¼ X ðt tÞðf fÞ ¼Xtfð P tÞðPfÞ n :

Then, the least square estimates of bandbare computed by b ¼Stf Stt; and b¼ fbt¼ P f n b P t n .

Specifically, only three measures are required to main-tain during the regression process, i.e., the ending time (te), the accumulative product of time and value (Ptf), and the accumulative sum of value (Pf). Consequently, we have the following definition for the regression-based fitting model.

Definition 3.A regression-based fitting model in an interval h is represented by RðhÞ ¼ X h tf;X h f * + :

Based on the following lemma, we will prove in Theorem 1 that the least square error linear fit for a substream can be obtained losslessly from the accumulative sumhPtf;Pfi and tL

e of that level. For interest of space, proofs of theorems and lemmas are given in the Appendix. Lemma 2.1) P te t¼ts t¼ðtsþteÞðtetsþ1Þ 2 and 2) Xte t¼ts t2¼X te t¼1 t2X ts1 t¼1 t2 ¼teðteþ 1Þð2teþ 1Þ 6 ðts 1ÞðtsÞð2ts 1Þ 6 :

Theorem 1.The least square error linear fit for a substream can be obtained losslessly from the accumulative sumhPtf;Pfi and tL

e of that level.

(6)

Lemma 3.The lossless aggregation of q regression-based fitting models is denoted by Rq¼ X q i¼1 X hi tf ! ;X q i¼1 X hi f ! * + ; where q successive models are merged together.

Example 3.Consider the stream in Example 2 again. Note that the procedure of generating the regression-based fitting model is the same as that of the wavelet-based one, but the values stored are different. Therefore, we have f 160; 112h i; 17; 648h ig at time stamp 4, where the first two values in the stream, i.e., 64 and 48, generate the accumulative sum

X tf;Xf

D E

¼ 1 64 þ 2 48; 64 þ 48h i ¼ 160; 112h i at time stamp 2, and the second two values 16 and 32 generate the accumulative sum

3 16 þ 4 32; 16 þ 32

h i ¼ 176; 48h i

at time stamp 4. In addition, h160þ 176; 112 þ 48i ¼ 336; 160

h i is also inserted into level 1 at time stamp 4. The first fitting model 1; 480; 344h i of level 2, which is the summation of 336; 160h i and 1; 144; 184h i of level 1, is generated at time stamp 8. By repeating this process, the regression-based summary hierarchy can be obtained as in Fig. 4b. The seventh model 760; 56h i generated in level 0 at time stamp 14 also causes the oldest model of that level, i.e., 160; 112h i, to be removed. The end time tL

e of each level and the value in the temporary bucket are the same as those of the wavelet-based model.

There are still other possible solutions for estimating data streams, such as adaptive piecewise constant approxima-tion. However, different from the focus of our multi-resolution structure, the adaptive piecewise constant approximation does not provide finer approximation for more recent data and coarser approximation for older data. It will be an interesting research direction for our future works to combine the adaptive piecewise constant approx-imation in the summary hierarchy.

3.2 Offline Clustering Phase

As the clustering query shown in Section 2.1, the users want to inspect clusters with window size w, and at most p windows will be observed. Note that the window size desired could be different from those maintained in the summary hierarchy. In this situation, we have to select the fitting models from appropriate levels of the hierarchy to approximate the desired windows. We have the following theorems for adaptive level selection.

Theorem 2. The highest level to approximate a substream with window size w is Lmax¼ logBh w Bt :

Theorem 3. The lowest level to approximate a substream with window size w is Lmin¼ min L w ðtnow t L eÞ þ L hL ;

where Lis the exact number of fitting models in level L, and hL¼ Bt ðBhÞL is the window size of the fitting models in level L.

From the above theorems, fitting models in the levels between Lmin and Lmax are able to approximate the substreams in the windows of clustering queries. The fitting models in higher levels span longer intervals and, thus, provide more generalized fitting to the original streams. In contrast, the fitting models in lower levels possess shorter spans and can thus provide more specific and accurate fits to the original streams. However, only models are maintained in each level. Therefore, we devise an adaptive clustering algorithm for the offline clustering phase to approximate each window of data streams with the fitting models as accurately as possible. The statistics to characterize a window of stream is encapsulated as an entry, which is described in Definition 4. Note that each entry obtained from the hierarchy may include more than one fitting model. As shown in the procedure of the adaptive clustering algorithm, the clustering algorithm is executed according to the entries obtained from the fitting models in the hierarchy.

Definition 4. An entry EAðwiÞ that encapsulates j fitting models to represent window wi of stream SA is expressed as:

a. hWAðh1Þ; jh1ji; Wh Aðh2Þ; jh2ji; . . . ; W AðhjÞ; jhjj for a wavelet-based model, where hq represents ½ts; te of the qth fitting model in the entry and

b. b Aðh1Þ;bAðh1Þ; h1 D E ; bAðh2Þ;bAðh2Þ; h2 D E ; . . . n ; b AðhjÞ;bAðhjÞ; hj D Eo

for a regression-based model, where bAðhdÞ andbAðhdÞ for 1 d j are calculated from the corresponding

R_AðhdÞ ¼ X hd tf;X hd f

by the equations derived in Section 3.1.2.

The outlines of the adaptive clustering algorithm are shown in Fig. 5. In the first step of the adaptive clustering algorithm, the values of Lmin and Lmax are calculated according to the window size w specified by the clustering request. Then, for each data stream, the fitting models, which are covered by the ranges of the desired windows, are retrieved in Step 2 and Step 3. Finally, for each window, the clustering algorithm is executed on the retrieved statistics. A typical cost function for a window of size w is

CostðClðwiÞÞ ¼ P CjðwiÞ P SqðwiÞ2CjðwiÞdist SqðwiÞ CjðwiÞ:center n w ;

(7)

dist SqðwiÞ CjðwiÞ:center ¼ Xt fq 2 SqðwiÞ; fCtj2 CjðwiÞ:center f t q f t Cj 2 ;

where CjðwiÞ:center is the center of cluster CjðwiÞ, and fCtj

are the values of this cluster center in the range of wi. If more than one window is observed, the total cost will be the average clustering cost of the retrieved windows, as shown in the following equation:

CostðClÞ ¼ Ppr i¼1 iCostðClðwiÞÞ pr ; ð1Þ

where pr is the number of windows actually retrieved and i is the weight of the clustering cost for window wi. In some applications, users are more interested in the recent data and give higher weights for the costs of more recent windows. Without loss of generality, this cost function will be used to evaluate the clustering results in our perfor-mance studies in sections later. Based on the definitions of the entry, the distance measurement between two sub-streams can be derived by Lemma 4.

Lemma 4. The distance measurement Di between two substreams SAðwiÞ and SBðwiÞ of window wi, i.e., between entries EAðwiÞ and EBðwiÞ, is:

a.

Xj d¼1

WAðhdÞ WBðhdÞ ð Þ2 hj jd

for a wavelet-based model and b. Xj d¼1 b AðhdÞ bBðhdÞ 2 jhdj þ 2 bAðhdÞ bBðhdÞ bAðhdÞ bBðhdÞ ð ÞX hd tþðbAðhdÞ bBðhdÞÞ2 X hd t2

for a regression-based model, where bfAðhdÞ ¼ b

AðhdÞ þbAðhdÞt and bfBðhdÞ ¼ bBðhdÞ þbBðhdÞt.

Let us consider Step 2 of a procedure of the adaptive clustering algorithm. Assume that level Lmin¼ 2, as shown in Fig. 2. However, since the end time t2

e of level 2 is not equal to the current time, we aggregate the models by the lower levels (from level 2 1 to 0) and the temporary bucket to generate a temporary model characterizing the interval between t2

e and the current time. Then, it is aggregated into the latest model in level 2.

The following lemma describes the interpolation method to obtain a fraction of a fitting model.

Lemma 5.Let tibe a time stamp between h ¼ ½ts; te of a fitting model W ðhÞ (or RðhÞ, respectively). The interpolation of the fitting model between h0_{¼ ½t}

i; te is:

a. Wðh0_{Þ ¼ W ðhÞ for a wavelet-based model and} b. Rðh0Þ ¼ 1 2ððt^ eÞ 2 ðtiÞ2Þ þ 1 2ððt^ eÞ 3 ðtiÞ3Þ ; ^ ðte tiÞ þ 1 2ððt^ eÞ 2 ðtiÞ2Þ

for a regression-based model, where bf¼ bþ_{bt is the} estimation function of RðhÞ.

The details of Step 3 are described as follows: We can in fact improve the clustering results by approximating a window with more fitting models in lower levels of the hierarchy. Note that interpolation is possible to bring some distortions into the fitting model. Therefore, we use interpolation only at the highest level Lmax for each clustering query. An entry, which encapsulates several whole fitting models from a level below Lmax, is approximated to the window size w in order to avoid the distortion from interpolation. More specifically, a fitting model is inserted into an entry if at least half of its interval is in the range of that window. Note that due to the space limitation of the data stream environment, at most models are maintained in each level. Therefore, the number of models in lower levels is usually not enough for all the windows queried. In this situation, some fitting models in level Lmax are aggregated and interpolated for an entry. Each entry generated from the models in level Lmax is exactly with window size w.

Example 4. Consider summary hierarchies in Fig. 6 as an example. Assume that tnow¼ 2; 110, ¼ 12, Bt¼ Bh¼ 2, and the clustering query indicates that w ¼ 180 and

Fig. 5. The outlines of the adaptive clustering algorithm.

Fig. 6. An illustration of adaptive clustering. Assume that ¼ 12, Bt¼ Bh¼ 2, w ¼ 180, p ¼ 6, and tnow¼ 2; 110.

(8)

p¼ 6. According to Theorem 2 and Theorem 3, Lmin¼ 3 because ðtnow tLeÞ þ L hL¼ ð2110 2096Þ þ 12 16 ¼ 206 180 for L ¼ 3, and ðtnow tLeÞ þ L hL¼ ð2; 110 2; 104Þ þ 12 8 ¼ 102 < 180 for L ¼ 2. Lmax¼ logBh w Bt ¼ log2 180 2 ¼ 6: Since the end time tL min

e ¼ 2; 096 of level Lmin¼ 3 is not equal to the current time tnow¼ 2; 110, the models of lower levels (from level 2 to 0) are aggregated to the latest model in level Lmin. Then, 11 fitting models in level 3 are encapsulated in the entry for window w1, five fitting models in level 4 are encapsulated in the entry for window w2, three fitting models in level 5 are encapsu-lated in the entry for window w3, and 2 fitting models in level 5 are encapsulated in the entry for window w4. Note that a fitting model is inserted into an entry if at least half of its interval is in the range of that window. As shown in Fig. 6, the entries representing windows w1 to w4 are approximated to the window size w by several whole fitting models (without interpolation) from level Lmin to Lminþ 2. Older windows such as windows w5and w6are not able to be approximated by several small models. Therefore, we have to aggregate and interpolate fitting models in level Lmax¼ 6 to generate entries exactly with window size w.

Assume that the interval covered by an entry is w0_{. We} observe that the difference between w0and the window size w of the clustering query is limited in w

Bh, which can be

verified by the following lemma.

Lemma 6. The difference between the interval of the retrieved entry, denoted as w0_{, and the window size w of the clustering} query is limited in w

Bh, i.e., w0 wj j

w Bh.

Example 5. Consider summary hierarchies in Fig. 7 as an example. Assume that tnow¼ 31, ¼ 5 , Bt¼ Bh¼ 2, and the clustering query indicates that w ¼ 10, and p ¼ 3. According to Theorem 2 and Theorem 3, Lmin¼ 0 because ðtnow tLeÞ þ L hL¼ ð31 30Þ þ 5 2 ¼ 11 10 for L ¼ 0 and Lmax¼ logBhð

w BtÞ j k ¼ log2ð10₂Þ ¼ 2. As shown as follows, these three windows of streams S1and S2 can be approximated by entries E1ðw1Þ, E1ðw2Þ, E1ðw3Þ, and E2ðw1Þ; E2ðw2Þ; E2ðw3Þ, respectively, accord-ing to Definition 4. a. Wavelet-based: Entry E1ðw1Þ ¼ h56; 2i; 36; 2h i; 28; 2h i; 16; 2h i; 28 2 þ 40 1 2þ 1 ;ð2 þ 1Þ ¼ f 56; 2h i; 36; 2h i; 28; 2h i; 16; 2h i; 32; 3h ig is obtained from the temporary bucket and level 0, entry E1ðw2Þ ¼ f 36; 4h i; 40; 4h ig is obtained from level 1, and entry E1ðw3Þ ¼ f 437þ29310 ; 10

g ¼ 38:8; 10

h i

f g is interpolated and aggregated by fitting models in level 2. Then, entries

Fig. 7. Example of applying the adaptive clustering algorithm on summary hierarchies of two streams S1and S2, where ¼ 5, Bt¼ Bh¼ 2, w ¼ 10,

(9)

E2ðw1Þ ¼fh55; 2i; 28; 2h i; 25; 2h i; 31; 2h i; 14; 3h ig E2ðw2Þ ¼ f 14:5; 4h i; 40; 4h ig;

and E2ðw3Þ ¼fh34:95; 10ig of stream S2 can be obtained similarly. The distance between w1 of streams S1and S2will be

D1¼ X5 d¼1 ðW1ðhdÞ W2ðhdÞÞ2 hj jd ¼ ð56 55Þ2 2 þ ð36 28Þ2 2 þ ð28 25Þ2 2 þ ð16 31Þ2 2 þ ð32 14Þ2 3 ¼ 1; 570: D2and D3can be calculated similarly.

b. Regression-based: Entry E1ðw1Þ ¼ f 56; 0; 21; 22h ½ i; 600; 24; 23; 24h ½ i; 232;8; 25; 26½ h i; 16; 0; 27; 28h ½ i; 208; 8; 29; 31½ h ig:

Note that b1ðh1Þ ¼S_Stf_tt¼_0:50 ¼ 0 and b1ðh1Þ ¼ f b 1ðh1Þt¼ 56 0 21:5 ¼ 56 can be calculated from Stf¼ X tfð P tÞðPfÞ n ¼ 2; 408 ð21 þ 22Þ 112 2 ¼ 0 and Stt¼ X t2ð P tÞ2 n ¼ ð21 2_{þ 22}2_Þð21 þ 22Þ 2 2 ¼ 925 924:5 ¼ 0:5: Similarly,b1ðh2Þ ¼120:5 ¼ 24, b 1ðh2Þ ¼ 36 ð24Þ 23:5 ¼ 600; and so on. Entry

E1ðw2Þ ¼ f 80; 8; 13; 16h ½ i; 276:8; 12:8; 17; 20h ½ ig is obtained from level 1, and entry E1ðw3Þ ¼ fh59:492; 3:41049; 2; 11½ ig is interpolated and aggregated by fitting models in level 2. Then, entries E2ðw1Þ ¼ f 12; 2; 21; 22h ½ i; 592; 24; 23; 24h ½ i; 26; 2; 25; 26½ h i; 196; 6; 27; 28h ½ i; 194;6; 29; 31½ h ig; E2ðw2Þ ¼ f 40:6; 1:8; 13; 16h ½ i; h130:2; 9:2; 17; 20½ ig; E2ðw3Þ ¼ f 59:0467; 3:5189; 2; 11h ½ ig of stream S2 can be obtained similarly. The distance between w1 of streams S1 and S2 will be D1¼ 2; 032 according to Lemma 4(b). D2 and D3 can be calculated similarly.

Based on adaptive clustering algorithm, we can also observe clustering patterns in the intervals prior to the current time with slight modification on the algorithm. In other words, we are able to observe p windows of clustering results starting from tbeforerather than from tnow, i.e., window

wiis in the interval ðtbefore w i; tbefore w ði 1Þ rather than in ðtnow w i; tnow w ði 1Þ. Because the newest window to be observed is in the interval ðtbefore w; tbefore, the lowest level to approximate a substream with window size w will become

Lmin¼ min L

ðtnow tbeforeÞ þ w ðtnow tLeÞ þ L hLg;

which is the lowest level whose fitting models are still enough to illustrate the pattern within the interval ðtbefore w; tbefore. In the original design of the adaptive clustering algorithm, the range of the clustering query starts from tnow. However, in order to support the clustering query for arbitrary historical ranges, Step 2 of the procedure of adaptive clustering algorithm needs a little modification to handle the newest window w1. Step 2 is modified as follows:

2. If the end time tLmin

e of level Lmin is smaller than tbefore, aggregate the models of lower levels and the temporary bucket (if needed) to generate a tempor-ary model characterizing the interval between tLmin

e and tbefore. Then, aggregate this temporary model to the latest model in level Lmin.

With these modifications mentioned above, we can observe clustering patterns in any interval. Note that the offline clustering phase is not designed for a specific clustering algorithm. Therefore, users can adopt any traditional clustering algorithm with minor modification if so neces-sary. Without loss of generality, the k-means clustering algorithm [23] is applied in the offline clustering phase for the complexity analysis and performance evaluation. 3.3 Complexity Analyses

The complexities of the online and offline phases of the COD framework are shown in the following theorems, where n is the number of streams and m is the number of data points in each stream.

Theorem 4.The time complexity of the online maintenance phase is Onmþm

Bt

.

Theorem 5. The space complexity of the online maintenance phase is OnlogBhð

m BtÞ

.

Theorem 6.The time complexity of the offline clustering phase is OðknÞ for a window.

According to Theorem 5, the total number of fitting models maintained is relatively modest. For example, suppose that Bh¼ Bh¼ 2 and ¼ 20, for a data stream running for 100 years with a clock time granularity of 1 second, the total number of fitting models maintained for a stream is at most 20 log2ð100 365 24 60 60=2Þ 600. This is quite a modest storage requirement. Furthermore, the dimension of the data for clustering is usually highly reduced by the summarization techniques. Thus, the COD framework is very efficient in the clustering phase.

3.4 Adaptive Use of COD

The framework COD proposed can be further extended to an adaptive version by integrating the wavelet-based and

(10)

the regression-based models together. Although the wave-let-based COD and the regression-based COD have the same complexities in time and space, we can still trade the precision with the storage space in practical applications. If the values of the data streams change slowly, a wavelet-based model may approximate these streams with accep-table precision while achieving shorter maintaining time and smaller storage space. Also, the regression-based model will provide better approximations for the faster changing data streams with a bit longer maintaining time and twice of the storage space. Therefore, it will be efficient in the usage of space to apply a wavelet-based model when the stream is stable and to employ a regression-based model as the stream changes dramatically. Based on the temporal locality, we may expect that the behavior of a stream is similar in a nearby interval. The following criteria provide guidelines for selecting fitting models adaptively according to the current behavior of a stream.

Criterion 1. If the slope of a regression-based model does not exceed a threshold , i.e., ^j j , start to apply a wavelet-based model.

Criterion 2. If the variation of a wavelet-based model is larger than a threshold , i.e., WðhiÞW ðhi1Þ

jhj

> , where jhj ¼ jhij ¼ jhi1j, start to employ a regression-based model. Note that if the previous one is a regression-based model, the value of W ðhi1Þ can be calculated from the second component of

Rðhi1Þ ¼ X hi1 tf;X hi1 f * + by Wðhi1Þ ¼ P hi1 f jhi1j :

Since a wavelet-based model can be obtained from a regression-based model by WðhÞ ¼ P h f jhj ;

and a regression-based model can be approximated from a wavelet-based model by RðhÞ ¼ X h tf;X h f * + ¼ WðhÞ X h t; WðhÞ jhj * + ; all the theorems and lemmas are still applicable to the adaptive COD. The fitting model in a higher level of the summary hierarchy is selected adaptively according to the following criterion.

Criterion 3. If the slope of the aggregated model is larger than a threshold , a regression-based model is main-tained. Otherwise, a wavelet-based model is produced. Although a wavelet model can be constructed exactly from a regression model, a regression model requires twice of the space used by a wavelet model. Therefore, if the storage space is very limited, users can take advantage of

the adaptive COD to store steady data in wavelet models, which spend less space, without having much influence on the clustering quality. Consequently, we can trade the precision with the storage space adaptively according to the behavior of a stream at present.

4 E

XPERIMENTAL

R

ESULTS

To assess the performance of our framework, we have conducted a series of experiments. In Section 4.1, we introduce the simulation model. Next, we investigate the sensitivity analyses of parameters in Section 4.2. Then, we perform the adaptivity analyses of COD framework in Section 4.3. After that, the performance of the offline clustering phase of COD framework on the real data set is evaluated in Section 4.4. Finally, we examine the scalability of the online maintenance phase of the COD framework in Section 4.5.

4.1 Simulation Model

All of our experiments are conducted on a PC with Intel Pentium III processor and 512 MB memory, which runs Window XP professional operating system. In order to evaluate our framework, both the synthetic data set and the real data set are used as our data sets.

1. Fast/Slow changing data: This data is designed to provide different fractions of fast/slow changing data sets for examining the adaptive use of frame-work COD. For a given fraction of fast data, , each series is the concatenation of several segments with length LEN. Samples of the jth segment are generated as follows:

gjðtÞ ¼ ðS þ Þ jðtÞ þ ðtÞ;

where and ðtÞ are drawn from a standard normal distribution Nð0; 1Þ, t is in the range ½0; LEN 1, and jðtÞ is decided according to the following rules.

The slope of the jth segment is determined by jand statej, where j is a value drawn uniformly from ½0; 1, and statejis either 0 or 1. Initially, state0is set to 0. Then, each state value is set according to its value in the previous segment and the current j value. More specifically, statej¼ statej1 if j> , and statej¼ 1 statej1 if j . By this setting, about ð1 Þ fraction of segments in a series are slowly changing segments with the slopes being close to 0 (for those segments with j> ). On the other hand, about fraction of segments in a series are fast changing segments with the slopes being close to S

LEN (for those segments with j ). In our experiments, S and LEN are set as 30 and 50, respectively. Therefore, the slopes of fast changing

(11)

segments are about 0:6. This data is used in our adaptivity analyses, and also supports the need of varying the number of data points and the number of data streams for the scalability analyses.

2. Real weather data: We obtain average daily tem-peratures of 290 cities around the world from Temperature Data Archive1 of the University of Dayton. The daily average temperatures of each city are recorded since 1 January 1995 to present. Each city is regarded as a data stream and each stream has 3,416 points.

This work provides a framework to support clustering requests on multiple data streams with flexible time ranges. Unlike that of prior studies [10], [11], [12], [13], the objective in this work is to partition these data streams, rather than their data points, into clusters. On the other hand, although the objective of the study [18] is to partition data streams, it does not support clustering with flexible time ranges. As a result, the problem studied in this paper is intrinsically different from those of prior works. In the following experiments, we use the average squared error as the evaluation function for clustering results, as shown in (1). The costs of COD are measured as the ratio of the costs obtained by running the k-means algorithm on the raw data streams. Because the choice of initial cluster centers in a k-means algorithm affects the quality of results, the best set of clusters from multiple restarts is employed to alleviate this effect [23]. Note that parameter Bt represents the number of values to be merged into a fitting model while parameter Bhrepresents the number of fitting models to be merged into a new fitting model in a higher level. Both of these two parameters imply approximating several values (or fitting models) with a coarser summary. Therefore, they are set as the same value in our experiments to avoid presenting two experiments with similar effects. Without

loss of generality, we assume that the bucket size Bt¼ Bh¼ B, and all the windows observed are regarded as of equal importance in the following experiments. That is, the weight iof window wiis assumed to be one for each window.

4.2 Sensitivity Analyses

In this section, two parameters of the summary hierarchy, which are the number of fitting models maintained in each level ðÞ and bucket size ðBÞ, are investigated on the weather data with a fixed window size w ¼ 50. In the complexity analysis, the worst case, which retrieves at most fitting models for clustering, is discussed. Note, however, that one does not encounter the worst case very often in practice. Therefore, the clustering time is mostly dominated by the execution time of the k-means clustering algorithm. As shown in Figs. 8a and 8b, does not have much influence on the costs and execution time of clustering. Note that since the clustering algorithms are only applied to the statistics maintained, both wavelet-based and regression-based COD are much more efficient than the clustering on the raw data streams, though with slight additional time for calculation from the summary hierarchy. Therefore, when the bucket size is too small, the execution time of frame-work COD will be larger. As shown in Fig. 8c and Fig. 8d, clustering on the summary hierarchies of framework COD is able to achieve almost the same quality as that on the raw data streams efficiently. Besides, regression-based COD, with slightly longer execution time, is more accurate than wavelet-based COD.

4.3 Adaptivity Analyses

In this section, the fast/slow changing data is used to investigate the performances of proposed methods when varying the fraction of fast data. As described in the simulation model, the fast/slow changing data contain data sets with different fractions of fast data . We generate data sets with the fraction of fast data () from 0:1 to 0:9. More specifically, nine data sets are generated, where the value of ranges from 0.1 to 0.9. In each data set, 100 streams, each contains 20 segments, were generated. In the data set of ¼ 0:1, each stream contains 10 percent of fast changing segments, in the data set of ¼ 0:2 , each stream contains 20 percent of fast changing segments, and so on. The threshold is set to be 0:5 in this experiment (because the slope of fast data is about 0:6 in the simulation model). As

Fig. 8. The execution time and clustering costs of the clustering phase of COD framework. (a) Cost versus alpha. (b) Time versus alpha. (c) Cost versus bucket size. (d) Time versus bucket size.

Fig. 9. The clustering costs and storage units of summary hierarchies while varying the fraction of fast data. (a) Cost versus fraction of fast data. (b) Storage versus fraction of fast data.

1. The real data can be obtained from the Web site: http:// www.engr.udayton.edu/weather/.

(12)

shown in Fig. 9a, the clustering quality of the regression-based COD is better than the wavelet-regression-based COD. The adaptive COD, which adaptively selects a wavelet-based model or a regression model for building the summary hierarchy, achieves similar clustering quality as the regres-sion-based COD for data sets with fewer fast data, and the clustering cost is increased when the fraction of fast changing segments increases. We observed that when the data stream changes too fast, it is possible that some changes of a segment are smoothened (regressed) to a line with a small slope and are stored by a wavelet-based fitting model. In this extreme case, the clustering quality of adaptive COD will tend to be similar to the wavelet-based COD. On the other hand, we also investigate the storage space needed to store the summary hierarchies. Note that the wavelet-based hierarchy keeps one value in each model while the regression-based hierarchy keeps two values in each model. We use the number of values stored in a summary hierarchy to approximate the storage space of COD framework. As shown in Fig. 9b, the space needed for the regression-based COD is twice as much as that needed for the wavelet-based COD. Note that the storage space of the adaptive COD is increased along with the increasing amount of fast changing segments. Therefore, when the data sets do not contain too much fast changing data, the adaptive use of framework COD can achieve similar clustering quality as the regression-based COD while spending much smaller storage space for the summary hierarchy.

4.4 COD on Real Data Set

The performance of the COD framework on real data set is next evaluated with a fixed bucket size B ¼ 12. As shown in Fig. 10a and Fig. 10c, while varying the window size and the number of windows observed, the framework COD (both of

the regression-based COD and the adaptive COD) is able to attain the same good clustering quality as the clustering on the raw data streams, with significantly shorter execution time, as shown in Fig. 10b and Fig. 10d. Note that by maintaining more parameters, the regression-based COD approximates the streams more precisely than the wavelet-based one, and achieves the same quality as that on the raw data streams. In this experiment, we tested several settings of the value, and observed that these results are consistent with one another: The regression-based COD is better than the wavelet-based COD, and the adaptive COD can achieve similar quality as the regression-based COD. For these settings, there are segments with slopes larger than the threshold and segments with slopes smaller than the threshold. Without loss of generality, the parameter is set as 1, and the storage space for the regression-based COD, the wavelet-based COD and the adaptive COD are 42,920, 21,460, and 22,344, respectively.

It is interesting to see that the adaptive COD attains the same good clustering quality as the regression-based COD while requiring almost as little storage space as the wavelet-based COD. There is no significant variance on the execution time of these three versions of framework COD when changing the window size because the total execution time is mainly dominated by the iterations that k-means algorithm actually performs until it reaches a stable clustering result. However, the wavelet-based COD is faster than the regression-based COD and the adaptive COD when the number of windows observed (parameter p) is smaller, but is slower when p is larger. In our observation, although the calculation of distances in the wavelet-based COD is faster than that in the regression-based COD and the adaptive COD, the wavelet-based COD has to run more iterations to reach a stable clustering result when p is larger. The reason is that the wavelet-based COD approximates data streams by average values, and these average values are closer in older windows (larger p). In other words, the distances between streams are smaller in older windows. Therefore, more streams will move between clusters when the cluster centers change. Consequently, more iterations, also more execution time, of the k-means algorithm are required for the wavelet-based COD when a larger value of pis specified. These experiments show that the summariza-tion techniques in our framework are very accurate and efficient for various clustering queries in a data stream environment. Furthermore, the adaptive COD can achieve the same good quality as the regression-based COD with fewer resources.

4.5 Scalability

To evaluate the scalability of the online maintenance phase, the scale-up experiments on both the number of data points and the number of data streams are conducted. The fast/ slow changing data with 50 percent of fast data is used to illustrate how much space can be saved by the adaptive COD. As shown in Fig. 11a, as the number of data points in each stream increases from 1,000 to 10,000, the execution time grows linearly. This property also holds when the number of data streams varies from 100 to 1,000, as shown in Fig. 11c. These results conform to our analyses in Section 3.3 which state that the time complexity of the online maintenance phase of framework COD is linear in both the number of streams and the number of data points

Fig. 10. Performance of the offline clustering phase of COD framework on the real data set. (a) Cost versus window size. (b) Time versus window size. (c) Cost versus windows observed. (d) Time versus windows observed.

(13)

in each stream. Note that the wavelet-based COD is more efficient in the online phase because of fewer parameters being calculated and maintained, while the adaptive COD needs more time to decide which model should be kept. On the other hand, as shown in Fig. 11b and Fig. 11d, the storage space is proportional to Oðlog mÞ and OðnÞ, where m is the number of points in each stream and n is the number of streams. These results do conform to the space complex-ity analyzed in Theorem 5. In addition, although the adaptive use of framework COD spends more time to build the summary hierarchy, it reduces the storage space significantly. Therefore, the adaptive COD has the ability to generate clusters with the same quality as the regression-based COD, while using the storage space almost as few as the wavelet-based COD, as shown in Fig. 11b.

5 C

ONCLUSIONS

In order to deal with various types of multiple data streams and to support flexible mining requirements, we devised a COD Framework to dynamically cluster multi-ple data streams. While providing a general framework of clustering on multiple data streams, COD framework had two major advantages, namely, one data scan for online statistics collection and compact multiresolution approx-imations, which can address, respectively, the time and the space constraints in a data stream environment. Furthermore, with the multiresolution approximations of data streams, flexible clustering demands can be sup-ported. The online maintenance phase of COD provided an efficient algorithm to maintain the summaries of the data streams with multiple resolutions in time linear in both the number of streams and the number of data points in each stream. On the other hand, an adaptive clustering algorithm was devised for the offline phase of COD to retrieve the approximations of the desired substreams from the summary hierarchies as precise as possible according to the clustering queries. We proposed

two summarization techniques to maintain summaries of the evolving streams, i.e., 1) wavelets and 2) regression lines. An adaptive version of COD framework was also designed to obtain clustering results with the same quality as the regression-based COD while using much less storage space for the summary hierarchy. As shown in the complexity analyses and also validated by our empirical studies, the algorithms of the COD framework performed very efficiently in the data stream environment while producing clustering results of very high quality.

A

PPENDIX

P

ROOFS OF

T

HEOREMS AND

L

EMMAS

Proof of Theorem 1. From the equations derived in Section 3.1.2, b and b can be calculated by Pt, Pt2_, P

f,Ptf, and n. Note that n is the number of data points represented by a fitting model, and can be calculated by Bt Bð hÞL. The interval ½ts; te of the ith fitting model on level L is equal to tL e ni þ 1; t L e nði 1Þ , where the first fitting model is the latest one. According to the above lemma,Pt; and Pt2 _{are both available.} There-fore, the least square error linear fit for a substream can be obtained losslessly from the accumulative sum_P

tf;Pf

h i and tL

e of that level. tu Proof of Theorem 2.Let hmaxbe the largest window size on the hierarchy which is not larger than the desired clustering window size w. We have the following equations:

hmax¼ Bt ðBhÞLmax; and hmax w < hmax Bh: ð2Þ As shown in the above equations, the window size of the fitting models in level Lmax is hmaxand the window size of the fitting models in level ðLmaxþ 1Þ is ðhmax BhÞ. The desired window size w is between these two levels. A model in level ðLmaxþ 1Þ merges more than w data points into a fitting model. Thus, level Lmax will be the highest level whose fitting models indicate the pattern changes with window size w. tu Proof of Theorem 3. Let hmin be the window size of the fitting models in level Lmin. We have the following equations:

hmin¼ Bt ðBhÞLmin; and

ðtnow tðLemin1ÞÞ þ ðLmin1Þ

hmin Bh < w

ðtnow tLeminÞ þ Lmin hmin:

As shown in the above equations, the total interval covered by the fitting models in level Lminis Lmin hmin.

The data points in the interval ðtnow tLeminÞ can be described by a temporary model which will be described later. Then, the models in level Lmin and temporary model can cover the whole range of the latest window w queried by a user. However, the total interval in level ðLmin 1Þ and its corresponding temporary model ðtnow tðLe min1ÞÞ cannot cover the whole range of the

Fig. 11. The scalability analyses of the online maintenance phase. (a) Time versus data number. (b) Storage versus data number. (c) Time versus stream number. (d) Storage versus stream number.

(14)

latest window w. Thus, level Lminwill be the lowest level whose fitting models are still enough to illustrate the pattern with window size w. Note that only the latest window can be approximated by the fitting models from level Lmin, older windows should be approximated by the models from higher levels. tu Proof of Lemma 4. (a) follows from the definition. The distance between two fitting models bfAðhdÞ ¼ bAðhdÞ þ bAðhdÞt and bfBðhdÞ ¼ bBðhdÞ þbBðhdÞt is X t2hd b f_AtðhdÞ bfBtðhdÞ 2 ¼X t2hd b AðhdÞ þbAðhdÞt bBðhdÞ þbBðhdÞt h i2 ¼ bAðhdÞ bBðhdÞ 2 jhdj þ 2 bAðhdÞ bBðhdÞ bAðhdÞ bBðhdÞ ð ÞX t2hd t þðbAðhdÞ bBðhdÞÞ2 X t2hd t2;

where Pt and Pt2 _{can be calculated efficiently by} Lemma 2. Consequently, for (b), the total distance of j fitting models is represented by the summation of each model, i.e., Di¼P

j d¼1 P t2hd b ft AðhdÞ bfBtðhdÞ 2 . tu Proof of Lemma 5. (a) follows from the definition. By the

integration of tf and f in the interval ½ti; te, we have Zte ti tfdt¼ Zte ti tðbþ_btÞdt ¼ Zte ti ðbtþ_bt2_{Þdt ¼} 1 2tb 2_þ1 3bt 3 jte ti; Zte ti fdt¼ Zte ti ðbþbtÞdt ¼ btþ1 2bt 2 jte ti: Consequently, for (b), Rðh0Þ ¼D P te ti tf;P te ti fE can be calculated by the following equations:

Xte ti tf ¼ 1 2 ^ ððteÞ2 ðtiÞ2Þ þ 1 2ððt^ eÞ 3 ðtiÞ3Þ ; Xte ti f¼ ðt^ e tiÞ þ 1 2ððt^ eÞ 2 ðtiÞ2Þ : t u Proof of Lemma 6. The proof is divided into two cases, which are the entries from (a) level Lmax, and (b) level below Lmax. For case (a), each entry generated is exactly with window size w, which means that w0_{¼ w. For case} (b), let hibe the interval of a fitting model on the ith level of the summary hierarchy. We have the following equation:

hi¼ Bt ðBhÞi:

As mentioned above, a fitting model is inserted into an entry if at least half of its interval is in the range of that window. Therefore, the worst case occurs at the situa-tions that the first and the last fitting models of the entry are entirely inserted (or removed) because exactly half of their intervals are in (or just more than half of their intervals are not in, respectively) the range of a queried window. That is,

w0 w j j hi

2 þ hi

2 ¼ hi:

Note that hidecreases exponentially if we retrieve fitting models from lower levels, and the largest difference appears at level ðLmax 1Þ. According to (2),

w0 w j j hðLmax1Þ¼ hLmax Bh ¼hmax Bh w Bh : t u Proof of Theorem 4. For each data stream, every Bt data

points generate a fitting model. Therefore, m

Bt models of

level 0 have been created in time OðmÞ after m points arrive. Every Bhmodels are aggregated to a model in a higher level in time OðBhÞ because the average of Bh values is calculated for a wavelet-based fitting model (and, respectively, two summations of Bh values are calculated for a regression-based fitting model). Accord-ingly, totally m Bt 1 Bh þ 1 Bh2 þ . . . þ 1 B log_Bhðm BtÞ h 0 B B @ 1 C C A Bmt 1 Bh 1 1 Bh ! ¼ O m BtðBh 1Þ

fitting models, which are in the levels larger than 0, have been created for a stream. The time required for building the Oð m

BtðBh1ÞÞ models will be OððBh

m

BtðBh1ÞÞ ¼ Oð

m BtÞ.

Consequently, the time complexity of the online main-tenance phase for n streams is O n m þm

Bt

. tu Proof of Theorem 5.For each data stream, every Bt data

points generate a fitting model. Therefore, m

Bt models of

level 0 have been created after m points arrived. Every Bh models are aggregated to a model in a higher level. Accordingly, the height of the summary hierarchy is logBhð

m

BtÞ and each level maintains at most fitting

models. Note that each fitting model contains a constant number of parameters, which is one for the wavelet-based model and two for the regression-based model. Consequently, totally n logBhð

m BtÞ

fitting models are maintained for n streams, which means that the space complexity of the online maintenance phase is O n logBhð m BtÞ . tu

Proof of Theorem 6. The complexity of k-means clustering algorithm is OðkndÞ, where d is the number of dimensions. Since each window of the substream is represented by at

(15)

most fitting models with constant number of parameters, therefore, the time complexity of the offline clustering phase is OðknÞ for a window. tu

R

EFERENCES

[1] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom, “Models and Issues in Data Stream Systems,” Proc. ACM Symp. Principles of Database Systems (PODS), June 2002.

[2] M.R. Henzinger, P. Raghavan, and S. Rajagopalan, “Computing on Data Streams,” Dimacs Series in Discrete Mathematics and Theoretical Computer Science, vol. 50, pp. 107-118, 1999.

[3] A. Bulut and A.K. Singh, “Swat: Hierarchical Stream Summariza-tion in Large Networks,” Proc. Int’l Conf. Data Eng., pp. 303-314, Mar. 2003.

[4] A. Bulut and A.K. Singh, “A Unified Framework for Monitoring Data Streams in Real Time,” Proc. Int’l Conf. Data Eng., pp. 44-55, 2005.

[5] M. Datar, A. Gionis, P. Indyk, and R. Motwani, “Maintaining Stream Statistics over Sliding Windows,” Proc. SIAM Symp. Discrete Algorithms, pp. 635-644, Jan. 2002.

[6] W. Fan, “Systematic Data Selection to Mine Concept-Drifting Data Streams,” Proc. ACM SIGKDD, pp. 128-137, 2004.

[7] C.C. Aggarwal, “On Change Diagnosis in Evolving Data Streams,” IEEE Trans. Knowledge and Data Eng., vol. 17, no. 5, pp. 587-600, May 2005.

[8] C.C. Aggarwal and P.S. Yu, “Online Analysis of Community Evolution in Data Streams,” Proc. ACM SIAM on Data Mining (SDM ’05), 2005.

[9] T. Johnson, S. Muthukrishnan, and I. Rozenbaum, “Sampling Algorithms in a Stream Operator,” Proc. ACM SIGMOD Conf., pp. 1-12, 2005.

[10] C.C. Aggarwal, J. Han, J. Wang, and P.S. Yu, “A Framework for Clustering Evolving Data Streams,” Proc. Very Large Data Bases Conf., Sept. 2003.

[11] C.C. Aggarwal, J. Han, J. Wang, and P.S. Yu, “A Framework for Projected Clustering of High Dimensional Data Streams,” Proc. Very Large Data Bases Conf., pp. 852-863, 2004.

[12] S. Guha, N. Mishra, R. Motwani, and L. O’Callaghan, “Clustering Data Streams,” Proc. Symp. Foundations of Computer Science, pp. 359-366, Nov. 2000.

[13] L. O’Callaghan, N. Mishra, A. Meyerson, S. Guha, and R. Motwani, “Streaming-Data Algorithms for High-Quality Cluster-ing,” Proc. Int’l Conf. Data Eng., 2002.

[14] C.C. Aggarwal, J. Han, J. Wang, and P.S. Yu, “On Demand Classification of Data Streams,” Proc. ACM SIGKDD, pp. 503-508, 2004.

[15] P. Domingos and G. Hulten, “Mining High-Speed Data Streams,” Proc. ACM SIGKDD, pp. 71-80, Aug. 2000.

[16] G. Hulten, L. Spencer, and P. Domingos, “Mining Time-Changing Data Streams,” Proc. ACM SIGKDD, pp. 97-106, Aug. 2001. [17] E. Keogh, J. Lin, and W. Truppel, “Clustering of Time Series

Subsequences is Meaningless: Implications for Past and Future Research,” Proc. IEEE Int’l Conf. Data Mining, Nov. 2003. [18] J. Yang, “Dynamic Clustering of Evolving Streams with a Single

Pass,” Proc. IEEE Int’l Conf. Data Mining (ICDE ’03), pp. 695-697, Mar. 2003.

[19] T. Li, Q. Li, S. Zhu, and M. Ogihara, “A Survey on Wavelet Applications in Data Mining,” SIGKDD Explorations, vol. 4, no. 2, pp. 49-68, 2003.

[20] W.-G. Teng, M.-S. Chen, and P.S. Yu, “Using Wavelet-Based Resource-Aware Mining to Explore Temporal and Support Count Granularities in Data Streams,” Proc. SIAM Int’l Conf. Data Mining, Apr. 2004.

[21] Y. Chen, G. Dong, J. Han, B.W. Wah, and J. Wang, “Multi-Dimensional Regression Analysis of Time-Series Data Streams,” Proc. Very Large Data Bases Conf., pp. 323-334, 2002.

[22] W.-G. Teng, M.-S. Chen, and P.S. Yu, “A Regression-Based Temporal Pattern Mining Scheme for Data Streams,” Proc. Very Large Data Bases Conf., pp. 93-104, Sept. 2003.

[23] P.S. Bradley and U.M. Fayyad, “Refining Initial Points for k-Means Clustering,” Proc. Int’l Conf. Machine Learning, pp. 91-99, July 1998.

Bi-Ru Dai received the BS and PhD degrees in electrical engineering from National Taiwan University, Taipei, Taiwan, R.O.C., in 2000 and 2006, respectively. Her research interests in-clude data mining, data clustering, data stream management, and bioinformatics.

Jen-Wei Huang received the BS degree in electrical engineering from the National Taiwan University, Taiwan, in 2002, and is currently working toward the PhD degree. He is majoring in computer science and is familiar with the data mining area. His research interests include data mining, mobile computing, and bioinformatics. Among these, the Web mining, incremental mining, mining data stream and time series, and sequential pattern mining are his special interests. In addition, some research are on mining general temporal association rules, sequential clustering, data broadcasting, progressive sequential patterns, and bioinformatics.

Mi-Yen Yeh received the BS degree in electrical engineering from National Taiwan University, Taipei, Taiwan, R.O.C., in 2002. She is currently a PhD candidate in the Electrical Engineering Department, National Taiwan University, Taipei, Taiwan, R.O.C. Her research interests include data mining, data clustering, and data stream management.

Ming-Syan Chen received the BS degree in electrical engineering from National Taiwan University, Taipei, Taiwan, and the MS and PhD degrees in computer, information, and control engineering from The University of Michigan, Ann Arbor, in 1985 and 1988, respec-tively. Dr. Chen is currently a professor and the chairman of the Graduate Institute of Commu-nication Engineering, a professor in the Electrical Engineering Department, and also a professor in the Computer Science and Information Engineering Department, National Taiwan University, Taipei, Taiwan. He was a research staff member at IBM Thomas J. Watson Research Center, Yorktown Heights, New York, from 1988 to 1996. His research interests include database systems, data mining, mobile computing systems, and multimedia networking, and he has published more than 200 papers in his research areas. In addition to serving as a program committee member in many conferences, is currently on the editorial board of the Very Large Data Base (VLDB) Journal, the Knowledge and Information Systems (KAIS) Journal, the Journal of Information Science and Engineering, and the International Journal of Electrical Engineering, and is a Distinguished Visitor of IEEE Computer Society for Asia-Pacific from 1998 to 2000, and also from 2005 to 2007 (invited twice). He served as an international vice chair, program chair, program cochair, program vice chairs for numerous conferences. He holds, or has applied for, 18 US patents and seven ROC patents in the areas of data mining, Web applications, interactive video playout, video server design, and concurrency and coherency control protocols. He is a recipient of the NSC (National Science Council) Distinguished Research Award, Pan Wen Yuan Distinguished Research Award, and K.-T. Li Research Penetration Award for his research work, and also the Outstanding Innovation Award from IBM Corporate for his contribution to a major database product. He also received numerous awards for his research, teaching, inventions and patent applications. He coauthored with his students for their works which received ACM SIGMOD Research Student Award and Acer Long-Term Thesis Awards. Dr. Chen is a fellow of the IEEE and a member of the ACM.