Clustering on demand for multiple data streams

(1)

Clustering on Demand for Multiple Data Streams

Bi-Ru Dai, Jen-Wei Huang, Mi-Yen Yeh, and Ming-Syan Chen

Department of Electrical Engineering

National Taiwan University

Taipei, Taiwan, ROC

E-mail:mschen@cc.ee.ntu.edu.tw, {brdai, jwhuang, miyen}@arbor.ee.ntu.edu.tw

Abstract

In the data stream environment, the patterns generated by the mining techniques are usually distinct at different time because of the evolution of data. In order to deal with various types of multiple data streams and to sup-port ﬂexible mining requirements, we devise in this paper a Clustering on Demand framework, abbreviated as COD

framework, to dynamically cluster multiple data streams.

While providing a general framework of clustering on mul-tiple data streams, the COD framework has two major fea-tures, namely one data scan for online statistics collection and compact multi-resolution approximations, which are designed to address, respectively, the time and the space constraints in a data stream environment. Furthermore, with the multi-resolution approximations of data streams, ﬂexible clustering demands can be supported.

1 Introduction

In recent years, several query problems and mining capa-bilities have been explored for the data stream environment [2], including those on the statistics [3], the aggregate query [4], association rules [8], frequent patterns [10], data clus-tering [1][5][9], and data classiﬁcation [6], to name a few. For data stream applications, the volume of data is usually too huge to be stored on permanent devices or to be scanned thoroughly more than once. It is hence recognized that both approximation and adaptivity are key ingredients for exe-cuting queries and performing mining tasks over rapid data streams.

In this paper, the problem of clustering multiple data streams is addressed. It is assumed that at each time stamp, data points from individual streams arrive simultaneously, and the data points are highly correlative to previous ones in the same stream. Unlike that of prior studies, the objective in this work is to partition these data streams, rather than their data points, into clusters. Note that the data streams

are not of a fixed length. Instead, they are still evolving when the clustering results are observed at users’ requests. The problem studied in this paper is different from the one discussed in [7], which focuses on clustering the windows of a single streaming time series. On the other hand, clus-tering of evolving streams is discussed in [11]. However, the objective in [11] is to continuously report clusters sat-isfying the specified distance threshold. To further enhance these techniques, clustering requests of flexible time ranges are supported in our framework.

In the data stream environment, the patterns generated by the mining techniques are usually distinct at different time because of the evolution of data. Depending on different applications, the frequency for patterns in data streams to change varies. For example, the streams gathered from ad-jacent sensors may always be in the same cluster. However, for stock prices, some companies are probably within the same cluster during several months but in different clus-ters afterward. The clusclus-ters obtained hence change fre-quently. Therefore, an important question arises: "Can we

design a scheme for modeling both fast and slow evolv-ing patterns adaptively?" Furthermore, the clusterevolv-ing

re-quest is unknown when the data is collected and processed. After the time range of clustering request reveals, recom-mendations for short-term or long-term investments are de-sired to be offered precisely. This leads to another impor-tant issue: "Can we provide a system to support various

clustering requirements at the same time?" Consequently,

we devise in this paper a framework of Clustering on

De-mand, abbreviated as COD framework, to dynamically

clus-ter multiple data streams. While providing a general frame-work of clustering on multiple data streams, the proposed COD framework has two advantageous features: (1) one data scan for online statistics collection, and (2) compact multi-resolution approximations, that are designed to ad-dress, respectively, the time and the space constraints in a data stream environment. Furthermore, with the multi-resolution approximations of data streams, ﬂexible ing demands can be supported. Note that since the

cluster-Proceedings of the Fourth IEEE International Conference on Data Mining (ICDM’04) 0-7695-2142-8/04 $ 20.00 IEEE

(2)

ing algorithms are only applied to the statistics maintained rather than to the original data streams, the COD proposed is very efﬁcient in practice.

The COD framework proposed consists of two phases, namely the online maintaining phase and the ofﬂine

clus-tering phase. The online maintaining phase provides an

ef-ﬁcient algorithm to maintain the summary hierarchies of the data streams with multiple resolutions in the time complex-ity linear to both the number of streams and the number of data points in each stream. On the other hand, an adaptive

clustering algorithm is devised for the ofﬂine phase to

re-trieve the approximations of the desired sub-streams from the summary hierarchies as precisely as possible accord-ing to the clusteraccord-ing queries specified by the users. In gen-eral, we keep finer approximations for more recent data and coarser approximations for more obsolete data. The COD framework performs very efficiently in the data stream en-vironment while producing clustering results of very high quality.

The rest of this paper is organized as follows. Prelimi-naries and advantages of the COD framework are described in Section 2. The online maintaining phase and the ofﬂine clustering phase of the COD framework are presented in Section 3. This paper concludes with Section 4.

2 Clustering on Demand Framework

2.1 Framework Deﬁnitions

The COD framework has two phases, i.e., the online

maintaining phase and the offline clustering phase. In the online maintaining phase, the arriving data streams are processed and only very brief summaries are main-tained. The offline clustering phase deals with the clustering queries. The COD framework supports clustering queries with a flexible window size and a desired number of win-dows to observe. For example, a clustering query could be 12 windows with the window size of 30 days to observe the clusters of each month during a year. Based on the limited space property in the data stream environment, the raw data streams are parsed only once and then discarded. Therefore, the clustering algorithm in our framework is applied to the statistics maintained by the online phase rather than to the original streams, as illustrated in Figure 1.

At any time stamp, each stream receives a new value simultaneously, and there are totally q data streams. More speciﬁcally, we have theq streams {V1> V2> ===> Vq}

at time stamp p where Vl = {il1> il2> ===> ilp}, for

1 l q, and im

l is the value of streamVl that

ar-rives at timem. First, we introduce the ofﬂine clustering

phase. Let n denote the number of clusters, and let z be the window size of the clustering query submitted at time stamp wqrz. The algorithm proposed will

gen-erate at most s windows of n-clustering results where

Online maintenance Clustering query:

-Cluster number: k -Window size: w -Windows observed: p Statistics {S1, S2, S4,…}, {S3, S9,…} w2 {S1, S4,…}, {S2, S3, S7,…} w1 Clusters Window {S1, S2, S4,…}, {S3, S9,…} w2 {S1, S4,…}, {S2, S3, S7,…} w1 Clusters Window w2 w1 S1 S2 S3 … Sn 0 10 20 30 40 50 60 70 0 5 10 15 Time Value

Figure 1. Clustering of multiple data streams

by COD at wqrz = 15, for a query of n = 2,

z= 5 and s = 2.

Fo(zl) = {F1(zl)> F2(zl)> ===> Fn(zl)}, for 1 l s,

which minimizes each clustering costFrvw(Fo(zl)) of the

sub-streams in the interval(wqrzz×l> wqrzz×(l1)].

Note that F_m(zl) is the mwk cluster of window zl with

the properties of n \ m=1 F_m(zl) = ! and n [ m=1 F_m(zl) = {V1(zl)> V2(zl)> ===> Vq(zl)}, where Vt(zl) = {i(wqrzz×l)+1 t > it(wqrzz×l)+2> ===> it(wqrzz×l)+z}> for1 t q=

Example 1: Consider the ﬁrst three data streams

{V1> V2> V3} in Figure 1 at time stamp wqrz=15, where

S1={64,48,16,32,56,56,48,24,32,24,16,16,24,32,40},

S2={24,38,46,52,54,56,40,16,24,26,34,28,20,14,8}, and

S3={32,46,54,60,62,64,66,64,58,50,42,36,28,22,16}.

Assume that the clustering query is n = 2> z = 5> and s = 2. The algorithm will generate at most 2 win-dows of2-clustering results, Fo(z1) = {F1(z1)> F2(z1)}

and Fo(z2) = {F1(z2)> F2(z2)}. Note that the

re-sulting clusters of window z1 are F₁(z1) = {V1} and

F₂(z1) = {V2> V3} since V2 is more similar to V3 in

the interval (15 5 × 1> 15 5 × (1 1)] = (10> 15]. In window z₂, the clusters areF₁(z2) = {V1> V2} and

F₂(z₂) = {V₃} since V₂is more similar toV₁in the inter-val(15 5 × 2> 15 5 × (2 1)] = (5> 10].¥

We next describe the online maintaining phase. To support various clustering queries in the ofﬂine cluster-ing phase, adequate information has to be preserved dur-ing the online maintaindur-ing phase for discoverdur-ing fast and slowly evolving patterns. Note that the resolution of statis-tics maintained could affect the patterns obtained. With a small interval used for summarization, short-term patterns can be observed, but long-term patterns are likely to be neglected. Also, the summaries are possibly affected by noises and oscillations, and the summaries should be up-dated or reconstructed frequently. On the other hand, if a large interval is used, long-term patterns can be observed while neglecting short-term patterns. Also, the patterns may not catch up with the changes of the data streams. More-over, the patterns could be very rough because of the gen-eralization/summarization of streams. Therefore, we devise a hierarchical structure to store data summaries at different

Proceedings of the Fourth IEEE International Conference on Data Mining (ICDM’04) 0-7695-2142-8/04 $ 20.00 IEEE

(3)

resolutions in the online maintaining phase. Whenever a clustering query is submitted, the algorithm of the ofﬂine clustering phase will select the approximations from appro-priate levels of the summary hierarchies to support the re-quirements of the clustering query. Details of the online and ofﬂine phases will be described in following sections.

2.2 Advantages of the COD Framework

Since the summaries maintained in the summary hier-archies are at multiple resolutions, the COD framework is able to support the clusters for variable window sizes. Ap-plying the conventional time series clustering algorithms to the raw streams with the desired window sizes is not practical in the streaming environment because there is not enough space for the storage of continuous streams. On the other hand, the summarization techniques which main-tain the streams with a ﬁxed resolution do not perform well for various window sizes. Although the prior work [11] for clustering of evolving streams continuously reports clusters within the given distance threshold, it does not allow the user to specify a desired window size to be observed. In our COD framework, even though the data streams are col-lected and summarized into the summary hierarchies before the clustering queries are submitted, the clusters with vari-ous window sizes can be obtained directly from the existing summary hierarchies without parsing the data streams again to construct the summaries for the desired resolutions. For example, suppose that the summary hierarchies of the stock prices have been kept for the prices in ten years. Then, clus-ters in one day, one month, or even one year can be observed very efﬁciently without resorting to the old prices again.

From the definition of clustering query, at mosts win-dows of clustering results can be obtained for a query. It is very efficient to observe the trends and changes of clusters at one time. The behaviors of the clusters, such as the mov-ing paths of clusters, the merges and splits of clusters, and the streams jumping between clusters, can be investigated from the results of a clustering query. Consider the exam-ple of stock prices again. Assume that the window size is one month and12 windows are inspected. We might find out that stockD and stock E are in the same cluster for one month and then stock D jumps to another cluster for the following several months. Such trends and changes within one year can be extended from the results of this clustering query.

3 The Framework of COD

3.1 Online Maintaining Phase

The main objective of this online maintaining phase is to provide a one scan algorithm of the incoming multiple data streams for statistics collection. A summary hierarchy

F([t6T+1,t7T]) F([t5T+1,t6T]) F([t4T+1,t5T]) F([t3T+1,t4T]) F([t2T+1,t3T]) F([tT+1,t2T]) L=0 te0 F([t4T+1,t6T]) F([t2T+1,t4T]) F([t1,t2T]) L=1 te1 F([t1,t4T]) L=2 te2 L=3 F([t6T+1,t7T]) F([t5T+1,t6T]) F([t4T+1,t5T]) F([t3T+1,t4T]) F([t2T+1,t3T]) F([tT+1,t2T]) L=0 te0 F([t4T+1,t6T]) F([t2T+1,t4T]) F([t1,t2T]) L=1 te1 F([t1,t4T]) L=2 te2 L=3 Fitting models in the summary hierarchy

Raw bucket Temporary

bucket

Current time

Figure 2. The illustration of summary

hierar-chy, whereI(k) represents the ﬁtting model

in the time intervalk= [wv> wh], and = 6.

is maintained incrementally to provide multi-resolution ap-proximations for a stream. Multiple levels in the hierarchy correspond to various resolutions.

A ﬁtting modelI(k) is deﬁned as an approximation of a raw sub-stream in the intervalk= [wv> wh] by some

sum-marization techniques. The ﬁtting model on levelO is gen-erated by the aggregation ofEkmodels on level(O 1),

and each level keeps the time stamp of the latest data point, which has been summarized to that level, as the end timewh

of the level. The procedure of generating and updating the summary hierarchy of a stream is described as follows. Procedure of the online maintaining phase

1 For each incoming value, put it in the temporary bucket. 2 If the number of items in the temporary bucket is less than the bucket sizeEw, go to Step 1. Else:

2.1 A new ﬁtting model of level0 is generated according

to the values in the temporary bucket. 2.2 Update the end timew0_hof level0.

2.3 Move all the items from the temporary bucket to the raw bucket.

2.4 If the number of items in the raw bucket is more than , remove the oldest items to keep the number no larger than.

3 For each levelO, if Eknew models are accumulated:

3.1 A new model of level(O + 1) will be generated by

aggregating the latestEkmodels in levelO.

3.2 Update the end timewO+1_h of level(O + 1).

3.3 If the number of models in level(O+1) is larger than

, remove the oldest model from that level.

As shown in the procedure, to achieve the space limi-tation in the streaming environment, the number of ﬁtting models maintained at each level is limited to be the maxi-mum number of. Note the should be set to a number no smaller thanEkin order to have enough ﬁtting models

for the model generation in a higher level. Figure 2 gives an example of the summary hierarchy.

3.2 Ofﬂine Clustering Phase

(4)

As the clustering query shown in Section 2.1, the users want to inspect clusters with window sizez, and at most s windows will be observed. Note that the window size de-sired could be different from those maintained in the sum-mary hierarchy. In this situation, we have to select the ﬁtting models from appropriate levels of the hierarchy to approxi-mate the desired windows. We have the following theorems for adaptive level selection.

Theorem 1: The highest level to approximate a sub-stream with window sizez is Opd{=

j orjEk( z Ew) k .

Theorem 2: The lowest level to approximate

a sub-stream with window size z is Oplq =

min

O

©

z (wqrz wOh) + O× kO

ª

, whereOis the exact

number of ﬁtting models in levelO, and kO= Ew× (Ek)O

is the window size of the ﬁtting models in levelO.

From the above theorems, ﬁtting models in the levels betweenOplqandOpd{are able to approximate the

sub-streams in the windows of clustering queries. The fitting models in higher levels span longer intervals and thus pro-vide more generalized fittings to the original streams. In contrast, the fitting models in lower levels possess shorter spans and can thus provide more specific and accurate fit-tings to the original streams. However, only models are maintained in each level. Therefore, we devise an

adap-tive clustering algorithm for the ofﬂine clustering phase to

approximate each window of data streams with the fitting models as accurately as possible. Note that the offline clus-tering phase is not designed for a specific clusclus-tering algo-rithm. Therefore, users can adopt any traditional clustering algorithm with minor modification if so necessary. Procedure of adaptive clustering algorithm

1. CalculateOplqandOpd{. For each data stream, do Step

2 and 3.

2. If the end timewOplq

h of levelOplqis not equal to the

current time, aggregate the models of lower levels (from Oplq1 to 0) and the temporary bucket to generate a

tem-porary model characterizing the interval betweenwOplq

h and

the current time. Then, aggregate this temporary model to the latest model in levelOplq.

3. Encapsulate the ﬁtting models between levelOplqand

Opd{to generate at mosts entries, where each entry

repre-sents a window with sizez. Set O= Oplqinitially. For the

windows fromz1tozs, if the range of a desired window is

covered by the interval of the fitting models in levelO, en-capsulate an appropriate number of fitting models into that entry. Else, increaseO by one to look for the fitting models with enough coverage. This step stops whens entries have been retrieved or whenO exceeds the maximum level Opd{

withsuentries obtained, wheresu s.

4. Run the clustering algorithm to cluster these sub-streams by the retrieved entries for each window.

4 Conclusions

In order to deal with various types of multiple data streams and to support flexible mining requirements, we devised a COD Framework to dynamically cluster multi-ple data streams. While providing a general framework of clustering on multiple data streams, COD framework had two major advantages, namely one data scan for online sta-tistics collection and compact multi-resolution approxima-tions. The online maintaining phase of COD provided an efficient algorithm to maintain the summaries of the data streams with multiple resolutions. On the other hand, an adaptive clustering algorithm was devised for the offline phase of COD to retrieve the approximations of the desired sub-streams from the summary hierarchies as precisely as possible according to the clustering queries. The COD framework performed very efficiently in the data stream en-vironment while producing clustering results of very high quality.

Acknowledgement

The work was supported in part by the National Science Council of Taiwan, R.O.C., under Contract NSC93-2752-E-002-006-PAE.

References

[1] C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu. A frame-work for clustering evolving data streams. In Proc. of VLDB, 2003.

[2] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream systems. In Proc. of PODS, June 2002.

[3] A. Bulut and A. K. Singh. Swat: Hierarchical stream sum-marization in large networks. In Proc. of ICDE, pages 303– 314, Mar. 2003.

[4] A. Dobra, M. N. Garofalakis, J. Gehrke, and R. Rastogi. Processing complex aggregate queries over data streams. In

Proc. of ACM SIGMOD, pages 61–72, June 2002.

[5] S. Guha, N. Mishra, R. Motwani, and L. O’Callaghan. Clus-tering data streams. In Proc. of FOCS, pages 359–366, 2000. [6] G. Hulten, L. Spencer, and P. Domingos. Mining time-changing data streams. In Proc. of ACM SIGKDD, pages 97–106, Aug. 2001.

[7] E. Keogh, J. Lin, and W. Truppel. Clustering of time se-ries subsequences is meaningless: Implications for past and future research. In Proc. of ICDM, Nov. 2003.

[8] G. S. Manku and R. Motwani. Approximate frequency counts over streaming data. In Proc. of VLDB, pages 346– 357, Aug. 2002.

[9] L. O’Callaghan, N. Mishra, A. Meyerson, S. Guha, and R. Motwani. Streaming-data algorithms for high-quality clustering. In Proc. of ICDE, 2002.

[10] W.-G. Teng, M.-S. Chen, and P. S. Yu. A regression-based temporal pattern mining scheme for data streams. In Proc.

of VLDB, Sep. 2003.

[11] J. Yang. Dynamic clustering of evolving streams with a sin-gle pass. In Proc. of ICDE03, Mar. 2003.