• 沒有找到結果。

Clustering over Multiple Evolving Streams by Events and Correlations

N/A
N/A
Protected

Academic year: 2021

Share "Clustering over Multiple Evolving Streams by Events and Correlations"

Copied!
14
0
0

加載中.... (立即查看全文)

全文

(1)

Clustering over Multiple Evolving Streams by

Events and Correlations

Mi-Yen Yeh, Bi-Ru Dai, and Ming-Syan Chen, Fellow, IEEE

Abstract—In applications of multiple data streams such as stock market trading and sensor network data analysis, the clusters of streams change at different times because of data evolution. The information about evolving cluster is valuable to support corresponding online decisions. In this paper, we present a framework for Clustering Over Multiple Evolving sTreams by CORrelations and Events, which, abbreviated as COMET-CORE, monitors the distribution of clusters over multiple data streams based on their correlation. Instead of directly clustering the multiple data streams periodically, COMET-CORE applies efficient cluster split and merge processes only when significant cluster evolution happens. Accordingly, we devise an event detection mechanism to signal the cluster adjustments. The coming streams are smoothed as sequences of end points by employing piecewise linear approximation. At the time when end points are generated, weighted correlations between streams are updated. End points are good indicators of significant change in streams, and this is a main cause of a cluster evolution event. When an event occurs, through split and merge operations we can report the latest clustering results. As shown in our experimental studies, COMET-CORE can be performed effectively with good clustering quality.

Index Terms—Data mining, data clustering, data streams.

Ç

1

I

NTRODUCTION

R

ESEARCHabout mining in the data stream environment

has flourished recently [1], [2], [3], [4], [5], [6], [7], [8]. In addition to those that consider a data stream at a time, more and more emerging applications are involved in monitoring multiple data streams concurrently. Such applications include online stock market trades, call detail records in telecommunication, data collection in sensor network, and ATM operations in banks to name a few. We are able to find out interesting and useful knowledge by analyzing the relationship among these multiple data streams. Therefore, mining multiple data streams has attracted an increasing amount of attention from related researchers. To discover the cross-relationship among streams, one way is to calculate the correlation between streams and report the stream pairs with high correlation [9], [10], [11]. Another one is to do a similarity pattern query between multiple data streams [9], [12]. Moreover, several works are reported on applying the clustering technique to multiple data streams [13], [14], [15], [16].

In this paper, we present a framework for monitoring the evolution of groups of correlated data streams by the clustering technique. Explicitly, this framework will trace

not only those streams becoming similar to one another but also those becoming dissimilar along with the growing of streams. Clustering is a mining technique that puts the similar objects together and separates dissimilar ones into different clusters. As a result, by clustering the streams dynamically, we can achieve the goal of monitoring the evolution of stream clusters. By observing the changes of cluster numbers and the members of each cluster, we are able to get the useful information for decision making or data management in various applications. The following examples are of applications of clustering over multiple evolving streams:

1. Sensor network data. A sensor network is composed

of a large number of sensors. Each sensor reads in data continuously as time advances. It is important to know the interrelationships among sensors. By clustering these sensors into groups, the adminis-trator of the sensor network can realize which sensors work together or behave similarly in different time intervals. The sensor data clusters are helpful to awareness of intrusions or abnormalities.

2. Automatic stock exchange monitoring system. In the stock market, the price of each stock may vary from time to time, and some stocks tend to rise and fall concurrently in some time intervals. The stock monitoring system can show which streams are in the same group and have similar behavior. From such evolving streams, the investors would like to buy a proper set of streams to maximize the profit. According to clusters provided, investors are able to choose a combination of several groups of stocks to reduce the risk of investment.

In addition to the above applications, the clustering framework can also be applied to other applications such as . M.-Y. Yeh is with the Department of Electrical Engineering, National

Taiwan University, No. 1, Sec. 4, Roosevelt Road, Taipe 106, Taiwan, ROC. E-mail: [email protected].

. B.-R. Dai is with the Department of Computer Science and Information Engineering, National Taiwan University of Science and Technology, No. 43, Sec. 4, Keelung Road, Taipei 106, Taiwan, ROC.

E-mail: [email protected].

. M.-S. Chen is with the Department of Electrical Engineering and the Graduate Institute of Communication Engineering, National Taiwan University, Taipei, Taiwan 106, ROC. E-mail: [email protected]. Manuscript received 25 Sept. 2006; revised 17 Feb. 2007; accepted 26 Apr. 2007; published online 23 May 2007.

For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TKDE-0454-0906. Digital Object Identifier no. 10.1109/TKDE.2007.1066.

(2)

mining gene expression data, protein sequences, climate data, and so forth.

In [14], an online data summarization framework is designed for offline clustering on multiple data streams when users submit requests. In contrast, we want to provide in this paper a more real-time and automatic system, which performs online clustering. The system will report the revolution of clusters as time advances. To achieve this goal, one intuitive solution is to recluster these data streams periodically. At the predetermined time point, streams are updated and clustered directly with an existing clustering algorithm. However, due to the large stream number and the huge data volume, reclustering streams are very costly. Furthermore, periodical clustering is not able to cope with the data streams with different evolving speeds. If the values of data streams are relatively steady, most of the clustering tasks are unnecessary since the resulting clusters are likely to remain the same. On the other hand, if the values of data streams are relatively fluctuant, we may lose some cluster information when the fixed time period is too long. Concluding from above issues, we need a solution that is able to perform clustering whenever it is necessary. Consequently, a framework named Clustering Over Multiple Evolving sTreams by CORrelations and Events, abbreviated as COMET-CORE, is proposed in this paper.

For generality, we consider the data on the numerical domain. Our work can be easily extended to the applications with categorical data via proper data transformation. Initially, the streams are divided into several clusters by applying any traditional clustering method. In fact, we can also apply our merge operation, which will be introduced later to obtain initial clusters. Each stream is growing and evolving when new data points come in. Hence, a group of streams may be similar at this moment but become dissimilar to one another later. In order to capture the significant changes of each stream, we use continuous piecewise linear line segments to approximate the original data stream. There are two main reasons to adopt the piecewise linear approximation. Explicitly, the piecewise linear approxima-tion can not only be performed in real time as the data comes in but also be able to capture the trend of data. Two line segments with different slopes are connected by an end point. The end point represents the significant trend change point of the streaming data. If a stream in a cluster has a significant change, it is possible to cause the split of this cluster. As a result, we can regard each end point of a stream as a “trigger” of the cluster evolution and call the stream, which has a newly encountered end point as “trigger stream.” For clusters that have trigger streams, the similarities between trigger streams and other streams in the same clusters are updated incrementally. Here, we use correlation as our similarity measurement. Two different streams may vary at different numerical levels but with similar trends or shapes. Also, it is not easy to normalize all data in advance since in an evolving streaming environment, the coming data is unknown. Hence, correlations between streams are used as the similarity measure to deal with the aforementioned issues. Correlation is a standard statistical measurement, which shows the degree of association and works best with linear relationships between two random variables. This

property perfectly meets our requirement. Furthermore, consider the preference for more recent data than obsolete one, a weight factor will be added in the calculation of correlations. In the later section, the way of updating weighted correlation between corresponding streams at a necessary time will be described in detail. In the process of similarity updates, if we detect that the weighted correlations related to trigger streams in clusters are below a given threshold, we say an “event” is detected. An event is a signal for the system to make necessary cluster modifications. When an event is found via the event detection mechanism, clusters will be split according to trigger streams. Then, a procedure for checking whether there exist clusters being close enough to be merged together is activated. Since the split and merge processes are very efficient, the event processing procedure is able to handle thousands of streams concurrently.

Consequently, the main contributions of the COMET-CORE framework are listed as follows:

1. We propose a novel online clustering framework

CORE over multiple data streams. COMET-CORE is efficient in monitoring the relationships between streams concurrently. The system reports the clustering results whenever the clusters encoun-ter significant changes.

2. We devise a weighted correlation measurement for

incrementally updating both the similarities among summarized streams and the similarities among clusters online.

3. We propose an efficient split and merge algorithm to perform the cluster modification with good cluster-ing quality.

The remainder of this paper is organized as follows: Preliminaries are given in Section 2, which includes related studies and the problem model. In Section 3, the data summarization method is described. The similarity mea-surement in COMET-CORE is discussed in Section 4. In Section 5, the event detection mechanism and the strategy of modifying the clusters with the split and merge processes are described. The detailed algorithm of event-driven clustering is also provided. Section 6 presents the experi-mental results and, finally, this paper concludes with Section 7.

2

P

RELIMINARIES

2.1 Related Work

Various research works have been reported to deal with clustering on one data stream [6], [8], [17], [18], [19], [20]. However, more and more applications will be modeled better if multiple data streams are employed. There is an increasing number of works on dealing with clustering of multiple data streams such as [13], [14], [15], [16]. Among the above works, a COD framework [14] is provided to do online data summarization and clustering offline when users give queries. As for online clustering, without providing details, the work [16] briefly describes a frame-work to continuously report clusters within the given distance threshold. Another work [13] discusses clustering over parallel data streams. It summarizes the data streams

(3)

with the Discrete Fourier Transform and adopts weighted euclidean distance as the distance measurement. By apply-ing a slidapply-ing window on streams, it reports clusterapply-ing results only of the current window. The work [15] provides a time-series system for whole clustering by incrementally constructing a hierarchy of clusters from a divisive point of view. It keeps performing clustering every time when fixed data points of every time series are collected. The scheme in [15] also utilizes the periodical way of checking cluster evolutions. The cluster split or merge processes may be a waste of time if the clusters remain almost the same. On the other hand, the cluster information can be missed due to the fast changing rate of data. In conclusion, to the best of our knowledge, none of the proposed frameworks can achieve the goal of reporting cluster evolutions dynamically with efficient cluster split and merge processes triggered by events.

2.2 Problem Model

Given an integer n, an n-stream set is denoted as ¼ fS1; S2; . . . ; Sng, where Si is the ith stream. A data

stream Sican be represented as Si½1; . . . ; t; . . ., where Si½t is

the data value of stream Siarriving at time t. The data points

of each stream arrive continuously, and the speed of the growth of data points is rapid. The objective of this paper is that given a set of data streams  and the threshold parameters, events of stream clusters are monitored online. Due to the large volume of data streams, we need to summarize each stream. Each stream is represented as a

series of end points, and the summary of stream Si is

denoted as bSi. The event detection mechanism and

cluster-ing task are both based on the summary data. When events occur, cluster modifications will be performed instead of reclustering all streams, and the latest clustering results are reported.

3

D

ATA

S

UMMARIZATION

The volume of data in streaming environment is very huge and possibly infinite. As a result, we can only store the summary of data seen thus far. In COMET-CORE, the piecewise linear approximation is used as the summariza-tion method for three reasons. First, in our model, the linear relationships between streams are concerned. Second, in the data-streaming environment, we need an online and single linear scan algorithm to process the data. There indeed exists online and single scan algorithm for piecewise linear approximation [21]. Third, we want to detect the significant

change of streams during the process of doing summariza-tion. An end point between two approximation line segments is a good indicator of the significant change of a stream [22]. Since the purpose of our framework is to report the cluster evolutions online, we believe that the significant change of streams is a main cause of significant change of stream clusters.

There are some related works about using piecewise linear line segments as the data representation [23], [24], [25]. Most of the prior works consider data with a fixed length and scan the data multiple times offline to obtain the best fit line segments. As mentioned in [21], such methods include bottom-up and top-down ways, whose practicality in stream environments needs further justifications. To adapt the characteristics of streaming data, we need an online piecewise linear approximation algorithm. The work in [21] describes the basic concept of online segmenting time series. Using sliding window techniques, we are able to smooth the original data stream as a combination of line segments in linear time. The concept of linear approxima-tion is illustrated in Fig. 1.

In Fig. 2, we briefly state the generic sliding window algorithm for transforming raw data streams into the linear piecewise representation. Si½ts; . . . ; te denotes the

sub-stream from time ts to te of the raw data stream Si. It is

approximated by line segment denoted as Siapp½t ¼ a  t þ b,

where t 2 ½ts; te. The parameter a is the slope of the line,

and b is the intercept. These two parameters can be calculated as follows:

a¼ Si½te  Si½ts te ts

;

b¼ Si½te  a  te¼ Si½ts  a  ts:

The two parameters in the algorithm, anow

i and bnowi , are

the a and b of the latest approximated line segment. Fig. 1. The data stream is smoothed by a sequence of line segments.

(4)

In this piecewise linear algorithm, the cal error function is used to decide when to start a new line segment. One of the feasible error measurement is the residual error illustrated in Fig. 2. It computes the square sum of difference between the predicted value Sappi ½t and original

value Si½t. It can be decomposed into the combination of

the summations of t, Si½t, and tSi½t and calculated

incrementally: error¼X te t¼ts ðSi½t  Siapp½tÞ 2 ¼X te t¼ts ðSi½t  ða  t þ bÞÞ2;

where a is the slope, and b is the intercept of the current line segment. This equation describes a kind of absolute error, however, in streaming data environment, it may not be easy to give a proper absolute error threshold. In this case, we can use relative error instead. For example, we would like the residual error is less than 2 percent of the square sum of the original stream.

Assume that at time tsþ p, the error between the raw

substream Si½ts; . . . ; tsþ p and the approximated one

Sappi ½ts; . . . ; tsþ p exceeds l. A segment that is connected

by Si½ts and Si½tsþ p  1 is obtained, and we record down

the latest end point Si½tsþ p  1 into the summary

structure. This is illustrated in Fig. 3a. After the new end point is recorded, a new line segment is started and the latest slope anow

i and intercept b now

i are temporarily updated

according to Si½tsþ p  1 and Si½tsþ p.

The sliding window method of online piecewise linear approximation is efficient because of its linear time complexity. Many variations are designed to adapt different types of data and error measurement. For example, because of the monotonically nondecreasing property of residual error, we can speed up the algorithm by extending the candidate segment k ðk > 1Þ time units each time. In [21], Keogh et al. have proposed an approach that combines sliding window and bottom-up methods. The proposed SWAB algorithm not only takes advantage of linear time and online fashion of the sliding window method but also owns the higher quality of bottom-up method. Wu et al. [12] provides a three-tiered online segmentation and pruning strategy for financial data streams. No matter what kind of optimization there is, the choice of error threshold is subjective and data dependent. The trade-off between efficiency and good approximation quality is also an interesting research topic itself, which is however beyond the scope of this paper.

4

S

IMILARITY

M

EASUREMENT ON

S

UMMARY

S

TRUCTURE

As mentioned in the previous section, we smooth the stream into a sequence of line segments, and the end points are recorded down. The summary structure of each stream is defined as follows:

Definition 4.1.The summary structure of Si is defined as

b

Si¼ fðSi½tv1; tv1Þ; ðSi½tv2; tv2Þ; . . . ; ðSi½tvk; tvkÞg;

where Si½tvk is the end point value, and tvk denotes the

corresponding arrival time of the end point. The end points in the set are not necessarily continuous in time domain. The two most common measures for streaming data are euclidean distance and correlation. The euclidean distance between two normalized streams reflects the magnitude of difference, whereas the correlation of two streams refers to the degree of resemblance of two streams in shape or pattern. Furthermore, correlation works best with linear relationships and perfectly meets the requirement in our COMET-CORE clustering model. Although correlation is a similarity measure rather than a distance measure, it can be converted to a dissimilarity measure (that is, distance measure) through a straightforward transformation. As a result, we adopt correlation as our similarity measurement. There are several different correlation techniques, and here, we focus on the most commonly used one, the Pearson correlation. The correlation coefficient between two random variables, say, X and Y , is defined as follows:

corrðX; Y Þ ¼EðX; Y Þ  EðXÞEðY Þ XY

;

where X and Y are the standard deviations of X and Y ,

respectively, and EðX; Y Þ, EðXÞ, and EðY Þ are the expectations of random variables X, Y , and XY .

In the multiple streaming data case, two different streams are regarded as two independent random variables. There-fore, follow the definition, the correlation between two streams Si and Sjis corrðSi; SjÞ ¼ P tSi½t  Sj½t PtSi½tPtSj½t ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi P tðSi½t  SiÞ2 q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiP tðSj½t  SjÞ2 q :

In most streaming applications, the recent data are usually more important compared to the aged data. To reflect the bias toward recent data, a weight function is introduced in the calculation of correlation between two streams.

Definition 4.2. Given two streams Si and Sj, and a weight

function wðtÞ, the weighted correlation coefficient between these two streams is defined as

wcorrðSi; SjÞ ¼ P twðtÞSi½tSj½t  P twðtÞSi½t P twðtÞSj½t P twðtÞ   ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi AðSiÞ  AðSjÞ p ; ð1Þ Fig. 3. (a) The streams are smoothed as a sequence of end points.

(5)

where AðSxÞ ¼ X t wðtÞðSx½tÞ2 ðP t wðtÞSx½tÞ2 P t wðtÞ ;

and wðtÞ is a monotonically nondecreasing function of time index t.

wcorrðSi; SjÞ is different from corrðSi; SjÞ in that each

term is multiplied by the weight function wðtÞ. To be more efficient, six necessary terms are kept to form a real vector for calculating (1).

Definition 4.3.A W C vector is defined as a weighted correlation vector, which is used to maintain the statistics for weighted correlation calculation between two sequences of end points. Given two streams Si and Sj, and a weight function wðtÞ, the

W C vector of Si and Sj is defined as

W CSiSj ¼ ðW C tk 1; W C tk 2; W C tk 3; W C tk 4; W C tk 5; tkÞ;

where the first five tuples are accumulation values from the beginning to time tk: W Ctk 1 ¼ Xtk t¼0 wðtÞSi½t; W C2tk¼ Xtk t¼0 wðtÞðSi½tÞ2; W Ctk 3 ¼ Xtk t¼0 wðtÞSi½tSj½t; W C4tk¼ Xtk t¼0 wðtÞSj½t; and W Ctk 5 ¼ Xtk t¼0 wðtÞðSj½tÞ2;

and the last tuple tk records the latest update time.

We set wðtÞ ¼ ðtnowtÞ, where  is a real number in (0,1],

and tnow refers to the current time and keeps growing.

Obviously, the closer the data is to the current time, the more weight they get. The  parameter controls the fading speed of data weight. When tnowbecomes larger and larger,

the weight of a certain data point keeps fading as new data come in continuously.

Now, let us see how our COMET-CORE realizes the similarity measurement. Since each stream is smoothed, the calculation of the distance between two data streams can be reduced to the one between two sequences of end points. Due to the limit of linear scan of data streams, the distance should be updated in an incremental fashion. In

other words, we need to update the W CSiSj vector

incrementally. Consider the case shown in Fig. 3b. Assume that W CSiSj is updated to ððW C

ts

mÞm¼15; tsÞ at time tsþ 1.

At time teþ 1, bSj is detected to have a new end point at

time te, and the W C vector should be updated. The

incremental value of mth tuple of the W C vector is denoted as W Cm. In the period from time ts to te, Si is

approximated by Siapp½t ¼ anowi  t þ bnowi , where t 2 ½ts; te.

The two parameters, anow

i and bnowi , are maintained by the

piecewise linear approximation method. For stream Sj, the

line segment between time ts and te is approximated by

Sjapp½t ¼ aj t þ bj, where t 2 ½ts; te. The two parameters, aj

and bj, are calculated by the last two end points in bSj, that

is, ðSj½ts; tsÞ and ðSj½te; teÞ. Then, the corresponding

W Cm values can be computed as follows:

W C1 ¼Xte t¼tsþ1wðtÞS app i ½t ¼Xte t¼tsþ1wðtÞða now i tþ b now i Þ ¼ anow i Xte t¼tsþ1twðtÞ þ b now i Xte t¼tsþ1wðtÞ; ð2Þ W C2 ¼Xte t¼tsþ1wðtÞðS app i ½tÞ 2 ¼Xte t¼tsþ1wðtÞða now i tþ b now i Þ 2 ¼ ðanow i Þ 2Xte t¼tsþ1t 2wðtÞ þ 2anow i b now i Xte t¼tsþ1twðtÞ þ ðbnow i Þ 2Xte t¼tsþ1wðtÞ; ð3Þ W C3 ¼Xte t¼tsþ1wðtÞS app i ½tS app j ½t ¼Xte t¼tsþ1wðtÞða now i tþ b now i Þðajtþ bjÞ ¼ ðanow i ajÞ Xte t¼tsþ1 t2wðtÞ

þ ðanowi bjþ ajbnowi Þ

Xte t¼tsþ1twðtÞ þ ðbnow i bjÞ Xte t¼tsþ1wðtÞ: ð4Þ

For the incremental values of the other two tuples, W C4 and W C5 are calculated the same way as (2) and

(3). Note that parameters a and b should be replaced with aj

and bj. By observing these equations, we can discover that

they are all the linear combinations of PTðtÞwðtÞ, where TðtÞ can be t2, t, or some constant. Consider in the weight

function and according to (2), the first tuple of W CSiSj is

updated, as shown below: W Cte 1 ¼Xte t¼1 ðtetÞSapp i ½t ¼ ðtetsÞXts t¼1 ðtstÞSapp i ½t þXte t¼tsþ1 ðtetÞSapp i ½t ¼ ðtetsÞW Cts 1 þ ðanow i Xte t¼tsþ1t ðtetÞþ bnow i Xte t¼tsþ1 ðtetÞÞ ¼ ðtetsÞW Cts 1 þ W C1:

Induced from above, the W C vector at time te is

W CSiSj ¼ ðð

ðtetsÞW Cts

mþ W CmÞm¼15; teÞ: ð5Þ

Lemma 4.1.The amortized time complexity of W C vector update is Oð1Þ.

(6)

Proof.From (5), the WC vector can always be updated by adding W Cmto an existing W C vector. It is clear that

this equation can be processed in constant time. Next,

we observe the time complexity of computing W Cm.

The five terms ðW C15Þ are updated, as shown in (2),

(3), and (4), and the worst-case cost of each operation is

OðnÞ, where n is the length of a stream. Using the

aggregate method of amortized analysis, we can obtain the amortized cost of an operation is the average:

OðnÞ=n ¼ Oð1Þ. tu

Consequently, we can incrementally maintain similarity between two streams efficiently by keeping the W CSiSj

vector of two stream summaries online. At the end of this section, a brief example is given to illustrate the update procedure.

Example 4.1.Consider the example shown in Fig. 3b. Assume ts¼ 3, te¼ 6, and let wðtÞ ¼ ðtnowtÞ, where  ¼ 0:9. The

summaries of two streams are bSi¼ fð80; 1Þð60; 3Þg, and

b

Sj¼ fð50; 1Þð0; 6Þg, where bSj has a new end point (0, 6)

at time 6. The last distance updating time of bSi and

b

Sj is ts¼ 3. Until time 3, we have Pt2wðtÞ ¼ 13:41,

P

twðtÞ ¼ 5:61, andPwðtÞ ¼ 2:71.

W CSiSj ¼ ðð187:8; 13; 194; 7; 560; 106:5; 4; 365; 2:71Þ; 3Þ:

Therefore, at time 3, the weighted correlation of bSiand bSj

is 1. At time 7, bSjhas a new end point (0, 6). Now, the tnow

is set to 6. The slope and the intercept for bSj between

time 4 and 6 are aj¼ 10 and bj¼ 60, respectively. For bSi,

assume the current parameters anow

i ¼ 10 and bnowi ¼ 30.

As a result, we haveP64t2wðtÞ ¼ 71:46,

P6

4twðtÞ ¼ 13:74,

and P64wðtÞ ¼ 2:71. The incremental values of W CSiSj

vector in this period are

W CSiSj¼ ðð218:7; 17829; 1854; 25:2; 414Þ; Þ:

By applying (5), the final W CSiSj is

ððð187:8; 13; 194; 7; 560; 106:5; 4; 365Þ  ð0:9Þð63Þ þ ð218:7; 17; 829; 1; 854; 25:2; 414ÞÞ; 6Þ

¼ ðð355:61; 27; 447:43; 7; 365:24; 102:84; 3; 596:09Þ; 6Þ: From this vector, the weighted correlation coefficient

between bSi and bSjis 0.56. tu

At the end of this section, the properties of this incremental similarity update are discussed. Since the weighted correlation between two streams is updated incrementally, the last updated time of two streams is needed when every new end point is generated. This can be done by checking the arrival time of the latest two end points of each stream.

Lemma 4.2.The last updated time between two stream Siand Sj

can certainly be decided by comparing the last two end points of either stream summary.

Proof.Assume at time ðteþ 1Þ an end point ðSi½te; teÞ of

b

Si is generated, the latest two end points of bSi are

f. . . ; ðSi½ti1; ti1Þ; ðSi½te; teÞg. A stream bSj has the latest

two end points f. . . ; ðSj½tj1; tj1Þ; ðSj½tj2; tj2Þg, where

tj2> tj1. If tj2¼ te, that is, bSj also has an end point

at time te, then the last updated time between bSi and

b

Sj is maxðti1; tj1Þ. If tj2< te, then the last updated time

between bSi and bSj is maxðti1; tj2Þ. tu

It is of great advantage that according to Lemma 4.2, the number of end points should be kept for each stream is limited.

Lemma 4.3. The maximal number of end points a stream

summary should be kept is 2.

Proof.According to Lemma 4.2, by checking the latest two end points between stream summary pair, we can get the last update time and then complete the similarity

update. tu

If there are n streams, from the above lemma, the space for end points is bounded to Oð2nÞ. This fact is satisfactory because it is independent of the length of each stream. Therefore, we can limit the space usage of summarized data efficiently.

5

T

HE

COMET-CORE F

RAMEWORK

In the beginning of this section, the brief concept of the COMET-CORE framework is given. Instead of reclustering whole streams, COMET-CORE reports cluster results through efficient cluster splitting and merging when there are significant changes. In essence, a cluster is the set of summarized streams, and all the clusters become a cluster set. The number of cluster varies during the process of event detection and stream clustering. Each cluster has a center that is simply the average of its members, that is, the summarized streams. Therefore, the center of a cluster is also a sequence of end points. As the streams grow, the center of the cluster varies as well. In this way, the similarity between two clusters can also be maintained incrementally with a weighted correlation vector, W C vector, as mentioned in Section 4.

Definition 5.1.Assume that the centers of two clusters Ci and

Cj are represented by end point sequence bSi and bSj,

respectively. Then, the W C vector of two clusters denoted by W CCiCjis equal to W CSiSj. The weighted correlation between

Ciand Cjdenoted by wcorrðCi; CjÞ is equal to wcorrðSi; SjÞ.

5.1 Event Detection

Once a stream generates a new end point, we regard it as a trigger stream. A trigger stream is a trigger of events, where the event here means that some clusters are required to be split or merged. Thus, the weighted correlation between trigger streams and other streams in the same cluster is updated to check if there are streams in the same cluster but getting dissimilar. The procedure of how trigger streams work is illustrated in Fig. 4. A cluster may have more than one trigger stream. Among these trigger streams, some of them are similar at the current time unit, and these trigger streams should be grouped first. Given the user-defined threshold a, if the weighted correlation of two trigger

streams are no less than this threshold, they will be grouped into the same group, which is called a trigger group. Clearly, if the number of trigger groups is larger than one, it means

(7)

that the cluster should be split. For each trigger group, a trigger stream is randomly picked out as the representative stream of this group. For the rest of the nontrigger streams, which are the streams in the same cluster but without new end points generated, each of them is compared to the representative trigger streams. According to the weighted correlation between itself and the representative trigger streams, the nontrigger streams will be assigned to the trigger group where the most correlated representative trigger stream is. It is possible that some of the nontrigger streams cannot fit any trigger group, that is, the weighted correlations between the nontrigger stream itself, and all the representative trigger streams are not high enough to reach the user-defined threshold a. In this case, these streams

form a temporary cluster. Though the streams in the temporary cluster may not be highly correlated, they will be split in the future when next event is detected.

5.2 Split Clusters

In the previous section, we introduce how trigger streams trigger the split of clusters. When new clusters are formed, the corresponding W C vectors of modified clusters should be maintained. Since reclustering of streams is not permitted here, the new W C vectors should be approxi-mated by existing ones. For these newly generated

W C vectors, we can classify them into three types. The

first two types consist of the W C vectors between two newly generated clusters that belong to the same cluster and different clusters originally. The last type to be created is the W C vectors between newly generated clusters and the originally existing clusters. To make it easy to under-stand, let us go through the split process in Fig. 5. Originally, there are four clusters, which are C1, Cx, Cy,

and C11. Suppose C1is split into three clusters: C1,1C4, and

C6, and cluster C11 is split into two clusters: C11 and C14.

First of all, the W C vectors among newly generated clusters that originally belong to the same clusters are set. Since the cluster is split according to the representative trigger streams, their W C vectors are used to approximate the W Cvectors of the first type of W C vectors among clusters.

Type 1. The W C vector between newly generated

clusters Ci and Cj, which belong to the same cluster

originally, is set to be equal to the W C vector of representative trigger streams Siand Sj:

W CCiCj ¼ W CSiSj;

where Siand Sjare the representative trigger streams of Ci

and Cj, respectively.

In the example in Fig. 5a, the cluster C11 is split to C11

and C14, so W C vector of cluster C11 and C14 is

W CC11C14 ¼ W CS11S14, and the same is true for a cluster

C1 split to C1, C4, and C6:

W CC1C4¼ W CS1S4; W CC1C6¼ W CS1S6; and

W CC4C6¼ W CS4S6:

Then, consider the W C vector of newly generated clusters that originally belong to different clusters.

Type 2. The W C vector between newly generated

clusters Ci and Cj, which originally belong to different

clusters is set as

W CCiCj¼ W CCioCjo;

where Cio and Cjo are the original clusters of Ci and Cj,

respectively.

This type of similarity is illustrated in Fig. 5b, where the W Cvector between newly generated cluster C4and C14is

W CC4C14¼ W CC1C11, since C4 is split from C1, and C14 is

split from C11.

1. For simplicity, each cluster is labeled with the smallest stream number in it. Therefore, if a cluster is split into several subclusters, there must be one cluster that owns the same label as the original one.

Fig. 4. Illustration of how trigger streams work.

(8)

Type 3. Assume Ci is the newly generated cluster split

from cluster Cio. The W C vector between Ci and other

originally existing clusters, say, Coois set as

W CCiCoo ¼ W CCioCoo:

This type of similarity is illustrated in Fig. 5c. The W C vector of C4 and other existing clusters are

W CC4Cx ¼ W CC1Cx; W CC4Cy ¼ W CC1Cy; and

W CC4C11¼ W CC1C11:

For other newly generated clusters C14 and C6, their

corresponding vectors are updated likewise.

Note that only the W C vector of Type 1 is updated to the latest end point time. It is because that Type 1 vector is copied directly from the W C vector of the latest represen-tative trigger streams. For W C vectors of Type 2 and Type 3, whose sixth tuple tk remain at the last updated time, they

will be updated to the current end point time in the following step.

5.3 Update Intercluster Similarity

It is assumed that the latest end point occurs at time te, and

we need to update the center of each cluster. The corresponding new end point of the center of each cluster is the average data values of each stream in that cluster at time te. For streams that have no end point at time te, their

values will be approximated by anow

i and bnowi , that is,

anow

i  teþ bnowi . Then, the new end point of the cluster

center can be calculated.

Once we update the center of each cluster, only the intercluster W C vectors whose updated time is not equal to tewill be updated. As aforementioned in the beginning of

this section, since the center of a cluster can be regarded as another stream summary, the intercluster W C vector can be maintained in the same way, as shown in Section 4. Furthermore, the Lemma 4.3 also holds here. We need only preserve the latest two end points of the center of each cluster.

5.4 Merge Clusters

After splitting and updating the intercluster correlation of each cluster pair, the COMET-CORE framework checks whether there are clusters being close enough to be merged. How close can two clusters be merged is defined by a user given threshold e. If the intercluster correlation between

any two clusters is equal to or higher than the threshold e,

these two clusters are merged. Note that we can apply an appropriate agglomerative hierarchical clustering method in the merge process by setting the stop criteria as the threshold e. In other words, the clusters will be merged repeatedly

until no correlations of the existing cluster pairs is higher than or equal to e. The cluster number is relatively small

compared to the original number of streams and, thus, the execution time is relatively less. The policy of the merge process is given as follows: Consider the example shown in Fig. 6. Suppose that the weighted correlation of cluster C1

and C2 is higher than e, then they will be merged. The

W Cvector W CCnewCk, where k 6¼ 1 and k 6¼ 2, is set to either

W CC1Ck or W CC2Ck according to who has smaller weighted

correlations, that is, minðwcorrðC1; CkÞ; wcorrðC2; CkÞÞ. After

updating the corresponding correlations, the merge process will be performed on the rest of the clusters until the stop criteria are met.

Finally, the algorithm of COMET-CORE is outlined in Fig. 7. Note that the triggerList is an array where the index is the cluster id, and the corresponding item is the trigger Fig. 6. Illustration of the merge process and the distances to be updated.

(9)

stream list tList. Si:cid represents the id of the cluster to

which stream Si belongs.

5.5 Analysis of COMET-CORE

In this section, we give analyses of COMET-CORE on parameter usage, space complexity, and efficiency.

5.5.1 Parameter Analysis

In COMET-CORE, several parameters are considered. Due to the nature of invisibility of future data in a streaming environment, we need some heuristics on historical data to adjust these parameters. The first one is the threshold used in piecewise linear approximation. As we mentioned earlier in Section 3, the error threshold used is subjective and data dependent. In our case, we can set a reasonable error threshold according to the past experience on historical data. Then, as the streaming data come in, we can observe if the trigger (end points) is too frequent or too rare. According to the frequency, we can increase or lower the error threshold dynamically.

Another parameter in COMET-CORE is . In the streaming environment, the more recent data are more attractive and should be sampled more. When  is smaller than 1, the data are a biased sampling and of course the cluster results vary. The choice of  value is optional and subjective. How this value affect the clustering results are further studied in the experimental section.

Finally, the thresholds of split and merge processes are discussed. As suggested in [26], a correlation coefficient above 0.8 is regarded as high correlation. Accordingly, the higher the threshold is, the more compact the cluster will be. Users may decide the parameters aand ebased on the

characteristic of the historical data streams for future use. In addition, without loss of generality, we can set a¼ e.

Notice if a is set to be equal to 1, it means only merge

operation is activated. On the other hand, if eis set to be 1,

only split process is activated. 5.5.2 Space Analysis

In this section, space usage in COMET-CORE is discussed. As we mentioned earlier in Lemma 4.3, the maximal end points a stream should be kept is 2, and the total end points are Oð2nÞ ¼ OðnÞ. Next, consider the space usage for W C vectors. For n streams, suppose that there are p clusters and cluster Ck contains mk streams. The space complexity of

W C vectors (intra-WC vector þ inter-WC vector) is space complexity ¼ fm1ðm1 1Þ=2 þ m2ðm2 1Þ=2 þ . . . þ mPðmP 1Þ=2g þ fpðp  1Þ=2g < m21þ m22þ . . . m2pþ p2 <ðm1þ m2þ . . . mpÞ2þ p2 ¼ n2þ p2:

It can be shown that most of the time, the space usage is far less than Oðn2Þ. The worst case appears when all streams

are in the same cluster or each stream is a cluster itself, but this is a rare case. It is noted that n, the number of stream, is much smaller than the length of endless streams.

5.5.3 Efficiency Analysis

The main processes of COMET-CORE are event detection, cluster splitting, and merging. In the event-detection process, each trigger stream only updates their correlations with members in the same cluster. The update time is linear to the number of members in a cluster. Furthermore, only the clusters with trigger streams are needed to update the similarity. This saves a lot of time in contrast to reclustering directly to streams themselves all the time.

Next, we analyze the efficiency of split process. Assume there are p clusters before the split process, and by event detection, q clusters are decided to be newly generated. According to Section 5.2, there are three types of WC vector to be generated or updated. For Type 1 and Type 2 WC vector, they are inter-WC vectors among newly generated q clusters. Hence, the number of these WC vectors is qðq  1Þ=2. For Type 3 WC vector, it represents the inter-WC vector between newly generated clusters and the existing clusters. Therefore, the number is pq. The cost of assigning each new WC vector is Oð1Þ. Therefore, the total time cost of split process is Oðpq þ q2Þ. It is noted that when q << p,

the time cost is reduced to OðpqÞ.

Finally, the complexity of merge process depends on the used agglomerative algorithm. As mentioned in Section 5.4, because the cluster number is relatively smaller than stream number in most of the time, the time cost is low.

6

E

MPIRICAL

S

TUDIES

To assess the performance of the COMET-CORE frame-work, we conducted all the following experiments on a PC with 3 Gbytes of RAM and 3-GHz CPU, which runs the Window XP Professional operating system. We compare the COMET-CORE with two other methods: one is called Basic here, and the other is the ODAC algorithm proposed in [15]. The Basic algorithm does clustering once a fixed number of examples are collected. The agglomerative hierarchical clustering with complete linkage is always applied directly to streams after updating accumulated statistics. The ODAC algorithm checks cluster split and aggregation periodically as well but with criteria based on Hoeffding bounds. On the contrary, COMET-CORE updates necessary statistics by triggers and adjusts cluster structure by events. All three methods use correlationlike similarity measure: COMET-CORE and Basic both use the weighted correlation as similarity measurement. The difference is that COMET-CORE computes correlations on approximated data as described in former sections, whereas Basic method computes correlations of streams on fact. The ODAC method uses rooted normalized one-minus correlation, which normalizes the one minus correlation value to [0, 1], as the distance measurement.

To evaluate the clustering quality, we use a famous cluster validation technique—Silhouette validation [27]. The silhouette validation first assigns each stream a quality measure:

silðSiÞ ¼

bðSiÞ  aðSiÞ

maxfaðSiÞ; bðSiÞg

;

where aðSiÞ is the average dissimilarity of stream Si to all

other streams in the same cluster, and bðSiÞ is the average

(10)

cluster. Following from the formula, the silðSiÞ value ranges

from 1 to 1. If this value is close to 1, it means that Si is

well clustered. On the other hand, if silðSiÞ is close to 1, it

shows that Si is assigned to an improper cluster. A cluster

silhouette for a cluster Ck who owns m members is

silCk¼

P

si2CksilðSiÞ

m :

Finally, the Global Silhouette, GS, which is used to assess clustering quality in the following experiments is defined as GS¼1

p

Pp

1silCk, where p is the total cluster number.

Silhouette Validation considers the cluster structure from the view of relationship between objects. It takes both the compactness and separation of clusters into account.

Note that in the following experiments, the distance used on computing GS value is based on fact instead of on approximated data.

6.1 Evaluation of COMET-CORE on Real Data

In this section, three real data sets are used for testing. The first one is the weather data obtained from the Temperature Data Archive of the University of Dayton.2From year 1995, the average daily temperatures of 290 cities around the world are collected. Each city is regarded as a data stream, and each stream has 3,416 points. The second real data set is S&P 500 stock, which can be obtained from a public Web site.3 We collected a one-year historical data of 476 stocks. For each stock, we use the high and the low value for each stock to constitute 952 streams with a length of 252. Finally, the third data set is the training data set for the Physiological

Data Modeling Contest (PDMC) competition.4 There are

nine sensors to be observed and the total observations are 189,961. Assume every stream comes in the rate of one sample per time unit. The parameter setting of three methods are given as follows: The stop criterion for Basic is e¼ 0:5. For COMET-CORE, the threshold value is set as

a¼ e¼ 0:5. For ODAC, we set the confidence value as 0.05

for the Hoeffding bound, as suggested in [15]. Since the Basic

and ODAC methods perform clustering periodically, two periods are used to test, which are 10 and 50 time units. COMET-CORE maintains end points with two kinds of relative error, 10 percent and 20 percent, and checks necessary cluster splitting or merging by events.

First, we evaluate the average processing time for a round of clustering. As shown in Fig. 8a, COMET-CORE is much more efficient than the other two methods. For Basic and ODAC methods, the average processing time increases as the period becomes longer. This is because a larger amount of data needed to be processed when updating necessary statistics for similarities. However, if the period between two clustering becomes shorter, the frequency of clustering tasks increases, and the total processing time will increase significantly as the growing of stream length. Both Basic and ODAC methods spend a lot of time on updating pairwise similarities between streams. ODAC even spends time on testing cluster splitting and aggregation by computing pairwise distance between streams in the same cluster for possible clusters. Although ODAC may have good clustering quality, high processing time makes the scalability of ODAC in the environment of multiple streams questionable. On the contrary, COMET-CORE only updates clusters with trigger streams. Under different error threshold of piece-wise linear approximation, the number of end points is listed in Fig. 8b. We can find that the threshold value has very small influence on processing time in all three data sets. Concluding from all three methods, we can discover that it is the number of stream, which dominates the average processing time. The time cost is almost indepen-dent of the length of a stream. It takes much more time for stock data, which contains 952 streams with length of 252 than the time for PDMC data, which contains nine streams with length of 189,961 (Note that the time scale for PDMC data set is in milliseconds.)

Next, the cluster quality is discussed. We probe the GS value along stream progression, and the average of these values is used to evaluate their performance. The average GS value for three methods of each data set is shown in Fig. 8c. The stock data fluctuates more, whereas the weather and PDMC data are relatively steady. Therefore, all methods have lower GS value on stock data set. There are 2. http://www.engr.udayton.edu/weather/.

3. http://kumo.swcp.com/stocks/.

4. http://www.cs.utexas.edu/users/sherstov/pdmc/.

(11)

only nine streams in PDMC data set; therefore, the overall GS values are high. Then, we discuss the reason for different qualities of each method. For Basic method, the period of clustering does not affect the clustering results since the Basic method always starts merging clusters directly from treating one stream as a cluster. Consider the ODAC method, the Hoeffding bound depends on the number of samples collected. Therefore, checking cluster splitting and aggregation with different frequency results in different clustering results for the same data streams. It is not easy to choose a good clustering period and confidence value in computing Hoeffding bounds. For COMET-CORE, the cluster quality is affected by the following reasons. In our algorithm, to avoid too many 1-stream clusters during clustering, we put all these streams in temporary clusters and wait for the next trigger to split them. Therefore, at this moment, the temporary cluster degrades the cluster quality. Also, the approximation error caused by WC vector has significant influence as well. However, this can be remedied by the piecewise linear smoothing. When there is higher tolerance on piecewise linear approximation error, the resulting number of trigger points degrades. This reduces the frequency of WC vector approximation and, hence, the cluster quality increases. In the following, we discuss the reason GS value of COMET-CORE is superior to other two methods. The Basic method applies agglomera-tive hierarchical clustering and ODAC applies mainly divisive clustering. Once the order of merging or splitting is decided for hierarchical clustering, it cannot be changed later. However, in COMET-CORE, the combination of split and merge processes provides streams flexibility to jump

among clusters for a better clustering structure. Further-more, the advantage that COMET-CORE captures cluster evolutions dynamically by events is revealed here: The periodical clustering method such as Basic and ODAC cannot capture cluster evolutions, which occur at some time during two clustering tasks.

Finally, take the weather data for example, the clustering results under different  are shown in Fig. 9a. Since there provides no bias option in ODAC algorithm, it is not considered here. For different  value, the data weight varies; hence, we will get different clustering results. Again, COMET-CORE is superior to Basic in that it captures the clustering evolution dynamically instead of clustering every fixed time period. In addition, we compare the final number of cluster, which has exactly the same members for two methods. In Fig. 9b, we can see that the proportion of the exact match is quite high, which supports that COMET-CORE is of high reliability.

6.2 Evaluation of COMET-CORE on Synthetic Data

In this section, two kinds of synthetic data sets are used for evaluation.

6.2.1 Cylinder-Bell-Funnel Data Set

We use cylinder-bell-funnel data to demonstrate that COMET-CORE can significantly cluster data in an online fashion and explain the function of  parameter. The generation of this data was proposed in [28] and [29]. Three classes of data, cylinder, bell, and funnel, and examples are illustrated in Fig. 10a. Each class of data has a length of 128. In our test data, we concatenate three distinct classes of data in different order, as shown in Fig. 10b. The different combinations of data classes constitute six types of 384-long streams. For each type, 100 streams are generated and normal distribution number ranges from 0 to 1 are randomly added on each of them. Consequently, we have a total of 600 streams to test.

We set a¼ e¼ 0:80, which is quite a strict criterion to

make sure that streams inside the same cluster are highly correlated. The  value is set from 0.98 to 1. In Fig. 10c, the column shows different time points, and the row represents different  values. At certain time points and  values, the cluster number and the error rate are recorded. Here, the error rate is calculated as (PCi (number of streams 62 Ci/

total stream number inside Ci))/(number of real cluster),

where real cluster means cluster with member number > 1. Fig. 9. The (a) average GS value and (b) the number of exactly matching

cluster under different  values.

(12)

If the class of a stream, say, Siis of minority in the cluster Ci,

where the stream is assigned, we say Si 62 Ci. According to

that in Fig. 10c, around time 92, there should be three clusters, and the COMET-CORE successfully reflects the results under all  value. Around time 128, which is the end of the first class period of each stream, we found that when  is 0.99, the cluster number becomes to 2. One cluster contains purely the whole F-B-C and F-C-B data (200 streams), whereas the other cluster contains the whole B-C-F, B-F-C, C-B-F, and C-F-B data (a total of 400 streams). This is because the data near time 128 are weighted more than the data at the beginning, which makes the bell and cylinder data alike, and both of them are dissimilar to the funnel data. Also, the error rate of 25 percent results from the reason that 200 out of 400 streams in the same cluster are judged as those which do not belong to this cluster. For  ¼ 0:98, the cluster number increases. Due to the faster decay rate and the domination of noise data, some correlations between the same-type streams fall below 0.8. Around time 226, the cluster number becomes 6 for  ¼ 1. This is because each data point is treated with equal weight and, therefore, different types of streams are separated into different clusters. For  ¼ 0:99 and 0.98, both of them have three clusters. This time, the clusters are decided by the second part of each stream, the weight before time 128 is almost 0. Hence, B-C-F and F-C-B are in one cluster, B-F-C and C-F-B are in another cluster, and the rest of the two types are in the other cluster. The cases of the following time points are the same as previous ones and are hence omitted here. In this experiment, we show that COMET-CORE indeed performs significant clustering and can obtain desired clustering results under different  values.

6.2.2 Random Walk Data Set

To evaluate the scalability of three methods, the synthetic data used are given as follows: First, we generate four random walk data sets with the number of streams ranging from 100 to 2,000. Each stream has a length of 20,000 points. For Basic and ODAC methods, the clustering task is performed every 200 points. The confidence value of Hoeffding bounds for ODAC is set as 0.05. The threshold

of merge process for Basic is e¼ 0:7, and for

COMET-CORE, we set a¼ e¼ 0:7. In addition, COMET-CORE

does not apply piecewise linear approximation on stream data here. The average processing time is shown in Fig. 11a. Here, the time axis (y-axis) is plotted in log scale for clarity because the processing time of each method is different in several orders. Again, COMET-CORE is manifestly more efficient than other two periodical clustering methods. Basic and ODAC spend too much time on computing pairwise similarities between streams, whereas COMET-CORE only checks clusters with trigger streams for possible splitting and merging. Then, we see the average GS value for each method in Fig. 11c. These values are obtained by averaging the GS value every 400 points for each method. In most cases, Basic has the highest value because each time it merges streams starting from one stream cluster and stops when the criterion is met. ODAC has a lower GS value for two possible reasons. One is as mentioned in the empirical study of real data, it is hard to decide a good clustering period and confidence value for Hoeffding bound. Second, ODAC splits a cluster into only two clusters each time when the splitting condition is met. The cluster may not be split good enough. Finally, for COMET-CORE, the main degra-dation on cluster quality is caused by the approximation of WC vector when splitting and merging clusters. This is a trade-off between efficiency and quality. However, from the clustering results, we show that COMET-CORE still keeps an acceptable cluster quality with high efficiency.

Next, we show the performance of three methods while the number of cluster varies. The stream number is fixed at 500, and each of them contains 20,000 points. For different data sets, streams are randomly distributed in different number of cluster. The cluster number varies from 50 to 200. In Fig. 11b, we can find that the processing time is almost independent of the cluster number for all three methods. This again proves that it is the number of streams that dominates the time cost. As for the clustering quality shown in Fig. 11d, the GS value for each method keeps steady itself under different cluster number. The reasons for better or worse GS value are the same as aforementioned.

Fig. 11. The average processing time while the number of (a) stream and (b) cluster varies, and the clustering quality while the number of (c) stream and (d) cluster varies.

(13)

Finally, we further analyze how the number of cluster that contains trigger streams affect the processing time of COMET-CORE. In the following, we use trigger cluster to mean the cluster with trigger streams. In the empirical study of real data, we have shown that the number of trigger points is not a magnificent factor of average processing time. Because many trigger streams may exist in only few clusters, and COMET-CORE only updates similarities of trigger clusters. To show this, we use the same data set that is 200 clusters with 500 streams. As the progression of streaming clustering along 20,000 points, we record the number of trigger cluster and its corresponding processing time. As shown in Fig. 12, the processing time indeed has a trend that it is increasing with the number of trigger cluster.

7

C

ONCLUSION

In this paper, we proposed the COMET-CORE framework for online monitoring clusters over multiple evolving streams by correlations and events. The streams are smoothed by piecewise linear approximation, and we can regard each end point of the line segment as a trigger point. At each trigger point, for clusters that have trigger streams, we update the weighted correlations related to trigger streams in clusters. Whenever an event happens, that is, clusters change, the clusters are modified through efficient split and merge processes. As validated in the experimental results, COMET-CORE is shown to be efficient and of good scalability while producing cluster results of good quality.

R

EFERENCES

[1] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom,

“Models and Issues in Data Stream Systems,” Proc. Principles of Database Systems, 2002.

[2] A. Bulut and A.K. Singh, “SWAT: Hierarchical Stream Summar-ization in Large Networks,” Proc. Int’l Conf. Data Eng., 2003. [3] P. Domingos and G. Hulten, “Mining High-Speed Data Streams,”

Proc. ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining, 2000.

[4] M.M. Gaber, S. Krishnaswamy, and A. Zaslavsky, “Cost-Efficient Mining Techniques for Data Streams,” Proc. Australasian Workshop Data Mining and Web Intelligence, 2004.

[5] V. Ganti, J. Gehrke, and R. Ramakrishnan, “Demon: Mining and Monitoring Evolving Data,” IEEE Trans. Knowledge and Data Eng., vol. 13, no. 1, pp. 50-63, Jan./Feb. 2001.

[6] S. Guha, N. Mishra, R. Motwani, and L. O’Callaghan, “Clustering Data Streams,” Proc. Ann. Symp. Foundations of Computer Science, 2000.

[7] G. Hulten, L. Spencer, and P. Domingos, “Mining Time-Changing Data Streams,” Proc. ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining, 2001.

[8] L. O’Callaghan, N. Mishra, A. Meyerson, S. Guha, and R.

Motwani, “Streaming-Data Algorithms for High-Quality Cluster-ing,” Proc. Int’l Conf. Data Eng., 2002.

[9] A. Bulut and A.K. Singh, “A Unified Framework for Monitoring Data Streams in Real Time,” Proc. Int’l Conf. Data Eng., 2005. [10] X. Liu and H. Ferhatosmanoglu, “Efficient K-nn Search on

Streaming Data Series,” Proc. Symp. Spatial and Temporal Databases, pp. 83-101, 2003.

[11] Y. Zhu and D. Shasha, “Statstream: Statistical Monitoring of Thousands of Data Streams in Real Time,” Proc. Very Large Data Bases Conf., pp. 358-369, 2002.

[12] H. Wu, B. Salzberg, and D. Zhang, “Online Event-Driven

Subsequence Matching over Financial Data Streams,” Proc. ACM SIGMOD Int’l Conf. Management of Data, pp. 23-34, 2004. [13] J. Beringer and E. Hullermeier, “Online Clustering of Parallel Data

Streams,” Data and Knowledge Eng., vol. 58, no. 2, pp. 180-204, 2005. [14] B.-R. Dai, J.-W. Huang, M.-Y. Yeh, and M.-S. Chen, “Adaptive Clustering for Multiple Evolving Streams,” IEEE Trans. Knowledge and Data Eng., vol. 18, no. 9, pp. 1166-1180, Sept. 2006.

[15] P.P. Rodrigues, J. Gama, and J.P. Pedroso, “ODAC: Hierarchical Clustering of Time Series Data Streams,” Proc. Sixth SIAM Int’l Conf. Data Mining, pp. 499-503, Apr. 2006.

[16] J. Yang, “Dynamic Clustering of Evolving Streams with a Single Pass,” Proc. Int’l Conf. Data Eng., pp. 695-697, 2003.

[17] C.C. Aggarwal, J. Han, J. Wang, and P.S. Yu, “A Framework for Clustering Evolving Data Streams,” Proc. Very Large Data Bases Conf., 2003.

[18] C.C. Aggarwal, J. Han, J. Wang, and P.S. Yu, “On High

Dimensional Projected Clustering of Data Streams,” Data Mining and Knowledge Discovery, vol. 10, no. 3, pp. 251-273, 2005.

[19] F. Cao, M. Ester, W. Qian, and A. Zhou, “Density-Based

Clustering over an Evolving Data Stream with Noise,” Proc. SIAM Conf. Data Mining, 2006.

[20] Z. He, X. Xu, S. Deng, and J.Z. Huang, “Clustering Categorical Data Streams,” Computational Methods in Science and Eng., 2004. [21] E.J. Keogh, S. Chu, D. Hart, and M.J. Pazzani, “An Online

Algorithm for Segmenting Time Series,” Proc. Int’l Conf. Data Mining, 2001.

[22] V. Guralnik and J. Srivastava, “Event Detection from Time Series Data,” Proc. Fifth ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining, pp. 33-42, 1999.

[23] D.P. Kacso, “Approximation by Means of Piecewise Linear

Functions,” Results in Math., vol. 35, pp. 89-102, Jan. 1999. [24] E.J. Keogh, “A Fast and Robust Method for Pattern Matching in

Time Series Databases,” Proc. Int’l Conf. Tools with Artificial Intelligence, 1997.

[25] E.J. Keogh and M.J. Pazzani, “An Enhanced Representation of Time Series Which Allows Fast and Accurate Classification, Clustering and Relevance Feedback,” Proc. Int’l Conf. Knowledge Discovery and Data Mining, pp. 239-243, 1998.

[26] A. Franzblau, A Primer of Statistics for Non-Statisticians. Harcourt, Brace, and World, 1958.

[27] P. Rousseeuw, “Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis,” J. Computational and Applied Math., vol. 20, pp. 53-65, 1987.

[28] N. Saito, “Local Feature Extraction and Its Applications Using a Library of Bases,” PhD dissertation, Dept. of Math., Yale Univ., 1994.

[29] S. Manganaris, “Supervised Classification with Temporal Data,” PhD dissertation, Dept. of Computer Science, Vanderbilt Univ., 1997.

Fig. 12. The average processing time versus the number of trigger cluster.

(14)

Mi-Yen Yeh received the BS degree in electrical engineering from the National Taiwan University, Taipei, Taiwan, in 2002. She is currently a PhD candidate in the Electrical Engineering Department, National Taiwan Uni-versity, Taipei, Taiwan. Her research interests include data mining, data clustering, and data stream management.

Bi-Ru Dai received the BS and PhD degrees in electrical engineering from the National Taiwan University, Taipei, Taiwan, in 2000 and 2006, respectively. She is currently an assistant professor in the Department of Computer Science and Information Engineering, National Taiwan University of Science and Technology, Taipei, Taiwan. Her research interests include data mining, data stream management, and bioinformatics.

Ming-Syan Chen received the BS degree in electrical engineering from the National Taiwan University, Taipei, Taiwan, and the MS and PhD degrees in computer, information, and control engineering from the University of Michigan, Ann Arbor, in 1985 and 1988, respectively. He is currently a distinguished professor jointly ap-pointed by the Electrical Engineering Depart-ment, the Computer Science and Information Engineering Department, and also the Graduate Institute of Communication Engineering at the National Taiwan University. He was a research staff member at the IBM Thomas J. Watson Research Center, Yorktown Heights, New York, from 1988 to 1996. His research interests include database systems, data mining, mobile computing systems, and multimedia networking. He has published more than 240 papers in his research areas. In addition to serving as program chairs/vice chairs and keynote/tutorial speakers in many international conferences, he was an associate editor of the IEEE Transactions on Knowledge and Data Engineering (TKDE) and also the Journal of Information Science and Engineering (JISE). He is currently on the editorial board of the Very Large Data Base (VLDB) Journal, the Knowledge and Information Systems (KAIS) Journal, and the Interna-tional Journal of Electrical Engineering (IJEE). He is a distinguished visitor of the IEEE Computer Society for Asia-Pacific from 1998 to 2000 and also from 2005 to 2007. He holds, or has applied for, 18 patents and seven ROC patents in his research areas. He is a recipient of the National Science Council (NSC) Distinguished Research Award, Pan Wen Yuan Distinguished Research Award, Teco Award, Honorary Medal of Information, and K.-T. Li Research Breakthrough Award for his research work and also the Outstanding Innovation Award from IBM Corporate for his contribution to a major database product. He also received numerous awards for his research, teaching, inventions, and patent applications. He is a fellow of the ACM and the IEEE.

. For more information on this or any other computing topic,

數據

Fig. 1. The data stream is smoothed by a sequence of line segments.
Fig. 3. (a) The streams are smoothed as a sequence of end points.
Fig. 4. Illustration of how trigger streams work.
Fig. 6. Illustration of the merge process and the distances to be updated.
+5

參考文獻

相關文件

 A genre is more dynamic than a text type and is always changing and evolving; however, for our practical purposes here, we can take genre to mean text type. Materials developed

&#34;Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values,&#34; Data Mining and Knowledge Discovery, Vol. “Density-Based Clustering in

The research proposes a data oriented approach for choosing the type of clustering algorithms and a new cluster validity index for choosing their input parameters.. The

In this paper, we extended the entropy-like proximal algo- rithm proposed by Eggermont [12] for convex programming subject to nonnegative constraints and proposed a class of

In this paper, we build a new class of neural networks based on the smoothing method for NCP introduced by Haddou and Maheux [18] using some family F of smoothing functions.

In this paper we establish, by using the obtained second-order calculations and the recent results of [23], complete characterizations of full and tilt stability for locally

In this paper we establish, by using the obtained second-order calculations and the recent results of [25], complete characterizations of full and tilt stability for locally

Discovering the City by Mining Diverse and Multimodal Data Streams – IBM Grand Challenge: New York City 360. §  Exploring and Integrating Multiple Contents and Sources for