• 沒有找到結果。

Constrained Clustering for the Evolving Data Stream

N/A
N/A
Protected

Academic year: 2021

Share "Constrained Clustering for the Evolving Data Stream"

Copied!
16
0
0

加載中.... (立即查看全文)

全文

(1)

Bi-Ru Dai1,* and Ming-Syan Chen 2

1 Department of Computer Science and Information Engineering National Taiwan University of Science and Technology

Taipei, Taiwan, ROC

[email protected]

2 Department of Electrical Engineering National Taiwan University

Taipei, Taiwan, ROC

[email protected]

Received June 22, 2007 ; Accepted June 30, 2007

Abstract. In order to import the domain knowledge or application dependent parameters into the data min-ing systems, constraint-based minmin-ing has attracted a lot of research attention recently. However, most of the constraint-based mining algorithms are designed for static data sets, and are not investigated in the data stream environment. In this paper, we devise a framework of Constrained Clustering for the Evolving Data Stream, abbreviated as CCDS framework, to cluster the data stream under the pairwise range constraint. The CCDS framework proposed consists of two phases, namely the statistics reserving phase and the clustering responding phase. The statistics reserving phase provides an efficient algorithm to process and maintain the data points in a compact structure named constraint tree. The clustering responding phase generates clusters whenever a clustering request is submitted by the user. As shown in our analyses, the time complexity of the statistics reserving phase, which is the time complexity of constructing the constraint tree, is linear in the number of data points. In addition, the clustering time is also reduced by applying the statistics maintained to the clustering algorithm. Therefore, framework CCDS is very suitable for the data stream environment. It is empirically shown that the proposed framework is very efficient for dealing with multiple constrained clus-tering requests in the data stream environment while producing clusclus-tering results of very high qual-ity

Keywords: data mining, data clustering, constrained clustering, data stream

1 Introduction

Data clustering is a useful technique for many applications, including similarity search, pattern recognition, trend analysis, marketing analysis, grouping, classification of documents, and so forth [1][2]. In data clustering, similar data points are grouped together in a cluster. In recent years, several mining capabilities have been explored for the data stream environment [3][4], including those on the association rules [5], frequent patterns [6], data clus-tering [7-9], and data classification [10][11], to name a few. For data stream applications, the volume of data is usually too huge to be stored on permanent devices or to be scanned thoroughly more than once. It is hence rec-ognized that both approximation and adaptivity are key ingredients for executing queries and performing mining tasks over rapid data streams.

Since data mining is an application dependent technology, the information involving domain knowledge is usually imposed on the mining systems as various constraints. Some algorithms have been proposed for cluster-ing with constraints [12-17]. However, they are mainly designed for static data sets and are usually not able to work well in the data stream environment. On the other hand, the main focus of the research on mining data streams is to solve mining problems under the condition of limited resources, such as time and space, compared to the fast and infinite data points. However, the constraints for clusters or data points are not fully investigated. In this paper, these two concepts, i.e., the properties of constraints and data streams, are combined and

(2)

ered at the same time. Note that the effects of these two factors are entangled, thus significantly complicating the problem. We focus our attention on clustering with the pairwise numerical constraint, which was proposed in the work [13], and propose a framework to support this constrained clustering problem in the data stream environ-ment. Specifically, those attributes employed to model the constraints are called constraint attributes whereas those attributes involved in the objective function to be optimized, similar to those in most prior works, are called optimization attributes. The constrained clustering is conducted in such a way that the objective function of opti-mization attributes is optimized subject to the condition that the imposed constraint by constraint attributes is satisfied. The constraint considered in this work is that the constraint attribute values of any two data items in the same cluster are required to be within the corresponding constraint range.

This pairwise constrained clustering problem can be understood by the following example. Consider the data points in Fig. 1 where X and Y are the conventional optimization attributes which form the coordinate of the residential location and age is the constraint attribute with the constraint range being 5 years. The nodes in Fig. 1(a) are identical to those in Fig. 1(b) and the number next to each node represents the age of that person. By conventional clustering methods which are designed to cluster nearby nodes together, these nodes may be parti-tioned into the two clusters as shown in Fig. 1(a), in which, however, the age constraint is not obeyed. Instead, one possible solution to this constrained clustering problem is shown in Fig. 1(b), where members within 22 - 27 years old are in one cluster, and people within 33 - 38 years old are in the other cluster. Such a clustering with numerical constraints is called for in many real applications. For example, we may apply this pairwise numerical constrained clustering on the basic data of people in a club to group people provided that we require the range of ages in each group to be no more than 5 years. Notice that we do not indicate a fixed boundary of a cluster ex-plicitly. Instead, the boundaries of constraint attributes of each cluster are dynamically determined by the data points existing in the cluster. In the example in Fig. 1, once a person joins a cluster, the age range of this cluster is adjusted immediately. For example, assume that the first person joining a group is 23 years old, then people between 18 to 28 years old are allowed to join this group afterward. However, if the second person joining this group is 27 years old, the age range of this group is narrowed down to 22 - 28 years old, meaning that the age boundary of this cluster is revised. Finally, groups with similar people are generated based on the 5-year range. Figures 1(c) and 1(d) show the age in the third dimension, and the difference to the result from a conventional clustering method is also shown. Possible applications of this constrained clustering model include bioinformat-ics data with the control of temperature [18], the group behavior of GPS users along the time line, and the seg-mentation of a video into story units [19], to name a few. More justification of this problem model can be found in [15][20]. 10 20 30 40 10 20 25 30 35 33 36 X 38 35 22 33 25 26 24 34 36 37 35 23 22 27 25 Y A g e

(c) Three-dimensional graph of (a)

10 20 30 40 10 20 25 30 35 33 36 X 38 35 22 33 25 26 24 34 36 37 35 23 22 27 25 Y A g e (d) Three-dimensional graph of (b) (a) A conventional clustering

10 20 30 40 50 5 10 15 20 25 30 38 33 35 34 36 25 22 37 24 X 33 22 35 36 27 26 23 25 Y

(b) A constrained clustering with age constraint range = 5

10 20 30 40 50 5 10 15 20 25 30 38 33 35 34 36 25 22 37 24 X 33 22 35 36 27 26 23 25 Y 38 17 49 33 7 47 35 15 43 34 26 42 36 10 41 25 18 37 22 11 35 37 23 32 24 18 29 33 8 23 22 25 20 35 18 19 36 12 17 27 24 14 26 8 14 23 13 10 25 20 6 Age Y X 10 20 30 40 10 20 25 30 35 33 36 X 38 35 22 33 25 26 24 34 36 37 35 23 22 27 25 Y A g e

(c) Three-dimensional graph of (a)

10 20 30 40 10 20 25 30 35 33 36 X 38 35 22 33 25 26 24 34 36 37 35 23 22 27 25 Y A g e (d) Three-dimensional graph of (b) (a) A conventional clustering

10 20 30 40 50 5 10 15 20 25 30 38 33 35 34 36 25 22 37 24 X 33 22 35 36 27 26 23 25 Y

(b) A constrained clustering with age constraint range = 5

10 20 30 40 50 5 10 15 20 25 30 38 33 35 34 36 25 22 37 24 X 33 22 35 36 27 26 23 25 Y 10 20 30 40 10 20 25 30 35 33 36 X 38 35 22 33 25 26 24 34 36 37 35 23 22 27 25 Y A g e

(c) Three-dimensional graph of (a)

10 20 30 40 10 20 25 30 35 33 36 X 38 35 22 33 25 26 24 34 36 37 35 23 22 27 25 Y A g e 10 20 30 40 10 20 25 30 35 33 36 X 38 35 22 33 25 26 24 34 36 37 35 23 22 27 25 Y A g e 10 20 30 40 10 20 25 30 35 33 36 X 38 35 22 33 25 26 24 34 36 37 35 23 22 27 25 Y A g e

(c) Three-dimensional graph of (a)

10 20 30 40 10 20 25 30 35 33 36 X 38 35 22 33 25 26 24 34 36 37 35 23 22 27 25 Y A g e (d) Three-dimensional graph of (b) 10 20 30 40 10 20 25 30 35 33 36 X 38 35 22 33 25 26 24 34 36 37 35 23 22 27 25 Y A g e 10 20 30 40 10 20 25 30 35 33 36 X 38 35 22 33 25 26 24 34 36 37 35 23 22 27 25 Y A g e 10 20 30 40 10 20 25 30 35 33 36 X 38 35 22 33 25 26 24 34 36 37 35 23 22 27 25 Y A g e (d) Three-dimensional graph of (b) (a) A conventional clustering

10 20 30 40 50 5 10 15 20 25 30 38 33 35 34 36 25 22 37 24 X 33 22 35 36 27 26 23 25 Y

(a) A conventional clustering

10 20 30 40 50 5 10 15 20 25 30 38 33 35 34 36 25 22 37 24 X 33 22 35 36 27 26 23 25 Y 10 20 30 40 50 5 10 15 20 25 30 38 33 35 34 36 25 22 37 24 X 33 22 35 36 27 26 23 25 Y

(b) A constrained clustering with age constraint range = 5

10 20 30 40 50 5 10 15 20 25 30 38 33 35 34 36 25 22 37 24 X 33 22 35 36 27 26 23 25 Y

(b) A constrained clustering with age constraint range = 5

10 20 30 40 50 5 10 15 20 25 30 38 33 35 34 36 25 22 37 24 X 33 22 35 36 27 26 23 25 Y 10 20 30 40 50 5 10 15 20 25 30 38 33 35 34 36 25 22 37 24 X 33 22 35 36 27 26 23 25 Y 38 17 49 33 7 47 35 15 43 34 26 42 36 10 41 25 18 37 22 11 35 37 23 32 24 18 29 33 8 23 22 25 20 35 18 19 36 12 17 27 24 14 26 8 14 23 13 10 25 20 6 Age Y X 38 17 49 33 7 47 35 15 43 34 26 42 36 10 41 25 18 37 22 11 35 37 23 32 24 18 29 33 8 23 22 25 20 35 18 19 36 12 17 27 24 14 26 8 14 23 13 10 25 20 6 Age Y X

(3)

Note that in the data stream environment, data points arrive continuously. The limited space does not allow us to maintain all the information of data points. Therefore, data points can only be parsed once to collect necessary statistics and then discard afterwards. In addition, to process fast incoming data points, an efficient algorithm of maintaining statistics is also required. Furthermore, along the accumulation of data points, users are possible to have different clustering requirements at different time. For example, when a club has fewer members and fewer resources, 5 clusters with the constraint range being 10 years are good. However, when more and more members join, users may feel that 10 clusters with 5-year range are more appropriate. In addition, clustering requirements are usually unpredictable in advance because they are application dependent and are usually different for various purposes. Therefore, a framework, which is able to handle infinite arriving data points and support various clus-tering requirements, will be desired.

In this paper, we devise a framework of Constrained Clustering for the Evolving Data Stream, abbreviated as CCDS framework, to cluster the data stream under the pairwise numerical constraint. The CCDS framework proposed consists of two phases, namely the statistics reserving phase and the clustering responding phase. The statistics reserving phase provides an efficient algorithm to process and maintain the data points in a compact structure named constraint tree. On the other hand, the statistics maintained will be retrieved in the clustering responding phase to generate clusters according to clustering requests specified by users. Since the clustering requirement is obtained only when a user submits the request, the maintained information is required to have the ability to service various clustering requests without process the original data points again. Note that in addition to minimize the clustering costs, the pairwise numerical constraint should be followed strictly. Accordingly, we design the constraint tree structure to provide a higher priority for the constraint attribute when data points are inserted. Then, data points with higher similarity are summarized and maintained in one record to reduce the storage space. As shown in the complexity analyses, the time complexity of the statistics reserving phase, which is the time complexity of constructing the constraint tree, is of O(n) where n is the number of data points. In addi-tion, the clustering time is also reduced by applying the statistics maintained to the clustering algorithm, which ensures the efficient execution of framework CCDS in the data stream environment. Furthermore, theoretical properties of CCDS are derived in this paper. It is also validated by our empirical studies that the CCDS frame-work performs very efficiently in the data stream environment while producing clustering results of very high quality.

The rest of this paper is organized as follows. The problem description and framework definitions are given in Section 2. Section 3 presents the proposed algorithms to deal with this constrained clustering problem in the data stream environment. Then empirical studies are conducted in Section 4. This paper concludes with Section 5

2 Problem Description

In this section we will describe the problem studied in this paper, and introduce a framework to solve the con-strained clustering for the evolving data stream. As mentioned earlier, among all attributes of the data set, some are specified as constraint attributes and some are optimization attributes. Constraint attributes and optimization attributes may overlap because some attributes are possibly important in both considerations. Similar to the con-ventional clustering problems, there is an objective function operating on the optimization attributes to measure the cost of clustering. In addition, an additional constraint range

C

a

R

is set for the constraint attribute aC. For any pair of objects oi, oj in a cluster, the constraint distance

d

aC

(

o

i

,

o

j

)

of constraint attribute aC between any two objects oi and oj is required to be less than or equal to

C

a

R

. For example, in Fig. 1, age is a constraint attrib-ute, denoted by aage, and its constraint range

R

aage is 5. The constraint distance

d

aage

(.,.)

, which is the

differ-ence of ages, should be no more than 5 years. For the simplification and clearer discussion of this problem, we focus on the scenario of a single constraint attribute in this paper. Formally, we have the following problem defi-nition.

Definition 1 Clustering Data Stream with Pairwise Constraint:

 A data stream D of infinite incoming objects is represented as {o1,o2, ..., oi, ...}.

 A clustering request is a pair of (k,

R

aC),where k is the number of clusters,

R

aC is the constraint range of the

constraint attribute aC , and the value of

R

aC

×

k

is larger than or equal to the whole range of the constraint

at-tribute aC .

 Given a data stream D and a received clustering request of (k,

C

a

R

), the pairwise constrained clustering problem is defined as the problem of determining the k-clustering Cl = {C1,C2, ..., Ck} in such a way that the total cost

(4)

Cost(Cl) is minimized subject to the condition that for any pair of objects (oi,oj ) in a cluster, we have C

C i j a

a

o

o

R

d

(

,

)

. The cost function Cost(Cl) is calculated based on the optimization attributes.

Example 1: Given a data stream as follows, D ={(9,11), (2,27), (7,3), (10,9), (6,29), (8,23), (4,22), (5,8), (3,6), (1,25), ...}, where the first value of each data point is the value of the constraint attribute, and the second one is the value of the optimization attribute. Suppose that a clustering request of (k,

C

a

R

) = (2, 6) is submitted when 8 data points are received. The constrained clustering result will be C1 = {(2, 27), (6, 29), (8, 23), (4, 22)} and C2 = {(9, 11),

(7, 3), 10, 9), (5, 8)}. Note that the values of the constraint attribute are in the range [2-8] for cluster C1 and in the

range [5-10] for cluster C2 to comply with the constraint range 6. Then, after two more data points are inserted, another

clustering request of (k,

C

a

R

) = (3, 5) is submitted. The clustering result of C1 = {(2, 27), (6, 29), (4, 22), (1, 25)}, C2

= {(9, 11), (10, 9), (8, 23)}, and C3 = {(7, 3), (5, 8), (3, 6)} will be responded to the new request. It is noted that some

data points are not in the same clusters as the previous clustering result since the constraint range and the number of clusters are changed.

62 50 62 50 173173 185185 190190 15 8 15 8 1717 2020 2525 2727 190190 200200 bucket

bucket bucketbucket bucketbucket bucketbucket bucketbucket bucketbucket bucketbucket

……

……

constraint levels dense-record buckets 173 120 90 50 90 120 173 50 47 42 33 17 33 42 47 17

Fig. 2. Illustration of the constraint tree, where the maximum number of keys in a node is 4. Each node contains at least 2 keys and at most 4 keys. The leaf nodes link to dense-record buckets.

Note that in the data stream environment, data points arrive continuously and users can submit clustering re-quest (k,

C

a

R

) at any time. According to the above definition, two main tasks are required to be handled in our framework. First, we have to keep track of the incoming data stream, and maintain sufficient information or sum-maries for generating clusters with high quality. Second, clusters, which are able to satisfy requirements specified by users, should be generated from the available information instead of from the original whole data set. Note that the maintained information should service various clustering requests without process the original data points again. Therefore, a framework named Constrained Clustering for the Evolving Data Stream, abbreviated as CCDS framework, containing two phases, which are the statistics reserving phase and the clustering responding phase, is proposed to deal with these two tasks, respectively. In the statistics reserving phase, the arriving data points are processed and some statistics are maintained. Note that clusters are generated whenever a clustering request is submitted by a user; therefore, we have no idea about what kind of clusters will be generated when data points arrive. In other words, the statistics maintained during the statistics reserving phase should contain enough information for currently unknown clustering requests while keeping the required storage space small. In the CCDS framework, a compact data structure, named constraint tree, is proposed to maintain the statistics effi-ciently for future clustering requests. In the clustering responding phase, clustering requests containing the cluster number k and the constraint range

C

a

R

are obtained. Then, according to the clustering requests, the clustering algo-rithm will be applied to the statistics kept in the constraint tree. Therefore, multiple clustering requests are able to be served without visiting the original data set. The details of the constraint tree and the proposed algorithms will be described in Section 3.

3 Design of CCDS Framework

In Section 3.1, the algorithm for maintaining the constraint tree during the statistics reserving phase is presented. Next, the approach to generate clusters in the clustering responding phase is described in Section 3.2.

(5)

3.1 Statistics Reserving Phase

Since the data points in the data stream environment arrive continuously, we are not able to store the information of all data points. Therefore, a compact structure is required to keep enough information for clustering requests in the future without incurring massive storage space.

Note that data points considered in this work contain two types of attributes, which are the constraint attribute and the optimization attributes. The constraint attribute sets a restriction on data points to be put in the same cluster. On the other hand, clusters with lower cost, which is calculated by the optimization attributes, are more desirable. Therefore, these two types of attributes should be handled separately in the statistics reserving phase. Accordingly, the constraint tree is designed to incorporate the information of the constraint attribute and the optimization attributes in a compact structure. This constraint tree consists of two parts, which are constraint levels and dense-record buckets, as shown in Fig. 2. The index keys in the constraint levels are values of the constraint attribute, where as similar data points in the same leaf node are aggregated into one record in the dense-record bucket. When a new arriving data point is inserted into the constraint tree, it traverses through the constraint levels according to the value of constraint attribute first, and then goes to the dense-record bucket afterwards. As such, among all the attributes, the constraint attribute is able to be considered in a higher priority. Then, the optimization attributes are taken into consideration for data points within a smaller range of constraint values.

The purpose of constraint levels is to distribute data points according to the constraint attribute in a more bal-anced way. However, the limitation of restricted space in the data stream environment usually does not allow us to keep track of every possible value. Therefore, when more and more data points with different constraint values are received, the precise constraint values are not always be inserted as a new index key value of the constraint tree. Instead, some data points will be assigned to an appropriate leaf node directly without creating a new key. The inner nodes of constraint levels are represented as the following format:

<child0, key1, child1, key2, child2,..., keym, childm> ,

where childi is the pointer to a child node, and keyi is a constraint value where keyi < keyj if i < j. There are at most m keys and at least





2

m

keys in each node. The child node linked by childi contains records with the constraint values in the range of [keyi , keyi+1) for 1 < i < m-1. The child node linked by child0 contains records with the constraint values being smaller than key1, and the child node linked by childm contains records with the constraint values being larger than or equal to keym. The format of leaf nodes is the same as inner nodes. Rather than linking to a child node, a pointer in a leaf node links to a dense-record bucket instead. Similar to the inser-tion of B+-tree, an incoming record traverse the constraint tree to find a suitable leaf node for inserinser-tion. If there is still empty space in the node, a new key is inserted. Otherwise, the node needs to be split and the median key is inserted into its parent node. If the parent node is also full, split it and repeat the procedure until no parent node needs to be split.

However, the constraint tree is not allowed to grow infinitely due to the limited space in the data stream envi-ronment. Therefore, a predefined parameter Hmax , which can be specified by users according to the available space, is assigned to be the maximum height of constraint levels. Note that when the height of the constraint tree is smaller than Hmax, there are still not many data points arrived. In this situation, we have enough space to allow the constraint tree being built precisely without any approximation. Therefore, data points are inserted into the constraint tree without any restriction. However, when more and more data points have been inserted, not all of them can be maintained precisely. Therefore, the insertion operation will be adjusted to include more data records in a dense-record bucket while do not lose much information.

We can observe that except for the root node, which has at least two branches, every node contains at least

1

2



+



m

branches. Therefore, the constraint tree is a highly balanced tree, where leaf nodes are all at the same level and each node is at least half full. In addition to the balanced structure of the constraint tree, we also desire that the distribution of the constraint values can be highly balanced. This highly balanced distribution can be achieved more easily for static data sets, where all the data points are available for calculation. However, in the data stream environment, maintaining the balance will be an uneasy job because of the evolving of data points and the limitation of available space. Therefore, we propose a mechanism to construct a more balanced constraint tree for the evolving data stream. The following lemmas help us to build a more balanced constraint tree.

Lemma 1 The number of leaf nodes, denoted as nL, in the constraint tree of height Hmax is in the range of

.

)

1

(

,

)

1

2

(

2

max 2 max 1

+

+





×

H − H −

m

m

(6)

Proof. Except for the root node, which has at least two links, every node contains at least

1

)

2

(

+





m

links.

Therefore, the number of leaf nodes of the tree with height Hmax will not smaller than

2 max

)

1

2

(

2

+





×

m

H . On the other hand, the maximum number of keys in each node is m; at most (m+1) links will be in a node. The maximum number of leaf nodes will be

(

m

+

1

)

Hmax−1

.

Note that a link of a leaf node connects to a dense-record bucket. According to the above lemma, the number of dense-record buckets in the constraint tree can be calculated as the following lemma.

Lemma 2 The number of dense-record buckets, denoted as nB, in the constraint tree of height Hmax is in the range of

+

+





×

1

)

max−

,

(

1

)

max

2

(

2

m

H 1

m

H .

Note that all the dense-record buckets will cover the whole range of the constraint attribute without overlap. Proof. The minimum number of dense-record buckets in a leaf node is

1

)

2

(

+





m

. Therefore, the number of

dense-record buckets of the tree with height Hmax will not smaller than

1 max

)

1

2

(

2

+





×

m

H

. On the other hand, the maximum number of record buckets in each leaf node is (m+1); thus, the maximum number of dense-record buckets will be max

(m +1)H .

Recall that a dense-record bucket contains a number of data points whose constraint values are in the range specified by the keys in the leaf node of the constraint tree. In order to maintain more precise constraint values for data points, dense-record buckets with smaller ranges of constraint values will be preferred. However, as shown in Lemma 2, the number of dense-record buckets in a constraint tree is limited, and all the dense-record buckets will cover the whole range of the constraint attribute. Since we have no idea about the distribution of constraint values of incoming data points, a balanced distribution of constraint values into all the dense-record buckets will allow us to maintain data points more precisely. Furthermore, because the number of incoming data points is infinite, which causes the range of the constraint attribute being possible to vary along with the arriving data points, we are not able to specify the ranges of the constraint attribute for all the dense-record buckets di-rectly. Therefore, a guideline is introduced in this paper to adjust the ranges of the constraint attribute in record buckets dynamically according to arriving data points. Since the range of constraint values in a dense-record bucket is affected by the number of dense-dense-record buckets in the constraint tree, we have the following observation.

Observation: From Lemma 2, we can observe that the constraint tree of height Hmax contains nB densr-record

buckets, where

1

)

max

(

1

)

max

2

(

2

m

+

H 1

n

B

m

+

H





×

. Therefore, if the whole range Rwhole of the constraint attribute is distributed to these dense-record buckets uniformly, each dense-record bucket is desired to cover the range between max

)

1

(

H whole

m

R

+

and

1

)

max 1

2

(

2

+





×

H whole

m

R

.

Because the number of branches in each node is between

1

)

2

(

+





m

and (m+1), we choose a middle value for approximating the range of the constraint values in a dense-record bucket. Accordingly, a guideline is provided as follows.

(7)

Lemma 3 The approximating range Rapp is assigned as max

)

1

4

3

(

H whole

m

R

+





. When a data point arrives at a leaf

node, if the constraint value of the new data point, denoted as keynew, is larger than the largest key in the leaf node, denoted as keymax,, insert the new key if (keynew – keymax,) > Rapp. If it is smaller than the smallest key in the leaf node, denoted as keymin, insert the new key if (keymin,– keynew,) > Rapp. Otherwise, this data point is put into a dense-record bucket directly without creating a new key.

The procedure of constructing the constraint tree is described as follows. Procedure of Constructing the Constraint Tree

1. When a data point with the constraint value keynew is inserted, update the minimum value and the maximum value of the constraint attribute if so necessarily.

2. Traverse the tree to an appropriate leaf node according to the constraint value. Then three cases are consid-ered:

2.1 Case 1: If the key value keynew already exists in the leaf node, insert this data point into a dense-record bucket. If this data point has high similarity with an existing dense-record in the bucket, merge the date point to that dense-record.

2.2 Case 2: If the height of constraint tree is smaller than Hmax, insert the constraint value of the data point as a new key. A dense-record bucket with this data point is also created. If this leaf node is full, split it into two nodes and insert the middle key to their parent node. This step of split and insertion will be re-peated until no splitting of a parent node is required.

2.3 Case 3: If the height of constraint tree is equivalent to Hmax, two cases will be considered:

Case 3.1: if (keynew–keymax)≤Rapp and (keymin–keynew)≤Rapp, this data point is inserted into a dense-record bucket directly.

Case 3.2: Otherwise, insert the new key, create the corresponding dense-record bucket, and split nodes as Case 2 while needed. After this insertion, if the height of the constraint tree in-creases to (Hmax+1), which exceeds the maximum height allowed, the constraint tree needs to be adjusted. In order to maintain the height of the constraint tree not to exceed Hmax, the leaf nodes of the constraint tree will be regarded as dense-record buckets afterwards, and the parents of original leaf nodes will become new leaf nodes. The dense-record buckets in an original leaf node are combined into a single dense-record bucket, and dense-records with high similarity are merged. Consequently, the height of the constraint tree is reduced to Hmax again.

20 17 20 17 2525 2727 3131 62 50 62 50 173173 185185 190190 15 8 15 8

190190 200200

……

constraint

levels

dense-record

buckets

25 17 25 17 3333 4242 4747 50 33 50 33 9090 120120 173173 90 90

bucket bucket bucket bucket

Fig. 3. Illustration of inserting a data point with constraint value 31.

Example 2: Consider the constraint tree in Fig. 2 as an example and recall that m =4 and Hmax =3. Assume that a data point with key 31 is inserted. Since the constraint tree already reaches the maximum height, Case 3 in the Procedure of Constructing the Constraint Tree will be followed. Rapp is calculated by

3

1

4

4

3

8

200

3

=





×

+

,

(8)

keymax of the leaf node is 27, and we will have that (31–27) = 4 > Rapp. Therefore, Case 3.2 will be followed next. After this insertion, several nodes are split and the height of the constraint tree becomes (Hmax+1), as shown in Fig. 3, where the split nodes are marked with red and bold key values. Therefore, the original leaf nodes of the constraint tree will be regarded as dense-record buckets afterwards, and the parents of original leaf nodes will become new leaf nodes. Then, the height of the constraint tree is reduced to Hmax again.

It is noted that if the values of the constraint attribute do not have much correlation with time, the constraint tree will become stable soon, and most of new data records will be inserted to appropriate dense-record buckets accordingly. However, if the values of the constraint attribute are dependent on time, the whole range of the con-straint attribute will getting larger and larger. In this situation, our approach is able to adjust the concon-straint tree dynamically according to the increasing range.

After introducing the properties of constraint levels and the construction of the constraint tree, we start to de-scribe the details of record buckets. A record bucket contains multiple records, and a dense-record, which is an aggregation of similar data points, can be regarded as a micro-cluster within a smaller range of the constraint attribute. A dense-record is represented as the following format,

[

a

C

.

s

,

a

C

.

e

]

,

(

N

,

LS

,

SS

)

,

where aC.s and aC.e are the smallest and largest constraint attribute values of data points in this dense-record

respec-tively, and the part

(

N

,

LS

,

SS

)

is the same as the Clustering Feature defined in BIRCH [21]. Accordingly, N is the number of data points in the dense-record,

LS

is the linear sum of the N data point, i.e.,

= N i i

X

1 , and

SS

is the square sum of the N data points, i.e.,

= N i i

X

1 2

. By the Clustering Feature, the distance between two

dense-record d(Fi, Fj), where

F

i

=

[

a

C

.

s

i

,

a

C

.

e

i

]

,

(

N

,

LS

,

SS

)

,

and

F

j

=

[

a

C

.

s

j

,

a

C

.

e

j

]

,

(

N

,

LS

,

SS

)

,

can be calculated efficiently.

(

)

1/2 1 1 2 2 2 / 1 1 1 2

2

)

,

(

+

=

=

∑ ∑

∑ ∑

= + + = = + + = j i N a N N N b b a b a j i N a N N N b b a j i

N

N

X

X

X

X

N

N

X

X

F

F

d

i i j i i i j i 2 / 1 2 / 1 1 1 1 1 2 2 2 2         + =               − + =

=

+ + = + + = = j i j i j i i j j i N a N N N b N N N b b N a a b i a j N N LS LS SS N SS N N N X X X N X N i i j i j i i i

When two dense-records Fi and Fj are merged, the new dense-record will be

[

a

C

.

s

'

,

a

C

.

e

'

]

,

(

N

i

+

N

j

,

LS

i

+

LS

j

,

SS

i

+

SS

j

)

,

Where

a

C

.

s

'

=

min(

a

C

.

s

i

,

a

C

.

s

j

)

and

a

C

.

e

'

=

max(

a

C

.

e

i

,

a

C

.

e

j

)

.

Example 3: Assume that Fi contains two data points (3, 20, 26) and (5, 22, 24), and Fj contains two data points (6, 21, 25) and (7, 23, 22),where the first value of each data point is the value of the constraint attribute, and others are values of the optimization attributes. Dense-record Fi is represented as

[

C i C i

]

(

i i i

)

i

a

s

a

e

N

LS

SS

F

=

.

,

.

,

,

,

.

))

1252

,

884

(

),

50

,

42

(

,

2

(

],

5

,

3

[

)

)

24

26

,

22

20

(

),

24

26

,

22

20

(

,

2

(

],

5

,

3

[

2 2 2 2

=

+

+

+

+

=

(9)

Similarly, dense-record Fj is represented as

.

)

)

1109

,

970

(

),

47

,

44

(

,

2

(

],

7

,

6

[

)

)

22

25

,

23

21

(

),

22

25

,

23

21

(

,

2

(

],

7

,

6

[

2 2 2 2

=

+

+

+

+

=

j

F

When dense-records Fi and Fj are merged, the new dense-record will be

[

]

(

)

.

)

)

2361

,

1854

(

),

97

,

86

(

,

4

(

],

7

,

3

[

)

)

1109

1252

,

970

884

(

),

47

50

,

44

42

(

,

2

2

(

],

7

,

3

[

,

,

,

'

.

,

'

.

=

+

+

+

+

+

=

+

+

+

j i j i j i C C

s

a

e

N

N

LS

LS

SS

SS

a

Theorem 1: The time complexity of the statistics reserving phase is O(n),where n is the number of data points inserted.

Proof. The time complexity of reading n data points is O(n). When the height of the constraint tree is smaller than Hmax, a data point traverse O(logmn) levels to the leaf node, and the maximum number of split caused by this insertion is also O(logmn). Therefore, the overall complexity is O(n)+O(logmn)= O(n). On the other hand, when the height of the constraint tree is Hmax, a data point traverse Hmax levels to the leaf node, and the maximum num-ber of split caused by this insertion is also Hmax. However, if the height exceeds Hmax during the insertion, the leaf nodes and dense-record buckets should be processed in the complexity of O ((m +1)Hmax ). Therefore, the total time complexity of insertion will be O(Hmax +(m+1)Hmax ), which is independent to the data number n. Conse-quently, the overall time complexity of the statistics reserving phase will be O(n)+O(Hmax+(m+1)Hmax )= O(n).

As will be shown in the performance studies, the time complexity of the statistics reserving phase, which is the time complexity of constructing the constraint tree, is linear in the number of data points. Therefore, framework CCDS is very efficient and suitable for the data stream environment.

3.2 Clustering Responding Phase

As the clustering request shown in Section 2, users want to inspect clusters with the cluster number being k and the constraint range being

C

a

R

. Note that the clustering algorithm is not applied to the original data set. Instead, the dense-records maintained in the leaf nodes of the constraint tree are retrieved for clustering. We apply the constrained clustering algorithm progressive CCL, which is proposed in the work [13], to generate clusters ac-cording to the clustering requests. The distance between clusters is defined as follows.

}

,

|

)

,

(

max{

)

,

(

c

i

c

j

d

F

a

F

b

F

a

C

i

F

b

C

j

dist

=

.

Here we take the time constraint attribute atime as example, i.e., the constraint attribute aC represents atime. For a cluster C, we define the start constraint value, denoted by C ts. , and end constraint value, denoted by C te. , as the smallest and largest constraint attribute values of data points in the cluster respectively, i.e.,

. min{ . | }

C a a

C ts= a s F ∈C and . max{ . | }.

C a a

C te= a e F ∈C Then, the constraint distance between two clusters Ci and Cj is determined as

}

.

.

,

.

.

max{

)

,

(

c

c

C

te

C

ts

C

te

C

ts

d

a i j i j j i C

=

.

The distance measurement of two clusters is modified for this constrained clustering problem. Instead of the original distance dist(., .), the distance measurement between two clusters is defined as follows:

=

.

,

,

)

,

(

),

,

(

)

,

(

otherwise

R

c

c

d

if

c

c

dist

c

c

dist

i j aC i j aC j i

The outline of algorithm progressive CCL is shown in Fig. 4. The basic idea of algorithm progressive CCL is that by using a tighter constraint range in early iterations of merging, we will have more room to seek for better solutions in subsequent iterations. In addition to the relaxation of the constraint range, the desired cluster number is also temporarily modified accordingly. The whole process starts with a small local constraint range and a large local desired cluster number. The procedure continuously relaxes the constraint range and reduces the number of desired clusters, and eventually results in the actual constraint range and the cluster number specified by the user. Explicitly, a parameter level is used to control the number of relaxation steps.

(10)

Fig. 4. Outline of algorithm progressive CCL

Theorem 2: The time complexity of the clustering responding phase is

(

)

,

)

log(

)

log(

2 2 2 2

level

f

O

level

r

n

level

r

n

O

level

k

r

n

k

r

n

O

R c R c R c c R



+

×



+

where rc is between 0 and 1, nR is the number of dense-records retrieved from the constraint tree, O(f ) represents the

lower order terms which can be ignored in the analysis of CCL, and level is iterations of algorithm CCL per-formed in algorithm progressive CCL.

Proof. The time complexity of algorithm progressive CCL is

(

)

.

)

log(

)

log(

2 2 2 2

level

f

O

level

r

n

level

r

n

O

level

k

nr

k

nr

O

c c c c



+

×



+

In the CCDS framework, only the dense-records retrieved are used for the clustering. Therefore, the number of data points n will be replaced by the number of dense-records nR.

Since only dense-records are applied to the clustering algorithm, framework CCDS is more efficient than algo-rithm progressive CCL. As will be shown in the performance studies, when more and more data points are in-serted, the superiority of framework CCDS over algorithm progressive CCL is more significant because more similar data points have been merged into dense-records.

4 Performance Studies

To assess the performance of these algorithms, we have conducted a series of experiments. These experiments are performed on a computer with an 1.7 GHz Intel CPU and 1 GB of memory. In order to generate the synthetic data used in our experiments, we adopt a method similar to the one in [13] and [21], by which n data points for k clusters are generated. Without loss of generality, every generated data point contains two optimization attributes and one constraint attribute. Several such synthetic data sets are combined together to simulate the phenomenon of clusters with time constraint. Both of the optimization attributes are in the range of {–100, 100}, and the range of the constraint attribute is {0, 10000}. Four such data sets, each of which contains 5000 data points for 5 clus-ters, are combined into 20000 data points with 20 clusters which may overlap with one another in optimization and constraint attributes. All the data sets in our experiments are random samples of this data set. Therefore,

Progressive CCL

//Input: an input data set, the number of clusters, k, the relaxation level, and the constraint

range R

ac

1. For i = 1 to level, do Step 2 and Step 3.

2. Calculate the local_k and the local_R

ac

,

where Local_k=(n×(level-i)+k×i)/level, and local_R

ac

= R

ac

×(k/local_k).

3. Run the algorithm CCL based on local_k and local_R

ac

.

Algorithm CCL

//Input: an input data set, the number of clusters, k, and the constraint range R

ac

1. Initially, each data point forms a cluster by itself.

2. The algorithm repetitively merges the two closest clusters.

(11)

these data sets have the same distribution as described above. In the performance studies, we will compare our CCDS framework with algorithm progressive CCL [13]. Note that algorithm progressive CCL can only deal with static data set; therefore, all data points are given in advance. On the other hand, to show the merit of framework CCDS on supporting streaming data sets in which data points are collected continuously, we let each data point be received at an individual time stamp.

In Section 4.1, the CCDS framework is evaluated over different values of parameters. In Section 4.2, we con-duct a series experiments of different clustering requests to compare the performance of CCDS framework with algorithm progressive CCL. Finally, we examine the scalability of CCDS framework in Section 4.3.

In the following experiments, we use the average squared error of all points as the evaluation function for clus-tering results, i.e.,

∑ ∑

=

=

||

||

)

.

(

)

(

)

(

2 i C i C p C

C

center

C

p

Cl

sq

Cl

Cost

i i i

where Ci denotes a cluster of the clustering result Cl, and Ci.center is the center of cluster Ci.

4.1 Parameter Sensitivity of CCDS

In this section, two parameters of the CCDS framework, which are the maximum number of keys in each node (m) and the maximum height of the constraint tree (Hmax) are investigated with a fixed clustering request (k,

C

a

R

)=(20, 4000). The data set of 20000 data points is used in this experiment. The size of noise set represents the number of data points which cannot join a cluster without violating the constraint range. As shown in Fig. 5, when the number of keys m and the height of the constraint tree Hmax are too small, the number of dense-record buckets will be too few. Therefore, the ranges of the constraint attribute covered by each dense-record bucket will be too large or biased, and the dense-records maintained will not be good representatives for similar data points within small constraint ranges. Consequently, the clustering quality will be poor and many data points willbe regarded as noises since they cannot be inserted into clusters subject to the constraint. However, when parameters m and Hmax increase to sufficiently large values, most of the data points can be maintained effectively, and the performance of framework CCDS becomes stable.

(a) Clustering Cost (b) Size of Noise Set

(c) Responding Time to a Clustering Request (d) Construction Time of Constraint tree 0 500 1000 1500 2000 2500 3000 3500 2 3 4 5 6 7 8 9 10 m C lu st er in g C o st Hmax=5 Hmax=6 Hmax=7 Hmax=8 Hmax=9 Hmax=10 0 500 1000 1500 2 3 4 5 6 7 8 9 10 m S iz e o f N o is e S et Hmax=5 Hmax=6 Hmax=7 Hmax=8 Hmax=9 Hmax=10 0 200 400 600 800 1000 2 3 4 5 6 7 8 9 10 m E x ec u ti o n T im e (i n s ec .) Hmax=5 Hmax=6 Hmax=7 Hmax=8 Hmax=9 Hmax=10 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 2 3 4 5 6 7 8 9 10 m E x ec u ti o n T im e (i n s ec .) Hmax=5 Hmax=6 Hmax=7 Hmax=8 Hmax=9 Hmax=10

(12)

(a) Clustering Cost (b) Size of Noise Set

(c) Responding Time to a Clustering Request (d) Construction Time of Constraint tree 0 1000 2000 3000 4000 2000 4000 6000 8000 10000 Constraint Range C lu st er in g C o st CCDS progressive CCL 0 20 40 60 80 100 2000 4000 6000 8000 10000 Constraint Range E x ec u ti o n T im e (i n s ec .) CCDS progressive CCL 0 500 1000 1500 2000 2500 3000 2000 4000 6000 8000 10000 Constraint Range S iz e o f N o is e S et CCDS progressive CCL 0 0.1 0.2 0.3 0.4 0.5 2000 4000 6000 8000 10000 Constraint Range E x ec u ti o n T im e (i n s ec .) CCDS

Fig.6. Comparison between CCDS and progressive CCL on (a) clustering costs, (b) size of noise set, (c) execution time of clustering, (d) construction time of constraint tree when the constraint range varied.

(a) Clustering Cost (b) Size of Noise Set

(c) Responding Time to a Clustering Request (d) Construction Time of Constraint tree 0 500 1000 1500 2000 2500 3000 20 30 40 50 Cluster Number C lu st er in g C o st CCDS progressive CCL 0 20 40 60 80 20 30 40 50 Cluster Number E x ec u ti o n T im e (i n s ec .) CCDS progressive CCL 0 200 400 600 800 1000 20 30 40 50 Cluster Number S iz e o f N o is e S et CCDS progressive CCL 0 0.1 0.2 0.3 0.4 0.5 20 30 40 50 Cluster Number E x ec u ti o n T im e (i n se c. ) CCDS

Fig.7. Comparison between CCDS and progressive CCL on (a) clustering costs, (b) size of noise set, (c) execution time of clustering, (d) construction time of constraint tree when the number of clusters varied.

(13)

4.2 Clustering Result by CCDS

In this section, various clustering requests (k,

C

a

R

) are submitted to evaluate the performance of CCDS framework. As mentioned earlier, algorithm progressive CCL is also executed on the same clustering requests for comparison. The same data set of 20000 data points is used. First, given a fixed cluster number k being 20, the performance against different constraint ranges

C

a

R

is investigated. As shown in Fig. 6(a), when the constraint range is very small, more similar data points are assigned into different clusters because they exceed the constraint range. Therefore, the clustering costs of two algorithms are both higher. Note that although framework CCDS applies the same clustering algorithm as algorithm progressive CCL, it assigns much more data points into appropriated clusters without violating the constraint range because the constraint tree structure of framework CCDS provides higher priority for the constraint attribute, as shown in Fig. 6(b). On the other hand, when the constraint range is very large, say, near to the whole range of the constraint attribute, clustering results similar to those without constraint will be generated. Therefore, the clustering costs will be lower. In this situation, the performance of algorithm CCL is slightly better than framework CCDS because that the approximation of data points in the constraint tree sacrifices a little accuracy for the limited space. Except these two extreme cases discussed above, the clustering quality of framework CCDS is much better than algorithm progressive CCL. These results show that framework CCDS is able to solve the problem of clustering with pairwise numerical constraint gracefully, while requiring much shorter execution time for generating clusters, as shown in Fig. 6(c). It is also noted that the execution time for constructing the constraint tree is very short, as shown in Fig. 6(d). As shown in Fig. 7, similar results can also be obtained from the experiments of varying the number of clusters (k) with a fixed constraint range. There-fore, our CCDS framework is able to maintain the statistics of the data stream very efficiently, while generating clusters with high quality by much shorter clustering time.

4.3 On Scalability

To evaluate the scalability of the proposed framework, the scale-up experiments on the number of data points are conducted. Recall that one data point is received at each time stamp for framework CCDS. Therefore, the value of the time stamp shown in Fig. 8 is also the number of data points accumulated at that time. As shown in Fig. 8(a) and Fig. 8(b), the properties of the constraint tree, such as considering the constraint attribute with a higher

(a) Clustering Cost (b) Size of Noise Set

(c) Responding Time to a Clustering Request (d) Construction Time of Constraint tree 0 500 1000 1500 2000 2500 3000 5000 10000 15000 20000 Time Stamp C lu st er in g C o st CCDS progressive CCL 0 20 40 60 80 5000 10000 15000 20000 Time Stamp E x ec u ti o n T im e (i n s ec .) CCDS progressive CCL 0 0.1 0.2 0.3 0.4 0.5 5000 10000 15000 20000 Time Stamp E x ec u ti o n T im e (i n s ec .) CCDS 0 200 400 600 800 1000 5000 10000 15000 20000 Time Stamp S iz e o f N o is e S et CCDS progressive CCL

Fig.8. Comparison between CCDS and progressive CCL on (a) clustering costs, (b) size of noise set, (c) execution time of clustering, (d) construction time of constraint tree when the number of data points varied.

(14)

priority and merging similar data points in dense-record buckets, allow the clustering responding phase of framework CCDS to produce clusters with higher quality than algorithm progressive CCL when the data set be-comes larger in the data stream environment. Also illustrated by Fig. 8(c), clustering on the summarized statistics is very efficient for large data sets. This result conforms with Theorem 2, which states that the time required for responding to a clustering request is dependent on the number of dense-records rather than the number of data points. In addition, as shown in Fig. 8(d), as the number of data points increases from 5, 000 to 20, 000, the exe-cution time of constructing the constraint tree is very short and grows linearly. This experiment also validates the time complexity analyzed in Theorem 1, which expresses that the time required for the statistics reserving phase is in O(n). Therefore, our CCDS framework is able to handle fast data streams very efficiently while producing excellent clustering results.

5 Conclusions

In this paper, we devised a framework named CCDS framework to cluster the data stream under the pairwise numerical constraint. The CCDS framework proposed consists of two phases, namely the statistics reserving phase and the clustering responding phase. The statistics reserving phase provides an efficient algorithm to proc-ess and maintain the data points in a compact structure named constraint tree. On the other hand, the statistics maintained will be retrieved in the clustering responding phase to generate clusters according to clustering re-quests specified by users. Since the clustering requirement is obtained only when a user submits the request, the maintained information in the constraint tree has the ability to service various clustering requests without proc-essing the original data points again. As shown in the complexity analyses and also validated by our empirical studies, the CCDS framework performs very efficiently in the data stream environment while producing cluster-ing results of very high quality.

References

[1] M.-S. Chen, J. Han, and P. S. Yu, “Data mining: An overview from database perspective,” IEEE Trans. on Knowledge And Data Engineering, Vol.5, No.1, pp.866–883, Dec. 1996.

[2] J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000.

[3] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom, “Models and issues in data stream systems,” Proceedings of PODS, June 2002.

[4] M. R. Henzinger, P. Raghavan, and S. Rajagopalan, “Computing on data streams,” DIMACS, Vol.50, pp.107–118, 1999. [5] G. S. Manku and R. Motwani, “Approximate frequency counts over streaming data,” Proceedings of VLDB, pp. 346–357,

Aug. 2002.

[6] W.-G. Teng, M.-S. Chen, and P. S. Yu, “A regression-based temporal pattern mining scheme for data streams,” Proceed-ings of VLDB, Sep. 2003.

[7] C. C. Aggarwal, J. W. Han, J. Y. Wang, and P. S, Yu. “A framework for projected clustering of high dimensional data streams,” Proceedings of VLDB, pp. 852–863, 2004.

[8] S. Guha, N. Mishra, R. Motwani, and L. O’ Callaghan, “Clustering data streams,” Proceedings of FOCS, pp. 359–366, Nov. 2000.

[9] L. O’ Callaghan, N. Mishra, A. Meyerson, S. Guha, and R. Motwani, “Streaming-data algorithms for high-quality cluster-ing,” Proceedings of ICDE, 2002.

[10] C. C. Aggarwal, J. W. Han, J. Y. Wang, and P. S. Yu, “On demand classification of data streams,” Proceedings of ACM SIGKDD, pp. 503–508, 2004.

[11] G. Hulten, L. Spencer, and P. Domingos, “Mining time-changing data streams,” Proceedings of ACM SIGKDD, pp. 97– 106, Aug. 2001.

(15)

[12] P. S. Bradley, K. P. Bennett, and A. Demiriz, Constrained K-Means Clustering, MSR-TR-2000-65, Microsoft Research, May 2000.

[13] B.-R. Dai, C.-R. Lin, and M.-S. Chen, “On the techniques for data clustering with numerical constraints,” Proceedings of SDM, 2003.

[14] D. Klein, S. D. Kamvar, and C. Manning, “From instance-level constraints to space-level constraints: Making the most of prior knowledge in data clustering,” Proceedings of the Nineteenth International Conference on Machine Learning (ICML-2002), Sydney, Australia, 2002.

[15] C.-R. Lin, K.-H. Liu, and M.-S. Chen, “Dual clustering: Integrating data clustering over optimization and constraint domains,” IEEE Trans. on Knowledge and Data Engineering, Vol.17, No.5, pp.628–637, May 2005.

[16] A. K. H. Tung, J. Han, L. V. S. Lakshmanan, and R. T. Ng, “Constraint-based clustering in large databases,” Proceed-ings of 2001 International Conference on Database Theory, Jan. 2001.

[17] O. R. Zaïane, A. Foss, C.-H. Lee, and W. Wang, “On data clustering analysis: Scalability, constraints, and validation,” Proceedings of PAK DD, pp. 28–39, 2002.

[18] S. L. Huang, L. C. Wu, H. K. Laing, K. T. Pan, M. T. Ko, and J. T. Horng, “Pgtdb: a database providing growth tem-peratures of prokaryotes,” Bioinformatics, Vol.20, No.2, pp.276-278, January 2004.

[19] M. Yeung and B. L. Yeo, “Time-constrained clustering for segmentation of video into story units,” Proceedings of the International Conference on Pattern Recognition, pp. 375–380, May 1996.

[20] B.-R. Dai, C.-R. Lin, and M.-S. Chen, “Constrained Data Clustering by Depth Control and Progressive Constraint Re-laxation,” Very Large Data Base Journal (VLDBJ), Vol. 16, No. 2, pp. 201-217, April 2007.

[21] T. Zhang, R. Ramakrishnan, and M. Livny, “BIRCH: An efficient data clustering method for very large database,” Pro-ceedings of the ACM SIGMOD Conference on Management of Data, pp. 103–114, 1996.

(16)

數據

Fig. 1. Illustration for the difference between a conventional clustering and a numerical constrained clustering
Fig. 2. Illustration of the constraint tree, where the maximum number of keys in a node is 4
Fig. 3. Illustration of inserting a data point with constraint value 31.
Fig. 5. Performance of CCDS for different parameter settings.

參考文獻

相關文件

Keywords: accuracy measure; bootstrap; case-control; cross-validation; missing data; M -phase; pseudo least squares; pseudo maximum likelihood estimator; receiver

After enrolment survey till end of the school year, EDB will issue the “List of Student Identity Data on EDB Record and New STRNs Generated” to the school in case the

&#34;Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values,&#34; Data Mining and Knowledge Discovery, Vol. “Density-Based Clustering in

The research proposes a data oriented approach for choosing the type of clustering algorithms and a new cluster validity index for choosing their input parameters.. The

Two examples of the randomly generated EoSs (dashed lines) and the machine learning outputs (solid lines) reconstructed from 15 data points.. size 100) and 1 (with the batch size 10)

In the past researches, all kinds of the clustering algorithms are proposed for dealing with high dimensional data in large data sets.. Nevertheless, almost all of

To complete the “plumbing” of associating our vertex data with variables in our shader programs, you need to tell WebGL where in our buffer object to find the vertex data, and

Following the supply by the school of a copy of personal data in compliance with a data access request, the requestor is entitled to ask for correction of the personal data