Problem Statement - Notation and Problem Definition

Chapter 3 Notation and Problem Definition

3.2 Problem Statement

Given a data stream S = { }, the data record are mapped into the density grid structure G = {G1, G2 … Gm}. Gird clusters are generated form the density grid structure. For every grid cluster , the output is in the from of {i , }, where i∈[1,K], ∈G and ∩ = ∅ , i≠j, where K is the number of clusters.

Chapter.4

Dynamic Grid-Based Clustering

In this chapter, we propose Dynamic Grid-Based Clustering (DGBC), a clustering algorithm based on dynamic grid. Section 4.1 is the framework of DGBC.

Section 4.2 presents the indexing we use and how it works. The maintenance step in the online component is described in Section 4.3, and the clustering generating stage is in Section 4.4. We put some discussion on the decay factor and the related parameters are in Section 4.5, the overall algorithm is in Section 4.6.

4.1 Framework of DGBC

Figure 4.1: Illustration of DGBC

We overview the overall architecture of DGBC. DGBC is an algorithm with fading model. Like in CLUStream, our method has online and offline components. The online component processes the incoming data and keeps the data summary, and the offline component generates clustering result whenever a user sends a request, as shown in Fig 4.1. The grid structure and the cluster are defined in Chapter 3, and more details are described in the following sections.

4.2 Indexing

Grids are created and deleted during the process as we use dynamic grid structure, so we cannot access them directly. To reduce the time cost on searching, here we propose two types of indexing. One is based on R+ tree, and the other is based on dimensional interval.

4.2.1 R+ Tree-Based Indexing

In an R+ tree based indexing, a node records a grid and the number of data had mapped into it. When a new grid is created, we insert into grid and into the tree.

When a new data arrives, we start from the root node and search if any entry can accept it. If any entry node does, we go down and repeat the process until there is no entries that the data can fit in or the data can accepted by a grid in the leaf level. If the incoming data falls in an existing grid, we simply insert the data into the grid and maintain the grid’s feature vector.

In other case, there is no existing grid for the incoming data, which happens when the incoming data belongs to a new cluster. We have to create a new grid for it

because we do not know if it is just an outlier or it is a start of a new cluster. A new grid then is created, taking the current data as the center, and the grid is inserted into the R+

tree, at the last node we search in. The new inserted grid is in the leaf level, and has a pointer to the last searched node. This allows us not need to update all the nodes in the path, saving the updating time when building the index.

(a) (b)

Figure 4.2: An example for R+ tree-based index, m= 5. (a) The view of grids and the arrived data. (b) The index structure when the new data arrives. (c) The view of grids

after the new data is inserted (d) The index structure after the data is inserted.

Fig.4.2 is an example for search and insert a new data on two dimension domain, where m, the capability, is 5. The cross is the incoming data, 1 to 5 is the existing grid, and A, B are the intermediate nodes. When a new data comes in, first we

search it from the root node, in which there are two entries, A and B. It belongs to A because it falls in the boundary of A in every dimension. Then we search the child nodes under A. there is no grid that the data can fall into, so we create and insert a new grid, named N, into the tree structure. Fig.4.2 (c) and (d) are the results after the insertion.

R+ trees are good in handling high dimensional data, but sometime they need extra processing time due to the ordering and outliers. A bust of noises may cause several split operations, and the depth of trees are increases. After that, even the noises are cleaned, the search path is longer than before and more time cost is needed to find the data.

4.2.2 Dimensional Interval-Based Indexing

Another type of indexing we use is based on dimensional intervals. For each dimension, we keep an interval list that holds all grids intervals on this dimension. An interval list keeps all grids’ intervals on its dimension in a sorting order. When a new data is coming, we choose a dimension and do a binary search on the interval list to find if the data can fit in any existing grids’ interval. If so, we pour out the grids that do not match in the preview step, choose another dimension and repeat the process until there are no grids. If there are no existing grids that the data can be mapped into, we create a new grid, take the incoming data as the grid center and insert the data. Then we update the interval list on every dimension. Any time if an interval contains no grid, the interval will be deleted. This only happens right after a grid is deleted, so it needs only to update or maintain the list when any grid is deleted.

(a) (b)

Figure 4.3: An example for dimensional interval-based indexing. (a) The view of grids and the arrived data. (b) The index structure when the new data arrive. (c) The view of

grids after the new data is inserted (d) The index structure after the data is inserted.

Fig. 4.3 is the same example as in Fig. 4.2 but in dimensional interval-based indexing. The cross is the incoming data, 1 to 5 is the existing grid, and List(x) and List(y) are the dimensional-interval indexing on axis x and y. When a new data coming, first we search it from the x-dimension list, the grids in the list are 1 and 2. Then we search the y-dimension list and the there is no interval for the new data. We can know that there must be no existing grid which can accept the data because the intersections of the search results are empty. The algorithm then creates a new grid, named N, for the data and updates the grids list in every dimension. Fig.4.3 (c) and (d) are the result after

the data insertion.

When there are d dimensions, we need to keep d lists and may have to search in all lists to discover the target grid. However in most of the cases, after doing search on several dimensions, there are only a few grids remaining, so we can just check them and find out the result we want. This happens more often especially in the high dimension data. When the dimension increases, usually the data are being sparse except at the cluster cores. We can get the target grid in checking only a few dimensions, hence it still works well even for high dimensional data.

4.3 Maintenance Step and Grid Resizing

For every time period, we maintain the grid structure. In the maintenance step, three parts need to be done: maintain grids, update index if needed, and resize the grids.

First, the sparse grids will be removed at the regulating step. A grid with too low density means that it may be a noise or outlier, or it was a high dense unit far time ago and is not meaningful anymore. As we use dynamic grid structure, this can also control the number of grids in use and reduces the memory usage and time cost. When a grid is deleted in this way, the passed information will be lost, so we have to make sure that if a grid is safe to be deleted. We will make a discussion in the later section to show that we can choose a suitable time gap to do it effectively without losing too much of the data information, and also show that this is a necessary operation to keep the data storage bounded.

If any grid is deleted in this way, then the algorithm also needs to update the indexing if needed. For each deleted grid, the algorithm will check and update the index, the node and entry in R+ tree, or its dimensional interval on every dimension. If the

deleted grid is the last object in the R+ tree node or interval, the corresponding index structure will also be deleted.

In a dimensional interval based-indexing, this can be done by simply check if the interval contains no object and should be removed at the same time. In an R+

tree-based index, first we remove the grid node, and then check if there are still any other objects under its parent node. If not, the parent node is also deleted, and repeat the process until the root node or a node that still has other objects. There is no need to reinsert the object in the deleted entry node, since R+ tree does not require node to be half filled. In a data stream, there is no information about the data property, so it is hard to find a good parameter setting for every time point. As using a dynamic grid structure, we can adjust the grid size according to the recent data distribution. We keep a global data feature vector to record the information.

Definition 13 (Data feature vector)

The data feature vector is denoted as Dv = { 𝐴̅, 𝑆 ⃑⃑⃑⃑⃑ , 𝑆𝑆 ⃑⃑⃑⃑⃑ } where 𝐴̅ is the number of data in recent time stamp,

𝑆 ⃑⃑⃑⃑⃑ is a d-dimension vector that stores the linear sum of data in each dimension for the recent 𝐴̅ data, and

𝑆𝑆 ⃑⃑⃑⃑⃑ is a d-dimension vector that stores the square sum of data in each dimension for the recent 𝐴̅ data.

By the data feature vector, the recent 𝐴̅ data’s distribution can be found and computed from the information that is kept. We can resize the grid with the information kept in the data feature vector. Some method resized the grid by merging or reset the distance boundary, but the grid size or maximum boundary may continuously grow and

turn into a very large grid. A large grid may takes more data then others, losing meaning and the clustering information. So we have to make sure that out method can work well under the condition.

Definition 14 (Grid Resizing)

The algorithm assigns new grid boundary after the maintenance step. Let 𝐵(T+1) denote the new grid boundary on the 𝑖^𝑡ℎ dimension at time T+1, and 𝐵(T) denote the boundary currently used for the 𝑖^𝑡ℎ dimension.

The new grid size on the 𝑖^𝑡ℎ dimension is computed by

𝐵(T+1) = 𝑅𝑎 𝑆𝑡(T, T+1) (8)

where Ra is a constant of ratio factor for the grid size, and

𝑆𝑡(T, T+1) is the root-mean-square deviations based on recent 𝐴̅ data arrived between T and T+1

From Eq. (8), it clearly shows that the grid boundary at any time point is bounded between [0, Ra*St]. As we resize the boundary depended on the data feature and not the information stored in grids, the boundary can be expanded and reduced over time. For a noisy data, we can update Eq. (8) by adding a weight factor to St and the old boundary. This can provide a smooth changing boundary between the time stamps.

4.4 Cluster Generation

At any time, a user can send a request and the offline component generates clustering results from the online structure. First we find out all high dense grids, and generate clusters by connecting them into grid groups as in Def. 11. Just like the method

in DUC-Stream [10], we treat the grid structure as a graph. The vertexes are the grids and there is an edge between two grids if they are neighbor. A depth first search is used to merge grids and generate grid clusters.

The indexing can also reduce the merging time in generating clusters. By Def.

8 and Def. 9, usually we need to check all grids and find their neighbors, then to decide to merge them or not. When the number of dimensions and the number of grids increases, the time cost also increases not only because more objects need to be processed, but also there are more possible neighbors to check. By the index, we can partition all grids into unconnected sets. Only grids in the same set need to be checked because grids in different sets are impossible to be neighboring.

In an R+ tree-based index, we check the nodes from the root, merge the connected nodes and repeat to the child nodes they contain. In a dimensional interval-based index, we choose a dimension and split it into unconnected parts, then choose another dimension and repeat the process in each part. After the partition, we check and merge the neighboring grids in the sets and generate the result, the cluster label and the grids’ information with the same label.

4.5 Decay Factor, Threshold, and Time gap

The algorithm maintains the grids and index structure to control the number of grids and index by removing the spare grids. Some measures can help make a suitable choice.

We define the data rate of a stream as Rx, the high threshold is Dh, the low threshold is Ds, and the decay factor is λ. We can find a good setting that can handle the grid and index structure effectively.

Lemma 2 (Grow Gap, TH)

The grow gap, denoted as TH, is the minimum time needed for a sparse grid to become a high dense grid.

TH = _λ ℎ λ

λ (9)

Proof:

A grid G is a sparse grid at time , by Def. 6, (G, ) ≤ Dw,

If G becomes a high dense grid at time , let 𝐶 (x), 𝐶 (x) …𝐶 (x) denote the sets of data that are inserted into G between time 1 to , and C(x) =

⋃_{𝑞∈[ ]}𝐶_𝑞 x , if G becomes a high dense grid at time , by Def. 5, (G, ) ≥ Dh

(G, ) = ∑_{∈ x} + (G, ) * λ

= ∑_{∈ x} + (G, ) * λ

= ∑_𝐶_𝑞_{x ∈ x}∑_∈𝐶_𝑞_x + (G, ) * λ We know that (G, ) ≤ Ds, and since 𝐶 (x), 𝐶 (x) …𝐶 (x) ≤ Rx,

(G, ) ≤ λ* Rx+ 𝜆 * Rx+ 𝜆 * Rx+ ...+ 𝜆* Rx + Ds* λ Therefore,

λ ≥ ℎ λ

λ , δ ≥ _λ ℎ λ

λ . (10) By Eq. (10), the minimum time for a spare grid to become a high dense grid is

TH = _λ ℎ λ

λ (11)

Lemma 3 (Decay Gap, TS)

The decay gap, denoted as TS, is the minimum time needed for a high dense grid to become a sparse dense grid.

TS = _λ _ℎ (12)

Proof:

A grid G that is a high dense grid at time , by Def. 5, (G, ) ≥ Dh

If G becomes a sparse grid at time , let C(x) denotes the sets of data that are inserted into G between time 1 to , we want that G become a sparse grid at time ,

(G, ) = ∑_{X∈ x} + (G, ) * λ ≤ Dw, We know that C(x) ≥ 0 , we have

(G, ) ≤ (G, ) * λ ≤ Dw, Therefore

λ ≥ _ℎ , _λ _ℎ (13)

By Eq(13), the minimum time for a high dense grid to become a sparse grid is

TS = _λ _ℎ (14)

Lemma 4 (Living Gap, TM)

The living gap, denoted as TM, is the minimum time needed for an empty grid to become an intermediate grid.

TM = _λ ^λ 1 . (15)

Proof:

An empty grid G is a grid with no data inserted before time (G, ) = 0

If G becomes an intermediate grid at time , let 𝐶 (x), 𝐶 (x) …𝐶 (x) denotes the sets of data that are inserted into G between time 1 to , and C(x)

= ⋃_{𝑞∈[ ]}𝐶_𝑞 x ,, we want that G becomes a high dense grid at time , by Def.7, (G, ) ≥ Dw

(G, ) = ∑_{∈ x} + (G, ) * λ

= ∑_𝐶_𝑞_{x ∈ x}∑_∈𝐶_𝑞_x + (G, ) * λ We know that (G, ) = 0, and since 𝐶 (x), 𝐶 (x) …𝐶 (x) ≤ Rx,

(G, ) ≤ 𝜆* Rx+ 𝜆 * Rx+ 𝜆 * Rx+ ...+ 𝜆* Rx λ ≥ ^λ

1, δ ≥ _λ ^λ 1 . (16) By Eq(16), the minimum time for an empty grid to become an intermediate grid is

TM = _λ ^λ 1 (17)

Now we can decide the time gap, the period that the algorithm should check and maintain the structure, from the above result.

The time gap we choose should not be too small. If we maintain the structure every time when a new data comes, it will lead to a heavy overhead and slow down the system. Also, most of the data information will be removed because we remove sparse grids in the maintenance stage. Many of the grids do not receive enough data and are deleted before they can reach the low threshold and become an intermediate grid.

The time gap also should not be too large; otherwise the evolving of clusters

may be lost. The cluster features may change obviously during a large time gap. Some of the clusters may show up and then disappear in the time gap. So the time gap we choose need to be small enough to discover the change of cluster features. For an intermediate node, we want to catch the changes that it becomes a high dense grid, or becomes a sparse grid and be removed by the algorithm. Also if we choose a too large time gap, the system will have to keep more grids and indexing information. These storages will increase the search time since they increase the size of index and the number of grids.

Definition 15 (Time Gap, TG)

The Time gap, denoted as TG, defines how often we check and maintain the grids and index structure.

TM< TG < Min (TS, TH) (18)

Another measure is to show how our method can work under a bounded memory. Since the data stream is infinite, we need to limit the storage usage.

Lemma 5 (The Maximum Density, MD)

The maximum density, denoted as MD, is the sum of density coefficients from all data. It is bounded by

MD ≤

(19)

Proof:

Given a time T, Md = ∑_∈T is the total density coefficient during the time 0 to T 𝐶 (x), 𝐶 (x) …𝐶_T(x) denote the sets of data arrived at time 0,1,2 …, T.

𝐶 (x), 𝐶 (x) …𝐶_T(x) ≤ Rx, and C(x) = ⋃_{𝑞∈[ T]}𝐶_𝑞 x For any T, we have that

∑_{∈ x} = ∑_𝐶_𝑞_{x ∈ x}∑_∈𝐶_𝑞_x

≤ 𝜆* Rx+ 𝜆 * Rx+ 𝜆 * Rx+ ... +𝜆 * Rx ∑_∈T ≤ o 𝜆* Rx+ 𝜆 * Rx+ 𝜆 * Rx+ ... +𝜆 * Rx

= Rx * (20) Therefore, from Eq(20),

MD = ∑_{∈ x} ≤ (21)

A living grid is a grid that needs to keep after the maintenance step. It is a high dense grid or an intermediate grid. Since the maximum data density kept is bounded, we know that the number of living grids is also bounded. And at any time point, the total number of grids we need is also bounded by the number of living grids and the data rate.

Lemma 6 (Maximum Living Grid, MA)

The maximum living grid, denoted as MA, is the maximum number of the sum of intermediate grids and high density grids. MA is the number of grids that needs to keep after any maintenance step.

MA ≤ (22)

Proof:

From Eq. (19), we know that at any time point, that maximum density coefficient from all data records is no more than MD. After the maintenance step, all grids whose density lower than Ds will be removed, so all the living grids have density at least Dw. Therefore, the maximum number of living grids is .

Definition 13 (Maximum Grid, MG)

The number of maximum grids in use, denoted as MG, is the maximum number of all the grids in the algorithm at any time.

MG ≤ 𝑅 (23)

The worst case shows up when a time point contains only noise or all of the incoming data belong to a different grid. The case is rare in the real data streams, and most of the grids will be deleted at next maintenance step.

4.6 DGBC algorithm

Fig 4.4 shows the overall algorithm of DGBC. For a data stream, the online component continuously read a new data and searches the indexing to find a grid that can accept the data. If so, we insert the data into the target grid. Otherwise, we create a new grid that takes the current data as center and inserts the data into it, and then we update the index for further use. For every period time step, TG, the algorithm periodically removes the sparse grids, which have too low density scores. The algorithm regulates the grids and index, computes the new grid size based on the recent data distribution.

The offline component generates clustering result for the user. When a user requests, the algorithm finds out all the high density grids, where their density score is higher than the threshold. Then the system tries to merge the neighboring high density grids together and assigns a cluster label for them.

在文檔中於資料串流上基於動態網格的分群演算法 (頁 28-0)