Chapter 2 Related Works
2.3 Indexing Structure
2.3.2 R+ tree
R+ trees [23] are a compromise with R-trees; R+ trees avoid having overlapping internal nodes by inserting an object into multiple leaves if necessary. In R+ tree, nodes are not guaranteed to be at least half filled, and the entries of any internal
13
node do not overlap. Because the entries in nodes do not overlap, this reduces the search time since all spatial regions are covered by at most one node. Only a single path and hence fewer nodes are visited than in R-tree. Some extra storage is needed due to the multiple insertions. R+ tree is more appropriate for processing data streams since it has a lower time cost on searching and do not need to reinsert to keep the nodes half full.
14
Chapter 3
Notation and Problem Definition
In this chapter, we formally introduce necessary notation and formulate the problem. Section 3.1 describes the notations and the structure we use, and in Section 3.2 we define the problem statement.
3.1 Notation and Symbol definition
Definition 1 (Data Stream)
A data stream S consists of an infinite sequence of multi-dimension records, S
= { }, data arrive at time stamps { , , … … }, each is a multidimensional record with d dimensions, denoted by =[ , , , … ].
As we cannot store in memory all data information in a data stream, the algorithm uses density grid structures to keep the data summary information.
Definition 2 (Density Grid)
A density grid G for a set of d-dimensional points is denoted as G = {r, n, t}, where
r = [ … ], correspond to a vector of 2*d entries. For each dimension, the upper boundary and lower boundary pairs are stored in r.
n is the number of data points maintained in G, and
t is the last updated time, the last time a data inserted into G.
We map the incoming data into the grid structure to generate summary information.
15
When a data record X is inserted into a grid G, it will support a density coefficient that will be decreased over time. The density coefficient and the density of a grid are defined as follows.
Definition 3 (Density Coefficient)
If a data X arrives at time stamp , and the current time is , we write that T(X) = , and its density coefficient is defined as Eq(1)
(X, ) = λ = λ (1) where λ∈(0, 1) is a constant named the decay factor.
Definition 4 (Density of a Grid)
A grid G at given time stamp , let G(x) denotes the set of data that are mapped into G at or before time , its density score is the sum of the density coefficients of all data records that are mapped into it. The density of G is
(G, )= ∑ ∈G x (2)
The density of a grid changes over time, but we do not need to maintain it on every time unit. The density of a grid increases only when data are inserting into it, so we update it only when a data is mapped into. We keep the time of last update for each grid so that the current density can be computed by the current time and the time of last update.
Lemma 1 (Update Grid’s Density)
If a grid G receives a new data at time , and the last time it receives a data is , > , then the current density of G at time can be updated with Eq(3),
(G, )= 1 + λ * (G, ). (3)
16
Proof:
Let G’(x) denotes the set of data that are mapped into G before time , and is the new data arrives at time , by Eq.2
(G, ) = ∑ ∈G x , And we have that
(X, ) = λ = λ = λ * λ = λ * s(X, ) Therefore, we have:
(G, ) = ∑ ∈G x
= + ∑ ∈G x = λ + ∑ ∈G x λ
= 1 + λ * (G, ). (4)
And then here we can define three types of grids according to their density:
high dense grids, sparse grid and intermediate grid. Clusters are generated from the high dense grid. They are the cores of clusters since their density is high. By the decay factor, we can make sure that a high dense grid is significant at the time point, because an out of dated high dense grid will be decayed over time and may become a sparse or intermediate grid.
Definition 5 (High Dense Grid)
A high dense grid is a grid that its density is higher than the high threshold. If G is a high dense grid at time , that means
(G, ) ≥ Dh, (5)
where Dh is the high threshold.
Definition 6 (Sparse Grid)
17
A sparse grid is a grid that its density is lower than the low threshold. If G is a high dense grid at time , that means
(G, ) ≤ Dw, (6)
where Dw is the low threshold.
Definition 7 (Intermediate Grid)
An intermediate grid is a grid that its density is between the high threshold and the low threshold. If G is a intermediate dense grid at time , that means
Dw ≤ (G, ) ≤ Dh. (7)
In the clustering step, we connect the neighboring high dense grids to generate grid clusters. The neighboring and the grid cluster are defined as follows
Definition 8 (Dimensional Overlapping)
For two grids, G1 and G2, at a dimension d’, their upper boundary and lower boundary pairs are ( , ), ( , ), and . G1 and G2 are dimensional overlapping in dimension d’ if and only if
and
Definition 9 (Dimensional Connecting)
For two grids, G1 and G2, at a dimension d’, their upper boundary and lower boundary pairs are ( , ), ( , ), and . G1 and G2 are dimensional connecting in dimension d’ if and only if
18
Definition 10 (Neighboring Grids)
Two grids, G1 and G2, are neighboring if and only if their boundaries are dimensional overlapping in at least d-1 dimensions, and the boundaries are dimensional connecting in at most 1 dimension, denoted as G1~ G2.
Definition 11 (Grid Group)
A set of density grid list Lg = (G1, G2, G3 …, Gm) is a grid group if for any two members Gi, Gj ∈ Lg there exists a sequence 𝐺𝑙 𝐺𝑙 𝐺𝑙 𝐺𝑙𝑚 such that 𝐺𝑙 𝐺𝑙 𝐺𝑙 𝐺𝑙 𝐺𝑙 𝑚 𝐺𝑙𝑚.
Definition 12 (Grid Cluster)
A set of density grid Cg is a grid cluster if it is a grid group and all members in the set is a high dense grid.
3.2 Problem Statement
Given a data stream S = { }, the data record are mapped into the density grid structure G = {G1, G2 … Gm}. Gird clusters are generated form the density grid structure. For every grid cluster , the output is in the from of {i , }, where i∈[1,K], ∈G and ∩ = ∅ , i≠j, where K is the number of clusters.
19
Chapter.4
Dynamic Grid-Based Clustering
In this chapter, we propose Dynamic Grid-Based Clustering (DGBC), a clustering algorithm based on dynamic grid. Section 4.1 is the framework of DGBC.
Section 4.2 presents the indexing we use and how it works. The maintenance step in the online component is described in Section 4.3, and the clustering generating stage is in Section 4.4. We put some discussion on the decay factor and the related parameters are in Section 4.5, the overall algorithm is in Section 4.6.
4.1 Framework of DGBC
Figure 4.1: Illustration of DGBC
20
We overview the overall architecture of DGBC. DGBC is an algorithm with fading model. Like in CLUStream, our method has online and offline components. The online component processes the incoming data and keeps the data summary, and the offline component generates clustering result whenever a user sends a request, as shown in Fig 4.1. The grid structure and the cluster are defined in Chapter 3, and more details are described in the following sections.
4.2 Indexing
Grids are created and deleted during the process as we use dynamic grid structure, so we cannot access them directly. To reduce the time cost on searching, here we propose two types of indexing. One is based on R+ tree, and the other is based on dimensional interval.
4.2.1 R+ Tree-Based Indexing
In an R+ tree based indexing, a node records a grid and the number of data had mapped into it. When a new grid is created, we insert into grid and into the tree.
When a new data arrives, we start from the root node and search if any entry can accept it. If any entry node does, we go down and repeat the process until there is no entries that the data can fit in or the data can accepted by a grid in the leaf level. If the incoming data falls in an existing grid, we simply insert the data into the grid and maintain the grid’s feature vector.
In other case, there is no existing grid for the incoming data, which happens when the incoming data belongs to a new cluster. We have to create a new grid for it
21
because we do not know if it is just an outlier or it is a start of a new cluster. A new grid then is created, taking the current data as the center, and the grid is inserted into the R+
tree, at the last node we search in. The new inserted grid is in the leaf level, and has a pointer to the last searched node. This allows us not need to update all the nodes in the path, saving the updating time when building the index.
(a) (b)
(c) (d)
Figure 4.2: An example for R+ tree-based index, m= 5. (a) The view of grids and the arrived data. (b) The index structure when the new data arrives. (c) The view of grids
after the new data is inserted (d) The index structure after the data is inserted.
Fig.4.2 is an example for search and insert a new data on two dimension domain, where m, the capability, is 5. The cross is the incoming data, 1 to 5 is the existing grid, and A, B are the intermediate nodes. When a new data comes in, first we
22
search it from the root node, in which there are two entries, A and B. It belongs to A because it falls in the boundary of A in every dimension. Then we search the child nodes under A. there is no grid that the data can fall into, so we create and insert a new grid, named N, into the tree structure. Fig.4.2 (c) and (d) are the results after the insertion.
R+ trees are good in handling high dimensional data, but sometime they need extra processing time due to the ordering and outliers. A bust of noises may cause several split operations, and the depth of trees are increases. After that, even the noises are cleaned, the search path is longer than before and more time cost is needed to find the data.
4.2.2 Dimensional Interval-Based Indexing
Another type of indexing we use is based on dimensional intervals. For each dimension, we keep an interval list that holds all grids intervals on this dimension. An interval list keeps all grids’ intervals on its dimension in a sorting order. When a new data is coming, we choose a dimension and do a binary search on the interval list to find if the data can fit in any existing grids’ interval. If so, we pour out the grids that do not match in the preview step, choose another dimension and repeat the process until there are no grids. If there are no existing grids that the data can be mapped into, we create a new grid, take the incoming data as the grid center and insert the data. Then we update the interval list on every dimension. Any time if an interval contains no grid, the interval will be deleted. This only happens right after a grid is deleted, so it needs only to update or maintain the list when any grid is deleted.
23
(a) (b)
(c) (d)
Figure 4.3: An example for dimensional interval-based indexing. (a) The view of grids and the arrived data. (b) The index structure when the new data arrive. (c) The view of
grids after the new data is inserted (d) The index structure after the data is inserted.
Fig. 4.3 is the same example as in Fig. 4.2 but in dimensional interval-based indexing. The cross is the incoming data, 1 to 5 is the existing grid, and List(x) and List(y) are the dimensional-interval indexing on axis x and y. When a new data coming, first we search it from the x-dimension list, the grids in the list are 1 and 2. Then we search the y-dimension list and the there is no interval for the new data. We can know that there must be no existing grid which can accept the data because the intersections of the search results are empty. The algorithm then creates a new grid, named N, for the data and updates the grids list in every dimension. Fig.4.3 (c) and (d) are the result after
24
the data insertion.
When there are d dimensions, we need to keep d lists and may have to search in all lists to discover the target grid. However in most of the cases, after doing search on several dimensions, there are only a few grids remaining, so we can just check them and find out the result we want. This happens more often especially in the high dimension data. When the dimension increases, usually the data are being sparse except at the cluster cores. We can get the target grid in checking only a few dimensions, hence it still works well even for high dimensional data.
4.3 Maintenance Step and Grid Resizing
For every time period, we maintain the grid structure. In the maintenance step, three parts need to be done: maintain grids, update index if needed, and resize the grids.
First, the sparse grids will be removed at the regulating step. A grid with too low density means that it may be a noise or outlier, or it was a high dense unit far time ago and is not meaningful anymore. As we use dynamic grid structure, this can also control the number of grids in use and reduces the memory usage and time cost. When a grid is deleted in this way, the passed information will be lost, so we have to make sure that if a grid is safe to be deleted. We will make a discussion in the later section to show that we can choose a suitable time gap to do it effectively without losing too much of the data information, and also show that this is a necessary operation to keep the data storage bounded.
If any grid is deleted in this way, then the algorithm also needs to update the indexing if needed. For each deleted grid, the algorithm will check and update the index, the node and entry in R+ tree, or its dimensional interval on every dimension. If the
25
deleted grid is the last object in the R+ tree node or interval, the corresponding index structure will also be deleted.
In a dimensional interval based-indexing, this can be done by simply check if the interval contains no object and should be removed at the same time. In an R+
tree-based index, first we remove the grid node, and then check if there are still any other objects under its parent node. If not, the parent node is also deleted, and repeat the process until the root node or a node that still has other objects. There is no need to reinsert the object in the deleted entry node, since R+ tree does not require node to be half filled. In a data stream, there is no information about the data property, so it is hard to find a good parameter setting for every time point. As using a dynamic grid structure, we can adjust the grid size according to the recent data distribution. We keep a global data feature vector to record the information.
Definition 13 (Data feature vector)
The data feature vector is denoted as Dv = { 𝐴̅, 𝑆 ⃑⃑⃑⃑⃑ , 𝑆𝑆 ⃑⃑⃑⃑⃑ } where 𝐴̅ is the number of data in recent time stamp,
𝑆 ⃑⃑⃑⃑⃑ is a d-dimension vector that stores the linear sum of data in each dimension for the recent 𝐴̅ data, and
𝑆𝑆 ⃑⃑⃑⃑⃑ is a d-dimension vector that stores the square sum of data in each dimension for the recent 𝐴̅ data.
By the data feature vector, the recent 𝐴̅ data’s distribution can be found and computed from the information that is kept. We can resize the grid with the information kept in the data feature vector. Some method resized the grid by merging or reset the distance boundary, but the grid size or maximum boundary may continuously grow and
26
turn into a very large grid. A large grid may takes more data then others, losing meaning and the clustering information. So we have to make sure that out method can work well under the condition.
Definition 14 (Grid Resizing)
The algorithm assigns new grid boundary after the maintenance step. Let 𝐵(T+1) denote the new grid boundary on the 𝑖𝑡ℎ dimension at time T+1, and 𝐵(T) denote the boundary currently used for the 𝑖𝑡ℎ dimension.
The new grid size on the 𝑖𝑡ℎ dimension is computed by
𝐵(T+1) = 𝑅𝑎 𝑆𝑡(T, T+1) (8)
where Ra is a constant of ratio factor for the grid size, and
𝑆𝑡(T, T+1) is the root-mean-square deviations based on recent 𝐴̅ data arrived between T and T+1
From Eq. (8), it clearly shows that the grid boundary at any time point is bounded between [0, Ra*St]. As we resize the boundary depended on the data feature and not the information stored in grids, the boundary can be expanded and reduced over time. For a noisy data, we can update Eq. (8) by adding a weight factor to St and the old boundary. This can provide a smooth changing boundary between the time stamps.
4.4 Cluster Generation
At any time, a user can send a request and the offline component generates clustering results from the online structure. First we find out all high dense grids, and generate clusters by connecting them into grid groups as in Def. 11. Just like the method
27
in DUC-Stream [10], we treat the grid structure as a graph. The vertexes are the grids and there is an edge between two grids if they are neighbor. A depth first search is used to merge grids and generate grid clusters.
The indexing can also reduce the merging time in generating clusters. By Def.
8 and Def. 9, usually we need to check all grids and find their neighbors, then to decide to merge them or not. When the number of dimensions and the number of grids increases, the time cost also increases not only because more objects need to be processed, but also there are more possible neighbors to check. By the index, we can partition all grids into unconnected sets. Only grids in the same set need to be checked because grids in different sets are impossible to be neighboring.
In an R+ tree-based index, we check the nodes from the root, merge the connected nodes and repeat to the child nodes they contain. In a dimensional interval-based index, we choose a dimension and split it into unconnected parts, then choose another dimension and repeat the process in each part. After the partition, we check and merge the neighboring grids in the sets and generate the result, the cluster label and the grids’ information with the same label.
4.5 Decay Factor, Threshold, and Time gap
The algorithm maintains the grids and index structure to control the number of grids and index by removing the spare grids. Some measures can help make a suitable choice.
We define the data rate of a stream as Rx, the high threshold is Dh, the low threshold is Ds, and the decay factor is λ. We can find a good setting that can handle the grid and index structure effectively.
28
Lemma 2 (Grow Gap, TH)
The grow gap, denoted as TH, is the minimum time needed for a sparse grid to become a high dense grid.
TH = λ ℎ λ
λ (9)
Proof:
A grid G is a sparse grid at time , by Def. 6, (G, ) ≤ Dw,
If G becomes a high dense grid at time , let 𝐶 (x), 𝐶 (x) …𝐶 (x) denote the sets of data that are inserted into G between time 1 to , and C(x) =
⋃𝑞∈[ ]𝐶𝑞 x , if G becomes a high dense grid at time , by Def. 5, (G, ) ≥ Dh
(G, ) = ∑ ∈ x + (G, ) * λ
= ∑ ∈ x + (G, ) * λ
= ∑𝐶𝑞 x ∈ x ∑ ∈𝐶𝑞 x + (G, ) * λ We know that (G, ) ≤ Ds, and since 𝐶 (x), 𝐶 (x) …𝐶 (x) ≤ Rx,
(G, ) ≤ λ * Rx+ 𝜆 * Rx+ 𝜆 * Rx+ ...+ 𝜆 * Rx + Ds* λ Therefore,
λ ≥ ℎ λ
λ , δ ≥ λ ℎ λ
λ . (10) By Eq. (10), the minimum time for a spare grid to become a high dense grid is
TH = λ ℎ λ
λ (11)
29
Lemma 3 (Decay Gap, TS)
The decay gap, denoted as TS, is the minimum time needed for a high dense grid to become a sparse dense grid.
TS = λ ℎ (12)
Proof:
A grid G that is a high dense grid at time , by Def. 5, (G, ) ≥ Dh
If G becomes a sparse grid at time , let C(x) denotes the sets of data that are inserted into G between time 1 to , we want that G become a sparse grid at time ,
(G, ) = ∑X∈ x + (G, ) * λ ≤ Dw, We know that C(x) ≥ 0 , we have
(G, ) ≤ (G, ) * λ ≤ Dw, Therefore
λ ≥ ℎ , λ ℎ (13)
By Eq(13), the minimum time for a high dense grid to become a sparse grid is
By Eq(13), the minimum time for a high dense grid to become a sparse grid is