Chapter 4 Dynamic Grid-Based Clustering
4.5 Decay Factor, Threshold, and Time gap
The algorithm maintains the grids and index structure to control the number of grids and index by removing the spare grids. Some measures can help make a suitable choice.
We define the data rate of a stream as Rx, the high threshold is Dh, the low threshold is Ds, and the decay factor is λ. We can find a good setting that can handle the grid and index structure effectively.
28
Lemma 2 (Grow Gap, TH)
The grow gap, denoted as TH, is the minimum time needed for a sparse grid to become a high dense grid.
TH = λ ℎ λ
λ (9)
Proof:
A grid G is a sparse grid at time , by Def. 6, (G, ) ≤ Dw,
If G becomes a high dense grid at time , let 𝐶 (x), 𝐶 (x) …𝐶 (x) denote the sets of data that are inserted into G between time 1 to , and C(x) =
⋃𝑞∈[ ]𝐶𝑞 x , if G becomes a high dense grid at time , by Def. 5, (G, ) ≥ Dh
(G, ) = ∑ ∈ x + (G, ) * λ
= ∑ ∈ x + (G, ) * λ
= ∑𝐶𝑞 x ∈ x ∑ ∈𝐶𝑞 x + (G, ) * λ We know that (G, ) ≤ Ds, and since 𝐶 (x), 𝐶 (x) …𝐶 (x) ≤ Rx,
(G, ) ≤ λ * Rx+ 𝜆 * Rx+ 𝜆 * Rx+ ...+ 𝜆 * Rx + Ds* λ Therefore,
λ ≥ ℎ λ
λ , δ ≥ λ ℎ λ
λ . (10) By Eq. (10), the minimum time for a spare grid to become a high dense grid is
TH = λ ℎ λ
λ (11)
29
Lemma 3 (Decay Gap, TS)
The decay gap, denoted as TS, is the minimum time needed for a high dense grid to become a sparse dense grid.
TS = λ ℎ (12)
Proof:
A grid G that is a high dense grid at time , by Def. 5, (G, ) ≥ Dh
If G becomes a sparse grid at time , let C(x) denotes the sets of data that are inserted into G between time 1 to , we want that G become a sparse grid at time ,
(G, ) = ∑X∈ x + (G, ) * λ ≤ Dw, We know that C(x) ≥ 0 , we have
(G, ) ≤ (G, ) * λ ≤ Dw, Therefore
λ ≥ ℎ , λ ℎ (13)
By Eq(13), the minimum time for a high dense grid to become a sparse grid is
TS = λ ℎ (14)
Lemma 4 (Living Gap, TM)
The living gap, denoted as TM, is the minimum time needed for an empty grid to become an intermediate grid.
TM = λ λ 1 . (15)
Proof:
30
An empty grid G is a grid with no data inserted before time (G, ) = 0
If G becomes an intermediate grid at time , let 𝐶 (x), 𝐶 (x) …𝐶 (x) denotes the sets of data that are inserted into G between time 1 to , and C(x)
= ⋃𝑞∈[ ]𝐶𝑞 x ,, we want that G becomes a high dense grid at time , by Def.7, (G, ) ≥ Dw
(G, ) = ∑ ∈ x + (G, ) * λ
= ∑𝐶𝑞 x ∈ x ∑ ∈𝐶𝑞 x + (G, ) * λ We know that (G, ) = 0, and since 𝐶 (x), 𝐶 (x) …𝐶 (x) ≤ Rx,
(G, ) ≤ 𝜆 * Rx+ 𝜆 * Rx+ 𝜆 * Rx+ ...+ 𝜆 * Rx λ ≥ λ
1, δ ≥ λ λ 1 . (16) By Eq(16), the minimum time for an empty grid to become an intermediate grid is
TM = λ λ 1 (17)
Now we can decide the time gap, the period that the algorithm should check and maintain the structure, from the above result.
The time gap we choose should not be too small. If we maintain the structure every time when a new data comes, it will lead to a heavy overhead and slow down the system. Also, most of the data information will be removed because we remove sparse grids in the maintenance stage. Many of the grids do not receive enough data and are deleted before they can reach the low threshold and become an intermediate grid.
The time gap also should not be too large; otherwise the evolving of clusters
31
may be lost. The cluster features may change obviously during a large time gap. Some of the clusters may show up and then disappear in the time gap. So the time gap we choose need to be small enough to discover the change of cluster features. For an intermediate node, we want to catch the changes that it becomes a high dense grid, or becomes a sparse grid and be removed by the algorithm. Also if we choose a too large time gap, the system will have to keep more grids and indexing information. These storages will increase the search time since they increase the size of index and the number of grids.
Definition 15 (Time Gap, TG)
The Time gap, denoted as TG, defines how often we check and maintain the grids and index structure.
TM< TG < Min (TS, TH) (18)
Another measure is to show how our method can work under a bounded memory. Since the data stream is infinite, we need to limit the storage usage.
Lemma 5 (The Maximum Density, MD)
The maximum density, denoted as MD, is the sum of density coefficients from all data. It is bounded by
MD ≤
(19)
Proof:
Given a time T, Md = ∑ ∈T is the total density coefficient during the time 0 to T 𝐶 (x), 𝐶 (x) …𝐶T(x) denote the sets of data arrived at time 0,1,2 …, T.
𝐶 (x), 𝐶 (x) …𝐶T(x) ≤ Rx, and C(x) = ⋃𝑞∈[ T]𝐶𝑞 x For any T, we have that
32
∑ ∈ x = ∑𝐶𝑞 x ∈ x ∑ ∈𝐶𝑞 x
≤ 𝜆 * Rx+ 𝜆 * Rx+ 𝜆 * Rx+ ... +𝜆 * Rx ∑ ∈T ≤ o 𝜆 * Rx+ 𝜆 * Rx+ 𝜆 * Rx+ ... +𝜆 * Rx
= Rx * (20) Therefore, from Eq(20),
MD = ∑ ∈ x ≤ (21)
A living grid is a grid that needs to keep after the maintenance step. It is a high dense grid or an intermediate grid. Since the maximum data density kept is bounded, we know that the number of living grids is also bounded. And at any time point, the total number of grids we need is also bounded by the number of living grids and the data rate.
Lemma 6 (Maximum Living Grid, MA)
The maximum living grid, denoted as MA, is the maximum number of the sum of intermediate grids and high density grids. MA is the number of grids that needs to keep after any maintenance step.
MA ≤ (22)
Proof:
From Eq. (19), we know that at any time point, that maximum density coefficient from all data records is no more than MD. After the maintenance step, all grids whose density lower than Ds will be removed, so all the living grids have density at least Dw. Therefore, the maximum number of living grids is .
33
Definition 13 (Maximum Grid, MG)
The number of maximum grids in use, denoted as MG, is the maximum number of all the grids in the algorithm at any time.
MG ≤ 𝑅 (23)
The worst case shows up when a time point contains only noise or all of the incoming data belong to a different grid. The case is rare in the real data streams, and most of the grids will be deleted at next maintenance step.