• 沒有找到結果。

Chapter 4 Dynamic Grid-Based Clustering

4.6 DGBC algorithm

Fig 4.4 shows the overall algorithm of DGBC. For a data stream, the online component continuously read a new data and searches the indexing to find a grid that can accept the data. If so, we insert the data into the target grid. Otherwise, we create a new grid that takes the current data as center and inserts the data into it, and then we update the index for further use. For every period time step, TG, the algorithm periodically removes the sparse grids, which have too low density scores. The algorithm regulates the grids and index, computes the new grid size based on the recent data distribution.

The offline component generates clustering result for the user. When a user requests, the algorithm finds out all the high density grids, where their density score is higher than the threshold. Then the system tries to merge the neighboring high density grids together and assigns a cluster label for them.

34

Input: data stream S

Output: (Cluster label, grid list)

Variable: time_gap TG, grid ratio R, high threshold Dh, low threshold Dw Initialize:

Current time tc= 0; An empty grid list Glist, index I, ; Data feature vector Dv= 0, Initial grid size B(0).

While data stream is active: Figure 4.4: The pseudo code of DGBC

35

Chapter.5

Experimental Results

We compare our method with CLUStream, DUC-Stream, D-stream, AGD-stream and IGDLC. Two types of index are used, represent as DGBC(DI) and DGBC(R+). The CLUStream algorithm is based on distance factor and the others are based on density. All algorithms were implemented in C++ language and tested on an i7-2600k 3.4 GHz with 16G memory running Microsoft Windows 7 Professional system.

The comprehensive performance study has been conducted on both synthetic and real world datasets. To show the efficiency of DGBC, we perform three kinds of experiments. First, we compare the execution time and clustering quality of DGBC with other streaming clustering algorithms using synthetic datasets. Second, we investigate the scalability of DGBC. Finally, we apply DGBC in some real datasets to compare the performance and also discuss the effect of initial setting.

5.1 Clustering Quality

The clustering quality is measured by clustering purity. Purity is a simple and transparent evaluation measure. To compute purity, each cluster is assigned to the class which is most frequent in the cluster, and then the accuracy of this assignment is measured by counting the number of correctly assigned members and divided by the total number of members in this cluster. Eq. (25) shows the purity in one cluster, and Eq(24) is the total purity from all clusters.

Purity = ∑𝑘 = |𝐶𝑁𝑗| Pur ty 𝐶 , (24)

36

Purity (𝐶) = |𝐶

𝑗| |𝑛 𝑚𝑏𝑒𝑟 𝑓 𝑚𝑎𝑗 𝑟 𝑖𝑡𝑒𝑚 𝑖𝑛 𝐶| (25) where k is the number of clusters, N is the total number of data recorded.

For example, we generate two clusters as in Fig.5.1. There are two different types of item in each group. By Eq(24) and Eq(25), the purity in each cluster are

Purity (C1) = 1/4 * max (3, 1) = 3/4

Purity (C2) = 1/11 * max (3, 8) = 8/11, and

the total purity = 4/15 * Purity (C1) + 11/15 * Purity (C2) = 11/15.

We say that the purity of the clustering result is 11/15. Notice that the purity is bounded between 1 and 1/K, where K is the number of item types. The worst case shows up only when the number of each type of items in a cluster is uniform.

Figure 5.1: Example for compute purity.

5.2 Data Generator

The synthetic data sets in the experiments are generated using synthetic generation program proposed by Vennam et al. [18]. The parameter setting of data generator is shown in Table 1.

37

Table 1: Parameters of synthetic data generator Parameters Description

N Number of points to be created in the dataset (an integer) C Number of clusters that should be present (an integer) d Number of dimensions the dataset should have (an integer) u The maximum data value for all the dimensions (a real number) f Flag. 1 if subspace clusters are to be computed, 0 otherwise.

In all the following experiments, some parameters are fixed, i.e., |u|= 1000 and f = 0. The data set is converted into data stream by taking the data input order as the order of streaming data. The data rate is 1000 data per time unit; the high and low threshold is 100 and 10, and the decay factor λ 0.98. We send a request and compute the cluster purity every 10 time units.

5.2.1 Performance on Synthetic Datasets

The first experiment of the seven algorithms is on the dataset N70k–C15–d10k.

Fig.5.2 shows the quality result that our method is always better than others. When clusters are created, deleted, and changing over time, according to the change of the data property, the algorithm adjusts the generated and existing grids periodically, so it is able to capture the evolution of the data stream and generates high quality clustering result.

38

Figure 5.2: Quality comparison (Synthetic dataset N70k–C15–d10k)

Fig.5.3 shows the result of execution time. CLUStream needs the most time because it needs a linear search on all micro-clusters for the insertion. IRGC also needs a search on all grids, so when the number of grids increases over the process, the search cost increases. At the start of the process, our method needs some time to set up the

39

Figure 5.3: Execution time comparison (Synthetic dataset N70k–C15–d10k)

5.2.2 Scalability Study

In this section, we study the scalability of the DGBC algorithm. Then we test the scalability with the synthetic datasets. We compare between two types of indexing we proposed. The first series of datasets are generated by varying the dimensionality from 10 to 100, while fixing the stream size (100k, the lower lines, and 1000k, the upper lines) and the number of natural clusters (10). Fig.5.4 shows that the execution time of our method is closest to linear with respect to the number of dimensions.

An R+-tree-based index ignored the number of dimensions, so it is ideal to deal with high dimension data. However, it has a high time cost to maintain and update the tree structure especially when the data property changes or the condition is noisy. In these cases, some of the old grids need to be deleted and many new grids are created as new core of clusters or noises, which leads to a heavy overhead since the R+ tree indexing may need several node split operations.

0

40

When the number of dimensions increases, the dimensional interval-based index needs to keep more information for each dimension, but the search time and merge time increases slowly as in higher dimension, the data become sparser. In the usual case, we can find the grids within checking a few dimensions, and do not really need to check on all dimensions to find the target grid.

Figure 5.4: Scalability test with different number of dimensions. (Synthetic datasets)

The other series of datasets are generated by varying the number of natural clusters from 2 to 50, while fixing the stream size (100k, the lower lines, and 1000k, the upper lines) and the number of dimensions (10). Fig.5.5 shows that the execution time of our method is stable with respect to the number of clusters. When the number of cluster increases, we need to be more careful not to merge data belonging to different cluster together. The two scalability tests can show that our method is suitable for both high dimensional data and dispersive data.

41

Figure 5.5: Scalability test with different number of clusters. (Synthetic datasets)

5.3 Network Intrusion Detection dataset

The KDD-CUP’99 Network Intrusion Detection dataset [27] consists of raw TCP connection records from a local area network. Each record in the dataset corresponds to either a normal connection or an attack. There are four attack types:

DOS (denial of service), R2L (unauthorized access from a remote machine, e.g., guessing password), U2R (unauthorized access to root), and PROBING (surveillance and other probing). As a result, the data contains five clusters including the class label of normal, and the attack types are further classified into 24 types. Most of the connections in this dataset are normal, but sometimes there may have burst of attacks at certain times. The cluster property evolves significantly over time. We use the type of connection as its cluster label. As in other experiments [4] [6], all 34 out of the total 42 continuous attributes available are used for clustering. The data rate is 1000 data per

42

factor λ 0.98. We send a request and compute the cluster purity every 10 time units.

5.3.1 Performance on Network Intrusion Detection dataset

Fig.5.6 is the result of clustering quality. It shows that our proposed method usually better than others. For all algorithms, the quality of clustering falls when the cluster property changes, but our proposed method can adapt itself to catch the new data property. Especially in a dataset that is noisy and evolves significantly during process as in the Network Intrusion dataset, our method performs much better to find and generate meaningful results. Our method works well on the Charitable Donation dataset even it has less noises and the cluster property is stable.

Figure 5.6: Quality comparison (Network Intrusion dataset)

Fig. 5.7 shows the result of execution time. CLUStream spend most of the time cost because it needs to do a liner search on all micro-clusters for the insertion for every

0.7

43

incoming data. IRGC also needs a liner search on all grids, so when the number of grids increases over the process, the search cost increases in chorus. At the start of the process, some time for building up the index is needed, and then our proposed method works even more effectively than in synthetic dataset because the data order usually is not completely random as in the synthetic data. In the Charitable Donation dataset, the execution time is more stable as there are only a few noises in the dataset, and the property of clusters is also stable.

Figure 5.7: Execution time comparisons (Network Intrusion dataset)

5.4 Charitable Donation dataset

Another real dataset is the KDD-CUP’98 [28]. It is a relatively stable real-life data. It contains 95412 records about people who made donation in response to the mailing requests. We used clustering to group donors with similar donation behaviors.

In total 56 out of 481 fields are used, and the dataset is converted into data stream by taking the data input order as the order of streaming data. The data rate is 1000 data per

0

44

time unit; the high and low threshold is 100 and 10, and the decay factor λ 0.98. We send a request and compute the cluster purity every 10 time units.

5.4.1 Performance on Charitable Donation dataset

Fig. 5.8 is the result of clustering quality on Charitable Donation dataset; they show that our method always has a better clustering quality then others. For all algorithms, the quality is high because there is only few noise and with a stable cluster property, and our method works well under the condition of less noises and the data property is stable.

Figure 5.8: Quality comparison (Charitable Donation dataset)

Fig.5.9 shows the result of execution time. CLUStream and IRGC still need most of time to do a liner search for every insertion. The execution time is more stable as there are only a few noises in the dataset, and the property of clustering is also stable.

0.93

45

Our algorithm benefits more than others under these conditions because it can build up the proper indexing in short time and speed up for all over the process without rebuild.

Figure 5.9 Execution time comparisons (Charitable Donation dataset)

5.5 Parameter Analyze

Since there is no information about the data property at every time stamp, we may not always have an optimal initial setting. The parameter here is defined as Ra, the ratio of the grid size we assign for the algorithm, which defined in Def.14. The value of R should be small enough to separates data that belong to different clusters, and detect most of the new incoming points representing a new cluster or an outliner. At the same time, it should not be too small that generates too many meaningless grids or outliners.

Fig. 5.10 and Fig. 5.11 is the quality result that if a different setting of grid size is assume on the start of process, and Fig. 5.12 and Fig. 5.13 is the execution time with the same condition as above.

Fig. 5.10 and Fig. 5.11 show that when we use a larger ratio factor for the grid

0

46

size, the execution time is reduced as there need fewer grids to summarize the data points, hence the time cost becomes lower.

On the other hand, as revealed in Fig. 5.12 and Fig. 5.13, larger grid size means a larger boundary distance to the grid center, more data will be collected, and the grids may be merged up together more easier on the merge stage, the purity of clustering result decrease. From all of these experiments, a choice of Ra = 5 resulted better, which has a balance between the execution time and the clustering quality. Therefore, the value of the factor R is set at 5 for all experiments in this thesis.

Figure 5.10: Cluster Quality with different initialization (Network Intrusion dataset)

0.8

47

Figure 5.11: Cluster Quality with different initialization (Charitable Donation dataset)

Figure 5.12: Execution time with different initialization (Network Intrusion data set)

0.7

48

Figure 5.13: Execution time with different initialization (Charitable Donation dataset)

0 100 200 300 400 500 600 700 800

10 20 30 40 50 60 70 80 90 100

Time (sec)

Time unit

Ra2 Ra3 Ra5 Ra10 Ra50 Ra100

49

Chapter.6 Conclusion

We proposed a streaming clustering algorithm based on dynamic grid with indexing. The algorithm maps the data into dynamic grids and generates clusters with data summary structure. The grid structure can update its size to proper boundaries and deal with the changing of an evolving data stream. We tried two kinds of indexing to reduce the execution time, R+ tree-based and dimensional interval-based. It can produce effective result, handle noise and find clusters with arbitrary shape. Also it is less sensitive to initialization parameters because it can automatically adapt the property itself. The experimental results show that our work is more effective and has better quality in clustering result than others not only in a unstable data stream, but also in a stable dataset.

Future work will focus on applying more indexing and adjusting methods and providing more evolution analysis functionalities based on DAG-Stream. For some very tight local clusters, the adjusting method sometimes fails because they may be merged together. Another problem is it is hard to get the ground truth. So far there are no convincing theories about how to measure the quality of streaming clustering, as it is affected by the threshold setting, boundary setting, and the number of clusters. In real world, how to correctly determine and present the needed information in a data stream is still a challenging problem.

50

Bibliography

[1] Amineh Amini, Teh Ying Wah, Mahmoud Reza Saybani,and Saeed Reza Aghabozorgi Sahaf Yazdi ,"A study of density-grid based clustering algorithms on data streams", Fuzzy Systems and Knowledge Discovery (FSKD), 2011 Eighth International Conference on Date of Conference, vol.3, pp. 1652-1656.

[2] Amineh Amini, Teh Ying Wah, “Density Micro-Clustering Algorithms on Data Streams: A Review", Proceedings of the International MultiConference of Engineers and Computer Scientists 2011, vol.1, pp. 1652–1656.

[3] Charu C. Aggarwal, Jiawei Han, Jianyong Wang, and Philip S. Yu, “A framework for clustering evolving data streams”, Proceedings of the 29th international conference on Very large data bases, Very Large Data Base Endowment, 2003, pp. 81-92.

[4] Charu C. Aggarwal, Jiawei Han, Jianyong Wang, and Philip S. Yu, "A framework for projected clustering of high dimensional data streams", Very Large Data Base Endowment '04, pp. 852-863.

[5] P. S. Bradley, O. L. Mangasarian, and W. N. Street, "Clustering via Concave Minimization," in Advances in Neural Information Processing Systems, vol. 9, 1997, pp. 368–374.

[6] Feng Cao, Martin Ester, Weining Qian, and Aoying Zhou, “Density-based clustering over an evolving data stream with noise,” in SIAM Conference on Data Mining, 2006, pp. 328-339.

[7] Yixin Chen and Li Tu, “Density-based clustering for real-time stream data”, Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, 2007, pp. 133-142.

51

[8] Wenxin Zhu, Jianpei Zhang, and Yue Yang, "Data Stream Clustering Algorithm Based On Active Grid Density", Internet Computing for Science and Engineering (ICICSE), 2010 Fifth International Conference on Date of Conference, 2010, pp.97-101

[9] Kyungmin Cho, SungJae Jo, Hyukjae Jang, Su Myeon Kim, and Junehwa Song,

"DCF: An Efficient Data Stream Clustering Framework for Streaming Applications". DEXA, 2006, pp. 114-122

[10] Jing Gao, Jianzhong Li, Zhaogong Zhang and Pang-Ning Tan, “An incremental data stream clustering algorithm based on dense units detection”, Lecture Notes in Computer Science, vol. 3518, 2005.

[11] Antonin Guttman,"R-trees: a dynamic index structure for spatial searching", Proceedings of the 1984 ACM SIGMOD international conference on Management of data, 1984, pp.47-57

[12] GuiBin Hou, RuiXia Yao, JiaDong Ren, and ChangZhen Hu , “Irregular Grid-Based Clustering over High-Dimensional Data Streams”, 2010 First International Conference on Pervasive Computing, Signal Processing and Applications, 2010, pp.783-786.

[13] Chen Jia, ChengYu Tan, and Ai Yong,"A Grid and Density-Based Clustering Algorithm for Processing Data Stream”, Proceeding WGEC '08 Proceedings of the 2008 Second International Conference on Genetic and Evolutionary Computing, 2008, pp. 517-521

[14] Philipp Kranen, Ira Assent, Corinna Baldauf and Thomas Seidl, "The ClusTree indexing micro-clusters for anytime stream mining", Knowledge and Information Systems , 2011, vol.29, pp. 249-272

52

[15] Yinzhao Li, and Jiadong Ren, "Clustering algorithm based on optimal intervals division for high-dimension data streams", Computer Science & Education, 2009. ICCSE '09, pp. 783-787

[16] Lloyd., S. P. "Least squares quantization in PCM". IEEE Transactions on Information Theory 28 (2), 1982, pp. 129–137.

[17] Liadan O'Callaghan , Nina Mishra , Adam Meyerson , Sudipto Guha ,and Rajeev Motwani , "Streaming-Data Algorithms For High-Quality Clustering", ICDE Conference, 2002.

[18] Jhansi Rani Vennam and Soujanya Vadapalli. “Syndeca: A tool to generate synthetic datasets for evaluation of clustering algorithms”. In 11th International Conference on Management of Data (COMAD 2005), Goa, India, January 2005.

[19] Jiadong Ren, and Ruiqing Ma, “Density-based data streams clustering over sliding windows,” in Proceedings of the 6th International Conference on Fuzzy systems and Knowledge Discovery (FSKD). Piscataway, NJ, USA: IEEE Press, 2009, pp. 248-252.

[20] Jiadong Ren, Shiyuan Cao, and Changzhen Hu, "Density-based Data Streams Subspace Clustering over Weighted Sliding Windows", 2010 First ACIS International Symposium on Cryptography, and Network Security, Data Mining and Knowledge Discovery, E-Commerce and Its Applications, and Embedded Systems, 2010, pp. 49-54

[21] Jiadong Ren, Binlei Cai, and Changzhen Hu, “Clustering over data streams based on grid density and index tree”, Journal of Convergence Information Technology, vol. 6, 2011, pp. 83 -93.

53

[22] Carlos Ruiz, Ernestina Menasalvas and Myra Spiliopoulou, "C-DenStream:

Using Domain Knowledge on a Data Stream", Proceeding DS '09 Proceedings of the 12th International Conference on Discovery Science, 2009, pp. 287-301 [23] Timos K. Sellis, Nick Roussopoulos, and Christos Faloutsos "The R+-Tree: A

Dynamic Index for Multi-Dimensional Objects", VLDB '87 Proceedings of the 13th International Conference on Very Large Data Bases, 1987, pp. 507- 518 [24] Li Tu and Yixin Chen, “Stream data clustering based on grid density and

attraction,” ACM Transactions on Knowledge Discovery Data, vol. 3, no. 3, 2009, pp.1-27

[25] Tian Zhang , Raghu Ramakrishnan ,and Miron Livny , “BIRCH: an efficient data clustering method for very large databases”, in Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, J. Widom, Ed. ACM Press, 1996, pp. 103-114.

[26] Aoying Zhou, Feng Cao, Weining Qian and Cheqing Jin, “Tracking clusters in evolving data streams over sliding windows”, Knowledge and Information Systems, vol. 15, 2008 , pp. 181-214.

[27] Irvine, KDD Cup 1999 Data, Information and Computer Science University of California, http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html

[28] Irvine, KDD Cup 1998 Data, Information and Computer Science University of California, http://kdd.ics.uci.edu/databases/kddcup98/kddcup98.html

相關文件