• 沒有找到結果。

此篇論文中,我們提出一種以區域性為基礎的相似度轉換方法。藉由觀察鏈 結兩端點的區域最近鄰分布,重新調整鏈結權重,使得資料彼此的相似程度能依 照提出的假設做修改,相當於對資料進行前處理。我們分別提出兩種尋找區域性 鄰集的方法:K 最近鄰(K-nn)和可互相包含最近鄰(MI-nn)。實驗結果顯示可互相 包含最近鄰(MI-nn)為基礎之相似度轉換方法能夠尋找任意形狀的群集,且對於 參數的依賴度遠低於 K 最近鄰。我們認為以區域性鄰集為基礎的相似度轉換方 法能夠凸顯資料之間的邊界,進而提升整體的準確率。

以區域性鄰集為基礎的相似度轉換方法有別於其他相似度轉換方法,不需任 何配對限制的幫助即可達成預期目標。以非監督式分群的角度觀察,與其他非監 督式分群演算法進行比較,對於分布較為特殊的資料集皆可獲得接近完全正確的 結果,顯著優於 K-means、cluto[5]和 DBSCAN[27];對於蒐集自真實世界的 UCI 資料集,多數的情況亦能獲得相對高的準確率。改以半監督式分群的角度觀察,

由於提出的方法涉及相似度轉換,與 Xing[11]、RCA[12]、 LMNN[16]三種屬於 similarity-based 的半監督式分群演算法更為接近,比較之下準確性都可以獲得相 近的結果,甚至在部分資料集有更高的準確率。

相似度轉換方法未必能完整描述資料的相似關係,因此我們認為將相似度轉 換方法與半監督式分群法結合,可以透過專家的意見彌補相似度轉換所無法涵蓋 的範圍,以提升分群的效能。

除了改善分群準確率外,我們亦能在相似度轉換的過程中縮減資料維度。縮 減維度不僅能加快相似度轉換,同時能在不影響分群結果的前提下,提升分群演 算法的速度。

總結以上實驗結果,此篇論文提出的方法有下列優點:

1. 透過相似度轉換改善分群演算法中所使用的相似度函式。

2. 不需過多的外部參數引導分群。

65

3. 在缺少配對限制的情況下,能夠有效的處理任意形狀分布的群集圖形。

4. 將相似度轉換方法與半監督式分群演算法結合,專家的意見能夠彌補相似度 轉換所無法涵蓋的範圍,進一步提升整體的準確率。

5. 我們能透過相似度轉換方法適度的縮減資料維度。維度縮減並不顯著影響分 群結果,亦能減少計算量以提升分群演算法的速度。

66

參考文獻

[1] Han, L., Kamber, M., Pei, J., “Data Mining: Concepts and Techniques”, Morgan Kaufmann, 2011

[2] Grira, N., Crucianu, M., Boujemaa, N., “Unsupervised and Semi-supervised Clustering: a Brief Survey”, A Review of Machine Learning Techniques for Processing Multimedia Content, Report of the MUSCLE European Network of Excellence, 2004

[3] MacQueen, J. B., “Some Methods for Classification and Analysis of Multivariate Observations”, Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, pp. 281-297, 1967

[4] Kaufman, L., Rousseeuw, P.J., “Finding Groups in Data: an Introduction to Cluster Analysis”, John Wiley & Sons, 2005

[5] Karypis, G., Han, E., Kumar, V., “CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling”, IEEE Computer: Special Issue on Data Analysis and Mining, pp. 68-75, 1999

[6] Ester, M., Kriegel, H., Sander, J., Xu ,X., “A Density-based Algorithm for Discovering Clusters in Large Spatial Databases with Noise”, Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), pp. 226–231, 1996

[7] Basu, S., Banerjee, A., Mooney, R., “Semi-supervised Clustering by Seeding”, Proceedings of the 19th International Conference on Machine Learning, pp. 19-26, 2002

[8] Wagstaff, K., Cardie, C., Rogers, S., Schroedl ,S., “Constrained K-means

Clustering with Background Knowledge”, 18th International Conference on Machine Learning, pp. 577-584, 2001

[9] Demiriz, A., Bennett, K., P., Embrechts, M., J., “Semi-supervised Clustering Using Genetic Algorithms”, Artificial Neural Networks in Engineering, pp. 809-814, 1999

67

[10] Basu, S., Banerjee, A., Mooney, R., “Active Semi-Supervision for Pairwise Constrained Clustering”, Proc. 4th SIAM Intl. Conf. on Data Mining (SDM-2004) [11] Xing, P., Ng, Y., Jordan, M., Russell, S., “Distance Metric Learning, with Application to Clustering with Side-information”, Neural Information Processing Systems, pp. 521-528, 2002

[12] Bar-Hillel, A., Hertz, T., Shental, N., Weinshall, D., “Learning Distance Functions using Equivalence Relations”, Proceedings of the 20th International Conference on Machine Learning”, pp. 11-18, 2003

[13] Basu, S., Bilenko, M., Mooney, R., “Comparing and Unifying Search-Based and Similarity-Based Approaches to Semi-Supervised Clustering”, Proceedings of the ICML-2003 Workshop on the Continuum from Labeled to Unlabeled Data in Machine Learning and Data Mining Systems, pp.42-49, 2003

[14 ] Basu, S., Bilenko, M., Mooney, R., “A Probabilistic Framework for Semi- supervised Clustering”, Proceedings of Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 59-68, 2004 [15] Klein, D., Kamvar, S., Manning, C., “From Instance-level Constraints to Space-level Constraints: Making the Most of Prior Knowledge in Data Clustering”, Proceeding ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning, pp. 307-314, 2002

[16] Weinberger, K., Blitzer, J., Saul, L., ”Distance Metric Learning for Large Margin Nearest Neighbor Classification”, Neural Information Processing Systems, pp.

1473-1480, 2006

[17] Cox, T., Cox, M., “Multidimensional Scaling, 2nd Edition”, Chapman & Hall, 2001

[18] Kruskal, J.B., “Nonmetric Multidimensional Scaling: a Numerical Method”, Psychometrika, pp. 115-129, 1964

[19] Rand, W.M., “Objective Criteria for the Evaluation of Clustering Methods”, Journal of the American Statistical Association, pp. 846-850, 1971

68

[20] Hubert, L., Arabie, P., “Comparing Partitions”, Journal of Classification, pp.

193-218, 1985

[21] Cohen, J., “A Coefficient of Agreement for Nominal Scales”, Educational and Psychological Measurement 196037-196046, 1960

[22] Landis, J.R., Koch, G.G., “The Measurement of Observer Agreement for Categorical Data”, Biometrics, pp. 159-174, 1977

[23] Reilly, C., Wang, C., Rutherford, M., “A Rapid Method for The Comparison of Cluster Analysis”, Statistica Sinica, pp. 19-33, 2005

[24] Ihaka, R., Gentleman, R., “R: A Language for Data Analysis and Graphics”, Journal of Computational and Graphical Statistics, pp. 299-314, 1996

[25] Kleinberg, J., Tardos, E., “Approximation algorithms for classification problems with pairwise relationships: Metric Labeling And Markov Random Field”, Journal of the ACM, pp. 616-639, 2002

[26] Blake, C., Merz, C., “UCI repository of machine learning databases”, 1998 [27] Daszykowski, M., Walczak, B., Massart, D., “Looking for Natural Patterns in Data. Part 1: Density Based Approach”, Chemometrics and Intelligent Laboratory Systems, pp. 83-92, 2001