，結論 - 以區域性鄰集為基礎之相似度轉換方法應用於分群演算法

此篇論文中，我們提出一種以區域性為基礎的相似度轉換方法。藉由觀察鏈結兩端點的區域最近鄰分布，重新調整鏈結權重，使得資料彼此的相似程度能依照提出的假設做修改，相當於對資料進行前處理。我們分別提出兩種尋找區域性鄰集的方法：K 最近鄰(K-nn)和可互相包含最近鄰(MI-nn)。實驗結果顯示可互相包含最近鄰(MI-nn)為基礎之相似度轉換方法能夠尋找任意形狀的群集，且對於參數的依賴度遠低於 K 最近鄰。我們認為以區域性鄰集為基礎的相似度轉換方法能夠凸顯資料之間的邊界，進而提升整體的準確率。

以區域性鄰集為基礎的相似度轉換方法有別於其他相似度轉換方法，不需任何配對限制的幫助即可達成預期目標。以非監督式分群的角度觀察，與其他非監督式分群演算法進行比較，對於分布較為特殊的資料集皆可獲得接近完全正確的結果，顯著優於 K-means、cluto[5]和 DBSCAN[27]；對於蒐集自真實世界的 UCI 資料集，多數的情況亦能獲得相對高的準確率。改以半監督式分群的角度觀察，

由於提出的方法涉及相似度轉換，與 Xing[11]、RCA[12]、 LMNN[16]三種屬於 similarity-based 的半監督式分群演算法更為接近，比較之下準確性都可以獲得相近的結果，甚至在部分資料集有更高的準確率。

相似度轉換方法未必能完整描述資料的相似關係，因此我們認為將相似度轉換方法與半監督式分群法結合，可以透過專家的意見彌補相似度轉換所無法涵蓋的範圍，以提升分群的效能。

除了改善分群準確率外，我們亦能在相似度轉換的過程中縮減資料維度。縮減維度不僅能加快相似度轉換，同時能在不影響分群結果的前提下，提升分群演算法的速度。

總結以上實驗結果，此篇論文提出的方法有下列優點：

1. 透過相似度轉換改善分群演算法中所使用的相似度函式。

2. 不需過多的外部參數引導分群。

3. 在缺少配對限制的情況下，能夠有效的處理任意形狀分布的群集圖形。

4. 將相似度轉換方法與半監督式分群演算法結合，專家的意見能夠彌補相似度轉換所無法涵蓋的範圍，進一步提升整體的準確率。

5. 我們能透過相似度轉換方法適度的縮減資料維度。維度縮減並不顯著影響分群結果，亦能減少計算量以提升分群演算法的速度。

參考文獻

[1] Han, L., Kamber, M., Pei, J., “Data Mining: Concepts and Techniques”, Morgan Kaufmann, 2011

[2] Grira, N., Crucianu, M., Boujemaa, N., “Unsupervised and Semi-supervised Clustering: a Brief Survey”, A Review of Machine Learning Techniques for Processing Multimedia Content, Report of the MUSCLE European Network of Excellence, 2004

[3] MacQueen, J. B., “Some Methods for Classification and Analysis of Multivariate Observations”, Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, pp. 281-297, 1967

[4] Kaufman, L., Rousseeuw, P.J., “Finding Groups in Data: an Introduction to Cluster Analysis”, John Wiley & Sons, 2005

[5] Karypis, G., Han, E., Kumar, V., “CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling”, IEEE Computer: Special Issue on Data Analysis and Mining, pp. 68-75, 1999

[6] Ester, M., Kriegel, H., Sander, J., Xu ,X., “A Density-based Algorithm for Discovering Clusters in Large Spatial Databases with Noise”, Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), pp. 226–231, 1996

[7] Basu, S., Banerjee, A., Mooney, R., “Semi-supervised Clustering by Seeding”, Proceedings of the 19th International Conference on Machine Learning, pp. 19-26, 2002

[8] Wagstaff, K., Cardie, C., Rogers, S., Schroedl ,S., “Constrained K-means

Clustering with Background Knowledge”, 18th International Conference on Machine Learning, pp. 577-584, 2001

[9] Demiriz, A., Bennett, K., P., Embrechts, M., J., “Semi-supervised Clustering Using Genetic Algorithms”, Artificial Neural Networks in Engineering, pp. 809-814, 1999

[10] Basu, S., Banerjee, A., Mooney, R., “Active Semi-Supervision for Pairwise Constrained Clustering”, Proc. 4th SIAM Intl. Conf. on Data Mining (SDM-2004) [11] Xing, P., Ng, Y., Jordan, M., Russell, S., “Distance Metric Learning, with Application to Clustering with Side-information”, Neural Information Processing Systems, pp. 521-528, 2002

[12] Bar-Hillel, A., Hertz, T., Shental, N., Weinshall, D., “Learning Distance Functions using Equivalence Relations”, Proceedings of the 20th International Conference on Machine Learning”, pp. 11-18, 2003

[13] Basu, S., Bilenko, M., Mooney, R., “Comparing and Unifying Search-Based and Similarity-Based Approaches to Semi-Supervised Clustering”, Proceedings of the ICML-2003 Workshop on the Continuum from Labeled to Unlabeled Data in Machine Learning and Data Mining Systems, pp.42-49, 2003

[14 ] Basu, S., Bilenko, M., Mooney, R., “A Probabilistic Framework for Semi- supervised Clustering”, Proceedings of Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 59-68, 2004 [15] Klein, D., Kamvar, S., Manning, C., “From Instance-level Constraints to Space-level Constraints: Making the Most of Prior Knowledge in Data Clustering”, Proceeding ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning, pp. 307-314, 2002

[16] Weinberger, K., Blitzer, J., Saul, L., ”Distance Metric Learning for Large Margin Nearest Neighbor Classification”, Neural Information Processing Systems, pp.

1473-1480, 2006

[17] Cox, T., Cox, M., “Multidimensional Scaling, 2^ndEdition”, Chapman & Hall, 2001

[18] Kruskal, J.B., “Nonmetric Multidimensional Scaling: a Numerical Method”, Psychometrika, pp. 115-129, 1964

[19] Rand, W.M., “Objective Criteria for the Evaluation of Clustering Methods”, Journal of the American Statistical Association, pp. 846-850, 1971

[20] Hubert, L., Arabie, P., “Comparing Partitions”, Journal of Classification, pp.

193-218, 1985

[21] Cohen, J., “A Coefficient of Agreement for Nominal Scales”, Educational and Psychological Measurement 196037-196046, 1960

[22] Landis, J.R., Koch, G.G., “The Measurement of Observer Agreement for Categorical Data”, Biometrics, pp. 159-174, 1977

[23] Reilly, C., Wang, C., Rutherford, M., “A Rapid Method for The Comparison of Cluster Analysis”, Statistica Sinica, pp. 19-33, 2005

[24] Ihaka, R., Gentleman, R., “R: A Language for Data Analysis and Graphics”, Journal of Computational and Graphical Statistics, pp. 299-314, 1996

[25] Kleinberg, J., Tardos, E., “Approximation algorithms for classification problems with pairwise relationships: Metric Labeling And Markov Random Field”, Journal of the ACM, pp. 616-639, 2002

[26] Blake, C., Merz, C., “UCI repository of machine learning databases”, 1998 [27] Daszykowski, M., Walczak, B., Massart, D., “Looking for Natural Patterns in Data. Part 1: Density Based Approach”, Chemometrics and Intelligent Laboratory Systems, pp. 83-92, 2001

在文檔中以區域性鄰集為基礎之相似度轉換方法應用於分群演算法 (頁 73-77)