Clustering with Local Density Peaks-Based Minimum Spanning Tree

(1)

Clustering with Local Density Peaks-Based Minimum Spanning Tree

報告者：闕中一、吳海韜

這是一篇將 MST 應用在 clustering 問題的論文。防疫期間我們會以文字代替口頭報告的方式向大家介紹 Clustering with Local Density Peaks-Based Minimum Spanning Tree. IEEE Trans. Knowl. Data Eng. 33(2): 374-387 (2021)

(2)

Outline

1. Why clustering?

2. How to generate clusters?

a. Center-Based b. Density-Based

c. MST-Based

3. Local Density Peaks-Based Minimum Spanning Tree (LDP-MST) 4. Experiments

a. Synthetic Datasets b. Real Datasets

c. Evaluation on Running Time

第一和第二部分介紹 clustering 這個應用問題的背景。第三部分介紹論文提出的方法、最後呈現實驗結果

(3)

Why clustering?

(4)

Why clustering?

https://github.com/bheemnitd/2D-KMeans-Clustering

所謂的 clustering 就是輸入 n 個樣本點，將它們以某種規則 (演算法、參數)劃分為 k 群。其中分群的數目 k 值可以是事先給定，或是由演算法本身自己去發現。本文的討論大部分基於 k 給定的情形

這個例子中 k=3

(5)

Why clustering?

1. Unsupervised learning: Label is not needed

a. Especially useful when the label is expensive 2. Discover how many categories are in your sample 3. Anomaly detection

4. Find useful intuitions to help supervise learning

善用 clustering 可以幫助我們瞭解資料中隱藏的規律，這裡列舉兩種常見的方法 Kmeans 和 DBSCAN，相同顏色的樣本點稱為一個 cluster。上圖我們也注意到不同演算法得到的結果可能大相逕庭，在這個例子中 DBSCAN 的結果可能更接近我們人類的「直覺」

(6)

Clustering algorithm matters

https://scikit-learn.org/stable/modules/clustering.html

這是 scikit-learn 套件說明文檔提供的例子。比較紅框 KMeans 和 DBSCAN 兩者的結果，可以看出不同演算法在不同 datasets 上會有很不一樣的行為

(7)

How to tell is it good or bad?

這是作者在實驗中用到人工合成的六組 datasets。直覺上它們都有一些明顯的結構，那麼接下來關鍵就是如何評價一個 clustering 方法對我們有沒有幫助，量化比較這些演算法的好壞

(8)

How to tell is it good or bad?

Label helps

一個直接的想法就是根據當初人工生成這些 cluster 的規則當作 ground truth，用一般機器學習作分類問題的角度去評價 clustering 的結果

(9)

How to tell is it good or bad?

Metrics

1. Clustering Accuracy (ACC)

2. Normalized Mutual Information (NMI)

# in category i

# in cluster j

# in category i & cluster j

在有 ground truth 的前提下，這是作者所採用的兩種量化指標，其中 ACC 便可以想成是分類問題的準確率。這兩個指標都是越大越好。

(10)

How to tell is it good or bad?

Running Time / Time Complexity

演算法的複雜度以及實際執行時間也是我們關心的。這兩張表格為示意圖，後續會有深入的說明

(11)

How to generate clusters?

接下來簡單介紹現有的 clustering 方法

(12)

How to generate clusters?

1. Center-Based a. Kmeans b. DP

c. DCore

2. Density-Based a. DBSCAN 3. MST-Based

a. LOF-MST b. LDP-MST

這是作者提出三大分類中有被實驗囊括探討的六個演算法。其中我們會介紹 Kmeans, DBSCAN 以及簡化版的 MST

(示意圖)這篇論文探討的六種演算法和六個 datasets，後續會詳細說明

(13)

Center-Based: Kmeans

1. Specify number of clusters K

2. Initialize centroids by first shuffling the dataset and then randomly selecting K data points for the centroids without replacement.

3. Keep iterating until there is no change to the centroids. i.e assignment of data points to clusters isn’t changing.

a. Recalculate means (centroids) for observations assigned to each cluster.

https://towardsdatascience.com/k-means-clustering-algorithm-applications-evaluation-methods-and-drawbacks-aa03e644b48a

https://en.wikipedia.org/wiki/K-means_clustering

Kmeans 是最常見的 clustering 方法。一開始隨機選定中心點 (centroids)，對所屬中心點的樣本取平均迭代得到新的中心點直到收斂

(14)

Density-Based: DBSCAN

Parameters: ε, minPts Definitions

● A point p is a core point if at least minPts points are within distance ε of it

● A point q is directly reachable from p if point q is within distance ε from core point p. Points are only said to be directly reachable from core points.

● A point q is reachable from p if there is a path p1, ..., pn with p1 = p and pn = q, where each pi+1 is directly

reachable from pi

● All points not reachable from any other point are outliers or noise points

Density-based spatial clustering of applications with noise

DBSCAN 是一種常用的 clustering 方法。右示意圖紅色核心點 (core points) 至少有 minPts=4 個鄰居在eps半徑之內，黃點不是核心但屬於能被核心點直達 (directly reachable)。藍點則視為雜訊點

https://en.wikipedia.org/wiki/DBSCAN

(15)

Density-Based: DBSCAN

Density-based spatial clustering of applications with noise

https://en.wikipedia.org/wiki/DBSCAN

DBSCAN 算法很直覺地先找出所有核心點，互相可達的自成一個 cluster (也就是說 clusters 數目是演算法自行決定的)。其餘能被直達的點就歸屬該 cluster，否則視為雜訊

The DBSCAN algorithm can be abstracted into the following steps:

1. Find the points in the ε (eps)

neighborhood of every point, and identify the core points with more than minPts neighbors

2. Find the connected components of core points on the neighbor graph, ignoring all non-core points

3. Assign each non-core point to a nearby cluster if the cluster is an ε (eps) neighbor, otherwise assign it to noise

(16)

MST-Based

1. Traditional MST-based clustering algorithms first construct an MST according to a distance measure

2. Continually remove

inconsistent edges to get a set of connected

components (clusters)

3. Do step 2 until the terminal condition is satisfied

Clustering with Local Density Peaks-Based Minimum Spanning Tree. IEEE Trans. Knowl. Data Eng. 33(2): 374-387 (2021) https://chih-ling-hsu.github.io/2017/09/01/Divisive-Clustering

一般 MST-Based 方法可以歸納成三步：對資料點建 MST (一般來說是歐氏距離)、移除 inconsistent edge 來形成 cluster (例如移除最長的邊 )、檢查終止條件 (例如分到有 k 群為止)。今天報告的論文 LDP-MST 也適用這個框架。

然而在有雜訊點的情況下，歐氏距離不一定是最理想的設定。想像 k=3 卻剛好有兩個雜訊點 p1, p2 離我們主要關心的群特別遠，得到 p1, p2 各一群、正常點全部自成一群，而這不是我們想要的結果。所以定義 inconsistent edge 的方法相當重要

(17)

Local Density Peaks-Based Minimum Spanning Tree

(LDP-MST)

(18)

Nature Neighbor

在開始介紹具體方法之前，先帶大家了解兩個關於 nature neighbor的基本概念。右圖為一 k=3的例子

，NN(C)={C,D,E}，RNN(C)={A,B,C,D,E}。在計算點p的kNN時點p自己需算在內，因此點p也本應在RNN(p)中。但注意，後續Algorithm 1 NaN-Searching的NN(p)與RNN(p)將不再考慮p點自身。

(19)

r為neighbor searching range，自r=1開始進行迭代。詳見後續例子幫助理解。

(20)

r=1 A B C D E

nb 0 0 0 0 0

nb(q)指當searching range為r時，點q被別的點計入 r nearest neighbor的次數，初始均為 0。

(21)

r=2 A B C D E

nb 1 3 1 0 0

(22)

r=3 A B C D E

nb 3 4 3 0 0

這一演算法的停止條件為，新一輪迭代得到的 nb中為0的個數與上一輪相等。

(23)

r=3 A B C D E

nb 3 4 3 0 0

迭代停止時的 r也被稱作nature characteristic value。

(24)

Main Idea of LDP-MST

此為本文提出方法的概要，大致分為三步： (a)使用local density peaks作為代表，見圖中紅點。 (b)依照一種新定義的距離為代表們構造 MST，見圖中綠線。 (c)從最長的edge開始切割，直至得到目標數量的 clusters，見圖中黃線。

(25)

Local Density Peaks

Local density

Where k=max(nb).

r=3 A B C D E

nb 3 4 3 0 0

在選代表時主要參考每個點的 local density。其中k為由Algorithm 1獲得的nb決定。以之前的例子為例， k應為4。

(26)

Local Density Peaks

Acyclic

融合local density和kNN的概念，使每個點都找到自己的代表。能代表自己的點即為 LDP。

(27)

Local Density Peaks

(28)

Local Density Peaks

尋找LDP和MLDP的具體演算法。第一步完成。

(29)

Construct MST

Euclidean distance?

Geodesic distance?

Shared neighbors-based distance!

依據過往經驗，考慮到諸如 curse of dimensionality之類的問題，Euclidean distance並不是構造MST的好選擇。而理論上比較合適的 Geodesic distance計算起來時間複雜度又過高。因此本文提出了一種新的定義距離的方法。

(30)

Construct MST

(31)

Construct MST

(32)

Construct MST

SD根據兩個LDP間有無shared neighbors對照不同的計算方式。

(33)

Construct MST

照此方法，若兩個 LDP間的關係緊密，則距離會被相對縮小，若關係疏遠，則距離會被拉大。

(34)

Construct MST

以這個data set為例，主要關注 p，q和o這三個LDP根據不同的distance構造出MST的結果。

(35)

Construct MST

Euclidean distance. Shared neighbors-based distance (SD).

可以看到用Euclidean distance構造的MST不符合我們對這一 data set的預期。而SD表現正常。第二步完成。

(36)

Clustering Based on LDP and MST

Repeatedly cut the longest edge (SD).

Make sure the sizes of clusters are large than a loosely estimated minimum number of points (MinSize).

Minsize主要用於避免將距離相對較近的 outliers當做一個單獨的 cluster。

(37)

MinSize

若data set當中沒有outlier或只有很少的 outlier，則是否設定MinSize都不會影響結果。不然，以 Dataset 3為例，不設定MinSize導致結果中有一個純 outlier組成的cluster（右圖紅框），而右圖黑框的兩個 cluster沒有區分開。

Dataset 3. LDP-MST without setting MinSize.

(38)

MinSize

若能使MinSize大於outlier的比例並且小於最小 cluster的比例（理想狀態）， MinSize的值在小範圍內改動並不會改變 clustering的結果。

(39)

獲取最終cluster label的具體方法。至此 LDP-MST完成。

(40)

Time Complexity O(nlogn)

Search local density peaks.

NaN-searching O(n^2) -> O(nlogn) by KD-tree Calculating densities and searching LDPs O(n)

Compute the shared neighbors-based distance between LDPs.

O(n+n_{ldp}^2) (n_{ldp} << n)

Employ the MST-based clustering algorithm to cluster the LDPs.

O(n_{ldp}^2)

n_{ldp}為local density peaks的數量。

(41)

Experiments

(42)

Clustering on Synthetic Datasets

(43)

這是第一組六個演算法的分類結果。這個例子很符合我們對 clustering 問題的想像，其中 center-based 方法對這類情形相對容易解決。可以注意到 Kmeans 雖然快速容易實作，從結果來看卻有些不盡如人意 (紅圈)

(44)

第二組資料和第一組呈現明顯的對比，可以看出 center-based 方法並不能識別出拋物線是一個 cluster。DBSCAN, LOF-MST 和這篇論文的方法更符合直覺

(45)

對大部分方法而言第三組的執行時間較長。這個例子中 DBSCAN 和 LDP-MST 表現都不錯。值得注意的是 LOF-MST 標示黑點代表雜訊，從紅圈可以看出 LOF-MST 傾向把正常的點認為是 outliers

(46)

第四組資料中 DBSCAN, LOF-MST 分群結果都和 ground truth 有一些出入(紅圈)。LDP-MST, Dcore 在這組上表現不錯。同時可以看出 LDP-MST 在這組資料上相對優於同為 MST-Based 方法的 LOF-MST

(47)

第五組資料凸顯 MST-Based 對於 clusters 多種不同形狀也能應對的優勢，其他方法都各自有一些奇怪的地方

(48)

第六組資料不但群的形狀各異、各群的樣本數也相差較大，對一般分群方法來說這種資料相對棘手。這組資料兩個 MST-Based 方法表現都不錯

(49)

Results

In terms of accuracy, the LDP-MST algorithm is very competitive

這是分群實驗的量化結果， ACC 和 NMI 指標都是越高越好。螢光標註的部分是 ACC 中該列第二高的方法，可以注意到其中 LDP-MST 在第三組和第六組的表現明顯贏過其他方法

(50)

Results

Under the premise of good classification performance, the actual execution speed of LDP-MST is quite fast

螢光標註的是幾個我認為跑特別久的情形，整題而言第三組資料跑最久。針對 LDP-MST 第五組的群數比較多 (9 群) 也會稍微增加需要的時間

(51)

Clustering on Real Data Sets

(52)

Real Data Sets

由於KD-tree的限制，維度大於 10的data sets都使用PCA做了降維。PCA作為前序準備工作故耗費時間多少不考慮在內。此外還選取 Olivetti Face Database中的部分進行了試驗，但由於計算兩張臉的 distance的方法直接引述於其他文章且較為複雜，在此不進行詳細解釋。

(53)

Results

相比其他幾種方法， LDP-MST的accuracy基本處於領先狀態。

(54)

Results

(55)

Evaluation on Running Time

(56)

The Impact of the Number of Instances

Two-dimensional data sets.

Instances are randomly generated by two different Gaussian distribution.

LDP-MST耗時雖長於Kmeans和DBSCAN，但遠優於LOF-MST。

(57)

The Impact of the Number of Dimensions

5000 instances.

Instances are randomly generated by two different Gaussian distribution.

當維度大過30時DBSCAN耗時大幅度增加。而 Kmeans，DP，Dcore和LDP-MST的耗時受維度增加的影響較小。

Clustering with Local Density Peaks-Based Minimum Spanning Tree