# Clustering with Local Density Peaks-Based Minimum Spanning Tree

(1)

(2)

(3)

## Why clustering?

(4)

### Why clustering?

https://github.com/bheemnitd/2D-KMeans-Clustering

(5)

### Why clustering?

1. Unsupervised learning: Label is not needed

a. Especially useful when the label is expensive 2. Discover how many categories are in your sample 3. Anomaly detection

4. Find useful intuitions to help supervise learning

(6)

### Clustering algorithm matters

https://scikit-learn.org/stable/modules/clustering.html

(7)

(8)

Label helps

(9)

### How to tell is it good or bad?

Metrics

1. Clustering Accuracy (ACC)

2. Normalized Mutual Information (NMI)

# in category i

# in cluster j

# in category i & cluster j

(10)

### How to tell is it good or bad?

Running Time / Time Complexity

(11)

## How to generate clusters?

(12)

### a. LOF-MST b. LDP-MST

(示意圖)這篇論文探討的六種演算法和六個 datasets，後續會詳細 說明

(13)

### Center-Based: Kmeans

1. Specify number of clusters K

2. Initialize centroids by first shuffling the dataset and then randomly selecting K data points for the centroids without replacement.

3. Keep iterating until there is no change to the centroids. i.e assignment of data points to clusters isn’t changing.

a. Recalculate means (centroids) for observations assigned to each cluster.

https://towardsdatascience.com/k-means-clustering-algorithm-applications-evaluation-methods-and-drawbacks-aa03e644b48a

https://en.wikipedia.org/wiki/K-means_clustering

Kmeans 是最常見的 clustering 方法。一開始隨機選定中心點 (centroids)，對所屬中心點的樣本取平均迭代得到新 的中心點直到收斂

(14)

### Density-Based: DBSCAN

Parameters: ε, minPts Definitions

A point p is a core point if at least minPts points are within distance ε of it

A point q is directly reachable from p if point q is within distance ε from core point p. Points are only said to be directly reachable from core points.

A point q is reachable from p if there is a path p1, ..., pn with p1 = p and pn = q, where each pi+1 is directly

reachable from pi

All points not reachable from any other point are outliers or noise points

Density-based spatial clustering of applications with noise

DBSCAN 是一種常用的 clustering 方法。右示意圖紅色核心點 (core points) 至少有 minPts=4 個鄰居在eps半徑之 內，黃點不是核心但屬於能被核心點直達 (directly reachable)。藍點則視為雜訊點

https://en.wikipedia.org/wiki/DBSCAN

(15)

### Density-Based: DBSCAN

Density-based spatial clustering of applications with noise

https://en.wikipedia.org/wiki/DBSCAN

DBSCAN 算法很直覺地先找出所有核心點，互相可達的自成一個 cluster (也就是說 clusters 數目是演算法自行決 定的)。其餘能被直達的點就歸屬該 cluster，否則視為雜訊

The DBSCAN algorithm can be abstracted into the following steps:

1. Find the points in the ε (eps)

neighborhood of every point, and identify the core points with more than minPts neighbors

2. Find the connected components of core points on the neighbor graph, ignoring all non-core points

3. Assign each non-core point to a nearby cluster if the cluster is an ε (eps) neighbor, otherwise assign it to noise

(16)

### 3. Do step 2 until the terminal condition is satisfied

Clustering with Local Density Peaks-Based Minimum Spanning Tree. IEEE Trans. Knowl. Data Eng. 33(2): 374-387 (2021) https://chih-ling-hsu.github.io/2017/09/01/Divisive-Clustering

(17)

## Local Density Peaks-Based Minimum Spanning Tree

(18)

### Nature Neighbor

，NN(C)={C,D,E}，RNN(C)={A,B,C,D,E}。在計算點p的kNN時點p自己需算在內，因此點p也本應在RNN(p)中。但注 意，後續Algorithm 1 NaN-Searching的NN(p)與RNN(p)將不再考慮p點自身。

(19)

r為neighbor searching range，自r=1開始進行迭代。詳見後續例子幫助理解。

(20)

r=1 A B C D E

nb 0 0 0 0 0

nb(q)指當searching range為r時，點q被別的點計入 r nearest neighbor的次數，初始均為 0。

(21)

r=2 A B C D E

nb 1 3 1 0 0

(22)

r=3 A B C D E

nb 3 4 3 0 0

(23)

r=3 A B C D E

nb 3 4 3 0 0

(24)

(25)

Local density

Where k=max(nb).

r=3 A B C D E

nb 3 4 3 0 0

(26)

Acyclic

(27)

(28)

(29)

### Construct MST

Euclidean distance?

Geodesic distance?

Shared neighbors-based distance!

(30)

(31)

(32)

### Construct MST

SD根據兩個LDP間有無shared neighbors對照不同的計算方式。

(33)

(34)

(35)

### Construct MST

Euclidean distance. Shared neighbors-based distance (SD).

(36)

### Clustering Based on LDP and MST

Repeatedly cut the longest edge (SD).

Make sure the sizes of clusters are large than a loosely estimated minimum number of points (MinSize).

Minsize主要用於避免將距離相對較近的 outliers當做一個單獨的 cluster。

(37)

### MinSize

Dataset 3. LDP-MST without setting MinSize.

(38)

(39)

(40)

### Time Complexity O(nlogn)

Search local density peaks.

NaN-searching O(n^2) -> O(nlogn) by KD-tree Calculating densities and searching LDPs O(n)

Compute the shared neighbors-based distance between LDPs.

O(n+n_{ldp}^2) (n_{ldp} << n)

Employ the MST-based clustering algorithm to cluster the LDPs.

O(n_{ldp}^2)

n_{ldp}為local density peaks的數量。

(41)

(42)

(43)

(44)

(45)

(46)

(47)

(48)

(49)

### Results

In terms of accuracy, the LDP-MST algorithm is very competitive

(50)

### Results

Under the premise of good classification performance, the actual execution speed of LDP-MST is quite fast

(51)

(52)

(53)

(54)

(55)

(56)

### The Impact of the Number of Instances

Two-dimensional data sets.

Instances are randomly generated by two different Gaussian distribution.

LDP-MST耗時雖長於Kmeans和DBSCAN，但遠優於LOF-MST。

(57)

### The Impact of the Number of Dimensions

5000 instances.

Instances are randomly generated by two different Gaussian distribution.

(58)

(59)

Updating...

## References

Related subjects :