• 沒有找到結果。

Applied Multivariate Quantitative Methods-Cluster Analysis

N/A
N/A
Protected

Academic year: 2021

Share "Applied Multivariate Quantitative Methods-Cluster Analysis"

Copied!
87
0
0

加載中.... (立即查看全文)

全文

(1)

Applied Multivariate

Quantitative Methods

Cluster Analysis

By Jen-pei Liu, PhD

Division of Biometry, Department of Agronomy,

National Taiwan University

and

Wei-Chie Chie, MD, PhD

Department of Public Health

National Taiwan University

(2)

Cluster Analysis

Introduction

Measures of Similarity

Hierarchical Clustering

K-mean Clustering

Summary

(3)

Introduction

A sample of n objects, each with

measurements of p variables

To use the measurements of p variables

to devise a scheme for grouping n

objects into classes

(4)

Introduction

In general, the number of clusters is no

t known in advance – unsupervised anal

ysis

The number of class is pre-specified in t

he discriminant analysis and is based on

a predicted function– supervised analysi

s

(5)

Introduction

Examples

Cluster of depressed patients

Data reduction

Marketing

Test markets: large number of cities

Small number of groups of similar cities

one member from each group selected for testing

Microarray

Clusters of genes

Clusters of subjects

(6)

Introduction

Types of Clustering Methods

Hierarchical Clustering

To find a series of partition

A bottom-to-up clustering

Partitional Method

To produce a single partition of objects

A up-to-bottom clustering

(7)

Introduction

Example

Student Chinese (X1) Math (X2)

1

85

82

2

25

32

3

65

55

4

90

95

5

40

30

6

60

70

(8)

Measures of Similarity

i i1 i2 ip j j1 j2 jp p 2 1/ 2 ij ik jk k=1

Distance (Dissimilarity) Function

Data vectors

(X X X ) '

(X X X ) '

(a) Euclidean Distance

d

[

(X - X ) ]

X

X

(9)

Measures of Similarity

Euclidean Distances Matrix for 6 students

1

2

3

4

5

6

1

0

78.10 33.60 13.93 68.77 27.73

2

0

46.14 90.52 15.13 51.66

3

0

47.17 35.36 15.81

4

0

82.01 39.05

5

0

44.72

6

0

(10)

Measures of Similarity

p

'

ij

ik

jk

k=1

Distance (Dissimilarity) Function

(b) Manhatten (city block) Distance

d

X - X

Manhatten distance is more robust to

extreme values

The Manhatten distance between Student 1 and 2

(11)

Measures of Similarity

(12)

Measures of Similarity

p ik i jk j k 1 p 2 2 ik i jk j k 1 r

Pearson product correlation coefficient

(X -X )(X -X )

r =

(X -X ) (X -X )

Distance measure based on Pearson correlation coefficient

d

1-r

 

Similarity Measures

(13)

Measures of Similarity

p ik jk k 1 p 2 2 ik jk k 1

Angular distance (Uncentered Pearson

correlation coefficient)

X X

r' =

X X

Spearman correlation coefficient

Use the ranks of observations for

Pearson correlation

Other Similarity Measures

(14)

Measures of Similarity

Correlation coefficient

A measure for association

Not a measure for similarity (or

agreement)

Euclidean distance

(15)

Measures of Similarity

Example

Case I

Case II

Case III

X1 X2

X1

X2

X1

X2

1 1

1

2

1

4

2 2

2

4

2

8

3 3

3

6

3

12

4 4

4

8

4

16

r=1, d

2

=0

r=1, d

2

=30 r=1, d2=270

(16)

Hierarchical Clustering

General Steps for n objects

Step 1: There are n clusters at the beginning and

each object is a cluster. Compute pairwise distanc

es among all clusters

Step 2: Find the minimum distance and merge the

corresponding two clusters into one cluster

Step 3: Based on n-1 clusters, compute pairwise d

istances among all n-1 clusters

Step 4: Find the minimum distance and merge the

(17)
(18)

Hierarchical Clustering

Single Linkage (Nearest-neighbor) Method

Use the minimal distance

Distance matrix for 5 objects

1

2

3

4

5

1

0

2

9

0

3

3

7

0

(19)

Hierarchical Clustering

Single Linkage Method

Step 1: 5 clusters:{1},{2},{3},{4},{5}

Step 2: min{d

ij

} = d

35

= 2 and merge objec

ts 3 and 5 into one cluster {35}

Step 3: Find the minimal distance among

{3,5},{1},{2},{4}

d

{35}1

= min[d

31

,d

51

]=min[3,11]=3

d

{35}2

= min[d

32

,d

52

]=min[7,10]=7

d

= min[d

,d

]=min[9,8]=8

(20)

Hierarchical Clustering

Single Linkage Method

Update the distance matrix

{35} 1

2

4

{35}

0

1

3

0

(21)

Hierarchical Clustering

Single Linkage Method

Step 4: Minimal distance is 3 between {35}

and {1} and merge {35} and {1} into

{135}

Step 5: Find the distances between {135}

and {2} and {4}

d

{135}2

= min[d

{35}2

,d

12

]=min[7,9]=7

d

{135}4

= min[d

{35}4

,d

14

]=min[8,6]=6

(22)

Hierarchical Clustering

Single Linkage Method

Update the distance matrix

{135} 2

4

{135} 0

27

0

0

46

5

0

The minimal distance is 5 between

{2} and {4}

(23)

Hierarchical Clustering

Single Linkage Method

Find the minimum distance between {135} and {24}

d

{135}{24}

= min[d

{135}2

,d

{135}4

]=min[7,6]=7

Update the distance matrix

{135} {24}

{135} 0

(24)

Hierarchical Clustering

Single Linkage Method

Distance

Clusters

2

{1},{35},{2},{4}

3

{135},{2},{4}

4

{135},{2},{4}

5

{135},{24}

6

{12345}

(25)

Hierarchical Clustering

Dendrograms

A 2-dimensional tree structure rooted in th

e top

One dimension is the distance measure

Another dimension is the clustering results

The height of vertical (horizontal) line repr

esents the distance between the two cluste

rs it mergers

(26)

Hierarchical Clustering

Complete Linkage (Farthest-neighbor) Method

Use the maximal distance

Distance matrix for 5 objects

1

2

3

4

5

1

0

2

9

0

3

3

7

0

(27)
(28)

Hierarchical Clustering

Complete Linkage Method

Step 1: 5 clusters:{1},{2},{3},{4},{5}

Step 2: min{d

ij

} = d

35

= 2 and merge objec

ts 3 and 5 into one cluster {35}

Step 3: Find the maximal distance among

{3,5},{1},{2},{4}

(29)

Hierarchical Clustering

Complete Linkage Method

Update the distance matrix

{35} 1

2

4

{35}

0

1

11

0

2

10

9

0

(30)

Hierarchical Clustering

Complete Linkage Method

Step 4: Minimal distance is 5 between {2},

{4} and merge {2} and {4} into {24}

Step 5: Find the maximal distances

d

{24}{35}

= max[d

2{35}

, d

4{35}

]=max[10,9]=10

d

{24}1

= max[d

21

, d

41

]=max[9,6]=9

(31)

Hierarchical Clustering

Complete Linkage Method

Update the distance matrix

{35} {24}

1

{35}

0

{24}

10

0

0

111

9

0

The maximal distance is 9 between

{1} and {24}

(32)

Hierarchical Clustering

Complete Linkage Method

Find the maximal distance between {124} and {35}

d

{124}{35}

= min[d

1{35}

d

{25}{35}

]

=max[10,11]=11

Update the distance matrix

{35} {124}

{35} 0

(33)

Hierarchical Clustering

Complete Linkage Method

Distance

Clusters

2

{35},{1},{2},{4}

5

{35},{1},{24}

9

{35},{124}

11

{12345}

(34)
(35)

Average Clustering

Average Linkage Method

Use the average distance

ik

i

k

{vu}{w}

{uv} {w}

Average distance between cluster

{vu} and cluster {w}

d

d

n

n



(36)

Average Clustering

Average Linkage Method

Use the average distance

Distance matrix for 5 objects

1

2

3

4

5

1

0

2

9

0

3

3

7

0

(37)

Hierarchical Clustering

Average Linkage Method

Step 1: 5 clusters:{1},{2},{3},{4},{5}

Step 2: min{d

ij

} = d

35

= 2 and merge objec

ts 3 and 5 into one cluster {35}

Step 3: Find the average distance among

{3,5},{1},{2},{4}

d

{35}1

=(d

31

+d

51

)/(2x1)=(3+11)/2=7

d

{35}2

= (d

32

+d

52

)/(2x1)=(7+10)/2=8.5

d

= (d

+d

)/(2x1)=(9+10)/2=8.5

(38)

Hierarchical Clustering

Average Linkage Method

Update the distance matrix

{35} 1

2

4

{35}

0

1

11

0

(39)

Hierarchical Clustering

Average Linkage Method

Step 4: Minimal distance is 5 between {2}

and {4} and merge {2} and {4} into {24}

Step 5: Find the average distances

d

{24}{35}

= (d

23

+ d

25

+d

43

+ d

45

)/(2x2)

=(7+10+9+8)/(2x2)=8.5

d

{24}1

= (d

21

+d

41

)/(2x1)=

(40)

Hierarchical Clustering

Average Linkage Method

Update the distance matrix

{35} {24}

1

{35}

0

{24}

8.5

0

0

1

7

7.5

0

The minimal distance is 7 between

{1} and {35}

(41)

Hierarchical Clustering

Average Linkage Method

Find the average distance between {24} and {135}

d

{24}{135}

= (d

12

+d

14

+d

32

+d

34

+d

52

+d

54

)/(3x2)

=(9+6+7+9+10+8)/6

=8.17

Update the distance matrix

{135} {24}

{135} 0

(42)

Hierarchical Clustering

Average Linkage Method

Distance

Clusters

2

{35},{1},{2},{4}

5

{35},{1},{24}

7

{135},{24}

9

{12345}

(43)
(44)

Hierarchical Clustering

Example Manly (2005)

Distance Matrix of 5 objects

1

2

3

4

5

1

0

2

2

0

3

6

5

0

(45)

Hierarchical Clustering

Single Linkage Method

Distance

Clusters

2

{12},{3},{4},{5}

3

{12},{3},{45}

4

{12},{345}

5

{12345}

Same results are obtained from complete and av

erage linkage methods

(46)
(47)

Hierarchical Clustering

Example: Canine group by single linkage clustering

Distance

Clusters

#

0.72

{MD,PD},GJ,CW,IW,CU,DI

6

1.38

{MD,PD,CU},GJ,CW,IW,DI

5

1.63

{MD,PD,CU},GJ,CW,IW,DI

5

1.68

{MD,PD,CU,DI},GJ,CW,IW

4

2.07

{MD,PD,CU,DI,GJ},CW,IW

3

2.31

{MD,PD,CU,DI,GJ},{CW,IW}

2

2.37

{MD,PD,CU,DI,GJ,CW,IW}

1

(48)
(49)

Results of single linkage method

for European employment data

(50)

Hierarchical Clustering

Centroid (Center or Average) Method

Start with each object being a cluster

Merge the two clusters with the shortest distance

Compute the centroid as the average of all variabl

es in the new cluster and update the distance mat

rix using the averages of the new clusters

Merge the two clusters with the shortest distance

Compute the centroid as the averages of all variab

les in the new clusters and update the distance m

atrix using the averages of the new clusters

(51)

Introduction

Example

Student Chinese (X1) Math (X2)

1

85

82

2

25

32

3

65

55

4

90

95

5

40

30

6

60

70

(52)

Hierarchical Clustering

Centroid Method

Euclidean Distance matrix of 6 students

1

2

3

4

5

6

1 0

2 78.10 0

3 33.60 46.14 0

4 13.93 90.52 47.17 0

5 68.77 15.13 36.36 82.01 0

(53)

Hierarchical Clustering

Centroid Method

The shortest distance is between student

{1} and student {4}

Merge {1} and {4} into {14}

Compute the averages for Chinese and mat

h

Average of Chinese = (85+90)/2 = 87.5

Average of math = (82+95)/2=88.5

(54)

Hierarchical Clustering

Centroid Method

Update the Euclidean distance matrix

{14} 2

3

5

6

{14}

0

2

84.25 0

3

40.35 46.14 0

(55)

Hierarchical Clustering

Centroid Method

The shortest distance is between {2} and

{5}

Merge {2} and {5} into {35}

The average of Chinese of {35} is 32.5

The average of math of {35} is 31.0

(56)

Hierarchical Clustering

Centroid Method

Update the Euclidean distance matrix

{14} {25} 3

6

{14}

0

{25}

79.57 0

3

40.35 40.40 0

(57)

Hierarchical Clustering

Centroid Method

The shortest distance is between {3} and

{6}

Merge {3} and {6} into {36}

The average of Chinese of {36} is 62.5

The average of math of {36} is 62.5

(58)

Hierarchical Clustering

Centroid Method

Update the Euclidean distance matrix

{14} {25} {36}

{14}

0

{25}

79.57 0

(59)

Hierarchical Clustering

Centroid Method

The shortest distance is between {14} and

{36}

Merge {14} and {36} into {1346}

Cluster means

Cluster Chinese

Math

{25}

32.5

31.0

(60)

Hierarchical Clustering

Centroid Method

Distance between {25} and {1346} is 61.5

3

Distance

Clusters

13.93

{14},{2},{3},{5},{6}

15.13

{14},{25},{3},{6}

15.81

{14},{25},{36}

36.07

{1436},{25}

(61)
(62)

Hierarchical Clustering

Application to gene expression data

from microarray experiments

# of genes >>> # of subjects

Clustering in two directions

Clusters of subjects (patients)

Clusters of genes

(63)
(64)
(65)
(66)
(67)
(68)

Hierarchical Clustering

The complexity of a bottom-up method

can vary between

n

2

and

n

3

depend on t

he linkage chosen.

The complexity of a top-down method c

an vary between

n

log

n

and

n

2

depend o

(69)

Hierarchical Clustering

Determination of the number of clusters

Criteria

Root-mean-square total-sample standard deviat

ion (RMSSTD)

Semipartial R-square (SPRSQ)

R-square (RSQ)

(70)
(71)

Hierarchical Clustering

Determination of the number of clusters

Example: test scores of 6 students

# of clusters RMSSTD

SPRSQ

RSQ MD

5

6.96

0.0145

0.98 0.30

4

7.57

0.0171

0.97 0.33

3

7.91

0.0187

0.95 0.34

2

15.93

0.1946

0.76 0.60

1

25.86

0.7751

0.00 0.77

(72)

K-means Clustering

Step 1: Select the number of clusters, say K a

nd determine the distance measure such as E

uclidean distance or 1-Pearson correlation coe

fficient

Step 2: Divide n objects into K clusters, either

randomly or based on a preliminary hierarchic

al clustering

(73)

K-means Clustering

Step 4: For each object, find the minima

l distance and reallocate the object to th

e corresponding cluster with the minimal

distance

Step 5: Update the clusters and its centr

oids

Step 6: Repeat Step 3 and Step 4 until n

o reallocation of objects among clusters

occurs

(74)
(75)
(76)
(77)
(78)

K-means Clustering

The number of computations that need

to be performed can be written as c*p

where c is a value that does depend on

the number of iterations and p is the

number of variables (e.g., the number

of genes)

(79)

K-means Clustering

The number of clusters is selected to

maximize the between-cluster sum of squares

(variation) and to minimize the within-cluster

sum of squares (variation)

The best-of-10 partition: to apply K-means

method 10 times using 10 different randomly

chosen sets of initial clusters and choose the

result that minimizes the within-cluster sum

of squares

(80)

Issues and Limitations

With considerable overlap between the initial

groups, cluster analysis may produce a result

that is quite different from the true situation

Different approaches obtained different result

s.

The dendrogram itself is almost never the ans

wer to the research question.

(81)
(82)

Issues and Limitations

Shape of clusters will create difficulty in

cluster analysis

(a) and (b) by any reasonable algorithms

(c) some methods will fail because of

overlapping points

(d), (e) and (f): great challenges for most

(83)

Issues and Limitations

Anything can be clustered

The clustering algorithm applied to the same

data may produce different results

Ignore the magnitudes of distance measures i

n dendrogram

Position of the patterns with the clusters does

not reflect their relationship in the input spac

e

(84)
(85)
(86)
(87)

Summary

Goals

Methods

Hierarchical Methods

Single

Complete

Average

Centroid

K-means Clutering

Limitations

參考文獻

相關文件

好了既然 Z[x] 中的 ideal 不一定是 principle ideal 那麼我們就不能學 Proposition 7.2.11 的方法得到 Z[x] 中的 irreducible element 就是 prime element 了..

"Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values," Data Mining and Knowledge Discovery, Vol. “Density-Based Clustering in

Wang, Solving pseudomonotone variational inequalities and pseudocon- vex optimization problems using the projection neural network, IEEE Transactions on Neural Networks 17

Then, it is easy to see that there are 9 problems for which the iterative numbers of the algorithm using ψ α,θ,p in the case of θ = 1 and p = 3 are less than the one of the

volume suppressed mass: (TeV) 2 /M P ∼ 10 −4 eV → mm range can be experimentally tested for any number of extra dimensions - Light U(1) gauge bosons: no derivative couplings. =>

For pedagogical purposes, let us start consideration from a simple one-dimensional (1D) system, where electrons are confined to a chain parallel to the x axis. As it is well known

The observed small neutrino masses strongly suggest the presence of super heavy Majorana neutrinos N. Out-of-thermal equilibrium processes may be easily realized around the

Define instead the imaginary.. potential, magnetic field, lattice…) Dirac-BdG Hamiltonian:. with small, and matrix