Applied Multivariate Quantitative Methods－Cluster Analysis

(1)

Applied Multivariate

Quantitative Methods

Cluster Analysis

By Jen-pei Liu, PhD

Division of Biometry, Department of Agronomy,

National Taiwan University

and

Wei-Chie Chie, MD, PhD

Department of Public Health

National Taiwan University

(2)

Cluster Analysis



Introduction



Measures of Similarity



Hierarchical Clustering



K-mean Clustering



Summary

(3)

Introduction



A sample of n objects, each with

measurements of p variables



To use the measurements of p variables

to devise a scheme for grouping n

objects into classes

(4)

Introduction



In general, the number of clusters is no

t known in advance – unsupervised anal

ysis



The number of class is pre-specified in t

he discriminant analysis and is based on

a predicted function– supervised analysi

s

(5)

Introduction



Examples



Cluster of depressed patients



Data reduction



Marketing



Test markets: large number of cities



Small number of groups of similar cities



one member from each group selected for testing



Microarray



Clusters of genes



Clusters of subjects

(6)

Introduction



Types of Clustering Methods



Hierarchical Clustering



To find a series of partition



A bottom-to-up clustering



Partitional Method



To produce a single partition of objects



A up-to-bottom clustering

(7)

Introduction

Example

Student Chinese (X1) Math (X2)

1

85

82

2

25

32

3

65

55

4

90

95

5

40

30

6

60

70

(8)

Measures of Similarity

i i1 i2 ip j j1 j2 jp p 2 1/ 2 ij ik jk k=1

Distance (Dissimilarity) Function

Data vectors

(X X X ) '

(a) Euclidean Distance

d

[

(X - X ) ]





X

(9)

Measures of Similarity



Euclidean Distances Matrix for 6 students

1

2

3

4

5

6

1

0 78.10 33.60 13.93 68.77 27.73

2

0 46.14 90.52 15.13 51.66

3

0 47.17 35.36 15.81

4

0 82.01 39.05

5

0

44.72

6

0

(10)

Measures of Similarity

p

'

ij

ik

jk

k=1

Distance (Dissimilarity) Function

(b) Manhatten (city block) Distance

d

X - X

Manhatten distance is more robust to

extreme values

The Manhatten distance between Student 1 and 2

(11)

Measures of Similarity

(12)

Measures of Similarity

p ik i jk j k 1 p 2 2 ik i jk j k 1 r

Pearson product correlation coefficient

(X -X )(X -X )

r =

(X -X ) (X -X )

Distance measure based on Pearson correlation coefficient

d

1-r

 





Similarity Measures

(13)

Measures of Similarity

p ik jk k 1 p 2 2 ik jk k 1

Angular distance (Uncentered Pearson

correlation coefficient)

X X

r' =

X X

Spearman correlation coefficient

Use the ranks of observations for

Pearson correlation





Other Similarity Measures

(14)

Measures of Similarity



Correlation coefficient



A measure for association



Not a measure for similarity (or

agreement)



Euclidean distance

(15)

Measures of Similarity



Example

Case I

Case II

Case III

X1 X2

X1

X2

X1

X2

1 1

1

2

1

4 2 2

2

4

2

8 3 3

3

6

3

12 4 4

4

8

4

16 r=1, d

2 =0

r=1, d

2 =30 r=1, d2=270

(16)

Hierarchical Clustering



General Steps for n objects



Step 1: There are n clusters at the beginning and

each object is a cluster. Compute pairwise distanc

es among all clusters



Step 2: Find the minimum distance and merge the

corresponding two clusters into one cluster



Step 3: Based on n-1 clusters, compute pairwise d

istances among all n-1 clusters



Step 4: Find the minimum distance and merge the

(17)

(18)

Hierarchical Clustering



Single Linkage (Nearest-neighbor) Method



Use the minimal distance



Distance matrix for 5 objects

1

2

3

4

5

1

0

2

9

0

3

7

0

(19)

Hierarchical Clustering



Single Linkage Method



Step 1: 5 clusters:{1},{2},{3},{4},{5}



Step 2: min{d

_ij

} = d

₃₅

= 2 and merge objec

ts 3 and 5 into one cluster {35}



Step 3: Find the minimal distance among

{3,5},{1},{2},{4}



d

_{35}1

= min[d

₃₁

,d

₅₁

]=min[3,11]=3



d

_{35}2

= min[d

₃₂

,d

₅₂

]=min[7,10]=7



d

= min[d

,d

]=min[9,8]=8

(20)

Hierarchical Clustering



Single Linkage Method



Update the distance matrix

{35} 1

2

4 {35}

0

1

3

0

(21)

Hierarchical Clustering



Single Linkage Method



Step 4: Minimal distance is 3 between {35}

and {1} and merge {35} and {1} into

{135}



Step 5: Find the distances between {135}

and {2} and {4}



d

_{135}2

= min[d

_{35}2

,d

₁₂

]=min[7,9]=7



d

_{135}4

= min[d

_{35}4

,d

₁₄

]=min[8,6]=6

(22)

Hierarchical Clustering



Single Linkage Method



Update the distance matrix

{135} 2

4 {135} 0

27

0

46

5

0 The minimal distance is 5 between

{2} and {4}

(23)

Hierarchical Clustering



Single Linkage Method

Find the minimum distance between {135} and {24}



d

_{135}{24}

= min[d

_{135}2

,d

_{135}4

]=min[7,6]=7



Update the distance matrix

{135} {24}

{135} 0

(24)

Hierarchical Clustering



Single Linkage Method



Distance

Clusters



2 {1},{35},{2},{4}



3 {135},{2},{4}



4 {135},{2},{4}



5 {135},{24}



6 {12345}

(25)

Hierarchical Clustering



Dendrograms



A 2-dimensional tree structure rooted in th

e top



One dimension is the distance measure



Another dimension is the clustering results



The height of vertical (horizontal) line repr

esents the distance between the two cluste

rs it mergers

(26)

Hierarchical Clustering



Complete Linkage (Farthest-neighbor) Method



Use the maximal distance



Distance matrix for 5 objects

1

2

3

4

5

1

0

2

9

0

3

7

0

(27)

(28)

Hierarchical Clustering



Complete Linkage Method



Step 1: 5 clusters:{1},{2},{3},{4},{5}



Step 2: min{d

_ij

} = d

₃₅

= 2 and merge objec

ts 3 and 5 into one cluster {35}



Step 3: Find the maximal distance among

{3,5},{1},{2},{4}

(29)

Hierarchical Clustering



Complete Linkage Method



Update the distance matrix

{35} 1

2

4 {35}

0

1

11

0

2

10

9

0

(30)

Hierarchical Clustering



Complete Linkage Method



Step 4: Minimal distance is 5 between {2},

{4} and merge {2} and {4} into {24}



Step 5: Find the maximal distances



d

_{24}{35}

= max[d

_2{35}

, d

_4{35}

]=max[10,9]=10



d

_{24}1

= max[d

₂₁

, d

₄₁

]=max[9,6]=9

(31)

Hierarchical Clustering



Complete Linkage Method



Update the distance matrix

{35} {24}

1 {35}

0 {24}

10

0

111

9

0 The maximal distance is 9 between

{1} and {24}

(32)

Hierarchical Clustering



Complete Linkage Method

Find the maximal distance between {124} and {35}



d

_{124}{35}

= min[d

_1{35}

d

_{25}{35}

]

=max[10,11]=11



Update the distance matrix

{35} {124}

{35} 0

(33)

Hierarchical Clustering



Complete Linkage Method



Distance

Clusters



2 {35},{1},{2},{4}



5 {35},{1},{24}



9 {35},{124}



11 {12345}

(34)

(35)

Average Clustering



Average Linkage Method



Use the average distance

ik

i

k

{vu}{w}

{uv} {w}

Average distance between cluster

{vu} and cluster {w}

d

n





(36)

Average Clustering



Average Linkage Method



Use the average distance



Distance matrix for 5 objects

1

2

3

4

5

1

0

2

9

0

3

7

0

(37)

Hierarchical Clustering



Average Linkage Method



Step 1: 5 clusters:{1},{2},{3},{4},{5}



Step 2: min{d

_ij

} = d

₃₅

= 2 and merge objec

ts 3 and 5 into one cluster {35}



Step 3: Find the average distance among

{3,5},{1},{2},{4}



d

_{35}1

=(d

₃₁

+d

₅₁

)/(2x1)=(3+11)/2=7



d

_{35}2

= (d

₃₂

+d

₅₂

)/(2x1)=(7+10)/2=8.5



d

= (d

+d

)/(2x1)=(9+10)/2=8.5

(38)

Hierarchical Clustering



Average Linkage Method



Update the distance matrix

{35} 1

2

4 {35}

0

1

11

0

(39)

Hierarchical Clustering



Average Linkage Method



Step 4: Minimal distance is 5 between {2}

and {4} and merge {2} and {4} into {24}



Step 5: Find the average distances



d

_{24}{35}

= (d

₂₃

+ d

₂₅

+d

₄₃

+ d

₄₅

)/(2x2)

=(7+10+9+8)/(2x2)=8.5



d

_{24}1

= (d

₂₁

+d

₄₁

)/(2x1)=

(40)

Hierarchical Clustering



Average Linkage Method



Update the distance matrix

{35} {24}

1 {35}

0 {24}

8.5

0

1

7

7.5

0 The minimal distance is 7 between

{1} and {35}

(41)

Hierarchical Clustering



Average Linkage Method

Find the average distance between {24} and {135}



d

_{24}{135}

= (d

₁₂

+d

₁₄

+d

₃₂

+d

₃₄

+d

₅₂

+d

₅₄

)/(3x2)

=(9+6+7+9+10+8)/6

=8.17



Update the distance matrix

{135} {24}

{135} 0

(42)

Hierarchical Clustering



Average Linkage Method



Distance

Clusters



2 {35},{1},{2},{4}



5 {35},{1},{24}



7 {135},{24}



9 {12345}

(43)

(44)

Hierarchical Clustering



Example Manly (2005)

Distance Matrix of 5 objects

1

2

3

4

5

1

0

2

0

3

6

5

0

(45)

Hierarchical Clustering



Single Linkage Method



Distance

Clusters



2 {12},{3},{4},{5}



3 {12},{3},{45}



4 {12},{345}



5 {12345}

Same results are obtained from complete and av

erage linkage methods

(46)

(47)

Hierarchical Clustering

Example: Canine group by single linkage clustering

Distance

Clusters

#

0.72 {MD,PD},GJ,CW,IW,CU,DI

6

1.38 {MD,PD,CU},GJ,CW,IW,DI

5

1.63 {MD,PD,CU},GJ,CW,IW,DI

5

1.68 {MD,PD,CU,DI},GJ,CW,IW

4

2.07 {MD,PD,CU,DI,GJ},CW,IW

3

2.31 {MD,PD,CU,DI,GJ},{CW,IW}

2

2.37 {MD,PD,CU,DI,GJ,CW,IW}

1

(48)

(49)

Results of single linkage method

for European employment data

(50)

Hierarchical Clustering



Centroid (Center or Average) Method



Start with each object being a cluster



Merge the two clusters with the shortest distance



Compute the centroid as the average of all variabl

es in the new cluster and update the distance mat

rix using the averages of the new clusters



Merge the two clusters with the shortest distance



Compute the centroid as the averages of all variab

les in the new clusters and update the distance m

atrix using the averages of the new clusters

(51)

Introduction

Example

Student Chinese (X1) Math (X2)

1

85

82

2

25

32

3

65

55

4

90

95

5

40

30

6

60

70

(52)

Hierarchical Clustering



Centroid Method

Euclidean Distance matrix of 6 students

1

2

3

4

5

6 1 0

2 78.10 0

3 33.60 46.14 0

4 13.93 90.52 47.17 0

5 68.77 15.13 36.36 82.01 0

(53)

Hierarchical Clustering



Centroid Method



The shortest distance is between student

{1} and student {4}



Merge {1} and {4} into {14}



Compute the averages for Chinese and mat

h



Average of Chinese = (85+90)/2 = 87.5



Average of math = (82+95)/2=88.5

(54)

Hierarchical Clustering



Centroid Method



Update the Euclidean distance matrix

{14} 2

3

5

6 {14}

0

2 84.25 0

3 40.35 46.14 0

(55)

Hierarchical Clustering



Centroid Method



The shortest distance is between {2} and

{5}



Merge {2} and {5} into {35}



The average of Chinese of {35} is 32.5



The average of math of {35} is 31.0

(56)

Hierarchical Clustering



Centroid Method



Update the Euclidean distance matrix

{14} {25} 3

6 {14}

0 {25}

79.57 0

3 40.35 40.40 0

(57)

Hierarchical Clustering



Centroid Method



The shortest distance is between {3} and

{6}



Merge {3} and {6} into {36}



The average of Chinese of {36} is 62.5



The average of math of {36} is 62.5

(58)

Hierarchical Clustering



Centroid Method



Update the Euclidean distance matrix

{14} {25} {36}

{14}

0 {25}

79.57 0

(59)

Hierarchical Clustering



Centroid Method



The shortest distance is between {14} and

{36}



Merge {14} and {36} into {1346}



Cluster means

Cluster Chinese

Math

{25}

32.5

31.0

(60)

Hierarchical Clustering



Centroid Method



Distance between {25} and {1346} is 61.5

3 

Distance

Clusters



13.93 {14},{2},{3},{5},{6}



15.13 {14},{25},{3},{6}



15.81 {14},{25},{36}

36.07 {1436},{25}

(61)

(62)

Hierarchical Clustering



Application to gene expression data

from microarray experiments



# of genes >>> # of subjects



Clustering in two directions



Clusters of subjects (patients)



Clusters of genes

(63)

(64)

(65)

(66)

(67)

(68)

Hierarchical Clustering



The complexity of a bottom-up method

can vary between

n

2 and

n

3 depend on t

he linkage chosen.



The complexity of a top-down method c

an vary between

n

log

n

and

n

2 depend o

(69)

Hierarchical Clustering



Determination of the number of clusters



Criteria



Root-mean-square total-sample standard deviat

ion (RMSSTD)



Semipartial R-square (SPRSQ)



R-square (RSQ)

(70)

(71)

Hierarchical Clustering



Determination of the number of clusters

Example: test scores of 6 students

# of clusters RMSSTD

SPRSQ

RSQ MD

5

6.96 0.0145

0.98 0.30

4

7.57 0.0171

0.97 0.33

3

7.91 0.0187

0.95 0.34

2

15.93 0.1946

0.76 0.60

1

25.86 0.7751

0.00 0.77

(72)

K-means Clustering



Step 1: Select the number of clusters, say K a

nd determine the distance measure such as E

uclidean distance or 1-Pearson correlation coe

fficient



Step 2: Divide n objects into K clusters, either

randomly or based on a preliminary hierarchic

al clustering

(73)

K-means Clustering



Step 4: For each object, find the minima

l distance and reallocate the object to th

e corresponding cluster with the minimal

distance



Step 5: Update the clusters and its centr

oids



Step 6: Repeat Step 3 and Step 4 until n

o reallocation of objects among clusters

occurs

(74)

(75)

(76)

(77)

(78)

K-means Clustering



The number of computations that need

to be performed can be written as c*p

where c is a value that does depend on

the number of iterations and p is the

number of variables (e.g., the number

of genes)

(79)

K-means Clustering



The number of clusters is selected to

maximize the between-cluster sum of squares

(variation) and to minimize the within-cluster

sum of squares (variation)



The best-of-10 partition: to apply K-means

method 10 times using 10 different randomly

chosen sets of initial clusters and choose the

result that minimizes the within-cluster sum

of squares

(80)

Issues and Limitations



With considerable overlap between the initial

groups, cluster analysis may produce a result

that is quite different from the true situation



Different approaches obtained different result

s.



The dendrogram itself is almost never the ans

wer to the research question.

(81)

(82)

Issues and Limitations



Shape of clusters will create difficulty in

cluster analysis



(a) and (b) by any reasonable algorithms



(c) some methods will fail because of

overlapping points



(d), (e) and (f): great challenges for most

(83)

Issues and Limitations



Anything can be clustered



The clustering algorithm applied to the same

data may produce different results



Ignore the magnitudes of distance measures i

n dendrogram



Position of the patterns with the clusters does

not reflect their relationship in the input spac

e

(84)

(85)

(86)

(87)

Summary



Goals



Methods



Hierarchical Methods



Single



Complete



Average



Centroid



K-means Clutering



Limitations