Multivariate Statistical Methods－Cluster Analysis

(1)

Multivariate Statistical Methods

Cluster Analysis

By Jen-pei Liu, PhD

Division of Biometry, Department of Agronomy, National Taiwan University

and

Division of Biostatistics and Bioinformatics National Health Research Institutes

(2)

Cluster Analysis



Introduction



Measures of Similarity



Hierarchical Clustering



K-mean Clustering



Summary

(3)

Introduction



A sample of n objects, each with

measurements of p variables



To use the measurements of p variables

to devise a scheme for grouping n

objects into classes

(4)

Introduction



In general, the number of clusters is

not known in advance – unsupervised

analysis



The number of class is pre-specified in

the discriminant analysis and is based

on a predicted function– supervised

analysis

(5)

Introduction



Examples

 Cluster of depressed patients  Data reduction

 Marketing

 Test markets: large number of cities  Small number of groups of similar cities

 one member from each group selected for testing

 Microarray

 Clusters of genes  Clusters of subjects

(6)

Introduction



Types of Clustering Methods

 Hierarchical Clustering

 To find a series of partition  A bottom-to-up clustering

 Partitional Method

 To produce a single partition of objects  A up-to-bottom clustering

(7)

Introduction

Example

Student Chinese (X1) Math (X2)

1 85 82 2 25 32 3 65 55 4 90 95 5 40 30 6 60 70

(8)

Measures of Similarity

i i1 i2 ip j j1 j2 jp p 2 1/ 2 ij ik jk k=1

Distance (Dissimilarity) Function Data vectors

(X X X ) ' (X X X ) ' (a) Euclidean Distance

d [ (X - X ) ]   



X X

(9)

Measures of Similarity

 Euclidean Distances Matrix for 6 students

1 2 3 4 5 6 1 0 78.10 33.60 13.93 68.77 27.73 2 0 46.14 90.52 15.13 51.66 3 0 47.17 35.36 15.81 4 0 82.01 39.05 5 0 44.72 6 0

(10)

Measures of Similarity

p '

ij ik jk k=1

Distance (Dissimilarity) Function (b) Manhatten (city block) Distance

d X - X

Manhatten distance is more robust to extreme values

The Manhatten distance between Student 1 and 2

(11)

Measures of Similarity

(12)

Measures of Similarity

p ik i jk j k 1 p 2 2 ik i jk j k 1 r

Pearson product correlation coefficient (X -X )(X -X )

r =

(X -X ) (X -X )

Distance measure based on Pearson correlation coefficient d 1-r   



Similarity Measures

(13)

Measures of Similarity

p ik jk k 1 p 2 2 ik jk k 1

Angular distance (Uncentered Pearson correlation coefficient)

X X r' =

X X

Spearman correlation coefficient Use the ranks of observations for Pearson correlation





Other Similarity Measures

(14)

Measures of Similarity



Correlation coefficient

 A measure for association

 Not a measure for similarity (or

agreement)



Euclidean distance

(15)

Measures of Similarity

 Example

Case I Case II Case III

X1 X2 X1 X2 X1 X2 1 1 1 2 1 4 2 2 2 4 2 8 3 3 3 6 3 12 4 4 4 8 4 16 r=1, d2=0 r=1, d2=30 r=1, d2=270

(16)

Hierarchical Clustering

 General Steps for n objects

 Step 1: There are n clusters at the beginning and

each object is a cluster. Compute pairwise distanc es among all clusters

 Step 2: Find the minimum distance and merge the

corresponding two clusters into one cluster

 Step 3: Based on n-1 clusters, compute pairwise d

istances among all n-1 clusters

 Step 4: Find the minimum distance and merge the

(17)

(18)

Hierarchical Clustering

 Single Linkage (Nearest-neighbor) Method

 Use the minimal distance

 Distance matrix for 5 objects

1 2 3 4 5

1 0

2 9 0

3 3 7 0

(19)

Hierarchical Clustering



Single Linkage Method

 Step 1: 5 clusters:{1},{2},{3},{4},{5}

 Step 2: min{d_ij} = d₃₅ = 2 and merge objec

ts 3 and 5 into one cluster {35}

 Step 3: Find the minimal distance among

{3,5},{1},{2},{4}

 d_{35}1 = min[d₃₁,d₅₁]=min[3,11]=3  d_{35}2 = min[d₃₂,d₅₂]=min[7,10]=7

(20)

Hierarchical Clustering



Single Linkage Method

 Update the distance matrix

{35} 1 2 4 {35} 0

1 3 0

(21)

Hierarchical Clustering



Single Linkage Method

 Step 4: Minimal distance is 3 between {35}

and {1} and merge {35} and {1} into {135}

 Step 5: Find the distances between {135}

and {2} and {4}

 d_{135}2 = min[d_{35}2,d₁₂]=min[7,9]=7  d_{135}4 = min[d_{35}4,d₁₄]=min[8,6]=6

(22)

Hierarchical Clustering

 Single Linkage Method

{135} 2 4 {135} 0

27 0 0

46 5 0

The minimal distance is 5 between {2} and {4}

(23)

Hierarchical Clustering

 Single Linkage Method

Find the minimum distance between {135} and {24}

 d_{135}{24} = min[d_{135}2,d_{135}4]=min[7,6]=7

{135} {24} {135} 0

(24)

Hierarchical Clustering



Single Linkage Method

 Distance Clusters  2 {1},{35},{2},{4}  3 {135},{2},{4}  4 {135},{2},{4}  5 {135},{24}  6 {12345}

(25)

Hierarchical Clustering



Dendrograms

 A 2-dimensional tree structure rooted in th

e top

 One dimension is the distance measure

 Another dimension is the clustering results  The height of vertical (horizontal) line repr

esents the distance between the two cluste rs it mergers

(26)

Hierarchical Clustering

 Complete Linkage (Farthest-neighbor) Method

 Use the maximal distance

1 2 3 4 5

1 0

2 9 0

3 3 7 0

(27)

(28)

Hierarchical Clustering



Complete Linkage Method

 Step 1: 5 clusters:{1},{2},{3},{4},{5}

 Step 3: Find the maximal distance among

{3,5},{1},{2},{4}

(29)

Hierarchical Clustering



Complete Linkage Method

{35} 1 2 4 {35} 0

1 11 0

2 10 9 0

(30)

Hierarchical Clustering



Complete Linkage Method

 Step 4: Minimal distance is 5 between {2},

{4} and merge {2} and {4} into {24}

 Step 5: Find the maximal distances

 d_{24}{35} = max[d_2{35}, d_4{35}]=max[10,9]=10  d_{24}1 = max[d₂₁, d₄₁]=max[9,6]=9

(31)

Hierarchical Clustering

 Complete Linkage Method

{35} {24} 1 {35} 0

{24} 10 0 0

111 9 0

The maximal distance is 9 between {1} and {24}

(32)

Hierarchical Clustering

 Complete Linkage Method

Find the maximal distance between {124} and {35}

 d_{124}{35} = min[d_1{35}d_{25}{35}]

=max[10,11]=11

{35} {124} {35} 0

(33)

Hierarchical Clustering



Complete Linkage Method

 Distance Clusters

 2 {35},{1},{2},{4}

 5 {35},{1},{24}

 9 {35},{124}

(34)

(35)

Average Clustering

 Average Linkage Method

 Use the average distance

ik

i k

{vu}{w}

{uv} {w}

Average distance between cluster

{vu} and cluster {w}

d

n





(36)

Average Clustering

 Use the average distance

1 2 3 4 5

1 0

2 9 0

3 3 7 0

(37)

Hierarchical Clustering



Average Linkage Method

 Step 1: 5 clusters:{1},{2},{3},{4},{5}

 Step 3: Find the average distance among

{3,5},{1},{2},{4}

 d_{35}1 =(d₃₁+d₅₁)/(2x1)=(3+11)/2=7  d_{35}2 = (d₃₂+d₅₂)/(2x1)=(7+10)/2=8.5

(38)

Hierarchical Clustering



Average Linkage Method

{35} 1 2 4 {35} 0

1 11 0

(39)

Hierarchical Clustering



Average Linkage Method

 Step 4: Minimal distance is 5 between {2}

and {4} and merge {2} and {4} into {24}

 Step 5: Find the average distances

 d_{24}{35} = (d₂₃+ d₂₅+d₄₃+ d₄₅)/(2x2)

=(7+10+9+8)/(2x2)=8.5

 d_{24}1 = (d₂₁+d₄₁)/(2x1)=

(40)

Hierarchical Clustering

{35} {24} 1 {35} 0

{24} 8.5 0 0 17 7.5 0

The minimal distance is 7 between {1} and {35}

(41)

Hierarchical Clustering

Find the average distance between {24} and {135}

 d_{24}{135} = (d₁₂+d₁₄+d₃₂+d₃₄+d₅₂+d₅₄)/(3x2)

=(9+6+7+9+10+8)/6 =8.17

{135} {24} {135} 0

(42)

Hierarchical Clustering



Average Linkage Method

 2 {35},{1},{2},{4}

 5 {35},{1},{24}

 7 {135},{24}

(43)

(44)

Hierarchical Clustering



Example Manly (2005)

Distance Matrix of 5 objects

1 2 3 4 5

1 0

2 2 0

3 6 5 0

(45)

Hierarchical Clustering



Single Linkage Method

 2 {12},{3},{4},{5}

 3 {12},{3},{45}

 4 {12},{345}

 5 {12345}

Same results are obtained from complete and av erage linkage methods

(46)

(47)

Hierarchical Clustering

Example: Canine group by single linkage clustering

Distance Clusters # 0.72 {MD,PD},GJ,CW,IW,CU,DI 6 1.38 {MD,PD,CU},GJ,CW,IW,DI 5 1.68 {MD,PD,CU,DI},GJ,CW,IW 4 2.07 {MD,PD,CU,DI,GJ},CW,IW 3 2.31 {MD,PD,CU,DI,GJ},{CW,IW} 2 2.37 {MD,PD,CU,DI,GJ,CW,IW} 1

(48)

(49)

Results of single linkage method

for European employment data

(50)

Hierarchical Clustering

 Centroid (Center or Average) Method

 Start with each object being a cluster

 Merge the two clusters with the shortest distance  Compute the centroid as the average of all

variables in the new cluster and update the distance matrix using the averages of the new clusters

 Merge the two clusters with the shortest distance  Compute the centroid as the averages of all

(51)

Introduction

Example

Student Chinese (X1) Math (X2)

1 85 82 2 25 32 3 65 55 4 90 95 5 40 30 6 60 70

(52)

Hierarchical Clustering



Centroid Method

Euclidean Distance matrix of 6 students

1 2 3 4 5 6 1 0 2 78.10 0 3 33.60 46.14 0 4 13.93 90.52 47.17 0 5 68.77 15.13 36.36 82.01 0

(53)

Hierarchical Clustering



Centroid Method

 The shortest distance is between student

{1} and student {4}

 Merge {1} and {4} into {14}

 Compute the averages for Chinese and

math

 Average of Chinese = (85+90)/2 = 87.5  Average of math = (82+95)/2=88.5

(54)

Hierarchical Clustering



Centroid Method

 Update the Euclidean distance matrix

{14} 2 3 5 6

{14} 0

2 84.25 0

3 40.35 46.14 0

(55)

Hierarchical Clustering



Centroid Method

 The shortest distance is between {2} and

{5}

 The average of Chinese of {35} is 32.5  The average of math of {35} is 31.0

(56)

Hierarchical Clustering



Centroid Method

{14} {25} 3 6

{14} 0

{25} 79.57 0

3 40.35 40.40 0

(57)

Hierarchical Clustering



Centroid Method

{6}

 The average of Chinese of {36} is 62.5  The average of math of {36} is 62.5

(58)

Hierarchical Clustering



Centroid Method

{14} {25} {36}

{14} 0

{25} 79.57 0

(59)

Hierarchical Clustering



Centroid Method

{36}

 Merge {14} and {36} into {1346}  Cluster means

Cluster Chinese Math {25} 32.5 31.0 (1346} 75.0 75.5

(60)

Hierarchical Clustering



Centroid Method

 Distance between {25} and {1346} is 61.5

3  Distance Clusters  13.93 {14},{2},{3},{5},{6}  15.13 {14},{25},{3},{6}  15.81 {14},{25},{36} 36.07 {1436},{25}

(61)

(62)

Hierarchical Clustering



Application to gene expression data

from microarray experiments

 # of genes >>> # of subjects  Clustering in two directions

 Clusters of subjects (patients)  Clusters of genes

(63)

(64)

(65)

(66)

(67)

(68)

Hierarchical Clustering



The complexity of a bottom-up method

can vary between

n

2

and

n

3

depend on t

he linkage chosen.



The complexity of a top-down method c

an vary between

n

log

n

and

n

2

depend o

(69)

Hierarchical Clustering



Determination of the number of clusters

 Criteria

 Root-mean-square total-sample standard deviat

ion (RMSSTD)

 Semipartial R-square (SPRSQ)

 R-square (RSQ)

(70)

(71)

Hierarchical Clustering



Determination of the number of clusters

Example: test scores of 6 students

# of clusters RMSSTD SPRSQ RSQ MD 5 6.96 0.0145 0.98 0.30 4 7.57 0.0171 0.97 0.33 3 7.91 0.0187 0.95 0.34 2 15.93 0.1946 0.76 0.60 1 25.86 0.7751 0.00 0.77

(72)

K-means Clustering

 Step 1: Select the number of clusters, say K a

nd determine the distance measure such as E uclidean distance or 1-Pearson correlation coe fficient

 Step 2: Divide n objects into K clusters, either

randomly or based on a preliminary hierarchic al clustering

(73)

K-means Clustering



Step 4: For each object, find the minima

l distance and reallocate the object to th

e corresponding cluster with the minimal

distance



Step 5: Update the clusters and its centr

oids



Step 6: Repeat Step 3 and Step 4 until n

o reallocation of objects among clusters

occurs

(74)

(75)

(76)

(77)

(78)

K-means Clustering



The number of computations that need

to be performed can be written as c*p

where c is a value that does depend on

the number of iterations and p is the

number of variables (e.g., the number

of genes)

(79)

K-means Clustering

 The number of clusters is selected to

maximize the between-cluster sum of squares (variation) and to minimize the within-cluster sum of squares (variation)

 The best-of-10 partition: to apply K-means

method 10 times using 10 different randomly chosen sets of initial clusters and choose the result that minimizes the within-cluster sum of squares

(80)

Issues and Limitations

 With considerable overlap between the initial

groups, cluster analysis may produce a result that is quite different from the true situation

 Different approaches obtained different result

s.

 The dendrogram itself is almost never the ans

wer to the research question.

(81)

(82)

Issues and Limitations



Shape of clusters will create difficulty in

cluster analysis

 (a) and (b) by any reasonable algorithms  (c) some methods will fail because of

overlapping points

 (d), (e) and (f): great challenges for most

(83)

Issues and Limitations

 Anything can be clustered

 The clustering algorithm applied to the same

data may produce different results

 Ignore the magnitudes of distance measures i

n dendrogram

 Position of the patterns with the clusters does

not reflect their relationship in the input spac e

(84)

(85)

(86)

(87)

Summary

 Goals  Methods  Hierarchical Methods  Single  Complete  Average  Centroid  K-means Clutering  Limitations