Multivariate Statistical Methods
Cluster Analysis
By Jen-pei Liu, PhD
Division of Biometry, Department of Agronomy, National Taiwan University
and
Division of Biostatistics and Bioinformatics National Health Research Institutes
Cluster Analysis
Introduction
Measures of Similarity
Hierarchical Clustering
K-mean Clustering
Summary
Introduction
A sample of n objects, each with
measurements of p variables
To use the measurements of p variables
to devise a scheme for grouping n
objects into classes
Introduction
In general, the number of clusters is
not known in advance – unsupervised
analysis
The number of class is pre-specified in
the discriminant analysis and is based
on a predicted function– supervised
analysis
Introduction
Examples
Cluster of depressed patients Data reduction
Marketing
Test markets: large number of cities Small number of groups of similar cities
one member from each group selected for testing
Microarray
Clusters of genes Clusters of subjects
Introduction
Types of Clustering Methods
Hierarchical Clustering To find a series of partition A bottom-to-up clustering
Partitional Method
To produce a single partition of objects A up-to-bottom clustering
Introduction
Example
Student Chinese (X1) Math (X2)
1 85 82 2 25 32 3 65 55 4 90 95 5 40 30 6 60 70
Measures of Similarity
i i1 i2 ip j j1 j2 jp p 2 1/ 2 ij ik jk k=1Distance (Dissimilarity) Function Data vectors
(X X X ) ' (X X X ) ' (a) Euclidean Distance
d [ (X - X ) ]
X XMeasures of Similarity
Euclidean Distances Matrix for 6 students
1 2 3 4 5 6 1 0 78.10 33.60 13.93 68.77 27.73 2 0 46.14 90.52 15.13 51.66 3 0 47.17 35.36 15.81 4 0 82.01 39.05 5 0 44.72 6 0
Measures of Similarity
p '
ij ik jk k=1
Distance (Dissimilarity) Function (b) Manhatten (city block) Distance
d X - X
Manhatten distance is more robust to extreme values
The Manhatten distance between Student 1 and 2
Measures of Similarity
Measures of Similarity
p ik i jk j k 1 p 2 2 ik i jk j k 1 rPearson product correlation coefficient (X -X )(X -X )
r =
(X -X ) (X -X )
Distance measure based on Pearson correlation coefficient d 1-r
Similarity MeasuresMeasures of Similarity
p ik jk k 1 p 2 2 ik jk k 1Angular distance (Uncentered Pearson correlation coefficient)
X X r' =
X X
Spearman correlation coefficient Use the ranks of observations for Pearson correlation
Other Similarity Measures
Measures of Similarity
Correlation coefficient
A measure for association Not a measure for similarity (or
agreement)
Euclidean distance
Measures of Similarity
Example
Case I Case II Case III
X1 X2 X1 X2 X1 X2 1 1 1 2 1 4 2 2 2 4 2 8 3 3 3 6 3 12 4 4 4 8 4 16 r=1, d2=0 r=1, d2=30 r=1, d2=270
Hierarchical Clustering
General Steps for n objects
Step 1: There are n clusters at the beginning and
each object is a cluster. Compute pairwise distanc es among all clusters
Step 2: Find the minimum distance and merge the
corresponding two clusters into one cluster
Step 3: Based on n-1 clusters, compute pairwise d
istances among all n-1 clusters
Step 4: Find the minimum distance and merge the
Hierarchical Clustering
Single Linkage (Nearest-neighbor) Method
Use the minimal distance
Distance matrix for 5 objects
1 2 3 4 5
1 0
2 9 0
3 3 7 0
Hierarchical Clustering
Single Linkage Method
Step 1: 5 clusters:{1},{2},{3},{4},{5}
Step 2: min{dij} = d35 = 2 and merge objec
ts 3 and 5 into one cluster {35}
Step 3: Find the minimal distance among
{3,5},{1},{2},{4}
d{35}1 = min[d31,d51]=min[3,11]=3 d{35}2 = min[d32,d52]=min[7,10]=7
Hierarchical Clustering
Single Linkage Method
Update the distance matrix
{35} 1 2 4 {35} 0
1 3 0
Hierarchical Clustering
Single Linkage Method
Step 4: Minimal distance is 3 between {35}
and {1} and merge {35} and {1} into {135}
Step 5: Find the distances between {135}
and {2} and {4}
d{135}2 = min[d{35}2,d12]=min[7,9]=7 d{135}4 = min[d{35}4,d14]=min[8,6]=6
Hierarchical Clustering
Single Linkage Method
Update the distance matrix
{135} 2 4 {135} 0
27 0 0
46 5 0
The minimal distance is 5 between {2} and {4}
Hierarchical Clustering
Single Linkage Method
Find the minimum distance between {135} and {24}
d{135}{24} = min[d{135}2,d{135}4]=min[7,6]=7
Update the distance matrix
{135} {24} {135} 0
Hierarchical Clustering
Single Linkage Method
Distance Clusters 2 {1},{35},{2},{4} 3 {135},{2},{4} 4 {135},{2},{4} 5 {135},{24} 6 {12345}Hierarchical Clustering
Dendrograms
A 2-dimensional tree structure rooted in th
e top
One dimension is the distance measure
Another dimension is the clustering results The height of vertical (horizontal) line repr
esents the distance between the two cluste rs it mergers
Hierarchical Clustering
Complete Linkage (Farthest-neighbor) Method
Use the maximal distance
Distance matrix for 5 objects
1 2 3 4 5
1 0
2 9 0
3 3 7 0
Hierarchical Clustering
Complete Linkage Method
Step 1: 5 clusters:{1},{2},{3},{4},{5}
Step 2: min{dij} = d35 = 2 and merge objec
ts 3 and 5 into one cluster {35}
Step 3: Find the maximal distance among
{3,5},{1},{2},{4}
Hierarchical Clustering
Complete Linkage Method
Update the distance matrix{35} 1 2 4 {35} 0
1 11 0
2 10 9 0
Hierarchical Clustering
Complete Linkage Method
Step 4: Minimal distance is 5 between {2},
{4} and merge {2} and {4} into {24}
Step 5: Find the maximal distances
d{24}{35} = max[d2{35}, d4{35}]=max[10,9]=10 d{24}1 = max[d21, d41]=max[9,6]=9
Hierarchical Clustering
Complete Linkage Method
Update the distance matrix
{35} {24} 1 {35} 0
{24} 10 0 0
111 9 0
The maximal distance is 9 between {1} and {24}
Hierarchical Clustering
Complete Linkage Method
Find the maximal distance between {124} and {35}
d{124}{35} = min[d1{35}d{25}{35}]
=max[10,11]=11
Update the distance matrix
{35} {124} {35} 0
Hierarchical Clustering
Complete Linkage Method
Distance Clusters 2 {35},{1},{2},{4}
5 {35},{1},{24}
9 {35},{124}
Average Clustering
Average Linkage Method
Use the average distance
ik
i k
{vu}{w}
{uv} {w}
Average distance between cluster
{vu} and cluster {w}
d
d
n
n
Average Clustering
Average Linkage Method
Use the average distance
Distance matrix for 5 objects
1 2 3 4 5
1 0
2 9 0
3 3 7 0
Hierarchical Clustering
Average Linkage Method
Step 1: 5 clusters:{1},{2},{3},{4},{5}
Step 2: min{dij} = d35 = 2 and merge objec
ts 3 and 5 into one cluster {35}
Step 3: Find the average distance among
{3,5},{1},{2},{4}
d{35}1 =(d31+d51)/(2x1)=(3+11)/2=7 d{35}2 = (d32+d52)/(2x1)=(7+10)/2=8.5
Hierarchical Clustering
Average Linkage Method
Update the distance matrix{35} 1 2 4 {35} 0
1 11 0
Hierarchical Clustering
Average Linkage Method
Step 4: Minimal distance is 5 between {2}
and {4} and merge {2} and {4} into {24}
Step 5: Find the average distances
d{24}{35} = (d23+ d25+d43+ d45)/(2x2)
=(7+10+9+8)/(2x2)=8.5
d{24}1 = (d21+d41)/(2x1)=
Hierarchical Clustering
Average Linkage Method
Update the distance matrix
{35} {24} 1 {35} 0
{24} 8.5 0 0 17 7.5 0
The minimal distance is 7 between {1} and {35}
Hierarchical Clustering
Average Linkage Method
Find the average distance between {24} and {135}
d{24}{135} = (d12+d14 +d32+d34 +d52+d54)/(3x2)
=(9+6+7+9+10+8)/6 =8.17
Update the distance matrix
{135} {24} {135} 0
Hierarchical Clustering
Average Linkage Method
Distance Clusters 2 {35},{1},{2},{4}
5 {35},{1},{24}
7 {135},{24}
Hierarchical Clustering
Example Manly (2005)
Distance Matrix of 5 objects
1 2 3 4 5
1 0
2 2 0
3 6 5 0
Hierarchical Clustering
Single Linkage Method
Distance Clusters 2 {12},{3},{4},{5}
3 {12},{3},{45}
4 {12},{345}
5 {12345}
Same results are obtained from complete and av erage linkage methods
Hierarchical Clustering
Example: Canine group by single linkage clustering
Distance Clusters # 0.72 {MD,PD},GJ,CW,IW,CU,DI 6 1.38 {MD,PD,CU},GJ,CW,IW,DI 5 1.68 {MD,PD,CU,DI},GJ,CW,IW 4 2.07 {MD,PD,CU,DI,GJ},CW,IW 3 2.31 {MD,PD,CU,DI,GJ},{CW,IW} 2 2.37 {MD,PD,CU,DI,GJ,CW,IW} 1
Results of single linkage method
for European employment data
Hierarchical Clustering
Centroid (Center or Average) Method
Start with each object being a cluster
Merge the two clusters with the shortest distance Compute the centroid as the average of all
variables in the new cluster and update the distance matrix using the averages of the new clusters
Merge the two clusters with the shortest distance Compute the centroid as the averages of all
Introduction
Example
Student Chinese (X1) Math (X2)
1 85 82 2 25 32 3 65 55 4 90 95 5 40 30 6 60 70
Hierarchical Clustering
Centroid Method
Euclidean Distance matrix of 6 students
1 2 3 4 5 6 1 0 2 78.10 0 3 33.60 46.14 0 4 13.93 90.52 47.17 0 5 68.77 15.13 36.36 82.01 0
Hierarchical Clustering
Centroid Method
The shortest distance is between student
{1} and student {4}
Merge {1} and {4} into {14}
Compute the averages for Chinese and
math
Average of Chinese = (85+90)/2 = 87.5 Average of math = (82+95)/2=88.5
Hierarchical Clustering
Centroid Method
Update the Euclidean distance matrix
{14} 2 3 5 6
{14} 0
2 84.25 0
3 40.35 46.14 0
Hierarchical Clustering
Centroid Method
The shortest distance is between {2} and
{5}
Merge {2} and {5} into {35}
The average of Chinese of {35} is 32.5 The average of math of {35} is 31.0
Hierarchical Clustering
Centroid Method
Update the Euclidean distance matrix
{14} {25} 3 6
{14} 0
{25} 79.57 0
3 40.35 40.40 0
Hierarchical Clustering
Centroid Method
The shortest distance is between {3} and
{6}
Merge {3} and {6} into {36}
The average of Chinese of {36} is 62.5 The average of math of {36} is 62.5
Hierarchical Clustering
Centroid Method
Update the Euclidean distance matrix
{14} {25} {36}
{14} 0
{25} 79.57 0
Hierarchical Clustering
Centroid Method
The shortest distance is between {14} and
{36}
Merge {14} and {36} into {1346} Cluster means
Cluster Chinese Math {25} 32.5 31.0 (1346} 75.0 75.5
Hierarchical Clustering
Centroid Method
Distance between {25} and {1346} is 61.5
3 Distance Clusters 13.93 {14},{2},{3},{5},{6} 15.13 {14},{25},{3},{6} 15.81 {14},{25},{36} 36.07 {1436},{25}
Hierarchical Clustering
Application to gene expression data
from microarray experiments
# of genes >>> # of subjects Clustering in two directions
Clusters of subjects (patients) Clusters of genes
Hierarchical Clustering
The complexity of a bottom-up method
can vary between
n
2and
n
3depend on t
he linkage chosen.
The complexity of a top-down method c
an vary between
n
log
n
and
n
2depend o
Hierarchical Clustering
Determination of the number of clusters
Criteria Root-mean-square total-sample standard deviat
ion (RMSSTD)
Semipartial R-square (SPRSQ)
R-square (RSQ)
Hierarchical Clustering
Determination of the number of clusters
Example: test scores of 6 students
# of clusters RMSSTD SPRSQ RSQ MD 5 6.96 0.0145 0.98 0.30 4 7.57 0.0171 0.97 0.33 3 7.91 0.0187 0.95 0.34 2 15.93 0.1946 0.76 0.60 1 25.86 0.7751 0.00 0.77
K-means Clustering
Step 1: Select the number of clusters, say K a
nd determine the distance measure such as E uclidean distance or 1-Pearson correlation coe fficient
Step 2: Divide n objects into K clusters, either
randomly or based on a preliminary hierarchic al clustering
K-means Clustering
Step 4: For each object, find the minima
l distance and reallocate the object to th
e corresponding cluster with the minimal
distance
Step 5: Update the clusters and its centr
oids
Step 6: Repeat Step 3 and Step 4 until n
o reallocation of objects among clusters
occurs
K-means Clustering
The number of computations that need
to be performed can be written as c*p
where c is a value that does depend on
the number of iterations and p is the
number of variables (e.g., the number
of genes)
K-means Clustering
The number of clusters is selected to
maximize the between-cluster sum of squares (variation) and to minimize the within-cluster sum of squares (variation)
The best-of-10 partition: to apply K-means
method 10 times using 10 different randomly chosen sets of initial clusters and choose the result that minimizes the within-cluster sum of squares
Issues and Limitations
With considerable overlap between the initial
groups, cluster analysis may produce a result that is quite different from the true situation
Different approaches obtained different result
s.
The dendrogram itself is almost never the ans
wer to the research question.
Issues and Limitations
Shape of clusters will create difficulty in
cluster analysis
(a) and (b) by any reasonable algorithms (c) some methods will fail because of
overlapping points
(d), (e) and (f): great challenges for most
Issues and Limitations
Anything can be clustered
The clustering algorithm applied to the same
data may produce different results
Ignore the magnitudes of distance measures i
n dendrogram
Position of the patterns with the clusters does
not reflect their relationship in the input spac e