• 沒有找到結果。

Multivariate Statistical Methods-Cluster Analysis

N/A
N/A
Protected

Academic year: 2021

Share "Multivariate Statistical Methods-Cluster Analysis"

Copied!
87
0
0

加載中.... (立即查看全文)

全文

(1)

Multivariate Statistical Methods

Cluster Analysis

By Jen-pei Liu, PhD

Division of Biometry, Department of Agronomy, National Taiwan University

and

Division of Biostatistics and Bioinformatics National Health Research Institutes

(2)

Cluster Analysis

Introduction

Measures of Similarity

Hierarchical Clustering

K-mean Clustering

Summary

(3)

Introduction

A sample of n objects, each with

measurements of p variables

To use the measurements of p variables

to devise a scheme for grouping n

objects into classes

(4)

Introduction

In general, the number of clusters is

not known in advance – unsupervised

analysis

The number of class is pre-specified in

the discriminant analysis and is based

on a predicted function– supervised

analysis

(5)

Introduction

Examples

 Cluster of depressed patients  Data reduction

 Marketing

 Test markets: large number of cities  Small number of groups of similar cities

 one member from each group selected for testing

 Microarray

 Clusters of genes  Clusters of subjects

(6)

Introduction

Types of Clustering Methods

 Hierarchical Clustering

 To find a series of partition  A bottom-to-up clustering

 Partitional Method

 To produce a single partition of objects  A up-to-bottom clustering

(7)

Introduction

Example

Student Chinese (X1) Math (X2)

1 85 82 2 25 32 3 65 55 4 90 95 5 40 30 6 60 70

(8)

Measures of Similarity

i i1 i2 ip j j1 j2 jp p 2 1/ 2 ij ik jk k=1

Distance (Dissimilarity) Function Data vectors

(X X X ) ' (X X X ) ' (a) Euclidean Distance

d [ (X - X ) ]   

X X

(9)

Measures of Similarity

 Euclidean Distances Matrix for 6 students

1 2 3 4 5 6 1 0 78.10 33.60 13.93 68.77 27.73 2 0 46.14 90.52 15.13 51.66 3 0 47.17 35.36 15.81 4 0 82.01 39.05 5 0 44.72 6 0

(10)

Measures of Similarity

p '

ij ik jk k=1

Distance (Dissimilarity) Function (b) Manhatten (city block) Distance

d X - X

Manhatten distance is more robust to extreme values

The Manhatten distance between Student 1 and 2

(11)

Measures of Similarity

(12)

Measures of Similarity

p ik i jk j k 1 p 2 2 ik i jk j k 1 r

Pearson product correlation coefficient (X -X )(X -X )

r =

(X -X ) (X -X )

Distance measure based on Pearson correlation coefficient d 1-r   

Similarity Measures

(13)

Measures of Similarity

p ik jk k 1 p 2 2 ik jk k 1

Angular distance (Uncentered Pearson correlation coefficient)

X X r' =

X X

Spearman correlation coefficient Use the ranks of observations for Pearson correlation

Other Similarity Measures

(14)

Measures of Similarity

Correlation coefficient

 A measure for association

 Not a measure for similarity (or

agreement)

Euclidean distance

(15)

Measures of Similarity

 Example

Case I Case II Case III

X1 X2 X1 X2 X1 X2 1 1 1 2 1 4 2 2 2 4 2 8 3 3 3 6 3 12 4 4 4 8 4 16 r=1, d2=0 r=1, d2=30 r=1, d2=270

(16)

Hierarchical Clustering

 General Steps for n objects

 Step 1: There are n clusters at the beginning and

each object is a cluster. Compute pairwise distanc es among all clusters

 Step 2: Find the minimum distance and merge the

corresponding two clusters into one cluster

 Step 3: Based on n-1 clusters, compute pairwise d

istances among all n-1 clusters

 Step 4: Find the minimum distance and merge the

(17)
(18)

Hierarchical Clustering

 Single Linkage (Nearest-neighbor) Method

 Use the minimal distance

 Distance matrix for 5 objects

1 2 3 4 5

1 0

2 9 0

3 3 7 0

(19)

Hierarchical Clustering

Single Linkage Method

 Step 1: 5 clusters:{1},{2},{3},{4},{5}

 Step 2: min{dij} = d35 = 2 and merge objec

ts 3 and 5 into one cluster {35}

 Step 3: Find the minimal distance among

{3,5},{1},{2},{4}

 d{35}1 = min[d31,d51]=min[3,11]=3  d{35}2 = min[d32,d52]=min[7,10]=7

(20)

Hierarchical Clustering

Single Linkage Method

 Update the distance matrix

{35} 1 2 4 {35} 0

1 3 0

(21)

Hierarchical Clustering

Single Linkage Method

 Step 4: Minimal distance is 3 between {35}

and {1} and merge {35} and {1} into {135}

 Step 5: Find the distances between {135}

and {2} and {4}

 d{135}2 = min[d{35}2,d12]=min[7,9]=7  d{135}4 = min[d{35}4,d14]=min[8,6]=6

(22)

Hierarchical Clustering

 Single Linkage Method

 Update the distance matrix

{135} 2 4 {135} 0

27 0 0

46 5 0

The minimal distance is 5 between {2} and {4}

(23)

Hierarchical Clustering

 Single Linkage Method

Find the minimum distance between {135} and {24}

 d{135}{24} = min[d{135}2,d{135}4]=min[7,6]=7

 Update the distance matrix

{135} {24} {135} 0

(24)

Hierarchical Clustering

Single Linkage Method

 Distance Clusters  2 {1},{35},{2},{4}  3 {135},{2},{4}  4 {135},{2},{4}  5 {135},{24}  6 {12345}

(25)

Hierarchical Clustering

Dendrograms

 A 2-dimensional tree structure rooted in th

e top

 One dimension is the distance measure

 Another dimension is the clustering results  The height of vertical (horizontal) line repr

esents the distance between the two cluste rs it mergers

(26)

Hierarchical Clustering

 Complete Linkage (Farthest-neighbor) Method

 Use the maximal distance

 Distance matrix for 5 objects

1 2 3 4 5

1 0

2 9 0

3 3 7 0

(27)
(28)

Hierarchical Clustering

Complete Linkage Method

 Step 1: 5 clusters:{1},{2},{3},{4},{5}

 Step 2: min{dij} = d35 = 2 and merge objec

ts 3 and 5 into one cluster {35}

 Step 3: Find the maximal distance among

{3,5},{1},{2},{4}

(29)

Hierarchical Clustering

Complete Linkage Method

 Update the distance matrix

{35} 1 2 4 {35} 0

1 11 0

2 10 9 0

(30)

Hierarchical Clustering

Complete Linkage Method

 Step 4: Minimal distance is 5 between {2},

{4} and merge {2} and {4} into {24}

 Step 5: Find the maximal distances

 d{24}{35} = max[d2{35}, d4{35}]=max[10,9]=10  d{24}1 = max[d21, d41]=max[9,6]=9

(31)

Hierarchical Clustering

 Complete Linkage Method

 Update the distance matrix

{35} {24} 1 {35} 0

{24} 10 0 0

111 9 0

The maximal distance is 9 between {1} and {24}

(32)

Hierarchical Clustering

 Complete Linkage Method

Find the maximal distance between {124} and {35}

 d{124}{35} = min[d1{35}d{25}{35}]

=max[10,11]=11

 Update the distance matrix

{35} {124} {35} 0

(33)

Hierarchical Clustering

Complete Linkage Method

 Distance Clusters

 2 {35},{1},{2},{4}

 5 {35},{1},{24}

 9 {35},{124}

(34)
(35)

Average Clustering

 Average Linkage Method

 Use the average distance

ik

i k

{vu}{w}

{uv} {w}

Average distance between cluster

{vu} and cluster {w}

d

d

n

n



(36)

Average Clustering

 Average Linkage Method

 Use the average distance

 Distance matrix for 5 objects

1 2 3 4 5

1 0

2 9 0

3 3 7 0

(37)

Hierarchical Clustering

Average Linkage Method

 Step 1: 5 clusters:{1},{2},{3},{4},{5}

 Step 2: min{dij} = d35 = 2 and merge objec

ts 3 and 5 into one cluster {35}

 Step 3: Find the average distance among

{3,5},{1},{2},{4}

 d{35}1 =(d31+d51)/(2x1)=(3+11)/2=7  d{35}2 = (d32+d52)/(2x1)=(7+10)/2=8.5

(38)

Hierarchical Clustering

Average Linkage Method

 Update the distance matrix

{35} 1 2 4 {35} 0

1 11 0

(39)

Hierarchical Clustering

Average Linkage Method

 Step 4: Minimal distance is 5 between {2}

and {4} and merge {2} and {4} into {24}

 Step 5: Find the average distances

 d{24}{35} = (d23+ d25+d43+ d45)/(2x2)

=(7+10+9+8)/(2x2)=8.5

 d{24}1 = (d21+d41)/(2x1)=

(40)

Hierarchical Clustering

 Average Linkage Method

 Update the distance matrix

{35} {24} 1 {35} 0

{24} 8.5 0 0 17 7.5 0

The minimal distance is 7 between {1} and {35}

(41)

Hierarchical Clustering

 Average Linkage Method

Find the average distance between {24} and {135}

 d{24}{135} = (d12+d14 +d32+d34 +d52+d54)/(3x2)

=(9+6+7+9+10+8)/6 =8.17

 Update the distance matrix

{135} {24} {135} 0

(42)

Hierarchical Clustering

Average Linkage Method

 Distance Clusters

 2 {35},{1},{2},{4}

 5 {35},{1},{24}

 7 {135},{24}

(43)
(44)

Hierarchical Clustering

Example Manly (2005)

Distance Matrix of 5 objects

1 2 3 4 5

1 0

2 2 0

3 6 5 0

(45)

Hierarchical Clustering

Single Linkage Method

 Distance Clusters

 2 {12},{3},{4},{5}

 3 {12},{3},{45}

 4 {12},{345}

 5 {12345}

Same results are obtained from complete and av erage linkage methods

(46)
(47)

Hierarchical Clustering

Example: Canine group by single linkage clustering

Distance Clusters # 0.72 {MD,PD},GJ,CW,IW,CU,DI 6 1.38 {MD,PD,CU},GJ,CW,IW,DI 5 1.68 {MD,PD,CU,DI},GJ,CW,IW 4 2.07 {MD,PD,CU,DI,GJ},CW,IW 3 2.31 {MD,PD,CU,DI,GJ},{CW,IW} 2 2.37 {MD,PD,CU,DI,GJ,CW,IW} 1

(48)
(49)

Results of single linkage method

for European employment data

(50)

Hierarchical Clustering

 Centroid (Center or Average) Method

 Start with each object being a cluster

 Merge the two clusters with the shortest distance  Compute the centroid as the average of all

variables in the new cluster and update the distance matrix using the averages of the new clusters

 Merge the two clusters with the shortest distance  Compute the centroid as the averages of all

(51)

Introduction

Example

Student Chinese (X1) Math (X2)

1 85 82 2 25 32 3 65 55 4 90 95 5 40 30 6 60 70

(52)

Hierarchical Clustering

Centroid Method

Euclidean Distance matrix of 6 students

1 2 3 4 5 6 1 0 2 78.10 0 3 33.60 46.14 0 4 13.93 90.52 47.17 0 5 68.77 15.13 36.36 82.01 0

(53)

Hierarchical Clustering

Centroid Method

 The shortest distance is between student

{1} and student {4}

 Merge {1} and {4} into {14}

 Compute the averages for Chinese and

math

 Average of Chinese = (85+90)/2 = 87.5  Average of math = (82+95)/2=88.5

(54)

Hierarchical Clustering

Centroid Method

 Update the Euclidean distance matrix

{14} 2 3 5 6

{14} 0

2 84.25 0

3 40.35 46.14 0

(55)

Hierarchical Clustering

Centroid Method

 The shortest distance is between {2} and

{5}

 Merge {2} and {5} into {35}

 The average of Chinese of {35} is 32.5  The average of math of {35} is 31.0

(56)

Hierarchical Clustering

Centroid Method

 Update the Euclidean distance matrix

{14} {25} 3 6

{14} 0

{25} 79.57 0

3 40.35 40.40 0

(57)

Hierarchical Clustering

Centroid Method

 The shortest distance is between {3} and

{6}

 Merge {3} and {6} into {36}

 The average of Chinese of {36} is 62.5  The average of math of {36} is 62.5

(58)

Hierarchical Clustering

Centroid Method

 Update the Euclidean distance matrix

{14} {25} {36}

{14} 0

{25} 79.57 0

(59)

Hierarchical Clustering

Centroid Method

 The shortest distance is between {14} and

{36}

 Merge {14} and {36} into {1346}  Cluster means

Cluster Chinese Math {25} 32.5 31.0 (1346} 75.0 75.5

(60)

Hierarchical Clustering

Centroid Method

 Distance between {25} and {1346} is 61.5

3  Distance Clusters  13.93 {14},{2},{3},{5},{6}  15.13 {14},{25},{3},{6}  15.81 {14},{25},{36} 36.07 {1436},{25}

(61)
(62)

Hierarchical Clustering

Application to gene expression data

from microarray experiments

 # of genes >>> # of subjects  Clustering in two directions

 Clusters of subjects (patients)  Clusters of genes

(63)
(64)
(65)
(66)
(67)
(68)

Hierarchical Clustering

The complexity of a bottom-up method

can vary between

n

2

and

n

3

depend on t

he linkage chosen.

The complexity of a top-down method c

an vary between

n

log

n

and

n

2

depend o

(69)

Hierarchical Clustering

Determination of the number of clusters

 Criteria

 Root-mean-square total-sample standard deviat

ion (RMSSTD)

 Semipartial R-square (SPRSQ)

 R-square (RSQ)

(70)
(71)

Hierarchical Clustering

Determination of the number of clusters

Example: test scores of 6 students

# of clusters RMSSTD SPRSQ RSQ MD 5 6.96 0.0145 0.98 0.30 4 7.57 0.0171 0.97 0.33 3 7.91 0.0187 0.95 0.34 2 15.93 0.1946 0.76 0.60 1 25.86 0.7751 0.00 0.77

(72)

K-means Clustering

 Step 1: Select the number of clusters, say K a

nd determine the distance measure such as E uclidean distance or 1-Pearson correlation coe fficient

 Step 2: Divide n objects into K clusters, either

randomly or based on a preliminary hierarchic al clustering

(73)

K-means Clustering

Step 4: For each object, find the minima

l distance and reallocate the object to th

e corresponding cluster with the minimal

distance

Step 5: Update the clusters and its centr

oids

Step 6: Repeat Step 3 and Step 4 until n

o reallocation of objects among clusters

occurs

(74)
(75)
(76)
(77)
(78)

K-means Clustering

The number of computations that need

to be performed can be written as c*p

where c is a value that does depend on

the number of iterations and p is the

number of variables (e.g., the number

of genes)

(79)

K-means Clustering

 The number of clusters is selected to

maximize the between-cluster sum of squares (variation) and to minimize the within-cluster sum of squares (variation)

 The best-of-10 partition: to apply K-means

method 10 times using 10 different randomly chosen sets of initial clusters and choose the result that minimizes the within-cluster sum of squares

(80)

Issues and Limitations

 With considerable overlap between the initial

groups, cluster analysis may produce a result that is quite different from the true situation

 Different approaches obtained different result

s.

 The dendrogram itself is almost never the ans

wer to the research question.

(81)
(82)

Issues and Limitations

Shape of clusters will create difficulty in

cluster analysis

 (a) and (b) by any reasonable algorithms  (c) some methods will fail because of

overlapping points

 (d), (e) and (f): great challenges for most

(83)

Issues and Limitations

 Anything can be clustered

 The clustering algorithm applied to the same

data may produce different results

 Ignore the magnitudes of distance measures i

n dendrogram

 Position of the patterns with the clusters does

not reflect their relationship in the input spac e

(84)
(85)
(86)
(87)

Summary

 Goals  Methods  Hierarchical Methods  Single  Complete  Average  Centroid  K-means Clutering  Limitations

參考文獻

相關文件

fostering independent application of reading strategies Strategy 7: Provide opportunities for students to track, reflect on, and share their learning progress (destination). •

"Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values," Data Mining and Knowledge Discovery, Vol. “Density-Based Clustering in

Wang, Solving pseudomonotone variational inequalities and pseudocon- vex optimization problems using the projection neural network, IEEE Transactions on Neural Networks 17

Then, it is easy to see that there are 9 problems for which the iterative numbers of the algorithm using ψ α,θ,p in the case of θ = 1 and p = 3 are less than the one of the

volume suppressed mass: (TeV) 2 /M P ∼ 10 −4 eV → mm range can be experimentally tested for any number of extra dimensions - Light U(1) gauge bosons: no derivative couplings. =>

Define instead the imaginary.. potential, magnetic field, lattice…) Dirac-BdG Hamiltonian:. with small, and matrix

• Formation of massive primordial stars as origin of objects in the early universe. • Supernova explosions might be visible to the most

The difference resulted from the co- existence of two kinds of words in Buddhist scriptures a foreign words in which di- syllabic words are dominant, and most of them are the