Applied Multivariate
Quantitative Methods
Canonical Correlation Analysis
By Jen-pei Liu, PhD
Division of Biometry, Department of Agronomy,
National Taiwan University
and
Canonical Correlation Analysis
Introduction
Procedures
Examples
Introduction
Examples
The health department is interested in
determining a relationship between hosing
quality – measured by a number of
variables such as type of housing, heating,
and cooling conditions, availability of
running water, and kitchen and toilet
facilities, and incidences of minor and
Introduction
Examples
A medical research is interested in
determining if individual’s lifestyles and
eating habits have an effect on their health
measured by a number of health-related
variables such as hypertension, weight,
anxiety, and tension levels
Introduction
Examples
The marketing manager is interested in det
ermining if there is a relationship between
types of products purchased and consumer
’s lifestyles and personalities
Any relationship between four environment
Introduction
Objectives
To determine whether there is a
relationship between two sets of variables
To identify relationships between two sets
of variables
To determine if the predictor set of
variables affects the dependent set of
variables
Procedures
Example: Hotelling (1936)
Reading speed (X1)
Reading power (X2)
Arithmetic speed (Y1)
Arithmetic power (Y2)
Procedures
A linear combination of X1 and X2
U = a1X1+a2X2
A linear combination of Y2 and Y2
V = b1X1 +b2X2
Coefficients in U and V are chosen as
Procedures
Procedure is very much similar to
principal component analysis (PCA)
PCA is to maximize the variance of
selected components
Canonical analysis is to maximize the
Procedures
Example (Continued)
U = -2.78X1 + 2.27X2
V = -2.44Y1 + 1.00Y2
U and V has a correlation of 0.67
U measures the difference between reading power
and speed
V measures the difference between reading power
and speed
A large difference between X1 and X2 also tends
to have a difference between Y1 and Y2
This is the aspect of reading and arithmetic that
Procedures
X1, X2, …, Xp based on m objects
Y1, Y2, …, Yq based on n objects
There can be up to the minimum of p a
Procedures
U
1
= a
11
X
1
+ a
12
X
2
+…+a
1p
X
p
U
2
= a
21
X
1
+ a
22
X
2
+…+a
2p
X
p
.
.
U
r
= a
r1
X
1
+ a
r2
X
2
+…+a
rp
X
p
Procedures
V
1
= b
11
Y
1
+ b
12
Y
2
+…+ b
1p
Y
p
V
2
= b
21
Y
1
+ b
22
Y
2
+…+ b
2p
Y
p
.
.
V
r
= b
r1
Y
1
+ b
r2
Y
2
+…+ b
rp
Y
p
Procedures
Maximum correlation between U1 and
V1
Maximum correlation between U2 and
V2 subject to these variables being
uncorrelated to (U1, V1)
Maximum correlation between U3 and
V3 subject to these variables being
uncorrelated to (U1, V1, U2, V2)
Procedures
Each of (U1,V1),(U2,V2),…,(Ur,Vr) repre
sents an independent dimension in the r
elationship between two sets of variable
s (X1,X2,..,Xp) and (Y1,Y2,..,Yq)
The first pair (U1,V1): highest correlatio
Procedures
Xi and Yj are standardized to have mean 0 an
d varianace 1
Compute the (p+q)x(p+q) correlation matrix f
or X1,X2,…,Xp and Y1,Y2,…,Yq
A is pxp correlation matrix of X1, X2, … ,Xp
B is qxq correlation matrix of Y1, Y2,…, Yq
C is pxq correlation matrix between X1,X2,…,
Procedures
Compute B
-1
C’A
-1
C
Find the eigenvalues and eigenvectors
(B
-1
C’A
-1
C -
I)b = 0
1
2
…
r
, eigenvalues are the sq
uare of the correlation between canonic
al variables (U1,V1), (U2,V2),…(Ur,Vr)
Procedures
b
1
, b
2
,…, b
r
provide the coefficients for
V
j
based on Y1,…,Yq
a
i
= A
-1
Cb
i
gives the coefficients for Ui
U
i
= a
i
X = a
i1
X
1
+ a
i2
X
2
+…+a
ip
X
p
Procedures
Test of significance
X
2
=-{n-(p+q+3)/2}ln(1-
i
)
Reject null hypothesis of no correlation at t
he signifinance level if
X
2
>
2,pq
Procedures
Test of significance for individual contri
bution
X
2i
= -{n-(p+q+3)/2}ln(1-
i)
Reject null hypothesis of no correlation at t
he signifinance level if
X
2
>
2,p+q-2i+1
Procedures
Test of significance from the (i+1)th to
the rth contribution
Sum from the (i+1)th X
2i+1
to the rth X
2r
X
2i
= -{n-(p+q+3)/2}ln(1-
i
)
Reject null hypothesis of no correlation at t
he signifinance level if
X
2
>
2,(p-i)(q-i)
Examples
Environmental and genetic correlations
16 colonies of the butterfly
Euphydryas edit
ha
in Oregon and California
Four environmental variables (Xs)
Altitude (x1), annual precipitation (X2), annual
maximum temperature (X3), and annual minim
um temperature (X4)
Examples
Environmental and genetic correlations
Issue: the sum of 6 Pgi gene frequencies is 100%
- linear dependence among Pgi gene frequencies
Solution: Removal one of Pgi gene frequencies (%
of 1.30 mobility genes) and combining the lower fr
equency genes (0.4 and 0.6-mobility genes)
Y1=% of 0.40-0.80 mobility genes, Y2=% of 0.8
0-mobility genes, Y3=% of 1.00-mobility genes, Y
4=% of 1.16-mobility genes
Examples
Examples
Correlation Matrix of X1, X2, X3, and X4
X1
X2
X3
X4
X1
1.000
0.568
-0.828
-0.936
X2
1.000
-0.479
-0.705
Examples
Examples
Correlation Matrix of Y1, Y2, Y3, and Y4
Y1
Y2
Y3
Y4
Y1
1.000
0.638
-0.561
-0.584
Y2
1.000
-0.824
-0.127
Y3
1.000
-0.264
Examples
Examples
Correlation Matrix of Y1, Y2, Y3, and Y4
and X1, X2, X3, and X4
Y1
Y2
Y3
Y4
X1
-0.201
-0.573
0.727
-0.458
X2
-0.468
-0.550
0.699
-0.138
Examples
The eigenvalues are 0.7425, 0.2049, 0.1425 a
nd 0.0069
The corresponding canonical correlations are
the square root of eigenvalues: 0.8617, 0.452
7, 0.3775, and 0.0833
X
2
=18.34 <
20.05, 16
= 26.2962, fail to reject th
e null hypothesis of zero correlation
Examples
U1 = -0.09X1-0.29X2+0.48X3+0.29X4
V1 = 0.54Y1+0.42Y2-0.10Y3+0.82Y4
U2 = 2.31X1-0.73X2+0.45X3+1.27X4
V2 = -1.66Y1-2.20Y2+0.45Y3+2.77Y4
U3 = 3.02X1+1.33X2+0.57X3+3.58X4
V3 = -3.56Y1-1.35Y2-3.86Y3-2.86Y4
Examples
U1 is a contrast between maximum with
minimum temperature and precipitation
V1 is for 0.60-, 0.80-, and 1.16-mobility
genes
Correlations between environmental
variables
Examples
Correlations between environmental variables
Altitude
-0.92
Precipitation
-0.77
Maximun Temperature
0.90
Minimum Temperature
0.92
Examples
Correlations between genetic variables
Mobility 0.40/06 0.38
Mobility 0.80
0.74
Mobility 1.00
-0.96
Mobility 1.16
0.49
Vi indicates a lack of mobility-1.00 genes
Examples
Soil and Vegetation
Prehistoric Maya sites in Belize in Central
America
Four soil variables
Four vegetation variables
Examples
Soil and Vegetation
Soil variables
X1:% of soil with constant lime enrichment
X2: % of meadow soil with calcium
groundwater
X3: % of soil with coral bedrock under
Examples
Soil and Vegetation
Vegetation variables
Y1: % of deciduous seasonal broadleaf forest
Y2: % of high and low marsh forest,
herbaceous marsh, and swamp
Y3: % of cohune palm forest
Y4: % mixed forest
Examples
The eigenvalues are 0.580, 0.320, 0.059 and
0.0149
The canonical correlations are 0.762, 0.566, 0
.243 and 0.122
X
2
=193.63 >
20.05, 16
= 26.2962,
Examples
U1 = 1.34X1+0.34X2+1.33X3+0.59X4
V1 = 1.71Y1+1.07Y2+0.22Y3+0.52Y4
U2 = 0.41X1+0.90X2+0.23X3+0.89X4
V2 = 0.64Y1+1.47Y2+0.27Y3+0.28Y4
U3 = -0.44X1-0.51X2+0.18X3+0.93X4
V3 = -0.18Y1-0.24Y2+0.93Y3+0.22Y4
U4 = -0.44X1-0.02X2+0.72X3+0.15X4
V4 = 0.12Y1+0.01Y2+0.26Y3-0.93Y4
Examples
Correlation between the canonical
variables and soil variables
U1
U2
U3
U4
X1 0.55
-0.23
0.00
-0.80
X2 -0.02
0.73
-0.68
-0.04
Examples
Correlation between the canonical
variables and Vegetation variables
U1 U2 U3 U4
Y1 0.77 -0.58
-0.08
0.24
Y2 -0.36
0.91 -0.19
-0.03
Y3 0.03 0.13 0.95 0.28
Examples
The most important relationships between soil and
vegetation variables are described by the first two
pairs of canonical variables
The presence of soil types 1 and 3 and the absence of soil
type 4 are associated with the presence of vegetation type 1
The presence of soil types 2 and 4 is associated with the
presence of vegetation type 2 and absence of vegetation
type 1