Dimension Reduction

(1)

http://www.hmwu.idv.tw

吳漢銘國立政治大學統計學系

維度縮減

Dimension Reduction

C01

(2)

Visualizing High-dimensional Data:

Dimension Reduction Techniques



Why Reduce Dimensionality?



Feature Selection vs Extraction



Dimension Reduction Techniques

1) Spectral Decomposition (Eigen) 譜(特徵值)分解 2) Singular Value Decomposition (SVD) 奇異值分解 3) Principal Component Analysis (PCA) 主成份分析 4) Biplot 雙標圖

5) Factor Analysis (FA) 因子分析

6) Multidimensional Scaling (MDS) 多元尺度法

7) Isometric Feature Mapping (ISOMAP) 等軸距特徵映射 8) Linear Discriminant Analysis (LDA) 線性區別分析

9) Sliced Inverse Regression (SIR) 切片逆迴歸法

10) Canonical Correlation Analysis (CCA) 典型相關分析

 The high-dimension low sample size data (HDLSS)

 DR Quality Assessment

2/144

(3)

Dimension Reduction

^3/144

PCA FA MDS

Methods using additional information y: LDA, Sufficient Dimension Reduction (SIR, SAVE, pHd, IRE,...)

input data matrix:

X

Y

keep information as much as possible without loss of information.

transformed data matrix: Z

Visualization Clustering Classification ....

(4)

Why Reduce Dimensionality?



Reduces time complexity:

less computation



Reduces space complexity:

Less parameters



Saves the cost of observing the features:

input is unnecessary.



Simpler models are more robust on small datasets:

simpler models vary less depending on the particulars of a sample.



More interpretable; simpler explanation



Data visualization (structure, groups, outliers, etc) if

plotted in 2 or 3 dimensions . (Dimension reduction visualization is often adopted for presenting grouping structure for methods such as K-

means. )

4/144

(5)

DR: Feature Selection vs Extraction

Feature Selection



Choosing k<d important features



Ignoring the remaining d – k



Subset selection algorithms

Feature Extraction



Project the original x

_i

, i =1,...,d

dimensions to new k<d dimensions, z

_j

, j =1,...,k.



z = w

^T

x



PCA, FA, MDS, LDA, SIR

5/144

z = w

^T

x

(6)

Subset Selection

 Finding the best subset of the set of features.

 Best: contains the least number of dimensions that most contribute to accuracy.

Forward Search: Add the best feature at each step

1. Set of features Finitially Ø.

2. At each iteration, find the best new feature

3. j = argmini E ( F ∪x_i )

4. Add x_i to F if E ( F ∪ x_i) < E ( F )

5. Stop: |E^t-E^t+1| < epsilon

Backward Search:

Start with all features and remove one at a time

1. Set of features Finitially ALL.

2. At each iteration, find the best new feature

3. j = argmini E ( F - x_i )

4. remove x_ifrom Fif E ( F - x_i ) < E ( F )

5. Stop: |E^t-E^t+1| < epsilon

Others: Floating search (Add k, Remove l), . . .

6/144

Training error

Validation error

(test error)

(7)

(1) Spectral Decomposition

^(1/2)



Definition: Let A be a m × m real symmetric matrix.

Then there exists an orthogonal matrix C such that

C

^T

AC= D or A=C DC

^T

, where D is a diagonal matrix of eigenvalues, and the C matrix has the eigenvectors.

(A. L. Cauchy established the Spectral Decomposition in 1829.)



Eigen-decomposition : Ac

_i

=λ

_i

c

_i

, i=1,...,m.



Let Z=W

^T

X, or Z=W

^T

(X-m) center the data on the origin. To find a matrix W such that when we have

Z=W

^T

X, we will get Cov(Z)=D is any diagonal matrix.

7/144

(8)

Spectral Decomposition

^(2/2)



We would like to get uncorrelated z

_i.



Let

C

= [c

_i

] be the normalized eigenvector of S, then (1)

C^TC = I

and

(2)

8/144

Application:

• Data reduction.

• Image processing and compression.

• K-selection for K-means clustering

• Multivariate outliers detection

• Noise filtering

• Trend detection in the observations.

(9)

Spectral Decomposition

^{in R} ^9/144

> # Spectral Decomposition

> X <- iris[,1:4]

> (S <- cov(X))

Sepal.Length Sepal.Width Petal.Length Petal.Width Sepal.Length 0.6856935 -0.0424340 1.2743154 0.5162707 Sepal.Width -0.0424340 0.1899794 -0.3296564 -0.1216394 Petal.Length 1.2743154 -0.3296564 3.1162779 1.2956094 Petal.Width 0.5162707 -0.1216394 1.2956094 0.5810063

> (e <- eigen(S)) eigen() decomposition

$values

[1] 4.22824171 0.24267075 0.07820950 0.02383509

$vectors

[,1] [,2] [,3] [,4]

[1,] 0.36138659 0.65658877 -0.58202985 0.3154872 [2,] -0.08452251 0.73016143 0.59791083 -0.3197231 [3,] 0.85667061 -0.17337266 0.07623608 -0.4798390 [4,] 0.35828920 -0.07548102 0.54583143 0.7536574

> D <- diag(e$values)

> C <- e$vectors

> C%*%D%*%t(C)

[,1] [,2] [,3] [,4]

[1,] 0.6856935 -0.0424340 1.2743154 0.5162707 [2,] -0.0424340 0.1899794 -0.3296564 -0.1216394 [3,] 1.2743154 -0.3296564 3.1162779 1.2956094 [4,] 0.5162707 -0.1216394 1.2956094 0.5810063

> # Eigen-decomposition

> S%*%C[,1]

[,1]

Sepal.Length 1.5280299 Sepal.Width -0.3573816 Petal.Length 3.6222104 Petal.Width 1.5149333

> D[1]*C[,1]

[1] 1.5280299 -0.3573816 [3] 3.6222104 1.5149333

(10)

Some Decompositions in R

 Eigen-decomposition

 eigen {base}: Spectral Decomposition of a Matrix

 LU decomposition

 LU-class {Matrix}: LU (dense) Matrix Decompositions

 lu {Matrix}: (Generalized) Triangular Decomposition of a Matrix

 LU.decompose {optR}: LU decomposition

 svd {base}

: Singular Value Decomposition of a Matrix

 qr {base}

: The QR Decomposition of a Matrix

 If A is a m×n matrix with linearly independent columns, then A can be decomposed as , A=QR, where Q is a m × n matrix whose

columns form an orthonormal basis for the column space of A and R is an nonsingular upper triangular matrix.

 chol {base}

: The Choleski Decomposition

 Cholesky Decomposition: If A is a real, symmetric and positive definite matrix then there exists a unique lower triangular matrix L with positive diagonal element such that A=LL^T .

10/144

> eigen(cov(iris[,1:4]))

(11)

(2) Singular Value Decomposition

 The m columns of U are called the left singular vectors. The n columns of V are called the right singular vectors.

 The right singular vectors are eigenvectors of A^TA.

 The left singular vectors are eigenvectors of AA^T.

 The non-zero values of S are the square root of the eigenvalues of AA^T and A^TAare called the singular values

 U and V are unitary matrixes, i.e. UU^T=I, VV^T=I.

11/144

奇異值分解

(12)

SVD Examples

SVD in R

 svd{base}: Singular Value Decomposition of a Matrix

https://stat.ethz.ch/R-manual/R-devel/library/base/html/svd.html

 fast.svd{corpcor}: Fast Singular Value Decomposition http://svitsrv25.epfl.ch/R-doc/library/corpcor/html/fast.svd.html

 R Package 'svd': Interfaces to various state-of-art SVD and eigensolvers https://cran.r-project.org/web/packages/svd/index.html

 R Package 'svdvisual’: SVD visualization tools

https://cran.r-project.org/web/packages/svdvisual/index.html

R Code Examples

 Data Mining Algorithms In R/Dimensionality Reduction/Singular Value Decomposition

https://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Dimensionality_Reduction/Singular_Value_Decomposit ion

 R Code Fragments, Examples of Singular Value Decomposition

http://www.ats.ucla.edu/stat/r/pages/svd_demos.htm

 Running PCA and SVD in R

http://genomicsclass.github.io/book/pages/pca_svd.html

 R-bloggers: Using the SVD to find the needle in the haystack

http://www.r-bloggers.com/using-the-svd-to-find-the-needle-in-the-haystack/

 Image Compression with the SVD in R

http://www.johnmyleswhite.com/notebook/2009/12/17/image-compression-with-the-svd-in-r/

 The Netflix Prize, Big Data, SVD and R

http://blog.revolutionanalytics.com/2011/05/the-neflix-prize-big-data-svd-and-r.html

 Singular Value Decomposition in R

http://www.di.fc.ul.pt/~jpn/r/svd/svd.html

12/144

(13)

SVD: Making Approximations

^13/144

> iris.sub <- iris[sample(1:150, 8),1:4]

> iris.sub

Sepal.Length Sepal.Width Petal.Length Petal.Width 58 4.9 2.4 3.3 1.0 59 6.6 2.9 4.6 1.3 77 6.8 2.8 4.8 1.4 84 6.0 2.7 5.1 1.6 54 5.5 2.3 4.0 1.3 39 4.4 3.0 1.3 0.2 71 5.9 3.2 4.8 1.8 87 6.7 3.1 4.7 1.5

> M.svd <- svd(iris.sub)

> M.svd

$d

[1] 22.2553197 2.2360595 0.7611307 0.1773813

$u

[,1] [,2] [,3] [,4]

[1,] -0.2898553 -0.096733080 -0.007319459 -0.05013322 [2,] -0.3885172 -0.006908608 0.305429166 -0.28018114 [3,] -0.3992219 0.066823762 0.450619905 0.06954991 [4,] -0.3793864 0.327663490 -0.174978375 -0.69121320 [5,] -0.3275406 0.106859076 0.218964907 0.50428246 [6,] -0.2285027 -0.918407065 -0.136688043 -0.13248141 [7,] -0.3782377 0.154232634 -0.776519488 0.27425130 [8,] -0.3989561 -0.009386179 0.058068514 0.29884180

$v

[,1] [,2] [,3] [,4]

[1,] -0.7497994 -0.3154600 0.5318218 0.2354811 [2,] -0.3527468 -0.5480151 -0.7276317 -0.2140122 [3,] -0.5343751 0.7023730 -0.1376906 -0.4496184 [4,] -0.1667746 0.3268587 -0.4108028 0.8346201

> M.svd$u %*% (diag(M.svd$d) %*% t(M.svd$v)) [,1] [,2] [,3] [,4]

[1,] 4.9 2.4 3.3 1.0 [2,] 6.6 2.9 4.6 1.3 [3,] 6.8 2.8 4.8 1.4 [4,] 6.0 2.7 5.1 1.6 [5,] 5.5 2.3 4.0 1.3 [6,] 4.4 3.0 1.3 0.2 [7,] 5.9 3.2 4.8 1.8 [8,] 6.7 3.1 4.7 1.5

>

> # use the first two values to approximate

> d.sub <- diag(M.svd$d[1:2])

> u.sub <- as.matrix(M.svd$u[, 1:2])

> v.sub <- as.matrix(M.svd$v[, 1:2])

> iris.sub.approx <- u.sub %*% d.sub %*% t(v.sub)

> iris.sub.approx

[,1] [,2] [,3] [,4]

[1,] 4.905057 2.394043 3.295235 1.0051334 [2,] 6.488070 3.058517 4.609664 1.4369796 [3,] 6.614690 3.052204 4.852772 1.5306008 [4,] 6.099701 2.576853 5.026535 1.6476201 [5,] 5.390302 2.440411 4.063166 1.2938078 [6,] 4.460863 2.919270 1.275109 0.1768745 [7,] 6.202869 2.780357 4.740493 1.5166003 [8,] 6.664012 3.143504 4.729919 1.4739142

>

> # compute the sum of squared errors

> sum((iris.sub - iris.sub.approx)^2) [1] 0.610784

(14)

SVD: Image Compression

^(1/3) ^14/144

Lenna 97: A Complete Story of Lenna

http://www.ee.cityu.edu.hk/~lmpo/lenna/Lenna97.html https://en.wikipedia.org/wiki/Lenna

This scan became one of the most used images in computer history.[4] In a 1999 issue of IEEE Transactions on Image Processing"Lena" was used in three separate articles,[5] and the picture continued to appear in scientific journals throughout the beginning of the 21st century. ... To explain Lenna's popularity, David C. Munson, editor-in-chief of IEEE Transactions on Image Processing, noted that it was a good test image because of its detail, flat regions, shading, and texture. However, he also noted that its popularity was largely because an image of an attractive woman appealed to the males in a male-dominated field

> # require packages: locfit, tiff, fftwtools

> library(EBImage) # (Repositories: BioC Software)

> lena <- readImage("lena.jpg")

> dims <- dim(lena)

> dims

[1] 512 512 3

>

> plot(c(0, dims[1]), c(0, dims[2]), type='n', xlab="", ylab="")

> rasterImage(lena, 0, 0, dims[1], dims[2])

> library(jpeg) # install.packages("jpeg")

> lena <- readJPEG("lena.jpg")

if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")

BiocManager::install("EBImage")

(15)

SVD: Image Compression

^(2/3) ^15/144

> lena.flip <- Image(flip(lena))

> # convert RGB to grayscale

> red.weight <- .2989

> green.weight <- .587

> blue.weight <- 0.114

>

> lena.gray <- red.weight * imageData(lena.flip)[,,1] + + green.weight * imageData(lena.flip)[,,2] + + blue.weight * imageData(lena.flip)[,,3]

> dim(lena.gray) [1] 512 512

> lena.gray[1:5, 1:5]

[,1] [,2] [,3] [,4] [,5]

[1,] 0.07128863 0.06344627 0.05952510 0.06736745 0.07913098 [2,] 0.07913098 0.06736745 0.06344627 0.07128863 0.08305216 [3,] 0.08697333 0.07520980 0.07128863 0.07913098 0.09089451 [4,] 0.09089451 0.08305216 0.07913098 0.08305216 0.09873686 [5,] 0.09570980 0.08786745 0.08002510 0.08786745 0.10355216

> image(lena.gray, col = grey(seq(0, 1, length = 256))) Converting RGB to grayscale/intensity

http://stackoverflow.com/questions/687261/converting-rgb-to-grayscale-intensity

工程師都愛的萊娜小姐究竟是誰？(林一平 2019-12-06)

https://www.digitimes.com.tw/col/article.asp?id=1129

In 1972, at the age of 21. 67-year-old

(16)

SVD: Image Compression

^(3/3) ^16/144

> lena.svd <- svd(lena.gray)

> d <- diag(lena.svd$d)

> dim(d) [1] 512 512

> u <- lena.svd$u

> v <- lena.svd$v

> plot(1:length(lena.svd$d), lena.svd$d, pch=19, xlab="i-th lena.svd$d", ylab="lena.svd$d")

>

> used.no <- 20

> u.sub <- as.matrix(u[, 1:used.no])

> v.sub <- as.matrix(v[, 1:used.no])

> d.sub <- as.matrix(d[1:used.no, 1:used.no])

> lena.approx <- u.sub %*% d.sub %*% t(v.sub)

> image(lena.approx, col = grey(seq(0, 1, length = 256)))

(17)

(3) Principal Component Analysis

^17/144

 As an exploratory tool to uncover unknown trends in the data.

 Make a visual inspection of the relationship between observations in a multi-dimensional matrix.

 PCA is a method that reduces data dimensionality by performing a covariance analysis between factors.

 PCA is to ‘summarize’ the data, it is not considered a clustering tool.

Karl Pearson 1901;

Hotelling 1933;

Jolliffe 2002

On lines and planes of closest fit to systems of points in space.

The London, Edinburgh and Dublin Philosophical Magazine and Journal of Science, 6 (2), 559-572, 1901.

Karl Pearson (1857-1936)

(18)

PCA

^18/144

PCA is a method that reduces data dimensionality by finding the new variables (major axes, principal components).

Image source: 61BL4165 Multivariate Statistics, Department of Biological Sciences, Manchester Metropolitan University

Amongst all possible projections, PCA finds the projections so that the

maximum amount of information, measured in terms of variability, is retained

in the smallest number of dimensions.

(19)

PCA: Loadings and Scores

^19/144

PCA is the spectral decomposition of the variance covariance

matrix, S, of the data matrix, X. we can write S=CDC

^T

.

(20)

PCA: General Methodology

^(1/3)



Find a low-dimensional space such that when x is projected there, information loss is minimized.



PCA centers the sample and then rotates the axes to line up with the directions of highest variance.

20/144

?cov

x <- iris[, 1:4]

cov(x)

(21)

PCA: General Methodology

^(2/3) ^21/144

拉格朗日

x <- iris[, 1:4]

(covx <- cov(x)) e <- eigen(covx) V <- e$vectors

V.inverse <- solve(e$vectors)

covx.hat <- V %*% diag(e$values) %*% V.inverse covx.hat # same with covx

(22)

PCA: General Methodology

^(3/3) ^22/144

Proof (by induction – Assume true for i -1, then prove true for i) w

_i

is the eigenvector of S associated with the i

^th

largest eigenvalue.

 w_i

's are uncorrelated (orthogonal)

 w₁

explains as much as possible of original variance in data set

 w₂

explains as much as possible of remaining variance, etc.

(23)

PCA: Solutions

 If the dimensions are highly correlated, there will be a small number of

eigenvectors with large eigenvalues, and k will be much smaller than d and a large reduction in dimensionality may be attained.

 If the dimensions are not correlated, k will be as large as d and there is no gain through PCA.

 S is symmetric: for two different eigenvalues, the eigenvectors are orthogonal.

 If S is positive definite: (x’Sx>0, x!=null): all eigenvalues are positive.

 If S is singular: then its rank, the effective dimension is k < d, and λ_i=0, i > k.

23/144

# PCA for iris data

z <- as.matrix(x) %*% e$vectors[, 1:2]

plot(z[, 1], z[, 2], col=iris[, 5])

(24)

When to use PCA?

 As an exploratory tool to uncover unknown trends in the data.

 PCA on genes provide a way to identify predominant gene expression patterns.

 PCA on conditions explore correlations between samples or conditions.

 PCA is to ‘summarize’ the data, it is not considered a clustering tool.

24/144

Yeast Microarray Data is from

DeRisi, JL, Iyer, VR, and Brown, PO.(1997). "Exploring the metabolic and genetic control of gene expression on a genomic scale"; Science, Oct 24;278(5338):680-6.

(25)

How Many Components to Use?

 The proportion of Variance explained by the first

k

principal components

 If

λ

_j

< 1

then component explains less variance than original variable (correlation matrix).

 Use 2 components (or 3) for visual ease.

 Keep only the eigenvectors with eigenvalues greater than average eigenvalue.

 Keep only those which have higher than the average input variance.

25/144

(26)

Scree Diagram

At the elbow , adding another eigenvector does not significantly increase the variance explained.

26/144

(27)

PCA: Scaling

^(1/2)



Different variables have completely different scaling.

 Eigenvalues of the matrix is scale dependent.

 (If we would multiply one column of the data matrix Xby some scale factor (say s) then variance of this variable would increase by s2 and this variable can dominate whole covariance matrix and hence whole eigenvalue and eigenvectors.)



Covariance Matrix

 Variables must be in same units.

 Emphasizes variables with most variance.

 Mean eigenvalue > 1.0



Correlation Matrix

 Variables are standardized (mean=0.0, SD=1.0)

 Variables can be in different units.

 All variables have same impact on analysis.

 Mean eigenvalue = 1.0

27/144

(28)

PCA: Scaling

^(2/2)

 Bring all data to the same scale: If scale of the data is unknown

then it is better to use correlation matrix instead of the

covariance matrix.



The interpretation of the principal components derived by these two methods (Covariance Matrix and Correlation Matrix) can be completely different.



If the variance of the original xi dimensions vary considerably, they affect the directions of the principal components more than the correlations.

 Standardized

data by columns, mean=0, sd=1,



PCA on correlation matrix, for the correlations to be effective and not the individual variances.

28/144

(29)

PCA: Potential Problems

 Sensitive to outliers

 Robust estimation

 Discarding the isolated data points that are far away.

 Lack of Independence

 NO PROBLEM

 Lack of Normality

 Normality desirable but not essential

 Lack of Precision

 Precision desirable but not essential

 Many Zeroes in Data Matrix

 Problem

(use Correspondence Analysis)

29/144

Example:

This works better than taking the average because we take into account correlations and

differences in variances.

average

(30)

Circle of Correlations

^30/144

Circles of correlations associated with the first two DR components based on the applied vector and path point

approaches applied to the microarray data of the yeast cell cycle.

(31)

Principal Components Analysis and SVD

31/144

Source: Singular Value Decomposition, João Neto, http://www.di.fc.ul.pt/~jpn/r/svd/svd.html

> m1 <- matrix(sample(1:16,16),4,4)

> m1

[,1] [,2] [,3] [,4]

[1,] 15 1 16 7 [2,] 8 9 2 10 [3,] 13 4 12 3 [4,] 5 14 11 6

> m1.scale.svd <- svd(scale(m1))

> m1.scale.svd

$d

[1] 2.827057e+00 1.759769e+00 9.544443e-01 4.258869e-17

$u

[,1] [,2] [,3] [,4]

[1,] -0.5580839 -0.3507292 0.5617218 0.5 [2,] 0.5716314 -0.5999038 -0.2517003 0.5 [3,] -0.4320314 0.2946885 -0.6902952 0.5 [4,] 0.4184839 0.6559445 0.3802737 0.5

$v

[,1] [,2] [,3] [,4]

[1,] -0.5663183 -0.3664762 -0.1512544 0.72256547 [2,] 0.5395013 0.4585555 0.1574801 0.68837872 [3,] -0.5008709 0.3789166 0.7772586 -0.03767775 [4,] 0.3706081 -0.7154329 0.5900773 0.05112996

> m1.pca <- prcomp(m1, scale=T)

> m1.pca

Standard deviations:

[1] 1.6322e+00 1.016003e+00 5.510487e-01 2.458859e-17 Rotation:

PC1 PC2 PC3 PC4 [1,] -0.5663183 -0.3664762 -0.1512544 0.72256547 [2,] 0.5395013 0.4585555 0.1574801 0.68837872 [3,] -0.5008709 0.3789166 0.7772586 -0.03767775 [4,] 0.3706081 -0.7154329 0.5900773 0.05112996

> pca2 <- princomp(m1, cor=T)

> pca2$scores

Comp.1 Comp.2 Comp.3 Comp.4 [1,] 1.821811 0.7126839 0.6190721 6.227657e-16 [2,] -1.866036 1.2190081 -0.2773982 -9.020562e-16 [3,] 1.410325 -0.5988089 -0.7607725 -1.804112e-16 [4,] -1.366101 -1.3328832 0.4190986 4.649059e-16

> pca2$loadings Loadings:

Comp.1 Comp.2 Comp.3 Comp.4 [1,] 0.566 0.366 -0.151 0.723 [2,] -0.540 -0.459 0.157 0.688 [3,] 0.501 -0.379 0.777 [4,] -0.371 0.715 0.590

Comp.1 Comp.2 Comp.3 Comp.4 SS loadings 1.00 1.00 1.00 1.00 Proportion Var 0.25 0.25 0.25 0.25 Cumulative Var 0.25 0.50 0.75 1.00

The principal components are equal to the right singular values if the variables of data is

scaled. (subtract the mean, divide by the standard deviation).

(32)

PCA in R

 5 functions to do Principal Components Analysis in R (by Gaston Sanchez): prcomp{stats}; princomp{stats};

PCA{FactoMineR}; dudi.pca{ade4}; acp{amap}; epPCA{ExPosition}

(http://gastonsanchez.com/blog/how-to/2012/06/17/PCA-in-R.html)

 STHDA: Statistical tools for high-throughput data analysis

 Principal component analysis in R : prcomp() vs. princomp() - R software and data mining

http://www.sthda.com/english/wiki/principal-component-analysis-in-r-prcomp-vs-princomp-r-software- and-data-mining

 FactoMineR and factoextra: Principal Component Analysis Visualization - R software and data mining

http://www.sthda.com/english/wiki/factominer-and-factoextra-principal-component-analysis- visualization-r-software-and-data-mining

 Gregory B. Anderson, principal component analysis in R

https://www.ime.usp.br/~pavan/pdf/MAE0330-PCA-R-2013

 R Package "bigpca": Title PCA, Transpose and Multicore Functionality for 'big.matrix'

https://cran.r-project.org/web/packages/bigpca/

32/144

(33)

範例: Microarray Data of Yeast Cell Cycle

 Synchronized by alpha factor arrest method (Spellman et al. 1998; Chu et al. 1998)

 103 known genes: every 7 minutes and totally 18 time points. (remove NA’s: 79 genes)

33/144

cell.matrix <- read.table("YeastCellCycle_alpha.txt", header=TRUE, row.names=1) n <- dim(cell.matrix)[1]

p <- dim(cell.matrix)[2]-1

cell.data <- cell.matrix[,2:p+1]

gene.phase <- cell.matrix[,1]

phase <- unique(gene.phase)

phase.name <- c("G1", "S", "S/G2", "G2/M", "M/G1") cell.sdata <- t(scale(t(cell.data)))

rc <- rainbow(5)[as.integer(gene.phase)]

cc <- rainbow(ncol(cell.sdata))

hv <- heatmap(cell.sdata, col = GBRcol, scale = "column", Colv=NA, Rowv=NA, RowSideColors = rc, ColSideColors = cc, margins = c(5,10),

xlab = "Times", ylab = "Genes", main = "Heatmap of Microarray Data")

(34)

PCA on Conditions

^34/144

cell.pca <- princomp(cell.sdata, cor=TRUE, scores=TRUE)

# 2D plot for first two components pca.dim1 <- cell.pca$scores[,1]

pca.dim2 <- cell.pca$scores[,2]

plot(pca.dim1, pca.dim2, main="PCA for Cell Cycle Data on Genes", xlab="1st PCA Component", ylab="2nd PCA Component",col=c(phase), pch=c(phase))

legend(3, 4, phase.name, pch=c(phase), col=c(phase))

# shows a screeplot.

plot(cell.pca) biplot(cell.pca)

(35)

Loadings Plot

^35/144

# loadings plot

plot(loadings(cell.pca)[,1], loadings(cell.pca)[,2], xlab="1st PCA", ylab="2nd PCA", main="Loadings Plot", type="n")

text(loadings(cell.pca)[,1], loadings(cell.pca)[,2], labels=paste(1:p)) abline(h=0)

abline(v=0)

# print loadings loadings(cell.pca) summary(cell.pca)

(36)

範例: two-dimensional simulated data

^36/144

> library(MASS)

> mu <- c(2, -1)

> Sigma <- matrix(c(2.4, -0.5, -0.5, 1), 2)

> n <- 250

> X <- mvrnorm(n, mu, Sigma)

>

> mycol <- terrain.colors(n)

> sorted.x1 <- sort(X[,1])

> order.x1 <- order(X[,1])

> id <- 1:n

> sorted.id <- id[order.x1]

> x1.col <- mycol[order(sorted.id)]

>

> par(mfrow=c(1, 2))

> plot(X, col=x1.col, pch=16,

+ main="simulated bivariate normal")

> abline(h=0, v=0, col="gray")

> X.pca <- princomp(X, cor = TRUE)

> X.pca$sdev

Comp.1 Comp.2 1.1875339 0.7679605

> X.pca$loadings Loadings:

Comp.1 Comp.2 [1,] -0.707 -0.707 [2,] 0.707 -0.707

Comp.1 Comp.2 SS loadings 1.0 1.0 Proportion Var 0.5 0.5 Cumulative Var 0.5 1.0

> plot(X.pca$scores, col=x1.col, pch=16, main="PCA")

> abline(h=0, v=0, col="gray")

(37)

範例: decathlon2 {factoextra} : Athletes' performance in decathlon

37/144

Practical Guide to Principal Component Methods in R http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/112-pca-

principal-component-analysis-essentials/

use FactoMineR for the analysis and factoextra for ggplot2-based visualization.

> pca.pkg <- c("FactoMineR", "factoextra", "corrplot")

> install.packages(pca.pkg)

> lapply(pca.pkg, library, character.only=TRUE)

> data(decathlon2) # 十項全能

> head(decathlon2) # 100米, 跳遠, 鉛球, 跳高, 400米, 110米跨欄, 鐵餅, 撐竿跳高, 標槍, 1500米 X100m Long.jump Shot.put High.jump X400m X110m.hurdle Discus

SEBRLE 11.04 7.58 14.83 2.07 49.81 14.69 43.75 CLAY 10.76 7.40 14.26 1.86 49.37 14.05 50.72 BERNARD 11.02 7.23 14.25 1.92 48.93 14.99 40.87 YURKOV 11.34 7.09 15.19 2.10 50.42 15.31 46.26 ZSIVOCZKY 11.13 7.30 13.48 2.01 48.62 14.17 45.67 McMULLEN 10.83 7.31 13.76 2.13 49.91 14.38 44.41

Pole.vault Javeline X1500m Rank Points Competition SEBRLE 5.02 63.19 291.7 1 8217 Decastar CLAY 4.92 60.15 301.5 2 8122 Decastar BERNARD 5.32 62.77 280.1 4 8067 Decastar YURKOV 4.72 63.44 276.4 5 8036 Decastar ZSIVOCZKY 4.42 55.37 268.0 7 8004 Decastar McMULLEN 4.42 56.37 285.1 8 7995 Decastar

> dim(decathlon2) [1] 27 13

> # standardization # x <- scale(x)

> x <- decathlon2[,1:10]

> # (default, PCA {FactoMineR} standardizes the data automatically FactoMineR: Multivariate Exploratory Data Analysis and Data Mining

(38)

範例: PCA {FactoMineR}

^38/144

> # PCA(X, scale.unit = TRUE, ncp = 5, graph = TRUE)

> x.pca <- PCA(x, graph = FALSE)

> class(x.pca)

[1] "PCA" "list "

> str(x.pca) List of 5

$ eig : num [1:10, 1:3] 3.75 1.745 1.518 1.032 0.618 ...

..- attr(*, "dimnames")=List of 2 ...

..$ call : language PCA(X = x, graph = FALSE) - attr(*, "class")= chr [1:2] "PCA" "list "

> print(x.pca)

**Results for the Principal Component Analysis (PCA)**

The analysis was performed on 27 individuals, described by 10 variables

*The results are available in the following objects:

name description 1 "$eig" "eigenvalues"

2 "$var" "results for the variables"

3 "$var$coord" "coord. for the variables"

4 "$var$cor" "correlations variables - dimensions"

...

15 "$call$col.w" "weights for the variables"

(OLD version)

# get_eigenvalue(x.pca): Extract the eigenvalues/variances of principal components

# fviz_eig(x.pca): Visualize the eigenvalues

# get_pca_ind(x.pca), get_pca_var(x.pca): Extract the results for individuals and variables.

# fviz_pca_ind(x.pca), fviz_pca_var(x.pca): Visualize the results individuals and variables.

# fviz_pca_biplot(x.pca): Make a biplot of individuals and variables.

FactoMineR (version 2.3)

x.pca$eig x.pca$ind x.pca$var

(39)

Eigenvalues/Variances, scree plot

^39/144

> # Eigenvalues/Variances

> eig.val <- get_eigenvalue(x.pca)

> eig.val

eigenvalue variance.percent cumulative.variance.percent Dim.1 3.7499727 37.499727 37.49973 Dim.2 1.7451681 17.451681 54.95141 Dim.3 1.5178280 15.178280 70.12969 Dim.4 1.0322001 10.322001 80.45169 Dim.5 0.6178387 6.178387 86.63008 ...

Dim.10 0.1122959 1.122959 100.00000

> # scree plot

> fviz_eig(x.pca, addlabels = TRUE, ylim = c(0, 50))

New Version, See

plot.PCA {FactoMineR}

(40)

Correlation Circles

^40/144

> var <- get_pca_var(x.pca)

> var

Principal Component Analysis Results for variables

===================================================

Name Description 1 "$coord" "Coordinates for the variables"

2 "$cor" "Correlations between variables and dimensions"

3 "$cos2" "Cos2 for the variables"

4 "$contrib" "contributions of the variables"

> # Coordinates of variables

> head(var$coord, 4)

Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 X100m -0.8189521 0.3427787 0.100864539 0.10134200 -0.2198061 Long.jump 0.7588985 -0.3814931 -0.006261254 -0.18542415 0.2637141 Shot.put 0.7150783 0.2821167 0.473854591 0.03610404 -0.2786370 High.jump 0.6084933 0.6113542 0.004605966 0.07124353 0.3005866

> # Correlation circle, variable correlation plots

> fviz_pca_var(x.pca, col.var = "black")

# Positively correlated variables are grouped together.

# Negatively correlated variables are positioned on opposite sides of the plot origin (opposed quadrants).

# The distance between variables and the origin measures the quality of the variables on the factor map. Variables that are away from the origin are well represented on the factor map.

# var$cos2: quality of representation for variables on the factor map.

var.cos2 = var.coord * var.coord.

# var$contrib: the contributions of the variables to the principal

components.

(var.cos2*100) / (total cos2 of the component).

(41)

Quality of representation:

square cosine (cos2 values) plot

41/144

> head(var$cos2, 4)

Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 X100m 0.6706825 0.11749725 1.017366e-02 0.010270201 0.04831474 Long.jump 0.5759270 0.14553701 3.920330e-05 0.034382115 0.06954513 Shot.put 0.5113370 0.07958983 2.245382e-01 0.001303502 0.07763857 High.jump 0.3702640 0.37375400 2.121493e-05 0.005075641 0.09035233

> corrplot(var$cos2, is.corr=FALSE)

> fviz_cos2(x.pca, choice = "var", axes = 1:2)

> # variables with low/mid/high cos2 values will be colored in blue/yellow/red

> fviz_pca_var(x.pca, col.var = "cos2",

+ gradient.cols = c("blue", "yellow", "red"), + repel = TRUE) # Avoid text overlapping

(42)

Contributions of variables to PCs

^42/144

> head(var$contrib, 4)

Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 X100m 17.88500 6.732718 0.670277238 0.9949816 7.81996 Long.jump 15.35817 8.339426 0.002582856 3.3309545 11.25620 Shot.put 13.63575 4.560583 14.793387670 0.1262838 12.56616 High.jump 9.87378 21.416504 0.001397716 0.4917303 14.62394

> corrplot(var$contrib, is.corr=FALSE)

> # Contributions of variables to PC1

> fviz_contrib(x.pca, choice = "var", axes = 1, top = 10)

> # Contributions of variables to PC2

> fviz_contrib(x.pca, choice = "var", axes = 2, top = 10)

> # The total contribution to PC1 and PC2:

> fviz_contrib(x.pca, choice = "var", axes = 1:2, top = 10)

(43)

Highlight the most important variables

^43/144

> fviz_pca_var(x.pca, col.var = "contrib",

+ gradient.cols = c("blue", "yellow", "red"))

> set.seed(123)

> var.kms <- kmeans(var$coord, centers = 3, nstart = 25)

> kms.grp <- as.factor(var.kms$cluster)

> # Color variables by kmeans' result

> fviz_pca_var(x.pca, col.var = kms.grp, palette = c("blue", "green", "red"), + legend.title = "Cluster")

(44)

Obtain PCA results for individuals

^44/144

> ind <- get_pca_ind(x.pca)

> ind

Principal Component Analysis Results for individuals

===================================================

Name Description 1 "$coord" "Coordinates for the individuals"

2 "$cos2" "Cos2 for the individuals"

3 "$contrib" "contributions of the individuals"

> # Coordinates of individuals

> head(ind$coord, 3)

Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 SEBRLE 0.2779582 -0.5364345 1.5852390 0.10582254 1.07462350 CLAY 0.9048536 -2.0942803 0.8406848 1.85071783 -0.40864484 BERNARD -1.3722659 -1.3481155 0.9619317 -1.49307179 -0.18266695

> # Quality of individuals

> head(ind$cos2, 3)

Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 SEBRLE 0.015446507 0.05753138 0.50241310 0.0022388646 0.2308788213 CLAY 0.065574226 0.35127414 0.05660346 0.2743196867 0.0133742243 BERNARD 0.232236641 0.22413434 0.11411498 0.2749258382 0.0041150408

> # Contributions of individuals

> head(ind$contrib, 3)

Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 SEBRLE 0.07630746 0.6107062 6.132015 0.04018174 6.92267277 CLAY 0.80865774 9.3082618 1.724567 12.29002506 1.00104404 BERNARD 1.85987901 3.8570314 2.257887 7.99896372 0.20002354

cos2 (Quality of individuals):https://personal.utdallas.edu/~herve/abdi-awPCA2010.pdf

(45)

Graph of individuals

^45/144

> # color individuals by their cos2 values

> fviz_pca_ind(x.pca, col.ind = "cos2", gradient.cols = c("blue", "black", "red"), + repel = TRUE)

>

> # change the point size according the cos2 of the corresponding individuals

> fviz_pca_ind(x.pca, pointsize = "cos2", pointshape = 21, fill = "lightblue", + repel = TRUE)

(46)

Graph of individuals

^46/144

> fviz_pca_ind(x.pca, geom.ind = "point", col.ind = decathlon2[,13], + palette = c("blue", "red"), legend.title = "Competition")

>

> # quality of representation (cos2) of individuals on the factor map

> fviz_cos2(x.pca, choice = "ind", top = 5)

>

> # Total contribution of individuals on PC1 and PC2

> fviz_contrib(x.pca, choice = "ind", axes = 1:2, top = 5)

get eigenvectors x.pca$svd$V

(47)

(4) The Biplot

^47/144

• K. R. Gabriel (1971). The biplot graphical display of matrices with application to principal component analysis. Biometrika 58, 453–467.

• J.C. Gower and D. J. Hand (1996). Biplots. Chapman & Hall.

(48)

Moving Towards the Biplot:

Rotation of axes

48/144

x <- c(-0.9, 0.6, 0.1) y <- c(-0.5, 0, 0.4)

plot(x, y, xlim=c(-1, 1), ylim=c(-1, 1), main="Data Input Space") abline(h=0, v=0, col="blue", lwd=2)

text(x+0.05, y+0.05, c("s1", "s2", "s3"), col="red") pca <- princomp(cbind(x, y)); ab <- pca$loadings

arrows(-ab[1,1], -ab[2,1], ab[1,1], ab[2,1], col="green", angle = 10, lwd=2) text(-ab[1,1], -ab[2,1], "Comp.1")

arrows(-ab[1,2], -ab[2,2], ab[1,2], ab[2,2], col="green", angle = 10, lwd=2) text(-ab[1,2], -ab[2,2], "Comp.2")

(49)

Graphical Display of Matrix Multiplication

49/144

A(4, 2) B(2, 3) P(4, 3)

Inner product property:

• P

_ij = OA_i

×

OB_j

×

cosα_ij

• Implies the product matrix

(Source) Modified from: Biplot Analysis of

Multi-Environment Trial Data, Weikai Yan, May 2006

Source: Biplots in Practice, Michael Greenacre.

• The biplot points can be projected onto the biplot axes, and their projections, multiplied by the length of the corresponding biplot vectors, give the scalar products which are equal to the elements of the target matrix.

• The cosines of the angles between the variables approximate the pairwise

correlation coefficients.

(50)

Singular Value Decomposition (SVD) &

Singular Value Partitioning (SVP)















 



r k

kj f k f

k ik SVP

r k

kj k ik SVD

ij

b a

P

1

1 1

) )(

(  



50/144

(0 ≤ f

≤

¹⁾

“Singular values”

Matrix

characterising the rows

Matrix

characterising the columns

SVD:

SVP:

The ‘rank’ of Y, i.e., the minimum number of PC required to fully represent Y

Rows scores Column scores Biplot

Plot Plot

(Source) from: Biplot Analysis of Multi-Environment Trial Data, Weikai Yan, May 2006

(51)

biplot {stats} :

Biplot of Multivariate Data

51/144

iris.pca <- princomp(iris[,1:4])

biplot(iris.pca, main="Biplot for iris data")

(52)

Interpretation of Biplot

 Interpreting Points: the relative location of the points can be interpreted.

 Points that are close together correspond to observations that have similar scores on the components displayed in the plot.

 To the extent that these components fit the data well, the points also correspond to observations that have similar values on the variables.

 Interpreting Vectors: both the direction and length of the vectors can be interpreted.

 A vector points in the direction which is most like the variable represented by the vector.

This is the direction which has the highest squared multiple correlation with the principal components.

 The length of the vector is proportional to the squared multiple correlation between the fitted values for the variable and the variable itself.

 The fitted values for a variable are the result of projecting the points in the space orthogonally onto the variable's vector (to do this, you must imagine extending the vector in both directions).

 The observations whose points project furthest in the direction in which the vector points are the observations that have the most of whatever the variable measures.

Those points that project at the other end have the least. Those projecting in the middle have an average amount.

 Thus, vectors that point in the same direction correspond to variables that have similar response profiles, and can be interpreted as having similar meaning in the context set by the data.

52/144

Source: Principal Components: BiPlots, Forrest Young's Notes

http://forrest.psych.unc.edu/research/vista-frames/help/lecturenotes/lecture13/biplot.html

(53)

Biplot in R

 biplot {stats}: Biplot of Multivariate Data

https://stat.ethz.ch/R-manual/R-devel/library/stats/html/biplot.html

 biplot.princomp {stats}: Biplot for Principal Components:

https://stat.ethz.ch/R-manual/R-devel/library/stats/html/biplot.princomp.html

 ggbiplot: https://github.com/vqv/ggbiplot

 biplot.rda {vegan}: http://cc.oulu.fi/~jarioksa/softhelp/vegan/html/biplot.rda.html

 biplot.psych {psych}: http://www.personality-project.org/r/html/biplot.psych.html https://cran.r-project.org/web/packages/psych/

 bpca: Biplot of Multivariate Data Based on Principal Components Analysis

https://cran.r-project.org/web/packages/bpca/index.html

 dynBiplotGUI: Full Interactive GUI for Dynamic Biplot in R

https://cran.r-project.org/web/packages/dynBiplotGUI/index.html

 The BiplotGUI package homepage: http://biplotgui.r-forge.r-project.org/

 BiplotGUI: Interactive Biplots in R: https://cran.r-project.org/web/packages/BiplotGUI/index.html

 biplotbootGUI: Bootstrap on Classical Biplots and Clustering Disjoint Biplot:

https://cran.r-project.org/web/packages/biplotbootGUI/index.html

 GGEBiplotGUI: GGEBiplotGUI: Interactive GGE Biplots in R:

https://cran.r-project.org/web/packages/GGEBiplotGUI/index.html

 multibiplotGUI: Multibiplot Analysis in R:

https://cran.r-project.org/web/packages/multibiplotGUI/index.html

 MultBiplotIn R: http://biplot.dep.usal.es/classicalbiplot/multbiplot-in-r/

 NominalLogisticBiplot: Biplot representations of categorical data

https://cran.r-project.org/web/packages/NominalLogisticBiplot/index.html

 OrdinalLogisticBiplot: Biplot representations of ordinal variables

https://cran.r-project.org/web/packages/OrdinalLogisticBiplot/index.html

 PLSbiplot1: The Partial Least Squares (PLS) Biplot

https://cran.r-project.org/web/packages/PLSbiplot1/index.html

 PCABiplot: Principal Components Analysis Biplots

http://biplot.usal.es/ClassicalBiplot/multbiplot-in-r/pcabiplot-manual.pdf

53/144

(54)

ggbiplot for Iris Data

^54/144

> library(devtools)

> install_github("vqv/ggbiplot")

>

> library(ggbiplot)

> iris.prcomp <- prcomp(iris[,1:4], scale. = TRUE)

> ggbiplot(iris.prcomp, groups = iris[,5])

> ggbiplot(iris.prcomp, obs.scale = 1, var.scale = 1, + groups = iris[,5], ellipse = TRUE, circle = TRUE) + + scale_color_discrete(name = '') +

+ theme(legend.direction = 'horizontal', legend.position = 'top')

(55)

Creating A Biplot Using SVD

^(1/3) ^55/144

Source: Measurement, Scaling, and Dimensional Analysis, Bill Jacoby Summer 2015

> library(lattice)

> state.spending <- read.table("StatePolicySpending.txt", header = T, row.names=1)

> head(state.spending)

Education Welfare Highways HealthCare Corrections LawEnforcement AL 0.863 0.448 0.168 0.284 0.044 0.183 AR 0.887 0.492 0.241 0.150 0.044 0.141 AZ 0.765 0.473 0.249 0.112 0.090 0.364 CA 0.916 0.817 0.142 0.170 0.104 0.425 CO 0.782 0.415 0.188 0.108 0.076 0.282 CT 0.764 0.742 0.251 0.323 0.128 0.347

> spend <- scale(state.spending)

> spend.svd <- svd(spend, nu=2, nv=2)

>

> D <- spend.svd$d[1:2]

> V <- spend.svd$v[,1:2]

> U <- spend.svd$u[,1:2]

>

> # Create matrices for variables and observations by weighting singular vectors with

> # the square roots of the first two singular values. These will be used to construct

> # a symmetric biplot.

>

> spend.var <- V * (rep(1, length(V[,1])) %*% t(D^.5))

> spend.obs <- U * (rep(1, length(U[,1])) %*% t(D^.5))

> row.names(spend.var) <- colnames(spend)

> row.names(spend.obs) <- row.names(spend)

(56)

Creating A Biplot Using SVD

^(2/3) ^56/144

> # Within the panel function, "panel.xyplot" draws the observation points and

> # "panel.segments" draws the variable vectors. The first "panel.text" labels the vectors,

> # and the second "panel.text" provides labels for relatively extreme observation points.

> xyplot(spend.obs[,2] ~ spend.obs[,1], + aspect = 1,

+ panel = function(x, y) {

+ panel.xyplot(x, y, col = "black")

+ panel.segments(rep(0, length(spend.var[,1])), + rep(0, length(spend.var[,1])),

+ spend.var[,1], spend.var[,2], lty = 1, col = "blue") + panel.text(spend.var[,1], spend.var[,2],

+ row.names(spend.var), cex = .7, col = "green")

+ panel.text(x[abs(x)>.7 | abs(y)>.7]+.04, y[abs(x)>.7 | abs(y)>.7], + row.names(spend.obs)[abs(x)>.7 | abs(y)>.7], cex = .7, + adj = 0, col= "red")

+ },

+ xlim = c(-2.5, 2.5), ylim = c(-2.5, 2.5), + xlab = " ", ylab = " ")

>

> # Calculate proportion of variance explained by

> # first two pairs of singular vectors

> var.explained <- sum(D^2)/sum(spend.svd$d^2)

> var.explained [1] 0.6233372

Dimension Reduction

http://www.hmwu.idv.tw

吳漢銘 國立政治大學 統計學系

維度縮減

Dimension Reduction

C01

Visualizing High-dimensional Data:

Dimension Reduction Techniques

Why Reduce Dimensionality?

Feature Selection vs Extraction

Dimension Reduction Techniques

 The high-dimension low sample size data (HDLSS)

 DR Quality Assessment

Dimension Reduction

input data matrix:

X

Y

transformed data matrix: Z

Why Reduce Dimensionality?

Reduces time complexity:

less computation

Reduces space complexity:

Less parameters

Saves the cost of observing the features:

input is unnecessary.

Simpler models are more robust on small datasets:

simpler models vary less depending on the particulars of a sample.

More interpretable; simpler explanation

Data visualization (structure, groups, outliers, etc) if

plotted in 2 or 3 dimensions . (Dimension reduction visualization is often adopted for presenting grouping structure for methods such as K-

means. )

DR: Feature Selection vs Extraction

Feature Selection

Choosing k<d important features

Ignoring the remaining d – k

Subset selection algorithms

Feature Extraction

Project the original x

, i =1,...,d

dimensions to new k<d dimensions, z

, j =1,...,k.

z = w

x

PCA, FA, MDS, LDA, SIR

z = w

x

Subset Selection

Training error

Validation error

(test error)

(1) Spectral Decomposition

Definition: Let A be a m × m real symmetric matrix.

Then there exists an orthogonal matrix C such that

C

AC= D or A=C DC

, where D is a diagonal matrix of eigenvalues, and the C matrix has the eigenvectors.

Eigen-decomposition : Ac

=λ

c

, i=1,...,m.

Let Z=W

X, or Z=W

(X-m) center the data on the origin. To find a matrix W such that when we have

Z=W

X, we will get Cov(Z)=D is any diagonal matrix.

Spectral Decomposition

We would like to get uncorrelated z

Let

= [c

] be the normalized eigenvector of S, then (1)

and

(2)

Spectral Decomposition

Some Decompositions in R

: Singular Value Decomposition of a Matrix

: The QR Decomposition of a Matrix

: The Choleski Decomposition

(2) Singular Value Decomposition

奇異值分解

SVD Examples

吳漢銘國立政治大學統計學系