利用相關性矩陣降維進行雙分群分析：以基因表現資料為例

(1)

國立台灣大學公共衛生學院流行病學與預防醫學研究所碩士論文

Institute of Epidemiology and Preventive Medicine College of Public Health

National Taiwan University Master Thesis

利用相關性矩陣降維進行雙分群分析：

以基因表現資料為例

A biclustering method with correlation matrix for gene expression profiling

陳建瑋 Chian Wei Chen

指導教授: 蕭朱杏博士

Advisor: Chuhsing Kate Hsiao, Ph.D.

中華民國 103 年 7 月

July, 2014

(2)

致謝

這個碩士論文，首先要非常感謝我的指導教授蕭朱杏老師。蕭老師擁有豐富的研究知識，以身作則帶領我們做研究。從老師身上我學會了不僅是如何做研究，

更學習到認真負責任的生命態度。每當我感到困惑或陷入研究瓶頸時，她總是能抽出時間陪我聊天並給予方向。跟老師聊完後，目標明確許多也不再那麼徬徨。

蕭老師總是不斷的鼓勵學生，“你絕對會解決了這個問題！＂，“你會按時畢業，

真的！＂很幸運這兩年能在蕭老師的實驗室學習、成長！

要謝謝郭柏秀老師與李美賢老師擔任口試委員，從不同角度來評論這篇論文，

兩位老師提的問題與建議都讓論文架構與內容更豐富完整。

另外要感謝我的同學劉智彥，他教了我很多 R 的語言。也要謝謝 R01 的碩班同學們，我們一同學習、研究，也一同玩樂、聊天。也要謝謝蕭老師實驗室的成員很親切、什麼問題都難不倒的彥雯學姐，穿著變化多端也最可愛的家瑋，快要

是極端氣候達人的于瑄，直升博班的書如，世亨學長與小麥。在lab meting 上要

用英文互相提問，看到大家抽籤被抽到的尷尬表情還有絞盡腦汁想問題與回答的模樣還是很有趣。也要感謝我的女友珮綺，碩班的這兩年我們互相學習和成長，

彼此打氣加油的模樣是最美的記憶。最後，我要感謝我的家人支持我念碩士學位，

溫暖而彼此關懷的家庭是能夠心無旁騖念書的穩定依靠。

建瑋

(3)

中文摘要

雙分群分析方法近年來在統計學上是相當重要的分析工具，特別是在歸類哪些基因在某些特定實驗下會有相似的基因表現。雙分群分析目標是找出哪些基因在一個特定實驗集合下的基因表現會有相同表現趨勢。先前研究大多是類別分析方法的推廣，集中研究於基因在所有的實驗條件之間的相似性。在本篇論文當中我們提出利用基因表達之間的相關性矩陣以及實驗條件之間的相關性矩陣降維進行雙分群分析，簡稱 BiCor。利用這兩個相關性矩陣，每次的迭代運算都會刪除最不相關的基因或實驗條件。根據預先指定的收斂條件，結果會得到較小的矩形陣列，此矩形陣列裡的基因表現從基因角度以及實驗條件角度看來都有相似的趨勢。我們更進一步定義真實偵測率（TDR）與成功被偵測率（DTR）用來評估 BiCor 的表現。最後利用模擬試驗與實際資料進行分析，比較 BiCor 和其他現有雙分群分析方法優劣。

關鍵字: 雙分群, 相關性, 基因表現

(4)

Abstract

Biclustering has become an important analytical tool in recent statistical practice, particularly when it is of interest to group genes under certain experimental conditions.

The goal of such biclustering analysis is to identify sets of genes sharing similar expression patterns across subsets of samples. Previous developed approaches were mostly extensions of clustering methods and thus focused more on similarity between genes across all experimental conditions. Here we proposed a bicluster algorithm via correlation matrices, called BiCor, between gene expression patterns and between conditions. Each of these two matrices was visited iteratively to remove the most irrelevant genes or conditions. Under a pre-specified convergence criterion, the resulting smaller rectangular contains expression levels that are considered similar at both the gene and the condition level. We further defined the true discovery rate (TDR) and discovered true rate (DTR) to assess the performance of the proposed algorithm.

Simulation studies and applications were conducted to evaluate and compare the proposed BiCor with other existing algorithms.

Key words: Bicluster, correlation, gene expression

(5)

In recent years, there have been various efforts to overcome the limitations of standard clustering approaches for the analysis of gene expression data by grouping genes and experimental conditions simultaneously. Such analyses are usually called biclustering and are used to identify sets of genes sharing similar expression patterns across subsets of samples, meaning the genes may work together under these

conditions. Biclustering methods can be used not only in gene expression profiling data but also in other biological data.

Biclustering is a method that identifies sets of genes sharing similar expression patterns across subsets of experiment conditions. The difference between traditional clustering and biclustering algorithm is that the clustering method clusters only rows or only columns in a data matrix; while the biclustering method clusters rows and columns simultaneously. With biclustering, genes belonging to different groups of conditions may be identified. In addition, a gene functions under different experiment conditions can be identified if it is grouped in different biclusters.

In contrast, cluster analysis was used to group a set of objects such as subjects who behave similarly in one group than in others. The main purpose of cluster analysis is exploratory data mining. There are many cluster analysis methods such as K-means, hierarchical clustering, Independent component analysis (ICA) and

(10)

Principal component analysis (PCA). These methods have been applied in many fields, such as bioinformatics (gene expression data), machine learning, image analysis and pattern recognition. An underlying assumption of cluster analysis is that the grouped objects (subjects or genes) behave similarly across all experiment conditions

(measurement), treating all conditions exchangeable. In reality, however, genes tend to co-regulate in some experiment conditions but not in all experiment conditions. In this case, therefore, a biclustering algorithm would serve the purpose better than a clustering algorithm.

The idea of biclustering was first proposed by Hartigan (1972). Currently there are four biclustering algorithms widely used in research, -size bicluster, Cheng and Church’s algorithm (CC, Cheng et al., 2000), Statistical-Algorithmic Method of Bicluster Analysis (SAMBA, Tanay et al., 2002), Iterative Signature Algorithm (ISA, Ihmels et al., 2002, 2004), and Binary inclusion-maximal biclustering algorithm (Bimax, Prelic´ et al., 2006). A special feature of CC is that it was applied under a fixed size of bicluster. This fixed size denotes the upper limit of the mean squared residual (MSR) of the bicluster. For SAMBA, it uses bipartite graph and binomial distribution to find the potential bicluster. For ISA, its pros is that ISA uses iterative method to see if the output bicluster wiil be the same when different initial genes are considered as input. The advantage of BiMax is that BiMax can

(11)

find the largest number of biclusters because BiMax searches every possible biclusters.

These four methods did not consider correlation between genes or between conditions but assumed genes independent and experiment conditions independent as well. In fact, genes may co-express in a condition but not in other conditions, leading to correlation between conditions for this certain set of genes. Hence, correlation is an important and intuitive characteristic that should be accounted for in biclustering algorithms.

In this paper, we calculate first the gene-gene and condition-condition correlation matrices, and then iteratively reduce the size to a bicluster if the criterion is satisfied.

One advantage of the proposed BiCor is that the algorithm does not need to be normalized because the operation of correlation matrices is not affected by the original scales. Next, we carried out simulations and compare BiCor with other biclustering algorithms. Finally we conclude with a discussion and conclusion.

(12)

Method

Notations

Let A_{n m}_* be the data matrix of gene expressions from n genes

1 2 3

{ ,G G G, ,..., }G_n and m experiment conditions { , , ,...,C C C₁ ₂ ₃ C_m}. Each row vector x , _i i1,...,n of the matrix A_{n m}_* is of dimension 1 m , and can be written as x_i ( ,g g_i₁ _i₂,...,g_im) where g_ij stands for the expression level of the gene G _i under the condition C_j. Similarly, each column vector y_j, j1,...,m of the matrix

*

An m is of dimension n , and can be written as 1 y_j (g₁_j,g₂_j,...,g_nj) standing for the gene expression levels of genes { ,G G G₁ ₂, ,..., }₃ G_n under the same condition

Cj. Thus the matrix A_{n m}_* is

11 1

*

1

m n m

n nm

g g

A

g g

 

 

  

 

 



  



.

Our algorithm use Pearson correlation coefficient to measuring similarity between expression patterns of two genes G_i and G or between two conditions _i_'

Cj and C_j_'. For instance, the Pearson correlation coefficient between G_i and G is _i_' defined as :

' '

' ' 1

2 2

' '

1 1

( )( )

( , ) ( , )

( ) ( )

m

il i i l i

l

i i i i m m

il i i l i

l l

g g g g

Corr G G Corr

g g g g



 

 

 

 



 

x x . (1)

Here the g_il and g are the _{i l}_' l-th expression value of the i-th gene and 'i -th

(13)

gene, and the g_i and g_i_' are the mean values over m expression values of the i -th gene and 'i -th gene. Since there are n genes in total, the gene-gene correlation matrix becomes an n n matrix ^G^{n n}^ ^

 

^g_{ij n n}_ ^.

The correlation between conditions C_j and C_j_' is defined in a similar way,

' '

1

' '

2 2

' '

1 1

( )( )

( , ) ( , )

( ) ( )

m

lj j lj j

l

j j j j m m

lj j lj j

l l

g g g g

Corr C C Corr

g g g g



 

 

 

 



 

y y .

And the resulting column-column correlation matrix ^C^{m m}^ ^

 

^c_{ij m m}_ is of dimension

m m .

A bicluster B_k can be defined as a subset of genes I_k possesing a similar behavior over a subset of experiment conditions (measurements) J_k. Thus, a bicluster

Bk can be represented as B_k ( , )I J_k _k . This bicluster B_k ( , )I J_k _k contains a subset I I_k( _k { ,G G G₁ ₂, ,..., })₃ G_n of genes and a subset J J_k( _k { , , ,...,C C C₁ ₂ ₃ C_m}) of experiment conditions where each gene in I_k is correlated with a correlation value greater than or equal to a pre-specified threshold (_gene or _condition), with all other genes in I_k over the measurements in J_k . That is, |Corr(x^{( )}_i^k ,x^{( )}_i_'^k ) |_gene if both gene i and gene i^' belong to I_k) and |Corr(y^{( )}_j^k ,y^{( )}_j^k_' ) |_condition if the conditions

j and j' are in the same subset. Note that we used the notations x^{( )}_i^k and y^{( )}_j^k to indicate that the correlations are measured over the genes and conditions in the

corresponding bicluster B_k only, not over the original n genes and m experiment

(14)

conditions.

Algorithm

The proposed algorithm starts with the expression matrix A_{n m}_* with n genes and m experiment conditions. For the purpose of illustration, we first perform our bicluster algorithm on the column-column correlation matrix.

Step 1: Calculate the column-column correlation matrix C_{m m}_ and identify the pair with the minimum absolute value of correlation. For example, if conditions

Cj and C_j_' has the minimum absolute value of correlation in the

column-column correlation matrix C_{m m}_ . Then delete one of the pair (C_j,C_j_') whose summation of absolute correlation values over rows,

1

| |

n jk k

c



 ^or '

1

| |

n j k k

c



 , is smaller. After removing one condition, the original expression data matrix would be reduced to the matrix A_n⁽¹⁾_{ }₍_m₁₎.

Step 2: For the remaining n genes and m experiment conditions, calculate the 1 row-row correlation matrix G_{n n}⁽¹⁾_ and identify the pair of genes

corresponding to the minimum absolute value of correlation. For instance, if

Gi and G_'i has the smallest absolute value of correlation in the row-row correlation G_{n n}⁽¹⁾_ , then compute the two sums of correlations,

1

| |

m ik k

r



 ^and '

1

| |

m i k k

r



 , and remove either G_i or G_'i with the smaller value. The original expression matrix is now updated to A₍⁽²⁾_n_{  }_{1) (}_m ₁₎.

(15)

Step 3: Repeat Steps 1 and 2 until every absolute value of correlation in the

column-column correlation matrix and row-row correlation matrix is greater

than or equal to a pre-specified threshold (^gene or ^condition). The resulting matrix then leads to a bicluster B_k ( , )I J_k _k where

( ) ( )

|Corr(x_i^k ,x_i'^k ) |_gene if gene ⁱand gene i^' belong to I and ^k ^|^Corr⁽y^{( )}^j^k ^,y^{( )}^j^k^' ^{) |}^condition

if

conditions j and j belong to ^' J . ^k

If we start the bicluster algorithm from the row-row correlation matrix, then step 2 will be performed as an initial step before step 1. The Steps are as follows:

Step 1: Calculate the row-row correlation matrix G_{n n}_ and identify the pair with the minimum absolute value of correlation. For example, if genes G_j and G_j_'

has the minimum absolute value of correlation in the row-row correlation matrix G_{n n}_ . Then delete one of the pair (G_j,G_j_') whose summation of absolute correlation values over rows,

1

| |

n jk k

r



 ^or ^' 1

| |

n j k k

r



 , is smaller. After removing one condition, the original expression data matrix would be

reduced to the matrix A₍⁽¹⁾_n_{ }₁₎_m.

Step 2: For the remaining n genes and m experiment conditions, calculate the 1 column-column correlation C_{m m}⁽¹⁾_ and identify the pair of conditions

corresponding to the minimum absolute value of correlation. For instance, if

Ci and C_i_' has the smallest absolute value of correlation in the

(16)

column-column correlation C_{m m}⁽¹⁾_ , then compute the two sums of correlations,

1

| |

n jk k

c



 ^or ^' 1

| |

n j k k

c



 , and remove either C_i and C_i_' with the smaller value.

The original expression matrix is now updated to A₍⁽²⁾_n_{  }_{1) (}_m ₁₎. Step 3: Repeat Steps 1 and 2 until every absolute value of correlation in the

column-column correlation matrix and row-row correlation matrix is greater

than or equal to a pre-specified threshold (^gene or ^condition). The resulting matrix then leads to a bicluster B_k ( , )I J_k _k where

( ) ( )

|Corr(x_i^k ,x_i'^k ) |_gene if gene ⁱand gene i^' belong to I and ^k ^|^Corr⁽y^{( )}^j^k ^,y^{( )}^j^k^' ^{) |}^condition

if

conditions j and j belong to ^' J . ^k

It is worth noting that no matter which direction (row or column) we start with, the resulting DTR differs by only 0.02, and the selected rows and columns in the identified bicluster only differ by 1-2 rows or 1-2 columns.

Choice of Thresholds

We propose three choices for both the threshold _gene and _condition. The first one is user-defined. Since the correlation between expression levels in a bicluster may depend on the strains of cells considered in the study and the treatments applied on the cells, expert’s opinion on the degree of correlation should be ascertained and to

construct the threshold values. For example, one researcher may prefer _gene=0.3 and

condition

 =0.5 in the algorithm; while another may select a more strict standard as _gene

(17)

=0.5 and _condition=0.5.

The second choice is data-dependent. For example, one may consider the 75th, 85thor 95thpercentile of the correlations in the column-column (or row-row) correlation matrix as the threshold. Two advantages are associated with this choice.

First, such choice would be practical especially when no expert knowledge is available. Second, the different percentiles may help to investigate the sensitivity of the final biclusters in the threshold values. In the rest of the paper, we adopt this choice and will carry out comparison for different choices.

The third choice is also data-dependent but requires prior statistical inference from a mixture model. Under the assumption that the correlations in the

column-column (or row-row) correlation matrix consists of a bicluster and noise, we use the mixtools package in R software to fit a mixture model of two normal

components. The threshold is then determined as the value where two normal density functions intersect. Figure 1 is an illustration of a mixture model for correlations using data (S_{100 100}_ ) randomly selected from A_{200 200}_ with seed 1. A two-component normal mixture model was then fitted as the figure showed. We can see that there were two distributions in the mixture model, where the proportion of the red distribution was 84%. The mean of the red distribution was 0.018, while the mean for the green distribution was 0.15. The intersection occurs at correlation=0.18. If there are three

(18)

biclusters in data, then there will be three correlation distributions in the mixture model. In this case, we can find the rightmost bicluster first, with one pair of

thresholds in the proposed algorithm, and then use the remaining data (the left part of the histogram of correlations) to undergo further biclustering procedure to separate the rest two distributions.

(19)

Simulation

Simulation settings

To evaluate the performance of the proposed biclustering algorithm and to compare with other existing methods such as CC, Bimax, and ISA, we performed simulation studies. First we constructed a larger population matrix of expression levels A_{200 200}_ containing a true bicluster B_{50 50}_ as well as other noises. Let B_{50 50}_ denote the expression levels from 50 truly clustered genes and 50 clustered conditions.

The expression levels in B_{50 50}_ were generated in a conditional fashion, where the first random vector in B_{50 50}_ was from a multivariate normal distribution

50*50

1 0.5 0.5

( , 0.5 0.5 )

0.5 0.5 1 MVN

 

 

   

 

 

0  ,

and the other random vectors were from another multivariate normal with a

conditional mean vector of the first generated random vector. Such conditioning was to ensure that the correlations between genes Corr x x( , )_i _i' would be no less than 0.3 (Corr( , ) 0.3x x_i _i'  ); while the correlation between conditions would be around 0.5.

The reason for a larger threshold for the correlations between conditions was because, in thereal data, the correlation among genes was smaller than the correlation between conditions. For the remaining components in A_{200 200}_ , they were all generated from a standard normal distribution. The large population matrix A_{200 200}_ now contained 200

(20)

genes and 200 conditions. All the generations were carried out with mvtnorm package in R software.

Next we selected randomly 100 gene samples under 100 conditions from the population matrix A_{200 200}_ with 30 genes and 30 conditions from the true bicluster

50 50

B _ to form a sample matrix S_{100 100}_ . This sample matrix was then analyzed with BiCor and other biclutsering algorithms. Such replications were carried out 100 times and the resulting identified biclsuters were collected for comparison. Another

simulation setting selected 15 genes and 20 conditions from B_{50 50}_ to evaluate the performance.

Criteria for performance evaluation

To evaluate the performance of bicluster algorithm, Li considered gene ontology weighted enrichment score and protein-protein interaction score (Li et al., 2012), Prelic considered proportion of disconnected gene pairs and average shortest distance in the graph for metabolic pathway map (MPM) for A. thaliana and a protein–protein interaction network (PPI) for S. cerevisiae (Prelic A. et al., 2006). Here we proposed two criterion for performance evaluation. The first one is true discovery rate (TDR). It indicates the proportion of the true bicluster among the identified bicluster,

(21)

The bigger the true discovery rate of genes (or conditions) is, the better the identified bicluster. A value close to 1 implies a large proportion of true genes (or conditions) in the identified bicluster. This measure, however, cannot evaluate if the identified bicluster recovers most of the original true bicluster. Therefore, we propose the second criterion discovered true rate (DTR).

ate

The bigger the discovered true rate of genes (conditions) is, the better the identified bicluster.

Both criteria above evaluate only one direction of the identified bicluster. To assess the two-dimensional matrix, we combine the TDR and DTR as an overall measure of performance:

∗

(22)

∗

These measurements are next considered in the simulation studies.

For example, if we considered a S(30 70) (30 70)_ _ _ matrix for biclustering, where the first 30 rows and first 30 columns were from the true bicluster B_{50 50}_ and the rest components were noise. If the BiCor identified a biclusterB(25 2) (24 3)_{ } _ containing 25 true genes and 24 true conditions, but 2 false positive genes and 3 false positive conditions. Then the criteria of performance can be calculated, as listed in Table 1. If BiCor identified a much smaller bicluster B(10 0)*(15 0)_ _ , then its TDR remained large but the DTR decreased reflecting the fact many genes and conditions have not been recovered. Figures 2-4 are TDR and DTR for three examples. Figure 2 demonstrates the case when TDR and DTR are small, Figure 3 for large TDR and small DTR, and Figure 4 for small TDR and large DTR.

Note that DTR is like the sensitivity and TDR the specificity. An alternative way to compare the performance of the algorithms is the sum of TDR and DTR. One should bear in mind however that the sum of TDR and DTR may disguise the original value of TDR or DTR.

Results

To compare with other existing methods, we consider CC (Cheng et al., 2000), BiMax (Prelic´ et al., 2006), and ISA (Ihmels et al., 2002, 2004). Table 2 lists the

(23)

parameter values suggested by the authors to be used in the algorithms. The results are shown in Tables 3 and Tables 4. The rates in Tables 3 and Tables 4 were also plotted in Figure 5 and Figure 6.

In the left part of Table 3, under the first simulation setting of 30 true genes and 30 true conditions in S_{100 100}_ both the marginal true discovery rate for genes or conditions and the overall TDR for BiCor under four different sets of threshold values {(0.3, 0.5), (0.2, 0.5), (0.2, 0.4), (0.1, 0.3)} remains close to 1, outperforming CC and ISA. Although BiCor and BiMax had similar performance in TDR, the computation time for BiCor was much less than BiMax. The latter took about 5 minutes for each iteration, while BiCor took only 4 seconds.

The marginal and overall discovered true rates are listed in the right part of Table 3. Such criteria evaluate how many genes or conditions were recovered among the original 30 genes and 30 conditions. It can be observed that the rates under BiCor were between 0.63 and 0.92, depending on the strictness of the threshold values. Less stringent values would lead to better performance. Under these criteria, BiCor

performs the second best, next to CC, indicating that CC’s method usually identifies a large bicluster, as compared with other algorithms.

In the second simulation we considered a smaller proportion of true bicluster.

The second simulation setting considered only 15 true genes and 20 conditions from

(24)

the true bicluster B_{50 50}_ , along with other noise components in the matrix S_{100 100}_ . The number of replications remained at 100. The top half of Table 4 contained the same numbers from the Table 3; while the lower half of Table 4 contained the results under the second simulation settings. It is obvious that, when the proportion of the true bicluster in S_{100 100}_ is smaller, the performance of BiCor becomes sensitive due to the fact that the identified bicluster becomes small, especially the marginal and overall DTR which used 15 and 20 in the denominator than 50 and 50 in the first setting. The other biclustering algorithms were not compared here because of their poorer performance than that in the first setting.

In addition, we investigated the sensitivity of the performance with respect to data-dependent thresholds. In Table 5, different percentiles were considered in both simulation settings. Although the marginal and overall TDR and DTR in Table 5 were not as good as those in Table 4 under the fixed thresholds, the rates remain satisfactory.

The values were expressed in Figure 7.

Considering all the factors affecting the performance, we recommend Bicor over other biclustering algorithms because of its balance between two types of performance evaluation, less computational burden, and because of its robustness to the choice of thresholds.

(25)

Application

For real data analysis, we considered the gene expression data from Arabidopsis

thaliana. Arabidopsis thaliana is a small flower with short life cycle of about 6 weeks

from germination to mature seed. The genome of Arabidopsis thaliana is small, only approximately 135 megabase pairs (Mbp) in 5 chromosomes, and was the first to be sequenced in the year 2000. To understand plant traits, it is popular to consider

Arabidopsis thaliana as an experimental material. In the following we introduce

NASC's data and MVA and MEP pathway study as applications.

NASC's data

We used the real Arabidopsis data from NASC's International Affymetrix Service (http://arabidopsis.info/affy/). The data can be downloaded at

http://data.iplantcollaborative.org/quickshare/da175c84258a9cf3/Exp340.zip.

(Thilmon et al, 2006) The data contain 734 genes and 23 experiment conditions (each experiment condition were replicated 3 times, leading to a total of 69 experiment conditions). Details are shown in Table 8. As stated in the paper by Thilmon et al.

(2006), “Pseudomonas syringae pv. tomato DC3000 (Pst) is a virulent pathogen, which causes disease on tomato and Arabidopsis. The type III secretion system (TTSS) plays a key role in pathogenesis by translocating virulence effectors from the bacteria into the plant host cell, while the phytotoxin coronatine (COR) contributes to

(26)

virulence and disease symptom development.” The goal of this study was to

understand if both TTSS and COR are associated with the suppression of host basal defenses.

After performing the biclustering algorithm, we can see in Figure 8 that there are two distributions in correlation of genes and correlation of conditions. Therefore we used _gene=0.4, _condition=0.6 and _gene=0.4, _condition=0.7 in the BiCor for analysis.

In Table 6, we can see that CC identified a large bicluster because CC’s original idea is to find any possible bicluster. BiMax and ISA identified the same condition (DC3000-10e6-24h) and 8 genes. These 8 genes were also identified by BiMax and were among the 43 genes found by ISA. Hence, we have more confidence that these 8 genes work together under the DC3000-10e6-24h condition. BiCor’s bicluster is very different from that under BiMax and ISA. This is because BiCor uses correlation of gene expression data to bicluster, while BiMax and ISA use gene expression value to bicluster. As Table 6 shows, BiCor can identify one type of experiment condition alone with its other replications. The DTR of experiment conditions are 73.3% under BiCor(0.4,0.7) and 86.7%. under BiCor(0.4,0.6). Although DTR of experiment conditions under BiCor is smaller than BiMax and ISA, BiCor identified more experiment conditions than BiMax and ISA.

(27)

MVA and MEP pathway data

This study investigated two pathways, one was the mevalonate pathway (MVA) and the other was non-mevalonate pathway (MEP). The data were collected from the 118 GeneChip (Affymetrix) microarrays with 39 genes, where 20 of which were assigned to MVA and 19 to MEP, as shown in Figure 8 and Table 9. We use this gene network as a standard to compare the performance of bicluster algorithms. The data can be downloaded at

http://genomebiology.com/content/supplementary/gb-2004-5-11-r92-s1.txt. [Wille et

al., 2004]

Based on the prior knowledge of the existence of MVA and MEP pathways in

Arabidopsis data, we fitted a mixture model of correlations in the Arabidopsis data

with two distributions (Figure 9). We also plot the histograms of correlations for the data under MVA and MEP pathways separately in Figures 10 and 11. We can see that the histograms of correlations of data under MVA and MEP pathways are really similar to the distributions in the mixture model. We use then considered

condition 0.8

  to perform BiCor.

Table 10 shows the bicluster results of four bicluster algorithms and two cluster results. BiCor performs the best in terms of TDR for each pathway, the identified biclusters by BiCor are all in their pathways. However, the DTR of pathways was not

(28)

large. When threshold of gene became strict, the performance of DTR of pathway got worse. CC performed the best in terms of DTR, almost 74% of genes in two pathways were found. In fact, the correlations under MVA and MEP pathways are similar, therefore it is not easy to have good bicluster results. In this case, it is a tradeoff between TDR and TDR, while one increases when the other decreases.

In cluster algorithms, we use Pearson’s correlation distance as a measure of similarity to perform hierarchical clustering (HCL) and K-means. The TDR of pathway and DTR of pathway under K-means was approximately 82%, while that under HCL was only 49%. K-means performs the best here because the current application is interested in clustering, not bicluster.

(29)

Discussion

In this paper, we propose a new bicluster algorithm called Biclustering methods via correlation matrix (BiCor). This method uses two correlation matrices to cluster data such as gene expression levels. In the simulation studies, we showed that BiCor can successfully identify the true bicluster with large true discovery rate and

discovered true rate. In addition, BiCor outperforms other existing algorithms like CC, BiMax and ISA. However, BiCor can identify one bicluster at one time; while others can find more than one bicluster. To identify more than one biclusters that are

non-overlapping, we can use different pairs of _gene and _condition, as stated at the end of Section Choice of Thresholds.

BiCor has three advantages. First, BiCor does not require the normalization step either in the levels of genes or conditions. Only the correlation among observations will be investigated. The origin data magnitude as well as the data information will not be lost. Second, the true discovery rate of bicluster in the simulation studies was larger than 90%, implying that BiCor has a low false positive rate. Third, BiCor takes correlation characteristic into consideration, while the CC, BiMax and ISA overlook this property.

Here in this research, we suggested three criteria to determine the threshold used in the algorithm. When no expert opinion is available, we recommend the second

(30)

criterion. Thus choice is robust to the correlation pattern in the observed data and to the subject matter under study. Our simulation studies also suggested a satisfactory result when this criterion was considered.

Several issues remain for future studies. First, we plan to make the proposed BiCor algorithm publicly available by providing the code in R so that more people can use it. Second, the validation of true bicluster in real gene expression data analysis may be carried out with the Gene Ontology (GO) database. If considering GO as a standard, then the identified bicluster can be compared with results in GO. We will be working in this direction in the near future. Third, in the simulations, we have

generated a square matrix of gene expression levels as the population matrix to start with the replications. This may not be realistic because the number of conditions in laboratory work is usually smaller. Future investigation may focus on such rectangular matrices.

There are two limitations in the BiCor algorithm. First, if the data contain more than the linear correlation, BiCor cannot bicluster well. This is because BiCor uses Pearson Correlation to find bicluster. A remedy can be the exchangeable correlation, first-order autoregressive correlation, unstructured correlation or user-specified correlation matrix, or the kernel methods for nonlinear correlations. In addition. when the data are binary, then other correlation measurements for categorical data should be

(31)

considered in BiCor. Second, when the correlations within different biclusters are similar, BiCor may not be able to separate them successfully and may combine these biclusters as a big one. As a modification, one may consider other algorithms that focus on features other than correlations. In conclusion, BiCor has good performance with TDR and moderate DTR; while CC has large DTR but small TDR. On future direction would be to combine CC and BiCor as two steps in bicluster analysis. We could use CC as the first step algorithm to find more correlated genes and conditions, and then use BiCor to check if these genes and conditions are correlated with each other.

(32)

References

1. Anindya Bhattacharya and Rajat K. De. 2009. Bi-correlation clustering algorithm for determining a set of co-regulated genes. Bioinformatics, 25:21,2795-2801

2. Yizong Cheng and George M. Church. 2000. Biclustering of Expression Data.

In Book Biclustering of Expression Data. 93–103

3. Ihmels J and Friedlander G et al. 2002. Revealing modular organization in the yeast transcriptional network. Nature Genetics, 31, 370–377

4. Ihmels J and Sven Bergmann et al. 2004. Defining transcription modules using large-scale gene expression data. Bioinformatics, 20, 1993–2003

5. Li Li and Yang Guo et al. 2012. A comparison and evaluation of five biclustering algorithms by quantifying goodness of biclusters for gene expression data. BioData Mining 5:8

6. Sara C. Madeira and Arlindo L. Oliveira. 2004. Biclustering Algorithms for Biological Data Analysis: A Survey. Ieee transactions on computational biology and bioinformatics. 1:1

7. Amela Prelić and Stefan Bleuler et al. 2006. A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics 22:1122–1129

(33)

8. Udi Ben Porat and Ophir Bleiberg. 2006. Analysis of Biological

Networks:Network Modules – Clustering and Biclustering. Lecturer: Roded Sharan

9. Anja Wille and Philip Zimmermann et al. 2004. Sparse graphical Gaussian modeling of the isoprenoid gene network in Arabidopsis thaliana. Genome Biology 2004, 5:R92

10. Amos Tanay and Roded Sharan et al. 2002. Discovering statistically significant biclusters in gene expression data. Bioinformatics, 18 Suppl. 1.

S136–S144

11. Thilmony R and Underwood W et al. 2006. Genome-wide transcriptional analysis of the Arabidopsis thaliana interaction with the plant pathogen Pseudomonas syringae pv. tomato DC3000 and the human pathogen Escherichia coli O157:H7. The Plant Journal 46, 34–53

12. https://www.arabidopsis.org/

(34)

Table 1 Examples of performance evaluation Performance BiCor

(25 2)*(24 3)

B _ _

BiCor

(10 0)*(15 0)

B _ _ True discovery rate of row 25/(25+2)=0.93 10/(10+0)=1.00 True discovery rate of column 24/(24+3)=0.89 15/(15+0)=1.00 Overall true discovery rate √0.93 ∗ 0.89=0.91 √1.00 ∗ 1.00=1.00 Discovered true rate of row 25/30=0.83 10/30=0.33

Discovered true rate of column 24/30=0.80 15/30=0.50

Overall discovered true rate √0.83 ∗ 0.80=0.81 √0.33 ∗ 0.50=0.41

(35)

Table 2 Parameter settings for the four bicluster algorithms.

Algorithm Default Parameter Settings Changed values

Biclustering method via correlation matrix (BiCor)

0.3, 0.5 0.3, 0.5

CC δ 0.5, α 0.12 δ 0.5, α 0.12

BiMax Normalize genes and conditions~N 0,1

Discretize (to binary values) by percentage=10

Normalize genes and conditions~N 0,1 Discretize (to binary values) by percentage=30

ISA = 2.0, = 2.0,nr. seeds = 13 = 1.0, = 1.0,nr. seeds = 1000

BiCor: : minCorr g g( , )_i _i'  ; : min Corr c c( , )_j _j' 

CC: δ: the maximum acceptable mean squared residue score.

 

²

,

H(I,J) 1 _ij _iJ _Ij _IJ

i I j J

a a a a

I J 

 





    ^.

α: a threshold for multiple node deletion.

BiMax: Discretize (to binary values) by percentage=30

ISA: nr. seeds: set seed for random number generator, gi j,_gi j',   ci j, _ci j, '  

(36)

Table 3 Performance comparison of the four bicluster algorithms.

Algorithm (mean time/each permute)

True discovery rate of rows (mean, se)

True discovery rate of columns (mean, se)

Overall true discovery rate of bicluster

(mean, se)

Discovered true rate of rows (mean, se)

Discovered true rate of columns (mean, se)

Overall

discovered true rate of bicluster (mean, se) BiCor (0.3,0.5)

( . )

1.00 (<0.01) 1.00 (<0.01) 1.00 (<0.01) 0.72 (0.02) 0.63 (0.02) 0.67 (0.02)

BiCor (0.2,0.5)

( . )

1.00 (<0.01) 1.00 (<0.01) 1.00 (<0.01) 0.78 (0.01) 0.64 (0.02) 0.71 (0.02)

BiCor (0.2,0.4)

( . )

1.00 (<0.01) 1.00 (<0.01) 1.00 (<0.01) 0.83 (0.02) 0.79 (0.02) 0.81 (0.02)

BiCor (0.1,0.3)

( . )

0.99 (<0.01) 1.00 (<0.01) 0.99 (<0.01) 0.87 (0.02) 0.87 (0.02) 0.87 (0.01)

BiCor (90^th ,90^th ) ( . )

0.99 (<0.01) 1.00 (<0.01) 0.99 (<0.01) 0.85(0.01) 0.92(0.01) 0.88(0.01)

CC

( )

0.73(0.02) 0.60(0.02) 0.66 (0.02) 0.93(0.05) 0.92(0.02) 0.92 (0.03)

BiMax (

)

1.00 (<0.01) 0.99(<0.01) 0.99 (0.01) 0.28(0.01) 0.30(0.02) 0.29 (<0.01)

ISA ( ) 1.00 (<0.01) 0.63(<0.01) 0.79 (<0.01) 0.68(0.02) 0.72(0.02) 0.70 (0.01) se: standard error

(37)

Table 4 Different simulation settings for BiCor algorithm.

(mean, se)

Overall

discovered true rate of bicluster (mean, se)

(30 70)*(30 70)

S _ _ BiCor (0.3,0.5)

( . )

1.00 (<0.01) 1.00 (<0.01) 1.00 (<0.01) 0.72 (0.02) 0.63 (0.02) 0.69 (0.02)

BiCor (0.2,0.5)

( . )

1.00 (<0.01) 1.00 (<0.01) 1.00 (<0.01) 0.78 (0.01) 0.64 (0.02) 0.71 (0.02)

BiCor (0.2,0.4)

( . )

1.00 (<0.01) 1.00 (<0.01) 1.00 (<0.01) 0.83 (0.02) 0.79 (0.02) 0.81 (0.02)

(15 85)*(20 80)

S _ _ BiCor (0.3,0.5)

( . )

0.91 (0.02) 0.94 (0.02) 0.94 (0.02) 0.55 (0.02) 0.39 (0.02) 0.49 (0.02)

BiCor (0.2,0.5)

( . )

0.88 (0.02) 0.93 (0.02) 0.93 (0.02) 0.59 (0.02) 0.40 (0.02) 0.51 (0.02)

BiCor (0.2,0.4)

( . )

0.89 (0.02) 0.94 (0.02) 0.93 (0.02) 0.61 (0.02) 0.48 (0.02) 0.56 (0.02)

(38)

Table 5 Performance of BiCor under different data-dependent thresholds.

(mean, se)

Overall

discovered true rate of bicluster (mean, se)

(30 70) (30 70)

S _ _ _ BiCor (60^th , 60^th)

( . )

0.96(<0.01) 0.97(<0.01) 0.97(<0.01) 0.88(0.01) 0.94(<0.01) 0.91(0.01)

BiCor (70^th , 70^th)

( . )

0.98(<0.01) 0.98(<0.01) 0.98(<0.01) 0.88(0.01) 0.94(<0.01) 0.91(0.01)

BiCor (80^th , 80^th)

( . )

0.99(<0.01) 0.99(<0.01) 0.99(<0.01) 0.87(0.01) 0.93(<0.01) 0.90(0.01)

(15 85) (20 80)

S _ _ _ BiCor (75^th , 70^th)

( . )

0.81(0.01) 0.88(0.01) 0.85(0.02) 0.70(0.02) 0.68(0.02) 0.71(0.02)

BiCor (85^th , 0.85(0.01) 0.90(0.01) 0.88(0.02) 0.68(0.02) 0.66(0.02) 0.68(0.02)

(39)

80^th)

( . )

BiCor (95^th , 90^th)

( . )

0.89(0.01) 0.920.01) 0.91(0.02) 0.66(0.02) 0.63(0.02) 0.66(0.02)

(40)

Table 6 Bicluster result of four bicluster algorithms

Bicluster algorithm Number of genes Id of experiment conditions Discovered true rate of experiment conditions BiCor (0.4,0.7) 32 16 18 (Cor-hrpS-5x10e7-10h) ,

24 (hrpAfliC-10e8-7h), 25 26 27 (hrpA-10e8-7h),

34 35 36 (E.coli-0157-H7-10e8-7h), 37 39 (E.coli-TUV86-2-fliC-10e8-7h)

11

5 ∗ 3 73.3%

BiCor (0.4,0.6) 44 16 18 (Cor-hrpS-5x10e7-10h) , 23 24 (hrpAfliC-10e8-7h), 25 26 27 (hrpA-10e8-7h),

34 35 36 (E.coli-0157-H7-10e8-7h), 37 38 39 (E.coli-TUV86-2-fliC-10e8-7h)

13

5 ∗ 3 87.7%

CC 155 23 experiment conditions are included 39

13 ∗ 3 100.0%

BiMax 8 7 8 9 (DC3000-10e6-24h) 3

1 ∗ 3 100.0%

(41)

ISA 43 7 8 9 (DC3000-10e6-24h) 3

3 100.0%

(42)

Table 7 Parameter settings of 3 bicluster algorithms in Arabidopsis data

Algorithm Parameter Settings

CC δ 0.5, α 0.12

BiMax Normalize genes and conditions~N 0,1 Discretize to binary values by =2

ISA = 2.0, = 2.0,nr. seeds = 1000

Table 8 23 experiment conditions of Arabidopsis data ID Pathogen-cfu/ml Collected time

1-3 Cor-10e6 24h

4-6 Cor-hrpS-10e6 24h

7-9 DC3000-10e6 24h

10-12 Mock-Inoculum 24h 13-15 Cor-5x10e7 10h 16-18 Cor-hrpS-5x10e7 10h 19-21 Mock-Inoculum 10h 22-24 hrpAfliC-10e8 7h 25-27 hrpA-10e8 7h 28-30 DC3000-10e8 7h 31-33 Mock-Inoculum 7h 34-36 E.coli-0157-H7-10e8 7h 37-39 E.coli-TUV86-2-fliC-10e8 7h

Collected time: collected sample time after giving pathogen-cfu/ml Arabidopsis data from NASC's International Affymetrix Service

734 genes and 23 experiment conditions (each experiment conditions replicate 3 times, total 69 experiment conditions).

(43)

Table 9 Genes coding for enzymes in the two isoprenoid pathways

Name AGI number Pathway

AACT1 At5g47720 ^MVA

AACT2 At5g48230 ^MVA

CMK At2g26930 ^MEP

DPPS1 At2g23410 ^MVA

DXPS1 At3g21500 ^MEP

DXR At5g62790 ^MEP

FPPS1 At4g17190 ^MVA

FPPS2 At5g47770 ^MVA

GGPPS1 At1g49530 ^MVA

GGPPS2 At2g18620 ^MEP

GPPS At2g34630 ^MEP

HDR At4g34350 ^MEP

HDS At5g60600 ^MEP

HMGR1 At1g76490 ^MVA

HMGR2 At2g17370 ^MVA

HMGS At4g11820 ^MVA

IPPI1 At3g02780 ^MEP

IPPI2 At5g16440 ^MVA

MCT At2g02500 ^MEP

MECPS At1g63970 ^MEP

MK At5g27450 ^MVA

MPDC1 At2g38700 ^MVA

MPDC2 At3g54250 ^MVA

(44)

PPDS1 At1g17050 MEP

PPDS2 At1g78510 MEP

UPPS1 At2g17570 MVA

MVA: Mevalonate pathway; MEP: Non-mevalonate pathway

Table 10 Bicluster result of four bicluster algorithms and two cluster results Bicluster algorithm True discovery rate of

pathway

Discovered true rate of pathway

BiCor (0.1,0.8) 1 0.45

BiCor (0.2,0.8) 1 0.45

BiCor (0.3,0.8) 1 0.35

CC 0.63 0.74

BiMax 0.63 0.25

ISA 0.92 0.55

Cluster algorithm

HCL 0.49 0.49

K means 0.83 0.82

Table 11 Parameter settings of 5 bicluster algorithms in Arabidopsis data Algorithm Parameter Settings

CC δ 0.5, α 1.2

BiMax Normalize genes and conditions~N 0,1 Discretize to binary values by =1

ISA = 1.0, = 1.0,nr. seeds = 10000

HCL Number of cluster=2, single linkage, Pearson’s correlation distance K means Number of cluster=2, Pearson’s correlation distance

(45)

Figu Perf

Figu Red

Figu

ure 1 An exa formance an

ure 2 Low t d: Setting tru

ure 3 High t

ure 4 Low t

ample of a t nd comparis

true discove ue bicluster

true discove

two-compon son:

ery rate and r; Green: Ide

ery rate and

nent mixtur

low discov entified bicl

d low discov

high discov

re model for

vered true ra luster; Oran

vered true ra

vered true r

r correlation

ate.

nge: Overlap

ate

ns.

pping regioon

(46)

Figure 5 Performance of four algorithms

Figure 6 Performance of BiCor with fixed thresholds

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.2 0.4 0.6 0.8 1

Overall discovered true rate of bicluster

Overall true discovery rate of bicluster

Performance of four algorithms

BiCor (0.3,0.5) BiCor (0.2,0.5) BiCor (0.2,0.4) BiCor (0.1,0.3) BiCor (90%,90%) CC

BiMax ISA

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.92 0.94 0.96 0.98 1

Overall discovered true rate of bicluster

Overall true discovery rate of bicluster

Performance of BiCor

BiCor (0.3,0.5) (30+70*30+70) BiCor (0.2,0.5) (30+70*30+70) BiCor (0.2,0.4) (30+70*30+70) BiCor (0.3,0.5) (15+85*20+80) BiCor (0.2,0.5) (15+85*20+80) BiCor (0.2,0.4) (15+85*20+80)

(47)

Figu

Overalldiscoveredtruerateof

ure 7 Perfor

ure 8 Mixtu

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.8 Overall discovered true rate of bicluster

rmance of B

ure model of

0.85 Overall

Pe

BiCor with f

f correlation

5 0

true discove

erform

fuzzy thresh

n of rows an

.9 0

ery rate of b

mance o

hold

nd columns

0.95 bicluster

of Bico

in Arabidop

1

or

BiCo (30+

BiCo (15+

opsis data

or (60th,60th +70*30+70) or (70th,70th +70*30+70) or (80th,80th +70*30+70) or (75th,70th +85*20+80) or (85th,80th +85*20+80) or (95th,90th +85*20+80)

)

(48)

Figu Wil Figu

Red

ure 9 Chloro le et al. (20 ure 9 Mixtu

d: MEP path

oplast (MEP 04) [5:R92]

ure model of

hway; Green

P pathway) ].

f correlation

n: MVA pat

and cytopla

n of rows an

thway

asm (MVA p

nd columns

pathway). T

in Arabidop

Taken from

psis data

(49)

Figu Ara

ure 10 Corre abidopsis da

ure 11 Corr abidopsis da

elation of 20 ata

elation of 1 ata

0 genes and

9 genes and

d 118 experi

d 118 exper

iments of M

MVA pathwa

MEP pathwa ay in

ay in

利用相關性矩陣降維進行雙分群分析：以基因表現資料為例

國立台灣大學公共衛生學院流行病學與預防醫學研究所 碩士論文

Institute of Epidemiology and Preventive Medicine College of Public Health

National Taiwan University Master Thesis

利用相關性矩陣降維進行雙分群分析：

以基因表現資料為例

A biclustering method with correlation matrix for gene expression profiling

陳建瑋 Chian Wei Chen

指導教授: 蕭朱杏博士

Advisor: Chuhsing Kate Hsiao, Ph.D.

中華民國 103 年 7 月

July, 2014

致謝

中文摘要

Abstract

Table of Contents

List of Tables

List of Figures

List of Appendices

Introduction

Method



 

 



 

 

















Simulation

Application

Discussion

References

 



Performance of four algorithms

Performance of BiCor

Pe

erform

mance o

of Bico

or

國立台灣大學公共衛生學院流行病學與預防醫學研究所碩士論文