Pathway analysis - Data analysis - 利用已知基因傳遞機制及蛋白質交互作用圖譜來開發新的基因互動途徑

Chapter 1 Introduction

1.3 Data analysis

1.3.3 Pathway analysis

Pathway analysis is actually gene set analysis only to focus on pathways. It is

expected to not only find pathways with significant differential expression but also

detect consistent yet subtle expression changes among members of a pathway. A review

on various pathway analysis methods evaluating the involvement of pathways in

different phenotypes under study is available in [19, 20].

A simplest procedure is to assess the significance of overlap between preselected

genes (differentially expressed genes) and predefined annotation groups (pathways)

using Fisher Exact test (hypergeometric distribution) or Chi-squared test, which is then

followed by adequate multiple hypothesis adjustments. This over-representation method

is intuitive, easy to implement and computationally efficient, thus it dominates current

commercial and public software, such as IPA [21], Metacore [22] and DAVID [23].

An alternative approach needs no prior filtering of genes. It assigns a score to each

gene set and assesses p-value by re-sampling procedure, which is to compare the score

with its null distribution. BRB-ArrayTools [24] use the LS statistic and

Kolmogorov-Smirnov (K-S) statistic to test if the single-gene p-values in a gene set are

of a uniform distribution. In gene set enrichment analysis (GSEA) [25, 26], an

enrichment score is obtained by considering the distribution of pathway genes in the

entire list of genes, which in spirit is a weighted K-S statistic. Tian et al. [27] designed a

statistical framework to determine perturbed pathways. In their work the overall

objective is to “test whether a group of genes has a coordinated association with a

phenotype of interest.” After each gene set is assigned a score by averaging the test

statistics of its member genes, p-values regarding two different hypotheses are

estimated by permuting class labels and gene orders. Eventually, gene sets with

significant p-values under both hypotheses are considered differentially expressed

10 across phenotypes.

The latter approach is more biologically reasonable than over-representation

method since it reserve expression information and does not depend on an artificial

filtering of genes, in which the results may depend strongly on the cutoff chosen to

make the significant gene list [28].

However, most pathway analysis methods suffer from some weaknesses. First, due

to the fact that only a limited number of well-studied genes are classified into pathways,

the remained many genes with unknown functions are left out of considerations. Second,

most pathway annotations contain only labels of pathway members and with no

information about the interplay between them, therefore the results obtained from these

methods could not help to give direct understanding of cellular processes at molecular

level. Also, pathway is actually a dynamic model and the component activated is

specific to conditions under study, which is usually not emphasized in most methods,

however, GSEA did specify a list of core members, named leading edge subset, that are

the main contributors of the pathway’s enrichment score.

In all, pathway analysis can successfully interpret data at pathway level but fails to

elucidate the interplay between members within and provide no information about genes

not involved in known pathways.

11 1.3.4 Network analysis

Extensive works had been done seeking to characterize the design principles of

biomolecular network structure using graph theory. An interesting finding shows that its

connectivity distribution follows the typical power-law distribution for scale-free

network and it shows small-world properties [29] in terms of network diameter and

clustering. It is believed that this feature, which indicates the existence of hubs, enables

the biological network to be robust against occasionally removal of arbitrary network

elements during evolution process [30].

Another subject of active research is the identification of relevant modules or

subnetworks in the global network. Notably the interactome data cannot alone complete

this task due to the inconsistent conditions of its content. That is, these interactions are

usually temporal, spatial or dependent on conditions and tissue types. Therefore,

additional annotations providing condition-specific information is required, such as

combining microarray data to identify connected sets of nodes based on their coherent

expression patterns at mRNA level. In this regard, several methods of identifying

responsive modules are reviewed in [31].

Ideker et al. [32] are among the first groups to extract active subnetworks based on

both interactome and transcriptomic data. They searched the global network for

high-scoring subnetworks via simulated annealing algorithm, where the score is in spirit

an aggregation of z-scores measuring differential expression of individual genes in the

subnetwork and calibrated against the null distribution generated by randomly sampled

groups of genes. Interestingly, the result contained many examples of genes individually

with low score but are required to connect together several high-scoring genes. Genes

with such character agree with the behavior of some transcription factors (TFs) and will

be referred to as “key nodes” in this article. Despite the exciting observation, the

time-consuming nature confines its application on large interaction network.

Nacu et al. [33] developed a method GXNA that expands subnetworks from seed

nodes using a greedy approach and then identifies those with significantly high scores.

One of their contributions lies in the design of scoring functions that correct for biases

due to subnetwork size if necessary. They are generally divided into two scoring

schemes -͂T and T͂. What subsequent to the evaluation of subnetwork score is the

estimation of p-value that assesses its significance against the null distribution derived

from scores of subnetworks under different phenotype permutations. Eventually, these

p-values are adjusted for family-wise error rate and used to select relevant subnetwork.

GXNA is fast and it focuses on small modules differentially expressed between

phenotypes, however, it allows no such key nodes since the greedy approach is adopted.

In general, pathway can be viewed as the assembly of causal sequences of

molecular events and with all building blocks available in the biomolecular network.

Ideally it seems promising to infer new pathways directly by network analysis, but in

reality such pathway inference are complex and less validated due to the noise and

incompleteness of current biological networks, especially when coupling with the

small-world property of its topology. Additionally, the identified module is prone to be

associated with multiple canonical pathways, which makes it hard to reveal a unifying

scene and interpret in a fashion related to focus under study, as in the example of Figure

A-2.

1.3.5 Methodology in this work

In this study we attempt to develop a methodology that goes a step further than

current pathway analysis by bringing in the advantages of network analysis. Practically

speaking, it can be achieved by processing microarray data through pathway and

network analyses in series, however, it is worthwhile mentioning that the incorporation

of both pathway and network knowledge into microarray data analysis is yet to be

widely realized. Based on results from pathway analysis, the methodology here aims to

enhance researchers’ understanding by stepping from pathway level into molecular level,

with its main objectives specified below:

1. Extract module most relevant to the pathway’s dysregulation.

It is without doubt that pathway is in fact a dynamic model and the members

perturbed differs between conditions under study. Although GSEA specifies a leading

edge subset and suggests their being responsible for pathway’s dysregulation, the subset

is of limited biological meaning since it provides no information for elucidating the

roles they play. To put it simply, the leading edge subset is statistically meaningful

more than biologically meaningful. Therefore, methodology here attempt to

complement pathway analysis in this regard by making use of biomolecular networks.

After dysregulated pathways are identified, concept of network analysis is then

imposed to extract modules most relevant to dysregulation of interested pathways. It is

mainly due to these sequential events that make the pathways deemed dysregulated.

On the other hand, the module obtained here differs from that yielded by other

network analyses in its ability to interpret microarray data at known-pathway level and

to investigate interested pathways in greater details.

2. Other focus-oriented strategies enabling exploratory survey on pathways of interest.

Current understanding of biological functions at the pathway level is far from

being thorough. In one way, some of known pathways are actually incomplete. In the

other, although recorded in a separate fashion, they are not truly isolated at the level of

global interaction network. In fact they should not be so as in the example of

cross-pathway inhibition, which allows an activated pathway to refocus cellular

resources to respond to stimuli it received by competing against other pathways. As a

matter of fact, a considerable degree of cross-talks between pathways are revealed by

various small-scaled biological studies.

Using the methodology here, a task-oriented investigation is promised by

exploiting the valuable information imbedded in biomolecular networks. These yet

well-understood interactions possess great potential to point an investigator to either

missing pathway components or cross-talks between pathways and help in the design of

appropriate experiments for identifying them.

In previous study physical interaction data were integrated with genetic interaction

information to uncover the mechanisms underneath [34], which can further be used to

find cross-talks between two pathways where the two interacting genes reside.

The interaction data was also coupled with pathway information to provide clues

of cross-talks between pathways [35], however, it is done by merely calculating whether

physical connections between two pathways are higher than random using Fisher Exact

Test.

In two successive works of Ideker et al. [36, 37], subnetwork and condition-

responsive genes (CORGs) were defined in dysregulated pathway as the part delivering

optimal discriminative power for the disease phenotype using T͂ evaluation.

Nonetheless, they were used as prediction markers rather than to discuss the underlying

mechanisms.

Before going to the next chapter, the methodology framework is summarized in

four parts.

1. Construction of database for storage of molecular interaction network, canonical

pathway collections and gene annotations.

2. Statistical analysis of microarray data at pathway level: in this part Tian’s algorithm

is adopted yet with slight but crucial modification.

3. Algorithms allow navigating the global network on the region of interested

pathways: the scoring function is based on GXNA and modified to tolerate key

nodes. In addition, a merging step is developed to complement the scoring function.

4. Visualize the module by laying genes/gene products according to their subcellular

localizations.

Chapter 2 Materials

2.1 Lung cancer datasets

1. NTUH Lung Cancer Dataset

This dataset is created by Bioinformatics and Biostatistics Core, National Taiwan

University Research Center for Medical Excellence - Division of Genomic Medicine.

Matched normal and tumor samples of 31 female patients with non-small cell lung

cancer (NSCLC) in National Taiwan University Hospital (NTUH) are collected for

DNA microarray analysis using Affymetrix Human Genome U133 Plus 2.0 Array.

2. Public Dataset GSE7670 (TVGH Lung Cancer dataset)

A public dataset [38] created by Taipei Veterans General Hospital (TVGH) and

deposited in NCBI’s Gene Expression Omnibus (GEO) [39] (also available at EBI’s

ArrayExpress: www.ebi.ac.uk/arrayexpress) under GEO series accession number

GSE7670 is downloaded as CEL files. Of all the 66 Affymetrix Human Genome

U133A Array data, those from 21 female NSCLC patients with paired normal-tumor

arrays are used to create this dataset.

2.2 Databases

1. Pathways

Predefined gene sets performing tasks of metabolic functions or signaling

transductions are recorded in several public databases.

The Molecular Signature Database (MSigDB) v2.5 maintained by Broad Institute

collects curated gene sets information from several online databases or literatures and

records them with a unified format. Its “Canonical Pathway (CP) collection” in

“Functional Sets (C2) category” [9] is the main source of pathway information in this

work. The collection of 639 gene sets is released in a file with filename extension “gmt”.

In the following work, gene sets are simplified into pathways.

2. Protein interaction network

Protein-protein interactions (PPI) detected by high-throughput methods are

recorded in several PPI databases. The NCBI’s Entrez Gene database containing curated

interaction information from BIND [16], BioGRID [40], EcoCyc [41] and HPRD [17] is

utilized as the main source to construct protein interaction network in this work.

3. Target genes of probe sets on microarray

The information of genes targeted by probe sets designed on Affymetrix GeneChip

arrays comes from annotation files in Affymetrix website (www.affymetrix.com).

4. Gene annotation

Unique gene identifiers (Entrez ID or HUGO gene symbol) and historical aliases

of genes are retrieved from NCBI’s Entrez Gene database.

2.3 Program environment and public/ commercial tools

With database built under MySQL, this methodology was realized with the help of

Matlab^®. Besides, some public or commercial tools are used to process the dataset for

different purposes. Following are some brief descriptions about them.

1. Partek^® Genomics Suite [42]

It is a commercial software that enables various statistical analysis of microarray

data. In the work here it is used simply to complete preprocessing steps of microarray

data, which summarizes expression value for each probe set and applied normalization

algorithm to remove potential systematic biases.

2. Gene set enrichment analysis (GSEA) [25]

GSEA evaluates the probability a gene set is differentially expressed across

phenotypes and defines a leading-edge subset comprising the core members activated in

the gene set.

At first, genes are ordered by their correlation between expression values and

phenotype classes, then an enrichment score (ES), corresponding to a weighted

Kolmogorov-Smirno-like statistic, is calculated for the gene set. A p-value representing

significance level of the ES score is determined against null distribution of ES estimated

by permuting class labels. After p-value for each gene sets is obtained, false discovery

rate (FDR) method is used to adjust for multiple hypothesis testing.

3. Database for Annotation, Visualization and Integrated Discovery (DAVID) [23]

DAVID uses an EASE score, a modified Fisher Exact p-value, to measure the

enrichment of gene sets in the gene list specified by user. Furthermore, to reduce the

redundant nature of annotations that might dilute the focus of the result, which is

especially inevitable when associating with Gene Ontology (GO) terms, DAVID

provides the option to classify significantly associated gene sets into different clusters

and order the clusters according to their significance. According to manual on the

website, it is achieved by integrating the same techniques of Kappa statistics to measure

the degree of the common genes between two gene sets, and fuzzy heuristic clustering

to classify the groups of similar annotations according to kappa values.

4. Cytoscape [43]

Cytoscape is a JAVA application which provides basic functionality to layout and

query networks, overlay nodes with expression data, or link genes/gene products to

databases of functional annotation. Notably, it is featured in its extensibility through a

straightforward plug-in architecture, which enables additional computational analyses to

be incorporated. In this study, a plug-in - Cerebral v2.0 [44] is used to layout the

network according to the subcellular location of each node.

5. Kyoto Encyclopedia of Genes and Genomes (KEGG) - Color Objects in KEGG

Pathways [45]

It is an on-line tool that provides a personalized pathway map by allowing the

assignment of different colors to font, border or background of each particular node.

6. European Bioinformatics Institute (EBI) - QuickGO [46]

QuickGO is a web-based browser that enables the extraction of branches from the

entire hierarchy of Gene Ontology according to a list of specified GO terms.

Chapter 3 Methods

Figure 3-1 illustrates the flow chart of this methodology and in the subsequent

sections we will describe each step in detail.

Figure 3-1. Flow chart of methodology in this work.

PPI is the abbreviation of protein-protein interaction data.

3.1 Database construction

Figure 3-2. Entity Relationship Diagram (ERD) for database constructed here.

In this ERD rectangles stand for entities, ellipses for attributes and diamonds

for relationships.

As the first step of integrating genomic data at different levels, a relational

database is required in order to bridge between information from various sources, to

speed up data retrieval and to correct for ambiguously recorded data. The structure of

database constructed here is elucidated in Figure 3-2 as a simplified entity relationship

diagram, where genes, pathways and array probes are considered as different entity

types.

While constructing the database, some records with ambiguous information should

be corrected:

1. Importing gene sets from MSigDB.

The gmt file records gene set members in the form of “synonym”, which is alias

instead of official name. The task here is to convert each of them into corresponding

unique identifier: “gene_id”. This is done by comparing them to “official gene symbol”,

or to “synonym” of genes in the case when there were no “official gene symbols”

matched.

2. Importing target genes of each probe set on microarray.

Some “gene_id” recorded in NetAffx annotation files had been changed or

discontinued. New “gene_id” should be updated based on information from NCBI’s

Entrez database.

3.2 Single gene analysis

1. Probe sets filtering

As one might note in chapter 2, the two datasets were assayed on different versions

of Affymetrix GeneChip array. In fact, U133 Plus 2.0 comprises probe sets in both

U133A and U133B [47], thus in following works only those probe sets common in both

versions are utilized. Note that this step could be skipped when datasets to be compared

are of the same version.

2. Summarizing expression value for each gene

The analysis of Affymetrix array data starts with CEL files recording fluorescence

signals at probe level, from which probe-set level intensities are derived using robust

multi-chip average (RMA) method [48]. In this method probe level data undergo

background correction, quantile normalization [49] and median-polish summarization

[50]. The RMA process is completed under the commercial software Partek^® [42] and

the resulting values are log-transformed expression values.

3. Hypothesis testing

Hypothesis testing methods are used to measure the degree of association between

response/covariate (either numerical or categorical factor, e.g. phenotypes) and random

variables (expression levels of probe sets/genes). In the simple but most common case,

given a covariate, its association with each gene is estimated by univariate hypothesis

testing method such as two-sample t test or Mann-Whitney statistics for binary

phenotype, F-statistic for polytomous phenotype, to name but a few.

Here two-sample paired t test is applied and the concluded p-value functions as an

index of degree of differential expression in terms of a probe set between phenotypes.

4. Select representative probe set for each gene

Among all probe sets targeting the same gene, the one with the smallest p-value is

selected to represent their target gene and the rest removed.

Until this step, an input matrix with logged gene expression values in rows and

arrays in columns is generated.

3.3 Pathway analysis

Both Tian method and modified Tian method are applied on the datasets.

Conceptually the procedure of pathway analysis starts with evaluating a pathway score

by employing a scoring function, which is the major difference between the two

methods. The score is then to be normalized and assigned with p-value according to its

null distribution that could be generated in two different ways of permutation. Finally

pathways are ordered by the addition of rankings under two permutation types. The

detailed procedures of both methods are described below.

T the t statistic of gene i. To note, the latter scoring

function comes from equation 2 in Nacu et al.[33].

2. Significant level of the score

A one/two-sided p-value representing significance of a pathway is estimated from

the f1/f0 score’s null distribution that could be generated in different ways depending on

the null hypothesis to be tested. Tian et al. [27] proposed two ways to choose from:

either to test if genes inside a set show significantly higher associations with phenotypes

than that outside a set, or to test if a set does contain genes differentially expressed

在文檔中利用已知基因傳遞機制及蛋白質交互作用圖譜來開發新的基因互動途徑 (頁 18-0)