Chapter 1 Introduction
1.3 Data analysis
1.3.3 Pathway analysis
Pathway analysis is actually gene set analysis only to focus on pathways. It is
expected to not only find pathways with significant differential expression but also
detect consistent yet subtle expression changes among members of a pathway. A review
on various pathway analysis methods evaluating the involvement of pathways in
different phenotypes under study is available in [19, 20].
A simplest procedure is to assess the significance of overlap between preselected
9
genes (differentially expressed genes) and predefined annotation groups (pathways)
using Fisher Exact test (hypergeometric distribution) or Chi-squared test, which is then
followed by adequate multiple hypothesis adjustments. This over-representation method
is intuitive, easy to implement and computationally efficient, thus it dominates current
commercial and public software, such as IPA [21], Metacore [22] and DAVID [23].
An alternative approach needs no prior filtering of genes. It assigns a score to each
gene set and assesses p-value by re-sampling procedure, which is to compare the score
with its null distribution. BRB-ArrayTools [24] use the LS statistic and
Kolmogorov-Smirnov (K-S) statistic to test if the single-gene p-values in a gene set are
of a uniform distribution. In gene set enrichment analysis (GSEA) [25, 26], an
enrichment score is obtained by considering the distribution of pathway genes in the
entire list of genes, which in spirit is a weighted K-S statistic. Tian et al. [27] designed a
statistical framework to determine perturbed pathways. In their work the overall
objective is to “test whether a group of genes has a coordinated association with a
phenotype of interest.” After each gene set is assigned a score by averaging the test
statistics of its member genes, p-values regarding two different hypotheses are
estimated by permuting class labels and gene orders. Eventually, gene sets with
significant p-values under both hypotheses are considered differentially expressed
10 across phenotypes.
The latter approach is more biologically reasonable than over-representation
method since it reserve expression information and does not depend on an artificial
filtering of genes, in which the results may depend strongly on the cutoff chosen to
make the significant gene list [28].
However, most pathway analysis methods suffer from some weaknesses. First, due
to the fact that only a limited number of well-studied genes are classified into pathways,
the remained many genes with unknown functions are left out of considerations. Second,
most pathway annotations contain only labels of pathway members and with no
information about the interplay between them, therefore the results obtained from these
methods could not help to give direct understanding of cellular processes at molecular
level. Also, pathway is actually a dynamic model and the component activated is
specific to conditions under study, which is usually not emphasized in most methods,
however, GSEA did specify a list of core members, named leading edge subset, that are
the main contributors of the pathway’s enrichment score.
In all, pathway analysis can successfully interpret data at pathway level but fails to
elucidate the interplay between members within and provide no information about genes
not involved in known pathways.
11 1.3.4 Network analysis
Extensive works had been done seeking to characterize the design principles of
biomolecular network structure using graph theory. An interesting finding shows that its
connectivity distribution follows the typical power-law distribution for scale-free
network and it shows small-world properties [29] in terms of network diameter and
clustering. It is believed that this feature, which indicates the existence of hubs, enables
the biological network to be robust against occasionally removal of arbitrary network
elements during evolution process [30].
Another subject of active research is the identification of relevant modules or
subnetworks in the global network. Notably the interactome data cannot alone complete
this task due to the inconsistent conditions of its content. That is, these interactions are
usually temporal, spatial or dependent on conditions and tissue types. Therefore,
additional annotations providing condition-specific information is required, such as
combining microarray data to identify connected sets of nodes based on their coherent
expression patterns at mRNA level. In this regard, several methods of identifying
responsive modules are reviewed in [31].
Ideker et al. [32] are among the first groups to extract active subnetworks based on
both interactome and transcriptomic data. They searched the global network for
12
high-scoring subnetworks via simulated annealing algorithm, where the score is in spirit
an aggregation of z-scores measuring differential expression of individual genes in the
subnetwork and calibrated against the null distribution generated by randomly sampled
groups of genes. Interestingly, the result contained many examples of genes individually
with low score but are required to connect together several high-scoring genes. Genes
with such character agree with the behavior of some transcription factors (TFs) and will
be referred to as “key nodes” in this article. Despite the exciting observation, the
time-consuming nature confines its application on large interaction network.
Nacu et al. [33] developed a method GXNA that expands subnetworks from seed
nodes using a greedy approach and then identifies those with significantly high scores.
One of their contributions lies in the design of scoring functions that correct for biases
due to subnetwork size if necessary. They are generally divided into two scoring
schemes -͂T and T͂. What subsequent to the evaluation of subnetwork score is the
estimation of p-value that assesses its significance against the null distribution derived
from scores of subnetworks under different phenotype permutations. Eventually, these
p-values are adjusted for family-wise error rate and used to select relevant subnetwork.
GXNA is fast and it focuses on small modules differentially expressed between
phenotypes, however, it allows no such key nodes since the greedy approach is adopted.
13
In general, pathway can be viewed as the assembly of causal sequences of
molecular events and with all building blocks available in the biomolecular network.
Ideally it seems promising to infer new pathways directly by network analysis, but in
reality such pathway inference are complex and less validated due to the noise and
incompleteness of current biological networks, especially when coupling with the
small-world property of its topology. Additionally, the identified module is prone to be
associated with multiple canonical pathways, which makes it hard to reveal a unifying
scene and interpret in a fashion related to focus under study, as in the example of Figure
A-2.
1.3.5 Methodology in this work
In this study we attempt to develop a methodology that goes a step further than
current pathway analysis by bringing in the advantages of network analysis. Practically
speaking, it can be achieved by processing microarray data through pathway and
network analyses in series, however, it is worthwhile mentioning that the incorporation
of both pathway and network knowledge into microarray data analysis is yet to be
widely realized. Based on results from pathway analysis, the methodology here aims to
enhance researchers’ understanding by stepping from pathway level into molecular level,
with its main objectives specified below:
14
1. Extract module most relevant to the pathway’s dysregulation.
It is without doubt that pathway is in fact a dynamic model and the members
perturbed differs between conditions under study. Although GSEA specifies a leading
edge subset and suggests their being responsible for pathway’s dysregulation, the subset
is of limited biological meaning since it provides no information for elucidating the
roles they play. To put it simply, the leading edge subset is statistically meaningful
more than biologically meaningful. Therefore, methodology here attempt to
complement pathway analysis in this regard by making use of biomolecular networks.
After dysregulated pathways are identified, concept of network analysis is then
imposed to extract modules most relevant to dysregulation of interested pathways. It is
mainly due to these sequential events that make the pathways deemed dysregulated.
On the other hand, the module obtained here differs from that yielded by other
network analyses in its ability to interpret microarray data at known-pathway level and
to investigate interested pathways in greater details.
2. Other focus-oriented strategies enabling exploratory survey on pathways of interest.
Current understanding of biological functions at the pathway level is far from
being thorough. In one way, some of known pathways are actually incomplete. In the
other, although recorded in a separate fashion, they are not truly isolated at the level of
15
global interaction network. In fact they should not be so as in the example of
cross-pathway inhibition, which allows an activated pathway to refocus cellular
resources to respond to stimuli it received by competing against other pathways. As a
matter of fact, a considerable degree of cross-talks between pathways are revealed by
various small-scaled biological studies.
Using the methodology here, a task-oriented investigation is promised by
exploiting the valuable information imbedded in biomolecular networks. These yet
well-understood interactions possess great potential to point an investigator to either
missing pathway components or cross-talks between pathways and help in the design of
appropriate experiments for identifying them.
In previous study physical interaction data were integrated with genetic interaction
information to uncover the mechanisms underneath [34], which can further be used to
find cross-talks between two pathways where the two interacting genes reside.
The interaction data was also coupled with pathway information to provide clues
of cross-talks between pathways [35], however, it is done by merely calculating whether
physical connections between two pathways are higher than random using Fisher Exact
Test.
16
In two successive works of Ideker et al. [36, 37], subnetwork and condition-
responsive genes (CORGs) were defined in dysregulated pathway as the part delivering
optimal discriminative power for the disease phenotype using T͂ evaluation.
Nonetheless, they were used as prediction markers rather than to discuss the underlying
mechanisms.
Before going to the next chapter, the methodology framework is summarized in
four parts.
1. Construction of database for storage of molecular interaction network, canonical
pathway collections and gene annotations.
2. Statistical analysis of microarray data at pathway level: in this part Tian’s algorithm
is adopted yet with slight but crucial modification.
3. Algorithms allow navigating the global network on the region of interested
pathways: the scoring function is based on GXNA and modified to tolerate key
nodes. In addition, a merging step is developed to complement the scoring function.
4. Visualize the module by laying genes/gene products according to their subcellular
localizations.
17
Chapter 2 Materials
2.1 Lung cancer datasets
1. NTUH Lung Cancer Dataset
This dataset is created by Bioinformatics and Biostatistics Core, National Taiwan
University Research Center for Medical Excellence - Division of Genomic Medicine.
Matched normal and tumor samples of 31 female patients with non-small cell lung
cancer (NSCLC) in National Taiwan University Hospital (NTUH) are collected for
DNA microarray analysis using Affymetrix Human Genome U133 Plus 2.0 Array.
2. Public Dataset GSE7670 (TVGH Lung Cancer dataset)
A public dataset [38] created by Taipei Veterans General Hospital (TVGH) and
deposited in NCBI’s Gene Expression Omnibus (GEO) [39] (also available at EBI’s
ArrayExpress: www.ebi.ac.uk/arrayexpress) under GEO series accession number
GSE7670 is downloaded as CEL files. Of all the 66 Affymetrix Human Genome
U133A Array data, those from 21 female NSCLC patients with paired normal-tumor
arrays are used to create this dataset.
2.2 Databases
1. Pathways
Predefined gene sets performing tasks of metabolic functions or signaling
18
transductions are recorded in several public databases.
The Molecular Signature Database (MSigDB) v2.5 maintained by Broad Institute
collects curated gene sets information from several online databases or literatures and
records them with a unified format. Its “Canonical Pathway (CP) collection” in
“Functional Sets (C2) category” [9] is the main source of pathway information in this
work. The collection of 639 gene sets is released in a file with filename extension “gmt”.
In the following work, gene sets are simplified into pathways.
2. Protein interaction network
Protein-protein interactions (PPI) detected by high-throughput methods are
recorded in several PPI databases. The NCBI’s Entrez Gene database containing curated
interaction information from BIND [16], BioGRID [40], EcoCyc [41] and HPRD [17] is
utilized as the main source to construct protein interaction network in this work.
3. Target genes of probe sets on microarray
The information of genes targeted by probe sets designed on Affymetrix GeneChip
arrays comes from annotation files in Affymetrix website (www.affymetrix.com).
4. Gene annotation
Unique gene identifiers (Entrez ID or HUGO gene symbol) and historical aliases
of genes are retrieved from NCBI’s Entrez Gene database.
19
2.3 Program environment and public/ commercial tools
With database built under MySQL, this methodology was realized with the help of
Matlab®. Besides, some public or commercial tools are used to process the dataset for
different purposes. Following are some brief descriptions about them.
1. Partek® Genomics Suite [42]
It is a commercial software that enables various statistical analysis of microarray
data. In the work here it is used simply to complete preprocessing steps of microarray
data, which summarizes expression value for each probe set and applied normalization
algorithm to remove potential systematic biases.
2. Gene set enrichment analysis (GSEA) [25]
GSEA evaluates the probability a gene set is differentially expressed across
phenotypes and defines a leading-edge subset comprising the core members activated in
the gene set.
At first, genes are ordered by their correlation between expression values and
phenotype classes, then an enrichment score (ES), corresponding to a weighted
Kolmogorov-Smirno-like statistic, is calculated for the gene set. A p-value representing
significance level of the ES score is determined against null distribution of ES estimated
by permuting class labels. After p-value for each gene sets is obtained, false discovery
20
rate (FDR) method is used to adjust for multiple hypothesis testing.
3. Database for Annotation, Visualization and Integrated Discovery (DAVID) [23]
DAVID uses an EASE score, a modified Fisher Exact p-value, to measure the
enrichment of gene sets in the gene list specified by user. Furthermore, to reduce the
redundant nature of annotations that might dilute the focus of the result, which is
especially inevitable when associating with Gene Ontology (GO) terms, DAVID
provides the option to classify significantly associated gene sets into different clusters
and order the clusters according to their significance. According to manual on the
website, it is achieved by integrating the same techniques of Kappa statistics to measure
the degree of the common genes between two gene sets, and fuzzy heuristic clustering
to classify the groups of similar annotations according to kappa values.
4. Cytoscape [43]
Cytoscape is a JAVA application which provides basic functionality to layout and
query networks, overlay nodes with expression data, or link genes/gene products to
databases of functional annotation. Notably, it is featured in its extensibility through a
straightforward plug-in architecture, which enables additional computational analyses to
be incorporated. In this study, a plug-in - Cerebral v2.0 [44] is used to layout the
network according to the subcellular location of each node.
21
5. Kyoto Encyclopedia of Genes and Genomes (KEGG) - Color Objects in KEGG
Pathways [45]
It is an on-line tool that provides a personalized pathway map by allowing the
assignment of different colors to font, border or background of each particular node.
6. European Bioinformatics Institute (EBI) - QuickGO [46]
QuickGO is a web-based browser that enables the extraction of branches from the
entire hierarchy of Gene Ontology according to a list of specified GO terms.
22
Chapter 3 Methods
Figure 3-1 illustrates the flow chart of this methodology and in the subsequent
sections we will describe each step in detail.
Figure 3-1. Flow chart of methodology in this work.
PPI is the abbreviation of protein-protein interaction data.
3.1 Database construction
Figure 3-2. Entity Relationship Diagram (ERD) for database constructed here.
In this ERD rectangles stand for entities, ellipses for attributes and diamonds
for relationships.
23
As the first step of integrating genomic data at different levels, a relational
database is required in order to bridge between information from various sources, to
speed up data retrieval and to correct for ambiguously recorded data. The structure of
database constructed here is elucidated in Figure 3-2 as a simplified entity relationship
diagram, where genes, pathways and array probes are considered as different entity
types.
While constructing the database, some records with ambiguous information should
be corrected:
1. Importing gene sets from MSigDB.
The gmt file records gene set members in the form of “synonym”, which is alias
instead of official name. The task here is to convert each of them into corresponding
unique identifier: “gene_id”. This is done by comparing them to “official gene symbol”,
or to “synonym” of genes in the case when there were no “official gene symbols”
matched.
2. Importing target genes of each probe set on microarray.
Some “gene_id” recorded in NetAffx annotation files had been changed or
discontinued. New “gene_id” should be updated based on information from NCBI’s
Entrez database.
24
3.2 Single gene analysis
1. Probe sets filtering
As one might note in chapter 2, the two datasets were assayed on different versions
of Affymetrix GeneChip array. In fact, U133 Plus 2.0 comprises probe sets in both
U133A and U133B [47], thus in following works only those probe sets common in both
versions are utilized. Note that this step could be skipped when datasets to be compared
are of the same version.
2. Summarizing expression value for each gene
The analysis of Affymetrix array data starts with CEL files recording fluorescence
signals at probe level, from which probe-set level intensities are derived using robust
multi-chip average (RMA) method [48]. In this method probe level data undergo
background correction, quantile normalization [49] and median-polish summarization
[50]. The RMA process is completed under the commercial software Partek® [42] and
the resulting values are log-transformed expression values.
3. Hypothesis testing
Hypothesis testing methods are used to measure the degree of association between
response/covariate (either numerical or categorical factor, e.g. phenotypes) and random
variables (expression levels of probe sets/genes). In the simple but most common case,
25
given a covariate, its association with each gene is estimated by univariate hypothesis
testing method such as two-sample t test or Mann-Whitney statistics for binary
phenotype, F-statistic for polytomous phenotype, to name but a few.
Here two-sample paired t test is applied and the concluded p-value functions as an
index of degree of differential expression in terms of a probe set between phenotypes.
4. Select representative probe set for each gene
Among all probe sets targeting the same gene, the one with the smallest p-value is
selected to represent their target gene and the rest removed.
Until this step, an input matrix with logged gene expression values in rows and
arrays in columns is generated.
3.3 Pathway analysis
Both Tian method and modified Tian method are applied on the datasets.
Conceptually the procedure of pathway analysis starts with evaluating a pathway score
by employing a scoring function, which is the major difference between the two
methods. The score is then to be normalized and assigned with p-value according to its
null distribution that could be generated in two different ways of permutation. Finally
pathways are ordered by the addition of rankings under two permutation types. The
detailed procedures of both methods are described below.
26
T the t statistic of gene i. To note, the latter scoring
function comes from equation 2 in Nacu et al.[33].
2. Significant level of the score
A one/two-sided p-value representing significance of a pathway is estimated from
the f1/f0 score’s null distribution that could be generated in different ways depending on
the null hypothesis to be tested. Tian et al. [27] proposed two ways to choose from:
either to test if genes inside a set show significantly higher associations with phenotypes
than that outside a set, or to test if a set does contain genes differentially expressed
than that outside a set, or to test if a set does contain genes differentially expressed