Chapter 3 Methods
3.1 Database construction
Figure 3-2. Entity Relationship Diagram (ERD) for database constructed here.
In this ERD rectangles stand for entities, ellipses for attributes and diamonds
for relationships.
23
As the first step of integrating genomic data at different levels, a relational
database is required in order to bridge between information from various sources, to
speed up data retrieval and to correct for ambiguously recorded data. The structure of
database constructed here is elucidated in Figure 3-2 as a simplified entity relationship
diagram, where genes, pathways and array probes are considered as different entity
types.
While constructing the database, some records with ambiguous information should
be corrected:
1. Importing gene sets from MSigDB.
The gmt file records gene set members in the form of “synonym”, which is alias
instead of official name. The task here is to convert each of them into corresponding
unique identifier: “gene_id”. This is done by comparing them to “official gene symbol”,
or to “synonym” of genes in the case when there were no “official gene symbols”
matched.
2. Importing target genes of each probe set on microarray.
Some “gene_id” recorded in NetAffx annotation files had been changed or
discontinued. New “gene_id” should be updated based on information from NCBI’s
Entrez database.
24
3.2 Single gene analysis
1. Probe sets filtering
As one might note in chapter 2, the two datasets were assayed on different versions
of Affymetrix GeneChip array. In fact, U133 Plus 2.0 comprises probe sets in both
U133A and U133B [47], thus in following works only those probe sets common in both
versions are utilized. Note that this step could be skipped when datasets to be compared
are of the same version.
2. Summarizing expression value for each gene
The analysis of Affymetrix array data starts with CEL files recording fluorescence
signals at probe level, from which probe-set level intensities are derived using robust
multi-chip average (RMA) method [48]. In this method probe level data undergo
background correction, quantile normalization [49] and median-polish summarization
[50]. The RMA process is completed under the commercial software Partek® [42] and
the resulting values are log-transformed expression values.
3. Hypothesis testing
Hypothesis testing methods are used to measure the degree of association between
response/covariate (either numerical or categorical factor, e.g. phenotypes) and random
variables (expression levels of probe sets/genes). In the simple but most common case,
25
given a covariate, its association with each gene is estimated by univariate hypothesis
testing method such as two-sample t test or Mann-Whitney statistics for binary
phenotype, F-statistic for polytomous phenotype, to name but a few.
Here two-sample paired t test is applied and the concluded p-value functions as an
index of degree of differential expression in terms of a probe set between phenotypes.
4. Select representative probe set for each gene
Among all probe sets targeting the same gene, the one with the smallest p-value is
selected to represent their target gene and the rest removed.
Until this step, an input matrix with logged gene expression values in rows and
arrays in columns is generated.
3.3 Pathway analysis
Both Tian method and modified Tian method are applied on the datasets.
Conceptually the procedure of pathway analysis starts with evaluating a pathway score
by employing a scoring function, which is the major difference between the two
methods. The score is then to be normalized and assigned with p-value according to its
null distribution that could be generated in two different ways of permutation. Finally
pathways are ordered by the addition of rankings under two permutation types. The
detailed procedures of both methods are described below.
26
T the t statistic of gene i. To note, the latter scoring
function comes from equation 2 in Nacu et al.[33].
2. Significant level of the score
A one/two-sided p-value representing significance of a pathway is estimated from
the f1/f0 score’s null distribution that could be generated in different ways depending on
the null hypothesis to be tested. Tian et al. [27] proposed two ways to choose from:
either to test if genes inside a set show significantly higher associations with phenotypes
than that outside a set, or to test if a set does contain genes differentially expressed
between phenotypes. The former is achieved by permuting members of a pathway, and
the latter by randomly shuffling phenotype labels on each paired samples.
Since all pathways are tested simultaneously, multiple testing problems can no
longer be ignored. The p-values are either adjusted by Bonferroni method [51] or
converted to q values [52].
3. Pathway score normalization
27
To overcome the hurdle that pathway scores are not able to be compared directly
due to its dependency on the size and unique correlation structure of each pathway, Tian
proposed that normalization of the observed scores can be achieved by replacing them
with their quantiles. Here f0 and f1 scores are normalized in the same principle.
4. Ranking the pathways
Under each permutation procedure, an adjusted p-value and a normalized score are
obtained and from which a ranking is summarized. With descending importance, all
pathways are eventually ordered by the addition of two rankings derived from separate
permutation procedures.
3.4 Network analysis
Based on the significant pathway identified, network analysis tries to investigate
the pathway, looking for connected subgraphs that are either essential for differential
expression of the pathway or related to genes outside the defined pathway boundary. In
short, a candidate subnetwork is generated from each root and then being merged into
several main components. This algorithm follows Ideker’s idea and some methods of
GXNA, and will be described dividedly in six steps.
1. Starting points and the search space
In the beginning, a background interaction network is constructed and afterward
28
referred to as the “search space”, within which the algorithm searches for main
components. As the objective here is to show the applicability of this approach before
any further explorative investigations, the search space is at first confined to genes
within the dysregulated pathways. After then a version of searching under global
interaction network was demonstrated.
Suppose there are N genes (nodes) in the pathway, each of them will be considered
as a root and thus N candidate subnetwork would be generated.
2. Extension
Starting from a root node, a candidate subnetwork is generated by an assigned
number of extensions within the search space. In each time of extension, the node
yielding maximal score of the new subnetwork is incorporated from those directly
neighboring the current subnetwork.
3. Scoring
Two ways of evaluating current subnetwork are adopted. One is identical to
f S1 , only when applied here, S represents a subnetwork instead of a pathway. The other, f S2
, is similar but with slight modifications in order to increase the tolerance of key nodes mentioned in chapter 1.
Considering a subnetwork of size k, the algorithm first rearrange its members in
29
where the setting of m is flexible to users (default “1”). The equation suggests that the m
members with least contribution would be left aside from the scoring function.
Note the default T scoring function in GXNA (eq.6) [33] is not applied here, for
more details please refer to chapter 6.
4. Stopping criterion
There are two criteria in GXNA for stopping the extension of a subnetwork. One is
when predefined size is met and the other is when the new subnetwork score does not
surpass the current one. The former criterion is used here due to the same reason that f2
score is dependent on subnetwork size and thus not comparable to each other.
Nonetheless, to make up for artificial restrictions in fixed-size search, a merging step is
developed to produce subgraphs of different sizes.
5. Merging
Until this step, a candidate pool has been formed by the N candidate subnetworks
derived from the N roots. The merging process is a decisive step. It ends up with at most
h main components as final results where h is a user-specified parameter.
30
For each i
^ `
1..h the merging process starts a main component Ci with an empty set Ø. Step by step, the algorithm merges it with the highest-scored candidatesubnetwork sharing overlap with it. Ever since a candidate subnetwork has been chosen
from the candidate pool and merged with Ci, it is excluded from the pool. The merging
process stops when certain criteria are met, which varies depending on the user’s
concern. Here a handleable size of main component within the pathway is to be found,
so the algorithm stops when it reaches an amount approximately r percent of the search
space size, or, stops at predefined min/max size in the case of small/large search space.
However, the whole process could break off anytime the candidate pool is emptied.
6. Visualization
Main components found in this methodology are visualized using Cytoscape [43].
Moreover, they can also be mapped on the pathway figure using KEGG’s online tool
[45] when the significant pathway is retrieved from KEGG database.