Chapter 5 Discussion
6.1 Input matrix creation
To obtain an input matrix prerequisite for pathway/network analysis, redundant
probe sets were removed by eliminating non-representative probe sets for each gene.
Three general approaches were usually adopted: to represent each gene by its maximal
or median probe set in terms of differential expression or by the average of all its probe
sets.
To prevent the gene’s significance level from being affected by potentially
51
ineffective probe designs, maximal probe set was selected in section 4.1. However,
when making such a decision, the trade-off is to have possibly amplified the noise
which may be produced by a high-scoring probe set targeting at several different genes.
The peak illustrated in Figure 4-3 is an example for this situation. Certainly, it could be
avoided by simply truncating those non-specific probe sets; however, they were
preserved here in order to reserve as many information as possible.
6.2 Pathway analysis
Two scoring functions in pathway analysis
As mentioned in section 3.3, the original f0 scoring function is slightly altered into
f1. Doing this is not for any mathematical consideration, whereas different types of gene
sets were targeted by the two scoring functions. f0 method aims to find a set of genes
with concordant changes while they might not show individually significant differential
expression; f1 targets to select sets that contain a proportion of significant genes higher
than that outside the sets regardless of their concordance in terms of direction of
changes. This difference was revealed in Figure 4-5, where a consistent up-regulation in
cancer phenotype was observed in Figure 4-5A but not in Figure 4-5B. The decision of
scoring scheme to use during analysis is indeed dependent on one’s purpose. For
example, if one focuses on downstream targets of a transcription factor, f0 would just fit;
52
in contrast, if one is searching for chains of signaling transductions or regulatory
circuits that involve various activation/inhibition relationships and lead to an indefinite
overall direction of change, f1 might be more close to the need.
Permutation method and pathway score normalization
Significance level of a pathway score was derived by its null distribution and
served as the major index to assess importance of a pathway. Tian et al. [27] suggested
two types of permutation methods that correspond to different biological questions: one
is to permute gene order and the other is to permute phenotypes.
In the case of phenotype permutation, it is inadequate to directly shuffle all class
labels as it usually did because paired normal-tumor arrays were utilized here. It is
because that doing so, one is further assuming the invariance of expression profiles
among patients, which is obviously not the truth. Alternatively, phenotype permutation
is achieved by randomly deciding whether to exchange each pair of tumor and normal
class labels.
The effect of this modification did not show apparently because both ways of
shuffling yield mostly significant results. In fact, cancer tissues usually exhibit great
differences from normal ones and thus, it was not surprising to identify so many
pathways passing the significant criteria in Figure 4-4 when comparing their scores with
53
null distributions assuming no differences exist between phenotypes. Therefore,
phenotype permutation is accordingly much less discriminative than gene order
permutation.
Furthermore, a limited resolution problem evolves due to the incapability of a null
distribution to cover a broad-enough range of pathway scores. Inevitably, the weakness
is derived from the essence of resampling procedures. It occurs in the situation where
insufficient permutations are performed and becomes especially evident when using a
dataset whose genes showed dramatically altered expression, and this is exactly the case
here and leads to a lot of pathways with same extremely small significance level.
In such situations where significance level is unable to discern pathways,
normalized pathway score serves as a further index to compare their importance. Tian et
al. [27] normalize pathway scores by using the following principle: if the score falls
within its null distribution it is replaced with its quantile, and those falling far from the
null distribution are converted into corresponding z-scores.
However, z-scores might not be directly comparable to each other since null
distributions differ from one another in different datasets. Thus, normalized scores
obtained by this method should always be used with notice, especially when many of
them are derived from z-score transformation. This is because it might fail to be reliable
54
when z-scores depend largely on the features of their null distributions. Therefore, when
the pathway analysis indicates it is a significant pathway, it really is, while if it suggests
that one pathway is the most dysregulated among these significant results, users should
always be more careful.
The roles of this methodology in relatively large and small pathways
The ability of this methodology to extract modules within pathways is both
applicable and profitable, especially in large pathways such as the focal adhesion
pathway selected here (size=200). It is not only because a pathway with handful amount
of members are easier to be manipulated but also for a larger group of pathway
members would generally form a more interconnected network by using information in
interaction databases.
At present, manually curated pathways remain the most reliable source for
pathway analysis, yet many of the public databases, such as BioCarta [7], tend to be
more conservative, since relatively small number were recorded when they were
compared with the actual size, which was believed to contain a few hundred or even
thousands of molecules. An overall concept of pathway size distributions in the
database were illustrated in Figure A-8. It shows that most pathways are with a size
smaller than 50.
55
However, one might question what this methodology can actually provide in terms
of small pathways? In fact, the contents of individual pathway are expected to be
improved if more biological data are gathered. With the aid of a computational tool that
enables the extension of modules to genes, which locate outside a predefined pathway,
it has great potential to point researchers to those interacting neighbors which are
suspicious to be the missing components in the existing pathways with relatively small
size. This applicability is yet to be widely realized by other computational analysis
tools.
6.3 Network analysis
Two scoring functions in network analysis
Specifically, to assess whether a group of genes (either a pathway or a subnetwork)
is related to a study, two indices are the major concern: a score independent of group
size and a significance level of the score.
When scores do not depend on size it means that they are directly comparable to
each other. In terms of network analysis it implies the ability to conduct flexible-size
subnetwork search which may identify modules with indefinite size. It can be achieved
by directly implementing T͂ scoring scheme (equation 8 in Nacu et al. [33], Ideker et
al. [36]) or parametric͂T scoring scheme (equation 6 in Nacu et al. [33], default
56
scoring function in GXNA). Otherwise, it is required to eliminate dependency on group
size before comparison. In the example of nonparametric ͂T scoring scheme (f0 ad f1
scoring functions), scores are normalized by using the reference null distribution.
However, as illustrated in Figure 4-3, the filtered t-scores do not follow a normal
distribution as expected. This situation was not improved when median probe set was
used to represent a gene. The parametric assumption was thus failed to be established
and this is the major reason why we did not to apply GXNA’s scoring function in our
network analysis. On the other hand, nonparametric counterpart requires large amount
of resampling and thus being time inefficient. As a consequence, a fixed-size approach
is adopted where the comparison of scores is no longer an issue, and thus f1 scoring
method was used in network analysis here.
In addition, it is known that regulations mechanisms spread from DNA/mRNA
level to protein level, which implies the probability of certain proteins being key players
to the connection of significant components, but they may show no differentially
expression at mRNA level, as mentioned in chapter 1.
In GXNA it filtered out probe sets with small variances, and doing this might lose
tract of these key nodes. However, in the work of Ideker et al. [32] such problem did
not exist because it utilized simulated annealing. In fact, their main objectives are
57
different. GXNA aimed to identify subnetworks where all members show certain degree
of differential expression and Ideker et al. [32] tried to adapt the algorithm to the events
actually happening in biological systems. Unfortunately, simulated annealing costs too
much time, so we followed GXNA’s approach. In order to compensate it, devised f2
scoring function was utilized. The new scoring function is able to tolerate key nodes as
shown as in Figure A-3.
As in Table A-6, the key nodes found in the two pathways did not pass the
significance criteria. However, the result suggested that the two groups of densely
connected genes may be bridged by a key node did not show apparently. Nonetheless,
this idea still reserves flexibility to those nodes that are not identified by mRNA
microarrays studies.
Starting condition : root nodes and search space
In terms of searching algorithm, we basically follow the greedy approach in
GXNA, while some modifications were made in the determination of root nodes and
search space.
Different from GXNA, which always chooses random root nodes and searches
under global interaction network, the starting condition in this methodology is relatively
much more flexible. The search space and root node determination depends on the
58
purposes. The root nodes can be members of specific and interesting pathways, and the
search space can be the global interaction network or its subsets by functional or
positional groupings, and this provides full flexibility to meet biologists’ interests.
In section 4.4, we aimed to obtain the most important module in a pathway, so the
root nodes were pathway members and the search space was defined within the pathway.
In section 4.6, since the purpose was to explore genes interacting with known pathway
members, the root nodes were pathway members and the search space was the global
interaction network.
Within the most significant subnetwork obtained by GXNA (visualized in Figure
A-2), few members hint to a common pathway. In contrast, the results in Figure 4-5,
Figure A-3 and Figure 4-10 obtained by our methodology were much more focused on
specific pathways. This advantage to conduct focus-oriented analyses evidenced that
these modifications make our approach much more useful.
Merging process
GXNA allows both fixed-size and flexible-size subnetwork search. The reason
why the flexible-size approach was not applied was discussed in the previous section. In
this methodology we only allows for fixed-size search; however, to compensate this
disadvantage, a merging process was developed in this methodology.
59
It is hypothesized that once an information flow is triggered, the signal propagates
sophisticatedly along its pre-designed paths including various interactions between
molecules. Once a region existed strong evidence of such information flow, which
amounted to the existence of a group of connected genes showing significant
differential expression, its neighboring genes would follow the gradient of evidence
strength and finally reach the region during the greedy extension algorithm. Thus in the
methodology here such a region would be implied within several candidate subnetworks.
Once the region with the strongest evidence is identified, the merging process is used to
reshape the region. However, there might be more than one such informative region and
this is the reason why we accept to specify multiple main components.
A potential alternative solution is to apply the clustering method in DAVID. It
proposes to cluster the candidate subnetworks to identify overlapped regions, and the
clusters are ordered in a fashion that each of them can be viewed as a main component.
Incompleteness of biomolecular interaction information
Although a pathway is an integration of interacting genes that shall be also seen in
biomolecular networks, it is observed that, small pathways are prone not to be
connected into an integral component because of the incompleteness of interaction
information. In commercial databases the knowledge base are constructed by hiring an
60
army of experts to curate information from public databases or scientific literatures. On
the other hand, teams maintaining public databases might also have transformed their
pathway information into corresponding formatted data such as the KEGG Markup
Language (KGML). These data enable automatic pathways drawing and provide
facilities for computational analysis. The incompleteness of interaction data can be
improved by incorporating such formatted data from these pathway databases.!!