Input matrix creation - 利用已知基因傳遞機制及蛋白質交互作用圖譜來開發新的基因互動途徑

Chapter 5 Discussion

6.1 Input matrix creation

To obtain an input matrix prerequisite for pathway/network analysis, redundant

probe sets were removed by eliminating non-representative probe sets for each gene.

Three general approaches were usually adopted: to represent each gene by its maximal

or median probe set in terms of differential expression or by the average of all its probe

sets.

To prevent the gene’s significance level from being affected by potentially

ineffective probe designs, maximal probe set was selected in section 4.1. However,

when making such a decision, the trade-off is to have possibly amplified the noise

which may be produced by a high-scoring probe set targeting at several different genes.

The peak illustrated in Figure 4-3 is an example for this situation. Certainly, it could be

avoided by simply truncating those non-specific probe sets; however, they were

preserved here in order to reserve as many information as possible.

6.2 Pathway analysis

 Two scoring functions in pathway analysis

As mentioned in section 3.3, the original f0 scoring function is slightly altered into

f1. Doing this is not for any mathematical consideration, whereas different types of gene

sets were targeted by the two scoring functions. f0 method aims to find a set of genes

with concordant changes while they might not show individually significant differential

expression; f1 targets to select sets that contain a proportion of significant genes higher

than that outside the sets regardless of their concordance in terms of direction of

changes. This difference was revealed in Figure 4-5, where a consistent up-regulation in

cancer phenotype was observed in Figure 4-5A but not in Figure 4-5B. The decision of

scoring scheme to use during analysis is indeed dependent on one’s purpose. For

example, if one focuses on downstream targets of a transcription factor, f0 would just fit;

in contrast, if one is searching for chains of signaling transductions or regulatory

circuits that involve various activation/inhibition relationships and lead to an indefinite

overall direction of change, f1 might be more close to the need.

 Permutation method and pathway score normalization

Significance level of a pathway score was derived by its null distribution and

served as the major index to assess importance of a pathway. Tian et al. [27] suggested

two types of permutation methods that correspond to different biological questions: one

is to permute gene order and the other is to permute phenotypes.

In the case of phenotype permutation, it is inadequate to directly shuffle all class

labels as it usually did because paired normal-tumor arrays were utilized here. It is

because that doing so, one is further assuming the invariance of expression profiles

among patients, which is obviously not the truth. Alternatively, phenotype permutation

is achieved by randomly deciding whether to exchange each pair of tumor and normal

class labels.

The effect of this modification did not show apparently because both ways of

shuffling yield mostly significant results. In fact, cancer tissues usually exhibit great

differences from normal ones and thus, it was not surprising to identify so many

pathways passing the significant criteria in Figure 4-4 when comparing their scores with

null distributions assuming no differences exist between phenotypes. Therefore,

phenotype permutation is accordingly much less discriminative than gene order

permutation.

Furthermore, a limited resolution problem evolves due to the incapability of a null

distribution to cover a broad-enough range of pathway scores. Inevitably, the weakness

is derived from the essence of resampling procedures. It occurs in the situation where

insufficient permutations are performed and becomes especially evident when using a

dataset whose genes showed dramatically altered expression, and this is exactly the case

here and leads to a lot of pathways with same extremely small significance level.

In such situations where significance level is unable to discern pathways,

normalized pathway score serves as a further index to compare their importance. Tian et

al. [27] normalize pathway scores by using the following principle: if the score falls

within its null distribution it is replaced with its quantile, and those falling far from the

null distribution are converted into corresponding z-scores.

However, z-scores might not be directly comparable to each other since null

distributions differ from one another in different datasets. Thus, normalized scores

obtained by this method should always be used with notice, especially when many of

them are derived from z-score transformation. This is because it might fail to be reliable

when z-scores depend largely on the features of their null distributions. Therefore, when

the pathway analysis indicates it is a significant pathway, it really is, while if it suggests

that one pathway is the most dysregulated among these significant results, users should

always be more careful.

 The roles of this methodology in relatively large and small pathways

The ability of this methodology to extract modules within pathways is both

applicable and profitable, especially in large pathways such as the focal adhesion

pathway selected here (size=200). It is not only because a pathway with handful amount

of members are easier to be manipulated but also for a larger group of pathway

members would generally form a more interconnected network by using information in

interaction databases.

At present, manually curated pathways remain the most reliable source for

pathway analysis, yet many of the public databases, such as BioCarta [7], tend to be

more conservative, since relatively small number were recorded when they were

compared with the actual size, which was believed to contain a few hundred or even

thousands of molecules. An overall concept of pathway size distributions in the

database were illustrated in Figure A-8. It shows that most pathways are with a size

smaller than 50.

However, one might question what this methodology can actually provide in terms

of small pathways? In fact, the contents of individual pathway are expected to be

improved if more biological data are gathered. With the aid of a computational tool that

enables the extension of modules to genes, which locate outside a predefined pathway,

it has great potential to point researchers to those interacting neighbors which are

suspicious to be the missing components in the existing pathways with relatively small

size. This applicability is yet to be widely realized by other computational analysis

tools.

6.3 Network analysis

 Two scoring functions in network analysis

Specifically, to assess whether a group of genes (either a pathway or a subnetwork)

is related to a study, two indices are the major concern: a score independent of group

size and a significance level of the score.

When scores do not depend on size it means that they are directly comparable to

each other. In terms of network analysis it implies the ability to conduct flexible-size

subnetwork search which may identify modules with indefinite size. It can be achieved

by directly implementing T͂ scoring scheme (equation 8 in Nacu et al. [33], Ideker et

al. [36]) or parametric͂T scoring scheme (equation 6 in Nacu et al. [33], default

scoring function in GXNA). Otherwise, it is required to eliminate dependency on group

size before comparison. In the example of nonparametric ͂T scoring scheme (f0 ad f1

scoring functions), scores are normalized by using the reference null distribution.

However, as illustrated in Figure 4-3, the filtered t-scores do not follow a normal

distribution as expected. This situation was not improved when median probe set was

used to represent a gene. The parametric assumption was thus failed to be established

and this is the major reason why we did not to apply GXNA’s scoring function in our

network analysis. On the other hand, nonparametric counterpart requires large amount

of resampling and thus being time inefficient. As a consequence, a fixed-size approach

is adopted where the comparison of scores is no longer an issue, and thus f1 scoring

method was used in network analysis here.

In addition, it is known that regulations mechanisms spread from DNA/mRNA

level to protein level, which implies the probability of certain proteins being key players

to the connection of significant components, but they may show no differentially

expression at mRNA level, as mentioned in chapter 1.

In GXNA it filtered out probe sets with small variances, and doing this might lose

tract of these key nodes. However, in the work of Ideker et al. [32] such problem did

not exist because it utilized simulated annealing. In fact, their main objectives are

different. GXNA aimed to identify subnetworks where all members show certain degree

of differential expression and Ideker et al. [32] tried to adapt the algorithm to the events

actually happening in biological systems. Unfortunately, simulated annealing costs too

much time, so we followed GXNA’s approach. In order to compensate it, devised f2

scoring function was utilized. The new scoring function is able to tolerate key nodes as

shown as in Figure A-3.

As in Table A-6, the key nodes found in the two pathways did not pass the

significance criteria. However, the result suggested that the two groups of densely

connected genes may be bridged by a key node did not show apparently. Nonetheless,

this idea still reserves flexibility to those nodes that are not identified by mRNA

microarrays studies.

 Starting condition : root nodes and search space

In terms of searching algorithm, we basically follow the greedy approach in

GXNA, while some modifications were made in the determination of root nodes and

search space.

Different from GXNA, which always chooses random root nodes and searches

under global interaction network, the starting condition in this methodology is relatively

much more flexible. The search space and root node determination depends on the

purposes. The root nodes can be members of specific and interesting pathways, and the

search space can be the global interaction network or its subsets by functional or

positional groupings, and this provides full flexibility to meet biologists’ interests.

In section 4.4, we aimed to obtain the most important module in a pathway, so the

root nodes were pathway members and the search space was defined within the pathway.

In section 4.6, since the purpose was to explore genes interacting with known pathway

members, the root nodes were pathway members and the search space was the global

interaction network.

Within the most significant subnetwork obtained by GXNA (visualized in Figure

A-2), few members hint to a common pathway. In contrast, the results in Figure 4-5,

Figure A-3 and Figure 4-10 obtained by our methodology were much more focused on

specific pathways. This advantage to conduct focus-oriented analyses evidenced that

these modifications make our approach much more useful.

 Merging process

GXNA allows both fixed-size and flexible-size subnetwork search. The reason

why the flexible-size approach was not applied was discussed in the previous section. In

this methodology we only allows for fixed-size search; however, to compensate this

disadvantage, a merging process was developed in this methodology.

It is hypothesized that once an information flow is triggered, the signal propagates

sophisticatedly along its pre-designed paths including various interactions between

molecules. Once a region existed strong evidence of such information flow, which

amounted to the existence of a group of connected genes showing significant

differential expression, its neighboring genes would follow the gradient of evidence

strength and finally reach the region during the greedy extension algorithm. Thus in the

methodology here such a region would be implied within several candidate subnetworks.

Once the region with the strongest evidence is identified, the merging process is used to

reshape the region. However, there might be more than one such informative region and

this is the reason why we accept to specify multiple main components.

A potential alternative solution is to apply the clustering method in DAVID. It

proposes to cluster the candidate subnetworks to identify overlapped regions, and the

clusters are ordered in a fashion that each of them can be viewed as a main component.

 Incompleteness of biomolecular interaction information

Although a pathway is an integration of interacting genes that shall be also seen in

biomolecular networks, it is observed that, small pathways are prone not to be

connected into an integral component because of the incompleteness of interaction

information. In commercial databases the knowledge base are constructed by hiring an

army of experts to curate information from public databases or scientific literatures. On

the other hand, teams maintaining public databases might also have transformed their

pathway information into corresponding formatted data such as the KEGG Markup

Language (KGML). These data enable automatic pathways drawing and provide

facilities for computational analysis. The incompleteness of interaction data can be

improved by incorporating such formatted data from these pathway databases.!!

在文檔中利用已知基因傳遞機制及蛋白質交互作用圖譜來開發新的基因互動途徑 (頁 60-70)