Database construction - 利用已知基因傳遞機制及蛋白質交互作用圖譜來開發新的基因互動途徑

Chapter 3 Methods

3.1 Database construction

Figure 3-2. Entity Relationship Diagram (ERD) for database constructed here.

In this ERD rectangles stand for entities, ellipses for attributes and diamonds

for relationships.

As the first step of integrating genomic data at different levels, a relational

database is required in order to bridge between information from various sources, to

speed up data retrieval and to correct for ambiguously recorded data. The structure of

database constructed here is elucidated in Figure 3-2 as a simplified entity relationship

diagram, where genes, pathways and array probes are considered as different entity

types.

While constructing the database, some records with ambiguous information should

be corrected:

1. Importing gene sets from MSigDB.

The gmt file records gene set members in the form of “synonym”, which is alias

instead of official name. The task here is to convert each of them into corresponding

unique identifier: “gene_id”. This is done by comparing them to “official gene symbol”,

or to “synonym” of genes in the case when there were no “official gene symbols”

matched.

2. Importing target genes of each probe set on microarray.

Some “gene_id” recorded in NetAffx annotation files had been changed or

discontinued. New “gene_id” should be updated based on information from NCBI’s

Entrez database.

3.2 Single gene analysis

1. Probe sets filtering

As one might note in chapter 2, the two datasets were assayed on different versions

of Affymetrix GeneChip array. In fact, U133 Plus 2.0 comprises probe sets in both

U133A and U133B [47], thus in following works only those probe sets common in both

versions are utilized. Note that this step could be skipped when datasets to be compared

are of the same version.

2. Summarizing expression value for each gene

The analysis of Affymetrix array data starts with CEL files recording fluorescence

signals at probe level, from which probe-set level intensities are derived using robust

multi-chip average (RMA) method [48]. In this method probe level data undergo

background correction, quantile normalization [49] and median-polish summarization

[50]. The RMA process is completed under the commercial software Partek^® [42] and

the resulting values are log-transformed expression values.

3. Hypothesis testing

Hypothesis testing methods are used to measure the degree of association between

response/covariate (either numerical or categorical factor, e.g. phenotypes) and random

variables (expression levels of probe sets/genes). In the simple but most common case,

given a covariate, its association with each gene is estimated by univariate hypothesis

testing method such as two-sample t test or Mann-Whitney statistics for binary

phenotype, F-statistic for polytomous phenotype, to name but a few.

Here two-sample paired t test is applied and the concluded p-value functions as an

index of degree of differential expression in terms of a probe set between phenotypes.

4. Select representative probe set for each gene

Among all probe sets targeting the same gene, the one with the smallest p-value is

selected to represent their target gene and the rest removed.

Until this step, an input matrix with logged gene expression values in rows and

arrays in columns is generated.

3.3 Pathway analysis

Both Tian method and modified Tian method are applied on the datasets.

Conceptually the procedure of pathway analysis starts with evaluating a pathway score

by employing a scoring function, which is the major difference between the two

methods. The score is then to be normalized and assigned with p-value according to its

null distribution that could be generated in two different ways of permutation. Finally

pathways are ordered by the addition of rankings under two permutation types. The

detailed procedures of both methods are described below.

T the t statistic of gene i. To note, the latter scoring

function comes from equation 2 in Nacu et al.[33].

2. Significant level of the score

A one/two-sided p-value representing significance of a pathway is estimated from

the f1/f0 score’s null distribution that could be generated in different ways depending on

the null hypothesis to be tested. Tian et al. [27] proposed two ways to choose from:

either to test if genes inside a set show significantly higher associations with phenotypes

than that outside a set, or to test if a set does contain genes differentially expressed

between phenotypes. The former is achieved by permuting members of a pathway, and

the latter by randomly shuffling phenotype labels on each paired samples.

Since all pathways are tested simultaneously, multiple testing problems can no

longer be ignored. The p-values are either adjusted by Bonferroni method [51] or

converted to q values [52].

3. Pathway score normalization

To overcome the hurdle that pathway scores are not able to be compared directly

due to its dependency on the size and unique correlation structure of each pathway, Tian

proposed that normalization of the observed scores can be achieved by replacing them

with their quantiles. Here f0 and f1 scores are normalized in the same principle.

4. Ranking the pathways

Under each permutation procedure, an adjusted p-value and a normalized score are

obtained and from which a ranking is summarized. With descending importance, all

pathways are eventually ordered by the addition of two rankings derived from separate

permutation procedures.

3.4 Network analysis

Based on the significant pathway identified, network analysis tries to investigate

the pathway, looking for connected subgraphs that are either essential for differential

expression of the pathway or related to genes outside the defined pathway boundary. In

short, a candidate subnetwork is generated from each root and then being merged into

several main components. This algorithm follows Ideker’s idea and some methods of

GXNA, and will be described dividedly in six steps.

1. Starting points and the search space

In the beginning, a background interaction network is constructed and afterward

referred to as the “search space”, within which the algorithm searches for main

components. As the objective here is to show the applicability of this approach before

any further explorative investigations, the search space is at first confined to genes

within the dysregulated pathways. After then a version of searching under global

interaction network was demonstrated.

Suppose there are N genes (nodes) in the pathway, each of them will be considered

as a root and thus N candidate subnetwork would be generated.

2. Extension

Starting from a root node, a candidate subnetwork is generated by an assigned

number of extensions within the search space. In each time of extension, the node

yielding maximal score of the new subnetwork is incorporated from those directly

neighboring the current subnetwork.

3. Scoring

Two ways of evaluating current subnetwork are adopted. One is identical to

f S1 , only when applied here, S represents a subnetwork instead of a pathway. The other, ^{f S}²

, is similar but with slight modifications in order to increase the tolerance of key nodes mentioned in chapter 1.

Considering a subnetwork of size k, the algorithm first rearrange its members in

where the setting of m is flexible to users (default “1”). The equation suggests that the m

members with least contribution would be left aside from the scoring function.

Note the default T scoring function in GXNA (eq.6) [33] is not applied here, for

more details please refer to chapter 6.

4. Stopping criterion

There are two criteria in GXNA for stopping the extension of a subnetwork. One is

when predefined size is met and the other is when the new subnetwork score does not

surpass the current one. The former criterion is used here due to the same reason that f2

score is dependent on subnetwork size and thus not comparable to each other.

Nonetheless, to make up for artificial restrictions in fixed-size search, a merging step is

developed to produce subgraphs of different sizes.

5. Merging

Until this step, a candidate pool has been formed by the N candidate subnetworks

derived from the N roots. The merging process is a decisive step. It ends up with at most

h main components as final results where h is a user-specified parameter.

For each ⁱ

^ `

^1..^h the merging process starts a main component Ci with an empty set Ø. Step by step, the algorithm merges it with the highest-scored candidate

subnetwork sharing overlap with it. Ever since a candidate subnetwork has been chosen

from the candidate pool and merged with Ci, it is excluded from the pool. The merging

process stops when certain criteria are met, which varies depending on the user’s

concern. Here a handleable size of main component within the pathway is to be found,

so the algorithm stops when it reaches an amount approximately r percent of the search

space size, or, stops at predefined min/max size in the case of small/large search space.

However, the whole process could break off anytime the candidate pool is emptied.

6. Visualization

Main components found in this methodology are visualized using Cytoscape [43].

Moreover, they can also be mapped on the pathway figure using KEGG’s online tool

[45] when the significant pathway is retrieved from KEGG database.

在文檔中利用已知基因傳遞機制及蛋白質交互作用圖譜來開發新的基因互動途徑 (頁 32-40)