利用已知基因傳遞機制及蛋白質交互作用圖譜來開發新的基因互動途徑

(1)

୯ҥᆵ᡼εᏢႝᐒၗૻᏢଣғᙴႝηᆶၗૻᏢࣴز܌

ᅺγፕЎ

Graduate Institute of Biomedical Electronics and Bioinformatics College of Electrical Engineering and Computer Science

National Taiwan University Master Thesis

ճҔςޕ୷Ӣ໺ሀᐒڋϷೈқ፦ҬϕբҔკ᛼!

ٰ໒วཥޑ୷Ӣϕ୏೼৩^!

Integrate Pathway Information and Protein Interaction Network to Explore Possible Interactions Between Genes

શλ఼

Siao-Han Wong

ࡰᏤ௲௤Ǻಷᘑӹ റγ AdvisorǺDr. Eric Y. Chuang

ύ๮҇୯ΐΜΖԃΎД July, 2009

(2)

II

ᖴᇞ

! ҁጇፕЎᆶࣴزޑֹԋǴһеΑ೚ӭΓޑႴᓰᆶᔅշǶ२Ӄाགᖴޑࢂᅺγ

੤ࡰᏤ௲௤ಷᘑӹԴৣ๏ϒךᐒ཮уΕ೭ঁёངޑၠሦୱࣴزი໗Ǵӧ೭ٿԃ໔ ගٮᓬؼޑᕉნᆶᙦ؊ޑࣴزၗྍǵᙖӚԄ඲Тϩ݋૽ግ஥ሦך຾Εғނၗૻሦ ୱᆶ໒৖౳ࣚǴ٠ததჹךॺޑՉ٣Бӛගрࡌ᝼ᆶᐋҥڂጄǶӆٰाགᖴᒘߝ ӄԴৣᆶጰۏᏌԴৣӧѳВޑኧᏵϩ݋ᆶፕЎϣ৒অ҅ޑٌ༇ࡰᏤǶ!

! ќѦा੝ձགᖴԾֻᏢߏவፕЎᚒҞޑ௴ᆾǵϣ৒૸ፕᆶ΋ၡ΢ޑ஥ሦǴ٠

όᘐӦעךவФفӾ္ඣрٰǴዴࢂٌध࿤ϩޑৡ٣ǼགᖴᏢߏۆٵ։ǵηர҉

ᇻό཮ӧךॺගрభᡉୢᚒਔያଆ࣭ᓐǴᕴࢂ஑཰ӦӣเᆶࡰᏤǴ٠ЪӧፕЎቪ բ΢ගрࡌ᝼ᆶڐշঅׯǹགᖴྷᔇᏢߏऐЈвಒӦ௲Ꮴǵҭόਔࣁךॺ஥ٰ॥

፪ޑقᇟϷ҅ӛޑΚໆǶགᖴᏢߏۆБᛄᆶॣذǵதόᜏٌമӦࣁεৎुᖼϱᓓ

໯਑܈ᗺЈǴЪ๏ϒךॺᆂЈޑႴᓰᆶ៿኷ණѲǹགᖴॕңᏢۆǴவ຾ჴᡍ࠻а

ٰޑ service ஥ሦᆶ΋ၡ΢೚೚ӭӭ዗Јޑ࿶ᡍϩ٦Ǵᡣךૈ໩ճӦᑼΕ೭ঁᕉნ ᆶ೭ঁሦୱǴ٠ЪόਔӦ஥๏ךॺྕធᆶᅈᗺޑ៿ઢǶགᖴշ౛⃎ྼǵ܍ਤޑႴ ᓰک਺༇Ӧᗯ१Ǵᡣךӧჴᡍ࠻ޑ؂΋Ϻ೿кᅈᡋ഻ᆶႫىགǹགᖴёངΞԖ፪ ޑᏢ׌ۂऩ໚ǵᄨ޲ǵݒᑉ೭΋ԃٰࣁךॺϩ៽Α೚ӭჴᡍ࠻ޑ٣୍ǵ٬ךॺள аӄΚྗഢፕЎᆶࣴزǶགᖴٿՏӕᏢࡌ኷ᆶࡌᗶǴӧ೭ঁჴᡍ࠻ӕҒӅधΑٿ ԃǴவ΋໒ۈᝢฝჴᡍ࠻ޑࢲ୏ǵঅፐǵᏢಞϩ݋ǵௗπբаϷനࡕ΋ଆዖၸஒ ፕЎԋ׎ޑٿঁӭДᇴ௘ၮޑВηǶགᖴғᙴႝၗ܌ޑৣߏǵ౲܌ᒤΓ঩ӧՉࡹ

٣୍΢ޑᔅշǴаϷҁۛ߈ΟΜՏӕᏢٿԃύޑן࡭ᆶྣ៝Ƕགᖴךيᜐޑ؂΋

Տܻ϶Ǵૈᇡ᛽գॺ٠ᏱԖգॺޑഉՔࢂΜϩ۩ၮҭࡐ۩ᅽޑ٣Ƕ!

! നࡕǴགᖴךޑৎΓॺ΋ޔࢂךᆒઓ΢ޑЍࢊǴ๏ϒךόᘐ۳߻ޑ୏ΚǶؒ

ԖգॺǴ൩ؒԖϞВᆶܴВޑךǶ!

λ఼!2009.7.29!୷ӢᡏύЈ

(3)

III

ᄔा

߈ԃٰǴ୷Ӣ඲Т (microarray) ޑኧᏵϩ݋ς࿶வ໺಍΢໻ճҔ಍ीϩ݋ޑ БݤǴᙯӛуΕ׳ӭςޕ୷Ӣфૈ܄ຏှٰڐշϩ݋Ƕғނϸᔈၡ৩ϩ݋ݤ (pathway analysis) ଞჹၗ਑৤ύ؂΋ঁςۓကӳޑϸᔈၡ৩ (pathway)Ǵෳໆܭჴ ᡍ೛ीύځࢂցӸӧૻ৲ਡᑗਡለ (mRNA) ቫભ΢ޑᡉ๱ᡂ౦ǶԶќѦ΋ᅿᆛ๎

კ᛼ϩ݋ݤ (network analysis) ߾ԑӧཛྷ൨΋ঁҗ୷Ӣ໔܌ԖёૈӸӧޑҬϕբҔ

܌ಔԋޑკ᛼ (global interaction network) Ǵ࣮ځύࢂցԖᡉ๱ׯᡂޑηკ᛼

(subnetwork) Ƕ೭ٿᅿϩ݋БݤӚᏰയ൑ǴЪёఈံىჹБޑલᗺǺ߻ޣ໻ϩ݋ς

ޕޑϸᔈၡ৩Ǵࡺஒֽ่݀ज़ӧዕޕޑғނޕ᛽ύǹࡕޣᗨ᝶ᙒΑ೚ӭёૈޑ୷ Ӣϕ୏೼৩Ǵՠޔௗவ global interaction network ύཛྷ൨৒ܰפډคݤа౜Ԗғނ ޕ᛽և౜ځ΋ठ܄ғނཀကޑ subnetworkǶ

ҁЎගр΋ঁૈ୷ܭ pathway analysis ՠ׳຾΋؁่ӝ network analysis ᓬᗺޑ ϩ݋БԄٰׯ๓౜Ԗޑϩ݋БԄǴॶள΋ගޑࢂԜ΋่ӝཷۺӧҞ߻ޑϩ݋ሦୱ ύ٠όதـǶ ԜБݤ२ӃճҔ Tian et al. ว৖ޑ pathway analysis БԄෳۓԖᡉ๱

ᡂ౦ޑ pathwayǶௗ๱а೭٤ pathway ޑԋ঩բࣁрวᗺǴ௦Ҕ Nacu et al. ޑ network analysisБݤ຾Չ΋ঁҞޑᏤӛԄޑ subnetwork ཛྷ൨Ƕ

ҁጇፕЎஒԜБݤᔈҔӧѠεᙴଣޤᕎੰΓޑ୷Ӣ඲ТኧᏵ΢Ƕ΋໒ۈ჋၂ ӧԖᡂ౦ޑ pathway ύ൨פځനڀж߄܄ޑԋϩ (subnetwork) Ƕ೭ಔኧᏵౢғޑ

subnetworkӧќ΋ಔѠчᄪᕴޤᕎੰΓޑኧᏵύҭளډΑև౜ଯࡋ΋ठޑ่݀ǶԜ

Ѧךॺଞჹפډޑ subnetwork ຾Չځԋ঩୷Ӣޑфૈ܄ϩ݋Ǵว౜வচҁֹ᏾

pathwayᕭ෧ډ subnetwork ޑၸำύǴ᏾ᡏޑфૈҗচҁڀԖޑӭኬ܄Ǵ஑΋ϯډ

੝ۓޑфૈ΢Ƕ೭སҢΑ၀ pathway ޑࢌ೽ϩфૈӧჴᡍύࢂܴᡉӦ೏ׯᡂޑǴԶ ೭ኬޑׯᡂளаҔ೭ጇፕЎޑБݤ೏ჸ᝺Ƕ׳຾΋؁Ǵךॺ৖౜ΑҁБݤ܍ᝩΑ

(4)

IV

network analysisԶٰޑᓬᗺǶҥ୷ܭ pathway ޑςޕԋ঩ѐཛྷ൨ځёૈԖϕ୏ޑᎃ

߈୷ӢǴךॺளډޑ subnetwork ࢂаԜ pathway ࣁрวᗺǴՠх֖೚ӭ҂೏ᇡ᛽

ޑҬϕբҔǴ೭ኬޑ่݀ёаڐշࣴزޣჹܭ҂ޕޑ೽ϩ೛ीჴᡍ଺׳ుΕޑ௖

زǶவќѦ΋Бय़ٰᇥǴ೭ኬޑ่݀ΨԖ౦ܭ໺಍ network analysis ޑБԄǴѬૈ

٬ளࣴزёаҥ୷ܭࣴزޣགᑫ፪ޑ pathwayǴ୷ܭςޕޑғނޕ᛽ѐܗ৖࣬ᜢ҂

ޕޑёૈ܄Ƕ

ᕴ่܌Ԗϩ݋ޑ่݀ǴѬॺவόӕБӛࡰрΑԜϩ݋Бݤόӕܭа۳ޑ೚ӭ ᓬᗺǺѬёаவᡉ๱ׯᡂޑ pathway ύ๧ڗр΋ঁനख़ाЪελ፾ӝࣴزޑ

subnetworkǵΨёаଞჹࣴزޣགᑫ፪ޑ pathway ܈੝ۓޑፓ௓ᐒڋ຾ՉЬᚒԄޑ

ుΕ௖૸ǵԜѦନΑҥ୷ܭচԖғނޕ᛽ѦǴѬҭڀԖ໒ว୷Ӣ໔ཥޑϕ୏ᐒڋ ޑૈΚǶ

ᜢᗖຒ

୷Ӣ඲ТǵኧᏵϩ݋!

! !

(5)

V

ABSTRACT

Currently the analysis of microarray data had turned into integrating with prior

biological knowledge: pathway analysis interprets transcriptomic data on pathway level

and identified predefined groups of genes with dysregulation; network analysis takes

gene-gene interactions information into consideration and searches for modules

associated to the phenotypes under study. The two analyses have its own advantages

respectively and they complement the weaknesses of each other: pathway analysis

provides little clues to directly explore new biological knowledge and network analysis

usually yields modules including few consistent biological information. In this study an

analytical methodology was developed to integrate current pathway analysis method

with network analysis methods.

Initially, dysregulated pathways are identified by modified pathway analysis

method in Tian et al.. Subsequently, a focus-oriented investigation on dysregulated

pathways are performed by network analysis following the work of Nacu et al., and this

step is using modules within or related to members of the pathways to be further

investigated. Several improvements were made, such as the scoring functions and the

module identification algorithms.

To illustrate the benefits of this methodology, a lung cancer study with 30 paired

(6)

VI

cancer and normal tissues was explored. The results derived within dysregulated

pathways were also identified consistently in another public dataset GSE7670.

Furthermore, GO term enrichment analysis was applied to show that the modules have a

specialized functionality than the original pathways. In brief, original large modules

were reduced from the entire pathway to a smaller size of relevant interconnected

members, which are much easier to be manipulated but still remain their biological

information. Moreover, the ability of this methodology to explore novel interactions

related to pathway members were also demonstrated by extending the module search

algorithm beyond the pre-defined pathways. This would not be achieved by traditional

pathway analysis methods, which usually don’t include biomolecular interaction

information. Yet, modules identified in this methodology were based on dysregulated

pathways with specific biological meaning since their members were mainly associated.

In conclusion, these data all indicated the advantages to integrate both pathway and

network information during microarray analysis: to uncover manageable size of

molecular interaction networks important for pathway dysregulation, to focus on

interested pathways, functions or even specific regulatory events, and to possess the

potential of performing exploratory researches on mechanisms that are not yet well

understood. Undoubtedly, this concept could be extensively applied to other array

(7)

VII

experiments of similar design regardless of the disease under study.

Key words

microarray, pathway analysis, network analysis, module!

!

(8)

LIST OF FIGURES

Figure 3-1. Flow chart of this methodology ... 22

Figure 3-2. Entity Relationship Diagram (ERD) for database constructed here ... 22

Figure 4-1. Overlaps between array, pathway and interaction data ... 31

Figure 4-3. t-score distribution of probe sets before and after representative filtering . 33 Figure 4-4. Histograms of unadjusted p-values for 560 pathways in database ... 35

Figure 4-5. Main components extracted from two pathways ... 38

Figure 4-6. Map the main component and leading edge subset to original pathway .... 40

Figure 4-7. GO terms (cellular component 5) enriched in focal adhesion pathway ... 44

Figure 4-8. GO terms (molecular function 1-5) enriched in focal adhesion pathway ... 45

(10)

Figure 4-9. GO terms (biological process 5) enriched in cell cycle pathway ... 46

Figure 4-10. GO terms (cellular component 5) enriched in cell cycle pathway ... 47

Figure 4-11. Extend subnetwork search to the global interaction network ... 49

Figure A-1. Top cancer killers in Taiwan in 2008 ... 70

Figure A-2. The most significant subnetwork identified by GXNA ... 71

Figure A-3. Main components obtained by f2 scoring method ... 76

Figure A-4. GO term hierarchy for cluster C in Fig. 4-6 ... 79

Figure A-5. GO term hierarchy for terms involved in Fig. 4-7 ... 80

Figure A-6. GO term hierarchy for cluster A and term B,C in Fig. 4-8 ... 81

Figure A-7. GO term hierarchy for cluster C,D,E in Fig. 4-9 ... 82

Figure A-8. Histogram of pathway sizes in our database ... 83

LIST OF TABLES

Table 1-1. Definition of gene sets and exemplary databases ... 5

Table 4-1. Statistics of current database ... 31

Table 4-2. Network analysis procedure and the consistency with GSE7670 ... 36

Table A-1. Statistics of t-scores in two datasets ... 70

Table A-2. Parameters set during pathway analysis ... 72

Table A-3. Significant pathways identified by different methods ... 72

Table A-4. Lists of significant pathways identified by f0 and f1 scoring function... 73

Table A-5. Annotations of genes present in Figure 4-5 ... 74

Table A-6. Annotations of genes present in Figure A-3 ... 77

Table A-7. Overlap of main components and the leading edge subset ... 78

Table A-8. Annotations of genes present in Figure 4-11 ... 83

(11)

1

Chapter 1 Introduction

1.1 Lung cancer

According to figures released by Department of Health in June 2009, cancer has

topped the major 10 causes of death in Taiwan for 27 consecutive years. The statistics

showed that, in 2008, malignant tumors were responsible for 27.3 percent of all deaths,

among which the proportions of top cancer killers are displayed in Figure A-1.

Moreover, when gender is taken into consideration, lung cancer, in particular, has

occupied the leading cause of female cancer mortality since it first overtook cervical

cancer in 1986 [1].

Noteworthily, despite that tobacco smoke is the major risk factor for lung cancer,

many patients, especially female ones, are never smokers. This situation is not unusual

in countries other than Taiwan, as it had already been discussed in the review paper [2].

It was suggested in the article that pathways of carcinogenesis for lung cancer in never

smokers and tobacco-associated lung cancer are not exactly the same due to the clinical

and biological differences observed in patients of the two types. However, specific

mechanisms are still under investigation, which in further requires good experimental

techniques and analytical methodologies that help researchers focus on clues to

carcinogenesis process with efficiency.

(12)

2

1.2 Microarray

Microarray is a powerful tool to screen tens of thousands of genes at one time,

giving a semi-quantitative sketch of genome-wide mRNA expression levels in cells,

which greatly facilitates and accelerates biological studies. Since its invention in the

1990s [3], gradually improved technologies had lead to more affordable commercial

arrays with stable quality, making microarray widely applied in various biomedical

researches and beyond question cancer-related studies are no exceptions.

However, unlike array experiments that could be conducted with acceptable quality

as long as protocols are followed adequately, there has always been room for

bioinformaticians to develop analytical methods that better extract biological insights

from array data.

Following this section some methods will be reviewed and in the end of this

chapter, the idea about methodology developed here will be introduced.

1.3 Data analysis

Starting from introducing regularly employed single gene analysis that

concentrates on independent statistical analysis of individual genes, what followed

subsequently are some advanced data analysis methods that take additional biological

information into account. Each of these methods detects biological processes being

(13)

3

dysregulated from different entry points, however, major concerns of biologists can yet

be satisfied simultaneously: to uncover manageable size of molecular interaction

networks important for dysregulation, to focus on interested pathways, functions or

even specific regulatory events, and to possess the potential of performing exploratory

researches on mechanisms that are not yet well understood. In the end of this chapter, an

idea about combining advantages of current methods will be introduced in attempt to

fulfill these requirements.

1.3.1 Single gene analysis

Whatever array platform the samples are assayed on, after image scanning and

preprocessing steps including background correction and normalization within/between

arrays, conventional analytic procedure starts with evaluation of individual genes: all

genes are ordered by the extent of their associations with phenotypes, with a significant

degree of association suggesting the gene’s being differentially expressed at mRNA

level and worthwhile to undergo further biological validations.

However, varied according to topics under study, this approach sometimes ends up

with a long list of significant genes even after multiple hypothesis adjustments [4, 5]

that are usually required when testing large amount of hypotheses simultaneously. This

makes follow-up validations laborious or even infeasible. Such obstacle could not be

(14)

4

surmounted by mere selection from the significant gene list, where the decision of genes

deserving further investigations depends on researchers’ expertise due to multiple roles

a gene might play.

To deal with this problem, analysis shall not be confined to expression data itself

anymore. As has been well-known to all, cellular functions are not implemented by

individual genes independently, rather, they are accomplished by a group of genes

acting together to perform cellular tasks. Thus it is anticipated that a more consistent

biological scene can be revealed by methods that not only incorporate transcriptomic

data but also consider prior biological knowledge such as functions in common or

relationships between genes/gene products during analysis. Such information was

derived from previous biological experiments and was deposited in various public

databases as described below.

1.3.2 Biological knowledge database

It gains more insight into the interpretation of transcriptomic data if analysis could

be integrated with functional annotations or other omic data from different levels. To

see how this can be realized, in this section and the next we will introduce what kind of

information can be utilized and how they can be integrated.

(15)

5 A. Gene set databases

Generally speaking, a gene set is a group of functionally related genes. Specific

instances for gene set definitions and corresponding databases are tabulated in Table

1-1.

Since members in a gene set tend to function in coordination, several methods are

developed to analyze genes in groups and identify gene sets instead of genes that are

significantly regulated.

Table 1-1. Definition of gene sets and exemplary databases

Common feature Databases

metabolic / signaling pathway member KEGG, Biocarta,

GenMapp, MSigDB (c2) chromosomal location / cytogenetic band MSigDB (c1)

target of microRNA / transcription factor MSigDB (c3)

biological process participants GO - Biological Process subcellular location/macromolecular complex GO - Cellular Component perform molecular function GO - Molecular Function

Ɂ Kyoto Encyclopedia of Genes and Genomes (KEGG) [6], BioCarta [7], GenMapp [8], MsigDB [9], Gene Ontology (GO) [10]

In this study, we focus only on metabolic and signaling pathways that are

essentially an abstraction of the information flow throw physical interaction network in

response to a drug, nutrients or external stimuli.

(16)

6 B. Interaction databases

An interaction knowledge base contains genetic [11] and physical interactions

between genes/gene products of various types. They can be related to physical binding

(protein complex), protein modification (methylation, (de)phosphorylation), promoter

binding (transcriptional regulation) or chemical reaction (activation/inhibition).

Typically, they are stored as binary interactions and in the form of

gene(product)1-relation-gene(product)2 triplets, from which a global biomolecular

network visualizing this knowledge base can be easily constructed by representing

genes/gene products as vertices (nodes) and interaction relationships as edges (directed

or undirected).

Although these relationships have been extensively studied in small-scaled

experiments using synthetic lethality or other biochemical and biophysical techniques,

they are recorded in scientific literatures and cannot be directly utilized by

computational scientists unless undergone information extraction into machine-readable

format. Currently, this can be done by manual curations or by text mining techniques

such as applying natural language processing algorithm [12].

The construction of such databases was triggered actually by the expansion of

high-throughput techniques in the last ten years, which includes: microarray for

(17)

7

synexpression of genes; yeast two-hybrid (Y2H) technology, tandem affinity

purification coupled with mass spectrometry (TAP-MS), high-throughput

mass-spectrometric protein complex identification (HMS-PCI) and

co-immunoprecipitation (Co-IP) for protein-protein interactions; chromatin

immunoprecipitation coupled with DNA microarray (ChIP-chip) or with paired-end

ditag (ChIP-PET), DNA adenine methylase identification (DamID) and yeast

one-hybrid assays for protein-DNA interactions. The boosted amount of formatted data

from aforementioned techniques had received wide attention of bioinformaticians and

enabled the development of this field. At present, these data account for more than 70%

of current database content. Nonetheless, scientific literatures remain to be the most

important and reliable source since a high percentage of high-throughput data were

estimated to be spurious [13, 14]. On the other hand, it is also believed that these so

called false positives are in fact true physical interactions, yet might not be biologically

meaningful [15].

To sum up, a pool of all known biomolecular interactions between genes/gene

products are accommodated in public interaction databases such as BIND [16], HPRD

[17], MINT [18] and commercial knowledge bases held by Ingenuity Systems

(Ingenuity Pathway Analysis, IPA) and GeneGo (Metacore).

(18)

8

At last, a clear difference between network and pathway can be elucidated by this

paragraph [15] :

“A network represents a static image of all possible physical and/or regulatory

interactions between biological entities, while a pathway represents how the

information propagates through the network. Because information propagation is a

directional process, a pathway must have entry nodes where the information flow starts

and terminal points where the information flow ends.”

Several methods evolved with the aid of these knowledge bases and they will be

reviewed in the next two sections. These methods differ mainly in the databases

incorporated, however, what as well cannot be left out of consideration are the statistical

methodology they utilized and the extent they exploit transcriptomic data.

1.3.3 Pathway analysis

Pathway analysis is actually gene set analysis only to focus on pathways. It is

expected to not only find pathways with significant differential expression but also

detect consistent yet subtle expression changes among members of a pathway. A review

on various pathway analysis methods evaluating the involvement of pathways in

different phenotypes under study is available in [19, 20].

A simplest procedure is to assess the significance of overlap between preselected

(19)

9

genes (differentially expressed genes) and predefined annotation groups (pathways)

using Fisher Exact test (hypergeometric distribution) or Chi-squared test, which is then

followed by adequate multiple hypothesis adjustments. This over-representation method

is intuitive, easy to implement and computationally efficient, thus it dominates current

commercial and public software, such as IPA [21], Metacore [22] and DAVID [23].

An alternative approach needs no prior filtering of genes. It assigns a score to each

gene set and assesses p-value by re-sampling procedure, which is to compare the score

with its null distribution. BRB-ArrayTools [24] use the LS statistic and

Kolmogorov-Smirnov (K-S) statistic to test if the single-gene p-values in a gene set are

of a uniform distribution. In gene set enrichment analysis (GSEA) [25, 26], an

enrichment score is obtained by considering the distribution of pathway genes in the

entire list of genes, which in spirit is a weighted K-S statistic. Tian et al. [27] designed a

statistical framework to determine perturbed pathways. In their work the overall

objective is to “test whether a group of genes has a coordinated association with a

phenotype of interest.” After each gene set is assigned a score by averaging the test

statistics of its member genes, p-values regarding two different hypotheses are

estimated by permuting class labels and gene orders. Eventually, gene sets with

significant p-values under both hypotheses are considered differentially expressed

(20)

10 across phenotypes.

The latter approach is more biologically reasonable than over-representation

method since it reserve expression information and does not depend on an artificial

filtering of genes, in which the results may depend strongly on the cutoff chosen to

make the significant gene list [28].

However, most pathway analysis methods suffer from some weaknesses. First, due

to the fact that only a limited number of well-studied genes are classified into pathways,

the remained many genes with unknown functions are left out of considerations. Second,

most pathway annotations contain only labels of pathway members and with no

information about the interplay between them, therefore the results obtained from these

methods could not help to give direct understanding of cellular processes at molecular

level. Also, pathway is actually a dynamic model and the component activated is

specific to conditions under study, which is usually not emphasized in most methods,

however, GSEA did specify a list of core members, named leading edge subset, that are

the main contributors of the pathway’s enrichment score.

In all, pathway analysis can successfully interpret data at pathway level but fails to

elucidate the interplay between members within and provide no information about genes

not involved in known pathways.

(21)

11 1.3.4 Network analysis

Extensive works had been done seeking to characterize the design principles of

biomolecular network structure using graph theory. An interesting finding shows that its

connectivity distribution follows the typical power-law distribution for scale-free

network and it shows small-world properties [29] in terms of network diameter and

clustering. It is believed that this feature, which indicates the existence of hubs, enables

the biological network to be robust against occasionally removal of arbitrary network

elements during evolution process [30].

Another subject of active research is the identification of relevant modules or

subnetworks in the global network. Notably the interactome data cannot alone complete

this task due to the inconsistent conditions of its content. That is, these interactions are

usually temporal, spatial or dependent on conditions and tissue types. Therefore,

additional annotations providing condition-specific information is required, such as

combining microarray data to identify connected sets of nodes based on their coherent

expression patterns at mRNA level. In this regard, several methods of identifying

responsive modules are reviewed in [31].

Ideker et al. [32] are among the first groups to extract active subnetworks based on

both interactome and transcriptomic data. They searched the global network for

(22)

12

high-scoring subnetworks via simulated annealing algorithm, where the score is in spirit

an aggregation of z-scores measuring differential expression of individual genes in the

subnetwork and calibrated against the null distribution generated by randomly sampled

groups of genes. Interestingly, the result contained many examples of genes individually

with low score but are required to connect together several high-scoring genes. Genes

with such character agree with the behavior of some transcription factors (TFs) and will

be referred to as “key nodes” in this article. Despite the exciting observation, the

time-consuming nature confines its application on large interaction network.

Nacu et al. [33] developed a method GXNA that expands subnetworks from seed

nodes using a greedy approach and then identifies those with significantly high scores.

One of their contributions lies in the design of scoring functions that correct for biases

due to subnetwork size if necessary. They are generally divided into two scoring

schemes -͂T and T͂. What subsequent to the evaluation of subnetwork score is the

estimation of p-value that assesses its significance against the null distribution derived

from scores of subnetworks under different phenotype permutations. Eventually, these

p-values are adjusted for family-wise error rate and used to select relevant subnetwork.

GXNA is fast and it focuses on small modules differentially expressed between

phenotypes, however, it allows no such key nodes since the greedy approach is adopted.

(23)

13

In general, pathway can be viewed as the assembly of causal sequences of

molecular events and with all building blocks available in the biomolecular network.

Ideally it seems promising to infer new pathways directly by network analysis, but in

reality such pathway inference are complex and less validated due to the noise and

incompleteness of current biological networks, especially when coupling with the

small-world property of its topology. Additionally, the identified module is prone to be

associated with multiple canonical pathways, which makes it hard to reveal a unifying

scene and interpret in a fashion related to focus under study, as in the example of Figure

A-2.

1.3.5 Methodology in this work

In this study we attempt to develop a methodology that goes a step further than

current pathway analysis by bringing in the advantages of network analysis. Practically

speaking, it can be achieved by processing microarray data through pathway and

network analyses in series, however, it is worthwhile mentioning that the incorporation

of both pathway and network knowledge into microarray data analysis is yet to be

widely realized. Based on results from pathway analysis, the methodology here aims to

enhance researchers’ understanding by stepping from pathway level into molecular level,

with its main objectives specified below:

(24)

14

1. Extract module most relevant to the pathway’s dysregulation.

It is without doubt that pathway is in fact a dynamic model and the members

perturbed differs between conditions under study. Although GSEA specifies a leading

edge subset and suggests their being responsible for pathway’s dysregulation, the subset

is of limited biological meaning since it provides no information for elucidating the

roles they play. To put it simply, the leading edge subset is statistically meaningful

more than biologically meaningful. Therefore, methodology here attempt to

complement pathway analysis in this regard by making use of biomolecular networks.

After dysregulated pathways are identified, concept of network analysis is then

imposed to extract modules most relevant to dysregulation of interested pathways. It is

mainly due to these sequential events that make the pathways deemed dysregulated.

On the other hand, the module obtained here differs from that yielded by other

network analyses in its ability to interpret microarray data at known-pathway level and

to investigate interested pathways in greater details.

2. Other focus-oriented strategies enabling exploratory survey on pathways of interest.

Current understanding of biological functions at the pathway level is far from

being thorough. In one way, some of known pathways are actually incomplete. In the

other, although recorded in a separate fashion, they are not truly isolated at the level of

(25)

15

global interaction network. In fact they should not be so as in the example of

cross-pathway inhibition, which allows an activated pathway to refocus cellular

resources to respond to stimuli it received by competing against other pathways. As a

matter of fact, a considerable degree of cross-talks between pathways are revealed by

various small-scaled biological studies.

Using the methodology here, a task-oriented investigation is promised by

exploiting the valuable information imbedded in biomolecular networks. These yet

well-understood interactions possess great potential to point an investigator to either

missing pathway components or cross-talks between pathways and help in the design of

appropriate experiments for identifying them.

In previous study physical interaction data were integrated with genetic interaction

information to uncover the mechanisms underneath [34], which can further be used to

find cross-talks between two pathways where the two interacting genes reside.

The interaction data was also coupled with pathway information to provide clues

of cross-talks between pathways [35], however, it is done by merely calculating whether

physical connections between two pathways are higher than random using Fisher Exact

Test.

(26)

16

In two successive works of Ideker et al. [36, 37], subnetwork and condition-

responsive genes (CORGs) were defined in dysregulated pathway as the part delivering

optimal discriminative power for the disease phenotype using T͂ evaluation.

Nonetheless, they were used as prediction markers rather than to discuss the underlying

mechanisms.

Before going to the next chapter, the methodology framework is summarized in

four parts.

1. Construction of database for storage of molecular interaction network, canonical

pathway collections and gene annotations.

2. Statistical analysis of microarray data at pathway level: in this part Tian’s algorithm

is adopted yet with slight but crucial modification.

3. Algorithms allow navigating the global network on the region of interested

pathways: the scoring function is based on GXNA and modified to tolerate key

nodes. In addition, a merging step is developed to complement the scoring function.

4. Visualize the module by laying genes/gene products according to their subcellular

localizations.

(27)

17

Chapter 2 Materials

2.1 Lung cancer datasets

1. NTUH Lung Cancer Dataset

This dataset is created by Bioinformatics and Biostatistics Core, National Taiwan

University Research Center for Medical Excellence - Division of Genomic Medicine.

Matched normal and tumor samples of 31 female patients with non-small cell lung

cancer (NSCLC) in National Taiwan University Hospital (NTUH) are collected for

DNA microarray analysis using Affymetrix Human Genome U133 Plus 2.0 Array.

2. Public Dataset GSE7670 (TVGH Lung Cancer dataset)

A public dataset [38] created by Taipei Veterans General Hospital (TVGH) and

deposited in NCBI’s Gene Expression Omnibus (GEO) [39] (also available at EBI’s

ArrayExpress: www.ebi.ac.uk/arrayexpress) under GEO series accession number

GSE7670 is downloaded as CEL files. Of all the 66 Affymetrix Human Genome

U133A Array data, those from 21 female NSCLC patients with paired normal-tumor

arrays are used to create this dataset.

2.2 Databases

1. Pathways

Predefined gene sets performing tasks of metabolic functions or signaling

(28)

18

transductions are recorded in several public databases.

The Molecular Signature Database (MSigDB) v2.5 maintained by Broad Institute

collects curated gene sets information from several online databases or literatures and

records them with a unified format. Its “Canonical Pathway (CP) collection” in

“Functional Sets (C2) category” [9] is the main source of pathway information in this

work. The collection of 639 gene sets is released in a file with filename extension “gmt”.

In the following work, gene sets are simplified into pathways.

2. Protein interaction network

Protein-protein interactions (PPI) detected by high-throughput methods are

recorded in several PPI databases. The NCBI’s Entrez Gene database containing curated

interaction information from BIND [16], BioGRID [40], EcoCyc [41] and HPRD [17] is

utilized as the main source to construct protein interaction network in this work.

3. Target genes of probe sets on microarray

The information of genes targeted by probe sets designed on Affymetrix GeneChip

arrays comes from annotation files in Affymetrix website (www.affymetrix.com).

4. Gene annotation

Unique gene identifiers (Entrez ID or HUGO gene symbol) and historical aliases

of genes are retrieved from NCBI’s Entrez Gene database.

(29)

19

2.3 Program environment and public/ commercial tools

With database built under MySQL, this methodology was realized with the help of

Matlab^®. Besides, some public or commercial tools are used to process the dataset for

different purposes. Following are some brief descriptions about them.

1. Partek^® Genomics Suite [42]

It is a commercial software that enables various statistical analysis of microarray

data. In the work here it is used simply to complete preprocessing steps of microarray

data, which summarizes expression value for each probe set and applied normalization

algorithm to remove potential systematic biases.

2. Gene set enrichment analysis (GSEA) [25]

GSEA evaluates the probability a gene set is differentially expressed across

phenotypes and defines a leading-edge subset comprising the core members activated in

the gene set.

At first, genes are ordered by their correlation between expression values and

phenotype classes, then an enrichment score (ES), corresponding to a weighted

Kolmogorov-Smirno-like statistic, is calculated for the gene set. A p-value representing

significance level of the ES score is determined against null distribution of ES estimated

by permuting class labels. After p-value for each gene sets is obtained, false discovery

(30)

20

rate (FDR) method is used to adjust for multiple hypothesis testing.

3. Database for Annotation, Visualization and Integrated Discovery (DAVID) [23]

DAVID uses an EASE score, a modified Fisher Exact p-value, to measure the

enrichment of gene sets in the gene list specified by user. Furthermore, to reduce the

redundant nature of annotations that might dilute the focus of the result, which is

especially inevitable when associating with Gene Ontology (GO) terms, DAVID

provides the option to classify significantly associated gene sets into different clusters

and order the clusters according to their significance. According to manual on the

website, it is achieved by integrating the same techniques of Kappa statistics to measure

the degree of the common genes between two gene sets, and fuzzy heuristic clustering

to classify the groups of similar annotations according to kappa values.

4. Cytoscape [43]

Cytoscape is a JAVA application which provides basic functionality to layout and

query networks, overlay nodes with expression data, or link genes/gene products to

databases of functional annotation. Notably, it is featured in its extensibility through a

straightforward plug-in architecture, which enables additional computational analyses to

be incorporated. In this study, a plug-in - Cerebral v2.0 [44] is used to layout the

network according to the subcellular location of each node.

(31)

21

5. Kyoto Encyclopedia of Genes and Genomes (KEGG) - Color Objects in KEGG

Pathways [45]

It is an on-line tool that provides a personalized pathway map by allowing the

assignment of different colors to font, border or background of each particular node.

6. European Bioinformatics Institute (EBI) - QuickGO [46]

QuickGO is a web-based browser that enables the extraction of branches from the

entire hierarchy of Gene Ontology according to a list of specified GO terms.

(32)

22

Chapter 3 Methods

Figure 3-1 illustrates the flow chart of this methodology and in the subsequent

sections we will describe each step in detail.

Figure 3-1. Flow chart of methodology in this work.

PPI is the abbreviation of protein-protein interaction data.

3.1 Database construction

Figure 3-2. Entity Relationship Diagram (ERD) for database constructed here.

In this ERD rectangles stand for entities, ellipses for attributes and diamonds

for relationships.

(33)

23

As the first step of integrating genomic data at different levels, a relational

database is required in order to bridge between information from various sources, to

speed up data retrieval and to correct for ambiguously recorded data. The structure of

database constructed here is elucidated in Figure 3-2 as a simplified entity relationship

diagram, where genes, pathways and array probes are considered as different entity

types.

While constructing the database, some records with ambiguous information should

be corrected:

1. Importing gene sets from MSigDB.

The gmt file records gene set members in the form of “synonym”, which is alias

instead of official name. The task here is to convert each of them into corresponding

unique identifier: “gene_id”. This is done by comparing them to “official gene symbol”,

or to “synonym” of genes in the case when there were no “official gene symbols”

matched.

2. Importing target genes of each probe set on microarray.

Some “gene_id” recorded in NetAffx annotation files had been changed or

discontinued. New “gene_id” should be updated based on information from NCBI’s

Entrez database.

(34)

24

3.2 Single gene analysis

1. Probe sets filtering

As one might note in chapter 2, the two datasets were assayed on different versions

of Affymetrix GeneChip array. In fact, U133 Plus 2.0 comprises probe sets in both

U133A and U133B [47], thus in following works only those probe sets common in both

versions are utilized. Note that this step could be skipped when datasets to be compared

are of the same version.

2. Summarizing expression value for each gene

The analysis of Affymetrix array data starts with CEL files recording fluorescence

signals at probe level, from which probe-set level intensities are derived using robust

multi-chip average (RMA) method [48]. In this method probe level data undergo

background correction, quantile normalization [49] and median-polish summarization

[50]. The RMA process is completed under the commercial software Partek^® [42] and

the resulting values are log-transformed expression values.

3. Hypothesis testing

Hypothesis testing methods are used to measure the degree of association between

response/covariate (either numerical or categorical factor, e.g. phenotypes) and random

variables (expression levels of probe sets/genes). In the simple but most common case,

(35)

25

given a covariate, its association with each gene is estimated by univariate hypothesis

testing method such as two-sample t test or Mann-Whitney statistics for binary

phenotype, F-statistic for polytomous phenotype, to name but a few.

Here two-sample paired t test is applied and the concluded p-value functions as an

index of degree of differential expression in terms of a probe set between phenotypes.

4. Select representative probe set for each gene

Among all probe sets targeting the same gene, the one with the smallest p-value is

selected to represent their target gene and the rest removed.

Until this step, an input matrix with logged gene expression values in rows and

arrays in columns is generated.

3.3 Pathway analysis

Both Tian method and modified Tian method are applied on the datasets.

Conceptually the procedure of pathway analysis starts with evaluating a pathway score

by employing a scoring function, which is the major difference between the two

methods. The score is then to be normalized and assigned with p-value according to its

null distribution that could be generated in two different ways of permutation. Finally

pathways are ordered by the addition of rankings under two permutation types. The

detailed procedures of both methods are described below.

(36)

26 1. Scoring function

For each pathway _୫ ൌ ൛_ଵǡ ǥ ǡ _୩_ౣൟ , its score is calculated by either

0

1

( ) 1

m

i

k

m g

m i

f S T

k

¦

^(Tian’smethod)

or ₁

1

( ) 1

m

i

k

m g

m i

f S T

k

¦

(modified Tian’s method), where km is the size of Sm and

gi

T the t statistic of gene i. To note, the latter scoring

function comes from equation 2 in Nacu et al.[33].

2. Significant level of the score

A one/two-sided p-value representing significance of a pathway is estimated from

the f1/f0 score’s null distribution that could be generated in different ways depending on

the null hypothesis to be tested. Tian et al. [27] proposed two ways to choose from:

either to test if genes inside a set show significantly higher associations with phenotypes

than that outside a set, or to test if a set does contain genes differentially expressed

between phenotypes. The former is achieved by permuting members of a pathway, and

the latter by randomly shuffling phenotype labels on each paired samples.

Since all pathways are tested simultaneously, multiple testing problems can no

longer be ignored. The p-values are either adjusted by Bonferroni method [51] or

converted to q values [52].

3. Pathway score normalization

(37)

27

To overcome the hurdle that pathway scores are not able to be compared directly

due to its dependency on the size and unique correlation structure of each pathway, Tian

proposed that normalization of the observed scores can be achieved by replacing them

with their quantiles. Here f0 and f1 scores are normalized in the same principle.

4. Ranking the pathways

Under each permutation procedure, an adjusted p-value and a normalized score are

obtained and from which a ranking is summarized. With descending importance, all

pathways are eventually ordered by the addition of two rankings derived from separate

permutation procedures.

3.4 Network analysis

Based on the significant pathway identified, network analysis tries to investigate

the pathway, looking for connected subgraphs that are either essential for differential

expression of the pathway or related to genes outside the defined pathway boundary. In

short, a candidate subnetwork is generated from each root and then being merged into

several main components. This algorithm follows Ideker’s idea and some methods of

GXNA, and will be described dividedly in six steps.

1. Starting points and the search space

In the beginning, a background interaction network is constructed and afterward

(38)

28

referred to as the “search space”, within which the algorithm searches for main

components. As the objective here is to show the applicability of this approach before

any further explorative investigations, the search space is at first confined to genes

within the dysregulated pathways. After then a version of searching under global

interaction network was demonstrated.

Suppose there are N genes (nodes) in the pathway, each of them will be considered

as a root and thus N candidate subnetwork would be generated.

2. Extension

Starting from a root node, a candidate subnetwork is generated by an assigned

number of extensions within the search space. In each time of extension, the node

yielding maximal score of the new subnetwork is incorporated from those directly

neighboring the current subnetwork.

3. Scoring

Two ways of evaluating current subnetwork are adopted. One is identical to

f S1 , only when applied here, S represents a subnetwork instead of a pathway. The other, ^{f S}²

, is similar but with slight modifications in order to increase the tolerance of key nodes mentioned in chapter 1.

Considering a subnetwork of size k, the algorithm first rearrange its members in

(39)

29

ascending order of p-values, so that for ^S

^

^g¹^,^!^,^g^k

`

^,Tgi Tgi1

d for all ⁱ

^ `

^2..^k ^,

where

gi

T is the t statistic for gene i. The score of S is then obtained by

2

1

( ) 1

i

k m

g i

f S T

k m

¦

^,

where the setting of m is flexible to users (default “1”). The equation suggests that the m

members with least contribution would be left aside from the scoring function.

Note the default T scoring function in GXNA (eq.6) [33] is not applied here, for

more details please refer to chapter 6.

4. Stopping criterion

There are two criteria in GXNA for stopping the extension of a subnetwork. One is

when predefined size is met and the other is when the new subnetwork score does not

surpass the current one. The former criterion is used here due to the same reason that f2

score is dependent on subnetwork size and thus not comparable to each other.

Nonetheless, to make up for artificial restrictions in fixed-size search, a merging step is

developed to produce subgraphs of different sizes.

5. Merging

Until this step, a candidate pool has been formed by the N candidate subnetworks

derived from the N roots. The merging process is a decisive step. It ends up with at most

h main components as final results where h is a user-specified parameter.

(40)

30

For each ⁱ

^ `

^1..^h the merging process starts a main component Ci with an empty set Ø. Step by step, the algorithm merges it with the highest-scored candidate

subnetwork sharing overlap with it. Ever since a candidate subnetwork has been chosen

from the candidate pool and merged with Ci, it is excluded from the pool. The merging

process stops when certain criteria are met, which varies depending on the user’s

concern. Here a handleable size of main component within the pathway is to be found,

so the algorithm stops when it reaches an amount approximately r percent of the search

space size, or, stops at predefined min/max size in the case of small/large search space.

However, the whole process could break off anytime the candidate pool is emptied.

6. Visualization

Main components found in this methodology are visualized using Cytoscape [43].

Moreover, they can also be mapped on the pathway figure using KEGG’s online tool

[45] when the significant pathway is retrieved from KEGG database.

3.5 Results demonstration

In the end, attempts were made to reveal the biological scenes underlying the

results of this methodology by associating Gene Ontology terms with members of main

components. This is done with the help of DAVID [23] for GO terms association and

clustering, and QuickGO [46] for GO hierarchy visualization.

(41)

31

Chapter 4 Results

4.1 Database

Table 4-1 lists some characteristics of the database constructed here, and

informative relationships of its content data are shown in Figure 4-1.

Table 4-1. Statistics of current database

Current Database Number of Records

NCBI - Entrez Gene database 40234 signaling / metabolic pathways 691

biomolecular interactions 31340 genes with interaction information (A) 8787

genes involved in pathways (B) 5596 genes targeted by U133A probe sets (C) 13799

Figure 4-1. It shows overlaps between array, pathway and interaction data.

Notably there are a large number (a+b=4,873) of genes within the global interaction network that are currently not assigned to any predefined pathways, which implies the potential to exploit functions of unknown genes when integrating interaction information. On the other hand, there are also 1,682 (c+d) genes in pathways that show

(42)

32

no related interaction information, which foretells some obstacles to be met in this methodology. However, this could be overcame by constructing a database that includes more complete interaction data, although it may take lots of time to manually record them from pathway databases and scientific literatures.

4.2 Single gene analysis

In this step, probe level data in CEL files are transformed into an input matrix which will be fed into pathway/network analysis. Figure 4-2 describes the single gene analysis procedure and Figure 4-3 shows the t-score distribution of probe sets before and after representative filtering. When a Bonferroni adjusted p-value < 0.05 criterion was applied, there were 1,489/1,345 significantly up-/down- regulated probe sets in NTUH dataset after representative filtering, which was much greater than that in GSE7670 dataset as shown in the lower panel in Figure 4-3.

Figure 4-2. Single gene analysis procedure. It describes how the input matrix was produced from CEL files.

(43)

33

Figure 4-3. t-score distribution of probe sets before and after representative filtering.

The total probe set number decreases from 22215 to 13799. Probe sets with t-scores greater/smaller than the Bonferroni thresholds are significantly up-/down-regulated.

In both datasets the original left-skewed distribution turned into a bimodal

distribution after filtering, and in the bimodal distribution the positive and negative

sides show unequal peak heights and variances. This might be due to the different

correlation structures between genes. More detailed statistics of t-score distribution is

shown in Table A-1. In addition, an unusual peak highlighted in the Figure 4-3 comes

from a nonspecific high-scoring probe set 209079_x_at that targets at 22 related genes.

(44)

34

4.3 Pathway analysis

Three pathway analysis methods were applied on the datasets, namely Tian scoring

method (using scoring function f0), modified Tian scoring method (using scoring

function f1) and GSEA, and the third one was applied only for result comparisons.

A total of 560 pathway information with size ranging from 10 to 500 were used for

pathway analysis; each pathway has corresponding null distributions generated by

permuting either gene order or class label for 10,000 times. The unadjusted p-values of

the 560 pathways were corrected for multiple hypotheses test using Bonfferoni

correction or q-value conversion in f0 scoring function and f1 scoring function,

respectively. Detailed parameter settings in each method were provided in Table A-2.

Histograms of unadjusted p-values for 560 pathways under different permutation

types or scoring schemes were displayed in Figure 4-4 and among them a number of

pathways were deemed significant under both scoring functions. This phenomenon was

different from that in GSEA (Table A-3) and suggested an elevated statistical power in

the methods applied here. Detailed significant pathways with adjusted p-value passing

the criteria of 0.05 in both datasets were listed in Table A-3,A-4.

After all pathways had undergone pathway score evaluation and p-value

adjustment, network analysis was applied further to investigate molecular mechanisms

(45)

35

Figure 4-4. Histograms of unadjusted p-values for 560 pathways in database.

Pathway p-values in the upper panel were derived by permuting gene orders and the lower ones by permuting phenotypes. Bars in green color stand for the result when applying f1 scoring function and that in orange color come from result of f0

scoring function.

based on the interested pathways. These two scoring methods identified significant

pathways for further analysis. For example, “Cell cycle” pathway was identified by f0

scoring, whereas “Focal adhesion” pathway was selected by the f₁ scoring method. In

principle, cancer-associated pathways with larger pathway size were chosen out for

examples from those with adjusted p-value lower than 0.05 in both datasets.

(46)

36

4.4 Network analysis - within pathways

In this and the next section, network analysis was applied on significant pathways,

and the focus will be on extracting main components from pathways that take the

interplay of member genes into consideration.

Within the search space containing all nodes in the pathway, candidate

subnetworks each with size 8 were formed using f1 scoring. They were then merged into

one main component with size basically 0.75 times the space size, or bounded by the

size of 15, 25 for small, large search spaces respectively. The whole process was

summarized in Table 4-2.

Table 4-2. Network analysis procedure and the consistency with GSE7670.

NTUH (m=0) Cell cycle Focal adhesion

search space (entire pathway) 90 199

candidate subneworks 60 111

top subneworks merged 10 10

1^st main component size 19 25

leading-edge subset size 35 63

consistency with GSE7670 74% 60%

Fisher’s exact test p-value 9.4E-08 4.6E-10

This table showed that the search spaces of two pathways were dramatically

reduced from 90 and 199 to 19 and 25 of their 1^st main components, respectively.

Furthermore, the 1^st main component sizes were also smaller than the sizes of leading

(47)

37

edge subsets obtained by GSEA that equaled to 35 and 63.

To examine the robustness of this method, members of main components derived

from NTUH dataset were compared with that obtained from GSE7670 dataset under

same analysis procedures. The result of comparison was also shown in the table that the

1^st main component from NTUH dataset had a significant overlap (74% and 60%

overlap and the overlaps were all with p-value << 0.05) with that derived from

GSE7670 dataset.

These main components obtained by network analysis are displayed in Figure 4-5,

where gene/gene products are arranged according to their cellular locations. For detailed

expression level and annotations of each gene in Figure 4-5, please refer to Table A-5.

Evidently the major difference between main components obtained from the two

pathways is that a coherently up-regulation in tumor samples is observed in Figure 4-5A,

while such coherence does not appear in Figure 4-5B. This observation once again

reveals the main difference of Tian method and modified Tian method: the former

emphasizes pathways with moderate but concordant changes and the latter focuses on

pathways with significant degree of overall changes, regardless of up- or down-

regulation.

(48)

38 (A)

(B)

Figure 4-5. Main components extracted from two pathways. After network analysis, main components were derived from (A) cell cycle pathway and (B) focal adhesion pathway.

Genes were arranged by their subcellular locations and colored by expression fold change values (red for up-regulation and green for down-regulation in cancer phenotype).

Rectangles indicate that genes were found in GSE7670 and NTUH dataset in common, whereas circles mean that genes were identified only in NTUH dataset.

(49)

39

In addition, we also applied network analysis using f₂ scoring with m=1 and the

rest settings remained unchanged on the two pathways. The main components yielded

by this analysis share approximately 70-90 percent similarities with previous ones.

Some nodes playing essential roles of maintaining the integrity of main components did

exist, however, Figure A-3 and Table A-6 indicate that an expected situation of catching

clues of key nodes bridging separately connected subgraphs did not show apparently.

4.5 Result demonstration and comparison

4.5.1 Mapping the main component and leading edge subset on KEGG pathways

Two main purposes of this section is to illustrate how the main component

obtained by network analysis differed from the leading edge subset obtained by GSEA

from biological viewpoints, and to highlight the fundamental difference between

information stored in pathway databases and interaction databases, thus to present the

benefit it brought to integrate both information with microarray data.

Figure 4-6 was produced by an on-line tool provided by KEGG [45], where

members of leading edge subset and main component for each pathway were

simultaneously mapped onto simplified pathway figures. It seemed that most members

of main components overlap with leading edge subsets, however, they actually share an

overlap of about 60-80 percent as summarized in details in Table A-7.

(50)

40 (A)

(B)

Figure 4-6. Overlay the main component and leading edge subset on the original pathway.

(51)

41

In the simplified figure, rectangles represent genes, complexes or families, etc. in (A) cell cycle pathway and (B) focal adhesion pathway. Members of main components were filled in red/green colors standing for up-/down-regulation in cancer phenotype. Members of leading edge subset were colored in grey. Note that those nodes with black border and filled with color were common members of both sets, while those with white borders were found in main components only. In addition, dashed red lines were added artificially implying interaction relationship not recorded in KEGG databases.

In Figure 4-6, the members of the leading edge subsets were colored in grey.

Although these genes show statistically significances, they located in the pathway in a

scattered manner. In contrast, main components are tightly interconnected subset of

genes because it took topology into consideration. It represents the most significant

module in the pathway, which would be more biologically meaningful when compared

to the leading edge subset that only takes gene significances into account.

Furthermore, the dashed red lines represent possible interactions which do not

appear in the predefined pathway may illustrate the additional information one would

get when incorporating interaction data with pathway information. The benefit of

bringing in the additional information would become more valuable later in section 4-6.

4.5.2 GO term enrichment analysis

In this section, we try to manifest the biological meanings of these main

components. It was achieved by using DAVID to compare the GO terms enriched in

both the entire pathway and the main component extracted from the pathway. As

(52)

42

mentioned in section 2.3, DAVID evaluates the randomness of each GO term being

associated with a user-specified gene list and assigns it with both an EASE score and a

false discovery rate (FDR) reflecting the significance of the EASE score. After that,

similar GO terms were sorted into a cluster and gave the cluster a new enriched score by

summarizing member term EASE scores. This enriched score was then used to rank

relative importance of clusters.

DAVID was used to separately evaluate GO terms/clusters associated with gene

lists containing the entire pathway and the main component. Then the GO terms or

clusters that were contained terms with significant association (FDR<0.05) with the

gene list were visualized by a pie chart. Each portion in the pie chart stands for a GO

term/cluster, and the proportion of gene list members involved in this cluster/term is

showed as a percentage. In addition, the rankings of relative importance of clusters are

specified on the pie chart. Note that different GO term/cluster may probably contain

overlap information, so the overall percentage in the pie chart would not be exactly 100.

Figure 4-7, 4-8 display the result obtained in focal adhesion pathway and Figure 4-9,

4-10 for that in cell cycle pathway.

In Figure 4-7, GO terms/clusters enriched in focal adhesion was identified in terms

of cellular component category. It was obvious that Cluster C turned to occupy a larger

(53)

43

proportion in the main component than in the entire pathway. It suggested that when

reducing the list of genes from the entire pathway to the main component, the functions

that the list of genes able to play had gradually specialized. It might also imply that

certain functions are more differentially regulated than other functions in the original

pathway, and that these important functions could be revealed by methodology.

Furthermore, when member terms of Cluster C were individually considered in Figure

4-7C, each of them showed a consistent trend of such function specialization in both

leading edge subset and main component. In contrast, this specialization was not

observed in a random subset that was randomly chosen from the entire pathway and

with the same size of the main component. In terms of molecular function category in

Gene Ontology, several significant GO terms not grouped into clusters (Term B-F) also

showed this consistent trend of function specialization in Figure 4-8C.

Such trends did not appear only in focal adhesion pathway. In Figure 4-9 cell cycle

pathway was analyzed using GO biological process terms, where Cluster A showed also

a dramatic increase from sharing 36% of the entire pathway to sharing 63% of the main

component. When analyzing cell cycle pathway using GO cellular component in Figure

4-10, the function specialization also existed since the proportion Cluster C,D,E

occupied were all amplified in Figure 4-10B.

利用已知基因傳遞機制及蛋白質交互作用圖譜來開發新的基因互動途徑

National Taiwan University Master Thesis

ABSTRACT

Key words

CONTENTS

LIST OF FIGURES

LIST OF TABLES

Chapter 1 Introduction

1.1 Lung cancer

1.2 Microarray

1.3 Data analysis

Chapter 2 Materials

2.1 Lung cancer datasets

2.2 Databases

2.3 Program environment and public/ commercial tools

Chapter 3 Methods

3.1 Database construction

3.2 Single gene analysis

3.3 Pathway analysis

¦

¦

3.4 Network analysis

^

`

^ `

¦

^ `

3.5 Results demonstration

Chapter 4 Results

4.1 Database

4.2 Single gene analysis

4.3 Pathway analysis

4.4 Network analysis - within pathways

4.5 Result demonstration and comparison