୯ҥᆵεᏢႝᐒၗૻᏢଣғᙴႝηᆶၗૻᏢࣴز܌
ᅺγፕЎ
Graduate Institute of Biomedical Electronics and Bioinformatics College of Electrical Engineering and Computer Science
National Taiwan University Master Thesis
ճҔςޕ୷ӢሀᐒڋϷೈқ፦ҬϕբҔკ!
ٰ໒วཥޑ୷Ӣϕ৩!
Integrate Pathway Information and Protein Interaction Network to Explore Possible Interactions Between Genes
શλ఼
Siao-Han Wong
ࡰᏤ௲Ǻಷᘑӹ റγ AdvisorǺDr. Eric Y. Chuang
ύ҇୯ΐΜΖԃΎД July, 2009
II
ᖴᇞ
! ҁጇፕЎᆶࣴزޑֹԋǴһеΑӭΓޑႴᓰᆶᔅշǶ२Ӄाགᖴޑࢂᅺγ
ࡰᏤ௲ಷᘑӹԴৣ๏ϒךᐒуΕ೭ঁёངޑၠሦୱࣴزი໗Ǵӧ೭ٿԃ໔ ගٮᓬؼޑᕉნᆶᙦ؊ޑࣴزၗྍǵᙖӚԄТϩ૽ግሦךΕғނၗૻሦ ୱᆶ໒ࣚǴ٠ததჹךॺޑՉ٣БӛගрࡌᆶᐋҥڂጄǶӆٰाགᖴᒘߝ ӄԴৣᆶጰۏᏌԴৣӧѳВޑኧᏵϩᆶፕЎϣঅ҅ޑٌ༇ࡰᏤǶ!
! ќѦाձགᖴԾֻᏢߏவፕЎᚒҞޑ௴ᆾǵϣፕᆶၡޑሦǴ٠
όᘐӦעךவФفӾ္ඣрٰǴዴࢂٌधϩޑৡ٣ǼགᖴᏢߏۆٵ։ǵηர҉
ᇻόӧךॺගрభᡉୢᚒਔያଆ࣭ᓐǴᕴࢂӦӣเᆶࡰᏤǴ٠ЪӧፕЎቪ բගрࡌᆶڐշঅׯǹགᖴྷᔇᏢߏऐЈвಒӦ௲Ꮴǵҭόਔࣁךॺٰ॥
፪ޑقᇟϷ҅ӛޑΚໆǶགᖴᏢߏۆБᛄᆶॣذǵதόᜏٌമӦࣁεৎुᖼϱᓓ
܈ᗺЈǴЪ๏ϒךॺᆂЈޑႴᓰᆶණѲǹགᖴॕңᏢۆǴவჴᡍ࠻а
ٰޑ service ሦᆶၡӭӭЈޑᡍϩ٦ǴᡣךૈճӦᑼΕ೭ঁᕉნ ᆶ೭ঁሦୱǴ٠ЪόਔӦ๏ךॺྕធᆶᅈᗺޑઢǶགᖴշྼǵ܍ਤޑႴ ᓰک༇Ӧᗯ१Ǵᡣךӧჴᡍ࠻ޑϺкᅈᡋ഻ᆶႫىགǹགᖴёངΞԖ፪ ޑᏢۂऩǵᄨǵݒᑉ೭ԃٰࣁךॺϩΑӭჴᡍ࠻ޑ٣୍ǵ٬ךॺள аӄΚྗഢፕЎᆶࣴزǶགᖴٿՏӕᏢࡌᆶࡌᗶǴӧ೭ঁჴᡍ࠻ӕҒӅधΑٿ ԃǴவ໒ۈᝢฝჴᡍ࠻ޑࢲǵঅፐǵᏢಞϩǵௗπբаϷനࡕଆዖၸஒ ፕЎԋޑٿঁӭДᇴၮޑВηǶགᖴғᙴႝၗ܌ޑৣߏǵ܌ᒤΓӧՉࡹ
٣୍ޑᔅշǴаϷҁۛ߈ΟΜՏӕᏢٿԃύޑןᆶྣ៝Ƕགᖴךيᜐޑ
Տܻ϶Ǵૈᇡգॺ٠ᏱԖգॺޑഉՔࢂΜϩ۩ၮҭࡐ۩ᅽޑ٣Ƕ!
! നࡕǴགᖴךޑৎΓॺޔࢂךᆒઓޑЍࢊǴ๏ϒךόᘐ۳ޑΚǶؒ
ԖգॺǴ൩ؒԖϞВᆶܴВޑךǶ!
λ఼!2009.7.29!୷ӢᡏύЈ
III
ᄔा
߈ԃٰǴ୷ӢТ (microarray) ޑኧᏵϩςவճҔीϩޑ БݤǴᙯӛуΕ׳ӭςޕ୷Ӣфૈ܄ຏှٰڐշϩǶғނϸᔈၡ৩ϩݤ (pathway analysis) ଞჹၗύঁςۓကӳޑϸᔈၡ৩ (pathway)Ǵෳໆܭჴ ᡍीύځࢂցӸӧૻ৲ਡᑗਡለ (mRNA) ቫભޑᡉᡂ౦ǶԶќѦᅿᆛ๎
კϩݤ (network analysis) ߾ԑӧཛྷ൨ঁҗ୷Ӣ໔܌ԖёૈӸӧޑҬϕբҔ
܌ಔԋޑკ (global interaction network) Ǵ࣮ځύࢂցԖᡉׯᡂޑηკ
(subnetwork) Ƕ೭ٿᅿϩБݤӚᏰയǴЪёఈံىჹБޑલᗺǺޣϩς
ޕޑϸᔈၡ৩Ǵࡺஒֽ่݀ज़ӧዕޕޑғނޕύǹࡕޣᗨᙒΑӭёૈޑ୷ Ӣϕ৩Ǵՠޔௗவ global interaction network ύཛྷ൨ܰפډคݤаԖғނ ޕևځठ܄ғނཀကޑ subnetworkǶ
ҁЎගрঁૈ୷ܭ pathway analysis ՠ׳่ӝ network analysis ᓬᗺޑ ϩБԄٰׯ๓ԖޑϩБԄǴॶளගޑࢂԜ่ӝཷۺӧҞޑϩሦୱ ύ٠όதـǶ ԜБݤ२ӃճҔ Tian et al. วޑ pathway analysis БԄෳۓԖᡉ
ᡂ౦ޑ pathwayǶௗа೭٤ pathway ޑԋբࣁрวᗺǴ௦Ҕ Nacu et al. ޑ network analysisБݤՉঁҞޑᏤӛԄޑ subnetwork ཛྷ൨Ƕ
ҁጇፕЎஒԜБݤᔈҔӧѠεᙴଣޤᕎੰΓޑ୷ӢТኧᏵǶ໒ۈ၂ ӧԖᡂ౦ޑ pathway ύ൨פځനڀж߄܄ޑԋϩ (subnetwork) Ƕ೭ಔኧᏵౢғޑ
subnetworkӧќಔѠчᄪᕴޤᕎੰΓޑኧᏵύҭளډΑևଯࡋठޑ่݀ǶԜ
Ѧךॺଞჹפډޑ subnetwork Չځԋ୷Ӣޑфૈ܄ϩǴวவচҁֹ
pathwayᕭ෧ډ subnetwork ޑၸำύǴᡏޑфૈҗচҁڀԖޑӭኬ܄Ǵϯډ
ۓޑфૈǶ೭སҢΑ၀ pathway ޑࢌϩфૈӧჴᡍύࢂܴᡉӦׯᡂޑǴԶ ೭ኬޑׯᡂளаҔ೭ጇፕЎޑБݤჸǶ׳ǴךॺΑҁБݤ܍ᝩΑ
IV
network analysisԶٰޑᓬᗺǶҥ୷ܭ pathway ޑςޕԋѐཛྷ൨ځёૈԖϕޑᎃ
߈୷ӢǴךॺளډޑ subnetwork ࢂаԜ pathway ࣁрวᗺǴՠх֖ӭ҂ᇡ
ޑҬϕբҔǴ೭ኬޑ่݀ёаڐշࣴزޣჹܭ҂ޕޑϩीჴᡍ׳ుΕޑ
زǶவќѦБय़ٰᇥǴ೭ኬޑ่݀ΨԖ౦ܭ network analysis ޑБԄǴѬૈ
٬ளࣴزёаҥ୷ܭࣴزޣགᑫ፪ޑ pathwayǴ୷ܭςޕޑғނޕѐܗ࣬ᜢ҂
ޕޑёૈ܄Ƕ
ᕴ่܌Ԗϩޑ่݀ǴѬॺவόӕБӛࡰрΑԜϩБݤόӕܭа۳ޑӭ ᓬᗺǺѬёаவᡉׯᡂޑ pathway ύڗрঁനख़ाЪελӝࣴزޑ
subnetworkǵΨёаଞჹࣴزޣགᑫ፪ޑ pathway ܈ۓޑፓᐒڋՉЬᚒԄޑ
ుΕǵԜѦନΑҥ୷ܭচԖғނޕѦǴѬҭڀԖ໒ว୷Ӣ໔ཥޑϕᐒڋ ޑૈΚǶ
ᜢᗖຒ
୷ӢТǵኧᏵϩ!
! !
V
ABSTRACT
Currently the analysis of microarray data had turned into integrating with prior
biological knowledge: pathway analysis interprets transcriptomic data on pathway level
and identified predefined groups of genes with dysregulation; network analysis takes
gene-gene interactions information into consideration and searches for modules
associated to the phenotypes under study. The two analyses have its own advantages
respectively and they complement the weaknesses of each other: pathway analysis
provides little clues to directly explore new biological knowledge and network analysis
usually yields modules including few consistent biological information. In this study an
analytical methodology was developed to integrate current pathway analysis method
with network analysis methods.
Initially, dysregulated pathways are identified by modified pathway analysis
method in Tian et al.. Subsequently, a focus-oriented investigation on dysregulated
pathways are performed by network analysis following the work of Nacu et al., and this
step is using modules within or related to members of the pathways to be further
investigated. Several improvements were made, such as the scoring functions and the
module identification algorithms.
To illustrate the benefits of this methodology, a lung cancer study with 30 paired
VI
cancer and normal tissues was explored. The results derived within dysregulated
pathways were also identified consistently in another public dataset GSE7670.
Furthermore, GO term enrichment analysis was applied to show that the modules have a
specialized functionality than the original pathways. In brief, original large modules
were reduced from the entire pathway to a smaller size of relevant interconnected
members, which are much easier to be manipulated but still remain their biological
information. Moreover, the ability of this methodology to explore novel interactions
related to pathway members were also demonstrated by extending the module search
algorithm beyond the pre-defined pathways. This would not be achieved by traditional
pathway analysis methods, which usually don’t include biomolecular interaction
information. Yet, modules identified in this methodology were based on dysregulated
pathways with specific biological meaning since their members were mainly associated.
In conclusion, these data all indicated the advantages to integrate both pathway and
network information during microarray analysis: to uncover manageable size of
molecular interaction networks important for pathway dysregulation, to focus on
interested pathways, functions or even specific regulatory events, and to possess the
potential of performing exploratory researches on mechanisms that are not yet well
understood. Undoubtedly, this concept could be extensively applied to other array
VII
experiments of similar design regardless of the disease under study.
Key words
microarray, pathway analysis, network analysis, module!
!
CONTENTS
α၂ہቩۓਜ ... I
ᖴᇞ ... II ᄔा ... III ABSTRACT ... V
Chapter 1 Introduction ... 1
1.1 Lung cancer ... 1
1.2 Microarray ... 2
1.3 Data analysis ... 2
1.3.1 Single gene analysis ... 3
1.3.2 Biological knowledge database ... 4
1.3.3 Pathway analysis ... 8
1.3.4 Network analysis ... 11
1.3.5 Methodology in this work ... 13
Chapter 2 Materials ... 17
2.1 Lung cancer datasets ... 17
2.2 Databases ... 17
2.3 Program environment and public/ commercial tools... 19
Chapter 3 Methods ... 22
3.1 Database construction ... 22
3.2 Single gene analysis ... 24
3.3 Pathway analysis ... 25
3.4 Network analysis ... 27
3.5 Results demonstration ... 30
Chapter 4 Results ... 31
4.1 Database ... 31
4.2 Single gene analysis ... 32
4.3 Pathway analysis ... 34
4.4 Network analysis - within pathways ... 36
4.5 Result demonstration and comparison ... 39
4.5.1 Mapping the main component and leading edge subset on KEGG pathways ... 39
4.5.2 GO term enrichment analysis ... 41
4.6 Network analysis – protruding pathways ... 48
Chapter 5 Discussion ... 50
6.1 Input matrix creation ... 50
6.2 Pathway analysis ... 51
6.3 Network analysis ... 55
6.4 Future perspectives ... 60
Chapter 6 Conclusions ... 63
REFERENCES ... 66
APPENDIX ... 70
LIST OF FIGURES
Figure 3-1. Flow chart of this methodology ... 22Figure 3-2. Entity Relationship Diagram (ERD) for database constructed here ... 22
Figure 4-1. Overlaps between array, pathway and interaction data ... 31
Figure 4-3. t-score distribution of probe sets before and after representative filtering . 33 Figure 4-4. Histograms of unadjusted p-values for 560 pathways in database ... 35
Figure 4-5. Main components extracted from two pathways ... 38
Figure 4-6. Map the main component and leading edge subset to original pathway .... 40
Figure 4-7. GO terms (cellular component 5) enriched in focal adhesion pathway ... 44
Figure 4-8. GO terms (molecular function 1-5) enriched in focal adhesion pathway ... 45
Figure 4-9. GO terms (biological process 5) enriched in cell cycle pathway ... 46
Figure 4-10. GO terms (cellular component 5) enriched in cell cycle pathway ... 47
Figure 4-11. Extend subnetwork search to the global interaction network ... 49
Figure A-1. Top cancer killers in Taiwan in 2008 ... 70
Figure A-2. The most significant subnetwork identified by GXNA ... 71
Figure A-3. Main components obtained by f2 scoring method ... 76
Figure A-4. GO term hierarchy for cluster C in Fig. 4-6 ... 79
Figure A-5. GO term hierarchy for terms involved in Fig. 4-7 ... 80
Figure A-6. GO term hierarchy for cluster A and term B,C in Fig. 4-8 ... 81
Figure A-7. GO term hierarchy for cluster C,D,E in Fig. 4-9 ... 82
Figure A-8. Histogram of pathway sizes in our database ... 83
LIST OF TABLES
Table 1-1. Definition of gene sets and exemplary databases ... 5Table 4-1. Statistics of current database ... 31
Table 4-2. Network analysis procedure and the consistency with GSE7670 ... 36
Table A-1. Statistics of t-scores in two datasets ... 70
Table A-2. Parameters set during pathway analysis ... 72
Table A-3. Significant pathways identified by different methods ... 72
Table A-4. Lists of significant pathways identified by f0 and f1 scoring function... 73
Table A-5. Annotations of genes present in Figure 4-5 ... 74
Table A-6. Annotations of genes present in Figure A-3 ... 77
Table A-7. Overlap of main components and the leading edge subset ... 78
Table A-8. Annotations of genes present in Figure 4-11 ... 83
1
Chapter 1 Introduction
1.1 Lung cancer
According to figures released by Department of Health in June 2009, cancer has
topped the major 10 causes of death in Taiwan for 27 consecutive years. The statistics
showed that, in 2008, malignant tumors were responsible for 27.3 percent of all deaths,
among which the proportions of top cancer killers are displayed in Figure A-1.
Moreover, when gender is taken into consideration, lung cancer, in particular, has
occupied the leading cause of female cancer mortality since it first overtook cervical
cancer in 1986 [1].
Noteworthily, despite that tobacco smoke is the major risk factor for lung cancer,
many patients, especially female ones, are never smokers. This situation is not unusual
in countries other than Taiwan, as it had already been discussed in the review paper [2].
It was suggested in the article that pathways of carcinogenesis for lung cancer in never
smokers and tobacco-associated lung cancer are not exactly the same due to the clinical
and biological differences observed in patients of the two types. However, specific
mechanisms are still under investigation, which in further requires good experimental
techniques and analytical methodologies that help researchers focus on clues to
carcinogenesis process with efficiency.
2
1.2 Microarray
Microarray is a powerful tool to screen tens of thousands of genes at one time,
giving a semi-quantitative sketch of genome-wide mRNA expression levels in cells,
which greatly facilitates and accelerates biological studies. Since its invention in the
1990s [3], gradually improved technologies had lead to more affordable commercial
arrays with stable quality, making microarray widely applied in various biomedical
researches and beyond question cancer-related studies are no exceptions.
However, unlike array experiments that could be conducted with acceptable quality
as long as protocols are followed adequately, there has always been room for
bioinformaticians to develop analytical methods that better extract biological insights
from array data.
Following this section some methods will be reviewed and in the end of this
chapter, the idea about methodology developed here will be introduced.
1.3 Data analysis
Starting from introducing regularly employed single gene analysis that
concentrates on independent statistical analysis of individual genes, what followed
subsequently are some advanced data analysis methods that take additional biological
information into account. Each of these methods detects biological processes being
3
dysregulated from different entry points, however, major concerns of biologists can yet
be satisfied simultaneously: to uncover manageable size of molecular interaction
networks important for dysregulation, to focus on interested pathways, functions or
even specific regulatory events, and to possess the potential of performing exploratory
researches on mechanisms that are not yet well understood. In the end of this chapter, an
idea about combining advantages of current methods will be introduced in attempt to
fulfill these requirements.
1.3.1 Single gene analysis
Whatever array platform the samples are assayed on, after image scanning and
preprocessing steps including background correction and normalization within/between
arrays, conventional analytic procedure starts with evaluation of individual genes: all
genes are ordered by the extent of their associations with phenotypes, with a significant
degree of association suggesting the gene’s being differentially expressed at mRNA
level and worthwhile to undergo further biological validations.
However, varied according to topics under study, this approach sometimes ends up
with a long list of significant genes even after multiple hypothesis adjustments [4, 5]
that are usually required when testing large amount of hypotheses simultaneously. This
makes follow-up validations laborious or even infeasible. Such obstacle could not be
4
surmounted by mere selection from the significant gene list, where the decision of genes
deserving further investigations depends on researchers’ expertise due to multiple roles
a gene might play.
To deal with this problem, analysis shall not be confined to expression data itself
anymore. As has been well-known to all, cellular functions are not implemented by
individual genes independently, rather, they are accomplished by a group of genes
acting together to perform cellular tasks. Thus it is anticipated that a more consistent
biological scene can be revealed by methods that not only incorporate transcriptomic
data but also consider prior biological knowledge such as functions in common or
relationships between genes/gene products during analysis. Such information was
derived from previous biological experiments and was deposited in various public
databases as described below.
1.3.2 Biological knowledge database
It gains more insight into the interpretation of transcriptomic data if analysis could
be integrated with functional annotations or other omic data from different levels. To
see how this can be realized, in this section and the next we will introduce what kind of
information can be utilized and how they can be integrated.
5 A. Gene set databases
Generally speaking, a gene set is a group of functionally related genes. Specific
instances for gene set definitions and corresponding databases are tabulated in Table
1-1.
Since members in a gene set tend to function in coordination, several methods are
developed to analyze genes in groups and identify gene sets instead of genes that are
significantly regulated.
Table 1-1. Definition of gene sets and exemplary databases
Common feature Databases
metabolic / signaling pathway member KEGG, Biocarta,
GenMapp, MSigDB (c2) chromosomal location / cytogenetic band MSigDB (c1)
target of microRNA / transcription factor MSigDB (c3)
biological process participants GO - Biological Process subcellular location/macromolecular complex GO - Cellular Component perform molecular function GO - Molecular Function
Ɂ Kyoto Encyclopedia of Genes and Genomes (KEGG) [6], BioCarta [7], GenMapp [8], MsigDB [9], Gene Ontology (GO) [10]
In this study, we focus only on metabolic and signaling pathways that are
essentially an abstraction of the information flow throw physical interaction network in
response to a drug, nutrients or external stimuli.
6 B. Interaction databases
An interaction knowledge base contains genetic [11] and physical interactions
between genes/gene products of various types. They can be related to physical binding
(protein complex), protein modification (methylation, (de)phosphorylation), promoter
binding (transcriptional regulation) or chemical reaction (activation/inhibition).
Typically, they are stored as binary interactions and in the form of
gene(product)1-relation-gene(product)2 triplets, from which a global biomolecular
network visualizing this knowledge base can be easily constructed by representing
genes/gene products as vertices (nodes) and interaction relationships as edges (directed
or undirected).
Although these relationships have been extensively studied in small-scaled
experiments using synthetic lethality or other biochemical and biophysical techniques,
they are recorded in scientific literatures and cannot be directly utilized by
computational scientists unless undergone information extraction into machine-readable
format. Currently, this can be done by manual curations or by text mining techniques
such as applying natural language processing algorithm [12].
The construction of such databases was triggered actually by the expansion of
high-throughput techniques in the last ten years, which includes: microarray for
7
synexpression of genes; yeast two-hybrid (Y2H) technology, tandem affinity
purification coupled with mass spectrometry (TAP-MS), high-throughput
mass-spectrometric protein complex identification (HMS-PCI) and
co-immunoprecipitation (Co-IP) for protein-protein interactions; chromatin
immunoprecipitation coupled with DNA microarray (ChIP-chip) or with paired-end
ditag (ChIP-PET), DNA adenine methylase identification (DamID) and yeast
one-hybrid assays for protein-DNA interactions. The boosted amount of formatted data
from aforementioned techniques had received wide attention of bioinformaticians and
enabled the development of this field. At present, these data account for more than 70%
of current database content. Nonetheless, scientific literatures remain to be the most
important and reliable source since a high percentage of high-throughput data were
estimated to be spurious [13, 14]. On the other hand, it is also believed that these so
called false positives are in fact true physical interactions, yet might not be biologically
meaningful [15].
To sum up, a pool of all known biomolecular interactions between genes/gene
products are accommodated in public interaction databases such as BIND [16], HPRD
[17], MINT [18] and commercial knowledge bases held by Ingenuity Systems
(Ingenuity Pathway Analysis, IPA) and GeneGo (Metacore).
8
At last, a clear difference between network and pathway can be elucidated by this
paragraph [15] :
“A network represents a static image of all possible physical and/or regulatory
interactions between biological entities, while a pathway represents how the
information propagates through the network. Because information propagation is a
directional process, a pathway must have entry nodes where the information flow starts
and terminal points where the information flow ends.”
Several methods evolved with the aid of these knowledge bases and they will be
reviewed in the next two sections. These methods differ mainly in the databases
incorporated, however, what as well cannot be left out of consideration are the statistical
methodology they utilized and the extent they exploit transcriptomic data.
1.3.3 Pathway analysis
Pathway analysis is actually gene set analysis only to focus on pathways. It is
expected to not only find pathways with significant differential expression but also
detect consistent yet subtle expression changes among members of a pathway. A review
on various pathway analysis methods evaluating the involvement of pathways in
different phenotypes under study is available in [19, 20].
A simplest procedure is to assess the significance of overlap between preselected
9
genes (differentially expressed genes) and predefined annotation groups (pathways)
using Fisher Exact test (hypergeometric distribution) or Chi-squared test, which is then
followed by adequate multiple hypothesis adjustments. This over-representation method
is intuitive, easy to implement and computationally efficient, thus it dominates current
commercial and public software, such as IPA [21], Metacore [22] and DAVID [23].
An alternative approach needs no prior filtering of genes. It assigns a score to each
gene set and assesses p-value by re-sampling procedure, which is to compare the score
with its null distribution. BRB-ArrayTools [24] use the LS statistic and
Kolmogorov-Smirnov (K-S) statistic to test if the single-gene p-values in a gene set are
of a uniform distribution. In gene set enrichment analysis (GSEA) [25, 26], an
enrichment score is obtained by considering the distribution of pathway genes in the
entire list of genes, which in spirit is a weighted K-S statistic. Tian et al. [27] designed a
statistical framework to determine perturbed pathways. In their work the overall
objective is to “test whether a group of genes has a coordinated association with a
phenotype of interest.” After each gene set is assigned a score by averaging the test
statistics of its member genes, p-values regarding two different hypotheses are
estimated by permuting class labels and gene orders. Eventually, gene sets with
significant p-values under both hypotheses are considered differentially expressed
10 across phenotypes.
The latter approach is more biologically reasonable than over-representation
method since it reserve expression information and does not depend on an artificial
filtering of genes, in which the results may depend strongly on the cutoff chosen to
make the significant gene list [28].
However, most pathway analysis methods suffer from some weaknesses. First, due
to the fact that only a limited number of well-studied genes are classified into pathways,
the remained many genes with unknown functions are left out of considerations. Second,
most pathway annotations contain only labels of pathway members and with no
information about the interplay between them, therefore the results obtained from these
methods could not help to give direct understanding of cellular processes at molecular
level. Also, pathway is actually a dynamic model and the component activated is
specific to conditions under study, which is usually not emphasized in most methods,
however, GSEA did specify a list of core members, named leading edge subset, that are
the main contributors of the pathway’s enrichment score.
In all, pathway analysis can successfully interpret data at pathway level but fails to
elucidate the interplay between members within and provide no information about genes
not involved in known pathways.
11 1.3.4 Network analysis
Extensive works had been done seeking to characterize the design principles of
biomolecular network structure using graph theory. An interesting finding shows that its
connectivity distribution follows the typical power-law distribution for scale-free
network and it shows small-world properties [29] in terms of network diameter and
clustering. It is believed that this feature, which indicates the existence of hubs, enables
the biological network to be robust against occasionally removal of arbitrary network
elements during evolution process [30].
Another subject of active research is the identification of relevant modules or
subnetworks in the global network. Notably the interactome data cannot alone complete
this task due to the inconsistent conditions of its content. That is, these interactions are
usually temporal, spatial or dependent on conditions and tissue types. Therefore,
additional annotations providing condition-specific information is required, such as
combining microarray data to identify connected sets of nodes based on their coherent
expression patterns at mRNA level. In this regard, several methods of identifying
responsive modules are reviewed in [31].
Ideker et al. [32] are among the first groups to extract active subnetworks based on
both interactome and transcriptomic data. They searched the global network for
12
high-scoring subnetworks via simulated annealing algorithm, where the score is in spirit
an aggregation of z-scores measuring differential expression of individual genes in the
subnetwork and calibrated against the null distribution generated by randomly sampled
groups of genes. Interestingly, the result contained many examples of genes individually
with low score but are required to connect together several high-scoring genes. Genes
with such character agree with the behavior of some transcription factors (TFs) and will
be referred to as “key nodes” in this article. Despite the exciting observation, the
time-consuming nature confines its application on large interaction network.
Nacu et al. [33] developed a method GXNA that expands subnetworks from seed
nodes using a greedy approach and then identifies those with significantly high scores.
One of their contributions lies in the design of scoring functions that correct for biases
due to subnetwork size if necessary. They are generally divided into two scoring
schemes -͂T and T͂. What subsequent to the evaluation of subnetwork score is the
estimation of p-value that assesses its significance against the null distribution derived
from scores of subnetworks under different phenotype permutations. Eventually, these
p-values are adjusted for family-wise error rate and used to select relevant subnetwork.
GXNA is fast and it focuses on small modules differentially expressed between
phenotypes, however, it allows no such key nodes since the greedy approach is adopted.
13
In general, pathway can be viewed as the assembly of causal sequences of
molecular events and with all building blocks available in the biomolecular network.
Ideally it seems promising to infer new pathways directly by network analysis, but in
reality such pathway inference are complex and less validated due to the noise and
incompleteness of current biological networks, especially when coupling with the
small-world property of its topology. Additionally, the identified module is prone to be
associated with multiple canonical pathways, which makes it hard to reveal a unifying
scene and interpret in a fashion related to focus under study, as in the example of Figure
A-2.
1.3.5 Methodology in this work
In this study we attempt to develop a methodology that goes a step further than
current pathway analysis by bringing in the advantages of network analysis. Practically
speaking, it can be achieved by processing microarray data through pathway and
network analyses in series, however, it is worthwhile mentioning that the incorporation
of both pathway and network knowledge into microarray data analysis is yet to be
widely realized. Based on results from pathway analysis, the methodology here aims to
enhance researchers’ understanding by stepping from pathway level into molecular level,
with its main objectives specified below:
14
1. Extract module most relevant to the pathway’s dysregulation.
It is without doubt that pathway is in fact a dynamic model and the members
perturbed differs between conditions under study. Although GSEA specifies a leading
edge subset and suggests their being responsible for pathway’s dysregulation, the subset
is of limited biological meaning since it provides no information for elucidating the
roles they play. To put it simply, the leading edge subset is statistically meaningful
more than biologically meaningful. Therefore, methodology here attempt to
complement pathway analysis in this regard by making use of biomolecular networks.
After dysregulated pathways are identified, concept of network analysis is then
imposed to extract modules most relevant to dysregulation of interested pathways. It is
mainly due to these sequential events that make the pathways deemed dysregulated.
On the other hand, the module obtained here differs from that yielded by other
network analyses in its ability to interpret microarray data at known-pathway level and
to investigate interested pathways in greater details.
2. Other focus-oriented strategies enabling exploratory survey on pathways of interest.
Current understanding of biological functions at the pathway level is far from
being thorough. In one way, some of known pathways are actually incomplete. In the
other, although recorded in a separate fashion, they are not truly isolated at the level of
15
global interaction network. In fact they should not be so as in the example of
cross-pathway inhibition, which allows an activated pathway to refocus cellular
resources to respond to stimuli it received by competing against other pathways. As a
matter of fact, a considerable degree of cross-talks between pathways are revealed by
various small-scaled biological studies.
Using the methodology here, a task-oriented investigation is promised by
exploiting the valuable information imbedded in biomolecular networks. These yet
well-understood interactions possess great potential to point an investigator to either
missing pathway components or cross-talks between pathways and help in the design of
appropriate experiments for identifying them.
In previous study physical interaction data were integrated with genetic interaction
information to uncover the mechanisms underneath [34], which can further be used to
find cross-talks between two pathways where the two interacting genes reside.
The interaction data was also coupled with pathway information to provide clues
of cross-talks between pathways [35], however, it is done by merely calculating whether
physical connections between two pathways are higher than random using Fisher Exact
Test.
16
In two successive works of Ideker et al. [36, 37], subnetwork and condition-
responsive genes (CORGs) were defined in dysregulated pathway as the part delivering
optimal discriminative power for the disease phenotype using T͂ evaluation.
Nonetheless, they were used as prediction markers rather than to discuss the underlying
mechanisms.
Before going to the next chapter, the methodology framework is summarized in
four parts.
1. Construction of database for storage of molecular interaction network, canonical
pathway collections and gene annotations.
2. Statistical analysis of microarray data at pathway level: in this part Tian’s algorithm
is adopted yet with slight but crucial modification.
3. Algorithms allow navigating the global network on the region of interested
pathways: the scoring function is based on GXNA and modified to tolerate key
nodes. In addition, a merging step is developed to complement the scoring function.
4. Visualize the module by laying genes/gene products according to their subcellular
localizations.
17
Chapter 2 Materials
2.1 Lung cancer datasets
1. NTUH Lung Cancer Dataset
This dataset is created by Bioinformatics and Biostatistics Core, National Taiwan
University Research Center for Medical Excellence - Division of Genomic Medicine.
Matched normal and tumor samples of 31 female patients with non-small cell lung
cancer (NSCLC) in National Taiwan University Hospital (NTUH) are collected for
DNA microarray analysis using Affymetrix Human Genome U133 Plus 2.0 Array.
2. Public Dataset GSE7670 (TVGH Lung Cancer dataset)
A public dataset [38] created by Taipei Veterans General Hospital (TVGH) and
deposited in NCBI’s Gene Expression Omnibus (GEO) [39] (also available at EBI’s
ArrayExpress: www.ebi.ac.uk/arrayexpress) under GEO series accession number
GSE7670 is downloaded as CEL files. Of all the 66 Affymetrix Human Genome
U133A Array data, those from 21 female NSCLC patients with paired normal-tumor
arrays are used to create this dataset.
2.2 Databases
1. Pathways
Predefined gene sets performing tasks of metabolic functions or signaling
18
transductions are recorded in several public databases.
The Molecular Signature Database (MSigDB) v2.5 maintained by Broad Institute
collects curated gene sets information from several online databases or literatures and
records them with a unified format. Its “Canonical Pathway (CP) collection” in
“Functional Sets (C2) category” [9] is the main source of pathway information in this
work. The collection of 639 gene sets is released in a file with filename extension “gmt”.
In the following work, gene sets are simplified into pathways.
2. Protein interaction network
Protein-protein interactions (PPI) detected by high-throughput methods are
recorded in several PPI databases. The NCBI’s Entrez Gene database containing curated
interaction information from BIND [16], BioGRID [40], EcoCyc [41] and HPRD [17] is
utilized as the main source to construct protein interaction network in this work.
3. Target genes of probe sets on microarray
The information of genes targeted by probe sets designed on Affymetrix GeneChip
arrays comes from annotation files in Affymetrix website (www.affymetrix.com).
4. Gene annotation
Unique gene identifiers (Entrez ID or HUGO gene symbol) and historical aliases
of genes are retrieved from NCBI’s Entrez Gene database.
19
2.3 Program environment and public/ commercial tools
With database built under MySQL, this methodology was realized with the help of
Matlab®. Besides, some public or commercial tools are used to process the dataset for
different purposes. Following are some brief descriptions about them.
1. Partek® Genomics Suite [42]
It is a commercial software that enables various statistical analysis of microarray
data. In the work here it is used simply to complete preprocessing steps of microarray
data, which summarizes expression value for each probe set and applied normalization
algorithm to remove potential systematic biases.
2. Gene set enrichment analysis (GSEA) [25]
GSEA evaluates the probability a gene set is differentially expressed across
phenotypes and defines a leading-edge subset comprising the core members activated in
the gene set.
At first, genes are ordered by their correlation between expression values and
phenotype classes, then an enrichment score (ES), corresponding to a weighted
Kolmogorov-Smirno-like statistic, is calculated for the gene set. A p-value representing
significance level of the ES score is determined against null distribution of ES estimated
by permuting class labels. After p-value for each gene sets is obtained, false discovery
20
rate (FDR) method is used to adjust for multiple hypothesis testing.
3. Database for Annotation, Visualization and Integrated Discovery (DAVID) [23]
DAVID uses an EASE score, a modified Fisher Exact p-value, to measure the
enrichment of gene sets in the gene list specified by user. Furthermore, to reduce the
redundant nature of annotations that might dilute the focus of the result, which is
especially inevitable when associating with Gene Ontology (GO) terms, DAVID
provides the option to classify significantly associated gene sets into different clusters
and order the clusters according to their significance. According to manual on the
website, it is achieved by integrating the same techniques of Kappa statistics to measure
the degree of the common genes between two gene sets, and fuzzy heuristic clustering
to classify the groups of similar annotations according to kappa values.
4. Cytoscape [43]
Cytoscape is a JAVA application which provides basic functionality to layout and
query networks, overlay nodes with expression data, or link genes/gene products to
databases of functional annotation. Notably, it is featured in its extensibility through a
straightforward plug-in architecture, which enables additional computational analyses to
be incorporated. In this study, a plug-in - Cerebral v2.0 [44] is used to layout the
network according to the subcellular location of each node.
21
5. Kyoto Encyclopedia of Genes and Genomes (KEGG) - Color Objects in KEGG
Pathways [45]
It is an on-line tool that provides a personalized pathway map by allowing the
assignment of different colors to font, border or background of each particular node.
6. European Bioinformatics Institute (EBI) - QuickGO [46]
QuickGO is a web-based browser that enables the extraction of branches from the
entire hierarchy of Gene Ontology according to a list of specified GO terms.
22
Chapter 3 Methods
Figure 3-1 illustrates the flow chart of this methodology and in the subsequent
sections we will describe each step in detail.
Figure 3-1. Flow chart of methodology in this work.
PPI is the abbreviation of protein-protein interaction data.
3.1 Database construction
Figure 3-2. Entity Relationship Diagram (ERD) for database constructed here.
In this ERD rectangles stand for entities, ellipses for attributes and diamonds
for relationships.
23
As the first step of integrating genomic data at different levels, a relational
database is required in order to bridge between information from various sources, to
speed up data retrieval and to correct for ambiguously recorded data. The structure of
database constructed here is elucidated in Figure 3-2 as a simplified entity relationship
diagram, where genes, pathways and array probes are considered as different entity
types.
While constructing the database, some records with ambiguous information should
be corrected:
1. Importing gene sets from MSigDB.
The gmt file records gene set members in the form of “synonym”, which is alias
instead of official name. The task here is to convert each of them into corresponding
unique identifier: “gene_id”. This is done by comparing them to “official gene symbol”,
or to “synonym” of genes in the case when there were no “official gene symbols”
matched.
2. Importing target genes of each probe set on microarray.
Some “gene_id” recorded in NetAffx annotation files had been changed or
discontinued. New “gene_id” should be updated based on information from NCBI’s
Entrez database.
24
3.2 Single gene analysis
1. Probe sets filtering
As one might note in chapter 2, the two datasets were assayed on different versions
of Affymetrix GeneChip array. In fact, U133 Plus 2.0 comprises probe sets in both
U133A and U133B [47], thus in following works only those probe sets common in both
versions are utilized. Note that this step could be skipped when datasets to be compared
are of the same version.
2. Summarizing expression value for each gene
The analysis of Affymetrix array data starts with CEL files recording fluorescence
signals at probe level, from which probe-set level intensities are derived using robust
multi-chip average (RMA) method [48]. In this method probe level data undergo
background correction, quantile normalization [49] and median-polish summarization
[50]. The RMA process is completed under the commercial software Partek® [42] and
the resulting values are log-transformed expression values.
3. Hypothesis testing
Hypothesis testing methods are used to measure the degree of association between
response/covariate (either numerical or categorical factor, e.g. phenotypes) and random
variables (expression levels of probe sets/genes). In the simple but most common case,
25
given a covariate, its association with each gene is estimated by univariate hypothesis
testing method such as two-sample t test or Mann-Whitney statistics for binary
phenotype, F-statistic for polytomous phenotype, to name but a few.
Here two-sample paired t test is applied and the concluded p-value functions as an
index of degree of differential expression in terms of a probe set between phenotypes.
4. Select representative probe set for each gene
Among all probe sets targeting the same gene, the one with the smallest p-value is
selected to represent their target gene and the rest removed.
Until this step, an input matrix with logged gene expression values in rows and
arrays in columns is generated.
3.3 Pathway analysis
Both Tian method and modified Tian method are applied on the datasets.
Conceptually the procedure of pathway analysis starts with evaluating a pathway score
by employing a scoring function, which is the major difference between the two
methods. The score is then to be normalized and assigned with p-value according to its
null distribution that could be generated in two different ways of permutation. Finally
pathways are ordered by the addition of rankings under two permutation types. The
detailed procedures of both methods are described below.
26 1. Scoring function
For each pathway ୫ ൌ ൛ଵǡ ǥ ǡ ୩ౣൟ , its score is calculated by either
0
1
( ) 1
m
i
k
m g
m i
f S T
k
¦
(Tian’s method)or 1
1
( ) 1
m
i
k
m g
m i
f S T
k
¦
(modified Tian’s method), where km is the size of Sm andgi
T the t statistic of gene i. To note, the latter scoring
function comes from equation 2 in Nacu et al.[33].
2. Significant level of the score
A one/two-sided p-value representing significance of a pathway is estimated from
the f1/f0 score’s null distribution that could be generated in different ways depending on
the null hypothesis to be tested. Tian et al. [27] proposed two ways to choose from:
either to test if genes inside a set show significantly higher associations with phenotypes
than that outside a set, or to test if a set does contain genes differentially expressed
between phenotypes. The former is achieved by permuting members of a pathway, and
the latter by randomly shuffling phenotype labels on each paired samples.
Since all pathways are tested simultaneously, multiple testing problems can no
longer be ignored. The p-values are either adjusted by Bonferroni method [51] or
converted to q values [52].
3. Pathway score normalization
27
To overcome the hurdle that pathway scores are not able to be compared directly
due to its dependency on the size and unique correlation structure of each pathway, Tian
proposed that normalization of the observed scores can be achieved by replacing them
with their quantiles. Here f0 and f1 scores are normalized in the same principle.
4. Ranking the pathways
Under each permutation procedure, an adjusted p-value and a normalized score are
obtained and from which a ranking is summarized. With descending importance, all
pathways are eventually ordered by the addition of two rankings derived from separate
permutation procedures.
3.4 Network analysis
Based on the significant pathway identified, network analysis tries to investigate
the pathway, looking for connected subgraphs that are either essential for differential
expression of the pathway or related to genes outside the defined pathway boundary. In
short, a candidate subnetwork is generated from each root and then being merged into
several main components. This algorithm follows Ideker’s idea and some methods of
GXNA, and will be described dividedly in six steps.
1. Starting points and the search space
In the beginning, a background interaction network is constructed and afterward
28
referred to as the “search space”, within which the algorithm searches for main
components. As the objective here is to show the applicability of this approach before
any further explorative investigations, the search space is at first confined to genes
within the dysregulated pathways. After then a version of searching under global
interaction network was demonstrated.
Suppose there are N genes (nodes) in the pathway, each of them will be considered
as a root and thus N candidate subnetwork would be generated.
2. Extension
Starting from a root node, a candidate subnetwork is generated by an assigned
number of extensions within the search space. In each time of extension, the node
yielding maximal score of the new subnetwork is incorporated from those directly
neighboring the current subnetwork.
3. Scoring
Two ways of evaluating current subnetwork are adopted. One is identical to
f S1 , only when applied here, S represents a subnetwork instead of a pathway. The other, f S2
, is similar but with slight modifications in order to increase the tolerance of key nodes mentioned in chapter 1.
Considering a subnetwork of size k, the algorithm first rearrange its members in
29
ascending order of p-values, so that for S
^
g1,!,gk`
,Tgi Tgi1d for all i
^ `
2..k ,where
gi
T is the t statistic for gene i. The score of S is then obtained by
2
1
( ) 1
i
k m
g i
f S T
k m
¦
,where the setting of m is flexible to users (default “1”). The equation suggests that the m
members with least contribution would be left aside from the scoring function.
Note the default T scoring function in GXNA (eq.6) [33] is not applied here, for
more details please refer to chapter 6.
4. Stopping criterion
There are two criteria in GXNA for stopping the extension of a subnetwork. One is
when predefined size is met and the other is when the new subnetwork score does not
surpass the current one. The former criterion is used here due to the same reason that f2
score is dependent on subnetwork size and thus not comparable to each other.
Nonetheless, to make up for artificial restrictions in fixed-size search, a merging step is
developed to produce subgraphs of different sizes.
5. Merging
Until this step, a candidate pool has been formed by the N candidate subnetworks
derived from the N roots. The merging process is a decisive step. It ends up with at most
h main components as final results where h is a user-specified parameter.
30
For each i
^ `
1..h the merging process starts a main component Ci with an empty set Ø. Step by step, the algorithm merges it with the highest-scored candidatesubnetwork sharing overlap with it. Ever since a candidate subnetwork has been chosen
from the candidate pool and merged with Ci, it is excluded from the pool. The merging
process stops when certain criteria are met, which varies depending on the user’s
concern. Here a handleable size of main component within the pathway is to be found,
so the algorithm stops when it reaches an amount approximately r percent of the search
space size, or, stops at predefined min/max size in the case of small/large search space.
However, the whole process could break off anytime the candidate pool is emptied.
6. Visualization
Main components found in this methodology are visualized using Cytoscape [43].
Moreover, they can also be mapped on the pathway figure using KEGG’s online tool
[45] when the significant pathway is retrieved from KEGG database.
3.5 Results demonstration
In the end, attempts were made to reveal the biological scenes underlying the
results of this methodology by associating Gene Ontology terms with members of main
components. This is done with the help of DAVID [23] for GO terms association and
clustering, and QuickGO [46] for GO hierarchy visualization.
31
Chapter 4 Results
4.1 Database
Table 4-1 lists some characteristics of the database constructed here, and
informative relationships of its content data are shown in Figure 4-1.
Table 4-1. Statistics of current database
Current Database Number of Records
NCBI - Entrez Gene database 40234 signaling / metabolic pathways 691
biomolecular interactions 31340 genes with interaction information (A) 8787
genes involved in pathways (B) 5596 genes targeted by U133A probe sets (C) 13799
Figure 4-1. It shows overlaps between array, pathway and interaction data.
Notably there are a large number (a+b=4,873) of genes within the global interaction network that are currently not assigned to any predefined pathways, which implies the potential to exploit functions of unknown genes when integrating interaction information. On the other hand, there are also 1,682 (c+d) genes in pathways that show
32
no related interaction information, which foretells some obstacles to be met in this methodology. However, this could be overcame by constructing a database that includes more complete interaction data, although it may take lots of time to manually record them from pathway databases and scientific literatures.
4.2 Single gene analysis
In this step, probe level data in CEL files are transformed into an input matrix which will be fed into pathway/network analysis. Figure 4-2 describes the single gene analysis procedure and Figure 4-3 shows the t-score distribution of probe sets before and after representative filtering. When a Bonferroni adjusted p-value < 0.05 criterion was applied, there were 1,489/1,345 significantly up-/down- regulated probe sets in NTUH dataset after representative filtering, which was much greater than that in GSE7670 dataset as shown in the lower panel in Figure 4-3.
Figure 4-2. Single gene analysis procedure. It describes how the input matrix was produced from CEL files.
33
Figure 4-3. t-score distribution of probe sets before and after representative filtering.
The total probe set number decreases from 22215 to 13799. Probe sets with t-scores greater/smaller than the Bonferroni thresholds are significantly up-/down-regulated.
In both datasets the original left-skewed distribution turned into a bimodal
distribution after filtering, and in the bimodal distribution the positive and negative
sides show unequal peak heights and variances. This might be due to the different
correlation structures between genes. More detailed statistics of t-score distribution is
shown in Table A-1. In addition, an unusual peak highlighted in the Figure 4-3 comes
from a nonspecific high-scoring probe set 209079_x_at that targets at 22 related genes.
34
4.3 Pathway analysis
Three pathway analysis methods were applied on the datasets, namely Tian scoring
method (using scoring function f0), modified Tian scoring method (using scoring
function f1) and GSEA, and the third one was applied only for result comparisons.
A total of 560 pathway information with size ranging from 10 to 500 were used for
pathway analysis; each pathway has corresponding null distributions generated by
permuting either gene order or class label for 10,000 times. The unadjusted p-values of
the 560 pathways were corrected for multiple hypotheses test using Bonfferoni
correction or q-value conversion in f0 scoring function and f1 scoring function,
respectively. Detailed parameter settings in each method were provided in Table A-2.
Histograms of unadjusted p-values for 560 pathways under different permutation
types or scoring schemes were displayed in Figure 4-4 and among them a number of
pathways were deemed significant under both scoring functions. This phenomenon was
different from that in GSEA (Table A-3) and suggested an elevated statistical power in
the methods applied here. Detailed significant pathways with adjusted p-value passing
the criteria of 0.05 in both datasets were listed in Table A-3,A-4.
After all pathways had undergone pathway score evaluation and p-value
adjustment, network analysis was applied further to investigate molecular mechanisms
35
Figure 4-4. Histograms of unadjusted p-values for 560 pathways in database.
Pathway p-values in the upper panel were derived by permuting gene orders and the lower ones by permuting phenotypes. Bars in green color stand for the result when applying f1 scoring function and that in orange color come from result of f0
scoring function.
based on the interested pathways. These two scoring methods identified significant
pathways for further analysis. For example, “Cell cycle” pathway was identified by f0
scoring, whereas “Focal adhesion” pathway was selected by the f1 scoring method. In
principle, cancer-associated pathways with larger pathway size were chosen out for
examples from those with adjusted p-value lower than 0.05 in both datasets.
36
4.4 Network analysis - within pathways
In this and the next section, network analysis was applied on significant pathways,
and the focus will be on extracting main components from pathways that take the
interplay of member genes into consideration.
Within the search space containing all nodes in the pathway, candidate
subnetworks each with size 8 were formed using f1 scoring. They were then merged into
one main component with size basically 0.75 times the space size, or bounded by the
size of 15, 25 for small, large search spaces respectively. The whole process was
summarized in Table 4-2.
Table 4-2. Network analysis procedure and the consistency with GSE7670.
NTUH (m=0) Cell cycle Focal adhesion
search space (entire pathway) 90 199
candidate subneworks 60 111
top subneworks merged 10 10
1st main component size 19 25
leading-edge subset size 35 63
consistency with GSE7670 74% 60%
Fisher’s exact test p-value 9.4E-08 4.6E-10
This table showed that the search spaces of two pathways were dramatically
reduced from 90 and 199 to 19 and 25 of their 1st main components, respectively.
Furthermore, the 1st main component sizes were also smaller than the sizes of leading
37
edge subsets obtained by GSEA that equaled to 35 and 63.
To examine the robustness of this method, members of main components derived
from NTUH dataset were compared with that obtained from GSE7670 dataset under
same analysis procedures. The result of comparison was also shown in the table that the
1st main component from NTUH dataset had a significant overlap (74% and 60%
overlap and the overlaps were all with p-value << 0.05) with that derived from
GSE7670 dataset.
These main components obtained by network analysis are displayed in Figure 4-5,
where gene/gene products are arranged according to their cellular locations. For detailed
expression level and annotations of each gene in Figure 4-5, please refer to Table A-5.
Evidently the major difference between main components obtained from the two
pathways is that a coherently up-regulation in tumor samples is observed in Figure 4-5A,
while such coherence does not appear in Figure 4-5B. This observation once again
reveals the main difference of Tian method and modified Tian method: the former
emphasizes pathways with moderate but concordant changes and the latter focuses on
pathways with significant degree of overall changes, regardless of up- or down-
regulation.
38 (A)
(B)
Figure 4-5. Main components extracted from two pathways. After network analysis, main components were derived from (A) cell cycle pathway and (B) focal adhesion pathway.
Genes were arranged by their subcellular locations and colored by expression fold change values (red for up-regulation and green for down-regulation in cancer phenotype).
Rectangles indicate that genes were found in GSE7670 and NTUH dataset in common, whereas circles mean that genes were identified only in NTUH dataset.
39
In addition, we also applied network analysis using f2 scoring with m=1 and the
rest settings remained unchanged on the two pathways. The main components yielded
by this analysis share approximately 70-90 percent similarities with previous ones.
Some nodes playing essential roles of maintaining the integrity of main components did
exist, however, Figure A-3 and Table A-6 indicate that an expected situation of catching
clues of key nodes bridging separately connected subgraphs did not show apparently.
4.5 Result demonstration and comparison
4.5.1 Mapping the main component and leading edge subset on KEGG pathways
Two main purposes of this section is to illustrate how the main component
obtained by network analysis differed from the leading edge subset obtained by GSEA
from biological viewpoints, and to highlight the fundamental difference between
information stored in pathway databases and interaction databases, thus to present the
benefit it brought to integrate both information with microarray data.
Figure 4-6 was produced by an on-line tool provided by KEGG [45], where
members of leading edge subset and main component for each pathway were
simultaneously mapped onto simplified pathway figures. It seemed that most members
of main components overlap with leading edge subsets, however, they actually share an
overlap of about 60-80 percent as summarized in details in Table A-7.
40 (A)
(B)
Figure 4-6. Overlay the main component and leading edge subset on the original pathway.
41
In the simplified figure, rectangles represent genes, complexes or families, etc. in (A) cell cycle pathway and (B) focal adhesion pathway. Members of main components were filled in red/green colors standing for up-/down-regulation in cancer phenotype. Members of leading edge subset were colored in grey. Note that those nodes with black border and filled with color were common members of both sets, while those with white borders were found in main components only. In addition, dashed red lines were added artificially implying interaction relationship not recorded in KEGG databases.
In Figure 4-6, the members of the leading edge subsets were colored in grey.
Although these genes show statistically significances, they located in the pathway in a
scattered manner. In contrast, main components are tightly interconnected subset of
genes because it took topology into consideration. It represents the most significant
module in the pathway, which would be more biologically meaningful when compared
to the leading edge subset that only takes gene significances into account.
Furthermore, the dashed red lines represent possible interactions which do not
appear in the predefined pathway may illustrate the additional information one would
get when incorporating interaction data with pathway information. The benefit of
bringing in the additional information would become more valuable later in section 4-6.
4.5.2 GO term enrichment analysis
In this section, we try to manifest the biological meanings of these main
components. It was achieved by using DAVID to compare the GO terms enriched in
both the entire pathway and the main component extracted from the pathway. As
42
mentioned in section 2.3, DAVID evaluates the randomness of each GO term being
associated with a user-specified gene list and assigns it with both an EASE score and a
false discovery rate (FDR) reflecting the significance of the EASE score. After that,
similar GO terms were sorted into a cluster and gave the cluster a new enriched score by
summarizing member term EASE scores. This enriched score was then used to rank
relative importance of clusters.
DAVID was used to separately evaluate GO terms/clusters associated with gene
lists containing the entire pathway and the main component. Then the GO terms or
clusters that were contained terms with significant association (FDR<0.05) with the
gene list were visualized by a pie chart. Each portion in the pie chart stands for a GO
term/cluster, and the proportion of gene list members involved in this cluster/term is
showed as a percentage. In addition, the rankings of relative importance of clusters are
specified on the pie chart. Note that different GO term/cluster may probably contain
overlap information, so the overall percentage in the pie chart would not be exactly 100.
Figure 4-7, 4-8 display the result obtained in focal adhesion pathway and Figure 4-9,
4-10 for that in cell cycle pathway.
In Figure 4-7, GO terms/clusters enriched in focal adhesion was identified in terms
of cellular component category. It was obvious that Cluster C turned to occupy a larger
43
proportion in the main component than in the entire pathway. It suggested that when
reducing the list of genes from the entire pathway to the main component, the functions
that the list of genes able to play had gradually specialized. It might also imply that
certain functions are more differentially regulated than other functions in the original
pathway, and that these important functions could be revealed by methodology.
Furthermore, when member terms of Cluster C were individually considered in Figure
4-7C, each of them showed a consistent trend of such function specialization in both
leading edge subset and main component. In contrast, this specialization was not
observed in a random subset that was randomly chosen from the entire pathway and
with the same size of the main component. In terms of molecular function category in
Gene Ontology, several significant GO terms not grouped into clusters (Term B-F) also
showed this consistent trend of function specialization in Figure 4-8C.
Such trends did not appear only in focal adhesion pathway. In Figure 4-9 cell cycle
pathway was analyzed using GO biological process terms, where Cluster A showed also
a dramatic increase from sharing 36% of the entire pathway to sharing 63% of the main
component. When analyzing cell cycle pathway using GO cellular component in Figure
4-10, the function specialization also existed since the proportion Cluster C,D,E
occupied were all amplified in Figure 4-10B.