國立交通大學
生物資訊所
碩
士
論
文
基於人類與小鼠微陣列基因表現圖譜識別功能性基因
群
Identifying functional gene clusters based on microarray
gene expression profiles in human and mouse genomes
研 究 生:洪瑞鴻
指導教授:黃憲達 教授
基於人類與小鼠微陣列基因表現圖譜識別功能性基因
群
Identifying functional gene clusters based on microarray
gene expression profiles in human and mouse genomes
研 究 生:洪瑞鴻 Student:Jui-Hung Hung
指導教授:黃憲達 Advisor:Hsien-Da Huang
國 立 交 通 大 學
生 物 資 訊 所
碩 士 論 文
A ThesisSubmitted to Institute of Bioinformatics College of Biological Science and Technology
National Chiao Tung University in partial Fulfillment of the Requirements
for the Degree of Master
in
Bioinformatics
June 2007
Hsinchu, Taiwan, Republic of China
基於人類與小鼠微陣列基因表現圖譜識別功能性基因群
學生:
洪瑞鴻指導教授
:黃憲達國立交通大學生物資訊所碩士班
摘 要
跨物種的基因表現圖譜分析能提供在天擇的進程中被保留的基因的功能和其參 與機制的訊息,保留在物種間的基因群所扮演的生化功能極可能具有不易被取代 的重要功能。尋找這樣的功能性基因群能加速基因療法中候選基因的發現和藥物 的開發。為此,本研究提出一個新穎的計算處理架構試圖找出保留於人類和小鼠的基因群,利用奇異值分解(singular value decomposition) 和分群演算法分析
基因表現,並且利用同源連結(ortholog linkage)和時間規整演算法(time warping
algorithm)使來自不同物種(異質性)的微陣列基因表現圖譜時間序列可以比較 並依此建議同源基因群。同時,我們實作模糊最近聚類(fuzzy nearest-cluster) 方 法 來 預 測 這 些 可 能 在 生 物 過 程 ( bioprocess ) 扮 演 極 重 要 的 角 色 的 同 源 (orthologous)基因群的功能併施行統計檢定找出具有生物意義、保留在物種間 影響細胞週期的基因群。 簡而言之,這個研究結合序列和時間序列圖譜層級的相似性來建議跨物種的功能 性基因群,提供基因療法實驗的候選基因並希望能貢獻在跨物種基因分析的卓越 進展。 最後,為了讓整個流程方便應用在未來的相關研究上,整個架構被模組並程式化 成一個獨立的可執行套件並結合視覺化的呈現。
Identifying functional gene clusters based on microarray gene
expression profiles in human and mouse genome
student:
Jui-Hung HungAdvisors:Dr.
Hsian-Da HuangInstitute of Bioinformatics
National Chiao Tung University
ABSTRACT
Cross-species gene expression analysis provides information of gene functions and
involving mechanisms, which conserved in evolutionary process. Gene groups
conserved in species are very likely to play irreplaceable biochemical functions.
Searching for this kind of functional gene groups can accelerate the discovery of
candidate genes in gene therapy and development of drug design. For this, our
research proposes a novel computational scheme to figure out the genes having
important biochemical functions, especially targets on the genes which are conserved
in human and mouse. These genes conserved across evolutionary history would be
most likely to reveal fundamental biochemical functions. This work utilizes singular
value decomposition (SVD) and clustering techniques to analyze gene expression, and
exploits orthologous linkage and time warping algorithm making microarray
time-series gene-expression profiles of different species (heterogeneous profiles)
comparable to suggest orthologous gene groups. In the meanwhile, in order to make
the results more promising, we use fuzzy nearest cluster method to predict the
and perform statistical test according to our annotation of predicated gene function to
find the genes having biological significance among these orthologous genes.
In brief, this research combines sequence- and time-series expression- levels ortholog
to suggest functional genes among multiple species, provides materials for candidate
gene therapy experiments and hopes to contribute remarkable advancement in
cross-species orthologous gene analysis. In the end, in order to let the whole process
be utilized in further application, the scheme is modeled and programmed in a
目
錄
摘 要 ... I ABSTRACT ...II 表 目 錄 ... VI 圖 目 錄 ... VII CHAPTER 1 INTRODUCTION... 1
1.1 OVERVIEW OF THE SCHEME... 1
1.1.1 Gene expression time series... 1
1.1.2 Gene function prediction... 3
1.1.3 Data clustering ... 3
1.1.4 Distance functions ... 6
1.1.5 Comparative analysis of different species... 7
1.2 MOTIVATION... 8
1.3 GOALS... 9
CHAPTER 2 RELATED WORKS ... 11
2.1 CROSS SPECIES ANALYSIS WITH STATIC PROFILING... 11
2.1.1 Genome-wide expression data of six organisms [10] ... 11
2.1.2 Orthologous expression profiling in multi-species models[11] ... 12
2.2 CROSS SPECIES ANALYSIS WITH TIME SERIES... 14
2.2.1 GSVD for comparative analysis of expression data sets of two different organisms[13]14 2.2.2 Continuous representation of time-series expression profiles[2]... 17
2.2.3 Aligning gene expression time series with time warping algorithms [1] ... 19
CHAPTER 3 MATERIALS AND METHODS ... 21
3.1 MATERIALS... 21
3.1.1 Datebases... 21
3.1.2 Data set of Experiments and results ... 24
3.1.3 Date set of gene funcation predication ... 24
3.2 METHODS OVERVIEW... 26
3.3 ALGORITHM... 28
3.3.1 Dynamic Time Warping (DTW) algorithm ... 28
3.3.2 Singular value decomposition (SVD) and Clustering approaches... 31
3.3.3 Guilt-by-association (GBA) principle... 34
3.3.4 Fuzzy Nearest-Cluster (FNC)[21] ... 35
3.4 METHODS... 38
3.4.1 Preprocessing ... 38
3.4.2 Ranking by single-gene distance... 40
3.4.3 Finding orthologous gene groups ... 42
3.4.4 Annotations and statistics analysis ... 46
3.4.5 Visualization ... 48 CHAPTER 4 IMPLEMENTATION ... 50 4.1 DATABASE... 50 4.1.1 GEO dataset ... 50 4.1.2 GO dataset... 51 4.1.3 Homologene dataset ... 52
4.2 IMPLEMENTATION OF TIME WARPING... 52
4.2.1 DTW core... 52
4.2.2 Group-Euclidean distance function ... 53
4.3 IMPLEMENTATION OF SVD ... 53
4.3.1 SVD core... 53
4.3.2 Noise reduction ... 55
4.4 IMPLEMENTATION OF FNC ... 55
4.4.1 Mining by unsupervised approach... 55
4.5 STATISTICAL ANALYSIS... 56
4.5.1 Hypergeometric testing... 56
4.6 IMPLEMENTATION OF VISUALIZATION... 57
4.6.1 GD library[28] ... 57
4.6.2 Genesis and graphwarp ... 57
CHAPTER 5 RESULTS... 59
5.1 ARTIFICIAL DATA SET... 59
5.2 YEAST DATA SET... 62
5.3 HUMAN AND MOUSE FETAL LIVER DATA SET... 64
5.4 GENE FUNCTION PREDICTION... 71
5.4.1 Human gene prediction... 72
5.4.2 Mouse gene prediction... 73
5.5 WEB SERVER... 74
5.5.1 Twins overview... 74
5.5.2 Web interface ... 77
5.6 SOFTWARE PACKAGE... 80
CHAPTER 6 DISCUSSION ... 81
6.1 THE LIMITATION OF THE DISTANCE FUNCTION CALCULATION BY DYNAMIC PROGRAMMING. 81 6.2 THE LIMITATION OF SVD... 81
6.3 FUTURE WORK... 82
CHAPTER 7 CONCLUSIONS ... 83
CHAPTER 8 REFERENCES... 84
表 目 錄
Table 3.1 Classification result (%) for largest 20 functional classes. Values in bold indicate the top performance in each row.………..………
38
Table 5.1 GO term with high biological significance in case 1………..… 67
Table 5.2 GO term with high biological significance in case 2.…….……… 69
Table 5.3 GO term with high biological significance in case 3.…….……… 71
Table 5.4 GO term predicted in human………... 72
Table 5.5 Known GO term in human……….. 73
Table 5.6. GO term predicted in mouse.………..……… 74
Table 5.7 Known GO term in mouse.……….……….………...… 74
Table 5.8 A List of cell-cycle profile fetched from NCBI GEO………….……… 76
圖 目 錄
FIGURE 1.1 GENE EXPRESSION ALSO VARIES WITHIN A CERTAIN TYPE OF CELL AT DIFFERENT POINTS IN TIME.FOR EXAMPLE, THE GENE EXPRESSION PROFILES OF AN ORGAN MIGHT DIFFER BETWEEN NORMAL AND CANCEROUS STATES, AS SHOWN HERE.
[HTTP://WWW.NCBI.NLM.NIH.GOV/CLASS/MLACOURSE/MODULES/MOLBIOREVIEW/GENE_EXPRES
SION_PROSTATE.HTML] ... 2
FIGURE 1.2 DIFFERENT REPRESENTATIONS OF GENE EXPRESSION PROFILES:(A) PATTERN,(B) COLOR SCALE [HTTP://GEPAS.BIOINFO.CIPF.ES/CGIBIN/TUTOX?C=CLUSTERING/CLUSTERING.CONFIG] ... 3
FIGURE 1.3 OVERVIEW OF HIERARCHICAL CLUSTERING OF ALL SAMPLES.GENES AND BLOOD SAMPLES ARE ORGANIZED BY HIERARCHICAL CLUSTERING BASED ON OVERALL SIMILARITY IN EXPRESSION PATTERNS.EXPRESSION LEVELS ARE REPRESENTED BY A COLOR KEY IN WHICH BRIGHT RED REPRESENTS THE HIGHEST LEVELS AND BRIGHT GREEN REPRESENTS THE LOWEST LEVELS, AND LESS SATURATED SHADES REPRESENT INTERMEDIATE LEVELS OF EXPRESSION. [HTTP://WWW.BIOMEDCENTRAL.COM/1471-2164/7/115/FIGURE/F1] ... 5
FIGURE 1.4 DIFFERENT DISTANCES WILL RENDER DIFFERENT CLASSIFICATIONS BECAUSE WE ARE ASKING FOR GROUPING BASED ON DIFFERENT FEATURES (TRENDS IN THE CASE OF CORRELATION AND ABSOLUTE DIFFERENCES IN THE CASE OF EUCLIDEAN DISTANCES) [HTTP://GEPAS.BIOINFO.CIPF.ES/CGIBIN/TUTOX?C=CLUSTERING/CLUSTERING.CONFIG] ... 7
FIGURE 2.1 STARTING FROM A SET OF COEXPRESSED GENES ASSOCIATED WITH A PARTICULAR FUNCTION IN ORGANISM A, THEY FIRST IDENTIFY THE HOMOLOGUES IN ORGANISM B USING BLAST. ONLY SOME OF THESE HOMOLOGUES ARE COEXPRESSED WHILE OTHERS ARE NOT.THE SIGNATURE ALGORITHM SELECTS THIS COEXPRESSED SUBSET AND ADDS FURTHER GENES THAT WERE NOT IDENTIFIED BASED ON SEQUENCE... 12
FIGURE 2.2 DIFFERENT DISTANCES WILL RENDER DIFFERENT CLASSIFICATIONS BECAUSE WE ARE ASKING FOR GROUPING... 14
FIGURE 2.3 ILLUSTRATION OF GSVD... 16
FIGURE 2.4 YEAST AND HUMAN EXPRESSION RECONSTRUCTED IN THE SIX-DIMENSIONAL CELL-CYCLE SUBSPACES APPROXIMATED BY TWO-DIMENSIONAL SUBSPACES. ... 17
FIGURE 2.5 ALIGNMENT OF GENES FOR THE CDC28DS TO CDC15DS. ... 18
FIGURE 2.6 TIME WARPING RESULT. ... 20
FIGURE 3.1 WEB PAGE OF GEO. [HTTP://WWW.NCBI.NLM.NIH.GOV/GEO/] ... 22
FIGURE 3.2 WEB PAGE OF GO. [HTTP://WWW.GENEONTOLOGY.ORG/] ... 23
FIGURE 3.3 WEB PAGE OF HOMOLOGENE. [HTTP://WWW.NCBI.NLM.NIH.GOV/HOMOLOGENE/]... 24
FIGURE 3.4 SYSTEM FLOW OF THE SCHEME... 26
FIGURE 3.5 AN ILLUSTRATION OF DTW ALGORITHM. ... 29
FIGURE 3.6 TWO TIME SERIES IN A TWO-DIMENSIONAL FEATURE SPACE CONTAINING SAMPLE POINTS FROM A CONTINUOUS PROCESS, WITH SAMPLE POINTS OF EACH SERIES MAPPED TO EACH OTHER BY SIMPLE TIME WARPING... 30
FIGURE 3.7 SVD FOR GENOME-SCALE EXPRESSION DATA ANALYSIS. [HTTP://GENOME-WWW.STANFORD.EDU/SVD/]... 32
FIGURE 3.8 VISUALIZATION OF THE SVD OF CELL CYCLE DATA.(A)PLOTS OF RELATIVE VARIANCE;(B) THE FIRST EIGNEGEN IS SHOWN;(C) THE SECOND EIGNEGENE IS SHOWN.(D)THE THIRD EIGENGEN LACKS THE OBVIOUS CYCLIC STRUCTURE OF THE FIRST AND SECOND.[18]... 34
FIGURE 3.9 THE SYSTEM FLOW OF FNC. ... 35
FIGURE 3.10 ALGORITHM OF MINING CO-EXPRESSED SUBGROUPS WITHIN EACH FUNCTION... 36
FIGURE 3.11 PSEUDO CODE OF MINING CO-EXPRESSED SUBGROUPS... 36
FIGURE 3.12 ALGORITHM OF FUZZY K-NEAREST CLUSTERS FOR FUNCTIONAL PREDICTION. ... 37
FIGURE 3.13 PSEUDO CODE OF FUZZY K-NEAREST CLUSTERS FOR FUNCTIONAL PREDICTION. ... 37
FIGURE 3.14 INPUT FILE FORMAT... 39
FIGURE 3.15 BUILD THE GENE LINKAGE BETWEEN EACH GENE OF GROUPS AND THE GENE OF ANOTHER SPECIES. ... 41
FIGURE 3.16 THE FLOW CHART OF STAGE FINDING ORTHOUGOUS GENE GROUPS. ... 43
FIGURE 3.17 (A)GROUP THE GENES IN SPECIES A TO FIND GENES SHARE SIMILAR EXPRESSION PATTERN AND SELECT THE GENES IN SPECIES B BY THE MAPPINGS ACCORDING TO GENE LINKAGE.(B) FOR EACH GROUP OF GENES FOUND IN (A), PERFORM CLUSTERING AGAIN ON SPECIES B IN ORDER TO DIVIDE THE SELECTED GENES OF SPECIES B INTO SMALLER GROUPS WHICH SHARE RESEMBLING PATTERN.THEN FOLLOW THE GENE LINKAGE AGAIN TO RE-CONSTRUCT THE RELATIONSHIP... 45
FIGURE 3.18 PAIR CONTAINS TWO SET OF GENES ASSOCIATED BY GENE LINKAGE.EACH SET BELONGS TO A SPECIES AND BUILDS THE CORRESPONDING RELATION BY GENE LINKAGE... 46
FIGURE 3.19 (A)WARPING PATH DISPLAY.(B)ANOTHER WAY TO UNDERSTAND WARPING PATH.(C)AN
OVERVIEW CONTAINING PROFILE PATTERNS AND THE EXPRESSION LEVEL. ... 49
FIGURE 4.1 DATABASE SCHEME OF MIRROR GEO. ... 51
FIGURE 4.2 DEMONSTRATION OF SEMI-GLOBAL DTW... 52
FIGURE 4.3 VISUALIZATION OF (A)GENESIS (B) GRAPHWARP... 58
FIGURE 5.1 RESULTS OF (A) PATTERN ONE (INCREASING)(B) PATTERN TWO (DECREASING)(C) PATTERN THREE (TWO PEAKS)... 60
FIGURE 5.2 (A)RESULTS SHOW A TIME SHIFT IN THE BEGINNING OF EACH PEAK.(B)SVD FILTERS LOTS OF NOISE. ... 61
FIGURE 5.3 (A)RESULTS OF MIXED PATTERNS.(B) SEMI-GLOBAL DTW WORKS WELL IN THIS CASE... 62
FIGURE 5.4 GENE CLUSTERING OF (A) NORMAL YEAST CELL, AND (B) KNOCK-OUT YEAST. ... 63
FIGURE 5.5 CDC20 SHOWS TIME SHIFT IN KNOCKOUT YEAST... 64
FIGURE 5.6 LEFT:GENE CLUSTERING OF MOUSE.RIGHT:GENE CLUSTERING OF HUMAN... 65
FIGURE 5.7 RESULT OF UN-SVD(RIGHT) GENES COMPARING TO GENES WITH SVD(LEFT)... 66
FIGURE 5.8 CASE 1: GENES HIGHLY EXPRESSED IN EARLY STAGES... 67
FIGURE 5.9 CASE 2: GENES HIGHLY EXPRESSED IN LATER STAGES. ... 69
FIGURE 5.10 CASE 3: GENES HIGHLY EXPRESSED LATER THAN GENES IN CASE 2... 70
FIGURE 5.11 VERIFIED THE PREDICTION ACCURACY OF FNC IN HUMAN. ... 72
FIGURE 5.12 TWINS SYSTEM FLOW... 75
FIGURE 5.13 TWO TYPE OF OPERATIONS ON WEB INTERFACE.(1) COMPARE TWO UPLOADED EXPERIMENTS, AND (2) COMPARE AN UPLOADED EXPERIMENT WITH DATABASE... 78
FIGURE 5.14 THE RESULT OF USERS’ REQUEST.THE GRAPH IN BLUE SQUARE IS DRAWN BY GRPHWARP. 79
Chapter 1
Introduction
1.1 Overview of the scheme
Since the high-throughput microarray assay has been widely used, gene function
prediction is shown to be reliable by classifying their gene expression profile
similarity. In some of frontier research, including experimental drug and gene therapy,
understandings of orthologous genes can be helpful accelerating the progress of the
discovery[1].
Research about orthologous gene functional groups searching is not rare but
limited.
This research considers both sequence homology and time-series expression
profile pattern similarity to search for the orthologous functional gene groups among
multiple species, and is able to deal with different time point number and interval, and
it integrates gene functional predication to annotate and suggest their biological
meanings which make the result more reliable.
In sum, this work presents a novel scheme to discover orthologous time-series
gene expression profiles in multiple species. With this scheme, it is now easier to
observe and disclose the important functional genes conserved in evolution process by
time-series microarray profiles.
1.1.1 Gene expression time series
A gene expression profile is the result of microarray analyses, which give the
breakdown of the switching on and off of certain genes (Figure 1.1 ). Gene expression
profile is an important asset, especially for scientifically understanding biological
all other high throughput assays for gene activity give biologists the chance to view
the global mRNA profile systematically[1]. There are two types of experiments, static
and time series experiments[2]. In static expression experiments, only a snapshot of
the expression of genes in different samples is measured[3]. On the other hand, when
the profile containing the information of time intervals, it reveals the expression of
genes with cell cycle stage, development stage, or any time related pattern. Gene
expression time series is a list of expression data for a gene along a number of
different experimental time intervals and would correspond to a row in the
representation (Figure 1.2 ). Through the variation of mRNA expression level with
time, which enables further investigations of the gene regulation networks[4, 5],
functional groupings of genes, distinction of cell cycles[6], tissue-specific profiling,
etc, scientists are now unraveling the mechanism of bioprocesses efficiently.
Figure 1.1 Gene expression also varies within a certain type of cell at different points in time. For example, the gene expression profiles of an organ might differ between normal and cancerous states, as shown here.
[http://www.ncbi.nlm.nih.gov/Class/MLACourse/Modules/MolBioReview/gene_express ion_prostate.html]
Figure 1.2 Different representations of gene expression profiles: (A) pattern, (B) color scale [http://gepas.bioinfo.cipf.es/cgibin/tutoX?c=clustering/clustering.config]
1.1.2 Gene function prediction
Determining the functions of genes is an essential problem in biology, which is
fundamental to realize the molecular and biochemical processes, identify and validate
new drug targets and develop reliable diagnostics. Recent advances in genomic
sequencing have generated an astounding number of new putative genes and
hypothetical proteins whose biological function remains a mystery. On average, there
are 70% of the genes in a genome having poorly known or no known functions. There
are two typical techniques that can be used on gene expression data for gene function
annotation or predication. The first technique is clustering, such as hierarchical
clustering, k-means clustering, SVD, and PCA, while the second is classification,
such as Hidden Markov Model (HMM), Support Vector Machine (SVM), and Neural
Networks (NN).
1.1.3 Data clustering
The primary aim of clustering is to figure out several clusters and centroids or
prototypes and using these centroids to represent the original enormous data.
In brief, data clustering is more likely to attempt to group data in smaller set.
However, some clustering approaches can be used as classifiers as well, and it is
needless to predefine classes (unsupervised learning). Clustering approaches are
feasible to be utilized in gene expression analysis, since the genes are numerous and
the interactions are complex.
Hierarchical clustering: In hierarchical clustering, a series of partitions takes
place, which may run from a single cluster containing all objects to n clusters each
containing a single object[7]. Hierarchical Clustering is subdivided into
agglomerative methods, which proceed by series of fusions of the n objects into
groups, and divisive methods, which separate n objects successively into finer
groupings. One of the simplest agglomerative hierarchical clustering methods is single
linkage, also known as the nearest neighbor technique. The defining feature of the
method is that distance between groups is defined as the distance between the closest
pair of objects, where only pairs consisting of one object from each group are
considered.
The minimum value of these distances is said to be the distance between two
clusters. At each stage of hierarchical clustering, the clusters whose distance is
Figure 1.3 Overview of hierarchical clustering of all samples. Genes and blood samples are organized by hierarchical clustering based on overall similarity in expression patterns. Expression levels are represented by a color key in which bright red represents the highest levels and bright green represents the lowest levels, and less saturated shades represent intermediate levels of expression.
[http://www.biomedcentral.com/1471-2164/7/115/figure/F1]
K-means clustering: This nonhierarchical method initially takes the number of
this step itself the final required number of clusters is chosen such that the points are
mutually farthest apart. Next, it examines each component in the population and
assigns it to one of the clusters depending on the minimum distance. The centroid's
position is recalculated every time a component is added to the cluster and this
continues until all the components are grouped into the final required number of
clusters.
K-Means Training starts with a single cluster with its center as the mean of the
data. This cluster is split into two and the means of the new clusters are iteratively
trained. These two clusters are again split and the process continues until the specified
number of clusters is obtained. If the specified number of clusters is not a power of
two, then the nearest power of two above the number specified is chosen and then the
least important clusters are removed and the remaining clusters are again iteratively
trained to get the final clusters.
1.1.4 Distance functions
There are two main families of distances to measure how closely related are two
groups of genes:
Euclidean: this kind of distance strategy calculates the length of two separate points
in n-directional space by their absolute differences[9]. For example, Euclidean
distance is measure by following definition:
For two points A= (a1, a2… an), and B= (b1, b2… bn), Euclidean distance =
∑
= n 1 i 2 i i-b ) (awithin both profiles[9]. For example, Pearson correlation measures the similarity in
shape between two profiles by the following formula:
For two points A= (a1, a2… an), and B= (b1, b2… bn), Pearson correlation distance
= a -a b -b n 1 -1 b i n 1 i a i ⎟⎟ ⎠ ⎞ ⎜⎜ ⎝ ⎛ ⎟⎟ ⎠ ⎞ ⎜⎜ ⎝ ⎛
∑
= σ σ i i b a ii andb ,and , are thestandarddeviation of a andb
a of mean the are b , a where σ σ
These two kinds of distance strategies will lead to different clustering results.
Please see Figure 1.4 for illustration.
Figure 1.4 Different distances will render different classifications because we are asking for grouping based on different features (trends in the case of correlation and absolute differences in the case of Euclidean distances)
[http://gepas.bioinfo.cipf.es/cgibin/tutoX?c=clustering/clustering.config]
1.1.5 Comparative analysis of different species
the study of biological and evolutionary principles[10]. Although differences among
organisms are often attributed to differential gene expression, genome-wide
comparative analysis thus far has been based primarily on genomic sequence
information.
By miscellaneous gene function predication techniques, biologists are now more
interesting in orthologous gene searching among different species. Since comparative
analysis of the expression data among two or more model organisms promises to
enhance fundamental understanding of the universality as well as the specialization of
molecular biological mechanisms. It also may prove useful in medical diagnosis,
treatment, and drug design. Comparisons of the DNA sequence of entire genomes
already give insights into evolutionary, biochemical, and genetic pathways.
Considering that gene expression profiling gives more information of genes’
biochemistry functional roles, comparative analysis based on microarray data is now a
blossomed area.
1.2 Motivation
Recently, Microarray expression analysis has become an important technique for
evaluating gene expression level in genomic scale. In addition, due to the profound
progress in gene sequencing, considerable number of genes are predicted and found.
However, there are only 30% of genes are explicitly analyzed and understood the
functional roles they playing in biological process[1]. It had been shown that using
microarray gene expression analysis to predict the functions of genes is an important
and efficient means[5]. However, one of the defects of conventional approaches is
that the number of sampling points and growth rate of cells (affected by experiment
designed to provide comparable samplings, and this is not practical in most of the
cases. Therefore, mostly, searching for functional gene clusters is constrained in
mono-species model and parallel experiment.
Despite so, by compiling the information of gene expression profile from diverse
organisms, tissues, and conditions, scientists are now capable of dissecting more
advanced topics. For instance, Grigoryev D. N., who proposed his renowned research
on Genome Biology, introduced a multi-species model using gene expression profile
to find orthologous gene-expression genes of lung cells suffered from
ventilator-associated lung injury among human, mouse, and dog. Grigoryev also
suggested these genes are potential candidate genes of acute lung injury remedy in the
future, and inferred these genes are conserved among the evolution process because
they play crucial protection functions after lung injury.
Applying the idea of utilizing cross-species or -tissue orthologous gene-expression
profile to search important gene groups and their biological functions to other topics
like cell cycle, is a general concept, which needs further investigation and
development.
Nevertheless, when analyzing microarray time-series gene-expression profile,
scientists have to face the difficulty of coordinating different growth rate of cells and
the number and time intervals of sampling in each independent experiment, especially
of distinct organisms, which is now becoming a pressing issue to let cross-species
gene-expression profiles comparable.
1.3 Goals
This dissertation proposes a novel computational scheme to search and analyze the
involved in certain bioprocess. We try to contribute to experiment verifying and
treatment development, and answer following questions:
1. How to combine and take advantage of both sequence- and expression-
level orthologous gene predication?
2. How to build up the mapping relationships between genes of multi-species?
3. How to deal with noise or experimental artifacts?
4. How to let different time-series profile be comparable when the experiment
conditions, growth rate, sampling time point number are different?
5. How to make our predication convincing enough?
This research hopes to propose a novel scheme to solve these related tasks on the
basis of other research with integration and improvement, and contributes remarkable
Chapter 2
Related Works
Some of the existing research had given answers to parts of the questions we devote to
solve; however, these solutions are still not sufficient to resolve our problems
completely. Furthermore, most of them avoid the questions that how to let different
time-series profile be comparable when the experiment conditions, growth rate, and
especially sampling time point number are different
2.1 Cross species analysis with static profiling
2.1.1 Genome-wide expression data of six organisms [10]
S. Bergmann et al. present a comparative study of large datasets of expression profilesfrom sic evolutionarily distant organisms: S. cerevisiae, C. elegans, E. coli, A.
thaliana, D. melanogaster, and H. sapiens. They use genomic sequence information
to connect these data and compare global and modular properties of the transcription
programs. Linking genes whose expression profiles are similar, functionally related
sets of genes are frequently coexpressed in multiple organisms. Bergmann integrates
the expression data with genomic sequence information to address three biological
issues. First, we verify that coexpression is often conserved among organisms and
propose a method for improving functional gene annotations using this conservation.
Second, we compare the regulatory relationships between particular functional groups
in the different organisms using the iterative signature algorithm (ISA), giving initial
insights into the extent of conservation of the gene regulatory architecture. See Figure
Figure 2.1 Starting from a set of coexpressed genes associated with a particular function in organism A, they first identify the homologues in organism B using BLAST. Only some of these homologues are coexpressed while others are not. The signature algorithm selects this coexpressed subset and adds further genes that were not identified based on sequence.
This approach didn’t consider data with time series. Also, they linked data of different
species by BLAST, which can provide sequence level homology, but they use ISA to
extend genes they linked to more genes co-expressed in the same species. In order
words, genes they found in the end only have ortholog in expression-level. Moreover,
although they try to tell the functions of genes they found, but not with clear evidence
and inference.
2.1.2 Orthologous expression profiling in multi-species
models[11]
species-specific Affymetrix GeneChips to search for candidate genes related VALI
(ventilator-associated lung injury). The individual analysis of species-specific arrays
produced large lists of candidate genes and several challenges, with the most notable
being an excessive number of genes for candidate gene selection. While meta-analysis
strategies exist for narrowing candidate gene selection from multiple experimental
systems, this analysis can only be applied to the same species cross-platform array
comparison, to use this approach for analysis of experiments involving diverse species
we speculated that multispecies gene0expression profiles could be linked using
RESOURCERER[12], which is based on EGO database and contains information for
all commercially available Affymetrix Genechips.
D. N. Grigoryev speculated that overlapping responses to mechanical stretch in
orthologous genes across species might reveal candidate genes involved in an
evolutionarily conserved defense mechanism to lung injury that might be triggered by
ventilator-induced lung injury.
This research first calculated gene-expression changes for each tested species and
linked expression values obtained for orthologous genes. Orthologous genes
exhibiting similar patterns of expression across all species were selected as
VALI-related candidates under the assumption that gene-expression responses
conserved across evolutionary history would be most likely to reveal fundamental
Figure 2.2 Different distances will render different classifications because we are asking for grouping
The basic concept of linking differentially-expressed gene with other species is
similar to linking co-expressed gene group in our approaches. Even regardless of the
inability of dealing time-series profiles, their scheme need a common experiment
condition (in their case VALI) among samples of all species, which means that the
experiment should be carefully designed and executed. This constraint makes this
approach unpractical in many situations. Although they did do some experiment to
prove genes they found is associate with VALI, however, their scheme unable to give
a global view of gene function in genomic scale.
2.2 Cross species analysis with time series
2.2.1 GSVD for comparative analysis of expression data
sets of two different organisms[13]
GSVD (generalized singular value decomposition) provides a comparative
mathematical framework for two genome-scale data sets from the two-genes X arrays
are shared by both data sets. Each genelet is expressed only in the two corresponding
arraylets, with a corresponding “angular distance” indicating the relative significance
of this genelet, i.e. its significance, in one data set relative to that in the other (see
Figure 2.3).
O. Alter shows that mathematical reconstruction of gene expression in a subset of
genelets may simulate experimental expression in subset of genelets may simulate
experimental observation of only the process that these genelets are inferred to
represent. By using GSVD, the framework enables comparative reconstruction and
classification of the genes and arrays of both data sets and the comparison of yeast
Figure 2.3 Illustration of GSVD
GSVD relies on the strong basis of mathematical theory and suggest a general
approach analyzing two data sets. However, the major improvement can be
categorized in three points: flexibility of multi-species model, limitation of data
reduction, and incapableness of heterogeneous data sets.
First, GSVD provides useful framework to analyzing data set from two species.
However, it is not suitable for models consisted of more than two species. Also, data
sets from two species should be in same vector space, i.e. their dimension—the
number of time points—should be the same, which is unpractical in most of the case.
GSVD restrict the data to a small subset of similar conditions, such as time points
along the cell cycle, which drastically reduces the size of the dataset and limits the
scope of comparison[10].
Third, data sets from two species should be in same vector space, i.e. their
dimensions—the number of time points—should be the same, which is unpractical in
Figure 2.4 Yeast and human expression reconstructed in the six-dimensional cell-cycle subspaces approximated by two-dimensional subspaces.
2.2.2 Continuous representation of time-series expression
profiles[2]
Z. Bar-Joseph et al. present a general algorithm to detect genes differentially
expressed between two nonhomogeneous time-series data sets. Their algorithm
overcomes these difficulties by using a continuous representation for time-series data
and combining a noise model for individual samples with a global difference measure.
They introduce a corresponding statistical method for computing the significance of
this differential expression measure. They used their algorithm to compare cell-cycle
dependent gene expression in wild type and knockout yeast strains. Their experiments
suggest additional roles for the transcription factors Fkh1 and Fkh2 in controlling
cellular activity in yeast.
of piecewise cubic polynomials and are frequently used for fitting time series and
other noisy data.
Using splines, we can use a linear warping function to obtain an optimal alignment
by adjusting shift and stretch parameters to minimize a global error function. See
Figure 2.5.
Figure 2.5 Alignment of genes for the cdc28DS to cdc15DS.
In this work, they used B-splines, a type of spline that is mathematically
convenient for data approximation. B-splines are described as a linear combination of
a set of basis polynomials. This approach has shown to be useful in many cases. It
considers heterogeneous data sets and gives the solution by Cubic spline algorithm.
However, their design is not suitable dealing with data sets need to be mapped by
semi-global alignment—one of the data sets is in fact only former or later part of
fitting process slower. And it is a shame that their statistically analysis did not cover
function annotation.
2.2.3 Aligning gene expression time series with time
warping algorithms [1]
Biological processes have the property that multiple instances of a single process may
unfold at different and possibly non-uniform rates in different organisms, strains,
individuals, or conditions. For instance, different individuals affected by a common
disease may progress at different and varying rates. Increasingly, biological processes
are being studied through time series of RNA expression data collected for large
numbers of genes. Because common processes may unfold at varying rates in
different experiments or individuals, methods are needed that will allow
corresponding expression states in different time series to bemapped to one another.
John Aach and George M. Church present implementations of time warping
algorithms applicable to RNA and protein expression data and demonstrate their
Figure 2.6 Time warping result.
They show time warping to be superior to simple clustering at mapping
corresponding time states. Depending on the domain of application, these might
include cell-specific parameters such as average cell size or physiological parameters
such as blood pressure or temperature. The relative contributions of such parameters
to alignment score calculations can be adjusted using feature weight parameters
already supported by the programs. The alignment programs can also be used not only
to align RNA and protein expression series individually, but series that combine both
RNA and protein data. Finally, the programs can also be applied to aligning
non-temporal series such as expression profiles for cells over a range of
Chapter 3
Materials and Methods
3.1 Materials
3.1.1 Datebases
GEO: a curated, online resource for gene expression data browsing, query and
retrieval[14]. GEO contains 141678 sampling data, which provide us enormous
experimental gene expression profiles. See Figure 3.1. Each dataset is fully annotated
and completely normalized. But the comprehensive data collection, our gene function
predication can be supported by adequate experiments under all kinds of conditions
Figure 3.1 Web page of GEO. [http://www.ncbi.nlm.nih.gov/geo/]
GO: provides a controlled vocabulary to describe gene and gene product attributes in
Figure 3.2 Web page of GO. [http://www.geneontology.org/]
HomoloGene: a system for automated detection of homologs among the annotated
Figure 3.3 Web page of homologene. [http://www.ncbi.nlm.nih.gov/HomoloGene/]
3.1.2 Data set of Experiments and results
We take three types of data to test the proposed scheme: artificial, homogeneous, and
heterogeneous data sets. We will discuss this in chapter result.
3.1.3 Date set of gene funcation predication
The microarray sample was fetched from NCBI GEO in order to be the materials of
prediction. GEO stores many precious gene expression profiles. We took advantage of
GEO’s comprehensive collection of published dataset, and extracted experiments
which consisted of several time points and are suitable to be compared with the time
series. Time series samples generated on Affymetrix GeneChip platform was
considered firstly. The samples of each time point are combined by averaging and
converting to log2 ratio by the mean of the expression level in all time point. That is,
for the gene G1 in a profile with 7 time points T1, T2, …, T7 and each time point have
two samples Et1a, Et1b, Et2a, Et2b, …, Et7a, Et7b. Next, we do the process of averaging,
converting to log2 ratio, and standardization as shown below. We arranged the
dataset as a matrix which represented as following Matrix.
Averaging:
7 ; 2 ) ( ,..., 2 ) ( , 2 ) ( 7 1 1 1 7 7 17 2 2 12 1 1 11∑
= = + = + = + = k k mean b t a t b t a t b t a t A A E E A E E A E E ALog2 ratio:
⎟⎟ ⎠ ⎞ ⎜⎜ ⎝ ⎛ = ⎟⎟ ⎠ ⎞ ⎜⎜ ⎝ ⎛ = ⎟⎟ ⎠ ⎞ ⎜⎜ ⎝ ⎛ = mean mean mean A A N A A N A A N 1 17 7 1 12 2 1 11 11 lg , lg ,..., lgStandardization:
1 1mean 11 11 N -N S σ =Matix:
47 42 41 4 37 32 31 3 27 22 21 2 17 12 11 1 7 2 1 S ... S S S ... S S S ... S S S ... S S ... G G G G T T TExperiment performed on human and mouse using Affymetrix GeneChips were
extracted from GEO and annotated their functional category by GO. We compiled the
mapping relation of Gene symbol name between SwissProt ID (because part of the
data store in GO is specified by SwissProt) from SwissProt. By this, we can then
annotate the GO term to each gene, and calculate the p-value to suggest biological
significance of the occurrence of this GO term.
3.2 Methods overview
The flowchart of the scheme is presented below (Figure 3.4 ):
Figure 3.4 system flow of the scheme
The whole scheme consists of five components: preprocessing, ranking by
single-gene distance, finding orthologous gene groups, annotation and statistics
analysis and visualization.
preprocessing stage, the datasets are normalized, standardized, and filtered out genes
with flat expressions, which implies these genes are not response dynamically toward
the perturbation in the experiment along with time and therefore lack of referable
meanings in our time-series profile analysis.
After the preprocessing, the remaining genes in human and mouse datasets are
cross-species linked by ortholog linkage provide by Homologene database, which
builds the linkages based on the sequence similarity of genes among species. Each
linked gene pair is calculated its single-gene distance by dynamic time warping
algorithm. All gene pairs are ranked according to their distance. The scheme selects
top T gene pairs for further analysis, since these genes share both sequence and
expression similarity as responding to experimental perturbation.
Next, in order to further infer the functional roles among these genes, we group
these T genes into smaller clusters. Before doing so, we act singular value
decomposition to filter out the noise in expression. We don’t do SVD in preprocessing
stage for several reasons, which will be discussed in following chapters. Then,
K1-means clustering was performed on one of the dataset (in this research, human),
and after that, the scheme acts second time K2-means clustering on another dataset
within each firstly-clustered group into even smaller clusters, which generates K1*K2
clusters. The scheme acts group-dynamic time warping to suggest unified warping
path of this group of genes pairs. Now, by above processes, the parallel essence in
sequence and expression among each group is sufficient to suggest that these groups
of genes play important roles in certain bioprocess, and conserve in evolution process.
In the fourth stage of the scheme, we try to suggest the functional roles of these gene
groups. By the help of GO, GEO and fuzzy nearest clustering algorithm, we can
judge and recognize the candidate genes with biological significance and proposed the
inference that these genes play essential roles in certain bioprocess during the
evolution process.
In the end, to further visualize of result, the scheme which had been carefully
programmed will generate abundant and useful information of the results. The whole
scheme is embedded in our web sever TWins, and the output files can be fed in to
Genesis and grphwarp program for further visualization and analysis.
3.3 Algorithm
3.3.1 Dynamic Time Warping (DTW) algorithm
DTW (see figure 3.12), which is similar to the sequence alignment used in
computational biology, are firstly introduced in speech recognition. By compression
and expansion operations, multiple time points with calculated weight coefficient can
be aligned to a single time point. DTW considers the warping distance according to
the vectors in feature space, and the distance can be evaluated by simple Euclidean
distance, Pearson correlation coefficient, or more complicated functions in which the
Figure 3.5 An illustration of DTW algorithm.
The basic idea of time warping is that replications of nominally the same trajectory
will trace out approximately the same curve (expression profiles pattern), but with
varying time patterns. To minimize the warping distance between two observed
profiles, a recursion to find the minimal distance is the main part of the calculation. This program adopts conventional DTW, see the equation below, whereτ , μare time points, and a, b are the expression values of two time series:
⎪
⎪
⎪
⎩
⎪
⎪
⎪
⎨
⎧
−
⋅
+
−
⋅
+
−
⋅
+
+
=
− − − − j i j i j i j i j i j i j ib
a
D
b
a
D
b
a
D
D
2
2
2
1 , , 1 1 , 1 ,μ
τ
μ
τ
Relative to the a series, a second time series b for a different instance of the process may contain a set of time pointsτ andμ. The sample points may come from a trajectory that traces through different regions of k-space or traces through the same
regions at different rates[1] (Figure 3.6 ). Simple time warping uses dynamic
programming to find the mapping between two series that minimizes a weighted sum
of the k-space distances between the corresponding sample points, subject to
constraints of order preservation and globality. The mapping identifies an optimal
time alignment of the two series. The task of finding it is set up as a dynamic
programming problem by placing the time points of each series along the axes of a
grid, representing alignments as paths through the grid cells, and finding the path with
minimum accumulated weighted distance score.
Figure 3.6 Two time series in a two-dimensional feature space containing sample points from a continuous process, with sample points of each series mapped to each other by simple time warping.
The mappings of the optimal path identify places where multiple time points of
one series correspond to a single time point of the other. Where measurement time
intervals are comparable between the series, these may represent situations in which
phase of the process relative to the instance measured by the other series. We call
such situations compression/expressions and they are analogous to the insertion /
deletions considered in sequence alignment algorithms. Time warping algorithm maps
two time series in a way that compensates for varying relative rate differences in gene
expression levels moving along similar expression trajectories[17].
3.3.2 Singular value decomposition (SVD) and Clustering
approaches
SVD is a common technique for analysis of multivariate data, and gene expression
data are well suited to analysis using SVD. In the literature the number of components
that results from SVD is sometimes associated with the number of underlying
biological processes that give rise to the patterns in the data[18].
Let X denotes an m x n matrix of real-valued data and rank r. In the case of
microarray data, xij is the expression level of the ith gene in the jth assay. The
elements of the ith row of X form the n-dimensional vector gi, which we refer to as
the transcriptional response of the ith gene. Alternatively, the elements of the jth
column of X form the m-dimensional vector aj, which we refer to as the expression
profile of the jth assay.
The equation for singular value decomposition of X is the following:
T
USV
X
=
where U is an m x n matrix, S is an n x n diagonal matrix, and VT is also an n x n matrix. The columns of U are called the left singular vectors (eigengenes), {uk}, and
form an orthonormal basis for the assay expression profiles, so that ui·uj = 1 for i = j,
vectors (eigenarrays), {vk}, and form an orthonormal basis for the gene transcriptional
responses. The elements of S are only nonzero on the diagonal, and are called the
singular values.
In systems biology applications, we generally wish to understand relations
among genes. The signal of interest in this case is the gene transcriptional response gi.
The SVD equation for gi is
m
i
v
s
u
g
r k k k ik i,
where
:
1
,...,
1∑
==
which is a linear combination of the eigengenes {vk}.
SVD is a linear transformation of the expression data from the n-genes x
m-arrays space to the reduced r-eigenarrays x r-eigengenes space[19]. See Figure 3.7
for illustration.
Figure 3.7 SVD for genome-scale expression data analysis. [http://genome-www.stanford.edu/SVD/]
Relation to principal component analysis: There is a direct relation between PCA
and SVD in the case where principal components are calculated from the covariance
matrix. The matrix US then contains the principal component scores, which are the
Even though each component on its own may not necessarily be biologically
meaningful, SVD can aid in the search for biologically meaningful signals[18]. The
height of each singular value indicates its importance in explaining the data. More
specifically, the square of each singular value is proportional to the variance explained
by each singular vector. The relative variances are often plotted (See Figure 3.8). If
the original variables are linear combinations of a smaller number of underlying
variables, combined with some low-level noise, the plot will tend to drop sharply for
the singular values associated with the underlying variables and then much more
slowly for the remaining singular values. One approach is to ignore components
beyond where the cumulative relative variance or singular value becomes larger than a
certain threshold, usually defined upon the dimensionality of the data. Everitt and
Dunn[20] propose an alternate approach based on comparing the relative variance of
each component to 0.7/n[18]. By normalizing the data and filtering out those
eigengens and eigenarrays (i.e. substituting zero for the singular value lower than
0.7/n) that are inferred to represent noise or experimental artifacts, SVD can
Figure 3.8 Visualization of the SVD of cell cycle data. (a) Plots of relative variance; (b) the first eignegen is shown; (c) the second eignegene is shown. (d) The third eigengen lacks the obvious cyclic structure of the first and second.[18]
K-means clustering takes the matrix as input and genes are grouped according to
the value in the row they represented. In our experiment we took Pearson correlation
distance to evaluate how close two genes are.
3.3.3 Guilt-by-association (GBA) principle
GBA infers uncategorized items by the close similarity to known items which can be
judged by evaluating the distance[21]. GBA principle is widely applied in biological
function prediction and candidate gene discovery. In gene expression profile analysis,
uncategorized genes can be grouped together with known genes by the distance or the
3.3.4 Fuzzy Nearest-Cluster (FNC)[21]
FNC utilizes the advantages of both clustering and classification. It contains two parts:
(1) mining by unsupervised approach, hierarchical clustering algorithm; (2) prediction
category of unclassified items by classification methods using GBA principle. See Fig.
3.9.
Figure 3.9 The system flow of FNC.
Figure 3.10 and 3.11 details the clustering algorithm for mining step. In the
algorithm, the FNC focuses on grouping gene expression profiles, in order to mining
Figure 3.10 Algorithm of mining co-expressed subgroups within each function.
Figure 3.11 Pseudo code of mining co-expressed subgroups.
would be predicted its function according to fuzzy k-nearest clusters algorithm
described in figure 3.12 and 3.13.
Figure 3.12 Algorithm of fuzzy k-nearest clusters for functional prediction.
Figure 3.13 Pseudo code of fuzzy k-nearest clusters for functional prediction.
true functional classes in its top-ranked perditions. The classification results in listed
in Table 3.1.
Table 3.1 Classification result (%) for largest 20 functional classes. Values in bold
indicate the top performance in each row.
3.4 Methods
In this section we will disclose the proposed scheme in five steps.
3.4.1 Preprocessing
File format description: The applicable file format consists of three parts:
experiment name, time periods, and expression data with gene symbol. Experment
name needs to be notified by a '>' at the beginning of the first line. Time periods
should start with "Gene" and follow with time points, which are separated by tab.
Next, each line of the expression data are named by its gene symbol as the beginning
of the line, and followed by corresponding time-series expression data separated by
gene name will not be able to map to gene ontology category defined by GO.
Figure 3.14 Input file format.
Convert data to log2 ratio: When the formatted data set is inputted to the system, it
will be converted to log2 ratio according to the mean expression of each row. Please
see 3.1.3 for details.
Data standardization: Since our algorithm considers Euclidean distance as our
distance function, each profile should therefore be standardized, or the distance will
be affected by the expression level of profile, and the trend of pattern will be missing
or ignore by the algorithm. In statistics, a standard score (also called z-score or normal
score) is a dimensionless quantity derived by subtracting the population mean from an
individual (raw) score and then dividing the difference by the population standard
deviation. The z-score reveals how many units of the standard deviation a case is
above or below the mean. So, by standardization, the pattern will be substitute to a
trend according to original mean, which is suitable for our Euclidean distance
Flat expression filtering: A level expression pattern is not helpful throughout our
analysis, since this kind of pattern reveals no information of differential expression,
which hinders our time warping algorithm to distinguish the warping path and
following warping distance. So before entering next stage that performing time
warping algorithm, flat expression should be filtered.
We denote a flat expressed gene as the gene of which expression in each time points
is oscillating in the range of 1.3*(mean of the expression in all samples) and
0.7*(mean of the expression in all samples). In other word, if there are more than one
expression of all time points higher than the upper bound or lower than the lower
bound, this gene will be retained and submitted to next stage. (We do not perform
noise filtering in this stage for several reasons. We will disclose them in discussion
chapter.)
3.4.2 Ranking by single-gene distance
For the reason that we are finding genes that share both sequence- and
expression-level similarity, we have to select genes linked by sequence homology and
with relatively low distance which implies their expression pattern is similar to
achieve our requirement.
Homologous gene linking: One means of searching orthologous gene-expression
profile in multi-species models we refer to is proposed by Grigoryev et al. in 2004[11].
Although their research is not suitable for time series expression profiling which is
strongly influenced by cell growth rate, their idea to associate genes of multi-species
is examined to be very useful. Grigoryev used ortholog links which are identified by
RESOURCERER[12, 22] between the most commonly used Affymetrix rat, mouse,
build the linkage of different species. NCBI Homologene is also competent providing
gene linkage between species.
Figure 3.15 Build the gene linkage between each gene of groups and the gene of another species.
Since each Affymetrix GeneChip has full annotation, the gene symbol of the probe
set is indicated. NCBI Homologene provides great mapping information of gene
symbols of different organisms. Homologene built the mapping relationship according
to their homology on sequence level. Since genes with similar sequence probably
share similar role of biochemical functions, combining sequence homology and
expression profiles to suggest their common functions is intuitionally more reliable.
of another species (see Figure 3.15 ). Next, we perform single gene time warping to
calculate and rank the warping distance to tell the genes with similar expression
pattern.
One-on-one dynamic time warping: When the expression profile are sampled every
single hour from normal cells of human and mouse, the different growth rate affected
the profiling which make direct one-on-one mapping unreliable. Aach’s research[1]
indicated even in the same processes the unfold rates of different experiments or
individuals are different. Dynamic time warping (DTW) algorithms are proposed to
make different time series to be comparable by finding their corresponding expression
states. When two time series expressed in the same pattern however in different rate,
DTW is able to find the optimal warping path which aligns two time series yielding
shortest distance.
Therefore, we leverage the advantage of the DTW to estimate how close two genes
are considering their time difference. By doing so, we can rank all genes by their
warping distance. The top ranking genes show relatively resembling expression,
which means these genes are both sequence and expression analogical.
After collecting these genes, we are going further to specify their function. This
entails the helps of following two stages.
3.4.3 Finding orthologous gene groups
Orthologous genes are likely to share similar pattern of expression. The co-expressed
genes can be inferred to be coding for proteins that partake in common biological
function[21]. We therefore find the gene cluster that share parallel expression pattern
by unsupervised learning approach, which assumes no prior knowledge about the
Singular value decomposition (noise filtering): Before been grouped, SVD was
performed to normalize the data by filtering out the eigengenes and eigenarrays that
are inferred to represent noise or experimental artifacts enables meaningful
comparison of the expression of different genes across different arrays in different
experiments. However, SVD has its mathematical limitation; we will discuss this in
chapter discussion. SVD filters eigengenes of which relative variance lower than 0.7/n
(n is the number of time points) and recombines the matrix again. After SVD, we will
retain only significant information which contains less noise. Expression profile
without observable noise can lead to better clustering result. See Figure 3.16 .
Figure 3.16 The flow chart of stage Finding Orthougous Gene Groups.
The challenge we faced next is how to determine which group of genes has the
common expression among both species. Although each of them has already shown
good expression resemblance, it is not guarantee that all of the genes share parallel
which can assure the groups we found are conserved in both species.
1st K-means clustering: There are numerous techniques can be used for searching
co-expression gene cluster, for example, hierarchical clustering, k-means clustering,
diametrical clustering, etc. These unsupervised learning methods are especially useful
when we try to search candidate genes from huge amounts of genes expression, such
as genome-wide microarray, since the entire gene reaction toward the conditions,
treatments, development stages are not yet completely understood. We decide to use
K-means clustering in our scheme. See Figure 3.17 (a), we groups gene in species A
(in our case, human). By this, human genes in the same group show great pattern
homology, however, not in mouse, which can be solved by doing 2nd K-means clustering.
2nd K-means clustering: We perform K-means clustering algorithm again on linked
genes belongs to another species (in this case, mouse) and therefore divided the
groups, which were firstly grouped and linked, into more specific and conserved parts
Figure 3.17 (a) Group the genes in species A to find genes share similar expression pattern and select the genes in species B by the mappings according to gene linkage. (b) For each group of genes found in (a), perform clustering again on species B in order to divide the selected genes of species B into smaller groups which share resembling pattern. Then follow the gene linkage again to re-construct the relationship.
Group-dynamic time warping: When genes are grouped into k1*k2 clusters by
clustering technique, the next step is to discover the same expression pattern among
multi species. After grouping genes, each group of gene is regarded as a pair (see
Figure 3.18 ). Take all of the pairs as the input of group-DTW algorithm and rank
each pair with their warping distance. The concept is similar to one-on-one DTW,
however this time we are doing group-DTW, which align two groups of genes. Two
groups of genes, each group belong to a species, if there expression profiles are
followed certain rules or patterns, even though their sampling time points and growth
rate are different, DTW would consider all the time points alignment possibility and
provides by DTW, we can figure out which group of genes shows better orthologous
expression.
Figure 3.18 Pair contains two set of genes associated by gene linkage. Each set belongs to a species and builds the corresponding relation by gene linkage.
Next, we are going to assign functions to these groups of genes.
3.4.4 Annotations and statistics analysis
Although in some situations, biologists already know a subset of genes involved in
certain biological pathway of interest; however, as shown in Yi’s research[23], the
gene annotated with GO terms[15] ranging from 25.1% in Danio rerio to 96.2% in
Saccharomyces cerevisiae, also, the gene annotations by MIPS[24] or Biomax is
ranging from 44.3% in Thermoplasma acidophilum to 73.1% in Bacillus subtilis
168[25], the functional roles of each gene are not thoroughly studied. So, we are not
supposed to only focus on these annotated genes. Many useful approaches show great
accuracy when predicting gene functions. Scientists use hierarchical clustering,
Hidden Markov models, SVM, FNC, and so on, to predict the functions of genes.
Fuzzy nearest cluster: We chose FNC to be the prediction approach of our scheme,
since it combines both clustering and classification and display great prediction
Affymatrix gene chip data set of human and mouse is fetched from GEO and
processed (please see 3.1.3). We search each genes in GO to know their functional
term. Any gene without GO term will be predicted by FNC throughout thousands of
experimental data set. The final predication will count on the voting of these
thousands data set. Top three GO terms will be annotated to this gene.
By doing functional predication in human and mouse, we are now having the ability
to annotate and analyze the orthologous gene group found by previous steps.
Known/predicted GO term annotation: Now, genes in each cluster, which share
both sequence and expression similarity can be annotated by known GO term and
predicted GO term by FNC. We have know that the co-expressed genes can be
inferred to be coding for proteins that partake in common biological function. It is
important to understand that these genes are not just co-expressed, they also
conserved in sequence and expression in evolution path. By the annotation, we can
infer the functions of orthologous gene groups according to the statistical test, which
makes the inference more reliable and promising.
Hypergeometric testing: To examine the biological significance of the pairs
(ortholgous gene groups); the known and predicted GO term annotation is take into
account. Genes will be calculated the p-value of GO terms by hypergeometric
distribution:
∑
− = ⎟⎟ ⎠ ⎞ ⎜⎜ ⎝ ⎛ ⎟⎟ ⎠ ⎞ ⎜⎜ ⎝ ⎛ − − ⎟⎟ ⎠ ⎞ ⎜⎜ ⎝ ⎛ − = 1 0 1 k i n N i n M N i M pIn this equation, N is the total number of gene product [15] of platform species,
can be regarded as background. M is the number of genes within the background
specified group.
A pair with relatively low p-value implies that these genes gathering together
shows significant biological meanings, which suggest that the function of this group is
annotated by this specific GO term.
3.4.5 Visualization
In order to provide better understanding and analysis of the results, we design a web
sever and implement several useful functions in our program package. We generate
many kinds of files containing useful information, including warping path, expression
level plot, p-value, and so on. The package gives many delicate illustrations (Figure
3.19 ). We also demonstrate great ability leveraging other tools, like Genesis and
Figure 3.19 (a) Warping path display. (b) Another way to understand warping path. (c) An overview containing profile patterns and the expression level.