基於人類與小鼠微陣列基因表現圖譜識別功能性基因群

(1)

國立交通大學

生物資訊所

碩

士

論

文

基於人類與小鼠微陣列基因表現圖譜識別功能性基因

群

Identifying functional gene clusters based on microarray

gene expression profiles in human and mouse genomes

研究生：洪瑞鴻

指導教授：黃憲達教授

(2)

基於人類與小鼠微陣列基因表現圖譜識別功能性基因

群

Identifying functional gene clusters based on microarray

gene expression profiles in human and mouse genomes

研究生：洪瑞鴻 Student：Jui-Hung Hung

指導教授：黃憲達 Advisor：Hsien-Da Huang

國立交通大學

生物資訊所

碩士論文

A Thesis

Submitted to Institute of Bioinformatics College of Biological Science and Technology

National Chiao Tung University in partial Fulfillment of the Requirements

for the Degree of Master

in

Bioinformatics

June 2007

Hsinchu, Taiwan, Republic of China

(3)

基於人類與小鼠微陣列基因表現圖譜識別功能性基因群

學生：

洪瑞鴻

指導教授

：黃憲達

國立交通大學生物資訊所碩士班

摘要

跨物種的基因表現圖譜分析能提供在天擇的進程中被保留的基因的功能和其參與機制的訊息，保留在物種間的基因群所扮演的生化功能極可能具有不易被取代的重要功能。尋找這樣的功能性基因群能加速基因療法中候選基因的發現和藥物的開發。為此，本研究提出一個新穎的計算處理架構試圖找出保留於人類和小鼠

的基因群，利用奇異值分解（singular value decomposition）和分群演算法分析

基因表現，並且利用同源連結（ortholog linkage）和時間規整演算法(time warping

algorithm)使來自不同物種（異質性）的微陣列基因表現圖譜時間序列可以比較並依此建議同源基因群。同時，我們實作模糊最近聚類（fuzzy nearest-cluster）方法來預測這些可能在生物過程（ bioprocess ）扮演極重要的角色的同源（orthologous）基因群的功能併施行統計檢定找出具有生物意義、保留在物種間影響細胞週期的基因群。簡而言之，這個研究結合序列和時間序列圖譜層級的相似性來建議跨物種的功能性基因群，提供基因療法實驗的候選基因並希望能貢獻在跨物種基因分析的卓越進展。最後，為了讓整個流程方便應用在未來的相關研究上，整個架構被模組並程式化成一個獨立的可執行套件並結合視覺化的呈現。

(4)

Identifying functional gene clusters based on microarray gene

expression profiles in human and mouse genome

student：

Jui-Hung Hung

Advisors：Dr.

Hsian-Da Huang

Institute of Bioinformatics

National Chiao Tung University

ABSTRACT

Cross-species gene expression analysis provides information of gene functions and

involving mechanisms, which conserved in evolutionary process. Gene groups

conserved in species are very likely to play irreplaceable biochemical functions.

Searching for this kind of functional gene groups can accelerate the discovery of

candidate genes in gene therapy and development of drug design. For this, our

research proposes a novel computational scheme to figure out the genes having

important biochemical functions, especially targets on the genes which are conserved

in human and mouse. These genes conserved across evolutionary history would be

most likely to reveal fundamental biochemical functions. This work utilizes singular

value decomposition (SVD) and clustering techniques to analyze gene expression, and

exploits orthologous linkage and time warping algorithm making microarray

time-series gene-expression profiles of different species (heterogeneous profiles)

comparable to suggest orthologous gene groups. In the meanwhile, in order to make

the results more promising, we use fuzzy nearest cluster method to predict the

(5)

and perform statistical test according to our annotation of predicated gene function to

find the genes having biological significance among these orthologous genes.

In brief, this research combines sequence- and time-series expression- levels ortholog

to suggest functional genes among multiple species, provides materials for candidate

gene therapy experiments and hopes to contribute remarkable advancement in

cross-species orthologous gene analysis. In the end, in order to let the whole process

be utilized in further application, the scheme is modeled and programmed in a

(6)

目

錄

摘要 ... I ABSTRACT ...II 表目錄 ... VI 圖目錄 ... VII CHAPTER 1 INTRODUCTION... 1

1.1 OVERVIEW OF THE SCHEME... 1

1.1.1 Gene expression time series... 1

1.1.2 Gene function prediction... 3

1.1.3 Data clustering ... 3

1.1.4 Distance functions ... 6

1.1.5 Comparative analysis of different species... 7

1.2 MOTIVATION... 8

1.3 GOALS... 9

CHAPTER 2 RELATED WORKS ... 11

2.1 CROSS SPECIES ANALYSIS WITH STATIC PROFILING... 11

2.1.1 Genome-wide expression data of six organisms [10] ... 11

2.1.2 Orthologous expression profiling in multi-species models[11] ... 12

2.2 CROSS SPECIES ANALYSIS WITH TIME SERIES... 14

2.2.1 GSVD for comparative analysis of expression data sets of two different organisms[13]14 2.2.2 Continuous representation of time-series expression profiles[2]... 17

2.2.3 Aligning gene expression time series with time warping algorithms [1] ... 19

CHAPTER 3 MATERIALS AND METHODS ... 21

3.1 MATERIALS... 21

3.1.1 Datebases... 21

3.1.2 Data set of Experiments and results ... 24

3.1.3 Date set of gene funcation predication ... 24

3.2 METHODS OVERVIEW... 26

3.3 ALGORITHM... 28

3.3.1 Dynamic Time Warping (DTW) algorithm ... 28

3.3.2 Singular value decomposition (SVD) and Clustering approaches... 31

3.3.3 Guilt-by-association (GBA) principle... 34

3.3.4 Fuzzy Nearest-Cluster (FNC)[21] ... 35

3.4 METHODS... 38

3.4.1 Preprocessing ... 38

3.4.2 Ranking by single-gene distance... 40

3.4.3 Finding orthologous gene groups ... 42

3.4.4 Annotations and statistics analysis ... 46

3.4.5 Visualization ... 48 CHAPTER 4 IMPLEMENTATION ... 50 4.1 DATABASE... 50 4.1.1 GEO dataset ... 50 4.1.2 GO dataset... 51 4.1.3 Homologene dataset ... 52

4.2 IMPLEMENTATION OF TIME WARPING... 52

4.2.1 DTW core... 52

4.2.2 Group-Euclidean distance function ... 53

4.3 IMPLEMENTATION OF SVD ... 53

4.3.1 SVD core... 53

4.3.2 Noise reduction ... 55

4.4 IMPLEMENTATION OF FNC ... 55

4.4.1 Mining by unsupervised approach... 55

(7)

4.5 STATISTICAL ANALYSIS... 56

4.5.1 Hypergeometric testing... 56

4.6 IMPLEMENTATION OF VISUALIZATION... 57

4.6.1 GD library[28] ... 57

4.6.2 Genesis and graphwarp ... 57

CHAPTER 5 RESULTS... 59

5.1 ARTIFICIAL DATA SET... 59

5.2 YEAST DATA SET... 62

5.3 HUMAN AND MOUSE FETAL LIVER DATA SET... 64

5.4 GENE FUNCTION PREDICTION... 71

5.4.1 Human gene prediction... 72

5.4.2 Mouse gene prediction... 73

5.5 WEB SERVER... 74

5.5.1 Twins overview... 74

5.5.2 Web interface ... 77

5.6 SOFTWARE PACKAGE... 80

CHAPTER 6 DISCUSSION ... 81

6.1 THE LIMITATION OF THE DISTANCE FUNCTION CALCULATION BY DYNAMIC PROGRAMMING. 81 6.2 THE LIMITATION OF SVD... 81

6.3 FUTURE WORK... 82

CHAPTER 7 CONCLUSIONS ... 83

CHAPTER 8 REFERENCES... 84

(8)

表目錄

Table 3.1 Classification result (%) for largest 20 functional classes. Values in bold indicate the top performance in each row.………..………

38

Table 5.1 GO term with high biological significance in case 1………..… 67

Table 5.2 GO term with high biological significance in case 2.…….……… 69

Table 5.3 GO term with high biological significance in case 3.…….……… 71

Table 5.4 GO term predicted in human………... 72

Table 5.5 Known GO term in human……….. 73

Table 5.6. GO term predicted in mouse.………..……… 74

Table 5.7 Known GO term in mouse.……….……….………...… 74

Table 5.8 A List of cell-cycle profile fetched from NCBI GEO………….……… 76

(9)

圖目錄

FIGURE 1.1 GENE EXPRESSION ALSO VARIES WITHIN A CERTAIN TYPE OF CELL AT DIFFERENT POINTS IN TIME.FOR EXAMPLE, THE GENE EXPRESSION PROFILES OF AN ORGAN MIGHT DIFFER BETWEEN NORMAL AND CANCEROUS STATES, AS SHOWN HERE.

[HTTP://WWW.NCBI.NLM.NIH.GOV/CLASS/MLACOURSE/MODULES/MOLBIOREVIEW/GENE_EXPRES

SION_PROSTATE.HTML] ... 2

FIGURE 1.2 DIFFERENT REPRESENTATIONS OF GENE EXPRESSION PROFILES:(A) PATTERN,(B) COLOR SCALE [HTTP://GEPAS.BIOINFO.CIPF.ES/CGIBIN/TUTOX?C=CLUSTERING/CLUSTERING.CONFIG] ... 3

FIGURE 1.3 OVERVIEW OF HIERARCHICAL CLUSTERING OF ALL SAMPLES.GENES AND BLOOD SAMPLES ARE ORGANIZED BY HIERARCHICAL CLUSTERING BASED ON OVERALL SIMILARITY IN EXPRESSION PATTERNS.EXPRESSION LEVELS ARE REPRESENTED BY A COLOR KEY IN WHICH BRIGHT RED REPRESENTS THE HIGHEST LEVELS AND BRIGHT GREEN REPRESENTS THE LOWEST LEVELS, AND LESS SATURATED SHADES REPRESENT INTERMEDIATE LEVELS OF EXPRESSION. [HTTP://WWW.BIOMEDCENTRAL.COM/1471-2164/7/115/FIGURE/F1] ... 5

FIGURE 1.4 DIFFERENT DISTANCES WILL RENDER DIFFERENT CLASSIFICATIONS BECAUSE WE ARE ASKING FOR GROUPING BASED ON DIFFERENT FEATURES (TRENDS IN THE CASE OF CORRELATION AND ABSOLUTE DIFFERENCES IN THE CASE OF EUCLIDEAN DISTANCES) [HTTP://GEPAS.BIOINFO.CIPF.ES/CGIBIN/TUTOX?C=CLUSTERING/CLUSTERING.CONFIG] ... 7

FIGURE 2.1 STARTING FROM A SET OF COEXPRESSED GENES ASSOCIATED WITH A PARTICULAR FUNCTION IN ORGANISM A, THEY FIRST IDENTIFY THE HOMOLOGUES IN ORGANISM B USING BLAST. ONLY SOME OF THESE HOMOLOGUES ARE COEXPRESSED WHILE OTHERS ARE NOT.THE SIGNATURE ALGORITHM SELECTS THIS COEXPRESSED SUBSET AND ADDS FURTHER GENES THAT WERE NOT IDENTIFIED BASED ON SEQUENCE... 12

FIGURE 2.2 DIFFERENT DISTANCES WILL RENDER DIFFERENT CLASSIFICATIONS BECAUSE WE ARE ASKING FOR GROUPING... 14

FIGURE 2.3 ILLUSTRATION OF GSVD... 16

FIGURE 2.4 YEAST AND HUMAN EXPRESSION RECONSTRUCTED IN THE SIX-DIMENSIONAL CELL-CYCLE SUBSPACES APPROXIMATED BY TWO-DIMENSIONAL SUBSPACES. ... 17

FIGURE 2.5 ALIGNMENT OF GENES FOR THE CDC28DS TO CDC15DS. ... 18

FIGURE 2.6 TIME WARPING RESULT. ... 20

FIGURE 3.1 WEB PAGE OF GEO. [HTTP://WWW.NCBI.NLM.NIH.GOV/GEO/] ... 22

FIGURE 3.2 WEB PAGE OF GO. [HTTP://WWW.GENEONTOLOGY.ORG/] ... 23

FIGURE 3.3 WEB PAGE OF HOMOLOGENE. [HTTP://WWW.NCBI.NLM.NIH.GOV/HOMOLOGENE/]... 24

FIGURE 3.4 SYSTEM FLOW OF THE SCHEME... 26

FIGURE 3.5 AN ILLUSTRATION OF DTW ALGORITHM. ... 29

FIGURE 3.6 TWO TIME SERIES IN A TWO-DIMENSIONAL FEATURE SPACE CONTAINING SAMPLE POINTS FROM A CONTINUOUS PROCESS, WITH SAMPLE POINTS OF EACH SERIES MAPPED TO EACH OTHER BY SIMPLE TIME WARPING... 30

FIGURE 3.7 SVD FOR GENOME-SCALE EXPRESSION DATA ANALYSIS. [HTTP://GENOME-WWW.STANFORD.EDU/SVD/]... 32

FIGURE 3.8 VISUALIZATION OF THE SVD OF CELL CYCLE DATA.(A)PLOTS OF RELATIVE VARIANCE;(B) THE FIRST EIGNEGEN IS SHOWN;(C) THE SECOND EIGNEGENE IS SHOWN.(D)THE THIRD EIGENGEN LACKS THE OBVIOUS CYCLIC STRUCTURE OF THE FIRST AND SECOND.[18]... 34

FIGURE 3.9 THE SYSTEM FLOW OF FNC. ... 35

FIGURE 3.10 ALGORITHM OF MINING CO-EXPRESSED SUBGROUPS WITHIN EACH FUNCTION... 36

FIGURE 3.11 PSEUDO CODE OF MINING CO-EXPRESSED SUBGROUPS... 36

FIGURE 3.12 ALGORITHM OF FUZZY K-NEAREST CLUSTERS FOR FUNCTIONAL PREDICTION. ... 37

FIGURE 3.13 PSEUDO CODE OF FUZZY K-NEAREST CLUSTERS FOR FUNCTIONAL PREDICTION. ... 37

FIGURE 3.14 INPUT FILE FORMAT... 39

FIGURE 3.15 BUILD THE GENE LINKAGE BETWEEN EACH GENE OF GROUPS AND THE GENE OF ANOTHER SPECIES. ... 41

FIGURE 3.16 THE FLOW CHART OF STAGE FINDING ORTHOUGOUS GENE GROUPS. ... 43

FIGURE 3.17 (A)GROUP THE GENES IN SPECIES A TO FIND GENES SHARE SIMILAR EXPRESSION PATTERN AND SELECT THE GENES IN SPECIES B BY THE MAPPINGS ACCORDING TO GENE LINKAGE.(B) FOR EACH GROUP OF GENES FOUND IN (A), PERFORM CLUSTERING AGAIN ON SPECIES B IN ORDER TO DIVIDE THE SELECTED GENES OF SPECIES B INTO SMALLER GROUPS WHICH SHARE RESEMBLING PATTERN.THEN FOLLOW THE GENE LINKAGE AGAIN TO RE-CONSTRUCT THE RELATIONSHIP... 45

FIGURE 3.18 PAIR CONTAINS TWO SET OF GENES ASSOCIATED BY GENE LINKAGE.EACH SET BELONGS TO A SPECIES AND BUILDS THE CORRESPONDING RELATION BY GENE LINKAGE... 46

(10)

FIGURE 3.19 (A)WARPING PATH DISPLAY.(B)ANOTHER WAY TO UNDERSTAND WARPING PATH.(C)AN

OVERVIEW CONTAINING PROFILE PATTERNS AND THE EXPRESSION LEVEL. ... 49

FIGURE 4.1 DATABASE SCHEME OF MIRROR GEO. ... 51

FIGURE 4.2 DEMONSTRATION OF SEMI-GLOBAL DTW... 52

FIGURE 4.3 VISUALIZATION OF (A)GENESIS (B) GRAPHWARP... 58

FIGURE 5.1 RESULTS OF (A) PATTERN ONE (INCREASING)(B) PATTERN TWO (DECREASING)(C) PATTERN THREE (TWO PEAKS)... 60

FIGURE 5.2 (A)RESULTS SHOW A TIME SHIFT IN THE BEGINNING OF EACH PEAK.(B)SVD FILTERS LOTS OF NOISE. ... 61

FIGURE 5.3 (A)RESULTS OF MIXED PATTERNS.(B) SEMI-GLOBAL DTW WORKS WELL IN THIS CASE... 62

FIGURE 5.4 GENE CLUSTERING OF (A) NORMAL YEAST CELL, AND (B) KNOCK-OUT YEAST. ... 63

FIGURE 5.5 CDC20 SHOWS TIME SHIFT IN KNOCKOUT YEAST... 64

FIGURE 5.6 LEFT:GENE CLUSTERING OF MOUSE.RIGHT:GENE CLUSTERING OF HUMAN... 65

FIGURE 5.7 RESULT OF UN-SVD(RIGHT) GENES COMPARING TO GENES WITH SVD(LEFT)... 66

FIGURE 5.8 CASE 1: GENES HIGHLY EXPRESSED IN EARLY STAGES... 67

FIGURE 5.9 CASE 2: GENES HIGHLY EXPRESSED IN LATER STAGES. ... 69

FIGURE 5.10 CASE 3: GENES HIGHLY EXPRESSED LATER THAN GENES IN CASE 2... 70

FIGURE 5.11 VERIFIED THE PREDICTION ACCURACY OF FNC IN HUMAN. ... 72

FIGURE 5.12 TWINS SYSTEM FLOW... 75

FIGURE 5.13 TWO TYPE OF OPERATIONS ON WEB INTERFACE.(1) COMPARE TWO UPLOADED EXPERIMENTS, AND (2) COMPARE AN UPLOADED EXPERIMENT WITH DATABASE... 78

FIGURE 5.14 THE RESULT OF USERS’ REQUEST.THE GRAPH IN BLUE SQUARE IS DRAWN BY GRPHWARP. 79

(11)

Chapter 1 Introduction

1.1 Overview of the scheme

Since the high-throughput microarray assay has been widely used, gene function

prediction is shown to be reliable by classifying their gene expression profile

similarity. In some of frontier research, including experimental drug and gene therapy,

understandings of orthologous genes can be helpful accelerating the progress of the

discovery[1].

Research about orthologous gene functional groups searching is not rare but

limited.

This research considers both sequence homology and time-series expression

profile pattern similarity to search for the orthologous functional gene groups among

multiple species, and is able to deal with different time point number and interval, and

it integrates gene functional predication to annotate and suggest their biological

meanings which make the result more reliable.

In sum, this work presents a novel scheme to discover orthologous time-series

gene expression profiles in multiple species. With this scheme, it is now easier to

observe and disclose the important functional genes conserved in evolution process by

time-series microarray profiles.

1.1.1 Gene expression time series

A gene expression profile is the result of microarray analyses, which give the

breakdown of the switching on and off of certain genes (Figure 1.1 ). Gene expression

profile is an important asset, especially for scientifically understanding biological

(12)

all other high throughput assays for gene activity give biologists the chance to view

the global mRNA profile systematically[1]. There are two types of experiments, static

and time series experiments[2]. In static expression experiments, only a snapshot of

the expression of genes in different samples is measured[3]. On the other hand, when

the profile containing the information of time intervals, it reveals the expression of

genes with cell cycle stage, development stage, or any time related pattern. Gene

expression time series is a list of expression data for a gene along a number of

different experimental time intervals and would correspond to a row in the

representation (Figure 1.2 ). Through the variation of mRNA expression level with

time, which enables further investigations of the gene regulation networks[4, 5],

functional groupings of genes, distinction of cell cycles[6], tissue-specific profiling,

etc, scientists are now unraveling the mechanism of bioprocesses efficiently.

Figure 1.1 Gene expression also varies within a certain type of cell at different points in time. For example, the gene expression profiles of an organ might differ between normal and cancerous states, as shown here.

[http://www.ncbi.nlm.nih.gov/Class/MLACourse/Modules/MolBioReview/gene_express ion_prostate.html]

(13)

Figure 1.2 Different representations of gene expression profiles: (A) pattern, (B) color scale [http://gepas.bioinfo.cipf.es/cgibin/tutoX?c=clustering/clustering.config]

1.1.2 Gene function prediction

Determining the functions of genes is an essential problem in biology, which is

fundamental to realize the molecular and biochemical processes, identify and validate

new drug targets and develop reliable diagnostics. Recent advances in genomic

sequencing have generated an astounding number of new putative genes and

hypothetical proteins whose biological function remains a mystery. On average, there

are 70% of the genes in a genome having poorly known or no known functions. There

are two typical techniques that can be used on gene expression data for gene function

annotation or predication. The first technique is clustering, such as hierarchical

clustering, k-means clustering, SVD, and PCA, while the second is classification,

such as Hidden Markov Model (HMM), Support Vector Machine (SVM), and Neural

Networks (NN).

1.1.3 Data clustering

(14)

The primary aim of clustering is to figure out several clusters and centroids or

prototypes and using these centroids to represent the original enormous data.

In brief, data clustering is more likely to attempt to group data in smaller set.

However, some clustering approaches can be used as classifiers as well, and it is

needless to predefine classes (unsupervised learning). Clustering approaches are

feasible to be utilized in gene expression analysis, since the genes are numerous and

the interactions are complex.

Hierarchical clustering: In hierarchical clustering, a series of partitions takes

place, which may run from a single cluster containing all objects to n clusters each

containing a single object[7]. Hierarchical Clustering is subdivided into

agglomerative methods, which proceed by series of fusions of the n objects into

groups, and divisive methods, which separate n objects successively into finer

groupings. One of the simplest agglomerative hierarchical clustering methods is single

linkage, also known as the nearest neighbor technique. The defining feature of the

method is that distance between groups is defined as the distance between the closest

pair of objects, where only pairs consisting of one object from each group are

considered.

The minimum value of these distances is said to be the distance between two

clusters. At each stage of hierarchical clustering, the clusters whose distance is

(15)

Figure 1.3 Overview of hierarchical clustering of all samples. Genes and blood samples are organized by hierarchical clustering based on overall similarity in expression patterns. Expression levels are represented by a color key in which bright red represents the highest levels and bright green represents the lowest levels, and less saturated shades represent intermediate levels of expression.

[http://www.biomedcentral.com/1471-2164/7/115/figure/F1]

K-means clustering: This nonhierarchical method initially takes the number of

(16)

this step itself the final required number of clusters is chosen such that the points are

mutually farthest apart. Next, it examines each component in the population and

assigns it to one of the clusters depending on the minimum distance. The centroid's

position is recalculated every time a component is added to the cluster and this

continues until all the components are grouped into the final required number of

clusters.

K-Means Training starts with a single cluster with its center as the mean of the

data. This cluster is split into two and the means of the new clusters are iteratively

trained. These two clusters are again split and the process continues until the specified

number of clusters is obtained. If the specified number of clusters is not a power of

two, then the nearest power of two above the number specified is chosen and then the

least important clusters are removed and the remaining clusters are again iteratively

trained to get the final clusters.

1.1.4 Distance functions

There are two main families of distances to measure how closely related are two

groups of genes:

Euclidean: this kind of distance strategy calculates the length of two separate points

in n-directional space by their absolute differences[9]. For example, Euclidean

distance is measure by following definition:

For two points A= (a1, a2… an), and B= (b1, b2… bn), Euclidean distance =

∑

= n 1 i 2 i i-b ) (a

(17)

within both profiles[9]. For example, Pearson correlation measures the similarity in

shape between two profiles by the following formula:

For two points A= (a1, a2… an), and B= (b1, b2… bn), Pearson correlation distance

= a -a b -b n 1 -1 b i n 1 i a i ⎟⎟ ⎠ ⎞ ⎜⎜ ⎝ ⎛ ⎟⎟ ⎠ ⎞ ⎜⎜ ⎝ ⎛

∑

= σ σ i i b a i

i andb ,and , are thestandarddeviation of a andb

a of mean the are b , a where σ σ

These two kinds of distance strategies will lead to different clustering results.

Please see Figure 1.4 for illustration.

Figure 1.4 Different distances will render different classifications because we are asking for grouping based on different features (trends in the case of correlation and absolute differences in the case of Euclidean distances)

[http://gepas.bioinfo.cipf.es/cgibin/tutoX?c=clustering/clustering.config]

1.1.5 Comparative analysis of different species

(18)

the study of biological and evolutionary principles[10]. Although differences among

organisms are often attributed to differential gene expression, genome-wide

comparative analysis thus far has been based primarily on genomic sequence

information.

By miscellaneous gene function predication techniques, biologists are now more

interesting in orthologous gene searching among different species. Since comparative

analysis of the expression data among two or more model organisms promises to

enhance fundamental understanding of the universality as well as the specialization of

molecular biological mechanisms. It also may prove useful in medical diagnosis,

treatment, and drug design. Comparisons of the DNA sequence of entire genomes

already give insights into evolutionary, biochemical, and genetic pathways.

Considering that gene expression profiling gives more information of genes’

biochemistry functional roles, comparative analysis based on microarray data is now a

blossomed area.

1.2 Motivation

Recently, Microarray expression analysis has become an important technique for

evaluating gene expression level in genomic scale. In addition, due to the profound

progress in gene sequencing, considerable number of genes are predicted and found.

However, there are only 30% of genes are explicitly analyzed and understood the

functional roles they playing in biological process[1]. It had been shown that using

microarray gene expression analysis to predict the functions of genes is an important

and efficient means[5]. However, one of the defects of conventional approaches is

that the number of sampling points and growth rate of cells (affected by experiment

(19)

designed to provide comparable samplings, and this is not practical in most of the

cases. Therefore, mostly, searching for functional gene clusters is constrained in

mono-species model and parallel experiment.

Despite so, by compiling the information of gene expression profile from diverse

organisms, tissues, and conditions, scientists are now capable of dissecting more

advanced topics. For instance, Grigoryev D. N., who proposed his renowned research

on Genome Biology, introduced a multi-species model using gene expression profile

to find orthologous gene-expression genes of lung cells suffered from

ventilator-associated lung injury among human, mouse, and dog. Grigoryev also

suggested these genes are potential candidate genes of acute lung injury remedy in the

future, and inferred these genes are conserved among the evolution process because

they play crucial protection functions after lung injury.

Applying the idea of utilizing cross-species or -tissue orthologous gene-expression

profile to search important gene groups and their biological functions to other topics

like cell cycle, is a general concept, which needs further investigation and

development.

Nevertheless, when analyzing microarray time-series gene-expression profile,

scientists have to face the difficulty of coordinating different growth rate of cells and

the number and time intervals of sampling in each independent experiment, especially

of distinct organisms, which is now becoming a pressing issue to let cross-species

gene-expression profiles comparable.

1.3 Goals

This dissertation proposes a novel computational scheme to search and analyze the

(20)

involved in certain bioprocess. We try to contribute to experiment verifying and

treatment development, and answer following questions:

1. How to combine and take advantage of both sequence- and expression-

level orthologous gene predication?

2. How to build up the mapping relationships between genes of multi-species?

3. How to deal with noise or experimental artifacts?

4. How to let different time-series profile be comparable when the experiment

conditions, growth rate, sampling time point number are different?

5. How to make our predication convincing enough?

This research hopes to propose a novel scheme to solve these related tasks on the

basis of other research with integration and improvement, and contributes remarkable

(21)

Chapter 2 Related Works

Some of the existing research had given answers to parts of the questions we devote to

solve; however, these solutions are still not sufficient to resolve our problems

completely. Furthermore, most of them avoid the questions that how to let different

time-series profile be comparable when the experiment conditions, growth rate, and

especially sampling time point number are different

2.1 Cross species analysis with static profiling

2.1.1 Genome-wide expression data of six organisms [10]

S. Bergmann et al. present a comparative study of large datasets of expression profiles

from sic evolutionarily distant organisms: S. cerevisiae, C. elegans, E. coli, A.

thaliana, D. melanogaster, and H. sapiens. They use genomic sequence information

to connect these data and compare global and modular properties of the transcription

programs. Linking genes whose expression profiles are similar, functionally related

sets of genes are frequently coexpressed in multiple organisms. Bergmann integrates

the expression data with genomic sequence information to address three biological

issues. First, we verify that coexpression is often conserved among organisms and

propose a method for improving functional gene annotations using this conservation.

Second, we compare the regulatory relationships between particular functional groups

in the different organisms using the iterative signature algorithm (ISA), giving initial

insights into the extent of conservation of the gene regulatory architecture. See Figure

(22)

Figure 2.1 Starting from a set of coexpressed genes associated with a particular function in organism A, they first identify the homologues in organism B using BLAST. Only some of these homologues are coexpressed while others are not. The signature algorithm selects this coexpressed subset and adds further genes that were not identified based on sequence.

This approach didn’t consider data with time series. Also, they linked data of different

species by BLAST, which can provide sequence level homology, but they use ISA to

extend genes they linked to more genes co-expressed in the same species. In order

words, genes they found in the end only have ortholog in expression-level. Moreover,

although they try to tell the functions of genes they found, but not with clear evidence

and inference.

2.1.2 Orthologous expression profiling in multi-species

models[11]

(23)

species-specific Affymetrix GeneChips to search for candidate genes related VALI

(ventilator-associated lung injury). The individual analysis of species-specific arrays

produced large lists of candidate genes and several challenges, with the most notable

being an excessive number of genes for candidate gene selection. While meta-analysis

strategies exist for narrowing candidate gene selection from multiple experimental

systems, this analysis can only be applied to the same species cross-platform array

comparison, to use this approach for analysis of experiments involving diverse species

we speculated that multispecies gene0expression profiles could be linked using

RESOURCERER[12], which is based on EGO database and contains information for

all commercially available Affymetrix Genechips.

D. N. Grigoryev speculated that overlapping responses to mechanical stretch in

orthologous genes across species might reveal candidate genes involved in an

evolutionarily conserved defense mechanism to lung injury that might be triggered by

ventilator-induced lung injury.

This research first calculated gene-expression changes for each tested species and

linked expression values obtained for orthologous genes. Orthologous genes

exhibiting similar patterns of expression across all species were selected as

VALI-related candidates under the assumption that gene-expression responses

conserved across evolutionary history would be most likely to reveal fundamental

(24)

Figure 2.2 Different distances will render different classifications because we are asking for grouping

The basic concept of linking differentially-expressed gene with other species is

similar to linking co-expressed gene group in our approaches. Even regardless of the

inability of dealing time-series profiles, their scheme need a common experiment

condition (in their case VALI) among samples of all species, which means that the

experiment should be carefully designed and executed. This constraint makes this

approach unpractical in many situations. Although they did do some experiment to

prove genes they found is associate with VALI, however, their scheme unable to give

a global view of gene function in genomic scale.

2.2 Cross species analysis with time series

2.2.1 GSVD for comparative analysis of expression data

sets of two different organisms[13]

GSVD (generalized singular value decomposition) provides a comparative

mathematical framework for two genome-scale data sets from the two-genes X arrays

(25)

are shared by both data sets. Each genelet is expressed only in the two corresponding

arraylets, with a corresponding “angular distance” indicating the relative significance

of this genelet, i.e. its significance, in one data set relative to that in the other (see

Figure 2.3).

O. Alter shows that mathematical reconstruction of gene expression in a subset of

genelets may simulate experimental expression in subset of genelets may simulate

experimental observation of only the process that these genelets are inferred to

represent. By using GSVD, the framework enables comparative reconstruction and

classification of the genes and arrays of both data sets and the comparison of yeast

(26)

Figure 2.3 Illustration of GSVD

GSVD relies on the strong basis of mathematical theory and suggest a general

approach analyzing two data sets. However, the major improvement can be

categorized in three points: flexibility of multi-species model, limitation of data

reduction, and incapableness of heterogeneous data sets.

First, GSVD provides useful framework to analyzing data set from two species.

However, it is not suitable for models consisted of more than two species. Also, data

sets from two species should be in same vector space, i.e. their dimension—the

number of time points—should be the same, which is unpractical in most of the case.

GSVD restrict the data to a small subset of similar conditions, such as time points

along the cell cycle, which drastically reduces the size of the dataset and limits the

scope of comparison[10].

Third, data sets from two species should be in same vector space, i.e. their

dimensions—the number of time points—should be the same, which is unpractical in

(27)

Figure 2.4 Yeast and human expression reconstructed in the six-dimensional cell-cycle subspaces approximated by two-dimensional subspaces.

2.2.2 Continuous representation of time-series expression

profiles[2]

Z. Bar-Joseph et al. present a general algorithm to detect genes differentially

expressed between two nonhomogeneous time-series data sets. Their algorithm

overcomes these difficulties by using a continuous representation for time-series data

and combining a noise model for individual samples with a global difference measure.

They introduce a corresponding statistical method for computing the significance of

this differential expression measure. They used their algorithm to compare cell-cycle

dependent gene expression in wild type and knockout yeast strains. Their experiments

suggest additional roles for the transcription factors Fkh1 and Fkh2 in controlling

cellular activity in yeast.

(28)

of piecewise cubic polynomials and are frequently used for fitting time series and

other noisy data.

Using splines, we can use a linear warping function to obtain an optimal alignment

by adjusting shift and stretch parameters to minimize a global error function. See

Figure 2.5.

Figure 2.5 Alignment of genes for the cdc28DS to cdc15DS.

In this work, they used B-splines, a type of spline that is mathematically

convenient for data approximation. B-splines are described as a linear combination of

a set of basis polynomials. This approach has shown to be useful in many cases. It

considers heterogeneous data sets and gives the solution by Cubic spline algorithm.

However, their design is not suitable dealing with data sets need to be mapped by

semi-global alignment—one of the data sets is in fact only former or later part of

(29)

fitting process slower. And it is a shame that their statistically analysis did not cover

function annotation.

2.2.3 Aligning gene expression time series with time

warping algorithms [1]

Biological processes have the property that multiple instances of a single process may

unfold at different and possibly non-uniform rates in different organisms, strains,

individuals, or conditions. For instance, different individuals affected by a common

disease may progress at different and varying rates. Increasingly, biological processes

are being studied through time series of RNA expression data collected for large

numbers of genes. Because common processes may unfold at varying rates in

different experiments or individuals, methods are needed that will allow

corresponding expression states in different time series to bemapped to one another.

John Aach and George M. Church present implementations of time warping

algorithms applicable to RNA and protein expression data and demonstrate their

(30)

Figure 2.6 Time warping result.

They show time warping to be superior to simple clustering at mapping

corresponding time states. Depending on the domain of application, these might

include cell-specific parameters such as average cell size or physiological parameters

such as blood pressure or temperature. The relative contributions of such parameters

to alignment score calculations can be adjusted using feature weight parameters

already supported by the programs. The alignment programs can also be used not only

to align RNA and protein expression series individually, but series that combine both

RNA and protein data. Finally, the programs can also be applied to aligning

non-temporal series such as expression profiles for cells over a range of

(31)

Chapter 3 Materials and Methods

3.1 Materials

3.1.1 Datebases

GEO: a curated, online resource for gene expression data browsing, query and

retrieval[14]. GEO contains 141678 sampling data, which provide us enormous

experimental gene expression profiles. See Figure 3.1. Each dataset is fully annotated

and completely normalized. But the comprehensive data collection, our gene function

predication can be supported by adequate experiments under all kinds of conditions

(32)

Figure 3.1 Web page of GEO. [http://www.ncbi.nlm.nih.gov/geo/]

GO: provides a controlled vocabulary to describe gene and gene product attributes in

(33)

Figure 3.2 Web page of GO. [http://www.geneontology.org/]

HomoloGene: a system for automated detection of homologs among the annotated

(34)

Figure 3.3 Web page of homologene. [http://www.ncbi.nlm.nih.gov/HomoloGene/]

3.1.2 Data set of Experiments and results

We take three types of data to test the proposed scheme: artificial, homogeneous, and

heterogeneous data sets. We will discuss this in chapter result.

3.1.3 Date set of gene funcation predication

The microarray sample was fetched from NCBI GEO in order to be the materials of

(35)

prediction. GEO stores many precious gene expression profiles. We took advantage of

GEO’s comprehensive collection of published dataset, and extracted experiments

which consisted of several time points and are suitable to be compared with the time

series. Time series samples generated on Affymetrix GeneChip platform was

considered firstly. The samples of each time point are combined by averaging and

converting to log2 ratio by the mean of the expression level in all time point. That is,

for the gene G1 in a profile with 7 time points T1, T2, …, T7 and each time point have

two samples Et1a, Et1b, Et2a, Et2b, …, Et7a, Et7b. Next, we do the process of averaging,

converting to log2 ratio, and standardization as shown below. We arranged the

dataset as a matrix which represented as following Matrix.

Averaging:

7 ; 2 ) ( ,..., 2 ) ( , 2 ) ( 7 1 1 1 7 7 17 2 2 12 1 1 11

∑

= = + = + = + = k k mean b t a t b t a t b t a t A A E E A E E A E E A

Log2 ratio:

_⎟⎟ ⎠ ⎞ ⎜⎜ ⎝ ⎛ = ⎟⎟ ⎠ ⎞ ⎜⎜ ⎝ ⎛ = ⎟⎟ ⎠ ⎞ ⎜⎜ ⎝ ⎛ = mean mean mean A A N A A N A A N 1 17 7 1 12 2 1 11 11 lg , lg ,..., lg

Standardization:

1 1mean 11 11 N -N S σ =

Matix:

47 42 41 4 37 32 31 3 27 22 21 2 17 12 11 1 7 2 1 S ... S S S ... S S S ... S S S ... S S ... G G G G T T T

Experiment performed on human and mouse using Affymetrix GeneChips were

extracted from GEO and annotated their functional category by GO. We compiled the

(36)

mapping relation of Gene symbol name between SwissProt ID (because part of the

data store in GO is specified by SwissProt) from SwissProt. By this, we can then

annotate the GO term to each gene, and calculate the p-value to suggest biological

significance of the occurrence of this GO term.

3.2 Methods overview

The flowchart of the scheme is presented below (Figure 3.4 ):

Figure 3.4 system flow of the scheme

The whole scheme consists of five components: preprocessing, ranking by

single-gene distance, finding orthologous gene groups, annotation and statistics

analysis and visualization.

(37)

preprocessing stage, the datasets are normalized, standardized, and filtered out genes

with flat expressions, which implies these genes are not response dynamically toward

the perturbation in the experiment along with time and therefore lack of referable

meanings in our time-series profile analysis.

After the preprocessing, the remaining genes in human and mouse datasets are

cross-species linked by ortholog linkage provide by Homologene database, which

builds the linkages based on the sequence similarity of genes among species. Each

linked gene pair is calculated its single-gene distance by dynamic time warping

algorithm. All gene pairs are ranked according to their distance. The scheme selects

top T gene pairs for further analysis, since these genes share both sequence and

expression similarity as responding to experimental perturbation.

Next, in order to further infer the functional roles among these genes, we group

these T genes into smaller clusters. Before doing so, we act singular value

decomposition to filter out the noise in expression. We don’t do SVD in preprocessing

stage for several reasons, which will be discussed in following chapters. Then,

K1-means clustering was performed on one of the dataset (in this research, human),

and after that, the scheme acts second time K2-means clustering on another dataset

within each firstly-clustered group into even smaller clusters, which generates K1*K2

clusters. The scheme acts group-dynamic time warping to suggest unified warping

path of this group of genes pairs. Now, by above processes, the parallel essence in

sequence and expression among each group is sufficient to suggest that these groups

of genes play important roles in certain bioprocess, and conserve in evolution process.

In the fourth stage of the scheme, we try to suggest the functional roles of these gene

groups. By the help of GO, GEO and fuzzy nearest clustering algorithm, we can

(38)

judge and recognize the candidate genes with biological significance and proposed the

inference that these genes play essential roles in certain bioprocess during the

evolution process.

In the end, to further visualize of result, the scheme which had been carefully

programmed will generate abundant and useful information of the results. The whole

scheme is embedded in our web sever TWins, and the output files can be fed in to

Genesis and grphwarp program for further visualization and analysis.

3.3 Algorithm

3.3.1 Dynamic Time Warping (DTW) algorithm

DTW (see figure 3.12), which is similar to the sequence alignment used in

computational biology, are firstly introduced in speech recognition. By compression

and expansion operations, multiple time points with calculated weight coefficient can

be aligned to a single time point. DTW considers the warping distance according to

the vectors in feature space, and the distance can be evaluated by simple Euclidean

distance, Pearson correlation coefficient, or more complicated functions in which the

(39)

Figure 3.5 An illustration of DTW algorithm.

The basic idea of time warping is that replications of nominally the same trajectory

will trace out approximately the same curve (expression profiles pattern), but with

varying time patterns. To minimize the warping distance between two observed

profiles, a recursion to find the minimal distance is the main part of the calculation. This program adopts conventional DTW, see the equation below, whereτ , μare time points, and a, b are the expression values of two time series:

⎪

⎩

⎪

⎨

⎧

−

⋅

+

−

⋅

+

−

⋅

+

=

− − − − j i j i j i j i j i j i j i

b

a

D

b

a

D

b

a

D

2

1 , , 1 1 , 1 ,

μ

τ

μ

τ

(40)

Relative to the a series, a second time series b for a different instance of the process may contain a set of time pointsτ andμ. The sample points may come from a trajectory that traces through different regions of k-space or traces through the same

regions at different rates[1] (Figure 3.6 ). Simple time warping uses dynamic

programming to find the mapping between two series that minimizes a weighted sum

of the k-space distances between the corresponding sample points, subject to

constraints of order preservation and globality. The mapping identifies an optimal

time alignment of the two series. The task of finding it is set up as a dynamic

programming problem by placing the time points of each series along the axes of a

grid, representing alignments as paths through the grid cells, and finding the path with

minimum accumulated weighted distance score.

Figure 3.6 Two time series in a two-dimensional feature space containing sample points from a continuous process, with sample points of each series mapped to each other by simple time warping.

The mappings of the optimal path identify places where multiple time points of

one series correspond to a single time point of the other. Where measurement time

intervals are comparable between the series, these may represent situations in which

(41)

phase of the process relative to the instance measured by the other series. We call

such situations compression/expressions and they are analogous to the insertion /

deletions considered in sequence alignment algorithms. Time warping algorithm maps

two time series in a way that compensates for varying relative rate differences in gene

expression levels moving along similar expression trajectories[17].

3.3.2 Singular value decomposition (SVD) and Clustering

approaches

SVD is a common technique for analysis of multivariate data, and gene expression

data are well suited to analysis using SVD. In the literature the number of components

that results from SVD is sometimes associated with the number of underlying

biological processes that give rise to the patterns in the data[18].

Let X denotes an m x n matrix of real-valued data and rank r. In the case of

microarray data, xij is the expression level of the ith gene in the jth assay. The

elements of the ith row of X form the n-dimensional vector gi, which we refer to as

the transcriptional response of the ith gene. Alternatively, the elements of the jth

column of X form the m-dimensional vector aj, which we refer to as the expression

profile of the jth assay.

The equation for singular value decomposition of X is the following:

T

USV

X

=

where U is an m x n matrix, S is an n x n diagonal matrix, and VT is also an n x n matrix. The columns of U are called the left singular vectors (eigengenes), {uk}, and

form an orthonormal basis for the assay expression profiles, so that ui·uj = 1 for i = j,

(42)

vectors (eigenarrays), {vk}, and form an orthonormal basis for the gene transcriptional

responses. The elements of S are only nonzero on the diagonal, and are called the

singular values.

In systems biology applications, we generally wish to understand relations

among genes. The signal of interest in this case is the gene transcriptional response gi.

The SVD equation for gi is

m

i

v

s

u

g

r k k k ik i

,

where

:

1 ,...,

1

∑

=

which is a linear combination of the eigengenes {vk}.

SVD is a linear transformation of the expression data from the n-genes x

m-arrays space to the reduced r-eigenarrays x r-eigengenes space[19]. See Figure 3.7

for illustration.

Figure 3.7 SVD for genome-scale expression data analysis. [http://genome-www.stanford.edu/SVD/]

Relation to principal component analysis: There is a direct relation between PCA

and SVD in the case where principal components are calculated from the covariance

matrix. The matrix US then contains the principal component scores, which are the

(43)

Even though each component on its own may not necessarily be biologically

meaningful, SVD can aid in the search for biologically meaningful signals[18]. The

height of each singular value indicates its importance in explaining the data. More

specifically, the square of each singular value is proportional to the variance explained

by each singular vector. The relative variances are often plotted (See Figure 3.8). If

the original variables are linear combinations of a smaller number of underlying

variables, combined with some low-level noise, the plot will tend to drop sharply for

the singular values associated with the underlying variables and then much more

slowly for the remaining singular values. One approach is to ignore components

beyond where the cumulative relative variance or singular value becomes larger than a

certain threshold, usually defined upon the dimensionality of the data. Everitt and

Dunn[20] propose an alternate approach based on comparing the relative variance of

each component to 0.7/n[18]. By normalizing the data and filtering out those

eigengens and eigenarrays (i.e. substituting zero for the singular value lower than

0.7/n) that are inferred to represent noise or experimental artifacts, SVD can

(44)

Figure 3.8 Visualization of the SVD of cell cycle data. (a) Plots of relative variance; (b) the first eignegen is shown; (c) the second eignegene is shown. (d) The third eigengen lacks the obvious cyclic structure of the first and second.[18]

K-means clustering takes the matrix as input and genes are grouped according to

the value in the row they represented. In our experiment we took Pearson correlation

distance to evaluate how close two genes are.

3.3.3 Guilt-by-association (GBA) principle

GBA infers uncategorized items by the close similarity to known items which can be

judged by evaluating the distance[21]. GBA principle is widely applied in biological

function prediction and candidate gene discovery. In gene expression profile analysis,

uncategorized genes can be grouped together with known genes by the distance or the

(45)

3.3.4 Fuzzy Nearest-Cluster (FNC)[21]

FNC utilizes the advantages of both clustering and classification. It contains two parts:

(1) mining by unsupervised approach, hierarchical clustering algorithm; (2) prediction

category of unclassified items by classification methods using GBA principle. See Fig.

3.9.

Figure 3.9 The system flow of FNC.

Figure 3.10 and 3.11 details the clustering algorithm for mining step. In the

algorithm, the FNC focuses on grouping gene expression profiles, in order to mining

(46)

Figure 3.10 Algorithm of mining co-expressed subgroups within each function.

Figure 3.11 Pseudo code of mining co-expressed subgroups.

(47)

would be predicted its function according to fuzzy k-nearest clusters algorithm

described in figure 3.12 and 3.13.

Figure 3.12 Algorithm of fuzzy k-nearest clusters for functional prediction.

Figure 3.13 Pseudo code of fuzzy k-nearest clusters for functional prediction.

(48)

true functional classes in its top-ranked perditions. The classification results in listed

in Table 3.1.

Table 3.1 Classification result (%) for largest 20 functional classes. Values in bold

indicate the top performance in each row.

3.4 Methods

In this section we will disclose the proposed scheme in five steps.

3.4.1 Preprocessing

File format description: The applicable file format consists of three parts:

experiment name, time periods, and expression data with gene symbol. Experment

name needs to be notified by a '>' at the beginning of the first line. Time periods

should start with "Gene" and follow with time points, which are separated by tab.

Next, each line of the expression data are named by its gene symbol as the beginning

of the line, and followed by corresponding time-series expression data separated by

(49)

gene name will not be able to map to gene ontology category defined by GO.

Figure 3.14 Input file format.

Convert data to log2 ratio: When the formatted data set is inputted to the system, it

will be converted to log2 ratio according to the mean expression of each row. Please

see 3.1.3 for details.

Data standardization: Since our algorithm considers Euclidean distance as our

distance function, each profile should therefore be standardized, or the distance will

be affected by the expression level of profile, and the trend of pattern will be missing

or ignore by the algorithm. In statistics, a standard score (also called z-score or normal

score) is a dimensionless quantity derived by subtracting the population mean from an

individual (raw) score and then dividing the difference by the population standard

deviation. The z-score reveals how many units of the standard deviation a case is

above or below the mean. So, by standardization, the pattern will be substitute to a

trend according to original mean, which is suitable for our Euclidean distance

(50)

Flat expression filtering: A level expression pattern is not helpful throughout our

analysis, since this kind of pattern reveals no information of differential expression,

which hinders our time warping algorithm to distinguish the warping path and

following warping distance. So before entering next stage that performing time

warping algorithm, flat expression should be filtered.

We denote a flat expressed gene as the gene of which expression in each time points

is oscillating in the range of 1.3*(mean of the expression in all samples) and

0.7*(mean of the expression in all samples). In other word, if there are more than one

expression of all time points higher than the upper bound or lower than the lower

bound, this gene will be retained and submitted to next stage. (We do not perform

noise filtering in this stage for several reasons. We will disclose them in discussion

chapter.)

3.4.2 Ranking by single-gene distance

For the reason that we are finding genes that share both sequence- and

expression-level similarity, we have to select genes linked by sequence homology and

with relatively low distance which implies their expression pattern is similar to

achieve our requirement.

Homologous gene linking: One means of searching orthologous gene-expression

profile in multi-species models we refer to is proposed by Grigoryev et al. in 2004[11].

Although their research is not suitable for time series expression profiling which is

strongly influenced by cell growth rate, their idea to associate genes of multi-species

is examined to be very useful. Grigoryev used ortholog links which are identified by

RESOURCERER[12, 22] between the most commonly used Affymetrix rat, mouse,

(51)

build the linkage of different species. NCBI Homologene is also competent providing

gene linkage between species.

Figure 3.15 Build the gene linkage between each gene of groups and the gene of another species.

Since each Affymetrix GeneChip has full annotation, the gene symbol of the probe

set is indicated. NCBI Homologene provides great mapping information of gene

symbols of different organisms. Homologene built the mapping relationship according

to their homology on sequence level. Since genes with similar sequence probably

share similar role of biochemical functions, combining sequence homology and

expression profiles to suggest their common functions is intuitionally more reliable.

(52)

of another species (see Figure 3.15 ). Next, we perform single gene time warping to

calculate and rank the warping distance to tell the genes with similar expression

pattern.

One-on-one dynamic time warping: When the expression profile are sampled every

single hour from normal cells of human and mouse, the different growth rate affected

the profiling which make direct one-on-one mapping unreliable. Aach’s research[1]

indicated even in the same processes the unfold rates of different experiments or

individuals are different. Dynamic time warping (DTW) algorithms are proposed to

make different time series to be comparable by finding their corresponding expression

states. When two time series expressed in the same pattern however in different rate,

DTW is able to find the optimal warping path which aligns two time series yielding

shortest distance.

Therefore, we leverage the advantage of the DTW to estimate how close two genes

are considering their time difference. By doing so, we can rank all genes by their

warping distance. The top ranking genes show relatively resembling expression,

which means these genes are both sequence and expression analogical.

After collecting these genes, we are going further to specify their function. This

entails the helps of following two stages.

3.4.3 Finding orthologous gene groups

Orthologous genes are likely to share similar pattern of expression. The co-expressed

genes can be inferred to be coding for proteins that partake in common biological

function[21]. We therefore find the gene cluster that share parallel expression pattern

by unsupervised learning approach, which assumes no prior knowledge about the

(53)

Singular value decomposition (noise filtering): Before been grouped, SVD was

performed to normalize the data by filtering out the eigengenes and eigenarrays that

are inferred to represent noise or experimental artifacts enables meaningful

comparison of the expression of different genes across different arrays in different

experiments. However, SVD has its mathematical limitation; we will discuss this in

chapter discussion. SVD filters eigengenes of which relative variance lower than 0.7/n

(n is the number of time points) and recombines the matrix again. After SVD, we will

retain only significant information which contains less noise. Expression profile

without observable noise can lead to better clustering result. See Figure 3.16 .

Figure 3.16 The flow chart of stage Finding Orthougous Gene Groups.

The challenge we faced next is how to determine which group of genes has the

common expression among both species. Although each of them has already shown

good expression resemblance, it is not guarantee that all of the genes share parallel

(54)

which can assure the groups we found are conserved in both species.

1st K-means clustering: There are numerous techniques can be used for searching

co-expression gene cluster, for example, hierarchical clustering, k-means clustering,

diametrical clustering, etc. These unsupervised learning methods are especially useful

when we try to search candidate genes from huge amounts of genes expression, such

as genome-wide microarray, since the entire gene reaction toward the conditions,

treatments, development stages are not yet completely understood. We decide to use

K-means clustering in our scheme. See Figure 3.17 (a), we groups gene in species A

(in our case, human). By this, human genes in the same group show great pattern

homology, however, not in mouse, which can be solved by doing 2nd K-means clustering.

2nd K-means clustering: We perform K-means clustering algorithm again on linked

genes belongs to another species (in this case, mouse) and therefore divided the

groups, which were firstly grouped and linked, into more specific and conserved parts

(55)

Figure 3.17 (a) Group the genes in species A to find genes share similar expression pattern and select the genes in species B by the mappings according to gene linkage. (b) For each group of genes found in (a), perform clustering again on species B in order to divide the selected genes of species B into smaller groups which share resembling pattern. Then follow the gene linkage again to re-construct the relationship.

Group-dynamic time warping: When genes are grouped into k1*k2 clusters by

clustering technique, the next step is to discover the same expression pattern among

multi species. After grouping genes, each group of gene is regarded as a pair (see

Figure 3.18 ). Take all of the pairs as the input of group-DTW algorithm and rank

each pair with their warping distance. The concept is similar to one-on-one DTW,

however this time we are doing group-DTW, which align two groups of genes. Two

groups of genes, each group belong to a species, if there expression profiles are

followed certain rules or patterns, even though their sampling time points and growth

rate are different, DTW would consider all the time points alignment possibility and

(56)

provides by DTW, we can figure out which group of genes shows better orthologous

expression.

Figure 3.18 Pair contains two set of genes associated by gene linkage. Each set belongs to a species and builds the corresponding relation by gene linkage.

Next, we are going to assign functions to these groups of genes.

3.4.4 Annotations and statistics analysis

Although in some situations, biologists already know a subset of genes involved in

certain biological pathway of interest; however, as shown in Yi’s research[23], the

gene annotated with GO terms[15] ranging from 25.1% in Danio rerio to 96.2% in

Saccharomyces cerevisiae, also, the gene annotations by MIPS[24] or Biomax is

ranging from 44.3% in Thermoplasma acidophilum to 73.1% in Bacillus subtilis

168[25], the functional roles of each gene are not thoroughly studied. So, we are not

supposed to only focus on these annotated genes. Many useful approaches show great

accuracy when predicting gene functions. Scientists use hierarchical clustering,

Hidden Markov models, SVM, FNC, and so on, to predict the functions of genes.

Fuzzy nearest cluster: We chose FNC to be the prediction approach of our scheme,

since it combines both clustering and classification and display great prediction

(57)

Affymatrix gene chip data set of human and mouse is fetched from GEO and

processed (please see 3.1.3). We search each genes in GO to know their functional

term. Any gene without GO term will be predicted by FNC throughout thousands of

experimental data set. The final predication will count on the voting of these

thousands data set. Top three GO terms will be annotated to this gene.

By doing functional predication in human and mouse, we are now having the ability

to annotate and analyze the orthologous gene group found by previous steps.

Known/predicted GO term annotation: Now, genes in each cluster, which share

both sequence and expression similarity can be annotated by known GO term and

predicted GO term by FNC. We have know that the co-expressed genes can be

inferred to be coding for proteins that partake in common biological function. It is

important to understand that these genes are not just co-expressed, they also

conserved in sequence and expression in evolution path. By the annotation, we can

infer the functions of orthologous gene groups according to the statistical test, which

makes the inference more reliable and promising.

Hypergeometric testing: To examine the biological significance of the pairs

(ortholgous gene groups); the known and predicted GO term annotation is take into

account. Genes will be calculated the p-value of GO terms by hypergeometric

distribution:

∑

− = ⎟⎟ ⎠ ⎞ ⎜⎜ ⎝ ⎛ ⎟⎟ ⎠ ⎞ ⎜⎜ ⎝ ⎛ − − ⎟⎟ ⎠ ⎞ ⎜⎜ ⎝ ⎛ − = 1 0 1 k i n N i n M N i M p

In this equation, N is the total number of gene product [15] of platform species,

can be regarded as background. M is the number of genes within the background

(58)

specified group.

A pair with relatively low p-value implies that these genes gathering together

shows significant biological meanings, which suggest that the function of this group is

annotated by this specific GO term.

3.4.5 Visualization

In order to provide better understanding and analysis of the results, we design a web

sever and implement several useful functions in our program package. We generate

many kinds of files containing useful information, including warping path, expression

level plot, p-value, and so on. The package gives many delicate illustrations (Figure

3.19 ). We also demonstrate great ability leveraging other tools, like Genesis and

(59)

Figure 3.19 (a) Warping path display. (b) Another way to understand warping path. (c) An overview containing profile patterns and the expression level.

基於人類與小鼠微陣列基因表現圖譜識別功能性基因群

國立交通大學

生物資訊所

碩

士

論

文

基於人類與小鼠微陣列基因表現圖譜識別功能性基因

群

Identifying functional gene clusters based on microarray

gene expression profiles in human and mouse genomes

研 究 生：洪瑞鴻

指導教授：黃憲達 教授

基於人類與小鼠微陣列基因表現圖譜識別功能性基因

群

Identifying functional gene clusters based on microarray

gene expression profiles in human and mouse genomes

研 究 生：洪瑞鴻 Student：Jui-Hung Hung

指導教授：黃憲達 Advisor：Hsien-Da Huang

國 立 交 通 大 學

生 物 資 訊 所

碩 士 論 文

基於人類與小鼠微陣列基因表現圖譜識別功能性基因群

學生：

指導教授

國立交通大學生物資訊所碩士班

摘 要

Identifying functional gene clusters based on microarray gene

expression profiles in human and mouse genome

student：

Advisors：Dr.

Institute of Bioinformatics

National Chiao Tung University

ABSTRACT

目

錄

表 目 錄

圖 目 錄

Chapter 1

Introduction

1.1 Overview of the scheme

1.1.1 Gene expression time series

1.1.2 Gene function prediction

1.1.3 Data clustering

1.1.4 Distance functions

∑

∑

1.1.5 Comparative analysis of different species

1.2 Motivation

1.3 Goals

Chapter 2

Related Works

2.1 Cross species analysis with static profiling

2.1.1 Genome-wide expression data of six organisms [10]

2.1.2 Orthologous expression profiling in multi-species

models[11]

2.2 Cross species analysis with time series

2.2.1 GSVD for comparative analysis of expression data

sets of two different organisms[13]

2.2.2 Continuous representation of time-series expression

profiles[2]

2.2.3 Aligning gene expression time series with time

warping algorithms [1]

Chapter 3

Materials and Methods

3.1 Materials

3.1.1 Datebases

3.1.2 Data set of Experiments and results

3.1.3 Date set of gene funcation predication

Averaging:

∑

Log2 ratio:

Standardization:

Matix:

3.2 Methods overview

3.3 Algorithm

3.3.1 Dynamic Time Warping (DTW) algorithm

⎪

⎪

⎪

研究生：洪瑞鴻

指導教授：黃憲達教授

研究生：洪瑞鴻 Student：Jui-Hung Hung

國立交通大學

生物資訊所

碩士論文

摘要

表目錄

圖目錄