• 沒有找到結果。

中 華 大 學

N/A
N/A
Protected

Academic year: 2022

Share "中 華 大 學"

Copied!
96
0
0

加載中.... (立即查看全文)

全文

(1)

中 華 大 學 碩 士 論 文

靈長類物種間啟動子推測性調控因子之比較

Cross-species Comparison of Putative Promoter Regulatory Elements in Non-human Primates

系 所 別 :資訊工程學系 碩士班 學 號 姓 名:M09102047 蘇 美 蘭 指 導 教 授:張 慧 玫 博 士

中華民國九十四年一月

(2)
(3)
(4)
(5)

摘 要

在這篇論文中,我們以比較基因體的方法研究靈長動物與人類的轉錄調控。我們從 NCBI的GeneBank裡取得 222 條非人類靈長物種啟動子序列,為 59 個基因共有 33 個物 種當做我們研究的資料。接著對於這些序列執行基本的序列分析,作多重序列比對,並 使用BLAST作homology-hit分析。為探究啟動子上的推測性調控因子,我們使用兩種方 法:以PCMC方法找出靈長類等區位高出現率短序列,和以多重序列比對次相似區結合 TRANSFAC資料庫,來分析人類和靈長物種啟動子推測性調控序列,得到的結果運用我 們發展出的圖形化介面和sequence logos呈現出來。

根據序列分岐的研究,單一非人類靈長物種內的基因啟動子序列,有較大的序列變 異性。透過homology-hits方法,大多靈長基因啟動子,在對所有基因庫比對時,都有低 隨機性高序列保留性的homologs,但MHC class I基因群的短片段,相似則落在中隨機性 區,表示有大的序列變異性。以PCMC方法找出靈長類等區位高出現率短序列,結果對 高相似性啟動子,可找出與人類相似推測性調控短序列。PCMC的統計高出現率區的結 果,與以多重序列比對經我們發展出的圖形化介面顯現相似區的結果,大致相合。以多 重序列比對次相似區結合TRANSFAC資料庫方法,找出推測性調控序列中,顯示CREB, SRF, IRF-2, c-Myb靈長序列偏好,可得到從JASPAR脊椎動物轉錄序列資料庫類似序列偏 好之印證。而本論文也找出其他資料庫未有的新的推測性靈長調控序列,其序列偏好以 統計式的sequence logos,供未來分子生物實驗作功能測試。

(6)

ABSTRACT

In this study, we use comparative genomics to study the transcriptional regulation between primates and human. We retrieved the 222 nonhuman primate promoter sequences of 59 genes from the GeneBank of NCBI with 33 species as the starting dataset for our study. We performed general analysis of these promoter sequences: multiple sequence alignment (MSA) and homology-hit method using BLAST (basic local alignment search tool). Then, to determine the putative regulatory elements, we analyze primate promoter sequences using two different methods by PCMC program for primate-specific, position-corresponding, highly-representative short sequences and by the less-conserved regions first obtained through MSA, then extracting putative elements from TRANSFAC database. These results were further presented a visualization tool developed by us and by statistical sequence logos.

Based on sequence divergence analysis, non-human primate promoters among genes of single species show greatest sequence variability. Through homology-hit analysis, most gene promoters contain less-random and higher-conservative ortholgos from published genomes in NCBI. But MHC class I genes allocate their short fragment matches in the medium-relaxed, random similarity areas, indicating high variability. PCMC program for primate-specific, position-corresponding, high-representative short sequences is helpful for analyzing putative primate regulatory elements from highly similar promoters. The results of PCMC program are consistent with those of our visualizing tool showing MSA trends. The results of the less-conserved regions through MSA plus TRANSFAC screening show that the primate consensus sequence preference by us for CREB, SRF, IRF-2, and c-Myb elements is consistent with those found in vertebrate transcriptional JASPAR database. De novel primate consensus sequence preference uniquely found by the present study are also obtained and listed in sequence logos for future molecular functional analysis.

(7)

ACKNOWLEDGEMENT

本篇碩士論文的完成,首先要衷心感謝指導老師 張慧玫博士的諄諄教誨及辛勤的 指導。研究所期間,在專業領域及待人處世上都給予悉心教導與鼓勵,而自由的研究風 氣讓我增進多元知識,並學習新興的生物資訊領域,以啟發、指引我在研究過程上的靈 感。再者感謝口試委員本校劉世華老師、清華大學張大慈老師與黎耀基老師,在百忙之 中抽空來審閱我的論文,並提供相當寶貴的意見,使我穫益良多。

感謝二位學弟─為宏及裕翔,在資料收集和程式協助下使我得以順利完成論文研 究。再來感謝實驗室同學群,有你們的扶持和勉勵使我有個愉快的碩士生活。其次謝謝 所上多位老師讓我充實地增進知識,並謝謝所上助理們為我們處理行政事務。還有謝謝 周遭朋友群的照顧、關心及扶持,在此獻上由衷的謝意。

最後,要感激我的家人─父母、大姐、大姐夫、二姐、二姐夫、二位弟弟,經濟上 的支持讓我無後顧之憂,精神上的鼓勵使我奮鬥到底,而取得碩士學位,未來將會以這 份感恩來回報給你們。

(8)

CONTENTS

page 摘要................................ i ABSTRACT............................. ii ACKNOWLEDGEMENT........................ iii CONTENTS............................. iv LIST OF TABLES........................... v LIST OF FIGURES........................... vi LIST OF APPENDIXES......................... viii ABBREVIATIONS........................... ix CHAPTER 1. Introduction........................ 1 CHAPTER 2. Literature Review...................... 4 2.1 The Information of Promoter Regulatory Elements........... 4 2.2 The Importance of Studies of Human Promoter Elements........ 5 2.3 The Importance of Studies of Primate Promoter Elements........ 6 2.4 Prediction of Transcription Elements Binding Sites........... 7 2.5 Thesis Research............................... 7 CHAPTER 3. Materials and Methods.................... 9 3.1 Sources of Promoter Sequences .................. 9 3.2 General Analysis of Promoter Sequences............... 10 3.3 Development of Visualization Tool for Element Matching......... 11 CHAPTER 4. Experimental Results..................... 14 4.1 Preliminary Data Analysis..................... 14 4.2 Results of Multiple Sequence Alignments.............. 16 4.3 Detection of Putative Transcription Regulatory Elements......... 17 4.4 Analysis of Consensus Transcription Regulatory Elements........ 20 CHAPTER 5. Discussion and Conclusion .................... 22 CHAPTER 6. References ........................ 25

(9)

LIST OF TABLES

Table page

2.1 Prediction programs of transcription elements binding sites...... 33 3.1 Summary of the primate, human and rodents promoter sequences.... 35 4.1 The divergence of substitution of sequence in the nonhuman primate

promoter genes ........................ 45 4.2 Summary of the data of putative binding sites identified from the

TRANSFAC database...................... 46

(10)

LIST OF FIGRURES

Figure page

3.1 The flowchart of this study.................... 42 3.2 The procedure of visualization tool for transcription elements binding

site............................. 43

3.3 The output format....................... 44

4.1 (a) The distribution of numbers of species in genes available for nonhuman primates. (b) The distribution of numbers of genes available for nonhuman primates...................... 48 4.2 Phylogenic trees of Gorilla................... 49 4.3 The relation of hit numbers and E-values (-log E-value scale) used

BLAST program........................... 50 4.4 Three kinds of output results of Gogo-A promoter sequence used

ClustalW program........................... 51 4.5 Three kinds of output results of AFP promoter sequence used ClustalW

program........................... 53

4.6 Phylogenic trees of lower-similarity genes............. 55 4.7 Output format of AFP promoter gene used PCMC program....... 56 4.8 Diagram format by a visualization tool for putative conserved regulatory

element sites of higher-similarity promoter sequences from PCMC program (a), TRANSFAC database (b) and multiple sequence alignment

(c)............................. 57

4.9 Diagram format by a visualization tool for putative conserved regulatory element sites of lower-similarity promoter sequences from PCMC

program (a), TRANSFAC database (b)............... 60 4.10 The distribution of putative regulatory elements in the conserved region

regions (a) and less-conserved region regions (b) ........... 63 4.11 Comparison of some consensus binding sequences from our study and

JASPAR database....................... 64

(11)

4.12 Comparison of TATA box and CAAT box sequence from NCBI, JASPAR database, and our study..................... 66 4.13 Sequence representation and position frequency of putative consensus

regulatory elements in non-human primates............. 67

(12)

LIST OF APPENDIXES

page Appendix I Taxonomy of primates from NC BI............. 72 Appendix II Primate images from Primate Info Net............ 74 Appendix III The source code of visualization program written in Visual Basic

6.0 for transcription elements binding site........... 79

(13)

ABBREVIATIONS

Alpha-CBF Alpha core-binding factor Alpha-IRP Alpha inverted repeat protein AFP Alpha-fetoprotein

AP-1 Activator protein-1

APP Amyloid beta precursor protein BMP2 Bone morphogenetic protein 2

CFTR Cystic fibrosis transmembrane conductance regulator CCR5 CC chemokine receptor 5

CREB Cyclic AMP-response element binding protein

DARC Duffy antigen

DPYD Dihydropyrimidine dehyfrogenease

Fmr1 Fragile X mental retardation syndrome protein GPHA Glycoprotein hormone alpha subunit

INMT Indolethylamine N-methyltrqansferase IRF-2 Interferon regulatory factor 2

LPA Lipoprotein Lp(a)

MAO A Monoamine oxidase A MC1R Melanocortin 1 receptor

MCP1 Monocyte chemoattractant protein 1 RB1 Retinoblastoma protein 1

RHAG Rh50 glycoprotein

SRF Serum response factor TNF Tumor necrosis factor TNFA Tumor necrosis factor alpha

(14)

UGT1A1 Uridine Diphosphate Glucuronosyltransferase 1A1 VHL Von Hippel-Lindau tumor suppressor

(15)

CHAPTER 1 Introduction

One of the most important functional elements in any genome is transcription regulation including transcription factors (TFs) and the sites within the DNA to which they bind. [1]

These interactions between the proteins and the DNA regions control many important processes, such as in development and responses to environmental stress, and defects in them contribute to the progression of various diseases. Much progress has been made recently in the accumulation and analysis of mRNA transcript profiles of a variety of cells and tissue, including those associated with various human diseases [2]; much remains to be understood, for example, about the transcriptional regulatory networks that govern these expression profiles.

Non-human primates have contributed studies in human physiological, medical and immunological fields. With the sequencing of chimpanzee genome, non-human primates even provided the bridges toward the understanding of functional genomics through transcriptomic and proteomic studies. Chimpanzees estimated to have shared a common ancestor only 4.6 - 6.2 million years ago with human and are only 1.6% different in their genomic sequences [1].

The issue concerning whether the regulation of the genomic DNA sequences accounts for the complexity in humans is currently a subject of intense investigation. Since regulation of the genomic DNA sequences acts through the expression of transcriptome and proteome, the present studies focus on extraction of the promoter elements important for the transcription. It is concerned that the previously deduced rules from the small percentage of known TATA-containing promoter pool might not be applicable to the majority unknowns in the human genome now. We use an evolutionary approach by comparing cross-species promoter from non-human primates. Identification of promoter elements of non-human primates should

(16)

shed some light into understanding of regulation of the TATA-containing and TATA-less promoters in humans.

Transcription regulatory elements (also named as cis-acting elements or promoter elements) are sequences within promoter regions of a gene that bind to transcription factors so that the downstream gene can be turned on under the regulatory condition of the transcription factors and elements. Genome-wide exploration for transcription factor–binding sites can be achieved by genetic, biochemical and large scale ChIp-chip (chromatin immunoprecipitation and DNA chip) experiments [4-5]. Promoters are generally accepted as a TATAAA motif located at 30 nucleotides upstream (called position -30) relative to the transcription start site (TSS) and a G-C enriched region downstream of the TSS. Data accumulations for defining the TSS are limited by experimental difficulty using 5’-RACE, primer extension, and cDNA cloning. Therefore, the advent of new experiments is an impending need for us to reveal functions and interactions of genes in the genome. Computation prediction can accelerate this process by providing putative promoter candidates. Computational methods for detecting transcription start sites include TATA box motif detection [8], hidden Markov models [9], and neural networks [10]. Correlation of experimental measurements and theoretical prediction of promoters for human genome is unsatisfactory due to poor sensitivity, high false positive rate, and poor positional accuracy [11]. Results from these analyses are lack of further biological usage. Therefore, we plan to identify more information to the primate promoter regions to provide value for human-primate comparisons and divergence.

We retrieved the 222 nonhuman primate promoter sequences of 59 genes from the GeneBank of NCBI with 33 species as the starting dataset for our study. We performed general analysis of these promoter sequences: multiple sequence alignment (MSA) and homology-hit method using BLAST (basic local alignment search tool). Then, to determine the putative regulatory elements, we analyze primate promoter sequences using two different methods by PCMC program for primate-specific, position-corresponding, high-representative

(17)

short sequences and by the less-conserved regions first obtained through MSA, then extracting putative elements from TRANSFAC database. These results were further presented by a visualization tool we developed by and by statistical sequence logos.

The organization of this thesis is as follows. In Chapter 2 we describe the definition and significant features of the promoter, and then we discuss the studies of human and other primate promoter elements. Our materials and analytic methods are covered in Chapter 3. We present our experimental results and give the discussion and conclusions in Chapter 4 and 5.

(18)

CHAPTER 2 Literature Review

In this chapter, the importance of regulatory elements in terms of mediating gene expression of the genomes as well as the regulatory regions of primate promoters is introduced.

2.1 The Information of Promoter Regulatory Elements

The well-known central dogma of genetic information states that DNA is transcribed to RNA and subsequently translated to protein [12]. Based on this rule of thumb, transcriptional regulatory regions in the genome are characterized by the target binding sites for transcription factors. Transcription factors are proteins that recognize short sequences (5 to 25 base-pairs) on the non-coding regions containing enhancer or promoter. The interactions between transcription factors and their binding sequences activate or repress the transcription of a particular gene. Distant promoter regulatory elements control gene expression in the temporal, tissue-specific, or condition on manner. These distant promoter elements are characterized by the ability to drive the expression of recombinant constructs containing putative promoter elements and reporter genes. Specific transcriptional elements control gene expressions in special tissues organisms, or developmental stages. By means of such slicing experiments, it is possible to locate the positions of specific regulatory elements. Therefore, knowledge of transcription factors mediating gene expression provides important insights into regulation of the function of a gene. The binding specificity of transcription factors are typically represented as consensus sequences or position weight matrices that summarize their position-specific sequence preferences [13]. In some cases, such ‘motif’ models of transcription factor binding sites can be inferred from genome sequences using computational

(19)

methods [14-17].

Although regulatory regions are not under the same evolutionary constraints as coding sequences, the comparison of regulatory regions from related species reveals highly conserved sequences [18-21]. Observation of such sequence conservation suggests similar regulation of the function, thereby generating testable hypotheses that have often been confirmed [18,19]. Most functional regulatory elements present widely in non-coding regions of the genome [22,23]. However, non-coding regions compose the majority of mammalian genomes with unknown function. Therefore, computational methods would be helpful to identify conserved, theoretically functional non-coding sequences [24-26]. Comparative genomics is an effective method for identification of the similarities and differences among different genomes [26-29]. Comparative analyses of complete genome sequences are extensively used to identity coding [30,31] and conserved non-coding region [32,33], including regulatory elements [34,35]. Likewise, the DNA sequences controlling the expression of genes that are regulated similarly among different species could be identified due to the conservation. Conversely, sequences that encode proteins and RNAs responsible for differences among various species will themselves be divergent [26].

2.2 The Importance of Studies of Human Promoter Elements

The first draft of the entire human genome was sequenced as three billion base pairs in February 2001 by International Human Genome Sequencing Consortium [36]. The first draft revealed that the number of human genes is significantly less than previously estimated, which ranged from 30,000 genes to as many as 40,000. This result remained true even after the full sequence was completed and published in April 2003. The sequencing and functional analysis of the human genome is transforming biological research. Future advances in biomedical research concerning the causes and treatment of human diseases will be

(20)

accelerated through continuing elaboration and interpretation of this information.

To explore how this relatively small number of genes accounts for the complexity of human biology, the availability of these genomic sequence data allows one to address functional genomic questions. These include, first, the transcriptomic promoter analyses using expression array for gene regulation and confocal microscopy for reporting tissue or cellular localization, and second, the proteomic studies of yeast two hybrid for protein-protein interactions, mass spectrometric proteomics for protein identification, and protein arrays for post-translational modifications. Of the transcriptome promoter analyses, computer tools searching for candidate regulatory sequences are crucial to facilitate the empirical biological studies. This has motivated the present study of this research.

2.3 The Importance of Studies of Primate Promoter Elements

Extensive analysis of genome structure and function in selected nonhuman primates could make immediate and significant contributions in the understanding of mis-regulation in many human diseases. Studies of primate genomics will become an important parallel adjunct to human genomic research. Primate models of human diseases, such as atherosclerosis, AIDS, diabetes, osteoporosis, neurodegeneration, mental illness, alcohol dependency, asthma, cancer, and others are critical to the long-term success of biomedical research.

It should be noted, however, that present biological prediction from human genome are largely based on the low percentage of known TATA-containing promoter pool, so these results may not translate into the promoter regulatory principles of the majority of unknowns.

Fortunately, cross-species homologous cloning opened the avenue of an evolutionary approach. Besides, the availability of genetic information will enhance the value of nonhuman primates, both as models for specific disease etiology and as tools for understanding the normal physiological processes.

(21)

2.4 Prediction of Transcription Elements

Traditionally, much of the information on transcription elements binding ability/function has been determined using experimental methods such as footprinting, nitrocellulose binding assays, gel-shift analysis, and southwestern blotting. Further, the high-throughput technologies have been developed to identifying such as SELEX (systematic evolution of ligands by exponential evolution) [37] in vitro, and CHIP-chip (chromatin immuno- precipitation assays) [4] in vivo. These methods are generally quite time-consuming. In recent years, with increasing information of bioinformatics, there are more new computational approaches and algorithms for predicting regulatory elements binding sites. Specialized databases on transcriptional regulation including TRANSFAC[38], TFD [39], and IMD [40]

databases, provide access to the sequence-analysis tools. However, TRANSFAC contains eukaryotic cis-acting regulatory DNA elements and trans-acting factors from yeast to human.

It consists of six cross-linked tables: SITE, CELL, FACTOR, CLASS, MATRIX, and GENE.

Therefore, such TRANSFAC-associated prediction programs as MatInspector, which uses a library of matrices selected from the TRANSFAC MATRIX table. Other software such as AliBaba2, Match 1.0, and TFBLAST also have been developed from TRANSFAC. These public prediction programs in silico for transcription factor binding sites are shown in Table 2.1. In this study, we select Signal Scan programs [41] to search putative transcription elements. Signal sequence is retrieved from those databases to present results on Signal Scan website. Definite detail is summarized in Chapter 3.3.2.

2.5 Thesis Research

In the present study, first we collected 222 nonhuman primate promoter sequences of 59

(22)

genes from NCBI by aligning with those of human and rodents. The sequence analysis is preformed by multiple sequence alignments and bioinformatics tools. These results from various computer programs showed that the characterization of consensus putative promoter regulatory elements of nonhuman primate might differ from human or rodent counterparts.

The information of regulatory elements based to comparing multiple-species sequences can provide for biology searches.

(23)

CHAPTER 3 Materials and Methods

In this chapter, the sources of our data and online bioinformatics tools used to process our analysis are introduced. The flowchart of this study is also provided in Figure 3.1.

3.1 Sources of Promoter Sequences

A combination of keyword searches (by the names of genes and species of nonhuman primates) to identify the promoter sequences in nonhuman primate species was used for this analysis. Hence, we retrieved the 222 nonhuman primate promoter sequences of 59 genes from the GeneBank of NCBI (National Center for Biotechnology Information) with 33 species of 19 genuses as the starting dataset for our study (2004/04/13). Promoter sequences ranged from 90 to 5848 base-pairs long were obtained by experimental methods and the upstream of mRNA sequence. Some primate sequence data (including 1 sequence available for 11beta hydroxylase [42], 2 for HaA [43], 1 for histone H1t [44], 6 for Gogo-A, 4 for Gogo-B, 3 for Patr-A, 2 for Patr-B, 4 for Popy-A, 2 for Popy-B [45] genes) were also collected from the literatures. Some entries have the identification of functional binding sites in promoter regions. Ongoing primate genomic resources are available at BACPAC Resource Center (http://www.chori.org/bacpac). We analyzed the regulatory elements of nonhuman primate promoter sequences by contrasting those with human and rodent sequences. Sequence data for human and rodents were obtained by the sources of keyword and BLAST searches.

Some of the human sequences (APP [46]、CCR5 [47]、CFTR [48]、factor IX [49]、MSH2 [50]、StAR [51]、TNFA [18]) were also obtained from the literatures. Therefore, we obtained 24 and 5 promoter sequences of human and rodents, respectively. A summary of the data is presented in Table 3.1.

(24)

3.2 General Analysis of Promoter Sequences

3.2.1 Multiple sequence alignments (MSA)

The primate promoter sequences were aligned by ClustalW tool [52] on workbench site (http://workbench.sdsc.edu/) and by manual inspection. This MSA analytic tool constructs phylogenic trees from promoter sequences by PHYLIP [53] using neighbor-joining algorithm.

Then, we made MSA among homologs of the primates, human and rodents. Rodent homologous promoters were used as a phylogenic outgroup.

3.2.2 The significance of BLAST hits

A BLASTn search returns hits, sequences that produce “significant” alignments to the query sequence. The significance of a hit is measured by its expect values (E-values), which is a statistical measurement of the likelihood of similar sequences occurring randomly. Each alignment has a bit score (S), a measure of similarity between the hit and the query, given in the column next to the E-value. Since the input sequence length and database size are fixed, BLAST hit must have increasingly larger bit scores as the E-value gets smaller. In other words, a lower E-value threshold corresponds to a smaller number of hits. However, using BLAST program only at one threshold could not be clarified similarity of entry. Here we utilize a homology-hit analysis developed by Dr. Shinozawa and his colleagues which uses multiple thresholds to determine the origin of the nucleus [54]. They compared the number of orthologs to yeast ORFs (open reading frames) in each prokaryotic genome. Therefore, we used the BLAST program to determine the hit numbers at different E-values of orthologous nonhuman primate promoter genes. In this method, we assumed that the evolutionary rate of each gene remained constant over time. However, evolutionary rates differ between genes, so it is expected that the hit number for each gene in primate promoter will decrease gradually with decreasing E-value. The threshold (E-value) was set at intervals of 10 in the range from

(25)

10 to 180 as –log E. Hit numbers for nonhuman primate promoter genes at each E-value range were calculated. The original E-value is set to exclude potential orthologs with low sequence similarity. As the stringency is higher in which the –log E-values become higher to include potential orthologs with higher sequence similarity (moving to the right on the X-axis), the numbers decline for each gene, indicating that the higher possibility of the potential orthologs are detected. As the stringency is lower in which the –log E-values become lower to exclude potential orthologs with lower sequence similarity (moving to the left on the X-axis), the hit numbers for each gene increases, indicating the random occurring distribution against the whole database for each gene behave similarly.

3.3 Development of Visualization Tool for Element Matching

3.3.1 Promoter cluster motif classification (PCMC) program (Dr. Tun-Wen Pai at Department of Computer Science at National Taiwan Ocean University)

A promoter analysis program was developed to make multiple sequence alignment by a sliding scanning model, which is capable of rapidly identifying common sequence fragments of 5 to 20 bps in length among up to 20 different sequence entries within an overall input length limited within 30 kb [55]. The frequency of each particular pattern was calculated and the specific positions were displayed. Hence the important transcription binding motifs such as TATA box for each of the primate promoters can be located if any.

3.3.2 Search for transcription elements from TRANSFAC 2.5 database

The second promoter analysis method in this study was that nonhuman promoter sequences align to transcription factor database (TRANSFAC Version 2.5) [38] in Signal Scan Version 4.05 platform [41]. TRANSFAC database is a large library of eukaryotic transcription factors, their genomic binding sites and DNA-binding profiles covering from yeast to human.

(26)

Hence, Signal Scan is the software that has been developed by Dr. Dan Prestridge to aid the molecular biologists to find eukaryotic transcription factor elements based to the TRANSFAC database is available from the site. It is most useful for analyzing mammalian sequences due to the prevalence of mammalian elements in the database. It uses both specific sequence elements derived from biochemical characterization and elements from derived consensus sequences to match against a user’s input DNA sequence. Any DNA sequence element matching an element in the Signal Scan database will be reported by the program, and then the determination of the biological relevance of the element will be subject to further biochemical investigation for confirmation.

3.3.3 Overrepresented and primate-specific transcription elements detection methods

Putative primate transcription elements were first detected by cross-species multiple alignments using ClustalW. Primate promoter regions with 100% homology/identity conversed regions after alignment were examined whether they contain sequences and sites of transcription factor motifs from TRANSFAC database sequences for overrepresentation.

Similar motifs within the less-conversed regions (with 60% to less than 100% similarity) compared with transcription motifs from the TRANSFAC database after MSA are regarded as the putative primate promoter elements more deviated from those of human.

3.3.4 Development of visualization tool for transcription elements

A further layout tool was developed by my labmate Wei-Hung Yang, who is a graduate student now in the department of Computer Science and Information Engineering of Chung Hua University. This program is a graphical user interface to observe the relative positions of the common motifs in different DNA input sequences. The program is written in Visual Basic 6.0 and runs on the Windows platform (source code shown in Appendix II). To input, the files

(27)

are required to be FASTA-formatted (strings begin with “>” will be neglected, the others will be filtered through for the strings composed of ATCG nucleotides). Common motifs to be searched for must be the common sequences reached the criteria of the program between the input sequences (shown in figure 3.2 and 3.3).

3.3.5 The sequence logos of transcription elements sequences

Sequence logos were drawn using WebLogo Version 2.8 [56] which is a web-based application designed to create sequence logos. The sequence logos are a quantitative graphical representation of an aligned set of amino acids or nucleic acids developed by Tom Schneider and Mike Stephens [57]. Each logo displays the frequencies of bases for each position in the consensus sequence as the relative heights of letters. The degree of sequence conservation as the total height of a stack of letters measured in bits of information. The vertical scale is in bits with a maximum of 2 bits possible at each position. In general, a sequence logo provides a richer and more precise description of, for example, a binding site, than would a consensus sequence.

(28)

CHAPTER 4 Experimental Results

4.1 Preliminary Data Analysis

4.1.1 Analysis of general evolution trends

Our starting dataset was 222 nonhuman primate promoter sequences of 59 genes from NCBI in 30 species from 19 genuses shown in Figure 4.1(a). Fourteen genes with only 1 promoter sequence available were excluded due to the lack of related sequences for comparison. 8 genes with more than 7 promoter sequences were compared, including 22 sequences available for CCR5, 12 for 5-HTT, 10 for LPA, 8 for DQB1, 8 for protein C, 7 for TNF, 7 for TNFA, 7 for UGT1A. To estimate divergence of sequences in the nonhuman primate promoters, we divide the collected promoter sequences into two regions based on degrees of similarity (details seen in next paragraph) to calculate substitution rates of nucleotides.

For the same gene in the same species, the evolution distances obtained for different sequence entries of the same DQB1 gene are 0.052 for p-distance (standard deviation (SD) = 0.007) and 0.049 for Kimura 2-parameter (SD = 0.006). For the same gene in the different species, the evolution distances were acquired for CCR5, 5-HTT, LPA, protein C, TNF, TNFA, and UGT1A promoters (Table 4.1), and then the mean distances were d = 0.061 with SD

=0.008 for p-distance, and d = 0.057 with SD = 0.007 for Kimura 2-parameter.

As shown in Figure 4.1(b) for species distribution, there are 5 species which have more than 10 promoter sequences, including 53 sequences available in chimpanzee, 36 sequences in gorilla, 27 sequences in oranguta, 20 sequences in rhesus monkey, 11 in sequences cynomolgus monkey. This allows the possibility of obtaining theoretical species-specific basal

(29)

promoter elements. The average evolution distances within species based on nucleotide substitution within nonhuman primates were also calculated. The results were d = 1.355, SD = 0.064 (Kimura 2-parameter); d = 0.656, SD = 0.008 (p-distance) for gorilla, and d = 1.722, SD = 0.055 (Kimura 2-parameter); d = 0.668, SD = 0.008 (p-distance) for orangutan. Figure 4.2 shows phylogeny tree of Gorilla.

4.1.2 General layout of BLAST hits

To get a general evolutionary trend toward comparison of these primate promoter sequences, we used BLASTn program of default settings to find the similarity hits of promoter sequences against non-repeated nucleotide database of all published genomes in NCBI. The resultant hits of different expect values (E-values), which are a statistical measurement of the likelihood of similar sequences occurring randomly, were documented.

As shown in Figure 4.3, the X-axis is a sliding scale of stringency or –log E-values. The original E-value is set to exclude potential orthologs with low sequence similarity. As the stringency is higher in which the –log E-values become higher to include potential orthologs with higher sequence similarity (moving to the right on the X-axis), the numbers decline for each gene, indicating that the higher possibility of the potential orthologs are detected. As the stringency is lower in which the –log E-values become lower to exclude potential orthologs with lower sequence similarity (moving to the left on the X-axis), the hit numbers for each gene increases, indicating that random distribution against the whole database occurred for each gene behave similarly. Therefore, most genes but not VHL gene showed higher hit numbers in the lower –log E-values. Large numbers of orthologs are found in medium stringency with –log E-values of 40 through 100 for major histocompatibility complex (MHC) class I genes (dotted lines). This could indicate that these high-variable gene sequences easily undergo adventitious factor resulting in sequences change or transfer. With the E-values of 10-160 to 10-180, five gene promoters, Factor IX, GPHA, IL-4, MCP1, and VHL, contain

(30)

numerous hit numbers. Such as these gene sequences match any databases at nonrandom and exclude lower similarity. Thus, this suggests higher sequence conservation retains potential biological meaning.

4.2 Results of Multiple Sequence Alignments

To investigate the phylogenic relationship of 222 nonhuman primate sequences, we performed multiple sequence alignment using ClustalW on workbench website. It is worth of knowing that ClustalW is able to differentiate the alignments obtained from functional restrain or from highly repetitive DNA sequence. ClustalW provides three kinds of output results: first layout of ClustalW is multiple sequence format (MSF) format for direct results of sequence alignment as shown in Figure 4.4 and 4.5. In Figure 4.4, we show the multiple sequence alignment of Gogo-A gene with 80-100% conservation. MSA were districted into two different regions: less- conserved (of less than 100% similarity) and conserved region (of 100% similarity). The alignment of AFP promoter revealed a similar TATA box sequence (TATAAA) among primates, human and rodents as shown in Figure 4.5 [27]. The second layout of ClusterW is the score matrix of pairwise distance showing similarity between two sequences estimated by p-distance model. The results of our data analyzed by score matrix revealed that 39 of 47 genes have higher similarity range from 80% to 100%. On the contrary, there are 8 genes with lower score from 3% to 74%, including APP, BMP2, CCR5, GPHA, IL-4, MID1, TNFA, and VHL, which indicated that these promoter genes might exhibit little conservation up to occur sequence transfer or deletion. Their phylogeny trees were constructed by MEGA [58] using the neighbor-joining method [59] and Kimura two-parameter model with 1000 replications for bootstrapping as shown in Figure 4.6. The third layout of ClusterW is the phylogenic trees transferring sequence similarity into theoretical evolution distance developed by Phylip (Phylogeny Inference Package) version 3.5

(31)

[60] in which the fundamental guide tree was modified according to the neighbor-joining algorithm. We also incorporate promoter sequences of homologs from human and rodents as input data to perform ClasterW program. The rodents would be used as a phylogenic outgroup.

4.3 Detection of Putative Transcription Regulatory Elements

To determine the putative transcription regulatory elements, we analyze primate promoter sequence using two different methods by PCMC program and by TRANSFAC database.

4.3.1 Results of Promoter Cluster Motif Classification (PCMC)

The highly-presented short sequences or characteristic sequences called cluster motifs were searched using extraction tool-PCMC to retrieve the common sequence fragments of 5 to 20 bps in length shared among different sequence entries with each overall input length limited within 30 kb. Our data pool of 222 nonhuman primate promoter sequences in 59 genes was served as the input of the PCMC program. Then, the direct results of these characteristic sequence were transformed into a diagram format to show changes of position of a shared characteristic sequence among entries based on a visualization tool (as showed in Figure 3.3) developed by us. In Figure 4.7, we show the direct result of short sequence output format for AFP gene promoters after using PCMC program.

The results showed the detected 222 characteristic sequences that might be putative conserved cluster elements with 20 bps long-lengths, and no other lengths were found. To show changes of position of a shared characteristic sequence among entries, we observed degrees of position difference in elements of related promoter sequences by a graphic user interface as shown in Figure 4.8 and 4.9. In Figure 4.8, this is particularly useful for promoter

(32)

genes with higher-similarity (99-100%) such as AFP, Brain-2 / N-Oct 3, and nerve growth factor. These promoters have intensive shared characteristic sequences. Contrarily, genes with lower-similarity (28-55%) such as APP, GPHA, TNFA, and VHL, containing less shared characteristic sequences reveal exiguously non-aligned graphs as shown in Figure 4.9.

4.3.2 Results of Aligned TRANSFAC Database

The second analysis method is to screen nonhuman promoter sequences through the known transcription factor database (TRANSFAC) in Signal Scan website. The transcription regulatory elements are possible binding site sequences for transcription factors as collected in TRANSFAC database. To obtain the putative transcription regulatory elements, the results from cross-species multiple alignment MSA of ClustalW analysis were used to search for the transcription regulatory elements for non-human primates. Promoter sequences of the genes in nonhuman primates were first searched for the putative binding sites and the numbers of sites were counted. The putative binding sites were annotated along with the MSA results in both conserved (of 100% similarity) and less-conserved (of less than 100% similarity) regions. The complete search results can be found listed in Table 4.2. Most genes have putative binding sites found within the two regions. However, BMP2, CCR5, GPHA, MID1 and TNFA genes of lower-similarity contain only less-conserved regions. AFP, DARC, DPYD, and Huntington’s disease genes of the higher-similarity contain only the conserved regions. There are 1439 occurrence of 82 binding sites in the less-conserved regions, and 1854 occurrence of 89 binding sites in the conserved regions.

Furthermore, these putative regulatory elements in conserved region are transformed into a diagram format to show changes of position of a putative regulatory element among entries based on a visualization tool developed by us. In Figure 4.8, AFP, Brain-2 / N-Oct 3, and Nerve growth factor show gene-specifically similar allocations of the related putative regulatory elements in both PCMC and TRANSFAC analyses. We find scarcely related

(33)

regulatory elements conserved in sequence and position in denoted promoters of GPHA, TNFA genes as shown in Figure 4.9. However, significant differences between results of TRANSFAC analysis and those of PCMC analysis of related regulatory elements were found in the Brain-2 / N-Oct 3 and VHL promoter sequences. In Brain-2 / N-Oct 3 gene, the distances between putative regulatory elements with GC-rich sequences as SP1 or CP1 in Pongo organism are farther than the distance between elements in the other two Great Apes (Pan and Gorilla). In the VHL promoter sequences, the distances between NF-1 and SP1 regulatory elements between Gorilla and Papio are more similar than those of Pan and Macaca. The finding that the conservation is higher between Gorilla and Papio than that between Pan and Macaca in both elements sequence and position after TRANSFAC analysis is consistent with evolution tree distribution across primate species of VHL promoter sequences (Figure 4.6).

The occurring frequency of putative regulatory elements in both less-conserved and conserved regions was also investigated. The putative regulatory elements of most frequent occurrence (the occurrence numbers shown in parentheses) in the conserved region consist of sequences capable of binding to 10 kinds of transcription factors, including 249 sequences available in NF-1, 145 in Sp1, 131 in GR, 69 in CAC-binding protein, 55 in NF-E, 42 in CP1, 42 in CTF, 41 in NF-1/L, 39 in CACCC-binding factor and 39 in gammaCAC2 as shown in Figure 4.10(a). Whereas, the putative regulatory elements of most frequent occurrence (the occurrence numbers shown in parentheses) in the less-conserved region consist of sequences capable of binding to 10 kinds of transcription factors, including 274 sequences available in NF-1, 197 in GR, 131 in GR, 193 in Sp1, 112 in CAC-binding protein, 74 in CACCC-binding factor, 73 in gammaCAC2, 62 in GATA-1, 62 in NF-E, 57 in Pit-1 and 55 in NF-1/L as shown in Figure 4.10(b). In short, sequences capable of binding to CP1 and CTF transcription factors were found frequently only in the conserved regions. Sequences capable of binding to GATA-1 and Pit-1 transcription factors were found frequently only in the less-conserved

(34)

regions. Sequences capable of binding to the other eight transcription factors are sporadically frequent in both conserved and less-conserved regions.

4.4 Analysis of Consensus Transcription Regulatory Elements

To visualize the changes in the putative elements of per sequence, related regulatory elements were collected from the less-conserved regions which allow us to observe the variations among elements. These results were showed sequence representation and position frequency of putative promoter regulatory elements in nonhuman primates. Position frequency was counted from the putative regulatory elements in the less-conserved regions.

Sequence representations were used for the putative regulatory elements as a graphical view by sequence logos. We were able to extract sequence representations of putative regulatory elements that have occurrence numbers higher than 30 and 33 sequence logos were obtained.

These sequence logos might indicate consensus regulatory elements for nonhuman primates.

The resultant non-human primate sequence logos were compared to those of human and rodents retrieved from JASPAR database [61] as shown in Figure 4.11. For example, regulatory elements such as AP-2, CREB, IRF-2, Sp1, and SRF are human consensus found in JASPAR; alternatively, c-Myb and GATA-1 are rodent consensus found in JASPAR as shown in Figure 4.12. CREB, IRF-2, SRF, and c-Myb elements are similar in consensus sequences between results from our analysis and those from JASPAR, and stronger nucleotide preference is also found in those four elements in that a special nucleotide at each position occurs in higher frequency over the others except for c-Myb elements. In contrast, AP-2, Sp1, and GATA-1 elements are observed in lower scores for any nucleotide at each position, indicating lower nucleotide preference for these three elements.

However, comparison of consensus TATA and CAAT box binding sequences of experiment-supported, primate-specific data from NCBI, Vertebrate-specific data by JASPER

(35)

database, Vertebrate-specific by us from TRANSFAC database. Of this set, 25 promoter sequences of 9 genes have the annotated TATA box and CAAT box of significant feature based on experimentally determined mRNAs collected by NCBI, a high proportion of which are expected to be full-length or 5’ enriched transcripts, and these include 11 beta hydroxylase [42], HaA [43], histone H1t [44], Gogo-A, Gogo-B, Patr-A, Patr-B, Popy-A, Popy-B [45]

genes. These annotated TATA box and CAAT box from the 25 promoters were collected and run through sequence logo analysis. Gogo-A, Gogo-B, Patr-A, Patr-B, Popy-A, Popy-B genes which are also called major histocompatibility complex (MHC) class I genes showed shared consensus preference in the TATA and CAAT box sequences, such as TCTAAA for TATA box and GCCAAT for CAAT box. The TATA and CAAT box sequences of the other three genes have variation from these consensuses. We compare consensus binding sequences of TATA and CAAT box from various source data of above mentioned 25 sequences within NCBI, JASPAR database [61], and Vertebrate-specific by us shown in Figure 4.12. Here, source data in JASPAR database was acquired by collections of in vivo binding sequences for candidate transcription factors mainly for vertebrates from the scientific literature. TATA box consensus sequences (TTCTAAA) of above mentioned 25 sequences within NCBI might theoretical basal promoter elements of nonhuman primate.

Additionally, other 24 putative regulatory elements were not found in JASPAR database [61] as shown in Figure 4.13. Several elements correspond to that bind to CCAAT-box which is a common promoter element present in the proximal promoter of numerous mammalian genes [8]. CCAAT-box complex are most often found between 80 and 100 bp upstream of the transcription start site. Therefore, Figure 4.13 showed that CP1 had a highly conserved CCAAT-sequence. But CTF and NF1 presented lower sequence scores, which might not contain complete CCAAT-binding elements and coregulate with other specific element to stimulate transcription.

(36)

CHAPTER 5

Discussion and Conclusion

The purpose of the present study is to identify putative transcription regulatory elements in the primate promoter region to provide useful information for exploring regulation comparison of human-primate. To observe sequence divergence of the dataset, the results showed that genetic distances of eight genes with more promoter sequences of inter-species same genes comparison were lower (ranging from 0.051 through 0.071) than ones of species with more sequences of 1.355 through 1.722 by Kimura two-parameter model and 0.656 through 0.668 by p-distance model. Thus, the higher divergence indicated great sequence variability among genes of single species for nonhuman primate promoter. This supposed that higher probability occur the presence of paralogs within one species as termed occurrence of gene duplication after speciation.

However, to get a general evolutionary trend toward comparison of primate promoter sequences, we used BLASTn program to count the number of orthologous genes (hit number, number of genes with close similarity) in the promoter sequences to non-repeated nucleotide database of all published genomes in NCBI at each threshold (E-value). Figure 4.4 reports that with the higher E-values of 10-160 to 10-180, Factor IX, GPHA, IL-4, MCP1, and VHL of five promoter genes contained numerous hit numbers. Likewise, these genes but not VHL gene were shown higher hit numbers in the lower –log E-values. Such as these gene sequences match any databases at nonrandom and exclude lower similarity. This could suggest higher sequence conservation retains potential biological information. Great numbers of orthologs are found in medium stringency with –log E-values of 40 through 100 for major histocompatibility complex (MHC) class I genes (dotted lines). This could explain that these high-variability gene sequences easily undergo extrinsic factors resulting in sequences change

(37)

or transfer.

Through analysis of multiple sequence alignment, APP, BMP2, CCR5, GPHA, IL-4, MID1, TNFA, and VHL of 8 lower sequence similarity which could indicated that these promoter genes might exhibit little conservation up to occur sequence transfer or deletion.

This result may make them difficult to identify homologous elements in different species groups by sequence comparison alone [62].

To determine the characteristics of putative regulatory elements, we analyze primate promoter sequences using two different methods by PCMC program and TRANSFAC database. The results present less putative regulatory elements form TRANSFAC than PCMC program in Figure 4.8 and 4.9. Therefore, this could find easily potential regulatory elements from TRANSFAC database. Nevertheless, false negatives might happen in our analysis. Our detection power is partially diluted since we use TRANSFAC vertebrate standards for cross-examining the highly representative short sequences after the PCMC analysis. For long, the less conserved sequences are believed not to be functionally definable; but for now, they are theoretically definable using the PCMC program. Many highly representative short sequences after the PCMC analysis might be functional only in primates. Promoter reporter analysis aimed at putative regulatory elements and primate expression microarrays looking for the co-regulation promoter elements might all be helpful.

Taking together, the contribution of this present study is as following, including non-human primate promoters among genes of single species show greatest sequence variability. Most gene promoters contain less-random and higher-conservative ortholgos from published genomes in NCBI. But MHC class I genes allocate their short fragment matches in the medium-relaxed, random similarity areas, indicating high variability. PCMC program for primate-specific, position-corresponding, high-representative short sequences is helpful for analyzing putative primate regulatory elements from highly similar promoters.

The results of PCMC program are consistent with those of our visualizing tool showing

(38)

MSA trends. The results of the less-conserved regions through MSA plus TRANSFAC screening show that the primate consensus sequence preference by us for CREB, SRF, IRF-2, and c-Myb elements is consistent with those found in vertebrate transcriptional JASPAR database. De novel primate consensus sequence preference for putative regulatory elements uniquely found by the present study are also obtained and listed in sequence logos for future molecular functional analysis.

(39)

CHAPTER 6 References

[1] Collins, F., Green, E., Guttmacher, A. and Guyer, M. (2003) US National Human Genome Institute: A vision for the future of genomics research. Nature, 422: 835-847.

[2] Lockhart, D. and Winzeler, E. (2000) Genomics, gene expression and DNAarrays. Nature, 405: 827-836.

[3] Chen, F.C. and Li, W.H. (2001) Genomic divergences between humans and other hominoids and the effective population size of the common ancestor of humans and chimpanzees. Am. J. Hum. Genet., 68: 444-456.

[4] Ren, B., Robert, F., Wyrick, J.J., Aparicio, O., Jennings, E.G., Simon, I., Zeitlinge,r J., Schreiber, J., Hannett, N., Kanin, E., Volkert, T.L., Wilson, C.J., Bell, S.P. and Young, R.A.

(2000) Genome-wide location and function of DNA binding proteins. Science, 290:

2306-2309.

[5] Lee, T., Rinaldi, N., Robert, R., Odom, D., Bar-Joseph, Z., Gerber, G., Hannett, N., Harbison, C., Thompson, C., Simon, I., Zeitlinger, J., Jennings, E.G., Murray, H.L., Gordon, D.B., Ren, B., Wyrick, J.J., Tagne, J.B., Volkert, T.L., Fraenkel, E., Gifford, D.K.

and Young, R.A. (2002) Transcriptional regulatory networks in Saccharomyces cerevisiae.

Science, 298: 799-804.

[6] Iyer, V.R., Horak, C.E., Scafe, C.S., Botstein, D., Snyder, M. and Brown, P.O. (2001) Genomic binding sites of the yeast cell-cycle transcription factors SBF and MBF. Nature, 409: 533-538.

[7] Horak, C.E., Mahajan, M.C., Luscombe, N.M., Gerstein, M., Weissman, S.M. and Snyder, M. (2002) GATA-1 binding sites mapped in the beta-globin locus by using mammalian chIP-chip analysis. Proc, Nat.l Acad. Sci. USA, 99: 2924-2929.

(40)

[8] Bucher, P. (1990) Weight matrix descriptions of four eukaryotic RNA polymerase II promoter elements derived from 502 unrelated promoter sequences. J. Mol. Biol., 212:

563-578.

[9] Audic, S. and Claverie, J.M. (1997) Detection of eukaryotic promoters using Markov transition matrices. Comput. Chem., 21: 223-227.

[10] Knudsen, S. (1999) Promoter2.0: for the recognition of Pol II promoter sequences.

Bioinformatics, 15: 356-361.

[11] Fickett, J.W. and Hatzigeorgiou, A.G. (1997) Eukaryotic promoter recognition. Genome Res., 7: 861-878.

[12] Griffiths, A.J.F., Miller, J.H., Suzuki, D.T., Lewontin, R.C. and Gelbart, W.M. (1999) Introduction to Genetic Analysis 7th ed., WH Freeman & Company, New York.

[13] Stormo, G.D. (2000) DNA binding sites: representation and discovery. Bioinformatics, 16:16-23.

[14] Lawrence, C.E., Altschul, S.F., Boguski, M.S., Liu, J.S., Neuwald, A.F. and Wootton, J.C.

(1993) Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science, 262: 208-214.

[15] Eskin, E. and Pevzner, P.A. (2002) Finding composite regulatory patterns in DNA sequences. Bioinformatics, 18: S354-363.

[16] Liu, X.S., Brutlag, D.L. and Liu, J.S. (2002) An algorithm for finding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments. Nat. Biotechnol., 20: 835-839.

[17] Marsan, L. and Sagot, M.F. (2000) Algorithms for extracting structured motifs using a suffix tree with an application to promoter and regulatory site consensus identification. J.

Comput. Bio.l, 7: 345-362.

(41)

[18] Leung, J.Y., McKenzie, F.E., Uglialoro, A.M., Flores-Villanueva, P.O., Sorkin, B.C., Yunis, E.J., Hartl, D.L. and Goldfeld, A.E. (2000) Identification of phylogenetic footprints in primate tumor necrosis factor-alpha promoters. Proc. Natl. Acad. Sci. USA, 97: 6614-6618.

[19] Wasserman, W.W., Palumbo, M., Thompson, W., Fickett, J.W. and Lawrence, C.E. (2000) Human-mouse genome comparisons to locate regulatory sites.” Nat. Genet. 26: 225-228.

[20] Hardison, RC, Oeltjen, J. and Miller, W. (1997) Long human-mouse sequence alignments reveal novel regulatory elements: a reason to sequence the mouse genome. Genome Res., 7: 959-966.

[21] Jareborg, N., Birney, E. and Durbin, R. (1999) Comparative analysis of noncoding regions of 77 orthologous mouse and human gene pairs. Genome Res., 9: 815-824.

[22] Loots, G.G., Locksley, R.M., Blankespoor, C.M., Wang, Z.E., Miller, W., Rubin, E.M.

and Frazer, K.A. (2000) Identification of a coordinate regulator of interleukins 4, 13, and 5 by cross-species sequence comparisons. Science, 288: 136-140.

[23] Dermitzakis, E.T. and Clark, A.G. (2002) Evolution of Transcription Factor Binding Sites in Mammalian Gene Regulatory Regions: Conservation and Turnover. Mol. Biol.

Evol., 9: 1114-1121.

[24] Blanchette, M., Schwikowski, B. and Tompa,M. (2002) Algorithms for phylogenetic footprinting. J. Comput. Biol., 9: 211-223.

[25] McCue, L., Thompson, W., Carmack, C., Ryan, M.P., Liu, J.S., Derbyshire, V. and Lawrence, C.E. (2001) Phylogenetic footprinting of transcription factor binding sites in proteobacterial genomes. Nucleic. Acids. Res., 29: 774-782.

[26] Hardison, R.C. (2003) Primer on Comparative Genomics. Public Library of Science, Biology, 1: 156-160.

(42)

[27] Lenhard, B., Sandelin, A, Mendoza, L., Engstrom, P., Jareborg, N. and Wasserman, W.W.

(2003) Identification of conserved regulatory elements by comparative genome analysis.

J. Biol., 2:13.

[28] Thomas, J.W., Touchman, J.W., Blakesley, R.W., Bouffard, G.G., Beckstrom-Sternberg, S.M., Margulies, E.H., Blanchette, M., Siepel, A.C., Thomas, P.J., McDowell, J.C., Maskeri, B., Hansen, N.F., Schwartz, M.S., Weber, R.J., Kent, W.J., Karolchik, D., Bruen, T.C., Bevan, R., Cutler, D.J., Schwartz, S., Elnitski, L., Idol, J.R., Prasad, A.B., Lee-Lin, S.Q., Maduro, V.V., Summers, T.J., Portnoy, M.E., Dietrich, N.L., Akhter, N., Ayele, K., Benjamin, B., Cariaga, K., Brinkley, C.P., Brooks, S.Y., Granite, S., Guan, X., Gupta, J., Haghighi, P., Ho, S.L., Huang, M.C., Karlins, E., Laric, P.L., Legaspi, R., Lim, M.J., Maduro, Q.L., Masiello, C.A., Mastrian, S.D., McCloskey, J.C., Pearson, R., Stantripop, S., Tiongson, E.E., Tran, J.T., Tsurgeon, C., Vogt, J.L., Walker, M.A., Wetherby, K.D., Wiggins, L.S., Young, A.C., Zhang, L.H., Osoegawa, K., Zhu, B., Zhao, B., Shu, C.L., De, Jong, P.J., Lawrence, C.E., Smit, A.F., Chakravarti, A., Haussler, D., Green, P., Miller, W. and Green, E.D. (2003) Comparative analyses of multi-species sequences from targeted genomic regions. Nature, 424: 788-793.

[29] Lecompte, O., Thompson, J.D., Plewniak, F., Thierry, J. and Poch, O. (2001) Multiple alignment of complete sequences (MACS) in the post-genomic era. Gene, 270: 17-30.

[30] Batzoglou, S., Pachter, L., Mesirov, J. P., Berger, B. and Lander, E. S. (2000) Human and mouse gene structure: comparative analysis and application to exon prediction. Genome Res., 10: 950-958.

[31] Chen, R., Bouck, J. B., Weinstock, G. M. and Gibbs, R. A. (2001) Comparing vertebrate whole-genome shotgun reads to the human genome. Genome Res., 11: 1807-1816.

[32] Mouse Genome Sequencing Consortium. (2002) Initial sequencing and comparative analysis of the mouse genome. Nature, 420: 520-562.

(43)

[33] Dubchak, I., Brudno, M., Loots, G.G., Pachter, L., Mayor, C., Rubin, E.M. and Frazer, K.A. (2000) Active conservation of noncoding sequences revealed by three-way species comparisons. Genome Res., 10: 1304-1306.

[34] Hardison, R.C. (2000) Conserved noncoding sequences are reliable guides to regulatory elements. Trends. Genet., 16: 369-372.

[35] Pennacchio, L.A. and Rubin, E.M. (2001) Genomic strategies to identify mammalian regulatory sequences. Nature Rev. Genet., 2: 100-109.

[36] International Human Genome Sequencing Consortium. (2001) Initial sequencing and analysis of the human genome. Nature, 409: 860-921.

[37] Oliphant, A.R., Brandl, C.J. and Struhl, K. (1989) Defining the sequence specificity of DNA-binding proteins by selecting binding sites from random-sequence oligonucleotides:

analysis of yeast GCN4 protein. Mol. Cell Biol., 9: 2944-9.

[38] Matys, V., Fricke, E., Geffers, R., Gossling, E., Haubrock, M., Hehl, R., Hornischer, K., Karas, D., Kel,A. E., Kel-Margoulis, O.V., Kloos, D.U., Land, S., Lewicki-Potapov, B., Michael, H., Munch, R., Reuter, I., Rotert, S., Saxel, H., Scheer, M., Thiele, S. and Wingender, E. (2003) TRANSFAC: transcriptional regulation, from patterns to profiles.

Nucleic. Acids. Res., 31: 374–378.

[39] Ghosh, D. (1993). Status of the transcription factors database (TFD). Nucleic. Acids. Res., 24: 238-241.

[40] Chen, Q.K., Hertz, J.Z. and Stormo, G.D. (1995) MATRIX SEARCH 1.0: a computer program that scans DNA sequences for transcriptional elements using a database of weight matrices. Comp. Appl. Biosciences, 11: 563-566.

[41] Prestridge, D.S. (1991) SIGNAL SCAN: A computer program that scans DNA sequences for eukaryotic transcriptional elements. Comput. Appl. Biosci., 7: 203-206.

(44)

[42] Nakabayashi, H., Koyama, Y., Sakai, M., Li, H.M., Wong, N.C. and Nishi, S. (2001) Glucocorticoid stimulates primate but inhibits rodent alpha-fetoprotein gene promoter.

Biochem. Biophys. Res. Commun., 287: 160-172.

[43] Winter, H., Langbein, L., Krawczak, M., Cooper, D.N., Jave-Suarez, L F., Rogers, M.A., Praetzel, S., Heidt, P.J. and Schweizer, J. (2001) Human type I hair keratin pseudogene phi-hHaA has functional orthologs in the chimpanzee and gorilla: evidence for recent inactivation of the human gene after the Pan-Homo divergence. Hum. Genet., 108:

37-42.

[44] Koppe, l.D.A., Wolfe, S.A., Fogelfeld, L.A., Merchant, P,S., Prouty, L. and Grimes, S.R.

(1994) Primate testicular histone H1t genes are highly conserved and the human H1t gene is located on chromosome 6. J. Cell Biochem., 54: 219-230.

[45] Vallejo, A.N. and Pease, L.R. (1995) Structure of the MHC A and B locus promoters in hominoids. Insights on the evolution of the class I MHC multigene family. J. Immunol., 154: 3912-3921.

[46] Clarimon, J., Andres, A.M., Bertranpetit, J. and Comas, D. (2004) Comparative analysis of Alu insertion sequences in the APP 5' flanking region in humans and other primates. J.

Mol. Evol., 58: 722-31.

[47] Mummidi, S., Bamshad, M., Ahuja, S.S., Gonzalez, E., Feuillet, P.M. Begum, K., Galvis, M.C., Kostecki, V., Valente, A.J., Murthy, K.K., Haro, L., Dolan, M.J., Allan, J.S. and Ahuja, S.K. (2000) Evolution of human and non-human primate CC chemokine receptor 5 gene and mRNA. Potential roles for haplotype and mRNA diversity, differential haplotype-specific transcriptional activity, and altered transcription factor binding to polymorphic nucleotides in the pathogenesis of HIV-1 and simian immunodeficiency virus. J. Biol. Chem., 275:18946-61.

(45)

[48] Yoshimura, K., Nakamura, H., Trapnell, B.C., Dalemans, W., Pavirani, A., Lecocq, J.P.

and Crystal, R.G. (1991) The cystic fibrosis gene has a 'housekeeping'-type promoter and is expressed at low levels in cells of epithelial origin. J. Biol. Chem., 266: 9140-9144.

[49] Reitsma, P.H., Bertina, R.M., Ploos van Amstel, J.K., Riemens,A. and Briet, E. (1988) The putative factor IX gene promoter in hemophilia B Leyden. Blood, 72: 1074-1076.

[50] Clarimon, J., Andres, A.M., Bertranpetit, J. and Comas, D. (1996) Isolation and characterization of the human mismatch repair gene hMSH2 promoter region. Hum.

Genet., 97: 114-116.

[51] Sugawara, T., Lin, D., Holt, J.A., Martin, K.O., Javitt, N.B., Miller, W.L. and Strauss, J.F.

III (1995) Structure of the human steroidogenic acute regulatory (StAR) gene: StAR stimulates mitochondrial cholesterol 27-hydroxylase activity. Biochemistry, 34:

12506-12512.

[52] Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic. Acids. Res., 22:

4673-4680.

[53] Felsenstein, J. (1989) PHYLIP - Phylogeny Inference Package (Version 3.2). Cladistics 5:

164-166.

[54] Horiike, T., Hamada, K., Kanaya, S. and Shinozawa, T. (2001). Origin of eukaryotic nuclei by symbiosis of Archaea in Bacteria is revealed by homology-hit analysis. Nat.

Cell Biol. 3: 210-214.

[55] Pai, T.-W., Chang, W.-Y., Chang, M.D.-T., Chu, J.-H. and Tai, H.L. (2004) Ladderlike Stepping and Interval Jumping Searching Algorithm for DNA Sequences. In Proc.

Second Asia-Pacific Bioinformatics Conference (APBC2004), 29: 93-98.

[56] Crooks, G.E., Hon, G., Chandonia, J.M. and Brenner, S.E. (2004) WebLogo: A sequence logo generator. Genome Res., 14: 1188-1190.

(46)

[57] Schneider, T.D. and Stephens, R.M. (1990) Sequence Logos: A New Way to Display Consensus Sequences. Nucleic. Acids. Res., 18: 6097-6100.

[58] Quandt, K., Frech, K., Karas, H., Wingender, E. and Werner, T. (1995) MatInd and MatInspector: new fast and versatile tools for detection of consensus matches in nucleotide sequence data. Nucleic. Acids. Res. 23: 4878-4884.

[59] Chen, Q.K. Hertz, G.Z. and Stormo, G.D. (1995) Matrix search 1.0: a computer program that scans DNA sequences for transcriptional elements using a database of weight matrices. Comput. Appl. Biosci., 11: 563-566.

[60] Felsenstein, J. (1989) PHYLIP -- Phylogeny Inference Package (Version 3.2). Cladistics, 5: 164-166.

[61] Sandelin, A., Alkema, W., Engstrom, P., Wasserman, W.W. and Lenhard, B. (2004) JASPAR: an open-access database for eukaryotic transcription factor binding profiles.

Nucleic. Acids. Res., 32: D91-4.

[62] Ludwig, M.Z., Bergman, C., Patel, N.H. and Kreitman, M. (2000) Evidence for stabilizing selection in a eukaryotic enhancer element. Nature, 403, 564-567.

[63] Grabe, N. (2002) AliBaba2: Context Specific Identification of Transcription Factor Binding Sites. In Silico. Biol., 2: S1-1.

[64] Schug J. and Overton, G.C. (1997) TESS: Transcription Element Search Software on the WWW in Technical Report CBIL-TR-1997-1001-v0.0, of the Computational Biology and Informatics Laboratory, School of Medicine, University of Pennsylvania.

[65] Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic. Acids. Res., 25, 3389-3402.

(47)

Table 2.1 Prediction programs of transcription elements binding sites

Program Operating principle Technical data URL Reference

AliBaba2 It predicts of transcription factor binding sites by constructing matrices on the fly from TRANSFAC 4.0 public.

The construct matrices are from all binding sites of TRANSFAC database [38].

http://www.alibaba2.

com/

63

Match 1.0 (Matrix search)

The MonoMatch Profilers and DiMatch Profilers provide mean for creating (editing, deleting) matrix profiles - specific subsets of weight matrixes with defined cut-offs.

Both profilers are the library of mono- and di-nucleotide weight matrixes from

TRANSFAC 3.5[38].

http://compel.bionet.

nsc.ru/Match/Match.

html

40

MatInspector It utilizes a library of matrix descriptions for transcription factor binding sites to locate matches in sequences of unlimited length.

The matrix family library contains 592 weight matrices in six groups of TRANSFAC database [38].

http://www.genomati x.de/products/MatIns pector/

58

Signal Scan (Signal

Sequence Scan)

The signal database source is used to search transcription factor binding sites.

The signal database source is derived from TFDa, TRANSFAC[38] and IMDb Matrix databases.

http://bimas.dcrt.nih.

gov/molbio/signal/

41

TESS

(Transcription Element Search System)

A web tool for identifying binding sites using site, consensus strings, and

positional weight matrices from databases.

Data source from the TRANSFAC[38], IMDb, and CBIL-GibbsMatc database.

http://www.cbil.upen n.edu/tess

64

參考文獻

相關文件

For example, Ko, Chen and Yang [22] proposed two kinds of neural networks with different SOCCP functions for solving the second-order cone program; Sun, Chen and Ko [29] gave two

Two distinct real roots are computed by the Müller’s Method with different initial points... Thank you for

Then, we tested the influence of θ for the rate of convergence of Algorithm 4.1, by using this algorithm with α = 15 and four different θ to solve a test ex- ample generated as

For the data sets used in this thesis we find that F-score performs well when the number of features is large, and for small data the two methods using the gradient of the

• If we know how to generate a solution, we can solve the corresponding decision problem. – If you can find a satisfying truth assignment efficiently, then sat is

The corresponding order for progres- sive alignment would be to align sequences from human and galago, then to align the resulting pairwise alignment with the rabbit sequences,

In order to detect each individual target in the crowded scenes and analyze the crowd moving trajectories, we propose two methods to detect and track the individual target in

Then using location quotient(L.Q.)to analyze of the basic industries in the metropolitan area, and population and employment multiplier of Hsinchu Area and Miaoli Area are analyzed