Overlapping-Gene Distance - 利用重疊基因建構原核生物的基因體樹之研究

As used in the studies of genome rearrangements, we utilize a signed integer to represent a gene encoded in a chromosome, with its sign indicating the transcriptional orientation of the corresponding gene (e.g. , ”+” stands for 5’ → 3’ and ”−” stands for 3’ ← 5’ ). Moreover, we use a pair of signed integers ( x, y ) to represent an OG of x and y. Basically, there are three possible overlapping types (or structures / directions) of OGs [11, 13]: (1) unidirectional OGs with sign (+, +) or (−,−), that is, the 3’ end of one gene overlaps with the 5’ end of the other, (2) convergent OGs with sign (+,−), that is, the 3’ ends of the two genes overlap, and (3) divergent OGs with sign (−, +), that is, the 5’ ends of the two genes overlap. It has been reported that in prokaryotic genomes unidirectional OGs are most

widespread, convergent OGs are less common, and divergent OGs are rare [8, 9, 13].

For our purpose, the orthologous OG pairs we considered here are further restricted to those orthologous OG pairs with the same (i.e., conserved) overlapping structures. Let {c1, c2, . . . , cn} denote the set of total typically circular. Two consecutive OGs, say (u, v) and (x, y) with (u, v) preceding (x, y), in Gi determine a breakpoint if neither (u, v) precedes (x, y) nor (− y, −x) precedes (−v, −u) in Gj . It is not hard to see that the number of breakpoints in Gi is equal to the number of breakpoints in Gj . Then we define the overlapping-gene distance D_i,j between G_i and G_j as follows.

⎟ ⎟

In the above formula, bi,j denotes the number of breakpoints in genome G_i with respect to genome G_j, and xi and x_j denote the numbers of total OGs in Gi and Gj , respectively. Note that if the considered genomes are linear, the denominator of the first term in the right hand of this equation should be n−1, because in this case it is the maximum number of breakpoints between

Gi and Gj . Basically, Di,j evaluates the distance between Gi and Gj by considering the orthologous OG order measure as defined in the first term (i.e., the normalized breakpoint distance) and the OG content measure as defined in the second term (i.e., the sum of the ratios of OGs found in one genome but not found in another genome to the number of total OGs found in a genome). Then wo and wc can be considered as the weight of orthologous OG order and the weight of OG content, respectively, where both of their defaults are 1’s in OGtree.

3.2 Algorithm

Figure 3.1 shows the flowchart of our algorithm for constructing the genome tree of prokaryotes based on overlapping-gene distance.

Given the accession numbers of several species, the first step of our algorithm is to download complete genomes from the National Centre for Biotechnology Information (NCBI) according to the accession numbers specified by the user. The putative genes are then extracted from each of these genomes on the basis of the coding sequence (CDS) annotation.

Inevitably, some of these putative genes may be misannotated in each genome downloaded from the NCBI. We may therefore exclude those genes that were annotated as being unknown, hypothetical or putative for a stringent analysis. In addition, horizontal gene transfer (HGT), the transfer of genes between different species, has been reported to be very common in prokaryotes [14]. It may obscure the OG pairs with which we hope to reconstruct the genome tree of prokaryotes. Hence, we offer an additional option in our OGtree to remove those genes that were annotated as horizontally transferred genes at the HGT-DB database [14], where

HGT-DB currently provides the lists of putative horizontally transferred genes for a large number of prokaryotic complete genomes.

Next, we use BLASTP program to determine putative orthologous genes between two genomes by using bidirectional best hit (BBH) approach. In addition, we use Inparanoid [14] as an alternative to identify putative orthologous genes between any two genomes. It has been demonstrated that Inparanoid is the best among five currently existing methods of automatically detecting orthologous genes [16].

After that, two adjacent genes in each genome are identified as overlapping genes (OGs), or an OG pair, if their CDSs overlap partially or completely. Two OGs, say (a, c) and (b, d), from different genomes are then considered as an orthologous OG pair if a and b, as well as c and d, are orthologous to each other, and (a, c) and (b, d) have the same overlapping structure.

Finally, for any two genomes Gi and Gj , we compute their OG distance D_i,jon basis of their OG pairs. Then we apply distance-based approaches of building trees, such as UPGMA, NJ and FM, to the matrix of overlapping-gene distance between genomes for constructing genome trees of the input prokaryotic genomes.

Input a set of accession numbers of species genomes

Download these complete genomes form NCBI

Extract the ORFs of each genomes

Whether or not to discard ORFs annotated as horizontally transferred genes?

Yes

Discard ORFs annotated as horizontally transferred genes.

Whether or not to discard ORFs annotated as

“hypothetical” or “putative” genes?

Figure 3.1: The flowchart of our algorithm.

Yes

Discard the hypothetical or putative genes.

Apply BBH approach or INPARANOID program to each genome pair for identifying the families of orthologous genes.

Calculate the overlapping-gene distance between any pair of genomes.

Output the constructed genome tree based on the distance matrix of overlapping-gene distance.

Chapter 4

Implementation

Based on the algorithm we described in the previous chapter, we have implemented a web server called OGtree (short for genome tree based on Overlapping Genes). The kernel programs of OGtree were written in C and Perl. Its web interface was implemented in PHP. It is available at http://bioalgorithm.life.nctu.edu.tw/OGtree/ for online analysis and can be easily accessed via a simple web interface, as shown in Figure 4.1.

在文檔中利用重疊基因建構原核生物的基因體樹之研究 (頁 21-26)