QuartetS - 大規模直向同源基因的偵測

Chapter 2 Preliminaries

2.3 QuartetS

Yu et al. [24]在 2011 年提出了一個大規模偵測直向同源基因的方法，稱為 QuartetS。QuartetS 藉由基因複製事件的演化證據從直向同源基因中區分出旁向同源基因。這樣的演化證據是從兩個目標基因與第三個基因體上的兩個基因所組成的基因樹上所提供。QuartetS 基因 樹的四個組成基因 x、y、z₁和 z₂。如果分別位於基因體 X 和基因體 Y 內的基因 x 和 y 形成 BBH (Bidirectional Best Hit) Pair，則基因 x 和 y 為同源基因。另外資料庫裡的第三個基因體 Z 中的 z₁和 z₂則提供潛

在複製事件的演化證據。

QuartetS 的主要想法就是去觀察基因 x 和 y 是不是源自於旁向同 源基因 z₁和 z₂路徑上相同的複製事件，如果是那作者就推斷基因 x 和 y 為旁向同源基因。反之則認為這組旁向同源基因 z₁和 z₂無法證 明基因 x 和 y 的關係(Figure 2.3)。

Figure 2.3: 為基因 x、y、z1和 z₂所建立的無樹根的基因樹。以 a、

b、c 和 d 圖解釋 QuartetS 的作法。

觀察 Figure 2.3 得到這四張圖為基因 x、y、z₁和 z₂所建立的無樹

Figure 2.4: α值在基因樹中的位置。

作者利用了以下方程式來測量α的近似值。

其中 S_i^，_j為利用 BLAST 的 Bit Scores 來表示基因 i 和 j 的相似性。

如果α大於預先定義的門檻值(Ω，預設為 20)，則作者推斷基因 x 和 y 為旁向同源基因。隨著 Ω 值越大，會得到比較少的旁向同源基因以 及比較多的直向同源基因，反之亦然。

Chapter 3

出分別位在基因體 X 和基因體 Y 的基因 x 和 y 的同源關係。如果內群 基因的最近共同祖先(即樹根)位在內部枝幹上，如 Figure 3.1 所示，

則基因 x 和 y 可能源自於旁向同源基因 z₁和 z₂的基因複製事件，則我 們推斷基因 x 和 y 為旁向同源基因。反之，如果我們利用外群基因 o 在內群的外部枝幹找到內群基因的最近共同祖先，這樣的資訊沒辦法 證明基因 x 和 y 源自於旁向同源基因 z₁和 z₂的基因複製事件，所以不 能拿來證明基因 x 和 y 的同源關係，如 Figure 3.2。

Figure 3.1: 內群基因的最近共同祖先位在內部枝幹上。

Figure 3.2: 內群基因的最近共同祖先位在外部枝幹上。

3.2 Algorithm

A. 先以基因 x 為查詢基因，再到基因體 Z 中利用 BLAST 做尋找，

因 o 不可以用來當作外群基因使用，程式就換到四個候選外群

Figure 3.3: 我們的方法的流程圖。

3.3 BBH Method and EC Number

我們 BBH 比對範圍至少要大於序列長度的 50%，且一對一 (Pairwise)比對結果所利用的 Bit Score 至少要大於 50。另外我們使用 EC Number 來表示基因的功能，這是由瑞士生物資訊研究所利用酵素的命名來建立 EC Number 的資料庫 [26]，它們根據蛋白質的酵素功能來訂出 EC Number。我們拿它來當參考對象。我們也把有超過一個 EC Number 以上的蛋白質移除。我們使用 EC Number 的理由為既然直向同源基因有相同的功能，大體上我們預期直向同源基因有相同的 EC Number。所以說如果直向同源基因有相同的 EC Number，我們就認為此項結果為 True Positive，否則就是 False Positive。

Chapter 4

Experimental Results

在本章節中，我們會以實驗的方式將我們的方法在進行大規模偵測直向同源基因時所預測出的結果與 QuartetS 做比較，並且說明我們的方法的優缺點。另外我們的方法和 QuartetS 所使用的 BBH Pair 會有些不同，這是因為 BBH Pair 數量會因為資料庫的大小而有所差異。

4.1 7 γ-Proteobacteria Genomes and 4 Outgroup Genomes

Table 4.1 則列出了我們進行實驗所使用的七組 γ-Proteobacterial 基因體。Table 4.2 為我們進行實驗使用的四組外群基因體。

Table 4.1: 七組γ-Proteobacterial 基因體。

Table 4.2: 四組外群基因體。

4.2 Experimental Results

我們方法的實驗結果如 Table 4.3 所示，右箭號所框起來部分為我們的方法預測的結果，左箭號框起來部分則為 QuartetS 的結果。我們以 Table 4.4 結果來看，第一欄為找出的旁向同源基因的數量，第二欄為找出的直向同源基因的數量。第三欄為我們的方法找出的旁向同源基因與 QuartetS 的方法找出的直向同源基因的交集。第四欄為我們的方法找出的直向同源基因與 QuartetS 的方法找出的旁向同源基因的交集。由上表第三欄可以看出 QuartetS 找出的直向同源基因有 416 個基因被我們的方法推斷為旁向同源基因，且由第一欄我們的方法找出來的旁向同源基因比 QuartetS 多，其他六組實驗結果也有同樣的現象，這代表說我們的方法比較能夠在直向同源基因中把更多的旁向同源基因給區別出來。

Table 4.3: 這是我們方法的實驗結果。

Table 4.4: 大腸桿菌對綠膿桿菌的實驗結果。

另外我們觀察大腸桿菌(NC_000913，eco)對綠膿桿菌(NC_002516，

pae)的實驗結果(Table 4.5)。QuartetS 的 EC Numbers 有 379 組，True

Postives 有 372 組，False Positives 有 7 組;我們的方法的 EC Numbers 有 302 組，True Postives 有 298 組，False Positives 有 4 組。我們針對 QuartetS 的 7 組 False Positives 去做觀察，發現 QuartetS 的 7 組 False Positives 中，其中有 3 組結果被我們的方法推斷為旁向同源基因(Table 4.6 框起來處)。

Table 4.5: 使用 EC Number 來表示功能後，大腸桿菌對綠膿桿菌的實 驗結果。

Table 4.6: 我們把 QuartetS 的 7 組 False Positives 找出來，發現其中有 3 組結果被我們的方法推斷為旁向同源基因。

接著我們觀察綠膿桿菌 (NC_002516 ， pae) 對鼠傷寒沙門氏菌 (NC_003197，stm)的實驗結果(Table 4.7)。QuartetS 的 EC Numbers 有 334 組， True Postives 有 328 組，False Positives 有 6 組;我們的方法的 EC Numbers 有 276 組，True Postives 有 275 組，False Positives 有 1 組。我們針對 QuartetS 的 6 組 False Positives 去做觀察，發現 QuartetS 的 6 組 False Positives 中，其中有 3 組結果被我們的方法推斷為旁向 同源基因(Table 4.8 框起來處)，其他 2 組 BBH 為我們所沒有的。

Table 4.7: 使用 EC Number 來表示功能後，綠膿桿菌對鼠傷寒沙門氏 菌的實驗結果。

Table 4.8: 我們把 QuartetS 的六組 False Positives 找出來，發現其中有 三組結果被我們的方法推斷為旁向同源基因，其中兩組 BBH Pair 為我們所沒有的，所以沒有辦法做比較。

最後我們觀察大腸桿菌 (NC_000913 ， eco) 對蚜蟲內共生菌 (NC_004545，BBp)的實驗結果(Table 4.9)。QuartetS 的 EC Numbers 有 230 組， True Postives 有 229 組， False Positives 有 1 組;我們的方法的 EC Numbers 有 183 組， True Postives 有 183 組， False Positives 有 0 組。

我們針對 QuartetS 的 1 組 False Positives 去做觀察，發現 QuartetS 的 1 組 False Positives 中，這 1 組結果被我們的方法推斷為旁向同源基因(Table 4.10)。

Table 4.9: 使用 EC Number 來表示功能後，大腸桿菌對蚜蟲內共生菌 的實驗結果。

Table 4.10: 我們把 QuartetS 的一組 False Positives 找出來，發現其中 有一組結果被我們的方法推斷為旁向同源基因。

我們統計 QuartetS 對我們的方法的實驗結果(Table 4.11)。QuartetS 的 False Positives 有 12 組，全部推斷結果有 1467 組， False Positives Rate 為 0.8%;我們的方法的 False Positives 有 5 組，全部推斷結果有 1219 組，False Positives Rate 有 0.4%。從 Figure 4.1 中應該可以看出我們的方法準確度較 QuartetS 高。

Table 4.11: 我們的方法與 QuartetS 的方法的 FPR 比較。

Figure 4.1: 我們計算 False Positive Rate 發現我們的方法準確度比 QuartetS 高。

4.3 Execution Time Requirement in Our Program

我們在進行實驗時，發現我們的方法執行時間上比起現行的方法或應用程式來的長。因此我們另外針對我們的方法的每個部分的執行 時間做統計(Table 4.12)。從表格中可以看出，進行多重序列比對時用 的 Clustalw 和建基因樹時用的 Phylip 的套件佔了幾乎全體的 93.66%

(Figure 4.2)。因此針對執行時間這部分，如果能將這部分的缺點彌補 起來，則整體執行時間將大幅加快。

Table 4.12: 我們的方法所使用的工具其時間成本比較表。

Figure 4.2: 我們的方法中，每個程式執行時間所佔比例的圓餅圖。

Chapter 5

References

1. Liolios,K., Chen,I.M., Mavromatis,K., Tavernarakis,N., Hugenholtz,P., Markowitz,V.M. and Kyrpides,N.C. 2010 The Genomes On Line Database (GOLD) in 2009: status of genomic and metagenomic projects and their associated metadata. Nucleic Acids Research., 38, D346–D354.

2. Berglund AC, Sjölund E, Ostlund G, Sonnhammer EL 2008:

InParanoid 6: eukaryotic ortholog clusters with inparalogs. Nucleic Acids Research, , 36 Database: D263-266.

3. Chen F, Mackey AJ, Stoeckert CJ Jr, Roos DS 2006: OrthoMCL-DB:

querying a comprehensive multi-species collection of ortholog groups.

Nucleic Acids Research, , 34 Database: D363-368.

4. Tatusov RL, Galperin MY, Natale DA, Koonin EV 2000: The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Research, 28:33-36.

5. Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, Dicuccio M, Edgar R, Federhen S, Feolo M, Geer LY, Helmberg W, Kapustin Y, Khovayko O, Landsman D, Lipman DJ,

Madden TL, Maglott DR, Miller V, Ostell J, Pruitt KD, Schuler GD, Shumway M, Sequeira E, Sherry ST, Sirotkin K, Souvorov A, Starchenko G, Tatusov RL, Tatusova TA, Wagner L, Yaschenko E 2008: Database resources of the National Center for Biotechnology Information. Nucleic Acids Research, , 36 Database: D13-21.

6. Jensen LJ, Julien P, Kuhn M, von Mering C, Muller J, Doerks T, Bork P 008: eggNOG: automated construction and annotation of orthologous groups of genes. Nucleic Acids Research, , 36 Database:

D250-254.

7. Schneider A, Dessimoz C, Gonnet GH 2007: OMA Browser-exploring orthologous relations across 352 complete genomes. Bioinformatics, 23(16):2180-2182.

8. Hubbard TJ, Aken BL, Beal K, Ballester B, Caccamo M, Chen Y, Clarke L, Coates G, Cunningham F, Cutts T, Down T, Dyer SC, Fitzgerald S, Fernandez-Banet J, Graf S, Haider S, Hammond M, Herrero J, Holland R, Howe K, Howe K, Johnson N, Kahari A, Keefe D, Kokocinski F, Kulesha E, Lawson D, Longden I, Melsopp C, Megy K, Meidl P, Ouverdin B, Parker A, Prlic A, Rice S, Rios D, Schuster M, Sealy I, Severin J, Slater G, Smedley D, Spudich G, Trevanion S, Vilella A, Vogel J, White S, Wood M, Cox T, Curwen V, Durbin R, Fernandez-Suarez XM, Flicek P, Kasprzyk A, Proctor G, Searle S, Smith J, Ureta-Vidal A, Birney E 2007: Ensembl 2007. Nucleic Acids Research,, 35 Database: D610-617.

9. Koonin,E.V. 2005 Orthologs, paralogs, and evolutionary genomics.

Annual Review of Genetics, 39, 309–338.

10. Ohta,T. (2003) Evolution by gene duplication revisited: differentiation of regulatory elements versus proteins. Genetica, 118, 209–216.

11. Serres,M.H., Kerr,A.R., McCormack,T.J. and Riley,M. 2009 Evolution by leaps: gene duplication in bacteria. Biology Direct, 4, 46.

12. Dufayard,J.F., Duret,L., Penel,S., Gouy,M., Rechenmann,F. and Perriere,G. 2005 Tree pattern matching in phylogenetic trees:

automatic search for orthologs or paralogs in homologous gene sequence databases. Bioinformatics, 21, 2596–2603.

13. Zmasek,C.M. and Eddy,S.R. 2002 RIO: analyzing proteomes by automated phylogenomics using resampled inference of orthologs.

BMC Bioinformatics, 3, 14.

14. Hollich,V., Storm,C.E. and Sonnhammer,E.L. 2002 OrthoGUI:

graphical presentation of Orthostrapper results. Bioinformatics, 18, 1272–1273.

15. Remm,M., Storm,C.E. and Sonnhammer,E.L. 2001 Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. Journal of Molecular Biology., 314, 1041–1052.

16. Salter,L.A. and Pearl,D.K. 2001 Stochastic search strategy for estimation of maximum likelihood phylogenetic trees. Systematic Biology, 50, 7–17.

17. Altenhoff,A.M. and Dessimoz,C. 2009 Phylogenetic and functional assessment of orthologs inference projects and methods. PLoS Computational Biology, 5, e1000262.

18. Li,L., Stoeckert,C.J. Jr and Roos,D.S. 2003 OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Research, 13, 2178–2189.

19. Dessimoz,C., Cannarozzi,G., Gil,M., Margadant,D., Roth,A., Schneider,A. and Gonnet,G.H. 2005 OMA, a comprehensive, automated project for the identification of orthologs from complete genome data: Introduction and first achievements. Compare Genomics, 3678, 61–72.

20. Alexeyenko,A., Tamas,I., Liu,G. and Sonnhammer,E.L.

2006)Automatic clustering of orthologs and in paralogs shared by multiple proteomes. Bioinformatics, 22, e9–e15.

21. Dessimoz,C., Boeckmann,B., Roth,A.C. and Gonnet,G.H. 2006 Detecting non-orthology in the COGs database and other approaches grouping orthologs using genome-specific best hits. Nucleic Acids Research., 34, 3309–3316.

22. Fulton,D.L., Li,Y.Y., Laird,M.R., Horsman,B.G., Roche,F.M. and Brinkman,F.S. 2006 Improving the specificity of high-throughput ortholog prediction. BMC Bioinformatics, 7, 270.

23. Roth,A.C., Gonnet,G.H. and Dessimoz,C. 2008 Algorithm of OMA for large-scale orthology inference. BMC Bioinformatics, 9, 518.

24. Yu,C., Zavaljevski,N., Desai,V. and Reifman,J. 2011 QuartetS: a fast and accurate algorithm for large-scale orthology detection. Nucleic Acids Research., 39, e88.

25. C. Dessimoz et al. 2000 Bairoch, A.: The ENZYME database in 2000.

Nucleic Acids Research 28 304–305

在文檔中大規模直向同源基因的偵測 (頁 23-0)