行政院國家科學委員會專題研究計畫成果報告

(1)

行政院國家科學委員會專題研究計畫成果報告

設計與實作一個最佳演化樹之分散式建構環境(I) 研究成果報告(精簡版)

計畫類別：個別型

計畫編號： NSC 94-2213-E-216-028-

執行期間： 94 年 08 月 01 日至 95 年 07 月 31 日執行單位：中華大學資訊管理學系

計畫主持人：游坤明共同主持人：唐傳義

計畫參與人員：碩士班研究生-兼任助理：周嘉奕、蔡宜霖、黃立明

報告附件：出席國際會議研究心得報告及發表論文

處理方式：本計畫可公開查詢

中華民國 95 年 12 月 22 日

(2)

行政院國家科學委員會專題研究計畫成果報告

設計與實作一個最佳演化樹之分散式建構環境(I)

計畫類別：個別型計畫

計畫編號： NSC94-2213-E-216-028-

執行期間： 94 年 08 月 01 日至 95 年 07 月 31 日執行單位：中華大學資訊管理學系

計畫主持人：游坤明共同主持人：唐傳義

計畫參與人員：周嘉奕、黃&#63991；明、蔡宜霖

報告類型：精簡報告

報告附件：出席國際會議研究心得報告及發表論文處理方式：本計畫可公開查詢

中華民國 95 年 10 月 16 日

(3)

行政院國家科學委員會補助專題研究計畫 ■ 成果報告

□期中進度報告設計與實作一個最佳演化樹之分散式建構環境

計畫類別：■ 個別型計畫 □ 整合型計畫計畫編號： NSC 94-2213-E-216 -028

執行期間： 94年8月1日至95年7月31日

計畫主持人：游坤明共同主持人：唐傳義

計畫參與人員：

周嘉奕、黃立明、蔡宜霖

成果報告類型(依經費核定清單規定繳交)：■精簡報告 □完整報告本成果報告包括以下應繳交之附件：

□赴國外出差或研習心得報告一份

□赴大陸地區出差或研習心得報告一份

■出席國際學術會議心得報告及發表之論文各一份

□國際合作研究計畫國外研究報告書一份

處理方式：除產學合作研究計畫、提升產業技術及人才培育研究計畫、

列管計畫及下列情形者外，得立即公開查詢

□ 涉及專利或其他智慧財產權，□一年□二年後可公開查詢

執行單位：中華大學資訊工程系

中華民國 95 年 8 月 30 日

(4)

設計與實作一個最佳演化樹之分散式建構環境

游坤明¹、唐傳義^2,

1中華大學資訊工程系

2清華大學資訊工程系

摘要:

將各個物種間的演化關係透過一個距離矩陣的方式來建立最佳的演化樹，以得知物種間的演化相似程度如何，在生物計算學中是一個相當重要的課題。本計畫的主要目的為針對距離矩陣提供一個有效率且富使用者親和力的最佳等距演化樹的平行建構系統。在此計畫執行中，我們將有效利用Compact Set 來做為物種資料的分群，藉以減少物種個數以縮短建構演化樹的時間，以利建構大型演化樹，並且考慮以2^nd best methodology、3-3 關係、4 點關係等技巧來探討快速建構最佳演化樹的平行化策略與方法。最後我們將會把計畫的執行成果整合成Web 介面之執行環境，提供給從事此研究領域的專家學者使用以及提供給生物學家選擇較適合的演化樹作實際的應用（UTCE）。計畫進行中所得到的研究成果已整理成二篇論文在國際研討會中發表，以及二篇國內研討會論文，同時亦有一篇論文被收錄於 Lecture Notes in Computer Science(SCI), Springer-Verlag 系列專書中，除此之外，我們亦將所建構完成之網站（UTCE: Ultrametric Tree Construction and Evaluation platform）所提供的各項演化樹的建構相關功能整理成論文参加The Evolutionary Genomics and Bioinformatics Symposium and Workshop (EGBS 2006)之壁報比賽並獲得 Excellent Poster Prize。

關鍵詞：等距演化樹、分支與界定、演化樹評估方法、叢集電腦、網格計算

Abstract:

The construction of evolutionary tree is an important problem in biology and taxonomy. The purposes for studying phylogenetics include (1) reconstructing the correct genealogical ties between species and (2) estimating the time when a divergence occurs between species from a common ancestor. Usually, these can be done by constructing evolutionary trees to obtain plenty of information. Often, we assume evolutionary tree is an ultrametric tree whose leaves are the present species and whose interior nodes represent the ancestors of the species. By the definition of an ultrametric tree, the distances from root to all the leaves are the same, and it means that the present species have the same progresses in evolution so far.

In the project, we developed an effective parallel algorithm to construct an optimal ultrametric tree from a given distance vector by using Branch-and-Bound technique based on clustering technique from Compact Set. Also, we have studied some methods to speedup the ultrametric tree’s construction, for example, 2^nd best methodology, 3-3 relationship and 4 points relationship.

In addition, we implemented an efficient ultrametric tree’s construction algorithms for Cluster computing environment. Finally, we integrated the related results to provide an efficient as well as user friendly Web-base ultrametric tree construction environment（UTCE: Ultrametric Tree Construction and Evaluation platform）.

Keyword : Ultrametric Tree、Branch and Bound、Revolutionary Tree Evaluation、PC Cluster、

Grid Computing

(5)

一、前言

由於分子生物學的迅速發展，DNA 已經能從各個物種取出來。物種之間的比較，也進入定量的計算。而當生物學家得到 DNA 序列後，如何計算序列之間的親疏程度是一個相當重要的議題。在計算序列之間的親疏程度上，我們通常先求得一個距離矩陣 (distance matrix) 代表兩兩序列之間的距離，爾後再用此 matrix 來建構演化樹 (evolutionary tree)。為了計算序列之間的親疏程度，生物學家希望將序列並排，相同的部份儘量對齊。這個問題稱 (多重序列排比) multiple sequence alignment，在得到 distance matrix 後，有許多不同的數學模型可以來建立演化樹，大多數的模型也都是難題，而這裡我們所指的難題、指的是問題難度在 NP-hard 以上。

目前在建立演化樹上有許多不同的方法，這些方法中、依據不同的假設、包含了許多的 heuristic algorithm；而根據不同的數學模型，有許多不同的 optimum algorithm。這些選擇、

對生物學家而言沒有絕對的好壞、也沒有絕對的意義，生物學家往往在許多的測試、嘗試以後得到他們想要的答案。所以，當我們建構完演化樹後、我們也試著評估樹的好壞。

在這些建立演化樹的方法中，我們可以從距離短陣建立，也可以從 DNA 序列來建立，在建立一個最佳解的演化樹的問題上，大部份都已被證明為 NP-Complete 問題。在此計劃中我們主要選擇是從距離矩陣來建立演化樹；用距離矩陣來建構演化樹的 heuristic algorithm 常用有下列數種方法 (1) Unweighted Pair-Group Method with Arithmetic Mean (UPGMA), (2) Transformed Distance Method, (3) Neighbors-Relation Method, (4) Neighbor-Join Method，

而我們想要求的是最佳解的樹，所以我們選擇了方法是 Minimum Size Ultrametric Tree。

Ultrametric Tree 亦是由距離矩陣所建構出來的，它是一棵有根樹 (Rooted Tree)，其樹葉 (Leaf) 代表了某一個物種，內部節點 (Internal Node) 代表在其下面物種的共同祖先 (Ancestor)，並且假設每個物種的演化速率相等。如此，於 Ultrametric Tree 的假設下，所建立出的樹由其各內部節點至其所屬的 leaf 距離為等距。但同樣地，給一群物種間的距離矩陣建立最小的Ultrametric Tree (Minimum Size Ultrametric Tree, MUT) 已被證明是 NP 的問題。

目前來建構演化樹所使用的方法大部份為 “分枝與界限＂ (branch-and-bound) 的方式求最佳解。當處理的資料量不大時，單一處理器尚能負擔其計算量，但當資料量增多時，單一處理器便會出現記憶體不足或者無法在合理的時間內求出答案。所以在目前的技術下，

如果要在合理的時間內得到答案的話，必需考慮使用多處理器或平行系統來處理所需的資料。前人曾提出了一個以分枝與界限的方法來建構 MUT (Minimum Size Ultrametric Tree)，

在些方法中，我們可以有效的在單一機器上以分枝與界限來建構 MUT，在我們之前申請通過的二年計劃中，在此議題的研究上，我們有下述的心得及成果，在研究中，我們依據 [24]

作為發展的根據，依此為依據，發展了一個有效率的平行化演算法，在我們的平行化系統中，所有計算節點同時對他們所擁有的候選樹做分枝 (branch) 的動作，當計算節點發現候選樹符合 bound 的條件時便不再對此候選樹做分枝的動作。而當計算節點得到更好的 upper bound 值時便會將此值傳遞給其他所有計算節點，其他所有計算節點得到新的 upper bound 值便可以 bound 掉更多的候選樹。基於這個理由，在叢集系統的解集合會少於單一處理器的系統，所以我們提出的平行化分枝與限界演算法在 speedup 上可能會達到 super-linear 的速度。我們使用了 global pool 及 local pool 做為一種負載平衡的機制，讓計算節點不至於有閒置。在我們的系統架構中，我們用的是 master / slave 的架構，並且資料是在執行期間由 master 指派。在我們的研究成果中可以知道，在單一處理器時，23 個物種數已經是在我們能接受的時間內能得到最佳解的物種數目了，如果要計算更多的物種

(6)

的話，勢必用平行化等方法讓在合理的時間內 (一般來說，我們期望在 1 天內能夠得到我們想要的演化樹)得到更多物種的最佳解，我們可以知道在我們提出的平行演算法中，可以在合理的時間內將物種數目推至 48 個物種，如果物種數目再增加時，時間會成指數成長，

所以在研究過程中讓我們了解到，如何減少物種的數量 (如使用分群演算法對物種分群、

以減少候選樹的數量) 或者如何例用其他限制式 ( 如 2nd Best Methodology 、 3-3 Relationship、4 Points Relationship) 加速 bound 的速度或減少候選樹的產生，將是我們完成 136 物種最佳化求解的方向。而相同的距離矩陣可能會產生許多不同的演化樹，於是如何用適當的數學模式或者適當的依據來對演化樹做評估將是我們研究的重點，並且也是讓生物評估的一個重要方向。在最後我們希望能提供一個人性化的使用者介面以及利用格網強大計算能力的特性加速我們演化樹的建構。

二、研究目的

由於生物學的迅速發展，許多的 DNA 序列已從各個生物中萃取出來，生物學家在得到許多 DNA 序列後，如何計算序列之間的親疏關係是一個相當重要的議題，本計畫的主要目的為針對距離矩陣提供一個有效率且富使用者親和力的建構最佳演化樹的平行系統，在此計畫中，我們希望能夠建立出更多物種最佳解的結果樹，但是在分枝與界限的過程中，當物種數目增多時，其候選樹的數目成長是非常快速的，在先前的研究中，我們在合理時間內得到最佳解的演化樹的物種數目依然有限 (在合理時間內，平行化程式能求得的最佳解演化樹在 48 個物種左右)，所以如何減少物種的數目或者適當的分群是解決本計劃一個主要的方向，在本計劃中，我們提出了許多方法如：緊湊集合 (Compact Set)、2nd Best Methodology、3-3 Relationship、以及 4 Points Relationship，進行有效縮短最佳解的演化樹的探討。並且我們希望能將所有方法整合在一個Web-base 的使用者介面中，讓使用者可以自行選擇參數、設定運算模式，時間限制以及選擇所欲的方法來評估樹的好壞並可以在Web 介面中直接顯示演化樹的樹型。

三、文獻探討

在建構最佳演化樹的問題中，當物種個數愈多時，在分枝與界限 (branch-and-bound) 的計算過程中所產生的分枝將會成長的非常快速，舉例來說 10 個物種的分枝數目會遠大於 10⁷、15 個物種的分枝數目會遠大於 10¹³、而 20 個物種的分枝數目更到達 10²¹。如此一來，我們的最終目標求出人類粒腺體 136 個物種的最佳解，其分枝數目將會大於 10²⁶⁸。在這個觀察下，要如何減少物種數目或抑合併物種都是一個重要的議題。在計畫的執行過程中，我們試著利用緊湊集合 (Compact Set) 方法來減少或合併物種的數目，緊湊集合分群法是屬於分群演算法中的階層演算法，基於圖形理論中緊密集合的觀念，使用緊湊集合分群法我們可以找出內聚力較強的集合。

除此之外，我們還進行包括了 3-3 Relationship 以及 4 Points Relationship 的研究。在 3-3 Relationship 中，我們期望在建樹過程中參考原始的距離矩陣的 3 點間關係，期望在分枝時就能夠辨別及決定是否要做分枝的動作，如此一來，可以快速及大量的減少候選樹的數目；在 4 Points Relationship 中，我們希望能夠排除矛盾點，以減少物種的數目，希望能夠更快速的建立演化樹，也亦希望能夠評估一個結果樹的好壞。

在平行處理進行運算過程中，其影響整體效能（Efficiency）的重要因素乃為負載平衡（Load Balance）。若各運算單元負載不均，則會浪費很多寶貴的計算資源於閒置的運算處理單元。

(7)

而利用分支與界定演算法進行解決問題的重點在於分支（Branch）與界定（Bound）的方法選擇決定上。於平行分散式系統中，往往會將原始問題分成數個可行解（Feasible solution）

的子問題（Subproblem），而分支後的一個或數個可行解分配給數個運算處理單元計算，因此分支的動作將會影響負載平衡問題。因此，我們在研究的進行中使用了 global pool 及 local pool 做為一種負載平衡的機制，讓計算節點不至於有閒置。而在我們的系統架構中，

我們用的是 master / slave 的架構，並且資料是在執行期間由 master 來指派，以提高系統整體運算效率。

四、計畫成果自評

本計畫之執行成果將能於較短的時間內建構出MUT，不但能夠增進執行效率，還能夠大幅提昇ultrametric tree 的正確性與結果的可讀性，本年度的計畫執行不但已順利完成預期的計畫目標，亦已將所獲得的研究成果整理成論文並且發表了三篇國際研討會論文(二篇為 SCI)，一篇國內研討會論文 (EGBS 2006) ，並且亦已將所得之最佳演化樹的建構成果撰寫工具程式並以網站的形式提供給相關領域之專家學者研究之用(UTCE: Ultrametric Tree Construction and Evaluation platform)，有效分享本計畫之執行成果，UTCE 之操作畫面如圖一至五所示，本年度的計畫執行成果可說是相當完整且豐碩。

圖一：Ultrametric Tree Construction and Evaluation platform 首頁畫面

(8)

圖二：Parallel Ultrametric Tree 建構使用者輸入介面

圖三： Parallel Ultrametric Tree 建構完成之使用者輸出畫面

(9)

圖四：3PR 關係限制之建構使用者輸入介面

圖五：ＭＳＡ使用者輸入介面

(10)

附錄一: 研討會論文國際研討會論文

1. Kun-Min Yu, Jiayi Zhou, Chun-Yuan Lin and Chuan Yi Tang, “An Efficient Parallel Algorithm for Ultrametric Tree Construction Based on 3PR,” Parallel and Distributed and Processing and Applications, - Lecture Notes in Computer Science, Vol. 4331, pp. 215 – 220, Springer-Verlag, Dec. 2006, (SCI). (NSC-94-2213-E-216-028).

2. The 2006 IAENG International Workshop on Computer Science , 優秀論文(The Certificate of Merit) 論文名稱 : “An Efficient Scheduling Algorithm for Irregular Data Redistribution,”

Authors: Kun-Ming Yu and Yi-Lin Tsai. (IMECS 2006, pp. 270-275, 6/20 ~ 6/22, 2006, Hong Kong).

國內研討會論文

1. The Evolutionary Genomics and Bioinformatics Symposium and Workshop, Excellent Poster Prize, 論文名稱 : “UTCE: Ultrametric Tree Construction and Evaluation platform,”

Authors: Kun-Ming Yu, Jiayi Zhou, Chun-Yuan and Chuan Yi Tang. (EGBS 2006, pp. 27, 8/15 ~ 8/17, 2006, Taipei, Taiwan). (NSC-94-2213-E-216-028)

2. 游坤明、徐蓓芳、賴威廷、謝一功、周嘉、林俊淵、唐傳義 ,“ 應用網格建立一個高效能演化樹平行建構環境,＂九十四年全國計算機會議 , NCS＇

2005, Abs. pp. 51. ( 台南 , 崑山科技大學 , 12/15 ~ 12/16, 2005), (NSC-93-2213-E-216-037、 NSC-94-2213-E-216-028).

(11)

G. Min et al. (Eds.): ISPA 2006 Ws, LNCS 4331, pp. 215–220, 2006.

An Efficient Parallel Algorithm for Ultrametric Tree Construction Based on 3PR*

Kun-Ming Yu¹, Jiayi Zhou^2,** , Chun-Yuan Lin³, and Chuan Yi Tang⁴

1 Department of of Computer Science and Information Engineering, Chung Hua University

2 Institute of Engineering Science, Chung Hua University

3 Institute of Molecular and Cellular Biology, National Tsing Hua University

4 Department of Computer Science, National Tsing Hua University 300, Hsinchu, Taiwan, R.O.C

1 yu@chu.edu.tw, ² jyzhou@pdlab.csie.chu.edu.tw,

3 cyulin@mx.nthu.edu.tw, ⁴ cytang@cs.nthu.edu.tw

Abstract. In the computational biology and taxonomy, to construct phylogenetic tree is an important problem. A phylogenetic tree can represent the relationship and histories for a set of species and helpful for biologists to observe existent species. One of popular model is ultrametric tree, and it assumed the evolution rate is constant. UPGMA is one of well-known ultrametric tree algorithm.

However, UPGMA is a heuristic algorithm, and it can not guarantee the constructed tree is minimum size. To construct minimum ultrametric tree (MUT) has been shown to be an NP-hard problem. In this paper, we propose an efficient parallel branch-and-bound algorithm with 3-Point Relationship (3PR) to reduce the construction time dramatically. 3PR is a relationship between a distance matrix and the constructed phylogenetic tree. The main concept is for any two species closed to each other in a distance matrix should be also closed to each other in the constructed phylogenetic tree. We use this property to mark the branching path with lower priority or higher, then we move the lower ranked branching path to delay bound pool instead of remove it to ensure the optimal solution can be found. The experimental results show that our parallel algorithm can save the computing time and it also shows that parallel algorithm with 3PR can save about 25% of computing time in average.

Keywords: phylogenetic tree, minimum ultrametric tree, parallel branch-and- bound algorithm, 3-point relationship, 4-point relationship.

1 Introduction

To construct phylogenetic trees is an important problem in the computational biology and in taxonomy, the phylogenetic tree can represent the histories for a set of species and helpful for biologists to observe existent species or evaluate the relationship of them. However, the real evolutionary histories are unknown in practice. Therefore, many methods had been proposed and tried to construct a meaningful phylogenetic tree, which is closing to the real one.

* The work is partially supported by National Science Council. (NSC 94-2213-E-216 -028).

** The corresponding author.

(12)

216 K.-M. Yu et al.

In the input of distance matrix, a phylogenetic tree is constructed according to the distance matrix [10,11]. In general, these values are edit distances between two sequences of any two species. There are many different models and motivated algorithmic problems were proposed [1,9]. However, most of optimization problems for phylogenetic tree construction have been show to be NP-hard [2-4,6,7]. An important and commonly used model is assumed that the rate of evolution is constant.

Based on this assumption, the phylogenetic tree will be an ultrametric tree (UT), which is rooted, leaf labeled, and edge weighted binary tree. Because many of these problems are intractable and NP-hard, biologists usually construct the trees by using heuristic algorithm. The Unweighted Pair Group Method with Arithmetic mean (UPGMA, [1]) is one of the popular heuristic algorithms to construct UTs.

Although construct MUTs is an NP-hard problem, it is still worthy to construct for middle-size of species. Thus, it seems possible to find an optimal tree using exhaustive search. Nevertheless, for n species, the number of rooted and leaf label tree is, it grows very rapidly. For example, A

( 10 )

>

10

⁷

,

A

( 20 )

>

10

²¹

,

A

( 30 )

>

10

³⁷

. Hence, it is impossible to exhaustively search for all possible trees even n are middle- size. Wu et al. [13] proposed a branch-and-bound algorithm for constructing MUTs to avoid exhaustive search. The branch-and-bound strategy is a general technique to solve combinatorial search problems.

In this paper, 3-Point Relationship (3PR) is used to construct MUTs more efficiently. 3PR is the relationship between a distance matrix and the constructed phylogenetic tree. The concept is that in triplet of species (a, b, c), any of two species which is closed to each other in the distance matrix should aslo be closed to each other in the constructed phylogenetic tree in a distance matrix. The experimental results show that PBBU with 3PR can reduce about 25% computation time both in sequential and parallel algorithms.

The paper is organized as follows. In section 2, some preliminaries for sequential branch-and-bound algorithm and 3PR are given. Parallel algorithm is described in section 3. Section 4 shows our experimental results, and final section is our conclusions.

2 Preliminaries

In this paper, we present PBBU with 3PR for construct minimum ultrametric tree. In the following, we denote an unweighted graph G=(V,E,w) with a vertex set V, an edge set E, and an edge weight function w. Some definitions are given as follows:

Definition 1:

A distance matrix of n species is a symmetric

n×n

matrix M such that

0

] , [i j ≥

M

for all

M[i,i]=0, and for all 0≤i,j≤n.

Definition 2: Let T =(V,E,w) be an edge weighted tree and u,v∈V. The path length from u to v is denoted by d_T( vu, ). The weight of T is defined by

∑

∈

=

E e

e w T

w( ) ( ).

Definition 3: For any M (not necessarily a metric), MUT for M is T with minimum )

(T

w such that L(T)={1,...,n} and d_T(i,j)≥M[i,j] for all 1≤i,j≤n. The problem of finding MUT for M is called MUT problem.

(13)

An Efficient Parallel Algorithm for Ultrametric Tree Construction Based on 3PR 217

Definition 4: Let P be a topology, and a,b∈L(P). LCA( ba, ) denotes the lowest common ancestor of a and b. If x and y are two nodes of P, we write x→y if and only if x is an ancestor of y.

Definition 5: The distance between distance matrix and rooted topology of phylogenetic trees is consistent if M[i,j]<min(M[i,k],M[j,k]) if and only if

) , ( ) , ( ) ,

(i j LCAi k LCA j k

LCA < = for any 1≤i,j,k≤n. Otherwise is contradictory.

2.1 Sequential Branch-and-Bound Algorithm for MUTs

In the MUT construction problem, the branch-and-bound is a tree search algorithm and repeatedly searches the branch-and-bound tree (BBT) [8,14] to find a better solution until optimal one is found. The BBT is a tree which can represent a topology of UTs. Assume that the root of BBT has depth 0, hence each node with depth i in BBT represents a topology with a leaf set {1,...,i+2}.

2.2 3-Point Relationship (3PR)

3PR is a logical method to check the LCA relation for any triplet of species (a, b, c) in a distance matrix, which is preserved or not in the constructed phylogenetic trees. For any two species (a, b), LCA(a, b) denotes the least common ancestor of (a, b). If (x, y) are two nodes in a phylogenetic tree, x → y is written if x is an ancestor of y. For a triplet of species (a, b, c) in the distance matrix M, if the distance M[a, b] of species a and b is less than M[a, c] and M[b, c], LCA(a, c)=LCA(b, c) → LCA(a, b) (as ((a, b), c); in Newick tree format). For a triplet of species (a, b, c), it is contradictive if the least common ancestor relation in a distance matrix is not preserved in the constructed phylogenetic tree. 3PR can be used to evaluate the qualities of constructed phylogenetic trees. A phylogenetic tree is considered unreliable if the number of contradictive triplets is large. The evaluated result may be useful for biologists to choose a feasible phylogenetic tree construction tool.

3 Parallel Branch-and-Bound Algorithm with 3PR

Parallel Branch-and-Bound Algorithm with 3PR (PBBU with 3PR) is designed on distributed memory multiprocessors and the master-slave architecture. The PBBU uses a branch-and-bound technique to avoid exhaustive search of possible trees. For load-balance purpose, the master processor (MP) contains a Global Pool and each slave processor (SP) has Local Pool, moreover we use new data structure instead of the link list to store BBT.

In [5], 3PR is applied as a tree evaluation method. We use this property to put lower rank branching path to Delay Bound Pool (DBP) when selecting branch path in the branch-and-bound algorithm. For example, Table 1 is the distance matrix and Figure 1 shows two candidates when inserting the third species c. In PBBU without 3PR, both (a) and (b) candidates need to be added to the pool when branching.

However, topology of (b) is closing to distance matrix, it obtained higher rank, and (a) has lower rank. In PBBU with 3PR, only (b) (with higher rank) candidate will be

(14)

218 K.-M. Yu et al.

selected due to the distance of a and c is greater than the distance of b and c. This result is based on the conception that in a triplet of species (a, b, c), any of two species which is closed to each other in the distance matrix should also be closed to each other in the corresponding phylogenetic tree in a distance matrix. However, it cannot be directly used to bound another branching path, and PBBU with 3PR put others candidates to the DBP to ensure the optimal solution can be found.

Table 1. Distance matrix

a b c

a 0 25 20

b 25 0 15

c 20 15 0

a

a cc bb aa cc bb

(a) (b)

Fig. 1. Candidate BBT

4 Experimental Results

In the experimental results, we implement PBBU and PBBU with 3PR on a Linux based PC cluster. Each computing node is an AMD Athlon PC with a clock rate of 2.0 GHz and 1GB memory. Each node is connected with each other by 100Mbps network. There are two data sets used to test our algorithms. One is a random data set, which is generated randomly. The distance matrix in the random data set is metric and the range of distances is between 1 and 100. Another is a data set composed of 136 Human Mitochondrial DNAs (HMDNA), which is obtained from [12]. Its distance matrix is metric and the range of distances is between 1 and 200. In order to eliminate the problems of data dependence, for each testing data, we run 10 instances. Then we compare the average, median, and worst cases.

Figure 2 and 3 show that PBBU with 3PR and delay bound technique can find the optimal solution and save about 25% of computation time than PBBU without 3PR.

Because 3PR technique move lower ranking candidates which disaccording to 3PR to delay bound pool, after that, the better bounding value can be found early. Afterward it can bound more candidates to decreasing computation time.

Figure 4 is the speed-up ratio of HMDNA data set. We observed that the speed-up ratio of 3PR is better than it without 3PR. Furthermore, the difference between 3PR and without 3PR is larger when the number of processors increasing. Because of the tighter bounding value can be found quickly with more processors. It also shows that our algorithm is scalable in large number of computing resources. Figure 5 shows the computation time of 16 processors of PBBU with 3PR for different number of species. We can observe that the computation time grow rapidly when the number of species increasing. Moreover, the reduced proportion between PBBU and PBBU with 3PR is increasing with larger number of species. We consider that large number of species contains more candidates that a tighter bounding value which can be obtained from 3PR technique can also bound grater number of candidates; it can decreasing the computation time.

(15)

An Efficient Parallel Algorithm for Ultrametric Tree Construction Based on 3PR 219

1 2 4 8 16

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

Without 3PR vs. With 3PR (HMDNA)

Without 3PR With 3PR

Number of processors

Time (sec.)

Fig. 2. 3PR vs. Without 3PR (HMDNA)

1 2 4 8 16

0 2500 5000 7500 10000 12500 15000 17500 20000 22500 25000 27500 30000 32500 35000 37500 40000

Without 3PR vs. With 3PR (Random)

Time (sec.)

Fig. 3. 3PR vs. Without 3PR (Random)

1 2 4 8 16

0 1 2 3 4 5 6 7 8 9 10 11

Speed-up (HMDNA)

Speed-up ratio

Fig. 4. Speed-up ratio (HMDNA)

14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 0

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 11000 12000 13000

Computing time (16 processors)

Number of species

Time (sec.)

Fig. 5. Computing time (16 processors)

5 Conclusions

In this paper, we have designed PBBU with 3PR for constructing MUTs problem. The 3PR is the relationship between distance matrix and constructed evolutionary tree. It moves candidates which do not fit 3PR to delay bound pool in branch-and-bound algorithm. After that, we can obtain the tighter bounding value quickly and uses it to bound more candidates. In order to evaluate the performance of our proposed algorithm, a random data set and a practical data set of HMDNA are used. The experimental results show that PBBU with 3PR can find optimal solution for 36 species within a reasonable time on 16 PCs. Furthermore, the speed-up ratio shows the performance of our algorithm is good in our PC cluster environment. Moreover, the results also show that PBBU with 3PR can save about 25% in average of computing time than PBBU without 3PR, and it assured the results are optimal with the delay bound technique.

(16)

220 K.-M. Yu et al.

References

1. T.H. Cormen, C.E. Leiserson, R.L. Rivest and C. Stein “Introduction to Algorithm,” MIT Press, 1990.

2. W.H.E. Day “Computationally difficult parsimony problems in phylogenetic systematics,”

J. Theoretic Biol., 103, 1983, pp.429-438.

3. W.H.E. Day “Computational complexity of inferring phylogenies from dissimilarity matrices,” Bulletin of Math. Biol., 49, 1987, pp.461-467.

4. W.H.E. Day, D.S. Johnson, and D. Sankoff “The computational complexity of inferring rooted phylogenies by parsimony,” Math. Biosci., 81, 1986, pp.33-42.

5. C.T. Fan, “The evaluation model of evolutionary tree,” Master Thesis, Nationa Tsing Hua University, 2000.

6. L.R. Foulds “Maximum savings in the Steiner problemin phylogeny,” J. Theoretic Biol., 107, 1984, pp.471-474.

7. D. Gusfield “Algorithms on Strings, Trees, and Sequences, computer science and computational biology,” Cambridge University Press, 1997.

8. M.D. Hendy and D. Penny “Branch and bound algorithms to determine minimal evolutionary trees,” Math. Biol., 59, 1982, pp.277-290.

9. S. Kumer, K. Tamura, M. Nei “MEGA: Molecular Evolutionary Genetics Analysis software for miceocomputers,” Comput. Appl. Biosci., 10, 1994, pp.189-191.

10. W.H. Li “Molecular Evolution,” Sinauer Associates, 1997.

11. R.D.M. Page “TreeView: An application to display phylogenetic trees on personal computers,” Comput. Appl. Biosci., 12, 1996, pp.357-358.

12. L. Vigilant, M. Stoneking, H. Harpending, K. Hawkes and A.C. Wilson “African Populations and the Evolution of Human Mitochondrial DNA,” Science, 253, 1991, pp.1503-1507.

13. B.Y. Wu, K.M. Chao, and C.Y. Tang “Approximation and Exact Algorithms for Constructing Minimum Ultrametric Trees from Distance Matrices,” J. Combinatorial Optimization, 3, 1999, pp.199-211.

14. C.F. Yu and B.W Wah “Efficient Branch-and-Bound Algorithms on a Two-Level Memory System,” IEEE Trans. Parallel and Distributed Systems, 14, 1988, pp.1342-1356.

(17)

An Efficient Scheduling Algorithm for Irregular Data Redistribution

Kun-Ming Yu and Yi-Lin Tsai

Department of Computer Science and Information Engineering Chung Hua University, Hsinchu, Taiwan 300, ROC.

Tel : 886-3-5186412 Fax: 886-3-5329701 Email: yu@chu.edu.tw

Abstract. Dynamic data redistribution is used to enhance the performance of an algorithm and to achieve data locality in parallel programs on distributed memory multi-computers. Therefore, the data redistribution problem has been extensively studied. Previous results focus on reducing index computational cost, schedule computational cost, and message packing/unpacking cost. However, irregular data redistribution is more flexible than regular data redistribution; it can distribute different sizes of data segments of each processor to those processors according to their own computation capability. High Performance Fortran 2 (HPF-2), the current version of HPF, provides an irregular distribution functionality, such as GEN_BLOCK which addresses some requirements of irregular applications for the distribution of data in an irregular manner and explicit control of load balancing. In this paper, we present a degree-reduction-and-coloring (DRC) algorithm for scheduling HPF2 irregular array redistribution.

We devoted to obtain the minimal number of transmission steps as well as to reduce the overall redistribution time. The proposed algorithm intends to reduce the number of maximum transmission messages in the first phase and then applies graph-coloring mechanism to obtain the final schedule. The proposed method not only avoids node contention, but also shortens the overall redistribution time. To evaluate the performance of DRC algorithm, we have implemented DRC algorithms along with the Divide-and-Conquer algorithm. The simulation results show that DRC algorithm has significant improvement on communication costs compared with the Divide-and-Conquer algorithm.

Keywords: Irregular redistribution, communication scheduling, GEN_BLOCK, degree-reduction

1. Introduction

Parallel computing systems have been widely adopted to solve complex scientific and engineering problems. To efficiently execute a data-parallel program on distributed memory multi-computers, an appropriate data distribution is critical to the performance. Appropriate distribution can balance the computational load, increase data locality, and reduce inter-processor communication. Array redistribution is crucial for system performance because a specific array distribution may be

appropriate for the current phase, but incompatible for the subsequent one. Many parallel programming languages thus support run-time primitives for rearranging the array distribution of a program. The data redistribution problem has been widely studied in the literature. In general, data redistribution can be classified into two categories: the regular data redistribution [1,5,6,7,9,11,13,15,18] and the irregular data redistribution [4,8,22-24]. The regular data redistribution decomposes data of equal sizes into processors. There are three types of this data redistribution, called BLOCK, CYCLIC, and BLOCK-CYCLIC(n). The irregular data distribution employs user-defined functions to specify data distribution unevenly. High Performance FORTRAN 2 (HPF-2) provides GEN_BLOCK functionality and makes it possible to handle different processors dealing with appropriate data size according to their computation capability. Previous works emphasized the minimal steps of data redistribution and scheduled the ordering of messages with minimal total transmission size. In the regular array redistribution, [15] proposed an Optimal Processor Mapping (OPM) scheme to minimize the data transmission cost for general BLOCK-CYCLIC regular data realignment.

Optimal Processor Mapping (OPM) utilized the maximum matching of realignment logical processors to achieve the maximum data hits for reducing the amount of data exchange transmission cost. In the irregular array redistribution problem, [22, 23]

proposed a greedy algorithm to utilize the Divide-and-Conquer technique to obtain near optimal scheduling while attempting to minimize the size of total communication messages as well as the number of steps.

In this paper, we bring up the Degree-Reduction-and-Coloring (DRC) algorithm to efficiently perform GEN_BLOCK array redistribution.

In section 2, we define the data communication model of irregular data redistribution and give an example of GEN_BLOCK data redistribution as the preliminary.

Section 3 describes the DRC algorithm for the irregular redistribution problem. The performance analysis, simulation results and practical transmission with MPI on SMP/Linux cluster are presented in section 4. Finally, the conclusions are given in section 5.

2. Data communication models

In this section, we present some properties of irregular data redistribution with GEN_BLOCK functionality. There are no repetitive communication

(18)

patterns in the irregular GEN_BLOCK array redistribution. A data redistribution can be represented by a bipartite graph, called a redistribution graph. To simplify the presentation, notations and terminologies used in this paper are defined in the following.

Definition 1: Given an irregular GEN_BLOCK redistribution on array A[SPi] and A[DPi] over P processors, the source processors of array data elements A[SPi] are denoted as SPi; the destination processors of array elements A[DPi] are denoted as DPi where 1 ≤ i ≤ P.

Definition 2: A bipartite graph G = (V, E) is used to represent the communications of an irregular data redistribution between source and destination processors. Vertices of G are used to represent the processors. Edge e_ij in G denotes the message sent from SPi to DPj, where e_ij ∈ E. |Eij| denotes the transmission message size through the redistribution.

Definition 3: Every message transmission link in irregular data redistribution is not overlapped.

Hence, the total number of message transmission link E is P ≤ E ≤ 2 × P - 1.

Definition 4: Each processor has more than one e_ij to send data to destination processors or receive data from other source processors. The number D of e_ij owned by one processor is denoted as D-degree and the maximum D-degree of all processors is denoted as Max-degree. We denote that the processors have the Max-degree number of messages as P_max.

Definition 5: If SPi sends messages to DPj-1 and DPj+1, the transmission between SPi and DPj must exist, where 1 ≤ i, j ≤ P. This result was mentioned as the consecutive communication property [12].

Fig.1 shows an example of redistributing two GEN_BLOCK distributions on A[SPi] and A[DPi].

Table 1(a) shows mapped communication message size to source processors and destination processors, respectively. The communications between source and destination processor sets are depicted in Fig 1.

There are 13 transmission messages, e₁₁, e₂₁, e₂₂, …e₇₇ among the processors involved in the redistribution.

Due to the considerable influence of node contention, a processor can only send at most one message to another processor in each communication step and the same is true for the receiving message. The messages, which cannot be scheduled in the same communication step, are called conflict tuple. For instance, {e₁₁,e₂₁} is a conflict tuple since they have a common destination processor DP1; {e₂₁,e₂₂} is also a conflict tuple because of the common source processor SP2. Table 1(b) shows a simple schedule result for this example.

Figure 1. An example of data redistribution Table 1(a). The total message size of redistribution

data for each processor in Fig. 1.

SP1 SP2 SP3 SP4 SP5 SP6 SP7

7 27 32 15 15 7 14

DP1 DP2 DP3 DP4 DP5 DP6 DP7

16 12 14 17 27 23 8 Table 1(b). A simple schedule

Schedule Table Step1: e34, e45, e22, e77, e11, e66

Step2: e56, e23, e35

Step3: e21, e33, e55, e76

3. Proposed Algorithm

The performance of a data redistribution procedure is determined by four costs: index computational cost Ti, schedule computational cost Ts, message packing/unpacking cost Tp, and data transfer cost. The data transfer cost for each communication step consists of start-up cost Tsu and transmission cost Tt. Let the unit transmission time τ denote the cost of transferring a message of unit length. In general, the message startup cost is directly proportional to the number of communication steps. The total number of communication steps is denoted by N. The total redistribution time equals Ti+Ts+

∑

⁼

=

+

N

+

i i

i su

p T m

T

1

)

(

τ ^{, where m}ⁱ^{= Max{e}¹^{, e}²^{, e}³^{, ..,}

ek} and ej represents the size of message scheduled in the i^th communication step for j = 1 to k. In irregular redistribution, messages of varying sizes are scheduled in the same communication step. Therefore, the largest size of message in the same communication step dominates the data transfer time required for this communication step.

The main idea of the Degree-Reduction-and-Coloring (DRC) algorithm is to diminish the degree of Pmax repeatedly by scheduling the message in the first step of data redistribution process until Max-degree is equal to 2.

The remaining messages are then scheduled into the communication steps by utilizing the concept of bipartite graph coloring mechanism. The details of the steps will be described in the following subsections.

(19)

3.1 Degree-Reduction Step

The goal in this step is to reduce Max-degree repeatedly in each iteration, until Max-degree is equal to 2. An example of 4-degree communication redistribution has taken as shown in Fig 2. In the first phase (phase-1) of degree-reduction step, the messages are sorted by the non-increasing order of

|E_ij|, and the result is shown in Table 2. Then, DRC selects the messages into step1 of the schedule according to non-increasing order of message size except those messages causing the conflict. After phase-1, the Max-degree will be decreased by 1 (from 4 to 3). Fig 3(a) and Table 3(b) show this scenario.

DRC repeat the procedure until the Max-degree reaches 2, which is depicted in Fig 4.

Figure 2. A data redistribution example with Max-degree = 4

Table 2. The messages are sorted by non-increasing order of message size

Msg no. e34 e45 e22 e65 e21 e33 e77 e11 e35 e66e32e76e55

Msg size 17 15 12 10 9 8 8 7 7 7 6 6 5

(a)

(b)

Figure 3. The messages communication (a) before phase-1 of the degree-reduction step; (b) after phase-1 of the degree-reduction step.

Table 4. The schedule after phase-1 Schedule Table

Step1:e₃₄, e₄₅, e₂₂, e₇₇, e₁₁, e₆₆ Step2:

Step3:

Step4:

(a)

(b)

Figure 4. The messages communication (a) before phase-2 of the degree-reduction step; (b) after phase-2 of the degree-reduction step.

Table 5. The schedule after the procedure of phase-2 Message no. e34e45e22e65e21e33 e77 e11 e35 e66 e32e76e55

Message size17 15 12 10 9 8 8 7 7 7 6 6 5

Schedule Table Step1: e34, e45, e22, e77, e11, e66

Step2: e65, e21, e33, e76

Step3:

Step4:

Message number e34e45e22e65e21 e33 e77 e11 e35e66 e32e76e55

Message size 17 15 12 10 9 8 8 7 7 7 6 6 5

(20)

3.2 Message-Coloring Step

After completing the degree-reduction step, we can obtain a redistribution graph with Max-degree of 2 and the resulting redistribution graph is 2-edge colorable [2], since it is a bipartite graph and its maximum degree is equal to 2. In the Message-Coloring Step, DRC schedules the left messages into the same step in a non-increasing order to accomplish an optimal scheduling unless a conflict occurs. Figure 5 shows the outcome of message-coloring and Table 6 shows the final schedule obtained from DRC algorithm.

Figure 5. The outcome of redistribution graph after applying the message coloring mechanism

Table 6. The final schedule obtained from DRC Msg no. e34 e45 e22 e65 e21 e33 e77 e11 e35 e66 e32 e76 e55

Msg size 17 15 12 10 9 8 8 7 7 7 6 6 5

Schedule Table Step1: e34, e45, e22, e77, e11, e66

Step2: e₆₅, e₂₁, e₃₃, e₇₆ Step3: e35

Step4: e₃₂, e₅₅

The algorithm of the Degree-Reduction-Coloring is given as follows.

======================================

Algorithm DRC generating messages;

// generate messages from AS[Pi] to AD[Pi]

step = maximum degree;

sort_msgSize();

// sorting in decreasing order by message size while (step > 2 )

{

choose_msg(step);

// selecting message without conflict tuple, set into (maximal degree - step + 1) schedule step

step--

} // degree-reduction iteration while ( remaining_messages != null ) {

selecting_msg(maximal degree-1);

// selecting message set into maximal degree-1 schedule step

check_msg_continue_set();

// check remaining message set

coloring_maximal_msg(maximal degree);

// color the maximal message with degree maximal degree -1 and the neighbor message with maximal degree

} // message coloring mechanism end of DRCM

======================================

4. Performance Evaluation

To evaluate the performance of the proposed methods, we have implemented the DRC along with the Divide-and-Conquer algorithm [23]. The performance simulation is discussed in two categories, even GEN_BLOCK and uneven GEN_BLOCK distributions. In even GEN_BLOCK distribution, each processor owns similar size of data. In contrast to even distribution, few processors might be allocated by grand volumes of data with uneven distribution.

Since data elements could be centralized to some specific processors, it is also possible for those processors to have the maximum degree of communications.

The simulation program generates a set of random integer number and the size of message as A[SPi] and A[DPi]. Moreover, the total message size sending from SPi equals to the total size receiving to DPi keeping the balance between source processors and destination processors.

We assume that the data computation (communication) time in the simulation is represented by the transmission size |E_ij|. In the following figures, the percentage of events is plotted as a function of the message size and the number of processors. Also, in the figures, “DRC Better” represents the percentage of the number of events that the DRC algorithm has lower total computation (communication) time than the Divide-and-Conquer algorithm, while “DC Better”

gives the reverse situation. If both algorithms have the same total computation (communication) time, “The Same Results” represents the number of that event.

In the uneven distribution, the size of message’s up-bound is set to be B*1.7 and that of low-bound is set to be B*0.3, where B is equal to the sum of total transmission message size / total number of processors. In the even distribution, the size of message’s up-bound is set to be B*1.3 and that of low-bound is set to be B*0.7. The total message-size is 10M.

Fig 6(a) and 6(b) show the simulation results of both the DRC and the Divide-and-Conquer algorithm

(21)

with different number of processors and total message size. The number of processors is from 8 to 24. We can observe that the DRC algorithm has better performance in the uneven data redistribution compared with Divide-and-Conquer algorithm. Since

the data is concentrated in the even case, from Fig 7(a) and 7(b), we can observe that DRC has better performance compared with the uneven case. In both even and uneven cases, DRC performs better than the Divide-and-Conquer algorithm.

Figure 6. The events percentage of computing time is plotted (a) with different number of processors and (b) with different number of total message sizes in 24 processors, on the uneven data set.

Figure 7. The events percentage of computing time is plotted (a) with different number of processors and (b) with different number of total message sizes in 24 processors, on the even data set.

5.Conclusion

In this paper, we have presented a Degree-Reduction-Coloring (DRC) scheduling algorithm to efficiently perform HPF2 irregular array redistribution on a distributed memory multi-computer.

The DRC algorithm is a simple method with low algorithmic complexity to perform GEN_BLOCK array redistribution. The DRC algorithm is an optimal algorithm in terms of minimal number of steps. In the same time, DRC algorithm is also a near optimal algorithm satisfying the condition of minimal message size of total steps. Effectiveness of the proposed methods not only avoids node contention, but also shortens the overall communication length.

For verifying the performance of our proposed algorithm, we have implemented DRC as well as the Divide-and-Conquer redistribution algorithm. The experimental results show improvement in communication costs and high practicability on different processor hierarchies. Also, the experimental results indicate that both of them have good

performance on GEN_BLOCK redistribution. In many situations, DRC is better than the Divide-and-Conquer redistribution algorithm.

Reference

[1] G. Bandera and E.L. Zapata, “Sparse Matrix Block-Cyclic Redistribution,” Proceeding of IEEE Int'l. Parallel Processing Symposium (IPPS'99), San Juan, Puerto Rico, 355 - 359 ,April 1999

[2] J.A. Bondy and U.S.R. Murty, Graph Theory with Applications, Macmillan, London, 1976.

[3] Frederic Desprez, Jack Dongarra and Antoine Petitet,

“Scheduling Block-Cyclic Data redistribution,” IEEE Trans. on PDS, vol. 9, no. 2, pp. 192-205, Feb. 1998.

[4] Minyi Guo, “Communication Generation for Irregular Codes,” The Journal of Supercomputing, vol. 25, no. 3, pp. 199-214, 2003.

[5] Minyi Guo and I. Nakata, “A Framework for Efficient Array Redistribution on Distributed

(22)

Memory Multicomputers,” The Journal of Supercomputing, vol. 20, no. 3, pp. 243-265, 2001.

[6] Minyi Guo, I. Nakata and Y. Yamashita,

“Contention-Free Communication Scheduling for Array Redistribution,” Parallel Computing, vol. 26, no.8, pp. 1325-1343, 2000.

[7] Minyi Guo, I. Nakata and Y. Yamashita, “An Efficient Data Distribution Technique for Distributed Memory Parallel Computers,” Joint Symp. on Parallel Processing (JSPP'97), pp.189-196, 1997.

[8] Minyi Guo, Yi Pan and Zhen Liu, “Symbolic Communication Set Generation for Irregular Parallel Applications,” The Journal of Supercomputing, vol.

25, pp. 199-214, 2003.

[9] Edgar T. Kalns, and Lionel M. Ni, “Processor Mapping Technique Toward Efficient Data Redistribution,” IEEE Trans. on PDS, vol. 6, no. 12, pp. 1234-1247, December 1995.

[10] S. D. Kaushik, C. H. Huang, J. Ramanujam and P.

Sadayappan, “Multiphase data redistribution:

Modeling and evaluation,” International Parallel Processing Symposium (IPPS’95), pp. 441-445, 1995.

[11] Peizong Lee, Academia Sinica, and Zvi Meir Kedem,

“Automatic Data and Computation Decomposition on Distributed Memory Parallel Computers,” ACM Transactions on Programming Languages and systems, Vol 24, No. 1, pp. 1-50, January 2002.

[12] S. Lee, H. Yook, M. Koo and M. Park, “Processor reordering algorithms toward efficient GEN_BLOCK redistribution,” Proceedings of the ACM symposium on Applied computing, pp . 539-543, 2001.

[13] Y. W. Lim, Prashanth B. Bhat and Viktor and K.

Prasanna, “Efficient Algorithms for Block-Cyclic Redistribution of Arrays,” Algorithmica, vol. 24, no.

3-4, pp. 298-330, 1999.

[14] C.-H Hsu, S.-W Bai, Y.-C Chung and C.-S Yang, “A Generalized Basic-Cycle Calculation Method for Efficient Array Redistribution,” IEEE Transactions on Parallel and Distributed Systems, vol. 11, no. 12, pp. 1201-1216, Dec. 2000.

[15] Ching-Hsien Hsu, Kun-Ming Yu, “An Optimal Processor Replacement Scheme for Efficient Communication of Runtime Data Realignment,”

Parallel and Distributed and Processing and Applications, - Lecture Notes in Computer Science,

Vol. 3358, pp. 268-273, 2004.

[16] C.-H Hsu, Dong-Lin Yang, Yeh-Ching Chung and Chyi-Ren Dow, “A Generalized Processor Mapping Technique for Array Redistribution,” IEEE Transactions on Parallel and Distributed Systems, vol.

12, vol. 7, pp. 743-757, July 2001.

[17] Antoine P. Petitet and Jack J. Dongarra, “Algorithmic Redistribution Methods for Block-Cyclic Decompositions,” IEEE Transactions on Parallel and Distributed Systems, vol. 10, no. 12, pp. 1201-1216, Dec. 1999

[18] Neungsoo Park, Viktor K. Prasanna and Cauligi S.

Raghavendra, “Efficient Algorithms for Block-Cyclic Data redistribution Between Processor Sets,” IEEE Transactions on Parallel and Distributed Systems, vol.

10, No. 12, pp.1217-1240, Dec. 1999.

[19] .L. Prylli and B. Touranchean, “Fast runtime block cyclic data redistribution on multiprocessors,”

Journal of Parallel and Distributed Computing, vol.

45, pp. 63-72, Aug. 1997.

[20] S. Ramaswamy, B. Simons, and P. Banerjee,

“Optimization for Efficient Data redistribution on Distributed Memory Multicomputers,” Journal of Parallel and Distributed Computing, vol. 38, pp.

217-228, 1996.

[21] Akiyoshi Wakatani and Michael Wolfe,

“Optimization of Data redistribution for Distributed Memory Multicomputers,” short communication, Parallel Computing, vol. 21, no. 9, pp. 1485-1490, September 1995.

[22] Hui Wang, Minyi Guo and Wenxi Chen, “An Efficient Algorithm for Irregular Redistribution in Parallelizing Compilers,” Proceedings of 2003 International Symposium on Parallel and Distributed Processing with Applications, LNCS 2745, 2003.

[23] Hui Wang, Minyi Guo and Daming Wei,

"Divide-and-conquer Algorithm for Irregular Redistributions in Parallelizing Compilers”, The Journal of Supercomputing, vol. 29, no. 2, pp.

157-170, 2004.

[24] H.-G. Yook and Myung-Soon Park, “Scheduling GEN_BLOCK Array Redistribution,” Proceedings of the IASTED International Conference Parallel and Distributed Computing and Systems, November, 1999.

(23)

應用網格建立一個高效能演化樹平行建構環境 ^*

游坤明¹, 徐蓓芳¹, 賴威廷¹, 謝一功¹, 周嘉奕¹, 林俊淵², 唐傳義³

1 中華大學資訊工程學系

2 國立清華大學分子與細胞生物研究所

3 國立清華大學資訊工程學系

1 yu@chu.edu.tw, {b9102042, b9004060, b9102004}@cc.chu.edu.tw, jyzhou@pdlab.csie.chu.edu.tw

2 cyulin@mx.nthu.edu.tw

3 cytang@cs.nthu.edu.tw

摘要

以平行處理方式來計算龐大的資料運算是近年來一個非常重要的應用觀念。有許多不同的環境架構伴隨著不同的應用。網格 (Grid) 是一種建立在網際網路上的架構，網格可透過網際網路與其他網格互相分享資源，因此可以視為在使用龐大的且容易增減的資源來運算；與傳統的叢集式系統相比，傳統的叢集式系統 (Cluster) 若要增加運算能力，則必需花費比網格多的費用，因此運算能力有限。在一般所見的網格中，必須要有相同的協定、

彼此認同的認證、安全性的考量以及合理的資源存取，才能讓網格在網路上互相溝通。使用網格運算我們所要處理的資料及程式，並且在合理的時間內得到正確的結果。本論文使用平行化演算法並以人類粒腺體為例，在單機、網格與叢集電腦環境中建構演化樹，並比較其效能差異。

關鍵詞：等距演化樹 , 叢集電腦計算, 網格計算, Globus Toolkit

1. 簡介

生物資訊研究領域中，科學家常常需要從演化樹的結果以了解物種間的親疏關係。從距離矩陣中建造演化樹在生物學和分類法方面是一個重要的議題，因此也產生許多不同的模型及相對應的演算法。而大部份的最佳解問題都已被証明為 NP-hard。

*

This word was supported in part by the NSC of ROC, under grant NSC-93-2213-E-216-037 and NSC-94-2213-E-216-028

其中在許多不同的模型中有一個重要的模型便是假定演化的速度是一致的 [5, 17]。在這種前提下，利用距離矩陣算出的演化樹將會是一個等距演化樹(ultrametric tree)。

本論文使用一種高效能的平行化分枝界限演算法(branch-and-bound) 建立最小距離演化樹。這個平行演算法是建立在 master-slave centralize 的架構上，並且加入了有效的負載平衡、節點與節點間通訊的策略，以解決最小權值等距演化樹建構的問題，使得時間在可容忍的範圍內完成。

近年來，對於許多以電腦輔助來求解的問題越來越多，且個人電腦的計算能力已無法滿足在合理的時間內得到結果。於是分散式的計算技術便是下一個發展的層次。本論文以人類粒腺體為例建構出演化樹，建構演化樹是一種非常複雜且耗時的計算過程，使用一般的個人電腦，將耗費大量的時間以求得結果，有時還會因資源不足造成等待許久的運作中斷，因此，要在合理的時間內得到滿意的結果，必須具有高效能的電腦，如超級電腦，但在經濟的考量下，我們可使用叢集電腦或網格來達到近似的效能。

叢集電腦可有大小不同規模，此做法的最大優點是「可擴充性」 (scalability) ：只要增加新的個人電腦，就可以提高叢集電腦的效能。在某些情況下資料是分布在不同的地區中需要互相存取，而網格是透過網路連線將好幾個在不同地區的叢集電腦串聯成的，更可以有效的利用這樣的優點來保持最新的訊息，所以在使用資源效率方面更遠勝於叢集電腦 [19]。

行政院國家科學委員會專題研究計畫 成果報告