• 沒有找到結果。

An Efficient Parallel Algorithm for Ultrametric Tree Construction Based on 3PR*

G. Min et al. (Eds.): ISPA 2006 Ws, LNCS 4331, pp. 215–220, 2006.

© Springer-Verlag Berlin Heidelberg 2006

An Efficient Parallel Algorithm for Ultrametric Tree

216 K.-M. Yu et al.

In the input of distance matrix, a phylogenetic tree is constructed according to the distance matrix [10,11]. In general, these values are edit distances between two sequences of any two species. There are many different models and motivated algorithmic problems were proposed [1,9]. However, most of optimization problems for phylogenetic tree construction have been show to be NP-hard [2-4,6,7]. An important and commonly used model is assumed that the rate of evolution is constant.

Based on this assumption, the phylogenetic tree will be an ultrametric tree (UT), which is rooted, leaf labeled, and edge weighted binary tree. Because many of these problems are intractable and NP-hard, biologists usually construct the trees by using heuristic algorithm. The Unweighted Pair Group Method with Arithmetic mean (UPGMA, [1]) is one of the popular heuristic algorithms to construct UTs.

Although construct MUTs is an NP-hard problem, it is still worthy to construct for middle-size of species. Thus, it seems possible to find an optimal tree using exhaustive search. Nevertheless, for n species, the number of rooted and leaf label tree is, it grows very rapidly. For example, A

( 10 )

>

10

7

,

A

( 20 )

>

10

21

,

A

( 30 )

>

10

37

. Hence, it is impossible to exhaustively search for all possible trees even n are middle-size. Wu et al. [13] proposed a branch-and-bound algorithm for constructing MUTs to avoid exhaustive search. The branch-and-bound strategy is a general technique to solve combinatorial search problems.

In this paper, 3-Point Relationship (3PR) is used to construct MUTs more efficiently. 3PR is the relationship between a distance matrix and the constructed phylogenetic tree. The concept is that in triplet of species (a, b, c), any of two species which is closed to each other in the distance matrix should aslo be closed to each other in the constructed phylogenetic tree in a distance matrix. The experimental results show that PBBU with 3PR can reduce about 25% computation time both in sequential and parallel algorithms.

The paper is organized as follows. In section 2, some preliminaries for sequential branch-and-bound algorithm and 3PR are given. Parallel algorithm is described in section 3. Section 4 shows our experimental results, and final section is our conclusions.

2 Preliminaries

In this paper, we present PBBU with 3PR for construct minimum ultrametric tree. In the following, we denote an unweighted graph G=(V,E,w) with a vertex set V, an edge set E, and an edge weight function w. Some definitions are given as follows:

Definition 1:

A distance matrix of n species is a symmetric

n×n

matrix M such that

0

] , [i j

M

for all

M[i,i]=0, and for all 0≤i,jn.

Definition 2: Let T =(V,E,w) be an edge weighted tree and u,vV. The path length from u to v is denoted by dT( vu, ). The weight of T is defined by

=

E e

e w T

w( ) ( ).

Definition 3: For any M (not necessarily a metric), MUT for M is T with minimum )

(T

w such that L(T)={1,...,n} and dT(i,j)≥M[i,j] for all 1≤i,jn. The problem of finding MUT for M is called MUT problem.

An Efficient Parallel Algorithm for Ultrametric Tree Construction Based on 3PR 217

Definition 4: Let P be a topology, and a,bL(P). LCA( ba, ) denotes the lowest common ancestor of a and b. If x and y are two nodes of P, we write xy if and only if x is an ancestor of y.

Definition 5: The distance between distance matrix and rooted topology of phylogenetic trees is consistent if M[i,j]<min(M[i,k],M[j,k]) if and only if

) , ( ) , ( ) ,

(i j LCAi k LCA j k

LCA < = for any 1≤i,j,kn. Otherwise is contradictory.

2.1 Sequential Branch-and-Bound Algorithm for MUTs

In the MUT construction problem, the branch-and-bound is a tree search algorithm and repeatedly searches the branch-and-bound tree (BBT) [8,14] to find a better solution until optimal one is found. The BBT is a tree which can represent a topology of UTs. Assume that the root of BBT has depth 0, hence each node with depth i in BBT represents a topology with a leaf set {1,...,i+2}.

2.2 3-Point Relationship (3PR)

3PR is a logical method to check the LCA relation for any triplet of species (a, b, c) in a distance matrix, which is preserved or not in the constructed phylogenetic trees. For any two species (a, b), LCA(a, b) denotes the least common ancestor of (a, b). If (x, y) are two nodes in a phylogenetic tree, x → y is written if x is an ancestor of y. For a triplet of species (a, b, c) in the distance matrix M, if the distance M[a, b] of species a and b is less than M[a, c] and M[b, c], LCA(a, c)=LCA(b, c) → LCA(a, b) (as ((a, b), c); in Newick tree format). For a triplet of species (a, b, c), it is contradictive if the least common ancestor relation in a distance matrix is not preserved in the constructed phylogenetic tree. 3PR can be used to evaluate the qualities of constructed phylogenetic trees. A phylogenetic tree is considered unreliable if the number of contradictive triplets is large. The evaluated result may be useful for biologists to choose a feasible phylogenetic tree construction tool.

3 Parallel Branch-and-Bound Algorithm with 3PR

Parallel Branch-and-Bound Algorithm with 3PR (PBBU with 3PR) is designed on distributed memory multiprocessors and the master-slave architecture. The PBBU uses a branch-and-bound technique to avoid exhaustive search of possible trees. For load-balance purpose, the master processor (MP) contains a Global Pool and each slave processor (SP) has Local Pool, moreover we use new data structure instead of the link list to store BBT.

In [5], 3PR is applied as a tree evaluation method. We use this property to put lower rank branching path to Delay Bound Pool (DBP) when selecting branch path in the branch-and-bound algorithm. For example, Table 1 is the distance matrix and Figure 1 shows two candidates when inserting the third species c. In PBBU without 3PR, both (a) and (b) candidates need to be added to the pool when branching.

However, topology of (b) is closing to distance matrix, it obtained higher rank, and (a) has lower rank. In PBBU with 3PR, only (b) (with higher rank) candidate will be

218 K.-M. Yu et al.

selected due to the distance of a and c is greater than the distance of b and c. This result is based on the conception that in a triplet of species (a, b, c), any of two species which is closed to each other in the distance matrix should also be closed to each other in the corresponding phylogenetic tree in a distance matrix. However, it cannot be directly used to bound another branching path, and PBBU with 3PR put others candidates to the DBP to ensure the optimal solution can be found.

Table 1. Distance matrix

a b c

a 0 25 20

b 25 0 15

c 20 15 0

a

a cc bb aa cc bb

(a) (b)

Fig. 1. Candidate BBT

4 Experimental Results

In the experimental results, we implement PBBU and PBBU with 3PR on a Linux based PC cluster. Each computing node is an AMD Athlon PC with a clock rate of 2.0 GHz and 1GB memory. Each node is connected with each other by 100Mbps network. There are two data sets used to test our algorithms. One is a random data set, which is generated randomly. The distance matrix in the random data set is metric and the range of distances is between 1 and 100. Another is a data set composed of 136 Human Mitochondrial DNAs (HMDNA), which is obtained from [12]. Its distance matrix is metric and the range of distances is between 1 and 200. In order to eliminate the problems of data dependence, for each testing data, we run 10 instances. Then we compare the average, median, and worst cases.

Figure 2 and 3 show that PBBU with 3PR and delay bound technique can find the optimal solution and save about 25% of computation time than PBBU without 3PR.

Because 3PR technique move lower ranking candidates which disaccording to 3PR to delay bound pool, after that, the better bounding value can be found early. Afterward it can bound more candidates to decreasing computation time.

Figure 4 is the speed-up ratio of HMDNA data set. We observed that the speed-up ratio of 3PR is better than it without 3PR. Furthermore, the difference between 3PR and without 3PR is larger when the number of processors increasing. Because of the tighter bounding value can be found quickly with more processors. It also shows that our algorithm is scalable in large number of computing resources. Figure 5 shows the computation time of 16 processors of PBBU with 3PR for different number of species. We can observe that the computation time grow rapidly when the number of species increasing. Moreover, the reduced proportion between PBBU and PBBU with 3PR is increasing with larger number of species. We consider that large number of species contains more candidates that a tighter bounding value which can be obtained from 3PR technique can also bound grater number of candidates; it can decreasing the computation time.

An Efficient Parallel Algorithm for Ultrametric Tree Construction Based on 3PR 219

1 2 4 8 16

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

Without 3PR vs. With 3PR (HMDNA)

Without 3PR With 3PR

Number of processors

Time (sec.)

Fig. 2. 3PR vs. Without 3PR (HMDNA)

1 2 4 8 16

0 2500 5000 7500 10000 12500 15000 17500 20000 22500 25000 27500 30000 32500 35000 37500 40000

Without 3PR vs. With 3PR (Random)

Without 3PR With 3PR

Number of processors

Time (sec.)

Fig. 3. 3PR vs. Without 3PR (Random)

1 2 4 8 16

0 1 2 3 4 5 6 7 8 9 10 11

Speed-up (HMDNA)

Without 3PR With 3PR

Number of processors

Speed-up ratio

Fig. 4. Speed-up ratio (HMDNA)

14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 0

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 11000 12000 13000

Computing time (16 processors)

Without 3PR With 3PR

Number of species

Time (sec.)

Fig. 5. Computing time (16 processors)

5 Conclusions

In this paper, we have designed PBBU with 3PR for constructing MUTs problem. The 3PR is the relationship between distance matrix and constructed evolutionary tree. It moves candidates which do not fit 3PR to delay bound pool in branch-and-bound algorithm. After that, we can obtain the tighter bounding value quickly and uses it to bound more candidates. In order to evaluate the performance of our proposed algorithm, a random data set and a practical data set of HMDNA are used. The experimental results show that PBBU with 3PR can find optimal solution for 36 species within a reasonable time on 16 PCs. Furthermore, the speed-up ratio shows the performance of our algorithm is good in our PC cluster environment. Moreover, the results also show that PBBU with 3PR can save about 25% in average of computing time than PBBU without 3PR, and it assured the results are optimal with the delay bound technique.

220 K.-M. Yu et al.

相關文件