A fast algorithm for rooting a tree to minimeze the ultrametric size

全文

(1)J|üultrametric size ²ìc;50§Æ¶. A fast algorithm for rooting a tree to minimize the ultrametric size Økø(Bang Ye Wu) c,x×ç’m ˙Í Email:[email protected]. Abstract For a given unrooted tree and observed distances among the species, we developed a fast algorithm for rooting the tree such that the size of the rooted ultrametric tree is minimum. The time complexity of the algorithm is O(n2 ), while a naive algorithm will take O(n3 ) time. Keywords: algorithms, computational biology, ultrametric trees.. 1. Introduction. Trees are used to represent evolutionary relationship and to guide the alignment of multiple sequences. The leaves of the tree represent the species and the internal nodes are the inferred ancestors. For constructing trees from observed distances, there are many different models which motivate algorithmic problems. However, most of the optimization problems of evolutionary tree construction have been shown to be NP-hard. Heuristic algorithms and computer softwares were developed to build rooted or unrooted trees by observed distances among the species. For example, PHYLIP [3] is one of the popular software packages, which contains several methods for building trees. To guide the alignment of sequences, such as in the computer software CLUSTAL W[4], a tree should be rooted. An unrooted tree may be rooted at any edge. Trees obtained by rooting the same unrooted tree at different edges represents different grouping orders, and therefore should be considered as different. Usually the root of a tree may be determined by outgroups. We investigated how to determine the root by the distances. The mathematical model we used is the minimum ultrametric tree [2]. An ultramet-. ric tree is a rooted tree in which every internal node has the same path length to all the leaves in its subtree. The size of a tree is the sum of the length of all edges. For given observed distances among species, we hope to find the ultrametric tree with minimum size subject to that, for each pair of species, the distance on the tree is no less than the given one. To construct the minimum ultrametric tree for given distances had been shown to be NP-hard, and therefore it is very unlikely to find the optimal tree in reasonable time [2]. In [5], a branch and bound algorithm was developed to solve the problem for moderate number, about 20, of species.. The problem considered in this paper is much easier. In addition to the observed distances, an unrooted tree topology is also given. The goal is to root the tree at an edge and to give the length of each edge such that the rooted tree is an ultrametric tree and its size is minimum among all possible roots. It will be referred as the optimal root in the remaining of this paper. The optimal root may be not unique, and our goal is to find one of them.. To determine the optimal root, we may try every edge of the tree. Once the tree is rooted at an edge, the minimum ultrametric size with respect to the fixed topology can be computed in O(n2 ) time by an algorithm developed in [5], where n is the number of species. Consequently the optimal root can be determined in O(n3 ) time since there are only O(n) edges in a tree with n leaves. In this paper, we present an O(n2 ) time algorithm for the problem..

(2) 2. Preliminaries. In this paper, by T = (V, E) we denote an unweighted tree with vertex set V and edge set E. A tree with an edge weight function w is denoted by T = (V, E, w). Let n denote the number of species. All the elements in a matrix and the weights on edges of a graph are assumed to be nonnegative. We first give some definitions as follows: Definition 1 : A distance matrix of n species is a symmetric n × n matrix M such that M [i, j] ≥ 0 for all 0 ≤ i, j ≤ n, and M [i, i] = 0 for all 0 ≤ i ≤ n. Definition 2 : An n × n metric M is an ultrametric if and only if M [i, j] ≤ max{M [i, k], M [j, k]} for all 1 ≤ i, j, k ≤ n. [1] Definition 3: Let T = (V, E, w) be an edge weighted tree and u, v ∈ V . The path length from u to v is denoted by P dT (u, v). The size of T is defined by w(T ) = e∈E w(e). Definition 4: Let T be a rooted tree and r be any node of T . we use Tr to denote the subtree rooted at r, and L(T ) to denote the leaf set of T . Definition 5 : An ultrametric tree T of {1..n} is a rooted and edge-weighted binary tree with L(T ) = {1..n} and root r such that dT (u, r) = dT (v, r) for all u, v ∈ L(T ). A rooted tree is binary if every internal node has exactly two children. An unrooted binary tree is a tree in which the degree of every internal node is exactly three. We consider only binary tree since any nonbinary tree can be easily transformed into a binary tree without changing the distances between leaves. Let T be an ultrametric tree with root r. It is easy to see that for any internal node v, Tv is an ultrametric tree of L(Tv ). It should be noted that an n×n metric is ultrametric if and only if there is an ultrametric tree T of {1..n} such that dT (i, j) = M [i, j] for all 1 ≤ i, j ≤ n [1]. By the definition of an ultrametric tree, the distances from an internal node r to all the leaves in Tr are the same. Therefore we can define the height of a node as follows: Definition 6: Let T = (V, E, w) be an ultrametric tree. For any r ∈ V , The height of r is the distance from r to any leaf in the subtree Tr .. The minimum ultrametric tree of a distance matrix was defined in [2]. Definition 7: For an n by n distance matrix M , an ultrametric tree T is an ultrametric tree of M if L(T ) ={1..n} and dT (i, j) ≥ M [i, j] for all 1 ≤ i, j ≤ n. T is the minimum ultrametric tree of M if the tree size is minimum among all ultrametric trees of M . The next definition and two lemmas were shown in [5]. Definition 8: Min Ultrametric Tree with a given Topology (MUTT) problem: Given a distance matrix M and a unweighted rooted tree P = (V, E) with L(P ) = {1..n}, the MUTT problem is to find a nonnegative edge weight function w of P such that T = (V, E, w) is the minimum ultrametric tree of M. Lemma 1: A tree T is a minimum ultrametric tree with respect to the fixed topology and distance matrix M if and only if the height of each internal node r is exactly max{M [u, v]/2 | u, v ∈ L(Tr )}. [5] Lemma 2: The MUTT problem, as well as the heights of all nodes of the minimum tree, can be computed in O(n2 ) time. [5] The problem to be solved in this paper is formally defined in the following: Definition 9: Given any distance matrix M and a unweighted unrooted tree P = (V, E) with L(P ) = {1..n}, the RMUT problem is to root P at one of its edges and to find a nonnegative edge weight function w for the resulted tree T such that T is an ultrametric tree of M and w(T ) is minimum among all possible roots and edge weight functions.. 3. The algorithm. As mentioned in Section 1, the RMUT problem can be solved in O(n3 ) time. We shall reduce the time complexity to O(n2 ) in this section. The next property is helpful for improving the time efficiency. Lemma 3: Let M be the distance matrix and M [u, v] be maximal among all observed distances. The tree can be rooted optimally at some edge of the path between u and v on the tree..

(3) Proof: Let T and r be an optimal tree and an optimal root of the RMUT problem respectively. By Lemma 1, the height of r is M [u, v]/2 since M [u, v] is maximal. Also we have dT (u, v) = M [u, v]. Therefore there is an internal node r1 of the path between u and v on T , whose height is exactly M [u, v]/2. In the case that r1 6= r, since the heights of r and r1 are the same, we may reroot T at r1 and the size of the tree remains minimal. By the above lemma, the trees rooted at one of the edges of the path are candidates of the solution. However, the number of edges of the path may be up to O(n). Computing all of the candidates individually takes also O(n3 ) time in worst case. The idea is to compute all the candidates in two passes. Let M [u, v] be a maximal element of M and (u = x0 , x1 , x2 , ..., xk = v) be the path from u to v on T . For each vertex xi , we first compute f1 (i) as the minimum size of the subtree rooted at xi if the optimal root is between xi and v. Then we compute f2 (i) as the minimum size of the subtree rooted at xi if the optimal root is between xi and u. Finally the minimum size of the whole tree rooted at edge (xi , xi+1 ) can be found by f1 (i) and f2 (i + 1). The time complexity is reduced because the values f1 (i) for all 0 ≤ i ≤ k can be computed in one pass. Similarly every value f2 (i) can be found in the second pass. Our algorithm is listed below and illustrated in Figure 1: Algorithm RootMUT Input:A unweighted unrooted tree T = ({1..n}, E) and a distance matrix M . Output:A rooted tree with edge weights. 1: Find u,v such that M [u, v] is a maximal element of M . 2: Find (u = x0 , x1 , x2 , ..., xk = v) which is the path from u to v on T . 3: Root T at edge (xk−1 , v). For every i, compute f1 (i) to be the minimum size of the subtree rooted at xi and h1 (i) to be the height of xi . 4: Root T at edge (u, x1 ) . For every i, compute f2 (i) to be the minimum size of the subtree rooted at xi and h2 (i) to be the height of xi . 5: For every i, compute f1 (i) + f2 (i + 1) +M [u, v] − h1 (i) − h2 (i + 1), which is the minimum size of the whole tree rooted at edge (xi , xi+1 ). Then find the optimal root by choosing the minimum. 6: Output the tree with the optimal root.. Theorem 4 : The algorithm RootMUT finds the optimal root for the RMUT problem in O(n2 ) time. Proof: Apparently Step 1 takes O(n2 ) time and Step 2, 5, 6 take O(n) time. By Lemma 2, Step 3 and 4 can be done in O(n2 ) time. Therefore the time complexity of the algorithm is O(n2 ). For the correctness of the algorithm, we shall show that f1 (i) is the minimum size of the subtree rooted at xi in the case that the optimal root is between xi and v. Let e1 , e2 be two edges of the path between xi and v. For the two trees resulted by rooting T at e1 and e2 respectively, the leaf sets of the subtrees rooted at xi are the same. By Lemma 1, the subtree rooted of xi has the same minimum size once the root is between xi and v. Therefore, in the case that the optimal root is between xi and v, the minimum size of the subtree rooted at xi is correctly given by f1 (i). The correctness of f2 (i) can be shown similarly. Let r be the root. The minimum size of the tree rooted at edge (xi , xi+1 ) is f1 (i) + f2 (i + 1) + w(r, xi ) + w(r, xi+1 ), in which w(r, xi ) = M [u, v]/2 − h1 (i) and w(r, xi+1 ) = M [u, v]/2 − h2 (i + 1) since the height of r is M [u, v]/2.. 4. Concluding remarks. It is interesting how to compute the minimum additive tree size of a given tree topology, instead of the restriction to ultrametric. It is obviously that such a problem can be solved by linear programming. But the algorithmic approach is still open. For the RMUT problem discussed in this paper, a C program based on algorithm RootMUT was written and ported on a PC running MS-DOS. The program, as well as some explanation and a sample input, are free and available at URL http://www.personal.stu.edu.tw/bangye/mutroot.htm.. Acknowledgements The work was partially supported by grant NSC 89-2218-E-366-003 from the National Science Council..

(4) xi+1 u x1. x2. xk-1 v. f1(i) xi. u. v. (a). (b) xi xi. xi+1. xi+1. v u. v. u f2(i+1). (c). (d). Figure 1: (a): Find the path between u and v on the tree. (b): Root the tree at the edge incident to v and compute f1 (i), h1 (i). (c): Root the tree at the edge incident to u and compute f2 (i), h2 (i). (d): The minimum size for rooting at edge (xi , xi+1 ) can be computed by f1 (i), f2 (i + 1), h1 (i) and h2 (i + 1).. References [1] H.J. Bandelt, Recognition of tree metrics, SIAM Journal on Discrete Mathematics., 3(1), 1–6, 1990. [2] M. Farach, S. Kannan and T. Warnow, A robust model for finding optimal evolutionary trees, Algorithmica, 13, 155–179, 1995. [3] J/ Felsenstein, PHYLIP — Phylogeny Inference Package (Version 3.2), Cladistics, 5, 164–166, 1989. [4] J.D. Thompson, D.G. Higgins, and T.J. Gibson, CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice, Nucleic Acids Research, 22, 4673-4680, 1994 [5] B.Y. Wu, K.M. Chao and C.Y. Tang, Approximation and exact algorithms for constructing minimum ultrametric trees from distance matrices, Journal of Combinatorial Optimization, 3, 199–211, 1999..

(5)