of the minimum in time O(nO(1

(1)

A POLYNOMIAL-TIME APPROXIMATION SCHEME FOR MINIMUM ROUTING COST SPANNING TREES^∗

BANG YE WU^†, GIUSEPPE LANCIA^‡, VINEET BAFNA^§, KUN-MAO CHAO^¶, R. RAVI^k, AND CHUAN YI TANG^∗∗

Abstract. Given an undirected graph with nonnegative costs on the edges, the routing cost of any of its spanning trees is the sum over all pairs of vertices of the cost of the path between the pair in the tree. Finding a spanning tree of minimum routing cost is NP-hard, even when the costs obey the triangle inequality. We show that the general case is in fact reducible to the metric case and present a polynomial-time approximation scheme valid for both versions of the problem. In particular, we show how to build a spanning tree of an n-vertex weighted graph with routing cost at most (1 + ) of the minimum in time O(n^O(¹⁾). Besides the obvious connection to network design, trees with small routing cost also find application in the construction of good multiple sequence alignments in computational biology.

The communication cost spanning tree problem is a generalization of the minimum routing cost tree problem where the routing costs of different pairs are weighted by different requirement amounts.

We observe that a randomized O(log n log log n)-approximation for this problem follows directly from a recent result of Bartal, where n is the number of nodes in a metric graph. This also yields the same approximation for the generalized sum-of-pairs alignment problem in computational biology.

Key words. approximation algorithms, network design, spanning trees, computational biology AMS subject classifications. 68W25, 68M10, 05C05, 92B02

PII. S009753979732253X

1. Introduction. Consider the following problem in network design: given an undirected graph with nonnegative delays on the edges, the goal is to find a spanning tree such that the average delay of communicating between any pair using the tree is minimized. The delay between a pair of vertices is the sum of the delays of the edges in the path between them in the tree. Minimizing the average delay is equivalent to minimizing the total delay between all pairs of vertices in the tree.

In general, when the cost on an edge represents a price for routing messages between its endpoints (such as the delay), we define the routing cost for a pair of vertices in a given spanning tree as the sum of the costs of the edges in the unique tree path between them. The routing cost of the tree itself is the sum over all pairs of vertices of the routing cost for the pair in this tree.

∗Received by the editors June 9, 1997; accepted for publication (in revised form) November 25, 1998; published electronically December 7, 1999.

http://www.siam.org/journals/sicomp/29-3/32253.html

†Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan, R.O.C.

(bangye@ms16.hinet.net).

‡Dipartimento di Elettronica e Informatica, Universit`a di Padova, Padova, Italy (lancia@dei.

unipd.it).

§Celera Genomics, 45 West Gude Drive, Rockville, MD 20850 (Vineet.Bafna@celera.com).

¶Department of Computer Science and Information Management, Providence University, Shalu, Taiwan, R.O.C. (kmchao@csim.pu.edu.tw). The research of this author was supported in part by NSC grant NSC86-2213-E-126-002.

kGSIA, Carnegie Mellon University, 5000 Forbes Ave., Pittsburgh, PA 15213 (ravi@cmu.edu).

The research of this author was supported by NSF Career grant CCR-9625297.

∗∗Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan, R.O.C.

(cytang@cs.nthu.edu.tw). The research of this author was supported by NSC grant NSC86-2213-E- 007-008.

761

(2)

Finding a spanning tree of minimum routing cost in a general weighted undirected graph is known to be NP-hard [11]. In this paper we show that finding a minimum routing cost tree in a general weighted graph G is equivalent to solving the same problem on a complete graph in which the edge weights are the shortest path lengths in G. This result implies that the minimum routing tree problem with metric inputs is also NP-hard.

Wong [22] studied the minimum routing cost tree problem and presented a 2- approximation algorithm even without the metric requirement. We give a better result for the metric case, which, by the above remark, applies to the general case as well.

Theorem 1.1. There is a polynomial-time approximation scheme (PTAS) for finding the minimum routing cost tree of a weighted undirected graph. In particular, on an n-vertex graph, we can find a (1+)-approximate solution in time O(n^2d²^e−2).

Our result is derived by approximating a minimum routing cost tree by a restricted class of trees that we call k-stars. For any fixed size k, a k-star is a tree in which at most k vertices have degree greater than one. For a given accuracy parameter , we consider all d²− 1e-stars and output the one with the minimum routing cost. To argue the performance guarantee, we show how a minimum routing cost tree can be converted into a k-star without much degradation in its routing cost (no more than a factor of 1 +_k+1² ). We also prove that for any fixed k, the minimum k-star can be determined in polynomial time. Hence, by finding the d²− 1e-star with the minimum routing cost, we get a (1 + )-approximate solution.

There is an important difference between our PTAS for the routing cost tree problem and Wong’s 2-approximation: While we show an approximation bound to the best tree’s routing cost, Wong’s proof shows that his trees have routing cost at most twice the value of the sum of pairwise distances between nodes in the input graph. This stronger connection is exploited by Gusfield [9] in an application to multiple alignments in computational biology (described later).

1.1. Optimum communication spanning trees. Hu [10] formulated a gen- eral version of the routing cost spanning tree problem that he called optimum communication spanning trees. In this problem, in addition to the costs on edges, a requirement value r_ij is specified for every pair of vertices i, j. The communication cost between a pair in a given spanning tree is the cost of the path between them in the tree multiplied by their requirement r_ij. The communication cost of the tree is the sum of all the pairwise communication costs. Thus the routing cost is a special case of the communication cost when all the requirement values are one.¹ In [10], Hu derives weak conditions under which the optimum routing cost tree is a star. In this paper, we demonstrate that simple generalizations of stars are indeed sufficient to guarantee any desirable accuracy in approximating optimal routing trees.

By using a recent result of Bartal [3] on approximating metrics probabilistically by tree metrics, we notice the following result.

Theorem 1.2. There is an O(log²n)-approximation algorithm for the commu- nication spanning tree problem on an n-node metric.

Recent improvements to Bartal’s original result in [4, 6] also lead to an improve- ment of the performance guarantee in Theorem 1.2 to O(log n log log n).

The result in Theorem 1.2 is actually stronger in the same sense as Wong [22].

1Hu uses the term “optimum distance spanning trees” to denote trees with minimum routing cost.

(3)

Given (symmetric) requirement values r_ij and metric distances d_ij between node pairs i, j, our approximate solution has communication cost at most O(log²n) times P

i,jr_ijd_ij. As in [9], we exploit this connection in the application to computational biology.

An overview of the remainder of the paper is as follows. In section 2 we describe the application of minimum routing cost trees to alignment problems in computational biology. In section 3 we give some basic definitions. In section 4, we show how the general case of the problem can be reduced to the metric one. Section 5 describes how k-stars provide good approximations to the optimum routing cost trees in metrics.

In section 6, we discuss a polynomial algorithm for finding minimum cost k-stars in a graph. Finally, in section 7 we describe an algorithm for approximating optimum communication spanning trees.

2. An application to computational biology.

2.1. Multiple sequence alignments. Multiple sequence alignments are im- portant tools for highlighting patterns common to a set of genetic sequences in com- putational biology. A multiple alignment of a set of n strings involves inserting gaps in the strings and arranging their characters into columns with n rows, one from each string. The order of characters along a row corresponding to string s_i is the same as that in s_i, with possibly some blanks inserted. The following is an example of an alignment of three strings ATTCGAC, TTCCGTC, and ATCGTC.

A T T - C G A - C - T T C C G - T C A - T - C G - T C

The intent of identifying common patterns is represented by attempting as much as possible to place the same character in every column.

The multiple sequence alignment problem has typically been formalized as an optimization problem in which some explicit objective function is minimized or maxi- mized. One of the most popular objective functions for multiple alignment generalizes ideas from optimally aligning two sequences. The pairwise-alignment problem [21]

can be phrased as that of finding a minimum mutation path between two sequences.

Formally, given costs for inserting or deleting a character and for substituting one character of the alphabet for another, the problem is to find a minimum-cost muta- tion path from one sequence to the other. The cost of this path is the edit distance between them. An optimal alignment of two sequences of length l can be computed effectively by dynamic programming [14, 21] in O(l²) time. The generalization to multiple sequences leads to the sum-of-pairs objective.

The sum-of-pairs (SP) objective for multiple alignment is to minimize the sum, over all pairs of sequences, of the pairwise distance between them in the alignment (where the distance of two sequences in an alignment with l columns is obtained by adding up the costs of the pairs of characters appearing at positions 1, . . . , l).

Pioneering work of Sankoff and Kruskal [17] and Sankoff, Morel, and Cedergren [18] led to an exponential-time dynamic programming solution to the SP-alignment problem. A straightforward implementation requires time proportional to 2ⁿlⁿ for a problem with n sequences each of length at most l. Considering that in typical real- life instances l can be a few hundred, the basic dynamic programming approach turns out to be infeasible for all but very small problems. Carrillo and Lipman [5] have introduced some bounding criteria which reduce the time and space requirements of

(4)

dynamic programming and make solvable problems for n ≤ 6 and l ≤ 200. However, constructing optimal alignments is bound to be computationally expensive, since the problem has been shown to be NP-complete (Wang and Jiang, [20]). Despite these very expensive solution methods, the SP-objective is implemented in several popularly available multiple alignment packages such as MACAW [19] and MSA [13].

2.2. Approximation algorithms via routing cost trees. The first approxi- mation algorithm for the SP-alignment problem was by Gusfield [9]. It had a perfor- mance ratio of 2 − ²_n where n is the number of sequences aligned. This was slightly improved to 2 −³_n by Pevzner [15]. The best-known approximation algorithm for this problem is due to Bafna, Lawler, and Pevzner [2], which achieves a ratio of 2 −^r_n for any fixed value of r. The running time is exponential in r. Notice that this is not a PTAS for the problem, and no polynomial-time approximation scheme is known yet for the SP-alignment problem.

Gusfield’s approximation algorithm for the SP-alignment problem is based on the 2-approximation for minimum routing cost trees due to Wong [22]. Gusfield’s algorithm uses a folklore approach to multiple alignment guided by a tree, due to Feng and Doolittle [8]: Given a spanning tree on the complete graph on the sequences to be aligned, the multiple alignment guided by the tree is built recursively as follows.

First, remove a leaf sequence l in the tree attached to sequence v by a tree edge (l, v), and align the remaining sequences recursively. Then, reinsert the leaf sequence into the alignment guided by an optimal pairwise alignment between the pair l and v. If this optimal pairwise alignment introduces a gap in v, insert the same gap in the recursively computed alignment for the tree without the leaf. Since the cost of aligning a blank to a blank is assumed to be zero, the resulting alignment has the property that for every pair related by a tree edge, the cost of the induced pairwise alignment equals their edit distance. By the triangle inequality on edit-distances, the SP-cost of the alignment derived from this spanning tree can be upper-bounded by the routing cost of the tree.

Wong’s 2-approximation algorithm considers the shortest path tree rooted at every vertex in turn, and picks the one with minimum routing cost. For graphs with metric distances obeying the triangle inequality, every shortest path tree is isomorphic to a star. Furthermore, in this case, Wong’s analysis shows that the best star has routing cost at most twice the total cost of the graph itself. The cost of the graph in this case is the sum of pairwise edit distances between sequences, which is a lower bound on the SP-cost. Thus, Gusfield observed that a multiple alignment derived from the best center-star gives a 2-approximation for the SP-alignment problem.

2.3. Tree-driven SP-alignment. Despite the popularity of the SP-objective, most of the currently available methods for finding alignments use a progressive ap- proach of incrementally building the alignment adding sequences one at a time with no performance guarantee on the SP-cost. The Feng–Doolittle procedure can be viewed as one such procedure. The advantage of such approaches is their low running time, but the shortcoming is that the order in which the sequences are merged into the alignment determines its cost.

In trying to define a middle ground between the SP-objective and the more prac- tical progressive methods, we introduce the tree-driven SP-alignment method: apply the Feng–Doolittle procedure to the best possible spanning tree in the complete graph on the sequences. By our reasoning above, the tree that gives the best upper bound on the SP-cost of the alignment is the one with the minimum routing cost. Thus,

(5)

our PTAS for routing cost trees may be useful in finding good trees for applying any progressive alignment method such as the Feng–Doolittle procedure.

2.4. Generalized SP-alignments. A simple generalization of the SP objective for multiple alignments is to weight the different sequence pairs in the alignment differ- ently in the objective function. Given a priority value r_ij for the pair i, j of sequences, the generalized sum-of-pairs objective for multiple alignment is to minimize the sum, over all pairs of sequences, of the pairwise distance between them in the alignment multiplied by the priority value of the pair. This allows one to increase the priority of aligning some pairs while down-weighting others, using other information (such as evolutionary) to decide on the priorities. An extreme case of assigning priorities is the threshold objective.

In an evolutionary context, a multiple alignment is used to reconstruct the blocks or motifs in a single ancestral sequence from which the given sequences have evolved.

However, if the evolutionary events of the ancestral sequence occur randomly at a certain rate over the course of time, and independently at each location (character) of the string, after a sufficiently long time, the mutated sequence appears essentially like a random sequence compared to the initial ancestral sequence. If we postulate a threshold time beyond which this happens, this translates roughly to a threshold edit distance between the pair of sequences. The threshold objective sets rij to be one for all pairs of input sequences whose edit-distance is less than this threshold and zero for other pairs which are more distant. In this way we try to capture the most information about closely related pairs in the objective function by setting an appropriate threshold.

In the same vein as Gusfield [9], Theorem 1.2 can be used to approximate the generalized SP objective within an O(log²n) factor on inputs with n sequences. Let dij denote the edit distance between sequences i and j. The theorem guarantees a tree whose communication cost using the rijvalues given by the priority function is at most O(log²n) timesP

i,jrijdij, which is a lower bound on the generalized SP value of any alignment. The Feng–Doolittle procedure guarantees that the generalized SP value of the resulting alignment is at most the communication cost of the tree which in turn is at most O(log²n) times the generalized SP value of any alignment.

3. Definitions. Throughout the paper we will be referring to a given weighted, connected, undirected graph G = (V, E, w), where we assume V = {1, . . . , n} and w is a nonnegative edge weight function, not necessarily metric. For a subset S ⊆ V , by P(S) we denote the set of all unordered pairs of elements of S.

Definition 3.1. Let G = (V, E, w) and i, j ∈ V . Let S = (VS, ES, w) be a subgraph of G. By SP (S, i, j) we denote a shortest path from i to j on S. When S is a tree, SP (S, i, j) denotes the unique path between i and j.

Definition 3.2. Let S be a subgraph of G and i, j ∈ V . The weight of S is denoted by w(S) = P

e∈ESw(e). The distance of i and j in S is denoted by d_S(i, j) := w(SP (S, i, j)). We define d_G(i, S) = min_j∈V_Sd_G(i, j). If T is a tree and S ⊂ T , we denote the value w(SP (T, i, j) ∩ S) by wS(T, i, j).

Definition 3.3. Let S be a subgraph of G. The routing cost of S is defined as C(S) =P

(i,j)∈P(VS)d_S(i, j).

Definition 3.4. Given a graph G = (V, E, w), the minimum routing cost span- ning tree problem (MRCT) is to find a spanning tree bT_G of G such that C( bT_G) is minimum.

(6)

Definition 3.5. A metric graph G = (V, E, w) is a complete graph in which w(i, j) ≥ 0 and w(i, j) + w(j, k) ≥ w(i, k) for all i 6= j 6= k ∈ V .

Definition 3.6. The metric closure of G is the complete weighted graph ¯G = (V, P(V ), δ), where δ(i, j) := d_G(i, j) for all (i, j) ∈ P(V ). Note that ¯G is a metric graph.

Definition 3.7. Given a metric graph G, the metric minimum routing cost spanning tree problem (∆MRCT) is to find a spanning tree T of G such that C(T ) is minimum.

4. A reduction from the general to the metric case. Let G = (V, E, w) and G = (V, P(V ), δ) be its metric closure. In this section, we present an algorithm which¯ can transfer a spanning tree of ¯G into a spanning tree of G without increasing cost.

This implies that we can solve the MRCT problem on G by solving the same problem on ¯G. An edge (a, b) in ¯G is termed a bad edge if (a, b) /∈ E or w(a, b) > δ(a, b). For any bad edge e = (a, b), there must exist a path P 6= e such that w(P ) = δ(a, b).

Given any spanning tree T of ¯G, the algorithm iteratively replaces bad edges (if any) in T with edges from the path defining the weight of the edge until there are no more bad edges in the tree. Since the resulting tree Y has no bad edge, it can be thought of as a spanning tree of G with the same cost. It will be shown later that the iteration will be executed at most O(n²) times and the cost is never increased while replacing the bad edges. The algorithm listed below details how to obtain Y from T .

Algorithm Remove bad Input: a spanning tree T of ¯G

Output : a spanning tree Y of G (i.e., without any bad edge) such that C(Y ) ≤ C(T ).

Compute all-pairs shortest paths of G.

while there exists a bad edge in T (1)

Pick a bad edge (a, b). Root T at a.

/* assume SP (G, a, b) = (a, x, ..., b) and y is the father of x in T */

if b is not an ancestor of x then Y₁= T ∪ (x, b) − (a, b) Y₂= Y₁∪ (a, x) − (x, y) elseY1= T ∪ (a, x) − (a, b)

Y₂= Y₁∪ (b, x) − (x, y) endif

if C(Y1) < C(Y2) then Y = Y1

elseY = Y₂ endif

T = Y (2)

endwhile

We assume that the shortest paths obtained in the beginning of the algorithm have the following property: If the obtained shortest path between a and b is (a, x)∪P , then P is the obtained shortest path between x and b. Note that since x is on the shortest a-b path, δ(a, b) = δ(a, x) + δ(x, b).

Proposition 4.1. The loop (1) is executed at most O(n²) times.

(7)

Fig. 4.1. Remove bad edge (a, b). Case 1 (left) and Case 2 (right).

Proof. For each bad edge e = (a, b), let l(e) be the number of edges in SP (G, a, b) and f(T ) =P

bad el(e). Since l(e) ≤ n − 1, f(T ) < n². Since l(x, b) < l(a, b) and (a, x) is not a bad edge, it is easy to check that f(T ) decreases by at least 1 at each loop iteration.

Proposition 4.2. Before instruction (2) is executed, C(Y ) ≤ C(T ).

Proof. For any node v, define Sv = {u|v is an ancestor of u on T } ∪ {v}. Also, let C(T, S1, S2) =P

i∈S1,j∈S2dT(i, j).

Case 1. (see Figure 4.1.) x ∈ Sa − Sb. If C(Y1) ≤ C(T ), the result follows.

Otherwise, let S1= Sa− Sb and S2= Sa− Sb− Sx. Since the distance between any two vertices both in S1 (or both in Sb) does not change, we have

C(T ) < C(Y₁)

⇒ C(T, S1, Sb) < C(Y1, S1, Sb)

⇒ |S_b|C(T, a, S₁) + |S₁||S_b|δ(a, b) < |S_b|C(T, x, S₁) + |S₁||S_b|δ(x, b)

⇒ C(T, a, S₁) + |S₁|δ(a, b) < C(T, x, S₁) + |S₁|δ(x, b)

⇒ C(T, a, S1) < C(T, x, S1) − |S1|δ(a, x).

The last inequality follows from the property of the shortest path lengths alluded to earlier.

Also,

C(Y₂) − C(T ) = (C(Y₂, S₂, S_x) − C(T, S₂, S_x)) + (C(Y₂, S_b, S₁) − C(T, S_b, S₁)) . Since dY2(i, j) ≤ dT(i, j) for i ∈ Sb and j ∈ S1, the second term is not positive, and

C(Y₂) − C(T )

≤ C(Y2, S2, Sx) − C(T, S2, Sx)

= |S_x|C(T, a, S₂) + |S₂||S_x|δ(a, x) − |S_x|C(T, x, S₂)

= |Sx|((C(T, a, S1) − C(T, a, Sx)) + |S2|δ(a, x) − (C(T, x, S1) − C(T, x, Sx)))

= |Sx|((C(T, a, S1) − C(T, x, S1)) + |S2|δ(a, x) + (C(T, x, Sx) − C(T, a, Sx)))

< |S_x| (−|S₁|δ(a, x) + |S₂|δ(a, x))

≤ 0.

Case 2. x ∈ S_b. The case is identical to Case 1 if we reroot the tree at b and follow the analysis in Case 1 exchanging the roles of a and b.

(8)

As a direct consequence of Propositions 4.1 and 4.2 we obtain the following lemma.

Lemma 4.3. Given a spanning tree T of ¯G, the algorithm Remove bad constructs a spanning tree Y of G with C(Y ) ≤ C(T ) in O(n³) time.

The above lemma implies that C( bTG) ≤ C( bTG¯). Since, for any edge, the weight on the original graph is no less than the weight on the metric closure, it is easy to see that C( bTG) ≥ C( bTG¯). Therefore, we have the following corollary.

Corollary 4.4. C( bTG) = C( bTG¯).

Corollary 4.5. If there is a (1 + ε)-approximation algorithm for ∆MRCT with time complexity O(f(n)), then there is a (1 + ε)-approximation algorithm for MRCT with time complexity O(f(n) + n³).

Proof. Let G be the input graph for a MRCT problem. We can construct ¯G in time O(n³) (see, e.g., [7]). If there is a (1 + ε)-approximation algorithm for the

∆MRCT problem, we can compute in time O(f(n)) a spanning tree T₁ of ¯G such that C(T₁) ≤ (1 + ε)C( bTG¯). Using Algorithm Remove bad, we can then construct a spanning tree T2of G such that C(T2) ≤ C(T1) ≤ (1 + ε)C( bTG¯) = (1 + ε)C( bTG). The overall time complexity is then O(f(n) + n³).

5. A PTAS for the ∆MRCT problem.

5.1. Overview. As described in the previous section, the fact that the costs w may not obey the triangle inequality is irrelevant, since we can simply replace these costs by their metric closure. Therefore, in this and the following sections we may assume that G = (V, E, w) is a metric graph. We remind the reader that n = |V |.

Also, for a subgraph G⁰ of G, we use V (G⁰) to denote the vertex set of G⁰.

To establish the performance guarantee, we use k-stars, i.e., trees with no more than k internal nodes. In section 6 we show that for any constant k, the minimum routing cost k-star can be determined in polynomial (in n) time. In order to show that a k-star achieves a (1 + ) approximation, we show that, for any tree T and constant δ ≤ 1/2:

1. It is possible to determine a δ-separator (a particular subtree of T to be defined later), and the separator can be cut into several δ–paths such that the total number of cut nodes and leaves of the separator is at most d²_δe − 3 (Lemma 5.9).

2. Using the separator, T can be converted into a (d²_δe − 3)-star X(T ), whose internal nodes are just those cut nodes and leaves. The routing cost of X(T ) satisfies C(X(T )) ≤ (1 +_1−δ^δ )C(T ) (Lemma 5.13).

By using T = bTG, δ = ₁₊ and finding the best (d²_δe − 3)-star K, we obtain C(K) ≤ C(X( bT_G)) ≤ (1+_1−δ^δ )C( bT_G) = (1+)C( bT_G), i.e., the desired approximation.

5.2. The δ-spine of a tree.

Definition 5.1. Let T be a spanning tree of G and S be a connected subgraph of T . A branch of S is a connected component of T \ S. Let δ ≤ 1/2 be a positive number. If |V (B)| ≤ δn for every branch B of S, then S is a δ-separator of T . A δ-separator S is minimal if any proper subgraph of S is not a δ-separator of T .

Intuitively, a δ-separator is like a “center” of the tree. Starting from any node, there are sufficiently many nodes which cannot be reached without touching the sepa- rator. To illustrate the concept of separator, we examine the simplest case for δ = 1/2.

For any tree T , there always exists a 1/2-separator which contains only one vertex.

That is, we can always cut a tree at a node c such that each branch contains at most

(9)

B₃ B₂

B₁

B7

B₅ B₆

B4

i P

j

r₁ r₂

r₃

Fig. 5.1. B1, . . . , B7 are branches of P . V B(T, P, i) = {i} ∪ V (B1) ∪ V (B2) ∪ V (B3). P^c is the number of vertices in {r1, r2, r3} ∪ V (B4) ∪ V (B5) ∪ V (B6).

half of the nodes. Such a node is usually called the centroid of the tree in the literature.

Note that this also shows the existence of a minimal δ-separator for any δ ≤ 0.5.

If we construct a star X centered at the centroid c, the routing cost will be at most twice that of T . This can be easily shown as follows. First, if i and j are two nodes not in the same branch, d_T(i, j) = d_T(i, c) + d_T(j, c). Consider the total distance of all ordered pairs of nodes on T . This value is exactly 2C(T ) by the definition.

For any node i, since each branch contains no more than half of the nodes, the term d_T(i, c) will be counted in the total distance at least n times, n/2 times for i to others, and n/2 times for others to i. Hence, we have 2C(T ) ≥ nP

id_T(i, c). Since C(X) = (n − 1)P

id_G(i, c), it follows that C(X) ≤ 2C(T ). The idea in this paper can be thought as a generalization of the above method. However, the proof is much more involved.

Definition 5.2. Let T be a spanning tree of G and S be a connected subgraph of T . For any vertex i in S, V B(T, S, i) denotes the set of vertex i and the vertices in the branches connected to i.

Definition 5.3. Let P = SP (T, i, j) in which |V B(T, P, i)| ≥ |V B(T, P, j)|.

|V B(T, P, j)|. Assume P = (i, r₁, r₂, . . . , r_h, j). Define Q(P ) = P

1≤x≤h

|V B(T, P, r_x)| × d_T(r_x, i).

The above notations are defined to simplify the expressions. P^a and P^b are the numbers of vertices that are hanging off the two endpoints of the path. Note that we always assume P^a ≥ P^b. In the case that P contains only one edge, P^c = 0. The notations are illustrated in Figure 5.1.

Lemma 5.4. Let S be a minimal δ-separator of T . If i is a leaf of S, then

|V B(T, S, i)| > δn.

Proof. If S contains only one vertex, the result is trivial since |V B(T, S, i)| = n.

Otherwise, if |V B(T, S, i)| ≤ δn, deleting i from S we still get a δ-separator. This is a contradiction to S being minimal.

Definition 5.5. Let 1 ≤ k ≤ n. A k-star is a spanning tree of G which has no more than k internal nodes. The set of all k-stars is denoted by k^∗(G). T is a minimum k-star if T ∈ k^∗(G) and C(T ) ≤ C(Y ) for all Y ∈ k^∗(G).

We now turn to the notions of δ-path and δ-spine. Informally, a δ-path is a path such that not too many nodes (at most δn/2) are hanging off its internal nodes. A δ-spine is a set of edge-disjoint δ-paths, whose union is a minimal δ-separator. That is, a δ-spine is obtained by cutting the minimal δ-separator into δ-paths. In the case

(10)

n/3 n/3

n/4

n/4 n/4

n/8

n/6 n/6

n/8 2

1

4 5

3

Fig. 5.2. Trees with maximum value for the size of the minimum cut and leaf set.

that the minimal δ-separator contains just one node, the only δ-spine is the empty set.

Definition 5.6. Given a spanning tree T of G, and 0 < δ ≤ 0.5, a δ-path of T is a path P such that P^c≤ δn/2.

Definition 5.7. Let 0 < δ ≤ 0.5. A δ-spine Y = {P1, P2, ..., Ph} of T is a set of pairwise edge-disjoint δ-paths in T such that S =S

1≤i≤hP_i is a minimal δ-separator of T . Furthermore, for any pair of distinct paths P_i and P_j in the spine, we require that either they do not intersect or, if they do, the intersection point is an endpoint of both paths.

Definition 5.8. Let Y be a δ-spine of a tree T . CAL(Y ) (which stands for the cut and leaf set of Y ) is the set of the endpoints of the paths in Y . In the case that Y is empty, the CAL set contains only one node which is the δ-separator of T . Formally CAL(Y ) = {u|∃P ∈ Y, v ∈ T : P = SP (T, u, v)} if Y is not empty, and otherwise CAL(Y ) = {u|u is the minimal δ-separator }.

Two trees achieving the maximum value for the size of the minimum CAL set for δ = 1/3 (|CAL(Y )| = 3) and δ = 1/4 (|CAL(Y )| = 5) are depicted in Figure 5.2. Next, we show that for any tree, there always exists a (1/3)-spine Y₁ such that

|CAL(Y₁)| ≤ 3 and a (1/4)-spine Y₂ such that |CAL(Y₂)| ≤ 5.

Lemma 5.9. For any constant 0 < δ ≤ 0.5, and spanning tree T of G, there exists a δ-spine Y of T such that |CAL(Y )| ≤ d2/δe − 3.

Proof. Let S be a minimal δ-separator of T . S is a tree. Let U₁ be the set of leaves in S, U₂ be the set of vertices which have more than two neighbors in S, and U = U₁∪ U₂. Let h = |U₁|. Clearly, |U| ≤ 2h − 2. Let Y₁ be the set of paths obtained by cutting S at all the vertices in U2. For example, for the tree on the right side of Figure 5.2, U1= {2, 3, 4}; U2= {1}; Y1 contains SP (T, 1, 2), SP (T, 1, 3), and SP (T, 1, 4). For any P ∈ Y1, if P^c > δn/2 then P is called a heavy path. It is easy to check that Y1 satisfies the requirements of a δ-spine except that there may exist some heavy paths. Suppose P is not a δ-path. We can break it up into δ-paths by the following process. First find the longest prefix of P starting at one of its endpoints and ending at some internal vertex, i, say, in the path, that determines a δ-path. Now we break P at vertex i. Then we repeat the breaking process on the remaining suffix of P starting at i stripping off the next δ-path and so on. In this way P can be cut into δ-paths by breaking it in at most d2P^c/ (δn)e − 1 vertices. Since there are at

(11)

least δn nodes hung at each leaf, X

P ∈Y1

P^c< n − hδn.

Assume U3 to be the minimal vertex set for cutting the heavy paths to result in a δ-spine Y of T . We have

|U₃| ≤ d2 (n − hδn) / (δn)e − 1 = d2/δe − 2h − 1.

So, |CAL(Y )| = |U| + |U₃| ≤ d2/δe − 3.

5.3. Lower bound.

Definition 5.10. The routing load of an edge e in T is the number e^ae^b of pairs in T connected by a path containing e.

The following lemma is immediate.

Lemma 5.11. For any spanning tree T of G, C(T ) =P

e∈Te^ae^bw(e).

Lemma 5.12. Let Y be a δ-spine of a spanning tree T of G and S =S

P ∈Y P be a minimal δ-separator of T . Then

C(T ) ≥ (1 − δ)nX

v∈V

dT(v, S) + X

P ∈Y

¡P^b(P^a+ P^c)w(P ) + (P^a− P^b)Q(P ) .

Proof. Since e^a≥ (1 − δ)n for any edge e ∈ T \ S, we have C(T ) =X

e∈T

e^ae^bw(e)

≥ X

e∈T \S

(1 − δ)ne^bw(e) +X

e∈S

e^ae^bw(e)

≥ (1 − δ)nX

v∈V

d_T(v, S) + X

P ∈Y

X

e∈P

e^ae^bw(e).

Now we simplify the second term. Assume P = (r₀, r₁, r₂, . . . , r_h) in which

|V B(T, P, r₀)| ≥ |V B(T, P, r_h)|. Let |V B(T, P, r_i)| = n_i for 1 ≤ i ≤ h − 1 and e_i= (r_i−1, r_i) for 1 ≤ i ≤ h.

X

e∈P

e^ae^bw(e)

=X^h

i=1



P^a+ P^c−^h−1X

j=i

n_j







P^b+^h−1X

j=i

n_j



 w(ei)

≥ Xh i=1

P^b(P^a+ P^c) w(ei) + (P^a− P^b) Xh i=1

h−1X

j=i

njw(ei)

+ Xh i=1



^h−1X

j=i

nj







P^c−

h−1X

j=i

nj



 w(ei)

≥ P^b(P^a+ P^c)w(P ) + (P^a− P^b)^h−1X

j=1

n_j Ã _j

X

i=1

w(e_i)

!

= P^b(P^a+ P^c) w(P ) + (P^a− P^b)Q(P ).

This completes the proof.

(12)

5.4. From trees to stars.

Lemma 5.13. For any constant 0 < δ ≤ 0.5, there exists a spanning tree X ∈ (d2/δe − 3)^∗(G) such that C(X) ≤ _1−δ¹ C( bT_G).

Proof. Let T = bT_G = (V, E, w) and n = |V |. Also, let Y = {P_i|1 ≤ i ≤ h} be a δ- spine of T in which |CAL(Y )| ≤ d2/δe−3. Note that the set of all the edges in Y form a δ-separator S. Assume P_i= SP (T, u_i, v_i) and |V B(T, P_i, u_i)| ≥ |V B(T, P_i, v_i)|.

We construct a spanning tree whose internal nodes are exactly the CAL set of the δ-spine we just identified. We connect these nodes by short-cutting paths along the spine to include a set of acyclic edges with the same skeletal structure as the spine. All vertices in subtrees hanging off the CAL nodes of the spine are connected directly to their closest node in the spine. Along a δ-path in the spine, all the internal nodes and nodes in subtrees hanging off internal nodes are connected to one of the two endpoints of this path (note that both are in the CAL set of the spine) in such a way as to minimize the resulting routing cost. This is the spanning tree used to argue the upper bound on the routing cost in the proof.

More formally, construct a subgraph R ⊂ G with vertex set CAL(Y ) and edge set Er = {(ui, vi)|1 ≤ i ≤ h}. Trivially, R is a tree. Let f(i) be an indicator variable such that if¡

P_i^a− P_i^b

P_i^cw(Pi) − n (2Q(Pi) − P_i^cw(Pi)) ≥ 0 then f(i) = 1, else f(i) = 0. The indicator variable f(i) determines the endpoint of Pito which all the internal nodes and nodes hanging off such internal nodes will be directly connected.

We construct a spanning tree X of G where the edge set Ex is determined by the following rules:

1. R ⊂ X.

2. If q ∈ V B(T, S, r), then (q, r) ∈ Ex, for any r ∈ {ui, vi|1 ≤ i ≤ h}.

3. For the vertex set Vi = V − V B(T, Pi, ui) − V B(T, Pi, vi), if f(i) = 1, then {(q, u_i)|q ∈ V_i} ⊂ E_x, else {(q, v_i)|q ∈ V_i} ⊂ E_x. That is, the vertices in V_i are either all connected to u_i or all connected to v_i.

It is easy to see that X ∈ (d2/δe − 3)^∗(G). Let’s consider the cost of X.

C(X) = X

e∈Ex

e^ae^bw(e)

= X

e∈Er

e^ae^bw(e) + (n − 1) X

e∈Ex−Er

w(e).

First, for any e = (ui, vi) ∈ Er, e^ae^bw(e) ≤ (P_i^a+ f(i)P_i^c)¡

P_i^b+ (1 − f(i)) P_i^c w(P_i)

= P_i^aP_i^bw(Pi) +¡

f(i)P_i^b+ (1 − f(i)) P_i^a

P_i^cw(Pi).

Recall that for subset of edges S ⊂ T , w_S(T, i, j) stands for w(SP (T, i, j)∩S). Second, by the triangle inequality,

X

e∈Ex−Er

w(e) ≤ X

v∈V

d_T(v, S) +X^h

i=1

X

v∈Vi

(f(i)w_S(T, v, u_i) + (1 − f(i)) w_S(T, v, v_i))

=X

v∈V

dT(v, S) + Xh i=1

(f(i)Q(Pi) + (1 − f(i)) (P_i^cw(Pi) − Q(Pi))) .

(13)

Thus,

C(X) ≤ Xh i=1

P_i^aP_i^bw(Pi) + nX

v∈V

dT(v, S)

+X^h

i=1

min{P_i^bP_i^cw(P_i) + nQ(P_i), P_i^aP_i^cw(P_i) + n(P_i^cw(P_i) − Q(P_i))}.

Since the minimum of two numbers is not larger than their weighted mean, we have

min{P_i^bP_i^cw(Pi) + nQ(Pi), P_i^aP_i^cw(Pi) + n (P_i^cw(Pi) − Q(Pi))}

≤¡

P_i^bP_i^cw(P_i) + nQ(P_i) P_i^a

P_iâ+ P_i^b + (P_iâP_i^cw(P_i) + n (P_i^cw(P_i) − Q(P_i))) P_i^b P_iâ+ P_i^b. Then,

C(X) ≤X^h

i=1

P_i^aP_i^bw(P_i) + nX

v∈V

d_T(v, S) +X^h

i=1

¡2P_i^aP_i^bP_i^c+ nP_i^bP_i^c w(P_i) P_i^a+ P_i^b

+ Xh i=1

(P_i^a− P_i^b)nQ(Pi) P_i^a+ P_i^b

= nX

v∈V

d_T(v, S) +X^h

i=1

w(P_i) P_i^a+ P_i^b

¡¡P_i^aP_i^b+ P_i^bP_i^c

n + P_i^aP_i^bP_i^c

+X^h

i=1

(P_i^a− P_i^b)nQ(Pi) P_i^a+ P_i^b .

The simplification in the last inequality uses the observation that for any i, we have P_i^a+ P_i^b+ P_i^c= n. By Lemma 5.12,

C(X) ≤ C(T ) max

1≤i≤h

1

1 − δ, n

P_iâ+ P_i^b + P_iâP_i^c (P_iâ+ P_i^b)(P_iâ+ P_i^c)

.

Since P_i^c≤ δn/2,

n

P_iâ+ P_i^b + P_iâP_i^c (P_iâ+ P_i^b)(P_iâ+ P_i^c)

≤ n

P_i^a+ P_i^b + P_i^c P_i^a+ P_i^b

= n + P_i^c

n − P_i^c ≤2 + δ 2 − δ ≤ 1

1 − δ. This completes the proof.

In the following section we will show that it is possible to determine the minimum k-star of a graph in polynomial time. In fact, we have the following lemma.

Lemma 5.14. The minimum k-star of a graph G can be constructed in time O(n^2k).

The proof is delayed to the next section. The following theorem establishes the time-complexity of our PTAS.

(14)

Theorem 5.15. There exists a PTAS for the ∆MRCT problem, which can find a (1+ε)-approximation solution in O(n^ρ) time complexity where ρ = 2 d2/εe − 2.

Proof. By Lemma 5.13, there exists a spanning tree X ∈ (d2/δe − 3)^∗(G) such that C(X) ≤ _1−δ¹ C( bT_G). For finding a (1+ε)-approximation solution, we set 1/δ = (1/ε) + 1 and find a minimum k-star with k = d2/δe − 3 = d2/εe − 1. The time complexity is O(n^ρ) where ρ = 2 d2/εe − 2 from Lemma 5.14.

The result in Theorem 1.1 is immediately derived from Theorem 5.15 and Corol- lary 4.5.

6. Finding the best k-star. In this section we describe an algorithm for finding the minimum routing cost k-star in G for a given value of k. As mentioned before, given an accuracy parameter > 0, we apply this algorithm for k = d² − 1e and return the minimum routing cost k-star as a (1 + )-approximate solution.

For a given k, to find the best k-star, we consider all possible subsets S of vertices of size k, and for each such choice, find the best k-star where the remaining vertices have degree one.

6.1. A polynomial-time method. First, we verify that the overall complexity of this step is polynomially bounded for any fixed k. Any k-star can be described by a triple (S, τ, L), where S = {v1, . . . , vk} ⊆ V is the set of k distinguished vertices which may have degree more than one, τ is a spanning tree topology on S, and L = (L₁, . . . , L_k), where L_i⊆ V \ S is the set of vertices connected to vertex v_i∈ S.

Let l = (l1, . . . , lk) be a nonnegative k-vector² such thatP_k

i=1li= n − k. We say that a k-star (S, τ, L) has the configuration (S, τ, l) if li= |Li| for all 1 ≤ i ≤ k. For a fixed k, the total number of configurations is O(n^2k−1) since there are¡_n

k

choices for S, k^k−2 possible tree topologies on k vertices, and¡_n−1

k−1

possible such k-vectors. (To see this, observe that every such vector can be put in correspondence with picking k−1 among n−1 linearly ordered elements and using the cardinalities of the segments between consecutively picked segments as the components of the vector.) Note that any two k-stars with the same configuration have the same routing load on their corresponding edges. We define α(S, τ, l) to be the minimum routing cost k-star with configuration (S, τ, l).

Note that any vertex v in V \ S that is connected to a node s ∈ S contributes a term of w(v, s) multiplied by its routing load of n−1. Since all these routing loads are the same, the best way of connecting the vertices in V \ S to nodes in S is obtained by finding a minimum-cost way of matching up the nodes of V \S to those in S which obey the degree constraints on the nodes of S imposed by the configuration, where the costs are the distances w. This problem can be solved in polynomial time for a given configuration (by a straightforward reduction to an instance of minimum-cost perfect matching). The above minimum-cost perfect matching problem, also called the assignment problem, has been well studied and several efficient algorithms can be found in [1]. For instance, by using an O(n³) algorithm for the assignment problem, the overall complexity would be O(n^2k+2) for finding the best k-star.

6.2. A faster method. We now show how the minimum k-stars for the different configurations can be computed more efficiently by carefully ordering the matching problems for the configurations and exploiting the common structure of two consec- utive problems. In particular, we show how we can obtain the optimal solution of any configuration in this order by performing a single augmentation on the optimal

2For any r ∈ Z⁺, an r-vector is an integer vector with r components.

(15)

solution of the previous configuration. Thus, we show (Lemma 6.2) how to compute α(S, τ, l) for a given configuration in time O(nk).

Let W_abbe the set of all nonnegative a-vectors whose entries add up to a constant b. In W_ab× W_ab, we introduce the relation ∼ as l ∼ l⁰ if there exist 1 ≤ s, t ≤ a such that

l⁰_i=





l_i− 1 if i = s, l_i+ 1 if i = t, l_i otherwise.

For a pair l and l⁰ such as the above, we say that l⁰ is obtained from l by s and t.

Let r = |Wab| =¡_a+b−1

a−1

. The following proposition shows that the elements of Wab can be linearly ordered as l¹, . . . , l^rso that lⁱ⁺¹∼ lⁱ for all 1 ≤ i ≤ r − 1.

Proposition 6.1. For all positive integers a, b, there exists a permutation πâ,bof Wab such that πâ,b₁ is the lexicographic minimum, π_râ,b is the lexicographic maximum, and πâ,b_i+1∼ π_iâ,b for all i = 1, . . . , r − 1.

Proof. By induction. The claim is clearly true when a = 1 for any b. Assume the claim is true for all b when a = m − 1. For a = m construct the ordering as follows: first, the elements for which l₁= 0, ordered by applying π^a−1,bto (l₂, . . . , l_a);

then the elements for which l₁ = 1, ordered according to decreasing πâ−1,b−1. In general each block for which l₁ = h is ordered by applying πâ−1,b−h to (l₂, . . . , l_a), forward or backward according to the parity of h. Note that πâ,b_i+1 ∼ π_iâ,b within one block. Furthermore, at block boundaries the part (l2, . . . , la) is either a lexicographic minimum or maximum so that it is feasible to increase by one l1. Finally, it is obvious that the first and the last of the constructed ordering are the lexicographic minimum and maximum respectively.

According to Proposition 6.1 we can order the elements of W_k,(n−k)as l¹, . . . , l^r, where r =¡_n−1

k−1

. Note that l¹ = (0, . . . , 0, n − k) and l^r = (n − k, 0, . . . , 0). In the remainder of this section, we shall prove the following lemma.

Lemma 6.2. α(S, τ, lⁱ⁺¹) can be computed from α(S, τ, lⁱ) in O(nk) time.

Proof. We shall show that α(S, τ, lⁱ⁺¹) can be found from α(S, τ, lⁱ) by means of a shortest path computation. A similar argument is used in [1, Exercise 10.20];

for solving a minimum cost flow problem given the solution of another minimum cost flow problem which differs by only one unit capacity arc.

For convenience, let us rename the vertices so that S = {1, . . . , k}. Let lⁱ = (|L₁|, . . . , |L_k|) and (S, τ, L) = α(S, τ, lⁱ). Let us define an auxiliary weighted digraph D(L) = (V, A, δ) in which the arc set is A = {(u, v)|u ∈ V \ S, v ∈ S} ∪ {(u, v)|u ∈ S, v ∈ L_u} and δ(u, v) = w(u, v) if u /∈ S, and δ(u, v) = −w(u, v) if u ∈ S. For a node in S, the weight on an outgoing arc reflects the cost reduction for removing a leaf from its neighbors, and the weight on an incoming arc reflects the increase in cost for connecting a leaf to the node.

It is immediately seen that any cycle (not necessary simple) in the graph describes a way of changing (S, τ, L) into another k-star with the same configuration, and the difference in cost between the new and the old k-stars is given by (n − 1) times the length of the cycle. Because (S, τ, L) is optimal for its configuration, there is no negative length cycle in D(L).

Similarly, if lⁱ⁺¹is obtained from lⁱby s and t, then any path from s to t in D(L) changes (S, τ, L) into a k-star with configuration (S, τ, lⁱ⁺¹). Conversely, any k-star with configuration (S, τ, lⁱ⁺¹) can be obtained by a path from s to t and possibly some cycles. Since positive length cycles contribute positive cost and there is no