An optimal algorithm for the maximum-density segment problem

(1)

AN OPTIMAL ALGORITHM FOR THE

MAXIMUM-DENSITY SEGMENT PROBLEM∗

KAI-MIN CHUNG† AND HSUEH-I LU‡

Abstract. We address a fundamental problem arising from analysis of biomolecular sequences.

The input consists of two numbers wmin and wmax and a sequence S of n number pairs (ai, wi)

with wi > 0. Let segment S(i, j) of S be the consecutive subsequence of S between indices i

and j. The density of S(i, j) is d(i, j) = (ai+ ai+1+· · · + aj)/(wi+ wi+1+· · · + wj). The maximum-density segment problem is to ﬁnd a maximum-density segment over all segments S(i, j)

with wmin≤ wi+ wi+1+· · · + wj≤ wmax. The best previously known algorithm for the problem,

due to Goldwasser, Kao, and Lu [Proceedings of the Second International Workshop on Algorithms

in Bioinformatics, R. Guig´o and D. Gusﬁeld, eds., Lecture Notes in Comput. Sci. 2452, Springer-Verlag, New York, 2002, pp. 157–171], runs in O(n log(wmax−wmin+ 1)) time. In the present paper,

we solve the problem in O(n) time. Our approach bypasses the complicated right-skew decomposition, introduced by Lin, Jiang, and Chao [J. Comput. System Sci., 65 (2002), pp. 570–586]. As a result, our algorithm has the capability to process the input sequence in an online manner, which is an important feature for dealing with genome-scale sequences. Moreover, for a type of input sequences

S representable in O(m) space, we show how to exploit the sparsity of S and solve the

maximum-density segment problem for S in O(m) time.

Key words. bioinformatics, biological sequence analysis, maximum-density segment, slope selection, computational geometry, sequence algorithm, data structure

AMS subject classiﬁcations. 68P05, 68Q25, 68R05, 68U05, 68W05, 68W40, 92-08, 92D20 DOI. 10.1137/S0097539704440430

1. Introduction. We address the following fundamental problem: The input

consists of two numbers wmin and wmax and a sequence S of number pairs (ai, wi)

with wi > 0 for i = 1, . . . , n. A segment S(i, j) is a consecutive subsequence of S from position i to position j. The width w(i, j) of S(i, j) is wi+ wi+1+· · · + wj. The density d(i, j) of S(i, j) is (ai+ ai+1+· · · + aj)/w(i, j). It is not diﬃcult to see that with an O(n)-time preprocessing to compute all O(n) preﬁx sums a1+ a2+· · · + aj

and w1+ w2+· · · + wj, the density of any segment can be computed in O(1) time.

S(i, j) is feasible if wmin≤ w(i, j) ≤ wmax. The maximum-density segment problem is

to ﬁnd a maximum-density segment over all O(n(wmax−wmin+ 1)) feasible segments.

This problem arises from the investigation of the nonuniformity of nucleotide com-position within genomic sequences, which was ﬁrst revealed through thermal melting and gradient centrifugation experiments [21, 28]. The GC content of the DNA se-quences in all organisms varies from 25% to 75%. GC-ratios have the greatest vari-ation among bacterial DNA sequences, while the typical GC-ratios of mammalian

∗_{Received by the editors February 3, 2004; accepted for publication (in revised form) July 7,}

2004; published electronically December 1, 2004. A preliminary version of this paper appeared in

Proceedings of the 11th Annual European Symposium on Algorithms (Budapest, Hungary, 2003), G.

Di Battista and U. Zwick, eds., Lecture Notes in Comput. Sci. 2832, Springer-Verlag, New York, 2003, pp. 136–147.

http://www.siam.org/journals/sicomp/34-2/44043.html

†_{Institute of Information Science, Academia Sinica. Part of this work was done while this author}

was an undergraduate student in the Department of Computer Science and Information Engineering, National Taiwan University.

‡_{Corresponding author. Institute of Information Science, Academia Sinica, 128 Academia Road,}

Section 2, Taipei 115, Taiwan, Republic of China (hil@iis.sinica.edu.tw, www.iis.sinica.edu.tw/∼hil/). This author’s research was supported in part by NSC grants 91-2215-E-001-001 and 92-2218-E-001-001.

(2)

genomes stay in the range 45–50%. Despite intensive research eﬀort in the past two decades, the underlying causes of the observed heterogeneity remain debatable [2, 3, 5, 8, 9, 10, 11, 18, 40, 42]. Researchers [32, 39] observed that the extent of the com-positional heterogeneity in a genomic sequence strongly correlates with its GC content regardless of genome size. Other investigations showed that gene length [7], gene den-sity [44], patterns of codon usage [37], distribution of diﬀerent classes of repetitive elements [7, 38], number of isochores [2], lengths of isochores [32], and recombina-tion rate within chromosomes [12] are all correlated with GC content. More research related to GC-rich segments can be found in [16, 17, 20, 23, 29, 31, 35, 41, 43] and ref-erences therein.

In the most basic form of the maximum-density segment problem, the sequence S corresponds to the given DNA sequence, where ai= 1 if the corresponding nucleotide in the DNA sequence is a G or C, and ai= 0 otherwise. In the work of Huang [19], sequence entries took on values of p and 1− p for some real number 0 ≤ p ≤ 1. More generally, we can look for regions where a given set of patterns occurs very often. In such applications, ai could be the relative frequency with which the corresponding DNA segments appear in the given patterns. Further natural applications of this problem can be designed for sophisticated sequence analysis such as mismatch den-sity [36], ungapped local alignments [1], annotated multiple sequence alignments [39], promoter mapping [22], and promoter recognition [33].

For the uniform case, i.e., wi = 1 for all indices i, Nekrutenko and Li [32] and Rice, Longden, and Bleasby [34] employed algorithms for the case wmin= wmax, which

is trivially solvable in O(n) time. More generally, when wmin= wmax, the problem is

also easily solvable in O(n(wmax− wmin+ 1)) time, linear in the number of feasible

segments. Huang [19] studied the case where wmax = n, i.e., there is eﬀectively no

upper bound on the width of the desired maximum-density segments. He observed that an optimal segment exists with width at most 2wmin− 1. Therefore, this case is

equivalent to the case with wmax = 2wmin− 1 and can be solved in O(nwmin) time

in a straightforward manner. Lin, Jiang, and Chao [27] gave an O(n log wmin)-time

algorithm for this case based on right-skew decompositions of a sequence. (See [26] for related software.) The case with general wmax was ﬁrst investigated by Goldwasser,

Kao, and Lu [13, 14], who gave an O(n)-time algorithm for the uniform case. Recently, Kim [25] showed an alternative algorithm based upon a geometric interpretation of the problem, which basically relates the maximum-density segment problem to the fundamental slope selection problem in computational geometry [4, 6, 24, 30]. Unfor-tunately, Kim’s analysis of time complexity has a flaw which seems difficult to fix.1 For the general (i.e., nonuniform) case, Goldwasser, Kao, and Lu [13] also gave an O(n log(wmax− wmin+ 1))-time algorithm. By bypassing the complicated

preprocess-ing step required in [13], we successfully reduce the time required for the general case down to O(n). Our result is based upon the following set of equations, stating that the order of d(x, y), d(y + 1, z), and d(x, z) with x≤ y < z can be determined by the

1_{Kim claims that all the progressive updates of the lower convex hulls L}

j∪ Rjcan be done in

overall linear time. The paper only sketches how to obtain Lj+1∪ Rj+1 from Lj∪ Rj. (See the

fourth-to-last paragraph of p. 340 in [25].) Unfortunately, Kim seems to overlook the marginal cases when the upper bound wmaxforces the pzof Lj∪ Rjto be deleted from Lj+1∪ Rj+1. As a result,

obtaining Lj+1∪ Rj+1from Lj∪ Rjcould be much more complicated than Kim’s sketch. A naive

implementation of Kim’s algorithm still takes Ω(n(wmax− wmin+ 1)) time in the worst case. We

believe that any correct implementation of Kim’s algorithm requires Ω(n log(wmax− wmin+ 1)) time

(3)

y ≥ ≥ y + 1 x y ≤ ≤ x z z y + 1

Fig. 1.1_. _{An illustration for (1.1): There exist only two possibilities for the order among}

d(x, y), d(x, z), d(y + 1, z).

order of any two of them:

d(x, y)≤ d(y + 1, z) ⇔ d(x, y) ≤ d(x, z) ⇔ d(x, z) ≤ d(y + 1, z), d(x, y)≥ d(y + 1, z) ⇔ d(x, y) ≥ d(x, z) ⇔ d(x, z) ≥ d(y + 1, z). (1.1)

(Both equations can be easily veriﬁed by observing the existence of some number ρ with 0 < ρ < 1 and d(x, z) = ρ· d(x, y) + (1 − ρ) · d(y + 1, z). See Figure 1.1.) Our algorithm is capable of processing the input sequence in an online manner, which is an important feature for dealing with genome-scale sequences.

For bioinformatics applications, e.g., in [1, 22, 33, 36, 39], the input sequence S is usually very sparse. That is, S can be represented by m = o(n) triples

(a₁, w₁, n1), (a2, w2, n2), . . . , (am, wm , nm)

with 0 = n0< n1< n2<· · · < nm= n to signify that (ai, wi) = (aj, wj) holds for all indices i and j with nj₋₁< i≤ nj and 1≤ j ≤ m. If wj = 1 holds for all 1≤ j ≤ m, we show how to exploit the sparsity of S and solve the maximum-density problem for S given in the above compact representation in O(m) time.

The remainder of the paper is organized as follows. Section 2 shows the main algorithm. Section 3 explains how to cope with the simple case that the width upper bound wmax is ineﬀective. Section 4 takes care of the more complicated case that

wmaxis eﬀective. Section 5 explains how to exploit the sparsity of the input sequence

for the uniform case.

2. The main algorithm. For any integers x and y, let [x, y] denote the set {x, x + 1, . . . , y}. Without loss of generality, we may assume that w1+w2+· · ·+wn≥

wmin and wi ≤ wmax holds for each i = 1, 2, . . . , n. Throughout the paper, we need

the following deﬁnitions and notation with respect to the input length-n sequence S and width bounds wmin and wmax. Deﬁne φ(x, y) to be the largest index z ∈ [x, y]

that minimizes d(x, z). That is, S(x, φ(x, y)) is the longest minimum-density preﬁx of S(x, y). Let j0 be the smallest index with w(1, j0) ≥ wmin. Let J = [j0, n]. For

each j ∈ J, let j be the smallest index i with w(i, j) ≤ wmax, and let rj be the largest index i with w(i, j)≥ wmin. That is, S(i, j) is feasible if and only if i∈ [j, rj].

(Figure 2.1 is an illustration for the deﬁnitions of j and rj.) Clearly, for the uniform case, we have i+1= i+ 1 and ri+1= ri+ 1. As for the general case, we know only that j and rj are both (not necessarily strictly) increasing. One can easily compute all j and rj in O(n) time.

Let i∗_j be the largest index k ∈ [j, rj] with d(k, j) = max{d(i, j) | i ∈ [j, rj]}. There must be an index j∗ such that S(i∗_j∗, j∗) is a maximum-density segment of

(4)

j n rj

1 j

≥ wmin

≤ wmax

Fig. 2.1. An illustration for the deﬁnitions of _jand r_j.

Algorithm main. 1 let ij0−1= 1; 2 for j = j0to n do{ 3 let ij= best(max(ij−1, j), rj, j); 4 output (ij, j); 5 } Function best_{(, r, j).} 1 let i = ;

2 while i < r and d(i, φ(i, r− 1)) ≤ d(i, j) do

3 let i = φ(i, r− 1) + 1;

4 return i;

Fig. 2.2_{. Our main algorithm.}

S. Therefore, a natural but seemingly difficult possibility for solving the maximum-density segment problem would be to compute i∗_j for all indices j∈ J in O(n) time. Instead, our approach is to compute a pair (ij, j) of indices with ij∈ [j, rj] for each index j∈ J by the algorithm shown in Figure 2.2. More specifically, each iteration of our algorithm, based upon (1.1), keeps chopping off the lowest-density prefix without affecting the feasibility of the remaining segment until its density does not go any higher. For brevity of presentation, each algorithm throughout the paper outputs a linear number of index pairs (i, j). (See, e.g., step 4 of algorithm main shown in Figure 2.2.) Clearly, it takes a linear-time postprocessing for the produced index pairs to obtain one that maximizes d(i, j). The rest of the section ensures the correctness of our algorithm by showing ij∗ = i∗j∗, and thus reduces the maximum-density segment problem to implementing our algorithm to run in O(n) time.

Lemma 2.1. The index returned by function call best(, r, j) is the largest index i∈ [, r] that maximizes d(i, j).

Proof. Let i∗ be the largest index in [, r] that maximizes d(i, j), i.e., d(i∗, j) = maxi∈[,r]d(i, j). Let ij be the index returned by function call best(, r, j). We show ij = i∗as follows. If ij < i∗, then ij < r. By the condition of the while-loop at step 2 of best, we know d(ij, φ(ij, r− 1)) > d(ij, j). By d(ij, j)≤ d(i∗, j) and (1.1), we have d(ij, i∗− 1) ≤ d(ij, j). It follows that d(ij, i∗− 1) < d(ij, φ(ij, r− 1)), contradicting the deﬁnition of φ(ij, r− 1).

On the other hand, suppose that ij > i∗. By definition of best, there must be an index i∈ [, r] with i < r, d(i, φ(i, r − 1)) ≤ d(i, j), and i ≤ i∗ < φ(i, r− 1) + 1. If i = i∗, by (1.1) we have d(i∗, φ(i∗, r− 1)) ≤ d(i∗, j)≤ d(φ(i∗, r− 1) + 1, j), where the last inequality contradicts the definition of i∗. Now that i < i∗, we have d(i∗, j) ≥ d(i, j)≥ d(i, i∗−1) ≥ d(i, φ(i, r−1)) ≥ d(i∗, φ(i, r−1)), where (a) the first inequality is

(5)

by definition of i∗, (b) the second inequality is by (1.1) and the first inequality, (c) the third inequality is by i∗≤ φ(i, r − 1) and the definition of φ(i, r − 1), and (d) the last inequality is by (1.1) and the third inequality. It follows from d(i∗, j)≥ d(i∗, φ(i, r−1)) and (1.1) that d(φ(i, r− 1) + 1, j) ≥ d(i∗, j), contradicting the definition of i∗ by i∗< φ(i, r− 1) + 1.

Theorem 2.2. _{Algorithm main correctly solves the maximum-density segment} problem.

Proof. We prove the theorem by showing ij∗ = i∗_j∗. By j0 = ij0−1 = 1 and Lemma 2.1, the equality holds if j∗= j0. The rest of the proof assumes j∗> j0. By

Lemma 2.1 and j∗ ≤ i∗j∗, it suﬃces to ensure ij∗−1≤ i∗j∗. Assume for contradiction that there is an index j∈ [j0, j∗−1] with ij−1 ≤ i∗j∗ < ij. By j < j∗, we know j ≤ i∗j∗.

By Lemma 2.1 and max(j, ij−1) ≤ i∗j∗ < ij ≤ rj, we have d(ij, j) ≥ d(i∗j∗, j). It

follows from (1.1) and i∗_j∗ < ij that d(i∗j∗, j)≥ d(ij∗∗, ij− 1). By j∗ ≤ i∗j∗ < ij ≤ rj∗ and the deﬁnition of j∗, we know d(i∗_j∗, j∗) > d(ij, j∗). It follows from i∗j∗ < ij and (1.1) that d(i∗_j∗, ij− 1) > d(i∗j∗, j∗). Therefore, d(ij, j) ≥ d(i∗j∗, j)≥ d(i∗j∗, ij− 1) > d(i∗_j∗, j∗), contradicting the deﬁnition of j∗.

One can verify that the value of i increases by at least 1 each time step 3 of best is executed. Therefore, to implement the algorithm to run in O(n) time, it suﬃces to maintain a data structure to support an O(1)-time query for each φ(i, rj − 1) in step 2 of best.

3. Coping with ineﬀective width upper bound. When wmax is ineﬀective,

i.e., wmax ≥ w(1, n), we have j = 1 for all j ∈ J. Therefore, the function call in

step 3 of main is exactly best(ij₋₁, rj, j). Moreover, during the execution of the function call best(ij₋₁, rj, j), the value of i can only be ij₋₁, φ(ij₋₁, rj − 1) + 1, φ(φ(ij₋₁, rj− 1) + 1, rj− 1) + 1, . . . , rj. Suppose that a subroutine call to update(j) yields an array Φ of indices and two indices p and q of Φ with p≤ q and Φ[p] = ij₋₁ such that the following condition holds.

Condition Cj : Φ[q] = rj and Φ[t] = φ(Φ[t− 1], rj− 1) + 1 holds for each index t∈ [p + 1, q]. (See Figure 3.1 for an illustration.)

Φ[p + 1] Φ[p + 2]

rj

φ(Φ[p], rj− 1) φ(Φ[p + 1], rj− 1) φ(Φ[q− 1], rj− 1)

Φ[p] Φ[q]

ij−1

Fig. 3.1_{. An illustration for Condition C}_j_.

Then, the subroutine call to best(ij₋₁, rj, j) can clearly be replaced by lbest(j), as deﬁned in Figure 3.2. That is, lbest(j) can access the value of each φ(i, rj− 1) by looking up Φ in O(1) time. It remains to show how to implement update(j) such that all O(n) subroutine calls to update from step 3 of lmain run in overall O(n) time. With the initialization of letting p = q = rj0−1 = Φ[1] = 1, as described at step 1 of algorithm lmain, Condition C_j0−1clearly holds before the subroutine call to update(j₀). The following lemma is crucial in ensuring the correctness and eﬃciency of our implementation shown in Figure 3.2.

Lemma 3.1. For each index j ∈ J, the following statements hold:

1. If Condition Cj−1 holds right before calling update(j), then Condition Cj holds right after the subroutine call.

(6)

Algorithm lmain_. 1 let p = q = rj0−1= Φ[1] = 1; 2 for j = j0to n do{ 3 call update(j); 4 let ij= lbest(j); 5 output (ij, j); 6 } Function lbest(j). 1 while p < q and d(Φ[p], Φ[p + 1]− 1) ≤ d(Φ[p], j) do 2 let p = p + 1; 3 return Φ[p]; Subroutine update(j). 1 for r = rj−1+ 1 to rjdo{ 2 while p < q and d(Φ[q− 1], Φ[q] − 1) ≥ d(Φ[q − 1], r − 1) do 3 let q = q− 1; 4 let q = q + 1; 5 let Φ[q] = r; 6 }

Fig. 3.2_{. An eﬃcient implementation for the case that w}_max _{is ineﬀective.}

2. If Condition Cj holds right before calling lbest(j), then the index returned by the function call is exactly that returned by best(Φ[p], Φ[q], j).

Proof. Statement 1. We need the following condition.

Condition Dr : Φ[q] = r and Φ[t] = φ(Φ[t− 1], r − 1) + 1 holds for each index t∈ [p + 1, q].

Clearly, Condition Cj is exactly Condition Drj. To prove the statement, we show

that if Condition Dr−1holds, then Condition Drholds right after executing steps 2–5 of update (i.e., an iteration of the for-loop of update), although the value of q may change.

Consider the moment when step 4 of update is about to be executed. We ﬁrst show that if p < q, then Φ[t] = φ(Φ[t− 1], r − 1) + 1 holds for each t ∈ [p + 1, q]. By the deﬁnition of φ, we have

φ(, r− 1) =

r− 1 if d(, φ(, r− 2)) ≥ d(, r − 1), φ(, r− 2) otherwise.

(3.1)

By Condition Dr₋₁, we know φ(Φ[q− 1], r − 2) = Φ[q] − 1. With = Φ[q − 1] plugged into (3.1), we know that if d(Φ[q− 1], Φ[q] − 1) < d(Φ[q − 1], r − 1), then φ(Φ[q−1], r−1) = φ(Φ[q −1], r−2). It follows from the condition of step 2 of update that φ(Φ[q− 1], r − 1) = φ(Φ[q − 1], r − 2). Furthermore, if φ(, r − 2) < r − 2, then one can prove as follows that φ(φ(, r− 2) + 1, r − 1) = φ(φ(, r − 2) + 1, r − 2) implies φ(, r− 1) = φ(, r − 2).

Let m = φ(, r− 2). By φ(m + 1, r − 1) = φ(m + 1, r − 2), we have d(m+1, φ(m+1, r−1)) < d(m+1, r−1). By deﬁnition of φ and (1.1), we have d(, m) < d(, φ(m + 1, r− 1)) < d(m + 1, φ(m + 1, r − 1)). As a result, we have d(, m) < d(m + 1, r− 1), which by (1.1) implies d(, m) < d(, r− 1). Thus φ(, r − 1) = φ(, r − 2).

(7)

Therefore, φ(Φ[t], r−1) = φ(Φ[t], r−2) implies φ(Φ[t−1], r−1) = φ(Φ[t−1], r−2). As a result, φ(Φ[t], r−1) = φ(Φ[t], r −2) holds for each t ∈ [p, q −1]. By Condition Dr₋₁, we know Φ[t] = φ(Φ[t− 1], r − 1) + 1 holds for each t ∈ [p + 1, q].

Since the value of q will be incremented by 1 after executing step 4 of update and Φ[q+1] will equal r after executing step 5 of update, it remains to show φ(Φ[q], r−1) = r− 1. Clearly, the equality holds if Φ[q] = r − 1, i.e., step 3 was not executed. If step 3 was executed at least once, then we know d(Φ[q], Φ[q + 1]− 1) ≥ d(Φ[q], r − 1), i.e., d(Φ[q], φ(Φ[q], r− 2)) ≥ d(Φ[q], r − 1). By plugging = Φ[q] into (3.1), we know φ(Φ[q], r− 1) = r − 1.

Statement 2. By Condition Cj, one can easily verify that lbest(j) is a faithful implementation of best(Φ[p], Φ[q], j). Therefore, the statement holds.

Lemma 3.2. The implementation lmain solves the maximum-density problem for the case with ineﬀective wmax in O(n) time.

Proof. By Lemma 3.1(1) and the deﬁnitions of update and lbest, both Con-dition Cjand Φ[p] = ij−1hold right after the subroutine call to update(j). By Condi-tion Cjand Lemma 3.1(2), lbest(j) is a faithful implementation of best(Φ[p], Φ[q], j). Therefore, the correctness of lmain follows from Φ[p] = ij₋₁, Φ[q] = rj, and Theo-rem 2.2.

As for the eﬃciency of lmain, observe that q− p ≥ 0 holds throughout the execution of lmain. Note that each iteration of the while-loops of lbest and update decreases the value of q− p by one. Since step 4 of update, which is the only place that increases the value of q− p, increases the value of q − p by one for O(n) times, the overall running time of lmain is O(n).

4. Coping with eﬀective width upper bound. In contrast to the previous

simple case, when wmax is arbitrary, j may not always be 1. Therefore, the first argument of the function call in step 3 of main could be j with j > ij₋₁. It seems quite difficult to update the corresponding data structure Φ in overall linear time such that both Φ[p] = max(ij−1, j) and Condition Cj hold throughout the execution of our algorithm. To overcome this difficulty, our algorithm sticks with Condition Cj but allows Φ[p] > max(ij−1, j). As a result, maxj∈Jd(ij, j) may be less than maxj∈Jd(i∗j, j). Fortunately, this potential problem can be resolved if we simultaneously solve a series of variant versions of the maximum-density segment problem.

≥ wmin

≤ wmax

1 = y0 r y0 y1 n

Fig. 4.1_{. An illustration for the relation among , r, y}₀_{, y}₁_.

4.1. A variant version of the maximum-density segment problem.

Sup-pose that we are given two indices r and y0 with w(r, y0) ≥ wmin. Let X = [, r]

and Y = [y0, y1] be two intervals such that = y0 and y1 is the largest index in J

with w(r, y1)≤ wmax. See Figure 4.1 for an illustration. The variant version of the

(8)

Algorithm vmain_{(r, y}₀_).

1 let be the smallest index in [1, n] with w(, y0)≤ wmax;

2 let y1 be the largest index in [1, n] with w(r, y1)≤ wmax;

3 let xy0−1= ;

4 for y = y0to y1 do{

5 let xy= best(max(xy−1, y), r, y);

6 output (xy, y);

7 }

Fig. 4.2_{. Our algorithm for the variant version of the maximum-density segment problem,}

where function best is as deﬁned in Figure 2.2.

all feasible segments S(x, y) with x ∈ X, y ∈ Y , and wmin ≤ w(x, y) ≤ wmax such

that d(x, y) is maximized.

For each y∈ Y , let x∗_y be the largest index x∈ X with wmin≤ w(x, y) ≤ wmax

that maximizes d(x, y). Let y∗ be an index in Y with d(x∗_y∗, y∗) = maxy_∈Y d(x∗y, y). Although solving the variant version can naturally be reduced to computing the index x∗_y for each index y∈ Y , the required running time is more than what we can aﬀord. Instead, we compute an index xy ∈ X with wmin ≤ w(xy, y) ≤ wmax for each index

y∈ Y such that xy∗ = x∗_y∗. By w(r, y0)≥ wminand w(r, y1)≤ wmax, one can easily see

that, for each y∈ Y , r is always the largest index x ∈ X with wmin≤ w(x, y) ≤ wmax.

Our algorithm for solving the variant problem is as shown in Figure 4.2, presented in a way to emphasize the analogy between vmain and main. For example, the index xy in vmain is the counterpart of the index ij in main. Also, the index r in vmain plays the role of the index rj in main. We have the following lemma whose proof is very similar to that of Theorem 2.2.

Lemma 4.1. _{Algorithm vmain solves the variant version of the maximum-density} problem correctly.

Proof. We prove the theorem by showing xy∗ = x∗_y∗. By y0 = xy0−1 = and Lemma 2.1, the equality holds if y∗= y0. The rest of the proof assumes y∗> y0. By

Lemma 2.1 and y∗ ≤ x∗y∗, it suﬃces to ensure xy∗₋₁≤ x∗y∗. Assume for contradiction that there is an index y ∈ [y0, y∗− 1] with xy₋₁ ≤ x∗y∗ < xy. By y < y∗, we know y ≤ x∗_y∗. By Lemma 2.1 and max(y, xy−1) ≤ x∗_y∗ < xy ≤ r, we have d(xy, y) ≥

d(x∗_y∗, y). It follows from (1.1) and x∗y∗ < xy that d(x∗y∗, y) ≥ d(x∗y∗, xy− 1). By y∗ ≤ x∗_y∗ < xy ≤ r and the deﬁnition of y∗, we know d(x∗y∗, y∗) > d(xy, y∗). It

follows from x∗y∗ < xyand (1.1) that d(x∗y∗, xy−1) > d(x∗y∗, y∗). Therefore, d(xy, y)≥

d(x∗y∗, y)≥ d(x∗y∗, xy− 1) > d(x∗y∗, y∗), contradicting the deﬁnition of y∗.

Again, the challenge lies in supporting each query to φ(i, r− 1) of best in O(1) time during the execution of vmain. Fortunately, unlike during the execution of main, where both parameters of φ(i, r− 1) may change, the second parameter r − 1 is now fixed. Therefore, to support each query to φ(i, r− 1) in O(1) time, we can actually afford O(r− + 1) time to compute a data structure Ψ such that Ψ[i] = φ(i, r − 1) for each i ∈ [, r − 1]. As a result, the function best can be implemented as the function vbest shown in Figure 4.3. The following lemma ensures the correctness and efficiency of our implementation variant shown in Figure 4.3.

Lemma 4.2. The implementation variant correctly solves the variant version of the maximum-density segment problem in O(r− + y1− y0+ 1) time.

(9)

Algorithm variant_{(r, y}₀_).

1 let be the smallest index in [1, n] with w(, y0)≤ wmax;

2 let y1 be the largest index in [1, n] with w(r, y1)≤ wmax;

3 call init(, r− 1);

4 let xy0−1= ;

5 for y = y0to y1 do{

6 let xy= vbest(max(xy−1, y), r, y);

7 output (xy, y);

8 }

Function vbest(, r, y).

1 let x = ; 2 while x < r and d(x, Ψ[x])≤ d(x, y) do 3 let x = Ψ[x] + 1; 4 return x; Subroutine init_{(, r)} 1 let Ψ[r] = r; 2 for s = r− 1 downto do { 3 let t = s; 4 while t < r and d(s, t)≥ d(s, Ψ[t + 1]) do 5 let t = Ψ[t + 1]; 6 let Ψ[s] = t; 7 }

Fig. 4.3_{. An eﬃcient implementation for the algorithm vmain.}

[, r− 1], then vbest is a faithful implementation of best. Therefore, by Lemma 4.1, the correctness of variant can be ensured by showing that after calling init(, r−1) at step 3 of algorithm variant , Ψ[i] = φ(i, r−1) holds for each index i ∈ [, r−1]. Note that for brevity of the following proof, we slightly abuse the notation r in Figure 4.3. That is, although the subroutine call at step 3 of algorithm variant is init(, r− 1), the second parameter in the deﬁnition of subroutine init in Figure 4.3 becomes r. Let us make it clear that the rest of the proof (i) lets r denote the one in the deﬁnition of subroutine init(, r), and (ii) proves that Ψ[i] = φ(i, r) holds for each index i∈ [, r] at the end of the subroutine call to subroutine init(, r).

By step 1 of init, we have Ψ[r] = r = φ(r, r). Now suppose that Ψ[i] = φ(i, r) holds for each index i∈ [x + 1, r] right before init is about to execute the iteration for index x∈ [, r]. It suﬃces to show Ψ[x] = φ(x, r) after the iteration. Let

Zx={x, Ψ[x + 1], Ψ[Ψ[x + 1] + 1], Ψ[Ψ[Ψ[x + 1] + 1] + 1], . . . , r}. Let|Zx| denote the cardinality of Zx. We ﬁrst show φ(x, r)∈ Zxas follows.

Assume for contradiction that φ(x, r) ∈ Zx, i.e., there is an index z ∈ Zx with z < φ(x, r) < Ψ[z + 1] = φ(z + 1, r). By deﬁnition of φ and (1.1), we have d(z + 1, φ(x, r)) > d(z + 1, φ(z + 1, r)) > d(φ(x, r) + 1, φ(z + 1, r)) and d(φ(x, r) + 1, φ(z + 1, r))≥ d(x, φ(z + 1, r)) ≥ d(x, φ(x, r)). By d(z + 1, φ(x, r)) > d(x, φ(x, r)) and (1.1), we have d(x, φ(x, r)) > d(x, z), contradicting the deﬁnition of φ(x, r).

(10)

Algorithm general_. 1 let p = q = rj0−1= Φ[1] = 1; 2 for j = j0to n do{ 3 call update(j); 4 while Φ[p] < jdo 5 let p = p + 1; 6 if ij−1< Φ[p] then 7 call variant(Φ[p], j); 8 let ij= lbest(j); 9 output (ij, j); 10 }

Fig. 4.4_{. Our algorithm for the general case, where update and lbest are deﬁned in Figure 3.2}

and variant is deﬁned in Figure 4.3.

φ(x, r). By φ(z + 1, r) ≤ φ(x, r) ≤ r and the deﬁnition of φ(z + 1, r), we have d(z + 1, φ(x, r)) ≥ d(z + 1, φ(z + 1, r)). By deﬁnition of φ(x, r) and (1.1), we have d(x, z)≥ d(x, φ(x, r)) ≥ d(z + 1, φ(x, r)). By d(x, z) ≥ d(z + 1, φ(z + 1, r)) and (1.1), we have d(x, z)≥ d(x, φ(z + 1, r)). Therefore, if z < φ(x, r), then step 5 of init will be executed to increase the value of z. Observe that φ(x, r) = z < r and Ψ[z + 1] > z imply d(x, z) < d(x, Φ[z + 1]). It follows that as soon as z = φ(x, r) holds, whether φ(x, r) = r or not, the value of Ψ[x] will immediately be set to z at step 6 of init.

As for the time complexity, we ﬁrst observe that and y1 can be found from r

and y0 in O(r− + y1− y0+ 1) time:

• Let = r, and then repeatedly decrease by 1 as long as w( − 1, y0)≤ wmax

holds.

• Let y1= y0, and then repeatedly increase y1 by 1 as long as w(r, y1+ 1)≤

wmaxholds.

Secondly, one can see that the rest of the implementation also runs in O(r− + y1−

y0+ 1) time by verifying that throughout the execution of the implementation (a) the

while-loop of vbest runs for O(r−+y1−y0+ 1) iterations, and (b) the while-loop of

initruns for O(r− + 1) iterations. To see statement (a), just observe that the value of index x (i) never decreases, (ii) stays in [, r], and (iii) increases by at least one each time step 3 of vbest is executed. As for statement (b), consider the iteration with index s of the for-loop of init. Note that if step 6 of init executes ts times in this iteration, then|Zs| = |Zs+1| − ts+ 1. Since|Zs| ≥ 1 holds for each s ∈ X, we have_s_∈Xts= O(r− + 1), and thus statement (b) holds.

4.2. Our algorithm for the general case. With the help of variant, we have

a linear-time algorithm for solving the original maximum-density segment problem as shown in Figure 4.4. Algorithm general is obtained by inserting four lines of code (i.e., steps 4–7 of general) between steps 3 and 4 of lmain in order to handle the case ij−1< j. Speciﬁcally, when ij−1< j, we cannot aﬀord to appropriately update the data structure Φ. Therefore, instead of moving i to j, steps 4 and 5 move i to Φ[p], where p is the smallest index with j ≤ Φ[p]. Of course, these two steps may cause our algorithm to overlook the possibility of ij ∈ [ij−1, Φ[p]− 1], as illustrated in Figure 4.5. This is when the variant version comes in: As shown in the next theorem, we can remedy the problem by calling variant(Φ[p], j).

(11)

ij−1 j ij Φ[p] rj

Fig. 4.5_{. An illustration for the situation when Steps 6 and 7 of general are needed.}

k j j rj k k j rk 1 n

Fig. 4.6_. _{An illustration showing that the overall running time of all subroutine calls to} variant₍

j, j) in general is O(n).

Theorem 4.3. _{Algorithm general solves the maximum-density segment} prob-lem in an online manner in linear time.

Proof. We prove the correctness of general by showing that i∗_j∗ = ij∗ implies i∗_j∗ = xj∗. By Lemma 3.1(1), Condition Cj holds after the subroutine call update(j) at step 3 of general. Observe that steps 4 and 5 of general, which may increase the value of p, do not aﬀect the validity of Condition Cj. Also, steps 6 and 7 do not modify p, q, and Φ. Let _j be the value of Φ[p] right before executing step 8 of general_{. By Lemma 3.1(2), the index i}_j_{returned by lbest(j) is the largest index in} [_j, rj] that maximizes d(ij, j). Clearly, ij∗ = i∗j∗ implies the correctness of general.

If ij∗ = i∗j∗, there must be an index j∈ [j0, j∗] with ij−1≤ i∗j∗ < ij. It can be proved as follows that i∗_j∗ ≤ j− 1.

Assume _j≤ i∗_j∗ for contradiction. It follows from Lemma 3.1(2) and

(1.1) that d(ij, j)≥ d(i∗_j∗, j)≥ d(i∗j∗, ij− 1). By the deﬁnition of j∗, we have d(i∗_j∗, j∗) > d(ij, j∗), which by (1.1) implies d(i∗j∗, ij− 1) > d(i∗j∗, j∗). Therefore, d(ij, j) > d(i∗j∗, j∗), contradicting the deﬁnition

of j∗.

Since ij−1≤ i∗j∗ ≤ j− 1, we know w(j− 1, j∗)≤ w(i∗j∗, j∗)≤ wmax. Thus, S(i∗j∗, j∗) is a feasible segment in the variant version of the maximum-density segment problem for S with respect to indices r = _j and y0= j. By Lemma 4.2, the subroutine call

variant₍

j, j) at step 7 of algorithm general has to output an index pair (x, y) with wmin≤ w(x, y) ≤ wmaxand d(x, y) = d(i∗j∗, j∗).

As for the running time, observe that q− p ≥ 0 holds throughout the execution of general. Step 4 of update, which is the only place that increases the value of q− p, increases the value of q − p by 1 for O(n) times. Note that each iteration of the while-loops of general, lbest, and update decreases the value of q− p by 1. Therefore, to show that the overall running time of general is O(n), it remains to ensure that all those subroutine calls to variant at step 7 of general take overall O(n) time. Suppose that j and k are two arbitrary indices with k < j such that general makes subroutine calls to variant(

k, k) and variant(j, j). Let rk be the largest index in [1, n] with w(_k, r_k)≤ wmax. By Lemma 4.2, it suﬃces to show

_k < j and rk < j as follows. (See Figure 4.6.) By the deﬁnition of general, we know that ij−1 < j, which is ensured by the situation illustrated in Figure 4.5. By k < j, we have _k ≤ ij₋₁, implying _k < j. Moreover, by the deﬁnitions of j and

(12)

r_k, one can easily verify that _k < j implies r_k< j.

It is not difficult to see that our algorithm shown in Figure 4.4 is already capable of processing the input sequence in an online manner, since the only preprocessing required is to obtain j, rj, and the prefix sums of a1, a2, . . . , aj and w1, w2, . . . , wj (for the purpose of evaluating the density of any segment in O(1) time), which can easily be computed on the fly.

5. Exploiting sparsity for the uniform case. In this section, we assume that

the input sequence S = (a1, w1), (a2, w2), . . . , (an, wn) is run-length encoded [15], i.e.,

represented by m pairs (a₁, n1), (a2, n2), . . . , (am, nm) with 0 = n0< n1< n2<· · · <

nm = n to signify that w1 = w2 = · · · = wn = 1 and that ai = aj holds for all indices i and j with nj₋₁ < i ≤ nj and 1 ≤ j ≤ m. Our algorithm for solving the maximum-density problem for the O(m)-space representable sequence S is shown in Figure 5.1.

Algorithm sparse_.

1 for k = 1 to m do

2 let nk= nk− nk−1;

3 let S be the length-m sequence (n1a1, n1), (n2a2, n2), . . . , (nmam, nm);

4 let (i, j) be an optimal output of general(wmin, wmax, S);

5 output (ni−1+ 1, nj);

6 for k = 1 to m do{

7 if nk≥ wmin then

8 output (nk, nk) and (rnk, nk);

9 if nk−1+ wmin≤ n then

10 output (nk−1+ 1, nk−1+ wmin) and (nk−1+ 1, min(n, nk−1+ wmax));

11 }

Fig. 5.1. Our algorithm that handles sparse input sequence for the uniform case, where general

is deﬁned in Figure 4.4.

Theorem 5.1. Algorithm sparse solves the maximum-density problem for the above O(m)-space representable sequence in O(m) time.

Proof. By Theorem 4.3, sparse runs in O(m) time. Let S(i∗, j∗) be a feasible segment with maximum density. We ﬁrst show that without loss of generality i∗−1 ∈ {n0, n1, . . . , nm−1} or j∗ ∈ {n1, n2, . . . , nm} holds. More speciﬁcally, we show as

follows that if i∗ − 1 ∈ {n0, n1, . . . , nm−1} and j∗ ∈ {n1, n2, . . . , nm}, then S(i∗+

1, j∗+ 1) is also a feasible segment with maximum density.

By i∗− 1 ∈ {n0, n1, . . . , nm−1}, we know ai∗−1 = ai∗. By j∗ ∈ {n1, n2, . . . , nm}, we know aj∗ = aj∗+1. It follows from the optimality

of S(i∗, j∗) that ai∗ ≥ aj∗+1and ai∗−1≤ aj∗, implying ai∗−1= ai∗ = aj∗ = aj∗+1. Therefore, S(i∗+ 1, j∗+ 1) is also a maximum-density

segment.

It remains to show that our algorithm works correctly for each possible case.

• Case 1: i∗_{− 1 ∈ {n}₀_{, n}₁_{, . . . , nm}_{−1} and j}∗ _{∈ {n}₁_{, n}₂_{, . . . , nm}. Clearly,} steps 1–5 of sparse take care of this case.

• Case 2: i∗_{− 1 ∈ {n}

0, n1, . . . , nm−1} and j∗ ∈ {n1, n2, . . . , nm}. Clearly, if

i∗∈ {j∗, rj∗}, S(i∗, j∗) can be discovered by steps 7 and 8 of sparse. Since i∗− 1 ∈ {n0, n1, . . . , nm−1}, we have ai∗−1 = ai∗. If ai∗−1 = ai∗ = d(i∗, j∗), then by (1.1) we have either d(i∗−1, j∗) > d(i∗, j∗) or d(i∗+1, j∗) > d(i∗, j∗),

(13)

which implies that either S(i∗− 1, j∗) or S(i∗+ 1, j∗) is infeasible, and thus i∗∈ {j∗, rj∗}. On the other hand, if ai∗₋₁ = ai∗ = d(i∗, j∗) and i∗= ∗_j, then

S(i∗−1, j∗) is also a feasible segment with maximum density. We can continue the same argument until we have a maximum-density segment S(i, j∗) such that either i− 1 ∈ {n0, n1, . . . , nm₋₁}, which is handled in Case 1, or i = j∗,

which is handled by steps 7 and 8 of sparse.

• Case 3: i∗_{− 1 ∈ {n}₀_{, n}₁_{, . . . , nm}_{−1} and j}∗_{∈ {n}₁_{, n}₂_{, . . . , nm}. The proof of} this case, omitted for brevity, is very similar to that of Case 2.

The theorem is proved.

Acknowledgments. We thank Yi-Hsuan Hsin and Hsu-Cheng Tsai for

discus-sions during the preliminary stage of this research. We thank Tien-Ching Lin for comments.

REFERENCES

[1] N. N. Alexandrov and V. V. Solovyev, Statistical signiﬁcance of ungapped sequence

align-ments, in Proceedings of the Paciﬁc Symposium on Biocomputing, Maui, HI, 1998, World

Scientiﬁc, River Edge, NJ, Vol. 3, pp. 461–470.

[2] G. Barhardi, Isochores and the evolutionary genomics of vertebrates, Gene, 241 (2000), pp. 3– 17.

[3] G. Bernardi and G. Bernardi, Compositional constraints and genome evolution, J. Molec. Evolution, 24 (1986), pp. 1–11.

[4] H. Br¨onnimann and B. Chazelle_{, Optimal slope selection via cuttings, Comput. Geom., 10} (1998), pp. 23–29.

[5] B. Charlesworth, Genetic recombination: Patterns in the genome, Current Biol., 4 (1994), pp. 182–184.

[6] R. Cole, J. S. Salowe, W. L. Steiger, and E. Szemer´edi_{, An optimal-time algorithm for}

slope selection, SIAM J. Comput., 18 (1989), pp. 792–810.

[7] L. Duret, D. Mouchiroud, and C. Gautier, Statistical analysis of vertebrate sequences

reveals that long genes are scarce in GC-rich isochores, J. Mol. Evol., 40 (1995), pp. 308–

371.

[8] A. Eyre-Walker, Evidence that both G+C rich and G+C poor isochores are replicated early

and late in the cell cycle, Nucleic Acids Res., 20 (1992), pp. 1497–1501.

[9] A. Eyre-Walker, Recombination and mammalian genome evolution, Proc. Roy. Soc. London Ser. B, 252 (1993), pp. 237–243.

[10] J. Filipski, Correlation between molecular clock ticking, codon usage ﬁdelity of DNA

re-pair, chromosome banding and chromatin compactness in germline cells, FEBS Lett., 217

(1987), pp. 184–186.

[11] M. P. Francino and H. Ochman, Isochores result from mutation not selection, Nature, 400 (1999), pp. 30–31.

[12] S. M. Fullerton, A. B. Carvalho, and A. G. Clark, Local rates of recombination are

positively correlated with GC content in the human genome, Mol. Biol. Evol., 18 (2001),

pp. 1139–1142.

[13] M. H. Goldwasser, M.-Y. Kao, and H.-I. Lu, Fast algorithms for ﬁnding maximum-density

segments of a sequence with applications to bioinformatics, in Proceedings of the Second

International Workshop on Algorithms in Bioinformatics (Rome, Italy, 2002), R. Guig´o and D. Gusﬁeld, eds., Lecture Notes in Comput. Sci. 2452, Springer-Verlag, New York, 2002, pp. 157–171.

[14] M. H. Goldwasser, M.-Y. Kao, and H.-I. Lu, Linear-time algorithms for computing

maximum-density sequence segments with bioinformatics applications, J. Comput. System

Sci., to appear.

[15] S. W. Golomb, Run-length encodings, IEEE Trans. Inform. Theory, 12 (1966), pp. 399–401. [16] P. Guldberg, K. Gronbak, A. Aggerholm, A. Platz, P. thor Straten, V. Ahrenkiel,

P. Hokland, and J. Zeuthen_{, Detection of mutations in GC-rich DNA by bisulphite}

denaturing gradient gel electrophoresis, Nucleic Acids Res., 26 (1998), pp. 1548–1549.

[17] W. Henke, K. Herdel, K. Jung, D. Schnorr, and S. A. Loening, Betaine improves the PCR

ampliﬁcation of GC-rich DNA sequences, Nucleic Acids Res., 25 (1997), pp. 3957–3958.

[18] G. P. Holmquist, Chromosome bands, their chromatin ﬂavors, and their functional features, Am. J. Hum. Genet., 51 (1992), pp. 17–37.

(14)

[19] X. Huang, An algorithm for identifying regions of a DNA sequence that satisfy a content

requirement, Comput. Appl. Biosci., 10 (1994), pp. 219–225.

[20] K. Ikehara, F. Amada, S. Yoshida, Y. Mikata, and A. Tanaka, A possible origin of newly

born bacterial genes: Signiﬁcance of GC-rich nonstop frame on antisense strand, Nucleic

Acids Res., 24 (1996), pp. 4249–4255.

[21] R. B. Inman, A denaturation map of the 1 phage DNA molecule determined by electron

mi-croscopy, J. Mol. Biol., 18 (1966), pp. 464–476.

[22] I. P. Ioshikhes and M. Q. Zhang, Large-scale human promoter mapping using CpG islands, Nature Genet., 26 (2000), pp. 61–63.

[23] R. Jin, M.-E. Fernandez-Beros, and R. P. Novick, Why is the initiation nick site of an

AT-rich rolling circle plasmid at the tip of a GC-rich cruciform?, EMBO J., 16 (1997),

pp. 4456–4466.

[24] M. J. Katz and M. Sharir, Optimal slope selection via expanders, Inform. Process. Lett., 47 (1993), pp. 115–122.

[25] S. K. Kim, Linear-time algorithm for ﬁnding a maximum-density segment of a sequence, In-form. Process. Lett., 86 (2003), pp. 339–342.

[26] Y.-L. Lin, X. Huang, T. Jiang, and K.-M. Chao, MAVG: Locating nonoverlapping maximum

average segments in a given sequence, Bioinformatics, 19 (2003), pp. 151–152.

[27] Y.-L. Lin, T. Jiang, and K.-M. Chao, Algorithms for locating the length-constrained heaviest

segments, with applications to biomolecular sequence analysis, J. Comput. System Sci., 65

(2002), pp. 570–586.

[28] G. Macaya, J.-P. Thiery, and G. Bernardi, An approach to the organization of eukaryotic

genomes at a macromolecular level, J. Mol. Biol., 108 (1976), pp. 237–254.

[29] C. S. Madsen, C. P. Regan, and G. K. Owens, Interaction of CArG elements and a GC-rich

repressor element in transcriptional regulation of the smooth muscle myosin heavy chain gene in vascular smooth muscle cells, J. Biol. Chem., 272 (1997), pp. 29842–29851.

[30] J. Matouˇsek_{, Randomized optimal algorithm for slope selection, Inform. Process. Lett., 39} (1991), pp. 183–187.

[31] S.-I. Murata, P. Herman, and J. R. Lakowicz, Texture analysis of ﬂuorescence lifetime

im-ages of AT- and GC-rich regions in nuclei, J. Histochem. Cytochem., 49 (2001), pp. 1443–

1452.

[32] A. Nekrutenko and W.-H. Li, Assessment of compositional heterogeneity within and between

eukaryotic genomes, Genome Res., 10 (2000), pp. 1986–1995.

[33] U. Ohler, H. Niemann, G. Liao, and G. M. Rubin, Joint modeling of DNA sequence and

physical properties to improve eukaryotic promoter recognition, Bioinformatics, 17 (2001),

pp. S199–S206.

[34] P. Rice, I. Longden, and A. Bleasby, EMBOSS: The European molecular biology open

software suite, Trends Genet., 16 (2000), pp. 276–277.

[35] L. Scotto and R. K. Assoian, A GC-rich domain with bifunctional eﬀects on mRNA and

protein levels: Implications for control of transforming growth factor beta 1 expression,

Mol. Cell. Biol., 13 (1993), pp. 3588–3597.

[36] P. H. Sellers, Pattern recognition in genetic sequences by mismatch density, Bull. Math. Biol., 46 (1984), pp. 501–514.

[37] P. M. Sharp, M. Averof, A. T. Lloyd, G. Matassi, and J. F. Peden, DNA sequence

evolution: The sounds of silence, Philos. Trans. Soc. Lond. B Biol. Sci., 349 (1995), pp. 241–

247.

[38] P. Soriano, M. Meunier-Rotival, and G. Bernardi, The distribution of interspersed repeats

is nonuniform and conserved in the mouse and human genomes, Proc. Natl. Acad. Sci.

USA, 80 (1983), pp. 1816–1820.

[39] N. Stojanovic, L. Florea, C. Riemer, D. Gumucio, J. Slightom, M. Goodman, W. Miller, and R. Hardison_{, Comparison of ﬁve methods for ﬁnding conserved}

se-quences in multiple alignments of gene regulatory regions, Nucleic Acids Res., 27 (1999),

pp. 3899–3910.

[40] N. Sueoka, Directional mutation pressure and neutral molecular evolution, Proc. Natl. Acad. Sci. USA, 80 (1988), pp. 1816–1820.

[41] Z. Wang, E. Lazarov, M. O’Donnel, and M. F. Goodman, Resolving a ﬁdelity paradox: Why

Escherichia coli DNA polymerase II makes more base substitution errors in At- compared to GC-rich DNA, J. Biol. Chem., 277 (2002), pp. 4446–4454.

[42] K. H. Wolfe, P. M. Sharp, and W.-H. Li, Mutation rates diﬀer among regions of the

mam-malian genome, Nature, 337 (1989), pp. 283–285.

[43] Y. Wu, R. P. Stulp, P. Elfferich, J. Osinga, C. H. Buys, and R. M. Hofstra, Improved

(15)

Acids Res., 27 (1999), article e9.

[44] S. Zoubak, O. Clay, and G. Bernardi, The gene distribution of the human genome, Gene, 174 (1996), pp. 95–102.