AN OPTIMAL ALGORITHM FOR THE
MAXIMUM-DENSITY SEGMENT PROBLEM∗
KAI-MIN CHUNG† AND HSUEH-I LU‡
Abstract. We address a fundamental problem arising from analysis of biomolecular sequences.
The input consists of two numbers wmin and wmax and a sequence S of n number pairs (ai, wi)
with wi > 0. Let segment S(i, j) of S be the consecutive subsequence of S between indices i
and j. The density of S(i, j) is d(i, j) = (ai+ ai+1+· · · + aj)/(wi+ wi+1+· · · + wj). The maximum-density segment problem is to find a maximum-density segment over all segments S(i, j)
with wmin≤ wi+ wi+1+· · · + wj≤ wmax. The best previously known algorithm for the problem,
due to Goldwasser, Kao, and Lu [Proceedings of the Second International Workshop on Algorithms
in Bioinformatics, R. Guig´o and D. Gusfield, eds., Lecture Notes in Comput. Sci. 2452, Springer-Verlag, New York, 2002, pp. 157–171], runs in O(n log(wmax−wmin+ 1)) time. In the present paper,
we solve the problem in O(n) time. Our approach bypasses the complicated right-skew decomposition, introduced by Lin, Jiang, and Chao [J. Comput. System Sci., 65 (2002), pp. 570–586]. As a result, our algorithm has the capability to process the input sequence in an online manner, which is an important feature for dealing with genome-scale sequences. Moreover, for a type of input sequences
S representable in O(m) space, we show how to exploit the sparsity of S and solve the
maximum-density segment problem for S in O(m) time.
Key words. bioinformatics, biological sequence analysis, maximum-density segment, slope selection, computational geometry, sequence algorithm, data structure
AMS subject classifications. 68P05, 68Q25, 68R05, 68U05, 68W05, 68W40, 92-08, 92D20 DOI. 10.1137/S0097539704440430
1. Introduction. We address the following fundamental problem: The input
consists of two numbers wmin and wmax and a sequence S of number pairs (ai, wi)
with wi > 0 for i = 1, . . . , n. A segment S(i, j) is a consecutive subsequence of S from position i to position j. The width w(i, j) of S(i, j) is wi+ wi+1+· · · + wj. The density d(i, j) of S(i, j) is (ai+ ai+1+· · · + aj)/w(i, j). It is not difficult to see that with an O(n)-time preprocessing to compute all O(n) prefix sums a1+ a2+· · · + aj
and w1+ w2+· · · + wj, the density of any segment can be computed in O(1) time.
S(i, j) is feasible if wmin≤ w(i, j) ≤ wmax. The maximum-density segment problem is
to find a maximum-density segment over all O(n(wmax−wmin+ 1)) feasible segments.
This problem arises from the investigation of the nonuniformity of nucleotide com-position within genomic sequences, which was first revealed through thermal melting and gradient centrifugation experiments [21, 28]. The GC content of the DNA se-quences in all organisms varies from 25% to 75%. GC-ratios have the greatest vari-ation among bacterial DNA sequences, while the typical GC-ratios of mammalian
∗Received by the editors February 3, 2004; accepted for publication (in revised form) July 7,
2004; published electronically December 1, 2004. A preliminary version of this paper appeared in
Proceedings of the 11th Annual European Symposium on Algorithms (Budapest, Hungary, 2003), G.
Di Battista and U. Zwick, eds., Lecture Notes in Comput. Sci. 2832, Springer-Verlag, New York, 2003, pp. 136–147.
http://www.siam.org/journals/sicomp/34-2/44043.html
†Institute of Information Science, Academia Sinica. Part of this work was done while this author
was an undergraduate student in the Department of Computer Science and Information Engineering, National Taiwan University.
‡Corresponding author. Institute of Information Science, Academia Sinica, 128 Academia Road,
Section 2, Taipei 115, Taiwan, Republic of China (hil@iis.sinica.edu.tw, www.iis.sinica.edu.tw/∼hil/). This author’s research was supported in part by NSC grants 91-2215-E-001-001 and 92-2218-E-001-001.
genomes stay in the range 45–50%. Despite intensive research effort in the past two decades, the underlying causes of the observed heterogeneity remain debatable [2, 3, 5, 8, 9, 10, 11, 18, 40, 42]. Researchers [32, 39] observed that the extent of the com-positional heterogeneity in a genomic sequence strongly correlates with its GC content regardless of genome size. Other investigations showed that gene length [7], gene den-sity [44], patterns of codon usage [37], distribution of different classes of repetitive elements [7, 38], number of isochores [2], lengths of isochores [32], and recombina-tion rate within chromosomes [12] are all correlated with GC content. More research related to GC-rich segments can be found in [16, 17, 20, 23, 29, 31, 35, 41, 43] and ref-erences therein.
In the most basic form of the maximum-density segment problem, the sequence S corresponds to the given DNA sequence, where ai= 1 if the corresponding nucleotide in the DNA sequence is a G or C, and ai= 0 otherwise. In the work of Huang [19], sequence entries took on values of p and 1− p for some real number 0 ≤ p ≤ 1. More generally, we can look for regions where a given set of patterns occurs very often. In such applications, ai could be the relative frequency with which the corresponding DNA segments appear in the given patterns. Further natural applications of this problem can be designed for sophisticated sequence analysis such as mismatch den-sity [36], ungapped local alignments [1], annotated multiple sequence alignments [39], promoter mapping [22], and promoter recognition [33].
For the uniform case, i.e., wi = 1 for all indices i, Nekrutenko and Li [32] and Rice, Longden, and Bleasby [34] employed algorithms for the case wmin= wmax, which
is trivially solvable in O(n) time. More generally, when wmin= wmax, the problem is
also easily solvable in O(n(wmax− wmin+ 1)) time, linear in the number of feasible
segments. Huang [19] studied the case where wmax = n, i.e., there is effectively no
upper bound on the width of the desired maximum-density segments. He observed that an optimal segment exists with width at most 2wmin− 1. Therefore, this case is
equivalent to the case with wmax = 2wmin− 1 and can be solved in O(nwmin) time
in a straightforward manner. Lin, Jiang, and Chao [27] gave an O(n log wmin)-time
algorithm for this case based on right-skew decompositions of a sequence. (See [26] for related software.) The case with general wmax was first investigated by Goldwasser,
Kao, and Lu [13, 14], who gave an O(n)-time algorithm for the uniform case. Recently, Kim [25] showed an alternative algorithm based upon a geometric interpretation of the problem, which basically relates the maximum-density segment problem to the fundamental slope selection problem in computational geometry [4, 6, 24, 30]. Unfor-tunately, Kim’s analysis of time complexity has a flaw which seems difficult to fix.1 For the general (i.e., nonuniform) case, Goldwasser, Kao, and Lu [13] also gave an O(n log(wmax− wmin+ 1))-time algorithm. By bypassing the complicated
preprocess-ing step required in [13], we successfully reduce the time required for the general case down to O(n). Our result is based upon the following set of equations, stating that the order of d(x, y), d(y + 1, z), and d(x, z) with x≤ y < z can be determined by the
1Kim claims that all the progressive updates of the lower convex hulls L
j∪ Rjcan be done in
overall linear time. The paper only sketches how to obtain Lj+1∪ Rj+1 from Lj∪ Rj. (See the
fourth-to-last paragraph of p. 340 in [25].) Unfortunately, Kim seems to overlook the marginal cases when the upper bound wmaxforces the pzof Lj∪ Rjto be deleted from Lj+1∪ Rj+1. As a result,
obtaining Lj+1∪ Rj+1from Lj∪ Rjcould be much more complicated than Kim’s sketch. A naive
implementation of Kim’s algorithm still takes Ω(n(wmax− wmin+ 1)) time in the worst case. We
believe that any correct implementation of Kim’s algorithm requires Ω(n log(wmax− wmin+ 1)) time
y ≥ ≥ y + 1 x y ≤ ≤ x z z y + 1
Fig. 1.1. An illustration for (1.1): There exist only two possibilities for the order among
d(x, y), d(x, z), d(y + 1, z).
order of any two of them:
d(x, y)≤ d(y + 1, z) ⇔ d(x, y) ≤ d(x, z) ⇔ d(x, z) ≤ d(y + 1, z), d(x, y)≥ d(y + 1, z) ⇔ d(x, y) ≥ d(x, z) ⇔ d(x, z) ≥ d(y + 1, z). (1.1)
(Both equations can be easily verified by observing the existence of some number ρ with 0 < ρ < 1 and d(x, z) = ρ· d(x, y) + (1 − ρ) · d(y + 1, z). See Figure 1.1.) Our algorithm is capable of processing the input sequence in an online manner, which is an important feature for dealing with genome-scale sequences.
For bioinformatics applications, e.g., in [1, 22, 33, 36, 39], the input sequence S is usually very sparse. That is, S can be represented by m = o(n) triples
(a1, w1, n1), (a2, w2, n2), . . . , (am, wm , nm)
with 0 = n0< n1< n2<· · · < nm= n to signify that (ai, wi) = (aj, wj) holds for all indices i and j with nj−1< i≤ nj and 1≤ j ≤ m. If wj = 1 holds for all 1≤ j ≤ m, we show how to exploit the sparsity of S and solve the maximum-density problem for S given in the above compact representation in O(m) time.
The remainder of the paper is organized as follows. Section 2 shows the main algorithm. Section 3 explains how to cope with the simple case that the width upper bound wmax is ineffective. Section 4 takes care of the more complicated case that
wmaxis effective. Section 5 explains how to exploit the sparsity of the input sequence
for the uniform case.
2. The main algorithm. For any integers x and y, let [x, y] denote the set {x, x + 1, . . . , y}. Without loss of generality, we may assume that w1+w2+· · ·+wn≥
wmin and wi ≤ wmax holds for each i = 1, 2, . . . , n. Throughout the paper, we need
the following definitions and notation with respect to the input length-n sequence S and width bounds wmin and wmax. Define φ(x, y) to be the largest index z ∈ [x, y]
that minimizes d(x, z). That is, S(x, φ(x, y)) is the longest minimum-density prefix of S(x, y). Let j0 be the smallest index with w(1, j0) ≥ wmin. Let J = [j0, n]. For
each j ∈ J, let j be the smallest index i with w(i, j) ≤ wmax, and let rj be the largest index i with w(i, j)≥ wmin. That is, S(i, j) is feasible if and only if i∈ [j, rj].
(Figure 2.1 is an illustration for the definitions of j and rj.) Clearly, for the uniform case, we have i+1= i+ 1 and ri+1= ri+ 1. As for the general case, we know only that j and rj are both (not necessarily strictly) increasing. One can easily compute all j and rj in O(n) time.
Let i∗j be the largest index k ∈ [j, rj] with d(k, j) = max{d(i, j) | i ∈ [j, rj]}. There must be an index j∗ such that S(i∗j∗, j∗) is a maximum-density segment of
j n rj
1 j
≥ wmin
≤ wmax
Fig. 2.1. An illustration for the definitions of jand rj.
Algorithm main. 1 let ij0−1= 1; 2 for j = j0to n do{ 3 let ij= best(max(ij−1, j), rj, j); 4 output (ij, j); 5 } Function best(, r, j). 1 let i = ;
2 while i < r and d(i, φ(i, r− 1)) ≤ d(i, j) do
3 let i = φ(i, r− 1) + 1;
4 return i;
Fig. 2.2. Our main algorithm.
S. Therefore, a natural but seemingly difficult possibility for solving the maximum-density segment problem would be to compute i∗j for all indices j∈ J in O(n) time. Instead, our approach is to compute a pair (ij, j) of indices with ij∈ [j, rj] for each index j∈ J by the algorithm shown in Figure 2.2. More specifically, each iteration of our algorithm, based upon (1.1), keeps chopping off the lowest-density prefix without affecting the feasibility of the remaining segment until its density does not go any higher. For brevity of presentation, each algorithm throughout the paper outputs a linear number of index pairs (i, j). (See, e.g., step 4 of algorithm main shown in Figure 2.2.) Clearly, it takes a linear-time postprocessing for the produced index pairs to obtain one that maximizes d(i, j). The rest of the section ensures the correctness of our algorithm by showing ij∗ = i∗j∗, and thus reduces the maximum-density segment problem to implementing our algorithm to run in O(n) time.
Lemma 2.1. The index returned by function call best(, r, j) is the largest index i∈ [, r] that maximizes d(i, j).
Proof. Let i∗ be the largest index in [, r] that maximizes d(i, j), i.e., d(i∗, j) = maxi∈[,r]d(i, j). Let ij be the index returned by function call best(, r, j). We show ij = i∗as follows. If ij < i∗, then ij < r. By the condition of the while-loop at step 2 of best, we know d(ij, φ(ij, r− 1)) > d(ij, j). By d(ij, j)≤ d(i∗, j) and (1.1), we have d(ij, i∗− 1) ≤ d(ij, j). It follows that d(ij, i∗− 1) < d(ij, φ(ij, r− 1)), contradicting the definition of φ(ij, r− 1).
On the other hand, suppose that ij > i∗. By definition of best, there must be an index i∈ [, r] with i < r, d(i, φ(i, r − 1)) ≤ d(i, j), and i ≤ i∗ < φ(i, r− 1) + 1. If i = i∗, by (1.1) we have d(i∗, φ(i∗, r− 1)) ≤ d(i∗, j)≤ d(φ(i∗, r− 1) + 1, j), where the last inequality contradicts the definition of i∗. Now that i < i∗, we have d(i∗, j) ≥ d(i, j)≥ d(i, i∗−1) ≥ d(i, φ(i, r−1)) ≥ d(i∗, φ(i, r−1)), where (a) the first inequality is
by definition of i∗, (b) the second inequality is by (1.1) and the first inequality, (c) the third inequality is by i∗≤ φ(i, r − 1) and the definition of φ(i, r − 1), and (d) the last inequality is by (1.1) and the third inequality. It follows from d(i∗, j)≥ d(i∗, φ(i, r−1)) and (1.1) that d(φ(i, r− 1) + 1, j) ≥ d(i∗, j), contradicting the definition of i∗ by i∗< φ(i, r− 1) + 1.
Theorem 2.2. Algorithm main correctly solves the maximum-density segment problem.
Proof. We prove the theorem by showing ij∗ = i∗j∗. By j0 = ij0−1 = 1 and Lemma 2.1, the equality holds if j∗= j0. The rest of the proof assumes j∗> j0. By
Lemma 2.1 and j∗ ≤ i∗j∗, it suffices to ensure ij∗−1≤ i∗j∗. Assume for contradiction that there is an index j∈ [j0, j∗−1] with ij−1 ≤ i∗j∗ < ij. By j < j∗, we know j ≤ i∗j∗.
By Lemma 2.1 and max(j, ij−1) ≤ i∗j∗ < ij ≤ rj, we have d(ij, j) ≥ d(i∗j∗, j). It
follows from (1.1) and i∗j∗ < ij that d(i∗j∗, j)≥ d(ij∗∗, ij− 1). By j∗ ≤ i∗j∗ < ij ≤ rj∗ and the definition of j∗, we know d(i∗j∗, j∗) > d(ij, j∗). It follows from i∗j∗ < ij and (1.1) that d(i∗j∗, ij− 1) > d(i∗j∗, j∗). Therefore, d(ij, j) ≥ d(i∗j∗, j)≥ d(i∗j∗, ij− 1) > d(i∗j∗, j∗), contradicting the definition of j∗.
One can verify that the value of i increases by at least 1 each time step 3 of best is executed. Therefore, to implement the algorithm to run in O(n) time, it suffices to maintain a data structure to support an O(1)-time query for each φ(i, rj − 1) in step 2 of best.
3. Coping with ineffective width upper bound. When wmax is ineffective,
i.e., wmax ≥ w(1, n), we have j = 1 for all j ∈ J. Therefore, the function call in
step 3 of main is exactly best(ij−1, rj, j). Moreover, during the execution of the function call best(ij−1, rj, j), the value of i can only be ij−1, φ(ij−1, rj − 1) + 1, φ(φ(ij−1, rj− 1) + 1, rj− 1) + 1, . . . , rj. Suppose that a subroutine call to update(j) yields an array Φ of indices and two indices p and q of Φ with p≤ q and Φ[p] = ij−1 such that the following condition holds.
Condition Cj : Φ[q] = rj and Φ[t] = φ(Φ[t− 1], rj− 1) + 1 holds for each index t∈ [p + 1, q]. (See Figure 3.1 for an illustration.)
Φ[p + 1] Φ[p + 2]
rj
φ(Φ[p], rj− 1) φ(Φ[p + 1], rj− 1) φ(Φ[q− 1], rj− 1)
Φ[p] Φ[q]
ij−1
Fig. 3.1. An illustration for Condition Cj.
Then, the subroutine call to best(ij−1, rj, j) can clearly be replaced by lbest(j), as defined in Figure 3.2. That is, lbest(j) can access the value of each φ(i, rj− 1) by looking up Φ in O(1) time. It remains to show how to implement update(j) such that all O(n) subroutine calls to update from step 3 of lmain run in overall O(n) time. With the initialization of letting p = q = rj0−1 = Φ[1] = 1, as described at step 1 of algorithm lmain, Condition Cj0−1clearly holds before the subroutine call to update(j0). The following lemma is crucial in ensuring the correctness and efficiency of our implementation shown in Figure 3.2.
Lemma 3.1. For each index j ∈ J, the following statements hold:
1. If Condition Cj−1 holds right before calling update(j), then Condition Cj holds right after the subroutine call.
Algorithm lmain. 1 let p = q = rj0−1= Φ[1] = 1; 2 for j = j0to n do{ 3 call update(j); 4 let ij= lbest(j); 5 output (ij, j); 6 } Function lbest(j). 1 while p < q and d(Φ[p], Φ[p + 1]− 1) ≤ d(Φ[p], j) do 2 let p = p + 1; 3 return Φ[p]; Subroutine update(j). 1 for r = rj−1+ 1 to rjdo{ 2 while p < q and d(Φ[q− 1], Φ[q] − 1) ≥ d(Φ[q − 1], r − 1) do 3 let q = q− 1; 4 let q = q + 1; 5 let Φ[q] = r; 6 }
Fig. 3.2. An efficient implementation for the case that wmax is ineffective.
2. If Condition Cj holds right before calling lbest(j), then the index returned by the function call is exactly that returned by best(Φ[p], Φ[q], j).
Proof. Statement 1. We need the following condition.
Condition Dr : Φ[q] = r and Φ[t] = φ(Φ[t− 1], r − 1) + 1 holds for each index t∈ [p + 1, q].
Clearly, Condition Cj is exactly Condition Drj. To prove the statement, we show
that if Condition Dr−1holds, then Condition Drholds right after executing steps 2–5 of update (i.e., an iteration of the for-loop of update), although the value of q may change.
Consider the moment when step 4 of update is about to be executed. We first show that if p < q, then Φ[t] = φ(Φ[t− 1], r − 1) + 1 holds for each t ∈ [p + 1, q]. By the definition of φ, we have
φ(, r− 1) =
r− 1 if d(, φ(, r− 2)) ≥ d(, r − 1), φ(, r− 2) otherwise.
(3.1)
By Condition Dr−1, we know φ(Φ[q− 1], r − 2) = Φ[q] − 1. With = Φ[q − 1] plugged into (3.1), we know that if d(Φ[q− 1], Φ[q] − 1) < d(Φ[q − 1], r − 1), then φ(Φ[q−1], r−1) = φ(Φ[q −1], r−2). It follows from the condition of step 2 of update that φ(Φ[q− 1], r − 1) = φ(Φ[q − 1], r − 2). Furthermore, if φ(, r − 2) < r − 2, then one can prove as follows that φ(φ(, r− 2) + 1, r − 1) = φ(φ(, r − 2) + 1, r − 2) implies φ(, r− 1) = φ(, r − 2).
Let m = φ(, r− 2). By φ(m + 1, r − 1) = φ(m + 1, r − 2), we have d(m+1, φ(m+1, r−1)) < d(m+1, r−1). By definition of φ and (1.1), we have d(, m) < d(, φ(m + 1, r− 1)) < d(m + 1, φ(m + 1, r − 1)). As a result, we have d(, m) < d(m + 1, r− 1), which by (1.1) implies d(, m) < d(, r− 1). Thus φ(, r − 1) = φ(, r − 2).
Therefore, φ(Φ[t], r−1) = φ(Φ[t], r−2) implies φ(Φ[t−1], r−1) = φ(Φ[t−1], r−2). As a result, φ(Φ[t], r−1) = φ(Φ[t], r −2) holds for each t ∈ [p, q −1]. By Condition Dr−1, we know Φ[t] = φ(Φ[t− 1], r − 1) + 1 holds for each t ∈ [p + 1, q].
Since the value of q will be incremented by 1 after executing step 4 of update and Φ[q+1] will equal r after executing step 5 of update, it remains to show φ(Φ[q], r−1) = r− 1. Clearly, the equality holds if Φ[q] = r − 1, i.e., step 3 was not executed. If step 3 was executed at least once, then we know d(Φ[q], Φ[q + 1]− 1) ≥ d(Φ[q], r − 1), i.e., d(Φ[q], φ(Φ[q], r− 2)) ≥ d(Φ[q], r − 1). By plugging = Φ[q] into (3.1), we know φ(Φ[q], r− 1) = r − 1.
Statement 2. By Condition Cj, one can easily verify that lbest(j) is a faithful implementation of best(Φ[p], Φ[q], j). Therefore, the statement holds.
Lemma 3.2. The implementation lmain solves the maximum-density problem for the case with ineffective wmax in O(n) time.
Proof. By Lemma 3.1(1) and the definitions of update and lbest, both Con-dition Cjand Φ[p] = ij−1hold right after the subroutine call to update(j). By Condi-tion Cjand Lemma 3.1(2), lbest(j) is a faithful implementation of best(Φ[p], Φ[q], j). Therefore, the correctness of lmain follows from Φ[p] = ij−1, Φ[q] = rj, and Theo-rem 2.2.
As for the efficiency of lmain, observe that q− p ≥ 0 holds throughout the execution of lmain. Note that each iteration of the while-loops of lbest and update decreases the value of q− p by one. Since step 4 of update, which is the only place that increases the value of q− p, increases the value of q − p by one for O(n) times, the overall running time of lmain is O(n).
4. Coping with effective width upper bound. In contrast to the previous
simple case, when wmax is arbitrary, j may not always be 1. Therefore, the first argument of the function call in step 3 of main could be j with j > ij−1. It seems quite difficult to update the corresponding data structure Φ in overall linear time such that both Φ[p] = max(ij−1, j) and Condition Cj hold throughout the execution of our algorithm. To overcome this difficulty, our algorithm sticks with Condition Cj but allows Φ[p] > max(ij−1, j). As a result, maxj∈Jd(ij, j) may be less than maxj∈Jd(i∗j, j). Fortunately, this potential problem can be resolved if we simultaneously solve a series of variant versions of the maximum-density segment problem.
≥ wmin
≤ wmax
≤ wmax
1 = y0 r y0 y1 n
Fig. 4.1. An illustration for the relation among , r, y0, y1.
4.1. A variant version of the maximum-density segment problem.
Sup-pose that we are given two indices r and y0 with w(r, y0) ≥ wmin. Let X = [, r]
and Y = [y0, y1] be two intervals such that = y0 and y1 is the largest index in J
with w(r, y1)≤ wmax. See Figure 4.1 for an illustration. The variant version of the
Algorithm vmain(r, y0).
1 let be the smallest index in [1, n] with w(, y0)≤ wmax;
2 let y1 be the largest index in [1, n] with w(r, y1)≤ wmax;
3 let xy0−1= ;
4 for y = y0to y1 do{
5 let xy= best(max(xy−1, y), r, y);
6 output (xy, y);
7 }
Fig. 4.2. Our algorithm for the variant version of the maximum-density segment problem,
where function best is as defined in Figure 2.2.
all feasible segments S(x, y) with x ∈ X, y ∈ Y , and wmin ≤ w(x, y) ≤ wmax such
that d(x, y) is maximized.
For each y∈ Y , let x∗y be the largest index x∈ X with wmin≤ w(x, y) ≤ wmax
that maximizes d(x, y). Let y∗ be an index in Y with d(x∗y∗, y∗) = maxy∈Y d(x∗y, y). Although solving the variant version can naturally be reduced to computing the index x∗y for each index y∈ Y , the required running time is more than what we can afford. Instead, we compute an index xy ∈ X with wmin ≤ w(xy, y) ≤ wmax for each index
y∈ Y such that xy∗ = x∗y∗. By w(r, y0)≥ wminand w(r, y1)≤ wmax, one can easily see
that, for each y∈ Y , r is always the largest index x ∈ X with wmin≤ w(x, y) ≤ wmax.
Our algorithm for solving the variant problem is as shown in Figure 4.2, presented in a way to emphasize the analogy between vmain and main. For example, the index xy in vmain is the counterpart of the index ij in main. Also, the index r in vmain plays the role of the index rj in main. We have the following lemma whose proof is very similar to that of Theorem 2.2.
Lemma 4.1. Algorithm vmain solves the variant version of the maximum-density problem correctly.
Proof. We prove the theorem by showing xy∗ = x∗y∗. By y0 = xy0−1 = and Lemma 2.1, the equality holds if y∗= y0. The rest of the proof assumes y∗> y0. By
Lemma 2.1 and y∗ ≤ x∗y∗, it suffices to ensure xy∗−1≤ x∗y∗. Assume for contradiction that there is an index y ∈ [y0, y∗− 1] with xy−1 ≤ x∗y∗ < xy. By y < y∗, we know y ≤ x∗y∗. By Lemma 2.1 and max(y, xy−1) ≤ x∗y∗ < xy ≤ r, we have d(xy, y) ≥
d(x∗y∗, y). It follows from (1.1) and x∗y∗ < xy that d(x∗y∗, y) ≥ d(x∗y∗, xy− 1). By y∗ ≤ x∗y∗ < xy ≤ r and the definition of y∗, we know d(x∗y∗, y∗) > d(xy, y∗). It
follows from x∗y∗ < xyand (1.1) that d(x∗y∗, xy−1) > d(x∗y∗, y∗). Therefore, d(xy, y)≥
d(x∗y∗, y)≥ d(x∗y∗, xy− 1) > d(x∗y∗, y∗), contradicting the definition of y∗.
Again, the challenge lies in supporting each query to φ(i, r− 1) of best in O(1) time during the execution of vmain. Fortunately, unlike during the execution of main, where both parameters of φ(i, r− 1) may change, the second parameter r − 1 is now fixed. Therefore, to support each query to φ(i, r− 1) in O(1) time, we can actually afford O(r− + 1) time to compute a data structure Ψ such that Ψ[i] = φ(i, r − 1) for each i ∈ [, r − 1]. As a result, the function best can be implemented as the function vbest shown in Figure 4.3. The following lemma ensures the correctness and efficiency of our implementation variant shown in Figure 4.3.
Lemma 4.2. The implementation variant correctly solves the variant version of the maximum-density segment problem in O(r− + y1− y0+ 1) time.
Algorithm variant(r, y0).
1 let be the smallest index in [1, n] with w(, y0)≤ wmax;
2 let y1 be the largest index in [1, n] with w(r, y1)≤ wmax;
3 call init(, r− 1);
4 let xy0−1= ;
5 for y = y0to y1 do{
6 let xy= vbest(max(xy−1, y), r, y);
7 output (xy, y);
8 }
Function vbest(, r, y).
1 let x = ; 2 while x < r and d(x, Ψ[x])≤ d(x, y) do 3 let x = Ψ[x] + 1; 4 return x; Subroutine init(, r) 1 let Ψ[r] = r; 2 for s = r− 1 downto do { 3 let t = s; 4 while t < r and d(s, t)≥ d(s, Ψ[t + 1]) do 5 let t = Ψ[t + 1]; 6 let Ψ[s] = t; 7 }
Fig. 4.3. An efficient implementation for the algorithm vmain.
[, r− 1], then vbest is a faithful implementation of best. Therefore, by Lemma 4.1, the correctness of variant can be ensured by showing that after calling init(, r−1) at step 3 of algorithm variant , Ψ[i] = φ(i, r−1) holds for each index i ∈ [, r−1]. Note that for brevity of the following proof, we slightly abuse the notation r in Figure 4.3. That is, although the subroutine call at step 3 of algorithm variant is init(, r− 1), the second parameter in the definition of subroutine init in Figure 4.3 becomes r. Let us make it clear that the rest of the proof (i) lets r denote the one in the definition of subroutine init(, r), and (ii) proves that Ψ[i] = φ(i, r) holds for each index i∈ [, r] at the end of the subroutine call to subroutine init(, r).
By step 1 of init, we have Ψ[r] = r = φ(r, r). Now suppose that Ψ[i] = φ(i, r) holds for each index i∈ [x + 1, r] right before init is about to execute the iteration for index x∈ [, r]. It suffices to show Ψ[x] = φ(x, r) after the iteration. Let
Zx={x, Ψ[x + 1], Ψ[Ψ[x + 1] + 1], Ψ[Ψ[Ψ[x + 1] + 1] + 1], . . . , r}. Let|Zx| denote the cardinality of Zx. We first show φ(x, r)∈ Zxas follows.
Assume for contradiction that φ(x, r) ∈ Zx, i.e., there is an index z ∈ Zx with z < φ(x, r) < Ψ[z + 1] = φ(z + 1, r). By definition of φ and (1.1), we have d(z + 1, φ(x, r)) > d(z + 1, φ(z + 1, r)) > d(φ(x, r) + 1, φ(z + 1, r)) and d(φ(x, r) + 1, φ(z + 1, r))≥ d(x, φ(z + 1, r)) ≥ d(x, φ(x, r)). By d(z + 1, φ(x, r)) > d(x, φ(x, r)) and (1.1), we have d(x, φ(x, r)) > d(x, z), contradicting the definition of φ(x, r).
Algorithm general. 1 let p = q = rj0−1= Φ[1] = 1; 2 for j = j0to n do{ 3 call update(j); 4 while Φ[p] < jdo 5 let p = p + 1; 6 if ij−1< Φ[p] then 7 call variant(Φ[p], j); 8 let ij= lbest(j); 9 output (ij, j); 10 }
Fig. 4.4. Our algorithm for the general case, where update and lbest are defined in Figure 3.2
and variant is defined in Figure 4.3.
φ(x, r). By φ(z + 1, r) ≤ φ(x, r) ≤ r and the definition of φ(z + 1, r), we have d(z + 1, φ(x, r)) ≥ d(z + 1, φ(z + 1, r)). By definition of φ(x, r) and (1.1), we have d(x, z)≥ d(x, φ(x, r)) ≥ d(z + 1, φ(x, r)). By d(x, z) ≥ d(z + 1, φ(z + 1, r)) and (1.1), we have d(x, z)≥ d(x, φ(z + 1, r)). Therefore, if z < φ(x, r), then step 5 of init will be executed to increase the value of z. Observe that φ(x, r) = z < r and Ψ[z + 1] > z imply d(x, z) < d(x, Φ[z + 1]). It follows that as soon as z = φ(x, r) holds, whether φ(x, r) = r or not, the value of Ψ[x] will immediately be set to z at step 6 of init.
As for the time complexity, we first observe that and y1 can be found from r
and y0 in O(r− + y1− y0+ 1) time:
• Let = r, and then repeatedly decrease by 1 as long as w( − 1, y0)≤ wmax
holds.
• Let y1= y0, and then repeatedly increase y1 by 1 as long as w(r, y1+ 1)≤
wmaxholds.
Secondly, one can see that the rest of the implementation also runs in O(r− + y1−
y0+ 1) time by verifying that throughout the execution of the implementation (a) the
while-loop of vbest runs for O(r−+y1−y0+ 1) iterations, and (b) the while-loop of
initruns for O(r− + 1) iterations. To see statement (a), just observe that the value of index x (i) never decreases, (ii) stays in [, r], and (iii) increases by at least one each time step 3 of vbest is executed. As for statement (b), consider the iteration with index s of the for-loop of init. Note that if step 6 of init executes ts times in this iteration, then|Zs| = |Zs+1| − ts+ 1. Since|Zs| ≥ 1 holds for each s ∈ X, we haves∈Xts= O(r− + 1), and thus statement (b) holds.
4.2. Our algorithm for the general case. With the help of variant, we have
a linear-time algorithm for solving the original maximum-density segment problem as shown in Figure 4.4. Algorithm general is obtained by inserting four lines of code (i.e., steps 4–7 of general) between steps 3 and 4 of lmain in order to handle the case ij−1< j. Specifically, when ij−1< j, we cannot afford to appropriately update the data structure Φ. Therefore, instead of moving i to j, steps 4 and 5 move i to Φ[p], where p is the smallest index with j ≤ Φ[p]. Of course, these two steps may cause our algorithm to overlook the possibility of ij ∈ [ij−1, Φ[p]− 1], as illustrated in Figure 4.5. This is when the variant version comes in: As shown in the next theorem, we can remedy the problem by calling variant(Φ[p], j).
ij−1 j ij Φ[p] rj
Fig. 4.5. An illustration for the situation when Steps 6 and 7 of general are needed.
k j j rj k k j rk 1 n
Fig. 4.6. An illustration showing that the overall running time of all subroutine calls to variant(
j, j) in general is O(n).
Theorem 4.3. Algorithm general solves the maximum-density segment prob-lem in an online manner in linear time.
Proof. We prove the correctness of general by showing that i∗j∗ = ij∗ implies i∗j∗ = xj∗. By Lemma 3.1(1), Condition Cj holds after the subroutine call update(j) at step 3 of general. Observe that steps 4 and 5 of general, which may increase the value of p, do not affect the validity of Condition Cj. Also, steps 6 and 7 do not modify p, q, and Φ. Let j be the value of Φ[p] right before executing step 8 of general. By Lemma 3.1(2), the index ijreturned by lbest(j) is the largest index in [j, rj] that maximizes d(ij, j). Clearly, ij∗ = i∗j∗ implies the correctness of general.
If ij∗ = i∗j∗, there must be an index j∈ [j0, j∗] with ij−1≤ i∗j∗ < ij. It can be proved as follows that i∗j∗ ≤ j− 1.
Assume j≤ i∗j∗ for contradiction. It follows from Lemma 3.1(2) and
(1.1) that d(ij, j)≥ d(i∗j∗, j)≥ d(i∗j∗, ij− 1). By the definition of j∗, we have d(i∗j∗, j∗) > d(ij, j∗), which by (1.1) implies d(i∗j∗, ij− 1) > d(i∗j∗, j∗). Therefore, d(ij, j) > d(i∗j∗, j∗), contradicting the definition
of j∗.
Since ij−1≤ i∗j∗ ≤ j− 1, we know w(j− 1, j∗)≤ w(i∗j∗, j∗)≤ wmax. Thus, S(i∗j∗, j∗) is a feasible segment in the variant version of the maximum-density segment problem for S with respect to indices r = j and y0= j. By Lemma 4.2, the subroutine call
variant(
j, j) at step 7 of algorithm general has to output an index pair (x, y) with wmin≤ w(x, y) ≤ wmaxand d(x, y) = d(i∗j∗, j∗).
As for the running time, observe that q− p ≥ 0 holds throughout the execution of general. Step 4 of update, which is the only place that increases the value of q− p, increases the value of q − p by 1 for O(n) times. Note that each iteration of the while-loops of general, lbest, and update decreases the value of q− p by 1. Therefore, to show that the overall running time of general is O(n), it remains to ensure that all those subroutine calls to variant at step 7 of general take overall O(n) time. Suppose that j and k are two arbitrary indices with k < j such that general makes subroutine calls to variant(
k, k) and variant(j, j). Let rk be the largest index in [1, n] with w(k, rk)≤ wmax. By Lemma 4.2, it suffices to show
k < j and rk < j as follows. (See Figure 4.6.) By the definition of general, we know that ij−1 < j, which is ensured by the situation illustrated in Figure 4.5. By k < j, we have k ≤ ij−1, implying k < j. Moreover, by the definitions of j and
rk, one can easily verify that k < j implies rk< j.
It is not difficult to see that our algorithm shown in Figure 4.4 is already capable of processing the input sequence in an online manner, since the only preprocessing required is to obtain j, rj, and the prefix sums of a1, a2, . . . , aj and w1, w2, . . . , wj (for the purpose of evaluating the density of any segment in O(1) time), which can easily be computed on the fly.
5. Exploiting sparsity for the uniform case. In this section, we assume that
the input sequence S = (a1, w1), (a2, w2), . . . , (an, wn) is run-length encoded [15], i.e.,
represented by m pairs (a1, n1), (a2, n2), . . . , (am, nm) with 0 = n0< n1< n2<· · · <
nm = n to signify that w1 = w2 = · · · = wn = 1 and that ai = aj holds for all indices i and j with nj−1 < i ≤ nj and 1 ≤ j ≤ m. Our algorithm for solving the maximum-density problem for the O(m)-space representable sequence S is shown in Figure 5.1.
Algorithm sparse.
1 for k = 1 to m do
2 let nk= nk− nk−1;
3 let S be the length-m sequence (n1a1, n1), (n2a2, n2), . . . , (nmam, nm);
4 let (i, j) be an optimal output of general(wmin, wmax, S);
5 output (ni−1+ 1, nj);
6 for k = 1 to m do{
7 if nk≥ wmin then
8 output (nk, nk) and (rnk, nk);
9 if nk−1+ wmin≤ n then
10 output (nk−1+ 1, nk−1+ wmin) and (nk−1+ 1, min(n, nk−1+ wmax));
11 }
Fig. 5.1. Our algorithm that handles sparse input sequence for the uniform case, where general
is defined in Figure 4.4.
Theorem 5.1. Algorithm sparse solves the maximum-density problem for the above O(m)-space representable sequence in O(m) time.
Proof. By Theorem 4.3, sparse runs in O(m) time. Let S(i∗, j∗) be a feasible segment with maximum density. We first show that without loss of generality i∗−1 ∈ {n0, n1, . . . , nm−1} or j∗ ∈ {n1, n2, . . . , nm} holds. More specifically, we show as
follows that if i∗ − 1 ∈ {n0, n1, . . . , nm−1} and j∗ ∈ {n1, n2, . . . , nm}, then S(i∗+
1, j∗+ 1) is also a feasible segment with maximum density.
By i∗− 1 ∈ {n0, n1, . . . , nm−1}, we know ai∗−1 = ai∗. By j∗ ∈ {n1, n2, . . . , nm}, we know aj∗ = aj∗+1. It follows from the optimality
of S(i∗, j∗) that ai∗ ≥ aj∗+1and ai∗−1≤ aj∗, implying ai∗−1= ai∗ = aj∗ = aj∗+1. Therefore, S(i∗+ 1, j∗+ 1) is also a maximum-density
segment.
It remains to show that our algorithm works correctly for each possible case.
• Case 1: i∗− 1 ∈ {n0, n1, . . . , nm−1} and j∗ ∈ {n1, n2, . . . , nm}. Clearly, steps 1–5 of sparse take care of this case.
• Case 2: i∗− 1 ∈ {n
0, n1, . . . , nm−1} and j∗ ∈ {n1, n2, . . . , nm}. Clearly, if
i∗∈ {j∗, rj∗}, S(i∗, j∗) can be discovered by steps 7 and 8 of sparse. Since i∗− 1 ∈ {n0, n1, . . . , nm−1}, we have ai∗−1 = ai∗. If ai∗−1 = ai∗ = d(i∗, j∗), then by (1.1) we have either d(i∗−1, j∗) > d(i∗, j∗) or d(i∗+1, j∗) > d(i∗, j∗),
which implies that either S(i∗− 1, j∗) or S(i∗+ 1, j∗) is infeasible, and thus i∗∈ {j∗, rj∗}. On the other hand, if ai∗−1 = ai∗ = d(i∗, j∗) and i∗= ∗j, then
S(i∗−1, j∗) is also a feasible segment with maximum density. We can continue the same argument until we have a maximum-density segment S(i, j∗) such that either i− 1 ∈ {n0, n1, . . . , nm−1}, which is handled in Case 1, or i = j∗,
which is handled by steps 7 and 8 of sparse.
• Case 3: i∗− 1 ∈ {n0, n1, . . . , nm−1} and j∗∈ {n1, n2, . . . , nm}. The proof of this case, omitted for brevity, is very similar to that of Case 2.
The theorem is proved.
Acknowledgments. We thank Yi-Hsuan Hsin and Hsu-Cheng Tsai for
discus-sions during the preliminary stage of this research. We thank Tien-Ching Lin for comments.
REFERENCES
[1] N. N. Alexandrov and V. V. Solovyev, Statistical significance of ungapped sequence
align-ments, in Proceedings of the Pacific Symposium on Biocomputing, Maui, HI, 1998, World
Scientific, River Edge, NJ, Vol. 3, pp. 461–470.
[2] G. Barhardi, Isochores and the evolutionary genomics of vertebrates, Gene, 241 (2000), pp. 3– 17.
[3] G. Bernardi and G. Bernardi, Compositional constraints and genome evolution, J. Molec. Evolution, 24 (1986), pp. 1–11.
[4] H. Br¨onnimann and B. Chazelle, Optimal slope selection via cuttings, Comput. Geom., 10 (1998), pp. 23–29.
[5] B. Charlesworth, Genetic recombination: Patterns in the genome, Current Biol., 4 (1994), pp. 182–184.
[6] R. Cole, J. S. Salowe, W. L. Steiger, and E. Szemer´edi, An optimal-time algorithm for
slope selection, SIAM J. Comput., 18 (1989), pp. 792–810.
[7] L. Duret, D. Mouchiroud, and C. Gautier, Statistical analysis of vertebrate sequences
reveals that long genes are scarce in GC-rich isochores, J. Mol. Evol., 40 (1995), pp. 308–
371.
[8] A. Eyre-Walker, Evidence that both G+C rich and G+C poor isochores are replicated early
and late in the cell cycle, Nucleic Acids Res., 20 (1992), pp. 1497–1501.
[9] A. Eyre-Walker, Recombination and mammalian genome evolution, Proc. Roy. Soc. London Ser. B, 252 (1993), pp. 237–243.
[10] J. Filipski, Correlation between molecular clock ticking, codon usage fidelity of DNA
re-pair, chromosome banding and chromatin compactness in germline cells, FEBS Lett., 217
(1987), pp. 184–186.
[11] M. P. Francino and H. Ochman, Isochores result from mutation not selection, Nature, 400 (1999), pp. 30–31.
[12] S. M. Fullerton, A. B. Carvalho, and A. G. Clark, Local rates of recombination are
positively correlated with GC content in the human genome, Mol. Biol. Evol., 18 (2001),
pp. 1139–1142.
[13] M. H. Goldwasser, M.-Y. Kao, and H.-I. Lu, Fast algorithms for finding maximum-density
segments of a sequence with applications to bioinformatics, in Proceedings of the Second
International Workshop on Algorithms in Bioinformatics (Rome, Italy, 2002), R. Guig´o and D. Gusfield, eds., Lecture Notes in Comput. Sci. 2452, Springer-Verlag, New York, 2002, pp. 157–171.
[14] M. H. Goldwasser, M.-Y. Kao, and H.-I. Lu, Linear-time algorithms for computing
maximum-density sequence segments with bioinformatics applications, J. Comput. System
Sci., to appear.
[15] S. W. Golomb, Run-length encodings, IEEE Trans. Inform. Theory, 12 (1966), pp. 399–401. [16] P. Guldberg, K. Gronbak, A. Aggerholm, A. Platz, P. thor Straten, V. Ahrenkiel,
P. Hokland, and J. Zeuthen, Detection of mutations in GC-rich DNA by bisulphite
denaturing gradient gel electrophoresis, Nucleic Acids Res., 26 (1998), pp. 1548–1549.
[17] W. Henke, K. Herdel, K. Jung, D. Schnorr, and S. A. Loening, Betaine improves the PCR
amplification of GC-rich DNA sequences, Nucleic Acids Res., 25 (1997), pp. 3957–3958.
[18] G. P. Holmquist, Chromosome bands, their chromatin flavors, and their functional features, Am. J. Hum. Genet., 51 (1992), pp. 17–37.
[19] X. Huang, An algorithm for identifying regions of a DNA sequence that satisfy a content
requirement, Comput. Appl. Biosci., 10 (1994), pp. 219–225.
[20] K. Ikehara, F. Amada, S. Yoshida, Y. Mikata, and A. Tanaka, A possible origin of newly
born bacterial genes: Significance of GC-rich nonstop frame on antisense strand, Nucleic
Acids Res., 24 (1996), pp. 4249–4255.
[21] R. B. Inman, A denaturation map of the 1 phage DNA molecule determined by electron
mi-croscopy, J. Mol. Biol., 18 (1966), pp. 464–476.
[22] I. P. Ioshikhes and M. Q. Zhang, Large-scale human promoter mapping using CpG islands, Nature Genet., 26 (2000), pp. 61–63.
[23] R. Jin, M.-E. Fernandez-Beros, and R. P. Novick, Why is the initiation nick site of an
AT-rich rolling circle plasmid at the tip of a GC-rich cruciform?, EMBO J., 16 (1997),
pp. 4456–4466.
[24] M. J. Katz and M. Sharir, Optimal slope selection via expanders, Inform. Process. Lett., 47 (1993), pp. 115–122.
[25] S. K. Kim, Linear-time algorithm for finding a maximum-density segment of a sequence, In-form. Process. Lett., 86 (2003), pp. 339–342.
[26] Y.-L. Lin, X. Huang, T. Jiang, and K.-M. Chao, MAVG: Locating nonoverlapping maximum
average segments in a given sequence, Bioinformatics, 19 (2003), pp. 151–152.
[27] Y.-L. Lin, T. Jiang, and K.-M. Chao, Algorithms for locating the length-constrained heaviest
segments, with applications to biomolecular sequence analysis, J. Comput. System Sci., 65
(2002), pp. 570–586.
[28] G. Macaya, J.-P. Thiery, and G. Bernardi, An approach to the organization of eukaryotic
genomes at a macromolecular level, J. Mol. Biol., 108 (1976), pp. 237–254.
[29] C. S. Madsen, C. P. Regan, and G. K. Owens, Interaction of CArG elements and a GC-rich
repressor element in transcriptional regulation of the smooth muscle myosin heavy chain gene in vascular smooth muscle cells, J. Biol. Chem., 272 (1997), pp. 29842–29851.
[30] J. Matouˇsek, Randomized optimal algorithm for slope selection, Inform. Process. Lett., 39 (1991), pp. 183–187.
[31] S.-I. Murata, P. Herman, and J. R. Lakowicz, Texture analysis of fluorescence lifetime
im-ages of AT- and GC-rich regions in nuclei, J. Histochem. Cytochem., 49 (2001), pp. 1443–
1452.
[32] A. Nekrutenko and W.-H. Li, Assessment of compositional heterogeneity within and between
eukaryotic genomes, Genome Res., 10 (2000), pp. 1986–1995.
[33] U. Ohler, H. Niemann, G. Liao, and G. M. Rubin, Joint modeling of DNA sequence and
physical properties to improve eukaryotic promoter recognition, Bioinformatics, 17 (2001),
pp. S199–S206.
[34] P. Rice, I. Longden, and A. Bleasby, EMBOSS: The European molecular biology open
software suite, Trends Genet., 16 (2000), pp. 276–277.
[35] L. Scotto and R. K. Assoian, A GC-rich domain with bifunctional effects on mRNA and
protein levels: Implications for control of transforming growth factor beta 1 expression,
Mol. Cell. Biol., 13 (1993), pp. 3588–3597.
[36] P. H. Sellers, Pattern recognition in genetic sequences by mismatch density, Bull. Math. Biol., 46 (1984), pp. 501–514.
[37] P. M. Sharp, M. Averof, A. T. Lloyd, G. Matassi, and J. F. Peden, DNA sequence
evolution: The sounds of silence, Philos. Trans. Soc. Lond. B Biol. Sci., 349 (1995), pp. 241–
247.
[38] P. Soriano, M. Meunier-Rotival, and G. Bernardi, The distribution of interspersed repeats
is nonuniform and conserved in the mouse and human genomes, Proc. Natl. Acad. Sci.
USA, 80 (1983), pp. 1816–1820.
[39] N. Stojanovic, L. Florea, C. Riemer, D. Gumucio, J. Slightom, M. Goodman, W. Miller, and R. Hardison, Comparison of five methods for finding conserved
se-quences in multiple alignments of gene regulatory regions, Nucleic Acids Res., 27 (1999),
pp. 3899–3910.
[40] N. Sueoka, Directional mutation pressure and neutral molecular evolution, Proc. Natl. Acad. Sci. USA, 80 (1988), pp. 1816–1820.
[41] Z. Wang, E. Lazarov, M. O’Donnel, and M. F. Goodman, Resolving a fidelity paradox: Why
Escherichia coli DNA polymerase II makes more base substitution errors in At- compared to GC-rich DNA, J. Biol. Chem., 277 (2002), pp. 4446–4454.
[42] K. H. Wolfe, P. M. Sharp, and W.-H. Li, Mutation rates differ among regions of the
mam-malian genome, Nature, 337 (1989), pp. 283–285.
[43] Y. Wu, R. P. Stulp, P. Elfferich, J. Osinga, C. H. Buys, and R. M. Hofstra, Improved
Acids Res., 27 (1999), article e9.
[44] S. Zoubak, O. Clay, and G. Bernardi, The gene distribution of the human genome, Gene, 174 (1996), pp. 95–102.