Fast Algorithms for the Density Finding Problem

(1)

DOI 10.1007/s00453-007-9023-8

Fast Algorithms for the Density Finding Problem

D.T. Lee· Tien-Ching Lin · Hsueh-I Lu

Received: 27 September 2006 / Accepted: 2 August 2007

Abstract We study the problem of finding a specific density subsequence of a se- quence arising from the analysis of biomolecular sequences. Given a sequence A= (a1, w1), (a2, w2), . . . , (an, wn)of n ordered pairs (ai, wi)of numbers ai and width w_i>0 for each 1≤ i ≤ n, two nonnegative numbers , u with ≤ u and a number δ, the DENSITYFINDING PROBLEMis to find the consecutive subsequence A(i^∗, j^∗) over all O(n²) consecutive subsequences A(i, j ) with width constraint satisfying

≤ w(i, j) =j

r=iw_r ≤ u such that its density d(i^∗, j^∗)=j∗

r=i^∗a_r/w(i^∗, j^∗)is closest to δ. The extensively studied MAXIMUM-DENSITYSEGMENTPROBLEMis a special case of the DENSITYFINDINGPROBLEMwith δ= ∞. We show that the DENSITYFINDINGPROBLEMhas a lower bound (n log n) in the algebraic decision tree model of computation. We give an algorithm for the DENSITYFINDINGPROB-

LEMthat runs in optimal O(n log n) time and O(n log n) space for the case when there is no upper bound on the width of the sequence, i.e., u= w(1, n). For the general case, we give an algorithm that runs in O(n log²m)time and O(n+ m log m) space, where

Grants NSC95-2221-E-001-016-MY3, NSC-94-2422-H-001-0001, and

NSC-95-2752-E-002-005-PAE, and by the Taiwan Information Security Center (TWISC) under the Grants NSC NSC95-2218-E-001-001, NSC95-3114-P-001-002-Y, NSC94-3114-P-001-003-Y and NSC 94-3114-P-011-001.

D.T. Lee· T.-C. Lin (

⁾^{· H.-I Lu}

Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan

e-mail: kero@iis.sinica.edu.tw D.T. Lee

e-mail: dtlee@csie.ntu.edu.tw H.-I Lu

e-mail: hil@csie.ntu.edu.tw

D.T. Lee· T.-C. Lin

Institute of Information Science, Academia Sinica, Nankang, Taipei 115, Taiwan

(2)

m= min{_w^u⁻_min, n} and wmin= minⁿ_r₌₁w_r. As a byproduct, we give another O(n) time and space algorithm for the MAXIMUM-DENSITYSEGMENTPROBLEM. Keywords Maximum-density segment problem· Density finding problem · Slope selection problem· Convex hull · Computational geometry · GC content · DNA sequence· Bioinformatics

1 Introduction

Let A be a sequence of n ordered pairs of numbers (a1, w1), (a2, w2), . . . , (an, wn) with wi>0 for each i. A segment A(i, j )= (ai, w_i), (a_i₊₁, w_i₊₁), . . . , (a_j, w_j)is a consecutive subsequence of A starting with index i and ending with index j . The width w(i, j ) of segment A(i, j ) is wi+ wi+1+ · · · + wj. The density d(i, j ) of seg- ment A(i, j ) is (ai+ ai+1+ · · · + aj)/w(i, j ). Given a sequence A of n ordered pairs of numbers, density δ and width bounds and u with ≤ u, the D^ENSITYFINDING

PROBLEMis to find the feasible segment A(i^∗, j^∗)that minimizes|d(i, j) − δ|. We say that a segment A(i, j ) is feasible if its width satisfies ≤ w(i, j) ≤ u.

The density finding problem with δ = ∞ is exactly the extensively studied MAXIMUM-DENSITYSEGMENTPROBLEM[1–9] which arises from the investiga- tion of non-uniformity of nucleotide composition within genomic sequences. For the uniform case where all wi’s are identical, for each 1≤ i ≤ n, Nekrutenko and Li [8]

and Rice, Longden, and Bleasby [9] presented algorithms for the case u= , which is trivially solvable in O(n) time. If u= , this problem can be easily solved in O(n(u− )) time, linear in the number of feasible segments. Huang [4] studied the special case u= n, and observed that an optimal segment exists with width at most 2−1. Therefore, this case is equivalent to the case with u = 2−1 and can be solved in O(n) time. Lin, Jiang, and Chao [6] gave an O(n log ) time algorithm based on right-skew decomposition. For the general case when u < n Goldwasser, Kao, and Lu [2,3] gave an O(n) time algorithm. Without loss of generality we shall assume w_i= 1 for the uniform case in this paper. For the nonuniform case Goldwasser, Kao, and Lu [2] also gave an O(n log(u− )) time algorithm. Kim [5] gave a clever O(n) time algorithm based on a geometric interpretation of the problem, which transforms the finding of maximum-density segment into the finding of maximum-slope line segment in computational geometry, but unfortunately Kim’s algorithm has a flaw, as pointed out by Chung and Lu [1], which causes it to run in O(n²)time in the worst case. Chung and Lu [1] recently gave an O(n) time algorithm bypassing the complicated right-skew decomposition. Based on Kim’s geometric technique and the decomposibility of tangent query to convex hulls, we fix the flaw of Kim’s algorithm and give yet another an O(n) time algorithm for the nonuniform case.

The MAXIMUM-DENSITYSEGMENTPROBLEMbecomes the problem of finding the most GC-rich region of a given DNA sequence, when we let the input sequence A= (a1, w₁), (a₂, w₂), . . . , (a_n, w_n)correspond to a given DNA sequence with uni- form width such that ai= 1 if the corresponding nucleotide in the DNA sequence is G or C, and ai = 0 if the corresponding nucleotide in the DNA sequence is A or T.

It is obvious that the output segment corresponds to the most GC-rich region of the given DNA sequence.

(3)

The DENSITYFINDINGPROBLEMfor A with respect to density δ was motivated by the results due to Ioshikhes and Zhang [10], and to Ohler et al. [11], who used specific GC-ratio as one of the parameters in their programs that locate promoter regions in large-scale genomic analysis.

The rest of the paper is organized as follows. Section 2 reduces the DENSITY

FINDINGPROBLEMto a geometric SLOPEFINDINGPROBLEM. Section3gives our density finding algorithm and a proof of its optimality. Section 4gives an optimal algorithm for finding the maximum-density segment. Section5concludes this paper with some future research directions.

2 Reduction to a Geometric Slope Finding Problem

We first briefly review some basic concepts of computational geometry used in this paper. We say that a point p is below (resp. above) a line l or a line segment uv, if the y-coordinate of p is less (resp. greater) than that of the point pl, which is the intersection of the vertical line passing through p with line l or with line seg- ment uv. pl is called the vertical projection of p on l or on uv. A polygonal chain C is a sequence of points, c1, c₂, . . . , c_t, such that cic_i₊₁, i= 1, 2, . . . , t − 1 is a line segment, or an edge. A polygonal chain C is said to be monotone with respect to a straight-line l if any line perpendicular to l intersects C at most once. If the line perpendicular to l passes through point ci, then the intersection point qi on l is called the orthogonal projection of ci on l. Similarly, a point p is said to be below (resp. above) a polygonal chain C monotone with respect to the x-axis, if p is below (resp. above) an edge of C. The upper hull of a point set P = {p0, p1, . . . , p_n} in the plane, denotedUH(P ), is a polygonal chain monotone with respect to the x-axis, such that any point of P is either onUH(P ) or below UH(P ). That is, UH(P ) is a sequence of hull points, ph1, p_h₂, . . . , p_h_t, where phi ∈ P such that for every point pi = (xi, yi)∈ P the vertical line νi passing through pi intersects the sequence of hull edges, ph₁ph₂, ph₂ph₃, . . . , ph_(t−1)ph_t, exactly once and the intersection point qi

has y-coordinate no less than yi. Let phmbe the hull point inUH(P ) with the greatest y-coordinate. But if the hull points inUH(P ) with the greatest y-coordinate are not unique, we set ph_m to be the rightmost point among those points. We define the left branch L(P ) (resp. right branch R(P )) ofUH(P ) to be the sequence of hull points, p_h₁, p_h₂, . . . , p_h_m (resp. phm, p_h_m₊₁, . . . , p_h_t).

Given an upper hullUH(P ) and a point q ∈ P external to UH(P ), the tangent segment ofUH(P ) from q is a line segment li= ph_iq passing through a hull point ph_i onUH(P ) such that all points of P lie entirely below the line containing li and p_h_iis called the tangent point ofUH(P ) from q. Given any hull point phr ∈ UH(P ), if phr is not the tangent point of UH(P ) from q and phrq does not intersect the region belowUH(P ), then ph_r is called a reflex point ofUH(P ) from q. If ph_r is neither a tangent point nor a reflexive point from q, then it is called a concave point ofUH(P ) from q. It is easy to see that testing if a hull point on UH(P ) is tangent, reflex or concave from a point q not in P can be done in O(1) time, assuming all the hull points are maintained as a doubly-linked list and the hull points before and after a given hull point can be accessed in O(1) time. An example of the upper hull is shown in Fig.1.

(4)

Fig. 1 The upper hull UH(P ) = ph₁, ph₂, . . . , ph_t. The left branch L(P ) (resp.

right branch R(P )) ofUH(P ) is p_h₁, p_h₂, . . . , p_h_m(resp.

phm, ph_m₊₁, . . . , pht). p_h₃is the tangent point onUH(P ) from q,{ph₄, p_h₅, . . . , p_h_t} are reflex points onUH(P ) from q, and{ph₁, p_h₂} are concave points onUH(P ) from q

We shall transform the DENSITY FINDING PROBLEM into a geometric SLOPE

FINDING PROBLEM as follows: We define the point set P = {p0, p₁, . . . , p_n} in R² according to the prefix sums of the sequence A, where pi = (xi, y_i)= (_i

k=1wk,_i

k=1ak), i= 1, 2, . . . , n and p0= (0, 0). To simplify the presentation, we assume that the points are in general position. By general position it means that no three points are collinear and no two induced lines passing through any two points have the same slope. It is easy to see that the slope m(i, j ) of the line segment s(i, j ) connecting pi and pj is equal to the density d(i+ 1, j) of the segment A(i + 1, j), so we can define that a line segment s(i, j ) is feasible if its corresponding segment A(i+ 1, j) is feasible, i.e., s(i, j) is feasible if ≤ w(i + 1, j) ≤ u. Let F (δ) be the set of all feasible line segments of P . Let F⁺(δ)= {s(i, j) ∈ F (δ) | m(i, j) ≥ δ} and F⁻(δ)= {s(i, j) ∈ F (δ) | m(i, j) ≤ δ} denote the sets of all feasible line segments of P with slopes no less or no greater than δ respectively. Without loss of general- ity, we may assume δ= 0 for the D^ENSITY FINDING PROBLEM, since the density finding problem for sequence A with respect to density δ and width bounds and u is equivalent to the same problem for sequence B of n ordered pairs (bi, w_i), with respect to density 0 and the same width bounds, where bi= ai− δwi holds for each i= 1, 2, . . . , n. Clearly, we can further restrict our attention to segments in F⁺(0) since the segments in F⁻(0) can be converted to those in F⁺(0) by setting b_i= −bi

for each i = 1, 2, . . . , n. Therefore, it suffices to consider the following geometric SLOPEFINDINGPROBLEM.

Given a set of points P= {p0, p₁, . . . , p_n} in R²where pi= (xi, y_i)as defined earlier and two width bounds , u with ≤ u, find the feasible line segment s(i^∗, j^∗)in F (0)⁺that minimizes m(i, j ).

Let Pa,bdenote the subset{pa, pa+1, . . . , pb} of P starting with left index a and ending with right index b. The indices a and b are assumed to be in[0, n]: If a < 0, we take a= 0. If b > n, we take b = n. If a > b, we define Pa,bto be an empty set.

For each point pj we have a set of all feasible points Pcj,dj = {pcj, p_c_j₊₁, . . . , p_d_j}, such that each pi ∈ Pc_j,d_j satisfies ≤ xj − xi = w(i + 1, j) ≤ u. Without con- fusion we shall for simplicity denote Pc_j,d_j as Pj. That is, Pj is the subset of P such that the line segment s(i, j ) is feasible for each pi ∈ Pj. Since the sequence {xj}ⁿ_j₌₁ of the x-coordinates of P is monotonically increasing, the left and right index sequences {cj}ⁿ_j₌₁ and {dj}ⁿ_j₌₁ respectively are non-decreasing. Therefore, we can obtain sequences {cj}ⁿ_j₌₁and{dj}ⁿ_j=1 respectively by a linear scan of the

(5)

sequence {xj}ⁿ_j₌₁. Let P_j⁺= {pi ∈ Pj | m(i, j) ≥ 0}. For each j, we define ij to be the index of point pij ∈ P_j⁺ such that m(i, j ) is minimized. That is, we have m(ij, j )= min{m(i, j) | pi ∈ P_j⁺}. Let i^∗ and j^∗ be the pair of indices that mini- mizes m(ij, j ). That is, we have m(i^∗, j^∗)= min{m(ij, j )| 1 ≤ j ≤ n}. The geometric SLOPE FINDING PROBLEMcan now be reduced to that of finding the feasible point pi_j in P_j⁺ that minimizes m(i, j ) for each j , and then select the minimum among all m(ij, j )’s. Let ptj be the tangent point ofUH(P_j⁺)from pj. It is not dif- ficult to see that m(tj, j )= min{m(i, j) | pi∈ P_j⁺} = m(ij, j ). Therefore, if we can find out the tangent point pt_j ofUH(P_j⁺)from pj for each 1≤ j ≤ n, we can solve the geometric SLOPEFINDINGPROBLEM.

3 Algorithm for Density Finding Problem

We first show that the DENSITYFINDINGPROBLEMhas a lower bound (n log n) in the algebraic decision tree model of computation. The ELEMENT UNIQUENESS

PROBLEMis to determine if a set of n positive numbers z1, z2, . . . , z_n are all distinct. It is known that the ELEMENT UNIQUENESS PROBLEM has a lower bound of (n log n) in the algebraic decision tree model of computation [12]. We can transform an instance of ELEMENT UNIQUENESS PROBLEM to an instance of the DENSITY FINDING PROBLEM for the uniform case with = 1, u = n and δ = 0 in O(n) time by letting a1= z1, ai = zi − zi−1 for i = 2, . . . , n. The density d(i, j )= (ai + ai+1+ · · · + aj)/(j− i) = (zj − zi−1)/(j− i) of the output seg- ment A(i, j ) is not equal to 0 if and only if z1, z₂, . . . , z_nare all distinct. Therefore, the DENSITYFINDINGPROBLEMhas a lower bound of (n log n) in the algebraic decision tree model of computation.

Lemma 1 The DENSITYFINDINGPROBLEMfor the uniform case with = 1, u = n and δ= 0 has a lower bound of (n log n) in the algebraic decision tree model of computation.

For each pi = (xi, yi)∈ Pj, we let P_jⁱ = {pr ∈ Pj | yr ≤ yi} be the subset of P_j whose y-coordinates are no greater than the y-coordinate of pi. The following lemma, Lemma 2, shows that if we can find the point pkj in Pj with the largest y-coordinate no greater than pj then we can obtain pi_j by finding the tangent point on the left branch ofUH(P_j^k^j)from pj. Therefore, once we have maintained a data structure of the left branch ofUH(P_jⁱ)for each pi∈ Pj, we can find the tangent point on the left branch ofUH(P_j^k^j)from pj.

Lemma 2 Let pk_j be the point in Pj with the largest y-coordinate no greater than pj. Then P_j⁺= P_j^k^j and pi_j must be a tangent point on the left branch L(P_j^k^j)of the upper hullUH(P_j^k^j).

(6)

Proof Since pkj is the point in Pj with the largest y-coordinate no greater than pj, we have m(i, j ) < 0 for each pi ∈ Pj \ P_j^k^j and m(i, j )≥ 0 for each pi ∈ P_j^k^j. It means P_j⁺= P_j^k^j. We also have xk_j < xi < xj and yi ≤ yk_j for each pi on the right branch R(P_j^k^j)of UH(P_j^k^j). This implies m(kj, j )= ^y_x^j_j^−y_−x^kj

kj <^y_x^j^−y^kj

j−xi ≤

y_j−yi

xj−xi = m(i, j). Therefore, pij must be a tangent point on the left branch L(P_j^k^j)of

UH(P_j^k^j).

3.1 Special Case when u= w(1, n)

Our density finding algorithm for the case u= w(1, n) iterates from j = 1 to n for finding the tangent point ptj on the left branch ofUH(P_j⁺)from pj, and is described as follows: At the beginning of iteration j , we have available the set of all feasible points, Pcj,dj (abbreviated as Pj for convenience), and maintain some data struc- tures in Pj such that we can make a predecessor query to obtain pkj and a tangent query to obtain ptj. At the end of iteration we will update Pj to Pj+1by inserting points pdj+1, . . . , p_d_j₊₁ one at a time from left to right. In this special case since u= w(1, n), all cj’s are identical, i.e., cj= 1 for all j. We will maintain two dy- namic data structures in Pj for the above purposes: a balanced binary search tree T (Pj)[13] which stores all points of Pj in ascending y-coordinates to support both the predecessor query for finding the point pk_j in Pj, and the insertion operations;

and a left branch data structureL(Pj), which stores the left branch L(P_jⁱ)for each p_i, c_j≤ i ≤ dj in Pj by a singly linked list in descending x-coordinates to support the tangent query for finding the tangent point pt_j on the left branch L(P_j^k^j)from pj. Updates from T (Pj)to T (Pj+1)by insertions of points to the balanced binary search tree can be done in a straightforward manner. We briefly describe the update of the left branch fromL(Pj)toL(Pj+1)by inserting a point pi ∈ Pd_j+1,dj+1 intoL(Pj) in O(log n) time each as follows. We need to construct the left branchL(P_jⁱ₊₁)for each pi in Pj+1. We first find the point pki in Pj with the largest y-coordinate less than piby a predecessor query to T (Pj). If pkiexists, then we can obtain the tangent point pti inL(P_j^kⁱ)= ph1, p_h₂, . . . , p_h_m from pi by a tangent query (which will be described later) toL(P_j^kⁱ). After finding out the tangent point pti= phv ∈ L(P_j^kⁱ)for some hv, we insert pi intoL(Pj), make a link from pi to ph_v and set L(P_jⁱ₊₁)= ph₁, ph₂, . . . , ph_v, pi. But if pk_i doesn’t exist, we just insert pi intoL(Pj)and set L(P_jⁱ₊₁)= pi. An example of inserting a point pi intoL(Pj)is shown in Fig.2.

To ensure the correctness of the construction ofL(Pj+1), we need to show that L(P_jⁱ₊₁)is indeed equal to the sequence Hi = ph1, p_h₂, . . . , p_h_v, p_i. It is easy to see that Hi is a sequence of hull points of UH(P_jⁱ₊₁)and pi is the hull point in UH(P_jⁱ₊₁)with the greatest y-coordinate. Therefore, we have L(P_jⁱ₊₁)= Hi. The correctness of the construction then follows.

In order to support the tangent query in O(log n) time to every L(P_jⁱ₊₁) of L(Pj+1), we need to add a few jumping pointers toL(P_jⁱ₊₁)so that we can per- form binary search in logarithmic time. We denote this new data structure for

(7)

Fig. 2 Insert point p_iinto L(Pj)to obtainL(P_j+1ⁱ ). We first find the point p_k_iin P_j with the largest y-coordinate no greater than p_iby a predecessor query to T (P_j), and then we can obtain the tangent point pt_ion L(P_j^kⁱ)from p_iby a tangent query toL(P_j^kⁱ). After finding out tangent point pt_iwe make a link from p_ito pt_ito get L(P_jⁱ₊₁)

L(P_jⁱ₊₁)that is augmented with jumping pointers byL(P_jⁱ₊₁). For each point in L(Pj+1), we maintain an array of size log₂n to store the jumping pointers. We let L(P_jⁱ₊₁)^(r)[i] denote the index of the hull point on L(P_jⁱ₊₁) to the left of p_i with link distance 2^r away from pi for 0≤ r ≤ log2n. It can be defined re- cursively using the following formula: L(P_jⁱ₊₁)⁽⁰⁾[i] = ti andL(P_jⁱ₊₁)^(r⁺¹⁾[i] = max{L(P_j+1ⁱ )^(r)[L(P_j+1ⁱ )^(r)[i]], 0}. Note that, pti, the tangent point from pi, is one (=2⁰) link distance away from point pi.

The operation of tangent queries in L(P_j) from pj is implemented by a bi- nary search in O(log n) time as follows. Let pkj be the point in Pj with the largest y-coordinate less than pj obtained from the predecessor query to T (Pj). We do binary search in the arrayL(P_j^k^j)^(r)[kj], 0 ≤ r ≤ log2n. When we search a point ps in this array, if ps on L(P_j^k^j)is a reflex point from pj, we search forward, and if psis a concave point from pj we search backward, until we find the tangent point pt_j from pj. A detailed description of the tangent query to L(P_j^k^j)is shown in the pseudo code below.

Algorithm DENSITYFINDINGPROBLEM(P , )

Input: a set of points P = {p0, p₁, . . . , p_n} in R²and a nonnegative number .

Output: the feasible line segment s(i^∗, j^∗) such that m(i^∗, j^∗) = min_{s(i,j )}_{∈F (0)}+m(i, j ).

1. j← 0; Pj← ∅;

2. ({ci}ⁿ_i₌₁⁺¹,{di}ⁿ_i₌₁⁺¹)← C^ONSTRUCTINDEX({xi}ⁿ_i₌₁);

3. for j= 0 to n do

4. p_k_j ← point in T (Pj)with the largest y-coord. less than pj by a predecessor query;

5. pt_j← T^ANGENTQUERY(L(P_j^k^j), log₂n, kj, pj);

6. m(ij, j )← m(tj, j );

7. for i= dj+ 1 to dj+1do

8. T (P_j₊₁)← insert point pi into T (Pj);

9. L(P_j₊₁)← I^NSERTPOINT(L(P_j), pi);

10. m(i^∗, j^∗)← min0≤j≤nm(i_j, j );

11. return s(i^∗, j^∗);

(8)

Function CONSTRUCTINDEX({xi}ⁿ_i₌₁).

Input: sequence{xi}ⁿ_i₌₁.

Output: sequences{ci}ⁿ_i=1⁺¹and{di}ⁿ_i=1⁺¹. 1. c0← 0; d0← 0; cn+1← 0; dn+1← 0;

2. for j= 1 to n do 3. cj← cj−1;

4. while xj− xc_j+1> udo cj← cj+ 1;

5. d_j← dj−1;

6. while xj− xdj+1≥ do dj← dj+ 1;

7. return ({ci}ⁿ⁺¹_i₌₁,{di}ⁿ⁺¹_i₌₁);

Function TANGENTQUERY(L(P_j^k^j), r, t, pj).

Input: data structureL(P_j^k^j)of left branch L(P_j^k^j), order r, index t , and point pj. Output: the tangent point on L(P_j^k^j)from pj.

1. if r≥ 0

2. s← L(P_j^k^j)^(r)[t];

3. if ps is tangent on L(P_j^k^j)from pj return ps;

4. if ps is reflex on L(P_j^k^j)from pj return TANGENTQUERY(L(P_j^k^j), r− 1, s, pj);

5. if psis concave on L(P_j^k^j)from pj return TANGENTQUERY(L(P_j^k^j), r− 1, t, pj);

Function INSERTPOINT(L(Pj), pi).

Input: data structureL(P_j)of left branch L(Pj)and point pi.

1. pki← the largest y-coord. less than pi in T (Pj)by predecessor query;

2. if pkidoesn’t exist 3. then insert pi intoL(P_j);

4. else

5. pt_i← T^ANGENTQUERY(L(P_j^kⁱ), log₂n, ki, pi);

6. insert pi intoL(P_j)and make a link from pito pti; 7. L(P_jⁱ₊₁)⁽⁰⁾[i] ← ti;

8. for r= 0 to log2n− 1 do

9. L(P_jⁱ₊₁)^(r⁺¹⁾[i] ← max{L(P_jⁱ₊₁)^(r)[L(P_jⁱ₊₁)^(r)[i]], 0};

Lemma 3 The tangent query and insertion operation can be done in O(log n) time each on the left branch data structureL(P_j).

Proof The insertion operation of a point to L(P_j)can be done by a predecessor query to T (Pj)in O(log n) time and a tangent query to L(P_j)in O(log n) time by binary search using the jumping pointers, so it takes O(log n) time per insertion.

Setting up the jumping pointers for each newly inserted point can also be done in O(log n) time, since there are as many such pointers that need to be created.

(9)

Theorem 1 The DENSITY FINDING PROBLEM for the case u= w(1, n) can be solved in optimal O(n log n) time and O(n log n) space.

Proof The correctness of this algorithm follows from the arguments given earlier and the correctness of insertion operation of L(P_j). Since T (Pj)is a balanced binary tree, it takes O(log n) time to insert a point. It takes O(n log n) time overall. Since the algorithm performs O(n) tangent queries and insertions onL(P_j), the overall time needed is O(n log n) by Lemma3. Since this algorithm maintainsL(P_j)dy- namically whose size is at most n and it takes O(log n) space for each point pi∈ Pj

to construct jumping pointers of all orders of r, for 1≤ r ≤ log2n, the total space

requirement is O(n log n).

3.2 General Case when u < w(1, n)

Now we develop our density finding algorithm for the general case, where the width of the segment has an upper bound u < w(1, n). At any iteration j , we maintain as described above, a dynamic left branch data structure L(Pj)of the upper hull UH(Pj)such that we can make tangent query toL(P_j)to obtain tangent point ptj. The only difference is that the set of all feasible points Pc_j,d_j for each pj varies in that cj+1is no longer identical to cj. In case cj+1= cj, we need to delete points pc_j, pc_j+1, . . . , pc_j₊₁−1from Pj. Insertions of points pd_j+1, . . . , pd_j₊₁are performed in the same way to obtainL(P_j₊₁)fromL(P_j). Moreover, our dynamic left branch data structure, as described above, supports only insertion operations. It doesn’t support deletion operations as effectively as we desire. Further modifications to the dynamic left branch data structure are needed.

We observe that the tangent query is decomposable. A query is called decompos- able if the answer to the query over the entire set can be obtained by combining the answers to the queries to a suitable collection of subsets of the set. We will partition P_j into several disjoint canonical subsets such that Pj= P⁰∪ P¹∪ · · · ∪ P^h where h= log m, m = min{_w^u−_min, n} and each canonical subset Pⁱ has size|Pⁱ| ≤ 2ⁱ. Note that some of these canonical subsets are empty, and that except the last non- empty canonical subset, all of the nonempty ones will be full, i.e., of size 2ⁱ for some i≥ 0. We will maintain a left branch data structure L(Pⁱ)for each i. We define the dynamic left branch data structureL(P_j)to beL(P⁰)∪ L(P¹)∪ · · · ∪ L(P^h). At iteration j , we first make tangent query toL(Pⁱ)for each i and find the one with the smallest slope as the tangent point pt_j, and then we delete points pc_j, . . . , pc_j+1−1

fromL(P_j)and insert points pdj+1, . . . , p_d_j₊₁ intoL(P_j). When we insert a point pi into L(Pj), we will insert it into L(P^r) which contains the point pi−1 if

|P^r| < 2^r, otherwise we insert it intoL(P^r⁺¹). When we delete a point pi from L(P_j), we will delete the entire data structureL(P^r)which piis located and recon- structL(P⁰),L(P¹), . . . ,L(P^r⁻¹)by inserting all the remaining points inL(P^r) intoL(P⁰),L(P¹), . . . ,L(P^r⁻¹)one by one in a left-to-right order. A detailed description of the density finding algorithm for general case is shown in the pseudo code below.

(10)

Algorithm DENSITYFINDINGPROBLEM(P , , u)

Input: a set of points P = {p0, p1, . . . , pn} in R²where pi= (xi, yi)and two non- negative numbers , u.

Output: the feasible line segment s(i^∗, j^∗) such that m(i^∗, j^∗) = min_{s(i,j )}_{∈F (0)}+m(i, j ).

1. j← 0; Pj← ∅; t ← 0; h ← log(min{_w^u⁻_min, n});

2. ({ci}ⁿ_i₌₁⁺¹,{di}ⁿ_i₌₁⁺¹)← C^ONSTRUCTINDEX({xi}ⁿ_i₌₁);

3. for j= 0 to n do 4. m(ij, j )← ∞;

5. for i= 0 to h

/* lines 6–8 do tangent query toL(P_j)*/

6. pk_j ← the largest y-coord. less than pjin T (Pⁱ)by predecessor query;

7. p_t_j ← TangentQuery(L(Pⁱ), h, kj, pj);

8. if m(ij, j ) > m(tj, j )then s(ij, j )← s(tj, j );

/* lines 9–21 delete points pcj, . . . , p_c_j₊₁₋₁fromL(P_j)*/

9. for a= cj to cj+1− 1 do

10. let P^r be the set such that pa∈ P^r;

11. let|P^r| − 1 = λ02⁰+ λ12¹+ · · · + λr−12^r⁻¹, where λk= 0 or 1, k = 0, 1, . . . , r− 1;

12. q← a + 1;

13. for i= 0 to r − 1

14. f ← 0;

15. while λi= 1 and f < 2ⁱand q≤ |P^r| − 1 16. T (Pⁱ)← insert point pqinto T (Pⁱ);

17. L(Pⁱ)← I^NSERTPOINT(L(Pⁱ), pq);

18. q← q + 1; f ← f + 1;

19. if r= t /* Pt is the last nonempty canonical subset */

20. t← max{k|λk= 1, k = 0, 1, . . . , r − 1};

21. DeleteL(P^r)and T (P^r);

/* lines 22–29 insert points pd_j+1, . . . , pd_j₊₁ intoL(Pj)*/

22. for b= dj+ 1 to dj+1do 23. if|P^t| < 2^t

24. T (P^t)← insert point pbinto T (P^t);

25. L(P^t)← I^NSERTPOINT(L(P^t), pb);

26. else

27. t← t + 1;

28. T (P^t)← insert point pbinto T (P^t);

29. L(P^t)← I^NSERTPOINT(L(P^t), pb);

30. m(i^∗, j^∗)= min{m(ij, j )|0 ≤ j ≤ n};

31. return s(i^∗, j^∗);

Theorem 2 The DENSITY FINDINGPROBLEM can be solved in O(n log²m) time and O(n+ m log m) space where m = min{_w^u⁻

min, n}.

Proof The algorithm maintains a left branch data structure L(P_j) dynamically whose size is at most m= min{_w^u−_min, n}. When a point pi in L(P^r)is deleted,

(11)

we destroy the entire data structureL(P^r)and reconstruct at most r left branch data structures L(P⁰),L(P¹), . . . ,L(P^r⁻¹)with the remaining points in L(P^r). We note that each time when a data structureL(P^r)of size 2^r is destroyed, the remaining points were reinserted into some data structuresL(Pⁱ), i < r, of smaller size.

Overall in the whole deletion process of the algorithm each point in P can be rein- serted into a left branch data structure of sizes, 2⁰,2¹, . . . ,2^{log m} at most once, and it takes O(log m) time to insert a point into a left branch data structure, so it totally takes O(n log²m)time for reinsertions and deletions operations. Since the algorithm needs to do O(log m) tangent queries at any iteration j and it takes O(log m) time for each query, the total time taken is O(n log²m). As for the space requirement, since this algorithm maintainsL(P_j)= L(P⁰)∪ L(P¹)∪ · · · ∪ L(P^{log m})dynamically whose size is at most m and it needs at most O(log m) jumping pointers for each

point, it totally needs O(n+ m log m) space.

4 Algorithm for Maximum-Density Segment Problem

As a byproduct, we shall present yet another optimal algorithm for the MAXIMUM- DENSITY SEGMENT PROBLEM based on Kim’s idea and the decomposibility of tangent query to convex hulls by which we fix the flaw of his algorithm. Recall that the MAXIMUM-DENSITY SEGMENTPROBLEMis a special case of the DEN-

SITYFINDING PROBLEM with δ= ∞. For this case we have F⁻(δ)= F (δ). The MAXIMUM-DENSITY SEGMENT PROBLEM can be reformulated as the following geometric MAXIMUMSLOPEFINDINGPROBLEM.

Given a set of points P = {p0, p1, . . . , pn} in R²and two width bounds , u with ≤ u, output the feasible line segment s(i^∗, j^∗)in F (∞) that maximizes m(i, j ).

Similar to the upper hull of Pj, we introduce the notion of lower hull, denoted LH(Pj), of Pj, which is a polygonal chain monotone with respect to the x-axis, and all points of Pj are either on or above the polygonal chain. Let ptj be the tangent point on the lower hull LH(Pj)of Pj from pj. It is easy to see that m(tj, j )= max{m(i, j) | pi∈ Pj} = m(ij, j ). Therefore, if we can find out the tangent point ptj

inLH(Pj)from pjfor 0≤ j ≤ n, we can solve the M^AXIMUM-DENSITYSEGMENT

PROBLEM.

Our maximum-density segment algorithm that iterates from j= 0 to n for finding the tangent point ptj on the lower hullLH(Pj)from pj is described as follows: As before, associated with each pj we have a set Pj = Pcj,dj of all feasible points. At any iteration j , we maintain a dynamic lower hull data structure in LH(Pj)such that we can make tangent query to LH(Pj)to obtain tangent point ptj in amor- tized O(1) time, and then we delete points pcj, p_c_j₊₁, . . . , p_c_j₊₁₋₁and insert points p_d_j₊₁, p_d_j₊₂, . . . , p_d_j₊₁ both in amortized O(1) time each to obtain the dynamic lower hull data structureLH(Pj+1). It is not obvious how to maintain just one dynamic data structure that supports both insertion and deletion of points and the tangent query operations. That is the difficulty with which Kim’s lower hull data structure faced. We provide a fix by exploiting the decomposibility of tangent query again. We