An upper bound of the number of tests in pooling designs for the error-tolerant complex model

(1)

DOI 10.1007/s11590-007-0070-5

O R I G I NA L PA P E R

An upper bound of the number of tests in pooling

designs for the error-tolerant complex model

Hong-Bin Chen · Hung-Lin Fu · Frank K. Hwang

Received: 9 June 2006 / Accepted: 1 November 2007 / Published online: 28 November 2007 © Springer-Verlag 2007

Abstract Recently pooling designs have been used in screening experiments in molecular biology. In some applications, the property to be screened is defined on subsets of items, instead of on individual items. Such a model is usually referred to as the complex model. In this paper we give an upper bound of the number of tests required in a pooling design for the complex model (with given design parameters) where experimental errors are allowed.

Keywords Pooling designs· Nonadaptive algorithms · Disjunct matrices

1 Introduction

In the classical group testing problem (see [3,5] for surveys), we consider a set N of n items consisting of at most d defective items with the others being good items. The problem is to identify all defective items with a small number of group tests. A group test can be applied to an arbitrary subset S ⊆ N with two possible outcomes; a negative outcome implies all items in S are good while a positive outcome implies otherwise, i.e., there exists at least one defective item in S, not knowing which or how many.

Group testing is a basic tool which can be applied to a variety of problems such as blood testing, multiple access communication, coding theory, among others [5]. Group testing procedures have been recently applied to computational molecular biology. Recent advances in molecular biology, especially the success of the Human Genome Project, have made the study of gene functions more popular. The study of gene func-tions requires a high quality DNA library, which is a collection of the copies of DNA

H.-B. Chen (

B

)· H.-L. Fu · F. K. Hwang

(2)

segments called clones. In screening a clone library, the goal is to determine which clones in the clone library hybridize with a given probe in an efficient fashion. A clone is said to be positive if it hybridizes with the probe, and negative otherwise. Clearly, classic group testing can be applied to the screening situation. Over the past few years, a considerable number of studies have been made on this topic. For a general reference between group testing and molecular biology, the readers may refer to the book [6].

In molecular biology, the biological objects (e.g., clones, cells, molecules) are being identified, but it remains a challenge to understand how they cooperate to produce vari-ous attributes. Torney [17] first introduced a generalized group testing problem geared to such a need. The problem is described as follows. Consider a set N of n molecules and a given family C of subsets of N. The goal is to identify an unknown family

D = {Di} from the given family C, where the joint appearance of all molecules in

each Di causes a certain given biological phenomenon. An experiment, sometimes

called a pool, can be applied to an arbitrary subset S⊆ N with two possible outcomes; a negative outcome implies S does not contain any Di ∈ D, and a positive outcome

implies otherwise. A set of molecules which is a member ofC is called a complex; members of D are called positive complexes and the others are called negative com-plexes. Such a model is usually referred to as the complex model. Of particular note is the basic assumption that members ofC are subject to non-inclusion. The reason for this is not difficult to grasp: it is that no positive complex may include any other positive complex.

Treating each molecule as a vertex and every complex as a hyper-edge, thenC can be viewed as a hypergraph on the vertex set N . Thus, the complex model can be easily fitted into the framework of graph testing, learning the hidden subgraph D in the given hypergraphC where the only allowed operation is to query whether a set of vertices induces a hyper-edge of D. There are different graph testing problems according to prior knowledge of D; the usual assumption is D has at most d edges, but it can also be

D is a matching [1,2] or a hamiltonian circuit [9]. The complex model is also related to other problems such as superimposed codes and secure key distribution, among others [1,4,7,12,17]. Besides, some probabilistic analysis can be found in [13,14].

In the applications to molecular biology, an experiment corresponding to a group test could take several hours or even several days. Thus, it is impractical to perform the experiments sequentially and great importance is attached to nonadaptive group

testing algorithms, also called pooling designs in the literature, in which all

experi-ments are performed simultaneously. For this purpose, one must decide exactly which pools to test before any testing occurs. Another feature of biological experiments is that errors almost always occur during the testing procedure. In practice, the decoding issue becomes even more difficult due to experimental errors, and thus algorithms with error-tolerant ability are desirable. In this paper we focus on the construction of error-tolerant pooling designs for the complex model.

2 Preliminaries

A pooling design can be represented by an incidence matrix M where rows are labeled by tests, columns by molecules, and cell(i, j) has a 1-entry if and only if molecule

(3)

j is in test i . For convenience, we view each column as the set of row indices where

it has 1-entries. Then we can talk about the union and the intersection of columns. We say that a set X of columns appears (or is contained) in a row means all columns in X have a 1-entry in the row. A pool with a negative (positive) outcome is called a negative (positive) pool, respectively.

A main property of M to solve the group testing problem on complexes, i.e., the (d, r; z]-disjunctness, has been discussed in [4,7,10,15,16]. A binary matrix M is said to be(d, r; z]-disjunct if for any d + r columns C1, C2, . . . , Cd+r,

r i₌₁ Ci\ d+r i_=r+1 Ci ≥ z.

That means for any d + r columns there exist at least z rows where each of the r designated columns has 1-entries and each of the other d columns has 0-entries.

Suppose our only knowledge of D is |D| ≤ d and every complex consists of at most r molecules. Then a(d, r; z]-disjunct matrix M can be used to identify D if the number of error-tests is at mostz−1₂ . Consider a negative complex Q and a set D = {Di} of positive complexes, |D| ≤ d. According to the non-inclusion assumption, we can choose two disjoint sets of columns, an r -set R and a d-set S with

S∩ R = ∅, such that Q ⊆ R and S ∩ Di = ∅ for all Di. By the(d, r; z]-disjunctness property, M has at least z rows each containing R but none of S. The pools correspond-ing to these rows must test negative since they do not contain any positive complex. Even for the worst case thatz−1₂ outcomes are erroneous, Q still appears in at least

z−z−1₂ = z−1₂ + 1 negative pools. On the other hand, each positive complex

appears in at mostz−1₂ negative pools (due to errors). Hence, for each complex, by simply counting the number of negative pools it appears, we can determine whether it is positive or not.

Let t(n, d, r; z] denote the minimum number of rows among all (d, r; z]-disjunct matrices with n columns. In this paper, we are interested in providing a method to con-struct the(d, r; z]-disjunct matrices and then to obtain an upper bound of t(n, d, r; z]. Our result is obtained by translating the problem into a hypergraph problem. Engel [8] first observed the equivalence between a(d, r; 1]-disjunct matrix and a cover of a properly defined hypergraph. Stinson and Wei [15] generalized the equivalence to (d, r; z]-disjunct matrices for general z, but used the equivalence only to derive a lower bound of t(n, d, r; z]. The hypergraph we construct in this paper is similar to that of Stinson and Wei except that we take a weight-l binary vector as a vertex if and only if l = w, while Stinson and Wei relaxed the condition l = w to r ≤ l ≤ n − d. This crucial step allows us to use the Lovász lemma on minimum cover to derive an upper bound of t(n, d, r; z].

Stinson and Wei proved a lower bound t(n, d, r; z] ≥ 0.7c(d+r)(

d+r r ) log(d+r_r ) log n + c_(z−1) 2 d+ r r

when n is sufficient large, where c is a constant. Also, they provided two asymptotic upper bounds for t(n, d, r; z] by using some other structures, one is

(4)

log∗(1) = 1 and log∗n= log∗( log n)+1 if n > 1, and the other is Ozd+ r_r log n . However, we believe that there is a flaw in the latter one, and we will explain the mat-ter lamat-ter. In the next section, we show that t(n, d, r; z] < z(k/r)r(k/d)d[1 + k(1 + ln(n/k + 1))], where k = d + r. Finally, we conclude this paper with some remarks.

3 The main results

Given a finite set X , a hypergraph H = (X, F) is a family F = {E1, E2, . . . , Em} of

subsets of X . The elements of X are called vertices, and the sets Ei’s are the edges of

the hypergraph H . A hypergraph with|Ei| = |Ej| for all i = j is said to be uniform.

For u ∈ X, define the degree dH(u) of u to be the number of edges containing u. A hypergraph H in which all vertices have the same degree is said to be regular, i.e., dH(u) = dH(v) for all u, v ∈ X. A z-cover of H is a subset C ⊆ X such that |C ∩ Ei| ≥ z for all i. Let τz(H) denote the minimum size among all z-covers of H. It is easy to see that

τz(H) ≤ zτ1(H). (1)

By a greedy strategy, i.e., choosing vertices sequentially in X such that every cho-sen vertex intersects the maximum number of edges which are not covered yet, a fundamental result by Lovász [11] implies that

τ1(H) <_min_E|X|_∈F_|E|(1 + ln ), (2) where = maxu∈XdH(u).

Our aim is to show that(d, r; z]-disjunct matrices can be obtained from z-covers of properly defined hypergraphs, and then Lovász’s result (2) provides us a desired upper bound of t(n, d, r; z].

Let X_wbe the set of all binary vector x= (x1, x2, . . . , xn) of length n containing w

1’s. For any two disjoint subsets D, R of [n], where [n] denotes the set {1, 2, . . . , n}, define ED_,R= {x = (x1, x2, . . . , xn) ∈ X_w : xi = 1, xj = 0 for i ∈ R, j ∈ D}. For

example, E_{1},{3}= {(1, 1, 1, 0, 0), (1, 0, 1, 1, 0), (1, 0, 1, 0, 1)} when w = 3, n = 5. Then, for r ≤ w ≤ n − d, define the hypergraph H = (Xw, F), where F = {ED,R: |D| = d, |R| = r, and D ∩ R = ∅}. Note that |Xw| =_wn and|F| =n_r n− r_d .

By the construction, a(d, r; z]-disjunct matrix with n columns can be obtained from a z-cover of H = (X_w, F) by treating {1, 2, . . . , n} as the set of columns, each vertex in the z-cover as a row, and the j th column has a 1-entry in that row if and only if xj = 1 in that vertex. The reason is as follows. For any two disjoint d-set D and r-set R of[n], there exists a corresponding edge ED_,Rin the hypergraph H = (X_w, F). Thus, in the z-cover there are at least z vertices, each of which intersects the edge

ED_,R. According to our transformation, each row corresponding to one of these

ver-tices in the z-cover has a 1-entry in every column of R and a 0-entry in every column of D. In other words, for any d+ r columns there exist at least z rows where each of

(5)

the r designated columns has 1-entries and each of the other d columns has 0-entries; hence it is a(d, r; z]-disjunct matrix. We then obtain the following theorem. Theorem 3.1 For any positive integers d, r, w, z and n, with r ≤ w ≤ n − d, there

exists a t× n (d, r; z]-disjunct matrix with

t < z _n r _n_−r d _w r n−w d 1+ ln _w r n− w d .

Proof By the construction of H = (X_w, F), H is uniform and regular; hence

|X| minE∈F|E|= |F| . Moreover, we have|F| = n r n− r d and = w_r n− w_d . The theorem follows directly from (1) and (2). From Theorem 3.1, all we need to do is to minimize the function by properly choosingw to obtain a better bound. Let f (w) =w_r n− w_d . Then f(w + 1) =

w+1

w−r+1 ·n−w−dn−w

f(w). When r ≤ w < n, clearly the function

w+1

w−r+1 ·n−w−dn−w

is decreasing since the two terms are both decreasing. Accordingly, if we ignore the fact thatw must be an integer, then setting w = nr_d_+r−d, satisfying

w+1

w−r+1· n−w−dn−w

= 1,

maximizes the function f(w) =w_r n− w_d ; hence in some sense be a good choice to minimize the function in Theorem3.1. For convenience, denote k= d + r. Theorem 3.2 For any positive integers d, r, z and n with d + r ≤ n,

t(n, d, r; z] < z(k/r)r(k/d)d[1 + k(1 + ln(n/k + 1))].

Proof For given positive integers d, r, z and n, setting w = n r/k where n ≥ n is the

least integer such that n r/k is an integer, we have

zn_r n _d−r _w r _n_−w d ≤ zn · (n − 1) · · · (n − k + 1) [(n _r_{/k) · · · (n} _r_{/k − r + 1)][(n} _d_{/k) · · · (n} _d_{/k − d + 1)]} ≤ z(k/r)r_(k/d)d_.

Moreover, using the inequalitya_b ≤ (ea/b)b, where e≈ 2.7182 is the base of the natural logarithm, one concludes

n r/k r n − n r/k d ≤ ek_(n _/k)k_.

(6)

From the above inequalities and Theorem3.1, we have t(n , d, r; z] < z _n r _n _−r d _w r _n _−w d 1+ ln _w r n − w d < z(k/r)r_(k/d)d_{[1 + k(1 + ln(n} _/k))].

Note that n < n +k because of the choice of n . For any given positive integers d, r, z and n, we have t(n, d, r; z] ≤ t(n , d, r; z] < z(k/r)r_(k/d)d_{[1 + k(1 + ln(n}_/k))] = z(k/r)r_(k/d)d_{[1 + k(1 + ln(n/k + 1))].} 4 Conclusions

What we concerned in this paper is the construction of pooling designs for the com-plex model. Although a number of studies have been made, little is known about the construction of pooling designs for the complex model. A main design to solve the group testing problem on complexes is(d, r; z]-disjunct matrices. In this paper we provide an explicit construction of(d, r; z]-disjunct matrices by using Lovász’s result (2). Precisely, we show that for any positive integers d, r, z and n with d +r ≤ n, there

exists a t×n (d, r; z]-disjunct matrix with t < z(k/r)r(k/d)d[1+k(1+ln(n/k+1))], where k = d + r. Our result presented in this paper is the first nontrivial and non-asymptotic bound of t(n, d, r; z]. Note that the two upper bounds proposed in [15] by Stinson and Wei are asymptotic.

It is worth pointing out that we believe that the original analysis in [15] has a flaw. The authors showed that for any positive integers d, r, z and n, there exists a

t× n(d, r; z]-disjunct matrix with t = Ozd+ r_r log n , by using a construction of perfect hash families described in [18]. However, the asymptotic result in [18] cited by Stinson and Wei should not be O(log n), but O(C log n), where C depends on the parameters d and r (hence it cannot be ignored even under O-notation). In addition, the same flaw can also be found in the construction of(d, 1; 1]-disjunct matrices in [18].

Acknowledgments The authors would like to thank to the anonymous referees for careful reading and valuable suggestions.

References

1. Alon, N., Beigel, R., Kasif, S., Rudich, S., Sudakov, B.: Learning a hidden matching. SIAM J. Comput.

33, 487–501 (2004)

2. Beigel, R., Alon, N., Apaydin, M.S., Fortnow, L., Kasif, S.: An optimal procedure for gap closing in whole genome shotgun sequencing. In: Proceedings of 2001 RECOMB, pp. 22–30. ACM, New York (2001)

(7)

3. Balding, D.J., Bruno, W.J., Knill, E., Torney, D.C.: A comparative survey of nonadaptive pooling designs. In: Genetic Mapping and DNA Sequencing, IMA Volumes in Mathematics and Its Applica-tions, pp. 133–154. Springer, Berlin (1996)

4. Chen, H.B., Du, D.Z., Hwang, F.K.: An unexpected meeting of four seemingly unrelated problems: graph testing, DNA complex screening, superimposed codes and secure key distribution. J. Combin. Optim. 14, 121–129 (2007)

5. Du, D.Z., Hwang, F.K.: Combinatorial Group Testing and Its Applications, 2nd edn. World Scientific, Singapore (2000)

6. Du, D.Z., Hwang, F.K.: Pooling Designs and Nonadaptive Group Testing: Important Tools for DNA Sequencing. World Scientific, Singapore (2006)

7. D’yachkov, A.G., Vilenkin, P.A., Macula, A.J., Torney, D.C.: Families of finite sets in which no intersection of sets is covered by the union of s others. J. Combin. Theory Ser. A 99, 195–218 (2002) 8. Engel, K.: Interval packing and covering in the boolean lattice. Combin. Prob. Comput. 5, 373–384

(1996)

9. Grebinski, V., Kucherov, G.: Reconstructing a Hamiltonian cycle by querying the graph: application to DNA physical mapping. Discrete Appl. Math. 88, 147–165 (1998)

10. Kim, H.K., Lebedev, V.: On optimal superimposed codes. J. Combin. Designs 12, 79–91 (2004) 11. Lovász, L.: On the ratio of optimal integral and fractional covers. Discrete Math. 13, 383–390 (1975) 12. Macula, A.J., Popyack, L.J.: A group testing method for finding patterns in data. Discrete Appl.

Math. 144, 149–157 (2004)

13. Macula, A.J., Rykov, V.V., Yekhanin, S.: Trivial two-stage group testing for complexes using almost disjunct matrices. Discrete Appl. Math. 137, 97–107 (2004)

14. Macula, A.J., Torney, D.C., Vilenkin, P.A.: Two-stage group testing for complexes in the presence of errors. DIMACS Ser. Discrete Math. Theoret. Comput. Sci. 55, 145–157 (1999)

15. Stinson, D.R., Wei, R.: Generalized cover-free families. Discrete Math. 279, 463–477 (2004) 16. Stinson, D.R., Wei, R., Zhu, L.: Some new bounds for cover-free families. J. Combin. Theory Ser. A

90, 224–234 (2000)

17. Torney, D.C.: Sets pooling designs. Ann. Combin. 3, 95–101 (1999)

18. Wang, H., Xing, C.: Explicit constructions of perfect hash families from algebraic curves over finite fields. J. Combin. Theory Ser. A 93, 112–124 (2001)