Error-tolerant pooling designs with inhibitors

(1)

© Mary Ann Liebert, Inc. Pp. 231–236

Error-Tolerant Pooling Designs with Inhibitors

F.K. HWANG and Y.C. LIU

ABSTRACT

Pooling designs are used in clone library screening to ef ciently distinguish positive clones

from negative clones. Mathematically, a pooling design is just a nonadaptive group testing

scheme which has been extensively studied in the literature. In some applications, there is a

third category of clones called “inhibitors” whose effect is to neutralize positives. Speci cally,

the presence of an inhibitor in a pool dictates a negative outcome even though positives are

present. Sequential group testing schemes, which can be modi ed to three-stage schemes,

have been proposed for the inhibitor model, but it is unknown whether a pooling design

(a one-stage scheme) exists. Another open question raised in the literature is whether the

inhibitor model can treat unreliable pool outcomes. In this paper, we answer both open

prob-lems by giving a pooling design, as well as a two-stage scheme, for the inhibitor model with

unreliable outcomes. The number of pools required by our schemes are quite comparable

to the three-stage scheme.

Key words: nonadaptive group testing, inhibitors, error-tolerant.

1. INTRODUCTION

S

equencing a set of clones often relieson identifying the clones which contain a given probe. For examples, in physical mapping, the probe can be a sequence-tagged site which is a unique subsequence in the target sequence. The identi cation of a clone containing this probe essentially locates the clone. In a DNA array, a probe is a given l-tuple and a positive identi cation con rms the existence of such an l-tuple in the target sequence. We will assume that the setting is in the rst application, namely, to identify which clones in the given set contain the probe. A clone is called a positive if it contains the probe, and a negative if not. A pool is a subset of clones put together for a joint assay with two possible outcomes: a negative pool signi es that there’s no positive in the pool, a positive pool signi es otherwise, namely, that there is at least one positive in the pool. A pooling design is a 0-1 matrix where the columns are the set of clones, the rows are the set of pools, and a 1-entry in cell .i; j/ signi es that clone j is in pool i.

In some biological applications, there is a third category of clones called inhibitors whose presence in a pool dictates a negative outcome, regardless of the presence of a positive in the pool. While the pooling design corresponds to the classical nonadaptive group testing problem (Du and Hwang, 2000), the presence of inhibitors presents a new group testing model not considered in the group testing literature. Farach et al. (1997) rst introduced this model. Let n denote the total number of clones including at most d positives and at most r inhibitors. They gave a randomized algorithm to identify all positives in

Department of Applied Mathematics, National Chiao Tung University, Hsìnchu 300, Taiwan. 231

(2)

O..dC r/ log n/ tests, assuming d C r ¿ n. Bonis and Vaccaro (1998) gave a deterministic algorithm in O..r2C d/ log n/ tests. However, both algorithms are sequential in natural; speci cally, tests cannot be performed in parallel (as some tests depend on the results of other tests). It is possible to convert the De Bonis and Vaccaro algorithm into a three-stage algorithm (tests in a given stage can be performed in parallel) by increasing the number of tests to O..r2C d2/ log n/. But it remains an open question whether there exists a pooling design (1-stage) for the inhibitor model. Further, experimental or recording errors may be made. De Bonis and Vaccaro raised the open problem of treating errors in the inhibitor model. In this paper, we answer both open problems by proving the existence of an error-tolerant pooling design with O..d2C r2_{C e}2_{/ log n/ tests when there are at most e unreliable outcomes. We also give a two-stage} algorithm in O..d2C r2_{C e}2_{/ log n/ tests; each stage can have at most e errors.}

2. PRELIMINARY

Since a pooling design is a 0-1 matrix, whether we can identify positives from negatives and inhibitors by decoding the pooling outcomes apparently depends on the structure of this matrix. Indeed, when there are only positives and negatives, the case corresponding to the nonadaptive group testing problem, we can identify positives from negatives by a 0-1 matrix with some particular structure.

Consider a t£ n 0-1 matrix M where Ri and Cj denote row i and column j. A 1-entry in cell .i; j/ is covered by a set X of columns if at least one column in X has a 1-entry in row i. A set A of columns is covered by a set B of columns if every 1-entry in A is covered by B. M is called d-disjunct if no union, or Boolean sum, of up to d columns covers any other column.

Note that a set of d columns can be viewed as a candidate set of positives; while the union of this candidate set, also a binary vector, then corresponds to the set of pools which give positive outcomes. Since the d-disjunct property guarantees that all negatives have at least a 1-entry not covered by the set of positive pools, we can then identify those covered by the set of positive pools as positives. It is well known (Hwang and Sós, 1987) that an upper bound of the number of pools in a d-disjunct matrix is O.d2_{log n/.} Figure 1 gives an example of a 2-disjunct matrix.

As we mentioned in Section 1, the inhibitors present a new group testing model not considered before. We prove in Section 4 that a .dCr C2e/-disjunct matrix can be used as the pooling design for the inhibitor model with e errors. We also give a two-stage method in Section 3. It contains an .rC 2e/-disjunct and a .dC 2e/-disjunct matrix in the rst and second stage, respectively. Note that the test numbers of both our methods have the same complexity as the three-stage method of De Bonis and Vaccaro and is only slightly more than their sequential method.

Before introducing our methods, some lemmas about the structure of disjunct matrices are needed. A column Ci in a 0-1 matrix is isolated if there exists one row that contains Ci only. A 1-entry has a

FIG. 1. This is a 2-disjunct matrix. Any union of up to two columns doesn’t cover any other column. Then suppose C2and C5are the positives; all negatives have at least a 1-entry (those appear with an underline) not covered by the union of C2and C5.

(3)

1-outcome (0-outcome) if it lies in a positive (negative) pool. A column has m 1-outcomes (0-outcomes) if m of its one-entries has 1-outcomes (0-outcomes).

Note that the deletion of the isolated column and rows that contain it in a disjunct matrix does not destroy the disjunctness. Further, it’s easy to determine whether an isolated clone is a positive. We assume that a disjunct matrix has no isolated column. D’yachkov and Rykov (1982) proved the following.

Lemma 2.1. Every column should have column weight at least m C 1 in a m-disjunct matrix. Lemma 2.2. Let K denote a set of k row indices. There exists a set of at most k columns whose union

covers K.

Proof. For each row index in K, select any columns with a one-entry in that row. Then the set of at

most k chosen columns covers K.

Suppose there are at most d positives, r inhibitors, and e errors. Then we have the following.

Lemma 2.3. A positive should have at least .m ¡ r ¡ e C 1/ 1-outcomes in a m-disjunct matrix. Proof. Suppose there is a positive C which has at most m ¡ r ¡ e 1-outcomes. By Lemma 2.2, there

exists a set of at most m¡ r ¡ e columns that are different from C because there’s no isolated column, covering the row indices of these 1-outcomes. Further, all its other (at least rCeC1) one-entries are covered by the r inhibitors and the up to e errors. Since there exists a set E of e columns, C =2 E when there’s no isolated column, covering the row indices of the up to e errors. Hence, at most .m¡ r ¡ e/ C .r C e/ D m columns cover C, contradicting the m-disjunct property.

Lemma 2.4. A negative should have at least .m ¡ d ¡ e C 1/ 0-outcomes in a m-disjunct matrix. Proof. Suppose to the contrary that a negative c has only .m ¡ d ¡ e/ 0-outcomes. Let E be de ned

as in Lemma 2.3. Then all the 1-entries of c with 1-outcomes are covered by at most dC e columns, and all with 0-outcomes by at most .m¡ d ¡ e/ columns, leading to the conclusion that c is covered by m columns, a contradiction to the m-disjunctness.

Lemma 2.5. An inhibitor should have at most e 1-outcomes in any disjunct matrix.

Proof. An inhibitor should have no 1-outcome if there is no error. Those e errors can result in at most

e 1-outcomes for an inhibitor.

3. THE 2-STAGE METHOD

In this section, we will rst state our algorithm, then prove its correctness.

The rst stage:

Pooling: Use an .rC 2e/-disjunct matrix.

Decoding: Collect all clones which have at most e 1-outcomes as a set AI . Take the complement of AI as P N.

The second stage:

Pooling: Use a .dC 2e/-disjunct matrix on clones in the set P N.

Decoding: Collect all clones which have at most e 0-outcomes as a set P . Output the set P as our positives.

(4)

Proof. By Lemma 2.3, a positive should have at least .eC1/ 1-outcomes in an .r C2e/-disjunct matrix;

hence, no positive will appear in AI . Further, since all clones with at most e 1-outcomes are chosen into AI , by Lemma 2.5, all inhibitors are contained in AI .

Theorem 3.2. The set P contains all positives and nothing else.

Proof. By Lemma 2.4, a negative should have at least .e C 1/ 0-outcomes; hence, those with at most

e 0-outcomes can not be a negative. They are all positives.

Actually, what we do in the rst stage is leave all inhibitors in the set AI and collect all positives into the set P N . But there can be negatives in P N; hence, we need a second stage. Since there is no inhibitor in P N, we can then do the classical group testing in the second stage to nd all positives.

4. THE 1-STAGE METHOD

Pooling: Use a .dC r C 2e/-disjunct matrix.

Decoding:

Step I: Partition the clones into 4 sets:

P consists of those with at most r_{C e 0-outcomes}

N consists of those with at least e_{C 1 but at most d C e 1-outcomes} O consists of those with at most e 1-outcomes

R consists of the rest Step II: If R_{6D ?}

Denote the outcome vector as V . For all possible r-subsets of O

a. denote the union of the r-subset as V0. b. let V union V0 be U .

c. if clone c in R has at most e 0-outcomes under U , then put it into P . Step III: Output P as our set of positives.

Lemma 4.1. After Step I, N is contained in the set of all negatives, and P is contained in the set of

all positives. O contains all inhibitors and no positives.

Proof. By Lemma 2.3, a clone with at most d C e 1-outcomes cannot be positive. An inhibitor has at

most e 1-outcomes. So an inhibitor cannot appear in N. Hence, there are only negatives in N.

By Lemma 2.4, a clone with at most rC e 0-outcomes cannot be negative. Further, the column weight is at least .dC r C 2e C 1/ in a .d C r C 2e/-disjunct matrix; these clones then have at least .d C e C 1/ > e 1-outcomes. By Lemma 2.5, inhibitors will not appear in P either.

By Lemma 2.4, a clone with at most e 1-outcomes cannot be a positive. An inhibitor cannot have more than e 1-outcomes. Hence, O contains all inhibitors but no positives.

Lemma 4.2. A clone c in R is positive , there exists at least one r-subset of O such that c has at

most e 0-outcomes under U .

Proof. ()) Since O contains all inhibitors (whose number is at most r), some chosen r-subset in

Step II should contain all inhibitors. Then the vector V0 formed by such an r-subset corrects the false negative outcomes caused by the inhibitors. If c is positive, then every pool containing c has positive outcome except at most e pools with unreliable outcomes. Thus c is in P .

(() Suppose a clone c in R is negative. We can view U as the union of d C r clones. Then jc n Uj > e for otherwise c can be covered by U[E, where E is a set of e columns, a contradiction to the .d Cr C2e/-disjunctness sincejU [ Ej · d C r C e.

(5)

In this one-stage method, we rst partition all clones into four parts. After Step I, part P contains only positives, N contains only negatives, and O contains all inhibitors and no positives. This is guaranteed by Lemmas 2.3, 2.4, and 2.5. Clones in part R maybe either positives or negatives; hence, we need Step II. Then what we do in Step II is to recover the false negative outcomes caused by inhibitors by trying all possible r-subsets in O. By Lemma 4.1, at least one of these r-subset will contain the true inhibitor-subset. Then each positive in R will be covered by this U and thus collected into P . On the other hand, due to the suf ciency part in Lemma 4.2, we can always keep negatives in R from P . Since there may be more than one r-subset that can recover the false negative outcomes correctly, we cannot recognize which one contains the true inhibitors-subset.

Then, we estimate the time complexity of this method. Most disjunct matrices are constructed by using some combinatorial designs (see Du and Hwang [2000] for a general introduction) which treats the clone symmetrically. In particular, most such designs have a constant column weight k; i.e., each column has exactly k 1-entries. The partition into P , N, O, and R requires one to count for each column how many of its 1-entries have 0-outcomes and how many have 1-outcomes. Thus, the partition takes O.kn/ time.

Next, we estimate the time complexity of Step II. We rst give a case for which Step II is not needed.

Corollary 4.4. R D ? after Step I if the .d C r C 2e/-disjunct matrix has constant column weight

.dC r C 2e C 1/.

Proof. By Lemma 2.4, we put a clone into P in Step I when it has at most r Ce 0-outcomes. That is, it

is a suf cient condition for a positive. However, in a matrix with constant column weight .dC r C 2e C 1/, a clone with at most rC e 0-outcomes implies it has at least .d C e C 1/ 1-outcomes, and by Lemma 2.3, this is the necessary condition for a positive. That is, all positives are collected into P . A similar argument works for negatives.

Assume R_{6D ?; then the time complexity of Step II is O.}¡jOj_r ¢kjRj/. To estimate jOj and jRj, we need to be more speci c about the .dC r C 2e/-disjunct matrix. A popular construction is to have constant column weight k¸ d C r C 2e C 1, while each pair of columns intersect the same row (having a 1-entry) at most once (called the 1-intersection property). Then p D k

t is approximately the probability of the appearance of a 1-entry in the matrix. We will give a rough analysis by assuming that the distribution of 1-entries in the rows is random.

We rst estimatejRj. A positive C is in R if it has at least r C e C 1 0-outcomes, which is impossible if the .dC r C 2e/-disjunct matrix has the 1-intersection property since the r inhibitors can cause at most r 0-outcomes, leaving a total of at least eC 1 0-outcomes. Even all e errors occur in these rows, there is still 1 0-outcome unexplained.

The above argument is very conservative in explaining the 0-outcomes. So even if a .dCr C2e/-disjunct matrix does not have the 1-intersection property, we expectjRj to be small.

ForjRj 6D 0, we compute the probability that a negative c is in O, i.e., the probability that it has at most e 1-outcomes.

Without errors, c gets a 0-outcome at a row if either an inhibitor is present or none of the positive is present in that row. The probability P .c¡_{/ of this is}

P .c¡/D 1 ¡ .1 ¡ p/r_[1_{¡ .1 ¡ p/}d_]: Hence the probability that c has z 0-outcomes is

f .z/_D ³ k z ´ P .c¡/z.1_{¡ P .c}¡//k¡z:

However, there are e0 errors present where e0 · e. Suppose i of them make the z 0-outcomes of c become 1-outcomes, and j of them make the k¡ z 1-outcomes become 0-outcomes. Let c1denote the number of

(6)

1-outcomes. Then P .c1· e/ D k X zD0 e0 X iD0 eX0¡i jDk¡zCi¡e f .z/ ³ z i ´³ k_{¡ z} j ´³ t_{¡ k} e0_{¡ x} ´ ³ t e0 ´ ;

because we need at least k¡zCi¡e 1-outcomes become 0-outcomes such that c has less than e 1-outcomes. Therefore,

E._{jOj/ D r C .n ¡ d ¡ r/ ¢ P .c1}_{· e/:} For eD 0, then c 2 O if and only if z D k. The probability is P .c¡_/k_.

5. CONCLUSION

In this paper, we offer an error-tolerant 1-stage method to solve the inhibitor model. This solves the existence problem for a pooling design with inhibitors and errors. As in the classical group testing problem, we use disjunct matrices as our pooling designs. But the decoding algorithm is slightly more complicated than that in the classical group testing problem. Fortunately, by Corollary 4.4 and the analysis of jRj and jOj, we can still decode quickly if the disjunct matrices we use as pooling designs have a certain constant column weight, and many well-studied (Du and Hwang, 2000) methods for constructing disjunct matrices actually do yield constant column weight.

After the submission of the paper, the authors became aware of a recent conference paper by D’yachkov et al. (2001) which gave a 1-stage method for the inhibitor model without error. Besides the errorless assumption, their method exhaustively checks all ¡_rn0

¢

, r0 · r, r0-subsets of the n clones for each clone, and hence could be too computation intensive for practical use for large n and moderate r.

ACKNOWLEDGMENT

Research partially supported by the Republic of China NSC grant 90-2115-M-009-029.

REFERENCES

Bonis, A.D., and Vaccaro, U. 1998. Improved algorithms for group testing with inhibitors. Information Processing Letters 67, 57–64.

Du, D., and Hwang, F.K. 2000. Combinatorial group testing and its applications, 2nd ed., World Scienti c, Singapore. D’yachkov, A.G., Macula, A.J., Torney, D.C., and Vilenkin, P.A. 2001. Two models of nonadaptive group testing for

designing screening experiments. Proc. 6th Int. Workshop on Model-Oriented Design and Analysis 63–75. D’yachkov, A.G., and Rykov, V.V. 1982. Bounds of the length of disjunct codes. Problems Control Inform. Thy. 11,

7–13.

Farach, M., Kannan, S., Knill, E., and Muthukrishnan, S. 1997. Group testing problems with sequences in experimental molecular biology. Proc. Compression and Complexity of Sequences, 357–367. IEEE.

Hwang, F.K., and Sós, V.T. 1987. Non-adaptive hypergeometric group testing. Studia Scient. Math. Hungarica 22, 257–263.

Address correspondence to: Y.C. Liu Department of Applied Mathematics National Chiao Tung University Hsìnchu 300, Taiwan E-mail: [email protected]

(7)

1. Yichao He, Haiyan Tian, Xinlu Zhang, Zhiwei Wang, Suogang Gao. 2012. Nonadaptive Algorithms for Threshold Group Testing with Inhibitors and Error-Tolerance. Journal of

Computational Biology 19:7, 903-910. [Abstract] [Full Text HTML] [Full Text PDF] [Full Text PDF with Links]

2. Fei-huang Chang, Huilan Chang, Frank K. Hwang. 2011. Pooling designs for clone library screening in the inhibitor complex model. Journal of Combinatorial Optimization 22:2, 145-152. [CrossRef]

3. Hong-Bin Chen, Annalisa De Bonis. 2011. An Almost Optimal Algorithm for Generalized Threshold Group Testing with Inhibitors. Journal of Computational Biology 18:6, 851-864. [Abstract] [Full Text HTML] [Full Text PDF] [Full Text PDF with Links]

4. Huilan Chang, Hong-Bin Chen, Hung-Lin Fu. 2010. Identification and Classification Problems on Pooling Designs for Inhibitor Models. Journal of Computational Biology 17:7, 927-941. [Abstract] [Full Text HTML] [Full Text PDF] [Full Text PDF with Links]

5. Hong-Bin Chen, Frank K. Hwang. 2008. A survey on nonadaptive group testing algorithms through the angle of decoding. Journal of Combinatorial Optimization 15:1, 49-59. [CrossRef] 6. Annalisa Bonis. 2008. New combinatorial structures with applications to efficient group testing

with inhibitors. Journal of Combinatorial Optimization 15:1, 77-94. [CrossRef]

7. Annalisa De Bonis, Leszek Gasieniec, Ugo Vaccaro. 2005. Optimal Two-Stage Algorithms for Group Testing Problems. SIAM Journal on Computing 34:5, 1253-1270. [CrossRef]

Error-tolerant pooling designs with inhibitors