Pooling designs for clone library screening in the inhibitor complex model

(1)

DOI 10.1007/s10878-009-9279-9

Pooling designs for clone library screening

in the inhibitor complex model

Fei-huang Chang· Huilan Chang · Frank K. Hwang

Published online: 9 February 2010

Abstract In this paper we introduce inhibitors into the complex model and we give

a lower bound and an upper bound of tests in a nonadaptive pooling design for some inhibitor complex model. We propose a very efficient pooling design for the general inhibitor complex model and extend it to the error-tolerant case.

Keywords Pooling design· Group testing · Inhibitor complex model ·

Error-tolerance

1 Introduction

In the clone library screening problem, the goal is to identify a small set of positive

clones from a large set of negative clones. Group testing is often the tool to

accom-plish this. A group test is performed on an arbitrary subset of clones with two possible outcomes: a negative outcome if all clones in the subset are negative, and a positive

outcome if otherwise. A sequence of tests which can identify a fixed but unknown set

of positive clones is a solution, usually called a design, to the problem. Since a test is time-consuming, it is much preferred to perform all tests simultaneously. In this case, we will refer to the design as a nonadaptive pooling design.

This research is partially supported by a Republic of China National Science grant NSC 93-2115-M-009-013.

F. Chang

Department of Mathematics and Science, National Taiwan Normal University, Taipei County 24449, Taiwan, ROC

e-mail:[email protected]

H. Chang (

)· F.K. Hwang

Department of Applied Mathematics, National Chiao Tung University, Hsinchu 300, Taiwan, ROC e-mail:[email protected]

(2)

To have an efficient design, we need to assume some knowledge about the positive set. The most innocent assumption is an upper bound d of the number of positive clones in the test population. A pooling design is usually represented by the incidence matrix M where rows are indexed by tests and columns by clones. A binary matrix is called d-disjunct if no column is covered by the union of any other d columns. In the clone model, it is well known (Du and Hwang2000) that a d-disjunct matrix can identify all p positive clones (p≤ d). In certain applications, there is a third type of clones called inhibitors whose existence may cancel the effect of positive clones. Farach et al. (1997) first proposed the 1-inhibitor model in which a single inhibitor clone dictates the test outcome to be negative regardless of how many positive clones are in the test. De Bonis and Vaccaro (2003) extended the above model to k-inhibitor model in which k inhibitor clones dictate the test outcome to be negative. However, only sequential designs were given in De Bonis and Vaccaro (1998), De Bonis and Vaccaro (2003), and Farach et al. (1997). Recently, nonadaptive pooling designs have been proposed for the inhibitor model (Dyachkov et al.2001; Hwang and Liu2003; Du and Hwang2005), and extended to the general inhibitor model (Chang and Hwang 2007) in which the exact cancellation effect of inhibitors on positive clones is not specified.

In some DNA screening environment, it takes a subset of clones, called a complex, to induce a positive outcome. We call such a model the complex model, as versus the

clone model discussed before. Thus in the complex model, a fixed but unknown set

of complexes are designated positive, while other candidates of positive complexes are called negative complexes. It is usually assumed that two positive complexes can overlap, but neither contains the other. Torney (1999) first introduced the complex model and gave the complexes of eukaryotic DNA transcription and RNA translation as an example.

In this paper we introduce inhibitors into the complex model and call it the

in-hibitor complex model. In this model, an inin-hibitor is a third type of complex, other

than positive and negative, whose presence may cancel the effect of positive com-plexes. In the simplest inhibitor complex model, the 1-inhibitor complex model, the mere existence of a single inhibitor dictates the test outcome to be negative, regard-less of the presence of positive complexes. If the requirement is changed from a single inhibitor to k inhibitors, then it is the k-inhibitor complex model. In general, in a (k, g)-inhibitor complex model, k inhibitors cancel the effect of g positive com-plexes. Usually, we don’t know the two parameters k and g for sure. We will refer to a model without such specification the general inhibitor complex model. In this pa-per, we propose a very efficient nonadaptive pooling design for the general inhibitor complex model, i.e., it works against any (k, g)-inhibitor complex model, and extend it to the error-tolerant case.

2 The general inhibitor complex model

In the general inhibitor clone model, the (d+ s)-disjunct matrix is the main tool to identify the positive clones from n clones including at most d positive clones and at most s inhibitor clones. We extend this idea to the general inhibitor complex model.

(3)

We attach the parameters (r, d, s) to a general inhibitor complex model to denote the fact that among the complexes, which are subsets of the n clones, there are at most

d positive complexes, at most s inhibitors, and a complex has at most r clones. For a complex X, we say a row covers X if it intersects (has a 1-entry in) every column of X whileX denotes the set of rows covering X. Let H denote the given set of complexes in the considered problem. Then H can be viewed as a hypergraph with clones as vertices and complexes as edges. Following the terminology of Gao et al. (2006), a binary matrix is (H: d)-disjunct if for any d +1 complexes X0, X1, . . . , Xd

there always exists a row covering X0but none of X1, . . . , Xd, i.e.,

X0 d i=1 Xi.

Denote the minimum number of rows in an (H: d)-disjunct matrix with n columns by t (H: d, n).

It is generally assumed that no edge is a subset of another edge for otherwise the requirement of (H: d)-disjunctness cannot be fulfilled when X0is contained in one of the Xi. We write Hr for H to denote the fact that each edge in H has maximum

rank (number of vertices) r.

By the definition given above, clearly, we have

Lemma 1 (Hr: d)-disjunct implies (Hr: d)-disjunct for d≥ dand r≥ r.

We give a lower bound for the (r, d, s) general inhibitor complex model, which is an extension of a similar result in De Bonis and Vaccaro (1998) for the general inhibitor clone model.

Theorem 2 The number of rows in a nonadaptive pooling design for the (r, d, s) general inhibitor complex model is at least t (Hr: s, n).

Proof Since a lower bound of the 1-inhibitor complex model is clearly a lower bound

of the general inhibitor complex model, it suffices to prove for the 1-inhibitor case. Let M be the testing matrix of a nonadaptive pooling design. Suppose M is not

(Hr : s)-disjunct. Then there exists a set of s + 1 complexes X0, . . . , Xs,

X0⊆ s

i=1

Xi. Consider the sample that X0is a positive complex and{X1, . . . , Xs} is

the set of inhibitors. Then outcomes of the tests covering X0 are negative and we

can’t identify X0from such outcomes.

Theorem 3 An (Hr: d + s)-disjunct matrix can identify all positive complexes under

the (r, d, s) general inhibitor complex model.

Proof Let Q denote a negative complex, P a positive complex and I an inhibitor

complex. Let t (X) denote the number of negative pools complex X appears in. For an

s-set S of complexes, let tS(X)denote the same except that a 0-outcome is changed to 1 if that test covers a complex contained in S. DefineH as the set of all choices of

(4)

sedges from H and tH(X)= minS∈HtS(X). Then

tH(P )= tS(P )= 0,

where Sis an s-set containing all inhibitors.

By the definition of an (H : d + s)-disjunct matrix, for any complex X, X

is not contained in d_i₌₁+sXi for any other complexes X1, . . . , Xd+s, i.e., there

exists a row covering X but none of X1, . . . , Xd+s.In particular, when{X1, . . . , Xd}

covers the set of positive complexes, and{Xd+1, . . . , Xd+s} is a given set S, we have

tS(X)≥ 1, for X ∈ {Q, I} for all S. Consequently, tH(X)≥ 1, for X ∈ {Q, I}. Thus

we conclude{X; tH(X)= 0} is the set of all positive complexes. Corollary 4 The number of rows in a nonadaptive pooling design for the (r, d, s) general inhibitor complex model is at most t (H: d + s, n).

Next, we discuss the error-tolerant case. We consider two types of errors: the 10-type, changing 1-outcome to 0, and the 01-10-type, changing 0-outcome to 1. Let e∗₁₀ and e∗₀₁ denote the unknown numbers of the 10-type errors and the 01-type errors, respectively, and denote upper bounds of e∗₁₀and e∗₀₁as e10and e01, either known or unknown. We assume e, an upper bound of the total number of errors, is known. We extend the (H: d)-disjunct matrix to the error-tolerant case. A binary matrix is called

(H: d; z)-disjunct if for any d + 1 complexes X0, X1, . . . , Xd, there exist at least z

rows which cover X0, but none of X1, . . . , Xd. Construction of (H : d; z)-disjunct

matrices was studied in Gao et al. (2006). For n clones and z n, the construction yields a matrix with O((d log n)r+1)rows.

Theorem 5 An (H: d + s; c + e + 1)-disjunct matrix can identify all positive com-plexes under the (r, d, s) general inhibitor complex model with at most e errors, where c= ⎧ ⎪ ⎨ ⎪ ⎩

(i) e10+ e01− e, if e10and e01are known,

(ii) e, if there are no estimates of e10and e01, (iii) 0, if the number of positive complexes is d.

Proof Ignoring the inhibitors for the moment, then a positive complex P can appear

in a negative pool only if its outcome is one of the 10-type errors. So when S contains all inhibitors,

tH(P )= tS(P )≤ e∗₁₀.

On the other hand, for X∈ {Q, I}, then by the definition of (H : d + s; c + e + 1)-disjunct, X appears in at least c+ e + 1 rows each covering none of the up-to-d positive complexes, nor the s complexes in S; hence the corresponding tests have negative outcomes. Errors of the 01-type may reduce the number of such negative pools. But still,

tH(X)= min S∈Ht S_(X)_{≥ min} S {c + e + 1 − e ∗ 01} = c + e + 1 − e∗01.

(5)

Since e≥ e∗₁₀+ e∗₀₁, tH(X)≥ c + e₁₀∗ + 1 > tH(P ). The problem is we do not know

e₁₀∗ and hence not knowing where to draw the line to separate P from I and Q. We consider three cases:

(i) We know e10and e01. Set c= e01+ e10− e. Then for X ∈ {Q, I},

tH(X)≥ (e01+ e10− e) + e + 1 − e∗₀₁= e10+ 1.

Thus{X: tH(X)≤ e10} is the set of all positive complexes.

(ii) If we have no estimates of e10and e01, set c= e. Then

tH(X)≥ e + e + 1 − e₀₁∗ ≥ e + 1.

Thus{X: tH(X)≤ e} is the set of all positive complexes.

(iii) If the number of positive complexes is known to be exactly d, then set c= 0 and we have{X: tH(X)is among the d smallest tHvalues} is the set of all

positive complexes.

3 A faster procedure

Suppose H has n vertices and h edges. Then h can be much larger than n. For exam-ple, if H is the complete r-graph, i.e., the edge-set consists of all r-sets of vertices, then h=n_r. The computation of tH(X)requires to go through all s-sets of edges and there areh_sof them. This could be a very large number. We now show that there is a way to reduce this work in order of magnitude.

A seemingly unrelated notion, the (d, r; z)-disjunct matrix, has been well studied (see Dyachkov et al. 2001; Stinson and Wei 2004for recent development) under the contexts of superimposed codes and secure key distribution methods. A binary matrix is (d, r; z)-disjunct if for any d + r columns C1, . . . , Cd+r, there always exist

zrows with 1-entries in C1, . . . , Cr and 0-entries in Cr+1, . . . , Cd+r. We now show

the relevance of this matrix to our problem.

Theorem 6 A (d+ s, r; 2e + 1)-disjunct matrix can identify all positive complexes under the (r, d, s) general inhibitor complex model with at most e errors.

Proof We use the same terminology defined in Theorem3except that S now stands for an s-set of vertices, instead of an s-set of edges. Consider a positive complex P and let{X1, . . . , Xs} denote a set of other complexes containing all inhibitors. Since

no edge is contained in another edge, there exists vertex vi∈ Xi\P for 1 ≤ i ≤ s. Let

V denote the set of these vi. If|V | < s, add s − |V | arbitrary other vertices to it to

become V. Let N denote the set of the n vertices.

tN(P )= tV(P )≤ e,

since P can be in a negative pool only by the occurrence of error.

Next, consider X∈ {Q, I} and let {X1, . . . , Xd} denote a set of other complexes

(6)

we can define wi∈ Xi\X and W = {wi}. Again, if |W| < d, add d − |W| arbitrary

other vertices to become W. By the definition of (d+ s, r; 2e + 1)-disjunct matrix, there exist at least 2e+ 1 rows in which each of the columns in X has a 1-entry but none of the columns in Wnor the s columns in an arbitrary s-set S does. Hence the outcomes of these 2e+ 1 pools should be negative under S except for the occurrence of errors. Hence

tN(X)≥ 2e + 1 − e = e + 1.

Thus{X : tN(X)≤ e} is the set of positive complexes.

To compute tN(X), we need only to go throughn_ss-sets of columns, a big de-duction fromh_s.

4 The k-inhibitor complex model

In the k-inhibitor complex model, the outcome of a test is positive if and only if it contains at least one positive complex and at most k− 1 inhibitors. While Sect.2 provided a nonadaptive pooling design for this model, we now give a more efficient one, following the approach of De Bonis and Vaccaro (2003) for the k-inhibitor clone model. Call a binary matrix M an (H: d, m out of s)-disjunct matrix if for any d +

s+ 1 complexes, X0, X1, . . . , Xd, Xd+1, . . . , Xd+s, there exists a test which covers

X0but none of X1, . . . , Xdand does not cover at least m of Xd+1, . . . , Xd+s. Clearly,

Theorem 7 An (H: d, s − k + 1 out of s)-disjunct matrix can identify all positive complexes under the (r, d, s) k-inhibitor complex model.

Proof For an s-set S of complexes and a fixed k, let tS(X) denote the number of negative pools complex X appears in except that a 0-outcome is changed to 1 if that test covers k complexes from S. Then

tH(P )= min

S∈Ht

S_{(P )}_{= t}S_{(P )}_{= 0,}

where Sis an s-set containing all inhibitors.

On the other hand, by the definition of an (H: d, s − k + 1 out of s)-disjunct ma-trix, for any complex X, there exists a test which covers X but none of X1, . . . , Xd

and does not cover at least s − k + 1 of Xd+1, . . . , Xd+s. In particular, when

{X1, . . . , Xd} covers the set of positive complexes, and {Xd+1, . . . , Xd+s} is the given

set S, we have tS_(X)_{≥ 1, for X ∈ {Q, I} for all S. Consequently, t}H_(X)_{≥ 1, for}

X∈ {Q, I}. Thus we conclude {X; tH(X)= 0} is the set of all positive complexes. Corollary 8 An (H : d, s − k + 1 out of s; c + e + 1)-disjunct matrix can identify all positive complexes under the (r, d, s) k-inhibitor complex model with at most e errors, where c= ⎧ ⎪ ⎨ ⎪ ⎩

(i) e10+ e01− e, if e10and e01are known,

(ii) e, if there are no estimates of e10and e01,

(7)

There is also a fast procedure for this result. A binary matrix is ((d, m out of s),

r; z)-disjunct if for any d + s + r columns C1, . . . , Cd+s+r, there always exist z rows

with 1-entries in C1, . . . , Cr, 0-entries in Cr+1, . . . , Cd+r, and at least m 0-entries

in Cd+r+1, . . . , Cd+r+s. A ((d, s− k + 1 out of s), r; 2e + 1)-disjunct matrix can

identify all positive complexes under the (r, d, s) k-inhibitor complex model with at most e errors. This procedure also results in a big deduction in computation, namely, from computing tH(X)to computing tN(X).

For the 1-inhibitor model, a further reduction in computation is possible. Note that any pool containing an inhibitor complex induces a negative outcome unless an error occurs. We use this property to confine all inhibitor complexes to a small set O of complexes. Let t1(X)denote the number of positive pools X appears in. Define O= {X : t1(X)≤ e}. Then O contains all inhibitor complexes. So instead

of computing tH(X)over all s-set S∈ H, we need only to compute tO_(X)_{over all}

S⊆ O. Further, (H : d, s out of s; c + e + disjunct is just (H : d + s; c + e +

1)-disjunct. Hence for any s+ 1 complexes X0, X1, . . . , Xs, there exist c+ e + 1 rows

covering X0but none of X1, . . . , Xs. In particular, this is true when X0is positive and

{X1, . . . , Xs} contains all inhibitor complexes. Thus t1(P )≥ c + e + 1 and we need

to compute tO(X)only for those X satisfying t1(X)≥ c + e + 1 to identify positive

complexes.

A similar reduction is possible for the faster procedure. Define O= {C : C ∈

X∈ O}. Then we need to compute tO(X)instead of tN(X).

5 Conclusions

We introduced a new pooling design environment by allowing the coexistence of in-hibitors and complexes which, separately, have been well studied in the literature. We gave a nonadaptive pooling design, with error-tolerance ability, to the most gen-eral model in such an environment with no need to know the exact relation between inhibitors and positive complexes. Thus the design is applicable to many practical sit-uations. This design is an (H: d; z)-type disjunct matrix whose construction has been studied in Dyachkov et al. (2001), Gao et al. (2006), and Stinson and Wei (2004).

References

Chang FH, Hwang FK (2007) The identification of positive clones in a general inhibitor model. J Comput Syst Sci 73:1090–1094

De Bonis A, Vaccaro U (1998) Improved algorithms for group testing with inhibitors. Inf Process Lett 67:57–64

De Bonis A, Vaccaro U (2003) Constructions of generalized superimposed codes with applications to group testing and conflict resolution in multiple access channels. Theor Comput Sci A 356:223–243 Dyachkov AG, Macula AJ, Torney DC, Villenkin PA (2001) Two models of nonadaptive group testing for

designing screening experiments. In: Atkinson AC, Hackl P, Muller WG (eds) Proc 6th int workshop on model-oriented design and analysis. Physica-Verlag, Heidelberg, pp 63–75

Du DZ, Hwang FK (2000) Combinational group testing and its applications, 2nd edn. Would Scientific, Singapore

Du DZ, Hwang FK (2005) Identifying d positive clones in the presence of inhibitors. Int J Bioinform Res Appl 1:162–168

(8)

Farach M, Kannan S, Knill E, Muthvkrishnan S (1997) Group testing problems with sequences in ex-perimental molecular biology. In: Carpentieri B et al (eds) Proc compression and complexity of se-quences. IEEE Press, New York, pp 357–367

Gao H, Hwang FK, Thai M, Wu W, Znati T (2006) Construction of d(H )-disjunct matrix for group testing in hypergraphs. J Comb Optim 12:297–301

Hwang FK, Liu YC (2003) Error-tolerant pooling designs with inhibitors. J Comput Biol 10:231–236 Stinson DR, Wei R (2004) Generalized cover-free families. Discrete Math 279:463–477