組合群試問題的研究

(1)

國立交通大學

應用數

學系

博士論文

組合群試問題的研究

Combinatorial Group Testing Problems

on Various Models

研究生

_:

張惠蘭

指導教授

_:

傅恆霖教授

(2)

Combinatorial Group Testing Problems

on Various Models

組合群試問題的研究

研究生

_:

張惠蘭

_{Student: Huilan Chang}

指導教授

_:

傅恆霖教授

_{Advisor: Hung-Lin Fu}

國立交通大學

應用數學系

博士論文

A Dissertation

Submitted to Department of Applied Mathematics

College of Science

National Chiao Tung University

in Partial Fulfillment of the Requirements

for the Degree of

Doctor of Philosophy

in

Applied Mathematics

June 2010

Hsinchu, Taiwan, Republic of China

(3)

Abstract

In the classical group testing problem, there is a set _{N of n clones, each} of which is either positive or negative. The main task of the problem is to identify all positive ones by group tests, and in identifying all positive clones, minimizing the number of group tests is the main issue. Motivated by applications, many studies have introduced a third type of clones called “inhibitors” whose effect is in a sense to obscure the positive clones in pools. Furthermore, in many applications, a subset of clones (rather than a single clone), called a complex, can induce a positive effect.

There are two general types of group testing algorithms: sequential and nonadaptive. A sequential algorithm conducts the tests one by one where the outcomes of all previous tests can be treated as a reference to the later one, while a nonadaptive algorithm specifies all tests in advance and thus all tests can be conducted simultaneously. Generally, sequential algorithms require fewer number of tests than nonadaptive ones, but performing all tests in a sequential algorithm spends more time than performing all tests in a nonadaptive one.

The group testing model which takes inhibitors (respectively complexes) into consideration is referred to as an inhibitor model (respectively a complex model). These two models have been well studied in the group testing liter-ature. In this thesis, we first study group testing problems in a new pooling design environment by allowing the coexistence of inhibitors and complexes and devote our attention to nonadaptive algorithms. To identify positive items, we attach a novel property “inclusiveness” to a design. This property

(4)

and a well-studied property “disjunctness” lead to a significant improvement in the decoding procedure. In addition to the identification problem where only positive items are identified, we also attempt to classify all items. We prove that the well-studied “(d, r; z]-disjunct matrices” are sufficient for the classification problems and associated with fast decoding procedures.

In the identification and classification problems, (H : d; z)-disjunct, (d, r;z]-disjunct, and (d, r; z]-disjunct and (h, r;y]-inclusive with z > y are three main properties of matrices that are employed as nonadaptive pooling de-signs. We study their constructions and the lower bounds on the number of rows (tests).

Finally, we study the graph reconstruction problem which is a general-ization of the classical combinatorial group testing problem. A group testing problem is a search paradigm where it is usually assumed that there are at most d positive items among given items. A graph reconstruction prob-lem is to reconstruct a hidden graph G from a given family of graphs by asking queries of the form “Whether a set of vertices induces an edge of G”. Reconstruction problems on families of Hamiltonian cycles, matchings, stars and cliques on n vertices have been studied where algorithms of using at most 2n lg n, (1 +o(1))(n lg n), 2n and 2n queries were proposed, respectively. We exploit some strategies such as affine plane method to improve them to (1 + o(1))(n lg n), (1 + o(1))(n₂ lg n), n + 2 lg n and n + lg n, respectively.

(5)

摘要

所謂傳統的群式問題

(classical group testing problem),

是要從含

有正克隆

_(clones)

及負克隆的群體中識別出正的克隆。其所使用的

工具是群試驗

(group tests),

而如何減少群試驗的使用量是主要被

重視的問題。應用當中也經常衍生出其它型態的克隆

_,

比如抑制型克

隆

_(inhibitor)

。它能攪擾正克隆的特性

_,

使其不能發揮正常的功用

_,

因此一個含有正克隆的群試驗可能無法顯現正克隆的存在

_,

若它同

時也含有抑制克隆。此外

_,

在

_NDA

篩選的環境中

_,

有些特定的克

隆能組合出具有相當特性的複合體

_,

我們稱它為克隆複合體

(com-plex);

因此

_,

相對於克隆模型

_,

複合模型探討的是正複合體的識別。

逐步演算法

_{(sequential algorithm)}

及非調整型演算法

(nonada-ptive algorithm)

是兩個普遍的群試演算法。前者當中的試驗是逐一

進行的

_,

且下一個試驗可依據之前進行過的試驗的結果而去設計

_;

後

者當中所含的試驗是同時進行的

_,

也就是只依據問題給定的訊息及

假設去設計所有的試驗

_,

且使其能達到識別所有正元素的能力。一般

而言

_,

逐步演算法所涉及的試驗量比非調整型演算法的來得少

_,

但其

完成所有試驗所需的時間比非調整型的來得多。

群試抑制模型

_{(the inhibitor model)}

指的是含有抑制型元素的群

試模型

_;

而群試複合模型

_{(the complex model)}

是指所探討的問題是

建立在複合體上。在群試研究的文獻中

_,

這兩個模型已各別有完善的

發展。在此論文中

_,

我們提出抑制元素和複合體共存的群試環境

_(the

inhibitor complex model),

並專攻於非調整型演算法的設計。我們

採用新提出的概念「覆蓋性

(inclusiveness)

」去設計演算法。「分離性

(6)

種概念的設計能顯著地改善譯解試驗結果的時間。除了探討正複合體

的識別

,

我們進一步從事複合體分類的工作

,

也就是設計演算法去區

分所有的複合體

₍

正的、負的及抑制型的

)

。

_{(d, r; z]-}

分離性群試設計

((d, r; z]-disjunct pooling design)

是一個發展良好的演算法設計工

具

_,

我們證明此工具足以用來處理複合體的分類工作而且也結合了

快速的譯解程序。

總結下來

_{,(H, d; z]-}

分離性、

_{(d, r; z]-}

分離性和

_{(h, r; y]-}

覆蓋性是三

種主要用來設計非調型演算法的工具。我們在此也討論它們的建構方

法及所需試驗量的下界。

最後

_,

我們探討推廣化的傳統群試問題

–

圖形重建問題。群試問題

可被視為一種搜尋式問題的範例

_,

且通常會在正元素的總量上做一

個假設。而圖形的重建是另一種搜尋式的問題

_,

其主要工作是要識別

出隱藏的圖形

_,

而已知條件是它是眾多可能圖形中的其中一個。其上

用來作為識別的工具是一種類似於群試驗是詢問

:

一個詢問給的訊

息是一個點的子集合是否包含隱藏圖形上的某個邊的所有的點。圖形

重建的問題已有一些結果

_,

比如隱藏的圖形是一個漢米爾頓圈

(Ham-iltonian cycle)

、

配對圖

_(matching)

、星圖

_(star)

或局部完全圖

(clique)

。針對這些問題

_,

我們利用一些策略去改進這些結果

,

例如

仿射平面法

(the affine plane method)

及配對結構法

_(maximal

(7)

誌謝

「練三分線外投籃

_,

大多數的人會沿著三分線練投

_,

若稍作調整

_,

每次

練投都離三分線遠一點

_,

那沿著三分線投也不會有困難

_!

」

這是我的

指導教授傅恆霖老師的見解

,

提醒我們做研究時該有的視野。

在做研究的道路上

_,

傅老師總是提供我很多方向及可能性

_,

並鼓

勵我去嘗試。就在傅老師的鼓勵下

,

我得到千里馬計劃的贊助到

DI-MACS Center

訪問一年

_,

經過這段期間的磨練

_,

讓我各方面都有實

質的進步

_,

研究做得更有心得

_,

這些都要感謝傅老師的提攜。另一方

面傅老師也希望我們培養運動的習慣

_,

我後來也在打羽球過程中領

會到老師的用意

–

打球競賽可以訓練專注力和企圖心以及培養合作

的精神

,

這不就是做研究所需要的。非常感謝傅老師所教授的一切。

黃光明教授是帶領我進入數學研究殿堂的老師

_,

不管是我在讀碩

士時期或者近年來和老師的接觸

,

都讓我學習到很多不同層面的東

西

_;

黃老師也會點出我要改進的地方還有面對不同狀況該持有的態

度

_,

讓我受益良多

_,

跟著黃老師做研究是件非常享受的事。不管是點

出問題或是給與讚美

_,

黃老師都帶給我很多正向的力量

_,

感謝黃老師

的帶領。

我要特別感謝陳秋媛老師

_,

認識陳老師也好幾年了

_,

感覺我們都

像陳老師的大孩子一樣

_,

陳老師總是在關鍵時刻給我很多的關愛和

幫助

_,

老師在學術上面也教我很多

_,

特別是在老師的教學下讓我奠定

演算法的基礎

,

感謝老師一直以來的鼓勵和教導。再來

,

我要感謝符

麥克老師

_,

符老師教學非常用心

_,

有問題請教老師

,

他都會仔細地講

解

_,

我很幸運能跟符老師做研究

,

在討論過程中

_,

老師都會傾囊相授

_,

是我非常敬佩的老師。我還要感謝大學時認識的幾位老師

,

特別是蘇

(8)

用善老師、謝維華老師還有葉芳柏老師

,

他們的鼓勵和幫助對我有很

深遠的影響

,

一直很感謝他們。

接著

,

我要謝謝學長姊們的幫助特別是賓賓和君逸學長

;

賓賓學

長是非常照顧我們這些後進的學長

_,

他會分享自已的觀察並提供做

研究的各種面向

_,

有問題請教學長

_,

他也常提供成熟且客觀的意見

_;

君逸學長是個很優秀的學長

_,

學術經驗相當豐富而且都會毫無保留

地分享

_,

也在博士論文及口試準備上給我很多的幫助。再來一定要感

謝一路幫助我

_,

為我加油打氣的朋友們

_,

馨華、雅靜、慧珊學姊、敏

筠、康康、小巴、千砡、小貓、貓頭、智懷、威雄

_...

等等

_,

受限於版面

_,

還有很多未能提及的好伙伴

_,

非常謝謝他們

_,

有了他們讓我的研究生

涯更充實有趣。

最後

_,

要感謝我親愛的家人

_,

感謝你們的鼓勵和支持

_,

謹以此論文

獻給你們。

(9)

List of Tables

(12)

List of Figures

2.1 An example of Theorem 2.2.2 . . . 20 3.1 An example of Theorem 3.1.4 . . . 25 5.1 An example of FIND-ALL-PATHS algorithm . . . 49

(13)

Chapter 1 Introduction

Given a set _{N of n clones, each of which is either positive (usually called} defective) or negative (usually called good ), the group testing problem is to identify all positive ones by group tests. A group test is applied to a subset ofN with two possible outcomes; a negative outcome indicates if all clones in the subset are negative; a positive outcome indicates otherwise. In particular, a group test on a clone can show its property. Consequently, the main issue is to minimize the number of group tests in identifying all positive clones.

The origin of group testing can be traced back to World War II. The concept of group testing was first conceived in a session in the offices of the Price Statistics Branch of The Research Division of the office of Price Administration in Washington, D.C.. Researchers in the session such as David Rosenblatt and Robert Dorfman were struck by the wastefulness of testing blood samples from millions of draftees to detect a few thousand cases of syphilis. They suggested that pooling the blood samples may be economical (for more detail, please refer to Du and Hwang, 1993 [20]).

In the probabilistic model of group testing, a probability distribution is attached to the positive set and the expected number of tests required to identify positive elements is a criterion of efficiency. Robert Dorfman (1943) [19] studied the group testing problem under the probabilistic model and proposed a method that could eliminate all syphilitic men called up for in-duction. However, the need of group testing faded with the conclusion of

(14)

the World War II. Group testing stayed dormant for many years until the coming of its use in industry. Sobel and Groll (1959) [46], two Bell Lab-oratory scientists, considered many industrial applications of group testing and studied group testing under probabilistic models as well. Li (1962) [37] was the first to study Combinatorial group testing where probability distri-butions on positive set are completely eliminated; for instance, the number of positive items among the n items can be assumed at most d. Hencefor-ward, combinatorial group testing developed alongside with the probabilistic group testing and has been prospering due to its applications in chemical leak testing, electric shorting detection, codes, multi-access channel commu-nication and AIDS screening (see Du and Hwang, 1993 [20] and 2nd ed. 2000 [21] for a general reference). Recently, group testing has been found useful in molecular biology and is usually referred to as pooling designs. The new application also generates new models and new problems such as pooling designs on complexes (Torney, 1999 [52]), the inhibitor model (Farach et al., 1997 [28]), contig sequencing, and non-unique probe selection problem (Du and Hwang, 2006 [23]).

1.1 Preliminaries on Algorithm

There are two general types of group testing algorithms: sequential and non-adaptive. A sequential algorithm conducts the tests one by one where the outcomes of all previous tests can be treated as a reference to the later one. A nonadaptive algorithm specifies all tests in advance and thus all tests can be conducted simultaneously. Sequential algorithms require fewer number of tests in general, since extra information allows for more efficient test designs while nonadaptive algorithms permit to conduct all tests simultaneously, thus saving the time for testing. Sequential algorithms have dominated the liter-ature historically because the main goal of group testing is to minimize the number of tests required to identify all positive items. However, in some ap-plications such as molecular biology, an experiment corresponding to a group

(15)

test is considerably time-consuming, thus it is impractical to perform the ex-periments sequentially. The focus then goes to nonadaptive group testing algorithms where all experiments are performed simultaneously; neverthe-less, sequential procedures can still be used, but the total time required to identify the positive items must be considered along with the total number of tests. There is a natural tradeoff between the sequential and the nonadaptive algorithms. One can seek 2-stage or k-stage algorithms for which all tests in a stage must be specified simultaneously, but the stages are sequential.

With experimental errors, test outcomes may contain false negative out-comes and false positive outout-comes. The former means that a test yields a negative outcome when a pool contains at least one positive clone. Likewise, the latter means that a test yields a positive outcome when a pool contains no positive clones. The error tolerance capability is concerned when proposing a design.

1.2 Models Originating from Applications in

Molecular Biology

The wide range of conditions in which group testing has practical applications call for meaningful variants of the basic model in order to better accommo-date the applications at hand. In this section, we introduce three models of group testing – inhibitor, complex, and graph reconstruction that originated from applications in molecular biology. These models have been studied in separate literatures. We will follow the original terminologies in each model.

1.2.1 Group Testing with Inhibitors

In certain applications, there is a third type of clones called inhibitors whose existence may cancel the effect of positive clones and the number of such clones in the population is usually assumed at most h. Various models can be formulated with inhibitors in the pooling design, depending on the inter-ferences between inhibitors and positive clones. Farach et al. (1997) [28],

(16)

motivated by molecular biology applications, first proposed the 1-inhibitor model in which a single inhibitor clone dictates the testing outcome to be negative regardless of how many positive clones are in the test and gave a randomized algorithm to identify all positives in O((d + h) log n) tests. For example, in molecular biology, enzyme inhibitors are molecules that interact in some way with the enzyme to prevent it from working normally; in drug discovery applications, certain compounds can block the detection of a po-tent compound (Xie et al., 2001 [54]); similar phenomena were mentioned in blood testing applications (Phatarfod and Sudbury, 1994 [44]).

De Bonis and Vaccaro (1998) [17] connected the 1-inhibitor model to a certain generalization of superimposed codes (D’yachkov and Rykov, 1983 [26]), and provided a lower bound Ω(_{d log h}h2 log n) on the number of tests re-quired to identify exactly d positives in the presence of h inhibitors. Further, De Bonis et al. (2005) [16] gave an asymptotically optimal 4-stage algorithm for the 1-inhibitor model under the assumption that the exact number of positives and an upper bound on the number of inhibitors are known before-hand. Note that all these algorithms are sequential. Recently, nonadaptive pooling designs have been proposed for the inhibitor model (D’yachkov et al., 2001 [25]; Hwang and Liu, 2003 [32]; Du and Hwang, 2005 [22]).

De Bonis and Vacarro (2003) [18] extended the model to k-inhibitor model in which k inhibitor clones dictate the testing outcome to be negative. In general, one can consider a (k, g)-inhibitor model where k inhibitors cancel the effect of g positive clones.

Besides the mathematical complexity of dealing with various inhibitor models, determining which model fits the reality is also a practical question. Hwang and Chang (2007) [33] considered the general inhibitor model in such an environment with no need to know the exact relation between inhibitors and positive clones. De Bonis (2008) [15] proposed an almost optimal algo-rithm using O(h_d2 log(n/h)) tests under the hypothesis that the exact number d of positives is given. Particularly, this algorithm is a trivial two-stage algo-rithm, that is, most non-positive candidates are eliminated by the first stage

(17)

and the remaining clones are tested separately in the second stage.

1.2.2 Group Testing on Complexes

The classical group testing problem has a set of elements each of which in-duces a positive or negative effect. In many DNA screening environments, a subset of clones (rather than a single clone), called a complex, can induce a positive effect. We call such a model the complex model in comparison with the clone model as previously discussed. Formally, in the complex model, we consider a given set H of complexes where a fixed but unknown subset of complexes are designated positive, while other candidates of positive com-plexes are called negative comcom-plexes. In particular, H = N is referred as the clone model. A group test is executed on a subset of N and yields a positive outcome only when it contains at least one positive complex. To have an efficient design, we need to make some assumptions on the positive set. The simplest assumption is an upper bound d of the number of positive complexes in the test population. It is usually assumed that two positive complexes can overlap, but neither contains the other. Torney (1999) [52] first introduced the concept of the complex model and gave some substances in eukaryotic DNA transcription and RNA translation as examples of complexes.

Group testing on complexes is widely applied in modern molecular and cellular biology. A prominent example is its application in the identifica-tion of protein-to-protein interacidentifica-tions (Lappe and Holm, 2003 [36]; Li et al., 2005 [38]). The interactions between proteins are significant for many biological functions. For example, in signal transduction process, the protein-to-protein interactions of the signaling molecules can convey signals from the exterior of a cell to the inside of that cell. This conveying process plays a fundamental role in living cells. Furthermore, information about the inter-actions between proteins improves our understanding of diseases and then provides the basis for new therapeutic approaches. Therefore, in many bio-logical projects, identifying all protein-to-protein interactions is an essential task. The development of some laboratory approaches (Lappe and Holm,

(18)

2003 [36]) enables the application of group testing to this problem. Li et al. (2005) [38] formulated this identification problem as a group testing problem in bipartite graphs which can be regarded as a special case of group testing on complexes. Besides the protein-to-protein interactions problem, some other problems such as graph testing, superimposed codes and secure key distribu-tion are also highly related to the complex model. Recent developments on this topic can be found in (Macula et al., 2000 [42]; Macula et al., 2004 [41]; Du and Hwang, 2006 [23]; Gao et al., 2006 [29]; Chen et al., 2007 [13]; Chen et al., 2008 [14]).

Chang et al. (2010) [9] first introduced the inhibitor complex model where an inhibitor is a third type of complexes. Similar to the environments in the inhibitor clone model, the presence of an inhibitor may cancel the effect of positive complexes; in other words, a group test executed on a set of clones containing an inhibitor may yield a negative outcome even if that set contains a positive complex. Furthermore, the inhibitor complex model, as well as the inhibitor clone model, can be subdivided into the 1-inhibitor, k-inhibitor and general inhibitor models based on the interference effect between positive complexes and inhibitors. For instance, under k-inhibitor model a pool of clones inducing more than k inhibitors would yield a negative response.

1.2.3 Graph Reconstruction Model

Combinatorial search problems on graphs in the literature (Aigner, 1988 [6]) consist of identifying an unknown edge or vertex in a given graph, verifying a property of a hidden graph, reconstructing a hidden graph of a given class, and some others. The graph reconstruction problem we consider here is as follows. A hidden graph G is known belonging to a given family G of labeled graphs on the set [n] := _{{1, 2, · · · , n}. The main task is to reconstruct G} by asking queries as few as possible, where a query is of the form “Does S induce at least one edge of G?”, denoted by Q(S), for S ⊆ [n], and Q(S) = 1, representing “yes”, or 0, representing “no”. Of course, the design of queries refers to the information provided by G.

(19)

Different settings on the prior knowledge of the hidden graph produce various graph reconstruction problems. The group testing problem under complex model is a (hyper)graph-version of the graph reconstruction prob-lem, where the vertices stand for the clones, edges stand for the complexes and the number of edges of the hidden graph is assumed at most d. Moreover, the hidden graph of bounded degree was studied in (Grebinski and Kucherov, 2000 [31]; Bouvel et al., 2005 [8]), while the general hidden graph was con-sidered in (Bouvel et al., 2005 [8]; Angluin and Chen, 2008 [5]). We study the graph reconstruction problems under the assumption that the structure of the hidden graph is known.

Various families of hidden graphs have been studied. Many recent stud-ies focus on two cases: Hamiltonian cycles and matchings (Grebinski and Kucherov, 1998 [30]; Beigel et al., 2001 [7]; Alon et al., 2004 [2]) which have specific application to the genome sequencing problem. In the genome sequencing, the contigs, which are longer continuous fragments formed from some overlapping short reads, cover the genome with possible gaps. The task is to determine the relative placement of contigs on the genome. A tool for do-ing this is an experiment called multiplex Polymerase Chain Reaction (PCR) (Sorokin et al., 1996 [47]). In a multiplex PCR, an input of an experiment is a set of primers, which are short nucleotide sequences that characterize the ends of the contigs. Whenever the input set contains two primers corre-sponding to adjacent ends of neighboring contigs, the experiment outputs a reaction bringing a PCR product. Hence, the relative placement of contigs can be represented by the reaction graph which is a graph with primers as its vertices and pairs of vertices with reactions as its edges. In particular, for a circular genome, a reaction graph can be characterized as either a Hamil-tonian cycle if the two primers of each contig are mixed together and are considered as a vertex, or a matching if primers are treated independently, i.e., each primer corresponds to a vertex. The problem can be generalized as to identify the pairs that react with each other among the given set of molecules (Torney, 1999 [52]; Alon and Asodi, 2005 [1]).

(20)

Sequential algorithms for graph reconstruction problems on some fami-lies of hidden graphs of known structure have been proposed. Grebinski and Kucherov (1998) [30] gave a sequential algorithm to reconstruct a Hamilto-nian cycle in 2n lg n queries (lg := log₂), while the information lower bound for the number of queries needed is n lg n. Bouvel et al. (2005) [8] provided a sequential algorithm to reconstruct a matching in (1 + o(1))(n lg n) queries while (1 + o(1))(n₂ lg n) is the best lower bound known so far and an algo-rithm to reconstruct a star in 2n queries while the information lower bound is (1 + o(1))n. They also proved that a clique of unknown size can be recon-structed in 2n queries while n queries are required in the worst case. There is still some room to improve the performance.

1.3 Thesis Overview

The inhibitor complex model, introduced by Chang et al. (2010) [9], is a new group testing environment with the allowance of the coexistence of inhibitors and complexes. In Chapter 2, we study group testing problem in the inhibitor complex model. We devote our attention to the studies of efficient nonadaptive designs with fast decoding procedures.

For group testing problems in the inhibitor model, much research has been devoted to the studies of identifying all positive items; however, only few studies have been done in classifying all items, especially for the nonadap-tive designs. Furthermore, almost no work has been done in the classification problems under the inhibitor complex model. However, the identification of inhibitory substances is important in many practical applications; for exam-ple, many drugs are enzyme inhibitors because they can make the activity of enzymes reduced, thus leading to a destruction of a pathogen or a correction of a metabolic disturbance. In Chapter 3 we provide efficient nonadaptive algorithms for the classification problems under the 1-inhibitor model. It is notable that the pooling designs we propose have polynomial decoding procedures, i.e., determining the three types of complexes according to the

(21)

testing outcomes can be done in polynomial time. Finally, for k-inhibitor clone model, we solve the classification problems with both efficient non-adaptive algorithms and fast decoding procedures (work jointly with Chen and Fu, 2010 [10]).

Concluding from Chapter 2 and Chapter 3, we know that (H : d; z)-disjunctness, (d, r; z]-z)-disjunctness, and (h, r; y]-inclusiveness are three main properties of matrices employed as efficient designs. Many studies have been done on the constructions of (H : d; z)-disjunct matrices and (d, r; z]-disjunct matrices. In Chapter 4, we will introduce their constructions and some lower bounds that are mostly discussed in the literature. A matrix with (d, r; z]-disjunct and (h, r; y]-inclusive property was newly proposed in (Chang et al., 2010 [10]; Chen, 2006 [12] for r = 1) and little literature is available on its construction. Accordingly, we provide some general results and prove that some well-constructed disjunct matrices have certain inclusiveness property. In Chapter 5, we show some improvement on sequential algorithms for graph reconstruction problems. We employ an affine plane method (Tettelin et al., 1996 [51]; Grebinski and Kucherov, 1998 [30]) together with construct-ing a maximal matchconstruct-ing first and some other strategies to derive better al-gorithms (Chang et al. (2010) [10]). We improve the result in (Grebinski and Kucherov, 1998 [30]) on Hamiltonian cycle by a factor of 1/2. We also provide algorithms to close up the gaps between lower and upper bounds for the numbers of queries required to reconstruct a matching and a star of unknown size. Further, we slightly improve the result in (Bouvel et al., 2005 [8]) on clique by giving an algorithm with at most n + lg n queries.

(22)

Chapter 2 Nonadaptive Pooling Designs

with Fast Decoding Procedures

In this chapter we study group testing problems in the inhibitor complex model. We devote our attention to nonadaptive designs that are not only efficient in terms of the number of tests, but also associated with fast decoding procedures.

A nonadaptive group testing scheme can be represented as a 0-1 (or bi-nary) matrix where columns are labeled by clones and rows by tests. Thus row j intersects (has a 1-entry in) column i specifies that test j contains clone i. Sometimes it is convenient to view a column Ci as the set of tests (rows)

containing the clone Ci. Thus Ci∩ Ci′ is the set of tests (rows) containing

(intersecting) both Ci and Ci′. Accordingly, for a complex X, ∩X :=

\

C∈X

C is the set of rows intersecting all clones in X and we say a row j covers X if

j _{∈ ∩X.}

For nonadaptive pooling designs, some enumerators are frequently used to differentiate complexes (or clones) of different properties. For example, let τ0(X) denote the number of negative pools that complex X appears in.

Then according to the testing outcomes of a design, this enumerator could be a cutoff function, i.e., there may be a fixed value, say a, such that τ0(X)≤ a

(23)

Furthermore, for a set S of complexes (or clones), let τS

0(X) denote the same

except that an negative pool that covers an element in S is not counted in. Similarly, let τ1(·) and τ1S(·) denote the numbers of corresponding positive

pools, respectively.

2.1 The General Inhibitor Complex Model

With the introduction of inhibitors to the clone models, Hwang and Chang (2007) [33] proposed the general inhibitor model in which the exact cancel-lation effect of inhibitors on positive clones is not specified. In the general inhibitor clone model, the (d + h)-disjunct matrix is the main design to iden-tify the positive clones from n clones, including at most d positive clones and at most h inhibitory clones. A binary matrix is d-disjunct if for any d + 1 columns C0, C1,· · · , Cd, C0\ d [ i=1 Ci ≥ 1.

Chang et al. (2010) [9] are the first ones to study the general inhibitor complex model, and expand the idea of (d + h)-disjunctness to this model.

We attach the parameters (n, d, h) to an inhibitor complex model with complex set H to denote the fact that among the complexes of H, which are subsets of the n clones, there are at most d positive complexes and at most h inhibitors. Following the terminology of (Gao et al., 2006 [29]), a binary matrix is (H : d; z)-disjunct if for any d + 1 complexes X0, X1,· · · , Xd there

exist z rows each covering X0 but none of X1,· · · , Xd, i.e.,

∩X0 \ d [ i=1 ∩Xi ≥ z.

Let t(n, (H : d; z)) denote the minimum number of rows in an (H : d; z)-disjunct matrix with n columns. Construction of (H : d; z)-z)-disjunct ma-trices was studied in (Gao et al., 2006 [29]). When z = O(1) and each

(24)

complex contains at most r clones, the construction yields a matrix with O(( 2dr lg n

lg(dr lg n))

r+1_{) rows (Corollary 4.3.4).}

It is generally assumed that no complex is a subset of another for other-wise the requirement of (H : d; z)-disjunctness cannot be fulfilled when X0

is contained in one of the Xi’s.

A lower bound for the general inhibitor complex model is as follows, which is an extension of a result in (De Bonis and Vaccaro, 1998 [17]) for the general inhibitor clone model.

Theorem 2.1.1. The number of rows in a nonadaptive pooling design under

the (n, d, h) general inhibitor complex model with complex set H is at least t(n, (H : h; 1)).

Proof. Since a lower bound of the 1-inhibitor complex model is clearly

a lower bound of the general inhibitor complex model, it suffices to prove the 1-inhibitor case. Let M be the testing matrix of a nonadaptive pooling design. Suppose M is not (H : h; 1)-disjunct. Then there exists a set of h + 1 complexes X0,· · · , Xh such that every row covering X0 must cover

some of X1,· · · , Xh. Consider the sample that X0 is a positive complex and

{X1,· · · , Xh} is the set of inhibitors. Then outcomes of the tests covering

X0 are negative and thus X0 can not be identified from such outcomes.

Theorem 2.1.2. An (H : d + h; 1)-disjunct matrix can identify all positive complexes under the (n, d, h) general inhibitor complex model with complex set H.

Proof. A positive complex appears in a negative pool only when the pool

also contains some inhibitors. Thus, for a positive complex P , if S is an h-set containing all inhibitors, then

τ₀S(P ) = 0

(25)

On the other hand, consider a non-positive complex X∗_{. By the definition}

of an (H : d + h; 1)-disjunct matrix, for any other complexes X1,· · · , Xd+h,

there exists a row covering X∗ _{but none of X}

1,· · · , Xd+h. In particular,

when_{X1,· · · , Xd} contains all positive complexes and {Xd+1,· · · , Xd+h} is

a given set S, we have that the row yields a negative outcome. Thus τ₀S(X∗)_{≥ 1}

for any h-set S _{⊆ H \ {X}∗_{}. Consequently, {X : τ}S

0(X) = 0 for some h-set

S _{⊆ H \ {X} } is the set of all positive complexes.}

Next, for the error-tolerant case, we consider two types of errors: the (10)-type, changing 1-outcome to 0, and the (01)-type, changing 0-outcome to 1. Let e∗

10 and e∗01 denote the unknown numbers of the (10)-type errors

and the (01)-type errors, respectively, and denote upper bounds of e∗ 10 and

e∗

01 as e10 and e01, either known or unknown. We assume that e, an upper

bound of the total number of errors, is known, and set

c :=     

e10+ e01− e if e10 and e01 are known,

e if there are no estimates of e10 and e01,

0 if the number of positive complexes is d.

Chang et al. (2010) [9] dealt with the error-tolerant case as follows.

Theorem 2.1.3. An (H : d + h; c + e + 1)-disjunct matrix can identify all

positive complexes under the (n, d, h) general inhibitor complex model with complex set H which has at most e errors.

Proof. Ignoring the inhibitors for the moment, then a positive complex

P can appear in a negative pool only if its outcome is one of the (10)-type errors. Therefore, if S contains all inhibitors, then τS

0(P )≤ e∗10.

On the other hand, for a non-positive complex X∗, by the definition of (H : d+h; c+e+1)-disjunct, X∗appears in at least c+e+1 rows each covering none of the up-to-d positive complexes, nor the h complexes in S; hence the

(26)

corresponding tests have negative outcomes. Errors of the (01)-type may reduce the number of such negative pools. But still,

τ₀S(X∗)≥ c + e + 1 − e∗

01≥ c + 1 + e∗10,

where the last inequality follows from e∗₁₀+ e∗₀₁ ≤ e.

However, we do not know e∗₁₀ and hence not knowing how to distinguish positive complexes from others. We consider three cases:

Case (1): e10 and e01 are known. Then c = e01+ e10− e. Thus

τ₀S(X∗)≥ (e01+ e10− e) + e + 1 − e∗01≥ e10+ 1.

This implies that {X : τS

0(X)≤ e10 for some h-set S ⊆ H \ {X} } is the set

of all positive complexes.

Case (2): no estimates of e10 and e01 are given. Then c = e. Thus

τ₀S(X∗)≥ e + e + 1 − e∗01 ≥ e + 1.

Hence,{X : τS

0(X)≤ e for some h-set S ⊆ H \{X} } is the set of all positive

complexes.

Case (3): the number of positive complexes is known to be d. Then c = 0. Thus

τ0S(X∗)≥ e∗10+ 1.

Therefore, _{{X : min}

X /∈Sτ S

0 (X) is among the d smallest min

X /∈Sτ S

0(·) values} is the

set of all positive complexes.

The decoding procedure requires to compute τS

0(X) for all h-subsets S ⊆

H \ {X} and there are |H|−1h of them. However, |H| can be much larger

than n =|N |. For example, if H contains all r-sets of clones, then |H| = nr.

Thus |H|−1_h could be a very large number. In the following, we introduce some ways that reduce the decoding complexity in the order of magnitude.

(27)

2.1.1 Faster Procedures

For convenience, we use an (n, d, h, r) inhibitor complex model to denote an (n, d, h) inhibitor complex model where every complex contains at most r clones. Chang et al. (2010) [9] employe a seemingly unrelated notion, the (d, r; z]-disjunct matrix, to tackle the problem. Moreover, this idea also provides a fast decoding procedure. A binary matrix is (d, r; z]-disjunct if for any r + d columns C1,· · · , Cr+d, there exist z rows each intersecting

C1,· · · , Cr, but none of Cr+1,· · · , Cr+d, i.e.,

r \ i=1 Ci\ d+r [ i=r+1 Ci ≥ z.

Let t(n, (d, r; z]) denote the minimum number of rows in a (d, r; z]-disjunct matrix with n columns. The (d : r; z]-disjunct matrix has been well studied (Stinson et al., 2000 [50]; D’yachkov et al., 2002 [27]; Stinson and Wei, 2004 [49]; Du et al., 2006 [24]). See Chapter 4 for a general introduction.

Theorem 2.1.4. A (d + h, r; 2e + 1]-disjunct matrix can identify all positive complexes under the (n, d, h, r) general inhibitor complex model with error tolerance e.

Proof. Consider a positive complex P and let _{X1,· · · , Xh} denote a set

of other complexes containing all inhibitors. Since no complex is contained in another, there exists a clone vi ∈ Xi\P for 1 ≤ i ≤ h. Let S′ be an h-set

containing these vi’s such that S′ ∩ P = ∅. Then

τ₀S′(P )≤ e

since P can be in a negative pool only by the occurrence of error.

On the other hand, consider a non-positive complex X∗ and let{X1,· · · ,

Xd} denote a set of other complexes containing all positive ones. Similarly,

we can define wi ∈ Xi\X∗. Let D be a d-set containing these wi’s and

D∩ X∗ ₌ _{∅. By the definition of a (d + h, r; 2e + 1]-disjunct matrix, there}

(28)

the columns in D ∪ S for an arbitrary h-set S ⊆ N which is disjoint with X∗_{. Hence the outcomes of these 2e + 1 pools should be negative except for}

the occurrence of errors. This implies that

τ₀S(X∗)≥ 2e + 1 − e = e + 1.

Hence {X : τS

0(X) ≤ e for some h-set S ⊆ N \ X} is the set of positive

complexes.

The decoding procedure demonstrated in the proof of Theorem 2.1.4

requires to compute τS

0(X) from the knowledge of the testing outcomes

for each candidate complex X ∈ H and every h-set S ⊆ N \ X. Let

t = t(n, (d, r; z]). Then each computation of τS

0(X) takes O(t(h + r)) and

thus the time complexity of the decoding procedure is O(t(h + r) n−r_h |H|) which could be a big deduction from O(t′hr |H|−1_h |H|) in Theorem 2.1.3 where t′ = t(n, (H : d; z)).

Chang et al. (2010) [10] provided an efficient design with a faster decoding procedure for the general inhibitor complex model where the improvement on decoding ability is attributed to the introduction of inclusiveness prop-erty to the design. A matrix is (h, r; y]-inclusive if for any h + r columns C1,· · · , Cr+h, there are at most y rows each intersecting C1,· · · , Cr and at

least one of Cr+1,· · · , Cr+h, i.e.,

r \ i=1 Ci ! \ r+h [ i=r+1 Ci ! ≤ y.

Lemma 2.1.5. A matrix which is(d, r; z]-disjunct and also (h, r; y]-inclusive with z_{− y ≥ 2e + 1 is (d + h, r; 2e + 1]-disjunct.}

Proof. For any r + d + h columns C1,· · · , Cr+d+h, there exist z rows

in-tersecting each of C1,· · · , Cr but none of Cr+1,· · · , Cr+dand at most y rows

intersecting each of C1,· · · , Crand at least one of Cr+d+1,· · · , Cr+d+h. Then

there remain at least z_{− y ≥ 2e + 1 rows intersecting each of C}1,· · · , Cr but

(29)

Let t(n, (d, h, r; x]) denote the minimum number of rows in a (d, r; z]-disjunct and (h, r; y]-inclusive matrix with n columns for some z and y sat-isfying z− y ≥ x.

From Lemma 2.1.5 and Theorem 2.1.4, we immediately have that a (d, r; z]-disjunct and (h, r; y]-inclusive matrix with z−y ≥ 2e+1 can identify all positive complexes under the (n, d, h, r) general inhibitor complex model with error tolerance e. However, the decoding ability of the design is not showed in the implication of Lemma 2.1.5. Especially, when every positive complex contains exactly r positive clones, we have the following advanced decoding procedure.

Algorithm 1:

Step 1. Implement a (d, r; z]-disjunct and (h, r; y]-inclusive matrix with z− y ≥ 2e + 1 as a design.

Step 1: Evaluate τ0(X) for every X ∈ H.

Step 2: Return {X ∈ H : τ0(X)≤ z − e − 1}.

Theorem 2.1.6. Algorithm 1 can identify all positive complexes in O(r|H|

t(n, (d, h, r; 2e + 1])) decoding time under the (n, d, h, r) general inhibitor

complex model with error tolerance e when each positive complex contains

exactly r clones.

Proof. Consider a positive complex P and let {X1,· · · , Xh} be a set of

other complexes containing all inhibitors. Under the hypothesis that no complex is contained in another, there exist vi ∈ Xi \ P for 1 ≤ i ≤ h. By

(h, r; y]-inclusiveness property, the number of pools containing P and at least one of vi is at most y. Hence P can only appear in at most y negative pools

if there is no error. This implies

(30)

On the other hand, consider a non-positive complex X∗ _{∈ H. Similarly,}

there exists a clone v _{∈ P \ X}∗ _{for each positive complex P . By the (d, r;}

z]-disjunctness of the matrix, there are at least z rows each covering X∗ _and

none of these v’s. Thus the pools corresponding to these rows yield negative outcomes if there is no error. Even in the worst case that all errors occur in these pools, we still have

τ0(X∗)≥ z − e > y + e.

Therefore, {X : τ0(X)≤ y + e} is the set of positive complexes.

Since each computation of τ0(X) takes O(tr) time where t = t(n, (d, h, r;

2e + 1]), the time complexity of the decoding procedure is O(tr|H|).

This procedure also results in a big deduction in computation, namely,

from computing τS

0 (X) to computing τ0(X) where the measurement value is

only calculated once for each potential candidate, leading to a considerable reduction in decoding complexity.

Notice that in Chapter 4, we will introduce some existing disjunct matri-ces that have certain inclusiveness property.

2.2 The

k-inhibitor Complex Model

In the k-inhibitor complex model, the outcome of a test is positive if and only if it contains at least one positive complex and at most k− 1 inhibitors. While Section 2.1 provided nonadaptive pooling designs for this model, we now give a more efficient one.

Du and Hwang (2006) [23] used a (d + h_{− k + 1, 1; 2e + 1]-disjunct matrix} to solve group testing problem in the k-inhibitor clone model with error tolerance e. It can be easily extended to the complex model as follows. Theorem 2.2.1. A (d + h_{− k + 1, r; 2e + 1]-disjunct matrix can identify all} positive complexes under the (n, d, h, r) k-inhibitor complex model with error tolerance e.

(31)

Let N_h denote the set consisting of all h-subsets of N . Then the asso-ciated decoding procedure for Theorem 2.2.1 is to compute τS

0 (X) for each

S ∈ _h−k+1N \X while {X ∈ H : τS

0(X) ≤ e for some (h − k + 1)-subset S of

N \ X} is the set of positive complexes.

According to Theorem 2.1.5 and Theorem 2.2.1, we obtain that a (d, r; z]-disjunct and (h_{−k+1, r; y]-inclusive matrix with z−y ≥ 2e+1 can identify all} positive complexes under the (n, d, h, r) k-inhibitor complex model with error tolerance e, but the decoding ability of such design has not been revealed yet. When every positive complex has exactly r clones, we show that the decoding algorithm can be improved.

Algorithm 2:

Step 1. Implement a (d, r; z]-disjunct and (h _{− k + 1, r; y]-inclusive} matrix with z_{− y ≥ 2e + 1 as a design.}

Step 1: Evaluate τ0(X) for every X ∈ H.

Step 2: Return {X ∈ H : τ0(X)≤ z − e − 1}.

Theorem 2.2.2. Algorithm 2 can identify all positive complexes in O(r|H|

t(n, (d, h_{− k + 1, r; 2e + 1])) decoding time under the (n, d, h, r) k-inhibitor}

complex model with error tolerance e when each positive complex contains

exactly r clones.

Proof. Since the implemented matrix is (d, r; z]-disjunct, by the same ar-gument used in the proof of Theorem 2.1.6 we have

τ0(X)≥ z − e

for any non-positive complex X.

On the other hand, let P be a positive complex and{X1,· · · , Xh−k+1} be

a set of other complexes containing as many inhibitors as possible. Since no complex is included in another, we can take a vi ∈ Xi\P for 1 ≤ i ≤ h−k+1.

By (h−k+1, r; y]-inclusiveness property, the number of pools containing both P and at least one of vi’s is no more than y. Since a pool containing P and

(32)

none of these vi’s would be tested positive, P can only appear in at most y

negative pools if there is no error. Thus

τ0(P )≤ z − e − 1.

We conclude that {X : τ0(X)≤ z − e − 1} is the set of positive complexes.

Theorem 2.2.1 suggest a decoding algorithm of computing τS

0(X) for each

S ∈ h−k+1N \X for each complex X ∈ H while the decoding procedure shown in

Theorem 2.2.2 is to compute τ0(X) for each complex X ∈ H, a big reduction

in computing.

Example 1. Consider the (5, 1, 1, 2) 1-inhibitor complex model with N =

{1, · · · , 5} and H = {12, 23, 13, 34, 15} where ij denotes the complex consist-ing of clones i and j. Assume that no error is allowed and 23 is the inhibitor. In Figure 2.1, M is a (1, 2; 2]-disjunct and (1, 2; 1]-inclusive matrix (refer to Example 3 in Chapter 4 for a general construction). In the chart we can see that only 12, the only positive complex, can make the value of τ0 lower than

or equal to one. 1 2 3 4 5 1 1 1 0 0 1 1 0 1 0 1 1 0 0 1 1 0 1 1 0 1 0 1 0 1 1 0 0 1 1 0 1 1 1 0 0 1 1 0 1 0 1 0 1 1 0 0 1 1 1 outcomes 0 1 1 0 0 0 0 0 0 0 M = 5 4 3 2 1 0 12 23 13 34 15 complexes

(33)

Chapter 3 Classification Problems on the

Inhibitor Models

The problem we consider in this chapter is to classify all items in the in-hibitor clone/complex models. Some multi-stage algorithms that were pro-posed to identify positive elements are to identify and then remove almost all inhibitors at the first stages (Farach et al., 1997 [28]; Hwang and Liu, 2003 [32] under the inhibitor clone model; De Bonis and Vaccaro, 2003 [18] under the k-inhibitor clone model). Of course, one could accomplish the classifi-cation work by extending these results. However, very little is known about nonadaptive pooling designs for the classification problem. An interesting feature is that a trivial strategy does not work for identifying inhibitors, i.e., one can not simply test every item to classify the whole set. We propose a nonadaptive pooling design to classify all items by starting with the identifi-cation of inhibitors (Chang et al., 2010 [10]). Our approach is to strengthen the parameters of (d, r; z]-disjunct type matrix such that the design gener-ated from the matrix is sufficient to identify all inhibitors and also contains enough pools, where inhibitors lost their cancellation effect, to identify posi-tive items. In the following we first introduce the results in 1-inhibitor model and then extend them to the k-inhibitor model.

(34)

3.1 Nonadaptive Pooling Design for

1-inhibitor

Complex Model

In order to distinguish inhibitors from negatives, we need to make an ad-ditional assumption: (A) Among the given complexes in H, there exists at least a positive one. The reason for this is that one cannot distinguish neg-ative complexes from inhibitors without any positive complex. In addition,

for the inhibitor complex model with r ≥ 2, we need another essential

as-sumption on complexes: (B) For each negative complex, there is always a positive complex such that no inhibitor is included in their union. Otherwise, any test containing the negative complex that violates the assumption must yield a negative outcome and thus the recognition of this complex would be ambiguous. Due to the naturalness of these two assumptions, we take them as default properties on complexes throughout this section. The following result was obtained by Chang et al. (2010) ([10]).

Theorem 3.1.1. An(h, 2r; 2e + 1]-disjunct matrix can identify all inhibitors under the (n, d, h, r) 1-inhibitor complex model with error tolerance e. Proof. Consider a positive complex P and let _{X1,· · · , Xh} be a set of

other complexes containing all inhibitors. Since no complex is contained in another, there exists vi ∈ Xi\ P for 1 ≤ i ≤ h. By (h, 2r; 2e + 1]-disjunctness

property, there exist at least 2e + 1 rows each containing P but none of vi’s.

The pools corresponding to these rows must be tested positive if no erroneous outcome occurs. Hence,

τ1(P )≥ e + 1

even in the worst case that e erroneous outcomes occur.

Next, consider a negative complex X−. According to the assumption (B) on complexes, there exists a positive complex P such that there is a clone v ∈ I \(P ∪X−_{) for each inhibitor I. By (h, 2r; 2e + 1]-disjunctness property,}

there exist at least 2e + 1 rows each containing P and X−, but none of these v’s. Hence, we have that

(35)

despite e erroneous outcomes.

On the other hand, since an inhibitor appears in a positive pool only when an erroneous outcome occurs,

τ1(X∗)≤ e

for any inhibitor X∗_{. Thus, we conclude that} _{{X : τ}

1(X)≤ e} is the set of

inhibitors.

An interesting observation coming from this theorem is that the number of tests required for identifying inhibitors does not depend on the number of positive complexes while the number of inhibitors is significant to the number of tests required for identifying all positives in the inhibitor model.

For inhibitor clone model, after identifying all inhibitors, one can remove them and then continue to identify positive ones; however, this strategy can not be implemented to the complex model due to intersections between complexes. In the following, we deal with the clone model and the complex model separately.

3.1.1 The Inhibitor Clone Model

For the inhibitor clone model, following Theorem 3.1.1, a two-stage algorithm to classify all clones could be to identify and eliminate all inhibitors by an (h, 2; 2e + 1]-disjunct matrix in the first stage and then turn to study the clone model in the second stage. The group testing problem in the clone model has been well studied in the literatures and a main design for this model is as follows.

Lemma 3.1.2. A(d, 1; 2e+1]-disjunct matrix can identify all positive clones under the (n, d) clone model with error tolerance e; furthermore, it can be concluded from the design that _{{v ∈ N ; τ}₀(v) ≤ e} is the set of positive clones.

According to Theorem 3.1.1 and Lemma 3.1.2, it is quite natural to con-sider a matrix that is (h, 2; 2e + 1]-disjunct and satisfies the following condi-tion: (∗) deleting any h columns and all rows intersecting them would yield

(36)

a (d, 1; 2e + 1]-disjunct matrix. Again, Chang et al. (2010) ([10]) proved that a (d + h, 2; 2e + 1]-disjunct matrix can indeed accomplish this job based on the following general result.

Lemma 3.1.3. For any d ≥ d′ _and _r _{≥ r}′_{, a} _{(d, r; z]-disjunct matrix is}

(d′, r′; z]-disjunct and the (d′, r′; z]-disjunctness property is preserved after deleting any d− d′ _{columns and all rows intersecting them.}

Proof. The first part of the statement is clear. Consider the second part. Let M be a (d, r; z]-disjunct matrix with column index set [n] and S be a (d_{− d}′_{)-subset of [n]. Let M}′ _{be the matrix obtained from M by deleting}

columns corresponding to indices in S and rows intersecting them. Let D and R be two disjoint subsets of [n]_{\ S with |D| ≤ d}′ _and _{|R| ≤ r}′_{. Any row of}

M that intersects all columns of M(R) and none of the columns of M(S_∪D)

is preserved in M′ _{where M(S) denotes the submatrix of M obtained by}

restricting the column indices to S. Thus the number of rows intersecting all columns of M′_{(R) and none of columns of M}′_{(D) is at least z.}

Therefore, a (d + h, 2; 2e + 1]-disjunct matrix is also (h, 2; 2e + 1]-disjunct and satisfies (_{∗) and thus it can classify all clones. Here, we give a proof} relating to decoding procedure.

Theorem 3.1.4. A (d + h, 2; 2e + 1]-disjunct matrix can classify all clones under the (n, d, h) 1-inhibitor model with error tolerance e.

Proof. A (d + h, 2; 2e + 1]-disjunct matrix is also (h, 2; 2e + 1]-disjunct and thus by Theorem 3.1.1, we immediately obtain that

I := {v ∈ N : τ1(v)≤ e}

is the set of inhibitors. Consider the matrix M′ _{obtained from deleting}

columns corresponding to inhibitors and rows intersecting them. Notice that the testing outcome for each pool in M′ _{inherits the outcome of its}

corre-sponding pool in M, showing no additional tests are required. By Lemma 3.1.3, M′ _{is (d, 1; 2e + 1]-disjunct; hence, by Lemma 3.1.2,}

(37)

is the set of all positive clones where the computing of τ0(v) refers to the

pools in M′_.

Since the computing of τ1(v) and τ0(v) takes O(t) time, the decoding

procedure for such design takes O(tn) time where t = t(n, (d + h, 2; 2e + 1]). Example 2. Consider the (5, 1, 1) 1-inhibitor clone model onN = {1, · · · , 5}. Assume that no error is allowed, i.e., e = 0. Consider that clone 1 is the in-hibitor and clone 2 is the positive clone. In Figure 3.1, M is a (2, 2; 1]-disjunct matrix (see Chapter 4 for general constructions). In chart (a), we can see that only 1, the only inhibitor, can make the value of τ1 lower than or equal to

e = 0. M is then shrunk to a 1-disjunct matrix M′ where columns represent all clones except inhibitors. Chart (b) shows that only 2, the only positive clone, has the value of τ0 lower than or equal to e = 0.

1 2 3 4 5 1 1 1 0 0 1 1 0 1 0 1 1 0 0 1 1 0 1 1 0 1 0 1 0 1 1 0 0 1 1 0 1 1 1 0 0 1 1 0 1 0 1 0 1 1 0 0 1 1 1 outcomes 0 0 0 0 0 0 1 1 1 0 M = M' 5 4 3 2 1 0 5 4 3 2 1 0 1 2 3 4 5 2 3 4 5

(a) clones (b) non-inhibitory clones

Figure 3.1: An example of Theorem 3.1.4

3.1.2 The Inhibitor Complex Model

We know that an (h, 2r; 2e + 1]-disjunct can identify all inhibitors for the (n, d, h, r) inhibitor complex model (Theorem 3.1.1) and a (d, r; 2e + 1]-disjunct can identify all positives for the (n, d, r) complex model (Theorem

(38)

2.1.4). Thus learning from Theorem 3.1.4, one may use a (d + h, 2r; 2e + 1]-disjunct matrix to tackle the classification problem. However, unlike the inhibitor clone model, for the complex model, after identifying inhibitors, we can not simply remove them because it could break other complexes and thus would affect their identification. However, the cutoff function τS

0(X)

provides a way to overcome this problem. When S is disjoint with X and contains a clone from each inhibitor, each negative pool counted in τS

0(X)

is not due to the appearance of inhibitors and thus any complex covered by the pool can be identified as negative. The following result is obtained by following this idea.

Theorem 3.1.5. A (d + h, 2r; 2e + 1]-disjunct matrix can classify all com-plexes under the (n, d, h, r) 1-inhibitor complex model with error tolerance e.

Proof. First, since a (d + h, 2r; 2e + disjunct matrix is (h, 2r; 2e + 1]-disjunct, by Theorem 3.1.1, _{I := {X : τ}1(X) ≤ e} is the set of inhibitors.

Assume that X1, X2,· · · , Xh′ are the inhibitors. LetI_X be a set that contains

a clone in Xi\ X for 1 ≤ i ≤ h′. Define τ0,I(X) = τ0IX(X).

A positive complex P can appear in a negative pool only when an inhibitor also appears in it or its testing result is fault. Thus

τIP

0 (P )≤ e

since a pool containing an inhibitor and thus some clone inIP is not evaluated

in the computation.

On the other hand, consider a negative complex X∗. Assume that D

is a set consisting of a clone in X \ X∗ _{for each positive complex X. A}

(d + h, 2r; 2e + 1]-disjunct matrix is also (d + h, r; 2e + 1]-disjunct; hence, there are 2e + 1 rows covering X∗ but none of clones in D∪ IX∗. Then

τIX∗

0 (X∗) > e

since each pool corresponding to any of these 2e+ 1 rows contains no positive complex and thus yields a negative outcome if there is no error, and it is

(39)

evaluated in the computation due to the fact that it contains no clone in IX∗. Hence {X ∈ H \ I : τ_0,I(X)≤ e} is the set of positive complexes.

Indeed, a decoding procedure for this design is to distinguish inhibitors from other complexes by the cutoff function τ1(·) and then distinguish

pos-itive complexes from negative ones by the cutoff function τ0,I(·). Since the

computing of τ1(X) and τ0,I(X) takes O(t(rh)) time (including the setting of

IX), the procedure takes O(t(rh)|H|) time where t = t(n, (d + h, 2r; 2e + 1]).

3.2 Nonadaptive Pooling Design for

k-inhibitor

Clone Model

In this section we consider the k-inhibitor model where a test yields a positive outcome if and only if it contains at least one positive clone and less than k inhibitors. It is assumed that the threshold k is known beforehand. In order to identify inhibitors, besides the assumption (A), another assumption is also essential: (C) Among the given clones, there exist at least k inhibitors. Otherwise, inhibitors do not have enough ability to obscure positive clones and thus there is no way to differentiate them from negative ones. A bun-dle of arbitrary k inhibitors has blocking effect while other clones (not all inhibitors) can’t. Chang et al. (2010) ([10]) used this characteristic to prove the following result which is an extension of Theorem 3.1.1 from k = 1 to a general k ≥ 1.

Theorem 3.2.1. An (h_{− k + 1, k + 1; 2e + 1]-disjunct matrix can identify all} inhibitors under the (n, d, h, r) k-inhibitor clone model with error tolerance e.

Proof. For any k-set K of inhibitors, it is obvious that τ1(K)≤ e.

Consider a set K of k clones not all inhibitors. Let P be a positive clone

(40)

inhibitors not in K as possible. By the (h− k + 1, k + 1; 2e + 1]-disjunctness property, there exist at least 2e + 1 rows each intersecting P and all clones in K but none in S. Then each pool corresponding to any of these rows contains a positive clone and at most k_{− 1 inhibitors, implying its testing} outcome is positive except an occurrence of error. Thus

τ1(K)≥ e + 1.

Therefore, S{K ⊆ N : τ1(K)≤ e, |K| = k} is the set of inhibitors.

The computing of τ1(K) takes O(kt) time and thus the overall decoding

procedure takes On k kt time.

In the previous section, we discussed a two-stage algorithm to classify all clones under the 1-inhibitor model where the first stage is to identify (and eliminate) all inhibitors by a disjunct matrix and the sequential stage is to distinguish positive clones from negative ones by another disjunct matrix. We can extend this idea to produce a two-stage algorithm for the k-inhibitor model, but with the following modification in the first stage: use an (h₋ k + 1, k + 1; 2e + 1]-disjunct matrix (instead of an (h, 2; 2e + 1]-disjunct) to identify inhibitors and then remove either all of them or exactly h− k + 1 of them so that the remaining inhibitors, at most k− 1, do not obscure the positive clones.

Again, from Lemma 3.1.3, a nonadaptive pooling design obtained from combining an (h− k + 1, k + 1; 2e + disjunct matrix and a (d, 1; 2e + 1]-disjunct matrix as follows can classify all clones. The following proof is given in the perspective of decoding.

Theorem 3.2.2. A (d + h_{− k + 1, k + 1; 2e + 1]-disjunct matrix can classify} all clones under the (n, d, h) k-inhibitor model with error tolerance e.

Proof. A (d + h− k + 1, k + 1; 2e + 1]-disjunct matrix is also (h − k + 1, k + 1; 2e + 1]-disjunct and then by Theorem 3.2.1, we immediately obtain that the union I of all k-sets K of clones with τ1(K)≤ e is the set of inhibitors.

(41)

columns corresponding to inhibitors and rows intersecting them. Then the

columns of M′ _{relate to at most k}_{− 1 inhibitors and hence M}′ _{could be}

used as a design for the clone model. Since at most h− k + 1 columns are

deleted, M′ _{is (d, 1; 2e + 1]-disjunct by Lemma 3.1.3. Then by Lemma 3.1.2,}

{v ∈ N \ I; τ0(v) ≤ e} is the set of positive clones where the computing

of τ0(v) refers to the pools in M′ and the outcome of each pool coincides

with the outcome of its expanded pool in M because deleted columns do not intersect it.

Notice that in Theorem 3.1.4 for 1-inhibitor model, all columns associ-ated with inhibitors are deleted but in Theorem 3.2.2 only at most h− k + 1 columns of inhibitors are deleted. Such deletion is proper because the in-hibitors corresponding to the remaining columns do not have the ability of obscuring positives and the remaining matrix still maintain the ability of solving classical group testing problem.

The decoding procedure for this design is to compute τ0(v) for each v ∈

N \ I besides the computing of τ1(K) for each K ∈ N_k, and thus its time

complexity is On k kt where t = t(n, (d + h− k + 1, k + 1; 2e + 1]).

(42)

Chapter 4 Constructions of Related

Disjunct Matrices

In the previous chapters, three main properties of matrices employed as nonadaptive pooling designs are (H : d; z)-disjunct, (d, r; z]-disjunct, and (d, r; z]-disjunct and (h, r; y]-inclusive with z > y. Many strategies were used to construct the related matrices: constructing by design theory and set in-tersections, transforming an m-ary matrix with certain properties to a binary one, called m-ary method in (Du and Hwang, 2006 [23]), and controlling the number of rows covering or not covering a certain number of columns, called row-covering method.

Before proceeding to see the constructions, we present some basic defini-tions and notadefini-tions. We start with some notadefini-tions on graph theory and then coding theory.

Let H be the given set of complexes in the considered problem. Then H can be viewed as a hypergraph with clones as vertices and complexes as edges and accordingly, it is usually assumed that no edge contains another. A hypergraph is usually represented by (V, E) where V is its vertex set and E is its edge set. The degree of a vertex is the number of edges that it belongs to while the rank of an edge is the number of vertices that it contains. A hypergraph in which all vertices have the same degree is said to be regular ; a hypergraph where all edges have the same rank is called uniform. Let Hr¯

(43)

denote a hypergraph where the maximum rank is r and H∗

r the hypergraph

with edge set V_r.

A code is a set of vectors called codewords and has three primary pa-rameters: length, size and Hamming distance. The number of entries in a codeword is its length and is also the length of a code if all codewords have the same length; the size of a code is the number of codewords in it; the

Hamming distance of a code is the minimum number of nonidentical

sym-bols between two codewords where the minimum is taken over all pairs of codewords. Moreover, an m-ary code is a code whose symbols are from the m-ary alphabet {0, 1, · · · , m − 1}. For an m-ary code C of length t, the incident matrix of C is a t_{× |C| m-ary matrix whose columns are codewords} of C.

4.1 Lower Bound

Stinson et al. (2000) [50] considered the generalized cover-free family which is equivalent to (d, r; z]-disjunct design. They derived a lower bound for the case z = 1 by a recursive relation. Stinson and Wei (2004) extended the method to a general z by induction on r + d. The basic cases and a recursive relation are as follows.

Theorem 4.1.1. t(n, (d, 1; z])≥ c(d2log dlog n+ (z− 1)d) for some absolute

con-stant c.

The basic case d = 1 is the same as the case r = 1 according to the following result.

Lemma 4.1.2. t(n, (d, r; z]) = t(n, (r, d; z]).

Proof. Interchanging 0 and 1 in a (d, r; disjunct matrix yields an (r, d; z]-disjunct matrix.

(44)

Proof. Let M be a (d, r; z]-disjunct matrix. By Lemma 3.1.3, deleting a column of M and all rows intersecting it yields a (d_{− 1, r; z]-disjunct matrix.} Similarly, deleting a column of M and all rows not intersecting it yields a (d, r_{− 1; z]-disjunct matrix.}

This recursion leads to a lower bound for t(n, (d, r; z]). Theorem 4.1.4. For d + r > 2, t(n, (d, r; z]) ≥ cd + r r 2 log(n_{− 1)} log(d + r) + z− 1 2

where c is the same constant as in Theorem 4.1.1.

Proof. The proof is by induction on r + d. The case r = 1 or d = 1 is

easily obtained from Theorem 4.1.1 and Lemma 4.1.2. For d≥ 2 and r ≥ 2,

t(n, (d, r; z])≥ t(n − 1, (d − 1, r; z]) + t(n − 1, (d, r − 1; z]) ≥ cd + r r 2 log(n_{− 2)} log(d + r− 1) + z_{− 1} 2 ≥ cd + r r 2 log(n_{− 1)} log(d + r) + z− 1 2 .

Stinson and Wei (2004) [49] further gave a stronger lower bound by a similar argument.

Theorem 4.1.5. There exists an integer nd,r such that forn ≥ nd,r,

t(n, (d, r; z])≥ cd + r r 0.7(d + r) log n log d+r_r + z− 1 2 !

組合群試問題的研究

國 立 交 通 大 學

應 用 數

學 系

博士論文

組合群試問題的研究

Combinatorial Group Testing Problems

on Various Models

研 究 生

:

張惠蘭

指 導 教 授

:

傅恆霖 教授

Combinatorial Group Testing Problems

on Various Models

組合群試問題的研究

研 究 生

:

張惠蘭

Student: Huilan Chang

指導教授

:

傅恆霖 教授

Advisor: Hung-Lin Fu

國 立 交 通 大 學

應 用 數 學 系

博 士 論 文

A Dissertation

Submitted to Department of Applied Mathematics

College of Science

National Chiao Tung University

in Partial Fulfillment of the Requirements

for the Degree of

Doctor of Philosophy

in

Applied Mathematics

June 2010

Hsinchu, Taiwan, Republic of China

Abstract

摘要

所謂傳統的群式問題

(classical group testing problem),

是要從含

有正克隆

(clones)

及負克隆的群體中識別出正的克隆。 其所使用的

工具是群試驗

(group tests),

而如何減少群試驗的使用量是主要被

重視的問題。 應用當中也經常衍生出其它型態的克隆

,

比如抑制型克

隆

(inhibitor)

。 它能攪擾正克隆的特性

,

使其不能發揮正常的功用

,

因此一個含有正克隆的群試驗可能無法顯現正克隆的存在

,

若它同

時也含有抑制克隆。 此外

,

在

NDA

篩選的環境中

,

有些特定的克

隆能組合出具有相當特性的複合體

,

我們稱它為克隆複合體

(com-plex);

因此

,

相對於克隆模型

,

複合模型探討的是正複合體的識別。

逐步演算法

(sequential algorithm)

國立交通大學

應用數

學系

研究生

_:

指導教授

_:

傅恆霖教授

研究生

_:

_{Student: Huilan Chang}

_:

傅恆霖教授

_{Advisor: Hung-Lin Fu}

國立交通大學

應用數學系

博士論文

_(clones)

及負克隆的群體中識別出正的克隆。其所使用的

重視的問題。應用當中也經常衍生出其它型態的克隆

_,

_(inhibitor)

。它能攪擾正克隆的特性

_,

_,

_,

時也含有抑制克隆。此外

_,

_NDA

_,

_,

_,

_,

_{(sequential algorithm)}

是兩個普遍的群試演算法。前者當中的試驗是逐一

_,

_;

_,

_,

且使其能達到識別所有正元素的能力。一般

_,

_,

_{(the inhibitor model)}

_;

_{(the complex model)}

建立在複合體上。在群試研究的文獻中

_,

發展。在此論文中

_,

_(the

並專攻於非調整型演算法的設計。我們

採用新提出的概念「覆蓋性

」去設計演算法。「分離性

種概念的設計能顯著地改善譯解試驗結果的時間。除了探討正複合體

₍

正的、負的及抑制型的

_{(d, r; z]-}

_,

_{,(H, d; z]-}

_{(d, r; z]-}

_{(h, r; y]-}

種主要用來設計非調型演算法的工具。我們在此也討論它們的建構方