Chapter 1. Introduction
1.2. Mathematical formulation
Due to the nature of primers reaction, our problem can be modeled as the problem of learning a hidden graph given vertex set and an allowed query operation.
The problem of learning a hidden graph is the following. Imagine that there is a graph , whose vertices are known to us and whose edges are not. We wish to identify all the edges of by asking some edge‐detecting queries of the form
: does include at least one edge of ?
Here, is a subset of . Therefore, the problem is to find an algorithm to reconstruct the hidden graph by using as few queries as possible. Distinctly, time is also very important, so we may want to parallelize the experiments as fewer rounds as possible.
An important aspect of an algorithm in this model is the number of parallel rounds. An algorithm is non‐adaptive if the whole of queries makes chosen before the answer to earlier queries, in other words, a non‐adaptive algorithm is a 1‐round algorithm. An algorithm is adaptive if the queries may conduct one by one.
So far, there are several related works which have been done. In 1997, Grebinski and Kucherov [5] considered the problem of finding a Hamilton cycle motivated by the study of DNA physical mapping. They obtain a deterministic adaptive Ο log algorithm. (All the logarithms we use throughout this paper will be in base 2.) Later, in 2001, Beigel et al. [4] describe an 8‐round deterministic algorithm for learning matchings using Ο log queries, which has direct application in genome sequencing projects. For randomized algorithms, a 1‐round Monte Carlo algorithm for learning matchings was given by Alon et al. [1], which succeeds with high probability. Quite recently, a more precise estimation of the number of queries on adaptive model of learning a general graph was obtained by Angluin and Chen [2].
They use at most 12 log queries in their algorithm for learning a general graph where is the number of edges and is the number of vertices. Also, they prove
that the asymptotic lower bound of the number of queries is Ω log which achieves the asymptotic lower bound in this model.
In this thesis, we further improve the upper bound of the number of queries used in this model when adaptive algorithm is utilized to learn a hidden general graph. Mainly, we prove that our adaptive algorithm of learning a hidden graph with edges defined on a set of vertices uses at most 2 log 9 queries. As a consequence, using the algorithm, we can also reconstruct a Hamilton cycle or a matching using at most 2 log 9 queries which achieves the asymptotic lower bound.
4
Chapter 2.
Preliminaries
2.1. Notations
2.1.1. Notations of Graphs
In graph theory, a simple graph is a pair , where is the set of vertices and is the set of edges. An subgraph of a graph is said to be induced if, for any pair of vertices and of , is an edge of if and only if is an edge of . In other words, is an induced subgraph of if it has all the edges that appear in over the same vertex set. If the vertex set of is the subset of , then can be written as and is said to be induced by .
An undirected graph is a graph which every edge is undirected. A set is an independent set of if it contains no edge of . A bipartite graph is a simple graph in which the vertex set can be decomposed into two independent sets.
A matching in a graph is a set of edges that do not share vertices. A maximal matching is a matching of a graph with the property that if any edge not in
is added to , it is no longer a matching.
A Hamiltonian cycle (or Hamiltonian circuit) is a cycle in an undirected graph which visits each vertex exactly once and also returns to the starting vertex. The Hamiltonian cycle on vertices has vertices and edges.
A complete graph is a simple graph in which every pair of distinct vertices is connected by an edge. The complete graph on vertices has vertices and 2 edges, and is denoted by .
A tree is a graph in which any two vertices are connected by exactly one path.
Alternatively, any connected graph with no cycles is a tree.
2.1.2. Notations of computer science
In computer science, a computation tree is a tree of nodes and edges. Each node in the tree represents a single computational state, while each edge represents a transition to the next possible computation. The length of the path from the root to a given node is the depth of the node.
A binary tree is a tree data structure in which each node has at most two children.
Typically the child nodes are called left and right. The root node of a tree is the node with no parents. There is at most one root node in a rooted tree. Nodes at the bottommost level of the tree are called leaf nodes. Since they are at the bottommost level, they do not have any children. An internal node or inner node is any node of a tree that has child nodes and is thus not a leaf node.
In computational complexity theory, big Ο notation is often used to describe how the size of the input data affects an algorithm's usage of computational resources (usually running time or memory). The symbol Ο is used to describe an asymptotic upper bound for the magnitude of a function in terms of another, usually simpler, function. There are also other symbols ο, Ω, ω, and θ for various other upper, lower, and tight bounds, these are useful in the analysis of the complexity of algorithms.
2.2. Models
There are four types of queries which lead to four different mathematical models in learning a hidden graph .
(1) Multi‐vertex model. For a set of vertices a , a , … , a , ask whether
, ,…, a , a , … , a is non‐empty, where , ,…, is the
complete graph on the set of vertices a , a , … , a .
(2) Quantitative multi‐vertex model. For a set of vertices a , a , … , a , ask what the number of edges in , ,…, is.
(3)
‐vertex model. Assume that a
is predefined. For a set of vertices a , a , … , a , where , ask whether , ,…, is non‐empty.6
(4) Quantitative ‐vertex model. For a set of vertices a , a , … , a , where , what is the number of edges in , ,…, is.
The model used in this paper will be (1) Multi‐vertex model.
2.3. Lower bound
Theorem 2.3.1. [1] For any 0
2, 2 edge‐detecting queries arerequired to identify a graph drawn from the class of all graphs with vertices
and edges.
Proof. There are
2
graphs that have edges. For any algorithm, its computation tree is a binary tree and has at least leaves, so the depth at least
log log 2 log 2 log log 2 .
Therefore, the lower bound implies at least log 2 queries in the worst case. □
2.4. Two Algorithms
In order to prove our main result, we shall need two adaptive algorithms which have been done earlier. Both of them are by Angluin and Chen [1, 2]. The first one identifies an arbitrary edge in a non‐empty hidden graph using 2 log queries where is the size of the vertex set . For convenience, we will denote the algorithm by Algorithm A ( ). And the second one finds the edges between two known independent sets and . We will be using this algorithm in a special case either | | 1 or | | 1, we denote it by Algorithm B ( , ) where is a
vertex and is an independent set not contains . Note here that this algorithm uses log 1 rounds with 2s log| | 1 queries where is the number of edges between and . Moreover, if the answer of is known, then we need at most 2s log| | queries to identify all the edges between and .
For completeness, we include them with slight modification.
‧ Algorithm A ( )
Part 1. FIND_ONE_VERTEX( )
1. ,
2. while | | 1 do
3. Divide arbitrarily into and , such that | | | |/2 , | | | |/2 .
4. if \ 0 then
5.
6. else
7. , \
8. end if 9. end while
10. Let be the unique element in . 11. Output , \
Part 2. FIND_ONE_EDGE( ) 1. , FIND_ONE_VERTEX( ) 2. while | | 1 do
3. Divide arbitrarily into 0 and 1, such that | 0| | |/2 , | 1| | |/2 .
4. if 1 then
5.
6. else
7.
8. end if
9.
a vertex se
iteration
5, \ 1, 2, 3, 44
1,2,3,4 5,6,7,8 7,8 5,6 1,2,3,4
1,2,3,4 5,6,7 1,2,3
1,2,3 5,6
6 5
1,2 5
| | 1
5
4,5,6,7,8 8 4,5,6,7,8 4,5,6,7,8
,8 ,4
3,4,5,6
,3,4,5
PA
8.
. Algorithm
log 1 eis the size
thm, the mCo omputatiion tree
Fin
Q
nd edges 5
Q({
Q({5,1})=1
12
1 and 52.
Q({5,1,
5,1,2})=1
Q({5
2,3,4})=1
5,2})=1
Q({5,3,44})=0
Chapter 3.
Main Result
3.1 Algorithms
We start with presenting our algorithms.
Note here that if there are edges between two independent sets, we may find all of the edges by using Algorithm B ( , ). The following algorithm is using a maximal matching to partition the vertices of into several bipartite graphs and an independent set. In other words, we provide an algorithm to partition the vertex set into several independent sets. Our first objective is to minimize the number of independent sets.
Algorithm 1. MAXIMAL_MATCHING( )
1. , 1,
2. while 1 do
3. x y
FIND_ONE_VERTEX( )
4. , \ , 1
5. end while 6. Output
Algorithm 2. PARTITION_OF_VERTEX_SET( )
1. "2. MAXIMAL_MATCHING( )
3. , 1
4. for 1 , k , do 5. for 1 , , do 6. if then
7. , , ,
14
8. break(1‐loop) 9. else
10. Make queries on , , , 11. if 0 then
12. , , , 13. break(1‐loop)
14. else if 0 then 15. , , , 16. break(1‐loop)
17. end if 18. end if 19. end for 20. end for
21. ! . . ,
22. Output
Remark: After implementing Algorithm 2, it is easily seen that the vertex sets
,and
for 1,2, … , , are independent sets inwhere
\ \ . Now, if we can identify all the edges between two different independent sets, then the graph is reconstructed.
Algorithm 3. HIDDEN_GRAPH( )
1. PARTITION_OF_VERTEX_SET( )
2. V\
3.
4. for 1 , , do 5. for do
6. ! s. t.
7. VERTEX_INDEPENDENT_SET( , \
)
8. end for9. end for
10.
11. for 2 , , do 12. for do
13. for 1 , , do 14. if 1 then
15. VERTEX_INDEPENDENT_SET( ,
)
16. end if17. if 1 then
18. VERTEX_INDEPENDENT_SET( , ) 19. end if
20. end for 21. end for 22. end for
23.
24. for 1 , , do 25. for every v do
26. VERTEX_INDEPENDENT_SET( ,
)
27. end for28. end for
29. Output
3.2. Analysis
3.2.1. Query complexity
The following two observations are essential in determining the complexity of our algorithms. First, (1) for every 1 there exists a unique such that , . And (2) there exist at least two edges in between two vertex sets and for every .
Also, for convenience of counting the number of queries, we let | |
, | | , | | and | | (i.e. m ).
16
In Algorithm 1, since we need one query to check whether the vertex set is independent or not before using the algorithm Algorithm A ( ) and we use
Algorithm A ( ) to identify a maximal matching with size
, the complexity is equal to the sum of 1 and 2 log .In Algorithm 2, it suffices to consider the number of queries in the 2nd and 10th line. By observations (1) and (2) mentioned above, we know the 10th line will be repeated at most ⁄2 times and thus in total 5 2 log 1 2 queries.
In Algorithm 3, since we use the Algorithm B ( , ) to find the edge sets , and , the 7th line was repeated times where is the size of the maximal matching . In other words, Algorithm B ( , ) was called times in this line, the number of queries in this line is therefore 2 log . In the 15th and 18th lines, we use the Algorithm B ( , ) to find the edge set and we already had some information before calling the Algorithm B ( , ), the number of queries in 15th and 18th lines is 2 log . Finally, in the 26th lines, we identify the edge set and Algorithm B ( , ) was called 2 times, so the number of queries is 2 2 log . The following tables show the above facts.
Algorithm 1.
Line Number of queries
2 1
3 2 log
Total 1 2 log 1
Algorithm 2.
Line Number of queries
2 1 2 log 1
10 4 2
Total 5 2 log 1 2
Algo
18
(2.) VERTEX_INDEPENDENT_SET(7 ,
)
(3.) VERTEX_INDEPENDENT_SET(1 ,)
VERTEX_INDEPENDENT_SET(1 ,
\ 3 ) (4.) VERTEX_INDEPENDENT_SET(5 ,)
VERTEX_INDEPENDENT_SET(5 ,
\ 7 ) find edge (5.) VERTEX_INDEPENDENT_SET(4 ,)
VERTEX_INDEPENDENT_SET(4 , ) find edge VERTEX_INDEPENDENT_SET(4 , )
(6.) VERTEX_INDEPENDENT_SET(2 ,
) VERTEX_INDEPENDENT_SET(2 , )
VERTEX_INDEPENDENT_SET(2 , ) find edge
NOTE.
\ 4COMPLETED……
3.2.3. Worst case of this algorithm
We prove that our adaptive algorithm of learning a hidden graph with edges defined on a set of vertices uses at most 2 log 9 queries. This algorithm is efficient to reconstruct a graph when the number of edges is small. If the number of edges of a hidden graph is , in above analysis, the upper bound is 2 log 9 , but the trivial algorithm only uses queries to obtain a hidden graph. The following will give another concept to analysis the upper bound and the lower bound of this algorithm.
Lemma 3.2.1. In Multi‐vertex model,
2edge‐detecting queries are required to identify a graph drawn from the class of all graphs with vertices and
2edges.
Proof. It is clearly, every edge in must be identified by an edge‐detecting
queries Q where and \ . □Lemma 3.2.2. Algorithm B ( , ) identifies edges between and using no more
than 2| |
1 edge‐detecting queries.Proof. For convenience, we may assume the size of is 2 to the power. We consider
the computation tree for this algorithm. This computation tree is a full binary tree, the number of leaves in this tree is equal the size of and the number of nodes is 2| | 1. In general, we may delete some pair of leaves from a full binary tree to obtain its computation tree. □Theorem 3.2.3. Our adaptive algorithm to learn a general graph
, doesnot exceed 1
log 3 where is the size of .Proof. In Algorithm 1, since the size of maximal matching does not exceed /2, the
worst case is the sum of /2 1 and log .In Algorithm 2, the 10th line does not repeat more than | | times where is the maximal matching of . So the worst case in this line does not exceed queries.
After Algorithm 1 and Algorithm 2, the vertex set will be partition into several vertex sets , , where 1 . Assign indices 1, 2, … , to the vertices of according to some restriction as following. First, (1) for every 1 , if and , then the index of be smaller than the index of . And (2) if and , then the index of be smaller than the index of
when 0 .
In Algorithm 3, we use the Algorithm B ( , ) to find all the edges and
Lemma 3.2.2. gives a bound for Algorithm B ( , ). The number of queries does not
exceed double of the sum of each size when calling the Algorithm B ( , ).Consider the vertex and the independent set in Algorithm B ( , ), in above paragraph, we know that the index of greater the index of every vertex in , then the number of queries of Algorithm 3 is at most 2 . Therefore, we can reconstruct a hidden graph using no more than 1 log 3 queries. □
20
3.2.3. Number of parallel nonadaptive rounds
Because calling the Algorithm B ( , ) in Algorithm 3 can be parallelized to find all edges in 2 log 1 rounds, this saves the rounds used in reconstructing the hidden graph sharply. But we don’t have a good idea to reduce the rounds of
Algorithm 1 and Algorithm 2.
Concluding remarks
In this paper, we have presented a new adaptive algorithm to find a hidden graph. It is not difficult to see that the number of queries used in our algorithm is around the lower bound which we expect to achieve especially when the size of the graph is far less than the order of the hidden graph. Here are a couple of works which we would like to accomplish in the near future:
1. Reduce the rounds of Algorithm 1 (i.e., obtain an efficient algorithm to find a maximal matching).
2. Learning a hidden graph in Quantitative k‐multi‐vertex model.
22
References
[1] N. Alon, R. Beigel, S. Kasif, S. Rudich,and B. Sudakov, Learning a hidden matching, The 43rd Annual IEEE Symposium on Foundations of Computer Science, 197–206, 2002.
[2] D. Angluin and J. Chen. Learning a hidden graph using O(log n) queries per edge. Manuscript, 2006.
[3] D. Angluin and J. Chen. Learning a hidden hypergraph, J. of Machine Learning Research 7, 2215‐2236, 2007.
[4] R. Beigel, N. Alon, S. Kasif, M. S. Apaydin and L. Fortnow., An optimal procedure for gap closing in whole genome shotgun sequencing, In RECOMB, 22–30, 2001.
[5] V. Grebinski and G. Kucherov, Optimal query bounds for reconstructing a Hamiltonian cycle in complete graphs, In fifth Israel symposium on the Theory of Computing Systems, 166‐173, 1997.
[6] V. Grebinski and G. Kucherov.,Reconstructing a Hamiltonian cycle by querying the graph: Application to DNA physical mapping. Discrete Applied Math., 88(1‐3): 147–165, 1998.