Computing Complexity and Algorithm - 基因排列的中端標記部分剪切問題

1.2 Preliminaries

1.2.3 Computing Complexity and Algorithm

An algorithm is like a cookbook, input the ingredients, follow the steps in the book, then it will output a cuisine. Algorithm is to design a program to solve a problem. Cook needs time, you can drill wood to make fire for an hour, or use a gas stove for a second. Algorithm is the same, a better algorithm needs less time to solve a problem. We say an algorithm is O(f (n)), means the worst running time for this algorithm is less than cf (n) when n is large enough, where c is a constant. Ω(f (n)) means it needs at least cf (n). And Θ(f (n)) means the running is between c₁f (n) and c₂f (n). We often focus on O(f (n)) for running time.

For any problem, we always want to find a polynomial-time algorithm, such as O(n), O(n²lg n), but there are some problems we can not. We can only find an exponential-time algorithm such as O(eⁿ), O(2ⁿ), gather them together and called them NP-hard problems. It does not mean that there is no polynomial-time algorithm for these problems, but we can not find one.

For sorting problem, there exists a merge sort algorithm, O(n lg n) for sort a sequence with n numbers. And for searching, there is an O(n) algorithm to search if a number is contained in a sequence with n numbers. But if the sequence is sorted, then a binary search algorithm needs only O(lg n) to find that number. So, deal with the data may decrease its running time.

In computer science, we often us a binary tree to explain the method or to analysis the running time for algorithm.

A binary tree is a type of trees, it has a top vertex, called the root, and every vertex on the tree has at most two children: the left child and right child, and the parent of these children is the vertex. The height of a tree is the length of the longest path start from the root of the tree. The complete binary tree is a binary tree with height h, and the vertices are full, imply every parent has two children. We give an example:

binary tree with height 3 complete binary tree with height 3

The graph on the left, the binary tree with height 3, has root v₁, and its children are left child v₂ and right child v₃. v₂⁰s parent is v₁. Notice that v₃ has only one child, the left child v₆. And the graph on the right is complete binary tree, all the vertices are full.

The Depth First Search(DFS) is a technique for searching a graph, the main idea is to search the most recently discovered vertex that has unexplored edge. And another one is called Breadth First Search(BFS), which explores the oldest vertex that has undiscovered edge. We use the graph above as example, start from v₁, we suppose the search order is left child first, right child is second. So in DFS, start from v₁, then discover v₂, then v₄, then v₇, v₇ has no children anymore, so back to find the right child of v4, it is v8. So the discover order for DFS is: v1, v2, v4, v7, v8, v5, v3, v6, v9. And for BFS, it searches the nearest vertex. In BFS, the discover order is: v₁, v₂, v₃, v₄, v₅, v₆, v₇, v₈, v₉.

For analysis the running time, we have to count how many vertices are there in the tree.

For complete binary tree, there are 2ⁱ vertices in the i-th row. So, there are

i=0

2ⁱ = 2ⁿ⁺¹− 1 vertices in a complete binary tree with height n.

complete binary tree with height n

We need to count the vertices of a special tree. For given integers l and r, the tree T_l,r has height l + r, and contains all the paths start from the root vertex that each path has exactly l left children and r right children. We give an example for T_1,2.

T_1,2 has 9 vertices. The number of vertices in T_l,r is X

0≤i≤l 0≤j≤r

i + j j

. We explain

this equation, each element ^i+j_j represents the number of paths start from the root vertex that contain exactly i left children and j right children. We give examples as follow.

So, X

is all the paths start from root vertex, and the number of these paths is equal to the number of vertices. The number of vertices is |T_l,r| =

The number of vertices in T_l,r is:

|T_l,r| = −1 +l + r + 2 l + 1

Another thing we need to know is the Stirling’s approximation:

n! ∼ n e

n√ 2πn

By Stirling’s approximation, we can have:

n 2

∼ 2ⁿ√

√ 2 πn

In this thesis, we will use these two for analysis our running time.

|T_l,r| = −1 +l + r + 2 l + 1

(1.1)

n 2

∼ 2ⁿ√

√ 2 (1.2) πn

Chapter 2 Previous Works

In this chapter, we introduce some variants of partial digest problems. For con-venience, we define:

X = {x₁, x₂, ..., x_n}, the restriction sites of DNA, with 0 = x₁ < x₂ < ... < x_n.

∆X = {x_j − x_i | 1 ≤ i < j ≤ n}, the distances between every two distinct restriction sites.

T = x_n− x₁, the length of target DNA.

2.1 Partial Digest Problem (PDP)

Partial digest problem is to find the solution X, by knowing ∆X.

We give an example for PDP. Suppose X = {0, 2, 6, 9, 10}, as below.

Then, ∆X = {2, 6, 9, 10, 4, 7, 8, 3, 4, 1} is the multi-set of ⁵₂ distances between every two distinct restriction sites.

Skiena et al. [16] showed that the number of possible solutions is between¹₂n^0.8107144 and ¹₂n^1.2324827. For the error-free case, they also gave a back-tracking algorithm. The main idea of this algorithm is always find the maximum length from the remain lengths. In each step, find the maximum length M from the remain lengths, M must be put on the left end or right end, otherwise, M is not the maximum length of remain lengths. So put M on the left first and see what happened after. If it failed, we change to put M on the right. If both side failed, back-track to the previous step and change the way putting maximum length. The running way of back-tracking algorithm is like to run DFS on a binary tree with height n.

For the example, ∆X = {1, 2, 3, 4, 4, 6, 7, 8, 9, 10}. Pick the maximum T = 10, is the distance of target DNA, and delete T from ∆X. So ∆X = {1, 2, 3, 4, 4, 6, 7, 8, 9}.

Then choose M = 9, the maximum distance of the remain ∆X, put M at left end, and delete the lengths produced by M . So ∆X = {2, 3, 4, 4, 6, 7, 8}.

Then, M = 8, if we put it at left end, M and 9 will produce a length 1 that is not in the remain ∆X. So try to put M at right end. ∆X = {3, 4, 4, 6}.

Then, M = 6, as same, put M left end. ∆X = φ, so output the answer.

It is not the end of the algorithm, we want all the solutions. Back-track to the previous step, ∆X = {3, 4, 4, 6}, but put M = 6 at the right end does not fit. So back-track again, ∆X = {2, 3, 4, 4, 6, 7, 8}. Keep doing this, we can have all the solutions.

The worst running time of back-tracking algorithm is O(n2ⁿlg n). Zhang [18]

gave an exponential example for this algorithm, so the back-tracking algorithm is Θ(n2ⁿlg n). The hardness of PDP is unknown, but no polynomial-time algorithm has found.

For the error case, the partial digest problem, Cieliebak et al. proved it is NP-hard for measurement errors [3] and noisy data [4]. Skiena and Sundaram [15] analyzed back-tracking algorithm, assume that the location of sites are chosen under binomial distribution. Then the algorithm can be O(n³) with high probability, and can tolerate a relative error O(_n¹2) in the fragment lengths.

在文檔中基因排列的中端標記部分剪切問題 (頁 12-19)