• 沒有找到結果。

A Class Note on Basic Algorithmic Techniques

N/A
N/A
Protected

Academic year: 2022

Share "A Class Note on Basic Algorithmic Techniques"

Copied!
19
0
0

加載中.... (立即查看全文)

全文

(1)

A Class Note on Basic Algorithmic Techniques

Kun-Mao Chao

1,2,3

1

Graduate Institute of Biomedical Electronics and Bioinformatics

2

Department of Computer Science and Information Engineering

3

Graduate Institute of Networking and Multimedia National Taiwan University, Taipei, Taiwan 106

September 24, 2008

An algorithm is a step-by-step procedure for solving a problem by a computer.

Although the act of designing an algorithm is considered as an art and can never be automated, its general strategies are learnable. Here we introduce a few frame- works of computer algorithms including greedy algorithms, divide-and-conquer strategies, and dynamic programming.

This note starts with the definition of algorithms and their complexity in Sec- tion 1. We introduce the asymptotic O-notation used in the analysis of the run- ning time and space of an algorithm. Two tables are used to demonstrate that the asymptotic complexity of an algorithm will ultimately determine the size of problems that can be solved by the algorithm.

Then, we introduce greedy algorithms in Section 2. For some optimization problems, greedy algorithms are more efficient. A greedy algorithm pursues the best choice at the moment in the hope that it will lead to the best solution in the end. It works quite well for a wide range of problems. Huffman’s algorithm is used as an example of a greedy algorithm.

Section 3 describes another common algorithmic technique, called divide- and-conquer. This strategy divides the problem into smaller parts, conquers each part individually, and then combines them to form a solution for the whole. We use the mergesort algorithm to illustrate the divide-and-conquer algorithm design

(2)

paradigm.

Following its introduction by Needleman and Wunsch, dynamic programming has become a major algorithmic strategy for many optimization problems in se- quence comparison. The development of a dynamic-programming algorithm has three basic components: the recurrence relation for defining the value of an opti- mal solution, the tabular computation for computing the value of an optimal so- lution, and the backtracking procedure for delivering an optimal solution. In Sec- tion 4, we introduce these basic ideas by developing dynamic-programming solu- tions for problems from different application areas, including the maximum-sum segment problem, the longest increasing subsequence problem, and the longest common subsequence problem.

1 Algorithms and Their Complexity

An algorithm is a step-by-step procedure for solving a problem by a computer.

When an algorithm is executed by a computer, the central processing unit (CPU) performs the operations and the memory stores the program and data.

Let n be the size of the input, the output, or their sum. The time or space com- plexity of an algorithm is usually denoted as a function f (n). Table 1 calculates the time needed if the function stands for the number of operations required by an algorithm, and we assume that the CPU performs one million operations per second.

Exponential algorithms grow pretty fast and become impractical even when n is small. For those quadratic and cubic functions, they grow faster than the linear functions. The constant and log factor matter, but are mostly acceptable in practice. As a rule of thumb, algorithms with a quadratic time complexity or higher are often impractical for large data sets.

Table 2 further shows the growth of the input size solvable by polynomial and exponential time algorithms with improved computers. Even with a million-times faster computer, the 10n algorithm only adds 6 to the input size, which makes it hopeless for handling a moderate-size input.

These observations lead to the definition of the O-notation, which is very use- ful for the analysis of algorithms. We say f (n) = O(g(n)) if and only if there exist two positive constants c and n0such that 0 ≤ f (n) ≤ cg(n) for all n ≥ n0. In other words, for sufficiently large n, f (n) can be bounded by g(n) times a constant. In this kind of asymptotic analysis, the most crucial part is the order of the function, not the constant. For example, if f (n) = 3n2+ 5n, we can say f (n) = O(n2) by

(3)

Table 1: The time needed by the functions where we assume one million opera- tions per second.

f (n) n = 10 n = 100 n = 100000

30n 0.0003 sec-

ond

0.003 second 3 seconds 100n log10n 0.001 sec-

ond

0.02 second 50 seconds

3n2 0.0003 sec-

ond

0.03 second 8.33 hours

n3 0.001 sec-

ond

1 second 31.71 years

10n 2.78 hours 3.17 × 1084 cen- turies

3.17 × 1099984centuries

letting c = 4 and n0= 10. By definition, it is also correct to say n2= O(n3), but we always prefer to choose a tighter order if possible. On the other hand, 10n6= O(nx) for any integer x. That is, an exponential function cannot be bounded above by any polynomial function.

Table 2: The growth of the input size solvable in an hour as the computer runs faster.

f (n) Present speed

1000-times faster 106-times faster

n x1 1000x1 106x1

n2 x2 31.62x2 103x2

n3 x3 10x3 102x3

10n x4 x4+ 3 x4+ 6

2 Greedy Algorithms

A greedy method works in stages. It always makes a locally optimal (greedy) choice at each stage. Once a choice has been made, it cannot be withdrawn, even if later we realize that it is a poor decision. In other words, this greedy choice may

(4)

or may not lead to a globally optimal solution, depending on the characteristics of the problem.

It is a very straightforward algorithmic technique and has been used to solve a variety of problems. In some situations, it is used to solve the problem exactly. In others, it has been proved to be effective in approximation.

What kind of problems are suitable for a greedy solution? There are two ingredients for an optimization problem to be exactly solved by a greedy approach.

One is that it has the so-called greedy-choice property, meaning that a locally optimal choice can reach a globally optimal solution. The other is that it satisfies the principle of optimality, i.e., each solution substructure is optimal. We use Huffman coding, a frequency-dependent coding scheme, to illustrate the greedy approach.

2.1 Huffman Codes

Suppose we are given a very long DNA sequence where the occurrence proba- bilities of nucleotides A (adenine), C (cytosine), G (guanine), T (thymine) are 0.1, 0.1, 0.3, and 0.5, respectively. In order to store it in a computer, we need to transform it into a binary sequence, using only 0’s and 1’s. A trivial solution is to encode A, C, G, and T by “00,” “01,” “10,” and “11,” respectively. This rep- resentation requires two bits per nucleotide. The question is “Can we store the sequence in a more compressed way?” Fortunately, by assigning longer codes for frequent nucleotides G and T, and shorter codes for rare nucleotides A and C, it can be shown that it requires less than two bits per nucleotide on average.

In 1952, Huffman proposed a greedy algorithm for building up an optimal way of representing each letter as a binary string. It works in two phases. In phase one, we build a binary tree based on the occurrence probabilities of the letters. To do so, we first write down all the letters, together with their associated probabilities.

They are initially the unmarked terminal nodes of the binary tree that we will build up as the algorithm proceeds. As long as there is more than one unmarked node left, we repeatedly find the two unmarked nodes with the smallest probabilities, mark them, create a new unmarked internal node with an edge to each of the nodes just marked, and set its probability as the sum of the probabilities of the two nodes.

The tree building process is depicted in Figure 1. Initially, there are four un- marked nodes with probabilities 0.1, 0.1, 0.3, and 0.5. The two smallest ones are with probabilities 0.1 and 0.1. Thus we mark these two nodes and create a new node with probability 0.2 and connect it to the two nodes just marked. Now we have three unmarked nodes with probabilities 0.2, 0.3, and 0.5. The two smallest

(5)

ones are with probabilities 0.2 and 0.3. They are marked and a new node connect- ing them with probabilities 0.5 is created. The final iteration connects the only two unmarked nodes with probabilities 0.5 and 0.5. Since there is only one unmarked node left, i.e., the root of the tree, we are done with the binary tree construction.

A 0.1

0.2

C 0.1

G 0.3

T 0.5 0.5

1.0

Figure 1: Building a binary tree based on the occurrence probabilities of the let- ters.

After the binary tree is built in phase one, the second phase is to assign the binary strings to the letters. Starting from the root, we recursively assign the value

“zero” to the left edge and “one” to the right edge. Then for each leaf, i.e., the let- ter, we concatenate the 0’s and 1’s from the root to it to form its binary string repre- sentation. For example, in Figure 2 the resulting codewords for A, C, G, and T are

“000,” “000,” “01,” and “1,” respectively. By this coding scheme, a 20-nucleotide DNA sequence “GTTGTTATCGTTTATGTGGC” will be represented as a 34-bit binary sequence “0111011100010010111100010110101001.” In general, since 3 × 0.1 + 3 × 0.1 + 2 × 0.3 + 1 × 0.5 = 1.7, we conclude that, by Huffman coding techniques, each nucleotide requires 1.7 bits on average, which is superior to 2 bits by a trivial solution. Notice that in a Huffman code, no codeword is also a prefix of any other codeword. Therefore we can decode a binary sequence without any am- biguity. For example, if we are given “0111011100010010111100010110101001,”

we decode the binary sequence as “01” (G), “1” (T), “1” (T), “01” (G), and so forth.

The correctness of Huffman’s algorithm lies in two properties: (1) greedy- choice property and (2) optimal-substructure property. It can be shown that there exists an optimal binary code in which the codewords for the two smallest-probability

(6)

A 0.1

0.2

C 0.1

G 0.3

T 0.5 0.5

0 1.0

0

0 1

1

1

000 001 01 1

Figure 2: Huffman code assignment.

nodes have the same length and differ only in the last bit. That’s the reason why we can contract them greedily without missing the path to the optimal solution.

Besides, after contraction, the optimal-substructure property allows us to consider only those unmarked nodes.

Let n be the number of letters under consideration. For DNA, n is 4 and for English, n is 26. Since a heap can be used to maintain the minimum dynamically in O(log n) time for each insertion or deletion, the time complexity of Huffman’s algorithm is O(n log n).

3 Divide-and-Conquer Strategies

The divide-and-conquer strategy divides the problem into a number of smaller subproblems. If the subproblem is small enough, it conquers it directly. Other- wise, it conquers the subproblem recursively. Once the solution to each subprob- lem has been done, it combines them together to form a solution to the original problem.

One of the well-known applications of the divide-and-conquer strategy is the design of sorting algorithms. We use mergesort to illustrate the divide-and-conquer algorithm design paradigm.

(7)

3.1 Mergesort

Given a sequence of n numbers ha1, a2, . . . , ani, the sorting problem is to sort these numbers into a nondecreasing sequence. For example, if the given sequence is h65, 16, 25, 85, 12, 8, 36, 77i, then its sorted sequence is h8, 12, 16, 25, 36, 65, 77, 85i.

To sort a given sequence, mergesort splits the sequence into half, sorts each of them recursively, then combines the resulting two sorted sequences into one sorted sequence. Figure 3 illustrates the dividing process. The original input se- quence consists of eight numbers. We first divide it into two smaller sequences, each consisting of four numbers. Then we divide each four-number sequence into two smaller sequences, each consisting of two numbers. Here we can sort the two numbers by comparing them directly, or divide it further into two smaller se- quences, each consisting of only one number. Either way we’ll reach the boundary cases where sorting is trivial. Notice that a sequential recursive process won’t ex- pand the subproblems simultaneously, but instead it solves the subproblems at the same recursion depth one by one.

65 16 25 85 12 8 36 77

65 16 25 85 12 8 36 77

65 16 25 85 12 8 36 77

65 16 25 85 12 8 36 77

Figure 3: The top-down dividing process of mergesort.

How to combine the solutions to the two smaller subproblems to form a solu- tion to the original problem? Let us consider the process of merging two sorted sequences into a sorted output sequence. For each merging sequence, we maintain a cursor pointing to the smallest element not yet included in the output sequence.

(8)

At each iteration, the smaller of these two smallest elements is removed from the merging sequence and added to the end of the output sequence. Once one merg- ing sequence has been exhausted, the other sequence is appended to the end of the output sequence. Figure 4 depicts the merging process. The merging sequences are h16, 25, 65, 85i and h8, 12, 36, 77i. The smallest elements of the two merging sequences are 16 and 8. Since 8 is a smaller one, we remove it from the merging sequence and add it to the output sequence. Now the smallest elements of the two merging sequences are 16 and 12. We remove 12 from the merging sequence and append it to the output sequence. Then 16 and 36 are the smallest elements of the two merging sequences, thus 16 is appended to the output list. Finally, the result- ing output sequence is h8, 12, 16, 25, 36, 65, 77, 85i. Let N and M be the lengths of the two merging sequences. Since the merging process scans the two merging sequences linearly, its running time is therefore O(N + M) in total.

8 12 16 25 36 65 77 85 16 25

77 36 12 8

85 65

Figure 4: The merging process of mergesort.

After the top-down dividing process, mergesort accumulates the solutions in a bottom-up fashion by combining two smaller sorted sequences into a larger sorted sequence as illustrated in Figure 5. In this example, the recursion depth is dlog28e = 3. At recursion depth 3, every single element is itself a sorted se- quence. They are merged to form sorted sequences at recursion depth 2: h16, 65i, h25, 85i, h8, 12i, and h36, 77i. At recursion depth 1, they are further merged into two sorted sequences: h16, 25, 65, 85i and h8, 12, 36, 77i. Finally, we merge these two sequences into one sorted sequence: h8, 12, 16, 25, 36, 65, 77, 85i.

It can be easily shown that the recursion depth of mergesort is dlog2ne for sorting n numbers, and the total time spent for each recursion depth is O(n). Thus, we conclude that mergesort sorts n numbers in O(n log n) time.

(9)

8 12 16 25 36 65 77 85

16 25 65 85 8 12 36 77

16 65 25 85 8 12 36 77

65 16 25 85 12 8 36 77

Figure 5: Accumulating the solutions in a bottom-up manner.

4 Dynamic Programming

Dynamic programming is a class of solution methods for solving sequential deci- sion problems with a compositional cost structure. It is one of the major paradigms of algorithm design in computer science. Like the usage in linear programming, the word “programming” refers to finding an optimal plan of action, rather than writing programs. The word “dynamic” in this context conveys the idea that choices may depend on the current state, rather than being decided ahead of time.

Typically, dynamic programming is applied to optimization problems. In such problems, there exist many possible solutions. Each solution has a value, and we wish to find a solution with the optimum value. There are two ingredients for an optimization problem to be suitable for a dynamic-programming approach.

One is that it satisfies the principle of optimality, i.e., each solution substructure is optimal. Greedy algorithms require this very same ingredient, too. The other ingredient is that it has overlapping subproblems, which has the implication that it can be solved more efficiently if the solutions to the subproblems are recorded. If the subproblems are not overlapping, a divide-and-conquer approach is the choice.

The development of a dynamic-programming algorithm has three basic com- ponents: the recurrence relation for defining the value of an optimal solution, the tabular computation for computing the value of an optimal solution, and the back-

(10)

tracking procedure for delivering an optimal solution. Here we introduce these basic ideas by developing dynamic-programming solutions for problems from dif- ferent application areas.

First of all, the Fibonacci numbers are used to demonstrate how a tabular com- putation can avoid recomputation. Then we use three classic problems, namely, the maximum-sum segment problem, the longest increasing subsequence prob- lem, and the longest common subsequence problem, to explain how dynamic- programming approaches can be used to solve the sequence-related problems.

4.1 Fibonacci Numbers

The Fibonacci numbers were first created by Leonardo Fibonacci in 1202. It is a simple series, but its applications are nearly everywhere in nature. It has fascinated mathematicians for more than 800 years. The Fibonacci numbers are defined by the following recurrence:



F0= 0, F1= 1,

Fi= Fi−1+ Fi−2 for i ≥ 2.

By definition, the sequence goes like this: 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765, 10946, 17711, 28657, 46368, 75025, 121393, and so forth. Given a positive integer n, how would you compute Fn? You might say that it can be easily solved by a straightforward divide-and- conquer method based on the recurrence. That’s right. But is it efficient? Take the computation of F10 for example (see Figure 6). By definition, F10 is derived by adding up F9and F8. What about the values of F9and F8? Again, F9is derived by adding up F8 and F7; F8 is derived by adding up F7 and F6. Working toward this direction, we’ll finally reach the values of F1and F0, i.e., the end of the recursive calls. By adding them up backwards, we have the value of F10. It can be shown that the number of recursive calls we have to make for computing Fnis exponential in n.

Those who are ignorant of history are doomed to repeat it. A major draw- back of this divide-and-conquer approach is to solve many of the subproblems repeatedly. A tabular method solves every subproblem just once and then saves its answer in a table, thereby avoiding the work of recomputing the answer every time the subproblem is encountered. Figure 7 explains that Fncan be computed in O(n) steps by a tabular computation. It should be noted that Fn can be computed in just O(log n) steps by applying matrix computation.

(11)

Figure 6: Computing F10 by divide-and-conquer.

0 1 1 2 3 5 8 13 21 34 55

Figure 7: Computing F10 by a tabular computation.

4.2 The Maximum-Sum Segment Problem

Given a sequence of numbers A = ha1, a2, . . . , ani, the maximum-sum segment problem is to find, in A, a consecutive subsequence, i.e., a substring or segment, with the maximum sum. For each position i, we can compute the maximum-sum segment ending at that position in O(i) time. Therefore, a naive algorithm runs in

ni=1O(i) = O(n2) time.

Now let us describe a more efficient dynamic-programming algorithm for this problem. Define S[i] to be the maximum sum of segments ending at position i of A. The value S[i] can be computed by the following recurrence:

S[i] =

½ ai+ max{S[i − 1], 0} if i > 1,

a1 if i = 1.

If S[i − 1] < 0, concatenating aiwith its previous elements will give a smaller

(12)

sum than aiitself. In this case, the maximum-sum segment ending at position i is aiitself.

By a tabular computation, each S[i] can be computed in constant time for i from 1 to n, therefore all S values can be computed in O(n) time. During the com- putation, we record the largest S entry computed so far in order to report where the maximum-sum segment ends. We also record the traceback information for each position i so that we can trace back from the end position of the maximum-sum segment to its start position. If S[i − 1] > 0, we need to concatenate with previous elements for a larger sum, therefore the traceback symbol for position i is “←.”

Otherwise, “↑” is recorded. Once we have computed all S values, the traceback information is used to construct the maximum-sum segment by starting from the largest S entry and following the arrows until a “↑” is reached. For example, in Figure 8, A = h3, 2, -6, 5, 2, -3, 6,-4, 2i. By computing from i = 1 to i = n, we have S = h3, 5, -1, 5, 7, 4, 10,6, 8i. The maximum S entry is S[7] whose value is 10. By backtracking from S[7], we conclude that the maximum-sum segment of A is h5, 2, -3, 6i, whose sum is 10.

2

1 3 4 5 6 7 8 9

2

3 -6 5 2 -3 6 -4 2

5

3 -1 5 7 4 10 6 8

Figure 8: Finding a maximum-sum segment.

Let prefix sum P[i] =ij=1aj be the sum of the first i elements. It can be easily seen that ∑k=ij ak = P[ j] − P[i − 1]. Therefore, if we wish to compute for a given position the maximum-sum segment ending at it, we could just look for a minimum prefix sum ahead of this position. This yields another linear-time algorithm for the maximum-sum segment problem.

4.3 Longest Increasing Subsequences

Given a sequence of numbers A = ha1, a2, . . . , ani, the longest increasing subse- quence problem is to find an increasing subsequence in A whose length is maxi- mum. Without loss of generality, we assume that these numbers are distinct. For- mally speaking, given a sequence of distinct real numbers A = ha1, a2, . . . , ani, se-

(13)

quence B = hb1, b2, . . . , bki is said to be a subsequence of A if there exists a strictly increasing sequence hi1, i2, . . . , iki of indices of A such that for all j = 1, 2, . . . , k, we have aij = bj. In other words, B is obtained by deleting zero or more elements from A. We say that the subsequence B is increasing if b1< b2< . . . < bk. The longest increasing subsequence problem is to find a maximum-length increasing subsequence of A.

For example, suppose A = h4, 8, 2, 7, 3, 6, 9, 1, 10, 5i, both h2, 3, 6i and h2, 7, 9, 10i are increasing subsequences of A, whereas h8, 7, 9i (not increasing) and h2, 3, 5, 7i (not a subsequence) are not.

Note that we may have more than one longest increasing subsequence, so we use “a longest increasing subsequence” instead of “the longest increasing sub- sequence.” Let L[i] be the length of a longest increasing subsequence ending at position i. They can be computed by the following recurrence:

L[i] =

½ 1 + maxj=0,...,i−1{L[ j] | aj< ai} if i > 0,

0 if i = 0.

Here we assume that a0is a dummy element and smaller than any element in A, and L[0] is equal to 0. By tabular computation for every i from 1 to n, each L[i] can be computed in O(i) steps. Therefore, they require in totalni=1O(i) = O(n2) steps. For each position i, we use an array P to record the index of the best previous element for the current element to concatenate with. By tracing back from the element with the largest L value, we derive a longest increasing subsequence.

Figure 9 illustrates the process of finding a longest increasing subsequence of A = h4, 8, 2, 7, 3, 6, 9, 1, 10, 5i. Take i = 4 for instance, where a4= 7. Its previous smaller elements are a1 and a3, both with L value equaling 1. There- fore, we have L[4] = L[1] + 1 = 2, meaning that the length of a longest increasing subsequence ending at position 4 is of length 2. Indeed, both ha1, a4i and ha3, a4i are an increasing subsequence ending at position 4. In order to trace back the solution, we use array P to record which entry contributes the maximum to the current L value. Thus, P[4] can be 1 (standing for a1) or 3 (standing for a3).

Once we have computed all L and P values, the maximum L value is the length of a longest increasing subsequence of A. In this example, L[9] = 5 is the maxi- mum. Tracing back from P[9], we have found a longest increasing subsequence ha3, a5, a6, a7, a9i, i.e., h2, 3, 6, 9, 10i.

In the following, we briefly describe a more efficient dynamic-programming algorithm for delivering a longest increasing subsequence. A crucial observation is that it suffices to store only those smallest ending elements for all possible

(14)

10 5 8

1 4

7 2

1 3 5 6 7 9

8 4

2

1 1 2 2 3 4 1 5 3

1

0 0 1 3 5 6 0 7 5

9 10

6 3 2

Figure 9: An O(n2)-time algorithm for finding a longest increasing subsequence.

lengths of the increasing subsequences. For example, in Figure 9, there are three entries whose L value is 2, namely a2= 8, a4= 7, and a5= 3, where a5 is the smallest. Any element after position 5 that is larger than a2 or a4 is also larger than a5. Therefore, a5can replace the roles of a2and a4after position 5.

Let SmallestEnd[k] denote the smallest ending element of all possible increas- ing subsequences of length k ending before the current position i. The algorithm proceeds for i from 1 to n. How do we update SmallestEnd[k] when we consider ai? By definition, it is easy to see that the elements in SmallestEnd are in in- creasing order. In fact, aiwill affect only one entry in SmallestEnd. If aiis larger than all the elements in SmallestEnd, then we can concatenate ai to the longest increasing subsequence computed so far. That is, one more entry is added to the end of SmallestEnd. A backtracking pointer is recorded by pointing to the previ- ous last element of SmallestEnd. Otherwise, let SmallestEnd[k0] be the smallest element that is larger than ai. We replace SmallestEnd[k0] by aibecause now we have a smaller ending element of an increasing subsequence of length k0.

Since SmallestEnd is a sorted array, the above process can be done by a binary search. A binary search algorithm compares the query element with the middle element of the sorted array, if the query element is larger, then it searches the larger half recursively. Otherwise, it searches the smaller half recursively. Either way the size of the search space is shrunk by a factor of two. At position i, the size of SmallestEnd is at most i. Therefore, for each position i, it takes O(log i) time to determine the appropriate entry to be updated by ai. Therefore, in total we have an O(n log n)-time algorithm for the longest increasing subsequence problem.

Figure 10 illustrates the process of finding a longest increasing subsequence of A = h4, 8, 2, 7, 3, 6, 9, 1, 10, 5i. When i = 1, there is only one increas- ing subsequence, i.e., h4i. We have SmallestEnd[1] = 4. Since a2= 8 is larger than SmallestEnd[1], we create a new entry SmallestEnd[2] = 8 and set the back-

(15)

tracking pointer P[2] = 1, meaning that a2 can be concatenated with a1 to form an increasing subsequence h4, 8i. When a3= 2 is encountered, its nearest larger element in SmallestEnd is SmallestEnd[1] = 4. We know that we now have an increasing subsequence h2i of length 1. So SmallestEnd[1] is changed from 4 to a3= 2 and P[3] = 0. When i = 4, we have SmallestEnd[1] = 2 < a4 = 7 <

SmallestEnd[2] = 8. By concatenating a4 with Smallest[1], we have a new in- creasing subsequence h2, 7i of length 2 whose ending element is smaller than 8.

Thus, SmallestEnd[2] is changed from 8 to a4= 7 and P[4] = 3. Continue this way until we reach a10. When a10 is encountered, we have SmallestEnd[2] = 3 <

a10= 5 < SmallestEnd[3] = 6. We set SmallestEnd[3] = a10= 5 and P[10] = 5.

Now the largest element in SmallestEnd is SmallestEnd[5] = a9= 10. We can trace back from a9by the backtracking pointers P and deliver a longest increasing subsequence ha3, a5, a6, a7, a9i, i.e., h2, 3, 6, 9, 10i.

2

1 1 2 2 3 4 1 5 3

1

0 0 3 3 5 6 0 7 5

10 5 8

1 4

7 2

1 3 5 6 7 9

8

4 2 3 6 9 10

4 4

8 2 8

2 7

2 3

2 3 6

2 3 6 9

1 3 6 9

1 3 6 9 10

1 3 5 9 10

Figure 10: An O(n log n)-time algorithm for finding a longest increasing subse- quence.

(16)

4.4 Longest Common Subsequences

A subsequence of a sequence S is obtained by deleting zero or more elements from S. For example, hP, R, E, Di, hS, D, Ni, and hP, R, E, D, E, N, T i are all subse- quences of hP, R, E, S, I, D, E, N, T i, whereas hS, N, Di and hP, E, Fi are not.

Recall that, given two sequences, the longest common subsequence (LCS) problem is to find a subsequence that is common to both sequences and its length is maximized. For example, given two sequences

hP, R, E, S, I, D, E, N, T i

and

hP, R, O,V, I, D, E, N,C, Ei,

hP, R, D, Ni is a common subsequence of them, whereas hP, R,V i is not. Their LCS is hP, R, I, D, E, Ni.

Now let us formulate the recurrence for computing the length of an LCS of two sequences. We are given two sequences A = ha1, a2, . . . , ami, and B = hb1, b2, . . . , bni. Let len[i, j] denote the length of an LCS between ha1, a2, . . . , aii (a prefix of A) and hb1, b2, . . . , bji (a prefix of B). They can be computed by the following recurrence:

len[i, j] =



0 if i = 0 or j = 0,

len[i − 1, j − 1] + 1 if i, j > 0 and ai= bj, max{len[i, j − 1], len[i − 1, j]} otherwise.

In other words, if one of the sequences is empty, the length of their LCS is just zero. If aiand bjare the same, an LCS between ha1, a2, . . . , aii, and hb1, b2, . . . , bji is the concatenation of an LCS of ha1, a2, . . . , ai−1i and hb1, b2, . . . , bj−1i and ai. Therefore, len[i, j] = len[i − 1, j − 1] + 1 in this case. If ai and bj are different, their LCS is equal to either an LCS of ha1, a2, . . . , aii, and hb1, b2, . . . , bj−1i, or that of ha1, a2, . . . , ai−1i, and hb1, b2, . . . , bji. Its length is thus the maximum of len[i, j − 1] and len[i − 1, j].

Figure 11 gives the pseudo-code for computing len[i, j]. For each entry (i, j), we retain the backtracking information in prev[i, j]. If len[i − 1, j − 1] contributes the maximum value to len[i, j], then we set prev[i, j] =“-.” Otherwise prev[i, j]

is set to be “↑” or “←” depending on which one of len[i − 1, j] and len[i, j − 1]

contributes the maximum value to len[i, j]. Whenever there is a tie, any one of them will work. These arrows will guide the backtracking process upon reaching

(17)

Algorithm LCS LENGTH(A = ha1, a2, . . . , ami, B = hb1, b2, . . . , bni) begin

for i ← 0 to m do len[i, 0] ← 0 for j ← 1 to n do len[0, j] ← 0 for i ← 1 to m do

for j ← 1 to n do if ai= bj then

len[i, j] ← len[i − 1, j − 1] + 1 prev[i, j] ←“-”

else if len[i − 1, j] ≥ len[i, j − 1] then len[i, j] ← len[i − 1, j]

prev[i, j] ←“↑”

else

len[i, j] ← len[i, j − 1]

prev[i, j] ←“←”

return len and prev end

Figure 11: Computation of the length of an LCS of two sequences.

the terminal entry (m, n). Since the time spent for each entry is O(1), the total running time of algorithm LCS LENGTHis O(mn).

Figure 12 illustrates the tabular computation. The length of an LCS of hA, L, G, O, R, I, T, H, Mi

and

hA, L, I, G, N, M, E, N, T i is 4.

Besides computing the length of an LCS of the whole sequences, Figure 12 in fact computes the length of an LCS between each pair of prefixes of the two sequences. For example, by this table, we can also tell the length of an LCS between hA, L, G, O, Ri and hA, L, I, Gi is 3.

Once algorithm LCS LENGTH reaches (m, n), the backtracking information retained in array prev allows us to find out which common subsequence con- tributes len[m, n], the maximum length of an LCS of sequences A and B. Fig-

(18)

A L I G N M E N T

M H T I R O G L A

0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0

0 1

2

3

3

4 4

1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2

3

1 2 2 3 3 3 3

1 1 1 1 1 1

2 2 2 2 2 2

3 2 2

3 3 3

3 3

3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4

3 3 4 4 4

Figure 12: Tabular computation of the length of an LCS of hA, L, G, O, R, I, T, H, Mi and hA, L, I, G, N, M, E, N, T i.

ure 13 lists the pseudo-code for delivering an LCS. We trace back the dynamic- programming matrix from the entry (m, n) recursively following the direction of the arrow. Whenever a diagonal arrow “-” is encountered, we append the current matched letter to the end of the LCS under construction. Algorithm LCS OUTPUT

takes O(m + n) time in total since each recursive call reduces the indices i and/or j by one.

Figure 14 backtracks the dynamic-programming matrix computed in Figure 12.

It outputs hA, L, G, T i (the shaded entries) as an LCS of hA, L, G, O, R, I, T, H, Mi

and

hA, L, I, G, N, M, E, N, T i.

(19)

Algorithm LCS OUTPUT(A = ha1, a2, . . . , ami, prev, i, j) begin

if i = 0 or j = 0 then return if prev[i, j] =“-” then

LCS OUTPUT(A, prev, i − 1, j − 1) print ai

else if prev[i, j] =“↑” then LCS OUTPUT(A, prev, i − 1, j) else LCS OUTPUT(A, prev, i, j − 1)

end

Figure 13: Delivering an LCS.

A L I G N M E N T

M H T I R O G L A

0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0

0 1

2

3

3

4 4

1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2

3

1 2 2 3 3 3 3

1 1 1 1 1 1

2 2 2 2 2 2

3 2 2

3 3 3

3 3

3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

3 3 4 4 4

4

Figure 14: Backtracking process for finding an LCS of hA, L, G, O, R, I, T, H, Mi and hA, L, I, G, N, M, E, N, T i.

參考文獻

相關文件

Proof: For every positive integer n, there are finitely many neighbor- hood of radius 1/n whose union covers K (since K is compact). Collect all of them, say {V α }, and it forms

課程概述 The goal of this course is to introduce basic knowledge on graph theory and develop facility at combinatorial reasoning, which are the bases for analyzing a wide

Review a high-resolution wave propagation method for solving hyperbolic problems on mapped grids (which is basic integration scheme implemented in CLAWPACK) Describe

The areas of these three parts are represented by the following integrals:. (1pt for

The main pur- pose is to prove the following proposition we used in the previous sec- tion..

Other than exploring the feasibility of introducing a salary scale for KG teachers, we also reviewed the implementation of the Scheme in different areas including funding

Failure to bring your seat number may lead to delay in your admission to the centre with NO time / mark compensation for any loss of the test time.. Watches with functions

In section 4, based on the cases of circular cone eigenvalue optimiza- tion problems, we study the corresponding properties of the solutions for p-order cone eigenvalue