A Class Note on Basic Algorithmic Techniques

(1)

A Class Note on Basic Algorithmic Techniques

Kun-Mao Chao

^1,2,3

1

Graduate Institute of Biomedical Electronics and Bioinformatics

2

Department of Computer Science and Information Engineering

3

Graduate Institute of Networking and Multimedia National Taiwan University, Taipei, Taiwan 106

September 24, 2008

An algorithm is a step-by-step procedure for solving a problem by a computer.

Although the act of designing an algorithm is considered as an art and can never be automated, its general strategies are learnable. Here we introduce a few frame- works of computer algorithms including greedy algorithms, divide-and-conquer strategies, and dynamic programming.

This note starts with the definition of algorithms and their complexity in Sec- tion 1. We introduce the asymptotic O-notation used in the analysis of the run- ning time and space of an algorithm. Two tables are used to demonstrate that the asymptotic complexity of an algorithm will ultimately determine the size of problems that can be solved by the algorithm.

Then, we introduce greedy algorithms in Section 2. For some optimization problems, greedy algorithms are more efficient. A greedy algorithm pursues the best choice at the moment in the hope that it will lead to the best solution in the end. It works quite well for a wide range of problems. Huffman’s algorithm is used as an example of a greedy algorithm.

Section 3 describes another common algorithmic technique, called divide- and-conquer. This strategy divides the problem into smaller parts, conquers each part individually, and then combines them to form a solution for the whole. We use the mergesort algorithm to illustrate the divide-and-conquer algorithm design

(2)

paradigm.

Following its introduction by Needleman and Wunsch, dynamic programming has become a major algorithmic strategy for many optimization problems in sequence comparison. The development of a dynamic-programming algorithm has three basic components: the recurrence relation for defining the value of an optimal solution, the tabular computation for computing the value of an optimal solution, and the backtracking procedure for delivering an optimal solution. In Sec- tion 4, we introduce these basic ideas by developing dynamic-programming solutions for problems from different application areas, including the maximum-sum segment problem, the longest increasing subsequence problem, and the longest common subsequence problem.

1 Algorithms and Their Complexity

An algorithm is a step-by-step procedure for solving a problem by a computer.

When an algorithm is executed by a computer, the central processing unit (CPU) performs the operations and the memory stores the program and data.

Let n be the size of the input, the output, or their sum. The time or space com- plexity of an algorithm is usually denoted as a function f (n). Table 1 calculates the time needed if the function stands for the number of operations required by an algorithm, and we assume that the CPU performs one million operations per second.

Exponential algorithms grow pretty fast and become impractical even when n is small. For those quadratic and cubic functions, they grow faster than the linear functions. The constant and log factor matter, but are mostly acceptable in practice. As a rule of thumb, algorithms with a quadratic time complexity or higher are often impractical for large data sets.

Table 2 further shows the growth of the input size solvable by polynomial and exponential time algorithms with improved computers. Even with a million-times faster computer, the 10ⁿ algorithm only adds 6 to the input size, which makes it hopeless for handling a moderate-size input.

These observations lead to the definition of the O-notation, which is very use- ful for the analysis of algorithms. We say f (n) = O(g(n)) if and only if there exist two positive constants c and n₀such that 0 ≤ f (n) ≤ cg(n) for all n ≥ n₀. In other words, for sufficiently large n, f (n) can be bounded by g(n) times a constant. In this kind of asymptotic analysis, the most crucial part is the order of the function, not the constant. For example, if f (n) = 3n²+ 5n, we can say f (n) = O(n²) by

(3)

Table 1: The time needed by the functions where we assume one million operations per second.

f (n) n = 10 n = 100 n = 100000

30n 0.0003 sec-

ond

0.003 second 3 seconds 100n log₁₀n 0.001 sec-

ond

0.02 second 50 seconds

3n² 0.0003 sec-

ond

0.03 second 8.33 hours

n³ 0.001 sec-

ond

1 second 31.71 years

10ⁿ 2.78 hours 3.17 × 10⁸⁴ centuries

3.17 × 10⁹⁹⁹⁸⁴centuries

letting c = 4 and n₀= 10. By definition, it is also correct to say n²= O(n³), but we always prefer to choose a tighter order if possible. On the other hand, 10ⁿ6= O(n^x) for any integer x. That is, an exponential function cannot be bounded above by any polynomial function.

Table 2: The growth of the input size solvable in an hour as the computer runs faster.

f (n) Present speed

1000-times faster 10⁶-times faster

n x₁ 1000x₁ 10⁶x₁

n² x₂ 31.62x₂ 10³x₂

n³ x₃ 10x₃ 10²x₃

10ⁿ x₄ x₄+ 3 x₄+ 6

2 Greedy Algorithms

A greedy method works in stages. It always makes a locally optimal (greedy) choice at each stage. Once a choice has been made, it cannot be withdrawn, even if later we realize that it is a poor decision. In other words, this greedy choice may

(4)

or may not lead to a globally optimal solution, depending on the characteristics of the problem.

It is a very straightforward algorithmic technique and has been used to solve a variety of problems. In some situations, it is used to solve the problem exactly. In others, it has been proved to be effective in approximation.

What kind of problems are suitable for a greedy solution? There are two ingredients for an optimization problem to be exactly solved by a greedy approach.

One is that it has the so-called greedy-choice property, meaning that a locally optimal choice can reach a globally optimal solution. The other is that it satisfies the principle of optimality, i.e., each solution substructure is optimal. We use Huffman coding, a frequency-dependent coding scheme, to illustrate the greedy approach.

2.1 Huffman Codes

Suppose we are given a very long DNA sequence where the occurrence probabilities of nucleotides A (adenine), C (cytosine), G (guanine), T (thymine) are 0.1, 0.1, 0.3, and 0.5, respectively. In order to store it in a computer, we need to transform it into a binary sequence, using only 0’s and 1’s. A trivial solution is to encode A, C, G, and T by “00,” “01,” “10,” and “11,” respectively. This rep- resentation requires two bits per nucleotide. The question is “Can we store the sequence in a more compressed way?” Fortunately, by assigning longer codes for frequent nucleotides G and T, and shorter codes for rare nucleotides A and C, it can be shown that it requires less than two bits per nucleotide on average.

In 1952, Huffman proposed a greedy algorithm for building up an optimal way of representing each letter as a binary string. It works in two phases. In phase one, we build a binary tree based on the occurrence probabilities of the letters. To do so, we first write down all the letters, together with their associated probabilities.

They are initially the unmarked terminal nodes of the binary tree that we will build up as the algorithm proceeds. As long as there is more than one unmarked node left, we repeatedly find the two unmarked nodes with the smallest probabilities, mark them, create a new unmarked internal node with an edge to each of the nodes just marked, and set its probability as the sum of the probabilities of the two nodes.

The tree building process is depicted in Figure 1. Initially, there are four unmarked nodes with probabilities 0.1, 0.1, 0.3, and 0.5. The two smallest ones are with probabilities 0.1 and 0.1. Thus we mark these two nodes and create a new node with probability 0.2 and connect it to the two nodes just marked. Now we have three unmarked nodes with probabilities 0.2, 0.3, and 0.5. The two smallest

(5)

ones are with probabilities 0.2 and 0.3. They are marked and a new node connect- ing them with probabilities 0.5 is created. The final iteration connects the only two unmarked nodes with probabilities 0.5 and 0.5. Since there is only one unmarked node left, i.e., the root of the tree, we are done with the binary tree construction.

A 0.1

0.2

C 0.1

G 0.3

T 0.5 0.5

1.0

Figure 1: Building a binary tree based on the occurrence probabilities of the letters.

After the binary tree is built in phase one, the second phase is to assign the binary strings to the letters. Starting from the root, we recursively assign the value

“zero” to the left edge and “one” to the right edge. Then for each leaf, i.e., the let- ter, we concatenate the 0’s and 1’s from the root to it to form its binary string repre- sentation. For example, in Figure 2 the resulting codewords for A, C, G, and T are

“000,” “000,” “01,” and “1,” respectively. By this coding scheme, a 20-nucleotide DNA sequence “GTTGTTATCGTTTATGTGGC” will be represented as a 34-bit binary sequence “0111011100010010111100010110101001.” In general, since 3 × 0.1 + 3 × 0.1 + 2 × 0.3 + 1 × 0.5 = 1.7, we conclude that, by Huffman coding techniques, each nucleotide requires 1.7 bits on average, which is superior to 2 bits by a trivial solution. Notice that in a Huffman code, no codeword is also a prefix of any other codeword. Therefore we can decode a binary sequence without any am- biguity. For example, if we are given “0111011100010010111100010110101001,”

we decode the binary sequence as “01” (G), “1” (T), “1” (T), “01” (G), and so forth.

The correctness of Huffman’s algorithm lies in two properties: (1) greedy- choice property and (2) optimal-substructure property. It can be shown that there exists an optimal binary code in which the codewords for the two smallest-probability

(6)

A 0.1

0.2

C 0.1

G 0.3

T 0.5 0.5

0 1.0

0

0 1

1

000 001 01 1

Figure 2: Huffman code assignment.

nodes have the same length and differ only in the last bit. That’s the reason why we can contract them greedily without missing the path to the optimal solution.

Besides, after contraction, the optimal-substructure property allows us to consider only those unmarked nodes.

Let n be the number of letters under consideration. For DNA, n is 4 and for English, n is 26. Since a heap can be used to maintain the minimum dynamically in O(log n) time for each insertion or deletion, the time complexity of Huffman’s algorithm is O(n log n).

3 Divide-and-Conquer Strategies

The divide-and-conquer strategy divides the problem into a number of smaller subproblems. If the subproblem is small enough, it conquers it directly. Other- wise, it conquers the subproblem recursively. Once the solution to each subprob- lem has been done, it combines them together to form a solution to the original problem.

One of the well-known applications of the divide-and-conquer strategy is the design of sorting algorithms. We use mergesort to illustrate the divide-and-conquer algorithm design paradigm.

(7)

3.1 Mergesort

Given a sequence of n numbers ha₁, a₂, . . . , a_ni, the sorting problem is to sort these numbers into a nondecreasing sequence. For example, if the given sequence is h65, 16, 25, 85, 12, 8, 36, 77i, then its sorted sequence is h8, 12, 16, 25, 36, 65, 77, 85i.

To sort a given sequence, mergesort splits the sequence into half, sorts each of them recursively, then combines the resulting two sorted sequences into one sorted sequence. Figure 3 illustrates the dividing process. The original input sequence consists of eight numbers. We first divide it into two smaller sequences, each consisting of four numbers. Then we divide each four-number sequence into two smaller sequences, each consisting of two numbers. Here we can sort the two numbers by comparing them directly, or divide it further into two smaller sequences, each consisting of only one number. Either way we’ll reach the boundary cases where sorting is trivial. Notice that a sequential recursive process won’t ex- pand the subproblems simultaneously, but instead it solves the subproblems at the same recursion depth one by one.

65 16 25 85 12 8 36 77

Figure 3: The top-down dividing process of mergesort.

How to combine the solutions to the two smaller subproblems to form a solution to the original problem? Let us consider the process of merging two sorted sequences into a sorted output sequence. For each merging sequence, we maintain a cursor pointing to the smallest element not yet included in the output sequence.

(8)

At each iteration, the smaller of these two smallest elements is removed from the merging sequence and added to the end of the output sequence. Once one merging sequence has been exhausted, the other sequence is appended to the end of the output sequence. Figure 4 depicts the merging process. The merging sequences are h16, 25, 65, 85i and h8, 12, 36, 77i. The smallest elements of the two merging sequences are 16 and 8. Since 8 is a smaller one, we remove it from the merging sequence and add it to the output sequence. Now the smallest elements of the two merging sequences are 16 and 12. We remove 12 from the merging sequence and append it to the output sequence. Then 16 and 36 are the smallest elements of the two merging sequences, thus 16 is appended to the output list. Finally, the result- ing output sequence is h8, 12, 16, 25, 36, 65, 77, 85i. Let N and M be the lengths of the two merging sequences. Since the merging process scans the two merging sequences linearly, its running time is therefore O(N + M) in total.

8 12 16 25 36 65 77 85 16 25

77 36 12 8

85 65

Figure 4: The merging process of mergesort.

After the top-down dividing process, mergesort accumulates the solutions in a bottom-up fashion by combining two smaller sorted sequences into a larger sorted sequence as illustrated in Figure 5. In this example, the recursion depth is dlog₂8e = 3. At recursion depth 3, every single element is itself a sorted se- quence. They are merged to form sorted sequences at recursion depth 2: h16, 65i, h25, 85i, h8, 12i, and h36, 77i. At recursion depth 1, they are further merged into two sorted sequences: h16, 25, 65, 85i and h8, 12, 36, 77i. Finally, we merge these two sequences into one sorted sequence: h8, 12, 16, 25, 36, 65, 77, 85i.

It can be easily shown that the recursion depth of mergesort is dlog₂ne for sorting n numbers, and the total time spent for each recursion depth is O(n). Thus, we conclude that mergesort sorts n numbers in O(n log n) time.

(9)

8 12 16 25 36 65 77 85

16 25 65 85 8 12 36 77

16 65 25 85 8 12 36 77

65 16 25 85 12 8 36 77

Figure 5: Accumulating the solutions in a bottom-up manner.

4 Dynamic Programming

Dynamic programming is a class of solution methods for solving sequential decision problems with a compositional cost structure. It is one of the major paradigms of algorithm design in computer science. Like the usage in linear programming, the word “programming” refers to finding an optimal plan of action, rather than writing programs. The word “dynamic” in this context conveys the idea that choices may depend on the current state, rather than being decided ahead of time.

Typically, dynamic programming is applied to optimization problems. In such problems, there exist many possible solutions. Each solution has a value, and we wish to find a solution with the optimum value. There are two ingredients for an optimization problem to be suitable for a dynamic-programming approach.

One is that it satisfies the principle of optimality, i.e., each solution substructure is optimal. Greedy algorithms require this very same ingredient, too. The other ingredient is that it has overlapping subproblems, which has the implication that it can be solved more efficiently if the solutions to the subproblems are recorded. If the subproblems are not overlapping, a divide-and-conquer approach is the choice.

The development of a dynamic-programming algorithm has three basic components: the recurrence relation for defining the value of an optimal solution, the tabular computation for computing the value of an optimal solution, and the back-

(10)

tracking procedure for delivering an optimal solution. Here we introduce these basic ideas by developing dynamic-programming solutions for problems from different application areas.

First of all, the Fibonacci numbers are used to demonstrate how a tabular computation can avoid recomputation. Then we use three classic problems, namely, the maximum-sum segment problem, the longest increasing subsequence problem, and the longest common subsequence problem, to explain how dynamic- programming approaches can be used to solve the sequence-related problems.

4.1 Fibonacci Numbers

The Fibonacci numbers were first created by Leonardo Fibonacci in 1202. It is a simple series, but its applications are nearly everywhere in nature. It has fascinated mathematicians for more than 800 years. The Fibonacci numbers are defined by the following recurrence:





F₀= 0, F₁= 1,

F_i= F_i−1+ F_i−2 for i ≥ 2.

By definition, the sequence goes like this: 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765, 10946, 17711, 28657, 46368, 75025, 121393, and so forth. Given a positive integer n, how would you compute F_n? You might say that it can be easily solved by a straightforward divide-and- conquer method based on the recurrence. That’s right. But is it efficient? Take the computation of F₁₀ for example (see Figure 6). By definition, F₁₀ is derived by adding up F₉and F₈. What about the values of F₉and F₈? Again, F₉is derived by adding up F₈ and F₇; F₈ is derived by adding up F₇ and F₆. Working toward this direction, we’ll finally reach the values of F₁and F₀, i.e., the end of the recursive calls. By adding them up backwards, we have the value of F₁₀. It can be shown that the number of recursive calls we have to make for computing F_nis exponential in n.

Those who are ignorant of history are doomed to repeat it. A major draw- back of this divide-and-conquer approach is to solve many of the subproblems repeatedly. A tabular method solves every subproblem just once and then saves its answer in a table, thereby avoiding the work of recomputing the answer every time the subproblem is encountered. Figure 7 explains that F_ncan be computed in O(n) steps by a tabular computation. It should be noted that F_n can be computed in just O(log n) steps by applying matrix computation.

(11)

Figure 6: Computing F₁₀ by divide-and-conquer.

0 1 1 2 3 5 8 13 21 34 55

Figure 7: Computing F₁₀ by a tabular computation.

4.2 The Maximum-Sum Segment Problem

Given a sequence of numbers A = ha₁, a₂, . . . , a_ni, the maximum-sum segment problem is to find, in A, a consecutive subsequence, i.e., a substring or segment, with the maximum sum. For each position i, we can compute the maximum-sum segment ending at that position in O(i) time. Therefore, a naive algorithm runs in

∑ⁿ_i=1O(i) = O(n²) time.

Now let us describe a more efficient dynamic-programming algorithm for this problem. Define S[i] to be the maximum sum of segments ending at position i of A. The value S[i] can be computed by the following recurrence:

S[i] =

½ ai+ max{S[i − 1], 0} if i > 1,

a₁ if i = 1.

If S[i − 1] < 0, concatenating aiwith its previous elements will give a smaller

(12)

sum than a_iitself. In this case, the maximum-sum segment ending at position i is a_iitself.

By a tabular computation, each S[i] can be computed in constant time for i from 1 to n, therefore all S values can be computed in O(n) time. During the com- putation, we record the largest S entry computed so far in order to report where the maximum-sum segment ends. We also record the traceback information for each position i so that we can trace back from the end position of the maximum-sum segment to its start position. If S[i − 1] > 0, we need to concatenate with previous elements for a larger sum, therefore the traceback symbol for position i is “←.”

Otherwise, “↑” is recorded. Once we have computed all S values, the traceback information is used to construct the maximum-sum segment by starting from the largest S entry and following the arrows until a “↑” is reached. For example, in Figure 8, A = h3, 2, -6, 5, 2, -3, 6,-4, 2i. By computing from i = 1 to i = n, we have S = h3, 5, -1, 5, 7, 4, 10,6, 8i. The maximum S entry is S[7] whose value is 10. By backtracking from S[7], we conclude that the maximum-sum segment of A is h5, 2, -3, 6i, whose sum is 10.

2

1 3 4 5 6 7 8 9

2

3 -6 5 2 -3 6 -4 2

5

3 -1 5 7 4 10 6 8

Figure 8: Finding a maximum-sum segment.

Let prefix sum P[i] =∑ⁱ_j=1aj be the sum of the first i elements. It can be easily seen that ∑_k=i^j a_k = P[ j] − P[i − 1]. Therefore, if we wish to compute for a given position the maximum-sum segment ending at it, we could just look for a minimum prefix sum ahead of this position. This yields another linear-time algorithm for the maximum-sum segment problem.

4.3 Longest Increasing Subsequences

Given a sequence of numbers A = ha₁, a₂, . . . , a_ni, the longest increasing subse- quence problem is to find an increasing subsequence in A whose length is maxi- mum. Without loss of generality, we assume that these numbers are distinct. For- mally speaking, given a sequence of distinct real numbers A = ha₁, a₂, . . . , a_ni, se-

(13)

quence B = hb₁, b₂, . . . , b_ki is said to be a subsequence of A if there exists a strictly increasing sequence hi₁, i₂, . . . , i_ki of indices of A such that for all j = 1, 2, . . . , k, we have ai_j = bj. In other words, B is obtained by deleting zero or more elements from A. We say that the subsequence B is increasing if b₁< b₂< . . . < b_k. The longest increasing subsequence problem is to find a maximum-length increasing subsequence of A.

For example, suppose A = h4, 8, 2, 7, 3, 6, 9, 1, 10, 5i, both h2, 3, 6i and h2, 7, 9, 10i are increasing subsequences of A, whereas h8, 7, 9i (not increasing) and h2, 3, 5, 7i (not a subsequence) are not.

Note that we may have more than one longest increasing subsequence, so we use “a longest increasing subsequence” instead of “the longest increasing sub- sequence.” Let L[i] be the length of a longest increasing subsequence ending at position i. They can be computed by the following recurrence:

L[i] =

½ 1 + maxj=0,...,i−1{L[ j] | a_j< a_i} if i > 0,

0 if i = 0.

Here we assume that a₀is a dummy element and smaller than any element in A, and L[0] is equal to 0. By tabular computation for every i from 1 to n, each L[i] can be computed in O(i) steps. Therefore, they require in total ∑ⁿ_i=1O(i) = O(n²) steps. For each position i, we use an array P to record the index of the best previous element for the current element to concatenate with. By tracing back from the element with the largest L value, we derive a longest increasing subsequence.

Figure 9 illustrates the process of finding a longest increasing subsequence of A = h4, 8, 2, 7, 3, 6, 9, 1, 10, 5i. Take i = 4 for instance, where a₄= 7. Its previous smaller elements are a₁ and a₃, both with L value equaling 1. There- fore, we have L[4] = L[1] + 1 = 2, meaning that the length of a longest increasing subsequence ending at position 4 is of length 2. Indeed, both ha₁, a₄i and ha₃, a₄i are an increasing subsequence ending at position 4. In order to trace back the solution, we use array P to record which entry contributes the maximum to the current L value. Thus, P[4] can be 1 (standing for a₁) or 3 (standing for a₃).

Once we have computed all L and P values, the maximum L value is the length of a longest increasing subsequence of A. In this example, L[9] = 5 is the maxi- mum. Tracing back from P[9], we have found a longest increasing subsequence ha₃, a₅, a₆, a₇, a₉i, i.e., h2, 3, 6, 9, 10i.

In the following, we briefly describe a more efficient dynamic-programming algorithm for delivering a longest increasing subsequence. A crucial observation is that it suffices to store only those smallest ending elements for all possible

(14)

10 5 8

1 4

7 2

1 3 5 6 7 9

8 4

2

1 1 2 2 3 4 1 5 3

1

0 0 1 3 5 6 0 7 5

9 10

6 3 2

Figure 9: An O(n²)-time algorithm for finding a longest increasing subsequence.

lengths of the increasing subsequences. For example, in Figure 9, there are three entries whose L value is 2, namely a₂= 8, a₄= 7, and a₅= 3, where a₅ is the smallest. Any element after position 5 that is larger than a₂ or a₄ is also larger than a₅. Therefore, a₅can replace the roles of a₂and a₄after position 5.

Let SmallestEnd[k] denote the smallest ending element of all possible increas- ing subsequences of length k ending before the current position i. The algorithm proceeds for i from 1 to n. How do we update SmallestEnd[k] when we consider a_i? By definition, it is easy to see that the elements in SmallestEnd are in in- creasing order. In fact, aiwill affect only one entry in SmallestEnd. If aiis larger than all the elements in SmallestEnd, then we can concatenate a_i to the longest increasing subsequence computed so far. That is, one more entry is added to the end of SmallestEnd. A backtracking pointer is recorded by pointing to the previ- ous last element of SmallestEnd. Otherwise, let SmallestEnd[k⁰] be the smallest element that is larger than a_i. We replace SmallestEnd[k⁰] by a_ibecause now we have a smaller ending element of an increasing subsequence of length k⁰.

Since SmallestEnd is a sorted array, the above process can be done by a binary search. A binary search algorithm compares the query element with the middle element of the sorted array, if the query element is larger, then it searches the larger half recursively. Otherwise, it searches the smaller half recursively. Either way the size of the search space is shrunk by a factor of two. At position i, the size of SmallestEnd is at most i. Therefore, for each position i, it takes O(log i) time to determine the appropriate entry to be updated by a_i. Therefore, in total we have an O(n log n)-time algorithm for the longest increasing subsequence problem.

Figure 10 illustrates the process of finding a longest increasing subsequence of A = h4, 8, 2, 7, 3, 6, 9, 1, 10, 5i. When i = 1, there is only one increas- ing subsequence, i.e., h4i. We have SmallestEnd[1] = 4. Since a₂= 8 is larger than SmallestEnd[1], we create a new entry SmallestEnd[2] = 8 and set the back-

(15)

tracking pointer P[2] = 1, meaning that a₂ can be concatenated with a₁ to form an increasing subsequence h4, 8i. When a₃= 2 is encountered, its nearest larger element in SmallestEnd is SmallestEnd[1] = 4. We know that we now have an increasing subsequence h2i of length 1. So SmallestEnd[1] is changed from 4 to a₃= 2 and P[3] = 0. When i = 4, we have SmallestEnd[1] = 2 < a₄ = 7 <

SmallestEnd[2] = 8. By concatenating a₄ with Smallest[1], we have a new in- creasing subsequence h2, 7i of length 2 whose ending element is smaller than 8.

Thus, SmallestEnd[2] is changed from 8 to a₄= 7 and P[4] = 3. Continue this way until we reach a₁₀. When a₁₀ is encountered, we have SmallestEnd[2] = 3 <

a₁₀= 5 < SmallestEnd[3] = 6. We set SmallestEnd[3] = a₁₀= 5 and P[10] = 5.

Now the largest element in SmallestEnd is SmallestEnd[5] = a₉= 10. We can trace back from a₉by the backtracking pointers P and deliver a longest increasing subsequence ha₃, a₅, a₆, a₇, a₉i, i.e., h2, 3, 6, 9, 10i.

2

1 1 2 2 3 4 1 5 3

1

0 0 3 3 5 6 0 7 5

10 5 8

1 4

7 2

1 3 5 6 7 9

8

4 2 3 6 9 10

4 4

8 2 8

2 7

2 3

2 3 6

2 3 6 9

1 3 6 9

1 3 6 9 10

1 3 5 9 10

Figure 10: An O(n log n)-time algorithm for finding a longest increasing subse- quence.

(16)

4.4 Longest Common Subsequences

A subsequence of a sequence S is obtained by deleting zero or more elements from S. For example, hP, R, E, Di, hS, D, Ni, and hP, R, E, D, E, N, T i are all subse- quences of hP, R, E, S, I, D, E, N, T i, whereas hS, N, Di and hP, E, Fi are not.

Recall that, given two sequences, the longest common subsequence (LCS) problem is to find a subsequence that is common to both sequences and its length is maximized. For example, given two sequences

hP, R, E, S, I, D, E, N, T i

and

hP, R, O,V, I, D, E, N,C, Ei,

hP, R, D, Ni is a common subsequence of them, whereas hP, R,V i is not. Their LCS is hP, R, I, D, E, Ni.

Now let us formulate the recurrence for computing the length of an LCS of two sequences. We are given two sequences A = ha₁, a₂, . . . , a_mi, and B = hb₁, b₂, . . . , b_ni. Let len[i, j] denote the length of an LCS between ha₁, a₂, . . . , a_ii (a prefix of A) and hb₁, b₂, . . . , b_ji (a prefix of B). They can be computed by the following recurrence:

len[i, j] =





0 if i = 0 or j = 0,

len[i − 1, j − 1] + 1 if i, j > 0 and a_i= b_j, max{len[i, j − 1], len[i − 1, j]} otherwise.

In other words, if one of the sequences is empty, the length of their LCS is just zero. If a_iand b_jare the same, an LCS between ha₁, a₂, . . . , a_ii, and hb₁, b₂, . . . , b_ji is the concatenation of an LCS of ha₁, a₂, . . . , a_i−1i and hb₁, b₂, . . . , b_j−1i and a_i. Therefore, len[i, j] = len[i − 1, j − 1] + 1 in this case. If a_i and b_j are different, their LCS is equal to either an LCS of ha₁, a₂, . . . , a_ii, and hb₁, b₂, . . . , b_j−1i, or that of ha₁, a₂, . . . , a_i−1i, and hb₁, b₂, . . . , b_ji. Its length is thus the maximum of len[i, j − 1] and len[i − 1, j].

Figure 11 gives the pseudo-code for computing len[i, j]. For each entry (i, j), we retain the backtracking information in prev[i, j]. If len[i − 1, j − 1] contributes the maximum value to len[i, j], then we set prev[i, j] =“-.” Otherwise prev[i, j]

is set to be “↑” or “←” depending on which one of len[i − 1, j] and len[i, j − 1]

contributes the maximum value to len[i, j]. Whenever there is a tie, any one of them will work. These arrows will guide the backtracking process upon reaching

(17)

Algorithm LCS LENGTH(A = ha₁, a₂, . . . , a_mi, B = hb₁, b₂, . . . , b_ni) begin

for i ← 0 to m do len[i, 0] ← 0 for j ← 1 to n do len[0, j] ← 0 for i ← 1 to m do

for j ← 1 to n do if a_i= b_j then

len[i, j] ← len[i − 1, j − 1] + 1 prev[i, j] ←“-”

else if len[i − 1, j] ≥ len[i, j − 1] then len[i, j] ← len[i − 1, j]

prev[i, j] ←“↑”

else

len[i, j] ← len[i, j − 1]

prev[i, j] ←“←”

return len and prev end

Figure 11: Computation of the length of an LCS of two sequences.

the terminal entry (m, n). Since the time spent for each entry is O(1), the total running time of algorithm LCS LENGTHis O(mn).

Figure 12 illustrates the tabular computation. The length of an LCS of hA, L, G, O, R, I, T, H, Mi

and

hA, L, I, G, N, M, E, N, T i is 4.

Besides computing the length of an LCS of the whole sequences, Figure 12 in fact computes the length of an LCS between each pair of prefixes of the two sequences. For example, by this table, we can also tell the length of an LCS between hA, L, G, O, Ri and hA, L, I, Gi is 3.

Once algorithm LCS LENGTH reaches (m, n), the backtracking information retained in array prev allows us to find out which common subsequence con- tributes len[m, n], the maximum length of an LCS of sequences A and B. Fig-

(18)

A L I G N M E N T

M H T I R O G L A

0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0

0 1

2

3

4 4

1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2

3

1 2 2 3 3 3 3

1 1 1 1 1 1

2 2 2 2 2 2

3 2 2

3 3 3

3 3

3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4

3 3 4 4 4

Figure 12: Tabular computation of the length of an LCS of hA, L, G, O, R, I, T, H, Mi and hA, L, I, G, N, M, E, N, T i.

ure 13 lists the pseudo-code for delivering an LCS. We trace back the dynamic- programming matrix from the entry (m, n) recursively following the direction of the arrow. Whenever a diagonal arrow “-” is encountered, we append the current matched letter to the end of the LCS under construction. Algorithm LCS OUTPUT

takes O(m + n) time in total since each recursive call reduces the indices i and/or j by one.

Figure 14 backtracks the dynamic-programming matrix computed in Figure 12.

It outputs hA, L, G, T i (the shaded entries) as an LCS of hA, L, G, O, R, I, T, H, Mi

and

hA, L, I, G, N, M, E, N, T i.

(19)

Algorithm LCS OUTPUT(A = ha₁, a₂, . . . , a_mi, prev, i, j) begin

if i = 0 or j = 0 then return if prev[i, j] =“-” then

LCS OUTPUT(A, prev, i − 1, j − 1) print a_i

else if prev[i, j] =“↑” then LCS OUTPUT(A, prev, i − 1, j) else LCS OUTPUT(A, prev, i, j − 1)

end

Figure 13: Delivering an LCS.