植基於QS方法之改良式Shift-And演算法

全文

(1)國立高雄大學資訊工程學系(研究所) 碩(博)士論文. 植基於 QS 方法之改良式 Shift-And 演算法 An Improved Shift-And Algorithm using Quick Search Method. 研究生：林鈺峰撰指導教授：陳建源. 中華民國 99 年 06 月.

(2)

(3) An Improved Shift-And Algorithm using Quick Search Method Advisor(s): Dr.(Professor) Chien-Yuan Chen Institute of Computer Science and Information Engineering National University of Kaohsiung. Student: Yu-Feng Lin Institute of Computer Science and Information Engineering National University of Kaohsiung. Abstract. This thesis presents a new algorithm for exact single pattern matching, called SAQS, that possesses the advantages of both the Shift-And and the Quick-Search algorithms. Comparing with the Shift-And algorithm and the Quick-Search algorithm, SAQS demands n only m+1 and n matching steps in the best case and the worst case, respectively. After analyzing, SAQS shall yield an efficient and effective pattern matching algorithm through the combination of the Shift-And and the Quick-Search algorithms when the pattern length is short. The experimental results also support our analysis.. Keywords: Exact Pattern Matching, Algorithms, Bioinformatics. I.

(4) 植基於 QS 方法之改良式 Shift-And 演算法指導教授：陳建源博士（教授）國立高雄大學資訊工程研究所. 學生：林鈺峰國立高雄大學資訊工程研究所. 摘要. 本論文提出一個精確樣式比對演算法 SAQS。此演算法同時擁有 Shift-And 演算法 n 與 Quick Search 演算法的優點，使得此演算法在最佳情況只需要m+1次的比對步驟，並且在最快情況下只需要 n 次的比對步驟。根據我們的分析，Shift-And 演算法在樣式比較短且字母系統較小的情況下，會是較有效的演算法；而在樣式較短且字母系統較大的條件下，Quick Search 演算法會較優於其他樣式比對演算法，所以 SAQS 在樣式長度較短的情況下會是效率較高且有效的樣式比對演算法。而實驗結果亦支持我們的分析內容。. 關鍵字：精確樣式比對、演算法、生物資訊學. II.

(5) Acknowledgement 首先，我誠摯的感謝我的恩師陳建源教授，老師的悉心指導，並不時的討論、指點我正確的方向，使我在這兩年中獲益匪淺，也常常為了配合我的工作時間，特別犧牲晚上或假日時段與我討論，並提供我許多研究與學習的方向，我對於老師的感恩之心無以言表。再來，特別感謝撥冗前來的口試委員，黃健峯教授與曾新穆教授，老師們在口試時對於我的指導與建議，使得本論文內容更佳的完善，也讓我瞭解本論文主題可以再深入探討的地方。尤其是黃健峯教授，在論文寫作期間，持續的給我建議與方向，讓我的論文得以順利的完成。兩年的碩士生活，研究與課業的討論、空閒時的運動、言不及義的閒扯、三不五時的出團吃飯等，都是實驗室裡共同的生活點滴與回憶，感謝實驗室的各位，筱凌、坪亨、蕃薯、小猜、阿水與能寅，你們的陪伴讓兩年的研究生活變得絢麗多彩。最後，當然是要感謝我親愛的家人，因為您們默默的支持與無怨無悔的付出，使得我能順利的完成碩士學業。另外還有很多曾經幫助過我的朋友，因為有大家的幫助，我才能有今天的成果。林鈺峰 2010.06.28. III.

(6) Content Abstract................................................................................................................ I 摘要......................................................................................................................II Acknowledgement ............................................................................................ III Content .............................................................................................................. IV Chapter 1 Introduction.......................................................................................5 Chapter 2 Preliminaries .....................................................................................8 2.1 Boyer-Moore Algorithm...........................................................................8 2.2 Quick-Search Algorithm ........................................................................10 2.3 Shift-And Algorithm ..............................................................................10 Chapter 3 The Proposed Algorithm ................................................................12 3.1 The Pre-Processing Phase ......................................................................13 3.2 Searching Phase .....................................................................................13 3.3 Example..................................................................................................18 Chapter 4 Analysis of our proposed algorithm ..............................................24 Chapter 5 Experimental evaluation ................................................................27 5.1 The Nucleotide Sequences .....................................................................27 5.2 The Amino Acid Sequences ...................................................................31 Chapter 6 Conclusion .......................................................................................35 References ..........................................................................................................36 IV.

(7) Chapter 1 Introduction Pattern matching has been an important research subject with many practical applications in Computer Science. It has been applied to many different areas, including cryptosystem [8], data compression [7], image processing [1], and the mining of meaningful patterns of nucleotides or amino acid sequences in biology [9,10]. According to the functionalities, pattern matching algorithms were mainly categorized as single pattern and multiple patterns, or exact matching as well as approximate matching. In this thesis we report our research results in the problem of exact single pattern matching. The goal of an exact single pattern matching problem is to find all occurrences of the pattern of m symbols from a given sequence of n symbols -n is usually larger than m. The easiest yet least efficient solution to this problem is to serially scan each position of the sequence in order to determine if all the symbols of the pattern match with those of a segment in the sequence - this approach is thus referred to as the brute-force matching. In the worst case this method requires n*m symbol comparisons. To tackle this problem, several other algorithms have been proposed, such as the Boyer-Moore algorithm (BM) [2], 5.

(8) the Quick Search algorithm (QS) [3] and the Shift-And algorithm (SA) [4]. Most of these algorithms, in general, contain a pre-processing phase and a searching phase. These pattern matching algorithms usually aim at the minimization of the computational cost in the searching phase. In this thesis, we present a new methodology, called SAQS, that integrates the advantages of both the SA and the QS algorithm for exact single pattern matching problems. The SA algorithm employs the idea of bit matching in bitwise operations to replace symbol matching, and it requires n comparisons upon matching in any case, including the worst case, the average case and the best case. On the other hand, the QS algorithm employs the shift-value idea, and n it requires only m+1 matching steps in the best case, but in the worst case QS requires the same computational overhead as the brute-force algorithm does. As we will show later, our proposed SAQS algorithm possesses the advantages of n both algorithms and demands only m+1 and n steps of matching in the best case and the worst case, respectively. According to our analysis, the SAQS algorithm possesses the advantages of both the SA and QS algorithms, and it shall work when the pattern length is short. The organization of this thesis is as follows. In Section 2, we present a literature review of three efficient algorithms. Section 3 describes the SAQS. 6.

(9) algorithm in detail and offers an example. Section 4 gives the analysis of our proposed algorithm, and Section 5 shows the experiment results of comparisons between the proposed algorithm and the other three algorithms. We then conclude this thesis in Section 6.. 7.

(10) Chapter 2 Preliminaries In this thesis, we assume all the symbols of patterns and sequences are taken from an alphabet Σ. If one has a sequence (or a text) of length n, T[1…n], and a pattern of length m, P[1…m], he will determine to find all occurrences of P in T.. 2.1 Boyer-Moore Algorithm The Boyer-Moore pattern matching algorithm (BM) [2] is a well-known straightforward, yet efficient algorithm for pattern matching. In a nutshell, Boyer and Moore proposed two rules to compare symbols in the search window with those of the pattern: the good-suffix rule and the bad-character rule. The BM algorithm scans the pattern across the sequence from left to right, yet the actual symbol comparisons between the pattern and the sequence takes place from right to left. The good-suffix rule and the bad-character rule are displayed in Fig. 1. (“G”,”A”,”T” are symbols of the pattern, and v is the suffix of the pattern.). 8.

(11) Figure 1. The good-suffix rule and the bad-character rule in the BM algorithm. The bad-character rule allows the algorithm to determine in one pass whether the symbol being compared is within the search pattern or not. Starting from the rightmost symbol of the pattern, if the symbol in the sequence is within the symbols in the pattern, the pattern is shifted several symbols right, depending on its distance from the rightmost symbol. The good-suffix rule is performed when the suffix of a pattern occurs in another position of the pattern. The rationale is that if a symbol of a sequence being compared does not match any symbol in the pattern, then the pattern can be shifted completely past the mismatching symbol. In common cases the BM Algorithm requires only a small portion of the symbols of the sequence to be compared. The time complexity in the worst case is O(mn), and the best case is O(n/m) [11].. 9.

(12) 2.2 Quick-Search Algorithm Sunday [3] improved BM’s bad-character rule to create a new shift table and proposed the Quick-search algorithm (QS). The QS algorithm possesses two distinctions from the BM algorithm. The first difference is that the symbol comparisons by the QS algorithm are performed strictly from left to right. The second difference arises from the calculation of the shift values: If the symbols of the window do not match with those of the pattern, the shift value is computed using the next symbol right following the window. The bad-character rule of the Quick-search algorithm is displayed in Fig. 2. Therefore, in the best case, the maximum shift value is m+1 and the time complexity of the n corresponding comparisons is O(m+1), but the search time for the worst case is still O(mn) [9].. Figure 2. The bad-character rule in the QS algorithm. 2.3 Shift-And Algorithm Baeza-Yates and Gonnet [4] developed the Shift-And algorithm (SA) using. 10.

(13) sets of various states to indicate the number of mismatches between the pattern and the sequence. The SA algorithm designates a state r, a binary string of size m, to store information about all matches of prefixes of the pattern P that end at a certain position. More specifically, ri[j] = 1 if the first j symbols of the pattern match exactly the last j symbols up to i in the sequence, where 1 ≤ i ≤ n and 1 ≤ j ≤ m . The algorithm also designates another state Bj, a binary string of size m,. to denote the indices in the pattern that contains Bj from the alphabet Σ where Σ={B1, B2, …, B|Σ| }.. Figure 3. Meaning of ri in the Shift-And algorithm. The SA algorithm then scans across the sequence to search for the pattern, if exists, by recursively computing ri+1 by a right shift of ri and a bitwise “AND” operation with Bj until the end of the sequence. The time complexity of the SA algorithm is O( ⎡⎢ ⎢ω. mb ⎤ ⎡ mb ⎤ ⎥⎥ n ) [5], where ⎢⎢ ω ⎥⎥ is the time required to compute a. constant number of operations on integers of mb bits using a word size of w bits.. 11.

(14) Chapter 3 The Proposed Algorithm The SAQS algorithm we propose here aims at searching for pattern P of length m in sequence T of length n. Our algorithm divides the matching task into a pre-processing phase and a searching phase. In the pre-processing phase, we construct two tables: a query table B and a shift table S. The query table B stores the symbols from the alphabet Σ and their positions in the pattern P. Then, the shift table S for these symbols is computed for how many positions ahead to start the next search based on the identity of the symbol that caused the match attempt to fail. In the searching phase, the algorithm first designates a state r (a binary string) to store the states of symbol comparisons. For symbols of the sequence being processed, the algorithm uses tables B and S to query the corresponding states of symbols and the shift values for symbol comparisons with those of the pattern, respectively. Then the algorithm updates r iteratively and examines whether the rightmost bit (m-th bit) of r is 1 or not. If the rightmost bit is 1, the pattern occurs in the sequence.. 12.

(15) 3.1 The Pre-Processing Phase We first construct a query table B by examining what symbols appear in pattern P. Each state B[c] of table B is a binary string of length m, where B[c] j = 1 indicates c is the symbol of the pattern at the j-th position, 1 ≤ j ≤ m .. That is, the query table B satisfies ⎧1, P[ j ] = c B[c] j = ⎨ , 1 ≤ j ≤ m . (1) ⎩0, P[ j ] ≠ c. Therefore, by scanning across the pattern for any symbol c, state B[c] of table B for the symbol can be generated (refer to Section 3.3 for an example). Thereafter, we generate a shift table S, according to the Quick-Search bad-character rule, and compute the shift value S[c] for symbol c, where S[c] ≥ 1. S[c] is defined as the position of c in the pattern from right to left. If c occurs in the pattern, S[c] is assigned the minimum of the distance of c from the rightmost of the pattern; otherwise, S[c] is equal to m+1. That is, S[c] satisfies (refer to Section 3.3 for an example): ⎧min( j ), P[ j ] = c S [c ] = ⎨ , 1 ≤ j ≤ m . (2) ⎩ m + 1, P[ j ] ≠ c. 3.2 Searching Phase In the searching phase, we construct a state r (a binary string of length m) in. 13.

(16) order to store the outcomes for symbol comparisons. The algorithm scans the first symbol of the sequence and determines whether this symbol is a match with the first symbol of the pattern. If this is the case, the algorithm continues to scan the next symbol of the sequence and compare with the second symbol of the pattern. This process iterates until a mismatch is found; then the algorithm starts over again to compare the mismatched symbol with the first symbol of the pattern and then proceeds until a further mismatch. If a symbol of the sequence matches with the final symbol of the pattern, then the pattern is found in the sequence. To this end, we assign 1 to the leftmost bit of the initial state r and 0's to the remaining bits. The purpose of this configuration setup is to ensure the comparison process of a symbol of the sequence always begins with the first symbol of the pattern. Starting from the leftmost symbol of the sequence the algorithm reads sequentially one symbol at a time to query the corresponding state from query table B, B[T[i]], where T[i] is the symbol of the sequence at position i, 1 ≤ i ≤ n . Then the algorithm performs the AND operation between B[T[i]] and r to store the results of the symbol comparisons – here we designate a new state r' , a binary string of length m, to store these results as shown in (3). r ' = r & B[T [i ]] , (3). 14.

(17) where “&” denotes the logical AND operator. According to the state r' , as shown in Fig. 4 one can determine if symbol T[i+j] in T[i…i+m] of the sequence matches with symbol P[j] of pattern P[1…m]. For instance, assume that the algorithm reads T[i+j] and computes r' via Formula (3). If the j-th bit of r' , rj' , is equal to 1, then there is a match between T[i+j] and P[j], meaning that substring T[i…i+j] is the same as substring P[1…j]. Thereafter, our algorithm updates r by using formula r = ( r ' >> 1) | rinitial , (4). where “>>” and “|” denote the bitwise right-shift and the logical OR operator, respectively. Our algorithm then scans the next symbol, T[i+j+1], to process Formula (3) for the next iteration. The purpose that (r' >>1) performs the OR operation with rinitial is to ensure the leftmost bit of r is equal to 1 all the time.. Figure 4. The shift situation of SAQS. 15.

(18) If r'j is equal to zero, according to S[T[i+m]], we consider the following two cases: Case A: S[T[i+m]] ≤ j Symbol T[i+j] of the sequence has been compared, meaning that the result of the comparisons so far is stored in r' . If this is the case, the algorithm employs Formula (4) to update the state r and then scan T[i+j+1] for the next iteration.. Case B: S[T[i+m]] > j According to the QS algorithm, we can shift S[T[i+m]] positions from position i of the sequence. In this case, we assign rinitial to r. Then we process symbol T[i+S[T[i+m]]] of the sequence for the next iteration.. In summary, the algorithm performs pattern matching by executing the aforementioned steps until the end of the sequence. If the rightmost (m-th) bit of r' is equal to 1, the pattern is found in the sequence. In other words, if the j-th bit of r' is equal to 1, then the algorithm has located successful matches between the most recently processed j symbols of the sequence and symbols 1 to j of the pattern. If this is the case, the comparison proceeds to the next symbol of the sequence and the value of j is computed by the following formula. 16.

(19) j = max(k ), rk' = 1, 1 ≤ k ≤ m − 1. (5) .. Here we show SAQS algorithm step by step: Step 1. Construct the query table B and the shift table S for the pattern. Step 2. Construct r and initialize it to get rinitial ; designate r' and set i = 0, j = 0. Step 3. Read sequentially one symbol at a time to query the corresponding state from table B, B[T[i]]. Step 4. Perform r' = r&B[T[i]]. Step 5. If r'j is equal to zero, execute either of the following two cases based on the value of S[T[i+m]]: i.. If S[T[i+m]] ≤ j :. ii.. If S[T[i+m]] > j : r = rinitial; i = i + S[T[i+m-j]]-j; j=0.. j = j - S[T[i+m-j]]+1; r = ( r' >>1)| rinitial; i++;. Step 6. If r'j is equal to 1, then: i++; j++. Step 7. If r'j is equal to 1 and j is equal to m, then the pattern is found and output i ; i++; j = max(k ), rk' = 1, 1 ≤ k ≤ m − 1 . Step 8. Repeat Step 3 ~ Step 7 until the last symbol of the sequence. Therefore, our proposed algorithm requires n iterations in the worst case, n and m+1 iterations in the best case (refer to Section 4 for detailed analysis). 17.

(20) 3.3 Example As an illustration, we choose a putative GR6 protein sequence (the variants of this sequence affect asthma and myocardial infarction, and the structural feature of this protein in colorectal cancer is different from its normal mucosa [12]) of Homo sapiens) from National Center for Biotechnology Information (NCBI) and use the previous 40 amino acids of the full sequence to test our algorithm. The FASTA format of the full sequence is in the following:. >gi| 6680081|ref | NP_031380.1| putative GR6 protein [Homo sapiens] MKEALHQIVVRCSELVSSTSLPRLSVSRLQGPPDSQPLGTLGQGGWKL LGIVGSLAPETLGGLGTEFGPCTHPLPFDMVRERERDDELRQGWLLQ CPQCARTLLCHCGPFLTPPSQTSSSGFQLCSLKPSGSLVTATEPLSNFAF SYFP T = MKEALHQIVVRCSELVSSTSLPRLSVSRLQGPPDSQPLGT P = LVSST n = 40, m = 5 (1) In the preprocessing phase: According to pattern P, we create a query table B, a shift table S and the initial r for the preprocessing phase. Symbol * denotes any symbol c of Σ that 18.

(21) does not appear in the pattern. Symbol c B[c]. S[c]. L. (10000)2 5. V. (01000)2 4. S. (00110)2 2. T. (00001)2 1. *. (00000)2 6. i = 1, j = 1 , r = rinitial = (10000)2. (2) In the searching phase: Iteration 1: MKEALHQIVVRCSELVSSTSLPRLSVSRLQGPPDSQPLGT r' = r&B[T[1]] = (10000)2 & (00000)2 = (00000)2 Because r'j = r'1 =0 , we have S[T[i+m]]＞j =>. 6＞1,. r=(10000)2, i=i+S[T[i+m]]=7, j=1.. Iteration 2: MKEALHQIVVRCSELVSSTSLPRLSVSRLQGPPDSQPLGT r' = r&B[T[7]] = (10000)2 & (00000)2 = (00000)2 19.

(22) Because r'j = r'1 =0 , we have S[T[i+m]]＞j =>. 6＞1,. r=(10000)2, i=i+S[T[i+m]]=13, j=1.. Iteration 3: MKEALHQIVVRCSELVSSTSLPRLSVSRLQGPPDSQPLGT r' = r&B[T[13]] = (10000)2 & (00110)2 = (00000)2 Because r'j = r'1 =0 , we have S[T[i+m]]＞j =>. 2＞1,. r=(10000)2, i=i+ S[T[i+m]]=15, j=1.. Iteration 4: MKEALHQIVVRCSELVSSTSLPRLSVSRLQGPPDSQPLGT r' = r&B[T[15]] = (10000)2 & (10000)2 = (10000)2 Because r'j = r'1 =1, we have r = (r' >> 1) | rinitial = (01000)2 | (10000)2 = (11000)2 , i=i+1=16, j=j+1=2.. Iteration 5: MKEALHQIVVRCSELVSSTSLPRLSVSRLQGPPDSQPLGT r' = r&B[T[16]] = (11000)2 & (010000)2 = (01000)2 Because r'j = r'2 =1, we have. 20.

(23) r = (r' >> 1) | rinitial = (00100)2 | (10000)2 = (10100)2 , i=i+1=17, j=j+1=3.. Iteration 6: MKEALHQIVVRCSELVSSTSLPRLSVSRLQGPPDSQPLGT r' = r&B[T[17]] = (10100)2 & (00110)2 = (00100)2 Because r'j = r'3 =1, we have r = (r' >> 1) | rinitial = (00010)2 | (10000)2 = (10010)2 , i=i+1=18, j=j+1=4.. Iteration 7: MKEALHQIVVRCSELVSSTSLPRLSVSRLQGPPDSQPLGT r' = r&B[T[18]] = (00010)2 & (00110)2 = (00010)2 Because r'j = r'4 =1, we have r = (r' >> 1) | rinitial = (00001)2 | (10000)2 = (10001)2 , i=i+1=19, j=j+1=5.. Iteration 8: MKEALHQIVVRCSELVSSTSLPRLSVSRLQGPPDSQPLGT r' = r&B[T[19]] = (10001)2 & (00001)2 = (00001)2 Because r'j = r'5 =1 and j is equal to m, we have found that the pattern lies in the sequence. Then, we have. 21.

(24) r = (r' >> 1) | rinitial = (00000)2 | (10000)2 = (10000)2 , i=i+1=20, j=1.. Iteration 9: MKEALHQIVVRCSELVSSTSLPRLSVSRLQGPPDSQPLGT r' = r&B[T[20]] = (10000)2 & (00110)2 = (00000)2 Because r'j = r'1 =0 , we have S[T[i+m]]＞j =>. 2＞1,. r=(10000)2, i=i+ S[T[i+m]]=22, j=1.. Iteration 10: MKEALHQIVVRCSELVSSTSLPRLSVSRLQGPPDSQPLGT r' = r&B[T[22]] = (10000)2 & (00000)2 = (00000)2 Because r'j = r'1 =0 , we have S[T[i+m]]＞j =>. 2＞1,. r=(10000)2, i=i+ S[T[i+m]]=24, j=1.. Iteration 11: MKEALHQIVVRCSELVSSTSLPRLSVSRLQGPPDSQPLGT r' = r&B[T[24]] = (10000)2 & (10000)2 = (10000)2 Because r'j = r'1 =0, we have r = (r' >> 1) | rinitial = (01000)2 | (10000)2 = (11000)2 , i=i+1=25, j=j+1=2. 22.

(25) Iteration 12: MKEALHQIVVRCSELVSSTSLPRLSVSRLQGPPDSQPLGT r' = r&B[T[25]] = (11000)2 & (00110)2 = (00000)2 Because r'j = r'2 =0 , we have S[T[i+m]]＞j =>. 6＞1,. r=(10000)2, i=i+ S[T[i+m]]=31, j=1.. Iteration 13: MKEALHQIVVRCSELVSSTSLPRLSVSRLQGPPDSQPLGT r' = r&B[T[31]] = (10000)2 & (00000)2 = (00000)2 Because r'j = r'1 =0 , we have S[T[i+m]]＞j =>. 6＞1,. r=(10000)2, i=i+ S[T[i+m]]=37, j=1.. Iteration 14: MKEALHQIVVRCSELVSSTSLPRLSVSRLQGPPDSQPLGT r' = r&B[T[37]] = (10000)2 & (00000)2 = (00000)2 Because rj' = r1' =0 and (i+m) > n, Finish.. 23.

(26) Chapter 4 Analysis of our proposed algorithm We analyze our algorithm on each of the two phases: the preprocessing phase and the searching phase. Our algorithm uses two tables, B and S, in the preprocessing phase. According to the Shift-And algorithm and the Quick-Search algorithm, the complexity of constructing B is O( ⎡⎢. mb ⎤ ( m + σ ) ), where σ is the size of Σ, and the ⎢ ω ⎥⎥. complexity of constructing S is O(m+σ). So in the preprocessing phase the complexity of our algorithm is O( ⎡⎢. mb ⎤ ( m + σ ) ). ⎢ ω ⎥⎥. Next, we offer two lemmas to describe the time complexity of the searching phase. Lemma 4.1. In the best case, the time complexity of the searching phase is O( ⎡⎢. mb ⎤ n ). ⎢ ω ⎥⎥ ( m + 1). Proof: The algorithm reads the first symbol in the sequence to compute r' . If the leftmost bit of r' is not 1, the algorithm shifts (m+1) positions for the next comparison as defined by table S of the Quick-Search algorithm. If the n foregoing situation occurs until the last symbol of the sequence, then m+1 iterations are required to scan the sequence, and to complete the computation, 24.

(27) the time complexity of each iteration is O( ⎡⎢. mb ⎤ ) [4]. Hence the time complexity ⎢ ω ⎥⎥. is O( ⎡⎢. mb ⎤ n ). ⎢ ω ⎥⎥ ( m + 1). An example of the best case is as follows: Sequence: GGGGGGGGGGGGGGG Pattern: TTTTT. Lemma 4.2. In the worst case, the time complexity of the searching phase is O( ⎡⎢. mb ⎤ n ). ⎢ ω ⎥⎥. Proof: The worst case occurs when we process only one comparison in each iteration. This means every symbol of the sequence is used to find the corresponding bits of B for one time, at most; thus it demands at most n times to compute r for n symbols of the sequence. So the complexity of the search phase in the worst case is O( ⎡⎢. mb ⎤ n ). ⎢ ω ⎥⎥. An example of the worst case is as follows: Sequence: GGGGGGGGGGGGGGG Pattern: GGGGG. The time complexity for our algorithm can be further improved by Yates and Gonnet’s paper [4]. They indicated, “In practice (the patterns of length up to. 25.

(28) the word size are 32 or 64 bits) we have O(n) worst case.” Our SAQS algorithm outperforms the Shift-And algorithm because the time complexity of our n algorithm is O( m+1 ) for the best case, and outperforms the Quick-Search algorithm because the time complexity of ours is O(n) for the worst case. To sum up, because the SAQS algorithm possesses the advantages of both the SA and QS algorithms, our SAQS algorithm shall work when the pattern length is short. The reason is two-fold: (1) As discussed in [5], the SA algorithm works when the pattern length is short and the alphabet size is small; (2) also as discussed in [6], the QS algorithm works when the pattern length is short and the alphabet size is large. Therefore, through the combination of the SA and QS algorithms, our SAQS algorithm shall yield an efficient and effective exact pattern matching method when the pattern length is short.. 26.

(29) Chapter 5 Experimental evaluation In this section, we report performance evaluation of our algorithm using two types of datasets, the nucleotide sequences and the amino acid sequences, downloaded from the NCBI website. We compare our algorithm with the SA, QS, BM algorithms using the windows XP operating system on a PC with Intel Core2 processor and 4GB of RAM to run our simulation.. 5.1 The Nucleotide Sequences The nucleotide sequences we use to test the algorithms is Gnomon mRNA, which is a set of genes predicted by Gnomon on Homo sapiens. The dataset of Gnomon mRNA (394 MB in size) consists of four symbols in which the alphabet size is equal to four (σ = 4), and the four symbols are G(70976937), T(99604486), A(98283399), C(71401736), respectively 1 . Based on the frequency of the symbols in this dataset here we examine many patterns of various lengths. For every pattern length, we generated one hundred different random patterns and computed the average execution time for our algorithm and 1. The number in the parenthesis of a symbol is the number of occurrences of that symbol in the sequence. 27.

(30) the other three algorithms. The results are displayed in Fig. 5. 2 As can be seen from this figure, our algorithm outperforms the other three when the pattern length is relatively short (between four and ten). Due to the short pattern length and the small alphabet size in this dataset, the S function isn’t often called; and the shift value isn’t large even if the S function is called. One can also notice that our algorithm always outperforms the SA and QS algorithms when the pattern length is larger than two. But the SAQS algorithm does not outperform the SA algorithm when the pattern length is equal to two. The reason is that the SAQS algorithm spilts the SA’s formula into two steps - one is used to store the results of the symbol comparisons and the other is used to determine the next symbol of the sequence being read, thereby making the SAQS algorithm perform the worst-case pattern matching.. 2. The vertical bars overlapping the performance curves in the figure are 95% confidence intervals about the means. 28.

(31) 12. Avg. Time(Sec). 11 10. BM. 9. QS SA. 8. SAQS. 7 6 5 4 2. 4. 6. 8. 10. 12 14 16 Pattern Length. 18. 20. 22. 24. 26. Figure 5. Comparison of the SAQS algorithm with the other three well-known algorithms using the nucleotide sequence. Here we present further results using four additional datasets from NCBI to examine the performance of our algorithm and the three other algorithms. It is clear that the results shown in Table 1 are consistent with what we just showed in Fig. 5, which demonstrates our results are promising. Table 1 also shows that the SAQS outperforms the BM by 13-15%, the QS by 5-7%, and the SA by 17-19% when the length of the pattern is between two and ten. Table. 1. Performance comparison between the four algorithms for various pattern length alt_Celera_chr12.fa alt_HuRef_chr15 alt_Celera_chr19 Length Algorithm 2 3. BM. mRNA. (128M). (75.9M). (53.6M). (415M). Avg.(± CI 3 ). Avg.(± CI). Avg.(± CI). Avg.(± CI). 3.042(±0.041). 2.141(±0.032). 5.224(±0.067). CI: Confidence Interval 29. 17.660(±0.279).

(32) 4. 6. 8. 10. 12. 14. 16. 18. 20. QS. 4.074(±0.040). 2.412(±0.023). 1.687(±0.017). 13.177(±0.139). SA. 3.376(±0.002). 2.319(±0.001). 1.499(±0.001). 11.780(±0.005). SAQS. 3.754(±0.034). 2.156(±0.020). 1.497(±0.015). 11.978(±0.125). BM. 3.620(±0.073). 2.118(±0.043). 1.498(±0.032). 11.792(±0.248). QS. 3.087(±0.072). 1.846(±0.042). 1.288(±0.030). 10.273(±0.234). SA. 3.280(±0.003). 2.366(±0.002). 1.465(±0.001). 11.812(±0.005). SAQS. 3.052(±0.071). 1.746(±0.042). 1.224(±0.030). 9.898(±0.242). BM. 2.905(±0.079). 1.698(±0.045). 1.186(±0.032). 9.438(±0.249). QS. 2.769(±0.092). 1.644(±0.053). 1.145(±0.037). 9.099(±0.296). SA. 3.279(±0.002). 2.364(±0.002). 1.461(±0.001). 11.830(±0.005). SAQS. 2.620(±0.085). 1.498(±0.048). 1.049(±0.034). 8.433(±0.263). BM. 2.590(±0.082). 1.508(±0.047). 1.057(±0.033). 8.109(±0.261). QS. 2.652(±0.102). 1.572(±0.060). 1.102(±0.042). 8.539(±0.326). SA. 3.293(±0.001). 2.366(±0.001). 1.462(±0.001). 11.796(±0.005). SAQS. 2.478(±0.091). 1.416(±0.052). 0.996(±0.037). 7.899(±0.298). BM. 2.363(±0.083). 1.374(±0.048). 0.965(±0.033). 7.293(±0.247). QS. 2.545(±0.108). 1.503(±0.063). 1.059(±0.046). 7.898(±0.347). SA. 3.295(±0.001). 2.366(±0.002). 1.461(±0.002) 11.470(±0.005 ). SAQS. 2.365(±0.096). 1.349(±0.055). 0.950(±0.001). 7.234(±0.309). BM. 2.229(±0.078). 1.295(±0.045). 0.909(±0.033). 6.851(±0.252). QS. 2.502(±0.115). 1.480(±0.068). 1.027(±0.047). 7.547(±0.348). SA. 3.297(±0.002). 2.365(±0.002). 1.461(±0.001). 11.470(±0.006). SAQS. 2.318(±0.107). 1.317(±0.061). 0.918(±0.042). 6.938(±0.317). BM. 2.168(±0.083). 1.261(±0.048). 0.883(±0.034). 6.620(±0.250). QS. 2.469(±0.114). 1.459(±0.005). 1.020(±0.048). 7.503(±0.350). SA. 3.297(±0.001). 2.367(±0.002). 1.461(±0.001). 11.468(±0.005). SAQS. 2.294(±0.104). 1.30(±0.060). 0.912(±0.043). 6.873(±0.320). BM. 2.137(±0.078). 1.259(±0.046). 0.880(±0.033). 6.525(±0.241). QS. 2.544(±0.110). 1.490(±0.064). 1.049(±0.046). 7.661(±0.335). SA. 3.296(±0.002). 2.366(±0.002). 1.461(±0.001). 11.468(±0.005). SAQS. 2.335(±0.101). 1.331(±0.058). 0.934(±0.041). 7.016(±0.308). BM. 2.058(±0.062). 1.207(±0.001). 0.845(±0.002). 6.271(±0.002). QS. 2.430(±0.099). 1.422(±0.056). 0.990(±0.040). 7.270(±0.294). SA. 3.296(±0.001). 2.365(±0.002). 1.462(±0.001). 11.469(±0.006). SAQS. 2.237(±0.078). 1.271(±0.046). 0.888(±0.033). 6.651(±0.241). BM. 1.925(±0.071). 1.129(±0.001). 0.787(±0.002). 5.809(±0.005). 30.

(33) QS. 2.371(±0.102). 1.385(±0.058). 0.979(±0.042). 7.123(±0.312). SA. 3.297(±0.001). 2.365(±0.002). 1.463(±0.001). 11.474(±0.004). SAQS. 2.183(±0.069). 1.243(±0.040). 0.871(±0.029). 6.503(±0.217). 5.2 The Amino Acid Sequences The dataset of the amino acid sequences considered here is 72.2 MB in size. In this case, the alphabet size (σ = 20) is large and the alphabet Σ={A(3768696), C(3590440) , D(2708175) , E(6623723) , F(1545720) , G(1211997) , H(1766250) , I(2737795) , K(1120005) , L(2594914) , M(2571322) , N(3828316) , P(2297735) , Q(248162) , R(5283992) , S(4150425) , T(3901164) , V(265042) , W(436449) , Y(530554)} 4 . As can be seen in Fig. 6, the SAQS algorithm outperforms the other three algorithms, except the case for pattern length equal to two where the SA algorithm slightly outperforms the SAQS algorithm. One can clearly notice the performance discrepancy between our algorithm and the other three when the alphabet size is sufficiently large (20, in this dataset, compared with the alphabet size of 4 in the datasets of the previous subsection). When the alphabet size increases, two factors contribute to the superior performance of the SAQS algorithm: (1) the S function is often called, and (2) the shift value is close to the maximum (m+1). Here we use four additional datasets to examine the performance 4. The number in the parenthesis of a symbol is the number of occurrences of that symbol in the sequence. 31.

(34) discrepancy. Table 2 again shows the promising results for our algorithm where the SAQS clearly outperforms the other three when the pattern length is between four and twenty. Table 2 also shows that the SAQS outperforms the BM by 48-49%, the QS by 20-21%, and the SA by 126-127% when the length of the pattern is between two and twenty six. To sum up the results of Table 1 and Table 2, it is clear that the efficiency of the algorithm is contingent on the size of the alphabet. Compared with the other three algorithms, the performance of the SAQS algorithm improves as the alphabet size increases.. 1.4. Avg. Time(Sec). 1.2 BM 1. QS SA. 0.8. SAQS 0.6 0.4 0.2 2. 4. 6. 8. 10. 12 14 16 Pattern Length. 18. 20. 22. 24. 26. Figure 6. Comparison of the SAQS algorithm with the other three well-known algorithms using the amino acid sequence Table. 2. Performance comparison between the four algorithms for various pattern length 32.

(35) H_sapiens/ Gnomon Protein (20.4M). H_sapiens/ Protein (410M). protein_fasta/ gbbct12 (102M). protein_fasta/ gbbct1 (39.1M). Pattern Length. Algorithm Avg.(± CI). Avg.(± CI). Avg.(± CI l). Avg.(± CI). 2. BM. 0.745(±0.010). 40.335(±0.599) 3.742(±0.046) 1.400(±0.017). QS. 0.492(±0.002). 23.429(±0.142) 2.454(±0.020) 0.940(±0.005). SA. 0.365(±0.001). 19.635(±0.187) 1.897(±0.034) 0.709(±0.002). SAQS. 0.422(±0.003). 19.204(±0.151) 2.079(±0.012) 0.766(±0.005). BM. 0.400(±0.002). 20.894(±0.182) 1.978(±0.007) 0.739(±0.003). QS. 0.306(±0.002). 16.296(±0.179) 1.515(±0.007) 0.585(±0.003). SA. 0.371(±0.001). 20.506(±0.198) 1.853(±0.005) 0.704(±0.001). SAQS. 0.266(±0.002). 12.186(±0.121) 1.297(±0.009) 0.477(±0.004). BM. 0.282(±0.002). 14.208(±0.164) 1.372(±0.008) 0.515(±0.003). QS. 0.243(±0.002). 11.375(±0.128) 1.177(±0.008) 0.445(±0.003). SA. 0.377(±0.002). 20.609(±0.168) 1.858(±0.004) 0.704(±0.001). SAQS. 0.198(±0.002). 8.997(±0.107). BM. 0.219(±0.002). 10.513(±0.114) 1.069(±0.007) 0.402(±0.003). QS. 0.202(±0.002). 8.901(±0.105). SA. 0.380(±0.002). 20.632(±0.190) 1.852(±0.002) 0.703(±0.001). SAQS. 0.163(±0.002). 6.874(±0.091). 0.778(±0.006) 0.286(±0.003). BM. 0.181(±0.002). 7.507(±0.019). 0.885(±0.007) 0.334(±0.003). QS. 0.169(±0.002). 7.185(±0.020). 0.815(±0.006) 0.305(±0.003). SA. 0.376(±0.002). 18.418(±0.016) 1.849(±0.001) 0.705(±0.002). SAQS. 0.139(±0.002). 5.516(±0.053). 0.657(±0.006) 0.243(±0.003). BM. 0.155(±0.002). 6.294(±0.015). 0.761(±0.007) 0.285(±0.003). QS. 0.146(±0.002). 5.909(±0.012). 0.705(±0.007) 0.267(±0.003). SA. 0.370(±0.002). 18.540(±0.070) 1.851(±0.002) 0.704(±0.001). SAQS. 0.119(±0.002). 4.570(±0.010). 0.574(±0.006) 0.211(±0.002). BM. 0.138(±0.002). 5.911(±0.073). 0.675(±0.006) 0.256(±0.002). QS. 0.131(±0.002). 5.699(±0.124). 0.636(±0.006) 0.241(±0.003). SA. 0.369(±0.002). 19.946(±0.179) 1.849(±0.001) 0.705(±0.001). SAQS. 0.108(±0.001). 4.288(±0.088). 0.517(±0.005) 0.190(±0.002). BM. 0.123(±0.002). 4.866(±0.039). 0.598(±0.006) 0.232(±0.002). QS. 0.120(±0.002). 4.496(±0.057). 0.576(±0.006) 0.220(±0.003). 4. 6. 8. 10. 12. 14. 16. 33. 0.957(±0.007) 0.352(±0.003) 0.964(±0.006) 0.360(±0.003).

(36) 18. 20. SA. 0.369(±0.008). 21.570(±1.584) 1.849(±0.008) 0.704(±0.006). SAQS. 0.098(±0.002). 3.506(±0.024). 0.469(±0.005) 0.173(±0.002). BM. 0.116(±0.002). 4.283(±0.064). 0.558(±0.001) 0.215(±0.001). QS. 0.110(±0.002). 4.050(±0.033). 0.534(±0.005) 0.203(±0.002). SA. 0.369(±0.002). 18.627(±0.051) 1.849(±0.006) 0.703(±0.002). SAQS. 0.092(±0.002). 3.233(±0.039). 0.436(±0.006) 0.161(±0.002). BM. 0.105(±0.002). 3.773(±0.014). 0.508(±0.002) 0.196(±0.005). QS. 0.102(±0.002). 3.656(±0.019). 0.494(±0.005) 0.195(±0.002). SA. 0.369(±0.002). 18.217(±0.025) 1.850(±0.006) 0.727(±0.003). SAQS. 0.084(±0.002). 2.865(±0.026). 34. 0.400(±0.006) 0.147(±0.003).

(37) Chapter 6 Conclusion In this thesis, we propose a new algorithm for exact single pattern matching that combines the shift function of the Quick-Search algorithm and the matching mechanism of the Shift-And algorithm. Comparing with the SA mb n algorithm, the time complexity of our algorithm is O( ⎡⎢ ⎤⎥ ) in the best ⎢ ω ⎥ ( m + 1). case; comparing with the QS algorithm, the time complexity of our algorithm is O( ⎡⎢. mb ⎤ n ) in the worst case. We have shown several promising results - based ⎢ ω ⎥⎥. on our analysis, our algorithm improves pattern matching over three other well-known algorithms when the pattern length is short. The experimental results also show that our algorithm outperforms the three algorithms when the alphabet size is large.. 35.

(38) References [1]. A. Amira, O. Kapahb, and D. Tsurc, “Faster two-dimensional pattern matching with rotations,” Theoretical Computer Science, Vol. 368, pp. 196-204, 2006.. [2]. R. Boyer, and S. Moore, “A fast string searching algorithm,” Communication of the ACM, Vol. 20, pp. 762–772, 1977.. [3]. D. M. Sunday, “A very fast substring search algorithm,” Communication of the ACM, Vol. 33, pp. 132–142, 1990.. [4]. R. Baeza-Yates, and G. H. Gonnet, “A new approach to text searching,” Communication of the ACM, Vol. 35, pp. 74-82, 1992.. [5]. K. Fredriksson, “Shift-or string matching with super-alphabets,” Information Processing Letters, Vol. 87, pp. 201-204, 2003.. [6]. T. Lecroq, “Fast exact string matching algorithms,” Information Processing Letters, Vol. 102, pp. 229-235, 2007.. [7]. T. Kida, M. Takeda, A. Shinohara, and S. Arikawa, “Shift-and approach to pattern matching in lzw compressed text,” Lecture Notes in Computer Science, pp. 1-13, 1999.. [8]. C. Haack, and A. Jeffrey, “Pattern-matching spi-calculus,” Information. 36.

(39) and Computation, pp.1195-1263, 2006. [9]. R. Thathoo, A. Virmani, S. S. Lakshmi, N. Balakrishnan, and K. Sekar, “Tvsbs: a fast exact pattern matching algorithm for biological sequences,” in: Current Science, Vol. 91, pp. 47-53, 2006.. [10] Y. Huang, L. Ping, X. Pan, and G. Cai, “A fast exact pattern matching algorithm for biological sequences,” The International Conference on BioMedical Engineering and Informatics, Vol. 1, pp. 8-12, 2008. [11] Galil, “On improving the worst case running time of the Boyer-Moore string matching algorithm,” Communication of the ACM, pp. 505-508, 1979. [12] J. Lin, Y. M. Zhu, and M. D. Lai, “Structural features of gr6 gene and its expression in colorectal neoplasm,” Medical Sciences, pp. 102-107, 2004. [13] Flexible pattern matching in strings: practical on-line search algorithms for texts and biological sequences, The Press Syndicate of the University of Cambridge, 1999. 37.

(40)