植基於符號組之改良式BLIM演算法

全文

(1)國立高雄大學資訊工程學系碩士論文. 植基於符號組之改良式 BLIM 演算法 An Improved BLIM Algorithm Using q-grams. 研究生：林柏豪撰指導教授：黃健峯博士. 中華民國一百零一年六月.

(2) Acknowledgements 首先，我誠摯的感謝我的指導老師黃健峯教授，老師的悉心指導，不只是在研究方面，在待人處世應對上也給予我許多的開導。在兩年來的研究生生涯，使我在學識上、個性上有很大的成長。特別是在撰寫與修改論文的這段期間，老師常常與我討論到晚上或是犧牲假日與我討論。此外，老師也常常對我告訴我們業界的狀況以及我們該以什麼樣的態度面對這樣的挑戰。老師的敦敦教誨，我時時刻刻不敢或忘，我對老師的感恩之心無以言表。接著，特別感謝撥冗前來的口試委員，陳建源教授與曾新穆教授，老師們在口試時對於論文的指導與建議，使我的論文更加地完善。尤其是陳建源教授，在研究生修業期間，對我的研究持續給予建議，更於最後論文寫作期間，在演算法分析部分，提出意見與建議，讓我的論文得以順利完成。在兩年的研究生生涯，感謝機器學習與資料探勘實驗室與密碼與資訊安全實驗室的學長、同學、學弟們，在課業上以及研究上互相討論或是氣氛歡樂的聚餐聊天，都是因為有你們在，所以才難忘，鈺峰學長、宇哲同學、宗男同學、俊宏學弟、嘉澤學弟、信宏學弟，謝謝你們，讓我裝載著滿滿的回憶，這段時光是我一生的至寶。感謝在這一路以來，不只是碩士修業年間，曾經幫助我、鼓勵我的老師、學長、同學、朋友們。因為你們的幫助與鼓勵，學習路程充滿多采多姿，你們的幫助與鼓勵也是我學習的動力與衝勁。最後，我要感謝家人的支持，因為有你們的信任與無怨無悔地付出，我才能順順利利地完成碩士學業。. 柏豪謹誌國立高雄大學資訊工程所中華民國一百零一年七月. I.

(3) An Improved BLIM Algorithm Using q-grams Advisor(s): Dr. ( Assistant Professor) Chien-Feng Huang Institute of Computer Science and Information Engineering National University of Kaohsiung. Student: Bo-Hau Lin Institute of Computer Science and Information Engineering National University of Kaohsiung. ABSTRACT. In this thesis we present a new matching algorithm to improve the BLIM algorithm by using q-grams in the matching phase. Our analysis shows that, in the best case, the time complexity of our algorithm is O( n ), where n is the length of sequence, and m is the m length of pattern. The experiments show that the run time of our algorithm, when compared with the BLIM algorithm, is reduced by about 20 - 40%. Keywords: pattern matching; BLIM algorithm; q-grams. II.

(4) 植基於符號組之改良式 BLIM 演算法指導教授：黃健峯博士（助理教授）國立高雄大學資訊工程研究所. 學生：林柏豪國立高雄大學資訊工程研究所. 摘要. 本論文提出一套樣式比對演算法，以 BLIM 演算法為主，利用符號組(q-grams)的概念來加快比對速度，根據我們的分析，我們的方法在最佳情況下有 O( n )的時間複雜度，其 m 中 n 為序列長度，m 為樣式長度。根據實驗結果，顯示我們的方法與 BLIM 演算法比較，可減少 20 - 40%的比對時間。關鍵字：樣式比對、BLIM 演算法、符號組. III.

(5) Table of Contents Acknowledgements ..................................................................................................................... I Abstract ..................................................................................................................................... II 摘要.......................................................................................................................................... III Index ........................................................................................................................................ IV Index of Figures ........................................................................................................................ V Index of Tables ........................................................................................................................ VI Chapter 1. Introduction .............................................................................................................. 1 Chapter 2. Preliminaries............................................................................................................. 5 2.1 Shift-And algorithm...................................................................................................... 5 2.2 BNDM algorithm.......................................................................................................... 6 2.3 BNDMq algorithm........................................................................................................ 7 2.4 BLIM algorithm.......................................................................................................... 10 Chapter 3. The Proposed Algorithm ........................................................................................ 14 Chapter 4. Analysis .................................................................................................................. 17 Chapter 5. Experimental Results.............................................................................................. 20 Chapter 6. Conclusion .............................................................................................................. 27 References ................................................................................................................................ 28. IV.

(6) List of Figures Figure 1. A finite state machine example for "ABC" ................................................................ 1 Figure 2. The windows of brute-force matching ....................................................................... 2 Figure 3. The shift window when mismatch .............................................................................. 3 Figure 4. The shift flag in BNDM algorithm ............................................................................. 7 Figure 5. The pattern "abcab" in alignment matrix with computer word size W = 8 .............. 10 Figure 6. The probability of mismatch with various q's for alphabet size 4 and 20. ............... 19 Figure 7. The average run time of algorithms form length of pattern 5-30 ............................. 20 Figure 8. The average run time of the algorithms for pattern length of 31-60 ........................ 21 Figure 9. The average run time of the BLIMq algorithm for various q's (alphabet size 4) ..... 24 Figure 10. The average run time of the BLIMq algorithm for various q's (alphabet size 20) . 25 Figure 11. The average inspections for various q's .................................................................. 26 Figure 12. The average run time for various q's. ..................................................................... 26. V.

(7) List of Tables Table 1. The pseudo code of the SA algorithm ......................................................................... 6 Table 2. The pseudo code of the BNDM algorithm................................................................... 8 Table 3. The pseudo code of the BNDMq algorithm................................................................. 9 Table 4. The set B of "abcab" with alphabet {a, b, c, d} ......................................................... 10 Table 5. The pseudo code of the preprocess phase of the BLIM algorithm ............................ 11 Table 6. The shift vector table for pattern "abcab" with alphabet {a, b, c, d} ......................... 12 Table 7. The pseudo code of the searching phase of the BLIM algorithm .............................. 13 Table 8. The pseudo code of the searching phase of BLIMq algorithm .................................. 15 Table 9. The searching phase of BLIM algorithm ................................................................... 16 Table 10. The searching phase of BLIMq algorithm ............................................................... 16 Table 11. The average run time of the algorithms for sequences of different length .............. 22 Table 12. The average run time of the algorithms for different alphabet size ......................... 22 Table 13. The average number of inspections for alphabet size 2 ........................................... 23 Table 14. The average number of inspections for alphabet size 4 ........................................... 23 Table 15. The average number of inspections for alphabet size 20 ......................................... 24. VI.

(8) Chapter 1 Introduction Pattern matching is an important technology in Computer Science with many applications, such as intrusion detection system (IDS) [1], deep packet inspection [2], anti-virus scanning [3][4], data mining [5][6][7], data compression [8], and bioinformatics[9]. For those applications, many good algorithms have been proposed and studied. The main task of pattern matching algorithms is to search for the locations of the pattern in a sequence. With various conditions, matching for exact or approximate occurrences of single or multiple patterns will be executed. This thesis focuses on exact single pattern matching. In general, the pattern matching algorithms can be divided into two phases — preprocessing and searching. According to the difference of the two phases, the pattern matching algorithms can be divided into five categories: finite state machine, symbol-order shift, bit-parallel, filter-base, and mixing types. We introduce these types in the following.. Figure 1. A finite state machine example for "ABC". First, the algorithms of finite state machine, e.g., Aho-Corasick algorithm (AC) [10], typically turn the symbols of pattern into states of finite state machine in the preprocessing phase. Fig. 1 illustrates how this algorithm works for 1. pattern "ABC". The finite state.

(9) machine is constructed in the preprocessing phase, and then states will be changed when a symbol of the given sequence is read in the searching phase. The pattern can be found when the machine is in a specific state, for example, state 3 in Fig. 1. The algorithms of finite state machine only need n iterations to process symbols to find out all the locations of the pattern for a sequence of length n. The runtime of these type of algorithms is linear, and are appropriate for real-time applications. However, the transformation among states causes the inefficiency of the algorithms.. Figure 2. The windows of brute-force matching. The second type of the pattern matching algorithms is symbol-order shift, including the well-known Boyer-Moore algorithm (BM) [11]. In the brute-force of pattern matching, the sequence can be regarded as windows. Fig. 2 shows the brute-force matching for pattern "ABC". It is obvious that the matching in Window 2 is unnecessary. Therefore the symbol-order shift algorithms can skip unnecessary matching. These algorithms take into account the relationship on the order of the symbols in a pattern, which can be divided into two types — prefix and suffix. for constructing a shift table consisting of the shift values calculated in the preprocessing phase. When a mismatch occurs in window, the shift value is referred to move to the next window. See Fig. 3 for an example. When compared with the finite state machine algorithms, there are the best case and worst case for the symbol-order shift algorithms, and it is important to analyze the average cases for them. 2.

(10) Figure 3. The shift window when mismatch. The bit-parallel algorithms, such as Shift-And algorithm (SA) [12], is the third type for pattern matching. The main idea of bit-parallel represents the bits as states. Because the aforementioned state transformation causes the inefficiency of the finite state machine algorithms, using bitwise operations can promote the efficiency for real-time applications. The bit-parallel algorithms convert each symbol of the pattern to bits, in which the numbers of bits equals the length of pattern. In the searching phase, the state vector is changed by the bitwise operations 'and' or 'or'. In general, the bit-parallel algorithms are faster than the previous two types of the algorithms in practice, but the length of patterns is constrained by the computer word size. The fourth type of pattern matching algorithms is filter-base, for instance, Lecroq algorithm (LQ) [13]. The filter-based algorithms split the searching phase into two stages, i.e., the filter stage and the examining stage. The filter values usually computed using the grams consisting of several symbols, can be used to determine to the possibility of a pattern occurring in a given sequence. If the pattern maybe occurs in a given sequence, then the algorithm examine the subsequence in the examining stage. In the filter stage, the filter-based algorithms can eliminate unnecessary matching to speed up the matching. Typically, to design an effective filter is the major subject on the filter-based algorithms. Recently, pattern matching algorithms mixing several types of the aforementioned methods are prevalent. For instance, the BNDMq algorithm [14] is one algorithm that mixes the types of symbol-order shift, bit-parallel, and filter-base. Using the same preprocess phase 3.

(11) of BNDM [15], BNDMq turns the symbols of pattern into the states containing bits, and improves the filter stage to find the possible patterns by grams of several symbols. When mismatch occurs in either filter stage or examining stage, the shift value is referred to move to the next window in the BNDMq algorithm. By mixing three classes of algorithms, the BNDMq algorithm is highly effective. In this thesis, we present a new pattern matching algorithm, called BLIMq algorithm, which integrates the mechanism of the BLIM algorithm [16],as well as a filter with q-grams, in order to improve the efficiency on pattern matching. The BLIM algorithm solves the problem of bit-parallel algorithms in which the length of patterns is constrained by the computer word size, but it is not a effectiveness algorithm [16]. Here we improve the BLIM algorithm by incorporating a filter of q-grams in the searching phase. We will show that the BLIMq algorithm we proposed is 20% faster than the BLIM algorithm. The organization of this thesis is as follows. In Section 2, we present a literature review for the four representative algorithms. Afterward, Section 3 describes the BLIMq algorithm in detail and gives an example. The analysis and experimental results are shown in Section 4 and Section 5. We conclude this thesis in Section 6.. 4.

(12) Chapter 2 Preliminaries In this thesis, we assume all the symbols of pattern and sequence are taken from an alphabet of size σ, Σ[1…σ]. In addition, we denote computer word size by W. If one is given a sequence (or a text) of length n, T[1…n], and a pattern of length m, P[1…m], the task of pattern matching is to find all occurrences of P in T.. 2.1 Shift-And algorithm Baeza-Yates and Gonnet [12] developed the Shift-And algorithm (SA) using sets of various states to indicate the number of mismatches between the pattern and the sequence. The SA algorithm designates a state vector D, a binary string of length m, to store information about all matches of prefixes of the pattern P that end at a certain position. More specifically, Di[j]=1 if the j-th symbol of the pattern matches exactly with the i-th symbol in the sequence, where 1  i  n and 1  j  m . The SA algorithm also designates a binary string BΣ[k] for each symbol in the alphabet Σ, where 1  k   , and a set B to collect all binary strings, where B={BΣ[1], BΣ[2], …, BΣ[σ]}. The bits of binary string BΣ[k], 1  k   , are set to 1 in the Σ[k] positions of the pattern. The SA algorithm then scans across the sequence to search for the pattern, by recursively computing Di+1 by a right shift of Di and a bitwise 'and' operation with the BT[i+1] until the end of the sequence. The position i in the sequence is reported if the state vector Di[m]=1. For the detailed steps, the pseudo code of the SA algorithm are described in Table 1. In practice SA was very effective especially for searching moderate patterns on sequences composed of small alphabets.. 5.

(13) Table 1. The pseudo code of the SA algorithm Shift-And(P, m, T, n) /* Preprocess*/ for all i ∈ Σ do B[i] = 0W; end for for i = 0 to m − 1 do B[P[i ]] = B[P[i ]] | (0W-i-110i ); end for /*Searching*/ r = 0W-m10 m-1; D = 0; i = 0; while i < n do D = ((D >> 1) | r) & B[T [i ]]; if (D&1) Pattern detected beginning at T [i ]; end if end while. 2.2 BNDM algorithm Navarro and Raffinot [15] implemented the nondeterministic backwards directed acyclic word graph in a bit-parallel fashion, which resulted in the BNDM algorithm. BNDM includes the symbol-order shift for suffix, and its performance is much better than that of the SA. The BNDM algorithm regards the sequence as windows as shown in Fig 2 of Chapter 1, and designates a state vector D, a binary string of length m, to store information about all matches for each window. The BNDM algorithm designates a set B of binary strings corresponding to symbols in the alphabet Σ, where B={BΣ[1], BΣ[2], …, BΣ[σ]}. The bits of binary string BΣ[k],. 1  k   , are set to 1 in the Σ[k] positions of the pattern. The BNDM algorithm then scans each window backwards to search for the pattern, by recursively computing Di+1 by a left 6.

(14) shift of Di and a bitwise 'and' operation with the state mask Bj until the first symbol of the window, The position i in the sequence is reported if the state vector Di[1]=1. The BNDM algorithm designates a shift flag, last, to store the occurrence of the pattern prefix in the suffix of the window. The next window will then be aligned by the shift flag as shown in Fig. 4. For instance, Fig. 4 shows that the pattern prefix "AB" occurs in the suffix of Window 1, and then Window 2 is aligned with the third symbol of Window 1 by the shift flag of 2. We hereby provide the pseudo code of the BNDM algorithm in Table 2.. Figure 4. The shift flag in BNDM algorithm. 2.3 BNDMq algorithm Recently, Durian et al. [14] studied BNDM with q-grams, which resulted in the BNDMq algorithm. The BNDMq algorithm constructs the set B of binary strings by the same method in the preprocessing phase of BNDM. The BNDMq algorithm reads in q symbols at first in the scanning window following:. D  B[T[i]] & (B[T[i  1]  1) & ...& (B[T[i  q  1]]  (q  1)) ,. (1). and calculates the state vector D in the searching phase. The symbol is read successively in the examining stage, if a portion of the pattern occurs in the window. We show the pseudo code of BNDMq in Table 3. 7.

(15) Table 2. The pseudo code of the BNDM algorithm BNDM(P, m, T, n) /*Preprocess*/ for all i ∈ Σ do B[i] = 0W; end for for i = 0 to m - 1 do B[P[i ]] = B[P[i ]] & 0m-i-110i ; end for /*Searching*/ i = 0; while i ≤ n - m do j = m; last = m; D = 0W-m1m; while D ≠ 0 do D = D & B[T [i + j ]]; j = j − 1; if D & 10m-1 ≠ 0m then if j > 0 then last = j ; else Pattern detected beginning at T[i + 1]; end if end if D = D << 1; end while i = i + last; end while. 8.

(16) Table 3. The pseudo code of the BNDMq algorithm BNDMq(P, m, T, n, q) /*Preprocess*/ for all i ∈ Σ do B[i] = 0W; end for for i = 0 to m - 1 do B[P[i ]] = B[P[i ]] & 0m-i-110i ; end for /*Searching*/ i = m- q+ 1; while i ≤ n - q+ 1 do D = B[T[i ]] & (B[T[i+ 1 ] << 1) &…& (B[T[i+ q- 1]] << (q- 1)); if D ≠ 0 then j = i; first = i- (m- q+ 1); while D ≠ 0 do j = j- 1; if D ≧ 10m-1 then if j > first then i = j; else Pattern detected beginning at T[j+ 1]; end if end if D = (D << 1) & B[T[j ]]; end while i = i + m- q+ 1; end if end while. 9.

(17) ws. W. 0 1 2 3 4. 0. 1. 2. 3. 4. 5. 6. 7. 8. a. b a. c b a. a c b a. b a c b a. b a c b. b a c. b a. b. 9. 10. 11. 5 a b c a b 6 a b c a b 7 a b c a b Figure 5. The pattern "abcab" in alignment matrix with computer word size W = 8. Table 4. The set B of "abcab" with alphabet {a, b, c, d} a. 0 FF. 1 FE. 2 FC. 3 F9. 4 F2. 5 E5. 6 BC. 7 97. 8 2F. 9 5F. 10 BF. 11 7F. b c d. FE FE FE. FD FC FC. FA F9 F8. F2 F4 F0. E9 E4 E0. D3 C9 C1. A7 93 83. 4F 27 07. 9F 4F 0F. 3F 9F 1F. 7F 3F 3F. FF 7F 7F. 2.4 BLIM algorithm The common property of both the classes of the SA and BNDM algorithms is that each symbol has a corresponding binary string on which the locations of its occurrences in the pattern are marked. In other words, each bit in a binary string indicates if the symbol occurs in the pattern at that position. As a result, using more computer words clearly degrades the performance because the string shall be less than the computer word size, in order to be efficiently processed by the bits in the binary string.. 10.

(18) Table 5. The pseudo code of the preprocess phase of the BLIM algorithm Preprocess(P, m, T, n, W) ws = W+ m – 1; // Construct mask matrix for i = 0 to ws - 1 do for all a ∈ Σ do B[a][i ] = 1W; end for end for for j = 0 to W - 1 do tmp = 0W-i-110i; for h = 0 to m - 1 do for all a ∈ Σ do B[a][i + j] &= ~tmp; end for B[P[j]][i + j] |= tmp; end for end for //Compute shift vector for all a ∈ Σ do S[a] = ws + 1; end for for j = 0 to m - 1 do S[P[j]] = ws - j; end for // Compute scan order i = 0; for j = m - 1 down to 0 do k = j; while k < ws do I[i] = k; k = k + m; i = i + 1; end while end for. 11.

(19) The BLIM algorithm [16] presented an new mask matrix to overcome the limitation of computer word size. The input pattern is transformed into an alignment matrix, and Fig. 5 is an illustration for pattern "abcab". Each symbol can be designated to a binary string of length W and the bits of the binary string are set to 1 according to the symbol positions of the alignment matrix, where the blanks in the alignment matrix can be regarded as any symbols in the alphabet. The set B constitutes σ rows, and ws columns to collect all binary strings, where σ is the alphabet size and ws can be calculated by W+ m+ 1. Table 4 displays the set B for "abcab". The BLIM algorithm divides the given sequence into windows. Once a window has been processed, the BLIM algorithm continues to the next window by inquiring the shift vector table S. The shift vectors stored in the shift vector table S can be calculated by the method of the QS algorithm [17]. The shift vector is generated by. ws  s, s  { j; P[ j]  [k ]},1  j  m S[[k ]]   [k ]  P  ws  1,. ,1  k   ,. (2). Table 6. The shift vector table for pattern "abcab" with alphabet {a, b, c, d} Symbols Shift vector a b c d. 8 7 9 12. Table 6 is the shift vector table for pattern "abcab". The maximum shift is ws+ 1, and the minimum shift is W. When windows overlap, the process of reading symbols repeatedly cause unnecessary inspections. In order to avoid unnecessary inspections, the BLIM algorithm uses the scan order I. The scan orders are generated by m- j, 2m- j, …, Lm- j, where Lm- j < ws and 1  j  m . For example, the scan orders of pattern "abcab" are 4、9、3、8、2、 7、1、6、11、0、5、10. The pseudo code of preprocess phase is provided in Table 5. In the 12.

(20) searching phase, according to the scan order, the algorithm reads into the symbols in each window to initialize state vector D by Eq. (3) and update by Eq. (4).. D  B[T[i  I[0]]],. (3). D  D & B[T[i  I[ j]]],0  j  ws .. (4). The symbol & represents the bitwise operations 'and'. The pattern can be found when the process of scanning the window is completed and D is not zero. The pseudo code of BLIM is provided in Table 7.We show an example for the searching phase of the BLIM algorithm in Table 9 of Chapter 3.. Table 7. The pseudo code of the searching phase of the BLIM algorithm BLIM(P, m, T, n, W) i = 0; while i < n do D = B[T[i + I[0]]][ I[0]]; for j = 1 to ws-1 do D = D& B[T[i + I[j]]][ I[j]]; if D = 0W then break; end if end for if D ≠ 0W then for h = W to ws-1 do if D&0W-j-110j then Pattern detected beginning at T[i + j]; end if end for end if i = i + S[T[i + ws]]; end while 13.

(21) Chapter 3 The Proposed Algorithm In this chapter, we describe the BLIMq algorithm and improve the BLIM algorithm as follows. The main idea of our proposed BLIMq algorithm is to improve the efficiency of the searching phase using q-grams, which is composed of the first q symbols read in the sliding window. It is unnecessary that the original algorithm checks the state every time of read symbol in window. The straightforward improvement is by checking the state after reading several symbols. The BLIMq algorithm we proposed is the same as the BLIM algorithm in the preprocessing phase, including computation of the set B, the shift vector table S by Eq. (2), and the scan order I, for details in Table 6. In the searching phase, the first state vector D in a window is calculated with the q binary strings of symbols read from set B following the scan order I in the window by the bitwise operations 'and'. The calculation of D is as follows:. D  B[T[i  I[0]]][I[0]] & B[T[i  I[1]]][I[1]] &...& B[T[i  I[q 1]]][I[q 1]],. (5). where the symbol & represents the bitwise operation 'and'. Then the state vector D can be updated by. D  D & B[T[i  I[ j]]], q  j  ws .. (6). The pseudo code of the BLIMq algorithm for the searching phase is displayed in Table 8. Then, we provide examples for the BLIM algorithm and BLIMq algorithm in Table 9 and Table 10, which are based on the previous preprocessing examples in Table 4, Table 6, and 14.

(22) the example of the scan order in Chapter 2.. Table 8. The pseudo code of the searching phase of BLIMq algorithm BLIMq(P, m, T, n, W, q) /*Searching*/ i = 0; while i < n do D = B[T[i + I[0]]][ I[0]] &…B[T[i + I[q-1]]][ I[q-1]]; for j = q to ws-1 do if D = 0W then break; end if D = D& B[T[i + I[j]]][ I[j]]; end for if D ≠ 0W then for h = W to ws-1 do if D&0W-j-110j then Pattern detected beginning at T[i + j]; end if end for end if i = i + S[T[i + ws]]; end while. 15.

(23) Table 9. The searching phase of BLIM algorithm Sequence. J. I[j]. B[ch][ I[j]]. D. Remark. abcabcabdcabd. 0. 4. B[b][4] = E9 = 11101001. 11101001. Equation (2). abcabcabdcabd. 1. 9. B[c][9] = 9F = 10011111. 10001001. Equation (3). abcabcabdcabd. 2. 3. B[a][3] = F9 = 11111001. 10001001. Equation (3). abcabcabdcabd. 3. 8. B[d][8] = 0F = 00001111. 00001001. Equation (3). abcabcabdcabd. 4. 2. B[c][2] = F9 = 11111001. 00001001. Equation (3). abcabcabdcabd. 5. 7. B[b][7] = 4F = 01001111. 00001001. Equation (3). abcabcabdcabd. 6. 1. B[b][1] = FD = 11111101. 00001001. Equation (3). abcabcabdcabd. 7. 6. B[a][6] = BC = 10101101. 00001001. Equation (3). abcabcabdcabd. 8. 11. B[b][11] = FF = 11111111. 00001001. Equation (3). abcabcabdcabd. 9. 0. B[a][0] = FF = 11111111. 00001001. Equation (3). abcabcabdcabd. 10. 5. B[c][5] = C9 = 11001001. 00001001. Equation (3). abcabcabdcabd. 11. 10. B[a][10] = BF = 10111111. 00001001. Equation (3). abcabcabdcabd. Shift window. Table 10. The searching phase of BLIMq algorithm Sequence. J. I[j]. B[ch][ I[j]]. D. Remark. 0. 4. B[b][4] = E9 = 11101001. 1. 9. B[c][9] = 9F = 10011111. 2. 3. B[a][3] = F9 = 11111001. 00001001. Equation (4). 3. 8. B[d][8] = 0F = 00001111. abcabcabdcabd. 4. 2. B[c][2] = F9 = 11111001. 00001001. Equation (5). abcabcabdcabd. 5. 7. B[b][7] = 4F = 01001111. 00001001. Equation (5). abcabcabdcabd. 6. 1. B[b][1] = FD = 11111101. 00001001. Equation (5). abcabcabdcabd. 7. 6. B[a][6] = BC = 10101101. 00001001. Equation (5). abcabcabdcabd. 8. 11. B[b][11] = FF = 11111111. 00001001. Equation (5). abcabcabdcabd. 9. 0. B[a][0] = FF = 11111111. 00001001. Equation (5). abcabcabdcabd. 10. 5. B[c][5] = C9 = 11001001. 00001001. Equation (5). abcabcabdcabd. 11. 10. B[a][10] = BF = 10111111. 00001001. Equation (5). abcabcabdcabd. abcabcabdcabd. Shift window 16.

(24) Chapter 4 Analysis In this chapter, we analyze the searching phase of the BLIMq algorithm we proposed. Thereafter, we analyze the benefit of using q-grams to find the best q. First, our proposed BLIMq algorithm uses the same preprocessing phase as the BLIM algorithm does. Therefore, the time and space complexity, according to [15], are O(mσ) and O(wsσ) on the set B, O(m+σ) and O(σ) on the shift vector table S, both O(σ) on the scan order I. Then, for the searching phase, one can obtain the worst case of O(nm/W), and the best case of O(n/m) according to [15]. The average analysis assumes that the symbols of the pattern and sequence are uniformly distributed. The average case can be obtained in the following:. O( AverageSymbolInspectionper. window*. n ). AverageShift. (7). In Eq. (7), one has to calculate the average shift (AS) and average symbol inspection per window (ASI). The average shift in our proposed BLIMq algorithm is the same as that in the BLIM algorithm. The average shift in the BLIM algorithm can be calculated as follows:. AS . (  m)(W  m)  W  (W 1)  ... (W  m 1). W  m . . m(m  1) . 2. (8). Afterwards, we calculate the average symbol inspection per window (ASI). The probability that pattern P of length m exists in the window of size ws is denoted by H as 17.

(25) follows :. H. W wsm. . ws.  W m ,. (9). and the probability that the q symbols of pattern P exists in the window of size ws is denoted by G as follows:. G. W wsq. . ws.  W q .. (10). When the pattern occurs in the window, ws symbol inspections will be performed, but actually the symbol inspections per window should be W +m -q because the q-grams are regarded as one inspection. If the q symbols of the pattern P exist in the window, but pattern P does not exist in the window, the symbol inspections is between W- q+ 1 and W+ m- q. The average can be shown as. (W  q  1)  (W  m  q) m 1 W  q  . 2 2. (11). Therefore, the average symbol inspection per window (ASI) can be denoted as follows. ASI  H * (W  m  q)  (G  H ) * (W  q  . W m 1 W m 1 ( )  q (W  q  ) 1 . m 2 2  . 18. m 1 )  (1  G) *1 2 (12).

(26) So, the average case can be shown as. W m 1 W m 1 n ( )  ( W  q  )  1 * O(  m m(m  1) 2 q 2 W m 2. ).. (13). In this study, we will discuss the benefit of using q-grams. In a nutshell, using the q-grams is an extension of alphabet which can increase the probability of mismatch. The high probability of mismatch often causes maximum shift. Fig. 6 shows that the probability of mismatch with various q's. According to Fig. 6, we may get the best benefits when the q is set between 4 and 2 in the alphabet size 4 and 20. Following Eq. (12), it is clear that suitable q shall lead to the minimal ASI, and that the number of symbol inspections for the sequence is constant with patterns of the same length and alphabet of the same size. In practice, computation of constructing q-grams with bigger q requires more cost. Therefore, we should find the best q such that the total required cost is minimal.. Figure 6. The probability of mismatch with various q's for alphabet size 4 and 20 19.

(27) Chapter 5 Experimental Results In this chapter we conduct several experiments for our method, and compare with other algorithms. To implement the experiments, we use Pentium4 3.0 GHz CPU with memory size of 2.5GB on the host of Windows XP operating system and Visual C++ environment... Figure 7. The average run time of algorithms form length of pattern 5-30. In the first experiment, we set the alphabet size to 4, and randomly generate sequence of size 10MB. The average execution time for 1000 tests from patterns of length 5-30 is calculated. The results are displayed in Fig. 7. Apparently, the execution efficiency of our 20.

(28) proposed algorithmis better than that of the other four algorithms. In the second experiment, we use the same setup as the first experiment, but the pattern length. is from 30 to 60. Due to the limitation of the computer word size, the SA algorithm. cannot be executed for pattern length greater than 32. As a result, we replace the SA algorithm with the QS algorithm, and the results are shown in Fig. 8. As can be seen, even for the relatively long patterns (pattern length from 31 to 60), our proposed algorithm still outperform others.. Figure 8. The average run time of the algorithms for pattern length of 31-60. In the third experiment, the alphabet size is also set to 4, and pattern length is set to 16. We randomly generate sequences of length 100,000, 1,000,000, 10,000,000, and 1,000,000,000, in which each sequence is examined by 1000 tests. The average run time is shown in Table 11.It is clear that our approach still outperformed other algorithms. 21.

(29) Table 11. The average run time of the algorithms for sequences of different length Algorithms. BM. QS. SA. LQ. algorithm algorithm algorithm algorithm. Length of sequence. BLIM. BLIMq. algorithm. algorithm. Average run time. 100000. 0.000532. 0.001373 0.000439 0.001094. 0.000282. 0.000311. 1000000. 0.005242. 0.015249 0.006101 0.009390. 0.003860. 0.003120. 10000000. 0.051185. 0.152437 0.060959 0.090130. 0.035208. 0.029140. 100000000. 0.510665. 1.529660 0.603417 0.913761. 0.351612. 0.292099. In the fourth experiment, we consider various alphabet sizes for 2, 4, 20, 64, and 128, along with randomly generated patterns of length 16 and sequences of length 10,000,000. The execution time of the algorithms are displayed in Table 12. The results showed that even for different alphabet sizes, our method is still better than the other algorithms.. Table 12. The average run time of the algorithms for different alphabet size Algorithms. BM. QS. algorithm algorithm. Alphabet size. SA. LQ. BLIM. BLIMq. algorithm. algorithm. algorithm. algorithm. Average run time. 2. 0.067392 0.383488. 0.062648. 0.154077. 0.070177. 0.062764. 4. 0.053189 0.159070. 0.065271. 0.093236. 0.037226. 0.030833. 20. 0.015169 0.044394. 0.063165. 0.062829. 0.012958. 0.009852. 68. 0.011129 0.032011. 0.060639. 0.050193. 0.010765. 0.008677. 128. 0.010791 0.031107. 0.060721. 0.064525. 0.010445. 0.008339. In the fifth experiment, we compare the average number of inspections for the BLIM algorithm and our approach. We execute the BLIM algorithm and the BLIM algorithm without scan order, reading symbols directly from left to right in the window, and our approach for alphabet size of 2, 4, and 20, sequence of length 100,000, as well as pattern 22.

(30) length of 10 to 16. For each case 1,000 tests are being conducted. The results are shown in Tables 13, 14, and 15. From the results, the BLIM algorithm without scan order in terms of alphabet size and pattern length entail most inspections, and the BLIM algorithm substantially reduces the amount of inspections by using scan order. But, our proposed BLIMq algorithm shows the least number of inspections.. Table 13. The average number of inspections for alphabet size 2 Algorithm. BLIM algorithm. Length of pattern. BLIM algorithm (without scan order). BLIMq algorithm (q = 4). Average number of inspections (alphabet size 2). 10. 53841.3. 96159.8. 35154.5. 11. 45039.7. 94276.6. 27752.5. 12. 46347.5. 96528.6. 26440.2. 13. 40457.1. 90267.5. 22932.5. 14. 41748.3. 98301.6. 22893.7. 15. 42319.4. 95410. 22915.9. 16. 43326. 95132. 23677.7. Table 14. The average number of inspections for alphabet size 4 Algorithm Length of pattern. BLIM algorithm. BLIM algorithm (without scan order). BLIMq algorithm (q = 4). Average number of inspections (alphabet size 4). 10. 15715.3. 69062.4. 10407.8. 11. 15753.5. 69179.2. 9981.53. 12. 15400.2. 69896.9. 9977.34. 13. 16014.1. 69981.3. 9945.96. 14. 15433.6. 67364.6. 9599.68. 15. 15437.5. 68569.9. 9771.78. 16. 15919.8. 69873.9. 9979.49. 23.

(31) Table 15. The average number of inspections for alphabet size 20 Algorithm Length of pattern. BLIM algorithm. BLIM algorithm (without scan order). BLIMq algorithm (q = 4). Average number of inspections (alphabet size 20). 10. 6592.24. 45173.2. 6418.68. 11. 6347.27. 43587. 6186.44. 12. 6123.14. 42014.2. 5964.54. 13. 5878.51. 40351.8. 5726.88. 14. 5827.47. 39920.9. 5671.71. 15. 5717.07. 39187.5. 5572.41. 16. 5490.37. 37628. 5348.67. Figure 9. The average run time of the BLIMq algorithm for various q's (alphabet size 4) In the sixth experiment, according to our analysis in Chapter 4, we exam the advantage of using q-grams for different alphabet sizes. The sequences of length 10,000,000 are random generated with alphabet size 4 and 20, and the length of pattern is from 5 to 32. We run the 24.

(32) tests for q = 1, 2, 3, and 4, where the BLIM algorithm can be regarded as the BLIMq algorithm when q = 1. The results are presented in Figs. 9 and 10. As can be seen for the case of alphabet size 4, our BLIMq algorithm, when q = 4, reduces the run time of the BLIM algorithm by about 40%. In the alphabet size 20, for the best case (q = 2), our BLIMq algorithm improves the BLIM algorithm by approximately 20% of the run time.. Figure 10. The average run time of the BLIMq algorithm for various q's (alphabet size 20) Finally, according to our analysis in Chapter 4, the symbol inspection per window (ASI) depends on the length of pattern when choosing a suitable q by equation (12), and the runtime of the BLIMq algorithm may increase when q is increased. In order to test this observation, we generate randomly sequences of length 100,000,000 and patterns of length 20 with alphabet size 128. We run the tests for q from 2 through 20. The results of the average number of inspections and average run time are presented in Figs. 11 and 12. As can be seen, the best q is 4 and the corresponding number of inspections and the run time is the least. 25.

(33) Figure 11. The average inspections for various q's. Figure 12. The average run time for various q's. 26.

(34) Chapter 6 Conclusion In this thesis we present a new matching algorithm, the BLIMq algorithm, to improve the BLIM algorithm by using q-grams in the matching phase. Our analysis shows that, in the best case, the time complexity of our algorithm is O( n ), where n is the length of the m sequence, and m is the length of the pattern. Furthermore, we exam the efficiency of our proposed algorithm for factor such as length of pattern, length of sequence, alphabet size, and q. The experiments show that the run time of our algorithm, when compared with the BLIM algorithm, is reduced by about 20 - 40%.. 27.

(35) References [1] M. Alicherry, M. Muthuprasanna, and V. Kumar, “High speed pattern matching for network ids/ips,” In Proceedings of IEEE International Conference on Network Protocols 2006, pp.187－196, 2006. [2] N. Hua, H. Song, and T.V. Lakshman, “Variable-stride multi-pattern matching for scalable deep packet inspection,” In Proceedings of The 28th IEEE International Conference on Computer Communications INFOCOM 2009, Rio De Janeiro, Brazil, pp. 415－423, 2009. [3] C. Haack and A. Jeffrey, “Pattern-matching spi-calculus,” Information and Computation, Vol. 204, No. 8, pp. 1195－1263, 2006. [4] D. E. Knuth, J. H. Morris, and V. R. Pratt, “Fast pattern matching in strings,” SIAM Journal on Computing, Vol. 6, No. 1, pp. 323－350, 1977. [5] J. Zheng, T. J. Close, T. Jiang, and S. Lonardi, “Efficient selection of unique and popular oligos for large EST databases,” In Proceedings of Combinatorial Pattern Matching 2003, pp. 384－401, 2003. [6] M. E. Califf, and R. J. Mooney, “Relational learning of pattern-match rules for information extraction,” In Proceedings of the 16th National Conference on AI, pp. 328 －334, 1999. [7] M. Wolverton, P. Berry, I. Harrison, J. Lowrance, D. Morley, A. Rodriguez, E. Ruspini, and J. Thomere, “LAW: A workbench for approximate pattern matching in relational data,” In Proceedings of the Fifteenth Innovative Applications of Artificial Intelligence Conference, pp. 143－150, 2003. [8] K. Takuya, T. Masayuki, S. Ayumi and A. Setsuo, “Shift-and approach to pattern matching in LZW compressed text,” In Proceedings of Combinatorial Pattern Matching 1999, pp. 1－13, 1999. [9] R. Prasad, S. Agarwal, I. Yadav, and B. Singh, "A fast bit-parallel multi-patterns string matching algorithm for biological sequences," In Proceedings of the International Symposium on Biocomputing 2010, pp. 1－4, 2010. 28.

(36) [10] A. V. Aho and M. J. Corasick, “Efficient string matching: an aid to bibliographic search,” Communications of the ACM, Vol. 18, No. 6, pp. 333－340, 1975. [11] R. S. Boyer, and J. S. Moore, “A fast string searching algorithm,” Communications of the ACM, Vol. 20, pp. 762－772, 1977. [12] R. Baeza-Yates, and G. H. Gonnet, “A new approach to text searching,” Communications of the ACM, Vol. 35, pp. 74－82, 1992. [13] T. Lecroq, “Fast exact string matching algorithms, ” Information Processing Letters, Vol. 102, No. 6, pp. 229－235, 2007. [14] B. Ďurian, J. Holub, H. Peltola, and J. Tarhio, “Tuning BNDM with q-grams,” In Proceedings of Algorithm Engineering and Experiments 2009, pp. 29－37, 2009. [15] G. Navarro, and M. Raffinot, “A bit-parallel approach to suffix automata: Fast extended string matching,” In Proceedings of Combinatorial Patern Matching 1998, Springer-Verlag, pp. 14－33, 1998. [16] M. O. Külekci, “A method to overcome computer word size limitation in bit-parallel pattern matching,” In Proceedings of the 19th International Symposium on Algorithm and Computation, volume 5369 of Lecture Notes in Computer Science, pp. 496－506, 2008. [17] D. M. Sunday, “A very fast substring search algorithm,” Communications of the ACM, Vol. 33, No. 8, pp. 132－142, 1990.. 29.

(37)