The Proposed Algorithm - 植基於符號組之改良式BLIM演算法

In this chapter, we describe the BLIMq algorithm and improve the BLIM algorithm as follows. The main idea of our proposed BLIMq algorithm is to improve the efficiency of the searching phase using q-grams, which is composed of the first q symbols read in the sliding window. It is unnecessary that the original algorithm checks the state every time of read symbol in window. The straightforward improvement is by checking the state after reading several symbols. The BLIMq algorithm we proposed is the same as the BLIM algorithm in the preprocessing phase, including computation of the set B, the shift vector table S by Eq. (2), and the scan order I, for details in Table 6. In the searching phase, the first state vector D in a window is calculated with the q binary strings of symbols read from set B following the scan order I in the window by the bitwise operations 'and'. The calculation of D is as follows:

]]

The pseudo code of the BLIMq algorithm for the searching phase is displayed in Table 8.

Then, we provide examples for the BLIM algorithm and BLIMq algorithm in Table 9 and Table 10, which are based on the previous preprocessing examples in Table 4, Table 6, and

the example of the scan order in Chapter 2.

Table 8. The pseudo code of the searching phase of BLIMq algorithm BLIMq(P, m, T, n, W, q)

/*Searching*/

i = 0;

while i < n do

D = B[T[i + I[0]]][ I[0]] &…B[T[i + I[q-1]]][ I[q-1]];

for j = q to ws-1 do if D = 0^W then

break;

end if

D = D& B[T[i + I[j]]][ I[j]];

end for if D ≠ 0^W then

for h = W to ws-1 do if D&0^W-j-110^j then

Pattern detected beginning at T[i + j];

end if end for end if

i = i + S[T[i + ws]];

end while

Table 9. The searching phase of BLIM algorithm

Sequence J I[j] B[ch][ I[j]] D Remark

abcabcabdcabd 11 10 B[a][10] = BF = 10111111 00001001 Equation (3)

abcabcabdcabd Shift window

Table 10. The searching phase of BLIMq algorithm

Sequence J I[j] B[ch][ I[j]] D Remark

abcabcabdcabd 11 10 B[a][10] = BF = 10111111 00001001 Equation (5)

abcabcabdcabd Shift window

Chapter 4 Analysis

In this chapter, we analyze the searching phase of the BLIMq algorithm we proposed.

Thereafter, we analyze the benefit of using q-grams to find the best q.

First, our proposed BLIMq algorithm uses the same preprocessing phase as the BLIM algorithm does. Therefore, the time and space complexity, according to [15], are O(mσ) and O(wsσ) on the set B, O(m+σ) and O(σ) on the shift vector table S, both O(σ) on the scan order I. Then, for the searching phase, one can obtain the worst case of O(nm/W), and the best case

of O(n/m) according to [15]. The average analysis assumes that the symbols of the pattern and sequence are uniformly distributed. The average case can be obtained in the following:

O(AverageSymbolInspectionper window*Averagen Shift

). (7)

In Eq. (7), one has to calculate the average shift (AS) and average symbol inspection per window (ASI). The average shift in our proposed BLIMq algorithm is the same as that in the BLIM algorithm. The average shift in the BLIM algorithm can be calculated as follows:



and the probability that the q symbols of pattern P exists in the window of size ws is denoted by G as follows: regarded as one inspection. If the q symbols of the pattern P exist in the window, but pattern P does not exist in the window, the symbol inspections is between W- q+ 1 and W+ m- q. The

average can be shown as

2

Therefore, the average symbol inspection per window (ASI) can be denoted as follows

1

So, the average case can be shown as

O( q-grams is an extension of alphabet which can increase the probability of mismatch. The high

probability of mismatch often causes maximum shift. Fig. 6 shows that the probability of mismatch with various q's. According to Fig. 6, we may get the best benefits when the q is set between 4 and 2 in the alphabet size 4 and 20. Following Eq. (12), it is clear that suitable q shall lead to the minimal ASI, and that the number of symbol inspections for the sequence is constant with patterns of the same length and alphabet of the same size. In practice, computation of constructing q-grams with bigger q requires more cost. Therefore, we should find the best q such that the total required cost is minimal.

Figure 6. The probability of mismatch with various q's for alphabet size 4 and 20

Chapter 5 Experimental Results

In this chapter we conduct several experiments for our method, and compare with other algorithms. To implement the experiments, we use Pentium4 3.0 GHz CPU with memory size of 2.5GB on the host of Windows XP operating system and Visual C++ environment..

Figure 7. The average run time of algorithms form length of pattern 5-30

In the first experiment, we set the alphabet size to 4, and randomly generate sequence of size 10MB. The average execution time for 1000 tests from patterns of length 5-30 is calculated. The results are displayed in Fig. 7. Apparently, the execution efficiency of our

proposed algorithmis better than that of the other four algorithms.

In the second experiment, we use the same setup as the first experiment, but the pattern length is from 30 to 60. Due to the limitation of the computer word size, the SA algorithm cannot be executed for pattern length greater than 32. As a result, we replace the SA algorithm with the QS algorithm, and the results are shown in Fig. 8. As can be seen, even for the relatively long patterns (pattern length from 31 to 60), our proposed algorithm still outperform others.

Figure 8. The average run time of the algorithms for pattern length of 31-60

In the third experiment, the alphabet size is also set to 4, and pattern length is set to 16.

We randomly generate sequences of length 100,000, 1,000,000, 10,000,000, and 1,000,000,000, in which each sequence is examined by 1000 tests. The average run time is shown in Table 11.It is clear that our approach still outperformed other algorithms.

Table 11. The average run time of the algorithms for sequences of different length

Algorithms ^BM

100000 0.000532 0.001373 0.000439 0.001094 0.000282 0.000311 1000000 0.005242 0.015249 0.006101 0.009390 0.003860 0.003120 10000000 0.051185 0.152437 0.060959 0.090130 0.035208 0.029140 100000000 0.510665 1.529660 0.603417 0.913761 0.351612 0.292099

In the fourth experiment, we consider various alphabet sizes for 2, 4, 20, 64, and 128, along with randomly generated patterns of length 16 and sequences of length 10,000,000. The execution time of the algorithms are displayed in Table 12. The results showed that even for different alphabet sizes, our method is still better than the other algorithms.

Table 12. The average run time of the algorithms for different alphabet size Algorithms ^BM

Alphabet size Average run time

2 0.067392 0.383488 0.062648 0.154077 0.070177 0.062764

4 0.053189 0.159070 0.065271 0.093236 0.037226 0.030833

20 0.015169 0.044394 0.063165 0.062829 0.012958 0.009852

68 0.011129 0.032011 0.060639 0.050193 0.010765 0.008677

128 0.010791 0.031107 0.060721 0.064525 0.010445 0.008339

In the fifth experiment, we compare the average number of inspections for the BLIM algorithm and our approach. We execute the BLIM algorithm and the BLIM algorithm without scan order, reading symbols directly from left to right in the window, and our approach for alphabet size of 2, 4, and 20, sequence of length 100,000, as well as pattern

length of 10 to 16. For each case 1,000 tests are being conducted. The results are shown in Tables 13, 14, and 15. From the results, the BLIM algorithm without scan order in terms of alphabet size and pattern length entail most inspections, and the BLIM algorithm substantially reduces the amount of inspections by using scan order. But, our proposed BLIMq algorithm shows the least number of inspections.

Table 13. The average number of inspections for alphabet size 2

Algorithm BLIM algorithm BLIM algorithm

(without scan order)

BLIMq algorithm (q = 4) Length of pattern Average number of inspections (alphabet size 2)

10 53841.3 96159.8 35154.5

Table 14. The average number of inspections for alphabet size 4

Algorithm BLIM algorithm BLIM algorithm

(without scan order)

BLIMq algorithm (q = 4) Length of pattern Average number of inspections (alphabet size 4)

10 15715.3 69062.4 10407.8

Table 15. The average number of inspections for alphabet size 20

Algorithm BLIM algorithm BLIM algorithm

(without scan order)

BLIMq algorithm (q = 4) Length of pattern Average number of inspections (alphabet size 20)

10 6592.24 45173.2 6418.68

11 6347.27 43587 6186.44

12 6123.14 42014.2 5964.54

13 5878.51 40351.8 5726.88

14 5827.47 39920.9 5671.71

15 5717.07 39187.5 5572.41

16 5490.37 37628 5348.67

Figure 9. The average run time of the BLIMq algorithm for various q's (alphabet size 4)

In the sixth experiment, according to our analysis in Chapter 4, we exam the advantage of using q-grams for different alphabet sizes. The sequences of length 10,000,000 are random generated with alphabet size 4 and 20, and the length of pattern is from 5 to 32. We run the

tests for q = 1, 2, 3, and 4, where the BLIM algorithm can be regarded as the BLIMq algorithm when q = 1. The results are presented in Figs. 9 and 10. As can be seen for the case of alphabet size 4, our BLIMq algorithm, when q = 4, reduces the run time of the BLIM algorithm by about 40%. In the alphabet size 20, for the best case (q = 2), our BLIMq algorithm improves the BLIM algorithm by approximately 20% of the run time.

Figure 10. The average run time of the BLIMq algorithm for various q's (alphabet size 20)

Finally, according to our analysis in Chapter 4, the symbol inspection per window (ASI) depends on the length of pattern when choosing a suitable q by equation (12), and the runtime of the BLIMq algorithm may increase when q is increased. In order to test this observation, we generate randomly sequences of length 100,000,000 and patterns of length 20 with alphabet size 128. We run the tests for q from 2 through 20. The results of the average number of inspections and average run time are presented in Figs. 11 and 12. As can be seen, the best q is 4 and the corresponding number of inspections and the run time is the least.

Figure 11. The average inspections for various q's

Figure 12. The average run time for various q's

Chapter 6 Conclusion

In this thesis we present a new matching algorithm, the BLIMq algorithm, to improve the BLIM algorithm by using q-grams in the matching phase. Our analysis shows that, in the best case, the time complexity of our algorithm is O(

n ), where n is the length of the

sequence, and m is the length of the pattern. Furthermore, we exam the efficiency of our proposed algorithm for factor such as length of pattern, length of sequence, alphabet size, and q. The experiments show that the run time of our algorithm, when compared with the BLIM

algorithm, is reduced by about 20 - 40%.

References

[1] M. Alicherry, M. Muthuprasanna, and V. Kumar, “High speed pattern matching for network ids/ips,” In Proceedings of IEEE International Conference on Network Protocols 2006, pp.187－196, 2006.

[2] N. Hua, H. Song, and T.V. Lakshman, “Variable-stride multi-pattern matching for scalable deep packet inspection,” In Proceedings of The 28th IEEE International Conference on Computer Communications INFOCOM 2009, Rio De Janeiro, Brazil, pp.

415－423, 2009.

[3] C. Haack and A. Jeffrey, “Pattern-matching spi-calculus,” Information and Computation, Vol. 204, No. 8, pp. 1195－1263, 2006.

[4] D. E. Knuth, J. H. Morris, and V. R. Pratt, “Fast pattern matching in strings,” SIAM Journal on Computing, Vol. 6, No. 1, pp. 323－350, 1977.

[5] J. Zheng, T. J. Close, T. Jiang, and S. Lonardi, “Efficient selection of unique and popular oligos for large EST databases,” In Proceedings of Combinatorial Pattern Matching 2003, pp. 384－401, 2003.

[6] M. E. Califf, and R. J. Mooney, “Relational learning of pattern-match rules for information extraction,” In Proceedings of the 16th National Conference on AI, pp. 328

－334, 1999.

[7] M. Wolverton, P. Berry, I. Harrison, J. Lowrance, D. Morley, A. Rodriguez, E. Ruspini, and J. Thomere, “LAW: A workbench for approximate pattern matching in relational data,” In Proceedings of the Fifteenth Innovative Applications of Artificial Intelligence Conference, pp. 143－150, 2003.

[8] K. Takuya, T. Masayuki, S. Ayumi and A. Setsuo, “Shift-and approach to pattern matching in LZW compressed text,” In Proceedings of Combinatorial Pattern Matching 1999, pp. 1－13, 1999.

[9] R. Prasad, S. Agarwal, I. Yadav, and B. Singh, "A fast bit-parallel multi-patterns string matching algorithm for biological sequences," In Proceedings of the International Symposium on Biocomputing 2010, pp. 1－4, 2010.

[10] A. V. Aho and M. J. Corasick, “Efficient string matching: an aid to bibliographic search,” Communications of the ACM, Vol. 18, No. 6, pp. 333－340, 1975.

[11] R. S. Boyer, and J. S. Moore, “A fast string searching algorithm,” Communications of the ACM, Vol. 20, pp. 762－772, 1977.

[12] R. Baeza-Yates, and G. H. Gonnet, “A new approach to text searching,”

Communications of the ACM, Vol. 35, pp. 74－82, 1992.

[13] T. Lecroq, “Fast exact string matching algorithms, ” Information Processing Letters, Vol.

102, No. 6, pp. 229－235, 2007.

[14] B. Ďurian, J. Holub, H. Peltola, and J. Tarhio, “Tuning BNDM with q-grams,” In Proceedings of Algorithm Engineering and Experiments 2009, pp. 29－37, 2009.

[15] G. Navarro, and M. Raffinot, “A bit-parallel approach to suffix automata: Fast extended string matching,” In Proceedings of Combinatorial Patern Matching 1998, Springer-Verlag, pp. 14－33, 1998.

[16] M. O. Külekci, “A method to overcome computer word size limitation in bit-parallel pattern matching,” In Proceedings of the 19th International Symposium on Algorithm and Computation, volume 5369 of Lecture Notes in Computer Science, pp. 496－506,

2008.

[17] D. M. Sunday, “A very fast substring search algorithm,” Communications of the ACM, Vol. 33, No. 8, pp. 132－142, 1990.

在文檔中植基於符號組之改良式BLIM演算法 (頁 21-36)