Compression of Goto Graph - 針對簡單正規表示式之字串比對演算法

The G graphs, especially G , are likely to have a large number of states for a ₀ large signature set. Therefore, a straightforward implementation using a two-dimensional table requires a huge amount of memory space. In this section, we present a compression scheme which can significantly reduce the memory space requirement.

Consider graph G . (The other ₀ G graphs can be compressed similarly.) In our proposed compression scheme, states are classified according to the number of child states. State P is said to be a branch state, a single-child state, or a leaf state, if it has at least two child states, exactly one child state, or no child state, respectively.

A single-child state P is a first single-child state if its parent state is a branch state.

Finally, state P is said to be an explicit state if it is the start state, a branch state, a first single-child state, a final state, a fork state, or a fragment-end state, i.e., a state represented by R for some _i i, n₁+ ≤ ≤ . We store all strings in 1 i n Z and ₀ some data structures for the explicit states. Note that every final state, fork state, and fragment-end state has to be a branch state, a single-child state, or a leaf state.

The strings in Z are stored contiguously in a ₀ compacted_G₀_strings file.

For example, if Z =₀ {he she his hers, , , } , then the compacted_G₀_strings file is simply heshehishers. Similar to the AC−bnfa scheme adopted by Snort, branch states are further classified into Branch_2, Branch_3, Branch_4, Branch_5, and Branch_256 states. State P is a Branch_i (2≤ ≤i 5) state if it has exactly i children states. For such a state, we store i pairs of (symbol next state, ). If state P has more than five children states, it is classified as a branch_256 state and we store sequentially 256 next states corresponding to 256 possible input symbols.

Note that the next state could be the END state for some input symbols. Our experimental results show that there are only a small number of branch_256 states and the number of children states is much larger than five for most branch-256 states.

By storing all the 256 next states, we waste a little memory space but achieve high-speed look-up for state transition. Assume that state P is a single-child state with representing string S . Let ^P A be the first string in _i Z which contains ₀

SP as a prefix. For state P, we store (position distance, ), where position is the position of the |S^{P th}| byte of A in the _i compacted_G₀_strings file and

distance is the number of bytes from state P to its nearest descendent explicit state, i.e., the explicit state whose representing string is the shortest one which contains S^P as a proper prefix.

Finally, for each leaf state, we basically store nothing but an identifier to indicate that it is a leaf state. Of course, every explicit state needs flags to indicate whether or not it is a final state, a fork state, and/or a fragment-end state. For a final state, we need to store the identification of the matched signature(s). For a fork state, the minimum and the maximum values as well as the starting state of some T graph to be traversed are stored. Note that the number of states on the compressed G ₀ graph is equal to the number of explicit states, which normally is much smaller than the number of states in the original G graph constructed with algorithm AC1. As ₀ a result, the memory requirement is significantly reduced.

Since every T graph is constructed with a single string, the memory space requirement is small. Precisely speaking, we don’t create states for T graphs actually, since there is no explicit state during transition in T graphs. All we need to store is an array of input symbols, failure function, and counter increment when the failure function is consulted.

Chapter 6. Experimental Result

In this section, we compare the performance of our proposed signature matching system with that of the ClamAV implementation and its enhancement [?].

Both throughput performance and memory requirement are compared. Programs are coded in C++ and the experiments are conducted on a PC with an Intel Pentium 4 CPU operated at 2.02GHz with 1.75GB of RAM.

We traced the ClamAV implementation, extracted the ideas, and re-wrote the codes for our experiments. In the ClamAV implementation, a trie of height two is constructed for the first two bytes of all patterns based on AC pattern matching machine. Effectively, patterns are grouped based on their first two bytes. The failure function for non-leaf states is eliminated because the next move function δ is adopted. The next move function δ is defined as ( , )δ Pσ =g P( , )σ if

( , )

g P σ ≠ fail or ( , )δ P σ =δ( ( ), )f P σ otherwise. When the first two bytes of some group are matched, a sequential search is performed for all patterns in the group. Different from our proposed scheme, a regular expression is fragmented by the three *, ?, and {min, max} operators. A data structure is maintained to indicate up to which fragment a regular expression had been matched and the position in the text of the last matched fragment. Consider a regular expression which consists of k fragments. Assume that the first e fragments had been matched and the e^th fragment ends at the i^th position of the text. Assume further that another fragment is matched at the j position. This newly matched fragment is ^th discarded if it is not the (e+1)^th fragment or i and j do not satisfy the condition specified by the operator which separates the e^th and the (e+1)^th fragments. As an example, consider a regular expression RE = sre ?₁ sre {2,4}₂ sre {3,5}₃ sre . ₄ Assume that the first fragment sre was matched at the ₁ i^th position of the text.

If the second fragment sre is matched at the ₂ (i+|sre₂| 1)+ ^th position, then the data structure will be updated to indicate that the first two fragments are matched and the position of the second fragment is matched at the (i+|sre₂| 1)+ ^th position.

Assume that a fragment is further found at the j position, then the data ^th structure is further updated only if it is the third fragment sre and j satisfies ₃ 2≤j-i-|sre |-|₂ sre |-1₃ ≤4. Otherwise, the newly matched fragment is discarded and the data structure remains intact.

Note that, strictly speaking, the ClamAV implementation may result in false negatives. For example, consider the same regular expression RE= sre ?₁ sre {2,4}₂ sre {3,5}₃ sre and assume that the input text is ₄

1 1 2 3 4

sre sre asre abcsre abcdsre . There is obviously a match starting at the (sre₁+1)^th position. However, the Clam AV implementation does not detect the match because the second sre will be discarded when it is found. ₁

The performance of ClamAV implementation can be improved by using variable height trie [Avfs]. The variable height trie requires more memory space for larger maximum heights. It was found that a trie with maximum height three is a good tradeoff between throughput performance and space requirement. Therefore, we shall compare our proposed system with tries of maximum height two and three.

Figure 5. Performance comparison of ClamAV implementation and our proposed signature matching system for clean files of various sizes.

Figure 5 shows the comparison of CPU execution time for randomly generated files of various sizes without any signature occurrence. It can be seen that the CPU execution time is proportional to file size. The CPU time required by the ClamAV implementation is about 20 times of that required by our proposed system. Figure

6 illustrates similar comparison with a string S abcdeS′ inserted into a randomly ₁ ₁ generated file of size slightly larger than 2M bytes to match a signature (W32.Gop) of the form RE=S₁*S′ . Again, the CPU execution time required by the ClamAV ₁ implementation is about 20 times of that required by our proposed system. We also conducted simulations with a string S abcdeS′ inserted at various positions to ₁ ₁ match a signature (DOS.Bg-2) of the form RE=S₁{1, 6}S′ . The results are similar. ₁ We expect the performance improvement to become larger as the number of signatures increases. The reason is that, in ClamAV implementation, the number of strings in a group with identical first two bytes increases as the number of signatures increases. Since the ClamAV implementation performs sequential search for strings in the same group, it consumes more CPU time to find the match in a larger group.

As for memory requirement, ClamAV implementation uses 362K bytes and our proposed system uses about 1.94M bytes. The pre-filter requires 128K bytes and the verification module needs 1.8M bytes. There are 2,486 final states and, therefore, the output function takes about 5K bytes. (In our implementation, we use two bytes for signature ID.) We believe the amount of memory required by our proposed signature matching system is acceptable for practical systems.

Figure 6. Performance comparison of ClamAV implementation and our proposed signature matching system with a string S abcdeS′ in various place of file. ₁ ₁

Chapter 7. Conclusion

We have presented in this paper a systematic approach to construct a signature matching system for simple regular expressions which are used to define virus/worm signatures in ClamAV. Like the Aho-Corasick algorithm, the verification module of our proposed system is dictated by three functions, namely, the goto, failure, and output functions. Experimental results using ClamAV signatures show that, compared with the ClamAV implementation and its enhancement, our proposed system achieves much better throughput performance while requiring an acceptable amount of memory.

Our work presented in this paper provides some guidelines for writing signatures. For example, the non-overlapping condition is very important in reducing the space complexity. In case the non-overlapping condition is to be violated, one should minimize the number of * operators in those overlapped signatures. As another example, the throughput performance can be largely improved for long pre-filter patterns. Extension of our work to other types of signatures is an interesting and useful further research topic.

Appendix: The Aho-Corasick Algorithm

The Aho-Corasick (AC) algorithm is dictated by three functions: a goto function g, a failure function f, and an output function output. Fig. A.1 shows the three functions for the pattern set Y = {he, she, his, hers} [9].

0 1 2 8 9

Fig. A.1. (a) goto function, (b) failure function, and (c) output function for Y = {he, she, his, hers}.

Some definitions are needed. Let S S represent concatenation of strings ₁ ₂ S and ₁ S . We say 2 S is a prefix and ₁ S is a suffix of the string ₂ S S . Moreover, ₁ ₂ S ₁ is a proper prefix if S is not empty. Likewise, ₂ S is a proper suffix if ₂ S is not ₁ empty. One state, numbered 0, is designated as the start state. String S^P is said to represent state P on a goto graph if the shortest path from the start state to state P spells out S^P. For example, string her represents state 8 in Fig. 1. The start state is represented by the empty string ε . The length of string S is represented by

|S|.

Note that there might be a self-loop at the start state of a goto graph. However, it

becomes a tree after removing the self-loop, if exists. In the following definitions, we ignore the self-loop. We call state P the parent of state ₁ P and state ₂ P the ₂ child of state P if there exists a symbol ₁ σ such that g P( , )₁ σ = . State P₂ P is ₂ said to be a descendent of state P and state ₁ P an ancestor of state ₁ P if ₂ S is ^P¹ a proper prefix S^P². The tree which consists of state P and all its descendant states is called the sub-tree of P.

The goto function g maps a pair (state, input symbol) into a state or the message fail.

For the example shown in Fig. A.1, we have g(0, h) = 1 and g(1,σ ) = fail if σ is not e or i. State 0 is a special state which never results in the fail message. With this property, one input symbol is processed by the AC algorithm in every operation cycle.

The failure function f maps a state into a state and is consulted when the outcome of the goto function is the fail message. We have f P( )₁ = if and only if (iff) P₂ S is ^P² the longest proper suffix of S^P¹ that is also a prefix of some pattern. The output function maps a state into a set (could be empty) of patterns. The set output(P) contains a pattern if the pattern is a suffix of S^P.

Let P be the current state and ₁ σ the current input symbol. Also, let T denote the input string. Initially, the start state is assigned as the current state and the first symbol of T is the current input symbol. An operation cycle of the AC algorithm is defined as follows.

1. If g P( , )₁ σ = , the algorithm makes a state transition such that state P₂ P ₂ becomes the current state and the next symbol in T becomes the current input symbol. If output P( ₂)≠ ∅ , the algorithm emits the set output P . ( ₂) The operation cycle is complete.

2. If g P( , )₁ σ = fail, the algorithm makes a failure transition by consulting the failure function f. Assume that f P( )₁ = . The algorithm repeats the P₂ cycle with P as the current state and ₂ σ as the current input symbol.

It can be shown that the maximum number of state transitions is 2n−1 for scanning if |T|=n. This number can be reduced to n if the next move function δ is adopted. The next move function is defined as δ( , )P σ =g P( , )σ if

( , )

g P σ ≠ fail or δ( , )P σ =δ( ( ), )f P σ otherwise.

The procedures to construct the goto, failure, and output functions are described in

Algorithms AC1 and AC2 below [9]. The goto function and the failure function are constructed, respectively, in Algorithms AC1 and AC2. The output function is partially constructed in Algorithm AC1 and completed in Algorithm AC2.

Algorithm AC1. Construction of the goto function.

Input. Set of keywords Y ={ ,y y₁ ₂,...,y_k}.

Output. Goto function g and a partially computed output function output.

Method. We assume output(P)=∅ when state P is first created, and g(P, σ) = fail if σ is undefined or if g(P,σ) has not yet been defined. The procedure enter(y) inserts into the goto graph a path that spells out y.

begin

newstate ← 0

for i ← 1 until k do enter y ( )_i

for all σ such that g(0,σ ) = fail do g(0,σ ) ← 0 end

procedure enter a a( _{1 2}...a_m): begin

state ← 0; j ← 1

while g state a( , _j)≠ fail do begin

state ← g state a( , _j) j ← j + l

end

for p ← j until m do begin

newstate ← newstate + 1 g state a( , _p) ← newstate state ← newstate

end

output(state) ← {a a_{1 2}...a_m} end

Algorithm AC2. Construction of the failure function.

Input. Goto function g and output function output from Algorithm 1.

Output. Failure function f and output function output.

Method.

begin

queue ← empty

for each σ such that g(0,σ) = P ≠ 0 do begin

queue ← queue∪{P}

f(P) ← 0 end while queue ≠ empty do begin

let R be the next state in queue queue ← queue - {R}

for each σ such that g(R,σ ) = P ≠ fail do begin

queue ← queue∪{P}

state ← f(R)

while g (state,σ) = fail do state ← f(state) f(P) ← g(state,σ )

output(P) ←output(P)∪output(f(P)) end

end end

References

[1] Clam anti virus signature database, www.clamav.net.

[2] SNORT system, www.snort.org.

[3] D. Moore, C. Shannon, and J. Brown, “Code-Red: a case study on the spread and victims of an Internet worm,” in Proc. ACM/USENIX Internet Measurement Workshop, France, Nov. 2002.

[4] CAIDA. Dynamic graphs of the Nimda worm.

http://www.caida.org/dynamic/analysis/security/nimda.

[5] D. Moore, V. Paxson, S. Savage, C. Shannon, S. Staniford, and N. Weaver, “Inside the Slammer worm,” IEEE Security and Privacy, 1(4): 33-39, July 2003.

[6] S. E. Schechter, J. Jung, and A. W. Berger, "Fast detection of scanning worm infections," 7^th International Symposium on Recent Advances in Intrusion Detection (RAID), French Riviera, September 2004.

[7] D. E. Knuth, J. H. Morris, and V. R. Pratt, “Fast pattern matching in strings,” TR CS-74-440, Stanford University, Stanford, California, 1974.

[8] R. S. Boyer and J. S. Moore, “A fast string searching algorithm,”

Communications of the ACM, Vol. 20, October 1977, pp. 762-772.

[9] A. V. Aho and M. J. Corasick, “Efficient string matching: an aid to bibliographic search,” Communications of the ACM, Vol. 18, June 1975, pp. 333-340.

[10] S. Wu and U. Manber, “A fast algorithm for multi-pattern searching,” Technical Report, May 1994.

[11] K. Thompson, “Programming techniques: Regular expression search algorithm,”

Commun. ACM, 11(6):419-422, 1968.

[12] V. M. Glushkov, “The abstract theory of automata,” Russian Mathematical Surveys, 16:1-53, 1961.

[13] R. W. Floyd and J. D. Ullman, “The compilation of regular expression into integrated circuits,” Journal of ACM, vol. 29, no. 3, pp. 603-622, July 1982.

[14] R. Sidhu and V. Prasanna, “Fast regular expression matching using FPGAs,” in Field-Programmable Custom Cuomputing Machines (FCCM), April 2001.

[15] C. R. Clark and D. E. Schimmel, “Efficient reconfigurable logic circuit for matching complex network intrusion detection patterns,” Proceedings of 13^th International Conference on Field Programmable Logic and Applications, 2003.

[16] Y. Miretskiy, A. Das, C. P. Wright, and E. Zadok, “Avfs: An on-access anti-virus file system,” USENIX Security Symposium, 2004.

[17] I. Sourdis and D. Pnevmatikatos, “Pre-decoded CAMs for efficient and high-speed NIDS pattern matching,” IEEE Symposium on Field Programmable Custom Computing Machines, Napa, CA, 2004.

[18] P. Sutton, “Partial character decoding for improved regular expression matching

in FPGAs,” in IEEE International Conference on Field Programmable Technology (FPT), Dec. 2004.

[19] S. Yusuf and W. Luk, “Bitwise optimized CAM for network intrusion detection systems,” Proceedings of 15^th International Conference on Field Programmable Logic and Applications, 2005.

[20] B. C. Brodie, D. E. Taylor, and R. K. Cytron, “A scalable architecture for high-throughput regular-expression pattern matching,” ISCA, 2006.

[21] C. H. Lin, C. T. Huang, and S. C. Chang, “Optimization of regular expression pattern matching circuits on FPGA,” in Proc. Of Conference on Design, Automation and Test in Europe, 2006.

[22] J. Moscola, Y. H. Cho, and J. W. Lockwood, “A scalable hybrid regular expression pattern matcher,” in Field-Programmable Custom Computing Machines (FCCM), 2006.

[23] J. C. Bispo, I. Sourdis, J. M. Cardoso, and S. Vassiliadis, “Regular expression matching for reconfigurable packet inspection,” in IEEE International Conference on Field Programmable Technology (FPT), Dec. 2006.

[24] F. Yu, Z. Chen, Y. Diao, T. V. Lakshman, and R. H. Katz, “Fast and memory-efficient regular expression matching for deep packet inspection,” in Proc. of Architectures for Networking and Communications Systems (ANCS), pp.

93-102, 2006.

[25] M. Alicherry, M. Muthuprasanna, and V. Kumar, “High speed pattern matching for network IDS/IPS,” IEEE International Conference on Network Protocols, 2006.

[26] T. H. Lee, “Generalized Aho-Corasick algorithm for signature based anti-virus applications,” IEEE ICCCN 2007.

[27] G. Vasiliadis, S. Antonatos, M. Polychronakis, E. P. Markatos, and S. Joannidis,

“Gnort: High performance network intrusion detection using graphics processors,” In Recent Advances in Intrusion Detection (RAID), 2008.

[28] S. Kumar, S. Dharmapurikar, F. Yu, P. Crowley, and J. Turner, ”Algorithms to accelerate multiple regular expressions matching for deep packet inspection,”

ACM SIGCOMM 2006.

[29] J. Rejeb and M. Srinivasan, “Extension of Aho-Corasick algorithm to detect injection attacks,” SCSS (1) 2007.

[30] R. Smith, C. Estan, and S. Jha, “XFA: Fast signature matching with extended automata,” In IEEE Symposium on Security and Privacy, May 2008.

[31] T. H. Lee and N. L. Huang, 2008, “An efficient and scalable pattern matching scheme for network security applications,” IEEE ICCCN Workshop, 2008.

[32] N. Schear, D. Albrecht, and N. Borisov, “High-speed matching of vulnerability

signatures,” In Recent Advances in Intrusion Detection (RAID), 2008.

[33] M. Norton, “Optimizing pattern matching for intrusion detection,” http://docs.idsresearch.org/OptimizingPatternMatchingForIDS.pdf, July 2004.

[34] J. E. Hopcroft and J. D. Ullman, “Introduction to automata theory, languages, and computation,” 2^nd edition, Addison-Wesley, 2001.

在文檔中針對簡單正規表示式之字串比對演算法 (頁 27-39)