Existing Works and Literature Background - 加速深層封包檢查的字串比對演算法

A multiple-string matching algorithm searches the textT = t1t₂. . . t_n for occur-rences of the patterns in a pattern setP = {P1, P₂, . . . , P_r} on the same alphabet Σ, where r is the number of patterns. We use m to denote the shortest pattern length and assume |Σ| = 256 (number of values in a byte). Table 3.1 summarizes the notations in this paper.

Table 3.1: Important notations throughout this paper.

notation description

P The pattern set.

P⁰ The set of pattern prefixes under consideration during pre-processing and scanning.

P_i The i-th pattern in the pattern set.

P_i[j . . . k] A substring from the j-th to the k-th character of P_i. Σ The character set. |Σ| = 256 in this paper.

r The number of patterns in the pattern set.

n The text length.

m The shortest pattern length in the pattern set. Also the length of the search window.

b The block size. b = 4 in this paper.

s The shift value.

v The size of the bit vector in a Bloom filter.

BF (G_j) The Bloom filter storing the group G_j.

B_j The block that is j characters backward away from the last character in the search window.

3.2.1 String Matching Algorithms

The Aho-Corasick (AC) algorithm [AC75] feeds a finite automaton that accepts the patterns in the pattern set with the input characters one by one, so its time complexity is O(n). A match is claimed if one of the final states is reached. Such automaton-based approaches, either Deterministic Finite Automaton (DFA) or Non-deterministic Finite Automaton (NFA), are common due to their flexibil-ity in representing the patterns [Tar06, Cav05] and deterministic execution time for robustness to algorithmic attacks. The transition table of an automaton is compressed to reduce the memory requirement [TSC04, Nor04]. Given the wide data bus in modern architectures, tracking one character at a time is inefficient.

Several designs can determine the next state after reading a block of characters

to boost the performance [SIH04,DL05], but they have two drawbacks. (1) Com-pressing the transition table may need tricky techniques, if feasible, as the table grows with a large block. (2) Because a signature may not start from a block boundary, the match engine should be duplicated several copies at the offset of one more character from the block boundary [DL05].

The Boyer-Moore (BM) algorithm is the first that can skip characters not in a match based on algorithmic heuristics [BM77], which are illustrated in [BMI].

Among the heuristics of the BM algorithm and its derivatives, we specifically mention the bad-character heuristic for its relevance to our work. This heuristic matches the characters backward from the suffix of the search window one by one, until either a mismatched character is found or the entire pattern is matched. If a mismatched character is found, the heuristic looks up a table to decide the shift distance of the window according to whether the character is in the pattern or not, and its position. However, the heuristic will significantly decrease the shift distance for a large pattern set due to the high probability of a character appearing in one of the patterns.

The WM algorithm matches a block of characters instead of a character to greatly reduce the chances that a block appears in the patterns. The algorithm assumes equal pattern lengths. If not, it considers only the first m characters of each pattern during pre-processing and scanning. The search window of m characters slides along the text during scanning according to the heuristics: if the rightmost block of b characters in the search window appears in none of the patterns, a window shift by a maximum of m − b + 1 characters is safe without missing any match; otherwise, the shift value is m − j, where the rightmost occurrence of the block in the patterns ends at position j. If the shift value is 0, i.e., the block is the suffix of some pattern, the occurrence of a true match

is verified. The algorithm builds a shift table that keeps the shift values for indexing by the rightmost block. Different blocks may be mapped to the same table entry, in which the minimum shift value of them is filled. This mapping saves the table space at the cost of smaller shift values. The worst performance of the WM algorithm may be poor. For example, if a pattern is aaaaa and the text is all a’s, the search window cannot skip any character. The time complexity is O(mn) because the verification takes O(m) in every text position. Nonetheless, variants of the algorithm can be found in popular software, such as ClamAV (www.clamav.net) for anti-virus.

A Bloom filter compactly stores the patterns in a v-bit bit vector for member-ship queries [Blo70]. For each pattern X, the filter sets to 1 the bits addressed by the k hash values h₁(X), h₂(X), . . . , h_k(X) ranging from 0 to v − 1. When a substring W in the text is matched, a membership query looks up the bits addressed by W ’s hash values. If one of the bits is unset, W must not be in the pattern set; otherwise, verification follows to see whether a true match occurs.

The uncertainty comes from different patterns setting checked bits. Properly choosing v and k can control the false-positive rate.

3.2.2 Hardware Accelerators

String-matching hardware accelerators either hardwire the patterns into logic cells on FPGA or store them in memory. Updating the patterns in the former may take hours to regenerate a bit-stream and a few minutes to download it onto the chip. Partial reconfiguration can reduce the cost [Xil04]. Besides the reconfiguration cost, the number of available gate counts limits the size of the pattern set. Several examples use this approach. For example, four scanning modules run in parallel to scan multiple packets concurrently in [MLL03], and

the throughput is up to 1.184 Gbps. Cho et. al. designed a pipelining architecture of discrete comparators [CNM02]. A pattern match unit involves four sets of four 8-bit comparators to directly compare four consecutive characters in each stage.

The matching results from each stage are fed to the next in the pipelining. The design was later enhanced by fully pipelining the entire system [SP03], and the throughput can be up to 11 Gbps at 344 MHz, but its area cost is still high.

Several following studies were devoted to area reduction, such as [SP04].

Reconfiguration in memory-based accelerators involves only updating the memory content, and the logics either remain intact or experience only a slight change. The designs may utilize an AC-style automaton [TS06,Lun06,TLLar,LT-Lar, LTH07, TLL05], a filtering search window [DKS04, PP05, SPW05, AC07], or both [DL05]. Whatever approach they take, a fundamental issue is that if the scanning proceeds by only one character at once, it demands high operating fre-quency for high speed. Some of them can advance several characters at once by multiple parallel engines, but the available hardware resources restrict the degree of parallelism.

在文檔中加速深層封包檢查的字串比對演算法 (頁 54-58)