Related Work - 具多字元狀態轉移之高效字串比對引擎

Among the string-matching algorithms, the algorithms of Aho-Corasick [1] and Bloom [8] are used in many applications for filtering out specific data efficiently.

The Bloom algorithm accelerates the string matching by allowing for a small num-ber of falsely matched patterns; however, a further exact verification is required for confirming whether the result is false positive. The Bloom algorithm can be imple-mented in hardware with high space efficiency [9]. In contrast, the AC-algorithm is an exact string-matching algorithm that can locate multiple patterns in a text with linear time complexity. Since the proposed approaches in this thesis are mainly based on the AC-algorithm, the following discussion of related work focuses on the AC-algorithm.

2.1 Optimization in Hardware

Due to the progress and flexibility of the programmable devices such as FPGA, de-velopers can design and evaluate variant architectures according to the features of the AC-algorithm. Nevertheless, the resources of programmable devices are limited, with some works attempting to increase the hardware efficiency. To im-prove the memory efficiency, Tuck et al. [2] developed a bitmap-compression and path-compression approach for the AC-algorithm, capable of reducing the required memory and improving the performance on hardware implementation. Zha and Sahni [3] improved the bitmap-compression and path-compression approach by re-quiring considerably less memory. Alternatively, Alicherry et al. [4] implemented the AC-algorithm by integrating a Ternary Content Addressable Memory (TCAM) and a Static Random-Access Memory (SRAM) that utilizes ternary matching of TCAM to achieve the matching of characters expressed in negation expressions,

Chapter 2. Related Work

subsequently reducing the space required for storing the state transitions. Lin and Liu [10] and Pao et al. [11] applied pipeline architectures to implement the character trie that only contains goto functions of the AC-trie to reduce the space introduced by the failure functions. Hua et al. [12] developed another approach based on a block-oriented scheme instead of the usual byte-oriented processing of patterns to minimize the memory usage. Dimopoulos et al. [13] developed a Split-AC algo-rithm that partitions a whole AC-trie into multiple smaller tries to increase memory efficiency.

Several researches developed hybrid string-matching approaches combining DFA and NFA to obtain both the advantages of DFA and NFA while avoid their dis-advantages. The previous hybrid approaches include exact and regular-expression string matching approaches. For example, Yang et al. [14] proposed the head-body finite automaton (HBFA) that partitions an AC-trie to a head DFA (H-DFA) and a body NFA (B-NFA) to improve the performance of soft implementation of the AC-algorithm. Becchi et al. [15] proposed a hybrid-FA solution for regular-expression string matching to bring together the strengths of both DFA and NFA, i.e. space efficiency and a deterministic hardware-implementation. In the hybrid-FA of Becchi et al., the nodes causing state explosion retain an NFA encoding, while the rest are transformed into DFA nodes.

2.2 Multi-Character Hardware Approaches

Because of the flexibility of programmable devices, some works have developed string-matching architectures that can inspect multiple characters in parallel to mul-tiply the throughput of string matching. However, developing an approach capable of inspecting multiple characters in parallel must consider both the complexity and the alignment problem incurred in k-character matching processes. As an extension of the AC-algorithm, Sugawara et al. [16] proposed a string-matching method called Suffix-Based Traversing (SBT) to process multiple input characters in parallel and reduce the lookup table size. Alicherry et al. [4] proposed a k-compressed AC-DFA to achieve a parallel k-character matching engine. A k-compressed AC-DFA con-sists only of the states whose depth is a multiple of k in the original AC-trie and the leaf states of the original AC-trie, where the alignment problem is solved by using additional shallow states. Other works used multiple FSMs to achieve parallelism and solve the alignment problem, where each FSM is responsible for processing a pattern beginning at a different position, respectively [5–7]. However, those

ap-2.3. Software Approaches

proaches require specific logics to combine thematching results from the FSMs. The approaches of Yamagaki et al. [17] and Katashita et al. [18] solve the alignment problem by using additional states and transitions.

2.3 Software Approaches

Recently, many software implementations of the AC-algorithm utilize the power of the multicore in CPU or GPU to accelerate string matching. For example, Scarpazza et al. [19] proposed an optimized algorithm for the IBM Cell/B.E processor, which is a heterogeneous multicore processor comprised of a 64-bit processor core and eight synergistic processor cores, to achieve high-performance exact string matching. In that algorithm, keywords are split to fit in the local memories of the processing cores to reach extremely high throughput for each processor.

However, Salmela et al. [20] developed a software approach capable of process-ing multiple characters in parallel. That approach uses short substrprocess-ings of length q, referred to as q-grams, which process q characters as a single character, and bit parallelism to increase filtering efficiency. Nevertheless, their approach is designed to match a set of keywords with the same length. Because of advanced semicon-ductor technologies, multiple processing cores can be packaged in a single CPU or GPU chip. Recently, many software implementations of the AC-algorithm use the power of the multicore in CPU or GPU to accelerate string matching. For example, Scarpazza et al. proposed an optimized algorithm for the IBM Cell/B.E proces-sor, which is a heterogeneous multicore processor comprised of a 64-bit processor core and eight synergistic processor cores, to achieve high-performance exact string matching [19, 21]. In that algorithm, keywords are split to fit in the local memo-ries of the processing cores to reach extremely high throughput for each processor.

Yang et al. [14] and Yang and Prasanna [22] derived an approach using a head-body finite automaton (HBFA) to improve the match ratio on multicore processors and implements the HBFA in multiple threads on the multicore system to achieve high throughput. Villa et al. presented a software approach for the AC-algorithm on the Cray XMT multithreaded shared memory machine, capable of achieving a throughput of 28Gbps [23]. The approach of Tumeo et al. assigns different packets to different CUDA/GPU threads, as proposed by NVIDIA, to increase the efficiency of pattern matching [24]. Tumeo et al. later evaluated several software implemen-tations of the AC-algorithm on shared and distributed memory architectures [25].

Herath et al. applied multicore CPUs to accelerate the string matching used in

Chapter 2. Related Work

biology applications [26].

Software and hardware approaches significantly differ in achieving the paral-lelism. Software approaches achieve parallelism by splitting an input text into mul-tiple chunks and then processing the chunks by mulmul-tiple threads, respectively, where each thread still inspects the input text character by character. Conversely, hardware approaches achieve parallelism by inspecting multiple characters in parallel. Both approaches can multiply the throughput of string matching. However, software and hardware approaches also differ in that the former can have a larger dictionary size.

Chapter 3 Aho-Corasick Algorithm and

在文檔中具多字元狀態轉移之高效字串比對引擎 (頁 23-27)