A platform-based SoC design and implementation of
scalable automaton matching for deep packet inspection
Ying-Dar Lin
a, Kuo-Kun Tseng
a,*, Tsern-Huei Lee
b, Yi-Neng Lin
a,
Chen-Chou Hung
a, Yuan-Cheng Lai
caDepartment of Computer and Information Science, National Chiao Tung University, Hsinchu, Taiwan
bDepartment of Communication Engineering, National Chiao Tung University, Hsinchu, Taiwan
cDepartment of Information Management, National Taiwan University of Science and Technology, Taipei, Taiwan
Received 1 November 2006; received in revised form 14 March 2007; accepted 14 March 2007 Available online 11 April 2007
Abstract
String matching plays a central role in packet inspection applications such as intrusion detection, anti-virus, anti-spam and Web filtering. Since they are computation and memory intensive, software matching algorithms are insufficient to meet the high-speed performance. Thus, offloading packet inspection to a dedicated hardware seems inevitable. This paper pre-sents a scalable automaton matching (SAM) coprocessor that uses Aho-Corasick (AC) algorithm with two parallel accel-eration techniques, root-indexing and pre-hashing. The root-indexing can match multiple bytes in one single matching, and the pre-hashing can be used to avoid bitmap AC matching which is a cycle-consuming operation. In the platform-based SoC implementation of the Xilinx ML310 FPGA, the proposed hardware architecture can achieve almost 10.7 Gbps and support over 10,000 patterns for virus, which is the largest pattern set from among the existing works. On the average, the performance of SAM is 7.65 times faster than the original bitmap AC. Furthermore, SAM is feasible for either internal or external memory architecture. The internal memory architecture provides high performance, while the external memory architecture provides high scalability in term of the number of patterns.
2007 Elsevier B.V. All rights reserved.
Keywords: Deep packet inspection; Automaton; String matching; Content filtering
1. Introduction
Since detecting malicious traffic on the Internet, such as viruses and intrusions, relies on looking for signatures in the packet payload, traditional
fire-walls that inspect only the packet header are insufficient for detection. Thus, deeper packet inspection such as intrusion detection, anti-virus, anti-spam and Web filtering are required to detect such application-level attacks that can be found in the field. The essential part of such solutions is the string matching which has been shown to be a time-consuming component that should be acceler-ated[1,2].
1383-7621/$ - see front matter 2007 Elsevier B.V. All rights reserved.
doi:10.1016/j.sysarc.2007.03.005
* Corresponding author.
E-mail address:[email protected](K.-K. Tseng).
For string matching, several algorithms and hardware architectures have been proposed to improve performance. Although the throughput of some approaches can achieve up to 10 Gbps, their common drawback is poor scalability. Their rules and pattern sets are hardwired into the FPGA and thus, the scalability is limited by the number of logic cells and the size of the embedded memory in the FPGA.
In this paper, we propose a scalable automaton matching (SAM) which is based on the Aho-Cora-sick (AC) algorithm with external memory architec-ture. AC is a common algorithm with the following features. First of all, AC has a deterministic perfor-mance in the worst case. Secondly, AC is robust for large and long patterns. Thirdly, AC can perform a multiple patterns match in a single matching itera-tion. However, the large memory usage is a critical problem for the AC algorithm. Another AC-based algorithm, the bitmap AC, improves memory utili-zation by using a 256-bit bitmap to replace the 256 word-size pointers of each state in AC. Hence, bitmap AC is the alternative that we adopted in this work.
SAM is developed with two acceleration tech-niques to archive a sub-linear matching time. The first technique is root-indexing that comes from the observing the AC’s high frequency root-visiting behavior. The second technique is pre-hashing that comes from the observing the time-consuming bit-map operation during the bitbit-map AC matching. To reduce this kind of operation, pre-hashing can test quickly to avoid the bitmap AC matching. For scalability, our architecture uses external mem-ories to store the whole pattern database of SNORT or even ClamAV.
Furthermore, SAM can easily update patterns without interrupting the operation or shutting down the machine since it is a memory based architecture. In our evaluation, we implemented SAM on a Sys-tem On Chip (SoC) based platform with Xilinx FPGA Virtex2P and EDK design tool.
The rest of this paper is organized as follows. In
Section2, we first introduce the related algorithms
and string matching hardware. Then, the proposed architecture and acceleration techniques are
pre-sented and an example is given in Section3. Section
4 presents the software and hardware
implementa-tion of the SAM approach. The performance analy-sis, evaluation, and comparison with existing works
are given in Section5. Finally, we draw up our
con-clusion in Section6.
2. Background
2.1. Selecting matching algorithms for packet inspection
To understand the appropriate requirements of string matching algorithms, we surveyed the real patterns from open source software including
SNORT (http://www.snort.org) for intrusion
detec-tion, ClamAV (http://www.clamav.net) for
anti-virus, SpamAssassin (http://spamassassin.apache.
org) for anti-spam, and SquidGuard (http://
www.squidguard.org) and DansGuardian (http://
dansguardian.org) for Web blocking. The necessary requirements are variable-length, multiple patterns and on-line processing for all packet inspection systems.
Although the complex patterns such as class, wildcard, regular expression and case sensitive pat-terns might increase the expression power of the patterns and has been used in some applications, they can be converted into multiple simple patterns
[3], and they are optional for matching algorithms.
Current existing on-line string matching algo-rithms for packet inspection can be classified into four categories: dynamic programming, bit parallel, filtering, and automaton algorithms. As
summa-rized inTable 1, dynamic programming [3]and bit
parallel [4] algorithms are inappropriate for
vari-able-length and multiple patterns, and the filtering
algorithms[5]have poor worst-case time complexity
O(nm), where n and m are the length of the text and patterns, respectively. Only the automaton based
algorithms such as Aho-Corasick (AC) [6] support
variable-length and multiple patterns, and also have the deterministic worst-case time complexity O(n). Hence, the automaton based algorithm is a better choice for the packet inspection system, and was selected as the base to develop new approaches. 2.2. AC related algorithms
Before performing AC matching, there is a need to construct a state machine from the patterns.
Adapting from the example in [6], Fig. 1a–c are
AC’s three major functions for patterns ‘‘TEST’’, ‘‘THE’’, ‘‘HE’’.
The first Goto function shown in Fig. 1a starts
with an empty root node and adds states to the state machine for each pattern. That Goto function is a tree structure that shares common prefixes with all of the patterns. During the matching the Goto
function is traversed from one state to the other with the text byte by byte.
The second is the Output function shown in Fig. 1b needs a table to store the matched pattern with their corresponding state in the Goto tree. Out-put function records a matched state for a matched pattern if that current state is matched during the visiting.
The third is the Failure function as shown in Fig. 1c. During the construction, failure states are added from the state, where their longest prefix also leads to a valid state in the Goto tree. During the matching Failure function is used when a match fails after a partial match.
After the construction of a machine, the AC state machine is traversed from the current node to the next node according to the input byte.
For the data structure of the Goto Function, there are two alternatives to store the next state links:
• The first alternative is the construction of a 2D-array table. Each state has 256 next state pointers for all of the ASCII input cases, as shown in Fig. 1d. It is the most popular implementation for fast matching, but it wastes memory space when the table is sparse.
• In the second alternative, data structure uses the link list, and each state only has the link list of existing next states. This kind of data structure has a smaller space requirement, but it is slow when there are many next states.
AC is a typical deterministic finite automaton (DFA) based on string matching. However, there
are several variations of it. Bitmap AC[7]uses
bit-map compression to reduce the storage of AC
states. AC_BM [8] is a combination of the AC
and Boyer Moore (BM) algorithms, and aims to improve the conventional AC from O(n) to the
sub-linear time complexity. AC_BDM[9]combines
AC with backward DAWG matching (BDM) to Table 1
Comparison of the on-line string matching algorithms
Algorithm Dynamic programming Backward filtering Automaton Bit parallel
Description Matrix operations to
compute the similarity between text and pattern
Discarding window of text that is not a substring of pattern in backward scanning
Search through a Deterministic Finite Automaton (DFA)
Simulate Non-Deterministic Finite Automaton (NFA) by bitwise operations
Average time complexity
O(n) Sub-linear O(n) O(n)
Worst time complexity
O(n) O(nm) O(n) O(n)
Text length Fixed short length Variable long length Variable long length Variable long length
Pattern length
Fixed short length Variable short length Variable long length Fixed short length
Multiple pattern
No Yes Yes Yes
Regular expression No No Yes Yes Advantage for hardware
Simple systolic array circuit Storage is normally smaller than
automaton
Comparison is a lookup operations
Bitwise operation is fast
Disadvantage for hardware
Not feasible to have a large systolic array
Long latency to compute discarding window
Table size is larger than bit-parallel
Not feasible to have a long bit mask
Typical algorithm
Edit distance Boyer–Moore Aho–Corasick Shift-OR
Current State Next State . . . y a b z . . . Next State i 1 2 3 4 5 6 7 8 f(i) 0 0 0 0 3 4 0 1 0 1 2 7 8 5 6 3 4 T E S T H E H E i output(i) 8 {TEST} 6 {THE, HE} 4 {HE}
a
b
c
d
Fig. 1. (a) Goto function. (b) Output function. (c) Failure function. (d) AC table implementation.
improve the average-case time complexity of the
conventional AC. Bit-split AC[10] splits the width
of the input text into a smaller bit width to reduce the memory usage and the number of comparisons when selecting the next states. Since AC_BM has the worst-case time complexity O(nm) and overhead for switching between AC and BDM, and bit-split AC requires a large match vector for each bit-split state, they are impractical for large patterns. Thus, a scalable bitmap AC with space efficiency is more preferable for our embodiment.
Bitmap AC is a compromise between table and link list approaches. It resolves the wasted memory of the AC table that uses 256 next pointers for each state. Bitmap AC maintains a 256-bit bitmap for each state to indicate whether a valid next state with a given character is valid or invalid, and it requires
traversing along the failure pointer path. Fig. 2
shows the data structure of bitmap AC and how it locates the next state.
Bitmap AC solves this problem for AC. How-ever, in order to locate the next state in bitmap AC, it must count all 1s in the 256-bit bitmap. This is known as a time-consuming operation that is dominated by loading the state and performing the population count.
2.3. Hardware-based string matching
Since sting matching is a bottleneck for packet
inspection systems [1], hardware solutions are
required for high-speed content processing. Among the existing string matching hardware, the most prevalent one is the finite automaton (FA) based hardware because it has support for deterministic matching times and large patterns. FA based hard-ware can be divided into deterministic FA (DFA) and non-deterministic FA (NFA) based hardware.
DFA based hardware has a unique transition that activates one state at a time and normally has a lar-ger number of states compared to NFA. NFA can handle multiple transitions at one time, but it requires parallel circuits for comparing its variable multiple next states. Thus, majority of DFA based hardware uses the table or link list to store their pat-terns, and most NFA based hardware uses parallel reconfigurable circuits to handle their patterns.
For DFA based hardware, there are three com-mon designs in recent string matching hardware:
Aho-Corasick (AC) based hardware[11,12]Regular
Expression (RE) based hardware [13,14] and
Knuth–Morris–Pratt (KMP) [15–17] based
hard-ware. To save more states, KMP and AC are simpli-fied from RE DFA by disabling their regular expression patterns. Each AC DFA supports multi-ple simmulti-ple patterns, and each KMP DFA support single simple patterns only. Thus, many KMP DFAs use duplicated hardware to support multiple patterns.
As for NFA based hardware, there are two
vari-ations: comparator NFA[18,19], which uses the
dis-tributed comparators, and decoder NFA[20], which
the uses the character decoder (shared decoder) to build their NFA circuits.
Other existing non-DFA based hardware are the
parallel comparator [21–23], Bloom filter [24],
sys-tolic array [25] and parallel-and-pipeline [26]
hard-ware in our classification. Parallel comparator based hardware improves the performance of brute force algorithm by exploiting architecture parallel-ism and pipelining. Bloom filter based hardware uses multiple hashing keys for quick approximate
matching. Using systolic array implementing
dynamic programming for string matching is only proper for short patterns and text because the cir-cuit size is proportional to the length of the patterns and text. Parallel and pipeline hardware uses naı¨ve string matching and only accelerates processing time by increasing the hardware circuits. Like the systolic array, this approach also has the drawback of only being suited for short-length patterns. 3. SAM design
Although bitmap AC has the good worst-case matching time complexity in O(n), this is insufficient for high speed processing. In this paper, we present a scalable automaton matching (SAM) that is built on an embedded based system and applied to a net-work gateway to perform deep content filtering as
… 256-bit bitmap Data structure for state i
Matched Pointer State Info.
Failure State Next State Pointer
...
Sum all valid 1s
Bit Offset
Next State Table of State i Next State Base Address
Fig. 2. Data structure of bitmap AC for state i, using bitmap to locate the next state.
shown in Fig. 3 SAM employs two techniques: novel pre-hashing for the non-root state and root-indexing for the root state to accelerate automaton based algorithms. The pre-hashing approach is a quick scanning for the non-root state to avoid the time-consuming automaton matching. First of all, the idea is hashing the substring of text and compar-ing the result with the vector for the suffixes of the state in the bitmap AC automaton. If a non-hit occurs, the slow automaton matching is not required. For the root state, the root-indexing approach uses a compressed technique to remember all the next states whose lengths are counting from the root state, less than l (l > 1). Thus, multiple bytes of length l, rather than one byte, can be han-dled in one single matching for the root state to accelerate the matching speed. Actually, since the root state is often visited in the matching operation, the root-indexing approach is an effective accelera-tion approach.
3.1. Root-indexing matching
In the AC tree, most of the failure links point to the root state, i.e., it will always go back to the root state when there is no next state for a given charac-ter. Thus, it is efficient to apply root-indexing in the root state where it can match multiple characters
simultaneously. In Fig. 4, root-indexing comprises
kroot index tables IDX[1, . . ., kroot] and a root next
table NEXT, where kroot denotes the maximum
length of root-indexing matching at the same time. Each entry of IDX stores a partial address for locat-ing the next state in NEXT.
For example, if patterns are ‘‘TEST’’, ‘‘THE’’ and ‘‘HE’’, IDX1 to IDX4 will at least contain the appearing characters in the corresponding position as {‘‘H’’,‘‘T’’} for level one, {‘‘E’’,‘‘H’’} for level 2, {‘‘E’’,‘‘S’’} for level 3, {‘‘T’’} for level 4, respec-tively. However, since the latter tables are required to contain the entries of former tables, IDX1 to IDX4 will actually contain {‘‘H’’,‘‘T’’}, {‘‘E’’,‘‘H’’, ‘‘T’’}, {‘‘E’’,‘‘H’’,‘‘S’’,‘‘T’’} and {‘‘E’’,‘‘H’’,‘‘S’’, ‘‘T’’}, respectively.
In numbering the entries of IDX tables, the first IDX has 2 appearing characters and thus, ‘‘H’’ and ‘‘T’’ are numbered ‘‘01’’ and ‘‘10’’ in the binary format, respectively. The second IDX table using ‘‘01’’, ‘‘10’’ and ‘‘11’’ stands for {‘‘E’’,‘‘H’’,‘‘T’’}, respectively. The NEXT table, indexed by a concat-enation address of lookup value from the all the IDX tables, is used to store all the next states within
length kroot. In the example of Fig. 4,
10_01_001_000, 10_01_011_100, 10_10_001_000
and 10_11_000_000 are concatenation addresses to locate the next states for ‘‘TEE’’, ‘‘TEST’’, ‘‘THE’’ and ‘‘TT’’, respectively.
3.2. Pre-hashing matching
The pre-hashing method can quickly examine the existence of next states to further avoid slow AC matching. Before the pre-hashing matching, it is necessary to build the pre-hashing bit vector in the preprocessing phase. First, we input the AC tree, which was built using conventional AC algo-rithm. For each state, we extract suffixes with the length 1.
Fig. 3. Packet inspection gateway with SAM coprocessor, which performs two techniques: root-indexing matching for the root state and the pre-hashing matching for the non-root states before the AC matching. Input text 0-255 …. …. 00 01 IDX1 10 01 010 11 H T E T 10_10_001_000 …. … Next table 0 1 2 3 5 4 6 7 8 |z|= kroot T H E Next state T E S T 10 01 011 100 …. …. H T 10 001 100…. …. 011 E S 010 H T E S H 001 011 100 …. 10_01_011_100 10_11_000_000 T T 10_01_001_000 T E E T E S T
IDX2 IDX3 IDX4
Fig. 4. Root-indexing architecture and example for the input text ‘‘TEST’’ with the patterns ‘‘TEST’’, ‘‘THE’’ and ‘‘HE’’.
When suffixes are obtained, the pre-hashing algo-rithm hashes suffixes into bit vectors. This proce-dure of building the bit vectors for state 1 is
illustrated in Fig. 5a. In Fig. 5b and c, the mask
of the rightmost four bits of the characters and the transformation from binary to one-hot represen-tation are used as the hash function in our design. However, a better mask position is adjustable for a lower false positive according to the characteris-tics of the patterns.
In pre-hashing matching, the pre-hashing unit reads a byte substring and then hashes the sub-string. The hash result will be indicated by the pre-hashing unit. When the pre-hashing unit indi-cates a non-hit, the next state will be obtained from the root-indexing unit. However, if the hit condition is indicated by the pre-hashing unit, a slow bitmap AC matching will be performed.
3.3. System architecture
A preferable searching architecture is suggested inFig. 6, where a string matching coprocessor per-forms three independent matching units in parallel. Hence, the control logic coordinates pre-hashing, root-indexing and bitmap AC matching for parallel processing. Also each matching function has its individual memory interface to access its pre-pro-cessing data. Since the design methodologies of SoC are popular and have matured in recent times, this specific component is quite feasible for use in modern IC technology.
In the SAM coprocessor, the three units can read the text in different lengths and perform their matching concurrently. This example processes a one-byte substring for AC matching, a two-byte
substring for pre-hashing matching, and a four-byte substring for root-indexing matching in a single matching iteration. The root-indexing and bitmap AC are used to locate the next states, and the pre-hashing matching is used to decide on which is the next state to be used in the next matching iteration. 4. SAM implementation
4.1. Pre-processing and simulation software
The pre-processing procedure generates essential data structures for the proposed hardware, as
shown in Fig. 7a. The Make_Goto() and Make_
Failure() functions are original functions defined by the AC algorithm, and our data structures were further on built according to the table constructed from these two basic functions. For bitmap AC, the Make_Bitmap() function builds a 256-bit bit-map for each state and sets 1 to the corresponding bit position for each existing next state. It also builds the next state table for each state. The next
function is Build_Index() which builds the
IDX[1,. . .,k] tables and root next table NEXT for
root-indexing pre-processing. In the final stage, Build_BitVector() sets 1 to the bit vector by hashing the function according to all the next states of both the current state and the recursive failure node for pre-hashing preprocessing.
After the pre-processing procedure is finished, the simulation of SAM can perform matching
according to the flow inFig. 7b. For every matching
iteration, the first current state is checked. When the current state is in the root state, the Root-Index() matching is performed otherwise Pre-Hash() is performed.
Fig. 5. (a) AC tree of state 1 for building the bit vector. (b) and (c) example of building the bit vector for length 1 and length 2 suffixes of state 1 in the preprocessing phase.
… … … … Text … … …… H1 H2 Bit vectors Non-hit? . . . . . . . . . . . . Load bit vector Current state . . . . . . . . . . . .
Root index tables Root next table . . . . . . . . . . . . . . . . . . Indexing Root-Indexing next state . . . . . . State table Load state Compute next state 1 0 . . . . . . Next state address AC next state Next state Current state Root-Indexing matching Pre-Hashing matching AC matching
If Pre-Hash() reports a non-hit situation, the current state will be set to the root state directly, and will do root-indexing matching. If a hit situa-tion is reported, Search_Bitmap() will check the existence of the next state for a given byte. If Search_Bitmap() = 1, the next state will be obtained from the base address pointer of the next state table plus the return value of Bitmap_offset(). Note that if Search_Bitmap() reports zero, the current state will be set to the failure state in the while loop until the current state becomes the root state. This C model can be the golden model for the proposed hardware design, and it can also be used to gather statistics for performance analysis.
4.2. Root-indexing and pre-hashing functions Since bitmap AC has been done in previous work, its detailed function implementation is not described in this paper, and only the important root-indexing and pre-hashing functions are dis-cussed. In pre-processing, procedures Build_Index() and Build_BitVector() are the two important functions.
The root-indexing technique comprises krootroot
index tables IDX½1;...;kroot and a root next table
NEXT, where kroot denotes the maximum length
of root-indexing matching. Each entry of IDX stores a partial address for locating the next state in NEXT, where the partial address is a sequential integer to represent the order of characters appear-ing in the correspondappear-ing substrappear-ings in the suffixes of the root state.
In the pre-processing of the root-indexing,
Buil-d_Index(S) is first invoked to build IDX½1;...;kroot as
Fig. 8. The length of input text and the number of
IDX tables are equal to kroot. This function builds
the IDX table from IDX1 to IDXkroot. It first
per-forms IDXj[x] 0 to initialize the current IDX table
and then performs IDXj+1[x] IDXj[x] to bring the
later IDXj to the former IDXj1, and, finally,
per-forms IDXj[aj [x]] q to set the index value from
the current character of the suffixes, where a are
the suffixes of S0, such that a from a set of possible
transition paths from root state S0 to the states
within length kroot, and can be defined as
a g(S0,kroot). The xth suffix of length j in a will
be indexed into the entry by IDXj[aj[x]] and
num-bered by an increasing value q. If the corresponding
entry in IDXjappears in suffixes aj[x], q will be put
into that entry and increased by one.
S is the set of all AC states, andjSj is the number
of states built by conventional AC algorithm from a
set of multiple patterns P. Let bi,jbe the set of
suf-fixes of length j for state Si, and bi,j,xrepresents the
xth suffix in length j for state Si. A transition
func-tion g can collect the possible bi,j from Si to the
states with length j.
Build_BitVector(S) builds the pre-hashing bit vector in the preprocessing phase, as seen in Fig. 9. This function first inputs the AC tree that is built by the conventional AC algorithm. Then,
it extracts suffixes bi within the length kpre-hash for
the specific state Si by using g(Si,kpre-hash), where
kpre-hashis the maximum length of the pre-hashing
suffixes, and is also the length of the substring in
the text for each pre-hashing matching. bi also
includes the failure links in the AC tree. When suf-fixes are obtained, the pre-hashing algorithm hashes
c_state=root c_state=root ? Root_Index() yes no Pre_Hash ()=1? c_state = next_p + Bitmap_offset() yes no Input data ptr=ptr+1 ptr=ptr+2 yes no c_state=root? no yes Make_Goto() pattern Make_Failure() Make_Bitmap() Build_Index() Build_BitVector()
a
b
Fig. 7. (a) The pre-processing procedure. (b) The flow of C simulation model.
the suffixes into bit vectors by Vi,j Hj(bi,j,x), where
Hj is a hashing function for the corresponding bit
vector Ve,jand the same Hjis used for all states.
4.3. Matching functions
In the matching phase of root-indexing, Root_In-dex(z) inputs a substring of the text z to locate the
new state Sc in parallel, and is defined in Fig. 10.
The lookup operation inputs z[j] into IDXj(z[j]) to
generate a NA, repeatedly, which is defined as
NA NAIDXj[z[j]], where symbol is a
concate-nation operation. When NA is obtained, Scis then
looked up by NEXT[NA].
In the searching phase, the pre-hashing performs
Pre Hash(w,Vc) to rapidly match the current state
in the AC tree, as shown inFig. 11, where w is the
current compared substring of the text, and Vc is
the current bit vector.
The operation TNj Hj(w[1,. . .,j])&Vc,j
per-forms the bitwise AND (&) for Hj(w[1,. . .,j]) and
Vc,j, in order to return a true non-hit TNjfor length
j. TNjis 1 (True) if the hashed w[1,. . .,j] bit is set in
Vc,j. The pre-hashing matching returns False (no
match) when ^
kpre-hash
j¼1 TNj6¼ 1, where the operation
^
kpre-hash
j¼1 is a conditional AND for multiple TNjwhose
amount is kpre-hash. This means that the longer w will
not be further matched when the shorter w is not matched.
4.4. Hardware implementation
Xilinx ML310 is a FPGA based platform for
SAM system as shown in Fig. 12. This platform
has 2448 Kbits internal block RAM, 30816 LUTs and two hardwired IBM PPC405 processors. For the peripheral, ML310 has one Ethernet port, one PCI slot for additional NIC extension, one 256 MB DDR RAM module and one CF card to store the image of the file system. During the operation, the packets are inputted from the on board Ethernet port, and processed by the PPC 405 CPU. Of course, if the SAM is implemented, the deep packet inspection of the packets is offloaded to SAM engine.
For the development tools, the Xilinx EDK and Synplicity SynplifyPro are used in the system imple-mentation. The EDK can generate the bit streams Fig. 9. Function for building pre-hashing bit vectors.
Fig. 10. Function for root-index matching.
Fig. 11. Function for pre-hashing matching.
from the hardware/software co-design files of SRAM implementation. For the software design, the files include the mapping address define files and the drivers of all peripherals needed for building the complete RTOS image. For the hardware design, the Verilog is used to design string matching hardware, then ModelSim and Debussy are the sim-ulator and debugger tools, respectively, to verify the SAM design.
The proposed architecture is a parallel design where all modules are working at the same time. The block diagram of the hardware implementation has the following major modules as shown in Fig. 13.
• FSM Unit controls the working flow of the whole hardware system.
• Root-Indexing Unit is used for fast state indexing at the root state.
• Pre-Hashing Unit tests the bit vector for two input bytes by hashing function and sending the hashing result to FSM.
• Bitmap AC Unit counts all 1s for locating the next state.
• SM Controller Unit provides the control registers including the length of text buffer and enable the signal for the software to program.
The most important module in SAM hardware is the FSM Unit. Once the SM controller Unit is enabled, FSM Unit controls all other modules in parallel and its detailed operations are given in Fig. 14. In the FSM diagram, the starting state is the IDLE state. When the control signal is enabled, the FETCH state fetches a waiting text if the text
buffer is empty. Otherwise, the MATCH state will enable Root-Indexing Unit, Pre-Hashing Unit, and Bitmap AC Unit simultaneously.
If the current state is a root or the result of pre-hashing is non-hit, FSM Unit moves to ROOT_-MATCH to keep the Root-Indexing Unit working. Once the root-indexing matching is done, the cur-rent state will be assigned by Root-Indexing Unit at SET_ROOT_IDX state. Afterward, the FSM Unit will return to MATCH state to match subse-quent texts. When a hit situation is reported, the Bitmap AC Unit and Root-Indexing Unit are trig-gered in AC_MATCH state, and the next state is assigned by root-indexing module if the current state of AC is required to set to root by failure link. Otherwise, the next state is provided by the Bitmap AC Unit.
As for the pattern updating, SAM can update patterns without interrupting the operation or shut-ting the machine down. Since the pattern is stored in the programmable memory, with the size of the cur-rent pattern sets and the download speed, the pat-tern data can easily be updated in a flash and thus, there is no need to shutdown the machine. In addition to that, SAM can also support the non-interrupting (incremental) update if the data structure of the SAM is ordered by state number. 5. Evaluation
5.1. Formal analysis
If pre-hashing, root-indexing and bitmap AC are to run as the sequential algorithm, the average time is
Tavg time¼
Thashþ Proot Trootþ ð1 ProotÞ TAC
ðkroot ProotÞ þ ð1 ProotÞ
; ð1Þ SM Controller FSM Root-Indexing Unit Index Table Pre-Hashing Unit Bitmap AC Unit control root_state root_index_en pre_hash_en ac_match_en data root_index_over state# hit no_hit ac_match_over failure offset text_buffer_1 text_buffer_2 current state register bitmap interrup Bus t addr data pre_hash_over
Fig. 13. The block diagram of proposed matching architecture.
IDLE FETCH MATCH ROOT_MATCH AC_MATCH SET_ROOT_IDX control=1 control=1 & no_text=0 root_state=1 || (prehash_over=1 & hit=0)
prehash_over=1 & hit=1 root_index_over=1 no_text=1 no_text=0 & set_acmatch_over=1 & text_rdy=0 SET_AC
failure=1 & root_state=1 & root_index_over=1
ac_match_over=1 no_text=0 & set_rmatch_over=1
& text_rdy=0
no_text=1
where Tavg_timeis the average time to process a byte,
Thashis the pre-hashing matching time, Proot is the
probability of using the root-indexing matching,
Troot is the root-indexing matching time, and TAC
is the AC matching time.
However, in the hardware, the pre-hashing, root-indexing and AC can be performed in parallel, and the computation of the next states in these three units are independent. Thus, the average time can be reduced to
Tavg time¼
Proot Trootþ ð1 ProotÞ TAC
ðkroot ProotÞ þ ð1 ProotÞ
: ð2Þ
The number of state skipping depends on the pat-tern sets (form of automaton) and input data (net-work traffic). Hence, the objective performance should consider the average case.
Since AC matching is the critical path, the
worst-case time of SAM is equal to TAC, i.e.,
Tworst time¼ TAC: ð3Þ
The probability Proot is an average probability to
visit the root state and is equal to the average
prob-ability of the true non-hit. Prootis calculated by
Proot¼
X
kpre-hash
j¼1
Ptnc j; ð4Þ
where Proot is computed by summing the
condi-tional probabilities of a true non-hit Ptnc_j, which
is the conditional probability of a true non-hit for
length j. Now, if the (j 1)th pre-hashing is not
matched, then the jth pre-hashing function cannot
be matched either. Ptnc_j is determined from the
unconditional probability of a true non-hit Ptn_ j.
Ptn_1is the first unconditional probability of a true
non-hit Ptn_ j, and can be obtained by
Ptnc 1¼ Ptn 1; ð5Þ
The subsequent Ptnc_ jfor length j can be computed
by Ptnc j¼ 1 Xj1 y¼1 Ptnc y ! Ptn j; ð6Þ
where Pj1y¼1Ptnc y is a summing probability of the
previous Ptnc_ j. When the shorter suffix indicates a
true non-hit, the longer suffix definitely outputs a
true non-hit too. Hence, Ptnc_ j is computed by
subtracting the previous summing probability
Pj1
y¼1Ptnc y, and multiplying by Ptn_ j. Ptn_ j is the
unconditional probability of a true non-hit, which
is referred from[24]as Ptn j¼ 1 1 M jbjj ; ð7Þ
where jbjj is the number of suffixes for the
corre-sponding length j, and M is the size of the bit vector. Pre-hashing intends to improve the probability of a true non-hit by increasing the non-matching suf-fixes. Thus, using one hashing function for each bit vector is sufficient and can significantly reduce hardware cost and latency.
InFig. 15, notation Ptnis the same as the
previ-ous Ptn_1and Ptn_ j. In our observation, a higher Ptn
results in better matching performance because fewer text bytes need to perform AC matching.
Fig. 15 shows the Proot value by computing Eq.
(4)and it obviously shows that a short length of
suf-fixes can also achieve an acceptable Proot whose
value is larger than 0.4. Therefore, setting the
max-imum suffix length kpre-hash to 2 is sufficient. For
example, when Ptn is set to 0.6 and kpre-hash is set
to 2, Proot is equal to 0.84.
For the space evaluation, we need first of all to determine the bit vector size M. Since the
probabil-ity of a true non-hit is defined in Eq.(7), M can be
determined by given number of suffixes jbj and Ptn
as M ¼ 1 1 p 1 jbj tn : ð8Þ
Fig. 16. shows that M increases exponentially asjbj
grows and thus, M is feasible when jbj is small.
a
b
Fig. 15. (a) Prootsimulation for length 1–4 and Ptnfrom 0.1 to
The space requirement can be determined by
summing the bitmap AC space SizeAC, the
pre-hash-ing bit vector space Sizepre-hash, and the
root-index-ing space Sizeroot, as
Sizetotal¼ SizeACþ Sizerootþ Sizepre-hash: ð9Þ
The original space requirement of bitmap AC,
Si-zeAC, is mainly dominated by the state table, which
is equal to the number of statesjSj multiplied by the
state size Sizestate,
SizeAC¼ jSj Sizestate: ð10Þ
Each state size Sizestate includes one byte of state
information, the failure and next state address Sizestate_address, and the size of the bitmap Sizebitmap
for locating the next state. Hence, Sizestate can be
determined by
Sizestate¼ 1 þ Sizestate address 2 þ Sizebitmap: ð11Þ
The pre-hashing size Sizepre-hash is determined from
Pkpre-hash
j¼1 Mj, which is the size of all bit vectors for one
state, where Mjis a bit vector size for length j, and
kpre-hash is the maximum length of the pre-hashing.
jSj is the number of states. Thus, Sizepre-hashis
ob-tained from Sizepre-hash¼ X kpre-hash j¼1 Mj jSj: ð12Þ
Sizeroot, which includes all root-indexing tables and
the root next table. The size of all root-indexing
ta-ble is 256 multiplied by kroot, and the root next table
is the number of the next state addresses multiplied
by the state address size Sizestate_address. The number
of root next state addresses is the cross product of the number of appearing alphabets in the index
ta-bles IDXjand one zero entry. Sizerootis formulated
as
Sizeroot¼ 256 krootþ prodkj¼1rootðjIDXjj þ 1Þ
Sizestate address: ð13Þ
5.2. Simulation analysis
This simulation analysis can determine the per-formance of our simulation software. In our analy-sis, the test contents are execution files in Linux and Windows, as well as normal text files. The 32-bit bit vector and 1000 virus patterns are used to evaluate the proportion of root-indexing matching and bit-map AC matching.
There are two important factors which can affect the rate of the non-hit case. The first factor is the number of patterns. As the number of patterns increases, the branches of a node also increase. This means that the performance will be degraded by raising the rate of the hit portion. The second factor is the size of the bit vector for pre-hashing matching. The 8-bit bit vector is a choice for the development environment when the memory resource is limited, while the 32-bit bit vector has a better performance when enough memory is available.
After analyzing these two key factors, the non-hit rate for different sizes of the bit vector and the num-ber of patterns in the three different data types are
shown in Fig. 17. As the pattern set increases, the
32-bit bit vector has a better relative improvement than the 16-bit bit vector. In addition to the hit rate, the false positive rate of pre-hashing matching is also affected by the size of the bit vector. The false positive will lead to a little penalty in the clock cycles in the internal SRAM architecture.
For the proposed architecture, the 256-bit bitmap, 32-bit bit vector, two 8-bit width IDX table, one root next table, base address pointer of the next state table and failure state pointer are the data structures we used. Each state takes 384 bits and 336 bits to store data structures when the representation bit of the state number is 32 and 16 bits, respectively. For the overall memory usages, 303 kB and 265 kB are the combined
mem-a
b
Fig. 16. (a) The simulation result of bit vector size M for Ptn
from 0.1 to 0.9, with the number of suffixesjbj from 2 to 2048. (b)
ory usages for vector size 32 and 16 bits, respectively.
In addition to the above simulation, we also con-ducted two simulations for the large patterns, which are the static real patterns and the dynamic real net-work traffic simulation. The static simulation of real patterns assumes that the probability of non-hit for the text (network traffic) is a uniform distribution, i.e., the performance is no influence by network traf-fic, and that the result can be obtained with the
equations of Section5.1.
In this analysis, we chose the virus signatures fromhttp://www.clamav.net. Since the virus signa-tures have a lot of patterns with long patterns as well, such patterns are sufficient to evaluate the per-formance of our SAM algorithm. The virus signa-tures have 10,000 patterns and generate 402,173
states. When kpre-hashand kroot are both 2, and the
other parameters are assumed to be the same as the above-mentioned setting, then the conditional probabilities of true non-hit for length 1 and length
2 can be computed as Ptn_1= 0.29 and Ptn_2= 0.14.
According to Prootequation, the probability of
root-indexing matching is computed as Proot= 0.43.
As for the dynamic simulation of the real net-work traffic, the SAM performance is actually not only affected by the pattern sets but also the net-work traffics. Thus, the over 120 MB ethereal cap-tured data were selected as the text to evaluate the above-mentioned URL patterns. Using the same conditions (10,000 virus patterns) with static real
pattern simulation, Proot in the dynamic simulation
of the real network traffic is 0.49, which is close to 0.43 in the static simulation.
5.3. Hardware analysis and comparison
As previously mentioned, our approach is flexible for both internal and external memory architecture. External memory architecture is suitable for large-pattern applications with modest throughput, such as the anti-virus and anti-spam applications. On the other hand, internal memory architecture can be used for high performance with fewer patterns, such as IDS and firewall applications.
The operating frequency of the synthesis result for our internal SRAM architecture is 350 MHz as reported by SynplifyPro. The root-indexing module takes 2 clock cycles to index a mapping state. The bitmap AC matching module takes 8 clock cycles per operation. Thus, the throughput can be esti-mated by the probability, frequency and processing bits per cycle. The best case throughput, wherein no byte has been matched, is 5.6 Gbps. The throughput in the average case, depending on the average pro-portion of the root-indexing matching and the bit-map AC matching, can be estimated at 5.37 Gbps. For the worst case, all bytes are matched in the text buffer. The throughput is 1.56 Gbps.
It is obvious that the average case has a very high performance, and is very close to that in the best case. It also has moderate performance in the worst case. This result demonstrates that our pre-hashing and root-indexing techniques are robust for high-performance packet inspection applications.
Compared to pure bitmap AC in hardware design, 96% of bitmap AC matching can be avoided by our two proposed techniques. This can be esti-mated by the portion of root-indexing, false-positive and non-hit cases at 96.18%. Furthermore, the throughput of pure bitmap AC hardware in the identical hardware environment can be estimated at 440 Mbps. Thus, our throughput is almost 7.65 times faster than the original bitmap AC in the aver-age case.
Since our design is memory based architecture, in the implemented FPGA, Xilinx Virtex2P with speed grade 6 consumed only 1,688 LUTs and 106 Block RAMs, and are far less than that of other works. Compared to memory-based architecture work
[17], the 384 bits of memory usage for each state is
much less than their 8192 bits which use 256 32-bit pointers. Also, the operating frequency of 350 MHz does not decrease as the number and size of patterns grow.
Since many related works [11,12,14,24–26]
employed the duplicate hardware for parallel
pro-Tex t file Win exe file
Linux exe file
a
c
b
Fig. 17. The non-hit rate of 8-bit, 16-bit and 32-bit bit vectors for (a) text files. (b) Windows execution files. (c) Linux execution files.
cessing, the two engine architecture of SAM would be fair in this comparison. The double-engine SAM requires a simple control finite state machine (FSM) to coordinate the two single SAM control-lers, and needs two extra cycles for each single SAM operation. The double-engine SAM has a slightly slow clock rate than the single engine SAM. Moreover, our optimally utilized dual port block RAM of the Xilinx FPGA not only doubles the performance, but it does not increase an extra block RAM.
The results demonstrate that our design has throughput at 10.7 Gbps and a support pattern of 21,563 bytes.
We compared and analyze about 12 major
hard-ware from recent related works as shown inFig. 18.
The common goals for such kind of hardware are to pursue a higher throughput and a larger pattern size, which are the major evaluated factors in this comparison. Pattern sizes are used for measuring scalability with the unit in byte, and the throughput factor is used for measuring performance with the unit in giga bit per second (Gbps).
Nevertheless, 21,563 bytes is not the largest amount for the external memory version. The pro-posed SAM architecture is scalable to support more patterns with high performance, and SAM can be implemented with external multiple memory banks. Although external memory produces overhead for memory access, ASIC hardware can often run at a much higher speed than FPGA devices. For instance, the previous example with 21,302 patterns only ran at a clock rate of 800 MHz only to main-tain about 10 Gpbs throughput with 35 MB
mem-ory requirement, which is quite feasible in today’s technology.
6. Conclusion
In this paper, we presented an architecture which takes scalability, flexibility and performance into consideration. Root-indexing and pre-hashing are the acceleration techniques used to improve the per-formance of our design. Also, our data structures are compressed and stored in either the internal SRAM or the external DRAM. The internal SRAM architecture provides an average of 10.7 Gbps throughput with the size limitation of patterns. The external DRAM architecture provides high sca-lability for the large number of patterns with accept-able throughput.
The internal SRAM architecture is implemented on the Xilinx Virtex2P FPGA-based platform. The string matching function of the target application ClamAV is also modified to set up the string match-ing engine. We tuned the hardware design accordmatch-ing to the analysis results of our software simulation, and also built a prototype system for packet inspec-tion applicainspec-tions such as IDS, URL blocking and ClamAV.
For a robust system evaluation, SAM should be operated in a real network environment for our future work. At the moment, the SAM implementa-tion is a prototype system only and is not yet ready for field trial evaluation.
References
[1] S. Antonatos, K. Anagnostakis, E. Markatos, Generating realistic workloads for network intrusion detection systems, in: ACM Workshop on Software and Performance, Red-wood Shores, CA, Jan. 2004.
[2] G. Navarro, M. Ranot, Flexible Pattern Matching in Strings, Cambridge University Press, 2002.
[3] G. Navarro, A guided tour to approximate string matching, ACM Computing Surveys 33 (1) (2001) 31–88.
[4] S. Wu, U. Manber, Fast text searching allowing errors, Communication of the ACM 35 (1992) 83–91.
[5] R.S. Boyer, J.S. Moore, A fast string searching algorithm, Communications of the ACM 20 (10) (1977) 762–772. [6] A.V. Aho, M.J. Corasick, Efficient string matching: an aid to
bibliographic search, Communications of the ACM, 1975, pp. 333–340.
[7] N. Tuck, T. Sherwood, B. Calder, G. Varghese, Determin-istic memory-efficient string matching algorithms for intru-sion detection, IEEE Infocom, Hong Kong, China, 2004. [8] C. Coit, S. Staniford, J. Mcalerney, Towards faster string
matching for intrusion detection, DARPA Information Survivability Conference and Exhibition 2002, pp. 367–373. Fig. 18. SAM comparison with the other string matching
[9] N. Desai, Increasing performance in high speed NIDS,
<http://www.snort.org/>.
[10] M. Raffinot, On the multi backward Dawg matching algorithm (MultiBDM), Workshop on String Processing, Carleton U. Press, 1997.
[11] L. Tan, T. Sherwood, A high throughput string matching architecture for intrusion detection and prevention, ISCA, 2005.
[12] M. Aldwairi, T. Conte, P. Franzon, Configurable string matching hardware for speeding up intrusion detection, ACM CAN, 2005.
[13] J. Lockwood, An open platform for development of network processing modules in reconfigurable hardware, IEC Design-Con, Santa Clara, CA, Jan. 2001.
[14] J. Moscola, J. Lockwood, R.P. Loui, M. Pachos, Imple-mentation of a content-scanning module for an internet firewall, IEEE FCCM, 2003.
[15] Z.K. Baker, V.K. Prasanna, Time and area efficient pattern matching on FPGAs, ACM/SIGDA FPGA, CA, USA, Feb 2004.
[16] G. Tripp, A finite-state-machine based string matching system for intrusion detection on high-speed network, EICAR, May 2005.
[17] L. Bu, J.A. Chandy, A keyword match processor architec-ture using content addressable memory, ACM VLSI (April) (2004) 26–28.
[18] R. Sidhu, V. Prasanna, Fast regular expression matching using FPGAs, IEEE FCCM April 2001.
[19] R. Franklin, D. Carver, B.L. Hutchings, Assisting network intrusion detection with reconfigurable hardware, IEEE FCCM, Napa, CA, Apr 2002.
[20] C.R. Clark, D.E. Schimmel, Scalable pattern matching for high speed networks, IEEE FCCM, 2004.
[21] Y.H. Cho, W.H. Mangione, A pattern matching coprocessor for network security, ACM/IEEE DAC, California, USA, June 2005.
[22] I. Sourdis, D. Pnevmatikatos, Pre-decoded CAMs for efficient and high-speed NIDS pattern matching, IEEE FCCM, 2004.
[23] M. Gokhale, D. Dubois, A. Dubois, M. Boorman, S. Poole, V. Hogsett, Granidt: towards gigabit rate network intrusion detection technology, LNCS 2438 (Jan) (2002).
[24] S. Dharmapurikar, P. Krishnamurthy, T.S. Sproull, J.W. Lockwood, Deep packet inspection using parallel bloom filters, IEEE Micro. 24 (1) (2004).
[25] H.M. Blu¨thgen, T. Noll, R. Aachen, A programmable processor for approximate string matching with high throughput rate, IEEE ASAP, 2000.
[26] J.H. Park, K.M. George, Parallel string matching algorithms based on dataflow, HICSS, Hawaii, 1999.