H ARDWARE - BASED STRING MATCHING - 以根索引及預先雜湊加速自動機式字串比對硬體：設計，實作，與評估

2. BACKGROUND

2.2 H ARDWARE - BASED STRING MATCHING

The major string matching hardwares are listed in Table 1. The most common matching hardware is the finite autom

at once to achieve high throughput. No The critical defect of AC

ext pointers for each state. Bitmap AC solves this problem, and still keeps the advantages of AC. However, in order to locate the next state in bitmap AC, we must to count all 1s before the valid bit in the 256-bit bitmap. This is known as a time-consuming operation which is the high cost on x86 based systems.

aton approach which uses DFA/NFA to match every input byte. Sidhu and Prassanna [13] introduced the hardwired Nondeterministic Finite Automata (NFAs) for finding matches to a given regular-expression pattern. They implemented on FPGA which matches one text byte per clock cycle. However, this kind of works has lower clock rate and throughput.

Another approach is CAM-based architecture, which is able to match all contents wadays FPGA often has embedded block

… 256-bit bitmap Data structure for state i

Matched Pointer State Info.

Failure State Next State Pointer

...

Sum all valid 1s

Bit Offset

Next State Table of State i Next State Base Address

RAMs for high performance design. Thus, CAM is easily constructed from block RAMs. For the CAM-based hardware, Cho [7] proposed a CAM-based solution which uses comparators to perform initial part of pattern matching and uses the matched prefix as an address into a CAM to read the whole patterns. The other CAM-based solution is pre-decoding CAM, which was proposed by Baker for processing a large-rule set [9]. Some CAM-based implementations [10], [16] also combine with hardware comparators for lower usage of circuit and high performance.

In the last two matching hardware, memory-based hardware was also presented by M

g the r

onther and Thomas [17]. They constructed an AC table which adds the extra failure links of longest prefix, and stores them in external memory. However, the memory requirement is too large in their design. Another different architecture is Bloom filters hardware [12]-[13], which uses multiple hash functions for approximated matching. Once Bloom filter reports a possible hit, the advanced verification is needed for the exact matching. The main drawback of Bloom filters hardware is that dedicated length processing unit is needed for every pattern length.

Most of these aforementioned works can achieve up to 10 Gbps by hardwirin ule sets into FPGA, which limits the scalability of rules and the size of patterns.

Even the data structure of memory-based design wastes too much memory space.

Therefore, the scalability of patterns and rules is our most concerned issue, and it is our focus in this work.

Table 1

Comparison of existi atching hardware

Matching Hardw

ng string m

are Advantage Disadvantage

NFA / DFA Hardware ement

ort put

Easy to impl

Regular expression supp

High area cost Modest through CAM-based Hardware High throughput High area cost Bloom Filter Hardware Low area cost False positive issue

Fixed length

Memory-based Hardware Reconfigurable ut

sting High capacity

Low throughp Memory space wa

Chapter 3 p AC String Matching Hardware

.1 Overview

base on the Tseng’s string matching approach [18]. It can match mult

Fast Bitma

3

This thesis is

iple characters at root state by root-indexing matching, and avoid some slow bitmap AC matching operations by pre-hashing matching. Also these two acceleration techniques can process in parallel. An example is illustrated from Fig. 3 to 5, which shows the difference between bitmap AC and fast bitmap AC.

0 T

1 E S2 3 T 4

5 E 6

7 E 8 H

Input text: TESTTHEUSHER i output(i)

4 {TEST}

6 {THE}

8 {HE}

Fig. 3. An example for fast bitmap AC. (a) Goto Trie. (b) Output function. (c) Input

The Fig. 3 is an example of original AC that can be used for bitmap AC and fast bitm

text.

ap AC. The transition of AC or bitmap AC will both go to next state according to the given byte. Their transition sequences according to the previous described AC and bitmap AC algorithms are the same, as shown in Fig. 4.

Fig. 4. State transition sequence of conventional AC.

Since our fast bitmap AC approach had implemented root-indexing and re-h

Root-Indexing mechanism can process two or even more bytes at the same time, and is applied to the root state in our desi

3.2 Root-Indexing Matching

In AC trie, most of failure links point to the root state, that is, it will always go back to the root state wh

p ashing techniques, our transition sequence, which is different from original AC and bitmap AC, is shown as Fig. 5. The “RI” symbol means root-indexing.

Fig. 5. State transition sequence of fast bitmap AC.

gn. Therefore, starting from the root state, the next state is decided by root-indexing. Beyond the root state, pre-hashing is used to quickly examine the existence of next state for every state transition. When the pre-hashing unit reports no match for given two bytes, the next state is determined by the root-indexing unit. Otherwise, the next state is decided by the bitmap AC unit. It is worth noting that because of parallel processing, when given two characters “TH” at state 4, the next state 5 will be directly obtained from the root-indexing unit.

en there is no any next state for a given character. Thus, it is efficient to apply the root-indexing in the root state. Root-indexing can match

1 2

multiple characters simultaneously at the root state. In Fig. 6, root-indexing comprises

k index tables IDX

[1…k] and a root next table NEXT, where k denotes the maximum length of root-indexing matching in the same time. Each entry of IDX stores a partial address for locating the next state in NEXT. The partial address is an unique sequential integer to represent the order of appearing characters for the corresponding substrings in the suffixes of root state. Note that, for advancing k characters in matching iteration, the substring is started from current byte to k, which means the latter IDX table is required to include the entries of the former IDX tables.

Input text

Fig. 6. Root-indexing architecture and example for the input text “TEST” with the patterns “TEST”, “THE” and “HE”.

For example, if patterns are “TEST”, “THE” and “HE”, IDX₁ to IDX₄ will at least contains the appearing characters in the corresponding position as {“H”,”T”} for level one, {“E”,”H”} for level 2, {“E”,”S”} for level 3, {“T”} for level 4, respectively.

However, because the latter tables are required to contain the entries of former tables,

IDX

₁to IDX₄ will actually contain {“H”,”T”}, {“E”,”H”,”T”}, {“E”,”H”,”S”,”T”}

and {“E”,”H”,”S”,”T”}, respectively.

For numbering the entries of IDX tables, the first IDX have 2 appearing characters, thus “H” and “T” are numbered as “01” and “10” in binary format. The second IDX table using “01”, “10” and “11” stands for {“E”,”H”,”T”}. The NEXT table is used to store all the next states within length k, and it is indexed by a concatenation address of lookup value from the all IDX tables. In the example of Fig.

6, 10_01_001_000, 10_01_011_100, 10_10_001_000 and 10_11_000_000 are concatenation addresses to locate the next states for “TEE”, “TEST”, “THE” and

“TT”.

3.3 Pre-Hashing Matching

The pre-hashing method can quickly examine the existence of next state to avoid further slow AC matching. It uses a single hashing function and builds the bit vector for the substrings of each state. When performing the pre-hashing, the next state will be obtained from root-indexing unit instead of from bitmap AC unit if true negative is indicated by the pre-hashing unit. True negative is the condition that the given character is absent in the pre-hashing vector for the suffix of the current state.

Before the pre-hashing matching, it is necessary to build the pre-hashing bit vector in the preprocessing phase. First, we input the AC trie which is built by the conventional AC algorithm. For each state, we extract suffixes within the length 1 which is different from the Tseng’s original design. Recursive failure link of each state except the link to root state is also included in these suffixes. This can avoid filling the bit vector to almost all 1 when number of patterns is large that will lead to high hit rate issue. When suffixes are obtained, the pre-hashing algorithm hashes suffixes into bit vectors. This procedure of building the bit vectors for state 1 in Fig. 7(a) is

illustrated in Fig. 7(b). The mask of rightmost four bits of the characters and transformation from binary to one-hot representation are used as the hash function in our design. However, better mask position is adjustable for lower false positive according to the characteristic of patterns.

1 2

5 E

H

EASCII = 01000101 HASCII= 01001000 decimal = 5 decimal = 8

0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 15 0

Fig. 7. (a) AC trie of state 1 for building bit vector. (b) Example of building the bit vector for state 1 in the preprocessing phase.

A pre-hashing matching example is shown in Fig. 8. The pre-hashing unit reads a byte substring and then hashes the substring “G”. The hash result will be indicated by the pre-hashing unit, when the pre-hashing unit indicates non-hit, the next state 5 for substring “GDTH” will be obtained from the root-indexing unit. However, if the hit condition is indicated by pre-hashing unit, the slow bitmap AC matching will be performed.

0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 15 0 bit vector

GASCII= 01000111

Input text: _G _D _T _H

Fig. 8. Example for hashing at state 1.

Chapter 4 Implementation

4.1 Overview

To verify the correctness of fast bitmap AC, we write the C simulation model, and also use it to generate the essential data structures from patterns. These data structures will be loaded to initialize matching hardware. After initializing of our system, the user applications call the string matching API offered by the driver to start the matching operation. Once a text buffer is scanned, the interrupt signal will be triggered by the matching engine. Afterward, the interrupt handler will be invoked to check the matching results and also fill the new text to the buffer if it exists.

For hardware design, it is partitioned into five main modules, and each one is in charge of a specific function. The implementation environment is based on the Xilinx ML310 FPGA platform, and Xilinx’s ISE, EDK and Synplicity’s SynpifyPro are development tools for embedded software/hardware integration. The MontaVista RTOS is chosen for our system, and ClamAV is the target application for our content filtering system.

4.2 Pre-Processing and Simulation Software

The pre-processing procedure generates the essential data structures for the proposed hardware, as shown in Fig. 9(a). The Make_Goto() and Make_Failure() functions are the original functions defined by the AC algorithm, and our data structures are further built according to the table constructed from these two basic functions. For bitmap AC, the Make_Bitmap() function builds a 256-bit bitmap for

each state and sets 1 to the corresponding bit position for each existing next state. It also builds the next state table for each state. The next function is Build_Index() which builds the IDX_[1…k] tables and root next table NEXT for root-indexing pre-processing. In the final stage, Build_BitVector() sets 1 to the bit vector by hashing function according to all next states of both current state and recursive failure node for pre-hashing preprocessing.

c_state=root

Fig. 9. (a) The pre-processing procedure. (b) The flow of C simulation model.

After the pre-processing procedure is finished, the simulation of proposed fast bitmap AC algorithm can perform matching according to the flow in Fig. 9 (b). For each matching iteration, it checks the current state at first. When the current state is in the root state, the Root-Index() matching will be performed, otherwise Pre_Hash() will be performed. If Pre_Hash() reports the non-hit situation, the current state will be set to root state directly, and do the root-indexing matching. If the hit situation is reported, Search_Bitmap() will check the existence of next state for a given byte. If

Search_Bitmap()=1, the next state will be obtained from the base address pointer of

the next state table plus the return value of Bitmap_offset(). Note that if

Search_Bitmap() reports zero, the current state will be set to the failure state in the while loop until the current state is root state. This C model can be the golden model

for the proposed hardware design, and it also can be used to gather statistics for performance analysis.

4.3 Matching Hardware

This subsection introduces the proposed hardware architecture, block diagram and FPGA platform individually.

4.3.1 Architecture

The proposed architecture is a highly parallel design that all modules are working at the same time, and this architecture is also flexible for either internal or external memory-based platform. The block diagram of our proposed architecture is shown as Fig. 10.

Bus

Fig. 10. The block diagram of proposed fast bitmap AC architecture.

1) FSM module: The most important part in our architecture is FSM module to

control the working flow of whole hardware system. Once the SM controller is enabled, FSM will control all other modules in parallel. The detailed state translations are shown in Fig 11.

Fig. 11. State transition diagram of FSM module.

In this FSM diagram, the starting point is the IDLE state. When the control signal is enabled, the FETCH state will fetch waiting-scan text if the text buffer is empty.

Otherwise, the MATCH state will enable the root-indexing, pre-hashing, and bitmap AC matching units simultaneously. If the current state is at root or the result of pre-hashing is non-hit, the control state of the FSM will translate to ROOT_MATCH to keep the root-indexing module working. Once the root-indexing matching is done, the current state will be assigned by root-indexing module at SET_ROOT_IDX state.

Afterward, FSM will return to MATCH state to match subsequent texts. When the hit situation is reported, the bitmap AC matching and root-indexing matching will be triggered in AC_MATCH state, and the next state will be assigned by root-indexing module if current state of AC trie is required to set to root by failure link. Otherwise,

the next state will be provided by bitmap AC matching module.

2) Root-Indexing module: This module is used for fast state indexing at the root

state. It contains internal RAMs for index tables and root next table. For the practical memory and throughput considerations, it only matches two characters at the same time. For a large number of patterns, two characters can used to directly index the next state from NEXT table when the next states of root is over 128. In this case, it takes only one clock cycle. For a small number of patterns, the given two characters will first index an encoded address from IDX table, and the obtained address can used to index the next state from NEXT table. This can save large memory space, but it takes two clock cycles to index a next state. After finishing the root-indexing matching, this module will output the next state to SM controller.

3) Pre-Hashing module: The pre-hashing module will test the bit vector for two

input bytes by hashing function and send the hashing result to FSM. For external memory architecture, the bit vector which is stored in internal memory can save large time to fetch 256-bit bitmap when hash result is missing.

4) Bitmap AC matching module: When this module is enabled by FSM, it will

firstly check the corresponding bit for input byte. If the corresponding bit is 1, then it will mask off the unnecessary bits and count all 1s for locating the next state.

Otherwise, it will issue the failure signal and notify the controller to set the failure state as the current state.

5) SM controller module: This module plays an important role between system

bus and the whole string matching module. It provides the control registers including length of text buffer and enable signal for software to program. Besides, it also contains two text buffers and two matching-result buffers for content applications.

After a buffer is scanned, the SM controller will trigger the interrupt signal, and the application will read out the matching result if it exists and fill the new text. For the

whole string matching module, it provides the input bytes from text buffers and feeds necessary data structure to each module.

4.3.2 FPGA Implementation

Fig. 12. The architecture of ML310 platform.

We use the Xilinx ML310 FPGA based platform as our development system, as shown in Fig. 12. This platform has 2448 Kbits internal block RAM, 30816 LUTs and also two hardwired IBM PPC405 processors in FPGA. For the peripheral, we use one Ethernet port, one PCI slot for additional NIC extension, one 256 MB DDR RAM module and one CF card to store the image of file system. The packets will be inputted from on board Ethernet port, processed by the PPC 405 CPU. Also the packet content is offloaded to string matching engine. Finally, the clean traffic will output from the NIC of PCI extension.

The MontaVista Preview Kit is chosen as our RTOS. Xilinx EDK, ISE and Synplicity’s SynplifyPro are the basic development tools. The EDK can generate BSP and bit stream file for our system design. The BSP including mapping address define files and drivers of all peripherals for building the complete RTOS image. As RTL

code design for string matching hardware, ModelSim and Debussy are simulator and debugger tools we used, respectively.

4.4 Driver Interface

We also provide the driver interface for communication between hardware and software. The detailed driver functions are listed below.

Write_Buff(unsigned int length, char buff, void base_addr);*

This function writes the text to buffer for scanning, and also specifies the length and address of target buffer.

Read_Match_Result(unsigned int match_count, void base_addr);

It reads the matching results from the result buffer and the application will specify the matched virus name.

Intr_Handler(void baseaddr_p);*

When the interrupt signal is triggered, this function will be invoked to do the matching result checking and text buffer writing. Therefore, Write_Buff() and

Read_Match_Result() will be called.

Start_Matching(char buffer, unsigned int length);*

This function is called by the ClamAV string matching function. It will write the waiting-scanned text to the buffer, and setup the text ready register to enable the matching operation.

Stop_Matching(void baseaddr_p);*

This function will be called when the applications is closed. It clears the string matching control register to stop the operation.

cli_ac_scanbuff(const char buffer, unsigned int length, int vir_id, struct ac_status

*status);

This is the API of ClamAV to perform the string matching. It specifies the address and length of text buffer which is located by the ClamAV, and returns matched virus IDs and the matched status. It will divide the buffer pointed by *buffer to several portions depending on the size of text buffer we used, and match them sequentially. Thus, the Start_Matching() function will be called in a for loop to scan each text partition, and returns the matched results.

Chapter 5 Evaluation

5.1 Simulation Analysis

This simulation analysis can determine the performance of our design by using the software simulation flow described in Chapter 4. In our analysis, the test contents are execution files in Linux, and Windows, and normal text files. The 32-bit bit vector and 1000 virus patterns are used to evaluate the proportion of root-index matching and bitmap AC matching, shown as Fig. 13 (a). The high proportion of fast root-index matching can improve the performance.

79.94%

20.06%

Root-Indexing Pre-Hashing

4.97%

19.03%

76.00%

hit(false positive) hit non-hit

Fig. 13. (a) The proportion of root-indexing and pre-hashing. (b) The proportion of hit, non-hit and false positive.

The pre-hashing portion in Fig. 13(a) can be divided into three sub-portions as shown in Fig. 13(b). The first and second are hit and false positive portions, which have 24% and 12 % and must perform the slow bitmap AC matching operation. The third is non-hit portion, which has 64% and performs the fast root-indexing matching.

Thus, as the proportion of non-hit increases, the performance upgrades.

在文檔中以根索引及預先雜湊加速自動機式字串比對硬體：設計，實作，與評估 (頁 13-0)

H ARDWARE - BASED STRING MATCHING