P RE -H ASHING MATCHING

3. FAST BITMAP AC STRING MATCHING HARDWARE

3.3 P RE -H ASHING MATCHING

The pre-hashing method can quickly examine the existence of next state to avoid further slow AC matching. It uses a single hashing function and builds the bit vector for the substrings of each state. When performing the pre-hashing, the next state will be obtained from root-indexing unit instead of from bitmap AC unit if true negative is indicated by the pre-hashing unit. True negative is the condition that the given character is absent in the pre-hashing vector for the suffix of the current state.

Before the pre-hashing matching, it is necessary to build the pre-hashing bit vector in the preprocessing phase. First, we input the AC trie which is built by the conventional AC algorithm. For each state, we extract suffixes within the length 1 which is different from the Tseng’s original design. Recursive failure link of each state except the link to root state is also included in these suffixes. This can avoid filling the bit vector to almost all 1 when number of patterns is large that will lead to high hit rate issue. When suffixes are obtained, the pre-hashing algorithm hashes suffixes into bit vectors. This procedure of building the bit vectors for state 1 in Fig. 7(a) is

illustrated in Fig. 7(b). The mask of rightmost four bits of the characters and transformation from binary to one-hot representation are used as the hash function in our design. However, better mask position is adjustable for lower false positive according to the characteristic of patterns.

1 2

5 E

H

EASCII = 01000101 HASCII= 01001000 decimal = 5 decimal = 8

0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 15 0

Fig. 7. (a) AC trie of state 1 for building bit vector. (b) Example of building the bit vector for state 1 in the preprocessing phase.

A pre-hashing matching example is shown in Fig. 8. The pre-hashing unit reads a byte substring and then hashes the substring “G”. The hash result will be indicated by the pre-hashing unit, when the pre-hashing unit indicates non-hit, the next state 5 for substring “GDTH” will be obtained from the root-indexing unit. However, if the hit condition is indicated by pre-hashing unit, the slow bitmap AC matching will be performed.

0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 15 0 bit vector

GASCII= 01000111

Input text: _G _D _T _H

Fig. 8. Example for hashing at state 1.

Chapter 4 Implementation

4.1 Overview

To verify the correctness of fast bitmap AC, we write the C simulation model, and also use it to generate the essential data structures from patterns. These data structures will be loaded to initialize matching hardware. After initializing of our system, the user applications call the string matching API offered by the driver to start the matching operation. Once a text buffer is scanned, the interrupt signal will be triggered by the matching engine. Afterward, the interrupt handler will be invoked to check the matching results and also fill the new text to the buffer if it exists.

For hardware design, it is partitioned into five main modules, and each one is in charge of a specific function. The implementation environment is based on the Xilinx ML310 FPGA platform, and Xilinx’s ISE, EDK and Synplicity’s SynpifyPro are development tools for embedded software/hardware integration. The MontaVista RTOS is chosen for our system, and ClamAV is the target application for our content filtering system.

4.2 Pre-Processing and Simulation Software

The pre-processing procedure generates the essential data structures for the proposed hardware, as shown in Fig. 9(a). The Make_Goto() and Make_Failure() functions are the original functions defined by the AC algorithm, and our data structures are further built according to the table constructed from these two basic functions. For bitmap AC, the Make_Bitmap() function builds a 256-bit bitmap for

each state and sets 1 to the corresponding bit position for each existing next state. It also builds the next state table for each state. The next function is Build_Index() which builds the IDX_[1…k] tables and root next table NEXT for root-indexing pre-processing. In the final stage, Build_BitVector() sets 1 to the bit vector by hashing function according to all next states of both current state and recursive failure node for pre-hashing preprocessing.

c_state=root

Fig. 9. (a) The pre-processing procedure. (b) The flow of C simulation model.

After the pre-processing procedure is finished, the simulation of proposed fast bitmap AC algorithm can perform matching according to the flow in Fig. 9 (b). For each matching iteration, it checks the current state at first. When the current state is in the root state, the Root-Index() matching will be performed, otherwise Pre_Hash() will be performed. If Pre_Hash() reports the non-hit situation, the current state will be set to root state directly, and do the root-indexing matching. If the hit situation is reported, Search_Bitmap() will check the existence of next state for a given byte. If

Search_Bitmap()=1, the next state will be obtained from the base address pointer of

the next state table plus the return value of Bitmap_offset(). Note that if

Search_Bitmap() reports zero, the current state will be set to the failure state in the while loop until the current state is root state. This C model can be the golden model

for the proposed hardware design, and it also can be used to gather statistics for performance analysis.

4.3 Matching Hardware

This subsection introduces the proposed hardware architecture, block diagram and FPGA platform individually.

4.3.1 Architecture

The proposed architecture is a highly parallel design that all modules are working at the same time, and this architecture is also flexible for either internal or external memory-based platform. The block diagram of our proposed architecture is shown as Fig. 10.

Bus

Fig. 10. The block diagram of proposed fast bitmap AC architecture.

1) FSM module: The most important part in our architecture is FSM module to

control the working flow of whole hardware system. Once the SM controller is enabled, FSM will control all other modules in parallel. The detailed state translations are shown in Fig 11.

Fig. 11. State transition diagram of FSM module.

In this FSM diagram, the starting point is the IDLE state. When the control signal is enabled, the FETCH state will fetch waiting-scan text if the text buffer is empty.

Otherwise, the MATCH state will enable the root-indexing, pre-hashing, and bitmap AC matching units simultaneously. If the current state is at root or the result of pre-hashing is non-hit, the control state of the FSM will translate to ROOT_MATCH to keep the root-indexing module working. Once the root-indexing matching is done, the current state will be assigned by root-indexing module at SET_ROOT_IDX state.

Afterward, FSM will return to MATCH state to match subsequent texts. When the hit situation is reported, the bitmap AC matching and root-indexing matching will be triggered in AC_MATCH state, and the next state will be assigned by root-indexing module if current state of AC trie is required to set to root by failure link. Otherwise,

the next state will be provided by bitmap AC matching module.

2) Root-Indexing module: This module is used for fast state indexing at the root

state. It contains internal RAMs for index tables and root next table. For the practical memory and throughput considerations, it only matches two characters at the same time. For a large number of patterns, two characters can used to directly index the next state from NEXT table when the next states of root is over 128. In this case, it takes only one clock cycle. For a small number of patterns, the given two characters will first index an encoded address from IDX table, and the obtained address can used to index the next state from NEXT table. This can save large memory space, but it takes two clock cycles to index a next state. After finishing the root-indexing matching, this module will output the next state to SM controller.

3) Pre-Hashing module: The pre-hashing module will test the bit vector for two

input bytes by hashing function and send the hashing result to FSM. For external memory architecture, the bit vector which is stored in internal memory can save large time to fetch 256-bit bitmap when hash result is missing.

4) Bitmap AC matching module: When this module is enabled by FSM, it will

firstly check the corresponding bit for input byte. If the corresponding bit is 1, then it will mask off the unnecessary bits and count all 1s for locating the next state.

Otherwise, it will issue the failure signal and notify the controller to set the failure state as the current state.

5) SM controller module: This module plays an important role between system

bus and the whole string matching module. It provides the control registers including length of text buffer and enable signal for software to program. Besides, it also contains two text buffers and two matching-result buffers for content applications.

After a buffer is scanned, the SM controller will trigger the interrupt signal, and the application will read out the matching result if it exists and fill the new text. For the

whole string matching module, it provides the input bytes from text buffers and feeds necessary data structure to each module.

4.3.2 FPGA Implementation

Fig. 12. The architecture of ML310 platform.

We use the Xilinx ML310 FPGA based platform as our development system, as shown in Fig. 12. This platform has 2448 Kbits internal block RAM, 30816 LUTs and also two hardwired IBM PPC405 processors in FPGA. For the peripheral, we use one Ethernet port, one PCI slot for additional NIC extension, one 256 MB DDR RAM module and one CF card to store the image of file system. The packets will be inputted from on board Ethernet port, processed by the PPC 405 CPU. Also the packet content is offloaded to string matching engine. Finally, the clean traffic will output from the NIC of PCI extension.

The MontaVista Preview Kit is chosen as our RTOS. Xilinx EDK, ISE and Synplicity’s SynplifyPro are the basic development tools. The EDK can generate BSP and bit stream file for our system design. The BSP including mapping address define files and drivers of all peripherals for building the complete RTOS image. As RTL

code design for string matching hardware, ModelSim and Debussy are simulator and debugger tools we used, respectively.

4.4 Driver Interface

We also provide the driver interface for communication between hardware and software. The detailed driver functions are listed below.

Write_Buff(unsigned int length, char buff, void base_addr);*

This function writes the text to buffer for scanning, and also specifies the length and address of target buffer.

Read_Match_Result(unsigned int match_count, void base_addr);

It reads the matching results from the result buffer and the application will specify the matched virus name.

Intr_Handler(void baseaddr_p);*

When the interrupt signal is triggered, this function will be invoked to do the matching result checking and text buffer writing. Therefore, Write_Buff() and

Read_Match_Result() will be called.

Start_Matching(char buffer, unsigned int length);*

This function is called by the ClamAV string matching function. It will write the waiting-scanned text to the buffer, and setup the text ready register to enable the matching operation.

Stop_Matching(void baseaddr_p);*

This function will be called when the applications is closed. It clears the string matching control register to stop the operation.

cli_ac_scanbuff(const char buffer, unsigned int length, int vir_id, struct ac_status

*status);

This is the API of ClamAV to perform the string matching. It specifies the address and length of text buffer which is located by the ClamAV, and returns matched virus IDs and the matched status. It will divide the buffer pointed by *buffer to several portions depending on the size of text buffer we used, and match them sequentially. Thus, the Start_Matching() function will be called in a for loop to scan each text partition, and returns the matched results.

Chapter 5 Evaluation

5.1 Simulation Analysis

This simulation analysis can determine the performance of our design by using the software simulation flow described in Chapter 4. In our analysis, the test contents are execution files in Linux, and Windows, and normal text files. The 32-bit bit vector and 1000 virus patterns are used to evaluate the proportion of root-index matching and bitmap AC matching, shown as Fig. 13 (a). The high proportion of fast root-index matching can improve the performance.

79.94%

20.06%

Root-Indexing Pre-Hashing

4.97%

19.03%

76.00%

hit(false positive) hit non-hit

Fig. 13. (a) The proportion of root-indexing and pre-hashing. (b) The proportion of hit, non-hit and false positive.

The pre-hashing portion in Fig. 13(a) can be divided into three sub-portions as shown in Fig. 13(b). The first and second are hit and false positive portions, which have 24% and 12 % and must perform the slow bitmap AC matching operation. The third is non-hit portion, which has 64% and performs the fast root-indexing matching.

Thus, as the proportion of non-hit increases, the performance upgrades.

There are two important factors which will affect the rate of the non-hit case. The first factor is the number of patterns. As the increasing of number of patterns, the

branches of a node increases. This means that the performance will be degraded by raising the rate of the hit portion. The second factor is the size of bit vector for pre-hashing matching. For the balance of performance and memory usage, the bit vector size can be adjusted in the preprocessing. A reasonable size is 8 bits or 32 bits for both practical considerations. The 8-bit bit vector is a choice for the development environment when memory resource is limited, and the 32-bit bit vector has better performance when memory resource is available. For analyzing these two key factors, the non-hit rate for different sizes of bit vector and the number of patterns in three different data types are shown in Fig. 14. As the increase of pattern set, 32-bit bit vector has more apparent improvement than 16-bit bit vector.

Text file

Fig. 14. The non-hit rate of 8-bit, 16-bit and 32-bit bit vectors for (a) text files, (b) Windows execution files, (c) Linux execution files.

In addition to hit rate, the false positive rate of pre-hashing matching is also affected by the size of bit vector, as shown in Fig. 15. The false positive will lead to a

little penalty of clock cycles in the internal SRAM architecture, and great penalty of bus contention for external DRAM architecture.

Text file

Fig. 15. The false positive rate of 16-bit and 32-bit bit vector for (a) text files, (b)

or the proposed architecture, the 256-bit bitmap, 32-bit bit vector, two 8-bit widt

5.2 Hardware Analysis

approach is flexible for both internal and external mem

Windows execution files, (c) Linux execution files.

h IDX table, one root next table, base address pointer of next state table and failure state pointer are data structures we used. For each state, it takes 384 bits and 336 bits to store these data structures when the representation bit of state number is 32 and 16 bits, respectively.

As mentioned before, our

ory architecture. The external memory architecture is suitable for large-pattern applications with modest throughput, such as the anti-virus and anti-spam applications.

On the other hand, the internal memory architecture can be used for the high

performance with fewer patterns, such as IDS and firewall applications.

In our design, the root-indexing can match four bytes at the same time with addr

nternal SRAM architecture is 220

ck SMen

ess decoding technique which can minimize the memory usage and make it more space efficiency. Furthermore, two string matching engines can be used to take advantage of the hardware feature of dual-port SRAM.

The operating frequency of synthesis result for our i

MHz which is reported by SynpilifyPro. The root-indexing module takes 2 clock cycles to index a mapping state. The bitmap AC matching module takes 8 clock cycles per operation. Thus, the throughput can be estimated by the probability, frequency and processing bits per cycle. The best case throughput which means no byte has been matched is

16bits 220MHz 2(clo ) 2(× gine)=3.52Gbps. (1) The throughput in the average case, depending on the average proportion of

× ÷

root-indexing matching and bitmap AC matching, as shown in Fig. 10(a), can be estimated as

(79.94% 32 2× ÷ +20.06% 19.03% 8 8× × ÷ +20.06% 76% 32 2) 220× × ÷ × MHz×2 3367Mbps 3.367Gbps.

≈ = (2) For worst case, all bytes are m

(3) It is obvious that the performance in the average case has very high perform

atched in the text buffer. The throughput is (26.67% 16 2 + 73.33% 1 8) 220MHz 2 = 979.1155 Mbps × ÷ × ÷ × × ≈ 0.98 Gbps.

ance which is very close to that in the best case and also has moderate performance in the worst case. This result demonstrates that our pre-hashing and root-indexing techniques are useful for high-performance content filtering applications.

5.3 Compared with Existing Works

Comparing with the pure bitmap AC in hardware design, 96% of bitmap AC matching can be avoided by our proposed two techniques. This can be estimated by the portion of root-indexing, false-positive and non-hit case in Fig. 13(a) and (b), shown as below

79.94% + 20.06% (4.97% + 76%) = 96.18 %.× (4) Furthermore, the throughput of pure bitmap AC hardware in the identical hardware environment can be estimated as

8bits÷8(clock) 220× MHz×2(SMengine)=440Mbps. (5) Thus, our throughput described in section 5.2 is almost 7.65 times faster than the original bitmap AC in the average case.

Because that our design is memory based architecture, it takes only 1688 LUTs which is far less than other works. Comparing with the memory-based architecture work [17], 384 bits memory usage for each state is much less than their 8192 bits which use 256 32-bit pointers. Also, the operating frequency 220 MHz will not decrease as the number and size of patterns grow. Although some existing works claim that their throughput can achieve up to 10 Gbps, but their designs are not feasible for real systems. Comparing with these works, we provide a flexible and scalable architecture for real applications with acceptable throughput.

Chapter 6 Conclusion and Future Work

6.1 Conclusion

In this thesis, we proposed an architecture which takes scalability, flexibility and performance into consideration. Furthermore, root-indexing and pre-hashing are two used acceleration techniques for dramatically improving the performance of our design. Also our data structures are compressed and can be stored either in the internal SRAM or external DRAM. The internal SRAM architecture provides average 3.367Gbps throughput with the size limitation of patterns. The external DRAM architecture provides the high scalability for the integration of multiple applications with acceptable throughput.

The proposed internal SRAM architecture is implemented on the Xilinx ML310 FPGA-based platform, and the driver interface API is provided for software/hardware integration. The string matching function of the target application ClamAV is also modified to setup the string matching engine. We tuned the hardware design according to the analysis results of our software simulation, and also built a complete system solution for content filtering applications such as IDS, URL blocking and ClamAV.

6.2 Future Work

Although the average throughput of our internal SRAM design can achieve 3.367 Gbps, our architecture is too complicated to design into a pipeline architecture which can get the better throughput. Therefore, now we adopt the multi-cycle implementation method which will degrade the throughput. For higher performance

of the internal SRAM architecture, the pipeline is a necessary trick to be applied. Thus, the proposed design should be refined to make it simple. Also, for the external memory-based architecture, the most defeat is the bus bandwidth and contention issue.

It is the native limitation. Furthermore, the path compression of bitmap AC is not used in our design. However, this technique can use the memory effectively, and also can reduce the access frequency of memory in the external DRAM architecture. Thus, the path compression technique is worth to take into consideration in the future.

References

[1] S. Antonatos, K. Anagnostakis, and E. Markatos. Generating realistic workloads for network intrusion detection systems. In ACM Workshop on Software and Performance, Redwood Shores, CA, Jan. 2004.

[2] Aho and M. Corasick. Fast pattern matching: an aid to bibliographic search. In

在文檔中以根索引及預先雜湊加速自動機式字串比對硬體：設計，實作，與評估 (頁 19-0)