HARDWARE IMPLEMENTATION AND PERFORMANCE COMPARISON In this section, Section 1 gives the detailed description of hardware

imple-mentation, including the block diagram, finite state machine, components, and interface of the FSAM hardware. Section 5.2 gives an exhausted comparison with previous hardware implementations.

19:24 • K. K. Tseng et al.

Fig. 13. FPGA Implementation of the double-engine FSAM, (a) block diagram of the double-engine architecture, (b) finite state machine for the FSAM controller, (c) finite state machine for FSAM.

5.1 Hardware Implementation

Figure 13 illustrates the FPGA implementation of the double-engine FSAM, and includes (a) a block diagram of the hardware architecture, (b) a finite state machine for the FSAM controller, and (c) a finite state machine for the FSAM.

In the double-engine FSAM, two FSAMs perform their matching for different texts at the same time, and thus, they perform independently without affecting the other. That is, FSAM1 and FSAM2 have their own texts, Text1 and Text2, respectively. Also the ping-pong buffers are used for each FSAM. For instance, the “Select1” signal is used to choose either the Text1A or Text1B buffers as Text1. At the ping-pong buffers, one buffer is used during the matching and the other buffer is prepared concurrently by the processor or DMA.

As shown in Figure 13(b), the controller feeds the text to the correspond-ing FSAM and activate it via the Start FSAM signal at the FSAM START state. If the FSAM rdy signal is one, representing that the FSAM is idle, the controller will set the Start FSAM to one to proceed a new matching process and disable the FSAM rdy. Once the matching process finishes, the FSAM sets the FSAM rdy to one and sends this signal to the FSAM controller. In this case, the FSAM control transits from the state FSAM START to FSAM END, the end of the matching operation. The Select signal switches between 0 and 1 in the FSAM END state to obtain the alternative text in ping-pong buffers.

This default unit of text handled in the double-engine FSAM is a message.

However, when the granularity is a packet, rather than a message, our FSAM can still operate well with little modification. The method is keeping the last

Fig. 14. (a) Components of FSAM Implementation, (b) suggested memory interfaces for the double-engine FSAM.

AC state of the previous packet for the next packet matching, and thus, the FSAM can easily do matching across the multiple packets.

The statistics of the FSAM components in Figure 14(a) reveals the detailed hardware usages. Its circuit size is measured in terms of logic element (LE) counts. The result demonstrates that the root index and prehash modules consume less circuit size, memory size, and bandwidth, as compared to the AC module, representing that AC dominates the hardware cost in FSAM. In the case of the single-engine implementation, its total circuit size is 329 LE only.

For the scalability of the storage, Figure 14(b) shows the suggested memory interfaces for the double-engine FSAM. The suffix “1&2” of the signal sym-bols denotes the first and second interfaces of each memory bank. Since four root-index tables are very small, storing them into the internal memory is recommended. The text and output memories are implemented as the inter-nal or exterinter-nal memories according to their scales. Fiinter-nally, since the root-index next table, prehash vector, and bitmap AC related tables could be large for a large amount of patterns, they should be implemented as the external memory for scalability.

For the suggested external memory interface, if the speed and high-capacity memories are used, about two cycles with up to 500MHz clock rate are obtainable for the QDR-III SRAMs (www.qdrsram.com). Moreover, since the ASIC hardware can often run at a much higher speed than the FPGA devices, the ASIC implementation, with the external memories for a large set of patterns, is quite feasible to maintain the competitive throughput as our FPGA implementation.

19:26 • K. K. Tseng et al.

5.2 Performance Comparison

Since many string-matching hardware [Aldwairi et al 2005; Mosola et al. 2003;

Baker et al. 2004; Cho et al. 2005; Dharmapurikar et al. 2004] store their patterns in on-chip hardwired circuits and internal memories, we also imple-mented our FSAM using FPGA internal memories to reach a fair evaluation.

Besides, because several previous matching hardware [Aldwairi et al. 2005;

Tan et al. 2005; Moscola et al. 2003; Baker et al. 2004; Dharmapurikar et al.

2004, etc.,] employed duplicated hardware for parallel processing, comparing our double-engine architecture with them is still fair in the performance com-parison. In particular, our optimally utilized dual port block RAM of Xilinx FPGA not only virtually doubles the performance, but also increases no extra block RAM.

We synthesized FSAM on various Xilinx FPGA devices and compared it with the major types of hardware described in related works, as shown in Table III. The common goals of the hardware are pursuing higher through-put, larger pattern sizes, and smaller circuit size, which are also our concerned factors in this comparison. The pattern size is equal to Number of patterns× Average length of patterns and used for evaluating scalability. The throughput is used for measuring performance.

The results demonstrate that FSAM has throughput of 11.1Gbps for the dou-ble engines and 5.6Gpbs for the single engine in a Xilinx Virtex2P device. For the storage, FSAM implementation uses an internal memory, Xilinx block RAM, to store the pattern set. Among all matching hardware, our FPGA implementa-tion can handle the largest pattern size of 32,634 bytes, which is the truncated URL patterns and composed of 2,940 patterns with the average length of 11.1 bytes. Thus, our FSAM is superior to all previous string-matching hardware in term of both space requirement and performance.

Next, the pattern placement column shows the major difference between our FSAM and the other matching hardware. The architecture of previous hard-ware often employed hardwired circuits and internal memories for storing their patterns, thus their amount of patterns was limited by FPGA resources.

在文檔中 A Fast Scalable Automaton-Matching Accelerator for Embedded Content Processors (頁 23-26)