4. IMPLEMENTATION
4.4 D RIVER I NTERFACE
We also provide the driver interface for communication between hardware and software. The detailed driver functions are listed below.
Write_Buff(unsigned int *length, char *buff, void *base_addr);
This function writes the text to buffer for scanning, and also specifies the length and address of target buffer.
Read_Match_Result(unsigned int *match_count, void *base_addr);
It reads the matching results from the result buffer and the application will specify the matched virus name.
Intr_Handler(void * baseaddr_p);
When the interrupt signal is triggered, this function will be invoked to do the matching result checking and text buffer writing. Therefore, Write_Buff() and
Read_Match_Result() will be called.
Start_Matching(char *buffer, unsigned int length);
This function is called by the ClamAV string matching function. It will write the waiting-scanned text to the buffer, and setup the text ready register to enable the matching operation.
Stop_Matching(void * baseaddr_p);
This function will be called when the applications is closed. It clears the string matching control register to stop the operation.
cli_ac_scanbuff(const char *buffer, unsigned int length, int *vir_id, struct ac_status
*status);
This is the API of ClamAV to perform the string matching. It specifies the address and length of text buffer which is located by the ClamAV, and returns matched virus IDs and the matched status. It will divide the buffer pointed by *buffer to several portions depending on the size of text buffer we used, and match them sequentially. Thus, the Start_Matching() function will be called in a for loop to scan each text partition, and returns the matched results.
Chapter 5 Evaluation
5.1 Simulation Analysis
This simulation analysis can determine the performance of our design by using the software simulation flow described in Chapter 4. In our analysis, the test contents are execution files in Linux, and Windows, and normal text files. The 32-bit bit vector and 1000 virus patterns are used to evaluate the proportion of root-index matching and bitmap AC matching, shown as Fig. 13 (a). The high proportion of fast root-index matching can improve the performance.
79.94%
20.06%
Root-Indexing Pre-Hashing
4.97%
19.03%
76.00%
hit(false positive) hit non-hit
Fig. 13. (a) The proportion of root-indexing and pre-hashing. (b) The proportion of hit, non-hit and false positive.
The pre-hashing portion in Fig. 13(a) can be divided into three sub-portions as shown in Fig. 13(b). The first and second are hit and false positive portions, which have 24% and 12 % and must perform the slow bitmap AC matching operation. The third is non-hit portion, which has 64% and performs the fast root-indexing matching.
Thus, as the proportion of non-hit increases, the performance upgrades.
There are two important factors which will affect the rate of the non-hit case. The first factor is the number of patterns. As the increasing of number of patterns, the
branches of a node increases. This means that the performance will be degraded by raising the rate of the hit portion. The second factor is the size of bit vector for pre-hashing matching. For the balance of performance and memory usage, the bit vector size can be adjusted in the preprocessing. A reasonable size is 8 bits or 32 bits for both practical considerations. The 8-bit bit vector is a choice for the development environment when memory resource is limited, and the 32-bit bit vector has better performance when memory resource is available. For analyzing these two key factors, the non-hit rate for different sizes of bit vector and the number of patterns in three different data types are shown in Fig. 14. As the increase of pattern set, 32-bit bit vector has more apparent improvement than 16-bit bit vector.
Text file
Fig. 14. The non-hit rate of 8-bit, 16-bit and 32-bit bit vectors for (a) text files, (b) Windows execution files, (c) Linux execution files.
In addition to hit rate, the false positive rate of pre-hashing matching is also affected by the size of bit vector, as shown in Fig. 15. The false positive will lead to a
little penalty of clock cycles in the internal SRAM architecture, and great penalty of bus contention for external DRAM architecture.
Text file
Fig. 15. The false positive rate of 16-bit and 32-bit bit vector for (a) text files, (b)
or the proposed architecture, the 256-bit bitmap, 32-bit bit vector, two 8-bit widt
5.2 Hardware Analysis
approach is flexible for both internal and external mem
Windows execution files, (c) Linux execution files.
F
h IDX table, one root next table, base address pointer of next state table and failure state pointer are data structures we used. For each state, it takes 384 bits and 336 bits to store these data structures when the representation bit of state number is 32 and 16 bits, respectively.
As mentioned before, our
ory architecture. The external memory architecture is suitable for large-pattern applications with modest throughput, such as the anti-virus and anti-spam applications.
On the other hand, the internal memory architecture can be used for the high
performance with fewer patterns, such as IDS and firewall applications.
In our design, the root-indexing can match four bytes at the same time with addr
nternal SRAM architecture is 220
ck SMen
ess decoding technique which can minimize the memory usage and make it more space efficiency. Furthermore, two string matching engines can be used to take advantage of the hardware feature of dual-port SRAM.
The operating frequency of synthesis result for our i
MHz which is reported by SynpilifyPro. The root-indexing module takes 2 clock cycles to index a mapping state. The bitmap AC matching module takes 8 clock cycles per operation. Thus, the throughput can be estimated by the probability, frequency and processing bits per cycle. The best case throughput which means no byte has been matched is
16bits 220MHz 2(clo ) 2(× gine)=3.52Gbps. (1) The throughput in the average case, depending on the average proportion of
× ÷
root-indexing matching and bitmap AC matching, as shown in Fig. 10(a), can be estimated as
(79.94% 32 2× ÷ +20.06% 19.03% 8 8× × ÷ +20.06% 76% 32 2) 220× × ÷ × MHz×2 3367Mbps 3.367Gbps.
≈ = (2) For worst case, all bytes are m
(3) It is obvious that the performance in the average case has very high perform
atched in the text buffer. The throughput is (26.67% 16 2 + 73.33% 1 8) 220MHz 2 = 979.1155 Mbps × ÷ × ÷ × × ≈ 0.98 Gbps.
ance which is very close to that in the best case and also has moderate performance in the worst case. This result demonstrates that our pre-hashing and root-indexing techniques are useful for high-performance content filtering applications.
5.3 Compared with Existing Works
Comparing with the pure bitmap AC in hardware design, 96% of bitmap AC matching can be avoided by our proposed two techniques. This can be estimated by the portion of root-indexing, false-positive and non-hit case in Fig. 13(a) and (b), shown as below
79.94% + 20.06% (4.97% + 76%) = 96.18 %.× (4) Furthermore, the throughput of pure bitmap AC hardware in the identical hardware environment can be estimated as
8bits÷8(clock) 220× MHz×2(SMengine)=440Mbps. (5) Thus, our throughput described in section 5.2 is almost 7.65 times faster than the original bitmap AC in the average case.
Because that our design is memory based architecture, it takes only 1688 LUTs which is far less than other works. Comparing with the memory-based architecture work [17], 384 bits memory usage for each state is much less than their 8192 bits which use 256 32-bit pointers. Also, the operating frequency 220 MHz will not decrease as the number and size of patterns grow. Although some existing works claim that their throughput can achieve up to 10 Gbps, but their designs are not feasible for real systems. Comparing with these works, we provide a flexible and scalable architecture for real applications with acceptable throughput.
Chapter 6
Conclusion and Future Work
6.1 Conclusion
In this thesis, we proposed an architecture which takes scalability, flexibility and performance into consideration. Furthermore, root-indexing and pre-hashing are two used acceleration techniques for dramatically improving the performance of our design. Also our data structures are compressed and can be stored either in the internal SRAM or external DRAM. The internal SRAM architecture provides average 3.367Gbps throughput with the size limitation of patterns. The external DRAM architecture provides the high scalability for the integration of multiple applications with acceptable throughput.
The proposed internal SRAM architecture is implemented on the Xilinx ML310 FPGA-based platform, and the driver interface API is provided for software/hardware integration. The string matching function of the target application ClamAV is also modified to setup the string matching engine. We tuned the hardware design according to the analysis results of our software simulation, and also built a complete system solution for content filtering applications such as IDS, URL blocking and ClamAV.
6.2 Future Work
Although the average throughput of our internal SRAM design can achieve 3.367 Gbps, our architecture is too complicated to design into a pipeline architecture which can get the better throughput. Therefore, now we adopt the multi-cycle implementation method which will degrade the throughput. For higher performance
of the internal SRAM architecture, the pipeline is a necessary trick to be applied. Thus, the proposed design should be refined to make it simple. Also, for the external memory-based architecture, the most defeat is the bus bandwidth and contention issue.
It is the native limitation. Furthermore, the path compression of bitmap AC is not used in our design. However, this technique can use the memory effectively, and also can reduce the access frequency of memory in the external DRAM architecture. Thus, the path compression technique is worth to take into consideration in the future.
References
[1] S. Antonatos, K. Anagnostakis, and E. Markatos. Generating realistic workloads for network intrusion detection systems. In ACM Workshop on Software and Performance, Redwood Shores, CA, Jan. 2004.
[2] Aho and M. Corasick. Fast pattern matching: an aid to bibliographic search. In Commun. ACM, volume 18(6), pages 333-340, June 1975.
[3] S. Wu and U. Manber. A fast algorithm for multi-pattern searching. Technical Report TR-94-17, Department of Computer Science, University of Arizona, 1994.
[4] R. Boyer and J. Moore. A fast string searching algorithm. Communications of the ACM, vol.20, no10, pp762-772, October 1977.
[5] Broder and M. Mitzenmacher. Network applications of Bloom Filters: A survey.
In Proc. Of Allerton Conference, 2002.
[6] Young H. Cho, Shiva Navab, and William Mangione-Smith. Specialized hardware for deep network packet filtering. In Proceedings of 12th International Conference on Field Programmable Logic and Applications, France, 2002.
[7] Young H. Cho and William H. Mangione-Smith. Deep packet filter with dedicated logic and read only memories. In IEEE Symposium on Filed-Programmable Custom Computing Machines, Napa, CA, USA, April 2004.
[8] Z. K. Baker and V. K. Prasanna. Time and area efficient reconfigurable pattern matching on FPGAs. In Proceedings of FPGA ’04, 2004.
[9] Z. K. Baker and V. K. Prasanna. A methodology for synthesis of efficient intrusion detection systems on FPGAs. In IEEE Symposium on Field-Programmable Custom Computation Machines, Napa, CA, USA, April 2004.
[10] I. Sourdis and D. Pnevmatikatos. Fast, large-scale string match for a 10Gbps FPGA-based network intrusion detection system. In Proceedings of 13th International Conference on Filed Programmable Logic and Applications, Lisbon, Portugal, September 2003.
[11] I. Sourdis and D.Pnevmatikatos. Pre-decoded CAMs for efficient and high-speed nids pattern matching. In IEEE Symposium on Field-Programmable Custom Computing Machines, Napa, CA, USA, April 2004.
[12] Sarang Dharmapurikar, Praven Krishnamurthy, Todd Spoull, and John Lockwood. Deep packet inspection using bloom filters. In Hot Interconnects, Stanford, CA, August 2003.
[13] S. Dharmapurikar, M. Attig and J. Lockwood. Design and Implementation of a
String Matching System for Network Intrusion Detection using FPGA-based Bloom filters. In the 12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM ’04), April 2004.
[14] N. Tuck, T. Sherwood, B. Calder and G. Varghese. Deterministic Memory-Efficient String Matching Algorithms for Intrusion Detection. In Proceedings of the IEEE Infocom Conference, Hong Kong, China, 2004.
[15] R. Sidhu and V. K. Prasanna. Fast regular expression matching using FPGAs. In IEEE Symposium on Field-Programmable Custom Computing Machines, Rohnert Park, CA, USA, April 2001.
[16] M. Gokhale, D. Dubois, A. Dubois, M. Boorman, S. Poole, and V. Hogsett.
Granidt: Towards gigabit rate network intrusion detection technology. In Proceedings of the 12th International Conference on Field-Programmable Logic and Applications, Sept. 2002.
[17] M. Aldwairi, T. Conte, P. Franzon. Configurable string matching hardware for speeding up Intrusion Detection. In ACM Sigarch Computer Architecture News.
Vol. 33, No. 1, March 2005.
[18] Kuo-Kun Tseng, Ying-Dar Lin, Tsern-Huei Lee, Yuan-Cheng Lai, A Parallel Automaton String Matching with Pre-Hashing and Root-Indexing Techniques for Content Filtering Coprocessor, 16th IEEE International Conference on Application-Specific Systems, Architectures, and Processors, Samos, Greece, July 2005.
[19] SNORT official web site. http://www.snort.org [20] ClamAV official web site. http://www.clamav.net