Conclusion and future work - 加速深層封包檢查的字串比對演算法

This work designs the BFAST architecture using Bloom filters to realize a sub-linear time algorithm in hardware. It can inspect multiple characters at once in effect based on algorithmic heuristics to boost the throughput up to 5.64 Gbps for more than 10,000 virus signatures, while the worst throughput is 1.2 Gbps with properly specified signatures. If the block size is eight characters, the throughput can be up to 9.34 Gbps for 16,384 patterns. The architecture needs only m Bloom filters and reads a block of only b = 4 characters from the text per iteration, and features low hardware cost and memory usage for high throughput. Although a method to guarantee linear worst-case time complexity is proposed, a more light-weight solution to reduce the overheads deserves further study in the future.

An increasing number of signatures are represented in regular expressions.

This architecture can support regular expressions by filtering the text with nec-essary substrings in the regular expressions. The presence of a regular expression is verified only if the substrings in it are all found. This filtration-then-verification method is common in open-source packages such as Snort and ClamAV. Support-ing regular-expression matchSupport-ing all in hardware is the next work we will pursue.

CHAPTER 4 A Hybrid Algorithm of Backward Hashing and Automaton Tracking for Virus Scanning

4.1 Introduction

Scanning the content on network or storage devices for viruses involves com-putationally intensive string matching against a pattern set of virus signatures.

Although designing an efficient method for high-speed content inspection has sparked a number of innovations lately, most of them look to hardware approaches that offload string matching to a specialized hardware engine [LLLar], especially for Snort-style intrusion detection (www.snort.org); however, as many anti-virus applications run on software environment (e.g., a commodity computer), deploy-ing a hardware accelerator is costly and inflexible. Compared with intrusion detection, anti-virus applications are relatively inconspicuous as a target to be accelerated. Therefore, we believe a scalable and fast string-matching algorithm and its efficient software implementation are still desired for anti-virus scanning.

Modern computer architecture brings new challenges to software implementa-tion. A compact data structure to improve cache locality becomes critical because of the “memory wall” — memory access is slow [WM95]. Anti-virus applications have a much larger pattern set than Snort, which has only thousands of patterns.

For example, ClamAV (www.clamav.net) has claimed a set of more than 200,000

patterns. A large pattern set not only demands large memory space, but also significantly slows down string matching. Carefully tuning the data structure is getting critical to the performance.

A common class of methods track a finite automaton that accepts the patterns in the pattern set, such as the Aho-Corasick (AC) algorithm [AC75]. The track-ing generally reads one character in the text per iteration. Although some can track multiple characters per iteration with hardware assistance for high perfor-mance [DL05,SIH04], implementing them in software is not so efficient. The data structure of the automaton contains the transitions from each state and the fail-ure links, and should be compressed in a compact representation [TSC04,Nor04].

The existing compression methods have two limitations. First, many of them rely on hardware assistance for fast tracking, but their software implementation is sequential and much slower. Second, the pattern set in anti-virus applications is much larger than that in intrusion detection, and virus signatures are gener-ally long (may be up to hundreds of characters) to avoid false positives, making compression even challenging.

Another class of methods moves a search window through the text to check whether it contains a suspicious match or not [EC05, WM94]. Assuming most of the text is legitimate, these methods can quickly exclude the legitimate text, and verify only the suspicious matches. The patterns can be represented in a compact data structure such as a shift table or a Bloom filter. There is a tradeoff in deciding the window size [EC05]. A large window size is preferred because matching a long window implies large likelihood of a true match and thus reduces the verification frequency. However, matching a short pattern within the window becomes difficult. Some of the methods can accelerate the scanning by skipping the characters not in a match based on algorithmic heuristics from a block of

characters within the search window, such as the Wu-Manber algorithm [WM94].

They are generally fast, but have the Achilles’ heel — the maximum skip distance is bounded by the shortest pattern length in the pattern set. These methods therefore have the problem with short patterns.

According to the above observation, either class of methods has its limitations.

Because most of the patterns in anti-virus applications are long to reduce false positives [KA94], we can exploit the feature to increase the efficiency while reduc-ing the memory requirement. This work presents a hybrid method that combines the AC algorithm and a variant of the WM algorithm, namely the backward hashing (BH) algorithm. The patterns of virus signatures are partitioned into long and short ones, separated by a length threshold. The BH algorithm can scan for only long patterns to derive long shift distance of the search window. The character distribution in both the patterns and the text is non-uniform, mak-ing the shift distance shorter and the verification frequency higher than those in theoretical analysis, so the performance is slowed down. The backward-hashing mechanism can effectively reduce the verification frequency and exploit long shift distance if there is a chance. After the partition, the AC algorithm can scan only the relatively small set of short patterns. The data structure of the automaton is compact and saves the memory space. The method is applied to ClamAV to improve its performance. Some factors in software implementation such as cache locality will drastically affect the overall performance, and this work will also discuss them in practical implementation.

The rest of this paper is organized as follows. Section 4.2 reviews the existing work for string matching for virus scanning. Section 4.3 presents the details of the hybrid method and the practical implementation issues, followed by the performance evaluation of the algorithm in Section 4.4.1. Section 4.5 concludes

this work and points out future work.

在文檔中加速深層封包檢查的字串比對演算法 (頁 89-93)