• 沒有找到結果。

Review of existing work

4.2.1 String matching algorithms

Scanning the text for multiple patterns typically tracks the partially matched prefixes with a finite automaton that accepts the patterns, or filters the text with a search window to weed out unsuccessful matches and verifies only suspicious matches. The former features the deterministic execution time that guarantees the worst performance even though algorithmic attacks are present to exploit the worst case of an algorithm, but may require large memory space to store the transition information. The latter features memory economy and fast average execution time, but must carefully deal with possible attacks.

Recent automaton approaches focus on fast tracking a compressed automaton with hardware assistance [DL05, Lun06, AC07, TS06, BCT06], but their software implementation is not as efficient as the hardware counterpart. Although some compression methods are independent of hardware [Nor04, YCD06], their scala-bility to a large pattern set of long virus signatures could be a problem. First, the transition table is not so sparse due to the large pattern set. Second, the method to simplify repetitions in the patterns [YCD06] is unable to compress the characters in the long patterns.

A filtering approach can map the search window along the text with one or more hash functions to see whether the window matches an entire pattern or part of a long pattern. If a suspicious match occurs, whether a true match with that pattern is verified. This approach is very memory efficient because a pattern is stored as only a hash value or the bits in a few addresses. The window must

be long enough to reduce the verification frequency, but has two side effects:

longer time in hash computation and inability to match shorter patterns [EC05].

Moreover, the window can slide by only one character per iteration.

The algorithms such as the WM algorithm can map only the suffix of the search window, ignore the remaining characters in the window, and shift the window to the next position according to certain heuristics in the search stage.

To be specific, suppose the window is m characters long, and let X be the block of b characters in the window suffix. The heuristic can look up the block in a shift table (built in the preprocessing stage) to derive the shift distance as follows.

1. The search window can be shifted by m − b + 1 characters if X is not a substring of any patterns. Any shift shorter than m − b + 1 cannot lead to a match because this would contradict that X does not appear in any patterns.

2. Otherwise, the shift value is m − j, assuming the rightmost occurrence of X ends in position j of some pattern. If j = m (i.e., zero shift value), X is the suffix of one or more patterns, so whether a true match occurs should be verified.

The maximum shift distance is m−b+1, meaning it is bounded by the shortest pattern length. Although using a heuristic like that in [?] allows the maximum distance up to m characters, building the shift table becomes extremely time-consuming for large b because up to |Σ|b possible blocks should be considered to build the table, where Σ is usually the ASCII character set and |Σ| is 256.

Mapping multiple blocks to the same entry, in which the minimum shift value of these blocks is filled in, can compress the shift table [WM94]. There are tradeoffs in the table compression. Reducing the table size can improve the cache

locality, but at the cost of smaller shift values (due to the minimum shift values of multiple blocks) and more frequent verification (due to the increasing likelihood of a zero shift value). On the contrary, expanding the shift table can derive larger shift values, but at the cost of worse cache locality. Therefore, the table size should be carefully tuned to achieve the optimal performance. Moreover, the non-uniform character distribution in the text and the patterns means that some characters or blocks appear more frequently than that in probabilistic analysis. A frequent block may happen to be the suffix of some pattern, resulting in frequent verification. The possibility increases with the size of pattern set. This problem should be solved, or the verification frequency may be high.

4.2.2 Virus signatures and string matching in ClamAV

The virus database in ClamAV grows very fast lately, containing more than 200,000 signatures at the time of writing (April 2008). The database contains four types of virus signatures: basic patterns, multi-part patterns, MD5 patterns and phishing patterns (See clamdoc.pdf and signatures.pdf in the source package).

A basic pattern is simply a string of characters for exact match, while a multi-part pattern consists of multiple parts of basic patterns to be matched in sequence for virus identification. The former is sufficient to detect non-polymorphic viruses, and the latter allows specifications such as wildcard characters and bounded gaps (the minimum or maximum distance between two consecutive parts) to detect polymorphic viruses. MD5 matching spends most of the time in MD5 computa-tion, and then checks whether the 16-byte output is a MD5 signature. Phishing matching checks whether a URL is in a URL list. The latter two types are be-yond the scope of this work because matching them differs from ordinary multiple string matching that scans long text for multiple patterns.

Table 4.1: The number of parts or basic patterns and their minimum/maxmum lengths in each target type.

Aho-Corasick (AC) Wu-Manber (WM)

Type num. min max. num. min. max.

Generic 4,704 2 144 29,794 10 246

MS PE 2,878 2 176 48,852 4 392

MS OLE2 1,474 2 134 177 23 176

HTML 3,142 2 140 1,629 5 355

Mail 390 3 120 838 12 172

Graphics 2 3 26 0 N/A N/A

ELF 0 N/A N/A 15 17 198

Subtotal 12,590 2 176 81,305 4 392

MD5 0 N/A N/A 143,641 0 0

Phishing 206 6 43 0 N/A N/A

Total 12,796 2 176 224,946 4 392

The patterns can come with context information such as target file type, virus position in the text and so on to reduce false positives. For example, the signature W32.Deadc0de is a four-character basic pattern: 0xdec0adde. The pattern is required to start from the 64th byte in a file of Portable Executable (PE) format, or the match will not be claimed. ClamAV separates the patterns other than generic ones into individual data structures according to the target type. Therefore, after determining the target type of the text, ClamAV can scan for only the patterns associated with that type, besides the generic patterns.

The current version of ClamAV (version 0.92) scans the text with both the AC algorithm for parts in the multi-part patterns and the WM algorithm1 for basic patterns. Table 4.1 summarizes the number of parts or basic patterns, as well as their minimum/maximum lengths for the two algorithms in each target type.

An old version of ClamAV simplified the AC automaton to a trie structure up to a maximum height h. The patterns with identical prefixes of h characters are

1ClamAV calls it the Boyer-Moore (BM) algorithm, but the algorithm actually operates in the same way as the WM algorithm.

stored in a linked list pointed by a leaf node at level h. Because the minimum pattern length for the AC algorithm is only two characters, h was set to 2, and the linked list was increasingly longer as the pattern set grows. Traversing the long linked list is time-consuming. Miretskiy et al. [MDW04] proposed a trie structure that can stores a pattern at the lowest possible level as soon as the pattern’s unique prefix is identified, but the automaton still needs the space to accommodate thousands of patterns. The scalability issue still remains unsolved.

The trie structure is inherently expensive in space because each node in it must contain 256 pointers, each of which either points to the next node or is null. If a pointer takes 4 bytes, the pointers alone in a node take 1 KB space.

Erdogan and Cao [EC05] presented a filtering approach named Hash-AV to weed out most of the legitimate text with a Bloom filter [Blo70], which can reside in the L2 cache due to its space efficiency. The design selects a window of seven characters for the filtering and four hash functions to reduce the number of false positives. Because the Bloom filter is unable to handle the patterns shorter than seven characters, they are left to the AC algorithm for multi-part patterns.

The search window in Hash-AV does not skip any characters in the text, unlike the original WM algorithm in ClamAV. Hash-AV prefers to abort the benefit of skipping because the search window is rather short in ClamAV due to the short patterns, and the short window significantly limits the skip distance. Using only three characters to derive the distance also results in high false-positive rate. However, we believe the skipping is still beneficial to high performance if the search window could be somehow lengthened. Moreover, Hash-AV does not improve the AC algorithm in ClamAV at all.