The Bloom Filter Accelerated Sub-linear Time algorithm

Chapter 2 Related Works

3.1 The Bloom Filter Accelerated Sub-linear Time algorithm

This work designs the hardware architecture for the sub-linear time algorithm extended from the WM algorithm to accelerate multiple string matching. The key points to embody the design are avoiding the need of a large shift table and reducing the impact from the worst case on performance.

3.1.1 Drawbacks of using a shift table

The WM algorithm looks up the shift values in the shift table by indexing the block in the suffix of the search window during scanning stage. A block of fewer than three characters is very likely to appear in a large pattern set, say that of virus signatures, and thus the shift distance will be mostly short and the verifications will be frequent according to the WM heuristic. A larger block of at least three characters can improve the situation, but it also leads to a large shift table. For example, 256³ entries in the shift table are required to store the shift value of every block of three characters.

It amounts to memory space of 16 MB if each entry takes one byte. A block size of larger than three is almost impractical due to the huge table size. Although compressing the table by mapping more than two blocks to an entry is possible, the shift distance will be reduced because the shift value in an entry is the minimum of all the blocks mapped to that entry. The shift distance will be reduced and the number of verifications will be increased significantly if the table is compressed too much.

A large table is unable to fit into the embedded memory, but if the table is stored in the external memory, the slow memory access will slow down the overall performance. Moreover, the shift values in the shift table can be indexed only from the rightmost block of the search window. If a shift value of zero happens frequently, the frequent verifications will slow down the overall performance. The BFAST

algorithm keeps the positions of the blocks in the patterns so that not only the rightmost block, but also the other blocks in the search window can derive their position in the patterns. Therefore, the algorithm can use a heuristic similar to the bad-character heuristic in the Boyer-Moore algorithm to determine a better shift value.

This benefit will be demonstrated in the next sub-section.

3.1.2 Implicit shift table using Bloom filters

Let B^B_o be the rightmost block in the search window. The shift distance is a function of the positions of Bo^B or its suffix in the patterns [24], so separately storing the blocks in each position of the patterns is sufficient to derive the shift distance. Fig.

2 shows an example of this derivation. Assume current block of the text is ‘XAMP’ , it appears in the fourth last block of the pattern ‘EXAMPLE1’, and thus the shift distance of ‘XAMP’ should be 4 to fetch the block ‘PLE1’ and check if its shift distance is 0 as illustrated in the Section 2.1.2. The shift value is derived formally from the Equation (1)

Figure 2. Shift distance of a block can be derived from its position in the patterns

)

With this derivation of shift distance, we can replace the shift table lookup operation with membership query of parallel Bloom filters. Bloom filtering is a space-efficient approach to store strings in the same length for membership query, i.e.

to check if one string belongs the string set or not. By grouping the blocks in different positions of the patterns and storing these groups in separate Bloom filters, we can

know whether a block belongs to the pattern or not and its position by querying these Bloom filters in parallel.

Fig. 3 illustrates how to establish an implicit shift table using Bloom filters.

Assume the pattern set is {P₁, P₂, P₃}. After dividing by position, the Group 0 is {efgh,mnop,vuts}, Group 1 is {defg,lmno,wvut}, and so on. If the block of text is

“cdef”, the query result will be Group 2 hit, so the shift distance is 2. If there is no hit reported, then it means there is no such block in the patterns, we can safely shift maximum shift distance or 8 in this example.

Figure 3. Grouping of blocks in the patterns for deriving the shift distance from querying Bloom filters. The shift table in the WM algorithm becomes implicit in the Bloom filters herein.

The grouping is defined formally in Equation 2:

The membership query of the Bloom filters may have false positives. In other words, a block may not exist in a group, but the corresponding Bloom filter of that group may be hit. The shift distance will be smaller than it should be as the false positives happen, but the search is still safe: no match will be missed. As long as the number of false positives is controlled within a small value with proper parameter setting of the Bloom filter, say the length of the bit vector, the false positives will not be an issue.

3.1.3 Additional checking in the Bloom filters

Although G0 is rarely hit for random samples, i.e. the block is not in the rightmost block of the pattern, this is not always the case in practice such as the reason illustrated in Section 3.2.1. Therefore, unlike the original WM that verifies the possible match immediately, the BFAST algorithm continues the checking the block B^B1, B2^B , …, B^Bm-|B| like the bad-character heuristic in the Boyer-Moore algorithm, where B_j^B stands for the |B| characters that are j characters away from the rightmost character backward in the search window. If the Bloom filter of Gi is hit, where i > j, the shift distance can be i - j. The reason is much like the bad-character heuristic in the Boyer-Moore algorithm. A shift less than i - j cannot lead to a match because B^Bj

cannot match any blocks in groups from G_i-1 to G_j. The verification procedure will follow to check whether a true match occurs only if every block from B0^B to B^Bm-|B| is in Bloom filters of G₀ to G_m-|B|. For example, Assume the text is abcdefghijklmn….

When the querying result of a block hijk is reported hit in the group 0, i.e. the shift distance equals to 0, we take the preceding block ghij to query the bloom filter of group 1. If it still hit, we continue to use the preceding block fghi to query group 2,

otherwise, we declaim verification end and move on to scan the block ijkl which is the next block of the one caused the verification, i.e. the block which shift distance is 0.

This verification procedure repeats until querying the last group. If all the groups are hit, Anchored AC verification is involved. This further verification can reduce significantly the number of verifications in the WM algorithm. In the simulation using 10,000 patterns, this approach can reduce the number of verifications by around 50%.

3.1.4 Worst case handling

The performance of a sub-linear time algorithm, say the WM algorithm, may be low in some cases. First, when the pattern length is close to the block size, the shift distance of m - |B₀| + 1 will be very short, given m ≥ |B0|. The BFAST algorithm can process at least four characters in each shift of the search window, while the shift distance in the WM algorithm can be as short as one or two characters in the same case. Second, the worst case time complexity can be as high as O(mn) if the patterns occur in the text frequently. Consider the extreme case that the characters both in the text and in some patterns are all a’s, verification is required after each shift of only one character. To increase the performance in the worst case, this work uses a linear time algorithm, Anchored-AC, to co-work with this sub-linear time algorithm for the verification. The verification result is reported to software (upper-layer applications) directly by the verification engine. The interface between the search engine and verification engine communicates through a descriptor buffer. As long as the buffer is not full, the search engine can always offload the verification and move on to scan the next block without blocking after finding a potential match.

3.1.5 Advantages of the proposed architecture

This architecture can successfully process multiple characters at a time with the number of Bloom filters on the order of at most O(m). Compared with other Bloom-filter-based architectures, such as [29], which demands the Bloom filters on

the order of O(ms), where s is the allowed shift distance, the proposed architecture has the two major advantages. First, the number of Bloom filters required is reduced for the same purpose of processing multiple characters at a time. Second, the proposed architecture allows long shift distance. For example, if the shortest pattern length is 10, the proposed architecture allows shifting as many as 10 characters at a time. This is not feasible in the architecture of [29] because the number of Bloom filters is large and simultaneous accesses to the bit vector from so many Bloom filters are difficult.

Moreover, as far as we know, no other hardware architecture can have such long shift distance so far.

在文檔中以Bloom filters硬體實作加速傳統次線性時間字串比對演算法: 設計、實作與評估 (頁 18-23)