String match architecture - Implementation details

Chapter 4 Implementation details

4.1 String match architecture

The string matching architecture includes two main components: (1) the scanning module, which is the main block performing the proposed algorithm that queries Bloom filters and shifts the text according to this querying result, and (2) the verification module and interface. When the scanning module finds a potential match, it instructs a verification job by filling an entry in verification job buffer in the verification interface. Fig. 4 shows the block diagram of the entire architecture. Each component in this architecture is described in following sections.

Figure 4. Overview of the string matching architecture

Each shift in the text includes three operations implemented in three separate sub-modules in the scanning module.

1. TextMemoryFetch fetches the suffix block of the search window in the text memory.

2. BloomFilterQuery queries the Bloom filters to find which group(s) the block belongs to.

3. TextPositionController calculates the location (address) of the search window in

the text memory on the next round according to querying result from the Bloom filters.

4.1.1 Text memory fetching

The block size is set to a word of four bytes for accessing memory efficiently and reducing matching probability of a random block. For parallel accessing four continuous memory bytes, the text memory is divided into four interleaving banks.

Fig. 5 illustrates an example of fetching a word of ‘BCDE’ starting with the byte addresses 00012. Note that the characters in the text are interleaved in each memory bank and the first character to fetch locates in bank1. The underlined bits in the address except the last two are word address. The byte offset is decoded to fetch the correct byte in each bank. The fetched word is rotated according to the byte offset from a multiple of four.

Figure 5. An example of fetching four bytes in 0001, 0010, 0011 and 0100.

4.1.2 Bloom filter querying

There are N independent Bloom filters storing different block sets in the patterns grouping with their positions in the patterns, where N corresponds to the group

number. The block fetched by the TextMemoryFetch module queries these N Bloom filters in parallel to get the membership information. After the query, the priority encoder in TextPositionController encodes the membership information into the shift distance as illustrated in Chapter3. The block diagram of the BloomFilterQuery module is presented in Fig. 6.

Figure 6. BloomFilterQuery module architecture

Because the bit vector has to be long enough to reduce the false positive rate, the on-chip dual-port block RAM is a lower cost way to implement it than flip-flops. Fig 6 is a example using 16kb block RAM on Xilinx XC2VP30 to implement one Bloom filter. Each block RAM is configured as a single bit wide and 16kb long bit array, and can be read write on two port simultaneously to support two hash function. Thus, the

false positive rate f of ONE block memory is

2 16384 - 2n

-1 ⎟⎟⎠

⎞

⎜⎜⎝

⎛ [29], where n is the

number of pattern blocks stored in that bit vector. Using k block memory can reduce this rate to f^k, it is very close to the false positive rate of one k*16kb memory of 2k ports. The hash functions are independent, so they can be calculated and fetch the M-bit bit vector in parallel.

4.1.3 Text position controller

The TextPositionController maintains the position of the suffix block in the search window of the text and calculate the next position according to the membership information of the BloomFilterQuery and current matching state. A finite state machine keeps five states to control how the position is calculated. Fig. 7 illustrates the state transition diagram.

Figure 7. Text position controller state transition diagram.

1. In the beginning, set the initial address according to the scan window size and block size. For example, the scan windows size is l and the block size = b, the initial text position is l - b

2. When the shift distance is non-zero, i.e. no potential match, it adds the shift distance to the text position to get the next one.

3. When the shift distance is zero, it substrates 1 from the text position to get the preceding block in the text to take additional checking illustrated in Section 3.2.2 and stores the text position of this hit block for going to next block as verification finished.

4. When the additional checking finished, it shifts by the shift distance of the non-hit block if no match or report a match and just shift one byte to find next match.

5. When there is a potential match, i.e. additional checking reporting match, but the verification job buffer is full and thus there is no space for instructing a verification job. TextPositionController halts to wait for a free entry to be filled, so the text position is not change in this state.

4.1.4 Verification interface

This work defines a flexible verification interface rather than implements a specific verification mechanism to let verification mechanism to be replaceable according to different applications. The verification mechanism is beyond this work, so we just briefly describe the advantages of the approach we take in this work.

This work takes anchored Aho-Croasick algorithm to verify the suspicious data for two reasons. (1) Its data structure allows high compression rate. It compresses the original AC date structure to 1 Mb that stores 1000 patterns, almost 0.2%, that can be put into the Virtex-II Pro platform we used for experiment. (2) Its time complexity is linear in the worst case. Due to the potential match is very possible to be a true match, i.e. a virus; a linear worst case time algorithm is efficient to discover it.

There are two parts in the verification interface: JobDispatcher and VerificationJobBuffer (VJB). When the scanning module discovers a potential match,

it instructs the JobDispatcher to fill the verification job descriptor (VJD), composed of text position, length and other related information to the VJB. The format of VJD is illustrated in Fig. 8. The most significant bit of the VJB is set when it is allocated. The verification module should test this bit to know if there is a new verification job and decode the text position and length and other information it needs to verify. After it finishes the verification, it should clear the entry it verified to free the entry.

Figure 8. Verification Job Descriptor format

The VJB is implemented with one 16 kb dual-port block RAM too, so there are 16k/32 or 512 entries in it. The JobDispatcher keeps a VJB pointer. After scanning modules allocated one entry, the pointer is moved to the next entry. When it finds a potential match next time it reads the allocation bit to tell if it is empty. The pointer will rotate when it comes to bottom of VJB, so if it finds allocation bit is set; which means the VJB is full. This can be improved by more than one verification module to verify the jobs in the VJB to balance the verification and scanning speed.

4.1.5 Pipeline the design

In the original multi-cycle design, the location of next block is decided by the shift values derived from the Bloom filters to continue the next round of matching, so there is only one active module at a time like the Fig. 9(a). We pipeline the design by dividing the text into four independent segments like the example of Fig. 9(b). If the length of text is m. The segments are the 0 ~ -1

dividing the original text by four plus the scan-window-size to avoiding the pattern across the segments. For example, assume length of text is 40. The four segments will be 0~9, 3~19, 13~29, 23~39. In this way of dividing the text to four independent segments, the TextPositionController can assign four start addresses at every cycle of four in the beginning, and calculate the second block position of the first segment at the fifth cycle: 7+S1, and the second block position of the second segment at the sixth cycle: 10+S₂, and so on, where S₁ is the shift distance of first segment at the first time query, and S2 is the shift distance of second segment at the first time query.

Figure 9. (a) State diagram of pipelining

在文檔中以Bloom filters硬體實作加速傳統次線性時間字串比對演算法: 設計、實作與評估 (頁 25-31)