4.3 The hybrid algorithm and practical issues
4.3.1 The BH algorithm
The WM algorithm implementation in ClamAV faces several practical perfor-mance issues. First, the block to derive the shift distance consists of only three characters (i.e., b = 3). Although the block size seems sufficient to weed out most false positives (i.e., suspicious matches that should be verified) from a probabilis-tic aspect, it is not the case in pracprobabilis-tice due to non-uniform block distribution.
For example, the block ‘0x00 0x00 0x00’ is frequent in Windows executable files.
The false-positive rate will be higher than expected if the block is the suffix of some pattern. For example, the rate is around 37% in our study on a sample set of Windows executable files. Extending the block size can reduce the false-positive rate, but it has three side effects. (1) Computing the hash function to look up a large block in the shift table takes longer time. (2) The maximum shift distance in the WM algorithm is m − b + 1. Increasing b will also shorten the maximum distance. (3) Although the heuristic such as that in [?] allows the maximum distance up to m characters, filling up the shift table in the preprocess-ing stage will be time-consumpreprocess-ing for large b. We will discuss this point in detail later. Second, the search window in the implementation is rather short because the pattern set has a pattern as short as only four characters (See Table 4.1).
Skipping longer than the shortest pattern length may miss the shortest pattern because it may happen to appear in a position between two consecutive skips.
Fig. 1 The illustration of a missed match in the FNP2 algorithm
This example illustrates that the shift of the search window should not go beyond the PSW.
H
Figure 4.1: The illustration of a missed match.
This factor restricts the effectiveness of applying the WM algorithm to ClamAV.
We use the following methods in the backward hashing (BH) algorithm to solve the aforementioned problems.
4.3.1.1 A better heuristic to determine the shift distance
The heuristic in the WM algorithm is conservative because it considers only the entire block to derive the shift distance — If the rightmost block X does not appear in the pattern set, the shift value is m − b + 1. The value could be larger if X’s suffix is also considered. For example, if neither X appears in the patterns nor any X’s suffix is a prefix of some pattern, the shift value of m is safe, i.e., no match will be missed. Liu et al. have a similar observation in their method that indexes the shift table from the prefix sliding window (PSW) [LHC04], but the forward search may result in false negatives. Figure 4.1 illustrates an example to show the shift should not go beyond the PSW. If the characters beyond the PSW are not examined, no match should be excluded. In this example, the pattern
‘MATCH’ will be missed if the shift distance is longer than three characters.
The BH algorithm looks backward in the search window instead. Because the characters in the suffix have been examined, shifting beyond them is safe.
The new heuristic is formally stated as follows.
1. If neither X appears in the pattern set nor any suffix of X is a prefix of any pattern, the shift value can be m if m ≥ b, or b otherwise.
2. X does not appear in the pattern set, but it has at least one suffix that is also the prefix of some pattern. Let k be the longest length of such a suffix.
The shift value can be m − k if m ≥ b, or b − k otherwise.
3. X is a substring of some pattern if m ≥ b, or some pattern is a substring of X otherwise. In the former case, the shift value is m − j, assuming the rightmost occurrence of X ends in position j of some pattern. If j = m, X is the suffix of some pattern, and whether a true match occurs should be verified, after which the search window is shifted by one character. In the latter case, a match is claimed.
The shift value for every different X is calculated and stored in a shift table in preprocessing, so a simple table lookup can derive the shift distance, just as simple as that in the WM algorithm. The maximum shift value is m, rather than m − b + 1.
Like the WM algorithm, the BH algorithm builds a shift table in the prepro-cessing stage according to the above heuristic. Suppose a block X is mapped to the table with the hash function h. The steps are as follows.
1. Initialize each entry in the shift table to max(m, b). This value is filled because the maximum shift distance is m if m ≥ b, and b otherwise.
2. For all x = x1. . . xq that is a prefix of some pattern, where 1 ≤ q <
min(m, b), set SHIFT[h(yx)] to max(m, b) − q, for all y ∈Pb−q
.
3. For every block X that is a substring of some pattern, set SHIFT[h(X)] to m − j, where the rightmost occurrence of X ends at position j. If b > m,
this step will be ignored because no such X exists.
4.3.1.2 The bad-character heuristic
Increasing the block size b can reduce the false-positive rate, but the increase also complicates the build-up of the shift table. For example, the preprocessing stage needs to map all possible blocks whose last character (a suffix) is also the first character (a prefix) of some pattern to fill in the shift value of m − 1.
For each pattern, totally |Σ|b−1 blocks have the last character that is also the first character of that pattern. In other words, the number of blocks increases exponentially to b, making building the shift table with a large block size very expensive in preprocessing.
The bad-character heuristic intends to reduce the false-positive rate while keeping the block size manageable. Moreover, it attempts to exploit a large shift value if possible. The block size is still 3, while the heuristic avoids immediate verification by checking the additional blocks B1,B2,. . . ,Bbm/bc to exploit a larger shift value if needed, where Bj denotes the j-th non-overlapping block counted backward from the window suffix. We give a trivial example in Figure 4.2 to justify the derivation of a safe shift from Bj. Suppose the algorithm scans for only a pattern XAMPAMPLE. The rightmost block PLE in the search window matches the the pattern suffix, so the hashing goes on to the next block XAM, whose position is 3 characters away from its rightmost appearance in the pattern. Therefore, shifting the search window by 3 character is safe.
The heuristic also looks up Bj in the shift table. Let SHIFT[Bj] be the shift value derived by mapping Bj to the table. Because the shift values in the table are computed with respect to B0, they are not directly applicable to Bj. However, we still have a chance to exploit the shift values for Bj. In the above example,
E
Figure 4.2: The heuristic in the bad-block heuristic.
SHIFT[XAM] is 6 because XAM ends at position 3 and the pattern length is 9. From the value in the shift table, we can derive XAM’s position and infer the safe shift distance.
Two subtleties are in the heuristic. (1) To compress the shift table, multiple blocks are mapped to the same entry, in which the minimum shift values of them is filled in. Therefore, the shift value derived in the heuristic may be smaller than it should be, but it is still safe — no match will be missed. (2) The shift table implies only the position of a block’s rightmost occurrence, and it losses the exact information such as whether a block appears in a specific position or appears multiple times in the patterns. Figure 4.3 illustrates an example to illustrate the information lost in the shift table. In this example, because PLE is the pattern suffix, SHIFT[PLE] is 0. When the heuristic checks the block B1 =PLE in the table, it knows only that PLE appears in the pattern suffix. Whether PLE also appears at position 4–6 is unknown from the table, even though a shift of 6 characters is safe in this case. In general, if SHIFT[Bj]> jb, a shift of SHIFT[Bj]−jb characters is safe; otherwise, the bad-block heuristic has better be conservative and keeps on checking Bj+1 for not missing any match. The correctness of the bad-block heuristic is proved as follows.
Theorem 1. The shift value derived in the bad-block heuristic is safe. That is, if SHIFT[B ]> jb, a shift of SHIFT[B ]−jb characters is safe.
E
Figure 4.3: An example to illustrate the information lost in the shift table.
Proof. Suppose a match occurs when the search window is shifted by a shift s, where s < SHIFT[Bj] − jb . This means there is a Bj that ends at position
Equation 4.1 leads to a contradiction, i.e., if the search window is shifted by less than SHIFT[Bj]−jb, no match should occur. A shift of SHIFT[Bj]−jb is thus safe.