2.5.1 Implementation in three content security packages
Each application has different pattern lengths and pattern set size. The shadow area in Figure 2.10 also indicates the range of pattern lengths and pattern set size of each application, overlapping with the profiling results. The shaded arrows indicate an application has patterns longer than the lengths in the the shadow area. This figure suggests which algorithm be better implemented in each appli-cation. We revised the packages in Section 2.2.2 by implementing the suggested algorithms and observed the acceleration below.
ClamAV The LSP of basic patterns in ClamAV is 10 and the number of patterns
pattern length
Figure 2.10: The profiling summary of each algorithm. (C denotes |Σ|.) is larger than 30,000 to date. We replace the WM algorithm with the 2-gram BG+ algorithm to match basic patterns. When the pattern set is even larger in the future, the 3-gram BG+ algorithm can be used to enhance the efficiency. The AC algorithm is still responsible for matching regular expressions of multi-part patterns.
DansGuardian According to our investigation, the Horspool algorithm scans 25 content keywords one by one in the current implementation. If the forcequicksearch option is enabled, every pattern in the pattern set will all be searched for with the Horspool algorithm. We do not enable this option because enabling this option will have the text scanned as many times as the number of patterns, and will actually slow down the search. We group all the patterns together and implement the Modified-WM algorithm to handle short patterns with LSP = 2 and 3, and the 2-gram BG+ algorithm to handle the longer patterns.
Snort Snort groups patterns into rule sets according to the packet header. The LSP of every rule set is not the same. We implement a hybrid method instead of enabling the default method, the Modified-WM algorithm. The
AC algorithm is selected for LSP=1; otherwise, the Modified-WM algorithm still handles the pattern matching.
2.5.2 Benchmarking of the revised implementation
The speedup of the revised packages are benchmarked in this section. The per-formance for both the real and synthetic sample data is also compared, where the synthetic data are generated from uniformly distributed random characters. The comparison of both types of sample data can exhibit whether the observation for the synthetic data is also applied to real situations.
2.5.2.1 Benchmarking for ClamAV
We select 10 Windows execution files whose sizes are between 32 KB and 16 MB as real data in the benchmark, which also tests for synthetic data of the same size. Figure 2.11 compares the execution time of both the original ClamAV and its revised version. The difference in scanning time between both versions becomes obvious with increasing file size. For example, the revision is five times faster than the original one when the file size is 16 MB. The acceleration comes primarily from that reduction of verification during the search. Figure 2.11 also compares the execution time for real and synthetic data in both versions. The difference between both data types is almost unnoticeable because the character distribution in the patterns and files is close to random in ClamAV.
2.5.2.2 Benchmarking for DansGuardian
We use wget (www.gnu.org/software/wget/wget.html) to mirror an RFC Web site at asg.web.cmu.edu/rfc/rfc-index.html that contains more than 8,000
0
Figure 2.11: The performance improvement for both random and real data in the revised version of ClamAV.
Figure 2.12: The performance improvement for both random and real data in the revised version of DansGuardian. (C denotes |Σ|.)
files, including HTML files and ordinary text files. DansGuardian’s content filter-ing function scans these files. Figure 2.12 shows the original implementation takes 2128 seconds to mirror the entire site while the revised implementation takes 1708 seconds. The acceleration is insignificant because the verification algorithm has to find every possible match that has the same hash value. The filtering part in the searching process becomes less significant, and so is its acceleration.
We also generate synthetic Web pages of the same sizes for comparison with the real Web pages. First, we generate data from the character set of 256
char-acters. The execution for synthetic data is faster than that for real data, because the character distribution of synthetic data is close to uniform but that of real data is biased towards English character set. Because the characters in real data concentrate more on English characters, the character set is effectively to be a small one. More possible matches occur and more verification is required than those for a uniformly distributed character set.
The character set of only 26 characters is also tested. The number of possible matches increases, and the processing time of content inspection is three times longer than that in the last experiment. However, the speed in this case is still much faster than that for the real data because keywords are more likely to appear in real data than in randomly generated synthetic data, so more possible matches occur and more verification is required.
2.5.2.3 Benchmarking for Snort
The HTTP traffic accounts for a large quantity of the Internet traffic, so we feed HTTP traffic to Snort. Snort is configured to run in the inline mode to easily measure its throughput. Figure 2.13 presents the benchmarking result of the throughput. First, we use a single client to mirror the entire site, but the acceleration is insignificant. We then add up to five clients for more traffic, and the acceleration becomes a little more obvious. However, the enhancement is still insignificant because Snort inspects only the HTTP header instead of HTTP body in most cases [NRb], and only a small portion of the traffic is inspected.
48.6121
27.0067 50.979
27.6654
0 5 10 15 20 25 30 35 40 45 50 55
1 client 5 clients
throughput (Mbits/s)
Snort (original) Snort (revised)
Figure 2.13: The benchmarking result of Snort.