5.4 Experiments
5.4.3 Practical Consideration in Deployment
With the increasing number of categories to be classified, ambiguity between these categories may increase. In our opinion, the proper place to perform Web content filtering is restricted to the edge devices for performance reason. Such edge devices usually require fewer banned categories, and thus the problem with increasing number of categories is not that serious.
The two thresholds, Tbypass and Tblock, can be tuned according to the tradeoffs between accuracy and efficiency. The accuracy can be increased at the cost of less efficiency by decreasing Tbypass or increasing Tblock, and the efficiency can be increased at the cost of less accuracy by increasing Tbypass or decreasing Tblock. The tuning depends on which is more important for an organization: accuracy or efficiency. For example, if the Web filter is overloaded, it may trade a little accuracy for efficiency to avoid overburdening the system; otherwise, it can adjust the two parameters for better accuracy.
Even though the early decision algorithm significantly speed up the filtering decision, we believe that it should complement other Web filtering approaches, especially URL blocking, but not to replace them. First, URL blocking is faster than content analysis since a URL has much fewer characters to be processed than the Web content. Besides, if a banned URL is successfully blocked, no network bandwidth will be wasted to download the banned content. As discussed in Section 5.2.1, content analysis is still needed to successfully catch the banned
content. The early decision can accelerate this part significantly. Second, Web content may contain images, video, Flash objects, Java applets and so on, which are non-trivial to analyze. Analyzing these objects is beyond the scope of this paper, but it is still helpful to increase the accuracy in filtering the Web content.
In summary, a Web filtering system can support various approaches in prac-tice. The system first blocks URLs according to the database of banned URLs that is constantly maintained. To reduce false negatives due to the outdated database, content analysis can catch the banned content whose source is not in the URL database. The early decision algorithm can speed up content analysis to reduce the latency perceived by the user and to increase the system throughput.
Although analyzing other types of objects in the Web content, such as images, could increase the accuracy, it is still a trade-off between performance and pro-cessing effort so far. It depends on the user to evaluate whether turning on such an analysis is worthwhile.
5.5 Conclusion
This work addresses the problem with possibly long delay in text classification algorithms that perform run-time content analysis in Web filtering. We present an early decision algorithm to decide to either block or pass the content as early as possible. A significant performance improvement is observed. The throughput is increased by about five times higher for banned content and nearly four times higher for allowed content, while the accuracy remains fairly good. In the F1 measure, the accuracy is about 89% for filtering banned content, and about 93%
for allowed content.
The early decision algorithm is simple but effective. The same rationale
be-hind this algorithm can be applied to other content filtering applications as well, such as anti-spam. The algorithm can be combined with more features other than keywords from the text to further increase the overall accuracy of the content fil-ter. Besides, the filtering can be further accelerated by combining the URL-based method with the cached results. That is, by caching the decisions on URLs of the filtered Web pages, duplicate filtering on the same Web page can be avoided.
Content analysis can be skipped if the cached URL is matched. The maintenance of the URL database is also facilitated.
CHAPTER 6
Conclusions and future works
To accelerate deep packet inspection, we review existing string matching algo-rithms, and profile their performance on various DPI applications. From the study, we have two major observations:
1. Verification frequency, memory access (thus cache locality) and shift dis-tance are the three major factors that affect the performance of string matching.
2. String matching on intrusion detection is not so critical as that addressed in the literature because it is common that only part of the traffic, say the HTTP requests, is scanned in practice. Therefore, we focus more on anti-virus since improving string matching is significant to its performance and the size of its pattern set challenges the scalability of a design.
We therefore design a hardware architecture, namely BFAST, and a hybrid algorithm based on the observations. The BFAST architecture exploits algorith-mic heuristic with Bloom filters to scan the content in sub-linear time, so that the hardware does not sheer rely on hardware parallelism or high frequency for acceleration. The architecture also avoids some practical hurdles with the bad-block heuristic, and proposes a method that can achieve linear time in the worst case. The throughput of the design is up to 9.34 Gbps for 16,384 patterns and the block size b = 8. In the hybrid algorithm, we separate the pattern set by length
so that the backward hashing (BH) algorithm, which is good at long patterns, can scan for only the long patterns. The Aho-Corasick then has only a relatively small set of short patterns, and the reduced automaton improves the performance due to good cache locality. The overall performance is three times faster than the original implementation in ClamAV.
We also propose a probabilistic approach, namely the early decision algo-rithm, to accelerate classification of Web filtering. The algorithm can make the classification decision early before scanning the entire content of both allowed and banned Web pages. The thresholds for passing and blocking a Web pages are also easily configured because they are directly associated with the accuracy.
The algorithm increases the throughput by around five times for banned content and nearly four times for allowed content, while the accuracy remains fairly good
— about 89% for filtering banned content, and about 93% for allowed content.
There are still some practical issues in the dissertation for further study in the future:
1. The entire packet processing involves not only string matching for DPI.
Although the throughput of scanning a buffer could be up to several giga-bits per second, the other stages, e.g., loading data into the buffer, could become a bottleneck. A total solution is needed besides accelerating string matching.
2. String matching with an automaton approach is still a bottleneck in soft-ware implementation. Although many hardsoft-ware accelerators can accelerate the automaton tracking significantly, they are not applicable in software im-plementation, which is unable to take the parallelism in a hardware solution.
3. The packet content in applications such as spam filtering is structured, and
the patterns are significant only in a certain context. Therefore, parsing the content to learn the contextual information, instead of sheer scanning the text stream for the patterns, should be also accelerated in the future.
References
[AC75] Alfred V. Aho and Margaret J. Corasick. “Efficient string match-ing: an aid to bibliographic search.” Communications of the ACM, 18(6):333–343, June 1975.
[AC07] N. Sertac Artan and H. Jonathan Chao. “TriBiCa: Trie Bitmap Con-tent Analyzer for High-Speed Network Intrusion Detection.” In Proc.
of the 26th IEEE Infocom Conference, Anchorage, AL, May 2007.
[ACF05] Monther Aldwairi, Thomas Conte, and Paul Franzon. “Configurable string matching hardware for speeding up intrusion detection.” ACM SIGARCH Computer Architecture News, 33(1):99–107, March 2005.
[ADL04] Michael Attig, Sarang Dharmapurikar, and John Lockwood. “Imple-mentation Results of Bloom Filters for String Matching.” In Proc.
12th Annual IEEE Symposium on Field-Programmable Custom Com-puting Machines (FCCM), Napa Valley, CA, April 2004.
[AGS99] Giuseppe Attardi, Antonio Gulli, and Fabrizio Sebastiani. “Automatic Web page categorization by link and context Analysis.” In Proc. of THAI-99, First European Symp. Telematics, Hypermedia, and Artifi-cial Intelligence, pp. 105–119, Varese, Italy, 1999.
[BCT06] Benjamin C. Brodie, Ron K. Cytron, and David E. Taylor. “A scalable architecture for high-throughput regular-expression pattern match-ing.” In Proc. of 33rd International Symposium on Computer Ar-chitecture (ISCA), pp. 191–202, Boston, MA, July 2006.
[Blo70] Burton H. Bloom. “Space/time tradeoffs in hash coding with allowable errors.” Commun. of the ACM, 13(7):422–426, July 1970.
[BM77] Robert S. Boyer and J Strother Moore. “A fast string searching algo-rithm.” Commun. of the ACM, 20(10):762–772, October 1977.
[BMI] Illustrations of the Boyer-Moore algorithm.
[BP05] Zachary K. Baker and Viktor K. Prasanna. “A computationally effi-cient engine for flexible intrusion detection.” 13(10):1179–1189, Oc-tober 2005.
[BSC06] Joao Bispo, Ioannis Sourdis, Joao M. P. Cardoso, and Stamatis Vassil-iadis. “Regular expression matching for reconfigurable packet inspec-tion.” In Proc. IEEE International Conference on Field Programmable Technology (FPT), Bangkok, Thailand, December 2006.
[Cav05] Cavium Networks. OCTEON NSP - network services processor family, 2005.
[CL96] Hsin-Hsi Chen and Jen-Chang Lee. “Identification and classification of proper nouns in Chinese texts.” In Proc. of 25th European Conference on Information Retrieval Research (ECIR), pp. 222–229, Copenhagen, Denmark, August 1996.
[CM05] Young H. Cho and William H. MangioneSmith. “A Pattern Match-ing Coprocessor for Network Security.” In Proc. of ACM/IEEE De-sign Automation Conference (DAC), pp. 234–239, Anaheim, CA, June 2005.
[CNM02] Young H. Cho, Shiva Navab, and William H. Mangione-Smith. “Spe-cialized hardware for deep network packet filtering.” In Proc. of 12th International Conference on Field Programmable Logic and Applica-tions (FPL), pp. 452–461, La Grand Motte, France, September 2002.
[DKS04] Sarang Dharmapurikar, Praveen Krishnamurthy, Todd Sproull, and John W. Lockwood. “Deep packet inspection using parallel Bloom filters.” IEEE Micro, 24(1):52–61, January 2004.
[DL05] Sarang Dharmapurikar and John W. Lockwood. “Fast and scalable pattern matching for content filtering.” In Symposium on Architec-tures for Networking and Communications Systems (ANCS), pp. 183–
192, Princeton, NJ, October 2005.
[EC05] Ozgun Erdogan and Pei Cao. “Hash-AV: fast virus signature scanning by cache-resident filters.” In Proc. Globecom, pp. 1767–1772, St. Louis, MO, November 2005.
[Fro06] Jeffery E.F. Froedl. Mastering Regular Expressions. O’Reilly, third edition, 2006.
[FV01] Mike Fisk and George Varghese. “Fast content-based packet handling for intrusion detection.” Technical Report CS2001-0670, UCSD, 2001.
[Gal79] Zvi Galil. “On improving the worst case running time of the Boyer-Moore string matching algorithm.” Communications of the ACM, 22(9):505–508, 1979.
[GGM04] Todd J. Green, Ashish Gupta, Gerome Miklau, Makoto Onizuku, and Dan Suciu. “Processing XML streams with deterministic automata and stream indexes.” ACM Trans. Database Systems, 29(4):752–788, December 2004.
[GM01] Pankaj Gupta and Nick McKeown. “Algorithms for packet classifica-tion.” IEEE Network, 15(2):529–551, March/April 2001.
[GTL02] Eric J. Glover, Kostas Tsioutsiouliklis, Steve Lawrence, David M. Pen-nock, and Gary W. Flake. “Using Web Structure for classifying and describing Web pages.” In Proc. of World Wide Web (WWW), pp.
562 – 569, Honolulu, HI, May 2002.
[HCC06] Mohamed Hamammi, Youssef Chahir, and Liming Chen. “Web-Guard: A Web filtering engine combining textual, structural and vi-sual content-based analysis.” IEEE Trans. Knowledge and Data En-gineering, 18(2):272–284, February 2006.
[HL03] Fu-Hsiang Huang and Ying-Dar Lin. “Evaluating the accuracy and ef-ficiency of a multi-language content filter.”. Master’s thesis, National Chiao Tung University, 2003.
[KA94] Jeffrey O. Kaphart and William C. Arnold. “Automatic extraction of computer virus signatures.” In Proc. of 4th Virus Bulletin of In-ternational Conference, pp. 178–184, Abingdon, England, September 1994.
[KST03] Jari Kytojoki, Leena Salmela, and Jorma Tarhio. “Tuning String Matching for Huge Pattern Sets.” In Proc. of Symposium on Com-binatorial Pattern Matching (CPM), pp. 211–224, Morelia, Mexico, 2003.
[LHC04] Rong-Tai Liu, Nen-Fu Huang, Chih-Hao Chen, and Chia-Nan Kao.
“A fast pattern-match engine for network processor-based network intrusion detection system.” In Proceedings of Information and Tech-nology: Coding and Computing (ITCC), pp. 97–101, Las Vegas, NV, April 2004.
[LHF02] Pui Y. Lee, Siu C. Hui, and Alvis Cheuk M. Fong. “Neural networks for Web content filtering.” IEEE Intelligent Systems, 17(5):48–57, September/October 2002.
[LJL06] Ying-Dar Lin, Chi-Wei Jan, Po-Ching Lin, and Yuan-Cheng Lai. “De-signinig an integrated architecture for network content security gate-ways.” IEEE Computer, 39(11):66–72, November 2006.
[LLL06] Po-Ching Lin, Zhi-Xiang Li, Ying-Dar Lin, Yuan-Cheng Lai, and Frank C. Lin. “Profiling and accelerating string matching algorithms in three network content security applications.” IEEE Commu. Sur-veys and Tutorials, 8(2), Second Quarter 2006.
[LLLar] Po-Ching Lin, Ying-Dar Lin, Yuan-Cheng Lai, and Tsern-Huei Lee.
“Using string matching for deep packet inspection.” IEEE Computer, to appear.
[LS75] Mike Lesk and Eric Schmidt. “Lex — A lexical analyzer generator.”
Technical Report Comp. Sci. Tech. Rep. No. 39, Bell Laboratories, 1975.
[LTH07] Ying-Dar Lin, Kuo-Kun Tseng, Chen-Chou Hung, and Yuan-Cheng Lai. “Scalable Automaton Matching for High-Speed Deep Content Inspection.” In 21th IEEE Advanced Information Networking and Ap-plications (AINA), Niagara Falls, Canada, May 2007.
[LTLar] Ying-Dar Lin, Kuo-Kun Tseng, Tseng-Huei Lee, Chen-Chou Hung, and Yuan-Cheng Lai. “A Platform-Based SoC Design and Implemen-tation of Scalable Automaton Matching for Deep Packet Inspection.”
Journal of Syst. Arch., to appear.
[Lun06] Jan van Lunteren. “High-Performance Pattern-Matching for Intrusion Detection.” In Proc. of the 25th IEEE Infocom Conference, Barcelona, Spain, April 2006.
[MDW04] Yevgeniy Miretskiy, Abhijith Das, Charles P. Wright, and Erez Zadok.
“Avfs: An On-Access Anti-Virus File System.” In USENIX Security Symposium, pp. 73–88, San Diego, CA, August 2004.
[Mit96] Tom Mitchell. Machine learning. McGraw Hill, 1996.
[MLL03] James Moscola, John W. Lockwood, Ronald Loui, and Michael Pa-chos. “Implementation of a content-scanning module for an Internet firewall.” In Proc. of IEEE Symposium on Field-Programmable Cus-tom Computing Machines (FCCM), pp. 31–38, Napa Valley, CA, April 2003.
[MM96] Robert Muth and Udi Manber. “Approximate Multiple Strings Search.” In Proc. of Symposium on Combinatorial Pattern Matching (CPM), pp. 75–86, Laguna Beach, CA, 1996.
[Nor04] Mark Norton. “Optimizing pattern matching for intrusion detection.”
Technical report, Sourcefire, Inc., 2004.
[NRa] Marc Norton and Dan Roelker. Multi-rule inspection engine.
http://www.snort.org/docs.
[NRb] Marc Norton and Dan Roelker. Snort 2.0 protocol flow analyzer.
http://www.snort.org/docs.
[NR00] Gonzalo Navarro and Mathieu Raffinot. “Fast and flexible string matching by combining bit-parallelism and suffix automata.” ACM Journal of Experimental Algorithms, 5(4):1–36, 2000.
[NR02] Gonzalo Navarro and Mathieu Raffinot. Flexible Pattern Matching in Strings. Cambridge University Press, 2002.
[NR04] Gonzalo Navarro and Mathieu Raffinot. “New techniques for regular expression searching.” Algorithmica, 4(2):89–116, November 2004.
[PAD06] Vern Paxson, Krste Asanovi´c, Sarang Dharmapurikar, John Lock-wood, Ruoming Pang, Robin Sommer, and Nicholas Weaver. “Re-thinking Hardware Support for Network Analysis and Intrusion Pre-vention.” In USENIX Workshop on Hot Topics in Security, pp. 63–68, Vancouver, Canada, August 2006.
[PP05] Giorgos Papadopoulos and Dionisios Pnevmatikatos. “Hashing + memory = low cost, exact pattern matching.” In Proc. of 15th In-ternational Conference on Field Programmable Logic and Applications (FPL), pp. 39–44, Tampere, Finland, August 2005.
[PS03] Fuchun Peng and Dale Schuurmans. “Combining naive Bayes and n-gram language models for text classification.” In Proc. of 25th Euro-pean Conference on Information Retrieval Research (ECIR), pp. 105–
119, Pisa, Italy, April 2003.
[RFB97] M. V. Ramakrishna, E. Fu, and E. Bahcekapili. “Efficient hardware hashing functions for high performance computers.” 46(12):1378–
1381, December 1997.
[Seb02] Fabrizio Sebastiani. “Machine learning in automated text categoriza-tion.” ACM Computing Survey, 34(1):1–47, March 2002.
[SIH04] Yutaka Sugawara, Mary Inaba, and Kei Hiraki. “Over 10 Gbps string matching mechanism for multi-stream packet scanning systems.” In Proc. of 14th International Conference on Field Programmable Logic and Applications (FPL), pp. 484–493, Antwerp, Belgium, September 2004.
[SL05] Haoyu Song and John W. Lockwood. “Multi-pattern signature match-ing for hardware network intrusion detection systems.” In Proc. of the 48th IEEE Globecom Conference, St. Louis, MO, November 2005.
[SP03] Robin Sommer and Vern Paxson. “Enhancing byte-level network intrusion detection signatures with context.” In Proc. ACM Com-puter and Communications Security (CCS), Washington D.C., Octo-ber 2003.
[SP04] Ioannis Sourdis and Dionisios Pnevmatikatos. “Pre-decoded CAMs for efficient and high-speed NIDS pattern matching.” In Proc. 12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM), pp. 258–267, Napa Valley, CA, April 2004.
[SPW05] Ioannis Sourdis, Dionisios Pnevmatikatos, Stephan Wong, and Stama-tis Vassiliadis. “A reconfigurable perfect-hashing scheme for packet inspection.” In Proc. of 15th International Conference on Field Pro-grammable Logic and Applications (FPL), pp. 644–647, Tampere, Fin-land, August 2005.
[Tar06] Tarari Inc. Tarari RegEx5 product brief, 2006.
[TLL05] Kuo-Kun Tseng, Ying-Dar Lin, Tseng-Huei Lee, and Yuan-Cheng Lai.
“A Parallel Automaton String Matching with Pre-Hashing and Root-Indexing Techniques for Content Filtering Coprocessor.” In 16th IEEE International Conference on Application-Specific Systems, Architec-tures, and Processors (ASAP), Samos, Greece, 2005.
[TLLar] Kuo-Kun Tseng, Yuan-Cheng Lai, Tsern-Huei Lee, and Ying-Dar Lin.
“A Fast Scalable Automaton Matching Accelerator for Embedded Content Processors.” ACM Trans. Embedded Comput. Syst., to ap-pear.
[TS06] Lin Tan and Timothy Sherwood. “Architectures for bit-split string scanning in intrusion detection.” IEEE Micro, 26(1):110–117, January 2006.
[TSC04] Nathan Tuck, Timothy Sherwood, Brad Calder, and George Varghese.
“Deterministic memory-efficient string matching algorithms for intru-sion detection.” In Proc. of the 23th IEEE Infocom Conference, pp.
333–340, HongKong, China, March 2004.
[Van79] C. J Van Rijsbergen. Information Retrieval. Dept. of Computer Sci-ence, University of Glasgow, second edition, 1979.
[Wan] James Z. Wang. WIPE: Wavelet image pornography elimination.
http://wang.ist.psu.edu/docs/projects/wipe.html.
[Wan01] James Z. Wang. Integrated region-based image retrieval. Kluwer Aca-demic Publishers, Dordrecht, Holland, 2001.
[WM94] Sun Wu and Udi Manber. “A fast algorithm for multi-pattern search-ing.” Technical Report TR94-17, Dept. Comput. Sci., Univ. Arizona, 1994.
[WM95] Wm A. Wolf and Sally McKee. “Hitting the memory wall: impli-cations of the obvious.” Computer Architecture News, 23(1):20–24, March 1995.
[Xil04] Xilinx Inc. Two flows for partial reconfiguration: module based and difference based, September 2004.
[Xil05] Xilinx Inc. Virtex-II Pro and Virtex-II Pro X platform FPGAs: com-plete data sheet, October 2005.
[YCD06] Fang Yu, Zhifeng Chen, Yanlei Diao, T. V. Lakshman, and Randy H.
Katz. “Fast and memory-efficient regular expression matching for deep packet inspection.” In Proc. of ACM/IEEE symposium on Architec-ture for networking and communications systems (ANCS), pp. 93–102, San Jose, CA, December 2006.
[YL99] Yiming Yang and Xin Liu. “A re-examination of text categorization methods.” In Proc. of 22nd ACM International Conference on Re-search and Development in Information Retrieval, pp. 42–49, Berkeley, CA, August 1999.