Accelerating web content filtering by the early decision algorithm

全文

(1)IEICE TRANS. INF. & SYST., VOL.E91-D,. NO.2 FEBRUARY 2008 251. PAPER. Accelerating. Web. Content. Filtering. by the Early. Decision. Algorithm Po-Ching. LIN•õa),. Ming-Dao. LIU•õb),. Nonmembers, and. SUMMARY Real-time contentanalysis istypicallya bottleneck inWeb filtering.Toaccelerate thefiltering process,thisworkpresentsa simple,but effectiveearlydecisionalgorithm thatanalyzesonlypartof theWebcontent. Thisalgorithm canmakethefilteringdecision,eithertoblockor to passtheWebcontent,assoonas it is confident witha highprobability that thecontentreallybelongstoa bannedor anallowedcategory. Experiments showthealgorithm needsto examineonlyaroundone-fourth of the Web contentonaverage,whiletheaccuracyremainsfairlygood:89%forthe bannedcontentand93%fortheallowedcontent.Thisalgorithm cancomplementotherWebfilteringapproaches, suchasURLblocking, tofilterthe Webcontentwithhighaccuracyand efficiency. Textclassification algorithmsin otherapplications canalsofollowtheprincipleof earlydecision toaccelerate theirapplications. keywords:Webfiltering,textclassification, WorldWideWeb,earlydecision 1. Introduction A huge amount of Web content is widely accessible nowadays. As inappropriate content such as pornography proliferates with the growth of World Wide Web, access control of such content is demanded in some situations. For example, an employer does not want the employees to watch stock information during working hours, or parents do not want their children to browse pornographic content. Web filtering products that enforce access control are therefore getting popular on the market. They can be deployed either on a host computer (e.g., in a family), or on the gateway for central management in a company or an Internet service provider. Four major approaches are generally adopted in Web filtering nowadays: Platform for Internet Content Selection (PICS), URL-based, keyword-based and content analysis [1]. According to a recent review of up-to-date Internet filters, commercial products have widely adopted content analysis besides the URL-based approach [2]. Content analysis automatically classifies the Web content into a category. •õ. Manuscript. received. Manuscript. revised. The. puter •õ•õ. authors. Science, The. Chiao •õ•õ•õ. The. National. with. National. author. Tung. are. February August. is with. University, author. Taiwan. is. Lab,. Tung. Department. University,. Department. of. Com-. Taiwan.. of Computer. Science,. National. LIN•õ•õc), LAI•õ•õ•õd),. Member,. Nonmember. first, and then makes the filtering decision, either to block or to pass the content, according to the management policy. The analysis generally complements the URL-based approach to relieve the effort of frequently updating the URL list and to reduce the number of false negatives due to an outdated URL database. The efficiency of content analysis algorithms is essential due to their complexity. Slow analysis in Web filtering leads to long user response time and also degrades the throughput of Web filtering systems. We therefore focus on text classification, which remains an important and efficient approach to Web content analysis, despite the research on image content analysis for Web filtering [3], [4]. Moreover, image content analysis in Web filtering is mostly designed for pornography recognition, but not as effective for coltent in other banned categories. Numerous text classification algorithms with high accuracy have been proposed. They are designed primarily for off-line applications, such as Web categorization for catalogs hosted by Internet portals. The research on these algorithms mostly emphasizes on classification accuracy, but their efficiency on execution is rarely addressed. However, their efficiency should deserve attention for on-line applications such as Web filtering so that text classification will not slow down these applications significantly. This work presents a simple, but effective early decision algorithm to accelerate Web filtering. The algorithm is based on the observation that it is possible to make the filtering decision before scanning the entire content, as soon as the content can be confirmed with a high probability that it really belongs to a certain category. The fast decision is particularly important, since most Web content is normally allowed and should pass the filter as soon as possible. The rest of this paper is organized as follows. Section 2 reviews related work in Web filtering. The early decision algorithm is presented in Sect. 3. Section 4 presents the accuracy and efficiency of this algorithm from experimental results, and discusses the issues of deployment in a practical environment. Section 5 finally concludes this work.. Taiwan. with. Department. University. a) E-mail:. pclin@cis.nctu.edu.tw mdliu@cis.nctu.edu.tw. c) E-mail:. ydlin@cis.nctu.edu.tw. DOI:. 2006.. 2007.. High-speed. Chiao. b) E-mail:. d) E-mail:. 16, 25,. Ying-Dar. Yuan-Cheng. of. of. Science. Information and. Management,. Technology,. 2.1 Approaches of Web Filtering A Web filtering system can either block HTTP requests according to their URLs, or block the Web content using sev-. laiyc@cs.ntust.edu.tw 10.1093/ietisy/e91-d.2.251. Copyright (c). 2. Related Work in Web Filtering. Taiwan.. 2008 The Institute. of Electronics,. Information. and Communication. Engineers.

(2) IEICE. TRANS.. INF. & SYST.,. VOL.E91-D,. NO.2. FEBRUARY. 2008. 252. eral approaches to be discussed later. The former approach maintains a large database of banned URLs. If the URL in a request is found in the database, the request is blocked. The database is frequently updated by the collaborative effort of human reviewers (may include the users). URL blocking is very efficient in processing, and the content on the banned sites will not occupy the bandwidth of the download link. However, since Web sites on the Internet change very often and new sites grow extremely fast, the database is unlikely to keep pace with the dynamic change of Web sites. Hence the system may fail to ban some sites that should be banned. Blocking the Web content can remedy the insufficiency of URL blocking. Several types of information can help to determine whether a Web page should be blocked. The Platform for Internet Content Selection (PICS) specification (http://www.w3.org/PICS/) allows the content publishers to rate and label Web content so that a Web filtering system can identify the category and judge the offensiveness of the content according to the PICS information. However, labeling the Web content is voluntary. Publishers of banned content may not want to label the content and let their sites be banned. Hence a system cannot rely solely on the PICS information to judge whether the content should be blocked or. not.. words. simple Web. content.. or. false. positives. using •esex•f about sex is often. desired. that. tell. on. fier. category. tures.. the. system. Web. categories. is. simple. pages. to judge. the. Web. Most. features. are. text-based. classification classify. is. more. content. so. but. The ing. the. efficient. label. est. training. but. it had. features. both. the. according. than. to. practice,. image than. the. because. analysis. The. Text. Classification. Text. classification. adaptive artificial. neurons. states. cause. the. tering.. text Yang. compared rithms,. et. such. achieve. precision. as. around. be. vector. the. ratio. of. by. the. ratio. of. the. rich [6],. of A. the. [7]. machine. node.. A. node. to. text. document.. can. our. in accuracy,. number. of of. decision are. where correct. positive of. correct. be. to estimate. We. work. leave. is based. the. on. research. improve the This. such accuracy. Web. content. direction. the. text. efficient. for. real-time. of. of. from. the. based. on. to carefully problem.. a. probabilistic in. a document. The. likely NB. an. is reached.. requires. words. in. attempted. probabili-. category. of. Sect.. because. 3.2. to exploit. hyperlinks. and. [9].. a test. takes. as a sequence. the. methods. of. struc-. require. semantical. beyond more. the. meta-information. These. to extract currently. parsing. viewing. details. [8],. is. the. node. estimated.. most. has as. the. because. traversing. uses. the. as. branches. NB. work. information,. attribute. an. over-fitting. are. be-. on. a leaf. of. category. in. examples,. successively. probabilities. its and. efficient. network.. but. the. change. are. outgoing. classification. each. to. neural. by. interpretable,. the. used. as. until. an. understandable. the. nodes. is. documents. separable. classified. widely. time. words,. informa-. the. scope. than. our. and. may. of. this. model. of. not. so. be. filtering.. k-nearest tree. shown. 3.. The. Early. the. This. key. is de-. Web. filtering. positive. pre-. examples, positive. by. Decision. Algorithm. to. recall. measured. precision,. number. and algo-. as-. interconnected. trained. a test. to avoid. of. easily of. attributes,. Bayesian. be. not. internal. the. attributes. can. fil-. is may. been. networks. results. is. is easily. the. ties. for. is. test. of. on. fea-. work. simple,. network. of the. neural. represents the. a set. results. model,. non-linearly. document. to. surveyed. (SVM),. (NNet),. number. features. and. decision. tion.. very. has. neural. be. and. nature. tree. node. can. association. decision. decision. test. be-. to. k near-. accuracy. a group. the. the •gblack-box•h. internal. of. system. linearly. belongs. filtering. The. classification. Some. classification. algorithms. and. the. text. These. higher. of recall. Web. network. (NB). or. in. provide. existing. divided is. part. Sebastian. neural. 80%. average to. and. support. Bayesian. harmonic. dictions. content. al.. (kNN),. nalve. fined. Web. comprehensively. neighbor and. in. important. the. is. (NNet) A. consists. Although. both. parsing an. space). document. a document. the. the. feature. among. that or. trainIn. similar. test. method. method. to reflect. categories.. handling. to. kNN. the. k most. the. one. intelligence. that. that. and. the. category,. network. system. internal. pornography.. in. assigns. possibility. one. artificial. which. Algorithms. is. neural in. tural 2.2. than. in. classi-. distance. and. The the. [5].. labels. selects. frequent. positive. categories.. the. most. allow. more. Nalve. and. the. to map. both. [6].. studied. from. banned. by. exam-. separable. method. in. SVM. function. linearly. method. and. space,. training. so that. to their. find. vectors.. of. a kernel. space. kNN. documents.. to. choose. classi-. is. better. degraded. the. feature the. as. number. (kNN). examples,. that. signed. represented. roughly. to. examples feature. space. are. (measured training. a process. positive. to find. neighbor. stage,. the. The. could. The. learning,. in. other. features. be difficult. examples. documents. root. fea-. a large. handle. the. predictions.. uses. a multi-dimensional. can. in. positive. separate. are. k-nearest. examples. the. can. multi-dimensional. classification. in. of (SVM). documents. it would to the. negative. that. machine-. [5].. the. from. content. in categories. on. learns. After. on. The. and first. able. blocking.. and. vectors and. that in. training. is efficient. number. machine. examples. which. ples,. the. surface. negative. cause. block Web concontent analy-. by vector. a decision. the. exam-. representative. content.. collected. off-line.. For. based for. key-. be carefully. keyword. looking of. offensive. happen.. possibly detailed. images. filtering of. to. generally. hyperlinks,. set. allowed. than is. for should. likely. It involves. the. a Web. a training. looking. keywords. are. analysis. keywords,. fier. The. rather. methods.. tures. is. as a keyword could education. Therefore,. Content learning. be. approach. in the. selected,. sis. divided Support. their. Another. ple, tent. dictions. before. idea. of. the. early. is that. scanning. the. and. confirmed. to really. pre-. probability.. The. decision. making. the. entire. content,. belongs fast. decision. algorithm. filtering as. to a certain is. to. decision soon. as. category. particularly. accelerate is possible. the. content with. important. is. a high for.

(3) LIN et al.: ACCELERATING. WEB CONTENT. FILTERING. BY THE EARLY DECISION. ALGORITHM 253. on-line. filtering. should. pass. because. the. Among herein. the. choose. of the be. ever,. we. only. algorithm. of the. Web. follow. the. early. Web. entire. just. one. may. its. topic,. has. fined. from. the. from. the. tal. others.. To justify. directory. and. centage.. the. Web. For. 100-word. Web is. positions. can. Fig. 1 The keyworddistributionin the Web contentof both the banned andthe allowedcategories.. section. According. to Fig.. how. in Web On. position. in-. the from. by. is. the. content.. The NB classification is divided into two stages: training and classification. In the training stage, the classifier learns the probabilistic parameters of the generative model from a set of training documents, D={d1,...,d|D|}. Each document consists of an ordered sequence of words from a vocabulary set V={W1,W2,...W|V|} and is associated with some category from a set of categories C={c1,c2,...,c|C|}. Two types of parameters are included in the model [10]: (1) P(wt|cj): the estimated probability that word Wt appears in the documents of category cj and (2) P(cj): the estimated probability of category cj in the training documents. The former parameter is derived by. content. word On. the. a. keywords. in. ver-. in. the. probability. the. is. banned. (1). ￨V￨+Σ￨V￨t=1￨D￨i=1N1(wt,di￨di∈cj),. where N1(wt,di) is the times word wt appears in document di, and. percentage. 1, the. 1+Σ￨D￨i=1N1(wt,di￨di∈cj)/. P(wt￨cj)=. in. keywords. The. Naive Bayesian Classification. in per-. 50-th. of. of. horizon-. the. is at 50%.. probability Web. content. the. is represented. its position. the in. de-. of content. collected. normalized. The. appearance. throughout. are high •gin-. investigated. categories.. if a keyword. represented. content. measures. we. con-. (http://www.yahoo.com).. position. then. the. category. content. 3.2. speech. Web. with. put,. is. comes. the. categories. distribution. allowed. page,. the. a speech. the. herein. feasibility,. average. page.. of. of. topic. algorithm. the. the scan. trick. whether. Web. of to. The. end. of. HTML tags, it will be ignored and unable to deceive the filter. Second, if the irrelevant content is in the Web text outside the tags, it will be displayed on the browser, and will also confuse the viewer who browses the content. This deception approach will lead to a great limitation on the layout design of the Web pages.. decision. keywords. simply. the. example,. tical. axis. part. have. part. part. distinguish. keyword. each. not. the. front. banned. service. banned is. the. from. part. faster.. until. The. in typical. the. of. does. front. which,. 1 presents. axis. length. also. [11],. is to help. the. front. to indicate. words. YAHOO. filtering. the. the. the. passed.. distribution. Figure both. that keywords. gain•h. the. Howmeans. this. the. information about the of the early decision. a word. keyword. in. only. is much. as. or. formation. easily. algorithms. make. to wait. long. blocked. to be. dicative. have. observation. be. be. decision. algorithm. filtering. as. adequate. should. no. introduced. can. the. not. sufficient feasibility. the. by. classification. scanning. the. like. tent. by. Because. know. from. can. classification.. algorithm. possible,. content,. provides The. can. probability. Distribution. content.. the. to. as. we. basis. document.. early. to be. content. decision. soon. text. principle. Web. the is. an. the. computation. category. along. make. Other. and. 2.2,. be. The. classification. can. content. similar. Keyword. The. scans. NB. that. to accelerate. 3.1. that. its. to each. classifier. believe. to. accumulation.. belongs. the. allowed. in Sect.. because. score. document as. is. classification. algorithm. into. content. algorithms. Bayesian. decision. a test. Web. as possible.. aforementioned. turned. estimated. most as soon. naive. early. easily. that. as. filter. P(cj)=. cate-. 1+Σ￨D￨i=1N2(di,cj). (2). /￨C￨+￨D￨, gories. start. lowed. categories. In. other. to. appear. words,. provide the tent belong decision. Although. the. the. filter. decisionn. content. analysis. ify. attributes. on. the the. the. browser.. front from. scanning. the. The. the. entire. front. If the. the. the. content,. irrelevant. in. Web. the. alwhere. content.. content. of. generally. Web. difficult. the are and content. true. page. tags. generally. to. is hidden. be. de-. c1). smoothing. zero.. In. the that. is computed. one rived. is the. that. 1 if di •¸ above to. avoid. classification a test The. document. cj,. or ,0 otherwise.. two. equations. stage,. document category. the. di belongs cj. di belongs. are. estimating. that to. posterior to each. maximizes most. filtered. probabilities. likely.. by to. be. probability category. cj. P(cj|di). is the. P(cj|di). is. de-. by. First,. during. used not. in stuff. to avoid.. HTML. will. Laplace. P(cj|di). deliberately. the. N2(di, Notably,. can. that the entire conto make the filtering. may. is not. tags. the partial. is. part. strips. those. content.. user. deception. of. of. front. observation. algorithm because. part the. a malicious in. than. the category it is feasible. above. content. early. the. keywords. conditions,. irrelevant ceive. since. frequently. clue to identify to. Therefore,. before. normal. more. to. the. = P(cj)II￨di￨ｋ=1P(wdi,k￨cj)/. displayed inside. P(cj￨di)=P(cj)P(di￨cj)/P(d i). spec-. the. P(di),. (3).

(4) IEICE TRANS. INF. & SYST., VOL.E91-D. , NO.2 FEBRUARY 2008. 254. wherewdi,kisthek-thword indocumentdi.The documentdi more clues for classification than t does. The score of isviewedasan orderedsequence<wdi,1,wdi,2,...,wdi,￨di￨>, concatenated string is defined to be the maximum score withtheassumption thattheprobability of a word occur- of each composite substring. For instance, S core(st)= renceisindependent of itsposition inthedocument,given max{S core(s), S core(t)}, where S core(s) denotes the score thecategory cj.Therefore, P(di￨cj)can be written as the of string s. productof P(Wdi,k￨cj), fork=1...￨di￨. Takingthelogarithmon bbbhsidesofEq.(3)simplifies thecomputation of 3.4 The Filtering Stage theposterior probability P(cj￨di) froma series ofmultications toa series ofadditions. The computation thenbecomes In the filtering stage, the incoming content of a Web page )+￨. logP(Cj￨di)=log. is. di. P(cj)/. ￨Σ. P(di. logP(wdi,k￨cj).. (4). k=1. scanned. event. n%. Since kept. the. only. the. cient. to derive. word. wt. function. the. ,k￨cj) during text of P(cj￨di).. that belongs. accumulated from. scanning The score. cj can. be. ity. that. is suffiof each. defined. in the. only. the. addition. by. training. operation. This is why we select algorithm herein.. NB. this. Keyword. Extraction. in. the. Training. ・En. ,m:. the. classifier. line. from. lowed. of. the. sample. categories.. library,. gain. They. as. the. words,. be. can. the. ・P(cj):. ern. from. the. the. words. in these. of. in. cured. in. typical. the. languages. no. space. Hence. that in. to. We. suggest. guages, tract. where. [14].. the. Simply. length. put,. of each. other. rule. t,. appearance leaving of. s can. be. false. for. extraction. algorithm N-gram). is. looks and. If. t, only. the. long. positives. eliminated. (2). (1). redundant. a. string. every t is. appearance. left. as. keyword due without. s is. many a. to. ex-. the. in. (i.e., to. short. harm. t) to. reduce. keywords. because. probability Web. can. that. content.. estimated. of. cj. ap-. be. estimated meaand. an-. belongs. P(En,m￨cj). category. is the. that. to. by. En,m. category. number. divided. cj. does. P(c'j)=1-P(ct).. probability. content. of. the. happens. cj.. Web. number. The. es-. pages. in. Web. pages. of. cj. similarly. as. P(En,m￨cj),. except. that. c'j.. from. the. ∈ C.. Note. that. The. exactly. match. in. the. best. in the. any. of. of. by of. be. if there. the. that should. the. possi-. in. it provides. no. exists. (5). The. cj the. nearly. classification. ,by. can in. the. classifier should. n,and. be in in. of. The. our. scan. m.. set. arbi-. estimate. list. of. to. category, the. belgng. contrary, such. to. computation less. of. banned. categories. occupying in. of. decision. are. On. is likely. filter.. derived. maximum. a banned. banned. the. are. the. the. not. There-. early. Tblock,. tricky. finite.. score the. and. be. for. bit may. the. PCEj all cj. for. stage,. because. P(En,m￨c'j). bypassed.. list. negligible, time. and. two-. built. filtering tables. of. to be. content the. the the. with. Let for. two. a little. accumulated. is unlikely. some. in. pseudo-code. herein.. content. is. and. the. ,m),. training. discrete. tables. than. 0.9. the. on. are. the. the. content. blocked is. m. m. thresholds,Tbypass and. remaining. total. of. P(En,m￨cj) up. in. look-up. of. If PCEj<Tbypass,. be. Eq.. table. P(cj￨En. P(En,m￨c'j)are. sets. tables. larger. PCEj>Tblock,. prefer. the. no. 0.1. the. the. the. the. looking m. P(cj￨En,nn).. an-. word. on. Two to. training. score. 2 presents. algorithm.. We. The. m. of. ,m￨cj)and. subscripts. probabilities. subscript. computation P(En. accumulated. practice. and. the. m. herein.. trarily. keywords. s implies. keyword.. the the. of. Figure. detailed. substring of. n% m.. content.. typical. tables. fore,the. lan-. keywords a. recording. accelerate. categories,. eliminate. category. by. To. cj. modified. for. that. environment. with. each. extract-. determines. scanned reached. dynamically. defined. delimit. tool. The. has. or. ,m￨c'j):. n and. Korean). so.. is derived. has. P(ct). is replaced. each. mention. The. to do. keyword. the. can. and. of only. bility. characters.. is modified. N-gram. (i.e.,. extraction: string. N. [13] and. filter score. content.. cj. dimensional. keyword.. A simple N-gram. the. [12],. Japanese. means. N-gram. lengths. method. (Chinese,. Rainbow,. for. various. N-gram. CJK. N-gram. keywords,. algorithm. of. the in. the. beforehand. ,m happens. subscripts. keywords. m. Cj.. in west-. how. not. En. ・P(En. keyword.. ing. of. the. complex. it is unclear. The. they. characters. a semantic •gword•h,. C. Web. estimated. ,m￨cj):. timate. words). Unlike. cj •¸. probability. Web. in. that. gain).. become. .. .. has reached The probabil-. traffic. actual. given. its. train. because. when. a running. appear. ・P(En. Common. information. keywords scanned. score content.. a category. accumulated. estimated. sample. al-. and. stop. that. the. from. information. (called. could. oriental. languages.. high. keywords. low. English),. compose. on. extraction. in some (e.g.,. characters. set. to. categories.. so. (with. keyword. content. training and. from. with. off-. and. [11]. University. keywords. in classification. languages. the. extract. trained. program. event. pears. as the. banned. to. extracted been. accumulated n% of the. belongs. the. ・P(c'j):. is. the. Rainbow Mellon. as •gthe•h, •gof•h. Automatic Web. the. both. the has. is. Stage. algorithm in. Carnegie. dropped. little. for. used. features. such. should help. from. decision content. We. Bow,. classifier.. early. Web. for. content. that the scanned. content,. not. The. page. content. alyzing.. 3.3. beginning. the. P(En ,m/cj)P(cj)/ P(Cj￨En,m)= P(En ,m￨cj)P(cj)+P(En,m￨c'j)P(c'j).(5). wdi ,k, while the content to the end. The computation. beginning. the. from. word. fast because. performed for each word. basis of the early decision. log P(cj)/P(di). accumulating. are pre-computed. for each. the. (4) is very. to category. scores. and. stage,. maximum. These. is scanned. is increasing. classification. log P(wdi the. ∈ V. and. in Eq.. in. term. log P(wt￨cj). stage. logarithm. constant. of. En ,m denotes the filter has. when. is. from. Suppose. a. cj,. and. overhead than. 0.1%. of. profiling.. a minimum. amount. of the.

(5) LIN et al.: ACCELERATING. WEB CONTENT. FILTERING. BY THE EARLY DECISION. ALGORITHM 255. Table 1. Comparison. of classification. accuracy. in four banned cate-. gories. Here Pr denotes the precision, Re denotes the recall, and F1 denotes the F1 measure, which is the harmonic average of Pr and Re.. Table 2 algorithm.. Average. accuracy. and average. scan rate in the early decision. other categories to serve as the allowed content. The early decision algorithm searches the Web content with a multiple-string matching algorithm for the keywords extracted in the training stage. A sub-linear time algorithm. Fig. 2. The pseudo-code. of the early. decision. algorithm.. content in a Web page to avoid deciding too early from only the very front part of the content. If the content in a banned category happens to not have keywords in this part, false negatives may occurs. The parameter min_scan in Fig. 2 denotes the minimum amount in percentage. We set min_scan=15 arbitrarily since it is sufficient to effectively avoid false negatives in our experiment. 4. 4.1. Experiments Performance Metrics. To measure the accuracy of the early decision algorithm, we use the F1 measure that combines the recall and the precision by taking the harmonic average of them with equal weight [15]. We also use two metrics: the average scan rate (ASR) and the throughput, defined by Total bytes that are scanned/ ASR=. ×100%. Total bytes in the content. (6). and. Total bits in the content/ Throughput=. Total execution time (sec),. (7). to measure the effectiveness of acceleration. The former reflects the percentage of the Web content that is scanned in the early decision algorithm, and the latter shows the actual throughput in Web content filtering. 4.2. Experimental. Results. and Discussion. From the experiment, totally 300 sample Web pages in four typically banned categories, Pornography, Game, OnlineShopping and Finance, are randomly collected from the YAHOO directory service, and another 300 pages are from. (e.g., the Wu-Manber algorithm, which can skip characters in the text by nearly the length of the shortest keywords [16]) hardly helps the performance here because short keywords are not uncommon in natural languages. The filtering routine is implemented on Lex [17], which uses the linear-time Aho-Corasick algorithm [18], and thus its performance is independent of the keyword lengths. The accuracy of the original Bayesian classifier, which scans the entire content, is compared with that of the early decision algorithm for the four banned categories in Table 1. Among the categories in comparison, only the shopping category presents noticeable accuracy degradation, while the others remain fairly good accuracy. After a careful examination, we observed that the Web pages in the shopping category have many common words that also appear in allowed categories. Therefore, the score accumulation from keywords is slow. Lacking representative keywords reduces the accuracy if the scanned part is not long enough. We consider the categorization should be more specific in this case so that precise keywords can be extracted. Table 2 presents the average filtering accuracy of the content in the four banned categories (summarized from Table 1) and the allowed categories. The accuracy of both types of content with the early decision algorithm is close to that when the entire content is scanned. The speed-up is obvious because the early decision algorithm scans only 17.22% of content in the banned categories and 26.51% in the allowed categories on average. A large portion of the Web content is bypassed in the Web filtering, and the classification time is significantly shortened. False positives of allowed content may be considered unacceptable in a practical environment, and a high threshold Tblockis set. Lifting the threshold Tblockto 1.0 can effectively avoid false positives in the allowed categories, as shown from the high precision in Table 3. Note that lifting Tblockalso leads to more false negatives in the banned categories because some banned content is unable to reach such a high threshold. Therefore, deciding a proper threshold is a.

(6) IEICE. 256. Table. 3. Accuracy. in the setting. of no false positives. in allowed. of the early. decision. INF. & SYST.,. VOL .E91-D,. NO.2. FEBRUARY. 2008. content .. 4.3. Table 4 Comparison of the throughput and the original Bayesian classifier.. TRANS.. algorithm. tradeoff in practice. Both the execution time and throughput of the early decision algorithm are compared with those of the original Bayesian classifier to manifest the improvement. Both classifiers are implemented on a PC with Intel Pentium III 700 MHz and 64 MB of RAM. Table 4 presents the comparison results of filtering both the banned and allowed content. The results show a significant improvement in throughput, about five times higher than that of the original Bayesian classifier for banned content and nearly four times higher for allowed content. Many commercial products and open source packages in our investigation, such as DansGuardian [19], can block a page as soon as the score accumulation achieves the given threshold configured arbitrarily by the user. In contrast, the early decision algorithm compares the threshold with the probability estimation of the classification, rather than the score itself. This approach has two advantages over that in DansGuardian. First, the two parameters, Tbypassand Tblock, have stronger association with the accuracy than the threshold on the score in DansGuardian. Therefore, it is easier to customize the thresholds in the early decision algorithm to achieve the desired accuracy. In comparison, deciding a proper threshold in DansGuardian to get the desired accuracy will take more efforts in trial and error, since the threshold provides few clues to the accuracy. Second, the early decision algorithm accelerates not only filtering blocked Web pages, but also filtering allowed pages. The advantage is particularly significant when the Web accesses are mostly allowed content. The early decision algorithm is also implemented on the content analysis of DansGuardian by modifying its filtering code. In our testing samples, the throughput is about three times higher than that in the original version of DansGuardian. The increasing primarily comes from the better criterion in the content filtering and the acceleration of filtering the allowed content. The principle of early decision can also be implemented into the content filtering process in other Web filtering products.. Practical Consideration in Deployment. With the increasing number of categories to be classified, ambiguity between these categories may increase . In our opinion, the proper place to perform Web content filtering is restricted to the edge devices for performance reason. Such edge devices usually require fewer banned categories , and thus the problem with increasing number of categories is not that serious. The two thresholds, Tbypassand Tblock , can be tuned according to the tradeoffs between accuracy and efficiency. The accuracy can be increased at the cost of less efficiency by decreasing Tbypassor increasing Tblock,and the efficiency can be increased at the cost of less accuracy by increasing Tbypass or decreasing Tblock.The tuning depends on which is more important for an organization: accuracy or efficiency. Even though the early decision algorithm significantly speed up the filtering decision, we believe that it should complement other Web filtering approaches, especially URL blocking, but not to replace them. First, URL blocking is faster than content analysis since a URL has much fewer characters to be processed than the Web content. Besides, if a banned URL is successfully blocked, no network bandwidth will be wasted to download the banned content. As discussed in Sect. 2.1, content analysis is still needed to successfully catch the banned content. The early decision can accelerate this part significantly. Second, Web content may contain images, video, Flash objects, Java applets and so on, which are non-trivial to analyze. Analyzing these objects is beyond the scope of this paper, but it is still helpful to increase the accuracy in filtering the Web content. In summary, a Web filtering system can support various approaches in practice, just like many commercial products and open source packages. The system first blocks URLs according to the database of banned URLs that is constantly maintained. To reduce false negatives due to the outdated database, content analysis can catch the banned content whose source is not in the URL database. The early decision algorithm can speed up content analysis to reduce the latency perceived by the user and to increase the system throughput. Although analyzing other types of objects in the Web content, such as images, could increase the accuracy, it is still a trade-off between performance and processing effort so far. It depends on the user to evaluate whether turning on such an analysis is worthwhile. 5. Conclusion This work addresses the problem with possibly long delay in text classification algorithms that perform run-time content analysis in Web filtering. We present an early decision algorithm to decide to either block or pass the content as early as possible. A significant performance improvement is observed. The throughput is increased by about five times higher for banned content and nearly four times higher for allowed content, while the accuracy remains fairly good..

(7) LIN et al.: ACCELERATING. WEB CONTENT. FILTERING. BY THE EARLY DECISION. ALGORITHM 257. In. the. F1. banned. measure,. the. content, The. The. and. early. same. accuracy. about. is. 93%. decision. about. for. algorithm. rationale. behind. is. this. 89%. allowed. for. filtering. content. simple. but. algorithm. Technical. .. [17]. effective. can. be. . [18]. applied. Report. other. content. filtering. applications. as. well,. such. as. - A lexical. analyzer. Rep.,. no.39.. Laboratories,. 1975.. A.V.. Aho. Bell and. The. other. algorithm. than. overall ing. can. keywords. from. accuracy can. be. method. of. further. with. sions. on. on. the. be. skipped. of. the. combined the. the. cached. URLs. of Web. if the. by. filtered. page. can. further. by. The. DansGuardian. Web. http://dansguardian.. database. is. also. filtering. Content. analysis. Po-Ching. The. degree. can. work Council's. part. by. was. grants. in. Program. of. from. Cisco. part. by. and. National. Taiwan. Computer University,. the. Taiwan. Excellence. in. Natioanl. Sci-. Research,. and. and Taiwan. in 1995,. the. Educa-. Normal. University,. and the M.S.. degree. of Computer. tional. University.. Chiao. Tung. include. content. and embedded. Science. Lee,. S.C.. Oct., [2]. Hui,. IEEE. and. A.C.. networking, hardware. software. Filter. Systems,. Review. co-. design.. networks vol.17,. for. no.5,. Web. content. pp. 48-57,. Sept.-. 2002.. Internet. in-. algorithm. Intel.. Fong,•gNeural. Intelligent. in Na-. His research. Liu. received. the bachelor's. gree and the M.S. degree in Computer from National Chiao Tung University, filtering,•h. in. Science from National Chiao Tung Hsinchu, Taiwan in 2001. He is. Ming-Dao. P.Y.. to. bachelor's. Information. a Ph. D. candidate. designing. in. References. [1]. aid 1975.. at. received. Computer. tion from. maintenance. facilitated.. supported. Available. Lin in. terests ence. tools,. An. pp. 333-340,. deci-. filtering. Acknowledgment. This. matching:. vol.18,. org. Taipei, URL. string. ACM,. Tech.. filter-. the. duplicate. matched.. Commun.. Sci.. URL-based. caching. pages,. Corasick, •gEfficient. search,•h. 1.994. Comp.. the. the. the. is,. avoided. is. features. increase. combining. Web. URL. more. Besides,. That. be. cached. to. filter.. results.. the. with. text. content. accelerated. the. same. be. M.J.. of Arizona, generator,•h. anti[19]. spam.. University. Lesk, •gLex. bibliographic to. TR-94-17,. M.E.. 2006,. available. at. de-. Science Hsinchu,. Taiwan. in 2001. and 2003.. His research. interests. include ing.. network. security. and content. network-. http://internet-filter-. review.toptenreviews.com/ [3]. J.Z.. Wang,. Kluwer [4]. [5]. Integrated. Academic. region-based. Publishers,. J.Z.. Wang,. able. at http://wang.ist.psu.edu/docs/projects/wipe.html. M.. WIPE:. image. Dordrecht,. Hamammi,. engine. Wavelet. Y. Chahir,. combining. sis,•h. IEEE. image. and. textual,. Trans.. pp. 107-122, 2001.. pornography. elimination,. L. Chen, •gWebGuard:. structural. Knowl.. retrieval, Holland,. Data. and. Eng.,. A Web. visual. vol.18,. avail-. no.2,. Ying-Dar Lin received the bachelor's degree in Computer Science and Information Engineering from National Taiwan University in 1988, and the M.S. and Ph.D. degrees in Computer Science from the University of California, Los Angeles in 1990 and 1993. He joined the faculty of the Department of Computer and Information Science since 1993. From 2005, he is the director of the graduate Institute of Network Engineering, and then the director of Computer and Network Center since 2006. He is also the. filtering. content-based. analy-. pp. 272-284,. Feb.. 2006. [6]. [7]. Y. Yang. Proc.. search. and. Liu, •gA. re-examination 22nd. Development. E.J.. Comput.. Web G.. learning. K.. pages,•h. A.. gorization ropean. World. Gulli,. by. link. Symp.. Wide. and and. 1999.. [10]. T. Mitchell,. Machine. [11]. The. Rainbow. Web. context. text. 2002.. D.M.. (WWW),. analysis,•h. Pennock, and. Proc.. McGraw. Available. Web. Hill,. 2002. page. THAI-99, Artificial. and. describing. pp. 562-569,. and. Re-. 1999.. categorization,•h. March. classifying. Hypermedia,. learning,. library,. automated. for. on. pp. 42-49,. F. Sebastiani, •gAutomatic. Telematics,. pp. 105-119,. Retrieval,. pp. 1-47,. structure. meth-. Conference. S. Lawrence,. Web. Proc.. Attardi,. in. no.1,. Tsioutsiouliklis,. categorization. International. Information. vol.34,. Flake, •gUsing. of text. ACM. in. Surv.,. Glover,. G.W.. [9]. X.. SIGIR'99,. F. Sebastiani, •gMachine ACM. [8]. and. ods,•h. cate-. First. Eu-. Intelligence,. founder and director of Network Benchmarking Lab since 2002. His research interests include design, analysis, implementation and benchmarking of network. protocols. and algorithms,. F. Peng. and. language on [13]. D.. models. Information. H.H.. Chen. nouns. in. and. F.H.. Huang. and. text. J.C.. classification,•h. and. 25th. (ECIR),. Lee, •gIdentification texts,•h. Y.D.. Proc.. Information. 16th. Evaluating filter,. Science,. European. Aug.. Conference Dec.. 2003.. of proper. Conference. on. and. efficiency. Department Chiao. Tung. of. of ComUniversity,. 2003. [15]. C.J.. [16]. S. Wu. Rijsbergen, and U.. Information Manber, •gA. retrieval, fast. proces-. Yuan-Cheng Lai received the bachelor's degree and the M.S. degree in Computer Science and Information Engineering from National Taiwan University in 1988 and 1990, and the Ph. D. degree from Computer and Information Science, National Chiao Tung University in 1997. He joined the faculty of National Cheng-Kung University, Tainan, Taiwan in 1998. He is an associate professor in Department of Information Management, National Taiwan University of Science and Technology, Taipei, Taiwan. His. n-gram. 1996.. accuracy. Thesis,. National. and. classification. International. the. MS.c.. Bayes. pp. 335-350, and. pp. 222-229,. Lin,. content. and routing,. 1996.. naive. Research. Linguistics,. a multi-language puter. for. Chinese. Computational [14]. Schuurmans, •gCombining. Retrieval. switching. at. http://www-2.cs.emu.edu/~mccallum/bow/rainbow/ [12]. wire-speed. quality of services, network security, content networking, network sors and SoCs, and embedded hardware software co-design.. algorithm. Butterworths, for. multi-pattern. London,. 1979.. searching,•h. research network. interests performance. include. high-speed. evaluation,. networking,. Internet. applications.. wireless. network. and.

(8)