5.4 Experiments
5.4.2 Experimental Results and Discussion
From the experiment, totally 300 sample Web pages in four typically banned categories, Pornography, Game, Online-Shopping and Finance, are randomly col-lected from the YAHOO directory service, and another 300 pages are from other categories to serve as the allowed content. The early decision algorithm searches the Web content with a multiple-string matching algorithm for the keywords ex-tracted in the training stage. A sub-linear time algorithm (e.g., the Wu-Manber algorithm, which can skip characters in the text by nearly the length of the short-est keywords [WM94]) hardly helps the performance here because short keywords
Earlybypass ← F alse;
Earlyblock ← F alse;
n ← 0;
while not end of content do
Skip stop words and the HTML tags.
Read the next keyword;
n ← percentage of the content that has been scanned; {scanning at least min scan% of content}
if n > min scan then
m ← the accumulated score;
for each banned category cj do
P ECj ← P (En,m|cj) of current scanning position;
P ECj0 ← P (En,m|c0j) of current scanning position;
P CEj ← (P ECjP (cj))/(P ECjP (cj) + P ECj0P (c0j));
end for
if (∀cj, P CEj ≤ Tbypass) then Earlybypass ← T rue;
Exit;
end if
if (∃cj, P CEj ≥ Tblock) then Earlyblock ← T rue;
Exit;
end if end if end while
Figure 5.2: The pseudo-code of the early decision algorithm.
Table 5.1: Comparison of classification accuracy in four banned categories.
Category Original Bayesian classifier Early decision
Pr Re F1 Pr Re F1
Porn 1.00 .993 .996 .977 .918 .947
Game 1.00 .971 .985 .958 .819 .883
Shopping 1.00 .975 .987 .866 .750 .804 Finance .896 1.00 .945 .964 .900 .931
are not uncommon in natural languages. The filtering routine is implemented on Lex [LS75], which uses the linear-time Aho-Corasick algorithm [AC75], and thus its performance is independent of the keyword lengths.
The accuracy of the original Bayesian classifier, which scans the entire con-tent, is compared with that of the early decision algorithm for the four banned categories in Table 5.1, in which Pr denotes the precision, Re denotes the recall, and F1 denotes the F1 measure, which is the harmonic average of Pr and Re.
Among the categories in comparison, only the shopping category presents notice-able accuracy degradation, while the others remain fairly good accuracy. After a careful examination, we observed that the Web pages in the shopping category have many common words that also appear in allowed categories. Therefore, the score accumulation from keywords is slow. Lacking representative keywords reduces the accuracy if the scanned part is not long enough. We consider the categorization should be more specific in this case so that precise keywords can be extracted.
Table 5.2 presents the average filtering accuracy of the content in the four banned categories (summarized from Table 5.1) and the allowed categories. The accuracy of both types of content with the early decision algorithm is close to that when the entire content is scanned. The speed-up is obvious because the early decision algorithm scans only 17.22% of content in the banned categories
Table 5.2: Average accuracy and scan rate in the early decision algorithm.
Banned Allowed ASR ASR
Pr Re F1 Pr Re F1 (Banned) (Allowed)
.941 .847 .892 .947 .920 .934 17.22% 26.51%
Table 5.3: Accuracy in the setting of no false positives in allowed content.
Category Original Bayesian classifier Early decision
Pr Re F1 Pr Re F1
Porn .977 .918 .947 1.00 .773 .871
Game .958 .819 .883 1.00 .623 .767
Shopping .866 .750 .804 1.00 .55 .709 Finance .964 .900 .931 1.00 .730 .843
and 26.51% in the allowed categories on average. A large portion of the Web content is bypassed, and the classification time is significantly shortened.
False positives of allowed content may be considered unacceptable in a practi-cal environment, and a high threshold Tblock is set. Lifting the threshold Tblock to 1.0 can effectively avoid false positives in the allowed categories, as shown from the high precision in Table 5.3. Note that lifting Tblock also leads to more false negatives in the banned categories because some banned content is unable to reach such a high threshold. Therefore, deciding a proper threshold is a tradeoff in practice.
Both the execution time and throughput of the early decision algorithm are compared with those of the original Bayesian classifier to manifest the improve-ment. Both classifiers are implemented on a PC with Intel Pentium III 700 MHz and 64MB of RAM. Table 5.4 presents the comparison results of filtering both the banned and allowed content. The results show a significant improvement in throughput, about five times higher than that of the original Bayesian classifier for banned content and nearly four times higher for allowed content.
Table 5.4: Comparison of the throughput of the early decision algorithm and the original Bayesian classifier.
Algorithm Execution time (ms) Throughput (Mb/s) Original Bayesian
classifier
1333.77 41.05
Early decision for banned content
241.89 226.36
Early decision for allowed content
239.90 156.68
Many commercial products and open source packages in our investigation, such as DansGuardian dansguardian.org, can block a page as soon as the score accumulation achieves the given threshold configured arbitrarily by the user. In contrast, the early decision algorithm compares the threshold with the probability estimation of the classification, rather than the score itself. This approach has two advantages over that in DansGuardian. First, the two parameters, Tbypass and Tblock, have stronger association with the accuracy than the threshold on the score in DansGuardian. Therefore, it is easier to customize the thresholds in the early decision algorithm to achieve the desired accuracy. In comparison, deciding a proper threshold in DansGuardian to get the desired accuracy will take more efforts in trial and error, since the threshold provides few clues to the accuracy.
Second, the early decision algorithm accelerates not only filtering blocked Web pages, but also filtering allowed pages. The advantage is particularly significant when the Web accesses are mostly allowed content.
The early decision algorithm is also implemented on the content analysis of DansGuardian by modifying its filtering code. In our testing samples, the throughput is about three times higher than that in the original version of
Dans-Guardian. The increasing primarily comes from the better criterion in the con-tent filtering and the acceleration of filtering the allowed concon-tent. The principle of early decision can also be implemented into the content filtering process in other Web filtering products.