Experimental Results and Discussion - 加速深層封包檢查的字串比對演算法

5.4 Experiments

5.4.2 Experimental Results and Discussion

From the experiment, totally 300 sample Web pages in four typically banned categories, Pornography, Game, Online-Shopping and Finance, are randomly col-lected from the YAHOO directory service, and another 300 pages are from other categories to serve as the allowed content. The early decision algorithm searches the Web content with a multiple-string matching algorithm for the keywords ex-tracted in the training stage. A sub-linear time algorithm (e.g., the Wu-Manber algorithm, which can skip characters in the text by nearly the length of the short-est keywords [WM94]) hardly helps the performance here because short keywords

Earlybypass ← F alse;

Earlyblock ← F alse;

n ← 0;

while not end of content do

Skip stop words and the HTML tags.

Read the next keyword;

n ← percentage of the content that has been scanned; {scanning at least min scan% of content}

if n > min scan then

m ← the accumulated score;

for each banned category c_j do

P EC_j ← P (E_n,m|c_j) of current scanning position;

P EC_j⁰ ← P (E_n,m|c⁰_j) of current scanning position;

P CE_j ← (P EC_jP (c_j))/(P EC_jP (c_j) + P EC_j⁰P (c⁰_j));

end for

if (∀c_j, P CE_j ≤ T_bypass) then Earlybypass ← T rue;

Exit;

end if

if (∃c_j, P CE_j ≥ T_block) then Earlyblock ← T rue;

Exit;

end if end if end while

Figure 5.2: The pseudo-code of the early decision algorithm.

Table 5.1: Comparison of classification accuracy in four banned categories.

Category Original Bayesian classifier Early decision

Pr Re F1 Pr Re F1

Porn 1.00 .993 .996 .977 .918 .947

Game 1.00 .971 .985 .958 .819 .883

Shopping 1.00 .975 .987 .866 .750 .804 Finance .896 1.00 .945 .964 .900 .931

are not uncommon in natural languages. The filtering routine is implemented on Lex [LS75], which uses the linear-time Aho-Corasick algorithm [AC75], and thus its performance is independent of the keyword lengths.

The accuracy of the original Bayesian classifier, which scans the entire con-tent, is compared with that of the early decision algorithm for the four banned categories in Table 5.1, in which Pr denotes the precision, Re denotes the recall, and F1 denotes the F1 measure, which is the harmonic average of Pr and Re.

Among the categories in comparison, only the shopping category presents notice-able accuracy degradation, while the others remain fairly good accuracy. After a careful examination, we observed that the Web pages in the shopping category have many common words that also appear in allowed categories. Therefore, the score accumulation from keywords is slow. Lacking representative keywords reduces the accuracy if the scanned part is not long enough. We consider the categorization should be more specific in this case so that precise keywords can be extracted.

Table 5.2 presents the average filtering accuracy of the content in the four banned categories (summarized from Table 5.1) and the allowed categories. The accuracy of both types of content with the early decision algorithm is close to that when the entire content is scanned. The speed-up is obvious because the early decision algorithm scans only 17.22% of content in the banned categories

Table 5.2: Average accuracy and scan rate in the early decision algorithm.

Banned Allowed ASR ASR

Pr Re F1 Pr Re F1 (Banned) (Allowed)

.941 .847 .892 .947 .920 .934 17.22% 26.51%

Table 5.3: Accuracy in the setting of no false positives in allowed content.

Category Original Bayesian classifier Early decision

Pr Re F1 Pr Re F1

Porn .977 .918 .947 1.00 .773 .871

Game .958 .819 .883 1.00 .623 .767

Shopping .866 .750 .804 1.00 .55 .709 Finance .964 .900 .931 1.00 .730 .843

and 26.51% in the allowed categories on average. A large portion of the Web content is bypassed, and the classification time is significantly shortened.

False positives of allowed content may be considered unacceptable in a practi-cal environment, and a high threshold Tblock is set. Lifting the threshold Tblock to 1.0 can effectively avoid false positives in the allowed categories, as shown from the high precision in Table 5.3. Note that lifting T_block also leads to more false negatives in the banned categories because some banned content is unable to reach such a high threshold. Therefore, deciding a proper threshold is a tradeoff in practice.

Both the execution time and throughput of the early decision algorithm are compared with those of the original Bayesian classifier to manifest the improve-ment. Both classifiers are implemented on a PC with Intel Pentium III 700 MHz and 64MB of RAM. Table 5.4 presents the comparison results of filtering both the banned and allowed content. The results show a significant improvement in throughput, about five times higher than that of the original Bayesian classifier for banned content and nearly four times higher for allowed content.

Table 5.4: Comparison of the throughput of the early decision algorithm and the original Bayesian classifier.

Algorithm Execution time (ms) Throughput (Mb/s) Original Bayesian

classifier

1333.77 41.05

Early decision for banned content

241.89 226.36

Early decision for allowed content

239.90 156.68

Many commercial products and open source packages in our investigation, such as DansGuardian dansguardian.org, can block a page as soon as the score accumulation achieves the given threshold configured arbitrarily by the user. In contrast, the early decision algorithm compares the threshold with the probability estimation of the classification, rather than the score itself. This approach has two advantages over that in DansGuardian. First, the two parameters, T_bypass and T_block, have stronger association with the accuracy than the threshold on the score in DansGuardian. Therefore, it is easier to customize the thresholds in the early decision algorithm to achieve the desired accuracy. In comparison, deciding a proper threshold in DansGuardian to get the desired accuracy will take more efforts in trial and error, since the threshold provides few clues to the accuracy.

Second, the early decision algorithm accelerates not only filtering blocked Web pages, but also filtering allowed pages. The advantage is particularly significant when the Web accesses are mostly allowed content.

The early decision algorithm is also implemented on the content analysis of DansGuardian by modifying its filtering code. In our testing samples, the throughput is about three times higher than that in the original version of

Dans-Guardian. The increasing primarily comes from the better criterion in the con-tent filtering and the acceleration of filtering the allowed concon-tent. The principle of early decision can also be implemented into the content filtering process in other Web filtering products.

在文檔中加速深層封包檢查的字串比對演算法 (頁 123-128)