Learning Model Comparison - RELATED WORK - 應用支持向量機偵測惡意網頁

RELATED WORK

3.3 Learning Model Comparison

Machine learning algorithms can be roughly divided into two major categories; unsupervised and supervised learning. Unsupervised learning is distinguished from supervised learning in that the learner is given only unlabeled samples. It is closely related to the problem of density

estimation in statistics but intuitively not suitable for the classification uses. Thus, I would like to focus only on the supervised learning algorithms in my study.

For malicious web detection, some machine learning algorithms are recently employed very often. In 2009, Likarish, Jung, and Jo [25] employed Naïve Bayes, Alternating Decision Tree (ADTree), Support Vector Machine (SVM) and the RIPPER rule learner to detect the obfuscated malicious JavsScript in web pages. All of these classifiers are available as part of the Java-based open source machine learning toolkit Weka [26].

Classifier Precision Recall F2 NPP Naïve Bayes 0.808 0.659 0.685 0.996

ADTree 0.891 0.732 0.757 0.997

SVM 0.920 0.742 0.764 0.997

RIPPER 0.882 0.787 0.806 0.997

Table 3.2: Learning algorithm comparison made by Likarish et al.

In order to compare the eﬀectiveness among those learning algorithms, they extracted the same features from training samples for each learning model, trained the data, and then classified the testing samples. The experimental results are listed in table 3.2, in which the fields are defined as below:

• Precision:

The ratio of (malicious scripts labeled correctly)/(all scripts that are labeled as malicious)

• Recall:

The ratio of (malicious scripts labeled correctly)/(all malicious scripts)

• F2-score:

The F2-score combines precision and recall, valuing recall twice as much as precision.

• Negative predictive power (NPP):

The ratio of (benign scripts labeled correctly)/(all benign scripts)

Likewise, Hou et al. [24] did a similar comparison in their research in 2010. The learning algorithms they used were Naïve Bayes, Decision Tree, Support Vector Machine, and Boosted Decision Tree. The results are shown in the table 3.3, where the desired false positive rates

were set below 1% and below 10% for the left FP and TP and the right FP and TP columns respectively.

Algorithm FP(%) TP(%) FP(%) TP(%)

Naïve Bayes 1.87 68.75 9.6 84.60

Decision Tree 0.73 73.29 6.5 90.90

SVM 0.73 73.30 9.9 86.36

Boosted Decision Tree 0.21 85.20 7.7 92.60

Table 3.3: Learning algorithm comparison made by Hou et al.

In their comparisons, the RIPPER ruler learner and the Boosted Decision Tree are both vari-ations of Decision Tree. Generally speaking, Decision Tree algorithms have the following disadvantages [26]:

• The tree structure can be unstable and small variations in the data can cause very diﬀerent tree structures.

• Some Decision Tree models generated may be very large and complex.

• Decision Tree models are not very good at estimation tasks.

• Decision Tree models are computationally expensive to train.

However, the focus of my study is not about comparing these machine learning algorithms. I would adopt Support Vector Machine, which was used in both of the above researches and then I could set my desired accuracy based on their results.

The machine learning approaches mentioned thus far are of the batch learning algorithms, which requires all the training data to be prepared and then processes all the data at a time. Once a new training sample needs to be merged into the learning model, re-running the entire training process is necessary.

The other type opposite to batch learning is online learning, or as known as incremental learn-ing, which is able to incrementally train new samples based on a trained model. It is generally believed that the benefit of eﬃcient computation in online learning comes at the expense of accuracy.

Ma et al. [27] leveraged online learning in their research in 2009, to identify suspicious URLs.

They used the same feature extraction method as what they proposed in another article [21] in

the same year, but utilized the Confidence-Weighted (CW) algorithm instead of SVM. Figure 3.1 compares the cumulative error rates for CW and for SVM under diﬀerent training datasets.

Figure 3.1: Cumulative error rates in CW and SVM

In the figure, we can clearly see the CW curve maintained the best cumulative error rate and the consequence seems to break the belief that batch learning should be more accurate. However, after carefully inspecting the results, we can see that the discrepancy came from the training datasets; CW took all the data for training, but SVM family only trained a portion of the data.

Admittedly, online learning algorithms can provide an eﬃcient way to continuously process a huge amount of training data. However, they usually require more memory space to maintain extra information for incremental training. If a learning model can be resistant to the eﬀect of rapid change for a period of time and re-training the learning model doesn’t need to be done very frequently, the value of online learning will be negligible.

Chapter 4 METHODOLOGY

The general principle for all types of supervised machine learning is to go through three nec-essary stages for both training and classifying processes; those are data collection, feature ex-traction and machine learning computation. In my study, I would certainly follow the principle to conduct my own experiments. Figure 4.1 shows the high-level flow diagram for the training and classifying processes in my experiment, where the solid arrows represent the training flow and the dashed arrows denote the classifying flow. The details of each stage are described in the following sections.

Figure 4.1: High-level flow diagram for training and classifying stages

在文檔中應用支持向量機偵測惡意網頁 (頁 31-36)