Further Analysis and Discussion

Cost Factor (J)

5.3 Further Analysis and Discussion

Now I have had an SVM learning model to classify malicious web pages from benign ones.

However, I still wonder if my experiments and the corresponding results were qualified, includ-ing not only the classification accuracy but also other experiment-related data.

Let’s recall the related work chapter, the researches of Ma et al [21]. and Hou et al. [24] shared the same goal in common with my study, and we all used Support Vector Machine as well.

Therefore, I particularly compare my experiments with theirs, as listed in Table 5.5.

Research Target Content Data Source Sample Amount Accuracy Ma et al. (2009) URL PhishTank and Spamscatter Malicious: 20,500

Benign: 15,000

TP: 92.4%

FP: 0.1%

Hou et al. (2010) DHTML StopBadWare Malicious: 176

Benign: 965

TP: 86.36%

FP: 9.9%

My study URL+ Page Content URL: security vendor Page: crawl from the Internet

Malicious: 1,198 Benign: 40,308

TP: 89.98%

FP: 8.27%

Table 5.5: Comparison between prior researches and my study

According to the comparison, I could highlight some discrepancies between my experiments and theirs:

• Ma et al. (2009)

Their classification accuracy might be excellent, but on the contrary, the results could be overfitting, which means the model might perform extremely well in their dataset but yield very poor accuracy in the real world. The data source provides the clues; they collected malicious samples from PhishTank and Spamscatter, where most of the URLs are about phishing. As we know, phishing URLs tend to be similar to the URLs they want to imitate, so that the phishing URLs can particularly reveal certain information for their machine learning process. In other words, their learning model might only work for phishing detection. Furthermore, from attacker’s point of view, altering URLs is much easier than modifying the page contents. As a result, it is believed that malicious URLs change more frequently than malicious page contents and the URL-only model could expire very soon. Later on, I would conduct another experiment to demonstrate this reality.

• Hou et al. (2010)

Though my classification accuracy looks better than their SVM results, their Boosted Decision Tree still outperformed my SVM model. However, this comparison could be unfair; how come only 176 malicious and 965 benign pages can be representative of all the malicious and benign pages in the world? Their sample dataset was obviously too small and there certainly was a lack of generalization in their experiments.

Therefore, my experiments should be more objective than others. However, about the frequent change of malicious URLs mentioned above, I would still like to see if my learning model could be resistant to this eﬀect. So, I designed another experiment where the parameter values were

chosen according to the previous experimental results (P= 10, 000, C = 0.001, J = 60). Then, Week 0 dataset was taken for training and Week 1 to 3 datasets were used as testing data. In this practice, the learning model generated from Week 0 could be used to detect malicious pages in Week 1 to 3 datasets. In other words, I could simulate the process of classifying the future data.

Accuracy Week 0

(cross-validation) Week 1 Week 2 Week 3

TP Rate (%) 89.98 88.33 87.76 88.06

FP Rate (%) 8.27 7.91 8.50 8.28

Table 5.6: Results of classifying the future data

Table 5.6 shows the results of the simulation, and the detection accuracy in Week 1 to 3 was roughly the same as that from the cross-validation in Week 0. This consequence proved that my learning model is able to catch up the frequent change of malicious pages.

Moreover, in order to reconstruct the experiment of Ma et al., I picked 6,421 URL-based features from the IG Top 10,000 feature set to conduct another experiment. The C value automatically specified by SVM^light was 0.022, and I additionally chose 0.0022 for an extra test. Besides, because the ratio of the numbers of malicious and benign samples was identical to previous, I directly adopted the same cost factor value, J= 60. Table 5.7 lists the results.

Penalty Factor (C) Accuracy Week 0

(cross-validation) Week 1 Week 2 Week 3

0.0022

TP Rate (%) 85.83 88.33 90.00 83.33

FP Rate (%) 17.49 21.01 25.33 26.89

0.022

TP Rate (%) 75.83 73.33 70.00 69.17

FP Rate (%) 7.67 8.56 8.66 9.65

Table 5.7: Detection accuracy of URL-only learning model

Although I couldn’t obtain as high accuracy as the experimental results of Ma et al., I could reproduce the trend of change and observe the accuracy decline as time went by. This means that the URL-only learning model could be out of date soon because malicious web URLs change even more frequently than page contents. When manually inspecting the malicious samples correctly detected by this URL-only model, I could see many of the pages were phishing, which also proved my guess.

Chapter 6 CONCLUSION

The keys for machine learning applications to get succeed include data collection, feature ex-traction and learning model fine-tuning, and all of them are equally significant. In this chapter, I would like to briefly recap these stages and reemphasize the key points in my study. On the other hand, there’s no doubt that my solution is not perfect, but through this research, I have been inspired with more ideas of improving both web threat protection and machine learning.

I would also list some possible enhancements and hopefully they could be realized as a useful product in the future.

Recap

In the data collection stage, the primary diﬃculties were about crawling the malicious pages.

I had to play some tricks on the malicious web server so that I could successfully download the pages on the server. Besides, correctly labeling the samples should be the most important work in this stage. Therefore, I needed to manually inspect those malicious pages in case some benign pages were labeled as malicious and contaminated the sample dataset.

A good feature extraction strategy relies on the domain knowledge of the field one wants to apply machine learning to. In my study, the domain is about web threat protection. I referenced related work published in journals or conferences and also consulted domain experts working in the security vendor where the URL data were from, so that I decided to extract both URL and page content features from the sample dataset. In addition, Information Gain could filter the

most discriminative features and help not only reduce the complexity but enhance the accuracy.

While fine-tuning an SVM learning model using SVM^light, I could adjust the penalty factor (C) and the cost factor (J) to achieve the best detection accuracy. In my study, I finally concluded that when 10,000 top discriminative features were used, parameters of C = 0.001 and J = 60 could yield the highest TP rate of 89.98% with the acceptable FP rate of 8.27%. Furthermore, I conducted an experiment to simulate the conditions of the real world and proved that my learn-ing model is resistant to the eﬀect of frequent change of malicious web pages.

在文檔中應用支持向量機偵測惡意網頁 (頁 61-65)