Discriminative Feature Selection

RELATED WORK

3.2 Discriminative Feature Selection

Discriminative features are the line to separate malicious page from benign ones, and basically, the domain experts’ (human) eyes should be the most sensitive means to calibrate this line.

Except manually evaluating by the domain experts, the most eﬀective approach for malicious web page detection should be the sandbox, in which a full-functional web browser can provide an environment identical to the genuine one. However, this is undoubted a resource-consuming process and the performance is always the concern.

In the year of 2008, Seifert, Welch and Komisarczuk [20] adopted a machine learning approach, Decision Tree, as a pre-filter to enhance the overall throughput of their high-interaction client honeypot system, which can be regarded as a type of sandbox. Their Decision Tree used the features (named attributes in their article) extracted from the exploit, the exploit delivery

mech-anism, and the way of hiding them. As table 3.1 shows, we can clearly see that, the features they extracted from web pages were HTML elements and JavaScript functions and the prop-erties derived from them, which are widely used as part of the infection chain depicted in the previous chapter.

Category Attributes Description

Exploit

Plug-ins Count of the number of applet and object tags.

Script Tags Count of script tags.

XML Processing Instructions

Count of XML processing instructions. Includes special XML processing instructions, such as VML.

Exploit Delivery Mechanism

Frames Count of frames and iFrames including information about the source.

Redirects Indications of redirects. Includes response code, meta-refresh tags, and JavaScript code.

Script Tags Count of script tags including information about the source.

Hiding

Script Obfuscation Functions and elements that indicate script obfuscation, such as encoded string values, decoding functions, etc.

Frames Information about the visibility and size of iFrames.

Table 3.1: Attributes defined by Seifert et al.

In their research, 5,678 instances of malicious and 16,006 instances of benign web pages were input into the machine learning algorithm, and the generated classifier was used to classify a new sample. To determine the false positive and false negative rates, they inspected the sample by using the high-interaction client honeypot, which is believed zero error. They finally obtained a false positive rate of 5.88% and a false negative rate of 46.15% for the classification method.

As a matter of fact, the false negative rate is very high, because about half amount of the ma-licious web pages couldn’t be detected by their method. However, they didn’t rely on only the machine learning for malicious web page detection. Instead, the machine learning just played the role of a pre-filter to prioritize the input URLs for the high-interaction client honeypot. In this practice, they could maintain as high detection rate as the high-interaction client honey-pot could provide and meanwhile, improve the processing speed to about 13 times faster than previous.

From another point of view, before the malicious web page can be browsed, the entrance point must be its URL. In addition to the links on an HTML document, the malicious URLs can also be spread through various types of the Internet media, such as email, instant messengers.

In order for attackers to host their sites for either exploits or redirection mechanisms, they have to register domains. Compared with the profits gained from cybercrimes, the expenses of registering domains are extremely cheap. As a result, they can easily get a lot of domains to reduce the possibility of being detected by URL string matching. Furthermore, once they have a domain, they can also vary the path and query string parts to confuse the detectors. However, this tendency provides clues for detection.

In 2009, instead of web page contents, Ma et al. [21] focused on suspicious URLs and pro-posed an approach to detecting malicious web sites by investigating solely the URLs and the corresponding information. They categorized the features that they gathered for URLs as being either lexical or host-based:

• Lexical features:

They used a combination of features suggested by the studies of Kolari, Finin and Joshi [22] and McGrath and Gupta [23]. These properties include the length of the hostname, the length of the entire URL, as well as the number of dots in the URL. Additionally, they created a binary feature for each token in the hostname and in the path URL and made a distinction between tokens belonging to the hostname, the path, the top-level domain (TLD) and primary domain name.

• Host-hased features:

They thought host-based features could describe “where” malicious sites are hosted,

“who” own them, and “how” they are managed. The properties of the hosts include IP address, WHOIS, domain name and geographic properties, though some of which overlap with lexical properties of the URL.

Using their approach, they claimed that the best results could reach a false positive rate of 0.1%

and a false negative rate of 7.6%, which were much better than 0.1% and 74.3% respectively amounted by solely looking up the pre-analyzed URL blacklist. However, as I mentioned ear-lier, attackers can easily and rapidly change the URLs with very few costs, so the trained model can expire very soon. Hence, it could be risky if an application of web threat protection only relies on classification on URL related information.

Finally, let’s get back to the fundamental part. Since the targeted data is the web page, the most informative features should still be extracted from the complete page content, which is capable

of revealing the most comprehensive information of a web page. Thus, Hou et al. [24] published their research on malicious web content detection by machine learning in 2010. In their paper, they selected the features from the entire DHTML web pages, which includes:

• Native JavaScript functions (154 features):

Count of the use of each native JavaScript function

• HTML document level (9 features):

1. Word count

2. Word count per line 3. Line count

4. Average word length 5. Null space count 6. Delimiter count 7. Distinct word count

8. Whether tag script is symmetric 9. The size of iframe

• Advanced features (8 features):

Count of the use of each ActiveX object

While taking all the selected features for their machine learning approach (Boosted Decision Tree), they could obtain a true positive rate (TP) of 85.20% with a false positive rate (FP) of 0.21% or a TP of 92.60% with an FP of 7.6% depending on the FP tolerance they set.

Among all the related researches so far, they only concentrate on a specific part of a complete process of browsing a web page. Since each part has its values for discriminating malicious and normal web sites/pages, intuitively, all the features should be taken into consideration.

在文檔中應用支持向量機偵測惡意網頁 (頁 28-31)