• 沒有找到結果。

Malicious Sample Collection

On the other hand, as everyone knows about malicious web pages, the change is persistent and extremely rapid. In order to catch up the trend, I need to collect malicious web pages in real time. Hence, once I got the malicious URL list, I would definitely like to download all page contents as soon as possible. Basically, if downloading a web page requires 5 seconds, 24 hours should be sufficient for around 17,000 pages. However, the reality is far from the ideal circumstance, as the attackers are good at playing various tricks to obstruct our analysis processes. Here lists some difficulties I once encountered while collecting malicious pages;

some could be overcome by applying a simple option to curl whilst others couldn’t so that I could just drop the URL to minimize the impact on my batch crawling process.

• Target browser version

Probably because Microsoft Internet Explorer (IE) has the highest browser market share, most of the attacks target IE. Additionally, some attacks can only infect victim’s com-puter though some IE-only functions, such as ActiveXObject. In order to hide the exploit from being analyzed, the malicious web server usually delivers different contents to dif-ferent clients according to the User-Agent specified in the HTTP request headers. As a consequence, I extracted the User-Agent string generated by IE version 6:

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;

GTB6.4; InfoPath.1; .NET CLR 2.0.50727;

.NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)

and manually filled it into the corresponding request header to perform crawling. In this way, curl could pretend to be IE and correctly fetch the page contents wanted.

• Target language or region

The organized cybercrimes, for example, may target credential information of a local bank. Similarly, the malicious web server can deliver harmless contents to clients from the regions other than desired so as to prevent from being detected. The locality-related in-formation can be revealed from either client IP addresses or request headers, like

Accept-Language. Due to the resource limitation, I couldn’t perform crawling through many IPs

or Internet service providers (ISPs) in different countries, so I can just neglect this impact.

As for the request headers, I could do the same way on Accept-Language as what I did for User-Agent.

• Block unwelcome visitor

Malicious web servers, as well as normal web services, usually apply some anti-crawling mechanism, which will block the unwelcome client IP if it finds the client is trying to massively crawl its web contents. In order to hide my intention, I couldn’t consecutively download the page contents from the same host within a short time period. On the con-trary, I simply applied the benign page crawling process. A host would be re-visited again after the other hosts had been visited at least once. In addition, my ISP applies a dynamic IP assigning policy; whenever I change the MAC address of the network interface, my server can be assigned with a new IP. Hence, I could change the IP of my crawling server everyday to decrease the probability of being blocked.

• Hold network connection

Once the malicious web server finds an unwelcome client, it will deliberately hold the network connection for a long time. It may deliver a huge garbage file to the client or continuously respond a small garbage packet to the client right before the connection timeout is reached. Actually, I once fetched just a 200kB page within about 6 hours, and of course, the content is totally useless. Before I can figure out what their mechanisms of detecting unwelcome clients are, the only thing I can do is to compulsorily disconnect

the link with the malicious server if the connection lasts for longer than a specific time period, which was set 30 seconds in my crawling process.

Apart from the list above, there are still many mechanisms attackers frequently use to obstruct the data collection process. However, how to jump over the obstacles is beyond the scope of my study, and I just adopted some simple methods suggested by domain experts from the web threat protection vender.

Although I applied all the techniques described above, I could download about 4,000 malicious pages per day. After being filtered by Content-Type, there were only about 1,000 pages left.

Even so, I could still find that there were many noises in the malicious sample dataset. For example, some malicious page was initially residing in a web hosting service but was then de-tected as inappropriate by the service provider so that the page was soon removed. Afterwards, while linking to the same URL, I could just receive the announcement of page removal gener-ated by the web hosting service provider. This page is totally harmless but extremely hard to be automatically filtered. Therefore, to purify the malicious sample dataset, I had to manually inspect those web pages one after another. Afterwards, I could pick up only about 200 samples that are malicious for sure each day.

Table 4.1 summarizes the weekly statistics of the malicious/benign URLs and pages collected for my experiments.

Week 0 Week 1 Week 2 Week 3 Total

Malicious URLs 100,213 138,650 116,759 97,419 453,041

Malicious Pages 1,198 1,322 1,205 1,080 4,805

Benign URLs 805,432 769,378 813,354 789,982 3,178,146 Benign Pages 40,308 41,356 39,896 41,086 162,646

Table 4.1: Statistics of malicious/benign URLs and pages collected

4.2 Feature Extraction

Another critical stage of supervised machine learning is the feature extraction. Features are considered as the attributes or the combination of some attributes of the samples, which are able to correctly separate the samples into different classes. In other words, the distribution

of feature values should be distinct among different classes. In addition, the training samples collected should also be representative of the real data, which means the training samples and the real samples should share the same distribution of feature values. So, these are the golden rules I need to follow for feature extraction.

Considering the sample data based on the time when they can be obtained, as we are using a web browser, we will firstly have the URLs and fetch the page contents afterwards. Due to the particular properties and the representations of the data in each stage, the samples have to be processed in different ways accordingly.

相關文件