URL Feature - 應用支持向量機偵測惡意網頁

URLs are the very first information that might be able to reveal some important clues about the maliciousness of a web page or a web site. Based on the definitions introduced in the articles published by Ma et al. [21,27], URLs can directly and indirectly render lexical and host-based features respectively.

Lexical features are the textual properties of the URL itself, and no additional information is required for the extraction processes. The reason for using lexical features is that URLs tend to

“look diﬀerent” from one another. Hence, including lexical features helps to methodically cap-ture this property for classification purposes, and perhaps to infer patterns in malicious URLs.

A full URL string (without the userinfo part) is in a well-defined format shown below, and each portion of the URL will be introduced later on.

http://example.com:80/dir/file?var0=value0&var1=value1

^^^^ ^^^^^^^^^^^ ^^ ^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^

1 2 3 4 5

1. URI Scheme

Specifically for URLs, the URI scheme is either http or https, which indicates whether the SSL/TLS (Secure Socket Layer/Transport Layer Security) protocol is integrated to provide encryption and secure identification of the web server.

2. Hostname

A hostname must be a fully qualified domain name (FQDN), which can be resolved to

map to (at least) an IP address so that the hostname can lead clients to connect to the web server on the Internet.

3. Port

The port number can be neglected and implicitly indicates 80 and 443 respectively for

http and https schemes by default.

4. Path

The path portion employs the same concept as the file system hierarchy, and the page file indicated by the URL resides in the directory corresponding to the root directory of the web service. However, the web service program can interpret the path in an arbitrary way as long as the page content can be correctly delivered to the client.

5. Query String

A query string is composed of “var=value” pairs that concatenate each other with an ampersand (&) as the delimiter. It usually appears in an HTTP request with GET method to realize the interaction between the web service and the visitor.

In my experiments, the URI scheme and port number were intuitively transformed to binary fea-tures; the URI scheme is either http or https, and the port number is either standard (80 or 443) or non-standard (otherwise). As for the other portions of a URL, they were firstly separated into tokens by the characters other than alphabets and digits. In other words, each token contained only “a” to “z” (case insensitive) and “0” to “9” characters. In addition, I made a distinction among tokens belonging to the top-level domain (TLD), primary-level domain (the domain name given to a registrar), and subordinate-level domain in the hostname portion. The example below describes the definitions of top-level domain, primary-level domain and subordinate-level domain, and we can clearly see the discrepancy between two “com” in diﬀerent parts of a hostname.

http://www.amazon.com.evil.com.ru/

|--| →

top-level domain (TLD)

|---| →

primary-level domain name

|---| →

subordinate-level domain name

Afterwards, the tokens were further processed by bi-gram computation, which uses a sliding window of two characters wide and moves the window on the token character by character to

extract the items as the features. Figure 4.2 demonstrates the bi-gram computation on a string

“token”, and the items extracted are “to”, “ok”, “ke” and “en”.

Figure 4.2: Bi-gram computation

Besides, I adopted some properties as features that are commonly used in other researches, including:

• Length of the entire URL

• Length of the hostname

• Number of dots in the hostname

• Length of the path

• Number of slashes in the path

• Length of the query string

• Number of ampersands in the query string

On the other hand, retrieving host-based information relies on issuing queries to some open services and receiving the responses through the Internet, so that it is not as easy as extracting lexical features. Some host-based information is not necessarily open to public, such as WHOIS in particular. As a result, I could only extract a few number of host-based features listed below.

For each hostname, I could obtain:

• Number of DNS A records

• DNS time-to-live (TTL) value

• Autonomous System (AS) Number

• Real location (country code, or cc)

• Whether the country-code top-level domain (ccTLD) matches the real location

Number of DNS A records and the TTL can be easily obtained by hosting a cache-only DNS server to forward the queries to the ISP’s DNS server. In this practice, I could get the real TTL values.

To collect AS numbers, I utilized an IP-to-ASN service provided by Team Cymru (http:

//www.team-cymru.org/Services/ip-to-asn.html), which allows me to issue a special DNS query to get the AS numbers:

$ dig +short 95.176.45.114.origin.asn.cymru.com TXT

"3462 | 114.45.0.0/16 | TW | apnic | 2008-04-18"

where the number “3462” in the first field is the AS number. (Note that there could be more than one AS numbers for a hostname, as one hostname can have multiple IPs.) Based on the AS Number Analysis Reports (http://bgp.potaroo.net/index-as.html) last updated on May 16, 2010, total amount of 61,438 AS numbers have been allocated.

As for the country codes, ip2nation (http://www.ip2nation.com/) publishes a database that includes 231 distinct country codes and could help map IP to cc.

在文檔中應用支持向量機偵測惡意網頁 (頁 40-43)