Detecting Malicious Web Links and Identifying Their Attack Types
Joe Huang Anti-Spam Team
Cellopoint
Introduction
A great effort has been directed towards detection of malicious URLs
Blacklisting incurs no false positives, yet is effective only for known malicious URLs With classification, discriminative feature selection is crucial
This paper, Choi et al. (2011), proposes a
machine learning approach to detect malicious
Framework Overview
1. Data Collection 2. Supervised Learning
3-1. Detection 3-2. Identification Input: URL
Output: Benign URL Malicious URL, {Type}
This process can be batched learning or an
interleaving manner
Discriminative Features
Lexical
Link popularity Webpage content DNS
DNS fluxiness
Network
Lexical Features
No. Feature Type
1 Domain token count Integer
2 Path token count Integer
3 Average domain token length Real
4 Average path token length Real
5 Longest domain token length Integer 6 Longest path token length Integer 7∼9 Spam, phishing and malware SLD hit ratio Real
10 Brand name presence Binary
Link Popularity (LPOP) Features
No. Feature Type
1∼5 5 LPOPs of the URL Integer
6∼10 5 LPOPs of the domain Integer 11 Distinct domain link ratio Real
12 Max domain link ratio Real
13∼15 Spam, phishing and malware link ratio Real
AltaVista, Alltheweb, Google, Yahoo! and Ask
Webpage Content Features
No. Feature Type
1 HTML tag count Integer
2 Iframe count Integer
3 Zero size iframe count Integer
4 Line count Integer
5 Hyperlink count Integer
6∼12 Count of each suspicious JavaScript function Integer 13 Total count of suspicious JavaScript functions Integer
DNS Features
No. Feature Type
1 Resolved IP count Integer
2 Name server count Integer
3 Name server IP count Integer
4 Malicious ASN ratio of resolved IPs Real 5 Malicious ASN ratio of name server IPs Real
DNS Fluxiness Features
No. Feature Type
1∼2 ϕ of NIP, NAS Real
3∼5 ϕ of NNS, NNSIP and NNSAS Real
ϕ = N/N
singleNetwork Features (NET)
No. Feature Type
1 Redirection count Integer
2 Downloaded bytes from content-length Real
3 Actual downloaded bytes Real
4 Domain lookup time Real
5 Average download speed Real
Data sets
Benign URLs: DMOZ and Yahoo!
Spam URLs: jwSpamSpy and webspam Phishing URLs: PhishTank
Malware URLs: DNS-BH
Multi-label: McAfee SiteAdvisor and Web of
Trust (WOT)
Multi-label Data
Label Attribute LSAd LWOT LBoth
λ1 spam 6020 6432 5835
λ2 phishing 1119 1067 899
λ3 malware 9478 8664 8105
λ1,2 spam, phishing 4076 4261 3860 λ1,3 spam, malware 2391 2541 2183 λ2,3 phishing, malware 4729 4801 4225 λ1,2,3 spam, phishing, malware 2219 2170 2080
Results - Detection Accuracy
Results - Link Popularity Feature Analysis
Link Classification
Unpopular legitimate link
LPOPs might be ineffective for links of low LPOPs Malicious URL detection result: accuracy of 91.2%
Popularity manipulated link
LPOPs can be manipulated
Detection result: accuracy of 90.03%
Error Analysis
False positives
Disreputable URL (LPOP, LEX and DNS errors) Contentless URL
Brand name URL Abnormal taken URL
False negatives
Hosted by popular social networking sites