Detecting Malicious Web Links and Identifying Their Attack Types

(1)

Detecting Malicious Web Links and Identifying Their Attack Types

Joe Huang Anti-Spam Team

Cellopoint

(2)

Introduction

A great effort has been directed towards detection of malicious URLs

Blacklisting incurs no false positives, yet is effective only for known malicious URLs With classification, discriminative feature selection is crucial

This paper, Choi et al. (2011), proposes a

machine learning approach to detect malicious

(3)

Framework Overview

1. Data Collection 2. Supervised Learning

3-1. Detection 3-2. Identification Input: URL

Output: Benign URL Malicious URL, {Type}

This process can be batched learning or an

interleaving manner

(4)

Discriminative Features

Lexical

Link popularity Webpage content DNS

DNS fluxiness

Network

(5)

Lexical Features

No. Feature Type

1 Domain token count Integer

2 Path token count Integer

3 Average domain token length Real

4 Average path token length Real

5 Longest domain token length Integer 6 Longest path token length Integer 7∼9 Spam, phishing and malware SLD hit ratio Real

10 Brand name presence Binary

(6)

Link Popularity (LPOP) Features

No. Feature Type

1∼5 5 LPOPs of the URL Integer

6∼10 5 LPOPs of the domain Integer 11 Distinct domain link ratio Real

12 Max domain link ratio Real

13∼15 Spam, phishing and malware link ratio Real

AltaVista, Alltheweb, Google, Yahoo! and Ask

(7)

Webpage Content Features

No. Feature Type

1 HTML tag count Integer

2 Iframe count Integer

3 Zero size iframe count Integer

4 Line count Integer

5 Hyperlink count Integer

6∼12 Count of each suspicious JavaScript function Integer 13 Total count of suspicious JavaScript functions Integer

(8)

DNS Features

No. Feature Type

1 Resolved IP count Integer

2 Name server count Integer

3 Name server IP count Integer

4 Malicious ASN ratio of resolved IPs Real 5 Malicious ASN ratio of name server IPs Real

(9)

DNS Fluxiness Features

No. Feature Type

1∼2 ϕ of N_IP, N_AS Real

3∼5 ϕ of N_NS, N_NSIP and N_NSAS Real

ϕ = N/N

single

(10)

Network Features (NET)

No. Feature Type

1 Redirection count Integer

2 Downloaded bytes from content-length Real

3 Actual downloaded bytes Real

4 Domain lookup time Real

5 Average download speed Real

(11)

Data sets

Benign URLs: DMOZ and Yahoo!

Spam URLs: jwSpamSpy and webspam Phishing URLs: PhishTank

Malware URLs: DNS-BH

Multi-label: McAfee SiteAdvisor and Web of

Trust (WOT)

(12)

Multi-label Data

Label Attribute L_SAd L_WOT L_Both

λ₁ spam 6020 6432 5835

λ₂ phishing 1119 1067 899

λ₃ malware 9478 8664 8105

λ_1,2 spam, phishing 4076 4261 3860 λ_1,3 spam, malware 2391 2541 2183 λ_2,3 phishing, malware 4729 4801 4225 λ_1,2,3 spam, phishing, malware 2219 2170 2080

(13)

Results - Detection Accuracy

(14)

Results - Link Popularity Feature Analysis

(15)

Link Classification

Unpopular legitimate link

LPOPs might be ineffective for links of low LPOPs Malicious URL detection result: accuracy of 91.2%

Popularity manipulated link

LPOPs can be manipulated

Detection result: accuracy of 90.03%

(16)

Error Analysis

False positives

Disreputable URL (LPOP, LEX and DNS errors) Contentless URL

Brand name URL Abnormal taken URL

False negatives

Hosted by popular social networking sites

(17)

Attack Type Identification Metrics

Assume there is an evaluation data set of multi-label examples (x

_i

, Y

_i

)

Micro-averaged and macro-averaged metrics

Ranking-based metrics

(18)

Attack Type Identification Results

(19)

Evadability Analysis

Robust against known evasions

(redirection/link manipulation/fast-flux hosting)

URL obfuscation

JavaScript obfuscation

Social network site

(20)

Conclusion

A framework for detecting and identifying malicious URLs

Discriminative features

Evadability issue for further improvement

(21)

Detecting Malicious Web Links and Identifying Their Attack Types