### 國立臺灣大學管理學院資訊管理系 碩士論文

### Department of Information Management College of Management

### National Taiwan University Master Thesis

### 以三值樹狀自動機為基礎之惡意程式分析 Malware Analysis with 3-Valued Deterministic Finite

### Tree Automata

### 王奕翔

### Yi-Hsiang Wang

### 指導教授：蔡益坤 博士 Advisor: Yih-Kuen Tsay, Ph.D.

### 中華民國 100 年 11 月

### November, 2011

### 以三值樹狀自動機為基礎之惡意程式分析

### Malware Analysis with 3-Valued Deterministic Finite Tree Automata

### 本 論 文 係 提 交 國 立 臺 灣 大 學 資 訊 管 理 研 究 所 作 為 完 成 碩 士 學 位

### 所 需 條 件 之 ㄧ 部 分

### 研 究 生 ： 王 奕 翔 撰

### 中 華 民 國 一 百 年 十 一 月

## 謝辭

二年多的研究生涯，七百多個日子，如今終於要劃下了句點。這段日子裡，

許多不同的經驗打開了我的視野也寬廣了我的見聞。除了感謝老天，我還有許多 值得我感謝的人事物。

首先要感謝的當然是我的指導教授蔡益坤老師。從老師的身上，我學到了持 續累積的重要，也見識到老師對於學問的堅持與追求。此外，還要感謝實驗室的 所有人。明憲學長的強大，我想就不用多言了。儘管已經結婚了，還能同時顧及 研究以及家庭，小弟自愧不如。晉碩學長與怡文學姐，感謝你們幫助我了解 GOAL 實作的演算法，祝你們白頭偕老。智斌學長，開發網站時我學到了很多東西，是 個有趣的經驗，希望日後你能夠順利的創業。睿元學長，你總是熱心的關心學弟 妹，感謝你提供的考古題。昇峰學長，當初報告時很常麻煩你，祝你之後工作順 利。任峰，雖然你先畢業了，不過還是要感謝你的幫忙還有你提供的沙發。辰旻，

身為台大資管旻，就算你不常來實驗室也是可以諒解的，祝你之後都能開開心心 的。啟祥以及靖婕，修課分組時辛苦你們了，雖然你們現在感覺很辛苦，不過你 們一定會準時畢業的。瑞舜與暐獻，可惜後來就沒再跟你們玩桌遊了，祝你們也 順利畢業。

另外，要感謝我的家人，來台北讀書後已很少回台中，感謝你們一直關心我。

此外，也要感謝孟璘，這二年多裡你幫助我應付許多困難，給了我很大的支持。

最後，要感謝這二年來所有碰到的不如意，這些經驗讓我能夠成長，面對壓 力，面對現實。這二年來所學到的東西，我一定會學以致用，也祝福大家都能夠 開心順利。

王奕翔 謹識 于台灣大學資訊管理研究所 民國一百年十一月

## 論文摘要

### 學生：王奕翔 2011 年 11 月

### 指導教授：蔡益坤

**以三值樹狀自動機為基礎之惡意程式分析 **

網路上存在許多不同的資安威脅，其中最惡名昭彰的就是惡意程式。惡意程 式指的是那些帶有惡意企圖並且有惡意行為的程式。典型的惡意程式包含了病 毒、蠕蟲、木馬與間諜軟體。惡意程式偵測軟體可降低我們被惡意程式攻擊的風 險。不同的偵測軟體有其各別的偵測方法，但在現今的偵測軟體中最基本也最普 遍被應用的方法是靜態特徵碼比對。然而這種偵測方法已經普遍的被認為無法對 抗現今更為進階的惡意程式。進階的惡意程式使用程式混淆的手法改變程式本身 的結構，也因此能夠非常輕易的避過偵查軟體。幸運的是，程式本身的語意在經 過混淆後通常仍會保持一致，因此一個可行的對策就是設計一個以程式語意為基 礎的偵測方法。

在這篇論文裡，我們提出一個以程式語意為基礎的惡意程式偵測方法。觀察 近年來被提出的各種辦法，我們發覺字串仍然被廣泛的使用為一種特徵碼的形 式。我們可以將字串擴充為樹，樹不但更為一般化而且帶有更多的語意資訊。因 此，我們使用樹做為我們偵測軟體的特徵碼形式。我們的偵測軟體需要一組惡意 程式跟一組正常程式做為輸入。這些程式的語意會以系統呼叫的資料相依圖表 示。接著，我們將這些相依圖解析為樹。使用文法推論的方法，我們最後可以得 到一個三值的樹狀自動機。一個三值的樹狀自動機有三個互斥的最終狀態：接 受、拒絕、與未知。如果我們使用三值樹狀自動機做為我們的惡意程式偵測軟體，

根據輸入的程式他會有三種可能的輸出值。如果輸入的程式是一個惡意程式，偵 測軟體會輸出是。如果輸入的程式是一個正常程式，偵測軟體會輸出否。其餘的 狀況下他會輸出未知。根據我們的實驗結果，我們的偵測軟體有著相當低的誤 報。然而，相對的代價是許多程式經過軟體檢查後的結果是未知。

關鍵字：惡意程式分析、惡意程式偵測軟體、文法推論、三值自動機、程
**式語意、系統呼叫 **

### THESIS ABSTRACT

### Graduate Institute of Information Management National Taiwan University

### Student: Wang, Yi-Hsiang Month/Year: November, 2011 Advisor: Tsay, Yih-Kuen

### Malware Analysis with 3-Valued Deterministic Finite Tree Automata

There exist many security threats on the Internet, and the most notorious is malware.

Malware (malicious software) refers to programs that have malicious intention and per- form some harmful actions. Typical malware includes viruses, worms, trojan horses, and spyware. The first line of defense to deter malware is malware detector. Each malware detector has its own analysis method. The most basic and prevalent methods used in commercial malware detectors are based on syntactic signature matching. It is widely recognized that this detection mechanism cannot cope with advanced malware. Advanced malware uses program obfuscation to alter program structures and therefore can evade the detection easily. However, the semantics of a malware instance is usually preserved after obfuscation. So, it is feasible to develop a malware detector that is based on pro- gram semantics.

In this thesis, we propose a semantics-based approach to malware analysis. Observing recently proposed methods for malware detection, we notice that string-based signatures are still used widely. It is natural to extend from string to tree, which is more general and can carry more semantics. Therefore, we use trees as signatures. Our malware detector requires a set of malware instances and a set of benign programs. The semantics of each input program is extracted and represented as a system call dependence graph. The graph is then transformed into a tree. With the set of trees generated from malware and benign programs, we use the method of grammatical inference to learn a 3-valued deter- ministic finite tree automaton (3DFT). A 3DFT has three different final states: accept, reject, and unknown. If we take this 3DFT as the malware detector, it outputs three different values. If an input program is a malware instance, the detector outputs true. If an input program is a benign program, the detector outputs false. Otherwise, it outputs unknown. According to our experiments, our detector exhibits very low false positives.

However, there is a tradeoff that many programs are identified as unknown.

Keywords: Malware Analysis, Malware Detector, Grammatical Inference, 3-Valued Automata, Program Semantics, System Call

### Contents

1 Introduction 1

1.1 Background . . . 1

1.2 Motivation and Objective . . . 3

1.3 Thesis Outline . . . 5

2 Related Work 6 2.1 Grammatical Inference . . . 6

2.1.1 Tree Automata Inference . . . 7

2.2 Malware Analysis with Executed System Calls . . . 10

2.2.1 Effective and Efficient Malware Detection at the End Host . . . . 10

2.2.2 A Layered Architecture for Detecting Malicious Behaviors . . . . 12

2.3 Malware Analysis with Tree Automata . . . 15

2.3.1 Architecture of a Morphological Malware Detector . . . 15

2.3.2 Malware Analysis with Tree Automata Inference . . . 17

3 Preliminaries 19 3.1 Finite Ordered Trees . . . 19

3.2 Finite Tree Automata . . . 20

4 Approach 22 4.1 Architecture . . . 23

4.2 3-Valued Deterministic Finite Tree Automata . . . 24

4.3 Semantics Extraction . . . 25

4.4 Graph Parser . . . 26

4.5 Automata Learning Algorithm . . . 28

4.5.1 Learning Algorithm of Drewes . . . 28

4.5.2 Tree Automata Learning Algorithm . . . 31

5 Implementation and Experiments 33 5.1 Implementation . . . 33

5.2 Experiments . . . 33

5.2.1 Experimental Results . . . 37

6 Conclusion 44 6.1 Contributions . . . 44

6.2 Future Work . . . 45

Bibliography 47

### List of Figures

1.1 The amount of new malware samples collected by AV-TEST . . . 2

1.2 Program obfuscation . . . 3

1.3 A malware detector identifies programs as malware, benign, or unknown 4
2.1 The interactions between learner and teacher in Angluin’s L^{∗} algorithm . 8
2.2 Partial behavior graph for malware Netsky, redraw from [18] . . . 11

2.3 Layered behavior specification, redraw from [19] . . . 12

2.4 The architecture of the detector, redraw from [19] . . . 13

2.5 The architecture of a morphological malware detector, redraw from [8] . . 16

3.1 A finite ordered tree t and its domain. . . 20

3.2 A finite ordered tree t and a run π of A over tree t. . . 21

4.1 Our malware detector . . . 23

4.2 Architecture . . . 24

4.3 A behavior graph . . . 26

4.4 An one-leveled dependence tree . . . 27

4.5 A two-leveled dependence tree . . . 28

5.1 Experiment 1 results with one-leveled dependence tree . . . 37

5.2 Experiment 3 results with one-leveled dependence tree . . . 38

5.3 Experiment 1 results with two-leveled dependence tree . . . 39

5.4 Experiment 3 results with two-leveled dependence tree . . . 40

5.5 Total states of generated automaton . . . 41

5.6 Total transitions of generated automaton . . . 41

5.7 The time expense for learning automaton . . . 42

5.8 Classification results for automaton generated from Experiment 1 . . . . 42

5.9 Classification results for automaton generated from Experiment 3 . . . . 43

### List of Tables

5.1 48 malware families and the amount of contained samples . . . 34

5.2 The list of benign programs . . . 35

5.3 Experiments for testing detection ability . . . 36

5.4 Experiments for testing classification ability . . . 36

### Chapter 1 Introduction

### 1.1 Background

Nowadays, many different services are provided on the Internet. As more trusts have been put on the Internet, more attentions have been paid to Internet security. There exist many security threats on the Internet, and the most notorious is malware. Gen- erally speaking, malware (malicious software) refers to programs that have malicious intention and perform some harmful actions. Typical malware includes viruses, worms, trojan horses, bots, and spyware.

As underground economy flourishes, malware has become a profitable tool. Malware can be used to launch zero-day attacks, inject bots, send fishing web-sites, and crash systems. What worse, malware is used widely to steal private information. As observed by Symantec in 2010 [21], an underground economy advertised $0.07 to $100 for each stolen credit card number. Besides, they also observed that 10,000 bots are promoting with $15. Bots are widely used for spam or distributed denial of service attacks (DDoS).

It is reasonable that the amount of malware samples have increased every year. As showed in Figure 1.1, the number of new malware samples that AV-TEST collected [1] is increased with amazing speed. In 2009, the amount of new malware samples is about 12 millions, however, in 2010, it is increased to 17 millions.

We mostly rely on malware detector to cope with malware. A malware detector ac-

0 3 6 9 12 15 18

2002 2003 2004 2005 2006 2007 2008 2009 2010

**New Malware**

**Millio****ns**

Figure 1.1: The amount of new malware samples collected by AV-TEST

cepts a suspicious program as input, and determine whether it is a malware instance or a benign program. There are many detection approaches, but the most widely used is syntactic signature matching.

A signature refers to a pattern that is only present in a particular malware instance or malware family. If a program matches a pattern (signature), that program is regarded as a malware instance. Strings are the most commonly used format for signatures. For example, a sequence of machine instructions, a sequence of magic numbers, and an reg- ular expressions can be used as signatures.

Syntactic signature matching has the advantages that detection is efficient and is easy to implement. However, syntactic signature can’t cope with advanced malware which uses program obfuscation, encryption, and packing. Program obfuscation changes the syntactic structure of a program while preserving its original semantics. Encryp- tion changes the original entry point (OEP) of a program. Packing changes the format of a program using lots of different skills. These techniques were initially designed to prevent reverse engineering. A corporation use these skills to prevent their rivals imi- tate their products. Unfortunately, malware writers also use these techniques to protect malware. The most common program obfuscation techniques includes garbage codes in- sertion, equivalent instruction substitution, and instruction reordering.

For example, Figure 1.2(a) is the original program. Figure 1.2(b) inserts jmp instruc- tions and Figure 1.2(c) reorders instruction 3 and instruction 4. However, the semantics for these three programs are identical.

To resolve the drawbacks of polymorphic malware, more advanced metamorphic malware was proposed. Metamorphic malware use the program obfuscation skills to evolve the whole virus body into new generations, while reserving the same semantic.

Because there wont exist any constant codes anymore, the signature matching method can not succeed anymore.

We can notice that both polymorphic malware and metamorphic malware use the obfuscation skills. In the following of this section, we describe four most common obfus- cation skills used in the metamorphic malware: Garbage Code Insertion, Register Reassignment, Instruction Substitution, Instruction Reordering. For demon- strative purpose, we also give a code example in figure 3.1, this example will applied four obfuscation skills later.

1 push eax 2 mov eax, [esi]

3 add esi, 1 4 mov [edi], eax 5 add edi, 1 6 pop eax

Figure 3.1: Original code example

• Garbage Code Insertion

This obfuscation skill inserts the instruction which do not change the program semantic. Typical instruction is nop. Some instructions do have semantic meaning, but with combination with other instructions, the behavior will be canceled. For example, push eax and pop eax do not change the program semantic. In Figure 3.2, the instruction mov eax 1 do not influence the program semantic because the eax register value will reassigned at line 2.

• Register Reassignment

Register Reassignment will replace one register with another register. In Figure 3.3, original register eax is replaced with ebx.

23

(a) Original program

ordering, the real time executing order remains unchanged. It can be achieved by inserting non-conditional jump or reordering independent instructions. In Figure 3.5(a), by inserting jmp ... instruction, the order is being disarranged. However, the program still have the same semantic. In Figure 3.5(b), because instruction mov [edi],eaxand add esi,1 is independent, the order between these two instructions doesn’t matter.

jmp L1 L2 mov [edi], eax

add edi, 1 pop eax L1 push eax

mov eax, [esi]

add esi, 1 jmp L2

(a) Non-conditional jump

1 push eax 2 mov eax, [esi]

4 mov [edi], eax 3 add esi, 1 5 add edi, 1 6 pop eax

(b) Reordering independent instructions

Figure 3.5: Code example with instruction reordering

In addition to this four common obfuscation skills, new obfuscation skills are still proposed, such as opaque predicates, execution time monitoring,...etc. In our malware detector, we will focus on this four common obfuscation skills.

25

(b) Instruction reordering 1

ordering, the real time executing order remains unchanged. It can be achieved by inserting non-conditional jump or reordering independent instructions. In Figure 3.5(a), by inserting jmp ... instruction, the order is being disarranged. However, the program still have the same semantic. In Figure 3.5(b), because instruction mov [edi],eaxand add esi,1 is independent, the order between these two instructions doesn’t matter.

jmp L1 L2 mov [edi], eax

add edi, 1 pop eax L1 push eax

mov eax, [esi]

add esi, 1 jmp L2

(a) Non-conditional jump

1 push eax 2 mov eax, [esi]

4 mov [edi], eax 3 add esi, 1 5 add edi, 1 6 pop eax

(b) Reordering independent instructions

Figure 3.5: Code example with instruction reordering

In addition to this four common obfuscation skills, new obfuscation skills are still proposed, such as opaque predicates, execution time monitoring,...etc. In our malware detector, we will focus on this four common obfuscation skills.

25

(c) Instruction reordering 2

Figure 1.2: Program obfuscation

Therefore, rather than using syntactic signatures, it is more effective to use semantics- based signatures. Malware writers usually change the syntactic structures to create new malware variants. However, these variants usually preserve the original semantics. If a signature is semantics-based, it should be able to detect several different malware variants as long as they have same program semantics.

### 1.2 Motivation and Objective

An ideal malware detector can identify malicious programs correctly. However, it is a co-evolution between the malware writers and the malware detectors. Every time a new malware sample is designed, malware detectors are improved to detect this malware sample. Whenever this malware sample can be detected, more advanced malware will be proposed to evade detection. A practical malware detector inevitably has false positives and false negatives.

We would like to design a malware detector that identifies programs as malware, benign program, or unknown. Only programs are identified as unknown need a double- check. For the other two cases, programs are recognized as malware instances or benign programs with great confidence. We design a malware detector like so because it is always need a double check in reality. In a highly security-sensitive environment, more than one malware detectors are used to cope with malware. The results from the former detectors may need to re-precess in the latter detectors. If we can explicitly separate programs into three groups: malware, benign, and unknown; only programs identified as unknown need to re-process. Therefore, it is more efficient.

As showed in Figure 1.3, the outermost rectangle represents all possible programs.

Each circle is a benign program and each triangle is a malware instance. For the circles in the left rectangle, they are correctly identified as benign programs. For the triangles in the right rectangle, they are correctly identified as malware instances. The remain circles and triangles in the middle rectangle are identified as unknown and need a double check.

## Chapter 1

### Unknown

### Benign Malware

Figure 1.3: A malware detector identifies programs as malware, benign, or unknown

Besides, we would like to use trees as semantics-based signatures. As mentioned before, strings has been used for syntactic signature matching widely. We can still select strings as semantics-based signatures, but strings as signatures are too specific and easy to evade. Strings can represent sequential relation and the basic dependence relations.

However, trees can represent more subtle dependence relations, such as shared depen- dence.

Some proposed detection approaches require a priori knowledge about malicious be- haviors. They specify a set of behaviors as malicious or design benign policies and regard programs as malware if any behaviors violate the policies. It is inefficient and error-prone.

We would like to learn the malicious behaviors using grammatical inference. Given a set of malware samples and benign programs, we use grammatical inference to learn signatures. Compared with training from the given knowledge, our signatures can reflect more implicit knowledge.

### 1.3 Thesis Outline

The rest of this thesis is structured as follows:

• In Chapter 2, we introduce several related literatures.

• In Chapter 3, we give some preliminaries about this thesis, includes finite ordered trees and finite tree automata.

• In Chapter 4, we describe our detection approach.

• In Chapter 5, we describe our implementation and show the results of our experi- ments.

• In Chapter 6, we summarize our contributions and indicate some possible research direction in the future.

### Chapter 2

### Related Work

In this chapter, we describe some related works about our research. At first, we give the introduction about grammatical inference, and we specifically focused on tree automata inference. Later, related works about malware analysis with executed system calls will be introduced. And we will also review two related works that combine malware analysis with tree automata.

### 2.1 Grammatical Inference

Grammatical inference is concerned with learning language representations from given information, which can be text, examples and counter-examples, or anything that can provide us insight about the elements of the target language[13]. The learner sometimes is called an inference machine, or a learning algorithm. The research on grammatical inference cross a number of related fields, includes artificial intelligence, machine learn- ing, formal language theory, pattern recognition, computational linguistics, and speech recognition. This field has a variety of learning models, they all have different and dis- criminative environment setting, from the target of learning, the language representation, the available information or the information presentation. However, there are three major established formal models:

• Gold’s identification in the limit,

• Angluin’s active learning model,

• Valiant’s probably approximately correct(PAC) model.

As that Angluin’s learning model has a tight link with other work, we will give a brief introduction. The introduction about other two learning models can be referenced in [12],[20], and [13].

In Angluin’s active learning model (also called query learning meodel), there is a
teacher which can answer specific kind of queries about the unknown grammar, and the
learner learns the target language by asking teacher this queries. Angluin has described
several types of queries in [5]. In [4], Angluin proposed the L^{∗} learning algorithm that
can learn a regular language from minimally adequate teacher. A minimally adequate
teacher (MAT) is assumed to answer correctly two types of queries from the learner about
the target language.

• Membership query: given a string t, returns ”yes” if t is the member of the target language, returns ”no” otherwise.

• Equivalence query: given a hypothesis grammar, if the hypothesis is equivalent to the target, returns ”yes”, returns ”no” otherwise. In the case of returning ”no”, the counter-example will also be returned. The counter-example is the symmetric difference of the input grammar and the target grammar.

The learner use the membership queries to infer grammar structure. Every time the
learner inferred a structure, it asks teacher equivalence queries to check equivalence. If
there is a counter-example returned, modifies the inferred structure with the help of mem-
bership queries and asks equivalence queries again. The process is repeatedly continuing
until the inferred structure is somehow identical to the target. The details about L^{∗}
learning algorithm is omitted here, as will be describe in latter section. The interactions
between the learn and teacher is depicted in Figure 2.1.

### 2.1.1 Tree Automata Inference

As mentioned in section 1.2, we would like to infer a tree automata from the example.

Here, we give an brief introduction about the related works. The main focus of research in grammatical inference has been placed on learning regular grammars or deterministic

### Chapter 2

Finite String Yes / No Membership Queries

Equivalence Queries

### Learner

### Minimally Adequate

### Teacher

Finite Automata Yes / No, counter-example

Figure 2.1: The interactions between learner and teacher in Angluin’s L^{∗} algorithm
finite automata (DFA). The reason for so is that this problem seems simple enough as it
is less general than context-free grammar. For tree automata, it can be seen as the direct
extension of finite automata that the input is finite trees instead of strings. Therefore,
it is straightforward to extend finite automata learning algorithm inference to tree au-
tomata inference.

Brayer and Fu [9] proposed an tree automata inference algorithm that extend the orig- inal k-tail inference method. Similar to [9], Fukuda and Kamata [16] use the k-follower to infer the tree automaton. Both these algorithms infer the tree automaton from a given sample set. Garc´ıa and Oncina [17] extend the RPNI (Regular Positive and Negative In- ference) algorithm to tree automata inference, obviously, the learning is completed with given positive examples and negative examples. The idea is that they build the subtree automaton at first (recognize each subtree of positive sample), and merge the automaton state while not accept the negative sample. The algorithm works in polynomial time with the size of the input data.

Drewes [14] extend Angluin’s L^{∗} learning algorithm to tree automaton inference. Just

like Angluin’s algorithm, they retain an observation table during the learning process and
use this table to construct the automaton. To extend observation table, they ask teacher
membership queries and equivalence queries. In [14], they proposed two different tree
automata learning algorithm, L^{tf ta}_{∗} and L^{∗}_{f ta}. The discriminative difference between this
two algorithms is that what L^{tf ta}_{∗} learned is a total finite tree automata, however, the
automata L^{∗}_{f ta} learned is a partial finite tree automata, so has less transitions and states.

The idea behind the partial finite tree automata is that they removed the dead state from the automaton. The term dead state means that the corresponding equivalence class of states can not be any subtree of the target language. More detail about this learning algorithm will be discussed in chapter 4.

Another algorithm that also extends Angluin’s active learning model is [7]. In [7], Besombes and Marion proposed the learning algorithm that learn the automata from positive examples and membership queries. Without the help of equivalence queries, the learner is initially provided with the representative samples for compensate. Informally, an representative sample is the set of languages such that each transition of the target automaton will be used to process the samples. By checking the consistent of the obser- vation table, they can get the new differentiating context.

### 2.2 Malware Analysis with Executed System Calls

As mentioned in section 1.1, the new generation malware detectors focused on the behavior of the program rather than the syntactic structure. There exists many kinds of behavior models, one of the prevalence models is based on the executed system calls of program. We know that in order to access the system resources, the user-level program has to invoke system call. By tracking the executed system calls of malware, we can realize the purpose of the program and the resources being request. However, if only using the sequence of system calls as malware signature, it is easy to evade by system call reordering skills. Some more informative representation that based on system calls have been proposed, we give a brief introduction as follows.

### 2.2.1 Effective and Efficient Malware Detection at the End Host

In [18], Kolbitsch et al. proposed the malware detection approach that is efficient to implement at the end host. To model the behavior of the program, they rely on the exe- cuted system calls. As mentioned above, it is unwise to represent the program behavior as system call sequences. Instead, they use the dependence graph of system calls, also called behavior graph in the paper.

The behavior graph G = (V, E, F, δ), where

• V is the set of vertices, each represents a system call s ∈ Σ

• E is the set of edges, E ⊆ V × V

• F is the set of functionsS f : x1, x_{2}, ..., x_{n} → y, where each xiis an output argument
of system call, and y is the input argument of system call

• δ, which assigns a function f^{i} to each system call input argument ai

For example, we list partial behavior graph for malware Netsky in Figure 2.2.

To detect the malware at the end host, the suspicious program must be dynamically executed. The invoked system calls will be used to match the graph vertices from the

Mode : Create Mode : Open

C:\WINDOWS

\AVprotect9x.exe

NTCreateFile NTCreateFile

NTWritrFile NTCreateSection

NTMapViewofSection GetModuleFileNameA

Name

FileHandle

SectionHandle

FileHandle (Read Buffer)

Figure 2.2: Partial behavior graph for malware Netsky, redraw from [18]

collected malware behavior graphs. However, the dependence between the graph vertices needs to preserved. And it generates significant overhead to dynamically taint analysis at the end host. Therefore, they use the program slicing skills. For the data flows be- tween x and y in the malware, the instructions that are responsible for reading the input and transforming it into output is extracted. This program slice can be used to derive the symbolic expression that represents the semantics of the slice. Using this symbolic expression, when the system call x has been invoked at the end host, the expected output can be pre-computed. Later, when the system call y has been invoked, checks whether the value of the arguments is identical to the expected output. If the value is equal, the dependence data flow has been detected.

To test the detection effectiveness and false positives, they set up the experiments with six common malware family. By randomly selecting 100 samples for each malware family and extract the behavior graph with 50 samples, the detection rate of their detector is about 93% for the remain test samples. Despite the experiment is only focused on the six

known malware family, the detection rate is still impressive. For false positives, they test five common benign applications and report no false positives. Although this result is promising, but the exact amounts of test samples is not clear from the experiments, and we cannot comment how well their detector is in goodware misclassified. But as their method shows, the dependence graph of system calls has truly a tight relation with the program semantic.

### 2.2.2 A Layered Architecture for Detecting Malicious Behav- iors

Martignoni et al. [19] proposed a layered architecture for detecting malicious behav- iors. However, the focus of this method is on botnet detection, and it is reasonable to extend the method to malware detection. To capture the behavior of the malicious pro- gram, the basis of their method is the executed system calls. But the system calls are used in a smarter way. For each malicious behavior, they use a layered representation.

Each behavior is composed of events, and events is generated from observed system calls.

For example, Figure 2.3 shows an example of the hierarchy of events used to specify the high-level behavior: downloading and executing a program.

Figure 2.3: Layered behavior specification, redraw from [19]

High-level behaviors are decomposed into multiple layers. Events are represented as bold strings, the edges related the events indicate the dependence relation. The lowest layer, Layer 0, is the invoked system calls. The events in Layer 1 aggregate Layer 0 events

that have a common side effect. In Figure 2.3, event net recv is generated whenever any of the Layer 0 events recv or recvfrom occur. Events at Layer 2 and upper layer identify correlated sequences of lower-layer events that have some aggregate, composite effect. In Figure 2.3, event sync tcp client identifies a synchronous TCP socket has been created, bound, and connected upon. As we can see, the events at upper layers have a more rich semantics. By decomposing the high-level behavior into multiple layers, the behavior specifications are configurable, less error-prone and easy to update.

Analyzed Programs

**Qemu Emulator**

### Behavior Matcher

**Behavior specification**

Process-generated events

Arguments extraction

Figure 2.4: The architecture of the detector, redraw from [19]

The overall architecture of the detector is presented in Figure 2.4, it includes analysis environment and a set of behavior specifications and a behavior matcher. It works as follows:

1. The suspicious program is dynamically executed in the emulator, and each invoked system call will send to behavior matcher.

2. Every time the matcher receives an invoked system call, it attempts to match this with the node of the behavior specification.

For each behavior specification, it is composed as the layered architecture in Figure 2.3. It is a graph structure, each node can be the system call or an event. There is a single output node in each graph. If overall nodes except the output node of the graph have been matched, then the output event will be generated. This event can also used to compose other behavior specification.

3. If the high-level behavior specification has been matched, then the malicious be- havior has been observed.

In their implementation, the sets of behavior specification are manually extracted with domain knowledge and analysis of tens of gigabytes of executions traces. Although it is inefficient to construct the behavior specifications, to modify or update the specification is simple. And the extracted behavior specifications also provide the detector a good behavioral signature.

### 2.3 Malware Analysis with Tree Automata

Below, we review the works that implement malware detection with the help of tree automata. In addition to give a brief introduction of the methods, we also comment their advantages and drawbacks.

### 2.3.1 Architecture of a Morphological Malware Detector

In [8], Bonfante et al. proposed a malware detector that makes use of tree automata.

To capture program semantics, the detector relies on the extracted control flow graph (CFG). The input program is transformed to the abstracted language first, and the corre- sponding CFG is extracted. The vertices of the extracted CFG includes inst, sequential instructions; jmp, uncondition jumps; jcc, conditional jumps; call, function calls; and end, function returns or undefined instructions. Besides, in order to deal with the clas- sical obfuscation skills, they also design an rewriting engine that reduces the extracted CFG. After this steps, the reduced CFG of a malware can be regarded as a signature. To build the signature database, collect the set of CFG of malware and transform its into tree representation. Using the finite set of trees, a minimal tree automata which recog- nizes these trees can be build. And this automata can be used to detect malware infection.

The overall architecture of the detector is depicted in Figure 2.5. For the malware samples, extract their control flow graph and reduced with rewriting engine. The reduced graph is then transformed into tree. The set of trees is then used to construct the tree automata, and after minimizing the tree automata, it will be used as the malware de- tector. For malware detection, the suspicious program is proceed with graph extraction, graph rewriting, and transform from graph to tree. The final output tree is used for malware detection. If the output tree of the program is accepted by the tree automaton, this program is identified as malware, otherwise, regard this program as benign program.

The detector Bonfante et al. proposed can be constructed automatically. Besides, the properties of tree automata give their tool an efficient performance to detect malware.

Malware Samples CFG Extraction

Graph Rewriting

Converter

Tree Automata Learning

Tree Automata

Suspicious Program CFG Extraction

Graph Rewriting

Converter

Recognize

Not Recognize

Alert

OK

Figure 2.5: The architecture of a morphological malware detector, redraw from [8]

However, that detection can only handles the basic malware such has same CFG as the malware signature. To detect a more general malware infection, the detection have to based on CFG subgraph isomorphism, and it increases the complexity of detection. When there is a demand to add new malware signature, the work can be done by computing the union of automata, and can be computed in linear time. However, the method of automata construction do not mentioned in [8]. So we cannot realized the capability of their tree automata and the effects of automata union. And the experiments they make only mentioned the false positives but neglect the false negatives. Although the false positives(0.09% when lower bound of CFG size is 15) of their detector is really amazing, it’s still not clear how effective and efficient of their malware detector. But the conclusion can be made from their experiments that it is worthwhile to combine tree automata with malware analysis.

### 2.3.2 Malware Analysis with Tree Automata Inference

Similar to [8], Babic et al. [6] proposed the malware analysis method with tree au- tomata inference. However, the major contribution of their work is the automata inference algorithm.

For each malware, they use dynamic taint analysis to construct the data flow depen- dence graph of system calls. As mention in section 2.2, the executed system call reflects the accessed system resources of malware. And it has been used in the community for a while. Besides, by focusing on the data flow dependence, the graph can resist some common obfuscation skills. After build the graphs, in order to avoid the exponential blowup when expanding into trees, they learn the tree automata directly from the graph.

In [6], they only use the set of positive samples (malware) to learn the automata.

Instead of learning the automata from regular tree languages, they learned from the sub- class of regular tree languages, the k-testable tree languages. These languages are defined as a finite set of k-level-deep tree patterns. And this kinds of regular languages can be identified from positive samples only. The value k is a tunable factor, it influences the

level of generalization. The smaller the k factor, the more abstract the inferred automa- ton. Therefore, by adjusting factor k, the detector can get balance between false positives and false negatives. There already exists some proposed methods for k-testable tree au- tomata inference. But the algorithm Babic et al. designed has better performance, with complexity O(kN ), where N is the size of the graphs.

As their experiments showed, when k = 4, their detector has 20% false negatives and 5% false positives. And increasing the value of k above 4 does not make a significant improvement of detection rates. Therefore, they determined that k = 4 is the optimal abstraction level. Besides, they also test the classification ability of their detector. The experiment results can only support that the tool has some kind of classification ability but the noise still exists. Unfortunately, the process of extracting system call dependence graph spends lots of time and makes this detector not sufficient to implement at ad hoc.

However, it merely costs few seconds to learn the automata, and takes less than the tim- ing jitter to analyze the program. If the front end of the detector can be improved, this detector will has more implementation value.

### Chapter 3

### Preliminaries

In this chapter, we give a formal definition about finite ordered trees and finite tree automata.

### 3.1 Finite Ordered Trees

First, we define the term ranked alphabet. A ranked alphabet Σ is a finite set of symbols
together with a function rank : Σ→ N ∪ 0. For each symbol a ∈ Σ, rank(a) is the rank
(arity) of a. We denote by Σn, n ≥ 0 , the set of all symbols a with rank(a) = n. A
finite ordered tree (abbreviated as tree in the following) t over a ranked alphabet Σ is a
partial mapping t : N^{∗} → Σ that satisfies the following conditions:

• dom(t) is a finite, prefix-closed subset of N^{∗},

• for each p ∈ dom(t), if t(p) ∈ Σ^{n}, then{ i | pi ∈ dom(t) } = {1, ..., n}

We call each sequence p ∈ dom(t) a node of t, and use to denote an empty sequence.

A root node of the tree t is the node which dom(t) = . A frontier node of the tree t is the node p such that ∀j ∈ N, pj /∈ dom(t). We denote by T (Σ) the set of all such trees over the ranked alphabet Σ.

For example, in Figure 3.1, we depict a tree t in the left side and its domain in
the right side. For the tree t, Σ0 = {a, b}, Σ^{2} = {f, g}, and Σ^{3} = {h}. A root node
of the tree t is the node . And the frontier nodes of the tree t is the set of nodes
11, 12, 21, 221, 222, 231, 232.

t= f

g a b

h

a f

a b g a b

dom(t) =

1 11 12

2

21 22

221 222

23 231 232

Figure 3.1: A finite ordered tree t and its domain.

### 3.2 Finite Tree Automata

A finite tree automaton (FTA) is a tuple A = (Σ, Q, ∆, F ) where:

• Q is a finite set of states,

• F ⊆ Q is a finite set of final states,

• Σ is a ranked alphabet,

• ∆ is a set of transition rules with following formats:

f(q1, q_{2}, ..., q_{n})→ q, where q1, ..., q_{n}, q∈ Q, f ∈ Σ^{n}

A finite tree automaton is deterministic if there do not exists two transition rules with
the same left-hand side, which is denoted by DFT. Given a tree t, we use the transition
rules of FTA to traverse the tree t from the bottom to top. We can notice that a FTA
does not have an initial state, but the transition rule is as the form f ()→ q when f ∈ Σ^{0},
and this kind of rules can be regarded as the initial rules.

We give a more formal definition, a run π of a finite tree automatonA over a tree t is
a mapping from dom(t) to Q. It starts from the leaves rules, a( )→ q, where a ∈ Σ^{0}. A
run π assigns every node p∈ dom(t) a state with following rules: if f(q1, ..., q_{n})→ q ∈ ∆
and t(p) = f , π(pi) = qi for each i ∈ {1, ..., n}, then π(p) = q. We say a run π of A on
tree t is successful if π()∩ F 6= ∅. A tree t is accepted by A if there is a successful run
of A on t. A tree language L(A) recognized by A is the set of trees accepted by A. Two

FTAs are said to be equivalent if they recognize the same tree language.

Let us look at an example. Let Σ = {0, 1, not, and, or}, with Σ^{0} ={0, 1}, Σ^{1} ={not},
and Σ2 = {and, or}. Consider a DFT A = (Σ, Q, ∆, F ) where Q = {q^{0}, q_{1}}, F = {q^{1}},
and ∆ contains the following rules.

0 → q^{0} 1 → q^{1}

not(q0) → q^{1} not(q1) → q^{0}
and(q0, q_{0}) → q^{0} and(q0, q_{1}) → q^{0}
and(q1, q_{0}) → q^{0} and(q1, q_{1}) → q^{1}
or(q0, q_{0}) → q^{0} or(q0, q_{1}) → q^{1}
or(q1, q_{0}) → q^{1} or(q1, q_{1}) → q^{1}

For a given tree t in Figure 3.2 , the run π of A over tree t is also presented in Fig- ure 3.2. Because this automaton is a deterministic finite tree automaton, it only has one possible run on t . Since π()∩ F 6= ∅, π is not a successful run. Therefore, this tree t is not accepted by A.

t= and

not or 0 1

or 1 not

0

π = q0

q0 q1 q0 q1

q1 q1 q1

q0

Figure 3.2: A finite ordered tree t and a run π of A over tree t.

### Chapter 4 Approach

In this chapter, we present our malware detection approach. As mentioned in Sec- tion 1.2, our detector uses semantics-based signatures. We assume that program seman- tics is captured by system call data-flow dependence graphs. Because we want to use trees as signatures, we unfold dependence graphs to trees. We use a finite tree automaton to represent correlated signatures and take this automaton as a malware detector.

Our tree automaton is a 3-valued tree automaton which has three disjoint final states:

accept, reject, and unknown. If we take a 3-valued tree automaton as a malware detector and give it a suspicious program as an input, there are three possible outcomes:

• An accept implies that the input program is a malware instance.

• A reject implies that the input program is a benign program.

• An unknown means no conclusive answer.

Therefore, our detector works as in Figure 4.1. Every circles and triangles is an input program. For the circles inside the left circle, they are rejected by our tree automaton.

For the triangles inside the right circle, they are accepted by our tree automaton. For the remaining circles and triangles, they are unknown for our tree automaton.

## Chapter 1

Figure 4.1: Our malware detector

### 4.1 Architecture

Our detector is trained with positive examples and negative examples. The positive examples are the set of malware instances and the negative examples are the set of benign programs. We use a system call dependence graph to represent each input program’s se- mantics. A parser will then parse the graph to a tree. Using the tree automata learning algorithm, we can learn a 3-valued finite tree automaton. We take this automaton as our malware detector.

To test a suspicious program, follow the similar steps and use the 3-valued tree au- tomaton to test the generated tree. The overall architecture is depicted in Figure 4.2.

The dash line separates the architecture into detector construction and suspicious pro- gram testing.

### Chapter 4

Positive Examples

Negative Examples

Semantics

Extraction Parser

System Call Dependence

Graphs

Trees

Learning Algorithm

### Tree Automaton

Semantics

Extraction Parser

Test Sample

Accept Unknown Reject System Call

Dependence Graphs

Trees

Figure 4.2: Architecture

### 4.2 3-Valued Deterministic Finite Tree Automata

We have mentioned that our tree automaton is a 3-valued tree automaton. Chen et al. [10] have proposed the notion of finite state automata with three disjoint final states.

We extend their definition to finite tree automata. A 3-valued deterministic finite tree automaton (3DFT) is a tuple A = (Σ, Q, ∆, Accept, Reject, Unknown) where:

• Σ is a ranked alphabet,

• Q is a finite set of states,

• Accept ⊆ Q is a finite set of states, disjoint with Reject and Unknown,

• Reject ⊆ Q is a finite set of states, disjoint with Accept and Unknown,

• Unknown = Q − (AcceptS Reject),

• ∆ is a set of transition rules of the following form:

f(q1, q_{2}, ..., q_{n})→ q, where q1, ..., q_{n}, q∈ Q, f ∈ Σ^{n}

The definition of a run over 3DFT is similar to that in finite tree automata. We say that a tree t is accepted by a 3DFT A if there is a run π such that π() ∩ Accept 6= ∅. A tree t is rejected by a 3DFT A if there is a run π such that π() ∩ Reject 6= ∅. If a tree t is neither accepted by A nor rejected by A, we say t is unknown for a 3DFT A.

### 4.3 Semantics Extraction

As we know, programs have to invoke system calls to acquire system resources. We can discover the intension of a malware instance by observing executed system calls. Be- sides, the dependence relation between system calls is also important. It is easy to reorder system calls that have no dependence relations. However, reordering system calls that have dependence relations always change the program semantics. Therefore, if we only record the executed system calls but not the dependence relations, this kind of signature is easy to evade.

We represent a system call data-flow dependence graph as a behavior graph. A behavior graph is a directed acyclic graph G = (V, E):

• V is a sets of vertices, where each vertex is a system call s ∈ S.

• A directed edge < v^{1}, v_{2} >from v1 ∈ V to v^{2} ∈ V represents a dependence relation
that the input arguments of v2 are somehow dependent on the outputs of v1.
For example, Figure 4.3 is a behavior graph example for malware Allaple.b which is
redrawn from [18]. The directed edge from NtOpenFile to NtCreateSection means that
an input argument of NtCreateSection depends on the output of NtOpenFile. In this ex-
ample, the dependence relation between NtOpenFile and NtCreateSection is the shared
file handle.

NtCreateProcessEx NtOpenFile

NtQuerySection NtQueryInformationProcess

NtCreateSection

NtCreateThread

NtResumeThread

GetModuleFileName

Figure 4.3: A behavior graph

### 4.4 Graph Parser

For a given behavior graph G, we extract all dependence relations and output a tree
t^{G}. For each directed edge < v1, v_{2} >, we create a tree t<v1,v2> with root v1 which has a
child v2. All of the trees generated with the above formalism from a behavior graph G
form a set of trees T^{G}. For each vi(vj)∈ T^{G}, merge with vi(vk)∈ T^{G}, where vj 6= vk and
form a new tree vi(vj, v_{k}).

We got two different trees that are generated from a single behavior graph. For the
first type of tree, we call it a one-leveled dependence tree. We create a single tree
t^{G} from a behavior graph G with an artificial root Root. Each tree in T^{G} is the child of
Root. Because the maximal height of T^{G}is 2, a one-leveled dependence tree has height at
most 3. It is called a one-level dependence tree because it captures the direct dependence
from the graph, which is a one-level depth. We show a one-leveled dependence tree in
Figure 4.4 which is parsed from the behavior graph in Figure 4.3.

For the second type of tree, we call it a two-leveled dependence tree. We ex-
pand the trees in T^{G} with one level deeper depth. For each vi(vj) ∈ T^{G}, merge with

NtOpenFile

NtCreateSection

NtQuerySection

NtCreateThread NtQueryInformationProcess

NtCreateThread NtCreateThread

NtResumeThread GetModuleFileName

NtCreateThread

NtCreateSection

NtCreateProcessEx

NtQuerySection NtCreateProcessEx

NtQueryInformationProcess NtCreateThread

ROOT

Figure 4.4: An one-leveled dependence tree

v_{j}(vk) ∈ T^{G} and form a new tree vi(vj(vk)). We create a single tree t^{G} from a behavior
graph G with an artificial root Root. Each tree in T^{G} is the child of Root. Because the
maximal height of T^{G} is 3, a two-leveled dependence tree has height at most 4.

We show a two-leveled dependence tree in Figure 4.5 which is parsed from the behavior graph in Figure 4.3.

Observe that each child of the root of t^{G} represents a dependence relation, and there
does not exist an ordered relation between the children. To tackle the problem that an
identical dependence relation in a behavior graph may generate different trees, we give
each tree an ordered relation. Each vertex in a behavior graph is a system call, which is
a string. We can sort a set of trees with the following rules:

1. Compare two different trees from the root node to the frontier nodes.

2. Nodes are sorted according to their corresponding system call’s dictionary orders.

Trees in Figure 4.4 and 4.5 are sorted according to the above rules.

NtCreateThread

NtResumeThread NtResumeThread

GetModuleFileName NtCreateThread

ROOT

NtCreateProcessEx

NtQueryInformationProcess NtCreateThread

NtResumeThread

NtCreateThread

NtCreateSection

NtCreateProcessEx

NtQuerySection

NtCreateThread NtQueryInformationProcess NtCreateThread

NtOpenFile

NtCreateSection

NtCreateThread NtQueryInformationProcess NtCreateThread

NtResumeThread NtQuerySection

NtCreateThread

NtResumeThread

Figure 4.5: A two-leveled dependence tree

### 4.5 Automata Learning Algorithm

We have mentioned in Section 2.1.1 that Drewes proposed two different tree automata
learning algorithms. Because L^{∗}_{f ta} is the main algorithm of our learning algorithm, we
will take a deeper view.

### 4.5.1 Learning Algorithm of Drewes

For a set of trees T , let Σ(T ) denote the set of all trees of the form f (t1, ..., t_{k}), where
f ∈ Σ^{k}, and t1, ...t_{k} ∈ T . Recall that T (Σ) denotes the set of all trees over the ranked
alphabet Σ. Let 2 /∈ Σ be a special symbol with rank 0, and let C(Σ) be the set of all
trees in T (Σ∪ {2}) with exactly one occurrence of 2, which we call contexts over Σ. The
concatenation c· t with c ∈ C(Σ) and t ∈ T (Σ) ∪ C(Σ) is the tree obtained from c by
replacing 2 with t.

Similar to Angluin’s L^{∗} algorithm, they use an observation table Ω to construct a tree

automaton. The observation table Ω can be separated into two parts: the upper table ΩU and the lower table ΩL. The rows of ΩU are indexed by the trees in S, S ⊆ T (Σ).

The rows of ΩL are indexed by the trees in T , T ⊆ Σ(S). The columns of Ω are indexed
by contexts from a finite set C ⊆ C(Σ). We use Mem^{L} : T (Σ) → B to represent the
membership relation of the tree t in the tree language L. If t ∈ L, Mem^{L}(t) = T rue,
otherwise, M emL(t) = F alse. The cell in the place of row t and column c in the obser-
vation table is filled with M emL(c· t), which represents the membership relation of the
tree c· t in the tree language L. We use hti to denote the row of t in Ω, and extend to
the finite tree set T such that hT i = {hti | t ∈ T }.

The idea behind Angluin’s L^{∗}algorithm is to construct an automaton by exploiting the
Myhill-Nerode congruence of the target language. The Myhill-Nerode congruence≡^{L} on
T(Σ) is defined as follows: t≡^{L} t^{0} iff for all c ∈ C(Σ), Mem^{L}(c·t) = Mem^{L}(c·t^{0}). We say
tree t and t^{0} are equivalent with respect to C iff for all c∈ C, Mem^{L}(c· t) = Mem^{L}(c· t^{0}).

To construct the finite tree automaton A^{Ω} from Ω, two properties have to hold:

1. Ω is closed, that is hti ∈ hSi, for every t ∈ T .

2. Ω is consistent. Let Σ_{2}(S) = C(Σ) ∩ Σ(S ∪ {2}), the observation table Ω is
consistent if hc · si = hc · s^{0}i, for all c ∈ Σ2(S) and all s, s^{0} ∈ S with hsi = hs^{0}i. If
hc · si 6= hc · s^{0}i, then s and s^{0} are not equivalent with respect to Σ_{2}(S). And there
exists an separating context c that witnesses this inequivalence.

After assuring that Ω is both closed and consistent, we can construct AΩ = (Σ, Q, ∆, F ) as follows:

• The set of states Q is hSi

• hsi ∈ F if s ∈ L

• For every tree t = f(s^{1}, ..., s_{k})∈ Σ(S), the corresponding transition rule is f(hs^{1}i, ..., hs^{k}i) →
hti.

Now, let us describe the L^{∗}_{f ta}learning algorithm. As an extension of Angluin’s L^{∗}, the
learning algorithm asks teacher membership queries and equivalence queries. At first, the

observation table is started with S =∅ and C = {2}. Then, it constructs an automaton from the observation table and performs an equivalence query. If there is a returned counter-example, update the observation table. Repeat this process until no counter- example is returned. The pseudo code of the algorithm is presented in Algorithm 4.1.

Algorithm 4.1 L^{∗}_{f ta}

Input: Ω ={S, T, C}, Teacher.

Output: AΩ.

1: loop

2: construct A^{Ω};

3: t:= EquivalenceQuery(AΩ);

4: if t = yes then

5: return A^{Ω};

6: else

7: update(Ω,t);

8: end if

9: end loop

When updating the observation table (Algorithm 4.2), decompose the returned counter-
example t from the bottom to top and get a subtree t^{0} that is not in S, where t = c· t^{0}
for c ∈ C. If t^{0} is also not in T , add t^{0} to ΩL and assure that Ω is closed. Oth-
erwise, find the equivalence tree te in ΩU and replace t^{0} with te to get a new tree
t_{new} = c· te. If M emL(t) = M emL(tnew), decompose tnew with above process again.

Else, M emL(t) 6= Mem^{L}(tnew), and we find a separating context c. Add the context
c to observation table and assure that the table is closed. The algorithm of updating
the observation table is in Algorithm 4.2. The function close in Algorithm 4.2 checks
whether hti ∈ hSi, for every t ∈ T . If there is a tree t ∈ T which hti /∈ hSi, move t from
T to S.

This algorithm has several interesting properties. For every tree t∈ T , there is exactly
one tree s ∈ S such that hsi = hti. In other words, there is no redundant information
being record. Therefore, there is no need to check table consistent. Besides, it can be
assured that the amount of contexts is no more than the states in ΩU. The L^{∗}_{f ta} algo-
rithm outputs a finite tree automaton A = (Σ, Q, ∆, F ) with O(r· |Q| · |∆| · (|Q| + m)),

Algorithm 4.2 update

Input: Ω ={S, T, C}, Counter-example t.

Output: Ω.

1: loop

2: decompose t into t = c· t^{0}, where t^{0} ∈ Σ(S) \ S;

3: if t^{0} ∈ T then

4: let s be the unique tree in S with hsi = ht^{0}i;

5: if membershipQuery(c· s) = membershipQuery(t) then

6: t := c· s;

7: else

8: C := C∪ {c};

9: return close(Ω);

10: end if

11: else

12: T := T ∪ {t^{0}};

13: return close(Ω);

14: end if

15: end loop

where m is the maximum size of counter-examples returned from the teacher, and r is
the maximum rank of symbols in Σ. The algorithm requires |Q| + |∆| + 1 equivalence
queries, and m +|Q| · (|∆| + 1) membership queries. As mentioned in [14] , the major
disadvantage of L^{∗}_{f ta} is the number of equivalence queries.

### 4.5.2 Tree Automata Learning Algorithm

Now, we show how we can learn a 3DFT by adapting Drewes’s L^{∗}_{f ta} algorithm. What
we have are a set of positive examples (malware) and a set of negative examples (benign
programs). In order to use Drewes’s learning algorithm, we need a teacher to answer
membership queries and equivalence queries. Therefore, we simulate the teacher with the
given positive and negative examples. The positive examples and negative examples are
trees rather than behavior graphs.

For membership queries, we check whether a given tree t belongs to positive exam- ples or negative examples. The term belong indicates that we check whether a tree is in the set of positive examples or negative examples. If a tree t is in the positive exam-

ples, returns true. If a tree t is in the negative examples, returns false. If a tree t is in both the positive examples and negative examples or in neither of them, return unknown.

For the last case of membership queries, there is a possibility that the positive ex- amples and negative examples have the common members. That is the case where the malware sample is conscious of being analyzed and is pretending as a benign program.

Then, the generated behavior graph will be identical to the benign programs.

For equivalence queries, we check that whether the samples in the positive examples are accepted and the samples in the negative examples are rejected. For the case that a tree t is both in the positive examples and negative examples, tree t is identified as unknown. If there is a sample that violates this rule, it will be returned as a counter- example. We proceed from the positive samples to negative samples.

### Chapter 5

### Implementation and Experiments

### 5.1 Implementation

We implemented a prototype TALA (Tree Automata Learning Algorithm) based on the 3DFT learning algorithm and we take it as our malware detector. TALA is written in C++, and it currently provides following functionalities.

• 3DFT learning algorithm.

• Parsing from a graph to a tree.

• Tree-language membership testing.

Inside TALA, we adapt the library libSFTA [3]. libSFTA is a symbolically encoded finite tree automata library and supports basic automata operations. The term symboli- cally encoded means that they use a multi-terminal binary decision diagrams(MTBDD) to represent transition functions of tree automata. The alphabets of transition functions are encoded into boolean variables. More implementation details can be referred in [3].

### 5.2 Experiments

To test the capability of our malware detector, we have performed several different experiments. Our detector requires a set of malware instances and a set of benign pro- grams. For input data, we use the data that are publicly available on Babic’s website [2].

They are provided as the system call data-flow dependence graph, which are generated by using the tool designed by Daniel Reynaud and the tracing library libwst. The provided

graphs are generated from 2632 malware instances and 35 benign programs. We directly use their data as our behavior graph in the experiments.

The set of positive examples (malware samples) are pre-classified into 48 different mal- ware families. As mentioned in [6], the methods they used for classification are based on the work of Christodorescu et al. [11] and Fredrikson et al. [15]. We list the complete 48 malware families and the number of samples included in each malware family in Table 5.1.

Table 5.1: 48 malware families and the amount of contained samples

Malware Family Samples Malware Family Samples

ABU,Banload 16 Hupigon,AWQ 219

Agent,Agent 42 IRCBot,Sdbot 66

Agent,Small 15 LdPinch,LDPinch 16

Allaple,RAHack 201 Lmir,LegMir 23

Ardamax,Ardamax 25 Mydoom,Mydoom 15

Bactera,VB 28 Nilage,Lineage 24

Banbra,Banker 52 OnlineGames,Delf 11

Bancos,Banker 46 OnLineGames,LegMir 76

Banker,Banker 317 OnLineGames,Mmorpg 19

Banker,Delf 20 OnLineGames,OnlineGames 23

Banload,Banker 138 Parite,Pate 71

BDH,Small 5 Plemood,Pupil 32

BGM,Delf 17 PolyCrypt,Swizzor 43

Bifrose,CEP 35 Prorat,AVW 40

Bobax,Bobic 15 Rbot,Sdbot 302

DKI,PoisonIvy 15 SdBot,Sdbot 75

DNSChanger,DNSChanger 22 Small,Downloader 29

Downloader,Agent 13 Stration,Warezov 19

Downloader,Delf 22 Swizzor,Obfuscated 27

Downloader,VB 17 Viking,HLLP 32

Gaobot,Agobot 20 Virut,Virut 115

Gobot,Gbot 58 VS,INService 17

Horst,CMQ 48 Zhelatin,ASH 53

Hupigon,ARR 33 Zlob,Puper 64

The set of negative examples (benign programs) includes 35 different samples of fre- quently used applications. The complete lists are list in Table 5.2.

Table 5.2: The list of benign programs

Adobe Reader Apple Software Update Autoruns

Battle for Wesnoth Chrome Chrome Setup

Copy to system folder Firefox Freecell

Freeciv Freeciv server GIMP

Google Earth Hello world Internet Explorer

iTunes Minesweeper MSN Messenger

Netcat port listen Netcat port scan NetHack

Notepad OpenOffice Writer Outlook Express

ping Self extracting archive Skype

Solitaire System information Task Manager

Tux Racer uTorrent VLC

Windows Media Player WordPad

We have tested the detection ability and classification ability of our detector. The procedure of testing detection ability is depicted as follows.

1. The samples in each malware family are separated into two disjoint sets, one for learning and one for testing.

2. The benign programs are also separated into learning sets and testing sets.

3. We parse each sample into a tree.

4. We use the learning samples to learn a 3DFT for each malware family.

5. We use the testing samples to test detection ability.

As mentioned in section 4.4, we extracted the dependence from the behavior graph and build a tree for a graph. In our experiments, we have parsed two types of trees from a graph. The first type of tree has height at most 3, and is defined in section 4.4 as one- leveled dependence tree. The second type of tree has height at most 4, and is defined in

section 4.4 as two-leveled dependence tree. For each type of dependence tree, we executed 4 experiments with different combinations of learning samples and testing samples. The ratio of the number of samples in the learning set and testing set are presented in Table 5.4.

Table 5.3: Experiments for testing detection ability Positive Examples Negative Examples Learning Testing Learning Testing

Experiments 1 80% 20% 100% 0%

Experiments 2 80% 20% 80% 20%

Experiments 3 50% 50% 100% 0%

Experiments 4 50% 50% 50% 50%

The procedure of testing detection ability is depicted as follows.

1. We use the automata in malware family Banker,Banker that are generated from the above experiments.

2. Taking the samples in other malware family as testing samples.

3. Test samples in other malware families.

Table 5.4: Experiments for testing classification ability Positive Examples Negative Examples Learning Testing Learning Testing

Experiments 5 80% 20% 100% 0%

Experiments 6 50% 50% 100% 0%