• 沒有找到結果。

Chapter 2 Related Works

2.2 Rule Synthesis

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

 6  

and generate rules for detecting malicious behaviors.

2.2 Rule Synthesis

Over the past few years, many researches about malware detection by using machine learning have been proposed. Commonly used methods are Neural Network (i.e. Self-Organizing Map and Growing Hierarchical Self-Organizing Map) , Nearest Neighbors, Support Vector Machines, and other clustering based algorithm. [7]

2.2.1 Self-Organizing Map and Growing Hierarchical Self-Organizing Map

Self-Organizing Map (SOM) was developed by Teuvo Kohonen in 1981. SOM is one of unsupervised learning (which means input data given to SOM are unlabeled), and is often used in clustering. SOM has a map used to present the distribution of each output value (cluster). Because of this map, so that SOM can visually present the original high-dimensional data in low-dimensional space. Visualized results can perfectly interpret all clusters. SOM has two phases: training and mapping. In training phase, SOM creates the map by using input data (or called vector). In mapping phase, SOM automatically classifies a new input vector.

According to [6], since the SOM was developed, SOM has been used in many fields of computer security. Many studies use SOM to detect malwares or outliers.

In [36], they use SOM to visualize viruses in Windows executable files. They have found that viruses cannot hide their own feature through the SOM visualization, so that they can find virus patterns in infected files by using the SOM visualization

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

 7  

technique. They create maps for Windows executable files both before and after an infection by a virus. Such maps can be analyzed visually, and they find a pattern in files infected by the same virus. It shows that each virus family has same virus mask, which can be used to detecting virus.

Faour et al. [5] use SOM coupled with Bayesian belief networks to build an automatically filter intrusion detection alarms for determining weather the network is attacked or not. In this paper, they use SOM to cluster aggressive behaviors and normal behaviors, and then use Bayesian to classification. Experimental results show that their system can filter most of the false positives. They further use SOM and Growing Hierarchical Self-Organizing Map (GHSOM) to find potential attacks [4].

GHSOM is a variant of SOM, and is a dynamic algorithm proposed by Rauber et al. to overcome limitations of SOM [24]. GHSOM overall renders multilayer hierarchical structure, that each layer is composed of several adaptive and independent growing SOM. Data is presented into multiple clusters according to correlation between each other.

Lee and Yu [15] use GHSOM to cluster collected system call distribution, and pick out clusters having the attack signature. They calculate each data cluster and derive detection rules using the mean analyzing from GHSOM. These rules can help their system detecting malicious behaviors.

However, these studies only detect known malicious behaviors in the training corpus. They cannot cope with unknown behaviors.

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

 8  

2.2.2

k

-Means

There are many studies using k-means to detect malicious behaviors. Among all partition clustering methods, the most basic way is k-Means clustering method.

We have to set the number of cluster (k) before starting clustering. k-Means will find out representative data (can be referred to cluster center), calculate the distance between data and all cluster centers, and classify data to the cluster has the shortest distance. Repeat the above steps until cluster center do not change.

In [12], they use k-Means to cluster unlabeled data. They think the data in the same cluster is quite similar. They analyze all these clusters to find the patterns that can be used to detect malicious behaviors.

[34] propose a hybrid method for detection, combining k -Means and k -Nearest Neighbors (k-NN). They use k-Means to find some cluster centers, and they say each cluster center is one kind of malicious behaviors. They then use two cluster centers and one data to form the triangle area where can be used to detect similar behaviors. Finally they use k-NN to detect malicious behaviors based on he triangle areas. Another research [18] also use a hybrid method that combines k -Means and Nai ve Bayes classifiers. They use k-Means to cluster data. Each cluster has similar behavior, so that they can separate normal and malicious behaviors.

If a new data is classified into the cluster has malicious activity, it will be considered to be a malicious behavior. [20] also use k-Means with k-NN and Nai ve Bayes classifiers.

Due to k-Means requires pre-set k, quality of result depends on k. k is too

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

 9  

big or too small to reduce effectiveness of the result and accuracy of detection. These methods can only detect malicious behaviors they have trained or unknown malicious behaviors similar to the former.

2.2.3 Other Clustering Algorithm

Other researches use clustering based algorithm [2, 17, 23, 26]. These algorithms cluster data, compute the distance between each newly detected data and each known cluster, and classify the new data to its nearest cluster otherwise create a new cluster.

However, use distance to decide clusters has limited effect and a little less rigorous.

2.2.4 Support Vector Machines

Support Vector Machines (SVMs) are supervised machine learning method proposed by Cortes and Vapnik [3], and often used for classification. SVMs find a maximum-margin hyperplane (the larger the margin, the higher accuracy,) that can separate the different data clusters from input labeled data. SVMs also record data on or closest the hyperplane, which are called support vectors. When new unlabeled data comes in, SVMs can use this hyperplane and support vectors to predict new unlabeled data belongs to which cluster.

Mukkamala et al. [19] use SVMs and neural network to distinguish normal behaviors and malicious behaviors respectively, and make a comparison between the above two methods. Their experimental results show that both SVMs and neural network can get high accuracy of detection, but SVMs outperform neural network.

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

  10  

SVMs have a higher accuracy with less training time. Sahs, J. et al. [29] and Rieck, K.

[25] thus use SVMs to detect malicious behaviors. They train SVMs to learn labeled data, both normal and malicious behaviors existing their database. SVMs can predict which kind of behaviors new data belong to. However, the drawback of their approach is that behaviors their approach can detect are limited to the behaviors SVMs have learned.

Unfortunately, the methods mentioned above have several drawbacks. To begin with, they only detect known malicious behaviors in the training corpus. For those unknown behaviors, they can still only detect behaviors similar to the existing behavior in their model. They are not suitable for handling great deal of data. To address these problems, we thus use the outlier detection algorithm proposed by Huang et al., and improve it to a new parallel algorithm that can improve performance and reduce execution time.

相關文件