國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
1
Chapter 1 Introduction
1.1 Background and Motivation
Malicious behavior poses a serious problem to the security of computer systems.
As online technologies become more and more advanced, cloud computing technology has becomes increasingly popular, and many applications can be applied across platforms synchronously through the cloud computing. Such a convenient way has made it accessible for hackers to find opportunities for invasion. Therefore, security has become more important.
By collecting data on a computer (such as system calls, memory information, libraries that applications invoke) and accumulating large amounts of data, big data analytics that addresses data volume, and variety poses an attractive solution to facilitate the analysis on these massive collected logs and system information. Our objective is to analyze patterns of these data and synthesize rules to identify malicious behavior. However, these data are both large and complex. They are just a combination of a bunch of numbers for humans. If we analyze these data manually, it’s not only time consuming but also inefficiency. Thus many researches on malicious behavior detection in the past analyze data using machine learning techniques, such as Neural Networks, Support Vector Machines, K-means, that can be leveraged for automatic reasoning on rules to distinguish patterns from large amounts of data through iterative and adjustable computation. These rules can help us to
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
2
identify known/unknown malicious behaviors. Unfortunately, it usually requires powerful computation to realize the idea on analyzing a large set of data.
Along with cloud computing has been widely used and the coming of big data, the amount of data or computation of data is often quite large. Such a huge data or computation needs quite a lot of computing resources. Machine learning tools that can only be deployed on a single computer would be hard to keep up with the growth of amount of data. It is needed to apply distributed computing techniques to leverage the use of a great number of resources in the cloud to speed up and scale the computation.
In this work, we propose a new distributed algorithm on learning outlier from massive samples that have unknown patterns. The algorithm is implemented with Scala and deployed in the cloud.
1.2 Research Method
In this paper, we use a new distributed outlier detection algorithm to analyze collected data and derive detection rules that can be use to identify normal, malicious, and unknown behaviors.
The original algorithm was proposed by Tsaih and Cheng [35]. However, when amount of data and dimensionality of each data are large, the algorithm becomes complicated and time-consuming. Huang et al. [11] thus proposed a new improved algorithm with the envelope module. The main idea of this algorithm is to find out a nonlinear fitting function, which has a constant-width area around it that most of data will be in this area (we call this area envelope). The data far from the fitting function
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
3
are not contained in envelope and will be regarded as outliers. Unfortunately, after we deploy the algorithm to the cloud, it still cannot load such a great deal of data. To cope with this, we further improve the algorithm to a distributed one. We develop a distributed outlier detection algorithm based on Huang et al.’s work. The distributed algorithm is implemented by Scala, so that a single multi-core computer can execute it parallelly. With the architecture Akka, it can be deployed with multiple actors as computation agents in the cloud, facilitating full use of all resources of the cloud, so that it has higher efficiency.
As for the practical impacts, we have collected a lot of training data from real-world malicious software for evaluating the effectiveness of the algorithm. This algorithm learns and finds out a nonlinear fitting function as a detection rule. While detecting, we can use rules to distinguish normal behaviors and malicious behaviors.
Those behaviors, which are far from the fitting function and not contained in envelope, are outliers and will be categorized as unknown behaviors.
1.3 Contribution
The purpose of this research is to synthesize rules for detecting malicious behaviors. These rules can be used to not only distinguish all known behaviors (including normal and malicious behaviors) but also unknown behaviors. This feature allows us to prevent zero day attack. In order to be able to handle large amounts of data on the cloud, we improve the algorithm proposed by Huang et al. to a new distributed algorithm that can be deployed in the cloud. It can enhance the
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
4
performance of the algorithm and reduce computing time.
The presented distributed outlier detection algorithm can also be extended to other contexts, such as medical condition monitoring [9] (i.e. heart-rate tracking), activity monitoring (i.e. monitoring phone activity to prevent from mobile phone fraud), pharmaceutical research (i.e. new compositions of molecular identifying).
1.4 Content Organization
The rest of this paper is organized as below. In Chapter 2 we give briefly reviews of previous work on malware detection and rule synthesis and give the background knowledge on common machine learning techniques for clustering and classification.
In Chapter 3, we present the details of our detection strategy and distributed outlier detection algorithm and in Chapter 4, we report the experiment results with simulation data and real data that are collected from observing system calls and processes of virtual machines.
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
5