惡意行為檢測規則生成之研究 - 政大學術集成

全文

(1)國⽴立政治⼤大學資訊管理學系碩⼠士學位論⽂文指導教授：蔡瑞煌博⼠士、︑､郁⽅方博⼠士. 立. 政治大. ‧ 國. 學. 惡意⾏行為檢測規則⽣生成之研究. ‧. n. al. er. io. sit. y. Nat. Rule Synthesis for Malicious Behavior Detection. Ch. engchi. i n U. v. 研究⽣生：彭雅筠中華民國 104 年 7 ⽉月.

(2) Abstract Malicious behavior that has unknown patterns poses a great challenge to security mechanisms of computers. Without effective detection rules, tools via monitoring system behaviors may fail to identify unknown attacks. The threats continue to cloud systems, even for those equipped with VMMs that are capable of collecting much larger and more detailed online system and operation information in a virtualization. 政治大. environment than a traditional PC system. It is essential to be able to identify. 立. abnormal behavior out from a large data set to detect unknown attacks. To address. ‧ 國. 學. this issue, we propose a new distributed outlier detection algorithm that characterizes. ‧. the majority pattern of observations as a backpropagation neural network and derive. sit. y. Nat. detection rules to reveal abnormal samples that fail to fall into the majority.. io. al. er. Specifically, the rules generated by the algorithm can be used to distinguish samples. v. n. as outliers that violate patterns of known attacks and normal behaviors and hence to. Ch. engchi. i n U. identify unknown attacks and reform their patterns. With distributed computing we can enhance the performance of the algorithm and handle huge amounts of data.. Keywords: Malicious behavior, Outliers, Distributed computing, Learning algorithm, Anomaly detection. i .

(3) Contents Abstract ...........................................................................................................................i Contents ........................................................................................................................ ii List of Figures .............................................................................................................. iii List of Tables ................................................................................................................iv Chapter 1 Introduction ................................................................................................... 1 1.1. Background and Motivation ............................................................................. 1. 1.2. Research Method .............................................................................................. 2. 1.3. Contribution ...................................................................................................... 3. 1.4. 政治大 Content Organization ........................................................................................ 4 立. Chapter 2 Related Works ............................................................................................... 5. ‧ 國. 學. 2.1. Malware Detection: Common Detection Tools/Methods ................................. 5. 2.2. Rule Synthesis .................................................................................................. 6. ‧. 2.2.1 Self-Organizing Map and Growing Hierarchical Self-Organizing Map .... 6 k -Means..................................................................................................... 8. y. Nat. 2.2.2. sit. 2.2.3 Other Clustering Algorithm ....................................................................... 9. al. n. 2.3. er. io. 2.2.4 Support Vector Machines ........................................................................... 9. i n U. v. Algorithm Optimization.................................................................................. 10. Ch. engchi. Chapter 3 Methodology ............................................................................................... 12 3.1. Detection strategy ........................................................................................... 15. 3.2. Parallel Computation ...................................................................................... 17. 3.3. Distributed Computation ................................................................................ 23. Chapter 4 Experiment .................................................................................................. 26 4.1. Evaluation with nonlinear function ................................................................ 26. 4.2. Real-world experiment and analysis ............................................................... 28. 4.3. Discussion ....................................................................................................... 36. Chapter 5 Conclusion................................................................................................... 38 References .................................................................................................................... 39. ii .

(4) List of Figures Figure 1.. Process of the outlier detection algorithm ............................................ 14. Figure 2.. System call distribution (part) ............................................................... 29. Figure 3.. Iterations of each stage in testing data0_4 ............................................ 31. Figure 4.. Iterations of each stage in testing data3_4 ............................................ 31. Figure 5.. Libraries that each process used (part).................................................. 34. 立. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. iii . i n U. v.

(5) List of Tables Table 1.. Notation definitions ................................................................................ 18. Table 2.. The parallel backpropagation algorithm................................................. 20. Table 3.. Add extra hidden nods (part of the outlier detection algorithm) ............ 22. Table 4.. Results of correctness test ...................................................................... 27. Table 5.. Results of performance test (sequential and parallel version) ................ 27. Table 6.. Results of performance test (distributed version)................................... 28. Table 7.. The experimental data for test 1 ............................................................. 30. Table 8.. Result of test 1 ........................................................................................ 32. Table 9.. The experimental data for test 2 ............................................................. 35. 學 ‧. ‧ 國 io. sit. y. Nat. n. al. er. Table 10.. 治政大 Result of test 2 ...................................................................................... 35 立. Ch. engchi. iv . i n U. v.

(6) Chapter 1 Introduction 1.1 Background and Motivation Malicious behavior poses a serious problem to the security of computer systems. As online technologies become more and more advanced, cloud computing technology has becomes increasingly popular, and many applications can be applied across platforms synchronously through the cloud computing. Such a convenient way. 政治大. has made it accessible for hackers to find opportunities for invasion. Therefore,. 立. security has become more important.. ‧ 國. 學. By collecting data on a computer (such as system calls, memory information,. ‧. libraries that applications invoke) and accumulating large amounts of data, big data. sit. y. Nat. analytics that addresses data volume, and variety poses an attractive solution to. n. al. er. io. facilitate the analysis on these massive collected logs and system information. Our. i n U. v. objective is to analyze patterns of these data and synthesize rules to identify malicious. Ch. engchi. behavior. However, these data are both large and complex. They are just a combination of a bunch of numbers for humans. If we analyze these data manually, it’s not only time consuming but also inefficiency. Thus many researches on malicious behavior detection in the past analyze data using machine learning techniques, such as Neural Networks, Support Vector Machines, K-means, that can be leveraged for automatic reasoning on rules to distinguish patterns from large amounts of data through iterative and adjustable computation. These rules can help us to. 1 .

(7) identify known/unknown malicious behaviors. Unfortunately, it usually requires powerful computation to realize the idea on analyzing a large set of data. Along with cloud computing has been widely used and the coming of big data, the amount of data or computation of data is often quite large. Such a huge data or computation needs quite a lot of computing resources. Machine learning tools that can only be deployed on a single computer would be hard to keep up with the growth of amount of data. It is needed to apply distributed computing techniques to leverage the. 政治大. use of a great number of resources in the cloud to speed up and scale the computation.. 立. In this work, we propose a new distributed algorithm on learning outlier from massive. ‧ 國. 學. samples that have unknown patterns. The algorithm is implemented with Scala and. ‧. deployed in the cloud.. Nat. er. io. sit. y. 1.2 Research Method. In this paper, we use a new distributed outlier detection algorithm to analyze. al. n. v i n C hrules that can be use collected data and derive detection e n g c h i U to identify normal, malicious, and unknown behaviors. The original algorithm was proposed by Tsaih and Cheng [35]. However, when amount of data and dimensionality of each data are large, the algorithm becomes complicated and time-consuming. Huang et al. [11] thus proposed a new improved algorithm with the envelope module. The main idea of this algorithm is to find out a nonlinear fitting function, which has a constant-width area around it that most of data will be in this area (we call this area envelope). The data far from the fitting function. 2 .

(8) are not contained in envelope and will be regarded as outliers. Unfortunately, after we deploy the algorithm to the cloud, it still cannot load such a great deal of data. To cope with this, we further improve the algorithm to a distributed one. We develop a distributed outlier detection algorithm based on Huang et al.’s work. The distributed algorithm is implemented by Scala, so that a single multi-core computer can execute it parallelly. With the architecture Akka, it can be deployed with multiple actors as computation agents in the cloud, facilitating full use of all resources of the cloud, so that it has higher efficiency.. 立. 政治大. As for the practical impacts, we have collected a lot of training data from. ‧ 國. 學. real-world malicious software for evaluating the effectiveness of the algorithm. This. ‧. algorithm learns and finds out a nonlinear fitting function as a detection rule. While. sit. y. Nat. detecting, we can use rules to distinguish normal behaviors and malicious behaviors.. io. al. er. Those behaviors, which are far from the fitting function and not contained in envelope,. n. are outliers and will be categorized as unknown behaviors.. 1.3 Contribution. Ch. engchi. i n U. v. The purpose of this research is to synthesize rules for detecting malicious behaviors. These rules can be used to not only distinguish all known behaviors (including normal and malicious behaviors) but also unknown behaviors. This feature allows us to prevent zero day attack. In order to be able to handle large amounts of data on the cloud, we improve the algorithm proposed by Huang et al. to a new distributed algorithm that can be deployed in the cloud. It can enhance the. 3 .

(9) performance of the algorithm and reduce computing time. The presented distributed outlier detection algorithm can also be extended to other contexts, such as medical condition monitoring [9] (i.e. heart-rate tracking), activity monitoring (i.e. monitoring phone activity to prevent from mobile phone fraud), pharmaceutical research (i.e. new compositions of molecular identifying).. 1.4 Content Organization The rest of this paper is organized as below. In Chapter 2 we give briefly reviews. 政治大 of previous work on malware detection and rule synthesis and give the background 立. ‧ 國. 學. knowledge on common machine learning techniques for clustering and classification. In Chapter 3, we present the details of our detection strategy and distributed outlier. ‧. detection algorithm and in Chapter 4, we report the experiment results with simulation. y. Nat. n. al. er. io. virtual machines.. sit. data and real data that are collected from observing system calls and processes of. Ch. engchi. 4 . i n U. v.

(10) Chapter 2 Related Works 2.1 Malware Detection: Common Detection Tools/Methods In the past few years, cloud computing continually grows. However, “Security is one of the major issues which hamper the growth of cloud.” [33]. Surveys indicate that the most worrying thing is security, next is performance. Therefore, if the intrusion can be detected, data on the cloud will be protected.. 政治大. To prevent cloud system from threats, the most common approach is to intercept. 立. system calls. When user programs require accessing to system resources, they must. ‧ 國. 學. send requests to the operating system by system call. According to [10, 13], using. ‧. sequences of system call to identify malicious behaviors is an effective method. First,. sit. y. Nat. they record system call sequences from normal behaviors. Then, using the concept of. io. al. er. sliding window, they build a short system call sequence that has same length as. v. n. sliding window in each slide. Finally, all these short sequences are stored to database,. Ch. as basis for comparison on detecting.. engchi. i n U. Another powerful technique is virtual machine introspection (VMI) [8, 21]. Unlike previous researches to collect system call sequence, VMI can observe internal behavior of virtual machines from the outside of virtual machines through virtual machine manger (VMM). VMI thus can get much information, such as CPU state, memory state, and I/O device state (i.e. the contents of storage devices and register state of I/O controllers). After collecting data, we need machine learning to automatically analyze data 5 .

(11) and generate rules for detecting malicious behaviors.. 2.2 Rule Synthesis Over the past few years, many researches about malware detection by using machine learning have been proposed. Commonly used methods are Neural Network (i.e. Self-Organizing Map and Growing Hierarchical Self-Organizing Map) , Nearest Neighbors, Support Vector Machines, and other clustering based algorithm. [7]. 政 and治 2.2.1 Self-Organizing Map Growing Hierarchical 大 Self-Organizing立 Map. ‧ 國. 學. Self-Organizing Map (SOM) was developed by Teuvo Kohonen in 1981. SOM. ‧. is one of unsupervised learning (which means input data given to SOM are unlabeled),. sit. y. Nat. and is often used in clustering. SOM has a map used to present the distribution of. io. al. er. each output value (cluster). Because of this map, so that SOM can visually present the. v. n. original high-dimensional data in low-dimensional space. Visualized results can. Ch. engchi. i n U. perfectly interpret all clusters. SOM has two phases: training and mapping. In training phase, SOM creates the map by using input data (or called vector). In mapping phase, SOM automatically classifies a new input vector. According to [6], since the SOM was developed, SOM has been used in many fields of computer security. Many studies use SOM to detect malwares or outliers. In [36], they use SOM to visualize viruses in Windows executable files. They have found that viruses cannot hide their own feature through the SOM visualization, so that they can find virus patterns in infected files by using the SOM visualization 6 .

(12) technique. They create maps for Windows executable files both before and after an infection by a virus. Such maps can be analyzed visually, and they find a pattern in files infected by the same virus. It shows that each virus family has same virus mask, which can be used to detecting virus. Faour et al. [5] use SOM coupled with Bayesian belief networks to build an automatically filter intrusion detection alarms for determining weather the network is attacked or not. In this paper, they use SOM to cluster aggressive behaviors and. 政治大. normal behaviors, and then use Bayesian to classification. Experimental results show. 立. that their system can filter most of the false positives. They further use SOM and. ‧ 國. 學. Growing Hierarchical Self-Organizing Map (GHSOM) to find potential attacks [4].. ‧. GHSOM is a variant of SOM, and is a dynamic algorithm proposed by Rauber et. sit. y. Nat. al. to overcome limitations of SOM [24]. GHSOM overall renders multilayer. io. al. er. hierarchical structure, that each layer is composed of several adaptive and. v. n. independent growing SOM. Data is presented into multiple clusters according to correlation between each other.. Ch. engchi. i n U. Lee and Yu [15] use GHSOM to cluster collected system call distribution, and pick out clusters having the attack signature. They calculate each data cluster and derive detection rules using the mean analyzing from GHSOM. These rules can help their system detecting malicious behaviors. However, these studies only detect known malicious behaviors in the training corpus. They cannot cope with unknown behaviors.. 7 .

(13) 2.2.2. k -Means. There are many studies using k -means to detect malicious behaviors. Among all partition clustering methods, the most basic way is k -Means clustering method. We have to set the number of cluster (k) before starting clustering. k -Means will find out representative data (can be referred to cluster center), calculate the distance between data and all cluster centers, and classify data to the cluster has the shortest distance. Repeat the above steps until cluster center do not change.. 政治大. In [12], they use k -Means to cluster unlabeled data. They think the data in the. 立. same cluster is quite similar. They analyze all these clusters to find the patterns that. ‧ 國. 學. can be used to detect malicious behaviors.. ‧. [34] propose a hybrid method for detection, combining k -Means and k. sit. y. Nat. -Nearest Neighbors ( k -NN). They use k -Means to find some cluster centers, and. io. al. er. they say each cluster center is one kind of malicious behaviors. They then use two. v. n. cluster centers and one data to form the triangle area where can be used to detect. Ch. engchi. i n U. similar behaviors. Finally they use k -NN to detect malicious behaviors based on he triangle areas. Another research [18] also use a hybrid method that combines k -Means and Nai ve Bayes classifiers. They use k -Means to cluster data. Each cluster has similar behavior, so that they can separate normal and malicious behaviors. If a new data is classified into the cluster has malicious activity, it will be considered to be a malicious behavior. [20] also use k -Means with k -NN and Nai ve Bayes classifiers. Due to k -Means requires pre-set k , quality of result depends on k . k is too 8 .

(14) big or too small to reduce effectiveness of the result and accuracy of detection. These methods can only detect malicious behaviors they have trained or unknown malicious behaviors similar to the former.. 2.2.3. Other Clustering Algorithm. Other researches use clustering based algorithm [2, 17, 23, 26]. These algorithms cluster data, compute the distance between each newly detected data and each known cluster, and classify the new data to its nearest cluster otherwise create a new cluster.. 政治大 However, use distance to decide clusters has limited effect and a little less rigorous. 立. ‧ 國. 學. 2.2.4. Support Vector Machines. ‧. Support Vector Machines (SVMs) are supervised machine learning method. sit. y. Nat. proposed by Cortes and Vapnik [3], and often used for classification. SVMs find a. n. al. er. io. maximum-margin hyperplane (the larger the margin, the higher accuracy,) that can. i n U. v. separate the different data clusters from input labeled data. SVMs also record data on. Ch. engchi. or closest the hyperplane, which are called support vectors. When new unlabeled data comes in, SVMs can use this hyperplane and support vectors to predict new unlabeled data belongs to which cluster. Mukkamala et al. [19] use SVMs and neural network to distinguish normal behaviors and malicious behaviors respectively, and make a comparison between the above two methods. Their experimental results show that both SVMs and neural network can get high accuracy of detection, but SVMs outperform neural network.. 9 .

(15) SVMs have a higher accuracy with less training time. Sahs, J. et al. [29] and Rieck, K. [25] thus use SVMs to detect malicious behaviors. They train SVMs to learn labeled data, both normal and malicious behaviors existing their database. SVMs can predict which kind of behaviors new data belong to. However, the drawback of their approach is that behaviors their approach can detect are limited to the behaviors SVMs have learned.. 政治大. Unfortunately, the methods mentioned above have several drawbacks. To begin. 立. with, they only detect known malicious behaviors in the training corpus. For those. ‧ 國. 學. unknown behaviors, they can still only detect behaviors similar to the existing. ‧. behavior in their model. They are not suitable for handling great deal of data. To. sit. y. Nat. address these problems, we thus use the outlier detection algorithm proposed by. io. al. n. and reduce execution time.. er. Huang et al., and improve it to a new parallel algorithm that can improve performance. Ch. engchi. i n U. v. 2.3 Algorithm Optimization. As our algorithm primarily use Backpropagation Neural Network (BPNN) to learn a fitting function. BPNN [28] is a neural network which has hidden nodes using nonlinear activation function that can solve non-linearly separable problem. Briefly, BPNN has two phases: forward and backward. In forward phase, BPNN forwards each input data, and sequentially calculates the activations of all hidden nodes and output nodes.. 10 .

(16) BPNN calculates an error value through comparing the desired output and the actual output of all output nodes. In backward phase, BPNN backward passes the error value to each hidden node, adjusts the weights between output nodes and hidden nodes and the bias between output nodes. Then, BPNN further pass the error value to each input node. Likewise, BPNN adjusts the weights between hidden nodes and input nodes and the bias between hidden nodes. BPNN repeats these two phases until the error value is satisfactory.. 政治大. But it is very time-consuming finding the adjustment amount of weights in each. 立. backward phase and repeating two stages until the error value is satisfactory.. ‧ 國. 學. Therefore many studies have been proposed for optimizing BPNN [1, 14, 16, 27, 30,. ‧. 31]. These studies use different method for improvement, and indicate their works. sit. y. Nat. outperform the standard BPNN. However, after experimenting by Schiffmann et al.. io. al. er. [31], these optimized algorithms are not fast enough. Another research speed up. v. n. BPNN through parallelization [22]. The experimental results show that parallelization is effective.. Ch. engchi. i n U. We thus adopt parallelization and further use distributed computing to speed up BPNN. The detail is illustrated in next chapter.. 11 .

(17) Chapter 3 Methodology This research synthesizes rules that can be used to detect malicious behaviors by using a distributed outlier detection algorithm based on Huang et al.’s work. In this chapter, we first briefly introduce our algorithm, and describe operations of our algorithm for generate detection rule. Finally, we present our new distributed algorithm that can be deployed in the cloud after improvement. The outlier detection algorithm is an incremental learning. After we have given. 政治大 𝑁 training data with 𝑚-dimensional and set the envelope width (𝜀), it uses BPNN to 立. ‧ 國. 學. train arbitrary 𝑚 + 1 data and learns an initial Single-hidden Layer Feed-forward Neural Network (SLFN) with one hidden node. Then it starts loops and constantly. ‧. evolves the SLFN by including extra training data one by one in each loop (𝑛) until. y. Nat. er. io. sit. achieving the termination condition (𝑛 > 𝑁 − 𝑘) and returning output data (e.g. weights and thresholds of final SLFN, 𝑘 outliers). In each loop, it uses the initial. al. n. v i n SLFN to calculate the squaredCresiduals all N training data, orders all h e n gregarding chi U squared residuals to determine which training data are going to be selected. If there is any selected training data outside the envelope, namely the squared residual of training data is greater than the envelope width (𝜀), it uses BPNN to adjust 𝑤 (threshold and weights) of SLFN. After the above approach for evolving the SLFN, if there is still selected training data outside the envelope, it add extra hidden nodes. To prevent too many hidden nodes to increase the complexity of it, it deletes all unnecessary hidden nodes at the end of each loop. The process of the algorithm is. 12 .

(18) shown in Figure 1. We apply this algorithm to detecting malicious behaviors. Our strategy requires prior training and learning, so that it can distinguish between normal and malicious behaviors, and furthermore it can discover unknown behaviors. The details would be described in the next section.. 立. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. 13 . i n U. v.

(19) 立. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Figure 1.. Ch. engchi. i n U. v. Process of the outlier detection algorithm. 14 .

(20) 3.1 Detection strategy The outlier detection algorithm continuously learns and constantly evolves SLFN by including extra input behaviors one by one. Those data far from the fitting function will be regarded as outliers, which are usually particularly different from others. With this feature, we use it to discover unknown behaviors. In our strategy, we use the outlier detection algorithm to identify outliers, and the backpropagation algorithm with envelope module to learn a SLFN and test new. 政治大. behaviors. The first three steps are designed to allow the algorithm to learn how to. 立. distinguish between various behaviors. In the remaining steps, the algorithm would. ‧ 國. 學. tell us what kind of behaviors the data we received is.. ‧. In step 0, if malicious behaviors are quite different from normal behaviors, they. sit. y. Nat. are supposed to far from the fitting function and regarded as outliers. We can give. io. al. er. different kinds of behavior different desired outputs to make the algorithm learn that. v. n. they are different kinds of behavior. In step 4 and 5, we try to change the desired. Ch. engchi. i n U. output of new behaviors until we find which kinds of behaviors they belong to. If all existing desired outputs have been tried and one behavior is still regarded as outlier, we would give it a new desired output and retrain all kinds of behaviors. The detailed steps are described below:. 15 .

(21) Detection Strategy Step 0. Given a set of behaviors (majority normal behaviors and minority malicious behaviors) with the same desired output (e.g. 0.0), and use the outlier detection algorithm to identify outliers.. Step 1. Set outliers for a new desired output (e.g. 1.0) differs from existing desired outputs.. Step 2. Use the backpropagation algorithm with envelope module to learn a SLFN. 政治大. from these types of behaviors. This SLFN can be used in following steps.. 立. Given a new set of behaviors, neither normal or malicious behaviors.. Step 4. Set these behaviors for one of existing desired outputs that have not been. ‧ 國. 學. Step 3. Based on the SLFN from step 2, use the backpropagation algorithm with. sit. y. Nat. Step 5. ‧. tried (e.g. 0.0).. io. al. v. Return behavior types. If one behavior is regarded as outlier, repeat step 4. n. Step 6. er. envelope module to test which type of behaviors they are.. Ch. engchi. i n U. and step 5 until this behavior isn’t regarded as outlier; if the behavior is still regarded as outlier after trying all of existing desired outputs, back to the step 1. This strategy can be used to not only distinguish all known behaviors (including normal and malicious behaviors) but also unknown behaviors. This feature allows us to prevent zero day attack. When the size of input behaviors is small, the outlier detection algorithm learns fast. However, as the amount of behaviors increases the performance of the outlier 16 .

(22) detection algorithm is getting slower. Only improving by parallel computation is still inadequate according to our experiment results. We need both parallel and distributed computation to speed up the outlier detection algorithm.. 3.2 Parallel Computation Most important thing to develop the parallel and distributed computation for our outlier detection algorithm is to avoid accessing to the same variable. Therefore we apply parallelization in the parts that won’t be influenced by the calculating sequence. 政治大 of data. We implement the algorithm in Scala. Scala’s parallel collections facilitate 立 way.. ‧. ‧ 國. 學. parallel programming, so that we can easily speed up the algorithm by using a simple. The most time-consuming part of our outlier detection algorithm is BPNN. We. y. Nat. er. io. sit. thus improve BPNN first. The parallel backpropagation algorithm and its notation definitions are presented in Table 2 and Table 1. The purpose of BPNN is to learn an. al. n. v i n optimal SLFN with minimum C error. SLFN is defined in equations (1) and (2), h eThen g chi U and the error is defined in equations (3). ! ! 𝑎! 𝑥, 𝑤 ! ≡ tanh 𝑤!! +. 𝑤!"! 𝑥!. (1). !!! !. !. 𝑓! 𝑥, 𝑤 ! , 𝑤!!. ≡𝑔. 𝑎, 𝑤!!. ≡. ! 𝑤!!. 𝑤!"!. +. tanh. ! 𝑤!!. ! ! 𝑤!! +. min 𝐸 𝑤 ≡ !. !!! !!!. !. !. ! ! 𝑤!"! tanh 𝑤!! +. 𝑤!"! 𝑥! − 𝑦!! !!!. !!!. 17 . (2). !!!. !!!. !. 𝑤!"! 𝑥!. +. (3).

(23) Notations tanh(𝑥) ≡. Definitions. 𝑒 ! − 𝑒 !! 𝑒 ! + 𝑒 !!. Hyperbolic tangent.. 𝑇. The transposition.. 𝑁. Total number of input behaviors.. x ! ≡ (𝑥! , 𝑥! , … , 𝑥! )! 𝑦. Explanatory variables of the 𝑐 !! input behavior. The vector of desired output corresponding to the 𝑐 !! input. !. behavior with explanatory variables 𝑥 ! .. 𝑚. The number of explanatory variables 𝑥! ’s.. 𝐻. The hidden layer.. ℎ. The number of adopted hidden nodes.. ! 𝑤!!. The threshold of the 𝑖 !! hidden node.. The number of adopted output nodes. The threshold of the 𝑙 !! output node.. y. sit. er. The weight between the 𝑖 !! hidden node and the 𝑙 !! output. io. 𝑤!"!. The output layer.. Nat. ! 𝑤!! . ‧ 國. 𝑞. 𝑖 !! hidden node.. ‧. 𝑂 . The weight between the 𝑗!! explanatory variable 𝑥! and the. 學. 𝑤!"!. 立. 政治大. node.. n. al. i n Table C 1. hNotation definitions engchi U. 18 . v.

(24) 1.. Via a pipeline with p1 threads (or VMs), execute the following forward operation of each training data in parallel. 1.1 Via a pipeline with p1.1 threads (or VMs), calculate and record 𝑎! 𝑥 ! , 𝑤!! ∀ 𝑖 and then calculate and record 𝑔 𝑎! , 𝑤!! ∀ 𝑙. 1.2 Use a pipeline with p1.2 threads (or VMs) to calculate the net input value of each (hidden or output) node.. 2.. Via a pipeline with p2 threads (or VMs) and based upon 𝑔 𝑎! , 𝑤!! ∀ 𝑙, calculate and record 𝐸 𝑤 .. 3.. If 𝐸 𝑤 is less than the predetermined value (says, 𝜀! ), then STOP.. 4.. Via a pipeline with p4 threads (or VMs), execute the following backward operation regarding all training data in parallel.. 政治大 ∀ 𝑖 and 立𝑔 𝑎 , 𝑤 ∀ 𝑙.. 4.1 Via a pipeline with p4.1 threads (or VMs) and based upon values of ! !. ‧ 國. Nat. 4.1.3Calculate and record the value of. ! !!!!. !! ! ! !!!". !! ! ! !!!!. in terms of each training data. in terms of each training data.. ‧. 4.1.2Calculate and record the value of. !! !. 學. 4.1.1Calculate and record the value of. in terms of each training data.. y. !. io. sit. 𝑎! 𝑥 ! , 𝑤!!. n. al. er. Use a pipeline with p4.1.3 threads (or VMs) to calculate the summation.. Ch. 4.1.4Calculate and record the value of. !! ! ! !". e n g c h!!i. i n U. v. in terms of each training data.. Use a pipeline with p4.1.4 threads (or VMs) to calculate the summation. 5.. Via a pipeline with p5 threads (or VMs) and based upon the values obtained from Step 4.1, calculate and record the partial derivatives of 𝐸 𝑤 regarding all thresholds and weights.. 6.. Via a pipeline with p6 threads (or VMs) and based upon the values obtained from Step 5, calculate the magnitude of gradient vector. If the magnitude of gradient vector is less than the predetermined value (says, 𝜀! ), then STOP; otherwise, calculate and record the normalization values of partial derivatives of 𝐸 𝑤 regarding all thresholds and weights.. 7.. Via a pipeline with p7 threads (or VMs) and the current 𝜂, update the thresholds. 19 .

(25) and weights. 8.. Via a pipeline with p8 threads (or VMs), execute the following forward operation of each training data in parallel. 8.1. Via a pipeline with p8.1 threads (or VMs), calculate and record 𝑎! 𝑥 ! , 𝑛𝑒𝑤_𝑤!! ∀ 𝑖 and then calculate and record 𝑔 𝑎! , 𝑛𝑒𝑤_𝑤!! ∀ 𝑙. 8.2. Use a pipeline with p8.2 threads (or VMs) to calculate the net input value of each (hidden or output) node.. 9.. Via a pipeline with p9 threads (or VMs) and based upon 𝑔 𝑎! , 𝑤!! ∀ 𝑙, calculate and record 𝑛𝑒𝑤_𝐸 𝑤 . If 𝑛𝑒𝑤_𝐸 𝑤 is less than 𝐸 𝑤 , then 𝜂 ← 𝜂×1.1, ! ! ! ! 𝑤!! ← 𝑛𝑒𝑤_𝑤!! , 𝑤!"! ← 𝑛𝑒𝑤_𝑤!"! , 𝑤!! ← 𝑛𝑒𝑤_𝑤!! , 𝑤!"! ← 𝑛𝑒𝑤_𝑤!"! , 𝐸 𝑤 ←. 𝑛𝑒𝑤_𝐸 𝑤 , and GOTO Step 3; otherwise, 𝜂 ← 𝜂×0.8 and GOTO Step 7.. 政治大. Table 2. The parallel backpropagation algorithm. 立. As we mentioned in the previous chapter, BPNN can be divided into two phases:. ‧ 國. 學. forward and backward. Each forward involves three steps (step 1 to 3). In step 1,. ‧. BPNN forwards each input data, and calculates the activations of all hidden nodes and. sit. y. Nat. output nodes. In step 2, BPNN calculates 𝐸 𝑤 through comparing the desired. n. al. er. io. output and the actual output of all output nodes. Finally, if 𝐸 𝑤 is less than the. Ch. i n U. v. predetermined value, BPNN stop; otherwise, BPNN go on to backward phase. Each. engchi. backward involves the following steps, step 4 to 9. In step 4 and 5, BPNN calculates and records gradient vector. Then BPNN calculates the magnitude of gradient vector based upon the values obtained from step 5. If the magnitude of gradient vector is less than the predetermined value 𝜀! , then BPNN stop; otherwise, BPNN calculates and records the normalization values of partial derivatives of 𝐸 𝑤 obtained from step 5. After completing calculations of partial derivatives and magnitude of gradient vector, BPNN updates the thresholds and weights in step 7. In step 8, BPNN forwards each. 20 .

(26) input data again, calculates 𝑛𝑒𝑤_𝐸 𝑤 based upon the activations of all hidden nodes and output nodes. If 𝑛𝑒𝑤_𝐸 𝑤. is larger than 𝐸 𝑤 , BPNN reduces. magnitude of gradient vector and goes back to step 7; otherwise, goes back to step 3. Since BPNN needs 𝑎! 𝑥 ! , 𝑤!!. to calculate 𝑔 𝑎! , 𝑤!! , we can only apply. parallelization to calculating 𝑎! 𝑥 ! , 𝑤!! and 𝑔 𝑎! , 𝑤!! respectively. All input data and weights of neurons would be placed in a parallel collection, so that the calculation of 𝑎! 𝑥 ! , 𝑤!! and 𝑔 𝑎! , 𝑤!! would be executed in parallel respectively. Each layer. 政治大. in whole neural network is also placed in a parallel collection. Therefore calculations. 立. in step 4 are also executed in parallel.. ‧ 國. 學. After improving the performance of our backpropagation algorithm, we then. ‧. ameliorate other parts in the outlier detection algorithm. Another time-consuming part. n. al. er. io. sit. y. Nat. is adding extra hidden nodes. The algorithm is shown in Table 3.. Ch. engchi. 21 . i n U. v.

(27) 0.. 1.. Let 𝜁 ← 𝑎𝑡𝑎𝑛ℎ(0.5) and 𝐶! ≡ {𝑐: 𝑐 ∈ 𝐼 𝑛 − 𝜅 𝐴𝑁𝐷 𝑥!! = 𝑥!! ∀𝑗 = 1, … , 𝑘} where 𝜅 is the input behavior whose squared residual is greater than the envelope width. Via a pipeline with p1 threads (or VMs), execute the following operation to determine 𝛼 in parallel. 1.1 Set 𝛽! ← 1 and let 𝑘 ← 2. 1.2 Via a pipeline with p1.2 threads (or VMs), calculate !!!! 𝛽! (𝑥!! − 𝑥!! ) ≠ 0 ∀ 𝑐 ∈ 𝐼 𝑛 − 𝜅 − 𝐶! where 𝛽! , 𝑗 = 1, … , 𝑘 − 1 and record 𝛽! , which is the smallest integer that is greater than or equal to 1. 1.3 𝑘 ← 𝑘 + 1. If 𝑘 ≤ 𝑚, GOTO Step 1.2. 1.4 Set 𝛼! =. 2. 3.. ! ! !!! !!. ∀ 𝑗 = 1, … , 𝑚.. Restore the SLFN, 𝑤 ← 𝑤′. Via actors, set 𝜆 ← 2! ∀ 𝑡 = 0, … ,20 and execute the following operation to determine weights and thresholds of new hidden nodes in different VMs. ! ← 𝜁 − 𝜆𝛼 ! 𝑥 ! , 3.1 Let two new hidden nodes h + 1 and ℎ + 2 with 𝑤!!!,! ! ! ! 𝑤!!! ← 𝜆𝛼, 𝑤!!!,! ← 𝜁 + 𝜆𝛼 ! 𝑥 ! , 𝑤!!! ← −𝜆𝛼; ! ← 𝑤!!!. 立. 政治大. ! ! ! ! !!!! ! ! !!! !! !!. ! !"#!(!). ! , and 𝑤!!! ←. ! ! ! ! !!!! ! ! !!! !! !!. ! !"#!(!). .. 3.2 Apply parallel backpropagation algorithm to adjust 𝑤 (thresholds and weights) of SLFN. 3.3 If an acceptable result is obtained, record 𝜆. Add two new hidden nodes ℎ + 1 and ℎ + 2 to the existing SLFN with minimum 𝜆, ℎ ← ℎ + 2, and STOP.. ‧. ‧ 國. 學. 4.. !!. sit. y. Nat. Table 3. Add extra hidden nods (part of the outlier detection algorithm). n. al. er. io. Both 𝛼 and 𝜆 are important parameters for determining newly weights and. i n U. v. thresholds of newly added hidden nodes. Through step 1, we can determine a. Ch. engchi. 𝑚-vector 𝛼 of length one. In step 3, we find an appropriately positive 𝜆 and use BPNN to retrain input behaviors that can make their squared residuals are less than the envelope width in parallel. The formula in step 3.1 for determining newly weights thresholds comes from [35]. Tsaih and Cheng gave an ample demonstration. To determine weights and thresholds of new hidden nodes is also very time-consuming. We improve it with parallel programming and focus on enhancing the speed of determining 𝛼 and 𝜆 for newly added hidden nodes. As we mentioned before, all input data would be placed in a parallel collection, so that the calculation in 22 .

(28) step 1.2 would be executed in parallel until finding the smallest integer 𝛽! that is greater than or equal to 1. As for step 3.2, we have already mentioned how to improve BPNN. In summary, there are several points to note about applying parallel collections provided by Scala. The first is that it takes some time to produce a parallel collection, so we cannot abuse it and only produce the necessary. Next, all calculations against parallel collections will not be affected by calculating sequence of data. Finally, we. 政治大. need to produce all required parallel collections at the beginning of program, and we. 立. will no longer change them. In simple terms, parallel collections are not be used. ‧ 國. 學. indiscriminately, it must be used with caution. Therefore, we apply parallel. ‧. collections to input data, neurons of each layer, and weights of each neuron. Such a. sit. y. Nat. simple way allows us to speed up our program easily. But according to our. io. al. er. experiments, the parallel outlier detection algorithm is better than the sequential one,. v. n. but it is still inadequate. To develop the distributed outlier detection algorithm is imperative.. Ch. engchi. i n U. 3.3 Distributed Computation Except the parallel collection provided by Scala can accelerate the execution speed of our program, distributed computing can greatly improve the performance of our program. “Spark is a fast and general engine for large-scale data processing.” With Resilient Distributed Dataset (RDD) proposed by Spark, data would be calculated in. 23 .

(29) memory before being written to the disk. It allows users to load data to memory of cluster and repeatedly inquiry data. For most iterative algorithms like our outlier detection algorithm, I/O bottlenecks often reduce its performance. This feature allows us to use in-memory computing to speed up our outlier detection algorithm, and also allows Spark to run programs faster than Hadoop. Spark is also easy to use. It offers many high-level operators, which provide scheduling, distributed task scheduling, and basic I/O function, that make it easy to. 政治大. build parallel applications. And we can use it interactively from the Scala. In a word,. 立. Spark is undoubtedly the best way to improve our program.. ‧ 國. 學. Each Spark application has a driver program. It would connect to several types. ‧. of cluster managers, acquire executors on nodes in the cluster, and execute various. sit. y. Nat. parallel operations on a cluster. Spark’s core concept is Resilient Distributed Dataset. io. al. er. (RDD). RDD is a fault-tolerant collection of elements partitioned across the nodes of. v. n. the cluster that can be operated on in parallel. Therefore, we need to build a driver. Ch. engchi. i n U. program to coordinate the cluster and create RDDs to be operated across the nodes of the cluster. In our distributed version, we do not need to modify the algorithm. We just use the collection provided by Spark to speed up our program. Before we create RDDs, we first need to build a SparkConf object and a SparkContext object. A SparkConf object contains configuration for our Spark application. A SparkContext object is called driver program. It connects to our Spark cluster and is used to create RDDs. Next, we create RDDs by our input data. All calculation against input data would be 24 .

(30) operated on cluster in parallel. We also transform each layer in whole neural network to RDDs. In this way, step 1 and 4 in Table 2 and step 1.2 in Table 3 would be operated on cluster in parallel. In next chapter we prepare several experiments to evaluate our program.. 立. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. 25 . i n U. v.

(31) Chapter 4 Experiment When we accomplish the outlier detection algorithm, we verify the correctness of the program with many sets of 100 simulation data generated by nonlinear functions. Furthermore, we evaluate the performance of the program (sequential, parallel, and version) with 100, 200, and 300 simulation data. After evaluation with nonlinear function, we use real-world data to experiment. When all these experiments are completed, we deploy the outlier detection algorithm to our system.. 政治大. 4.1 Evaluation with立 nonlinear function. ‧ 國. 學. We use the nonlinear functions (4) and (5) to generate a set of 100 simulation. ‧. y. 𝑌 = 0.5 + 0.4𝑋 + 0.8𝑒 !! + 𝐸𝑟𝑟𝑜𝑟. io. n. al. er. Nat. to 20.0. Y is the desired output. Error is normally distributed.. sit. data, respectively. X is the explanatory variable and is randomly generated from 0.0. i n U. v. 𝑌 = 0.5 + 0.35𝑋 + 0.8𝑒 !! + 𝐸𝑟𝑟𝑜𝑟. Ch. engchi. (4) (5). Then, we use them to verify correctness of the program. If our program can find outliers, it proves that our program is correct and can advance to the next stage to test performance of the program. The experiment results are shown in Table 4.. 26 .

(32) Function. Data size. Data dimension. Execution time (seconds) Find outliers. (4). 100. 1. 372.205. 5/5. (5). 100. 1. 7394.357. 2/2. Table 4. Results of correctness test The correctness test shows that our program can identify all outliers indeed. We confirm the correctness of our program, and there are no errors during the execution of the program. We can then go on to performance test.. 政治大. 立. In performance test, we use the nonlinear functions (4) to generate a set of 100. ‧ 國. 學. simulation data, a set of 200 simulation data, and a set of 300 simulation data. We test. ‧. performance of sequential and parallel version respectively in same computer (4 cores. y. sit. Version. n. Data dimension. al. Ch. Sequential 100. 200. 300. Execution time (seconds). er. io. Data size /. Nat. and 8G memory) by using this simulation data. The results are shown in Table 5.. n U engchi 751.254. iv. Total iteration. 15964. 1 Parallel. 372.205. 16036. Sequential. 27881.963. 36966. Parallel. 11199.175. 37276. Sequential. 164933.381. 64410. Parallel. 58158.497. 64620. 1. 1. Table 5. Results of performance test (sequential and parallel version) 27 .

(33) According to the above results (Table 5), the parallel version has shorter execution time than sequential one. Parallel version of the outlier detection algorithm does have better performance, but it is not good enough. It still can’t deal with large amount of data or computation effectively. Next, we test performance of distributed version on our cluster (two computers each with 4 cores and 8G memory), and the results are shown in Table 6. Data size. Data dimension. 100. 立. 治 1 政大. 521.319. 1. 8139.587. 學. 300. ‧ 國. 200. Execution time (seconds). 1. 45156.248. ‧. Table 6. Results of performance test (distributed version). Nat. sit. y. The results show that the improvement resulting from distributed computing is. n. al. er. io. not significant (even worse) when data size is small. This is because Spark takes a. Ch. i n U. v. moment to build a driver program. With the increase of data size, the improvement will be more and more obvious.. engchi. 4.2 Real-world experiment and analysis After we complete the full program and affirm its correctness, we collect real-word data and follow step 0 of training phase mentioned in chapter 3 to detect malicious behaviors. We collect two kinds of VM behaviors: system call distribution and dynamic link libraries (DLLs) that each process uses.. 28 .

(34) Test1: System call distribution The first kind of data is system call distribution (the number of executions that each system call be executed) collecting from virtual machines. We execute and monitor many malwares on different VM, then collect their system call distribution. Figure 2 is collected system call distribution.. 政治大. 立. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. Figure 2.. engchi. i n U. v. System call distribution (part). Each row is execution frequency of the system call within ten seconds produced by one VM. First number is the label of each VM. Each attribute (before colon) represents a system call, and each value (after colon) is the execution frequency of each system call. In order to facilitate the calculation and easily identify, we give labels to each 29 .

(35) VM. There are five different VMs, shown in Table 7, a normal VM and four VMs which execute different malware. Labels. 0. 1. 2. 3. 4. VMs. VM0. VM1. VM2. VM3. VM4. Behaviors. Normal. Malware 1. Malware 2. Malware 3. Malware 4. Sizes. 720. 1905. 4896. 2104. 1569. Table 7. The experimental data for test 1. 政治大. We collect many data. But we do not need to take all the information to test. We. 立. use the following two methods to sample data:. ‧ 國. 學. (1) We randomly sample 300 data from a contiguous range (500 data) of VM0. ‧. and 5 data from VM1 to VM4, and synthesize 4 testing file (data0_1,. sit. y. Nat. data0_2, data0_3, data0_4). Each file has 305 data.. n. al. er. io. (2) We randomly sample 300 data from VM1 to VM4 and 5 data from the. i n U. v. remaining VMs, and synthesize testing file respectively (data1_0, data1_2,. Ch. engchi. data1_3, data1_4, data2_0, data2_1, data2_3, data2_4, data3_0, data3_1, data3_2, data3_4, data4_0, data4_1, data4_2, data4_3). Each file has 305 data. We record execution time, average dimensions of data, number of iterations, and outliers in each test. The results are shown in Table 8. We graphically present number of iterations of each stage, taking data0_4 and data3_4 for example, in Figure 3 and Figure 4.. 30 .

(36) y. ‧. io. sit. Figure 3.. n. al. Figure 4.. er. 學. Nat. 2300 2200 2100 2000 1900 1800 1700 1600 1500 1400 1300 1200 1100 1000 900 800 700 600 500 400 300 200 100 0 . ‧ 國. 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 300 . 0 . 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 300 . 35 . 30 . 25 . 20 . 15 . 10 . 5 . Iterations . 立 Iterations of each stage in testing data0_4 政治大. Ch engchi. 31 . i n U. v. Iterations . Iterations of each stage in testing data3_4.

(37) Execution time. Average. (seconds). dimension. data0_1. 714.498. data0_2. Total iterations. Find outliers. 36. 366. 4/5. 733.211. 36. 369. 4/5. data0_3. 888.878. 36. 360. 4/5. data0_4. 848.362. 36. 331. 5/5. data1_0. 140489.559. 36. 23512. 0/5. data1_2. 181084.198. 36. 27283. 0/5. data1_3. 174962.884. 0/5. data1_4. 65763.987. 政 36 治大26922 36 19789 36. 25441. 0/5. 154121.209. 36. 24950. 0/5. 138515.703. 36. 23712. 86871.918. 36. 24124. 1/5. 154401.265. 36. 24549. 1/5. 25232. v. 0/5. 23833. 1/5. n. al. y. i n U. 2/5. data3_2. 141195.341. data3_4. 79095.205. data4_0. 138873.674. 36. 22615. 0/5. data4_1. 144335.03. 36. 22554. 0/5. data4_2. 146454.622. 36. 22615. 0/5. data4_3. 142752.562. 36. 22338. 0/5. Ch. 36. sit. io. data3_1. 0/5. er. data3_0. 173386.433. ‧ 國. data2_4. 24108. ‧. data2_3. 36. 學. data2_1. 4/5. Nat. data2_0. 立 146232.571. e n36g c h i. Table 8. Result of test 1. 32 .

(38) The result shows that data0_1, data0_2, data0_3, and data0_4 have better true negative and shorter execution time. It can be seen that their average iterations are much lower than other data from Figure 3 and Figure 4. Probably because we sample their 300 data from a contiguous range (500 data) of VM0 rather than the original 720 data of VM0. The wild range of original 720 data is reduced to a continuous and small range 500 data. This way results in an effect similar to clustering. 720 data are divided to two clusters: 500 data and 220 data. For this reason, later on we may try to cluster. 政治大. data by using GHSOM before training, and then take one of clusters for training to. 立. see whether a higher accuracy we will get or not.. ‧ 國. 學 ‧. Test2: Dynamic link libraries (DLLs) that each process has used. sit. y. Nat. Another kind of data is DLLs, which are used by each process, that are collecting. io. al. er. from VM. We execute and monitor many malwares on different VM, then collect and. v. n. record all DLLs that each process in VM has used. Figure 5 is the file which records all the DLLs used by process.. Ch. engchi. 33 . i n U.

(39) 立. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Figure 5.. i n U. v. CLibraries h e n that h i process used (part) g ceach. A row is DLLs that one process is using in some moment. First number is the label of VM behaviors. Each attribute (before colon) represents a DLL, and each value (after colon) represents whether the process uses this DLL. 1 means used; and 0 means unused. We also give different labels to different VMs. There are three different VMs, shown in Table 9, a normal VM and two VMs which execute different malware.. 34 .

(40) Labels. 6. 1. 2. VMs. VM0. VM1. VM2. Behaviors. Normal. Malware 1. Malware 2. Sizes. 1467. 100. 25. Table 9. The experimental data for test 2 In test2, we randomly sample 775 data from VM0 and 25 data from the remaining VMs, and synthesize testing file respectively (lib6_1 and lib6_2). Each file. 政治大. has 800 data.. 立. (seconds). dimension. 24224.704. lib6_2. 22216.737. lib6_2(2). 26309.865. n. al. Ch. 12/25. 678. 18983. 678. 18142. 14/25. n16180 U engchi. 11/25. 678 678. sit. io. lib6_1(2). Find outliers. Total iterations. er. 29773.487. ‧. Average. Nat. lib6_1. Execution time. y. and outliers in each test. The results are shown in Table 10.. 學. ‧ 國. We also record execution time, average dimensions of data, number of iterations,. iv. 17861. 13/25. Table 10. Result of test 2 The result shows that the accuracy of test 2 is about 50% to 60%. As we mentioned in previous section, we can try to add other algorithms to improve accuracy. We can also try to reduce some unnecessary attributes, and retain only influential attributes to see whether a higher accuracy we will get or not.. 35 .

(41) 4.3 Discussion According to performance test, the current outcome of our improvement in program performance is not enough. As for the spark, we only use the basic RDD collection to improve program performance. Spark also provides many application programming interface (API), and we can make good use to optimize our program performance. For example, we can use method persist() or cache() to store a RDD in memory on the nodes after it is first computed in an action. It can be repeated use, and make subsequent action faster.. 立. 政治大. Except these collections we have used, we can also apply multiple actors. ‧ 國. 學. provided by Akka to determining lambda for newly added hidden nodes (Table 3).. ‧. Since all possible lambdas are regular, we can execute step 3 in Table 3 on different. sit. y. Nat. computers or virtual machines. Each computer or VM will get a different lambda,. io. al. er. determine weights and thresholds of new hidden nodes, apply BPNN to adjust w. v. n. (thresholds and weights) of SLFN, and return results to the main program.. Ch. engchi. i n U. Another part we can apply Akka to is deleting all of the potentially irrelevant hidden nodes. The original approach is to try to remove each of hidden nodes, and then use BPNN to retrain input behaviors until an acceptable result is obtained. We can manipulate many computers to delete different hidden node and retrain data at the same time by applying multiple actors. It can significantly shorten the time spending on deleting all of the potentially irrelevant hidden nodes. As to results of test1 and test2, the testing behaviors we collected are relatively heterogeneous. Many behaviors do not belong to malwares were also collected. 36 .

(42) Therefore we can try to add other clustering algorithms (such as SOM, GHSOM, K-means, etc.) before starting our training phase, and then take one cluster which gathers only normal behaviors and malicious behaviors for training to see whether a higher accuracy we will get or not.. 立. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. 37 . i n U. v.

(43) Chapter 5 Conclusion We propose a distributed outlier detection algorithm. With distributed techniques we can optimize the algorithm and improve its performance. This feature makes the algorithm can be deployed in the cloud, facilitating use of resources of the cloud to reduce the time to complete computation on analyzing and learning patterns of data and proving a potential solution for big data analytics on anomaly detection in scale. We can derive detection rules that can be used to not only distinguish known. 政治大 behaviors (including normal and malicious behaviors) but unknown behaviors. The 立. ‧ 國. 學. ability to identify samples that violate known patterns allows us to guard our computer systems against zero day attacks.. ‧. n. er. io. sit. y. Nat. al. Ch. engchi. 38 . i n U. v.

(44) References [1] Almeida, L., & Silva, F. (1990). Speeding up backpropagation. Adv Neural Comput, 151-158. [2] Bayer, U., Comparetti, P. M., Hlauschek, C., Kruegel, C., & Kirda, E. (2009, February). Scalable, Behavior-Based Malware Clustering. In NDSS (Vol. 9, pp. 8-11). [3] Cortes,. C.,. &. Vapnik,. learning,20(3), 273-297.. 立. V.. (1995).. Support-vector 治政大. networks. Machine. ‧ 國. 學. [4] Faour, A., Leray, P., & Bassam, E. T. E. R. (2007). Growing hierarchical self-organizing map for alarm filtering in network intrusion detection systems.. ‧. InNew Technologies, Mobility and Security (pp. 631-631). Springer Netherlands.. y. Nat. er. io. sit. [5] Faour, A., Leray, P., & Eter, B. (2006). A SOM and Bayesian network architecture for alert filtering in network intrusion detection systems.. al. n. v i n C h Technologies,U2006. ICTTA'06. 2nd (Vol. 2, InInformation and Communication engchi pp. 3175-3180). IEEE. [6] Feyereisl, J., & Aickelin, U. (2009). Self-Organising Maps in Computer. Security.Computer Security: Intrusion, Detection and Prevention, Ed. Ronald D. Hopkins, Wesley P. Tokere, 1-30. [7] Figueroa-Nazuno,. J.. Neural. Networks:. A. Comprehensive. Foundation.Computación y Sistemas, 4(2), 188-190. [8] Garfinkel, T., & Rosenblum, M. (2003, February). A Virtual Machine. 39 .

(45) Introspection Based Architecture for Intrusion Detection. In NDSS (Vol. 3, pp. 191-206). [9] Hodge, V. J., & Austin, J. (2004). A survey of outlier detection methodologies. Artificial Intelligence Review, 22(2), 85-126. [10] Hofmeyr, S. A., Forrest, S., & Somayaji, A. (1998). Intrusion detection using sequences of system calls. Journal of computer security, 6(3), 151-180. [11] Huang, S. Y., Yu, F., Tsaih, R. H., & Huang, Y. (2014, July). Resistant learning. 政治大. on the envelope bulk for identifying anomalous patterns. In Neural Networks. 立. (IJCNN), 2014 International Joint Conference on (pp. 3303-3310). IEEE.. ‧ 國. 學. [12] Jianliang, M., Haikun, S., & Ling, B. (2009, May). The application on intrusion. ‧. detection based on k -means cluster algorithm. In Information Technology and. y. sit. io. al. er. IEEE.. Nat. Applications, 2009. IFITA'09. International Forum on (Vol. 1, pp. 150-152).. v. n. [13] Kosoresow, A. P., & Hofmeyr, S. A. (1997). Intrusion detection via system call. Ch. engchi. traces. IEEE software, 14(5), 35-42.. i n U. [14] Kramer, A. H., & Sangiovanni-Vincentelli, A. (1989). Efficient parallel learning algorithms for neural networks. In Advances in neural information processing systems (pp. 40-48). [15] Lee, S. W., & Yu, F. (2014, January). Securing KVM-Based Cloud Systems via Virtualization Introspection. In System Sciences (HICSS), 2014 47th Hawaii International Conference on (pp. 5028-5037). IEEE.. 40 .

(46) [16] Leonard, J., & Kramer, M. A. (1990). Improvement of the backpropagation algorithm for training neural networks. Computers & Chemical Engineering, 14(3), 337-341. [17] Leung, K., & Leckie, C. (2005, January). Unsupervised anomaly detection in network intrusion detection using clusters. In Proceedings of the Twenty-eighth Australasian conference on Computer Science-Volume 38 (pp. 333-342). Australian Computer Society, Inc... 政治大. [18] Muda, Z., Yassin, W., Sulaiman, M. N., & Udzir, N. I. (2011, July). Intrusion. 立. detection based on K-Means clustering and Naïve Bayes classification.. ‧ 國. 學. InInformation Technology in Asia (CITA 11), 2011 7th International Conference. ‧. on (pp. 1-6). IEEE.. sit. y. Nat. [19] Mukkamala, S., Janoski, G., & Sung, A. (2002). Intrusion detection using neural. io. al. er. networks and support vector machines. In Neural Networks, 2002. IJCNN'02.. v. n. Proceedings of the 2002 International Joint Conference on (Vol. 2, pp. 1702-1707). IEEE.. Ch. engchi. i n U. [20] Om, H., & Kundu, A. (2012, March). A hybrid system for reducing the false alarm rate of anomaly intrusion detection system. In Recent Advances in Information Technology (RAIT), 2012 1st International Conference on (pp. 131-136). IEEE. [21] Payne, B. D. (2012). Simplifying virtual machine introspection using libvmi.Sandia Report. [22] Pethick, M., Liddle, M., Werstein, P., & Huang, Z. (2003, November). 41 .

(47) Parallelization of a backpropagation neural network on a cluster computer. InInternational conference on parallel and distributed computing and systems (PDCS 2003). [23] Portnoy, L. (2000). Intrusion detection with unlabeled data using clustering. [24] Rauber, A., Merkl, D., & Dittenbach, M. (2002). The growing hierarchical self-organizing map: exploratory analysis of high-dimensional data. Neural Networks, IEEE Transactions on, 13(6), 1331-1341.. 政治大. [25] Rieck, K., Holz, T., Willems, C., Düssel, P., & Laskov, P. (2008). Learning and. 立. classification of malware behavior. In Detection of Intrusions and Malware, and. ‧ 國. 學. Vulnerability Assessment (pp. 108-125). Springer Berlin Heidelberg.. ‧. [26] Rieck, K., Trinius, P., Willems, C., & Holz, T. (2011). Automatic analysis of. y. sit. io. al. er. 639-668.. Nat. malware behavior using machine learning. Journal of Computer Security, 19(4),. v. n. [27] Riedmiller, M., & Braun, H. (1993). A direct adaptive method for faster. Ch. engchi. i n U. backpropagation learning: The RPROP algorithm. In Neural Networks, 1993., IEEE International Conference on (pp. 586-591). IEEE. [28] Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1985). Learning internal representations by error propagation (No. ICS-8506). CALIFORNIA UNIV SAN DIEGO LA JOLLA INST FOR COGNITIVE SCIENCE. [29] Sahs, J., & Khan, L. (2012, August). A machine learning approach to android malware detection. In Intelligence and Security Informatics Conference (EISIC), 2012 European (pp. 141-147). IEEE. 42 .

(48) [30] Salomon, R. (1989). Adaptive Regelung der Lernrate bei back-propagation. Technische Universität Berlin. FB 20. Institut für Software und Theoretische Informatik. [31] Schiffmann, W., Joost, M., & Werner, R. (1993, April). Comparison of optimized backpropagation algorithms. In ESANN (Vol. 93, pp. 97-104). [32] Schmidhuber, J., Pfeifer, I. R., Schreter, Z., Fogelman, Z., & Steels, L. (1989). Accelerated learning in back-propagation nets.. 政治大. [33] SO, K. (2011). Cloud computing security issues and challenges. International. 立. Journal of Computer Networks, 11-14.. ‧ 國. 學. [34] Tsai, C. F., & Lin, C. Y. (2010). A triangle area based nearest neighbors. ‧. approach to intrusion detection. Pattern Recognition, 43(1), 222-229.. sit. y. Nat. [35] Tsaih, R. H., & Cheng, T. C. (2009). A resistant learning procedure for coping. io. al. er. with outliers. Annals of Mathematics and Artificial Intelligence, 57(2), 161-180.. v. n. [36] Yoo, I. (2004, October). Visualizing windows executable viruses using. Ch. engchi. i n U. self-organizing maps. In Proceedings of the 2004 ACM workshop on Visualization and data mining for computer security (pp. 82-89). ACM.. 43 .

(49)