filterNN: 基於神經網路之序列資料特徵選取方法 - 政大學術集成

全文

(1)國立政治大學資訊管理學系. 碩士學位論文指導教授:蕭舜文博士. 立. 政治大. ‧ 國. 學. filterNN: 基於神經網路之序列資料特徵選取方. ‧ y. Nat. 法. n. sit. er. io. filterNN: NN-based Feature Selection from a l Sequential Data i v n Ch U engchi. 研究生：吳皓銘中華民國 108 年 7 月. DOI:10.6814/NCCU201900745.

(2) C ONTENTS I. Introduction. 5. II. Related Work. 6. II-A II-B II-C. Framework Design III-A III-B III-C. III-C2. 政治大 Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . . 立 Recurrent Neural Network . . . . . . . . . . . . . . . . . . . . . . . . .. 學. Evaluation. 8. 11 12 13. IV-D. filterNN .. IV-H. sit. al. n. IV-G. io. IV-F. CNN Hyper-parameter Selection . . . . . . . . . . . . . . . . . . . . . .. iv n C Case studies of filtered APIs h e n .g. c. .h. i. .U. . . . . . . . . . . . . . . . . . . . IV-D2. IV-E. y. Nat. IV-D1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. er. IV-B. ‧. IV-C. Dynamic Analysis Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Dataset and Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16. IV-A. V. 7. filterNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 III-C1. IV. 6. 8. ‧ 國. III. Malware Behavior Representation . . . . . . . . . . . . . . . . . . . . . . . . NN-based Malware Classification . . . . . . . . . . . . . . . . . . . . . . . . NN Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. RNN Hyper-parameter Selection . . . . . . . . . . . . . . . . . . . . . .. 17 20 21 21. Case Study of different learning goals of the filterNN . . . . . . . . . . . 23 Case study of the number of 4-grams used to represent a malware before and after filterNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Case study of the Jaccard distance difference among each group and in the same group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28. Conclusion. 38. V-A. 38. Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. References. 39. 1. DOI:10.6814/NCCU201900745.

(3) L IST OF F IGURES 1. The proposed architecture of filterNN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 11. 2. The CNN architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 11. 3. The interpretation of LSTM Cells. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 13. 4. A standard LSTM cell. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 14. 5. An example of malware behavior profile that contains a sequence of Windows API invocation with timestamp, API name, and parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . .. 14. 6. A visualization of encoded Windows API call sequences in grey and viridis color. . . . . . .. 17. 7. A visualized, encoded malware samples of different malware families . . . . . . . . . . . . .. 18. 8. A visualized, encoded malware samples of different malware groups . . . . . . . . . . . . .. 19. 9. An example of filtered API alignment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 治政 The visualized encoded API sequences of two families and their filters. . . . . . . . . . . . . 大 The average filtered out inputs of each malware family. . . . . . . . . . . . . . . . . . . . . 立. 20. 10 11. 22 23. filterNN (SLFN filter) training results of hyper-parameter set A cost: 1, Z cost: 1, C cost: 4 25. 13. filterNN (SLFN filter) training results of hyper-parameter set A cost: 1, Z cost: 1, C cost: 8 27. 14. filterNN (SLFN filter) training results of hyper-parameter set A cost: 1, Z cost: 1, C cost: 12 28. 15. filtered API alignment using SLFN filter. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 31. 16. filtered API alignment using SLFN filter. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 32. 17. filterNN (Convolution filter) training results of hyper-parameter set A cost: 1, Z cost: 300,. ‧. ‧ 國. 學. 12. sit. y. Nat. er. io. C cost: 75 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. al. 32. 33. C cost: 165 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 33. 20. filtered API alignment using Conovlution filter. . . . . . . . . . . . . . . . . . . . . . . . . .. 34. 21. The avreage 4-gram used in each group before filterNN . . . . . . . . . . . . . . . . . . . .. 35. 22. The avreage 4-gram used in each group after filterNN . . . . . . . . . . . . . . . . . . . . .. 35. 23. The avreage difference of the 4-gram used in each group . . . . . . . . . . . . . . . . . . .. 36. 24. The distance difference among groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 36. 25. The distance difference of each group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 37. 19. n. iv n C C cost: 135 . . . . . . . . . . . . . h . . . . . . . . . . .U. . . . . . . . . . . . . . . . . . . . engchi filterNN (Convolution filter) training results of hyper-parameter set A cost: 1, Z cost: 300,. 18. filterNN (Convolution filter) training results of hyper-parameter set A cost: 1, Z cost: 300,. 2. DOI:10.6814/NCCU201900745.

(4) L IST OF TABLES I. The Encoding Table of APIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 17. II. The Encoding Table of APIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 17. III. Classification Accuracy of filterNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 20. IV. Parameters used in CNN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 21. V. The classification accuracy for different filter size tested in LeNet and AlexNet . . . . . . .. 21. VI. The classification accuracy for different hyper-parameter tested in RNN. . . . . . . . . . . .. 22. VII. Parameters used in filterNN (SLFN filter). . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 26. VIII. Parameters used in filterNN (Convolution filter). . . . . . . . . . . . . . . . . . . . . . . . .. 27. 立. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. 3. i n U. v. DOI:10.6814/NCCU201900745.

(5) Abstract We design a new Neural Network architecture which consists of two parts, filter, and classifier. We implement three kinds of filters, which can filter out the unnecessary input data. The filtered data will be fed into the latter classifier to achieve the highest training accuracy. Because of the filter and classifier are trained together, thus, the filter will keep the inputs which help classifier to perform the classification. Therefore, the remaining input data can be viewed as the characteristic of the class. We also design three cost function to achieve different purpose, 1) the filtered inputs could be as less as possible, 2) the filtered inputs could be more consecutive as possible, 3) the classifier could achieve the highest training accuracy as possible. The three learning goals are in conflict with each other, so we demonstrate the tuning process in this research to achieve the best performance. We use text-based sequential data to test the usefulness of the proposed NN architecture. The use of text-based sequential data is malware execution API calls which are collected from the real world. The research shows that the proposed NN architecture is helpful for dealing with text-based sequential data and will filter the characteristic for domain experts to perform further analyze.. 立. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. 4. i n U. v. DOI:10.6814/NCCU201900745.

(6) I.. I NTRODUCTION. In recent years, the use of neural networks (NNs) for classification is quite popular. LeCun [1] adopted convolutional neural network (CNN) for the recognition of hand-written numbers. Although such techniques are well-developed, it is difficult for us to understand what information is used for classification. Hence, in this paper, we explore this question by proposing a filter structure in NN to extract the important features in the raw data for latter classification. We use a malware behavior database as an example to analyze sequential data for classifying malware families. Conventionally, the analysis of data requires a lot of human work and time. With the help of NN, we can feed the data into an NN for data analysis, such as classification. However, for some applications, such as malware behavior analysis used in this paper, security experts want to know the root cause of an. 政治大. attack or the behavior characteristics of the malware; rather than just performing malware classification. Such needs may not be satisfied by a simple classifier; hence we proposed an NN-based characteristics. 立. filter for analyzing text-based sequential data in order to filter useful and human-readable (the filtered. ‧ 國. 學. input use the same encoding as the original input) information from the raw data for a further classifier. Malware behavior extraction is used in this paper to demonstrate the capability of filtering input from original input, and the filtered input is fed into a classifier for malware family classification. Malware. ‧. variants within the same family should commonly share a behavior that can be used to distinguish with. y. Nat. variants in another family. We anticipated that the proposed filter can extract the behavior characteristics. sit. (in terms of Windows API call sequences) from different families correctly, and they can be used for. er. io. better understanding of the family and to perform classification correctly.. al. n. iv n C ture that can filter valuable, human-readable h features for latter classification, 2) implement different filters engchi U and classifiers to construct most effective filterNN, and 3) apply filterNN on malware behavior dataset The contributions of this paper is to 1) design an NN framework (filterNN) which embeds a filter struc-. to demonstrate the usefulness of the proposed work and provide a case study of malware characteristic extraction. In our experiment, we figure out that 1) in malware behavior dataset example, the filterNN can filter out half of the malware behavior data (i.e., 128 consecutive Windows API invocations) which is considered useless for latter classification; 2) the filtered input still has high classification accuracy of 90+%; 3) we can extract the commonly shared characteristics of a malware family for further study. The rest of the paper is organized as follows. In Section 2, we review some related works. Section 3 are the detail design of our proposed NN architecture. In Section 4, the evaluation of the framework is shown.. 5. DOI:10.6814/NCCU201900745.

(7) II.. R ELATED W ORK. First, we provide some backgrounds of malware behavior analysis and then discuss the NN based solutions on the classification. A. Malware Behavior Representation The reason we choose malware behavior classification as our demonstrated application is that the malware behavior is usually represented as a text-based sequential data. The data could be a sequence of CPU instructions, system calls [2], C library calls, or Windows API calls. However, among a sequence of calls or events, usually, some of the calls or events are not useful to characterize the instance. Thus, it could be a problem for security experts to manually specify them. Since the malware variants are already classified into malware families, we can leverage the classification information to filter the valuable. 政治大 information when the program is running. For example, recording the system call invoked by the program 立 information from the sequential data. In the literature, dynamic analysis [2] is to collect the runtime. is a commonly seen method. The execution history of a program can be represented by its system call. ‧ 國. 學. sequence which is viewed as the characteristics of a program interacting with the operating system. Forrest [2] used a short sequence of system call (i.e., 6 or 11 calls) to represent a normal behavior in a. ‧. program to distinguish malicious ones. Hofmeyr [3] collecting short sequence of system call of normal. y. Nat. behavior activities in real world environment as the profile to detect the abnormal behavior. Lee [4] further. sit. utilizes data mining methods (association rules algorithm and the frequent episodes algorithm) to cluster. al. er. io. malicious and normal sendmail system call sequences to distinguish between attack and normal program.. v. n. However, there are still many intrusion activities undetected, because models do not take into account all. Ch. i n U. available features of system calls. In particular, some attacks will go undetected because the models do not. engchi. make use of system call arguments. To solve this problem, Kruegel [5] developed an anomaly detection technique that utilizes the information contained in these parameters. We believe system calls are suitable for establishing a normal program model for anomaly detection, but using system calls only to describe malicious programs is not sufficient. Therefore, in this paper, we use a higher level of execution profile – Windows API call sequence. Bayer et al. [6] utilized dynamic analysis method to record the system calls and the Windows API calls that are invoked by the malicious program. They [7] then imported the profiles into the clustering algorithm to detect some malicious behavior of the same type. Willems et al. [8] developed CWSandbox to implement Windows API hooking to retrieve its execution content. Hsiao et al. [9] placed malicious programs in a virtual machine and retrieved the hooked Windows API as a profile to calculate similarities between different malware and normal programs. In this paper, we adopt a. 6. DOI:10.6814/NCCU201900745.

(8) modified simulator [9] to retrieve Windows API call records as the malware behavior profile (an example of the text-based sequential profile is shown in Section 4). B. NN-based Malware Classification For malware analysis, there are many research studies adopted machine learning or deep learning techniques. Hou et al. [10] proposed a system that can learn the features of malicious Android application API calls and performs classification tasks. Grosse et al. [11] utilized deep neural network (DNN) for static malware analysis performing two-class classification, while Wang et al. [12] proposed a DNN on system calls of both malware data and benign programs to perform dynamic malware analysis. Wang et al. also demonstrated the compatibility of the algorithm in other image datasets. Dahl et al. [13] also utilize a NN to learn the API calls feature and adopt used random projections to reduce the dimensionality of the. 政治大 the different approaches used to address the problem of malicious software detection and classification. 立 Tesauro [15] and Sornil [16] represent a malware file as a sequence of hexadecimal values with n-gram original input space then perform binary classification. Gibert [14] lists a couple of past research about. ‧ 國. 學. analysis, using decision tree, multilayer perceptron (MLP) and Support Vector Machine (SVM) to perform the classification. The researchers Ravi [17] and Veeramani [18] examine the imported APIs of PE files and. ‧. count the number of times that each unique API call appears for malware analysis. They employed Naive Bayes, SVM and decision tree as the classifiers. Ravi [17] developed a malware detection system which. y. Nat. sit. uses 4-gram Windows API call sequences. Saxe [19] built a DNN to detect malware by features extracted. al. er. io. from file binary’s PE packaging. Some works use recurrent neural network (RNN) for feature extraction. n. and perform classification, such as logistic regression, MLP [20] or CNN [21]. Dahl [13] use API calls as. Ch. i n U. v. inputs of one/two layer NNs and also compare the performance with logistic regression. Mtnet [22] uses. engchi. APIs as NN’s feature for classification. In addition, Saxe [19] and Seok [23] also proposed DNNs for the static analysis of malware. These works enriched the features obtained from the malware and perform malware detection or classification by different machine learning or deep learning techniques. However, from the perspective of security analysis, detection and classification may not be enough. The encoded vectors, features and complex structures are not trivial for security experts to comprehend the malware. Therefore, the proposed filterNN are meant to extract human-readable (the filtered input use the same encoding as the original input of filterNN) features which are considered as the behavior characteristics of this malware and this family.. 7. DOI:10.6814/NCCU201900745.

(9) C. NN Feature Extraction Neuron Network is also known as “black box”. Many researchers try to extract useful features by their design models. Vaughan et al. [24] design an Explainable Neural Networks (xNN) which provides a parsimonious explanation of the relationship between the features and the output. In xNN, each input node is corresponded to a subnetwork. If there are a lot of input nodes, training an xNN would be very time consuming. While the proposed filterNN only using one filter and one classifier to train on the data, which saves many times during training. Masci et al.[25] introduced a Convolutional AutoEncoders(CAEs) to extract the feature of images. Due to the characteristic of convolution, CAEs can produce regional features of the image. The limitation of CAEs is that it cannot work on text-based data. When an image is compressed, people can still recognize the regional pattern of the image, while when. 政治大. a text-based sequence data is compressed, it is hard to explain the insights of the compressed data. As for filterNN, it can still extract useful information from the input data, and the extracted information can. 立. III.. F RAMEWORK D ESIGN. ‧. A. filterNN. 學. remaining data.. ‧ 國. be easily translated back to the original text data. We can explore the insights of the input data from the. Nat. sit. y. Fig. 1 demonstrated our proposed filterNN architecture. The proposed filterNN is composed of two NN. io. er. structures: a filter and a classifier. The difference between from conventional classifiers and filterNN is that the filter structure in filterNN is capable of extracting the characteristics from each class or cluster.. n. al. i n U. v. The input of filterNN is the encoded profile which is converted from the profile, i.e., the Windows API. Ch. engchi. call, to the vectors of real number. The encoding table of the profile is defined and shown in Section IV. In filterNN, the input is fed to the filter first, and the filtered inputs will be processed by the latter classifier. The filter and the classifier are trained at the same time. Hence, the filter will remove the unnecessary or unimportant API calls which the latter classifier considers not helping to perform the classification. We develop three cost functions which can 1) reduce the number of filtered APIs and 2) maintain the sequence information while 3) maximize the classification accuracy. Maintaining the sequence information mentioned above means that the filter can retain the API calls as consecutive as possible which is a benefit to security experts to interpret the intention of the filtered inputs, i.e., API calls. Therefore, the filtered API calls can be viewed as the characteristic of this variant.. 8. DOI:10.6814/NCCU201900745.

(10) B. Filter The design idea of our filter is to a) retain the characteristic of filtered inputs in original form instead of human unreadable vectors, b) filter out API calls that are not helpful for latter classification, and c) perform the classification with high accuracy by using the filtered inputs. In filterNN, we implement three kinds of filters: None, single-layer feedforward Neuron Network (SLFN) and convolution filter. The first filter, ’None’ filter, is a dummy structure which forwards the input feature straight to the classifier. The second, SLFN filter, which is a single-layer feedforward Neuron Network with a hard sigmoid function as the activation function that will output a binary vector whose length is equal to the original input feature that specifies the corresponding input of each index is filtered or not. The last, convolution filter, which is one convolution layer with a hard sigmoid function as the. 政治大. activation function that will outputs the same result as the SLFN filter. The difference between these two filter is that SLFN is fully-connected so it considers the interaction among every input features while. 立. convolution filter uses small size kernel and considers the interaction among regional input features. The. ‧ 國. 學. filters are trained with the NN classifier together. The design principles of the cost function of a filter are that 1) Z cost: minimizing the number of filtered API calls; 2) C cost: keeping consecutive API calls instead of individual API in the sequence, and; 3) A cost: still retaining the high accuracy of the classifier.. ‧. In general, these principles are conflicted with each other therefore filterNN is expected to fine-tune the. y. Nat. weights of filter to obtain feasible models. As mentioned above, the filter will output a binary vector. sit. which only contains zeros or ones. The index of the vector indicates which the corresponding API call. n. al. er. io. will be kept for latter classification (i.e., 1) or will be filtered out (i.e., 0).. i n U. v. The first cost functions (Z cost) can be implemented by minimizing the summation of the binary vector (denoted as BV) values as follows.. Ch. engchi 0. minimize. n X n X. BVi,j. i=1 j=1. (Formula 1) • n is the number of data • n’ is the length of the nth input data • BVi,j = {0, 1} is the filtered input. 9. DOI:10.6814/NCCU201900745.

(11) The second cost function (C cost) will maximize the number of consecutive 1’s in the filtered input so that domain experts will be able to interpret the filtered input more easily. When calculating the C cost function, the output binary vector is duplicated (| BV |= n’) k times and then generate a new vector, V, Pj+k−1 that | V |= n − k + 1 and V [j] = i=j BV [i]. The cost function is as follows.. j+k−1. V [j] =. X. BV [i]. i=j. n−k+1 −1 X minimize BV [i]. 政治大 i=0. 立. ‧ 國. 學. • n is the number of data. (Formula 2). After minimize formula 2, the filterNN will filter consecutive input as more as possible. k is set to 2 in. ‧. our experiment. In the third cost function (A cost) we utilize cross-entropy to minimize the cost between. n. al. minimize(−. n X. Ch. er. io. sit. y. Nat. predictions from the classifier and the label.. i n U. v. labeli ∗ log(predictinoi )). i=1. engchi. (Formula 3). • n is the number of data. C. Classifier The output of the filter will be fed into the classifier to perform supervised learning. During the training and validation process, 5-fold cross-validation method is adopted which means that twenty percent of the data (i.e., malware profile) will be used as the validation data, and eighty percent of the data will be used as the training data. We implement several classifiers such as Logistic Regression (LR), SLFN, CNN, RNN, CNN-RNN (C-RNN), and RNN-CNN (R-CNN).. 10. DOI:10.6814/NCCU201900745.

(12) Fig. 1. The proposed architecture of filterNN.. n1 S2 C1 S1 C2 Feature maps Feature maps Feature maps Feature maps. 立. input. n2 Output. DR Dropout. 政治大 …. …. Convolution Pooling. ‧. ‧ 國. 學. Pooling. Convolution. Fully-Connected & Dropout. io. y. sit. Fig. 2. The CNN architecture.. er. Nat. Classification. Feature Extraction. al. n. iv n C U in the following subsections. model works vice versa. The structure of CNN h eand n gRNN c harei shown. The C-RNN model is a hybrid model that the output of CNN will be the input of RNN. The R-CNN. 1) Convolutional Neural Network: In CNN, we adopt the standard two-layered convolutional neuron. network, i.e., LeNet [1], which is shown in figure III-C1. The generated encoded profile (of length n) served as the input of the filter, then the output of the filter served as the input of CNN. As for the profiles which have less than n API calls, we use padding (zeros) to fill the vector. The first convolutional layer (C1) has K1 two-dimension kernels and the size of each kernel is 1 ∗ F 1. The second layer is pooling layer (S1) which performs pooling operation with the pooling size, P 1, and stride length is D1. The second convolutional layer (C2) has K2 kernel and the size of each kernel is 1 ∗ F 2. The second pooling layer (S2) has length P 2 and the stride length is D2. After processed by the second pooling layer, the inputs will be fed into dropout layer (with dropout rate, DR) which will remove the negative number and avoid over-fitting. Finally, there is a fully-connected layer which contains n1 neurons and it. 11. DOI:10.6814/NCCU201900745.

(13) will produce a one-hot vector that has size equals to the number of the malware family. The index of the highest value in the one-hot vector means that the input sample is classified to the corresponding malware family index. Rectified Linear Unit (ReLU) is attached after the fully-connected layer to serve as the activation function. We adopt the softmax method at the output layer. The cost function is cross-entropy with logits, which is used for calculating the cost between the predicted label and the actual label. The training process is to minimize the cross-entropy. The architecture of CNN is capable of processing two-dimension figure meanwhile the kernels of CNN will be trained automatically. In this research, Windows API calls are viewed as one-dimension figure. In this paper, we view a sequence of Windows API calls as a pattern of a one-dimension figure so that the automatically trained kernels will be viewed as the feature for the classification. Hence, during the process. 政治大. of convolution, the whole sequence of API calls will be fed into the kernels one by one to perform the. 立. convolution operation.. ‧ 國. 學. 2) Recurrent Neural Network: In RNN, the interpretation of the Long-Short Term Memory (LSTM) cells is shown in figure 3. The structure of the LSTM is displayed as follows.. ‧. • t is the time step. The Windows API calls sequence is divided into t steps.. y. Nat. • Xt refers to the input features (i.e., a segment of API calls sequence) at time step t.. sit. • Ot refers to the output at time step t. For the task of family classification, we set Ot as the number. al. er. io. of families which will be classified.. n. • U, V and W are the weights that are shared in all the time steps.. Ch. i n U. v. • In an LSTM cell, there are num units units. The structure of each LSTM unit is shown in figure 4, which is a standard LSTM [35]. engchi. The formulas of the standard LSTM at time t is explained as follows. ft = σ(Wf ∗ [ht − 1, xt ] + bf ) it = σ(Wi ∗ [ht − 1, xt ] + bi ) C˜t = tanh(WC ∗ [ht − 1, xt ] + bC ) ot = σ(Wo ∗ [ht − 1, xt ] + bo ) Ct = ft ∗ Ct − 1 + it ∗ C˜i ). 12. DOI:10.6814/NCCU201900745.

(14) ht = ot ∗ tanh(Ct ). The encoded API call sequence serves as the input of the LSTM. For the initialization, the three weights (U, V and W ) are set with randomly normalized values. As for output, we adopt a one-hot vector to represent the classification result, where the length of the one-hot vector is n2 , which equals to the number of the classes. The index of the highest value in the output one-hot vector indicates that the input is classified into the corresponding malware family.. 政治大h. t. 立. Ct-1. 學. ‧ 國. Ct. +. X. tanh. it. ft σ. σ. X. C᷉ t. tanh. ot. X. σ. ‧. ht-1. ht. sit. y. Nat. n. al. er. io. xt. Xt. ht. Ch. engchi. i n U. v. Fig. 3. The interpretation of LSTM Cells.. IV.. E VALUATION. A. Dynamic Analysis Profile Figure 5 show an example of the generated malware profile. A profile contains the invoked Windows APIs with their parameters and timestamps. The profiling process lasts for 300 seconds for each malware. We then store all profiles in a database.. 13. DOI:10.6814/NCCU201900745.

(15) Ot. Ot-1 V. V. V. LSTM cell. LSTM cell. LSTM cell. ………….. W. Ot+1. ………….. W. num_units. ………….. W. num_units. U. num_units U. U. Xt-1. W. Xt. Xt+1. 政治大. Fig. 4. A standard LSTM cell.. 立. #317560000 CreateFile. hName="C:\DOCUME˜1\x\LOCALS˜1\Temp\n7785\s7785.exe" desiredAccess="GENERIC_WRITE" creationDisposition=". ‧ 國. 學. CREATE_ALWAYS" Return="SUCCESS" #339720000 LoadLibrary. lpFileName="SHELL32.dll" Return="SUCCESS" # 341100000 RegQueryValue. ‧. hKey="HKCU\Software\Microsoft\Windows\ShellNoRoam\MUICache\C:\DOCUME˜1\x\LOCALS˜1\Temp\n7785\s7785.exe". Nat. #341350000 RegSetValue. y. Return="FAILURE". sit. hKey="HKCU\Software\Microsoft\Windows\ShellNoRoam\MUICache\C:\DOCUME˜1\x\LOCALS˜1\Temp\n7785\s7785.exe" data="install manager" Return="SUCCESS". io. al. er. #341600000 CreateProcessInternal. v. n. lpApplicationName="C:\DOCUME˜1\x\LOCALS˜1\Temp\n7785\s7785.exe" lpCommandLine="C:\DOCUME˜1\x\LOCALS˜1\Temp. Ch. i n U. \n7785\s7785.exe ins.exe /e11831362 /u50d1d9d5-cf90-407c-820a-35e05bc06f2f /v" dwProcessId="1276" dwThreadId="1272" Return="SUCCESS". engchi. Fig. 5. An example of malware behavior profile that contains a sequence of Windows API invocation with timestamp, API name, and parameters.. B. Dataset and Platform The malware samples are collected from the OWL [26] database from NCHC, Taiwan and the textbased profiles are provided by Hsiao [9]. A visualized, encoded profiles under different malware families are shown in Fig. 7. A little square represents a Windows API call and the color of a square represents different type of calls. The figure shows the first 64 calls of some samples from different families. The family label in the figure of each malware sample is collected from VirusTotal.com [27]. For example,. 14. DOI:10.6814/NCCU201900745.

(16) the first malware sample is under Allaple family and its id is 2b68b4-3352. We can observe that the API call patterns in different families are different so that conventional classifiers provide good classification results; however, the proposed filterNN can point out which part of the call sequences are critical to the classifier. Our first dataset contains 1,904 profiles and they belong to 14 families which is classified by Chiu [32]. On average, each profile contains 397 Windows API calls. The maximum length of API calls in a profile is 20,766 and the minimum is 18. The second dataset contains 9,149 profiles and we adopt hierarchical clustering algorithm to preliminarily cluster these profiles into 130 groups. On average, each profile contains 453 Windows API calls. The maximum length of API calls in a profile is 219,743 and the minimum is 17. In our experiments, we choose the top 128 API calls of each malware variants as our input features. If any API calls of malware variant are shorter than 128, we fill in 0 instead.. 政治大 cluster these malware at first. The clustering steps are as follows. In the beginning, count all the 4-grams 立. The second dataset contains 9149 profiles and is all unlabeled. To test the usefulness of our filterNN, we. 學. ‧ 國. appeared in all malware samples of the second dataset. Then compute the the jaccard distance of the all malware variants in 4-gram encoding. Jaccard distance:. 4G(i) ∩ 4G(j) 4G(i) ∪ 4G(j). ‧. JD(i, j) = 1 −. n. al. er. io. sit. y. Nat. • (i, j) ∈ N. • N = the number of all malware • 4G(i) = 4 gram of malwarei. Ch. engchi. i n U. v. After compute jaccard distance, using UPGMA algorithm to cluster all the malware variants. Then set the distance threshold to 0.4 to properly cluster the dataset. Under the circumstance of threshold 0.4, we can obtain total 662 groups. After that, it can be observed that there are many groups contains only 1 or 2 malware variants, which means the similarity of this malware are too low from others so they can only be grouped by itself. Hence, I removed the groups which contains less than 10 malware variant. Finally, dataset 2 is clustered into 130 groups. UPGMA algorithm: U (Gi , Gj ) =. X a∈Gi , b∈Gj. 15. JD(i, j) | Gi || Gj |. DOI:10.6814/NCCU201900745.

(17) • Gi ∈ G • G = all groups • if | Gi |= 1, Gi ∈ p • p ∈ all point P • 4G(i) = 4 gram of malwarei. The analysis environment is Python 3.6 with Tensorflow package [28]. Using NVIDIA GeForce GTX 1080 Ti GPU to perform supervised and unsupervised learning.. 政治大 After acquiring all the profiles, we use four different ways to perform encoding, including rank, rank 立. C. Encoding. ratio, frequency and frequency ratio on the dataset. We calculate the occurrence of frequency of all. ‧ 國. 學. invoked Windows APIs among all profiles and sort them by the occurrence of frequency. Table I presents the results of classification accuracy in pure CNN and RNN classifier using different encoding methods.. ‧. This experiment shows that Frequency ratio is more suitable to encode call sequence data since the. y. Nat. invocation of API is quite imbalanced.. sit. Table II shows the encoding table (note that only parts of the entries are listed) by using frequency. al. er. io. ratio, which transforms a text-based API sequence into a vector of real numbers between 0.0 and 1.0. We. v. n. reserve 0.0 as the padding value, in case the input sequence is too short. Fig. 6 is an example of a profile. Ch. i n U. that is encoded by the encoding table. We map the encoded value to grey scale color and Viridis color for. engchi. better visualization. The Windows API calls are arranged from the left to the right and from the top to the bottom. Each slot in the figure represents an API call. Take the grey scale figure for example. A slot in black color indicates that the API has the smallest value in the encoding table (which is PAD); a white slot indicates that its corresponding API has the largest value (which is RegQueryValue). Such visualized figures can help us to quickly identify which Windows API is used and the pattern of the Windows API usage in a malware family. In Fig. 7 and 8, we can tell that the pattern of the Windows API call sequence of the different family or group is different; however, filterNN should help us to identify which patterns are important for classification.. 16. DOI:10.6814/NCCU201900745.

(18) TABLE I T HE E NCODING TABLE OF API S Rank. Rank Ratio. Frequency. Frequency Ratio. CNN. 0.92. 0.92. 0.41. 0.92. RNN. 0.60. 0.39. 0.51. 0.85. TABLE II T HE E NCODING TABLE OF API S API. Frequency. Encode (Frequency Ratio). PAD. 0. 0.0. CreateRemoteThread. 3. 3.896038118836955e-06. 政治大. 2.2077549340076075e-05. 15. 4.155773993426085e-05. 21. 6.883000676611953e-05. .... .... EnumValue. 51961. 0.16465566165766032. LoadLibrary. 52652. 0.23303372800199476. CreateFile. 143427. 0.41929941442547075. DeleteFile. 160671. 0.6279595279560215. RegQueryValue. 286476. 1.0. 立. TerminateProcess. Nat. io. 1 2. 0 1. al. 2. n. 3 4 5 6. Ch. 2. 4. i n U 3. engchi. 7 0. 1. 0. 2. 3. 4. 5. 6. 7. er. 0. ‧. ‧ 國. .... y. WinExec. 學. 14. sit. OpenThread. v. 4 5 6 7. 6. Fig. 6. A visualization of encoded Windows API call sequences in grey and viridis color.. D.. filterNN In this experiment, we provide the comparisons between different combination of filters (including. N one filter, SLFN filter, Convolution filter) and classifiers (including LR, SLFN, CNN, RNN, CNN-RNN and RNN-CNN) under the proposed filterNN framework. It is a 10-fold test that 10 percent of the malware samples are fed into filterNN with the labeled family. 17. DOI:10.6814/NCCU201900745.

(19) allaple:2b68b4-3352 bettersurf:27f348-3140 elkern:a617de-2876 graftor:0ea6f0-3156 hotbar:0a0253-3276 kryptik:0fbac2-3340 kryptik:0a5b82-3180 loadmoney:0e0941-3196. 政治大 loring:1d1e01-3268. rahack:0f1f30-3180. 學. ‧ 國. 立. mydoom:0ca2ef-3324. sytro:fff4eb-3260 vobfus:0a91d6-2872. ‧. zbot:0b575e-3216. sit. y. Nat. io. n. al. er. Fig. 7. A visualized, encoded malware samples of different malware families. i n U. v. class to perform training. The remaining 90 percent of the malware samples are used for testing among the trained models.. Ch. engchi. Table III shows the classification accuracy of filterNN with different implemented filters and classifiers using the first dataset. This table only lists the best classification result of different NN models and the discussion of individual hyper-parameter will be shown in the later subsection. We can see that almost every result of SLFN filter and Convolution filter is slightly lower than N one filter. In simple classifiers (LR, SLFN, CNN and RNN), the SLFN filter is slightly outperform than N one filter. We anticipate that even if there is a filter to reduce the original input feature, the classifier can still be able to classify the data correctly. The remaining feature can be viewed as a common characteristic of each malware family. Figure 9 displays an example of two malware variant from the same family. Hamming distance [34] is used to evaluate the alignmentness of the filtered inputs. The corresponding aligned filtered inputs are ai and aj ,. 18. DOI:10.6814/NCCU201900745.

(20) 立. 政治大. ‧. ‧ 國. 學 sit. y. Nat. io. n. al. er. Fig. 8. A visualized, encoded malware samples of different malware groups. Ch. i n U. v. and their length is N . The filtered inputs will be aligned by global alignment algorithm (NeedlemanWunsch. engchi. algorithm)[33]. We calculate the hamming distance (H) between ai and aj , and calculate the alignment mean distance (Md = H/N ) Both malware is processed by the filterNN. We align the remaining input feature to see if the filterNN is successfully removed the redundancy information and extract the common characteristic of these two variants. The digits on the right-hand side indicate how many input features of these two malware variants are the same. As a result, they are almost all aligned. For the complex classifiers (i.e., CNN-RNN and RNN-CNN), we can see that on the average, CNNRNN classifier outperforms the RNN-CNN classifier (no matter which filter is used). To the best of our knowledge, the front RNN works as a feature extraction model and the back CNN serves as a classification that the structure quite makes sense. On the contrary, the front CNN is not able to output the sequentialmeaningful data to the back RNN so that the structure is not that good.. 19. DOI:10.6814/NCCU201900745.

(21) 1. O. LoadLibrary. LoadLibrary. 21. O. RegQueryValue. RegQueryValue. 2. O. LoadLibrary. LoadLibrary. 22. O. RegQueryValue. RegQueryValue. 3. O. LoadLibrary. LoadLibrary. 23. O. LoadLibrary. LoadLibrary. 4. L. -. RegQueryValue. 24. O. RegSetValue. RegSetValue. 5. L. -. RegQueryValue. 25. O. RegSetValue. RegSetValue. 6. O. RegEnumValue. RegEnumValue. 26. O. RegSetValue. RegSetValue. 7. X. RegCreateKey. LoadLibrary. 27. O. RegSetValue. RegSetValue. 8. O. LoadLibrary. LoadLibrary. 28. O. RegQueryValue. RegQueryValue. 9. O. RegQueryValue. RegQueryValue. 29. O. RegCreateKey. RegCreateKey. 10. R. RegSetValue. -. 30. O. RegCreateKey. RegCreateKey. 11. R. InternetOpen. -. 31. O. RegCreateKey. RegCreateKey. 12. O. RegQueryValue. RegQueryValue. 32. O. RegSetValue. RegSetValue. 13. O. RegQueryValue. RegQueryValue. 33. O. CreateFile. CreateFile. 14. O. RegQueryValue. RegQueryValue. 34. O. CreateFile. CreateFile. 15. O. RegQueryValue. RegQueryValue. 35. O. CreateFile. CreateFile. 16. O. LoadLibrary. LoadLibrary. 36. O. CreateFile. CreateFile. 17. O. RegQueryValue. RegQueryValue. 18. O. RegQueryValue. RegQueryValue. 19. O. RegQueryValue. RegQueryValue. 20. O. RegQueryValue. RegQueryValue. 立. 政治大 37. O. CreateFile. CreateFile. 38. O. RegQueryValue. RegQueryValue. 39. O. RegQueryValue. RegQueryValue. 40. O. RegQueryValue. RegQueryValue. ‧ 國. 學. Fig. 9. An example of filtered API alignment.. ‧. io. Classifier \ Filter. n. a LR l SLFN. Ch. CNN. y. sit. C LASSIFICATION ACCURACY OF FILTER NN. er. Nat. TABLE III. None. SLFN. Convolution. 0.80. 0.80. i n U 0.92. e 0.90 n g c0.92h i. 0.80. 0.92. 0.94. 0.91. RNN. 0.87. 0.91. 0.90. CNN-RNN. 0.90. 0.87. 0.92. RNN-CNN. 0.93. 0.93. 0.91. v. 1) CNN Hyper-parameter Selection: We test several hyper-parameters of CNN and RNN separately and try to figure out the optimized hyper-parameters for the encoded profile for the classification. For CNN, we adjust the hyper-parameters, such as the filter size and the output size of each convolution layer, to identify the setting of the model for classification. The hyper-parameters shown in Table IV is fixed among all experiments. One of the important hyper-parameters is the filter size and we use LeNet [1] and AlexNet [29] to help us to identify the proper size of the filter (i.e., the kernel of CNN). Table V shows. 20. DOI:10.6814/NCCU201900745.

(22) the classification accuracy with different size of convolution filter in LeNet and AlexNet, respectively. In our following experiments, the filter size is set to 1*16.. TABLE IV PARAMETERS USED IN CNN. Parameter. Value. Explanation. test rate. 0.1. The ratio of the test data of all data. n. 128. The length if API call sequence. K1. 64. The number of the first convolution. F 1,F 2. 16, 16. The length of the filter. P 1,P 2. 4, 4. The length of pooling. D1,D2. 4, 4. The strides of pooling. K2. 128. The number of the second convolution. 立. n1. 政治大. The number of nodes in full-connected layer. 0.5. Dropout rate. Lr ate. 0.001. Learning rate. batch size. 20. The number of data in each epoch of training. ‧. ‧ 國. DR. 學. 1024. TABLE V. T HE CLASSIFICATION ACCURACY FOR DIFFERENT FILTER SIZE TESTED IN L E N ET AND A LEX N ET. n. AlexNet. 1*4. 1*8. 1*16. 1*32. 1*64. er. io. al. LeNet. CNN filter size. sit. y. Nat. CNN model. iv 0.941 0.951 0.926n 0.928 C0.920 hengchi U 0.922. 0.916. 0.923. 0.928. 0.931. 2) RNN Hyper-parameter Selection: In RNN, we adjust the time steps (ts) and the length of input feature (length f ). Table VI shows the accuracy of classification under different combinations of hyperparameter ts and length f . The Results of Table 6 shows that the accuracy of RNN has positive correlation with the length of input features. The longer the length is, the higher the accuracy will be. As a result, we choose time steps as 1 and length of input feature as the hyper-parameter configuration of RNN in this paper. E. Case studies of filtered APIs Figure 10 shows two examples of the filtered APIs in two malware families, Allaple and Kryptik, respectively. The colored sequences are the originally encoded API sequences and the black dots below. 21. DOI:10.6814/NCCU201900745.

(23) TABLE VI T HE CLASSIFICATION ACCURACY FOR DIFFERENT HYPER - PARAMETER TESTED IN RNN.. 立. ts. length f. accuracy. 1. 128. 0.915. 2. 64. 0.873. 4. 32. 0.873. 8. 16. 0.867. 16. 8. 0.865. 32. 4. 0.820. 64. 2. 0.809. 128. 1. 0.783. 政治大. ‧. ‧ 國. 學 y. Nat. al. er. io. sit. Fig. 10. The visualized encoded API sequences of two families and their filters.. n. specified that the above API is filtered out by our filter model. Hence, if we collect the APIs with white. Ch. i n U. v. dots below, collectively they are the filtered APIs that the filter model considers important to malware. engchi. family classification. The filter will select APIs properly for the latter classification. As we can see, although the variants are slightly different, the filter still extracts some common shared patterns in the family. Some variations of the malware variants are filtered out by the filter. In our experiments, we have 14 malware families and the filter model would filter out about 50% to 54% of the APIs for the latter classification for each family. On average, 51 percent of the APIs will be filtered out. Figure 11 shows the average filtered out inputs of each malware family. After we obtain multiple filtered APIs in the same family, we perform the local sequence alignment algorithm (a.k.a SmithWaterman algorithm) [31] to extract the characteristics of a family class. Figure 9 is an example of Gradtor family with two variants aligned and it shows the first 40 APIs of two variants of Gradtor. The number in the figure is the index of the filtered APIs aligned by the algorithm, and the. 22. DOI:10.6814/NCCU201900745.

(24) Fig. 11. The average filtered out inputs of each malware family.. 政治大. marks in the next column specify that the two aligned APIs are matched (O), mismatched (X), or is a. 立. gap (-). We can see that most of the filtered APIs can be aligned (marked as O), and it indicates that our. ‧ 國. 學. filterNN can identify the patterns of the family while performing classification. Such a human-readable filter makes the security experts more convenient and easier to analyze sequential data without manually filter unimportant data in the sequence.. ‧. F. Case Study of different learning goals of the filterNN. y. Nat. sit. As mentioned in Section III, the learning goal of the three cost function 1) Z cost: reduce inputs as more. al. er. io. as possible, 2) C cost: retain the inputs (API calls) as consecutive as possible, which will help security. v. n. expert to perform further analysis, 3) A cost: minimizing the cost between predictions and labels, which. Ch. i n U. means maximize accuracy. These three learning goal conflict with each other. In expected, if we want to. engchi. reduce most of the noises in the original data which can help security experts to explore the malicious purpose of this malware variant, the classifier may receive insufficient input to learn and classify. On the contrary, if we want to achieve higher classification accuracy, the filter will output a relatively integrated which may contain many noises that will make security experts hard to refer the malicious purpose of the malware variant. The idea is that by adjusting the training proportion of the three cost functions, we would like to train a filter model that can generate filtered input. The filtered input contains fewer data for the latter classifier that can still have high accuracy, meanwhile, the data left is as consecutive (2 calls in a row) as possible. In figure 12, 13 and 14 shows some of the filterNN training results with SLFN filter. The learning rate of A cost and C cost is 0.0001 and C cost is 0.0001 in the following examples. The hyper-parameter of. 23. DOI:10.6814/NCCU201900745.

(25) the cost function represents the training proportion. If it is 1, then it means this cost function is trained in every epoch. If it is 4, it means this cost function is trained every 4 epochs. The total parameter sets is shown in VII. In this experiment, I tested LeNet and AlexNet separately as the CNN classifier, while the effectiveness between these two models are similar. Under the consideration of the while training efficiency (in the perspective of time consumed), I adopt LeNet as the CNN classifier. As for the RNN model, I I used 128 units and 128 inputs to perform the classification. In expected, the higher the training proportion of Z cost is, the more input will be filtered out. But the result shows that the factor that influences the number of filtered input is C cost. The lower the training proportion of C cost is, the more input will be filtered out. It is because when filterNN try to make the filtered input as consecutive as possible, the input which is filtered out will be less and less. On the other hand, when filterNN try not to make the filtered input consecutive, the input which is filtered out will be more and more. In figure 12. 政治大 filter out no input so the accuracy is very high (95%) as expected. While in figure 13 and 14 shows two 立. we can see that the training proportion of C cost is high (every 4 epochs), this cause the filter eventually. different situation. In figure 13, the C cost hyper-parameter is set to 8. The cost of Z cost is slowly rising. ‧ 國. 學. and eventually retains about 90% of the original input. The accuracy is still very high (94%). In figure 14, the C cost is set to 12. It shows an interesting result that the Z cost starts dropping, the final Z cost. ‧. is 0.2. It means the filter has filtered out 80% of the original input while the accuracy can reach 87%.. y. Nat. This result indicates that the classifier is able to correctly learn the insights of the input while the input. sit. is only left 20%. The filtered input can be viewed as the characteristic of the malware variant. Figure. al. er. io. 15 shows the alignment of filtered input using the model which is trained by the hyper-parameter set of. v. n. 13. Though the number of filtered input is about 90% of the original input. The malware from the same. Ch. i n U. group is aligned well. Figure 16 shows the alignment of filtered input using the model which is trained. engchi. by the hyper-parameter set of 14. Almost 80% of the original input were filtered out, there are 23 inputs left. As a result, most of the filtered input are aligned. In figure 17, 18 and 19 display some of the filterNN training results with convolution filter. Table VIII shows the hyper-parameter used in this part. The hyper-parameter configuration of classifiers is as the same as table VII while the hyper-parameters of the three cost functions are different. Different from using SLFN filter, the result shows that C cost cannot significantly affect the number of filtered input like what SLFN filter did. No matter the training proportion of C cost is high or low, the number of filtered inputs or the accuracy rate are not much different in these three different conditions. This result can be inferred that because of the the characteristics of the convolution operation. Because convolution operation is not fully-connected to the input. It is hard to train Z cost function with small size convolutional kernel. While. 24. DOI:10.6814/NCCU201900745.

(26) SLFN filter is fully-connected to the input, so the training curve of C cost looks smooth and as expected. In sum, in filterNN, using SLFN filter is capable of training a better model which can filter out inputs as many as you want while still keeps high accuracy. Figure 20 shows the alignment of filtered input using the model which is trained by the hyper-parameter set of 19. There’s 80% of the original input were filtered in average. In Convolution filter, the number of 0s is difficult to fine-tune. This result can be inferred that because of the characteristics of the convolution operation. Convolution is not fully-connected to the input and the convolution operation is complex than SLFN filter. It need more training to find the local minimal. However, the cost will increase exponentially. The experiment shows that we can adjust the training proportion of the three cost function to have different kinds of filtered input 1) filterNN can output the filter result that contains 50% of the filtered input and has 80% accuracy, or 2)output the filter result that contains 20% of the filtered input and has 90% accuracy. To security expert, you can either. 政治大 to analyze the malware purpose with less filtered input with higher classification accuracy. 立. choose 1) to analyze the malware purpose by more filtered input with lower classification accuracy or 2). ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. i n U. v. Fig. 12. filterNN (SLFN filter) training results of hyper-parameter set A cost: 1, Z cost: 1, C cost: 4. 25. DOI:10.6814/NCCU201900745.

(27) TABLE VII PARAMETERS USED IN FILTER NN (SLFN FILTER ).. Parameter. Value. Explanation. 128. The number of neurons in the SLFN filter. K1. 64. The number of the first convolution. F 1,F 2. 4, 4. The length of the convolution kernel. P 1,P 2. 4, 4. The length of pooling. D1,D2. 4, 4. The strides of pooling. K2. 128. The number of the second convolution. FC. 1024. The number of nodes in full-connected layer. SLFN filter SLF N neuron in CNN. in RNN 128. n input. 128. num neuron. 立. number of units. number of inputs. 1024. number of single layered neurons. [128, 130]. number of weights. 130. number of bias. 學. in SLFN. ‧ 國. num units. 政治大. in Logistic Regression. io. n. y. 0.2. The ratio of the test data of all data, training 6165, testing 1487. 128 0.5. T cost. 1. The length if API call sequence. al. n. DR. sit. test rate. Nat. shared parameters. Dropout rate. er. num bias. ‧. num weight. i training proportion n CTZhcost cost proportion i U etraining h n c g C cost training proportion. v. Z cost. 1. C cost. 1 ∼ 15. T cost rate. 0.0001. T cost Learning rate. Z cost rate. 0.00001. Z cost Learning rate. C cost rate. 0.0001. C cost Learning rate. batch size. 40. The number of data in each epoch of training. epochs. 5000. the total training epochs. G. Case study of the number of 4-grams used to represent a malware before and after. filterNN This experiment aims to explore the number of 4-grams used to represent malware before and after filterNN. If the input data can be represent with a few 4-grams. It can be viewed as that the filterNN is. 26. DOI:10.6814/NCCU201900745.

(28) TABLE VIII PARAMETERS USED IN FILTER NN (C ONVOLUTION FILTER ).. Parameter. Value. Explanation. 4. The length of convolution filter. test rate. 0.2. The ratio of the test data of all data, training 6165, testing 1487. n. 128. The length if API call sequence. DR. 0.5. Dropout rate. T cost. 1. T cost training proportion. Z cost. 300. Z cost training proportion. C cost. 75 ∼ 240. C cost training proportion. T cost rate. 0.0001. T cost Learning rate. Z cost rate. 0.0001. Z cost Learning rate. C cost rate. 0.0001. C cost Learning rate. batch size. 40. The number of data in each epoch of training. epochs. 5000. convolution filter conv f shared parameters. 立. 政治大. ‧. ‧ 國. 學. the total training epochs. n. er. io. sit. y. Nat. al. Ch. engchi. i n U. v. Fig. 13. filterNN (SLFN filter) training results of hyper-parameter set A cost: 1, Z cost: 1, C cost: 8. 27. DOI:10.6814/NCCU201900745.

(29) 立. 政治大. ‧ 國. 學. Fig. 14. filterNN (SLFN filter) training results of hyper-parameter set A cost: 1, Z cost: 1, C cost: 12. ‧. capable of reducing noises and makes the representation of malware much simple. Figure 21 shows the. y. Nat. average of the original number of 4-gram used to represent malware in each group. The total number of. io. sit. unique 4-gram is 1831, each group used 36.66 4-grams on average. Then use filterNN which is trained in. er. the hyper-parameter set in figure 14 to filter the input data and reduce noises. After pass through filterNN,. al. n. iv n C the new number of 4-gram used to represent h malware i Ugroup shown in figure 22. As Expected, e n g cinheach the total number of unique 4-gram is 1501, each group used 11.22 4-grams on average. The average of. the filtered input in each group uses less 4-grams than the original input, which means after the malware is filtered, it can be represented in simpler ways (less 4-grams). The advantage of using less 4-gram to represent malware is that security experts can analyze the malicious intention of the malware easier. The difference of the number of 4-gram used in each group is shown in 23. From the results above, this experiment prove that filterNN is capable of reducing the noise from the original data. On the other hand, it extracts the characteristic from the data.. H. Case study of the Jaccard distance difference among each group and in the same group In expected, the Jaccard distance among different groups of filtered malware variants should be longer than the full information malware variant, the Jaccard distance in the same group of filtered malware. 28. DOI:10.6814/NCCU201900745.

(30) variant should be shorter than the full information malware variant. To compute the Jaccard distance difference among the groups which the input pass through the filterNN before and after, first we need to find the central point of each groups. The central point means the specific input in the group whose Jaccard distance is the shortest to others. To find the central point, use the formula as follows. Central point of the group: Cg (Gi ) = argmin(JD(p1 , pj )). • 1≤j≤n • n = |Gi | • 1≤i≤N. 立. 政治大. 學. ‧ 國. • N = the number of total groups. Compute the distance of the central of all groups:. ‧. Dg = JD(Cg (G1 ), Cg (Gj )). sit. n. al. er. io. • n = |G|. y. Nat. • 1≤j≤n. Ch. engchi. i n U. v. Figure 24 shows the result of the Jaccard distance difference among groups. We can see that the distance difference of every groups increases, which means the common feature of all groups had been filtered out by filterNN. It can be inferred that the extracted input can represent the characteristic of the group in dataset 2. To compute the Jaccard distance of the malware from the same group, use the formula as follows. Jaccard distance of the malware from the same group: Dg (Gi ) = JD(C(p1 ), C(pj )). 29. DOI:10.6814/NCCU201900745.

(31) • 1≤p≤n • n = |Gi | • 1≤i≤N • N = total groups. After computed the Jaccard distance, the result of the Jaccard distance difference of each group is shown in 25. In average, the Jaccard distance difference in the same group is shorter, which means that after filtered by filterNN, the malware variant of the same group are more similar to each other. Meanwhile, figure 24 shows that each malware variant in the same group is more dissimilar to other malware variant of other groups. In other words, the difference between each group is increasing, each group has its own uniqueness.. 立. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. 30. i n U. v. DOI:10.6814/NCCU201900745.

(32) group 102 1. O. LoadLibrary. LoadLibrary. 51. O. RegQueryValue. RegQueryValue. 2. O. LoadLibrary. LoadLibrary. 52. O. RegQueryValue. RegQueryValue. 3. O. RegQueryValue. RegQueryValue. 53. O. LoadLibrary. LoadLibrary. 4. O. LoadLibrary. LoadLibrary. 54. O. LoadLibrary. LoadLibrary. 5. O. LoadLibrary. LoadLibrary. 55. O. RegQueryValue. RegQueryValue. 6. O. RegQueryValue. RegQueryValue. 56. O. RegQueryValue. RegQueryValue. 7. O. RegQueryValue. RegQueryValue. 57. O. RegQueryValue. RegQueryValue. 8. O. RegEnumValue. RegEnumValue. 58. O. RegQueryValue. RegQueryValue. 9. O. RegQueryValue. RegQueryValue. 59. O. RegQueryValue. RegQueryValue. 10. O. RegQueryValue. RegQueryValue. 60. O. RegQueryValue. RegQueryValue. 11. O. RegQueryValue. RegQueryValue. 61. O. RegQueryValue. RegQueryValue. 12. O. RegQueryValue. RegQueryValue. 62. O. RegQueryValue. RegQueryValue. 13. O. LoadLibrary. LoadLibrary. 63. O. RegQueryValue. RegQueryValue. 14. O. LoadLibrary. LoadLibrary. 64. O. RegQueryValue. RegQueryValue. 15. O. RegQueryValue. RegQueryValue. 65. O. RegQueryValue. RegQueryValue. 16. O. RegQueryValue. RegQueryValue. 17. O. RegQueryValue. RegQueryValue. 18. O. RegQueryValue. RegQueryValue. 19. O. RegQueryValue. 20. O. LoadLibrary. 21. O. RegQueryValue. 22. O. 23. O. 24. O. RegQueryValue. RegQueryValue. 67. O. RegQueryValue. RegQueryValue. 68. O. RegQueryValue. RegQueryValue. RegQueryValue. 69. O. RegQueryValue. RegQueryValue. LoadLibrary. 70. O. RegQueryValue. RegQueryValue. RegQueryValue. 71. O. RegQueryValue. RegQueryValue. RegQueryValue. RegQueryValue. 72. O. RegQueryValue. RegQueryValue. RegQueryValue. RegQueryValue. 73. O. RegQueryValue. RegQueryValue. O. RegQueryValue. RegQueryValue. 74. O. RegQueryValue. RegQueryValue. 25. O. RegQueryValue. RegQueryValue. 75. O. RegQueryValue. RegQueryValue. 26. O. RegQueryValue. RegQueryValue. 76. O. RegQueryValue. RegQueryValue. 27. O. RegQueryValue. RegQueryValue. 77. O. RegQueryValue. RegQueryValue. 28. O. RegQueryValue. RegQueryValue. 78. O. RegQueryValue. RegQueryValue. 29. O. RegQueryValue. RegQueryValue. 79. O. RegQueryValue. RegQueryValue. 30. O. RegQueryValue. RegQueryValue. 80. O. CreateFile. 31. O. RegQueryValue. RegQueryValue. 81. O. DeleteFile. 32. O. RegQueryValue. RegQueryValue. 82. O. DeleteFile. 33. O. RegQueryValue. RegQueryValue. 83. O. CreateFile. 34. O. RegQueryValue. RegQueryValue. engchi U 84. O. CreateFile. CreateFile. 35. O. RegQueryValue. RegQueryValue. 85. O. RegQueryValue. RegQueryValue. 36. O. LoadLibrary. LoadLibrary. 86. O. CreateFile. CreateFile. 37. O. CreateFile. CreateFile. 87. O. CreateFile. CreateFile. 38. O. CreateFile. CreateFile. 88. O. LoadLibrary. LoadLibrary. 39. O. RegQueryValue. RegQueryValue. 89. O. RegQueryValue. RegQueryValue. 40. O. RegQueryValue. RegQueryValue. 90. O. RegQueryValue. RegQueryValue. 41. O. CreateFile. CreateFile. 91. O. LoadLibrary. LoadLibrary. 42. O. RegQueryValue. RegQueryValue. 92. O. RegEnumValue. RegEnumValue. 43. O. RegQueryValue. RegQueryValue. 93. O. RegEnumValue. RegEnumValue. 44. O. CreateFile. CreateFile. 94. O. Padding. Padding. 45. O. RegCreateKey. RegCreateKey. 95. O. Padding. Padding. 46. O. RegSetValue. RegSetValue. 96. O. Padding. Padding. 47. O. CreateFile. CreateFile. 97. O. Padding. Padding. 48. O. CreateFile. CreateFile. 98. O. Padding. Padding. 49. O. RegCreateKey. RegCreateKey. 50. O. RegSetValue. RegSetValue. n. Ch. y. sit. er. Nat. al. ‧. ‧ 國. 立. 學. 66. io. 政治大. CreateFile. v ni. DeleteFile DeleteFile CreateFile. 99. O. Padding. Padding. 100. O. Padding. Padding. 31 Fig. 15. filtered API alignment using SLFN filter.. DOI:10.6814/NCCU201900745.

(33) group 102 1. O. LoadLibrary. LoadLibrary. 13. O. DeleteFile. 2. O. RegEnumValue. RegEnumValue. 14. O. CreateFile. DeleteFile CreateFile. 3. L. RegQueryValue. -. 15. O. CreateFile. CreateFile. 4. L. RegQueryValue. -. 16. L. LoadLibrary. 5. O. RegQueryValue. RegQueryValue. 17. X. LoadLibrary. CreateFile. 6. O. RegQueryValue. RegQueryValue. 18. O. RegQueryValue. RegQueryValue. 7. O. RegQueryValue. RegQueryValue. 19. R. 8. O. RegQueryValue. RegQueryValue. 20. X. RegSetValue. CreateFile. 9. O. CreateFile. CreateFile. 21. O. CreateFile. CreateFile. 10. O. RegQueryValue. RegQueryValue. 22. O. CreateFile. CreateFile. 11. O. RegQueryValue. RegQueryValue. 23. O. CreateFile. CreateFile. 12. O. RegQueryValue. RegQueryValue. -. -. CreateFile. 政治大. Fig. 16. filtered API alignment using SLFN filter.. 立. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. i n U. v. Fig. 17. filterNN (Convolution filter) training results of hyper-parameter set A cost: 1, Z cost: 300, C cost: 75. 32. DOI:10.6814/NCCU201900745.

(34) 立. 政治大. ‧ 國. 學. Fig. 18. filterNN (Convolution filter) training results of hyper-parameter set A cost: 1, Z cost: 300, C cost: 135. ‧. n. er. io. sit. y. Nat. al. Ch. engchi. i n U. v. Fig. 19. filterNN (Convolution filter) training results of hyper-parameter set A cost: 1, Z cost: 300, C cost: 165. 33. DOI:10.6814/NCCU201900745.

(35) group 102 1. O. LoadLibrary. LoadLibrary. 38. O. CreateFile. 2. O. LoadLibrary. LoadLibrary. 39. O. RegCreateKey. RegCreateKey. 3. O. RegQueryValue. 40. O. RegSetValue. RegSetValue. 4. O. LoadLibrary. LoadLibrary. 41. O. CreateFile. CreateFile. 5. O. LoadLibrary. LoadLibrary. 42. O. CreateFile. CreateFile. 6. O. CreateFile. CreateFile. 43. O. RegCreateKey. RegCreateKey. 7. O. RegQueryValue. RegQueryValue. 44. O. RegSetValue. RegSetValue. 8. O. RegQueryValue. RegQueryValue. 45. O. RegCreateKey. RegCreateKey. RegQueryValue. CreateFile. 9. O. RegEnumValue. RegEnumValue. 46. O. RegCreateKey. RegCreateKey. 10. O. RegQueryValue. RegQueryValue. 47. O. LoadLibrary. LoadLibrary. 11. O. RegQueryValue. RegQueryValue. 48. O. LoadLibrary. LoadLibrary. 12. O. RegQueryValue. RegQueryValue. 13. O. RegQueryValue. RegQueryValue. 14. O. LoadLibrary. LoadLibrary. 15. O. LoadLibrary. 16. O. LoadLibrary. 17. O. RegQueryValue. 18. O. RegQueryValue. 19. O. 20 21. O. RegQueryValue. 50. O. RegQueryValue. RegQueryValue. 51. O. RegQueryValue. RegQueryValue. LoadLibrary. 52. O. RegQueryValue. RegQueryValue. LoadLibrary. 53. O. RegQueryValue. RegQueryValue. RegQueryValue. 54. O. RegQueryValue. RegQueryValue. RegQueryValue. 55. O. RegQueryValue. RegQueryValue. RegQueryValue. RegQueryValue. 56. O. RegQueryValue. RegQueryValue. O. RegQueryValue. RegQueryValue. 57. O. RegQueryValue. RegQueryValue. O. RegQueryValue. RegQueryValue. 58. O. RegQueryValue. RegQueryValue. 22. O. RegQueryValue. RegQueryValue. 59. O. RegQueryValue. RegQueryValue. 23. O. RegQueryValue. RegQueryValue. 60. O. RegQueryValue. RegQueryValue. 24. O. RegQueryValue. RegQueryValue. 61. O. RegQueryValue. RegQueryValue. 25. O. RegQueryValue. RegQueryValue. 62. O. CreateFile. CreateFile. 26. O. RegQueryValue. 63. O. DeleteFile. 27. O. RegQueryValue. 64. O. DeleteFile. 28. O. LoadLibrary. LoadLibrary. 29. O. CreateFile. CreateFile. 30. O. CreateFile. CreateFile. engchi U. 31. O. CreateFile. 32. O. RegQueryValue. 33. O. 34 35. io. n. al. RegQueryValue. Ch. y. sit. Nat. RegQueryValue. er. ‧ 國. 立. RegQueryValue. ‧. 49. 學. 政治大. v ni. DeleteFile DeleteFile. 65. O. CreateFile. CreateFile. 66. O. CreateFile. CreateFile. 67. O. RegQueryValue. RegQueryValue. CreateFile. 68. O. CreateFile. CreateFile. RegQueryValue. 69. O. CreateFile. CreateFile. RegQueryValue. RegQueryValue. 70. O. LoadLibrary. LoadLibrary. O. CreateFile. CreateFile. 71. O. LoadLibrary. LoadLibrary. O. RegQueryValue. RegQueryValue. 72. O. RegQueryValue. RegQueryValue. 36. O. RegQueryValue. RegQueryValue. 73. O. RegQueryValue. RegQueryValue. 37. O. CreateFile. CreateFile. 74. O. LoadLibrary. LoadLibrary. Fig. 20. filtered API alignment using Conovlution filter.. 34. DOI:10.6814/NCCU201900745.

(36) 立. 政治大. Fig. 21. The avreage 4-gram used in each group before filterNN. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. i n U. v. Fig. 22. The avreage 4-gram used in each group after filterNN. 35. DOI:10.6814/NCCU201900745.

(37) 立. 政治大. Fig. 23. The avreage difference of the 4-gram used in each group. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. i n U. v. Fig. 24. The distance difference among groups. 36. DOI:10.6814/NCCU201900745.

(38) 立. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. i n U. v. Fig. 25. The distance difference of each group. engchi. 37. DOI:10.6814/NCCU201900745.

(39) V.. C ONCLUSION. We design and implement a filterNN framework that can automatically extract characteristics from text-based sequential data among multiple classes. We discuss the effectiveness of different NN models in the proposed framework and discuss the hyper-parameter selections by using a malware behavior dataset. Unlike the past researches that focus on the design of the classifier, we believe some analysts may need to know which part of the data is important to affect the classifier. And most important of all, the design of the filter makes it possible to distinguish important and unimportant data in the sequence by a simple binary vector which is easier for the domain experts to focus on important, valuable characteristics in the class. In filterNN, all the filtered input use the same encoding as the original input. In the experiment of dataset 1, we analyze the pre-classified malware behavior dataset to demonstrate the proposed filterNN. 政治大. framework. With the help of filterNN, the filtered input for most of the classifier has accuracy above 90+%. It indicates that the filter retains enough information for the classification while providing human-readable. 立. results. This framework can be further used in other applications, such as article characteristics extraction.. ‧ 國. 學. It is a useful tool for analysts to explore the important patterns in sequential data. In dataset 2, we deal with a lot of input data which has no pre-classified classes. We utilize UPGMA algorithm to cluster the malware from dataset 2 then analyze these malware groups with filterNN. In the experiment of dataset. ‧. 2, we test different training proportion of the three cost function to explore the correlation among these. y. Nat. cost function. In expected, filterNN should be able to filter out the noises as more as possible while still. sit. have high accuracy and consecutive filtered input. The experiment shows that, we can adjust the training. er. io. proportion of the three cost function to have different kinds of filtered input 1) filterNN can output the. al. n. iv n C filter result that contains 20% of the filtered and has 90% h einput i U classification accuracy. To a security h n c g expert, you can either choose option 1 to analyze the malware purpose by more filtered input with lower filter result that contains 50% of the filtered input and has 80% classification accuracy, or 2) output the. classification accuracy or choose option 2 to analyze the malware purpose with less filtered input with higher classification accuracy. As a result, filterNN is capable of filter out the differences of the malware from the same group or same class while still has high accuracy for the latter classifier. The filterNN with SLFN filter demonstrate a result of having many or few filtered input by different training proportion of three cost function. Analyst can easily adjust the hyper-parameters of the three cost function to acquire the desirable filtered input to perform the further analysis. A. Future work There are some works that can be done to further improve this research.. 38. DOI:10.6814/NCCU201900745.

(40) •. Since the current encoding of API call only includes the API call name, in the data analysis point of view, it may not contains rich malware information. There are many malicious intents which are included in the parameters of the API call, so it can be implemented to explore more malware intention with the filtered inputs.. •. Concatenate the clustering algorithm with filterNN to auto-cluster the dataset. Let filterNN to decide the best clustering results rather than manually test the filterNN performance with different of clusters results. •. Exploring more advanced encoding methods on text-based sequence data which can better represent the relationship of text sequence and better perform advanced cluster algorithms.. •. Automatically generate a report of the malware intention on the filtered inputs which can reduce the effort of domain expert to manually examine the filtered inputs one by one.. •. 政治大 classifier, which will significantly reduce the time of manually test the hyper-parameters and the 立 Auto-tuning the best training proportion of the three cost function and the hyper-parameter of. 學. ‧ 國. training proportion of the three cost function.. R EFERENCES. 1998, pp. 2278-2324.. ‧. [1] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” in Proc. of the IEEE,. y. Nat. [2] S. Forrest, S. A. Hofmeyr, A. Somayaji, and T. A. Longstaff, “A sense of self for unix processes,” in Proc 1996 IEEE Symp. on Security. sit. and Privacy (S&P) , 1996, pp. 120-128.. io. no. 3, pp. 151-180, 1998.. n. al. er. [3] S. Hofmeyr, S. Forrest and A. Somayaji, “Intrusion detection using sequences of system calls,” Journal of Computer Security, vol. 6,. v. [4] W. Lee and S. J. Stolfo, “Data Mining Approaches for Intrusion Detection,” in Proc. USENIX Security Symp., 1998, pp. 79-93.. Ch. i n U. [5] C. Kruegel, D. Mutz, F. Valeur, and G. Vigna,“On the Detection of Anomalous System Call Arguments,” in Proc. European Symp. on. engchi. Research in Computer Security (ESORICS), 2003, pp. 101-118.. [6] U. Bayer, C. Kruegel, and E. Kirda, “TTAnalyze: A Tool for Analyzing Malware,” in Proc. European Institute for Computer Antivirus Research Annual Conference (EICAR), 2006, pp. 180-192. [7] U. Bayer, P. M. Comparetti, C. Hlauschek, C. Kruegel, and E. Kirda, “Scalable, Behavior-Based Malware Clustering,” in Proc. Network and Distributed System Security Symp. (NDSS) , 2009, pp. 8-11. [8] C. Willems, T. Holz, and F. Freiling, “Toward Automated Dynamic Malware Analysis Using CWSandbox,” in IEEE Security and Privacy Magazine, vol. 5, no. 2, pp. 32-39, 2007. [9] S. W. Hsiao, Y. N. Chen, Y. S. Sun, and M. C. Chen, “A Cooperative Botnet Profiling and Detection in Virtualized Environment,” in Proc. IEEE Conf. on Communications and Network Security (IEEE CNS), 2013, pp. 154-162. [10] S. Hou, Y. Ye, Y. Song, and M. Abdulhayoglu, “Hindroid: An intelligent android malware detection system based on structured heterogeneous information network,” in Proc. of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2017, pp. 1507-1515. [11] K. Grosse, N. Papernot, P. Manoharan, M. Backes, and P. McDaniel, “Adversarial perturbations against deep neural networks for malware classification,” arXiv: 1606.04435, 2016.. 39. DOI:10.6814/NCCU201900745.