二元主體學習技術研究與張量流實作 - 政大學術集成

全文

(1)國立政治大學資訊管理學系研究所碩士學位論文. 二元主體學習技術研究與張量流實作 Bipartite Majority Learning with Tensors. 指導教授：. 郁方. 研究生. 李佳倫. ：. 博士撰. 中華民國一○八年一月. DOI:10.6814/THE.NCCU.MIS.001.2019.A05.

(2) Abstract A great deal of attention has been given to machine learning owing to the remarkable achievement in Go game and AI robot. Since then, machine learning techniques have been widely used in computer vision, information retrieval, and speech recognition. However, data are inevitably containing statistically outliers or mislabeled. These anomalies could interfere with the effectiveness of learning. In a dynamic environment where the majority pattern changes, it is even harder to distinguish anomalies from majorities. This work addresses the research issue on resistant learning on categorical data. Specifically, we propose an efficient bipartite majority learning algorithm for data classification with tensors. We adopt the resistant learning approach to avoid significant impact from anomalies and iteratively conduct bipartite classification for majorities afterward. The learning system is implemented with TensorFlow API and uses GPU to speed up the training process. Our experimental results on malware classification show that our bipartite majority learning algorithm can reduce training time significantly while keeping competitive accuracy compared to previous resistant learning algorithms. Keywords: Bipartite majority learning, Resistant learning, Malware classification.. DOI:10.6814/THE.NCCU.MIS.001.2019.A05.

(3) Contents 1 Introduction. 1. 2 Related Work. 5. 3 Methodology. 8. 3.1 Resistant Learning on Single Hidden Layer Feed-forward Neural Network .. 8. 3.2 Bipartite Majority Learning . . . . . . . . . . . . . . . . . . . . . . . . . .. 9. 3.3 Majority Learning on Softmax Neural Network . . . . . . . . . . . . . . . . 18 3.4 Majority Learning on Support Vector Machines . . . . . . . . . . . . . . . 22 3.5 Multi-Class Classifier for Majority . . . . . . . . . . . . . . . . . . . . . . . 24 4 EXPERIMENTS. 27. 4.1 Malware Samples from OWL . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.2.1. Exp. 1.1: Majority Learning on Small-Size Sampling Data . . . . . 29. 4.2.2. Exp. 1.2: Use ANN to Learn the Majority . . . . . . . . . . . . . . 35. 4.2.3. Exp. 2.1: Majority Learning on Large Scale Data . . . . . . . . . . 39. 4.2.4. Exp. 2.2: Use ANN to Learn the Larger Amount of Majority . . . . 44. 4.2.5. Exp. 3: Binary Classification Performance . . . . . . . . . . . . . . 50. 5 Discussion. 54. 5.1 Exp. 1.1: Majority Learning on Small-Size Sampling Data . . . . . . . . . 54 5.2 Exp. 1.2: Use ANN to Learn the Majority . . . . . . . . . . . . . . . . . . 56 5.3 Exp. 2.1: Majority Learning on Large Scale Data . . . . . . . . . . . . . . 57 5.4 Exp. 2.2: Use ANN to Learn the Larger Amount of Majority . . . . . . . . 58 5.5 Exp. 3: Binary Classification Performance . . . . . . . . . . . . . . . . . . 59 5.6 Majority Learning on SVMs . . . . . . . . . . . . . . . . . . . . . . . . . . 60 6 Conclusion. 61. DOI:10.6814/THE.NCCU.MIS.001.2019.A05.

(4) References. 62. DOI:10.6814/THE.NCCU.MIS.001.2019.A05.

(5) List of Figures 1. The tensor graph of SLFN. . . . . . . . . . . . . . . . . . . . . . . . . . . . 10. 2. The condition L in another form. . . . . . . . . . . . . . . . . . . . . . . . 21. 3. An example for different SVM kernels classifying two class data. . . . . . . 23. 4. The tensor graph of the softmax neuron network. . . . . . . . . . . . . . . 25. 5. Generate system call samples. . . . . . . . . . . . . . . . . . . . . . . . . . 27. 6. Experimental design of experiment 1.1. . . . . . . . . . . . . . . . . . . . . 30. 7. False rate of different majority learning methods on training data (100*2 samples). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33. 8. False rate of different majority learning methods on testing data (100*2 samples). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33. 9. False rate of different majority learning methods on outlier data (100*2 samples). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34. 10. Execution time of different majority learning methods (100*2 samples). . . 34. 11. Experimental design of experiment 1.2. . . . . . . . . . . . . . . . . . . . . 35. 12. False rate of different majority training softmax neuron network on training data. (100*2 samples) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38. 13. False rate of different majority training softmax neuron network on testing data. (100*2 samples) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38. 14. False rate of different majority training softmax neuron network on outlier data. (100*2 samples) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39. 15. Experimental design of experiment 2.1. . . . . . . . . . . . . . . . . . . . . 40. 16. False rate of different majority learning methods on training data (80% samples). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43. 17. False rate of different majority learning methods on testing data (80% samples). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43. 18. False rate of different majority learning methods on outlier data (80% samples). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44. DOI:10.6814/THE.NCCU.MIS.001.2019.A05.

(6) 19. Execution time of different majority learning methods (80% samples). . . . 44. 20. Experimental design of experiment 2.2. . . . . . . . . . . . . . . . . . . . . 45. 21. False rate of different majority training softmax neuron network on training data. (80% samples) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48. 22. False rate of different majority training softmax neuron network on testing data. (80% samples) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48. 23. False rate of different majority training softmax neuron network on outlier data. (80% samples) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49. 24. Experimental design of experiment 3.1. . . . . . . . . . . . . . . . . . . . . 50. 25. False rate of different majority learning methods on training data (variety samples). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52. 26. False rate of different majority learning methods on testing data (variety samples). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52. 27. False rate of different majority learning methods on outlier data (variety samples). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53. 28. False rate of different majority learning methods (100*2 samples). . . . . . 54. 29. False rate of different softmax neuron network. (100*2 samples) . . . . . . 56. 30. False rate of different majority learning methods (80% samples). . . . . . . 57. 31. False rate of different softmax neuron network. (80% samples). . . . . . . . 58. DOI:10.6814/THE.NCCU.MIS.001.2019.A05.

(7) List of Tables 1. Table of Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 9. 2. The bipartite majority learning algorithm . . . . . . . . . . . . . . . . . . 13. 3. The softmax majority learning algorithm . . . . . . . . . . . . . . . . . . . 19. 4. The SVM majority learning algorithm . . . . . . . . . . . . . . . . . . . . 22. 5. Sample amount in each detection rules . . . . . . . . . . . . . . . . . . . . 28. 6. Training result by using 100*2 samples . . . . . . . . . . . . . . . . . . . . 31. 7. Softmax neural network classification result by using 100*2 samples . . . . 36. 8. Softmax neural network classification result by using 100*2 samples (cont.) 37. 9. Training result by using 80% samples . . . . . . . . . . . . . . . . . . . . . 42. 10. Softmax neural network classification result by using 80% samples . . . . . 46. 11. Softmax neural network classification result by using 80% samples (cont.) . 47. 12. Training result of bipartite classification. . . . . . . . . . . . . . . . . . . . 51. DOI:10.6814/THE.NCCU.MIS.001.2019.A05.

(8) 1. Introduction. Artificial Neural Network (ANN) is regularly adopted in data classification researches [1, 2, 3, 4]. These studies combine the softmax transformation in the output layer of ANNs so that the output values of each output nodes can represent the probability of a specific input belongs to that class. With the characteristics that ANN is suitable for classification, ANN can also assist humans in solving yes-no questions. For example, determining if the image contains a specific object, forecasting whether future stock prices will rise, etc. A complex ANN has been proved that it could represent any input-output mappings [5]. If sufficient information was supplied, ANN could help people making decisions more effectively. However, the parameters of ANN need many tests to find proper settings. How many hidden layers should be adopted? How many hidden nodes in each layer? Long trial and error processes might consume a lot of time and computing resources. Even if a proper model complexity is found, the loss of the model is often stuck in local optimal and might not fulfill the requirement of the classification accuracy on training data. Still, in real-world data sets, data is inevitably containing outliers or wrong information. Although ANN is more flexible to handle mislabeled or outliers rather than linear models [6], a small portion of outliers will still affect the learning effect of the majority. Tsaih and Chang [7] proposed a resistant learning procedure on ANN. The learning model of resistant learning is single-hidden layer feed-forward neural networks (SLFN). This resistant learning method allows multi-dimension inputs and only one dimension output. In other words, the SLFN can have many input nodes but can have only one output node. The resistant learning method dynamically changes the hidden node amount to find a proper model complexity. When ANN’s loss stuck in local optimal, resistant learning approach will add new hidden nodes and calculate the appropriate weights for the new hidden nodes. With this mathematical approach, the data which is difficult to learn can be learned by the ANN while the majority’s numeric output would not be affected significantly. But, the final goal of resistant learning is to find a near perfect fitting-function, that is to say, the resistant learning method focuses on finding perfect numeric input-output mappings, not. 1 DOI:10.6814/THE.NCCU.MIS.001.2019.A05.

(9) to be a classifier for data. Still, if ANN learns the perfect input-output relationship of the training data, it would probably cause the over-fitting problem. Therefore, in this paper we designed a new resistant learning method. We retained the resistant learning advantages and applied it to classify different types of data. The proposed resistant learning mechanism we called bipartite majority learning (BML). Our method can learn the proper majority features of two categories of data. Bipartite classification problem is relative. Although resistant learning method only has one output value of each input data, we can set different desired learning output for two types of data. One of them should be greater, and the other should be smaller. Then, we can find a boundary for classifying different types of data. BML can also pick out the anomalies while the anomalies’ features were unknown previously. In order to achieve this goal, BML starts learning from a small portion of training data, then gradually increase the selected amount of training data until the majority features of data are learned. When selecting the partial training data, the whole training data should be sort by specific criteria. This sort-and-select process imitates the learning process of human beings. Humans use things they already know to describe unknown things, and learn the concept of unknown things in the process of such analogy and thinking. By the sorting process, ANN can learn the data which is familiar to it first. As the knowledge of ANN becomes more and more abundant, ANN can distinguish which data belongs to the majority pattern, and those that are difficult to learn may be outliers. In the situation of classifying two types of data, BML would learn the most different features of the bipartite data first. The most different features mean the data that is far from the classification boundary. Then, BML will tune the weights or change the model structure to learn the difference of the similar data. The stop condition of BML is the majority of data can be correctly classified by the boundary. The majority rate can be determined previously. Although BML can even perform perfect classification for all training data due to the resistant learning process, perfect learning might cause an over-fitting problem. Depending on the application, the majority rate can be set as needed. More precisely, the setting of the majority rate should. 2 DOI:10.6814/THE.NCCU.MIS.001.2019.A05.

(10) depend on the knowledge of training data. If we know how much outliers may be in the training data, we can set the majority rate and then BML will automatically pick out the appropriate majorities and learn these majorities perfectly. In the experiments, we apply BML to do malware dynamic analysis. Dynamic analysis refers to the process of executing a program and recording its behavior. The program execution data sets used in our experiments are real-world data sets. The program execution data is a collection of profiles generated from the dynamic analysis. When a program is executed, the monitor records all the function calls (e.g., system call or Windows API call) invoked by the program and saves the call sequence in the profile [8]. We collect several benign programs and malicious programs, then label their profiles as benign (B) or malicious (M) for bipartite classification. Our paper has three main contributions. First, we proposed BML to deal with bipartite classification problem in the context of resistant learning. Past studies [7, 9] focus on learning a fitting function of the numerical outputs. BML learns a fitting function to be the classifier of two categories of data. The proposed resistant learning method in this paper can be applied to any bipartite data. Second, BML can find anomalies in a global view and do not need prior knowledge of the training data set. BML has a sort-and-select mechanism for all of the training data, selecting appropriate majority data for model training, decrease the effect from the outliers. This function can be helpful when doing dynamic malware analysis. As we know, a malicious program may not always do something harmful to the infected system (i.e., in the incubation phase); therefore, some function call sequences invoked by the malware do not belong to the attack behavior (which can be considered as a noise or outlier). In this case, the conventional ANN might be confused with the outlier data while training the model. However, the proposed majority learning mechanism for nominal classification can avoid being affected by such anomalies. Third, BML has greater time efficiency in training the model and keep high prediction accuracy. Compare to the former resistant learning methods, our BML is outperformed in. 3 DOI:10.6814/THE.NCCU.MIS.001.2019.A05.

(11) decreasing training time and remains a competitive classification accuracy. Our method can be faster than former ones due to two reasons. One reason is that, we design a looser learning goal for the SLFN model. Former resistant learning methods need to learn a near-perfect input-output relationship for training data. But, in our approach, we only need to let the model classify the outputs of the two categories of data. For example, the model does not need to train all class 1 data to the value of 1 and train all class 2 data to the value of 0, instead, BML determine a boundary to classify the two categories of data. It means that we only need to promise all class 1 predict outputs are greater than the class 2 predict outputs. Another reason is that, the system and the underlying neural network are implemented in Python and TensorFlow API [10] on GPU, which provides the high performance of parallel computing mechanism on the high dimension tensors. A tensor is a set of primitive values shaped into an array of any number of dimensions. TensorFlow can perform parallel computing to large tensors, and also be able to speed up computing by using GPU. We had tested the execution speed on a portable computer, we tried to do a forward pass through a neural network by a tensor with the shape of (1187842, 52). It took 9.998 seconds for JAVA, and 0.273 seconds for TensorFlow in the same hardware environment. It was about 36 times faster by using TensorFlow with GPU calculating. We used TensorFlow as the main framework in order to reduce the execution time for training the SLFN model.. 4 DOI:10.6814/THE.NCCU.MIS.001.2019.A05.

(12) 2. Related Work. Our main learning architecture of bipartite majority learning is SLFN. The SLFN has been discussed in many researches. Huang et al. [11] have proved the ability of SLFN classifying disjoint decision regions with arbitrary shapes in multidimensional cases. They also mention that failures in SLFN classification can be attributed to inadequate learning and inadequate amount of hidden nodes. Tsaih [12] uses SLFN to conduct two class data classification but having a limitation that input data must be binary inputs. Huang et al. [13] develop an efficient learning algorithm, extreme learning machine (ELM), for the SLFN. Feng et al. [14] apply ELM and have a growing hidden node approach on the SLFN. Our approach also has a growing hidden node method for SLFN and focus on two class data classification, but, we do not have the limitation on the input data, the inputs in our study can be real numbers. The proposed BML can deal with the anomalies in learning process. In the context of linear regression analysis, there are two ways to dealing with outlier problems: deletion diagnostics and robust estimators [15]. One way of deletion diagnostics is to determine the observations that cause the largest change in some regression quantity [16, 17] when they are excluded from the fitting procedure. As for the robust method, one robustness analysis is to focus on trimmed sum of squared residuals instead of including all the squared residuals as in the least squares estimator [18]. BML inspired by this robust approach, limit the attention to a trimmed set of data and gradually increase the subset size. In this way, BML can pick out the appropriate majority and fight against the outliers. There are many studies focusing on robust learning methods and dealing with anomaly pattern. Ren et al. [19] proposed a robust softmax regression for multi-class classification to cope with noisy data and statistical outliers. Jiang et al. [20] worked on a single layer robust autoencoder. Zhou and Paffenroth [21] devised a robust mechanism in deep autoencoder to find anomalies. Zhao and Fu [22] proposed a robust graph representation method to clearly split the elements in the segmentation of videos. Wang and Tan [23] studied a robust distance metric learning method which distinguishes labeled image data.. 5 DOI:10.6814/THE.NCCU.MIS.001.2019.A05.

(13) Jia and Zhao [24] worked on deep neural network to develop a Chinese pinyin typo detection system. Our study was inspired by the resistant learning method on SLFN proposed by Tsaih and Chang [7] and make appropriate modifications to make our method more in line with the bipartite classification situations. This paper is an extension study in the context of resistant learning. The word resistant has the same meaning to robust. In the statistical literature, the terms robust and resistant are often used interchangeably but sometimes have specific meanings [25]. Robust procedure means that the results are not impacted significantly by violations of the model assumptions (i.e., the errors are normally distributed). As for resistant procedures, those whose numerical results are not impacted significantly by outlying observations. Tsaih and Chang [7] proposed a resistant learning procedure on ANN to learn nearperfect real-number input-output mappings. The error values of all training observations will less than a tiny value, ϵ (say, 10−6 ). However, in order to train a perfect fitting function, the resistant learning procedure suffers high model complexity and long training time. Srivastava et al. [26] introduced dropout nodes in a neural network to reduce computation requirements and hence speed up ANN training. Huang et al. [9] proposed an envelope module to ease the restriction of ϵ. The envelope method covers the fitting function with an envelope and the learning goal is to make majority data covered by the range of envelope. The envelope method allows observations having larger error values, thus accelerate the resistant learning procedure. Yet, past researchers do not address the learning algorithm with nominal anomalies (i.e., outliers) in the context of resistant learning. Hence, we propose a new resistant majority learning mechanism for nominal classification that can antagonize the anomalies in the training data automatically with less training time and high prediction accuracy. In the field of malware detection, many research studies adopted machine learning techniques. Hou et al. [27] developed a system to learn the features of malicious Android application API calls. Grosse et al. [28] and Wang et al. [29] proposed adversary approaches in the deep neural network. Grosse et al. [28] used the deep neural network as a. 6 DOI:10.6814/THE.NCCU.MIS.001.2019.A05.

(14) two-class classifier for static malware analysis. Wang et al. [29] developed a deep neural network on audit logs to perform dynamic malware analysis. They also demonstrated the validity of the algorithm in other image data sets. Dahl et al. [30] learned the API calls feature and adopt random projections to deal with large, sparse, binary feature sets. Chiu et al. [31] performed clustering for system call sequences and found the features of malware behaviors. Training SLFN to distinguish between malicious behavior and benign behavior is a good application of the bipartite classification. Therefore, we decided to test the detection rate of malicious behavior of BML in our experiments. There are many types of malware behaviors in our data set. In order to do further classification for different behaviors, combining several trained neural networks to be a cascade classifier might be a possible solution. Breiman [32] combines the prediction of the tree classifiers. Bell and Koren [33] also verify the feasibility of combining multiple predictors. To synthesize the classification results of each neural network to determine the class of a system call sample might be a feasible approach. However, Krizhevsky et al. [3] consider that it appears to be too much cost for complex neural networks. Thus, we found another solution for multi-class classifying, the softmax neural network. Neural networks which contain softmax transformation in their output layer are suitable for multi-class classifying. These neural networks are usually trained with backpropagation gradient-descent procedure. [34] Lawrence et al. [2] trained convolutional neural networks with softmax function to recognize faces. Krizhevsky et al. [3] build large neural networks with softmax and classify 1,000 types of images. Karpathy et al. [4] also verified the classification ability of softmax neural networks on large scale video data set. Former studies have proved the effectiveness of neural networks with softmax function. We decide to follow the MNIST instruction in TensorFlow [35], applying the softmax neural network to be the multi-class classifier for practical application.. 7 DOI:10.6814/THE.NCCU.MIS.001.2019.A05.

(15) 3. Methodology. 3.1. Resistant Learning on Single Hidden Layer Feed-forward Neural Network. SLFN had been proved its ability to representing a complex input-output relationship. [36] Given a set of N observations, (x1 , y 1 ), . . ., (xN , y N ), we can train an SLFN to learn the input-output relationship of the observations. The SLFN can be regarded as a fitting function f (x, w), where w is the parameter vector. The value f (xc , w) is expected to be equal or very close to y c . To train such a fitting function, we should define the loss function of SLFN. We use one of the most commonly adopted methods, least squares estimator (LSE), to evaluate the loss of the model. The LSE of the cth observation (xc , y c ) is defined ∑ c 2 as (1). The ultimate goal of training the SLFN model is to minimize N c=1 (e ) . Gradient descent is a popular method for optimizing SLFN. We applied gradient descent method to minimize LSE in this paper.. (ec )2 = (y c − f (xc , w))2. (1). From the perspective of statistics, outliers are the observations that lie far away from the fitting function, i.e., outliers have larger square error (ec )2 . Resistant learning means that the machine learning progress are not significantly influenced by the outliers. Resistant learning has a similar meaning to robust learning, but resistant learning does not need previous knowledge for the majority and the outliers. Resistant learning will select the appropriate majority of observations according to a guideline, then, train the model with the majority, and, apply the special method for learning outliers. Tsaih and Cheng [7] proposed a resistant learning algorithm using SLFN to find a nearperfect fitting function for all of the numerical observations. Nevertheless, such training process is very time-consuming. Huang et al. [9] proposed the resistant learning procedure with envelope. It eases the restriction of fitting function for numeric observations and is more effective on resistant learning. However, past solutions cannot be appropriately used 8 DOI:10.6814/THE.NCCU.MIS.001.2019.A05.

(16) for nominal data. Therefore, we propose a nominal majority learning mechanism, using the resistant learning procedure to cope with the observations that are difficult to learn, perform robust learning to avoid the interference from outliers.. 3.2. Bipartite Majority Learning. The proposed resistant learning method, bipartite majority learning (BML), focuses on binary classification problem, and SLFN is used as the learning model. The SLFN architecture is defined in (2) to (4). Table 1 shows the description of the notations. ai (x) is the net output value of the hidden layer, and f (x) is the output value of the SLFN. The activation function, tanh, is hyperbolic tangent. The SLFN can be a bipartite classifier by setting a threshold [12]. If the output value of an observation is greater or equal than the threshold, it will be considered as class 1; otherwise class 2. Figure 1 shows the tensor graph of SLFN. Table 1: Table of Notations Notation. Description. xc. x ≡ (x1 , , xm )T , xc is the cth of input observations.. yc. The desired output corresponding to the cth input observation of xc .. m. The dimension of input observation x.. p. The number of adopted hidden nodes.. H wi0. The bias value θ of the ith hidden node.. H wij. The weight between xj and the ith hidden node.. w0O. The bias value θ of the output node.. wiO. The weight between the ith hidden node and the output node.. f (x) ≡. w0O. +. p ∑. wiO ai (x). (2). i=1. 9 DOI:10.6814/THE.NCCU.MIS.001.2019.A05.

(17) ai (x) ≡. H tanh(wi0. +. m ∑. H wij xj ). (3). j=1. tanh(x) ≡. ex − e−x ex + e−x. (4). Figure 1: The tensor graph of SLFN.. 10 DOI:10.6814/THE.NCCU.MIS.001.2019.A05.

(18) Algorithm 1 Define the SLFN tensor graph in python code 1:. x ← tf.placeholder(tf.f loat64). 2:. y ← tf.placeholder(tf.f loat64). 3:. ht ← tf.V ariable(h t). 4:. hw ← tf.V ariable(h w). 5:. hidden layer ← tf.tanh(tf.add(tf.matmul(x, hw), ht)). 6:. ot ← tf.V ariable(o t). 7:. ow ← tf.V ariable(o w). 8:. y ← tf.add(tf.matmul(hidden layer, ow), ot). 9:. sr ← tf.reduce sum(tf.square(y − y )). 10:. train ← tf.optimizer(eta).minimize(sr). Algorithm 1 is the corresponding code in Python. In tensorflow, we should define the calculation relationship between tensors.. The binding tensors form a data flow. graph, which is the tensor graph(i.e., Figure 1). Variables x and y are the tensors of tf.placeholder type that holds the input data. The variables hw(hidden layer weights, H H ), ow(output layer weights, wiO ) and ot(output layer theta, ), ht(hidden layer theta, wi0 wij. w0O ) are the tensors of tf.Variable type which will be modified by the optimizer. The structure of the SLFN is defined by the above tensors and certain tensor operations. The tf.matmul performs the matrix multiplication, the tf.add performs matrix addition and the tf.square squares all the elements in the tensor. The tf.reduce sum calculates the sum of all square residuals. Finally, the optimizer, tf.optimizer, applies gradient descent method to modify the variable tensors of neuron weights. In the supervised learning scenario for binary classification, we should give an appropriate label to our training set. In general, the desired output will be set to [1, 0] and [0, 1] for binary classification when the SLFN have two output nodes. But in this study, our SLFN has only one output node, the desired output of the observations are given dynamically by a specific method. The learning goal of the SLFN is to discern the majority of two classes data. We adopt the linearly separating condition (the condition 11 DOI:10.6814/THE.NCCU.MIS.001.2019.A05.

(19) L) [12] to distinguish two classes of observations. The α in (5) is the minimum output value of the observations in class 1(C1 ) and the β in (6) is the maximum output value of the observations in class 2(C2 ). If α > β, for all f (xc ) : c ∈ n, the condition L in (7) is satisfied. The two classes of observations can be separated by a threshold,. α+β . 2. In. practice, we label C1 as {1} and C2 as {-1} at the beginning of training process.. α = min f (xc ) c. (5). β = max f (xc ) c. (6). T he Linear Separating Condition : α > β. (7). y ∈C1. y ∈C2. Since we introduce the condition L to be the learning goal of the SLFN, it could be training more faster than the envelope method. The envelope method proposed by Huang et al. [9] ensures that the square error between y c and f (xc ) should be less than two times of the standard deviation. The condition L is less restrictive than the envelope module but more appropriate to do bipartite classification. Table 2 presents the proposed bipartite majority learning algorithm. Assume there are N observations, and γ is the majority rate while γN > m + 1. The BML algorithm is terminated when more than γN observations are correctly classified and the condition L is satisfied. Let S(N ) be the set of N observations. Let the nth stage be the stage of handling ˆ n reference observations (i.e., S(n)), and γN ≥ n > m + 1. Let S(n) be the set of the observations which are classified correctly by the condition L at the end of nth stage. Then, the acceptable SLFN estimate that leads to a set of {(xc , y c )} that can find a threshold ˆ to separate the two classes of observations for all c ∈ S(n). Meanwhile, |S(n)| ≥ n ˆ since S(n) ⊆ S(n). To put it another way, at the end of the nth stage, the acceptable SLFN estimate presents a fitting function f can find a threshold to classify at least n. 12 DOI:10.6814/THE.NCCU.MIS.001.2019.A05.

(20) ˆ observations in {(xc , y c ) : c ∈ S(n)}. Table 2: The bipartite majority learning algorithm Step 1. Randomly obtain the initial m + 1 reference observations, two classes of observations each account for half of the m+1 observations. Let S(m + 1) be the set of observations of these observations. Set up an acceptable SLFN estimate with one hidden node regarding the reference observations (xc , y c ) for all c ∈ S(m + 1). Set n = m + 2.. Step 2. If n > γN , STOP.. Step 3. Present the n-1 reference observations (xc , y c ) that are the ones with the largest distances between C1 and C2 . Then select another observation (xk , y k ) so that the value of α − β will be the largest. Let S(n) be the set of observations selected in stage n.. Step 4. If n reference observations satisfy the condition L, go to Step 7.. Step 5. Set w˜ = w. Step 6. Apply the gradient descent algorithm to adjust weights w until one of the following cases occurs: (1) If n reference observations satisfy L, go to Step 7. (2) If the n observations cannot satisfy L, then restore the weights. Set w = w˜ and apply the resistant learning mechanism by adding extra hidden nodes to obtain an acceptable SLFN.. Step 7. n + 1 → n; go to Step 2.. The proposed BML executes the following two procedures: (i) the ordering procedure implemented by Step 3 that determines the input sequence of reference observations and (ii) the modeling procedure implemented by Step 6 that adjusts the weights of the SLFN ∑ c 2 to minimize the sum of square residuals N c=1 (e ) . If the gradient descent mechanism. 13 DOI:10.6814/THE.NCCU.MIS.001.2019.A05.

(21) cannot tune the weight to find an acceptable SLFN, then restore weights and adjust the number of hidden nodes adopted in the SLFN. Finally, all n observations S(n) at the nth stage would satisfy the condition L. The detail operations are explained as follows. (Step 1) It first randomly chooses m + 1 observations from N training data. Then, it calculates the weight of the initial neural network by using the m + 1 reference observations in the initial training case. The initial weights of the neural network are given by formula (8) to (11). We firstly calculate w0O and w1O in (8) and (9) by all of the reference H H observations. Next, we calculate wi0 and wij in (8). There are m + 1 hidden weight. variables, we can use m + 1 reference observations to obtain a set of m + 1 simultaneous equations. Then, we can solve the m + 1 simultaneous equations by using matrices [37] to get the desired hidden weight values, and make f (xc ) = y c ∀ c ∈ S(m + 1).. w0O = min y c − 1. (8). w1O = max y c − min y c + 2. (9). y c − minc∈S(N ) y c + 1 y˜ = tanh ( ) maxc∈S(N ) y c − minc∈S(N ) y c + 2. (10). c∈S(N ). c∈S(N ). c∈S(N ). −1. c. H wi0. +. m ∑. H c wij xj = y˜c ∀ c ∈ S(m + 1). (11). j=1. Algorithm 2 shows how we use the TensorFlow API to define the operations of equations (8) to (11).. 14 DOI:10.6814/THE.NCCU.MIS.001.2019.A05.

(22) Algorithm 2 Calculate the first SLFN weights in python code 1:. o w ← tf.max(y) − tf.min(y) + 2. 2:. o t ← tf.min(y) − 1. 3:. s y ← y[: m + 1]. 4:. yc ← tf.arctanh((s y − o t)/o w). 5:. s x ← x[: m + 1]. 6:. h t vector ← tf.ones([m + 1, 1]). 7:. xc ← sess.run(tf.concat([s x, h t vector])). 8:. answer ← sess.run(tf.matrix solve ls(xc, yc)). 9:. h w ← answer[: m]. 10:. h t ← answer[m :]. The purpose of using this method to set the weight is to ensure that the initial neural network has met the condition: ec = 0 for all c ∈ S(m + 1). That is, the initial SLFN perfectly represents the correspondence between xc and y c for the m + 1 reference observations. (Step 2) It is the termination condition of the system. We set the majority rate, γ, to 95%. It guarantees the SLFN can correctly discern the observations in training set more than 95%. (Step 3) The BML first computes all the possible values of α − β of n − 1 observations from all N observations. It then selects a set of n − 1 observations that has the maximal value of α−β. To find the n−1 observations, we firstly sort the values f (xc ) in C1 and C2 , respectively. Then we can get i maximum f (xc ) in C1 and get (n − 1 − i) minimum f (xc ) in C2 , i ∈ [1, n−2], to calculate all possible α −β. The time complexity for obtaining such n−1 observations is O(N logN ) since the time complexity of sorting is O(N logN ) and the time complexity of calculating all possible α − β is O(n), N > n. Compare to the training process, this step does not significantly reduce the efficiency of learning. After selecting the n − 1 observations, it picks another observation, (xk , y k ), so that the value of α − β will be the largest. The purpose of this selection mechanism is to select the n observations 15 DOI:10.6814/THE.NCCU.MIS.001.2019.A05.

(23) which most likely to be classified by the condition L. The (n − 1)th stage S(n − 1) asserts that there is at least one set of n − 1 observations that can let α − β > 0. Although the n-1 observations selected in the nth stage may not necessarily equal to S(n−1), this select mechanism ensures that if the observation (xk , y k ) is excluded, the n − 1 observations can satisfy the condition L. (Step 4) It checks if the n selected reference observations satisfy the condition L. If true, the nth stage can find at least one set of n observations that can let α − β > 0, then it goes to the next stage. If not, the BML would temp to find an acceptable SLFN for the chosen observations S(n). (Step 5) It saves the current weights of SLFN for the resistant learning procedure. At the end of the (n − 1)th stage, the ones in {(xc , y c ) : c ∈ S(n − 1)} satisfy the condition L. The resistant learning procedure can cram a new observation (i.e., (xk , y k )) by adding two hidden nodes while not affecting the output of other observations. Adjusting the weights by gradient descent mechanism in Step 6 might make the n − 1 observations picked in the stage n violating the condition L. Therefore, the current state of the neural network needs to be temporarily stored so that it can be restored in Step 6-2. (Step 6) We apply gradient descent mechanism to find an acceptable SLFN. For the purpose of nominal supervised learning, the learning target was given dynamically rather than fixed value. Although we respectively give the desired output y c for C1 and C2 observations at the beginning, the difference between the two classes observations is even ¯ ¯ more important. Let S(n) be the subset of S(n), S(n) = {k} + S(n). We first calculate ¯ max(f (xC1 )) and min(f (xC2 )) ∀c ∈ S(n), then the supervised learning target changes to max(f (xC1 )) ∀(xc , y c ) ∈ C1 , and the supervised learning target changes to min(f (xC2 )) ¯ ∀(xc , y c ) ∈ C2 . After setting the learning target of S(n), we compute the values α and ¯ β of S(n). Then, if y k ∈ C1 , the learning target of xk is set to α; otherwise y k ∈ C2 , the learning target of xk is set to β. Then, the gradient descent mechanism is applied to adjust the weights to find an acceptable SLFN. (Step 6.1) If an acceptable SLFN is found, we move to the next stage. However, we. 16 DOI:10.6814/THE.NCCU.MIS.001.2019.A05.

(24) might encounter the problem of local optimum which is caused by the implementation of the gradient descent mechanism or by the SLFN model that does not have enough hidden nodes. Both situations lead to an unacceptable SLFN estimate regarding the n reference observations. Therefore, we adopt the resistant procedure proposed by Tsaih and Cheng [7] to cope with the observation xk . (Step 6.2) We apply the resistant learning procedure. Restore the w˜ that is stored in Step 5. Then, we add two hidden nodes to change the output value of xk to the learning target. It can also represent the observation xk is closer to the threshold than other same ¯ class observations. Other output values y c′ ∀ c ∈ S(n) will not be significantly affected by the resistant learning procedure. The hidden weights formulas of newly hidden nodes are defined in (12) to (16).. O wp−1. =. H wp−1,0 = ζ − λαT xk. (12). H wp,0 = ζ + λαT xk. (13). H wp−1 = λαT. (14). wpH = −λαT. (15). wpO. ∑ ′ |y k − w0O − qi=1 (wiO wik )| = 2 tanh(ζ). (16). ζ is a small constant number set to 0.05. λ is a large constant number set to 105 . αT is an m-dimension vector which length equals 1 and satisfies the condition (17).. αT (xk − xc ) ̸= 0 ∀ c ∈ I(n) − {k}. (17). By adding two hidden nodes in the hidden layer, the SLFN satisfies the condition L 17 DOI:10.6814/THE.NCCU.MIS.001.2019.A05.

(25) for the n reference observations S(n). The output value f (xk ) will be very close to α if y k ∈ C1 ; otherwise y k ∈ C2 , the output value f (xk ) will be near to β. (Step 7) We increase n by 1 and repeat step 2 to step 7. Most machine learning methods use all training data as the basis for adjusting weights. In the BML mechanism, in order to avoid anomaly observations affecting SLFN learning and picking appropriate majority observations, BML will start with a small amount of data and gradually increase the amount of selected data. The advantage of this approach is that since we do not necessarily know in advance which data is the majority and which data is the anomaly in all training data, it is possible that anomaly data will be selected when m+1 data are first acquired. However, as the number of selected data n increases, the selection mechanism in step 3 will select those that are most easily classified by condition L. Since n observations selected in the nth stage is most suitable for the current SLFN, the n observations do not necessarily include the n − 1 observations selected in the (n − 1)th stage. Through this dynamic selection method, we can select appropriate majority data and avoid the anomaly’s impact on the effectiveness of learning. Although n only increases by 1 at one time that would slower than training the SLFN with all data, we found that the SLFN does not need to retrain at every stage in our experiments due to the observations can satisfy the condition L. The BML mechanism can quickly move to the next stage when the most part of training data can be classified correctly.. 3.3. Majority Learning on Softmax Neural Network. The softmax neural network is a popular neural network architecture, which is commonly adopted to do nominal classification. In general conditions, the softmax neural network will learn for all training data, but in order to make softmax neural network a comparable learning model benchmark, we design a majority learning method for softmax neural network. The designed softmax majority learning method has 6 procedures which are similar to the proposed majority method, BML. We list the process of softmax majority learning. 18 DOI:10.6814/THE.NCCU.MIS.001.2019.A05.

(26) in table 3. Table 3: The softmax majority learning algorithm Step 1. Randomly obtain the initial m + 1 reference observations, two classes of observations each account for half of the m+1 observations. Let S(m + 1) be the set of observations of these observations. Set up an acceptable SLFN estimate regarding the reference observations (xc , y c ) for all c ∈ S(m + 1). Set n = m + 2.. Step 2. If n > γN , STOP.. Step 3. Present the n reference observations (xc , y c ) that are the ones with the smallest cross entropy value. Let S(n) be the set of observations selected in stage n.. Step 4. If n reference observations can be correctly classified by the model, go to Step 6.. Step 5. Apply the gradient descent algorithm to adjust weights w until one of the following cases occurs: (1) If n reference observations can be correctly classified by model, go to Step 6. (2) If the n observations stuck in local optimal and cannot correctly classified by model, go to step 6.. Step 6. n + 1 → n; go to Step 2.. The designed softmax majority learning method has the same procedure in increasing training data and terminal condition. That is to say, compare to the general training method of the softmax neural network, we iteratively obtain a new observation to be the training data in each stage. We modify the other steps to let the softmax neural network fit the majority learning process. In step 1, we arbitrarily pick m + 1 observations to be the initial training data. We adopt gradient descent method to get an initial network state that the m + 1 observations can correctly classified by the softmax neural network. Note that the gradient descent method cannot always tune the weight to find an acceptable neural network, we can only minimize the cross-entropy of the model. In step 2, we 19 DOI:10.6814/THE.NCCU.MIS.001.2019.A05.

(27) check the terminal condition just like BML. In step 3, we do forward pass for all N observations, calculate each observations’ cross-entropies, and sort the N observations by the cross-entropy values. We obtain n observations with the smallest cross-entropy values to be the training data in the satge n. In step 4, we would check if the n observations with the smallest cross-entropy values can correctly classified by the softmax neural network. If yes, go to step 6. Otherwise, go to step 5, the softmax neural network should apply gradient descent optimizer to tune the weights until one of the two following occurs: (1) The softmax neural network can correctly classified the n observations. (2) The gradient descent optimizer stuck in local optimal and could not find a set of weight correctly classified the n observations. The softmax majority learning method does not have a resistant learning procedure. After the training process, the ANN model cannot make sure the γN majority can be correctly classified. In other words, the softmax majority learning method has a limited ability to deal with outliers. The ANN model with softmax function in softmax majority learning method could be regarded as another form of condition L. Because of we focus on bipartite classification problem, the ANN model have two output nodes in the output layer. If we concatenate a new layer with one node to the original two output nodes and the neuron weights connected are 1 and -1 respectively, the newly output node would satisfy the condition L. One of the two class would have output value greater than 0, the other class would have output value less than 0. Figure 2 illustrates the effect of this operation.. 20 DOI:10.6814/THE.NCCU.MIS.001.2019.A05.

(28) Figure 2: The condition L in another form.. 21 DOI:10.6814/THE.NCCU.MIS.001.2019.A05.

(29) 3.4. Majority Learning on Support Vector Machines. Support Vector Machines (SVMs) are a prevalent supervised learning method for data classification, regression, and other learning tasks. [38] SVMs have several advantages, for example, SVMs are effective in high dimensional spaces and still effective in cases where number of dimensions is greater than the number of samples. SVMs have several kernel functions that can be specified for the decision function. We also design a majority learning method for SVM. Table 4 describe the SVM majority learning process. Table 4: The SVM majority learning algorithm Step 1. Randomly obtain the initial m + 1 reference observations, two classes of observations each account for half of the m+1 observations. Let S(m + 1) be the set of observations of these observations. Set up an initial model regarding the reference observations (xc , y c ) for all c ∈ S(m + 1). Set n = m + 2.. Step 2. If n > γN , STOP.. Step 3. Present the n reference observations (xc , y c ) that are the ones with the smallest −y c ∗ f (xc ) value. Let S(n) be the set of observations selected in stage n.. Step 4. If n reference observations can be correctly classified by the model, go to Step 6.. Step 5. Apply the weight tuning method to adjust weights w until one of the following cases occurs: (1) If n reference observations can be correctly classified by model, go to Step 6. (2) If the n observations stuck in local optimal and cannot correctly classified by model, go to step 6.. Step 6. n + 1 → n; go to Step 2.. The SVM majority learning algorithm has the same sort-and-select procedure to pick the data most familiar to the trained classifier. The main idea of SVM is to find a 22 DOI:10.6814/THE.NCCU.MIS.001.2019.A05.

(30) hyperplane which can separate two class of data in a high dimension space. Thus, in step 3, we pick the data with the smallest −y c ∗ f (xc ) value. y c is the desired output of the training data. y c is 1 for all class 1 data and y c is −1 for all class 2 data. f (xc ) is the predict output value of SVM classifier for xc . If the two values are the same sign, it means that xc is separated into the correct space by the hyperplane. If the two values are the opposite sign, it means that xc is separated into the wrong space by the hyperplane. If the value |f (xc )| is larger, the farther away b is from the hyperplane. Depending on the selection criteria, data that is far from the hyperplane will be prioritized selected. SVM classification methods have several kernels. Linear, polynomial, radial basis function (RBF) and sigmoid are common used kernels. Each kernel has different ability for classifying different data distribution. Figure 3 illustrates the ability of different kernels classifying specific data distribution.. Figure 3: An example for different SVM kernels classifying two class data.. As for the implementation, Scikit-Learn [39] provides high-level functions for building SVMs classifiers. We use Scikit-Learn library for the implementation of SVM majority learning algorithm.. 23 DOI:10.6814/THE.NCCU.MIS.001.2019.A05.

(31) 3.5. Multi-Class Classifier for Majority. For the bipartite classification problem, the obtained SLFN from BML can be used as a bipartite classifier. An input can be classified by comparing its output value with a threshold. The threshold of the SLFN is. α+β 2. where α is the minimum output value in. C1 and β is the maximum output value in C2 of the majority in the training set. If f (xt ) ≥. α+β , 2. xt is classified as C1 ; otherwise C2 .. As for multi-class classification problem, for example, classifying several types of malware behaviors, We need to add an additional mechanism in our neural network. In the literature, a cascade classifier is an option [40]; however, cascade classifier may lead to poor classification accuracy in the circumstance of too many classes. Hence, inspired by the MNIST experiments in TensorFlow [35], we build a neural network with softmax function to perform multi-class classification. This neural network does not contain any hidden layer. The softmax function is a generalization of the logistic function that converts a K-dimensional vector z of arbitrary real values to a K-dimensional vector σ(z) of real values in the range [0, 1] that add up to 1. The neural network architecture is defined in (18). We use cross-entropy to determine the loss of our model, and the cross-entropy is defined in (19).. g(xc ) ≡ sof tmax(W xc + b). Hyc′ (y) = −. N ∑. y c′ log(y c ). (18). (19). c=1. Figure 4 shows the tensor graph of the softmax neuron network. Algorithm 3 is the corresponding code in Python. Variables x and y are the tensors of tf.placeholder type that holds the input data. W and b are the tensors of tf.Variable type. The shape of W is determined by the dimension of x and the dimension of y. The tf.matmul performs the matrix multiplication, the tf.add performs matrix addition. The tf.reduce sum sums up all of the cross entropies.Finally, the optimizer, tf.optimizer, applies gradient. 24 DOI:10.6814/THE.NCCU.MIS.001.2019.A05.

(32) descent method to modify the variable tensors of neuron weights.. Figure 4: The tensor graph of the softmax neuron network.. 25 DOI:10.6814/THE.NCCU.MIS.001.2019.A05.

(33) Algorithm 3 Define the Softmax tensor graph in python code 1:. x ← tf.placeholder(tf.f loat64). 2:. y ← tf.placeholder(tf.f loat64). 3:. W ← tf.V ariable([x.shape[1], y.shape[1]]). 4:. b ← tf.V ariable([y.shape[1]]). 5:. y ← tf.nn.sof tmax(tf.matmul(x, W ) + b). 6:. c e ← −tf.reduce sum(y ∗ tf.log(y )). 7:. train ← tf.optimizer(c e).minimize(sr). Compared to SLFN, this neural network architecture has neither hidden layer nor activation function. Gradient descent is also adopted here to adjust the weight and bias. The advantage of this neural network is that it can classify multiple malicious behaviors. The aforementioned SLFN can only distinguish between malicious behavior and benign behavior. Using the softmax neural network, we can discern the category of a malicious behavior. The neural network will output a set of probability values representing the likelihood of each category. Among them, we would take the maximum probability value to be the classification result of the neural network.. 26 DOI:10.6814/THE.NCCU.MIS.001.2019.A05.

(34) 4. EXPERIMENTS. In the experiments, we perform dynamic malware analysis and evaluate the effectiveness of the proposed BML on real-world data sets. We apply BML to learn two types of features, malware behavior, and benign program behavior.. 4.1. Malware Samples from OWL. In our experiments, 998 malicious programs and 2 benign programs are used. The malware executables are downloaded from the OWL database [41]. Chiu et al. [31] executed the malware samples in a virtual machine with Cuckoo Sandbox [42] installed and used the tool Viso [43] to record the chronological data of the system calls. The two benign program are Google Chrome and Filezilla. Then, the chronological data are split into multiple individual samples with the same window size N . Fig. 5 illustrates the method of collecting training data.. Figure 5: Generate system call samples.. With the window size N = 1,000, there are 1,229,634 samples in total. The number of distinct system call recorded in these programs is 52. Hence, we can build a vector with 52 dimensions for each sample, and each dimension represents the frequency occurrence of a 27 DOI:10.6814/THE.NCCU.MIS.001.2019.A05.

(35) specific system call. After clustered by the GHSOM algorithm [31], there were 19 clusters containing only malicious behavior. In this case, the rest of the samples that are not in the 19 clusters are considered as non-malicious samples, i.e., benign samples. Hence, the unique patterns of these 19 clusters are the detection rules of malicious behaviors. Some clusters have many repetitive features of samples. In order to prevent the bias of model training, we filtered out the samples with the same features in advance, and only one sample was reserved for an individual feature. In addition, we need a large enough sample size to be suitable for the division of the training/testing set. Therefore, we select clusters with more than 1000 samples. After these two filtering processes, only 9 malware detection rule clusters meet our need. For the purpose of balancing the amount of benign and malware samples, we randomly sampled 19391 benign samples from 1,187,842 benign samples. The number 19,391 is equal to the sum of 9 malware sample amounts. The sample amount of each detection rules are shown in Table 5. Table 5: Sample amount in each detection rules Rule 1.. 3175. Rule 2.. 1331. Rule 3.. 2025. Rule 4.. 1838. Rule 5.. 2451. Rule 6.. 4356. Rule 7.. 1208. Rule 8.. 1220. Rule 9.. 1787. Benign. 19391. 4.2. Evaluation. To evaluate our model, we compare our BML with the state-of-art methods, including the former resistant learning approach, envelope (ENV) [9], softmax and SVM. In order to maintain the same benchmark, softmax and SVM methods are adapted to majority learning to make a fair comparison. ENV already has a majority picking method but does not have a classification method for bipartite data. Thus, we combined ENV with the condition L for evaluating ENV’s classification accuracy. The ANN with softmax function does not have a majority mothod. So, we designed a majority learning method for softmax neural network in section 3.3. SVM methods do not have majority methods either, so we designed majority learning methods for SVMs in section 3.4. We applied these majority learning method in experi28 DOI:10.6814/THE.NCCU.MIS.001.2019.A05.

(36) ment 1.1, 2.1 and 3 to compare the performance. The mentioned majority learning mechanisms in this study are applied to learn the majority pattern in each cluster. We test all algorithms in the same hardware environment. We implement the majority learning methods with TensorFlow API and use GPU accelerate training efficiency. In this paper, we set γ to 95%, which means all of the majority learning mechanisms stop when the 95% of the majority in the training set is correctly classified. We have tested several different degrees of γ setting. Since our system call data set has been classified by GHSOM, there is a certain difference between the benign sample patterns and the malicious sample patterns. The SLFN can easily separate most of the data. Therefore, we set a stricter majority rate, 95%, to ensure that the SLFN can learn most of the training data patterns and does not interfere with few abnormal patterns. We test the performance of BML under different experimental designs. In Experiment 1, we focus on the performance of BML to classify the samples in each cluster and test whether BML selecting the appropriate majority. In Experiment 2, we concern about the performance of BML to classify a large number of samples in each cluster and test whether BML selecting the appropriate majority. In Experiment 3, we test if BML has ability classifying two categories of samples which contain a variety of benign and malware patterns.. 4.2.1. Exp. 1.1: Majority Learning on Small-Size Sampling Data. RQ 1: How BML performs in terms of efficiency and accuracy compared to the state-ofthe-art approaches on small-size data sets? To answer RQ1, in this experiment, we tested the learning results of BML, ENV, SVM, and softmax on sampling data, whether it can improve performance in the case of majority learning. The performance evaluation shows the classification accuracy and the time efficiency of each the methods.. 29 DOI:10.6814/THE.NCCU.MIS.001.2019.A05.

(37) Figure 6: Experimental design of experiment 1.1.. Figure 6 illustrates the experimental design of experiment 1.1. We randomly selected an equal amount of malicious samples from the 9 malware clusters and the benign samples from the benign data set. We decided to select 100 samples from each detection rule clusters, because the size 100 is relatively small compared to the total sample amount in each clusters (less than 10%). These randomly selected samples are used as the training set; the rest of the samples are used as the testing set. For example, we would randomly select 100 malware samples from the rule cluster 1 and randomly select 100 benign samples from the benign cluster. Then, we labeled these 200 samples according to their class and regarded these samples as training set. The rest of the rule 1 cluster samples and the rest of the benign cluster samples would be the testing set. After we sampling the data from each clusters, we conduct the different majority learning experiments with the same sampling data.. 30 DOI:10.6814/THE.NCCU.MIS.001.2019.A05.

(38) Table 6: Training result by using 100*2 samples Rule. 1. 2. 3. 4. 5. 6. 7. 8. 9. Majority Method SVM linear SVM poly SVM rbf Softmax ENV BML SVM linear SVM poly SVM rbf Softmax ENV BML SVM linear SVM poly SVM rbf Softmax ENV BML SVM linear SVM poly SVM rbf Softmax ENV BML SVM linear SVM poly SVM rbf Softmax ENV BML SVM linear SVM poly SVM rbf Softmax ENV BML SVM linear SVM poly SVM rbf Softmax ENV BML SVM linear SVM poly SVM rbf Softmax ENV BML SVM linear SVM poly SVM rbf Softmax ENV BML. #HN 7 1 7 1 5 1 11 1 55 1 39 1 25 1 37 1 5 1. Outliers (#FB/#B, #FM/#M) (0/1, 0/9) (0/0, 0/10) (0/0, 3/10) (0/7, 0/3) (7/9, 0/1) (4/6, 3/4) (0/6, 0/4) (0/7, 0/3) (0/7, 3/3) (0/0, 0/10) (0/0, 0/10) (0/4, 0/6) (1/6, 0/4) (1/7, 0/3) (0/3, 2/7) (0/0, 0/10) (0/4, 0/6) (0/7, 0/3) (0/10, 0/0) (1/10, 0/0) (0/1, 1/9) (4/10, 0/0) (0/2, 4/8) (0/6, 0/4) (0/4, 0/6) (0/4, 0/6) (0/0, 4/10) (7/8, 0/2) (1/2, 0/8) (0/3, 0/7) (0/8, 0/2) (0/8, 0/2) (0/0, 7/10) (4/10, 0/0) (0/3, 0/7) (0/3, 0/7) (0/4, 0/6) (0/4, 0/6) (0/0, 9/10) (0/0, 5/10) (0/0, 4/10) (0/0, 4/10) (0/0, 0/10) (0/0, 0/10) (0/1, 3/9) (0/4, 0/6) (0/1, 0/9) (0/6, 0/4) (0/2, 0/8) (0/1, 0/9) (0/0, 9/10) (2/10, 0/0) (0/1, 0/9) (0/1, 0/9). Execute Time(s) 0.1 0.1 0.3 0.6 106.1 0.4 0.1 0.1 0.2 0.5 82.5 0.4 0.1 0.1 0.2 0.5 34.5 0.4 0.1 0.1 0.2 0.4 104.7 0.5 0.1 0.1 0.2 0.5 1265.0 0.4 0.1 0.1 0.2 0.3 461.0 0.4 0.1 0.1 0.3 0.5 378.3 0.4 0.1 0.1 0.2 0.4 1472.5 0.4 0.1 0.1 0.2 0.4 92.1 0.4. Train B #FB/#B 0/99 0/100 0/100 0/93 0/91 0/94 0/94 0/93 0/93 0/100 0/100 0/96 0/94 0/93 0/97 0/100 0/96 0/93 0/100 0/90 0/99 0/90 0/98 0/94 0/96 0/96 0/100 0/92 0/98 0/97 0/92 0/92 0/100 0/90 0/97 0/97 0/96 0/96 0/100 0/100 0/100 0/100 0/100 0/100 0/99 0/96 0/99 0/94 0/98 0/99 0/100 0/90 0/99 0/99. Train M #FM/#M 0/91 0/90 0/90 0/97 0/99 0/96 0/96 0/97 0/97 0/90 0/90 0/94 0/96 0/97 0/93 0/90 0/94 0/97 0/100 0/100 0/91 0/100 0/92 0/96 0/94 0/94 0/90 0/98 0/92 0/93 0/98 0/98 0/90 0/100 0/93 0/93 0/94 0/94 0/90 0/90 0/90 0/90 0/90 0/90 0/91 0/94 0/91 0/96 0/92 0/91 0/90 0/100 0/91 0/91. Test B #FB/#B 85/19291 127/19291 58/19291 345/19291 1078/19291 759/19291 12/19291 18/19291 0/19291 6/19291 2/19291 2/19291 14/19291 18/19291 0/19291 3/19291 8/19291 8/19291 15/19291 18/19291 0/19291 548/19291 0/19291 0/19291 12/19291 18/19291 0/19291 1784/19291 500/19291 3/19291 16/19291 19/19291 0/19291 814/19291 0/19291 0/19291 17/19291 34/19291 0/19291 1/19291 0/19291 0/19291 10/19291 16/19291 0/19291 35/19291 44/19291 11/19291 11/19291 16/19291 0/19291 664/19291 118/19291 7/19291. Test M #FM/#M 3/3075 30/3075 350/3075 1/3075 0/3075 30/3075 0/1231 0/1231 47/1231 7/1231 17/1231 6/1231 0/1925 0/1925 48/1925 0/1925 0/1925 0/1925 0/1738 0/1738 28/1738 0/1738 98/1738 0/1738 0/2351 0/2351 74/2351 0/2351 0/2351 0/2351 0/4256 0/4256 125/4256 0/4256 0/4256 0/4256 0/1108 0/1108 171/1108 79/1108 38/1108 36/1108 0/1120 0/1120 58/1120 0/1120 32/1120 0/1120 0/1687 0/1687 60/1687 0/1687 0/1687 0/1687. 31 DOI:10.6814/THE.NCCU.MIS.001.2019.A05.

(39) Table 6 shows the training result of SVM, softmax, ENV and BML. For each cluster, we train the different models by using randomly selected samples as training data (100 benign samples and 100 malicious samples are used) in this experiment. The column of “#HN” specifies the number of hidden nodes in SLFN after trained by ENV and BML. Note that the softmax neural network does not have hidden layer, so the amount of hidden nodes must be always zero. For the ENV, all of the rules need more than one hidden nodes to find the fitting function. For BML, all rules only need one hidden node to classify the majority data. It indicates we do not even need to apply the add hidden nodes procedure to deal with the outliers in the training data. In column “Outliers (#FB/#B,#FM/#M)”, the value #B is the number of benign samples which were regarded as outliers in the training data. The value #M is the number of malware samples which were regarded as outliers in the training data. The sum of #B and #M is equal to 5% of training data because our majority rate is set to 95%. The value #FB and #FM are the number of false classified samples in the benign and malware outliers, respectively. Although outliers have a greater loss than the majority data, not all outliers are misclassified. Because we applied the condition L for classification, if the losses are not great enough, the outliers would not be misclassified by the model. On the average, BML has higher classification accuracy on training data than EVN, and most of the misclassified samples are benign samples. As for the training time, BML is outperformed then ENV and is similar with SVM and softmax, since BML do not need to re-train the model as many times as ENV. In this study, we evaluate the accuracy of the model by “false rate”. We define the false rate as follows: F alse Rate = F alse classif ied sample amount / T otal sample amount. For example, if a rule 1 sample was classified as benign sample by a model, the rule 1 sample is a false classified sample. We sum the amount of false classified rule 1 samples and divide by the total amount of rule 1 samples to calculate the false rate of rule 1 samples. This calculation method applies to all rule clusters and benign clusters.. 32 DOI:10.6814/THE.NCCU.MIS.001.2019.A05.

(40) In column “Train B(#FB/#B)” indicates the false rate of benign training data, the value #B is the number of benign samples in the training data. In column “Train M(#FM/#M)” indicates the false rate of malware training data, the value #M is the number of malware samples in the training data. In column “Test B(#FB/#B)” indicates the false rate of benign testing data, the value #B is the number of benign samples in the testing data. In column “Test M(#FM/#M)” indicates the false rate of malware testing data, the value #M is the number of malware samples in the testing data.. Figure 7: False rate of different majority learning methods on training data (100*2 samples).. Figure 8: False rate of different majority learning methods on testing data (100*2 samples). 33 DOI:10.6814/THE.NCCU.MIS.001.2019.A05.

(41) Figure 9: False rate of different majority learning methods on outlier data (100*2 samples).. Figure 10: Execution time of different majority learning methods (100*2 samples).. Figure 7 to Figure 9 show the false rate of SVM, softmax, ENV and BML. We calculate the mean false rate of 9 rules, BML can perform higher classification accuracy compare to softmax and ENV on testing data. As for the training data, BML has higher classification accuracy on benign data but has lower classification accuracy than softmax on malware data. Figure 10 shows the execution time of SVM, softmax, ENV and BML. BML, SVM and softmax finish the model training process much faster than ENV. To answer RQ1, BML has on average higher time efficiency and higher classification 34 DOI:10.6814/THE.NCCU.MIS.001.2019.A05.

(42) accuracy than the state-of-art methods on small size data sets. 4.2.2. Exp. 1.2: Use ANN to Learn the Majority. RQ 2: How well the majority set selected by BML represents on small-size data sets?. Figure 11: Experimental design of experiment 1.2.. Figure 11 illustrates the experimental design of experiment 1.2. As far as we know, outliers would affect the learning of softmax neural network. If training data contains outliers, the classification accuracy of training and testing data would be decrease. In this experiment, we want to test if BML can select proper majority data by the selecting mechanism. We adopt the softmax neural network to the majority selected by the different majority learning methods. We also test softmax learning by directly using the original training data. We tested whether the selected majority can increase the performance of softmax neural network learning. If the majority are chosen properly, the softmax neural network should learn more accurately from the training data. We label the data by using the corresponding one-hot vector. A one-hot vector is a vector that in a single dimension is 1 and in other dimensions are 0s. In our case, the data in the nth rule is labeled with a vector that its nth dimension is 1 and other dimensions are 0s. For a benign sample, only the 0th dimension is 1 and others are 0s. Hence, the shape of the one-hot vector is (10, 1). We apply gradient descent 10,000 times for weight tuning and compare the classification accuracy. The learning rate is set as 0.0001. 35 DOI:10.6814/THE.NCCU.MIS.001.2019.A05.

(43) Table 7: Softmax neural network classification result by using 100*2 samples Rule. Majority Method. Outlier FR. Train FR. Test FR. 1. None SVM linear SVM poly SVM rbf Softmax ENV BML. 0/9 0/10 0/10 0/3 0/1 0/4. 0/100 0/91 0/90 0/90 0/97 0/99 0/96. 13/3075 17/3075 19/3075 17/3075 16/3075 17/3075 19/3075. 2. None SVM linear SVM poly SVM rbf Softmax ENV BML. 0/4 0/3 1/3 5/10 1/10 1/6. 2/100 2/96 2/97 1/97 1/90 1/90 2/94. 34/1231 42/1231 41/1231 30/1231 75/1231 42/1231 43/1231. 3. None SVM linear SVM poly SVM rbf Softmax ENV BML. 1/4 1/3 0/7 1/10 1/6 0/3. 14/100 7/96 8/97 14/93 13/90 11/94 9/97. 150/1925 128/1925 131/1925 149/1925 158/1925 148/1925 133/1925. 4. None SVM linear SVM poly SVM rbf Softmax ENV BML. 0/0 0/0 2/9 0/0 0/8 0/4. 9/100 8/100 8/100 8/91 7/100 8/92 8/96. 199/1738 181/1738 184/1738 220/1738 164/1738 182/1738 180/1738. 5. None SVM linear SVM poly SVM rbf Softmax ENV BML. 2/6 2/6 1/10 0/2 2/8 3/3. 2/100 4/94 4/94 2/90 3/98 2/92 3/97. 207/2351 246/2351 247/2351 195/2351 216/2351 234/2351 233/2351. 36 DOI:10.6814/THE.NCCU.MIS.001.2019.A05.

(44) Table 8: Softmax neural network classification result by using 100*2 samples (cont.) Rule. Majority Method. Outlier FR. Train FR. Test FR. 6. None SVM linear SVM poly SVM rbf Softmax ENV BML. 1/2 1/2 6/10 0/0 4/7 0/7. 22/100 20/98 19/98 18/90 22/100 17/93 21/93. 922/4256 798/4256 790/4256 1002/4256 907/4256 922/4256 847/4256. 7. None SVM linear SVM poly SVM rbf Softmax ENV BML. 0/6 0/6 6/10 1/10 6/10 6/10. 2/100 2/94 2/94 0/90 2/90 0/90 0/90. 24/1108 37/1108 37/1108 57/1108 43/1108 57/1108 57/1108. 8. None SVM linear SVM poly SVM rbf Softmax ENV BML. 0/10 0/10 0/9 0/6 0/9 0/4. 0/100 0/90 0/90 0/91 0/94 0/91 0/96. 15/1120 14/1120 14/1120 13/1120 9/1120 16/1120 13/1120. 9. None SVM linear SVM poly SVM rbf Softmax ENV BML. 0/8 0/9 1/10 0/0 0/9 4/9. 7/100 8/92 8/91 6/90 7/100 6/91 3/91. 118/1687 127/1687 129/1687 96/1687 118/1687 88/1687 153/1687. Benign. None SVM linear SVM poly SVM rbf Softmax ENV BML. 1/41 1/41 0/12 2/49 2/22 3/40. 2/900 1/859 1/859 2/888 0/851 0/878 0/860. 67/18491 72/18491 72/18491 70/18491 108/18491 95/18491 105/18491. Table 7 and 8 shows the classification result of the trained model. In column “Train False Rate” indicates the false rate of training data. In column “Test False Rate” indicates 37 DOI:10.6814/THE.NCCU.MIS.001.2019.A05.

(45) the false rate of testing data.. Figure 12: False rate of different majority training softmax neuron network on training data. (100*2 samples). Figure 13: False rate of different majority training softmax neuron network on testing data. (100*2 samples). 38 DOI:10.6814/THE.NCCU.MIS.001.2019.A05.

(46) Figure 14: False rate of different majority training softmax neuron network on outlier data. (100*2 samples). Figure 12 to Figure 14 show the false rate of softmax neural networks trained by different majorities. On average, all of the majority learning methods increase the classification accuracy of training data and do not loss much accuracy on testing data. To answer RQ2, BML can choose proper majority set to make training data classified by softmax neural network more accurately. But, the 95% of majority set loss some information so that the accuracy of the classification on testing data is slightly lower than using all training data to train the softmax neural network. 4.2.3. Exp. 2.1: Majority Learning on Large Scale Data. RQ 3: How BML performs in terms of efficiency and accuracy compared to the state-ofthe-art approaches on large-size data sets?. 39 DOI:10.6814/THE.NCCU.MIS.001.2019.A05.

(47) Figure 15: Experimental design of experiment 2.1.. Figure 15 illustrates the experimental design of experiment 2.1. In this subsection, we want to test the performance of BML learning a larger amount of data. We use 80% of the rule samples, rather than using randomly selected 100*2 (i.e., 100 benign + 100 malicious) samples. For example, we would randomly select 80% of malware samples from the rule cluster 1 (3175 * 0.8 = 2540 ) and randomly select 2,540 benign samples from the benign cluster. Then, we labeled these 5,080 samples according to their class and regarded these samples as training set. The rest of the rule 1 cluster samples and the rest of the benign cluster samples would be the testing set. After we sampling the data from each clusters, we conduct the different majority learning experiments with the same sampling data. Note that we also use the same amount of benign and malicious samples for training. The majority rate is set to 95%. Table 9 shows the training result of the large-scale training set. In this set, BML reduces significantly training time compared to ENV. Also, BML does not need to increase model complexity for learning the 95% of training data. BML can train a proper SLFN for classifying the majority of bipartite data more efficiency than ENV. Figure 16 to Figure 18 show the false rate of SVM, softmax, ENV and BML. We calculate the mean false rate of 9 rules as same as exp 1-1. The accuracy performance of BML is between softmax and ENV. Figure 19 and shows the execution time of SVM, softmax, ENV and BML. ENV has slightly higher classification accuracy when the training data amount is larger, but the trade-off is the long model training time. 40 DOI:10.6814/THE.NCCU.MIS.001.2019.A05.

(48) To answer RQ3, BML is more time efficiency and does not lose much classification accuracy compared to the state-of-the-art approaches on large-size data sets.. 41 DOI:10.6814/THE.NCCU.MIS.001.2019.A05.