Exp. 1.1: Majority Learning on Small-Size Sampling Data

4.2 Evaluation

4.2.1 Exp. 1.1: Majority Learning on Small-Size Sampling Data

RQ 1: How BML performs in terms of eﬃciency and accuracy compared to the state-of-the-art approaches on small-size data sets?

To answer RQ1, in this experiment, we tested the learning results of BML, ENV, SVM, and softmax on sampling data, whether it can improve performance in the case of majority learning. The performance evaluation shows the classification accuracy and the time eﬃciency of each the methods.

Figure 6: Experimental design of experiment 1.1.

Figure 6 illustrates the experimental design of experiment 1.1. We randomly selected an equal amount of malicious samples from the 9 malware clusters and the benign samples from the benign data set. We decided to select 100 samples from each detection rule clusters, because the size 100 is relatively small compared to the total sample amount in each clusters (less than 10%). These randomly selected samples are used as the training set; the rest of the samples are used as the testing set.

For example, we would randomly select 100 malware samples from the rule cluster 1 and randomly select 100 benign samples from the benign cluster. Then, we labeled these 200 samples according to their class and regarded these samples as training set. The rest of the rule 1 cluster samples and the rest of the benign cluster samples would be the testing set. After we sampling the data from each clusters, we conduct the diﬀerent majority learning experiments with the same sampling data.

Table 6: Training result by using 100*2 samples

Rule Majority #HN Outliers Execute Train B Train M Test B Test M

Method (#FB/#B, #FM/#M) Time(s) #FB/#B #FM/#M #FB/#B #FM/#M

SVM linear - (0/1, 0/9) 0.1 0/99 0/91 85/19291 3/3075

SVM poly - (0/0, 0/10) 0.1 0/100 0/90 127/19291 30/3075

SVM rbf - (0/0, 3/10) 0.3 0/100 0/90 58/19291 350/3075

Softmax - (0/7, 0/3) 0.6 0/93 0/97 345/19291 1/3075

ENV 7 (7/9, 0/1) 106.1 0/91 0/99 1078/19291 0/3075

BML 1 (4/6, 3/4) 0.4 0/94 0/96 759/19291 30/3075

SVM linear - (0/6, 0/4) 0.1 0/94 0/96 12/19291 0/1231

SVM poly - (0/7, 0/3) 0.1 0/93 0/97 18/19291 0/1231

SVM rbf - (0/7, 3/3) 0.2 0/93 0/97 0/19291 47/1231

Softmax - (0/0, 0/10) 0.5 0/100 0/90 6/19291 7/1231

ENV 7 (0/0, 0/10) 82.5 0/100 0/90 2/19291 17/1231

BML 1 (0/4, 0/6) 0.4 0/96 0/94 2/19291 6/1231

SVM linear - (1/6, 0/4) 0.1 0/94 0/96 14/19291 0/1925

SVM poly - (1/7, 0/3) 0.1 0/93 0/97 18/19291 0/1925

SVM rbf - (0/3, 2/7) 0.2 0/97 0/93 0/19291 48/1925

Softmax - (0/0, 0/10) 0.5 0/100 0/90 3/19291 0/1925

ENV 5 (0/4, 0/6) 34.5 0/96 0/94 8/19291 0/1925

BML 1 (0/7, 0/3) 0.4 0/93 0/97 8/19291 0/1925

SVM linear - (0/10, 0/0) 0.1 0/100 0/100 15/19291 0/1738

SVM poly - (1/10, 0/0) 0.1 0/90 0/100 18/19291 0/1738

SVM rbf - (0/1, 1/9) 0.2 0/99 0/91 0/19291 28/1738

Softmax - (4/10, 0/0) 0.4 0/90 0/100 548/19291 0/1738

ENV 11 (0/2, 4/8) 104.7 0/98 0/92 0/19291 98/1738

BML 1 (0/6, 0/4) 0.5 0/94 0/96 0/19291 0/1738

SVM linear - (0/4, 0/6) 0.1 0/96 0/94 12/19291 0/2351

SVM poly - (0/4, 0/6) 0.1 0/96 0/94 18/19291 0/2351

SVM rbf - (0/0, 4/10) 0.2 0/100 0/90 0/19291 74/2351

Softmax - (7/8, 0/2) 0.5 0/92 0/98 1784/19291 0/2351

ENV 55 (1/2, 0/8) 1265.0 0/98 0/92 500/19291 0/2351

BML 1 (0/3, 0/7) 0.4 0/97 0/93 3/19291 0/2351

SVM linear - (0/8, 0/2) 0.1 0/92 0/98 16/19291 0/4256

SVM poly - (0/8, 0/2) 0.1 0/92 0/98 19/19291 0/4256

SVM rbf - (0/0, 7/10) 0.2 0/100 0/90 0/19291 125/4256

Softmax - (4/10, 0/0) 0.3 0/90 0/100 814/19291 0/4256

ENV 39 (0/3, 0/7) 461.0 0/97 0/93 0/19291 0/4256

BML 1 (0/3, 0/7) 0.4 0/97 0/93 0/19291 0/4256

SVM linear - (0/4, 0/6) 0.1 0/96 0/94 17/19291 0/1108

SVM poly - (0/4, 0/6) 0.1 0/96 0/94 34/19291 0/1108

SVM rbf - (0/0, 9/10) 0.3 0/100 0/90 0/19291 171/1108

Softmax - (0/0, 5/10) 0.5 0/100 0/90 1/19291 79/1108

ENV 25 (0/0, 4/10) 378.3 0/100 0/90 0/19291 38/1108

BML 1 (0/0, 4/10) 0.4 0/100 0/90 0/19291 36/1108

SVM linear - (0/0, 0/10) 0.1 0/100 0/90 10/19291 0/1120

SVM poly - (0/0, 0/10) 0.1 0/100 0/90 16/19291 0/1120

SVM rbf - (0/1, 3/9) 0.2 0/99 0/91 0/19291 58/1120

Softmax - (0/4, 0/6) 0.4 0/96 0/94 35/19291 0/1120

ENV 37 (0/1, 0/9) 1472.5 0/99 0/91 44/19291 32/1120

BML 1 (0/6, 0/4) 0.4 0/94 0/96 11/19291 0/1120

SVM linear - (0/2, 0/8) 0.1 0/98 0/92 11/19291 0/1687

SVM poly - (0/1, 0/9) 0.1 0/99 0/91 16/19291 0/1687

SVM rbf - (0/0, 9/10) 0.2 0/100 0/90 0/19291 60/1687

Softmax - (2/10, 0/0) 0.4 0/90 0/100 664/19291 0/1687

ENV 5 (0/1, 0/9) 92.1 0/99 0/91 118/19291 0/1687

BML 1 (0/1, 0/9) 0.4 0/99 0/91 7/19291 0/1687

Table 6 shows the training result of SVM, softmax, ENV and BML. For each cluster, we train the diﬀerent models by using randomly selected samples as training data (100 benign samples and 100 malicious samples are used) in this experiment. The column of

“#HN” specifies the number of hidden nodes in SLFN after trained by ENV and BML.

Note that the softmax neural network does not have hidden layer, so the amount of hidden nodes must be always zero.

For the ENV, all of the rules need more than one hidden nodes to find the fitting function. For BML, all rules only need one hidden node to classify the majority data. It indicates we do not even need to apply the add hidden nodes procedure to deal with the outliers in the training data.

In column “Outliers (#FB/#B,#FM/#M)”, the value #B is the number of benign samples which were regarded as outliers in the training data. The value #M is the number of malware samples which were regarded as outliers in the training data. The sum of #B and #M is equal to 5% of training data because our majority rate is set to 95%. The value #FB and #FM are the number of false classified samples in the benign and malware outliers, respectively. Although outliers have a greater loss than the majority data, not all outliers are misclassified. Because we applied the condition L for classification, if the losses are not great enough, the outliers would not be misclassified by the model.

On the average, BML has higher classification accuracy on training data than EVN, and most of the misclassified samples are benign samples. As for the training time, BML is outperformed then ENV and is similar with SVM and softmax, since BML do not need to re-train the model as many times as ENV.

In this study, we evaluate the accuracy of the model by “false rate”. We define the false rate as follows: F alse Rate = F alse classif ied sample amount / T otal sample amount.

For example, if a rule 1 sample was classified as benign sample by a model, the rule 1 sample is a false classified sample. We sum the amount of false classified rule 1 samples and divide by the total amount of rule 1 samples to calculate the false rate of rule 1 samples. This calculation method applies to all rule clusters and benign clusters.

In column “Train B(#FB/#B)” indicates the false rate of benign training data, the value #B is the number of benign samples in the training data. In column “Train M(#FM/#M)” indicates the false rate of malware training data, the value #M is the number of malware samples in the training data. In column “Test B(#FB/#B)” indi-cates the false rate of benign testing data, the value #B is the number of benign samples in the testing data. In column “Test M(#FM/#M)” indicates the false rate of malware testing data, the value #M is the number of malware samples in the testing data.

Figure 7: False rate of diﬀerent majority learning methods on training data (100*2 sam-ples).

Figure 8: False rate of diﬀerent majority learning methods on testing data (100*2 sam-ples).

Figure 9: False rate of diﬀerent majority learning methods on outlier data (100*2 samples).

Figure 10: Execution time of diﬀerent majority learning methods (100*2 samples).

Figure 7 to Figure 9 show the false rate of SVM, softmax, ENV and BML. We calculate the mean false rate of 9 rules, BML can perform higher classification accuracy compare to softmax and ENV on testing data. As for the training data, BML has higher classification accuracy on benign data but has lower classification accuracy than softmax on malware data. Figure 10 shows the execution time of SVM, softmax, ENV and BML. BML, SVM and softmax finish the model training process much faster than ENV.

To answer RQ1, BML has on average higher time eﬃciency and higher classification

accuracy than the state-of-art methods on small size data sets.

在文檔中二元主體學習技術研究與張量流實作 - 政大學術集成 (頁 36-42)