• 沒有找到結果。

Evaluation: Incident Detection and Categorization

Chapter 7: Communication Efficient Distributed Agnostic Boosting

8.4 Evaluation

8.4.4 Evaluation: Incident Detection and Categorization

We perform 10-fold Monte Carlo cross-validation, where each randomly samples 70% of the machine-days from the training dataset collected from July to September in 2016. The remaining 30% is left for testing.

We set up a baseline model (shorthand: LR) by training a logistic regression classifier directly on the event count matrixX, with missing entries filled with zeros. In our approach (shorthand: VP), we train a logistic regression classifier on the low-dimensional feature representation ofX produced by Virtual Product model.

The purpose of introducing the baseline model is two-folds. Firstly, we use the results from the baseline model to further validate our initial assumption: it is possible to predict the events that would have been reported by additional security products that were not deployed. The baseline model conducts classification using only the observed events from the deployed products. No reconstructed event information is embedded. Therefore, if the baseline method can detect or categorize incidents with an acceptable accuracy, we have strong reason to believe the proposed Virtual Product model can perform even better by incorporating the reconstructed event counts into the classifier design. Secondly, we aim to conduct a fair comparative study for our proposed methodology, though we note that the baseline model is not a comparison to prior art, as Virtual Product addresses a novel problem of not only predicting the incidents but also recovering the associated security events. The objective function of SSN M F , used in Virtual Product, can be roughly understood as construction of a logistic regression classifier on the projected space of the original data.

This comparative study aims to verify the benefits gained from the algorithmic design of Virtual Productfor classification with missing features.

To allow fine-grained comparison, we compute the mean and standard deviation of the Area-Under-Curve(AUC) and the True Positive Rate (TPR) across 10 folds, and display them in Table 8.6 and Table 8.7, respectively. As we can see in the two tables, both the baseline and the proposed Virtual Product method present good classification performances over training

VP AUC LR AUC

Dataset Mean Std Mean Std

FW1 0.9831 0.0041 0.9695 0.0055 FW2 0.9900 0.0018 0.9810 0.0029 FW3 0.9200 0.0070 0.8761 0.0131 EP1 0.8218 0.0066 0.8076 0.0072 EP2 0.8962 0.0083 0.8306 0.0164

Table 8.6: Our approach (VP) detects security incidents with high accuracies (AUCs) across all five datasets, outperforming the baseline model (LR).

VP TPR LR TPR

Dataset Mean Std Mean Std

FW1 0.9724 0.0114 0.9661 0.0078 FW2 0.9820 0.0057 0.9810 0.0074 FW3 0.7879 0.0157 0.7608 0.0228 EP1 0.5200 0.0175 0.5016 0.0268 EP2 0.5897 0.0293 0.5663 0.0399

Table 8.7: True positive rate (TPR) of incident detection on all five data sets at 10% false positive rate (FPR). Our approach (VP) outperforms the baseline (LR)

datasets of all five security products. It indicates that counts of events collected from different organizations are able to predict occurrence of incidents that would have been reported by undeployed products. Furthermore, the result unveils consistently superior incident detection precision of the proposed Virtual Product model over the baseline method across the training datasets of different products. Figure 8.2 shows the average ROC curve and AUC derived from the cross-validation test, offering a global and intuitive view of incident detection performances over training datasets of different products using the proposed Virtual Product model. All obtained results support the design of the proposed Virtual Product method.

Embedding matrix completion into classification helps extract correlation among observed events of different products, which increases available information to boost classification precision.

Additionally, test on the validation datasets follows a standard training-testing process of machine learning models in real-world applications. Classification model built with the

0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate 0.0

0.2 0.4 0.6 0.8 1.0

True Positive Rate

FW1

VP (AUC = 98.31%)

0.0 0.2 0.4 0.6 0.8 1.0

False Positive Rate 0.0

0.2 0.4 0.6 0.8 1.0

True Positive Rate

FW2

VP (AUC = 99.00%)

0.0 0.2 0.4 0.6 0.8 1.0

False Positive Rate 0.0

0.2 0.4 0.6 0.8 1.0

True Positive Rate

FW3

VP (AUC = 78.79%)

0.0 0.2 0.4 0.6 0.8 1.0

False Positive Rate 0.0

0.2 0.4 0.6 0.8 1.0

True Positive Rate

EP1

VP (AUC = 82.18%)

0.0 0.2 0.4 0.6 0.8 1.0

False Positive Rate 0.0

0.2 0.4 0.6 0.8 1.0

True Positive Rate

EP2

VP (AUC = 89.62%)

Figure 8.2: Averaged ROC curves from 10-fold cross-validation of Virtual Product on our top five product datasets.

trainingdataset collected within the precedent time period is used to detect incidents on the validationdataset formulated within the current time slot.

Interestingly, as shown in Figure 8.3, incident detection result using the proposed Virtual Product model presents consistent high detection accuracy over validation datasets of different products. The reported detection accuracy confirms the robustness of the proposed Virtual Productmodel.

As described in Section 8.3, the proposed Virtual Product can be seamlessly extended for incident categorization, which classifies detected incident at a finer scale. Without major modification, the proposed Virtual Product is able to achieve both incident detection and categorization (multi-class classification) at the same time. Table 8.8 shows the average F1-score of incident categorization on training datasets of different products using Virtual Product. As we can see, Virtual Product can achieve almost perfect incident categorization on the FW1 and FW2 datasets. In the EP2 dataset, over 99% of detected incidents belong

0.0 0.2 0.4 0.6 0.8 1.0

False Positive Rate

0.0 0.2 0.4 0.6 0.8 1.0

True Positive Rate

FW1

VP (AUC = 99.12%)

0.0 0.2 0.4 0.6 0.8 1.0

False Positive Rate 0.0

0.2 0.4 0.6 0.8 1.0

True Positive Rate

FW2

VP (AUC = 97.62%)

0.0 0.2 0.4 0.6 0.8 1.0

False Positive Rate 0.0

0.2 0.4 0.6 0.8 1.0

True Positive Rate

FW3

VP (AUC = 84.78%)

0.0 0.2 0.4 0.6 0.8 1.0

False Positive Rate 0.0

0.2 0.4 0.6 0.8 1.0

True Positive Rate

EP1

VP (AUC = 66.11%)

0.0 0.2 0.4 0.6 0.8 1.0

False Positive Rate 0.0

0.2 0.4 0.6 0.8 1.0

True Positive Rate

EP2

VP (AUC = 92.58%)

Figure 8.3: ROC curves of the Virtual Product model evaluated using the validation datasets of the five products.

to a single incident type. Severe class imbalance makes any classifier built on the dataset statistically unstable, so we chose not to include the EP2 dataset in the experimental study of incident categorization. The categorization precisions on EP1 and FW3 are relatively lower. This is mainly due to class imbalance among different incident categories in these two datasets, particularly in the case of the EP1 training dataset, for which nearly half of the 30 incident types are minority classes. Each of these minority classes contains fewer than 10 machine-day observations, which increases the difficulty of categorization. The impact of class imbalance is also confirmed by the baseline LR method. Nevertheless, even in this extreme situation, the proposed Virtual Product still obtains improvements compared to the baseline model.

In general, all experimental results in this section verify the effectiveness of Virtual Product. By jointly conducting matrix factorization and discriminative model learning, the proposed model makes full use of inter-event correlation to compensate information missing

VP LR FW1 0.9927 0.9910 FW2 0.9425 0.9338 FW3 0.8005 0.8043 EP1 0.7501 0.7220

Table 8.8: Average F1 scores of incident categorization on our datasets. We do not include EP2because over 99% of the detected incidents belong to one single incident type.

due to the extremely sparse data structure. As a result, it provides a good reconstruction of the classification boundary from highly incomplete event occurrence data.