Experimental Results - 利用三階段行為分析來偵測和分類已知與未知的惡意程式

Chapter 5. Evaluation

5.2 Experimental Results

We present our results in terms of four metrics, i.e., true positive ratio (TPR), true negative ratio (TNR), false positive ratio (FPR), and false negative ratio (FNR).

TPR and TNR denote the ratio that we truly identify malware and benign programs, respectively. FPR and FNR mean that benign programs or malware are mistaken.

Following on, we probe into 7 issues that help us resolve doubts and introduce us into a deeper comprehension regarding this research. First, we examine the detection performance for SDM and SCDM working alone. Then, we measure the time cost for SDM as well as SCDM and compare three strategies that two detection mechanisms serve in different phases. Finally, we present the results for malware classification and intrusive malware recognition.

Impact of MD on ANN

This experiment evaluates sandbox-based detection ability. As discussed above, we can get the MD value for each program using the ANN. In order to delimit the ambiguous area and determine the optimum MD value, the distribution for MD values is shown in Figure 13.

There is an obvious ambiguous area from 0.2 to 0.75. We observe that SDM cannot discriminate malware and benign programs within the ambiguous area while SDM can tell them apart well outside the ambiguous area. Therefore, we illustrate detection ratio with different MD values in Figure 14.

In Figure 14, we set different MD values as thresholds to evaluate the detection ability. We observe that FNR goes up sharply within two intervals, i.e., from 0.2 to 0.25 and from 0.4 to 0.5. We look into the cause of the trend and notice that a large quantity of benign programs and malware practice the same suspicious behaviors. So, their calculated MD values would be similar. It might result from the capability of the

GFI sandbox that cannot dissect programs in a fine-grained way. If we want to keep the FNR less than 10%, the best MD threshold is 0.4.

Figure 13. Distribution for MD values by ANN

Figure 14. Detection ratio with different MD thresholds

Impact of Length on System Call Sequence

In this experiment, we evaluate the system call-based detection approach. There are a lot of variable-length system call sequences in our MB. Different system call sequences represent different malware’s behaviors. Generally, the longer system call sequence can represent malware’s common behavior more specifically, and Figure 15

clarifies it.

Figure 15. Detection ratio with different lengths as thresholds

We use different lengths of system call sequences as thresholds to conduct the experiment. Only the length of a system call sequence is longer than the configured threshold will it be put into use for malware detection. Figure 15 shows that there is a remarkable decrease for FPR as the length of system call sequences goes from 10 to 50. Besides, a slight variation occurred for both TPR and FPR when the length of system call sequences is greater than 50. We also discuss the issue of misjudgment.

The FNR might be ascribed to the hidden malicious behaviors or the lack of triggered events. The FPR might arise from the impure MB where some benign behaviors are not filtered out. Thus, we must collect more benign samples to make the MB much purer. According to Figure 16, we can observe that some extracted malicious behaviors also performed by benign programs occupy the minority. For the future work, we should look into what these malicious behaviors mean so that we can get a better understanding of them.

Figure 16. Distribution for the number of malicious behaviors

Detection Ability vs. Time Cost

There are two different detection approaches, i.e., SDM and SCDM, adopted in our system. Although they come with different detection capability, they should both retain their own merits. We are curious about the tradeoff between detection capability and time cost, and Table 3 answers the question.

Table 3. Detection ratio vs. Time cost

Mechanism FNR FPR Time cost

Observing Detecting

SDM

(external) 7.6% 44.9% 180 (sec) 0.068 (sec)

SCDM

(internal) 7.4% 7.5% 900 (sec) 0.35 (sec)

For SDM, we set a threshold to make the judgment. Base on the Table 3, there is a gap between the FPR produced by SDM and SCDM. We measure the time cost in observing time and detecting time. The observing time means the time consumed by the GFI sandbox or the System Call tracer to observe the behaviors of a program. The detecting time refers to how much time that the ANN system or the Sequence Detector takes to submit an inquisition for a program. As Table 3 demonstrates,

and 98.8 percent for SCDM as a benign filter. For SCDM as a malware filter, it introduces high FPR such that we should abandon it. Subsequently, we consider the time cost for analyzing a program. The first strategy takes 731 seconds to analyze a program, which is better than all the other strategies. We deduce the reason why they makes a difference in time cost as that SDM passes the programs whose calculated MD values are in the ambiguous area. SDM can act as both malware and benign filter, i.e., fuzzy filter. On the other hand, SCDM cannot generate such a distribution for programs that it can be a malware filter only. We conclude that SDM as a fuzzy filter should serve the 1^st-phase defense.

Detection: 1-phase vs. 2-phase

This experiment is conducted to evaluate the feasibility of 2-phase approach with SDM in the 1^st-phase and SCDM in the 2^nd-phase. Figure 17 delivers the information that the 2-phase approach can achieve a better accuracy than 1-phase approaches.

Although SDM alone does not perform as expected, it still supports SCDM. We can deduce that one detection approach complements the deficiency for the other one.

Figure 17. 1-phase vs. 2-phase

Evaluation for Type Vectors

This experiment evaluates our classification method. In order to better recognize what kind of malware it is, we classify malware based on its exhibited behaviors. We

extract 3337 different system call sequences so the length for every type vector is 3337. Initially, we get 500 type vectors to denote 5 types. In our classification approach, there is a threshold responsible for the decision on the exercise of a type vector. By applying different values to it, the number of the type vectors decreases, and the experimental results are illustrated in Figure18. We prepare some malware with known types and some malware with unknown type. The lower-bound threshold for unknown type is configured to 0.4. As the results of Figure 18, with the growth of the threshold, the malware of known types gets poorly distinguished. The mis-judgement might originate from that the higher the value is set the more type vectors are abandoned. That less type vectors are put in use leads to more serious mis-judgement, as shown in Figure 19. If we want to keep better classification performance for known types, the threshold set as 0.4 is desirable.

Figure 18. Classification for malware of known and unknown types

Figure 19. Number of type vectors vs. Threshold

Intrusive vs. Non-intrusive

We turn our attention to the recognition for malware with or without intrusive behaviors. Intrusive malware is the malware that carries out intrusive behaviors. Only the malware of certain types has intrusive behaviors. According to our statistics of all the malware from VX Heaven, intrusive malware accounts for only 3.8%. As we can see in Table 5, Behavioral Classifier differentiates intrusive and non-intrusive malware well. There is still some malware incorrectly recognized probably owing to that some intrusive behaviors are mistaken or the intrusive malware might hide its intrusive behaviors.

Table 5. Identification for malware with or without intrusions

Category Classification result Accuracy

Non-intrusive malware Correct 88.9%

Incorrect 11.1%

Intrusive malware Correct 82.7%

Incorrect 17.3%

Then, we compare the behaviors that non-intrusive malware and intrusive malware practices and describe the results in Table 6. According to Table 6, overlapping behavior ratio indicates the ratio of the behaviors that are carried out by

that type of malware and other types of malware. For example, within all behaviors that worm carries out, there is 28.9% that other types of malware also carry out. On the contrary, there is 71.1% (1 - 28.9%) carried out by worms only. We can see that the overlapping behavior ratio for intrusive malware is the greatest. More behaviors that intrusive malware performs are also carried out by other types of malware.

Therefore, we can deduce that intrusive behaviors occupy the minority.

Table 6. Behaviors that carried by non-intrusive or intrusive malware

Malware type Non-intrusive malware Intrusive

malware

worm backdoor Trojan Hoax Bot

Overlapping behavior

Ratio 28.9% 27.8% 22.0% 17.5% 44.4%

在文檔中利用三階段行為分析來偵測和分類已知與未知的惡意程式 (頁 32-41)