Experimental Results - 以共同行為為基礎之三階式Android惡意程式偵測與分類

Chapter 5. Evaluation

5.2 Experimental Results

We discuss the performance numbers from seven issues: impact on permissions, impact of the value N for N-gram on system call sequences, impact of the number of malicious behaviors for LCS on system call sequences, detection performance vs.

time consumption, effectiveness on the order of applied detection phases, performance comparison for one-phase and two-phase detectors, and evaluation for type vector based classification.

Impact on Permissions

We first evaluate the PBD. First we made preliminary statistics on the number of permissions appearance in benign applications and malware. Figure 9(a) and 9(b) show the distribution of permissions for benign and malicious applications, respectively. We can find that the popular permissions requested by benign and malicious applications are different. In Figure 9 (a), we also observe that some permissions such as SEND_SMS, READ_SMS, READ_CONTACTS, RECEIVE_SMS, WRITE_SMS, and RECEIVE_BOOT_COMPLETED has much higher frequency being requested by malware than benign applications. Figure 10(a) and 10(b) show the permissions requested by benign applications and malware, respectively.

We calculate the probabilities of 139 permissions by the Bayes theorem. To judge the inspected application, we calculate the product of permission probabilities.

If the product is in the predefined ranges, the application is judged as suspicious.

Figure 11 shows the accuracy of different thresholds. There are more benign applications and malware filtered out if the threshold is increased. Since we use this phase to reduce the number of applications, the upper bound of 0.9 and the lower bound of 0.1 would be a good choice based on our experiments.

Impact of the value N for N-gram on System Call Sequences

We evaluate SBD with the N-gram and the LCS algorithm in this experiment.

The length of system call sequences means how many system calls every system call

Figure 9. Distribution of the permissions requested by benign and malicious applications

(a) The top 20 of requested permissions by malware

(b)

(b) The top 20 of requested permissions by benign applications

sequence contains. For LCS, the length of system call sub-sequences is so dynamic that we do not have to predefine the length. However, the length of system call sequences for N-gram can be configured so we vary the value N to see its effectiveness.

Figure 10. The permissions requested by benign applications and malware

Figure 11. Performance for PBD

(b) The requested permissions by malware (a) The requested permissions by benign applications

The value N for N-gram means the unit length of system call sequence retrieved from all system call sequences. Different lengths could lead to different performance. A small N would filter out malicious system call sequences. If a system provides only 200 different system calls, a value N of two would have only 19900 combinations of sequences and therefore it can be easily filtered out by a relatively large number of benign system call sequences. In contrast, a large N would preserve too many benign behaviors within a malicious system sequence. Consequently, it is important to choose a good value for N.

Figure 12 shows the detection performance with various N. We divide Figure 12 into three areas by N. In the first area (N ranges from 2 to 4), the FN curve is descending but the FP curve is ascending. In the second area (N ranges from 5 to 15), the FN curve reaches 0% and the FP curve is kept smooth. In the third area (N ranges from 20 to 150), the FP curve is descending but the FN curve is ascending.

The higher FN rate in the first area is because N is so small that significant system call sequences are filtered out. In contrast, the higher FN rates in the third area are because N is too large so that the system call sequences are mixed with benign system call sequence. If we want a lower FN rate, a good value of N would be in the second area. Based on the experiment, we choose a value of 15 for N. Although we get a higher FP rate when N is 15, we can reduce the FP rate with the help of PBO.

Impact of the number of malicious behaviors for LCS on System Call Sequences Figure 13 presents the percentage of samples versus the number of malicious behaviors. We can observe that some malicious behaviors are also performed by benign applications and there are less FPs if the number of malicious behaviors gets increasing. If LCS-based SBD works with PBD, the number of malicious behaviors with 1 would be a good choice. If LCS-based SBD works alone, the number of

malicious behaviors with 2 would be much suitable.

Figure 12. Detection performances for system call sequences with various N

Figure 13. Distribution for the number of malicious behaviors

Detection Performance vs. Time Consumption

Malware detection mechanisms can be categorized into static analysis or dynamic analysis techniques. Static analysis is simple and efficient but dynamic

analysis is complex and time-consuming. In our approach, PBD is static and SBD is dynamic. Here we discuss the tradeoff between accuracy and time requirement. All the experiments are conducted with a machine equipped with an Intel Core i3 3.1GHz CPU and 16GB of RAM running on a 64-bit Windows 7 operating system. The summary of the experiment results is shown in Table 5.

We first compare accuracy and time for LCS-based and N-gram-based SBD.

Although it requires more time, LCS-based detector gets a better accuracy. We also compare accuracy and time between PBD and LCS-based SBD. Although PBD runs faster, it gets a poor accuracy.

Table 5. Detection performance vs. time consumption

Category Algorithm FN FP Accuracy Time Consumption (sec/application)

PBD Bayes

Probability 2% 24% 87% 2.57

SBD LCS 3% 14% 91.5% 601.38

N-gram 0% 35% 82.5% 600.87

Effectiveness on the Order of Applied Detection Phases

We also consider the order of applying different detection phases, i.e. PBD and SBD. The measured detection time and detection performance is shown in Table 6.

For the time consumption, if the PBD is applied first, it takes 599 seconds and 262 seconds to analyze permissions for all benign applications and malware, and then the LCS-based SBD spends 32,228 seconds and 6,747 seconds to analyze system call sequences for the remaining 23% of benign applications and 11% of malware. In contrast, if we swap the order, the LCS-based SBD takes 140,122 seconds and 61,341 seconds to analyze system call sequences for all benign applications and malware, and then the PBD spends 84 seconds and 8 seconds to analyze permissions for the remaining 14% of benign applications and 3% of malware. If we want a lower FNR and a lower FPR, running the PBD first would be a better choice as PBD consumes

much less time than SBD. Based on the experiments, we choose PBD as the first phase detector and SBD as the second phase detector.

Table 6. Time consumption, FNR, and FPR

Strategy Procedure Time Consumptions FNR

for 1^st Phase

FPR for 1^st Phase 1^st Phase 2^nd Phase 1^stPhase 2^nd Phase Total

PBD → SBD Benign 100% 23% 599s 32,228s 32,827s - 1%

Malware 100% 11% 262s 6,747s 7,009s 2% -

SBD → PBD Benign 100% 14% 140,122s 84s 140,206s - 14%

Malware 100% 3% 61,341s 8s 61,349s 3% -

Performance Comparison for One-phase and Two-phase Detectors

From the above experiment, we know that PBD is able to filter out 76% of benign applications with the lower bound of 0.1 and filter out 87% of malware with the upper bound of 0.9 and LCS-based SBD has a better accuracy than PBD. Figure 14 compares the accuracy of one-phase detectors and two-phase detectors (PBD first and then SBD). The two-phase detectors have a lower FPR than one-phase detectors.

We can see that although one-phase detectors could have poor performance, the combined detectors always have a good performance. This also shows that PBD and SBD complement each other.

Figure 14. Accuracy comparison for one-phase and two-phase detectors

Now we examine the reasons which cause false negatives for two-phase detectors. In PBD, we filtered out 2% of malware because some malicious

applications request only a few non-critical permissions. In SBD, we missed 1% of malware because some malicious behaviors would be triggered only after we agree the update option. We also discuss the reasons that cause false positives for two-phase detectors. Since most Android malware are repackaged applications, the recorded system call sequences are mixed with both benign and malicious behaviors. It may cause false positives if we do not completely filter out benign behaviors. In addition, the noise from Dalvik VM that runs the applications on Android devices may incur both false positives and false negatives. This is because the system call sequences originated from Dalvik VM itself are recorded as well and it is not able to tell the real origin of system calls.

Evaluation for Type Vector Based Classification

Table 7 shows the details of type vectors. For system call sequences, we get 149 system call sequence vectors to denote 17 types, and the length of system call sequence vectors is 1460. For permissions, we get 68 permission vectors to denote 17 types, and the length of permission vectors is 139. If we mix system call sequences and permissions, we get 156 mix vectors to denote 17 types, and the length of mix vectors is 1599.

Table 7. The detail of type vectors

Category Number of

Permission Vectors 68 139

Mix Vectors 156 1599

Finally, we evaluate the performance of BBC which recognizes the type (known or new) of a detected malware. We classify a detected malware based on permissions, system call sequences, or mix. To show that type vector is good at classifying type of malware, we attempt to classify all identified malicious applications into a malware

type. We use the PBD detector followed by the LCS-based SBD detector. The

classification result for known types using LCS-based type vectors, permission-based type vectors, and mix-based type vectors is shown in Table 8. We show that 93%, 99%, and 96% of malicious applications can be classified into a correct type based on LCS-based type vectors, permissions-based type vectors, and mix-based type vectors, respectively. It shows that permission-based type vectors can be a better choice than LCS-based type vectors or mix vectors.

Table 8. Classification results for known types of malware

Malware Type Category Number of Malware Classification Result Percentage

Known Type

System Call

Sequence Vectors 99 Correct 93%

Incorrect 7%

Permission Vectors 99 Correct 99%

Incorrect 1% known types and new types of malware, as shown in Table 4. We use a threshold of 0.5 for cosine similarities with LCS-based type vectors, a threshold of 0.8 for cosine similarities with permission-based type vectors, and a threshold of 0.65 for cosine similarities with mix-based type vectors. Table 9 shows the classification results.

With LCS-based type vectors, although the correct classification rate is decreased by 10% for known types, more than 81% of new type of malware can be classified correctly. It is worth noting that with permission-based type vectors, the correct classification rate is only decreased by 1% for known types and the correct classification rate is more than 98%. With mix-based type vectors, more than 99% of new type of malware can be classified correctly but the correct classification rate is decreased by 3% for known types. We conclude that permission-based type vector

classifier performs better on classifying malware types.

Table 9. Classification results for known and new types of malware

Category Malware Type Number of Malware Classification Result Percentage System Call

Sequence Vectors

Known Type 99 Correct 83%

Incorrect 17%

New Type 42 Correct 81%

Incorrect 19%

Permission Vectors

Known Type 99 Correct 98%

Incorrect 2%

New Type 42 Correct 98%

Incorrect 2%

Mix Vectors

Known Type 99 Correct 93%

Incorrect 7%

New Type 42 Correct 99%

Incorrect 1%

在文檔中以共同行為為基礎之三階式Android惡意程式偵測與分類 (頁 28-38)