Chapter 4 Experiment
4.2 Real-world experiment and analysis
國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
28
According to the above results (Table 5), the parallel version has shorter execution time than sequential one. Parallel version of the outlier detection algorithm does have better performance, but it is not good enough. It still can’t deal with large amount of data or computation effectively. Next, we test performance of distributed version on our cluster (two computers each with 4 cores and 8G memory), and the results are shown in Table 6.
Data size Data dimension Execution time (seconds)
100 1 521.319
200 1 8139.587
300 1 45156.248
Table 6. Results of performance test (distributed version)
The results show that the improvement resulting from distributed computing is not significant (even worse) when data size is small. This is because Spark takes a moment to build a driver program. With the increase of data size, the improvement will be more and more obvious.
4.2 Real-world experiment and analysis
After we complete the full program and affirm its correctness, we collect real-word data and follow step 0 of training phase mentioned in chapter 3 to detect malicious behaviors. We collect two kinds of VM behaviors: system call distribution and dynamic link libraries (DLLs) that each process uses.
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
29
Test1: System call distribution
The first kind of data is system call distribution (the number of executions that each system call be executed) collecting from virtual machines. We execute and monitor many malwares on different VM, then collect their system call distribution.
Figure 2 is collected system call distribution.
Figure 2. System call distribution (part)
Each row is execution frequency of the system call within ten seconds produced by one VM. First number is the label of each VM. Each attribute (before colon) represents a system call, and each value (after colon) is the execution frequency of each system call.
In order to facilitate the calculation and easily identify, we give labels to each
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
30
VM. There are five different VMs, shown in Table 7, a normal VM and four VMs which execute different malware.
Labels 0 1 2 3 4
VMs VM0 VM1 VM2 VM3 VM4
Behaviors Normal Malware 1 Malware 2 Malware 3 Malware 4
Sizes 720 1905 4896 2104 1569
Table 7. The experimental data for test 1
We collect many data. But we do not need to take all the information to test. We use the following two methods to sample data:
(1) We randomly sample 300 data from a contiguous range (500 data) of VM0 and 5 data from VM1 to VM4, and synthesize 4 testing file (data0_1, data0_2, data0_3, data0_4). Each file has 305 data.
(2) We randomly sample 300 data from VM1 to VM4 and 5 data from the remaining VMs, and synthesize testing file respectively (data1_0, data1_2, data1_3, data1_4, data2_0, data2_1, data2_3, data2_4, data3_0, data3_1, data3_2, data3_4, data4_0, data4_1, data4_2, data4_3). Each file has 305 data.
We record execution time, average dimensions of data, number of iterations, and outliers in each test. The results are shown in Table 8. We graphically present number of iterations of each stage, taking data0_4 and data3_4 for example, in Figure 3 and Figure 4.
‧
Figure 3. Iterations of each stage in testing data0_4
Figure 4. Iterations of each stage in testing data3_4
0
‧
dimension Total iterations Find outliers
data0_1 714.498 36 366 4/5
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
33
The result shows that data0_1, data0_2, data0_3, and data0_4 have better true negative and shorter execution time. It can be seen that their average iterations are much lower than other data from Figure 3 and Figure 4. Probably because we sample their 300 data from a contiguous range (500 data) of VM0 rather than the original 720 data of VM0. The wild range of original 720 data is reduced to a continuous and small range 500 data. This way results in an effect similar to clustering. 720 data are divided to two clusters: 500 data and 220 data. For this reason, later on we may try to cluster data by using GHSOM before training, and then take one of clusters for training to see whether a higher accuracy we will get or not.
Test2: Dynamic link libraries (DLLs) that each process has used
Another kind of data is DLLs, which are used by each process, that are collecting from VM. We execute and monitor many malwares on different VM, then collect and record all DLLs that each process in VM has used. Figure 5 is the file which records all the DLLs used by process.
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
34
Figure 5. Libraries that each process used (part)
A row is DLLs that one process is using in some moment. First number is the label of VM behaviors. Each attribute (before colon) represents a DLL, and each value (after colon) represents whether the process uses this DLL. 1 means used; and 0 means unused.
We also give different labels to different VMs. There are three different VMs, shown in Table 9, a normal VM and two VMs which execute different malware.
‧
Behaviors Normal Malware 1 Malware 2
Sizes 1467 100 25
Table 9. The experimental data for test 2
In test2, we randomly sample 775 data from VM0 and 25 data from the remaining VMs, and synthesize testing file respectively (lib6_1 and lib6_2). Each file has 800 data.
We also record execution time, average dimensions of data, number of iterations, and outliers in each test. The results are shown in Table 10.
Execution time (seconds)
Average
dimension Total iterations Find outliers
lib6_1 29773.487 678 18983 12/25 influential attributes to see whether a higher accuracy we will get or not.
‧
collection to improve program performance. Spark also provides many application programming interface (API), and we can make good use to optimize our program performance. For example, we can use method persist() or cache() to store a RDD in memory on the nodes after it is first computed in an action. It can be repeated use, and make subsequent action faster.Except these collections we have used, we can also apply multiple actors provided by Akka to determining lambda for newly added hidden nodes (Table 3).
Since all possible lambdas are regular, we can execute step 3 in Table 3 on different computers or virtual machines. Each computer or VM will get a different lambda, determine weights and thresholds of new hidden nodes, apply BPNN to adjust w (thresholds and weights) of SLFN, and return results to the main program.
Another part we can apply Akka to is deleting all of the potentially irrelevant hidden nodes. The original approach is to try to remove each of hidden nodes, and then use BPNN to retrain input behaviors until an acceptable result is obtained. We can manipulate many computers to delete different hidden node and retrain data at the same time by applying multiple actors. It can significantly shorten the time spending on deleting all of the potentially irrelevant hidden nodes.
As to results of test1 and test2, the testing behaviors we collected are relatively heterogeneous. Many behaviors do not belong to malwares were also collected.