Experiments - Applications – Kernel Rootkit Recognition

VI. Applications – Kernel Rootkit Recognition

6.2 Experiments

In order to evaluate performance and precision of MrKIP, we conduct three sets of experiments. In the first experiment, the trojan Srizbi is used to demonstrate MrKIP’s profiling mechanism against pure kernel-level rootkits. The second experiment measures the performance of MrKIP, showing its capability to recognize rootkits in a reasonable time. In the last experiments, we cluster 536 kernel-level rootkit instances with VirusTotal, and divide them into training set and testing set randomly. Then, we evaluate the effectiveness of recognition.

All our experiments are conducted on an Intel i7 machine with Windows 7 OS. Samples used in experiments are collected from offensive computing, a public sample sharing forum. Please note that MrKIP can be also applied on the recognition of ordinary user-level malware, since their behaviors are eventually executed through kernel-level functions. However, our experiments focus on the evaluation of the effectiveness of MrKIP against advanced, kernel-level trojans.

6.2.1 Case Study : Srizbi

Srizbi is one of world’s largest botnet. With the capability to hide itself from both user and system level, it is difficult to remove and detect. Since Srizbi is executing totally in kernel mode, it can make its files and network traffic invisible to bypass detection. With these advanced rootkit technique, Srizbi is considered one of sophisticate rootkits. In order to demonstrate correctness of BehaviorProfiler, we use this famous rootkit family as a case study. We use two variants, Trojan.Win32.Srizbi.ah and Trojan.Win32.Srizbi.x, labeled by Kaspersky, to evaluate correctness of the extracted behavior profiles.

Our tool records behaviors in both sample’s profiles. We can observe that both Srizbi

samples first delete some system files and then do some file manipulation to driver files. It also registers itself as a system service. We also uploaded the Srizbi trojan instances onto two famous online malware analysis systems, Threat Expert and Anubis, for comparison. It turns out that Anubis does not generate information at all about it. Threat Expert captured certain registry modification behaviors, which form merely a subset of our profiling result. This comparison shows that our kernel-level behavior profiling is more effective than conventional approaches.

The whole HMM model generated by PatternGenerator for Srizbi contains more than a hundred states, which are difficult to present in the article. To illustrate the idea, we show in Figure 18 a portion of the generated pattern. Each node is a clustered state, and the string inside the node is the data selected as the centroid for that cluster. On each edge the transition probability is also listed. In the model we can observe the three major types of captured behaviors: registry modification, packet transfer, and process creation. As shown, the transition probabilities between the sequential registry modifications are 1. This matches the convention that registering a program as a system service requires setting up multiple registry entries.

Figure 18 : Constructed model for Srizbi.

6.2.2 Effectiveness of Recognition

To evaluate the effectiveness of PatternRecognizer of MrKIP, the next experiment compares the clustering result of MrKIP with the clustering done by commercial an-ti-virus software. The comparison is performed as follows. For each collected rootkit instance, we upload it onto VirusTotal, which is a website providing simultaneously the analysis results of dozens of anti-virus software. Two instances which reported by any different anti-virus software as the same family will be grouped together. This is used as the ground truth, and our recognition result will be compared with it. The 536 rootkit in-stances are then separated into the training set and the testing set. We divide one family into two partitions with equal sizes, intending to keep the total size of the training set equal to that of the testing set. Yet, certain family contains so few variants that we have to maintain an enough amount of instances for training, leading to a slightly imbalanced partition. In the end, we have 351 rootkit samples in the training set and 185 samples in the testing set.

For each instance in the testing set, our PatternRecognizer compares it with each constructed model and generates a matching score. Thereby, through sorting we can observe in which place the correct group (the right answer) gets among all other families. We refer to the index of the correct group in the sequence of families (sorted with the similarity score, from high to low) as the rank of that instance. For instance, if the similarity score of family Trojan.Win32.Delf takes the fourth place among other families when we recognize Trojan.Win32.Delf.cit, which is confirmed a variant of Trojan.Win32.Delf, the rank of Trojan.Win32. Delf.cit is 4, which means Trojan.Win32.Delf is fourth similar to Trojan.Win32.Delf.cit.

The cumulative distribution of classification ranking is shown in Figure 19. The X-axis represents the rank and the Y-axis indicates the cumulative percentage of instances. A coordinate (x,y) in the figure indicates that y% instances of the whole testing have rank numbers less than x, which indicate the correct family of y% instance can be found in top x similar families. A steeper curve indicates more instances have lower rank numbers, which implies that

the correct family gets a higher score from our PatternRecognizer.

Meanwhile, since there is a parameter δ in our algorithms, we repeat this experiment multiple times with different threshold values. When δ is set to 0, each behavior will form its own behavior group, even if their arguments are similar. As shown, without grouping similar behaviors, the classification result is poor. After raising the threshold value to 0.2, 60% of the instances in the testing set have rank number 1, indicating MrKIP finds correct answer. Note that the cumulative curve also indicates that 80% instances have rank less than 4. Namely, MrKIP can successfully sort the correct answer in the top three places for 80% test instances.

The cumulative percentage even increases to 90% when rank reaches to 5. If we further raise the threshold value, unrelated behavior may be group together. Therefore the classification rate will decrease. As our experiments shows, the appropriate threshold is around 0.2.

Figure 19 : Cumulative Ranking.

在文檔中 ProbeBuilder - Automating Probe Construction in Virtual Machine Introspection through Uncovering Opaque Kernel Data Structures (頁 95-99)