: Calculation of the physical address generated by EBP-based accessing

offset encoded in BP-based accessing instruction.

current value of the register.

physical address of base page.

Physical address accessed by this instruction.

liminate on edicated PA MAS x

PA MAS PA MAS PA MAS

PA MAS

observation saves us from en-queuing memory address for each EBP (or ESP) -based memory access. However, the benefit comes with the trade-off that the helper must possess correct paddr_base and paddr_siding for calculation. This is done by having the emulator perform virtual address translation and deliver translated addresses to the helper every time that EBP or ESP is modified. However, sometimes their values can change so frequently that the cost to perform address translation may attenuate the benefit.

To resolve the limitation, the following technique is used. Note that in most cases, both EBP and ESP point at locations inside the stack segment (if the program has one). Therefore, an out-of-box hook is implemented on the part in charge of segment allocation to acquire the physical pages mapped to pages of the stack segment. These physical pages are specially labeled so that we can identify whether EBP or ESP points at a labeled page when their values are modified. All the addresses of these physical pages are delivered to the helper thread only at the infrequent context switch or user/kernel mode switch. As a result, the helper can be informed of paddr_base and paddr_siding without demanding the emulator to perform address translation every time when EBP or ESP is modified. If EBP or ESP point at an unlabeled page, the delivering operation automatically falls back to the slower translate-then-deliver mode upon each EBP or ESP modification.

4.4 Peripherals

Tracking information flows across peripherals is a primary goal of SWIFT. For the time being SWIFT tracks information flows in hard disks and network interfaces

For hard disks, any DMA operation and port I/O between the hard disk buffer and memory are watched, and taint tags are propagated along the data movement. Their taint maps are stored hierarchically just as page directory mechanism in conventional MMU to avoid excessive memory consumption. In this way, no taint tags would be allocated to those sectors which had never been tainted.

The watching on network interfaces follows a similar pattern. Since packet exchange between the NIC and memory is usually done with DMA, only DMA operations are watched

in SWIFT implementation.

V. Evaluation

Chapter 5 Evaluation

Algorithm 1 is implemented on the QEMU emulator. The algorithm generally requires 8~12 Gigabytes of memory space to store the acquired dereference paths. Ubuntu 12.04 64-bit is chosen as the host environment and have 16 Gigabytes RAM installed so that the QEMU process can utilize as much memory as it demands. To implement Algorithm 2 and Algorithm 3, IDA Pro and IDAPython are used to parse the memory dump and construct the basic graph.

Then, the graph is converted to a NetworkX graph object with edges of indirect branches added.

The code generator is implemented within 127 lines of ruby scripts.

The effectiveness of ProbeBuilder is evaluated through executing the following six subject behaviors: process creation, file creation, registry creation, process termination, file deletion, and registry deletion. All these behaviors are performed with upper layer Win32 API with tainted arguments. Windows XP SP3 32-bit is installed as the guest operating system.

ProbeBuilder generates probe locations for these behaviors and the corresponding data dereferences. To verify the correctness and quality of the automatically generated probes, for each behavior three probes are selected at random and implemented in another QEMU instance dedicated to behavior monitoring. In the QEMU instance, Process Monitor produced by Sysinternals and Wireshark are installed. The log trace generated by our probes is then compared with the ones generated by the above two tools. To demonstrate the strength of ProbeBuilder, a kernel-level VMI profiler is implemented using the generated probe locations and dereferences.

5.1 Probe Generation

As aforementioned, Algorithm 1 will be executed repetitively to eliminate unstable probe locations. After that, graph analysis (Algorithm 2 and Algorithm 3) is used to eliminate non-dedicated code locations. To better illustrate effectiveness of the process, an entry in Table 5 shows the total number of remaining probe candidates for a specific behavior after runs of Algorithm 1. Since only identical probe locations and data dereferences are kept after each round, the total number decreases as the process continues. Note that each test is repeated 50 times to ensure the stability of sieved probe candidates. However, in all these tests the total number of probe candidates soon stabilized in less than 10 rounds. As an example, the test item for process termination takes only 3 rounds to converge. The experiment indicates two important facts. First, a large portion of candidates found in the first round are eliminated in later rounds. This shows the benefit of multi-run elimination. Secondly, the fact that all these numbers converge to stable points guarantees the existence of stable probe locations and data dereferences.

In Table 6, the analysis results of Algorithm 2 and Algorithm 3 are listed. The set of the Table 5 : Remainder candidates after each run of Algorithm 1.

Behavior 1 2 3 4 5 6 7 8 9 10 11

Process

Creation 474 311 308 303 300 295 294 281 277 276 - File

Creation 220 183 181 178 177 177 173 168 168 167 - Registry

Creation 83 70 64 62 58 - - - -

Process

Termination 42 38 35 - - - -

File

Deletion 104 100 99 99 98 - - - -

Registry

Deletion 86 67 67 66 62 60 55 51 - - -

probe candidates discovered by Algorithm 1 is used as an input variable C to Algorithm 2 and Algorithm 3, that is, the numbers listed in the first row of Table 6 are identical to the final stable numbers in Table 5. The second row in Table 6 gives the total number of leading nodes generated by Algorithm 2. The third row gives the total number of dedicated nodes as the final output of ProbeBuilder. As shown, a large portion of candidates are again eliminated by Algorithm 3. The test on registry creation filtered 44 non-dedicated probe candidates, leaving 14 candidates as the final answer.

To understand the effectiveness of the refinement phase, the eliminated probe candidates are inspected. However, the massive amount of code and its assembly form make manual examination extremely difficult. To simplify the task, the non-dedicated probe candidates (code blocks) are mapped back to the owner functions with IDA Pro. Functions with human-readable names recognized by IDA Pro and WinDbg are then collected. The functionality of these

“named” functions are manually checked on MSDN, looking for those dedicated to the subject behavior. Subroutines called by these dedicated functions are also considered as non-dedicated.

The results of examination are shown in Table 7. For each subject behavior, a certain amount of functions are manually discovered to be non-dedicated. Their names are listed in the first column. Among the non-dedicated functions recognized by Algorithm 2 and Algorithm 3, the proportion of these manually verified functions p is listed in the second column. For instance, the test on file creation shows that 80% of discovered non-dedicated probe candidates are manually verified. Tests on behaviors like registry deletion and registry creation give low proportion of the successfully verified functions. However, the test merely investigates the

Table 6 : Remainder candidates after Algorithm 2 and Algorithm 3.

Process

documented functions, which can be recognized by IDA Pro and WinDbg, and hence the number may be underestimated. Considering this fact, it is reasonable to deduce that this test has successfully verified the effectiveness of Algorithm 2 and Algorithm 3.

In Table 8, a few concrete examples of generated probes and their data dereferences are shown. During the execution of Algorithm 1, the first 512 bytes of the data captured by the taint-based predicate P are captured. However, for conciseness only the part before the terminating null character is listed. The first and the second columns give the probe locations and the corresponding dereference data paths, respectively. As expected, the large offsets in

Table 7 : Eliminated non-dedicated functions.

CcUnpinData, RtlSplay, CcPinRead, CcRemapBcb, SeDeassignSecurity, CcPinMappedData

dereference paths show the existence of huge data structures in Windows kernel. The given SRCH_WIDTH: [512, 256, 128, 64, 512] may not be enough to cover all possible dereferences.

Nevertheless, it is not necessary to identify all of them in this application.

Note that our dummy program invokes the ANSI-string-based API. Yet, the corresponding Table 8 : Examples of probes, data dereferences, and data collected by Algorithm 1.

EIP Dereference Path Captured Data

Process Creation

0x804d9050 ESP, +20, +0 T_A_R_G_E_T.exe\0

0x804e447f EAX, +184, +16, +0

0x804e917d ESP, +20, +12, +0 T_A_R_G_E_T.exe\0 File Deletion

Unicode strings encoded by the operating system are also identified. In addition, prefixed Unicode strings are captured as well in process creation and file deletion. The results demonstrate the use of the taint-based predicate.

5.2 Effectiveness of Generated Probes

To verify effectiveness of the generated probes and dereferences, two experiments are performed.

5.2.1 Monitoring User-Space Activities

Since all the probes generated by ProbeBuilder locate in the OS kernel, they should produce profiles at least as complete as any user-level monitors. To verify this, for each behavior 3, probes are randomly selected out of the output of ProbeBuilder, and are manually implemented in another QEMU instance (without the functionality of ProbeBuilder). In that guest machine, the same OS image is installed. Meanwhile, Sysinternal Process Monitor v3.04 and API Monitor v2.0 from Rohitab are also installed. The system is then manually exercised for 30 minutes, producing more than one million activities recorded by the two commercial applications. The API trace logged by the probes of ProbeBuilder was compared with theirs.

The comparison shows that all the occurrences of these six behaviors reported by Process Monitor and API Monitor are also logged by the probes of ProbeBuilder. The API arguments are also correctly captured by the generated data dereferences. We discover that the probes of ProbeBuilder recorded more activities than the two user-space profiling tools, especially for the file creation behavior. A large portion of these extra records are confirmed as expected to be kernel activities since the probes reside in the kernel space. However, there still exist 0.21% of the total recorded activities with meaningless binary data which are considered as false positives.

5.2.2 Monitoring Kernel-Space Activities

The kernel-level probe shows its effectiveness against kernel activities. For evaluation, a kernel module is implemented to simulate a pure kernel-level Trojan. The following tasks are

performed in sequence: 1) Establish a TCP connection with a HTTP server controlled by us. 2) Create a dummy file ProbeBuilderTest.txt at C:\ 3) Create a registry key ProbeBuilderRegistryInjection in the start-up program entries. Note that these tasks are executed purely in the kernel-space. This module is packed within a leading program in which the attached module is registered as a system service. The program is then profiled with the same QEMU instance used in the previous subsection. It is also uploaded to ThreatExpert and Anubis for comparison.

The result shows that only our profiling tool successfully captures all the three activities.

ThreatExpert only identified the created registry key, and Anubis only logged the TCP connection. (The user-space activities of the leading program are captured by all three platforms) Since both ThreadExpert and Anubis captured at least one of the three kernel-level behaviors, the success execution of the kernel module is confirmed. This experiment not only demonstrates the effectiveness of ProbeBuilder but also the necessity of kernel-level probes.

Please note that this result does not imply that the probes generated by ProbeBuilder are more effective than those in existing VMI-based systems. Given sufficient time, any experienced analysts can discover probe locations and data dereferences through reverse-engineering. The contribution of ProbeBuilder is automating these procedures in an effective way.

5.3 Performance

ProbeBuilder utilizes emulation to monitor the system state before each code block. Under the taint checking mode, additional taint analysis must be performed for each executed instruction. The data dereference analysis (Algorithm 1) runs about 30 times slower than the native machine. It takes 1~3 minutes to complete an upper-layer API invocation. The time required by the control flow analysis (Algorithm 2 and Algorithm 3) heavily depends on the size of the probe candidates discovered by Algorithm 1, varying from 6 minutes (for process termination) to 167 minutes (for process creation). However, note that ProbeBuilder is designed to reduce the effort of manually building VMI tools. Compared with the enormous effort

generally required by manually reverse-engineer a closed-source kernel image, the execution time needed by ProbeBuilder is trivial. In addition, the probes and the data dereferences generated by ProbeBuilder should be transferred to systems implemented with faster emulation, virtualization, or even native machines, not directly on ProbeBuilder itself. Consequently, the schemes proposed in this dissertation do not impose overhead on the final application system.

On the other hand, the performance of SWIFT should be carefully evaluated since its contributions focus on the performance boost led by the optimizations proposed in this dissertation. To evaluate performance improvement attributed to techniques proposed in this study, both commercial test suites are used and common workloads to acquire benchmark scores for following configurations.

(a) Native QEMU

(b) SWIFT (decoupled design) (c) SWIFT w/ only OPT1 enabled

(d) SWIFT w/ both OPT1 and OPT2 enabled (e) QEMU with inline taint propagation (f) TEMU (Based on QEMU Ver. 0.9.1)

To set up a baseline for our evaluations, a native version of QEMU, which our system base on, is tested in configuration (a). Note that neither KQEMU nor KVM was activated because we want to benchmark the performance of the pure emulation. In (b), solely the decoupling mechanism is enabled so that its performance advantage could be measured. Configuration (c) and (d) operate with the decoupled design as well as (b), yet OPT1 and OPT2 are enabled respectively. To understand how much performance gain could be achieved with the proposed schemes, we also set up a configuration (e), which inlines taint propagation routines of SWIFT directly in code blocks generated by original QEMU. The comparison also included TEMU, a well-known system-wide taint analysis system, as configuration (f) in this benchmark evaluation.

A little more explanation is needed to elaborate the goal of this evaluation. For configuration (b), (c), (d), and (e) all data in guest memory or received from network are labeled

as tainted. Although this leads to considerable memory usage since the companied shadow memory becomes as large as the allocated RAM size of the guest machine, it is necessary for measuring the performance gain under the worst case. Moreover, it is difficult to fairly compare TEMU with our scheme directly. TEMU is designed with extremely high flexibility, and it thus contains large taint record for each byte and many callbacks for additional plug-ins. These features inherently incur severe overhead on performance of TEMU. However, it is also difficult to port its code into SWIFT for direct comparison because TEMU is based on an older version of QEMU. Due to reasons above, only the taint propagation of TEMU is activated and remove any other plug-ins in configuration (f). In addition, no taint data is introduced in configuration (f) among all experiments.

All the evaluations are performed on an IBM System x3650, with one unit of Intel Xeon E5430 2.66 GHz Quad-Core Processor, 8GB DDR2 RAM, and a 150GB SATA-II hard disk installed. In each configuration one identical virtual machine snapshot is loaded into the emulator to ascertain fairness. The virtual machine is allocated with 512 MB RAM and a 10GB hard disk. Windows XP with service pack 3 is installed and booted in the snapshot. In addition, 512 MB are allocated for the IF-code delivering circular queue.

To perform information flow tracking SWIFT consumes more memory than the original emulator does. First of all, an extra 512 MB space was allocated to construct the circular queue for IF-code delivering. In addition, each byte in the guest memory is augmented with an extra shadow byte to preserve its taint status. Since in our evaluation every byte in memory is labeled as tainted, the shadow memory occupies the same size as the physical memory size of the guest.

To enable OPT1, a shared 128MB memory region is pre-allocated to store IF-code blocks generated in the translation phase. We also force the emulator to flush all the code blocks when this region is full. Therefore, all these extra memory usage can be statically calculated to be 512+512+128=1152MB.

The first result of performance evaluation is acquired with PassMark Performance Test 6.0, which is an off-the-shelf commercial test suites adopted extensively in CPU and system benchmarking. Benchmark items could be categorized into CPU-intense jobs and

memory-intense ones. To present overhead imposed by each configuration more clearly, benchmark scores of configuration (b), (c), (d), (e), and (f) are divided by the score of baseline configuration (a). As indicated in Figure 14, although the design with decoupled DIFT still imposes high overhead (1.43X ~ 5.00X) among all test items, it already outperforms configuration (e) significantly. When OPT1 and OPT2 are enabled, the overall performance downgrade can be reduced to 1.28X~3.16X, which are 2.74X~7.48X times faster than the interleaved design (f).

The result demonstrates effectiveness of optimizations proposed in this dissertation. In addition, close scores between (e) and (f) give us faith on the representativeness of configuration (f).

There is an interesting fact presented in Figure 14. First of all, memory-intense benchmark items benefit a lot from OPT2, but no significant improvement is shown on CPU-intense ones.

After analyzing instruction traces of those experiments, it is discovered that EBP-based memory accesses in the benchmark program occur less frequently than expected. In such cases, the overhead of delivering memory addresses cannot be effectively removed by the optimization.

Next, same configurations with common workload such as file transferring or source code compiling are benchmarked to further investigate the analysis overhead in real applications.

Details of these workloads are explained below. All measurements are repeated certain times and average values are calculated. The number of repetition of each item is listed after the name

Figure 14 : Overhead imposed by different configurations.

To present overhead imposed by each configuration more clearly, benchmark scores of configuration (b), (c), (d), and (e) are divided by the score of baseline configuration (a).

of the workload.

System Booting (50)

The time needed for booting Windows XP is measured. More precisely, the time elapsed from powering on the emulator until Windows loads Graphical Identification and Authentication, GINA, which brings up the Windows Security dialog for users to log on, is measured. It is chosen as the termination of the measurement because its loading represents that the booting-up sequence has come to an end.

Web Browsing (50)

Since web-browsing is an extremely frequent user behavior and a common way to get attacked by malware, this item is included in benchmark to investigate how our implementation can affect the browsing speed. The experiment is carried out by measuring the time needed for sequentially browsing top 50 websites, which are ranked by Alexa Internet, an authoritative Internet information provider. The sequential browsing mechanism is implemented with a Firefox plugin, which automatically visits next website once it receives an event of page loading complete.

Communication over SCP (20)

In this benchmark a large file is downloaded into the emulator through Secure Copy, a file transfer mechanism based on SSH protocol to provide confidentiality and authentication. The file consists of 120 MB random binary sequence and resides on a host locating in the same 100BASE-TX local area network. The benchmark is performed with Putty SCP, which is a Win32 implementation of the protocol. The time needed to accomplish the following command is measured.

在文檔中 ProbeBuilder - Automating Probe Construction in Virtual Machine Introspection through Uncovering Opaque Kernel Data Structures (頁 57-92)