IV. System Design of SWIFT
4.3 Optimization
Although approaches proposed so far successfully decouple the execution of system emulation and analysis task, the en-queuing operations still incur performance downgrade in three ways. First, extra instructions injected for en-queuing inherently introduce latency in the emulation process and the helper thread. Secondly, the massive memory accesses to the queue can consume hardware cache or causes cache misses generated by the producer-consumer communication. Thirdly, the more data en-queued, the faster the circular queue will be saturated.
Therefore, reducing en-queuing operations can accelerate the emulation operation in multiple ways. Two optimizations were proposed to aggressively remove them.
4.3.1 OPT1 : Delayed-Delivering on a Per-Block Basis
Avoiding en-queuing IF-codes frequently can bring us substantial performance improvement because they form the major part of messages delivered to the helper thread. One possible optimization toward this is to deliver IF-codes on a per-block basic. While translating a basic block, the translator could group all IF-codes generated into a special entity, called the IF-code block. In the IF-code block, IF-codes are stored in the order as corresponding instructions are arranged. Instead of delivering IF-codes between every emulated instruction, code are only injected in the beginning of the code block to inform the helper thread which code block is being emulated so the correct IF-code block can be traced. In doing so only one en-queuing operation is needed to deliver all IF-codes for a whole code block.
However, above optimization does not reflect correct information flows always. Consider the third emulated instruction of the exampling basic block listed in Figure 8. Since the “mov EAX, [EBP+8]” instruction accesses a memory address indirectly, it can potentially lead to a page fault exception as long as EBP register contains an inappropriate value. To emulate such behaviors correctly, a code block must exit itself when things go wrong. Therefore, only first two IF-codes would be effective in such a case because rest instructions had not been emulated.
Since an exception is unpredictable, the amount of effective IF-codes may vary between each
execution of the code block.
To amend the problem, a counter should be added to accumulate instructions emulated for each code block. Between each emulated instruction, code are injected to increase the counter so it reflects the amount of emulated instruction. The accumulation will continue until the code block exits. The value of the counter will be delivered to the helper thread in the very beginning of next code block. An example of this amendment is shown in Figure 12. In the beginning of every code block, the counter is delivered as the IF-code accumulation of the previous code block. Meanwhile, the sequence number of the current code block is also delivered. IF-codes are no longer delivered in execution of code blocks. Instead, they are only delivered once by the binary translator on a per-block basis. After that, only the counter is delivered so that the IF-code interpreter could decide how many IF-codes should be tracked for each execution of a code block. It is easy to see that communications between the two threads are reduced. In addition, the counter addition could be realized with the following instruction, which is far more concise than previous IF-code en-queuing routine.
inc dword ptr [LOC_counter]
OPT1 eliminates en-queuing operations effectively with the observation that once a basic Figure 12 : An improved system design with OPT1.
Communications between the two threads are reduced since IF-codes are no longer delivered in execution of code blocks.
block is generated, its IF-codes would be fixed also.However, OPT1 cannot entirely eliminate the slow, per-code delivering mechanism. Even without exceptions, the total number of executed instructions of a code block can be still unpredictable if it contains conditional execution. The CMOV of the IA-32 ISA is such an instruction. When the emulator translates these instructions, it will fall back to the slow delivering mode. Fortunately, they are seldom used in ordinary programs, and hence their presences do not impede the acceleration of OPT1.
On the other hand, large amount of en-queuing operation are still needed to pass physical memory addresses. As stated earlier, these physical memory addresses can only be watched when the code block is being executed. To reduce the overhead incurred by memory address delivery, another optimization is proposed below.
4.3.2 OPT2 : Stack-based Indirect Accessing
The foundation of the second optimization relies on several phenomena observed on the frame pointer register and stack pointer register, namely EBP and ESP of the IA-32 architecture.
Due to the conventional design of common compilers, this register stores the beginning address of an activation record and top of the stack in EBP and ESP respectively. In addition, their values usually change only when a new activation record is created. Referencing memory indirectly with these two registers is a common way to access local variables and function arguments. The third emulated instruction “mov EAX, [EBP+8]” listed in Figure 12 is a representative instance. More, in more than 90 percent of EBP or ESP-based memory accesses, their offsets distribute over the range from -1024 to +512 bytes. The clustering phenomenon is understandable since local variables and arguments usually locate near the beginning of the frame and occupy little space.
In Figure 13 are depicted the three scenarios could occur when adding offsets within the range above to EBP. It is easy to see that two pages at most could be cross by the ranging offset.
Note that although the two pages are continuous in virtual address space, their physical locations may be not due to the virtual memory mapping. For the two physical pages, the page pointed by EBP is referred to as the base page and the other one as the siding page. The physical address generated by EBP-based addressing with such offsets could be acquired using Algorithm 4. Note that the discussion above also applies on ESP-based indirect addressing.
Using the algorithm the helper thread could calculate the physical address generated by EBP (or ESP) -based addressing instructions as long as the instruction has a proper offset.
Let’s consider the four inputs needed for the algorithm. Since the offset is encoded in the instruction, it could be determined in translation phase and stored as part of the IF-code. This Figure 13 : Scenarios of EBP-based memory address with offset within range -1024~+512.
Algorithm 4 : Calculation of the physical address generated by EBP-based accessing.
offset encoded in BP-based accessing instruction.
current value of the register.
physical address of base page.
Physical address accessed by this instruction.
liminate on edicated PA MAS x
PA MAS PA MAS PA MAS
PA MAS
observation saves us from en-queuing memory address for each EBP (or ESP) -based memory access. However, the benefit comes with the trade-off that the helper must possess correct paddr_base and paddr_siding for calculation. This is done by having the emulator perform virtual address translation and deliver translated addresses to the helper every time that EBP or ESP is modified. However, sometimes their values can change so frequently that the cost to perform address translation may attenuate the benefit.
To resolve the limitation, the following technique is used. Note that in most cases, both EBP and ESP point at locations inside the stack segment (if the program has one). Therefore, an out-of-box hook is implemented on the part in charge of segment allocation to acquire the physical pages mapped to pages of the stack segment. These physical pages are specially labeled so that we can identify whether EBP or ESP points at a labeled page when their values are modified. All the addresses of these physical pages are delivered to the helper thread only at the infrequent context switch or user/kernel mode switch. As a result, the helper can be informed of paddr_base and paddr_siding without demanding the emulator to perform address translation every time when EBP or ESP is modified. If EBP or ESP point at an unlabeled page, the delivering operation automatically falls back to the slower translate-then-deliver mode upon each EBP or ESP modification.