Runtime Optimizations - 可重定目標且高效能之動態二元碼轉譯器框架系統

We can also integrate runtime optimizations that manipulate LLVM IRs to fur-ther improve the performance in LnQ framework. Those optimizations should be retargetable and not depend on either guest ISA or host ISA. Retargetability is an important goal of LnQ framework, and we expect those optimization techniques can be reused in all binary translators for all ISAs.

To demonstrate this ability, we implemented three classic optimizations that re-duce the frequency of context switch between emulation engine and code cache.

The three optimizations are block linking, indirect branch target caching, and shadow stack. We describe each of them in the following sections.

2.2.1 Block Linking

Block linking [7] links translated blocks in code cache so that program execution transfers directly from one code block to another so as to eliminate expensive context switching. Block linking targets at blocks that have a direct branch, a conditional direct branch, or a direct call instruction as the last instruction. We use exit to refer to these jump or call instruction that are at the end of a block, and each exit has an unique exit ID, and the destination of an exit is refer to as branch target.

Block linking can be either proactive or lazy. Proactive block linking links trans-lated basic blocks whenever a new block is generated. A lazy block linking links translated basic blocks when a context switch occurs.

We choose lazy block linking for two reasons. First, lazy block linking links trans-lated basic blocks only when the execution actually goes from one block to another.

Second, when a new block is generated, proactive block linking must update the branch targets of all exits that go to the newly generated block, therefore we need a data structure that maps branch address of exits to the starting address of trans-lated basic blocks. In contrast lazy block linking does not require this extra data structure, and will not incur extra maintenance overheads. Further comparisons between proactive and lazy block linkings can be found in [8, 39].

We implement block linking as follows. Before an execution thread leaves a block it stores the exit ID of the block in a specied thread-private memory location.

Then the emulation engine locates the translated basic block pointed by the branch target, retrieves the exit ID that we just stored, then uses this exit ID to look into

#8 0xXXYYZZ

#8 Last Block

Current Block

address

Code Cache

1. Save 2. Lookup table

3. Find

4. Insert jmp

exit ID

Exit−ID−To−Address Table

Figure 2.5: Illustration of execution steps of block linking optimization.

an exit-id-to-address mapping table that maps exit id to the address of an exit in code cache.

The entries of exit-id-to-address table is added when LLVM JIT generates host instructions for exits. Then the emulation engine will be able to patch the branch target of the exit of the block the execution thread just left. Please refer to Figure 2.5 for an illustration of our implementation.

2.2.2 Indirect Branch Target Caching

To further reduce the context switches, We use the indirect branch target caching (IBTC) to fast look up whether the target of the indirect branch has a translated basic block in code cache without returning to emulation engine. We add a IBTC lookup function calls at the ends of those blocks that end with indirect jumps and indirect calls instructions.

The IBTC contains a shared cache as shown in Figure 2.6. The shared cache is implemented as a direct-access hash table with 1K entry to cache all indirect

branches. The hash table is indexed by the last 10 bits of the guest target address.

Given a guest target address, we use the last 10 bits as the key to index the shared cache. If the comparison successes, the control then transfers to the memory location of the translated basic block. Otherwise, the control transfers back to the emulation engine, and updates the shared cache after locating the TBB.

Host Target

Figure 2.6: Illustration of execution steps of indirect branch target caching optimization.

2.2.3 Shadow Stack

Shadow stack (SS) optimizes function return mechanism in binary translation, and was rst introduced in FX!32 [14]. Despite that a function return instruction can be viewed as an indirect branch, it can be optimized without indirect branch cache lookup. This is because the guest return address is known when the translator translates a guest call instruction since the call pushes the guest return address onto the stack.

If the translated basic block of the guest return address exists in code cache, the translator can push the memory location of the translated basic block onto a

shadow stack, from which we can fetch the host return address when the function ends without looking up indirect branch cache or going back to emulation engine.

However, if the block of the guest return address is not translated, we push the address that goes back to emulation engine.

The details of our implement of shadow stack are as follow. We assign a memory location, called address box, to each guest call instruction, which stores either the host return address or the return address back to emulation engine. If the guest return address does not yet have its translated basic block in code cache, we marks the address box as untranslated and stores the return address back to emulation engine in the address box. Then, when the translator generates a translated basic block, it checks whether there is a address box that should store the address of this translated block but is marked untranslated. If such address box is found, the translator marks the address box as translated and stores the address of translated basic block into the address box. For each guest call instruction, we insert instructions to push the content of its address box on top of the shadow stack. Please refer to Figure 2.7 as an illustration.

As for each guest return instruction, we insert pop shadow stack instructions in the end of the translated block. Note that we need to perform a check to see whether the guest return address is matched to the one stored on top of the shadow stack.

If they are matched, the execution directly transfers to the address that is popped from the shadow stack. If the addresses are not matched, we ush the shadow stack since the shadow stack is no longer valid, and the execution transfers back to the emulation engine.

Trns.

Basic Block Trns.

Basic Block

Call Ret

Pop Push

Shadow Stack

Read host

: Host return address

return address jump to

Address Box

Figure 2.7: Illustration of execution steps of shadow stack optimization.

在文檔中可重定目標且高效能之動態二元碼轉譯器框架系統 (頁 31-36)