5.3.1 Method-Based Language Virtual Machines
Region expansion is widely used in method-based JIT systems, e.g., HotSpot Java VM [57]. These JIT systems compile methods as follows. When a method-based JIT system compiles a method for the rst time, it only compiles those basic blocks whose execution counts exceed a threshold during interpretation. If the execution frequently leaves a region from side exits, the JIT system expands this region to include those basic blocks that are the destinations of these side exits.
Our EEG and method-based JIT systems use similar heuristics to decide when to expand regions during the second stage of region expansion, but they are very dierent in the rst stage and the third stage of region expansion in terms of motivation and the type of blocks they merge.
The major dierence between EEG and those systems in the rst stage is the motivation in forming the initial regions. EEG uses traces as initial regions for two reasons. First, traces represent those frequently executed paths that may span across several methods. Second, it takes less time to optimize traces because of their simple control ow graph and small numbers of basic blocks. For example, we found only 4.2 blocks per trace in EEG. On the other hand, method-based JIT systems build initial regions by selecting blocks from hot methods, and excluding those blocks that are rarely executed. For example, HotSpot JVM excludes blocks that are never executed during interpretation.
The major dierence between EEG and method-based JIT systems in the third stage is the type of blocks they merge. In the third stage EEG merges traces that contains frequently executed paths. However, in the third stage method-based JIT systems will only merge blocks that are rarely executed in the rst stage, since those frequently executed blocks in the rst stage have already been merged.
Another dierence between method-based JIT and this dissertation is that we integrate trace-guided optimization into procedure-based DBT.
Although procedure-based JIT compilation has been widely studied in language virtual machines [36, 58], it has not been studied adequately outside of the JVM community.
Suganuma et al. [36] investigate how to use region-based compilation to improve the performance of method-based Java Just-In-Time compilation. They use region-based compilation to partially inline procedures, instead of using traditional method inlining techniques. They collect execution counts of basic blocks in order to un-derstand program runtime behavior, and they apply static code analysis on the Java bytecode to identify those rarely executed code blocks, such as those han-dle exception. They use these information to identify and optimize those often executed code blocks only, without optimizing the entire method.
In our case it is dicult to identify those rarely executed regions by a static code analysis, as they did for Java bytecode. Therefore we cannot apply their approach in our system.
5.3.2 Trace-Based Language Virtual Machines
Trace-based compilation has gained popularity in dynamic scripting languages [32, 59] and high level language virtual machines [34,35,54,60] in recent years. Wu et al. [54] and Inoue et al. [34, 60] investigate the performance of several variations of NET on trace-based Java virtual machines.
Gal et al. [32] propose merging loop traces into a trace-tree. Their approach requires adding annotation while compiling JavaScript into bytecode, and thus cannot be applied in our case.
In contrast, our EEG merges delinquent traces/regions, which are not necessarily loop traces. EEG uses hardware monitoring to identify often executed code traces, then determines whether they have many side exits, and nally merges those often executed code regions that have many side exits to avoid early exits from a region, EEG also uses spill index to prevent generating regions which may degrade performance.
Chapter 6
Conclusion and Future Works
6.1 Conclusion
The performance of dynamic binary translators is determined by the quality of translated code and the ability to detect hot regions at runtime. The retargetabil-ity also reduce the eort of reimplementation dynamice binary translators with high performance. This dissertation presents the design and implementation of the LLVM+QEMU (LnQ) framework, which can build retargetabile and high per-formance multi-threaded trace-based/procedure-based dynamic binary translators in Chapter 2. We show how to use LLVM compiler infrastructure to design the IR library and the IR translator in the translation module on LnQ. We also show how we perform register mapping to remove redundant loads and stores of guest archi-tecture states, and how to implement dynamic binary translation optimizations in LnQ which can be used in all translators built from LnQ.
In Chapter 3, this dissertation has identied and quantied the delinquent trace problem in the popular Next-Executing-Tail (NET) trace formation algorithm.
Delinquent traces contain frequently taken early exits which cause signicant over-head and limit the performacne of dynamic binary translators. Motivated by this problem, in this dissertation, we develop a light-weight region formation strat-egy called Early-Exit Guided region formation (EEG) to improve the performance of NET by merging delinquent traces into larger code regions. The EEG region formation does not only improve NET performance but all other trace forma-tion approaches where traces are suered from the ineciency of early exits. We also show how to o-load compilation overhead to other CPUs to fully utilize the computation power of multi-core CPUs.
In this dissertation, we further study region formation in a more large granular-ity: a whole procedure. In Chapter 4 we present a trace-guided procedure-based region formation algorithm for dynamic binary translation. We describe how to solve Call-Return problem and other issues that only happen in dynamic binary translation.
A prototype was built to study various design issues, and the experimental results show that, in the SPEC 2006 integer benchmark, On average, the Procedure-based approach achieves 2.0X, 2.95X and 1.92X slowdown compared to the native run on X86-to-X86_64, ARM-to-X86_64, and X86-to-ARM LnQ DBTs.