In the computer engineering field, new ideas are often evaluated and verified by simulations before actual implementations. A micro-architecture simulator simulates the detailed implementations of a computer system implementation such as the processor, the memory hierarchy, and sometimes, the memory interconnections and buses. In the design of micro-architecture, simulations are frequently used to predict the performance and tradeoffs are often made to fine tune the design. This process has been very effective in reducing the cost of design, increasing the reliability and performance of the implementation. In micro-architecture simulations, performance critical components such as caches, register modules, out-of-order instruction issue mechanisms, re-order buffers, branch predictors, load/store buffers are often modeled in details. Usually, the more implementation details modeled, the more accurate the predictions, but the slower the simulations. With all micro-architecture modules, the cache hierarchy is one of the most importance, since it is not only performance critical, but also frequently invoked by every memory reference instruction during the program execution.
Traditional cache simulations are usually based on interpretation. It means that each is interpreted and the memory operation involved is going through the cache hierarchy simulation. This commonly used approach is straightforward, but is very
2
slow. Over the past many years, JIT(Just-In-Time Translation) techniques [1] have been used to speed up the interpretation of bytecode, and DBT(Dynamic Binary Translation) techniques have been adopted to speed up interpretation of executables [2]. It is quite natural to question why such techniques were not used for speeding up micro-architecture simulations.
This thesis investigates how DBT techniques may be used for fast micro-architecture simulations. We started from cache simulations, but the idea could be extended to other micro-architecture simulations such as the pipeline simulations and the load/store buffer simulations. Although DBT is not a new technology, but it has not been used effectively on micro-architecture simulations. DBT has been quite successfully applied for function simulation, for example, most high performance functional simulators, such as QEMU, Simics, Shade and so on, are based on DBT.
DBT turns interpretation into native code execution, for example, instead of interpreting an “ADD” instruction, DBT translate this ADD instruction to an equivalent native add instruction on the host machine, and the execution of this “ADD”
instruction is now emulated by executing the translated native instruction instead of the slow process of interpretation. Translation needs to be done only once, since the translated code will be stored in a code cache so that subsequent execution of the same code sequence can go directly to the translated code. However, when DBT is
3
applied to micro-architecture simulations, one challenge might emerge.
Micro-architecture simulations often involve lots of details to emulate. Those details are often coded as functions and are called when the activities are involved in one instruction emulation. If we translate all these activities into native code, the degree of code expansion may degrade the simulation speed instead of making it go faster. We will further discuss this impact in next paragraph and use some experimental data to support this point. To avoid this code expansion dilemma, this study chooses a selected area to apply the idea of DBT on micro-architecture simulation. In this study, the simulation goes through the same components many times, for example, each memory operation will call the cache simulator. We propose to have the dynamic translator chooses some critical blocks of these components, applies optimizations to get rid of redundancies, converts the selected blocks to native instructions, and stores the translated code in the code cache for subsequent simulations.
There are significant redundancies exist among multiple load/store instructions.
Due to spatial locality, multiple load/stores are likely reference data located on the same cache line. Instead of calling the cache simulation multiple times, the simulation activities might be combined into one transaction. This is particularly useful for frequently executed loops. According to the fact mentioned above, the DBT could merge multiple cache simulations into one and eliminate a large portion of simulation
4
activities.
Before starting our work, a proper platform must be chosen. Simplescalar is a well-known open source computer architecture/micro-architecture simulator, and it is a set of tools that model a virtual computer system with CPU, caches and a memory hierarchy. In year 2000, more than one third of all papers published in top computer architecture conferences used the SimpleScalar tool sets to evaluate their designs. [3]
That is why we choose it for the investigation study.
The purpose of this study is to come up with a method for identifying these redundancies and let a DBT to eliminate such redundancies via code transformations.
By eliminating the redundancies, the total time of data cache simulation, in this study, would be decreased. We believe the same idea could also be applied to other micro-architecture simulations. In our study, we modified the original sim-cache simulator to include our proposed approach. Unmodified code will run as before, and modified code will be executed as if the DBT has applied the transformation and perform merged cache simulations.
Motivating Example
In this paragraph, we will use real examples to point out the differences between translating guest instructions to host instructions by interpretation or by DBT on micro-architecture simulations. Then we could understand the motivation of this work.
5
Use a short and simple code as example, as shown in Figure 1.
Figure 1 A simple C code example
In Figure 1, there are some arithmetic instructions in a loop. When we use interpretation to emulate, the ADD and MUL instruction are emulated as a sequence of instructions respectively, as shown in Figure 2. If we choose DBT to do the emulation, the ADD and MUL instruction could be converted to native code, as sample as one single native instruction, so the number of executed instructions in emulation is much less than interpretation. Although the DBT is an effective way to do emulation, but the speedup may not be realized when applied to micro-architecture simulations. The reason is that micro-architecture simulators are not only translate code like interpretation, but require many instructions to do the cache simulation, register updating, load/store refreshing, interconnection buses and peripheral devices.
6
Figure 2 Translated instructions: Interpreter vs. Micro-architecture simulator (Add instruction with cache simulation only)
Code expansion problem
If we inline these function calls, the code expansion may slow down the program, and the performance of simulation may be even less than the original one. Table 1 shows the code size of these functions. Even though the source C code is simple and short, the translated code size of these functions is large. Based on this data, we take Figure 2 and Figure 3 as examples; the number of instructions of micro-architecture simulation (with cache simulation only) is about 3300 (according to Table1, each cache access function translated to 1111 native instructions), it’s more than translated code in interpretation. When the number of iteration is 100, the translated code size of micro-architecture simulation with inlined cache simulation are much larger than original way (as shown in Figure3). According to this conclude, even though we use
7
binary translation to translate the source instructions with native instructions, its performance may be worse than original one.
Table 1 Code size of main functions in sim-outorder simulator
Figure 3 Code expansion problem
Function name code size native code size
cache_access 207 1111
ruu_commit 554 3294
ruu_writeback 453 863
lsq_refresh 111 300
ruu_issue 707 3165
ruu_dispatch 15484 77452+
ruu_fetch 1137 4806
8
In conclusion, instead of inline these functions, the better way to improve the performance of micro-architecture simulation is to reduce the times of function calls.
Therefore, find a proper function to improve is important. Cache simulation is easier and more potential than other functions to optimize.
The remainder of this thesis is as follows. In Chapter 2, we introduce the background of this work. Chapter 3 shows the design and implementation of merged cache simulations. In Chapter 4, we evaluate the performance of our design and discuss the correctness and applicability of this approach. Chapter 5 summaries and concludes.
9