Example - Design of Out-of-order Execution Mechanism

Chapter 3 Design of Out-of-order Execution Mechanism

3.4 Example

We will show the execution process of one dependency control bit machine in Figure 3.14 to Figure 3.30 below.

Figure 3.14: The 1st clock cycle. There are 3 function units and 8 slot of instruction bundle queue.

Figure 3.15: The 2nd clock cycle.

Figure 3.16: The 3rd clock cycle.

Figure 3.17: The 4th clock cycle.

Figure 3.18: The 5th clock cycle.

Figure 3.19: The 6th clock cycle. A marked instruction can be executed if no stall instructions with the same mark in the function units. When execute a marked instruction, the executable status of all continuous instructions are set.

Figure 3.20: The 7th clock cycle. A stall occurred.

Figure 3.21: The 8th clock cycle. The blocked instructions are skipped at this time. Note that these two instructions can be executed at the same time because they belong to the same instruction group. A branch mispredict occurred at this clock cycle.

Figure 3.22: The 9th clock cycle.

Figure 3.23: The 10th clock cycle.

Figure 3.24: The 11th clock cycle.

Figure 3.25: The 12th clock cycle.

Figure 3.26: The 13th clock cycle. No instruction can be executed until the stall instruction completed.

Figure 3.27: The 14th clock cycle. The stall instruction is completed. Blocked instructions in the queue can be executed now.

Figure 3.28: The 15th clock cycle.

Figure 3.29: The 16th clock cycle. Set the executable status of next continuous marked instructions.

Figure 3.30: The 17th clock cycle.

Chapter 4 Simulation Environment and Result

In this chapter, we will describe the simulation environment, including simulator, compiler, and benchmarks. Then we will show the simulation results of total execution time and give an analysis for the simulation results.

4.1 Simulation Environment

The only one implementation of EPIC architecture is the Intel IA-64 Itanium family.

We use the Itanium 2 processor for the simulation target processor.

4.1.1 IA-64 Simulator

The IA-64 is a RISC-style, register-register instruction set, but with many novel features designed to support compiler-based exploitation of ILP.

The components of the IA-64 register state are

• 128 64-bit general-purpose registers, which as we will see shortly are actually 65 bits wide

• 128 82-bit floating-point registers, which provide two extra exponent bits over the standard 80-bit IEEE format

• 8 64-bit branch registers, which are used for indirect branches

• a variety of registers used for system control, memory mapping, performance counters, and communication with the OS

Execution

A Integer ALU add, subtract, and, or, compare I-unit

I Non-ALU integer integer and multimedia shifts, bit tests, moves

A Integer ALU add, subtract, and, or, compare M-unit

M Memory access Loads and stores for integer/FP registers F-unit F Floating point Floating-point instructions

B-unit B Branches Conditional branches, calls, loop branches L + X L + X Extended Extended immediates, stops and no-ops Table 4.1: The five execution unit slots in the IA-64 architecture and what instructions type they may hold are shown.

The IA-64 instruction set architecture (ISA) includes six instructions types, which are A-type, I-type, M-type, F-type, B-type, and L+X-type. A-type instructions, which correspond to integer ALU instructions, may be placed in either an I-unit or M-unit slot. L+X slots are special, as they occupy two instruction slots; L+X instructions are used to encode 64-bit immediates and a few special instructions. L+X instructions are executed either by the I-unit or the B-unit.

There is no open source IA-64 simulator. We write a simulator to simulate the processor. Our simulator simulates all the A-type instructions, I-type instructions, M-type instructions, and B-type instructions. This simulator also simulates some F-type instructions.

The simulator counts the clock cycles and simulate the branch mispredict, instruction fetch stalls, and data access cycles.

4.1.2 IA-64 Compiler

We use two of the free C/C++ compiler for IA-64 architecture as below:

• Microsoft C/C++ Optimizing Compiler for IA-64

• GNU C/C++ compiler

These two compilers can save the object file into ELF64 format for our simulator.

Instead of code rewriting, we write an independent program to provide the dependency bits and store the result in the corresponding file.

4.1.3 Benchmarks

Because some of the floating-point instructions not implemented, we only write two simple benchmarks. One is computer-intensive program, which is a RSA algorithm

implementation. The other is memory-intensive program, which is just implementing a block memory coping.

4.2 Simulation Results and Analysis

The length of continuous instructions after each stall instruction is very short in our chosen compilers. Figure 4.1 is the counting result of a parser generator called ‘Bison’. So we just simply use infinity dependency bits without concerning the overhead of merging them.

Figure 4.1: The length of continuous instructions after each stall instruction.

Figure 4.2: Simulation Results

Figure 4.2 shows the simulation results. For compute-intensive program, 24.2% of total clock cycles will be saved. For memory-intensive program, 48.1% of total clock cycles can be saved.

The simulation results show us that when executing a compute-intensive program, stall can overlap with other non-stall instructions and execute the blocked instructions later with non-stall instructions. When a memory intensive program executed, stall cycles can overlap with other stall cycles.

The real performance gain may less than the result because some other stalls are not simulated. However, those stalls can still be overlapped.

Chapter 5 Conclusions and Future Works

This chapter concludes this thesis. We summarize and conclude this study in section 5.1. Section 5.2 points out some possible issues worth further investigation.

5.1 Conclusions

In this thesis, an approach to an out-of-order execution EPIC is proposed and simulated. Complier gives the hint of handling stall cycles to hardware instead of using a complex circuit to detect the instruction dependency.

In order to hide the stall cycles and become an out-of-order execution processor, the way to keep the dependency relation is the key point of this research. Not only retain the dependency information between each instruction, but also execute them correctly.

The dependency chain will occupy whole execution resources very soon, and no more instruction can be executed if stall duration is too long. Therefore, the out-of-order execution mechanism is most effective in tolerating stall penalty of short instruction length.

As the simulation results, both compute-intensive program and memory-intensive program can get the performance improvement.

5.2 Future Works

Still, there are some other issues worth further investigation.

This design is a conservative approach to prevent the incorrect execution. It is not an optimized out-of-order execution mechanism. To become an optimized approach, maybe all the dependency relations are needed and it may hard to store those relations.

The simulator of this thesis assumes that there are infinity dependency bits without concerning the merging issue. This issue may need to be verified and improve the merging method.

Finally, comparing to the simultaneous multi-threading technology (SMT) where multiple threads of software applications can be run simultaneously on one processor, it is effective in tolerating large stall penalty. They don’t have conflict and the combination of these technologies may be interesting.

References

[1] Intel Itanium Architecture Software Developer’s Manual Volume 1: Application Architecture, Revision 2.1 [2] Intel Itanium Architecture Software Developer’s Manual

Volume 2. System Architecture, Revision 2.1

[3] Intel Itanium Architecture Software Developer’s Manual Volume 3. Instruction Set Reference, Revision 2.1 [4] Intel Itanium 2 Processor Reference Manual

For Software Development and Optimization

[5] Computer Architecture: A Quantitative Approach, Third Edition John L. Hennessy; David A. Patterson

[6] EPIC: Explicitly Parallel Instruction Computing Schlansker, M.S.; Rau, B.R.

Computer, Volume: 33, Issue: 2, Feb. 2000

[7] Register renaming and scheduling for dynamic execution of predicated code Wang, P.H.; Hong Wang; Kling, R.M.; Ramakrishnan, K.; Shen, J.P.

High-Performance Computer Architecture, 2001. HPCA. The Seventh International Symposium on, 19-24 Jan. 2001

[8] Memory Latency-Tolerance Approaches for Itanium Processors: Out-of-Order Execution vs. Speculative Precomputation

Wang, P.H.; Hong Wang; Collins, J.D.; Grochowski, E.; Kling, R.M.; Shen, J.P.

High-Performance Computer Architecture, 2002. Proceedings. Eighth International Symposium on, 2-6 Feb. 2002

在文檔中在直接平行指令集運算架構中處理失速週期 (頁 41-0)