Dynamic Scheduling Using Tomasulo’s Approach
2.5 Dynamic Scheduling: Examples and the Algorithm
In describing the operation of this scheme, we use a terminology taken from the CDC scoreboard scheme (see Appendix A) rather than introduce new termi-nology, showing the terminology used by the IBM 360/91 for historical refer-ence. It is important to remember that the tags in the Tomasulo scheme refer to the buffer or unit that will produce a result; the register names are discarded when an instruction issues to a reservation station.
Each reservation station has seven fields:
■ Op—The operation to perform on source operands S1 and S2.
■ Qj, Qk—The reservation stations that will produce the corresponding source operand; a value of zero indicates that the source operand is already available in Vj or Vk, or is unnecessary. (The IBM 360/91 calls these SINKunit and SOURCEunit.)
■ Vj, Vk—The value of the source operands. Note that only one of the V field or the Q field is valid for each operand. For loads, the Vk field is used to hold the offset field. (These fields are called SINK and SOURCE on the IBM 360/91.)
■ A—Used to hold information for the memory address calculation for a load or store. Initially, the immediate field of the instruction is stored here; after the address calculation, the effective address is stored here.
■ Busy—Indicates that this reservation station and its accompanying functional unit are occupied.
The register file has a field, Qi:
■ Qi—The number of the reservation station that contains the operation whose result should be stored into this register. If the value of Qi is blank (or 0), no currently active instruction is computing a result destined for this register, meaning that the value is simply the register contents.
The load and store buffers each have a field, A, which holds the result of the effective address once the first step of execution has been completed.
In the next section, we will first consider some examples that show how these mechanisms work and then examine the detailed algorithm.
Before we examine Tomasulo’s algorithm in detail, let’s consider a few exam-ples, which will help illustrate how the algorithm works.
Example Show what the information tables look like for the following code sequence when only the first load has completed and written its result:
2.5 Dynamic Scheduling: Examples and the Algorithm
98 ■ Chapter Two Instruction-Level Parallelism and Its Exploitation
1. L.D F6,32(R2)
2. L.D F2,44(R3)
3. MUL.D F0,F2,F4
4. SUB.D F8,F2,F6
5. DIV.D F10,F0,F6
6. ADD.D F6,F8,F2
Answer Figure 2.10 shows the result in three tables. The numbers appended to the names add, mult, and load stand for the tag for that reservation station—Add1 is the tag for the result from the first add unit. In addition we have included an instruction status table. This table is included only to help you understand the algorithm; it is not actually a part of the hardware. Instead, the reservation station keeps the state of each operation that has issued.
Tomasulo’s scheme offers two major advantages over earlier and simpler schemes: (1) the distribution of the hazard detection logic and (2) the elimination of stalls for WAW and WAR hazards.
The first advantage arises from the distributed reservation stations and the use of the Common Data Bus (CDB). If multiple instructions are waiting on a single result, and each instruction already has its other operand, then the instructions can be released simultaneously by the broadcast of the result on the CDB. If a centralized register file were used, the units would have to read their results from the registers when register buses are available.
The second advantage, the elimination of WAW and WAR hazards, is accom-plished by renaming registers using the reservation stations, and by the process of storing operands into the reservation station as soon as they are available.
For example, the code sequence in Figure 2.10 issues both the DIV.D and the ADD.D, even though there is a WAR hazard involving F6. The hazard is elimi-nated in one of two ways. First, if the instruction providing the value for the DIV.D has completed, then Vk will store the result, allowing DIV.D to execute independent of the ADD.D (this is the case shown). On the other hand, if the L.D had not completed, then Qk would point to the Load1 reservation station, and the DIV.D instruction would be independent of the ADD.D. Thus, in either case, the ADD.D can issue and begin executing. Any uses of the result of the DIV.D would point to the reservation station, allowing the ADD.D to complete and store its value into the registers without affecting the DIV.D.
We’ll see an example of the elimination of a WAW hazard shortly. But let’s first look at how our earlier example continues execution. In this example, and the ones that follow in this chapter, assume the following latencies: load is 1 clock cycle, add is 2 clock cycles, multiply is 6 clock cycles, and divide is 12 clock cycles.
2.5 Dynamic Scheduling: Examples and the Algorithm ■ 99
Example Using the same code segment as in the previous example (page 97), show what the status tables look like when the MUL.D is ready to write its result.
Answer The result is shown in the three tables in Figure 2.11. Notice that ADD.D has com-pleted since the operands of DIV.D were copied, thereby overcoming the WAR hazard. Notice that even if the load of F6 was delayed, the add into F6 could be executed without triggering a WAW hazard.
Instruction status
Instruction Issue Execute Write Result
L.D F6,32(R2) √ √ √
L.D F2,44(R3) √ √
MUL.D F0,F2,F4 √
SUB.D F8,F2,F6 √
DIV.D F10,F0,F6 √
ADD.D F6,F8,F2 √
Reservation stations
Name Busy Op Vj Vk Qj Qk A
Load1 no
Load2 yes Load 45 + Regs[R3]
Add1 yes SUB Mem[34 + Regs[R2]] Load2
Add2 yes ADD Add1 Load2
Add3 no
Mult1 yes MUL Regs[F4] Load2
Mult2 yes DIV Mem[34 + Regs[R2]] Mult1
Register status
Field F0 F2 F4 F6 F8 F10 F12 . . . F30
Qi Mult1 Load2 Add2 Add1 Mult2
Figure 2.10 Reservation stations and register tags shown when all of the instructions have issued, but only the first load instruction has completed and written its result to the CDB. The second load has completed effec-tive address calculation, but is waiting on the memory unit. We use the array Regs[ ] to refer to the register file and the array Mem[ ] to refer to the memory. Remember that an operand is specified by either a Q field or a V field at any time. Notice that the ADD.D instruction, which has a WAR hazard at the WB stage, has issued and could com-plete before the DIV.D initiates.
100 ■ Chapter Two Instruction-Level Parallelism and Its Exploitation