Tomasulo’s Algorithm: A Loop-Based Example
2.6 Hardware-Based Speculation
2.6 Hardware-Based Speculation ■ 105
per clock, just predicting branches accurately may not be sufficient to generate the desired amount of instruction-level parallelism. A wide issue processor may need to execute a branch every clock cycle to maintain maximum performance.
Hence, exploiting more parallelism requires that we overcome the limitation of control dependence.
Overcoming control dependence is done by speculating on the outcome of branches and executing the program as if our guesses were correct. This mech-anism represents a subtle, but important, extension over branch prediction with dynamic scheduling. In particular, with speculation, we fetch, issue, and exe-cute instructions, as if our branch predictions were always correct; dynamic scheduling only fetches and issues such instructions. Of course, we need mech-anisms to handle the situation where the speculation is incorrect. Appendix G discusses a variety of mechanisms for supporting speculation by the compiler.
In this section, we explore hardware speculation, which extends the ideas of dynamic scheduling.
Hardware-based speculation combines three key ideas: dynamic branch pre-diction to choose which instructions to execute, speculation to allow the execu-tion of instrucexecu-tions before the control dependences are resolved (with the ability to undo the effects of an incorrectly speculated sequence), and dynamic schedul-ing to deal with the schedulschedul-ing of different combinations of basic blocks. (In comparison, dynamic scheduling without speculation only partially overlaps basic blocks because it requires that a branch be resolved before actually execut-ing any instructions in the successor basic block.)
Hardware-based speculation follows the predicted flow of data values to choose when to execute instructions. This method of executing programs is essentially a data flow execution: Operations execute as soon as their operands are available.
To extend Tomasulo’s algorithm to support speculation, we must separate the bypassing of results among instructions, which is needed to execute an instruc-tion speculatively, from the actual compleinstruc-tion of an instrucinstruc-tion. By making this separation, we can allow an instruction to execute and to bypass its results to other instructions, without allowing the instruction to perform any updates that cannot be undone, until we know that the instruction is no longer speculative.
Using the bypassed value is like performing a speculative register read, since we do not know whether the instruction providing the source register value is providing the correct result until the instruction is no longer speculative. When an instruction is no longer speculative, we allow it to update the register file or mem-ory; we call this additional step in the instruction execution sequence instruction commit.
The key idea behind implementing speculation is to allow instructions to exe-cute out of order but to force them to commit in order and to prevent any irrevo-cable action (such as updating state or taking an exception) until an instruction commits. Hence, when we add speculation, we need to separate the process of completing execution from instruction commit, since instructions may finish exe-cution considerably before they are ready to commit. Adding this commit phase
106 ■ Chapter Two Instruction-Level Parallelism and Its Exploitation
to the instruction execution sequence requires an additional set of hardware buff-ers that hold the results of instructions that have finished execution but have not committed. This hardware buffer, which we call the reorder buffer, is also used to pass results among instructions that may be speculated.
The reorder buffer (ROB) provides additional registers in the same way as the reservation stations in Tomasulo’s algorithm extend the register set. The ROB holds the result of an instruction between the time the operation associated with the instruction completes and the time the instruction commits. Hence, the ROB is a source of operands for instructions, just as the reservation stations provide operands in Tomasulo’s algorithm. The key difference is that in Tomasulo’s algo-rithm, once an instruction writes its result, any subsequently issued instructions will find the result in the register file. With speculation, the register file is not updated until the instruction commits (and we know definitively that the instruc-tion should execute); thus, the ROB supplies operands in the interval between completion of instruction execution and instruction commit. The ROB is similar to the store buffer in Tomasulo’s algorithm, and we integrate the function of the store buffer into the ROB for simplicity.
Each entry in the ROB contains four fields: the instruction type, the destina-tion field, the value field, and the ready field. The instrucdestina-tion type field indicates whether the instruction is a branch (and has no destination result), a store (which has a memory address destination), or a register operation (ALU operation or load, which has register destinations). The destination field supplies the register number (for loads and ALU operations) or the memory address (for stores) where the instruction result should be written. The value field is used to hold the value of the instruction result until the instruction commits. We will see an example of ROB entries shortly. Finally, the ready field indicates that the instruction has completed execution, and the value is ready.
Figure 2.14 shows the hardware structure of the processor including the ROB.
The ROB subsumes the store buffers. Stores still execute in two steps, but the second step is performed by instruction commit. Although the renaming function of the reservation stations is replaced by the ROB, we still need a place to buffer operations (and operands) between the time they issue and the time they begin execution. This function is still provided by the reservation stations. Since every instruction has a position in the ROB until it commits, we tag a result using the ROB entry number rather than using the reservation station number. This tagging requires that the ROB assigned for an instruction must be tracked in the reserva-tion stareserva-tion. Later in this secreserva-tion, we will explore an alternative implementareserva-tion that uses extra registers for renaming and the ROB only to track when instruc-tions can commit.
Here are the four steps involved in instruction execution:
1. Issue—Get an instruction from the instruction queue. Issue the instruction if there is an empty reservation station and an empty slot in the ROB; send the operands to the reservation station if they are available in either the registers
2.6 Hardware-Based Speculation ■ 107
or the ROB. Update the control entries to indicate the buffers are in use. The number of the ROB entry allocated for the result is also sent to the reservation station, so that the number can be used to tag the result when it is placed on the CDB. If either all reservations are full or the ROB is full, then instruction issue is stalled until both have available entries.
2. Execute—If one or more of the operands is not yet available, monitor the CDB while waiting for the register to be computed. This step checks for RAW hazards. When both operands are available at a reservation station, exe-cute the operation. Instructions may take multiple clock cycles in this stage, and loads still require two steps in this stage. Stores need only have the base register available at this step, since execution for a store at this point is only effective address calculation.
Figure 2.14 The basic structure of a FP unit using Tomasulo’s algorithm and extended to handle speculation. Comparing this to Figure 2.9 on page 94, which implemented Tomasulo’s algorithm, the major change is the addition of the ROB and the elimination of the store buffer, whose function is integrated into the ROB. This mechanism can be extended to multiple issue by making the CDB wider to allow for multiple completions per clock.
From instruction unit
FP registers
Reservation stations
FP adders FP multipliers
3 2 1
2 1
Common data bus (CDB) Operation bus
Operand buses Address unit
Load buffers
Memory unit
Reorder buffer
Reg # Data
Store
data Address
Load data
Store address
Floating-point operations Load-store
operations Instruction
queue
108 ■ Chapter Two Instruction-Level Parallelism and Its Exploitation
3. Write result—When the result is available, write it on the CDB (with the ROB tag sent when the instruction issued) and from the CDB into the ROB, as well as to any reservation stations waiting for this result. Mark the reser-vation station as available. Special actions are required for store instruc-tions. If the value to be stored is available, it is written into the Value field of the ROB entry for the store. If the value to be stored is not available yet, the CDB must be monitored until that value is broadcast, at which time the Value field of the ROB entry of the store is updated. For simplicity we assume that this occurs during the Write Results stage of a store; we discuss relaxing this requirement later.
4. Commit—This is the final stage of completing an instruction, after which only its result remains. (Some processors call this commit phase “comple-tion” or “graduation.”) There are three different sequences of actions at com-mit depending on whether the comcom-mitting instruction is a branch with an incorrect prediction, a store, or any other instruction (normal commit). The normal commit case occurs when an instruction reaches the head of the ROB and its result is present in the buffer; at this point, the processor updates the register with the result and removes the instruction from the ROB. Commit-ting a store is similar except that memory is updated rather than a result regis-ter. When a branch with incorrect prediction reaches the head of the ROB, it indicates that the speculation was wrong. The ROB is flushed and execution is restarted at the correct successor of the branch. If the branch was correctly predicted, the branch is finished.
Once an instruction commits, its entry in the ROB is reclaimed and the regis-ter or memory destination is updated, eliminating the need for the ROB entry. If the ROB fills, we simply stop issuing instructions until an entry is made free.
Now, let’s examine how this scheme would work with the same example we used for Tomasulo’s algorithm.
Example Assume the same latencies for the floating-point functional units as in earlier exam-ples: add is 2 clock cycles, multiply is 6 clock cycles, and divide is 12 clock cycles.
Using the code segment below, the same one we used to generate Figure 2.11, show what the status tables look like when the MUL.D is ready to go to commit.
L.D F6,32(R2)
L.D F2,44(R3)
MUL.D F0,F2,F4
SUB.D F8,F6,F2
DIV.D F10,F0,F6
ADD.D F6,F8,F2
Answer Figure 2.15 shows the result in the three tables. Notice that although the SUB.D instruction has completed execution, it does not commit until the MUL.D commits.
The reservation stations and register status field contain the same basic
informa-2.6 Hardware-Based Speculation ■ 109
tion that they did for Tomasulo’s algorithm (see page 97 for a description of those fields). The differences are that reservation station numbers are replaced with ROB entry numbers in the Qj and Qk fields, as well as in the register status fields, and we have added the Dest field to the reservation stations. The Dest field desig-nates the ROB entry that is the destination for the result produced by this reserva-tion stareserva-tion entry.
The above example illustrates the key important difference between a proces-sor with speculation and a procesproces-sor with dynamic scheduling. Compare the con-tent of Figure 2.15 with that of Figure 2.11 on page 100, which shows the same code sequence in operation on a processor with Tomasulo’s algorithm. The key difference is that, in the example above, no instruction after the earliest uncom-pleted instruction (MUL.D above) is allowed to complete. In contrast, in Figure 2.11 the SUB.D and ADD.D instructions have also completed.
One implication of this difference is that the processor with the ROB can dynamically execute code while maintaining a precise interrupt model. For example, if the MUL.D instruction caused an interrupt, we could simply wait until it reached the head of the ROB and take the interrupt, flushing any other pending instructions from the ROB. Because instruction commit happens in order, this yields a precise exception.
By contrast, in the example using Tomasulo’s algorithm, the SUB.D and ADD.D instructions could both complete before the MUL.D raised the exception.
The result is that the registers F8 and F6 (destinations of the SUB.D and ADD.D instructions) could be overwritten, and the interrupt would be imprecise.
Some users and architects have decided that imprecise floating-point excep-tions are acceptable in high-performance processors, since the program will likely terminate; see Appendix G for further discussion of this topic. Other types of exceptions, such as page faults, are much more difficult to accommodate if they are imprecise, since the program must transparently resume execution after handling such an exception.
The use of a ROB with in-order instruction commit provides precise excep-tions, in addition to supporting speculative execution, as the next example shows.
Example Consider the code example used earlier for Tomasulo’s algorithm and shown in Figure 2.13 in execution:
Loop: L.D F0,0(R1)
MUL.D F4,F0,F2
S.D F4,0(R1) DADDIU R1,R1,#-8
BNE R1,R2,Loop ;branches if R1≠R2
Assume that we have issued all the instructions in the loop twice. Let’s also assume that the L.D and MUL.D from the first iteration have committed and all other instructions have completed execution. Normally, the store would wait in
110 ■ Chapter Two Instruction-Level Parallelism and Its Exploitation
the ROB for both the effective address operand (R1 in this example) and the value (F4 in this example). Since we are only considering the floating-point pipe-line, assume the effective address for the store is computed by the time the instruction is issued.
Answer Figure 2.16 shows the result in two tables.
Reorder buffer
Entry Busy Instruction State Destination Value
1 no L.D F6,32(R2) Commit F6 Mem[34 + Regs[R2]]
2 no L.D F2,44(R3) Commit F2 Mem[45 + Regs[R3]]
3 yes MUL.D F0,F2,F4 Write result F0 #2 × Regs[F4]
4 yes SUB.D F8,F2,F6 Write result F8 #2 – #1
5 yes DIV.D F10,F0,F6 Execute F10
6 yes ADD.D F6,F8,F2 Write result F6 #4 + #2
Reservation stations
Name Busy Op Vj Vk Qj Qk Dest A
Load1 no
Load2 no
Add1 no
Add2 no
Add3 no
Mult1 no MUL.D Mem[45 + Regs[R3]] Regs[F4] #3
Mult2 yes DIV.D Mem[34 + Regs[R2]] #3 #5
FP register status
Field F0 F1 F2 F3 F4 F5 F6 F7 F8 F10
Reorder # 3 6 4 5
Busy yes no no no no no yes . . . yes yes
Figure 2.15 At the time the MUL.D is ready to commit, only the two L.D instructions have committed, although several others have completed execution. The MUL.D is at the head of the ROB, and the two L.D instructions are there only to ease understanding. The SUB.D and ADD.D instructions will not commit until the MUL.D instruction commits, although the results of the instructions are available and can be used as sources for other instructions.
The DIV.D is in execution, but has not completed solely due to its longer latency than MUL.D. The Value column indicates the value being held; the format #X is used to refer to a value field of ROB entry X. Reorder buffers 1 and 2 are actually completed, but are shown for informational purposes. We do not show the entries for the load-store queue, but these entries are kept in order.
2.6 Hardware-Based Speculation ■ 111
Because neither the register values nor any memory values are actually writ-ten until an instruction commits, the processor can easily undo its speculative actions when a branch is found to be mispredicted. Suppose that the branch BNE is not taken the first time in Figure 2.16. The instructions prior to the branch will simply commit when each reaches the head of the ROB; when the branch reaches the head of that buffer, the buffer is simply cleared and the processor begins fetching instructions from the other path.
In practice, processors that speculate try to recover as early as possible after a branch is mispredicted. This recovery can be done by clearing the ROB for all entries that appear after the mispredicted branch, allowing those that are before the branch in the ROB to continue, and restarting the fetch at the correct branch successor. In speculative processors, performance is more sensitive to the branch prediction, since the impact of a misprediction will be higher. Thus, all the aspects of handling branches—prediction accuracy, latency of misprediction detection, and misprediction recovery time—increase in importance.
Exceptions are handled by not recognizing the exception until it is ready to commit. If a speculated instruction raises an exception, the exception is recorded
Reorder buffer
Entry Busy Instruction State Destination Value
1 no L.D F0,0(R1) Commit F0 Mem[0 +
Regs[R1]]
2 no MUL.D F4,F0,F2 Commit F4 #1 × Regs[F2]
3 yes S.D F4,0(R1) Write result 0 + Regs[R1] #2
4 yes DADDIU R1,R1,#-8 Write result R1 Regs[R1] – 8
5 yes BNE R1,R2,Loop Write result
6 yes L.D F0,0(R1) Write result F0 Mem[#4]
7 yes MUL.D F4,F0,F2 Write result F4 #6 × Regs[F2]
8 yes S.D F4,0(R1) Write result 0 + #4 #7
9 yes DADDIU R1,R1,#-8 Write result R1 #4 – 8
10 yes BNE R1,R2,Loop Write result
FP register status
Field F0 F1 F2 F3 F4 F5 F6 F7 F8
Reorder # 6 7
Busy yes no no no yes no no ... no
Figure 2.16 Only the L.D and MUL.D instructions have committed, although all the others have completed exe-cution. Hence, no reservation stations are busy and none are shown. The remaining instructions will be committed as fast as possible. The first two reorder buffers are empty, but are shown for completeness.
112 ■ Chapter Two Instruction-Level Parallelism and Its Exploitation
in the ROB. If a branch misprediction arises and the instruction should not have been executed, the exception is flushed along with the instruction when the ROB is cleared. If the instruction reaches the head of the ROB, then we know it is no longer speculative and the exception should really be taken. We can also try to handle exceptions as soon as they arise and all earlier branches are resolved, but this is more challenging in the case of exceptions than for branch mispredict and, because it occurs less frequently, not as critical.
Figure 2.17 shows the steps of execution for an instruction, as well as the conditions that must be satisfied to proceed to the step and the actions taken. We show the case where mispredicted branches are not resolved until commit.
Although speculation seems like a simple addition to dynamic scheduling, a comparison of Figure 2.17 with the comparable figure for Tomasulo’s algorithm in Figure 2.12 shows that speculation adds significant complications to the con-trol. In addition, remember that branch mispredictions are somewhat more com-plex as well.
There is an important difference in how stores are handled in a speculative processor versus in Tomasulo’s algorithm. In Tomasulo’s algorithm, a store can update memory when it reaches Write Result (which ensures that the effective address has been calculated) and the data value to store is available. In a specula-tive processor, a store updates memory only when it reaches the head of the ROB.
This difference ensures that memory is not updated until an instruction is no longer speculative.
Figure 2.17 has one significant simplification for stores, which is unneeded in practice. Figure 2.17 requires stores to wait in the Write Result stage for the reg-ister source operand whose value is to be stored; the value is then moved from the Vk field of the store’s reservation station to the Value field of the store’s ROB
Figure 2.17 has one significant simplification for stores, which is unneeded in practice. Figure 2.17 requires stores to wait in the Write Result stage for the reg-ister source operand whose value is to be stored; the value is then moved from the Vk field of the store’s reservation station to the Value field of the store’s ROB