4.33.6 This problem is very similar to We have already computed the number of stall cycles due to a branch misprediction, and we know how to compute the num-ber of non-stall cycles between mispredictions (this is where the misprediction rate has an effect). We have:, except that we are aiming to have as many stall cycles as we have non-stall cycles. We get:
Stall cycles between mispredictions
Need # of instructions between mispredictions
Allowed branch misprediction rate
a. 6.4 6.4 × 4 = 25.5 1/(25.5 × 0.30) = 13.1%
b. 7.3 7.3 × 2 = 14.5 1/(14.5 × 0.15) = 46.0%
The needed accuracy is 100% minus the allowed misprediction rate.
Solution 4.34
4.34.1 We need an IF pipeline stage to fetch the instruction. Since we will only execute one kind of instruction, we do not need to decode the instruction but we still need to read registers. As a result, we will need an ID pipeline stage although it would be misnamed. After that, we have an EXE stage, but this stage is sim-pler because we know exactly which operation should be executed so there is no need for an ALU that supports different operations. Also, we need no Mux to select which values to use in the operation because we know exactly which value it will be. We have:
a. In the ID stage we read two registers and we do not need a sign-extend unit. In the EXE stage we need an Add unit whose inputs are the two register values read in the ID stage. After the EXE stage we have a WB stage which writes the result from the Add unit into Rd (again, no Mux). Note that there is no MEM stage, so this is a 4-stage pipeline. Also note that the PC is always incremented by 4, so we do not need the other Add and Mux units that compute the new PC for branches and jumps.
b. We only read one register in the ID stage so there is no need for the second read port in the Registers unit. We do need a sign-extend unit for the Offs fi eld in the instruction word. In the EXE stage we need an Add unit whose inputs are the register value and the sign-extended offset from the ID stage. After the EXE stage we use the output of the Add unit as a memory address in the MEM stage, and then we have a WB stage which writes the value we read in the MEM stage into Rt (again, no Mux). Also note that the PC is always incremented by 4, so we do not need the other Add and Mux units that compute the new PC for branches and jumps.
4.34.2
a. Assuming that the register write in WB happens in the fi rst half of the cycle and the register reads in ID happen in the second half, we only need to forward the Add result from the EX/WB pipeline register to the inputs of the Add unit in the EXE stage of the next instruction (if that next instruction depends on the previous one). No hazard detection unit is needed because forwarding eliminates all hazards.
b. Assuming that the register write in WB happens in the fi rst half of the cycle and the register read in ID happens in the second half, we only need to forward the memory value from the MEM/WB pipeline register to the fi rst (register) input of the Add unit in the EXE stage of the next or second-next instruction (if one of those two instructions is dependent on the one that has just read the value). We also need a hazard detection unit that stalls any instruction whose Rs register fi eld is equal to the Rt fi eld of the previous instruction.
4.34.3 We need to add some decoding logic to our ID stage. The decoding logic must simply check whether the opcode and funct fi led (if there is a funct fi eld) match this instruction. If there is no match, we must put the address of the excep-tion handler into the PC (this adds a Mux before the PC) and fl ush (convert to nops ) the undefi ned instruction (write zeros to the ID/EX pipeline register) and the following instruction which has already been fetched (write zeros to the IF/ID pipeline register).
4.34.4
a. We need to add the logic that computes the branch address (sign-extend, shift-left–2, Add, and Mux to select the PC). We also need to replace the Add unit in EXE with an ALU that supports either an ADD or a comparison. The ALUOp signal to select between these operations must be supplied by the Control unit.
b. We need to add back the second register read port (AND reads two registers), add the Mux that selects the value supplied to the second ALU input (register for AND, Offs for LW), add an ALUOp signal to select between two ALU operations, and replace the Add unit in EXE with an ALU that supports either an Add or an And operation. Finally, we must add to the WB stage the Mux that select whether the value to write to the register is the value from the ALU of from memory, and the Mux in the EX stage that selects which register to write to (Rd for AND, Rt for LW).
4.34.5
a. The same forwarding logic used for forwarding from one ADD to another can also be used to forward from ADD to BEQ. We still need no hazard detection for data hazards, but we must add detection of control hazards. Assuming there is no branch prediction, whenever a BEQ is taken we must fl ush (convert to NOPs) all instructions that were fetched after that branch.
b. We need to add forwarding from the EX/MEM pipeline register to the ALU inputs in the EXE stage (so AND can forward to the next instruction), and we need to extend our forwarding from the MEM/WB pipeline register to the second input of the ALU unit (so LW can forward to an AND whose Rt (input) register is the same as the Rt (result) register of the LW instruction. We also need to extend the hazard detection unit to also stall any AND instruction whose Rs or Rt register fi eld is equal to the Rt fi eld of the previous LW instruction.
4.34.6 The decoding logic must now check if the instruction matches either of the two instructions. After that, the exception handling is the same as for 4.34.3.
Solution 4.35
4.35.1 The worst case for control hazards is if the mispredicted branch instruction is the last one in its cycle and we have been fetching the maximum number of instructions in each cycle. Then the control hazard affects the remaining instructions in the branch’s own pipeline stage and all instructions in stages between fetch and branch execution stage. We have:
Delay slots needed a. 7 × 4 – 1 = 27
b. 17 × 2 – 1 = 33
4.35.2 If branches are executed in stage X, the number of stall cycles due to a misprediction is (N - 1). These cycles are reduced by fi lling them with delay slot instructions. We compute the number of execution (non-stall) cycles between mis-predictions, and the speed-up as follows:
Non-stall cycles between mispredictions
Stall cycles without delay slots
Stall cycles with 4 delay slots
Speed-up due to delay slots
a. 1/(020 × (1 – 0.80) × 4) = 6.25 6 5 (6.25 + 6)/(6.25 + 5) = 1.089
b. 1/(025 × (1 – 0.92) × 2) = 25 16 14 (25 + 16)/(25 + 14) = 1.051
4.35.3 For 20% of branches, we add an extra instruction, for 30% of the branches we add two extra instructions, and for 40% of branches, we add three extra instructions. Overall, an average branch instruction is now accompanied by 0.20 + 0.30 ´ 2 + 0.40 ´ 3 = 2 nop instructions. Note that these nops are added for every branch, not just mispredicted ones. These nop instructions add to the execution time of the program, so we have:
Total cycles between mispredictions without delay
slots
Stall cycles with 4 delay slots
Extra cycles spent on NOPs
Speed-up due to delay slots
a. 6.25 + 6 = 12.25 5 0.5 × 6.25 × 0.20 = 0.625 12.5/(6.25 + 5 + 0.625) = 1.032
b. 25 + 16 = 41 14 1 × 25 × 0.25 = 6.25 41/(25 + 14 + 6.25) = 0.906
4.35.4
a. add $2,$0,$0 ; $1=0 Loop: beq $2,$3,End
lb $10,1000($2) ; Delay slot sb $10,2000($2)
beq $0,$0,Loop
addi $2,$2,1 ; Delay slot Exit:
b. add $2,$0,$0 ; $1=0 Loop: lb $10,1000($2)
lb $11,1001($2) beq $10,$11,End
addi $1,$1,1 ; Delay slot beq $0,$0,Loop
addi $2,$2,1 ; Delay slot
Exit: addi $1,$1,-1 ; Undo c++ from delay slot
4.35.5
a. add $2,$0,$0 ; $1=0 Loop: beq $2,$3,End
lb $10,1000($2) ; Delay slot
nop ; 2nd delay slot
beq $0,$0,Loop
sb $10,2000($2) ; Delay slot addi $2,$2,1 ; 2nd delay slot Exit:
b. add $2,$0,$0 ; $1=0
lb $10,1000($2) ; Prepare for fi rst iteration lb $11,1001($2) ; Prepare for fi rst iteration Loop: beq $10,$11,End
addi $1,$1,1 ; Delay slot addi $2,$2,1 ; 2nd delay slot beq $0,$),Loop
lb $10,1000($2) ; Delay slot, prepare for next iteration lb $11,1001($2) ; 2nd delay slot, prepare for next iteration Exit: addi $1,$1,-1 ; Undo c++ from delay slot
addi $2,$2,-1 ; Undo i++ from 2nd delay slot
4.35.6 The maximum number of in-fl ight instructions is equal to the pipeline depth times the issue width. We have:
Instructions in fl ight Instructions per iteration Iterations in fl ight
a. 10 × 4 = 40 5 40/5 + 1 = 9
b. 25 × 2 = 50 6 roundUp(50/6) + 1 = 10
Note that an iteration is in-fl ight when even one of its instructions is in-fl ight. This is why we add one to the number we compute from the number of instructions in fl ight (instead of having an iteration entirely in fl ight, we can begin another one and still have the “trailing” one partially in-fl ight) and round up.
Solution 4.36
4.36.1
Instruction Translation
a. lwinc Rt,Offset(Rs) lw Rt,Offset(Rs)
addi Rs,Rs,4
b. addr Rt,Offset(Rs) lw tmp,Offset(Rs)
add Rt,Rt,tmp
4.36.2 The ID stage of the pipeline would now have a lookup table and a micro-PC, where the opcode of the fetched instruction would be used to index into the lookup table. Micro-operations would then be placed into the ID/EX pipeline register, one per cycle, using the micro-PC to keep track of which micro-op is the next one to be output. In the cycle in which we are placing the last micro-op of an instruction into the ID/EX register, we can allow the IF/ID register to accept the next instruction. Note that this results in executing up to one micro-op per cycle, but we actually fetching instructions less often than that.
4.36.3
Instruction
a. We need to add an incrementer in the MEM stage. This incrementer would increment the value read from Rs while memory is being accessed. We also need to change the Registers unit to allow two writes to happen in the same cycle, so we can write the value from memory into Rt and the incremented value of Rs back into Rs.
b. We need another EX stage after the MEM stage to perform the addition. The result can then be stored into Rt in the WB stage.