Lecture 7: Midterm Review
•Quantitative principle of Computer Design
•ISA Design
•Pipeline
Quantitative Principle of Computer Design
Performance Terminology
is n% faster than Y? means:
ExTime(Y) Performance(X) n --- = --- = 1 + ---
ExTime(X) Performance(Y) 100
n = 100(Performance(X) - Performance(Y)) Performance(Y)
Example: Y takes 15 seconds to complete a task,
X takes 10 seconds. What % faster is X?
Example
15
10 = 1.5
1.0 = Performance (X) Performance (Y) ExTime(Y)
ExTime(X) =
n = 100 (1.5 - 1.0)
1.0
n = 50%
Amdahl's Law
Speedup due to enhancement E:
ExTime w/o E Performance w/ E Speedup(E) = --- = ---
ExTime w/ E Performance w/o E
Suppose that enhancement E accelerates a fraction F of the task by a factor S, and the remainder of the task is unaffected, then:
ExTime(E) =
Speedup(E) =
Amdahl’s Law
ExTime
new= ExTime
oldx (1 - Fraction
enhanced) + Fraction
enhancedSpeedup
overall=
ExTime
oldExTime
newSpeedup
enhanced= 1
(1 - Fraction
enhanced) + Fraction
enhancedSpeedup
enhancedAmdahl’s Law
• Floating point instructions improved to run 2X; but only 10% of actual instructions are FP
Speedup
overall=
ExTime
new=
Amdahl’s Law
• Floating point instructions improved to run 2X; but only 10% of actual instructions are FP
Speedup
overall= 1 0.95
= 1.053
ExTime
new= ExTime
oldx (0.9 + .1/2) = 0.95 x ExTime
oldAspects of CPU Performance
CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle CPU time = Seconds = Instructions x Cycles x Seconds
Program Program Instruction Cycle
Inst Count CPI Clock Rate
Program X
Compiler X
Inst. Set. X X
Organization X X
Technology X
Base Machine (Reg / Reg)
Op Freq Cycles
ALU 50% 1
Load 20% 2
Store 10% 2
Branch 20% 2
Typical Mix
Example
Add register / memory operations:
– One source operand in memory – One source operand in register – Cycle count of 2
Branch cycle count to increase to 3.
What fraction of the loads must be eliminated for this
to pay off?
Example Solution
Exec Time = Instr Cnt x CPI x Clock
OpFreq Cycles Freq Cycles
ALU .50 1 .5 .5 -X 1 .5 -X
Load .20 2 .4 .2 -X 2 .4 -2X
Store .10 2 .2 .1 2 .2
Branch .20 2 .3 .2 3 .6
Reg/Mem X 2 2X
1.00 1.5 1 -X (1.7 -X)/(1 -X)
Instr CntOld x CPIOld x ClockOld = Instr CntNewx CPINew x ClockNew 1.00 x 1.5 = (1 -X) x (1.7 -X)/(1-X)
1.5 = 1.7 -X 0.2 = X ALL loads must be eliminated for this to be a win!
IC cost = Die cost + Testing cost + Packaging cost Final test yield
Die cost = Wafer cost
Dies per Wafer * Die yield
Dies per wafer = Π* ( Wafer_diam / 2)2 Π * Wafer_diam Die Area (2 * Die Area)^(1/2)
Die Yield = Wafer yield * 1 + Defects_per_unit_area * Die_Area
α
Integrated Circuits Costs
Die Cost goes roughly with die area
4( ) − α
-
Instruction Set Design
instruction set software
hardware
Basic ISA Classes
Accumulator:
add A acc ← acc + mem[A]
Stack:
add top ← top + next
General Purpose Register:
register-memory
add R1, A R1 = R1 + mem[A]
Load/Store:(register to register)
add Ra Rb Rc Ra ← Rb + Rc load Ra Rb Ra ← mem[Rb]
store Ra Rb mem[Rb] ← Ra
•Design goal
•simplify compilation of high-level languages
•optimize code size
•Variable format, 2 and 3 operand instruction
• Rich set of addressing modes (apply to any operand)
• Rich set of operations
• Rich set of data types (B, W, L, Q, O, F, D, G, H)
• Condition codes
• Examples: Vax, Intel
• Problem: increase hardware design complexity!
Complex instruction set computer:
Reduced Instruction Set Architecture
• Instruction set simplicity leads to a faster machine
– efficient pipelining 32-bit fixed format instruction (3 formats)
• 32 32-bit GPR
• 3-operand, reg-reg arithmetic instruction
• Supporting very few addressing modes for load/store
– displacement – immediate
• Simple branch conditions
• Delayed branch
see: SPARC, MIPS, MC88100, AMD2900, i960, i860 PARisc, DEC Alpha, Clipper,
CDC 6600, CDC 7600, Cray-1, Cray-2, Cray-3
Example: DLX
Op
6
rs rd Immediate
I-type instruction
5 5 16
Load, store, conditional branch
example: load rd, mem(rs+immediate)
R-type instruction
Op
6
rs1 rs2 func
5 5 11
rd
5
Register-register ALU operations rd <- rs1 func rs2
6
Offset added to PC
26
J-type instruction
Op
Jump and jump-link
5 Steps of MIPS Datapath
Figure 3.4, Page 134 , CA:AQA 2e
Memory
Access Write Instruction Back
Fetch Instr. Decode
Reg. Fetch Execute Addr. Calc
ALU
Memory Reg File MUXMUX DataMemory MUX
ExtendSign
Zero?
IF/ID ID/EX MEM/WBEX/MEM
4
Adder Next SEQ PC Next SEQ PC
RD RD RD WB Data
Next PC
Address RS1
RS2
Imm
MUX
Visualizing Pipelining
Figure 3.3, Page 133 , CA:AQA 2e
In st r.
Or de r
Time (clock cycles)
Reg ALU DMem
Ifetch Reg
Reg ALU DMem
Ifetch Reg
Reg ALU DMem
Ifetch Reg
Reg ALU DMem
Ifetch Reg
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
Ideal speedup from pipelining
• Idel case: an instruction is issued every cycle
– ideal pipelined CPI = 1
What is the reality?
• Pipeline overhead
– pipeline register delay and clock skew
– Clock Cycle
pipelinedcould be larger than
Clock Cycleunpipelined• Pipeline hazard
– prevent the CPU from issuing one instruction every cycle
Hazards
• Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle
– Structural hazards: HW cannot support this combination of
instructions
– Data hazards: Instruction depends on result of prior instruction still
in the pipeline
– Control hazards: Pipelining of branches & other instructions
• Common solution is to stall the pipeline until the hazard
resolved.
How does hazard affect performance?
CPIpipelined = Ideal CPI + Pipeline stall clock cycles per instr Speedup = CPIunpipelined x Clock Cycleunpipelined
CPIpipelined x Clock Cyclepipelined
= Ideal CPI x Pipeline depth Clock Cycleunpipelined
Ideal CPI + Pipeline stall CPI Clock Cyclepipelined Speedup = Pipeline depth Clock Cycleunpipelined
1 + Pipeline stall CPI Clock Cyclepipelined x
x
Three Generic Data Hazards
• Read After Write (RAW)
– Add r1, r2, r3 – Add r1, r1, r2
• Write After Read (WAR)
– Add r2,r1,r3 – Add r1, r4, r5
• Write After Write (WAW)
– Add r1, r2, r3
– Add r1, r4, r5
Time (clock cycles)
Forwarding to Avoid Data Hazard
Figure 3.10, Page 149 , CA:AQA 2e
In st r.
Or de r
add r1,r2,r3 sub r4,r1,r3
and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11
Reg ALU DMem
Ifetch Reg
Reg ALU DMem
Ifetch Reg
Reg ALU DMem
Ifetch Reg
Reg ALU DMem
Ifetch Reg
Reg ALU DMem
Ifetch Reg
Try producing fast code for a = b + c;
d = e -f;
assuming a, b, c, d ,e, and f in memory.
Slow code:
LW Rb,b LW Rc,c
ADD Ra,Rb,Rc SW a,Ra
LW Re,e LW Rf,f
SUB Rd,Re,Rf
SW d,Rd
Software Scheduling to Avoid Load Hazards
Fast code:
LW Rb,b LW Rc,c
LW Re,e ADD Ra,Rb,Rc LW Rf,f
SW a,Ra
SUB Rd,Re,Rf
SW d,Rd
Branch Operation
Figure 3.4, Page 134 , CA:AQA 2e
Memory
Access Write Instruction Back
Fetch Instr. Decode
Reg. Fetch Execute Addr. Calc
ALU
Memory Reg File MUXMUX DataMemory MUX
ExtendSign
Zero?
IF/ID ID/EX MEM/WBEX/MEM
4
Adder Next SEQ PC Next SEQ PC
RD RD RD WB Data
Next PC
Address RS1
RS2
Imm
MUX
Control Hazard on Branches Three Stage Stall
Branch Inst IF ID EX MEM WB
Br successor IF S S IF ID EX MEM WB
Br successor+1 IF ID EXE
Adder
IF/ID
New Pipelined DLX Datapath
Figure 3.22, page 163, CA:AQA 2/e
Memory
Access Write Instruction Back
Fetch Instr. Decode
Reg. Fetch Execute Addr. Calc
ALU
Memory Reg File MUX DataMemory MUX
ExtendSign
Zero?
MEM/WB
EX/MEM
4
Adder
Next SEQ PC
RD RD RD WB Data
Next PC
Address RS1
RS2
Imm
MUX ID/EX
Control Hazard on Branches One Stage Stall
Branch Inst IF ID EX MEM WB
Br successor IF IF ID EX MEM WB
Br successor+1 IF ID EXE
Four Branch Hazard Alternatives
#1: Stall until branch direction is clear
#2: Predict Branch Not Taken
– Execute successor instructions in sequence
– squash instructions in pipeline if branch actually taken
#3: Predict Branch Taken
- can’t implement in the DLX pipeline
Four Branch Hazard Alternatives
#4: Delayed Branch
– schedule instructions into branch-delay slots branch instruction
sequential successor1 sequential successor2 ...
sequential successorn branch target if taken
– 1 slot delay allows proper decision and branch target address in 5 stage pipeline
– DLX uses this
Branch delay of length n