EI 338: Computer Systems Engineering

(1)

EI 338: Computer Systems Engineering

(Operating Systems & Computer Architecture)

Dept. of Computer Science & Engineering Chentao Wu

[email protected]

(2)

Download lectures

• ftp://public.sjtu.edu.cn

• User: wuct

• Password: wuct123456

• http://www.cs.sjtu.edu.cn/~wuct/cse/

(3)

3

Appendix C Pipelining

Computer Architecture

A Quantitative Approach, Fifth Edition

(4)

5 Steps of a (pre-pipelined) MIPS Datapath

Memory

Access Write Back Instruction

Fetch Instr. Decode

Reg. Fetch Execute Addr. Calc

ML D

ALU MUX

Memory Reg File MUXMUX DataMemory MUX

ExtendSign

4

Adder _Zero?

Next SEQ PC

Address

Next PC

WB Data

Inst

RD RS1 RS2

Imm

RTL Actions: Reg.

Transfer Language

IR <= mem[PC]; #stage 1 PC <= PC + 4

Reg[IR_rd] <= (Reg[Ir_rs] op_IRop Reg[IR_rt]) #op is done in stages 2-5

PC

IR

Stages: 1 2 3 4 5

(5)

5-Stage MIPS Datapath (has pipeline latches)

Memory

Access Write Instruction Back

ALU

ExtendSign

Zero?

IF/ID ID/EX MEM/WBEX/MEM

4

Adder

Next SEQ PC Next SEQ PC

RD RD RD WB Data

Next PC

Address ^RS1

RS2

Imm

MUX

IR <= mem[PC]; #1

PC <= PC + 4

A <= Reg[IR_rs]; #2

B <= Reg[IR_rt]

rslt <= A op_IRop B

#3

Reg[IR_rd] <= WB #5

WB <= rslt #4

Stages: 1 2 3 4 5

(6)

Instruction Set Processor Controller

IR <= mem[PC];

PC <= PC + 4

A <= Reg[IR_rs];

B <= Reg[IR_rt]

r <= A op_IRopB

Reg[IR_rd] <= WB WB <= r

Ifetch

opFetch-DeCoDe

PC <= IR_jaddr if bop(A,B)

PC <= PC+IR_im

br jmp

RR

r <= A op_IRop IR_im

Reg[IR_rd] <= WB WB <= r

RI

r <= A + IR_im

WB <= Mem[r]

Reg[IR_rd] <= WB LD

ST

JAL JR

(7)

5-Stage MIPS Datapath (has pipeline latches)

Memory

ALU

ExtendSign

Zero?

4

Adder

RD RD RD WB Data

• Data stationary control

– local decode for each instruction phase / pipeline stage

Next PC

Address ^RS1

RS2

Imm

MUX

Stages: 1 2 3 4 5

(8)

Visualizing Pipelining

In st r.

Or de

r

Time (clock cycles)

Reg ALU DMem

Ifetch Reg

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7

(9)

Pipelining is not quite that easy!

 Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle

 Structural hazards: HW cannot support this

combination of instructions (having a single person to fold and put clothes away at same time)

 Data hazards: Instruction depends on result of prior instruction still in the pipeline (having a missing sock in a later wash; cannot put away)

 Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps).

(10)

One Memory_Port / Structural_Hazards

In st r.

Or de r

Time (clock cycles)

Load

Instr 1 Instr 2 Instr 3 Instr 4

Ifetch Reg

(11)

One Memory Port/Structural Hazards

In st r.

Or de

r

Load

Instr 1 Instr 2 Stall

Instr 3

Ifetch Reg

Bubble Bubble Bubble Bubble Bubble

How do you “bubble” the pipe?

(12)

Code SpeedUp Equation for Pipelining

pipelined d unpipeline

Time Cycle

CPI stall

Pipeline CPI

Ideal

depth Pipeline

CPI Ideal

Speedup 



 

Time Cycle

CPI stall

Pipeline 1

Speedup 

 

Inst per

cycles Stall

Average CPI

Ideal

CPI_pipelined  

For simple RISC pipeline, Ideal CPI = 1:

(13)

Example: Dual-port vs. Single-port

 Machine A: Dual ported memory (“Harvard Architecture”)

 Machine B: Single ported memory, but its pipelined implementation has a 1.05 times faster clock rate

 Ideal CPI = 1 for both

 Assume loads are 20% of instructions executed

SpeedUp_A = Pipeline Depth/(1 + 0) x (clock_unpipe/clock_pipe)

= Pipeline Depth

SpeedUp_B = Pipeline Depth/(1 + 0.2 x 1) x (clock_unpipe/(clock_unpipe / 1.05)

= (Pipeline Depth/1.20) x 1.05 {105/120 = 7/8}

= 0.875 x Pipeline Depth

SpeedUp_A / SpeedUp_B = Pipeline Depth/(0.875 x Pipeline Depth)

= 1.14

 Machine A is 1.14 times faster

(14)

In st r.

Or de

r

add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11

Ifetch Reg

Data Hazard on Register R1 (If No Forwarding)

IF ID/RF EX MEM WB No forwarding needed since

write reg in 1^st half cycle, read reg in 2^nd half cycle.

(15)

 Read After Write (RAW)

Instr_J tries to read operand before Instr_Iwrites it

 Caused by a “(True) Dependence” (in compiler nomenclature). This hazard results from an

actual need for communicating a new data value.

Three Generic Data Hazards

I: add r1,r2,r3 J: sub r4,r1,r3

(16)

 Write After Read (WAR)

Instr_J writes operand before Instr_Ireads it

 Called an “anti-dependence” by compiler writers.

This results from reuse of the name “r1”.

 Cannot happen in MIPS 5 stage pipeline because:

 All instructions take 5 stages, and

 Register reads are always in stage 2, and

I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7

Three Generic Data Hazards

(17)

Three Generic Data Hazards

 Write After Write (WAW)

Instr_J writes operand before Instr_Iwrites it.

 Called an “output dependence” by compiler writers

This also results from the reuse of name “r1”.

 Cannot happen in MIPS 5 stage pipeline because:

 All instructions take 5 stages, and

 Register writes are always in stage 5

 Will see WAR and WAW in more complicated pipes I: sub r1,r4,r3

J: add r1,r2,r3 K: mul r6,r1,r7

(18)

Forwarding to Avoid Data Hazard

In st r.

Or de r

add r1,r2,r3 sub r4,r1,r3

and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11

Ifetch Reg

Need no forwarding since write reg is in 1^st half cycle, read reg in 2^nd half cycle.

Forwarding of ALU

outputs needed as ALU inputs 1 & 2 cycles later.

Forwarding of LW MEM outputs to SW MEM or ALU inputs 1 or 2 cycles later.

(19)

HW Datapath Changes (in red) for Forwarding

MEM/WR

ID/EX EX/MEM

MemoryData

ALU

muxmux

Registers

NextPC

Immediate

mux

What circuit detects and resolves this hazard?

(From ALU) To forward

ALU output 1 cycle to ALU inputs

To forward ALU, MEM 2 cycles to ALU

(From LW Data Memory)

mux

To forward MEM 1 cycle to SW MEM input

(20)

Forwarding Avoids A

^LU

-A

^LU

& L

^W

-S

^W

Data Hazards

In st r.

Or de r

add r1,r2,r3 lw r4, 0(r1)

sw r4,12(r1) or r8,r6,r9 xor r10,r9,r11

Ifetch Reg

(21)

In st r.

Or de

r

lw r1, 0(r2)

sub r4,r1,r6 and r6,r1,r7 or r8,r1,r9

LW-ALU Data Hazard Even with Forwarding

Ifetch Reg

No forwarding needed since

(22)

Data Hazard Even with Forwarding

Time (clock cycles)

or r8,r1,r9

In st r.

Or de r

lw r1, 0(r2)

sub r4,r1,r6 and r6,r1,r7

Ifetch Reg

Ifetch Reg Bubble ALU DMem Reg

Ifetch Bubble ^Reg ALU DMem Reg

Ifetch ALU DMem

Bubble ^Reg

H

ow is this hazard detected?

No forwarding needed since

(23)

23

Try producing fast code with no stalls for a = b + c;

d = e – f;

assuming a, b, c, d ,e, and f are in memory.

Slow code:

LW Rb,b LW Rc,c

ADD Ra,Rb,Rc SW a,Ra

LW Re,e LW Rf,f

SUB Rd,Re,Rf

SW d,Rd

Software Scheduling to Avoid Load Hazards

Fast code (no stalls):

LW Rb,b LW Rc,c LW Re,e

ADD Ra,Rb,Rc LW Rf,f

SW a,Ra

SUB Rd,Re,Rf

SW d,Rd

Compiler optimizes for performance. Hardware checks for safety.

Stall ===>

Important technique !

(24)

Outline



MIPS – An ISA for Pipelining



5 stage pipelining



Structural and Data Hazards



Forwarding



Branch Schemes



Exceptions and Interrupts



Conclusion

(25)

5-Stage MIPS Datapath (has pipeline latches)

Memory

ALU

ExtendSign

Zero?

4

Adder

RD RD RD WB Data

• Old simple design put branch completion in stage 4 (Mem)

Next PC

Address ^RS1

RS2

Imm

MUX

Will move red circuits to 2^nd stage to make branch delays shorter

Stages: 1 2 3 4 5

(26)

Control Hazard on Branch - Three Cycle Stall (Caused if Decide Branches in 4

^th

Stage)

10: beq r1,r3,34 14: and r2,r3,r5 18: or r6,r1,r7 22: add r8,r1,r9 34: xor r10,r1,r11

Ifetch Reg

What can be done with the 3 instructions between beq & xor?

Code between beq&xor must not start until know beq not branch => 3 stalls ID/RF MEM

(27)

Branch Stall Impact if Commit in Stage 4

 If CPI = 1 and 15% of instructions are branches,

Stall 3 cycles => new CPI = 1.45 (1+3*.15) Too much!

 Two-part solution:

 Determine sooner whether branch taken or not, AND

 Compute taken branch address earlier

 MIPS branch tests if register = 0 or  0

 Original 1986 MIPS Solution:

 Move zero_test to ID/RF (Inst Decode & Register Fetch) stage(2) (4=MEM)

 Add extra adder to calculate new PC (Program Counter) in ID/RF stage

 Result is 1 clock cycle penalty for branch versus 3 when decided in MEM

(28)

Adder

IF/ID

New Pipelined MIPS Datapath:

Faster Branch

Memory

ALUMemory Reg File MUX DataMemory MUX

ExtendSign

Zero?

MEM/WB

EX/MEM

4

Adder

Next SEQ PC

RD RD RD WB Data

• Example of interplay of instruction set design and cycle time.

Next PC

Address ^RS1

RS2

Imm

MUX ID/EX

The fast_branch design needs a slightly longer stage 2 cycle time, making the clock a little slower for all stages.

Stages: 1 2 3 4 5

(29)

Four Branch Hazard Alternatives

#1: Stall until branch direction is clearly known

#2: Predict Branch Not Taken – Easy Solution

 Execute the next instructions in sequence

 PC+4 already calculated, so use it to get next instruction

 Nullify bad instructions in pipeline if branch is actually taken

 Nullify easier since pipeline state updates are late (MEM, WB)

 47% MIPS branches not taken on average

(30)

Four Branch Hazard Alternatives

#3: Predict Branch Taken



53% MIPS branches taken on average



But have not calculated branch target address in MIPS

 MIPS still incurs 1 cycle branch penalty

 Some other CPUs: branch target known before outcome

(31)

Last of Four Branch Hazard Alternatives

#4: Delayed Branch (Used Only in 1st MIPS “Killer Micro”)

 Define branch to take place AFTER a following instruction branch instruction

sequential successor₁ sequential successor₂ ...

sequential successor_n branch target if taken

 1 slot delay allows proper decision and branch target address in 5 stage pipeline

 MIPS 1^st used this (Later versions of MIPS did not; pipeline deeper)

Branch delay of length n

(32)

Scheduling Branch Delay Slots

 A is the best choice, fills delay slot & reduces instruction count (IC)

 In B, the sub instruction may need to be copied, increasing IC add $1,$2,$3

if $2=0 then delay slot

A. From before branch B. From branch target C. From fall through

add $1,$2,$3 if $1=0 then

delay slot

add $1,$2,$3 if $1=0 then

delay slot sub $4,$5,$6

sub $4,$5,$6

becomes becomes becomes

if $2=0 then add $1,$2,$3

add $1,$2,$3 if $1=0 then

sub $4,$5,$6

add $1,$2,$3 if $1=0 then

sub $4,$5,$6 sub $4,$5,$6

(33)

Delayed Branch Not Used in Modern CPUs



Compiler effectiveness 1/2 for single branch delay slot:

 Fills about 60% of branch delay slots

 About 80% of instructions executed in branch delay slots useful in computation

 Only half of (60% x 80%) slots usefully filled;

cannot fill 2 or more

(34)

Delayed Branch Not Used in Modern CPUs

 Delayed Branch downside: As processor designs use deeper pipelines and multiple

issue, the branch delay grows and needs many more delay slots

 Delayed branching soon lost effectiveness and popularity compared to more expensive but more flexible dynamic approaches

 Growth in available transistors soon permitted dynamic approaches that keep records of branch locations, taken/not-taken decisions, and target addresses

 Multi-issue 2 => 3 delay slots needed, 4 => 7 slots, 8

=> 15 slots

(35)

Evaluating Branch Alternatives

Assume 4% unconditional jump, 10% conditional branch-taken, 6% conditional branch-not-taken, base CPI = 1.

Scheduling Branch CPI speedup vs. speedup vs.

Scheme penalty no-pipe

5

cycles stall_pipeline Stall pipeline ^{(Stage 4)} 3 1.60 3.1 1.00

Predict taken (Stage 2) 1 1.20 4.2 1.33 Predict not taken ^(St.2) 1 1.14 4.4 1.40 Delayed branch ^{(Stg 2)}0.5 1.10 4.5 1.45

(Sample 1.60=1+3(4+10+6)% (4.5=5/1.10) (1.45=1.6/1.1) calcu- 1.20=1+1(4+10+6)% (to calculate taken target)

lations) 1.14=1+1(4+10)% (refetch for jump, taken-branch)

Pipeline speedup = Pipeline depth

1 +Branch frequencyBranch penalty

(36)

Another Problem with Pipelining

 Exception: An unusual event happens to an instruction during its execution {caused by instructions executing}

 Examples: divide by zero, undefined opcode

 Interrupt: Hardware signal to switch the processor to a new instruction stream {not directly caused by code}

 Example: a sound card interrupts when it needs more audio output samples (an audio “click” happens if it is left waiting)

 Precise Interrupt Problem: Must seem as if the exception or

interrupt appeared between 2 instructions (I_i and I_i+1) although several instructions were executing at the time

 All instructions up to and including I_i are totally completed

 No effect of any instruction after I_iis allowed to be saved

 After a precise interrupt, the interrupt (exception) handler either aborts the program or restarts at instruction I_i+1

(37)

Precise Exceptions in Static Pipelines

Key observation: “Architected” states change only in memory (M) and register write (W) stages.

Fetch Decode Execute Memory

Stages: F D E M W

(38)

And In Conclusion: Control and Pipelining

 Quantify and summarize performance

 Ratios, Geometric Mean, Multiplicative Standard Deviation

 F&P: Benchmarks age, disks fail, single-point failure

 Control via State Machines and Microprogramming

 Just overlap tasks; easy if tasks are independent

 Speed Up  Pipeline Depth; if ideal CPI is 1, then:

 Hazards limit performance on computers by stalling:

 Structural: need more HW resources

 Data (RAW,WAR,WAW): need forwarding, compiler scheduling

 Control: delayed branch or branch (taken/not-taken) prediction

 Exceptions and interrupts add complexity

Time Cycle

CPI stall

Pipeline 1

Speedup 

 

(39)

Homework

 C.1

39