ALUMemoryReg File

(1)

Lecture 7: Midterm Review

•Quantitative principle of Computer Design

•ISA Design

•Pipeline

(2)

Quantitative Principle of Computer Design

(3)

Performance Terminology

 is n% faster than Y? means:

ExTime(Y) Performance(X) n --- = --- = 1 + ---

ExTime(X) Performance(Y) 100

n = 100(Performance(X) - Performance(Y)) Performance(Y)

Example: Y takes 15 seconds to complete a task,

X takes 10 seconds. What % faster is X?

(4)

Example

15 10 = 1.5

1.0 = Performance (X) Performance (Y) ExTime(Y)

ExTime(X) =

n = 100 (1.5 - 1.0)

1.0 n = 50%

(5)

Amdahl's Law

Speedup due to enhancement E:

ExTime w/o E Performance w/ E Speedup(E) = --- = ---

ExTime w/ E Performance w/o E

Suppose that enhancement E accelerates a fraction F of the task by a factor S, and the remainder of the task is unaffected, then:

ExTime(E) =

Speedup(E) =

(6)

Amdahl’s Law

ExTime

_new

= ExTime

_old

x (1 - Fraction

_enhanced

) + Fraction

_enhanced

Speedup

_overall

=

ExTime

_old

ExTime

_new

Speedup

_enhanced

= 1

(1 - Fraction

_enhanced

) + Fraction

_enhanced

Speedup

_enhanced

(7)

Amdahl’s Law

• Floating point instructions improved to run 2X; but only 10% of actual instructions are FP

Speedup

_overall

=

ExTime

_new

=

(8)

Amdahl’s Law

• Floating point instructions improved to run 2X; but only 10% of actual instructions are FP

Speedup

_overall

= 1 0.95

= 1.053

ExTime

_new

= ExTime

_old

x (0.9 + .1/2) = 0.95 x ExTime

_old

(9)

Aspects of CPU Performance

CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle CPU time = Seconds = Instructions x Cycles x Seconds

Program Program Instruction Cycle

Inst Count CPI Clock Rate

Program X

Compiler X

Inst. Set. X X

Organization X X

Technology X

(10)

Base Machine (Reg / Reg)

Op Freq Cycles

ALU 50% 1

Load 20% 2

Store 10% 2

Branch 20% 2

Typical Mix

Example

Add register / memory operations:

– One source operand in memory – One source operand in register – Cycle count of 2

Branch cycle count to increase to 3.

What fraction of the loads must be eliminated for this

to pay off?

(11)

Example Solution

Exec Time = Instr Cnt x CPI x Clock

OpFreq Cycles Freq Cycles

ALU .50 1 .5 .5 -X 1 .5 -X

Load .20 2 .4 .2 -X 2 .4 -2X

Store .10 2 .2 .1 2 .2

Branch .20 2 .3 .2 3 .6

Reg/Mem X 2 2X

1.00 1.5 1 -X (1.7 -X)/(1 -X)

Instr Cnt_Old x CPI_Old x Clock_Old = Instr Cnt_Newx CPI_New x Clock_New 1.00 x 1.5 = (1 -X) x (1.7 -X)/(1-X)

1.5 = 1.7 -X 0.2 = X ALL loads must be eliminated for this to be a win!

(12)

IC cost = Die cost + Testing cost + Packaging cost Final test yield

Die cost = Wafer cost

Dies per Wafer * Die yield

Dies per wafer = Π* ( Wafer_diam / 2)² Π * Wafer_diam Die Area (2 * Die Area)^(1/2)

Die Yield = Wafer yield * 1 + Defects_per_unit_area * Die_Area

α

Integrated Circuits Costs

Die Cost goes roughly with die area

⁴

( ) ^{− α}

-

(13)

Instruction Set Design

instruction set software

hardware

(14)

Basic ISA Classes

Accumulator:

add A acc ← acc + mem[A]

Stack:

add top ← top + next

General Purpose Register:

register-memory

add R1, A R1 = R1 + mem[A]

Load/Store:(register to register)

add Ra Rb Rc Ra ← Rb + Rc load Ra Rb Ra ← mem[Rb]

store Ra Rb mem[Rb] ← Ra

(15)

•Design goal

•simplify compilation of high-level languages

•optimize code size

•Variable format, 2 and 3 operand instruction

• Rich set of addressing modes (apply to any operand)

• Rich set of operations

• Rich set of data types (B, W, L, Q, O, F, D, G, H)

• Condition codes

• Examples: Vax, Intel

• Problem: increase hardware design complexity!

Complex instruction set computer:

(16)

Reduced Instruction Set Architecture

• Instruction set simplicity leads to a faster machine

– efficient pipelining 32-bit fixed format instruction (3 formats)

• 32 32-bit GPR

• 3-operand, reg-reg arithmetic instruction

• Supporting very few addressing modes for load/store

– displacement – immediate

• Simple branch conditions

• Delayed branch

see: SPARC, MIPS, MC88100, AMD2900, i960, i860 PARisc, DEC Alpha, Clipper,

CDC 6600, CDC 7600, Cray-1, Cray-2, Cray-3

(17)

Example: DLX

Op

6

rs rd Immediate

I-type instruction

5 5 16

Load, store, conditional branch

example: load rd, mem(rs+immediate)

R-type instruction

Op

6

rs1 rs2 func

5 5 11

rd

5

Register-register ALU operations rd <- rs1 func rs2

6

Offset added to PC

26

J-type instruction

Op

Jump and jump-link

(18)

5 Steps of MIPS Datapath

Figure 3.4, Page 134 , CA:AQA 2e

Memory

Access Write Instruction Back

Fetch Instr. Decode

Reg. Fetch Execute Addr. Calc

ALU

Memory Reg File MUXMUX DataMemory MUX

ExtendSign

Zero?

IF/ID ID/EX MEM/WBEX/MEM

4

Adder Next SEQ PC Next SEQ PC

RD RD RD WB Data

Next PC

Address ^RS1

RS2

Imm

MUX

(19)

Visualizing Pipelining

Figure 3.3, Page 133 , CA:AQA 2e

In st r.

Or de r

Time (clock cycles)

Reg ALU DMem

Ifetch Reg

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7

(20)

Ideal speedup from pipelining

• Idel case: an instruction is issued every cycle

– ideal pipelined CPI = 1

(21)

What is the reality?

• Pipeline overhead

– pipeline register delay and clock skew

– Clock Cycle

_pipelined

could be larger than

Clock Cycleunpipelined

• Pipeline hazard

– prevent the CPU from issuing one instruction every cycle

(22)

Hazards

• Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle

– Structural hazards: HW cannot support this combination of

instructions

– Data hazards: Instruction depends on result of prior instruction still

in the pipeline

– Control hazards: Pipelining of branches & other instructions

• Common solution is to stall the pipeline until the hazard

resolved.

(23)

How does hazard affect performance?

CPI_pipelined = Ideal CPI + Pipeline stall clock cycles per instr Speedup = CPIunpipelined x Clock Cycleunpipelined

CPI_pipelined x Clock Cycle_pipelined

= Ideal CPI x Pipeline depth Clock Cycleunpipelined

Ideal CPI + Pipeline stall CPI Clock Cycle_pipelined Speedup = Pipeline depth Clock Cycleunpipelined

1 + Pipeline stall CPI Clock Cycle_pipelined x

x

(24)

Three Generic Data Hazards

• Read After Write (RAW)

– Add r1, r2, r3 – Add r1, r1, r2

• Write After Read (WAR)

– Add r2,r1,r3 – Add r1, r4, r5

• Write After Write (WAW)

– Add r1, r2, r3

– Add r1, r4, r5

(25)

Time (clock cycles)

Forwarding to Avoid Data Hazard

Figure 3.10, Page 149 , CA:AQA 2e

In st r.

Or de r

add r1,r2,r3 sub r4,r1,r3

and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11

Ifetch Reg

(26)

Try producing fast code for a = b + c;

d = e -f;

assuming a, b, c, d ,e, and f in memory.

Slow code:

LW Rb,b LW Rc,c

ADD Ra,Rb,Rc SW a,Ra

LW Re,e LW Rf,f

SUB Rd,Re,Rf

SW d,Rd

Software Scheduling to Avoid Load Hazards

Fast code:

LW Rb,b LW Rc,c

LW Re,e ADD Ra,Rb,Rc LW Rf,f

SW a,Ra

SUB Rd,Re,Rf

SW d,Rd

(27)

Branch Operation

Figure 3.4, Page 134 , CA:AQA 2e

Memory

ALU

Memory Reg File MUXMUX DataMemory MUX

ExtendSign

Zero?

IF/ID ID/EX MEM/WBEX/MEM

4

Adder Next SEQ PC Next SEQ PC

RD RD RD WB Data

Next PC

Address ^RS1

RS2

Imm

MUX

(28)

Control Hazard on Branches Three Stage Stall

Branch Inst IF ID EX MEM WB

Br successor IF S S IF ID EX MEM WB

Br successor+1 IF ID EXE

(29)

Adder

IF/ID

New Pipelined DLX Datapath

Figure 3.22, page 163, CA:AQA 2/e

Memory

ALU

Memory Reg File MUX DataMemory MUX

ExtendSign

Zero?

MEM/WB

EX/MEM

4

Adder

Next SEQ PC

RD RD RD WB Data

Next PC

Address ^RS1

RS2

Imm

MUX ID/EX

(30)

Control Hazard on Branches One Stage Stall

Branch Inst IF ID EX MEM WB

Br successor IF IF ID EX MEM WB

Br successor+1 IF ID EXE

(31)

Four Branch Hazard Alternatives

#1: Stall until branch direction is clear

#2: Predict Branch Not Taken

– Execute successor instructions in sequence

– squash instructions in pipeline if branch actually taken

#3: Predict Branch Taken

- can’t implement in the DLX pipeline

(32)

Four Branch Hazard Alternatives

#4: Delayed Branch

– schedule instructions into branch-delay slots branch instruction

sequential successor₁ sequential successor₂ ...

sequential successor_n branch target if taken

– 1 slot delay allows proper decision and branch target address in 5 stage pipeline

ALUMemoryReg File

Lecture 7: Midterm Review

•Quantitative principle of Computer Design

•ISA Design

•Pipeline

Quantitative Principle of Computer Design

Performance Terminology

 is n% faster than Y? means:

ExTime(Y) Performance(X) n --- = --- = 1 + ---

ExTime(X) Performance(Y) 100

n = 100(Performance(X) - Performance(Y)) Performance(Y)

Example: Y takes 15 seconds to complete a task,

X takes 10 seconds. What % faster is X?

Example

15

10 = 1.5

1.0 = Performance (X) Performance (Y) ExTime(Y)

ExTime(X) =

n = 100 (1.5 - 1.0)

1.0

n = 50%

Amdahl's Law

Speedup due to enhancement E:

ExTime w/o E Performance w/ E Speedup(E) = --- = ---

ExTime w/ E Performance w/o E

Suppose that enhancement E accelerates a fraction F of the task by a factor S, and the remainder of the task is unaffected, then:

ExTime(E) =

Speedup(E) =

Amdahl’s Law

ExTime

= ExTime

x (1 - Fraction

) + Fraction

Speedup

=

ExTime

ExTime

Speedup

= 1

(1 - Fraction

) + Fraction

Speedup

Amdahl’s Law

• Floating point instructions improved to run 2X; but only 10% of actual instructions are FP

Speedup

=

ExTime

=

Amdahl’s Law

• Floating point instructions improved to run 2X; but only 10% of actual instructions are FP

Speedup

= 1 0.95

= 1.053

ExTime

= ExTime

x (0.9 + .1/2) = 0.95 x ExTime

Aspects of CPU Performance

Inst Count CPI Clock Rate

Program X

Compiler X

Inst. Set. X X

Organization X X

Technology X

Base Machine (Reg / Reg)

Op Freq Cycles

ALU 50% 1

Load 20% 2

Store 10% 2

Branch 20% 2

Example

Add register / memory operations:

Branch cycle count to increase to 3.

What fraction of the loads must be eliminated for this

to pay off?

Example Solution

α

Integrated Circuits Costs

Die Cost goes roughly with die area

( ) − α

-

( ) ^{− α}