Computer Organization &

(1)

Computer Organization &

Assembly Languages Assembly Languages

Computer Organization (I) Fundamentals

Pu-Jen Cheng

(2)

Materials Materials

Some materials used in this course are adapted from

¾ The slides prepared by Kip Irvine for the book, Assembly Language for Intel-Based Computers, 5^th Ed.

¾ The slides prepared by S. Dandamudi for the book, Fundamentals of Computer Organization and Designs.

¾ The slides prepared by S Dandamudi for the book Introduction to

¾ The slides prepared by S. Dandamudi for the book, Introduction to Assembly Language Programming, 2^nd Ed.

¾ Introduction to Computer Systems, CMU

(http://www.cs.cmu.edu/afs/cs.cmu.edu/academic/class/

15213-f05/www/)

Assembly Language & Computer Organization NTU

¾ Assembly Language & Computer Organization, NTU (http://www.csie.ntu.edu.tw/~cyy/courses/assembly/

05fall/news/)/ /)

(http://www.csie.ntu.edu.tw/~acpang/course/asm_2004)

(3)

Outline Outline

General Concepts of Computer Organization

¾ Overview of Microcomputer CPU, Memory, I/O

Instruction Execution Cycley

¾ Central Processing Unit (CPU) CISC vs. RISC

6 Instruction Set Design Issues 6 Instruction Set Design Issues

¾ How Hardwares Execute Processor’s Instructions

Digital Logic Design (Combinational & Sequential Circuits) Microprogrammed Control

Microprogrammed Control

¾ Pipelining

3 Hazards

3 t h l i f f i t

3 technologies for performance improvement

¾ Memory

Data Alignment

2 Design Issues (Cache, Virtual Memory)

¾ I/O Devices

(4)

General Concepts of Computer Organization General Concepts of Computer Organization

Overview of Microcomputer

(5)

Von Neumann Machine, 1945 Von Neumann Machine, 1945

Memory, Input/Output, Arithmetic/Logic Unit, Control Unity, p p , g ,

Stored-program Model

¾ Both data and programs are stored in the same main memory

Sequential Execution

http://www.virtualtravelog.net/entries/2003-08-TheFirstDraft.pdf

(6)

What is Microcomputer What is Microcomputer

Microcomputer

¾ A computer with a microprocessor (µP) as its central processing unit (CPU)

Microprocessor (µP)

¾ A digital electronic component with transistors on a single semiconductor integrated circuit (IC)

semiconductor integrated circuit (IC)

¾ One or more microprocessors typically serve as a central

processing unit (CPU) in a computer system or handheld device.

(7)

Components of Microcomputer

(8)

Basic Microcomputer Design Basic Microcomputer Design

data bus

registers

I/O I/O

Central Processor Unit (CPU)

Memory Storage Unit

ALU l k

I/O Device

#1

I/O Device

#2 CU

ALU clock

control bus CU

address bus

(9)

CPU CPU

Arithmetic and logic unit (ALU) performs arithmetic (add, subtract) and logical (AND OR NOT) operations

logical (AND, OR, NOT) operations

Registers store data and instructions used by the processor

Control unit (CU) coordinates sequence of execution steps

¾ Fetch instructions from memory, decode them to find their types

Clock

Datapath consists of registers and ALU(s)

(10)

Datapath

^{ALU input}

Datapath

operand operand ALU output

Program Counter (PC) (or Instruction Pointer (IP)) Instruction Register (IR)

M Add R i t

Memory Address Register (MAR)

Memory Data Register (MDR)

(MDR)

RISC processor RISC processor

(11)

Clock Clock

Provide timing signal and the basic unit of time

Synchronize all CPU and BUS operations

Machine (clock) cycle measures time of a single operation

Clock is used to trigger eventsgg

Clock period = 1GHz→clock cycle=1ns1 Clock frequency

A instruction could take multiple cycles to complete, e.g. multiply in 8088 takes 50 cycles

one cycle 11

00

(12)

Memory, I/O, System Bus Memory, I/O, System Bus

Main/primary memory (random access memory, RAM)

t b th i t ti d d t

stores both program instructions and data

I/O devices

¾ Interface: I/O controller

¾ User interface: keyboard, display screen, printer, modem, …

¾ Secondary storage: disk

¾ Communication network

System Bus

A bunch of parallel wires

¾ A bunch of parallel wires

¾ Transfer data among the components

¾ Address bus (determine the amount of physical memory addressable)

¾ Data bus (indicate the size of the data transferred)

¾ Control bus (consists of control signals:

memory/IO read/write interrupt bus request/grand) memory/IO read/write, interrupt, bus request/grand)

(13)

Instruction Execution Cycle Instruction Execution Cycle

Execution Cycle

¾ Fetch (IF): CU fetches next instruction, advance PC/IP

¾ Decode (ID): CU determines what the instruction will do

¾ Execute

Fetch operands (OF): (memory operand needed) read value from memory E t th i t ti (IE)

Execute the instruction (IE)

Store output operand (WB): (memory operand needed) write result to memoryy

(14)

Instruction Execution Cycle (cont.) Instruction Execution Cycle (cont.)

Fetch ^PC

Fetch

Decode

Fetch operands

I-1 I-2 I-3 I-4

PC program

memory fetch

Fetch operands

Execute

Store output

op1 op2 memory

registers read

registers

p

I-1 instruction register

g g

write decode

write

w ALU

execute

w flags

(output)

(15)

Introduction to Digital Logic Design Introduction to Digital Logic Design

¾ See asm ch2 dl ppt

¾ See asm_ch2_dl.ppt

(16)

CPU

(17)

CPU CPU

CISC vs RISC

CISC vs. RISC

6 Instruction Set Design Issues

N b f Add

¾ Number of Addresses

¾ Flow of Control

O d T

¾ Operand Types

¾ Addressing Modes

¾ Instruction Types

¾ Instruction Formats

(18)

Processor Processor

RISC and CISC designs

¾ Reduced Instruction Set Computer (RISC)

Simple instructions, small instruction set

O d d t b i i t

Operands are assumed to be in processor registers

Not in memory

Simplify design (e.g., fixed instruction size)

Examples: ARM (Advanced RISC Machines), DEC Alpha (now Compaq)p ( p q)

¾ Complex Instruction Set Computer (CISC)

Complex instructions, large instruction set

Operands can be in registers or memory

Instruction size varies

T i ll i

Typically use a microprogram

Example: Intel 80x86 family

(19)

Processor (cont.)

(20)

Processor (cont.) Processor (cont.)

Variations of the ISA-level can be implemented by

Variations of the ISA level can be implemented by changing the microprogram

(21)

Instruction Set Design Issues Instruction Set Design Issues

Number of Addresses

Flow of Control O

Operand Types

Addressing Modes

Instruction Types

Instruction Formats

(22)

Number of Addresses Number of Addresses

Four categories

¾ 3-address machines

2 for the source operands and one for the result2 for the source operands and one for the result

One address doubles as source and result

¾ 1-address machine

Accumulator machines

Accumulator is used for one source and result

Stack machines

Operands are taken from the stack

R lt t th t k

Result goes onto the stack

(23)

Number of Addresses (cont.) Number of Addresses (cont.)

Three-address machines

¾ Two for the source operands, one for the result RISC processors use three addresses

¾ RISC processors use three addresses

¾ Sample instructions

add dest src1 src2 add dest,src1,src2

; M(dest)=[src1]+[src2]

b d t 1 2

sub dest,src1,src2

; M(dest)=[src1]-[src2]

lt d t 1 2

mult dest,src1,src2

; M(dest)=[src1]*[src2]

(24)

Number of Addresses (cont.) Number of Addresses (cont.)

Example

¾ C statement

A = B + C * D – E + F + A A = B + C D E + F + A

¾ Equivalent code:

mult T C D ;T = C*D mult T,C,D ;T = C D add T,T,B ;T = B+C*D sub T T E ;T = B+C*D-E sub T,T,E ;T = B+C*D-E add T,T,F ;T = B+C*D-E+F add A T A ;A = B+C*D-E+F+A add A,T,A ;A = B+C*D-E+F+A

(25)

Number of Addresses (cont.) Number of Addresses (cont.)

Two-address machines

¾ One address doubles (for source operand & result) Last example makes a case for it

¾ Last example makes a case for it

Address T is used twice Sample instructions

load dest,src ; M(dest)=[src]

add dest src M(dest) [dest]+[src]

add dest,src ; M(dest)=[dest]+[src]

sub dest,src ; M(dest)=[dest]-[src]

lt d t M(d t) [d t]*[ ]

mult dest,src ; M(dest)=[dest]*[src]

(26)

Number of Addresses (cont.) Number of Addresses (cont.)

Example

¾ C statement

A = B + C * D – E + F + A A = B + C D E + F + A

¾ Equivalent code:

load T C ;T = C load T,C ;T = C mult T,D ;T = C*D add T B ;T = B+C*D add T,B ;T = B+C*D sub T,E ;T = B+C*D-E add T F ;T = B+C*D-E+F add T,F ;T = B+C*D-E+F add A,T ;A = B+C*D-E+F+A

(27)

Number of Addresses (cont.) Number of Addresses (cont.)

One-address machines

¾ Use special set of registers called accumulators

Specify one source operand & receive the result

¾ Called accumulator machines Sample instructions

load addr ; accum = [addr]

store addr M[addr] acc m store addr ; M[addr] = accum

add addr ; accum = accum + [addr]

b dd [ dd ]

sub addr ; accum = accum - [addr]

mult addr ; accum = accum * [addr]

(28)

Number of Addresses (cont.) Number of Addresses (cont.)

ExampleExample

¾ C statement

**A = B + C * D – E + F + A A B C D E F A**

¾ Equivalent code:

load C ;load C into accum mult D ;accum = C*D

add B ;accum = C*D+B sub E ;accum = B+C*D-E add F ;accum = B+C*D-E+F add A ;accum = B+C*D-E+F+A

store A ;store accum contents in A

(29)

Number of Addresses (cont.) Number of Addresses (cont.)

Zero-address machines

¾ Stack supplies operands and receives the result

Special instructions to load and store use an address

¾ Called stack machines (Ex: HP3000, Burroughs B5500) Sample instructions

push addr ; push([addr]) pop addr pop([addr]) pop addr ; pop([addr])

add ; push(pop + pop)

b h( )

sub ; push(pop - pop) mult ; push(pop * pop)

(30)

Number of Addresses (cont.) Number of Addresses (cont.)

Example

¾ C statement

A B **C * D E F A A = B + C * D – E + F + A**

¾ Equivalent code:

push E sub

push C push F

p p

push D add

Mult push A

Mult push A push B add

add pop A

(31)

Load/Store Architecture Load/Store Architecture

Instructions expect operands in internal processor registers

¾ Special LOAD and STORE instructions move data between registers and memory

¾ RISC uses this architecture

¾ Reduces instruction length

(32)

Load/Store Architecture (cont.) Load/Store Architecture (cont.)

Sample instructions

load Rd,addr ;Rd = [addr]

t dd R ( dd ) R

store addr,Rs ;(addr) = Rs add Rd,Rs1,Rs2 ;Rd = Rs1 + Rs2

b Rd R 1 R 2 Rd R 1 R 2 sub Rd,Rs1,Rs2 ;Rd = Rs1 - Rs2 mult Rd,Rs1,Rs2 ;Rd = Rs1 * Rs2

(33)

Number of Addresses (cont.) Number of Addresses (cont.)

Example

¾ C statement

A **B + C * D E + F + A A = B + C * D – E + F + A**

¾ Equivalent code:

load R1,B mult R2,R2,R3 load R2,C add R2,R2,R1 load R3,D sub R2,R2,R4 load R4,E add R2,R2,R5 load R5,F add R2,R2,R6 load R6,A store A,R2

(34)

Flow of Control Flow of Control

Default is sequential flow

Several instructions alter this default execution

B h

¾ Branches

Unconditional C di i l

Conditional

Delayed branches

¾ Procedure calls

Delayed procedure calls

(35)

Flow of Control (cont.) Flow of Control (cont.)

Branches

¾ Unconditional

Absolute address

PC-relative

Target address is specified relative to PC contents

Relocatable code

¾ Example: MIPS

Absolute address j target

j target

PC-relative b target b target

(36)

Flow of Control (cont.) Flow of Control (cont.)

e g , Pentium e g , SPARC

e.g., Pentium e.g., SPARC

(37)

Flow of Control (cont.) Flow of Control (cont.)

Branches

¾ Conditional

Jump is taken only if the condition is metp y

¾ Two types

Set-Then-Jump

Condition testing is separated from branching

Condition code registers are used to convey the condition test result

Condition code registers keep a record of the status of the last ALU operation such as overflow condition

Example: Pentium code

cmp AX,BX ; compare AX and BX je target ; jump if equal

(38)

Flow of Control (cont.) Flow of Control (cont.)

Test-and-Jump

Test and Jump

Single instruction performs condition testing and branching

Example: MIPS instructionp

beq Rsrc1,Rsrc2,target

Jumps to target if Rsrc1 = Rsrc2g

Delayed branching

¾ Control is transferred after executing the instruction that

¾ Control is transferred after executing the instruction that follows the branch instruction

This instruction slot is called delay sloty

¾ Improves efficiency

¾ Highly pipelined RISC processors supportg y p pe ed SC p ocesso s suppo

(39)

Flow of Control (cont.) Flow of Control (cont.)

Procedure calls

¾ Facilitate modular programming

¾ Require two pieces of information to return

End of procedure

Pentium

uses ret instruction

MIPS

uses jr instruction

Return address

In a (special) register

MIPS allows any general-purpose register

On the stack

Pentium

(40)

Flow of Control (cont.)

(41)

Flow of Control (cont.) Flow of Control (cont.)

Delay slot

(42)

Parameter Passing Parameter Passing

Two basic techniques

¾ Register-based (e.g., PowerPC, MIPS) Internal registers are used

Internal registers are used

Faster

Limit the number of parametersLimit the number of parameters

Recursive procedure

¾ Stack-based (e.g., Pentium)( g )

Stack is used

More general

(43)

Operand Types Operand Types

Instructions support basic data types

¾ Characters Integers

¾ Integers

¾ Floating-point

I t ti l d

Instruction overload

¾ Same instruction for different data types

¾ Example: Pentium

mov AL,address ;loads an 8-bit value mov AX,address ;loads a 16-bit value mov EAX,address ;loads a 32-bit value

(44)

Operand Types Operand Types

Separate instructions

¾ Instructions specify the operand size Example: MIPS

¾ Example: MIPS

lb Rdest,address ;loads a byte

lh Rdest address ;loads a halfword lh Rdest,address ;loads a halfword

;(16 bits) l Rdest address loads a ord lw Rdest,address ;loads a word

;(32 bits)

ld Rd t dd l d d bl d

ld Rdest,address ;loads a doubleword

;(64 bits) Similar instruction: store

(45)

Addressing Modes Addressing Modes

How the operands are specified

¾ Operands can be in three places Registers

Registers

Register addressing mode

Part of instruction

Constant

Immediate addressing modeg

All processors support these two addressing modes

Memory

Difference between RISC and CISC

CISC supports a large variety of addressing modes RISC f ll l d/ t hit t

RISC follows load/store architecture

(46)

Instruction Types Instruction Types

Several types of instructions yp

¾ Data movement

Pentium: mov dest,src

Some do not provide direct data movement instructions

I di t d t t

Indirect data movement

add Rdest,Rsrc,0 ;Rdest = Rsrc+0 Arithmetic and Logical

¾ Arithmetic and Logical

Arithmetic

Integer and floating point signed and unsigned

Integer and floating-point, signed and unsigned

add, subtract, multiply, divide

Logical

and, or, not, xor

(47)

Instruction Types (cont.) Instruction Types (cont.)

Condition code bits

¾ S: Sign bit (0 = +, 1= -)

Z: Zero bit (0 = nonzero 1 = zero)

¾ Z: Zero bit (0 = nonzero, 1 = zero)

¾ O: Overflow bit (0 = no overflow, 1 = overflow) C: Carry bit (0 = no carry 1 = carry)

¾ C: Carry bit (0 = no carry, 1 = carry)

E l P ti

Example: Pentium

cmp count,25 ;compare count to 25

;subtract 25 from count je target ;jump if equal

(48)

Instruction Types (cont.) Instruction Types (cont.)

¾ Flow control and I/O instructions

Branch

Procedure call

Interrupts

¾ I/O instructions

Memory-mapped I/O

Most processors support memory-mapped I/O No separate instructions for I/O

No separate instructions for I/O

Isolated I/O

Pentium supports isolated I/OPentium supports isolated I/O

Separate I/O instructions

in AX,io_port ;read from an I/O port

t i t AX it t I/O t

out io_port,AX ;write to an I/O port

(49)

Instruction Formats Instruction Formats

Two types

¾ Fixed-length

Used by RISC processors

32-bit RISC processors use 32-bits wide instructions

Examples: SPARC MIPS PowerPC

Examples: SPARC, MIPS, PowerPC

¾ Variable-length

Used by CISC processors

Memory operands need more bits to specify

Opcode

¾ Major and exact operation

(50)

Examples of Instruction Formats

(51)

How Hardware Executes How Hardware Executes

Processor’s Instructions ocesso s s uc o s

(52)

How Hardware Executes Processor’s Instructions Processor s Instructions

Digital Logic Design

¾ Combinational and Sequential Circuits

Microprogrammed Control

(53)

Virtual Machines Virtual Machines

Abstractions for computers

High-Level Language Level 5

Machine-independent

Assembly Language Level 4 Machine-specific

Operating System

Instruction Set

Level 3 Instruction Set

Architecture

Microarchitecture L l 1 Level 2

Microarchitecture

Digital Logic Level 0 Level 1

(54)

Basic Microcomputer Design Basic Microcomputer Design

data bus

registers

I/O I/O

Central Processor Unit (CPU)

Memory Storage Unit

ALU l k

I/O Device

#1

I/O Device

#2 CU

ALU clock

control bus CU

address bus

(55)

Consider 1-bus Datapath Consider 1 bus Datapath

Assume all entities are Assume all entities are 32-bit wide

(56)

1-bit ALU

1 bit ALU

(57)

ALU Circuit in 1-bus Datapath

ALU Circuit in 1 bus Datapath

(58)

Memory Interface Implementation

(59)

Microprogrammed Control Microprogrammed Control

32 32-bit general-purpose registers32 32 bit general purpose registers

¾ Interface only with the A-bus

¾ Each register has two control signals G i d G t

Gxin and Gxout

Control signals used by the other registers

¾ PC register:

PCin, PCout, and PCbout

¾ IR register:

IRout and IRbin

¾ MAR register:

MARin, MARout, and MARboutMARin, MARout, and MARbout

¾ MDR register:

MDRin, MDRout, MDRbin and MDRbout

(60)

Microprogrammed Control (cont.) Microprogrammed Control (cont.)

add %G9,%G5,%G7 add %G9,%G5,%G7 Implemented as

Transfer G5 contents to A register

Assert G5out and Ain

Place G7 contents on the A busPlace G7 contents on the A bus

Assert G7out

Instruct ALU to perform addition p

Appropriate ALU function control signals

Latch the result in the C register

Assert Cin

Transfer contents of the C register to G9

Assert Cout and G9in

(61)

Microprogrammed Control (cont.) Microprogrammed Control (cont.)

Instruction Fetch Instruction Fetch Implemented as

PCbout: read: PCout: ALU=add4: Cin;

PCbout: read: PCout: ALU add4: Cin;

read: Cout: PCin;

Read: IRbin;

Decodes the instruction and jumps to the appropriate execution rountine

the appropriate execution rountine

(62)

Microprogrammed Control (cont.) Microprogrammed Control (cont.)

Example instruction groups

¾ Load/store

Moves data between registers and memory

¾ Register

Arithmetic and logic instructions

¾ Branch

J di t/i di t

Jump direct/indirect

¾ Call

P d i ti h i

Procedures invocation mechanisms

¾ More…

(63)

Microprogrammed Control (cont.) Microprogrammed Control (cont.)

High-level FSM for instruction execution

execution

FSM: finite state machine

(64)

Microprogrammed Control (cont.) Microprogrammed Control (cont.)

Software implementation

¾ Typically used in CISC

Hardware implementation (PLA) is complex and

Hardware implementation (PLA) is complex and expensive

Example

add %G9,%G5,%G7

¾ Three steps

S1 G5out: Ain;

S2 G7out: ALU=add: Cin;

S3 Cout: G9in: end;

(65)

Microprogrammed Control (cont.) Microprogrammed Control (cont.)

Simple microcode microcode organization

(66)

Microprogrammed Control (cont.) Microprogrammed Control (cont.)

Uses a microprogram to generate the control

Uses a microprogram to generate the control signals

¾ Encode the signals of each step as a codeword

Called microinstruction

A instruction is expressed by a sequence of codewords

¾ A instruction is expressed by a sequence of codewords

Called microroutine

Mi ti ll i l t th FSM

Microprogram essentially implements the FSM

discussed before

(67)

Microprogrammed Control (cont.) Microprogrammed Control (cont.)

A simple microcontroller can execute a

microprogram to generate the control signals

¾ Control store

Store microprogram Use μPC

¾ Use μPC

Similar to PC Address generator

¾ Address generator

Generates appropriate address depending on the

Opcode and

Opcode, and

Condition code inputs

(68)

Microprogrammed Control (cont.) Microprogrammed Control (cont.)

Microcontroller Microcontroller

Microcodes reside in control store, which might be read-only memory (ROM)

(69)

Microprogrammed Control (cont.) Microprogrammed Control (cont.)

Microinstruction format

¾ Two basic ways

Horizontal organization

Vertical organization

¾ Horizontal organization

O bit f h i l

One bit for each signal

Very flexible

L i i t ti

Long microinstructions

Example: 1-bus datapath

N d 90 bit f h i i t ti

Needs 90 bits for each microinstruction

(70)

Microprogrammed Control (cont.) Microprogrammed Control (cont.)

Horizontal

microinstruction format

(71)

Microprogrammed Control (cont.) Microprogrammed Control (cont.)

¾ Encodes to reduce microinstruction length

Reduced flexibility

¾ Example:

Horizontal organization

64 t l i l f th 32 l i t

64 control signals for the 32 general purpose registers

5 bits to identify the register and 1 for in/outy g

(72)

2-bus Datapath

2 bus Datapath

(73)

Microprogrammed Control (cont.) Microprogrammed Control (cont.)

Adding more buses reduces time needed to

Adding more buses reduces time needed to execute instructions

¾ No need to multiplex the bus

Example

dd %G9 %G5 %G7 add %G9,%G5,%G7

¾ Needed three steps in 1-bus datapath

¾ Need only two steps with a 2-bus datapath S1 G5out: Ain;

S2 G7out: ALU=add: G9in;

(74)

Pipelining

(75)

Pipelining Pipelining

Introduction

3 Hazards

R D t d C t l H d

¾ Resource, Data and Control Hazards

3 Technologies for Performance Improvement

¾ Superscalar, Superpipelined, and Very Long Instruction Word

(76)

Serial and Pipelining Serial and Pipelining

Serial execution: 20 cycles Pipelined execution: 8 cycles

F k d i i

For k states and n instructions, the number of required cycles is:

k + (n – 1) k + (n 1)

(77)

Pipelining Pipelining

Pipelining

¾ Overlapped execution

¾ Increases throughput

(78)

Pipelining (cont.) Pipelining (cont.)

Pipelining requires buffers

¾ Each buffer holds a single value

¾ Uses just-in-time principlej p p

Any delay in one stage affects the entire pipeline flow

¾ Ideal scenario: equal work for each stage

Sometimes it is not possible

Slowest stage determines the flow rate in the entire pipeline

pipeline

(79)

Pipelining (cont.) Pipelining (cont.)

Some reasons for unequal work stages

¾ A complex step cannot be subdivided conveniently

¾ An operation takes variable amount of time to executep

EX: Operand fetch time depends on where the operands are located

Registers

Cache

Memory

¾ Complexity of operation depends on the type of operation

Add: may take one cycle

M lti l t k l l

Multiply: may take several cycles

(80)

Pipeline Stall Pipeline Stall

Operand fetch of I2 takes three cycles

¾ Pipeline stalls for two cycles

Caused by hazards

¾ Pipeline stalls reduce overall throughput

(81)

Hazards Hazards

Three types of hazards

¾ Resource hazards

Occurs when two or more instructions use the same resource

Also called structural hazards

D t h d

¾ Data hazards

Caused by data dependencies between instructions

Example: Result produced by I1 is read by I2p p y y

¾ Control hazards

Default: sequential execution suits pipelining

Altering control flow (e.g., branching) causes problems

Introduce control dependencies

(82)

Resource Hazards Resource Hazards

Example

¾ Conflict for memory in clock cycle 3

I1 fetches operandp

I3 delays its instruction fetch from the same memory

(83)

Data Hazards Data Hazards

Example

¾ I1: add R2,R3,R4 /* R2 = R3 + R4 */

¾ I2: sub R5,R6,R2 /* R5 = R6 – R2 */

Introduces data dependency between I1 and I2

(84)

Control Hazards

»Determine branch decision early

(85)

Performance Improvement Performance Improvement

Several techniques to improve performance of aSeveral techniques to improve performance of a pipelined system

¾ Superscalar

Replicates the pipeline hardware

¾ Superpipelined

Increases the pipeline depth Very long instruction word (VLIW)

¾ Very long instruction word (VLIW)

Encodes multiple operations into a long instruction word

Hardware schedules these instructions on multiple functional units (No run time analysis)

functional units (No run-time analysis)

add R1, R2, R3 ; R1 = R2 + R3 sub R5, R6, R7 ; R5 = R6 – R7 and R4, R1, R5 ; R4 = R1 AND R5 xor R9, R9, R9 ; R9 = R9 XOR R9

cycle 1: add, sub, xor cycle 2: and

(86)

Superscalar Processor Superscalar Processor

Ex: Pentium

(87)

Wasted Cycles (pipelined) Wasted Cycles (pipelined)

When one of the stages requires two or more clock cycles,When one of the stages requires two or more clock cycles, clock cycles are again wasted.

St

S1 S2 S3 S4 S5

1

Stages

S6 I-1

exe

For k states and n instructions the

cles

2 3 4 5

I-2 I-3

I-1 I-2 I-3

I-1 I-1

instructions, the number of required cycles is:

k + (2 1)

Cyc 5

6 7

I 3

I-2 I-1

I-1

8 I-3 I-2

I 1 I-2

k + (2n – 1)

9 I-2

10 11

I-3

I-3 I-3

(88)

Superscalar Superscalar

A superscalar processor has multiple execution pipelines.

In the following, note that Stage S4 has left and right pipelines (u and v).

S1 S2 S3 S5

Stages

S6

S4 For k states and n

instructions the

S1 S2 S3 u S5

1

s

S6

2 3

I-1 I-2 I-3

I-1

I-2 I-1

v instructions, the

number of required cycles is:

k +

Cycles ⁴

5 6 7

I-4 I-3 I-4

I-2 I-3 I-4

I-1

I-3 I-1

I-2 I-1 I-2

I-4 I-2 I-1

I-3

k + n

8 9

I-3 I-4

I-2 I-3

10 I-4

I-4 3

(89)

Superpipelined Processor Superpipelined Processor

Ex: MIPS R4000 Ex: MIPS R4000

(90)

Memory

(91)

Memory Memory

Introduction

Building Memory Blocks

l f

Alignment of Data

2 Memory Design Issues

¾ Cache

¾ Virtual Memoryy

(92)

Memory (cont.) Memory (cont.)

Ordered sequence of bytes

¾ The sequence number is called the memory address

¾ Byte addressable memory

Each byte has a unique address

Almost all processors support thisp pp

Memory address space

¾ Determined by the address bus widthy

¾ Pentium has a 32-bit address bus

address space = 4GB (2³²)

¾ Itanium with 64-bit address bus supports

2⁶⁴ bytes of address space

(93)

Memory (cont.)

(94)

Memory (cont.) Memory (cont.)

Read cycle

1. Place address on the address bus 2. Assert memory read control signal 2. Assert memory read control signal

3. Wait for the memory to retrieve the data

Introduce wait states if using a slow memoryg y 4. Read the data from the data bus

5. Drop the memory read signal

In Pentium, a simple read takes three clocks cycles

Clock 1: steps 1 and 2

Clock 2: step 3

Clock 3 : steps 4 and 5

(95)

Memory (cont.) Memory (cont.)

Write cycle

1. Place address on the address bus

2. Place data on the data bus

3. Assert memory write signal

4. Wait for the memory to retrieve the datay

Introduce wait states if necessary 5. Drop the memory write signal

In Pentium, a simple write also takes three clocks

Clock 1: steps 1 and 3

Clock 2: step 2

Clock 3 : steps 4 and 5

(96)

How Hardware Implements How Hardware Implements

Memory Systems

(97)

Building a Memory Block Building a Memory Block

A 4 X 3 d i

A 4 X 3 memory design using D flip-flops

(98)

Building a Memory Block (cont’d) Building a Memory Block (cont d)

Bl k di t ti f 4 3

Block diagram representation of a 4x3 memory

Address

Data

Control signals

¾ Read

¾ Write

(99)

Building Larger Memories Building Larger Memories

2 X 16 memory module using 74373 chips 2 X 16 memory module using 74373 chips

(100)

Designing Larger Memories Designing Larger Memories

64M X 32 i memory using 16M X 16 chips

(101)

Alignment of Data Alignment of Data

Get 32-bit data in one or more read cycle?

(102)

Alignment of Data (cont.) Alignment of Data (cont.)

Alignment

¾ 2-byte data: Even address

Rightmost address bit should be zero

¾ 4-byte data: Address that is multiple of 4 Rightmost 2 bits should be zero

Rightmost 2 bits should be zero

¾ 8-byte data: Address that is multiple of 8 Ri ht t 3 bit h ld b

Rightmost 3 bits should be zero

¾ Soft alignment

C h dl li d ll li d d t

Can handle aligned as well as unaligned data

¾ Hard alignment

H dl l li d d t ( f li t)

Handles only aligned data (enforces alignment)

(103)

Memory Design Issues Memory Design Issues

Slower memoriesSlower memories

Problem: Speed gap between processor and memory Solution: Cache memory

U ll t f f t

Use small amount of fast memory

Make the slow memory appear faster

Works due to “reference locality”

Size limitations

¾ Limited amount of physical memory Overlay technique

Overlay technique

Programmer managed

¾ Virtual memory

Automates overlay management

Some additional benefits

(104)

Memory Hierarchy

(105)

Cache Memory Cache Memory

High-speed expensive static RAM both inside and outside

High speed expensive static RAM both inside and outside the CPU.

¾ Level-1 cache: inside the CPU

¾ Level-2 cache: outside the CPU

Prefetch data into cache before the processor needs it

¾ Need to predict processor future access requirements

¾ Locality of reference

C h hit h d t t b d i l d i h

Cache hit: when data to be read is already in cache memory

Cache miss: when data to be read is not in cache memory

Cache miss: when data to be read is not in cache memory.

When? compulsory, capacity and conflict.

Cache design: cache size n-way block size replacement

Cache design: cache size, n-way, block size, replacement policy

(106)

Why Cache Memory Works Why Cache Memory Works

Example

for (i=0; i<M; i++)

for(j=0; j<N; j++) for(j=0; j<N; j++)

X[i][j] = X[i][j] + K;

Each element of X is double (eight bytes)

¾ Each element of X is double (eight bytes)

¾ Loop is executed (M_*N) times

Pl i th d i h id t i

Placing the code in cache avoids access to main memory

Repetitive use

Temporal locality

Prefetching datag

Spatial locality

(107)

Cache Design Basics Cache Design Basics

On every read miss

¾ A fixed number of bytes are transferred

More than what the processor needs

Effective due to spatial locality

Cache is divided into blocks of B bytes

b-bits are needed as offset into the block

b = log₂B

Block are called cache lines

Main memory is also divided into blocks of same

size

(108)

Mapping Function Mapping Function

Determines how memory blocks are mapped to cache lines

Three types

¾ Direct mapping

Specifies a single cache line for each memory block

¾ Set-associative mapping

¾ Set associative mapping

Specifies a set of cache lines for each memory block

¾ Associative mapping

No restrictions

Any cache line can be used for any memory block

(109)

Direct Mapping

(110)

Set-Associate Mapping

Set Associate Mapping

(111)

Virtual Memory

(112)

I/O Devices

(113)

Input/Output Input/Output

I/O devices are interfaced via an I/O controller

¾ Takes care of low-level operations details

Several ways of mapping I/O

¾ Memory-mapped I/O

Reading and writing similar to memory read/write

Uses same memory read and write signals

Most processors use this I/O mappingp pp g

¾ Isolated I/O

Separate I/O address space

Separate I/O read and write signals are needed

Pentium supports isolated I/O

Also supports memory-mapped I/O

(114)

Input/Output (cont.)

(115)

Input/Output (cont.) Input/Output (cont.)

Several ways of transferring data y g

¾ Programmed I/O

Program uses a busy-wait loopg y p

Anticipated transfer

¾ Direct memory access (DMA)

Special controller (DMA controller) handles data transfers

Typically used for bulk data transfer

¾ Interrupt-driven I/O

Interrupts are used to initiate and/or terminate data transfers

Powerful technique

Handles unanticipated transfers

(116)

Interconnection Interconnection

System components are interconnected by buses

¾ Bus: a bunch of parallel wires

Uses several buses at various levels

¾ On-chip buses

B i ALU d i

Buses to interconnect ALU and registers

A, B, and C buses in our example

D t d dd b t t hi h

Data and address buses to connect on-chip caches

¾ Internal buses

PCI AGP PCMCIA

PCI, AGP, PCMCIA

¾ External buses

S i l ll l USB IEEE 1394 (Fi Wi )

Serial, parallel, USB, IEEE 1394 (FireWire)

(117)

PC

System Buses y

ISA (Industry Standard

A hi )

Architecture)

PCI (Peripheral Component Interconnect)

Interconnect)

AGP (Accelerated Graphics Port))

(118)

Interconnection (cont.) Interconnection (cont.)

Bus is a shared resource

¾ Bus transactions

Sequence of actions to complete a well-defined

Sequence of actions to complete a well defined activity

Involves a master and a slave

Memory read, memory write, I/O read, I/O write

¾ Bus operations

A b s t ansaction ma pe fo m one o mo e b s

A bus transaction may perform one or more bus operations

Pentium burst read

Transfers four memory words

Bus transaction consists of four memory read operations

operations

¾ Bus arbitration

Computer Organization &