• 沒有找到結果。

Computer Organization &

N/A
N/A
Protected

Academic year: 2022

Share "Computer Organization &"

Copied!
119
0
0

加載中.... (立即查看全文)

全文

(1)

Computer Organization &

Computer Organization &

Computer Organization &

Computer Organization &

Assembly Languages Assembly Languages

Computer Organization (I) Fundamentals

Pu-Jen Cheng

(2)

Materials Materials

„ Some materials used in this course are adapted from

„ Some materials used in this course are adapted from

¾ The slides prepared by Kip Irvine for the book, Assembly Language for Intel-Based Computers, 5th Ed.

¾ The slides prepared by S. Dandamudi for the book, Fundamentals of Computer Organization and Designs.

¾ The slides prepared by S Dandamudi for the book Introduction to

¾ The slides prepared by S. Dandamudi for the book, Introduction to Assembly Language Programming, 2nd Ed.

¾ Introduction to Computer Systems, CMU

(http://www.cs.cmu.edu/afs/cs.cmu.edu/academic/class/

15213-f05/www/)

Assembly Language & Computer Organization NTU

¾ Assembly Language & Computer Organization, NTU (http://www.csie.ntu.edu.tw/~cyy/courses/assembly/

05fall/news/)/ /)

(http://www.csie.ntu.edu.tw/~acpang/course/asm_2004)

(3)

Outline Outline

„ General Concepts of Computer Organization

¾ Overview of Microcomputer CPU, Memory, I/O

Instruction Execution Cycley

¾ Central Processing Unit (CPU) CISC vs. RISC

6 Instruction Set Design Issues 6 Instruction Set Design Issues

¾ How Hardwares Execute Processor’s Instructions

Digital Logic Design (Combinational & Sequential Circuits) Microprogrammed Control

Microprogrammed Control

¾ Pipelining

3 Hazards

3 t h l i f f i t

3 technologies for performance improvement

¾ Memory

Data Alignment

2 Design Issues (Cache, Virtual Memory)

¾ I/O Devices

(4)

General Concepts of Computer Organization General Concepts of Computer Organization

Overview of Microcomputer

Overview of Microcomputer

(5)

Von Neumann Machine, 1945 Von Neumann Machine, 1945

„ Memory, Input/Output, Arithmetic/Logic Unit, Control Unity, p p , g ,

„ Stored-program Model

¾ Both data and programs are stored in the same main memory

„ Sequential Execution

http://www.virtualtravelog.net/entries/2003-08-TheFirstDraft.pdf

(6)

What is Microcomputer What is Microcomputer

„ Microcomputer

„ Microcomputer

¾ A computer with a microprocessor (µP) as its central processing unit (CPU)

„ Microprocessor (µP)

¾ A digital electronic component with transistors on a single semiconductor integrated circuit (IC)

semiconductor integrated circuit (IC)

¾ One or more microprocessors typically serve as a central

processing unit (CPU) in a computer system or handheld device.

(7)

Components of Microcomputer

Components of Microcomputer

(8)

Basic Microcomputer Design Basic Microcomputer Design

data bus

registers

I/O I/O

Central Processor Unit (CPU)

Memory Storage Unit

ALU l k

I/O Device

#1

I/O Device

#2 CU

ALU clock

control bus CU

address bus

(9)

CPU CPU

„ Arithmetic and logic unit (ALU) performs arithmetic (add, subtract) and logical (AND OR NOT) operations

logical (AND, OR, NOT) operations

„ Registers store data and instructions used by the processor

„ Control unit (CU) coordinates sequence of execution steps

¾ Fetch instructions from memory, decode them to find their types

„ Clock

„ Datapath consists of registers and ALU(s)

„ Datapath consists of registers and ALU(s)

(10)

Datapath

ALU input

Datapath

operand operand ALU output

Program Counter (PC) (or Instruction Pointer (IP)) Instruction Register (IR)

M Add R i t

Memory Address Register (MAR)

Memory Data Register (MDR)

(MDR)

RISC processor RISC processor

(11)

Clock Clock

„ Provide timing signal and the basic unit of time

„ Synchronize all CPU and BUS operations

„ Machine (clock) cycle measures time of a single operation

„ Clock is used to trigger eventsgg

„ Clock period = 1GHz→clock cycle=1ns1 Clock frequency

„ A instruction could take multiple cycles to complete, e.g. multiply in 8088 takes 50 cycles

one cycle 11

00

(12)

Memory, I/O, System Bus Memory, I/O, System Bus

„ Main/primary memory (random access memory, RAM)

t b th i t ti d d t

stores both program instructions and data

„ I/O devices

¾ Interface: I/O controller

¾ Interface: I/O controller

¾ User interface: keyboard, display screen, printer, modem, …

¾ Secondary storage: disk

¾ Communication network

„ System Bus

A bunch of parallel wires

¾ A bunch of parallel wires

¾ Transfer data among the components

¾ Address bus (determine the amount of physical memory addressable)

¾ Data bus (indicate the size of the data transferred)

¾ Control bus (consists of control signals:

memory/IO read/write interrupt bus request/grand) memory/IO read/write, interrupt, bus request/grand)

(13)

Instruction Execution Cycle Instruction Execution Cycle

„ Execution Cycle

„ Execution Cycle

¾ Fetch (IF): CU fetches next instruction, advance PC/IP

¾ Decode (ID): CU determines what the instruction will do

¾ Execute

Fetch operands (OF): (memory operand needed) read value from memory E t th i t ti (IE)

Execute the instruction (IE)

Store output operand (WB): (memory operand needed) write result to memoryy

(14)

Instruction Execution Cycle (cont.) Instruction Execution Cycle (cont.)

„ Fetch PC

„ Fetch

„ Decode

„ Fetch operands

I-1 I-2 I-3 I-4

PC program

memory fetch

„ Fetch operands

„ Execute

„ Store output

op1 op2 memory

registers read

registers

p

I-1 instruction register

g g

write decode

write

w ALU

execute

w flags

(output)

(15)

Introduction to Digital Logic Design Introduction to Digital Logic Design

¾ See asm ch2 dl ppt

¾ See asm_ch2_dl.ppt

(16)

CPU

CPU

(17)

CPU CPU

„

CISC vs RISC

„

CISC vs. RISC

„

6 Instruction Set Design Issues

N b f Add

¾ Number of Addresses

¾ Flow of Control

O d T

¾ Operand Types

¾ Addressing Modes

¾ Instruction Types

¾ Instruction Formats

(18)

Processor Processor

„

RISC and CISC designs

¾ Reduced Instruction Set Computer (RISC)

„ Simple instructions, small instruction set

O d d t b i i t

„ Operands are assumed to be in processor registers

„ Not in memory

„ Simplify design (e.g., fixed instruction size)

„ Simplify design (e.g., fixed instruction size)

„ Examples: ARM (Advanced RISC Machines), DEC Alpha (now Compaq)p ( p q)

¾ Complex Instruction Set Computer (CISC)

„ Complex instructions, large instruction set

„ Operands can be in registers or memory

„ Instruction size varies

T i ll i

„ Typically use a microprogram

„ Example: Intel 80x86 family

(19)

Processor (cont.)

Processor (cont.)

(20)

Processor (cont.) Processor (cont.)

„ Variations of the ISA-level can be implemented by

„ Variations of the ISA level can be implemented by changing the microprogram

(21)

Instruction Set Design Issues Instruction Set Design Issues

„

Number of Addresses

„

Number of Addresses

„

Flow of Control O

„

Operand Types

„

Addressing Modes

„

Instruction Types

„

Instruction Formats

„

Instruction Formats

(22)

Number of Addresses Number of Addresses

„

Four categories

„

Four categories

¾ 3-address machines

„ 2 for the source operands and one for the result2 for the source operands and one for the result

¾ 2-address machines

„ One address doubles as source and result

¾ 1-address machine

„ Accumulator machines

„ Accumulator is used for one source and result

¾ 0-address machines

„ Stack machines

„ Operands are taken from the stack

R lt t th t k

„ Result goes onto the stack

(23)

Number of Addresses (cont.) Number of Addresses (cont.)

„

Three-address machines

„

Three-address machines

¾ Two for the source operands, one for the result RISC processors use three addresses

¾ RISC processors use three addresses

¾ Sample instructions

add dest src1 src2 add dest,src1,src2

; M(dest)=[src1]+[src2]

b d t 1 2

sub dest,src1,src2

; M(dest)=[src1]-[src2]

lt d t 1 2

mult dest,src1,src2

; M(dest)=[src1]*[src2]

(24)

Number of Addresses (cont.) Number of Addresses (cont.)

„ Example

„ Example

¾ C statement

A = B + C * D – E + F + A A = B + C D E + F + A

¾ Equivalent code:

mult T C D ;T = C*D mult T,C,D ;T = C D add T,T,B ;T = B+C*D sub T T E ;T = B+C*D-E sub T,T,E ;T = B+C*D-E add T,T,F ;T = B+C*D-E+F add A T A ;A = B+C*D-E+F+A add A,T,A ;A = B+C*D-E+F+A

(25)

Number of Addresses (cont.) Number of Addresses (cont.)

„

Two-address machines

„

Two-address machines

¾ One address doubles (for source operand & result) Last example makes a case for it

¾ Last example makes a case for it

„ Address T is used twice Sample instructions

¾ Sample instructions

load dest,src ; M(dest)=[src]

add dest src M(dest) [dest]+[src]

add dest,src ; M(dest)=[dest]+[src]

sub dest,src ; M(dest)=[dest]-[src]

lt d t M(d t) [d t]*[ ]

mult dest,src ; M(dest)=[dest]*[src]

(26)

Number of Addresses (cont.) Number of Addresses (cont.)

„ Example

„ Example

¾ C statement

A = B + C * D – E + F + A A = B + C D E + F + A

¾ Equivalent code:

load T C ;T = C load T,C ;T = C mult T,D ;T = C*D add T B ;T = B+C*D add T,B ;T = B+C*D sub T,E ;T = B+C*D-E add T F ;T = B+C*D-E+F add T,F ;T = B+C*D-E+F add A,T ;A = B+C*D-E+F+A

(27)

Number of Addresses (cont.) Number of Addresses (cont.)

„

One-address machines

„

One-address machines

¾ Use special set of registers called accumulators

Specify one source operand & receive the result

„ Specify one source operand & receive the result

¾ Called accumulator machines Sample instructions

¾ Sample instructions

load addr ; accum = [addr]

store addr M[addr] acc m store addr ; M[addr] = accum

add addr ; accum = accum + [addr]

b dd [ dd ]

sub addr ; accum = accum - [addr]

mult addr ; accum = accum * [addr]

(28)

Number of Addresses (cont.) Number of Addresses (cont.)

„ ExampleExample

¾ C statement

A = B + C * D – E + F + A A B C D E F A

¾ Equivalent code:

load C ;load C into accum mult D ;accum = C*D

add B ;accum = C*D+B sub E ;accum = B+C*D-E add F ;accum = B+C*D-E+F add A ;accum = B+C*D-E+F+A

store A ;store accum contents in A

(29)

Number of Addresses (cont.) Number of Addresses (cont.)

„

Zero-address machines

„

Zero-address machines

¾ Stack supplies operands and receives the result

Special instructions to load and store use an address

„ Special instructions to load and store use an address

¾ Called stack machines (Ex: HP3000, Burroughs B5500) Sample instructions

¾ Sample instructions

push addr ; push([addr]) pop addr pop([addr]) pop addr ; pop([addr])

add ; push(pop + pop)

b h( )

sub ; push(pop - pop) mult ; push(pop * pop)

(30)

Number of Addresses (cont.) Number of Addresses (cont.)

„

Example

„

Example

¾ C statement

A B C * D E F A A = B + C * D – E + F + A

¾ Equivalent code:

push E sub

push C push F

p p

push D add

Mult push A

Mult push A push B add

add pop A

(31)

Load/Store Architecture Load/Store Architecture

„ Instructions expect operands in internal processor registers

„ Instructions expect operands in internal processor registers

¾ Special LOAD and STORE instructions move data between registers and memory

¾ RISC uses this architecture

¾ Reduces instruction length

(32)

Load/Store Architecture (cont.) Load/Store Architecture (cont.)

„

Sample instructions

„

Sample instructions

load Rd,addr ;Rd = [addr]

t dd R ( dd ) R

store addr,Rs ;(addr) = Rs add Rd,Rs1,Rs2 ;Rd = Rs1 + Rs2

b Rd R 1 R 2 Rd R 1 R 2 sub Rd,Rs1,Rs2 ;Rd = Rs1 - Rs2 mult Rd,Rs1,Rs2 ;Rd = Rs1 * Rs2

(33)

Number of Addresses (cont.) Number of Addresses (cont.)

„

Example

„

Example

¾ C statement

A B + C * D E + F + A A = B + C * D – E + F + A

¾ Equivalent code:

load R1,B mult R2,R2,R3 load R2,C add R2,R2,R1 load R3,D sub R2,R2,R4 load R4,E add R2,R2,R5 load R5,F add R2,R2,R6 load R6,A store A,R2

(34)

Flow of Control Flow of Control

„

Default is sequential flow

„

Default is sequential flow

„

Several instructions alter this default execution

B h

¾ Branches

„ Unconditional C di i l

„ Conditional

„ Delayed branches

¾ Procedure calls

„ Delayed procedure calls

(35)

Flow of Control (cont.) Flow of Control (cont.)

„

Branches

„

Branches

¾ Unconditional

Absolute address

„ Absolute address

„ PC-relative

„ Target address is specified relative to PC contents

„ Target address is specified relative to PC contents

„ Relocatable code

¾ Example: MIPS

¾ Example: MIPS

„ Absolute address j target

j target

„ PC-relative b target b target

(36)

Flow of Control (cont.) Flow of Control (cont.)

e g , Pentium e g , SPARC

e.g., Pentium e.g., SPARC

(37)

Flow of Control (cont.) Flow of Control (cont.)

„ Branches

„ Branches

¾ Conditional

„ Jump is taken only if the condition is metp y

¾ Two types

„ Set-Then-Jump

„ Condition testing is separated from branching

„ Condition code registers are used to convey the condition test result

„ Condition code registers keep a record of the status of the last ALU operation such as overflow condition

„ Example: Pentium code

„ Example: Pentium code

cmp AX,BX ; compare AX and BX je target ; jump if equal

(38)

Flow of Control (cont.) Flow of Control (cont.)

„ Test-and-Jump

„ Test and Jump

„ Single instruction performs condition testing and branching

„ Example: MIPS instructionp

beq Rsrc1,Rsrc2,target

„ Jumps to target if Rsrc1 = Rsrc2g

„

Delayed branching

¾ Control is transferred after executing the instruction that

¾ Control is transferred after executing the instruction that follows the branch instruction

„ This instruction slot is called delay sloty

¾ Improves efficiency

¾ Highly pipelined RISC processors supportg y p pe ed SC p ocesso s suppo

(39)

Flow of Control (cont.) Flow of Control (cont.)

„

Procedure calls

„

Procedure calls

¾ Facilitate modular programming

¾ Require two pieces of information to return

¾ Require two pieces of information to return

„ End of procedure

„ Pentium

„ uses ret instruction

„ MIPS

„ uses jr instruction

„ uses jr instruction

„ Return address

„ In a (special) register

„ MIPS allows any general-purpose register

„ On the stack

„ Pentium

„ Pentium

(40)

Flow of Control (cont.)

Flow of Control (cont.)

(41)

Flow of Control (cont.) Flow of Control (cont.)

Delay slot

(42)

Parameter Passing Parameter Passing

„

Two basic techniques

„

Two basic techniques

¾ Register-based (e.g., PowerPC, MIPS) Internal registers are used

„ Internal registers are used

„ Faster

„ Limit the number of parametersLimit the number of parameters

„ Recursive procedure

¾ Stack-based (e.g., Pentium)( g )

„ Stack is used

„ More general

(43)

Operand Types Operand Types

„

Instructions support basic data types

„

Instructions support basic data types

¾ Characters Integers

¾ Integers

¾ Floating-point

I t ti l d

„

Instruction overload

¾ Same instruction for different data types

¾ Example: Pentium

mov AL,address ;loads an 8-bit value mov AX,address ;loads a 16-bit value mov EAX,address ;loads a 32-bit value

(44)

Operand Types Operand Types

„

Separate instructions

„

Separate instructions

¾ Instructions specify the operand size Example: MIPS

¾ Example: MIPS

lb Rdest,address ;loads a byte

lh Rdest address ;loads a halfword lh Rdest,address ;loads a halfword

;(16 bits) l Rdest address loads a ord lw Rdest,address ;loads a word

;(32 bits)

ld Rd t dd l d d bl d

ld Rdest,address ;loads a doubleword

;(64 bits) Similar instruction: store

(45)

Addressing Modes Addressing Modes

„

How the operands are specified

„

How the operands are specified

¾ Operands can be in three places Registers

„ Registers

„ Register addressing mode

„ Part of instruction

„ Part of instruction

„ Constant

„ Immediate addressing modeg

„ All processors support these two addressing modes

„ Memory

„ Difference between RISC and CISC

„ CISC supports a large variety of addressing modes RISC f ll l d/ t hit t

„ RISC follows load/store architecture

(46)

Instruction Types Instruction Types

„

Several types of instructions yp

¾ Data movement

„ Pentium: mov dest,src

„ Some do not provide direct data movement instructions

I di t d t t

„ Indirect data movement

add Rdest,Rsrc,0 ;Rdest = Rsrc+0 Arithmetic and Logical

¾ Arithmetic and Logical

„

Arithmetic

Integer and floating point signed and unsigned

„ Integer and floating-point, signed and unsigned

„ add, subtract, multiply, divide

„

Logical

„

Logical

„ and, or, not, xor

(47)

Instruction Types (cont.) Instruction Types (cont.)

„

Condition code bits

„

Condition code bits

¾ S: Sign bit (0 = +, 1= -)

Z: Zero bit (0 = nonzero 1 = zero)

¾ Z: Zero bit (0 = nonzero, 1 = zero)

¾ O: Overflow bit (0 = no overflow, 1 = overflow) C: Carry bit (0 = no carry 1 = carry)

¾ C: Carry bit (0 = no carry, 1 = carry)

E l P ti

„

Example: Pentium

cmp count,25 ;compare count to 25

;subtract 25 from count je target ;jump if equal

(48)

Instruction Types (cont.) Instruction Types (cont.)

¾ Flow control and I/O instructions

¾ Flow control and I/O instructions

„ Branch

„ Procedure call

„ Interrupts

¾ I/O instructions

„ Memory-mapped I/O

„ Most processors support memory-mapped I/O No separate instructions for I/O

„ No separate instructions for I/O

„ Isolated I/O

„ Pentium supports isolated I/OPentium supports isolated I/O

„ Separate I/O instructions

in AX,io_port ;read from an I/O port

t i t AX it t I/O t

out io_port,AX ;write to an I/O port

(49)

Instruction Formats Instruction Formats

„

Two types

„

Two types

¾ Fixed-length

Used by RISC processors

„ Used by RISC processors

„ 32-bit RISC processors use 32-bits wide instructions

„ Examples: SPARC MIPS PowerPC

„ Examples: SPARC, MIPS, PowerPC

¾ Variable-length

„ Used by CISC processors

„ Used by CISC processors

„ Memory operands need more bits to specify

Opcode

„

Opcode

¾ Major and exact operation

(50)

Examples of Instruction Formats

Examples of Instruction Formats

(51)

How Hardware Executes How Hardware Executes

Processor’s Instructions ocesso s s uc o s

(52)

How Hardware Executes Processor’s Instructions Processor s Instructions

„

Digital Logic Design

„

Digital Logic Design

¾ Combinational and Sequential Circuits

Microprogrammed Control

„

Microprogrammed Control

(53)

Virtual Machines Virtual Machines

Abstractions for computers

High-Level Language Level 5

Machine-independent

Assembly Language Level 4 Machine-specific

Operating System

Instruction Set

Level 3 Instruction Set

Architecture

Microarchitecture L l 1 Level 2

Microarchitecture

Digital Logic Level 0 Level 1

(54)

Basic Microcomputer Design Basic Microcomputer Design

data bus

registers

I/O I/O

Central Processor Unit (CPU)

Memory Storage Unit

ALU l k

I/O Device

#1

I/O Device

#2 CU

ALU clock

control bus CU

address bus

(55)

Consider 1-bus Datapath Consider 1 bus Datapath

Assume all entities are Assume all entities are 32-bit wide

(56)

1-bit ALU

1 bit ALU

(57)

ALU Circuit in 1-bus Datapath

ALU Circuit in 1 bus Datapath

(58)

Memory Interface Implementation

Memory Interface Implementation

(59)

Microprogrammed Control Microprogrammed Control

„ 32 32-bit general-purpose registers32 32 bit general purpose registers

¾ Interface only with the A-bus

¾ Each register has two control signals G i d G t

„ Gxin and Gxout

„ Control signals used by the other registers

¾ PC register:

¾ PC register:

„ PCin, PCout, and PCbout

¾ IR register:

„ IRout and IRbin

¾ MAR register:

„ MARin, MARout, and MARboutMARin, MARout, and MARbout

¾ MDR register:

„ MDRin, MDRout, MDRbin and MDRbout

(60)

Microprogrammed Control (cont.) Microprogrammed Control (cont.)

add %G9,%G5,%G7 add %G9,%G5,%G7 Implemented as

„ Transfer G5 contents to A register

„ Transfer G5 contents to A register

„ Assert G5out and Ain

„ Place G7 contents on the A busPlace G7 contents on the A bus

„ Assert G7out

„ Instruct ALU to perform addition p

„ Appropriate ALU function control signals

„ Latch the result in the C register

„ Assert Cin

„ Transfer contents of the C register to G9

„ Assert Cout and G9in

(61)

Microprogrammed Control (cont.) Microprogrammed Control (cont.)

Instruction Fetch Instruction Fetch Implemented as

„ PCbout: read: PCout: ALU=add4: Cin;

„ PCbout: read: PCout: ALU add4: Cin;

„ read: Cout: PCin;

„ Read: IRbin;

„ Read: IRbin;

„ Decodes the instruction and jumps to the appropriate execution rountine

the appropriate execution rountine

(62)

Microprogrammed Control (cont.) Microprogrammed Control (cont.)

„

Example instruction groups

„

Example instruction groups

¾ Load/store

Moves data between registers and memory

„ Moves data between registers and memory

¾ Register

Arithmetic and logic instructions

„ Arithmetic and logic instructions

¾ Branch

J di t/i di t

„ Jump direct/indirect

¾ Call

P d i ti h i

„ Procedures invocation mechanisms

¾ More…

(63)

Microprogrammed Control (cont.) Microprogrammed Control (cont.)

High-level FSM for instruction execution

execution

FSM: finite state machine

(64)

Microprogrammed Control (cont.) Microprogrammed Control (cont.)

„

Software implementation

„

Software implementation

¾ Typically used in CISC

Hardware implementation (PLA) is complex and

„ Hardware implementation (PLA) is complex and expensive

„

Example

„

Example

add %G9,%G5,%G7

¾ Three steps

S1 G5out: Ain;

S2 G7out: ALU=add: Cin;

S3 Cout: G9in: end;

(65)

Microprogrammed Control (cont.) Microprogrammed Control (cont.)

Simple microcode microcode organization

(66)

Microprogrammed Control (cont.) Microprogrammed Control (cont.)

„

Uses a microprogram to generate the control

„

Uses a microprogram to generate the control signals

¾ Encode the signals of each step as a codeword

¾ Encode the signals of each step as a codeword

„ Called microinstruction

A instruction is expressed by a sequence of codewords

¾ A instruction is expressed by a sequence of codewords

„ Called microroutine

Mi ti ll i l t th FSM

„

Microprogram essentially implements the FSM

discussed before

(67)

Microprogrammed Control (cont.) Microprogrammed Control (cont.)

„

A simple microcontroller can execute a

„

A simple microcontroller can execute a

microprogram to generate the control signals

¾ Control store

¾ Control store

„ Store microprogram Use μPC

¾ Use μPC

„ Similar to PC Address generator

¾ Address generator

„ Generates appropriate address depending on the

Opcode and

„ Opcode, and

„ Condition code inputs

(68)

Microprogrammed Control (cont.) Microprogrammed Control (cont.)

Microcontroller Microcontroller

Microcodes reside in control store, which might be read-only memory (ROM)

(69)

Microprogrammed Control (cont.) Microprogrammed Control (cont.)

„

Microinstruction format

„

Microinstruction format

¾ Two basic ways

Horizontal organization

„ Horizontal organization

„ Vertical organization

¾ Horizontal organization

O bit f h i l

„ One bit for each signal

„ Very flexible

L i i t ti

„ Long microinstructions

„ Example: 1-bus datapath

N d 90 bit f h i i t ti

„ Needs 90 bits for each microinstruction

(70)

Microprogrammed Control (cont.) Microprogrammed Control (cont.)

Horizontal

microinstruction format

(71)

Microprogrammed Control (cont.) Microprogrammed Control (cont.)

„ Vertical organization

„ Vertical organization

¾ Encodes to reduce microinstruction length

„ Reduced flexibility

¾ Example:

„ Horizontal organization

64 t l i l f th 32 l i t

„ 64 control signals for the 32 general purpose registers

„ Vertical organization

„ 5 bits to identify the register and 1 for in/outy g

(72)

2-bus Datapath

2 bus Datapath

(73)

Microprogrammed Control (cont.) Microprogrammed Control (cont.)

„

Adding more buses reduces time needed to

„

Adding more buses reduces time needed to execute instructions

¾ No need to multiplex the bus

¾ No need to multiplex the bus

„

Example

dd %G9 %G5 %G7 add %G9,%G5,%G7

¾ Needed three steps in 1-bus datapath

¾ Need only two steps with a 2-bus datapath S1 G5out: Ain;

S2 G7out: ALU=add: G9in;

(74)

Pipelining

Pipelining

(75)

Pipelining Pipelining

„

Introduction

„

Introduction

„

3 Hazards

R D t d C t l H d

¾ Resource, Data and Control Hazards

„

3 Technologies for Performance Improvement

¾ Superscalar, Superpipelined, and Very Long Instruction Word

(76)

Serial and Pipelining Serial and Pipelining

Serial execution: 20 cycles Pipelined execution: 8 cycles

F k d i i

For k states and n instructions, the number of required cycles is:

k + (n – 1) k + (n 1)

(77)

Pipelining Pipelining

„ Pipelining

„ Pipelining

¾ Overlapped execution

¾ Increases throughput

(78)

Pipelining (cont.) Pipelining (cont.)

„ Pipelining requires buffers

„ Pipelining requires buffers

¾ Each buffer holds a single value

¾ Uses just-in-time principlej p p

„ Any delay in one stage affects the entire pipeline flow

¾ Ideal scenario: equal work for each stage

„ Sometimes it is not possible

„ Slowest stage determines the flow rate in the entire pipeline

pipeline

(79)

Pipelining (cont.) Pipelining (cont.)

„ Some reasons for unequal work stages

„ Some reasons for unequal work stages

¾ A complex step cannot be subdivided conveniently

¾ An operation takes variable amount of time to executep

„ EX: Operand fetch time depends on where the operands are located

Registers

„ Registers

„ Cache

„ Memory

¾ Complexity of operation depends on the type of operation

„ Add: may take one cycle

M lti l t k l l

„ Multiply: may take several cycles

(80)

Pipeline Stall Pipeline Stall

„ Operand fetch of I2 takes three cycles

„ Operand fetch of I2 takes three cycles

¾ Pipeline stalls for two cycles

„ Caused by hazards

¾ Pipeline stalls reduce overall throughput

(81)

Hazards Hazards

„ Three types of hazards

„ Three types of hazards

¾ Resource hazards

„ Occurs when two or more instructions use the same resource

„ Also called structural hazards

D t h d

¾ Data hazards

„ Caused by data dependencies between instructions

„ Example: Result produced by I1 is read by I2p p y y

¾ Control hazards

„ Default: sequential execution suits pipelining

„ Altering control flow (e.g., branching) causes problems

„ Introduce control dependencies

(82)

Resource Hazards Resource Hazards

„ Example

„ Example

¾ Conflict for memory in clock cycle 3

„ I1 fetches operandp

„ I3 delays its instruction fetch from the same memory

(83)

Data Hazards Data Hazards

„ Example

„ Example

¾ I1: add R2,R3,R4 /* R2 = R3 + R4 */

¾ I2: sub R5,R6,R2 /* R5 = R6 – R2 */

„ Introduces data dependency between I1 and I2

(84)

Control Hazards

»Determine branch decision early

(85)

Performance Improvement Performance Improvement

„ Several techniques to improve performance of aSeveral techniques to improve performance of a pipelined system

¾ Superscalar

„ Replicates the pipeline hardware

„ Replicates the pipeline hardware

¾ Superpipelined

„ Increases the pipeline depth Very long instruction word (VLIW)

¾ Very long instruction word (VLIW)

„ Encodes multiple operations into a long instruction word

„ Hardware schedules these instructions on multiple functional units (No run time analysis)

functional units (No run-time analysis)

„ add R1, R2, R3 ; R1 = R2 + R3 sub R5, R6, R7 ; R5 = R6 – R7 and R4, R1, R5 ; R4 = R1 AND R5 xor R9, R9, R9 ; R9 = R9 XOR R9

cycle 1: add, sub, xor cycle 2: and

(86)

Superscalar Processor Superscalar Processor

Ex: Pentium

(87)

Wasted Cycles (pipelined) Wasted Cycles (pipelined)

„ When one of the stages requires two or more clock cycles,When one of the stages requires two or more clock cycles, clock cycles are again wasted.

St

S1 S2 S3 S4 S5

1

Stages

S6 I-1

exe

For k states and n instructions the

cles

2 3 4 5

I-2 I-3

I-1 I-2 I-3

I-1 I-2 I-3

I-1 I-1

instructions, the number of required cycles is:

k + (2 1)

Cyc 5

6 7

I 3

I-2 I-1

I-1

8 I-3 I-2

I 1 I-2

k + (2n – 1)

9 I-2

10 11

I-3

I-3 I-3

(88)

Superscalar Superscalar

A superscalar processor has multiple execution pipelines.

In the following, note that Stage S4 has left and right pipelines (u and v).

S1 S2 S3 S5

Stages

S6

S4 For k states and n

instructions the

S1 S2 S3 u S5

1

s

S6

2 3

I-1 I-2 I-3

I-1

I-2 I-1

v instructions, the

number of required cycles is:

k +

Cycles 4

5 6 7

I-4 I-3 I-4

I-2 I-3 I-4

I-1

I-3 I-1

I-2 I-1 I-2

I-4 I-2 I-1

I-3

k + n

8 9

I-3 I-4

I-2 I-3

10 I-4

I-4 3

(89)

Superpipelined Processor Superpipelined Processor

Ex: MIPS R4000 Ex: MIPS R4000

(90)

Memory

Memory

(91)

Memory Memory

„

Introduction

„

Introduction

„

Building Memory Blocks

l f

„

Alignment of Data

„

2 Memory Design Issues

¾ Cache

¾ Virtual Memoryy

(92)

Memory (cont.) Memory (cont.)

„

Ordered sequence of bytes

„

Ordered sequence of bytes

¾ The sequence number is called the memory address

¾ Byte addressable memory

¾ Byte addressable memory

„ Each byte has a unique address

„ Almost all processors support thisp pp

„

Memory address space

¾ Determined by the address bus widthy

¾ Pentium has a 32-bit address bus

„ address space = 4GB (232)

¾ Itanium with 64-bit address bus supports

„ 264 bytes of address space

(93)

Memory (cont.)

Memory (cont.)

(94)

Memory (cont.) Memory (cont.)

„

Read cycle

„

Read cycle

1. Place address on the address bus 2. Assert memory read control signal 2. Assert memory read control signal

3. Wait for the memory to retrieve the data

„ Introduce wait states if using a slow memoryg y 4. Read the data from the data bus

5. Drop the memory read signal

„

In Pentium, a simple read takes three clocks cycles

„ Clock 1: steps 1 and 2

„ Clock 2: step 3

„ Clock 3 : steps 4 and 5

(95)

Memory (cont.) Memory (cont.)

„

Write cycle

„

Write cycle

1. Place address on the address bus

2. Place data on the data bus

2. Place data on the data bus

3. Assert memory write signal

4. Wait for the memory to retrieve the datay

„ Introduce wait states if necessary 5. Drop the memory write signal

„

In Pentium, a simple write also takes three clocks

„ Clock 1: steps 1 and 3

„ Clock 2: step 2

„ Clock 3 : steps 4 and 5

(96)

How Hardware Implements How Hardware Implements

Memory Systems

Memory Systems

(97)

Building a Memory Block Building a Memory Block

A 4 X 3 d i

A 4 X 3 memory design using D flip-flops

(98)

Building a Memory Block (cont’d) Building a Memory Block (cont d)

Bl k di t ti f 4 3

Block diagram representation of a 4x3 memory

„ Address

„ Data

„ Control signals

¾ Read

¾ Read

¾ Write

(99)

Building Larger Memories Building Larger Memories

2 X 16 memory module using 74373 chips 2 X 16 memory module using 74373 chips

(100)

Designing Larger Memories Designing Larger Memories

64M X 32 i memory using 16M X 16 chips

(101)

Alignment of Data Alignment of Data

Get 32-bit data in one or more read cycle?

(102)

Alignment of Data (cont.) Alignment of Data (cont.)

„

Alignment

„

Alignment

¾ 2-byte data: Even address

Rightmost address bit should be zero

„ Rightmost address bit should be zero

¾ 4-byte data: Address that is multiple of 4 Rightmost 2 bits should be zero

„ Rightmost 2 bits should be zero

¾ 8-byte data: Address that is multiple of 8 Ri ht t 3 bit h ld b

„ Rightmost 3 bits should be zero

¾ Soft alignment

C h dl li d ll li d d t

„ Can handle aligned as well as unaligned data

¾ Hard alignment

H dl l li d d t ( f li t)

„ Handles only aligned data (enforces alignment)

(103)

Memory Design Issues Memory Design Issues

„ Slower memoriesSlower memories

Problem: Speed gap between processor and memory Solution: Cache memory

U ll t f f t

„ Use small amount of fast memory

„ Make the slow memory appear faster

„ Works due to “reference locality”

„ Size limitations

¾ Limited amount of physical memory Overlay technique

„ Overlay technique

„ Programmer managed

¾ Virtual memory

„ Automates overlay management

„ Some additional benefits

(104)

Memory Hierarchy

Memory Hierarchy

(105)

Cache Memory Cache Memory

„ High-speed expensive static RAM both inside and outside

„ High speed expensive static RAM both inside and outside the CPU.

¾ Level-1 cache: inside the CPU

¾ Level-2 cache: outside the CPU

„ Prefetch data into cache before the processor needs it

¾ Need to predict processor future access requirements

¾ Locality of reference

C h hit h d t t b d i l d i h

„ Cache hit: when data to be read is already in cache memory

Cache miss: when data to be read is not in cache memory

„ Cache miss: when data to be read is not in cache memory.

When? compulsory, capacity and conflict.

„ Cache design: cache size n-way block size replacement

„ Cache design: cache size, n-way, block size, replacement policy

(106)

Why Cache Memory Works Why Cache Memory Works

„

Example

„

Example

for (i=0; i<M; i++)

for(j=0; j<N; j++) for(j=0; j<N; j++)

X[i][j] = X[i][j] + K;

Each element of X is double (eight bytes)

¾ Each element of X is double (eight bytes)

¾ Loop is executed (M*N) times

Pl i th d i h id t i

„ Placing the code in cache avoids access to main memory

„ Repetitive use

„ Repetitive use

„ Temporal locality

„ Prefetching datag

„ Spatial locality

(107)

Cache Design Basics Cache Design Basics

„

On every read miss

¾ A fixed number of bytes are transferred

¾ A fixed number of bytes are transferred

„ More than what the processor needs

„ Effective due to spatial locality

„

Cache is divided into blocks of B bytes

„ b-bits are needed as offset into the block

b = log2B

„ Block are called cache lines

„

Main memory is also divided into blocks of same

size

(108)

Mapping Function Mapping Function

„

Determines how memory blocks are mapped to cache lines

„

Three types

¾ Direct mapping

¾ Direct mapping

„ Specifies a single cache line for each memory block

¾ Set-associative mapping

¾ Set associative mapping

„ Specifies a set of cache lines for each memory block

¾ Associative mapping

¾ Associative mapping

„ No restrictions

„ Any cache line can be used for any memory block

„ Any cache line can be used for any memory block

(109)

Direct Mapping

Direct Mapping

(110)

Set-Associate Mapping

Set Associate Mapping

(111)

Virtual Memory

Virtual Memory

(112)

I/O Devices

I/O Devices

(113)

Input/Output Input/Output

„

I/O devices are interfaced via an I/O controller

„

I/O devices are interfaced via an I/O controller

¾ Takes care of low-level operations details

„

Several ways of mapping I/O

„

Several ways of mapping I/O

¾ Memory-mapped I/O

„ Reading and writing similar to memory read/write

„ Reading and writing similar to memory read/write

„ Uses same memory read and write signals

„ Most processors use this I/O mappingp pp g

¾ Isolated I/O

„ Separate I/O address space

„ Separate I/O read and write signals are needed

„ Pentium supports isolated I/O

„ Also supports memory-mapped I/O

(114)

Input/Output (cont.)

Input/Output (cont.)

(115)

Input/Output (cont.) Input/Output (cont.)

„

Several ways of transferring data y g

¾ Programmed I/O

„ Program uses a busy-wait loopg y p

„ Anticipated transfer

¾ Direct memory access (DMA)

„ Special controller (DMA controller) handles data transfers

„ Typically used for bulk data transfer

¾ Interrupt-driven I/O

„ Interrupts are used to initiate and/or terminate data transfers

„ Powerful technique

„ Handles unanticipated transfers

(116)

Interconnection Interconnection

„

System components are interconnected by buses

„

System components are interconnected by buses

¾ Bus: a bunch of parallel wires

Uses several buses at various levels

„

Uses several buses at various levels

¾ On-chip buses

B i ALU d i

„ Buses to interconnect ALU and registers

„ A, B, and C buses in our example

D t d dd b t t hi h

„ Data and address buses to connect on-chip caches

¾ Internal buses

PCI AGP PCMCIA

„ PCI, AGP, PCMCIA

¾ External buses

S i l ll l USB IEEE 1394 (Fi Wi )

„ Serial, parallel, USB, IEEE 1394 (FireWire)

(117)

PC

System Buses y

ISA (Industry Standard

A hi )

Architecture)

PCI (Peripheral Component Interconnect)

Interconnect)

AGP (Accelerated Graphics Port))

(118)

Interconnection (cont.) Interconnection (cont.)

„

Bus is a shared resource

„

Bus is a shared resource

¾ Bus transactions

„ Sequence of actions to complete a well-defined

„ Sequence of actions to complete a well defined activity

„ Involves a master and a slave

„ Memory read, memory write, I/O read, I/O write

¾ Bus operations

A b s t ansaction ma pe fo m one o mo e b s

„ A bus transaction may perform one or more bus operations

„ Pentium burst read

„ Transfers four memory words

„ Bus transaction consists of four memory read operations

operations

¾ Bus arbitration

參考文獻

相關文件

• The memory storage unit is where instructions and data are held while a computer program is running.. • A bus is a group of parallel wires that transfer data from one part of

• Virtual memory uses disk as part of the memory, thus allowing sum of all programs can be larger than physical memory. • Divides each segment into 4096-byte blocks

– Number of TLB entries are restricted by clock cycle time, so a larger page size maps more memory, thereby reducing TLB misses. • Reasons for a smaller

• When paging in from disk, we need a free frame of physical memory to hold the data we’re reading in. • In reality, size of physical memory is

 To write to the screen (or read the screen), use the next 8K words of the memory To read which key is currently pressed, use the next word of the

Evaluation of the association between t he characteristics of physicians and th eir practices with the availability of electronic health records.. Association with the availability

The t-submodule theorem says that all linear relations satisfied by a logarithmic vector of an algebraic point on t-module should come from algebraic relations inside the t-module

* All rights reserved, Tei-Wei Kuo, National Taiwan University, 2005..