Computer Organization &
Computer Organization &
Computer Organization &
Computer Organization &
Assembly Languages Assembly Languages
Computer Organization (I) Fundamentals
Pu-Jen Cheng
Materials Materials
Some materials used in this course are adapted from
Some materials used in this course are adapted from
¾ The slides prepared by Kip Irvine for the book, Assembly Language for Intel-Based Computers, 5th Ed.
¾ The slides prepared by S. Dandamudi for the book, Fundamentals of Computer Organization and Designs.
¾ The slides prepared by S Dandamudi for the book Introduction to
¾ The slides prepared by S. Dandamudi for the book, Introduction to Assembly Language Programming, 2nd Ed.
¾ Introduction to Computer Systems, CMU
(http://www.cs.cmu.edu/afs/cs.cmu.edu/academic/class/
15213-f05/www/)
Assembly Language & Computer Organization NTU
¾ Assembly Language & Computer Organization, NTU (http://www.csie.ntu.edu.tw/~cyy/courses/assembly/
05fall/news/)/ /)
(http://www.csie.ntu.edu.tw/~acpang/course/asm_2004)
Outline Outline
General Concepts of Computer Organization
¾ Overview of Microcomputer CPU, Memory, I/O
Instruction Execution Cycley
¾ Central Processing Unit (CPU) CISC vs. RISC
6 Instruction Set Design Issues 6 Instruction Set Design Issues
¾ How Hardwares Execute Processor’s Instructions
Digital Logic Design (Combinational & Sequential Circuits) Microprogrammed Control
Microprogrammed Control
¾ Pipelining
3 Hazards
3 t h l i f f i t
3 technologies for performance improvement
¾ Memory
Data Alignment
2 Design Issues (Cache, Virtual Memory)
¾ I/O Devices
General Concepts of Computer Organization General Concepts of Computer Organization
Overview of Microcomputer
Overview of Microcomputer
Von Neumann Machine, 1945 Von Neumann Machine, 1945
Memory, Input/Output, Arithmetic/Logic Unit, Control Unity, p p , g ,
Stored-program Model
¾ Both data and programs are stored in the same main memory
Sequential Execution
http://www.virtualtravelog.net/entries/2003-08-TheFirstDraft.pdf
What is Microcomputer What is Microcomputer
Microcomputer
Microcomputer
¾ A computer with a microprocessor (µP) as its central processing unit (CPU)
Microprocessor (µP)
¾ A digital electronic component with transistors on a single semiconductor integrated circuit (IC)
semiconductor integrated circuit (IC)
¾ One or more microprocessors typically serve as a central
processing unit (CPU) in a computer system or handheld device.
Components of Microcomputer
Components of Microcomputer
Basic Microcomputer Design Basic Microcomputer Design
data bus
registers
I/O I/O
Central Processor Unit (CPU)
Memory Storage Unit
ALU l k
I/O Device
#1
I/O Device
#2 CU
ALU clock
control bus CU
address bus
CPU CPU
Arithmetic and logic unit (ALU) performs arithmetic (add, subtract) and logical (AND OR NOT) operations
logical (AND, OR, NOT) operations
Registers store data and instructions used by the processor
Control unit (CU) coordinates sequence of execution steps
¾ Fetch instructions from memory, decode them to find their types
Clock
Datapath consists of registers and ALU(s)
Datapath consists of registers and ALU(s)
Datapath
ALU inputDatapath
operand operand ALU output
Program Counter (PC) (or Instruction Pointer (IP)) Instruction Register (IR)
M Add R i t
Memory Address Register (MAR)
Memory Data Register (MDR)
(MDR)
RISC processor RISC processor
Clock Clock
Provide timing signal and the basic unit of time
Synchronize all CPU and BUS operations
Machine (clock) cycle measures time of a single operation
Clock is used to trigger eventsgg
Clock period = 1GHz→clock cycle=1ns1 Clock frequency
A instruction could take multiple cycles to complete, e.g. multiply in 8088 takes 50 cycles
one cycle 11
00
Memory, I/O, System Bus Memory, I/O, System Bus
Main/primary memory (random access memory, RAM)
t b th i t ti d d t
stores both program instructions and data
I/O devices
¾ Interface: I/O controller
¾ Interface: I/O controller
¾ User interface: keyboard, display screen, printer, modem, …
¾ Secondary storage: disk
¾ Communication network
System Bus
A bunch of parallel wires
¾ A bunch of parallel wires
¾ Transfer data among the components
¾ Address bus (determine the amount of physical memory addressable)
¾ Data bus (indicate the size of the data transferred)
¾ Control bus (consists of control signals:
memory/IO read/write interrupt bus request/grand) memory/IO read/write, interrupt, bus request/grand)
Instruction Execution Cycle Instruction Execution Cycle
Execution Cycle
Execution Cycle
¾ Fetch (IF): CU fetches next instruction, advance PC/IP
¾ Decode (ID): CU determines what the instruction will do
¾ Execute
Fetch operands (OF): (memory operand needed) read value from memory E t th i t ti (IE)
Execute the instruction (IE)
Store output operand (WB): (memory operand needed) write result to memoryy
Instruction Execution Cycle (cont.) Instruction Execution Cycle (cont.)
Fetch PC
Fetch
Decode
Fetch operands
I-1 I-2 I-3 I-4
PC program
memory fetch
Fetch operands
Execute
Store output
op1 op2 memory
registers read
registers
p
I-1 instruction register
g g
write decode
write
w ALU
execute
w flags
(output)
Introduction to Digital Logic Design Introduction to Digital Logic Design
¾ See asm ch2 dl ppt
¾ See asm_ch2_dl.ppt
CPU
CPU
CPU CPU
CISC vs RISC
CISC vs. RISC
6 Instruction Set Design Issues
N b f Add
¾ Number of Addresses
¾ Flow of Control
O d T
¾ Operand Types
¾ Addressing Modes
¾ Instruction Types
¾ Instruction Formats
Processor Processor
RISC and CISC designs
¾ Reduced Instruction Set Computer (RISC)
Simple instructions, small instruction set
O d d t b i i t
Operands are assumed to be in processor registers
Not in memory
Simplify design (e.g., fixed instruction size)
Simplify design (e.g., fixed instruction size)
Examples: ARM (Advanced RISC Machines), DEC Alpha (now Compaq)p ( p q)
¾ Complex Instruction Set Computer (CISC)
Complex instructions, large instruction set
Operands can be in registers or memory
Instruction size varies
T i ll i
Typically use a microprogram
Example: Intel 80x86 family
Processor (cont.)
Processor (cont.)
Processor (cont.) Processor (cont.)
Variations of the ISA-level can be implemented by
Variations of the ISA level can be implemented by changing the microprogram
Instruction Set Design Issues Instruction Set Design Issues
Number of Addresses
Number of Addresses
Flow of Control O
Operand Types
Addressing Modes
Instruction Types
Instruction Formats
Instruction Formats
Number of Addresses Number of Addresses
Four categories
Four categories
¾ 3-address machines
2 for the source operands and one for the result2 for the source operands and one for the result
¾ 2-address machines
One address doubles as source and result
¾ 1-address machine
Accumulator machines
Accumulator is used for one source and result
¾ 0-address machines
Stack machines
Operands are taken from the stack
R lt t th t k
Result goes onto the stack
Number of Addresses (cont.) Number of Addresses (cont.)
Three-address machines
Three-address machines
¾ Two for the source operands, one for the result RISC processors use three addresses
¾ RISC processors use three addresses
¾ Sample instructions
add dest src1 src2 add dest,src1,src2
; M(dest)=[src1]+[src2]
b d t 1 2
sub dest,src1,src2
; M(dest)=[src1]-[src2]
lt d t 1 2
mult dest,src1,src2
; M(dest)=[src1]*[src2]
Number of Addresses (cont.) Number of Addresses (cont.)
Example
Example
¾ C statement
A = B + C * D – E + F + A A = B + C D E + F + A
¾ Equivalent code:
mult T C D ;T = C*D mult T,C,D ;T = C D add T,T,B ;T = B+C*D sub T T E ;T = B+C*D-E sub T,T,E ;T = B+C*D-E add T,T,F ;T = B+C*D-E+F add A T A ;A = B+C*D-E+F+A add A,T,A ;A = B+C*D-E+F+A
Number of Addresses (cont.) Number of Addresses (cont.)
Two-address machines
Two-address machines
¾ One address doubles (for source operand & result) Last example makes a case for it
¾ Last example makes a case for it
Address T is used twice Sample instructions
¾ Sample instructions
load dest,src ; M(dest)=[src]
add dest src M(dest) [dest]+[src]
add dest,src ; M(dest)=[dest]+[src]
sub dest,src ; M(dest)=[dest]-[src]
lt d t M(d t) [d t]*[ ]
mult dest,src ; M(dest)=[dest]*[src]
Number of Addresses (cont.) Number of Addresses (cont.)
Example
Example
¾ C statement
A = B + C * D – E + F + A A = B + C D E + F + A
¾ Equivalent code:
load T C ;T = C load T,C ;T = C mult T,D ;T = C*D add T B ;T = B+C*D add T,B ;T = B+C*D sub T,E ;T = B+C*D-E add T F ;T = B+C*D-E+F add T,F ;T = B+C*D-E+F add A,T ;A = B+C*D-E+F+A
Number of Addresses (cont.) Number of Addresses (cont.)
One-address machines
One-address machines
¾ Use special set of registers called accumulators
Specify one source operand & receive the result
Specify one source operand & receive the result
¾ Called accumulator machines Sample instructions
¾ Sample instructions
load addr ; accum = [addr]
store addr M[addr] acc m store addr ; M[addr] = accum
add addr ; accum = accum + [addr]
b dd [ dd ]
sub addr ; accum = accum - [addr]
mult addr ; accum = accum * [addr]
Number of Addresses (cont.) Number of Addresses (cont.)
ExampleExample
¾ C statement
A = B + C * D – E + F + A A B C D E F A
¾ Equivalent code:
load C ;load C into accum mult D ;accum = C*D
add B ;accum = C*D+B sub E ;accum = B+C*D-E add F ;accum = B+C*D-E+F add A ;accum = B+C*D-E+F+A
store A ;store accum contents in A
Number of Addresses (cont.) Number of Addresses (cont.)
Zero-address machines
Zero-address machines
¾ Stack supplies operands and receives the result
Special instructions to load and store use an address
Special instructions to load and store use an address
¾ Called stack machines (Ex: HP3000, Burroughs B5500) Sample instructions
¾ Sample instructions
push addr ; push([addr]) pop addr pop([addr]) pop addr ; pop([addr])
add ; push(pop + pop)
b h( )
sub ; push(pop - pop) mult ; push(pop * pop)
Number of Addresses (cont.) Number of Addresses (cont.)
Example
Example
¾ C statement
A B C * D E F A A = B + C * D – E + F + A
¾ Equivalent code:
push E sub
push C push F
p p
push D add
Mult push A
Mult push A push B add
add pop A
Load/Store Architecture Load/Store Architecture
Instructions expect operands in internal processor registers
Instructions expect operands in internal processor registers
¾ Special LOAD and STORE instructions move data between registers and memory
¾ RISC uses this architecture
¾ Reduces instruction length
Load/Store Architecture (cont.) Load/Store Architecture (cont.)
Sample instructions
Sample instructions
load Rd,addr ;Rd = [addr]
t dd R ( dd ) R
store addr,Rs ;(addr) = Rs add Rd,Rs1,Rs2 ;Rd = Rs1 + Rs2
b Rd R 1 R 2 Rd R 1 R 2 sub Rd,Rs1,Rs2 ;Rd = Rs1 - Rs2 mult Rd,Rs1,Rs2 ;Rd = Rs1 * Rs2
Number of Addresses (cont.) Number of Addresses (cont.)
Example
Example
¾ C statement
A B + C * D E + F + A A = B + C * D – E + F + A
¾ Equivalent code:
load R1,B mult R2,R2,R3 load R2,C add R2,R2,R1 load R3,D sub R2,R2,R4 load R4,E add R2,R2,R5 load R5,F add R2,R2,R6 load R6,A store A,R2
Flow of Control Flow of Control
Default is sequential flow
Default is sequential flow
Several instructions alter this default execution
B h
¾ Branches
Unconditional C di i l
Conditional
Delayed branches
¾ Procedure calls
Delayed procedure calls
Flow of Control (cont.) Flow of Control (cont.)
Branches
Branches
¾ Unconditional
Absolute address
Absolute address
PC-relative
Target address is specified relative to PC contents
Target address is specified relative to PC contents
Relocatable code
¾ Example: MIPS
¾ Example: MIPS
Absolute address j target
j target
PC-relative b target b target
Flow of Control (cont.) Flow of Control (cont.)
e g , Pentium e g , SPARC
e.g., Pentium e.g., SPARC
Flow of Control (cont.) Flow of Control (cont.)
Branches
Branches
¾ Conditional
Jump is taken only if the condition is metp y
¾ Two types
Set-Then-Jump
Condition testing is separated from branching
Condition code registers are used to convey the condition test result
Condition code registers keep a record of the status of the last ALU operation such as overflow condition
Example: Pentium code
Example: Pentium code
cmp AX,BX ; compare AX and BX je target ; jump if equal
Flow of Control (cont.) Flow of Control (cont.)
Test-and-Jump
Test and Jump
Single instruction performs condition testing and branching
Example: MIPS instructionp
beq Rsrc1,Rsrc2,target
Jumps to target if Rsrc1 = Rsrc2g
Delayed branching
¾ Control is transferred after executing the instruction that
¾ Control is transferred after executing the instruction that follows the branch instruction
This instruction slot is called delay sloty
¾ Improves efficiency
¾ Highly pipelined RISC processors supportg y p pe ed SC p ocesso s suppo
Flow of Control (cont.) Flow of Control (cont.)
Procedure calls
Procedure calls
¾ Facilitate modular programming
¾ Require two pieces of information to return
¾ Require two pieces of information to return
End of procedure
Pentium
uses ret instruction
MIPS
uses jr instruction
uses jr instruction
Return address
In a (special) register
MIPS allows any general-purpose register
On the stack
Pentium
Pentium
Flow of Control (cont.)
Flow of Control (cont.)
Flow of Control (cont.) Flow of Control (cont.)
Delay slot
Parameter Passing Parameter Passing
Two basic techniques
Two basic techniques
¾ Register-based (e.g., PowerPC, MIPS) Internal registers are used
Internal registers are used
Faster
Limit the number of parametersLimit the number of parameters
Recursive procedure
¾ Stack-based (e.g., Pentium)( g )
Stack is used
More general
Operand Types Operand Types
Instructions support basic data types
Instructions support basic data types
¾ Characters Integers
¾ Integers
¾ Floating-point
I t ti l d
Instruction overload
¾ Same instruction for different data types
¾ Example: Pentium
mov AL,address ;loads an 8-bit value mov AX,address ;loads a 16-bit value mov EAX,address ;loads a 32-bit value
Operand Types Operand Types
Separate instructions
Separate instructions
¾ Instructions specify the operand size Example: MIPS
¾ Example: MIPS
lb Rdest,address ;loads a byte
lh Rdest address ;loads a halfword lh Rdest,address ;loads a halfword
;(16 bits) l Rdest address loads a ord lw Rdest,address ;loads a word
;(32 bits)
ld Rd t dd l d d bl d
ld Rdest,address ;loads a doubleword
;(64 bits) Similar instruction: store
Addressing Modes Addressing Modes
How the operands are specified
How the operands are specified
¾ Operands can be in three places Registers
Registers
Register addressing mode
Part of instruction
Part of instruction
Constant
Immediate addressing modeg
All processors support these two addressing modes
Memory
Difference between RISC and CISC
CISC supports a large variety of addressing modes RISC f ll l d/ t hit t
RISC follows load/store architecture
Instruction Types Instruction Types
Several types of instructions yp
¾ Data movement
Pentium: mov dest,src
Some do not provide direct data movement instructions
I di t d t t
Indirect data movement
add Rdest,Rsrc,0 ;Rdest = Rsrc+0 Arithmetic and Logical
¾ Arithmetic and Logical
Arithmetic
Integer and floating point signed and unsigned
Integer and floating-point, signed and unsigned
add, subtract, multiply, divide
Logical
Logical
and, or, not, xor
Instruction Types (cont.) Instruction Types (cont.)
Condition code bits
Condition code bits
¾ S: Sign bit (0 = +, 1= -)
Z: Zero bit (0 = nonzero 1 = zero)
¾ Z: Zero bit (0 = nonzero, 1 = zero)
¾ O: Overflow bit (0 = no overflow, 1 = overflow) C: Carry bit (0 = no carry 1 = carry)
¾ C: Carry bit (0 = no carry, 1 = carry)
E l P ti
Example: Pentium
cmp count,25 ;compare count to 25
;subtract 25 from count je target ;jump if equal
Instruction Types (cont.) Instruction Types (cont.)
¾ Flow control and I/O instructions
¾ Flow control and I/O instructions
Branch
Procedure call
Interrupts
¾ I/O instructions
Memory-mapped I/O
Most processors support memory-mapped I/O No separate instructions for I/O
No separate instructions for I/O
Isolated I/O
Pentium supports isolated I/OPentium supports isolated I/O
Separate I/O instructions
in AX,io_port ;read from an I/O port
t i t AX it t I/O t
out io_port,AX ;write to an I/O port
Instruction Formats Instruction Formats
Two types
Two types
¾ Fixed-length
Used by RISC processors
Used by RISC processors
32-bit RISC processors use 32-bits wide instructions
Examples: SPARC MIPS PowerPC
Examples: SPARC, MIPS, PowerPC
¾ Variable-length
Used by CISC processors
Used by CISC processors
Memory operands need more bits to specify
Opcode
Opcode
¾ Major and exact operation
Examples of Instruction Formats
Examples of Instruction Formats
How Hardware Executes How Hardware Executes
Processor’s Instructions ocesso s s uc o s
How Hardware Executes Processor’s Instructions Processor s Instructions
Digital Logic Design
Digital Logic Design
¾ Combinational and Sequential Circuits
Microprogrammed Control
Microprogrammed Control
Virtual Machines Virtual Machines
Abstractions for computers
High-Level Language Level 5
Machine-independent
Assembly Language Level 4 Machine-specific
Operating System
Instruction Set
Level 3 Instruction Set
Architecture
Microarchitecture L l 1 Level 2
Microarchitecture
Digital Logic Level 0 Level 1
Basic Microcomputer Design Basic Microcomputer Design
data bus
registers
I/O I/O
Central Processor Unit (CPU)
Memory Storage Unit
ALU l k
I/O Device
#1
I/O Device
#2 CU
ALU clock
control bus CU
address bus
Consider 1-bus Datapath Consider 1 bus Datapath
Assume all entities are Assume all entities are 32-bit wide
1-bit ALU
1 bit ALU
ALU Circuit in 1-bus Datapath
ALU Circuit in 1 bus Datapath
Memory Interface Implementation
Memory Interface Implementation
Microprogrammed Control Microprogrammed Control
32 32-bit general-purpose registers32 32 bit general purpose registers
¾ Interface only with the A-bus
¾ Each register has two control signals G i d G t
Gxin and Gxout
Control signals used by the other registers
¾ PC register:
¾ PC register:
PCin, PCout, and PCbout
¾ IR register:
IRout and IRbin
¾ MAR register:
MARin, MARout, and MARboutMARin, MARout, and MARbout
¾ MDR register:
MDRin, MDRout, MDRbin and MDRbout
Microprogrammed Control (cont.) Microprogrammed Control (cont.)
add %G9,%G5,%G7 add %G9,%G5,%G7 Implemented as
Transfer G5 contents to A register
Transfer G5 contents to A register
Assert G5out and Ain
Place G7 contents on the A busPlace G7 contents on the A bus
Assert G7out
Instruct ALU to perform addition p
Appropriate ALU function control signals
Latch the result in the C register
Assert Cin
Transfer contents of the C register to G9
Assert Cout and G9in
Microprogrammed Control (cont.) Microprogrammed Control (cont.)
Instruction Fetch Instruction Fetch Implemented as
PCbout: read: PCout: ALU=add4: Cin;
PCbout: read: PCout: ALU add4: Cin;
read: Cout: PCin;
Read: IRbin;
Read: IRbin;
Decodes the instruction and jumps to the appropriate execution rountine
the appropriate execution rountine
Microprogrammed Control (cont.) Microprogrammed Control (cont.)
Example instruction groups
Example instruction groups
¾ Load/store
Moves data between registers and memory
Moves data between registers and memory
¾ Register
Arithmetic and logic instructions
Arithmetic and logic instructions
¾ Branch
J di t/i di t
Jump direct/indirect
¾ Call
P d i ti h i
Procedures invocation mechanisms
¾ More…
Microprogrammed Control (cont.) Microprogrammed Control (cont.)
High-level FSM for instruction execution
execution
FSM: finite state machine
Microprogrammed Control (cont.) Microprogrammed Control (cont.)
Software implementation
Software implementation
¾ Typically used in CISC
Hardware implementation (PLA) is complex and
Hardware implementation (PLA) is complex and expensive
Example
Example
add %G9,%G5,%G7
¾ Three steps
S1 G5out: Ain;
S2 G7out: ALU=add: Cin;
S3 Cout: G9in: end;
Microprogrammed Control (cont.) Microprogrammed Control (cont.)
Simple microcode microcode organization
Microprogrammed Control (cont.) Microprogrammed Control (cont.)
Uses a microprogram to generate the control
Uses a microprogram to generate the control signals
¾ Encode the signals of each step as a codeword
¾ Encode the signals of each step as a codeword
Called microinstruction
A instruction is expressed by a sequence of codewords
¾ A instruction is expressed by a sequence of codewords
Called microroutine
Mi ti ll i l t th FSM
Microprogram essentially implements the FSM
discussed before
Microprogrammed Control (cont.) Microprogrammed Control (cont.)
A simple microcontroller can execute a
A simple microcontroller can execute a
microprogram to generate the control signals
¾ Control store
¾ Control store
Store microprogram Use μPC
¾ Use μPC
Similar to PC Address generator
¾ Address generator
Generates appropriate address depending on the
Opcode and
Opcode, and
Condition code inputs
Microprogrammed Control (cont.) Microprogrammed Control (cont.)
Microcontroller Microcontroller
Microcodes reside in control store, which might be read-only memory (ROM)
Microprogrammed Control (cont.) Microprogrammed Control (cont.)
Microinstruction format
Microinstruction format
¾ Two basic ways
Horizontal organization
Horizontal organization
Vertical organization
¾ Horizontal organization
O bit f h i l
One bit for each signal
Very flexible
L i i t ti
Long microinstructions
Example: 1-bus datapath
N d 90 bit f h i i t ti
Needs 90 bits for each microinstruction
Microprogrammed Control (cont.) Microprogrammed Control (cont.)
Horizontal
microinstruction format
Microprogrammed Control (cont.) Microprogrammed Control (cont.)
Vertical organization
Vertical organization
¾ Encodes to reduce microinstruction length
Reduced flexibility
¾ Example:
Horizontal organization
64 t l i l f th 32 l i t
64 control signals for the 32 general purpose registers
Vertical organization
5 bits to identify the register and 1 for in/outy g
2-bus Datapath
2 bus Datapath
Microprogrammed Control (cont.) Microprogrammed Control (cont.)
Adding more buses reduces time needed to
Adding more buses reduces time needed to execute instructions
¾ No need to multiplex the bus
¾ No need to multiplex the bus
Example
dd %G9 %G5 %G7 add %G9,%G5,%G7
¾ Needed three steps in 1-bus datapath
¾ Need only two steps with a 2-bus datapath S1 G5out: Ain;
S2 G7out: ALU=add: G9in;
Pipelining
Pipelining
Pipelining Pipelining
Introduction
Introduction
3 Hazards
R D t d C t l H d
¾ Resource, Data and Control Hazards
3 Technologies for Performance Improvement
¾ Superscalar, Superpipelined, and Very Long Instruction Word
Serial and Pipelining Serial and Pipelining
Serial execution: 20 cycles Pipelined execution: 8 cycles
F k d i i
For k states and n instructions, the number of required cycles is:
k + (n – 1) k + (n 1)
Pipelining Pipelining
Pipelining
Pipelining
¾ Overlapped execution
¾ Increases throughput
Pipelining (cont.) Pipelining (cont.)
Pipelining requires buffers
Pipelining requires buffers
¾ Each buffer holds a single value
¾ Uses just-in-time principlej p p
Any delay in one stage affects the entire pipeline flow
¾ Ideal scenario: equal work for each stage
Sometimes it is not possible
Slowest stage determines the flow rate in the entire pipeline
pipeline
Pipelining (cont.) Pipelining (cont.)
Some reasons for unequal work stages
Some reasons for unequal work stages
¾ A complex step cannot be subdivided conveniently
¾ An operation takes variable amount of time to executep
EX: Operand fetch time depends on where the operands are located
Registers
Registers
Cache
Memory
¾ Complexity of operation depends on the type of operation
Add: may take one cycle
M lti l t k l l
Multiply: may take several cycles
Pipeline Stall Pipeline Stall
Operand fetch of I2 takes three cycles
Operand fetch of I2 takes three cycles
¾ Pipeline stalls for two cycles
Caused by hazards
¾ Pipeline stalls reduce overall throughput
Hazards Hazards
Three types of hazards
Three types of hazards
¾ Resource hazards
Occurs when two or more instructions use the same resource
Also called structural hazards
D t h d
¾ Data hazards
Caused by data dependencies between instructions
Example: Result produced by I1 is read by I2p p y y
¾ Control hazards
Default: sequential execution suits pipelining
Altering control flow (e.g., branching) causes problems
Introduce control dependencies
Resource Hazards Resource Hazards
Example
Example
¾ Conflict for memory in clock cycle 3
I1 fetches operandp
I3 delays its instruction fetch from the same memory
Data Hazards Data Hazards
Example
Example
¾ I1: add R2,R3,R4 /* R2 = R3 + R4 */
¾ I2: sub R5,R6,R2 /* R5 = R6 – R2 */
Introduces data dependency between I1 and I2
Control Hazards
»Determine branch decision early
Performance Improvement Performance Improvement
Several techniques to improve performance of aSeveral techniques to improve performance of a pipelined system
¾ Superscalar
Replicates the pipeline hardware
Replicates the pipeline hardware
¾ Superpipelined
Increases the pipeline depth Very long instruction word (VLIW)
¾ Very long instruction word (VLIW)
Encodes multiple operations into a long instruction word
Hardware schedules these instructions on multiple functional units (No run time analysis)
functional units (No run-time analysis)
add R1, R2, R3 ; R1 = R2 + R3 sub R5, R6, R7 ; R5 = R6 – R7 and R4, R1, R5 ; R4 = R1 AND R5 xor R9, R9, R9 ; R9 = R9 XOR R9
cycle 1: add, sub, xor cycle 2: and
Superscalar Processor Superscalar Processor
Ex: Pentium
Wasted Cycles (pipelined) Wasted Cycles (pipelined)
When one of the stages requires two or more clock cycles,When one of the stages requires two or more clock cycles, clock cycles are again wasted.
St
S1 S2 S3 S4 S5
1
Stages
S6 I-1
exe
For k states and n instructions the
cles
2 3 4 5
I-2 I-3
I-1 I-2 I-3
I-1 I-2 I-3
I-1 I-1
instructions, the number of required cycles is:
k + (2 1)
Cyc 5
6 7
I 3
I-2 I-1
I-1
8 I-3 I-2
I 1 I-2
k + (2n – 1)
9 I-2
10 11
I-3
I-3 I-3
Superscalar Superscalar
A superscalar processor has multiple execution pipelines.
In the following, note that Stage S4 has left and right pipelines (u and v).
S1 S2 S3 S5
Stages
S6
S4 For k states and n
instructions the
S1 S2 S3 u S5
1
s
S6
2 3
I-1 I-2 I-3
I-1
I-2 I-1
v instructions, the
number of required cycles is:
k +
Cycles 4
5 6 7
I-4 I-3 I-4
I-2 I-3 I-4
I-1
I-3 I-1
I-2 I-1 I-2
I-4 I-2 I-1
I-3
k + n
8 9
I-3 I-4
I-2 I-3
10 I-4
I-4 3
Superpipelined Processor Superpipelined Processor
Ex: MIPS R4000 Ex: MIPS R4000
Memory
Memory
Memory Memory
Introduction
Introduction
Building Memory Blocks
l f
Alignment of Data
2 Memory Design Issues
¾ Cache
¾ Virtual Memoryy
Memory (cont.) Memory (cont.)
Ordered sequence of bytes
Ordered sequence of bytes
¾ The sequence number is called the memory address
¾ Byte addressable memory
¾ Byte addressable memory
Each byte has a unique address
Almost all processors support thisp pp
Memory address space
¾ Determined by the address bus widthy
¾ Pentium has a 32-bit address bus
address space = 4GB (232)
¾ Itanium with 64-bit address bus supports
264 bytes of address space
Memory (cont.)
Memory (cont.)
Memory (cont.) Memory (cont.)
Read cycle
Read cycle
1. Place address on the address bus 2. Assert memory read control signal 2. Assert memory read control signal
3. Wait for the memory to retrieve the data
Introduce wait states if using a slow memoryg y 4. Read the data from the data bus
5. Drop the memory read signal
In Pentium, a simple read takes three clocks cycles
Clock 1: steps 1 and 2
Clock 2: step 3
Clock 3 : steps 4 and 5
Memory (cont.) Memory (cont.)
Write cycle
Write cycle
1. Place address on the address bus
2. Place data on the data bus
2. Place data on the data bus
3. Assert memory write signal
4. Wait for the memory to retrieve the datay
Introduce wait states if necessary 5. Drop the memory write signal
In Pentium, a simple write also takes three clocks
Clock 1: steps 1 and 3
Clock 2: step 2
Clock 3 : steps 4 and 5
How Hardware Implements How Hardware Implements
Memory Systems
Memory Systems
Building a Memory Block Building a Memory Block
A 4 X 3 d i
A 4 X 3 memory design using D flip-flops
Building a Memory Block (cont’d) Building a Memory Block (cont d)
Bl k di t ti f 4 3
Block diagram representation of a 4x3 memory
Address
Data
Control signals
¾ Read
¾ Read
¾ Write
Building Larger Memories Building Larger Memories
2 X 16 memory module using 74373 chips 2 X 16 memory module using 74373 chips
Designing Larger Memories Designing Larger Memories
64M X 32 i memory using 16M X 16 chips
Alignment of Data Alignment of Data
Get 32-bit data in one or more read cycle?
Alignment of Data (cont.) Alignment of Data (cont.)
Alignment
Alignment
¾ 2-byte data: Even address
Rightmost address bit should be zero
Rightmost address bit should be zero
¾ 4-byte data: Address that is multiple of 4 Rightmost 2 bits should be zero
Rightmost 2 bits should be zero
¾ 8-byte data: Address that is multiple of 8 Ri ht t 3 bit h ld b
Rightmost 3 bits should be zero
¾ Soft alignment
C h dl li d ll li d d t
Can handle aligned as well as unaligned data
¾ Hard alignment
H dl l li d d t ( f li t)
Handles only aligned data (enforces alignment)
Memory Design Issues Memory Design Issues
Slower memoriesSlower memories
Problem: Speed gap between processor and memory Solution: Cache memory
U ll t f f t
Use small amount of fast memory
Make the slow memory appear faster
Works due to “reference locality”
Size limitations
¾ Limited amount of physical memory Overlay technique
Overlay technique
Programmer managed
¾ Virtual memory
Automates overlay management
Some additional benefits
Memory Hierarchy
Memory Hierarchy
Cache Memory Cache Memory
High-speed expensive static RAM both inside and outside
High speed expensive static RAM both inside and outside the CPU.
¾ Level-1 cache: inside the CPU
¾ Level-2 cache: outside the CPU
Prefetch data into cache before the processor needs it
¾ Need to predict processor future access requirements
¾ Locality of reference
C h hit h d t t b d i l d i h
Cache hit: when data to be read is already in cache memory
Cache miss: when data to be read is not in cache memory
Cache miss: when data to be read is not in cache memory.
When? compulsory, capacity and conflict.
Cache design: cache size n-way block size replacement
Cache design: cache size, n-way, block size, replacement policy
Why Cache Memory Works Why Cache Memory Works
Example
Example
for (i=0; i<M; i++)
for(j=0; j<N; j++) for(j=0; j<N; j++)
X[i][j] = X[i][j] + K;
Each element of X is double (eight bytes)
¾ Each element of X is double (eight bytes)
¾ Loop is executed (M*N) times
Pl i th d i h id t i
Placing the code in cache avoids access to main memory
Repetitive use
Repetitive use
Temporal locality
Prefetching datag
Spatial locality
Cache Design Basics Cache Design Basics
On every read miss
¾ A fixed number of bytes are transferred
¾ A fixed number of bytes are transferred
More than what the processor needs
Effective due to spatial locality
Cache is divided into blocks of B bytes
b-bits are needed as offset into the block
b = log2B
Block are called cache lines
Main memory is also divided into blocks of same
size
Mapping Function Mapping Function
Determines how memory blocks are mapped to cache lines
Three types
¾ Direct mapping
¾ Direct mapping
Specifies a single cache line for each memory block
¾ Set-associative mapping
¾ Set associative mapping
Specifies a set of cache lines for each memory block
¾ Associative mapping
¾ Associative mapping
No restrictions
Any cache line can be used for any memory block
Any cache line can be used for any memory block
Direct Mapping
Direct Mapping
Set-Associate Mapping
Set Associate Mapping
Virtual Memory
Virtual Memory
I/O Devices
I/O Devices
Input/Output Input/Output
I/O devices are interfaced via an I/O controller
I/O devices are interfaced via an I/O controller
¾ Takes care of low-level operations details
Several ways of mapping I/O
Several ways of mapping I/O
¾ Memory-mapped I/O
Reading and writing similar to memory read/write
Reading and writing similar to memory read/write
Uses same memory read and write signals
Most processors use this I/O mappingp pp g
¾ Isolated I/O
Separate I/O address space
Separate I/O read and write signals are needed
Pentium supports isolated I/O
Also supports memory-mapped I/O
Input/Output (cont.)
Input/Output (cont.)
Input/Output (cont.) Input/Output (cont.)
Several ways of transferring data y g
¾ Programmed I/O
Program uses a busy-wait loopg y p
Anticipated transfer
¾ Direct memory access (DMA)
Special controller (DMA controller) handles data transfers
Typically used for bulk data transfer
¾ Interrupt-driven I/O
Interrupts are used to initiate and/or terminate data transfers
Powerful technique
Handles unanticipated transfers
Interconnection Interconnection
System components are interconnected by buses
System components are interconnected by buses
¾ Bus: a bunch of parallel wires
Uses several buses at various levels
Uses several buses at various levels
¾ On-chip buses
B i ALU d i
Buses to interconnect ALU and registers
A, B, and C buses in our example
D t d dd b t t hi h
Data and address buses to connect on-chip caches
¾ Internal buses
PCI AGP PCMCIA
PCI, AGP, PCMCIA
¾ External buses
S i l ll l USB IEEE 1394 (Fi Wi )
Serial, parallel, USB, IEEE 1394 (FireWire)
PC
System Buses y
ISA (Industry Standard
A hi )
Architecture)
PCI (Peripheral Component Interconnect)
Interconnect)
AGP (Accelerated Graphics Port))
Interconnection (cont.) Interconnection (cont.)
Bus is a shared resource
Bus is a shared resource
¾ Bus transactions
Sequence of actions to complete a well-defined
Sequence of actions to complete a well defined activity
Involves a master and a slave
Memory read, memory write, I/O read, I/O write
¾ Bus operations
A b s t ansaction ma pe fo m one o mo e b s
A bus transaction may perform one or more bus operations
Pentium burst read
Transfers four memory words
Bus transaction consists of four memory read operations
operations
¾ Bus arbitration