Advanced Architecture
Computer Organization and Assembly Languages p g z y g g Yung-Yu Chuang
with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, Robert Sedgwick and Kevin Wayne
Basic architecture
Basic microcomputer design
• clock synchronizes CPU operations
l i (CU) di f
• control unit (CU) coordinates sequence of execution steps
• ALU performs arithmetic and logic operations
data bus
registers
Central Processor Unit (CPU)
Memory Storage Unit
ALU l k
I/O Device
#1
I/O Device
#2 CU
ALU clock
control bus
CU
address bus
Basic microcomputer design
• The memory storage unit holds instructions and data for a running program
data for a running program
• A bus is a group of wires that transfer data from t t th (d t dd t l)
one part to another (data, address, control)
data bus
registers
Central Processor Unit (CPU)
Memory Storage Unit
ALU l k
I/O Device
#1
I/O Device
#2 CU
ALU clock
control bus
CU
address bus
Clock
• synchronizes all CPU and BUS operations
hi ( l k) l i f i l
• machine (clock) cycle measures time of a single operation
• clock is used to trigger events
one cycley 1
0
• Basic unit of time, 1GHz→clock cycle=1ns
• An instruction could take multiple cycles to p y
complete, e.g. multiply in 8088 takes 50 cycles
Instruction execution cycle
program counter
• Fetch
PC program
instruction queue
• Fetch
• Decode
• Fetch
I-1 I-2 I-3 I-4 p g
op1
memory fetch
d
• Fetch
operands
• Execute
I-1 instruction register op1
op2
registers read
registers
• Execute
• Store output
register
te decodete
wri ALU e
execute
wri
( t t) flags
(output)
Pipeline
Multi-stage pipeline
• Pipelining makes it possible for processor to execute instructions in parallel
execute instructions in parallel
• Instruction execution divided into discrete stages
S1 S2 S3 S4 S5
1
Stages
S6
Example of a non- 1 I-1 2 3 4
I-1
I-1
I-1
I-1
Example of a non- pipelined processor.
For example, 80386.
Cycles 56
7 I-2
I-1
I-1
p
Many wasted cycles.
8 9 10 11
I-2
I-2
I-2
I 2 11
12
I-2
I-2
Pipelined execution
• More efficient use of cycles, greater throughput of instructions: (80486 started to use pipelining)
S1 S2 S3 S4 S5
Stages
S6 For k stages and n instructions the
1
es 23
I-1
I-2 I-1
I-2 I-1
n instructions, the number of
required cycles is:
Cycle
4 5 6
I-2 I-1
I-2 I-1
I 2 I 1
q y
k + (n – 1)
d t k*
6 7
I-2 I-1 I-2
compared to k*n
Pipelined execution
• Pipelining requires buffers
E h b ff h ld i l l – Each buffer holds a single value
– Ideal scenario: equal work for each stage
S i i i ibl
• Sometimes it is not possible
• Slowest stage determines the flow rate in the ti i li
entire pipeline
Pipelined execution
• Some reasons for unequal work stages
A complex step cannot be subdivided conveniently – A complex step cannot be subdivided conveniently – An operation takes variable amount of time to
execute, e.g. operand fetch time depends on where , g p p the operands are located
• Registers Cache
• Cache
• Memory
– Complexity of operation depends on the type of C p y p p yp operation
• Add: may take one cycle
M lti l t k l l
• Multiply: may take several cycles
Pipelined execution
• Operand fetch of I2 takes three cycles
Pipeline stalls for two cycles – Pipeline stalls for two cycles
• Caused by hazards
– Pipeline stalls reduce overall throughputPipeline stalls reduce overall throughput
Wasted cycles (pipelined)
• When one of the stages requires two or more clock cycles clock cycles are again wasted
clock cycles, clock cycles are again wasted.
Stages
exe
S1 S2 S3 S4 S5
1
S6
2
I-1
I 2 I 1
exe
For k stages and n instructions the
cles
2 3 4 5
I-2 I-3
I-1 I-2 I-3
I-1 I-2 I 3
I-1 I 1
instructions, the
number of required cycles is:
Cyc 5
6 7
I-3
I-2 I-1
I-1
8 I 3 I 2
I-1
I-2
cycles is:
k + (2n – 1)
8 9
I-3 I-2
I-2
10 I-3
I-3
11 I-3
Superscalar
A superscalar processor has multiple execution pipelines In the following note that Stage S4 pipelines. In the following, note that Stage S4 has left and right pipelines (u and v).
Stages
S1 S2 S3 u S5
Stages
S6 v
S4 For k states and n instructions, the
S1 S2 S3 u S5
1
S6 2
3
I-1 I-2 I-3
I-1
I-2 I-1
v number of required cycles is:
Cycles 4
5 6
I-4 I-3 I-4
I-2 I-3 I-4
I-1
I-3 I-1
I-2 I-2 I-1
k + n
7 I-4 I-2 I-1
8 9
I-3 I-4
I-2 I-3
10 I 4
I-4 I-3
Pentium: 2 pipelines P ti P 3
10 I-4 Pentium Pro: 3
Pipeline stages
• Pentium 3: 10
P i 4 20 31
• Pentium 4: 20~31
• Next-generation micro-architecture: 14
• ARM7: 3
Hazards
• Three types of hazards
Resource hazards – Resource hazards
• Occurs when two or more instructions use the same resource also called structural hazards
resource, also called structural hazards – Data hazards
• Caused by data dependencies between instructions
• Caused by data dependencies between instructions, e.g. result produced by I1 is read by I2
– Control hazardsControl hazards
• Default: sequential execution suits pipelining
• Altering control flow (e g branching) causes
• Altering control flow (e.g., branching) causes problems, introducing control dependencies
Data hazards
add r1, r2, #10 ; write r1 sub r3 r1 #20 ; read r1 sub r3, r1, #20 ; read r1
fetch decode reg ALU wb fetch decode reg ALU wb
fetch decode stall reg ALU wb
Data hazards
• Forwarding: provides output result as soon as possible
possible
add r1, r2, #10 ; write r1 sub r3, r1, #20 ; read r1 fetch decode reg ALU wb
fetch decode stall reg ALU wb
Data hazards
• Forwarding: provides output result as soon as possible
possible
add r1, r2, #10 ; write r1 sub r3, r1, #20 ; read r1 fetch decode reg ALU wb
fetch decode stall reg ALU wb
fetch decode stall reg ALU wb
Control hazards
bz r1, target add r2, r4, 0 add r2, r4, 0 ...
target: add r2 r3 0
fetch decode reg ALU wb target: add r2, r3, 0
fetch decode reg ALU wb
fetch decode reg ALU wb
fetch decode reg ALU wb fetch decode reg ALU
Control hazards
• Braches alter control flow
R i i l tt ti i i li i – Require special attention in pipelining
– Need to throw away some instructions in the pipeline
pipeline
• Depends on when we know the branch is taken Pipeline wastes three clock cycles
• Pipeline wastes three clock cycles – Called branch penalty
R d i g b h lt – Reducing branch penalty
• Determine branch decision early
Control hazards
• Delayed branch execution
Eff ti l d th b h lt – Effectively reduces the branch penalty
– We always fetch the instruction following the branch Wh th it ?
• Why throw it away?
• Place a useful instruction to execute h ll d d l l
• This is called delay slot Delay slot
add R2,R3,R4 branch target
branch target add R2,R3,R4
sub R5,R6,R7 . . .
sub R5,R6,R7 . . .
Branch prediction
• Three prediction strategies
Fi d – Fixed
• Prediction is fixed
– Example: branch-never-takenExample: branch-never-taken
» Not proper for loop structures
– StaticStatic
• Strategy depends on the branch type
– Conditional branch: always not taken – Loop: always taken
– Dynamic
• Takes run-time history to make more accurate predictions
Branch prediction
• Static prediction
I di ti Fi d
– Improves prediction accuracy over Fixed
I i I i P di i C
Instruction type Instruction Distribution
(%)
Prediction:
Branch taken?
Correct prediction
(%)
( ) ( )
Unconditional branch
70*0.4 = 28 Yes 28
Conditional 70*0 6 = 42 No 42*0 6 = 25 2 Conditional
branch
70 0.6 42 No 42 0.6 25.2
Loop 10 Yes 10*0.9 = 9
Call/return 20 Yes 20
Overall prediction accuracy = 82.2%p y
Branch prediction
• Dynamic branch prediction
U ti hi t
– Uses runtime history
• Takes the past n branch executions of the branch type and makes the prediction
makes the prediction
– Simple strategy
• Prediction of the next branch is the majority of the j y previous n branch executions
• Example: n = 3
If two or more of the last three branches were taken the – If two or more of the last three branches were taken, the
prediction is “branch taken”
• Depending on the type of mix, we get more than 90%
di i
prediction accuracy
Branch prediction
• Impact of past n branches on prediction accuracy
accuracy
Type of mix
n Compiler Business Scientific
0 64.1 64.4 70.4
1 91.9 95.2 86.6
2 93.3 96.5 90.8
2 93.3 96.5 90.8
3 93.7 96.6 91.0
4 94 5 96 8 91 8
4 94.5 96.8 91.8
5 94.7 97.0 92.0
Branch prediction
00 0101
Predict no branch 00
Predict no branch
branch no
branch
no
branch
branch no
branch branch
10 no 11
10 Predict
branch
11 Predict
branch branch
no
branch
Multitasking
• OS can run multiple programs at the same time.
M l i l h d f i i hi h
• Multiple threads of execution within the same program.
• Scheduler utility assigns a given amount of CPU time to each running program.
• Rapid switching of tasks
– gives illusion that all programs are running at onceg p g g – the processor must support task switching
– scheduling policy, round-robin, priorityscheduling policy, round robin, priority
Cache
SRAM vs DRAM
data bus
Central Processor Unit (CPU)
Memory Storage Unit registers
I/O Device
#1
I/O Device
#2
ALU clock
#1 #2
control bus CU
Tran Access Needs
address bus
Tran. Access Needs
per bit time refresh? Cost Applications SRAM 4 or 6 1X No 100X cache memories
DRAM 1 10X Y 1X M i i
DRAM 1 10X Yes 1X Main memories, frame buffers
The CPU-Memory gap
The gap widens between DRAM, disk, and CPU speeds.
100 000 000
1,000,000 10,000,000 100,000,000
1 000 10,000 100,000
ns
Disk seek time DRAM access time SRAM access time
10 100 1,000
CPU cycle time
1
1980 1985 1990 1995 2000
year year
register cache memory disk
Access time 1 1-10 50-100 20,000,000
(cycles)
, ,
Memory hierarchies
• Some fundamental and enduring properties of hardware and software:
hardware and software:
– Fast storage technologies cost more per byte, have less capacity and require more power (heat!)
less capacity, and require more power (heat!).
– The gap between CPU and main memory speed is widening
widening.
– Well-written programs tend to exhibit good locality.
• They suggest an approach for organizing
• They suggest an approach for organizing memory and storage systems known as a memory hierarchy
memory hierarchy.
Memory system in practice
L0:
registers on-chip L1 cache (SRAM) L1:
Smaller, faster, and more expensive (per byte) storage devices
off-chip L2 cache (SRAM) L2:
byte) storage devices
Larger, slower, and
main memory (DRAM) L3:
Larger, slower, and cheaper (per byte)
storage devices local secondary storage (virtual memory) (local disks)
L4:
remote secondary storage L5:
(tapes, distributed file systems, Web servers)
Reading from memory
• Multiple machine cycles are required when reading from memory because it responds much more slowly from memory, because it responds much more slowly than the CPU (e.g.33 MHz). The wasted clock cycles are called wait states.
L1 Data 1 l l t 1 cycle latency
16 KB 4-way assoc Write through
Regs. L2 Unified
128KB--2 MB Main
Write-through 32B lines L1 I t ti
4-way assoc Write-back Write allocate
MemoryMain Up to 4GB L1 Instruction
16 KB, 4-way 32B lines
32B lines
Processor Chip Pentium III cache hierarchy
Cache memory
• High-speed expensive static RAM both inside and outside the CPU
and outside the CPU.
– Level-1 cache: inside the CPU L l 2 h t id th CPU – Level-2 cache: outside the CPU
• Cache hit: when data to be read is already in h
cache memory
• Cache miss: when data to be read is not in cache memory. When? compulsory, capacity and conflict.
• Cache design: cache size, n-way, block size,
replacement policy p p y
Caching in a memory hierarchy
8 9 14 3
Smaller, faster, more Expensive device at level k 4 10
level k caches a
subset of the blocks f l l k+1
Data is copied between levels from level k+1
4 10
p
in block-sized transfer units
0 1 2 3
4 5 6 7
Larger, slower, cheaper Storage device at level level 44 5 6 7
8 9 10 11
12 13 14 15
Storage device at level k+1 is partitioned into blocks.
k+1
4
10
12 13 14 15
General caching concepts
Request
Request
• Program needs object d, which is stored in some block b
14
12 q
14
q
12 stored in some block b.
• Cache hit
– Program finds b in the cache at
9 3
level
1414 14
12
0 1 2 3
4*4*
12 Program finds b in the cache at
level k. E.g., block 14.
• Cache miss
9 3
k 14
124* Request 12
124
– b is not at level k, so level k cache must fetch it from level k+1.
E.g., block 12.
12
E.g., block 12.
– If level k cache is full, then some current block must be replaced
(evicted) Which one is the “victim”?
0 1 2 3
4 5 6 7
8 9 10 11
level k+1
4*
(evicted). Which one is the “victim”?
• Placement policy: where can the new block go? E.g., b mod 4
8 9 10 11
12 13 14 15
k+1
12
• Replacement policy: which block should be evicted? E.g., LRU
Locality
• Principle of Locality: programs tend to reuse
data and instructions near those they have used y recently, or that were recently referenced
themselves.
– Temporal locality: recently referenced items are likely to be referenced in the near future.
Spatial locality: items with nearby addresses tend to – Spatial locality: items with nearby addresses tend to
be referenced close together in time.
• In general, programs with good locality run In general, programs with good locality run faster then programs with poor locality
• Locality is the reason why cache and virtual Locality is the reason why cache and virtual
memory are designed in architecture and
operating system. Another example is web p g y p
browser caches recently visited webpages.
Locality example
sum = 0;
for (i = 0; i < n; i++) sum += a[i];
• Data
return sum;
• Data
– Reference array elements in succession (stride-1 reference pattern): Spatial locality
reference pattern):
– Reference sum each iteration:
• Instructions
Spatial locality
Temporal locality
• Instructions
– Reference instructions in sequence:
C l th h l t dl
Spatial locality
T l l lit
– Cycle through loop repeatedly: Temporal locality
Locality example
• Being able to look at code and get a qualitative sense of its locality is important Does this
sense of its locality is important. Does this function have good locality?
int sum_array_rows(int a[M][N]) {
{
int i, j, sum = 0;
for (i = 0; i < M; i++)
for (j = 0; j < N; j++) sum += a[i][j];
sum += a[i][j];
return sum;
} stride-1 reference patternp
Locality example
• Does this function have good locality?
int sum_array_cols(int a[M][N]) {
{
int i, j, sum = 0;
for (j = 0; j < N; j++)
for (i = 0; i < M; i++) sum += a[i][j];
sum += a[i][j];
return sum;
} stride-N reference patternp
Blocked matrix multiply performance
• Blocking (bijk and bikj) improves performance by a factor of two over unblocked versions (ijk by a factor of two over unblocked versions (ijk and jik)
– relatively insensitive to array size.relatively insensitive to array size.
50 60
40
eration
kji jki kij ikj
20 30
Cycles/ite ikj
jik ijk
bijk (bsize = 25)
0 10
bijk (bsize 25) bikj (bsize = 25)
0
25 50 75 100 125
150 175
200 225
250 275
300 325
350 375
400 Array size (n)
Cache-conscious programming
• make sure that memory is cache-aligned
• Split data into hot and cold (list example)
• Split data into hot and cold (list example)
• Use union and bitfields to reduce size and
increase localityy
RISC v.s. CISC
Trade-offs of instruction sets
high-level language compiler machine code high-level language machine code
semantic gap C, C++
Lisp, Prolog, Haskell…
• Before 1980, the trend is to increase instruction
l i ( i if ibl )
p g
complexity (one-to-one mapping if possible) to bridge the gap. Reduce fetch from memory.
S lli i b f i i
Selling point: number of instructions, addressing modes. (CISC)
• 1980, RISC. Simplify and regularize instructions
to introduce advanced architecture for better
performance, pipeline, cache, superscalar.
RISC
• 1980, Patternson and Ditzel (Berkeley),RISC Features
• Features
– Fixed-length instructions Load store architecture – Load-store architecture – Register file
• Organization
• Organization
– Hard-wired logic
– Single-cycle instructionSingle-cycle instruction – Pipeline
• Pros: small die size short development time
• Pros: small die size, short development time, high performance
• Cons: low code density not x86 compatible
• Cons: low code density, not x86 compatible
RISC Design Principles
• Simple operations
Simple instructions that can execute in one cycle – Simple instructions that can execute in one cycle
• Register-to-register operations
Only load and store operations access memory – Only load and store operations access memory
– Rest of the operations on a register-to-register basis
• Simple addressing modes
• Simple addressing modes
– A few addressing modes (1 or 2)
• Large number of registers
• Large number of registers
– Needed to support register-to-register operations – Minimize the procedure call and return overhead – Minimize the procedure call and return overhead
RISC Design Principles
• Fixed-length instructions
F ilit t ffi i t i t ti ti – Facilitates efficient instruction execution
• Simple instruction format
– Fixed boundaries for various fields
• opcode, source operands,…
CISC and RISC
• CISC – complex instruction set
large instruction set – large instruction set
– high-level operations (simpler for compiler?)
requires microcode interpreter (could take a long – requires microcode interpreter (could take a long
time)
– examples: Intel 80x86 familyp y
• RISC – reduced instruction set
– small instruction setsmall instruction set
– simple, atomic instructions
– directly executed by hardware very quicklydirectly executed by hardware very quickly
– easier to incorporate advanced architecture design – examples: ARM (Advanced RISC Machines) and DEC p ( )
Alpha (now Compaq), PowerPC, MIPS
CISC and RISC
CISC RISC
(Intel 486) (MIPS R4000)
#i t ti 235 94
#instructions 235 94
Addr. modes 11 1
Inst. Size (bytes) 1-12 4
GP registers 8 32
Why RISC?
• Simple instructions are preferred
Complex instructions are mostly ignored by – Complex instructions are mostly ignored by
compilers
• Due to semantic gapg p
• Simple data structures
– Complex data structures are used relatively p y infrequently
– Better to support a few simple data types efficiently
• Synthesize complex ones
• Simple addressing modes
– Complex addressing modes lead to variable length instructions
• Lead to inefficient instruction decoding and scheduling
• Lead to inefficient instruction decoding and scheduling
Why RISC? (cont’d)
• Large register set
Effi i t t f d ll d t
– Efficient support for procedure calls and returns
• Patterson and Sequin’s study
– Procedure call/return: 1215% of HLL statementsProcedure call/return: 12 15% of HLL statements
» Constitute 3133% of machine language instructions
» Generate nearly half (45%) of memory references
S ll ti ti d
– Small activation record
• Tanenbaum’s study
– Only 1 25% of the calls have more than 6 argumentsOnly 1.25% of the calls have more than 6 arguments – More than 93% have less than 6 local scalar variables – Large register set can avoid memory references
ISA design issues
Instruction set design
• Issues when determining ISA
I t ti t
– Instruction types
– Number of addresses
Add i d
– Addressing modes
Instruction types
• Arithmetic and logic D
• Data movement
• I/O (memory-mapped, isolated I/O)
• Flow control
– Branches (unconditional, conditional)Branches (unconditional, conditional)
• set-then-jump (cmp AX, BX; je target)
• Test-and-jump (beq r1, r2, target)Test and jump (beq r1, r2, target) – Procedure calls (register-based, stack-based)
• Pentium: ret; MIPS: jrPentium: ret; MIPS: jr
• Register: faster but limited number of parameters
• Stack: slower but more general
• Stack: slower but more general
Operand types
• Instructions support basic data types
Ch t
– Characters – Integers
Fl ti i t – Floating-point
• Instruction overload
– Same instruction for different data types – Example: Pentium
mov AL,address ;loads an 8-bit value mov AX,address ;loads a 16-bit value mov EAX address ;loads a 32 bit value mov EAX,address ;loads a 32-bit value
Operand types
• Separate instructions
I t ti if th d i
– Instructions specify the operand size – Example: MIPS
lb Rdest address loads a b te lb Rdest,address ;loads a byte
lh Rdest,address ;loads a halfword
;(16 bits)
;( )
lw Rdest,address ;loads a word
;(32 bits)
ld Rdest,address ;loads a doubleword
;(64 bits)
Number of addresses
Number of addresses
• Four categories
3-address machines – 3-address machines
• two for the source operands and one for the result – 2-address machines2 address machines
• One address doubles as source and result – 1-address machine
• Accumulator machines
• Accumulator is used for one source and result – 0-address machines
• Stack machines
• Operands are taken from the stack
• Result goes onto the stack
Number of addresses
Number of instruction operation addresses instruction operation
3 OP A, B, C A ← B OP C
2 OP A, B A ← A OP B
1 OP A AC ← AC OP A
0 OP T (T 1) OP T
0 OP T ← (T-1) OP T
A, B, C: memory or register locations AC: accumulator
T: top of stack
T 1: second element of stack T-1: second element of stack
3-address
) ( D E C
B Y A
Example: RISC machines, TOY
SUB Y, A, B ; Y = A - B
) ( D E C
p ,
opcode A B C
MUL T, D, E ; T = D
×
EADD T, T, C ; T = T + C DIV Y, Y, T ; Y = Y / T
2-address
) ( D E C
B Y A
Example: IA32
MOV Y, A ; Y = A
SUB Y B Y Y B
) ( D E C p
opcode A B
SUB Y, B ; Y = Y - B MOV T, D ; T = D
MUL T, E ; T = T
×
EADD T, C ; T = T + C DIV Y, T ; Y = Y / T
1-address
) ( D E C
B Y A
Example: IA32’s MUL (EAX)
LD D ; AC = D
) ( D E C
p ( )
opcode A
MUL E ; AC = AC
×
EADD C ; AC = AC + C
ST Y ; Y = AC
LD A ; AC = A
SUB B ; AC = AC – B DIV Y ; AC = AC / Y
ST Y ; Y = AC
0-address
) ( D E C
B Y A
Example: IA32’s FPU, HP3000
PUSH A ; A
PUSH B A B
) ( D E C
p ,
opcode
PUSH B ; A, B
SUB ; A-B
PUSH C ; A-B, C
PUSH D ; A-B, C, D
PUSH E ; A-B, C, D, E MUL ; A-B, C, D× E ADD ; A-B, C+(D× E)
DIV ; (A-B) / (C+(D× E)) POP Y
Number of addresses
• A basic design decision; could be mixed Fewer addresses per instruction results in
• Fewer addresses per instruction results in – a less complex processor
h t i t ti
– shorter instructions
– longer and more complex programs – longer execution time
• The decision has impacts on register usage p g g policy as well
– 3-address usually means more general- purpose registers
– 1-address usually means less
Addressing modes
Addressing modes
• How to specify location of operands? Trade-off for address range address flexibility number for address range, address flexibility, number of memory references, calculation of addresses
• Operands can be in three places
• Operands can be in three places
– Registers
• Register addressing mode
• Register addressing mode
– Part of instruction
• ConstantConstant
• Immediate addressing mode
• All processors support these two addressing modes
– Memory
• Difference between RISC and CISC
• CISC supports a large variety of addressing modes
• RISC follows load/store architecture
Addressing modes
• Common addressing modes
Implied – Implied
– Immediate (lda R1, 1) – Direct (st R1, A)Direct (st R1, A)
– Indirect
– Register (add R1, R2, R3)g ( , , ) – Register indirect (sti R1, R2) – Displacementp
– Stack
Implied addressing
• No address field;
operand is implied by
instruction
opcode
operand is implied by
the instruction
CLC l
opcode
CLC ; clear carry
• A fixed and unvarying dd
address
Immediate addressing
• Address field contains the operand value
instruction
operand
opcode
the operand value
ADD 5; AC=AC+5
P
operand opcode
• Pros: no extra
memory reference;
ffaster
• Cons: limited range
Direct addressing
• Address field contains the effective address
address A opcode
instruction
the effective address of the operand
address A opcode
Memory ADD A; AC=AC+[A]
• single memory
Memory
reference
• Pros: no additional address calculation
• Cons: limited address
operand
• Cons: limited address space
p
Indirect addressing
• Address field contains the address of a
address A opcode
instruction
the address of a pointer to the operand
address A opcode
Memory
operand
ADD [A]; AC=AC+[[A]]
operand
Memory
• multiple memory references
operand
• Pros: large address space p
• Cons: slower
Register addressing
• Address field contains the address of a
R opcode
instruction
the address of a register
R opcode
ADD R; AC=AC+R
• Pros: only need a small address field;
shorter instruction
operand
and faster fetch; no memory reference
operand
R i t
• Cons: limited address space
Registers
p
Register indirect addressing
• Address field contains the address of the
R opcode
instruction
the address of the register containing a pointer to the operand
R opcode
Memory
pointer to the operand
ADD [R]; AC=AC+[R]
Memory
• Pros: large address space
• Cons: extra memory reference
R i t operand Registers p
Displacement addressing
• Address field could contain a register
R opcode
instruction
A
contain a register
address and an address
MOV EAX [A+ESI*4]
R opcode
Memory A
MOV EAX, [A+ESI 4]
• EA=A+[R×S] or vice versa
Memory
versa
• Several variants
– Base-offset: [EBP+8]
+
Base-offset: [EBP+8]– Base-index: [EBX+ESI]
– Scaled: [T+ESI*4]
R i t operand
+
Scaled: [T+ESI 4]
• Pros: flexible
• Cons: complex
Registers p
• Cons: complex
Displacement addressing
MOV EAX, [A+ESI*4]
Of i ll d
opcode
instruction A
R
• Often, register, called indexing register, is
d f di l t
opcode
Memory A
R
used for displacement.
• Usually, a mechanism
Memory
is provided to
efficiently increase the
+ indexing register.
R i t operand
+
Registers p
Stack addressing
• Operand is on top of the stack
opcode
instruction
the stack
ADD [R]; AC=AC+[R]
opcode
• Pros: large address space
implicit
• Pros: short and fast fetch
• Cons: limited by FILO order
St k
order
Stack
Addressing modes
Mode Meaning Pros Cons
Implied Fast fetch Limited instructions
Immediate Operand=Ap No memory refy Limited operandp
Direct EA=A Simple Limited address space
Indirect EA=[A] Large address space Multiple memory ref Register EA=R No memory ref Limited address space Register
indirect EA=[R] Large address space Extra memory ref Displacement EA=A+[R] Flexibility Complexity
stack EA=stack top No memory ref Limited applicability stack EA=stack top No memory ref Limited applicability
IA32 addressing modes
Effective address calculation (IA32)
8
A dummy format for one operand
base index s displacement
3 3 2 8 or 32
y p
p
register
file shifter adder adder memory
Based Addressing
• Effective address is computed as
base + signed displacement base + signed displacement – Displacement:
– 16-bit addresses: 8- or 16-bit number – 32-bit addresses: 8- or 32-bit number
• Useful to access fields of a structure or record
B gi t i t t th b dd f th t t
• Base register points to the base address of the structure
• Displacement relative offset within the structure
• Useful to access arrays whose element size is
• Useful to access arrays whose element size is not 2, 4, or 8 bytes
• Displacement points to the beginning of the array
• Base register relative offset of an element within the array
2003
To be used with S. Dandamudi, “Fundamentals of Computer Organization and Design,” Springer, 2003.
S. Dandamudi Chapter 11: Page 81
Based Addressing
Indexed Addressing
• Effective address is computed as
(index * scale factor) + signed displacement (index scale factor) + signed displacement – 16-bit addresses:
– displacement: 8- or 16-bit number l f
– scale factor: none (i.e., 1)
– 32-bit addresses:
– displacement: 8- or 32-bit numberp – scale factor: 2, 4, or 8
• Useful to access elements of an array
(particularly if the element size is 2 4 or 8 (particularly if the element size is 2, 4, or 8 bytes)
• Displacement points to the beginning of the arrayp p g g y
• Index register selects an element of the array (array index)
• Scaling factor size of the array element
• Scaling factor size of the array element
Indexed Addressing
Examples
add AX,[DI+20]
add AX,[DI+20]
– We have seen similar usage to access parameters off the stack
add AX,marks_table[ESI*4]
A bl l k bl b t t (i
– Assembler replaces marks_table by a constant (i.e., supplies the displacement)
– Each element of marks_table takes 4 bytes (the scale factor value)
– ESI needs to hold the element subscript value
add AX,table1[SI]
add AX,table1[SI]
– SI needs to hold the element offset in bytes
– When we use the scale factor we avoid such byte counting
Based-Indexed Addressing
Based-indexed addressing with no scale factor
Eff i dd i d
• Effective address is computed as
base + index + signed displacement
• Useful in accessing two-dimensional arrays
• Displacement points to the beginning of the array
• Base and index registers point to a row and an element within that row
Useful in accessing arrays of records
• Useful in accessing arrays of records
• Displacement represents the offset of a field in a record
• Base and index registers hold a pointer to the base of the
• Base and index registers hold a pointer to the base of the array and the offset of an element relative to the base of the array
Based-Indexed Addressing
• Useful in accessing arrays passed on to a procedure
procedure
• Base register points to the beginning of the array
• Index register represents the offset of an element l i h b f h
relative to the base of the array
Example Example
Assuming BX points to table1
mov AX [BX+SI]
mov AX,[BX+SI]
cmp AX,[BX+SI+2]
compares t o s ccessi e elements of t bl 1
compares two successive elements of table1
Based-Indexed Addressing
Based-indexed addressing with scale factor
• Effective address is computed as
base + (index * scale factor) + signed displacement
• Useful in accessing two-dimensional arrays g y when the element size is 2, 4, or 8 bytes
• Displacement ==> points to the beginning of the array
• Base register ==> holds offset to a row (relative to start of array)
• Index register ==> selects an element of the row
• Index register ==> selects an element of the row
• Scaling factor ==> size of the array element