Architecture and Assembly Basics
Computer Organization and Assembly Languages Yung-Yu Chuang
2007/10/29
with slides by Peng-Sheng Chen, Kip Irvine, Robert Sedgwick and Kevin Wayne
Announcements
• Midterm exam? Date? 11/12 (I prefer this one) or 11/19?
• 4 assignments plus midterm v.s. 5 assignments
• Open-book
Basic architecture
Basic microcomputer design
• clock synchronizes CPU operations
• control unit (CU) coordinates sequence of execution steps
• ALU performs arithmetic and logic operations
Central Processor Unit (CPU)
Memory Storage Unit
registers
ALU clock
I/O Device
#1
I/O Device
#2
data bus
control bus address bus
CU
Basic microcomputer design
• The memory storage unit holds instructions and data for a running program
• A bus is a group of wires that transfer data from one part to another (data, address, control)
Central Processor Unit (CPU)
Memory Storage Unit
registers
ALU clock
I/O Device
#1
I/O Device
#2
data bus
control bus address bus
CU
Clock
• synchronizes all CPU and BUS operations
• machine (clock) cycle measures time of a single operation
• clock is used to trigger events
one cycle 1
0
• Basic unit of time, 1GHz→clock cycle=1ns
• A instruction could take multiple cycles to
complete, e.g. multiply in 8088 takes 50 cycles
Instruction execution cycle
• Fetch
• Decode
• Fetch
operands
• Execute
• Store output
I-1 I-2 I-3 I-4
PC program
I-1 instruction register op1
op2
memory fetch
ALU registers
write decode
execute read
write
(output)
registers
flags
program counter
instruction queue
Advanced architecture
Multi-stage pipeline
• Pipelining makes it possible for processor to execute instructions in parallel
• Instruction execution divided into discrete stages
S1 S2 S3 S4 S5
1
Cycles
Stages
S6
2 3 4 5 6 7 8 9 10 11 12
I-1
I-2
I-1
I-2
I-1
I-2
I-1
I-2
I-1
I-2
I-1
I-2
Example of a non- pipelined processor.
For example, 80386.
Many wasted cycles.
Pipelined execution
• More efficient use of cycles, greater throughput of instructions: (80486 started to use pipelining)
S1 S2 S3 S4 S5
1
Cycles
Stages
S6
2 3 4 5 6 7
I-1
I-2 I-1
I-2 I-1
I-2 I-1
I-2 I-1
I-2 I-1 I-2
For k stages and n instructions, the number of
required cycles is:
k + (n – 1)
compared to k*n
Wasted cycles (pipelined)
• When one of the stages requires two or more clock cycles, clock cycles are again wasted.
S1 S2 S3 S4 S5
1
Cycles
Stages
S6
2 3 4 5 6 7
I-1 I-2 I-3
I-1 I-2 I-3
I-1 I-2 I-3
I-1
I-2 I-1
I-1 8
9
I-3 I-2
I-2 exe
10 11
I-3
I-3 I-1
I-2
I-3
For k stages and n instructions, the
number of required cycles is:
k + (2n – 1)
Superscalar
A superscalar processor has multiple execution pipelines. In the following, note that Stage S4 has left and right pipelines (u and v).
S1 S2 S3 u S5
1
Cycles
Stages
S6
2 3 4 5 6 7
I-1 I-2 I-3 I-4
I-1 I-2 I-3 I-4
I-1 I-2 I-3 I-4
I-1
I-3 I-1
I-2 I-1 v
I-2
I-4 S4
8 9
I-3 I-4
I-2 I-3
10 I-4
I-2
I-4 I-1
I-3
For k states and n instructions, the
number of required cycles is:
k + n
Pentium: 2 pipelines Pentium Pro: 3
More stages, better performance?
• Pentium 3: 10
• Pentium 4: 20~31
• Next-generation micro-architecture: 14
• ARM7: 3
Pipeline hazards
add r1, r2, #10 ; write r1 sub r3, r1, #20 ; read r1
fetch decode reg ALU memory wb
fetch decode stall reg ALU
Pipelined branch behavior
fetch decode reg ALU memory wb
fetch decode reg ALU memory wb
fetch decode reg ALU memory wb
fetch decode reg ALU memory
fetch decode reg ALU
Reading from memory
• Multiple machine cycles are required when reading from memory, because it responds much more slowly than the CPU (e.g.33 MHz). The wasted clock cycles are called wait states.
Processor Chip Processor Chip
L1 Data 1 cycle latency
16 KB 4-way assoc Write-through
32B lines L1 Instruction 16 KB, 4-way
32B lines Regs.
L2 Unified 128KB--2 MB
4-way assoc Write-back Write allocate
32B lines L2 Unified 128KB--2 MB
4-way assoc Write-back Write allocate
32B lines
MemoryMain Up to 4GB
MemoryMain Up to 4GB
Pentium III cache hierarchy
Cache memory
• High-speed expensive static RAM both inside and outside the CPU.
– Level-1 cache: inside the CPU – Level-2 cache: outside the CPU
• Cache hit: when data to be read is already in cache memory
• Cache miss: when data to be read is not in cache memory. When? compulsory, capacity and conflict.
• Cache design: cache size, n-way, block size, replacement policy
Memory system in practice
Larger, slower, and cheaper (per byte) storage devices
registers on-chip L1 cache (SRAM)
main memory (DRAM)
local secondary storage (virtual memory) (local disks)
remote secondary storage
(tapes, distributed file systems, Web servers) off-chip L2
cache (SRAM) L0:
L1:
L2:
L3:
L4:
L5:
Smaller, faster, and more expensive (per byte) storage devices
Multitasking
• OS can run multiple programs at the same time.
• Multiple threads of execution within the same program.
• Scheduler utility assigns a given amount of CPU time to each running program.
• Rapid switching of tasks
– gives illusion that all programs are running at once – the processor must support task switching
– scheduling policy, round-robin, priority
Trade-offs of instruction sets
• Before 1980, the trend is to increase instruction complexity (one-to-one mapping if possible) to bridge the gap. Reduce fetch from memory.
Selling point: number of instructions, addressing modes. (CISC)
• 1980, RISC. Simplify and regularize instructions to introduce advanced architecture for better performance, pipeline, cache, superscalar.
high-level language machine code semantic gap
compiler C, C++
Lisp, Prolog, Haskell…
RISC
• 1980, Patternson and Ditzel (Berkeley),RISC
• Features
– Fixed-length instructions – Load-store architecture – Register file
• Organization
– Hard-wired logic
– Single-cycle instruction – Pipeline
• Pros: small die size, short development time, high performance
• Cons: low code density, not x86 compatible
CISC and RISC
• CISC – complex instruction set
– large instruction set
– high-level operations (simpler for compiler?)
– requires microcode interpreter (could take a long time)
– examples: Intel 80x86 family
• RISC – reduced instruction set
– small instruction set
– simple, atomic instructions
– directly executed by hardware very quickly
– easier to incorporate advanced architecture design – examples: ARM (Advanced RISC Machines) and DEC
Alpha (now Compaq)
Instruction set design
• Number of addresses
• Addressing modes
• Instruction types
Instruction types
• Arithmetic and logic
• Data movement
• I/O (memory-mapped, isolated I/O)
• Flow control
– Branches (unconditional, conditional)
• set-then-jump (cmp AX, BX; je target)
• Test-and-jump (beq r1, r2, target) – Procedure calls (register-based, stack-based)
• Pentium: ret; MIPS: jr
• Register: faster but limited number of parameters
• Stack: slower but more general
Number of addresses
T ← (T-1) OP T OP
0
AC ← AC OP A OP A
1
A ← A OP B OP A, B
2
A ← B OP C OP A, B, C
3
operation instruction
Number of addresses
A, B, C: memory or register locations AC: accumulator
T: top of stack
T-1: second element of stack
Number of addresses
SUB Y, A, B ; Y = A - B MUL T, D, E ; T = D × E
ADD T, T, C ; T = T + C DIV Y, Y, T ; Y = Y / T
) (D E C
B Y A
× +
= − 3-address instructions
opcode A B C
Number of addresses
MOV Y, A ; Y = A
SUB Y, B ; Y = Y - B MOV T, D ; T = D
MUL T, E ; T = T × E
ADD T, C ; T = T + C DIV Y, T ; Y = Y / T
) (D E C
B Y A
× +
= − 2-address instructions
opcode A B
Number of addresses
LD D ; AC = D
MUL E ; AC = AC × E
ADD C ; AC = AC + C
ST Y ; Y = AC
LD A ; AC = A
SUB B ; AC = AC – B DIV Y ; AC = AC / Y
ST Y ; Y = AC
) (D E C
B Y A
× +
= − 1-address instructions
opcode A
Number of addresses
PUSH A ; A
PUSH B ; A, B
SUB ; A-B
PUSH C ; A-B, C
PUSH D ; A-B, C, D
PUSH E ; A-B, C, D, E MUL ; A-B, C, D× E ADD ; A-B, C+(D× E)
DIV ; (A-B) / (C+(D× E)) POP Y
) (D E C
B Y A
× +
= − 0-address instructions
opcode
Number of addresses
• A basic design decision; could be mixed
• Fewer addresses per instruction result in – a less complex processor
– shorter instructions
– longer and more complex programs – longer execution time
• The decision has impacts on register usage policy as well
– 3-address usually means more general- purpose registers
– 1-address usually means less
Addressing modes
• How to specify location of operands? Trade-off for address range, address flexibility, number of memory references, calculation of addresses
• Often, a mode field is used to specify the mode
• Common addressing modes
– Implied
– Immediate (st R1, 1) – Direct (st R1, A)
– Indirect
– Register (add R1, R2, R3) – Register indirect (sti R1, R2) – Displacement
– Stack
Implied addressing
• No address field;
operand is implied by the instruction
CLC ; clear carry
• A fixed and unvarying address
instruction opcode
Immediate addressing
• Address field contains the operand value
ADD 5; AC=AC+5
• Pros: no extra
memory reference;
faster
• Cons: limited range
instruction
operand opcode
Direct addressing
• Address field contains the effective address of the operand
ADD A; AC=AC+[A]
• single memory reference
• Pros: no additional address calculation
• Cons: limited address space
address A
operand
opcode
instruction
Memory
Indirect addressing
• Address field contains the address of a
pointer to the operand
ADD [A]; AC=AC+[[A]]
• multiple memory references
• Pros: large address space
• Cons: slower
address A opcode
instruction
operand
Memory
Register addressing
• Address field contains the address of a
register
ADD R; AC=AC+R
• Pros: only need a
small address field;
shorter instruction and faster fetch; no memory reference
• Cons: limited address space
R opcode
instruction
operand
Registers
Register indirect addressing
• Address field contains the address of the
register containing a pointer to the operand
ADD [R]; AC=AC+[R]
• Pros: large address space
• Cons: extra memory reference
R opcode
instruction
Registers operand Memory
Displacement addressing
• Address field could contain a register
address and an address
MOV EAX, [A+ESI*4]
• EA=A+[R×S] or vice versa
• Several variants
– Base-offset: [EBP+8]
– Base-index: [EBX+ESI]
– Scaled: [T+ESI*4]
• Pros: flexible
• Cons: complex
R opcode
instruction
Registers operand Memory A
+
Displacement addressing
MOV EAX, [A+ESI*4]
• Often, register, called indexing register, is used for displacement.
• Usually, a mechanism is provided to
efficiently increase the indexing register.
opcode
instruction
Registers operand Memory A
+
R
Effective address calculation
8
base index s displacement
3 3 2 8 or 32
A dummy format for one operand
register
file shifter adder adder memory
Stack addressing
• Operand is on top of the stack
ADD [R]; AC=AC+[R]
• Pros: large address space
• Pros: short and fast fetch
• Cons: limited by FILO order
opcode
instruction
Stack
implicit
Addressing modes
Limited applicability No memory ref
EA=stack top stack
Complexity Flexibility
EA=A+[R]
Displacement
Extra memory ref Large address space
EA=[R]
Register indirect
Limited address space No memory ref
EA=R Register
Multiple memory ref Large address space
EA=[A]
Indirect
Limited address space Simple
EA=A Direct
Limited operand No memory ref
Operand=A Immediate
Limited instructions Fast fetch
Implied
Cons Pros
Meaning Mode