• 沒有找到結果。

write write decode

N/A
N/A
Protected

Academic year: 2022

Share "write write decode"

Copied!
42
0
0

加載中.... (立即查看全文)

全文

(1)

Architecture and Assembly Basics

Computer Organization and Assembly Languages Yung-Yu Chuang

2007/10/29

with slides by Peng-Sheng Chen, Kip Irvine, Robert Sedgwick and Kevin Wayne

(2)

Announcements

• Midterm exam? Date? 11/12 (I prefer this one) or 11/19?

• 4 assignments plus midterm v.s. 5 assignments

• Open-book

(3)

Basic architecture

(4)

Basic microcomputer design

• clock synchronizes CPU operations

• control unit (CU) coordinates sequence of execution steps

• ALU performs arithmetic and logic operations

Central Processor Unit (CPU)

Memory Storage Unit

registers

ALU clock

I/O Device

#1

I/O Device

#2

data bus

control bus address bus

CU

(5)

Basic microcomputer design

• The memory storage unit holds instructions and data for a running program

• A bus is a group of wires that transfer data from one part to another (data, address, control)

Central Processor Unit (CPU)

Memory Storage Unit

registers

ALU clock

I/O Device

#1

I/O Device

#2

data bus

control bus address bus

CU

(6)

Clock

• synchronizes all CPU and BUS operations

• machine (clock) cycle measures time of a single operation

• clock is used to trigger events

one cycle 1

0

• Basic unit of time, 1GHz→clock cycle=1ns

• A instruction could take multiple cycles to

complete, e.g. multiply in 8088 takes 50 cycles

(7)

Instruction execution cycle

Fetch

• Decode

• Fetch

operands

Execute

• Store output

I-1 I-2 I-3 I-4

PC program

I-1 instruction register op1

op2

memory fetch

ALU registers

write decode

execute read

write

(output)

registers

flags

program counter

instruction queue

(8)

Advanced architecture

(9)

Multi-stage pipeline

• Pipelining makes it possible for processor to execute instructions in parallel

• Instruction execution divided into discrete stages

S1 S2 S3 S4 S5

1

Cycles

Stages

S6

2 3 4 5 6 7 8 9 10 11 12

I-1

I-2

I-1

I-2

I-1

I-2

I-1

I-2

I-1

I-2

I-1

I-2

Example of a non- pipelined processor.

For example, 80386.

Many wasted cycles.

(10)

Pipelined execution

• More efficient use of cycles, greater throughput of instructions: (80486 started to use pipelining)

S1 S2 S3 S4 S5

1

Cycles

Stages

S6

2 3 4 5 6 7

I-1

I-2 I-1

I-2 I-1

I-2 I-1

I-2 I-1

I-2 I-1 I-2

For k stages and n instructions, the number of

required cycles is:

k + (n – 1)

compared to k*n

(11)

Wasted cycles (pipelined)

• When one of the stages requires two or more clock cycles, clock cycles are again wasted.

S1 S2 S3 S4 S5

1

Cycles

Stages

S6

2 3 4 5 6 7

I-1 I-2 I-3

I-1 I-2 I-3

I-1 I-2 I-3

I-1

I-2 I-1

I-1 8

9

I-3 I-2

I-2 exe

10 11

I-3

I-3 I-1

I-2

I-3

For k stages and n instructions, the

number of required cycles is:

k + (2n – 1)

(12)

Superscalar

A superscalar processor has multiple execution pipelines. In the following, note that Stage S4 has left and right pipelines (u and v).

S1 S2 S3 u S5

1

Cycles

Stages

S6

2 3 4 5 6 7

I-1 I-2 I-3 I-4

I-1 I-2 I-3 I-4

I-1 I-2 I-3 I-4

I-1

I-3 I-1

I-2 I-1 v

I-2

I-4 S4

8 9

I-3 I-4

I-2 I-3

10 I-4

I-2

I-4 I-1

I-3

For k states and n instructions, the

number of required cycles is:

k + n

Pentium: 2 pipelines Pentium Pro: 3

(13)

More stages, better performance?

• Pentium 3: 10

• Pentium 4: 20~31

• Next-generation micro-architecture: 14

• ARM7: 3

(14)

Pipeline hazards

add r1, r2, #10 ; write r1 sub r3, r1, #20 ; read r1

fetch decode reg ALU memory wb

fetch decode stall reg ALU

(15)

Pipelined branch behavior

fetch decode reg ALU memory wb

fetch decode reg ALU memory wb

fetch decode reg ALU memory wb

fetch decode reg ALU memory

fetch decode reg ALU

(16)

Reading from memory

Multiple machine cycles are required when reading from memory, because it responds much more slowly than the CPU (e.g.33 MHz). The wasted clock cycles are called wait states.

Processor Chip Processor Chip

L1 Data 1 cycle latency

16 KB 4-way assoc Write-through

32B lines L1 Instruction 16 KB, 4-way

32B lines Regs.

L2 Unified 128KB--2 MB

4-way assoc Write-back Write allocate

32B lines L2 Unified 128KB--2 MB

4-way assoc Write-back Write allocate

32B lines

MemoryMain Up to 4GB

MemoryMain Up to 4GB

Pentium III cache hierarchy

(17)

Cache memory

• High-speed expensive static RAM both inside and outside the CPU.

Level-1 cache: inside the CPU Level-2 cache: outside the CPU

• Cache hit: when data to be read is already in cache memory

• Cache miss: when data to be read is not in cache memory. When? compulsory, capacity and conflict.

• Cache design: cache size, n-way, block size, replacement policy

(18)

Memory system in practice

Larger, slower, and cheaper (per byte) storage devices

registers on-chip L1 cache (SRAM)

main memory (DRAM)

local secondary storage (virtual memory) (local disks)

remote secondary storage

(tapes, distributed file systems, Web servers) off-chip L2

cache (SRAM) L0:

L1:

L2:

L3:

L4:

L5:

Smaller, faster, and more expensive (per byte) storage devices

(19)

Multitasking

• OS can run multiple programs at the same time.

• Multiple threads of execution within the same program.

• Scheduler utility assigns a given amount of CPU time to each running program.

• Rapid switching of tasks

gives illusion that all programs are running at once the processor must support task switching

scheduling policy, round-robin, priority

(20)

Trade-offs of instruction sets

• Before 1980, the trend is to increase instruction complexity (one-to-one mapping if possible) to bridge the gap. Reduce fetch from memory.

Selling point: number of instructions, addressing modes. (CISC)

• 1980, RISC. Simplify and regularize instructions to introduce advanced architecture for better performance, pipeline, cache, superscalar.

high-level language machine code semantic gap

compiler C, C++

Lisp, Prolog, Haskell…

(21)

RISC

• 1980, Patternson and Ditzel (Berkeley),RISC

• Features

– Fixed-length instructions – Load-store architecture – Register file

• Organization

– Hard-wired logic

– Single-cycle instruction – Pipeline

• Pros: small die size, short development time, high performance

• Cons: low code density, not x86 compatible

(22)

CISC and RISC

• CISC – complex instruction set

large instruction set

high-level operations (simpler for compiler?)

requires microcode interpreter (could take a long time)

examples: Intel 80x86 family

• RISC – reduced instruction set

small instruction set

simple, atomic instructions

directly executed by hardware very quickly

easier to incorporate advanced architecture design examples: ARM (Advanced RISC Machines) and DEC

Alpha (now Compaq)

(23)

Instruction set design

• Number of addresses

• Addressing modes

• Instruction types

(24)

Instruction types

• Arithmetic and logic

• Data movement

• I/O (memory-mapped, isolated I/O)

• Flow control

– Branches (unconditional, conditional)

• set-then-jump (cmp AX, BX; je target)

• Test-and-jump (beq r1, r2, target) – Procedure calls (register-based, stack-based)

• Pentium: ret; MIPS: jr

• Register: faster but limited number of parameters

• Stack: slower but more general

(25)

Number of addresses

T ← (T-1) OP T OP

0

AC ← AC OP A OP A

1

A ← A OP B OP A, B

2

A ← B OP C OP A, B, C

3

operation instruction

Number of addresses

A, B, C: memory or register locations AC: accumulator

T: top of stack

T-1: second element of stack

(26)

Number of addresses

SUB Y, A, B ; Y = A - B MUL T, D, E ; T = D × E

ADD T, T, C ; T = T + C DIV Y, Y, T ; Y = Y / T

) (D E C

B Y A

× +

= 3-address instructions

opcode A B C

(27)

Number of addresses

MOV Y, A ; Y = A

SUB Y, B ; Y = Y - B MOV T, D ; T = D

MUL T, E ; T = T × E

ADD T, C ; T = T + C DIV Y, T ; Y = Y / T

) (D E C

B Y A

× +

= 2-address instructions

opcode A B

(28)

Number of addresses

LD D ; AC = D

MUL E ; AC = AC × E

ADD C ; AC = AC + C

ST Y ; Y = AC

LD A ; AC = A

SUB B ; AC = AC – B DIV Y ; AC = AC / Y

ST Y ; Y = AC

) (D E C

B Y A

× +

= 1-address instructions

opcode A

(29)

Number of addresses

PUSH A ; A

PUSH B ; A, B

SUB ; A-B

PUSH C ; A-B, C

PUSH D ; A-B, C, D

PUSH E ; A-B, C, D, E MUL ; A-B, C, D× E ADD ; A-B, C+(D× E)

DIV ; (A-B) / (C+(D× E)) POP Y

) (D E C

B Y A

× +

= 0-address instructions

opcode

(30)

Number of addresses

• A basic design decision; could be mixed

• Fewer addresses per instruction result in – a less complex processor

– shorter instructions

– longer and more complex programs – longer execution time

• The decision has impacts on register usage policy as well

– 3-address usually means more general- purpose registers

– 1-address usually means less

(31)

Addressing modes

• How to specify location of operands? Trade-off for address range, address flexibility, number of memory references, calculation of addresses

• Often, a mode field is used to specify the mode

• Common addressing modes

– Implied

– Immediate (st R1, 1) – Direct (st R1, A)

– Indirect

– Register (add R1, R2, R3) – Register indirect (sti R1, R2) – Displacement

– Stack

(32)

Implied addressing

• No address field;

operand is implied by the instruction

CLC ; clear carry

• A fixed and unvarying address

instruction opcode

(33)

Immediate addressing

• Address field contains the operand value

ADD 5; AC=AC+5

• Pros: no extra

memory reference;

faster

• Cons: limited range

instruction

operand opcode

(34)

Direct addressing

• Address field contains the effective address of the operand

ADD A; AC=AC+[A]

• single memory reference

• Pros: no additional address calculation

• Cons: limited address space

address A

operand

opcode

instruction

Memory

(35)

Indirect addressing

• Address field contains the address of a

pointer to the operand

ADD [A]; AC=AC+[[A]]

• multiple memory references

• Pros: large address space

• Cons: slower

address A opcode

instruction

operand

Memory

(36)

Register addressing

• Address field contains the address of a

register

ADD R; AC=AC+R

• Pros: only need a

small address field;

shorter instruction and faster fetch; no memory reference

• Cons: limited address space

R opcode

instruction

operand

Registers

(37)

Register indirect addressing

• Address field contains the address of the

register containing a pointer to the operand

ADD [R]; AC=AC+[R]

• Pros: large address space

• Cons: extra memory reference

R opcode

instruction

Registers operand Memory

(38)

Displacement addressing

• Address field could contain a register

address and an address

MOV EAX, [A+ESI*4]

• EA=A+[R×S] or vice versa

• Several variants

– Base-offset: [EBP+8]

– Base-index: [EBX+ESI]

– Scaled: [T+ESI*4]

• Pros: flexible

• Cons: complex

R opcode

instruction

Registers operand Memory A

+

(39)

Displacement addressing

MOV EAX, [A+ESI*4]

• Often, register, called indexing register, is used for displacement.

• Usually, a mechanism is provided to

efficiently increase the indexing register.

opcode

instruction

Registers operand Memory A

+

R

(40)

Effective address calculation

8

base index s displacement

3 3 2 8 or 32

A dummy format for one operand

register

file shifter adder adder memory

(41)

Stack addressing

• Operand is on top of the stack

ADD [R]; AC=AC+[R]

• Pros: large address space

• Pros: short and fast fetch

• Cons: limited by FILO order

opcode

instruction

Stack

implicit

(42)

Addressing modes

Limited applicability No memory ref

EA=stack top stack

Complexity Flexibility

EA=A+[R]

Displacement

Extra memory ref Large address space

EA=[R]

Register indirect

Limited address space No memory ref

EA=R Register

Multiple memory ref Large address space

EA=[A]

Indirect

Limited address space Simple

EA=A Direct

Limited operand No memory ref

Operand=A Immediate

Limited instructions Fast fetch

Implied

Cons Pros

Meaning Mode

參考文獻

相關文件

A constant offset is added to a data label to produce an effective address (EA) The address is dereferenced to get effective address (EA). The address is dereferenced to get

• When paging in from disk, we need a free frame of physical memory to hold the data we’re reading in. • In reality, size of physical memory is

• Appearance: vectorized mathematical code appears more like the mathematical expressions found in textbooks, making the code easier to understand.. • Less error prone: without

(B)Data Bus 是在 CPU 和 Memory 之間傳送資料,所以是雙向性 (C)Address Bus 可用來標明 Memory 或 I/O Port 位址的地方 (D)Data Bus 的長度和 Address

● In computer science, a data structure is a data organization, management, and storage format that enables efficient access and

In digital systems, a register transfer operation is a basic operation that consists of a transfer of binary information from one set of registers into another set of

Following the supply by the school of a copy of personal data in compliance with a data access request, the requestor is entitled to ask for correction of the personal data

„ A socket is a file descriptor that lets an application read/write data from/to the network. „ Once configured the