• 沒有找到結果。

Chapter 5Thread-Level Parallelism

N/A
N/A
Protected

Academic year: 2021

Share "Chapter 5Thread-Level Parallelism"

Copied!
35
0
0

加載中.... (立即查看全文)

全文

(1)

Chapter 5

Thread-Level Parallelism

(2)

Thread-Level parallelism

Have multiple program counters

Uses MIMD model

Targeted for tightly-coupled shared-memory multiprocessors

For n processors, need n threads

Amount of computation assigned to each thread

= grain size

Threads can be used for data-level parallelism, but

ction

(3)

Symmetric multiprocessors (SMP)

Small number of cores

Share single memory with uniform memory latency

Distributed shared memory (DSM)

Memory distributed among processors

Non-uniform memory access/latency (NUMA)

Processors connected via direct (switched) and non- direct (multi-hop)

ction

(4)

Processors may see different values through their caches:

lized Shared-Memory Architectures

(5)

Coherence

All reads by any processor must return the most recently written value

Writes to the same location by any two processors are seen in the same order by all processors

Consistency

When a written value will be returned by a read

If a processor writes location A followed by location B, any processor that sees the new value of B must also see the new value of A

lized Shared-Memory Architectures

(6)

Coherent caches provide:

Migration: movement of data

Replication: multiple copies of data

Cache coherence protocols

Directory based

Sharing status of each block kept in one location

Snooping

Each core tracks sharing status of each block

lized Shared-Memory Architectures

(7)

Write invalidate

On write, invalidate all other copies

Use bus itself to serialize

Write cannot complete until bus access is obtained

lized Shared-Memory Architectures

(8)

Locating an item when a read miss occurs

In write-back cache, the updated value must be sent to the requesting processor

Cache lines marked as shared or exclusive/modified

Only writes to shared lines need an invalidate broadcast

After this, the line is marked as exclusive

lized Shared-Memory Architectures

(9)

lized Shared-Memory Architectures

(10)

lized Shared-Memory Architectures

(11)

Complications for the basic MSI protocol:

Operations are not atomic

E.g. detect miss, acquire bus, receive a response

Creates possibility of deadlock and races

One solution: processor that sends invalidate can hold bus until other processors receive the invalidate

Extensions:

Add exclusive state to indicate clean block in only one cache (MESI protocol)

Prevents needing to write invalidate on a write

Owned state

lized Shared-Memory Architectures

(12)

Shared memory bus and snooping

bandwidth is

bottleneck for scaling symmetric

multiprocessors

Duplicating tags

Place directory in outermost cache

Use crossbars or point- to-point networks with banked memory

lized Shared-Memory Architectures

(13)

Every multicore with >8 processors uses an interconnect other than bus

Makes it difficult to serialize events

Write and upgrade misses are not atomic

How can the processor know when all invalidates are complete?

How can we resolve races when two processors write at the same time?

Solution: associate each block with a single bus

lized Shared-Memory Architectures

(14)

Coherence influences cache miss rate

Coherence misses

True sharing misses

Write to shared block (transmission of invalidation)

Read an invalidated block

False sharing misses

Read an unmodified word in an invalidated block

ance of Symmetric Shared-Memory Multiprocess

(15)

ance of Symmetric Shared-Memory Multiprocess

(16)

ance of Symmetric Shared-Memory Multiprocess

(17)

ance of Symmetric Shared-Memory Multiprocess

(18)

ance of Symmetric Shared-Memory Multiprocess

(19)

Snooping schemes require communication among all caches on every cache miss

Limits scalability

Another approach: Use centralized directory to keep track of every block

Which caches have each block

Dirty status of each block

Implement in shared L3 cache

Keep bit vector of size = # cores for each block in L3

Not scalable beyond shared L3

uted Shared Memory and Directory-Based Coheren

(20)

Alternative approach:

Distribute memory

uted Shared Memory and Directory-Based Coheren

(21)

For each block, maintain state:

Shared

One or more nodes have the block cached, value in memory is up-to-date

Set of node IDs

Uncached

Modified

Exactly one node has a copy of the cache block, value in memory is out-of-date

Owner node ID

Directory maintains block states and sends invalidation messages

uted Shared Memory and Directory-Based Coheren

(22)

uted Shared Memory and Directory-Based Coheren

(23)

uted Shared Memory and Directory-Based Coheren

(24)

For uncached block:

Read miss

Requesting node is sent the requested data and is made the only sharing node, block is now shared

Write miss

The requesting node is sent the requested data and becomes the sharing node, block is now exclusive

For shared block:

Read miss

The requesting node is sent the requested data from memory, node is added to sharing set

Write miss

The requesting node is sent the value, all nodes in the

uted Shared Memory and Directory-Based Coheren

(25)

For exclusive block:

Read miss

The owner is sent a data fetch message, block becomes shared, owner sends data to the directory, data written back to memory, sharers set contains old owner and requestor

Data write back

Block becomes uncached, sharer set is empty

Write miss

Message is sent to old owner to invalidate and send the value to the directory, requestor becomes new owner, block remains exclusive

uted Shared Memory and Directory-Based Coheren

(26)

Basic building blocks:

Atomic exchange

Swaps register with memory location

Test-and-set

Sets under condition

Fetch-and-increment

Reads original value from memory and increments it in memory

Requires memory read and write in uninterruptable instruction

RISC-V: load reserved/store conditional

If the contents of the memory location specified by the load linked are changed before the store conditional to the same address, the store conditional fails

ronization

(27)

Atomic exchange (EXCH):

try: mov x3,x4 ;mov exchange value lr x2,x1 ;load reserved from sc x3,0(x1) ;store conditional bnez x3,try ;branch store fails mov x4,x2 ;put load value in x4?

Atomic increment:

try: lr x2,x1 ;load reserved 0(x1) addi x3,x2,1 ;increment

sc x3,0(x1) ;store conditional bnez x3,try ;branch store fails

ronization

(28)

Lock (no cache coherence)

addi x2,R0,#1

lockit: EXCH x2,0(x1) ;atomic exchange bnez x2,locket ;already locked?

Lock (cache coherence):

lockit: ld x2,0(x1) ;load of lock

bnez x2,locket ;not available-spin addi x2,R0,#1 ;load locked value EXCH x2,0(x1) ;swap

bnez x2,locket ;branch if lock wasn’t 0

ronization

(29)

Advantage of this scheme: reduces memory traffic

ronization

(30)

of Memory Consistency: An Introduction Processor 1:

A=0

A=1

if (B==0) …

Processor 2:

B=0

B=1

if (A==0) …

Should be impossible for both if-statements to be evaluated as true

Delayed write invalidate?

Sequential consistency:

Result of execution should be the same as long as:

Accesses on each processor were kept in order

Accesses on different processors were arbitrarily interleaved

(31)

To implement, delay completion of all memory accesses until all invalidations caused by the access are completed

Reduces performance!

Alternatives:

Program-enforced synchronization to force write on processor to occur before read on the other

processor

Requires synchronization object for A and another for B

“Unlock” after write

“Lock” after read

of Memory Consistency: An Introduction

(32)

Rules:

X → Y

Operation X must complete before operation Y is done

Sequential consistency requires:

R → W, R → R, W → R, W → W

Relax W → R

“Total store ordering”

Relax W → W

“Partial store order”

of Memory Consistency: An Introduction

(33)

of Memory Consistency: An Introduction

(34)

Consistency model is multiprocessor specific

Programmers will often implement explicit synchronization

Speculation gives much of the performance advantage of relaxed models with sequential consistency

Basic idea: if an invalidation arrives for a result that has not been committed, use speculation recovery

of Memory Consistency: An Introduction

(35)

Measuring performance of multiprocessors by linear speedup versus execution time

Amdahl’s Law doesn’t apply to parallel computers

Linear speedups are needed to make multiprocessors cost-effective

Doesn’t consider cost of other system components

Not developing the software to take advantage of, or optimize for, a multiprocessor architecture

ies and Pitfalls

參考文獻

相關文件