Chapter 5Thread-Level Parallelism

(1)

Chapter 5 Thread-Level Parallelism

(2)

 Thread-Level parallelism

 Have multiple program counters

 Uses MIMD model

 Targeted for tightly-coupled shared-memory multiprocessors

 For n processors, need n threads

 Amount of computation assigned to each thread

= grain size

 Threads can be used for data-level parallelism, but

ction

(3)

 Symmetric multiprocessors (SMP)

 Small number of cores

 Share single memory with uniform memory latency

 Distributed shared memory (DSM)

 Memory distributed among processors

 Non-uniform memory access/latency (NUMA)

 Processors connected via direct (switched) and non- direct (multi-hop)

ction

(4)

 Processors may see different values through their caches:

lized Shared-Memory Architectures

(5)

 Coherence

 All reads by any processor must return the most recently written value

 Writes to the same location by any two processors are seen in the same order by all processors

 Consistency

 When a written value will be returned by a read

 If a processor writes location A followed by location B, any processor that sees the new value of B must also see the new value of A

(6)

 Coherent caches provide:

 Migration: movement of data

 Replication: multiple copies of data

 Cache coherence protocols

 Directory based

 Sharing status of each block kept in one location

 Snooping

 Each core tracks sharing status of each block

(7)

 Write invalidate

 On write, invalidate all other copies

 Use bus itself to serialize

 Write cannot complete until bus access is obtained

(8)

 Locating an item when a read miss occurs

 In write-back cache, the updated value must be sent to the requesting processor

 Cache lines marked as shared or exclusive/modified

 Only writes to shared lines need an invalidate broadcast

 After this, the line is marked as exclusive

(9)

(10)

(11)

 Complications for the basic MSI protocol:

 Operations are not atomic

 E.g. detect miss, acquire bus, receive a response

 Creates possibility of deadlock and races

 One solution: processor that sends invalidate can hold bus until other processors receive the invalidate

 Extensions:

 Add exclusive state to indicate clean block in only one cache (MESI protocol)

 Prevents needing to write invalidate on a write

 Owned state

(12)

 Shared memory bus and snooping

bandwidth is

bottleneck for scaling symmetric

multiprocessors

 Duplicating tags

 Place directory in outermost cache

 Use crossbars or point- to-point networks with banked memory

(13)

 Every multicore with >8 processors uses an interconnect other than bus

 Makes it difficult to serialize events

 Write and upgrade misses are not atomic

 How can the processor know when all invalidates are complete?

 How can we resolve races when two processors write at the same time?

 Solution: associate each block with a single bus

(14)



Coherence influences cache miss rate

 Coherence misses

 True sharing misses

 Write to shared block (transmission of invalidation)

 Read an invalidated block

 False sharing misses

 Read an unmodified word in an invalidated block

ance of Symmetric Shared-Memory Multiprocess

(15)

(16)

(17)

(18)

(19)

 Snooping schemes require communication among all caches on every cache miss

 Limits scalability

 Another approach: Use centralized directory to keep track of every block

 Which caches have each block

 Dirty status of each block

 Implement in shared L3 cache

 Keep bit vector of size = # cores for each block in L3

 Not scalable beyond shared L3

uted Shared Memory and Directory-Based Coheren

(20)

 Alternative approach:

 Distribute memory

(21)

 For each block, maintain state:

 Shared

 One or more nodes have the block cached, value in memory is up-to-date

 Set of node IDs

 Uncached

 Modified

 Exactly one node has a copy of the cache block, value in memory is out-of-date

 Owner node ID

 Directory maintains block states and sends invalidation messages

(22)

(23)

(24)

 For uncached block:

 Read miss

 Requesting node is sent the requested data and is made the only sharing node, block is now shared

 Write miss

 The requesting node is sent the requested data and becomes the sharing node, block is now exclusive

 For shared block:

 Read miss

 The requesting node is sent the requested data from memory, node is added to sharing set

 Write miss

 The requesting node is sent the value, all nodes in the

(25)

 For exclusive block:

 Read miss

 The owner is sent a data fetch message, block becomes shared, owner sends data to the directory, data written back to memory, sharers set contains old owner and requestor

 Data write back

 Block becomes uncached, sharer set is empty

 Write miss

 Message is sent to old owner to invalidate and send the value to the directory, requestor becomes new owner, block remains exclusive

(26)

 Basic building blocks:

 Atomic exchange

 Swaps register with memory location

 Test-and-set

 Sets under condition

 Fetch-and-increment

 Reads original value from memory and increments it in memory

 Requires memory read and write in uninterruptable instruction

 RISC-V: load reserved/store conditional

 If the contents of the memory location specified by the load linked are changed before the store conditional to the same address, the store conditional fails

ronization

(27)

 Atomic exchange (EXCH):

try: mov x3,x4 ;mov exchange value lr x2,x1 ;load reserved from sc x3,0(x1) ;store conditional bnez x3,try ;branch store fails mov x4,x2 ;put load value in x4?

 Atomic increment:

try: lr x2,x1 ;load reserved 0(x1) addi x3,x2,1 ;increment

sc x3,0(x1) ;store conditional bnez x3,try ;branch store fails

ronization

(28)

 Lock (no cache coherence)

addi x2,R0,#1

lockit: EXCH x2,0(x1) ;atomic exchange bnez x2,locket ;already locked?

 Lock (cache coherence):

lockit: ld x2,0(x1) ;load of lock

bnez x2,locket ;not available-spin addi x2,R0,#1 ;load locked value EXCH x2,0(x1) ;swap

bnez x2,locket ;branch if lock wasn’t 0

ronization

(29)

 Advantage of this scheme: reduces memory traffic

ronization

(30)

of Memory Consistency: An Introduction Processor 1:

A=0

… A=1

if (B==0) …

Processor 2:

B=0

… B=1

if (A==0) …

 Should be impossible for both if-statements to be evaluated as true

 Delayed write invalidate?

 Sequential consistency:

 Result of execution should be the same as long as:

 Accesses on each processor were kept in order

 Accesses on different processors were arbitrarily interleaved

(31)

 To implement, delay completion of all memory accesses until all invalidations caused by the access are completed

 Reduces performance!

 Alternatives:

 Program-enforced synchronization to force write on processor to occur before read on the other

processor

 Requires synchronization object for A and another for B

 “Unlock” after write

 “Lock” after read

of Memory Consistency: An Introduction

(32)

 Rules:

 X → Y

 Operation X must complete before operation Y is done

 Sequential consistency requires:

 R → W, R → R, W → R, W → W

 Relax W → R

 “Total store ordering”

 Relax W → W

 “Partial store order”

(33)

(34)

 Consistency model is multiprocessor specific

 Programmers will often implement explicit synchronization

 Speculation gives much of the performance advantage of relaxed models with sequential consistency

 Basic idea: if an invalidation arrives for a result that has not been committed, use speculation recovery

(35)

 Measuring performance of multiprocessors by linear speedup versus execution time

 Amdahl’s Law doesn’t apply to parallel computers

 Linear speedups are needed to make multiprocessors cost-effective

 Doesn’t consider cost of other system components

 Not developing the software to take advantage of, or optimize for, a multiprocessor architecture

ies and Pitfalls