Chapter 5
Thread-Level Parallelism
Thread-Level parallelism
Have multiple program counters
Uses MIMD model
Targeted for tightly-coupled shared-memory multiprocessors
For n processors, need n threads
Amount of computation assigned to each thread
= grain size
Threads can be used for data-level parallelism, but
ction
Symmetric multiprocessors (SMP)
Small number of cores
Share single memory with uniform memory latency
Distributed shared memory (DSM)
Memory distributed among processors
Non-uniform memory access/latency (NUMA)
Processors connected via direct (switched) and non- direct (multi-hop)
ction
Processors may see different values through their caches:
lized Shared-Memory Architectures
Coherence
All reads by any processor must return the most recently written value
Writes to the same location by any two processors are seen in the same order by all processors
Consistency
When a written value will be returned by a read
If a processor writes location A followed by location B, any processor that sees the new value of B must also see the new value of A
lized Shared-Memory Architectures
Coherent caches provide:
Migration: movement of data
Replication: multiple copies of data
Cache coherence protocols
Directory based
Sharing status of each block kept in one location
Snooping
Each core tracks sharing status of each block
lized Shared-Memory Architectures
Write invalidate
On write, invalidate all other copies
Use bus itself to serialize
Write cannot complete until bus access is obtained
lized Shared-Memory Architectures
Locating an item when a read miss occurs
In write-back cache, the updated value must be sent to the requesting processor
Cache lines marked as shared or exclusive/modified
Only writes to shared lines need an invalidate broadcast
After this, the line is marked as exclusive
lized Shared-Memory Architectures
lized Shared-Memory Architectures
lized Shared-Memory Architectures
Complications for the basic MSI protocol:
Operations are not atomic
E.g. detect miss, acquire bus, receive a response
Creates possibility of deadlock and races
One solution: processor that sends invalidate can hold bus until other processors receive the invalidate
Extensions:
Add exclusive state to indicate clean block in only one cache (MESI protocol)
Prevents needing to write invalidate on a write
Owned state
lized Shared-Memory Architectures
Shared memory bus and snooping
bandwidth is
bottleneck for scaling symmetric
multiprocessors
Duplicating tags
Place directory in outermost cache
Use crossbars or point- to-point networks with banked memory
lized Shared-Memory Architectures
Every multicore with >8 processors uses an interconnect other than bus
Makes it difficult to serialize events
Write and upgrade misses are not atomic
How can the processor know when all invalidates are complete?
How can we resolve races when two processors write at the same time?
Solution: associate each block with a single bus
lized Shared-Memory Architectures
Coherence influences cache miss rate
Coherence misses
True sharing misses
Write to shared block (transmission of invalidation)
Read an invalidated block
False sharing misses
Read an unmodified word in an invalidated block
ance of Symmetric Shared-Memory Multiprocess
ance of Symmetric Shared-Memory Multiprocess
ance of Symmetric Shared-Memory Multiprocess
ance of Symmetric Shared-Memory Multiprocess
ance of Symmetric Shared-Memory Multiprocess
Snooping schemes require communication among all caches on every cache miss
Limits scalability
Another approach: Use centralized directory to keep track of every block
Which caches have each block
Dirty status of each block
Implement in shared L3 cache
Keep bit vector of size = # cores for each block in L3
Not scalable beyond shared L3
uted Shared Memory and Directory-Based Coheren
Alternative approach:
Distribute memory
uted Shared Memory and Directory-Based Coheren
For each block, maintain state:
Shared
One or more nodes have the block cached, value in memory is up-to-date
Set of node IDs
Uncached
Modified
Exactly one node has a copy of the cache block, value in memory is out-of-date
Owner node ID
Directory maintains block states and sends invalidation messages
uted Shared Memory and Directory-Based Coheren
uted Shared Memory and Directory-Based Coheren
uted Shared Memory and Directory-Based Coheren
For uncached block:
Read miss
Requesting node is sent the requested data and is made the only sharing node, block is now shared
Write miss
The requesting node is sent the requested data and becomes the sharing node, block is now exclusive
For shared block:
Read miss
The requesting node is sent the requested data from memory, node is added to sharing set
Write miss
The requesting node is sent the value, all nodes in the
uted Shared Memory and Directory-Based Coheren
For exclusive block:
Read miss
The owner is sent a data fetch message, block becomes shared, owner sends data to the directory, data written back to memory, sharers set contains old owner and requestor
Data write back
Block becomes uncached, sharer set is empty
Write miss
Message is sent to old owner to invalidate and send the value to the directory, requestor becomes new owner, block remains exclusive
uted Shared Memory and Directory-Based Coheren
Basic building blocks:
Atomic exchange
Swaps register with memory location
Test-and-set
Sets under condition
Fetch-and-increment
Reads original value from memory and increments it in memory
Requires memory read and write in uninterruptable instruction
RISC-V: load reserved/store conditional
If the contents of the memory location specified by the load linked are changed before the store conditional to the same address, the store conditional fails
ronization
Atomic exchange (EXCH):
try: mov x3,x4 ;mov exchange value lr x2,x1 ;load reserved from sc x3,0(x1) ;store conditional bnez x3,try ;branch store fails mov x4,x2 ;put load value in x4?
Atomic increment:
try: lr x2,x1 ;load reserved 0(x1) addi x3,x2,1 ;increment
sc x3,0(x1) ;store conditional bnez x3,try ;branch store fails
ronization
Lock (no cache coherence)
addi x2,R0,#1
lockit: EXCH x2,0(x1) ;atomic exchange bnez x2,locket ;already locked?
Lock (cache coherence):
lockit: ld x2,0(x1) ;load of lock
bnez x2,locket ;not available-spin addi x2,R0,#1 ;load locked value EXCH x2,0(x1) ;swap
bnez x2,locket ;branch if lock wasn’t 0
ronization
Advantage of this scheme: reduces memory traffic
ronization
of Memory Consistency: An Introduction Processor 1:
A=0
… A=1
if (B==0) …
Processor 2:
B=0
… B=1
if (A==0) …
Should be impossible for both if-statements to be evaluated as true
Delayed write invalidate?
Sequential consistency:
Result of execution should be the same as long as:
Accesses on each processor were kept in order
Accesses on different processors were arbitrarily interleaved
To implement, delay completion of all memory accesses until all invalidations caused by the access are completed
Reduces performance!
Alternatives:
Program-enforced synchronization to force write on processor to occur before read on the other
processor
Requires synchronization object for A and another for B
“Unlock” after write
“Lock” after read
of Memory Consistency: An Introduction
Rules:
X → Y
Operation X must complete before operation Y is done
Sequential consistency requires:
R → W, R → R, W → R, W → W
Relax W → R
“Total store ordering”
Relax W → W
“Partial store order”
of Memory Consistency: An Introduction
of Memory Consistency: An Introduction
Consistency model is multiprocessor specific
Programmers will often implement explicit synchronization
Speculation gives much of the performance advantage of relaxed models with sequential consistency
Basic idea: if an invalidation arrives for a result that has not been committed, use speculation recovery
of Memory Consistency: An Introduction
Measuring performance of multiprocessors by linear speedup versus execution time
Amdahl’s Law doesn’t apply to parallel computers
Linear speedups are needed to make multiprocessors cost-effective
Doesn’t consider cost of other system components
Not developing the software to take advantage of, or optimize for, a multiprocessor architecture
ies and Pitfalls