Case Study 3: Advanced Directory Protocol

5.13 a. P0,0: read 100 Read hit

b. P0,0: read 120 Miss, will replace modified data (B0) and get new line in shared state

P0,0: M Æ MI^A Æ I Æ IS^DÆ S Dir: DM {P0,0} Æ DI {}

c. P0,0: write 120 Å 80 Miss will replace modified data (B0) and get new line in modified state

P0,0: M Æ MI^A Æ I Æ IM^AD Æ IM^A Æ M P3,1: S Æ I

Dir: DS {P3,0} Æ DM {P0,0}

d, e, f: steps similar to parts a, b, and c

5.14 a. P0,0: read 120 Miss, will replace modified data (B0) and get new line in shared state

P0,0: M Æ MI^A Æ I Æ IS^D Æ S Figure S.31 Directory states.

Modified

Write miss Fetch invalidate Data value response Sharers= {P}

Write miss Data value reply Sharers={P}

Write miss Send invalidate

message to sharers Data value reply

Sharer s =

{P}

Data value reply, Sharers= {P}

Read miss

Write miss Fetch invalidate Data value response Sharers= {P}

Read miss

Fetch; Data value reply Sharers= sharers + {P}

Data write back Sharers={} Data write back Sharers=sharers – {P}

Owned

Invalid Shared

Read miss Fetch

Data value response Sharers= sharers + {P}

Read miss Data value reply Sharers= sharers + {P}

P1,0: read 120 Miss, will replace modified data (B0) and get new line in shared state

P1,0: M Æ MI^A Æ I Æ IS^D Æ S

Dir: DS {P3,0} Æ DS {P3,0; P0,0} Æ DS {P3,0;

P0,0; P1,0}

b. P0,0: read 120 Miss, will replace modified data (B0) and get new line in shared state

P0,0: M Æ MI^A Æ I Æ IS^D Æ S

P1,0: write 120 Å 80 Miss will replace modified data (B0) and get new line in modified state

P1,0: M Æ MI^A Æ I Æ IM^AD Æ IM^A Æ M P3,1: S Æ I

Dir: DS {P3,1} Æ DS {P3,0; P1,0} Æ DM {P1,0}

c, d, e: steps similar to parts a and b

5.15 a. P0,0: read 100 Read hit, 1 cycle

b. P0,0: read 120 Read Miss, causes modified block replacement and is satisfied in memory and incurs 4 chip crossings (see underlined)

Latency for P0,0: Lsend_data + Ldata_msg + Lwrite_memory + Linv + L_ack + Lreq_msg + Lsend_msg + Lreq_msg + Lread_memory + Ldata_msg + Lrcv_data + 4 × chip crossings latency = 20 + 30 + 20 + 1 + 4 + 15 + 6 + 15 + 100 + 30 + 15 + 4 × 20 = 336 c, d, e: follow same steps as a and b

5.16 All protocols must ensure forward progress, even under worst-case memory access patterns. It is crucial that the protocol implementation guarantee (at least with a probabilistic argument) that a processor will be able to perform at least one mem-ory operation each time it completes a cache miss. Otherwise, starvation might result. Consider the simple spin lock code:

tas:

DADDUI R2, R0, #1 lockit:

EXCH R2, 0(R1) BNEZ R2, lockit

If all processors are spinning on the same loop, they will all repeatedly issue GetM messages. If a processor is not guaranteed to be able to perform at least one instruction, then each could steal the block from the other repeatedly. In the worst case, no processor could ever successfully perform the exchange.

5.17 a. The MS^A state is essentially a “transient O” because it allows the processor to read the data and it will respond to GetShared and GetModified requests from other processors. It is transient, and not a real O state, because memory will send the PutM_Ack and take responsibility for future requests.

b. See Figures S.32 and S.33

State Read Write

Ack Data Last ACK

I send

error error error error error

S do Read send

GetM/IM

I send

Ack/I

error error error error error

O do Read send

GetM/OM send PutM/OI

error send Data send Data/I error — —

M do Read do Write send

PutM/MI

error send Data/O send Data/I error error error

IS z z z send

Ack/ISI

error error error save Data,

do Read/S error

ISI z z z send Ack error error error save Data,

do Read/I error

IM z z z send Ack IMO IMI^A error save Data do Write/M

IMI z z z error error error error save Data do Write,

send Data/I

IMO z z z send

Ack/IMI

— IMOI error save Data do Write,

send Data/O

IMOI z z z error error error error save Data do Write,

send Data/I

OI z z z error send Data send Data /I error error

MI z z z error send Data send Data /I error error

OM z z z error send Data send Data/IM error save Data do Write/M

Figure S.32 Directory protocol cache controller transitions.

State Read Write

send Data, clear sharers, set owner/DM

error send PutM_Ack

DS send Data, add to sharers

send INVs to sharers, clear sharers, set owner, send Data/DM

error send PutM_Ack

DO forward GetS, add to sharers

forward GetM, send INVs to sharers, clear sharers, set owner/DM

send Data, send PutM_Ack/DS

send PutM_Ack

DM forward GetS, add to requester and owner to sharers/DO

forward GetM, send INVs to sharers, clear sharers, set owner

send Data, send PutM_Ack/DI

send PutM_Ack

Figure S.33 Directory controller transitions.

5.18 a. P1,0: read 100 P3,1: write 100 Å 90

In this problem, both P0,1 and P3,1 miss and send requests that race to the directory. Assuming that P0,1’s GetS request arrives first, the directory will forward P0,1’s GetS to P0,0, followed shortly afterwards by P3,1’s GetM. If the network maintains point-to-point order, then P0,0 will see the requests in the right order and the protocol will work as expected. However, if the for-warded requests arrive out of order, then the GetX will force P0 to state I, causing it to detect an error when P1’s forwarded GetS arrives.

b. P1,0: read 100 P0,0: replace 100

P1,0’s GetS arrives at the directory and is forwarded to P0,0 before P0,0’s PutM message arrives at the directory and sends the PutM_Ack. However, if the PutM_Ack arrives at P0,0 out of order (i.e., before the forwarded GetS), then this will cause P0,0 to transition to state I. In this case, the forwarded GetS will be treated as an error condition.

Exercises

5.19 The general form for Amdahl’s Law is

all that needs to be done to compute the formula for speedup in this multiproces-sor case is to derive the new execution time.

The exercise states that for the portion of the original execution time that can use i processors is given by F(i,p). If we let Execution time_old be 1, then the relative time for the application on p processors is given by summing the times required for each portion of the execution time that can be sped up using i processors, where i is between 1 and p. This yields

Substituting this value for Execution time_new into the speedup equation makes Amdahl’s Law a function of the available processors, p.

5.20 a. (i) 64 processors arranged a as a ring: largest number of communication hops = 32 Æ communication cost = (100 + 10 × 32) ns = 420 ns.

(ii) 64 processors arranged as 8x8 processor grid: largest number of commu-nication hops = 14 Æ commucommu-nication cost = (100 + 10 × 14) ns = 240 ns.

(iii) 64 processors arranged as a hypercube: largest number of hops = 6 (log₂ 64) Æ communication cost = (100 + 10 × 6) ns = 160 ns.

Speedup Execution time_old Execution time_new

---=

Execution time_new f i,p( ) ---i

i=1

∑

b. Base CPI = 0.5 cpi

(i) 64 processors arranged a as a ring: Worst case CPI = 0.5 + 0.2/100 × (420) = 1.34 cpi

(ii) 64 processors arranged as 8x8 processor grid: Worst case CPI = 0.5 + 0.2/

100× (240) = 0.98 cpi

(iii) 64 processors arranged as a hypercube: Worst case CPI CPI = 0.5 + 0.2/

100 × (160) = 0.82 cpi

The average CPI can be obtained by replacing the largest number of communi-cations hops in the above calculation by hˆ , the average numbers of communica-tions hops. That latter number depends on both the topology and the application.

c. Since the CPU frequency and the number of instructions executed did not change, the answer can be obtained by the CPI for each of the topologies (worst case or average) by the base (no remote communication) CPI.

5.21 To keep the figures from becoming cluttered, the coherence protocol is split into two parts as was done in Figure 5.6 in the text. Figure S.34 presents the CPU portion of the coherence protocol, and Figure S.35 presents the bus portion of the protocol. In both of these figures, the arcs indicate transitions and the text along each arc indicates the stimulus (in normal text) and bus action (in bold text) that occurs during the transition between states. Finally, like the text, we assume a write hit is handled as a write miss.

Figure S.34 presents the behavior of state transitions caused by the CPU itself. In this case, a write to a block in either the invalid or shared state causes us to broad-cast a “write invalidate” to flush the block from any other caches that hold the block and move to the exclusive state. We can leave the exclusive state through either an invalidate from another processor (which occurs on the bus side of the coherence protocol state diagram), or a read miss generated by the CPU (which occurs when an exclusive block of data is displaced from the cache by a second block). In the shared state only a write by the CPU or an invalidate from another processor can move us out of this state. In the case of transitions caused by events external to the CPU, the state diagram is fairly simple, as shown in Figure S.35.

When another processor writes a block that is resident in our cache, we uncondi-tionally invalidate the corresponding block in our cache. This ensures that the next time we read the data, we will load the updated value of the block from memory. Also, whenever the bus sees a read miss, it must change the state of an exclusive block to shared as the block is no longer exclusive to a single cache.

The major change introduced in moving from a write-back to write-through cache is the elimination of the need to access dirty blocks in another processor’s caches. As a result, in the write-through protocol it is no longer necessary to pro-vide the hardware to force write back on read accesses or to abort pending mem-ory accesses. As memmem-ory is updated during any write on a write-through cache, a processor that generates a read miss will always retrieve the correct information from memory. Basically, it is not possible for valid cache blocks to be incoherent with respect to main memory in a system with write-through caches.

Figure S.34 CPU portion of the simple cache coherency protocol for write-through caches.

Figure S.35 Bus portion of the simple cache coherency protocol for write-through caches.

CPU read CPU read

CPU read miss

CPU read hit or write CPU write

Invalidate block CPU write

Invalidate block Invalid

Exclusive (read/write)

Shared (read only)

Shared (read only) Write miss

Invalidate block Invalid

Exclusive (read/write)

Write miss Invalidate block

Read miss

5.22 To augment the snooping protocol of Figure 5.7 with a Clean Exclusive state we assume that the cache can distinguish a read miss that will allocate a block destined to have the Clean Exclusive state from a read miss that will deliver a Shared block.

Without further discussion we assume that there is some mechanism to do so.

The three states of Figure 5.7 and the transitions between them are unchanged, with the possible clarifying exception of renaming the Exclusive (read/write) state to Dirty Exclusive (read/write).

The new Clean Exclusive (read only) state should be added to the diagram along with the following transitions.

■ from Clean Exclusive to Clean Exclusive in the event of a CPU read hit on this block or a CPU read miss on a Dirty Exclusive block

■ from Clean Exclusive to Shared in the event of a CPU read miss on a Shared block or on a Clean Exclusive block

■ from Clean Exclusive to Shared in the event of a read miss on the bus for this block

■ from Clean Exclusive to Invalid in the event of a write miss on the bus for this block

■ from Clean Exclusive to Dirty Exclusive in the event of a CPU write hit on this block or a CPU write miss

■ from Dirty Exclusive to Clean Exclusive in the event of a CPU read miss on a Dirty Exclusive block

■ from Invalid to Clean Exclusive in the event of a CPU read miss on a Dirty Exclusive block

■ from Shared to Clean Exclusive in the event of a CPU read miss on a Dirty Exclusive block

Several transitions from the original protocol must change to accommodate the existence of the Clean Exclusive state. The following three transitions are those that change.

■ from Dirty Exclusive to Shared, the label changes to CPU read miss on a Shared block

■ from Invalid to Shared, the label changes to CPU miss on a Shared block

■ from Shared to Shared, the miss transition label changes to CPU read miss on a Shared block

5.23 An obvious complication introduced by providing a valid bit per word is the need to match not only the tag of the block but also the offset within the block when snooping the bus. This is easy, involving just looking at a few more bits. In addi-tion, however, the cache must be changed to support write-back of partial cache blocks. When writing back a block, only those words that are valid should be writ-ten to memory because the conwrit-tents of invalid words are not necessarily coherent

with the system. Finally, given that the state machine of Figure 5.7 is applied at each cache block, there must be a way to allow this diagram to apply when state can be different from word to word within a block. The easiest way to do this would be to provide the state information of the figure for each word in the block. Doing so would require much more than one valid bit per word, though. Without replica-tion of state informareplica-tion the only solureplica-tion is to change the coherence protocol slightly.

5.24 a. The instruction execution component would be significantly sped up because the out-of-order execution and multiple instruction issue allows the latency of this component to be overlapped. The cache access component would be sim-ilarly sped up due to overlap with other instructions, but since cache accesses take longer than functional unit latencies, they would need more instructions to be issued in parallel to overlap their entire latency. So the speedup for this component would be lower.

The memory access time component would also be improved, but the speedup here would be lower than the previous two cases. Because the mem-ory comprises local and remote memmem-ory accesses and possibly other cache-to-cache transfers, the latencies of these operations are likely to be very high (100’s of processor cycles). The 64-entry instruction window in this example is not likely to allow enough instructions to overlap with such long latencies.

There is, however, one case when large latencies can be overlapped: when they are hidden under other long latency operations. This leads to a technique called miss-clustering that has been the subject of some compiler optimiza-tions. The other-stall component would generally be improved because they mainly consist of resource stalls, branch mispredictions, and the like. The synchronization component if any will not be sped up much.

b. Memory stall time and instruction miss stall time dominate the execution for OLTP, more so than for the other benchmarks. Both of these components are not very well addressed by out-of-order execution. Hence the OLTP workload has lower speedup compared to the other benchmarks with System B.

5.25 Because false sharing occurs when both the data object size is smaller than the granularity of cache block valid bit(s) coverage and more than one data object is stored in the same cache block frame in memory, there are two ways to prevent false sharing. Changing the cache block size or the amount of the cache block cov-ered by a given valid bit are hardware changes and outside the scope of this exer-cise. However, the allocation of memory locations to data objects is a software issue.

The goal is to locate data objects so that only one truly shared object occurs per cache block frame in memory and that no non-shared objects are located in the same cache block frame as any shared object. If this is done, then even with just a single valid bit per cache block, false sharing is impossible. Note that shared, read-only-access objects could be combined in a single cache block and not con-tribute to the false sharing problem because such a cache block can be held by many caches and accessed as needed without an invalidations to cause unneces-sary cache misses.

To the extent that shared data objects are explicitly identified in the program source code, then the compiler should, with knowledge of memory hierarchy details, be able to avoid placing more than one such object in a cache block frame in memory. If shared objects are not declared, then programmer directives may need to be added to the program. The remainder of the cache block frame should not contain data that would cause false sharing misses. The sure solution is to pad with block with non-referenced locations.

Padding a cache block frame containing a shared data object with unused mem-ory locations may lead to rather inefficient use of memmem-ory space. A cache block may contain a shared object plus objects that are read-only as a trade-off between memory use efficiency and incurring some false-sharing misses. This optimiza-tion almost certainly requires programmer analysis to determine if it would be worthwhile. Generally, careful attention to data distribution with respect to cache lines and partitioning the computation across processors is needed.

5.26 The problem illustrates the complexity of cache coherence protocols. In this case, this could mean that the processor P1 evicted that cache block from its cache and immediately requested the block in subsequent instructions. Given that the write-back message is longer than the request message, with networks that allow out-of-order requests, the new request can arrive before the write back arrives at the direc-tory. One solution to this problem would be to have the directory wait for the write back and then respond to the request. Alternatively, the directory can send out a negative acknowledgment (NACK). Note that these solutions need to be thought out very carefully since they have potential to lead to deadlocks based on the partic-ular implementation details of the system. Formal methods are often used to check for races and deadlocks.

5.27 If replacement hints are used, then the CPU replacing a block would send a hint to the home directory of the replaced block. Such hint would lead the home directory to remove the CPU from the sharing list for the block. That would save an invali-date message when the block is to be written by some other CPU. Note that while the replacement hint might reduce the total protocol latency incurred when writing a block, it does not reduce the protocol traffic (hints consume as much bandwidth as invalidates).

5.28 a. Considering first the storage requirements for nodes that are caches under the directory subtree:

The directory at any level will have to allocate entries for all the cache blocks cached under that directory’s subtree. In the worst case (all the CPU’s under the subtree are not sharing any blocks), the directory will have to store as many entries as the number of blocks of all the caches covered in the subtree.

That means that the root directory might have to allocate enough entries to reference all the blocks of all the caches. Every memory block cached in a directory will represented by an entry <block address, k-bit vector>, the k-bit vector will have a bit specifying all the subtrees that have a copy of the block.

For example, for a binary tree an entry <m, 11> means that block m is cached

在文檔中 Chapter 1 Solutions (頁 51-91)