For pipelined computers that allow out-of-order execution (discussed inChapter 3), the processor need not stall on a data cache miss. For example, the processor could
0
Figure 2.10 Four-way interleaved cache banks using block addressing. Assuming 64 bytes per block, each of these addresses would be multiplied by 64 to get byte addressing.
100 ■ Chapter Two Memory Hierarchy Design
continue fetching instructions from the instruction cache while waiting for the data cache to return the missing data. A nonblocking cache or lockup-free cache esca-lates the potential benefits of such a scheme by allowing the data cache to continue to supply cache hits during a miss. This “hit under miss” optimization reduces the effective miss penalty by being helpful during a miss instead of ignoring the requests of the processor. A subtle and complex option is that the cache may further lower the effective miss penalty if it can overlap multiple misses: a “hit under multiple miss” or “miss under miss” optimization. The second option is beneficial only if the memory system can service multiple misses; most high-performance pro-cessors (such as the Intel Core propro-cessors) usually support both, whereas many lower-end processors provide only limited nonblocking support in L2.
To examine the effectiveness of nonblocking caches in reducing the cache miss penalty, Farkas and Jouppi (1994) did a study assuming 8 KiB caches with a 14-cycle miss penalty (appropriate for the early 1990s). They observed a reduction in the effective miss penalty of 20% for the SPECINT92 benchmarks and 30% for the SPECFP92 benchmarks when allowing one hit under miss.
Li et al. (2011) updated this study to use a multilevel cache, more modern assumptions about miss penalties, and the larger and more demanding SPECCPU2006 benchmarks. The study was done assuming a model based on a single core of an Intel i7 (seeSection 2.6) running the SPECCPU2006 benchmarks.
Figure 2.11shows the reduction in data cache access latency when allowing 1, 2, and 64 hits under a miss; the caption describes further details of the memory system. The larger caches and the addition of an L3 cache since the earlier study have reduced the benefits with the SPECINT2006 benchmarks showing an average reduction in cache latency of about 9% and the SPECFP2006 bench-marks about 12.5%.
Example Which is more important for floating-point programs: two-way set associativity or hit under one miss for the primary data caches? What about integer programs?
Assume the following average miss rates for 32 KiB data caches: 5.2% for floating-point programs with a direct-mapped cache, 4.9% for the programs with a two-way set associative cache, 3.5% for integer programs with a direct-mapped cache, and 3.2% for integer programs with a two-way set associative cache. Assume the miss penalty to L2 is 10 cycles, and the L2 misses and penalties are the same.
Answer For floating-point programs, the average memory stall times are Miss rateDM$ Miss penalty ¼ 5:2% $ 10 ¼ 0:52 Miss rate2-way$ Miss penalty ¼ 4:9% $ 10 ¼ 0:49
The cache access latency (including stalls) for two-way associativity is 0.49/0.52 or 94% of direct-mapped cache. Figure 2.11 caption indicates that a hit under one miss reduces the average data cache access latency for floating-point programs to 87.5% of a blocking cache. Therefore, for floating-point programs, the 2.3 Ten Advanced Optimizations of Cache Performance ■ 101
direct-mapped data cache supporting one hit under one miss gives better perfor-mance than a two-way set-associative cache that blocks on a miss.
For integer programs, the calculation is
Miss rateDM$ Miss penalty ¼ 3:5% $ 10 ¼ 0:35 Miss rate2-way$ Miss penalty ¼ 3:2% $ 10 ¼ 0:32
The data cache access latency of a two-way set associative cache is thus 0.32/0.35 or 91% of direct-mapped cache, while the reduction in access latency when allow-ing a hit under one miss is 9%, makallow-ing the two choices about equal.
The real difficulty with performance evaluation of nonblocking caches is that a cache miss does not necessarily stall the processor. In this case, it is difficult to judge the impact of any single miss and thus to calculate the average memory access time. The effective miss penalty is not the sum of the misses but the nonoverlapped time that the processor is stalled. The benefit of nonblocking caches is complex, as it depends upon the miss penalty when there are multiple misses, the memory reference pattern, and how many instructions the processor can execute with a miss outstanding.
In general, out-of-order processors are capable of hiding much of the miss penalty of an L1 data cache miss that hits in the L2 cache but are not capable
40%
50%
60%
70%
80%
90%
100%
bzip2 gcc mcf hmmer sjeng libquantum h264ref omnetpp astar gamess zeusmp milc gromacs cactusADM namd soplex povray calculix GemsFDTD tonto lbm wrf sphinx3
SPECFP SPECINT
Cache access latency
Hit-under-2-misses Hit-under-64-misses Hit-under-1-miss
Figure 2.11 The effectiveness of a nonblocking cache is evaluated by allowing 1, 2, or 64 hits under a cache miss with 9 SPECINT (on the left) and 9 SPECFP (on the right) benchmarks. The data memory system modeled after the Intel i7 consists of a 32 KiB L1 cache with a four-cycle access latency. The L2 cache (shared with instructions) is 256 KiB with a 10-clock cycle access latency. The L3 is 2 MiB and a 36-cycle access latency. All the caches are eight-way set associative and have a 64-byte block size. Allowing one hit under miss reduces the miss penalty by 9% for the integer benchmarks and 12.5%
for the floating point. Allowing a second hit improves these results to 10% and 16%, and allowing 64 results in little additional improvement.
102 ■ Chapter Two Memory Hierarchy Design
of hiding a significant fraction of a lower-level cache miss. Deciding how many outstanding misses to support depends on a variety of factors:
■ The temporal and spatial locality in the miss stream, which determines whether a miss can initiate a new access to a lower-level cache or to memory.
■ The bandwidth of the responding memory or cache.
■ To allow more outstanding misses at the lowest level of the cache (where the miss time is the longest) requires supporting at least that many misses at a higher level, because the miss must initiate at the highest level cache.
■ The latency of the memory system.
The following simplified example illustrates the key idea.
Example Assume a main memory access time of 36 ns and a memory system capable of a sustained transfer rate of 16 GiB/s. If the block size is 64 bytes, what is the maximum number of outstanding misses we need to support assuming that we can maintain the peak bandwidth given the request stream and that accesses never conflict. If the prob-ability of a reference colliding with one of the previous four is 50%, and we assume that the access has to wait until the earlier access completes, estimate the number of maximum outstanding references. For simplicity, ignore the time between misses.
Answer In the first case, assuming that we can maintain the peak bandwidth, the memory system can support (16 $10)9/64 ¼250 million references per second. Because each reference takes 36 ns, we can support 250 $106$ 36 $ 10%9¼ 9 references.
If the probability of a collision is greater than 0, then we need more outstanding ref-erences, because we cannot start work on those colliding references; the memory system needs more independent references, not fewer! To approximate, we can sim-ply assume that half the memory references do not have to be issued to the memory.
This means that we must support twice as many outstanding references, or 18.
In Li, Chen, Brockman, and Jouppi’s study, they found that the reduction in CPI for the integer programs was about 7% for one hit under miss and about 12.7% for 64. For the floating-point programs, the reductions were 12.7% for one hit under miss and 17.8% for 64. These reductions track fairly closely the reductions in the data cache access latency shown inFigure 2.11.
Implementing a Nonblocking Cache
Although nonblocking caches have the potential to improve performance, they are nontrivial to implement. Two initial types of challenges arise: arbitrating conten-tion between hits and misses, and tracking outstanding misses so that we know when loads or stores can proceed. Consider the first problem. In a blocking cache, misses cause the processor to stall and no further accesses to the cache will occur 2.3 Ten Advanced Optimizations of Cache Performance ■ 103
until the miss is handled. In a nonblocking cache, however, hits can collide with misses returning from the next level of the memory hierarchy. If we allow multiple outstanding misses, which almost all recent processors do, it is even possible for misses to collide. These collisions must be resolved, usually by first giving priority to hits over misses, and second by ordering colliding misses (if they can occur).
The second problem arises because we need to track multiple outstanding mis-ses. In a blocking cache, we always know which miss is returning, because only one can be outstanding. In a nonblocking cache, this is rarely true. At first glance, you might think that misses always return in order, so that a simple queue could be kept to match a returning miss with the longest outstanding request. Consider, however, a miss that occurs in L1. It may generate either a hit or miss in L2; if L2 is also nonblocking, then the order in which misses are returned to L1 will not necessarily be the same as the order in which they originally occurred. Multi-core and other multiprocessor systems that have nonuniform cache access times also introduce this complication.
When a miss returns, the processor must know which load or store caused the miss, so that instruction can now go forward; and it must know where in the cache the data should be placed (as well as the setting of tags for that block). In recent processors, this information is kept in a set of registers, typically called the Miss Status Handling Registers (MSHRs). If we allow n outstanding misses, there will be n MSHRs, each holding the information about where a miss goes in the cache and the value of any tag bits for that miss, as well as the information indicating which load or store caused the miss (in the next chapter, you will see how this is tracked). Thus, when a miss occurs, we allocate an MSHR for handling that miss, enter the appropriate information about the miss, and tag the memory request with the index of the MSHR. The memory system uses that tag when it returns the data, allowing the cache system to transfer the data and tag information to the appropri-ate cache block and “notify” the load or store that generappropri-ated the miss that the data is now available and that it can resume operation. Nonblocking caches clearly require extra logic and thus have some cost in energy. It is difficult, however, to assess their energy costs exactly because they may reduce stall time, thereby decreasing execution time and resulting energy consumption.
In addition to the preceding issues, multiprocessor memory systems, whether within a single chip or on multiple chips, must also deal with complex implemen-tation issues related to memory coherency and consistency. Also, because cache mis-ses are no longer atomic (because the request and response are split and may be interleaved among multiple requests), there are possibilities for deadlock. For the interested reader, Section I.7 in online Appendix I deals with these issues in detail.