Ideally one would desire an indefinitely large memory capacity such that any particular… word would be immediately available…
We are… forced to recognize the possibility of constructing a hierarchy of memories each of which has greater capacity than the preceding but which is less quickly accessible.
A. W. Burks, H. H. Goldstine, and J. von Neumann, Preliminary Discussion of the Logical Design of an Electronic Computing Instrument (1946).
Computer Architecture.https://doi.org/10.1016/B978-0-12-811905-1.00002-X
© 2019 Elsevier Inc. All rights reserved.
2.1 Introduction
Computer pioneers correctly predicted that programmers would want unlimited amounts of fast memory. An economical solution to that desire is a memory hierar-chy, which takes advantage of locality and trade-offs in the cost-performance of memory technologies. The principle of locality, presented in the first chapter, says that most programs do not access all code or data uniformly. Locality occurs in time (temporal locality) and in space (spatial locality). This principle plus the guideline that for a given implementation technology and power budget, smaller hardware can be made faster led to hierarchies based on memories of different speeds and sizes.Figure 2.1shows several different multilevel memory hierarchies, including typical sizes and speeds of access. As Flash and next generation memory technol-ogies continue to close the gap with disks in cost per bit, such technoltechnol-ogies are likely to increasingly replace magnetic disks for secondary storage. AsFigure 2.1shows, these technologies are already used in many personal computers and increasingly in servers, where the advantages in performance, power, and density are significant.
Because fast memory is more expensive, a memory hierarchy is organized into several levels—each smaller, faster, and more expensive per byte than the next lower level, which is farther from the processor. The goal is to provide a memory system with a cost per byte that is almost as low as the cheapest level of memory and a speed almost as fast as the fastest level. In most cases (but not all), the data contained in a lower level are a superset of the next higher level. This property, called the inclusion property, is always required for the lowest level of the hierar-chy, which consists of main memory in the case of caches and secondary storage (disk or Flash) in the case of virtual memory.
The importance of the memory hierarchy has increased with advances in per-formance of processors.Figure 2.2plots single processor performance projections against the historical performance improvement in time to access main memory.
The processor line shows the increase in memory requests per second on average (i.e., the inverse of the latency between memory references), while the memory line shows the increase in DRAM accesses per second (i.e., the inverse of the DRAM access latency), assuming a single DRAM and a single memory bank. The reality is more complex because the processor request rate is not uniform, and the memory system typically has multiple banks of DRAMs and channels. Although the gap in access time increased significantly for many years, the lack of significant perfor-mance improvement in single processors has led to a slowdown in the growth of the gap between processors and DRAM.
Because high-end processors have multiple cores, the bandwidth requirements are greater than for single cores. Although single-core bandwidth has grown more slowly in recent years, the gap between CPU memory demand and DRAM band-width continues to grow as the numbers of cores grow. A modern high-end desktop processor such as the Intel Core i7 6700 can generate two data memory references per core each clock cycle. With four cores and a 4.2 GHz clock rate, the i7 can generate a peak of 32.8 billion 64-bit data memory references per second, in addi-tion to a peak instrucaddi-tion demand of about 12.8 billion 128-bit instrucaddi-tion 78 ■ Chapter Two Memory Hierarchy Design
Size:
Memory hierarchy for a laptop or a desktop Laptop
Memory hierarchy for a personal mobile device
256 KB
Figure 2.1 The levels in a typical memory hierarchy in a personal mobile device (PMD), such as a cell phone or tablet (A), in a laptop or desktop computer (B), and in a server (C). As we move farther away from the processor, the memory in the level below becomes slower and larger. Note that the time units change by a factor of 109from pico-seconds to millipico-seconds in the case of magnetic disks and that the size units change by a factor of 1010from thou-sands of bytes to tens of terabytes. If we were to add warehouse-sized computers, as opposed to just servers, the capacity scale would increase by three to six orders of magnitude. Solid-state drives (SSDs) composed of Flash are used exclusively in PMDs, and heavily in both laptops and desktops. In many desktops, the primary storage system is SSD, and expansion disks are primarily hard disk drives (HDDs). Likewise, many servers mix SSDs and HDDs.
2.1 Introduction ■ 79
references; this is a total peak demand bandwidth of 409.6 GiB/s! This incredible bandwidth is achieved by multiporting and pipelining the caches; by using three levels of caches, with two private levels per core and a shared L3; and by using a separate instruction and data cache at the first level. In contrast, the peak band-width for DRAM main memory, using two memory channels, is only 8% of the demand bandwidth (34.1 GiB/s). Upcoming versions are expected to have an L4 DRAM cache using embedded or stacked DRAM (seeSections 2.2and2.3).
Traditionally, designers of memory hierarchies focused on optimizing average memory access time, which is determined by the cache access time, miss rate, and miss penalty. More recently, however, power has become a major consideration. In high-end microprocessors, there may be 60 MiB or more of on-chip cache, and a large second- or third-level cache will consume significant power both as leakage when not operating (called static power) and as active power, as when performing a read or write (called dynamic power), as described inSection 2.3. The problem is even more acute in processors in PMDs where the CPU is less aggressive and the power budget may be 20 to 50 times smaller. In such cases, the caches can account for 25% to 50% of the total power consumption. Thus more designs must consider both performance and power trade-offs, and we will examine both in this chapter.
1 100 10 1000
Performance
10,000 100,000
2010 2005
1980 1995 2000
Year Processor
Memory
1990
1985 2015
Figure 2.2 Starting with 1980 performance as a baseline, the gap in performance, measured as the difference in the time between processor memory requests (for a single processor or core) and the latency of a DRAM access, is plotted over time.
In mid-2017, AMD, Intel and Nvidia all announced chip sets using versions of HBM technology. Note that the vertical axis must be on a logarithmic scale to record the size of the processor-DRAM performance gap. The memory baseline is 64 KiB DRAM in 1980, with a 1.07 per year performance improvement in latency (seeFigure 2.4on page 88).
The processor line assumes a 1.25 improvement per year until 1986, a 1.52 ment until 2000, a 1.20 improvement between 2000 and 2005, and only small improve-ments in processor performance (on a per-core basis) between 2005 and 2015. As you can see, until 2010 memory access times in DRAM improved slowly but consistently;
since 2010 the improvement in access time has reduced, as compared with the earlier periods, although there have been continued improvements in bandwidth. See Figure 1.1 inChapter 1for more information.
80 ■ Chapter Two Memory Hierarchy Design