Chapter 2Memory Hierarchy Design

(1)

Chapter 2 Memory Hierarchy Design

(2)



Programmers want unlimited amounts of memory with low latency



Fast memory technology is more expensive per bit than slower memory



Solution: organize memory system into a hierarchy



Entire addressable memory space available in largest, slowest memory



Incrementally smaller and faster memories, each containing a subset of the memory below it, proceed in steps up toward the processor



Temporal and spatial locality insures that nearly all references can be found in smaller memories



Gives the allusion of a large, fast memory being presented to the processor

cti on

(3)

cti on

(4)

cti on

(5)



Memory hierarchy design becomes more crucial with recent multi-core processors:



Aggregate peak bandwidth grows with # cores:



Intel Core i7 can generate two references per core per clock



Four cores and 3.2 GHz clock



25.6 billion 64-bit data references/second +



12.8 billion 128-bit instruction references/second



= 409.6 GB/s!



DRAM bandwidth is only 8% of this (34.1 GB/s)



Requires:



Multi-port, pipelined caches



Two levels of cache per core



Shared third-level cache on chip

cti on

(6)



High-end microprocessors have >10 MB on-chip cache



Consumes large amount of area and power budget

cti on

(7)



When a word is not found in the cache, a miss occurs:



Fetch word from lower level in hierarchy, requiring a higher latency reference



Lower level may be another cache or the main memory



Also fetch the other words contained within the block



Takes advantage of spatial locality



Place block into cache in any location within its set, determined by address



block address MOD number of sets in cache

cti on

(8)



n sets => n-way set associative



Direct-mapped cache => one block per set



Fully associative => one set



Writing to cache: two strategies



Write-through



Immediately update lower levels of hierarchy



Write-back



Only update lower levels of hierarchy when an updated block is replaced



Both strategies use write buffer to make writes asynchronous

cti on

(9)



Miss rate



Fraction of cache access that result in a miss



Causes of misses



Compulsory



First reference to a block



Capacity



Blocks discarded and later retrieved



Conflict



Program makes repeated references to multiple addresses from different blocks that map to the same location in the cache

cti on

(10)



Speculative and multithreaded processors may execute other instructions during a miss



Reduces performance impact of misses

cti on

(11)



Six basic cache optimizations:



Larger block size

 Reduces compulsory misses

 Increases capacity and conflict misses, increases miss penalty



Larger total cache capacity to reduce miss rate

 Increases hit time, increases power consumption



Higher associativity

 Reduces conflict misses

 Increases hit time, increases power consumption



Higher number of cache levels

 Reduces overall memory access time



Giving priority to read misses over writes

 Reduces miss penalty



Avoiding address translation in cache indexing

 Reduces hit time

cti on

(12)



Performance metrics



Latency is concern of cache



Bandwidth is concern of multiprocessors and I/O



Access time



Time between read request and when desired word arrives



Cycle time



Minimum time between unrelated requests to memory



SRAM memory has low latency, use for cache



Organize DRAM chips into many banks for high bandwidth, use for main memory

y T ec hn olo gy a nd O pt im iz ati on s

(13)



SRAM



Requires low power to retain bit



Requires 6 transistors/bit



DRAM



Must be re-written after being read



Must also be periodically refeshed



Every ~ 8 ms (roughly 5% of time)



Each row can be refreshed simultaneously



One transistor/bit



Address lines are multiplexed:



Upper half of address: row access strobe (RAS)



Lower half of address: column access strobe (CAS)

y T ec hn olo gy a nd O pt im iz ati on s

(14)

y T ec hn olo gy a nd O pt im iz ati on s

(15)



Amdahl:



Memory capacity should grow linearly with processor speed



Unfortunately, memory capacity and speed has not kept pace with processors



Some optimizations:



Multiple accesses to same row



Synchronous DRAM



Added clock to DRAM interface



Burst mode with critical word first



Wider interfaces



Double data rate (DDR)



Multiple banks on each DRAM device

y T ec hn olo gy a nd O pt im iz ati on s

(16)

y T ec hn olo gy a nd O pt im iz ati on s

(17)

y T ec hn olo gy a nd O pt im iz ati on s

(18)



DDR:



DDR2



Lower power (2.5 V -> 1.8 V)



Higher clock rates (266 MHz, 333 MHz, 400 MHz)



DDR3



1.5 V



800 MHz



DDR4



1-1.2 V



1333 MHz



GDDR5 is graphics memory based on DDR3

y T ec hn olo gy a nd O pt im iz ati on s

(19)



Reducing power in SDRAMs:



Lower voltage



Low power mode (ignores clock, continues to refresh)



Graphics memory:



Achieve 2-5 X bandwidth per DRAM vs. DDR3



Wider interfaces (32 vs. 16 bit)



Higher clock rate

 Possible because they are attached via soldering instead of socketted DIMM modules

y T ec hn olo gy a nd O pt im iz ati on s

(20)

y T ec hn olo gy a nd O pt im iz ati on s

(21)

 Stacked DRAMs in same package as processor



High Bandwidth Memory (HBM)

y T ec hn olo gy a nd O pt im iz ati on s

(22)



Type of EEPROM



Types: NAND (denser) and NOR (faster)



NAND Flash:



Reads are sequential, reads entire page (.5 to 4 KiB)



25 us for first byte, 40 MiB/s for subsequent bytes



SDRAM: 40 ns for first byte, 4.8 GB/s for subsequent bytes



2 KiB transfer: 75 uS vs 500 ns for SDRAM, 150X slower



300 to 500X faster than magnetic disk

y T ec hn olo gy a nd O pt im iz ati on s

(23)



Must be erased (in blocks) before being overwritten



Nonvolatile, can use as little as zero power



Limited number of write cycles (~100,000)



$2/GiB, compared to $20-40/GiB for SDRAM and $0.09 GiB for magnetic disk



Phase-Change/Memrister Memory



Possibly 10X improvement in write performance and 2X improvement in read performance

y T ec hn olo gy a nd O pt im iz ati on s

(24)



Memory is susceptible to cosmic rays



Soft errors: dynamic errors



Detected and fixed by error correcting codes (ECC)



Hard errors: permanent errors



Use spare rows to replace defective rows



Chipkill: a RAID-like error recovery technique

y T ec hn olo gy a nd O pt im iz ati on s

(25)



Reduce hit time



Small and simple first-level caches



Way prediction



Increase bandwidth



Pipelined caches, multibanked caches, non-blocking caches



Reduce miss penalty



Critical word first, merging write buffers



Reduce miss rate



Compiler optimizations



Reduce miss penalty or miss rate via parallelization



Hardware or compiler prefetching

ce d O pt im iz at io ns

(26)

Access time vs. size and associativity

ce d O pt im iz at io ns

(27)

Energy per read vs. size and associativity

ce d O pt im iz at io ns

(28)



To improve hit time, predict the way to pre-set mux



Mis-prediction gives longer hit time



Prediction accuracy



> 90% for two-way



> 80% for four-way



I-cache has better accuracy than D-cache



First used on MIPS R10000 in mid-90s



Used on ARM Cortex-A8



Extend to predict block as well



“Way selection”



Increases mis-prediction penalty

ce d O pt im iz at io ns

(29)



Pipeline cache access to improve bandwidth



Examples:



Pentium: 1 cycle



Pentium Pro – Pentium III: 2 cycles



Pentium 4 – Core i7: 4 cycles



Increases branch mis-prediction penalty



Makes it easier to increase associativity

ce d O pt im iz at io ns

(30)



Organize cache as independent banks to support simultaneous access



ARM Cortex-A8 supports 1-4 banks for L2



Intel i7 supports 4 banks for L1 and 8 banks for L2



Interleave banks according to block address

ce d O pt im iz at io ns

(31)



Allow hits before previous misses complete



“Hit under miss”



“Hit under multiple miss”



L2 must support this



In general, processors can hide L1 miss penalty but not L2 miss penalty

ce d O pt im iz at io ns

(32)



Critical word first



Request missed word from memory first



Send it to the processor as soon as it arrives



Early restart



Request words in normal order



Send missed work to the processor as soon as it arrives



Effectiveness of these strategies depends on block size and likelihood of another access to the portion of the block that has not yet been fetched

ce d O pt im iz at io ns

(33)



When storing to a block that is already pending in the write buffer, update write buffer



Reduces stalls due to full write buffer



Do not apply to I/O addresses

ce d O pt im iz at io ns

No write buffering

Write buffering

(34)



Loop Interchange



Swap nested loops to access memory in sequential order



Blocking



Instead of accessing entire rows or columns, subdivide matrices into blocks



Requires more memory accesses but improves locality of accesses

ce d O pt im iz at io ns

(35)

for (i = 0; i < N; i = i + 1) for (j = 0; j < N; j = j + 1) {

r = 0;

for (k = 0; k < N; k = k + 1) r = r + y[i][k]*z[k][j];

x[i][j] = r;

};

(36)

for (jj = 0; jj < N; jj = jj + B) for (kk = 0; kk < N; kk = kk + B) for (i = 0; i < N; i = i + 1)

for (j = jj; j < min(jj + B,N); j = j + 1) {

r = 0;

for (k = kk; k < min(kk + B,N); k = k + 1) r = r + y[i][k]*z[k][j];

x[i][j] = x[i][j] + r;

};

(37)



Fetch two blocks on miss (include next sequential block)

ce d O pt im iz at io ns

Pentium 4 Pre-fetching

(38)



Insert prefetch instructions before data is needed



Non-faulting: prefetch doesn’t cause exceptions



Register prefetch



Loads data into register



Cache prefetch



Loads data into cache



Combine with loop unrolling and software pipelining

ce d O pt im iz at io ns

(39)



128 MiB to 1 GiB



Smaller blocks require substantial tag storage



Larger blocks are potentially inefficient



One approach (L-H):



Each SDRAM row is a block index



Each row contains set of tags and 29 data segments



29-set associative



Hit requires a CAS

ce d O pt im iz at io ns

(40)



Another approach (Alloy cache):



Mold tag and data together



Use direct mapped



Both schemes require two DRAM accesses for misses



Two solutions:



Use map to keep track of blocks



Predict likely misses

ce d O pt im iz at io ns

(41)

ce d O pt im iz at io ns

(42)

ce d O pt im iz at io ns

(43)



Protection via virtual memory



Keeps processes in their own memory space



Role of architecture



Provide user mode and supervisor mode



Protect certain aspects of CPU state



Provide mechanisms for switching between user mode and supervisor mode



Provide mechanisms to limit memory accesses



Provide TLB to translate addresses

al M em or y a nd V irt ua l M ac hin es

(44)



Supports isolation and security



Sharing a computer among many unrelated users



Enabled by raw speed of processors, making the overhead more acceptable



Allows different ISAs and operating systems to be presented to user programs



“System Virtual Machines”



SVM software is called “virtual machine monitor” or

“hypervisor”



Individual virtual machines run under the monitor are called

“guest VMs”

al M em or y a nd V irt ua l M ac hin es

(45)

 Guest software should:



Behave on as if running on native hardware



Not be able to change allocation of real system resources

 VMM should be able to “context switch”

guests

 Hardware must allow:



System and use processor modes



Privileged subset of instructions for allocating system resources

al M em or y a nd V irt ua l M ac hin es

(46)



Each guest OS maintains its own set of page tables



VMM adds a level of memory between physical and virtual memory called “real memory”



VMM maintains shadow page table that maps guest virtual addresses to physical addresses



Requires VMM to detect guest’s changes to its own page table



Occurs naturally if accessing the page table pointer is a privileged operation

al M em or y a nd V irt ua l M ac hin es

(47)



Objectives:



Avoid flushing TLB



Use nested page tables instead of shadow page tables



Allow devices to use DMA to move data



Allow guest OS’s to handle device interrupts



For security: allow programs to manage encrypted portions of code and data

al M em or y a nd V irt ua l M ac hin es

(48)