Chapter 2
Memory Hierarchy Design
Programmers want unlimited amounts of memory with low latency
Fast memory technology is more expensive per bit than slower memory
Solution: organize memory system into a hierarchy
Entire addressable memory space available in largest, slowest memory
Incrementally smaller and faster memories, each containing a subset of the memory below it, proceed in steps up toward the processor
Temporal and spatial locality insures that nearly all references can be found in smaller memories
Gives the allusion of a large, fast memory being presented to the processor
cti on
cti on
cti on
Memory hierarchy design becomes more crucial with recent multi-core processors:
Aggregate peak bandwidth grows with # cores:
Intel Core i7 can generate two references per core per clock
Four cores and 3.2 GHz clock
25.6 billion 64-bit data references/second +
12.8 billion 128-bit instruction references/second
= 409.6 GB/s!
DRAM bandwidth is only 8% of this (34.1 GB/s)
Requires:
Multi-port, pipelined caches
Two levels of cache per core
Shared third-level cache on chip
cti on
High-end microprocessors have >10 MB on-chip cache
Consumes large amount of area and power budget
cti on
When a word is not found in the cache, a miss occurs:
Fetch word from lower level in hierarchy, requiring a higher latency reference
Lower level may be another cache or the main memory
Also fetch the other words contained within the block
Takes advantage of spatial locality
Place block into cache in any location within its set, determined by address
block address MOD number of sets in cache
cti on
n sets => n-way set associative
Direct-mapped cache => one block per set
Fully associative => one set
Writing to cache: two strategies
Write-through
Immediately update lower levels of hierarchy
Write-back
Only update lower levels of hierarchy when an updated block is replaced
Both strategies use write buffer to make writes asynchronous
cti on
Miss rate
Fraction of cache access that result in a miss
Causes of misses
Compulsory
First reference to a block
Capacity
Blocks discarded and later retrieved
Conflict
Program makes repeated references to multiple addresses from different blocks that map to the same location in the cache
cti on
Speculative and multithreaded processors may execute other instructions during a miss
Reduces performance impact of misses
cti on
Six basic cache optimizations:
Larger block size
Reduces compulsory misses
Increases capacity and conflict misses, increases miss penalty
Larger total cache capacity to reduce miss rate
Increases hit time, increases power consumption
Higher associativity
Reduces conflict misses
Increases hit time, increases power consumption
Higher number of cache levels
Reduces overall memory access time
Giving priority to read misses over writes
Reduces miss penalty
Avoiding address translation in cache indexing
Reduces hit time
cti on
Performance metrics
Latency is concern of cache
Bandwidth is concern of multiprocessors and I/O
Access time
Time between read request and when desired word arrives
Cycle time
Minimum time between unrelated requests to memory
SRAM memory has low latency, use for cache
Organize DRAM chips into many banks for high bandwidth, use for main memory
y T ec hn olo gy a nd O pt im iz ati on s
SRAM
Requires low power to retain bit
Requires 6 transistors/bit
DRAM
Must be re-written after being read
Must also be periodically refeshed
Every ~ 8 ms (roughly 5% of time)
Each row can be refreshed simultaneously
One transistor/bit
Address lines are multiplexed:
Upper half of address: row access strobe (RAS)
Lower half of address: column access strobe (CAS)
y T ec hn olo gy a nd O pt im iz ati on s
y T ec hn olo gy a nd O pt im iz ati on s
Amdahl:
Memory capacity should grow linearly with processor speed
Unfortunately, memory capacity and speed has not kept pace with processors
Some optimizations:
Multiple accesses to same row
Synchronous DRAM
Added clock to DRAM interface
Burst mode with critical word first
Wider interfaces
Double data rate (DDR)
Multiple banks on each DRAM device
y T ec hn olo gy a nd O pt im iz ati on s
y T ec hn olo gy a nd O pt im iz ati on s
y T ec hn olo gy a nd O pt im iz ati on s
DDR:
DDR2
Lower power (2.5 V -> 1.8 V)
Higher clock rates (266 MHz, 333 MHz, 400 MHz)
DDR3
1.5 V
800 MHz
DDR4
1-1.2 V
1333 MHz
GDDR5 is graphics memory based on DDR3
y T ec hn olo gy a nd O pt im iz ati on s
Reducing power in SDRAMs:
Lower voltage
Low power mode (ignores clock, continues to refresh)
Graphics memory:
Achieve 2-5 X bandwidth per DRAM vs. DDR3
Wider interfaces (32 vs. 16 bit)
Higher clock rate
Possible because they are attached via soldering instead of socketted DIMM modules
y T ec hn olo gy a nd O pt im iz ati on s
y T ec hn olo gy a nd O pt im iz ati on s
Stacked DRAMs in same package as processor
High Bandwidth Memory (HBM)
y T ec hn olo gy a nd O pt im iz ati on s
Type of EEPROM
Types: NAND (denser) and NOR (faster)
NAND Flash:
Reads are sequential, reads entire page (.5 to 4 KiB)
25 us for first byte, 40 MiB/s for subsequent bytes
SDRAM: 40 ns for first byte, 4.8 GB/s for subsequent bytes
2 KiB transfer: 75 uS vs 500 ns for SDRAM, 150X slower
300 to 500X faster than magnetic disk
y T ec hn olo gy a nd O pt im iz ati on s
Must be erased (in blocks) before being overwritten
Nonvolatile, can use as little as zero power
Limited number of write cycles (~100,000)
$2/GiB, compared to $20-40/GiB for SDRAM and $0.09 GiB for magnetic disk
Phase-Change/Memrister Memory
Possibly 10X improvement in write performance and 2X improvement in read performance
y T ec hn olo gy a nd O pt im iz ati on s
Memory is susceptible to cosmic rays
Soft errors: dynamic errors
Detected and fixed by error correcting codes (ECC)
Hard errors: permanent errors
Use spare rows to replace defective rows
Chipkill: a RAID-like error recovery technique
y T ec hn olo gy a nd O pt im iz ati on s
Reduce hit time
Small and simple first-level caches
Way prediction
Increase bandwidth
Pipelined caches, multibanked caches, non-blocking caches
Reduce miss penalty
Critical word first, merging write buffers
Reduce miss rate
Compiler optimizations
Reduce miss penalty or miss rate via parallelization
Hardware or compiler prefetching
ce d O pt im iz at io ns
Access time vs. size and associativity
ce d O pt im iz at io ns
Energy per read vs. size and associativity
ce d O pt im iz at io ns
To improve hit time, predict the way to pre-set mux
Mis-prediction gives longer hit time
Prediction accuracy
> 90% for two-way
> 80% for four-way
I-cache has better accuracy than D-cache
First used on MIPS R10000 in mid-90s
Used on ARM Cortex-A8
Extend to predict block as well
“Way selection”
Increases mis-prediction penalty
ce d O pt im iz at io ns
Pipeline cache access to improve bandwidth
Examples:
Pentium: 1 cycle
Pentium Pro – Pentium III: 2 cycles
Pentium 4 – Core i7: 4 cycles
Increases branch mis-prediction penalty
Makes it easier to increase associativity
ce d O pt im iz at io ns
Organize cache as independent banks to support simultaneous access
ARM Cortex-A8 supports 1-4 banks for L2
Intel i7 supports 4 banks for L1 and 8 banks for L2
Interleave banks according to block address
ce d O pt im iz at io ns
Allow hits before previous misses complete
“Hit under miss”
“Hit under multiple miss”
L2 must support this
In general, processors can hide L1 miss penalty but not L2 miss penalty
ce d O pt im iz at io ns
Critical word first
Request missed word from memory first
Send it to the processor as soon as it arrives
Early restart
Request words in normal order
Send missed work to the processor as soon as it arrives
Effectiveness of these strategies depends on block size and likelihood of another access to the portion of the block that has not yet been fetched
ce d O pt im iz at io ns
When storing to a block that is already pending in the write buffer, update write buffer
Reduces stalls due to full write buffer
Do not apply to I/O addresses
ce d O pt im iz at io ns
No write buffering
Write buffering
Loop Interchange
Swap nested loops to access memory in sequential order
Blocking
Instead of accessing entire rows or columns, subdivide matrices into blocks
Requires more memory accesses but improves locality of accesses
ce d O pt im iz at io ns
for (i = 0; i < N; i = i + 1) for (j = 0; j < N; j = j + 1) {
r = 0;
for (k = 0; k < N; k = k + 1) r = r + y[i][k]*z[k][j];
x[i][j] = r;
};
for (jj = 0; jj < N; jj = jj + B) for (kk = 0; kk < N; kk = kk + B) for (i = 0; i < N; i = i + 1)
for (j = jj; j < min(jj + B,N); j = j + 1) {
r = 0;
for (k = kk; k < min(kk + B,N); k = k + 1) r = r + y[i][k]*z[k][j];
x[i][j] = x[i][j] + r;
};
Fetch two blocks on miss (include next sequential block)
ce d O pt im iz at io ns
Pentium 4 Pre-fetching
Insert prefetch instructions before data is needed
Non-faulting: prefetch doesn’t cause exceptions
Register prefetch
Loads data into register
Cache prefetch
Loads data into cache
Combine with loop unrolling and software pipelining
ce d O pt im iz at io ns
128 MiB to 1 GiB
Smaller blocks require substantial tag storage
Larger blocks are potentially inefficient
One approach (L-H):
Each SDRAM row is a block index
Each row contains set of tags and 29 data segments
29-set associative
Hit requires a CAS
ce d O pt im iz at io ns
Another approach (Alloy cache):
Mold tag and data together
Use direct mapped
Both schemes require two DRAM accesses for misses
Two solutions:
Use map to keep track of blocks
Predict likely misses
ce d O pt im iz at io ns
ce d O pt im iz at io ns
ce d O pt im iz at io ns
Protection via virtual memory
Keeps processes in their own memory space
Role of architecture
Provide user mode and supervisor mode
Protect certain aspects of CPU state
Provide mechanisms for switching between user mode and supervisor mode
Provide mechanisms to limit memory accesses
Provide TLB to translate addresses
al M em or y a nd V irt ua l M ac hin es
Supports isolation and security
Sharing a computer among many unrelated users
Enabled by raw speed of processors, making the overhead more acceptable
Allows different ISAs and operating systems to be presented to user programs
“System Virtual Machines”
SVM software is called “virtual machine monitor” or
“hypervisor”
Individual virtual machines run under the monitor are called
“guest VMs”
al M em or y a nd V irt ua l M ac hin es
Guest software should:
Behave on as if running on native hardware
Not be able to change allocation of real system resources
VMM should be able to “context switch”
guests
Hardware must allow:
System and use processor modes
Privileged subset of instructions for allocating system resources
al M em or y a nd V irt ua l M ac hin es
Each guest OS maintains its own set of page tables
VMM adds a level of memory between physical and virtual memory called “real memory”
VMM maintains shadow page table that maps guest virtual addresses to physical addresses
Requires VMM to detect guest’s changes to its own page table
Occurs naturally if accessing the page table pointer is a privileged operation
al M em or y a nd V irt ua l M ac hin es
Objectives:
Avoid flushing TLB
Use nested page tables instead of shadow page tables
Allow devices to use DMA to move data
Allow guest OS’s to handle device interrupts