1
Lecture 10
Main Memory
Virtual Memory
2
Administration
• Project progress report – next week
– 15 minute presentation
3
Review: Improving Cache Performance
• Reducing Miss Rate
– Larger Block Size – Higher Associativity – Victim Cache– Pseudo-Associativity
– HW Prefetching Instr, Data – SW Prefetching Data
– Compiler Optimizations
• Reducing Miss Penalty
– Read priority over write on miss – Subblock placement
– Early Restart and Critical Word First on miss – Non-blocking Caches (Hit Under Miss)
– Second Level Cache
• Reducing Hit Time
– Small & simple cache – Avoid address translation – Pipeline write
– Fast Writes on Misses Via Small Subblocks
4
Main Memory Background
• Performance of Main Memory:
– Latency: time to finish a request
• Access Time: time between request and word arrives
• Cycle Time: time between requests
• Cycle time > Access Time
– Bandwidth: Bytes/per second
• Main Memory is
DRAM: Dynamic Random Access Memory – Dynamic since needs to be refreshed periodically (8 ms)– Addresses divided into 2 halves (Memory as a 2D matrix):
• RAS or Row Access Strobe
• CASor Column Access Strobe
• Cache uses SRAM: Static Random Access Memory
– No refresh (6 transistors/bit vs. 1 transistor /bit)– Address not divided
• Size: DRAM/SRAM - 4-8,
Cost/Cycle time: SRAM/DRAM - 8-16
5
Main Memory Performance
• Simple:
– CPU, Cache, Bus, Memory same width (32 bits)
• Wide:
– CPU/Mux 1 word; Mux/Cache, Bus, Memory N words (Alpha:
64 bits & 256 bits)
• Interleaved:
– CPU, Cache, Bus 1 word:
Memory N Modules
(4 Modules); example is word interleaved
CPU
Cache
Memory
One word
One word
Simple
CPU
Cache
Bank 0
One word
One word
Bank 1
Bank 2
Bank 3
Interleaved
CPU
Cache
Memory
`multiplexor
Wide
6
Main Memory Performance
• Timing model
– 4 clock cycles to send address,
– 24 clock cycles for access time per word
– 4 clock cycles to send x (bus bandwidth ) words of data – Cache block size = 4 words
• Simple M.P. = 4 x (4+24+4) = 128
• Wide M.P. = 4 + 24 + 4 = 32
• Interleaved M.P. = 4+ 24 + 4x4 = 44
4
12 8 0
5
13 9 1
6
14 10 2
7
15 11 3
Address Bank 0 Bank 1 Bank 2 Bank 3
7
Bank offset Superbank offset
Bank number Superbank number
Independent Memory Banks
• Memory banks for independent accesses vs. faster sequential accesses
– Multiprocessor – I/O
– Miss under Miss, Non-blocking Cache
Super bank 0 1 2 :::
0 1 2 3 bank
Independent banks 0 1 2 3
bank
0 1 2 3 bank
8
Avoiding Bank Conflicts
• Bank Conflicts
– Memory references map to the same bank
– Problem: cannot take advantage of multiple banks (supporting multiple independent request)
• Example: all elements of a column are in the same memory bank with 128 memory banks, interleaved on a word basis
int x[256][512];
for (j = 0; j < 512; j = j+1)
for (i = 0; i < 256; i = i+1) x[i][j] = 2 * x[i][j];
• SW: loop interchange or declaring array not power of 2
• HW: Prime number of banks
– Problem: more complex calculation per memory access – Memory address = (bank number, address within bank) – bank number = address mod number of banks
– address within bank = address / number of banks – modulo & divide per memory access
9
• Bank number = address mod number of banks
• Address within bank = address
modnumber words in bank
=> prove that there is no ambiguity using this mapping method
• Chinese Remainder Theorem
As long as two sets of integers ai and bi follow these rules
and that ai and aj are co-prime if i ≠ j, then the integer x has only one solution (unambiguous mapping):
– bank number = b0, number of banks = a0 (b0 = x mod a0)
– address within bank = b1, number of words in bank = a1 (b1 = x mod a1) – Bank number < Number of banks ( 0≤b0≤a0)
– Address within a bank < Number of words in bank ( 0≤b1≤ a1)
– Address < Number of banks x Number of words in a bank (0 ≤x<a0xa1)
– The number of banks and the number of words in a bank are co-prime (a0 and a1 are co-prime)
• N word address 0 to N-1, prime no. banks, words power of 2
b
i= x mod a
i, 0 ≤ b
i< a
i, 0 ≤ x < a
0× a
1× a
2×…
Fast Bank Number
10
Fast Bank Number
Seq. Interleaved Modulo Interleaved
BankNumber: 0 1 2 0 1 2
Address
within Bank:
0
0 1 2
016 8
1
3 4 5 9
117
2
6 7 8 18 10
23
9 10 11
319 11
4
12 13 14 12
420
5
15 16 17 21 13
56
18 19 20 6 22 14
7
21 22 23 15 7 23
11
Fast Memory Systems: DRAM specific
• Multiple RAS accesses: several names (page mode)
– 64 Mbit DRAM: cycle time = 100 ns, page mode = 20 ns• New DRAMs to address gap;
what will they cost, will they survive?
– Synchronous DRAM: Provide a clock signal to DRAM, transfer synchronous to system clock
– RAMBUS: reinvent DRAM interface
• Each Chip a module vs. slice of memory
• Short bus between CPU and chips
• Does own refresh
• Variable amount of data returned
• 1 byte / 2 ns (500 MB/s per chip)
• Niche memory
– e.g., Video RAM for frame buffers, DRAM + fast serial output
12
Main Memory Summary
• Wider Memory
• Interleaved Memory: for sequential or independent accesses
• Avoiding bank conflicts: SW & HW
• DRAM specific optimizations: page mode & Specialty
DRAM
13
Virtual Memory: Motivation
• Permit applications to grow larger than main memory size
– 32, 64 bits v.s. 28 bits (256MB)
• Automatic management
• Multiple process management
– Sharing – Protection
• Relocation
Disk
A B C D
virtual memory physical memory
C B
A
D
14
Virtual Memory Terminology
• Page or segment
– A block transferred from the disk to main memory
• Page: fixed size
• Segment: variable size
• Page fault
– The requested page is not in main memory (miss)
• Address translation or memory mapping
– Virtual to physical address
15
Cache vs. VM Difference
• What controls replacement?
– Cache miss: HW
– Page fault: often handled by the OS
• Size
– VM space is determined by the address size of the CPU – Cache size is independent of the CPU address size
• Lower level use
– Cache: main memory is not shared by anything els
– VM: most of the disk contains the file system
16
Virtual Memory
• 4Qs for VM?
– Q1: Where can a block be placed in the upper level?
• Fully Associative
– Q2: How is a block found if it is in the upper level?
• Pages: use a page table
• Segments: segment table
– Q3: Which block should be replaced on a miss?
• LRU
– Q4: What happens on a write?
• Write Back
17
Page Table
• Virtual-to-physical address mapping via page table
virtual page number page offset
page table
Main Memory
What is the size of the page table given a 28-bit virtual address, 4 KB pages, and 4 bytes per page table entry?
2 ^ (28-12) x 2^2 = 256 KB PTE
18
Inverted Page Table (HP, IBM)
• One PTE per page frame
• Pros & Cons:
– The size of table is the number of physical pages – Must search for virtual
address (using hash)
virtual page number page offset
Hash Another Table (HAT)
hash
Inverted page table
VA PA
19
Fast Translation: Translation Look-aside Buffer (TLB)
• Cache of translated addresses
• Alpha 21064 TLB: 32 entry fully associative
page frame address <30> page offset <13>
::::
V R W Tag <30> Physical address <21>
1 2
32:1 Mux
<13>
<21>
3 4 34-bit
physical address
Problem: combine caches with virtual memory
20
TLB & Caches
CPU
TB
$
MEM VA
PA
PA
Conventional Organization
CPU
$
TB
MEM VA
VA
PA
Virtually Addressed Cache Translate only on miss
CPU
$ TB
MEM VA
PA Tags
PA
Overlap $ access with VA translation:
requires $ index to remain invariant across translation VA
Tags
L2 $
21
Virtual Cache
• Avoid address translation before accessing cache
– Faster hit time
• Context switch
– Flush: time to flush + “compulsory misses” from empty cache – Add processor id (PID) to TLB
• I/O (physical address) must interact with cache
– Physical -> virtual address translation
• Aliases (Synonyms)
– Two virtual addresses map to the same physical addresses
• Two identical copies in the cache
22
Solutions for Aliases
• HW: anti-aliasing
– Guarantee every cache block a unique physical address
• OS : page coloring
– Guarantee that the virtual and physical addresses match in the last n bits
– Avoid duplicate physical addresses for block if using a direct- mapped cache, size < 2^n
virtual address Physical address
< 18 >
index bit block offset
virtual address x virtual address y
(x, y) aliasing
index bit block offset
(x,y) map to the same set
23
Virtually indexed & physically tagged cache
• Use the part of addresses that is not affected by address translation to index a cache
– Page offset
– Overlap the time to read the tags with address translation
• Limit cache size to cache size for direct-mapped cache – how to get a bigger cache
– Higher associativity – Page coloring
Page address Page offset
Address tag index block offset
Page address Page offset
Address tag index block offset
24
Selecting a Page Size
• Reasons for larger page size
– Page table size is inversely proportional to the page size;
therefore memory saved
– Fast cache hit time easy when cache <= page size (VA caches);
bigger page makes it feasible as cache size grows
– Transferring larger pages to or from secondary storage, possibly over a network, is more efficient
– Number of TLB entries are restricted by clock cycle time, so a larger page size maps more memory, thereby reducing TLB misses
• Reasons for a smaller page size
– Fragmentation: don’t waste storage; data must be contiguous within page – Quicker process start for small processes
• Hybrid solution: multiple page sizes
– Alpha: 8KB, 16KB, 32 KB, 64 KB pages (43, 47, 51, 55 virt addr bits)
25
Alpha VM Mapping
• “64-bit” address divided into 3 segments
– seg0 (bit 63=0) user code/heap – seg1 (bit 63 = 1, 62 = 1) user stack – kseg (bit 63 = 1, 62 = 0)
kernel segment for OS
• Three level page table, each one page
– 8KB page, 8B PTE
– Alpha only 43 unique bits of VA – (future min page size up to 64KB
=> 55 bits of VA)
• PTE bits; valid, kernel & user read & write enable
Seg0/seg1 Level1 Level 2 Level 3 Page offset
<21> <10> <10> <10> <13>
Page table
Base register +
L1 page table
L1 page table
L1 page table +
+
Physical page-frame number page offset
26
Virtual Memory Summary
• Why virtual memory?
• Fast address translation
– Page table, TLB
• TLB & Cache
– Virtual cache vs. physical cache
– Virtually indexed and physically tagged
27
Cross Cutting Issues
• Superscalar CPU & Number Cache Ports
• Speculative Execution and non-faulting option on memory
• Parallel Execution vs. Cache locality
– Want far separation to find independent operations vs.
want reuse of data accesses to avoid misses
For ( i=0; i<512; i= i+1) for (j= 0; j< 512; j = j+1)
x[i][j] = 2 * x[i][j-1]
For (i = 0; i<512 ; i=i+1) for (j=1; j< 512; j= j+4) {
x[i][j] = 2 * x[i][j-1];
x[i][j+1] = 2 * x[i][j];
x[i][j+2] = 2 * x[i][j+1];
x[i][j+3] = 2 * x[i][j+2];
};
For ( j=0; j<512; j= j+1) for (i = 0; i< 512; i = i+1)
x[i][j] = 2 * x[i][j-1]
For (j= 0; j<512 ; j=j+1) for (i=1; i< 512; i= i+4) {
x[i][j] = 2 * x[i][j-1];
x[i+1][j] = 2 * x[i+1][j-1];
x[i+2][j] = 2 * x[i+2][j-1];
x[i+3][j] = 2 * x[i+3][j-1];
};
28
Cross Cutting Issues
• I/O and consistency of data between cache and memory
Cache
Main Memory
I/O Bridge
CPU
Cache
Main Memory
I/O Bridge
CPU
DMA
• I/O always see the latest data
• interfere with CPU
• not interfering with the CPU
• might see the stale data
•Output: write-through
•Input:
• noncacheable
• SW: flush the cache
• HW : check I/O address on input
•
29
Pitfall: Predicting Cache Performance from Different Prgrm (ISA, compiler, ...)
• 4KB Data cache miss rate 8%,12%,
or 28%?
• 1KB Instr cache miss rate 0%,3%,
or 10%?
• Alpha vs. MIPS for 8KB Data:
17% vs. 10%
Cache Size (KB) Miss
Rate
0%
5%
10%
15%
20%
25%
30%
35%
1 2 4 8 16 32 64 128
D: tomcatv D: gcc D: espresso I: gcc I: espresso I: tomcatv
30