Main Memory

(1)

1

Lecture 10

Main Memory

Virtual Memory

(2)

2

Administration

• Project progress report – next week

– 15 minute presentation

(3)

3

Review: Improving Cache Performance

• Reducing Miss Rate

– Larger Block Size – Higher Associativity – Victim Cache

– Pseudo-Associativity

– HW Prefetching Instr, Data – SW Prefetching Data

– Compiler Optimizations

• Reducing Miss Penalty

– Read priority over write on miss – Subblock placement

– Early Restart and Critical Word First on miss – Non-blocking Caches (Hit Under Miss)

– Second Level Cache

• Reducing Hit Time

– Small & simple cache – Avoid address translation – Pipeline write

– Fast Writes on Misses Via Small Subblocks

(4)

4

Main Memory Background

• Performance of Main Memory:

– Latency: time to finish a request

• Access Time: time between request and word arrives

• Cycle Time: time between requests

• Cycle time > Access Time

– Bandwidth: Bytes/per second

• Main Memory is

DRAM: Dynamic Random Access Memory – Dynamic since needs to be refreshed periodically (8 ms)

– Addresses divided into 2 halves (Memory as a 2D matrix):

• RAS or Row Access Strobe

• CASor Column Access Strobe

• Cache uses SRAM: Static Random Access Memory

– No refresh (6 transistors/bit vs. 1 transistor /bit)

– Address not divided

• Size: DRAM/SRAM - 4-8,

Cost/Cycle time: SRAM/DRAM - 8-16

(5)

5

Main Memory Performance

• Simple:

– CPU, Cache, Bus, Memory same width (32 bits)

• Wide:

– CPU/Mux 1 word; Mux/Cache, Bus, Memory N words (Alpha:

64 bits & 256 bits)

• Interleaved:

– CPU, Cache, Bus 1 word:

Memory N Modules

(4 Modules); example is word interleaved

CPU

Cache

Memory

One word

Simple

CPU

Cache

Bank 0

One word

Bank 1

Bank 2

Bank 3

Interleaved

CPU

Cache

Memory

`multiplexor

Wide

(6)

6

Main Memory Performance

• Timing model

– 4 clock cycles to send address,

– 24 clock cycles for access time per word

– 4 clock cycles to send x (bus bandwidth ) words of data – Cache block size = 4 words

• Simple M.P. = 4 x (4+24+4) = 128

• Wide M.P. = 4 + 24 + 4 = 32

• Interleaved M.P. = 4+ 24 + 4x4 = 44

4

12 8 0

5

13 9 1

6

14 10 2

7

15 11 3

Address Bank 0 Bank 1 Bank 2 Bank 3

(7)

7

Bank offset Superbank offset

Bank number Superbank number

Independent Memory Banks

• Memory banks for independent accesses vs. faster sequential accesses

– Multiprocessor – I/O

– Miss under Miss, Non-blocking Cache

Super bank 0 1 2 :::

0 1 2 3 bank

Independent banks 0 1 2 3

bank

0 1 2 3 bank

(8)

8

Avoiding Bank Conflicts

• Bank Conflicts

– Memory references map to the same bank

– Problem: cannot take advantage of multiple banks (supporting multiple independent request)

• Example: all elements of a column are in the same memory bank with 128 memory banks, interleaved on a word basis

int x[256][512];

for (j = 0; j < 512; j = j+1)

for (i = 0; i < 256; i = i+1) x[i][j] = 2 * x[i][j];

• SW: loop interchange or declaring array not power of 2

• HW: Prime number of banks

– Problem: more complex calculation per memory access – Memory address = (bank number, address within bank) – bank number = address mod number of banks

– address within bank = address / number of banks – modulo & divide per memory access

(9)

9

• Bank number = address mod number of banks

• Address within bank = address

mod

number words in bank

=> prove that there is no ambiguity using this mapping method

• Chinese Remainder Theorem

As long as two sets of integers ai and bi follow these rules

and that ai and aj are co-prime if i ≠ j, then the integer x has only one solution (unambiguous mapping):

– bank number = b₀, number of banks = a₀ (b0 = x mod a0)

– address within bank = b₁, number of words in bank = a₁ (b1 = x mod a1) – Bank number < Number of banks ( 0≤b0≤a0)

– Address within a bank < Number of words in bank ( 0≤b1≤ a1)

– Address < Number of banks x Number of words in a bank (0 ≤x<a0xa1)

– The number of banks and the number of words in a bank are co-prime (a0 and a1 are co-prime)

• N word address 0 to N-1, prime no. banks, words power of 2

b

ⁱ

= x mod a

ⁱ

, 0 ≤ b

ⁱ

< a

ⁱ

, 0 ≤ x < a

⁰

× a

¹

× a

²

×…

Fast Bank Number

(10)

10

Fast Bank Number

Seq. Interleaved Modulo Interleaved

BankNumber: 0 1 2 0 1 2

Address

within Bank:

0

0 1 2

0

16 8

1

3 4 5 9

1

17

2

6 7 8 18 10

2

3

9 10 11

3

19 11

4

12 13 14 12

4

20

5

15 16 17 21 13

5

6

18 19 20 6 22 14

7

21 22 23 15 7 23

(11)

11

Fast Memory Systems: DRAM specific

• Multiple RAS accesses: several names (page mode)

– 64 Mbit DRAM: cycle time = 100 ns, page mode = 20 ns

• New DRAMs to address gap;

what will they cost, will they survive?

– Synchronous DRAM: Provide a clock signal to DRAM, transfer synchronous to system clock

– RAMBUS: reinvent DRAM interface

• Each Chip a module vs. slice of memory

• Short bus between CPU and chips

• Does own refresh

• Variable amount of data returned

• 1 byte / 2 ns (500 MB/s per chip)

• Niche memory

– e.g., Video RAM for frame buffers, DRAM + fast serial output

(12)

12

Main Memory Summary

• Wider Memory

• Interleaved Memory: for sequential or independent accesses

• Avoiding bank conflicts: SW & HW

• DRAM specific optimizations: page mode & Specialty

DRAM

(13)

13

Virtual Memory: Motivation

• Permit applications to grow larger than main memory size

– 32, 64 bits v.s. 28 bits (256MB)

• Automatic management

• Multiple process management

– Sharing – Protection

• Relocation

Disk

A B C D

virtual memory physical memory

C B

A

D

(14)

14

Virtual Memory Terminology

• Page or segment

– A block transferred from the disk to main memory

• Page: fixed size

• Segment: variable size

• Page fault

– The requested page is not in main memory (miss)

• Address translation or memory mapping

– Virtual to physical address

(15)

15

Cache vs. VM Difference

• What controls replacement?

– Cache miss: HW

– Page fault: often handled by the OS

• Size

– VM space is determined by the address size of the CPU – Cache size is independent of the CPU address size

• Lower level use

– Cache: main memory is not shared by anything els

– VM: most of the disk contains the file system

(16)

16

Virtual Memory

• 4Qs for VM?

– Q1: Where can a block be placed in the upper level?

• Fully Associative

– Q2: How is a block found if it is in the upper level?

• Pages: use a page table

• Segments: segment table

– Q3: Which block should be replaced on a miss?

• LRU

– Q4: What happens on a write?

• Write Back

(17)

17

Page Table

• Virtual-to-physical address mapping via page table

virtual page number page offset

page table

Main Memory

What is the size of the page table given a 28-bit virtual address, 4 KB pages, and 4 bytes per page table entry?

2 ^ (28-12) x 2^2 = 256 KB PTE

(18)

18

Inverted Page Table (HP, IBM)

• One PTE per page frame

• Pros & Cons:

– The size of table is the number of physical pages – Must search for virtual

address (using hash)

virtual page number page offset

Hash Another Table (HAT)

hash

Inverted page table

VA PA

(19)

19

Fast Translation: Translation Look-aside Buffer (TLB)

• Cache of translated addresses

• Alpha 21064 TLB: 32 entry fully associative

page frame address <30> page offset <13>

::::

V R W Tag <30> Physical address <21>

1 2

32:1 Mux

<13>

<21>

3 4 34-bit

physical address

Problem: combine caches with virtual memory

(20)

20

TLB & Caches

CPU

TB

$

MEM VA

PA

Conventional Organization

CPU

$

TB

MEM VA

VA

PA

Virtually Addressed Cache Translate only on miss

CPU

$ TB

MEM VA

PA Tags

PA

Overlap $ access with VA translation:

requires $ index to remain invariant across translation VA

Virtual Cache

• Avoid address translation before accessing cache

– Faster hit time

• Context switch

– Flush: time to flush + “compulsory misses” from empty cache – Add processor id (PID) to TLB

• I/O (physical address) must interact with cache

– Physical -> virtual address translation

• Aliases (Synonyms)

– Two virtual addresses map to the same physical addresses

• Two identical copies in the cache

(22)

22

Solutions for Aliases

• HW: anti-aliasing

– Guarantee every cache block a unique physical address

• OS : page coloring

– Guarantee that the virtual and physical addresses match in the last n bits

– Avoid duplicate physical addresses for block if using a direct- mapped cache, size < 2^n

virtual address Physical address

< 18 >

index bit block offset

virtual address x virtual address y

(x, y) aliasing

index bit block offset

(x,y) map to the same set

(23)

23

Virtually indexed & physically tagged cache

• Use the part of addresses that is not affected by address translation to index a cache

– Page offset

– Overlap the time to read the tags with address translation

• Limit cache size to cache size for direct-mapped cache – how to get a bigger cache

– Higher associativity – Page coloring

Page address Page offset

Address tag index block offset

Page address Page offset

Address tag index block offset

(24)

24

Selecting a Page Size

• Reasons for larger page size

– Page table size is inversely proportional to the page size;

therefore memory saved

– Fast cache hit time easy when cache <= page size (VA caches);

bigger page makes it feasible as cache size grows

– Transferring larger pages to or from secondary storage, possibly over a network, is more efficient

– Number of TLB entries are restricted by clock cycle time, so a larger page size maps more memory, thereby reducing TLB misses

• Reasons for a smaller page size

– Fragmentation: don’t waste storage; data must be contiguous within page – Quicker process start for small processes

• Hybrid solution: multiple page sizes

– Alpha: 8KB, 16KB, 32 KB, 64 KB pages (43, 47, 51, 55 virt addr bits)

(25)

25

Alpha VM Mapping

• “64-bit” address divided into 3 segments

– seg0 (bit 63=0) user code/heap – seg1 (bit 63 = 1, 62 = 1) user stack – kseg (bit 63 = 1, 62 = 0)

kernel segment for OS

• Three level page table, each one page

– 8KB page, 8B PTE

– Alpha only 43 unique bits of VA – (future min page size up to 64KB

=> 55 bits of VA)

• PTE bits; valid, kernel & user read & write enable

Seg0/seg1 Level1 Level 2 Level 3 Page offset

<21> <10> <10> <10> <13>

Page table

Base register ⁺

L1 page table

L1 page table +

+

Physical page-frame number page offset

(26)

26

Virtual Memory Summary

• Why virtual memory?

• Fast address translation

– Page table, TLB

• TLB & Cache

– Virtual cache vs. physical cache

– Virtually indexed and physically tagged

(27)

27

Cross Cutting Issues

• Superscalar CPU & Number Cache Ports

• Speculative Execution and non-faulting option on memory

• Parallel Execution vs. Cache locality

– Want far separation to find independent operations vs.

want reuse of data accesses to avoid misses

For ( i=0; i<512; i= i+1) for (j= 0; j< 512; j = j+1)

x[i][j] = 2 * x[i][j-1]

For (i = 0; i<512 ; i=i+1) for (j=1; j< 512; j= j+4) {

x[i][j] = 2 * x[i][j-1];

x[i][j+1] = 2 * x[i][j];

x[i][j+2] = 2 * x[i][j+1];

x[i][j+3] = 2 * x[i][j+2];

};

For ( j=0; j<512; j= j+1) for (i = 0; i< 512; i = i+1)

x[i][j] = 2 * x[i][j-1]

For (j= 0; j<512 ; j=j+1) for (i=1; i< 512; i= i+4) {

x[i][j] = 2 * x[i][j-1];

x[i+1][j] = 2 * x[i+1][j-1];

x[i+2][j] = 2 * x[i+2][j-1];

x[i+3][j] = 2 * x[i+3][j-1];

};

(28)

28

Cross Cutting Issues

• I/O and consistency of data between cache and memory

Cache

Main Memory

I/O Bridge

CPU

Cache

Main Memory

I/O Bridge

CPU

DMA

• I/O always see the latest data

• interfere with CPU

• not interfering with the CPU

• might see the stale data

•Output: write-through

•Input:

• noncacheable

• SW: flush the cache

• HW : check I/O address on input

•

(29)

29

Pitfall: Predicting Cache Performance from Different Prgrm (ISA, compiler, ...)

• 4KB Data cache miss rate 8%,12%,

or 28%?

• 1KB Instr cache miss rate 0%,3%,

or 10%?

• Alpha vs. MIPS for 8KB Data:

17% vs. 10%

Cache Size (KB) Miss

Rate

0%

5%

10%

15%

20%

25%

30%

35%

1 2 4 8 16 32 64 128

D: tomcatv D: gcc D: espresso I: gcc I: espresso I: tomcatv

(30)

30