• 沒有找到結果。

Main Memory

N/A
N/A
Protected

Academic year: 2022

Share "Main Memory"

Copied!
30
0
0

加載中.... (立即查看全文)

全文

(1)

1

Lecture 10

Main Memory

Virtual Memory

(2)

2

Administration

• Project progress report – next week

– 15 minute presentation

(3)

3

Review: Improving Cache Performance

• Reducing Miss Rate

– Larger Block Size – Higher Associativity – Victim Cache

– Pseudo-Associativity

– HW Prefetching Instr, Data – SW Prefetching Data

– Compiler Optimizations

• Reducing Miss Penalty

– Read priority over write on miss – Subblock placement

– Early Restart and Critical Word First on miss – Non-blocking Caches (Hit Under Miss)

– Second Level Cache

• Reducing Hit Time

– Small & simple cache – Avoid address translation – Pipeline write

– Fast Writes on Misses Via Small Subblocks

(4)

4

Main Memory Background

• Performance of Main Memory:

– Latency: time to finish a request

• Access Time: time between request and word arrives

• Cycle Time: time between requests

• Cycle time > Access Time

– Bandwidth: Bytes/per second

• Main Memory is

DRAM: Dynamic Random Access Memory – Dynamic since needs to be refreshed periodically (8 ms)

– Addresses divided into 2 halves (Memory as a 2D matrix):

• RAS or Row Access Strobe

• CASor Column Access Strobe

• Cache uses SRAM: Static Random Access Memory

– No refresh (6 transistors/bit vs. 1 transistor /bit)

– Address not divided

• Size: DRAM/SRAM - 4-8,

Cost/Cycle time: SRAM/DRAM - 8-16

(5)

5

Main Memory Performance

Simple:

– CPU, Cache, Bus, Memory same width (32 bits)

Wide:

– CPU/Mux 1 word; Mux/Cache, Bus, Memory N words (Alpha:

64 bits & 256 bits)

Interleaved:

– CPU, Cache, Bus 1 word:

Memory N Modules

(4 Modules); example is word interleaved

CPU

Cache

Memory

One word

One word

Simple

CPU

Cache

Bank 0

One word

One word

Bank 1

Bank 2

Bank 3

Interleaved

CPU

Cache

Memory

`multiplexor

Wide

(6)

6

Main Memory Performance

• Timing model

– 4 clock cycles to send address,

– 24 clock cycles for access time per word

– 4 clock cycles to send x (bus bandwidth ) words of data – Cache block size = 4 words

• Simple M.P. = 4 x (4+24+4) = 128

• Wide M.P. = 4 + 24 + 4 = 32

• Interleaved M.P. = 4+ 24 + 4x4 = 44

4

12 8 0

5

13 9 1

6

14 10 2

7

15 11 3

Address Bank 0 Bank 1 Bank 2 Bank 3

(7)

7

Bank offset Superbank offset

Bank number Superbank number

Independent Memory Banks

• Memory banks for independent accesses vs. faster sequential accesses

– Multiprocessor – I/O

– Miss under Miss, Non-blocking Cache

Super bank 0 1 2 :::

0 1 2 3 bank

Independent banks 0 1 2 3

bank

0 1 2 3 bank

(8)

8

Avoiding Bank Conflicts

• Bank Conflicts

– Memory references map to the same bank

– Problem: cannot take advantage of multiple banks (supporting multiple independent request)

• Example: all elements of a column are in the same memory bank with 128 memory banks, interleaved on a word basis

int x[256][512];

for (j = 0; j < 512; j = j+1)

for (i = 0; i < 256; i = i+1) x[i][j] = 2 * x[i][j];

• SW: loop interchange or declaring array not power of 2

• HW: Prime number of banks

– Problem: more complex calculation per memory access – Memory address = (bank number, address within bank) – bank number = address mod number of banks

– address within bank = address / number of banks – modulo & divide per memory access

(9)

9

• Bank number = address mod number of banks

• Address within bank = address

mod

number words in bank

=> prove that there is no ambiguity using this mapping method

• Chinese Remainder Theorem

As long as two sets of integers ai and bi follow these rules

and that ai and aj are co-prime if i ≠ j, then the integer x has only one solution (unambiguous mapping):

– bank number = b0, number of banks = a0 (b0 = x mod a0)

– address within bank = b1, number of words in bank = a1 (b1 = x mod a1) – Bank number < Number of banks ( 0≤b0≤a0)

– Address within a bank < Number of words in bank ( 0≤b1≤ a1)

– Address < Number of banks x Number of words in a bank (0 ≤x<a0xa1)

– The number of banks and the number of words in a bank are co-prime (a0 and a1 are co-prime)

• N word address 0 to N-1, prime no. banks, words power of 2

b

i

= x mod a

i

, 0 ≤ b

i

< a

i

, 0 ≤ x < a

0

× a

1

× a

2

×…

Fast Bank Number

(10)

10

Fast Bank Number

Seq. Interleaved Modulo Interleaved

BankNumber: 0 1 2 0 1 2

Address

within Bank:

0

0 1 2

0

16 8

1

3 4 5 9

1

17

2

6 7 8 18 10

2

3

9 10 11

3

19 11

4

12 13 14 12

4

20

5

15 16 17 21 13

5

6

18 19 20 6 22 14

7

21 22 23 15 7 23

(11)

11

Fast Memory Systems: DRAM specific

• Multiple RAS accesses: several names (page mode)

– 64 Mbit DRAM: cycle time = 100 ns, page mode = 20 ns

• New DRAMs to address gap;

what will they cost, will they survive?

– Synchronous DRAM: Provide a clock signal to DRAM, transfer synchronous to system clock

– RAMBUS: reinvent DRAM interface

• Each Chip a module vs. slice of memory

• Short bus between CPU and chips

• Does own refresh

• Variable amount of data returned

• 1 byte / 2 ns (500 MB/s per chip)

• Niche memory

– e.g., Video RAM for frame buffers, DRAM + fast serial output

(12)

12

Main Memory Summary

• Wider Memory

• Interleaved Memory: for sequential or independent accesses

• Avoiding bank conflicts: SW & HW

• DRAM specific optimizations: page mode & Specialty

DRAM

(13)

13

Virtual Memory: Motivation

• Permit applications to grow larger than main memory size

32, 64 bits v.s. 28 bits (256MB)

• Automatic management

• Multiple process management

Sharing Protection

• Relocation

Disk

A B C D

virtual memory physical memory

C B

A

D

(14)

14

Virtual Memory Terminology

• Page or segment

– A block transferred from the disk to main memory

• Page: fixed size

• Segment: variable size

• Page fault

– The requested page is not in main memory (miss)

• Address translation or memory mapping

– Virtual to physical address

(15)

15

Cache vs. VM Difference

• What controls replacement?

– Cache miss: HW

– Page fault: often handled by the OS

• Size

– VM space is determined by the address size of the CPU – Cache size is independent of the CPU address size

• Lower level use

– Cache: main memory is not shared by anything els

– VM: most of the disk contains the file system

(16)

16

Virtual Memory

• 4Qs for VM?

– Q1: Where can a block be placed in the upper level?

• Fully Associative

– Q2: How is a block found if it is in the upper level?

• Pages: use a page table

• Segments: segment table

– Q3: Which block should be replaced on a miss?

• LRU

– Q4: What happens on a write?

• Write Back

(17)

17

Page Table

• Virtual-to-physical address mapping via page table

virtual page number page offset

page table

Main Memory

What is the size of the page table given a 28-bit virtual address, 4 KB pages, and 4 bytes per page table entry?

2 ^ (28-12) x 2^2 = 256 KB PTE

(18)

18

Inverted Page Table (HP, IBM)

• One PTE per page frame

• Pros & Cons:

– The size of table is the number of physical pages – Must search for virtual

address (using hash)

virtual page number page offset

Hash Another Table (HAT)

hash

Inverted page table

VA PA

(19)

19

Fast Translation: Translation Look-aside Buffer (TLB)

• Cache of translated addresses

• Alpha 21064 TLB: 32 entry fully associative

page frame address <30> page offset <13>

::::

V R W Tag <30> Physical address <21>

1 2

32:1 Mux

<13>

<21>

3 4 34-bit

physical address

Problem: combine caches with virtual memory

(20)

20

TLB & Caches

CPU

TB

$

MEM VA

PA

PA

Conventional Organization

CPU

$

TB

MEM VA

VA

PA

Virtually Addressed Cache Translate only on miss

CPU

$ TB

MEM VA

PA Tags

PA

Overlap $ access with VA translation:

requires $ index to remain invariant across translation VA

Tags

L2 $

(21)

21

Virtual Cache

• Avoid address translation before accessing cache

– Faster hit time

• Context switch

– Flush: time to flush + “compulsory misses” from empty cache – Add processor id (PID) to TLB

• I/O (physical address) must interact with cache

– Physical -> virtual address translation

• Aliases (Synonyms)

– Two virtual addresses map to the same physical addresses

• Two identical copies in the cache

(22)

22

Solutions for Aliases

• HW: anti-aliasing

– Guarantee every cache block a unique physical address

• OS : page coloring

– Guarantee that the virtual and physical addresses match in the last n bits

– Avoid duplicate physical addresses for block if using a direct- mapped cache, size < 2^n

virtual address Physical address

< 18 >

index bit block offset

virtual address x virtual address y

(x, y) aliasing

index bit block offset

(x,y) map to the same set

(23)

23

Virtually indexed & physically tagged cache

• Use the part of addresses that is not affected by address translation to index a cache

– Page offset

– Overlap the time to read the tags with address translation

• Limit cache size to cache size for direct-mapped cache – how to get a bigger cache

– Higher associativity – Page coloring

Page address Page offset

Address tag index block offset

Page address Page offset

Address tag index block offset

(24)

24

Selecting a Page Size

• Reasons for larger page size

– Page table size is inversely proportional to the page size;

therefore memory saved

– Fast cache hit time easy when cache <= page size (VA caches);

bigger page makes it feasible as cache size grows

– Transferring larger pages to or from secondary storage, possibly over a network, is more efficient

– Number of TLB entries are restricted by clock cycle time, so a larger page size maps more memory, thereby reducing TLB misses

• Reasons for a smaller page size

– Fragmentation: don’t waste storage; data must be contiguous within page – Quicker process start for small processes

• Hybrid solution: multiple page sizes

– Alpha: 8KB, 16KB, 32 KB, 64 KB pages (43, 47, 51, 55 virt addr bits)

(25)

25

Alpha VM Mapping

• “64-bit” address divided into 3 segments

– seg0 (bit 63=0) user code/heap – seg1 (bit 63 = 1, 62 = 1) user stack – kseg (bit 63 = 1, 62 = 0)

kernel segment for OS

• Three level page table, each one page

– 8KB page, 8B PTE

– Alpha only 43 unique bits of VA – (future min page size up to 64KB

=> 55 bits of VA)

• PTE bits; valid, kernel & user read & write enable

Seg0/seg1 Level1 Level 2 Level 3 Page offset

<21> <10> <10> <10> <13>

Page table

Base register +

L1 page table

L1 page table

L1 page table +

+

Physical page-frame number page offset

(26)

26

Virtual Memory Summary

• Why virtual memory?

• Fast address translation

– Page table, TLB

• TLB & Cache

– Virtual cache vs. physical cache

– Virtually indexed and physically tagged

(27)

27

Cross Cutting Issues

• Superscalar CPU & Number Cache Ports

• Speculative Execution and non-faulting option on memory

• Parallel Execution vs. Cache locality

– Want far separation to find independent operations vs.

want reuse of data accesses to avoid misses

For ( i=0; i<512; i= i+1) for (j= 0; j< 512; j = j+1)

x[i][j] = 2 * x[i][j-1]

For (i = 0; i<512 ; i=i+1) for (j=1; j< 512; j= j+4) {

x[i][j] = 2 * x[i][j-1];

x[i][j+1] = 2 * x[i][j];

x[i][j+2] = 2 * x[i][j+1];

x[i][j+3] = 2 * x[i][j+2];

};

For ( j=0; j<512; j= j+1) for (i = 0; i< 512; i = i+1)

x[i][j] = 2 * x[i][j-1]

For (j= 0; j<512 ; j=j+1) for (i=1; i< 512; i= i+4) {

x[i][j] = 2 * x[i][j-1];

x[i+1][j] = 2 * x[i+1][j-1];

x[i+2][j] = 2 * x[i+2][j-1];

x[i+3][j] = 2 * x[i+3][j-1];

};

(28)

28

Cross Cutting Issues

• I/O and consistency of data between cache and memory

Cache

Main Memory

I/O Bridge

CPU

Cache

Main Memory

I/O Bridge

CPU

DMA

• I/O always see the latest data

• interfere with CPU

• not interfering with the CPU

• might see the stale data

•Output: write-through

•Input:

• noncacheable

• SW: flush the cache

• HW : check I/O address on input

(29)

29

Pitfall: Predicting Cache Performance from Different Prgrm (ISA, compiler, ...)

• 4KB Data cache miss rate 8%,12%,

or 28%?

• 1KB Instr cache miss rate 0%,3%,

or 10%?

• Alpha vs. MIPS for 8KB Data:

17% vs. 10%

Cache Size (KB) Miss

Rate

0%

5%

10%

15%

20%

25%

30%

35%

1 2 4 8 16 32 64 128

D: tomcatv D: gcc D: espresso I: gcc I: espresso I: tomcatv

(30)

30

Instructions Executed (billions) Cummlati

ve Average

Memory Access

Time

1 1.5

2 2.5

3 3.5

4 4.5

0 1 2 3 4 5 6 7 8 9 10 11 12

Pitfall: Simulating Too Small an Address

Trace

參考文獻

相關文件

2 路組相聯的 Cache 結構 4K bytes 的 Cache 大小 4 Words 的 Cache line 大小 讀操作分配策略. 提供使 Cache line 無效的命令 提供預取 Cache line 的命令

則察看自己的 cache 是否有紀錄,若否才前往 root(.)3. DNS 主機會先將該查詢記錄記憶在自己的

In the inverse boundary value problems of isotropic elasticity and complex conductivity, we derive estimates for the volume fraction of an inclusion whose physical parameters

As students have to sketch and compare graphs of various types of functions including trigonometric functions in Learning Objective 9.1 of the Compulsory Part, it is natural to

The PE curriculum contributes greatly to enabling our students to lead a healthy lifestyle with an interest and active participation in physical and aesthetic

These strategies include hands-on and minds-on exploratory activities that allow students to integrate and apply knowledge and skills, sustain their interests in science

In Hong Kong, our young people’s understanding of the importance of the Basic Law as the constitutional document of the Hong Kong Special Administrative Region (HKSAR),

It provides details about promoting Language across the Curriculum at the secondary level, the learning and teaching of language arts and the four language skills, namely