• 沒有找到結果。

Chapter 2Memory Hierarchy Design

N/A
N/A
Protected

Academic year: 2021

Share "Chapter 2Memory Hierarchy Design"

Copied!
48
0
0

加載中.... (立即查看全文)

全文

(1)

Chapter 2

Memory Hierarchy Design

(2)

Programmers want unlimited amounts of memory with low latency

Fast memory technology is more expensive per bit than slower memory

Solution: organize memory system into a hierarchy

Entire addressable memory space available in largest, slowest memory

Incrementally smaller and faster memories, each containing a subset of the memory below it, proceed in steps up toward the processor

Temporal and spatial locality insures that nearly all references can be found in smaller memories

Gives the allusion of a large, fast memory being presented to the processor

cti on

(3)

cti on

(4)

cti on

(5)

Memory hierarchy design becomes more crucial with recent multi-core processors:

Aggregate peak bandwidth grows with # cores:

Intel Core i7 can generate two references per core per clock

Four cores and 3.2 GHz clock

25.6 billion 64-bit data references/second +

12.8 billion 128-bit instruction references/second

= 409.6 GB/s!

DRAM bandwidth is only 8% of this (34.1 GB/s)

Requires:

Multi-port, pipelined caches

Two levels of cache per core

Shared third-level cache on chip

cti on

(6)

High-end microprocessors have >10 MB on-chip cache

Consumes large amount of area and power budget

cti on

(7)

When a word is not found in the cache, a miss occurs:

Fetch word from lower level in hierarchy, requiring a higher latency reference

Lower level may be another cache or the main memory

Also fetch the other words contained within the block

Takes advantage of spatial locality

Place block into cache in any location within its set, determined by address

block address MOD number of sets in cache

cti on

(8)

n sets => n-way set associative

Direct-mapped cache => one block per set

Fully associative => one set

Writing to cache: two strategies

Write-through

Immediately update lower levels of hierarchy

Write-back

Only update lower levels of hierarchy when an updated block is replaced

Both strategies use write buffer to make writes asynchronous

cti on

(9)

Miss rate

Fraction of cache access that result in a miss

Causes of misses

Compulsory

First reference to a block

Capacity

Blocks discarded and later retrieved

Conflict

Program makes repeated references to multiple addresses from different blocks that map to the same location in the cache

cti on

(10)

Speculative and multithreaded processors may execute other instructions during a miss

Reduces performance impact of misses

cti on

(11)

Six basic cache optimizations:

Larger block size

Reduces compulsory misses

Increases capacity and conflict misses, increases miss penalty

Larger total cache capacity to reduce miss rate

Increases hit time, increases power consumption

Higher associativity

Reduces conflict misses

Increases hit time, increases power consumption

Higher number of cache levels

Reduces overall memory access time

Giving priority to read misses over writes

Reduces miss penalty

Avoiding address translation in cache indexing

Reduces hit time

cti on

(12)

Performance metrics

Latency is concern of cache

Bandwidth is concern of multiprocessors and I/O

Access time

Time between read request and when desired word arrives

Cycle time

Minimum time between unrelated requests to memory

SRAM memory has low latency, use for cache

Organize DRAM chips into many banks for high bandwidth, use for main memory

y T ec hn olo gy a nd O pt im iz ati on s

(13)

SRAM

Requires low power to retain bit

Requires 6 transistors/bit

DRAM

Must be re-written after being read

Must also be periodically refeshed

Every ~ 8 ms (roughly 5% of time)

Each row can be refreshed simultaneously

One transistor/bit

Address lines are multiplexed:

Upper half of address: row access strobe (RAS)

Lower half of address: column access strobe (CAS)

y T ec hn olo gy a nd O pt im iz ati on s

(14)

y T ec hn olo gy a nd O pt im iz ati on s

(15)

Amdahl:

Memory capacity should grow linearly with processor speed

Unfortunately, memory capacity and speed has not kept pace with processors

Some optimizations:

Multiple accesses to same row

Synchronous DRAM

Added clock to DRAM interface

Burst mode with critical word first

Wider interfaces

Double data rate (DDR)

Multiple banks on each DRAM device

y T ec hn olo gy a nd O pt im iz ati on s

(16)

y T ec hn olo gy a nd O pt im iz ati on s

(17)

y T ec hn olo gy a nd O pt im iz ati on s

(18)

DDR:

DDR2

Lower power (2.5 V -> 1.8 V)

Higher clock rates (266 MHz, 333 MHz, 400 MHz)

DDR3

1.5 V

800 MHz

DDR4

1-1.2 V

1333 MHz

GDDR5 is graphics memory based on DDR3

y T ec hn olo gy a nd O pt im iz ati on s

(19)

Reducing power in SDRAMs:

Lower voltage

Low power mode (ignores clock, continues to refresh)

Graphics memory:

Achieve 2-5 X bandwidth per DRAM vs. DDR3

Wider interfaces (32 vs. 16 bit)

Higher clock rate

Possible because they are attached via soldering instead of socketted DIMM modules

y T ec hn olo gy a nd O pt im iz ati on s

(20)

y T ec hn olo gy a nd O pt im iz ati on s

(21)

 Stacked DRAMs in same package as processor

High Bandwidth Memory (HBM)

y T ec hn olo gy a nd O pt im iz ati on s

(22)

Type of EEPROM

Types: NAND (denser) and NOR (faster)

NAND Flash:

Reads are sequential, reads entire page (.5 to 4 KiB)

25 us for first byte, 40 MiB/s for subsequent bytes

SDRAM: 40 ns for first byte, 4.8 GB/s for subsequent bytes

2 KiB transfer: 75 uS vs 500 ns for SDRAM, 150X slower

300 to 500X faster than magnetic disk

y T ec hn olo gy a nd O pt im iz ati on s

(23)

Must be erased (in blocks) before being overwritten

Nonvolatile, can use as little as zero power

Limited number of write cycles (~100,000)

$2/GiB, compared to $20-40/GiB for SDRAM and $0.09 GiB for magnetic disk

Phase-Change/Memrister Memory

Possibly 10X improvement in write performance and 2X improvement in read performance

y T ec hn olo gy a nd O pt im iz ati on s

(24)

Memory is susceptible to cosmic rays

Soft errors: dynamic errors

Detected and fixed by error correcting codes (ECC)

Hard errors: permanent errors

Use spare rows to replace defective rows

Chipkill: a RAID-like error recovery technique

y T ec hn olo gy a nd O pt im iz ati on s

(25)

Reduce hit time

Small and simple first-level caches

Way prediction

Increase bandwidth

Pipelined caches, multibanked caches, non-blocking caches

Reduce miss penalty

Critical word first, merging write buffers

Reduce miss rate

Compiler optimizations

Reduce miss penalty or miss rate via parallelization

Hardware or compiler prefetching

ce d O pt im iz at io ns

(26)

Access time vs. size and associativity

ce d O pt im iz at io ns

(27)

Energy per read vs. size and associativity

ce d O pt im iz at io ns

(28)

To improve hit time, predict the way to pre-set mux

Mis-prediction gives longer hit time

Prediction accuracy

> 90% for two-way

> 80% for four-way

I-cache has better accuracy than D-cache

First used on MIPS R10000 in mid-90s

Used on ARM Cortex-A8

Extend to predict block as well

“Way selection”

Increases mis-prediction penalty

ce d O pt im iz at io ns

(29)

Pipeline cache access to improve bandwidth

Examples:

Pentium: 1 cycle

Pentium Pro – Pentium III: 2 cycles

Pentium 4 – Core i7: 4 cycles

Increases branch mis-prediction penalty

Makes it easier to increase associativity

ce d O pt im iz at io ns

(30)

Organize cache as independent banks to support simultaneous access

ARM Cortex-A8 supports 1-4 banks for L2

Intel i7 supports 4 banks for L1 and 8 banks for L2

Interleave banks according to block address

ce d O pt im iz at io ns

(31)

Allow hits before previous misses complete

“Hit under miss”

“Hit under multiple miss”

L2 must support this

In general, processors can hide L1 miss penalty but not L2 miss penalty

ce d O pt im iz at io ns

(32)

Critical word first

Request missed word from memory first

Send it to the processor as soon as it arrives

Early restart

Request words in normal order

Send missed work to the processor as soon as it arrives

Effectiveness of these strategies depends on block size and likelihood of another access to the portion of the block that has not yet been fetched

ce d O pt im iz at io ns

(33)

When storing to a block that is already pending in the write buffer, update write buffer

Reduces stalls due to full write buffer

Do not apply to I/O addresses

ce d O pt im iz at io ns

No write buffering

Write buffering

(34)

Loop Interchange

Swap nested loops to access memory in sequential order

Blocking

Instead of accessing entire rows or columns, subdivide matrices into blocks

Requires more memory accesses but improves locality of accesses

ce d O pt im iz at io ns

(35)

for (i = 0; i < N; i = i + 1) for (j = 0; j < N; j = j + 1) {

r = 0;

for (k = 0; k < N; k = k + 1) r = r + y[i][k]*z[k][j];

x[i][j] = r;

};

(36)

for (jj = 0; jj < N; jj = jj + B) for (kk = 0; kk < N; kk = kk + B) for (i = 0; i < N; i = i + 1)

for (j = jj; j < min(jj + B,N); j = j + 1) {

r = 0;

for (k = kk; k < min(kk + B,N); k = k + 1) r = r + y[i][k]*z[k][j];

x[i][j] = x[i][j] + r;

};

(37)

Fetch two blocks on miss (include next sequential block)

ce d O pt im iz at io ns

Pentium 4 Pre-fetching

(38)

Insert prefetch instructions before data is needed

Non-faulting: prefetch doesn’t cause exceptions

Register prefetch

Loads data into register

Cache prefetch

Loads data into cache

Combine with loop unrolling and software pipelining

ce d O pt im iz at io ns

(39)

128 MiB to 1 GiB

Smaller blocks require substantial tag storage

Larger blocks are potentially inefficient

One approach (L-H):

Each SDRAM row is a block index

Each row contains set of tags and 29 data segments

29-set associative

Hit requires a CAS

ce d O pt im iz at io ns

(40)

Another approach (Alloy cache):

Mold tag and data together

Use direct mapped

Both schemes require two DRAM accesses for misses

Two solutions:

Use map to keep track of blocks

Predict likely misses

ce d O pt im iz at io ns

(41)

ce d O pt im iz at io ns

(42)

ce d O pt im iz at io ns

(43)

Protection via virtual memory

Keeps processes in their own memory space

Role of architecture

Provide user mode and supervisor mode

Protect certain aspects of CPU state

Provide mechanisms for switching between user mode and supervisor mode

Provide mechanisms to limit memory accesses

Provide TLB to translate addresses

al M em or y a nd V irt ua l M ac hin es

(44)

Supports isolation and security

Sharing a computer among many unrelated users

Enabled by raw speed of processors, making the overhead more acceptable

Allows different ISAs and operating systems to be presented to user programs

“System Virtual Machines”

SVM software is called “virtual machine monitor” or

“hypervisor”

Individual virtual machines run under the monitor are called

“guest VMs”

al M em or y a nd V irt ua l M ac hin es

(45)

 Guest software should:

Behave on as if running on native hardware

Not be able to change allocation of real system resources

 VMM should be able to “context switch”

guests

 Hardware must allow:

System and use processor modes

Privileged subset of instructions for allocating system resources

al M em or y a nd V irt ua l M ac hin es

(46)

Each guest OS maintains its own set of page tables

VMM adds a level of memory between physical and virtual memory called “real memory”

VMM maintains shadow page table that maps guest virtual addresses to physical addresses

Requires VMM to detect guest’s changes to its own page table

Occurs naturally if accessing the page table pointer is a privileged operation

al M em or y a nd V irt ua l M ac hin es

(47)

Objectives:

Avoid flushing TLB

Use nested page tables instead of shadow page tables

Allow devices to use DMA to move data

Allow guest OS’s to handle device interrupts

For security: allow programs to manage encrypted portions of code and data

al M em or y a nd V irt ua l M ac hin es

(48)

 Predicting cache performance of one program from another

 Simulating enough instructions to get accurate performance measures of the memory hierarchy

 Not deliverying high memory bandwidth in

a cache-based system

參考文獻

相關文件

 To write to the screen (or read the screen), use the next 8K words of the memory To read which key is currently pressed, use the next word of the

Given proxies, find the optimal placement of the proxies in the network, such that the overall access cost(including both read and update costs) is minimized.. For an

Cowell, The Jātaka, or Stories of the Buddha's Former Births, Book XXII, pp.

Other advantages of our ProjPSO algorithm over current methods are (1) our experience is that the time required to generate the optimal design is gen- erally a lot faster than many

 Genre – animal stories but even the stories have animals as main characters the contents are actually realistic..  Curious

哈佛大學教授夏爾(Jeanne Chall)1983 年曾以六個階段描述兒童學習 閱讀的歷程,這六個階段又可分成兩大部份,分別是: 「學習如何讀」(learn to read ),「透過閱讀學習知識」(read to

Performance metrics, such as memory access time and communication latency, provide the basis for modeling the machine and thence for quantitative analysis of application performance..

Data larger than memory but smaller than disk Design algorithms so that disk access is less frequent An example (Yu et al., 2010): a decomposition method to load a block at a time