• 沒有找到結果。

EI 338: Computer Systems Engineering

N/A
N/A
Protected

Academic year: 2022

Share "EI 338: Computer Systems Engineering"

Copied!
65
0
0

加載中.... (立即查看全文)

全文

(1)

EI 338: Computer Systems Engineering

(Operating Systems & Computer Architecture)

Dept. of Computer Science & Engineering Chentao Wu

[email protected]

(2)

Download lectures

• ftp://public.sjtu.edu.cn

• User: wuct

• Password: wuct123456

• http://www.cs.sjtu.edu.cn/~wuct/cse/

(3)

3

Chapter 2

Memory Hierarchy Design

Computer Architecture

A Quantitative Approach, Fifth Edition

(4)

Introduction

Programmers want unlimited amounts of memory with low latency

Fast memory technology is more expensive per bit than slower memory

Solution: organize memory system into a hierarchy

Entire addressable memory space available in largest, slowest memory

Incrementally smaller and faster memories, each

containing a subset of the memory below it, proceed in steps up toward the processor

Temporal and spatial locality insures that nearly all references can be found in smaller memories

Gives the allusion of a large, fast memory being presented to the processor

(5)

Memory Hierarchy

(6)

Memory Performance Gap

(7)

Memory Hierarchy Design

Memory hierarchy design becomes more crucial with recent multi-core processors:

Aggregate peak bandwidth grows with # cores:

Intel Core i7 can generate two references per core per clock

Four cores and 3.2 GHz clock

25.6 billion 64-bit data references/second +

12.8 billion 128-bit instruction references

= 409.6 GB/s!

DRAM bandwidth is only 6% of this (25 GB/s)

Requires:

Multi-port, pipelined caches

Two levels of cache per core

Shared third-level cache on chip

(8)

Performance and Power

High-end microprocessors have >10 MB on-chip cache

Consumes large amount of area and power budget

(9)

Memory Hierarchy Basics

When a word is not found in the cache, a

miss occurs:

Fetch word from lower level in hierarchy, requiring a higher latency reference

Lower level may be another cache or the main memory

Also fetch the other words contained within the block

Takes advantage of spatial locality

Place block into cache in any location within its set, determined by address

block address MOD number of sets

(10)

Memory Hierarchy Basics

n sets => n-way set associative

Direct-mapped cache => one block per set

Fully associative => one set

Writing to cache: two strategies

Write-through

Immediately update lower levels of hierarchy

Write-back

Only update lower levels of hierarchy when an updated block is replaced

Both strategies use write buffer to make writes asynchronous

(11)

Memory Hierarchy Basics

Miss rate

Fraction of cache access that result in a miss

Causes of misses

Compulsory

First reference to a block

Capacity

Blocks discarded and later retrieved

Conflict

Program makes repeated references to multiple

addresses from different blocks that map to the same location in the cache

(12)

Note that speculative and multithreaded processors may execute other instructions during a miss

Reduces performance impact of misses

Memory Hierarchy Basics

(13)

Equations on Appendix B-4

13

(14)

Example on B-5

14

(15)

Answer on B-5

15

(16)

Example on B-6

16

(17)

B-7 Q1: where can a block be placed in a cache?

17

(18)

B-7 Q2: how is a block found if it is in the cache?

18

(19)

B-9 Q2: how is a block found if it is in the cache? (contd.)

19

(20)

B-9 Q3: which block should be replaced on a cache miss?

20

(21)

B-10 Q4: what happens on a write?

21

(22)

B-10 Q4: what happens on a write?

(contd.)

22

(23)

B-10 Q4: what happens on a write?

(contd.)

23

(24)

AMD Opteron Processor

24

(25)

B-16 Cache Performance

25

(26)

B-16 Cache Performance (contd.)

26

(27)

27

(28)

B-17 Avg. Memory Access Time and Processor Performance

28

(29)

B-17 Avg. Memory Access Time and Processor Performance

29

(30)

B-20 Miss Penalty and Out-of-Order Execution Processors

30

(31)

B-20 Miss Penalty and Out-of-Order Execution Processors

31

(32)

Six basic cache optimizations

32

(33)

Six basic cache optimizations

33

(34)

Memory Hierarchy Basics

Six basic cache optimizations:

Larger block size

Reduces compulsory misses

Increases capacity and conflict misses, increases miss penalty

Larger total cache capacity to reduce miss rate

Increases hit time, increases power consumption

Higher associativity

Reduces conflict misses

Increases hit time, increases power consumption

Higher number of cache levels

Reduces overall memory access time

Giving priority to read misses over writes

Reduces miss penalty

Avoiding address translation in cache indexing

Reduces hit time

(35)

Ten Advanced Optimizations

Small and simple first level caches

Critical timing path:

addressing tag memory, then

comparing tags, then

selecting correct set

Direct-mapped caches can overlap tag compare and transmission of data

Lower associativity reduces power because fewer cache lines are accessed

(36)

L1 Size and Associativity

Access time vs. size and associativity

(37)

L1 Size and Associativity

Energy per read vs. size and associativity

(38)

Way Prediction

To improve hit time, predict the way to pre-set mux

Mis-prediction gives longer hit time

Prediction accuracy

> 90% for two-way

> 80% for four-way

I-cache has better accuracy than D-cache

First used on MIPS R10000 in mid-90s

Used on ARM Cortex-A8

Extend to predict block as well

“Way selection”

Increases mis-prediction penalty

(39)

Pipelining Cache

Pipeline cache access to improve bandwidth

Examples:

Pentium: 1 cycle

Pentium Pro – Pentium III: 2 cycles

Pentium 4 – Core i7: 4 cycles

Increases branch mis-prediction penalty

Makes it easier to increase associativity

(40)

Nonblocking Caches

Allow hits before previous misses complete

“Hit under miss”

“Hit under multiple miss”

L2 must support this

In general,

processors can hide L1 miss

penalty but not L2 miss penalty

(41)

Multibanked Caches

Organize cache as independent banks to support simultaneous access

ARM Cortex-A8 supports 1-4 banks for L2

Intel i7 supports 4 banks for L1 and 8 banks for L2

Interleave banks according to block address

(42)

Critical Word First, Early Restart

Critical word first

Request missed word from memory first

Send it to the processor as soon as it arrives

Early restart

Request words in normal order

Send missed work to the processor as soon as it arrives

Effectiveness of these strategies

depends on block size and likelihood of another access to the portion of the

block that has not yet been fetched

(43)

Merging Write Buffer

When storing to a block that is already pending in the write buffer, update write buffer

Reduces stalls due to full write buffer

Do not apply to I/O addresses

No write buffering

Write

buffering

(44)

Compiler Optimizations

Loop Interchange

Swap nested loops to access memory in sequential order

Blocking

Instead of accessing entire rows or

columns, subdivide matrices into blocks

Requires more memory accesses but improves locality of accesses

(45)

Hardware Prefetching

Fetch two blocks on miss (include next sequential block)

Pentium 4 Pre-fetching

(46)

Compiler Prefetching

Insert prefetch instructions before data is needed

Non-faulting: prefetch doesn’t cause exceptions

Register prefetch

Loads data into register

Cache prefetch

Loads data into cache

Combine with loop unrolling and

software pipelining

(47)

Summary

(48)

Memory Technology

Performance metrics

Latency is concern of cache

Bandwidth is concern of multiprocessors and I/O

Access time

Time between read request and when desired word arrives

Cycle time

Minimum time between unrelated requests to memory

DRAM used for main memory, SRAM

used for cache

(49)

Memory Technology

SRAM

Requires low power to retain bit

Requires 6 transistors/bit

DRAM

Must be re-written after being read

Must also be periodically refeshed

Every ~ 8 ms

Each row can be refreshed simultaneously

One transistor/bit

Address lines are multiplexed:

Upper half of address: row access strobe (RAS)

Lower half of address: column access strobe (CAS)

(50)

Memory Technology

Amdahl:

Memory capacity should grow linearly with processor speed

Unfortunately, memory capacity and speed has not kept pace with processors

Some optimizations:

Multiple accesses to same row

Synchronous DRAM

Added clock to DRAM interface

Burst mode with critical word first

Wider interfaces

Double data rate (DDR)

Multiple banks on each DRAM device

(51)

Memory Optimizations

(52)

Memory Optimizations

(53)

Memory Optimizations

DDR:

DDR2

Lower power (2.5 V -> 1.8 V)

Higher clock rates (266 MHz, 333 MHz, 400 MHz)

DDR3

1.5 V

800 MHz

DDR4

1-1.2 V

1600 MHz

GDDR5 is graphics memory based on

DDR3

(54)

Memory Optimizations

Graphics memory:

Achieve 2-5 X bandwidth per DRAM vs.

DDR3

Wider interfaces (32 vs. 16 bit)

Higher clock rate

Possible because they are attached via soldering instead of socketted DIMM modules

Reducing power in SDRAMs:

Lower voltage

Low power mode (ignores clock, continues to refresh)

(55)

Memory Power Consumption

(56)

Flash Memory

Type of EEPROM

Must be erased (in blocks) before being overwritten

Non volatile

Limited number of write cycles

Cheaper than SDRAM, more expensive than disk

Slower than SRAM, faster than disk

(57)

Memory Dependability

Memory is susceptible to cosmic rays

Soft errors : dynamic errors

Detected and fixed by error correcting codes (ECC)

Hard errors : permanent errors

Use sparse rows to replace defective rows

Chipkill: a RAID-like error recovery

technique

(58)

Virtual Memory

Protection via virtual memory

Keeps processes in their own memory space

Role of architecture:

Provide user mode and supervisor mode

Protect certain aspects of CPU state

Provide mechanisms for switching between user mode and supervisor mode

Provide mechanisms to limit memory accesses

Provide TLB to translate addresses

(59)

Virtual Machines

Supports isolation and security

Sharing a computer among many unrelated users

Enabled by raw speed of processors, making the overhead more acceptable

Allows different ISAs and operating systems to be presented to user programs

“System Virtual Machines”

SVM software is called “virtual machine monitor” or

“hypervisor”

Individual virtual machines run under the monitor are called “guest VMs”

(60)

Impact of VMs on Virtual Memory

Each guest OS maintains its own set of page tables

VMM adds a level of memory between physical and virtual memory called “real memory”

VMM maintains shadow page table that maps guest virtual addresses to physical addresses

Requires VMM to detect guest’s changes to its own page table

Occurs naturally if accessing the page table pointer is a privileged operation

(61)

Example on Page 80

61

(62)

Example on Page 82

62

(63)

Example on Page 83

63

(64)

Example on Page 83 (contd.)

64

(65)

Homework

2.8, B.1

65

參考文獻

相關文件

Reading: Stankovic, et al., “Implications of Classical Scheduling Results for Real-Time Systems,” IEEE Computer, June 1995, pp.. Copyright: All rights reserved, Prof. Stankovic,

Department of Computer Science and Information

Department of Computer Science and Information

Department of Computer Science and Information

Professor of Computer Science and Information Engineering National Chung Cheng University. Chair

● In computer science, a data structure is a data organization, management, and storage format that enables efficient access and

Department of Computer Science and Information Engineering, Chaoyang University of

Performance metrics, such as memory access time and communication latency, provide the basis for modeling the machine and thence for quantitative analysis of application performance..