• 沒有找到結果。

Fundamentals of Computer Systems Caches Stephen A. Edwards and Martha A. Kim

N/A
N/A
Protected

Academic year: 2022

Share "Fundamentals of Computer Systems Caches Stephen A. Edwards and Martha A. Kim"

Copied!
28
0
0

加載中.... (立即查看全文)

全文

(1)

Fundamentals of Computer Systems

Caches

Stephen A. Edwards and

Martha A. Kim

Columbia University

Spring 2012

Illustrations Copyright © 2007 Elsevier

(2)

Computer Systems

Performance depends on which is slowest:

the processor or the memory system

Processor

Address

Memory

MemWrite WriteData

ReadData WE

CLK

(3)

Memory Speeds Haven’t Kept Up

Year CPU

100,000

10,000

100 1000

Performance

10

1

1980 1981 1982 198319841985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005

Memory

Our single-cycle memory assumption has been wrong since 1980.

Hennessy and Patterson. Computer Architecture: A Quantitative Approach. 3rd ed., Morgan Kaufmann, 2003.

(4)

Your Choice of Memories

Fast Cheap Large

On-Chip SRAM ✔ ✔

Commodity DRAM ✔ ✔

Supercomputer ✔ ✔

(5)

Memory Hierarchy

Fundmental trick to making a big memory appear fast Technology Cost Access Time Density

($/Gb) (ns) (Gb/cm2)

SRAM 30 000 0.5 0.00025

DRAM 10 100 1 – 16

Flash 2 300

8 – 32

Hard Disk 0.1 10 000 000 500 – 2000

Read speed; writing much, much slower

(6)

A Modern Memory Hierarchy

AMD Phenom 9600 Quad-core

2.3 GHz 1.1–1.25 V 95 W 65 nm

My desktop machine:

Level Size Tech.

L1 Instruction

64 K SRAM

L1 Data

64 K SRAM

L2

512 K SRAM

L3 2 MB SRAM

Memory 4 GB DRAM

Disk 500 GB Magnetic

per core

(7)

Temporal Locality

What path do your eyes take when you read this?

Did you look at the

drawings more than once?

Euclid’s Elements

(8)

Spatial Locality

If you need something, you may also need something nearby

(9)

Memory Performance

Hit: Data is found in the level of memory hierarchy Miss: Data not found; will look in next level

Hit Rate =

Number of hits Number of accesses

Miss Rate =

Number of misses Number of accesses

Hit Rate + Miss Rate = 1

The expected access time E

L

for a memory level L with latency t

L

and miss rate M

L

:

E

L

= t

L

+ M

L

· E

L+1

(10)

Memory Performance Example

Two-level hierarchy: Cache and main memory Program executes 1000 loads & stores

750 of these are found in the cache What’s the cache hit and miss rate?

Hit Rate =

1000750

= 75% Miss Rate = 1 − 0.75 = 25%

If the cache takes 1 cycle and the main memory 100, What’s the expected access time?

Expected access time of main memory: E

1

= 100 cycles Access time for the cache: t

0

= 1 cycle

Cache miss rate: M

0

= 0.25

E

0

= t

0

+ M

0

· E

1

= 1 + 0.25 · 100 = 26 cycles

(11)

Memory Performance Example

Two-level hierarchy: Cache and main memory Program executes 1000 loads & stores

750 of these are found in the cache What’s the cache hit and miss rate?

Hit Rate =

1000750

= 75%

Miss Rate = 1 − 0.75 = 25%

If the cache takes 1 cycle and the main memory 100, What’s the expected access time?

Expected access time of main memory: E

1

= 100 cycles Access time for the cache: t

0

= 1 cycle

Cache miss rate: M

0

= 0.25

E

0

= t

0

+ M

0

· E

1

= 1 + 0.25 · 100 = 26 cycles

(12)

Memory Performance Example

Two-level hierarchy: Cache and main memory Program executes 1000 loads & stores

750 of these are found in the cache What’s the cache hit and miss rate?

Hit Rate =

1000750

= 75%

Miss Rate = 1 − 0.75 = 25%

If the cache takes 1 cycle and the main memory 100, What’s the expected access time?

Expected access time of main memory: E

1

= 100 cycles Access time for the cache: t

0

= 1 cycle

Cache miss rate: M

0

= 0.25

E

0

= t

0

+ M

0

· E

1

= 1 + 0.25 · 100 = 26 cycles

(13)

Cache

Highest levels of memory hierarchy Fast: level 1 typically 1 cycle access time With luck, supplies most data

Cache design questions:

What data does it hold? Recently accessed

How is data found? Simple address hash

What data is replaced? Often the oldest

(14)

What Data is Held in the Cache?

Ideal cache: always correctly guesses what you want before you want it.

Real cache: never that smart

Caches Exploit Temporal Locality

Copy newly accessed data into cache, replacing oldest if necessary

Spatial Locality

Copy nearby data into the

cache at the same time

Specifically, always read

and write a block at a time

(e.g., 64 bytes), never a

single byte.

(15)

A Direct-Mapped Cache

00...00010000

230-Word Main Memory

mem[0x00000000]

mem[0x00000004]

mem[0x00000008]

mem[0x0000000C]

mem[0x00000010]

mem[0x00000014]

mem[0x00000018]

mem[0x0000001C]

mem[0x00000020]

mem[0x00000024]

mem[0xFFFFFFE0]

mem[0xFFFFFFE4]

mem[0xFFFFFFE8]

mem[0xFFFFFFEC]

mem[0xFFFFFFF0]

mem[0xFFFFFFF4]

mem[0xFFFFFFF8]

mem[0xFFFFFFFC]

23-Word Cache Address

00...00000000 00...00000100 00...00001000 00...00001100 00...00010100 00...00011000 00...00011100 00...00100000 00...00100100 11...11110000

11...11100000 11...11100100 11...11101000 11...11101100 11...11110100 11...11111000 11...11111100

Set 7 (111) Set 6 (110) Set 5 (101) Set 4 (100) Set 3 (011) Set 2 (010) Set 1 (001) Set 0 (000)

Data

This simple cache has

É

8 sets

É

1 block per set

É

4 bytes per block To simplify answering

“is this memory in

the cache?,” each

byte is mapped to

exactly one set.

(16)

Direct-Mapped Cache Hardware

Data Tag

00

Tag Set

Byte Offset

Memory Address

Hit Data

V

=

27 3

27 32

8-entry x (1+27+32)-bit

SRAM Set 7

Set 6 Set 5 Set 4 Set 3 Set 2 Set 1 Set 0

Address bits:

0–1: byte within block 2–4: set number 5–31: block “tag”

Cache hit if

in the set of the address, É block is valid (V=1) É tag (address bits

5–31) matches

(17)

Direct-Mapped Cache Behavior

A dumb loop:

repeat 5 times load from 0x4;

load from 0xC;

load from 0x8.

li $t0, 5

l1: beq $t0, $0, done lw $t1, 0x4($0) lw $t2, 0xC($0) lw $t3, 0x8($0) addiu $t0, $t0, -1

j l1

done:

Data Tag V

00...00

1 mem[0x00...04]

0 0 0

0 0

00

Tag Set Byte Offset

Memory Address

V

3

001 00...00

1 00...00 00...00 1

mem[0x00...0C]

mem[0x00...08]

Set 7 (111) Set 6 (110) Set 5 (101) Set 4 (100) Set 3 (011) Set 2 (010) Set 1 (001) Set 0 (000)

When two recently accessed addresses map to the same cache block,

Cache when reading 0x4 last time

Assuming the cache starts empty, what’s the miss rate?

4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 M M M H H H H H H H H H H H H 3/15 = 0.2 = 20%

(18)

Direct-Mapped Cache Behavior

A dumb loop:

repeat 5 times load from 0x4;

load from 0xC;

load from 0x8.

li $t0, 5

l1: beq $t0, $0, done lw $t1, 0x4($0) lw $t2, 0xC($0) lw $t3, 0x8($0) addiu $t0, $t0, -1

j l1

done:

Data Tag V

00...00

1 mem[0x00...04]

0 0 0

0 0

00

Tag Set Byte Offset

Memory Address

V

3

001 00...00

1 00...00 00...00 1

mem[0x00...0C]

mem[0x00...08]

Set 7 (111) Set 6 (110) Set 5 (101) Set 4 (100) Set 3 (011) Set 2 (010) Set 1 (001) Set 0 (000)

When two recently accessed addresses map to the same cache block,

Cache when reading 0x4 last time

Assuming the cache starts empty, what’s the miss rate?

4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 M M M H H H H H H H H H H H H 3/15 = 0.2 = 20%

(19)

Direct-Mapped Cache: Conflict

A dumber loop:

repeat 5 times load from 0x4;

load from 0x24

li $t0, 5

l1: beq $t0, $0, done lw $t1, 0x4($0) lw $t2, 0x24($0) addiu $t0, $t0, -1

j l1

done:

Data Tag

V 00

Tag Set Byte Offset

Memory Address

V

3

001 00...01

Set 7 (111) Set 6 (110) Set 5 (101) Set 4 (100) Set 3 (011) Set 2 (010) Set 1 (001) Set 0 (000) Cache State Assuming the cache starts empty, what’s the miss rate?

4 24 4 24 4 24 4 24 4 24 M M M M M M M M M M 10/10 = 1 = 100% Oops

These are conflict misses

(20)

Direct-Mapped Cache: Conflict

A dumber loop:

repeat 5 times load from 0x4;

load from 0x24

li $t0, 5

l1: beq $t0, $0, done lw $t1, 0x4($0) lw $t2, 0x24($0) addiu $t0, $t0, -1

j l1

done:

Data Tag

V

00...00

1 mem[0x00...04]

0 0 0

0 0

00

Tag Set Byte Offset

Memory Address

V

3

001 00...01

0 0

Set 7 (111) Set 6 (110) Set 5 (101) Set 4 (100) Set 3 (011) Set 2 (010) Set 1 (001) Set 0 (000)

mem[0x00...24]

Cache State Assuming the cache starts empty, what’s the miss rate?

4 24 4 24 4 24 4 24 4 24 M M M M M M M M M M 10/10 = 1 = 100% Oops These are conflict misses

(21)

No Way! Yes Way! 2-Way Set Associative Cache

Data Tag

Tag Set

Byte Offset

Memory Address

Data

1 0

Hit1

V

=

00

32 32

32

Data V Tag

=

Hit1

Hit0

Hit

28 2

28 28

Way 1 Way 0

Set 3 Set 2 Set 1 Set 0

(22)

2-Way Set Associative Behavior

li $t0, 5

l1: beq $t0, $0, done lw $t1, 0x4($0) lw $t2, 0x24($0) addiu $t0, $t0, -1

j l1

done:

Assuming the cache starts empty, what’s the miss rate?

4 24 4 24 4 24 4 24 4 24 M M H H H H H H H H 2/10 = 0.2 = 20%

Associativity reduces conflict misses

Data

V Tag V Tag Data

00...00

1 mem[0x00...24] 1 00...10 mem[0x00...04]

0 0 0

0 0 0

Way 1 Way 0

Set 3 Set 2 Set 1 Set 0

(23)

An Eight-way Fully Associative Cache

Data Tag

V VTag Data VTag Data VTagData VTagData VTagData VTag DataVTag Data Way 0 Way 1

Way 2 Way 3

Way 4 Way 5

Way 6 Way 7

Figure 8.11

No conflict misses: only compulsory or capacity misses

Either very expensive or slow because of all the

associativity

(24)

Exploiting Spatial Locality: Larger Blocks

0x8000 0009C: 00

Tag

Byte Offset Memory

Address

100...100 11

Block Offset

1

800000 9 C Set

Data Tag

00 Tag

Byte Offset Memory

Address

Data V

Block Offset

32 32 32 32

32

Hit

= Set

27

27 2

Set 1 Set 0

00

01

10

11

2 sets

1 block per set (Direct Mapped)

4 words per block

(25)

Direct-Mapped Cache Behavior w/ 4-word block

The dumb loop:

repeat 5 times load from 0x4;

load from 0xC;

load from 0x8.

li $t0, 5

l1: beq $t0, $0, done lw $t1, 0x4($0) lw $t2, 0xC($0) lw $t3, 0x8($0) addiu $t0, $t0, -1

j l1

done:

Set 1 Data

Tag V

Set 0 00...00

1 mem[0x00...0C]

0

mem[0x00...08] mem[0x00...04] mem[0x00...00]

00 Tag

Byte Offset Memory

Address

V Block Offset Set 00...00 0 11

Figure 8.14 Cache when reading 0xC

Assuming the cache starts empty, what’s the miss rate?

4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 M H H H H H H H H H H H H H H 1/15 = 0.0666 = 6.7%

Larger blocks reduce compulsory misses by exploting spatial locality

(26)

Direct-Mapped Cache Behavior w/ 4-word block

The dumb loop:

repeat 5 times load from 0x4;

load from 0xC;

load from 0x8.

li $t0, 5

l1: beq $t0, $0, done lw $t1, 0x4($0) lw $t2, 0xC($0) lw $t3, 0x8($0) addiu $t0, $t0, -1

j l1

done:

Set 1 Data

Tag V

Set 0 00...00

1 mem[0x00...0C]

0

mem[0x00...08] mem[0x00...04] mem[0x00...00]

00 Tag

Byte Offset Memory

Address

V Block Offset Set 00...00 0 11

Figure 8.14 Cache when reading 0xC

Assuming the cache starts empty, what’s the miss rate?

4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 M H H H H H H H H H H H H H H 1/15 = 0.0666 = 6.7%

Larger blocks reduce compulsory misses by exploting spatial locality

(27)

Stephen’s Desktop Machine Revisited

AMD Phenom 9600

Quad-core 2.3 GHz 1.1–1.25 V 95 W 65 nm

On-chip caches:

Cache Size Sets Ways Block L1I

64 K 512 2-way 64-byte L1D

64 K 512 2-way 64-byte L2

512 K 512 16-way 64-byte L3 2 MB 1024 32-way 64-byte

per core

(28)

Intel On-Chip Caches

Chip Year Freq. L1 L2

(MHz) Data Instr

80386 1985 16–25 off-chip none

80486 1989 25–100 8K unified off-chip

Pentium 1993 60–300 8K 8K off-chip

Pentium Pro 1995 150–200 8K 8K 256K–1M (MCM) Pentium II 1997 233–450 16K 16K 256K–512K

(Cartridge) Pentium III 1999 450–1400 16K 16K 256K–512K

Pentium 4 2001 1400–3730 8–16K 12k op

trace cache256K–2M Pentium M 2003 900–2130 32K 32K 1M–2M Core 2 Duo 2005 1500–3000 32K

per core 32K

per core 2M–6M

參考文獻

相關文件

• The memory storage unit is where instructions and data are held while a computer program is running.. • A bus is a group of parallel wires that transfer data from one part of

– Number of TLB entries are restricted by clock cycle time, so a larger page size maps more memory, thereby reducing TLB misses. • Reasons for a smaller

bgez Branch on greater than or equal to zero bltzal Branch on less than zero and link. bgezal Branch on greter than or equal to zero

The performance guarantees of real-time garbage collectors and the free-page replenishment mechanism are based on a constant α, i.e., a lower-bound on the number of free pages that

a substance, such as silicon or germanium, with electrical conductivity intermediate between that of an insulator and a

• The memory storage unit holds instructions and data for a running program.. • A bus is a group of wires that transfer data from one part to another (data,

 develop a better understanding of the design and the features of the English Language curriculum with an emphasis on the senior secondary level;..  gain an insight into the

introduction to continuum and matrix model formulation of non-critical string theory.. They typically describe strings in 1+0 or 1+1 dimensions with a