Fundamentals of Computer Systems Caches Stephen A. Edwards and Martha A. Kim

(1)

Fundamentals of Computer Systems

Caches

Stephen A. Edwards and

Martha A. Kim

Columbia University

Spring 2012

(2)

Computer Systems

Performance depends on which is slowest:

the processor or the memory system

Processor

^Address

Memory

MemWrite WriteData

ReadData WE

CLK

(3)

Memory Speeds Haven’t Kept Up

Year CPU

100,000

10,000

100 1000

Performance

10

1

1980 1981 1982 198319841985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005

Memory

Our single-cycle memory assumption has been wrong since 1980.

Hennessy and Patterson. Computer Architecture: A Quantitative Approach. 3rd ed., Morgan Kaufmann, 2003.

(4)

Your Choice of Memories

Fast Cheap Large

On-Chip SRAM ✔ ✔

Commodity DRAM ✔ ✔

Supercomputer ✔ ✔

(5)

Memory Hierarchy

Fundmental trick to making a big memory appear fast Technology Cost Access Time Density

($/Gb) (ns) (Gb/cm2)

SRAM 30 000 0.5 0.00025

DRAM 10 100 1 – 16

Flash 2 300

^∗

8 – 32

Hard Disk 0.1 10 000 000 500 – 2000

∗Read speed; writing much, much slower

(6)

A Modern Memory Hierarchy

AMD Phenom 9600 Quad-core

2.3 GHz 1.1–1.25 V 95 W 65 nm

My desktop machine:

Level Size Tech.

L1 Instruction

^∗

64 K SRAM

L1 Data

^∗

64 K SRAM

L2

^∗

512 K SRAM

L3 2 MB SRAM

Memory 4 GB DRAM

Disk 500 GB Magnetic

∗

per core

(7)

Temporal Locality

What path do your eyes take when you read this?

Did you look at the

drawings more than once?

Euclid’s Elements

(8)

Spatial Locality

If you need something, you may also need something nearby

(9)

Memory Performance

Hit: Data is found in the level of memory hierarchy Miss: Data not found; will look in next level

Hit Rate =

Number of hits Number of accesses

Miss Rate =

Number of misses Number of accesses

Hit Rate + Miss Rate = 1

The expected access time E

^L

for a memory level L with latency t

^L

and miss rate M

^L

:

E

_L

= t

L

+ M

L

· E

L+1

(10)

Memory Performance Example

Two-level hierarchy: Cache and main memory Program executes 1000 loads & stores

750 of these are found in the cache What’s the cache hit and miss rate?

Hit Rate =

₁₀₀₀⁷⁵⁰

= 75% Miss Rate = 1 − 0.75 = 25%

If the cache takes 1 cycle and the main memory 100, What’s the expected access time?

Expected access time of main memory: E

1

= 100 cycles Access time for the cache: t

0

= 1 cycle

Cache miss rate: M

0

= 0.25

E

₀

= t

₀

+ M

₀

· E

1

= 1 + 0.25 · 100 = 26 cycles

(11)

Memory Performance Example

Two-level hierarchy: Cache and main memory Program executes 1000 loads & stores

750 of these are found in the cache What’s the cache hit and miss rate?

Hit Rate =

₁₀₀₀⁷⁵⁰

= 75%

Miss Rate = 1 − 0.75 = 25%

If the cache takes 1 cycle and the main memory 100, What’s the expected access time?

Expected access time of main memory: E

1

= 100 cycles Access time for the cache: t

0

= 1 cycle

Cache miss rate: M

0

= 0.25

E

₀

= t

₀

+ M

₀

· E

1

= 1 + 0.25 · 100 = 26 cycles

(12)

Memory Performance Example

Two-level hierarchy: Cache and main memory Program executes 1000 loads & stores

750 of these are found in the cache What’s the cache hit and miss rate?

Hit Rate =

₁₀₀₀⁷⁵⁰

= 75%

Miss Rate = 1 − 0.75 = 25%

If the cache takes 1 cycle and the main memory 100, What’s the expected access time?

Expected access time of main memory: E

1

= 100 cycles Access time for the cache: t

0

= 1 cycle

Cache miss rate: M

0

= 0.25

E

₀

= t

₀

+ M

₀

· E

1

= 1 + 0.25 · 100 = 26 cycles

(13)

Cache

Highest levels of memory hierarchy Fast: level 1 typically 1 cycle access time With luck, supplies most data

Cache design questions:

What data does it hold? Recently accessed

How is data found? Simple address hash

What data is replaced? Often the oldest

(14)

What Data is Held in the Cache?

Ideal cache: always correctly guesses what you want before you want it.

Real cache: never that smart

Caches Exploit Temporal Locality

Copy newly accessed data into cache, replacing oldest if necessary

Spatial Locality

Copy nearby data into the

cache at the same time

Specifically, always read

and write a block at a time

(e.g., 64 bytes), never a

single byte.

(15)

A Direct-Mapped Cache

00...00010000

2³⁰-Word Main Memory

mem[0x00000000]

mem[0x00000004]

mem[0x00000008]

mem[0x0000000C]

mem[0x00000010]

mem[0x00000014]

mem[0x00000018]

mem[0x0000001C]

mem[0x00000020]

mem[0x00000024]

mem[0xFFFFFFE0]

mem[0xFFFFFFE4]

mem[0xFFFFFFE8]

mem[0xFFFFFFEC]

mem[0xFFFFFFF0]

mem[0xFFFFFFF4]

mem[0xFFFFFFF8]

mem[0xFFFFFFFC]

2³-Word Cache Address

00...00000000 00...00000100 00...00001000 00...00001100 00...00010100 00...00011000 00...00011100 00...00100000 00...00100100 11...11110000

11...11100000 11...11100100 11...11101000 11...11101100 11...11110100 11...11111000 11...11111100

Set 7 (111) Set 6 (110) Set 5 (101) Set 4 (100) Set 3 (011) Set 2 (010) Set 1 (001) Set 0 (000)

Data

This simple cache has

É

8 sets

É

1 block per set

É

4 bytes per block To simplify answering

“is this memory in

the cache?,” each

byte is mapped to

exactly one set.

(16)

Direct-Mapped Cache Hardware

Data Tag

00

Tag Set

Byte Offset

Memory Address

Hit Data

V

=

27 3

27 32

8-entry x (1+27+32)-bit

SRAM Set 7

Set 6 Set 5 Set 4 Set 3 Set 2 Set 1 Set 0

Address bits:

0–1: byte within block 2–4: set number 5–31: block “tag”

Cache hit if

in the set of the address, É block is valid (V=1) É tag (address bits

5–31) matches

(17)

Direct-Mapped Cache Behavior

A dumb loop:

repeat 5 times load from 0x4;

load from 0xC;

load from 0x8.

li $t0, 5

l1: beq $t0, $0, done lw $t1, 0x4($0) lw $t2, 0xC($0) lw $t3, 0x8($0) addiu $t0, $t0, -1

j l1

done:

Data Tag V

00...00

1 mem[0x00...04]

0 0 0

0 0

00

Tag Set Byte Offset

Memory Address

V

3

001 00...00

1 00...00 00...00 1

mem[0x00...0C]

mem[0x00...08]

Set 7 (111) Set 6 (110) Set 5 (101) Set 4 (100) Set 3 (011) Set 2 (010) Set 1 (001) Set 0 (000)

When two recently accessed addresses map to the same cache block,

Cache when reading 0x4 last time

Assuming the cache starts empty, what’s the miss rate?

4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 M M M H H H H H H H H H H H H 3/15 = 0.2 = 20%

(18)

Direct-Mapped Cache Behavior

A dumb loop:

load from 0xC;

load from 0x8.

li $t0, 5

j l1

done:

Data Tag V

00...00

1 mem[0x00...04]

0 0 0

0 0

00

Tag Set Byte Offset

Memory Address

V

3

001 00...00

1 00...00 00...00 1

mem[0x00...0C]

mem[0x00...08]

When two recently accessed addresses map to the same cache block,

Cache when reading 0x4 last time

4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 M M M H H H H H H H H H H H H 3/15 = 0.2 = 20%

(19)

Direct-Mapped Cache: Conflict

A dumber loop:

load from 0x24

li $t0, 5

l1: beq $t0, $0, done lw $t1, 0x4($0) lw $t2, 0x24($0) addiu $t0, $t0, -1

j l1

done:

Data Tag

V 00

Tag Set Byte Offset

Memory Address

V

3

001 00...01

Set 7 (111) Set 6 (110) Set 5 (101) Set 4 (100) Set 3 (011) Set 2 (010) Set 1 (001) Set 0 (000) Cache State Assuming the cache starts empty, what’s the miss rate?

4 24 4 24 4 24 4 24 4 24 M M M M M M M M M M 10/10 = 1 = 100% Oops

These are conflict misses

(20)

Direct-Mapped Cache: Conflict

A dumber loop:

load from 0x24

li $t0, 5

j l1

done:

Data Tag

V

00...00

1 mem[0x00...04]

0 0 0

0 0

00

Tag Set Byte Offset

Memory Address

V

3

001 00...01

0 0

mem[0x00...24]

Cache State Assuming the cache starts empty, what’s the miss rate?

4 24 4 24 4 24 4 24 4 24 M M M M M M M M M M 10/10 = 1 = 100% Oops These are conflict misses

(21)

No Way! Yes Way! 2-Way Set Associative Cache

Data Tag

Tag Set

Byte Offset

Memory Address

Data

1 0

Hit₁

V

=

00

32 32

32

Data V Tag

=

Hit₀

Hit

28 2

28 28

Way 1 Way 0

Set 3 Set 2 Set 1 Set 0

(22)

2-Way Set Associative Behavior

li $t0, 5

j l1

done:

4 24 4 24 4 24 4 24 4 24 M M H H H H H H H H 2/10 = 0.2 = 20%

Associativity reduces conflict misses

Data

V Tag V Tag Data

00...00

1 mem[0x00...24] 1 00...10 mem[0x00...04]

0 0 0

Way 1 Way 0

Set 3 Set 2 Set 1 Set 0

(23)

An Eight-way Fully Associative Cache

Data Tag

V VTag Data VTag Data VTagData VTagData VTagData VTag DataVTag Data Way 0 Way 1

Way 2 Way 3

Way 4 Way 5

Way 6 Way 7

Figure 8.11

No conflict misses: only compulsory or capacity misses

Either very expensive or slow because of all the

associativity

(24)

Exploiting Spatial Locality: Larger Blocks

0x8000 0009C: ⁰⁰

Tag

Byte Offset Memory

Address

100...100 11

Block Offset

1

800000 9 C Set

Data Tag

00 Tag

Byte Offset Memory

Address

Data V

Block Offset

32 32 32 32

32

Hit

= Set

27

27 2

Set 1 Set 0

00

01

10

11

2 sets

1 block per set (Direct Mapped)

4 words per block

(25)

Direct-Mapped Cache Behavior w/ 4-word block

The dumb loop:

load from 0xC;

load from 0x8.

li $t0, 5

j l1

done:

Set 1 Data

Tag V

Set 0 00...00

1 mem[0x00...0C]

0

mem[0x00...08] mem[0x00...04] mem[0x00...00]

00 Tag

Byte Offset Memory

Address

V Block Offset Set 00...00 0 11

Figure 8.14 Cache when reading 0xC

4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 M H H H H H H H H H H H H H H 1/15 = 0.0666 = 6.7%

Larger blocks reduce compulsory misses by exploting spatial locality

(26)

Direct-Mapped Cache Behavior w/ 4-word block

The dumb loop:

load from 0xC;

load from 0x8.

li $t0, 5

j l1

done:

Set 1 Data

Tag V

Set 0 00...00

1 mem[0x00...0C]

0

mem[0x00...08] mem[0x00...04] mem[0x00...00]

00 Tag

Byte Offset Memory

Address

V Block Offset Set 00...00 0 11

Figure 8.14 Cache when reading 0xC

4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 M H H H H H H H H H H H H H H 1/15 = 0.0666 = 6.7%

Larger blocks reduce compulsory misses by exploting spatial locality

(27)

Stephen’s Desktop Machine Revisited

AMD Phenom 9600

Quad-core 2.3 GHz 1.1–1.25 V 95 W 65 nm

On-chip caches:

Cache Size Sets Ways Block L1I

^∗

64 K 512 2-way 64-byte L1D

^∗

64 K 512 2-way 64-byte L2

^∗

512 K 512 16-way 64-byte L3 2 MB 1024 32-way 64-byte

∗

per core

(28)

Intel On-Chip Caches

Chip Year Freq. L1 L2

(MHz) Data Instr

80386 1985 16–25 off-chip none

80486 1989 25–100 8K unified off-chip

Pentium 1993 60–300 8K 8K off-chip

Pentium Pro 1995 150–200 8K 8K 256K–1M (MCM) Pentium II 1997 233–450 16K 16K 256K–512K

(Cartridge) Pentium III 1999 450–1400 16K 16K 256K–512K

Pentium 4 2001 1400–3730 8–16K 12k op

trace cache256K–2M Pentium M 2003 900–2130 32K 32K 1M–2M Core 2 Duo 2005 1500–3000 32K

per core 32K

per core 2M–6M