Fundamentals of Computer Systems
Caches
Stephen A. Edwards and
Martha A. Kim
Columbia University
Spring 2012
Illustrations Copyright © 2007 Elsevier
Computer Systems
Performance depends on which is slowest:
the processor or the memory system
Processor
AddressMemory
MemWrite WriteData
ReadData WE
CLK
Memory Speeds Haven’t Kept Up
Year CPU
100,000
10,000
100 1000
Performance
10
1
1980 1981 1982 198319841985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005
Memory
Our single-cycle memory assumption has been wrong since 1980.
Hennessy and Patterson. Computer Architecture: A Quantitative Approach. 3rd ed., Morgan Kaufmann, 2003.
Your Choice of Memories
Fast Cheap Large
On-Chip SRAM ✔ ✔
Commodity DRAM ✔ ✔
Supercomputer ✔ ✔
Memory Hierarchy
Fundmental trick to making a big memory appear fast Technology Cost Access Time Density
($/Gb) (ns) (Gb/cm2)
SRAM 30 000 0.5 0.00025
DRAM 10 100 1 – 16
Flash 2 300
∗8 – 32
Hard Disk 0.1 10 000 000 500 – 2000
∗Read speed; writing much, much slower
A Modern Memory Hierarchy
AMD Phenom 9600 Quad-core
2.3 GHz 1.1–1.25 V 95 W 65 nm
My desktop machine:
Level Size Tech.
L1 Instruction
∗64 K SRAM
L1 Data
∗64 K SRAM
L2
∗512 K SRAM
L3 2 MB SRAM
Memory 4 GB DRAM
Disk 500 GB Magnetic
∗
per core
Temporal Locality
What path do your eyes take when you read this?
Did you look at the
drawings more than once?
Euclid’s Elements
Spatial Locality
If you need something, you may also need something nearby
Memory Performance
Hit: Data is found in the level of memory hierarchy Miss: Data not found; will look in next level
Hit Rate =
Number of hits Number of accessesMiss Rate =
Number of misses Number of accessesHit Rate + Miss Rate = 1
The expected access time E
Lfor a memory level L with latency t
Land miss rate M
L:
E
L= t
L+ M
L· E
L+1Memory Performance Example
Two-level hierarchy: Cache and main memory Program executes 1000 loads & stores
750 of these are found in the cache What’s the cache hit and miss rate?
Hit Rate =
1000750= 75% Miss Rate = 1 − 0.75 = 25%
If the cache takes 1 cycle and the main memory 100, What’s the expected access time?
Expected access time of main memory: E
1= 100 cycles Access time for the cache: t
0= 1 cycle
Cache miss rate: M
0= 0.25
E
0= t
0+ M
0· E
1= 1 + 0.25 · 100 = 26 cycles
Memory Performance Example
Two-level hierarchy: Cache and main memory Program executes 1000 loads & stores
750 of these are found in the cache What’s the cache hit and miss rate?
Hit Rate =
1000750= 75%
Miss Rate = 1 − 0.75 = 25%
If the cache takes 1 cycle and the main memory 100, What’s the expected access time?
Expected access time of main memory: E
1= 100 cycles Access time for the cache: t
0= 1 cycle
Cache miss rate: M
0= 0.25
E
0= t
0+ M
0· E
1= 1 + 0.25 · 100 = 26 cycles
Memory Performance Example
Two-level hierarchy: Cache and main memory Program executes 1000 loads & stores
750 of these are found in the cache What’s the cache hit and miss rate?
Hit Rate =
1000750= 75%
Miss Rate = 1 − 0.75 = 25%
If the cache takes 1 cycle and the main memory 100, What’s the expected access time?
Expected access time of main memory: E
1= 100 cycles Access time for the cache: t
0= 1 cycle
Cache miss rate: M
0= 0.25
E
0= t
0+ M
0· E
1= 1 + 0.25 · 100 = 26 cycles
Cache
Highest levels of memory hierarchy Fast: level 1 typically 1 cycle access time With luck, supplies most data
Cache design questions:
What data does it hold? Recently accessed
How is data found? Simple address hash
What data is replaced? Often the oldest
What Data is Held in the Cache?
Ideal cache: always correctly guesses what you want before you want it.
Real cache: never that smart
Caches Exploit Temporal Locality
Copy newly accessed data into cache, replacing oldest if necessary
Spatial Locality
Copy nearby data into the
cache at the same time
Specifically, always read
and write a block at a time
(e.g., 64 bytes), never a
single byte.
A Direct-Mapped Cache
00...00010000
230-Word Main Memory
mem[0x00000000]
mem[0x00000004]
mem[0x00000008]
mem[0x0000000C]
mem[0x00000010]
mem[0x00000014]
mem[0x00000018]
mem[0x0000001C]
mem[0x00000020]
mem[0x00000024]
mem[0xFFFFFFE0]
mem[0xFFFFFFE4]
mem[0xFFFFFFE8]
mem[0xFFFFFFEC]
mem[0xFFFFFFF0]
mem[0xFFFFFFF4]
mem[0xFFFFFFF8]
mem[0xFFFFFFFC]
23-Word Cache Address
00...00000000 00...00000100 00...00001000 00...00001100 00...00010100 00...00011000 00...00011100 00...00100000 00...00100100 11...11110000
11...11100000 11...11100100 11...11101000 11...11101100 11...11110100 11...11111000 11...11111100
Set 7 (111) Set 6 (110) Set 5 (101) Set 4 (100) Set 3 (011) Set 2 (010) Set 1 (001) Set 0 (000)
Data
This simple cache has
É
8 sets
É
1 block per set
É
4 bytes per block To simplify answering
“is this memory in
the cache?,” each
byte is mapped to
exactly one set.
Direct-Mapped Cache Hardware
Data Tag
00
Tag Set
Byte Offset
Memory Address
Hit Data
V
=
27 3
27 32
8-entry x (1+27+32)-bit
SRAM Set 7
Set 6 Set 5 Set 4 Set 3 Set 2 Set 1 Set 0
Address bits:0–1: byte within block 2–4: set number 5–31: block “tag”
Cache hit if
in the set of the address, É block is valid (V=1) É tag (address bits
5–31) matches
Direct-Mapped Cache Behavior
A dumb loop:
repeat 5 times load from 0x4;
load from 0xC;
load from 0x8.
li $t0, 5
l1: beq $t0, $0, done lw $t1, 0x4($0) lw $t2, 0xC($0) lw $t3, 0x8($0) addiu $t0, $t0, -1
j l1
done:
Data Tag V
00...00
1 mem[0x00...04]
0 0 0
0 0
00
Tag Set Byte Offset
Memory Address
V
3
001 00...00
1 00...00 00...00 1
mem[0x00...0C]
mem[0x00...08]
Set 7 (111) Set 6 (110) Set 5 (101) Set 4 (100) Set 3 (011) Set 2 (010) Set 1 (001) Set 0 (000)
When two recently accessed addresses map to the same cache block,
Cache when reading 0x4 last timeAssuming the cache starts empty, what’s the miss rate?
4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 M M M H H H H H H H H H H H H 3/15 = 0.2 = 20%
Direct-Mapped Cache Behavior
A dumb loop:
repeat 5 times load from 0x4;
load from 0xC;
load from 0x8.
li $t0, 5
l1: beq $t0, $0, done lw $t1, 0x4($0) lw $t2, 0xC($0) lw $t3, 0x8($0) addiu $t0, $t0, -1
j l1
done:
Data Tag V
00...00
1 mem[0x00...04]
0 0 0
0 0
00
Tag Set Byte Offset
Memory Address
V
3
001 00...00
1 00...00 00...00 1
mem[0x00...0C]
mem[0x00...08]
Set 7 (111) Set 6 (110) Set 5 (101) Set 4 (100) Set 3 (011) Set 2 (010) Set 1 (001) Set 0 (000)
When two recently accessed addresses map to the same cache block,
Cache when reading 0x4 last timeAssuming the cache starts empty, what’s the miss rate?
4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 M M M H H H H H H H H H H H H 3/15 = 0.2 = 20%
Direct-Mapped Cache: Conflict
A dumber loop:
repeat 5 times load from 0x4;
load from 0x24
li $t0, 5
l1: beq $t0, $0, done lw $t1, 0x4($0) lw $t2, 0x24($0) addiu $t0, $t0, -1
j l1
done:
Data Tag
V 00
Tag Set Byte Offset
Memory Address
V
3
001 00...01
Set 7 (111) Set 6 (110) Set 5 (101) Set 4 (100) Set 3 (011) Set 2 (010) Set 1 (001) Set 0 (000) Cache State Assuming the cache starts empty, what’s the miss rate?
4 24 4 24 4 24 4 24 4 24 M M M M M M M M M M 10/10 = 1 = 100% Oops
These are conflict misses
Direct-Mapped Cache: Conflict
A dumber loop:
repeat 5 times load from 0x4;
load from 0x24
li $t0, 5
l1: beq $t0, $0, done lw $t1, 0x4($0) lw $t2, 0x24($0) addiu $t0, $t0, -1
j l1
done:
Data Tag
V
00...00
1 mem[0x00...04]
0 0 0
0 0
00
Tag Set Byte Offset
Memory Address
V
3
001 00...01
0 0
Set 7 (111) Set 6 (110) Set 5 (101) Set 4 (100) Set 3 (011) Set 2 (010) Set 1 (001) Set 0 (000)
mem[0x00...24]
Cache State Assuming the cache starts empty, what’s the miss rate?
4 24 4 24 4 24 4 24 4 24 M M M M M M M M M M 10/10 = 1 = 100% Oops These are conflict misses
No Way! Yes Way! 2-Way Set Associative Cache
Data Tag
Tag Set
Byte Offset
Memory Address
Data
1 0
Hit1
V
=
00
32 32
32
Data V Tag
=
Hit1
Hit0
Hit
28 2
28 28
Way 1 Way 0
Set 3 Set 2 Set 1 Set 0
2-Way Set Associative Behavior
li $t0, 5
l1: beq $t0, $0, done lw $t1, 0x4($0) lw $t2, 0x24($0) addiu $t0, $t0, -1
j l1
done:
Assuming the cache starts empty, what’s the miss rate?
4 24 4 24 4 24 4 24 4 24 M M H H H H H H H H 2/10 = 0.2 = 20%
Associativity reduces conflict misses
Data
V Tag V Tag Data
00...00
1 mem[0x00...24] 1 00...10 mem[0x00...04]
0 0 0
0 0 0
Way 1 Way 0
Set 3 Set 2 Set 1 Set 0
An Eight-way Fully Associative Cache
Data Tag
V VTag Data VTag Data VTagData VTagData VTagData VTag DataVTag Data Way 0 Way 1
Way 2 Way 3
Way 4 Way 5
Way 6 Way 7
Figure 8.11
No conflict misses: only compulsory or capacity misses
Either very expensive or slow because of all the
associativity
Exploiting Spatial Locality: Larger Blocks
0x8000 0009C: 00
Tag
Byte Offset Memory
Address
100...100 11
Block Offset
1
800000 9 C Set
Data Tag
00 Tag
Byte Offset Memory
Address
Data V
Block Offset
32 32 32 32
32
Hit
= Set
27
27 2
Set 1 Set 0
00
01
10
11
2 sets
1 block per set (Direct Mapped)
4 words per block
Direct-Mapped Cache Behavior w/ 4-word block
The dumb loop:
repeat 5 times load from 0x4;
load from 0xC;
load from 0x8.
li $t0, 5
l1: beq $t0, $0, done lw $t1, 0x4($0) lw $t2, 0xC($0) lw $t3, 0x8($0) addiu $t0, $t0, -1
j l1
done:
Set 1 Data
Tag V
Set 0 00...00
1 mem[0x00...0C]
0
mem[0x00...08] mem[0x00...04] mem[0x00...00]
00 Tag
Byte Offset Memory
Address
V Block Offset Set 00...00 0 11
Figure 8.14 Cache when reading 0xC
Assuming the cache starts empty, what’s the miss rate?
4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 M H H H H H H H H H H H H H H 1/15 = 0.0666 = 6.7%
Larger blocks reduce compulsory misses by exploting spatial locality
Direct-Mapped Cache Behavior w/ 4-word block
The dumb loop:
repeat 5 times load from 0x4;
load from 0xC;
load from 0x8.
li $t0, 5
l1: beq $t0, $0, done lw $t1, 0x4($0) lw $t2, 0xC($0) lw $t3, 0x8($0) addiu $t0, $t0, -1
j l1
done:
Set 1 Data
Tag V
Set 0 00...00
1 mem[0x00...0C]
0
mem[0x00...08] mem[0x00...04] mem[0x00...00]
00 Tag
Byte Offset Memory
Address
V Block Offset Set 00...00 0 11
Figure 8.14 Cache when reading 0xC
Assuming the cache starts empty, what’s the miss rate?
4 C 8 4 C 8 4 C 8 4 C 8 4 C 8 M H H H H H H H H H H H H H H 1/15 = 0.0666 = 6.7%
Larger blocks reduce compulsory misses by exploting spatial locality
Stephen’s Desktop Machine Revisited
AMD Phenom 9600
Quad-core 2.3 GHz 1.1–1.25 V 95 W 65 nm
On-chip caches:
Cache Size Sets Ways Block L1I
∗64 K 512 2-way 64-byte L1D
∗64 K 512 2-way 64-byte L2
∗512 K 512 16-way 64-byte L3 2 MB 1024 32-way 64-byte
∗
per core
Intel On-Chip Caches
Chip Year Freq. L1 L2
(MHz) Data Instr
80386 1985 16–25 off-chip none
80486 1989 25–100 8K unified off-chip
Pentium 1993 60–300 8K 8K off-chip
Pentium Pro 1995 150–200 8K 8K 256K–1M (MCM) Pentium II 1997 233–450 16K 16K 256K–512K
(Cartridge) Pentium III 1999 450–1400 16K 16K 256K–512K
Pentium 4 2001 1400–3730 8–16K 12k op
trace cache256K–2M Pentium M 2003 900–2130 32K 32K 1M–2M Core 2 Duo 2005 1500–3000 32K
per core 32K
per core 2M–6M