Memory Hierarchy

(1)

Memory Hierarchy

Computer Organization and Assembly Languages Yung-Yu Chuang

2006/01/05

with slides by CMU15-213

(2)

Announcement

• Grade for hw#4 is online

• Please DO submit homework if you haven’t

• Please sign up a demo time on 1/16 or 1/17 at the following page

http://www.csie.ntu.edu.tw/~b90095/index.cgi/Assembly_Demo

• Hand in your report to TA at your demo time

• The length of report depends on your project

type. It can be html, pdf, doc, ppt…

(3)

Reference

• Chapter 6 from “Computer System: A

Programmer’s Perspective”

(4)

Computer system model

• We assume memory is a linear array which

holds both instruction and data, and CPU can access memory in a constant time.

Central Processor Unit (CPU)

Memory Storage Unit

registers

ALU clock

I/O Device

#1

I/O Device

#2

data bus

control bus address bus

CU

(5)

SRAM vs DRAM

Tran. Access Needs

per bit time refresh? Cost Applications SRAM 4 or 6 1X No 100X cache memories DRAM 1 10X Yes 1X Main memories,

frame buffers

(6)

The CPU-Memory gap

The gap widens between DRAM, disk, and CPU speeds.

1 10 100 1,000 10,000 100,000 1,000,000 10,000,000 100,000,000

1980 1985 1990 1995 2000

year

ns

Disk seek time DRAM access time SRAM access time CPU cycle time

20,000,000 50-100

1-10 1

Access time (cycles)

disk memory

cache register

(7)

Memory hierarchies

• Some fundamental and enduring properties of hardware and software:

– Fast storage technologies cost more per byte, have less capacity, and require more power (heat!).

– The gap between CPU and main memory speed is widening.

– Well-written programs tend to exhibit good locality.

• They suggest an approach for organizing

memory and storage systems known as a

memory hierarchy.

(8)

Memory system in practice

Larger, slower, and cheaper (per byte) storage devices

registers on-chip L1 cache (SRAM)

main memory (DRAM)

local secondary storage (local disks)

remote secondary storage

(tapes, distributed file systems, Web servers) off-chip L2

cache (SRAM) L0:

L1:

L2:

L3:

L4:

L5:

Smaller, faster, and more expensive (per byte) storage devices

(9)

Why it works?

• Most programs tend to access the storage at any particular level more frequently than the storage at the lower level.

• Locality: tend to access the same set of data

items over and over again or tend to access sets

of nearby data items.

(10)

Why learn it?

• A programmer needs to understand this

because the memory hierarchy has a big impact on performance.

• You can optimize your program so that its data is more frequently stored in the higher level of the hierarchy.

• For example, the difference of running time for matrix multiplication could up to a factor of 6 even if the same amount of arithmetic

instructions are performed.

(11)

Locality

• Principle of Locality: programs tend to reuse

data and instructions near those they have used recently, or that were recently referenced

themselves.

– Temporal locality: recently referenced items are likely to be referenced in the near future.

– Spatial locality: items with nearby addresses tend to be referenced close together in time.

• In general, programs with good locality run faster then programs with poor locality

• Locality is the reason why cache and virtual

memory are designed in architecture and

operating system. Another example is web

browser caches recently visited webpages.

(12)

Locality example

• Data

– Reference array elements in succession (stride-1 reference pattern):

– Reference sum each iteration:

• Instructions

– Reference instructions in sequence:

– Cycle through loop repeatedly:

sum = 0;

for (i = 0; i < n; i++) sum += a[i];

return sum;

Spatial locality

Spatial locality Temporal locality

Temporal locality

(13)

Locality example

• Being able to look at code and get a qualitative sense of its locality is important. Does this

function have good locality?

int sum_array_rows(int a[M][N]) {

int i, j, sum = 0;

for (i = 0; i < M; i++)

for (j = 0; j < N; j++) sum += a[i][j];

return sum;

} stride-1 reference pattern

(14)

Locality example

• Does this function have good locality?

int sum_array_cols(int a[M][N]) {

int i, j, sum = 0;

for (j = 0; j < N; j++)

for (i = 0; i < M; i++) sum += a[i][j];

return sum;

} stride-N reference pattern

(15)

Locality example

typedef struct { float v[3];

float a[3];

} point ; point p[N];

for (i=0; i<n; i++) { for (j=0; j<3; j++) {

p[i].v[j]=0;

p[i].a[j]=0;

} }

for (i=0; i<n; i++) { for (j=0; j<3; j++)

p[i].v[j]=0;

for (j=0; j<3; j++) p[i].a[j]=0;

}

for (j=0; j<3; j++) { for (i=0; i<n; i++)

p[i].v[j]=0;

for (i=0; i<n; i++) p[i].a[j]=0;

}

A

B C

(16)

Memory hierarchies

Larger, slower, and cheaper (per byte) storage devices

registers on-chip L1 cache (SRAM)

main memory (DRAM)

local secondary storage (local disks)

remote secondary storage

(tapes, distributed file systems, Web servers) off-chip L2

cache (SRAM) L0:

L1:

L2:

L3:

L4:

L5:

Smaller, faster, and more expensive (per byte) storage devices

(17)

Caches

• Cache: a smaller, faster storage device that

acts as a staging area for a subset of the data in a larger, slower device.

• Fundamental idea of a memory hierarchy:

– For each k, the faster, smaller device at level k serves as a cache for the larger, slower device at level k+1.

• Why do memory hierarchies work?

– Programs tend to access the data at level k more often than they access the data at level k+1.

– Thus, the storage at level k+1 can be slower, and thus larger and cheaper per bit.

(18)

Caching in a memory hierarchy

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

Larger, slower, cheaper Storage device at level k+1 is partitioned into blocks.

Data is copied between levels in block-sized transfer units

8 9 14 3

Smaller, faster, more Expensive device at level k caches a

subset of the blocks from level k+1

level k

level k+1

4

4 10

10

(19)

Request 14

Request 12

General caching concepts

• Program needs object d, which is stored in some block b.

• Cache hit

– Program finds b in the cache at level k. E.g., block 14.

• Cache miss

– b is not at level k, so level k cache must fetch it from level k+1.

E.g., block 12.

– If level k cache is full, then some current block must be replaced

(evicted). Which one is the “victim”?

• Placement policy: where can the new block go? E.g., b mod 4

• Replacement policy: which block should be evicted? E.g., LRU

9 3

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

level k

level k+1

1414

12

14

4*

124*

12

0 1 2 3

Request 12

4*4*

12

(20)

Type of cache misses

• Cold (compulsory) miss: occurs because the cache is empty.

• Capacity miss: occurs when the active cache blocks (working set) is larger than the cache.

• Conflict miss

– Most caches limit blocks at level k+1 to a small

subset of the block positions at level k, e.g. block i at level k+1 must be placed in block (i mod 4) at level k.

– Conflict misses occur when the level k cache is large enough, but multiple data objects all map to the

same level k block, e.g. Referencing blocks 0, 8, 0, 8, 0, 8, ... would miss every time.

(21)

Cache memories

• Cache memories are small, fast SRAM-based

memories managed automatically in hardware.

• CPU looks first for data in L1, then in L2, then in main memory.

• Typical system structure:

memorymain bridgeI/O

bus interface L2 data

ALU register fileCPU chip

SRAM Port system bus

memory bus cacheL1

(22)

General organization of a cache

• • • B–1 1

0

• • • B–1 1

0 valid

valid

tag set 0: tag

B = 2^b bytes per cache block

E lines per set

S = 2^s sets

t tag bits per line

Cache size: C = B x E x S data bytes

• • •

• • • B–1 1

0

• • • B–1 1

0 valid

valid

tag

set 1: tag • • •

• • • B–1 1

0

• • • B–1 1

0 valid

valid

tag

set S-1: tag • • •

• • • Cache is an array

of sets.

Each set contains one or more lines.

Each line holds a block of data.

1 valid bit per line

(23)

Addressing caches

t bits s bits b bits

0 m-1

Address A:

• • • B–1 1

0

• • • B–1 1

0 v

v

tag

set 0: tag • • •

• • • B–1 1

0

• • • B–1 1

0 v

v

tag

set 1: tag • • •

• • • B–1 1

0

• • • B–1 1

0 v

v

tag

set S-1: tag ^{• • •}

• • •

The word at address A is in the cache if the tag bits in one of the <valid> lines in set <set index> match <tag>.

The word contents begin at offset

<block offset> bytes from the beginning of the block.

(24)

Addressing caches

t bits s bits b bits

0 m-1

Address A:

• • • B–1 1

0

• • • B–1 1

0 v

v

tag

set 0: tag • • •

• • • B–1 1

0

• • • B–1 1

0 v

v

tag

set 1: tag • • •

• • • B–1 1

0

• • • B–1 1

0 v

v

tag

set S-1: tag ^{• • •}

• • •

1. Locate the set based on

2. Locate the line in the set based on

<tag>

3. Check that the line is valid

4. Locate the data in the line based on

(25)

Direct-mapped cache

• Simplest kind of cache, easy to build

(only 1 tag compare required per access)

• Characterized by exactly one line per set.

valid valid

valid

tag tag

tag

• • • set 0:

set 1:

set S-1:

E=1 lines per set cache block

cache block

Cache size: C = B x S data bytes

(26)

Accessing direct-mapped caches

• Set selection

– Use the set index bits to determine the set of interest.

t bits s bits 0 0 0 0 1

0 m-1

b bits

tag set index block offset

selected set ^valid

valid

tag tag

tag

• • • set 0:

set 1:

set S-1:

cache block cache block

cache block

(27)

Accessing direct-mapped caches

• Line matching and word selection

– Line matching: Find a valid line in the selected set with a matching tag

– Word selection: Then extract the word

t bits s bits

100 i

0110

0 m-1

b bits tag set index block offset

selected set (i): 1 0110 w₀ w₁ w₂ w₃

3

0 1 2 4 5 6 7

=1? (1) The valid bit must be set

= ? (2) The tag bits in the

cache line must match the tag bits in the address

If (1) and (2), then cache hit

(28)

Accessing direct-mapped caches

• Line matching and word selection

– Line matching: Find a valid line in the selected set with a matching tag

– Word selection: Then extract the word

t bits s bits

100 i

0110

0 m-1

selected set (i): 1 0110 w₀ w₁ w₂ w₃

3

0 1 2 4 5 6 7

(3) If cache hit,

block offset selects starting byte.

(29)

Direct-mapped cache simulation

M=16 byte addresses, B=2 bytes/block, S=4 sets, E=1 entry/set

Address trace (reads):

0 [0000₂], 1 [0001₂], 7 [0111₂], 8 [1000₂], 0 [0000₂]

t=1 s=2 b=1x

xx x

0 ? ?

v tag data

miss

1 0 M[0-1]

hit miss

1 0 M[6-7]

miss

1 1 M[8-9]

miss

1 0 M[0-1]

(30)

What’s wrong with direct-mapped?

float dotprod(float x[8], y[8]) { float sum=0.0;

for (int i=0; i<8; i++) sum+= x[i]*y[i];

return sum;

}

1 60

y[7]

1 28

x[7]

1 56

y[6]

1 24

x[6]

1 52

y[5]

1 20

x[5]

1 48

y[4]

1 16

x[4]

0 44

y[3]

0 12

x[3]

0 40

y[2]

0 8

x[2]

0 36

y[1]

0 4

x[1]

0 32

y[0]

0 0

x[0]

set address

element set

address element

block size=16 bytes

(31)

Solution? padding

0 76

y[7]

1 28

x[7]

0 72

y[6]

1 24

x[6]

0 68

y[5]

1 20

x[5]

0 64

y[4]

1 16

x[4]

1 60

y[3]

0 12

x[3]

1 56

y[2]

0 8

x[2]

1 52

y[1]

0 4

x[1]

1 48

y[0]

0 0

x[0]

set Address

element set

address element

float dotprod(float x[12], y[8]) { float sum=0.0;

for (int i=0; i<8; i++) sum+= x[i]*y[i];

return sum;

}

(32)

Set associative caches

• Characterized by more than one line per set

lines per setE=2

valid tag

set 0:

set 1:

set S-1:

• • •

cache block

valid tag cache block

E-way associative cache

(33)

Accessing set associative caches

• Set selection

– identical to direct-mapped cache

valid valid

tag set 0: tag

valid valid

tag set 1: tag

valid valid

tag set S-1: tag

• • •

cache block cache block cache block cache block

cache block cache block

t bits s bits 0 0 0 0 1

0 m-1

b bits

tag set index block offset

selected set

(34)

Accessing set associative caches

• Line matching and word selection

– must compare the tag in each valid line in the selected set.

1 0110 w₀ w₁ w₂ w₃

1 1001

selected set (i):

3

0 1 2 4 5 6 7

t bits s bits

100 i

0110

0 m-1

=1? (1) The valid bit must be set

= ? (2) The tag bits in one

of the cache lines must match the tag bits in the address

If (1) and (2), then cache hit

(35)

Accessing set associative caches

• Line matching and word selection

– Word selection is the same as in a direct mapped cache

1 0110 w₀ w₁ w₂ w₃

1 1001

selected set (i):

3

0 1 2 4 5 6 7

t bits s bits

100 i

0110

0 m-1

b bits tag set index block offset (3) If cache hit,

block offset selects starting byte.

(36)

2-Way associative cache simulation

M=16 byte addresses, B=2 bytes/block, S=2 sets, E=2 entry/set

Address trace (reads):

0 [0000₂], 1 [0001₂], 7 [0111₂], 8 [1000₂], 0 [0000₂]

xxt=2 s=1 b=1

x x

0 ? ?

v tag data

0 00

miss

1 00 M[0-1]

hit miss

1 01 M[6-7]

miss

1 10 M[8-9]

hit

(37)

Why use middle bits as index?

• High-order bit indexing

– adjacent memory lines would map to same

cache entry

– poor use of spatial locality

4-line Cache High-Order

Bit Indexing Middle-Order Bit Indexing 00

01 10 11

0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111

(38)

What about writes?

• Multiple copies of data exist:

– L1 – L2

– Main Memory – Disk

• What to do when we write?

– Write-through

– Write-back (need a dirty bit)

• What to do on a replacement?

– Depends on whether it is write through or write back

(39)

Multi-level caches

• Options: separate data and instruction caches, or a unified cache

size:

speed:

$/Mbyte:

line size:

200 B 3 ns 8 B

8-64 KB 3 ns 32 B

128 MB DRAM 60 ns

$1.50/MB 8 KB

30 GB

8 ms$0.05/MB

larger, slower, cheaper

Memory Memory

Regs Unified

CacheL2 Unified CacheL2

Processor

1-4MB SRAM 6 ns$100/MB 32 B

diskdisk

d-cacheL1

i-cacheL1

(40)

Intel Pentium III cache hierarchy

Processor Chip Processor Chip

L1 Data 1 cycle latency

16 KB 4-way assoc Write-through

32B lines L1 Instruction 16 KB, 4-way

32B lines Regs.

L2 Unified 128KB--2 MB

4-way assoc Write-back Write allocate

32B lines L2 Unified 128KB--2 MB

4-way assoc Write-back Write allocate

32B lines

MemoryMain Up to 4GB

(41)

Writing cache friendly code

•Repeated references to variables are good (temporal locality)

•Stride-1 reference are good (spatial locality)

•Examples: cold cache, 4-byte words, 4-word cache blocks

int sum_array_rows(int a[4][8]) {

int i, j, sum = 0;

for (i = 0; i < M; i++) for (j = 0; j < N; j++)

sum += a[i][j];

return sum;

}

int sum_array_cols(int a[4][8]) {

int i, j, sum = 0;

for (j = 0; j < N; j++) for (i = 0; i < M; i++)

sum += a[i][j];

return sum;

}

Miss rate = 1/4 = 25% Miss rate = 100%

(42)

The memory mountain

• Read throughput: number of bytes read from memory per second (MB/s)

• Memory mountain

– Measured read throughput as a function of spatial and temporal locality.

– Compact way to characterize memory system performance.

void test(int elems, int stride) { int i, result = 0;

volatile int sink;

for (i = 0; i < elems; i += stride) result += data[i];

/* So compiler doesn't optimize away the loop */

sink = result;

}

(43)

The memory mountain

s1 s3 s5 s7 s9 s11 s13 s15 8m 2m 512k 128k 32k 8k 2k

0 200 400 600 800 1000 1200

L1

L2 mem

Slopes of xe

Spatial Locality

Pentium III 550 MHz

16 KB on-chip L1 d-cache 16 KB on-chip L1 i-cache 512 KB off-chip unified

L2 cache

Ridges of Temporal Locality Working set size

(bytes) Stride (words)

Throughput (MB/sec)

(44)

Ridges of temporal locality

• Slice through the memory mountain (stride=1)

– illuminates read throughputs of different caches and memory

0 200 400 600 800 1000 1200

8m 4m 2m 1024k 512k 256k 128k 64k 32k 16k 8k 4k 2k 1k

working set size (bytes)

read througput (MB/s)

L1 cache region L2 cache

region main memory

region

(45)

A slope of spatial locality

• Slice through memory mountain (size=256KB)

– shows cache block size.

0 100 200 300 400 500 600 700 800

s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13 s14 s15 s16 stride (words)

read throughput (MB/s)

one access per cache line

(46)

Matrix multiplication example

• Major cache effects to consider

– Total cache size

• Exploit temporal locality and keep the working set small (e.g., use blocking)

– Block size

• Exploit spatial locality

• Description:

– Multiply N x N matrices – O(N³) total operations – Accesses

• N reads per source element

• N values summed per destination – but may be able to hold in register

/* ijk */

for (i=0; i<n; i++) { for (j=0; j<n; j++) {

sum = 0.0;

for (k=0; k<n; k++)

sum += a[i][k] * b[k][j];

c[i][j] = sum;

} }

/* ijk */

for (i=0; i<n; i++) { for (j=0; j<n; j++) {

sum = 0.0;

for (k=0; k<n; k++)

sum += a[i][k] * b[k][j];

c[i][j] = sum;

} }

Variable ^sum held in register

(47)

Miss rate analysis for matrix multiply

• Assume:

– Line size = 32B (big enough for four 64-bit words) – Matrix dimension (N) is very large

• Approximate 1/N as 0.0

– Cache is not even big enough to hold multiple rows

• Analysis method:

– Look at access pattern of inner loop

A C

k i

B

k j

i

j

(48)

Matrix multiplication (ijk)

/* ijk */

for (i=0; i<n; i++) { for (j=0; j<n; j++) {

sum = 0.0;

for (k=0; k<n; k++)

sum += a[i][k] * b[k][j];

c[i][j] = sum;

} }

/* ijk */

for (i=0; i<n; i++) { for (j=0; j<n; j++) {

sum = 0.0;

for (k=0; k<n; k++)

sum += a[i][k] * b[k][j];

c[i][j] = sum;

} }

A B C

(i,*)

(*,j)

(i,j) Inner loop:

Column-

Row-wise wise Fixed

Misses per Inner Loop Iteration:

A B C

0.25 1.0 0.0

(49)

Matrix multiplication (jik)

/* jik */

for (j=0; j<n; j++) { for (i=0; i<n; i++) {

sum = 0.0;

for (k=0; k<n; k++)

sum += a[i][k] * b[k][j];

c[i][j] = sum }

}

/* jik */

for (j=0; j<n; j++) { for (i=0; i<n; i++) {

sum = 0.0;

for (k=0; k<n; k++)

sum += a[i][k] * b[k][j];

c[i][j] = sum }

}

A B C

(i,*)

(*,j)

(i,j) Inner loop:

Row-wise Column-

wise Fixed

A B C

0.25 1.0 0.0

(50)

Matrix multiplication (kij)

/* kij */

for (k=0; k<n; k++) { for (i=0; i<n; i++) {

r = a[i][k];

for (j=0; j<n; j++)

c[i][j] += r * b[k][j];

} }

/* kij */

for (k=0; k<n; k++) { for (i=0; i<n; i++) {

r = a[i][k];

for (j=0; j<n; j++)

c[i][j] += r * b[k][j];

} }

A B C

(i,*)

(i,k) (k,*)

Inner loop:

Row-wise Row-wise Fixed

A B C

0.0 0.25 0.25

(51)

Matrix multiplication (ikj)

/* ikj */

for (i=0; i<n; i++) { for (k=0; k<n; k++) {

r = a[i][k];

for (j=0; j<n; j++)

c[i][j] += r * b[k][j];

} }

/* ikj */

for (i=0; i<n; i++) { for (k=0; k<n; k++) {

r = a[i][k];

for (j=0; j<n; j++)

c[i][j] += r * b[k][j];

} }

A B C

(i,*)

(i,k) (k,*)

Inner loop:

Row-wise Row-wise Fixed

A B C

0.0 0.25 0.25

(52)

Matrix multiplication (jki)

/* jki */

for (j=0; j<n; j++) { for (k=0; k<n; k++) {

r = b[k][j];

for (i=0; i<n; i++)

c[i][j] += a[i][k] * r;

} }

/* jki */

for (j=0; j<n; j++) { for (k=0; k<n; k++) {

r = b[k][j];

for (i=0; i<n; i++)

c[i][j] += a[i][k] * r;

} }

A B C

(*,j) (k,j)

Inner loop:

(*,k)

Column -

wise Column-

Fixed wise

A B C

1.0 0.0 1.0

(53)

Matrix multiplication (kji)

/* kji */

for (k=0; k<n; k++) { for (j=0; j<n; j++) {

r = b[k][j];

for (i=0; i<n; i++)

c[i][j] += a[i][k] * r;

} }

/* kji */

for (k=0; k<n; k++) { for (j=0; j<n; j++) {

r = b[k][j];

for (i=0; i<n; i++)

c[i][j] += a[i][k] * r;

} }

A B C

(*,j) (k,j)

Inner loop:

(*,k)

Fixed Column-

wise Column-

wise

A B C

1.0 0.0 1.0

(54)

Summary of matrix multiplication

ijk (& jik):

• 2 loads, 0 stores

• misses/iter = 1.25

kij (& ikj):

• 2 loads, 1 store

jki (& kji):

• 2 loads, 1 store

for (i=0; i<n; i++) { for (j=0; j<n; j++) {

sum = 0.0;

for (k=0; k<n; k++)

sum += a[i][k] * b[k][j];

c[i][j] = sum;

} }

for (k=0; k<n; k++) { for (i=0; i<n; i++) {

r = a[i][k];

for (j=0; j<n; j++)

c[i][j] += r * b[k][j];

} }

for (j=0; j<n; j++) { for (k=0; k<n; k++) {

r = b[k][j];

for (i=0; i<n; i++)

c[i][j] += a[i][k] * r;

} }

(55)

Pentium matrix multiply performance

• Miss rates are helpful but not perfect predictors.

• Code scheduling matters, too.

0 10 20 30 40 50 60

25 50 75 100 125 150 175 200 225 250 275 300 325 350 375 400 Array size (n)

Cycles/iteration

kji jki kij ikj jik ijk

kji & jki

kij & ikj

jik & ijk

(56)

Improving temporal locality by blocking

• Example: Blocked matrix multiplication

– Here, “block” does not mean “cache block”.

– Instead, it mean a sub-block within the matrix.

– Example: N = 8; sub-block size = 4

C₁₁ = A₁₁B₁₁ + A₁₂B21 C₁₂ = A₁₁B₁₂ + A₁₂B₂₂ C₂₁ = A₂₁B₁₁ + A₂₂B21 C₂₂ = A₂₁B₁₂ + A₂₂B₂₂

A₁₁ A₁₂ A₂₁ A₂₂

B₁₁ B₁₂ B₂₁ B₂₂

X = C₁₁ C₁₂

C₂₁ C₂₂

Key idea: Sub-blocks (i.e., A_xy) can be treated just like scalars.

(57)

Blocked matrix multiply (bijk)

for (jj=0; jj<n; jj+=bsize) { for (i=0; i<n; i++)

for (j=jj; j < min(jj+bsize,n); j++) c[i][j] = 0.0;

for (kk=0; kk<n; kk+=bsize) { for (i=0; i<n; i++) {

for (j=jj; j < min(jj+bsize,n); j++) { sum = 0.0

for (k=kk; k < min(kk+bsize,n); k++) { sum += a[i][k] * b[k][j];

}

c[i][j] += sum;

} } } }

(58)

Blocked matrix multiply analysis

– Innermost loop pair multiplies a 1 X bsize sliver of A by a bsize X bsize block of B and accumulates into 1 X

bsize sliver of C

– Loop over i steps through n row slivers of A & C, using same B

A B C

block reused n

times in succession row sliver accessed

bsize times

Update successive elements of sliver

i kk i

kk jj jj

for (i=0; i<n; i++) {

for (j=jj; j < min(jj+bsize,n); j++) { sum = 0.0

for (k=kk; k < min(kk+bsize,n); k++) { sum += a[i][k] * b[k][j];

}

c[i][j] += sum;

Innermost }

Loop Pair

(59)

Blocked matrix multiply performance

• Blocking (bijk and bikj) improves performance by a factor of two over unblocked versions (ijk and jik)

– relatively insensitive to array size.

0 10 20 30 40 50 60

25 50 75 100 125

150

175 200 225 250 275 300 325 350 375 400 Array size (n)

Cycles/iteration

kji jki kij ikj jik ijk

bijk (bsize = 25) bikj (bsize = 25)

(60)

Concluding observations

• Programmer can optimize for cache performance

– How data structures are organized – How data are accessed

• Nested loop structure

• Blocking is a general technique

• All systems favor “cache friendly code”

– Getting absolute optimum performance is very platform specific

• Cache sizes, line sizes, associativities, etc.

– Can get most of the advantage with generic code

• Keep working set reasonably small (temporal locality)

• Use small strides (spatial locality)