• 沒有找到結果。

Case Study 2: Putting It All Together: Highly Parallel Memory Systems

在文檔中 Computer Architecture Formulas 1. (頁 182-185)

Concept illustrated by this case study

Cross-Cutting Issues: The Design of Memory Hierarchies

The program inFigure 2.32can be used to evaluate the behavior of a memory sys-tem. The key is having accurate timing and then having the program stride through memory to invoke different levels of the hierarchy.Figure 2.32shows the code in C. The first part is a procedure that uses a standard utility to get an accurate measure of the user CPU time; this procedure may have to be changed to work on some systems. The second part is a nested loop to read and write memory at different strides and cache sizes. To get accurate cache timing, this code is repeated many times. The third part times the nested loop overhead only so that it can be subtracted from overall measured times to see how long the accesses were. The results are output in.csv file format to facilitate importing into spreadsheets.

You may need to changeCACHE_MAX depending on the question you are answer-ing and the size of memory on the system you are measuranswer-ing. Runnanswer-ing the program in single-user mode or at least without other active applications will give more con-sistent results. The code in Figure 2.32was derived from a program written by Andrea Dusseau at the University of California-Berkeley and was based on a detailed description found in Saavedra-Barrera (1992). It has been modified to fix a number of issues with more modern machines and to run under Microsoft 150 Chapter Two Memory Hierarchy Design

#include "stdafx.h"

#include <stdio.h>

#include <time.h>

#define ARRAY_MIN (1024) /* 1/4 smallest cache */

#define ARRAY_MAX (4096*4096) /* 1/4 largest cache */

int x[ARRAY_MAX]; /* array going to stride through */

double get_seconds() { /* routine to read time in seconds */

__time64_t ltime;

_time64( &ltime );

return (double) ltime;

}int label(int i) {/* generate text labels */

if (i<1e3) printf("%1dB,",i);

else if (i<1e6) printf("%1dK,",i/1024);

else if (i<1e9) printf("%1dM,",i/1048576);

else printf("%1dG,",i/1073741824);

return 0;

}int _tmain(int argc, _TCHAR* argv[]) { int register nextstep, i, index, stride;

int csize;

double steps, tsteps;

double loadtime, lastsec, sec0, sec1, sec; /* timing variables */

/* Initialize output */

printf(" ,");

for (stride=1; stride <= ARRAY_MAX/2; stride=stride*2) label(stride*sizeof(int));

printf("\n");

/* Main loop for each configuration */

for (csize=ARRAY_MIN; csize <= ARRAY_MAX; csize=csize*2) { label(csize*sizeof(int)); /* print cache size this loop */

for (stride=1; stride <= csize/2; stride=stride*2) { /* Lay out path of memory references in array */

for (index=0; index < csize; index=index+stride) x[index] = index + stride; /* pointer to next */

x[index-stride] = 0; /* loop back to beginning */

/* Wait for timer to roll over */

lastsec = get_seconds();

sec0 = get_seconds(); while (sec0 == lastsec);

/* Walk through path in array for twenty seconds */

/* This gives 5% accuracy with second resolution */

steps = 0.0; /* number of steps taken */

nextstep = 0; /* start at beginning of path */

sec0 = get_seconds(); /* start timer */

{ /* repeat until collect 20 seconds */

(i=stride;i!=0;i=i-1) { /* keep samples same */

nextstep = 0;

do nextstep = x[nextstep]; /* dependency */

while (nextstep != 0);

}steps = steps + 1.0; /* count loop iterations */

sec1 = get_seconds(); /* end timer */

} while ((sec1 - sec0) < 20.0); /* collect 20 seconds */

sec = sec1 - sec0;

/* Repeat empty loop to loop subtract overhead */

tsteps = 0.0; /* used to match no. while iterations */

sec0 = get_seconds(); /* start timer */

{ /* repeat until same no. iterations as above */

(i=stride;i!=0;i=i-1) { /* keep samples same */

index = 0;

do index = index + stride;

while (index < csize);

}tsteps = tsteps + 1.0;

sec1 = get_seconds(); /* - overhead */

} while (tsteps<steps); /* until = no. iterations */

sec = sec - (sec1 - sec0);

loadtime = (sec*1e9)/(steps*csize);

/* write out results in .csv format for Excel */

printf("%4.1f,", (loadtime<0.1) ? 0.1 : loadtime);

}; /* end of inner for loop */

printf("\n");

}; /* end of outer for loop */

return 0;

}

Figure 2.32 C program for evaluating memory system.

Case Studies and Exercises 151

Visual C++. It can be downloaded from http://www.hpl.hp.com/research/cacti/

aca_ch2_cs2.c.

The preceding program assumes that program addresses track physical addresses, which is true on the few machines that use virtually addressed caches, such as the Alpha 21264. In general, virtual addresses tend to follow physical addresses shortly after rebooting, so you may need to reboot the machine in order to get smooth lines in your results. To answer the following questions, assume that the sizes of all components of the memory hierarchy are powers of 2. Assume that the size of the page is much larger than the size of a block in a second-level cache (if there is one) and that the size of a second-level cache block is greater than or equal to the size of a block in a first-level cache. An example of the output of the program is plotted inFigure 2.33; the key lists the size of the array that is exercised.

2.4 [12/12/12/10/12] <2.6> Using the sample program results inFigure 2.33:

a. [12] <2.6> What are the overall size and block size of the second-level cache?

b. [12] <2.6> What is the miss penalty of the second-level cache?

c. [12] <2.6> What is the associativity of the second-level cache?

d. [10] <2.6> What is the size of the main memory?

e. [12] <2.6> What is the paging time if the page size is 4 KB?

Read (ns)

1000

100

10

1 4B 16B 64B 256B 1K 4K 16K 64K 256K 1M 4M 16M 64M 256M

Stride

8K 16K 32K 64K 128K 256K 512K 1M 2M 4M 8M 16M 32M 64M 128M 256M 512M

Figure 2.33 Sample results from program in Figure 2.32.

152 Chapter Two Memory Hierarchy Design

2.5 [12/15/15/20] <2.6> If necessary, modify the code inFigure 2.32to measure the following system characteristics. Plot the experimental results with elapsed time on the y-axis and the memory stride on the x-axis. Use logarithmic scales for both axes, and draw a line for each cache size.

a. [12] <2.6> What is the system page size?

b. [15] <2.6> How many entries are there in the TLB?

c. [15] <2.6> What is the miss penalty for the TLB?

d. [20] <2.6> What is the associativity of the TLB?

2.6 [20/20] <2.6> In multiprocessor memory systems, lower levels of the memory hierarchy may not be able to be saturated by a single processor but should be able to be saturated by multiple processors working together. Modify the code in Figure 2.32, and run multiple copies at the same time. Can you determine:

a. [20] <2.6> How many actual processors are in your computer system and how many system processors are just additional multithreaded contexts?

b. [20] <2.6> How many memory controllers does your system have?

2.7 [20] <2.6> Can you think of a way to test some of the characteristics of an instruc-tion cache using a program? Hint: The compiler may generate a large number of nonobvious instructions from a piece of code. Try to use simple arithmetic instruc-tions of known length in your instruction set architecture (ISA).

Case Study 3: Studying the Impact of Various

在文檔中 Computer Architecture Formulas 1. (頁 182-185)