Concepts illustrated by this case study
■ Performance Characteristics
■ Microbenchmarks
The Shear algorithm, from work by Timothy Denehy and colleagues at the Uni-versity of Wisconsin [Denehy et al. 2004], uncovers the parameters of a RAID system. The basic idea is to generate a workload of requests to the RAID array and time those requests; by observing which sets of requests take longer, one can infer which blocks are allocated to the same disk.
We define RAID properties as follows. Data are allocated to disks in the RAID at the block level, where a block is the minimal unit of data that the file system reads or writes from the storage system; thus, block size is known by the file system and the fingerprinting software. A chunk is a set of blocks that is allo-cated contiguously within a disk. A stripe is a set of chunks across each of D data disks. Finally, a pattern is the minimum sequence of data blocks such that block offset i within the pattern is always located on disk j.
D.4 [20/20] <D.2> One can uncover the pattern size with the following code. The code accesses the raw device to avoid file system optimizations. The key to all of the Shear algorithms is to use random requests to avoid triggering any of the prefetch or caching mechanisms within the RAID or within individual disks. The basic idea of this code sequence is to access N random blocks at a fixed interval p within the RAID array and to measure the completion time of each interval.
for (p = BLOCKSIZE; p <= testsize; p += BLOCKSIZE) { for (i = 0; i < N; i++) {
request[i] = random()*p;
}
begin_time = gettime();
issues all request[N] to raw device in parallel;
wait for all request[N] to complete;
interval_time = gettime() - begin_time;
printf("PatternSize: %d Time: %d\n", p, interval_time);
}
If you run this code on a RAID array and plot the measured time for the N requests as a function of p, then you will see that the time is highest when all N
requests fall on the same disk; thus, the value of p with the highest time corre-sponds to the pattern size of the RAID.
a. [20] <D.2> Figure D.26 shows the results of running the pattern size algo-rithm on an unknown RAID system.
■ What is the pattern size of this storage system?
■ What do the measured times of 0.4, 0.8, and 1.6 seconds correspond to in this storage system?
■ If this is a RAID 0 array, then how many disks are present?
■ If this is a RAID 0 array, then what is the chunk size?
b. [20] <D.2> Draw the graph that would result from running this Shear code on a storage system with the following characteristics:
■ Number of requests, N = 1000
■ Time for a random read on disk, 5 ms
■ RAID level, RAID 0
■ Number of disks, 4
■ Chunk size, 8 KB
D.5 [20/20] <D.2> One can uncover the chunk size with the following code. The basic idea is to perform reads from N patterns chosen at random but always at controlled offsets, c and c – 1, within the pattern.
for (c = 0; c < patternsize; c += BLOCKSIZE) { for (i = 0; i < N; i++) {
requestA[i] = random()*patternsize + c;
requestB[i] = random()*patternsize + (c-1)%patternsize;
}
Figure D.26 Results from running the pattern size algorithm of Shear on a mock storage system.
Time (s)
1.5
0 1.0
0.5
0.0
Pattern size assumed (KB)
256
160 192 224
128 96
64 32
begin_time = gettime();
issue all requestA[N] and requestB[N] to raw device in parallel;
wait for requestA[N] and requestB[N] to complete;
interval_time = gettime() - begin_time;
printf("ChunkSize: %d Time: %d\n", c, interval_time);
}
If you run this code and plot the measured time as a function of c, then you will see that the measured time is lowest when the requestA and requestB reads fall on two different disks. Thus, the values of c with low times correspond to the chunk boundaries between disks of the RAID.
a. [20] <D.2> Figure D.27 shows the results of running the chunk size algorithm on an unknown RAID system.
■ What is the chunk size of this storage system?
■ What do the measured times of 0.75 and 1.5 seconds correspond to in this storage system?
b. [20] <D.2> Draw the graph that would result from running this Shear code on a storage system with the following characteristics:
■ Number of requests, N = 1000
■ Time for a random read on disk, 5 ms
■ RAID level, RAID 0
■ Number of disks, 8
■ Chunk size, 12 KB
D.6 [10/10/10/10] <D.2> Finally, one can determine the layout of chunks to disks with the following code. The basic idea is to select N random patterns and to exhaustively read together all pairwise combinations of the chunks within the pattern.
Figure D.27 Results from running the chunk size algorithm of Shear on a mock stor-age system.
Time (s)
1.5
0 1.0
0.5
0.0
Boundary offset assumed (KB)
64 48
32 16
for (a = 0; a < numchunks; a += chunksize) { for (b = a; b < numchunks; b += chunksize) {
for (i = 0; i < N; i++) {
requestA[i] = random()*patternsize + a;
requestB[i] = random()*patternsize + b;
}
begin_time = gettime();
issue all requestA[N] and requestB[N] to raw device in parallel;
wait for all requestA[N] and requestB[N] to complete;
interval_time = gettime() - begin_time;
printf("A: %d B: %d Time: %d\n", a, b, interval_time);
} }
After running this code, you can report the measured time as a function of a and b. The simplest way to graph this is to create a two-dimensional table with a and b as the parameters and the time scaled to a shaded value; we use darker shadings for faster times and lighter shadings for slower times. Thus, a light shading indi-cates that the two offsets of a and b within the pattern fall on the same disk.
Figure D.28 shows the results of running the layout algorithm on a storage sys-tem that is known to have a pattern size of 384 KB and a chunk size of 32 KB.
a. [20] <D.2> How many chunks are in a pattern?
b. [20] <D.2> Which chunks of each pattern appear to be allocated on the same disks?
Figure D.28 Results from running the layout algorithm of Shear on a mock storage system.
Chunk
10
0 6
4
2 8
0
Chunk
10 8 6 4 2
c. [20] <D.2> How many disks appear to be in this storage system?
d. [20] <D.2> Draw the likely layout of blocks across the disks.
D.7 [20] <D.2> Draw the graph that would result from running the layout algorithm on the storage system shown in Figure D.29. This storage system has four disks and a chunk size of four 4 KB blocks (16 KB) and is using a RAID 5 Left-Asymmetric layout.