Experiment Results - Experiments and Results

IV. Experiments and Results

4.4 Experiment Results

This section shows the final results of above cases. First we display the data of each case in default of cache and make some explanations of these results, then compare the performance of cache block size. After the performance issue, the analyses of cache miss rate of each case are shown to ensure the correctness of our modified version. Finally we briefly conclude the effect of our optimal method in micro-architecture simulator’s cache.

Figure 23 is the experiment results of array addition, the L1 data cache uses the

default configuration (with 256 sets, 32 bytes for each block and the associative of set is 1) and its size is 8KB. As the first case, we introduce the parts of this diagram in detail. In the bottom of this diagram, the horizontal axis shows the number of loop iterations and in the left of this figure, the vertical axis shows the ratio of original Simplescalar’s data cache access time to modified versions. There are 4 columns in fixed iterations of loop. The left hand side is the collections of read/write data in the optimal scope and the right hand side is the collections of read/write data during the whole program. This study will use the same format to perform the results in the following figures.

Figure 22 Case : Array Addition (with default configuration of L1 data cache) As shown in Figure23, the speedups are not consistent when the loop iteration’s number is small, since the iterations are too small that the results may be influenced

by experimental error. The largest test data displays the modified version’s speedup is about 3.4x to 3.5x and the experimental results between the optimal loop and the whole program are more and more closely when the iteration’s number goes from small to big; in other words, the charts which consist of 4 columns are getting more and more smoothly from left to right.

Figure 23 Case : Memory Copy (with default configuration of L1 data cache) Figure 24 shows the case 2 results. The purple charts show the improved performance of write data cache access during the whole program time. When the iterations go from small to big, the dark red charts which mean the modified write data cache performance of the optimal loop scope look like as the purple chart. And the same phenomenon is shown in blue and green chart. It means that the results are getting more stable of large test case. In conclusion, the speedup is about 3.3x to 3.5x

3.33 3.38 3.21 3.35 3.19 3.39 3.45 3.53

The number of loop iterations

Memory Copy

Read(In optimal loop) Write(In optimal loop) Read(Program) Write(Program)

in the largest case.

Figure 24 Case : Comparison (with default configuration of L1 data cache) In Figure 25, the blue charts are more evident than the others. The reason is that the modified routine only change the load mechanism in the optimal loop and it account for a small rate of the total load instructions. Therefore, the dark red charts are all close to 1 because the write data cache mechanism is the same; and the green

charts grow slightly when the number of iterations changed from 100 to 100000.

There is only one kind of charts hasn’t be mentioned above – the purple charts. It shows the total time of write data cache access is reduced, but it looks like

unreasonable since the write data cache efficiency in the optimal loop is almost the same as before. There is a complete explanation in the following sentences.

In the previous chapter, there is another optimized way to accelerate the load or

4.13 4.06

The number of loop iterations

Comparison

Read(In optimal loop) Write(In optimal loop) Read(Program) Write(Program)

store instructions, and this method is out of the optimal loop. After traced the

assembly code, we found that there are some consecutive store instructions in certain loop which didn’t vary even though array size and loop iterations changed. That’s the reason that purple charts show some speedup when test data is small but make rarely improvement in the large test.

Figure 25 Case : Linear Search (with default configuration of L1 data cache) In Figure 26, the read data cache time is improved in the optimal loop, the speedup is about 3.5x to 3.8x. Due to the ratio of load instructions in the loop to other loads, the performance of total read time decreased when the iterations had increased.

Since the write data count is different between original Simplescalar and modified one, it makes no sense to put the ratio of write data cache time in the optimal loop in this figure. The purple charts show the same trend as previous case, the write data access

3.81 3.98

The number of loop iterations

LinearSearch

Read(In optimal loop) Write(In optimal loop) Read(Program) Write(Program)

is improved out of the loop. Even though the optimization made 2.56 speedup in small test data, the speedup is rare when the program iterations grew.

Figure 26 Case : Bubble Sort (with default configuration of L1 data cache) Figure 27 shows the last case in our study. Compare the read data cache

efficiency in the loop and the total runtime, there is a strong relevance between them.

When the test data get bigger, the green chart is closer to the blue chart. It means the load instructions are almost in the optimal loop when the number of loop iterations raised, so the speedup in the loop is almost identical to the total execute time’s speedup. The purple charts show the same results with last two cases, it is no more explanation here.

The number of loop iterations

BubbleSort

Read(In optimal loop) Write(In optimal loop) Read(Program) Write(Program)

Cache miss rate

Figure 27 Data cache miss rate (Original)

Figure 30 and 31 show the data cache miss rate with original and modified version. The miss rates are almost the same in these cases with different cache configuration. Still, there is a little difference between these two versions when the cache block size is 16 bytes. The difference is very small and do not influence the correctness of our study results.

0.2499 0.2498 ArrayAddition MemCopy Comparison LinearSearch BubbleSort Miss rate

Loop Size : 100000

Original 16 Bytes Original 32 Bytes Original 64 Bytes

Figure 28 Data cache miss rate (Modified)

0.2493 0.2489

0.0389 0.0563

0.0002

0.125 0.1249

0.0206 0.0292

0.0001

0.0625 0.0625

0.0126 0.0166

0.0001 ArrayAddition MemCopy Comparison LinearSearch BubbleSort Miss rate

Loop Size : 100000

Modified 16 Bytes Modified 32 Bytes Modified 64 Bytes

V. Conclusion and Future Work

In this thesis, we presented a simple method to improve the micro-architecture’s cache simulation performance, and used some general test code as case studies to verify our hypothesis. After discussing the possible problems of implementing such a method in modified Simplescalar, many experiments have been done. The results are presented and discussed in the previous chapter. As expected, the results show the average speedups are about 3.4x to 3.9x when memory read operations dominate the execution with default data cache configuration.

This study makes a good case for using DBT to speed up micro-architecture simulations; we could also extend the same idea to other micro-architecture features such as pipeline or load/store buffer simulations. For example, the DBT could try use similar optimizations in upper levels, such as merging the behavior of same stage in the pipelines.

Figure 290 Example: Pipeline stages

In other ways, the current modified Simplescalar leaves much room for improvement. Now the merge mechanism is pre-set before running the simulation, and the speedup factor is fixed to 4. In simulations of future processors, the actual speed up could be greater when the cache line size is more than 4 words, as is rather common for L2 or L3 caches.

Reference

[1] Ebcioglu, K. (2001, Jun). Dynamic binary translation and optimization. Computers, IEEE Transactions on, pp. 529-548.

[2] Krall, A. (1998, 12 18). Efficient JavaVM just-in-time compilation. Parallel Architectures and Compilation Techniques, 1998. Proceedings. 1998 International Conference on, pp. 205-212.

[3] T Austin, E Larson, D Ernst. (2002, Feb). SimpleScalar: an infrastructure for computer system modeling. Computer, pp. 59-67.

[4] D. Burger and T. Austin. (1997, June). The Simple, Scalar tool set, version 2.0. ACM SIGARCH Computer Architecture News, pp. 13-25.

在文檔中利用動態二進制轉譯技術改善微架構內快取記憶體模擬:個案研究 (頁 42-52)