Execution Time - Experimental Results - 一個為Thumb-2可執行檔以LLVM為基準的靜態二元轉譯系統

IV. Experimental Results

4.2. Performance

4.2.1. Execution Time

If there are still some kinds of data that is not found by our system in Thumb-2 binaries, the result of executing translated binary will also be incorrect. Therefore, what influence the performance of the translated binaries are the method for handling address mapping table and the options being given to LLVM optimizer and LLVM static compiler. How to handle indirect branches has been described in 3.3.2.

4.2.1.1. EEMBC

The results of execution time of EEMBC are shown in Figure 30. We normalize the results by the fastest one, which uses –mem2reg flag when optimizing. The inputs of the front three cases are unstripped binaries, while stripped binaries in last two cases. The difference of the last two cases is that they put the address after unconditional branches in the secondary address mapping table and the primary mapping table respectively, and the reason for adding such entries has been described in the last paragraph of 3.3.2.1. Besides, the last two stripped cases are also use –mem2reg flag.

4.6 4.1 4.5 4.3 5.0 8.7 9.2 3.5

4.9 4.9

4.4 4.4 4.4

5.6 6.8 6.9 7.5 7.4 7.3

0.501 1.52 2.53 3.54 4.55

execution time(normalized)

EEMBC(part 1)

mem2reg_unstripped no mem2reg QEMU stripped(two) stripped(merge)

Figure 30. EEMBC execution time

For all of the EEMBC benchmarks, QEMU is much slower than others. This is not a surprise for us since SBT must be faster than DBT because SBT spends time for translation in static time, which is off-line and not included in the execution time. We can take a look at the geometric mean of all cases, which is the last item of part 3 of Figure 30. The –mem2reg flag provide about 1.9 times of performance enhancement, with the smallest address mapping table; therefore, we always choose this flag to get better results. No matter how the entries after unconditional branch are accessed in the address mapping table, both of two stripped cases spend about 58% more time, which is very slow. The reason is that the code size of EEMBC benchmarks are small, only about ten thousand instructions; therefore, the

corresponding address mapping tables are also small, about 800 entries for unstripped input

6.7 8.4 12.8 9.7 11.3

mem2reg_unstripped no mem2reg QEMU stripped(two) stripped(merge)

1.9

mem2reg_unstripped no mem2reg QEMU stripped(two) stripped(merge)

and 1200 entries for stripped one. Since the number of return address is a constant in both AMTs, the quotient of the size of stripped and unstripped AMT becomes higher if the number of return address is small. In EEMBC, the number of return address is about 600 in average, which is 73% of the size of unstripped AMT. We will show that the return addresses dominate the size of the address mapping table in common-use program by some statistical result of CINT2006. As a result, the higher quotient of the size of two kinds of AMT make the performance be lower. Another possible reason is that the very short execution time, which is shorter than one second, make the ratio of the time spent in searching AMT and the total execution time become higher, so the difference of the execution time is magnified.

We list some statistical results in Table 3, and the last column is the sum of previous two columns, indicating the entries in the case which puts the address after unconditional

branches into the primary address mapping table. The column named “Total” is the total instructions in the binary and the column named “AMT” is the size of address mapping table when using unstripped binaries. In the figures above, we can see how this approximately two times larger address mapping table influences the performance.

Table 3. Statistical information about some of EEMBC benchmarks

Set benchmark Total AMT AMT for stripped Secondary Sum

automotive a2time01 10135 762 892 374 1266

basefp01 9726 772 884 367 1251

consumer cjpeg 17016 1364 1573 593 2166

djpeg 18357 1355 1582 664 2246

networking pktflowb1m 9778 770 903 379 1282

routelookup 9528 747 854 379 1233

office bezier01fixed 9382 735 862 357 1219

bezier01float 9382 735 862 357 1219

telecom autcor00data_1 10200 865 1006 390 1396

conven00data_1 10111 855 990 387 1377

Average of all (about 60 bmks) 10425 819 950 393 1343

In some cases in above figures, stripped cases are better, like canrdr816, memacc816 and puwmod816, which is unreasonable. Since the execution time of these benchmark are very short, less than 0.01 seconds in our experiment, indirect branches may seldom occurs and some table entries may be discarded by LLVM optimizer. Besides, system overhead may

also influences the performance, especially when the execution time is very short, so we test our system using different benchmarks many times and take the average.

4.2.1.2. SPEC 2006 CINT

We choose both test data and reference data in SPEC 2006 CINT to test our translated binaries. There are twelve benchmarks in CINT2006, but some of them are not runnable or generate incorrect result, the summary is shown in Table 4. Some problem, like memory issues in 400 and 473, are what mc2llvm should solve. Since the objective of this thesis is not finding a better way to map the memory layout from source binary to target binary, we remain these two benchmark being not runnable. System call “fork” is regarded as a future work of our system, because multi-thread backend must be supported for this purpose. The remaining problem are not what we can control, since the compiler for only Thumb-2 instruction set is not popular; therefore, more limitations are added in this compiler, like some features must being disabled. As a result, ARM/Thumb-2 mixed ISA is the most probably the next ISA that our system supports.

Table 4. The reasons for not runnable benchmarks in CINT2006

Test data Reference data

400.perlbench

Our system doesn't support fork system call, which exist in one of the test cases.

Heap reaches 0x08048000, which stores read-only data when executing an executable.

403.GCC Instruction size is too large to be compiled, and internal compiler error occurs due to some flag setting issues.

464.h264 Floating point issue, which makes the result incorrect. Since the result of QEMU is also wrong, it should be a cross-compiler setting issue.

473.astar Runnable

Heap reaches 0x08048000, which stores read-only data when executing an executable.

483.xalanc Can't be compiled by our cross compiler due to some setting issues.

The results of execution time of CINT2006 are shown below: Figure 31 show the result of test data while Figure 32 show the result of reference data. We use the execution time of the binaries compiled directly for our native system (with GCC 4.4.6 and –O2 optimization option) as the base, and compare it with the others. The third and the fourth column list the result without –mem2reg flag and with –mem2reg flag when optimizing the translated

bitcodes respectively. Both of stripped columns put the address after unconditional branches, which can be a function entry, as described in 3.3.2.1, in the secondary address mapping table, and the first one accesses the table by LLVM switch instructions while the second one uses LLVM indirect branch instructions with a helper function for searching the entries by the binary-search algorithm. The previous one always performs better, because more optimizations can be applied by LLVM optimizer. However, when the size of possible entries grows, LLVM optimizer and LLVM static compiler may have no ability to scale this complex work, because they cost too much memory when optimizing. As a result, some translated bitcodes can’t be optimized with aggressive optimizations, like 471.omnetpp. The result of 471 with the first stripped version remains blank because it costs too much memory when compiling to x86 assembly.

(measured in

Figure 31. Result of CINT2006 with test data, compared with native result 1

CINT2006 (test data, compared with native)

Native QEMU no mem2reg mem2reg_unstrippped stripped(switch) stripped(helper)

(measured in

seconds) Native QEMU no

mem2reg mem2reg Stripped (switch)

Figure 32. Result of CINT2006 with ref data, compared with native result

For test data, our best result takes 2.4 times more time than the native system in average and 1.9 times more for reference data; however, the performance of 456 and 471 are very bad compared with native system. Since the compiler may generate instructions with better performance when directly compiling, we think it can’t be better without other modification when translating to LLVM IR or adding other optimization option when

optimizing. By the way, the performance of the first stripped version is better than our best result, but it omits the result of 471, which other cases perform terrible result; therefore, we can’t conclude anything about this smaller value.

Take a look at Figure 33 and Figure 34, which is normalized by our best case, for more detail. We omit the result of QEMU here since its performance is not good, even though many parts of translated code must have been in the code cache of QEMU after long time execution. In average, the execution time of QEMU is 5.4 and 6.1 times more than our best

13.8 20.3 14.9 13.2 16.6 11.7

4.0

CINT2006 (reference data, compared with native)

Native QEMU no mem2reg mem2reg_unstripped stripped(switch) stripped(helper)

result, with test data and reference data respectively. Besides, both of stripped cases are only about 10% slower, which gives the credit to the good method for selecting possible function entries, described in 3.3.2.1. The performance difference of them are not large, but the difference of the translation time, including translating to LLVM IR by our translator,

optimizing by LLVM optimizer and compiling by LLVM static compiler, are large, as shown in Figure 35. The reason in that LLVM switch instructions take more time to be optimized. If how much time is spent in translation time is not important and the input program is not too large, using the first solution, which uses LLVM switch instructions, is recommended.

Figure 33. Result of CINT2006 with test data, compared with our best result

Figure 34. Result of CINT2006 with ref data, compared with our best result 1

CINT2006 (test data, compared with mem2reg)

mem2reg_unstripped no mem2reg stripped(switch) stripped(helper)

CINT2006 (reference data, compared with mem2reg)

mem2reg_unstrpped no mem2reg stripped(switch) stripped(helper)

Figure 35. Comparison of translation time when handling stripped function

We list some statistical result of CINT2006 in Table 5, and some benchmarks that are not runnable are also listed in this table. Note that the primary address mapping table of

stripped input binary is only about 10% bigger that unstripped one, so the method used for maintaining the secondary one is very important. We give two approaches here, and there must be better method to handle address mapping table, hence, the performance with stripped input binary can be certainly enhanced.

Table 5. Statistical information about CINT2006

Total AMT AMT-stripped Secondary Ratio of AMT

400.perlbench 219979 18153 19797 10658 1.09

401.bzip2 21371 1165 1364 732 1.17

The ratio of the address mapping table and the best case, which is generated by symbol

2.3

table, in CINT2006 is much lower than what is EEMBC. Table 6 lists the ratio of the number of return address and the size of address mapping table. Since return address is a constant in both cases, this comparison is reasonable. We can find that the larger the code size is, the higher the ratio of the number of return addresses is. Therefore, we can conclude that the return addresses dominate the size of the address mapping table.

Table 6. Comparison between the ratio of Return address and AMT

Total Return AMT Ratio AMT-stripped Ratio

400.perlbench 219979 16156 18153 89% 19797 82%

401.bzip2 21371 896 1165 77% 1364 66%

403.gcc 589937 50281 54678 92% 58545 86%

429.mcf 12789 653 878 74% 1038 63%

445.gobmk 160832 11660 14436 81% 17144 68%

456.hmmer 66438 6290 7062 89% 7581 83%

464.h264 35817 1965 2336 84% 2631 75%

458.sjeng 19739 1588 1914 83% 2120 75%

462.libquantum 119768 5285 6063 87% 6684 79%

471.omnetpp 142032 23870 27260 88% 29589 81%

473.astar 25916 1873 2334 80% 2744 68%

Partition L-function technique has been described in 3.3.4, and we show some performance results using different parameters, like different partition method, different method for function switching, and different recursive time permitted. The larger the size of L-function, the longer compiling time and shorter execution time. For convenient, we set the size of each slice of L-function being twenty thousand. Furthermore, all of the results are normalized by all best result.

1) Partition method: DFS vs. uniform, in Figure 36.

The results of using DFS to partition functions seem better in most of cases, especially in 471; however, 458 has better performance when using uniform partition method, so there are not definitely correct partition method, we just choose the one that has higher probability to have good performance.

Figure 36. DFS vs. Uniform

2) Method for function switching: helper function vs. LLVM switch instructions, in Figure 37.

In this case, we can’t say any one of choice is better. The average shows that using switch instruction is better, but the difference is very small. Therefore, choosing the one that spends less time to translate is better.

Figure 37. Helper function vs. LLVM switch instruction

3) Recursive times: 0 vs. 2048, in Figure 38.

In this four cases, the results look the same, so whether returning back to the

1.23

main L-function every time when function switching is up to the user. The only thing most be noticed is that the maximum recursive times can’t be too large; otherwise, the stack may overflow.

Figure 38. Recursion time comparison: 0 vs. 2048

Actually, four cases in 1) and 2) are the same, except the permutation difference. As a summary, using DFS for partitioning the function and LLVM switch instruction for function switching has more probability to get better performance if the main L-function has to be partitioned in order to save translation time. In fact, the instruction size of 401, 429 and 458 are smaller than twenty thousand, so their data are just for reference.

在文檔中一個為Thumb-2可執行檔以LLVM為基準的靜態二元轉譯系統 (頁 52-62)