Next we will look at the effect of our optimisations to the baseline AOT translation ap-proach, for all our benchmarks. The trace data produced by Avrora gives us a detailed view into the run-time performance and the different types of overhead. We count the number of bytes and cycles spent on each native instruction for both the native C and our AOT compiled version, and group them into 4 categories that roughly match the types of AOT translation overhead discussed in Section 5.1.2:
• PUSH,POP instructions: Matches the type 1 push/pop overhead.
• LD,LDD,ST,STD instructions: Matches the type 2 load/store overhead and directly shows the amount of memory traffic.
• MOV,MOVW instructions: For moves the picture is less clear since the AOT compiler emits them for various reasons. Without stack caching, it emits moves to replace push/pop pairs, and after adding the mark loops optimisation to save a pinned value when it is popped destructively.
• other instructions: the total overhead, minus the previous three categories. This roughly matches the type 3 overhead.
The overhead from each category is defined as the number of bytes or cycles spent in the AOT version, minus the number spent in the native version for that category, and again normalised to the total number of bytes or cycles spent in the native C version. The detailed results for each benchmark and for each type of overhead are shown in tables 7.4 and 7.5. In addition, Table 7.4 also lists the time spent in the VM on method calls and allocating objects. The constant array optimisation is already included in these results, since MoteTrack cannot run without it. Its effect is examined separately in section 7.6.
Table 7.4 first shows the results without any optimisation to either the AOT compila-tion process or the original, direct translacompila-tion of the C source code to Java. This results in a large overhead of up to 20x slowdown for heap sort, which in this case is mostly due to method calls since small functions and macros in the C code are not inlined in
0
Overhead (% of native C run time)
total
Figure 7.1: Performance overhead per category
this version. Optimising the source code reduces overhead dramatically, but this is partly because the other optimisations, which target some of the same overhead, have not yet been applied. For example, in Table 7.4 optimising the source code reduces CoreMark’s overhead by 434%, while the previous section showed that when all other optimisations are applied first, the difference is only 268%. Since the source code optimisations were discussed in the previous section, the rest of this evaluation will focus on the effect of the other optimisations on the already optimised source.
Figure 7.1 starts with the manually optimised source code and incrementally adds each optimisation to the AOT compiler to show how they combine to reduce performance over-head. We take the average of all benchmarks, and show both the total overhead, and the overhead for each instruction category. Figure 7.2 shows the total overhead for each in-dividual benchmark.
Using the baseline AOT compilation on the optimised sources, the types 1, 2 and 3 overhead are all significant, at 138%, 109%, and 73% respectively, and the 50% overhead in the VM is mainly spent on method calls since the overhead from allocating temporary objects is already removed by the source code optimisations. The basic approach does not have many reasons to emit a move, so in some cases the AOT version actually spends fewer cycles on move instructions than the C version, resulting in small negative values. When
0
Overhead (% of native C run time)
Bubble sort
Figure 7.2: Performance overhead per benchmark
we improve the peephole optimiser to include non-consecutive push/pop pairs, push/pop overhead drops by 100.2% (of native C performance), but if the push and pop target dif-ferent registers, they are replaced by a move instruction, and we see an increase of 11.5%
in move overhead. For a 16-bit register pair this takes 1 cycle (for a MOVW instruction), instead of 8 cycles for two pushes and two pops. The increase in moves shows most of the extra cases that are handled by the improved peephole optimiser are replaced by a move instead of eliminated, since the 11.5% extra move overhead corresponds to a 92%
reduction in push/pop overhead.
Next stack caching is introduced to utilise all available registers and eliminate most of the push/pop instructions that cannot be handled by the peephole optimiser. As a result the push/pop overhead drops to nearly 0, and so does the move overhead since most of the moves introduced by the peephole optimiser, are also unnecessary when using stack caching.
Having eliminated the type 1 overhead almost completely, popped value caching is added to remove a large number of the unnecessary load instructions. This reduces the memory traffic significantly, as is clear from the reduced load/store overhead, while the other types remain stable. Adding the mark loops optimisation further reduces loads, and this time also stores, by pinning common variables to a register. But it uses slightly more
move instructions, and the fact that fewer registers are available for stack caching means stack values are spilled to memory more often. While 53.0% is saved on loads and stores, the push/pop and move overhead increase by 6.0% and 5.6% respectively.
Most of the push/pop and load/store overhead has now been eliminated and the type 3 overhead, unaffected by these optimisations, has become the most significant source of overhead. This type has many different causes, and only part of it can be eliminated with the instruction set optimisations. These optimisations, especially the 16-bit array index, also reduce register pressure, which results in a slight decrease in the other overhead types, although this is minimal in comparison. The CoreMark and FFT benchmarks are the only ones to do 16-bit to 32-bit multiplication, so the average performance improvement for SIMUL is small, but Table 7.4 shows it is significant for these two benchmarks.
Finally, the lightweight optimisation could be applied to almost every method.
Lightweight methods still incur some overhead, which will be discussed in more detail in Section 7.7, but since they do not call the VM, the time spent in the VM on method calls is effectively eliminated.
Combined, the optimisations to the AOT compilation process reduce performance overhead from 377% to 67% of native C performance.
Table 7.5: Code size data per benchmark
B.sort H.sort Bin.Search XXTEA MD5 RC5 FFT Outlier LEC CoreMark MoteTrack HeatCalib HeatDetect average CODE SIZE OVERHEAD USING ORIGINAL SOURCE (% of native C)
Total 449.2 298.0 208.2 287.2 166.0 239.3 94.9 316.3 186.4 159.4 255.0 26.2 238.5 225.0
push/pop 159.3 99.3 71.2 140.6 110.7 108.6 47.7 92.6 60.7 69.6 78.1 31.7 93.9 89.5
load/store 128.8 65.8 76.7 68.9 40.8 56.5 20.3 103.2 71.4 51.6 75.9 22.6 56.4 64.5
mov(w) 1.7 17.4 9.6 10.1 -3.6 0.0 2.5 14.7 5.7 -3.1 24.1 -14.3 15.1 6.1
other 159.3 115.4 50.7 67.6 18.0 74.3 24.5 105.8 48.6 41.2 76.9 -13.8 73.1 64.7
OVERHEAD REDUCTION FROM SOURCE CODE OPTIMISATION (% of native C)
Source optimisation -195.0 -58.4 -26.0 -124.2 +45.9 +110.2 +4.5 -47.4 +4.3 -31.2 -27.7 0.0 -12.7 -27.5 CODE SIZE OVERHEAD BEFORE COMPILER OPTIMISATIONS (% of native C)
Total 254.2 239.6 182.2 163.0 211.9 349.5 99.4 268.9 190.7 128.2 227.3 26.2 225.8 197.5
push/pop 71.2 80.5 60.3 103.7 133.3 165.3 52.6 86.3 63.6 55.2 72.8 31.7 83.3 81.5
load/store 88.1 73.8 74.0 28.4 56.7 67.9 19.7 101.1 72.9 45.8 68.2 22.6 60.1 59.9
mov(w) 10.2 9.4 4.1 2.6 -1.0 2.2 4.3 4.7 5.7 -3.4 19.6 -14.3 16.2 4.6
other 84.7 75.8 43.8 28.2 22.9 114.1 22.8 76.8 48.6 30.5 66.7 -13.8 66.2 51.3
OVERHEAD REDUCTION PER COMPILER OPTIMISATION (% of native C)
Impr. peephole -67.8 -53.0 -45.2 -38.3 -49.4 -62.5 -32.2 -77.8 -33.9 -24.7 -27.4 -13.6 -49.8 -44.3
Stack caching -25.4 -26.2 -24.7 -59.4 -85.4 -111.2 -20.9 -30.6 -39.7 -27.6 -26.7 -12.6 -38.3 -40.7
Pop. val. caching -16.9 -29.5 -6.8 -6.2 -18.7 -18.7 -13.5 -5.2 -18.5 -9.9 -26.7 -8.1 -20.7 -15.3
Mark loops +1.7 0.0 +21.9 +5.9 -1.2 -2.6 -4.2 -16.4 +2.5 -1.5 -8.7 -1.3 -11.4 -1.2
Const shift 0.0 -6.1 -6.9 +1.7 +2.8 -16.0 -4.6 -2.6 -1.8 -1.5 0.0 -1.7 -0.1 -2.8
16-bit array index -27.2 -22.8 -8.2 -11.6 -5.1 -16.7 -11.6 -25.8 -10.7 -7.4 -16.9 -2.2 -10.7 -13.6
SIMUL 0.0 0.0 0.0 0.0 0.0 0.0 -9.9 0.0 0.0 -3.4 0.0 0.0 0.0 -1.1
Lightw. methods 0.0 -2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 -5.5 -3.8 -3.9 +0.6 -1.1
CODE SIZE OVERHEAD AFTER COMPILER OPTIMISATIONS (% of native C)
Total 118.6 100.0 112.3 55.1 54.9 121.8 2.5 110.5 88.6 46.7 117.1 -17.2 95.4 77.4
push/pop 23.7 16.1 27.4 13.3 0.0 6.2 1.9 -2.1 -5.0 1.7 16.3 3.9 -3.1 7.7
load/store 33.9 41.6 49.3 14.8 37.2 25.3 -2.6 57.9 45.0 30.1 40.9 8.0 37.6 32.2
mov(w) 1.7 6.7 6.8 2.5 -2.4 11.9 -0.8 1.1 7.1 -0.2 15.4 -10.7 13.3 4.0
other 59.3 35.6 28.8 24.4 20.1 78.5 4.0 53.7 41.4 15.1 44.5 -18.4 47.6 33.4
B.sort H.sort Bin.Search XXTEA MD5 RC5 FFT Outlier LEC CoreMark MoteTrack HeatCalib HeatDetect average
-20
Overhead (% of native C size)
total push/pop load/store mov(w) vm+other
Figure 7.3: Code size overhead per category