• 沒有找到結果。

Next, we have a closer look at some of the benchmarks and see how the effectiveness of each optimisation depends on the characteristics of the source code. Table 7.2 shows the size of each benchmark, the distribution of the executed bytecode instructions, and both the maximum and average number of bytes on the VM stack. We can see some important differences between the benchmarks. While the sort benchmarks on the left are almost

completely load/store bounded, XXTEA, RC5 and MD5 are much more computation in-tensive, spending fewer instructions on loads and stores, and more on math or bitwise operations. The left three benchmarks and the outlier detection benchmark have only a few bytes on the stack, but as the benchmarks contain more complex expressions, the number of values on the stack increases.

Tables 7.4 and 7.5 show the performance and code size overhead for each benchmark, split up per instruction category. First, the overhead using the unoptimised compiler and original unoptimised source code is shown, followed by the effect of the source code op-timisations. Then the resulting overhead using optimised source but the original AOT compiler is shown, followed by the effect on the total overhead when the different opti-misations are incrementally added to the compiler, and finally the resulting overhead per category after applying all optimisations.

First, looking at the effect of the source code optimisations, the performance for all benchmarks improves, except for heat detection which slows down by a small fraction.

One of the optimisations done is to store a computed value or value retrieved from an array in a temporary variable if it is known the value will not change and is needed again later.

After the optimisations to the AOT compiler have been added, accessing local variables is much cheaper than accessing an array and heat detection’s optimised source is slightly faster. Using the baseline AOT compiler however, the extra loads and stores added by this optimisation are much more expensive, which in this case just tipped the balance to a small loss.

The source code optimisations also result in an average reduction of the code size over-head in Table 7.5, but it increases for some benchmarks due to inlining of small methods.

This particularly affects RC5, for which 10 method calls were inlined. Again, without the other optimisations, the overhead from this more significant. The difference between the inlined and non-inlined versions with all other optimisations applied is shown in Table 8.3.

Looking at the compiler optimisations, the improved peephole optimiser and stack caching both target the push/pop overhead. Stack caching can eliminate almost all, and

replaces the need for a peephole optimiser, but it is interesting to compare the two. The improved peephole optimiser does well for the simple benchmarks like sorting, binary search and outlier detection, leaving less overhead to remove for stack caching. The more computation intensive benchmarks contain more complicated expressions, which means there is more distance between a push and a pop, leaving more cases that cannot be handled by the peephole optimiser. For these benchmarks, replacing the peephole optimiser with stack caching yields a big improvement.

The benchmarks on the left spend more time on load/store instructions. This results in higher load/store overhead, and the two optimisations that target this overhead, popped value caching and mark loops, have a big impact. For the computation intensive bench-marks, the load/store overhead is much smaller, but the higher stack size means stack caching is very important for these benchmarks.

The smaller benchmarks highlight certain specific aspects of our approach, while the larger CoreMark benchmark covers a mix of different types of processing. As a result, it is an average case in almost every row in Table 7.4.

Bubble sort

Next we look at bubble sort in some more detail. After optimisation, most of the stack related overhead has been eliminated and of the 101.2% remaining performance overhead, most is due to other sources. For bubble sort there is a single, clearly identifiable source.

The detailed trace output shows that 79.8% is due to ADD instructions, but bubble sort hardly does any additions. This is a good example of how the simple JVM instruction set leads to less efficient code. To access an array the VM needs to calculate the address of the indexed value, which takes one move and five additions for an array of shorts. This calculation is repeated for each access. The C version is more efficient, using the auto-increment version of the ATmega’s LD and ST instructions to slide a pointer over the array.

Of the remaining 101.2% overhead, 93.1% is caused by these address calculations.

HeatCalib and FFT

Table 7.5 shows that after optimisation the HeatCalib benchmark has a negative code size overhead. This is caused by the fact that the C versions are compiled using avr-gcc’s -O3 optimisations, optimising for performance instead of code size. In this case, as well as for FFT, this caused avr-gcc to duplicate a part of the code, which improves perfor-mance but at the cost of a significantly larger code size.

MoteTrack

The MoteTrack benchmark is by far the slowest of our benchmarks, at a 156% overhead compared to native C. MoteTrack stores a database of reference signatures in flash mem-ory. In C this is a complex struct containing a number of sub-structures and fixed-sized arrays. In Java this becomes a collection of objects and arrays, shown in Figure 8.1.

Since the layout of the complete C structure is known at compile time, the C function to load a reference signature from the database can simply use memcpy_P to copy a block of 80 bytes from flash memory to RAM. In Java, the method to read from flash memory must follow several references to find the locations to put each value. As a result, reading a single signature takes 1455 cycles in Java, and only 735 cycles in C.

After the reference signature is loaded, the fixed offsets in the C structure means using the loaded signature is also more efficient in C than in Java, which must again follow a number of references to reach the data. We discuss this in more detail in Section 8.3.

LEC

In Section 1.2.1 we calculated that the LEC compression algorithm reduced the energy spent to transmit the sample ECG data by 650 μJ, at the expense of 246 μJ spent on CPU cycles compressing the data, when implemented in C and using the ATmega128 CPU and CC2420 radio.

A compression algorithm like LEC is a good example of an optimisation that may be part of an application loaded onto a sensor node. However, if the overhead of using a VM is too high, the cost of compression may outweigh the energy saved on transmission. Table

-10

Overhead (% of native C run time)

total push/pop load/store mov(w) vm+other

Figure 7.5: XXTEA performance overhead for different number of pinned register pairs

7.4 shows that using the baseline AOT approach, the LEC benchmark has an overhead of 885.3%, which drops to 272.6% after optimising the source code to avoid repeatedly creating a small object. This means the CPU has to stay active longer, and compressing the data would cost 246µJ ∗ 3.726 ≈ 917µJ, which is more than the 650 μJ saved on transmission.

After we apply our optimisations, the overhead is reduced to 84.6%, resulting in 246µJ ∗ 1.846 ≈ 454µJ spent on compression. While the savings are less than when using native C to compress the data, our optimisations mean that in this scenario, we can save on transmission costs by using LEC compression, while using the baseline AOT approach, LEC compression would have resulted in a net loss.

XXTEA and the mark loops optimisation

The XXTEA benchmark has the highest average stack depth of all benchmarks. As a result, popped value caching does not have much effect: most registers are used for real stack values, leaving few chances to reuse a value that was previously popped from the stack.

When the mark loops optimisation is applied, performance actually degrades by 5%!

Here we have an interesting trade-off: if a register is used to pin a variable, accessing that variable will be cheaper, but this register will no longer be available for stack caching, so

0

Overhead (% of native C run time)

Bubble sort

Figure 7.6: Per benchmark performance overhead for different numbers of pinned register pairs

more stack values may have to be spilled to memory.

For most benchmarks, using the maximum of 7 register pairs to pin variables was also the best option. At a lower average stack depth, the fewer number of registers available for stack caching is easily compensated for by cheaper variable access. For XXTEA how-ever, the cost of spilling more stack values to memory outweighs the gains from cheaper variable access when too many variables are pinned. Figure 7.5 shows the overhead for XXTEA from the different instruction categories. When the number of register pairs used to pin variables is increased from 1 to 7, the load/store overhead steadily decreases, but the push/pop and move overhead increase. For XXTEA, the optimum is at 5 pinned register pairs, at which the total overhead is only 43%, instead of 58% at 7 pinned register pairs.

Interestingly, when we pin 7 pairs, the AOT version does fewer loads and stores than the C compiler. Under high register pressure the C version may spill a register value to memory and later load it again, adding extra load/store instructions. When the AOT version pins too many registers, it will also need to spill values, but this adds push/pop instructions instead of loads/stores.

Figure 7.6 shows the performance for each benchmark, as the number of pinned reg-ister pairs is increased. The benchmarks stay stable or even slow down when the number pinned pairs is increased beyond 5 are the benchmarks that have a high stack depth, while

Table 7.7: Effect of constant arrays on memory consumption and performance

RC5 FFT LEC MoteTrack

Size of constant data 200 2,048 51 20,560

Using constant arrays no yes no yes no yes no yes

Performance overhead 19.5% 19.5% 17.7% 17.7% 86.5% 84.6% cannot run 156.3%

Size of constant data in flash 1998 204 26,714 2,052 930 59 cannot run 20,588

Size of constant data in RAM 208 0 2,056 0 67 0 cannot run 0

the benchmarks with low stack depth such as sort, binary search and outlier detection im-prove significantly. It should be possible to develop a simple heuristic to allow the VM to make a better decision on the number of registers to pin. Since our current VM always pins 7 pairs, we used this as our end result and leave this heuristic to future work.