• 沒有找到結果。

Next we examine the effects of our optimisations on code size. Two factors are important here: the size of the VM itself and the size of the code it generates.

The size overhead for the generated code is shown in figures 7.3 and 7.4, again split up per instruction category and benchmark respectively. For the first three optimisations, the two graphs follow a similar pattern as the performance graphs. These optimisations eliminate the need to emit certain instructions, which reduces code size and improves performance at the same time.

The fourth optimisation, mark loops, moves loads and stores for pinned variables out-side of the loop. This reduces performance overhead by 42%, but the effect on code size varies per benchmark: some are slightly smaller, others slightly larger.

For each variable that is live at the beginning of the loop, the VM emits a load before the mark loop block, so code size is only reduced if the variable is loaded more than once.

Code size may actually increase if the value is then popped destructively, since this causes the VM to emit a mov. Stores follow a similar argument. Also, for small methods the extra registers used may mean more call-saved registers have to be saved in the method prologue. Finally, we get the performance advantage for each run-time iteration, but the effect on code size, whether positive or negative, only once.

-50

Overhead (% of native C size)

Bubble sort

Figure 7.4: Code size overhead per benchmark

The constant shift optimisation unrolls the loop that is normally generated for bit shifts.

This significantly improves performance, but the effect on the code size depends on the number of bits to shift by. The constant load and loop take at least 5 instructions. In most cases the unrolled shifts are smaller, but MD5 and XXTEA show a small increase in code size since they contain shifts by a large number of bits.

Using 16-bit array indexes also reduces code size. The benchmarks here already have the manual source code optimisations, so they use short index variables. This means the infuser emits S2I instructions to cast them to 32-bit ints if the array access instructions expect an int index. Not having to emit those when the array access instructions expect a 16-bit index, and the reduced work the access instruction needs to do, saves 14% code size overhead in addition to the 19% reduction in performance overhead. Using 32-bit variables in the source code also removes the need for S2I instructions, but the extra effort needed to manipulate the 32-bit index variable would make the net code size even larger.

7.4.1 VM code size and break-even point

These more complex code generation techniques do increase the size of our compiler. The first column in Table 7.6 shows the difference in code size between the AOT compiler and

Darjeeling’s interpreter. The basic AOT approach is 6863 bytes larger than the interpreter, and each optimisation adds a little to the size of the VM.

They also generate significantly smaller code. The third column shows the reduction in the generated code size compared to the baseline approach. Here we show the reduction in total size, as opposed to the overhead used elsewhere, to be able to calculate the break-even point. Using the improved peephole optimiser adds 276 bytes to the VM, but it reduces the size of the generated code by 14.6%. If we have more than 1.9 KB available to store user programmes, this reduction will outweigh the increase in VM size. Adding more complex optimisations further increases the VM size, but compared to the baseline approach, the break-even point is well within the range of memory typically available on a sensor node, peaking at at most 17.8 KB.

As is often the case, there is a trade-off between size and performance. The interpreter is smaller than each version of our AOT compiler, and Table 7.2 shows bytecode is smaller than both native C and AOT compiled code, but the interpreter’s performance penalty may be unacceptable in many scenarios. Using AOT compilation we can achieve much better performance, but the most important drawback has been an increase in generated code size. These optimisations help to mitigate this drawback, and both improve performance, and allow us to load more code on a node.

For the smallest devices, or if we want to be able to load especially large programmes, we may decide to use only a selection of optimisations to limit the VM size and still get both a reasonable performance, and most of the code size reduction. For example, dropping the markloop optimisation would reduces the size of the VM by 3 KB but keeps most of the reduction in generated code size, while performance overhead would increase to around 109%.

7.4.2 VM memory consumption

The last column in Table 7.6 shows the amount of data that needs to be kept in memory while translating a method. We would like our VM to be able to load new code while other tasks are running concurrently. Here we only list the data that the VM needs to maintain

Table 7.6: VM size and memory consumption

Size vs Size vs AOT code Break Memory

interpreter baseline reduction even usage

Baseline 6863 B 25 B

Improved peephole 7139 B 276 B (+276) -14.6% 1.9 KB 25 B

Stack caching 7961 B 1098 B (+822) -27.8% 3.9 KB 36 B

Popped value caching 9229 B 2366 B (+1268) -33.1% 7.1 KB 80 B

Markloop 12511 B 5648 B (+3282) -33.4% 16.9 KB 87 B

Const shift 12955 B 6092 B (+444) -34.3% 17.8 KB 87 B

16-bit array index 12935 B 6072 B (-20) -38.7% 15.7 KB 87 B

SIMUL 13001 B 6138 B (+66) -39.2% 15.7 KB 87 B

Lightweight methods 13549 B 6686 B (+548) -39.7% 16.8 KB 87 B The constant shift optimisation adds 170 bytes to the VM size. Because the MoteTrack bench-mark cannot run without it, we cannot calculate the average code size reduction.

in between receiving messages with new code, since this is the amount of memory that will not be available to other tasks during this process. Of course, when new code is being processed, more stack memory is used, but this is freed after a batch of instructions has been translated and can be reused by other applications.

For the baseline approach we only use 25 bytes for a number of commonly used values such as a pointer to the next instruction to be compiled, the number of instructions in the method, etc. The basic stack caching approach adds a 11 byte array to store the state of each register pair we use for stack caching. Popped value caching adds two more arrays of 16-bit elements to store the value tag and age of each value. Mark loops only needs an extra 16-bit variable to mark which registers are pinned, and a few other variables.

Finally, the instruction set optimisations do not require any additional memory. In total, our compiler requires 87 bytes of memory during the compilation process.