The CoreMark benchmark was developed by the Embedded Microprocessor Benchmark Consortium as a general benchmark for embedded CPUs. It consists of three main parts:
• Matrix multiplication
• A state machine
• Linked list processing
Since CoreMark is the largest benchmark, we will use it to discuss some of the chal-lenges when translating its C code to Java.
The biggest complication is that CoreMark makes extensive use of pointers, which do not exist in Java. In cases where a pointer to a simple variable is passed to a function, we simply wrap it in a wrapper object. A more complicated case is the core_list_mergesort function, which takes a function pointer parameter cmp used to compare list elements. Two different implementations exists, cmp_idx and cmp_complex. Here we initially choose the most canonical way to do this in Java, which is to define an interface and pass an object with the right to implementation to core_list_mergesort.
The C version of the linked list benchmark takes a block of memory and constructs a linked list inside it by treating it as a collection of list_head and list_data structs, shown in Listing 7.1. One way to mimic this as closely as possible is to use an array of shorts of equal size to the memory block used in the C version, and use indexes into this array instead of C pointers. However this leads to quite messy code.
1 typedef struct list_data_s {
public static final class ListData { public short data16;
public short idx;
}
public static final class ListHead { ListHead next;
ListData info;
}
Listing 7.1: C and Java version of the CoreMark list data structures
Table 7.3: Effect of manual source optimisation on the CoreMark benchmark
list matrix state total
timea vs. nat. C timea vs. nat. C timea vs. nat. C timea vs. nat. C
native C 17.9 49.2 18.0 85.1
baseline 124.3 (+594%) 360.5 (+633%) 289.0 (+1506%) 774.4 (+810%)
optimised, using original source 55.9 (+212%) 231.4 (+370%) 75.9 (+322%) 363.4 (+327%)
manually inline small methods -0.8 (-4%) -37.3 (-76%) -17.4 (-97%) -55.5 (-65%) use short array index variables -0.1 (-1%) -104.6 (-213%) 0.0 (0%) -104.7 (-123%) avoid recalculating expressions in a loop +0.1 (+1%) -7.6 (-15%) 0.0 (0%) -7.5 (-9%) reduce array and object access -0.1 (-1%) -18.0 (-37%) -2.4 (-13%) -20.5 (-24%)
reduce branch cost in crcu8 -3.6 (-20%) -0.5 (-1%) -3.2 (-18%) -7.3 (-9%)
using optimised source 51.4 (+187%) 63.4 (+29%) 52.9 (+194%) 167.9 (+97%)
(non-autom.) avoid creating objects 0.8 (+4%) 0.0 (0%) -10.6 (-59%) -9.9 (-12%) (non-autom.) avoid virtual calls -22.8 (-127%) 0.0 (0%) 0.0 (0%) -22.7 (-27%)
after non-automatic optimisations 29.4 (+64%) 63.4 (+29%) 42.3 (+135%) 135.3 (+59%)
a in millions of cycles
Instead we choose the more natural Java approach and define two classes to match the structs in C and create instances of these to build the list. This is also the faster op-tion because accessing object fields is faster than array access. The trade-off is memory consumption, since each object has its own 5-byte heap header.
7.2.1 Manual optimisations
After translating the C code to Java, we do some manual optimisations to produce better bytecode. Since CoreMark is the most comprehensive benchmark, we use it to evaluate the effect of these manual optimisations.
Table 7.3 shows the slowdown over the native C version, broken down into Core-Mark’s three main components. The baseline version, using the original Java code and without any optimisations, is 810% slower than native C. Even after applying all optimi-sations to the AOT compilation process, the best we can achieve with the original code is a 327% slowdown.
Next we manually optimise the Java source code, starting with the optimisations as described in Section 5.2 and add a small extra optimisation to crcu8 which can be easily reorganised to reduce branch overhead. These are all optimisations that a future optimising infuser could do automatically.
The effect depends greatly on the characteristics of the code. The matrix part of the
benchmark benefits most from using short array indexes, the state machine frequently calls a small method and benefits greatly from inlining it, etc. Combined these optimisations reduce the overhead for the whole benchmark from 327% to 97%, proving the importance of a better optimising infuser.
We also applied all these optimisations to the native C version to ensure a fair com-parison, but the difference in performance was negligible.
7.2.2 Non-automatic optimisations
After these optimisations, CoreMark is still one of the slower benchmarks. We can im-prove performance further using two more optimisations. While these cannot be done automatically, even by an optimising infuser, they do not change the meaning of the pro-gramme, and a developer writing this code in Java from the start may make similar choices to optimise performance.
Table 7.3 shows that in the native version, over half of the time is spent in the matrix part of the benchmark, but for the final Java version all three parts are much closer together.
The state machine and linked list processing both suffer from a much larger slowdown than the matrix part, which by itself would be the third fastest of all our benchmarks.
One of the reasons for the slow performance of the state machine is that it creates two arrays of 8 ints, and an little wrapper object for a short to mimic a C pointer. Allocat-ing memory on the Java heap is much more expensive than it is for a local C variable.
For linked list processing the biggest source of overhead is in the virtual method call to the comparer objects in core_list_mergesort that was used instead of a function pointer. Virtual methods cannot be made lightweight.
This is the best we can do when strictly translating the C to Java code, using only optimisations that could be done automatically. If this constraint is relaxed, these two sources of overhead can be removed as well: we can avoid having to repeatedly create the small arrays and objects in the state machine, by creating them once at the beginning of the benchmark and passing them down to the methods that need them. This significantly speeds up the state machine, although the list processing part incurs a small extra overhead
because it needs to pass these temporary arrays and objects to the state machine.
The virtual call to the comparer object in the list benchmark is the most natural way to implement this in Java, but since there are only two implementations, we can make both compare methodsstaticand pass a boolean to select which one to call instead of the comparer object. This saves the virtual method call, and allows ProGuard to inline the methods since they are only called from a single location.
Combined, this improves the performance of CoreMark to only 59% overhead over native C, right in the middle of the spectrum of the other benchmarks.
Similar to MoteTrack in the previous section, these results point out some weaknesses of Java when used as an embedded VM. The lack of cheap function pointers, or a way of allocating small temporary objects or arrays in a method’s stack frame means there will be a significant overhead in situations where the optimisations used here cannot be applied.
We discuss a way to reduce the cost of temporary objects in future VMs in Section 8.8.
In the rest of the evaluation, the manually optimised code is used for all benchmarks.
For CoreMark this includes the two non-automatic optimisations. The optimisation to avoid repeatedly creating temporary objects was also applied to the LEC and MoteTrack benchmarks.
Table 7.4: Performance data per benchmark
B.sort H.sort Bin.Search XXTEA MD5 RC5 FFT Outlier LEC CoreMark MoteTrack HeatCalib HeatDetect average PERFORMANCE OVERHEAD USING ORIGINAL SOURCE (% of native C)
Total 1277.1 1927.2 1319.4 714.5 470.6 409.9 437.8 549.0 885.3 809.7 1018.7 210.2 203.9 787.2
push/pop 640.1 356.7 233.7 197.2 115.7 70.1 66.6 207.2 106.6 220.4 166.5 80.9 78.8 195.4
load/store 360.1 197.4 175.3 67.0 46.7 33.2 29.3 190.3 110.7 136.8 218.2 67.6 43.8 129.0
mov(w) 10.0 41.1 8.4 6.6 3.6 0.1 5.2 21.5 5.1 5.5 38.6 -3.0 9.5 11.7
other 266.9 331.4 902.1 82.8 104.0 67.8 76.8 130.1 370.6 234.2 220.0 37.4 65.6 222.3
vm 0.0 1000.6 0.0 361.1 200.4 238.7 260.0 -0.1 292.2 212.9 375.4 27.3 6.2 228.8
OVERHEAD REDUCTION FROM SOURCE CODE OPTIMISATION (% of native C)
Source optimisation -613.2 -1234.0 -843.6 -464.1 -244.2 -285.6 -315.0 -56.5 -612.7 -433.7 -227.9 0.0 1.7 -409.9 PERFORMANCE OVERHEAD BEFORE COMPILER OPTIMISATIONS (% of native C)
Total 663.9 693.2 475.8 250.4 226.4 124.3 122.8 492.5 272.6 376.0 790.8 210.2 205.6 377.3
push/pop 266.9 200.8 202.2 166.4 105.3 61.9 57.2 205.5 105.6 123.8 137.7 80.9 77.5 137.8
load/store 240.3 177.5 191.0 42.5 43.9 28.5 25.2 190.4 111.7 89.2 165.3 67.6 47.6 109.3
mov(w) 23.3 14.8 4.5 3.9 2.6 -1.2 4.2 8.0 5.1 5.3 17.6 -3.0 10.9 7.4
other 133.5 118.4 78.1 37.7 74.6 35.1 36.2 88.8 49.0 97.7 94.8 37.4 63.4 72.7
vm 0.0 181.7 0.0 0.0 0.0 0.0 0.0 -0.1 1.1 60.0 375.4 27.3 6.2 50.1
OVERHEAD REDUCTION PER COMPILER OPTIMISATION (% of native C)
Impr. peephole -233.5 -157.7 -149.4 -60.3 -48.2 -23.1 -36.5 -186.9 -54.2 -58.8 -60.2 -35.2 -54.5 -89.1
Stack caching -40.0 -56.0 -57.3 -98.4 -58.0 -39.8 -16.2 -27.8 -67.7 -40.7 -63.1 -41.4 -24.2 -48.6
Pop. val. caching -133.1 -84.9 -67.4 -6.8 -12.9 -8.8 -10.7 -51.0 -28.8 -24.5 -41.5 -15.4 -15.5 -38.5
Mark loops -102.9 -46.8 -85.4 +5.0 -10.9 -8.0 -7.9 -114.9 -18.0 -40.0 -54.3 -38.2 -28.6 -42.4
Const shift 0.0 -17.1 -35.4 -18.4 -45.2 -20.9 -3.8 0.0 -9.6 -10.1 0.0 -17.2 -3.3 -13.9
16-bit array index -53.2 -34.9 -15.7 -13.9 -5.5 -4.2 -2.8 -36.2 -9.7 -38.9 -19.7 -1.7 -9.0 -18.9
SIMUL 0.0 0.0 0.0 0.0 0.0 0.0 -27.2 0.0 0.0 -36.6 0.0 0.0 0.0 -4.9
Lightw. methods 0.0 -207.3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 -67.5 -395.7 -30.6 -0.3 -54.0
PERFORMANCE OVERHEAD AFTER COMPILER OPTIMISATIONS (% of native C)
Total 101.2 88.5 65.2 57.6 45.7 19.5 17.7 75.7 84.6 58.9 156.3 30.5 70.2 67.0
push/pop 0.0 -2.8 0.0 37.4 0.1 2.9 2.0 -0.2 -13.7 2.5 20.4 5.6 1.7 4.3
load/store 1.0 29.3 27.0 -2.3 20.3 4.3 2.4 4.5 54.3 17.1 72.0 2.7 13.5 18.9
mov(w) 10.0 9.4 11.8 5.6 1.5 0.1 2.9 6.8 7.4 9.6 14.9 5.1 4.4 6.9
other 90.2 52.5 26.4 16.9 23.8 12.2 10.4 64.7 35.5 28.8 35.7 17.0 46.1 35.4
vm 0.0 0.0 0.0 0.0 0.0 0.0 0.0 -0.1 1.1 0.8 13.2 0.0 4.4 1.5