Chapter 3 Designs
3.2 Array Prefetching
3.2.6 Circular Prefetching
Consider a for loop in a loop as following:
while ( k > 0 ) { ……
for (int i=0; i<a.length; i++) { Read a[i]
……
} ……
}
When the index i approaches the array tail, the RPT will try to prefetch the data over the array (Figure 3.16 (a)). However, since the hardware can know the array length, we can avoid this situation by a simple comparison. Further, we may prefetch the head of the array for the next entry of for as Figure 3.16 (b). This may gain some benefit for case that the for loop is entered repeatedly.
Figure 3.16 Circular prefetching
Chapter 4
Experiments and Results
This chapter presents the experiments and results on 2 benchmark suites: Sun’s CLDC HotSpot Implementation Evaluation Kit 1.0.1 and EEMBC’s GrinderBench 1.0. Section 4.1 introduces our environment setting for evaluations. Section 4.2 gives introductions of the benchmarks. Section 4.3 presents some analysis on these benchmarks. Section 4.4 shows the results of applying our prefetch mechanisms and compares them to the previous studies.
Section 4.5 analyzes memory traffics resulted by the related works and our designs.
4.1 Evaluation Environment
We use the cycle parameters of Java Optimized Processor (JOP) [29] for the simulation of hardware accelerator. JOP is an embedded Java processor implemented on FPGA. It has 4-stage pipeline and handles stack in the internal memory (Figure 4.1). Most bytecode are translated to microcodes. A simple bytecode instruction can be mapped to single microcode;
however, a complex instruction must be synthesized by several microcodes. For bytecodes not implemented by JOP, we trap to Sun’s KVM 1.1 on Intel x86 processor core for software emulation. We use a 4k bytes data cache with 16 bytes per line; the prefetch buffer is configured to be an 8-line fully associative cache. Average memory latency is set to 50 cycles.
Figure 4.1 Datapath of JOP [24]
4.2 Benchmarks
We use 2 CLDC benchmark suites for our evaluation. One is Sun’s CLDC HotSpot Implementation Evaluation Kit (CLDC HI) version 1.0.1, the other is EEMBC’s GrinderBench (GB) version 1.0 [9].
Sun’s CLDC HI Evaluation Kit 1.0.1 includes 4 benchmarks, following is their brief descriptions:
Richard
Simulating the task dispatcher in the kernel of an operating system.
Delta Blue
Solving one-way constraint systems.
Image Manipulation (Processing)
Reading an image file (Sun raster image format) and performs various transformations on it, such as Sobel, threshold, 3x3 convolver, and so forth. After each transformation, it compares the result with an expected result to confirm that the transformation was done properly.
Queen
A solver of the n-queens problem, where the objective is to place n queens in a chess board so that no queen can attack another. It is a classical problem used to illustrate several techniques such as general search and backtracking.
EEMBC’s GrinderBench 1.0 [9] contains 5 benchmarks:
Chess
It only performs the logical parts of a chess program, as no graphical output is available.
It plays a preset number of games with itself.
Crypto
It contains multiple encrypt/decrypt engines. The following encryption engines are exercised: DES, DESede, IDEA, Blowfish and Twofish.
kXML
It processes a command script which specifies XML documents to parse and DOM tree manipulations to do.
Parallel
This benchmark is used to test the performance of KVM threading capabilities. It accomplishes this by dividing computational tasks among several threads that must then cooperate with each other to complete those tasks. Two parallel algorithms are used: a merge-sort algorithm and a parallel matrix multiplication algorithm.
PNG
PNG is the standard format for image representation in J2ME implementations. This benchmark does the decoding of a PNG image, including decompression, and stores the result internally as header info, color palette(s), and image data.
4.3 Analysis on the Benchmarks
In order to understand the properties of Java programs, we analyzed the benchmarks.
This section presents the experimental results: stall analysis, array stride analysis to each benchmark.
4.3.1 Memory Stalls
Figure 4.2 shows the stall time over the total execution time of each benchmark. In the average of Sun’s CLDC HI benchmarks, it takes 15.9% execution time on stalls; In EEMBC’s GrinderBench, average 25.7% execution time are spent on stalls. So it is worth reducing memory stall time in order to speedup Java execution.
22.0%
Chess Crypto kXML Parallel PNG Ave. GB
Stall time / execution time
Figure 4.2 Memory stall time over total execution time
Now we consider the composition of stall time. Stall distribution depends on program type. A computation-intensive program will often spend more time on bytecode stalls, such
as Richard, Chess and kXML. An array-based program will usually have a larger proportion of time spent on array stalls; Image Manipulation, Crypto and Parallel belong to this type.
On an average, stall time caused by bytecode misses and array misses take more than 50%
of total stall time.
Chess Crypto kXML Parallel PNG Ave. GB
Stall time eliminate all stalls, the number of stay cycles per bytecode block distributes from 30 to 40 cycles. If plus stalls, the average numbers of stay cycles of each benchmark are between 40 to 60 cycles.
38.6
Chess Crypto kXML Parallel PNG Ave. GB
# of cycles
Eli. stalls Plus stalls
Figure 4.4 Average stay time per bytecode block
If a consecutively-fetched block pair is sequential, then we call it is a sequential cross;
otherwise, it is a non-sequential cross. Figure 4.5 shows the proportions of sequential crosses and non-sequential crosses of each benchmark. On an average, sequential crosses occupy around half of all crosses. Especially in Image Manipulation, the proportion of sequential-crosses is up to 77.3%. Thus, we can apply sequential prefetching for most blocks.
27.2% 23.9%
Chess Crypto kXML Parallel PNG Ave.
GB
Non-sequential%
Sequential%
Figure 4.5 Sequential strength of bytecode
4.3.3 Stride Distributions of Arrays
We may concern with what size the stride field of a stride table is needed, or concentrating on what magnitudes of stride would be effective sufficiently if we want to simplify our design. The stride distributions of each benchmark are shown in Figure 4.6.
The x axis is the absolute values of strides in bytes; the y axis represents the accumulating proportion of strides. Most magnitudes of strides of the benchmarks are less than or equal to 4 bytes. That is, if our prefetching works for strides less than or equal to 4 bytes, it works for more than 90% of strides.
0%
Figure 4.6 Stride distributions of arrays
4.4 Results of Prefetching
In order to evaluate effects of prefetching, we define the remaining stall ratio (RSR).
Remaining stall ratio(certain data) =
For example:
Remaining stall ratio(bytecode) =
4.4.1 Prefetching for Bytecode
Firstly, we discuss the size of NLPT and its effects. See Figure 4.7, the x axis represents the number of entries of NLPT and the y axis represents the remaining stall ratio of bytecode. The left-most points of each benchmark are RSR(bytecode)s of sequential
prefetching only. As the table size grows, the RSR(bytecode)s degrade but slightly in most benchmarks. It even has no effect for Queen and gets worse than sequential prefetching for Parallel.
Figure 4.7 RSR(bytecode)s to the sizes of NLPT
Now we see how NBPT performs. See Figure 4.8, we can see that the RSR(bytecode)s start to degrade slowly when the NBPT is larger than 8 to 16 entries. For Queen and PNG, NBPT introduces good stall reductions.
0%
Figure 4.8 RSR(bytecode)s to the sizes of NBPT
In Figure 4.9, we pick the 16-entry NBPT, compare to a 16-entry NLPT and sequential prefetching. When we add a table to record non-sequential crosses rather than sequential prefetching, we can improve the prefetching further. If we adopt NBPT for Java bytecode prefetching, we can obtain better performance than NLPT, especially for Queen and PNG.
0%
Chess Crypto kXML Parallel PNG Ave.
GB
Figure 4.9 A comparison of RSR(bytecode)
4.4.2 Reference Prediction Table for Data Prefetching
Following discusses the effects of reference prediction table(RPT) which for data prefetching. RPT records all load/store instructions, including instance-field accesses, static-field accesses and array accesses. However, our simulations show RPT is only effective for instance-field accesses in some special programs and doesn’t have any improvement for static-field accesses. Figure 4.10 shows a 128-entry RPT and its effects for instance fields. Sometimes there are strides between instance-field accesses as indicated in [32]. This property is obvious in Delta Blue, as a result, RPT also introduces a good stall reduction for it. But strides between instances are uncommon in most programs, so RPT usually cannot effectively eliminate the stalls of instance-field accesses.
RSR(InstanceField)s of 128-entry RPT
Chess Crypto kXML Parallel PNG Ave. GB
RSR(InstanceField)
Figure 4.10 RSR(InstanceField)s of 128-entry RPT
Figure 4.11 shows the RSR(array)s to the sizes of RPT. We can see RPT performs well for array prefetching, especially for Delta Blue, kXML and PNG.
0%
Figure 4.11 RSR(array)s to the sizes of RPT
A possible variation of RPT is, letting RPT only record array instructions since it is usually not effective for other data types. Figure 4.12 shows the effects for array data of the array-only RPT design; Figure 4.13 depicts the average RSR(array)s of original RPTs and array-only RPTs together. Because instructions of other data types occupy spaces in the RPT, unsurprisedly, a small-size array-only RPT performs better than an original RPT which has the same number of entries. However, if we are able to provide a larger size for RPT, their effects will be very close.
0%
Figure 4.12 RSR(array)s to the sizes of RPT
0%
Figure 4.13 Average RSR(arrays)s of original RPTs and array-only RPTs
4.4.3 Stride Table for Array Prefetching
Firstly, we may care about what values of the predefined H and the prefetch depth of ST should be. These 2 variables very highly depend on individual program. We may profile
a program offline and embed the appropriate settings into its classfiles. However, we could try to find the appropriate values for most programs by experiments. Figure 4.14 shows the average RSR(array)s of all benchmarks by using 8-entry STs. We can see when the H and the prefetch depth both equal to 2, the average RSR(array) would be the minimum. Now we apply H=2 and prefetch depth=2 to each benchmarks and compare the result to their optimal configurations. See Table 4.1, the differences of RSR(array)s between the recommended configurations and their optimal configurations are less than 1.1%. So we usually can already get good effects when using H=2 and prefetch depth=2 in comparison to using their individual optimal configurations. Nevertheless, note the appropriate values of these 2 variables may highly depend on the platform.
1 2
Figure 4.14 Results of configurations of H and prefetch depth in ST
Benchmark Richard Delta Blue Image
Manipulation Queen Optimal (H, prefetch depth) (8, 1) (1, 1) (16, 1) (2, 8)
Optimal RSR(array) 97.544% 49.493% 51.574% 78.921%
RSR(array)
when (H=2, prefetch depth=2) 97.545% 49.667% 51.898% 79.985%
Difference 0.001% 0.174% 0.324% 1.064%
Benchmark Chess Crypto kXML Parallel PNG
Optimal (H, prefetch depth) (1, 8) (2, 2) (2, 2) (2, 1) (1, 1) Optimal RSR(array) 78.002% 77.257% 43.528% 83.171% 90.905%
RSR(array)
when (H=2, prefetch depth=2) 78.687% 77.257% 43.528% 83.260% 91.838%
Difference 0.685% 0% 0% 0.089% 0.933%
Table 4.1 RSR(array) differences between
using the recommended H and prefetch depth, and their optimal configurations
Now we may want to know what size a stride table should be. Figure 4.15 shows that the average RSR(array) almost doesn’t degrade if the stride table is larger than 8 or 16 entries. So a stride table has 8 to 16 entries is usually sufficient for most embedded Java programs.
0%
Figure 4.15 RSR(array)s to the sizes of ST
If we use program counter for tagging, we can see the performance of prefetching almost doesn’t promote after a 32-entry stride table (Figure 4.16).
0%
Figure 4.16 RSR(array)s to the sizes of PC-tagged ST
Then we compare the array-base-tagged ST and the PC-tagged ST, the simulation
result is shown in Figure 4.17. In the condition of the same number of entries, an array-base-tagged ST usually has a lower RSR(array) than a PC-tagged ST. The possible reason we have discussed in Subsection 3.2.4. However, if most strides appear only in individual instruction but not arrays, a PC-tagged ST will be better than an array-base-tagged ST, such as Richard, Parallel and PNG.
ST(both)-256 is a 256-entry ST but tagged by both PC and array base. Both-tagging can eliminate the interferences of several instructions or several arrays to an entry. However, it can only perform better than PC-tagging and array-base-tagging slightly for Richard and Delta Blue. This is maybe because an entry of both-tagged ST needs longer time to get a
Chess Crypto kXML Parallel PNG Ave.
GB
RSR(array)
ST(PC)-16 ST(base)-16 ST(both)-256
Figure 4.17 Comparison of tagging approaches for ST
.
In following we compare the efficiencies of ST to array-only RPT. As Figure 4.18, ST usually obtains better array stall reductions than RPT for Java programs, especially for Queen, Crypto and kXML.
0%
Chess Crypto kXML Parallel PNG Ave.
GB
Figure 4.18 Comparisons of ST and array-only RPT
Figure 4.19 presents the effects of each design idea of ST. The bars of “Array-base” are RSR(array)s of RPT but adopting array-base-tagging and our 2-state design. We can see array-base-tagging is effective for Queen and Crypto, although it is a little bad for Delta Blue, Chess and PNG. The bars of “Base+S” are RSR(array)s when using array-base-tagging plus stride-adaptive prefetching. The stride-adaptive approach is much effective for Chess and kXML, but gets worse for Richard, Crypto and PNG. “Base+S+C”
is base-tagging plus stride-adaptive prefetching and circular prefetching; that is, our final ST design. The results show circular prefetching slightly improves the performance of prefetching. But note that no design absolutely suits every case and sometime may result in negative effects.
0%
Chess Crypto kXML Parallel PNG Ave.
GB
Figure 4.19 Effects of each design idea of ST
In conclusion, our design can achieve better performance for array prefetching than RPT. On an average, ST is 6% better for Sun’s CLDC HI and 8% better for EEMBC’s GrinderBench than RPT in RSR(array).
Finally, Figure 4.20 shows the fractions of unnecessary prefetch signals that can be eliminated by trigger-block. The trigger-block design eliminates more than 50% of unnecessary signals for Image Manipulation, kXML and PNG; 17.2% for Sun’s CLDC HI and 33.9% for EEMBC’s GrinderBench in average.
0.4%
Chess Crypto kXML Parallel PNG Ave. GB
Eliminated by trigger-block %
Figure 4.20 Effects of trigger-block
4.5 Analysis of Memory Traffic
Besides stall reduction, another issue we may concern with is memory traffic. Useless prefetches will produce additional traffics.
Firstly, we define some terms. A true miss is a memory request where the data accessed is not found either in the cache or in the prefetch buffer. If a prefetched block is really required and submitted into the cache, we call the prefetch a useful prefetch; otherwise, it is called an unused prefetch. An unused prefetch is never needed before being replaced out from the prefetch buffer. The memory traffic caused by bytecode fetches or prefetches is called bytecode traffic; the array traffic is similar.
4.5.1 Bytecode Traffic
We compare the bytecode traffics of the sequential prefetching, a 16-entry NLPT and a 16-entry NBPT. The bytecode traffic without prefetching is normalized to 100%. Their
traffics are shown in Figure 4.21, as well as the traffics resulted by useful prefetches (U.P) only. The fractions larger than 100% are caused by unused prefetches and the rests are true misses. As we see, the traffic resulted by the 16-entry NBPT is a little larger than the 16-entry NLPT in some benchmarks, but no more than the sequential prefetching. However, the useful prefetches produced by the NBPT-16 are more than both the NLPT-16 and the sequential prefetching in average, especially for Queen and PNG. So despite of including the effects of unused prefetches, the performance of NBPT is still better than NLPT.
0%
Chess Crypto kXML Parallel PNG Ave.
GB more tentative prefetches. However, ST can issue more useful prefetches, especially for
Queen, Chess, Crypto and KXML, to achieve more stall reductions. So if the prefetch buffer is absent or the contention is significant, the conservative policy of RPT is worth of being used or we can just disable the tentative prefetches of ST; otherwise, the aggressive policy of ST can provide a better performance. A possible variation will be described in Subsection 5.2.2.
Chess Crypto kXML Parallel PNG Ave.
GB
Chapter 5
Conclusion and Future Works
This chapter includes the conclusion and the discussions of some future works in Section 5.1 and 5.2, respectively.
5.1 Conclusion
As the continuously growing of multimedia applications, the requirement of memory is increasing because of their large amount of code and data. This problem also exists in embedded devices. In order to reduce memory stall time and speedup execution, prefetching is a feasible solution. We studied bytecode and array prefetching approaches for Java hardware accelerators. Because there are usually more small method invocations in a Java program, NBPT has some subtle designs to handle them. Strides exist between array accesses, we indicated using array-base to tag stride entries is an alternative approach to PC-tagging for Java. By cooperating to our 2-state design and stride-adaptive algorithm, it performs better than the PC-tagged RPT. We also had some analysis on Sun’s CLDC HI and EEMBC’s GrinderBench benchmarks. On an average, NBPT can reduce 40% of time spent on bytecode stalls. The ST design can reduce 25% of array stall time; for some array-based programs, around 50% of array stall time is eliminated.
We can try to apply our mechanisms to mixed-mode JVMs in advanced environments.
A mixed-mode JVM has a selective JIT compiler. It detects hotspots in running Java programs, and compiles them into machine code dynamically. For non-compiled code, it
still executes them by interpretation. A hardware accelerator can also be used to accelerate the interpretation.
5.2 Future Works
This section includes some possible variations and applications of our mechanisms, and future study directions.
5.2.1 Prefetching More Bytecode Blocks at a Time
Note that NBPT only makes a prediction for the next continuously-fetched block. In case of shorter memory latency, it’s adequate that we issue prefetch for the next block.
However, if the memory latency is longer or the accelerator is improved further, the arrival time of our prefetched block may be too late to hide the memory stall. Thus we may want to initiate the prefetch earlier.
For simply, suppose we have a block sequence … A, B, C…, where B is in the cache and we merely have to prefetch block C. If the memory latency is not too long, we can initiate the prefetch of C when entering B as Figure 5.1 (a). However, if the memory latency is longer, our prefetch for C would arrive too late, so that there is still a period of stall does not be hided (Figure 5.1 (b)). Thus we may want to issue the prefetch for C earlier, for example, at the entry of A (Figure 5.1 (c)). In this case, we may assume that block B is already in cache or has been prefetched, or also try to prefetch B and let C be a little postponed. Note that since all prefetches will check the cache and the prefetch buffer before being really issued out to the memory, the prefetches for blocks present in the cache or the buffer won’t degrade the performance of prefetching too much.
To predict C, we can have 2 different policies. One is looking up the NBPT only once
for B and assume (B, C) is sequential, in other words, our prediction is NBPT[A]+1. The other is looking up the NBPT twice continuously, make a prediction for B and then for C, i.e., predict NBPT[NBPT[A]] for C.
Figure 5.1 Timing issue of bytecode prefetching
(a) Short memory latency (b) Prefetch too late (c) Prefetch earlier
More generally, if the program counter is on block A currently and we want to prefetch certain block B which is n blocks later. Suppose the accuracy of NBPT prediction for one
More generally, if the program counter is on block A currently and we want to prefetch certain block B which is n blocks later. Suppose the accuracy of NBPT prediction for one