The Proposed Dynamic Code Optimization Scheme

3. Proposed Dynamic Code Optimization System

3.2 The Proposed Dynamic Code Optimization Scheme

Instance Variable n Object

Reference

Fig. 31. Object Format

The array layout in memory is just like an object. We showed the array format in Fig. 32 (see [14]). Also, if we want to access the array length, just take object reference subtracted one.

Fig. 32. Array Format

c. Runtime Class Structure The runtime class structure of JOP is shown in Fig. 33 which had discussed in i as allclasses’information.Thisclassstructure is stored in the external memory. For indicating the pointers in previous data structure, we drew this class structure again with pointers Class Reference, Class Method Pointer, MP, and CP.

Class Variable 1 Class Variable 2

…

Instance Size Interface Table Method Structure 0 Method Structure 1

…

Class Reference Constant Pool Length Constant 1

Constant 2

…

Interface Reference 0 Interface Reference 1

… Class Reference

Class Method Pointer

Current Method (MP)

Constant Pool (CP)

Fig. 33. Runtime Class Structure

3.2 The Proposed Dynamic Code Optimization Scheme

In this section, we propose our dynamic code optimization scheme. First we analyze the bytecode execution frequency, from which we get the new idea of improvement.

Then we compare the access time of the external memory and the internal memory.

By reducing the number of dynamic code modifications that do not improve the performance, we can make the system more efficient. The architecture overview is illustrated in the last subsection.

i. Analysis of Bytecode Execution Frequency

We have mentioned that Hotspot uses optimistic compilation which can dynamically choose which instructions needs to be compiled and the rest are executed by the interpreter. The decision is based on the execution frequency. This concept is used in many systems. For example, the famous code morphing processor from Transmeta also uses execution frequency to decide whether the code is to be interpreted or to translated. As Fig. 34 (see [22]), the translation threshold is decided by the code execution frequency.

When the number of executions of a section of x86 machine code reaches a certain threshold, its address is passed to the translator. Store in Tcache

Execute

Fig. 34. Transmeta Code Morphing Software Control Flow

We analyze the bytecode execution frequency using three common benchmark programs. The distribution of bytecode execution frequency is shown in Fig. 35.

The number of bytecodes is counted under the given execution frequency. Consider the following analysis data. When executing the

UDP/IP program, there are 385 bytecodes that are executed 9 times and 210 bytecodes are executed 13 times. For the same bytecodes (e.g. aload_0), they are different in different methods or sequences.

Look at the curves in Fig. 35. We observe a very important rule. The bytecodes are almost executed exactly once or much more than twice. This observation is the critical point in our design.

Fig. 35.Distribution of Bytecode Execution Frequency

ii. Access Time of External Memory & Internal Memory As we mentioned before, typical dynamic code optimization can speed up the execution of embedded Java VM, but it suffers from the overhead of external memory accesses.

Consider the JOP system. The clock frequency of both FPGA and SRAM is 50MHz, so the clock time is calculated as following.

The internal memory access only needs 1 cycle. But if it is the external memory, it needs 5 cycles for memory read and 7 cycles for memory write on JOP because JOP is

designed for various developing boards. The microcode sequence of external memory read is shown in Fig. 36.

Fig. 36.Microcode Sequence of External Memory Read

Upon execution of a memory read, the address is stored and the processor waits for the value to arrive and then pushed the value to the top of the operand stack as in Fig. 30.

Each microcode executes in a single cycle, so the external memory read needs 5 cycles.

For the microcode sequence of external memory write shown in Fig. 37, it needs 7 cycles.

Fig. 37.Microcode Sequence of External Memory Write

As a result, if we can reduce the number of dynamic code modifications that do not give us any advantages, e.g. the codes that are exactly executed once, we can make a big improvement of execution time and cut down the power consumption. In next subsection we are going to introduce the design of our dynamic code optimization module.

iii. Architecture Overview

From previous experiment, we knew

that bytecodes are almost executed exactly one time or much more than two times. Then in subsection ii, we analyzed the memory access time, and found that the access time of external memory is a big overhead of the traditional dynamic code optimization scheme. Based on these two observations, we designed the new dynamic code optimization architecture called JDCO.

To speed up the execution and cut down the power consumption, we only modify the codes when it is necessary. That is, if the code is executed exactly one time, we do not do the dynamic code optimization –constructing a new bytecode to replace the original bytecode and storing the field or method offset in the operand of new bytecode. Because the method bytecodes are stored in external memory in most embedded system and also our JOP system (described in subsection 3.1 i), this new module can execute the Java programs with dynamic code optimization in a more efficient way.

However, if the execution frequency can not be determined upon the first encounter of a bytecode (unless we do a

“fast-forward” to check whether the bytecode will be executed again, which has unacceptable overhead). Another possible way is to perform a pre-pass counting of the execution of the bytecodes, but this is also very expensive. We proposed a simple algorithm that reduces unnecessary modifications with very low overhead. The proposal is as follows. A small memory is synthesized in the FPGA to count the number of execution of each bytecode during execution. For the first execution, no dynamic code modification is performed.

stmra nop wait wait ldmrd

stmwa nop stmwd nop wait wait nop

The DCO is only done at the second time the code is executed, because we assume that it will be executed again and again base on the observation of subsection i. For third execution and above, we can directly use the operand of new bytecode to speed up the performance and cut down the power consumption.

microcode branch condition next bytecode

bytecode branch condition

■ Hardware ■ Software

… iadd: add nxt isub: sub nxt idiv: stm b

Fig. 38.Proposed JDCO Architecture

The flowchart of our JDCO architecture is shown in Fig. 38. We mark our new modules in colored background with distinguishing hardware and software implementation modules. In the beginning of the first stage, bytecode fetch, a bytecode is pointed by Java pc to be executed. The bytecode will pass to a JDCO check module, which will check if this bytecode can be optimized or not. For example, if the bytecode has the information that can be recorded for speeding up the next execution (e.g. getfield. putfield. etc.), we say that it can be optimized. If the answer of JDCO check is yes, our system will further check if it is the first time to execute this bytecode to

decide whether we should perform DCO or not. If it is not the first time of execution, the new JDCO optimization will look up the bytecode in our new jump table to get the JOP pc, which points to our new JDCO module in the second stage, microcode fetch.

JDCO will execute this bytecode and get the runtime information depending on the specific bytecode. It may be the offset of an object field, or of the class method that will not change when next time we execute the same bytecode. The runtime information will be passed to a JDCO Java program which will construct a new bytecode to replace the original bytecode in external memory, and store the runtime information in the operand of this new bytecode.

If the answer of the JDCO check is no, or it is yes but this is only the first time of execution of the byte code, our architecture will follow the original procedure. Looking up in the jump table, the JOP pc is retrieved for execution. The corresponding bytecode implementation is executed whether it is a newly implemented bytecode that we constructed or not. The implementation of the bytecode may be the VHDL implementation, microcode implementation, or Java Code implementation.

在文檔中 MPEG-4/21 SoC 設計及新世代行動訊之研究-子計畫二：多媒體通訊數位基頻SoC加速架構及嵌入式作業系統界面的研究(III) (頁 57-60)