Chapter 1 Introduction
1.5 Synopsis
The remainder of this thesis is organized as follows. Chapter 2 discusses Java Just-In-Time compiler and Andes 32bit-16bit Instruction Set Architectures. In Chapter 3, we introduce the Multiple Fixed-width ISA Emitter. In Chapter 4, experiments and the results are presented and be analyzed. In Chapter 5, the conclusion and future work are given.
Chapter 2
Java Just-In-Time compiler and Andes 32bit-16bit Instruction Set Architectures
As the Java language becomes more and more important for programming embedded systems, translation at the byte-code level has been proposed to increase program performance. Java VM proposes a just-in-time compiler architecture, which executes target-machine code for improve performance.
ANDES ISA proposes a special architecture which uses mixed-mode instructions without the need of mode-switching instructions. Its 16-bit ISA almost reflects to 32-bit ISA, but it has an alignment restriction: 32-bit memory instruction reference object that must be word-alignment.
In our research, our target is the ANDES processor. We port an existing JVM JIT to the ANDES platform and then modify the code emitter so that it can generate16-bit as well as 32-bit instructions. We use several benchmark tests to measure the performance of the code emitter.
2.1 CVM Internals
The virtual machine we use the Connected Device Configuration Hotspot Implementation (CVM) version of JAVA VM, which is highly optimized for resource-constrained devices, such as consumer electronic products and embedded
devices. Portability is the most important benefits of the Java system design. It includes a dynamic compiler, which is also called a just-in-time compiler (JIT). While a method in the Java program has been used frequently enough, JIT converts the method’s bytecodes to native code during execution time to improve future performance. This operation has two passes: First, the front end converts Java bytecode to an intermediate representation (IR); Second, the back end converts the IR to native code. The architecture is shown in Figure 2.1.
Figure 2.1 Java program execute
2.1.1 JIT Front End
The front end is portable for different execution environments. It converts the bytecode to an intermediate representation (IR). Figure 2.3 is an example of IR.
Figure 2.2 Frontend
2.1.2 JIT Back End
The back end converts IR to native instructions. An IR tree is parsed by a parser.
The parser
Figure 2.3 An example of IR.
, which is produced by the Java Code Select (JCS) tool at build time, performs pattern matching for tree-based data structures in which the patterns are specified as a set of JCS rules. These rules are translated into C source code and initialized data structures. Code generation is done with rule-based pattern matching on trees. When there are multiple possibilities, JCS choose the rules with the least
x = y + 1000;
Translate to IR Tree Compile to bytecodes iload y
sipush 1000 iadd
istore x
ASSIGN
LOCAL(X) ADD
LOCAL(Y) CONSTANT (1000)
Byte
Code IR Generator
Frontend
Method
IR
static costs. Figure 2.5(A) is an example of JCS rules.
Method IR
Register
Manager Emitter
Backend
Method Machine
Code JCS
Parser
Figure 2.4 Backend
Figure 2.5 An example of a JCS rule
The first part of this rule is the result and the second part is a pattern. They are used for pattern matching. For instance, the subtree in Figure 2.5 (B) will be matched by the JCS rule in Figure 2.5 (C). If a subtree can be matched in multiple ways, the rule with the lowest static cost will be selected. The static cost is specified as the third part of a rule. After a match is found, the fourth and the fifth parts of the rule will be used for setting up a register set. This is shown in Figure 2.6(A). First, a bottom-up traversal of the matched tree passes the use register set, shown in Figure 2.6(B). Second, a top-down traversal passes the accept register set, shown in Figure 2.6(C). After these two passes, the register manager knows which registers are provided. Finally, the last part is the semantic actions which will call the code emitter to emit native instructions. It is shown in Figure 2.7(A) and Figure 2.7(B).
Figure 2.6 Set Register Set
2.2 ANDES Instruction Set Architectures
In our system, we use the ANDES instruction set, which is a RISC-style register-based instruction set. In Andes ISA, we may freely mix 16-bit and 32-bit instructions without the need of mode-switching instructions. The 16-bit ISA almost reflects the 32-bit ISA, but there is an alignment restriction: When a 32-bit
instruction is written to the code buffer, the address of the memory cell in the code buffer that will hold the instruction must be word-aligned. Otherwise, the 32-bit instruction must be broken into two 16-bit halves. Each half is written to the code buffer separately. A Word-Alignment exception will be thrown when we attempt to write a 32-bit instruction at half-word alignment.
Figure 2.7 JCS rule calls emitter to emit native code
2.2.1 General Purpose Register
Andes 32-bit instructions can access thirty-two 32-bit general-purpose registers (GPR). A 16-bit instruction’s register index can be 5 bits, 4 bits, or 3 bits in different instruction formats. A 3-bit and 4-bit index can only access a part of the GPRs. The 3-bit and 4-bit register indices are mapped to real registers according to Table 2.1.
2.2.2 The Andes Instruction Set
In this section, we introduce the part of the Andes instruction set that is related to our research. In Andes, the memory address accessed by a 32-bit memory instruction has to be word-aligned. Otherwise, a Data Alignment Check exception will be
generated. Table 2.2 - 2.8 are examples of which the maps of 32-bit instruction translate to 16 bit instruction.
Table 2.1 Andes General Purpose Registers
Register 32/16-bit (5) 16-bit (4) 16-bit (3) Comments R0 A0 H0 O0
R15 Ta Temporary register for assembler
Implied register for slt(s|i)45, b[eq|ne]zs8
R26 P0 Reserved for Privileged-mode use.
R27 P1 Reserved for Privileged-mode use.
R28 S9/Fp Frame pointer / Saved by callee
R29 Gp Global pointer
R30 Lp Link pointer
R31 Sp Stack pointer
Table 2.2 Add/Sub Instruction
32-bit instruction 16-bit instruction Special case ADD ADD333
Table 2.3 Move instruction
32-bit instruction 16-bit instruction Special case MOVI MOVI55
ADDI/ORI MOV55 ADDI R# R# 0
Table 2.4 Shift Instruction
32-bit instruction 16-bit instruction Special case SRAI SRAI45
SRLI SRLI45 SLLI SLLI333
Table 2.5 Bit Filed Mask Instruction 32-bit instruction 16-bit instruction Special case
ZEB ZEB333
Table 2.6 Branch and Jump Instruction 32-bit instruction 16-bit instruction Special case
BEQ BEQS38 Branch on Equal Implied R5
BNE BNES38 Branch on Not Equal Implied R5 BEQZ BEQZ38
BNEZ BNEZ38 J J8 JR JR5 JRAL JRAL5
Table 2.7 Load/Store Instruction
32-bit instruction 16-bit instruction Special case LWI LWI450
LWI333
LWI37 Load Word with Implied FP LWI.bi LWI333.bi
LHI LHI333 LBI LBI333 SWI SWI450
SWI333
SWI37 Store Word with Implied FP SWI.bi SWI333.bi
SHI SHI333 SBI SBI333
Table 2.8 Compare and Branch Instruction 32-bit instruction 16-bit instruction Special case
SLTI SLTI45 SLTSI SLTSI45
SLT SLT45 SLTS SLTS45
BEQZ BEQZS8 Branch on Equal Zero Implied R15 BNEZ BNEZS8 Branch on Not Equal Zero Implied
R15
Chapter 3
The Multiple Fixed-width ISA Emitter
The Multiple Fixed-width ISA Emitter can emit 32-bit and 16-bit instructions in any desired mixture. The register manager will assign a register to a particular instruction and then the emitter will determine if a 16-bit instruction can be use. If not, a 32-bit instruction will generated instead. When a 16-bit instruction is to be generated, the register number must be converted according Table 2.1. Because, in Andes, the memory address accessed by a 32-bit memory instruction must be word-aligned, when the emitter wishes to write a 32-bit instruction to the code, it has to break that instruction into two 16-bit half-words and write the two half-words separately in order to avoid a Data-Alignment exception. It is essential for the register manager to choose an appropriate register if the emitter attempts to generate 16-bit instructions. The JIT writer can set up four register sets (CVMCPU_PHI_REG_SET,
CVMCPU_BUSY_SET, CVMCPU_NON_VOLATILE_SET, and CVMCPU_VOLATILE_SET) for the register manager to choose appropriate registers.
We may tune the four register sets to emit as many 16-bit instructions as possible. For certain patch points, we must be sure that patch instruction has the same size with the original instruction.
3.1 Multiple Fixed-width ISA Emitter Introduction
While JCS rules select one instruction, the emitter will be called to emit the instruction to code buffer. The Multiple Fixed-width ISA Emitter adds a test (“16-bitable” in Figure 3.1(b)) to determine if the emitter can emit 16-bit instruction.
If so, it will translate the 32-bit instruction to the corresponding 16-bit instruction.
(a) (b) Figure 3.1 (a) Original emitter. (b) Adding the “16-bitable” test.
3.1.1 Determine Instruction
In Andes ISA, there are six formats for 16-bit instructions---333–form, 45-form, 37-form, 38-form, 8-form, and 55-form. (333-form and 45-form are the two most popular formats for 16-bit instructions.) Some 32-bit instructions even do not have the
16-bit counterparts. The emitter first needs to determine if a 16-bit instruction can be issued. Figures 3.2 (a) is the flow chart for testing the 333-form and Figure 3.3 (a) is the flow chart for testing the 45-form. For example, in Figure 3.2 (b), an add instruction has registers R0 and R1 and the immediate value imm. R0 and R1 fall in the ranger for registers in an addi333 instruction. Furthermore, if the immediate value is no more than 7 (0x111), this instruction will be translated into a 16-bit instruction in the 333-form.
(a) (b) Ex:
((R0 |R1| imm)>>3) Addi R0 R1 imm
Addi333 R0 R1 imm
Figure 3.2 (a) Flow chart of testing the 333-form. (b) An example of Addi333.
When the immediate value is larger than 7, the emitter will try other forms, say the 45-form (4 bits for specifying a register and 5 bits for specifying the immediate value.) Figure 3.3 (a) shows the flow chart for testing if the 45-form can be used.
There are other forms for 16-bit instructions. The emitter will try each form in turn.
When no 16-bit form is applicable, a 32-bit instruction will be issued instead.
3.1.2 Translating Registers
A register may be encoded in 3, 4, or 5 bits according to the selected instruction formats. The encoding is shown in Table 3.1. For example, R17 is encoded as 10001 (T1) in 5 bits and as 1101 (H13) in 4 bits.
(a) (b) Ex:
R16 == R16 (31 >>5) ==0 15<R16 <20
Addi R16 R16 31
Addi45 H12 H12 31
Figure 3.3 (a) Flow chart for testing the 45-form. (b) An example of ADDI45.
When the emitter wants to emit a 16-bit instruction, the emitter will test if the register assigned by the register manager could be used in a 16-bit instruction. For example, the 333-form is restricted to use registers R0 through R7 while the 45-form can use only registers R0-R11 and R16-R19 in the 4-bit field. (There is no restriction for the 5-bit field since 5 bits are enough to address any of the 32 general-purpose registers.) If the assigned register can fit in a 16-bit instruction form, then the emitter will translate the encoding of the register according to Table 3.1. This means that R16-R19 will be translated into H11-H15. The flowchart for the translation is shown in Figure 3.4 (a). The used registers of different mode are shown in Figure 3.5.
Table 3.1. The difference of two kinds of register set.
Register 32/16 32/16-bit (5 bits) 16-bit (4 bits)
R0 A0 H0
R1 A1 H1
R2 A2 H2
R3 A3 H3
R4 A4 H4
R5 A5 H5
R6 S0 H6
R7 S1 H7
R8 S2 H8
R9 S3 H9
R10 S4 H10
R11 S5 H11
R16 T0 H12
R17 T1 H13
R18 T2 H14
R19 T3 H15
(a) (b)
R16 -4 = H12 Ex:
Addi R16 R16 31
Addi45 H12 H12 31
Figure 3.4 (a) Flow chart for translating register encoding. (b) An example of register translation.
r0
Figure 3.5 Register range of 333 mode and 45 mode Figure 3.5 Register range of 333 mode and 45 mode
3.1.3 Instruction Alignment 3.1.3 Instruction Alignment
In Andes, there is a restriction that the memory address accessed by a 32-bit memory instruction (Load/Store) must be word-aligned, that the least significant two bits of the address must be 0. When the emitter wants to place a 32-bit instruction into the code buffer, it will break the instruction into two half-words. Each half-word is written into the code buffer separately. This is explained in Figure 3.6.
In Andes, there is a restriction that the memory address accessed by a 32-bit memory instruction (Load/Store) must be word-aligned, that the least significant two bits of the address must be 0. When the emitter wants to place a 32-bit instruction into the code buffer, it will break the instruction into two half-words. Each half-word is written into the code buffer separately. This is explained in Figure 3.6.
3.2 Register Setting 3.2 Register Setting
A JIT writer may adjust the register setting to emit more 16-bit instructions.
There are two places in the JIT that can be adjusted: the VM register set and the four code generator register sets.
A JIT writer may adjust the register setting to emit more 16-bit instructions.
There are two places in the JIT that can be adjusted: the VM register set and the four code generator register sets.
Ins32-1 Ins32-2
Figure 3.6 Avoid the Data Alignment Check exceptions when writing a 32-bit instruction into code buffer.
3.2.1 The VM Register Set
The VM register set contains four special registers: JSP_REG, JFP_REG, CHUNKEND_REG, and CVMCPU_EE_REG. They must be mapped to Andes registers properly. In our emitter, we use register FP for JFP_REG because it can use the special 37-form instructions.
Table 3.2. VM Register Setting
VM Register Register
JSP_REG R11
3.2.2 Code Generator Register Set
There are four code generator register sets: CVMCPU_PHI_REG_SET, CVMCPU_BUSY_SET, CVMCPU_NON_VOLATILE_SET, and
CVMCPU_VOLATILE_SET in the header file jitrisc_cpu.h. The four register sets are used by the register manager to set up CVMRM_ANY_REG_SET,
CVMRM_SAFE_SET, and CVMRM_UNSAFE_SET. (The CVMRM_EMPTY_SET is always an empty set.) When the JCS rules requests for a register, the register manager will select a register out of one of these four register sets. We wish to distribute the registers that can be used to generate 16-bit instructions into these four sets so that such a register is available when JCS rules requests for a register. The best distribution should be determined by extensive benchmarks. Currently, the
distribution is shown in Table 3.3.
The register manager sets up the four sets CVMRM_ANY_REG_SET,
CVMRM_SAFE_SET, CVMRM_UNSAFE_SET, and CVMRM_EMPTY_SET as follows. The CVMRM_EMPTY_SET is always an empty set. The
CVMRM_ANY_REG_SET includes all registers except those in the
CVMCPU_BUSY_SET. The CVMRM_SAFE_SET includes all the registers that are in both CVMCPU_NON_VOLATILE_SET and CVMRM_ANY_SET. Equivalently, the CVMRM_SAFE_SET includes all the registers that are in
CVMCPU_NON_VOLATILE_SET but not in CVMCPU_BUSY_SET. The CVMRM_UNSAFE_SET includes all the registers that are in both CVMCPU_
VOLATILE_SET and CVMRM_ANY_SET. Equivalently, the
CVMRM_UNSAFE_SET includes all the registers that are in CVMCPU_
VOLATILE_SET but not in CVMCPU_BUSY_SET. Table 3.4 summarizes the above specification in the register manager.
Table 3.3. RISC_CPU Register Setting
RISC_CPU Register Set Register
CVMCPU_PHI_REG_SET S1, S4, S5 ,S6 ,S7 ,S8 ,GP
CVMCPU_BUSY_SET TA, P0, P1, FP
CVMCPU_NON_VOLATILE_SET S0-S8, FP, GP
CVMCPU_VOLATILE_SET ALL &
~CVMCPU_NON_VOLATILE_SET Table 3.4. Register Manager register setting
JIT RegMan Register Set Register set
CVMRM_BUSY_SET CVMCPU_BUSY_SET | 1U<<CVMCPU_SP_REG |
1U<<CVMCPU_JSP_REG | 1U<<CVMCPU_JFP_REG | CVMRM_CHUNKEND_BUSY_BIT | CVMRM_CVMGLOBALS_BUSY_BIT
| CVMRM_EE_BUSY_BIT | CVMRM_CP_BUSY_BIT | CVMRM_GC_BUSY_BIT
CVMRM_ANY_REG_SET ALL &~(BUSY_SET)
CVMRM_SAFE_SET (CVMCPU_NON_VOLATILE_SET &
CVMRM_ANY_SET)
CVMRM_UNSAFE_SET (CVMCPU_VOLATILE_SET &
CVMRM_ANY_SET)
CVMRM_EMPTY_SET Always empty set
3.3 Instruction Patch and Adjust
While the emitter emits a forward branch or jump to glue code, the address field in this instruction will be patched later. Since we do not know the size of the actual offset in the instruction, to be on the safe side, we always use 32-bit instructions for forward branch or jump to glue code.
Furthermore, the instructions for null check may also need additional patches. It is discussed in Sections 3.3.3.
3.3.1 Forward Branch
When the emitter emits a branch instruction with unknown offset, it will always issue a 32-bit instruction. The address field in this instruction will be patched later when the address of the branch target is known. Figure 3.7 shows that patch a forward branch instruction.
3.3.2 Glue Code
Sometimes the program has to calculate certain special values when it reaches a particular instruction the first time. (Ex. ResolveMethodTableOffsetGlue) The emitter will issue a “Jarl .glue” instruction to force the program to jump to the glue code.
The special value is calculated in the glue code. At the end of the glue code, the calculated vale will be written to the word immediately following the “Jarl”
instruction and the “Jarl” instruction is changed to a “J .skip” instruction. Having done that, the program continues execution following the “Jarl” instruction. Note that the glue code is executed only the once during program execution because it is a waste of time to calculate the same special value more than once. Changing the
“Jarl” instruction to “J .skip” instruction can prevent the glue code being executed
again. Figure 3.8 shows the execution of glue code. Note that the “Jarl” instruction is changed to a “J .skip” instruction after the glue code is executed. A variation of glue code does not compute a special value; however, it is also executed only once—the first time it is encountered. This variation of glue code also needs patching as described above.
Due to the existing implementation of glue code (which was written in the assembly language for the 32-bit platform and always patched instructions at word-alignment), whenever a “Jarl” instruction may be patched by glue code, that
“Jarl” instruction must be word-aligned. In this case, a two-byte “nop16” instruction might be inserted before the “Jarl” instruction in order to satisfy the requirement of word-alignment. This is because, in the existing glue code, instructions are always assumed to be word-aligned while in our target platform (Andes) instructions may be half-word aligned. In the future, we plan to rewrite glue code. Then the two-byte
“nop16” instructions will become unnecessary. On the other hand, if the “Jarl”
instruction will not be patched by the glue code, we can choose either a 16-bit (for half-word aligned) or a 32-bit (for word aligned) “Jarl” instruction. Note that the four reserved bytes (i.e., “.word ____”) following the “Jarl” instruction must always be word-aligned. The flow chart is shown in Figure 3.9. The list is shown in Table 3.5.
3.3.3 Trap-based Null Checks
Every time VM references an new object, the object must be check is null or not.
While JIT wants to do null checks, a null-pointer trap will occurs, the return address (which is the address of the instruction immediately following the trapping instruction) will be saved in the link-pointer register (LP). If the trapping instruction is a 16-bit instruction, the return address is 2 plus the address of the trapping instruction. On the
other hand, if the trapping instruction is a 32-bit instruction, the return address is 4 plus the address of the trapping instruction. In Andes, an instruction is 16-bit if and only if the first (leftmost) bit of the instruction is 1. The flow chart is shown in Figure 3.10.
translation in this thesis. In the next chapter, we will use benchmarks to verify the Branch endPC
Jump . . StartPc:
Address Instruction Branch offset StartPc:
Address Instruction
CVMJITcbufPushFixup StartPc (Add patch point)
`
Figure 3.8 Patch a forward branch instruction
3.4 Summary
Our emitter will issue mixed 16-bit and 32-bit instructions in an attempt to reduce the resulting code size. Due to the alignment requirement in the existing JIT implementation, the emitter has to take care of the alignment of the issued instructions, adding “Nop” instructions when necessary. Because only some, but not all, registers can be used in 16-bit instructions, register allocations must be done carefully in order to generate more 16-bit instructions. We propose a simple heuristic for instruction
EndPc: Branch endPC
EndPc: Branch endPC