Conditional Execution - Code Generation Overview

Chapter 2 Code Generation Overview

2.3 Conditional Execution

Conditional execution is a mechanism to perform predicated execution. In the ARM ISA, almost all instructions are predicated instructions (with the exception of certain v5T instructions). Four flag bits stored in the CPSR (Current Program Status Register) register are used to decide if the instruction should be executed. The four flag bits are named as Negative (N), Zero (Z), Carry (C) and Overflow (V) and these four condition bits are illustrated in Figure 1.

There are sixteen different conditions as shown in Figure 1. Some of the condition checks only need to examine one specific bit, and some need to check multiple bits. The code generated for each check will be different. For example, the following addeq instruction checks if one condition bit is set. On the other hand, the addge instruction shown on the right side of the addeq instruction checks if the N flag is equal to the V flag. In the following code example, r15 is allocated to hold condition flags (CPSR), and r21, r22, r23 are temporary registers.

ARM :

addeq r0, r1, r2 addge r0, r1, r2 MIPS’:

btst r21, r15, 1 btst r21, r15, 0

beqz r21, 8 btst r22, r15, 3 add r0, r1, r2 sub r23, r21, r22

beqz r23, 8 add r0, r1, r2

Instructions with condition code AL do not need to check any condition bits, and the instructions with NV are never executed (like a NOP) except for some extensions in v5T [18]. Our translator generates one or multiple branches to handle condition checks. If the specified condition is not met, the attempted operation is skipped. In the EEMBC benchmark, conditionally executed instructions are used rather frequently. Table 3 shows the percentage of all executed ARM instructions that specify conditions other than AL and NV.

Check condition code

EEMBC-base 10.27%

EEMBC-speed 19.57%

EEMBC-space 19.84%

Table 3. Frequency of the condition code check in the EEMBC benchmark.

Opcode Mnemonic Meaning Condition flag state

0000 EQ Equal Z

0001 NE Not equal !Z

0010 CS/HS Carry set/unsigned high or same C

0011 CC/LO Carry clear/unsigned lower !C

0100 MI Minus/negative N

0101 PL Plus/positive or zero !N

0110 VS Overflow V

0111 VC No overflow !V

1000 HI Unsigned higher C && !Z

1001 LS Unsigned lower or same !C || Z

1010 GE Signed greater than or equal N == V

1011 LT Signed less than N != V

1100 GT Signed greater than Z == 0 && N ==

1101 LE Signed less than or equal Z == 1 || N != V

1110 AL Always Any

1111 NV Special use None

Figure 1. All 16 condition codes in the ARM architecture. N, Z, C, and V mean the respective condition bits are set.[18]

2.4 Condition flags handling

The four condition flags in the ARM ISA are N, Z, C, and V. They are located at bits [31:28] of the CPSR register. Two types of ARM instructions modify these four flags: the comparison instructions, and the ALU instructions or the move instructions with the s-bit set.

One target register can be preserved to store the four condition flags. However, to access such bits is rather expensive since it requires at least two instructions (i.e. a shift and a logical OR) to update one condition flag.

/* Instructions that calculate the new C flag */

slli r21, r21, 2 // store the C flag in bit[2]

or r15, r15, r21 // update the flag register

The overhead for updating condition flags increases for instructions that set more condition bits. Table 4 shows the frequency of all executed ARM instructions in EEMBC that need to update the condition flags.

Percentage

EEMBC-base 9.49%

EEMBC-speed 14.99%

EEMBC-space 15.18%

Table 4. Frequency of instructions that update condition flags in the EEMBC benchmark.

In order to minimize instruction overhead for condition flag updates, we allocate each of the four flags in separate target registers. This would avoid the shift operations for each flag update. However, the downside is the use of three more registers which might be better used for some optimizations which require additional temporary registers.

Since our target architecture has 16 more general-purpose registers than the ARM architecture, we can afford reserving four registers for the condition flags, given that the frequency of flag update is fairly high in the ARM application binaries.

2.5 Processor mode and Thumb instruction set

ARM supports regular execution mode and the Thumb execution mode. The Thumb execution mode allows 16-bit ARM instructions to be used. Using Thumb execution code can effectively reduce the code size, and is considered important for many embedded applications. A special instruction, bx, must be used to switch between the Thumb execution mode and the regular ARM mode. The bx instruction format is as follows:

Bx r1

The execution will bring the processor to the Thumb mode if the least significant bit of register r1 is 1. Otherwise, the processor stays in the ARM mode. Since the prefix of a Thumb instruction is similar to that of an ARM instruction, the translator must know the execution mode to correctly parse the instructions. Therefore, this feature makes static translation difficult because the value of register r1 may not be known at translation time.

Our static translator does not handle Thumb instructions. Thumb mode execution will be handled by the dynamic translator.

2.6 Register mapping

There are 31 general-purpose registers in the ARM architecture, but only 16 of them are visible to programmers and compilers. The rest are used to speed up exception handling. Furthermore, registers r8-r14 are banked, which means there are multiple physical copies of each register. Which physic registers are actually referenced depends on the current execution mode. Our static binary translator supports only user mode, so we do not need to take care of which physical registers are used. The number of ARM registers needs to be maintained as part of the architecture state is 16. The MIPS’

architecture has 32 32-bit general-purpose registers, so the 16 ARM registers can be mapped directly as a subset of the target architecture registers. It is worth noting that the register r0 in MIPS’ is not a constant zero as is in the MIPS architecture.

Register mapping is conducted in two steps: (a) ARM registers r0 to r11 are mapped to MIPS’ architecture registers r0 to r11, and (b) ARM registers r12 to r15 are mapped to target registers r28 to r31. ARM registers r12 to r15 are special registers such as PC, LR, SP, and so on. Mapping them to consecutive target registers is preferred so that we may translate the load/store multi-word instructions in ARM directly into similar load/store multiple word instructions in the target ISA which requires the registers to be contiguous.

Other than the 16 registers reserved for mapping to ARM registers, the remaining registers in the MIPS’ architecture are used for temporary registers and for special usages such as the shifter operand. In our current translator, five of them are used as temporary registers, two of them are reserved for handling privileged mode, and four of them are saved for future usage.

In section 2.4.1 we discussed the need to allocate registers to hold each condition flag so that the cost of flag update and check can be significantly reduced. Initially, we allocated the four condition flags in one register (r15, a mapped register for CPSR). Now we separated them out and assigned one register for each. The allocation of the target registers not mapped to the 16 ARM registers is listed in Table 5.

Index Usage Index Usage

12 Shifter operand 20 V flag

13 Shifter carry out 21 Temp register 1

14 Top of RAS 22 Temp register 2

15 Unused 23 Temp register 3

16 Special condition flags 24 Temp register 4

17 N flag 25 Temp register 5

18 Z flag 26 Reserved

19 C flag 27 Reserved

Table 5. New register allocation for the MIPS’ architecture. The ones mapped to the 16 ARM registers (i.e. r0-r11 and r28-r31) are excluded.

2.7 Executable layout

Layouts of ARM and MIPS’ executables are contrasted in Figure 2. The binary layout of the target executable can be divided into three parts: the ARM program sections, the MIPS’ program sections, and the control management sections. The original ARM program sections are stored as part of the MIPS’ executable. Keeping the ARM program sections is needed for future dynamic translation, and it also allows the target program to access the data in the ARM program sections. All the ARM sections are allocated in the same address as in the original ARM executable. This decision makes the memory accesses much easier to handle, since no additional computation is needed to calculate the memory address of the operands.

The MIPS’ executable has the regular text, data, and bss sections. These sections are

allocated in higher memory address since the lower addresses are used for the ARM sections. The control management part has return address stack (RAS), address mapping table, and address stub sections. The purpose of these three sections will be discussed in section 2.9.

This initial binary layout allocated the ARM binary in the same address space as an original binary. This approach has the advantage that all address computation in the ARM binary can remain the same (such as relocation). The newly generated MIPS’ code and the supporting data structure, such as the address mapping table, were allocated to the address space not used by the ARM binary. However, we later encountered some

Figure 2. Memory layout of the original ARM executable and the translated executable. In 2.b, the target architecture is MIPS’, and RAS means the return address stack.

applications that used such memory space for I/O buffers. This would cause instructions or the address mapping table to be modified at runtime.

To avoid this type of memory address overlap, we propose a new binary layout. In the new binary layout, we allocate the translated MIPS’ code and the control management code and data before the ARM.heap section. Allocating the new code in the heap is safer because heap is for dynamic memory allocation, and programmers should never presume how heap space is allocated. One possible downside of this new approach is that the heap space is reduced by a small amount.

Figure 3. New layout of the translated executable is showed in 3.b. Compared with 2.b, the new sections are moved to between the ARM.Bss section and the ARM.Heap section

2.8 Control flow graph

Like other static binary translation tools [5][6], we also build the control flow graph of each translation unit. The control flow graph serves the purpose of control management and code optimization.

Our translator constructs the control flow graph by recognizing potential basic blocks and establishing the relation between different blocks. The usual way to recognize basic blocks in binary code is to identify the leader of each basic block. Typical leaders are instructions following direct or indirect branches, instructions targeted by direct branches, and the program entry point. Here we must handle one additional case: the addresses for PC-relative data. The basic blocks contain PC-relative data are identified to prevent treating PC-relative data as instructions, so they are excluded from the control flow graph.

The control flow graph we built is not precise because the targets of indirect branches may not be known at translation time. In Chapter 3, we will discuss code optimizations implemented in our translator. Since some optimizations are based on the control flow graph, and if there is an indirect branch to the middle of a basic block, our code optimization may become invalid. Therefore, when the address mapping routine discovers at runtime that the target of an indirect branch is not a recognized basic block, it will throw an exception and transfer control to our runtime translation system to handle such cases.

2.9 Program control management

Program control management includes updating the ARM-PC and handling direct

and indirect branches. The following subsections describe mechanisms used in the translator to manage control transfers.

2.9.1 Lazy update to PC

In the generated code, updating the ARM-PC is required since other instructions may reference the PC. However, updating ARM-PC for each instruction incurs unacceptable high overhead. Notice that only a small percentage of instructions need to reference the PC explicitly. In many cases, the PC referenced is a known value at translation time – it is an offset to the text section. Therefore, our translator employs a method that updates PC only when it is needed and cannot be resolved at translation time.

This means the translator generates instructions to update the PC before it is to be referenced. For instance, ARM-PC update is generated for an instruction that pushes the current ARM-PC onto the stack.

For PC-relative data access, the ARM-PC value is directly embedded as an immediate of the target instruction as shown in the following example:

ARM :

add r1, pc, #228 MIPS’ :

movi r31, 0x0000811c // updated ARM -PC addi r1, r31, 228

2.9.2 Indirect branch handling

Direct branches can be handled at translation time since the branch target address for the translated block is known. For indirect branches, since the branch target is usually

unknown at compile/translation time (the target is in the register), they must be handled differently. Indirect branches can be divided into two categories: structured and unstructured. Structured indirect branches are generated from program structures such procedure returns and switch statements (or something similar such as virtual functions or computed GoTo in Fortran). They may also be used when the branch target is beyond the reaching limits of direct jumps (MIPS Jump instruction has 26 bit for immediate).

Structured indirect branches can be handled at translation time. Unstructured indirect branches are usually used by assembly programmers in hand-crafted code to handle arbitrary targets, and are considered non-manageable. For return branches, our translator applies the shadow stack technique [14], and will be discussed in details in section 2.9.4.

For switch related indirect branches, we search the binary backwards from the indirect branch to figure out where the jump table is. Once the starting address of the jump table is known, the remaining translation can be straight forward – each entry in the jump table will be replaced by the translated address. This is discussed in section 2.9.5. In addition, we have a general address mapping approach as the safety net for “unstructured” indirect branches. For unstructured indirect branches, their target addresses will be used to search the ARM-to-MIPS’ address mapping table to obtain the translated address.

2.9.3 Address mapping table

The ARM-to-MIPS’ mapping table is generated by the translator and stored as part of the data section in the translated executable. The table maps an ARM instruction address to the address of the translated instruction. It is used to provide the target address for “unstructured” (non-return, non-switch) indirect branches (although we currently also use it for switch-based indirect branches). To minimize the size of the table, we do not keep one entry for every ARM instruction. Instead, we attempt to allocate one entry for

each basic block so that the table size may be reduced. This approach has the drawback that a hand-crafted code may set arbitrary instruction as the target of an indirect branch and we may fail to keep the entry address in our table. However, this case should be rare and when it occurs, the execution should trap to the runtime system to invoke the dynamic translator. Our initial attempt was successful for normally compiled programs such as the EEMBC suite. Therefore, we further reduce the address mapping table size by allocating entries for a more limited set of basic blocks. For example, the entry of a function; the entry of a function call return; the entry that immediately follows a function;

and the address of a “likely” target stored in the text and data section. A well known example for the “likely” target case is a switch table. It is important that we keep the entry address of each function in the table because function pointers are often used for the target address of indirect branches (e.g. virtual function calls).

One challenge here is how to ensure every function entry can be detected. We initially locate the instruction following a function return. Although this approach may collect some false function entries because a function may have multiple return instructions, this is not a serious issue because those extra entries would not cause incorrect execution. The real difficulty comes from PC-relative data and padding for alignment requirements. Table 6 shows a comparison of the number of entries we stored in the address mapping table and the number of basic blocks. The number of selected entries is much lower than the number of basic blocks.

Ratio of the table entries to the number of basic blocks

EEMBC-base 0.388

EEMBC-speed 0.392

EEMBC-space 0.393

Table 6. Ratio of the number of table entries to the number of basic blocks in the EEMBC benchmark.

Each time an unstructured ARM indirect branch is executed, the control will transfer to a stub generated by the binary translator. This stub is used to look up the address mapping table and check if the current entry contains the correct ARM address. Since this table lookup is performed at runtime, it must be efficient to avoid excessive overhead.

A simple hash function is used to hash the ARM address into an index to the table. If the search is a hit, the stub will return a target address for the execution to continue. A hash collision is resolved by linear probing. However, we allow the table to grow as needed during translation time to minimize collisions. For the EEMBC benchmark suite, the generated table entry count is either 1K or 2K. If the search missed in the table, the stub will trap to our runtime system.

The frequency of the table look-up in the EEMBC benchmark is less than 1%. At present, they are mainly switch-based indirect branches. We will eliminate most of them in the next version of the translator and leave the address mapping table for true

“unstructured” indirect branches.

2.9.4 Return address stack

Although the addressing mapping table can handle all indirect branches, searching the address mapping table is relatively expensive (about 10-12 instructions). In order to accelerate indirect branch handling, a prediction mechanism is often used in various binary translators. If the branch target is as predicted, a direct branch can be used instead.

This approach usually works well; unfortunately, return branches, which are also indirect

branches, are difficult to predict since a procedure can often be called from many different places. Hence, we implement the return address stack, also called shadow return address stack in [14], to speed up return branch handling.

2.9.5 Switch table lookup

Just like the return address stack for the return indirect branch handling, some mechanisms are also needed for speeding up switch indirect branch handling. In [19], a method to discover the jump table in the text section is presented. In our binary translator, a similar method is also used to discover the jump table stored in the text section in ARM executables. By applying this mechanism, the starting address and the table size of the jump table can be easily obtained, and so are the switch branch targets.

The translated jump table requires both the ARM address and the MIPS’ address of the switch target. Unlike the ARM executables, which store the jump table in the text section, the translated jump table is stored in the MIPS’ code section. Each entry of the jump table stores both the ARM target address and the MIPS’ target address. Every time a switch related indirect branch in ARM is executed, the translated code will first load both the ARM target address stored in the ARM jump table and the MIPS’ target address

在文檔中 ARM指令集架構應用程式之靜態二進位轉譯及最佳化 (頁 15-0)