Thesis Organization - 利用LLVM編譯器基礎設施對ARM指令集架構應用程式實現具可移植性的靜態二進位轉譯

Chapter 1 Introduction

1.6 Thesis Organization

In chapter 2, we describe the key translation issues, including register simu-lation, memory access simusimu-lation, system call handling, and indirect branch handling, and summarize the instruction modeling overhead from ARM ISA to LLVM IR. In chapter 3, we disscuss a couple of translation issues, and present improvement to speed up the performance of target ARM binary. In chapter 4, we evaluate the performance of our static binary translation on the EEMBC benchmark. Finally, in chapter 5, we conclude this paper.

(a) Direct binary translation

(b) Indirect binary translation (Our translation System)

Figure 1.1: Translation Categories

Chapter 2 LLVM IR Generation

In this section, we describe the key translation issues from ARM binary to LLVM IR, including register simulation, translation details, memory access simulation, the technical problems and our solutions. Finally, we summarize the instruction modeling overhead from from ARM binary to LLVM IR. Our translator does not handle Thumb instructions and coprocessor instructions since our translator supports user mode only.

2.1 Register Simulation

In order to maintain program behavior and control flow, our translator has to simulate ARM general-purpose registers and the status registers.

Figure 2.2 shows 7 ARM processor modes and all registers. ARM has 31 general-purpose 32-bit registers. 16 of these registers are visible to applica-tion programs. The remaining registers are reserved for excepapplica-tion processing.

Registers r8-r14 are banked, which means that there are multiple copies of these registers. These banked registers are switched among various processor modes. Since our translator supports user-mode applications only, we do not

require all registers in all processor modes but only the 16 general-purpose registers in the user mode.

All processor state other than the general-purpose register contents is held in the status registers. The current operating processor status is in the Current Program Status Register (CPSR). CPSR holds 4 condition flags (Negative, Zero, Carry and Overflow) and other information that we do not require [9]. Figure 2.1 shows the format of CPSR.

Figure 2.1: CPSR Register Format

The target of our translator is LLVM IR [14], which is a virtual instruc-tion set. LLVM IR has no physical registers. Initially, our translator maps 16 general-purpose registers and CPSR to 17 global variables of type inte-ger. However, it is profitable to map the four condition flags in CPSR to four separate global variables. The first reason is that the four condition flags are used and updated frequently: all ARM instructions can be exe-cuted conditionally based on the condition flags and all ALU instructions and comparison instructions may update the condition flags. The second reason, doing register mapping using direct binary translation usually has to consider the number of usable registers in target architecture, but our target LLVM IR is not limited by the number of physical registers; as a result, we can utilize as many as global variables to gain performance.

Obviously, this simple approach to simulate ARM registers will incur tremendous overhead since all operands must be loaded from memory before

Figure 2.2: 7 ARM Processor Modes And All ARM Registers

instruction execution and be stored afterwards. Fortunately, an LLVM op-timizer provides the “promote memory to register” optimization [15]. This promotion can reduce considerable memory accesses.

1 . ARM:

Figure 2.3: Instruction Modeling : A Complete ARM Instruction

2.2 Translation Detail

In this section, we use the example in figure 2.3 to explain the translation from ARM binary to the corresponding LLVM IR.

2.2.1 Conditional Execution

Almost all ARM instructions can be conditionally executed, which means that they have their effects on the program execution state only if the con-dition flags in CPSR satisfy the concon-dition specified in the instruction. Oth-erwise, the instruction acts as a NOP (no operation). Field <cond> in figure 2.4 is the condition field of the instruction. There are 16 combinations of the condition field in an instruction, which are listed in table 2.1.

Figure 2.4: Condition Field in an ARM Instructon

Opcode Mnemonic Meaning Condition flag state

0000 EQ Equal Z

0001 NE Not equal !Z

0010 CS/HS Carry set C

/unsigned higher or same

0011 CC/LO Carry clear !C

/unsigned lower

0100 MI Minus/negative N

0101 PL Plus/positive or zero !N

0110 VS Overflow V

0111 VC No overflow !V

1000 HI Unsigned higher C and !Z

1001 LS Unsigned lower or same !C or Z

1010 GE Signed greater than (N == V)

or equal

1011 LT Signed less than (N != V)

1100 GT Signed greater than (Z == 0)

and (N == V) 1101 LE Signed less than or equal (Z == 1)

or (N != V)

1110 AL Always (unconditional)

-1111 (NV) special

-Table 2.1: 16 Combinations Of Condition Field

Our translator translates the conditional execution feature into a checking barrier before the instruction execution. The checking barrier checks whether the condition flags in CPSR satisfy the condition specified in the instruction;

if it is satisfied then continue the instruction execution, otherwise branch to the next instruction. In Figure 2.3, the condition field of the ARM instruction is NE, which means that the ARM instruction will execute only if the Zero flag is not set.

2.2.2 Shifter Operand

Most data-processing instructions take two source operands. One of them is called a shifter operand, which could be either an immediate value or a register. If the shifter operand is a register, it can have a shift operation applied to it. In Figure 2.3, “r0, lsl #8” is a shifter operand. The value of r0 is logically-shifted-left 8 bits. The resulting value is used as an operand for the andsne instruction. Table 2.2 shows the five shift types in ARM.

Our translator translates the shifted register operand as additional in-structions before the instruction body. In this example, the second source operand r0 will be logically shifted left by 8 before it is used by the AND instruction.

Shift type Meaning Instruction Percentage in the EEMBC

ASR arithmetic shift right 0.94%

LSL logically shift left 2.83%

LSR logically shift right 2.26%

ROR rotate right 0.15%

RRX rotate right with extension 0.14%

Table 2.2: 5 Shift(Rotation) Types

2.2.3 Updating Condition Flags

The condition flags in CPSR are usually modified by the comparison in-structions (CMN, CMP, TEQ or TST) and some arithmetic, logical, and move instructions with the S qualifier. Table 2.3 shows the four condition flags and their meaning after instruction execution.

Our translator generates additional instructions to update the condition flags after instruction body if the instruction has the S qualifier. In Figure 2.3, the andsne instruction updates N, Z, and C flags.

Condition flag Value and Meaning N (Negative) Result less than 0 ? 1 : 0 Z (Zero) Result is 0 ? 1 : 0

C (Carry) Operation produced a carry? 1 : 0 Operation produced a borrow? 0 : 1 V (Overflow) Operation has signed overflow ? 1 : 0

Table 2.3: 4 Condition Flags

2.3 Memory Access Simulation

In this section, we disscuss memory management in target binary. First, we describe static and dynamic memory allocation. Second, we illustrate an example of memory access and present an issue. Third, we propose a solution to improve this issue.

static allocation - our translator keeps the source ARM static section (.text and .data) in the target binary. The purpose is for accessing general data in data section and PC-relative data in text section. We place the source ARM static section at the target address which is the same in the source binary, this decision makes that no extra address computation is needed.

dynamic allocation - Heap and stack allocation are supported by li-brary; as a result, we disscuss the issus of linking approach here. Our source binary is compiled with static linking instead of dynamic linking.

Static linking approach makes library become part of our translated code, so it avoids the parameter identification problem and makes our translator more portable in different operating system. Taking an example to explain the parameter identification problem, for example : when translating ”bl printf”, we may translate this call instruction into LLVM ”call i32 (i8*, ...)*

@printf(parameter list)”. But, it is difficult to identify the parameter list of printf from source binary. Stack allocation is handled by a function’s prolog and epilog, and those code are also part of our translated code.

Example and issue In figure 2.5, we illustrate an example of load word instruction. This instruction loads a value from stack and the correspond LLVM IR are easy to be understanded. An intresting issue is if we translate

1 . ARM :

2 . l d r r0 , [ sp , #8]

3 . LLVM IR :

4 . %tmp1 = l o a d i 3 2 ∗ @sp 5 . %tmp2 = add i 3 2 %tmp1 , 8

6 . %tmp3 = i n t t o p t r i 3 2 %tmp2 t o i 3 2 ∗ 7 . %tmp4 = l o a d i 3 2 ∗ %tmp3

8 . s t o r e i 3 2 %tmp4 , i 3 2 ∗ @r0

Figure 2.5: Instruction Modeling : An ARM load word Instruction stack-operation instructions like general data access, target stack will not be used anymore. (ie. SP register is wasted since variable SP will be mapped to a data register instead of SP register) So, an intresting isssue is whether we should allocate target stack for other purpose ? We will explain our improvement in chapter 3.

2.4 Program Layout

We contrast the runtime layout of the source ARM binary and the target ARM binary in figure 2.6. The target sections can be divided into two parts:

the sections of source ARM binary and its own sections. Keeping the source ARM’s text section in target binary is necessary for dynamic translation in the furture, and it also allows the target binary to pc-relative data in the source ARM binary. All the source ARM sections are allocated at the target address which is the same as the address in the source ARM binary.

This decision makes the memory accesses much easier to handle since no additional address computation is required for memory access. The part of target section has the text, data, bss section, and stack section. Target data section contains the additional data used to help memory access simulation, target text section contains the generated code and the address mapping table to handle indirect branch, and target stack is allocated to speed up accessing global variables.

(a) Source ARM binary

(b) Target ARM binary

Figure 2.6: Runtime Binary Layout

2.5 System Call Handling

As we have known, system call is a mechanism to communicate between user program and operating system. Each architecture provides different instruction to handle this kind of exception in their defined ISA, such as the Software Interrupt instruction (SWI) in ARM, SYSCALL instruction in MIPS, and INT instruction in x86. Our source architecture ARM provides the Software Interrupt instruction (SWI) to enter Supervisor mode to request a particular supervisor function. Furthermore, user program also has to pass

a number of parameters to operating system such as system call number that indicates which vector in exception vector table will be performed. After operating system finishes its exception service routine, it also returns result to the user program. However, how parameters passed to the operating system and how result returned from operating system are defined by the architecture EABI; for example, ARM architecture passes parameters by registers r0-r3 and returns result of SWI instruction by register r0.

As we have known, the mapping between variables and registers is de-cided by register allocation phase; as a result, the EABI standard introduces a technical restriction in our translator, which is how our translator maps LLVM variables r0-r3 to target physical registers r0-r3 before SWI instruc-tion execuinstruc-tion and maps target physical register r0 back to LLVM variable r0 after SWI instruction execution in order to comply with EABI stardard.

Our translator solves this mapping restriction by translating an ARM SWI instruction into a function call and passing the parameters of system call as the parameters of function call, so that we can utilize ARM procedure call standard. In detail, procedure call standard of ARM and MIPS specifies that physical register r0-r3 are used to hold argument values passed to a procedure call and register r0 also helds result returned from a procedure call.

The codes in figure 2.7 are the corresponding LLVM IR against an ARM SWI instruction, and it does solve the mapping restriction from variables r0-r3 to physical register r0-r0-r3. However, if the definition of syscall function is written using LLVM IR, it won’t solve the mapping restriction from physical register r0 back to variable r0 because the value stored to variable r0 will

1 . ARM:

2 . s w i 0 x00123456 3 . LLVM IR :

4 . %s w i r 0 = l o a d i 3 2 ∗ @r0 5 . %s w i r 1 = l o a d i 3 2 ∗ @r1 6 . %s w i r 2 = l o a d i 3 2 ∗ @r2 7 . %s w i r 3 = l o a d i 3 2 ∗ @r3

8 . %s w i r t = c a l l i 3 2 @ s y s c a l l ( i 3 2 %s w i r 0 , i 3 2 %s w i r 1 , i 3 2 % s w i r 2 , %s w i r 3 )

9 . s t o r e i 3 2 %s w i r t 1 , i 3 2 ∗ @r0

Figure 2.7: Instruction Modeling : An ARM System Call Instruction

1 . . t e x t

2 . . g l o b a l s y s c a l l 3 . . a l i g n 2

4 . s y s c a l l :

5 . s w i 0 x00123456 6 . . LBB1 1 :

7 . bx l r

8 . . s i z e s y s c a l l , .− s y s c a l l

Figure 2.8: Definition Of syscall Function (Written in ARM assembly code)

be the return value of syscall function instead of the return value of ARM SWI instruction since both standards use the same register r0 as their return value. Thus, our translator writes the definition of syscall function in target assembly code to ensure that the value returned from physical r0 is the return value of ARM SWI instruction instead of the return value of syscall function.

2.6 Program Control Management

2.6.1 Indirect Branch Handling

Direct branches can be handled easily in our static translator since their target addresses are known at translation time. Generally speaking, indirect branches are classified as two categories: the structured and unstructured indirect branch. Structured indirect branches are like return-based indirect branches and switch-based indirect branches in the jump table generated by switch statement. Unstructured indirect branches are the others. Typical static binary translator handles return-based indirect branches with Return Address Stack(RAS) [13], switch-based indirect branches by recovering jump table, and unstructured indirect branches with an address mapping table.

Our translator adopts the similar ideas to handle indirect branches except Return Address Stack(RAS). In brief, the reason why not to adopt Return Address Stack is that the standard LLVM IR does not support an array of label type; hence, it is hard to implement the Return Address Stack.

2.6.2 Address Mapping Table

Our translator generates an ARM-to-LLVM address mapping table as a spe-cific function to handle return-based indirect branches and other unstruc-tured indirect branches. The table maps the address of an ARM instruction to a corresponding LLVM label when a non-switch indirect branch occurs.

In order to minimize the size of the address mapping table, our translator does not keep entries for all ARM instructions. Instead, our translator al-locates an entry for the leading instruction of each recognized basic block

in our imprecise control flow graph. If the target address is not found in the address mapping table, the executing program will throw an exception and abort the process. Indeed, those corner cases will be handled by adding LLVM dynamic components in the future.

In order to speed up the search in the address mapping table, typical static binary translator designs a hash function to index the address map-ping table. However, as we have mentioned before, LLVM does not support an array of label type; as a result, hash function and other search algorithms are infeasible in our binary translation. The only choose to implement the address mapping table is switch statement in LLVM IR. As we have known, if case values in the switch statement are in a narrow range, jump table will be generated, otherwise a sequence of if-else statements are generated. Un-fortunately, our address mapping table is belong to latter switch statement, so the program must do linear search in the address mapping table.

2.7 Summary of Instruction Modeling Over-head

In this section, we list some instruction medeling which cost a sequence of LLVM IR.

1. Register Simulation

As we have mentioned in subsection ”register emulation”, our translator uses the global variables to emulate ARM physical registers. Therefore, the generated program must load the source operands from memory before instruction execution, and store the destination operand back to memory. We can imagine this does create a huge burden.

2. Redundant Condition Flag Update

It is possible for our translator to generate code that updates only the flags that will be used later. In Figure 2.3, assume that only the Z flag of the andsne instruction will be used later. It is not necessary to generate code that updates the N and C flags. Our translator will generate code that updates all three flags. We will rely on an existing opimization pass ”global value numbering” in LLVM that will remove the unnecessary code.

3. Updating Condition Flags and Shifter Operand

There are some special operations in ARM ISA such as ROR(32 bits rotation), RRX(33 bits rotation) in shift type, and carryFrom, bor-rowFrom, overflowFrom [9] in order to update condition flags. Our translator must emulate single operation of those with several LLVM

IR; however, the frequency of those instructions on the EEMBC bench-mark is pretty rare; thus, those instructions don’t impact performance of the whole binary translation system.

4. Accessing Multiple Registers

ARM architecture is capable of loading multiple and storing multiple registers. Those instructions perform a block transfer of any number of the general-purpose registers to or from memory. LLVM IR does not provide the similar feature for our translator to do semantic mapping;

as a result, our translator translates a Load Multiple(LDM) and Store Multiple(STM) instruction into a sequence of Load(LDR) and Store Register(STR) instructions.

Chapter 3 Translation Issues And Improvement

3.1 Target Stack Allocation

The defect of our proposed solution to simulate memory access is that SP(Stack Pointer) register won’t be used anymore because we didn’t use the target stack to simulate the source stack. Registers are the valuable resource of CPU; accordingly, our translator allocates target stack to access all simu-lated global register and data (16 ARM registers, 4 condition flags. The reason is that ARM architecture requires two load instructions to accessing a global variable, one is for loading the address of the global variable, and the other is for loading the value. Therefore, we allocate the target stack in our translated binary and copy all global variables onto the target stack in order to accelerate the access of all global variables. Accessing a local variable from stack requires merely one load instruction with the help of SP register. This improvement shows great profit in the generated program since accessing the global variable is necessary before and after translating each ARM instructions.

3.2 Address Mapping Table Improvement

Performing a linear search in the address mapping table would cause a bottle-neck in our generated programs; hence, our translator improves the design of the address mapping table. First, our translator divides the address mapping table into a lot of smaller address mapping tables according to the quotient of the ARM PC divided by the value of a predefined divisor. Second, a non-switch indirect branch would jump to a runtime dispatcher. This runtime dispatcher would perform the same operation to obtain a result as the way to divide the address mapping table. Then, this result would be used to index a jump table to decide which address mapping table should be searched. The key to speed up the linear search in each address mapping table is to decide the value of the predefined mask, so that each smaller address mapping table contains only a couple of entries. With this improvement, the time spent on the address mapping table in our translator is approaching the time spent on the address mapping table using hash algorithm in other binary translator.

Figure 3.1 contrasts the structure of original address mapping table against our improved address mapping table.

(a) Struction of Original Ad-dress Mapping Table (Linear Search)

(b) Struction of Improved Address Mapping Table (Jump Table plus Linear Search)

Figure 3.1: Structure Of Address Mapping Table

3.3 Switch Table Recovery

As we have known, the jump table will be used to implement the switch statement if case values are in a narrow range. The indirect branch in the jump table used to index the target address is named as switch-based indirect branch. Although this kind of indirect branches can be handled by our address mapping table, our translator still recovers the jump tables in the ARM binary to speed up the generated programs [7].

Chapter 4 Experiments And Results

In this chapter, we list our simulation environment, the benchmark, and

在文檔中利用LLVM編譯器基礎設施對ARM指令集架構應用程式實現具可移植性的靜態二進位轉譯 (頁 14-0)