Partition the main L-function into several L-functions

III. Design and Implementation

3.3. Implementation Detail

3.3.4. Partition the main L-function into several L-functions

The complexity of LLVM optimizer and static compiler are at least polynomial with power greater than one. Since our translator works quickly, the speed of LLVM optimizer and static compiler dominates the execution time of our system. If we can lower the size of the main function, our system can generate translated binary faster.

The comparison between one function method and multi-function method, which distributes instruction blocks uniformly, are shown in Figure 23. Furthermore, the

distribution method can be changed if the user wants. We introduce how our analyzer and

translator partition the main L-function and distribute them into different L-functions.

Moreover, we also introduce the method we implemented for handling function switching and some modification for fitting the optimizations of LLVM optimizer.

Original Binary

Slice function 1 Slice function 2

Slice function 3 Slice function 4 translate

original

New method

Figure 23. Comparison between one function and multi-function

3.3.4.1. Overview

Figure 24 is the framework of the LLVM module that our system generates using multiple LLVM functions. Since the architecture states and Application Program Status Register (APSR) are regarded as global variables in the LLVM module, we don’t have to pass any of them as parameters when switching L-functions.

In multi-L-function version, the main L-function just handle some initialing routine, like initialing stack space, putting command line arguments and environmental setting in the stack, and then jump to the L-function that contains the instruction block whose address is entry point of the input binary; moreover, since frequently switching between functions may cause stack overflow, we let the slices of L-function return to the main L-function if needed.

Therefore, the main L-function must have an ability for handling function switching, regarding as a switching stub.

Other slices of L-function may call others if branch target is not in it. Besides, indirect branch table is also partitioned in several part, and they are appended to the end of each slice of L-function.

Figure 24. The framework of multi-function version LLVM module

3.3.4.2. How many LLVM functions have to be generated

The size of each L-function is user defined and the size of all instructions in the input binary divides by this user defined value; however, we have to avoid that certain L-function becomes too small, so L-function size is permitted to be a little larger than defined if this situation happens.

3.3.4.3. Strategy for distributing instruction blocks

The cost of function switching is not high, but this cost may influence the performance seriously if they switch frequently. Our goal is to lower the frequency of function switching by finding instructions that have more probability to be executed sequentially. Therefore, the instructions in the same B-function must be distributed to the same L-function.

Since the only useful information our analyzer can get is the target address of calling

function, so our analyzer has to find all function entries in the input binary first. How to find the entries of B-function has been described in 3.3.2.1. Then our analyzer constructs a graph, regarding each function as a node and a calling action as an edge. Take Figure 25 as an

example, F1 calls F2 and F3, so there are one direct edge from F1 to F2 and one from F1 to F3.

Figure 25. An example of function graph

After generating this graph, our analyzer performs depth first search (DFS) to find all connective components and put the B-function pointer in the list. Each search is start at the node with zero in-edge, that is, no other B-function calls it. However, since the graph is directional, this approach will find two components if two functions which no other B-function calls them calls the same B-B-function. To have more probability that the same component are distributed to the same L-function, our analyzer regards the case described above as only one component.

Finally, our analyzer sorts all components according to their size, and then distributes them, beginning with the largest one. A component will be split into several components if no L-function has enough space to hold it. Besides, some flexible coefficients are added here for fear that some component is just less than 10% larger than the maximal space. This approach may lower the compiling speed, but it is worth since the component won’t be split.

We choose DFS instead of other searching method like breath first search (BFS), because the recording order of B-functions in a connective component is based on which is traced

first. We should ensure that not only the B-functions in the same connective components but also the longest path in the graph has more probability to be distributed into the same L-function.

Our analyzer also implements uniform distribution method. It just distribute B-functions sequentially until the size of L-function larger than user defined threshold. Although the performance of this method is usually worse, it has advantage when handling function switching, as described in the following section.

3.3.4.4. Function mapping table

To find where the instruction block of indirect branch target is, our analyzer has to maintain a table for this purpose. Since indirect branch targets are either function entries or return addresses, we can use the same method used in 3.3.2.3, except that it uses call-instruction instead of branch-call-instruction.

We can also use helper function to find the target LLVM function. Binary search is used in this helper function, and each entry contains address of B-function head, address of next B-function head, and L-function it was distributed. The helper function claim that it finds the target if the target address is between the first two values of the entry. The advantage of this approach is that if the target is not one of the function entries, the translated program can still find the corresponding LLVM function. Although binary search is the best search

algorithm, it still takes time if it searches the same target frequently. Therefore, we also add a function table cache to enhance the performance.

Using helper function is efficient if uniform distribution strategy is used. Since the B-functions that are distributed to the same L-function are put continuously in the input binary, these functions can be described in the same table entry. As a result, the number of entries used in helper function is equal to the number of slices of LLVM functions, and the search time becomes very low.

3.3.4.5. Function switching

The easiest method for L-function switching is just call another L-function directly, but this approach occasions stack overflow if switching behavior occurs frequently. We use two

kinds of method to avoid stack overflow occurs. The first one is returning to the main function and then call another function, and the second one is returning to the main L-function when the counter is larger than the user-defined threshold.

Furthermore, the main L-function should have ability to handle function switching, so the target address should be passed between the main function and other slices of L-function. Therefore, we declare that all slices of L-function have two parameters, the target address and the counter for calling (omitted if the user sets that the function returns to the main every time before function switching), and two return values, the function to call next time and the target address. If the first return value is zero, then the program will look up the function mapping table and then call the corresponding L-function. Figure 26 is an example that uses switch cases to implement the function mapping table, and its input is the return value of the previous L-function.

0 Look function table

Address 1 Call function 1

2 Call function 2

n Call function n

function

0x80c0 1

0x803e 2

0x8100 n

... ...

0x1b0ce 1

... ...

num Action

Function number &

address

Figure 26. An example of function switching handler

The last modification is that we have to let all slices of L-functions jump to the correct instruction block according to its first parameter, the target address. Since function switch always occurs when calling function or returning to the caller, which is the same as the condition that indirect branches occur; therefore, the same looking table can be used for finding the target. We add a branch instruction at the head of slice of L-function and then it

can execute at correct position. Figure 27 illustrates the control flow of each slice of L- function. The arrow A and C indicate what we explained above; besides, B, C and D indicate the flow when handling indirect branches. The mapping table is designed for two purposes, so the code size of the binary generated by our system can be smaller.

Branch to the table

Instruction blocks

Mapping table Return to main

function

C D B

Figure 27. The control flow of each slice of LLVM function

3.3.4.6. Local variable remapping

In some instructions, the result is stored in the local variables, and these local variables are created for the main L-function. Therefore, these variables become undefined after partitioning the main L-function in to several slices of L-function. We should create these variables for every slices of L-function and substitute the operand of these instructions by the new ones. For convenient, our translator marks instruction blocks that contain several uses of local variables, so this substitution can be performed more efficiently.

3.3.4.7. Global variables remapping

The architecture states and the application program status registers are regarded as global variables in our system, and they always accessed by memory load and store operations. Fortunately, LLVM optimizer provide a memory-to-register optimization that substitutes these memory operations by mapping to certain registers, and the performance enhances tremendously. This optimization find all of the allocation instructions in the bitcode file, and promotes the local variables to registers. An example is shown in Figure 28, and only one memory reference is needed after optimizing.

%0 = load i32* @SP %1 = add i32 %1, -4 %2 = load i32* @R0

%3 = inttoptr i32 %1 to i32*

Store i32 %2, i32* %3 Br label %L_next

%0 = add i32 %SP.0, -4 %1 = inttoptr i32 %0 to i32*

Store i32 %R0.0, i32* %1 Br label %L_next

Figure 28. An example of mem2reg optimization (STR r0, [SP, #-4])

The memory-to-register optimization is only available on local variables, but our system regards architecture states and APSR as global variables. The reason why the original version that uses the main L-function only can also optimized with memory-to-register optimization is that the LLVM optimizer performs a global variable optimization that converts global variables to local variables if the global variables is only used in one LLVM function.

According to the reason described above, what our system has to do is substituting global variables by local variables, and let the global variables only appear in the main function. However, the latter goal is discarded since all global variable must be passed to other LLVM functions and the only way is regard them as parameters; therefore, about twenty parameters make reliability of our system lower because of the rapid growing of stack. Since global variables are not accessed frequently in the main L-function, the

performance won’t be influenced too much without the memory-to-register optimization in the main L-function. As a result, the only modification our system has to do is loading all global variables in the local variables when entering the LLVM functions, and storing the latest value of local variables back to the global variables.

在文檔中一個為Thumb-2可執行檔以LLVM為基準的靜態二元轉譯系統 (頁 40-47)