Find All Kinds of Data - Implementation Detail

III. Design and Implementation

3.3. Implementation Detail

3.3.1. Find All Kinds of Data

3.3.1.1. Data Structure

Although how many memories needed when translating is no concern of our translator, we still want to use a better structure to record these information of the data position.

The most intuitive solution is that use several bits to indicate the data kind of certain half-word. This method is easy to implement, but it cost too much spaces. For example, a binary that contains about one megabytes may need about 200 kilobytes, about one fifth of

original binary. Besides, our analyzer and translator has to read every halfword one by one even if where it is translating is a large set of data. If our program can jump to the end of data once it attempts to translate the head of data, less time will be taken in translation.

If we define a part of continuous bytes with same kind as a set, then we can describe a set by giving its start address, end address, and its kind. Figure 11 shows an example. The input data in an undefined set, and our analyzer uses different kinds of methods to discover all kinds of data and marks them. These sets are mutual exclusive, so there is no ambiguous set after analyzing.

Undefined set

Code PC-r data Code Switch table ^padding Code

Start address End address

Figure 11. An example of set analyzing

Regarding continues bytes as a set is very useful, since it takes much less memories compared with using a bit map, especially when the data set is large, and it is also easy to tell our translator that it meets data and can skip. Furthermore, marking a little set in the

undefined set is easy. Our analyzer just need to create an entry to store the start and end address of the set, and check whether there is the same kind of set before or after it. If true, these two sets can be merged, just like a union operation of two sets, as shown in Figure 12.

Code PC-r data Undefined set PC-r data

0x80f8 0x8114 0x8118 0x8118 0x811c

Code PC-r dataPC-r dataPC-r data Undefined set

0x80f8 0x8114 0x811c

Figure 12. An example of sets union operation

Furthermore, this data structure has more benefits in saving memories. Since the bytes are read sequentially and their address sequence is strictly increasing, our analyzer can discard data whose address is smaller than current halfword. Take Figure 13 and Figure 14 as

an example, Figure 13 shows original version, that our analyzer don’t discard any data that is used, and Figure 14 shows a discarding version. The green arrow indicates that where our analyzer is analyzing and the blue arrows indicate that where are LDR instructions that tell our analyzer where are PC-relative data. As a result, the discarding version uses less memory, since almost all cases of PC-relative data can be merged in only one set. In our experience, only two set entry is needed for the binaries generated by GCC.

0x8150 0x813c PC-relative data

0x81e4 0x81dc

0x8210 0x820c

Code PC-r

0x80f8 0x813cdata

Code PC-r

0x8150 0x81dcdata

Code PC-r

0x81e4 0x820c 0x8210data

Figure 13. An example of non-discarding used data version

0x8150 0x813c PC-relative data

Code PC-r

0x80f8 0x813c 0x8150data

PC-relative data

Code PC-r

data

0x80f8 0x813c 0x8150

Figure 14. An example of discarding used data version

Our analyzer uses linked list to maintain different sets, as shown by Figure 15, since popping the front element and inserting the element at back are needed. Normally, a list of set is in increasing order, so our program just have to check the first set of the certain kind of list and decide what to do. Therefore, our program can execute faster.

End address Start address Code

PC-relative data Switch table

Padding Unknown

End address Start address

Figure 15. How these sets stored in the memory

3.3.1.2. PC-relative Data

Every time our analyzer finds LDR-prefixed instruction with base register, it passes associated information to the set handler. And this kind of data can be found correctly.

8120: 4b06 ldr r3, [pc, #24]

… … …

8126: 4806 ldr r0, [pc, #24]

… … …

813a: bd10 pop {r4, pc} (end of function) 813c: 00000000 .word 00000000

8140: 000bd36c .word 000bd36c

Figure 16. An example of PC-relative data (using LDR)

Take Figure 16 as an example, at 0x8120, the address of the word program has to load is 0x8120 + 4 + 24 = 0x813c, and 0x8120 + 4 is the value of current PC. So our analyzer can mark 0x813c to 0x813f as relative data. Alignment is also important in handling PC-relative data. At 0x8126 in Figure 16, 0x8126 + 4 + 24 = 0x8142 is not the correct target address, because current PC is not word-aligned. The address must be Align(0x8126 + 4, 4) + 24 = 0x8140.

Due to the alignment of PC-relative data, our analyzer can ensure that the instruction at

the address which is not word-aligned is not start point of data, so the probability of mistranslating is lower.

3.3.1.3. Switch Table and Padding

Finite state machine (FSM) is used in our analyzer to find switch tables, because FSM is flexible and easy to implement.

The Left side of Figure 17 is our FSM for finding switch cases, and the arrows with no number indicate other cases that are not listed in the right side of Figure 17. Every time our analyzer reaches the final state, it receives necessary information about generating switch functions, like number of cases, default targets address and addresses of every case. The work of generating switch tables will be done by our translator.

Start

Final 2

3, 4 1

1. Cmp %case, #case_num 2. Bhi #default_target 3. Tbb [pc, %case]

4. Tbh [pc, %case, lsl #1]

5. Adr %reg, #table_head 6. Ldr pc, [%reg, %case, lsl #2]

Figure 17. Finite State Machine for finding switch cases

If the input is TBB when entering the final state, our analyzer have to check whether the number of cases is an odd number, and mark the next byte after the table as a padding byte if true.

在文檔中一個為Thumb-2可執行檔以LLVM為基準的靜態二元轉譯系統 (頁 31-35)