Check Condition Code Selectively - Code Optimization Overview

Chapter 3 Code Optimization Overview

3.2 Check Condition Code Selectively

For conditional execution instructions, the translated code will test for the condition and skip over the actual operation should the check fail. There are optimization opportunities for multiple conditional execution instructions with the same conditions.

In this case, our translator can simply generate the check-condition instructions only once but carefully adjusts the target of the branch to the correct place.

The code generator will check the next instruction to see whether the two instructions are under the same execution condition. Figure 4 shows one example of this optimization. In Figure 4, the baseline translation would generate four target instructions. However, since the two ARM instructions check the same condition code, our optimization can remove the second branch by modifying the target address of the first branch to skip over two instructions instead of one.

A similar optimization can be applied for two instructions having reversed condition codes. In this case, we may not eliminate the second check-condition instruction, but we can create a shorter execution path. The consecutive instructions with inverse condition codes implies that if the condition-code check of the first instruction fails, the condition-code check of the second instruction must be met, so that the offset of the condition branch for the first conditional execution instruction can be modified to bypass the second check. This is illustrated in Figure 5.

ARM :

addeq $Rd1, $Rs1, $Rt1 subne $Rd2, $Rs2, $Rt2 MIPS’:

beqz $R_FLAG_Z, 12 add $Rd1, $Rs1, $Rt1 bnez $R_FLAG_Z, 8 sub $Rd2, $Rs2, $Rt2

Figure 5. Example of two conditional instructions with a reverse condition. The branch offset of the first instruction is modified as shown in bold face.

ARM :

addeq $Rd1, $Rs1, $Rt1 subeq $Rd2, $Rs2, $Rt2 MIPS’:

beqz $R_FLAG_Z , 8 add $Rd1, $Rs1, $Rt1 sub $Rd2, $Rs2, $Rt2

Figure 4. Example of translating two conditional instructions with the same condition. The condition-code check of the second ARM instruction is eliminated.

3.3 Update condition flag selectively

Translating the update-condition-flags instructions will incur a high instruction overhead. For example, if we update all four condition flags for each instruction, the instruction overhead could be as high as 8 times (i.e. two additional instructions

generated per flag). It is one of the most critical areas that call for optimization.

In [1], the FX!32 translator deals with the same challenge since the x86 architecture has a similar condition-code architecture as ARM. FX!32 traces the condition code dependency in the control flow graph of each translated unit. A flag is not updated unless it is actually used. In other words, we try to locate those unnecessary flag updates, and avoid generating flag-updating instruction. For example, if instruction A updates all flags, and instruction B, which follows instruction A, also updates all the flags, then there is no need to translate the flag update for A, since no other instructions will use such updated flags from A. All the flag updates from instruction A will be overwritten by those from B. Through this analysis, we can selectively update the condition flags and eliminate most of the redundant updates to condition flags. As performed in FX!32, we also traverse successor blocks to further reduce unnecessary condition flag updates cross blocks. The performance impact of cross block redundant flag updates elimination is very significant.

3.4 Special condition code

The previous two subsections deal with redundant condition check/update elimination. However, there are cases we must check/update the condition flags and the instruction overhead is high. These are cases where multiple condition flags need to be checked. For example, the GT condition indicates Z, N, and V flags must be checked.

To avoid the cost of updating three flags, however, we could combine multiple condition-code check into one special condition to check. As illustrated in Figure 6, the check of the GE condition can be implemented with the set-less-than condition in MIPS’.

In order to carry out this optimization, we must ensure there is only one type of check

ARM :

cmp Rd1, 0

add Rd2, Rs1, Rs2 addge Rd3, Rs3, Rs4 cmp Rd1, 0

MIPS’ :

movi R_TEMP_1, 0

slts R_FLAG_SPECIAL, Rd1, R_TEMP_1

//Set the flag if Rd1 is great than or equal to R_TEMP_1 add Rd2, Rs1, Rs2

bnez R_FLAG_SPECIAL, 8

//If the flag were not set, the condition code is not matched add Rd3, Rs3, Rs4

Figure 6. An example of replacing ordinary condition flags with special conditions flag.

between two condition code updates.

The multiple condition code update/check cases include HI, LS, GE, LT, GT, and LE. Since these six condition codes are set based on a less-than test, we can use only one set-less-than instruction to update the register allocated to represent the special condition code. This optimization reduces multiple condition bit updates and checks to only one update and check.

3.5 Combined conditional branch

ARM uses condition codes to carry out conditional branches. In some cases, while the MIPS’ only requires one compare-and-branch instruction, ARM may need two

to get the job done. Such cases include:

1) CMP + conditional branch 2) TST + conditional branch 3) TEQ + conditional branch

Although it seems that the cases above favor MIPS-like architectures, in practice, the advantage is not very significant. This is because the condition flags may be used in instructions other than the conditional branch, so the update of the condition flag cannot be eliminated.

Conditional branch optimization provides some interesting results on the translated EEMBC code. For some functions, the translated binary has even fewer target instructions than the source instructions. This is rather unusual when translating a more complex ISA (i.e. ARM) to a simpler ISA (i.e. MIPS’).

3.6 Other optimizations

The optimization methods introduced in the previous sections are part of the code generation components which work on the ARM’s IR. Optimizations at code generation time are rather limited since they do not have the knowledge of other instructions that have not yet been emitted. For example, the redundant condition code check elimination and the inverse condition check optimization are limited to consecutive instructions. More powerful optimizations can be implemented after the target IR’s are generated. For example, some classical local optimizations such as DCE (Dead Code Elimination), CSE (Common Sub-expression Elimination), CF (Constant Folding) and

CP (Constant Propagation and Copy Propagation) can be applied within a larger scope.

Adding this phase of optimization may serve two purposes: 1) There may be optimization opportunities missed by the ARM compiler (which could be compiled with no optimizations) and 2) There may be new opportunities introduced during our binary code translation.

We have implemented a local optimization phase which identifies and reports on opportunities existing in the target IR’s. This phase also estimates the potential performance gain in terms of the number of instructions eliminated. This optimization phase is iterative. Based on profile analysis, we collect code patterns that may be optimized. Such patterns are given to the local optimization phase to identify and report on the complete set of benchmark. Based on the estimated performance gain, we set priorities on what transformations to implement first. The performance estimation of those local optimizations is discussed in Chapter 5.

Chapter 4 Simulation Environment

The target platform and the tool chain for our binary translation system are still under development. The processor chip has been under sampling and some test boards are available now. However, the evaluation of the performance of our binary translator is carried out with simulations where we could have hooks to collect more detailed profiles on benchmark execution.

4.1 Simulators

We have conducted both functional and performance simulations. For functional simulation, we have a MIPS’ simulator ported from SID [15]. SID is a simulator that supports multiple ISAs, and also provides support for testing, validation and debugging.

We use a modified GDB 6.3 [16] to verify ARM binaries and collect profiles.

For performance simulation with micro-architecture details, we use the SimpleScalar ARM to measure the executed cycles. We also have a MIPS’ micro-architecture simulator to measure the performance of MIPS’ code. The configurations we used for ARM and MIPS’ are similar: for example, we assume a single issue, in-order processor with 32KB of I-cache and 32KB of D-cache. The MIPS’ has a deeper pipeline than the SimpleScalar ARM, but we assume both have the same clock frequency.

4.2 Benchmarks

The benchmark we use is the EEMBC [17] benchmark suite version 1.1. EEMBC benchmark is commonly used for embedded system developers to tune their hardware design and software tool chains. There are 55 programs and are divided into six categories: 8-16 bit, automotive, consumer, networking, office, and telecom. The EEMBC benchmark can be compiled as normal versions or as lite versions. To speed

up our simulations, we use the lite versions.

The ARM compiler we used to compile EEMBC is GCC 3.4.3 with static linking.

To test our static translation comprehensively, we compiled the EEMBC benchmark with 3 different options: EEMBC-base, EEMBC-speed, and EEMBC-space. EEMBC-base is compiled with option “-O0”, EEMBC-speed is compiled with option “-O2”, and EEMBC-space is compiled with “-Os”. Since our translator does not translate the Thumb instruction set, we did not create versions with Thumb instructions.

Chapter 5 Experimental results

In this Chapter we evaluate the performance of the optimizations discussed in Chapter 3.

5.1 Baseline code generation

The performance improvements from each optimization discussed in Chapter 3 are presented in Figure 7. The baseline we used in comparison is the translated code using basic translations described in Chapter 2. The performance is measured by the ARM to MIPS’ execution ratio. For example, the performance of the baseline translation, which is labeled as “BASELINE” is 2.58 for EEMBC-base, 3.62 for EEMBC-speed, and 3.6 for EEMBC-space. The number, say 2.58, means the ratio of the dynamic number of executed MIPS’ instructions to the number of ARM instructions is 2.58. In other words, for each ARM instruction, on average, the basic translation would take 2.58 translated MIPS’ instruction to execute. The ratios are higher for optimized ARM binaries, 3.62 for EEMBC-speed and 3.6 for EEMBC-space. Optimized ARM binaries tend to have a higher translation ratio because compiler optimizations would have eliminated many simple instructions, such as the copy operation, which can often be translated into a single target instruction. The baseline ratio of 2.58 is somewhat expected based on past experience of binary translation of various general-purpose architectures.

The performance bar of each optimization is tagged with a name. For example, the bar for the optimization to eliminate redundant condition checks is labeled “CHECK”, and the performance bars for this optimization which eliminates unnecessary flag updates is labeled “UPDATE”. The optimizations are implemented in order, so the performance

gain is cumulative. For example, the performance gain of “UPDATE” includes the gain from “CHECK” and the “REG MAPPING” includes the gain attributed by all other optimizations.

5.2 Selectively check condition code

As shown in Table 3, a large fraction of instructions are conditional – they must check the condition code to determine if execution is needed. There are nearly 20% of such instructions in both EEMBC-speed and EEMBC-space. This indicates eliminating redundant condition checks may have a good potential for performance improvement when translating ARM binaries to other RISC architectures with no predicated execution instructions. Although it may be interesting to translate ARM instructions to Itanium architecture, which does have predicated instructions, there are no practical needs to do so because Itanium is not designed for embedded systems.

The bars under the name “CHECK” in Figure 7 are results from applying the redundant condition check elimination. Compared with the baseline translation, this optimization yields no gains for EEMBC-base, and small gains for both EEMBC-speed (from 3.62 to 3.57) and EEMBC-space (from 3.6 to 3.55). This seemingly low performance gain indicates that although conditional execution is frequent in ARM code, there are not many consecutive instructions using the same condition in the compiled EEMBC code. The frequency of using the same condition increased slightly when ARM binaries are optimized (i.e. in EEMBC-speed and EEMBC-space).

However, there is a different way to conduct redundant condition check elimination, which would require more complex data flow analysis and incur a much higher transformation cost. Notice that there may be instructions having the same condition as

execution predicates, but not next to each other. We can use a separate phase to search and group them together. For such a group, a single branch could skip the entire group.

This is different from translating predicated instructions. For architectures with predicates, the analysis to determine if two instructions are under the same condition is easier – just check if the common predicate is updated between the two instructions. To determine whether two non-consecutive instructions are under the same execution conditions is a little more complex since multiple condition bits are involved. We are currently evaluating the potential for such an optimization.

5.3 Selectively update condition flag

Table 4 shows the percentage of instructions that may update condition flags in the origin ARM program. The percentage of instructions that update condition flags is almost as frequent as the instructions that check condition codes in the EEMBC-base.

For EEMBC-speed and EEMBC-space, flag-update instructions are somewhat less frequent than conditional execution instructions.

The performance result of applying redundant condition update elimination is shown by the bars labeled as “UPDATE” in Figure 7. The ratio of EEMBC-base is decreased from 2.58 to 1.87, a 38% of performance improvement. The other two benchmarks have even higher improvements; the ratio is dropped from 3.57 to 2.48 for EEMBC-speed, a 44% of performance gain, and from 3.55 to 2.44 for EEMBC-space, with a 46% of performance gain. The redundant flag update elimination is the most significant optimization. When translating ARM binaries to other embedded architectures, this shall be the first optimization to consider.

Although it seems that flag updating is as frequent as condition checking, the cost of

flag updating is higher, thus the optimization yields a higher return.

5.4 Combined conditional branch

Combining condition code checking with a branch into a single compare-and-branch is a more interesting optimization discussed here. Most of the other optimizations eliminate redundancies introduced by binary code translation, but this combined conditional branch transformation not only eliminates redundancies but also compresses multiple instructions into one instruction. It gives our translator a chance (although small) to reduce the number of translated instructions executed to be even less than the number of source instructions.

The performance result of combined conditional branch is showed by the bars labeled as “CCB” in figure 7. All three benchmarks have very good improvement.

Different from previous two optimizations, this optimization improves EEMBC-base more than the other two. The ratio of EEMBC-base decreased from 1.87 to 1.47, 27%

of performance gain. For EEMBC-speed and EEMBC-space, the improvement is about 19%.

5.5 Special condition flag

Special condition flag optimization combines multiple condition updates and checks into one condition update and check, thus saving instruction overhead for flag updates and condition checks. In Figure 7, EEMBC-base has only slight improvement from this optimization (from 1.46 to 1.39, about 5% of gain), while EEMBC-speed and EEMBC-space have much greater speed up. The improvement for EEMBC-speed is about 18% (from 2.08 to 1.77) and 14% for EEMBC-space (from 2.08 to 1.82).

5.6 Check inverse condition code selectively

The check inverse condition code optimization has a minor impact to performance.

All three benchmarks benefit 2-3% from this optimization. This should be no surprise to us because Section 5.2 indicates even the same condition optimization does not render notable performance gains. The approach that groups instructions with the same execution conditions (i.e. predicates) and use one branch for each group (as discussed in section 5.2) can also be applied here to enhance the reverse condition check optimization.

5.7 Register remapping

In the initial design, we allocated the four condition flags in one register, that is, a mapped register for CPSR. The flag checking and updating operations are carried out just like the ARM architecture. After learning the importance of flag emulation, we decided to keep each flag in a separate register to avoid instruction overhead to fetch/store the flag from/to the flag register.

In Figure 7, the bars with name “REGMAPPING” show the performance of this optimization. For EEMBC-base, the gain is about 12% (from 1.36 to 1.21), and the gain is more significant for the optimized ARM binaries. EEMBC-speed gains 21% (from 1.73 to 1.43) and EEMBC-space gains 25% (from 1.77 to 1.42).

With all above optimizations, the translated code can run at ratio 1.21 for EEMBC-base, 1.43 for EEMBC-speed, and 1.42 for EEMBC-space. The average ratio of the three benchmarks is 1.35. It is generally considered very cost-effective to get many applications ready for a new platform with only 35% of instruction path length overhead

2.58

Figure 7. Performance improvement from various optimizations. (Y-axis shows the executed instruction count ratio.)

5.8 Local optimization estimation

As discussed in section 3.6, local optimizations such as DCE, CSE, CF and CP, may be applied to the target architecture IR’s after the code generation from the ARM IR’s.

This is to exploit possible redundancy elimination opportunities introduced by the code translator. We implemented a local optimizer to identify such opportunities and estimate the potential contribution from such optimizations. Figure 8 shows the estimation – adding local optimizations may eliminate additional 5% of target instructions.

5.9 Cycle count of the benchmark

It might be unfair to compare the performance of translated code merely based on the instruction path length. For modern embedded processors, micro-architectures also play a very important role. In Figure 9, we compare the performance between ARM code and translated code using simulated execution cycles. As mentioned in section 4.1,

Figure 8: Estimation of the potential of local optimization (Blue bars: before local optimization; Red bars: after local optimization)

we use SimpleScalar ARM and the MIPS’ SID for cycle simulations. Although the SimpleScalar ARM simulator is easily accessible, its micro-architecture may not be ideal for implementing ARM processors because it first maps ARM instructions into micro-operations similar to modern Intel x86 implementations. Therefore, we select two configurations of SimpleScalar ARM to compare with our MIPS’ micro-architecture simulation. The first uses the default configuration for ARM, which we called DEFAULT. The second tries to make the simulated ARM compatible with our MIPS’

configuration, which we called COMPATIBLE. The configuration we used for our target MIPS’ is single issue, in-order execution, with separate 32KB I-cache and 32KB D-cache, and 35 cycle cache miss latency.

The measured cycle count ratios are shown in Figure 9. The DEFAULT ARM configuration yields an average CPI close to 1.45 for all three EEMBC benchmarks.

Our simulated MIPS’ has an average CPI of 1.6. With the compatible setting, which forces the SimpleScalar ARM to issue one micro-operation per cycle, the average CPI of SimpleScalar ARM yields an average CPI close to 1.7. The total execution cycle ratio of MIPS’/DEFAULT becomes 1.31 (EEMBC-space), 1.53 (EEMBC-speed), and 1.53 (EEMBC-space). The total execution cycle ratio of MIPS’/COMPATIBLE becomes 1.10 (EEMBC-space), 1.30 (EEMBC-speed), and 1.31 (EEMBC-space).

5.10 Discussion on predicated execution and conditional branches

In the EEMBC benchmark, there is one program where the translated code executes fewer instructions than the original ARM program. It is the rotate01_lite in the EEMBC-base benchmark, and the instruction ratio of ARM/MIPS’ is 0.94. The reason for a lower-than-1 ratio is because of the combined conditional branch transformation.

The frequent use of conditional branches in this program provides our binary translator such an opportunity.

However, we notice that there could be other opportunities for our translator to yield

在文檔中 ARM指令集架構應用程式之靜態二進位轉譯及最佳化 (頁 31-0)