Other Optimizations - Code Optimization Overview

Chapter 3 Code Optimization Overview

3.6 Other Optimizations

The optimization methods introduced in the previous sections are part of the code generation components which work on the ARM’s IR. Optimizations at code generation time are rather limited since they do not have the knowledge of other instructions that have not yet been emitted. For example, the redundant condition code check elimination and the inverse condition check optimization are limited to consecutive instructions. More powerful optimizations can be implemented after the target IR’s are generated. For example, some classical local optimizations such as DCE (Dead Code Elimination), CSE (Common Sub-expression Elimination), CF (Constant Folding) and

CP (Constant Propagation and Copy Propagation) can be applied within a larger scope.

Adding this phase of optimization may serve two purposes: 1) There may be optimization opportunities missed by the ARM compiler (which could be compiled with no optimizations) and 2) There may be new opportunities introduced during our binary code translation.

We have implemented a local optimization phase which identifies and reports on opportunities existing in the target IR’s. This phase also estimates the potential performance gain in terms of the number of instructions eliminated. This optimization phase is iterative. Based on profile analysis, we collect code patterns that may be optimized. Such patterns are given to the local optimization phase to identify and report on the complete set of benchmark. Based on the estimated performance gain, we set priorities on what transformations to implement first. The performance estimation of those local optimizations is discussed in Chapter 5.

Chapter 4 Simulation Environment

The target platform and the tool chain for our binary translation system are still under development. The processor chip has been under sampling and some test boards are available now. However, the evaluation of the performance of our binary translator is carried out with simulations where we could have hooks to collect more detailed profiles on benchmark execution.

4.1 Simulators

We have conducted both functional and performance simulations. For functional simulation, we have a MIPS’ simulator ported from SID [15]. SID is a simulator that supports multiple ISAs, and also provides support for testing, validation and debugging.

We use a modified GDB 6.3 [16] to verify ARM binaries and collect profiles.

For performance simulation with micro-architecture details, we use the SimpleScalar ARM to measure the executed cycles. We also have a MIPS’ micro-architecture simulator to measure the performance of MIPS’ code. The configurations we used for ARM and MIPS’ are similar: for example, we assume a single issue, in-order processor with 32KB of I-cache and 32KB of D-cache. The MIPS’ has a deeper pipeline than the SimpleScalar ARM, but we assume both have the same clock frequency.

4.2 Benchmarks

The benchmark we use is the EEMBC [17] benchmark suite version 1.1. EEMBC benchmark is commonly used for embedded system developers to tune their hardware design and software tool chains. There are 55 programs and are divided into six categories: 8-16 bit, automotive, consumer, networking, office, and telecom. The EEMBC benchmark can be compiled as normal versions or as lite versions. To speed

up our simulations, we use the lite versions.

The ARM compiler we used to compile EEMBC is GCC 3.4.3 with static linking.

To test our static translation comprehensively, we compiled the EEMBC benchmark with 3 different options: EEMBC-base, EEMBC-speed, and EEMBC-space. EEMBC-base is compiled with option “-O0”, EEMBC-speed is compiled with option “-O2”, and EEMBC-space is compiled with “-Os”. Since our translator does not translate the Thumb instruction set, we did not create versions with Thumb instructions.

Chapter 5 Experimental results

In this Chapter we evaluate the performance of the optimizations discussed in Chapter 3.

5.1 Baseline code generation

The performance improvements from each optimization discussed in Chapter 3 are presented in Figure 7. The baseline we used in comparison is the translated code using basic translations described in Chapter 2. The performance is measured by the ARM to MIPS’ execution ratio. For example, the performance of the baseline translation, which is labeled as “BASELINE” is 2.58 for EEMBC-base, 3.62 for EEMBC-speed, and 3.6 for EEMBC-space. The number, say 2.58, means the ratio of the dynamic number of executed MIPS’ instructions to the number of ARM instructions is 2.58. In other words, for each ARM instruction, on average, the basic translation would take 2.58 translated MIPS’ instruction to execute. The ratios are higher for optimized ARM binaries, 3.62 for EEMBC-speed and 3.6 for EEMBC-space. Optimized ARM binaries tend to have a higher translation ratio because compiler optimizations would have eliminated many simple instructions, such as the copy operation, which can often be translated into a single target instruction. The baseline ratio of 2.58 is somewhat expected based on past experience of binary translation of various general-purpose architectures.

The performance bar of each optimization is tagged with a name. For example, the bar for the optimization to eliminate redundant condition checks is labeled “CHECK”, and the performance bars for this optimization which eliminates unnecessary flag updates is labeled “UPDATE”. The optimizations are implemented in order, so the performance

gain is cumulative. For example, the performance gain of “UPDATE” includes the gain from “CHECK” and the “REG MAPPING” includes the gain attributed by all other optimizations.

5.2 Selectively check condition code

As shown in Table 3, a large fraction of instructions are conditional – they must check the condition code to determine if execution is needed. There are nearly 20% of such instructions in both EEMBC-speed and EEMBC-space. This indicates eliminating redundant condition checks may have a good potential for performance improvement when translating ARM binaries to other RISC architectures with no predicated execution instructions. Although it may be interesting to translate ARM instructions to Itanium architecture, which does have predicated instructions, there are no practical needs to do so because Itanium is not designed for embedded systems.

The bars under the name “CHECK” in Figure 7 are results from applying the redundant condition check elimination. Compared with the baseline translation, this optimization yields no gains for EEMBC-base, and small gains for both EEMBC-speed (from 3.62 to 3.57) and EEMBC-space (from 3.6 to 3.55). This seemingly low performance gain indicates that although conditional execution is frequent in ARM code, there are not many consecutive instructions using the same condition in the compiled EEMBC code. The frequency of using the same condition increased slightly when ARM binaries are optimized (i.e. in EEMBC-speed and EEMBC-space).

However, there is a different way to conduct redundant condition check elimination, which would require more complex data flow analysis and incur a much higher transformation cost. Notice that there may be instructions having the same condition as

execution predicates, but not next to each other. We can use a separate phase to search and group them together. For such a group, a single branch could skip the entire group.

This is different from translating predicated instructions. For architectures with predicates, the analysis to determine if two instructions are under the same condition is easier – just check if the common predicate is updated between the two instructions. To determine whether two non-consecutive instructions are under the same execution conditions is a little more complex since multiple condition bits are involved. We are currently evaluating the potential for such an optimization.

5.3 Selectively update condition flag

Table 4 shows the percentage of instructions that may update condition flags in the origin ARM program. The percentage of instructions that update condition flags is almost as frequent as the instructions that check condition codes in the EEMBC-base.

For EEMBC-speed and EEMBC-space, flag-update instructions are somewhat less frequent than conditional execution instructions.

The performance result of applying redundant condition update elimination is shown by the bars labeled as “UPDATE” in Figure 7. The ratio of EEMBC-base is decreased from 2.58 to 1.87, a 38% of performance improvement. The other two benchmarks have even higher improvements; the ratio is dropped from 3.57 to 2.48 for EEMBC-speed, a 44% of performance gain, and from 3.55 to 2.44 for EEMBC-space, with a 46% of performance gain. The redundant flag update elimination is the most significant optimization. When translating ARM binaries to other embedded architectures, this shall be the first optimization to consider.

Although it seems that flag updating is as frequent as condition checking, the cost of

flag updating is higher, thus the optimization yields a higher return.

5.4 Combined conditional branch

Combining condition code checking with a branch into a single compare-and-branch is a more interesting optimization discussed here. Most of the other optimizations eliminate redundancies introduced by binary code translation, but this combined conditional branch transformation not only eliminates redundancies but also compresses multiple instructions into one instruction. It gives our translator a chance (although small) to reduce the number of translated instructions executed to be even less than the number of source instructions.

The performance result of combined conditional branch is showed by the bars labeled as “CCB” in figure 7. All three benchmarks have very good improvement.

Different from previous two optimizations, this optimization improves EEMBC-base more than the other two. The ratio of EEMBC-base decreased from 1.87 to 1.47, 27%

of performance gain. For EEMBC-speed and EEMBC-space, the improvement is about 19%.

5.5 Special condition flag

Special condition flag optimization combines multiple condition updates and checks into one condition update and check, thus saving instruction overhead for flag updates and condition checks. In Figure 7, EEMBC-base has only slight improvement from this optimization (from 1.46 to 1.39, about 5% of gain), while EEMBC-speed and EEMBC-space have much greater speed up. The improvement for EEMBC-speed is about 18% (from 2.08 to 1.77) and 14% for EEMBC-space (from 2.08 to 1.82).

5.6 Check inverse condition code selectively

The check inverse condition code optimization has a minor impact to performance.

All three benchmarks benefit 2-3% from this optimization. This should be no surprise to us because Section 5.2 indicates even the same condition optimization does not render notable performance gains. The approach that groups instructions with the same execution conditions (i.e. predicates) and use one branch for each group (as discussed in section 5.2) can also be applied here to enhance the reverse condition check optimization.

5.7 Register remapping

In the initial design, we allocated the four condition flags in one register, that is, a mapped register for CPSR. The flag checking and updating operations are carried out just like the ARM architecture. After learning the importance of flag emulation, we decided to keep each flag in a separate register to avoid instruction overhead to fetch/store the flag from/to the flag register.

In Figure 7, the bars with name “REGMAPPING” show the performance of this optimization. For EEMBC-base, the gain is about 12% (from 1.36 to 1.21), and the gain is more significant for the optimized ARM binaries. EEMBC-speed gains 21% (from 1.73 to 1.43) and EEMBC-space gains 25% (from 1.77 to 1.42).

With all above optimizations, the translated code can run at ratio 1.21 for EEMBC-base, 1.43 for EEMBC-speed, and 1.42 for EEMBC-space. The average ratio of the three benchmarks is 1.35. It is generally considered very cost-effective to get many applications ready for a new platform with only 35% of instruction path length overhead

2.58

Figure 7. Performance improvement from various optimizations. (Y-axis shows the executed instruction count ratio.)

5.8 Local optimization estimation

As discussed in section 3.6, local optimizations such as DCE, CSE, CF and CP, may be applied to the target architecture IR’s after the code generation from the ARM IR’s.

This is to exploit possible redundancy elimination opportunities introduced by the code translator. We implemented a local optimizer to identify such opportunities and estimate the potential contribution from such optimizations. Figure 8 shows the estimation – adding local optimizations may eliminate additional 5% of target instructions.

5.9 Cycle count of the benchmark

It might be unfair to compare the performance of translated code merely based on the instruction path length. For modern embedded processors, micro-architectures also play a very important role. In Figure 9, we compare the performance between ARM code and translated code using simulated execution cycles. As mentioned in section 4.1,

Figure 8: Estimation of the potential of local optimization (Blue bars: before local optimization; Red bars: after local optimization)

we use SimpleScalar ARM and the MIPS’ SID for cycle simulations. Although the SimpleScalar ARM simulator is easily accessible, its micro-architecture may not be ideal for implementing ARM processors because it first maps ARM instructions into micro-operations similar to modern Intel x86 implementations. Therefore, we select two configurations of SimpleScalar ARM to compare with our MIPS’ micro-architecture simulation. The first uses the default configuration for ARM, which we called DEFAULT. The second tries to make the simulated ARM compatible with our MIPS’

configuration, which we called COMPATIBLE. The configuration we used for our target MIPS’ is single issue, in-order execution, with separate 32KB I-cache and 32KB D-cache, and 35 cycle cache miss latency.

The measured cycle count ratios are shown in Figure 9. The DEFAULT ARM configuration yields an average CPI close to 1.45 for all three EEMBC benchmarks.

Our simulated MIPS’ has an average CPI of 1.6. With the compatible setting, which forces the SimpleScalar ARM to issue one micro-operation per cycle, the average CPI of SimpleScalar ARM yields an average CPI close to 1.7. The total execution cycle ratio of MIPS’/DEFAULT becomes 1.31 (EEMBC-space), 1.53 (EEMBC-speed), and 1.53 (EEMBC-space). The total execution cycle ratio of MIPS’/COMPATIBLE becomes 1.10 (EEMBC-space), 1.30 (EEMBC-speed), and 1.31 (EEMBC-space).

5.10 Discussion on predicated execution and conditional branches

In the EEMBC benchmark, there is one program where the translated code executes fewer instructions than the original ARM program. It is the rotate01_lite in the EEMBC-base benchmark, and the instruction ratio of ARM/MIPS’ is 0.94. The reason for a lower-than-1 ratio is because of the combined conditional branch transformation.

The frequent use of conditional branches in this program provides our binary translator such an opportunity.

However, we notice that there could be other opportunities for our translator to yield lower-than-1 execution ratio for more programs. This is the case we mentioned in section 5.2. A conditional executed instruction is translated into a branch and a regular instruction, where the branch may skip the regular instruction. Although we may skip

Figure 9: comparison of the instruction count ratio and the cycle count ratio between the two architectures.

the regular instruction, we will have to execute the branch instruction so that the chance of reducing translated instruction is not obvious. Nevertheless, if we add a separate phase to group instructions with the same or reversed conditions together, we will have a greater opportunity to skip more instructions. In other words, programs with more

“predicated false” instructions can have greater potential for our translated code to achieve a less-than-1 execution ratio.

Predicated execution will execute more instructions in general, but may minimize the cost of branch mis-predictions. Modern embedded processors may adopt deeper pipelines to achieve a higher clock rate (but must balance with power consideration), and increase the cost of branch mis-prediction. Our translated code does incur more branches as shown in Table 7. In Table 7, we can observe that the translated code, on average, have 4% more conditional branches in the executed instructions.

Conditional branch in

Table 7: Percentage of conditional branches in ARM binary and the MIPS’ binary.

Another consideration is the performance of the memory subsystem. Since our translation must keep the original ARM code, the executable will be at least twice as large as the original ARM binary. This, however, may not have a significant impact to the I-cache performance since most instruction accesses are from the translated code.

The ARM code section is rarely referenced except for PC-relative data references, and when exceptions occur and the code must trap to the runtime system to get help from the

dynamic translator.

The primary purpose of a binary translation system is not for performance – it is to make more applications available at the time a new architecture is introduced. However, the performance and power efficiency gap should not be too severe to make the new section. Unfortunately, the EEMBC benchmark suite rarely uses the switch statement, and the improvement by using switch table lookup is nearly zero. To show the power of the switch table lookup, we designed a new test program.

The program we tested contains a switch statement in a loop, and the switch targets are varied when the switch statement is executed. The loop runs about one million times, which minimizes the impact of startup code. Table 8 shows the result after we translate the ARM jump table into MIPS’ table. On average, the switch table lookup approach successfully reduced the instruction path length from 3.07 to 2.09. Almost one-third of the instructions in the original application are eliminated. Apparently, the switch table lookup is an effective way to handle indirect branches caused by switches.

I-Count Raio w/o

Space 3.4 2.3 1.1

Avg. 3.071 2.085 0.986

Table 8. Instruction count ratio after applying switch table lookup

5.12 Comparison of code size

The code size is always an important issue for embedded systems, no matter it is static or dynamic code sizes. Larger static code size will requires more non-volatile storage. Lrager dynamic code size will demand for more memory, which is usually very limited in embedded systems.

Table 9 shows the comparison of the code sizes in ELF format, and both the ARM executable and MIPS’ executable are already stripped. The code size ratios in Table 9 show that the MIPS’ executable needs 80% more storage than the ARM executable.

ARM binary size MIPS’ binary size MIPS’ /ARM ratio

Base 105422 183123 1.811 newly generated code. The comparison of the size of .text sections of the two programs is shown in Table 10. It shows that the MIPS’ .text section needs 70% more storage than the ARM .text section. The address mapping table section needs about 16KB to 32KB, which is about 11% of the average code size.

ARM .text size MIPS’ .text size MIPS’ /ARM

Base 44174 75073 1.714

Speed 38537 68252 1.771

Space 38376 68152 1.775

Table 10: A comparison of the .text section size of the source program and the target

Table 11 shows the total sizes for both programs. The ratio of the total size of the sections is about 2.6, which is higher than the ratio of the .text section size. This may be caused by the difference of the sizes of other sections. Table 12 shows the comparison of the sizes of other sections. The ratio between the non-text section size in MIPS’

executable and the non-text section size in ARM executable is about 7.7. This seems worrisome, but the non-text sections only needs 25% (Table 13) of the size of all sections in the ARM executable. So the impact is less severe.

Most of non-text sections is used by ARM.text section and the address mapping table section in the MIPS’ executable. Table 14 shows the size of adding the ARM.text section and the address mapping table section is about 78% of the total storage for non-text sections. We can reduce half of the size of the non-text section by simply removing the ARM.text section. Although not all the pages in the sections would be swapped into memory, the high ratio of the section size indicates that the MIPS’ program may incur more page faults at runtime. The difference of the executable’s size in Table 9 and the total size in the Table 11 shows that the ARM executable needs more space than the MIPS’ executable to store the metadata such as file headers.

在文檔中 ARM指令集架構應用程式之靜態二進位轉譯及最佳化 (頁 35-0)