Code Optimization on TI DSP Platform - DSP Implementation Environment

DSP Implementation Environment

3.5 Code Optimization on TI DSP Platform

In this section, we describe several methods that can accelerate our code and reduce the execution time on the C64x DSP. First, we introduce two techniques that can be used to analyze the performance of specific code regions:

• Use the clock( ) and printf( ) functions in C/C++ to time and display the performance of specific code regions. Use the stand-alone simulator (load6x) to run the code for this purpose.

• Use the profile mode of the stand-alone simulator. This can be done by compiling the

Figure 3.4: Code development flow for C6000 [17].

code with the -mg option and executing load6x with the -g option. Then enable the clock and use profile points and the RUN command in the Code Composer debugger to track the number of CPU clock cycles consumed by a particular section of code.

Use “View Statistics” to view the number of cycles consumed.

Usually, we use the second technique above to analyze the C code performance. The feedback of the optimization result can be obtained with the -mw option. It shows some important results of the assembly optimizer for each code section. We take these results into consideration in improving the computational speed of certain loops in our program.

3.5.1 Compiler Optimization Options [17]

In this subsection, we introduce the compiler options that control the operation of the compiler. The CCS compiler offers high-level language support by transforming C/C++

code into more efficient assembly language source code. The compiler options can be used to optimize the code size or the executing performance.

The major compiler options we use are -o3, -k, -pm -op2, -mh<n>, -mw, and -mi.

• -on: The “n” denotes the level of optimization (0, 1, 2, and 3), which controls the type and degree of optimization.

– -o3: highest level optimization, whose main features are:

∗ Performs software pipelining.

∗ Performs loop optimizations, and loop unrolling.

∗ Removes all functions that are never called.

∗ Reorders function declarations so that the attributes of called functions are known when the caller is optimized.

∗ Propagates arguments into function bodies when all calls pass the same value in the same argument position.

∗ Identifies file-level variable characteristics.

• -k: Keep the assembly file to analyze the compiler feedback.

• -pm -op2: In the CCS compiler option, -pm and -op2 are combined into one option.

– -pm: Gives the compiler global access to the whole program or module and allows it to be more aggressive in ruling out dependencies.

– -op2: Specifies that the module contains no functions or variables that are called or modified from outside the source code provided to the compiler. This improves variable analysis and allowed assumptions.

• -mh<n>: Allows speculative execution. The appropriate amount of padding, n, must be available in data memory to insure correct execution. This is normally not a problem but must be adhered to.

• -mw: Produce additional compiler feedback. This option has no performance or code size impact.

• -mi: Describes the interrupt threshold to the compiler. If the compiler knows that no interrupts will occur in the code, it can avoid enabling and disabling interrupts before and after software-pipelined loops for improvement in code size and performance. In addition, there is potential for performance improvement where interrupt registers may be utilized in high register pressure loops.

Figure 3.5: Software-pipelined loop [17].

3.5.2 Software Pipelining [18]

Software pipelining is a technique used to schedule instructions from a loop so that multiple iterations of the loop execute in parallel. This is the most important feature we rely on to speed up our system. The compiler always attempts to software-pipeline. Fig. 3.5 illustrates a software pipelined loop. The stages of the loop are represented by A, B, C, D, and E. In this figure, a maximum of five iterations of the loop can execute at one time. The shaded area represents the loop kernel. In the loop kernel, all five stages execute in parallel.

The area above the kernel is known as the pipelined loop prolog, and the area below the kernel the pipelined loop epilog.

But under the conditions listed below, the compiler will not do software pipelining [17]:

• If a register value lives too long, the code is not software-pipelined.

• If a loop has complex condition code within the body that requires more than five

condition registers, the loop is not software pipelined.

• A software-pipelined loop cannot contain function calls, including code that calls the run-time support routines.

• In a sequence of nested loops, the innermost loop is the only one that can be software-pipelined.

• If a loop contains conditional break, it is not software-pipelined.

Usually, we should maximize the number of loops that satisfy the requirements of software pipelining. Software pipelining is a very important technique for optimization. But how can we get the software pipeline information? The information is located in the .asm file that the compiler generates with the -mw options. In order to view software pipeline information, we must also enable the -k option which can retain the .asm output from the compiler.

3.5.3 Macros and Intrinsic Functions [17]

Because software-pipeline cannot contain function calls, it takes more clock cycles to complete function calls. Changing function to macros under some conditions is a good way to optimize. In addition, replacing functions with macros can cut down the code for initial function definition and reduce the number of branches. But macros are expanded each time they are called. Hence, they will increase the code size.

The TI C6000 compiler provides many special functions that map C codes directly to in-lined C64x instructions, which increases C code efficiency. These special functions are called intrinsic functions. If some instructions have equivalent intrinsic functions, we can replace them by intrinsic functions and the execution time can be decreased. We will introduce how to use the intrinsic functions in chapter 4.

Chapter 4 Fixed-Point Implementation of CTC

在文檔中 WiMAX 迴旋渦輪碼技術與數位訊號處理器實現 (頁 62-68)