Acceleration Rules - DSP Implementation Environment

DSP Implementation Environment

3.5 Acceleration Rules

In this section, we describe several methods that can accelerate our code and reduce the execution time on the C64x DSP.

3.5.1 Compiler Optimization Options [27]

The compiler supports several options to optimize the code. The compiler options can be used to optimize code size or execution performance. Our primary concern in this work is the execution performance. The easiest way to invoke optimization is to use the cl6x shell program, specifying the -on option on the cl6x command line, where n denotes the level of optimization (0, 1, 2, 3) which controls the type and degree of optimization:

• -o0:

– Performs control-flow-graph simplification.

– Allocates variables to registers.

– Performs loop rotation.

– Eliminates unused code.

– Simplifies expressions and statements.

– Expands calls to functions declared inline.

• -o1. Performs all -o0 optimization, and:

– Performs local copy/constant propagation.

– Removes unused assignments.

– Eliminates local common expressions.

• -o2. Performs all -o1 optimizations, and:

– Performs software pipelining.

– Performs loop optimizations.

– Eliminates global common subexpressions.

– Eliminates global unused assignments.

– Converts array references in loops to incremented pointer form.

– Performs loop unrolling.

• -o3. Performs all -o2 optimizations, and:

– Removes all functions that are never called.

Table 3.2: Sizes of Different Data Types

Data type Char Short Int Long Float Double

Size (bits) 8 16 32 40 32 64

– Simplifies functions with return values that are never used.

– Inline calls to small functions.

– Reorders function declarations so that the attributes of called functions are known when the caller is optimized.

– Propagates arguments into function bodies when all calls pass the same value in the same argument position.

– Identifies file-level variable characteristics.

3.5.2 Fixed–Point Coding

The C6000 compiler define a size for each data type as Table 3.2. The C64X DSP is a fixed-point processor, so it can only perform fixed-point operations. Although the C64X DSP can simulate floating-point processing, it takes many clock cycles to do the job. The

“char”, “short”, “int” and “long” are fixed-point data types, and the “float” and “double”

are floating-point data types.

3.5.3 Loop Unrolling

Loop unrolling unrolls the loops so the all iterations of the loop appear in the code. It often increases the number of instructions available to execute in parallel. It is also suitable for use with software pipelining. When our code has conditional instructions, sometimes the compiler may not be sure that the branch will occur or not. It needs more waiting time for

Figure 3.9: Loop unrolling.

Table 3.3: Comparison Between Unrolled and not Unrolled Before

Unrolling

After Unrolling

Execution Cycles 436 206

Code Size 116 479

the decision of branch operation. If we do loop unrolling, some of the overhead for branching instruction can be reduced. Fig. 3.9 is an example about loop unrolling and Table 3.3 shows the cycles and the code size with and without unrolling. We can see clearly that the clock cycles decrease after loop unrolling, but the code size has increased.

3.5.4 Packet Data Processing

Packet data processing means processing of several data together in one instruction. For example, we may use a single load or store instruction to access multiple data that are located consecutively in the memory. It can enhance data throughput. The technique is also called the single instruction multiple data (SIMD) method. For example, if we can place four 8-bit data (char) or two 16-bit data (short) in a 32-bit space, we may do four or two operations in one clock cycle. The code efficiency substantially. Some intrinsic functions

Figure 3.10: The block diagram of SIMD.

enhance the efficiency in a similar way. Fig. 3.10 shows an example that uses word access for adding short data.

3.5.5 Register and Memory Arrangement

When accessing in the external memory, it may tale more clock cycles than accessing on–

chipdata. We can use registers to store data in order to reduce the transfer time. The C compiler has a pre–defined way of placing different code segments (such as variable pointers, malloc spaces, and the program code) in the memory. We can set up the link commend (.cmd) file to allocate the memory for different types of data for efficient data reading and writing. The key–words “CODE SECTION” and “DATA SECTION” can be used to put the program code or data in the internal memory for greater execution speed.

3.5.6 Software Pipelining

Software pipelining is a technique used to schedule instructions from a loop so that multiple iterations of the loop execute in parallel. The compiler always attempts to software pipeline.

In Fig. 3.11, illustrates a software pipelined loop. The stages of the loop are represented by A, B, C, D and E. In this figure, a maximum of five iterations of the loop can execute at one time. The shaded area represents the loop kernel. In the loop kernel, all five stages execute

Figure 3.11: Software-pipelined loop.

in parallel. The area above the kernel is known as the pipelined loop prolog, and the area below the kernel is known as the pipelined loop epilog.

3.5.7 Macros and Intrinsic Functions

Because software-pipeline cannot contain function calls, it takes more clock cycles to com-plete function calls. Changing functions to macros under some conditions is a good way for code optimization. In addition, replacing functions with macros can cut down the code for initial function definition and reduce the number of branches. But macros are expanded each time they are called. Hence, they will increase the code size.

The TI C6000 compiler provides many special functions that map C codes directly to inlined C64x instructions, which increase C code efficiency. These special functions are called intrinsic functions. If some instructions have equivalent intrinsic functions, we can replace them by intrinsic functions and the execution time can be decreased.

3.5.8 Other Acceleration Rules

Other code Acceleration rules include reducing memory access, using bit shifts for multi-plication or division, declaring constants as constants that not variable, access the memory sequentially, and minimizing use of conditional breaks or complex condition codes in loops.

Chapter 4 Simulation and DSP Implementation

在文檔中 IEEE 802.16e OFDMA通道編碼技術與數位訊號處理器實現之研究 (頁 62-68)