Efficiency Enhancement - System Performance

DSP Implementation

4.2 System Performance

4.2.2 Efficiency Enhancement

4.2.2.1 Modulation Functions

In this section, we will describe the techniques used to improve the performance of the modulation function. Fig. 4.5 shows a part of the original modulation program and we see that some “if” and “else” statements are used to check the modulation type inside the outer “for” loop. This is inefficient because we do not change the modulation type within one data block. In addition, the compiler cannot do software pipelining for this kind of coding style. Because the modulation can only have three types (QPSK, 16QAM, and 64QAM), we separate their handling into three sub-functions, as shown in Figs. 4.6 and 4.7. Table 4.8 compares the execution cycles before and after modification. The compiler optimization information is shown in Fig. 4.8 and Fig. 4.9 is a main section of the assembly code of the modulation function together with the corresponding C code.

Fig. 4.5: A part of the original modulation program.

Fig. 4.6: A part of the modified program in the modulation function.

Table 4.8: Comparison of the Modulation Function Before and After Optimization

Original Code Revised Code Improvement

Cycles/Symbol Cycles/Sample Cycles/Symbol Cycles/Sample

188973 123.02 8310 5.41 95.60%

Fig. 4.7: The other part of the modified program in the modulation function.

Fig. 4.8: Compiler feedback of the modulation4 function.

4.2.2.2 Framing and De-framing Functions

In Table 4.7, the execution cycles of framing/de-framing seem extraordinarily large.

In this section, we analyze the reasons of the inefficiency of the original code and find ways of improvement through loop unrolling and software pipelining by the compiler.

First, we introduce the original code of de-framing function and propose a better coding style. As shown in Fig. 4.10, the problem of the original code consists in the waste of cycles in the large number of “or” operations in the “if” statement in every iteration, as shown in the circle denoted “part 1.” The same problem exists in the framing function. The proposed C code uses simple skills to prevent this waste of cycles and does away with the modulo operation, as shown in the “part 1” code in Fig. 4.11.

Another modification of the de-framing function is done to “part 2” in Fig. 4.10 and results “part 2” in Fig. 4.11. We just remove the variable “carrier n s” by

Fig. 4.9: Kernel of the assembly code of the modulation4 function.

Table 4.9: Comparison of Framing/De-framing Functions Before and After Opti-mization

Original Code Revised Code Improvement Cycles per Cycles per Cycles per Cycles per

Symbol Sample Symbol Sample

framing 187916 110.40 25676 15.08 86.34%

de-framing 833350 489.62 7373 4.33 99.11%

replacing it with a look-up table, which is the framing/de-framing indexing number.

As illustrated in Table. 4.9, we can get huge improvement after the modifications.

This is because the original C code cannot result in software pipelining and loop unrolling with the use of large numbers of “if,” “else,” and “or” operations. We can get detailed information about how the compiler is able to optimize the code from the CCS compiler feedback information shown in Fig. 4.12. We find that the software pipelining is 6 stages deep from the sentence “Schedule found with 6 iterations in parallel.” Fig. 4.13 is the kernel of the assembly code of the de-framing function, where the corresponding C code is also illustrated. We can compare the kernels of the assembly code before and after revision. The assembly code for the original program is shown in Fig. 4.14 and we can see that it cannot be software pipelined, so the assembly programs are very inefficient.

4.2.2.3 FFT and IFFT Functions

The FFT/IFFT functions we use are from TI’s DSPLIB [22]. The original programs [2] have used FFT/IFFT functions that employ 32-bit operations. Because the C6416 DSP chip could perform four 16-bit multiplication operations but only two 32-bit multiplication operations during one cycle, it is more efficient if we could use 16-bit multiplications. The Table 4.10 compares the performance of the FFT functions provided in the DSPLIB.

DSP fft32x32 is the complex mixed radix 32×32-bit FFT with rounding, while

Fig. 4.10: Original C code of the de-framing function.

Fig. 4.11: Revised C code of the de-framing function.

Fig. 4.12: Software pipelining information of the revised code for the de-framing function.

Table 4.10: Comparison of Performance of FFT Functions in DSPLIB for N = 2048 Code Size Execution Cycles Minimum Cycles Efficiency

(Bytes) per Symbol Needed per Symbol

DSP fft32x32 932 28811 11351 39.39%

DSP ifft32x32 932 28811 11351 39.39%

DSP fft16x16r 868 15510 11351 73.18%

inverse FFT of the same type is DSP ifft32x32. DSP 16x16r is the complex mixed radix 16×16-bit FFT with rounding. TI DSPLIB does not provide functions for 16-bit IFFT, so we have to do IFFT using the 16-bit FFT function. As shown in Fig. 4.15, we just need to do conjugation before and after FFT. More detailed usage of these functions can be found in [22].

Table 4.11 compares the computational complexity of different FFT algorithms.

The mixed radix FFT needs 19974 real multiplications and 68102 real additions theoretically in our application which uses 2048-point FFT/IFFT. So the absolutely

Fig. 4.13: Kernel of the assembly code of the revised de-framing function.

Fig. 4.14: Kernel of the assembly code of the original de-framing function.

Fig. 4.15: IFFT implementation using FFT function.

minimum number of execution cycles is max{19974/2, 68102/6} = 11351 for the 32-bit FFT/IFFT operation and max{19974/4, 68102/6} = 11351 for the 16-32-bit FFT.

Practically, as shown in Table 4.10, DSP fft32x32 and DSP ifft32x32 need 28811 clock cycles and DSP 16x16r needs 15510 clock cycles, so the efficiencies are 39.39%

and 73.18%, respectively, where the efficiency is defined as Efficiency = Minimum Cycles NeededPractical Execution Cycles, which indicates how well the compiler schedules the assembly code.

Fig. 4.16 shows the core loop in DSP fft16x16r. The assembly code shown in the figure uses “ dotp2” and “ dotpn2” instructions to compute intermediate results.

For example, the following code:

x2[l1 ] = (si10 * yt1 0 + co10 * xt1 0 + 0x8000) À 16 x2[l1+1] = (co10 * yt1 0 - si10 * xt1 0 + 0x8000) À 16 x2[l1+2] = (si11 * yt1 1 + co11 * xt1 1 + 0x8000) À 16

x2[l1+3] = (co11 * yt1 1 - si11 * xt1 1 + 0x8000) À 16 is mapped to the assembly code below:

DOTP2 .M2 B xt0 0 yt0 0, B co20 si20, B x l1 0 ; DOTPN2 .M2 B yt0 0 xt0 0, B co20 si20, B x l1 1 ;

Table 4.11: Comparison of Computational Complexity of Different FFT Algorithms Complexity No. of Real Multiplications No. of Real Additions Radix-2 FFT ²₃N log₂N − ⁷₂N + 8 ⁵₂N log₂N − ⁷₂N + 8 Radix-4 FFT ⁹₈N log₂N − 3N + 3 ²⁵₈N log₂N − 3N + 3 Radix-8 FFT ²⁵₂₄N(log₂N − 3) + 4 ⁷³₂₄N log₂N − ²⁵₈N + 4 Split-radix-4/2 FFT N log₂N − 3N + 4 3N log₂N − 3N + 4

Simplified FFT 4N 6N

Table 4.12: Comparison of FFT/IFFT Before and After Optimization Original Code Revised Code Improvement Cycles per Cycles per Cycles per Cycles per

Symbol Sample Symbol Sample

FFT 32256 15.75 17046 8.32 47.17%

IFFT 35728 17.44 24360 11.89 31.82%

DOTP2 .M2 B xt0 1 yt0 1, B co21 si21, B x l1 2 ; DOTPN2 .M2 B yt0 1 xt0 1, B co21 si21, B x l1 3 ; as indicated by the ovals in Fig. 4.16.

By this modification, the execution cycles of the IFFT and FFT functions in Ta-bles 4.6 and 4.7 become 24360/2048 = 11.89 (cycles/sample) and 17046/2048 = 8.32 (cycles/sample) respectively, as shown in Table 4.12. The DSP fft16x16 function is used inside the FFT/IFFT function. The excess clock cycles of FFT/IFFT over the DSP fft16x16r cycle counts are from the data movement inside our FFT/IFFT functions.

4.2.2.4 SRRC Filter

The C6000 compiler provides intrinsics, which are special functions that map directly to inlined C62x/C64x/C67x instructions, to optimize the C/C++ code quickly. The intrinsic functions, which TI provides, provide an another method for optimizing the program at C level. Detailed introduction to the intrinsic functions can be found in

Fig. 4.16: A part of the assembly code in DSP 16x16r.

Table 4.13: Simulation Data for SRRC downsample Inclusive Exclusive

Cycles Cycles

SRRC downsample 226 140

[21].

In Table 4.7, the reason for the inefficiency in the SRRC downsample function is the data movement for the SRRC filter buffer, as shown in Fig. 4.17. We can get proof from the simulation data shown in Table 4.13, where the inclusive cycles are the cycle count for the entire SRRC downsample function and the exclusive cycles are the cycle count other than the cycles for the functions called inside the SRRC downsample function. In our program, the function called does the SRRC filtering and the exclusive cycles are just for data movement in the data buffer, so the multiples of real-time for filtering is (226−140)/52.5 = 1.63.

By using intrinsics, we can accelerate the speed of data movement. As shown in Fig. 4.17, the function “ amemd8” and “ amemd8 const” are intrinsic functions that provide aligned loads and stores of 8 bytes to memory in single instruction. So we can perform four 16-bit load and store within one instruction. The speedup of the SRRC downsample function is shown in Table 4.14. Here, we find the Tx SRRC filter has obtained huge improvement in performance. The reason is not only due to the use of intrinsics but also because the better coding style by remov-ing of conditionals like the method to improve the framremov-ing function as introduced before. More detailed analyses can be found in [5].

在文檔中 IEEE 802.16a 分時雙工正交分頻多重進接下行傳收系統之數位訊號處理器軟體實現與整合 (頁 75-90)