Software Pipelining - Compiler Level Optimization Techniques

4 Implementation and Optimization of 802.16a FEC Scheme on DSP Platform

4.2 Compiler Level Optimization Techniques

4.2.3 Software Pipelining

Software pipelining is extensively used to exploit instruction level parallelism (ILP) in loops and TI CCS compiler is also capable of doing it. To say more clearly, it is a technique used to schedule instructions in a loop so that multiple iterations of the loop can execute in parallel, so the empty slots can be filled and the functional units can be used more efficiently. Overall it makes a loop to be a highly optimized loop code and hence accelerate the program execution speed significantly.

For the ease of understanding how software pipelining actually works, here we give an example to illustrate [16]. A simple for loop and its code after applying software pipelining are shown in Fig 4.4(a) and 4.4(b). The loop schedule length is reduced from four control steps to one control step in the software pipelined loop. However, the code size of software pipelined loop is three times longer than that of the original code in this

example. Fig. 4.5(a) and 4.5(b) show the execution records of the original loop and the software pipelined loop, respectively.

(a) (b)

Figure 4.4: (a) The Original Loop. (b) The Loop After Applying Software Pipelining.

for i = 1 to n do A[i] = E[i-4] + 9;

B[i] = A[i] * 5;

C[i] = A[i] + B[i-2];

D[i] = A[i] * C[i];

E[i] = D[i] + 30;

end

A[1] = E[-3] + 9;

A[2] = E[-2] + 9;

B[1] = A[1] * 5’

C[1] = A[1] + B[-1];

A[3] = E[-1] + 9;

B[2] = A[2] * 5;

C[2] = A[2] + B[0];

D[1] = A[1] * C[1];

for i = 1 to n-3 do A[i+3] = E[i-1] + 9;

B[i+2] = A[i+2] * 5;

C[i+2] = A[i+2] + B[i];

D[i+1] = A[i+1] * C[i+1];

E[i] = D[i] + 30;

End E[n] = D[n] +30;

D[n] = A[n] * C[n];

E[n-1] = D[n-1] + 30;

B[n] = A[n] * 5;

C[n] = A[n] + B[n-2];

D[n-1] = A[n-1] * C[n-1];

E[n-2] = D[n-2] + 30;

In these figures, we can clearly observe that there are only two (B and C) of the five instructions – A,B,C,D,E executed in parallel in the original loop, while there are all five instructions executed in parallel in the software pipelined loop and hence the program efficiency is improved significantly. We can also notice that the pipelined code can be classified into three regions: prologue, loop kernel (repeating schedule) and epilogue. The prologue is the “setup” to the loop. Running the prologue code is often called “priming” the loop. The length of the prologue depends on the latency between the beginning and ending of the loop code; i.e., the number of instruction and their latency. The epilogue refers to the ending instructions, which must be completed at the end after the loop kernel; it is kind of similar to the prologue and is optional. If necessary, it can be rolled into the loop kernel. Prologue and epilogue of the software pipelined loop occupy a large part of the code size, so there may be a trade-off issue between the speed and memory size consideration that we have to take into account. But since the program memory of the Quixote DSP baseboard is quite large and the original FEC code size is quite small, it may not be a serious issue if we adopt software pipelining in our codes.

Concerning the implementation using the TI C6000 DSP family, the C code loop performance is greatly influenced by how well the CCS compiler can do the software pipelining on our loop. The compiler provides some feedback information to the programmers to fine-tune the loop structure. Understanding the feedback information, we can quickly tune our C codes to obtain the highest possible performance. The feedback is geared for explaining exactly what all the issues related to pipelining the loop are and what the results are. The compiler goes through three basic stages when compiling a loop, these stages are [13]

1. Qualify the loop for software pipelining.

2. Collect loop resource and dependency graph information.

3. Software pipelining the loop.

In the first stage, the compiler tries to identify what the loop counter (named trip counter because of the number of trips througha loop) is and any information about the loop counter such as minimum value (known minimum trip count), and whether it is a multiple of something (has a known maximum trip count factor).

If the above information is known about a loop counter, the compiler can be more aggressive with performing packed data processing and loop unrolling optimizations.

For example, if the exact value of a loop counter is not known but it is known that the value is a multiple of some number, the compiler may be able to unroll the loop to improve the performance.

There are several conditions that must be met before software pipelining is allowed, or legal, from the compiler’s point of view. These conditions are

It cannot have too many instructions in the loop. Loops that are too big typically require more registers than that are available and they require a longer compilation time.

It cannot call another function within the loop unless the called function is inlined. Any break in control flow makes it impossible to software pipeline as multiple iterations are executing in parallel.

If any of the conditions for software pipelining are not met, qualification of the pipeline will halt and a disqualification messages will appear. In this situation, software pipelining will not be applied to our loop program and hence the program operating speed will be quite slow.

In the second stage, the compiler is collecting loop resource and dependency graph information. It will derive the loop carried dependency bound, unpartitioned resource bound across all resources, partitioned resource bound across all resources based on our

In the third stage, the compiler attempts to software pipeline our loop based on the knowledge it collects from the previous two stages. The first thing the compiler attempts to do during this stage, is to schedule the loop at an iteration interval (ii) equal to the minimum value of the three bounds obtained in second stage. If the attempt was not successful, the compiler provides additional feedback message to help explain why it failed; i.e., register is live too long or did not find a schedule. The compiler will keep proceeding to ii = (previous failed ii + 1) till it find a valid schedule and then the software pipeline is done.

在文檔中 IEEE 802.16a標準之前向誤差改正編碼於數位訊號處理器平台上之實現與最佳化 (頁 57-61)