Optimization of Implementation on PACDSP
5.2 Architectural Optimization
An important issue of DSP implementation is the utilization of the architectural advan-tages. In this section, we introduce some general software optimization techniques, in-cluding static rescheduling, loop unrolling, and software pipelining. In addition, the com-putations are dispatched to different units to utilize the advantage of VLIW processor.
Some special SIMD instructions of PACDSP are used to compute or load/store multiple data at the same time. The advantage of SIMD instructions is increase in throughput of computations.
Table 5.7: Comparison of IDCT on Different Platforms
Instruction Designs Processing units Clock (MHz) 2-D fast algo. Cycles counts TI C62x [20] 2 MUL, 6 ALU 200 row-column 230 1,840 TI C64x [21] 2 MUL, 6 ALU 600 row-column 154 1,232
IDCT Core [20] 1 ALU 33 direct 2-D 1,208 1,208
PACDSP (ours)∗ 2 AU, 2 L/S 200 even-odd 307 1,228
∗Note: If we consider the scalar unit, the instruction counts is 1,535 in our implementation
5.2.1 General Optimization Techniques
For our implementation on PACDSP, we should try to fill all the slots in an instruction packet to get a higher performance. Therefore, how to achieve a full-pipeline imple-mentation is very important to a better performance. In this subsection, three general optimization techniques are discussed, which are static rescheduling, loop unrolling, and software pipelining [11]. The purpose of these techniques is to reduce the number of stalls resulting from hazards, and the appropriateness for PACDSP of these techniques are discussed as well.
For the discussion, we use an example of coefficients summing in a 1-D array, which contains eight 8-bit data. Fig. 5.17 shows the corresponding C program. In order to simplify the utilization of different techniques, we use only one instruction slot in the instruction packet.
Figure 5.17: Example C code of vector addition.
Static Rescheduling
In the assembly code programming, the dependence of data may cause stalls in processor, and these stalls increase the required computation time. There are three types of data hazard, namely, read-after-write (RAW), write-after-read (WAR), and write-after-write (WAW).
In the left of Fig. 5.18, we simply translate the C program in Fig. 5.17 to the PACDSP assembly code. We can see that because the dependency of the register D0 and the data loading from memory requires two cycle to be valid in PACDSP, two stalls are inserted after the “LB” instruction. In addition, the conditional branch, whose predicate register is p2, depends on the comparison instruction “SLTI.” And the predicate register also need two cycles to be valid for conditional execution, so two stalls are inserted after the “SLTI”
instruction. Therefore, there are totally seven stalls (NOPs) in the direct translation with three delay slots, and these stalls significantly degrade the execution speed.
We can utilize the independence of instructions to eliminate the stalls as much as possible. In the right half of Fig. 5.18, we reschedule the order of the assembly code, which reduces the stalls from seven to four. However, since the computation is not very complex, we cannot further reduce the number of stalls simply through rescheduling.
Loop Unrolling
Loop unrolling is a general technique to deal with the implementation of an iterative computation, especially if there are stalls in a single iteration.
To use the unrolling technique, we have to find the independent computations in con-secutive iterations. We can use different registers to store data from different iterations, and the instructions still need to be scheduled well to reduce the stalls. The number of unrolled loops depends on the stalls and independent computations in a single loop. Fig-ure 5.19 shows the assembly code before and after loop unrolling.
In Fig. 5.19, we see that all the stalls (NOPs) are eliminated. The loop maintenance code and branch condition should be changed to adjust the new iterative computations.
However, there is a tradeoff between execution time and corresponding code size. Al-though the stalls are all eliminated, the code size increases after loop unrolling. Therefore,
Loop:
Figure 5.18: Example of static rescheduling technique.
4−NOPs
Figure 5.19: Example of loop unrolling technique.
we have to assess that if code size is critical or not. In addition, the number of available registers is a limitation to the use of loop unrolling.
Software Pipelining
The concept of software pipelining is to reorganize the loop and to interleave dependent instructions from different loop iterations to separate dependent instructions within the original loop. Different from loop unrolling, we just reschedule the loop, so the stalls may not be entirely eliminated. An example of software pipelining is illustrated in Fig. 5.20.
It is noted that the start-up code and clean-up code are used to interleave the dependent code. Compared to loop unrolling, there are still 2 stalls. The advantage of software
Loop:
Figure 5.20: Example of software pipelining technique.
pipelining is the smaller code size. However, the loop overhead cannot be reduced through software pipelining. But we can apply loop unrolling and software pipelining to our implementation simultaneously and take the advantage of both techniques.
5.2.2 Advantages of PACDSP
In order to speed up our implementation on PACDSP, we can utilize the advantages of VLIW architecture and SIMD instructions. However, not all the computations can be distributed to both clusters, so we have to check if the feature of the computations are appropriate to apply the advantages of PACDSP.
In addition, since the branch instructions affects the program sequence of both ters, it is better to put two regular and independent parts of computations in different clus-ters. For example, an iterative computation can be separated into two parts if the com-putations are independent in different iterations. Take the MPEG-4 frame-based video decoder for instance, dequantization (IQ) and IDCT (IT) are very regular computations, which are suitable to distribute into two clusters. Moreover, SIMD instructions are also very helpful for our optimization.
5.2.3 Experiment Result of Architectural Optimization
After our architectural optimization, including general optimization techniques and using the advantages of PACDSP, the improvement is shown in Table 5.8. We can find that
the architectural optimization introduces significant improvement, up to at most 28.27 percent. It is thus clear that the number of stalls affect the performance greatly. We can increase the performance of our implementation, if we reduce the stalls in the assembly code.