Optimization of The Implementation on PACDSP
5.1 Algorithmic Optimization
5.1.1 Algorithmic Optimization for Intra Frames
In intra frames decoding, there is a process, prediction of DC and AC coefficients, which is not applied to inter-encoded frames. However, since such predictions are time-consuming, if the frequency of these predictions can be reduced, much execution time can be saved.
In addition, an important property of DCT is that it concentrates signal energy in lower frequency coefficients. That is, if a block is filled with constant coefficients, there will be
0
Figure 5.1: DC spreading from decoded coefficient to output block.
only one coefficient at the DC after the transform. In other words, if we can make sure that there is only a DC component decoded from the bitstream, the corresponding output block data can be obtained with copying the DC component to the entire block, and such property is illustrated in Fig. 5.1. There are different methods to skip the prediction and transform, and we introduce the implementation techniques and show the analysis and simulation results in the following.
The assembly code of spreading DC value to the whole block is shown in Fig. 5.2. We need four iterations to complete one block, so the execution time is 19 cycles including the setting of loop register and address registers. However, we still need several cycles to
DC_Spread()
DC_Spreading: ; 4 iterations for one block
{ SET_LBCI RBC0,4 | MOVI.L A6,R_Block_2D | COPY D 15,D14 | MOVI.L A6,R_Block_2D | COPY D15,D14 }
{ NOP | MOVI.H A6,R_Block_2D | NOP | MOVI.H A6,R _Block_2D | NOP }; D14 D15 are DC value
Figure 5.2: Assembly code of DC spreading.
update the prediction data “DC Store”.
Check Skipped Blocks Using CBP and ACPred Flag
In MPEG-4 video, there are two parameters encoded in the macroblock header which can help us reduce the amount of computation. The first one, CBP, standing for Coded Block Pattern, tells us which blocks in a macroblock are variable length encoded. The second, ACPred Flag, informs us about the existence of AC coefficients prediction.
In order to find out the proportion of blocks that can be skipped, we choose the same test sequences as mentioned before. The simulation is done on PC with 90 frames to be encoded, and these frames are all encoded in intra type. The simulation results on PC are listed in Table 5.1.
In Table 5.1, we can see that the percentage of skipped block is not very high, and a slow-motion sequence such as “Akiyo” does not have the most skipped blocks among the six test sequences. The reason that the simulation results is not as what we expected is due to the parameter ACPred Flag. Since the ACPred Flag is set to 1 if there is any block in an MB predicted with AC coefficients, we cannot skip some blocks with DC component only but nonzero ACPred Flag. Therefore, we should improve our method in finding the blocks that can be skipped.
Check Skipped Blocks After AC Prediction
Since the previously simple checks cannot precisely indicate the blocks to be skipped, we add a check after the prediction of AC coefficients is completed. Similar to the previous method, we still need to check if the block data is variable length encoded through CBP in the MB header, CBP. If the corresponding bit in CBP is zero, we can skip this block because all the AC predicted coefficients are zero.
Consequently, we can further find out all the possible blocks to be skipped, but the effort also increases because of more conditions to be checked. We again do a simulation on PC to get the percentage of skipped blocks in 90 intra-encoded frames. The simulation results are listed in Table 5.2.
Compared to Table 5.1, we can see in Table 5.2 that the percentage of skipped blocks
Table 5.1: Number of Skipped Blocks in 90 Intra Frames (Check CBP and ACPred Flag Only)
Test Seqs.(QCIF) Total Block No. Skipped Block No. %
grandmother 53,460 4,106 7.78
stefan 53,460 2,041 3.82
foreman 53,460 8,343 15.61
akiyo 53,460 6,574 12.30
mobile 53,460 1,422 2.66
football 53,460 5,568 10.42
gets higher with the aid of the new check. Furthermore, the test sequence “Grand-mother qcif” becomes the one which has the most skipped blocks, and it is expected that the performance of this optimization should be highly related the simulation results listed in Table 5.1 and 5.2.
Conclusion of Optimization for Intra Frames
Based on the analysis of the frequency of skipped blocks in intra-encoded frames, we apply the proposed means to our implementation on PACDSP. The simulation results are listed in Table 5.3, where noted that the execution time is gathered from the first encoded frame, not the average over 90 frames.
In Table 5.3, we can see that the performance of optimization varies from one se-quence to another. The percentage of speedup on PACDSP is less than the percentage of skipped blocks, and this phenomenon can be explained by Ahmdahl’s Law [7]. In other words, the skipped blocks do not reduce computations other than dequantization and IDCT, and we also need more cycles for the condition checking.
In conclusion, the above algorithmic optimization for intra-frame decoding is severely limited by the nature of the test sequences. To further improve the performance, we will take the advantage of VLIW architecture and SIMD instructions. The architectural
Vertical_AC_Reconstruction() vertical, top ROW of block C 7 elements, so unroll the loop to 2 clusters A2 is Q_block, A3 is P_Coeff (AC)
; cluster1 1, cluster2 5
{ NOP | ADDI A2,A2,4 | CLR D12 | ADDI A2,A2,20 | CLR
D12 }
; D13 is index of Pred_A
{ NOP | ADDI A3,A3,4 | NOP | ADDI A3,A3,20 | NO P }
{ NOP | MOVI.L A7,Fake_AC_Pred | (p7)ADDI D12,D12,1 |
MOVI.L A7,Fake_AC_Pred | (p9)ADDI D12,D12,1}
{ NOP | MOVI.H A7,Fake_AC_Pred | NOP | MOVI.H
Figure 5.3: Assembly code of new check in vertical AC reconstruction.
optimization methods will be introduced and applied in the next section.