Optimization for Image Interpolation and Padding

Implementation and Optimization for ARM9

5.1 Algorithmic Optimization

5.1.3 Optimization for Image Interpolation and Padding

Analysis of Image Interpolation

Interpolation of each macroblock is necessary because the motion vector may be a non-integer number. In the original decoding process of motion compensation (see Figure 4.2), we interpolate the whole padded image and multiply the motion vector by 2. However, interpolation of the whole image consumes many cycles on ARM, but the effort is wasted where the motion vector is an integer. The other problem is that the storage requirement is 146,432 bytes which is large for our implementation. Therefore, we propose a block-based interpolation method, done only when the motion vector is not an integer.

Table 5.5: Execution Time of Inter (P) Frame Decoding on ARM9 Test Seqs. Execution Time (cycles)

(QCIF) Original CBP Checked Speedup (%) grandmother 18,600,472 12,428,187 33.18

stefan 20,837,319 19,004,118 8.80

foreman 19,114,456 16,360,197 14.41

Under the above approach, interpolation can be divided into 4 categorizes as follows, where MVx is the horizontal motion and MVy is the vertical motion for a block:

Both MVx and MVy are integer numbers.

MVx is a half-integer number and MVy is a integer number.

MVy is a half-integer number and MVx is a integer number.

Both MVx and MVy are half-integer numbers.

If both MVx and MVy are integer numbers, we can avoid the interpolation process.

Moreover, we can interpolate only the horizontal direction, only the vertical direction, or both according to the category of the block.

To see how much saving is possible, we count the amount of motion vectors which are half-integer in either the horizontal or the vertical directions. The results are listed in Table 5.6 (from [15]). In Table 5.6, “Both” means that both the horizontal and the vertical motion are fractional. “MVx” and “MVy” mean that the motion vector is fractional only in horizontal and vertical direction, respectively.

From Table 5.6, we can also understand more about the different sequences, in par-ticular directions of motion. Moreover, we see that more than 50% of interpolation can be avoided in four of the six test sequences. Thus, if we check the characteristics of the motion vectors before luminance and chrominance motion compensation of each blocks, many computations can be saved. Then the motion compensation flow becomes as shown in Figure 5.2.

Analysis of Padding

The objective of padding is to get more accurate motion estimation. Padding operation is necessary for the motion vectors which points out of the frame. However, it spends much time on repetitively copying the edge values to the exterior regions and the storage requirement of the padded frame is 36,608 bytes which is large for our implementation.

Table 5.6: Analysis of Necessary Interpolation (from [15])

Bitstream Total MV Half-integer MV

(QCIF) Number Total % Both % MVx % MVy %

grandmother 18,204 2,064 11.34 550 3.02 497 2.73 1,017 5.59 stefan 33,744 15,385 45.59 1,954 5.79 10,478 31.05 2,953 8.75 foreman 34,128 15,585 45.67 4,658 13.65 5,994 17.56 4,933 14.45

akiyo 13,552 1,225 9.04 120 0.89 144 1.06 961 7.09

mobile 35,192 21,663 61.56 1,697 4.82 15,933 45.27 4,033 11.46 football 34,604 27,031 77.23 11,164 32.26 9,198 26.58 6,669 19.27

Moreover, in the original MoMuSys code, the frame which need to be padded is copied to the center of another bigger frame and padded latter. The reason of this op-eration is that the original frame size and the order of pixels are fixed in the MoMuSys code, so it is necessary to do the copy operation. Consequently, the overhead of calcu-lating addresses is very considerable. If we can avoid copying the whole image, or even reducing the padding times, the performance will be more better.

Our approach is to skip the padding process and check if the target pixel is outside the frame. If it is, then we use the value of the edge pixel to do motion compensation. Take Figure 5.3 as an example. If the target pixel is at one of the positions marked a’, we use the value of position a. Similar is the case for b’ and b, h’ and h, etc. Hence, we can to-tally skip the padding operation at the cost of some computations in block compensation.

Finally, the flow of motion compensation becomes as shown in Figure 5.4.

Experiment Results

Based on the above analysis, we know that both interpolation and padding are time-critical processes because they are pixel-by-pixel operations. Their execution times for each P-frame decoding and storage requirements are shown in Table 5.7. For optimization, we

Figure 5.2: Modified flow of motion compensation with optimized interpolation.

Table 5.7: Execution Time and Storage Requirement of Image Interpolation and Padding on ARM9

Operation Time (cycles) Storage (bytes) Interpolation 1,184,399 146,432

Padding 1,534,275 36,608

Total 2,718,674 183,040

alter the decoding flow to be as in Figure 5.4 and further optimize our code using the following methods:

take the computations out of the loops as much as possible, and changing division operations into shift operations.

Note that the methods listed above are common optimization methods for program-ming. They are good for our design because the loops of interpolation and padding are large in number of cycles. Take Figure 5.5 as an example. This is a double for-loop and the pixels in the vertical direction need to be padded. We can see that the computations

Figure 5.3: Example of padding in the upper-left corner of a frame.

for the variable “destination” is out of the second loop and thus we can save some compu-tations. In other words, if the computations for the variable “destination” is in the second loop, the same computations will be done 8 times. Moreover, the interpolation is done by a shift operation instead of dividing by 2.

With our approach, the execution times of interpolation and padding are reduced, and the storage requirements are totaly saved. Meanwhile, the time for calculating addresses is also reduced because the size of the previous frame is the same as that of the current frame. Therefore, the result of address calculation (partly) can be applied to both frames.

The simulation results are listed in Table 5.8 and we can find that the speedup is very significant. Furthermore, the speedup of grandmother qcif is more than other sequences.

The reason why the speedup of different sequences is different is that their motion vectors are quite different in type. In grandmother qcif, there is no motion vector which points outside the frame, and the motion of the sequence is relatively little that the number of fractional motion vectors is fewer. In conclusion, the speedup by the proposed method is also related to the characteristics of the test sequence. For further optimization, we focus on the modes of motion compensation and discuss our approach in the next section.

Figure 5.4: Modified flow of motion compensation with optimized padding.

Table 5.8: Execution Time of P-Frame Decoding on ARM9 After Modification of Inter-polation and Padding

Test Seqs. Execution Time (cycles)

(QCIF) Original Optimized Speedup (%) grandmother 12,428,187 5,760,143 53.65 stefan 19,004,118 12,528,558 34.07 foreman 16,360,197 9,829,931 39.92

在文檔中使用ARM9處理器實現MPEG-4視訊之軟體解碼 (頁 74-79)