• 沒有找到結果。

Low-Level Computational Analysis

Dual-Core Program Development and Analysis

4.2 Low-Level Computational Analysis

In the following analysis, we analyze the critical functions of each component coder to figure out the greatest computation efforts and the instruction-level parallelism for the VLIW architecture and SIMD instructions of the PACDSP. In the analysis, “Function”

indicates each function block of the coder, “Cycles Estimation” means the estimated ex-ecution cycles for this block, “Instruction Counts” show the instructions needed for the function block, and “Parallelism” denotes average number of parallel instructions exe-cuted per cycle of the VLIW processors.

4.2.1 Motion Coder Analysis

Motion estimation is a most important component in the video encoder, affecting the en-coding speed and image quality significantly. Our main target is to reduce the computation complexity in these functions. Table 4.7 summarizes the major functions in the motion coder and the percentage computation efforts of each function in the total as obtained with the ADS.

Table 4.6: Profile of Object-Based MPEG-4 Encoding of QCIF P-VOP on ADS [10]

foreman qcif akiyo qcif stefan qcif

Function Name Clockticks % Clockticks % Clockticks %

MotionEstimation 79,675,422 50.20 48,952,190 45.19 24,251,478 41.60 FullPelMotionEstMB 71,951,245 45.34 40,752,077 37.62 22,069,388 37.86 FindSubPel 7,703,016 4.85 8,183,547 7.55 2,174,324 3.73 TextureCoding 37,139,540 23.40 38,101,536 35.17 10,856,089 18.62 BlockDCT 16,004,337 10.08 15,611,662 14.41 4,768,944 8.18 BlockIDCT 16,252,774 10.24 16,749,806 15.46 4,564,249 7.83 ShapeCoding 35,191,526 22.17 12,907,962 11.91 18,419,663 31.60 ShapeInterMB 30,833,225 19.43 10,436,508 9.63 15,631,836 26.82 CAE MB 3,231,739 2.04 1,636,398 1.51 2,351,171 4.03

Others 6,694,822 4.22 8,372,839 7.73 4,764,480 8.17

Total 158,701,310 100.00 108,334,527 100.00 58,291,710 100.00

The reference search method is full search in raster-scan order with check for early termination each row. The search range is [−16,16) and the motion vector is specified to half-pixel accuracy. As we can see in Table 4.7, the critical function is “SAD MB,” which is used to calculate the SAD (sum of absolute differences) in a 16×16 MB at integer pixel displacements. After searching for the MB motion vector, an additional search is made for each 8×8 block. The integer block motion estimation uses the MB motion vector as the search center and the search range is±2 pixels. “SAD Block” is the function to calculate the SAD of an 8×8 block.

In order to reduce the searched displacements by increasing the probability of early termination, we replace the original raster-scan method with spiral search. Experience

Table 4.7: Major Function in Motion Estimation (ME)[10]

Execution Time Percentage in Total for ME Function Name foreman qcif akiyo qcif stefan qcif

Obtain SR 0.40% 0.61% 0.26%

SAD MB 81.65% 71.07% 83.38%

SAD Block 3.16% 3.86% 2.43%

ChooseMode 0.53% 0.85% 0.48%

FindSubPel 9.67% 16.72% 7.78%

Others 4.59% 6.89% 5.67%

shows that most motions are within±5 pixels, and the spiral search may reduce the com-plexity of SAD calculation by increasing the occurrence of early termination. Fig. 4.2 shows the concept of spiral search.

Table 4.8 shows the percentage of early termination in SAD calculation under two different scan orders: raster-scan order and spiral order. Three test sequences of different motion characteristics are used here each running 10 inter frames on the ADS.

According to Table 4.9, most of the computation in the motion coder is due to the MB motion search, wherein the critical component is the “SAD MB.” If the SAD calculation can be reduced, then the efficiency of the motion estimation can be improved.

Table 4.8: Percentage of Early Termination in SAD Calculation Under Different Scan Orders [10]

Scan Order foreman qcif akiyo qcif stefan qcif Raster-scan order 46.62% 55.33% 43.24%

Spiral order 66.00% 80.66% 60.37%

Tier 0 Tier 1 Tier 2 Tier 3 Search Window

Tier 4

Figure 4.2: Concept of spiral search.

Table 4.9: Motion Coder Analysis on PACDSP

Function Name Cycles Estimation Instruction Counts Parallelism

Load MB 5+8x8 38 38/13=2.9

Count MB Number 7+16x21 80 80/28=2.9

Search Range 30 85 85/30=2.8

MB Motion Search 32+SAD MB+5x25+121x(27+SAD MB) 427 427/169=2.5 Compute 8x8 MV 4x(169+SAD Block+25x(28+SAD Block)) 361 361/130=2.8

Others 22 77 77/22=3.5

SADMB 2+8x35 63 63/37=1.7

SADBlock 2+2x34 64 64/36=1.8

4.2.2 Shape Coder Analysis

In the lossless ShapeCoding for the context-based arithmetic encoding (CAE), as the Ta-ble 4.10 shows, there are four modes and each may have different supporting VOP. In I-VOP coding, only two modes are available for ShapeCoding, and Table 4.5 shows that CAE operation takes much of time spent in shape coding. In P-VOP coding, all four CAE modes are available for ShapeCoding. As shown in Table 4.6, the function “ShapeIn-terMB,” depending on motion characteristic, may occupy about 10% to 30% of the ex-ecution time in P-VOP encoding. Since the CAE algorithm has a complicated coding procedure and strong data dependency, it is hard to exploit the parallel processing ca-pability of PACDSP. We will focus on the “ShaperInterMB” analysis and optimization.

Table 4.11 shows the execution cycles, instruction and parallelism status. It shows that the “AlphamotionEstimation” function is a more time consuming function in “ShapeIn-terMB.” The reason is that it performs a full search on the binary alpha plane.

4.2.3 Texture Coder Analysis

The floating-point DCT and IDCT of the texture coder are time-consuming functions.

Implementing the transforms in fixed-point is essential for PACDSP. We will discuss this subject in the next chapter. By the block-based coding structure of MPEG-4, we can dis-tribute the texture coding operations to the two clusters simultaneously. Table 4.12 shows that the program can almost fully utilize the processor units except for some program loop andbranch conditions.

Table 4.10: CAE Modes and Associated VOP Types Mode Intra / Inter MC Scanning Supporting VOP

1 Intra Horizontal I-VOPs and P-VOPs 2 Intra Vertical I-VOPs and P-VOPs 3 Inter MC Horizontal P-VOPs

4 Inter MC Vertical P-VOPs

Table 4.11: Analysis of the ShapeInterMB function on PACDSP

Function Name Cycles Estimation Instruction Counts Parallelism

Initial PredAlpha MB 4x10 13 13/10=1.3

FindMVP 74 156 156/74=2.1

FindPredAlpha4MC 8+16x(7+16x10) 54 54/25=2.2

Error detection 1+8x20 54 54/21=2.6

AlphaMotionEstimation 5+16x(13+16x(29+16x(38+8x6)) 162 162/91=1.8

AMVbits 28+8x6+8x6+8x62 77 77/22=3.5

Find18x18PredAlphaMC 15+18x(14+9x7) 89 89/36=2.5

others 24 47 47/24=1.9

Table 4.12: Texture Coder Analysis on PACDSP

Function Name Cycles Estimation Instruction Counts Parallelism

Block DCT 3+4x82 287 287/85=3.4

BlockQuantH263 20+8x31 145 145/51=2.8

DCSpreading 7+7x9 35 35/14=2.5

BlockDequantH263 1+8x31 114 114/32=3.6

BlockIDCT 9+4x74 276 276/83=3.3

Clipping 21 72 77/21=3.4