• 沒有找到結果。

Chapter 4.   Motion Compensation Design

4.1.1.  Motion Vector Predictor

Since the motion vectors of neighborings are often highly correlated, the motion vector of each block is predicted from previously coded partitions, and only the prediction error is transmitted in H.264/AVC standard to reduce the bit rate. In motion vector prediction process, the first thing is to generate motion vector predictor (MVp), then add it together with the decoded motion vector difference (mvd). The derivation process for MVp is described as below and shown in Fig. 4.3:

z For macroblock partitions excluding 16x8 and 8x16 partition sizes: MVp is the median of the motion vectors for partitions A, B and C.

z For 16x8 partitions: MVp for the upper 16x8 block is predicted from B, MVp for the lower is predicted from A.

z For 8x16 partitions: MVp for the left 8x16 block is predicted from A, MVp for the right is predicted from C.

(a) (b) (c)

Fig. 4.3. Motion vector predictor scheme (a) macroblock partitions excluding 16x8 and 8x16 partition sizes, (b) 16x8 partitions, (c) 8x16 partitions

34

generated by interpolation of the integer-pixels. The interpolation method is based on 6-tap FIR filter with tap values ( 1, -5, 20, 20, -5, 1 ). Fig. 4.4 shows the interpolation scheme of luminance component. The half-pixels, b, h, m, and s, are derived by applying 6-tap FIR filter using integer-pixels as inputs.

b1 = ( E − 5 * F + 20 * G + 20 * H − 5 * I + J ) (4.1) b = Clip1Y( ( b1 + 16 ) >> 5 ) (4.2) The half-pixel j is obtained by first calculating the intermediate values of the six half-pixel locations in the horizontal or vertical direction then applying 6-tap FIR filter with these intermediates as shown in equation (4.3) to (4.5).

j1 = cc − 5 * dd + 20 * h1 + 20 * m1 − 5 * ee + ff, or (4.3) j1 = aa − 5 * bb + 20 * b1 + 20 * s1 − 5 * gg + hh (4.4) j = Clip1Y( ( j1 + 512 ) >> 10 ) (4.5) Table 4.1 shows the bit-width of data during luma interpolation. Notice that the input bit-width of the interpolation process in half-pixel j is 15-bit. Fortunately, a simplification to equation (4.3) to (4.5) can make the implementation much easier with negligible quality degradation at about 0.01 dB [16] in which the intermediates are truncated to 8-bit.

j1 = cc’ − 5 * dd’ + 20 * h + 20 * m − 5 * ee’ + ff’, or (4.6) j1 = aa’ − 5 * bb’ + 20 * b + 20 * s − 5 * gg’ + hh’ (4.7) j = Clip1Y( ( j1 + 16 ) >> 5 ) (4.8)

35

Fig. 4.5 shows the interpolation scheme of quarter-pixels. The quarter-pixels labeled as a, c, d, n, f, i, j k, and q are derived by averaging two nearest integer-pixel and half-pixel. The quarter-pixels e, g, p, and are derived by averaging two nearest half-pixels in diagonal direction.

Table 4.1. Bit-width of data during first luma interpolation Interpolation Min Max Bit-width

x 0 255 8

-5x -1275 0 12

20x 0 5100 14

Σ -2550 10710 15

(Σ+ 16)>>5 -80 335 10

Clip((Σ+ 16)>>5) 0 255 8

36

Fig. 4.4. Interpolation scheme for luminance component (grey blocks represent integer pixels, which are denoted by upper-case letter)

Fig. 4.5. Interpolation scheme of quarter-pel positions for luminance component

37

Fractional pixels of chrominance component are derived by averaging weighted samples of nearest four integer pixels. The interpolation scheme of chrominance component is shown in Fig. 4.6 and the equation is:

a = ( ( 8 − x ) * ( 8 − y ) * A + x * ( 8 − y ) * B +

( 8 − x ) * y * C + x * y * D + 32 ) >> 6 (4.9)

A B

C D

x

y 8‐x 8‐y

Fig. 4.6. Interpolation scheme for chrominance component

4.2. Bandwidth Optimization

The high memory bandwidth requirement in motion compensation is the bottleneck in video decoder design. To alleviate this situation, we first reuse the overlapped data inside a partitioned block, and then we reduce the required reference data according to fractional-pel position. These two bandwidth optimization methods are discussed in this Section.

38

methods of reusing this overlapped data have been realized in several ways [11] [12].

In [11], a Vertical Integrated Double Z (VIDZ) flow adding a 21x64-bit on-chip memory to reuse the vertical and horizontal overlapped regions between two 4x4 decomposed blocks. In [12], an exploiting data reuse in hybrid block size memory access from 4x4 to 8x8 is presented, which reuse the overlapped data inside a 4x8, 8x4, or 8x8 block. However, the external data requests of these methods are based on small block size such as 4x4 to 8x8, and the numerous external data access may influence the latency of MC hardware.

In 4x4-block pipeline design, the general case of memory access scheme is loading 9x9 reference pixels as the interpolation window for a decomposed 4x4 block;

we call it Conventional 4x4 Based Data Request. The data reuse only existed between two neighboring blocks with additional buffers. To increase the reusing rate of overlapped data and reduce the frequency of external data access, the processing element in our proposed design is scaled up to the block partition size, and we call it Block Size Based Data Request. For example, a 16x8 block consisting of eight 4x4 blocks, instead of requesting 9x9 block eight times, the reference data request would be only one 21x13 block. The reference data in both requests are the same, but the request is reduced from eight times to only one and the accessing pixels of reference data request is down from 648 pixels to 273, which is 58% reduction. The ideal reduction rates corresponding to different partition size are shown in Table 4.2.

39

Fig. 4.7. Data reuse of case: (a) 4x4 block, (b) 8x4 block, (c) 4x8 block, and (d) 8x8 block, (e) 16x8 block, (f) 8x16 block, (g) 16x16 block (shaded region means

reusable)

40

4.2.2. Precision Based Data Request

A strategy for the interpolation filters and the valid reference data size of different positions has been proposed [18]. In 4x4-block pipeline design, it is inefficient to load 9x9 reference pixels for a decomposed 4x4 block to interpolator since the required reference data is not always need as large as 9x9. Fig. 4.8 shows integer-pixel and fractional-pixel positions in H.264/AVC standard which has the accuracy of a quarter of a pixel. To minimize the memory bandwidth access of motion compensation, a classification of different pixel positions should be discussed. In Fig. 4.8, the sub-pixel a, b and c are located at vertical integer positions and being interpolated by horizontal interpolation only (vertical interpolation is not used), which means the required reference data can be reduced to 9x4 as shown in Fig. 4.9 (b). Same situation comes in sub-pixel d, h and n, the horizontal interpolation is not used and the required reference data is reduced to 4x9 (Fig. 4.9 (c)). In the case of vertical or horizontal

41

integer positions, the required reference data can be is reduced from 9x9 to 9x4, 4x9, or even 4x4. This strategy can be combined with Block Size Based Data Request in Section 4.2.1. Table 4.3 shows a classification of required reference data size for different positions with block partition size M x N (M for width and N for height of current partition).

Table 4.3. Reducing required reference data according to different pixel positions with block partition size M x N

Pixel Position Interpolation Filters Required Reference Data

G None M x N

Fig. 4.8. Integer pixels (blocks with upper-case letter) and fractional pixels (blocks with lower-case letter) of luminance

42

Fig. 4.9. (a) Horizontal and vertical interpolation (x and y components of MV point to fractional positions). (b) Horizontal interpolation only (y component points to integer position). (c) Vertical interpolation only (x component points to integer

position). (d) No interpolation (both x and y point to integer positions)

4.2.3. Simulation Results

In order to verify the effect of the bandwidth optimization methods, a simulation is performed based on our C model with a DDR400 SDRAM model. We compare the memory bandwidth requirements with (1) Block Size Based Data Request, (2) Block Size Based Data Request and Precision Based Data Request, and without these bandwidth optimization methods. Six HD 1080p video sequences, IBBBBBBBP (GOP=8) hierarchical-B prediction structure, and four QP values are used for the simulation. As illustrated in Table 4.4, the results show that about 62-74% of the data

43

bandwidth is reduced with these bandwidth optimization methods. With higher QP values, the block partition size trends to large block size, and the bandwidth reduction rate will be higher.

Table 4.4. Reduction of data bandwidth in different sequences and QPs QP=16 QP=24 QP=32 QP=40

(1) (2) (1) (2) (1) (2) (1) (2) tractor 59.57 63.5 60.72 64.7 62.05 65.86 62.52 66.83 sunflower 60.07 64.8 62.4 66.2 62.99 66.45 63.13 68.39 rush_hour 58.36 62.8 61.17 68.7 62.58 71.6 63.09 74.53 station 60.34 63.2 62.23 67.9 62.89 67.24 63.15 69.58 blue_sky 60.12 66.2 62.26 70.9 62.83 70.53 63.13 70.97 pedestrian_area 59.91 66.7 61.9 68.8 62.64 69.77 63.01 72.65 Average 59.73 64.5 61.78 67.2 62.66 68.8 63 70.5

4.3. Hardware Design

The system specification of our hardware design is an SVC decoder operating at 135MHz clock rates with 3 spatial layers (CIF, 480p, and 1080p), three quality layers, and 60 frames per second. According to the analysis in Chapter 3, a frame-based decoding flow with one-pass quality layer decoding MB-pipeline architecture, the total number of processing stages per set of frame is

(396 + 1,350 + 8,160) x 60 = 594,360 (4.10)

44

cycles.

There are three processes in our motion compensation design which are “Motion Vector Generation”, “Reference Pixel Accessing”, and “Interpolation”. It’s hard to finish all these works within 227 cycles. As a result, we propose a two-stage motion compensation design with separated data access and interpolation. Our proposed INTER stage is decomposed into two MB-pipeline stages, trying to reduce the processing cycles. The block diagram of proposed motion compensation architecture is shown in Fig. 4.10. The first stage, named MVG, has “Motion Vector Generation”

and “Reference Pixel Accessing” processes. The main functions of the first stage are to generate MVs and collect all reference data of current MB. First, the Motion Vector Generation reconstructs MVs of current MB and then Data Request Generator in Reference Pixel Accessing generates data request to Memory Controller of external memory for accessing reference pixels. The returned reference pixels are collected in a 21x21-pixel Register Array, and then written to Reference Data Buffer for next MB-pipeline stage. The second stage, named INTERP, contains “Interpolation” and pixel reconstruction. The main function of this stage is to interpolate fractional pixels.

The Interpolator produces fractional pixels from reference data, and then Pixel Reconstruction collects these interpolated pixels adding to residuals as the output reconstructed pixels.

The details of these three processes are described in follows.

45

Fig. 4.10. Proposed pipeline architecture of motion compensation design

46

macroblock which is first generating motion vector predictor (MVp) and then adding to the motion vector difference (mvd). The proposed motion vector prediction design takes three cycles to derive the motion data for a 4x4 block, getting neighboring MVs in cycle 1, finding out MVp of current block in cycle 2, and adding mvd to MVp in cycle 3 as illustrated in Fig. 4.11.

The prediction of inter-coded block in H.264/AVC comes from two directions, one is forward reference frame, called List0 (L0), and the other is backward reference frame, called List1 (L1). The prediction types in inter-coded block are combined with L0 and L1, which could be L0, L1, or bi-direction (both L0 and L1) prediction. The unit of the direction of prediction is based on 8x8 block, so a 2-bit predFlag, one bit for L0, and one bit for L1, is used to identify the prediction types for an 8x8 block as illustrated in Fig. 4.12. To deal with the variable block size, a 16x16 macroblock is decomposed into sixteen 4x4 forward blocks and sixteen backward blocks and the block index is shown in Fig. 4.13. The decoding flow is in the order of block index from 0 to 31. In Section 4.1.1, the algorithm of motion vector predictor is introduced.

Before the process of current MB, all neighboring MVs are loaded to internal buffer and take preload cycles. The neighboring blocks come from upper MB, left MB or current MB, and the MVp is predicted from these neighboring MVs. The neighboring blocks of upper MB come from a Neighboring MV Buffer which stores MV information of previous decoded macroblocks as shown in Fig. 4.14. We use an

47

internal SRAM as the neighboring MV buffer and this SRAM stores the bottom four 4x4 blocks in a row of macroblocks, i.e., a 1080p video sequence with 120 macroblocks per row of frame. Both L0 and L1 need a SRAM and the total size is 2.46 KBytes. While an MV of a 4x4 block is reconstructed, the MV of current 4x4 block is sent to Current MV Buffer. In the case of partition size bigger than 4x4, only the first 4x4 block of this partitioned block needs to enter the process of motion vector prediction, and the rest blocks just copy the MV result of the first block. On the other hand, those rest blocks of the corresponding partitioned block can skip the motion vector prediction process. After the motion vector generation process of current MB is finished, the MVs of bottom four 4x4 blocks will be stored in the neighboring MV buffer for upper neighborings of next MB-row. The worst case occurs when the MB is consisted of 16 4x4 blocks with bi-direction prediction, and the motion vector prediction process will take preload + 96 cycles.

Fig. 4.11. Proposed processing element for motion vector prediction

48

Fig. 4.12. The four prediction flags in a MB

Fig. 4.13. Block index of all 4x4 subblocks

Fig. 4.14. SRAM buffer for neighboring MVs

49

4.3.2. Reference Pixel Accessing

After the MV of a 4x4 block is reconstructed, the current block information will be sent to Data Request Generator for accessing reference pixels in external memory.

Refer to Section 4.2.2, the size of requested reference data is based on the partition size of current block. That is, a maximum 21x21-pixel register array is used in Reference Pixel Accessing for request data collection. The returned reference data from memory controller is row by row, which has the maximum bitwidth of 21-pixel (168-bit), and these data are collected by 21x21-pixel Register Array. The requested data size is not always 21x21. However, for every kinds of block size, the requested data can be fitted in this 21x21 register array from upper-left corner with a fixed position mapping to every 4x4 blocks inside current partitioned block. For example, if current requested data size is 21x21 for a 16x16 partition, these data fit into the 21x21-pixel register array. If the current partition size is 16x8, then the requested data will be 21x13. After one partition is finished, these 21x13 data in 21x21-pixel register array can be refreshed by the requested data of other partition. The 21x21 Register Array first modifies the returned reference data to recovery boundary extension, and then collects these modified data into register array. When there are enough reference data for a 4x4 block, i.e., 9 rows of luma pixels or 3 rows of chroma pixels, the corresponding reference data, which will be 9x9 luma pixels or fractional MVs, 3x3 Cb, and 3x3 Cr samples, are written to Reference Data Buffer for next MB-pipeline stage, INTERP. Fig. 4.15 shows the data mapping of the Reference Data Buffer. These buffers are consisted of 16 rows and each row corresponding to the 4x4 block index of a macroblock. The row of luma buffer contains 9x9 reference pixels of luminance components, and the row of chroma buffer includes fraction MV of x-direction,

50

Reference Data Buffer, and the buffer size is 6.25 Kbytes.

In the external memory access, the returned data is row by row. That is, the reference data of current partition block will be ready at the order from left-to-right and top-to-bottom. As a result, writing reference data to Reference Data Buffer is in raster-scan order within current partitioned block as shown in Fig. 4.16.

(a) (b)

Fig. 4.15. (a) Data mapping in SRAM of luma reference data buffer, (b) data mapping in SRAM of chroma reference data buffer

51

(a) (b)

Fig. 4.16. Raster-scan order of writing reference data in (a) 16x8 and (b) 8x16 partition size

4.3.3. Interpolation

Our proposed interpolation design adopts a Doubled Hardware of Interpolation Unit (DHIU) scheme that two separate interpolation units handle L0 and L1 directions in a 4x4 block simultaneously. The interpolation unit is consisted of a luma interpolator and a chroma interpolator. The elements in a 4x4 block in YUV420 format are consisted of 4x4 luma, 2x2 Cb, and 2x2 Cr samples. In our proposed architecture design, all of the reference data for a MB is ready before current Interpolation MB-pipeline stage, so we parallel processes luminance and chrominance components. The 6-tap FIR filter design for luma is shown in Fig. 4.17, which translates multiplier-adder to shift operation and addition. In our proposed luma interpolator, separate 1-D structure is used. The components of proposed luma interpolator are thirteen 6-tap FIR filters and four bilinear filters and the throughput is 4 pixels/cycle as shown in Fig. 4.18. According to different fractional MVs of x-direction and y-direction, the processing cycles for one 4x4 block would be 5-6 cycles. For the chroma interpolator, a low-cost two-stage chroma filter design was proposed [15], and the throughput was 1 pixel/cycle. We extend the parallelism to

52

overall latency of interpolation unit for one 4x4 block is 5-6 cycles. The Pixel Reconstruction takes another 2-3 cycles and 1 cycle for changing 4x4 block index. As a result, the latency of Interpolation takes 128-160 cycles for an MB.

Fig. 4.17. 6-tap FIR filter design

53

Fig. 4.18. Interpolator of luminance component

4.4. Implementation Results

This chapter summarizes the implementation results of this work. The proposed motion compensation hardware architecture is implemented in Verilog HDL with UMC 90nm 1P9M CMOS technology. The critical path of this design is in 6-tap FIR filter due to the tree adder architecture.

4.4.1. Design Flow

The design flow in our proposed design is shown in Fig. 4.19. First, the system specification is made, and we develop a behavior model in C language. Then, we formulate and analyze the design problem from algorithmic and architectural levels.

After the system architecture is confirmed, the hardware design is implemented with

54

Fig. 4.19. Design flow in this work

4.4.2. Experiment Results

In order to observe the behavior of proposed design, we integrate all components with a top module. The input pattern such as slice_type, mb_type, CurrMbAddr and other MB information are generated from our C model, and the output results are compared with the golden data generated from C.

The experiment results in this Section are simulated in C with a DDR400 SDRAM model. We first calculate the MB processing cycles in the two stages of our MC architecture, and the overall MC processing cycle is the maximum of these two values.

Table 4.5 and Table 4.6 show the results in our simulation.

55

Table 4.5. Average processing cycles per MB in MVG stage (motion vector generation and reference pixel accessing) and INTERP stage (interpolation)

QP=16 QP=24 QP=32 QP=40 M V G INTERP M V G INTERP M V G INTERP M V G INTERP

pedestrian_area 157 135 134 135 132 135 130 132

Average 156 137 137 135 138 136 139 133

Table 4.6. Experiment results of average processing cycles per MB in our proposed motion compensation design

pedestrian_area 166 150 149 146

Average 165 152 153 152

For small QP values, the block partition size trends to be small, causing the increment of processing cycles in MVG stage. The INTERP stage is independent with QP since it is only affected by the x- and y-components of fractional MV. A small QP also results in more intra-coded MBs in the bitstream. For example, there are 26%,

56

below 160 cycles/MB. With repetitive background sequences such as sunflower with numerous stamens and petals, station2 with numbers of rails, or blue_sky with textureless blue sky and shadow leaves, different QPs may affect the position pointed by MV, causing average processing cycles vibrate in a range. However, for the specific object or clearly demarcated sequences such as tractor or pedestrian_area, the average processing cycle would be lower in high QP values.

The results here are only considered about inter-coded MBs. If the intra-coded MBs are also taken into consideration, the average processing cycles would be lower.

4.4.3. Gate Count

For 135MHz synthesis frequency (clock period is set to 7.4 ns), the total gate count of this motion compensation design excluded internal memory is about 107k.

The gate count of each component is listed in Table 4.7.

Compared to previous works [11][14][15], the gate count increment of this work is mainly caused by 21x21-pixel register array in Reference Pixel Accessing part for Partitioned Block Data Reuse (PBDR). The Doubled Hardware of Interpolation Unit in Interpolation (DHIU) scheme also takes much gate space.

57

Table 4.7. List of gate count for previous works and proposed design

Module [11] [14] [15] Proposed

Motion Vector Generation N/A N/A 16,907 14,780

Reference Pixel Accessing N/A N/A 44,542 53,754

Interpolation N/A N/A 55,728 38,546

- interpolation unit N/A 20,686 2 * 14,960 2 * 15,571

- chroma interpolator N/A N/A N/A 2,336

- pixel reconstruction N/A N/A 7,133 7,404

Total 46,646 43K 117,177 107,080

4.4.4. Comparison

Since there is no SVC decoder has been published yet, this work is compared with other H.264/AVC designs. Table 4.8 gives the comparison of motion compensation.

Compared to [15], the gate count of our design is smaller and the on-chip SRAM is a little larger while the average processing cycle is almost the same. Moreover, the worst cast of interpolation cycles is one-fourth to two-thirds compared to previous works. Generally speaking, the hardware design of this work is competitive with other works.

58

Gate Count 46.6 k 43 k 117.2 k 107 k

SRAM (Bytes) 168 - 8 k 8.71 k

# of Interpolation Unit 1 1 2 2

Clock Rate 125 MHz 100 MHz 200 MHz 135 MHz

Standard H.264/AVC H.264/AVC H.264/AVC SVC

Standard H.264/AVC H.264/AVC H.264/AVC SVC

相關文件