• 沒有找到結果。

CHAPTER 2 ALGORITHM DESCRIPTION AND ANALYSIS

2.4 S UMMARY

From the H.264 profiling on ARM processor, an efficient hardware accelerator or ASIC design for motion compensation is crucial. The inter prediction for H.264/AVC and the comparison among different standards are also illustrated in this Chapter.

Chapter 3

Motion Compensation Design for MPEG-2/H.264 video decoder

The state-of-the-art video coding standard H.264/AVC provides amazing compression ratio that significantly outperforms all previous video compression standards. However, unlike traditional MPEG-x standards, H.264/AVC lacks backward compatibility to the former MPEG-x and H.26x video coding standards. Therefore, a development of combining multi-video coding standards is essential to support modern multimedia systems. For example, DVD forum adopted MPEG-2, H.264/AVC, and VC-1 (also named well-known WMV-9) as mandatory for the next generation HD-DVD and Blu-ray format. As for digital video broadcasting (DVB) application, DVB-T system, which is designed for digital terrestrial television services, is directly compatible with MPEG-2 coded TV signal.

Furthermore, mobile DVB, presently called DVB-H, allows the transmission with video content of H.264/AVC due to high coding efficiency. Especially, DVB-H features backward compatibility with DVB-T but transmit different video format. Therefore, it is the demand and challenge of designing efficient video decoder for multi-standard video application.

This chapter will discuss that designing motion compensation, which dominates the amount of data transfer on the video decoder, for MPEG-2/H.264 dual video decoder. The rest part is structured as follows. Section 3.1 illustrates motion compensation engine for H.264/AVC decoder. The combined motion compensation engine for MPEG-2/H.264 and analysis is discussed in section 3.2. Finally, summary is given in section 3.3.

3.1 Motion Compensation Engine for H.264/AVC decoder

Motion Vector Predictor

4 x 4 MV Buffer

Line MV FIFO Address

Generator

Fig 3.1 Motion compensation engine for H.264 video decoder

Fig. 3.1 illustrates the whole motion compensation engine for H.264/AVC video decoder.

Firstly, line MV FIFO stores decoded motion vectors for motion vector prediction and 4 x 4 MV buffer stores the decoded motion vector for current MB decoding. Then, the address generator sends reference address to memory access controller. The tasking of motion controller is scheduling consecutive access command and sending to frame memories. The burst read data is kept in read data buffer and then filtered through interpolator. Finally, the interpolated reference data add up to the residual data and then pass through de-blocking filter.

In our proposed decoder, ping-pong structured external frame memory [28], double memories stored reference and current frame reciprocally, is adopted.

The following subsection will discuss the detail of other modules except memory access controller. The detailed discussion of frame memory access controller is shown in Chapter 4.

Subsection 3.1.1 illustrates motion vector generator including motion vector predictor (MVP) and the related storages. Subsection 3.1.2 gives data reuse technique for interpolator.

Subsection 3.1.3 analyzes the proposed data reuse technique. Finally, luma and chroma interpolator designs are reported in subsection 3.1.4 and 3.1.5 respectively.

3.1.1 Motion Vector Generator

Current MB

Frame boundary

Next MB 0 Frame

boundary

Next

MB 1 Next MB 2

Next

MB 3 Next

MB 4 ……

0 1 2 3 4

5 6 7 8 9 10 11

Fig 3.2 Motion vectors information storage or motion vector predictor for QCIF frame format.

Motion vector generator mainly contains motion vector predictor, line MV FIFO and 4 x 4 MV buffers. Motion vector is generated by the summation of motion vector prediction (MVP) and motion vector difference (MVD). The MVP value is calculated according to the neighboring MVs, thus the decoded motion vectors are required to be stored for the following decoding. Line MV FIFO stores the decoded motion vector pair (MVX, MVY). The depth and width of MV FIFO are dependent on the frame width and search range respectively. Once the content of MV FIFO will not be used in the future, the motion vector pair can be discarded.

The 4 x 4 size of MV buffers is required since the maximum number of motion vectors per

MB is sixteen. The motion vectors for current MB decoding stores in this 4 x 4 MV buffers.

As for the requisite total storage for motion vector generator, Fig. 3.2 shows an example.

Total amount of 4 x 11 motion vector pairs have to be stored for QCIF frame format. The detail of required neighboring motion vectors is shown in Fig. 3.3. To cover all kinds of conditions, storages element is based on 4 x 4-block size that is the smallest element for H.264/AVC video decoder. Each square indicates one motion vector pair. To decode MV0-MV15 in current MB, it needs neighboring motion vectors in left-upper corner (MVLU), right-upper corner (MVRU), upper (MVU0-3) and left (MVL0-MVL3) positions.

MV7

MVLU MVU0 MVU1 MVU2 MVU3 MVRU

Fig 3.3 Neighboring motion vectors needed when decoding all motion vectors in current macroblock

The detailed architecture of motion vector generator is shown in Fig 3.4. Motion vector generation involves two-phase operations. The first one is loading MVD into 4 x 4 MV buffers and another is calculating MV = MVP + MVD then restoring into 4 x 4 MV buffers.

The proposed memory storage can be treated as two-level memory hierarchy painted in Fig 3.5. Four line MV FIFOs are implemented using SRAM and local registers store the neighboring motion vectors for current MB. Local register that stores neighboring motion vectors includes left MV line buffer, upper-left, upper, upper-right and left MV registers. The

vectors required in current MB decoding. After accomplishing current MB decoding, FIFOs need one push and one pop operation, which occupies two cycles, to update all contents of local registers for the next MB decoding.

4x4 MV buffers Left MV line buffer MVP

MVD (load from MV buffer) MV (write back to MV buffer)

MVD (load from

MVA, MVB, MVC, MVD

MV from Upper MB

MV from Left MB MV from Current MB MV from Upper-right MB

MV from Upper-left MB

Neighboring MVs

Fig 3.4 (a) motion vector generator architecture for QCIF-format, (b) mv buffer unit

Line MV FIFO 4x4 MV

buffer

Fig. 3.5 Two-level memory hierarchical structure for MVP

16x16

8x8_0 8x8_1

8x8_2 8x8_3

4x4_10 4x4_11 4x4_14 4x4_15 4x4_12 4x4_13 MVL0 MVU0 MVU2 MVLU

MVRU

MV1 MVU2 MVU1

MVL2 MV2 MV6 MVL1

MV9 MV6 MV3 MV3

4x4_0

MV0 MVU1 MVU2 MVU0

MV1

MVL1 MV0 MVL0

MV2 MV1 MV0 MV0

MV1 MVU2 MVU3 MVU1

4x4_5

MV4 MVU3 MVRU MVU2

MV3 MV4 MV5 MV1

MV5 MV4 MV4

MV9 MV6 MV7 MV3

MV7

MV12 MV6 MV6

MV11 MV12 MV13 MV9

MV13 MV12 MV12

(a) MV14

(b) (c)

(d)

Fig 3.6 (a) block size_position index, (b) directional prediction table (16x8, 8x16), (c) median prediction table (16x16, 8x8), (d) median prediction table (4x4)

MVP is calculated according to MVA, MVB, MVC and MVD whose values are derived from neighboring motion vectors according to block size_position index illustrated in Fig. 3.6 (a). MVA, MVB, MVC and MVD indicate the motion vectors located at left, upper, right-upper, left-upper neighboring macroblock/partition/block respectively as shown in Fig.

2.3 (c). Fig. 3.6 (b)-(d) lists all MVA, MVB, MVC and MVD for different block size_position index. Besides the above loop-up table (LUT) is required for motion vector prediction, many trivial boundary conditions and exceptions have to be handled. Here, we do not describe them for clarity.

3.1.2 Data Reuse Technique for Interpolator

4 9

4 9

(a) (b)

5

Fig 3.7 (a) 4x4 block window and the corresponding 9x9 interpolation window, (b) overlapped region for neighboring interpolation window

(a) (b) (c)

0 1

2 3

4 5

6 7

8 9

10 11

12 13 14 15

Fig 3.8 (a) 2x2 raster scanning order, (b) row-major 2x2 raster scanning order, (c) column-major 2x2 raster scanning order

From Fig 3.7 (a), to interpolate each fractional sample value for each 4x4 block, it needs 9 x 9 interpolation window. If two motion vectors of neighboring 4 x 4 blocks are the same, 5 x 9 overlap region between two interpolation windows can be data reused. The scanning order of residual decoding for each macroblock is 2x2 raster scanning order as shown in Fig 3.8 (a).

Then, considering two different scanning orders illustrated in Fig 3.8 (b) and (c), row-major one needs 13 times of transitions but column-major one only needs 5 times of transitions.

Each transition causes that the overlap region could not be data reused. Therefore, column-major one is the better selection because of less number of transitions.

0 1

Fig 3.9 (a) 2x2 raster scanning order, (b) 4x4 raster scanning order, (c) extended 2x2 raster scanning order

0 1

Fig 3.10 Synchronization buffer scheme for two different scanning order in inter prediction (a) 2x2 raster scanning order, (b) 4x4 raster scanning order

For video decoding system, inter prediction often processes based on macroblock level.

Thus, the decoding order based on 4 x 4-block size, which is the smallest block element in H.264/AVC video decoder, is freedom for each macroblock. In view of this concept, 2 x 2 and 4 x 4 raster scanning orders are depicted in Fig 3.9 (a) and (b), and we can find column-major 4 x 4 raster scanning order only needs four transitions less than that of 2 x 2 raster scan.

pixels in residual adder because of different scanning order with residual decoder which must follow 2x2 raster scanning order defined in standard [1].

Fig. 3.11 Content-swap operation (interpolator with attached content buffer)

(1, 3)

(1, 3)

(2, 0)

(2, 0)

(-2, 1) (2, 0)

Fig. 3.12 An example of macroblock partition (1, 3) indicates (mv_x, mv_y).

In order to resolve this problem, we can attach content register to interpolator which concept is illustrated in Fig 3.11, and adopt extended 2x2 raster scanning order as shown in Fig 3.9 (c). The size of content register depends on the local register in interpolators. Each gray block in Fig. 3.9 (c) indicates content-swap operation that swaps all content in local register in interpolation and that in content buffer. By doing that, we can find that if motion vectors of block 1 and block 4 are the same, the overlapped region in Fig. 3.7(b) need not to be re-fetched when decoding block 4. Therefore, extended 2x2 raster scanning order follows 2 x 2 raster scanning which is the same as that of residual decoder, and achieves data reuse

Local register for interpolator

Content buffer

status of 4 x 4 raster scanning order. The content-swap operation brings effect only when larger block size partition or motion vectors of the neighboring blocks are the same. The condition that executes this operation follows the expression (3. 1)

)

_swap condition mb type x mb type x

content = == == (3. 1)

However, considering an example shown in Fig. 3.12, the condition (3.1) checking is false.

Furthermore, if checking equality of neighboring motion vectors instead of block size, the example in Fig. 3.11 can be data reused. The checking table of motion vectors between neighboring blocks is listed in Table 3.1.

Table 3.1 Neighboring MV checking table for content-swap operation Block number Checking condition

1 MV1 = = MV4

3.1.3 Analysis for Data Reuse Technique

To give more generic and platform independent analysis, we analyze requisite pixels per MB and cost overhead for each method. Taking account of fractional motion compensation for each macroblock, the required pixels for each MB and cost overhead for different methods are summarized in Table 3.2. Assuming that each motion vector contains fractional part, the best case has one motion vector and worst case has 16 motion vectors for one luma macroblock. Although requisite pixels per method are the same in worst case, requisite pixels

column-major methods, 4 x 4 raster scanning order (RSO) takes the best effect; however, it requires additional synchronization buffer and degrades throughput due to different RSO with that of residual decoder. As for extended methods, condition (3. 1) only takes effect in larger block partition (SKIP, 16x16, 16x8). That is, it cannot data-reuse in some case such as Fig.

3.11 even if the neighbor motion vectors are the same. To erase this disadvantage, method 5 checks the neighboring motion vectors rather than block size, and then the required bandwidth can reduce to be the same as that of 4 x 4 RSO in Fig. 3.12 case. The advantage of extended method is that it only requires content buffer which size is smaller than that of method 3 and takes a little extra cycle for content-swap operation. Although method 4 behaves better for larger block size (SKIP, 16x16, 16x8) than method 1/2/3, larger block size still occupies up to 50 ~90 % proportion from the Fig. 3.13. Furthermore, method 5 not only involves all case in method 4 but also takes effect in smaller block size such as Fig. 3.1. As shown in Fig. 3.14, after applying extended method in our design, the required memory bandwidth can be reduced about 30 % compared to column-major 2x2 RSO method.

Table 3.2 Static analyses for different method in H.264/AVC.

Assumption: each motion vectors contains fractional part.

Required pixels per luma MB Method

Worst case Best case Fig 3.11 Cost overhead

1 R 2 x 2 RSO 1296 1296 1296 0

* R: row-major, C: column major, RSO: raster scanning order, CS: content-swap operation (one cycle)

* Best case: one MB contains one motion vector

32 48 64 80 96 112 128 0

20 40 60 80 100 120

Block size proportion (foreman-QCIF)

bit rate(kbps)

proportion

SKIP, 16x16, 16x8 Other

Fig. 3.13 Block proportion under different bit-rate environments

32 48 64 80 96 112 128

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

Required bandwidth (MByte/s) for different methods (foreman-QCIF)

bit rate(kbps)

Required bandwidth (MByte/s)

Row-major 2x2 RSO Column-major 2x2 RSO

Extented Column-major 2x2 RSO

Fig. 3.14 Required memory bandwidth for different methods

3.1.4 Luma Interpolator Design

Adder network Adder network

Adder tree

(a) (b)

Fig 3.15 (a) adder-chain based [10], (b) adder-tree based [11]

1-D linear interpolator design

FIR

FIR

FIR

FIR

Fig. 3.16 Separate 1-D interpolator design (no parallel)

In this subsection, several different interpolator designs will be reported. Reviewing the fractional interpolation for H.264/AVC in Fig. 2.2, 6-tap FIR with (1, -5, 20, 20, -5, 1) coefficient and bilinear filter are needed for half and quarter pixel interpolation in H.264/AVC

video decoder. For cost and PSNR loss acceptable consideration, Lie’s 4-tap diagonal FIR filter and three-stage recursive algorithm is proposed in [8], and Chen’s HVBi, bilinear filter in both horizontal and vertical direction, and VBi, vertical bilinear horizontal FIR, schemes are also reported in [9]. However, when P frame sequence is very long, such as I + 29 P, the propagation of PSNR loss may cause the heavy degradation of video quality, especially in high definition frame format. Oppositely, considering PSNR losses and standard-compatible design, Chien’s [10] and He’s [11] presented adder-chain and adder-tree based design respectively. These two types depicted in Fig. 3.15 are categorized into 1-D linear filter design.

For cost consideration, multipliers can be simplified to adders and shifters. 1-D linear interpolator is suitable for Q-CIF video sequence in mobile application; however, as for HDTV video sequence, throughput is a very important issue and long execution cycles in 1-D linear design may lead to poor throughput. As for another choice, Chien’s [10] also proposed separate 1-D design that separates horizontal and vertical interpolation and processes in parallel based on 4 x 4 block size. This design induces better throughput, although it may need more storages. Fig. 3.16 shows separate 1-D interpolator design without processing in parallel.

Table 3.3 Comparison of execution cycles for different architectures Architecture Ideal execution cycles

Adder-chain based 1-D 57

Adder-tree based 1-D 52

Separate 1-D (no parallel) 36

Separate 1-D (2 parallel) 18

Separate 1-D (4 parallel) 9

Assuming that all 9 x 9 interpolated data for each 4 x 4 block are ready and they can be accessed randomly, Table 3.3 lists the execution cycles for different architecture. For

adder-networks are used to overlap each row inputs and eliminate the latency overhead except the first one. The total number of cycles required is 57 (5 + 4 x 9 + 4 x 4) which detailed operation is described in Chien’s [10]. For adder-tree based 1-D design, the row data can be loaded in parallel without shift one-by-one, hence the latency overhead does not exist and total number of cycles is 52 (4 x 9 + 4 x 4). As for separate 1-D design, the first data outputs at the 6th clock cycle and the following 3 data generates after 3 clock cycles. Therefore, the separate 1-D design without parallel needs 36 ((6 + 3) x 4) cycles to complete interpolation of one 4 x 4 block. Similarly, separate 1-D design with 2 and 4 parallel requires 18 ((6 + 3) x 2) and 9 (6 + 3) cycles respectively. The required content buffers are 6 x 9 pixels for 4-parallel design shown in Fig. 3.17 and it can be implemented in local registers or SRAM. However, SRAM requires several cycles to accomplish content-swap operation, so we choose local registers in order to execute content-swap in one cycle. In addition, 4-parallel separate 1-D architecture is our selection due to smaller required execution cycles that can be hidden below data-read cycles from frame memory. For another reason, it is easier to combine with interpolation for MPEG-2 video decoding and we will show it in subsection 3.2.1.

FIR

FIR

FIR

FIR

FIR

FIR

FIR

FIR

FIR

FIR

FIR

FIR

FIR

bilinear

bilinear

bilinear

bilinear Content

buffer

Fig. 3.17 4-parallel separate 1-D luma interpolator with content buffer

3.1.5 Chroma Interpolator Design

Fig 3.18 Interpolation window for each 2 x 2 chroma block

8

Fig. 3.19 (a) chroma interpolator, (b) vertical/horizontal filter

Because of 4:2:0 chroma format and quarter precision of luma inter prediction, chroma inter prediction can achieve eighth motion resolution. Chroma inter prediction must process based on 2 x 2 block and chroma interpolation requires 3 x 3 pixels for each 2 x 2 block as shown in Fig. 3.18. For chroma 2 x 2 block including A, B, C and D, the corresponding fractional sample is e, f, g and h whose precision is eighth point. Compared with direct mapping design with 8 multipliers which equation is listed in Fig. 2.2 (d), we rewire the equation listed in equation (3. 2) and the number of multiplier number can reduce to 4.

]

Similar to luma interpolator, chroma interpolator can separate into horizontal and vertical filter. The corresponding separate 1-D design is depicted in Fig. 3.19 (a) and the vertical / horizontal filter is illustrated in Fig. 3.19 (b). Double chroma interpolators are required to generate interpolated value in 2-pixel parallel, and it takes 3 cycles to filter 2 x 2 pixels if all required interpolated pixels are ready. Based on 2-parallel chroma interpolator design painted in Fig. 3.20, only one cycle latency is induced.

Vertical filter

yFrac

Horizontal filter

reg reg

xFrac

round

Vertical filter

yFrac

Horizontal filter

reg reg

xFrac

round

A B

C D

e

h f

g Fig. 3.20 2-parallel chroma interpolator

3.2 Combined Motion Compensation Engine for MPEG-2/H.264 Dual Video Decoder

Our H.264/MPEG-2 dual-standard video decoder is illustrated in Fig. 3.21 and the component of MPEG-2 decoder is depicted in Fig. 3.22. Compared with H.264/AVC standard, MPEG-2 does not provide intra prediction and in-loop de-blocking filter, and only supports half motion precision for both luma and chroma macroblock. Unlike median/directional prediction of MVP algorithm supported in H.264/AVC, motion vectors are only decided by updated PMV and bitstream side information like f_code, motion_residual and motion_code.

The detailed algorithm of motion vector generation can refer to [2]. Besides motion vector generator, a reconfigurable interpolator design for dual-standard is proposed in section 3.2.1 and section 3.2.2 gives the cost analysis.

Motion Vector Predictor for H.264

4 x 4 MV Buffer

Line MV FIFO Address

Generator

Fig. 3.21 Motion compensation engine for H.264/MPEG-2 decoder

Motion Vector Predictor for H.264

4 x 4 MV Buffer

Line MV FIFO Address

Generator

Fig. 3.22 MPEG-2 Motion compensation engine part

3.2.1 Reconfigurable dual-standard interpolator design

The main additional penalty of motion compensation engine is interpolator when combing with MPEG-2 video decoder. In this subsection, we will focus on storage and arithmetic module sharing on dual-standard to minimize area cost overhead. For macroblock based fractional motion compensation in MPEG-2, each 16 x 16 macroblock needs 17 x 17 interpolation windows to interpolate fractional samples. Each macroblock can be partitioned into four 8 x 8 blocks with 9 x 9 interpolation window of which size is identical to that of H.264/AVC luma interpolation window for each 4 x 4 block. In addition, the bilinear filter for H.264/AVC luma quarter interpolation can share with bilinear filter for MPEG-2 half interpolation. Considering 4-parallel luma interpolator as shown in Fig. 3.17, part of registers and bilinear filters, which are shaded in Fig. 3.23, can be shared with MPEG-2 interpolator.

bilinear

bilinear

bilinear

bilinear 0

1

2

3

4

5

6

7

8

Fig. 3.23 Shared local registers and bilinear filters for MPEG-2

bilinear bilinear

bilinear bilinear bilinear

Stage 0 Stage 1 Stage 2

(c)

(a) (b)

Fig. 3.24 Data flow of (a) vertical bilinear filter, (b) horizontal bilinear filter, (c) both vertical and horizontal bilinear filter

Beside the shared modules described above, only extra control circuits for data flow are required for MPEG-2 interpolation. Fig. 3.24 shows the data flow of vertical or horizontal bilinear filter and half sample flag is decided by the LSB of motion vectors. Firstly, we have to concern IDCT/IIT that is the last stage of MPEG-2/H.264 residual decoder. Inverse discrete cosine transform (IDCT) for MPEG-2 is 8 x 8-block based module, whereas inverse integer transform (IIT) for H.264/AVC is 4 x 4-block based decoding process. To achieve module combining and storage sharing, these two modules can merge to single multi-mode IDCT and the output data are 4-pixel in parallel for both standards. Besides, only four bilinear filters are available for MPEG-2/H.264, hence each column 8-pixel filtering has to separate into two

Beside the shared modules described above, only extra control circuits for data flow are required for MPEG-2 interpolation. Fig. 3.24 shows the data flow of vertical or horizontal bilinear filter and half sample flag is decided by the LSB of motion vectors. Firstly, we have to concern IDCT/IIT that is the last stage of MPEG-2/H.264 residual decoder. Inverse discrete cosine transform (IDCT) for MPEG-2 is 8 x 8-block based module, whereas inverse integer transform (IIT) for H.264/AVC is 4 x 4-block based decoding process. To achieve module combining and storage sharing, these two modules can merge to single multi-mode IDCT and the output data are 4-pixel in parallel for both standards. Besides, only four bilinear filters are available for MPEG-2/H.264, hence each column 8-pixel filtering has to separate into two