• 沒有找到結果。

Chapter 2 Motion Compensation Algorithm of

2.9 Summary

From the H.264/AVC profiling on ARM processor, we can find that an efficient hardware accelerator or ASIC design for motion compensation is crucial. For HDTV application, H.264/AVC main profile has provided several coding tools to deal with high-quality resolution. Bi-prediction and quarter-pel interpolation are proposed to improve coding efficiency. Weighted prediction is first adopted by video standard, and is a powerful tool for efficiently coding fading sequences. Bitstream size is reduced by direct mode coding which is adopted by H.264/AVC main profile for B-slices. In B-slices, inter prediction is performed by

using two frames so that motion compensation hardware are more complex. Furthermore, multiple reference frames is proposed so that memory requirement may be extremely increased. For above discussion, not only hardware accelerator but also bandwidth-efficient hardware is required to develop for high-definition system. Finally, the inter prediction for H.264/AVC and the comparison among different standards are also illustrated in this Chapter.

Chapter 3

A Bandwidth-efficient Motion

Compensation Architecture Design

In video standards, such as MPEG-1/2, MPEG-4 and H.264/AVC, motion compensation is an important part of entire decoder system, and always dominates system performance due to high computing power. Furthermore, the hardware design of motion compensation is more complex than other modules such as CAVLC, DCT, intra-prediction and De-blocking filter, etc. For HDTV application, motion compensation adopts new features which are supported by H.264/AVC main profile so that procedure and hardware of motion compensation are more and more complex in ASIC designs. Besides, inter-prediction requires large pixels of previous decoded reference frame to predict current frame, and external memory is decided as frame memory in our on-chip design. Moreover, multiple reference frames are employed to lead that more memory may be used to store pixels of several previous decoded reference frames. Thus, memory bandwidth which is data traffic under external BUS will be a bottleneck of motion compensation. A bandwidth-efficient motion compensation hardware accelerator has to be designed, which can be integrated into simplex architecture.

In this chapter, we will focus on motion compensation for high throughput and low cost designs. We propose a bandwidth-efficient motion compensation architecture which is suitable for high-quality system. Firstly, we will introduce whole bandwidth-efficient motion compensation architecture for H.264/AVC main profile. The hardware of detail module such as motion vector generation, interpolator, and weighted prediction will be discussed in

sub-section 3.2-3.4, respectively. Finally, simulation and summary are given in section 3.5.

3.1 Motion Compensation Engine for H.264/AVC’s Main Profile

The H.264/AVC main profile decoder system is illustrated as Figure 3.1. First, the frame information in bitstream is decoded by entropy coding module includes CABAD and CAVLD.

According macroblock type, the frame pixels can be decoded by intra-prediction and inter-prediction. The bus traffic is treated by memory controller which can be supported for module of video decoder. Figure 3.2 illustrates the entire bandwidth-efficient motion compensation architecture for H.264/AVC main profile. In H.264/AVC, a 4x4 block is the smallest element of the prediction block types in variable block size (VBS) and each 16x16 block can be decomposed into several 4x4 blocks. We adopt a 4x4 block-based pipeline to implement this motion compensation design in this design, because the 4x4 block is the smallest processing unit of pixels that the H.264/AVC adopts and 4x4 block-based pipeline can save the cost of storage buffer and the associated power reduction.

Figure 3.1 The block diagram of H.264/AVC main profile decoder system

Figure 3.2 Motion compensation architecture for HDTV H.264/AVC main profile decoder

Excluding the memory controller, the proposed motion compensation architecture is presented in gray dotted area of Figure 3.2. The detailed discussion of frame memory access controller is shown in Chapter 4. The motion compensation architecture consists of three major parts that are motion vector generator (MVG), interpolator and weighted prediction.

The decoded information is firstly loaded from bitstream into MVG. A MVG generates motion vector to predict current macroblock. In H264/AVC’s main profile, motion vector is generated by two predicted methods: motion vector prediction (MVP) and direct mode coding.

The details of MVG are described in the sub-section 3.2. According to motion vectors which are produced by MVG, corresponding reference pixels are loaded from external frame memory. In this chapter, we will not discuss memory such as memory controller and address generator, etc. Interpolators are invoked to produce fractional samples for both luma and chroma components. In this design, we employ two interpolators to simultaneously process

pixels of list-0 and list-1 because two motion vectors will point to two search areas in list-0 frame and list-1 frame in B-slices, respectively. When the motion vector is an integer value, corresponding reference pixels without interpolation directly feeds through weighted prediction. In the end of motion compensation processes, weighted prediction (WP) is performed by applying a multiplicative weighting factor and an additive offset in bitstream.

These pixels obtained by weighted prediction add with residual data to create the unfiltered pixels. Finally, the de-blocking filter loads these pixels, and restores correct pixels into external memory after performing filter operations. Because the data bus of external frame memory is defined as 32bit, pixels which are loaded into interpolator are limited.

3.2 Motion Vector Predictor Design

To facilitate a spatial prediction, we store motion vector for one row stripe of 4x4 blocks, four left neighboring 4x4 blocks and current 4x4 blocks into Row stored buffer. Figure 3.3 illustrates that shaded regions have to be stored for predicting oblique region.

Figure 3.3 MV in shaded and oblique line region must be stored in row-FIFO.

Firstly, motion vector generator is shown in Figure 3.4. Motion vector is obtained in two

predicted methods: MVP and direct mode coding. Note that direct mode coding is supported in B-slice. According to MB types, the motion vector is obtained by different predicted methods and stored into current motion vector buffer.

Figure 3.4 Motion vector generator

3.2.1 MVp Prediction Module

In the MVP generation method, the motion vector is generated by summing predicted motion vector (MVp) and MVD. For calculating MVp, we employ directional segmentation prediction in 8x16 or 16x8 block types and median prediction in other block types. These predictions are integrated into MVp generator. The MVp generator calculates MVp according to the motion vectors of the neighboring blocks in current frame. Thus the decoded motion vectors are required to be stored into FIFO buffer for the subsequent decoding. FIFO buffer stores the decoded motion vector pair (MVX, MVY). The depth and width of MV FIFO are dependent on the decoded frame width and search range respectively. For instance, for supporting 1080HD format, the total size of FIFO buffer is 968 x 10 bits (((120 blocks x 4 + 4)

x 2) = 968 4x4-block). Therefore, SRAM is selected as a FIFO buffer to store required decoded motion vectors in our design. Once the content of FIFO buffer will not be used in the future, the restored motion vector pair in FIFO buffer can be discarded. Furthermore, the 4 x 4 size of MV buffers is required because the maximum number of motion vectors per MB is 16.

The motion vectors for current MB decoding store in this 4 x 4 MV buffers. Due to a Bi-prediction, two 4 x 4 MV buffers are required to store current two motion vectors for predicting motion vectors of list-0 frame and list-1 frame.

Figure 3.5 Neighboring motion vectors required for decoding all motion vectors in current macroblock

When decoding current macroblock, the detail of required neighboring motion vectors is shown in Figure 3.5. To involve all kinds of VBS conditions, storages element is based on 4 x 4-block size that is the smallest element for H.264/AVC video decoder. Each square indicates one motion vector pair. To predict MV0-MV15 in current MB, it requires neighboring motion vectors in left-upper corner (MVLU), right-upper corner (MVRU), upper (MVU0-3) and left (MVL0-MVL3) positions., Neighboring motion vectors are shifted and stored into MV FIFO except for current MV.

16x16

8x8_0 8x8_1

8x8_2 8x8_3

4x4_10 4x4_11 4x4_14 4x4_15 4x4_12 4x4_13 4x4_8 4x4_9

4x4_1

4x4_3 4x4_6 4x4_7 4x4_4 4x4_5 4x4_0

4x4_2

16x8_0 16x8_1

8x16_0 8x16_1

(a)

Figure 3.6 (a) block size position index, (b) directional prediction table (16x8, 8x16), (c) median prediction table (16x16, 8x8), (d) median prediction table (4x4)

MVp is calculated according to MVA, MVB, MVC and MVD which are obtained from neighboring motion vectors according to block size position index for different macroblock types. The block size position index in one macroblock is illustrated in Figure 3.6 (a). MVA, MVB, MVC and MVD indicate the motion vectors located at left, upper, right-upper, left-upper neighboring macroblock/partition/block respectively as shown in Figure 2.8 (c).

Figure 3.6 (b)-(d) lists all MVA, MVB, MVC and MVD for different block size position index.

When MB_type of current macroblock is 16x8 or 8x16, MVp can derived by directional

prediction, otherwise median prediction is involved. Furthermore, the above loop-up table (LUT) is required for motion vector prediction, many trivial boundary conditions and exceptions have to be handled. Here, we do not describe them for simplicity.

3.2.2 Direct Mode Coding Design

Except for MVp prediction, other way to predict current motion vectors is direct mode coding. In the direct mode coding, there are two types: spatial and temporal types [7] [8].

These types are user-defined in encoding processes. From above discussion, the PSNR of spatial mode is better than that of temporal mode. In our design, we implement both temporal and spatial modes and integrate it into MVG module. When a temporal mode is invoked, a temporal direct mode coding module calculates motion vector according to the picture order counts and co-located motion vectors in first list-1 frame. From above introduction of direct mode coding with temporal mode, we have to calculate the scalefactor value by equation 2.4.

From Equation 2.5 and 2.6, two motion vectors from list-0 and list-1 frame are computed with scalefactor. Therefore, the scalefactor must be computed in advance. Figure 3.7 depicts hardware by which scalefactor is computed. We implement division-free and multiplication-free design to reduce hardware complexity. We employ some multiplexer and shifters to replace division and multiplication in gray dotted area and it is shown in Figure 3.8

CurrPoc List0Poc

List1Poc

CLIP

CLIP

20H

>>6 CLIP

TDD >>1 tp_1

ScaleFactor X

X1 X2

4000H TDB

Figure 3.7 Pre-scalefactor generator design

Figure 3.8 (a) Division free replacement (b) Multiplication-free replacement

Where CLIP operation is used to restrict TDB , TDD and scalefactor within range between -128 and 127. The CLIP operation is expanded as Equation 3.1. The complexity of this module is reduced efficiently by division-free and multiplication-free.

CLIP ( 128,127,= − input value_ ) (3.1) What is more, the process which produces motion vectors by spatial mode is the same as

median method for MVp prediction. Therefore, the hardware of MVp prediction module and spatial direct mode coding predictor can be shared. When the spatial mode is chosen, the predicted process acts as the MVp prediction, which needs motion vectors of the neighboring blocks to generate motion vector. Hence, we can employ MVp generator to generate motion vector without adding MVD.

3.3 Bandwidth-Efficient Factional Interpolator Design

Figure 3.9 (a) 4x4-block and 9x9 interpolation search windows for luma component interpolation (b) overlap region between neighboring blocks

Figure 3.10 (a) 2x2-block and 3x3 interpolation search windows for chroma component interpolation (b) overlap region between neighboring blocks

Interpolator design always dominates the throughput of H.264/AVC decoder. To interpolate each fractional sample value for each 4x4 block of luma component, it needs 9 x 9 interpolation window illustrated in Figure 3.9 (a). If two motion vectors of neighboring 4 x 4 blocks are the same, 5 x 9 overlapped region between two interpolation windows can be data reused. The overlapped region between neighboring blocks is shown in Figure 3.9 (b). We can find that maximum overlapped region is 65 pixels for luma search windows. For each 2 x 2 block of chroma component is shown in Figure 3.10, the size of interpolating search windows is 3 x 3 and 5 pixels can be reused between neighboring blocks. For above property, when interpolating current block, overlapped region cannot be fetched again. We will introduce the proposed data-reuse approach and give some examples in sub-section 3.3.1.

3.3.1 Data Reuse Technique

The scanning order of residual decoding for each macroblock is row-major interpolating order as shown in Figure 3.11 (a), and column-major interpolating order illustrated in Figure 3.11 (b). A dotted line indicates transition between interpolating processes. In comparison of row-major interpolating order and column-major interpolating order, we adopt a column-major interpolating order because the transition of column-major interpolating order is 5 times less than row-major order. Each transition causes that the overlap region could not be reused. Therefore, column-major one is the better selection because of less number of transitions.

Figure 3.11 (a) row-major interpolating order (b) column-major interpolating order

For a data-reuse approach, Wang’s design [10] proposed an extended 2 x 2 raster scanning order approach to increase throughput. Although 30% reduction of access cycle for motion compensation is derived by this approach, the improvement is not high enough for high-definition resolution. Therefore, based on the column-major interpolating order, we propose an extend-2D column-major approach (E2CMA) to reduce read access times from external memory and thereby achieve approximately 60% reduction of access cycles.

E2CMA exploits horizontal and vertical common region in interpolation search window between neighboring blocks to execute data-reuse operation. Because each 4x4 block needs 9x9 search windows to interpolate fractional pixels and word length is limited to 32 bits under data bus, it requires three cycles to load nine pixels of one column into entry. Therefore, it needs 27 cycles (3 x 9) to accomplish one 4x4 block interpolated in the worst case. The worst case means that MB type is decoded as 4x4.

(a)

C2 C6 C0 C1

C4 C5 C3 C7

Cycle 6

C2 C6C10 C14 C0 C1

C4 C5

C8 C9

C12 C13

C3 C7C11 C15

Cycle 8

(b)

(c)

Figure 3.12: Luma component interpolation: (a) Interpolating block 4 and (b) Interpolating block 3 (c) Interpolating block 6

(a)

Cycle 0

C0 C1

C2 C3

Cycle 1

C0 C1

(b)

Figure 3.13 Chroma component interpolation: (a) Interpolating 2x2-block 0 (b) Interpolation 2x2-block 1

Some examples are given in Figure 3.12 (a)-(c) to illustrate the vertical and horizontal data-reuse approach by E2CMA. The charcoal-gray circle indicates pixels have been stored in buffer, and the light gray means pixels must be loaded from external frame memory. Figure 3.12 (a)-(c) depict three data-reuse cases: (a) horizontal data-reuse approach (b) vertical data-reuse approach (c) horizontal and vertical data-reuse approach. The MB type assumes 16x16 in these cases. Firstly, the horizontal data-reuse approach is given for interpolating block 4 in Figure 3.12 (a). The horizontal data-reuse approach is applied to content buffers for

executing a content-switch operation. Pixels in columns 0-4 have been stored in content buffers. Therefore, we only need to load pixels from external memory in column 5. After 12 cycles, 16 interpolated pixels in block 4 have been produced. The vertical data-reuse approach is illustrated in Figure 3.12(b) for interpolating block 2. In block 2, upper six pixels in each column have been shifted into Reuse-Register-File. Therefore, three lower pixels in each column must be fetched from external memory. We require one cycle to fetch three lower pixels from external memory and load upper six pixels from reused registers at the same time.

Nine cycles are needed in this case. Last case is that horizontal-vertical data-reuse approach is shown in Figure 3.12(c). The least interpolating cycle for one block is 4 cycles for horizontal-vertical data-reuse approach. Because all pixels in column 0-4 and upper six pixels in column 6-9 is stored in content buffer and Reuse-Register File respectively, 4x4 block interpolated can be accomplished after four cycles. All MB types can be applied by E2CMA so that data-reuse utilization is increased excepting for MB type is 4 x 4,

Because of 4:2:0 chroma component and quarter precision of luma inter prediction, chroma inter prediction can achieve eighth motion resolution. E2CMA can be applied for chroma component interpolation as well. Similarly, for chroma component interpolation, some examples are given as Figure 3.13 (a)-(b). Chroma inter prediction must process based on 2 x 2 block and chroma interpolation search windows requires 3 x 3 pixels for each 2 x 2 block as shown in Figure 3.10 (b). For chroma component interpolation, block 0 of chroma component is interpolated is shown as Figure 3.13 (a). In this case, data-reuse approach can be not applied so that three cycles are required. Other case is shown in Figure 3.13 (b), E2CMA is used so that two interpolation cycles are needed. From Figure 2.12 (d), for chroma 2 x 2 block including A, B, C and D, the fractional sample i whose precision is eighth point. A reduction of required access cycles is 33% using E2CMA.

Table 3.1 Analysis for different interpolating approach

To give more generic and platform independent analysis, we analyze requisite pixels per MB for each interpolating approach. Table 3.1 lists required pixels per luma MB and chroma MB for different interpolating approach. Assuming that each motion vector contains fractional part, the best case has one motion vector and worst case has 16 motion vectors for one macroblock. Although requisite pixels of each approach are the same in worst case, requisite pixels of column major related approach are smaller than that of row major approach.

Although column major related approach takes the best effect than row major approach, it requires additional synchronization buffer and degrades throughput due to different scan order approach with that of residual decoder. As for Wang’s approach, few MB types can be data-reuse such as direct, skip, 16x16 and 16x8. Although larger block size (skip, 16x16, 16x8) occupies up to approximately 50% ~ 90% proportion depends on bit rate. For higher bit rate, improvement of Wang’s method is limited. Oppositely, E2CMA can be applied all MB type except for MB type is 4x4. Therefore, the performance is better than previous approach.

3.3.2 Combined Luma/Chroma Interpolation Architecture

In this subsection, several different works related to interpolator designs of which have been published will be introduced. From above discussion, reviewing the fractional

interpolation for H.264/AVC in Figure 2.11, 6-tap FIR with (1, -5, 20, 20, -5, 1) coefficient and bilinear filter are needed for half and quarter precision of luma component interpolation in H.264/AVC video decoder. For cost and PSNR loss acceptable consideration, Lie’s 4-tap diagonal FIR filter and three-stage recursive algorithm is proposed in [21], and Chen’s HVBi, bilinear filter in both horizontal and vertical direction, and VBi, vertical bilinear horizontal FIR, schemes are also introduced in [22]. However, when frame sequence is very long for supporting B-slices, such as I + 9 P +4B, the propagation of PSNR loss may cause the heavy degradation of video quality, especially in high definition frame format such as 1080HD.

Oppositely, considering PSNR losses and standard-compatible design, Chien’s [23] and He’s [24] have proposed adder-chain and adder-tree based design respectively. These two types which depicted in Figure 3.14 are categorized into 1-D linear filter design. For cost consideration, multipliers can be simplified to adders and shifters. 1-D linear interpolator is suitable for Q-CIF video sequence in mobile applications; however, as for HDTV video sequence, throughput is a very important issue and long execution cycles in 1-D linear design may lead to poor throughput. As for another choice, Chien’s [23] also proposed separate 1-D design that separates horizontal and vertical interpolation and processes in parallel based on 4 x 4 block size. This design induces better throughput, although it may need more storages.

Figure 3.15 shows separate 1-D interpolator design without processing in a parallel way.

Adder network Adder network

Adder tree

(a) (b)

Figure 3.14 (a) Adder-chain based [23] (b) Adder-tree based [24] 1-D linear interpolator design

Table 3.2 Comparison of execution cycles for different architectures

Adder-chain based 1-D[10]

Adder-tree based 1-D[10]

Separate 1-D (no parallel) [10]

Separate 1-D (2 parallels) Separate 1-D (4 parallels) [10]

57

Interpolation Architecture Interpolating cycles

52 36 18 9

FIR

FIR

FIR

FIR

Figure 3.15 Separate 1-D interpolator design (no parallel)

Assuming that all 9 x 9 interpolated data for each 4 x 4 block are ready and they can be accessed randomly, Table 3.2 lists the execution cycles for different architectures. For

Assuming that all 9 x 9 interpolated data for each 4 x 4 block are ready and they can be accessed randomly, Table 3.2 lists the execution cycles for different architectures. For