CHAPTER 2 ALGORITHM DESCRIPTION AND ANALYSIS
2.4 S UMMARY
From the H.264 profiling on ARM processor, an efficient hardware accelerator or ASIC design for motion compensation is crucial. The inter prediction for H.264/AVC and the comparison among different standards are also illustrated in this Chapter.
Chapter 3
Motion Compensation Design for MPEG-2/H.264 video decoder
The state-of-the-art video coding standard H.264/AVC provides amazing compression ratio that significantly outperforms all previous video compression standards. However, unlike traditional MPEG-x standards, H.264/AVC lacks backward compatibility to the former MPEG-x and H.26x video coding standards. Therefore, a development of combining multi-video coding standards is essential to support modern multimedia systems. For example, DVD forum adopted MPEG-2, H.264/AVC, and VC-1 (also named well-known WMV-9) as mandatory for the next generation HD-DVD and Blu-ray format. As for digital video broadcasting (DVB) application, DVB-T system, which is designed for digital terrestrial television services, is directly compatible with MPEG-2 coded TV signal.
Furthermore, mobile DVB, presently called DVB-H, allows the transmission with video content of H.264/AVC due to high coding efficiency. Especially, DVB-H features backward compatibility with DVB-T but transmit different video format. Therefore, it is the demand and challenge of designing efficient video decoder for multi-standard video application.
This chapter will discuss that designing motion compensation, which dominates the amount of data transfer on the video decoder, for MPEG-2/H.264 dual video decoder. The rest part is structured as follows. Section 3.1 illustrates motion compensation engine for H.264/AVC decoder. The combined motion compensation engine for MPEG-2/H.264 and analysis is discussed in section 3.2. Finally, summary is given in section 3.3.
3.1 Motion Compensation Engine for H.264/AVC decoder
Motion Vector Predictor
4 x 4 MV Buffer
Line MV FIFO Address
Generator
Fig 3.1 Motion compensation engine for H.264 video decoder
Fig. 3.1 illustrates the whole motion compensation engine for H.264/AVC video decoder.
Firstly, line MV FIFO stores decoded motion vectors for motion vector prediction and 4 x 4 MV buffer stores the decoded motion vector for current MB decoding. Then, the address generator sends reference address to memory access controller. The tasking of motion controller is scheduling consecutive access command and sending to frame memories. The burst read data is kept in read data buffer and then filtered through interpolator. Finally, the interpolated reference data add up to the residual data and then pass through de-blocking filter.
In our proposed decoder, ping-pong structured external frame memory [28], double memories stored reference and current frame reciprocally, is adopted.
The following subsection will discuss the detail of other modules except memory access controller. The detailed discussion of frame memory access controller is shown in Chapter 4.
Subsection 3.1.1 illustrates motion vector generator including motion vector predictor (MVP) and the related storages. Subsection 3.1.2 gives data reuse technique for interpolator.
Subsection 3.1.3 analyzes the proposed data reuse technique. Finally, luma and chroma interpolator designs are reported in subsection 3.1.4 and 3.1.5 respectively.
3.1.1 Motion Vector Generator
Current MB
Frame boundary
Next MB 0 Frame
boundary
Next
MB 1 Next MB 2
Next
MB 3 Next
MB 4 ……
0 1 2 3 4
5 6 7 8 9 10 11
Fig 3.2 Motion vectors information storage or motion vector predictor for QCIF frame format.
Motion vector generator mainly contains motion vector predictor, line MV FIFO and 4 x 4 MV buffers. Motion vector is generated by the summation of motion vector prediction (MVP) and motion vector difference (MVD). The MVP value is calculated according to the neighboring MVs, thus the decoded motion vectors are required to be stored for the following decoding. Line MV FIFO stores the decoded motion vector pair (MVX, MVY). The depth and width of MV FIFO are dependent on the frame width and search range respectively. Once the content of MV FIFO will not be used in the future, the motion vector pair can be discarded.
The 4 x 4 size of MV buffers is required since the maximum number of motion vectors per
MB is sixteen. The motion vectors for current MB decoding stores in this 4 x 4 MV buffers.
As for the requisite total storage for motion vector generator, Fig. 3.2 shows an example.
Total amount of 4 x 11 motion vector pairs have to be stored for QCIF frame format. The detail of required neighboring motion vectors is shown in Fig. 3.3. To cover all kinds of conditions, storages element is based on 4 x 4-block size that is the smallest element for H.264/AVC video decoder. Each square indicates one motion vector pair. To decode MV0-MV15 in current MB, it needs neighboring motion vectors in left-upper corner (MVLU), right-upper corner (MVRU), upper (MVU0-3) and left (MVL0-MVL3) positions.
MV7
MVLU MVU0 MVU1 MVU2 MVU3 MVRU
Fig 3.3 Neighboring motion vectors needed when decoding all motion vectors in current macroblock
The detailed architecture of motion vector generator is shown in Fig 3.4. Motion vector generation involves two-phase operations. The first one is loading MVD into 4 x 4 MV buffers and another is calculating MV = MVP + MVD then restoring into 4 x 4 MV buffers.
The proposed memory storage can be treated as two-level memory hierarchy painted in Fig 3.5. Four line MV FIFOs are implemented using SRAM and local registers store the neighboring motion vectors for current MB. Local register that stores neighboring motion vectors includes left MV line buffer, upper-left, upper, upper-right and left MV registers. The
vectors required in current MB decoding. After accomplishing current MB decoding, FIFOs need one push and one pop operation, which occupies two cycles, to update all contents of local registers for the next MB decoding.
4x4 MV buffers Left MV line buffer MVP
MVD (load from MV buffer) MV (write back to MV buffer)
MVD (load from
MVA, MVB, MVC, MVD
MV from Upper MB
MV from Left MB MV from Current MB MV from Upper-right MB
MV from Upper-left MB
Neighboring MVs
Fig 3.4 (a) motion vector generator architecture for QCIF-format, (b) mv buffer unit
Line MV FIFO 4x4 MV
buffer
Fig. 3.5 Two-level memory hierarchical structure for MVP
16x16
8x8_0 8x8_1
8x8_2 8x8_3
4x4_10 4x4_11 4x4_14 4x4_15 4x4_12 4x4_13 MVL0 MVU0 MVU2 MVLU
MVRU
MV1 MVU2 MVU1
MVL2 MV2 MV6 MVL1
MV9 MV6 MV3 MV3
4x4_0
MV0 MVU1 MVU2 MVU0
MV1
MVL1 MV0 MVL0
MV2 MV1 MV0 MV0
MV1 MVU2 MVU3 MVU1
4x4_5
MV4 MVU3 MVRU MVU2
MV3 MV4 MV5 MV1
MV5 MV4 MV4
MV9 MV6 MV7 MV3
MV7
MV12 MV6 MV6
MV11 MV12 MV13 MV9
MV13 MV12 MV12
(a) MV14
(b) (c)
(d)
Fig 3.6 (a) block size_position index, (b) directional prediction table (16x8, 8x16), (c) median prediction table (16x16, 8x8), (d) median prediction table (4x4)
MVP is calculated according to MVA, MVB, MVC and MVD whose values are derived from neighboring motion vectors according to block size_position index illustrated in Fig. 3.6 (a). MVA, MVB, MVC and MVD indicate the motion vectors located at left, upper, right-upper, left-upper neighboring macroblock/partition/block respectively as shown in Fig.
2.3 (c). Fig. 3.6 (b)-(d) lists all MVA, MVB, MVC and MVD for different block size_position index. Besides the above loop-up table (LUT) is required for motion vector prediction, many trivial boundary conditions and exceptions have to be handled. Here, we do not describe them for clarity.
3.1.2 Data Reuse Technique for Interpolator
4 9
4 9
(a) (b)
5
Fig 3.7 (a) 4x4 block window and the corresponding 9x9 interpolation window, (b) overlapped region for neighboring interpolation window
(a) (b) (c)
0 1
2 3
4 5
6 7
8 9
10 11
12 13 14 15
Fig 3.8 (a) 2x2 raster scanning order, (b) row-major 2x2 raster scanning order, (c) column-major 2x2 raster scanning order
From Fig 3.7 (a), to interpolate each fractional sample value for each 4x4 block, it needs 9 x 9 interpolation window. If two motion vectors of neighboring 4 x 4 blocks are the same, 5 x 9 overlap region between two interpolation windows can be data reused. The scanning order of residual decoding for each macroblock is 2x2 raster scanning order as shown in Fig 3.8 (a).
Then, considering two different scanning orders illustrated in Fig 3.8 (b) and (c), row-major one needs 13 times of transitions but column-major one only needs 5 times of transitions.
Each transition causes that the overlap region could not be data reused. Therefore, column-major one is the better selection because of less number of transitions.
0 1
Fig 3.9 (a) 2x2 raster scanning order, (b) 4x4 raster scanning order, (c) extended 2x2 raster scanning order
0 1
Fig 3.10 Synchronization buffer scheme for two different scanning order in inter prediction (a) 2x2 raster scanning order, (b) 4x4 raster scanning order
For video decoding system, inter prediction often processes based on macroblock level.
Thus, the decoding order based on 4 x 4-block size, which is the smallest block element in H.264/AVC video decoder, is freedom for each macroblock. In view of this concept, 2 x 2 and 4 x 4 raster scanning orders are depicted in Fig 3.9 (a) and (b), and we can find column-major 4 x 4 raster scanning order only needs four transitions less than that of 2 x 2 raster scan.
pixels in residual adder because of different scanning order with residual decoder which must follow 2x2 raster scanning order defined in standard [1].
Fig. 3.11 Content-swap operation (interpolator with attached content buffer)
(1, 3)
(1, 3)
(2, 0)
(2, 0)
(-2, 1) (2, 0)
Fig. 3.12 An example of macroblock partition (1, 3) indicates (mv_x, mv_y).
In order to resolve this problem, we can attach content register to interpolator which concept is illustrated in Fig 3.11, and adopt extended 2x2 raster scanning order as shown in Fig 3.9 (c). The size of content register depends on the local register in interpolators. Each gray block in Fig. 3.9 (c) indicates content-swap operation that swaps all content in local register in interpolation and that in content buffer. By doing that, we can find that if motion vectors of block 1 and block 4 are the same, the overlapped region in Fig. 3.7(b) need not to be re-fetched when decoding block 4. Therefore, extended 2x2 raster scanning order follows 2 x 2 raster scanning which is the same as that of residual decoder, and achieves data reuse
Local register for interpolator
Content buffer
status of 4 x 4 raster scanning order. The content-swap operation brings effect only when larger block size partition or motion vectors of the neighboring blocks are the same. The condition that executes this operation follows the expression (3. 1)
)
_swap condition mb type x mb type x
content = == == (3. 1)
However, considering an example shown in Fig. 3.12, the condition (3.1) checking is false.
Furthermore, if checking equality of neighboring motion vectors instead of block size, the example in Fig. 3.11 can be data reused. The checking table of motion vectors between neighboring blocks is listed in Table 3.1.
Table 3.1 Neighboring MV checking table for content-swap operation Block number Checking condition
1 MV1 = = MV4
3.1.3 Analysis for Data Reuse Technique
To give more generic and platform independent analysis, we analyze requisite pixels per MB and cost overhead for each method. Taking account of fractional motion compensation for each macroblock, the required pixels for each MB and cost overhead for different methods are summarized in Table 3.2. Assuming that each motion vector contains fractional part, the best case has one motion vector and worst case has 16 motion vectors for one luma macroblock. Although requisite pixels per method are the same in worst case, requisite pixels
column-major methods, 4 x 4 raster scanning order (RSO) takes the best effect; however, it requires additional synchronization buffer and degrades throughput due to different RSO with that of residual decoder. As for extended methods, condition (3. 1) only takes effect in larger block partition (SKIP, 16x16, 16x8). That is, it cannot data-reuse in some case such as Fig.
3.11 even if the neighbor motion vectors are the same. To erase this disadvantage, method 5 checks the neighboring motion vectors rather than block size, and then the required bandwidth can reduce to be the same as that of 4 x 4 RSO in Fig. 3.12 case. The advantage of extended method is that it only requires content buffer which size is smaller than that of method 3 and takes a little extra cycle for content-swap operation. Although method 4 behaves better for larger block size (SKIP, 16x16, 16x8) than method 1/2/3, larger block size still occupies up to 50 ~90 % proportion from the Fig. 3.13. Furthermore, method 5 not only involves all case in method 4 but also takes effect in smaller block size such as Fig. 3.1. As shown in Fig. 3.14, after applying extended method in our design, the required memory bandwidth can be reduced about 30 % compared to column-major 2x2 RSO method.
Table 3.2 Static analyses for different method in H.264/AVC.
Assumption: each motion vectors contains fractional part.
Required pixels per luma MB Method
Worst case Best case Fig 3.11 Cost overhead
1 R 2 x 2 RSO 1296 1296 1296 0
* R: row-major, C: column major, RSO: raster scanning order, CS: content-swap operation (one cycle)
* Best case: one MB contains one motion vector
32 48 64 80 96 112 128 0
20 40 60 80 100 120
Block size proportion (foreman-QCIF)
bit rate(kbps)
proportion
SKIP, 16x16, 16x8 Other
Fig. 3.13 Block proportion under different bit-rate environments
32 48 64 80 96 112 128
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Required bandwidth (MByte/s) for different methods (foreman-QCIF)
bit rate(kbps)
Required bandwidth (MByte/s)
Row-major 2x2 RSO Column-major 2x2 RSO
Extented Column-major 2x2 RSO
Fig. 3.14 Required memory bandwidth for different methods
3.1.4 Luma Interpolator Design
Adder network Adder network
Adder tree
(a) (b)
Fig 3.15 (a) adder-chain based [10], (b) adder-tree based [11]
1-D linear interpolator design
FIR
FIR
FIR
FIR
Fig. 3.16 Separate 1-D interpolator design (no parallel)
In this subsection, several different interpolator designs will be reported. Reviewing the fractional interpolation for H.264/AVC in Fig. 2.2, 6-tap FIR with (1, -5, 20, 20, -5, 1) coefficient and bilinear filter are needed for half and quarter pixel interpolation in H.264/AVC
video decoder. For cost and PSNR loss acceptable consideration, Lie’s 4-tap diagonal FIR filter and three-stage recursive algorithm is proposed in [8], and Chen’s HVBi, bilinear filter in both horizontal and vertical direction, and VBi, vertical bilinear horizontal FIR, schemes are also reported in [9]. However, when P frame sequence is very long, such as I + 29 P, the propagation of PSNR loss may cause the heavy degradation of video quality, especially in high definition frame format. Oppositely, considering PSNR losses and standard-compatible design, Chien’s [10] and He’s [11] presented adder-chain and adder-tree based design respectively. These two types depicted in Fig. 3.15 are categorized into 1-D linear filter design.
For cost consideration, multipliers can be simplified to adders and shifters. 1-D linear interpolator is suitable for Q-CIF video sequence in mobile application; however, as for HDTV video sequence, throughput is a very important issue and long execution cycles in 1-D linear design may lead to poor throughput. As for another choice, Chien’s [10] also proposed separate 1-D design that separates horizontal and vertical interpolation and processes in parallel based on 4 x 4 block size. This design induces better throughput, although it may need more storages. Fig. 3.16 shows separate 1-D interpolator design without processing in parallel.
Table 3.3 Comparison of execution cycles for different architectures Architecture Ideal execution cycles
Adder-chain based 1-D 57
Adder-tree based 1-D 52
Separate 1-D (no parallel) 36
Separate 1-D (2 parallel) 18
Separate 1-D (4 parallel) 9
Assuming that all 9 x 9 interpolated data for each 4 x 4 block are ready and they can be accessed randomly, Table 3.3 lists the execution cycles for different architecture. For
adder-networks are used to overlap each row inputs and eliminate the latency overhead except the first one. The total number of cycles required is 57 (5 + 4 x 9 + 4 x 4) which detailed operation is described in Chien’s [10]. For adder-tree based 1-D design, the row data can be loaded in parallel without shift one-by-one, hence the latency overhead does not exist and total number of cycles is 52 (4 x 9 + 4 x 4). As for separate 1-D design, the first data outputs at the 6th clock cycle and the following 3 data generates after 3 clock cycles. Therefore, the separate 1-D design without parallel needs 36 ((6 + 3) x 4) cycles to complete interpolation of one 4 x 4 block. Similarly, separate 1-D design with 2 and 4 parallel requires 18 ((6 + 3) x 2) and 9 (6 + 3) cycles respectively. The required content buffers are 6 x 9 pixels for 4-parallel design shown in Fig. 3.17 and it can be implemented in local registers or SRAM. However, SRAM requires several cycles to accomplish content-swap operation, so we choose local registers in order to execute content-swap in one cycle. In addition, 4-parallel separate 1-D architecture is our selection due to smaller required execution cycles that can be hidden below data-read cycles from frame memory. For another reason, it is easier to combine with interpolation for MPEG-2 video decoding and we will show it in subsection 3.2.1.
FIR
FIR
FIR
FIR
FIR
FIR
FIR
FIR
FIR
FIR
FIR
FIR
FIR
bilinear
bilinear
bilinear
bilinear Content
buffer
Fig. 3.17 4-parallel separate 1-D luma interpolator with content buffer
3.1.5 Chroma Interpolator Design
Fig 3.18 Interpolation window for each 2 x 2 chroma block
8
Fig. 3.19 (a) chroma interpolator, (b) vertical/horizontal filter
Because of 4:2:0 chroma format and quarter precision of luma inter prediction, chroma inter prediction can achieve eighth motion resolution. Chroma inter prediction must process based on 2 x 2 block and chroma interpolation requires 3 x 3 pixels for each 2 x 2 block as shown in Fig. 3.18. For chroma 2 x 2 block including A, B, C and D, the corresponding fractional sample is e, f, g and h whose precision is eighth point. Compared with direct mapping design with 8 multipliers which equation is listed in Fig. 2.2 (d), we rewire the equation listed in equation (3. 2) and the number of multiplier number can reduce to 4.
]
Similar to luma interpolator, chroma interpolator can separate into horizontal and vertical filter. The corresponding separate 1-D design is depicted in Fig. 3.19 (a) and the vertical / horizontal filter is illustrated in Fig. 3.19 (b). Double chroma interpolators are required to generate interpolated value in 2-pixel parallel, and it takes 3 cycles to filter 2 x 2 pixels if all required interpolated pixels are ready. Based on 2-parallel chroma interpolator design painted in Fig. 3.20, only one cycle latency is induced.
Vertical filter
yFrac
Horizontal filter
reg reg
xFrac
round
Vertical filter
yFrac
Horizontal filter
reg reg
xFrac
round
A B
C D
e
h f
g Fig. 3.20 2-parallel chroma interpolator
3.2 Combined Motion Compensation Engine for MPEG-2/H.264 Dual Video Decoder
Our H.264/MPEG-2 dual-standard video decoder is illustrated in Fig. 3.21 and the component of MPEG-2 decoder is depicted in Fig. 3.22. Compared with H.264/AVC standard, MPEG-2 does not provide intra prediction and in-loop de-blocking filter, and only supports half motion precision for both luma and chroma macroblock. Unlike median/directional prediction of MVP algorithm supported in H.264/AVC, motion vectors are only decided by updated PMV and bitstream side information like f_code, motion_residual and motion_code.
The detailed algorithm of motion vector generation can refer to [2]. Besides motion vector generator, a reconfigurable interpolator design for dual-standard is proposed in section 3.2.1 and section 3.2.2 gives the cost analysis.
Motion Vector Predictor for H.264
4 x 4 MV Buffer
Line MV FIFO Address
Generator
Fig. 3.21 Motion compensation engine for H.264/MPEG-2 decoder
Motion Vector Predictor for H.264
4 x 4 MV Buffer
Line MV FIFO Address
Generator
Fig. 3.22 MPEG-2 Motion compensation engine part
3.2.1 Reconfigurable dual-standard interpolator design
The main additional penalty of motion compensation engine is interpolator when combing with MPEG-2 video decoder. In this subsection, we will focus on storage and arithmetic module sharing on dual-standard to minimize area cost overhead. For macroblock based fractional motion compensation in MPEG-2, each 16 x 16 macroblock needs 17 x 17 interpolation windows to interpolate fractional samples. Each macroblock can be partitioned into four 8 x 8 blocks with 9 x 9 interpolation window of which size is identical to that of H.264/AVC luma interpolation window for each 4 x 4 block. In addition, the bilinear filter for H.264/AVC luma quarter interpolation can share with bilinear filter for MPEG-2 half interpolation. Considering 4-parallel luma interpolator as shown in Fig. 3.17, part of registers and bilinear filters, which are shaded in Fig. 3.23, can be shared with MPEG-2 interpolator.
bilinear
bilinear
bilinear
bilinear 0
1
2
3
4
5
6
7
8
Fig. 3.23 Shared local registers and bilinear filters for MPEG-2
bilinear bilinear
bilinear bilinear bilinear
Stage 0 Stage 1 Stage 2
(c)
(a) (b)
Fig. 3.24 Data flow of (a) vertical bilinear filter, (b) horizontal bilinear filter, (c) both vertical and horizontal bilinear filter
Beside the shared modules described above, only extra control circuits for data flow are required for MPEG-2 interpolation. Fig. 3.24 shows the data flow of vertical or horizontal bilinear filter and half sample flag is decided by the LSB of motion vectors. Firstly, we have to concern IDCT/IIT that is the last stage of MPEG-2/H.264 residual decoder. Inverse discrete cosine transform (IDCT) for MPEG-2 is 8 x 8-block based module, whereas inverse integer transform (IIT) for H.264/AVC is 4 x 4-block based decoding process. To achieve module combining and storage sharing, these two modules can merge to single multi-mode IDCT and the output data are 4-pixel in parallel for both standards. Besides, only four bilinear filters are available for MPEG-2/H.264, hence each column 8-pixel filtering has to separate into two
Beside the shared modules described above, only extra control circuits for data flow are required for MPEG-2 interpolation. Fig. 3.24 shows the data flow of vertical or horizontal bilinear filter and half sample flag is decided by the LSB of motion vectors. Firstly, we have to concern IDCT/IIT that is the last stage of MPEG-2/H.264 residual decoder. Inverse discrete cosine transform (IDCT) for MPEG-2 is 8 x 8-block based module, whereas inverse integer transform (IIT) for H.264/AVC is 4 x 4-block based decoding process. To achieve module combining and storage sharing, these two modules can merge to single multi-mode IDCT and the output data are 4-pixel in parallel for both standards. Besides, only four bilinear filters are available for MPEG-2/H.264, hence each column 8-pixel filtering has to separate into two