CHAPTER 1 INTRODUCTION
1.2 T HESIS O RGANIZATION
The thesis is organized as follows. The algorithm description and analysis is discussed in Chapter 2. In Chapter 3, the motion compensation engine for H.264/AVC video decoder is described firstly. Then, the motion compensation engine for MPEG-2/H.264 dual-video decoder is illustrated. We also propose the data reuse technique to reduce the required bandwidth particularly in H.264/AVC fractional motion compensation. Chapter 4 presents frame memory organization including frame memory access controller for external SDRAM and merging structured frame organization that is one of the frame compression method. Chip implementation is given in Chapter 5. Finally, conclusion and future work is shown in Chapter 6.
Chapter 2
Algorithm Description and Analysis
Current
Fig. 2.1 General structure of H.264 encoder
Reference MC
Fig. 2.2 General structure of H.264 decoder
Fig. 2.1 and Fig. 2.2 show the general structure of H.264/AVC video encoder and decoder respectively. The H.264/AVC design covers a Video Coding Layer (VCL) and
Network Abstraction Layer (NAL). We only concentrate on VCL that efficient represents the video content. The spirit of H.264/AVC follows the so-called block-based hybrid video coding.
It consists of hybrid of temporal and spatial prediction, in conjunction with transform coding.
The main additional blocks compared with prior standards are intra prediction and in-loop de-blocking filter. Fig. 2.3 and Fig. 2.4 illustrate general structure of MPEG-2 encoder and decoder respectively. Compared to H.264/AVC, the decoding flow becomes simplified without intra prediction and in-loop de-blocking filter except that only DCT/IDCT is more complicated than integer transform for H.264/AVC codec.
VLD
Reference MC frame
bitstream
+ +
Current ME frame
_
+ DCT Q
dQ
IDCT
Reconstructed frame
Fig. 2.3 General structure of MPEG-2 encoder
dQ VLD IDCT
MC Reference
frame
Reconstructed frame
bitstream
+ +
Fig. 2.4 General structure of MPEG-2 decoder
This chapter is structured as follows. The software profiling is illustrated in section 2.1.
Then, the algorithm of H.264/AVC motion compensation would be described in section 2.2.
Finally, the comparison with those of previous video standards would be discussed in section 2.3.
2.1 Profiling
7%
8%
9%
7%
9%
9%
8%
11%
32%
Others (Intra Prediction, etc.) Write File
PSNR Computation De-blocking Filter CAVLC IQ/IDCT Ref. Frame Copy Reconstruction Motion Compensation
Fig. 2.1 H.264/AVC video decoder software profile on ARM processor (JM 8.2)
Fig. 2.1 shows the H.264/AVC profile on ARM processor. The reference software is JM 8.2. We can find inter prediction related modules, including motion compensation, reconstruction, and reference frame copy, occupy 50 % proportion of the entire video decoder.
This dominated part can be greatly reduced by parallel processing, data reuse, or pipeline processing on ASIC design.
2.2 Inter Prediction Algorithm for H.264/AVC Standard
H.264/AVC standard supports more flexible block size selection in inter prediction compared with any previous standard [1][2]. The smallest block size selection could reach as small as 4x4 for luma and 2x2 for chroma. Fig. 2.2 illustrates all types of partitions.
0 0 0 1
1
0 2
1 3
0 0 0 1
1
0 2
1 3
16x16 16x8 8x16 8x8
8x8 8x4 4x8 4x4
Macroblock partitions
Sub-macroblock partitions
Fig. 2.2 Macroblock partitions and sub-macroblock partitions
H.264/AVC standard also supports high motion resolution that reaches quarter motion accuracy for luma sample and eighth one for chroma sample. This can be found firstly in advances profile of MPEG-4 Visual standard; however, H.264/AVC reduces the complexity of interpolation processing. Luma half sample interpolation with a 6-tap (1, -5, 20, 20, -5, 1) symmetrical FIR filter and quarter sample interpolation with bilinear filter are drawn in Fig 2.3 (a)-(c). The prediction value of chroma component is generated using bilinear interpolator illustrated in Fig. 2.3 (d), and the displacement can achieve one-eighth accuracy. From mathematical equations, they are both 2-D interpolation. However, based on hardware implementation, these equations can be separated into two 1-D to reduce hardware cost, namely, horizontal filter first and than vertical one, or vice verse.
G a c H
Fig. 2.3 (a) luma half sample with 6-tap FIR, (b) luma horizontal/vertical quarter sample with bilinear filter, (c) luma diagonal quarter sample with bilinear filter, (d) chroma sample with bilinear filter. Upper-case letters indicate the full samples and lower-case
letter indicates the interpolated fractional samples
Motion vector is generated from motion vector difference (MVD) and motion vector prediction (MVP) which equation is expressed by (2. 1).
MVPy
MVD is decoded from universal variable length decoder (UVLD) and MVP is predicted according to neighboring motion vectors. MVP algorithm, of which concept is similar to that for MPEG-4, contains directional prediction for 16 x 8 or 8 x 16 block size and median
prediction for other block sizes. The detail of MVP decision is shown in Fig. 2.4. Equation of median prediction is expressed by (2. 2). In addition, some boundary conditions or exceptions have to be handled accurately. For instance, when MVC is not available, its value is replaced by MVD. We do not go into detail of those trivial boundary conditions over here.
)
Fig. 2.4 (a) directional prediction for 8 x 16 block size, (b) directional prediction for 16 x 8 block size, (c) median prediction
In addition to the motion-compensated block size described in Fig. 2.2, a P macroblock can also be coded to P_SKIP mode. For this coding mode, neither residual signal nor motion information is transmitted. That is, motion vectors are only decided according to MVP. The reconstructed data is obtained similar to that of macroblock type P_16x16. Macroblock coded in P_SKIP are often located in large area with no change or low motion. Besides the above techniques, H.264/AVC also supports multiple reference frame, weighted prediction and direct mode for B slice. These tools can also improve coding efficiency efficiently.
Application of de-blocking filter is a well-known method to improve image quality by alleviating blocking artifacts. The de-blocking design in H.264/AVC is brought within motion-compensated prediction loop and the improvement in quality becomes more conspicuous.
2.3 Comparison among Different Video Standards
Considering frame coding, Table 2.1 lists all fractional motion compensation features for different standards. Up to now, we can find fractional interpolation issue becomes more and more important in state-of-the-art video coding. The interpolation window becomes larger for the same block size; namely, it requires much more cycles to interpolate each macroblock. For example, it requires 9 x 9 pixels window to interpolate luma 4 x 4 block for H.264/AVC;
however, the identical size of interpolation window can be used to filter 8 x 8 block for MPEG-2 video decoder. Fig. 2.5 and Fig. 2.6 show the luma and chroma integer/fractional motion vector proportion for H.264/AVC. Especially note that luma and chroma interpolation for H.264/AVC are different compared with previous standards. That is, no matter what on algorithm level or hardware level, the computation sources cannot be shared. Therefore, the combination of luma and chroma parts is the space of improvement and we will give the discussion and implantation in Chapter 3. In high bit rate application (128 kbps), fractional motion vector occupies about 80 % and even in low bit rate (32 kbps) fractional part has a certain proportion (40 %). Higher fractional MV proportion, more execution time is needed to read pixels data from frame memory. This gap may become more obvious especially when SDRAM is used as frame memory. To reduce requisite fetching pixels from frame memory, a data reuse technique for fractional motion compensation will be proposed in Chapter 3.
Table 2.1 Comparison of fractional motion compensation among different standards
Standard MPEG-1/2 MPEG-4 H.264/AVC MVP
Update from previous PMV
value
Median prediction Median prediction Directional prediction
Luma block unit 16 x 16 8 x 8 4 x 4
Luma motion accuracy Half Half, quarter Half, quarter Half sample mode
Half: bilinear Quarter sample mode
Luma filter Bilinear
Half: 8-tap FIR Quarter: 8-tap FIR
and bilinear
Half: 6-tap FIR Quarter: 6-tap FIR and
bilinear
Luma Interpolation window 17 x 17 15 x 15 9 x 9
Chroma block unit 8 x 8 4 x 4 2 x 2
Chroma motion accuracy Half Half, quarter Eighth Half sample mode
Half: bilinear Quarter sample mode
Chroma filter Bilinear
Half: 8-tap FIR Quarter: 8-tap FIR
and bilinear
Bilinear
Chroma interpolation window 9 x 9 5 x 5 3 x 3
32 48 64 80 96 112 128
Luma integer/fractional mo tio n vecto r proportion (foreman-QCIF)
bit rate(kbps)
proportion
integer fractio n
Fig. 2.5 Luma integer/fractional motion vector proportion for H.264/AVC
32 48 64 80 96 112 128
Chro ma integer/fractio nal motion vector pro portio n (foreman-QCIF)
bit rate(kbps)
proportion
integer fraction
2.4 Summary
From the H.264 profiling on ARM processor, an efficient hardware accelerator or ASIC design for motion compensation is crucial. The inter prediction for H.264/AVC and the comparison among different standards are also illustrated in this Chapter.
Chapter 3
Motion Compensation Design for MPEG-2/H.264 video decoder
The state-of-the-art video coding standard H.264/AVC provides amazing compression ratio that significantly outperforms all previous video compression standards. However, unlike traditional MPEG-x standards, H.264/AVC lacks backward compatibility to the former MPEG-x and H.26x video coding standards. Therefore, a development of combining multi-video coding standards is essential to support modern multimedia systems. For example, DVD forum adopted MPEG-2, H.264/AVC, and VC-1 (also named well-known WMV-9) as mandatory for the next generation HD-DVD and Blu-ray format. As for digital video broadcasting (DVB) application, DVB-T system, which is designed for digital terrestrial television services, is directly compatible with MPEG-2 coded TV signal.
Furthermore, mobile DVB, presently called DVB-H, allows the transmission with video content of H.264/AVC due to high coding efficiency. Especially, DVB-H features backward compatibility with DVB-T but transmit different video format. Therefore, it is the demand and challenge of designing efficient video decoder for multi-standard video application.
This chapter will discuss that designing motion compensation, which dominates the amount of data transfer on the video decoder, for MPEG-2/H.264 dual video decoder. The rest part is structured as follows. Section 3.1 illustrates motion compensation engine for H.264/AVC decoder. The combined motion compensation engine for MPEG-2/H.264 and analysis is discussed in section 3.2. Finally, summary is given in section 3.3.
3.1 Motion Compensation Engine for H.264/AVC decoder
Motion Vector Predictor
4 x 4 MV Buffer
Line MV FIFO Address
Generator
Fig 3.1 Motion compensation engine for H.264 video decoder
Fig. 3.1 illustrates the whole motion compensation engine for H.264/AVC video decoder.
Firstly, line MV FIFO stores decoded motion vectors for motion vector prediction and 4 x 4 MV buffer stores the decoded motion vector for current MB decoding. Then, the address generator sends reference address to memory access controller. The tasking of motion controller is scheduling consecutive access command and sending to frame memories. The burst read data is kept in read data buffer and then filtered through interpolator. Finally, the interpolated reference data add up to the residual data and then pass through de-blocking filter.
In our proposed decoder, ping-pong structured external frame memory [28], double memories stored reference and current frame reciprocally, is adopted.
The following subsection will discuss the detail of other modules except memory access controller. The detailed discussion of frame memory access controller is shown in Chapter 4.
Subsection 3.1.1 illustrates motion vector generator including motion vector predictor (MVP) and the related storages. Subsection 3.1.2 gives data reuse technique for interpolator.
Subsection 3.1.3 analyzes the proposed data reuse technique. Finally, luma and chroma interpolator designs are reported in subsection 3.1.4 and 3.1.5 respectively.
3.1.1 Motion Vector Generator
Current MB
Frame boundary
Next MB 0 Frame
boundary
Next
MB 1 Next MB 2
Next
MB 3 Next
MB 4 ……
0 1 2 3 4
5 6 7 8 9 10 11
Fig 3.2 Motion vectors information storage or motion vector predictor for QCIF frame format.
Motion vector generator mainly contains motion vector predictor, line MV FIFO and 4 x 4 MV buffers. Motion vector is generated by the summation of motion vector prediction (MVP) and motion vector difference (MVD). The MVP value is calculated according to the neighboring MVs, thus the decoded motion vectors are required to be stored for the following decoding. Line MV FIFO stores the decoded motion vector pair (MVX, MVY). The depth and width of MV FIFO are dependent on the frame width and search range respectively. Once the content of MV FIFO will not be used in the future, the motion vector pair can be discarded.
The 4 x 4 size of MV buffers is required since the maximum number of motion vectors per
MB is sixteen. The motion vectors for current MB decoding stores in this 4 x 4 MV buffers.
As for the requisite total storage for motion vector generator, Fig. 3.2 shows an example.
Total amount of 4 x 11 motion vector pairs have to be stored for QCIF frame format. The detail of required neighboring motion vectors is shown in Fig. 3.3. To cover all kinds of conditions, storages element is based on 4 x 4-block size that is the smallest element for H.264/AVC video decoder. Each square indicates one motion vector pair. To decode MV0-MV15 in current MB, it needs neighboring motion vectors in left-upper corner (MVLU), right-upper corner (MVRU), upper (MVU0-3) and left (MVL0-MVL3) positions.
MV7
MVLU MVU0 MVU1 MVU2 MVU3 MVRU
Fig 3.3 Neighboring motion vectors needed when decoding all motion vectors in current macroblock
The detailed architecture of motion vector generator is shown in Fig 3.4. Motion vector generation involves two-phase operations. The first one is loading MVD into 4 x 4 MV buffers and another is calculating MV = MVP + MVD then restoring into 4 x 4 MV buffers.
The proposed memory storage can be treated as two-level memory hierarchy painted in Fig 3.5. Four line MV FIFOs are implemented using SRAM and local registers store the neighboring motion vectors for current MB. Local register that stores neighboring motion vectors includes left MV line buffer, upper-left, upper, upper-right and left MV registers. The
vectors required in current MB decoding. After accomplishing current MB decoding, FIFOs need one push and one pop operation, which occupies two cycles, to update all contents of local registers for the next MB decoding.
4x4 MV buffers Left MV line buffer MVP
MVD (load from MV buffer) MV (write back to MV buffer)
MVD (load from
MVA, MVB, MVC, MVD
MV from Upper MB
MV from Left MB MV from Current MB MV from Upper-right MB
MV from Upper-left MB
Neighboring MVs
Fig 3.4 (a) motion vector generator architecture for QCIF-format, (b) mv buffer unit
Line MV FIFO 4x4 MV
buffer
Fig. 3.5 Two-level memory hierarchical structure for MVP
16x16
8x8_0 8x8_1
8x8_2 8x8_3
4x4_10 4x4_11 4x4_14 4x4_15 4x4_12 4x4_13 MVL0 MVU0 MVU2 MVLU
MVRU
MV1 MVU2 MVU1
MVL2 MV2 MV6 MVL1
MV9 MV6 MV3 MV3
4x4_0
MV0 MVU1 MVU2 MVU0
MV1
MVL1 MV0 MVL0
MV2 MV1 MV0 MV0
MV1 MVU2 MVU3 MVU1
4x4_5
MV4 MVU3 MVRU MVU2
MV3 MV4 MV5 MV1
MV5 MV4 MV4
MV9 MV6 MV7 MV3
MV7
MV12 MV6 MV6
MV11 MV12 MV13 MV9
MV13 MV12 MV12
(a) MV14
(b) (c)
(d)
Fig 3.6 (a) block size_position index, (b) directional prediction table (16x8, 8x16), (c) median prediction table (16x16, 8x8), (d) median prediction table (4x4)
MVP is calculated according to MVA, MVB, MVC and MVD whose values are derived from neighboring motion vectors according to block size_position index illustrated in Fig. 3.6 (a). MVA, MVB, MVC and MVD indicate the motion vectors located at left, upper, right-upper, left-upper neighboring macroblock/partition/block respectively as shown in Fig.
2.3 (c). Fig. 3.6 (b)-(d) lists all MVA, MVB, MVC and MVD for different block size_position index. Besides the above loop-up table (LUT) is required for motion vector prediction, many trivial boundary conditions and exceptions have to be handled. Here, we do not describe them for clarity.
3.1.2 Data Reuse Technique for Interpolator
4 9
4 9
(a) (b)
5
Fig 3.7 (a) 4x4 block window and the corresponding 9x9 interpolation window, (b) overlapped region for neighboring interpolation window
(a) (b) (c)
0 1
2 3
4 5
6 7
8 9
10 11
12 13 14 15
Fig 3.8 (a) 2x2 raster scanning order, (b) row-major 2x2 raster scanning order, (c) column-major 2x2 raster scanning order
From Fig 3.7 (a), to interpolate each fractional sample value for each 4x4 block, it needs 9 x 9 interpolation window. If two motion vectors of neighboring 4 x 4 blocks are the same, 5 x 9 overlap region between two interpolation windows can be data reused. The scanning order of residual decoding for each macroblock is 2x2 raster scanning order as shown in Fig 3.8 (a).
Then, considering two different scanning orders illustrated in Fig 3.8 (b) and (c), row-major one needs 13 times of transitions but column-major one only needs 5 times of transitions.
Each transition causes that the overlap region could not be data reused. Therefore, column-major one is the better selection because of less number of transitions.
0 1
Fig 3.9 (a) 2x2 raster scanning order, (b) 4x4 raster scanning order, (c) extended 2x2 raster scanning order
0 1
Fig 3.10 Synchronization buffer scheme for two different scanning order in inter prediction (a) 2x2 raster scanning order, (b) 4x4 raster scanning order
For video decoding system, inter prediction often processes based on macroblock level.
Thus, the decoding order based on 4 x 4-block size, which is the smallest block element in H.264/AVC video decoder, is freedom for each macroblock. In view of this concept, 2 x 2 and 4 x 4 raster scanning orders are depicted in Fig 3.9 (a) and (b), and we can find column-major 4 x 4 raster scanning order only needs four transitions less than that of 2 x 2 raster scan.
pixels in residual adder because of different scanning order with residual decoder which must follow 2x2 raster scanning order defined in standard [1].
Fig. 3.11 Content-swap operation (interpolator with attached content buffer)
(1, 3)
(1, 3)
(2, 0)
(2, 0)
(-2, 1) (2, 0)
Fig. 3.12 An example of macroblock partition (1, 3) indicates (mv_x, mv_y).
In order to resolve this problem, we can attach content register to interpolator which concept is illustrated in Fig 3.11, and adopt extended 2x2 raster scanning order as shown in Fig 3.9 (c). The size of content register depends on the local register in interpolators. Each gray block in Fig. 3.9 (c) indicates content-swap operation that swaps all content in local register in interpolation and that in content buffer. By doing that, we can find that if motion vectors of block 1 and block 4 are the same, the overlapped region in Fig. 3.7(b) need not to be re-fetched when decoding block 4. Therefore, extended 2x2 raster scanning order follows 2 x 2 raster scanning which is the same as that of residual decoder, and achieves data reuse
Local register for interpolator
Content buffer
status of 4 x 4 raster scanning order. The content-swap operation brings effect only when larger block size partition or motion vectors of the neighboring blocks are the same. The condition that executes this operation follows the expression (3. 1)
)
_swap condition mb type x mb type x
content = == == (3. 1)
However, considering an example shown in Fig. 3.12, the condition (3.1) checking is false.
Furthermore, if checking equality of neighboring motion vectors instead of block size, the example in Fig. 3.11 can be data reused. The checking table of motion vectors between neighboring blocks is listed in Table 3.1.
Table 3.1 Neighboring MV checking table for content-swap operation Block number Checking condition
1 MV1 = = MV4
3.1.3 Analysis for Data Reuse Technique
To give more generic and platform independent analysis, we analyze requisite pixels per MB and cost overhead for each method. Taking account of fractional motion compensation for each macroblock, the required pixels for each MB and cost overhead for different methods are summarized in Table 3.2. Assuming that each motion vector contains fractional part, the best case has one motion vector and worst case has 16 motion vectors for one luma macroblock. Although requisite pixels per method are the same in worst case, requisite pixels
column-major methods, 4 x 4 raster scanning order (RSO) takes the best effect; however, it requires additional synchronization buffer and degrades throughput due to different RSO with that of residual decoder. As for extended methods, condition (3. 1) only takes effect in larger block partition (SKIP, 16x16, 16x8). That is, it cannot data-reuse in some case such as Fig.
3.11 even if the neighbor motion vectors are the same. To erase this disadvantage, method 5 checks the neighboring motion vectors rather than block size, and then the required bandwidth can reduce to be the same as that of 4 x 4 RSO in Fig. 3.12 case. The advantage of extended method is that it only requires content buffer which size is smaller than that of method 3 and takes a little extra cycle for content-swap operation. Although method 4 behaves better for larger block size (SKIP, 16x16, 16x8) than method 1/2/3, larger block size still occupies up to 50 ~90 % proportion from the Fig. 3.13. Furthermore, method 5 not only involves all case in method 4 but also takes effect in smaller block size such as Fig. 3.1. As shown in Fig. 3.14, after applying extended method in our design, the required memory bandwidth can be reduced about 30 % compared to column-major 2x2 RSO method.
Table 3.2 Static analyses for different method in H.264/AVC.
Assumption: each motion vectors contains fractional part.
Required pixels per luma MB Method
Worst case Best case Fig 3.11 Cost overhead
Worst case Best case Fig 3.11 Cost overhead