T HESIS O RGANIZATION - 適用於雙重視訊標準的可調式動作補償記憶體架構

CHAPTER 1 INTRODUCTION

1.2 T HESIS O RGANIZATION

The thesis is organized as follows. The algorithm description and analysis is discussed in Chapter 2. In Chapter 3, the motion compensation engine for H.264/AVC video decoder is described firstly. Then, the motion compensation engine for MPEG-2/H.264 dual-video decoder is illustrated. We also propose the data reuse technique to reduce the required bandwidth particularly in H.264/AVC fractional motion compensation. Chapter 4 presents frame memory organization including frame memory access controller for external SDRAM and merging structured frame organization that is one of the frame compression method. Chip implementation is given in Chapter 5. Finally, conclusion and future work is shown in Chapter 6.

Chapter 2 Algorithm Description and Analysis

Current

Fig. 2.1 General structure of H.264 encoder

Reference MC

Fig. 2.2 General structure of H.264 decoder

Fig. 2.1 and Fig. 2.2 show the general structure of H.264/AVC video encoder and decoder respectively. The H.264/AVC design covers a Video Coding Layer (VCL) and

Network Abstraction Layer (NAL). We only concentrate on VCL that efficient represents the video content. The spirit of H.264/AVC follows the so-called block-based hybrid video coding.

It consists of hybrid of temporal and spatial prediction, in conjunction with transform coding.

The main additional blocks compared with prior standards are intra prediction and in-loop de-blocking filter. Fig. 2.3 and Fig. 2.4 illustrate general structure of MPEG-2 encoder and decoder respectively. Compared to H.264/AVC, the decoding flow becomes simplified without intra prediction and in-loop de-blocking filter except that only DCT/IDCT is more complicated than integer transform for H.264/AVC codec.

VLD

Reference MC frame

bitstream

+ +

Current ME frame

+ DCT Q

IDCT

Reconstructed frame

Fig. 2.3 General structure of MPEG-2 encoder

dQ VLD IDCT

MC Reference

frame

Reconstructed frame

bitstream

+ +

Fig. 2.4 General structure of MPEG-2 decoder

This chapter is structured as follows. The software profiling is illustrated in section 2.1.

Then, the algorithm of H.264/AVC motion compensation would be described in section 2.2.

Finally, the comparison with those of previous video standards would be discussed in section 2.3.

2.1 Profiling

11%

32%

Others (Intra Prediction, etc.) Write File

PSNR Computation De-blocking Filter CAVLC IQ/IDCT Ref. Frame Copy Reconstruction Motion Compensation

Fig. 2.1 H.264/AVC video decoder software profile on ARM processor (JM 8.2)

Fig. 2.1 shows the H.264/AVC profile on ARM processor. The reference software is JM 8.2. We can find inter prediction related modules, including motion compensation, reconstruction, and reference frame copy, occupy 50 % proportion of the entire video decoder.

This dominated part can be greatly reduced by parallel processing, data reuse, or pipeline processing on ASIC design.

2.2 Inter Prediction Algorithm for H.264/AVC Standard

H.264/AVC standard supports more flexible block size selection in inter prediction compared with any previous standard [1][2]. The smallest block size selection could reach as small as 4x4 for luma and 2x2 for chroma. Fig. 2.2 illustrates all types of partitions.

0 0 0 1

0 2

1 3

0 0 0 1

0 2

1 3

16x16 16x8 8x16 8x8

8x8 8x4 4x8 4x4

Macroblock partitions

Sub-macroblock partitions

Fig. 2.2 Macroblock partitions and sub-macroblock partitions

H.264/AVC standard also supports high motion resolution that reaches quarter motion accuracy for luma sample and eighth one for chroma sample. This can be found firstly in advances profile of MPEG-4 Visual standard; however, H.264/AVC reduces the complexity of interpolation processing. Luma half sample interpolation with a 6-tap (1, -5, 20, 20, -5, 1) symmetrical FIR filter and quarter sample interpolation with bilinear filter are drawn in Fig 2.3 (a)-(c). The prediction value of chroma component is generated using bilinear interpolator illustrated in Fig. 2.3 (d), and the displacement can achieve one-eighth accuracy. From mathematical equations, they are both 2-D interpolation. However, based on hardware implementation, these equations can be separated into two 1-D to reduce hardware cost, namely, horizontal filter first and than vertical one, or vice verse.

G a c H

Fig. 2.3 (a) luma half sample with 6-tap FIR, (b) luma horizontal/vertical quarter sample with bilinear filter, (c) luma diagonal quarter sample with bilinear filter, (d) chroma sample with bilinear filter. Upper-case letters indicate the full samples and lower-case

letter indicates the interpolated fractional samples

Motion vector is generated from motion vector difference (MVD) and motion vector prediction (MVP) which equation is expressed by (2. 1).

MVPy

MVD is decoded from universal variable length decoder (UVLD) and MVP is predicted according to neighboring motion vectors. MVP algorithm, of which concept is similar to that for MPEG-4, contains directional prediction for 16 x 8 or 8 x 16 block size and median

prediction for other block sizes. The detail of MVP decision is shown in Fig. 2.4. Equation of median prediction is expressed by (2. 2). In addition, some boundary conditions or exceptions have to be handled accurately. For instance, when MVC is not available, its value is replaced by MVD. We do not go into detail of those trivial boundary conditions over here.

)

Fig. 2.4 (a) directional prediction for 8 x 16 block size, (b) directional prediction for 16 x 8 block size, (c) median prediction

In addition to the motion-compensated block size described in Fig. 2.2, a P macroblock can also be coded to P_SKIP mode. For this coding mode, neither residual signal nor motion information is transmitted. That is, motion vectors are only decided according to MVP. The reconstructed data is obtained similar to that of macroblock type P_16x16. Macroblock coded in P_SKIP are often located in large area with no change or low motion. Besides the above techniques, H.264/AVC also supports multiple reference frame, weighted prediction and direct mode for B slice. These tools can also improve coding efficiency efficiently.

Application of de-blocking filter is a well-known method to improve image quality by alleviating blocking artifacts. The de-blocking design in H.264/AVC is brought within motion-compensated prediction loop and the improvement in quality becomes more conspicuous.

2.3 Comparison among Different Video Standards

Considering frame coding, Table 2.1 lists all fractional motion compensation features for different standards. Up to now, we can find fractional interpolation issue becomes more and more important in state-of-the-art video coding. The interpolation window becomes larger for the same block size; namely, it requires much more cycles to interpolate each macroblock. For example, it requires 9 x 9 pixels window to interpolate luma 4 x 4 block for H.264/AVC;

however, the identical size of interpolation window can be used to filter 8 x 8 block for MPEG-2 video decoder. Fig. 2.5 and Fig. 2.6 show the luma and chroma integer/fractional motion vector proportion for H.264/AVC. Especially note that luma and chroma interpolation for H.264/AVC are different compared with previous standards. That is, no matter what on algorithm level or hardware level, the computation sources cannot be shared. Therefore, the combination of luma and chroma parts is the space of improvement and we will give the discussion and implantation in Chapter 3. In high bit rate application (128 kbps), fractional motion vector occupies about 80 % and even in low bit rate (32 kbps) fractional part has a certain proportion (40 %). Higher fractional MV proportion, more execution time is needed to read pixels data from frame memory. This gap may become more obvious especially when SDRAM is used as frame memory. To reduce requisite fetching pixels from frame memory, a data reuse technique for fractional motion compensation will be proposed in Chapter 3.

Table 2.1 Comparison of fractional motion compensation among different standards

Standard MPEG-1/2 MPEG-4 H.264/AVC MVP

Update from previous PMV

value

Median prediction Median prediction Directional prediction

Luma block unit 16 x 16 8 x 8 4 x 4

Luma motion accuracy Half Half, quarter Half, quarter Half sample mode

Half: bilinear Quarter sample mode

Luma filter Bilinear

Half: 8-tap FIR Quarter: 8-tap FIR

and bilinear

Half: 6-tap FIR Quarter: 6-tap FIR and

bilinear

Luma Interpolation window 17 x 17 15 x 15 9 x 9

Chroma block unit 8 x 8 4 x 4 2 x 2

Chroma motion accuracy Half Half, quarter Eighth Half sample mode

Half: bilinear Quarter sample mode

Chroma filter Bilinear

Half: 8-tap FIR Quarter: 8-tap FIR

and bilinear

Bilinear

Chroma interpolation window 9 x 9 5 x 5 3 x 3

32 48 64 80 96 112 128

Luma integer/fractional mo tio n vecto r proportion (foreman-QCIF)

bit rate(kbps)

proportion

integer fractio n

Fig. 2.5 Luma integer/fractional motion vector proportion for H.264/AVC

32 48 64 80 96 112 128

Chro ma integer/fractio nal motion vector pro portio n (foreman-QCIF)

bit rate(kbps)

proportion

integer fraction

2.4 Summary

From the H.264 profiling on ARM processor, an efficient hardware accelerator or ASIC design for motion compensation is crucial. The inter prediction for H.264/AVC and the comparison among different standards are also illustrated in this Chapter.

Chapter 3 Motion Compensation Design for MPEG-2/H.264 video decoder

The state-of-the-art video coding standard H.264/AVC provides amazing compression ratio that significantly outperforms all previous video compression standards. However, unlike traditional MPEG-x standards, H.264/AVC lacks backward compatibility to the former MPEG-x and H.26x video coding standards. Therefore, a development of combining multi-video coding standards is essential to support modern multimedia systems. For example, DVD forum adopted MPEG-2, H.264/AVC, and VC-1 (also named well-known WMV-9) as mandatory for the next generation HD-DVD and Blu-ray format. As for digital video broadcasting (DVB) application, DVB-T system, which is designed for digital terrestrial television services, is directly compatible with MPEG-2 coded TV signal.

Furthermore, mobile DVB, presently called DVB-H, allows the transmission with video content of H.264/AVC due to high coding efficiency. Especially, DVB-H features backward compatibility with DVB-T but transmit different video format. Therefore, it is the demand and challenge of designing efficient video decoder for multi-standard video application.

This chapter will discuss that designing motion compensation, which dominates the amount of data transfer on the video decoder, for MPEG-2/H.264 dual video decoder. The rest part is structured as follows. Section 3.1 illustrates motion compensation engine for H.264/AVC decoder. The combined motion compensation engine for MPEG-2/H.264 and analysis is discussed in section 3.2. Finally, summary is given in section 3.3.

3.1 Motion Compensation Engine for H.264/AVC decoder

Motion Vector Predictor

4 x 4 MV Buffer

Line MV FIFO Address

Generator

Fig 3.1 Motion compensation engine for H.264 video decoder

Fig. 3.1 illustrates the whole motion compensation engine for H.264/AVC video decoder.

Firstly, line MV FIFO stores decoded motion vectors for motion vector prediction and 4 x 4 MV buffer stores the decoded motion vector for current MB decoding. Then, the address generator sends reference address to memory access controller. The tasking of motion controller is scheduling consecutive access command and sending to frame memories. The burst read data is kept in read data buffer and then filtered through interpolator. Finally, the interpolated reference data add up to the residual data and then pass through de-blocking filter.

In our proposed decoder, ping-pong structured external frame memory [28], double memories stored reference and current frame reciprocally, is adopted.

The following subsection will discuss the detail of other modules except memory access controller. The detailed discussion of frame memory access controller is shown in Chapter 4.

Subsection 3.1.1 illustrates motion vector generator including motion vector predictor (MVP) and the related storages. Subsection 3.1.2 gives data reuse technique for interpolator.

Subsection 3.1.3 analyzes the proposed data reuse technique. Finally, luma and chroma interpolator designs are reported in subsection 3.1.4 and 3.1.5 respectively.

3.1.1 Motion Vector Generator

Current MB

Frame boundary

Next MB 0 Frame

boundary

Next

MB 1 Next MB 2

Next

MB 3 Next

MB 4 ……

0 1 2 3 4

5 6 7 8 9 10 11

Fig 3.2 Motion vectors information storage or motion vector predictor for QCIF frame format.

Motion vector generator mainly contains motion vector predictor, line MV FIFO and 4 x 4 MV buffers. Motion vector is generated by the summation of motion vector prediction (MVP) and motion vector difference (MVD). The MVP value is calculated according to the neighboring MVs, thus the decoded motion vectors are required to be stored for the following decoding. Line MV FIFO stores the decoded motion vector pair (MVX, MVY). The depth and width of MV FIFO are dependent on the frame width and search range respectively. Once the content of MV FIFO will not be used in the future, the motion vector pair can be discarded.

The 4 x 4 size of MV buffers is required since the maximum number of motion vectors per

MB is sixteen. The motion vectors for current MB decoding stores in this 4 x 4 MV buffers.

As for the requisite total storage for motion vector generator, Fig. 3.2 shows an example.

Total amount of 4 x 11 motion vector pairs have to be stored for QCIF frame format. The detail of required neighboring motion vectors is shown in Fig. 3.3. To cover all kinds of conditions, storages element is based on 4 x 4-block size that is the smallest element for H.264/AVC video decoder. Each square indicates one motion vector pair. To decode MV0-MV15 in current MB, it needs neighboring motion vectors in left-upper corner (MVLU), right-upper corner (MVRU), upper (MVU0-3) and left (MVL0-MVL3) positions.

MV7

MVLU MVU0 MVU1 MVU2 MVU3 MVRU

Fig 3.3 Neighboring motion vectors needed when decoding all motion vectors in current macroblock

The detailed architecture of motion vector generator is shown in Fig 3.4. Motion vector generation involves two-phase operations. The first one is loading MVD into 4 x 4 MV buffers and another is calculating MV = MVP + MVD then restoring into 4 x 4 MV buffers.

The proposed memory storage can be treated as two-level memory hierarchy painted in Fig 3.5. Four line MV FIFOs are implemented using SRAM and local registers store the neighboring motion vectors for current MB. Local register that stores neighboring motion vectors includes left MV line buffer, upper-left, upper, upper-right and left MV registers. The

vectors required in current MB decoding. After accomplishing current MB decoding, FIFOs need one push and one pop operation, which occupies two cycles, to update all contents of local registers for the next MB decoding.

4x4 MV buffers Left MV line buffer MVP

MVD (load from MV buffer) MV (write back to MV buffer)

MVD (load from

MVA, MVB, MVC, MVD

MV from Upper MB

MV from Left MB MV from Current MB MV from Upper-right MB

MV from Upper-left MB

Neighboring MVs

Fig 3.4 (a) motion vector generator architecture for QCIF-format, (b) mv buffer unit

Line MV FIFO 4x4 MV

buffer

Fig. 3.5 Two-level memory hierarchical structure for MVP

16x16

8x8_0 8x8_1

8x8_2 8x8_3

4x4_10 4x4_11 4x4_14 4x4_15 4x4_12 4x4_13 MVL0 MVU0 MVU2 MVLU

MVRU

MV1 MVU2 MVU1

MVL2 MV2 MV6 MVL1

MV9 MV6 MV3 MV3

4x4_0

MV0 MVU1 MVU2 MVU0

MV1

MVL1 MV0 MVL0

MV2 MV1 MV0 MV0

MV1 MVU2 MVU3 MVU1

4x4_5

MV4 MVU3 MVRU MVU2

MV3 MV4 MV5 MV1

MV5 MV4 MV4

MV9 MV6 MV7 MV3

MV7

MV12 MV6 MV6

MV11 MV12 MV13 MV9

MV13 MV12 MV12

(a) MV14

(b) (c)

(d)

Fig 3.6 (a) block size_position index, (b) directional prediction table (16x8, 8x16), (c) median prediction table (16x16, 8x8), (d) median prediction table (4x4)

MVP is calculated according to MVA, MVB, MVC and MVD whose values are derived from neighboring motion vectors according to block size_position index illustrated in Fig. 3.6 (a). MVA, MVB, MVC and MVD indicate the motion vectors located at left, upper, right-upper, left-upper neighboring macroblock/partition/block respectively as shown in Fig.

2.3 (c). Fig. 3.6 (b)-(d) lists all MVA, MVB, MVC and MVD for different block size_position index. Besides the above loop-up table (LUT) is required for motion vector prediction, many trivial boundary conditions and exceptions have to be handled. Here, we do not describe them for clarity.

3.1.2 Data Reuse Technique for Interpolator

4 9

(a) (b)

Fig 3.7 (a) 4x4 block window and the corresponding 9x9 interpolation window, (b) overlapped region for neighboring interpolation window

(a) (b) (c)

0 1

2 3

4 5

6 7

8 9

10 11

12 13 14 15

Fig 3.8 (a) 2x2 raster scanning order, (b) row-major 2x2 raster scanning order, (c) column-major 2x2 raster scanning order

From Fig 3.7 (a), to interpolate each fractional sample value for each 4x4 block, it needs 9 x 9 interpolation window. If two motion vectors of neighboring 4 x 4 blocks are the same, 5 x 9 overlap region between two interpolation windows can be data reused. The scanning order of residual decoding for each macroblock is 2x2 raster scanning order as shown in Fig 3.8 (a).

Then, considering two different scanning orders illustrated in Fig 3.8 (b) and (c), row-major one needs 13 times of transitions but column-major one only needs 5 times of transitions.

Each transition causes that the overlap region could not be data reused. Therefore, column-major one is the better selection because of less number of transitions.

0 1

Fig 3.9 (a) 2x2 raster scanning order, (b) 4x4 raster scanning order, (c) extended 2x2 raster scanning order

0 1

Fig 3.10 Synchronization buffer scheme for two different scanning order in inter prediction (a) 2x2 raster scanning order, (b) 4x4 raster scanning order

For video decoding system, inter prediction often processes based on macroblock level.

Thus, the decoding order based on 4 x 4-block size, which is the smallest block element in H.264/AVC video decoder, is freedom for each macroblock. In view of this concept, 2 x 2 and 4 x 4 raster scanning orders are depicted in Fig 3.9 (a) and (b), and we can find column-major 4 x 4 raster scanning order only needs four transitions less than that of 2 x 2 raster scan.

pixels in residual adder because of different scanning order with residual decoder which must follow 2x2 raster scanning order defined in standard [1].

Fig. 3.11 Content-swap operation (interpolator with attached content buffer)

(1, 3)

(2, 0)

(-2, 1) (2, 0)

Fig. 3.12 An example of macroblock partition (1, 3) indicates (mv_x, mv_y).

In order to resolve this problem, we can attach content register to interpolator which concept is illustrated in Fig 3.11, and adopt extended 2x2 raster scanning order as shown in Fig 3.9 (c). The size of content register depends on the local register in interpolators. Each gray block in Fig. 3.9 (c) indicates content-swap operation that swaps all content in local register in interpolation and that in content buffer. By doing that, we can find that if motion vectors of block 1 and block 4 are the same, the overlapped region in Fig. 3.7(b) need not to be re-fetched when decoding block 4. Therefore, extended 2x2 raster scanning order follows 2 x 2 raster scanning which is the same as that of residual decoder, and achieves data reuse

Local register for interpolator

Content buffer

status of 4 x 4 raster scanning order. The content-swap operation brings effect only when larger block size partition or motion vectors of the neighboring blocks are the same. The condition that executes this operation follows the expression (3. 1)

)

_swap condition mb type x mb type x

content = == == (3. 1)

However, considering an example shown in Fig. 3.12, the condition (3.1) checking is false.

Furthermore, if checking equality of neighboring motion vectors instead of block size, the example in Fig. 3.11 can be data reused. The checking table of motion vectors between neighboring blocks is listed in Table 3.1.

Table 3.1 Neighboring MV checking table for content-swap operation Block number Checking condition

1 MV1 = = MV4

3.1.3 Analysis for Data Reuse Technique

To give more generic and platform independent analysis, we analyze requisite pixels per MB and cost overhead for each method. Taking account of fractional motion compensation for each macroblock, the required pixels for each MB and cost overhead for different methods are summarized in Table 3.2. Assuming that each motion vector contains fractional part, the best case has one motion vector and worst case has 16 motion vectors for one luma macroblock. Although requisite pixels per method are the same in worst case, requisite pixels

column-major methods, 4 x 4 raster scanning order (RSO) takes the best effect; however, it requires additional synchronization buffer and degrades throughput due to different RSO with that of residual decoder. As for extended methods, condition (3. 1) only takes effect in larger block partition (SKIP, 16x16, 16x8). That is, it cannot data-reuse in some case such as Fig.

3.11 even if the neighbor motion vectors are the same. To erase this disadvantage, method 5 checks the neighboring motion vectors rather than block size, and then the required bandwidth can reduce to be the same as that of 4 x 4 RSO in Fig. 3.12 case. The advantage of extended method is that it only requires content buffer which size is smaller than that of method 3 and takes a little extra cycle for content-swap operation. Although method 4 behaves better for larger block size (SKIP, 16x16, 16x8) than method 1/2/3, larger block size still occupies up to 50 ~90 % proportion from the Fig. 3.13. Furthermore, method 5 not only involves all case in method 4 but also takes effect in smaller block size such as Fig. 3.1. As shown in Fig. 3.14, after applying extended method in our design, the required memory bandwidth can be reduced about 30 % compared to column-major 2x2 RSO method.

Table 3.2 Static analyses for different method in H.264/AVC.

Assumption: each motion vectors contains fractional part.

Required pixels per luma MB Method

Worst case Best case Fig 3.11 Cost overhead

在文檔中適用於雙重視訊標準的可調式動作補償記憶體架構 (頁 23-0)