適用於H.264/AVC之降低記憶體頻寬的動作補償

(1)

國

立

交

通

大

學

電機學院電子與光電學程

碩

士

論

文

適用於 H.264/AVC 之降低記憶體頻寬的動作補償

A Memory Bandwidth-Reduction Motion Compensator for

H.264/AVC Application

研究生：陳浩民

指導教授：李鎮宜教授

(2)

適用於 H.264/AVC 之降低記憶體頻寬的動作補償

A Memory Bandwidth-Reduction Motion Compensator for

H.264/AVC Application

研究生：陳浩民 Student：Hao-Min Chen

指導教授：李鎮宜 Advisor：Chen-Yi Lee

國立交通大學

電機學院電子與光電學程

碩士論文

A Thesis

Submitted to College of Electrical and Computer Engineering National Chiao Tung University

in partial Fulfillment of the Requirements for the Degree of

Master of Science in

Electronics and Electro-Optical Engineering December 2010

Hsinchu, Taiwan, Republic of China

(3)

適用於 H.264/AVC 之降低記憶體頻寬的動作補償

學生：陳浩民指導教授：李鎮宜教授

國立交通大學電機學院電子與光電學程碩士班

摘

要

近年來，對於已被數位視訊廣播的高傳真電視廣播服務和藍光光碟所採用的

H.264/AVC High Profile 視訊標準，其需求是很必要的。而動作補償單元的計算量通常占了整個視訊解碼系統的大多數，這是由於它需要對參考畫面的記憶體有相當大量的資料傳輸。特別在目前最先進的 H.264/ AVC Main/High Profile 視訊標準支援了雙向參考畫

面，因而使得所需的記憶體頻寬大量增加。我們提出的記憶體頻寬縮減策略除了可有效地減少所需的記憶體頻寬高達 80 %之外，同時維持和整個視訊解碼系統相同的解碼順

序。和傳統的架構相較之下，針對 H.264 提出的可重新架構的內插器，可省下 20 % 的邏輯閘數量。我們的動作補償單元同時支援了 H.264 Baseline Profile @ 4.0 Level 和

H.264 Main/High Profile @ 4.0 Level，對即時解碼能力而言可達到 1080 HD @ 100.0 MHz，而總邏輯閘數量為 68 K。

(4)

A Memory Bandwidth-Reduction Motion Compensator for

H.264/AVC Application

Student : Hao-Min Chen Advisor : Dr. Chen-Yi Lee

Degree Program of Electrical and Computer Engineering

National Chiao Tung University

ABSTRACT

In recent years, H.264/AVC High Profile video standard, which has been adopted by the

Digital Video Broadcasting (DVB) HDTV broadcast service and the Blu-ray Disc storage format, is necessary in demand. The computation time of motion compensation unit is usually

accounted for most of the video decoding system because of the enormous data transfer with reference frame memories. Particularly in the most advanced H.264/AVC Main/High Profile

video standard supports bi-prediction reference frame, which makes the memory bandwidth required for a significant increase. Our proposed reduction strategies of memory bandwidth

cannot only effectively reduce the required memory bandwidth up to 80% but also maintaining the same decoding order as that of entire video decoding system. The proposed

restructured interpolator can save 20% of the number of logic gates compared to traditional design. Our motion compensator also support H.264 Baseline Profile @ 4.0 Level and

Main/High Profile @ 4.0 Level, in terms of real-time decoding up to 1080 HD @ 100 MHz, while the total number of 68k NAND2 CMOS logic gate count.

(5)

Acknowledgements

首先要感謝的是我的指導教授李鎮宜研發長在我的碩士生涯中給我的指導與鞭策，在他熱心指導過程中，雖然常常我有很爆笑的回應，但總是能很有耐心的指導我。接下來要感謝，帶領我的博士班學長，也是我們 Si2 多媒體組 leader 李曜，在他有效的領導與熱心的幫助，讓我的研究持續有進展；另外要感謝王勝仁學長，他的論文非常優秀，很值得我借鏡，對我的研究幫助很大；接下來要感我們 Si2 實驗室的成員們，耀琳，謝謝你為了教我 ICLAB 常常陪我熬夜；還有見縫就插針的的明瑜；要把好的 idea 分給我的建辰；分享酒店文化的義澤，分享把妹經驗的人偉；常常幫我介紹的勝舜；下課常一起聊天的元雍與盈鋒；老是說真的的長宏；常回來的子明學長；老是叫我捐錢的欣儒；大家一起做研究，一起敖夜，一起唱歌，一起聊天，一起聚餐。在苦悶的研究生涯帶來了豐富的歡樂色彩。最後要感謝的是我的家人，強列的要求我再進修，也由於有他們的支持，讓我可以在沒有後顧之憂的，全心完成我的研究。

(6)

List of Figures

FIG 2.1GENERAL STRUCTURE OF H.264 ENCODER ... 3

FIG 2.2GENERAL STRUCTURE OF H.264 DECODER ... 3

FIG 2.3H.264/AVC VIDEO DECODER SOFTWARE PROFILE ON ARM PROCESSOR (JM8.2) ... 4

FIG 2.4MACROBLOCK PARTITIONS AND SUB-MACROBLOCK PARTITIONS ... 5

FIG 2.5(A) LUMA HALF SAMPLE WITH 6-TAP FIR,(B) LUMA QUARTER SAMPLE WITH BILINEAR FILTER,(C) CHROMA SAMPLE WITH BILINEAR FILTER.UPPER-CASE LETTERS INDICATE THE FULL SAMPLES AND LOWER-CASE LETTERS INDICATES THE INTERPOLATED FRACTIONAL SAMPLES ... 6

FIG 2.6(A) DIRECTIONAL PREDICTION FOR 8 X 16 BLOCK SIZE,(B) DIRECTIONAL PREDICTION FOR 16 X 8 BLOCK SIZE,(C) MEDIAN PREDICTION ... 7

FIG 2.7BI-PREDICTION EXAMPLES ... 8

FIG 2.8EXAMPLES OF PREDICTION MODES IN B SLICE MACROBLOCKS ... 9

FIG 2.9 EXAMPLE FOR TEMPORAL DIRECT-MODE MOTION VECTOR ... 10

FIG 2.10INTERLACED VIDEO SEQUENCE ... 12

FIG 2.11MACROBLOCK-ADAPTIVE FRAME-FIELD CODING... 13

FIG 3.1MOTION COMPENSATION ENGINE FOR H.264 VIDEO DECODER ... 16

FIG 3.7MOTION VECTORS INFORMATION STORAGE FOR MOTION VECTOR PREDICTOR FOR QCIF FRAME FORMAT. ... 17

FIG 3.8(A)NEIGHBORING MOTION VECTORS NEEDED WHEN DECODING ALL MOTION VECTORS IN

(9)

VERSION ... 19

FIG 3.9MOTION VECTOR GENERATOR ARCHITECTURE FOR QCIF-FORMAT SUPPORT MBAFF ... 20

FIG 3.10 MOTION VECTOR GENERATOR ARCHITECTURE ... 26

FIG 3.11SEPARATE 1-D INTERPOLATOR DESIGN (NO PARALLEL) ... 26

FIG 3.12ONLY ONE HALF PIXEL IS NEEDED ... 28

FIG 3.13ORIGINAL 4-PARALLEL SEPARATE 1-D LUMA INTERPOLATOR ... 29

FIG 3.14ENHANCE 4-PARALLEL SEPARATE 1-D LUMA INTERPOLATOR ... 30

FIG 3.15INTERPOLATION WINDOW FOR EACH 2 X 2 CHROMA BLOCK ... 31

FIG 3.16(A) CHROMA INTERPOLATOR,(B) VERTICAL/HORIZONTAL FILTRR ... 32

FIG 3.172-PARALLEL CHROMA INTERPOLATOR ... 33

FIG 3.18(A) LUMA FIR DESIGN IN CHEN‟S [3],(B) BILINEAR FILTER ... 33

FIG 3.19COMBINED LUMA/CHROMA INTERPOLATOR DESIGN FOR H.264 ... 34

FIG 3.20(A) PATH OF LUMA FIR INTERPOLATOR,(B) PATH OF CHROMA 1/8 BILINEAR ... 35

FIG 3.21ENTIRE INTERPOLATOR ARCHITECTURE ... 36

FIG 3.22WEIGHTED PREDICTOR DESIGN ... 39

FIG 3.23ENTIRE WEIGHT PREDICTOR ARCHITECTURE ... 40

FIG 4.14 X 4 BLOCK WINDOW AND THE CORRESPONDING 9 X 9 INTERPOLATION WINDOW ... 42

FIG 4.8EMBEDDED COMPRESS/DECOMPRESS METHOD ... 44

FIG 4.9 FRACTIONAL SAMPLE POSITIONS FOR QUARTER SAMPLE LUMA INTERPOLATION ... 45

FIG 4.10FRACTIONAL SAMPLE ONLY NEED HORIZONTAL SAMPLES ... 46

FIG 4.11FRACTIONAL SAMPLE ONLY NEED VERTICAL SAMPLES ... 46

FIG 4.12PRE-FETCH MECHANISM ... 48

FIG 4.134X4 BLOCK WINDOW AND THE CORRESPONDING 9X9 INTERPOLATION WINDOW AND OVERLAPPED REGION FOR NEIGHBORING INTERPOLATION WINDOW ... 49

FIG 4.14INTRA MB OVERLAP PIXELS REUSING ... 50

(10)

FIG 4.16ALL OVERLAP REGION INCLUDE BETWEEN PREVIOUS UPPER MB AND LEFT MB ... 53

FIG 4.17NO OVERLAP REGION CAN BE REUSED ... 54

FIG 5.1MOTION COMPENSATION ENGINE FOR H.264 VIDEO DECODER ... 57

FIG 5.2SIMULATION RESULTS OF BANDWIDTH REDUCTION STRATEGIES ... 58

FIG 5.3COMPARE RELATED WORKS ... 58

FIG 5.4RATIO OF PIXELS POSITION IN AKIYO AND STEFAN SEQUENCE ... 59

FIG 5.5 LUMA INTEGER/FRACTIONAL MOTION VECTOR PROPORTION FOR H.264/AVC ... 60

(11)

List of Tables

TABLE 3.1MEDIAN PREDICTION TABLE IN MBAFF FRAMES... 20

TABLE 3.2 CO-LOCATED MACROBLOCK TABLE ... 24

TABLE 3.3 CO-LOCATED PARTITION TABLE ... 25

TABLE 4.2SUMMARY OF LUMA INTERPOLATION WINDOWS ... 47

TABLE 4.3SUMMARY OF CHROMA INTERPOLATION WINDOWS ... 47

TABLE 4.4STORAGE REQUIREMENT AND LIFETIME ANALYSIS ... 51

TABLE 4.5SUMMARY OF LUMA INTERPOLATION WINDOWS AND REDUCTION PERCENT ... 52

TABLE 4.6SUMMARY OF REDUCTION PERCENT IN DIFFERENT OVERLAP REGION ... 53

TABLE 5.1VIDEO DECODER SPECIFICATION IN OUR DESIGN ... 56

(12)

Chapter 1 Introduction

1.1 Motivation

In recent years, the newest video coding standard published jointly as Part 10 of MPEG-4 and ITU-T Recommendation H.264 [1] provides fine video compression

performance. The new H.264/AVC standard provides a technical solution for a wider range of applications, including video-on-demand (VOD), mobile networks, high definition TV,

broadcast over cable, satellite, cable modem, DSL or terrestrial, interactive or serial storage like BD, conversational services over ISDN, Ethernet, LAN, wireless, or mobile network,

multimedia messaging services over DSL, ISDN, etc.

Besides, in Nov. 2004, Digital video broadcasting handheld, DVB-H [5], has mandated

support of Main Profile for H.264/AVC SDTV receivers, with an option for the use of High

Profile. The support of High Profile is mandated for H.264/AVC HDTV decoder. Moreover, high definition TV requires huge data transmission particular in frame memory, a memory

controller that efficiently communicates with frame memory is the most significant over the entire video decoding system. Within the video decoding system, motion compensation

always dominates the total amount of data transmission especially when SDRAM or DDR-SDRAM is adopted as external frame memories. Motion compensation should also

(13)

1.2 Thesis Organization

This thesis is organized as follows. The algorithm description and analysis is discussed in Chapter 2. In Chapter 3, the motion compensation engine for H.264/AVC video decoder is

presented firstly. Then, the motion compensation engine for H.264 high profile is illustrated. In Chapter 4, we propose the bandwidth reduction strategies to reduce the required bandwidth

particularly in H.264/AVC integral and fractional motion compensation. We also presents frame memory organization, and memory bandwidth analysis. Implementation result is given

(14)

Chapter 2 Algorithm Description and Analysis

Current frame ME MC Reference frame reconstruct ed frame Choose intra prediction Intra prediction Filter _ + + + T IT Q IQ

Reorder Entropy _encoder

NAL

Inter

Intra

Fig 2.1 General structure of H.264 encoder

MC Reference frame Intra prediction Filter + + _IT _IQ Inter Intra

Reorder _decoderEntropy

NAL bitstream

Reconstructed frame

Fig 2.2 General structure of H.264 decoder

Fig 2.1 and Fig 2.2 shows the general structure of H.264/AVC video encoder and

(15)

Network Abstraction Layer (NAL). We only discuss on VCL that efficient represents the video content. The concept of H.264/AVC submits the so-called block-based hybrid video

coding. It consists of hybrid of temporal and spatial prediction and is simultaneous with transform coding.

This chapter is structured as follows. The software profiling is illustrated in section 2.1.

Then, the algorithm of H.264/AVC motion compensation would be described in section 2.2. Finally, the H.264/AVC high profile is presented in section 2.3

2.1 Profiling

Fig 2.3 H.264/AVC video decoder software profile on ARM processor (JM 8.2)

7% 8% 9% 7% 9% 9% 8% 11% 32% Others (Intra Prediction, etc.) Write File

PSNR Computation

De-blocking Filter

CAVLC

IQ/IDCT _{Ref. Frame Copy}

Reconstruction

(16)

Fig 2.3[8] shows the H.264/AVC profile on ARM processor. The reference software is JM 8.2 [7]. We can find motion compensation related modules, including motion

compensation, reconstruction, and reference frame copy, occupy 51 % proportion of the entire video decoder. Parallel processing, bandwidth reduction, or pipeline processing on ASIC

design can significantly reduce this dominated part.

2.2 Inter Prediction Algorithm for H.264/AVC Standard

H.264/AVC standard supports variable block size (VBS) in inter prediction [1] [2]. The

smallest block size could reach least 4x4 for luma and 2x2 for chroma. Fig 2.4 [1] illustrates all types of partitions.

0 0 0 1 1 0 2 1 3 0 0 0 1 1 0 2 1 3 16x16 16x8 8x16 8x8 8x8 8x4 4x8 4x4 Macroblock partitions Sub-macroblock partitions

Fig 2.4 Macroblock partitions and sub-macroblock partitions

H.264/AVC standard also supports high motion resolution that reaches quarter motion

accuracy for luma sample and one-eighth for chroma sample. Luma half sample interpolation with a 6-tap (1, -5, 20, 20, -5, 1) symmetrical FIR filter and quarter sample interpolation with

bilinear filter are illustrated in Fig 2.5 (a)-(c). The prediction value of chroma component is generated using bilinear interpolator illustrated in Fig 2.5(d), and the displacement can

(17)

achieve one-eighth accuracy. From mathematical equations, they are both 2-D interpolation. However, based on hardware implementation, these equations can be divided into two 1-D to

reduce hardware cost, in other words, horizontal filter first and then vertical one, or vice versa. G a c H d n M e i f g k m N p q r I P J Q R S T U B D C A F L E K s h j b bb aa gg hh xFrac yFrac 8-xFrac 8-yFrac A B D C b=((E-5xF+20xG+20xH-5xI+J)+16)>>5 h=((A-5xC+20xG+20xM-5xR+T)+16)>>5 j=((aa-5xbb+20xh+20xs-5xgg+hh)+16)>>5 G H M e g m N p r s h j b e=(G+j+1)>>1 g=(b+m+1)>>1 G a c H d n M i f k m N q s h j b d=(G+h+1)>>1 a=(G+b+1)>>1 (b) i i=((8-xFrac)*(8-yFrac)*A+xFrac*(8-yFrac)*B+(8-xFrac)*yFrac*C+xFrac*yFrac*D+32)>>6 (a) (c)

Fig 2.5 (a) Luma half sample with 6-tap FIR, (b) luma quarter sample with bilinear filter, (c) chroma sample with bilinear filter. Upper-case letters indicate the full samples

and lower-case letters indicates the interpolated fractional samples

Motion vector difference (MVD) and motion vector prediction (MVP) generate the

motion vector which Eq. 2.1 express the equation.

MVPy MVDy MVy MVPx MVDx MVx     Eq. 2.1

(18)

MVD is decoded from bit-stream and MVP is predicted according to neighboring motion vectors. MVP algorithm, contains directional prediction for 16 x 8 or 8 x 16 block size and

median prediction for other block sizes. The detail of MVP decision is shown in Fig 2.6 [8]. Eq. 2.2 expresses the equation of median prediction. Besides, some boundary conditions or

exceptions have to be handled carefully. For instance, when MVC is not available, its value is replaced by MVD. We do not go into detail of those trivial boundary conditions in here.

) , , (MVA MVB MVC median MVP  _{Eq. 2.2} 8x16 A C A B Current MB/Sub-MB partition A D B C (a) (b) (c) 16x8

Fig 2.6 (a) Directional prediction for 8 x 16 block size, (b) directional prediction for 16 x 8 block size, (c) median prediction

In addition to the motion-compensated block size described in Fig 2.4, a P macroblock can also be coded to P_SKIP mode. For this coding mode, neither residual signal nor motion

information is transmitted. In other words, motion vectors are only decided according to MVP. The reconstructed reference pixels are obtained similar to macroblock type P_16x16.

Macroblock coded in P_SKIP are often located in large area with no change or slow motion. In addition to the above techniques, H.264/AVC also supports multiple reference frame,

weighted prediction and direct mode for B slice, which we will present in section 2.3. These tools can also improve coding efficiency efficiently.

(19)

2.3 Inter Prediction for H.264/AVC High profile Standards

Considering motion compensation, the tools supported by H.264/AVC Main/High Profile are B slices, Weighted Prediction and Interlace video.

In an inter-coded macroblock of B slice, each macroblock partition may be predicted from one or two reference pictures, forward and backward the current picture in temporal

order. This tool provides better coding efficiency with more possibilities to select the best-match prediction references for the macroblock partitions in B slice. Fig 2.7 shows the 3

reference directions: (a) Forward and backward reference pictures, the so-called bi-directional reference, (b) backward reference, and (c) forward references [6]. B slices use two lists of

coded reference pictures, LIST_0 and LIST_1. These two lists can include backward and/or forward coded pictures respectively.

B

(c) two forward (b) two backward

(a) forward and backward

(20)

In B slice, there are four prediction modes: (a) direct mode, (b) LIST_0 mode, (c) LIST_1 mode, and (d) bi-predictive mode. For a macroblock, each partition can choose

different prediction modes. When the 8 x 8 partition size is used, the chosen mode for each 8x8 partition is applied to all sub-partition within that partition. Fig 2.8 shows two examples

of prediction mode combinations. In Bi-predictive mode, two motion-compensated reference regions are obtained from LIST_0 and LIST_1 picture respectively. The motion vectors from

LIST_0 and/or LIST_1 in a bi-predictive macroblock or block are predicted form neighboring motion vectors with the same temporal direction. For instance, a motion vector from the

current macroblock pointing to a forward picture is predicted from other neighboring vectors that also point to forward pictures.

Bi pred Bi-pred L1 L1 Direct L0

Fig 2.8 Examples of prediction modes in B slice macroblocks

Similar to the skipped P macroblock coded in P_SKIP mode, a B macroblock can also be

coded in direct mode. In direct mode, no motion vector is transferred for a B slice macroblock or macroblock partition encoded. Instead, the decoder predicts the motion vectors of LSIT_0

and LIST_1 with neighboring vectors and carries out bi-predictive motion compensation block. There are spatial and temporal mode can be used to calculate the LIST_0 and LIST_1

motion vectors for direct mode macroblocks or partitions.

Spatial direct mode is similar to P_SKIP mode. Furthermore, it supports bi-prediction

(21)

However, some conditions or exceptions have to be handled carefully. For example, in case of the co-located MB or the partition in the picture that contains the co-located macroblock has a

motion vector that is less than +/- 1/2 luma samples in magnitude (and in some other conditions), one or both of the predicted vectors are set to zero. We do not go into detail of

those trivial conditions here.

Temporal direct mode differs from P_SKIP mode. The same with the spatial direct mode,

the block size is also 4 x 4 block size accuracy, the motion vectors mvL0, mvL1 are derived as scaled versions of the motion vector mvCol of the co-locate sub-macroblock partition. The

scaled method is based on the picture-order-count (POC) distance between the current and LIST_1/LIST_0 picture. Fig 2.9 shows the illustration of temporal direct-mode motion vector

inference. When the object is constant velocity motion, it is suitable-coded in temporal direct mode. When the object is the average form backward and forward, it is suitable-coded in

spatial direct mode. When the object is still, it is suitable-coded in skip mode. Encoder can use skip/direct mode to save one/two motion vector differences (mvd) in every skip/direct

mode partition for further enhance compression efficiency.

List 1 reference List 0

reference

Distance of picture order count MV L0 MV L1 Current picture time MV co-located

(22)

Another tool supported in Main/High Profile is Weighted Prediction (WP), which is a method of scaling the samples to increase the video quality in H.264/AVC video decoding. An

application of weighted prediction is to control the relative weighted of interpolated regions to the motion compensated prediction process. For example, WP may be effective in coding of „fade‟ transitions (where one scene fades into another). There are three modes in Weighted

Prediction. When Default mode is in use, two motion compensated reference regions are

obtained from LIST_0 and LIST_1 picture respectively and each sample of the prediction block is calculated as an average of the LIST_0 and LIST_1 prediction samples. Eq. 2.3

expresses the equation

( 0 1 1) 1

p r e d P a r t  p r e d P a r t L  p r e d P a r t g L    _{Eq. 2.3}

When explicit or implicit mode is in use, Eq. 2.4 is used to calculate the sample of the prediction block. The difference between explicit and implicit mode is the weighting factors

are calculated based on the picture-order-count distance between LIST_0 and LIST_1 reference pictures in implicit mode. It is similar to temporal direct mode in motion vector

prediction. When explicit mode is in use, the encoder determines weighting factors. In other words, implicit mode objection is to save weighted prediction parameter in bit-stream for

further enhance compression efficiency.

lo g

0 1 0 1

( ( 0 * 1 * 2 W D) ( lo g 1) ) ( ( 1) 1) )

p r e d P a r t  p r e d P a r t L w  p r e d P a r t L w    W D   o o    Eq. 2.4

As for interlace video tool, video signal may be sampled as a sequence of complete

frames or interlaced fields. An interlaced video sequence contains a series of fields. A field consists of either the odd-numbered or the even-numbered lines within a complete video

(23)

frame. Fig 2.10 illustrates the fields in video sequence. Half of the data in a complete video frame is represented as a field and is sampled at each temporal interval. The advantage of

interlaced video coding is that it is possible to send twice as many fields per second as the number of frames in an equal progressive sequence with the same data rate, giving the

appearance of smoother motion. For instance, a NTSC video sequence consists of 60 fields per second and, when played back, motion can appears smoother than in an equivalent

progressive video sequence containing 30 frames per second.

top field top field bottom field bottom field

Fig 2.10 Interlaced video sequence

Frame coding is more efficient than field coding for progressive video and static pictures in interlaced video. Oppositely, field coding is more efficient for moving pictures in interlaced

video. However, sometimes not complete frames are fast moving. Hence, H.264/AVC Main/High profile provides another tool in interlaced video, macroblock-adaptive frame/field

(MBAFF), to provide macroblock level interlacing. Similar to MBAFF, the picture level interlacing sometimes is called PicAFF. As an extension of PicAFF, MBAFF is used to

improve coding efficiency of picture with both static and moving regions [21]. In MBAFF mode, the current slice is processed in units of 16 luma samples wide and 32 luma samples

(24)

choose to encode each MB pair as (a) frame macroblock pair (b) field macroblock pair and may select the optimum coding mode for each region of the picture.

32 16 16 16 16 MB pair 16 16 16 32 16 MB pair

(a)frame MB mode (b)field MB mode

Fig 2.11Macroblock-Adaptive Frame-Field Coding

2.4 Bandwidth Requirement for Inter Prediction

Up to now, we can find interpolation issue becomes more and more important in

state-of-the-art video coding. The interpolation window becomes double for the same block; In other words, it requires double cycles to interpolate each macroblock. For instance, it

requires two 9 x 9 interpolation windows to interpolate a luma 4 x 4 block and four 3 x 3 interpolation windows to interpolate two chroma 2 x 2 blocks in B macroblock.

In worst case, interpolator needs 398MB/s in P frame, 796MB/s in B frame when supporting 1920 x 1088 30fps. In other words, motion compensation needs huge memory

bandwidth requirement. Huge data also means large power consumption for bus activity and data operation.

To reduce bandwidth requirement from frame memory, strategies of memory bandwidth reduction for motion compensation will be proposed in Chapter 4.

(25)

2.5 Summary

From the H.264/AVC profiling on ARM processor, an efficient hardware accelerator or

ASIC design for motion compensation is important. The inter prediction for H.264/AVC Baseline, Main/High profiles, and the bandwidth requirement are also illustrated in this

(26)

Chapter 3 Motion Compensation Design for

H.264/AVC Main/High Profile video

decoder

The state-of-the-art video coding standard H.264/AVC provides better compression ratio that significantly outperforms all previous video compression standards. However,

H.264/AVC supports Main/High profile and provides many tools compare with Baseline Profile for further enhance compression ratio. Therefore, a development of combining

multi-video coding profiles is essential to support modern multimedia systems. Therefore, it is the challenge of designing efficient video decoder for multi-profile video application

without significantly increase complexity.

This chapter will discuss that designing of motion compensation, which dominates the

amount of data transfer on the H.264/AVC video decoder. The rest part is structured as follows. Section 3.1 illustrates motion compensation engine for H.264/AVC decoder. The

combined motion compensation engine for H.264/AVC Baseline/Main/High profile and the analysis is discussed in section 3.2. Finally, summary is given in section 3.3.

(27)

3.1 Motion Compensation Engine for H.264/AVC decoder

Fig 3.1 Motion compensation engine for H.264 video decoder

Fig 3.1 illustrates the whole motion compensation engine for H.264/AVC video decoder. Firstly, Motion vector generator generates motion vector according to motion data. Then, the

address generator uses motion vector with reduction strategies of memory bandwidth to generate address of reference region. Moreover, transfer reference address to system memory

controller (also named well-known arbiter). The tasking of memory access controller is scheduling consecutive access command and sending to frame memories. The burst read data

is kept in read data buffer and then filtered through interpolator. Finally, the interpolated reference data pass through Weighted Predictor to produce motion compensation result. The

(28)

result will be added to the residual data and then pass through de-blocking filter. In our proposed decoder, ping-pong structured external frame memory [9], double memories stored

reference and current frame reciprocally, is adopted.

The following subsection will discuss the detail of other modules except reduction

strategies of memory bandwidth. The detailed discussion of reduction strategies of memory bandwidth are shown in Chapter 4. Subsection 3.2 illustrates motion vector generator (MVG)

Supports Main/High Profile including motion vector predictor and the related storages. Subsection 3.3 combines luma and chroma interpolator design. Subsection .3.4 shows

Weighted Predictor design. Finally, summary is presented in section 3.5

3.2 MVG support Main/High profile

Frame boundary Frame boundary …… 0 Top 0 Bot. 1 Top 1 Bot. 2 Top 2 Bot. 3 Top 3 Bot. 4 Top 4 Bot. 5 Top 5 Bot. 7 Top 7 Bot. 6 Top 6 Bot. 8 Top 8 Bot. 9 Top 9 Bot. 10 Top 10 Bot. 11 Top

11 Bot. Current _{Bot. MB} 24 Current

Top MB Next MB 0 Top

Next MB 0 Bot.

Next MB

1 Top Next MB 2 Top

Next MB

1 Bot. Next MB 2 Bot.

Next MB

3 Top Next MB 4 Top

Next MB

3 Bot. Next MB 4 Bot.

Fig 3.2 Motion vectors information storage for motion vector predictor for QCIF frame format.

(29)

There are two tools in MVG for supported Main/High profile. The first one is B slice

type, which has double motion vectors. The second one is MBAFF mode. In MBAFF mode, the handle of macroblock is Macroblock pair. The same with P slice, the required total storage

for motion vector generator, Fig 3.2 shows an example. Total amount of 4 x 11 x 2 both components of the motion vector have to be stored for QCIF frame format. Fig 3.3 (a) shows

the detail of required neighboring motion vectors. To decode T0-T15 in current top MB, it needs neighboring motion vectors in left (TL0-TL3, MVL0-MVL3), above (TU0-TU3,

MVU0-MVU3), above-right (TRU, MVRU), and above-left (TLU-MVLU) position. The 4 x 8 size of MV buffers is required because the maximum number of motion vector per MB pair

is thirty-two. If we reuse the same 4 x 4 size of MV buffers and add a number of buffers (T10, T11, T14, and T15), the MV buffers can be further reduced. Fig 3.3 (b) shows the reduced

(30)

T7 T6 T5 T4 T13 T12 T3 T2 T1 T0 T9 T8 TL0 TL1 TL2 BRU BU0 BU1 BU2 BU3

B7 B6 B5 B4 B15 B14 B13 B12 B3 B2 B1 B0 B11 B10 B9 B8 BL0 BL1 BL2 BLU TL3 BL3

TLU TU0 TU1 TU2 TU3 TRU

T10 T11 T14 T15

TL0

TL1

TL2

MVRU MVU0 MVU1 MVU2 MVU3

MV7 MV6 MV5 MV4 MV15 MV14 MV13 MV12 MV3 MV2 MV1 MV0 MV11 MV10 MV9 MV8 MVL0 MVL1 MVL2 MVLU TL3 MVL3 T10 T11 T14 T15

TLU TU0 TU1 TU2 TU3 TRU

Fig 3.3 (a) Neighboring motion vectors needed when decoding all motion vectors in current MBAFF macroblock, (b) reduced and combined with non-MBAFF version

Fig 3.4 shows the detailed architecture of motion vector generator. This architecture

combine non-MBAFF and MBAFF mode. When operation in non-MBAFF TX (with X being 5, 7, 13, 15, and so on) storages can be closed for saving power. The same with P slice, Table

3.1 lists all MVA, MVB, MVC, and MVD for different block size_position index. The difference is MBAFF mode not only size_position index but also current MB pair is

Frame/Field coding, current MB is Top/Bottom MB, and relative MB pair is Frame/Filed coding. Therefore, LUT in MBAFF mode is eight times complexity than non-MBAFF mode.

For cost and area efficiency consideration, we combine MBAFF and non-MBAFF LUT. Fortunately, we can find the condition of MVA, MVB, MVC, and MVD is the same with

non-MBAFF mode when condition of MBAFF mode is fixed in current MB pair is Field, current MB is Bottom MB, and relative MB pair is Field. As mentioned above, we can use the

(31)

same LUT to deal with non-MBAFF and MBAFF mode.

4x4 MV buffers Left MV line buffer

MVP

MVD (load from MV buffer) MV (write back to MV buffer)

Line MV FIFO

0 1 2 3 4 5 6 7 8 9 10 11

upperupper _left

right mvA, mvB, mvC, mvD MV from Upper MB MV from Left MB MV from Current MB MV from Upper-left MB MV from Upper-right MB Neighboring MVs motion vector predictor MVL0 MVL1 MVL2 MV5 MV7 F I F O MVU3 TLU MV13 MV1 MV3 MV9 F I F O MVU1 MV11 MV4 MV6 F I F O MVU2 MV0 MV2 MV8 F I F O MVU0 MVRU MV10 MV15 MV14 MV12 upper left TL0 T5 TL1 T7 T13 TL2 MVL3 TL3 T15 MVLU TU2 TU1 TU0 TU3 T14 T11 TRU T10 F I F O F I F O F I F O F I F O

Fig 3.4 Motion vector generator architecture for QCIF-format support MBAFF

(32)

c u r r e n t M B T /B r e la t iv e M B m v A m v B m v C m v D F r a m e T L 0 B U 0 B U 2 B L U F ie ld F r a m e B L 0 T 1 0 T 1 4 T L 3 F ie ld T L 2 B L 1 F r a m e T L 0 B U 0 B U 2 B L U F ie ld T U 0 T U 2 T L U F r a m T L 0 B U 0 B U 2 B L U F ie ld B L 0 F r a m e T L 1 M V 0 X T L 0 F ie ld T L 0 B L 0 F r a m e B L 1 M V 0 X B L 0 F ie ld T L 2 B L 2 F r a m e T L 2 T 0 X T L 1 F ie ld T L 1 T L 0 F r a m T L 2 B 0 X T L 1 F ie ld B L 1 B L 0 F r a m e M V 1 B U 2 B R U B U 1 F ie ld F r a m e M V 1 T 1 4 X T 1 1 F ie ld F r a m e T 1 B U 2 B R U B U 1 F ie ld T U 2 T R U T U 1 F r a m M V 1 B U 2 B R U B U 1 F ie ld F r a m e M V 3 M V 4 X M V 1 F ie ld F r a m e M V 3 M V 4 X M V 1 F ie ld F r a m e T 3 T 4 X T 1 F ie ld F r a m M V 3 M V 4 X M V 1 F ie ld F r a m e T L 2 M V 2 M V 6 T L 1 F ie ld T L 1 B L 0 F r a m e B L 2 M V 2 M V 6 B L 1 F ie ld T L 3 B L 2 F r a m e B L 0 T 2 T 6 T L 3 F ie ld T L 2 T L 1 F r a m B L 0 M V 2 M V 6 T L 3 F ie ld B L 2 B L 1 F r a m e T L 3 M V 8 X T L 2 F ie ld T L 1 B L 1 F r a m e B L 3 M V 8 X B L 2 F ie ld T L 3 B L 3 F r a m e B L 2 T 8 X B L 1 F ie ld T L 3 T L 2 F r a m B L 2 M V 8 X B L 1 F ie ld B L 3 B L 2 F r a m e M V 9 M V 6 X M V 3 F ie ld F r a m e M V 9 M V 6 X M V 3 F ie ld F r a m e M V 9 M V 6 X M V 3 F ie ld F r a m M V 9 M V 6 X M V 3 F ie ld F r a m e M V 1 1 M V 1 2 X M V 9 F ie ld F r a m e M V 1 1 M V 1 2 X M V 9 F ie ld F r a m e M V 1 1 M V 1 2 X M V 9 F ie ld F r a m M V 1 1 M V 1 2 X M V 9 F ie ld F r a m e T L 0 B U 0 B U 1 B L U F ie ld F r a m e B L 0 T 1 0 T 1 1 T L 3 F ie ld T L 2 B L 1 F r a m e T L 0 B U 0 B U 1 B L U F ie ld T U 0 T U 1 T L U F r a m T L 0 B U 0 B U 1 B L U F ie ld B L 0 F r a m e T o p B o t . F ie l d T o p B o t . F r a m e T o p B o t . F ie l d T o p B o t . F r a m e T o p B o t . F ie l d T o p B o t . F r a m e T o p B o t . F ie l d T o p B o t . F r a m e T o p B o t . F ie l d T o p B o t . F r a m e T o p B o t . F ie l d T o p B o t . F r a m e T o p B o t . 8 x 4 _ 7 4 x 8 _ 0 8 x 4 _ 4 8 x 4 _ 5 8 x 4 _ 6 F ie l d T o p B o t . F r a m e T o p B o t . F ie l d T o p B o t . F r a m e T o p B o t . F ie l d T o p B o t . 8 x 4 _ 0 8 x 4 _ 1 8 x 4 _ 2 8 x 4 _ 3 c u r r e n t M B T /B r e la t iv e M B m v A m v B m v C m v D F r a m e T L 0 B U 0 B R U B L U F ie ld F r a m e B L 0 T 1 0 X T L 3 F ie ld T L 2 B L 1 F r a m e T L 0 B U 0 B R U B L U F ie ld T U 0 T R U T L U F r a m T L 0 B U 0 B R U B L U F ie ld B L 0 F r a m e T L 0 B U 0 B R U B L U F ie ld F r a m e B L 0 T 1 0 X T L 3 F ie ld T L 2 B L 1 F r a m e T L 0 B U 0 B R U B L U F ie ld T U 0 T R U T L U F r a m T L 0 B U 0 B R U B L U F ie ld B L 0 F r a m e T L 2 M V 2 X T L 1 F ie ld T L 1 B L 0 F r a m e B L 2 M V 2 X B L 1 F ie ld T L 3 B L 2 F r a m e B L 0 M V 2 X T L 3 F ie ld T L 2 T L 1 F r a m B L 0 M V 2 X T L 3 F ie ld B L 2 B L 1 F r a m e T L 0 B U 0 B U 2 B L U F ie ld F r a m e B L 0 T 1 0 T 1 4 T L 3 F ie ld T L 2 B L 1 F r a m e T L 0 B U 0 B U 2 B L U F ie ld T U 0 T U 2 T L U F r a m T L 0 B U 0 B U 2 B L U F ie ld B L 0 F r a m e M V 1 B U 2 B R U B U 1 F ie ld F r a m e M V 1 T 1 4 X T 1 1 F ie ld F r a m e M V 1 B U 2 B R U B U 1 F ie ld T U 2 T R U T U 1 F r a m M V 1 B U 2 B R U B U 1 F ie ld F r a m e T L 0 B U 0 B U 2 B L U F ie ld F r a m e B L 0 T 1 0 T 1 4 T L 3 F ie ld T L 2 B L 1 F r a m e T L 0 B U 0 B U 2 B L U F ie ld T U 0 T U 2 T L U F r a m T L 0 B U 0 B U 2 B L U F ie ld B L 0 F r a m e M V 1 B U 2 B R U B U 1 F ie ld F r a m e M V 1 T 1 4 X T 1 1 F ie ld F r a m e M V 1 B U 2 B R U B U 1 F ie ld T U 2 T R U T U 1 F r a m M V 1 B U 2 B R U B U 1 F ie ld F r a m e T L 2 M V 2 M V 6 T L 1 F ie ld T L 1 B L 0 F r a m e B L 2 M V 2 M V 6 B L 1 F ie ld T L 3 B L 2 F r a m e B L 0 M V 2 M V 6 T L 3 F ie ld T L 2 T L 1 F r a m B L 0 M V 2 M V 6 T L 3 F ie ld B L 2 B L 1 F r a m e M V 9 M V 6 X M V 3 F ie ld F r a m e M V 9 M V 6 X M V 3 F ie ld F r a m e M V 9 M V 6 X M V 3 F ie ld F r a m M V 9 M V 6 X M V 3 F ie ld B o t . F r a m e T o p B o t . F ie l d T o p B o t . F r a m e T o p B o t . F ie l d T o p B o t . F r a m e T o p B o t . F ie l d T o p B o t . 1 6 x 8 _ 0 1 6 x 8 _ 1 8 x 1 6 _ 0 8 x 1 6 _ 1 8 x 8 _ 0 F r a m e T o p B o t . F ie l d T o p B o t . F r a m e T o p B o t . F ie l d T o p B o t . F r a m e T o p B o t . 1 6 x 1 6 F r a m e F ie l d T o p B o t . T o p B o t . 8 x 8 _ 2 8 x 8 _ 3 8 x 8 _ 1 F ie l d T o p B o t . F ie l d T o p B o t . F r a m e T o p B o t . F ie l d T o p B o t . F r a m e T o p

(33)

c u r r e n t M B T /B r e la t iv e M B m v A m v B m v C m v D F r a m e M V 0 B U 1 B U 2 B U 0 F ie ld F r a m e M V 0 T 1 1 T 1 4 T 1 0 F ie ld F r a m e M V 0 B U 1 B U 2 B U 0 F ie ld T U 1 T U 2 T U 0 F r a m M V 0 B U 1 B U 2 B U 0 F ie ld F r a m e M V 1 B U 2 B U 3 B U 1 F ie ld F r a m e M V 1 T 1 4 T 1 5 T 1 1 F ie ld F r a m e M V 1 B U 2 B U 3 B U 1 F ie ld T U 2 T U 3 T U 1 F r a m M V 1 B U 2 B U 3 B U 1 F ie ld F r a m e M V 4 B U 3 B R U B U 2 F ie ld F r a m e M V 4 T 1 5 X T 1 4 F ie ld F r a m e M V 4 B U 3 B R U B U 2 F ie ld T U 3 T R U T U 2 F r a m M V 4 B U 3 B R U B U 2 F ie ld F r a m e T L 2 M V 2 M V 3 T L 1 F ie ld T L 1 B L 0 F r a m e B L 2 M V 2 M V 3 B L 1 F ie ld T L 3 B L 2 F r a m e B L 0 M V 2 M V 3 T L 3 F ie ld T L 2 T L 1 F r a m B L 0 M V 2 M V 3 T L 3 F ie ld B L 2 B L 1 F r a m e M V 8 M V 3 M V 6 M V 2 F ie ld F r a m e M V 8 M V 3 M V 6 M V 2 F ie ld F r a m e M V 8 M V 3 M V 6 M V 2 F ie ld F r a m M V 8 M V 3 M V 6 M V 2 F ie ld F r a m e M V 9 M V 6 M V 7 M V 3 F ie ld F r a m e M V 9 M V 6 M V 7 M V 3 F ie ld F r a m e M V 9 M V 6 M V 7 M V 3 F ie ld F r a m M V 9 M V 6 M V 7 M V 3 F ie ld F r a m e M V 1 2 M V 7 X M V 6 F ie ld F r a m e M V 1 2 M V 7 X M V 6 F ie ld F r a m e M V 1 2 M V 7 X M V 6 F ie ld F r a m M V 1 2 M V 7 X M V 6 F ie ld F r a m e T L 0 B U 0 B U 1 B L U F ie ld F r a m e B L 0 T 1 0 T 1 1 T L 3 F ie ld T L 2 B L 1 F r a m e T L 0 B U 0 B U 1 B L U F ie ld T U 0 T U 1 T L U F r a m T L 0 B U 0 B U 1 B L U F ie ld B L 0 F r a m e M V 0 B U 1 B U 2 B U 0 F ie ld F r a m e M V 0 T 1 1 T 1 4 T 1 0 F ie ld F r a m e M V 0 B U 1 B U 2 B U 0 F ie ld T U 1 T U 2 T U 0 F r a m M V 0 B U 1 B U 2 B U 0 F r a m e T o p B o t . F ie l d T o p B o t . F r a m e T o p B o t . F ie l d T o p B o t . F r a m e T o p B o t . F ie l d T o p B o t . F r a m e T o p B o t . F ie l d T o p B o t . F ie l d T o p B o t . F r a m e T o p B o t . F ie l d T o p B o t . F r a m e T o p B o t . F ie l d T o p F r a m e T o p B o t . F ie l d T o p B o t . F r a m e T o p B o t . F ie l d T o p B o t . F r a m e T o p B o t . 4 x 8 _ 1 4 x 4 _ 0 4 x 4 _ 1 4 x 8 _ 2 4 x 8 _ 3 4 x 8 _ 4 4 x 8 _ 5 4 x 8 _ 6 4 x 8 _ 7 c u r r e n t M B T /B r e la t iv e M B m v A m v B m v C m v D F r a m e T L 1 M V 0 M V 1 T L 0 F ie ld T L 0 B L 0 F r a m e B L 1 M V 0 M V 1 B L 0 F ie ld T L 2 B L 2 F r a m e T L 2 M V 0 M V 1 T L 1 F ie ld T L 1 T L 0 F r a m T L 2 M V 0 M V 1 T L 1 F ie ld B L 1 B L 0 F r a m e M V 2 M V 1 X M V 0 F ie ld F r a m e M V 2 M V 1 X M V 0 F ie ld F r a m e M V 2 M V 1 X M V 0 F ie ld F r a m M V 2 M V 1 X M V 0 F ie ld F r a m e M V 1 B U 2 B U 3 B U 1 F ie ld F r a m e M V 1 T 1 4 T 1 5 T 1 1 F ie ld F r a m e M V 1 B U 2 B U 3 B U 1 F ie ld T U 2 T U 3 T U 1 F r a m M V 1 B U 2 B U 3 B U 1 F ie ld F r a m e M V 4 B U 3 B R U B U 2 F ie ld F r a m e M V 4 T 1 5 X T 1 4 F ie ld F r a m e M V 4 B U 3 B R U B U 2 F ie ld T U 3 T R U T U 2 F r a m M V 4 B U 3 B R U B U 2 F ie ld F r a m e M V 3 M V 4 M V 5 M V 1 F ie ld F r a m e M V 3 M V 4 M V 5 M V 1 F ie ld F r a m e M V 3 M V 4 M V 5 M V 1 F ie ld F r a m M V 3 M V 4 M V 5 M V 1 F ie ld F r a m e M V 6 M V 5 X M V 4 F ie ld F r a m e M V 6 M V 5 X M V 4 F ie ld F r a m e M V 6 M V 5 X M V 4 F ie ld F r a m M V 6 M V 5 X M V 4 F ie ld F r a m e T L 2 M V 2 M V 3 T L 1 F ie ld T L 1 B L 0 F r a m e B L 2 M V 2 M V 3 B L 1 F ie ld T L 3 B L 2 F r a m e B L 0 M V 2 M V 3 T L 3 F ie ld T L 2 T L 1 F r a m B L 0 M V 2 M V 3 T L 3 F ie ld B L 2 B L 1 F r a m e M V 8 M V 3 M V 6 M V 2 F ie ld F r a m e M V 8 M V 3 M V 6 M V 2 F ie ld F r a m e M V 8 M V 3 M V 6 M V 2 F ie ld F r a m M V 8 M V 3 M V 6 M V 2 F ie ld T o p F r a m e T L 3 M V 8 M V 9 T L 2 F ie ld T L 1 B L 1 B o t . F r a m e B L 3 M V 8 M V 9 B L 2 F ie ld T L 3 B L 3 T o p F r a m e B L 2 M V 8 M V 9 B L 1 F ie ld T L 3 T L 2 B o t . F r a m B L 2 M V 8 M V 9 B L 1 F r a m e T o p B o t . F ie l d T o p B o t . F ie l d T o p B o t . F r a m e T o p B o t . F ie l d T o p B o t . F r a m e T o p B o t . F ie l d F ie l d T o p B o t . F r a m e T o p B o t . F ie l d T o p B o t . B o t . F r a m e F r a m e T o p B o t . F ie l d T o p B o t . F r a m e T o p B o t . F ie l d T o p B o t . F r a m e T o p B o t . 4 x 4 _ 6 4 x 4 _ 7 4 x 4 _ 8 4 x 4 _ 9 4 x 4 _ 1 0 4 x 4 _ 2 4 x 4 _ 3 4 x 4 _ 4 4 x 4 _ 5 F r a m e T o p B o t . F ie l d T o p

(34)

c u r r e n t M B T /B r e la t iv e M B m v A m v B m v C m v D T o p F r a m e M V 1 0 M V 9 X M V 8 F ie ld B o t . F r a m e M V 1 0 M V 9 X M V 8 F ie ld T o p F r a m e M V 1 0 M V 9 X M V 8 F ie ld B o t . F r a m M V 1 0 M V 9 X M V 8 F ie ld F r a m e M V 9 M V 6 M V 7 M V 3 F ie ld F r a m e M V 9 M V 6 M V 7 M V 3 F ie ld F r a m e M V 9 M V 6 M V 7 M V 3 F ie ld F r a m M V 9 M V 6 M V 7 M V 3 F ie ld F r a m e M V 1 2 M V 7 X M V 6 F ie ld F r a m e M V 1 2 M V 7 X M V 6 F ie ld F r a m e M V 1 2 M V 7 X M V 6 F ie ld F r a m M V 1 2 M V 7 X M V 6 F ie ld F r a m e M V 1 1 M V 1 2 M V 1 3 M V 9 F ie ld F r a m e M V 1 1 M V 1 2 M V 1 3 M V 9 F ie ld F r a m e M V 1 1 M V 1 2 M V 1 3 M V 9 F ie ld F r a m M V 1 1 M V 1 2 M V 1 3 M V 9 F ie ld F r a m e M V 1 4 M V 1 3 X M V 1 2 F ie ld F r a m e M V 1 4 M V 1 3 X M V 1 2 F ie ld F r a m e M V 1 4 M V 1 3 X M V 1 2 F ie ld F r a m M V 1 4 M V 1 3 X M V 1 2 F ie ld T o p B o t . T o p F ie ld F r a m e F r a m e F ie ld F r a m e F ie ld F ie ld F r a m e F ie ld F r a m e B o t . T o p B o t . T o p B o t . T o p B o t . T o p B o t . T o p B o t . T o p B o t . 4 x 4 _ 1 1 4 x 4 _ 1 2 4 x 4 _ 1 3 4 x 4 _ 1 4 4 x 4 _ 1 5

As for B slice, we can use hardware sharing to process twice mvp for B slice type

because motion vector prediction of LIST_1 can be hidden below data-read cycles of LIST_0 from frame memory. However, it is not only process twice but also need consider many extra

conditions. For example, Fig 2.8 shows one partition predicted by L0 direction, and neighboring partition predicted by L1 direction. When predicting direction is different, the

neighboring MV cannot be used to predict current MV. Here, we do not discuss them for clarity.

In addition to considered predicting direction, B slice has new direct mode. There are two direct modes in B slice, one is spatial direct mode (SDM) and the other one is temporal direct

(35)

macroblock is and where the co-located partition is. Because current picture and co-located picture can be field, frame, and MBAFF coding types. Therefore, both of the co-located

macroblock and co-located partition determine formula is about eight kinds and these determine formula will involve multiplier, divider, and remainder, which are high complexity

component. However, if we use macroblock coordinate (x and y) which originally transferred from system to motion compensation unit to find co-located macroblock/partition. We can

significantly reduce complexity. Table 3.2 shows the mapping table of co-located macroblock after coordinated method reduction. The Y means y-axis. Table 3.3 shows the mapping table

of co-located partition. Fig 3.5 shows the entire motion compensation architecture.

Table 3.2 Co-located macroblock table

Curr Col Original equation New equation

FLD FRM 2 * PicWidthInMbs * ( CurrMbAddr / PicWidthInMbs ) + ( CurrMbAddr

% PicWidthInMbs ) + PicWidthInMbs * ( yCol / 8 ) Y<<1+blk_Num[3] FLD AFRM

-FRM 2 * CurrMbAddr + ( yCol / 8 ) Y<<1+blk_Num[3] AFRM

-FLD 2 * CurrMbAddr + bottom_field_flag Y<<1+bottom_field_flag FRM FLD PicWidthInMbs * ( CurrMbAddr / ( 2 * PicWidthInMbs ) ) +

( CurrMbAddr % PicWidthInMbs ) Y>>1 AFRM FLD CurrMbAddr / 2 Y>>1 AFRM

-FRM AFRM-FLD 0 : 1 )2 * ( CurrMbAddr / 2 ) + ( ( topAbsDiffPOC < bottomAbsDiffPOC ) ? Y[0]=0 AFRM

-FLD AFRM-FRM 2 * ( CurrMbAddr / 2 ) + ( yCol / 8 ) Y[0]=0+blk_num[3] Other CurrMbAddr Don’t change

(36)

Table 3.3 Co-located partition table

L0 Line MV FIFO

Col/L0 intra flag

direct predictor

L1 Line intra flag

L1 Line MV FIFO

Addr Gen.

4x4 MV buffers

motion vector

predictor

MUX

mvA, mvB, mvC, mvD

Co-located

mvp

Neighboring

MVs

(37)

Fig 3.5 Motion vector generator architecture

3.3 Interpolator Design

3.3.1 Luma Interpolator Design

F

IR

F I R

Fig 3.6 Separate 1-D interpolator design (no parallel)

In this subsection, several different interpolator designs will be presented. Reviewing the fractional pixel interpolation for H.264/AVC in Fig 2.5, 6-tap FIR with (1, -5, 20, 20, -5, 1)

coefficient and bilinear filter are needed for half and quarter pixel interpolation. For cost and area efficiency consideration, Li‟s and Shen‟s interpolator filter unit and two-stage recursive

algorithm is proposed in [10] and [11]. These designs are area efficiency and suitable for P slice. However, as for B slice, throughput is a very important issue and long execution cycles

(38)

in these designs cause the real-time of video decoding cannot be meted.

Oppositely, consider throughput and standard-compatible design, Chien‟s [4] proposed

separate 1-D design that separates horizontal and vertical interpolation and processes in parallel based on 4 x 4 block size. This design owns better throughput, although it may need

more storages. Fig 3.6 shows separate 1-D interpolator design without processing in parallel.

Table 3.3 Comparison of execution cycles for different architectures

Architecture Ideal execution cycles

Shen’s and Li’s desing 13

Separate 1-D (no parallel) 36

Separate 1-D (2 parallel) 18

Separate 1-D (4 parallel) 9

Assuming that all 9 x 9 interpolated data for each 4 x 4 block are ready and they can be accessed randomly, Table 3.3 lists the execution cycles for different architecture. For Shen‟s

and Li‟s design, the result outputs depend on fractional pixel positions. For a, b, c, d, h, and n position 4 clock cycles are needed to finish one 4x4 block. For e, g, p, and r, it takes 8 cycles

to finish one 4 x 4 block interpolator. For f, j, q, i, and k, the cycles to finish one 4x4 block are 13 cycles which detailed operation is described in Li‟s [10] and Shen‟s [4]. As for separate

1-D design, the first data outputs at the 6th clock cycle and the following 3 data generates after 3 clock cycles. Therefore, the separate 1-D design without parallel needs 36 ((6 + 3) x 4)

cycles to complete interpolation of one 4 x 4 block. Similarly, separate 1-D design with 2 and 4 parallel requires 18 ((6 + 3) x 2) and 9 (6 + 3) cycles respectively. Finally, 4-parallel

separate 1-D architecture is our selection due to smaller required execution cycles that can be hidden below data-read cycles from frame memory.

(39)

a

c

G

h

d

n

H

m

M

s

N

f

e

g

j

i

k

q

p

r

b

Fig 3.7 Only one half pixel is needed

Fig 3.8 shows original 4-parallel separate 1-D luma interpolator. For cost consideration, multiplier in FIR can be simplified to adders and shifters. We will discuss FIR design later.

Because the original 4-parallel separate 1-D interpolator produces b and s half pixels at the same time for produce any position fractional pixel. However, either b or s half pixels is

needed when produce interpolated pixel. If we check MV, we can know which half is needed after all. Therefore, we can modify 4-parallel separate 1-D interpolator to reduce the path

storages and one FIR. The similar design can be seen in [12][13] and [14], but these designs require four multiplexers and we require only one multiplexer. Fig 3.9 shows the enhance

(40)

FI R FIR FIR FIR FIR FIR FI R FIR FI R FIR FIR FI R FIR bilinear bilinear bilinear bilinear

(41)

FI R FIR FIR FIR FIR FIR FI R FIR FI R FIR FI R FIR bilinear bilinear bilinear bilinear

(42)

3.3.2 Chroma Interpolator Design

A B D E G H C F I e h f g

Fig 3.10 Interpolation window for each 2 x 2 chroma block

] * * ) 8 [( * ] * * ) 8 [( * ) 8 ( * * * * ) 8 ( * ) 8 ( * * ) 8 ( * ) 8 ( D yFrac B yFrac xFrac C yFrac A yFrac xFrac D yFrac xFrac C yFrac xFrac B yFrac xFrac A yFrac xFrac i                Eq. 3.1

Because of 4:2:0 chroma format and quarter precision of luma inter prediction, chroma inter prediction displacement can achieve one-eighth motion accuracy. Chroma inter

prediction must process based on 2 x 2 block size when luma inter prediction process based on 4 x 4 block size. Chroma interpolation requires 3 x 3 interpolated data for each 2 x 2 block

as shown in Fig 3.10. For chroma 2 x 2 block including A, B, C and D, the corresponding fractional sample is e, f, g and h whose precision is one-eighth. Compared with direct

mapping design with 8 multipliers which equation is listed in Fig 2.5 (c), we rewrite the equation listed in Eq. 3.1 and the number of multiplier number can be reduced to 4.

(8 ) * [ (8 ) * * ] * [ (8 ) * * ] (8 ) * * (8 - ) * * i x F r a c y F r a c A y F r a c C x F r a c y F r a c B y F r a c D F r a c M F r a c N M N F r a c O F r a c P              Eq. 3.2

We can also rewrite the equation listed in Eq. 3.2. The Frac, O, and P are any corresponding value in Eq. 3.2. We can find as luma interpolator, chroma interpolator can

(43)

separate into horizontal and vertical filter. The corresponding separate 1-D design is illustrated in Fig 3.11 (a) and the vertical / horizontal filter is illustrated in Fig 3.11 (b).

2-parallel separate 1-D chroma interpolator are required to generate interpolated value in 2-pixel parallel, and it takes 3 cycles to filter 2 x 2 pixels if all required interpolated data are

ready and they can be accessed randomly. Based on 2-parallel separate 1-D chroma interpolator design illustrated in Fig 3.12, only one cycle latency is required.

yFrac xFrac round (b) (a) FI R FIR 8 Frac + * -*

Fig 3.11 (a) Chroma interpolator, (b) vertical/horizontal filter

A B D E G H C I F e h f g yFrac xFrac round FI R FIR yFrac xFrac round FI R FIR

(44)

Fig 3.12 2-parallel chroma interpolator

3.3.3 Combine Luma and Chroma FIR Design

<< 2 << 2 Luma Output Luma Output = A - 5B + 20C + 20D - 5E +F A F B E C D round (a) (b)

Fig 3.13 (a) Luma FIR design in Chen’s [3], (b) bilinear filter

Especially note that luma and chroma interpolation for H.264/AVC are different. That is,

no matter what on algorithm level or hardware level, the computation sources cannot be shared. Therefore, the combination of luma and chroma parts is the space of improvement. As

luma and chroma interpolator filter described in above, the adder and shifter can be shared when the architecture of chroma horizontal/vertical filter in Fig 3.11 (b) restructure to adder

and shifter. Besides, we can further reduce critical path by merge rounding stage. The combined interpolator design is shown in Fig 3.14 and the cost penalty is MUX x 2 and

bitwise AND x 6 when compared with the FIR design proposed in Chen‟s [3] and shown in Fig 3.13. Fig 3.15 illustrates the decoding path of luma FIR filter and chroma

(45)

horizontal/vertical filter. Because chroma interpolation for H.264/AVC is 2 x 2 block size basis, only eight luma FIR filters are required to replace with combined luma/chroma

interpolators. Fig 3.16 indicates the entire interpolator architecture for H.264/AVC.

<< 2 << 2 << 3 Chroma Output Bitwise AND Luma Output << 1 Rounding Coefficient

(46)

<< 2 << 1

<< 3

Chroma Output

XFrac[0]YiFrac[0]XFrac[1]YiFrac[1] XFrac[2]YiFrac[2]

Frac

Chroma Output =Frac*X + (8-Frac)*Y Y << 2 << 2 Luma Output A 1 F 1 B 1 E 1 _C _{1 D 1} Luma Output = (A - 5B + 20C + 20D - 5E +F +16 )>>5 (b) 16 >>5 0 (a)

(47)

R

_F

IR

FIR

R

_F

IR

R_FIR

Restructured interpolator design

R_FIR

R

_F

IR

R

_F

IR

Reuse for Cb

Reuse for Cr

(48)

3.3.4 Cost Analysis

Table 3.2 Comparison of requisite modules

Wang‟s [15] ISCAS‟05 Chen‟s [16] ICASSP‟06 Li‟s [10] ISCAS‟07 Tsai‟s[14] MWSCAS‟05 Shen‟s [11] ICME‟09 Proposed FIR 13 12 4 12 4 12 Bilinear 2 12 4 4 4 0 Technology (um) 0.18 0.18 0.18 0.18 0.18 0.09 Gate count 20,686 15,000 13,027 21,506 11,823 13,201 Working Frequency (MHz) 100 150 100 125 100 100 Latency (Cycles/MB) luma+chroma 560 320 304 144+NA 288+NA 144+48

Because of multipliers of 6-tap filter are simplified to adders and shifters in all references.

Therefore, in literature [10] and [11] use hardware sharing 6-tap FIRs to compute twice to reduce area cost in interpolator design. However, throughput is a very important issue and

long execution cycles in interpolator design lead to not enough throughput in B slice. Our restructured interpolator combines luma and chroma filter and through determine MV to

reduce a filter and one-path storages in traditional design. Table 3.2 lists the comparisons between our restructured interpolator design and other design. It shows our interpolator can

almost achieve as gate count of [10] and [11] and owns enough throughputs although it requires paying some control overhead to support multi-mode operations.

(49)

3.4 Weighted Prediction

lo g 0 1 0 1 ( ( 0 * 1 * 2 W D) ( lo g 1) ) ( ( 1) 1) p  p L W  p L W    W D  o o    Eq. 3.3 lo g 1 0 0 lo g 1 1 1 { [ ( ( 0 * 2 ) lo g ) ] [ ( ( 1 * 2 ) lo g ) ] 1} 1) W D W D p p L W W D o p L W W D o                Eq. 3.4 lo g 1 0 0 lo g 1 0 0 ( 0 * 2 ) lo g ) { 0 * [ 2 ( lo g ) ] } lo g W D W D p p L W W D o p p L W o W D W D               Eq. 3.5

Weighted prediction is the final stage of motion compensation behind the interpolator. Weighted prediction is a tool of scaling motion compensated samples to increase the video

quality in H.264/AVC video decoding. In this subsection, weighted predictor architecture is proposed to collocate with interpolator and eliminate the latency overhead. Chen‟s [16]

proposed weighted prediction architecture has low complexity. However, it has long critical path and large memory requirement (1.5kb). The design of Azevedo‟s [12] weighted predictor

is simply implemented by direct mapping design and require an embedded memory to store rounding coefficient. Compared with direct mapping design which equation is listed in Eq.

3.3, we can use the same predictor twice to generate predicted value, first is LIST_0 prediction and second is LIST_1 prediction as shown in Eq. 3.4. The component of rounding

and offset can be advanced and combined in the same stage. Therefore, the predictor can be further modified to reduce the critical path as shown in Eq. 3.5. Moreover from Eq. 3.5, the

W0 means weight factor and the value depend on weight flag from bit-stream. When weight

(50)

When weight flag is equal to 0, weight factor shall be in the range of 20 to 27, inclusive [1]. From the above discussion, if we determine the highest weight factor two bit we can use an

eight bits multiplier and shifter instead of a nine bits multiplier. The predictor is shown in Fig 3.17. M U X + offset <<7 * Weight factor[7:0] predPart LogWD + >> Round Weight factor[8:7] <<

Fig 3.17 Weighted predictor design

Moreover, when B slice is involved, we use hardware sharing to operate twice. In

addition, a 4 x 4 storages array is required to store intermediate results. Fig 3.18 illustrates the complete weighted predictor design. The same as temporal direct mode in motion vector

generator, weighted predictor has implicit mode which weighting factors are calculated based on the relative temporal positions of LIST_0 and LIST_1 reference picture. Weighting factor

in the implicit mode is derived from temporal direct mode data-path in order to reduce hardware cost. Furthermore, divider occupies the main area cost and computation time in the

(51)

dividend is a constant value. Table 3.2 lists the comparison for implementation results. For [12], it was not presented in comparison because lack of related detail information.

Predictor

1 Predictor

2 B-L1

M

U

X

P slice

C

lip

Luma/Cr

4x4

Buffer

Luma/Cb

A

v

er

ag

e

B-L0

Fig 3.18 Entire weight predictor architecture

ICASSP‟06[16] proposed

Multiplier (bits) 9 8

Technology .18um .90um

Gate count 12,960 6,412

Working frequency 87MHz 100MHz

(52)

3.5 Summary

In this chapter, a motion compensation engine for H.264/AVC Main/High Profile

decoder is presented. As for sharing design issue for multi-profile, our MVG use the same module and storages to deal with P slice and B slice which include MBAFF and non MBAFF.

Our restructured interpolator presents the area efficiently compared with traditional design and it is suitable for high throughput application such as coded in B slice video decoder.

Besides, the weighted predictor through hardware sharing with temporal direct mode and critical path shorten to achieve area efficiency. When weighted predictor collocates with

(53)

Chapter 4 Memory Bandwidth Reduction

4x4 output pixels

9x9 reference pixels

interp

olatio

n

Fig 4.1 4 x 4 block window and the corresponding 9 x 9 interpolation window

Considering luma interpolation, the half position samples interpolated by applying 6-tap

FIR filter and quarter position samples performed by applying using bilinear filter. It means interpolator needs six reference pixels to produce one interpolated pixel. Fig 4.1 shows to

interpolate each fractional sample value for each 4 x 4 block size; it needs 9 x 9 interpolation window. Chroma interpolation, of which concept is similar to luma, interpolates each

fractional sample value for each 2 x 2 block size, it needs 3 x 3 interpolation windows. When frame size is large and frame rate is high, interpolation causes heavy loading of memory

bandwidth. Moreover, motion compensation involves Main/High Profile; it supports B slices in which reference frame from one direction increase to two directions. From the above

(54)

interpolator needs memory bandwidth requirement, 398MB/s in P slices and 796MB/s in B slices, when support 1080 HD @ 30 fps. The heavy loading of memory bandwidth also means

huge power consumption for bus activity and data operation.

The rest of this chapter is organized as follows. Firstly, section 4.1 discusses our

reduction strategies of memory bandwidth. In addition, an analysis of bandwidth reduction limit is presented in section 4.2. Finally, summary is given in section 4.3.

(55)

4.1 Reduction strategies of memory bandwidth

Memory bandwidth always dominates the performance of entire video decoder. Several

methods have been proposed to reduce the required memory bandwidth and they can be mainly classified to two directions, first one is frame recompression and another one is

redundancy reduction of pixels transmission. With regard to the frame recompression, Fig 4.2 illustrates the concept. Frame data will be compressed before writing to frame memory, and

reference frame data will be decompressed before reading into video decoder. However, frame recompression method must consider many issues which like necessary random access

capability demanded from motion compensation, low complexity property due to area cost and power saving, and minimize required additional execution cycles to compress/decompress

data such that meet the real time throughput requirement of video decoder. Here we do not go into detail because our system have two dedicated modules, embedded compressor, between

motion compensation and frame memory and embedded decompressor between frame memory and de-blocking module respectively.

Video Decoder Frame Memory recompress decompress Global bus

(56)

As for second solution, transmission reduction of redundant pixels, which can be classified into two solutions that first one is data fetch time reducing and the other one is data

(pixel) reusing. The following subsection will discuss the detail of reduction strategies of memory bandwidth. Subsection 4.2.1 illustrates first strategy of data fetch times reducing.

Subsection 4.2.2 gives second strategy of data fetch times reducing. Subsection 4.2.3 illustrates first strategy of data reusing. Finally, subsection 4.2.4 presents second strategy of

data reusing.

4.1.1 Exact Fetch Necessary Pixels

a

c

G

h

d

n

H

m

M

s

N

f

e

g

j

i

k

q

p

r

b

Fig 4.3 Fractional sample positions for quarter sample luma interpolation

Fig 4.3 illustrates the luma samples „a‟ to „s‟ at fractional sample positions. In traditional

method, when interpolate fractional pixel, it always fetch 9x9 interpolation windows. However, there are not all pixels required in all fractional sample position. For example, the

sample at half sample position labeled b is derived by the nearest integer position samples in the horizontal direction. Similarly, the sample at half sample position labeled h is derived by

the nearest integer position samples in the vertical direction. Fig 4.4 illustrates interpolation of the samples at a, b, and c positions only need 9 x 4 interpolation windows. Fig 4.5 illustrates

(57)

interpolation of the samples at d, h, and n positions only need 4 x 9 interpolation windows. We can depend on motion vector value to exact fetch necessary pixels instead of fetch 9 x 9

interpolation window. Similar to luma interpolation, chroma interpolation can determine motion vector to decide interpolation window as well. Table 4.1 shows the summary of luma

interpolation windows. Table 4.2 shows the summary of chroma interpolation windows. The strategy is also used in other design [14], [10], and [11]. As for bandwidth reduction result, we

will show it later.

4x4 output pixels

9x4 reference pixels

interp

olatio

n

Fig 4.4 Fractional sample only need horizontal samples

. interp olatio n 4x9 reference pixels 4x4 output pixels

適用於H.264/AVC之降低記憶體頻寬的動作補償

國

立

交

通

大

學

電機學院 電子與光電學程

碩

士

論

文

適用於 H.264/AVC 之降低記憶體頻寬的動作補償

A Memory Bandwidth-Reduction Motion Compensator for

H.264/AVC Application

研 究 生：陳浩民

指導教授：李鎮宜 教授

適用於 H.264/AVC 之降低記憶體頻寬的動作補償

A Memory Bandwidth-Reduction Motion Compensator for

H.264/AVC Application

研 究 生：陳浩民 Student：Hao-Min Chen

指導教授：李鎮宜 Advisor：Chen-Yi Lee

國 立 交 通 大 學

電機學院 電子與光電學程

碩 士 論 文

適用於 H.264/AVC 之降低記憶體頻寬的動作補償

學生：陳浩民 指導教授：李鎮宜 教授

國 立 交 通 大 學 電 機 學 院 電 子 與 光 電 學 程 碩 士 班

摘

要

A Memory Bandwidth-Reduction Motion Compensator for

H.264/AVC Application

Student : Hao-Min Chen Advisor : Dr. Chen-Yi Lee

Degree Program of Electrical and Computer Engineering

National Chiao Tung University

ABSTRACT

Acknowledgements

Contents

List of Figures

List of Tables

Chapter 1

Introduction

1.1 Motivation

1.2 Thesis Organization

Chapter 2

Algorithm Description and Analysis

2.1 Profiling

2.2 Inter Prediction Algorithm for H.264/AVC Standard

2.3

Inter Prediction for H.264/AVC High profile Standards

2.4 Bandwidth Requirement for Inter Prediction

2.5 Summary

Chapter 3

Motion Compensation Design for

H.264/AVC Main/High Profile video

decoder

3.1 Motion Compensation Engine for H.264/AVC decoder

3.2 MVG support Main/High profile

L0 Line MV FIFO

Col/L0 intra flag

direct predictor

L1 Line intra flag

L1 Line MV FIFO

Addr Gen.

4x4 MV buffers

motion vector

predictor

MUX

mvA, mvB, mvC, mvD

Co-located

mvp

Neighboring

MVs

3.3 Interpolator Design

3.3.1 Luma Interpolator Design

a

c

G

h

d

電機學院電子與光電學程

研究生：陳浩民

指導教授：李鎮宜教授

研究生：陳浩民 Student：Hao-Min Chen

國立交通大學

電機學院電子與光電學程

碩士論文

學生：陳浩民指導教授：李鎮宜教授

國立交通大學電機學院電子與光電學程碩士班