Optimal data mapping for motion compensation in H.264 video decoding

(1)

Optimal Data Mapping for Motion

Compensation

in

1.264

Video

Decoding

Guo-Shiuan Yu, and Tian SheuanChang Dep.ElectronicsEngineering, NationalChiao-TungUniversity,

1001 Ta-Hsueh Rd., Hsinchu, Taiwan e-mail: {isis,tschang} @twins.ee.nctu.edu.tw

doesn't take the memory operation scheduling into Abstract- Long initial access cycles of SDRAM arethemajor consideration. With careful scheduling, extra bandwidth due to performance burden of motioncompensationina video decoder. page active operation can be reduced.

To minimize its effect while improve overall available memory Inthis paper, we combine the data

mapping

and

operation

bandwidth, thispaper presentsanoptimaldata mappingscheme cycles

for motion compensation in H.264 video coding. This scheme

scheduling

ineourdesign

toeminimizeqthe SDRAMinitial

cycles

allocates the video data into suitable address and bankaccording and thus decrease the bandwidth requirement for real-time to the access characteristics of SDRAM access and address

decoding.

To find the

optimal

mapping,

we firstuse a

simple

transition in motion compensation. The resulted allocation can analyticalmodeltofind thebestdata mappingintheory andin reduce the requiredbandwidth of motioncompensation by36% practice. Then, we use real video sequences to validate the whencomparedtothepreviousdesign for 525SD videosequences. mapping.

Furthermore,

we do not

only

consider the access

within a single motion compensation operation for a block (called intra request), which has

high

probabilities

for I. INTRODUCTION continuous address, but also consider the accessbetween the Memory access dominates the performance in a video blocks (called inter request) by operation scheduling. The decoder,especially inmotion

compensation.

In a

typical

video resulted design can save

37%

ofmemorybandwidthcompared decoder, motion compensation unit will access the

required

with the previousapproaches.

reference data from external SDRAM systems. However, a Therest of the paper is organized as follows. First, we brief typicalSDARM accessconsists ofa

long

initial

cycle

toopena overview the motion compensation inH.264 video decoding memory row followed by continuous addressed burst access. and SDRAM memory accessinSection

II.

Thenwe presentour Thus,ifthe memoryaccesshas discontinuous

addresses,

it will analytical model for intra request and its simulated results in suffer frequent initial

cycles

and thus results in

performance

Section

III. Furthermore,

we presentouroperation scheduling degradation and

larger

memory bandwidth.

Thus,

how to

for

interrequest in Section IV. Thefinal simulationresults are allocate the datatothe

physical

SDRAM is an

important

task shown in Section V. Finally, we conclude this paper in Section

for video decoder. VI.

Targeted to video codec

applications,

many papers have

beenproposedto

improve

SDRAMbandwidth utilization and

II.

OVERVIEW

achieve efficient memory access. Li

[1]

develops

a bus

arbitration algorithm

optimized

with different

processing

unit A. Basics

of

SDRAMaccess

to meet the real-time

performance.

Ling's

controller[2]

schedulesDRAMaccessesin

pre-determined

ordertolower the

peak bus bandwidth. Kim's memory interface

[3]

adopts

an Cycle 1 2 3 4 5 6 7 8 9 10

array-translation

technique

to reducepower

consumption

and

RowMiss

Cte

increase memory bandwidth. Park's

history-based

memory precharge active col access(read)

controller

[4]

reduces page break to achieve energy and

lP

tRwM

burstlenth

memorylatencyreduction. Precharge ColumnAccess precharge active colaccess(write)

For H.264

application,

Kang's

AHB based scalable bus P o _{Bank Miss} tRCD burstlength architecture and dualmemory

controller[5]

supports 1080

SD aIDLEc

oACTIVE'

a olaccess

(write)

under13OMHz. Zhu'sSDRAM

controller[6]

employs

the main burst length

idea of Kim's memory interface to HDTV

application.

It Active col access(wnite)

focuses on data arrangement and memory

mapping

toreduce (a) (b)

pageactive overheads sothat itnot

only

improve

throughput

agram,

A (b) a- atncis o

but also provides lower power consumption. However, it different access

statuses.[7].

(2)

Thecycle for acomplete SDRAM accessdeeplydependson For constantdatalengthN,largersize ofrow meansfewerrow the state ofthe bank addressedbythe SDRAMaccess.Fig.

1(a)

miss and the

longer

data

length

leads to

higher

row-miss and Fig. l(b) show a simplified bank state diagram and the probabilitywith fixedrowsize.

Extending

above observations accesslatencies duetodifferentaccess statuses:bankmiss,row to2-D

domain,

the totalrowmiss with horizontalrowmiss and miss, and row hit. From a data access viewpoint, low cycle verticalrowmiss is

countintherowhit condition ispreferredthan those in bank

NX

-1

Ny

-1

(2)

missand row miss.Thus, howtominimize such miss is critical Prow-mis-2D

Ly

in SDRAM performance. A more completed discussion on where

Lx,

and

Ly

denote the

length

of memory window in various accesslatencies ofaSDRAM access canbe found in horizontal and vertical

respectively.

However, therow size is

Lee's paper[7]. fixedfor a certain type of SDRAM, which implies

LX*LY

B. Memory access in motion compensation Thus,the width and height ofthe rowwindow areaffected each Forvideoapplications, the memory request isusuallytogeta other. Therowmissprobability should be adjusted as follows: determined size ofrectangle image from frame memory like

NX

-1

Ny-I

Q*

(Nx-1)+LX2

*

(Ny

-1)

(3)

Pro-nis2D ±

those inmotioncompensation, intra prediction anddeblocking LX Q ILX Lx*Q

filter process.These data are continuous inspatialdomain and whereNx,

Ny,

Lx>0 and Q denotes the row size. For worst thearea wemay requestbetweentwoconsecutive blocks has case,Nx=

Ny

is equal to the maximum data length. Thus, the

high probability to be overlapped. For instance, when we row-miss probability has the minimum value when Lx is equal processmotioncompensation, therequireddata is boundedby to

[.

its block size and search range set during encoding. Ifthe Above formula is quite simplified. To be practical, we search range is L and the block length is N, a 2L

by

2L+N furtherconsider the characteristics of real video sequences. In rectangleisoverlapped.Datainthisregionhashighprobability H.264the data length we may request for motion compensation intheopenedbank duetopreviousblockaccess.Thuswe can is 4,8,9, 13, 16and 21 pixels according to its block modes and find a method to avoid therowmiss and improve bandwidth sub-pixel motion vectors. Besides, the probability of starting utilization. Fig. 2 illustrates an example to

explain

this pointdoes not distribute uniformly in many video sequences.

characteristic. Theposition number that is divisible by 4, which means the oth,

4th

8th,

12th

.4kth

..

pixels

ofrow window invertical or in

Theareafor blocko

horizontal,

has

higher probability

to appear.

Generally

Theareawemay Theuareawemay fo

(2L+N)*(2L+N)

speaking,

P4kiS 1.5to2.5times

larger

than others

according

to

our

simulation,

where P4k denotes the

probability

of

Oth,

4th

8th

Theareafor block1;

12th

4kh

..

positions.

This is because the smallest block

(2L+N)*(2L+N)

length

is 4 and the effect ofzeromotionvector.The blocks with

zero motion vectors are

usually

referenced for

background

Block 0 BlockI Theoverlappedarea:

image.

For

larger quantization

parameter,this effect becomes

(2L)*(2L-N) more

significant.

Thus,

the row miss

probability

of H.264

Theoverlapped

area motion

compensation

is

Lx Ly

o < o Prow-miss-Ac (PNx 2

Pn)

+ L (PNY ZPm)

SearchrangeLBlock L Nx=4,8,9,13,16,21 n=Lx-Nx+l NY=478,9,13,16,21 m=Ly-Ny+1

Search range L lengthNL

5*PNX4

+11PX9 +10PX8

+16*PNX13+20*PX16+25*PNX21

Fig.2.Possiblerequiredareabetweenadjacentblocks LX +LX/4

5*PNy4

+11*PN +10* PY

+16PNY13 +20PNY16

+25*PNY2l

(4)

III. INTRAREQUESTOPTIMIZATION LY +LY/4

A. Analyticalanalysis

Accordingtothe characteristics ofvideodata,we canderive where the

PNX4

is

the

probability

of data

length equal

to4

pixels

the translation betweenphysicallocation in memory and

pixel

in horizontal and we assume P4k

is

twice than others for

coordinates inspatialdomaintoreduce therowmiss.

simplification.

To easeanalysis without loss ofgenerality,wedegradethis

Combining Eq.

(4)

with simulation

statistics,

we canfind the

2-D problem to 1-D domain. Assume that a SDRAM row rowmiss

probability

function is

contains L pixels and N continuous data are

requested.

The 16.866* Q +16.133*

L(5

situation ofrowmiss could beasfollows.Forthecasewithout * Q

rowmisses,thestarting pointshall lie in the firstL-N

position

For L

>0,

this function has a minimum vale when Lx

equal

oftherow.However,ifthe

starting points

lie in lastN-I

pixels,

x

row miss happened. Assuming the probability of starting point to 1 0Q. A typicalQ~, row size of a SDRAM, can be 16384, position is uniform distributed, the probability of row miss is 8192or 4096 bits, which is 2048, 1024 or 512 pixels. For our

_N-i

(1) targeted SDRAM, 2048 pixels in a row, the optimized window

Po-isl-L

size should be a 46x44 rectangle. However, it is hard to

(3)

implement the translation with the 46x44 windows.Weadjust 64bytes_R.._C _R..I R..

the window size to 64x32. Because 32 and 64 are

powers

of

2,

N

c

1

the translationcanbeeasily implemented with bit shift. > ...

~~~~~~~~~~~~~~~~~~~. .Bak2 Ban.k < Bank, Bank Bank, Bank

B. Simulation results

IBank C Ban.kI Ban.k C BankI

Fig. 3 shows the statistics ofrowmiss in different window ...

size. Thetestsequences are crew,night,sailormen, and harbour B___k2

Bank

_Bank

Bak_2

in 525 SD frame size. Comparing with the linear translation like lx2048 and 2048xl window size, the 64x32 mapping

reduces about 84% of row miss rate. Compared with the Fig. 5. Bank arrangement with optimization optimal 46x44 mapping, the 64x32 mapping has slightly low

rowmiss dueto morefrequent horizontal motion and 4x4 block With the data arrangement mentioned before, the size. The rowswith large size candecrease the

probability

of requests can be classified to three kinds as shown in Fig. 6, by rowbreak, thus the32x32window hashigherrowmisscount assuming open all required rows at the beginning of every than 64x32. Due to the video sequences

characteristics,

the request to reduce the control overhead and ease the hardware occurrence of horizontal break is morefrequent than vertical. design.

Thus, the64x32mappingcanleadtobetterperformance.

Case 1:all dataofsingleaccessarecontainedina row.

row missprob. in different windows Itis clear that thiscaseintroducesno rowmiss, since all the 70 datatobe

requested

arestoredina row.Thememory

operation

60 contains therowactivation, data reading and precharging. Fig.

50 2 0 4 7shows the

operations

undercase 1. L+4

cycles

areneededto

40 C 64x32

complete

thisaccess,whereLdenotes the number of accessed

30 *32x64 data.

20 _ 312048

10 Case 2:all data

ofsingle

access arecontainedintworows.The

0 qp6

_

q4

l

data may be discontinuous in horizontal or in vertical as

*2048xl 44.83032999 47.24833619 54.99977362 illustrated inFig.6.

*64x32 6.820379894 7.529893801 9.3924985 Inthis

case,

we suffer two row miss since the accessed data 046x44 6.778230921 7.545193255 9.533249351

032x64 6.846469165 7.60661285 9.521609834 are contained in different rows. However, they are in different

E32x32 7.403632408 8.116680411 9.96107406

E31x2048 44.8864127 49.0488445 58.74551384 banks. Wecanshorten the

latency

with bank

alternating

access. Fig. 7 shows the operations ofcase 2. We openthe rows we

Fig.

3. Miss rateindifferentrowwindows may

access,

read the data in

determined

order and then precharge the openedrow.TotalcyclecountisL+5.

Imagep-sitio inspatiald-min _ R..0

R..0 R..CR. 0 Bank 1

< 8 | ~~~~~~~~~~~~~~~Bank

2 Bank 3 | Bank 1l

201Ell

(L~~ ~~ ~ ~ ~ ~ ~ ~ ~DtDaalction1.. inSDRAM

Rowk Row k+1

64 Case 1 Case 2 Case 3

Fig.4.Translation ofphysical location and

image position

Fig.6.Request classification C. Thememorymappingsandoperations

Fig. 4 illustrates the mapping between

physical

location in Case 3: all data

ofsingle

accessare storedin

four

rows. The

memoryand image positionin spatial domain. The

latency

of data breakinhorizontal and verticalasillustratedin

Fig.

6.

singlerequestcanbe reduced with bank

interleaving operation

Four row breaks are encountered in this case. Due to the

asshownin

Fig.

5. limitation of SDRAM access

cycle,

one

cycle latency

is

introduced to meet timing requirement.

Fig.

7 illustrates the operations. The number of total

cycles

isL+7.

(4)

__________ .__ V. SIMULATION RESULTS

L4_ With above intra and inter request optimization, we can

C...2

AG

C A A A

efficiently

reduce the miss rate from 6.8%

(without

inter-request optimization) to 1.8% from simulation. Table I

C...3 ATOATRDARAACRARA RAREO shows the

comparisons

of

bandwidth

requirement

with other

designs, while the data of [6] is from our implementation. Our proposed scheme can reduce extra memory access overhead, Fig. 7. Request operations in different cases needs less time to transfer data, and thus save

37%

of

bandwidth compared to Zhu's design at 525SD video size. The probability distribution ofthese cases is shown in Fig. 8.

With the increasing quantization parameters, the cross-bank TABLEI. COMPARISONS OF BANDWIDTH REQUIREMENT.

casesdecreaserapidlydueto more zero motionvectorinhigh

QP. Besides, this result also shows that case 1 occurs most in Scheme Format Bandwidth (MBps)

totalaccesses. This means we usually only needtoopen one proposed QCIF 2.60

row in a single request and thus reduce extra bandwidth CIF 12.00

requirement.

525SD

46.96

bank cross distribution 720HD 135.54

100%

Zhu[6]

525SD 73.85

90.%

m

______________

720HD

187.25

700 VI. CONCLUSION

50- *c 2 This paper presents an optimal data mapping for motion

40% compensation used in H.264 video coding. Our scheme can

30%

~~~~~~~~~~save

3700 of memory bandwidth when compared to the

20% previousapproach.This schemecanbe

applied

tothe memory

0% controller

design

and can co-work with the selected

qp20 qp32 qp4O on-chip-bus.Besides,this schemecanbe also appliedtoother

Oae31.763961321 1.536542139 1.088995216 B sds a p le

Mcase 2 23.55624675 20.01307293 14.54669593 types of memory access in videodecoding since these types are

*case 1 74.67979193 78.45038493 84.36430885

subset of thatin motion compensation. Fig. 8. Distribution of access types

Acknolwedgement

This research is sponsored by National Science Council, Taiwan, R.O.C., under grant NSC-922215-E-009-014.

IV. INTERREQUESTOPTIMIZATION

In intra-request optimization, we have determined the REFERENCES

optimizeddatamappingtoreduce therowmisses.Furthermore, [1] J.-H. Li, N. Ling, "Architecture and bus-arbitration schemes for MPEG-2

for successive requests, therequesteddata hashigh probability videodecoder,"IEEE TransactiononCircuits andSystemsforVideo

tobe stored in the same row dueto overlappedsearch range. Technology, vol. 9, pp.727-736, Aug. 1999

Thisaccess canget thesamebenefitasthe intrarequest without [2] N.

Ling,

N.-T.

Wang,

D.-J.

Ho,

"An efficient controller scheme for MPEG-2videodecoder,"IEEETransaction onConsumerElectronics, vol.

any rowmiss. However, there is still acertain amountof data 44, pp.451-458, May 1998

stored in different rows. Thusrow miss willoccur ifclosing [3] H.Kim,I.-C.Park, "High-performanceandlow-power memory-interface

unused rows by precharging the banks and opening the new architecture for video processing applications," IEEE Transactionon

rows.T ds

oCircuits

andSystems for VideoTechnology,vol. 11,pp.1160-1170,

rows. To reduce such rowmisses,we shallconsiderwhenand Nov. 2001

how to close the rowby precharing. [4] S.-I.Park, Y. Yi,I.-C.Park, "Highperformance memory mode control for There are two types ofprecharge command, precharge all HDTVdecoders,"IEEETransaction on ConsumerElectronics,vol.49,

banks or precharge single bank. Precharing each bank pp.1348-1353, Nov. 2003

separately is preferred to easily reduce row miss. However, [5] H.-Y.Kang,K.-A.Jeong,J.-Y.Bae,Y.-S.Lee, S.-H.Lee,"MPEG4

indivia

pAVC/H.264

decoder with scalable bus architecture and dualmemory

individual precharging has overheads to send more explicit controller,"proc.International Symposium on Circuits and Systems, vol. commands toclose corresponding rows. Incontrast, only one 2, pp. II-145-8, May2004

command is needed for all banksprecharging.Withsinglebank [6] J. Zhu, L. Hou, W. Wu, R. Wang, C. Huang, J.-T. Li, "HighPerformance

precharging, we can save one rowbreak fromtworowbreaks SynchronousDRAMs ControllerinH.264 HDTVDecoder",proc.

to

break,

which is relatively small when compared with the International ConferenceonSolid-State andIntegrated Circuits

to one break, which is relatively small when compared with the Technology, vol. 3, pp. 1621 - 1624, Oct. 2004

onefromonebreakto zerobreak. The actualgain bysimulation [7] K.-B. Lee and C.-W. Jen, "Design and verification for configurable

is about0.10%in total memory access cycles. The benefit is so memory controller - Memory interface socket soft IP," Journal of the small that we can neglect it. Thus, we choose all banks Chinese Institute of Electrical Engineering, vol. 8, no. 4, pp.309-323, precharging as our solution considering the hardware control 2001.

cost and bandwidth performance.