Low power and power aware fractional motion estimation of H.264/AVC for mobile applications

(1)

Low Power and Power Aware Fractional Motion

Estimation of H.264/AVC for Mobile

Applications

Tung-Chien Chen, Yu-Han Chen, Chuan-Yung Tsai and Liang-Gee Chen

DSP/IC

Design Lab., Graduate Institute of Electronics Engineering and Department of Electrical

Engineering, National Taiwan University; Email: djchen, doliamo, cytsai,

lgchen@video.ee.ntu.edu.tw

Abstract- In this paper, the low power design techniques from algo- S stem Bus

rithm to architecture levels areproposed for factional motion estimation j inH.264/AVC.TheproposedAMPD algorithm can reduce 50.8%

power0if

with up to 0.1 dB quality drop. The proposed parallel architecture with ef-

-ficientmemory hierarchy can efficiently reuse data and save 61.6% power. RISC SW Local JControMB-level

Furthermore, thepower awarefunctionalityis included. Our design can = DataReuse

gracefully vary the quality degradation of 0.1-3.9 dB in response to the

7--22.58-1.64 mW power consumption. This power-oriented design is very

,

d

efficient fordifferent mobile applications in various power situations. FrameOn-lineFME Candidate-Memory Interpolation Core level Data

Reuse

I. INTRODUCTION FMEModule _Fast

While highly interactive and recreational multimedia appli- Algorithm

cationsappearmuch fasterthanexpected, there seems to be an

inexhaustibledemand forasmuch higher compression ratio and Fig. 1. FMEsystemarchitectureand thecorrespondingpowerreduction

tech-better video quality as possible. The new video coding stan-

niques.

dard of H.264/AVC[1], whichsignificantly outperforms previ- gives a conclusion. ousstandards, undoubtedlyplays animportant role inthis area.

Thenewtechniques [2] of 1/4-pixelresolution, variable block II. FUNDAMENTAL

sizes(VBS), multiple referenceframes (MRF), andLagrangian A. FMESystem Architecture and Power Reduction Techniques mode decisionin motionestimation (ME)induce huge

compu-tationcomplexity, and hardware accelerationis a must. Figure 1 shows the FME system architecture. In reference Classical hardware architecture focuses on latency,

area,

software [7], in order tosupportMEwithquarter-pelresolution, throughput and hardware complexity as the primary parame- all required pixels areinterpolatedin advanceand stored in the ters to beoptimized ortradedoff against each other.

However,

frame memory. However, the storage space with 16-times of

power-orienteddesign methodologyhasemergedas a a key fac- frame size isrequired for each reference frame, which is a con-torfortworeasons: limitedavailability of power in portable or siderable overheadfor hardwareimplementation. Besides, the wearable devices, and limited capacity to dispose of the heat systembandwidth(BW) will become too large during FME pro-of VLSI circuits. Lowpower and power aware are the most cedure. Therefore,theon-chip on-lineinterpolationisgenerally important issues inthepower-oriented design. Thelow-power adopted [6] In this system structure, the power consumption ME canprovide the maximum compression performanceunder mainlycomes from two parts. One is data access power of inte-the specific power constraint. The power-aware ME can ex- ger referencepixels loading from frame memory to FME core. tent the battery life by varying the compression performance The other is computationpower ofinterpolation andmatching andpowerconsumption according to thesystem power status. cost

calculation.

Manypower-oriented fast algorithmsandhardwarearchitecture Several powerreduction techniques aredescribedasfollows. have been proposed forinteger motionestimation (IME)[3] [4] First, because the referred search areas ofneighboring current

[5],butnonefor fractional motion estimation(FME). However, MBs (CMBs) are considerably overlapped, thesearch window

accordingtoouranalysis, theFMEoccupies45% [6] of the run- (SW) SRAMs are firstly embedded to achieve MB-level data time in H.264/AVC inter predictionandupgradesrate-distortion reuse (DR). The BW and power of system memory can be re-efficiency by 2-6 dB in peak signaltonoise ratio (PSNR). The duced. Second, theparallelarchitectureis designedto achieve advancedFMEdesign with optimizationinpower issues is ur- candidate-levelDR, andthe BW and power of local memory can

gentlydemanded by H.264/AVC compressionsystem. be reduced. Third, thehardware-orientedfastalgorithmcanbe Inthis paper, the first low-powerandpower-awareFME de- used to reduce the computationalpower ofinterpolation engine

sign of H.264/AVC is proposed. The rest ofthis paper is or- and FME core. Last but not least, the gatedclocktechniquecan ganized as follows. In Section II, the related power reduction beappliedto turn theinoperative circuits off.

and poweraware techniques will be described. In Section III, oe wr ucinlt

the power-aware fast algorithm areproposed. The

correspond-ing parallel hardware architectureisdesignedin Section IV with Power aware functionality is another important issue for low-power considerations, while theimplementationandsimu- power-oriented design. It can vary thepower consumption ac-lation results are presented in Section V. Finally, Section VI cordingtothepowerstatus at the expense ofcompression

(2)

2

FullSearch MPD AMPD (dB) (dB)

37

IntegerMotion IntegerMotion IntegerMotion 3 - - 36

XEstimation Estimation Estimation

pMode Decision M Pre- FN

IMV41pa s Best MVs AsetofIMVs AMPD'(32 -X AMPD(N=3)

,of_t4

l-

MPD

(N=l-1)

--MPD(N=1 )

Fractional Fractional Fractional 31 30

Motion Motion Motion 60 110 160 210 260(Kbps) 220 420 620 (Kbps)

CL. Estimation Estimation Estimation

Mode Post-

(a)

Foreman

(QCIF)

(b)

Stefan

(QCIF)

LU ModeDecisionDeiin(B-(d)- _{Decision~~CtS-S}_(dB); _; _ff

BestFMVs BestFMVs BestFMVs 36 3 366

(a) ()(c) 34

.34

/

(a) (b)-(c)FS (N=7) -FS (N=7)F 32 ,A, ---AMPD (N=3) 32 ---AMPD (N=3)

Fig. 2. (a)FS algorithm; (b) MPDfast algorithm; (b) ProposedAMPD fast M (N- MP. (N

algorithm. 30

30-270 470 670 870 (Kbps) 40 90 140 190(Kbps)

formance. ForH.264/AVCstandard,power awarefunctionality

becomesmorenecessary. That is because thecomputationcom- (a) Mobile (QCIF) (b) Table (QCIF) plexity of the video compression algorithm grows much faster _{Fig. 3. Rate-distortion curves of FS, MPD, and AMPD algorithms. The} param-than theprogressofpowersupply. A specifiedpowerconstraint eters are QCIF,30frames/s, I reference frames,±16-pelsearch range(SR),

areusuallymet attheexpenseof considerablequality drop, but and low complexity mode decision.

high quality is usually claimed especially when poweris full. "Mode Pre-Decision" and "Mode Post-Decision". In "Mode The encoder withpower awarefunctionality can notonly have Pre-Decision", not only one but a set of probable modes are se-more adaptability for good balance between compression per- lected after

IME

phase. Firstly, the four sub-modes of each 8 x 8 formance andpower consumption but also extend the battery blockare sorted

according

tothe

Lagrangian matching

costsin life time. In a well-designed power aware system, the com- integer-pel resolution. The costs of the best sub-modes of four pression performance should be varied gracefully inresponse to 8x8blocksarethen summedupas one8x8

partition

costs. The thepower status. Inordertomaximize thecompression perfor- costs of the second sub-modes are summed up as the second manceunder thepowerconstraint, the processing tasks should 8 x 8 partition costs, and so on. There are totally four costs of be turned off

according

tothe

priority,

the 8x8

partitions.

Secondly, the costsofseven

partition,

four III. POWERAWAREFASTALGORITHM 8 x 8 partitions together with the 16 x 8, 8 x 16, and 16 x 16

parti-tions, aresorted from low

(best)

to

high (worst).

N

(N

= I1-

7)

Full search (FS) algorithm can guarantee the highest com-

of

the best partitions are selected for the FME procedure fol-pressing performance. However, the largepowerconsumption lowed by "Mode Post-Decision". Note that, if more than one is verycritical formanyportable applications. Inthis section, 8 x 8 partitions are selected, the re-combination among the sub-a hardware oriented fast algorithm will beproposed with the modes is allowed in "Mode Post-Decision". When N is equal to

power awarefunctionality. one, AMPD has the same performance with MPD. When N is

Figure2(a) shows theconceptof FSalgorithm. H.264/AVC equal to seven, AMPD has the same performance with FS. supportsVBS, and the inter mode decision with FSalgorithmis

donewiththecostsof all blocks andsub-blocks, whichare re-

Figret

3

shows

ther

itortion

Muve

of

aMPdintwhioh

fiedtoarqarerpl esluio. n rertoreuc

te*o

N is set to

three,

together with FS and MPD. The rate-distortion fined toward

quarter-pel

_{* * **}resolution._{1 r} In₁ order₁ to₁reduce the₁ _{s /r 1}com- _~~efficiencvefiiec imroe wihteices_im _roves_with_the_increase_offN,adsostecm_N,_{and so does the} com-putation complexity, a simple fast algorithmdenoted as Mode yc

Pre-Decision (MPD) was evaluated firstly. As shown in Fig. putation complexity. However, this improvement saturates when 2 (b), the inter modedecision ismoved forwardto

IMB

phase. N is set to

three.

That means when N is larger than three, much

2b)theine mod deiso is

~

moe fow

rd

to IM

pase

more

computation complexity

mustbe

paid

with

just

the limited Thatmeanstheselectionofthe best combinationof VBS isdone

in integer-pel resolution. The FME is thus responsible forre-

quality improvement,

which is not efficient for power-oriented

fining theinteger motion vectors (IMVs) of the selected mode design. Therefore, the parameter of "N=3" is chosen. Simi-toward quarter-pel resolution. In VBS, eachMB can be split larly, as for MRF-ME, the fast algorithm modified from [8] is up into4 kinds ofpartition: 16x16, 16x8,8x16, and 8x8. If used. According to the analysis in [8], more than 80% of the

partition

8x8 isselected, each blockcanbe further

split

into4 best reference frames have the

temporal

distance that is smaller

kinds ofsub-partitions: 8x8, 8x4,4x8,4x4.Totally41 blocks than two. When the

maximum

reference frame numberis set and sub-blocksareinvolvedperreference frame and havetotally overthan two, the

quality improvement

is limited with consid-seven-timespixels ofoneMB.Therefore, theMPDcanreduce erably increased computation complexity. Therefore, the maxi-about 6/7computation complexity. However,the rate-distortion mum number of MRF is set to two in our design.

efficiency is seriously degraded for up to one dB in PSNR. That The AMPD can efficiently support power aware functionality. is because the best combination of VBS in integer-pel resolution Because the correlation between mode decision in integer-pel may change after the refinement in quarter-pel resolution. resolution and that in quarter-pel resolution is very high. The To achieve abetter trade-offbetween quality and computation priority of the VBS can be precisely estimated by the sorting complexity, advanced MPD (AMPD) algorithm is proposed. As operation in the "Mode Pre-Decision". By varying the param-shown in Fig. 2 (c), the inter mode decision is divided into eterN,our algorithm can have a good trade-off between power

(3)

3

BH SRHSR

~~-

(-B*-X B--H :> MBwideB MBwdeCRretMB CurntM Search Window SRAMS Side Information

B MBheight SRAM

RSR,SHorizontalsearchrange

SRv SRv Veritcal searchrange Interpolation MVRet/Mode

+ BW IwagewideCs

Q

I~

~~~L- SWoftCMBO

mxBeent 4x4Blewent 4x4Blewent

1 ~~~~~

~~~~SW

otCMBi PU# PU #1 PU #2 g

Mode

(a)Level-CDRScheme ent 4x4Element 4x4Elewent Decision

PU#3 PU#4 PU#S w

rX 4x4EBewent 4x4Eiement 4x4EBement

By PU #6 PU#7 PU #8 Output

~~~~~~ ~~~~~~~~~~&B-~~~~~~~~~~~~~~~~~~~~~~~~Butter

+ RowotFourCur.MBPale& 44ImnP

Bv FourInterpolatedRet.Pale xEemnp

L I4 -XP-ht -

-B, SRH X BH 4-aale Si Si

2-DHadaward+ AO

(b) Level-DDRScheme TransformUnit

Fig. 4. MB-level DR schemes. When process is changed from CMBO to CMB 1,

the dark grayregionis the reused data, and the light gray region is the re- Fig.5. Theproposed parallelhardware with candidate-level DR.

quireddata loadedfromsystem memory. TABLE I

consumption and compression performance. Furthermore, if the PERFORMANCE OF

CANDIDATE

LEVEL DR FOR THE PROPOSED PARALLEL poweris verylow,wecan evensupport only half-pel or integer- ARCHITECTURE.

pel resolution with onereference frame. In the later case, the

No________________________

FME module will only take care of the MC task. The quality

Technology

Paral.

level DR level DR Both approachestoMPEG-4simple profile in this situation.

RAcandidate-level 74.98

28.12 10.18 3.64

IV. LowPOWERARCHITECTURE DESIGN neighboring CMBs. According to Fig. 4 (b), the RA of Level-D

Inthis section,the low-powerdesignmethodology onarchi- scheme is calculatedas:

tecture level is described. The parallel architecture with effi-

BH

X

BV

cient memoryhierarchy is used. Two factors are used to evaluate RAmb-level,level-D R

BH

XBV

theperformance of the proposed low-power architecture: the

re-dundancyaccessfactors(RAs)for MB-levelDRand candidate- The Level-D schemecanminimize the system memoryBW, but levelDR,which are defined as follows. ahugelocal memory size of(Wx

SRV)

+(SRH+BH)x

BV

pix-els isrequired for each reference frame.

System

BW

for

loading

SWs Inour

design,

both the level-C and level-D schemesare

em-minimum requirement bedded and chosen according to the MRF number. In general SW SRAMBWfor reading ref-pels case,tworeference framesareused for thehighest compression RAcand.

level

= minimum requirement performance. Each reference frame is stored in SW SRAMs and reused in level-C scheme, separately. When thepower is

A. MB-levelDataReuse low, the SW SRAMs are configured as the level-D scheme for

For MB-level DR, four

strategies

have been

proposed

with lowestpower

consumption

in

system

bus. In this

situation,

only

differenttradeoffs between local memory size and system bus onereference framecanbe

supported.

BWandare indexed from level-Ato level-D [9]. Because the B. Candidate-levelDataReuse power consumption to access the system memory is the most

criticalpart, the level-C and level-D, that have lower RA fac-

intHge4/AV, Fhe

uo

ar-pixel

ara After

tors, arerecommended for memory technology nowadays. The the integer

ME,rthel

refin

s are

perfoRm

level-C scheme reuses the horizontally overlapped region be- around the best integer search positions with

i1/2

pixel SR. tween two SWs of theneighboring CMBs as shown inFig. 4 The quarter-pixel ME, as well, is then performed around the (a). Only the lightgray partof BH X

(SRV

+

BV)

pixels is re- best halfsearch position with

h1/4

pixel SR. Therefore, each

quiredtoloaded from externalmemoryforeveryCMB.TheRa half and quarter refinement has nine candidates including the of Level-schemecn be calclated

as:refinement

center and its eight neighborhoods. For AMPD al-of Level-Cschemecanbecalculatedas:

gorithm,

thisrefinement

procedure

willbe

iteratively

processed

RAmb

~~~~BH

X

(SRV

+

BV

) _ SRV on the selected blocks and sub-blocks. The best mode is selected

RAblevel.leve!

BH X BV + B~ by considering both the distortion of SATD and the rate parts. Based on the analysis in [6], the hardware architecture is Therequired SW SRAMs' size is (SRH +BH) (SRV+BV)pixels shown in Fig. 5. The 4x4 block is the smallest element of VBS, for each reference frame. As for the Level-D scheme, it can and the SATD is also based on 4x4 blocks. Every blockand fully reuse the horizontally and vertically overlapped SWs of subblock can be decomposed into several 4x4-elements with

(4)

4

PSNR(dB)

9l68llll Intra-leveldata reuse

9.4i8 -50.8% * Inter-leveldatareuse

731 &-24.5% EliFastAlgorithm 38

_)8~~~~~~~Io1Root/LeafGatedClockl =l

-ToolOptimization 3

Power(mnW) l

0 10 20 30 40 36 9- MemoryPower

36 ECorePower

Fig. 6. Performance of the power reduction techniques on each level. The pa- 35

rameters areforeman CIF video,30frames/s, 1 reference frames, and AMPD 0 4 8 12 16 20 Power

with N=3. (mW)

PSNR (dB)

40 _ Fig. 8. Performance ofpower awarefunctionality. Theparameters areforeman

38 CIF video, 30frames/s,and

700kbps.

36 - 2 sumption inFMEis saved after theoptimization in each level.

Figure

7 shows the

compression performance

ofour

design.

34 The best rate-distortioncurvehas

only

0.1dB

quality drop

com-JM,2Ref. -=7 pared with reference software. Because the tasks ofFME can

32 X 1 Ref'N=3 be

dynamically

turned off

according

tothecontent

information,

-~1 Ref,N=2

30 1 Ref'N=1 our

design

can

gracefully

varythe

compression performance

in

1 Ref,N-1 (HalfPrecision) response to the

powerstatus.

The

powerconsumption

versus

1Ref,N=1(integerPrecision) oe oe

28 Bitrates compression performance is shown in Fig. 8. The power of SW

160 320 480 640 800 (Kbps) SRAMs is included. Our design cangracefully vary the

com-pressionperformance of 0.1-3.9 dB with the 22.58-1.64 mW

Fig. 7. The rate-distortion curves of the proposed hardware with power aware power consumption, and is very efficient for different mobile functionality. The test video isforeman,CIF and 30frames/s. applications in variable power situations.

the same MV. Therefore, a 4x4-elementPU is designed and V CONCLUSION

reused for all larger blocks by folding technique. Forthe low

powerconsideration, the intra-candidate and inter-candidateDR In this paper, we contributedalow-powerandpower-aware schemesareapplied. Forintra-candidate DR, each4x4-element designfor FME of H.264/AVC. The fastalgorithmof AMPD is PUisarranged with four degrees of parallelismtoprocessfour proposedto notonlyreduce computation complexity, but also horizontally adjacent pixels ofonecandidate.Forhorizontal fil- provide efficientadaptivitybetweenqualityandpower. The ef-tering, six integer pixels arerequired for interpolating onehalf ficientmemory hierarchy and reconfigurable SW SRAMs are pixel, while night integer pixels are required for four half pix- usedtoachieve MB-level datareuse,while theparallel architec-els. Mosthorizontally adjacent integer pixels canbe reusedby tureisdesignedtoachieve candidate-level datareuse. As shown horizontal filters, and the on-chipmemory BWof SWSRAMs in

experimental

results, the 87% ofpowerconsumptioncanbe canbe reduced. Forinter-candidate DR, wearrangenine4x4- saved after lowpowerconsiderations. The powerconsumption

elementPUs to processthe nine candidates simultaneously. In of 22.58-1.64 mWcan be

efficiently

variedatexpense of 0.1-thisway,theinterpolated fractional pixelscanbereuseby these 3.9 dB quality drop.

4 x4-elementPUs. The redundant computation of interpolation REFERENCES

canbe saved.Besides,theon-chipmemory BWof SW SRAMs

canbe further reduced. TableIsummarizes theperformanceof [1] Joint Video Team, Draft ITU-T Recommendation and Final Draft Inter-_{national Standard}_{of Joint Video Specification,} _ITU-T_{Recommendation} candidate levelDR for theproposed parallel architecture. Af- H.264 and ISO/IEC14496-10AVC, May 2003.

ter candidate-level DR, the reference pixel data canbereused [2] T.Wiegand,G. J.Sullivan, G.Bj0ntegaard, and A.Luthra, "Overviewof 20-timesmoreefficiently. _{[3] M. Miyama, J. Miyakoshi, Y Kuroda, K. Imamura, H. Hashimoto, and}theH.264/AVCvideocodingstandard,"IEEETransactionsonCSVT,2003.

M.Yoshimoto, "Asub-mW MPEG-4 motion estimation processor core for

V. IMPLEMENTATIONAND SIMULATION RESULT mobile video application," IEEE Journal ofSolid-State Circuits, 2004. [4] S.-S. Lin;P.-C.Tseng;C.-P.Lin;L.-G.Chen;, "Multi-modecontent-aware

Theproposed low-power architecture with power-aware fast _Proceedingsmotion estimation_of_IEEEalgorithm for power-aware video_Workshop_on_SIPS,_2004. coding systems," in

algorithm is implemented in TSMC

0.18,u

1P6M technology. [5] H.-W. Cheng and L.-R. Dung, "A content-based methodologyfor power-The totallogicgate countis 125K with maximumoperation fre- awaremotion estimation architecture," IEEE TransactionsonCASII,2005.

of 27MHz. This designcansupportreal-time encoding [6] T.-C

Chen,

Y-W.

Huang,

and L.-G.

Chen,

"Fullyutilized and reusable

ar-chitecture

for

fractional motion estimation

ofH.264/AVC," inProceedings

CIF30fpsvideos withtwoMRF.ThesupportedSRis±32 hor- ofICASSP, 2004.

izontally and ±16 vertically. Figure 6 shows the power reduc- [7] Joint Video Team Reference Software JM8.5,

afterhe

[8]http://bs.hhi.de/ suehring/tmlldownload/,

Sept. 2004.

tion

performance of

logic circuits

(SRAM excluded) atrhe 8

Y-

W. Huang,

B.-Y.

Hsieh, T.-C. Wang,

S.-Y.

Chien,

S.-Y.

Ma, C.-F. Shen, low-power techniques that are applied on each level. Note that and L.-G. Chen, "Analysis and reduction ofreference frames for motion

Synopsys Prime Power is used to estimate the power consump- estimation in MPEG-4 AVC/JVT/H.264," in Proceedings ofICASSP, 2003. tionhere Th_{hon hee.Thelnnovaons onalgorlhm}innvatons n agorihm ad achitctue leels_{andarchltcture evels} [9] J. C._{ory bandwidth analysis for full-search block-matching VLSI architecture,"}

Tuan,

T. S. Chang, and C. W. Jen, "On the data reuse and

mem-contribute the most power reduction. Totally 87% ofpowercon- IEEE Transactions on CSVT, 2002. 5334