Hybrid-mode embedded compression for H.264/AVC video coding system

(1)

Proceedings of 2005 International Symposium onIntelligent Signal Processing and Communication Systems December 13-16, 2005 Hong Kong

HYBRID-MODE EMBEDDED COMPRESSION FOR H.264/AVC VIDEO CODING SYSTEM

Tung-Chien

Chen,

Yu-Han

Chen, Ke-Chung

Wu, Liang-Gee

Chen

DSP/IC Design

Lab, Graduate Institute of Electronics Engineering and

Department

of Electrical

Engineering, National Taiwan University, Taipei, Taiwan

{

djchen,doliamo}

@video.ee.ntu.edu.tw, ddddog(gmail.com, lgchen(video.ee.ntu.edu.tw

ABSTRACT

Applications of high resolution videoshave great potential for Frame Buffer

H.264/AVC. However, due to the multi-frame motion estimation

and large search range (SR) requirement, the ultrahigh system I Enmbedded I bandwidth becomes the challenge for theplateformbased video I Bitstream I codec. In this paper, a

hybrid-mode

embedded compression

(EC)

I

isproposed. Two different strategies arerespectivelyused to com-

Embedded

press the reconstructed macroblocks

(MBs)

ofintra- and inter-

Recode

De oder 2

mode. Up-to 9.2 times of compression ratio(CR)canbe achieved Re Ref. 01g. H264

evenwithlossless-compressionconstraint. Besides,with resource Frame Frame Frame Bitstream

sharing,

this systemcanbe

integrated

intoH.264/AVC codec with BC System

almost no area overhead. Accordingtothe simulationresults,the H.264Encoder systembandwidth can be reducedby66.2%and 75.3% in average

forhighand medianqualitysituation. 1. INTRODUCTION

H.264/AVC [1] is the new video coding standard developed by

ITU-TVideo Coding Experts Group (VCEG) andISO/IEC Mov-ing Picture Experts Group (MPEG). It can save 25%-45% and 50%-70% of bitrates compared with MPEG-4 advanced simple profile and MPEG-2,respectively. Thecodinggainmainlycomes from newprediction tools, and enormous computationand ultra

highmemorybandwidth are thepenalties. Accordingto instruc-tionprofiling,2.76tera-operations/sofcomputationalloadingand 4.25tera-bytes/s of memory access arerequiredfor real-time

en-codingSDTV(YUV420,720x480,

30fps)

videos(JM8.5[2], base-line options, fullsearch,four referenceframes,SR[-32, +31]). For

platform-based VLSIsystems inwhich thehighcomputation re-quirement canbe easily solvedby increasing theparallelism of processing elements, the real challenge is the unacceptable bus bandwidth requirement with limited system resources.

The bandwidth mainly comes from the access of reference dataduringmotion estimation(ME). One common solutionis to

usethe EC for the frame buffer access [3]. The EC engine com-pressesthe MBs of the reconstructed frame and transmits such bit-stream to frame buffer. When the video codec systemperforms

ME,the EC engine fetches and decodes thecompresseddata from systembus. Dependingonthe targetapplications,there are vari-ousECalgorithmswith two categories. ForlossyECs[4][5],they

have betterperformanceinbandwidthreduction,butquality

degra-dation occurs due to error propagation. For losslessones [6]that guaranteethehighestvideo quality, their CRsare limitedtotwo, justlike lossless imagecoding [7]. Inthis paper, anovel EC sys-tem isproposed. Unlike previous ECs thatequallyprocess every MB, twodifferent strategies are used for inter- and intra-mode

re-constructed MBs,respectively. Bythis way, our ECcanlosslessly

compressthereconstructionMBwithhigherCR.

Fig.1. The EC system for H.264 video encoder. The rest of this paper isorganizedasfollows. In Section 2, the bandwidthproblemisdescribed followedbytheanalysisof the

re-quiredECsystem. Then,thehybrid-modeECsystem isproposed

inSection 3. Thecorrespondinghardware architecture as well as

performance evaluation arepresentedinSection 4and Section 5.

Finally,Section 6gives aconclusion.

2. PROBLEM STATEMENT

InME, inorder to find the best matchedcandidate,asearch win-dow(SW)within one reference frame has to be searched. Ahuge

amountof reference data must be loaded from frame buffer to ME core,and the traffic is veryheavy. Becausepixelsinneighboring

candidate blocks areconsiderably overlapped,and so are the SWs ofneighboringcurrentMBs,the bandwidth of system bus can be

greatlyreducedby designinglocal buffers to store reusable data

[8][9]. Bymeansof local memory access, the external memory bandwidth can be reduced.

Forhighresolution videoapplicationinH.264,however,the bandwidth is still too largeevenwith the abovetechniques. The

largerframe size means that thelargerSR isrequiredtoachieve

goodMEperformance. The system bandwidth would

exponen-tiallyincrease. Besides,H.264supportthe feature ofmultiple

ref-erenceframes

[10].

Therequiredreference dataareproportionalto

the number of referenceframes,which alsoheavilyincreasesthe systembandwidth. Table 1 and Table 2 show the necessary system bandwidth of ME for different video formats and reference frame

numbers, respectively. Theanalysisisbased on level-C data reuse strategy [9] thatis adoptedthe mostfrequently nowadays. The systembandwidthincreasesexponentiallywith the framesizeand

(2)

-Table 1. The necessary system bandwidth of ME for different frame size with one reference frame.

Format CIF(352x288,

30fps)

D 1(720 x 480,

30fps)

HDTV(1280x720,

30fps)

HDTV(1920 x 1080) SR ±16 ±64 ±128 ±256 Bandwidth 9.13MB/s 93.3MB/s 470MB/s 2.05GB/s

Table 2. The necessary system bandwidth of ME for different multiple reference frames (HDTV, 1280x720,

30fps

with ±128

SR).

#of ref. frame systembandwidth

2 940MB/s

3 1.41GB/s

4 1.88GB/s

linearlywith reference frame number. In the HDTV case with 4 referenceframes,the bandwidth requirement is not achievable for

plateformbased system. To do furtherreduction, EC techniques

should be used. Figure 1 shows the EC system for H.264 video encoder. The EC compresses the reconstructed frame data into embedded bitstream beforesendingout,and then thecompressed

data will be read and decoded while the SW is loaded. Sinceonly

thecompressedbitstream is transmitted on systembus,the

band-widthcanthus be reduced. Several issues areconsideredfor our ECsystem in H.264largeframe sizeapplication:

1. Lossless compression: Most of EC for previous standards arelossy. The lossycompression has better CR at an ex-pense of videoquality. Someproblems such as data mis-match between encoding anddecodingmay occur, which induce the error propagation. Therefore,due to the demand ofhighvideo quality inthe HDTVapplication, the

com-pressed reference frame must be able to perfectly recon-structed.

2. HighCR:The system bandwidth reduction ratio isequalto the CR ofEC. Theencoding efficiencymustbehigh enough

to solve thelarge systembandwidth problem. Ingeneral,

the CR limitation of lossless image codingis two. How-ever, EC compresses the reconstructed frame data which have beenprocessed throughDPCMloop by original en-coder. With the information fromoriginal encoder,ourEC canachievehigherCR.

3. Resource sharing: There are many encoding schemes in

H.264 encoder withhighcompression efficiency. Someof them can be chosen as parts of our EC. We can share ex-isted functional blocks in H.264 codec to reduce hardware

costfor EC system.

3. PROPOSED ALGORITHM

Inthis section, a newhybrid-mode EC system for H.264 is

pro-posedto meettherequiredissuesmentionedinthe previous

sec-tion. In H.264, a MB is either intra mode or inter mode. For

+ Residue

Origin

(1

DC --d uan

Prediction

Intra Im traM

e

diH

t2o enCo dpe rsaon Bitstream

Re c econstruct

Fig.

2. The H.264 Intra-frame

coding.

each

mode,

differentstrategieswillbe usedtocompressthe recon-structed data of H.264.

3.1. Intra mode MB

The H.264 encoderconvertsthe

original

frames tobitstream,and this bitstreamcanbe convertedtoreconstructed frames by H.264 decoder.

Therefore,

though

the H.264 bitstreamis

lossily

com-pressed

data for

original

frames,

itis

losslessly compressed

ones forreconstructionframes. Basedonthis

idea,

ECsystemmay

di-rectly

usethe H.264bitstreamasthe embedded

bitstream,

which

'illbe transmittedto and stored in theexternal memory.

Dur-ingMEfor thenextoriginalframe, the reference frame, the pre-viousreconstructionframe,canthus beperfectly recovered from

this bitstream. However, this schemesometimes does contradict the demands of EC. IfaMB isinter-mode encodedinH.264, the

temporal informationis required whenwereconstruct it. Inthis

way, moredata are involvedduring loading search window, and

thesystembandwidthcannotbe reduced.

Inordertoproperly solve the bandwidth problem, only

spa-tially local datacanbe involvedinECsystem, justlikeimage cod-ing.Foranintra-mode coded MB, both theencoding and decoding processes accessonly local informationinintra-prediction.

There-fore, the H.264 bitstream of intra-mode codedMBcandirectlyuse as embedded bitstream without anybandwidth overhead. Please

notethat, thequantizationstepinintra-modeencoding reduces the energyand resultsinbetter CR. Sinceourlossless ECis

embed-ded with this quantizationstep, the CR, orbandwidth reduction

ratio,ismuchhigher than that of other ECsystemsusinglossless compression image coding standards. Consider a special case

the reference frame of the firstPframe isIframe where all MBs areintra-mode coded. Thememoryreductionratioisequaltothe H.264CR, whichismuchhigher than the limitation of the lossless imagecoding. For generalcase,the overall CR ofourECsystem

willdependonthepercentageof intra-mode coded MBs.

3.2. Inter modeMB

Forthe inter-mode codedMB, theextraencoding schemesare

pro-posed forourECsystem. TheH.264 intra-framecodingischosen as the prototype of the embedded encoder under the

considera-tionsofgoodcompressioncapability, lowmemory usage,andhigh

resource-sharing possibility. Figure2shows theoriginal block

di-agramof H.264 intra-framecoding.For losslessconsideration,the quantizationandinversequantizationareremovedfirstly. Besides, inlosslessEC system, sincethe reconstructed framemustbe the samewith theoriginalone,the feedbackpathof DPCMloopcan

(3)

-be omitted. After the above modifications, only the transformation and entropycodingarepreformedafter intraprediction and intra compensation.

We analysis three different strategies of transform inherited from H.264 encoder. The selection depends on boththe compres-sionperformances and hardware considerations.

Scheme 1: The transform is simply the original integer discrete cosinetransform(DCT)inH.264 DPCMloop.

Scheme 2: Hadamardtransform,the other transformation in H.264 involved in both the DPCM loop and the encoder issues, is used.

Scheme 3: No transformation. The residue isdirectly bypassed

and entropy coded.

Inscheme 1 and 2, the residues aretransformedfromspatial

domain tofrequencydomain. We must find an inverse transforma-tionthat the residues can be losslessly recovered. It means, for any matrix A, if T isthe transform matrix, an inverse transform matrix Rwill satisfy

R(TATT)RT=

A

₍₁₎

Weseparate R into two matrixes,and the equation becomes

(B(D(TATT)DT)BT)

=A

(2)

Inscheme 1,

Tschemel

is DCT matrixand

Dscheme,

is IDCT matrix. Inorder to satisfy (1),the additional matrix of

Bschemel

mustbeimplementedas 1/4

Bschemel

0 0

O

0 \

1/5

0 0 0

1/5

0 0 0 1/4

Inscheme 2, Tscheme2 is Hadamardtransformmatrixand Dscheme2 is inverseHadamard transform matrix, that is the same as Hadamard transform matrix. Similarly, inorder tosatisfy (1), thematrixof

Bscheme2 is 1/4 B ~~~0 Bscheme2 0 0

O

0 \

1/4

0 0 0

1/4

0 0 0

1/4

The matrix of

Bschemel

is notsuitable for hardware

implemen-tation due to the requirement of the divisors. On the contrary, the matrix of Bscheme2 onlyinvolves theshifter,and no hardware overheads arerequired. Therefore,the scheme 2 is a better choice. Sinceboth the scheme 2 and scheme 3 have no hardware

over-head, theyarecompared accordingtothe compressionperformance.

Table 3 shows the simulation results. Several video sequences are simulated with quantization parameter(QP)set to 15. According

tothe experiment, the CR is better without transformation. That is because there is no quantization process intheEC,and transfor-mation can notsuccessfully gatherthe energy up. Then, scheme 3 is the best choice among all schemes. Itisworth to note that the CR of EC system increases with thelargerframe size and the

higherQPvalue of theoriginalencoder.

There is another hardware issue includedintheproposed

al-gorithm. The correlation ofthe bestintrapredictionmode between theoriginalMBoforiginal encoder and the reconstructedMBof embedded encoderisquitehigh. Therefore,after the intra

predic-tioninthe H.264encoder, the best intra mode information can be

Table 3. The simulation results of CR in scheme 2&3(frame size: (1)352x288 (2)720x288 (3)720x480(4)750x576 (5)1920x720). Image name CR (QP=15) scheme 2 scheme 3 Foreman(1) 1.515 1.567 Stefan(1) 1.197 1.230 Boat(2) 1.460 1.492 Bamboo(3) 2.415 2.439 Wendy(4) 1.574 1.618 Bigships(5) 2.309 2.347

reused for our embedded encoder even if the final mode of current coded MB is not intra mode. The intrapredictionoccupies most of the computation complexity in the embeddedencoding.With such modification, the embedded encoder can be implemented with less hardware resource.

3.3. Hybrid-mode EC system

There are two modes in theproposed EC system : inter mode and intra mode. The mode selection depends onthe mode deci-sionof the H.264 encoder. Theinter-predictedMBentersthe inter

mode,while theintra-predictedMBentersthe intra mode. For in-termode,we convertthe reconstructed MB to embedded bitstream

bytheproposedECalgorithm. For intramode,the embedded en-coder isidle,and thecorrespondingH.264bitstream can be reused asembedded bitstream.

4. ARCHITECTURE 4.1. Embedded encoder

Figure 3shows the architecture of embedded encoder. If the cur-rent MB isinter-mode coded MB, the embedded encoder will com-pressthe reconstructed MB into embedded bitstream. Because the

requiredlocal information have beenpreparedduringthe intra pre-diction process of theoriginal encoder,and so does the best intra

mode, onlythe intra compensation and entropycodingisinvolved inourembedded encoder. Reconstructedpixelsaresubtractedby predictors generated byintracompensation, and residues are then codedbyentropycodingmodule. Please note that thesetwo mod-ules can be shared withoriginalencoder withadequate schedule. Both the hardware cost and computation of the embedded encoder are small. If the current codedMB is intramode MB, the orig-inal bitstream of thisMB will bedirectly used as the embedded bitstream. The embedded encoderisidledinthissituation.

4.2. Embedded decoder

During loading SW data,thecorresponding embedded bitstream isread from system memory and then decodedbyembedded de-coder. Thedecoding procedure is thereverse processof the

en-codingone. Figure 4illustrate the architecture of the embedded decoder. The headerinembedded bitstreamisdecodedfirstly,and the mode of the loadedMB canbe decided. If the MB is intra mode MB, the residues areprocessed throughIQ/IDCTand then addedby predictors generated byintracompensation.

Otherwise,

thebypassing pathinthe bottom is chosen. Similarly,the hardware

(4)

-Reconstructed + Residue Entropy frame .

Eng

of H.264 encoder -Prediction Intra

Embedded

Compensation

Encoder

Lin Buffer E3estIntra

Predition Mode

Fig. 3.Proposed architecture of embedded encoder.

Intra/Inter mode

Bitstream Table 4. The overall simulation results of EC(Bamboo:720x480

Wendy:720x576 Crew,Bigships:1280x720).

CRintra

CRinter

CRhyb,id

Bamboo QP=15 5.848 2.433 4.082 Bamboo QP=30 25.0 2.933 4.651 WendyQP=15 4.00 1.618 1.916 Wendy QP=30 17.86 2.304 2.786 CrewQP=15 8.065 3.571 6.024 CrewQP=30 55.56 6.173 9.259 Bigships QP=15 4.902 2.347 2.387 Bigships QP=30 23.81 3.279 3.279

atintra-mode and inter-mode reconstructed MB are used to achieve up-to 9.2 timesof CR under thelossless-compression constrain. By resource sharing, the corresponding hardware is designed and integrated into H.264/AVC codec with almost no area overhead. The simulation result shows that our EC can reduce 66.2% and 75.3%systembandwidth in average when QPs are 15 and 30,

re-spectively.

7. REFERENCES Fig. 4. Proposedarchitecture of embedded decoder.

sourcesharingcanalso be achieved if the schedule of the coding

process iscarefully designed.

5. SIMULATION RESULTS

Theperformanceof theproposed hybrid-modeEC isshown in Ta-ble 4. The reference software, JM8.5 [2], ismodified, and four sequenceswith SDTV or HDTV formats that match the target

ap-plications are selected for the simulation. Two quantization pa-rametersstandingforhighand medianqualitysituations areused. The CR can be obtainedbythefollowingequation:

1

ae

1-ae

=+

CRhybrid

CRintra

CRinter

whereCRintra: CRfor intra-mode MB.

CRinter:

CRfor inter-modeMB

a:percentage of intrapredictedMB

The CR of the intra mode MB's is much better than that of intermode ones. The overall CR will belargelyinfluencedbythe percentage of the intra mode MB's. The overall CR, or the

band-width reductionratio,canbe ashighas9.259. Inworst casewith

almost no intra MB, the CR would approach to two, the upper-bound of lossless image coding. With ourapproach, the system bandwidthcanbe reducedby66.2%and 75.3%inaverageforhigh

and median quality situations, respectively. Pleasenote thatwe

didn't do any modification on entropycodingfor resourcesharing. Theperformance of the proposedEC canbe betterifappropriate

revision ismade accordingtothe statistics. 6. CONCLUSION

Inthis paper, ahybrid-modeECfor H.264/AVC isproposedto re-duce the bandwidth ofloadingSW. Twodifferent strategies aimed

[1] Joint Video Team, DraftITU-TRecommendation and Fi-nalDraftInternationalStandardofJointVideoSpecification,

ITU-T Rec. H.264andISO/IEC 14496-10AVC, May 2003. [2] Joint Video Team Reference Software JM8.5,

http://bs.hhi.de/ suehring/tml/download/, 2004.

[3] Peter H.Frencken P.H.N. de with and M. van dar

Schaar,

"An MPEGdecoder with embedded compression for memory

re-duction," IEEETransactions on ConsumerElectronics,vol. 44, no. 3, pp. 545-555, 1998.

[4] M. van der Schaar and P.H.N. de With, "Near-lossless

complexity-scalable embedded compression algorithm for cost reduction in DTV receivers," IEEE Transactions on ConsumerElectronics,vol. 46, no. 4, pp. 923-933, 2000. [5] BourgeAmaud and JungJoel, "Low-power H.264video

de-coder withgraceful degradation," inProc.ofVisual Commu-nicationsand Image Processing,2004, vol. 5308,pp. 372-383.

[6] R. v.d. Vleutenl R. Manniesing, R. Kleihorstl and E.

Hen-drilks, "Implementation of lossless coding for embedded

compression," inProc. ofIEEEProRISC, 1998, pp. 385-389.

[7] Andreas E.Savakis, "Evaluation ofalgorithms for lossless compression ofcontinuous-tone images," Journal of Elec-tronicImaging,vol. 11,pp.75-86,2002.

[8] M.Y.Hsu, H. C.Chang,and L. G.Chen, "Scalable module-based architecture for MPEG-4 BMA motion

estimation,"

in

Proc.ofISCAS, 2001, pp. 245-248.

[9] J.C.Tuan, T. S.Chang, andC. W.Jen, "Onthe datareuseand

memorybandwidthanalysis for full-searchblock-matching

VLSI architecture," IEEETransactions on CSVT, vol. 12,

pp.61-72,Jan.2002.

[10] T. Wiegand, X. Zhang, and B. Girod, "Long-term mem-orymotion-compensated prediction," IEEETransactionson

CSVT,vol.9,pp.70-84,Feb. 1999.