Proceedings of 2005 International Symposium onIntelligent Signal Processing and Communication Systems December 13-16, 2005 Hong Kong
HYBRID-MODE EMBEDDED COMPRESSION FOR H.264/AVC VIDEO CODING SYSTEM
Tung-Chien
Chen,
Yu-Han
Chen, Ke-Chung
Wu, Liang-Gee
Chen
DSP/IC Design
Lab, Graduate Institute of Electronics Engineering and
Department
of Electrical
Engineering, National Taiwan University, Taipei, Taiwan
{
djchen,doliamo}
@video.ee.ntu.edu.tw, ddddog(gmail.com, lgchen(video.ee.ntu.edu.tw
ABSTRACT
Applications of high resolution videoshave great potential for Frame Buffer
H.264/AVC. However, due to the multi-frame motion estimation
and large search range (SR) requirement, the ultrahigh system I Enmbedded I bandwidth becomes the challenge for theplateformbased video I Bitstream I codec. In this paper, a
hybrid-mode
embedded compression(EC)
Iisproposed. Two different strategies arerespectivelyused to com-
Embedded
Embedded
press the reconstructed macroblocks
(MBs)
ofintra- and inter-Recode
De oder 2mode. Up-to 9.2 times of compression ratio(CR)canbe achieved Re Ref. 01g. H264
evenwithlossless-compressionconstraint. Besides,with resource Frame Frame Frame Bitstream
sharing,
this systemcanbeintegrated
intoH.264/AVC codec with BC Systemalmost no area overhead. Accordingtothe simulationresults,the H.264Encoder systembandwidth can be reducedby66.2%and 75.3% in average
forhighand medianqualitysituation. 1. INTRODUCTION
H.264/AVC [1] is the new video coding standard developed by
ITU-TVideo Coding Experts Group (VCEG) andISO/IEC Mov-ing Picture Experts Group (MPEG). It can save 25%-45% and 50%-70% of bitrates compared with MPEG-4 advanced simple profile and MPEG-2,respectively. Thecodinggainmainlycomes from newprediction tools, and enormous computationand ultra
highmemorybandwidth are thepenalties. Accordingto instruc-tionprofiling,2.76tera-operations/sofcomputationalloadingand 4.25tera-bytes/s of memory access arerequiredfor real-time
en-codingSDTV(YUV420,720x480,
30fps)
videos(JM8.5[2], base-line options, fullsearch,four referenceframes,SR[-32, +31]). Forplatform-based VLSIsystems inwhich thehighcomputation re-quirement canbe easily solvedby increasing theparallelism of processing elements, the real challenge is the unacceptable bus bandwidth requirement with limited system resources.
The bandwidth mainly comes from the access of reference dataduringmotion estimation(ME). One common solutionis to
usethe EC for the frame buffer access [3]. The EC engine com-pressesthe MBs of the reconstructed frame and transmits such bit-stream to frame buffer. When the video codec systemperforms
ME,the EC engine fetches and decodes thecompresseddata from systembus. Dependingonthe targetapplications,there are vari-ousECalgorithmswith two categories. ForlossyECs[4][5],they
have betterperformanceinbandwidthreduction,butquality
degra-dation occurs due to error propagation. For losslessones [6]that guaranteethehighestvideo quality, their CRsare limitedtotwo, justlike lossless imagecoding [7]. Inthis paper, anovel EC sys-tem isproposed. Unlike previous ECs thatequallyprocess every MB, twodifferent strategies are used for inter- and intra-mode
re-constructed MBs,respectively. Bythis way, our ECcanlosslessly
compressthereconstructionMBwithhigherCR.
Fig.1. The EC system for H.264 video encoder. The rest of this paper isorganizedasfollows. In Section 2, the bandwidthproblemisdescribed followedbytheanalysisof the
re-quiredECsystem. Then,thehybrid-modeECsystem isproposed
inSection 3. Thecorrespondinghardware architecture as well as
performance evaluation arepresentedinSection 4and Section 5.
Finally,Section 6gives aconclusion.
2. PROBLEM STATEMENT
InME, inorder to find the best matchedcandidate,asearch win-dow(SW)within one reference frame has to be searched. Ahuge
amountof reference data must be loaded from frame buffer to ME core,and the traffic is veryheavy. Becausepixelsinneighboring
candidate blocks areconsiderably overlapped,and so are the SWs ofneighboringcurrentMBs,the bandwidth of system bus can be
greatlyreducedby designinglocal buffers to store reusable data
[8][9]. Bymeansof local memory access, the external memory bandwidth can be reduced.
Forhighresolution videoapplicationinH.264,however,the bandwidth is still too largeevenwith the abovetechniques. The
largerframe size means that thelargerSR isrequiredtoachieve
goodMEperformance. The system bandwidth would
exponen-tiallyincrease. Besides,H.264supportthe feature ofmultiple
ref-erenceframes
[10].
Therequiredreference dataareproportionaltothe number of referenceframes,which alsoheavilyincreasesthe systembandwidth. Table 1 and Table 2 show the necessary system bandwidth of ME for different video formats and reference frame
numbers, respectively. Theanalysisisbased on level-C data reuse strategy [9] thatis adoptedthe mostfrequently nowadays. The systembandwidthincreasesexponentiallywith the framesizeand
-Table 1. The necessary system bandwidth of ME for different frame size with one reference frame.
Format CIF(352x288,
30fps)
D 1(720 x 480,30fps)
HDTV(1280x720,30fps)
HDTV(1920 x 1080) SR ±16 ±64 ±128 ±256 Bandwidth 9.13MB/s 93.3MB/s 470MB/s 2.05GB/sTable 2. The necessary system bandwidth of ME for different multiple reference frames (HDTV, 1280x720,
30fps
with ±128SR).
#of ref. frame systembandwidth
2 940MB/s
3 1.41GB/s
4 1.88GB/s
linearlywith reference frame number. In the HDTV case with 4 referenceframes,the bandwidth requirement is not achievable for
plateformbased system. To do furtherreduction, EC techniques
should be used. Figure 1 shows the EC system for H.264 video encoder. The EC compresses the reconstructed frame data into embedded bitstream beforesendingout,and then thecompressed
data will be read and decoded while the SW is loaded. Sinceonly
thecompressedbitstream is transmitted on systembus,the
band-widthcanthus be reduced. Several issues areconsideredfor our ECsystem in H.264largeframe sizeapplication:
1. Lossless compression: Most of EC for previous standards arelossy. The lossycompression has better CR at an ex-pense of videoquality. Someproblems such as data mis-match between encoding anddecodingmay occur, which induce the error propagation. Therefore,due to the demand ofhighvideo quality inthe HDTVapplication, the
com-pressed reference frame must be able to perfectly recon-structed.
2. HighCR:The system bandwidth reduction ratio isequalto the CR ofEC. Theencoding efficiencymustbehigh enough
to solve thelarge systembandwidth problem. Ingeneral,
the CR limitation of lossless image codingis two. How-ever, EC compresses the reconstructed frame data which have beenprocessed throughDPCMloop by original en-coder. With the information fromoriginal encoder,ourEC canachievehigherCR.
3. Resource sharing: There are many encoding schemes in
H.264 encoder withhighcompression efficiency. Someof them can be chosen as parts of our EC. We can share ex-isted functional blocks in H.264 codec to reduce hardware
costfor EC system.
3. PROPOSED ALGORITHM
Inthis section, a newhybrid-mode EC system for H.264 is
pro-posedto meettherequiredissuesmentionedinthe previous
sec-tion. In H.264, a MB is either intra mode or inter mode. For
+ Residue
Origin
(1
DC --d uanPrediction
Intra Im traM
e
diH
t2o enCo dpe rsaon BitstreamRe c econstruct
Fig.
2. The H.264 Intra-framecoding.
each
mode,
differentstrategieswillbe usedtocompressthe recon-structed data of H.264.3.1. Intra mode MB
The H.264 encoderconvertsthe
original
frames tobitstream,and this bitstreamcanbe convertedtoreconstructed frames by H.264 decoder.Therefore,
though
the H.264 bitstreamislossily
com-pressed
data fororiginal
frames,
itislosslessly compressed
ones forreconstructionframes. Basedonthisidea,
ECsystemmaydi-rectly
usethe H.264bitstreamasthe embeddedbitstream,
which'illbe transmittedto and stored in theexternal memory.
Dur-ingMEfor thenextoriginalframe, the reference frame, the pre-viousreconstructionframe,canthus beperfectly recovered from
this bitstream. However, this schemesometimes does contradict the demands of EC. IfaMB isinter-mode encodedinH.264, the
temporal informationis required whenwereconstruct it. Inthis
way, moredata are involvedduring loading search window, and
thesystembandwidthcannotbe reduced.
Inordertoproperly solve the bandwidth problem, only
spa-tially local datacanbe involvedinECsystem, justlikeimage cod-ing.Foranintra-mode coded MB, both theencoding and decoding processes accessonly local informationinintra-prediction.
There-fore, the H.264 bitstream of intra-mode codedMBcandirectlyuse as embedded bitstream without anybandwidth overhead. Please
notethat, thequantizationstepinintra-modeencoding reduces the energyand resultsinbetter CR. Sinceourlossless ECis
embed-ded with this quantizationstep, the CR, orbandwidth reduction
ratio,ismuchhigher than that of other ECsystemsusinglossless compression image coding standards. Consider a special case
the reference frame of the firstPframe isIframe where all MBs areintra-mode coded. Thememoryreductionratioisequaltothe H.264CR, whichismuchhigher than the limitation of the lossless imagecoding. For generalcase,the overall CR ofourECsystem
willdependonthepercentageof intra-mode coded MBs.
3.2. Inter modeMB
Forthe inter-mode codedMB, theextraencoding schemesare
pro-posed forourECsystem. TheH.264 intra-framecodingischosen as the prototype of the embedded encoder under the
considera-tionsofgoodcompressioncapability, lowmemory usage,andhigh
resource-sharing possibility. Figure2shows theoriginal block
di-agramof H.264 intra-framecoding.For losslessconsideration,the quantizationandinversequantizationareremovedfirstly. Besides, inlosslessEC system, sincethe reconstructed framemustbe the samewith theoriginalone,the feedbackpathof DPCMloopcan
-be omitted. After the above modifications, only the transformation and entropycodingarepreformedafter intraprediction and intra compensation.
We analysis three different strategies of transform inherited from H.264 encoder. The selection depends on boththe compres-sionperformances and hardware considerations.
Scheme 1: The transform is simply the original integer discrete cosinetransform(DCT)inH.264 DPCMloop.
Scheme 2: Hadamardtransform,the other transformation in H.264 involved in both the DPCM loop and the encoder issues, is used.
Scheme 3: No transformation. The residue isdirectly bypassed
and entropy coded.
Inscheme 1 and 2, the residues aretransformedfromspatial
domain tofrequencydomain. We must find an inverse transforma-tionthat the residues can be losslessly recovered. It means, for any matrix A, if T isthe transform matrix, an inverse transform matrix Rwill satisfy
R(TATT)RT=
A(1)
Weseparate R into two matrixes,and the equation becomes
(B(D(TATT)DT)BT)
=A(2)
Inscheme 1,
Tschemel
is DCT matrixandDscheme,
is IDCT matrix. Inorder to satisfy (1),the additional matrix ofBschemel
mustbeimplementedas 1/4Bschemel
0 0O
O
0
\
1/5
0 0 01/5
0 0 0 1/4Inscheme 2, Tscheme2 is Hadamardtransformmatrixand Dscheme2 is inverseHadamard transform matrix, that is the same as Hadamard transform matrix. Similarly, inorder tosatisfy (1), thematrixof
Bscheme2 is 1/4 B ~~~0 Bscheme2 0 0
O
O
0
\
1/4
0 0 01/4
0 0 01/4
The matrix of
Bschemel
is notsuitable for hardwareimplemen-tation due to the requirement of the divisors. On the contrary, the matrix of Bscheme2 onlyinvolves theshifter,and no hardware overheads arerequired. Therefore,the scheme 2 is a better choice. Sinceboth the scheme 2 and scheme 3 have no hardware
over-head, theyarecompared accordingtothe compressionperformance.
Table 3 shows the simulation results. Several video sequences are simulated with quantization parameter(QP)set to 15. According
tothe experiment, the CR is better without transformation. That is because there is no quantization process intheEC,and transfor-mation can notsuccessfully gatherthe energy up. Then, scheme 3 is the best choice among all schemes. Itisworth to note that the CR of EC system increases with thelargerframe size and the
higherQPvalue of theoriginalencoder.
There is another hardware issue includedintheproposed
al-gorithm. The correlation ofthe bestintrapredictionmode between theoriginalMBoforiginal encoder and the reconstructedMBof embedded encoderisquitehigh. Therefore,after the intra
predic-tioninthe H.264encoder, the best intra mode information can be
Table 3. The simulation results of CR in scheme 2&3(frame size: (1)352x288 (2)720x288 (3)720x480(4)750x576 (5)1920x720). Image name CR (QP=15) scheme 2 scheme 3 Foreman(1) 1.515 1.567 Stefan(1) 1.197 1.230 Boat(2) 1.460 1.492 Bamboo(3) 2.415 2.439 Wendy(4) 1.574 1.618 Bigships(5) 2.309 2.347
reused for our embedded encoder even if the final mode of current coded MB is not intra mode. The intrapredictionoccupies most of the computation complexity in the embeddedencoding.With such modification, the embedded encoder can be implemented with less hardware resource.
3.3. Hybrid-mode EC system
There are two modes in theproposed EC system : inter mode and intra mode. The mode selection depends onthe mode deci-sionof the H.264 encoder. Theinter-predictedMBentersthe inter
mode,while theintra-predictedMBentersthe intra mode. For in-termode,we convertthe reconstructed MB to embedded bitstream
bytheproposedECalgorithm. For intramode,the embedded en-coder isidle,and thecorrespondingH.264bitstream can be reused asembedded bitstream.
4. ARCHITECTURE 4.1. Embedded encoder
Figure 3shows the architecture of embedded encoder. If the cur-rent MB isinter-mode coded MB, the embedded encoder will com-pressthe reconstructed MB into embedded bitstream. Because the
requiredlocal information have beenpreparedduringthe intra pre-diction process of theoriginal encoder,and so does the best intra
mode, onlythe intra compensation and entropycodingisinvolved inourembedded encoder. Reconstructedpixelsaresubtractedby predictors generated byintracompensation, and residues are then codedbyentropycodingmodule. Please note that thesetwo mod-ules can be shared withoriginalencoder withadequate schedule. Both the hardware cost and computation of the embedded encoder are small. If the current codedMB is intramode MB, the orig-inal bitstream of thisMB will bedirectly used as the embedded bitstream. The embedded encoderisidledinthissituation.
4.2. Embedded decoder
During loading SW data,thecorresponding embedded bitstream isread from system memory and then decodedbyembedded de-coder. Thedecoding procedure is thereverse processof the
en-codingone. Figure 4illustrate the architecture of the embedded decoder. The headerinembedded bitstreamisdecodedfirstly,and the mode of the loadedMB canbe decided. If the MB is intra mode MB, the residues areprocessed throughIQ/IDCTand then addedby predictors generated byintracompensation.
Otherwise,
thebypassing pathinthe bottom is chosen. Similarly,the hardware
-Reconstructed + Residue Entropy frame .
Eng
of H.264 encoder -Prediction IntraEmbedded
Compensation
Encoder
Lin Buffer E3estIntra
Predition Mode
Fig. 3.Proposed architecture of embedded encoder.
Intra/Inter mode
Bitstream Table 4. The overall simulation results of EC(Bamboo:720x480
Wendy:720x576 Crew,Bigships:1280x720).
CRintra
CRinter
CRhyb,id
Bamboo QP=15 5.848 2.433 4.082 Bamboo QP=30 25.0 2.933 4.651 WendyQP=15 4.00 1.618 1.916 Wendy QP=30 17.86 2.304 2.786 CrewQP=15 8.065 3.571 6.024 CrewQP=30 55.56 6.173 9.259 Bigships QP=15 4.902 2.347 2.387 Bigships QP=30 23.81 3.279 3.279
atintra-mode and inter-mode reconstructed MB are used to achieve up-to 9.2 timesof CR under thelossless-compression constrain. By resource sharing, the corresponding hardware is designed and integrated into H.264/AVC codec with almost no area overhead. The simulation result shows that our EC can reduce 66.2% and 75.3%systembandwidth in average when QPs are 15 and 30,
re-spectively.
7. REFERENCES Fig. 4. Proposedarchitecture of embedded decoder.
sourcesharingcanalso be achieved if the schedule of the coding
process iscarefully designed.
5. SIMULATION RESULTS
Theperformanceof theproposed hybrid-modeEC isshown in Ta-ble 4. The reference software, JM8.5 [2], ismodified, and four sequenceswith SDTV or HDTV formats that match the target
ap-plications are selected for the simulation. Two quantization pa-rametersstandingforhighand medianqualitysituations areused. The CR can be obtainedbythefollowingequation:
1
ae
1-ae
=+
CRhybrid
CRintra
CRinter
whereCRintra: CRfor intra-mode MB.
CRinter:
CRfor inter-modeMBa:percentage of intrapredictedMB
The CR of the intra mode MB's is much better than that of intermode ones. The overall CR will belargelyinfluencedbythe percentage of the intra mode MB's. The overall CR, or the
band-width reductionratio,canbe ashighas9.259. Inworst casewith
almost no intra MB, the CR would approach to two, the upper-bound of lossless image coding. With ourapproach, the system bandwidthcanbe reducedby66.2%and 75.3%inaverageforhigh
and median quality situations, respectively. Pleasenote thatwe
didn't do any modification on entropycodingfor resourcesharing. Theperformance of the proposedEC canbe betterifappropriate
revision ismade accordingtothe statistics. 6. CONCLUSION
Inthis paper, ahybrid-modeECfor H.264/AVC isproposedto re-duce the bandwidth ofloadingSW. Twodifferent strategies aimed
[1] Joint Video Team, DraftITU-TRecommendation and Fi-nalDraftInternationalStandardofJointVideoSpecification,
ITU-T Rec. H.264andISO/IEC 14496-10AVC, May 2003. [2] Joint Video Team Reference Software JM8.5,
http://bs.hhi.de/ suehring/tml/download/, 2004.
[3] Peter H.Frencken P.H.N. de with and M. van dar
Schaar,
"An MPEGdecoder with embedded compression for memoryre-duction," IEEETransactions on ConsumerElectronics,vol. 44, no. 3, pp. 545-555, 1998.
[4] M. van der Schaar and P.H.N. de With, "Near-lossless
complexity-scalable embedded compression algorithm for cost reduction in DTV receivers," IEEE Transactions on ConsumerElectronics,vol. 46, no. 4, pp. 923-933, 2000. [5] BourgeAmaud and JungJoel, "Low-power H.264video
de-coder withgraceful degradation," inProc.ofVisual Commu-nicationsand Image Processing,2004, vol. 5308,pp. 372-383.
[6] R. v.d. Vleutenl R. Manniesing, R. Kleihorstl and E.
Hen-drilks, "Implementation of lossless coding for embedded
compression," inProc. ofIEEEProRISC, 1998, pp. 385-389.
[7] Andreas E.Savakis, "Evaluation ofalgorithms for lossless compression ofcontinuous-tone images," Journal of Elec-tronicImaging,vol. 11,pp.75-86,2002.
[8] M.Y.Hsu, H. C.Chang,and L. G.Chen, "Scalable module-based architecture for MPEG-4 BMA motion
estimation,"
inProc.ofISCAS, 2001, pp. 245-248.
[9] J.C.Tuan, T. S.Chang, andC. W.Jen, "Onthe datareuseand
memorybandwidthanalysis for full-searchblock-matching
VLSI architecture," IEEETransactions on CSVT, vol. 12,
pp.61-72,Jan.2002.
[10] T. Wiegand, X. Zhang, and B. Girod, "Long-term mem-orymotion-compensated prediction," IEEETransactionson
CSVT,vol.9,pp.70-84,Feb. 1999.