Multi-pass algorithm of motion estimation in video encoding for generic GPU

(1)

Multi-Pass

Algorithm of

Motion

Estimation

in

Video

Encoding for Generic GPU

Yu-Cheng Lin, Pei-Lun Li, Chin-Hsiang Chang, Chi-Ling Wu, You-Ming Tsao, and Shao-Yi Chien Media IC and System Lab

Graduate Institute of Electronics Engineering and Department ofElectrical Engineering National Taiwan University

yucheng@media.ee.ntu.edu.tw

TABLE I

Abstract- The importance of video encoding has boomed

rapidly since video data communication was widely needed. In this paper, we propose a multi-pass algorithm to accelerate the

motion estimation (ME), the dominant part in video encoding, Tools Datapath Percentage(%)

with the graphics processing unit (GPU). By the multi-pass Operation (MIPS)

method to unroll and rearrange the multiple nested loops, the Motion Estimation 24,768.2 97.94 complex ME can be implemented on GPU. Besides, ME can Transform & Quant 432.527 1.710 be executed efficiently with the built-in parallel processing and Others 88.0320 0.348 texturefilter of GPU.Experimentalresults show that, byutilizing

the computing power of GPU, about two times and 14 times

speed-up can be achieved for integer-pel ME and MPEG-1/2 TABLE II

half-pel ME,respectively. INSTRUCTIONPROFILING RESULTS OFH.264ENCODER.

I. INTRODUCTION Tools Datapath Percentage(%)

Digital entertainment with multimediacontent, is definitely Operation (MIPS) one of the most

important applications

in the

twenty-first

Motion Estimation_{Transform &}_Quant _5,402.50152,897 94.66

3.345

century. Fancy multimedia content, especially video content, Others 3,225.30 1.997

usually accompanies huge volume of data. Consequently, video encoding and decoding play key roles tothemost

mul-timedia applications. Thanks to the development of modern

technologies, digital still cameras (DSC) and digital video programmed to make changes in the routinepipelines [3] [4].

recorders (DV) have become commonand popular consumer A GPU is no longer with fixed pipeline but is now more

electronicsproducts. Everyonecould be amultimedia content appropriately described as a SIMD parallel processor or a provider as long as he or she has a DSC or DV. High streamingprocessor [5].

resolution video equipments are occupying more and more Moreover, the performance of GPU evolves much faster market share in addition. In the 3G mobile era, there are than the famous Moore's law for theperformanceofCPU, that potentialmultimediacontentproviders everywhere.As aresult is, 2.4 times/year versus 2 times per 18 months [6] [7]. With of the abovereasons, videoencoding has becomeasimportant all theseadvantagesoverCPU,especiallySIMDoperationand

as video decoding has been. parallel processing, there is an active research area of using

Comparing the computation profile ofMPEG-4 in Table. GPU for nongraphics oriented operations, such as FFT [6]

I [1] and H.264 in Table. II [2], it is evident that motion and motioncompensation [7]. However, due to the instruction estimation is themostsignificantpartinvideo encoding. That limit of GPU andlarge inputdataamount, thecomplex motion is, improving the performance of motion estimation would

actually speedup the whole videoencoding. Thus the motion

estimation is chosen to be accelerated as the

starting

point

of Vertex ProgammableVertex

real-time video encoding.

On the other hand, more and morecommodityPC and game Polygon PolygonSetup,

consoles are commonly equipped with graphics processing Clig atrzto

units (GPUs). Graphics processors are designed to perform Fragment

Programmable

Per-

1Memory,

a mass of operations on a crowd of vertices or pixels, and Tetr aaFtc,fl ledn

they do this very efficiently because of their inherent multiple

parallel pipelines. In the recent years, graphics processors IImage

IZ-Buffer

fpl6 Blending, L have changedfromfixed pipelines to programmable pipelines.

Figure 1 shows that the vertex and fragment shader could be Fig. 1. A programmable graphics pipeline [4].

(2)

______ Programmable MIMD CPU D_ata 'UProcessing(fp32)

6.4 GB/s 6.4 GB/s up to

or mere 8 GB/s

F_L_i_s_t_s__1 SIMD Rasterization SystemMemoy Noah.eri GeCGPU _ To Display

ProgrammableSIMD Memory up to 35GB/s

Processing(tp32) SouthBridge Video Memory

Daa DataPetch,tptBBleeding E

i__ta_

PredicatedWrite,fp16

D Blending, Multiple Output O

Fig. 3. Theoverall system architecture of a PC [4]. Fig. 2. The GeForce 6 Seriesarchitecture for non-graphics applications [4].

Loop(rows of macro blocks (MBs)){ Loop(columns of MBs){

estimation algorithm, which is the most important in video Loop (rows of search range (SR))

encoding, has neverbeen implemented on GPU. Besides, the SAD computation;

time-consuming interpolation computing offractional motion SAD comparison;

estimation is another challenge for speed-up.

In the past, motion estimation has been implemented by

use of dedicated hardware of multiple parallel processing _}_l

elements (PEs) [8] [9] [10]. The graphicsprocessorsnowadays Fig. 4. The pseudocode of the classic integer-pel ME algorithm. also have similar parallel architecturebecause of the graphics

processing requirements. Furthermore, motion estimation is

block-wise and thus suitable for SIMD-like GPUprocessing. memory bandwidth.

In conventional approach, the powerful GPU is idle when no

graphics-relatedprocess is going on. Itis natural to leverage B.

implementation Envionment

GPU to off-load some of the CPU's tasks to achieve the We implemented the motion estimation (ME) algorithm on

maximum co-throughput of CPU and GPU. In this paper, a nVidia GeForce 6800 GT graphics card, which features wepropose an algorithm to map motion estimation (ME) on fully programmable vertex and fragment units, 32-bitfloating

generic commodity GPUtoaccelerate the video encoding. We point frame buffers and textures. The fragmentprocessor can expect to accelerate motion estimation with GPU and would sustain 12 floating-point operations per pixel per clock cycle

accomplish areal-time video encodingsystem hereafter. With [11]. Both the vertex and fragment shaders were programmed unrolling and rearranging loops of the motionestimation, the with OpenGL and the Cg computer language and runtime ME could be implemented on GPUin the limited instruction libraries of nVidia Corporation [12] [13]. TheME isinherently number with multi-pass technique.Second, the large input data an image-based algorithm. As such, we perform the ME by

would be bound as textures in graphics pipeline to reduce executing several fragment programs successively in a

SIMD-the memory access time. Finally, GPU has inherent built-in like fashion.

interpolation engine, which readily accelerates the fractional C. A Novel Motion Estimation Algorithm

interpolation process. The implementation detail would be . . . is ..

shown in the

following

section. Note that, nVidia GeForce The

tmotonestimat

algorith

is

itsiclly

complx 6800, a famous GPU, is used as an

example

in this paper.

nBesdes

the sum of absolute

dlfference

(SAD)

computatuon

Theorganizationofthispaperis shownbelow. The proposed

Bsd

the

sump

of ab

e

ffer

(sAD) coation

motion_{motion stimatin}estimation algorithm_algoithm is described in Sec. II._isdescribd _in _{Sc. 11.} Next,_ext, _iin

_algorithm,

and SAD comparison,_as _shown _inthere are four

_Fig.

₄ _[10].loops_In _ourIn the

_proposed

classic _MEME

Sec.

III,

the

experimental

results will be shown.

Finally,

ashort algorithm shown in Fig. 4

[10].icneourdprops Me

cocuso is gie in Sec

algorithm

shown

in

Fig.

5, the four

classic

nested loops are

conclusion is given in Sec. IV. unrolled and

rearranged.

Abstractly,

the

operations

on all the

II. MOTION ESTIMATIONIMPLEMENTATION pixels could be thought as executed in parallel. Therefore, the inner loop of processing elements is unrolled and

ex-ecuted almost

simultaneously.

The

original

four-tiered

loop When used for non-graphics applications, the nVidia aretransformed into a one-tiered loop and a two-tiered loop. GeForce 6 Series could be viewed as two programmable Thecomputation complexity is reduced and theparallelism is blocks that run serially: the vertex shader for MIMD pro- increased in this way. In fact, there are only 16 pipelines in

cessing and the fragment shader for SIMD processing, both GeForce6800. All the operations would be scheduled,folded, with support for fp32 operands and intermediate values [4]. and executed as concurrently as possible in the 16 pipelines.

The architecture is shown in Fig. 2. As shown in Fig. 3, the Thus we could still think of each pixel of fragment shader as

memory bandwidthbetween the GPU and the video memory a PB performed in parallel. We refer to the PB-like pixel as

is about five times as fast as that between the GPU and the PB and the pixel of the texture as pixel here. It is highly

system memory. The texture unit could be used as a random- recommended to include as many instructions as possible access datafetchunit to make use of the astonishing 35 GB/sec in [3]. However, the total instruction amount of full-search

(3)

Pass 1: MBO Loop(candidates per processing element){

Loop(processing elements){

SAD computation; OO0O

OI

*SSS

SAD comparison; olcoo--- MBO ||

ooioo

32x32 PEs

Pass 2:

Loop(rows of a MB){ 512x512PEs

Loop(columns of a MB){ (512/16)x(512/16)x32x32

Loop(processing elements){ candidates pass 1 pass2

SAD comparison;

Fig. 7. The illustration of theproposedMEalgorithm.

Fig. 5. The pseudo code of the proposed integer-pel ME algorithm. candidates in the search range of every MB in a frame into consideration, the formula for the number of candidates per transferred to video 352x288pixels PB is shownbelow:

memory as textures

I

WT

HT1

CPU

~~~~~~~~~~x

l

Paddingtothe size of power of 2

CPU

n 'VT BiTS,,,xSRrowxSRcoiX1

1Ro

xx (1)

MB

MBcol

MBrow

Wpi

Hlp

GPU 512x512pixels

CalculatingSADvaluesof Take WT = HT = 512,

MBrow

=

MBcol

= 16, SRrow

I integerpositions l l I

integer positions ntimes ISRCOI = [-16, +15] = 32,

Wpi

Hp1

= 512 as the

n=

n=4 illustration of Fig. 7. The result is n = 4, which means that

SADcomparison every

four candidates

are grouped and mapped to a PEin the

pass 1 corresponding MB. We use texture coordinates to perfectly

V51x51lclal minimIIIumISADs

Comparingthe local minimum allot four differential candidates so that every candidate is SADs togenerate global taken into account. This is the only loop in the first pass

minimumSAD

32x32integer-pelMVs 7 besides the two loops of SAD computation. EachPEcompares

CalculatingSAD values of the sum of absolute difference (SAD) of the four candidates

fractionalpositions m times and writes the motion vector with the smallest SAD value to

m=8 thetexture

by using

the render-to-texture extensions.Notethat,

SADcomparison in Fig. 7, only fourPEs areshown for a MB inthe firstpass,

pass 2 instead of 256 PEs inthis

example

case, for

simplification.

32x32MVs

2)

The Second Pass:

Generating

Global Minimum SADs

32x32MAVs

Fig. 6. The block diagram of the proposed MEalgorithm. As Well As Integer-Pel MVs and Generating Fractional-Pel MVs: The remained computation amount for the integer-pel

ME is much smaller in pass 2. Thus each MB only needs to

block matching motion estimation is much more than the be

processed by

a PE. The formula follows:

instruction limit of GeForce 6800 GT. Therefore, amulti-pass

WT

HT

MEalgorithm on GPU is proposed. We perform theproposed Quad sizeof pass 2 x (2)

ME algorithm by executing two fragment programs in two In the second pass, each

PB

compares the remained local

passes. We draw quadrilaterals to invoke fragment programs minimum SAD values computed from the first ass in each

for every PB. Figure 6 shows the block diagram Of the threepp_MB _{transferred via the} _{texture, to} _{find the smallest} _global

passes weproposedtoimplementtheME onGPU.Wetake the _minimum. ' _{SAD values and the} _{corresponding motion}' _vector.

sequence in CIF format

(352x288)

forillustration convenience The number of local minimum SADs

compared

per

PB

is as

inFig. 7. The hollowcirclepoints representcandidates inthe follows:

search range while the other opaque circle points represent WT HT

PEs.

Wp

x

Hp

X ( WT X

H3)

1) The FirstPass: Generating LocalMinimum SADs: The

MBci

MB

(

whole video frames are first transferred from system memory In the above example as in Fig. 7, the quadrilateral size of to video memory as textures to make use of the largememory pass 2 would be 32x32, and 256 local minimum SAD values bandwidth in the following operations. The size of the texture would be compared to find a integer-pel motion vector. Then must be power of two. Thus we prepare a padded reference the fractional-pel MB is computed if needed. The

fractional-frame as well as a padded current frame. Both of the frames pel motion estimation is similar to the integer-pel motion equal to WT X HT. Next, a quadrilateral whichequalsto

Wp1

x estimation. After integer-pel MB is completed, this pass takes

Hp1

is drawn, which can be viewed as W x

Hp1

PBs.

We the integer-pel MVs and both the reference and current frames take MBrow x

MBCo1

pixels as a macroblock (MB) and let as inputs to find the smallest SADs and the corresponding MVs the search range (SR) equal to SRrow x

SRCoi.

Taking all the in the neighboring fractional search candidate positions. The

(4)

TABLE III TABLE IV

THE COMPARISONOFINTEGER-PEL ME BETWEEN CPUANDGPU THECOMPARISONOFHALF-PEL ME BETWEENCPUANDGPU

Search Frame rate (s) Frame rate (s) Speed-up Search Frame rate (s) Frame rate (s) Speed-up

range (CPU only) (CPU +GPU) range (CPU only) (CPU +GPU)

16x16 6.451 11.57 1.794 16x16 0.805 7.275 9.037 32X32 1.694 3.709 2.189 32X32 0.217 3.091 14.24 48x48 0.795 1.492 1.877 48x48 0.102 1.374 13.47 64x64 0.462 0.673 1.457 64x64 0.06 0.65 10.83 80X80 0.308 0.400 1.299 80X80 0.04 0.393 9.825 96x96 0.220 0.240 1.091 96x96 0.028 0.236 8.429

fragment processor uses the texture unit to fetch data from run on GPU about two times as fast as run on CPU in memory,optionally filtering the data before returning it to the 32x32 search range. Our proposed algorithm outperforms the fragment processor [4]. Thus the built-in texture filter is used classic software-based motion estimation computation because

to do the bilinear interpolation for the fractional candidates. offully utilizingGPU'sparallel computationpower. As we are

3) Future Optimization: From the resources' point of view, composing this paper, a nVidia GeForce 7800 GT GPU with the integer-pel ME must be divided into two passes due to 256 MB videomemory has been on sale on the market. The the instruction limit of GPU and the large data amount of performance degradation problem may be solved using the video encoding. As the instruction capacity grows, the two more powerful GPU.

passes ofinteger-pelMEmay be combinedto one pass. More Ourfuture work is to further investigate the power of GPU passes might be needed if quarter-pel ME is included. For and think thoroughly howtopartition the workloads in video optimization consideration, all the loops of the two passes encoding between CPU and GPU. The whole software-based could be unrolled and distributed averagely to achieve the video encoding architecture needto be accomplished as well.

maximum performance. Combined with the video decoding system [7], this work

would develop a complete real-time video codec system in

III. EXPERIMENTAL RESULTS _the _near _future.

Weperformed extensivetests on aPC withanIntel Pentium

IV 3.00 GHz CPU, 1024MB memory, and a nVidia GeForce REFERENCES

6800 GT GPU with 128 MB video memory. Video sequence [1] H.-C.Chang, L.-G.Chen, M.-Y Hsu, and Y-C. Chang, "Performance Stefan in CIF (352x288) format with 30 fps with the archi- analysis and architecture evaluation of mpeg-4 video codec system,"

in Proc. IEEEInternational Symposiumon Circuits andSystems,May tecture in Fig. 7 iS used as an example to measureprocessing 2000, pp. 449-452.

time. [2] S.-Y Chien, Y-W. Huang, C.-Y Chen, H. H. Chen, and L.-G. Chen,

We compare the motion estimation speed achieved using

4"Hardware

_{communication}architecture design of video_systems,"_IEEE_Commun._{Mag., vol. 43, no. 8, pp. 122-}compression for multimedia CPU only against that achieved with GPU acceleration. Six 131,Aug. 2005.

different searchranges aretested forinteger-pel and fractional- [3] C. J. Thompson, S. Hahn, and M. Oskin, "Using modern graphics pel motion estimation. The experimental results are shown in architectures for general-purpose_{in Proc.} _IEEE/ACM _{International Symposium on Microarchitecture,}computing: a framework andanalysis," Table III for integer-pel ME and Table IV for half-pel ME of Nov.2002, pp. 306-317.

MPEG- 1/2.Full searchalgorithm is employed, and there is no [4] M.Pharr, Ed.,GPU Gems 2. AddisonWesley, 2005.

quality degradation in PSNR for MEonGPU. It isinteresting [5] B._B.Khailany,_{Towles, A. Chang, and S. Rixner, "Imagine: media processing with}W.Dally,U. Kapasi, P.Mattson, J.Namkoong,J.Owens,

to observe that the speed-up of 32x32 search range reaches streams," IEEE Micro, vol. 21, no. 2, pp. 35-46, Mar. 2001.

the local maximum. That is, about two times and 14 times [6] K. Moreland and E.Angel, "Simulation andcomputation: The FFTon

of speed-up can be achievedrespectively for integer-pel and a_Graphics

GPU,"

in Proc. ACM_{hardware, July 2003, pp. 112-119.}SIGGRAPH/EUROGRAPHICS conference on half-pelME. The bilinearinterpolationis readily accomplished [7] G. Shen, G.-P. Gao, S. Li, H.-Y. Shum, and Y-Q. Zhang, "Accelerate with the help of the built-in texture filter. Thus the speed-up video decoding with generic GPU," IEEE Trans. Circuits Syst. Video ofthehalf-pelME ismuch more significantthanthe speed-up _[8] Technol.,_K.-M._Yang,vol. 15, no. 5, pp. 685-693,_M.-T. _Sun,_{and L.}_Wu,_"AMay_family2005._{of VLSI}_designs_{for the}

of the integer-pel ME. However, as the search range exceeds motion compensation block-matchingalgorithm," IEEE Trans. Circuits 32x32,thespeed-updeclines. Themostpossiblereasonis that Syst.,vol. 36,no. 10, pp. 1317-1325,Oct. 1989.

the speed-up mightdepend greatly on the memory bandwidth [9] _{VLSI tree}Y.-S. Jehng, L.-G. Chen, and T.-D. Chiueh, "An efficient and simple_architecture _for _{motion estimation} _{algorithms," IEEE Trans.}

requirement. The larger computation amount and memory SignalProcessing, vol. 41, no. 2, pp. 889-900, Feb. 1993.

bandwidth requirement always accompany the bigger search [10] H. Yeo andY.H. Hu, "A novel modular systolic array architecture for rnewhich might be the cause of performance degradation._{range,~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~}full-search block matching motion estimation," IEEE Trans. Circuits_Syst. _{Video Technol., vol. 5, no. 5, pp. 407-416, Oct. 1995.}

[11] J.Montrym and H. Moreton, "The GeForce 6800," IEEE Micro, vol. 25, no. 2, pp. 41-51, Mar. 2005.

IV. CONCLUSION AND FUTURE WORK [12] R. Fernando and M. J. Kilgard, The Cg Tutorial. Addison Wesley, In thi paperwehaedemostratd thatGPUcn acceerate 2003.

Inti aer ehv dmntae ta P anaclrt [13] (2004) The GPGPU website. [Online].Available:

http://www.gpgpu.orgl

motion estimation together with CPU. The ME could be