Efficient hierarchical motion estimation algorithm based on visual pattern block segmentation

(1)

1997 IEEE International Symposium on Circuits and Systems, June 9-12,1997, H e r o n g

EFFICIENT HIERARCHICAL MOTION ESTIMATION ALGORITHM BASED ON

VISUAL PATTERN BLOCK SEGR/IENTATION

M e i - J u a n Chen, Liang-Gee Chen, Ro-Min Weng and Yung-Pin Lee

Department of Electrical Engineering National Taiwan Univeicsitv

Taipei, Taiwan,

R.0.C China

ABSTRACT

A new hierarchical block-matching motion estimation algorithm with t h e segmentation of variable block size and visual pattern is presented. T h e block partition with visual patterns which exploiting t h e properties of t h e human visual syst,ems(HVS) is employed t o get a more precise motion estima.tion for t h e detailed regions with complex motion in an image. T h e hierarchical search expedites the motion estima.tion and simplifies the processing of the stationary area. Simple control overhead and low side information are requirled due t o t h e regular decomposition of visual pattern structures. T h e performance of proposed method is superi- or over t h e conventional full search block-matching motion estimabtion while reducing computation complexity drasti- cally. Extensive experimental results are included in this paper.

1.

INTRODUCTION

Motion estimation is a powerful technique t o eliminate temposral redundancy for video compression. T h e block- matching approach[l] is popular and has been widely adopt- ed by several video coding standards, such as H.261[2], H.263[3] and MPEG[4]. Because the removal of temporal redundancy between successive image frames relies heavily on the use of a block-matching motion estimation technique, t h e pe:rformance and efficiency of the video coding systems depend on t h e accuracy and the speed of t h e block-matching motio:n estimation algorithm(BMA).

Block-matching algorithms find t h e motion vector based on a block-by-block matching. Jain and Jain[l] originally proposed t h e motion estimation which involves t h e division of the image i n t o fixed size blocks. T h e motion vector is t h e location t h a t has the maximum correlation value between blocks: i n temporally adjacent frames. Each equal-size block is compared against candidate blocks in a search area on t h e previous frame t o get t h e best-matched one. T h e success of BMA lies on t h e ability of prediction. Among various BMAri, t h e full search(or exhaustive search) is t h e most popultar one. However, a massive computation effort is usually needed. In a typical hybrid coding system, there is 51% time spent in operation for the motion estimation. For t h e applications where power consumption and processing speed are critical, i t is clear t h a t t h e fast algorithms are necessary. In t h e past, various fast search algorithms[6][7] have been proposed t o alleviate t h e computational burden imposed by full search. Basically, the majority of these fast search. algorithms reduces t h e number of searched positions largelyg based on t h e assumption t h a t t h e matching criterion increases monotonically as the search position moves away from the best match position and reduce t h e computation time a t the expense of t h e accuracy.

For BMA, there is a n implicit assumption t h a t the motion within each block is uniform. I t is not always valid if

0-7803-3583-X/97

$10.00

01997 IEEE

t h e fixed block sise is too large compared t o real object in an image. Then the block effect due t o moving edge will be noticeable arid the quality of t h e predict suffers. However, a decrease in the sise of t h e blocks means the number of motion vectors :to be transmitted increase. T h e large overhead will frustrate coding efficiency. In the past, t h e quad-tree Segmentation has been used in image processing[l2]. His- torically, [9][10][11.] segmented t h e image into variable block sizes based on bin-tree or quad-tree decompositions before BMA t o make a compromise between performance and bit- rates. Quad-tree segmentation is one of popular techniques for variable block size coding. According t o [lo], t h e quad- tree is a tree! structure representation in which each node, unless i t is at leaf, generates four children. Each child oc-

cupies a quarter of t h e area of i t s parent. T h e subdivision of a parent node into i t s four children is guided by a homogeneity test. In this test, a decision is made whether four subblocks of temporal difference are homogeneous. If t h e test fails, the! four subblocks are generated and they are re- garded as four new independent parent blocks t o be tested further. After the decomposition with quad-tree structure, the block-matching motion estimation is conducted on the variable size block.

Natural v:ideo images typically consist of regions with widely varying content and activity. Most motion estimation algorithms deal with each block equally in spite of the different coniplexity of motion for t h e various blocks. To use t h e comlputing resources efficiently, different computation efforts should be made for blocks containing different amount of motion. T h u s the video image can be segmented into regions having widely different degree of motions. Especially for typical videoconferencing sequence, certain regions:, like the s,peakes's eyes and mouth in t h e head-and- shoulders sequence, are critical to our subjective evaluation of quality, and relatively small errors can perceptually have a major degrading effect on the overall quality. Such regions tend to dominate t h e viewer's attention and are intrinsical- ly mort: difficult compensated t h a n t h e 'background' of the sequence.

Motivated by the above considerations, i t is desirable t o find at segmentation t h a t allocates less computation effort- s and bit-rates t o homogeneous regions and more for t h e area with complex motion. In this paper, we devise a hierarchical BM,4 in which visual pattern structures are utilized t o describe t8he segmentation of the blocks. T h e hierarchical procedure simplifies the motion estimation of t h e blocks with slow motion. For the regions experiencing complicated motion.s, the algorithm provides precise motion estimation by splitting them with visual pattern blocks which are de- signed t o take advantage of human visual system charac- teristics for image coding[l3]. T h e visual pattern design is develolped using icelevant psychophysical and physiological d a t a . 'The objective of our approach is t o obtain a unifor- m quality over the entire image, especially to improve the large d.egradation a t t h e area with quick motions.

(2)

2.

PROPOSED HIERARCHICAL

ALGORITHM

T h e hierarchical processing methodology is being proposed for increasing applications of image processing and video coding[8][11]. T h e success may be mainly attributed to t h e similar level of efficiency achieved in each hierarchy. T h e definition of hierarchy in our proposed algorithm is the block size. To obtain t h e flexibility of visual pattern block partition while avoiding t h e excessive overhead or side information needed t o characterize more sophisticated image segmentation, we use a top-down(sp1itting) bit-quads and visual pattern blocks for image decomposition. From the t o p t o t h e bottom level, t h e block sizes 32x32, 16x16, 8x8 and visual pattern block(VPB) are chosen. And the image segmentation doesn't involve upsampling or downsampling because the image has single resolution in each hierarchy.

T h e encoding procedure takes place in three stages. In t h e first stage, an initial segmentation of 32x32 blocks of t h e current frame is performed. T h e procedure starts from t h e largest possible block(32x32). T h e first stage consists of determining which bit-quads pattern(as seen in Figure 1) a 32x32 block should be mapped. For the four subblock- ~ ( 1 6 x 1 6 ) in each 32x32 block, the average absolute frame difference(FD) is used as a segmentation rule. If this value is greater t h a n some threshold, bit 1 is assigned to indicate t h e subblock needs further processing to get satisfied quality. And the halfof average frame difference of this subblock is obtained as t h e adaptive threshold value of test of next hierarchy. In general, video statistics are not known a prz- or. For this reason, we develop adaptive algorithm t h a t is capable of adjusting t h e thresholds t o adapt to the statistics of t h e video activity. To avoid unnecessary division of blocks due t o small luminance changes or random noise, t h e adaptive threshold is compared against a minimum threshold(minfhresh). If t h e adaptive threshold is smaller than min-thresh, min-thresh replaces it. However, if the threshold test is satisfied for t h e 16x16 block, the operation will be stopped and bit 0 is assigned for the subblock. We can use t h e bit-quads pattern which is used in binary image processing[5] t o describe t h e condition of the four 16x16 subblocks in a 32x32 block. If bit 0 is corresponding t o the subblock, t h e 16x16 block is treated as 'background' and without the need for t h e second stage. Otherwise, the bit 1 indicates t h a t t h e 16x16 block requires the processing of t h e next hierarchy(stage 2). Since natural images usually contain numerous large approximately static regions, larger blocks are adequate. Four bits for pattern index are needed for each 32x32 block a t stage 1. T h e purpose of the first stage of t h e coder is t o find those stationary area t o save further processing efforts and no operation for motion estimation is involved.

For each 16x16 block with bit 1 in the bit-quads pattern decided i n stage 1, a full search block matching is performed a t stage 2. And a search interval of only -2-$2 pixels is sufficient. For each of the four 8x8 blocks, the displaced frame difference(DFD) is employed t o determine whether t h e block should be processed further. If the adaptive threshold obtained a t stage 1 is satisfied, the operation will be stopped for the 8x8 block. Otherwise, the half of D F D is obtained t o be t h e adaptive threshold of next hierarchy. T h e min-threshold check is the same as in the first stage. T h e bit-quads pattern can indicate t h e decomposition structure of stage 2 as in stage 1.

At stage 3, for each 8x8 block with bit 1 in bit-quads pattern, a full search block matching around t h e motion vector obtained at stage 2 with -2-+2 search range is performed. T h e displaced frame difference(DFD) of each 8x8 blockis checked t o see whether t h e threshold test is satisfied or not

.

If it is not satisfied, a full search block matching with -3-+3 search range around the motion vector so far

obtained is implemented. In this stage, we use t h e visual pattern blocks as segmentation model(as indicated in Fig- ure 2). T h e size of the large block is 8x8 and t h a t of t h e small block is 2x2. T h e design of visual pattern blocks in a 8x8 block is defined encompassing t h e orientation. A 8x8 block can be uniform(index O ) , partitioned into two region- s(index 1-14) with respective orientation 90", 45", O o and -45O or decomposed into four subblocks(index 15). T h e regions of different colors i n Figure 2 represent different ob- j e c t s with homogeneous motion in t h e block respectively. T h e algorithm computes

DFD

of every subblock(2x2 for every displacement in search range. T h e minimum

D

d

D

is defined as the sum of individual minimum

DFD

of different parts in those visual pattern blocks. Although there are 16 possibilities for segmenting t h e block into separate objects, t h e computation complexity is the same as a 8x8 block- matching due to t h e requirement of memory elements. T h e implementation of pattern 15 needs more bit-rates for the coding of motion vectors. T h u s we have a compromise between performance and bit-rate for pattern 15. T h e index between 0-15 as seen in Figure 2 is transmitted by select- ing t h e pattern whose minimum D F D is optimized. T h e patterns require only a small overhead rate by restricting the shape and possible numbers of t h e final regions from a predetermined set of options. T h e residual image is consid- erably concentrated in t h e boundary of moving object. T h e visual patterns are selected using a simple viewing geometry model in conjunction with measured properties of biological vision and suitable t o describe t h e boundary of objects i n a small block. T h e visual pattern decomposition provides an effective and economical solution t o t h e problem of object segmentation i n t h e application of block-matching motion estimation.

T h e bit-quads and visual patterns are not transmitted but instead are built independently by the receiver and transmitter only a few indexes must be communicated from the receiver t o t h e transmitter t o define the shape. T h e s-

mall overhead with the proposed method is due t o i t s struc- tural decomposition property, i.e. partitioning t h e picture frame into subblocks, whose sizes, shapes and locations are predetermined and thus are not transmitted. For hierarchical search, matching is first performed in t h e 16x16 block t o obtain an initial estimation of the vector field; t h e computed vector field is then propagated t o t h e next stage, where it is corrected and again propagated t o t h e next stage until t h e necessary stage is reached. For each stage, t h e vector field to be transmitted will be composed of small vectors which might be efficiently coded using an MPEG-like Huffman coder. T h e hierarchical search incorporated with bit-quads and visual patterns reduces t h e overhead for coding motion vectors.

3.

SIMULATION RESULTS

The performance of t h e proposed algorithm is demonstrat- ed by five various video sequences, which are Claire,

Mzs-

s America, Susie, Salesman and Table Tennis. These sequences were chosen for their different motions and char- acteristics. Claire and Miss America are typical head-and- shoulders sequences. Susie and Salesman involve local motions. T h e aforesaid four sequences have a speaker imposed on a static background. However, Table Tennis includes zooming and large displacement. Each sequence has 30 frames/sec, 60 frames and luminance only. T h e performance of proposed method is evaluated and compared with the conventional full search(b1ock size 16x16), three-step fast search(b1ock size 16x16) and variable block size quad- tree segmentation block matching[lO]. For t h e quad-tree segmentation block-matching scheme, t h e largest possible block has 32x32 pixels, whereas t h e smallest one consists of 4x4 pixels. Then the full search is applied. All t h e schemes utilize mean absolute error(MAE) as matching criterion for

(3)

simplicity. And t h e final search ranges of motion estimation are all set t o -7-+7 with integer precision.

T h e methods aforementioned are also compared according t o the P S N R , computation complexity, block number, total bit-rate and subjective quality. T h e P S N R gives an objec:tive measurement for t h e quality of reconstructed image. T h e computation complexity, which is important for real-time video coding, is represented by the computation of one 16x16 block-matching. And t h e total bit-rates consists of two components. One is t h e entropy of t h e prediction error image. T h e other is t h e bit-rate for coding the motion vectors, which

is

obtained from the multiplication of t h e motion vector entropy and total block number. T h e measurement of t h e bit-rate of motion vector also includes t h e overhead for t h e transmission of pattern indexes used in our strategy or t h e code addressing quad-tree structure for quad-tree decomposition method.

Table 1 provides t h e average performance comparison of t h e proposed hierarchical algorithm with other approaches. T h e ;superiority over t h e previous methods is obvious. Re- gardless of picture type, for the objective quality(in terms of PSNR), our proposed method outperforms t h a n conventional schemes, even though t h e full search. A massive computation effort is needed by full search algorithm. Even for t h e fast-moving sequence( Table Tennis), t h e computation of our method is lower t h a n t h e three-step search. T h e per- form:snce improvement a n d computation reduction of our proposed algorithm result from t h a t for those large block-

s wh:ich exhibit low picutre activity, like backgrounds, are usua1.ly quite successfully compensated with only small vectors; however, for t h e regions with complicated motion, the algorithm divides t h e blocks t o lower hierarchy or t o visual pattern blocks t o make accurate motion estimation and thus attains improvement on accuracy of reconstructed image and lower bit-rate for error images. Due t o t h e advantage of hierarchical procedure, there is no large amount computation needed for visual pattern blocks and the vector dis- tribution is more smooth and t h u s the lower bit-rate needed for coding motion vectors. In respect of bit-rate, the proposed algorithm requires very little overhead for coding the bit-quads and visual patterns indexes. With fewer block-

s and little overhead, the bit-rate of proposed method is competable t o other methods. In [14], the edge information in a .block is extracted t o guide block-matching. Although t h e scheme makes use of t h e edge features, i t gets inferior performance t h a n full search.

Figures 3 illustrates t h e subjective quality comparison of t h e reconstructed pictures with motion-compensated vectors of proposed algorithm and full search for Table Tennis

sequence. T h e figures show t h a t t h e proposed algorithm applied t o visual p a t t e r n blocks so accurately compensates t h e motion t h a t there is little need for coding the error signal. :Both uniform areas and many edges and detailed in the imag’e are well compensated. T h e image quality is clearly better when the proposed algorithm is used. Moreover, t h e blocking artifact, present with fixed-size block matching, is almost nonexist. In particular, observe t h e small detailed such as t h e player’s hands, the boundary of t h e banner and the corner of t h e table. T h e quad-tree segmentation using t h e homogeneity test rule of temporal difference tends t o get smaller blocks in a cluster. Although t h e use of smaller blocks results i n higher adaptivity, b u t t h e correlation among blocks can’t be exploited and therefore limits the accuracy and t h e compression ratio achieved. However, t h e use cif bit-quads and visual patterns have suitable segmentation t o fit t h e real objects. In particular, observe t h e partition of visual patterns used a t the player’s left ear, hands and the ball. From t h e simulation, our method gives more reliable motion estimation with much lower computation.

4.

CONCLUSION

A new hierarchical algorithm for block-matching motion estimation has presented which combines the concepts of variable block size and visual pattern segmentation. To make use of the segmentation of variable block size, different computational (efforts may be made for regions having different compllexity of motion. Particularly, we utilize t h e visual patterns of single-image coding t o t h e block decomposition for smaller bloclcs t o get very accurate motion compensation for those regions with complex motions. We develop the adaptive algorithm t h a t is capable of adjusting t h e thresholds to adapt to t h e statistics of the video activity. Extensive experiments have shown t h a t t h e proposed methods reduces computation complexity while providing better performance t h a n conventional block-matching methods.

REFERENCES

J.R.Jain and A.K.Jain, “Displacement measurement a.nd i t s application in interframe image coding”, I E E E Trans.Commun., Vol.COM-29, No.12, pp.1799-1808, December 1981.

“CCIT‘T Standard H.261, Video codec for audio visual services a t p x 6 4 kbps”, July 1990.

ITU-T Studly Group 15, Working P a r t y 15/1, “Draft ITU-T Recommendation H.263 - Video Coding for Low E%it-Rate Communication”, July 1995.

‘‘ISO/I:EC 13818-2 Generic Coding of Moving Pic- tures and Associated Audio: Recommendation H.262”, November 1993.

W.K. P r a t t , Digital Image Processing, A Wiley- Interscience Publication, 199 1

T.Koyix, K.Iinuma, A.Hirano, Y.Iiyima, and T.Ishi- p r o , “Motion compensated interframe coding for video ‘conferencing”

,

Proc. N T C81, pp. G5.3.1-65.3.5, New Orleans, LA, December 1981.

M.J. Chen, L.G. Chen and T . D . Chiueh, “One- dimensional Motion Estimation Algorithm for Video Coding”, I E E E Trans. C A S for Video Technology,

Vo1.4, INo.5, pp.504-509, October 1994.

M.Bierling, “Displacement estimation by hierarchical block-matching”, S P I E , Vol.1001 Visual Communica- tions and Image Processing, pp.942-951, 1988. 1d.H. Chan,

Y

.B

.Yu,

A.G.Const an tinides, “Variable size Eilock nnatch.ing motion compensation with application-

s t o vi.deo coding”, I E E Proceedings, Vo1.137, No.4, V.Sefei:idis, M.Ghanbari, “Generalised block-matching motion estiimation using quad-tree structured spatial decomposition”, I E E Proceedings- Vision Image Signal I’roces.sing, .vo1.141, No.6, pp.446-452,December 1994. X X i a isnd Y.Q.Shi, “A thresholding hierarchical block matching algorithm for motion estimation”, I S C A S ,

E:.Shusterman and M.Feder, “Image compression via improved quadtree decomposition algorithms”, I E E E Trans. Image Processing, vo1.3, No.2, pp.207-215, March 1994.

D.Chen and A.C.Bovik, “Visual pattern Image Cod- ing”, I E E E Trans. Communications, vo1.38, No.12, pp.2137-2145, December 1990.

S’ .Z hong,

F

.Chin,Y .S. Cheung and

D

.Kwan, “Hierar- chical .motion estimation based on visual patterns for video c:oding”, ICASSP, pp.2323-2326, 1996.

August 1990.

DP. 624-627, 1996.

(4)

Algorithms PSNR M atch (dB)

B1 ock Entropy(bits/pixel Number DFD MV+Over. kotal

Mlss America Prouosed I 38.78 1 4118 I 218 I 3.52 0.023 3.543 Proposed 42.95 1216 Fullsearch 42.28 64148 Three-Step 42.21 7954 Ouadtree 42.43 15392 124 2.15 0.010 2.160 308 2.17 0.006 2.176 308 2.18 0.006 2.186 249 2.17 0.012 2.182 22969 Full-Search Three-Step Quadtree Susie Proposed Fullsearch Three-Step Quadtree 3.98 0.024 4.004 4.02 0.007 4.027 4.04 0.008 4.048 4.07 0.017 4.087 5.35 0.069 5.419 5.28 0.023 5.303 5.49 0.025 5.515 5.37 0.062 5.432 38.52 64148 308 3.50 0.022 3.522 38.15 7931 308 3.56 0.021 3.581 38.28 13739 154 3.56 0.010 3.570 36.18 6123 325 3.82 0.039 3.859 35.65 64148 308 3.84 0.012 3.852 35.14 7902 308 3.90 0.012 3.912 35.45 19093 437 3.87 0.029 3.899

Table 1. The performance comparison for various algorithms and video sequences

Figure 1.

The

bit-quads pattern used in stage l(segmenting a 32x32 block into four 16x16 blocks) and stage %(segmenting a 16x16 block into four 8x8 blocks)

2 z

Figure 2. Set of visual pattern blocks used in stage S(segmenting a 8x8 block into visual pattern blocks)

Original Picture

Proposed Algorithm

Segmentation of Proposed Algorithm

Quad-tree Segmentation

Figure 3. Subjective comparison of proposed method and other approaches for Table Tennis

se-

quence