Analysis and reduction of reference frames for motion estimation in MPEG-4 AVC/JVT/H.264

(1)

ANALYSIS AND REDUCTION OF REFERENCE FRAMES FOR MOTION ESTIMATION IN

MPEG-4 AVUJVTm.264

Yu-Wen Huang',', Bing-Yu Hsieh', Tu-Chih Wang', Shao-Yi Chien',

Shyh-Eh

Ma2,

Chun-Fu Shen2, and Liang-Gee Chen'

1 . DSP/IC Design Lab., Graduate Institute of Electronics Engineering and

Department

of

Electrical Engineering, National Taiwan University,

yuwen@video.ee.ntu.edu.tw

2. Vivotek Incorporation, steve@vivotek.com

i

ABSTRACT

In the new video coding standard, MPEG-4 AVCIJVTIH.264, motion estimation is allowed to use multiple reference frames. The reference software adopts full search scheme, and the increased computation is in proportion to the number of searched reference frames. However, the reduction of prediction residues is lughly de- pendent on the nature of sequences, not on the number of searched frames.

In

this paper, we present a method to speed up the matching process for multiple reference frames. For each macroblock. we analyze the available information after intra prediction and motion estimation from previous one frame to determine whether it is necessary to search more frames. The information we use includes selected mode, inter prediction residues, intra prediction residues, and motion vectors. Simulation results show that the proposed algorithm can save up to 90% of unnecessary frames while keeping the average miss rate of optimal frames less than 4%.

1. INTRODUCTION

Joint Video Team (JVT) gathered experts from ISO/IEC MPEG-4 Advanced Video Coding (AVC) and ITU-T H.264 to develop the latest standard. The new standard significantly outperforms previous ones in bit-rate reduction. Compared to MPEG-4 advanced simple profile, up to 50% of bit-rate reduction can be achieved. Such improvement mainly comes from the prediction pan [I]. Mo- tion estimation at quarter-pixel accuracy with variable block sizes and multiple reference frames greatly reduces prediction errors. Even if inter-frame prediction cannot find a good match, intra prediction will make it up instead of directly coding the texture.

The reference software, JM4.3 [21, adopts full search for both inter and intra prediction. Although there are seven kinds of block size (16x16, 1 6 x 8 , 8 ~ 1 6 , 8x8, 8 x 4 , 4 ~ 8 , 4 ~ 4 ) for motion compensation, the sum of absolute difference (SAD) of a 4x4-block can be reused for the SAD calculation of a larger block. Thus, variable block size motion estimation (ME) does not lead to much increase in computation. lnlra prediction allows 4 modes for l6xl6-blocks and 9 modes for 4xCblocks. The computational load can be es-

timated as the SAD calculation of thirteen 16xl6-blocks and ex- tra operations for interpolation, which are quite small compared to inter prediction. As for the multiple reference frames ME, it contributes to the heaviest computational load. The required operations are proportional to the number of searched frames. Never- theless, the decrease of prediction residues depends on the nature

*The author thanks SiS Education Foundation far financial support.

Fig. 1. Searching steps of intrdinter prediction with multiple ref-

erence frames in H.264 reference software and our method.

of sequences. Sometimes the prediction gain is very significant. but sometimes a lot of computation is wasted without any benefits. In this paper, we present an effective method to accelerate the multiple reference frames ME without significant loss of video quality. The rest of this paper is organized as follows. In Section 2, we will analyze the statistics of selected mode, residues, and motion vectors for multiple reference frames. In Section 3, we will describe our fast algorithm. Simulation results will be shown in Section 4. Finally, Section 5 gives a conclusion.

2. ANALYSIS

The left side of Fig. 1 shows the searching steps in H.264 refer- ence software. Tne prediction of a macroblock (MB) is performed mode by mode with full search scheme. The allowed modes are in- ter16x16, interl6x8, inter8xl6, inter8x8, intra4x4, and intral6xl6. Note that the inter8x8 mode can be further partitioned into smaller blocks. Given an inter-mode, the reference Software carries out the matching process reference frame by reference frame. The best mode is chosen by minimizing a Lagrangian cost function, which considers both 2-D 4x4 Hadamard transformed SAD (SATD) and number of bits required to code the side information. The right side

of Fig. 1 illustrates our method.

In

Table 1, we can see that 80% of the optimal motion vectors (MVs) determined by the reference software belong to the nearest reference frame. Therefore, we first adopt exhaustive search for intra-modes and inter-modes from previous frame. Next, we analyze the available information including selected mode, intra prediction residues, inter prediction residues, and MVs, to determine if it is helpful to search more frames. In- tuitively, the prediction gain of multiple reference frames mainly results from occluded and uncovered objects.

0-7803-7663-3/03/$17.00

0 2 0 0 3

IEEE

This paper lntemational Conference was originally published in on Acoustics, Speech, the Proceedings ofthc & Signal Processing, 2003 IEEE April 6-10. 2003, Hang Kong (cancelled). Reprinted with permission.

(2)

Table I SIIII>IKS of Rclcrence Frdnies

Sequences Previous Frame Oihcr,

Cusstguai 75'; 2 5 3

Container 91% 09%

Foreman 76% 24%

Hall Monitor 92% 08%

Mobile Calendar 36%

64%

Mother and Daughter 92% 08%

Silent 91% 09% Stefan Table Tennis 65% 35% 87% 13% Weather 90% IOW Average 80% 20%

CIF size. rearchrange 1-16. +161. 5 reference framer, QP=30.

We first treat the saving of computation for multiple reference frames in the view point of compression. After prediction, residues are transformed, quantized. and then entropy coded. If we can detect that the transformed and quantized coefficients are very close

to zero in the first reference frame, we can turn off the matching process from the rest frames since more computation will not cause any reduction in prediction errors. This concept is very simple and effective. Moreover, DCT, Q. IQ, and IDCT can also be saved by early detection of all-zero quantized coefficients. The quantization steps of 4x4-residues are described in the following equations:

a v e r = QPI6 (1) yp-rem = QP%6 (2) qpdits = qp-per

+

15 (3) qpronst = (1

<<

q d i t s ) / 6 ₍₄₎

t

y p r o n s t )

>>

y-bits ( 5 ) Q M [ i ] [ j ] = (ITR[i]Lj]l x yuant.coef[yp.rem][i]lj]

where QP is quantization parameter (0-5 I), Q M is 4x4-quantized magnitude, T R is 4x4-transformed residues, and p a n t - c o e f is a

3-D 6x4x4-matrix.

If

inequality (6) holds,

(?'RI

< (Zq-b"S -

qp.const)/quant.coef

=

f ( Q P ) ₍₆₎

the quantized magnitude will be zero, which means the threshold becomes a function of QP and can be implemented as a look- up table. Besides, T R is not available before transformation, so

we assume residues are Laplacian distributed and find the relation between SAD or SATD and T R . The threshold is directly applied

on

SAD or SATD. The detailed derivation is omitted in this paper. The mode decision result after intra prediction and ME from previous frame is also a very important cue.

In

Table

2,

AIB is defined

as

follows. A is the percentage of a mode after intra pre- diction and ME from previous frame. B is the percentage of A that optimal reference frame and mode remain unchanged after 5

frames are searched. We can see that 73%, 4%. 4%, 17%. and

2% of the MBs are selected as 16x16, 16x8, 8x16. 8x8. and intra, respectively, when only previous one frame is searched. After the rest 4 frames are searched, 90% of the 16x16-MBs still re- main as the optimal selection. As for 16x8-MBs, ExICMBs, 8x8- MBs, and intra MBs. the percentages that the optimal mode and reference frame do not change a e 65%. 65%. 34%. and 7%. re- spectively. Thismeans t h a t 7 3 % ~ 9 0 % + 4 % ~ 6 5 % + 4 % ~ 6 5 % + 1 7 7 0 ~ 3 4 %

+

2%x7% = 76.82% of MBs need only previous one

Table 2. Statistics of Selected Modes.

Sequences 16x16 16x8 8x16 8x8 Intra Coastguard 57192 06181 06178 30139 O I ( O 8 Container 92195 01166 01161 04120 02102 Foreman 641x8 0 8 j 6 ~ 08166 17135 03jw Hall Monitor 90198 01153 01161 07144 01114 Mobile Calendar 49150 06/28 07/28 37120 01/11

Mother and Daughter 89197 03177 03176 04127 01/04 Silent 83198 03178 0317!) 10137 01112 Stefan 47184 06/63 05163 38139 04107 Table Tennis 76/97 04171 04173 13141 03108 Weather 87198 01165 02161 09137 01102 Average 73/90 04/65 04165 17134 02107 ClFrize,rearchrvnge I-16,+16l.QP=30. AIB is defined a i follows.

A: '36 of MBs when only prediction from previous frame is, allowed.

8 : W of A k p i n g be ram mode m d ref. hame afkr 5 framer are reached.

--

Fig. 2. Definition of MV compactness of a ME.

reference frame, which is quite consistent with I.he results in Table

I . Furthermore, when a MB is split into smaller blocks for motion compensation using only previous one frame, it means that t h e motion is discontinuous. In this case, the MB may cross the object boundaries, where occlusion and uncovering often occur. Thus, there is a greater possibility that the best matched candidates belong to the other

4

reference frames. When the intra-mode has bet- ter prediction than inter-modes from previous frame, the MB may

belong to uncovered parts or new objects. The best candidates are very likely to be found in other 4 reference frames, t w .

Now we try to find the correlation between MV distribution and optimal reference frames. After ME from previous one frame. we have one MV for 16x16-MB, two MVs for 16x8-MB. two MVs for 8x16-MB. and four MVs for 8x8-MB. If the best mode is 16x16, 16x8. or 8x16, the definition

of

MV compactness for each of the 3 modes is shown in Fig. 2, respectively. Next, we keep

on

searching the other 4 frames. If the optimal frame or mode of

a MB does not change after ME from 5 frames. we classify these

MBs

as

type 1. Otherwise, we classify them as type II. The average MV compactness of type I and type I1 for each sequence is shown in Table 3. The MV compactness of MBs with optimal reference frame belonging to the rest 4 frames tends to be larger than that of MBs predicted by previous frame. Therefon:, if the MV compactness of a MB after ME from previous frame is very small, we should stop searching the rest 4 frames.

The texture is also taken into consideration. The reduction of residues by applying multiple reference frames is more significant

(3)

Table

3.

Statistics of Average MV Compactness. Sequences 16x16 16x8 8x16 Coastguard 08.9112.6 09.9112.5 07.8109.5 Container 07.4112.3 07.7109.2 07.1108.9 Foreman 08.3116.4 10.1113.2 10.5113.2 Hall Monitor 08.4114.5 09.8112.7 10.7112.5 Mobile Calendar 07.111 1.8 07.411 1.2 09.5110.9 MotherandDaughter 06.9jl0.5 07.7111.4 io.ojil.6 Silent 07.2111.9 10.4114.2 11.6116.2 Stefan 08.4122.4 11.8i16.5 12.7j17.3 Table Tennis 08.6117.5 08.1113.4 10.4/16.7 Weather 07.71 11.8 07.1108.5 07.2110.7 Average 07.9114.2 09.0112.3 09.8112.8

CIF size. search range 1-16. t161. QP=30.

1111 is defined si fallows.

1 optimal ref. frame and mode do not chmge after ME from 5 framer.

It: optimal ref. frame and mode chrnge after ME from 5 frames.

The "nit Of M V compaanerr is quarter pixel.

at object boundaries, where occlusion and uncovering often occur. The texture of object boundaries should be more complex than other flat regions. We use SATD after intra prediction to represent the complexity of texture of a MB. In Table 4, intra prediction and ME from previous one frame is first applied for each MB. Then we focus on the MBs having the best mode as 16x16, 16x8. and 8x16. Again, we keep on searching the rest 4 frames and classify the MBE into two types. MBs that do not change optimal reference frame and mode are classified as type P, while MBs with different reference frame and mode are denoted as type Q. It is clear that the SATD of intra prediction for type P is smaller than that for type Q. Therefore, i f a MB is significantly textured. we should search more frames. However, it is also clear from Table 4 that this threshold value should be adaptive with different scenes or sequences.

There are some exceptions in Table 4, such as 16x8 and 8x16 modes for Hall Monitor. 16x8 mode for Silent. These are due to the complicated texture of stationary background that only requires one reference frame. In these cases. if we do not want to lose video quality, the threshold for SATD of intra prediction should be small enough, which will cause waste of computation for highly- textured stationary background. Fortunately, we can use the MV compactness to prevent these situations. Our algorithm will be described in the next section. Let us summarize the analysis as follows. After intra prediction and ME from'previous one frame,

If 16x16 mode is selected. the optimal reference frame tend to be unchanged.

more frames tend to be helpful.

If inter-modes with smaller blocks are selected, searching

If MVs of larger blocks are similar to MVs of smaller blocks, i t is likely that no occlusion or uncovering occurs in MB, so one reference frame may be enough.

If MVs of larger blocks are more different from MVs of

smaller blocks, MB often crosses object boundaries and thus requires more reference frames.

0 If the texture of a MB is very complicated. it may requirc

more reference frames.

Table 4 St3tistics o f , \ \ r r g c SATDof Intra Prediction Scqurnics . . . Ib\lb 16x8 8x16 CoastcuxJ I M Y Y I S I ~ X i 9 ~ z i s s x i

-

48.1615 150 Container 3 103js2

18

4813isz42 ~960i6010 Foreman 210413094 310313269 256213290

Hall Monitor 2341 13875 401512149 41821 I950 Mobile Calendar 616518371 684318239 629018723 Mother and Daughter 148212472 248812388 2250i2501 Silent 288112904 3010(2604 322713256 Stefan 372315981 579516113 597516188 Table Tennis 302513920 377313420 372513827 Average 330714833 427614276 433014556 CLFrize,rearchrange [-16.+161.QP=30. PIQirdefinedarfollowr.

P: oplimrl ref. frame and mode do not change after M E from 5 framer

Q: optimal ref. frame and mode change rficr M E from 5 fnmr.

Fig. 3. Fast algorithm for multiple reference frames ME

3. PROPOSED ALGORITHM

According to the above analysis, we propose a fast algorithm for multiple reference frame ME to save the computation of full search and to maintain the same video quality. The steps are shown in Fig. 3. Note that TH;,,,, is a function of QP and is implemented

as

a look-up table. T H M V is empirically obtained. TH;,,,, must be adaptive with different scenes. Currently, we use the number of intra MBs in previous coded frame to detect scene change. If more than 10% of the MBs are intra-coded, we will adjust THint,,. The detailed derivation of TH;,t,, is omitted due to the limited space. As shown in Fig. 4, we connected ten standard sequences together to show the dynamic adjustment ofTH,,t,, according to

the number

of

intra MBs in previous frame.

4. SIMULATION RESULTS

Figure 5 compares the rate distortion CUNCS of the reference soft-

ware and the proposed algorithm. It is shown that the maximum peak signal to noise ratio (PSNR) drop is 0.2 dB (Hall Monitor). The average PSNR drop is less than 0.05 dB, so that the two curves for each sequence are hardly distinguishable. Table 5 shows the miss detection rate and the false alarm rate of the proposed algorithm. Note that m i s s detection of optimal reference frames leads to the degradation of PSNR, and the false alm results in waste of computation. In fact,

TH,,,,,

provides an easy trade-off between speed and quality. Adjusting TH,,,,, cannot decrease the m i s s detection rate and false alarm rate at the same time. The higher

the TH;,,,,, the more the computation is saved. The lower the

(4)

Fig. 4. Adaptive THint,. according to the number of intra MBs.

Table 5. Miss Detection and False Alarm Rates. Sequences Miss Detection False Alarm Coastguard 6.21% 41.15%

Container 1.90% 24.96%

Foreman 2.57% 47.42%

Hall Monitor 5.37% 12.83%

Mobile Calendar 5.86% 31.60%

Mother and Daughter 2.67% 24.57%

Silent 3.73% 19.65% Stefan 5.62% 42.36% Table Tennis 2.41% 25.15% Weather 2.64% 12.30% Average 3.90% 28.20% CIFrim,serrchrange 1-16.+161.

T N , , , , , . the less the PSNR drop is achieved. The average miss

detection rate is only 3.90%, which means 96.10% of MBs can find the optimal reference frame. The average false alarm rate is

28.20%. which means we should further improve our algorithm to

save the 28.20% of computation while keeping the miss detection rate from rising in the mean time. Given a budget of computation resources, how to select T H , , t , , will also be our future work. Figure 6 shows the number of average searched frames for the reference software and the proposed algorithm It is shown that

I0%-67% of ME operations can he saved.

5. CONCLUSION

We proposed a simple and effective fast algorithm for multiple reference frames motion estimation. We first analyzed the avai- able information after intra prediction and motion estimation from

previous one frame. Then we applied several threshold values on the available information to determine i f it is necessary to search more frames. Experimental results showed that our method can save 10%-67% of ME computation depending on sequences while keeping the quality nearly the same as full search scheme.

6. REFERENCES

[ I ]

H.264 and ISO/IEC 14496-10 AVC), July. 2002. 121 Join! Video Team (JW) so/iware JM4.3, October. 2002.

Cammifree Oror of Joint Video Specification (ITU-T Rec.