Predictive line search: an efficient motion estimation algorithm for MPEG-4 encoding systems on multimedia processors

(1)

Abstract—This paper describes an efficient motion-estimation algorithm, the predictive line search (PLS), for real-time imple-mentations of MPEG-4 encoder on multimedia processors. The motion-vector predictor is used as the starting point in the search process because the correlation between neighboring motion vec-tors is strong. The line search pattern is used in the proposed algo-rithm to reduce the memory access as well as to exploit the special multimedia processor instructions for sum of absolute difference calculations. Experimental results show that the performance of the PLS is very close to that of the full-search (FS) algorithm. Com-pared with the well-known diamond search and one-dimensional FS, the PLS shows better performance and robustness, especially for high motion sequences. A prototype MPEG-4 encoding system is implemented on a 216-MHz multimedia processor with very long instruction word architecture to verify the effectiveness of the PLS. Real-time encoding of MPEG-4 Simple Profile Level 3 (CIF, 30 fps) can be achieved with only 57% of the processor load.

Index Terms—Motion estimation, multimedia processor, predic-tive line search (PLS), real-time MPEG-4 encoder.

I. INTRODUCTION

M

PEG-4 [1] HAS become one of the dominant standards for multimedia communication. The main issues ad-dressed by MPEG-4 are content-based interactivity, universal accessibility, and improved compression. In order to support these complex functionalities, the video-coding system must be built on a platform that is both flexible enough for various tools and powerful enough to achieve real-time requirements. Therefore, multimedia processors [2] are the natural choice to implement such a real-time video-coding system because they combine the flexibility of programmable processors and the processing power of parallel architectures.

In almost all video compression standards, including the MPEG-4 visual part, the block-matching motion estimation is the most computationally intensive part. The simplest and most effective method of motion estimation is to exhaustively search all the candidates in the search range and find a best-matching position with the lowest distortion; this is called the full search (FS) algorithm. The distortion measure is usually the sum of absolute difference (SAD) for its simplicity. If the maximum allowable displacement for a motion address is pixels, then there are candidates to compare for each macroblock, and each comparison needs absolute-difference operations Manuscript received January 1, 2001; revised November 20, 2002. This paper was recommended by Associate Editor H.-F. Sun.

Y.-W. Huang and L.-G. Chen are with DSP/IC Design Laboratory, De-partment of Electrical Engineering and the Graduate Institute of Electronics Engineering, National Taiwan University, Taipei, 106 Taiwan, R.O.C. (e-mail: yuwen@video.ee.ntu.edu.tw; lgchen@video.ee.ntu.edu.tw).

S.-Y. Ma and C.-F. Shen are with Vivotek Inc., Taipei County, Taiwan, R.O.C. (e-mail: steve@vivotek.com; diego@vivotek.com).

Digital Object Identifier 10.1109/TCSVT.2002.808093

if the size of a macroblock is . Thus, FS motion estima-tion may consume as high as 80% of the total computaestima-tional power in a typical video encoding system.

In order to reduce the extremely high complexity of the FS approach, many fast algorithms for block-matching motion es-timation have been proposed. The three-step search [3], new three-step search [4], one-dimensional FS (ODFS) [5], four-step search [6], block-based gradient descent search [7], center-bi-ased diamond search (DS) [8], and advanced diamond zonal search [9] are among the most famous fast algorithms.

These algorithms are designed to search as few candidates as possible without a significant drop in quality. However, the fea-tures of the MPEG-4 compression standard and the special ar-chitecture of multimedia processors are not considered in these algorithms. Therefore, the “fewest-search-point” criterion for optimization of the motion estimation may not be feasible for MPEG-4 video compression systems on multimedia processors. The goal of this paper is to develop an efficient algorithm for block-matching motion estimation optimized for real-time MPEG-4 video coding systems on multimedia processors. The detailed algorithm is described in the next section, followed by experimental results, discussions, and a conclusion.

II. PREDICTIVELINESEARCH(PLS) ALGORITHM

A. Motion-Vector Prediction

Since fast motion-estimation algorithms will not search all the candidates in the search range, the distance between the starting point and the best-matching point is directly related to the total number of searched candidates and, therefore, to the complexity.

Many algorithms use the center-biased approach, which starts from the origin because it is the most probable position for the best-matching point. However, the algorithm proposed in this paper, the PLS, starts at the motion-vector predictor to exploit the characteristics of motion field in nature video and the feature of MPEG-4 motion-vector coding method.

The coding method for motion vectors in the MPEG-4 stan-dard is predictive coding. The motion-vector predictor can be obtained from calculating the medium value of motion vectors of the three neighboring macroblocks as shown in Fig. 1. Only the error of motion-vector prediction is coded in the bitstream. The basic principle for motion-vector prediction is that the mo-tion field of nature video is gentle, smooth, and varies slowly [4]. Therefore, the correlation between motion vectors of neigh-boring macroblocks is very strong.

Fig. 2 shows the distribution of motion vectors and the distri-bution of motion-vector residues after prediction for Foreman sequence. The search range is ( 16, 15) and the macroblock 1051-8215/03$17.00 © 2003 IEEE

(2)

Fig. 1. Motion-vector prediction: the predictor for the current macroblock is the medium of MV1, MV2, and MV3.

(a)

(b)

Fig. 2. Motion-vector distribution for the Foreman sequence. (a) Distribution of the motion vector. (b) Dtribution of the motion-vector residue after prediction.

size is 16 16, in this case. As we can see in this figure, about 24% of the motion vectors are located at the origin; this makes the center-biased approach feasible. However, after applying the MPEG-4 motion-vector prediction, more than 61% of the motion-vector residues are at the origin. Therefore, if we start our search from the position of the motion-vector predictor, it is very likely that the best-matching point can be obtained in the early stages of the search process and the complexity can be reduced significantly. Also, since the coded bit length of a motion-vector residue increases with the distance from the mo-tion-vector predictor, there is a higher probability of getting shorter motion-vector codes by starting from the predictor. In fact, advanced diamond zonal search [9] also adopts this scheme to further improve the performance, and we will later show the difference between starting from the origin and starting from the motion-vector predictor.

B. Considerations for Multimedia Processors

There are three main features of multimedia processors [2], [10], [11] that may impact the performance of motion estima-tion. They are: 1) wider data path compared with general pur-pose processors; 2) subword parallel architecture (SWP) to deal with multiple pixels simultaneously; and 3) special instructions for SAD calculation. Compared with general-purpose

proces-(a)

(b)

Fig. 3. Memory access comparison: (a) one-line search for PLS versus (b) one-point search for DS.

sors, the SAD can be calculated much more efficiently because, in one clock cycle, the processor is able to execute shift, sub-tract, absolute, and accumulation operations on many pixels in parallel.

However, this means the complexity weighting of control instructions in the motion-estimation algorithm increases in the multimedia processors because one control instruction now takes the same time as many SAD operations. For an algorithm to be efficiently executed on multimedia processors, the algorithm should be as simple as possible to reduce the control overhead.

Another issue for efficient motion estimation is data access. In most of the fast algorithms, the next search position depends on the result of current search step and can not be obtained in advance. Since the motion estimation requires massive memory access, if a fast algorithm has regular search pattern, data reuse can be applied, and the amount of memory access can be greatly reduced.

Fig. 3 shows an example of regular data access versus irreg-ular data access. The macroblock size is 16 16 and the search range is ( 16, 16). Fig. 3(a) shows the amount of data required for a line search pattern of 33 consecutive points. Most of the reference pixel data for the next candidate can be obtained by shifting the current reference pixel data. The total number of

pixels loaded into register is . On

the other hand, for an isolated search point, the data for the cur-rent macroblock and the reference macroblock are required as shown in Fig. 3(b). The total number of pixels loaded into

regis-ters is . Compare the 1024 pixels for 33

candidates with the 512 pixels for only one candidate, the line search pattern is far more efficient in terms of memory access.

C. The Proposed Algorithm

From the considerations of the above two subsections, we developed our fast algorithm, the PLS algorithm, with simplicity and regular search pattern in mind.

The PLS algorithm is summarized as follows:

Step 1) Search three consecutive lines of candidates centered

(3)

Fig. 4. PLS procedure. The search range is (016, 15), the motion-vector predictor is (04, 02), and the best-matching point is (04, 04) in this example.

locates in line , then all points in line , line , and line are tested. If the best-matching point calculated is located in line , go to Step 2), if the best-matching point is in line

, go to Step 3), otherwise, go to Step 4).

Step 2) Let , then test all points in line . If the best matching point is in line , go to Step 4), otherwise repeat the current step.

Step 3) Let , then test all points in line . If the best-matching point is in line , go to Step 4), otherwise repeat the current step.

Step 4) Report the best-matching point as the position of the

motion vector.

In short, this method starts from searching three lines around the motion-vector predictor, then searches additional lines in the direction of descending distortion, and stops when the best-matching point is not on the boundary of searched lines.

The search procedure is demonstrated by an example as shown in Fig. 4. Assume that the motion-vector predictor is ( 4, 2), the true motion vector for this macroblock is ( 4, 4), and the search range is ( 16, 15). First, the y value of the motion-vector predictor is 2, so all candidates in line 1, line 2, and line are searched (a). The best-matching point in this step is at ( 5, 3), which is on boundary of searched lines so an additional line is searched (b). The best-matching point after search line 4 is at ( 4, 4), therefore, line 5 is also searched (c). Finally, since no candidates in line 5 has lower distortion than position ( 4, 4), the procedure stops and the motion vector of ( 4, 4) is found.

III. EXPERIMENTALRESULTS

A. Simulation Results

In order to evaluate the performance of the PLS, we apply it to several standard MPEG-4 test sequences. We use two cri-teria for measuring the performance of motion-estimation al-gorithms: the mean square error (MSE) and the motion-vector

error rate. The MSE compares the motion-compensated image frame with the original image frame and calculates the MSE. The lower the MSE, the smaller the energy of the prediction error, and therefore the more effective the motion-estimation algorithm is.

The motion-vector error rate of the fast algorithm is the per-centage of motion vectors that are different from those obtained by the full-search algorithm. Since the FS algorithm generates the optimal results, the error rate shows how close the fast algo-rithm approaches the optimal solution. Therefore, an efficient and robust motion-estimation fast algorithm should have lower MSE and lower motion-vector error rate for all test sequences.

The results of center-biased DS [8] are also shown in the fig-ures and tables for comparison. It is used for comparison not only because it has superior balance between simplicity and per-formance, but also because the MPEG-4 reference software [12] has adopted it as an alternative to FS algorithm. Predictive di-amond search (PDS), which starts from the motion-vector pre-dictor instead of the origin, is also tested to verify the effective-ness of motion-vector predictors and to fairly compare with the PLS. One-dimensional full search (ODFS), which is also effi-cient in memory access, is simulated as well.

Table I shows the MSE performance for PLS, DS, PDS, ODFS, and FS on eight standard MPEG-4 test sequences. The search range is ( 16, 15) in all cases. For sequences where only small motions are involved, such as News, the MSE performance of the five algorithms are very close. FS always has the smallest MSE values, while PLS is better than DS in all cases. On the other hand, for sequences with large motions, such as Foreman and Stefan, PLS outperforms DS significantly, with slightly higher MSE values than the results of FS. This means that the PLS is very robust, even when very large motion is involved. If we compare the DS and the PDS, the effectiveness of choosing motion-vector predictors as starting points can be clearly seen. The prediction error of PDS is much smaller than that of DS for sequences with large motion, such as Foreman and Stefan. However, our PLS still significantly outperforms PDS for Stefan. As for the ODFS, it is better than PDS but worse than PLS, on average.

Figs. 5 and 6 show the MSE measure versus the frame number for Foreman and Stefan sequences. As we can see in these fig-ures, the MSE values of results from the PLS stay very close to those from the FS all the time with only small deviations when

(4)

Fig. 5. MSE comparison between the PLS, the DS, and the FS algorithms. The input sequence is the Foreman sequence in CIF format

Fig. 6. MSE comparison between the PLS, the DS, and the FS algorithms. The input sequence is the Stefan sequence in CIF format.

TABLE II MOTION-VECTORERRORRATE

the motion is very large. On the other hand, the MSE values of results from the DS rise significantly when sequences have large motions. Note that the curves of PDS and ODFS are omitted for clarity. In fact, for most of the frames, the two omitted curves lie between the curve of PLS and that of DS, and the curve of PDS is slightly higher than that of ODFS.

Table II shows the comparison of motion-vector error rates for various algorithms. From the table, we can see that the results of PLS is the best, especially in fast-moving sequences such as Foreman and Stefan.

Figs. 7 and 8 show the motion-vector error rates versus the frame number for the Foreman and Stefan sequences. Both

Fig. 7. Motion-vector error rate comparison between the PLS and the DS algorithms. The input sequence is the Foreman sequence in CIF format.

Fig. 8. Motion-vector error rate comparison between the PLS and the DS algorithms. The input sequence is the Stefan sequence in CIF format.

sequences have large motion in the scene and, therefore, are used to test the robustness of motion-estimation fast algorithms. From the figures, we can see that the DS is not very reliable when the scene is moving fast, while the results of PLS stay very close those of FS. Superior robustness of the PLS is shown in these figures compared with the DS. Note that again the curves of PDS and ODFS are omitted for clarity. In fact, for most of the frames in Foreman, the ranks of MV error rate for these fast algorithms, from the best to the worst, are PLS, PDS, DS, and ODFS. For most of the frames in Stefan, the ranks are PLS, ODFS, PDS, and DS.

B. Discussions

The two main features of the PLS are the predictive start point and the line search pattern. The effectiveness of these two methods are analyzed in this subsection.

Table III shows the comparison of the PLS with the center-biased line search (CBLS). The center-center-biased line search algo-rithm is the same as the PLS except that the starting point is always at the origin. Therefore, this comparison is used to show the enhancement of a predictive start point. As can be seen in the table, the MSE performance for the predictive approach is better than the center-biased approach. The motion-vector error rates and the search lines of PLS are lower than those of

(5)

center-bi-ased line search. The use of predictive starting point is justified because it brings better performance and lower complexity.

Table IV shows the complexity comparison between the PLS and the other algorithms. The average number of searched lines by the PLS is 3.19. Compared with the FS algorithm, which searches all 32 lines in the search range, the speedup is about ten times faster.

Although the total number of candidates searched ( ) by the PLS is more than those by the DS (15.69) and those by the PDS (14.01), the memory access for PLS is only 40% of the memory access needed by the DS and 45% of the memory access needed by the PDS. The PLS has higher memory access efficiency than the DS, or than any other fast algorithm we are aware of.

The ODFS algorithm first searches a horizontal line, followed by a vertical line, and then a horizontal half line, and finally a vertical half line. Although the number of lines searched by

ODFS is , which is lower than that of

PLS (3.19), PLS is still more efficient in memory access. This is because the data reuse of one single line is more efficient than that of two separated half lines.

The total number of SAD operations that needs to be calcu-lated is proportional to the total number of searched candidates, so the computational complexity of the PLS is about 6.51 times higher than that of the DS and 7.28 times higher than that of the PDS. However, since the PLS has lower memory access and smaller control overhead, the overall complexity comparison is platform dependent. On a platform that can calculate the SAD operations efficiently, the speed of the PLS can approach the speed of the DS and the PDS.

As for the ODFS, its required number of SAD operations is slightly lower than that of PLS, but its memory access is slightly higher than that of PLS. The complexities of these two algo-rithms are about the same. However, note that PLS has better performance in the quality of motion-compensated frames.

In Table IV, we assumed the cache is not used for multimedia processor. In fact, the item of memory access should be re-placed by “cache and memory access” because multimedia pro-cessors are equipped with cache to facilitate higher speed of data transfer. However, even if the cache is considered, data transfer still leads to the processing bottleneck due to the high efficiency of SAD calculation in media processors. Furthermore, the size

(a)

(b)

of the cache is limited. If the operands are not hit by the cache, the access time of memory is an order higher than that of the cache, which means that the reduction of memory access is still very important.

Due to the gravity, it is found that there is less significant motion in the vertical direction, so we rotated the standard sequences by 90 to test more cases. The results of MSE performance, MV error rate, and complexity are shown in Tables V–VII, respectively. Although the MSE performance and MV error rate of PLS for rotated sequences are not as good as those for original sequences, PLS is still significantly better than other fast algorithms. The complexity of PLS for rotated sequences rises a little (3.1%), while the other fast algorithms remain almost the same.

C. System Performance

We have implemented an MPEG-4 encoder on a multimedia processor, the Equator MAP-CA, which has a very long in-struction word (VLIW) core running at a clock frequency of 216 MHz. This processor can process the data of 32 pixels in parallel and has special instructions that can execute shift, sub-tract, absolute, and accumulation in a single clock cycle. When running a real-time encoder for MPEG-4 Simple Profile Level 3, which deals with CIF (352 288) format at 30 frames per second, only 57% of the processing power of the multimedia

(6)

TABLE V

MSE PERFORMANCECOMPARISON FORSEQUENCESROTATED BY90

TABLE VI

MOTION-VECTORERRORRATE FORSEQUENCESROTATED BY90

processor is consumed. The PLS motion estimation is respon-sible for 58% of the total computation load. Table VIII shows the run-time profiles for Foreman encoded at a target bit rate of 384 Kbits/s. On average, only 18.88 ms is required to encode one single frame.

Since the PLS is about ten times faster than the FS algorithm, it is not possible to run the FS algorithm in real time, even in such a powerful multimedia processor. The proposed PLS is a very good alternative.

Fig. 9 shows the peak signal-to-noise ratio (PSNR) of the Foreman sequence encoded at a target bit rate of 384 Kbits/s. As shown in the figure, the PSNR results of the PLS are very close to the results of FS throughout the whole sequence. On the other hand, the results of the DS deviate from the FS re-sults when large motions are involved. The PLS can achieve the performance of the FS algorithm, even when large motions are involved in the scene.

Fig. 10 shows the rate-distortion curves of the Foreman se-quence encoded at a target bit rate of 384 Kbits/s for various motion-estimation algorithms. As shown in the figure, the rate distortion curve of the PLS is very close to that of FS. On the other hand, the curve of the DS drops significantly from the FS results.

IV. CONCLUSION

An efficient motion-estimation algorithm, the PLS, is described in this paper. The main features of PLS are the predictive starting point and the line search pattern. This search algorithm starts at the position of the motion-vector predictor

TABLE VII

COMPLEXITYCOMPARISON FORSEQUENCESROTATED BY90 . (a) SEARCHED

CANDIDATES PERMACROBLOCK. (b) MEMORYACCESS PERMACROBLOCK

(a)

(b)

TABLE VIII

RUN-TIMEPROFILES FORFOREMAN

because strong correlation exists between neighboring motion vectors. The line search pattern in PLS exploits the data reuse concept, so the memory access is very efficient compared with any other algorithm. From the experimental results, the PSNR performance of the PLS is very close to that of the FS approach, and the speed of the PLS is also ten times faster. It is

(7)

Fig. 9. PSNR comparison for the MPEG-4 encoder using three different motion-estimation algorithms: the FS, the PLS, and the DS algorithms. The input sequence is the Foreman sequence in CIF format and the target bit rate is 384 Kbits/s.

Fig. 10. Rate distortion curves for different motion-estimation algorithms: the FS, the PLS, and the DS algorithms. The input sequence is the Foreman sequence in CIF format and the target bit rate is 384 Kbits/s.

also shown that the PLS is more robust than the DS, which is a very good fast-algorithm adopted by the MPEG-4 reference software. A real-time encoder for MPEG-4 Simple Profile

REFERENCES

[1] T. Sikora, “The MPEG-4 video standard verification model,” IEEE Trans. Circuits Syst. Video Technol., vol. 7, pp. 19–31, Feb. 1997. [2] I. Kuroda and T. Nishitani, “Multimedia processors,” Proc. IEEE, vol.

86, pp. 1203–1221, June 1998.

[3] T. Koga, K. Iinuma, A. Hirano, Y. Iijima, and T. Ishiguro, “Motion-compensated interframe coding for video conferencing,” in Proc. Nat. Telecommunication Conf., 1981, pp. C9.6.1–C9.6.5.

[4] R. Li, B. Zeng, and M. L. Liou, “A new three-step search algorithm for block motion estimation,” IEEE Trans. Circuits Syst. Video Technol., vol. 4, pp. 438–442, Aug. 1994.

[5] M.-J. Chen, L.-G. Chen, and T.-D. Chiueh, “One-dimensional full search motion estimation algorithm for video coding,” IEEE Trans. Circuits Syst. Video Technol., vol. 4, pp. 504–509, Oct. 1994.

[6] L.-M. Po and W.-C. Ma, “A novel four-step search algorithm for fast block motion estimation,” IEEE Trans. Circuits Syst. Video Technol., vol. 6, pp. 313–317, June 1996.

[7] L.-K. Liu and E. Feig, “A block-based gradient descent search algorithm for block motion estimation in video coding,” IEEE Trans. Circuits Syst. Video Technol., vol. 6, pp. 419–422, Aug. 1996.

[8] J. Y. Tham, S. Ranganath, M. Ranganath, and A. A. Kassim, “A novel unrestricted center-biased diamond search algorithm for block motion estimation,” IIEEE Trans. Circuits Syst. Video Technol., vol. 8, pp. 369–377, Aug. 1998.

[9] A. M. Tourapis, O. C. A. ad Ming, L. Liou, G. Shen, and I. Ahmad, “Op-timizing the MPEG-4 encoder—Advanced diamond zonal search,” in Proc. IEEE Int. Symp. Circuits and Systems, 2000, pp. III-674–III-677. [10] M. Budagavi, W. Rabiner Heinzelman, J. Webb, and R. Talluri, “Wire-less MPEG-4 video communication on DSP chips,” IEEE Signal Pro-cessing Mag., pp. 36–53, Jan. 2000.

[11] S. Rathnam and G. Slavenburg, “An architectural overview of the pro-grammable multimedia processor, tm-1,” in Proc. IEEE COMPCON, 1996, pp. 319–325.

[12] T. Chiang, H.-J. Lee, and H. Sun, “An overview of the encoding tools in the MPEG-4 reference software,” in Proc. IEEE Int. Symp. Circuits and Systems, 2000, pp. I-295–I-298.