Parallel global elimination algorithm and architecture design for fast block matching motion estimation

(1)

PARALLEL GLOBAL ELIMINATION ALGORITHM AND ARCHITECTURE DESIGN FOR

FAST BLOCK MATCHING MOTION ESTIMATION

Yu-Wen Huang, Chen-Han Tsai, and Liang-Gee Chen

DSP/IC Design Lab., Graduate Institute of Electronics Engineering and

Department of Electrical Engineering, National Taiwan University

yuwen, phenom, lgchen

@video.ee.ntu.edu.tw

ABSTRACT

The critical path of the hardware for global elimination algorithm (GEA) is too long to meet the real-time constraints for high-end applications. In this paper, we propose a new parallel GEA and its corresponding architecture. By dividing candidate blocks into independent groups and ﬁnding the most probable candidates of each group in parallel, instead of sequentially searching within the whole search range, parallel design can be developed as an array of GEA processing elements with much shorter critical path. Be-sides, the GEA processing element is optimized to reduce 30% of the gates, and the 2-D data reuse is organized to save 80% of the SRAM bandwidth, which also reduces a lot of power. Simu-lation results show that our implementation can achieve real time processing of D1 30Hz video with search range as H[-64, +63.5] V[-32, +31.5] while the operating frequency is 70MHz, and the gate count is 113K. Compared with full search, our gate count is six times smaller under the same frequency, and the PSNR loss is at most 0.1-0.2dB.

1. INTRODUCTION

Motion estimation (ME) removes the temporal redundancy within frames and provides coding systems with high compression ratio. Block matching algorithm (BMA) is mostly selected as the ME module in video codecs because of its simplicity and good per-formance. Among all the block matching algorithms, full-search block matching algorithm (FSBMA) is the most popular but de-mands the most computation. For example, real-time ME for CIF (352 288) 30Hz video with [-16, +15] search range requires 9.3 Giga-operations per second (GOPS). If the frame size is enlarged to D1 (720 480) 30Hz with [-32, +31] search range, 127 GOPS is required. Such huge computational complexity is far beyond the processing capabilities of general purpose processors.

Successive elimination algorithm (SEA) [1][2] can reduce the heavy computation of FSBMA and maintain the same results as FSBMA. We proposed a global elimination algorithm (GEA) [3] to remove the branches and to make the data flow much more regular for hardware design. However, the drawback of our previous GEA architecture is the longer critical path. It is difficult to meet the real-time requirement for high specifications. In this paper, we propose a new parallel GEA and its corresponding architecture to solve the encountered problem. The rest of this paper is organized as follows. The parallel algorithm and architecture are described in Section 2 and Section 3, respectively. Finally, Section 4 gives a conclusion.

2. ALGORITHM The original GEA is described as Equ. (1)-(7).

(1) (2) (3) (4) (5) (6) (7)

The search range is , denotes a search

posi-tion, and is the subblock index. Level is indicated by, and

a block of size is divided into subblocks of size . The current block data and the search area

data are denoted as and, respectively. is the sum of all

pixels within a subblock in current block, and is the sum of

all pixels within a subblock in a candidate block. Originally, the matching criterion is sum of absolute differences () for all

pixels in the block. Here, we deﬁne subsampled-()

as sum of absolute differences between and . After all

the values are calculated, we will ﬁnd the most

probablemotion vectors whosevalues are the

smallest. The-th smallestamong all candidate blocks

is denoted as . Finally, we compute theat the

search positions to ﬁnd the ﬁnal motion vector (). In our

pre-vious work [3], we found that and are suitable

9

(2)

m n m n g r o u p 0 g r o u p 0 g r o u p 6 g r o u p 6 g r o u p 1 g r o u p 1 g r o u p 7 g r o u p 7 (a) (b)

Fig. 1. Scanning order of search positions for calculation:

(a) sequential GEA; (b) parallel GEA with=8.

parameters for CIF and QCIF under and or

. We also proposed an architecture with a systolic module

and an 16-pel SAD tree to efﬁciently calculate and ,

and with a comparator tree to record the most probable motion

vectors. The comparator tree is designed to match the throughput of generating values, so the critical path of the

compara-tor tree is roughly proportional to log·½ ¾

. However, for high-end applications with larger frame size, the search range and the

parameter should be enlarged (e.g. =64,=15 or 31) to obtain

high video quality. Moreover, our previous architecture computes

sequentially, so the operating frequency must be increased

with search range and frame size. Consequently, parallel algorithm and architecture with short critical path are demanded.

In order to compute the ’s of several candidates blocks

in parallel, we divide them into groups. Candidate blocks with

the same value of are grouped together, and the the most

probable motion vectors with the smallest are found

separately for each group. Hence, after all the values are

estimated, values of the search positions are

fur-ther computed to get the ﬁnal motion vector. Although the

most probable candidates do not correspond to the smallest values within the whole search range, the parallel GEA

does not suffer noticeable quality degradation because the

globally smallest values usually belong to different groups.

The collection ofcandidates in each of the groups should be

similar to the candidates with globally smallest

val-ues. In this way, duplications of the original GEA architecture

can be conﬁgured as an array of GEA processing elements (PE’s) to support parallel scanning of search positions and parallel cal-culation of values. Figure 1 illustrates the scanning order.

Besides,is much smaller than , which indicates that the

critical path of comparator tree in each GEA PE can be reduced at the algorithmic level.

Many conditions have been tested to verify the quality of our parallel GEA. In our experiments, we embed parallel GEA with

=8 and=3 into an MPEG-4 simple proﬁle encoder. The

reso-lution of and is truncated from 12-bit to eight-bit in order

to save more area and to reduce the critical path for hardware. The other parameter sets areCIF 30Hz [-32, +31.5] 384-2048Kbps

andD1 30Hz H[-64, +63.5] V[-32, +31.5] 1536-8192Kbps. CIF

sequences are Foreman, Hall Monitor, Mobile Calendar, Stefan, and Table Tennis. D1 sequences are two clips from the movie, Crouching Tiger Hidden Dragon. One clip is the scene with two actresses ﬁghting in the courtyard, and the other clip is the leading

+ + + + + + + + + + + + + + + sum₃₀ sum₂₀ sum₁₀ sum₀₀ + + + + + + + + + + + + + + + sum₃₁ sum₂₁ sum₁₁ sum₀₁ + + + + + + + + + + + + + + + sum₃₂ sum₂₂ sum₁₂ sum₀₂ + + + + + + + + + + + + + + + sum₃₃ sum₂₃ sum₁₃ sum₀₃

(12-bit) (12-bit) (12-bit) (12-bit)

(a) + + + + + + sum₃₀ sum₂₀ sum10 sum 00 10-bit sum31 sum₂₁ sum11 sum 01 sum₃₂ sum₂₂ sum12 sum 02 sum₃₃ sum₂₃ sum13 sum 03 (8-bit) MSB 8-bit + + + + + + 10-bit (8-bit) MSB 8-bit + + + + + + 10-bit (8-bit) MSB 8-bit + + + + + + 10-bit (8-bit) MSB 8-bit (b)

Fig. 2. Systolic module to generate 16 subblock sums of 44

pixels: (a) original; (b) proposed.

actor “playing” with the leading actress on bamboos. Compared with FSBMA, the average PSNR losses for the seven sequences are only 0.16, 0.13, 0.05, 0.00, 0.14, -0.02, 0.05dB, respectively. Note that Lagrangian mode decision [4] is applied for both GEA and FSBMA.

3. ARCHITECTURE

In this section,=16,=2,=8,=3, and search range as large

as H[-64, +63.5] V[-32, +31.5] are used as an example to explain the parallel GEA design. The speciﬁcation is D1 30Hz.

3.1. Systolic Module

The purpose of the systolic module is to generate 16 subblock sums of 44 pixels in parallel. As shown in Fig. 2, the input

is a row of 161 pixels. After consecutive 16 rows of pixels are

inputted, the 16 subblock sums at search positionare

pro-duced. The systolic module utilizes vertical data reuse, so the sub-block sums at the search positions-can be

obtained in the followingcycles. The improved systolic

module not only removes the redundant computation of subblock sums but also reduces the resolution of subblock sums. The gate count of this part is reduced from 6.0K to 4.7K.

(3)

+ AD00 AD01AD10 AD11AD20 AD21AD30 AD31AD02 AD03AD12 AD13AD22 AD23AD32 AD33

+ + + + + + + + + + + SSAD + + +

Fig. 3. SAD Tree to compute /

Max Value

SSAD0_Reg MV0_Reg SSAD1_Reg MV1_Reg SSAD2_Reg MV2_Reg SSADnew MVnew

MAX MAX

MAX

EQU EQU EQU

CHECK Stall

(a)

Tag=2'd1

Max Tag Only

Max Value & Tag Max Value & Tag

SSAD0_Reg MV0_Reg SSAD1_Reg MV1_Reg SSAD2_Reg MV2_Reg SSADnew MVnew

MAX MAX

MAX

EQU EQU EQU

Stall

Tag=2'd0 Tag=2'd1 Tag=2'd2 Tag=2'd3

Tag=2'd0 Tag=2'd2

(b)

Fig. 4. Comparator tree to ﬁnd the three smallest values:

(a) original; (b) proposed. 3.2. SAD Tree

The SAD tree is illustrated in Fig. 3, and the goal is to compute

/ values. An AD unit computes the absolute

differ-ence of two eight-bit samples. When SAD tree is used to generate

values, the inputs are 16 subblock sums of current block

and 16 subblock sums of a candidate block. The throughput is the same as the systolic module, i.e. one candidate block per cycle (ex-cept the ﬁrst candidate block at each column of search positions). When SAD tree is used to compute values, the inputs are

rows of current block data and search area data, and its output is fetched to a 16-bit accumulator. It takes 16 cycles for one candi-date block to compute . Due to the bit-width reduction of and , the gate count of this part is reduced from 4.6K to

2.8K.

3.3. Comparator Tree

The purpose of the comparator tree is to ﬁnd the three smallest

values among one group of candidate blocks. The

through-put is also matched with the systolic module and the SAD tree. The concept is to keep the up-to-date three smallest values and

their corresponding’s in the registers, compare the new

com-Systolic Module

SAD Tree

MV Cost Bias

CMP Tree

16 subblock sums CS Registers

SSAD/SAD value

16x1 Pels

Current Block Pels Search Area Pels

To Accumulators MV Predictor Current MV Most Probable MV Stall Select (a) CS Reg GEA PE0 GEA PE1 GEA PE2 GEA PE3 GEA PE4 GEA PE5 GEA PE6 GEA PE7 On-Chip SRAM Interpolation

Control Unit

System Bus System Bus

(b)

Fig. 5. Illustration of the motion engine: (a) GEA processing ele-ment; (b) system block diagram.

ing value with the three stored values, and replace

maxi-mum stored values by the new if it is larger than the new

one. Figure 4 illustrates the comparator tree. The MAX unit out-puts the larger value of its two inout-puts, and the EQU unit tells if the two given inputs are the same. The previous architecture shown in Fig. 4(a) ﬁnds the maximum value and feed it back to

compare with three stored values to see if a stored value should be replaced. The CHECK unit is to ensure that only one stored value will be replaced if more than one stored values are equal to the maximum. Stall signal should be activated when the invalid

value is generated from SAD tree for the ﬁrst 15 cycles of

a column of search positions. We shorten the critical path in three aspects. First, at the algorithmic level, search positions are divided into eight groups. Originally, if=24, we will have to ﬁnd the 24

smallest values, but now only three smallest values in each group are required. Second, the bit-width of is reduced from

16-bit to 12-bit. Third, as shown in Fig. 4(b), instead of feeding the maximum value back to compare for replacement, we

give each value an unique 2-bit tag and feed the tag with

the maximum back for comparison. The gate count of this

part is reduced from 1.5K to 1.1K.

3.4. Entire Motion Engine

Figure 5(a) illustrates a GEA processing element (PE). The sys-tolic module, SAD tree, MV cost generator, and comparator tree are conﬁgured in cascoding. The MV cost generator, which re-quires only 0.6K gates, adds a bias of motion information to the distortion function (known as Lagragian method [4]) and provides additional coding gain of 0.2-1.0 dB in PSNR for the MPEG-4 simple proﬁle encoder. The gate count of a GEA PE is 11.3K.

(4)

ure 5(b) illustrates the entire ME accelerator. Current block data and search area data are loaded from external SDRAM to on-chip SRAM’s. We adopt data reuse of overlapped search area between two horizontally adjacent macroblocks (MB’s) to reduce the bus bandwidth from 477 Mbytes/sec to 71 Mbytes/sec. The interpola-tion circuit is used to generate half-pixels. Besides, thanks to the versatility of SAD tree, advanced prediction (AP) mode (four 8

8-MV’s for an MB) is also supported. Inter mode selection between 1616 and 88 conﬁgurations of an MB is done after half-pixel

ME, and intra/inter mode decision are also included in the acceler-ator. In general, an MPEG-4 simple proﬁle encoder with our ME accelerator provides better coding performance of 0.5 dB in PSNR than the reference software using FSBMA.

The sequential GEA only utilizes the data reuse in the vertical direction by systolic module to compute the values. For

one column of 64 search positions, 79 rows of 161 pixels are

fetched, and 1264 bytes of memory access are required. The par-allel GEA utilizes not only the vertical but also the horizontal data reuse. As mentioned before, values of eight columns of

search positions are generated in parallel. In order to achieve par-allel calculation, 79 rows of 231 pixels (1817 bytes) are

fetched. Let us denote the fetched 231 pixels from left to right

as p0-p22. The ﬁrst 16 pixels, p0-p15, are sent to PE0, p1-p16 are sent to PE1, ..., and p7-p22 are sent to PE7. In this way, com-pared with the sequential GEA (1264 bytes for one column), par-allel GEA is much more efﬁcient in on-chip SRAM access (1817 bytes for eight columns, i.e. 227 bytes for one column on average). The total bandwidth of on-chip SRAM for calculation is

thus reduced from 6.55 Gbytes/sec to 1.18 Gbytes/sec.

The numbers of cycles to compute values,

val-ues, integer values, and half values are 16, 1264, 384,

and 58, respectively. Therefore, for processing an MB, about 1730 cycles are required (including pipelines and mode decision). For D1 30Hz, there are 40,500 MB’s in a second, so the required fre-quency is about 70 MHz. We use one 32x128 dual-port SRAM and eight 400x32 dual-port SRAM’s to buffer current block data (1616 pels2) and the search area data (16080 pels),

respec-tively, for two horizontally adjacent MB’s. The advantage of dual-port SRAM’s is that the loading of pixels via bus for the right MB of an MB pair and the ME process for the left MB can use differ-ent SRAM ports, so that they can be executed at the same time. If only single-port SRAM’s are available, the two tasks cannot be operated simultaneously, and the operating frequency should be in-creased to 100-120 MHz depending on the bus trafﬁc and protocol (assume the bus is 32-bit wide).

3.5. Comparison

We compare our implementation with 1-D semi-systolic FSBMA array architecture [5] due to its high flexibility of search range, scalability of processing elements, and 100% utilization. The re-sults are shown in Table. 1. The gate count of our design is four, eight, and 16 times smaller than 512-PE, 1024-PE, and 2048-PE array, respectively, while the minimum working frequency of our design for D1 30Hz H[-64, +63] V[-32, +31] is 0.42, 0.84, and 1.67 times of the three array configurations. Apparently, our de-sign is more efficient in area and speed. However, the on-chip SRAM access is larger than FSBMA architectures. As for the functionality, our design is the most rich and includes integer ME, half ME, AP mode, and Lagrangian mode decision. The video quality of an MPEG-4 simple profile encoder adopting our ME

ac-Table 1. Comparison of ME architectures under the speciﬁcation of D1 30Hz H[-64, +63] V[-32, +31].

Architecture 8-Parallel GEA 512-PE 1-D Array 1024-PE 1-D Array

Bus Bandwidth SRAM Bandwidth Gate Count Working Frequency SRAM Size Functionality 113 K 70 MHz 71 Mbytes/sec

13.312 Kbytes 13.312 Kbytes 13.312 Kbytes 1842 Mbytes/sec

71 Mbytes/sec 71 Mbytes/sec

2048-PE 1-D Array

13.312 Kbytes

71 Mbytes/sec Integer ME, Half

ME, AP Mode, Lagrangian MB Mode Decision Integer ME (FSBMA) Integer ME (FSBMA) Integer ME (FSBMA) 166 MHz 83 MHz 42 MHz

498 Mbytes/sec 249 Mbytes/sec 124 Mbytes/sec 448 K 896 K 1792 K

celerator is 0.1-0.2 dB worse than that of adopting FSBMA and Lagrangian mode decision, but is signiﬁcantly better than the ref-erence software.

4. CONCLUSION

This paper presents a new parallel global elimination algorithm and architecture for fast block matching. By rejecting less possi-ble candidate blocks with simplified distortion estimation, only a few most probable candidates are required to determine the final motion vector with fine distortion estimation. The computational complexity of our algorithm is about 10% of the full search. Be-sides, candidate blocks are divided into independent groups so that the coarse distortion estimation of several search positions can be executed in parallel. A parallel GEA architecture design is also in-troduced. Many design techniques, such as systolic flow, 2-D data reuse, reuse of overlapped search area, and resource sharing, are proposed to maximize the overall system performance. Our ac-celerator is much more area-speed efficient than full search archi-tectures and provides better coding performance than the MPEG-4 reference software.

5. REFERENCES

[1] W. Li and E. Salari, “Successive elimination algorithm for motion estimation,” IEEE Transactions on Image Processing, vol. 4, no. 1, pp. 105–107, January 1995.

[2] X. Q. Gao, C. J. Duanmu, and C. R. Zou, “A multilevel suc-cessive elimination algorithm for block matching motion esti-mation,” IEEE Transactions on Image Processing, vol. 9, no. 3, pp. 501–504, March 2000.

[3] Y. W. Huang, S. Y. Chien, B. Y. Hsieh, and L. G. Chen, “An efﬁcient and low power architecture design for motion estima-tion using global eliminaestima-tion algorithm,” in Proc. of IEEE

In-ternational Conference on Acoustics, Speech, and Signal Pro-cessing, 2002, pp. 3120–3123.

[4] A. Joch, F. Kossentini, H. Schwarz, T. Wiegand, and G. J. Sullivan, “Performance comparison of video coding standards using lagragian coder control,” in Proc. of IEEE International

Conference on Image Processing, 2002.

[5] K. M. Yang, M. T. Sun, and L. Wu, “A family of vlsi designs for the motion compensation block-matching algo-rithm,” IEEE Transactions on Circuits and Systems, vol. 36, no. 2, pp. 1317–1325, October 1989.

Parallel global elimination algorithm and architecture design for fast block matching motion estimation