Low power full-search block-matching motion estimation chip for H.263+

(1)

Low Power Full-Search Block-Matching Motion

Estimation Chip for H.263+

Jun-Fu Shen , Liang-Gee Chen , Hao-Chieh Chang and Tu-Chih Wang DSPAC Design Lab, Department of Electrical Engineering

National Taiwan University Taipei, Taiwan. R.O.C.

ABSTRACT

In this paper, a low power full-search block matching (FSBM) motion estimation design for the H.263+ low bit rate video coding was proposed. The features of H.263+ such as half-pixel precision and some advanced modes (advance prediction mode, PB-frame mode and reduced resolution update mode) are taken into consideration. This architecture can deal with different

block size and searching range in a single chip without

any latency. We use a 1-D and 2-D mixed architecture to

fulfill this goal. To achieve the purpose of low power and reduce the design period, we use dual supply voltage levels in this chip. This chip is realized by

TSMC 0.6um single-poly triple-metal CMOS

technology. The operation frequency is set at 60MHz to meet the requirement of the real time processing in the reduced resolution update mode in H.263+. The power consumption is 424mW at 60MHz and the throughput is 36 frames per second with CIF format at 60MHz.

Keyword VLSI architecture, Motion Estimation, Videohmage

coding

1. INTRODUCTION

Since the multimedia applications become more and more popular, the bandwidth of communication network has become insufficient obviously. In last ten years, there were many video and audio standards proposed to reduce the bitrate and improved the quality during transmission time. ITU-T announced the H.263 version

2 [l] called H.263+ to meet these requirements. For

H.263+ standard, there are additional sixteen negotiable advanced modes and motion estimation still plays a very important role. Although there are several studies on motion estimation architectures, [2-71, they can not meet the computation requirements of H.263+. Besides,

because of the widespread application of the portable

and wireless communication device, it can't be ignored the low-power issue, especially for the motion estimation, while the higher computation leads to more

power consumption. To prolong the using period

between the battery recharging, it is a trend to develop the low-power motion estimation architecture.

Therefore, this paper proposed a low-power FSBM motion estimation ASIC design for the latest H263+ video standard. The architecture overview is described in section 2. And the architectures of integer-pixel

precision unit (IU) and half-pixel precision unit (HU)

are depicted in section 3 and 4. Section 5 shows the chip

implementation and simulation results. Finally, a

conclusion is given in section 6.

2. ARCHITECTURE OVEWVIEW

Figure 1 is the overview of the whole architecture. The IU and HU are used to calculate the MAE in integer- pixel and half-pixel precision individually. In IU, we use

64 processing elements (PE) that is a tradeoff between

speed and area. The current and previous frame data come from the off chip memory and don't include in our

architecture. To achieve the 100% PE utilization during

the working time, we use four input ports for the previous block data and one input port for the current block data in IU. In HU, the previous block data and the current block data share the same port because of the low requirement of the input bandwidth.

,I M a l * ,I R..., ""I S*! i l L " l l i i "I I l l *

Figure 1. Architecture overview

0-7803-5471 -0/99/$10.0001999 IEEE

(2)

In the output block, we use an identical 30-bit register to output the motion vector and the minimum absolute

error for IU and HU sequentially to reduce the number

of output pad.

3. ARCHITECTURE FOR INTEGER-

PIXEL PRECISION UNIT

Figure 2 is the proposed IU architecture, the current

block data is reused by using the 64 8-bits shift registers in the right side, and the previous block data broadcasts to each PE from the four 8-bits buses. The MAE in each PE is transmitted to the macroblock comparator and the block comparator through the 18-bits bus and the 16-

bits bus. In Figure 2, there are 64 4-bit shift registers

placed in the left side and these registers are used to store the control signal. Using these registers, the design of control signal flow becomes much easy and no global bus is required here.

PE Coolrol Signal PI Control I 4

y

I :/

,

4

4 I ,-I PE1 4 I I 4 4 4

4

Block T111ing Data Block Error Block Candidate M#crobiock Ms!oblork Marroblack

Enor Candidalr Tcaling Dsts

Figure 2. Proposed IU architecture

A. PE Architecture

Figure 3 is the PE cell we proposed here. The major

difference from other PE architecture proposed before is marked as the gray blocks, and those registers that named as G register means that the clock of this register is gated control.

The function of gray part B that we call block

accumulator is to accumulate the MAE of a block. To

reduce the power consumption in each PE, we use gated clock control in the block accumulator. The gray part C that we call macroblock accumulator is the additional part from general PE architecture. It is used to accumulate the MAE of macroblock. In the same

candidate position, the MAE of the macroblock is the

summation of the MAE of four blocks that belong to

this macroblock.. So the data in each block accumulator

can be reused in the macroblock accumulator one after another. For low power consideration, it needn't use the same clock period in block accumulator as in

macroblock accumulator. It only needs to use a 64 time

clock period in macroblock accumulator to reduce the

activity rate in the 18 bits register B and 16 bits register C.

B. Variable block size and searching range

consideration

Considering the variable block size requirement, if we

need the MAE with 8 X 8 block size, the PE just send

the MAE in the block accumulator to the comparator. If

we need the MAE with 16 X 16 block size, the

macroblock accumulator is used. For the 32 X 32 block

size, to achieve no latency in the working time, we

divide the 3 2 x 3 2 block into four 16X 16 sub-blocks.

At Every 256 clock cycles, the block accumulator in

each PE calculate the MAE in the 8 X 8 sub-block. After

repeating four times, the MAE of a 32 X 32 block has

been finished and is sent to the comparator.

To achieve the variable searching range consideration,

we use the searching range partition strategy. For a 32 X

32 searching range, we divided it into sixteen 8 X 8 sub-

searching ranges. And each 64 clock cycles the PE array

deals with only one sub-searching range. For a 64 X 64

searching range, we divide the whole searching range

into 64 8 X 8 sub-searching ranges. Because we use 64

PES architecture and 2-D searching method, we can

easily achieve the variable searching range just divide

the whole searching into several 8 X 8 sub-searching

range and each time processing only one sub-searching range as mentioned above. And the data flow in each sub-searching range is still the same, we still has no

(3)

latency between each sub-searching range.

4. ARCHITECTURE FOR HALF-PIXEL

PRECISION UNIT

The HU is composed with an interpolation pixel generator, a processing element and a comparator as

shown in Figure 4. The input of the current block data

and the previous block data share the same input port because of the low demand of the input bandwidth.

Figure 5 is the architecture of the interpolation pixel

generator in HU we proposed. In this architecture, we

use only two adders and seven registers. The G register means that the clock of this register is gated control. We use only one 8-bit input port because of the low input bandwidth. Figure 6 and Table 1 shows the data flow in the interpolation pixel generator. The input data stream

is C, PO, Pl;.. ., P8, the output data stream is HO, H1,

H2;.-,H7, and the current block pixel C is stored in the register A during the interpolation cycle. From Table 1 we see that for an interpolation cycle, it needs eleven

clock cycles. So for a 8 X 8 block size, it needs 11 X 8 X

cycles. This number is still smaller than the clock cycles we need in IU. We use only one FE in HU. But we have to calculate nine candidates during a block matching, we use nine shift

partial absolute error in each ca

HU.

Table 1. Interpolation pixel generator data flow

5. IMPLEMENTATION

To reduce the design period and meet the low-power requirement, we use full-custom and cell-based hybrid design in this work. In the PE array, we use full-custom design because of its large area and high power consumption. And we use 2.5 volts supply voltage in it. In the other modules such as controller, we use cell- based design to reduce the design period. Since the cell library we used is only support 5-volt supply voltage, we use dual supply voltage levels in our chip.

To minimize the power consumption in the shift registers, we use the TSPC register proposed in [8] here. Because there are two voltage levels in our chip, we need a voltage swing circuit. For the low-to-high level converter, we use the low-to-high level converter proposed in [9]. This chip is implemented with TSMC 0.6um single-poly triple-metal technology and uses the COMPASS cell library. Figure 7 shows the physical layout view with the whole chip. The detail description

of the chip specification is listed in Table 2.

Table 2. Chip specification

6. COWCEEJS~ION

In this paper, we proposed a low-power FSBM motion H.263+ such as half-pixel precision, AP’ mode, PB mode, and RRU mode. By the properly design for the PE cell, this architecture can fit variable block size and searching range requirement in a signal chip and still consumes less power.

This chip was implemented by TSMC 0.6um single- poly triple-metal technology. In the circuit level, we used TSPC register to minimize the power consumption in shift registers and dual voltage levels to realize the hybrid design strategy. From the simulation result, this chip can work at 60MHz and the power consumption is 334mw. It is enough to the real time requirement for RRU mode in H.263+ with QCIF picture format.

architecture thab supports some featu

7. REFERENCE

ITU-T Recommendation H.263 version 2: “Video coding for low bit rate communication,” Sep. 1997.

G. Fujita, T.Onoye, and I. Shirakawa, “A New

Motion Estimation Core Dedicated to H.263 Video Coding,” IEEE Int. Symp. Circuits Syst., vol. 3, pp. 1161-1164, June 1997.

S. H. Nam and M. K. Lee, “Flexible VLSI Architecture of Motion Estimator for Video Image

(4)

r41

[51

[61

,lpul Dall

-

B

Application," IEEE Trans. Circuits Syst., vol. 43, no. 6, pp. 467-470, June 1996.

K. M. Yang, M. T. Sun, and L. Wu, "A Family of VLSI Designs for the Motion Compensation Block-Matching Algorithm," IEEE Trans. Circuits

Syst. Video Technol., vol. 36, no. 10, pp1317-

1325, Oct. 1989.

M. J. Chen, L. G. Chen, T. D. Chiueh, and Y. P.

Lee, "A New Block-Matching Criterion for Motion Estimation and its Implementation," IEEE

Trans. Circuits Syst. Video Technol., vol. 5 , no. 3,

pp. 231-236, Jun. 1995.

L. G. Chen, W. T. Chen, Y. S. Jehng and T. D.

Chiueh, "An Efficient Parallel Motion Estimation Algorithm for Digital Image Processing," IEEE

Trans. Circuits Syst. Video Technol., vol. 1, no. 4,

Pixel 8 , ErCO' ,' b Y O t l D O V I C l O i

+

Comparator Interpolation Processing i s , b Error

Pixel Generator Element

pp.378-395, Dec. 1991.

Y. K .Lai, L. G. Chen, H. T. Chen, M. J. Chen, Y.

P. Lee and P. C. Wu, "A Novel Video Siganl

Processor with Programmable Data Arrangement and Efficient Memory Configuration," IEEE Trans. Consumer Electronics, vol. 42, no. 3, pp.526-534, Aug. 1996.

J. Yuan and Svensson C., "High-speed CMOS circuit technique," IEEE J. Solid-state Circuits, vol. 24, no. 1, pp. 62-70, Feb. 1989.

J. S. Caravella and J. H. Quigley, "Three volt to five volt interface circuit with device leakage limited dc power dissipation," IEEE ASIC Intern. Conf. and Exhibit, Rochester, NY, pp. 448-451, Sep. 1993.

[7]

[8]

[9]

Figure 6. Input pixels and the half-

Figure 3. Proposed PE architecture pixel positions

Figure 4. Proposed HU architecture

Figure 5. Interpolation pixel generator Figure7.Whole chip physical layout