Platform architecture design for MPEG-4 video coding

(1)

PLATFORM ARCHITECTURE DESIGN FOR MPEG-4 VIDEO CODING

Wei-Min Chao, Yung-Chi Chang, Chih- Wei

Hsu,

and Liang-Gee Chen

DSP/IC Design Lab

Graduate Institute of Electronics Engineering and Department

of

Electrical Engineering

National Taiwan University

1, Sec. 4,

Roosevelt Rd., Taipei 106, Taiwan

{ hydra,watchmanjeromn,Igchen} @video.ee.ntu.edu.tw

ABSTRACT

This paper presents a cost-effective platform architecture design for MPEG-4 video coding. A fast motion estimator architecture supporting predictive diamond search and spiral full search with halfway termination is implemented to make good compromise between compression pcrformance and design cost. An efficient block-level scheduling for texture coding engine is employed to

reduce the hardware cost. Both these key modules are integrated into an efficient platform in hardwareisoftware co-design fashion. With high degree of optimization in both algorithm and architecture levels, a cost-efficient video encoder is implemented. It consumes 256.8mW at 40MHz and achieves real-time encoding of 30 CIF (352x288) frames per second.

1. INTRODUCTION

The emerging MPEG-4 standard becomes the main technique of the mobile devices and streaming video applications such as smart phone and handheld PDA devices. In these applications, low power, low cost, high flexibility, and high performance are four key issues

to implement the video coding system for real-time specification and future applications.

Several MPEG-4 video chips have been reported in the past. To satisfy rich functionality of future multimedia, some are implemented in software [ I ] 121 based on the low-power DSP platform. They have highest flexibility but to achieve the real-time performance under the limited resources, the fast algorithms of motion estimation (ME) and discrete cosine transform (DCT) are applied and the compression quality degrades at the same time. Some [3] use the dedicated hardware methodology to achieve low power and low area cost. Lack of potential for future modifica- tion of advanced algorithms and higher design effort are disadvan- tages. Hence, some 141 [SI adopted the hybrid softwarehardware co-design to compromise the performance and flexibility for complex coding flow.

In this paper, a RISC-based platfom with hardware accelerators is presented to implement MPEG-4 video encoding algorithms. The optimization in both algorithm and architecture level is applied. Not only the key components but also the connection optimization are discussed in this paper. First. the coding system is divided into three main subsystems, motion, texture, and bitstream, which are optimized by observing the relationship at the algorithm and architecture level. In motion subsystem. the hybrid motion estimator supporting both predictive diamond search and spiral full search with halfway termination for real-time or

0-7803-7750-8/03/$I7.00 02003 IEEE

high compression quality applications are proposed to reduce the dominant cost in the typical coding system. In the texture subsystem, the efficient interleaving schedule and substructure shar- ing technique among quantization and DC/AC prediction are proposed (6]to reduce the cost further, In the bitstream subsystem, to handle the complex bitstream syntax and avoid inefficient bit-level storage, the hardwareisoftware co-operations scheme is applied for the bitstream generation. After the optimization described above, a compact MPEG-4 video encoder chip is implemented and occu- pies the 5.02x5.13 mmz in 4-layer-metal, 0.35 pm CMOS standard cell process. It is much smaller than any M P E G 4 video encoder previously reported and achieves the same performance. It consumes 257 mW at 40MHz operation and achieves real-time encoding of 30 CIF (352x288) frames per second.

2. MPEC-4 VIDEO ENCODER ARCHITECTURE Fig. 1 depicts the proposed platform-based system with hardware accelerators to achieve a MPEG-4 video coding functionalities. RlSC takes responsibility for macroblock level hardware scheduling, coding mode decision, motion vector coding, and other high level procedures. Other hardware accelerators improve the system performance by parallel processing according to the parallelism of algorithms. Motion estimator (ME) carries out motion estimation with the search range -16.0 to f15.5 pixel unit. Motion compen- sator (MC) interpolates pixels in reference frames into compen- sated blocks by specified motion vectors. Texture block engine (TBE) carries out discrete cosine transform (DCT), inverse cosine transform (IDCT), quantization (Q), inverse quantization (IQ), and AC/DC prediction on texture pixels in block unit. Bitstream gcn- erator (BTS) produces headers, motion information, and texture information in the format of variable length codes. In addition, share memory builds the direct channels from MC to TBE and BE

to BTS to decrease the traffic of the data bus. Sequencer (SEQ) handles the pixel by pixel scheduling of these share memory with- out bothering RISC. DMA involved in dedicated commands efficiently generates the proper addresses issued by RlSC or SEQ. Four global bus channels are used in this system. First, RlSC bus broadcasts controlling information to each hardware modules. ARer applying operations issued by RISC, hardware modules re- spond processed side information on which RlSC depends to decide the coding modes for macroblocks. At the same time source, reference, and reconstructed frames required by hardware modules are passed through DMA and then pmvided by DATA bus. Hard- ware modules efficiently access this data automatically according

(2)

BITSTREAM PROGRAM

(21 inleger-pel motion estimation (16x16 blmk size) and then hall-pixel refinement

Fig. I . System Architecture

to pre-determined scheduling. These pans are integrated into a single chip with the firmware stored outside for programmability through PROGRAM bus after taped out. SHARE bus can transfer DCT coefficients, quantized coefficients, or other immediate information in the testing mode. The developing time and effort can he reduced through this information.

3. MOTION ESTIMATION 3.1. Algorithm

Motion estimation is the key technique of video coding and can reduce the temporal redundancies of sequences to make compression efficient. In all algorithms of motion estimation. full search block matching (FSBM) algorithm is well known and commonly used

in the video coding system because of its good performance and regularity. However, the huge computational power is required to meet the real-time application. Dedicated hardware is usually employed through the parallel processing and it CBUSCS a large cost design. Besides, the encoder should decide the optimal prediction blocks among the various block sizes and in the finer pixel precision in the MPEG-4 standard. It makes the system difficult to handle these operations under acceptable cost and maintain the same compression quality. To meet the requirement o f various applications under the acceptable cost, we adopt two kinds of algorithms for the motion estimation of 16x16 block size at integer- pixel precision. One is the spiral full search with halfway termination (called fast full search, FFS) which can achieve the same compression efficiency as the full search algorithm. The other is the diamond search starting from the predictor derived from neighboring macroblocks (called predictive diamond search, PDS) and it meets the real-time specification under the visual quality degradation. Afterwards, the hierarchy scheme is applied for the motion estimation for four 8x8 pixels blocks in a macroblock around 1 2 to -2 positions of the previous best motion vector. The half-pixel refinement is also applied for all found integer-pixel motion vectors. Fig.2 depicb the whole stages of motion estimation and de- scribes as follows. The predictor is determined from neighboring macroblocks. The PDS mode or FFS mode is employed to find the integer pixel motion vectors. The half-pixel refinement is applied around the motion vector found in the phase 2. For four 8x8 pixel blocks in a macroblock, the spiral search around -2 to +2 is applied to obtain four optimal motion vectors. Four times of half-

0 0 0 B 0 ~ 0 6.6 half-pel refinement 0 , 0 initial phase

@S refinement phase (large diamond) 0 last phase (small diamond)

Fig. 2. Algorithms of motion estimation

pixel refinement is applied around the motion vectors found in the previous phases.

3.2. Architecture

Fig.3 depicts the hardware architecture of the motion estimator supporting PDS and FFS. This architecture mainly includes three processing stages and two buffers to store current MB and the search window. Before performing motion estimation, the video coding system transfers data from external memory into these buffers to eliminate the bus bandwidth for calculating of sum of abso- lute difference in the following. Meanwhile, the adder tree ac- cumulates the sum of the pixels in the current MB to save it into a register for the mode decision in the future. To speed UP the data loading and reduce the bus traffic, the search window buffer can be loaded using column-by-column data-reuse scheme. After motion estimation starts, the pattem generation (PG) stage generates the valid candidate positions. Then these positions are passed through the FIFO stage and fetched by the distortion calculation (DC) stage. The DC stage is responsible for calculating SAD of candidate positions and finds the minimum one. The accumulation comparison elimination (ACE) unit performs the PDE algorithm to

(3)

Data loading path

1

Spiral Diamond L A I Panem ... Aerminate ... '64 bits Adder Tree

-

Calculation ... f ... ~ ... ~ ~ . . ~ Min. motion

vector with SAD

Fig. 3. Architecture of motion estimator

reduce the computational complexity 3.3. Performance

MPEG-4 standard only defines the decoder and left how to implement the encoder an open problem. Many different algorithms can be adopted alternatively under the different conditions of the cost, bit-rate, and picture quality. In our paint of view, we use a novel

motion estimator to support PDE or FFS algorithms to compromise the compression performance and the design cost. The PDS mode can satisfy the real-time specification while the FFS mode can achieve the same compression quality as MPEG-4 software verified model (VM)[7]. To explore the degradation in the PDS mode, four sequences with different features are used as test pat- terns. The average difference between PDS and VM in PSNR is only 0. I36 d B and the maximum PSNR drop through the testing

sequences is only 0.618 dB. Even in the frames whose the difference in PSNR are maximum, it is still indistinguishable between these two in subject view. While encoding in the FFS mode, the PSNR and bit-rate of the reconstructed frames are almost the same as that encoded by VM. The average PSNR are even better than 0.00625 dB. The general R-D cuwes for testing sequence are sim- ulated and shown in Fig.4.

4. CONFIGURABLE PLATFORM PROTOTYPING A configurable platform is used to verify the functionality of our architecture design. This prototyping board is connected through the PCI interface to the host computer. Four separated memory with DMA modules are used t o handle PROGRAM, DATA, SHARE, and BITSTREAM bus from our design. An arbiter is responsible for the memory access through PCI and memory. The MPEG-

m 4m m m ,m

Fig. 4. RD curves with PDS and FFS modes

u u

Fig. 5. Reconfigurable platform

4 video encoder design is synthesized and placed on the FPGA chip. The program to run in RlSC processor is compiled to ma- chine codes by the host computer and then sent to the program memory. Raw image data is transferred from the host computer to the frame memory on the prototyping board. Video encoding is processed concurrently. Afterwards, bitstream data are stored in the hitstream memory and then read from the host computer. Be- sides, the share memory can record the immediate information for debugging in the testing mode.

5. IMPLEMENTATION

Fig.6 shows a micrograph of the encoder LSI and Table I de- picts its characteristics. The LSI contains 828K transistors and is fabricated on a 5.02 x 5.13 mm2 with 0.35 p m and single-poly quadruple-metal CMOS process. The chip is tested and works suc- cessfully. The supply voltage is 3.3V and consumes 256.8mW at 40MHz working frequency. Table 2 shows the number o f transistors, the area, and the size ratio to the LSI of each unit.

(4)

6. CONCLUSION

In this paper, an efficient platform architecture design with hardware accelerators for MPEG-4 Simple Profile@Level 3 video encoder is proposed. The hardware module is written in Verilog and verified in modular fashion while the firmware is written in assen- bly. The co-design and co-simulation is employed to reduce the development time. Also. the efficient reconfigurable FPGA prototyping system is exploited to verify the functionality. With cost- effective hybird motion estimation and interleaving DCTilDCT hardware modules, the system are implemented into 5.03x5.13 mm2 die size with 0.35 o m CMOS technology process. It works a t 4 0 M H r and consumes 256.8mW to meet the real-time encoding specification.

7. REFERENCES

[ I ] A.Hatabu and et al., “QVGAiCIF Resolution MPEG-4 Video Codec Based on a Low-Power and General Purpose DSP,” [2] T.Kumura and et al., “VLSI DSP for Mobile Applications,” IEEE Signal Processing Magazine, vol. 23, pp. 2 7 4 9 , 2002. [3] M. Takahashi and et al., “An MPEG-4 Video LSI with an

Error-Resilient Ccdec Core Based on a Fast Motion Estima- tion Algorithm,’’ IEEE lnlernalional Solid-Stare Circuirr Con- ference, vol. 35, pp. 1713-1721, Feb2002.

“A 60-MHz 240“ MPEG-4 Videophone LSI with 16-Mb Embedded DRAM,” IEEEJour- nalofSulid-State Circuir, vol. 35,pp. 1713-1721, Nov 2000. [5] J.H. Park and et al:, “MPEG-4 Video Codec on an ARM core and AMBA,” MPEG-4 Proceedings of Workshop and Exhibi- tion, vol. 35, pp. 95-98, June 2001.

[6] C.W. Hsu, W.M. Chao, Y.C. Chang, and L.G. Chen, “Cost-Effective Scheduling Of Tcxture Coding For MPEG-4 Video,” IEEE Inremarional Conference on Multimedia and Expo(lCME’O2). Aug 2002.

[7] T. Sikora, “The MPEG-4 Video Standard Verification Model,” IEEE Trans. on Circuits and S’srems.fir Video Terhrrolo&y,

vol. 7,110. I,pp. I S 3 1 , Feb 1997.

SIPS, VOI. 23, pp. 2 7 ~ 9 , 2 0 0 2 .

[4] M. Takahashi and et al., Fig. 6. Micrograph ofthis encode1

Table 1. Characteristics of the encoder chip

__

Technology Die Size

TSMC 0.35 p m IP4M CMOS- 5.02 x 5.13 mm2

Transistor count 828,692 trans.

On-chip memory 39,080 bits

Off-chip memory 2,027,527 bits

Clock frequency 40 M H ~

Voltage 3.3v

Power consumption 256.8mW

Package 208 CQFP

Function MPEG-4 SP@L3 video encoder Motion estimation algorithm Predictive diamond search &

Searchrange-16.0to+15.5& Advanced prediction mode

352 x 288 at 30 fps

Encoding complexity

Table 2. Cost distribution

Trans. Area Size ratio

(k) (mm2) (%) ME 288 5.8 22.6 MC DCTilDCT in TBE QilQ in TBE ACDCP in TBE RlSC DMA VLC Share MEM Others (PAD etc.)

Total 53 0.3 126 1.6 64 0.7 22 0.8 112 1.8 19 0.3 95 0.7 68 2.8 49 10.9 829 25.8 I .2 6.2 2.9 3.0 7.0 I .2 2.7 10.9 42.3 100.0 111

-

96