JPEG, MPEG-4, and H.264 Codec IP Development
Chung-Jr Lian, Yu-Wen Huang, Hung-Chi Fang, Yung-Chi Chang and Liang-Gee Chen
Graduate Institute of Electronics Engineering and
Department of Electrical Engineering
National Taiwan University, Taipei 10617, Taiwan, R.O.C.
[email protected]
Abstract
This paper summarizes our design experiences of var-ious image and video codec IPs. The design issues and methodology of custom video codecs are discussed. The de-sign methodology can be summarized as four stages, system analysis, algorithm optimization, architecture exploration, and code development. Based on these guidelines, several design cases are presented, including the proposed JPEG, MPEG-4, and H.264 architectures.
1. Introduction
Image and video codec IPs play an important role in today’s highly demanding multimedia appliances. Most codecs are implemented as dedicated architectures for the high computational complexity under real-time constraints. To have a high performance and cost efficient architecture, designers must have an insightful understanding of the char-acteristics of video data and coding algorithms first, and then apply architecture design techniques to achieve highly parallel designs with smooth data flow and high hardware utilization. In the following sections, design methodology is discussed, followed by some design cases and a conclu-sion.
2. Design Methodology
The design methodology discussed here are partitioned into four stages:
1) system analysis, 2) algorithm optimization, 3) architecture exploration, and 4) design coding and verification.
System analysis is the first step to identify the critical problem of the system under design. Profiling tools are used for complexity analysis, characteristics understanding, and
bottleneck identification. In a codec system, some mod-ules are computation-intensive, while others are control-intensive. The bottleneck may be computation, memory size, or bandwidth. For video encoders, the profiling data show the bottleneck is the motion estimation (ME). This module is therefore always implemented as a highly paral-lelized array processor with carefully designed I/O consid-erations and local buffer allocation. Data reuse techniques can be applied further to reduce memory bandwidth and share computations. As for the bitstream parsing in a de-coder, the characteristic is bit-wise processing. Though it is not computationally complicated, a custom architecture is necessary for efficient bit-level operations. Based on the analysis, the design goal is to map each module in a codec to an efficient processing element architecture.
Hardware-oriented optimization at algorithmic level is crucial in architecture design. The optimization at a higher level always has a greater impact on the entire system. Classic examples are the discrete cosine transform (DCT) and fast ME algorithm optimization considering the hard-ware costs, processing speeds, and power issues. Besides, hardware-feasibility is another issue in some modules, since some software-based algorithms, such as recursive process-ing, may not be suitable for dedicated implementations and need to be modified.
There are various methodologies and techniques [1][2] in mapping algorithms to hardware architectures. An ar-chitecture is highly related to design specifications, such as area, speed, power, and functions to be provided. Due to the tough real-time constraint, pipelining and paralleliz-ing are the most frequently used techniques in codec de-signs. Inherent parallelism in an algorithm is extracted and efficiently mapped to multiple processing elements. Sys-tem pipelining and scheduling must be carefully designed to minimize the inter-module buffer size and increase the hardware utilization.
Coding rules and simulation approaches are important in the Verilog code development stage. Disciplined cod-ing styles help prevent inconsistencies between pre- and
Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE’05) 1530-1591/05 $ 20.00 IEEE
post-synthesis, and are also beneficial for design mainte-nance. Commercial source code linting tools are used for our code checking. During the code development stage, large amount of simulations are necessary for various pa-rameters and conditions. Fast simulation and efficient er-ror diagnosis are the key to shorten the development time. FPGA-PC co-simulation can help speed up the intermedi-ate verification of each module, and it is also a platform for final emulation and demo of the entire system.
3. Proposed Codec IPs
In this section, design results and experiences of sev-eral codec IPs are presented, including JPEG, MPEG-4, and H.264.
JPEG is widely used in digital imaging applications and video surveillance systems. Digital still camera (DSC) is the most typical application. The proposed hardwired JPEG engine [3][4] can easily support both high speed still image and motion-JPEG processing at a very low clock frequency. The most computation-intensive module, DCT/IDCT, is based on a compact row-column decomposition architec-ture. The other modules are designed to be cascaded seam-lessly such that no extra buffer is required for inter-module date flow smoothing. It is because the fully pipelined smooth data flow, high throughput and compact design are achieved.
A codec IP is usually expected to be a stand-alone pro-cessor. In this case, the master processor only has to fire up the IP, and then wait for output data ready. The proposed JPEG engine meets the requirement and is an entirely cus-tom design supporting complete JPEG coding and decod-ing, including file syntax handldecod-ing, bit-packing of variable length codes, and Huffman decoding for user-defined ta-bles. Experiences show that although these tasks are not so computation-critical as shown in software run-time pro-file, they usually need more design effort in coding and de-bugging than the DCT/IDCT module, which is with regular processing elements and easy control.
In our MPEG-4 encoder [5], the platform-based ap-proach is adopted for the system architecture. The proto-type supports real-time encoding of MPEG-4 Simple Profile Level 3 at 40 MHz. The system mainly consists of a RISC processor, an embedded SRAM, a DMA unit, a memory in-terface, wrappers for dedicated units, two signal buses, and dedicated accelerators of ME/MC, block engine, and vari-able length coder (VLC). The JPEG design experiences of modules such as DCT/IDCT, quantization/inverse quanti-zation and VLC can be transferred to the MPEG-4 design. However, a poor scheduling of modules in the coding loop of MPEG-4 will involves large buffer and cost. Therefore, an interleaving DCT/IDCT scheduling is proposed. For the decoder, a programmable bitstream processor [6] is
pro-posed to efficiently handle bit-level tasks.
Emerging H.264 is much more complicated than all pre-vious standards. The computational load is higher, and there are many modes to be processed and then selected. The op-timal data flow and pipelining stages are therefore different from previous MPEG algorithms. After analysis, a four-stage macroblock pipelining architecture [7] is proposed. The four stages are integer motion estimation, fractional motion estimation, intra prediction engine, and entropy cod-ing and de-blockcod-ing engines. The Lagrangian mode deci-sion is also optimized for dedicated hardware feasibility. The processing capability is HDTV720p 30 frames/s with one reference frame and H±64/V±32 at 108 MHz.
4. Conclusion
In this paper, image and video codec IP design experi-ences are given. Design concepts, algorithm analysis and optimization, parallelism exploration, and efficient mapping are discussed. FPGA development platform is adopted to cope with the high verification complexity of codec IP de-signs. By following the design methodology, many high performance image and video codec IPs have been devel-oped and successfully transferred to third party for mass production.
References
[1] S. Y. Kung. VLSI Array Processors. Prentice-Hall, Engle-wood Cliffs, NJ, 1998.
[2] K. K. Parhi. VLSI Digital Signal Processing Systems:
De-sign and Implementation. Wiley-Interscience, New York,
NY, 1999.
[3] C.-J. Lian, L.-G. Chen, H.-C. Chang, and Y.-C. Chang. Design and implementation of JPEG encoder IP core. In
Proc. Asia and South Pacific Design Automation Conference (ASP-DAC’01), pages 29–30, Yokohama, Japan, Jan. 2001.
[4] C.-J. Lian, H.-C. Chang, K.-F. Chen, and L.-G. Chen. A JPEG decoder IP core supporting user-defined Huffman ta-ble decoding. In Proc. International Symposium on
Inte-grated Circuits, Devices and Systems (ISIC’01), pages 497–
500, Singapore, Sept. 2001.
[5] Y.-C. Chang, W.-M. Chao, C.-W. Hsu, and L.-G. Chen. Platform-based MPEG-4 SOC design for video communica-tion. Journal of VLSI Signal Processing Systems, submitted for publication.
[6] Y.-C. Chang, C.-C. Huang, W.-M. Chao, and L.-G. Chen. An efficient embedded bitstream parsing processor for MPEG-4 video decoding system. Journal of VLSI Signal
Processing Systems, submitted for publication.
[7] Y.-W. Huang, T.-C. Chen, C.-H. Tsai, C.-Y. Chen, T.-W. Chen, C.-S. Chen, C.-F. Shen, S.-Y. Ma, T.-C. Wang, B.-Y. Hsieh, H.-C. Fang, and L.-G. Chen. A 1.3TOPS H.264/AVC single-chip encoder for HDTV applications. In Proc. IEEE
International Solid-State Circuits Conference (ISSCC’05),
San Francisco, California, USA, Feb. 2005.
2
Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE’05) 1530-1591/05 $ 20.00 IEEE