Memory-efficient architecture for JPEG 2000 coprocessor with large tile image

(1)

Memory-Efficient Architecture for JPEG 2000

Coprocessor With Large Tile Image

Bing-Fei Wu and Chung-Fu Lin

Abstract—. The experimental results show that using a larger tile size to perform JPEG 2000 coding results in better image quality (i.e., greater than or equal to 256 256 tile image). However, pro-cessing large tile images also requires relatively high memory for the hardware implementation. For example, it would require tile memory of 256 K words to support the process of a 512 512 tile image in the straightforward architecture. To reduce hardware re-sources, we have proposed the quad code-block (QCB) -based dis-crete wavelet transform method to reduce the size of tile memory by a factor of 4. In this paper, the remaining 1/4 tile memory can be further reduced through two approaches: the zero-holding tension with slight image degradation and the QCB-block size ex-tension without any image degradation. That is, it only requires 12 K words tile memory to support the process of 512 512 tile image by using zero-holding extension, and 13.58 K words memory through QCB-block size extension. The low memory requirement makes the on-chip memory practicable.

Index Terms—Code-block, discrete wavelet transform (DWT), embedded block coding (EBC), JPEG 2000, quad code-block (QCB), tile size.

I. INTRODUCTION

J

PEG 2000 provides higher compression ratio (CR) and more functions than traditional JPEG. It takes various functions (i.e., lossless, lossy, resolution, quality, ROI etc.) into a single coding stream. In general, the main coding stream has to be performed by discrete wavelet transform (DWT), context formation (CF), and MQ-coder, which can be regarded as the core blocks of JPEG 2000 standard [1]. After getting the main compressed data, the rate-distortion optimization is applied to decide the optimal truncation points of the main codestream. Using a large tile size to perform JPEG 2000 coding gains higher CR than using a small tile size [2], but it also requires more memory in the hardware implementation.

Considering these three core blocks, the DWT process re-quires an entire tile memory to carry out the subband transfor-mation [3]. Afterward, the CF process divides each subband into several code-blocks and performs the bit-plane coding (BPC). Several architectures are proposed to speedup the high-compu-tation BPC [4]–[7]. The MQ-coder then compresses the con-text-based information in a lossless way [8]. Many studies have devoted to optimize the individual components. However, the overall system suffers performance degradation and requires

Manuscript received April 16, 2005; revised July 19, 2005. This work was supported by the National Science Council under Grant NSC 94-2213-E-009-062. This paper was recommended by Associate Editor A. Loui.

The authors are with the Department of Electrical and Control Engineering, National Chiao Tung University, Hsinchu, 30050 Taiwan, R.O.C. (e-mail: [email protected]; [email protected]).

Digital Object Identifier 10.1109/TCSII.2005.862042

Fig. 1. Block diagram of JPEG 2000 encoder.

more memory to process larger tile images [9]–[11]. These bot-tlenecks are mainly caused by various coding flows between DWT and embedded block coding (EBC) processes [12], [13]. Based on the quad code-block (QCB)-based DWT [13], the in-ternal buffer can be reduced by a factor of 4 but, it still requires high internal memory while processing large tile images (i.e., 64 K words are required for 512 512 tile size). In this paper, we propose two methods with QCB-based DWT to further re-duce the internal tile memory. The first one is using zero-holding extension for band to process each QCB block. With slight image degradation, the internal tile memory can be reduced from 256 K words to 12 K words, for a 512 512 tile image. The second method is increasing the QCB-block size to re-cover the original DWT data path, instead of zero-holding pre-diction. Without any image degradation, the QCB-block exten-sion method requires tile memory of 13.58 K words to process a 512 512 tile image. The low memory requirement makes the on-chip memory practicable, and the parallelism between DWT and EBC processes can be enhanced.

The paper is organized as follows. Section II describes brief concepts of the core blocks of JPEG 2000. In Section III, we dis-cuss the QCB-based DWT architecture with zero-holding exten-sion and QCB-block size extenexten-sion. The simulation results are shown in Section IV. In Section V, we compare the proposed ar-chitecture with other related works. A brief summary is given in Section VI.

II. JPEG 2000 BASICBLOCKS

Fig. 1 shows he basic coding flow of JPEG 2000. First, an image is split into several rectangular tiles and each tile is coded independently. The 2-D DWT decomposes a tile image into

sev-eral subbands , , , and and the subband

can be decomposed into next resolution, recursively. DWT coef-ficients in each subband are partitioned into several code-blocks and processed by EBC independently. The EBC process carries out CF and MQ coding. The CF algorithm codes each code-block bit-plane by bit-plane and generates the context-based in-formation. Then, the context data are coded by MQ-coder in a lossless way to generate main codestream. After getting all main

(2)

Fig. 2. PSNR of different tile sizes of JPEG 2000. (Lenna: 5122 512 size, 4-level DWT decompositions of 5/3 filter).

Fig. 3. Straightforward implementation of JPEG 2000 coprocessor. TABLE I

ARCHITECTURALMODEL OFDIFFERENTMETHODS FOREBC. (L: CODE-BLOCKWIDTH, N: NUMBER OFCODINGBITPLANE,: 0TO1)

compressed data, the rate-distortion optimization is applied to decide the optimal truncation points for lossy compression.

In general, using the large tile size parameter to perform JPEG 2000 compression achieves better image quality than choosing the small tile size mode [2]. Fig. 2 shows the peak signal-to-noise ratio (PSNR) with different tile sizes. Compared with small tile images, the large tile image provides more pos-sible truncation points for rate-distortion optimization and has less tile block effects. Thus, it can provide better image quality even at higher CRs. Based on the better coding efficiency for processing large tile images, it is a reasonable demand to design the hardware architecture to support large tile sizes.

The straightforward implementation of JPEG 2000 copro-cessor is shown in Fig. 3 [9]–[11]. It uses an entire tile memory to carry out the 2-D DWT process and to provide the coefficients to code-block memories for EBC. Multiple EBC processors are used to execute the code-blocks in parallelism [7], [9]–[11]. Several EBC architectures are also proposed to realize the high computation BPC. Table I classifies into three speedup methods. These methods greatly decrease the computation cycle with ap-propriate hardware resources.

Although each component is optimized individually, the overall system may still require large internal memory and suffer performance degradation while performing the large tile image. As shown in Table II, the tile memory size of straight-forward architecture is proportional to the size of tile image. For example, it requires 256 K words

memory to process a 512 512 tile image, which makes the

TABLE II

MEMORYCONSTRAINT OFSTRAIGHTFORWARDARCHITECTURE

Fig. 4. QCB–DWT process in a tile image (tile size:N 2 N, code-block size:N=8 2 N=8, QCB size: N=4 2 N=4).

on-chip memory impracticable. Moreover, processing the large tile image also causes latency between DWT and EBC.

III. PROPOSEDQCB–DWT ARCHITECTURE

To reduce the size of internal tile memory, we divide a tile image into several QCB blocks in advance of DWT procedure [13]. As shown in Fig. 4, the QCB block0 carries out the QCB–DWT process and generates four code-blocks—three for EBC, and one for next DWT decomposition, recursively. Thus, three EBC processors can individually process three code-blocks at the same time and the size of internal tile memory can be reduced by a factor of 4. The broken DWT data path can be solved by processing parts of previous data to recover the original data path [13], [15].

Althought the tile memory can be reduced by a factor of 4, it still requires high internal memory to process the large tile image (i.e., 64 K words are required for 512 512 tile image). Thus, we propose two approaches: the zero-holding extension and QCB-size block extension, to further reduce the 1/4 tile memory. By partially performing higher DWT decompositions, the remaining tile memory can be reduced effectively.

A. Zero-Holding Extension

To reduce the 1/4 tile memory, a simple prediction method is applied to predict the unavailable data belonging to the neighbor QCB blocks. Fig. 5 shows the data flow of the band. Once the QCB of band is obtained, it can be decomposed into next DWT resolution immediately and produce three complete code-blocks. Since some QCB blocks near to the QCB are not available, we use zero-holding extension to predict the unavailable data based on the continuous property of image. Fig. 6(a) shows the periodic symmetric extension defined in JPEG 2000 standard applied to the start and the end of each complete data path. To predict the unavailable data, we use the zero-holding extension method, as shown in Fig. 6(b). Based on the zero-holding prediction, parts of coefficients in band can be decomposed once the QCB is completely obtained. Although it would suffer slight image degradation through the zero-holding prediction simu-lated in Section IV, the size of internal tile memory is reduced

(3)

Fig. 5. QCB–DWT forLL band with zero holding extension.

Fig. 6. Periodic symmetric and zero holding extension of signal. (a) Periodic symmetric extension signal; (b) Zero holding extension of signal.

Fig. 7. Boundary data to recover the data path between different QCB blocks.

to proportional to the number of DWT decompositions, i.e., only two QCB-size memories are required in this case—one is for band, and the other one is for band. For an tile image with 32 32 code-block size, the number of DWT levels is defined as , and the internal tile memory size requires words. Compared with the traditional DWT, the zero-holding prediction requires the same memory access number because of zero-holding prediction.

B. QCB-Block Size Extension

To fully comply with JPEG 2000 standard, we can carefully increase the QCB-block size of each DWT level to recover the original DWT data path. As shown in Fig. 7, the original DWT data path of the 5/3 and 9/7 filters can be restored by reading two and four previous data to produce the precise coefficients. To generate the last coefficient, it requires one and three additional data for the 5/3 and 9/7 cases. Thus, instead of zero-holding prediction, one can increase the QCB-block size to recover the data path and generate the precise coefficients. Fig. 8 shows the QCB-block size extension for the 5/3 filter at different DWT levels. If the QCB-block size at the final DWT is 64 64, we add two valid input data at the start and the end of the QCB-block for the previous DWT . The additional valid data then preserves the original data path and produces the

Fig. 8. QCB blocks for different DWT levels for the 5/3 filter.

TABLE III

MEMORYSPECIFICATIONSWITHDIFFERENTCHOSENCODE-BLOCKSIZES

precise coefficients for the current QCB block. Compared with the zero-holding extension, the size of internal tile memory is

defined as , where is 2 or 4 to preserve

the additional valid data for the 5/3 or 9/7 filters. Table III shows the memory specifications for different code-block sizes for 512 512 tile size. It can be found that choosing small code-blocks can greatly reduce internal memory size. However, the penalty of QCB-block size extension, , would cause heavy memory accesses relative to the original QCB-block with small size. Since the number of memory access dominates the pro-cessing time of traditional DWT, it would be a tradeoff be-tween the internal memory size and the additional memory ac-cess number of QCB-based DWT.

C. Proposed QCB-Based DWT Architecture

Fig. 9 shows the QCB-based DWT architecture. The tile memory is composed by several QCB-size memories. Table IV specifies the internal memory size of the proposed architec-ture for 512 512 case. To process a 512 512 tile image with greater than or equal to four-level DWT decompositions, one needs three memory units (i.e., ,

and ) to store the coefficients of three QCB blocks belonging to , and bands. Fig. 10 describes the flowchart of QCB-based DWT for the JPEG 2000 coprocessor.

(4)

Fig. 9. Architecture of QCB-based JPEG 2000 coprocessor.

TABLE IV

MEMORYREQUIREMENT OF THEPROPOSEDARCHITECTURE

Fig. 10. Flowchart of QCB-based DWT for JPEG 2000 coprocessor (tile image: 5122 512, 4-level DWT).

zero-holding extension or QCB-block size extension can be used to perform QCB-based DWT.

IV. SIMULATIONRESULTS

To simulate the image quality of QCB-based DWT with zero-holding extension, we use 512 512 tile images to perform 4-level DWT decompositions of the 5/3 filter. It only requires 16 K and 24 KB tile memory to process the 256 256 and 512 512 tile image. As shown in Figs. 11 and 12, the PSNR of QCB-based DWT approaches to the traditional DWT used in JPEG 2000, especially in the medium and low bit rates. However, it may suffer more image degradation in low CRs. As shown in Fig. 13, the process of QCB512 induces slight blur and quality sacrifice along the boundaries of QCB blocks (i.e., the left eye winker). Thus, we can choose 128 128 tile size to support high bit rates or lossless coding without any image degradation. That is because the size of band is just the QCB-block size (i.e., 4 K words) and the zero-holding prediction do not apply to the QCB-based DWT.

Fig. 11. PSNR of QCB-based DWT for 5122 512 Lenna with 4-level DWT (i.e., less than 0.5 dB degradation when CR is larger than 32 for QCB512). (a) CR from 0 to 100. (b) CR from 0 to 20.

Fig. 12. PSNR of QCB-based DWT for 5122 512 Baboon with 4-level DWT.

We also use a software version to evaluate the overall perfor-mance of JPEG 2000 coprocessor. Several testing images are chosen to assess the performance of the proposed architecture with 32 32 code-block size. The throughput of line-based DWT can be approximated by the number of memory access [15], as shown in Fig. 14(a). For the QCB-based DWT, each QCB block requires cycles to access the memory, where is the number of additional data for recovering the original DWT data path at the first level of DWT decomposi-tion. The throughput of EBC is based on the pass-parallel ar-chitectural model [5]. The computation cycle of each bit-plane is approximated by the maximal number of context-decision pairs among three coding passes. The context-decision pairs are coded by the pipeline MQ-coder [8]. As shown in Fig. 14, the traditional DWT method requires cycles to carry out the first level of DWT decomposition and cycles are overlapped with EBC.

(5)

Fig. 13. PSNR for CR = 4. (a) Default JPEG2000 [14]: 46.47 dB. (b) QCB512: 41.54 dB.

Fig. 14. Timing diagram of traditional DWT and QCB-based DWT. (a) Total processing time based on traditional DWT. (b) Total processing time based on QCB-based DWT.

TABLE V

PERFORMANCE OFQCB-BASEDJPEG 2000 COPROCESSOR

WITHOUTZERO-HOLDINGEXTENSION. (TILESIZE: 1282 128, 2-LEVELDWT, 5/3 FILTER)

TABLE VI

PERFORMANCE OFQCB-BASEDJPEG 2000 COPROCESSORWITHQCB-BLOCK

EXTENSION. (TILESIZE: 5122 512, 4-LEVELDWT, 5/3 FILTER)

Table V shows the computation cycles for the 128 128 tile image. Since QCB-based DWT has higher parallelism than tra-ditional DWT, it reduces about 8% computation cycles. If the tile size becomes larger, it can reduce more computation cycles since the latency between DWT and EBC is still dominated by the process of one QCB-block. As shown in Table VI, it reduces about 15% computation cycles. Finally, it is reasonable to ap-proach the same throughput between QCB–DWT and EBC pro-cessors to achieve high hardware utilization and parallelism.

V. COMPARISON

Table VII compares several architectures for JPEG 2000 co-processor. For 128 128 tile size, the straightforward architec-tures require internal memory of 32 KB to store entire tile data [9]–[11]. However, for 512 512 case, the internal memory would become 512 KB. Based on the QCB-based architecture, we can reduce the internal tile memory by a factor of 4 and it only requires 8 KB to perform a 128 128 tile image. For 512

TABLE VII

COMPARISONS OFDIFFERENTARCHITECTURES(16-BITWORDLENGTH)

512 tile size, the zero-holding extension is further applied to decrease the remaining tile memory. It reduces the internal tile memory to 24 KB with slight image degradation. Moreover, by increasing the QCB-block size at different DWT levels to pre-serve the original data path, the QCB-based DWT can fully obey JPEG 2000 standard.

VI. CONCLUSION

In this paper, we propose two methods for QCB-based DWT to decrease the internal memory size for JPEG 2000 coprocessor. Based on the QCB-based DWT, it can save the tile memory by a factor of 4.The remaining tile memory storing band can be further reduced through the zero-holding extension or QCB-block size extension method. The proposed architecture can process a 512 512 tile image by using only 24 K (27.16 K) bytes of tile memory with slight (without any) image degradation, it makes the on-chip memory practicable.

REFERENCES

[1] ISO/IEC, ISO/IEC 15 444-1. Information Technology—JPEG 2000 Image Coding System, 2000.

[2] K. Varma and A. Bell, “JPEG2000-choices and tradeoffs for encoders,”

IEEE Signal Process. Mag., no. 11, pp. 70–75, Nov. 2004.

[3] K. Andra, C. Chakrabati, and T. Acharya, “A VLSI architecture for lifting-based forward and inverse wavelet transform,” IEEE Trans.

Signal Process., vol. 50, no. 4, pp. 966–977, Apr. 2002.

[4] C. J. Lian, K. F. Chen, H. H. Chen, and L. G. Chen, “Analysis and archi-tecture design of block-coding engine for EBCOT in JPEG2000,” IEEE

Trans. Circuits Syst. Video Technol., vol. 13, no. 3, pp. 219–230, Mar.

2003.

[5] J. S. Chiang, Y. S. Lin, and C. Y. Hsieh, “Efficient pass-parallel architec-ture for EBCOT in JPEG2000,” in Proc. IEEE Int. Symp. Circuits Syst., vol. 1, May 2002, pp. 773–776.

[6] H. C. Fang, T. C. Wang, C. J. Lian, T. H. Chang, and L. G. Chen, “High speed memory efficient EBCOT architecture for JPEG2000,” in Proc.

IEEE Int. Symp. Circuits Syst., vol. 2, May 2003, pp. 736–739.

[7] H. Yamauchi, S. Okada, K. Taketa, T. Ohyama, Y. Matsuda, T. Mori, S. Okada, T. Watanabe, Y. Matsuo, Y. Yamada, T. Ichikawa, and Y. Mat-sushita, “Image processor capable of block-noise-free JPEG2000 com-pression with 30 frames/s for digital camera applications,” in Dig. Tech.

Papers ISSCC, San Francisco, CA, Feb. 2003, pp. 46–47.

[8] K. K. Ong, W. H. Chang, Y. C. Tseng, Y. S. Lee, and C. Y. Lee, “A high throughput context-based adaptive arithmetic codec for JPEG2000,” in

Proc. IEEE Int. Symp. Circuits Syst., May 2002, pp. 133–136.

[9] K. Andra, C. Chakrabati, and T. Acharya, “A high-performance JPEG2000 architecture,” IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 3, pp. 209–218, Mar. 2003.

[10] AMPHION Products—CS6510 JPEG2000 Encoder [Online]. Avail-able: http://www.amphion.com/cs6510.html

[11] Alma Technologies—JPEG2K_E [Online]. Available: http://www. alma-tech.com/

[12] M. Y. Chiu, K. B. Lee, and C. W. Jen, “Optimal data transfer and buffering schemes for JPEG 2000 encoder,” in Proc. IEEE Int. Signal

Process. Syst., 2003, pp. 177–182.

[13] B. F. Wu and C. F. Lin, “Analysis and architecture design for high perfor-mance JPEG2000 coprocessor,” in Proc. IEEE Int. Symp. Circuits Syst., vol. 2, May 2004, pp. 225–228.

[14] JPEG 2000 Software Model [Online]. Available: http://www.ece.uvic. ca/~mdadams/jasper/

[15] C. T. Huang, P. C. Tseng, and L. G. Chen, “Memory analysis and archi-tecture for two-dimensional discrete wavelet transform,” in Proc. IEEE