Overall Decoder Design

Chapter 3 Proposed Embedded Compression Algorithm

4.2 Architecture of Decoder Design

4.2.3 Overall Decoder Design

Fig. 38 shows the pipeline stage of decompressor. To fast provide data for motion

compensation unit, the decompressor must support higher throughput. The decompressor is divided into two stages and each stage needs 2 cycles. A 4x4 block needs 4 cycles to be decoded. Decoding a MB just needs 34 cycles.

Fig. 38 Overall decoder design

Chapter 5 Design Implementation and Verification

5.1 Design Implementation

In this thesis, we proposed a flexible algorithm which achieves good coding efficiency and is suitable to integrate with any video decoder. The proposed architecture is synthesized with UMC 90-nm CMOS standard-cell library. The operation frequency is 100 MHz. The gate counts of proposed algorithm for compressor/decompressor are 15.8K/14.2K respectively.

Embedded encoder is divided into 3 pipeline stages and each stage cost 4 cycles.

Although the pipeline stage of encoder can be shorten to 2 cycles, but considering that the de-blocking filter needs 4 cycles to completely output a 4x4 pixel block. To integrate with the original system with out any extra buffer, four cycles per stage is a better choice. At the same time, longer cycle per stage can reduce the number of 1-D, 4-points DCT functional blocks and can decrease the area.

Embedded decompressor is divided into 2 pipeline stages and each stage costs 2 cycles. The minimum pipeline stage design is one cycle per stage, 3 stages total (2 for 2-D DCT, one for CGBPZ decoder). But this minimum design requires 64 bits bus bandwidth. Therefore 2 stages, 2 cycles each is the fast design in system with 32 bits bus bandwidth.

A summary of hardware design is given at Table 7.

Table 7 Summary of hardware design Proposed EC

Function part Compressor Decompressor

Synthesis process UMC 90nm

Operate Frequency CIF@5MHz

HD1080@100MHz

Latency/MB 72 cycles 34 cycles

Gate Counts 15.8K 14.2K

Power 2.78mW 1.66mW

5.2 Design Verification

The flow of design verification is shown in Fig. 39. The verification can be considered as two parts. One is software and the other is hardware. Patterns are generated by software and applied as the input of hardware. At the same time, the software calculates the correct answer and compares the result with hardware’s result.

And then, the result is stored in memory. Again the coded data is accessed by software decompressor and hardware decompressor. The decoded data is checked to confirm the result is met on software and hardware.

Fig. 39 The flow of design verification

Chapter 6 System Integration and Experimental Results

In section 6.1, we will first introduce the specification of SI2 low power H.264 decoder. The problems occurred during integration will also be discussed. The detail analysis will be given in section 6.2.

6.1 System Analysis

The overall system block diagram is shown in Fig. 40. Our H.264 decoder works at 100 MHz, performing HD1080 at 30frames/per second. The embedded compressor compresses the data from deblocking filter. 4x4 blocks will become 64 bits segments and then stored into off-chip memory. The embedded decompressor decompresses coded segments from external memory and sends to motion compensation unit (MC).

The system bus bandwidth is as 32 bits and the external memory is 32 bits per entry.

4MB

Fig. 40 The overall system block diagram

6.1.1 Interface

The embedded compressor can be considered as an interface between the chip and the external memory. Fig. 41 is the system interface design for embedded codec.

The output speed of deblocking filter is 4 pixels per clock, thus the best processing clocks for each pipeline stage of embedded compressor must less or equal to 4 cycles to avoid the traffic jam at the input of embedded compressor. Another interface issue occurs at the input of motion compensation (MC). The data provider of MC switch form external memory to the proposed embedded decompressor. The input bandwidth of MC in original system is 4 pixels per cycle, so the basic requirement is that the embedded decompressor must output at least 4 pixels per cycle.

Finally, an address converter will be needed. Since the compression ratio is fixed at two, the address converter is easy to implement.

Fig. 41 System interface design for embedded codec.

6.1.2 Overhead Problem

As we introduced in chap 3.1, embedded compressor suffered from overhead problem and the overhead ratio directly links to the coding unit. However, the overhead problem in our system is different with chap 3.1 since our system is 1x4 pixels array-based not pixel based. The access behavior of motion compensation with/without embedded compressor can be analyzed as follows. Here we simply analysis two cases: best case and worst case.

If the requested 4x4 blocks are perfectly aligned with the coded 4x4 blocks, only 2 cycles are needed to fetch the 4x4 block while the original system needs 4 cycles to fetch. This situation is illustrated in Fig. 42

Fig. 42 Best case on data fetching

The worse situation is the sub pixel case. For motion vector (x, y), both x and y are not integers. Therefore a 4x4 block needs a 9x9 pixels block to finish the motion compensation. 18 cycles is needed for embedded compressor while original system needs 27 cycles. Fig. 43 shows the analysis above. Full case analysis will be given in section 6.2.1. In chapter 6.2, we can see that H.264 decoder with an embedded compressor does reduce the access times and can efficiently reduce the access power consumption.

Fig. 43 Worse case: sub pixel case

6.1.3 Processing Cycles Problem

The third part of system analysis is the problem of processing cycle. The existence of this problem is due to the tight processing cycles of our low power H.264

decoder. Our H.264 decoder works at 100 MHz, performing the HD1080i@30fps. By a simple division, we can find that it is only 25 cycles for motion compensation to deal with a 4x4 block. Therefore we need a short-cycle design to turn down the loading on cycles of the embedded compressor. Detail analysis will be given in section 6.2.2.

6.2 System Integration

6.2.1 Access Reduction

Recall our motivation, we try to scarify some quality while achieve power reduction. In following section we will introduce how we reduced the access.

The correlated accesses of EC can be separated into two parts. One is the write accesses from deblocking filter which writing data to external SDRAM. Another part is the read accesses from motion compensation (MC) unit. First part is easy to be analyzed because the write accesses are formed by writing frames into SDRAM. The access times after adding EC (4x4 pixel-unit and CR=2.0) are always half of the original system (1x4 pixel array).

The read accesses requested from MC are much more complicate. Motion compensation unit requests data based on motion vector (MV). For further discussion, the value of x and y in motion vector (x, y) can be classified into 3 types: align, not align and sub pixel case.

1) Align: the value is a quadruple. It can fit with the 4x4 coded block grid.

2) Not align: the value is not a quadruple but an integer. Needed 4 pixels may span two 4x4 block grids.

3) Sub pixel case: the value is not an integer but accurate to ½ or ¼. It needs 9 pixels to be interpolated into 4 pixels.

In section 6.1.2, we already explain several cases of the access behavior between system with and without EC. Here we give the analysis of overall cases in Table 8.

Notice that in all cases the access times with EC are always less than or equal to the access times of original system.

Table 8 Overall cases of read access requested by MC with/without EC Case of MV

The probabilities of each case are derived by simulation over 4 sequences (Akiyo, Foreman, Stefan, Mobile Calendar), 300 frames each. These sequences are formed by GOP 30.

According to the probabilities, the average reduction achieved on read accesses is 40% of original accesses.

6.2.2 Processing Cycles Problem

In section 6.1.3, we had described the problem of processing cycles. In this section, we will show the feasibility of CGBPZ to integrate with our system.

Based on not to change our original system, we have two constrains here. First, the original system specification is HD1080@100MHz, 30 frames per cycles. This means the available cycles for each 4x4 block unit are 25 cycles. We hope to finish MC with proposed EC in 25 cycles. Second, we hope that after embedding EC, we won’t change the data input mechanism of MC (sending data into MC continually).

The solution we used to solve constrain 2 is to add a new state into original states and to insert buffer between embedded decompressor and MC. The signal “MC read data enable” is putting off till the buffer has enough data to continually feed into MC.

However, this new state cost times. Now the required cycles of “MC data read states” is equal to the original process cycles of MC plus the new state “EC decode”

like in (4). The full cases discussions of “EC decode” cycles plus original “MC data read” cycles are in Table 9.

) _ (

_ )

_ ( )

_ _ (

_time MC with EC delay EC decode process timeOriginal MC

process = + (4)

Table 9 Full cases of “EC decode” cycles plus original “MC data read” cycles much less than 25 cycles except the (sub, sub) case and (not align, sub) case. By using the probabilities, we can calculate the average cycles used for MC+EC. The average cycles are 19.3 cycles. That means, although the (sub, sub) case uses more than 25 cycles, it is still fit the system timing constraint since there are available cycles from other modes. So, embedding proposed codec into original system is feasible.

6.2.3 Access Reduction Ratio

The access ratio of system with EC vs. original system is defined as (5):

_Ori.

According to our simulation, the ratio of read accesses with/without EC is 0.625,

and the ratio of write accesses with/without EC is fixed at 0.5. Also, we obtain the average access ratio of read/write in original system is about 3.51. The overall access ratio (with/without EC) can be calculated below (6):

_Ori.

Therefore, the average reduction ratio on memory access is:

ratio

6.2.4 Simulation Result on SDRAM Power Reduction

We choose the system-power calculator [21] as external memory power model and the parameter setting is according to [22]. We simulated the memory using on CIF@50MHz and HD1080@100MHz. The results are shown in Fig. 44 and Fig. 45.

Each figure includes the core power of H.264 decoder, SDRAM background power and SDRAM access power (R/W) operated on different frequencies. The power saving on performing CIF is 7.6mW while the power saving on performing HD1080 is 154.8mW. It does make sense since the average available cycles for a 4x4 block on both video formats are the same and the access ratio on R/W is slightly different due to different test sequence. Therefore it is reasonable that the power reduction is almost directly proportional to the frame size.

Power distribution @ CIF , 5MHz

Fig. 44 Power analysis on CIF @ 5.3MHz

Power distribution @ HD1080 , 100MHz

396.2

Fig. 45 Power analysis on HD1080 @ 100MHz

Chapter 7 Conclusion and Future Work

7.1 Conclusions

In this thesis, we proposed a flexible algorithm which has good coding efficiency and is suitable to integrate with any video decoder. With the help of this recompression engine, we can reduce the bandwidth requirement and the external frame memory and reduce the data access times to achieve the goal of power saving.

The fixed compression ratio makes this extra function easily be integrated with a system by adding a simple address controller. The proposed architecture is synthesized with 90-nm CMOS standard-cell library. The operation frequency is 100 MHz. The gate counts of proposed algorithm for compressor/decompressor are 15.8K/14.2K respectively. The proposed architecture costs 30K gate counts and deals with a 4x4 block unit while previous MHT work costs 20K gate counts in dealing with a 1x8 pixels array. The proposed algorithm not only gains 7.12dB in the quality but also achieves an area-efficient hardware implementation. The peak power consumption of proposed embedded codec @ 100MHz is 4.445mW.

7.2 Future Work

Future works are formed by three parts. First is about the coding efficiency. From proposed FGBPZ to CGBPZ, we made a trade off between the encoding/decoding cycles and the visual quality. CGBPZ achieves fast coding speed and acceptable

visual quality. However, error propagation delivered through 29 P frames is still noticeable by human eyes. To embed a compressor/decompressor in video decoder, the only way to get better visual quality is to reduce the quality loss of each referred frame. Therefore, refine coding scheme to reduce quality loss is very importance.

Second part is to develop adaptive compensation modes according to different characteristics of video sequences. The compensation method we proposed is based on universal behaviors of our testing data base. Proposed compensation method reaches minimum average PSNR loss over our data base. But we also found other kinds of methods have better performance on compensating certain video sequences while having poor performance on the others. That is why we wish to develop adaptive compensation modes. To compensate sequences according to their characteristics can optimal the individual visual quality and is also a direction of quality improvement. Since the power consumption of our embedded codec is much less than the power we reduced on access reduction, to increase reasonable complexity to get better performance is a good idea and is worth us to try.

The final part is about the memory power model. The memory power model we use is to estimate the memory power consumption according to the average behavior of data access ratio [21]. For a detail analysis, there are some factors needed to be specified such as page mode design and burst length. The burst length determines the efficiency of memory data read accesses and page mode determinates the hit rate.

Power consumption on memory is related to those factors. If we can build a power model taking those factors into account, we can analyze our memory power consumption in a more accurate way.

References

[1] M. Li, R. Wang and W. Wu, "The High Throughput and Low Memory Access Design of Sub-pixel Interpolation for H.264/AVC HDTV Decoder," IEEE SIPS'05, pp. 296-301, May 2005.

[2] R. Manniesing, R. Kleihorst1, R. V. Vleuten1, and E. Hendriks, “Implementation of lossless coding for embedded compression,” IEEE ProRISC, 1998.

[3] T. Y. Lee, “A New Frame-Recompression Algorithm and its Hardware Design for MPEG-2 Video Decoders,” IEEE Trans. CSVT, vol. 13, no. 6, pp. 529-534, June 2003.

[4] Yongie Lee, et al, "A New Frame Recompression Algorithm Integrated with H.264 Video Compression," IEEE Circuits Sys. ISCAS Vol. 6, pp.6110-6113, May 2007.

[5] H. Shim, N. Chang, and M. Pedram, “A compressed frame buffer to reduce display power consumption in mobile systems,” in Proceedings of the Asia and South Pacific Design Automation Conference, pp. 819–824, 2004.

[6] U. Bayazit, L. Chen, and R. Rozploch, “A novel memory compression system for MPEG2 decoders,” in IEEE International Conference of Consumer Electronics, pp. 56–57, 1998.

[7] M. Schaar-Mitrea and P. With, “Near-lossless embedded compression algorithm for cost reduction in DTV receivers,” in IEEE International Conference of Consumer Electronics, pp. 112–113, 1999.

[8] S. Lei, “A quad-tree embedded compression algorithm for memory-saving DTV decoders,” in IEEE International Conference of Consumer Electronics, pp.

120–121,1999.

[9] E. G. T. Jaspers and P. H. N. With, “Embedded compression for memory resource reduction in MPEG systems,” in IEEE Benelux Signal Processing Symposium, 2002.

[10] G. M. Callico, A. Nunez, R. P. Llopis, and R. Sethuraman, “Low-cost and real-time super-resolution over a video encoder ip,” IEEE, 2003.

[11] R. Bruni et al., “A novel adaptive vector quantization method for memory reduction in MPEG-2 HDTV decoders,” in Proc. Int. Conf. Consumer Electronics, pp. 58-59, 1998.

[12] R. Dugad and N. Ahuja, “A Fast Scheme for Image Size Change in the Compressed Domain,” IEEE Trans. CSVT, vol. 11, no. 4, pp. 461-474, April 2001.

[13] D. Pau et al., “MPEG-2 Decoding with a Reduced RAM Requisite by ADPCM Recompression before Storing MPEG Decompressed Data,” U.S. patent 5838597, Nov. 1998.

[14] C. C. Cheng, P. C. Tseng, C. T. Huang, and L. G. Chen, "Multi-Mode Embedded Compression Codec Engine for Power-Aware Video Coding System," in IEEE, SIPS 2005.

[15] A. Bourge and J. Jung, “Low-Power H.264 Video Decoder with Graceful Degradation,” SPIE Proc. Visual Communications and Image Processing, vol.

5308, pp. 372-383, Jan. 2004

[16] Byeong Lee, “A new algorithm to compute the discrete cosine Transform”

Acoustics, Speech, and Signal Processing [see also IEEE Transactions on Signal Processing], IEEE Transactions on Volume 32, Issue 6, Page(s):1243 – 1245 Dec 1984

[17] Wen-Hsiung Chen et al., “A Fast Computational Algorithm for the Discrete Cosine Transform”, Communications, IEEE Transactions on [legacy, pre - 1988], Volume 25, Issue 9, Page(s):1004 – 1009, Sep 1977

[18] R. J van der Vleuten et al, “Low-complexity scalable DCT image compression”, IEEE Proc. Image Processing, vol.3, pp. 837-840, Sep. 2000

[19] Yao-Chun Fang; Chun-Yi Lee; Yin-Ming Wang; Chung-Neng Wang; Tihao Chiang; “Low complexity lossless video compression”, ICIP 2004 on Volume 4, 24-27 Page(s):2519 - 2522 Vol. 4 Oct. 2004

[20] Chieh-Hsien Cheng, “An Embedded Compressor/De-compressor for Video Decoder Using VQ/DCT Hybrid Coding”, master thesis, 2007

[21] Micron® Technology Inc. The Micron® System-Power Calculator: SDRAM.

[Online Available]: http://www.micron.com/products/dram/syscalc.html

[22] Micron® Technology Inc. MT48LC2M32B2 64Mb SDRAM. [Online Available]:

http://www.micron.com/products/dram/

作者簡歷

姓名：吳昱德

戶籍地：台灣省台中市出生日期：1984. 02. 01

學歷： 1999. 09 ~ 2002. 06 國立台中第一高級中學

2002. 09 ~ 2006. 06 國立交通大學電子工程學系

2006. 09 ~ 2008. 07 國立交通大學電子工程研究所碩士班

在文檔中應用於行動式視訊裝置之嵌入式壓縮器解壓縮器設計 (頁 63-0)