Summary and Comparison With Related Works

Chapter 3 Proposed Algorithm & Architecture

3.5 Summary and Comparison With Related Works

Based on matrix decomposition algorithm the low power and the most hardware efficiency architecture new fast 4x4, 8x8 and Hadamard inverse integer transforms can be derived. New fast 4x4, Hadamard, 8x8 and hardware sharing inverse transform algorithms and hardware implementations, are developed by utilizing matrix decomposition for H.264/AVC applications.

By applying the concept of hardware sharing, the proposed hardware schemes for fast inverse integer transforms need a smaller number of shifters and adders than the direct three mode realization architecture, where the direct architecture just implements the individual 4x4, individual Hadamard, individual 8x8 inverse integer transforms independently (Table 4).

For the throughput, actually we already get high throughput by previous works such as [10]

[12]. By the state of the art, we shouldn’t keep going to raise much higher throughput. For our purpose, we make an effort to design inverse integer transform decoder which is the most suitable strategy for system integration and take a balance between throughput and overhead at the promise of the acceptable throughput for real-time decoding full-HD sequences.

Therefore, we simplify and make a formula for throughput as following:

First, we can get the formula of total executed cycles,

PPC

In Table 5 also shows the throughput of our designs for 4x4, Hadamard, and 8x8 inverse integer transform and Hardware sharing design.

In our design, we apply pipeline architecture efficiently to get acceptable throughput by our proposed IIT scheme.

On the other hand, we use umc 90nm technology process which is different technology process from previous works such as tsmc 0.18um, umc 0.18um. for the normalization of our works to tsmc 0.18um and umc 0.18um, the normalization result will be the same for tsmc 0.18um and umc 0.18um because of supply voltage is the same 1.8Volt. In order to make fair comparison, in table 5 shows the normalization in terms of the power consumption.

Table 5. Normalization of Power consumption to UMC 0.18um and TSMC 0.18um

Technology Processing Unit Operating frequency(MHz) /Power consumption (mW)

Hwangbo[7]’10 UMC 0.18um 4x4,Hadamard,8x8 200MHz / 86.9mW

Lai [17]’10 UMC 90nm MPEG, VC, H.264

8x8 100MHz / 3.4mW

Proposed

UMC 90nm

4x4 150MHz / 182.9 µW

Proposed Hadamard 150MHz / 151.8 µW

Proposed 8x8 150Mhz / 0.68 mW

Proposed HW Sharing 4x4, Hadamard,

8x8 150MHz / 1.1 mW

The synthesis results of proposed architecture and the performance comparison with previous works are shown in Table 6. We focus on the power consumption, hardware area and hardware efficiency. Our hardware schemes by applying the concept of hardware sharing for

inverse integer transforms need smaller number of shifters and adders than the direct realization architecture, where the direct realization architecture just implements the individual transforms independently. Through the comparison, the proposed inverse integer transforms design requires, saving more power consumption and performs better hardware efficiency than the state-of-the-art existing design.

Table 6. Synthesis results and Comparison

Technology / Power consumption (mW)

Throughput

Hwangbo[7]’10 UMC 0.18um 63.6k 4x4,Hadamard, 8x8

In our design, we apply matrix decomposition method to get low power consumption by our 4x4, Hadamard, 8x8 and hardware sharing design scheme. Our hardware architecture power consumption and hardware cost for 4x4, Hadamard, 8x8 inverse integer transform at 56.45µW, 46.85µW, 0.21mW, and for the area 0.9k, 0.87k, 4.2k at 150MHz, respectively. For the Hardware sharing design our power consumption and hardware cost is just 0.31mW and 4.6k, respectively.

Our four designs are better power consumption design than the previous works. According to Hardware efficiency index for 4x4, Hadamard, 8x8 and hardware sharing schemes in Table 6, our design is most efficient than existing designs. For the Full HD system speed requirements for each size is 1920x1080 @ 30fps. 4x4, Hadamard, 8x8 and hardware sharing design is suitable for H.264/AVC High Profile.

Chapter 4 System Integration

4.1 System Specification

The specifications of the proposed architecture are described in Table 7 for H.264 video decoder. The proposed architecture is synthesized with UMC 90-nm CMOS standard-cell library and operated at 150MHz. 4x4, 8x8 and 16x16 block size can be supported. Our H.264 decoder, the processing capability for 4x4, Hadamard and 8x8 inverse integer transform are HD1080p/HD720p/QFHD@30fps. In the future trend the higher video resolution is necessary.

Therefore our proposed architecture and algorithm can support the higher video resolution than H.264 supported.

Table 7. The specification of Video decoder

H.264/AVC decoder

Process technology : UMC 90nm Block size: 4x4, 8x8, 16x16 Throughput: 4 – 8 pixels/cycle

Processing capability: 4x4 inverse transform HD1080p,720p, QFHD Hadamard inverse transform HD1080p,720p, QFHD

8x8 inverse transform HD1080p,720p, QFHD

Decoding capability: H.264/AVC: HDTV, 1080p HD, QFHD @30fps SVC: 720p– 1080p HD @30fps

Working Frequency:

H.264/AVC: 100 MHz SVC: 150 MHz

4.2 The Integration with H.264/AVC System

In Figure 5, each residual macroblock is transformed, quantized and coded. Previous standards such as MPEG-1, MPEG-2, MPEG-4 and H.263 made use of the 8x8 inverse integer transform as the basic transform. The “baseline” profile of H.264 uses three inverse transforms depending on the type of residual data that is to be coded. A transform for the 4x4 array of luma DC coefficients in intra macroblock (predicted in 16x16 mode), a transform for the 2x2 array of chroma DC coefficients (in any macroblock) and a transform for all other 4x4 blocks in the residual data. If the optional “adaptive block size transform” mode is used, further inverse transforms are chosen depending on the motion compensation block size (4x4, 8x8).

Data within a macroblock are transmitted in the order also shown in Figure 5. If the macroblock is coded in 16x16 Intra mode, then the block labeled “-1” is transmitted first, containing the DC coefficient of each 4x4 luma block. Next, the luma residual blocks 0-15 are transmitted in the order shown (with the DC coefficient set to zero in a 16x16 Intra macroblock).

Blocks 16 and 17 contain a 2x2 array of DC coefficients from the Cb and Cr chroma components respectively. Finally, chroma residual blocks 18-25 (with zero DC coefficients) are sent.

Figure 30. The Integration with H.264/AVC system block diagram

In Figure 30, block diagram shows the integration of inverse integer transform with H.264/AVC video decoder system. Our design architecture can process 4 pixels or 8 pixels per cycle with a low power and small gate count. The input of the inverse transform is from the inverse quantization function followed by a two dimensional 4x4 or 8x8 inverse integer transform depends on the selection mode in hardware sharing design. Then output of the inverse integer transform residual data will be input of the de-blocking filter.

Table 8 shows the timing analysis for different MB. Thus, for the timing analysis, the calculation of time required to process a whole frame is as follows,

frame block per frame block

T =N xT

Where Ncycle and Tcycle indicate the number of cycles and time required per cycle, respectively.

Table 8. Time required to decoding full HD and HDTV frame with different MB. Frame with YUV420 is used.

Frequency Format 4x4 MB

(ms)

Hadamard (ms)

8x8MB (ms)

150MHz

HD 1080

(1920x1080) 4.53 4.6 5.4

HD 720

(1280x768) 2.14 2.18 3.1

625MHz QFHD

(4*HD 1080) 17.3 17.3 18

THD_4x4MB (4.53ms) is 7.35, THD_Hadamard (4.6ms) is 7.23, THD_8X8MB (5.4ms) is 6.1 times faster than the 33.3ms standard time required for processing each HD frame decoding. Same way for the HDTV frame, THDTV_4x4MB (2.14ms) is 15.5, THDTV_Hadamard (2.18ms) is 15.27, THDTV_8x8MB

(3.1ms) is 10.7 times faster and for QFHD is almost 2 times faster. Thus, the proposed inverse transforms architectures meet the real-time constraints for HD1080 and QFHD video signal.

Therefore this module can perform 1080 HD and QFHD @ 30fps in real-time.

Chapter 5 Conclusion and Future Works

5.1 Conclusion

In this works, we implement 4x4, Hadamard, 8x8 inverse integer transforms and Hardware sharing design. We first proposed fast algorithm for 4x4 and 8x8 macroblock and use with pipeline to reduce the inverse transform complexity which means saved power consumption, significant reduce hardware area and enhance the performance of the hardware. Our hardware architecture power consumption and hardware cost for 4x4, Hadamard, and 8x8 inverse integer transforms are only 56.45µW, 46.85µW, and 0.21mW at 150MHz and for the area 0.9k, 0.87k, 4.2k, respectively. For the Hardware sharing design our power consumption is just 0.31mW and hardware cost is just 4.6k. Our four designs are better power consumption design than the previous works. For the Full HD system speed requirements for each size is 1920x1080 @ 30fps.

Our comparisons power consumption, hardware cost in terms of gate count, critical path delay, throughput and hardware efficiency which achieves better (783.6k) than the previous works.

DTUA is used to evaluate the hardware efficiency. It is defined as the ratio of data throughput rate over hardware cost in terms of the gate count. The higher the DTUA is, the more efficient the design. According to the DTUA in Table 6, our four designs are the most hardware efficient design than other designs. In Table 6, the proposed hardware sharing design for fast 4x4, Hadamard, 8x8 inverse transforms of H.264/AVC requires smaller gate counts (i.e., 4.602 gates) than the individual 4x4 and 8x8 inverse integer transforms without the hardware share (i.e., 904+873+4209=5986 gates).This component can be used in H.264 high profile decoder design and its inversion can be used in encoder design as well.

5.2 Future Work

In the future, we will more focus on new algorithms to reduce the number of adder and shifter that saving more power consumption and keep improving the performance and to further reduce hardware area of our design. We will also employ voltage scaling technique to further reduce power consumption and furthermore employ gated clock and multiple clock technique to save the clock power. Meanwhile, we will try to support other standard inverse transforms in the same algorithms.

References

[1] Joint Video Team (JVT) of ISO/IEC MPEG&ITU-T VCEG, “Joint Draft ITU-T Rec. H.264 | ISO/IEC 14496-10 Scalable video coding,” July 2007.

[2] S.Gordon, D.Marple, and T. Wiegand, “Simplified use of 8x8 Transforms – Update Proposal and results,” JVT-K028,11^th Meeting,Munich,Germany,15-19, Mar. 2004.

[3] lain E. G. Richardson, H264 and MPEG-4 Video Compression-Video Coding for Next-generation Multimedia, John Wiley &Sons Ltd, 2003.

[4] D.Marpe, T. Wiegand, and S. Gordon, “H.264/MPEG4-AVC ﬁdelity range extensions: Tools, proﬁles, performance, and application areas,” IEEE International Conference of Image Processing, pp. I-593-I-596, Sep. 2005.

[5] T. C. Wang et al., “Parallel 4x4 2D transform and inverse transform architecture for MPEG-4 AVC/H.264,” IEEE International Symposium on Circuits and Systems, pp.800-803, May 2003.

[6] C. P. Fan “Efficient Fast 1-D 8x8 Inverse Integer Transform for VC-1 Application,” IEEE Transactions on Circuits and Systems for Video Technology, vol.19, no.4, pp.584-590, April 2009.

[7] W. Hwangbo, J. Kim and C.M. Kyung, “A Multi Transform Architecture for H.264/AVC High-Profile Coders, IEEE international transactions on multimedia, Vol. 12, No.3, pp.157-167, Apr. 2010.

[8] G. A. Su, “Low-Cost Hardware Sharing Architecture of Fast 1-D Inverse Transforms for H.264/AVC and AVS Applications,” IEEE Transactions on Circuits and Systems, Part II, vol.55, no.12, pp.1249-1253, Dec. 2008.

[9] L. Z. Liu et al., “A 2-D forward/inverse integer transform processor ofH.264 based on highly-parallel architecture,” IEEE International Workshop on System-on Chip for Real-Time Applications, pp.158-161, July 2004.

[10] K. H. Chen, J. I.Guo, et al., “A high-performance low power direct 2-D transform coding IP design for MPEG-4 AVC/H.264 with a switching power suppression technique,” IEEE VLSI-TSA International Symposium on VLSI Design, Automation and Test, pp.291-294, Apr.

2005.

[11] Z. Y. Cheng et al., “High throughput 2-D transform architectures for H.264 advanced video coders,” IEEE Asia-Pacific Conference on Circuits and Systems, pp.1141-1144, Dec.

2004.

[12] W. Hwangbo, J. Kim and C.M Kyung, “A High-Performance 2-D Inverse Transform Architecture for the H.264/AVC Decoder,” IEEE International Symposium of Circuits and Systems, 2007. ISCAS 2007, pp.1613-1616, May 2007.

[13] G.A. Su et al., “Cost Effective Hardware Sharing Architecture for Fast 1-D 8x8 Forward and Inverse Integer Transforms of H.264/AVC High Profile,” IEEE Asia Pacific Conference of Circuits and Systems, 2008. APCCAS 2008, pp.1332-1335, 2008.

[14] M.L. Hsia, and T.C.C. Oscal, “Low-complexity inverse integer transform in H.264/AVC,”

International Conference of Multimedia and Expo (ICME), pp.826-830, 2010.

[15] Y.K Lin, Y.Z Liao, and T.S. Chang “An area-efficient Design for Integer Transform in H.264/AVC FRExt,” The 17^th VLSI Design/CAD symposium, 2006.

[16] M.Nadeem et al., “Configurable, Low Power Design for Inverse integer Transform in H.264/AVC,” 8^thInternational Conference on Frontiers of Information Technology (FIT), no.8, Dec. 2010.

[17] Y. K. Lai, and Y. F. Lai, “A Reconfigurable IDCT Architecture for Universal Video Decoders,” IEEE Transactions on Consumer Electronics, vol.56, no.3, pp.1872-1879, August 2010.

[18] H.S. Malvar, A. Hallapuro, M. Karaczewicz, and L. Kerofsky, “Low-Complexity Transform and Quantization in H.264/AVC,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 13, no. 7, pp.598-603, July 2003.

[19] N.T. Ngo, T.T.T. Do, T.M. Le, “ASIP-controlled Inverse Integer Transform for H.264/AVC Compression”, The 19th IEEE/IFIP International Symposium, pp.158-164, June 2008.

[20] C. P. Fan, and Y. L. Cheng, “Unified and Fast 2-Dimension 4x4 Transform Design for H.264/AVC Texture Coding”, IEEE International Symposium on Intelligent Signal Processing and Communication Systems, pp.473-476, December 2005.

[21] R. A. Horn and C. R. Johnson, Topics in Matrix Analysis. New York: Cambridge Univ.

Press, 1991, pp. 239-267.

在文檔中應用於H.264/AVC視訊解碼器之低功耗反整數轉換 (頁 57-0)

Summary and Comparison With Related Works

Chapter 3 Proposed Algorithm &amp; Architecture

3.5 Summary and Comparison With Related Works

Chapter 4

System Integration