Summary - Related Works - 應用於H.264/AVC視訊解碼器之低功耗反整數轉換

Chapter 2 Related Works

2.4 Summary

Table 3 summarized the above approaches. Each has distinct strength and weakness. We take 4x4 transform supporting, 8x8 transform supporting, Hadamard transform supporting, power consumption, hardware cost, DTUA, and throughput as our comparison items.

Table 3. Supporting features comparison Hwangbo Su

We also take an effort to evaluate several previous works and classify into three strategies: Low power aware, low hardware cost aware and high throughput aware. In Figure 22, we classify previous works as their strategy. Each strategy represents the major improvement in conventional inverse integer transform decoder. Each strategy represents the major improvement in conventional inverse integer transform decoder.

Figure 22. Implementation strategies of previous works

Chapter 3 Proposed Algorithm & Architecture

In this chapter, we propose our 4x4, Hadamard and 8x8 inverse integer transform fast one dimensional butterfly algorithms, pipeline hardware architectures and in Section 3.4 proposed their hardware-sharing design for 4x4, Hadamard and 8x8 inverse integer transforms of H.264 video decoder. In our algorithms we use matrix decomposition method to reduce the complexity of inverse integer transforms to reduce the power consumption, hardware cost and raise the throughput and hardware efficiency in H.264/AVC. Matrix decomposition utilizes the permutation matrices. All Inverse integer transforms Hardware architecture designs are implemented with pipelined architecture. Thus, our design’s power consumption and hardware cost are smaller when comparing to previous works.

The area overhead for the inverse integer transforms unit can be reduced by sharing the hardware resources between the independent processing units by designing the new fast butterfly algorithm. In next sub-sections, we will discuss more details about new fast 4x4, Hadamard, 8x8 butterfly algorithms.

3.1 Fast 4x4 Inverse Integer Transform

3.1.1 Fast 4x4 Inverse Integer Transform Algorithm

Fast 4x4 inverse integer transform algorithm is proposed in this part. First we will derive the formulas then algorithms which will be implemented in hardware design. We know that from the previous chapter 4x4 inverse integer transform coefficient matrix (Eq. 3.1) is follows,

We will use the matrix decomposition method to reduce the complexity of inverse integer transform which also means reduce the power consumption, hardware cost in terms of gate count.

Therefore we define two permutation matrices Tc and Tr as described below,

1 0 0 0

A

4i matrix is described as follows,

4_i

( )

_c ^T 4_i

( )

_r ^T

A  T A T

^{Eq. 3.5}

Then if we can useA^~_4imatrix to represent with 2x2 matrix form (Eq. 3.6)

~ 2

Then we use the matrix operation rule to derive one of the following equations (Eq. 3.8);

  

And where  denotes the Kronecker product, matrix operation as follows (Eq. 3.9). Assume that the dimension of the matrix A is NxP, B is MxQ,

Another means the direct sum operation which matrix operation express as follows (Eq. 3.10),

If we apply Eq. 3.8 matrix operation rule to Eq. 3.5 then we can re-derive 4x4 inverse integer transform coefficient matrix,

A

_4ire-expressed as follows,

   

4_i

( )

_c ^T 4_i

( )

_r ^T _c^T 2 _r^T

A  T A T  T H

₂

 I

₂

H

₂

 Q T

^{Eq. 3.11}

3.1.2 Fast 4x4 Inverse Integer Transform Architecture

According to Eq. 3.11, the first stage of the hardware design

T

_r^T and the last stage

T

_c^T

permutations are just wire connection which represents no arithmetic computation. We use 2 pipeline stages to finish the operation. Figure 23 shows pipeline hardware architecture to represent the 4x4 inverse integer transform and we use the pipeline stages to help us simplify the design and speed up the hardware to achieve the higher resolution such as HD 1080, QFHD(4*HD 1080) @ 30fps.

Figure 23. Pipeline hardware architecture for fast 4x4 inverse integer transform.

In pipeline hardware architecture we use fast butterfly algorithm data flow like Figure 24 to implement 4x4 inverse integer transform. The complexity of this proposed fast 4x4 inverse integer transform needs2 shift operations and 8 additions.

Figure 24. New fast algorithm of 4x4 inverse integer transform.

3.2 Fast Inverse Hadamard Integer Transform

3.2.1 Inverse Hadamard Integer Transform Algorithm

The Hadamard inverse integer transform is given by;

D 4i D 4i

W = H X H

^{Eq. 3.12}

We know that from the previous chapter Hadamard inverse integer transform coefficient matrix, H4i as follows, Hadamard inverse integer transform, we use the same permutation matrices,

T

_r ^and

T

_c^,

36 connection which represents no arithmetic computation.

3.2.2 Inverse Hadamard Integer Transform Hardware Architecture

Figure 25 shows 2 pipeline hardware architectures to represent the 4x4 Hadamard inverse integer transform that we use the pipeline stages to help us simplify the design and speed up the hardware.

Figure 25. Pipeline hardware architecture for fast Hadamard inverse integer transform.

Figure 26 shows 4x4 Hadamard inverse integer transform implemented by using new fast butterfly algorithm data flow. The complexity of this proposed fast 4x4 Hadamard inverse integer transform needs just 8 additions without any shift operation.

Figure 26. New fast algorithm of 4x4 Hadamard inverse integer transform.

3.3 Fast 8x8 Inverse Integer Transform

3.3.1 Fast 8x8 Inverse Integer Transform Algorithm

As we know from the previous chapter 8x8 transform coefficient matrix as follows,

According to the fast computations in [2], the fast 8x8 inverse integer transform that we use further matrix decomposition into 3 stage matrix multiplication as below,

1 2 3

Where

I

₄identity matrix with order 4 and permutation matrix

Where

I

₂identity matrix with order 2 and permutation matrix

In Eq. 3.25, 3x/2 can be decomposed to x+(x>>1). (x>>1) means 1 bit right shift. Eq. 3.20,

C

_8i³

can be further decomposed into,

~ ~

C

8i can be further decomposed into,

42 3 ~

8_i

 [(



₂

) 

^T3

] 

_r^T

C H Q Q T

^{Eq. 3.30}

Similarly, in Eq. 3.29, x/2 and x/4 can be replaced by 1 bit right shifter (x>>1) and 2 bit right shifter (x>>2) respectively.

Then we can rewrite the 8x8 inverse integer transform matrix,

C

_8i can become as follows;

1 2 3

3.3.2 Fast 8x8 Inverse Integer Transform Hardware Architecture

Same as 4x4 inverse integer transform, the first stage of the 8x8 inverse integer transform hardware design

T

_r^Tpermutation and the last stage

T

_c^T that need no arithmetic computation to be implemented by hard-wire connection. We use 4 pipeline stages to implement the 8x8 inverse integer transform operation. Figure 27 shows pipeline hardware architecture to represent the 8x8 inverse integer transform. We use the pipeline stages to simplify the design and speed up the hardware to achieve the higher resolution such as HD 1080, QFHD (4*HD 1080) @ 30fps.

Figure 27. Pipeline hardware architecture for fast 8x8 inverse integer transform.

Figure 28. New fast algorithm of 8x8 inverse integer transform.

Figure 28 shows 8x8 inverse integer transform by using new fast butterfly algorithm data flow. Total complexities of proposed fast 8x8 inverse integer transform are just 10 shift and 32 addition operations. In the next section, we will discuss about hardware sharing architecture of inverse integer transform for H.264/AVC decoder.

3.4 Hardware Sharing Algorithm & Architecture

First of all, we make all these 4x4 inverse integer transform, Hadamard inverse transform and 8x8 inverse integer transform into one group. It can be found that the inverse transform process is slightly similar. In order to achieve low power and hardware saving for the inverse integer transform unit, sharing the hardware resources between the independent processing units is investigated. For the Hardware sharing design, we have listed all inverse integer transform equations from (Eq. 3.11), (Eq. 3.15), (Eq. 3.31) as follows,

   

From the above three equations, it can be found that in all these operations have



H2^I2



same blocks. In this block, we will be able to do hardware sharing. Other 3 blocks which are

 2

H2 Q ^,H2H2 and H₂Q₂ need to be operated in one hardware block. Since

Q

₂^is

the main hardware block, for Hadamard we need to use shift in the input circuit (scaling) will meet the

H

₂which is defined in Eq.3.7 in order to save the hardware cost, for 4x4 inverse integer transform that doesn’t need any scaling in the input circuit because of

Q

₂=

Q

_{in (Eq.}

3.29) and (Eq.3.7). Figure 29 shows the hardware sharing architecture for the fast 4x4, Hadamard, 8x8 inverse integer transforms. The hardware sharing part of the fast inverse integer transforms is (H₂I₂)(H₂Q in Figure 29. ₂)

3.4.1 Comparison And Implementation Of Hardware Sharing Architecture

Sel_0=0 , hadamard mode Sel_1=1 ,

Sel_0=1 , 8x 8 mode Sel_1=1 ,

Q1a

8x 8 mode: pipeline 1 4x 4 mode: pipeline 1 8x 8 mode: pipeline 2

4x 4 mode: pipeline 2 8x 8 mode: pipeline 3

4x 4 mode: pipeline 3 8x 8 mode: pipeline 4

In Figure 29, the proposed hardware sharing architectures, which are low power consumption to support 3 inverse integer transform modes, need 12 shifters, 32 additions and a simple MUX in implementations. In order to achieve the purpose of hardware sharing, an additional simple 4 additions and 4 shifters. In Figure 27, Q₁^T requires 12 additions and 4 shifters. For low power and high processing speed, we cut into two pipeline operation in Figure 29. The first pipeline is

1 T

Q

a which requires 8 additions and 4 shift registers. The second is

Q

₁^T_bstage pipeline requires just 4 additions. In Table 4, it shows the architecture comparisons for new fast inverse integer transforms. The pipelined phases of hardware sharing architectures are also noted with dotted lines in Figure 29.

Table 4. Architecture Comparisons for Fast Inverse Integer Transforms Architectures of integer transform is equivalent to that of the fast algorithm of the state-of-the-art [14], where the fast computation needs for 4x4 inverse, 2 shift, 8 addition and for 8x8 inverse 10 shift and 32 addition operations.

3.5 Summary and Comparison With Related Works

Based on matrix decomposition algorithm the low power and the most hardware efficiency architecture new fast 4x4, 8x8 and Hadamard inverse integer transforms can be derived. New fast 4x4, Hadamard, 8x8 and hardware sharing inverse transform algorithms and hardware implementations, are developed by utilizing matrix decomposition for H.264/AVC applications.

By applying the concept of hardware sharing, the proposed hardware schemes for fast inverse integer transforms need a smaller number of shifters and adders than the direct three mode realization architecture, where the direct architecture just implements the individual 4x4, individual Hadamard, individual 8x8 inverse integer transforms independently (Table 4).

For the throughput, actually we already get high throughput by previous works such as [10]

[12]. By the state of the art, we shouldn’t keep going to raise much higher throughput. For our purpose, we make an effort to design inverse integer transform decoder which is the most suitable strategy for system integration and take a balance between throughput and overhead at the promise of the acceptable throughput for real-time decoding full-HD sequences.

Therefore, we simplify and make a formula for throughput as following:

First, we can get the formula of total executed cycles,

PPC

In Table 5 also shows the throughput of our designs for 4x4, Hadamard, and 8x8 inverse integer transform and Hardware sharing design.

In our design, we apply pipeline architecture efficiently to get acceptable throughput by our proposed IIT scheme.

On the other hand, we use umc 90nm technology process which is different technology process from previous works such as tsmc 0.18um, umc 0.18um. for the normalization of our works to tsmc 0.18um and umc 0.18um, the normalization result will be the same for tsmc 0.18um and umc 0.18um because of supply voltage is the same 1.8Volt. In order to make fair comparison, in table 5 shows the normalization in terms of the power consumption.

Table 5. Normalization of Power consumption to UMC 0.18um and TSMC 0.18um

Technology Processing Unit Operating frequency(MHz) /Power consumption (mW)

Hwangbo[7]’10 UMC 0.18um 4x4,Hadamard,8x8 200MHz / 86.9mW

Lai [17]’10 UMC 90nm MPEG, VC, H.264

8x8 100MHz / 3.4mW

Proposed

UMC 90nm

4x4 150MHz / 182.9 µW

Proposed Hadamard 150MHz / 151.8 µW

Proposed 8x8 150Mhz / 0.68 mW

Proposed HW Sharing 4x4, Hadamard,

8x8 150MHz / 1.1 mW

The synthesis results of proposed architecture and the performance comparison with previous works are shown in Table 6. We focus on the power consumption, hardware area and hardware efficiency. Our hardware schemes by applying the concept of hardware sharing for

inverse integer transforms need smaller number of shifters and adders than the direct realization architecture, where the direct realization architecture just implements the individual transforms independently. Through the comparison, the proposed inverse integer transforms design requires, saving more power consumption and performs better hardware efficiency than the state-of-the-art existing design.

Table 6. Synthesis results and Comparison

Technology / Power consumption (mW)

Throughput

Hwangbo[7]’10 UMC 0.18um 63.6k 4x4,Hadamard, 8x8

In our design, we apply matrix decomposition method to get low power consumption by our 4x4, Hadamard, 8x8 and hardware sharing design scheme. Our hardware architecture power consumption and hardware cost for 4x4, Hadamard, 8x8 inverse integer transform at 56.45µW, 46.85µW, 0.21mW, and for the area 0.9k, 0.87k, 4.2k at 150MHz, respectively. For the Hardware sharing design our power consumption and hardware cost is just 0.31mW and 4.6k, respectively.

Our four designs are better power consumption design than the previous works. According to Hardware efficiency index for 4x4, Hadamard, 8x8 and hardware sharing schemes in Table 6, our design is most efficient than existing designs. For the Full HD system speed requirements for each size is 1920x1080 @ 30fps. 4x4, Hadamard, 8x8 and hardware sharing design is suitable for H.264/AVC High Profile.

Chapter 4 System Integration

4.1 System Specification

The specifications of the proposed architecture are described in Table 7 for H.264 video decoder. The proposed architecture is synthesized with UMC 90-nm CMOS standard-cell library and operated at 150MHz. 4x4, 8x8 and 16x16 block size can be supported. Our H.264 decoder, the processing capability for 4x4, Hadamard and 8x8 inverse integer transform are HD1080p/HD720p/QFHD@30fps. In the future trend the higher video resolution is necessary.

Therefore our proposed architecture and algorithm can support the higher video resolution than H.264 supported.

Table 7. The specification of Video decoder

H.264/AVC decoder

Process technology : UMC 90nm Block size: 4x4, 8x8, 16x16 Throughput: 4 – 8 pixels/cycle

Processing capability: 4x4 inverse transform HD1080p,720p, QFHD Hadamard inverse transform HD1080p,720p, QFHD

8x8 inverse transform HD1080p,720p, QFHD

Decoding capability: H.264/AVC: HDTV, 1080p HD, QFHD @30fps SVC: 720p– 1080p HD @30fps

Working Frequency:

H.264/AVC: 100 MHz SVC: 150 MHz

4.2 The Integration with H.264/AVC System

In Figure 5, each residual macroblock is transformed, quantized and coded. Previous standards such as MPEG-1, MPEG-2, MPEG-4 and H.263 made use of the 8x8 inverse integer transform as the basic transform. The “baseline” profile of H.264 uses three inverse transforms depending on the type of residual data that is to be coded. A transform for the 4x4 array of luma DC coefficients in intra macroblock (predicted in 16x16 mode), a transform for the 2x2 array of chroma DC coefficients (in any macroblock) and a transform for all other 4x4 blocks in the residual data. If the optional “adaptive block size transform” mode is used, further inverse transforms are chosen depending on the motion compensation block size (4x4, 8x8).

Data within a macroblock are transmitted in the order also shown in Figure 5. If the macroblock is coded in 16x16 Intra mode, then the block labeled “-1” is transmitted first, containing the DC coefficient of each 4x4 luma block. Next, the luma residual blocks 0-15 are transmitted in the order shown (with the DC coefficient set to zero in a 16x16 Intra macroblock).

Blocks 16 and 17 contain a 2x2 array of DC coefficients from the Cb and Cr chroma components respectively. Finally, chroma residual blocks 18-25 (with zero DC coefficients) are sent.

Figure 30. The Integration with H.264/AVC system block diagram

In Figure 30, block diagram shows the integration of inverse integer transform with H.264/AVC video decoder system. Our design architecture can process 4 pixels or 8 pixels per cycle with a low power and small gate count. The input of the inverse transform is from the inverse quantization function followed by a two dimensional 4x4 or 8x8 inverse integer transform depends on the selection mode in hardware sharing design. Then output of the inverse integer transform residual data will be input of the de-blocking filter.

Table 8 shows the timing analysis for different MB. Thus, for the timing analysis, the calculation of time required to process a whole frame is as follows,

frame block per frame block

T =N xT

Where Ncycle and Tcycle indicate the number of cycles and time required per cycle, respectively.

Table 8. Time required to decoding full HD and HDTV frame with different MB. Frame with YUV420 is used.

Frequency Format 4x4 MB

(ms)

Hadamard (ms)

8x8MB (ms)

150MHz

HD 1080

(1920x1080) 4.53 4.6 5.4

HD 720

(1280x768) 2.14 2.18 3.1

625MHz QFHD

(4*HD 1080) 17.3 17.3 18

THD_4x4MB (4.53ms) is 7.35, THD_Hadamard (4.6ms) is 7.23, THD_8X8MB (5.4ms) is 6.1 times faster than the 33.3ms standard time required for processing each HD frame decoding. Same way for the HDTV frame, THDTV_4x4MB (2.14ms) is 15.5, THDTV_Hadamard (2.18ms) is 15.27, THDTV_8x8MB

(3.1ms) is 10.7 times faster and for QFHD is almost 2 times faster. Thus, the proposed inverse transforms architectures meet the real-time constraints for HD1080 and QFHD video signal.

Therefore this module can perform 1080 HD and QFHD @ 30fps in real-time.

Chapter 5 Conclusion and Future Works

5.1 Conclusion

In this works, we implement 4x4, Hadamard, 8x8 inverse integer transforms and Hardware sharing design. We first proposed fast algorithm for 4x4 and 8x8 macroblock and use with pipeline to reduce the inverse transform complexity which means saved power consumption, significant reduce hardware area and enhance the performance of the hardware. Our hardware architecture power consumption and hardware cost for 4x4, Hadamard, and 8x8 inverse integer transforms are only 56.45µW, 46.85µW, and 0.21mW at 150MHz and for the area 0.9k, 0.87k, 4.2k, respectively. For the Hardware sharing design our power consumption is just 0.31mW and hardware cost is just 4.6k. Our four designs are better power consumption design than the previous works. For the Full HD system speed requirements for each size is 1920x1080 @ 30fps.

Our comparisons power consumption, hardware cost in terms of gate count, critical path delay, throughput and hardware efficiency which achieves better (783.6k) than the previous works.

DTUA is used to evaluate the hardware efficiency. It is defined as the ratio of data throughput rate over hardware cost in terms of the gate count. The higher the DTUA is, the more efficient the design. According to the DTUA in Table 6, our four designs are the most hardware efficient design than other designs. In Table 6, the proposed hardware sharing design for fast 4x4, Hadamard, 8x8 inverse transforms of H.264/AVC requires smaller gate counts (i.e., 4.602 gates) than the individual 4x4 and 8x8 inverse integer transforms without the hardware share (i.e., 904+873+4209=5986 gates).This component can be used in H.264 high profile decoder design and its inversion can be used in encoder design as well.

5.2 Future Work

In the future, we will more focus on new algorithms to reduce the number of adder and shifter that saving more power consumption and keep improving the performance and to further reduce hardware area of our design. We will also employ voltage scaling technique to further reduce power consumption and furthermore employ gated clock and multiple clock technique to save the clock power. Meanwhile, we will try to support other standard inverse transforms in the same algorithms.

References

[1] Joint Video Team (JVT) of ISO/IEC MPEG&ITU-T VCEG, “Joint Draft ITU-T Rec. H.264 | ISO/IEC 14496-10 Scalable video coding,” July 2007.

[2] S.Gordon, D.Marple, and T. Wiegand, “Simplified use of 8x8 Transforms – Update Proposal and results,” JVT-K028,11^th Meeting,Munich,Germany,15-19, Mar. 2004.

[3] lain E. G. Richardson, H264 and MPEG-4 Video Compression-Video Coding for Next-generation Multimedia, John Wiley &Sons Ltd, 2003.

[4] D.Marpe, T. Wiegand, and S. Gordon, “H.264/MPEG4-AVC ﬁdelity range extensions: Tools, proﬁles, performance, and application areas,” IEEE International Conference of Image Processing, pp. I-593-I-596, Sep. 2005.

[5] T. C. Wang et al., “Parallel 4x4 2D transform and inverse transform architecture for MPEG-4 AVC/H.264,” IEEE International Symposium on Circuits and Systems, pp.800-803, May 2003.

[6] C. P. Fan “Efficient Fast 1-D 8x8 Inverse Integer Transform for VC-1 Application,” IEEE Transactions on Circuits and Systems for Video Technology, vol.19, no.4, pp.584-590, April 2009.

[7] W. Hwangbo, J. Kim and C.M. Kyung, “A Multi Transform Architecture for H.264/AVC High-Profile Coders, IEEE international transactions on multimedia, Vol. 12, No.3, pp.157-167, Apr. 2010.

[8] G. A. Su, “Low-Cost Hardware Sharing Architecture of Fast 1-D Inverse Transforms for H.264/AVC and AVS Applications,” IEEE Transactions on Circuits and Systems, Part II, vol.55, no.12, pp.1249-1253, Dec. 2008.

[9] L. Z. Liu et al., “A 2-D forward/inverse integer transform processor ofH.264 based on highly-parallel architecture,” IEEE International Workshop on System-on Chip for Real-Time Applications, pp.158-161, July 2004.

[10] K. H. Chen, J. I.Guo, et al., “A high-performance low power direct 2-D transform coding IP design for MPEG-4 AVC/H.264 with a switching power suppression technique,” IEEE VLSI-TSA International Symposium on VLSI Design, Automation and Test, pp.291-294, Apr.

2005.

[11] Z. Y. Cheng et al., “High throughput 2-D transform architectures for H.264 advanced video coders,” IEEE Asia-Pacific Conference on Circuits and Systems, pp.1141-1144, Dec.

2004.

[12] W. Hwangbo, J. Kim and C.M Kyung, “A High-Performance 2-D Inverse Transform Architecture for the H.264/AVC Decoder,” IEEE International Symposium of Circuits and Systems, 2007. ISCAS 2007, pp.1613-1616, May 2007.

[13] G.A. Su et al., “Cost Effective Hardware Sharing Architecture for Fast 1-D 8x8 Forward and Inverse Integer Transforms of H.264/AVC High Profile,” IEEE Asia Pacific Conference of Circuits and Systems, 2008. APCCAS 2008, pp.1332-1335, 2008.

[14] M.L. Hsia, and T.C.C. Oscal, “Low-complexity inverse integer transform in H.264/AVC,”

International Conference of Multimedia and Expo (ICME), pp.826-830, 2010.

[15] Y.K Lin, Y.Z Liao, and T.S. Chang “An area-efficient Design for Integer Transform in H.264/AVC FRExt,” The 17^th VLSI Design/CAD symposium, 2006.

[16] M.Nadeem et al., “Configurable, Low Power Design for Inverse integer Transform in H.264/AVC,” 8^thInternational Conference on Frontiers of Information Technology (FIT),

在文檔中應用於H.264/AVC視訊解碼器之低功耗反整數轉換 (頁 38-0)