Thesis Organization - 應用於H.264/AVC視訊解碼器之低功耗反整數轉換

Chapter 1 Introduction

1.3 Thesis Organization

An introduction of H.264/AVC video coding standards are given in this section. The rest of this thesis is organized as follows. Chapter 2 describes the related works for inverse integer transforms, the overview of H.264/AVC profiles and previous works. In Chapter 3, we describe our proposed algorithm and hardware architecture for 4x4, Hadamard and 8x8 inverse integer transform. In this chapter, we also implement the hardware sharing algorithm & architecture, and take an in-depth discussion about Comparison and implementation of hardware sharing architecture. Moreover, we show 4 kinds of Inverse integer transform module design including hardware sharing and according to our proposed algorithm, system integration architecture and comparison of the proposed design with others shows in 0. Finally, we make the conclusion and future works in the last Chapter 5.

Chapter 2 Related Works

In this chapter, we will describe the overview of the H.264/AVC traditional inverse integer transform algorithm for 4x4, Hadamard and 8x8 MB.

2.1 Inverse Integer Transform Algorithm

2.1.1 Overview of the Inverse Integer Transforms

H.264/AVC uses a macroblock (MB) as a basic data unit. Our input includes coefficients and flags decoded by the entropy decoder (CAVLC or CABAC). It contains luma part and chroma part. The inter-prediction or intra prediction module finds a macroblock which is similar to current one from reference or present frames. However, the founded MB usually does not perfectively match with the current one, and the differences are called residuals (or prediction error) as shown in Figure 4. The residuals are inversely transformed which are then reordered and entropy encoded. At the decoder side, these entropy-encoded coefficients are decoded back to coefficients. After reordering, coefficients are inversely transformed to residuals data. Finally the residuals are combined with prediction data to reconstruct a MB.

(a.) Current Frame (b.) Predicted Frame (c.) Residuals

Figure 4. Residuals (prediction errors) between current and reconstructed frame

There are one 16x16 luma block and two 8x8 chroma blocks (Cb, Cr) within a macroblock.

A 16x16 luma block can be divided into four 8x8 blocks, and each consists of four 4x4 blocks. A chroma 8x8 block contains four 4x4 blocks. In Figure 4, every 4x4 (or 2x2) block is numbered

according to decoding order. If a macroblock is coded by intra 16x16 prediction mode as shown in Figure 5(a), block -1 which contains DC coefficients of every 4x4 luma block will be processed first. The DC coefficients are filled back to upper-left corner of each 4x4 block in a16x16 luma block. Next, the luma residual blocks 0-15 are processed. After luma block is decoded, chroma DC blocks 16 and 17 are processed, and filled back to upper-left corner of each 4x4 block in an 8x8 chroma block. Finally, chroma residual blocks 18-25 are processed. If current macroblock type is non-intra 16x16, the processing order is the same except that it has no luma DC block as shown in Figure 5(b).

C_b C_r

Luma

(a) Intra 16x16 macroblock

Cb Cr

Luma

(b) Non-Intra 16x16 macroblock

Figure 5. Scanning order of residual blocks within a macroblock

Three kinds of inverse integer transform are adopted depending on the type of residual blocks: 4x4 Hadamard transform for luma DC block (block -1), 2x2 Hadamard transform for chroma DC block (block16, 17), and 4x4 integer transform for all other types of 4x4 blocks (block 0-15, 18-25).Figure 6 shows the decoding flow diagram. We will emphasize on the

Figure 6. H.264/AVC Encoding/Decoding flow diagram

If 4x4 inverse transform is employed, the luma part is divided into one luma DC 4x4 block and 16 luma AC 4x4 blocks. On the other hand, if 8x8 transform is applied, the luma part is divided into four 8x8 blocks. The chroma part is divided into two chroma DC 2x2 blocks and eight chroma AC 4x4 blocks in both cases.

In the following sub-sections, we will describe the traditional 4x4, Hadamard and 8x8 inverse integer transform algorithms.

2.1.2 Traditional 4x4 Inverse Integer Transform

In the H.264/AVC standard, the inverse integer transform operates on 4x4 blocks of residual data after motion-compensated prediction or intra prediction [4]. However, only two types of 4x4 inverse transforms are defined for the H.264/AVC decoder. The first type is the 4x4 inverse integer transform, which is defined as Eq. 2.1, where the 4x4 inverse integer transform coefficient matrix

A

_{4 i}defined as Eq. 2.2

4_i^T

(

)

4_i 4^T_i 4_i element of Y is multiplied by the scaling factor in the same position in matrix

E

_i [3]. Since the scaling matrix

E

i could be merged into the inverse quantization and pre-scaled process to reduce the number of multiplication process. Figure 7 show the traditional 4x4 inverse integer transform algorithm.

Figure 7.Traditional 4x4 inverse integer transform

2.1.3 Traditional Inverse Hadamard Integer Transform

The second type is the 4x4 inverse Hadamard transform (also known as the luma DC transform). The inverse Hadamard transform is defined as Eq. 2.3, where XD is the 4x4 DC component of a 16x16 intra mode macroblock.

D 4i D 4i

W = H X H

^{Eq. 2.3}

The 4x4 inverse Hadamard integer transform coefficient matrix H4i defined as Eq. 2.4 and Figure 8 shows the traditional two dimensional inverse Hadamard fast algorithm.

1 1 1 1 1 1 -1 -1 H = 1 -1 -1 1

1 -1 1 -1

 

 

 

Eq. 2.4

The 2x2 Hadamard transform use the same formula for inverse integer transform as Eq. 2.5

13 T

D 2i D 2i

X = H W H

, with

H = 1 1

1 -1

 

 

 

^{Eq. 2.5}

In the H.264/AVC standard, the 2x2 chroma DC transform is also defined. Since it is implied in the 4x4 inverse Hadamard transform.

Figure 8.Traditional 4x4 inverse Hadamard integer transform

2.1.4 Traditional 8x8 Inverse Integer Transform

The 8x8 forward and inverse integer transforms can be performed in a similar with 4x4 manner.

8x8 forward integer transform can be realized by the following equivalent form as Eq. 2.6, where

E is the scaling matrix. Meanwhile, 8x8 inverse integer transform is described as Eq. 2.7, f

where

E

^~ iis the scaling matrix.



⁸^f ⁸^T^f



^~ ^f

Y  C XC  E

^{Eq. 2.6}

14 transforms are only applied to luma blocks.

Figure 9. Traditional 8x8 inverse integer transform

In the previous section, we already know the H.264 integer inverse transform (4x4, Hadamard, 8x8) their principle and algorithms. For the implementation, the first one dimensional inverse integer transform block executes the transformation of row pixels and the second one dimensional inverse integer transform block performs the transformations of column pixels. Such as, Figure 9 is that the traditional 8x8 inverse integer transform method for implementation of hardware algorithm.

2.1.5 Traditional Hardware Sharing Design

Figure 10. Traditional hardware sharing design [19]

In order to reduce the gate count required for the two different transform processors, using multi transform (hardware sharing) algorithm that combine the three transform units into one multiple function transform processor which can execute all the three transform operations in H.264. In traditional hardware sharing architecture shown in Figure 10, the one dimensional transform can be any type of the transform. By the observation of Figure 7and Figure 8, we can find that every one-dimensional transform contains 8 arithmetical operations. In order to get a clear view of how to achieve hardware sharing transforms in a single design, we overlap Figure 7, Figure 8, and Figure 9 together. In Figure 10, all the adders have three inputs. It means that a

common input which is not changed by the transform type exists. Furthermore, Figure 10 is the fully extended of [19] into 64 pixels.

2.2 H.264 Profiles and Levels

H.264/AVC defines four profiles: baseline, extended, main and high profile. Baseline profile is usually used in low bit-rate applications. Extended profile, also called streaming profile, is designed for internet communication. Main profile is suitable for broadcast and storage applications. High profile, also called Fidelity Range Extension (FRExt), is intended for high resolution applications characterized by large block transform and large prediction blocks.

The high profile is further classified into four sub-profiles: High, High 10, High 4:2:2 and High 4:4:4, as depicted in Figure 11. These features include 8x8 luma transform, 8x8 spatial luma prediction, custom scaling matrix, deeper sample bits and lossless coding. Among them, the 8x8 luma transform is the key.

Main Profile

8x8 Luma Transform

8x8 Spartial Luma Prediction Perceptual Scaling Matrices Monochrome Format

Figure 11. High profile classification and features

Figure 12shows the profiling result of decoding a high profile video sequence. The inverse integer transform consumes about 17% to 20% of CPU time. Therefore, we design a low power inverse integer transform for integration into H.264/AVC decoder depicted in Figure 1.

Some important H.264 profiles and their special features are:

Baseline Profile: Only I and P type slices are present, only frame mode (progressive) picture types are present, Only CAVLC is supported.

Main Profile: Only I, P, and B type slices are present, Frame and field picture modes (in progressive and interlaced, modes) picture types are present, Both CAVLC and CABAC are supported, ASO is not supported, FMO is not supported.

High Profile: Only I, P, and B type slices are present, Frame and field picture modes (in progressive and interlaced modes) picture types are present, Both CAVLC and CABAC are supported, ASO is not supported, FMO is not supported, 8x8 transform supported, Scaling matrices supported.

Figure 12.H.264 decoder profiling results

All of these profiles also support mono chroma coded video sequences, in addition to typical 4:2:0 video. The difference in capability among these profiles is primarily in terms of supported sample bit depths and chroma formats. However, the high 4:4:4 profile additionally supports the

residual color transform and predictive lossless coding features are not found in any other profiles. The detailed capabilities of these profiles are show in Table 2.

Table 2.Coding tools in different profiles of H.264/AVC standard

Coding Tools Baseline Main Extend High High 10 High

4:2:2 High 4:4:4

4:2:0 Chroma formats Yes Yes Yes Yes Yes Yes Yes

Monochrome video format (4:0:0) No No No Yes Yes Yes Yes

4:2:2 Chroma Format No No No No No Yes Yes

4:4:4 Chroma Format No No No No No No Yes

8 Bit Sample Bit Depth Yes Yes Yes Yes Yes Yes Yes

9 and 10 Bit Sample Depth No No No No Yes Yes Yes

11 to 12 Bit Sample Depth No No No No No No Yes

8x8 vs. 4x4 transform adaptivity No No No Yes Yes Yes Yes

Quantization scaling matrices No No No Yes Yes Yes Yes

Separate Cb and Cr QP control No No No Yes Yes Yes Yes

Residual Color Transform No No No No No No Yes

Predictive Lossless Coding No No No No No No Yes

Flexible Macroblock Ordering (FMO) Yes Yes No No No No No

Arbitrary Slice Ordering (ASO) Yes Yes No No No No No

2.3 Previous Works

In recent years, many researchers proposed a number of optimized algorithms to compute the transforms used in H.264/AVC. The major focus of the research has been to develop fast algorithms for the transform unit.

2.3.1 Parallel 4x4 transform and inverse transform Architecture for MPEG-4 AVC/H.264 [5]

The multi-transform approach is good for low power and saving the hardware area. Chen’s design [5] is the first multi-transform architecture. They proposed a low power multi-transform architecture. They analyze residuals characteristics and propose a switching power suppression technique for saving data transition power. The design outputs four values every cycle. Their design achieves throughput of eight pixels per cycle and consumes 14.40mW at 200MHz.

Figure 13. (Re-designed) parallel transform architecture

This architecture is very compact for the 4x4 inverse transform, the gate count is only 4983.

The processing speed can be achieved to 1Gpixels/sec at 200MHz. It is sufficient for the existing video formats including HDTV formats. But this architecture is very limited because it can only support 4x4 block. Moreover, if we want to use this architecture and extend to 8x8, it will have almost 4 times overhead. Therefore, this will cost large power consumption and hardware cost.

This design still exists some way to accelerate the processing speed and reduce the hardware cost.

2.3.2 Low Cost Hardware sharing Architecture of Fast Inverse Transforms for H.264/AVC and AVS

Applications [8]

The 1-D fast algorithms and their hardware sharing design for the 1-D inverse transforms of H.264/AVC and AVS are proposed by using the symmetric property of the integer DCT matrix and the matrix decompositions. In this paper hardware-sharing architecture for H.264/AVC and AVS is realized by the offset computations and the pipelined design. Thus, the hardware cost of the proposed sharing architecture for H.264/AVC and AVS is smaller than that of the individual and separate realizations. This design implemented by pipeline stage to increase the performance of inverse transform.

Figure 14. Block diagram of the proposed hardware sharing architecture of fast 2x2, 4x4 and 8x8 inverse transforms for H.264/AVC and AVS with four pipeline phases [8]

In this paper, the 1-D transform is further divided into two smaller matrix-vector operations by even-symmetric or odd-symmetric. Therefore, its size is smaller. But the latency is increased

to 22 cycles because it only consumes one coefficient every cycle. Then the power consumptions of the 8x8 inverse integer at H.264/AVC mode and the 8x8 inverse at AVS mode at 62.5 MHz are 34.266mW and 37.785mW, respectively. Because of the supporting two video standards, need to add extra adding offset computations that use extra registers to completely satisfy two video standards. Therefore the area overhead and power consumption still need to be improved.

This design still exists some way to reduce the hardware cost and power consumption.

2.3.3 A High Performance Inverse integer Transform Architecture for the H.264/AVC Decoder[12]

In this paper, a high-performance inverse transform architecture for the H.264/AVC decoder is proposed. The proposed architecture utilizes the block multiplication and permutation matrices.

This architecture uses the matrix decomposition method to reduce the complexity of 4x4 inverse transform. By applying permutation matrices, the inverse transform matrix is regularized and the inverse Hadamard transform is merged into inverse transform with a minor modification.

Figure 15. 4x4 inverse transform hardware architecture [12]

This design has higher throughput for computing inverse transform and inverse Hadamard transform. It has also higher hardware efficiency through the measure of DTUA for computing inverse transform and inverse Hadamard transform. In hardware architecture in each block A2, B2, C2, D2, they use traditional 4x4 inverse transform algorithm for implementation and too much extra logic was required to completely satisfy H.264/AVC standard. Therefore area and power consumption still need to be improved.

2.3.4 Configurable, Low-power Design for Inverse Integer Transform in H.264/AVC[16]

This paper presented a configurable, low-power design for the inverse integer transform in H.264/AVC. The power consumption is drastically reduced by employing an input block-type aware algorithm with variable number of operations for the computation of the inverse integer transform. This algorithm takes advantage of significant number of zero-valued transformed coefficients in a typical input block. Additionally, the area overhead was reduced by designing basic configurable processing blocks in order to share the hardware resources (adders) for different input block types.

Figure 16. Functional block diagram: Configurable inverse integer transform unit. [16]

The internal organization of this block is depicted in Figure 16. Since the processing block M1-M3 are derived from M4 (Figure 17) and have the similar structure, therefore, we can design a configurable processing units (CM14, CM24, and CM34) with overlapped functionality to reduce the hardware resource requirement for its implementation.

The configurable processing units (CM14, CM24, and CM34) as the name suggest can be configured to provide processing for either (M1, M4), (M2, M4), or (M3, M4) using the appropriate control signal.

Figure 17. Data flow diagram for (a) M1, (b) M2, (c) M3, and (d) M4 cases.

The internal architecture for these configurable units is depicted in Figure 18(a)-(c).

Therefore, no additional (34) adders are required anymore because of configurable processing units. Furthermore, the input registers (in CM24) are also shared among processing for data vectors.

Figure 18. Data flow diagram for (a) CM14, (b) CM24 and (c) CM34.

The new algorithm is derived from the fast one dimensional inverse integer transform. This paper focuses on the low power design that consumes significantly less dynamic power (up to 80%

reduction) when compared with existing conventional design for the inverse integer transform. In some blocks, they use traditional 4x4 inverse transform algorithm and this architecture processing speed is very slow that can’t achieve the high resolution such as full HD in H.264/AVC.

2.3.5 A Reconfigurable IDCT Architecture for Universal Video Decoders[17]

The reconfigurable architecture has become more and more popular. It not only decreases the time of research and development but also saves fabrication cost. Moreover, the proposed reconfigurable inverse integer transform architecture can support 3 different video standards such as VC-1, MPEG and H.264/AVC. The block diagram is shown in Figure 19.

Figure 19. Block diagram of reconfigurable inverse integer transform

Figure 20. Architecture of reconfigurable one dimensional inverse integer transform

a) b)

Figure 21. Architecture of adder kernel a) Even part and b) odd part

They propose the reconfigurable one dimensional inverse integer transform architecture combined from two modes in Figure 20 in order to meet the requirements of various video standards. Adder kernel unit, we can find that any combinations of the input signals are composed of {00~11} or {0000~1111}. Therefore the computational results in every row can be generated by adder kernel even and odd part in figure 21. We can simplify the adder kernel into thirteen adders only: two adders in the even parts, figure 21a, and eleven adders in the odd part, figure 21b. Routing network is for VC-1 inverse integer transform. Stage 3 is the shifter and adder tree unit, using two’s complement concept to implement the total sums. Stage 4 is the post-adders. Reconfigurable inverse integer transform architecture is implemented for universal video decoders. It is the key point of this paper to reinforce the high throughput and to reduce power consumption and improve the throughout utilizing parallelism. This architecture can support 3 different video standards. The power consumption is 3.4mW at 100MHz, hardware cost is 11.6k and the throughput rate is 800Mpixels/sec. but throughput is still lower than the state of the art such as [7], [10], [12]. In this paper, what kind of fast algorithm that used is not clear and in order to achieve different video standards that use too much extra registers therefore hardware cost still need to be improved.

2.4 Summary

Table 3 summarized the above approaches. Each has distinct strength and weakness. We take 4x4 transform supporting, 8x8 transform supporting, Hadamard transform supporting, power consumption, hardware cost, DTUA, and throughput as our comparison items.

Table 3. Supporting features comparison Hwangbo Su

We also take an effort to evaluate several previous works and classify into three strategies: Low power aware, low hardware cost aware and high throughput aware. In Figure 22, we classify previous works as their strategy. Each strategy represents the major improvement in conventional inverse integer transform decoder. Each strategy represents the major improvement in conventional inverse integer transform decoder.

Figure 22. Implementation strategies of previous works

Chapter 3 Proposed Algorithm & Architecture

In this chapter, we propose our 4x4, Hadamard and 8x8 inverse integer transform fast one dimensional butterfly algorithms, pipeline hardware architectures and in Section 3.4 proposed their hardware-sharing design for 4x4, Hadamard and 8x8 inverse integer transforms of H.264 video decoder. In our algorithms we use matrix decomposition method to reduce the complexity of inverse integer transforms to reduce the power consumption, hardware cost and raise the throughput and hardware efficiency in H.264/AVC. Matrix decomposition utilizes the permutation matrices. All Inverse integer transforms Hardware architecture designs are implemented with pipelined architecture. Thus, our design’s power consumption and hardware cost are smaller when comparing to previous works.

The area overhead for the inverse integer transforms unit can be reduced by sharing the hardware resources between the independent processing units by designing the new fast butterfly algorithm. In next sub-sections, we will discuss more details about new fast 4x4, Hadamard, 8x8 butterfly algorithms.

3.1 Fast 4x4 Inverse Integer Transform

3.1.1 Fast 4x4 Inverse Integer Transform Algorithm

Fast 4x4 inverse integer transform algorithm is proposed in this part. First we will derive the formulas then algorithms which will be implemented in hardware design. We know that from the previous chapter 4x4 inverse integer transform coefficient matrix (Eq. 3.1) is follows,

We will use the matrix decomposition method to reduce the complexity of inverse integer transform which also means reduce the power consumption, hardware cost in terms of gate count.

Therefore we define two permutation matrices Tc and Tr as described below,

1 0 0 0

A

4i matrix is described as follows,

4_i

( )

_c ^T 4_i

( )

_r ^T

A  T A T

^{Eq. 3.5}

Then if we can useA^~_4imatrix to represent with 2x2 matrix form (Eq. 3.6)

~ 2

Then we use the matrix operation rule to derive one of the following equations (Eq. 3.8);

  

And where  denotes the Kronecker product, matrix operation as follows (Eq. 3.9). Assume that the dimension of the matrix A is NxP, B is MxQ,

Another means the direct sum operation which matrix operation express as follows (Eq. 3.10),

在文檔中應用於H.264/AVC視訊解碼器之低功耗反整數轉換 (頁 15-0)