Organization of the dissertation - I-Shou University Institutional Repository:Item 987654321/18

Chapter 1 Introduction

1.4 Organization of the dissertation

The rest of the paper is organized is as follows. In Chapter 2, we first review the main coding architecture in H.264/AVC and H.265/HEVC, respectively. To further reduce extremely high computational load and high complexity in H.264/AVC, we propose some efficient algorithms include fast intra-mode decision, early detecting all-zero DCT block and rate control for low delay video communication.

H.265/HEVC has higher coding efficiency than H.264/AVC, and we also proposed effective transform unit size decision method in Chapter 3. In Chapter 4 we intro to embed codec of H.264/AVC and H.265/HEVC into DSP based development board.

Finally, Chapter 5 shows the conclusions of this study and suggests areas for future research.

A review of video coding standards

Due to the rapid development of communication technology, especially video applications, the digital multimedia services are more and more popular. If the uncompressed data transmits on limited bandwidth, it will need in the large amount of delay time or the larger space to store. So it is necessary to reduce the redundancy of video data through the data compression methods. By the international organization dedicated to developing a series of video coding standard, the H.264/AVC standard can achieve much higher coding efficiency than the previous standards.

With the rapid development of electronic technology, the panels of 4K2K (or 8K4K) high-resolution will become the main specification of large size digital TV in future. However, the currently state-of-the-art video coding standard H.264/AVC is difficult to support the video applications of high definition (HD) and ultrahigh definition (UHD) resolution. Therefore, the ITU-T Video Coding Experts Group (VCEG) and ISO/IEC Moving Pictures Expert Group (MPEG) through their Joint Collaborative Team on Video Coding (JCT-VC) has been developed a newest high efficiency video coding (HEVC) for video compression standard to satisfy the UHD requirement in 2010 [22], and the first version of H.265/HEVC was approved as ITU-T H.265/HEVC and ISO/IEC 23008-2 by JCT-VC in Jan. 2013 [23-25].

This chapter will introduce the main coding architecture of H.264/AVC and H.265/HEVC in the follow sections.

2.1 H.264/AVC video coding standard

2.1.1 H.264/AVC codec system

Like prior video coding standards, such as MPEG-1/2/4 and H.261/H.263, H.264/AVC is also a hybrid coding scheme based on block [1-4]. But the H.264/AVC standard can achieve much higher coding efficiency than the previous standards as mention above. To achieve high quality video under high compression rate, H.264/AVC develops some complicated encoding modules including intra mode prediction, inter mode prediction, multiple reference frame prediction, integer cosine transform (ICT), entropy coding and deblocking filter. Two of the novel features of H.264/AVC video coding are the intra and inter mode prediction modules and offer a rich set of prediction patterns. The intra prediction module offers 12 prediction modes and the inter mode prediction module uses 7 variable block sizes (modes) ranging from 16×16 to 4×4 in and the motion estimation (ME) with 7 modes. H.264/AVC splits every frame by 16x16 macroblock (MB). To select the best coding mode, the rate-distortion optimization (RDO) is employed for each MB. After prediction, the residual data for each MB is transformed using the integer DCT transform and quantized. Quantized transform coefficients are reordered and the syntax elements are entropy coded. Fig. 2.1 and Fig. 2.2 denote the architecture of encode and decode in H .264/AVC, respectively.

Fig. 2.1 The architecture of encode in H.264/AVC.

Fig. 2.2 The architecture of decode in H.264/AVC.

Entropy

4 × 4 Integer Transform

Mode Decision

4 × 4 Integer Transform

Mode Decision

2.1.2 Intra prediction module

The H.264/AVC standard exploits the directional spatial correlation between adjacent MBs or blocks for intra prediction. In other words, the current MB/block is predicted by adjacent pixels in the upper and the left MBs/blocks that are previously decoded. H.264/AVC offers a rich set of prediction patterns for intra prediction, i.e., nine prediction modes for 44 luma blocks, four prediction modes for 1616 luma MBs, and four prediction modes for 88 chroma blocks, respectively. Each mode has its own direction of prediction and the predicted samples are obtained from a weighted average of decoded values of neighboring MBs/blocks. Fig. 2.3 shows prediction samples and nine prediction modes for each 44 luma block. It can be seen that 44 block prediction is conducted for samples a-p of a block using samples A-Q.

There are eight prediction directions in total and one DC prediction for 44 block prediction. To take the full advantage of these modes, the H.264/AVC encoder can select the best mode by using the rate-distortion optimization (RDO) calculations.

(a) (b)

Fig. 2.3 (a) 44 block and the neighboring samples. (b) Eight prediction modes for 44 block prediction.

Q A B C D E F G H I a b c d

J e f G h K i j k l L m n o p

3 4

5 6 7 0

The RDO mode decision exhaustively searches the best mode for each MB which produces the minimum rate-distortion cost (RDCost ) given by

) ,

, ( )

, , ( )

, ,

(s c mode QP SSD s c mode QP R s c mode QP

J _mode  _mode (2.1)

where QP is the MB quantization parameter, _mode is the Lagrange multiplier for mode decision, SSD means the sum of the squared differences between the original block s and its reconstruction c and mode represents one of the potential prediction modes. According to the RDO procedure of intra prediction, the number of mode combinations for luma and chroma components in an MB is N₈_₈(N₄_₄16N₁₆_₁₆), where N_8₈, N_4₄, and N_16₁₆ represent the number of modes for 88 chroma blocks, 44 luma blocks and 1616 luma MBs, respectively. It means that, for an MB, it has to perform 592 different RDO calculations before a best mode is determined. As a result, the complexity of the intra-mode decision is extremely high, which makes it difficult to achieve real-time implementations. To reduce the number of RDO computations, many fast mode decision methods have been proposed [26-28]. F. Pan, et al. proposed a fast intra mode decision method based on analysis of edge direction

histogram within the block so as to reduce the number of probable modes [26]. J. Kim and J. Jeong proposed a modified version based on Pan’s method using simple directional masks and adjacent mode information to further speed up RDO procedure [27]. Although Pan’s and Kim’s algorithms have reduced much complexity of intra prediction, they need extra pre-processing time to detect the edge information and analyze the edge direction histogram. Therefore, the effects of both fast mode decision algorithms are reduced

Best mode selection using RDO

The intra prediction procedures of luma and chroma components (YC_rC_b) using RDO can be described as follows:

Step 1 Generate an 88 predicted chroma block according to a mode.

Step 2 Determine the best intra mode for a 1616 MB among 4 modes. Code the chroma components with the given mode and compute the rate distortion of the MB for YC_rC_b components RDCost_16₁₆.

Step 3 Select the 16 best intra modes for sixteen 44 luma blocks among 9 modes. Code the chroma components with the given modes and compute the rate distortion of the MB for YC_rC_b components RDCost_4₄

Step 4 If RDCost₁₆_₁₆ RDCost₄_₄, the block type 44 is selected, otherwise the 1616 block type is selected in the given chroma mode. And the minimum cost is saved as RDCost

Step 5 Repeat step 1 to 4 for all chroma intra prediction modes, and choose the one with the minimum RDCost^.

2.1.3 Integer DCT/Quantization module

In H.264/AVC, the DCT is an integer transform applied to 44 blocks of residual data and avoids inverse transform mismatch problems. The core part of the integer 44 DCT can be implemented by using only additions and shift. A scaling multiplication is integrated into the quantization, and reducing the total number of multiplications. The integer 44 DCT is

AXAT

Y (2.2)

where X is the residual signal, and Y is the transformed coefficients. A is a 44 transformation matrix. The elements of A are

 approximated to the following form

 

element of CXC^T is multiplied by the scaling factor in the position in matrix E.

The scalar quantization operation in H.264/AVC is defined as

)

Qstep is quantization parameter (QP) ranging from 0 to 51, the quantized coefficient Z_ij, 0 ≤ i,j ≤ 3, is written as

2.1.4 Rate control module

Nowadays, real-time video streaming scenarios requiring very low end-to-end delay are getting more and more popular. However, it is very difficult to adjust the encoding parameters directly to obtain fixed bits for every encoded frame in the constant bit rate channel. Therefore, it is necessary that the buffer to regulate the bit stream before transmission. With a good rate control technique, it should adjust the output rate to prevent the buffer from overflow and underflow. If the buffer suffers from overflow and underflow, it will cause frames skipping and wastage of channel resource, respectively. Furthermore, the size of buffer is usually very small to achieve low end-to-end delay requirement for real-time communication. It causes the buffer overflowing and underflow easier. So, the low delay video communication requires more accurate bit allocation and encoder parameter adjustment to achieve a suitable rate control.

There are two parts that should be considered when designing a rate control scheme. One is about the bit allocation for each basic unit according to its complexity.

The other is the adjustment of the encoder parameter, i.e., quantization parameter (QP) to encode each basic unit to match target bits. The number of bits required for encoding a video sequence varies with time to provide consistent visual quality because complexity of each frame generally differs from the other frames in the input sequence. Therefore, a rate control scheme which meets a constrained channel rate by controlling the number of generated bits is necessary in an encoder.

Rate control scheme have been widely studied in video standards. Fig. 2.4 shows the rate control scheme for MPEG-2, H.263 and MPEG-4 using rate-distortion (R-D) model. The amount of encoding bits of the current basic units (macroblock: MB) is predicted from the recent encoded basic units. The encoder shown in Fig. 2.4 can obtain the motion vectors (MV) using motion estimator (ME) and calculate the statistical data of the residual frame with actual mean absolute difference (MAD) after motion compensation (MC). And then, the rate controller can adjust the quantization parameter (QP) according to the rate-quantization (R-Q) model.

Fig. 2.4. Rate control scheme for MPEG-2, H.263 and MPEG-4.

Compared with these previous standards, there is an additional problem for rate control in H.264/AVC as shown in Fig. 2.5. The problem is due to that the

Source video

ME MC Coding

Channel CBR

Actual MAD

Bitstream

Buffer

H.264/AVC encoder determines motion information by using the rate-distortion optimization (RDO) calculations. Before performing RDO for each MB, the quantization parameter should be defined by using MAD of MB. However, the statistical MAD of MB is only available after performing RDO. This is typical chicken and egg dilemma. Therefore, the rate control scheme is more difficult in H.264/AVC.

Fig. 2.5 The problem of QP dilemma for rate control scheme in H.264/AVC.

2.2 H.265/HEVC video coding standard

2.2.1. H.265/HEVC codec system

Same as H.264/AVC, H.265/HEVC [5] also adopt hybrid coding scheme based on block. The architecture of encode and decode in H.265/HEVC is shown in Fig. 2.6 and Fig. 2.7. HEVC can achieve 50% bit rate reduction in comparison with H.264/AVC High Profile while still maintaining the same subjective video quality [29]. Because of the HEVC adopts some new coding techniques including coding unit (CU), prediction unit (PU) and transform unit (TU). The CU can be split by coding quad-tree structure of 4 level depths (6464 to 88) for inter/intra prediction. The PU is used to for performing the related to the prediction processes. When pruning the best CU coding quad-tree, the inter prediction module executes 7 different prediction

Source video

( RDO ) MC Coding

Channel CBR

Actual MAD

Bitstream

Buffer QP ?

modes including SKIP mode、intra2N×2N、intraN×N、inter2N×2N、inter2N×N、interN

×2N and interN×N to find the best mode. Especially, in the inter2N×2N、inter2N×N、

interN×2N and interN×N prediction need perform motion estimation (ME) and motion compensation (MC). However, ME process is performed using all the possible depth levels and prediction modes.

Fig. 2.6 The architecture of encode in H.265/HEVC.

Fig. 2.7 The architecture of decode in H.265/HEVC.

2.2.2. Coding unit (CU) module

The CU is the basic unit of region splitting used for inter/intra prediction, which allows recursive subdividing into four equally sized blocks. In previous standards, like H.264/AVC, they used hybrid-block coding framework based on block to encode and the basic coding unit was the macroblock. In the usual case of 4:2:0, macroblock contains a 16×16luma block and two 8×8 chroma samples. H.265/HEVC also has the similar structure, coding tree unit (CTU) into quad-tree coding block partitioning, but CTU has more flexible and selectable. Even CTU can be larger than macroblock by encoder selecting. CTU usually consists of luma and chroma coding tree block (CTB), and its luma block size can be chosen as 8×8, 16×16, 32×32, or 64×64. The larger block size which is encoded in HD or UHD resolution can be perform the better compression. It also can be described by hierarchical depth. In Fig. 2.8, the depth is 0 for largest CU (LCU) 8×8 CU, the depth is 1 for 32×32 CU, the depth is 2 for 16×16 CU, and the depth is 3 for smallest CU (SCU) 8×8 CU. Although every node in the CTU is called CU, the best decision CU is according to rate distortion cost (RD cost).

After prediction, transforming, and quantization, the RD cost is calculated for each CU in every depth. At least, CU with the smallest RD cost constitutes the best CTU structure. The depth = 0 is 64×64 CU, the depth=1 is 32×32 CU, and so on.

2.2.3. Prediction unit (PU) module

The PU is the basic unit used for carrying the information related to the prediction processes. And each CU may contain one or more PU, which is different with the seven fixed size of H.264/AVC as shown in Figs. 2.9 and 2.10. In general, SKIP mode only support 2N2N partition and intra CUs have two types of PUs including 2N2N partition and NN partition. But inter CUs have four types of PUs including symmetrical 2N2N, 2NN, N2N, NN and unsymmetrical 2NnU、

2NnD、nL2N、nR2N. Therefore, HEVC encoder enables several different partition modes including SKIP, inter and intra for inter slice as shown in Fig. 2.11.

Fig. 2.8 The comparison of CU size and hierarchical depth in CTU.

Fig. 2.9 The best decision CU of CTU in HEVC.

Fig. 2.10 The prediction unit in H.264/AVC.

16x16

16x8

8x16

8x8

8x4

4x8

4x4

Fig. 2.11 The comparison of PU size and prediction modes in H.265/HEVC.

2.2.4. Transform unit (TU) module

TU is the basic unit for transform and quantization to encode the prediction residual. A TU structure has its root at the CU level like PU. It may exceed the size of PU, but not that of the CU. TU can be split by residual quadtree (RQT) at maximally 3 level depths which vary from 32×32 to 4×4 pixels. The relationship among the CU, PU and TU is shown in Fig. 2.12.

Fig. 2.13 shows that the rate distortion (RD) cost under all partition mode and all CU sizes has to be calculated by performing the PUs and TUs to select the optimal CU size and partition mode. The relationship of encode architecture in HEVC is shown in Fig. 2.14. For an LCU, all the PUs and available TUs listed in Table 2.1 have to be exhaustively searched by rate-distortion optimization (RDO) process and this causes dramatically increased computational complexity compared with

H.264/AVC. This “try all and select the best” method will result in the high computational complexity and limit the use of HEVC encoders in real-time applications.

Table 2.1 PU and TU sizes in HEVC.

Depth Inter PU Intra PU TU

0 64×64, 64×32, 32×64 64×64 32×32, 16×16 1 32×32, 32×16, 16×32 32×32 32×32, 16×16, 8×8 2 16×16, 16×8, 8×16 16×16 16×16, 8×8, 4×4

3 8×8, 8×4, 4×8 8×8, 4×4 8×8, 4×4

Fig. 2.12 The relationship among the CU, PU and TU.

Fig. 2.13 Perform the PUs and TUs to select the optimal CU size and partition mode by RDcost.

Fig. 2.14 The CU, PU and TU of encode architecture in HEVC.

2.2.5. RQT structure of HEVC

Integer transform and quantization of HEVC

The forward integer transform in HEVC is an approximation of DCT specified as a matrix multiplication. The integer DCT is

CXCT

Y  (2.8) where X is the residual block, Y is the transformed coefficient matrix, and C is a core transformation matrix [30]. Eq.(2.9) shows a general form of an NN block transformed by integer DCT in H.265/HEVC.

2 1)

(X C PF PF

Y_N__N  _N__N ^T   (2.9) where all of the above matrices are with the size of NN. The symbol  stands for a

scalar multiplication PF1 and PF2. are post-scaling factors defined as follows:

) operations. The main reason is that transformed coefficients can be represented with 16 bit to avoid the computation burden in hardware. After integer DCT, the transformed coefficient is quantized by the operations defined as follows:

bits where Q is the quantized coefficient matrix, QP is an quantization parameter ranging from 0 to 51, << and >> are the binary left-shift and right-shift operators,

respectively. The constant n which is an offset for quantization is 171 and 85 for intra and inter coding, respectively. The quantization matrix (QM) is defined as follows:

}

where % is the mod operation. The relationship between Q_bits and QP can be derived as follows: where ⌊ ⌋ denotes rounding to the nearest integer.

Best RQT structure of HEVC

Similar to an LCU, a TU is also recursively partitioned into smaller TUs using a quadtree structure. The residual samples corresponding to a CU are be subdivided into smaller units using a RQT and the leaf nodes of the RQT is referred to as TUs. In order to achieve the optimal coding performance, the full RQT needs to be pruned to obtain best RQT structure of TUs by comparing the RD cost between the upper layer TU and the lower layer sub-TUs from bottom to top. The minimization process of the RD cost is the well-known RDO measurement. The RD cost function is defined as distortion can be estimated after the inverse quantization and inverse transform steps.

RD cost is the main evaluation indicator for achieving the optimal performance. For example, the optimal decision-making is described in Fig. 2.15 when given a CU size

of 32×32. The leaf nodes (a~j) of the RQT represent the final mode decision of the chosen size of each TU.

Although the coding efficiency in HEVC can be improved by using varying transform block sizes (from 32×32 to 4×4), the computational complexity increased dramatically in terms of the transform kernel size and the transform coding structure [32]. This is because the RQT selects the best TU partition mode by checking the RD cost of all possible TUs in all the RQT depths. From the example shown in Fig. 2.13, we can find that the RD cost evaluation is performed a number of times within each RQT structure: once for the TU size of 32×32, four times for the TU size of 16×16, and 16 times for the TU size of 8×8.

Fig. 2.15 Example of RQT for dividing given coding tree block.

Proposed efficient video coding method for H.264/AVC and H.265/HEVC

3.1 Fast intra-mode decision in H.264/AVC

In this chapter, a fast intra-prediction mode decision algorithm for H.264/AVC that exploits the interblock correlation in the intra-mode domain is proposed to reduce the computational complexity. Four modes of neighbouring coded macroblocks /blocks are considered as the good candidate intra modes of the current block.

Experimental results show that the proposed method can efficiently provide a good trade-off between the R-D performance and the computational complexity

3.1.1 Proposed fast intra mode decision using interblock correlation

H.264/AVC is a block-based coding scheme, the frame is encoded block by block in a raster scan order, i.e., from the left to right and top to bottom. For a luma MB in an I-slice, RDO exhaustively searches the combinations of the predefined 13 intra modes to produce the best mode for this MB. Fig. 3.1 shows part of the RDO

在文檔中 I-Shou University Institutional Repository:Item 987654321/18708 (頁 25-0)