T HESIS O RGANIZATION - 應用於SVC視訊編碼標準之空間可適性內幀解碼器設計

CHAPTER 1 INTRODUCTION

1.2 T HESIS O RGANIZATION

This thesis is organized as follows. In Chapter 2, the intra prediction process in different profiles and some important coding tools which for achieving high performance compression will be described. For the additional features in intra prediction of SVC such as spatial scalability, new prediction mode and inter-layer intra prediction between different video types (progressive or interlace) will also be introduced. Then, the architecture design and implementation results of intra prediction engine which supports H.264/AVC high profile and

scalable extension will be shown in Chapter 3 and Chapter 4, respectively. Finally, the conclusions and future works will be presented in Chapter 5.

Chapter 2 Overview the Intra Prediction in

H.264/AVC and Scalable Video Coding

2.1 H.264/AVC Standard Overview

H.264/AVC is the newest international video coding standardfrom MPEG and ITU-T Video Coding Experts Group. It provides various kinds of powerful but complex techniques to achieve higher video compression, such as spatial prediction in intra coding, adaptive block-size motion compensation, 4×4 integer transformation, context-adaptive entropy coding, adaptive deblocking filtering, and etc.

2.1.1 Profiles and Levels

In order to achieve different specific classes of applications, the standard includes the following profiles. We take four commonly used profiles to be described.

 Baseline Profile: This profile is used widely for lower-cost applications with limited computing resources, such as in videoconferencing and mobile applications.

 Main Profile: Intended as the mainstream consumer profile for broadcast and storage applications. However, the importance of this profile is faded when the high profile was developed for those applications.

 Extended Profile: This profile is intended for the streaming video profile, and has relatively high compression capability and some extra methods for robustness to error and server stream switching.

 High Profile: The primary profile for broadcast and disc storage applications,

particularly for high-definition television applications (for example, HD DVD and Blu-ray Disc).

Figure 1 shows the difference between coding techniques in different profiles of H.264/AVC standard. To be noticed, the coding tools “Interlace” and “8x8 Integer DCT” in Main and High profiles are referred to the Macroblock-Adaptive Frame-Field (MBAFF) coding (or Picture-Adaptive Frame-Field (PAFF) coding) and Luma Intra_8x8 prediction mode.

Figure 1: The contained techniques in different profiles of H.264/AVC standard.

These two additional coding tools can further increase the performance of compression.

PAFF and MBAFF was reported to reduce bit rates in the range of 16 to 20% and 14 to 16%

over frame-only and PAFF for ITU-R 601 resolution sequences like “Canoa” and “Mobile and Calendar”, respectively [2]. Moreover, Intra_8x8 supports more options to predict the block, and provides more efficient compression than Intra_4x4 and Intra_16x16 only. For more details about coding tool difference are listed in Table 1.

Table 1: Coding tools in different profiles of H.264/AVC standard

Coding Tools Baseline Main Extend High

I and P Slices Yes Yes Yes Yes

B Slices No Yes Yes Yes

SI and SP Slices No Yes No No

Multiple Reference Frames Yes Yes Yes Yes

In-Loop Deblocking Filter Yes Yes Yes Yes

CAVLC Entropy Coding Yes Yes Yes Yes

CABAC Entropy Coding No No Yes Yes

Flexible Macroblock Ordering (FMO) Yes Yes No No Arbitrary Slice Ordering (ASO) Yes Yes No No

Redundant Slices (RS) Yes Yes No No

Data Partitioning No Yes No No

Interlaced Coding (PicAFF, MBAFF) No Yes Yes Yes

4:2:0 Chroma Format Yes Yes Yes Yes

Monochrome Video Format (4:0:0) No No No Yes

4:2:2 Chroma Format No No No No

4:4:4 Chroma Format No No No No

8 Bit Sample Depth Yes Yes Yes Yes

9 and 10 Bit Sample Depth No No No No

11 to 14 Bit Sample Depth No No No No

8x8 vs. 4x4 Transform Adaptivity No No No Yes

Quantization Scaling Matrices No No No Yes

Separate Cb and Cr QP control No No No Yes

Separate Color Plane Coding No No No No

Predictive Lossless Coding No No No No

2.1.2 Encoder/Decoder Block Diagram

As mentioned before, H.264/AVC standard provides much higher video compression performance by some new coding tools. However, the complexity is also increased. Figure 2 shows the block diagram of H.264/AVC video encoder. An embedded decoder exists inside the encoder that calculates the motion compensation and intra prediction at the decoder side.

With this embedded decoder in the encoder side, the encoder can foresee the decoded result and precisely calculate the residual pixel values without mismatch to the decoder. One of the major differences compared to previous standard is the intra prediction in H.264/AVC standard. Several prediction modes are provided for the intra prediction to highly improve the compression ability in the spatial domain. On the other side, the motion compensation supports variable block sizes, multiple reference frames, and short/long term prediction to reduce the redundancy in the temporal domain. After subtracting the input video with the predicted pixels which are from the intra prediction or motion compensation, the residual values are processed with DCT, quantization and entropy coding to reduce the coding redundancy. Moreover, H.264/AVC uses powerful block-based compression methods to increase the performance. However, another problem called “blocky effect” will be generated.

In order to solve this problem, H.264 uses the in-loop filter, deblocking filter, to improve the visual quality. Finally, the coded bit stream is produced and transmitted.

On the decoder side, it lacks of the decision parts like motion estimator and the intra mode decision parts. Hence, the decoder is simpler than the encoder. Figure 3 shows the block diagram of H.264/AVC video decoder. After entropy decoding the input bit stream, the syntax elements are decoded to decide which mode is used in motion compensation or intra prediction. Then, the residual values are generated by inverse quantization and IDCT. With

residual values and predicted pixel values, the video can be constructed. Finally, the deblocking filter is invoked to eliminate the blocky effects and improve the visual quality.

Figure 2: Block diagram of H.264/AVC video encoder.

Figure 3: Block diagram of H.264/AVC video decoder.

2.1.3 Intra Block Decoding

As shown in Figure 3, the intra prediction (the grey block) is one of the major prediction engines in H.264/AVC standard. In the previous section, we have mentioned about that H.264 provides several prediction modes in intra prediction to improve the compression ability in the spatial domain. There are two classes of intra prediction modes in baseline and main profile of H.264/AVC, the Intra_4x4 prediction mode and Intra_16x16 prediction mode.

Besides, the third class of intra prediction called Intra_8x8 which is supported in high profile will be further introduced in Section 2.1.5.

Figure 4: Intra_4x4 prediction mode directions.

There are a total of 9 optional prediction modes for Intra_4x4 prediction mode. The arrows in Figure 4 indicate the direction of prediction in each mode. These 9 modes are vertical (0), horizontal (1), DC (2), diagonal down-left (3), diagonal down-right (4), vertical-right (5), horizontal-down (6), vertical-left (7), and horizontal-up (8), respectively.

The prediction 4x4 blocks (which are grey color in Figure 5) are calculated based on the neighboring samples labeled A-M in Figure 5, as follows. In DC modes, the intra prediction process is to calculate the mean value of neighboring pixel values. Except for DC mode, all the others are calculated by four neighboring pixel values. These four pixels are different according to the type of mode and the position in the 4x4 block.

Figure 5: Intra_4x4 prediction modes based on the neighboring pixels.

In the Intra_16x16 prediction mode class, there are a total of 4 optional prediction modes, which are vertical, horizontal, DC, and plane modes, as shown in Figure 6, respectively. The vertical, horizontal, and DC modes are very similar to the Intra_4x4 prediction modes described before. The new prediction mode in Intra_16x16 is plane mode, which is a “linear”

plane function that is fitted to the upper and left-hand samples H and V. This works well in areas of smoothly-varying luminance.

Figure 6: Intra_16x16 prediction modes based on the neighboring pixels.

For a luma samples in a macroblock, both Intra_4x4 prediction modes and Intra_16x16 prediction modes are valid. However, only 4 prediction modes which are very similar to the 16x16 luma prediction modes but with a little different in parameters are valid in chroma samples.

2.1.4 Macroblock-Adaptive Frame-Field Intra Decoding

There usually exist two types of video source formats, i.e. progressive and interlaced video, for which different techniques were presented in a range of video standards, such as H.261 [5], H263 [6], MPEG-1 [7], MPEG-2 [8], and MPEG-4 [9] as well as in H.264/AVC [1]

a newly released standard. In particular, a frame of interlace video sequence consists of two fields, scanned at different time instants. Therefore, a video sequence consisting of interlaced video can be compressed in many different ways. They can be grouped in two main categories:

(a) Fixed Frame/Field and (b) Adaptive Frame/Field.

2.1.4.1 Types of Video Source Formats

In the first category, all the frames of a sequence are encoded in either frame or field mode. In the second category, depending on the content, one adaptively selects whether to use the field or frame structure. Adaptation can be done at either picture level or macroblock level.

For the picture level adaptation of frame/field (PAFF), an input frame of a sequence can be encoded as one frame or two fields. If frame mode is chosen, a picture created by interleaving both even number pixels and odd number pixels is partitioned into some 16x16 macroblocks (MBs) and coded as a whole frame. However, in field coding, the top and bottom field pictures are partitioned into 16x16 MBs and coded as separate pictures. For MB level adaptation of frame/field coding (MBAFF), the frame is partitioned into 16x32 MB pairs, and the two MBs in each MB pair are both coded in frame mode or field mode, as shown in

Figure 7(a). Further, any combinations of MB pairs can also exist in a picture, as illustrated in Figure 7(b) (In Figure 7, we assume that picture length and width are both 64 pixels).

Therefore, there exist two decoding units according to the video source format. One is called MB in fixed frame/field video source format and picture level adaptive frame/field video source format, and the other is called MB pair in MBAFF video source format. The sizes of MB and MB pair are 16x16 and 16x32, respectively. The scanning order in different decoding units of a frame and the relationship between current unit and neighboring units are described in Figure 8(a) and Figure 8(b).

(a) (b)

Figure 7: (a) A MB pair which is coded in frame mode and field mode, and (b) an example of different combinations of MB pairs in a picture.

In a short summary about video sequence formats, it is believed that frame coding is more efficient than field coding for progressive video and static pictures in interlaced video, while field coding is more efficient for moving pictures in interlaced video. Thus, picture-adaptive frame/field (PAFF) coding is supported in H.264 to improve the coding efficiency for interlaced video with both static and moving pictures. Obviously, PAFF coding requires the encoder has the ability to adaptively select the picture coding mode, which is

realized by 2-pass coding in reference software of H.264, i.e., frame mode and field mode are performed respectively, and the one with smaller rate-distortion cost is chosen as the optimal picture coding mode.

(a)

(b)

Figure 8: (a) Scanning order in different decoding units of a frame and (b) relationship between current unit and neighboring units

As an extension of PAFF, MBAFF is used to improve coding efficiency of picture with both static and moving regions. Reference software performs MBAFF coding with a 2-pass coding strategy on MB pair level. Every MB pair is coded in frame mode and field mode respectively, and the mode with smaller rate-distortion cost is selected. In general, MBAFF coding is more efficient for video coding. Figure 9 gives some simulation result about different video sequence formats in different test sequences [10].

Mobile (I only)

Figure 9: PSNR vs bit rate curve for the (a) Mobile sequence and (b) Stefan sequence.

2.1.4.2 Intra Block Neighboring Information in MBAFF

As mentioned in the previous section, the even rows and odd rows will be gathered to different MBs (top or bottom MB in a MB pair) according to the different modes of a MB pair.

Therefore, the neighboring pixels of an intra block will also be more complex when the MBAFF is supported.

Figure 10 shows the neighboring pixels of a 4x4 intra block in the MB pair. When the current MB pair is coded in frame mode, the neighboring pixels are similar to the baseline profile which has the unit in MB, as illustrated in Figure 10 (a). However, when the current MB pair is coded in field mode, the relationship between current 4x4 block and neighboring pixels will be more complicated, and the neighboring pixels are no more around to a 4x4 block. Depending on the 4x4 blocks in a MB pair, the neighboring pixels will be in different positions, as shown in Figure 10 (b). In a simple way to express the principle about neighboring pixels is that the pixels in top (or bottom) field of current block can only find the neighboring pixels also in top (or bottom) field of neighboring blocks. Besides, if neighboring blocks are not coded in field mode, the first step is to convert the frame mode block into field block. Then, find the corresponding neighboring pixels in the same field. Furthermore, there

are a total of 4 cases between current MB pair and neighboring MB pair (frame/frame, frame/field, field/frame, and field/field). Here we just take 2 cases (frame/frame and frame/field) for an example.

(a) (b)

Figure 10: Neighboring information in (a) frame mode MB pair and (b) field mode MB pair (assume neighboring MB pairs are all coded in frame mode).

Obviously, in order to support MBAFF mode, it needs more side information to indicate the frame/field mode of each MB pair. Therefore, how to efficiently store and decode these neighboring pixels and MB pairs becomes an important issue in MBAFF video format.

2.1.5 Intra_8x8 Prediction Mode Decoding

For intra prediction, luma Intra_8x8 is an additional intra block type supported in H.264 high profile which supports all features of the prior main profile. The luma Intra_8x8 prediction is introduced by extending the concepts of Intra_4x4 prediction but with an additional process as described in Figure 11.

Figure 11: Difference between Intra_8x8 and Intra_16x16/Intra_4x4

Figure 12: Nine Intra_8x8 prediction modes (note: each neighboring pixel (A~Y) is already filtered by RSFP).

Similar to Intra_4x4, Intra_8x8 has nine different direction prediction modes, as shown in Figure 12. The luma values of each sample in a given 8x8 block are predicted from neighboring reconstructed reference pixels base on prediction modes. It should be noticed that before decoding an Intra_8x8 block, there is an extra process that is different from Intra_4x4

be filtered first, and then using these filtered pixels to predict subsequent 8×8 blocks. For an Intra_4x4 and Intra_8x8 block, 13 neighbors and 25 filtered neighbors are needed, respectively.

This new intra coding tool not only improves I-frame coding efficiency significantly [11], but also induces some extra cycles and data dependency in order to generate the filtered neighboring pixels.

2.2 SVC Extension of H.264/AVC Standard Overview

Scalable Video Coding (SVC) [12] is the name given to an extension of the H.264/AVC video compression standard. SVC enables the transmission and decoding of partial bit streams to provide video services with lower temporal or spatial resolutions or reduced fidelity while retaining a reconstruction quality that is high relative to the rate of the partial bit streams.

Hence, SVC provides some techniques such as graceful degradation in lossy transmission environments, bit rate, format, and power adaptation. By these techniques, SVC can enhance the transmission and storage within applications. Furthermore, SVC has achieved significant improvements in coding efficiency with an increased degree of supported scalability relative to the scalable profiles of prior video coding standards.

Scalable Video Coding as specified in extension of H.264/AVC allows the construction of bitstreams that contain sub-bitstreams which conform to H.264/AVC. For temporal bitstream scalability, it provides the presence of a sub-bitstream with a smaller sampling rate than the bitstream. When deriving the temporal sub-bitstream, complete access units will be removed from the bitstream. For spatial and quality bitstream scalability, these two scalabilities provide the presence of a sub-bitstream with lower spatial resolution or quality than the bitstream. When deriving these kinds of sub-bitstream, the NAL (Network Abstraction Layer) will be removed from the bitstream. In this case, inter-layer prediction (i.e.

the prediction between the higher spatial resolution or quality layer and the lower spatial resolution or quality layer) is typically used for improving the coding efficiency, and will be more explained in Section 2.2.2.

2.2.1 Profiles and Levels

As a result of the Scalable Video Coding extension [13], the standard contains three additional scalable profiles: Scalable Baseline, Scalable High, and Scalable High Intra. These profiles are defined as a combination of the H.264/AVC profile for the base layer and tools that achieve the scalable extension:

 Scalable Baseline Profile: Primarily targeted for mobile broadcast, conversational and surveillance applications that require a low decoding complexity. In this profile, the support for spatial scalable coding is restricted to resolution ratios of 1.5 and 2 between spatial layers in both horizontal and vertical direction and to macroblock-aligned cropping. Furthermore, the coding tools for interlaced sources are not included in this profile. Bit streams conforming to the Scalable Baseline profile contain a base layer bit stream that conforms to the restricted Baseline profile H.264/AVC. It should be noted that the Scalable Baseline profile supports B slices, weighted prediction, the CABAC entropy coding, and the 8x8 luma transform in enhancement layers (CABAC and the 8×8 transform are only supported for certain levels), although the base layer has to conform to the restricted Baseline profile, which does not support these tools.

 Scalable High Profile: Primarily designed for broadcast, streaming, and storage applications. In this profile, the restrictions of the Scalable Baseline profile are removed and spatial scalable coding with arbitrary resolution ratios and cropping parameters is supported. Bit streams conforming to the Scalable High profile contain a base layer bit stream that conforms to the High profile of H.264/AVC.

 Scalable High Intra Profile: Primarily designed for professional applications. Bit streams conforming to this profile contain only IDR pictures (for all layers). Besides, the same set of coding tools as for the Scalable High profile is supported.

The coding tool differences between SVC profiles are described in Table 2.

Table 2: Coding tool difference between SVC profiles

Coding Tools

Scalable Baseline

Scalable High

Scalable High Intra Base Layer Profile AVC Baseline AVC High AVC High Intra

CABAC and 8x8 Transform

Only for certain levels

Yes Yes

Layer Spatial Ratio 1:1, 1.5:1, or 2:1 Unrestricted Unrestricted I, P, and B Slices Limited B Slices All Only I Slices

Interlaced Tools (MBaff & Paff)

No Yes Yes

2.2.2 Encoder/Decoder Block Diagram

The example for combining spatial, quality, and temporal scalabilities is illustrated in Figure 13, which shows an encoder block diagram with two spatial layers.

The SVC coding structure is organized in dependency layers (i.e. spatial layers). A dependency layer usually represents a specific spatial resolution, and is identified by a layer identifier, D. The identifier D equals 0 for the base layer which is also a H.264/AVC compatible layer, and it is increased by 1 from one spatial layer to the next. For each dependency layer, the basic prediction engine of motion compensation and intra prediction in H.264 are used as in single-layer coding. Further, the additional inter-layer prediction which

is exploited to reduce the redundancy between dependency layers will be more explained in Section 2.2.3.1. Hence, by applying this new inter-layer prediction in SVC, the coding efficiency related to spatial scalability provides higher coding efficiency than the simulcasting between different spatial resolutions.

Figure 13: SVC encoder block diagram (note: enhancement layer can be more than 1).

In a special case that the spatial resolutions for two dependency layers are identical, in

在文檔中應用於SVC視訊編碼標準之空間可適性內幀解碼器設計 (頁 16-0)