Introduction and Motivation - H.264編碼器及其可調適延伸版解碼器之加速和TI DSP系統平台之實現

Chapter 1 Introduction

1.1 Introduction and Motivation

With the growing popularity of mobile communication, video transmission over wireless channel will become an essential element in our daily life. Many international video compression standards such as H.261, H.263, MPEG-2 and MPEG-4 have already been widely used in different situations. In this thesis, we concentrate on the standard H.264/AVC and the newest standard scalable extension of H.264/AVC (H.264/AVC SVC). Our focus is to implement the H.264/AVC encoder and H.264/AVC SVC decoder on the digital signal processors (DSPs).

H.264/AVC is a recent standard defined by the ITU-T Video Coding Experts Group and the ISO/IEC Moving Picture Experts Group. It provides better compression of video images together with a range of features supporting high-quality, low-bitrates streaming video. The basic functional elements (prediction, transform, quantization, entropy encoding) are similar to those in the previous standards but the important fine-tune in H.264 occur in the details of each functional element.

Scalable video coding is currently being developed as an extension of H.264/AVC. The Joint Video Team of the ISO/IEC MPEG and the ITU-T VCEG is now standardizing this new standard. It is intended to encode the signal once, but allow decoding from the partial streams

chips on the board. The DSP chips are Texas Instruments (TI)’s TMS320C6416T. The TMS320C6416T is a fixed-point DSP with 1 ns (1 GHz clock) instruction cycle time. It adopts the advanced VelociTI Very Long Instruction Word (VLIW) architecture that enables sustained throughput of up to eight instructions in parallel and thus it allows the processor running faster. In addition, we accelerate the H.264/AVC encoder and H.264/AVC SVC decoder by some DSP coding techniques and several efficient algorithms.

This thesis is organized as follows. Chapter 2 is an overview of the H.264/AVC video standard. Chapter 3 introduces the H.264/AVC SVC. Chapter 4 gives a brief description of the TI DSP chip and its development environment. In chapter 5, we describe the algorithm and code acceleration methods of the H.264/AVC encoder and show the experimental results on DSP. Chapter 6 describes the acceleration of the H.264/AVC SVC decoder for DSP and presents experimental results. Finally, chapter 7 contains the conclusion.

Chapter 2 H.264 Video Coding

H.264/MPEG-4 AVC (Advanced Video Coding) is a video coding standard of the ITU-T Video Coding Experts Group and the ISO/IEC Moving Picture Experts Group. The first draft design for H.264/AVC coding standard was adopted in October of 1999. VCEG and the MPEG formed a Joint Video Team (JVT) and drafted new video coding standard as H.264/AVC [1] in March of 2003.

H.264 is a standard developed based on H.26L and it promises to significantly outperform MPEG4 and H.263, providing better compression of video images together with a range of features supporting high-quality, low bit-rate streaming video. It uses the state-of-the-art coding tools and provides enhanced coding efficiency for a wide range of applications, including video telephony, digital video authoring, digital camera, and many others. In this chapter, the H.264/AVC standard will be described.

2.1 Overview of H.264 Encoder

The H.264/AVC standards include a Video Coding Layer (VCL), designed to efficiently represent video contents, and a Network Abstraction Layer (NAL), designed to format the VCL representation and it provides header information in a manner appropriate to be conveyed by a variety of transport layers or storage media.

redundancy. The entropy coding removes syntax redundancy. In the block diagram, the video frames are captured into intra or inter prediction parts. If the frame type is intra, the inter prediction is disabled. Multiple references and variable block size motion estimations are used for the inter prediction. The best mode among these prediction modes is chosen in the mode selection block. The input frame is then subtracted from the prediction and forms the residual block. The residual blocks are transformed by using a 4x4 integer DCT transformer for luminance and a 2x2 transform for the chrominance DC coefficient, scan and quantization procedures are then applied to the coefficients. The entropy coder receives these quantized coefficients and generates codeword. The reconstruction loop includes the dequantization, inverse transform and deblocking filter. Finally, the reconstruction frame is written to the frame buffer for motion estimation.

Figure 2-1 H.264/AVC encoder structure [2]

There are three profiles defined in the H.264/AVC standard: baseline, main and extended

CAVLC. In the main profile, B-frame coding is used and entropy coding using CABAC.

While the extended profile has all the features of the baseline profile while B-frame coding, SI-frame, and SP-frame coding are included.

2.2 Slice and Slice Groups

Slices are a sequence of macroblocks which are processed in the order of a raster scan. A picture is split into one or several slices as shown in Figure 2-2. A picture, therefore, is a collection of one or more slices in the H.264/AVC. Slices are self-contained in the sense that given the active sequence and picture parameter sets, their syntax elements can be parsed from the bit-stream. Furthermore, the values of the samples in the area of the picture that the slices represents can be correctly decoded without the use of data from other slices provided that the utilized reference pictures are identical at encoder and decoder. Each slice can be coded using different coding types as follows.

I slice: A slice in which all macroblocks of the slice are coded using intra prediction.

P slice: In addition to the coding types of I slice, some macroblocks of the P slice can also be coded using inter prediction with at most one motion-compensated prediction signal per prediction block.

B slice: In addition to the coding types available in a P slice, some macroblocks of the B slice can also be coded using inter prediction with two motion-compensated prediction signals per prediction block.

SP slice: A so-called switching P slice that is coded such that efficient switching between different pre-coded pictures becomes possible.

SI slice: A so-called switching I slice that allows an exact match of a macroblock in

Figure 2-2 Subdivision of a picture into slices [2]

2.3 Inter Prediction

High quality video sequences usually have high frame rates. Therefore, two successive frames in a video sequence are very likely to be similar. The goal of inter prediction is to utilize this temporal correlation to reduce data need to be encoded. Inter prediction creates a prediction model from one or more previously encoded video frames. The model is formed by shifting samples in the reference frame(s). H.264/AVC supports motion compensation block sizes ranging from 16x16 to 4x4 luminance samples with many options between the two sizes. The luminance component of each macroblock (16x16 samples) may be split up in 4 ways as shown in Figure 2-3: 16x16, 16x8, 8x16, or 8x8. If the 8x8 mode is chosen, each of the four 8x8 macroblock partitions within the macroblock may be split in a 4 ways as shown in Figure 2-4: 8x8, 8x4, 4x8, or 4x4.

Figure 2-4 Macroblock sub-partitions: 8x8, 8x4, 4x8 and 4x4

Each partition in an inter-coded macroblock is predicted from an area of the same size in a reference picture. The distance between the two areas (the motion vector) has 1/4-pixel resolution (for the luma component). In case the motion vector points to an integer-sample position, the prediction signal consists of the corresponding samples of the reference pictures.

Otherwise, the corresponding sample is obtained using interpolation to generate non-integer positions. The prediction values at half-sample positions are obtained by applying a one-dimensional 6-tap FIR (Finite Impulse Response) filter horizontally and vertically. For example in Figure 2-5., half-pixel sample b is calculated from the 6 horizontal integer samples E, F, G, H, I and J:

(( 5 20 20 5 ) 16) 5

b= E− F+ G+ H− I+J + >>

Similarly, h is interpolated by filtering A, C, G, M, R and T. When all of the samples horizontally and vertically adjacent to integer samples are calculated, the remaining half-pixel positions are calculated by interpolating between six horizontal or vertical half-pixel samples from the first set of operations. For example, the sample at half sample positions labels as j are obtained by

(( 5 20 20 5 ) 512) 10

j= cc− dd+ h+ m− ee+ ff + >>

Once all the half-pixel samples are available, the quarter-pixel positions are produced by linear interpolation. Quarter-pixel positions with two horizontally or vertically adjacent half-

of the luma, the displacements used for chroma have one-eighth sample position accuracy.

Figure 2-5 Filter for fractional-sample accurate motion compensation [2]

2.3.1 Motion Vector Prediction

Encoding a motion vector for each partition can take a significant number of bits, especially if small partition sizes are chosen. Motion vector for neighboring partitions are often highly correlated and so each motion is predicted from vectors of nearby. A predicted vector MVp (Motion Vector Prediction) is formed based on previously calculated. The motion vectors MVD (Motion Vector Difference), the difference between the current vector and the predicted vector, is coded and transmitted. The method of forming the prediction MVp depends on the motion compensation partition size on the availability of nearby vectors.

Let E be the current macroblock, macroblock partition or sub-partition; let A be the partition or sub-partition immediately to the left of E; let B be the partition or sub-partition

same size (16x16 in this case). The MVp of current macroblock E is calculated from the motion vector of macroblock A, B and C. In the decoder, the predicted vector MVp is formed in the same way and added to the decoded vector difference MVD.

Figure 2-6 Choice of neighboring partitions [3]

2.4 Intra Prediction

In H.264/AVC standard, each 16x16 is a basic unit to be encoded. For intra prediction, utilizing high correlation of neighboring samples in spatial domain, the prediction block is conducted based on previously coded and reconstructed blocks which are to the left /or above the block before deblocking filter. For the luma samples, each prediction block may be formed for each 4x4 block (denoted as I4MB), or for an entire MB (denoted as I16MB).

When utilizing I4MB prediction, each 4x4 block is predicted from spatially neighboring samples and will choose one of nine prediction modes as the best one. In addition to DC prediction mode, eight directional prediction modes are supported shown in Figure 2-7. Those modes are suitable to predict directional structures in a picture such as edges at various angles.

Figure 2-7 Intra 4x4 prediction mode [3]

Figure 2-8 Intra 16x16 prediction mode [3]

2.5 Mode Decision

In H.264/AVC, the high complexity mode of the standard, the macroblock mode decision is done by minimizing the Lagrangian function [1]:

( , , | , _MODE) ( , , | ) _MODE ( , , | )

J s c MODE QP λ =SSD s c MODE QP +λ ×R s c MODE QP

Where J denotes the cost function and depends on s (the original signal macroblock), c (the reconstructed signal macroblock) and MODE (select from a set of modes). J is found given QP (the macroblock quantization parameter) and λMODE (the Lagrange multiplier for mode decision). SSD is the sum of the squared differences between the original macroblock and its reconstruction with QP and it also depends on the original and reconstructed macroblock, as well as the mode decision (MODE). The Lagrange multiplier, λMODE, depends

Finally, the rate R s c MODE QP( , , | ) depends on the original and reconstructed macroblock with quantization parameter QP, as well as chosen MODE, and reflects the number of bits produced for header(s) (including MODE indictors), motion vector(s) and coefficients.

In H.264, MODE is chosen from a set of potential prediction modes as follows:

For Intra slices:

MODE ∈ {I4MB, I16MB}

For P slices: {single reference forward or backward prediction}

MODE ∈ {I4MB, I16MB, SKIP, P_16x16, P_16x8, P_8x16, P_8x8}

For B slices: {bi-directionally predicted slices}

MODE ∈ {I4MB, I16MB, DIRECT, P_16x16, P_16x8, P_8x16, P_8x8}

The DIRECT mode is particular to the B slices, while the SKIP mode implies that no motion or residual information will be encoded.

In the above mode sets, when the best mode is intra mode (I4MB, I16MB), the mode is chosen through evaluation of the Lagrangian function with mode choices from the mode described in section 2.4

When the best mode is inter mode, the best inter mode is chosen from 7 different block (P_16x8, P_8x16, P_16x8, P_8x8, P_8x4, P_4x8, and P_4x4) shown in Figure 2-3 and Figure 2-4. Figure 2-9 shows the flow chart of H.264/AVC mode decision algorithm.

In order to evaluate the least RD cost for a single mode, we need to calculate the rate and distortion for all modes. For example when we choose the best mode for a 16x16 macroblock belonging to a P or B slices (luma component only), we need 144 cost evaluations for the best I4MB mode (9 modes time 16 partitions of 4x4 blocks), 4 more evaluations for the I16MB case, 16 more for the best P_8x8 inter mode (4 modes times 4 partitions of 8x8 blocks) and 4 more for selecting the minimal cost among the rest of the modes results in 168 evaluations.

2.6 Loop Filter

One particular characteristic of block-based coding is the accidental production of visible block structures. Block edges are typically reconstructed with less accuracy than interior pixels and “blocking” is generally considered to be one of the most visible artifacts with present compression methods. H.264/AVC defines an adaptive in-loop deblocking filter, where the strength of filtering is controlled by the values of several syntax elements. The deblocking filter is applied after the inverse transform. The filter has two benefits: (1) block edges are smoothed, improving the appearance of decoded images and (2) the filtered macroblock is used for motion-compensated prediction of further frames in the encoder, resulting in a smaller residual after prediction. The basic of filter is that if a relatively large absolute difference between samples near a block edge is occurred, we can use a QP threshold to measure. So it is quite likely a blocking artifact and should be reduced. However, if the magnitude of that difference is so large that it cannot be explained by the coarseness of the quantization used in the encoding, the edge is more likely to reflect the actual behavior of the source picture and should not be smoothed over. The deblocking filter is an adaptive filter that adjusts in strength depending upon compression mode of a macroblock, the quantization parameter, motion vector, frame or field coding decision and the pixel values. When the quantization step size is decreased, the effect of the filter is reduced, and when the quantization step size is very small, the filter is shut off. The filter can also be shut off explicitly or adjusted in overall strength by an encoder at the slice level. More details about deblocking filter are described in [3].

2.7 Transform and Quantization

The difference between the actual and predicted data is called residual error data. Discrete

intra macroblocks, a transform for the 2x2 array of the chroma DC coefficients and a transform for all other 4x4 blocks in the residual data. In H.264/AVC, the transform is applied to 4x4 blocks, and instead of 4x4 DCT, a separable integer transform with similar properties as a 4x4 DCT is used. The 4x4 DCT integer transform is approximation of original floating point DCT transform. Since the inverse transform is defined by exact integer transform, inverse transform mismatches is avoided. The 4x4 integer transform is designed to be so simple that it can be implemented using just a few additions, subtractions, and bit shifts. The transform matrix is given as:

The basic transform coding process is very similar to that of previous standards including a forward transform, zig-zag scanning shown in Figure 2-10, scaling, and rounding as the quantization process followed by entropy coding. The flow is shown in Figure 2-11.

Figure 2-10 Zig-Zag Scan

Figure 2-11 Flow of transform and quantization [4]

The main functionality of quantization is to scale down the transformed coefficients and to reduce the coding information, Because of human visual system is lee sensitive to high frequency image component. Some video and image compression standards may use higher scaling-value for high frequency data. H.264/AVC uses a scalar quantizer. The basic forward quantizer operation is as follows:

( / )

ij ij

Z =round Y Qstep

Where Yij is a coefficient of the transformed described above, Qstep is a quantizer step size and Zij is a quantized coefficient. A total of 52 values of Qstep are supported by the standard and these are indexed by a quantization Parameter, QP. The values of Qstep corresponding to each QP are shown in Table 2-1. Note that Qstep doubles in size for every increment in QP. The wide range of quantizer setup sizes makes it possible for an encoder to accurately and flexibly control the trade-off between bit rate and quality.

Table 2-1 Quantization step size in H.264 [3]

QP 0 1 2 3 4 5 6 7 8 9 10 11 12

Qstep 0.63 0.69 0.81 0.88 1 1.13 1.25 1.38 1.63 1.75 2 2.25 2.5

QP 18 24 30 36 42 48 51

QStep 5 10 20 40 80 160 224

2.8 Entropy Coding

The entropy encoder is responsible of converting the syntax elements to bit stream and then the entropy decoder can recover syntax elements from bit stream. H.264/AVC standard defines two entropy coding methods: Context Adaptive Variable Length Coding (CAVLC) and Context Based Adaptive Arithmetic Coding (CABAC). For the baseline profile, only CAVLC is employed. For the main profile, both CAVLC and CABAC must be supported.

Chapter 3 Scalable Extension of H.264

Motion pictures are to be transmitted over variable bandwidth channels, both in wireless and cable networks. They have to be stored on media of different capacity and displayed on a variety of devices, ranging from small mobile terminals to high-resolution video projection systems. Scalable video coding schemes are intended to encode the signal once at highest resolution, but enable decoding from partial streams at the specific rate and resolution required by a certain application. This scheme provides a simple and flexible solution for transmission over heterogeneous networks, additionally providing adaptability for bandwidth variations and error concealment. An example of applications is shown in Figure 3-1.

The scalable extension of H.264/AVC has been chosen to be the starting point of MPEG Scalable Video Coding (SVC) standardization project in October 2004. In January 2005, MPEG and the Video Coding Experts Group (VCEG) of the ITU-T agreed to jointly finalize the SVC project as an amendment of their H.264/AVC standard. The working draft provides a specification of the bit-stream syntax and the decoding process. The reference encoding process is described in the Joint Scalable Video Model (JSVM). Both can be downloaded from the web site [5]. The new standard is based on the architectureof H.264 [2] and provides types of scalability i.e. temporal, spatial and SNR. More details about the scalable extension of H.264/AVC can be found in [6] [7].

Ethernet

Figure 3-1 Example of Scalable Video Coding

3.1 The Architecture of Scalable Extension of H.264

The overall structure of scalable extension of H.264 encoder is shown in Figure 3-2. It encodes the video into multiple spatial, temporal and SNR layers for combined scalability.

The spatial scalability can be realized by a layered approach. When we compress a frame, we separate different coding layer for different frame resolution. The base layer contains a lowest spatial resolution version of each coded frame. The enhancement layers have higher resolution and can be predicted from the base layer pictures and previously encoded enhancement layer pictures. The information of enhancement layer can be predicted from the base layer includes the motion vector, intra texture and the residual. The constrained inter-layer prediction is used for reduced decoder complexity. In the same spatial resolution, the temporal scalability means the change of frame rate. The temporal scalability is to extend the hybrid video coding approach of H.264/AVC towards motion compensated temporal

techniques with hierarchical-B frame of H.264/AVC. We can use the MCTF to achieve the scalability of frame rate. In addition, the SNR scalability can be achieved by residual quantization with very little changes to H.264/AVC. This method is similar as the FGS bit-plane coding of MPEG-4 to achieve the scalability of quality. The SNR scalability includes two aspects: Fine Granularity Scalability (FGS) and Coarse Granularity Scalability (CGS).

3.2 Temporal Scalability

Temporal scalability is often used in practice, as reduction of the video frame rate. It is a common approach in cases where insufficient transmission capacity is available. MCTF is a main feature for spatiotemporal wavelet filtering techniques.

3.2.1 MCTF

The Motion Compensated Temporal Filtering (MCTF) is based on the lifting scheme. The lifting scheme has two main advantages: It provides a way to compute the wavelet transform in an efficient way and it insure perfect reconstruction of the input in the absence of

在文檔中 H.264編碼器及其可調適延伸版解碼器之加速和TI DSP系統平台之實現 (頁 14-0)