Entropy Coding - H.264 Video Coding - H.264編碼器及其可調適延伸版解碼器之加速和TI DSP系統平台之實現

Chapter 2 H.264 Video Coding

2.8 Entropy Coding

The entropy encoder is responsible of converting the syntax elements to bit stream and then the entropy decoder can recover syntax elements from bit stream. H.264/AVC standard defines two entropy coding methods: Context Adaptive Variable Length Coding (CAVLC) and Context Based Adaptive Arithmetic Coding (CABAC). For the baseline profile, only CAVLC is employed. For the main profile, both CAVLC and CABAC must be supported.

Chapter 3 Scalable Extension of H.264

Motion pictures are to be transmitted over variable bandwidth channels, both in wireless and cable networks. They have to be stored on media of different capacity and displayed on a variety of devices, ranging from small mobile terminals to high-resolution video projection systems. Scalable video coding schemes are intended to encode the signal once at highest resolution, but enable decoding from partial streams at the specific rate and resolution required by a certain application. This scheme provides a simple and flexible solution for transmission over heterogeneous networks, additionally providing adaptability for bandwidth variations and error concealment. An example of applications is shown in Figure 3-1.

The scalable extension of H.264/AVC has been chosen to be the starting point of MPEG Scalable Video Coding (SVC) standardization project in October 2004. In January 2005, MPEG and the Video Coding Experts Group (VCEG) of the ITU-T agreed to jointly finalize the SVC project as an amendment of their H.264/AVC standard. The working draft provides a specification of the bit-stream syntax and the decoding process. The reference encoding process is described in the Joint Scalable Video Model (JSVM). Both can be downloaded from the web site [5]. The new standard is based on the architectureof H.264 [2] and provides types of scalability i.e. temporal, spatial and SNR. More details about the scalable extension of H.264/AVC can be found in [6] [7].

Ethernet

Figure 3-1 Example of Scalable Video Coding

3.1 The Architecture of Scalable Extension of H.264

The overall structure of scalable extension of H.264 encoder is shown in Figure 3-2. It encodes the video into multiple spatial, temporal and SNR layers for combined scalability.

The spatial scalability can be realized by a layered approach. When we compress a frame, we separate different coding layer for different frame resolution. The base layer contains a lowest spatial resolution version of each coded frame. The enhancement layers have higher resolution and can be predicted from the base layer pictures and previously encoded enhancement layer pictures. The information of enhancement layer can be predicted from the base layer includes the motion vector, intra texture and the residual. The constrained inter-layer prediction is used for reduced decoder complexity. In the same spatial resolution, the temporal scalability means the change of frame rate. The temporal scalability is to extend the hybrid video coding approach of H.264/AVC towards motion compensated temporal

techniques with hierarchical-B frame of H.264/AVC. We can use the MCTF to achieve the scalability of frame rate. In addition, the SNR scalability can be achieved by residual quantization with very little changes to H.264/AVC. This method is similar as the FGS bit-plane coding of MPEG-4 to achieve the scalability of quality. The SNR scalability includes two aspects: Fine Granularity Scalability (FGS) and Coarse Granularity Scalability (CGS).

3.2 Temporal Scalability

Temporal scalability is often used in practice, as reduction of the video frame rate. It is a common approach in cases where insufficient transmission capacity is available. MCTF is a main feature for spatiotemporal wavelet filtering techniques.

3.2.1 MCTF

The Motion Compensated Temporal Filtering (MCTF) is based on the lifting scheme. The lifting scheme has two main advantages: It provides a way to compute the wavelet transform in an efficient way and it insure perfect reconstruction of the input in the absence of quantization of the wavelet coefficients. The generic lifting scheme consists of three steps:

poly-phase operation, prediction, and update. Figure 3-3 shows a two-channel filter bank with

“P” representing the prediction step and “U” representing the update step.

Figure 3-3 Lifting representation of an analysis-synthesis filter bank [8]

At the analysis side (a), the odd samples s[2k+1] of a given signal s are predicted by a linear combination of the even samples s[2k] using a prediction operator P(s[2k]) and a high pass signal h[k] is formed by the prediction residuals. A corresponding low-pass signal l[k] is

[ ] [2 1] P( [2 ]) P( [2 ]) _i [2( )] transform. For the Haar wavelet are given by

PHaar( [x, 2 ])s k =s[x,2k]

Haar

U ( [x,k])=1 [x,k]

h 2h

For the 5/3 transform, the prediction and the update operators are given by

5/3 coordinate k in scalable video coding.

The extension to motion-compensated temporal filtering is realized by modifying the prediction and the update operators as follows

Haar P0 P0

3.2.2 Scalability Dimensions

The temporal coding structure of MCTF is changed relative to hybrid video coding in that not only high-pass pictures H^k are resulting from the prediction step but also low-pass pictures L^k are resulting from the update step. Figure 3-4 is an example for the temporal decomposition of a group 8 pictures (GOP =8) using 3 decomposition stages. This structure provides a non-dyadic decomposition in the contrast layer. If only the level 3 pictures are obtained after the third decomposition stage is transmitted, the picture sequence that can be reconstructed at the decoder side has the 1/4 of the temporal resolution of the input sequence. By additionally transmitting the higher level (level 2) pictures, the decoder can reconstruct an approximation of the picture sequence that has 1/2 of the temporal resolution of the input sequence. And finally, if the highest level (level 1) pictures are transmitted, a reconstructed version of the original input sequence with the full temporal resolution is obtained.

The temporal coding structure of MCTF is an open-loop structure. With MCTF, the encoder can provide better prediction. However, it may cause mismatch error between encoder and decoder in the presence of quantization error and the update step increase the complexity and memory requirement. In order to justify the complexity of the update step, temporal scalability uses a closed-loop structure which is known as “hierarchical-B”. The hierarchical architecture of temporal scalability is described in Figure 3-5. This is an example of the prediction structure for a group of eight pictures. The first picture of a video sequence is intra-coded as the instantaneous decoder refresh (IDR) picture that is a kind of the key picture. The key picture of the sequence is independent from any other pictures of the video sequence, and it generally represents the minimal temporal resolution that can be decoded. It is either intra-coded or inter-coded. When the key picture is decoded, the picture B¹ is predicted by using the surrounding key pictures A as references. It depends only on the key pictures, and represents the next higher temporal resolution together the key pictures, the pictures B² of the next temporal level are predicted by using only the picture of the lower temporal resolution as references, etc. It is obviously that this hierarchical prediction structure inherently provides temporal scalability. The main idea is similar as the B frames of the H.264/AVC.

3.3 Spatial Scalability

In the spatial scalability, we use an over-sampled pyramid structure to represent multiple resolutions (ex. QCIF, CIF, and 4CIF) and code the various spatial resolutions independently of each other. The information of a higher spatial layer is affected by the information of the lower spatial layers. We can code the higher spatial layer efficiently by predicting from the lower spatial layers. For that, the following techniques turned out to provide gains and are described below:

1. Prediction of a macroblock using the up-sampled lower resolution signal

2. Prediction of motion vectors using the up-sampled lower resolution motion vectors 3. Prediction of the residual signal using the up-sampled residual signal of the lower

resolution layer

In inter-layer prediction, motion prediction is used to remove the redundancy of motion information, including macroblock partition, reference picture index, and motion vector among layers. The macroblock partitioning is obtained by up-sampling the partitioning of the corresponding 8x8 block of the lower resolution layer. For the obtained macroblock partitions, the corresponding sub-macroblock partition of the base layer block is used as shown in Figure 3-6. The motion vector is scaled by a factor of 2. For the motion information, we introduce two additional modes. While for the first of these modes (Base_layer_mode) no additional motion information are coded, for the second one (Qpel_refinement_mode), a quarter-sample motion refinement (-1, 0, or +1 for each motion vector component) is transmitted for each motion vector.

Intra texture prediction uses the reconstructed image of the reference layer to predict an enhancement layer. For intra texture prediction, we use the “Intra_BL” mode. The “Intra_BL”

mode is only allowed for macroblock, for which the corresponding 8x8 block of the base layer is located inside an intra-coded macroblock. This is described in Figure 3-7. In this mode, the prediction signal is directly obtained by de-blocking and up-sampling the 8x8 luma

residuals of consecutive layers may have some correlations. The residual information is coded in the lower resolution layer using a bi-linear filter with constant border extension.

8x8

4x8 4x4

8x4 Direct,

16x16, 16x8,

8x16

16x16 16x8

8x16 8x8

16x16 16x16

Figure 3-6 Up-sampling of motion data [8]

Figure 3-7 Up-sampling of intra texture [8]

employed. For each macroblock, the coded block pattern (CBP), and the conditioned on CBP the corresponding residual blocks are transmitted together with the macroblocks modes, intra prediction modes, reference picture indices and motion vectors using the B or P slice syntax of H.264/AVC. For that, the quantization error between the SNR base layer and the original sub-band pictures is re-quantized exactly using the same methods as for the base layer but with a finer quantization step size. In the SNR scalability, we can divide into two aspects:

Coarse Granularity Scalability (CGS) and Fine Granularity Scalability (FGS).

3.4.1 CGS

The mechanism of CGS is similar as spatial scalability. The CGS also can be realized by a layered coding. Each CGS layer has its own motion information and temporal prediction. On top of the SNR base layer, the enhancement layer is coded. For that, the quantization error

3.4.2 FGS

In order to support fine granularity scalability (FGS), we have introduced an algorithm so-called progressive refinement slices. The algorithm encodes coefficients in order such that

“more significant” coefficients are coded first. By arranging the bit stream in this way, we can truncate the bit stream at any arbitrary point and retain the “more significant” coefficients first, so that the quality of SNR base layer can be improved in a fine granular way. The progressive refinement in FGS uses cyclic block coding. The coefficients are scanned in zig-zag scan as shown in Figure 2-10 and the current scan position in a given coding pass will from differ from one block to another. When a coding pass in one block is finished, we need to change the next block to perform a coding pass. In general, the progressive refinement encodes the DC coefficient first in the same cycle for every block. In next cycle encodes the other coefficients. This is shown in Figure 3-9. This progressive refinement slices using cyclical block coding can improve the quality of every block averagely and is in order to support fine granular quality scalability.

Block 0 Block 1 Block 2 Block 3

In FGS layer, a block is coded using two passes: significant pass and refinement pass.

The significant pass encodes the coefficient that became significant in the enhancement layer.

The refinement pass encodes the coefficient for which a nonzero value has already been coded in the previous coding pass. At each cycle, for the significant pass, the coefficients are scanned in zig-zag order for every block, and all zero values are coded up to and including the first nonzero value. Then, the next block is processed. Each coding cycle in a block includes an End-of-Block (EOB) symbol, a Run index, and a non-zero quantization index. In refinement pass, refinement values are coded when all significant values have been coded for all block. Figure 3-10 is an example of a slice consists of four blocks having eight coefficients each.

Block 0 0 0 0 A 0 B 0 1 0 1 C D E 0 0 0

Block 1 F 0 1 0 1 0 0 0 0 0 0 0 0 G 0 0

Block 2 1 1 1 1 0 0 0 H I 0 0 J 0 0 K 0

Block 3 0 L 0 M 0 1 1 0 0 0 0 0 0 0 0 0

Figure 3-10 Example of significant and refinement pass [11]

Initially, in the significant pass, we encode the first nonzero value for each block. So the coefficient for each cycle can be discussed in the follows:

Cycle 0 = 0 :{ 0 0 0 0 0 1 }, 1 :{ 0 1 } 2 :{ 1 } 3 :{ 0 0 0 1}

Cycle 1 = 0 :{ 0 1 } 1 :{ 0 1 } 2 :{ 1 } 3 :{ 1 } Cycle 2 = 1 :{ EOB } 2 :{ 1 } 3 :{ EOB } Cycle 3 = 2 :{ 1 } Cycle 4 = 2 :{ EOB }

The symbol of EOB indicates the last significant coefficient flag.

In the refinement pass, each cycle only can encode one refinement value for each block. So the finally coefficient can be presented in the follows:

Cycle 0 = 0 :{0 0 0 0 0 1} 1 :{0 1} 2 :{1} 3 :{0 0 0 1} Cycle 1 = 0 :{0 1} 1 :{0 1} 2 :{1} 3 :{1}

3.5 Combined Scalability

Figure 3-11 is an example of the combination of spatial, temporal and quality scalability. In the same resolution layer, we can use the MCTF to achieve the temporal scalability. In different resolution layer, we can use the inter-layer prediction to code different resolution picture. In addition, in every layer, we can adjust the quantization for quality scalability. This can provide a wide range of temporal, SNR, and spatial scalability.

Chapter 4 DSP Implementation

Environment

As discussed previously, our project involves the implementation on digital signal processors (DSPs). In this chapter, we briefly describe the DSP platform environment. In our DSP board, we use the Sundance module (SMT395). Its main chips are the TMS320C6416T DSP made by Texas Instrument and the Xilinx Virtex II Pro FPGA. We will introduce the DSP chip and the DSP board. In addition, we will also introduce the software development tool, the Code Composer Studio (CCS), and the code development environment for TI DSP.

4.1 The DSP Board

The DSP board use in our implementation is the Sundance module (SMT395) shown in Figure 4-1. SMT395 is used the 1GHz 64-bit TMS320C6416T DSP, which is manufactured using the latest 90nm wafer technology and it offers high fixed-point processing power. The SMT395 is supported by the TI Code Composer Studio and 3L_Diamond_RTOS to enable full MultiDSP systems with minimum efforts by the programmers. It provides a flexible platform for the next generations of telecom systems, image processing applications, medical equipment and industrial solutions. We list some specifications of SMT395 modules as follows [12].

Xilinx Virtex II Pro FPGA. XC2VP30-6 in FF896 package.

256Mbytes of SDRAM @ 133MHz

Two Sundance High-speed Bus (50MHz, 100Mhz or 200MHz) ports 32 bits wide

Eight 2.5Gbit/sec Rocket Serial Links (RSL) for Inter Module communications

8Mbytes FLASH ROM for configuration/booting

JTAG Diagnostics Port

Figure 4-1 SMT395 module

4.2 The TMS320C6416T DSP Chip

wireless infrastructure applications. The functional block and DSP core diagram of TMS320C64x series is shown in Figure 4-2.

Figure 4-2 Block diagram of the TMS320C64x DSPs [13]

The C6000 core CPU consists of 64 general-purpose 32-bits register and 8 function units.

Features of C6000 devices include [14]:

Executes up to eight instructions per cycle.

Allows designers to develop highly effective RISC-like code for fast development time.

Instructing packing:

Gives code size equivalence for eight instructions executed serially or in parallel.

Reduces code size, program fetches, and power consumption.

Conditional execution of all instructions:

Reduces costly branching.

Increases parallelism for higher sustained performance.

Efficient code execution on independent functional units:

Efficient C complier on DSP benchmark suite.

Assembly optimizer for fast development and improved parallelization.

8/16/32-bit data support, providing efficient memory support for a variety of applications.

32 x 32-bit integer multiply with 32- or 64-bit result.

The C64x extensions add enhancements to the C6000 architecture which includes:

Quad 8-bit and dual 16-bit extensions for data flow.

Additional functional unit hardware.

Increased orthogonally instruction set.

4.2.1 Central Processing Unit of C64x

The C64x DSP core contains 64 32-bit general purpose register, program fetch unit, instruction decode unit, two data path which each with four function units, control register,

four functional units and one register file. The four functional units can divide into four operations. The first unit is for multiplier operations (.M). The second unit is for arithmetic and logic operations (.L). The next one is for branch, byte shifts, arithmetic operations (.S).

The last unit is for linear and circular address calculation to load and store with external memory operations (.D). The details of functional units are described in Table 4-1.

Each register file consists of 32 32-bit registers for each four functional unit reads and writes directly within its own data path. That is, the functional units .L1, .S1, .M1, .D1 can only write to register file A. The same condition occurs in register file B. However, two cross-paths (1X and 2X) allow functional units from one data path to access a 32-operand from the opposite side register file. The cross path 1X allow data path A to read their source from register file B. The cross path 2X allow data path B to read their source from register file A. In the C64x, CPU pipelines data-cross-path accesses over multiple clock cycles. This allows the same register to be used as a data-cross-path operand by multiply functional units in the same execute packet. The detail features about the C64x CPU are introduced in [13].

Table 4-1 Functional units and operations performed [15]

Function Unit Operations

.L unit(.L1, .L2) 32/40-bit arithmetic and compare operations 32-bit logical operations

Leftmost 1 or 0 counting for 32 bits Normalization count for 32 and 40 bits Byte shifts

Data packing/unpacking 5-bit constant generation

Dual 16-bit and Quad 8-bit arithmetic operations

Table 4-2 Functional units and operations performed [15]

Function Unit Operations

.S unit (.S1, .S2) 32-bit arithmetic operations

32/40-bit shifts and 32-bit bit-field operations 32-bit logical operations

Branches

Constant generation

Data packing/unpacking

Dual 16-bit and Quad 8-bit compare operations

Dual 16-bit and Quad 8-bit saturated arithmetic operations

.M unit (.M1, .M2) 16 x 16 multiply operations

16 x 32 multiply operations

Dual 16 x 16 and Quad 8 x 8 multiply operations Dual 16 x 16 multiply with add/subtract operations Quad 8 x 8 multiply with add operations Bit expansion

Bit interleaving/de-interleaving Variable shift operations Rotation

Galois Field Multiply

.D unit (.D1, .D2) 32-bit add, subtract, linear and circular address calculation Loads and stores with 5-bit constant offset

Loads and stores with 15-bit constant offset(.D2 only)

4.2.2 Memory Architecture and Peripherals

The C64x DSP is a two level cache-based architecture. The level 1 cache can be separated into program and data space. The level 1 program cache (L1P) is a 16 K-bytes direct mapped cache and the level 1 data cache (L1D) is a 16 K-bytes 2-way set-associative mapped cache.

The level 2 (L2) consists 1024 K-bytes memory space for cache (up to 256K-bytes) and unified mapped memory.

The EMIF provides the interfaces for the DSP core and external memory, such as synchronous-burst SRAM (SBSRAM), synchronous DRAM (SRAM), SDRAM, FIFO and asynchronous memories (SRAM and EPROM). The EMIF also provides 64-bit-wide (EMIFA) and 16-bit-wide (EMIFB) memory read capability.

The C64x contains some peripherals such as enhanced direct-memory-access (EDMA), host-port interface (HPI), PCI, three multi-channel buffered serial ports (McBSPs), three 32-bit general-purpose timers and sixteen general-purpose I/O pins. The EDMA controller handles all data transfers between the level 2 (L2) cache/memory and the device peripheral.

The C64x has 64 independent channels. The HPI is a 32-/16-bit wide parallel port through which a host processor can directly access the CPUs memory space. The PCI port supports connection of the DSP to a PCI host via the integrated PCI master/slave bus interface.

4.3 TI DSP Code Development Environment

TI supports a useful GUI development to DSP users for developing and debugging their project: the code composer studio (CCS). In this section, we will give a briefly introduction about this development environment. The tutorial [16] introduces the key features of CCS. A DSP users needs to familiar with the coding development tool for building project on DSP platform efficiently.

real-time analysis capabilities which supports all phases of the development cycle shown in Figure 4-3.

Figure 4-3 Development cycle [16]

The CCS has the following components which work together as shown in Figure 4-4:

TMS320C6000 code generation tools

Code Composer Studio Integrated Development Environment (IDE)

DSP/BIOS plug-ins and API

RTDX plug-in, host interface, and API

在文檔中 H.264編碼器及其可調適延伸版解碼器之加速和TI DSP系統平台之實現 (頁 29-0)