Thesis Organization - 用於H.264視訊解碼器之記憶體控制器與熵解碼器之設計

Chapter 1 INTRODUCTION

1.2 Thesis Organization

This paper is organized with six parts. Chapter 1 gives the introduction and motivation of this work. Then, in chapter 2, a brief overview of H.264/AVC standard is given. Chapter 3 presents a data mapping aware memory controller design. In Chapter 4 a proposed entropy decoder including UVLC decoder and CAVLC decoder with multi-symbol-like decoding is implemented. The implementation result of memory decoder and entropy decoder is shown in chapter 5. Finally a conclusion remark is given in chapter 6.

Chapter 2 OVERVIEW OF H.264/AVC STANDARD

H.264 has been developed jointly by ITU-T VCEG and ISO/IEC MPEG. Its data compression efficiency is four and two times better than that in earlier video standards like MPEG-2 and MPEG-4. Thus, H.264/AVC has huge advantage of transmitting more multimedia data over limited channel resource. This is due to that H.264/AVC adopts many complicated and computational video coding tools. With these coding tools help, H.264/AVC has the ability to enhance the coding efficiency and still maintain the video quality as well.

2.1 Overview

The basic structure of H.264 is the so-called block-based motion-compensated transform coder.The application of H.264 contains terrestrial and satellite broadcast, optical and magnetic devices storage like DVD, wireless and mobile multimedia message service, etc. To support this need for flexibility and customizability, the H.264 covers a VCL, which efficiently codes the video component, and a NAL, which formats the VCL data and provides header information in a manner appropriate for conveyance by variety of transport layers or storage media.

Relative to previous work like MPEG-2, H.264/AVC has following advanced

features to improve the coding efficiency and video quality.

z Variable block-size motion compensation z Quarter-sample-accurate motion compensation z Multiple reference picture motion compensation z In-loop deblock filtering

z Small block-size transform z Exact-match inverse transform z Arithmetic entropy coding z Context-adaptive entropy coding

Fig.1 Basic coding structure for H.264/AVC

A coded video sequence in H.264/AVC consists of several coded pictures. A coded picture, as well as an interlaced field or a non-interlaced frame, can be

The primary unit for video coding in H.264 is a macroblock. To take advantage of human visual system characteristics, the chrominance information is down-sampled in a 4:2:0 format. There are 16x16 luminance (Y) and two 8x8 chrominance (Cb,Cr) components in one macroblock. Fig.1 illustrates the basic coding flow for one macroblock. Detailed coding flow can be referred in [1].

2.2 Video coding tools

2.2.1 Intra frame prediction

In contrast to previous video coding standards like H.263 and MPEG-4, where intra prediction is performed in the transform domain, in H.264/AVC it is always conducted in the spatial domain. By referring to neighboring samples of coded blocks which are to the left or/and above current predicted block, most of energy in the block can be removed in the intra prediction process. With the help of intra prediction, the compression performance of small block-size transform is enhanced. There are two kinds of intra prediction for luma components, nine 4x4 prediction modes or four 16x16 prediction modes.

When using the 4x4 intra prediction, each 4x4 blocks is predicted from spatially neighboring samples as shown in Fig. 2

(a). For each block, one of nine direction modes can be chosen. In addition to “DC”

prediction (where one value is used to predict the entire 4x4 block), eight directional prediction modes are specified as illustrated in Fig. 2

Fig. 2

(a)Intra_4_4 prediction is conducted for samples a-p of a block using samples A-Q.

(b) Eight “prediction directions” for Intra_4_4 prediction.

For 16x16 prediction modes, the whole luma component of a macroblock is predicted. Four prediction modes are supported. They are vertical prediction, horizontal prediction, DC prediction and plane prediction.

The chroma components are predicted using a similar prediction technique as that in the 16x16 prediction since chroma components are usually smooth over large areas.

Intra prediction and all other forms of prediction are not used while across slice boundaries to keep all slices independent of each other.

2.2.2 Inter frame prediction

The inter prediction in H.264/AVC is a block matching based motion estimation and compensation technique. It can remove the redundant inter-frame information efficiently. Each inter macroblock corresponds to a specific partition into blocks used

for motion compensation. For the luma components, partition with 16x16, 8x16, 16x8 and 8x8 are supported by the syntax. Once the 8x8 partition is chosen, additional syntax is transmitted to specify whether the 8x8 partition is further partitioned into 8x4, 4x8 or 4x4 blocks. Fig.3 illustrates these partitions.

Fig.3 Inter macroblock partitions

The prediction information for each MxN block is obtained by displacing an area of the corresponding reference frame, which is determined by the motion vector and reference index. H.264/AVC supports quarter pixels accurate motion compensation.

The sub-pixels prediction samples are obtained by interpolation of integer position samples. For the half-pixel position, the prediction value is interpolated by a one-dimensional 6-tap FIR filter horizontally and vertically. For the quarter-pixel position, the interpolation value is generated by averaging the samples at integer-pixel and half-pixel position. Fig.4 shows the fractional sample interpolation.

Fig.4 Fractional interpolation for motion compensation

The prediction values for the chroma component are always obtained by bilinear interpolation. Since the chroma components are down-sampled, the motion compensation for chroma has one-eighth sample position accuracy.

The motion vector components are differentially coded using either median or directional prediction from neighboring blocks. Besides, H.264/AVC supports multiple reference frame prediction. That is, more than one prior coded frame can be used as reference for motion compensation as Fig. 5 illustrated.

Fig. 5 Multiple reference frame motion compensation

2.2.3 Transform

Similar to previous video coding standards, H.264/AVC utilizes transform coding of the prediction residual. However, the transformation is applied to 4x4 blocks. Instead of a 4x4 discrete cosine transform, an integer transform with similar properties as DCT is adopted. Thus, the inverse-transform mismatches can be avoided due to the exact integer operation of transform. For the 16x16 intra luma or chroma prediction, extra Hadamard transform is applied on the DC coefficients of 4x4 blocks.

There are three reasons for using a smaller size transform. First, the residual signal has less spatial correlation with the advanced prediction. Generally speaking, this transform has less to concern the de-correlation ability. 4x4 transform is essentially as efficient in removing residual correlation as lager transform. Second, the ringing effect is eased for smaller transform. Third, less computation is required in the small size integer transform.

2.2.4 Quantization

A quantization parameter is used for determining the quantization of transform coefficients in H.264/AVC. The parameter can take 52 values. The quantized transform coefficients of a block generally are scanned in a zig-zag fashion and transmitted using entropy coding methods. The 2x2 DC coefficients of the chroma component are scanned in the raster-scan order. To simplify the transform, some operation is performed in the quantization stage.

2.2.5 Entropy coding

In H.264/AVC, there are two methods of entropy coding. The simpler entropy coding method, UVLC, uses exp-Golomb codeword tables for all syntax elements except the quantized transform coefficients.For transmitting the quantized transform coefficients, a more efficient method called Context-Adaptive Variable Length Coding (CAVLC) is employed. In this scheme, VLC tables for various syntax elements are switched depending on already transmitted syntax elements.

In the CAVLC entropy coding method, the number of nonzero quantized coefficients (N) and the actual size and position of the coefficients are coded separately. After zig-zag scanning of transform coefficients, their statistical distribution typically shows large values for the low frequency part and becomes to small values later in the scan for the high-frequency part.

The efficiency of entropy coding can be improved further if the Context-Adaptive Binary Arithmetic Coding (CABAC) is used . In H.264/AVC, the arithmetic coding core engine and its associated probability estimation are specified

Compared to CAVLC, CABAC typically provides a reduction in bit rate between 5%–15%. More details on CABAC can be found in [6].

2.2.6 In-loop deblocking filter

One particular characteristic of block-based coding is the accidental production of visible block structures. Block edges are typically reconstructed with less accuracy than interior. To ease the blocking artifacts due to both prediction and transform operation, an adaptive deblocking filter which can improve the resulting video quality is performed in the motion compensation loop as an in-loop filter.The filter reduces the bit rate typically.

2.3 Profiles

A profile defines a set of coding tools or algorithms that can be used in generating a conforming bit stream. In H.264/AVC, there are three profiles, which are the Baseline, Main, and Extended Profile. The Baseline profile supports all features in H.264/AVC except the following two feature sets:

‧ Set 1: B slices, weighted prediction, CABAC, field coding, and picture or macroblock adaptive switching between frame and field coding.

‧ Set 2: SP/SI slices, and slice data partitioning.

The first set of additional features is supported by the Main profile. However, the Main profile does not support the FMO, ASO, and redundant features which are supported by the Baseline profile. The Extended Profile supports all features of the

baseline profile and main profile except CABAC. Baseline profile is used for lower-cost application with less computation resources such as videoconferencing, internet multimedia and mobile application. Main profile supports the mainstream consumer for applications of broadcast system and storage devices. The extended profile is intended as the streaming video profile with extra coding tools for data loss robustness and server stream switching. Fig.6 shows the relationship of these three profiles.

Fig.6 Profiles

Chapter 3 DATA MAPPING AWARE FRAME MEMORY CONTROLLER

3.1 Backgrounds

Fig.7 Simplified architecture of typical 4-bank SDRAM

3.1.1 Features of SDRAM

A brief illustration of 4-bank SDRAM architecture is shown in Fig.7. These

kinds of SDRAMs are three-dimensional structures of bank, row, and column. Each bank contains its own row decoder, column decoder, and sense amplifier, while four banks share the same command, address and data buses. SDRAMs provide programmable Burst-Length, CAS Latency, and burst type. The mode register is used to define these operation modes. While updating the register, all banks must be idle, and the controller should wait the specified time before initiating the subsequent operation. Violating either of these requirements will result in unspecified operation.

Fig.8(a) Typical read access in SDRAM

Fig.8(b) Typical write access in SDRAM

Fig.8(a) and Fig.8(b) shows a typical read and write access. A memory access operation consists of three steps. First, an active command should be sent to open a row in a particular bank, which will copy the row data into the sense amplifier.

Second, a read or write command is issued to initiate a burst read/write access to the active row. The starting column and bank address are provided by address bus, and the burst length and type are as defined in mode register in advance. Data for any read/write burst may be truncated with subsequent read/write command as shown in Fig.9(a), and the first data element from the new burst follows either the last element of a completed burst or the last desired data element of a longer burst which is being truncated. Finally, a precharge command is used to deactivate the open row in a particular bank or the open rows in all banks. The bank(s) will be available for a subsequent memory access time after the sense amplifier is precharged. Many SDRAMs provide auto precharge which performs individual-bank precharge function without command issued right after the completion of read/write access. This is

accomplished by setting an index when the read/write access command is sent. Thus we can issue another permitted command during the cycle of precharge command to improve the utilization of command bus.

Fig.9(a) Random read access

Fig.9(b) Random write access

Since each bank can operate independently, a row-activation command can be overlapped to reduce the number of cycles for the row-activation as shown in Fig.10.

Take a read access as an example. Assume to read 8 data from the SDRAM, and that 4 data lie in row 0 of bank 0 and others lie in row1 of bank 1. Without bank alternating, we need 16 cycles to get 8 data from the SDRAM. This access occupies command bus for 16 cycles. However, only 12 cycles are needed with alternating access and the command bus is busy for 8 cycles. We can send subsequent command to pipeline the following operation. The more data is required, the more cycles can be saved. For 8 data, we gain 25% of speedup for the read memory access latency.

Fig.10 Bank alternating read access

Some important timing characteristics are listed below. The behavior model used in our design is Micron’s MT48LC8M32B2P 256 Mb SDRAM [7] with 4 banks by 4,096 rows by 512 columns by 32 bits.

Table 1 SDRAM timing characteristics

Targeted to video codec applications, many papers have been proposed to improve SDRAM bandwidth utilization and achieve efficient memory access. Li [8]

develops a bus arbitration algorithm optimized with different processing unit to meet the real-time performance. Ling’s Table 9 controller schedules DRAM accesses in pre-determined order to lower the peak bus bandwidth. Kim’s [10] memory interface adopts an array-translation technique to reduce power consumption and increase memory bandwidth. Park’s [11] history-based memory controller reduces page break to achieve energy and memory latency reduction.

For H.264 application, Kang’s [12] AHB based scalable bus architecture and

dual memory controller supports 1080 HD under 130MHz. Zhu’s [13] SDRAM controller employs the main idea of Kim’s memory interface to HDTV application. It focuses on data arrangement and memory mapping to reduce page active overheads so that it not only improve throughput but also provides lower power consumption.

However, it doesn’t take the memory operation scheduling into consideration. With careful scheduling, some loss of bus bandwidth introduced by page active operation can be reduced. We combine the data mapping and operation scheduling in our design to decrease the bandwidth requirement for real-time decoding.

3.1.3 Problem definition

For a memory access, for each time we activate a closed row, we will suffer the latency introduced by SDRAM inherent structure. For read access, the read latency consists of tRP (precharge), tRCD(active) and CAS latency. For write access, the write latency includes t_RP (precharge) and t_RCD(active). There are two methods to ease this overhead. One is to reduce the required active command. This means either the demand data should lie in as less rows as possible within a single request or the probability of row miss must be as small as possible between successive requests.

Since the numbers of row opening is decreased, the latency we suffered can be shortened. Fig.11 shows the effect of row opening reduction on access latency. The other is to apply bank alternating to hide the latency. However, the number of total required operation remains the same. Taking the advantage of banks architecture, the free bank can continue another operation while other banks process requested works, and thus the latency of current access will be overlapped as shown in Fig.12. For such case, the required data should lie in different banks when row miss is happened, or the bank interleaving technique will fail to improve the data bus utilization.

Fig.11 Memory access under different row miss count

Fig.12 Cycles overlapped by bank interleaving technique

For video applications, the memory request is usually to get a determined size of rectangle image from frame memory like those in motion compensation, intra prediction and deblocking filter process. These data are continuous in spatial domain separately and the area we may request between two consecutive blocks has high probability to be overlapped. For instance, when we process motion compensation, the range of data we may need is bounded by block size and search range set during encoding. If the search range is L and the block length is N, a 2L by 2L+N rectangle is overlapped. Data in this region has high probability to be opened in previous block access. Thus we can find a method to avoid the row miss. Fig.13 illustrates an

example to explain this characteristic.

Fig.13 Possible required area between adjacent blocks

According to the characteristics of video data, the translation between physical location in memory and pixel coordinates in spatial domain will affect the probability of row miss. We can analyze the statistics of video sequence to find an optimized translation to minimize the times of row opening and then the controller can schedule memory operations to enhance the efficiency. As a result, the bandwidth utilization of the same SDRAM can be increased.

3.1.4 Estimation of bandwidth requirement

H.264 has high efficiency of video compression comparing with previous video processing standards such as MPEG-2 and MPEG-4. This is contributed to its

advanced features like fractional pixel interpolation, variable block sizes, multiple reference frames, etc. However, these lead to high bandwidth requirement in implementation. Below we briefly discuss the requirement of bandwidth in different processing unit. Assume that the frame width is W, height is H, the frame rate is F and all are in 4:2:0 format.

Reference frame storage

In H.264 decoder, the processed frame must be store in memory for following frame reference. The required bandwidth is

F H

BW_RFS = * *(1_Y +0.25_Cb +0.25_Cr)*

Loop filter

A deblocking filter is adopted in H.264 to improve the subjective quality. For luma, a 16x4 block and a 4x16 block adjacent to current macroblock are referenced while deblocking current macroblock. For chroma, we need to reference an 8x4 block and a 4x8 block. As a result, the bandwidth we required is

F H

BW_LP =( /16)*( /16)*((16*4*2)_Y +(8*4*2)_Cb +(8*4*2)_Cr)* Fig.14. illustrate the reference blocks for loop filter.

Fig.14 Reference blocks for deblocking fliter

Motion compensation

H.264 supports variable block size motion compensation to enhance coding efficiency as shown in Fig.15.

Fig.15 Block types in H.264

Beside, sub-pixel interpolation is applied to further improve the performance but extra bandwidth is required to meet the real-time decoding.

Table 2 lists the maximum reference block size for different block type.

Considering the worst case, all blocks are broken into smallest 4x4 size with maximum reference blocks and all frames are p-frame except first frame. The requirement of bandwidth is

F H

BW_MC =( /16)*( /16)*((9*9*16)_Y +(3*3*16)_Cb +(3*3*16)_Cr)* The effect of first frame is neglected in above formula.

Table 2 Maximum area of reference blocks

luma block max. size of

Summing up all above three cases, we can get rough bandwidth requirement for H.264 decoder. The bandwidth for different frame size is listed in Table 3 with assuming that frame rate is 30 fps in all cases.

Table 3 Rough estimation of required bandwidth in different frame size

format width height BWRFS BWLP BWMC BWall unit

1080 HD 1920 1080 93.31 62.21 384.91 540.40 MBps

With the growth of frame size, the required bandwidth increase rapidly and motion compensation processing unit dominates the demand for memory bandwidth.

However, SDRAM bandwidth can not achieve 100% utilization. It needs to improve the performance with optimized data arrangement and careful operation scheduling.

3.2 optimization of memory access

3.2.1 Intra-request optimization

In this section we discuss the memory access operation within a single request and build a mathematic model to describe this behavior. According to the model, we can find optimized mapping between physical location and the position in spatial

在文檔中用於H.264視訊解碼器之記憶體控制器與熵解碼器之設計 (頁 16-0)