• 沒有找到結果。

CHAPTER 1 INTRODUCTION

1.4 T HESIS O RGANIZATION

NAL unit NAL unit NAL unit ‧‧‧‧

SPS-RBSP RBSPPPS- Slice Layer-RBSP

Slice

Header Slice Data

Macro-block Residual

Data NAL

Syntax Element

NAL Syntax Element

NAL Syntax Element

NAL Syntax Element

NAL Unit Header

NAL Unit Header

NAL Unit Header

Macro-block

Macro-block

Macro-block Residual

Data

Sub-Macroblock Predition

Macro-block Predition

Fig. 1.7 Hierarchical structure of H.264 video bit-stream

1.4 Thesis Organization

This thesis is organized as follows. At first, the overview of the MPEG-2 and H.264/AVC decoding flow is described in Chapter 2. Chapter 3 gives the system level design consideration and some system-level schemes in this work, like pipeline architecture, decoding ordering, system synchronization, low power mode exploration, and low power design between modules. Then, details of the architecture designs of each functional block are described in Chapter 4. Finally, the implementation details, conclusion and summary are presented in Chapter 5 and Chapter 6, respectively.

Chapter 2

Overview of MPEG-2 and H.264/AVC Decoding Flow

The overviews of MPEG-2 decoding flow and H.264/AVC decoding flow will be given in this chapter. Though there exists some differences between the decoding flow of MPEG-2 and H.264/AVC, similarities like inverse discrete cosine transform, inverse quantization, or motion compensation can still be found. In the system point of view, to make good use of every functional block suits for both systems is an important issue and good innovation of designing a multimode video decoder.

2.1 Overview of MPEG-2 Decoding Flow

The decoding process is strictly defined in the standard. With the exception of the Inverse Discrete Cosine Transform (IDCT) the decoding process is defined such that all decoders shall produce numerically identical results. As Fig. 1.2 shows, the decoding process mainly includes variable length decoding, inverse scanning process, inverse quantization, inverse DCT, and motion compensation.

2.1.1 Variable length decoding

The DC coefficients are separated from other coefficients. For DC coefficients, a predictor is used for the prediction of the DC coefficients. The predictor shall be reset to a certain value at the start of a slice, a non-intra macroblock is decoded, or a macroblock is skipped. The differential value (dc_dct_differential) is coded in the bit-stream. Thus the

decoder can calculate the DC coefficients (QFS[0]) by

Where dc_dct_pred are the values of the 3 predictors, Y(cc=0), Cb(cc=1), and Cr(cc=2). The dct_diff is the transformed value from dc_dct_differential.

For other coefficients, by table lookup of two VLC tables the values of “run” and

“level” can be decoded. Then the coefficients in a macroblock can be recovered by run-length decoding process as the follows.

}

2.1.2 Inverse scanning process

After total 64 coefficients are decoded by the Huffman run-length VLC decoder described above, the inverse scanning process inverse scanned these coefficients to a single 8x8 block. 2 scan patterns determined by parameter “alternate_scan” can be used. Fig. 2.1 shows these 2 scan patterns.

0 1 5 6 14 15 27 28

2 4 7 13 16 26 29 42

3 8 12 17 25 30 41 43

9 11 18 24 31 40 44 53

10 19 23 32 39 45 52 54

20 22 33 38 46 51 55 60

21 34 37 47 50 56 59 61

35 36 48 49 57 58 62 63

0 4 6 20 22 36 38 52

1 5 7 21 23 37 39 53

2 8 19 24 34 40 50 54

3 9 18 25 35 41 51 55

10 17 26 30 42 46 56 60

11 16 27 31 43 47 57 61

12 15 28 32 44 48 58 62

13 14 29 33 45 49 59 63

(a) (b)

Fig. 2.1 Inverse scan pattern (a)alternate_scan=0 (b)alternate_scan=1

2.1.3 Inverse quantization

As Fig. 2.2 shows, the inverse quantization process can be divided into 3 parts, the arithmetic, saturation, and mismatch control parts. In the arithmetic part, DC coefficient is separated from all the other coefficients. The parameter “intra_dc_precision” indicates the multiplication factor for DC coefficients, ranging from 1 to 8. For other coefficients, a weighting matrices W[w][v][u] and the Quant_scale_code determines the multiplication factor. W[w][v][u] can be either encoder-defined values or default values. The Quant_scale_code can be got by table lookup with the help of parameter

“quantiser_scale_code” and “q_scale_type” in the bit-stream.

In the saturation part, the scaled coefficients F’’[v][u] are saturated to F’[v][u] which lie in the range of [-2048:+2048]. In mismatch control, a correction is made to just one coefficient, F[7][7], adding or subtracting by one.

Inverse

Fig. 2.2 Inverse quantization process

In summary the inverse quantization process is any process numerically equivalent to

}

}

2.1.4 Inverse Discrete-Cosine-Transform (IDCT)

The formula of Inverse Discrete-Cosine-Transform is as follows:

∑∑

and n,m,k,l=0,…,N-1

These transformed values shall be saturated to [-256:+255].

2.1.5 Motion compensation

As Fig. 2.3 shows, the motion compensation process includes many parts. From the bit-stream, parameters like “f_code”, “motion_code”, and “motion_residual” can be extracted. With these parameters from bit-stream, a vector decoding module with the vector predictors (PMV[r][s][t]) decodes the motion vector vector’[r][s][t] by the following process.

By scaling with a certain scaling factors, vector[r][s][t] are then sending to the address generator of frame buffers. The pixel values read from the frame buffers are then feed through a half-pel prediction filter. At the final stage, the decoded pixels is calculated by adding the combined predictions p[y][x] with the residual values f[y][x]. Saturation process is needed to clamp the result.

Prediction Field/Frame

Selection

Frame Buffer Addressing

Frame Buffers

Additional Dual-Prime Arithmetic

Scaling for Color Components

Half-pel Prediction

Filtering

Combine Predictions Vector

Decoding

+

Saturation

From Bit-stream

Decoded Pixels f[y][x]

p[y][x]

Half-Pel Info.

vector[r][s][t]

vector'[r][s][t]

Vector

Predictors d[y][x]

Fig. 2.3 Motion compensation process

2.2 Overview of H.264/AVC Decoding Flow

The H.264/AVC decoding flow is strictly specified in the standard such that all decoders shall produce numerically identical results. As Fig. 1.6 shows, the decoding process contains entropy decoding, reordering, inverse quantization, inverse integer discrete cosine transform, intra prediction, motion compensation, and loopfilter. In this thesis we consider the decoding process of baseline profile.

2.2.1 Entropy decoding

The H.264 bit-stream mainly contains 2 types of contents, the parameters and the coefficients. The parameters are mainly coded by Exp-Golomb code, which is a Universal Variable Length Code (UVLC). And the coefficients of residuals are coded by Context-based Adaptive Variable Length Code (CAVLC).

The entropy decoding process for Exp-Golomb code is as follows leadingZeroBits=-1;

for(b=0;!b;leadingZeroBits++) b=read_bits(1);

code value=2leadingZeroBits-1+read_bits(leadingZeroBits);

The entropy decoding process for CAVLC is much more complicated than the decoding process of Exp-Golomb code. Many different tables are used for decoding the parameters like “TrailingOnes”, “TotalCoeff”, “level_prefix”, “total_zeros”, and

“run_before”. For some coefficients like “TrailingOnes” and “TotalCoeff”, more than one table are used for decoding. And the so-called “Context-based Adaptive” VLC is because that the CAVLC decoder has to choose the correct table to decode a certain parameter according to the number of coefficients in neighboring block (left and upper 4x4 blocks).

By decoding the intermediate parameters like “trailing_ones_sign_flag”, “level_prefix”,

“level_suffix”, “total_zeros”, and “run_before”, the run and level of this run-level code can be calculated by the procedure defined in the standard. Then a run-level code decoder is used to recover the 16 coefficients in that 4x4 block.

2.2.2 Inverse scanning process

Input to this functional block is a list of 16 coefficients decoded by CAVLC. These 16 coefficients are then inverse scanned with a Zig-Zag scan pattern to form a 4x4 block. Fig.

2.4 shows the Zig-Zag scanning pattern.

0 1 5 6

2 4 7 12

3 8 11 13

9 10 14 15

Fig. 2.4 Zig-Zag scan

2.2.3 Inverse quantization & inverse Hadamard transform

In the inverse quantization process, the operations on DC values are separated from other coefficients. Because 4x4 block is the basic unit in H.264 systems, there are total 16 luma DC coefficients in a macroblock. These 16 DC coefficients in a macroblock are first transformed through an inverse Hadamard transform matrix as the follows

⎥⎥

The result of inverse Hadamard transformation is then scaled by the following formula with the given QPY.

For the Cb and Cr in a macroblock, the 4 DC coefficients are first transformed through a 2x2 inverse transform matrix as the follows

⎥⎦

After inverse transform, scaling is performed as follows

1

For coefficients other than DC, the scaling function is )

2.2.4 Inverse Integer Discrete Cosine Transform

The Inverse Discrete Cosine Transform in H.264 system is much more simplified than the traditional Inverse Discrete Cosine Transform. The transform coefficients of 2-D IDCT in H.264 system are all simplified to integers. The transform matrix is as the follows.

⎥⎥

2.2.5 Intra prediction

The intra prediction process is a new prediction process that MPEG-2 system lacks.

There are 2 classes of intra prediction modes, the Intra_4x4 prediction mode and Intra_16x16 prediction mode.

There are total 9 sub-modes in Intra_4x4 prediction mode. As Fig. 2.5 shows, these 9 modes are vertical, horizontal, DC, diagonal down-left, diagonal down-right, vertical-right, horizontal-down, vertical-left, and horizontal-up, respectively. In DC modes, the intra prediction process is to calculate the mean value of neighboring pixel values. Except for DC mode, all the others are directional modes. For directional modes, the intra prediction process for the prediction values can all be written as the following formula

4

Where P0, P1, P2, and P3 are all neighboring pixel values

The P0, P1, P2, and P3 are different neighboring pixel values according to the type of mode and the position in the 4x4 block. For example, in mode 3 (diagonal down-left), the upper-left corner is predicted by ((A+2B+C)+2)/4, which is equivalent to ((A+B+B+C)+2)/4; and for upper-right corner of mode 5 (vertical-right), the prediction values is calculated by ((2C+2D)+2)/4, which is equivalent to ((C+C+D+D)+2)/4.

Note that the intra prediction process and the residual adding process are processes that must be perform iteratively. That is, for a given 4x4 block, the neighboring pixel values (upper and left) for intra prediction must be residual values added.

A B C D E F G H

0 (vertical) 1 (horizontal) 2 (DC)

3 (diagonal down-left) 4 (diagonal down-right) 5 (vertical-right)

6 (horizontal-down) 7 (vertical-left) 8 (horizontal-up)

Fig. 2.5 Intra_4x4 prediction modes

In the intra_16x16 prediction mode class, there are total 4 modes – vertical, horizontal, DC, and plane modes respectively. The vertical mode and horizontal modes are easiest ones;

the prediction is down by copying upper or left pixel values directly. In DC mode the mean value of all the upper and left neighboring pixel values has to be calculated and the result is

assigned to all the pixels in this macroblock. The plane prediction mode is the most complex one. The formula for luma samples is given as the follows

)

vertical horizontal DC plane

Fig. 2.6 Intra_16x16 prediction modes

For luma samples in a macroblock, both intra_4x4 prediction modes and intra_16x16 prediction modes are valid. But for chroma samples, only the 4 modes in intra_16x16 prediction class are valid and are a little different in parameters from formula for luma samples.

2.2.6 Motion compensation

In motion compensation process, each macroblock can be split into 4 types of partitions, 16x16, 8x16, 16x8, and 8x8. If the macroblock is split into 8x8 partitions, each

8x8 partition (Sub-Macroblock) can be further split into 4 types of partitions, 8x8, 4x8, 8x4, and 4x4. This hierarchical macroblock partition gives flexibilities on motion compensation process.

0 0 1

0 1

0 1

3 2

0 0 1

0 1

0 1

3 2

16x16 8x16 16x8 8x8

8x8 4x8 8x4 4x4

16

16

8 8

(a) Macroblock partitions

(b) Sub-Macroblock partitions

Fig. 2.7 Macroblock and Sub-Macroblock partitions

The precision of motion vectors is up to 1/4. Fig. 2.8 shows an example of motion vector equals to (+1.50, -0.75).

Fig. 2.8 Up to 1/4 motion vector resolution ( mv=(+1.50, -0.75) )

The motion compensation process requires interpolation process for inter-pixel values.

As Fig. 2.9 shows, for interpolating pixels with the precision of motion vector up to 1/2, a 6-tap interpolator is used for the interpolation. For example, pixel “b” is calculated by

J)/32)

For interpolating pixels with the precision of motion vector up to 1/4, a 2-tap interpolator is used for the interpolation. For example, pixel “n” is calculated by

f)/2)

Fig. 2.9 Interpolation for pixel values

The motion vector MV is calculated by adding the MVD (motion vector difference) with the MVP (motion vector prediction). The MVD is decoded from the bit-stream. MVP is calculated from the motion vectors of neighboring blocks.

2.2.7 De-blocking filter

Same as MPEG-2, H.264/AVC system is block-based video coding system. Though we can perform discrete cosine transform to take advantage of the spatial correlation property and exploit motion compensated prediction to improve the compression ratio on the block-based systems, the disadvantage of the block-based system lies on the discontinuity on each block boundaries which is also known as blocking effects because of the quantization loss that annoying the continuity on block boundaries. Moreover, the blocking-effect propagated from frame to frame due to the motion compensation. Thus, a de-blocking filter is demanded and is included in the H.264 standard as an in-loop filter.

As Fig. 2.10 shows, the edge filtering order defined in the standard is a, b, c, d, e, f, g, then h. For a given 4x4 blocks, as long as the filter ordering to this 4x4 block is left, right, upper, and down, is standard compliant.

a b c d

e f g h

Fig. 2.10 Edge filtering order in a macroblock

The filtering process to a certain boundary is through an interpolator. Each filtering operation can at most changes 3 pixel values either in both sides of the boundary. The choice of filtering outcome depends on the boundary strength and on the gradient of image samples across the boundary. The boundary strength bS is in the range of 0 to 4, from no filtering to strongest filtering according to the quantiser, coding modes of neighboring blocks and the gradient of image samples.

q0 q1 q2 q3

p0 p1 p2

p3 q0

q1 q2 q3 p0 p1 p2 p3

Fig. 2.11 Adjacent pixels to horizontal and vertical boundaries

Chapter 3

System Design Of MPEG-2 and H.264/AVC Decoder

In this chapter, we show some design techniques like pipeline scheme, synchronization problem and solution, decoding ordering, and power saving techniques from the system point of view.

3.1 MPEG-2 and H.264/AVC Combined System Decoding Flow

Fig. 3.1 shows our MPEG-2/H.264 combined decoder diagram. Input to this decoder is the video bit-stream and a video type signals. This video type signal acknowledges the decoder the type of the video bit-stream is feeding.

For H.264 video bit-stream, an H.264 syntax parser is firstly syntax analyzed the bit-stream, stored the system parameter into system-wide shared registers, send the bit-stream to the following residual path (CAVLC, 4x4-scaling, 4x4 IDCT) or prediction path (H.264 Intra predictor, H.264 Motion Compensator), summed them together with the help of synchronizer, a loopfilter process the summed pixel value and then output it both to frame buffer or to the display.

For MPEG-2 video bit-stream, same as the H.264 decoding flow, first, a MPEG-2 syntax parser analyzed the bit-stream, stored the system parameter into system-wide shared registers, send the bit-stream to the following MPEG-2 VLC decoder, 8x8 inverse quantizer, 8x8 IDCT, MPEG-2 Motion Compensator, and an optional MPEG-2 post filter is at the end

of the decoding flow.

For the hardware sharing issues, we share the registers in syntax parsers, design a CAVLC/VLC combined decoder for entropy decoding, a H.264/MPEG-2 combined motion compensator, a synchronizer for both system, content memory and frame buffer for both systems, and the de-blocking filter for both system which functions as an in-loop filter for H.264 system and a post-processing filter for MPEG-2 system.

H264 Decoder

Video

Bit-stream CAVLC(H.264) /

VLC (MPEG2) Combined Decoder

H.264 Intra predictor

Off chip Frame Buffer

Residual

LoopFilter / MPEG2 Post Filter Single port (Nx2)x32

Fig. 3.1 MPEG-2/H.264 Combined Decoder Diagram

3.2 Hybrid 4x4-Block Level Pipeline with Instantaneous Switching Scheme for H.264/AVC Decoder

3.2.1 Hybrid 4x4-Block Level Pipeline Architecture

The 4x4 block is the smallest group of pixels that the H.264/AVC standard adopts. We can see from the standard that a 4x4 Inverse-Discrete-Cosine-Transform (IDCT), a 4x4-block based inverse scanning process, and a 4x4 inverse quantization matrix for

rescaling, are required in decoding H.264/AVC video sequence. Moreover, the smallest intra prediction unit is 4x4 sized block, and so do the motion compensation process. Thus in our H.264/AVC decoder design, compared with conventional macroblock-level pipelining architecture [1] [6], our 4x4-block level pipelining architecture are more suitable for the 4x4-block based H.264/AVC system.

Compared with macroblock-level (16x16) and block-level (8x8) pipeline parallelism, a trade-off exists between processing cycles and buffer cost. For the processing cycles issue, refers to Fig.3.2, we can see that the 4x4-sub-block-level pipeline parallelism requires more additional processing cycle than cycles needed of macroblock-level pipeline parallelism.

Although this penalty has to be paid by the 4x4-sub-block-level pipeline parallelism, the cost saved of the buffer storage required is worthy.

)

Macroblock i-1 Macroblock i

Stage 1

Fig.3.2 Additional processing cycles required for 4x4-sub-block-level pipeline parallelism

Compared with macroblock-level (16x16) and block-level (8x8) pipeline parallelism, because the processing unit of data in each stage is quite smaller (4x4) in 4x4-sub-block-level pipeline parallelism that the only 4x4-sub-block-level sized buffer storage is enough. We can see from Table 3.1, three different parallelisms show the trade-off between buffer cost and processing cycles. For 4x4-sub-block-level pipeline parallelism,

although 1.26 times processing cycles required compared with macroblock-level pipeline parallelism, 15/16 buffer storage can be saved.

Table3.1: Trade-off between processing cycles and buffer cost Parallelism Unit of Data Buffer Cost Processing Cycles

Macroblock-Level 16x16 X16 M cycles/MB

Block-Level 8x8 X4 1.19*M cycles/MB

Sub-Block-Level 4x4 X1 1.26*M cycles/MB

Moreover, besides the saving in storage cost, the large amount of power induced by these buffers which are active all the time could be greatly reduced as well. As Table 3.3 shows, the 4x4-sub-block-level sized storage buffers in CAVLC & IDCT consume 1.453mW and 0.864mW under clock frequency 100MHz, which contribute 2.86% of total power (81.072mW) when summed together. But if the macroblock-level sized buffers are used instead, the power of these storage buffers would be 23.251mW and 13.824mW, which is 15 times greater than the case of 4x4-sub-block-level pipeline parallelism.

Table 3.3 Power dissipated by buffers between pipeline stages

Storage buffer in CAVLC Storage buffer in IDCT Parallelism

Num. of regs Power Num. of regs Power Macroblock-Level 16x16x8 (bits) 23.251 mW 16x16x18 (bits) 13.824 mW

Block-Level 8x8x8 (bits) 5.813 mW 8x8x18 (bits) 3.456 mW Sub-Block-Level 4x4x8 (bits) 1.453 mW 4x4x18 (bits) 0.864 mW

Although we saves the cost of storage buffer and the associated power reduction by adopting 4x4-sub-block-level pipeline parallelism, this 4x4-sub-block-level pipeline parallelism can’t be applied on some other modules which also exist in the decoding flow like motion compensator and loopfilter because of their macroblock-level-characteristic.

Motion compensator must supports inter prediction process for several block sizes, from 4x4, 4x8, 8x4, 8x8, 16x8, 8x16, to 16x16. It is hard to divide the inter prediction process for block sized modes other than 4x4-block-sized mode into several 4x4-sub-block-sized inter prediction processes. So we choose to maintain traditional macroblock-level pipeline parallelism on motion compensation stage.

For in-loop filtering operation, i.e. loopfilter, it is also hard to be divided into several identical 4x4-sub-block filtering process because the neighboring 4x4-sub-blocks it has to fetch is irregular according to inverse scanning sequence. In contrary, the filtering process is almost identical in macroblock level. Thus we also choose macroblock-level pipeline parallelism for loopfilter.

In our overall pipeline design, we combine the 4x4-sub-block-level pipeline parallelism with macroblock-level pipeline parallelism to a hybrid pipeline scheme that suits best for each module. The pipeline parallelism applied for decoding modules is summarized in Table 3.4.

Table 3.4 Summary of pipeline parallelism applied

Module Pipeline parallelism

Intra predictor 4x4-sub-block-level

CAVLC 4x4-sub-block-level De-quantizer 4x4-sub-block-level

IDCT 4x4-sub-block-level Motion compensator Macroblock-level

Loopfilter Macroblock-level

3.2.2 Instantaneous Switching Scheme

We also applied an instantaneous switching scheme in our 4x4-sub-block-level pipeline design, that is, we switch our pipeline stage as soon as possible. As long as all pipelined modules complete their work, we switch the pipeline into next stage instantaneously. Because of this instantaneous switching scheme we applied, any pipelined module with especially long processing cycles would be the bottleneck of the whole decoding system. The pipeline stage must be switched only if all the pipelined modules complete their work. So all the other pipelined modules must be idle and wait for the pipelined module with especially long processing cycles if exists, bubbles induced in this kind of situation would be a lot that degrades overall system throughput much. Thus, we try to balance the cycle count required for each modules, so that the idle time of these pipelined modules like CAVLC, De-quantization, IDCT, and etc could be minimized that this instantaneous switching scheme can be a great help of maximizing our system throughput.

We also applied an instantaneous switching scheme in our 4x4-sub-block-level pipeline design, that is, we switch our pipeline stage as soon as possible. As long as all pipelined modules complete their work, we switch the pipeline into next stage instantaneously. Because of this instantaneous switching scheme we applied, any pipelined module with especially long processing cycles would be the bottleneck of the whole decoding system. The pipeline stage must be switched only if all the pipelined modules complete their work. So all the other pipelined modules must be idle and wait for the pipelined module with especially long processing cycles if exists, bubbles induced in this kind of situation would be a lot that degrades overall system throughput much. Thus, we try to balance the cycle count required for each modules, so that the idle time of these pipelined modules like CAVLC, De-quantization, IDCT, and etc could be minimized that this instantaneous switching scheme can be a great help of maximizing our system throughput.

相關文件