CHAPTER 1 INTRODUCTION
1.4 T HESIS O RGANIZATION
NAL unit NAL unit NAL unit ‧‧‧‧
SPS-RBSP RBSPPPS- Slice Layer-RBSP
Slice
Header Slice Data
Macro-block Residual
Data NAL
Syntax Element
NAL Syntax Element
NAL Syntax Element
NAL Syntax Element
NAL Unit Header
NAL Unit Header
NAL Unit Header
Macro-block
Macro-block
Macro-block Residual
Data
Sub-Macroblock Predition
Macro-block Predition
Fig. 1.7 Hierarchical structure of H.264 video bit-stream
1.4 Thesis Organization
This thesis is organized as follows. At first, the overview of the MPEG-2 and H.264/AVC decoding flow is described in Chapter 2. Chapter 3 gives the system level design consideration and some system-level schemes in this work, like pipeline architecture, decoding ordering, system synchronization, low power mode exploration, and low power design between modules. Then, details of the architecture designs of each functional block are described in Chapter 4. Finally, the implementation details, conclusion and summary are presented in Chapter 5 and Chapter 6, respectively.
Chapter 2
Overview of MPEG-2 and H.264/AVC Decoding Flow
The overviews of MPEG-2 decoding flow and H.264/AVC decoding flow will be given in this chapter. Though there exists some differences between the decoding flow of MPEG-2 and H.264/AVC, similarities like inverse discrete cosine transform, inverse quantization, or motion compensation can still be found. In the system point of view, to make good use of every functional block suits for both systems is an important issue and good innovation of designing a multimode video decoder.
2.1 Overview of MPEG-2 Decoding Flow
The decoding process is strictly defined in the standard. With the exception of the Inverse Discrete Cosine Transform (IDCT) the decoding process is defined such that all decoders shall produce numerically identical results. As Fig. 1.2 shows, the decoding process mainly includes variable length decoding, inverse scanning process, inverse quantization, inverse DCT, and motion compensation.
2.1.1 Variable length decoding
The DC coefficients are separated from other coefficients. For DC coefficients, a predictor is used for the prediction of the DC coefficients. The predictor shall be reset to a certain value at the start of a slice, a non-intra macroblock is decoded, or a macroblock is skipped. The differential value (dc_dct_differential) is coded in the bit-stream. Thus the
decoder can calculate the DC coefficients (QFS[0]) by
Where dc_dct_pred are the values of the 3 predictors, Y(cc=0), Cb(cc=1), and Cr(cc=2). The dct_diff is the transformed value from dc_dct_differential.
For other coefficients, by table lookup of two VLC tables the values of “run” and
“level” can be decoded. Then the coefficients in a macroblock can be recovered by run-length decoding process as the follows.
}
2.1.2 Inverse scanning process
After total 64 coefficients are decoded by the Huffman run-length VLC decoder described above, the inverse scanning process inverse scanned these coefficients to a single 8x8 block. 2 scan patterns determined by parameter “alternate_scan” can be used. Fig. 2.1 shows these 2 scan patterns.
0 1 5 6 14 15 27 28
2 4 7 13 16 26 29 42
3 8 12 17 25 30 41 43
9 11 18 24 31 40 44 53
10 19 23 32 39 45 52 54
20 22 33 38 46 51 55 60
21 34 37 47 50 56 59 61
35 36 48 49 57 58 62 63
0 4 6 20 22 36 38 52
1 5 7 21 23 37 39 53
2 8 19 24 34 40 50 54
3 9 18 25 35 41 51 55
10 17 26 30 42 46 56 60
11 16 27 31 43 47 57 61
12 15 28 32 44 48 58 62
13 14 29 33 45 49 59 63
(a) (b)
Fig. 2.1 Inverse scan pattern (a)alternate_scan=0 (b)alternate_scan=1
2.1.3 Inverse quantization
As Fig. 2.2 shows, the inverse quantization process can be divided into 3 parts, the arithmetic, saturation, and mismatch control parts. In the arithmetic part, DC coefficient is separated from all the other coefficients. The parameter “intra_dc_precision” indicates the multiplication factor for DC coefficients, ranging from 1 to 8. For other coefficients, a weighting matrices W[w][v][u] and the Quant_scale_code determines the multiplication factor. W[w][v][u] can be either encoder-defined values or default values. The Quant_scale_code can be got by table lookup with the help of parameter
“quantiser_scale_code” and “q_scale_type” in the bit-stream.
In the saturation part, the scaled coefficients F’’[v][u] are saturated to F’[v][u] which lie in the range of [-2048:+2048]. In mismatch control, a correction is made to just one coefficient, F[7][7], adding or subtracting by one.
Inverse
Fig. 2.2 Inverse quantization process
In summary the inverse quantization process is any process numerically equivalent to
}
}
2.1.4 Inverse Discrete-Cosine-Transform (IDCT)
The formula of Inverse Discrete-Cosine-Transform is as follows:
∑∑
− and n,m,k,l=0,…,N-1These transformed values shall be saturated to [-256:+255].
2.1.5 Motion compensation
As Fig. 2.3 shows, the motion compensation process includes many parts. From the bit-stream, parameters like “f_code”, “motion_code”, and “motion_residual” can be extracted. With these parameters from bit-stream, a vector decoding module with the vector predictors (PMV[r][s][t]) decodes the motion vector vector’[r][s][t] by the following process.
By scaling with a certain scaling factors, vector[r][s][t] are then sending to the address generator of frame buffers. The pixel values read from the frame buffers are then feed through a half-pel prediction filter. At the final stage, the decoded pixels is calculated by adding the combined predictions p[y][x] with the residual values f[y][x]. Saturation process is needed to clamp the result.
Prediction Field/Frame
Selection
Frame Buffer Addressing
Frame Buffers
Additional Dual-Prime Arithmetic
Scaling for Color Components
Half-pel Prediction
Filtering
Combine Predictions Vector
Decoding
+
SaturationFrom Bit-stream
Decoded Pixels f[y][x]
p[y][x]
Half-Pel Info.
vector[r][s][t]
vector'[r][s][t]
Vector
Predictors d[y][x]
Fig. 2.3 Motion compensation process
2.2 Overview of H.264/AVC Decoding Flow
The H.264/AVC decoding flow is strictly specified in the standard such that all decoders shall produce numerically identical results. As Fig. 1.6 shows, the decoding process contains entropy decoding, reordering, inverse quantization, inverse integer discrete cosine transform, intra prediction, motion compensation, and loopfilter. In this thesis we consider the decoding process of baseline profile.
2.2.1 Entropy decoding
The H.264 bit-stream mainly contains 2 types of contents, the parameters and the coefficients. The parameters are mainly coded by Exp-Golomb code, which is a Universal Variable Length Code (UVLC). And the coefficients of residuals are coded by Context-based Adaptive Variable Length Code (CAVLC).
The entropy decoding process for Exp-Golomb code is as follows leadingZeroBits=-1;
for(b=0;!b;leadingZeroBits++) b=read_bits(1);
code value=2leadingZeroBits-1+read_bits(leadingZeroBits);
The entropy decoding process for CAVLC is much more complicated than the decoding process of Exp-Golomb code. Many different tables are used for decoding the parameters like “TrailingOnes”, “TotalCoeff”, “level_prefix”, “total_zeros”, and
“run_before”. For some coefficients like “TrailingOnes” and “TotalCoeff”, more than one table are used for decoding. And the so-called “Context-based Adaptive” VLC is because that the CAVLC decoder has to choose the correct table to decode a certain parameter according to the number of coefficients in neighboring block (left and upper 4x4 blocks).
By decoding the intermediate parameters like “trailing_ones_sign_flag”, “level_prefix”,
“level_suffix”, “total_zeros”, and “run_before”, the run and level of this run-level code can be calculated by the procedure defined in the standard. Then a run-level code decoder is used to recover the 16 coefficients in that 4x4 block.
2.2.2 Inverse scanning process
Input to this functional block is a list of 16 coefficients decoded by CAVLC. These 16 coefficients are then inverse scanned with a Zig-Zag scan pattern to form a 4x4 block. Fig.
2.4 shows the Zig-Zag scanning pattern.
0 1 5 6
2 4 7 12
3 8 11 13
9 10 14 15
Fig. 2.4 Zig-Zag scan
2.2.3 Inverse quantization & inverse Hadamard transform
In the inverse quantization process, the operations on DC values are separated from other coefficients. Because 4x4 block is the basic unit in H.264 systems, there are total 16 luma DC coefficients in a macroblock. These 16 DC coefficients in a macroblock are first transformed through an inverse Hadamard transform matrix as the follows
⎥⎥
The result of inverse Hadamard transformation is then scaled by the following formula with the given QPY.
For the Cb and Cr in a macroblock, the 4 DC coefficients are first transformed through a 2x2 inverse transform matrix as the follows
⎥⎦
After inverse transform, scaling is performed as follows
1
For coefficients other than DC, the scaling function is )
2.2.4 Inverse Integer Discrete Cosine Transform
The Inverse Discrete Cosine Transform in H.264 system is much more simplified than the traditional Inverse Discrete Cosine Transform. The transform coefficients of 2-D IDCT in H.264 system are all simplified to integers. The transform matrix is as the follows.
⎥⎥
2.2.5 Intra prediction
The intra prediction process is a new prediction process that MPEG-2 system lacks.
There are 2 classes of intra prediction modes, the Intra_4x4 prediction mode and Intra_16x16 prediction mode.
There are total 9 sub-modes in Intra_4x4 prediction mode. As Fig. 2.5 shows, these 9 modes are vertical, horizontal, DC, diagonal down-left, diagonal down-right, vertical-right, horizontal-down, vertical-left, and horizontal-up, respectively. In DC modes, the intra prediction process is to calculate the mean value of neighboring pixel values. Except for DC mode, all the others are directional modes. For directional modes, the intra prediction process for the prediction values can all be written as the following formula
4
Where P0, P1, P2, and P3 are all neighboring pixel values
The P0, P1, P2, and P3 are different neighboring pixel values according to the type of mode and the position in the 4x4 block. For example, in mode 3 (diagonal down-left), the upper-left corner is predicted by ((A+2B+C)+2)/4, which is equivalent to ((A+B+B+C)+2)/4; and for upper-right corner of mode 5 (vertical-right), the prediction values is calculated by ((2C+2D)+2)/4, which is equivalent to ((C+C+D+D)+2)/4.
Note that the intra prediction process and the residual adding process are processes that must be perform iteratively. That is, for a given 4x4 block, the neighboring pixel values (upper and left) for intra prediction must be residual values added.
A B C D E F G H
0 (vertical) 1 (horizontal) 2 (DC)
3 (diagonal down-left) 4 (diagonal down-right) 5 (vertical-right)
6 (horizontal-down) 7 (vertical-left) 8 (horizontal-up)
Fig. 2.5 Intra_4x4 prediction modes
In the intra_16x16 prediction mode class, there are total 4 modes – vertical, horizontal, DC, and plane modes respectively. The vertical mode and horizontal modes are easiest ones;
the prediction is down by copying upper or left pixel values directly. In DC mode the mean value of all the upper and left neighboring pixel values has to be calculated and the result is
assigned to all the pixels in this macroblock. The plane prediction mode is the most complex one. The formula for luma samples is given as the follows
)
vertical horizontal DC plane
Fig. 2.6 Intra_16x16 prediction modes
For luma samples in a macroblock, both intra_4x4 prediction modes and intra_16x16 prediction modes are valid. But for chroma samples, only the 4 modes in intra_16x16 prediction class are valid and are a little different in parameters from formula for luma samples.
2.2.6 Motion compensation
In motion compensation process, each macroblock can be split into 4 types of partitions, 16x16, 8x16, 16x8, and 8x8. If the macroblock is split into 8x8 partitions, each
8x8 partition (Sub-Macroblock) can be further split into 4 types of partitions, 8x8, 4x8, 8x4, and 4x4. This hierarchical macroblock partition gives flexibilities on motion compensation process.
0 0 1
0 1
0 1
3 2
0 0 1
0 1
0 1
3 2
16x16 8x16 16x8 8x8
8x8 4x8 8x4 4x4
16
16
8 8
(a) Macroblock partitions
(b) Sub-Macroblock partitions
Fig. 2.7 Macroblock and Sub-Macroblock partitions
The precision of motion vectors is up to 1/4. Fig. 2.8 shows an example of motion vector equals to (+1.50, -0.75).
Fig. 2.8 Up to 1/4 motion vector resolution ( mv=(+1.50, -0.75) )
The motion compensation process requires interpolation process for inter-pixel values.
As Fig. 2.9 shows, for interpolating pixels with the precision of motion vector up to 1/2, a 6-tap interpolator is used for the interpolation. For example, pixel “b” is calculated by
J)/32)
For interpolating pixels with the precision of motion vector up to 1/4, a 2-tap interpolator is used for the interpolation. For example, pixel “n” is calculated by
f)/2)
Fig. 2.9 Interpolation for pixel values
The motion vector MV is calculated by adding the MVD (motion vector difference) with the MVP (motion vector prediction). The MVD is decoded from the bit-stream. MVP is calculated from the motion vectors of neighboring blocks.
2.2.7 De-blocking filter
Same as MPEG-2, H.264/AVC system is block-based video coding system. Though we can perform discrete cosine transform to take advantage of the spatial correlation property and exploit motion compensated prediction to improve the compression ratio on the block-based systems, the disadvantage of the block-based system lies on the discontinuity on each block boundaries which is also known as blocking effects because of the quantization loss that annoying the continuity on block boundaries. Moreover, the blocking-effect propagated from frame to frame due to the motion compensation. Thus, a de-blocking filter is demanded and is included in the H.264 standard as an in-loop filter.
As Fig. 2.10 shows, the edge filtering order defined in the standard is a, b, c, d, e, f, g, then h. For a given 4x4 blocks, as long as the filter ordering to this 4x4 block is left, right, upper, and down, is standard compliant.
a b c d
e f g h
Fig. 2.10 Edge filtering order in a macroblock
The filtering process to a certain boundary is through an interpolator. Each filtering operation can at most changes 3 pixel values either in both sides of the boundary. The choice of filtering outcome depends on the boundary strength and on the gradient of image samples across the boundary. The boundary strength bS is in the range of 0 to 4, from no filtering to strongest filtering according to the quantiser, coding modes of neighboring blocks and the gradient of image samples.
q0 q1 q2 q3
p0 p1 p2
p3 q0
q1 q2 q3 p0 p1 p2 p3
Fig. 2.11 Adjacent pixels to horizontal and vertical boundaries
Chapter 3
System Design Of MPEG-2 and H.264/AVC Decoder
In this chapter, we show some design techniques like pipeline scheme, synchronization problem and solution, decoding ordering, and power saving techniques from the system point of view.
3.1 MPEG-2 and H.264/AVC Combined System Decoding Flow
Fig. 3.1 shows our MPEG-2/H.264 combined decoder diagram. Input to this decoder is the video bit-stream and a video type signals. This video type signal acknowledges the decoder the type of the video bit-stream is feeding.
For H.264 video bit-stream, an H.264 syntax parser is firstly syntax analyzed the bit-stream, stored the system parameter into system-wide shared registers, send the bit-stream to the following residual path (CAVLC, 4x4-scaling, 4x4 IDCT) or prediction path (H.264 Intra predictor, H.264 Motion Compensator), summed them together with the help of synchronizer, a loopfilter process the summed pixel value and then output it both to frame buffer or to the display.
For MPEG-2 video bit-stream, same as the H.264 decoding flow, first, a MPEG-2 syntax parser analyzed the bit-stream, stored the system parameter into system-wide shared registers, send the bit-stream to the following MPEG-2 VLC decoder, 8x8 inverse quantizer, 8x8 IDCT, MPEG-2 Motion Compensator, and an optional MPEG-2 post filter is at the end
of the decoding flow.
For the hardware sharing issues, we share the registers in syntax parsers, design a CAVLC/VLC combined decoder for entropy decoding, a H.264/MPEG-2 combined motion compensator, a synchronizer for both system, content memory and frame buffer for both systems, and the de-blocking filter for both system which functions as an in-loop filter for H.264 system and a post-processing filter for MPEG-2 system.
H264 Decoder
Video
Bit-stream CAVLC(H.264) /
VLC (MPEG2) Combined Decoder
H.264 Intra predictor
Off chip Frame Buffer
Residual
LoopFilter / MPEG2 Post Filter Single port (Nx2)x32
Fig. 3.1 MPEG-2/H.264 Combined Decoder Diagram
3.2 Hybrid 4x4-Block Level Pipeline with Instantaneous Switching Scheme for H.264/AVC Decoder
3.2.1 Hybrid 4x4-Block Level Pipeline Architecture
The 4x4 block is the smallest group of pixels that the H.264/AVC standard adopts. We can see from the standard that a 4x4 Inverse-Discrete-Cosine-Transform (IDCT), a 4x4-block based inverse scanning process, and a 4x4 inverse quantization matrix for
rescaling, are required in decoding H.264/AVC video sequence. Moreover, the smallest intra prediction unit is 4x4 sized block, and so do the motion compensation process. Thus in our H.264/AVC decoder design, compared with conventional macroblock-level pipelining architecture [1] [6], our 4x4-block level pipelining architecture are more suitable for the 4x4-block based H.264/AVC system.
Compared with macroblock-level (16x16) and block-level (8x8) pipeline parallelism, a trade-off exists between processing cycles and buffer cost. For the processing cycles issue, refers to Fig.3.2, we can see that the 4x4-sub-block-level pipeline parallelism requires more additional processing cycle than cycles needed of macroblock-level pipeline parallelism.
Although this penalty has to be paid by the 4x4-sub-block-level pipeline parallelism, the cost saved of the buffer storage required is worthy.
)
Macroblock i-1 Macroblock i
Stage 1
Fig.3.2 Additional processing cycles required for 4x4-sub-block-level pipeline parallelism
Compared with macroblock-level (16x16) and block-level (8x8) pipeline parallelism, because the processing unit of data in each stage is quite smaller (4x4) in 4x4-sub-block-level pipeline parallelism that the only 4x4-sub-block-level sized buffer storage is enough. We can see from Table 3.1, three different parallelisms show the trade-off between buffer cost and processing cycles. For 4x4-sub-block-level pipeline parallelism,
although 1.26 times processing cycles required compared with macroblock-level pipeline parallelism, 15/16 buffer storage can be saved.
Table3.1: Trade-off between processing cycles and buffer cost Parallelism Unit of Data Buffer Cost Processing Cycles
Macroblock-Level 16x16 X16 M cycles/MB
Block-Level 8x8 X4 1.19*M cycles/MB
Sub-Block-Level 4x4 X1 1.26*M cycles/MB
Moreover, besides the saving in storage cost, the large amount of power induced by these buffers which are active all the time could be greatly reduced as well. As Table 3.3 shows, the 4x4-sub-block-level sized storage buffers in CAVLC & IDCT consume 1.453mW and 0.864mW under clock frequency 100MHz, which contribute 2.86% of total power (81.072mW) when summed together. But if the macroblock-level sized buffers are used instead, the power of these storage buffers would be 23.251mW and 13.824mW, which is 15 times greater than the case of 4x4-sub-block-level pipeline parallelism.
Table 3.3 Power dissipated by buffers between pipeline stages
Storage buffer in CAVLC Storage buffer in IDCT Parallelism
Num. of regs Power Num. of regs Power Macroblock-Level 16x16x8 (bits) 23.251 mW 16x16x18 (bits) 13.824 mW
Block-Level 8x8x8 (bits) 5.813 mW 8x8x18 (bits) 3.456 mW Sub-Block-Level 4x4x8 (bits) 1.453 mW 4x4x18 (bits) 0.864 mW
Although we saves the cost of storage buffer and the associated power reduction by adopting 4x4-sub-block-level pipeline parallelism, this 4x4-sub-block-level pipeline parallelism can’t be applied on some other modules which also exist in the decoding flow like motion compensator and loopfilter because of their macroblock-level-characteristic.
Motion compensator must supports inter prediction process for several block sizes, from 4x4, 4x8, 8x4, 8x8, 16x8, 8x16, to 16x16. It is hard to divide the inter prediction process for block sized modes other than 4x4-block-sized mode into several 4x4-sub-block-sized inter prediction processes. So we choose to maintain traditional macroblock-level pipeline parallelism on motion compensation stage.
For in-loop filtering operation, i.e. loopfilter, it is also hard to be divided into several identical 4x4-sub-block filtering process because the neighboring 4x4-sub-blocks it has to fetch is irregular according to inverse scanning sequence. In contrary, the filtering process is almost identical in macroblock level. Thus we also choose macroblock-level pipeline parallelism for loopfilter.
In our overall pipeline design, we combine the 4x4-sub-block-level pipeline parallelism with macroblock-level pipeline parallelism to a hybrid pipeline scheme that suits best for each module. The pipeline parallelism applied for decoding modules is summarized in Table 3.4.
Table 3.4 Summary of pipeline parallelism applied
Module Pipeline parallelism
Intra predictor 4x4-sub-block-level
CAVLC 4x4-sub-block-level De-quantizer 4x4-sub-block-level
IDCT 4x4-sub-block-level Motion compensator Macroblock-level
Loopfilter Macroblock-level
3.2.2 Instantaneous Switching Scheme
We also applied an instantaneous switching scheme in our 4x4-sub-block-level pipeline design, that is, we switch our pipeline stage as soon as possible. As long as all pipelined modules complete their work, we switch the pipeline into next stage instantaneously. Because of this instantaneous switching scheme we applied, any pipelined module with especially long processing cycles would be the bottleneck of the whole decoding system. The pipeline stage must be switched only if all the pipelined modules complete their work. So all the other pipelined modules must be idle and wait for the pipelined module with especially long processing cycles if exists, bubbles induced in this kind of situation would be a lot that degrades overall system throughput much. Thus, we try to balance the cycle count required for each modules, so that the idle time of these pipelined modules like CAVLC, De-quantization, IDCT, and etc could be minimized that this instantaneous switching scheme can be a great help of maximizing our system throughput.
We also applied an instantaneous switching scheme in our 4x4-sub-block-level pipeline design, that is, we switch our pipeline stage as soon as possible. As long as all pipelined modules complete their work, we switch the pipeline into next stage instantaneously. Because of this instantaneous switching scheme we applied, any pipelined module with especially long processing cycles would be the bottleneck of the whole decoding system. The pipeline stage must be switched only if all the pipelined modules complete their work. So all the other pipelined modules must be idle and wait for the pipelined module with especially long processing cycles if exists, bubbles induced in this kind of situation would be a lot that degrades overall system throughput much. Thus, we try to balance the cycle count required for each modules, so that the idle time of these pipelined modules like CAVLC, De-quantization, IDCT, and etc could be minimized that this instantaneous switching scheme can be a great help of maximizing our system throughput.