De-blocking filter - O VERVIEW OF H.264/AVC D ECODING F LOW

CHAPTER 2 OVERVIEW OF MPEG-2 AND H.264/AVC DECODING FLOW

2.2 O VERVIEW OF H.264/AVC D ECODING F LOW

2.2.7 De-blocking filter

Same as MPEG-2, H.264/AVC system is block-based video coding system. Though we can perform discrete cosine transform to take advantage of the spatial correlation property and exploit motion compensated prediction to improve the compression ratio on the block-based systems, the disadvantage of the block-based system lies on the discontinuity on each block boundaries which is also known as blocking effects because of the quantization loss that annoying the continuity on block boundaries. Moreover, the blocking-effect propagated from frame to frame due to the motion compensation. Thus, a de-blocking filter is demanded and is included in the H.264 standard as an in-loop filter.

As Fig. 2.10 shows, the edge filtering order defined in the standard is a, b, c, d, e, f, g, then h. For a given 4x4 blocks, as long as the filter ordering to this 4x4 block is left, right, upper, and down, is standard compliant.

a b c d

e f g h

Fig. 2.10 Edge filtering order in a macroblock

The filtering process to a certain boundary is through an interpolator. Each filtering operation can at most changes 3 pixel values either in both sides of the boundary. The choice of filtering outcome depends on the boundary strength and on the gradient of image samples across the boundary. The boundary strength bS is in the range of 0 to 4, from no filtering to strongest filtering according to the quantiser, coding modes of neighboring blocks and the gradient of image samples.

q0 q1 q2 q3

p0 p1 p2

p3 q0

q1 q2 q3 p0 p1 p2 p3

Fig. 2.11 Adjacent pixels to horizontal and vertical boundaries

Chapter 3 System Design Of MPEG-2 and H.264/AVC Decoder

In this chapter, we show some design techniques like pipeline scheme, synchronization problem and solution, decoding ordering, and power saving techniques from the system point of view.

3.1 MPEG-2 and H.264/AVC Combined System Decoding Flow

Fig. 3.1 shows our MPEG-2/H.264 combined decoder diagram. Input to this decoder is the video bit-stream and a video type signals. This video type signal acknowledges the decoder the type of the video bit-stream is feeding.

For H.264 video bit-stream, an H.264 syntax parser is firstly syntax analyzed the bit-stream, stored the system parameter into system-wide shared registers, send the bit-stream to the following residual path (CAVLC, 4x4-scaling, 4x4 IDCT) or prediction path (H.264 Intra predictor, H.264 Motion Compensator), summed them together with the help of synchronizer, a loopfilter process the summed pixel value and then output it both to frame buffer or to the display.

For MPEG-2 video bit-stream, same as the H.264 decoding flow, first, a MPEG-2 syntax parser analyzed the bit-stream, stored the system parameter into system-wide shared registers, send the bit-stream to the following MPEG-2 VLC decoder, 8x8 inverse quantizer, 8x8 IDCT, MPEG-2 Motion Compensator, and an optional MPEG-2 post filter is at the end

of the decoding flow.

For the hardware sharing issues, we share the registers in syntax parsers, design a CAVLC/VLC combined decoder for entropy decoding, a H.264/MPEG-2 combined motion compensator, a synchronizer for both system, content memory and frame buffer for both systems, and the de-blocking filter for both system which functions as an in-loop filter for H.264 system and a post-processing filter for MPEG-2 system.

H264 Decoder

Video

Bit-stream CAVLC(H.264) /

VLC (MPEG2) Combined Decoder

H.264 Intra predictor

Off chip Frame Buffer

Residual

LoopFilter / MPEG2 Post Filter Single port (Nx2)x32

Fig. 3.1 MPEG-2/H.264 Combined Decoder Diagram

3.2 Hybrid 4x4-Block Level Pipeline with Instantaneous Switching Scheme for H.264/AVC Decoder

3.2.1 Hybrid 4x4-Block Level Pipeline Architecture

The 4x4 block is the smallest group of pixels that the H.264/AVC standard adopts. We can see from the standard that a 4x4 Inverse-Discrete-Cosine-Transform (IDCT), a 4x4-block based inverse scanning process, and a 4x4 inverse quantization matrix for

rescaling, are required in decoding H.264/AVC video sequence. Moreover, the smallest intra prediction unit is 4x4 sized block, and so do the motion compensation process. Thus in our H.264/AVC decoder design, compared with conventional macroblock-level pipelining architecture [1] [6], our 4x4-block level pipelining architecture are more suitable for the 4x4-block based H.264/AVC system.

Compared with macroblock-level (16x16) and block-level (8x8) pipeline parallelism, a trade-off exists between processing cycles and buffer cost. For the processing cycles issue, refers to Fig.3.2, we can see that the 4x4-sub-block-level pipeline parallelism requires more additional processing cycle than cycles needed of macroblock-level pipeline parallelism.

Although this penalty has to be paid by the 4x4-sub-block-level pipeline parallelism, the cost saved of the buffer storage required is worthy.

)

Macroblock i-1 Macroblock i

Stage 1

Fig.3.2 Additional processing cycles required for 4x4-sub-block-level pipeline parallelism

Compared with macroblock-level (16x16) and block-level (8x8) pipeline parallelism, because the processing unit of data in each stage is quite smaller (4x4) in 4x4-sub-block-level pipeline parallelism that the only 4x4-sub-block-level sized buffer storage is enough. We can see from Table 3.1, three different parallelisms show the trade-off between buffer cost and processing cycles. For 4x4-sub-block-level pipeline parallelism,

although 1.26 times processing cycles required compared with macroblock-level pipeline parallelism, 15/16 buffer storage can be saved.

Table3.1: Trade-off between processing cycles and buffer cost Parallelism Unit of Data Buffer Cost Processing Cycles

Macroblock-Level 16x16 X16 M cycles/MB

Block-Level 8x8 X4 1.19*M cycles/MB

Sub-Block-Level 4x4 X1 1.26*M cycles/MB

Moreover, besides the saving in storage cost, the large amount of power induced by these buffers which are active all the time could be greatly reduced as well. As Table 3.3 shows, the 4x4-sub-block-level sized storage buffers in CAVLC & IDCT consume 1.453mW and 0.864mW under clock frequency 100MHz, which contribute 2.86% of total power (81.072mW) when summed together. But if the macroblock-level sized buffers are used instead, the power of these storage buffers would be 23.251mW and 13.824mW, which is 15 times greater than the case of 4x4-sub-block-level pipeline parallelism.

Table 3.3 Power dissipated by buffers between pipeline stages

Storage buffer in CAVLC Storage buffer in IDCT Parallelism

Num. of regs Power Num. of regs Power Macroblock-Level 16x16x8 (bits) 23.251 mW 16x16x18 (bits) 13.824 mW

Block-Level 8x8x8 (bits) 5.813 mW 8x8x18 (bits) 3.456 mW Sub-Block-Level 4x4x8 (bits) 1.453 mW 4x4x18 (bits) 0.864 mW

Although we saves the cost of storage buffer and the associated power reduction by adopting 4x4-sub-block-level pipeline parallelism, this 4x4-sub-block-level pipeline parallelism can’t be applied on some other modules which also exist in the decoding flow like motion compensator and loopfilter because of their macroblock-level-characteristic.

Motion compensator must supports inter prediction process for several block sizes, from 4x4, 4x8, 8x4, 8x8, 16x8, 8x16, to 16x16. It is hard to divide the inter prediction process for block sized modes other than 4x4-block-sized mode into several 4x4-sub-block-sized inter prediction processes. So we choose to maintain traditional macroblock-level pipeline parallelism on motion compensation stage.

For in-loop filtering operation, i.e. loopfilter, it is also hard to be divided into several identical 4x4-sub-block filtering process because the neighboring 4x4-sub-blocks it has to fetch is irregular according to inverse scanning sequence. In contrary, the filtering process is almost identical in macroblock level. Thus we also choose macroblock-level pipeline parallelism for loopfilter.

In our overall pipeline design, we combine the 4x4-sub-block-level pipeline parallelism with macroblock-level pipeline parallelism to a hybrid pipeline scheme that suits best for each module. The pipeline parallelism applied for decoding modules is summarized in Table 3.4.

Table 3.4 Summary of pipeline parallelism applied

Module Pipeline parallelism

Intra predictor 4x4-sub-block-level

CAVLC 4x4-sub-block-level De-quantizer 4x4-sub-block-level

IDCT 4x4-sub-block-level Motion compensator Macroblock-level

Loopfilter Macroblock-level

3.2.2 Instantaneous Switching Scheme

We also applied an instantaneous switching scheme in our 4x4-sub-block-level pipeline design, that is, we switch our pipeline stage as soon as possible. As long as all pipelined modules complete their work, we switch the pipeline into next stage instantaneously. Because of this instantaneous switching scheme we applied, any pipelined module with especially long processing cycles would be the bottleneck of the whole decoding system. The pipeline stage must be switched only if all the pipelined modules complete their work. So all the other pipelined modules must be idle and wait for the pipelined module with especially long processing cycles if exists, bubbles induced in this kind of situation would be a lot that degrades overall system throughput much. Thus, we try to balance the cycle count required for each modules, so that the idle time of these pipelined modules like CAVLC, De-quantization, IDCT, and etc could be minimized that this instantaneous switching scheme can be a great help of maximizing our system throughput.

Fig. 3.3 shows an example pipelining schedule of hybrid 4x4-sub-block-level pipeline parallelism with instantaneous switching scheme.

Intra M ode / MVD

Macroblock Index 1

‧‧‧

Macroblock Index 0 ^‧‧‧

‧‧‧

Macroblock Level Pipelining

0 1 14

2 Intra predicted Macroblock Inter predicted Macroblock Inter predicted Macroblock

Bubble Group of pixels in a 4x4 block Pipeline stage switch after the

operation of CAVLC completes instantaneously

Fig. 3.3 An example of the pipelining schedule

3.3 Efficient 1x4 Column-By-Column Decoding Ordering

Based on our proposed 4x4-sub-block-level pipeline parallelism, we choose 4 pixels per cycle as our overall system throughput. The throughput of 4 pixels per cycle is also very suitable for the efficient IDCT design, inverse quantizer design, and inter/intra predictor design. Limited by the 4x4-sub-block inverse scanning sequence (also the decoding sequence) defined by H.264/AVC standard, we have two choices on the decoding ordering that are both standard compliant, the 4x1 row-by-row decoding ordering and the 1x4 column-by-column decoding ordering, as Fig. 3.4 and Fig. 3.5 shows respectively. After the analysis for inter and intra predictor on these 2 types of decoding order given in the following, we will see that the 1x4 column-by-column decoding ordering is better than 4x1 row-by-row decoding ordering both in fewer memory access times and fewer decoding cycles.

Fig. 3.4 4x1 row-by-row decoding ordering

0 1 2 3 4 5 6 7

Fig. 3.5 1x4 column-by-column decoding ordering

Now we give an analysis for both inter and intra prediction units on these 2 decoding ordering.

3.3.1 Analysis on inter prediction unit

In our inter predictor design also known as motion compensator, an initialization stage is required before any contiguous output of motion compensated pixel values. The

initialization period requires 18 memory access times for loading related neighboring 9*6=54 pixel values from the reference frame for the 2-D interpolation (6-tap interpolation then 2-tap interpolation) of the target pixel values. 18 cycles (9 pixels per 3 cycles) also required for this operation in the initialization stage. After the initialization stage is finished for the 1^st group of 4 motion compensated pixel values, the loaded pixel values in the initialization stage can be reused and only 9 new pixels are needed to be loaded for computing the following contiguous output. This computing process requires only 3 memory access times and 3 cycles for the following contiguous outputs of a group of 4 motion compensated pixel values.

For decoding an inter predicted macroblock under 4x1 row-by-row decoding ordering and 1x4 column-by-column decoding ordering, we can found that as Fig. 3.6 shows, for the 4x1 row-by-row decoding ordering, there exists 16 discontinuities (3^rd, 7^th, 11^th, 15^th, 19^th, 23^rd, 27^th, 31^st, 35^th, 39^th, 43^rd, 47^th, 51^st, 55^th, 59^th and 63^rd outputs) in decoding ordering.

Each discontinuity output of a group of 4 pixel values requires an initialization process. 3 contiguous outputs are then followed by each discontinuous output. Thus for 4x1 row-by-row decoding ordering, total memory access times and total decoding cycles are

16x18 (discontinuous output) + 16x3x3 (contiguous output) = 432 memory access (3.1) 16x18 (discontinuous output) + 16x3x3 (contiguous output) = 432 cycles (3.2) As Fig. 3.7 shows, for 1x4-column-by-column decoding ordering, only 8 discontinuities (7^th, 15^th, 23^rd, 31^st, 39^th, 47^th, 55^th and 63^rd outputs) exist in decoding ordering, which leads to 8 initialization process for these 8 outputs. 7 contiguous outputs are then followed by each discontinuous output. Thus for 1x4-column-by-column decoding ordering, total memory access times and total decoding cycles are

8x18 (discontinuous output) + 8x7x3 (contiguous output) = 312 memory access (3.

3) 8x18 (discontinuous output) + 8x7x3 (contiguous output) = 312 cycles (3.4)

In summary, for an inter predicted macroblock, the memory access times and decoding cycles saved by adopting 1x4-column-by-column decoding ordering instead of 4x1-row-by-row decoding ordering are both 28%.

Fig. 3.6 16 initialization processes in inter predicted macroblock under 4x1 row-by-row decoding ordering

0 1 2 3 4 5 6 7

Fig. 3.7 8 content switches in inter predicted macroblock under 1x4 column-by-column decoding ordering

3.3.2 Analysis on intra prediction unit

Based on the H.264/AVC standard, for an Intra4x4 predicted macroblock, the neighboring pixels including upper 8 pixels, left 4 pixels plus a corner pixel (total 13 pixels) must be loaded before the intra prediction process. Because we follow this rule in our intra predictor design, accessing from memory for these 13 pixels are required before each intra4x4 prediction process no matter which prediction mode is for this 4x4-sub-block. We found that if we choose the 1x4 column-by-column decoding ordering as Fig. 3.5 show, a group of 4 pixels of every 4^th output is just the left 4 neighboring pixels that originally required to be fetched from neighbor for intra prediction on next 4x4-block. For example, as Fig. 3.8 shows, the group of 4 pixels in the 3^rd output is just the left 4-neighboring pixels to be fetched for the following 4x4-sub-block. In this way, this group of 4 pixels can be forwarded directly from previous output instead of fetching from memory that reduces the memory access times. Same situation also occurs at the 11th, 19th, 27th, 35th, 43rd, 51st, 59th outputs too. However, for 4x1-row-by-row decoding ordering, this property can not be found to reduce the memory access times. In summary, the memory access times can be reduced from 3x16=48 times to 3x8+2x8=40 times (17% saved) by adopting 1x4-column-by-column decoding ordering instead of 4x1-row-by-row decoding ordering.

0 1 2 3 4 5 6 7

Fig. 3.8 Reduction on memory access of the intra predicted macroblock

3.4 Prediction/Residual Synchronization Scheme

In both H.264/AVC and MPEG2 decoder designs, there exist 2 decoding paths, say, inter/intra prediction path (prediction path) and residual recovery path (residual path). The prediction path predicted the pixel values from the motion vector by motion compensator or by intra prediction mode by intra predictor. The residual path decodes the residual pixel values first by entropy decoding the coded data by CAVLC/CABAC (H.264/AVC) or table based VLC (MPEG2). A de-quantization process is then performed on the decoded value.

Finally, an inverse discrete-cosine-transform (IDCT) transfers the scaled values into residual values and output them at the end of the residual path. The decoder has to add the predicted pixel values from prediction path with the residual pixel values from residual path to reconstruct the original picture before an in-loop filter (H.264/AVC) or a post-filter (MPEG2).

The synchronization problem exists in this adder that adds the pixel values come from 2 different decoding paths. Because the output timing of these 2 paths is different, we can not guarantee the output timing of the pixel values come from 2 paths is simultaneous in a

certain cycle. And we can not expect which output comes earlier. Thus a synchronizer is required. As Fig. 3.9 shows, we developed a Variable-Length FIFO as a synchronizer for the synchronization of prediction path and residual path to solve this problem.

Fig. 3.9 A variable-length FIFO is required for the synchronization between intra/inter predictor and IDCT

The operation of this variable-length FIFO (VL-FIFO) solution is as Fig. 3.10 shows.

The signals “sample_valid” and “IDCT_operation” indicates the output valid timing.

Because these 2 signals are not identical, the VL-FIFO stores the output pixel values from either path as long as it comes alone without the company of the output from another path.

The output of VL-FIFO which had been stored previously waits until the associated values come from another path. In this way, the residual adder that adds the values from prediction/residual path can correctly add them together at a certain cycle with the help of this VL-FIFO as the synchronizer.

Residual Value

+

From IDCT

Predicted Value From Intra/Inter

Variable-Length FIFO

To Filter

VL-FIFO

Fig. 3.10 Operation of variable-length FIFO as a synchronizer

3.5 Power saving by exploiting Coded-Block-Pattern

H.264/AVC and MPEG-2 both support videos in various data rate. High definition with high data rate targets at some high quality video applications like digital home entertainment devices, on the other hand, low definition but low data rate targets at applications like video transmission in hand held devices.

In this chapter, a power saving technique will be introduced for some low data-rate applications, especially for video sequence of high QP (Quantisation Parameter). The main idea of this power saving technique is that we can save power by shutting down the inverse quantization and inverse DCT operation at the blocks with all zero coefficients and passing these 2 modules directly because the output through these 2 modules are all zeros expectedly.

Residual coefficients which quantized to zeros are more as the QP is increased. Thus in high QP video sequence, we can find that there exists many blocks with all-zero-coefficients.

And the parameter “Coded-Block-Pattern” notifies the decoder the incoming all-zero-coefficients blocks in advance. Thus by observing the decoded

“Coded-Block-Pattern”, we can foresee the blocks with all-zero-coefficients, shutting down the inverse quantizer and IDCT, and then passing the results of all zeros directly to the output of the IDCT. Fig. 3.11 shows the block diagram for this power saving technique.

Coded-Block-Pattern CAVLC(H.264)/

VLC(MPEG-2) Decoder Syntax

Parser

Inverse

quantizer IDCT

From Intra/Inter Predictor

To Content Memory Shut-down

signal

Shut-down signal

Fig. 3.11 Power saving by exploiting Coded-Block-Pattern

Fig. 3.12 shows the simulation results of QP versus bitrate (for QCIF foreman sequence) and QP versus the percentage of all zero coefficient blocks. We can see that as the QP increased, the bit rate decreased because of the quantization loss increased, which leads to the increasing of the percentage of all-zero-coefficient blocks. We can see that this power saving technique saves power dissipated from inverse quantization and IDCT from 30% to almost 100% at QP from 20 to 50. This savings in power is huge especially in high QP sequence.

Fig. 3.12 QP versus Bitrate and the percentage of all zero coefficient blocks

3.6 Novel User-Determinable Low Power Mode Exploration

Because H.264 is getting more and more popular in future video applications, and is potential in future hand-held devices like PDA, mobile phone, and etc. Thus, to design a low-power decoder becomes an important issue. To reduce the power consumption, besides the low-power architecture design which will be introduced in the following chapter, a novel user-determinable low power mode is introduced here.

Table 3.5 shows the power profiling of our decoder, reported by PrimePower in decoding H.264 video sequence at 100MHz. From this report, we can see that the loopfilter consumes highest power among other modules both in decoding I-frame and P-frame. The power consumed by loopfilter mainly contributed by 4 single port SRAMs in it. Because the main purpose of this loopfilter is to smooth the decoded picture only, we might be able to shut down the loopfilter in order to save much power as long as the unsmoothed picture is

acceptable.

Imagine that one day you’re watching a TV program through hand-held device on a train. You find that the battery almost ran out. At this time if power-saving solution with the acceptable performance degradation trade-off is provided, it would be a very nice choice to you.

Fortunately, the content memory which serves as to isolate the loopfilter from other decoding modules, is useless and can be shut-down too when we shut down the loopfilter.

在文檔中應用於數位電視之視訊雙標準解碼器設計與實現 (頁 44-0)