Entropy Coding - Components of Baseline Intra Coding

Chapter 2 Overview of H.264/AVC Standard

2.2. Components of Baseline Intra Coding

2.2.5. Entropy Coding

There are two entropy coding methods supported in H.264/AVC baseline profile.

1. Exp-Golomb coding

The standard uses the exponential Golomb coding (Exp-Golomb) to encode the syntax elements of coding information, such as mode and type. Exp-Golomb codes consist of a prefix part and a suffix part with a string of bits as shown in Fig. 8. The signed values of syntax elements are assigned to the unsigned code number for code mapping.

Bitstring Code Num Code Num Syntax Value

1 0 0 0

Bitstring Form Range 0 1 0 1 1 1

1 0 0 1 1 2 2 -1

0 1 x1 1 - 2 0 0 1 0 0 3 3 2

0 0 1 x1 x0 3 - 6 0 0 1 0 1 4 4 -2

0 0 0 1 x2 x1 x0 7 - 14 0 0 1 1 0 5 5 3

0 0 0 0 1 x3 x2 x1 x0 15 - 30 0 0 1 1 1 6 6 -3

0 0 0 0 0 1 x4 x3 x2 x1 x0 31 - 62 0 0 0 1 0 0 0 7 7 4

… … … … … …

(a) (b) (c)

Fig. 8 (a) Prefix and suffix bitstrings, (b) exp-Golomb bitstrings, (c) mapping for signed bitstrings

2. Context adaptive variable length coding (CAVLC)

In baseline profile, the standard uses CAVLC for coding the quantized samples to bitstream. Fig. 9 illustrates an example of CAVLC coding in flow diagram. The following items are coded in a proper order: number of nonzero coefficients, sign marks of trailing ones, levels of remaining nonzero coefficients, number of total zeros, runs of zeros between nonzero coefficients. The coefficients should be scanned in the reversed zigzag order for CAVLC coding, but for decoding, the process is the reverse of the encoding one. If all coefficients within an 8x8-size block are zero, the coding process will skip them and assign a special flag, coded block pattern (CBP), to denote such case.

Fig. 9 Example of CAVLC Coding

Chapter 3 H.264/AVC Intra Frame Codec

H.264/AVC is popularly regarded as the video standard in the next generation to replace the existing MPEG-2 standards. The spatial-domain intra coding is a newly supported technique which is not only suitable for moving video coding but also still image compression. However, the common digital signal processors are hard to afford its high computational complexity and large data throughput.

In this chapter, a parallel H.264/MPEG-4 AVC baseline profile intra frame codec supporting both processes of encoder and decoder is proposed for digital camera and video application. This work is mainly based on the previous architecture [11] and modified to fit the decoding procedure. The proposed chip has ability to support high definition (HD) size 720p (1280x720 4:2:0) at 30 fps real-time video encoding at 117MHz and 1080p (1920x1080 4:2:0) at 30 fps video decoding at 58MHz. When clocked at the height frequency 125MHz, this design can process encoding of 29.62M pixels still image per second or decoding of 135.60M pixels. The research result of this work is also published in [12].

3.1. Hardware Oriented Algorithm

3.1.1. Enhanced SATD Function for Mode Decision

In determining the coding performance of intra-only H.264/AVC, the cost function for mode decision is the most important part. To find a best matched prediction mode is to use RDO. Though RDO can provide the best performance, its complexity hinders its use in the hardware design. Thus, the SATD method is adopted to calculate the costs.

However, how to determine the transform for SATD computation will become the main issue now. The transform choice used in SATD should be computationally simple but also effective to estimate the energy of the signals. In [10], a pure transform of 4x4 DHT is adopted for mode decision, but it is far from the real transform used in the whole encoding process. A better transform choice for SATD shall approximate the effect of transform and quantization used in the H.264/AVC encoding process to estimate the real bitrate. Therefore, previous works [13][14] use the 4x4 integer transform as the choice. Although their approaches can achieve better performance than DHT does, it’s still not good enough. That is because that the fractional multiplication factors do not be taken into consideration. A complete transform function in H.264/AVC shall include both the integer transform and multiplication factors in the quantization formula as shown in (6) and (7). However, to incorporate these factors into the cost function directly will cost a lot of computation because they are not simple integer numbers. Besides, these factors cannot be directly derived from the formula since the quantization parameters shall also be included.

To solve these problems, this work adopts the cost function proposed in [11] that combines the integer transform and simplified multiplication factors. The simplified multiplication factors are derived from quantization coefficients shown in Table 2 and Table 3 . Derivation from the quantization coefficients enables the consideration of both effects of transform and quantization. From these tables, we can obtain the required scalar factors by approximating the relationship among the reciprocal of de-quantization coefficients and simplifying them to integers for reduction of computational complexity as shown in (9) and (10).

1/quant_coef: p(0,0)^-1 : p(0,1)^-1 : p(1,1)^-1~= 30:19:12 (9)

1/dequant_coef: p(0,0)^-1 : p(0,1)^-1 : p(1,1)^-1~= 30:25:20 (10) In (9) and (10) the symbol p(x,y) represents the quantization and de-quantization coefficients of different positions in Table 2 and Table 3 respectively. The scaling factors derived from the de-quantization table are adopted by considering the final performance and implementation cost, as shown in (11). In this formula, division by 32 is added to avoid enlargement of cost values, which can be carried out with simpler low-cost wiring in the hardware design. As a result, the cost generation function is able to estimate the energy of residuals after the transform and quantization function more accurate than other methods while still keep computation simple and suitability for hardware implementation. It can provide better quality than that in [10] and can be used to compensate the quality loss of the plane mode removal discussed in Subsection 3.1.2.

3.1.2. Intra Plane Mode Removal

The various intra prediction modes can further be organized systematically into four types according to their prediction properties and computational complexity. These types are illustrated in Fig. 10. The bypass type is easy to be implemented since the prediction samples are the same as boundary pixels. In the average type, neighboring eight pixels (for 4x4 prediction) or 32 pixels (for 16x16 prediction) are summarized and divided into an average value for all prediction samples. The linear type contains most of the 4x4 prediction modes with directional approach, and the samples are linearly interpolated by boundary pixels. Finally in the bilinear type, also known as plane prediction, samples are derived by the approximation of bilinear transform. Though

being simplified to be only integer arithmetic operations, the plane mode is still much more computational complex than other modes. Besides, it is also hard to reuse its results for other prediction, and occupies almost half of the area in the intra prediction unit. The detail computation of plane prediction can refer to Fig. 11.

Bypass

Linear

Average Bilinear

Fig. 10 Four categorized types of intra prediction modes

A solution to this problem is to eliminate the plane mode from intra prediction and replace it with other modes. This may raise the issue of performance loss. Table 4 shows the probability distribution of the 16x16 prediction modes in different sequences.

Macroblocks predicted in plane mode is only 4.2% in average and not larger than 5.7%

except the sequence “Akiyo” which contains much smoother texture. However, after simulation we found that prediction with plane mode only reduces about 1% of bitrate than that without plane mode for these video sequences. This 1% bitrate loss can be easily compensated by the enhanced cost function proposed in Subsection 3.1.1. With the modification, we can achieve almost the same result as [10] but save a lot of

computation and hardware cost.

Luma 16x16

h= Σx[ p(7+i,-1) - p(7-i,-1) ] i=1~8 v= Σy[ p(-1,7+i) - p(-1,7-i) ] i=1~8 a= 16*[ p(-1,15) + p(15,-1) ]

b= ( 5h+32 )>>6 c= ( 5v+32 )>>6

Pred= [ a + b(x-7) + c(y-7) + 16 ]>>5

Chroma 8x8

h= Σx[ p(4+i,-1) - p(4-i,-1) ] i=1~4 v= Σy[ p(-1,4+i) - p(-1,4-i) ] i=1~4 a= 16*[ p(-1,7) + p(7,-1) ]

b= ( 17h+16 )>>5 c= ( 17v+16 )>>5

Pred= [ a + b(x-3) + c(y-3) + 16 ]>>5

Fig. 11 Intra plane mode for (a) 16x16 (b) 8x8 predictions

Table 4 Probability distribution of 16x16 modes in different sequence with 300 I-frames at QP=28

Total ratio Veritical Horizontal DC Plane

Mobile 3.3% 0.8% 1.0% 1.3% 0.2%

Coastguard 10.7% 0.8% 3.8% 4.6% 1.6%

Stefan 20.7% 3.4% 12.8% 2.0% 2.5%

Paris 15.5% 3.2% 4.6% 4.4% 3.4%

Foreman 23.1% 5.1% 4.3% 8.0% 5.7%

Akiyo 47.5% 5.3% 4.8% 25.7% 11.8%

Sequence 16x16 modes

3.1.3. Simulation Results

Table 5 illustrates the comparison results for encoding of six CIF-size sequences with all intra frames in different QPs among four algorithms: the original SATD function in [10], SAITD algorithm in [14], enhanced SATD cost function, and the proposed method

combining enhanced function and plane mode removal. In most cases of the simulation, it is obvious that the proposed SATD cost function is able to achieve better coding efficiency than [10] and [14], with almost the same or even better PSNR quality. We can also observe that the enhanced algorithm can reduce average 0.08% bitrate for all sequences. After combining with technique of plane mode removal, the bitrate increase is compensated and not larger than 0.06% in average.

Table 5 Comparison among original code [10], SAITD algorithm in [14], and the two proposed algorithm for coding of 300 Intra frames

SNR Y SNR U SNR V Bit-rate SNR Y SNR U SNR V Bit-rate SNR Y SNR U SNR V Bit-rate SNR Y SNR U SNR V Bit-rate 16 46.38 47.27 47.43 10537.41 +0.10 +0.00 +0.00 +0.11% +0.06 +0.00 -0.01 -0.16% +0.06 +0.00 +0.00 -0.11%

20 42.96 44.43 44.51 8143.18 +0.05 -0.03 -0.03 +0.07% +0.02 -0.03 -0.03 -0.18% +0.02 -0.03 -0.03 -0.12%

24 39.63 41.63 41.65 6189.86 -0.02 -0.14 -0.13 +0.15% -0.04 -0.14 -0.13 -0.22% -0.04 -0.14 -0.12 -0.14%

28 36.41 38.95 38.96 4585.29 -0.02 -0.23 -0.24 +0.15% -0.05 -0.23 -0.24 -0.24% -0.05 -0.23 -0.25 -0.16%

32 33.05 37.12 37.08 3246.41 -0.05 -0.41 -0.43 +0.34% -0.08 -0.41 -0.43 -0.21% -0.08 -0.42 -0.45 -0.15%

36 29.96 35.15 35.07 2218.46 -0.06 -0.48 -0.52 +0.95% -0.10 -0.49 -0.52 -0.09% -0.11 -0.53 -0.55 -0.06%

16 45.93 46.25 46.29 15361.27 +0.04 +0.01 +0.01 +0.05% +0.04 +0.01 +0.01 -0.10% +0.04 +0.01 +0.01 -0.08%

20 42.14 42.87 42.89 12233.46 +0.05 -0.03 -0.04 +0.07% +0.03 -0.03 -0.04 -0.12% +0.03 -0.03 -0.03 -0.09%

24 38.49 39.72 39.68 9487.00 +0.01 -0.11 -0.10 +0.17% -0.01 -0.11 -0.10 -0.14% -0.01 -0.11 -0.10 -0.10%

28 35.04 36.88 36.76 7179.38 +0.03 -0.17 -0.18 +0.26% +0.01 -0.17 -0.18 -0.16% +0.01 -0.16 -0.18 -0.12%

32 31.50 34.89 34.67 5200.07 +0.02 -0.24 -0.24 +0.52% -0.01 -0.24 -0.24 -0.14% -0.01 -0.24 -0.23 -0.09%

36 28.28 32.87 32.60 3572.40 +0.01 -0.31 -0.32 +1.08% -0.05 -0.31 -0.22 -0.05% -0.05 -0.31 -0.22 +0.01%

16 46.16 47.35 47.63 10114.93 +0.08 -0.05 -0.05 +0.12% +0.06 -0.05 -0.05 -0.09% +0.06 -0.05 -0.04 -0.00%

20 42.81 44.69 44.88 7662.99 +0.02 -0.21 -0.16 +0.25% -0.01 -0.21 -0.16 -0.08% -0.01 -0.20 -0.15 -0.01%

24 39.59 41.97 42.12 5742.80 -0.05 -0.35 -0.28 +0.49% -0.09 -0.36 -0.28 -0.06% -0.09 -0.36 -0.28 +0.05%

28 36.49 39.40 39.54 4235.43 -0.05 -0.50 -0.34 +0.70% -0.20 -0.50 -0.34 -0.06% -0.10 -0.52 -0.36 +0.09%

32 33.33 37.51 37.73 3005.78 -0.14 -0.61 -0.52 +0.82% -0.20 -0.61 -0.52 -0.11% -0.20 -0.65 -0.52 +0.08%

36 30.36 35.58 35.78 2055.30 -0.18 -0.64 -0.54 +1.00% -0.25 -0.64 -0.55 -0.13% -0.25 -0.67 -0.56 +0.08%

16 47.34 48.46 49.40 4159.54 +0.16 -0.06 -0.11 +0.95% +0.09 -0.05 -0.12 -0.10% +0.10 -0.06 -0.12 +0.42%

20 44.89 46.88 47.95 2777.52 +0.04 -0.26 -0.33 +0.83% -0.05 -0.26 -0.34 -0.16% -0.05 -0.30 -0.38 +0.18%

24 42.65 44.62 46.03 1941.72 -0.14 -0.35 -0.58 +1.03% -0.23 -0.35 -0.58 +0.18% -0.25 -0.37 -0.64 +0.56%

28 40.33 42.52 43.93 1370.30 -0.18 -0.53 -0.77 +1.49% -0.33 -0.54 -0.76 +0.27% -0.35 -0.63 -0.85 +0.62%

32 37.77 40.82 42.54 963.42 -0.32 -0.53 -0.89 +1.71% -0.53 -0.52 -0.86 -0.66% -0.58 -0.69 -1.09 +0.30%

36 35.28 38.80 40.69 672.96 -0.38 -0.45 -0.78 +2.80% -0.65 -0.50 -0.75 +0.17% -0.70 -0.64 -0.98 +0.28%

16 46.26 47.56 48.57 7665.68 +0.10 +0.03 -0.01 +0.28% +0.08 +0.03 -0.01 -0.14% +0.08 +0.03 +0.00 -0.04%

20 42.90 45.17 46.83 5394.72 +0.13 -0.08 -0.21 +0.59% +0.09 -0.08 -0.21 -0.13% +0.09 -0.08 -0.21 -0.00%

24 39.95 42.87 44.78 3678.28 +0.04 -0.25 -0.33 +1.20% -0.03 -0.24 -0.33 -0.08% -0.03 -0.25 -0.35 +0.07%

28 37.26 40.91 42.79 2467.35 +0.03 -0.32 -0.42 +1.80% -0.06 -0.32 -0.42 -0.08% -0.06 -0.34 -0.46 +0.10%

32 34.61 39.80 41.32 1598.24 -0.05 -0.41 -0.54 +2.62% -0.19 -0.41 -0.54 +0.03% -0.20 -0.46 -0.60 +0.30%

36 32.24 38.61 39.81 1022.62 -0.08 -0.39 -0.49 +4.19% -0.31 -0.39 -0.49 -0.00% -0.32 -0.45 -0.57 +0.53%

16 45.88 48.20 49.05 9454.85 +0.04 +0.06 +0.05 +0.15% +0.04 +0.06 +0.05 -0.20% +0.04 +0.06 +0.05 -0.18%

20 42.13 46.38 47.64 6969.04 +0.10 -0.04 -0.10 +0.30% +0.09 -0.04 -0.10 -0.21% +0.09 -0.05 -0.11 -0.20%

24 38.74 44.64 46.13 4959.03 +0.08 -0.21 -0.12 +0.74% +0.06 -0.21 -0.11 -0.16% +0.06 -0.24 -0.13 -0.15%

28 35.63 43.08 44.72 3437.66 +0.15 -0.27 -0.15 +1.33% +0.12 -0.27 -0.15 -0.11% +0.12 -0.35 -0.18 -0.14%

32 32.74 41.96 43.72 2236.41 +0.06 -0.31 -0.17 +1.99% +0.00 -0.31 -0.16 -0.13% +0.00 -0.44 -0.23 -0.17%

36 30.24 40.82 42.72 1418.33 -0.04 -0.28 -0.11 +2.68% -0.15 -0.27 -0.11 -0.17% -0.14 -0.39 -0.14 -0.03%

Coastguard Akiyo

Foreman

JM 8.6 [9] Enhanced SATD Cost Function Enhanced SATD Cost Function + Plane

Mode Removal

Sequence QP SAITD Algorithm

Stefan

Mobile

Paris

24 Stefan CIF

32 34 36 38 40 42 44

100 130 160 190 220 250 280

Kbits/frame

SNR Y (dB)

JM 8.6 Proposed

Fig. 12 RD curves of [10] and proposed algorithm for sequence “Stefan”

Mobile CIF

31 33 35 37 39 41 43

170 210 250 290 330 370 410

Kbits/frame

SNR Y (dB)

JM 8.6 Proposed

Fig. 13 RD curves of [10] and proposed algorithm for sequence “Mobile”

25 Paris CIF

33 35 37 39 41 43

90 120 150 180 210 240

Kbits/frame

SNR Y (dB)

JM 8.6 Proposed

Fig. 14 RD curves of [10] and proposed algorithm for sequence “Paris”

Akiyo CIF

38 40 42 44 46 48

40 60 80 100 120 140

Kbits/frame

SNR Y (dB)

JM 8.6 Proposed

Fig. 15 RD curves of [10] and proposed algorithm for sequence “Akiyo”

26 Foreman CIF

34 36 38 40 42 44

50 70 90 110 130 150 170 190

Kbits/frame

SNR Y (dB)

JM 8.6 Proposed

Fig. 16 RD curves of [10] and proposed algorithm for sequence “Foreman”

Coastguard CIF

32 34 36 38 40 42 44

60 90 120 150 180 210 240

Kbits/frame

SNR Y (dB)

JM 8.6 Proposed

Fig. 17 RD curves of [10] and proposed algorithm for sequence “Coastguard”

The RD curve diagrams of [10] and our proposed combined algorithm for these six sequences are shown from Fig. 12 to Fig. 17. The QP range for these diagrams is from 20 to 32 except that for “Akiyo” whose range is located from 16 to 28 to clearly show the characteristic of its curve. The curves of our algorithm are very close to the original ones. Especially in the high bitrate coding with lower QPs, the performance is even better. This algorithm-level optimization actually makes the final hardware design not only simpler but also with good video quality.

3.2. System Level Scheme

3.2.1. Analysis of Hardware Complexity

To achieve the throughput for our target of video size, the complexity of hardware shall be first analyzed before design. For H.264/AVC codec, the computational complexity in encoder is much more extensive than that in decoder since encoder computes all prediction modes instead of decoding exactly one. Table 6 shows the data throughput in different video sizes at 30 fps. Thus, for our target HD 720p, it needs data throughput of at least 108,000 macroblocks per second, which is identical to 27.65M pixels.

Table 6 Data throughput for different video size

Mega pixs/sec kilo mbs/sec

QCIF 176 x 144 0.76 2.97

CIF 352 x 288 3.04 11.88

ITU-R 720 x 576 12.44 48.60

SDTV 720 x 480 10.37 40.50

1280 x 720 27.65 108.00

1920 x 1080 62.21 243.00

HDTV

Video Size Data Throughput

To estimate the operating frequency for hardware design, we explore the cycles for

intra coding. Simplifying the estimation by neglecting the cycles of data transfer between on-chip and off-chip memory, we only consider prediction cycles. With such assumption, total cycle count in encoding process to predict a macroblock, including one luma and two chroma components, is 3456 (16x16x9+16x16x3+2x16x4x3), where plane modes are removed and other operations are excluded. The necessary frequency is 373.25 MHz for HD 720p size and 839.81 MHz for HD 1080p size in response to the estimated cycles in encoder. This speed requirement is far beyond the generally acceptable range of common processor and hard to be implemented.

Table 7 Frequency for N-pixel parallel encoder

N=1 N=2 N=4 N=16

QCIF 176 x 144 10.26 5.13 2.57 0.64

CIF 352 x 288 41.06 20.53 10.26 2.57

ITU-R 720 x 576 167.96 83.98 41.99 10.50 SDTV 720 x 480 139.97 69.98 34.99 8.75

1280 x 720 373.25 186.62 93.31 23.33 1920 x 1080 839.81 419.90 209.95 52.49

Frequency at N-Parallel (MHz) Video Size

HDTV

Table 8 Frequency for N-pixel parallel decoder

N=1 N=2 N=4 N=16

QCIF 176 x 144 1.14 0.57 0.29 0.07

CIF 352 x 288 4.56 2.28 1.14 0.29

ITU-R 720 x 576 18.66 9.33 4.67 1.17

SDTV 720 x 480 15.55 7.78 3.89 0.97

1280 x 720 41.47 20.74 10.37 2.59 1920 x 1080 93.31 46.66 23.33 5.83 HDTV

Video Size Frequency at N-Parallel (MHz)

As a consequence, we apply parallelism technique to reduce the required frequency.

Table 7 and Table 8 show the estimation results of encoder and decoder respectively for such pixel parallelism. For encoder design, the suitable choice is to use the four-pixel parallel architecture for HD 720p that runs at frequency of 93.31MHz and needs 864

cycles for one macroblock. With the same condition, the four-pixel parallel decoder only needs 96 cycles at 10.37MHz to decode a macroblock for 720p size. Thus, our decoder can support larger size like HD 1080p at 23.33MHz. Such design target can achieve the real-time requirement while is easy to be implemented as well.

3.2.2. Macroblock Level Pipelining

Previous approach in Subsection 3.2.1 only assumes the cycles for intra prediction.

However, more cycles will be required when considering other functions like data transfer between memories and entropy coding. These operations will increase the necessity of extra cycle count for a macroblock and result in higher operating frequency.

For example, the CAVLC circuit in [15] takes about 500 cycles to encode a high-quality application video. This will increase the encoder latency to around 1,400 cycles with the frequency of 150MHz. In addition, the zigzag scan in CAVLC unit will also increase the operation cycles since its scan order is quite different than the raster scan used in the prediction engine.

Fig. 18 Ping-pong architecture with macroblock level pipelining for encoder To solve above problems, the macroblock level pipelining is used as shown in Fig. 18 This pipeline partition enables the overlapped execution of intra prediction and entropy coding without large cycle increases. The cycle count of a macroblock depends on the longest latency of each processing unit in pipeline. Besides, this design adopts the

ping-pong architecture with a two-bank memory located between the prediction loop and entropy coding to resolve the ordering problem. In the architecture, currently predicted coefficients after quantization are sent to one memory bank of ping-pong buffer, and coefficients of previously predicted macroblock are stored in the other memory bank and ready for CAVLC coding. These ping-pong buffers are also beneficial for decoding process in the inverse data flow direction for data reordering and processing rate smoothing.

3.3. Architecture Design of Intra Codec 3.3.1. Overall Architecture

Boundary Reg External Upper Line Buffer

Ping-pong

4 pixels/cycle 1 coef./cycle

Most Pb.

Reg

Fig. 19 Proposed architecture of baseline intra frame codec

Fig. 19 shows the architecture of the proposed codec derived from previous work [11], which is based on algorithm-level optimization and system-level pipelining mentioned

in Section 3.1 and 3.2. This design is directly corresponding to both the encoding flow in Fig. 2 and the decoding flow in Fig. 3. It can work as an intra frame encoder or a decoder with the alternative of three switch multiplexers shown in Fig. 19. The entire architecture consists of three operation phases: prediction phase, reconstruction phase, and bitstream phase.

The prediction phase is the most important part in this design. It mainly contains intra prediction generator, forward transform, cost generation and mode decision unit, quantization, and some buffers and registers. The reconstruction phase, which is used to reconstruct the decoded data, is composed of inverse transform, de-quantization, and reconstruction FIFO registers. Four-pixel parallelism is used in these two phases to achieve the required throughput. The bitstream phase is separated form previous two phases by the ping-pong buffer and uses the CAVLC codec to perform coding or decoding of bitstream with throughput of at least one coefficient per cycle. In this design, the plane prediction buffer and dual port memory are saved due to the algorithm optimizations in comparison with previous encoder-only design [13].

Boundary Reg

Fig. 20 Encoder dataflow of the design

Fig. 20 shows the encoder dataflow of this design, where the unused datapath is concealed with light tint. First, the intra prediction unit generates prediction values of various modes for predicted block according to schedule control unit. Residuals derived from difference of prediction samples and original data are then transformed by 4x4 integer transform unit. Transformed coefficients are further used to compute cost for decision of the best mode, and block with minimum cost is preserved in the block buffer.

The coefficients, which are quantized after chosen as the best mode, are then stored in the ping-pong memory for entropy coding and sent to reconstruction phase to be decoded as boundary samples for next block prediction at the same time.

Diff

Add

Fig. 21 Decoder dataflow of the design

The decoder flow is shown in Fig. 21. Unlike the encoding loop in Fig. 20, the decoder is only a direct-through datapath without loop. The coefficients decoded by the CAVLC decoder in last macroblock-level pipeline stage are passed through de-quantization and inverse 4x4 transform to be recovered to residuals. Prediction values according to the mode information decoded from UVLC decoder are acquired from the intra prediction generator and added to the residuals for reconstruction. All the

decoded blocks are further sent to source memory for output and the boundary registers for next prediction. Unused components in decoder such as mode decision, forward transform, quantization, and predictor FIFO buffer are shut down to save power. Detail information for each component is discussed in the following subsections.

3.3.2. Schedule of Codec

Because of variety of prediction modes, the number of process cycles for encoder is much greater than that for decoder and limits the major performance of codec. Another performance bottleneck in encoder is the reconstruction feedback loop since the next 4x4 block cannot start its computation until its boundary samples are reconstructed from previous blocks. This may result in low hardware utilization and longer latency in the prediction phase. In addition to the 4x4 block prediction, when performing 16x16 intra predictions, a macroblock-size buffer could be needed to store the processed 4x4 residual data for later mode decision, which raises the hardware cost.

To solve these issues, we propose three scheduling techniques adopted in the scheduling control unit to solve these data dependency problems and eliminate the requirement of large buffer. These three techniques are as follows:

1. Insertion of the 16x16 and 8x8 predictions

During the empty cycles waiting for reconstructed samples between two intra 4x4 blocks, the 16x16 or 8x8 intra prediction process is inserted into these bubble cycles to pre-compute their costs. Unlike the technique used in [13], the prediction generator predicts four blocks in one 16x16 mode successively instead of one block for four modes in each bubble. This helps to decrease the registers used in accumulating costs for the 16x16 prediction. After processing four blocks, it continues to process the next

4x4 prediction. Thus, utilization of components in the prediction phase is improved.

2. Early start of next block prediction

Since the 4x4 blocks are processed in the Z-scan order, upper and left boundary samples might not be available at the same time for prediction purpose. To avoid this problem and pull the next block processing earlier, we rearrange the processing order of prediction modes such that prediction modes can be started as early as possible if the required data is available. For example, the vertical mode is processed before the horizontal mode since the left boundary pixels are not available. This approach can reduce the idle cycles and thus improve the throughput.

3. Recomputation of 16x16 and 8x8 best modes

For the 4x4 block prediction, we use a small buffer to save the residuals of the best mode. However, when such a strategy applies to 16x16 or 8x8 predictions, a large macroblock-size buffer will be required. To solve this problem, we neglect the data generated in the prediction process and recompute them again for the best mode of 16x16 and 8x8 macroblocks after the prediction process if it is selected as the best mode.

This approach may increase the total encoding cycles, but it is still in an acceptable range and can reduce the buffer cost as well.

Fig. 22 shows the pipelined schedule of this codec for processing a macroblock.

在文檔中適用於高解析度靜態影像與視訊應用之H.264/MPEG-4 AVC框內編解碼器設計 (頁 29-0)