S IMULATION R ESULT AND D ISCUSSION - FAST 4X4 INTRA PREDICTION ALGORITHM FOR H.264/AVC

CHAPTER 4 FAST 4X4 INTRA PREDICTION ALGORITHM FOR H.264/AVC

4.4. S IMULATION R ESULT AND D ISCUSSION

The proposed three step algorithm and the full search are simulated on five CIF sequences, mobile and calendar, foreman, Stefan, news, and coastguard. For each sequence, 300 frames are encoded with intra frame coding. We simulate these sequences with 5 different fixed QP values, from 12 to 44 as shown in Table 5 to Table 8. RD-curve is shown in the Fig. 7 to Fig. 11.

From the result, we can find that bit-rate is increased about 1% with almost the same PSNR. We can also find bit-rate increase is step up when QP is from low to high. But when QP is high, the bit-rate increase is reduced. The phenomenon may relate to the Intra16 mode, an intra prediction mode for 16x16 blocks. In the high QP case, the opportunity to select Intra16 will also increases since Intra16 mode decision is also using full search algorithm.

For high motion and low motion test sequence, the result in bit-rate increase is almost the same. It is because that picture is intra coded without using information of other frames. The comparison with [16] is shown in Table 5 to Table 8 with four different QP values, from 28 to 40. The proposed algorithm outperforms the previous approach.

Table 5. QP = 28, Comparison results

Sequence

Table 6. QP = 32, Comparison results

Sequence

News 4.26 1.20 0.000 -0.02 -31.41 -15.29 Paris 3.25 1.21 0.013 -0.01 -30.45 -15.46 Tempete 3.11 1.00 0.051 -0.02 -31.23 -16.16

Table 8. QP = 40, Comparison results

Sequence Container 5.18 1.00 0.001 -0.03 -32.75 -16.21

News 5.31 1.38 0.006 -0.03 -31.64 -15.16

Paris 4.91 1.58 0.003 -0.04 -30.39 -15.23

Tempete 3.67 0.78 0.024 -0.03 -31.17 -16.09

CHG BIT: change in bit-rate CHG PSNR: change in PSNR

CHG T_I: change in intra encoding time

CHG T_AVG: change in average encoding time

20 0 5000 10000 15000 20000

bit-rate

PSNR

FS TSS

Fig. 26. RD-curve of mobile & calendar

25 20 25

0 4000 8000 12000

bit-rate

Fig. 28. RD-curve of Stefan

25 30 35 40 45 50

0 2500 5000 7500 10000

bit-rate

PSNR

FS TSS

Fig. 29. RD-cure of news

25 30 35 40 45 50

0 5000 10000 15000

bit-rate

PSNR

FS

TSS

Fig. 30. RD-curve of coastguard

4.5. Summary of Proposed Intra Prediction Algorithm

We propose a three step intra prediction mode selection algorithm.

Computation reduction is achieved by examining only six of total intra prediction modes. Simulation results suggest that three step algorithm can achieve similar PSNR as full search and only about 1% of increase on bit-rate.

performance of intra-only coding also makes it suitable for still image coding, which is competitive with JPEG2000 [17].

In this chapter, an HDTV size H.264/AVC intra encoder chip for digital camera and digital video applications is presented. The chip reduces the gate count by saving the costly plane mode and enhances the video quality with the improved cost function. With careful scheduling and high performance function unit, the developed chip can easily support 29.46M pixels/s still image encoding and real-time moving picture intra coding of HDTV 720p@30fps video application when clocked at 117.28MHz under 0.18um CMOS process.

Fig. 31. Flow of H.264/AVC intra coding

5.1. Fundamental of H.264/AVC Intra Coding

The Intra coding flow of H.264/AVC is shown in

Fig. 31. This macroblock data will be predicted from one of nine kinds of 4x4 luma prediction modes, four kinds of 16x16 luma prediction mode, and four kinds of 8x8 chroma prediction mode. Then the prediction mode with the minimum cost value is selected as the best mode. The residuals after the prediction are further processed by transform, Q/IQ, inverse transform, and reconstructed as reference of next macroblock. The coefficients after quantization and mode information are encoded by entropy coding, CAVLC and UVLC.

Fig. 32. Modes of Intra4x4

Fig. 33. Modes of Intra16x16.

There are three classes of intra prediction modes. They are Intra4x4, Intra16x16 for luma sample prediction, and chroma8x8 for chroma samples prediction. Different form AC/DC prediction of MPEG-4, H.264/AVC use directional spatial information of neighbor already coded blocks to predict current sample values. Fig. 32 shows the modes of Intra4x4. Eight directional modes and one DC prediction are adopted. Fig. 33 shows the modes of Intra16x16 used for smooth texture. Intra4x4 is more suitable for high quality application while intra16x16 is suitable for low bitrate application. Fig. 34 shows the mode of chroma8x8. The mode of chroma8x8 is the same as Intra16x16 only with different mode number.

5.2. Hardware Oriented Algorithm Modification

5.2.1. Proposed Mode Decision Method

In the intra encoding flow, the mode decision method is the most important part to determine the coding performance. Two mode decision methods are used in the reference software. One is basic mode decision method and the other is rd-optimization (RDO) mode decision method.

Basic mode decision method calculates cost using table look up mode cost

and sum of absolute transform difference (SATD). RDO mode decision method use weighted sum of actual encoded bitrate and reconstructed samples to generate distortion. Though RDO mode decision method achieves the best performance, it is also computational intensive and thus is not suitable for high performance or real-time encoder implementation. Therefore, our intra encoder adopts the basic method to implement the mode decision stage as shown below

Basic cost generation function：

Cost = Cost_of_Mode + SATD

In the reference software, SATD is calculated by applying 4x4 discrete Hadamard transform (DHT) to the residuals of prediction modes due to its simplicity. However, since the residuals are processed by 4x4 discrete cosine transform (DCT) in the encoding flow, a 4x4 DCT transform for SATD will generate better results than DHT does, which has the side benefit to avoid computing the 4x4 DCT again.

However, 4x4 DCT in H.264/AVC is divided into two parts, 4x4 integer transform and scalar multiplication factors (the one with factors a, b) that are merged into the quantization stage, as shown in Fig. 35. The reference software adopts DHT simply for its simplicity to approximate the 4x4 integer transform. A better way for SATD calculation is to approximate the 4x4 DCT, but this should have low computational complexity as DHT does.

Fig. 35. 4X4 DCT transform of H.264/AVC

Fig. 37. dequant_coef table of inverse quantization.

First, we look at the equation of quantization and inverse quantization Quantization

– L=(abs(M) * quant_coef + qp_const) >> q_bits Inverse quantization

– L*dequant_coef<<qp_per

qp_per, q_bits and qp_const are derived from quantization parameter Quantization is calculated by using a table look up constant multiplication and an offset derived from quantization parameter. Inverse quantization is calculated only by a table loop up constant multiplication. We use the quantization factors, quant_coef, shown in Fig. 36 or inverse quantization factor, dequant_coef, shown in Fig. 37 to derive the scaling factors.

– 1/quant_coef: [0][0]:[0][1]:[1][1]~=30:19:12 – 1/dequant_coef: [0][0]:[0][1]:[1][1]~=32:25:20

Fig. 38. Modified SATD calculation method

Fig. 38 shows our modified method of SATD calculation. In our simulation, the scalar factors derived from inverse quantization is better than factors from quantization. The reason is that quantization process is also affected by an offset qp_const. The result of modified mode decision method is better than the reference software.

5.2.2. Intra Prediction Mode

In H.264/AVC Intra coding, intra prediction and mode decision are the two computation extensive components. All prediction modes are examined to find the best mode. Parallel architecture are demanded to accelerate these components.

After analyzing the type of intra prediction modes, we can separate the modes into four types as shown in Fig. 39. In the bypass type, prediction samples are the same as boundary pixels. In the linear types, prediction samples are linear interpolation derived from boundary pixels. In the average type, prediction samples are average of all boundary pixels. In the plane type, prediction samples are approximation of bilinear transform with only integer arithmetic as shown in Fig. 40. The equation of Plane mode is more complex than other modes and is hard to reuse with other mode.

However, by simulation we found that intra prediction with plane prediction mode only reduces about 1% of bit-rate than that without plane mode. This 1% of bit-rate difference can be easily compensated by the enhanced cost function and achieves almost the same result with the basic method in reference software

Fig. 39 Four types of intra prediction modes

Fig. 40. Equations of plane mode prediction

The simulation result is shown from Fig. 41 to Fig. 48. Thus, we decide to implement the intra coding without plane prediction mode based on the cost and performance trade-off.

400 600 800 1000 1200 1400

bit-rate

500 800 1100 1400 1700 2000 2300 bit-rate

PSNR

FS Proposed

Fig. 42. RD curve of Foreman

29 30 31

500 800 1100 1400 1700 2000 2300 2600 bit-rate

Fig. 43. RD curve of container

1400 1800 2200 2600 3000 3400 3800 4200 4600 bit-rate

PSNR FS

Proposed

Fig. 44. RD curve of stefan

700 1100 1500 1900 2300 2700 bit-rate

PSNR

FS Proposed

Fig. 45. RD curve of football

2300 3300 4300 5300 6300

bit-rate

PSNR FS

Proposed

Fig. 46. RD curve of mobile and calendar

Fig. 47. RD curve of tempete

Fig. 49 Architecture of Intra Coding

5.3. Architecture Design of H.264/AVC Intra Coding

5.3.1. System Architecture Design

Fig. 49 shows the intra encoding architecture, which is directly corresponding to the coding flow shown in Fig. 31. The architecture consists of the intra prediction unit, transform unit, quantization unit and CAVLC unit. First, the intra prediction unit will generate the prediction value for the current block.

Then for each possible mode, the residual pixels after prediction are transformed by 4x4 integer transform or DHT (DC value of Intra16x16 or Chroma8x8). These transform coefficients are further used to compute the cost function to determine the best by the proposed cost function. The intra4x4 block with lower cost is preserved in the buffer. After best intra4x4 block is obtained, it will go through the reconstruction path to generate the required boundary samples for the next 4x4 block. The data after quantization and mode information will be coded by CAVLC and UVLC, respectively.

intra4x4 block reconstruction, intra16x16 prediction process is inserted into these bubble cycles of intra predictor generation unit to pre-compute the Intra16 cost. Thus, the utilization of intra predictor is improved.

2. Early start of next 4x4 block prediction: before the boundary samples are available, the prediction mode using upper samples (vertical prediction mode) can be early started before other modes.

3. Intra16x16 DC value pre-computing: In the H.264/AVC standard, the sixteen DC coefficients from the Intra16x16 mode have to be transformed again by DHT. Thus, for the reconstruction, inverse DCT of other AC coefficients cannot be started before inverse DHT, and this situation will result in a macroblock size buffer to store the AC coefficient of sixteen 4x4 blocks. Using the intra16x16 prediction insertion mentioned in technique 1, the best intra16x16 DC value after DHT is pre-computed from the Q/IQ stage to the DC registers of IDCT/IDHT stage. Not only a macroblock size buffer is saved but also the overall computation cycles are reduced.

5.3.2. Intra Predictor Generation Unit

A reconfigurable 4 pixels parallel intra predictor generation unit is proposed.

It can support nine kinds of Intra 4x4 modes, three kinds of Intra16x16 modes, and three kinds of Chroma8x8 modes. After analyzing the prediction mode, we can find that prediction samples are derived from boundary pixels using four types of arithmetic equation:

1. (A+B+1)>>1 2. (A+2B+C+2)>>2

3. Bypass (for Vertical, Horizontal mode)

4. DC (Intra4x4: average of 8 pixels, Intra16x16: average of 32 pixels) (A, B and C are reconstructed boundary pixels)

Fig. 50 shows the proposed reconfigurable architecture of intra predictor generation unit. The architecture reuse the partial sum of neighbor predictor to save the adder count.

For example：Intra4x4

Predictor1 = B+2C+D = (B+C)+(C+D) Predictor2 = A+2B+C = (A+B)+(B+C)

Thus, B+C can be reused to generate two predictor output Some examples are shown in Fig. 51 to Fig. 55.

Fig. 50. Reconfigurable data path of intra predictor generation unit

Fig. 51. Data path of diagonal down right

Fig. 52. Data path of vertical right

Fig. 53. Data path of horizontal down mode

Fig. 54. Data path of DC prediction mode

Fig. 55. Data path of horizontal mode

Fig. 56. Coding order of residual blocks

5.3.3. Transform Unit

In H.264/AVC, residual macroblock is divided in 16 4x4 luma blocks and 8 4x4 blocks as shown in Fig. 56. All the 4x4 blocks will be transformed with integer coefficient. If the intra prediction mode is Intra16x16, the DC value of 16 luma blocks will be transformed again by 4x4 discrete Hadamard transform. The 2x2 DC values of chroma blocks after DCT will also be transformed by 2x2 DHT.

Transform matrix of DCT, IDCT, and Hadamard transform is shown in Fig.

57 to Fig. 59. We can find the coefficients of the transform matrixes are even or odd symmetry at each row or column and can be implemented by add and shift.

The number of addition in each 1D transform can be reduced from 16 to 8 with butterflies. Fast algorithm and its butterfly structure are shown in Fig. 60. Because two forward transforms have the same structure and will not operate at the same time in our system architecture. We can merge them together to save area. Inverse transform of DCT and DHT are merged by the same method as the forward methods. The transform unit handles uses the similar architecture as in [18]. Fig.

61 shows the hardware architecture of transform unit.

⎥⎥

Fig. 58. Transform matrix of 4x4 IDCT transform

⎥⎥

Fig. 59. Transform matrix of Hadamard transform

Fig. 60. Fast algorithms of 4x4 transform

Fig. 61. Hardware architecture of transform unit

5.3.4. Quantization Unit

The quantization and inverse quantization unit are shown in Fig. 62. The constant value of quant_coef, dequant_coef, qp_const, qp_shift, and qp_per are implemented by look-up table depending on the QP values. The design also uses the data guarding technique to reduce power consumption when input value is zero.

Fig. 62. Hardware architecture of quantization unit

After processing whole macroblock, the mode with minimum cost will be selected as the best intra prediction mode.

Fig. 63. Hardware architecture of mode decision unit

Fig. 64. CAVLC architecture

5.3.6. CAVLC Unit

The architecture of CAVLC is shown in Fig. 64. CAVLC encoding process can be divided into two phases, scanning phase and encoding phase. Input of CAVLC is four transformed coefficients per cycle. The scanning phase will skip the zero coefficients and only scans the nonzero one in the inverse zigzag scan order to speedup the encoding phase. Then, the data are sent to the corresponding lookup tables in parallel. These codes are buffered and concatenated to form the final bitstream.

Fig. 65. Memory Organization

5.3.7. Memory Organization

In the proposed architecture, two components have memories. The organizations of memories are shown in Fig. 65. Source buffer stores the input data 4 pixels row by row. Coefficient Buffer is divided into two parts to facilitate DC value access in Intra16x16 mode. By using Ping-Pong architecture, data input phase and entropy coding phase can be pipelined to improve the encoding throughput.

1086 cycles are spent for pipelined architecture as shown in Fig. 67. The performance of proposed architecture only needs about 117.28MHz to meet HDTV 720p (1280x720@30Hz) real-time application.

Fig. 66. Timing schedule of proposed intra coder.

Fig. 67. Timing schedule of proposed architecture

5.4. Implementation Results

To evaluate the accuracy and the efficiency of the proposed architecture, the design is implemented using the UMC 0.18µm 1P6M CMOS technology and the cell-based design flow. The chip has an area of 2.4x2.4 mm² (pad limited) as shown in Fig. 68. The design can achieve 125 MHz at the worst-case. Thus, it can easily support 29.46M pixels/s still image encoding and real-time moving picture intra coding of HDTV 720p@30fps video application when clocked at 117.28MHz. Therefore, it is suitable for digital video or camera applications.

Table 9. List of gate count

Intra Predictor 3507

Q/IQ 22082

DCT(with DC register) 9985

IDCT(with DC register) 9836

Boundary Reconstruction Unit 15697

Cost Generation and Mode Decision Unit 10315

UVLC/CAVLC 11965 Controller 2781

Boundary Predictor Buffer 6465

Total 92633

Technology: UMC 0.18 µm 1P6M CMOS

Voltage:

1.8 V (Core) 3.3 V (I/O)

Die Size: 2.4×2.4 mm²

Core size: 1.28x1.28mm

SRAM: (all single port) Coefficient buffer

Source buffer

104 x 64 bits x 2 banks 96 x 32 bits x 1 bank

Fig. 68 Chip specification

Chapter 6 Conclusion

In this thesis, our contribution is in three parts. The first contribution is the deblocking filter architecture that can accelerate the deblocking process. The proposed two architectures not only save the memory size but also have higher speed. The idea is to rearrange the data flow and achieve higher data reusability.

The second contribution is the fast intra coding algorithm can reduce the computational complexity of intra 4x4 prediction. Six modes are required instead of nine modes in the full search method. The fast intra prediction algorithm can save 33% computational complexity with only about 1% bit-rate loss. The final contribution is the intra coding architecture can speed up the computation of intra frame coding. Proposed cost function has better quality and complex plane mode is skipped to save area. The prediction process is well scheduled to achieve high utilization. We hope that our research result can promote the convenience of human life.

[2] Thomas Wiegand, Gary J. Sullivan, Gisle Bjontegaard, and Ajay Luthra,

“Overview of the H.264/AVC Video Coding Standard,” IEEE Transactions on Circuits and Systems for Video Technology, July 2003

[3] Information Technology - Generic Coding of Moving Picture and Associated Audio Information: Video, ISO/IEC 13818-2 and ITU-T Recommendation H.262, 1996

[4] Video Coding for Low Bit Rate Communication, ITU-T Recommendation H.263, Feb. 1998.

[5] Information Technology - Coding of Audio-Visual Objects - Part 2: Visual, ISO/IEC 14496-2, 1999.

[6] A. Joch, F. Kossentini, H. Schwarz, T. Wiegand, and G.J. Sullivan,

"Performance comparison of video coding standards using Lagrangian coder control," in Proceedings of IEEE International Conference on Image Processing 2002, vol. 2, pp501-504.

[7] Y.-L. Lee and H. W. Park, “Loop filtering and post-filtering for low-bitrates moving picture coding,” Signal Processing: Image Commun., vol. 16, pp.

871–890, 2001.

[8] S. D. Kim, J. Yi, H. M. Kim, and J. B. Ra, “A deblocking filter with two separate modes in block-based video coding,” IEEE Trans. Circuits Syst.

Video Technol., vol. 9, pp. 156–160, Feb. 1999.

[9] P. List, A. Joch, J. Lainema, G. Bjøntegaard, and M. Karczewicz, “Adaptive deblocking filter,” IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 7, pp. 614- 619, Jul. 2003.

[10] H.264/AVC reference software JM7.2, Jul. 2003

[11] Y.-W. Huang, T.-W. Chen, B.-Y. Hsieh, T.-C. Wang, T.-H. Chang, L.-G.

Chen, “Architecture design for deblocking filter in H.264/JVT/AVC,” Proc.

of Multimedia and Expo, vol. 1, pp. 693 –696, Jul. 2003.

[12] Draft ITU-T Recommendation and Final Draft International Standard of Joint Video Specification (ITU-T Rec. H.264/ ISO/ IEC 14496-10 AVC), Mar.

2003.

[13] H.264/AVC reference software JM8.2, Jul. 2004

[14] Meng, B.; Au, O.C, “Fast intra-prediction mode selection for 4x4 blocks in H.264”in Proc. of IEEE Int. Conf. on Acoustics, Speech, and Signal, 2003., vol. 3, 6-10 pp.III - 389-92 ,April2003

[15] Meng, B., Au, O.C., Chi-Wah Wong, Hong-Kwai Lam, “Efficient intra-prediction mode selection for 4x4 blocks in H.264” in Proc. of Int. Conf.

on Multimedia and Expo, 2003, vol. 3 , 6-9 Pages:III - 521-4, July 2003

[16] Feng PAN, Xiao LIN, Rahardja SUSANTO, Keng Pang LIM, Zheng Guo LI, Ge Nan FENG, Da Jun WU, and Si WU, "Fast Mode Decision for Intra Prediction," JVT-G013, 7th Meeting, Pattaya II, Thailand, 7-14 March, 2003.

[17] Joint Video Team (JVT) of ISO/IEC MPEG & ITU-T VCEG ”Performance comparison: H.26L intra coding vs. JPEG2000” Klagenfurt, Austria, 22-26 July, 2002, JVT-D039

[18] T.-C. Wang, Y.-W. Huang, H.-C. Fang, and L.-G. Chen, “Parallel 4_4 2D transform and inverse transform architecture for MPEG-4 AVC/H.264,” in Proc. IEEE Int. Symp. Circuits and Systems, 2003, pp. 800–803.

國立台南市第一高級中學 (民國 85 年 9 月～民國 88 年 6 月) 國立交通大學電子工程學系學士 (民國 88 年 9 月～民國 92 年 6 月)

國立交通大學電子研究所系統組碩士 (民國 92 年 9 月～民國 94 年 6 月)

獲獎紀錄：

z 九十三學年度大學院校積體電路設計競賽 (IC Contest) 研究所/大學部標準單元式設計組(Cell-based) 優等

z Asia and South Pacific Design Automation Conference (ASP-DAC) 2005 Best Award of Student Design Contest

z 九十二學年度大學院校矽智產設計競賽(IP Contest) Star Video Motion Estimation Engine QME

Soft IP 不定題組特優

z 九十一學年度殷之同電子實驗計畫獎學金

專題名稱：Automatic generation of Area-Effective Bit-Serial FIR Filters z 九十一學年度上學期(大四) 電子工程系書卷獎

在文檔中針對H.264/AVC去方塊濾波器及框內編碼之演算法和架構設計 (頁 46-0)