CHAPTER 4 FAST 4X4 INTRA PREDICTION ALGORITHM FOR H.264/AVC
4.4. S IMULATION R ESULT AND D ISCUSSION
The proposed three step algorithm and the full search are simulated on five CIF sequences, mobile and calendar, foreman, Stefan, news, and coastguard. For each sequence, 300 frames are encoded with intra frame coding. We simulate these sequences with 5 different fixed QP values, from 12 to 44 as shown in Table 5 to Table 8. RD-curve is shown in the Fig. 7 to Fig. 11.
From the result, we can find that bit-rate is increased about 1% with almost the same PSNR. We can also find bit-rate increase is step up when QP is from low to high. But when QP is high, the bit-rate increase is reduced. The phenomenon may relate to the Intra16 mode, an intra prediction mode for 16x16 blocks. In the high QP case, the opportunity to select Intra16 will also increases since Intra16 mode decision is also using full search algorithm.
For high motion and low motion test sequence, the result in bit-rate increase is almost the same. It is because that picture is intra coded without using information of other frames. The comparison with [16] is shown in Table 5 to Table 8 with four different QP values, from 28 to 40. The proposed algorithm outperforms the previous approach.
Table 5. QP = 28, Comparison results
Sequence
Table 6. QP = 32, Comparison results
Sequence
News 4.26 1.20 0.000 -0.02 -31.41 -15.29 Paris 3.25 1.21 0.013 -0.01 -30.45 -15.46 Tempete 3.11 1.00 0.051 -0.02 -31.23 -16.16
Table 8. QP = 40, Comparison results
Sequence Container 5.18 1.00 0.001 -0.03 -32.75 -16.21
News 5.31 1.38 0.006 -0.03 -31.64 -15.16
Paris 4.91 1.58 0.003 -0.04 -30.39 -15.23
Tempete 3.67 0.78 0.024 -0.03 -31.17 -16.09
CHG BIT: change in bit-rate CHG PSNR: change in PSNR
CHG T_I: change in intra encoding time
CHG T_AVG: change in average encoding time
20
0 5000 10000 15000 20000
bit-rate
PSNR
FS TSS
Fig. 26. RD-curve of mobile & calendar
25
20 25
0 4000 8000 12000
bit-rate
Fig. 28. RD-curve of Stefan
25 30 35 40 45 50
0 2500 5000 7500 10000
bit-rate
PSNR
FS TSS
Fig. 29. RD-cure of news
25 30 35 40 45 50
0 5000 10000 15000
bit-rate
PSNR
FS
TSS
Fig. 30. RD-curve of coastguard
4.5. Summary of Proposed Intra Prediction Algorithm
We propose a three step intra prediction mode selection algorithm.
Computation reduction is achieved by examining only six of total intra prediction modes. Simulation results suggest that three step algorithm can achieve similar PSNR as full search and only about 1% of increase on bit-rate.
performance of intra-only coding also makes it suitable for still image coding, which is competitive with JPEG2000 [17].
In this chapter, an HDTV size H.264/AVC intra encoder chip for digital camera and digital video applications is presented. The chip reduces the gate count by saving the costly plane mode and enhances the video quality with the improved cost function. With careful scheduling and high performance function unit, the developed chip can easily support 29.46M pixels/s still image encoding and real-time moving picture intra coding of HDTV 720p@30fps video application when clocked at 117.28MHz under 0.18um CMOS process.
Fig. 31. Flow of H.264/AVC intra coding
5.1. Fundamental of H.264/AVC Intra Coding
The Intra coding flow of H.264/AVC is shown in
Fig. 31. This macroblock data will be predicted from one of nine kinds of 4x4 luma prediction modes, four kinds of 16x16 luma prediction mode, and four kinds of 8x8 chroma prediction mode. Then the prediction mode with the minimum cost value is selected as the best mode. The residuals after the prediction are further processed by transform, Q/IQ, inverse transform, and reconstructed as reference of next macroblock. The coefficients after quantization and mode information are encoded by entropy coding, CAVLC and UVLC.
Fig. 32. Modes of Intra4x4
Fig. 33. Modes of Intra16x16.
There are three classes of intra prediction modes. They are Intra4x4, Intra16x16 for luma sample prediction, and chroma8x8 for chroma samples prediction. Different form AC/DC prediction of MPEG-4, H.264/AVC use directional spatial information of neighbor already coded blocks to predict current sample values. Fig. 32 shows the modes of Intra4x4. Eight directional modes and one DC prediction are adopted. Fig. 33 shows the modes of Intra16x16 used for smooth texture. Intra4x4 is more suitable for high quality application while intra16x16 is suitable for low bitrate application. Fig. 34 shows the mode of chroma8x8. The mode of chroma8x8 is the same as Intra16x16 only with different mode number.
5.2. Hardware Oriented Algorithm Modification
5.2.1. Proposed Mode Decision Method
In the intra encoding flow, the mode decision method is the most important part to determine the coding performance. Two mode decision methods are used in the reference software. One is basic mode decision method and the other is rd-optimization (RDO) mode decision method.
Basic mode decision method calculates cost using table look up mode cost
and sum of absolute transform difference (SATD). RDO mode decision method use weighted sum of actual encoded bitrate and reconstructed samples to generate distortion. Though RDO mode decision method achieves the best performance, it is also computational intensive and thus is not suitable for high performance or real-time encoder implementation. Therefore, our intra encoder adopts the basic method to implement the mode decision stage as shown below
Basic cost generation function:
Cost = Cost_of_Mode + SATD
In the reference software, SATD is calculated by applying 4x4 discrete Hadamard transform (DHT) to the residuals of prediction modes due to its simplicity. However, since the residuals are processed by 4x4 discrete cosine transform (DCT) in the encoding flow, a 4x4 DCT transform for SATD will generate better results than DHT does, which has the side benefit to avoid computing the 4x4 DCT again.
However, 4x4 DCT in H.264/AVC is divided into two parts, 4x4 integer transform and scalar multiplication factors (the one with factors a, b) that are merged into the quantization stage, as shown in Fig. 35. The reference software adopts DHT simply for its simplicity to approximate the 4x4 integer transform. A better way for SATD calculation is to approximate the 4x4 DCT, but this should have low computational complexity as DHT does.
Fig. 35. 4X4 DCT transform of H.264/AVC
Fig. 37. dequant_coef table of inverse quantization.
First, we look at the equation of quantization and inverse quantization Quantization
– L=(abs(M) * quant_coef + qp_const) >> q_bits Inverse quantization
– L*dequant_coef<<qp_per
qp_per, q_bits and qp_const are derived from quantization parameter Quantization is calculated by using a table look up constant multiplication and an offset derived from quantization parameter. Inverse quantization is calculated only by a table loop up constant multiplication. We use the quantization factors, quant_coef, shown in Fig. 36 or inverse quantization factor, dequant_coef, shown in Fig. 37 to derive the scaling factors.
– 1/quant_coef: [0][0]:[0][1]:[1][1]~=30:19:12 – 1/dequant_coef: [0][0]:[0][1]:[1][1]~=32:25:20
Fig. 38. Modified SATD calculation method
Fig. 38 shows our modified method of SATD calculation. In our simulation, the scalar factors derived from inverse quantization is better than factors from quantization. The reason is that quantization process is also affected by an offset qp_const. The result of modified mode decision method is better than the reference software.
5.2.2. Intra Prediction Mode
In H.264/AVC Intra coding, intra prediction and mode decision are the two computation extensive components. All prediction modes are examined to find the best mode. Parallel architecture are demanded to accelerate these components.
After analyzing the type of intra prediction modes, we can separate the modes into four types as shown in Fig. 39. In the bypass type, prediction samples are the same as boundary pixels. In the linear types, prediction samples are linear interpolation derived from boundary pixels. In the average type, prediction samples are average of all boundary pixels. In the plane type, prediction samples are approximation of bilinear transform with only integer arithmetic as shown in Fig. 40. The equation of Plane mode is more complex than other modes and is hard to reuse with other mode.
However, by simulation we found that intra prediction with plane prediction mode only reduces about 1% of bit-rate than that without plane mode. This 1% of bit-rate difference can be easily compensated by the enhanced cost function and achieves almost the same result with the basic method in reference software
Fig. 39 Four types of intra prediction modes
Fig. 40. Equations of plane mode prediction
The simulation result is shown from Fig. 41 to Fig. 48. Thus, we decide to implement the intra coding without plane prediction mode based on the cost and performance trade-off.
400 600 800 1000 1200 1400
bit-rate
500 800 1100 1400 1700 2000 2300 bit-rate
PSNR
FS Proposed
Fig. 42. RD curve of Foreman
29 30 31
500 800 1100 1400 1700 2000 2300 2600 bit-rate
Fig. 43. RD curve of container
27
1400 1800 2200 2600 3000 3400 3800 4200 4600 bit-rate
PSNR FS
Proposed
Fig. 44. RD curve of stefan
29
700 1100 1500 1900 2300 2700 bit-rate
PSNR
FS Proposed
Fig. 45. RD curve of football
25
2300 3300 4300 5300 6300
bit-rate
PSNR FS
Proposed
Fig. 46. RD curve of mobile and calendar
26
Fig. 47. RD curve of tempete
30
Fig. 49 Architecture of Intra Coding
5.3. Architecture Design of H.264/AVC Intra Coding
5.3.1. System Architecture Design
Fig. 49 shows the intra encoding architecture, which is directly corresponding to the coding flow shown in Fig. 31. The architecture consists of the intra prediction unit, transform unit, quantization unit and CAVLC unit. First, the intra prediction unit will generate the prediction value for the current block.
Then for each possible mode, the residual pixels after prediction are transformed by 4x4 integer transform or DHT (DC value of Intra16x16 or Chroma8x8). These transform coefficients are further used to compute the cost function to determine the best by the proposed cost function. The intra4x4 block with lower cost is preserved in the buffer. After best intra4x4 block is obtained, it will go through the reconstruction path to generate the required boundary samples for the next 4x4 block. The data after quantization and mode information will be coded by CAVLC and UVLC, respectively.
intra4x4 block reconstruction, intra16x16 prediction process is inserted into these bubble cycles of intra predictor generation unit to pre-compute the Intra16 cost. Thus, the utilization of intra predictor is improved.
2. Early start of next 4x4 block prediction: before the boundary samples are available, the prediction mode using upper samples (vertical prediction mode) can be early started before other modes.
3. Intra16x16 DC value pre-computing: In the H.264/AVC standard, the sixteen DC coefficients from the Intra16x16 mode have to be transformed again by DHT. Thus, for the reconstruction, inverse DCT of other AC coefficients cannot be started before inverse DHT, and this situation will result in a macroblock size buffer to store the AC coefficient of sixteen 4x4 blocks. Using the intra16x16 prediction insertion mentioned in technique 1, the best intra16x16 DC value after DHT is pre-computed from the Q/IQ stage to the DC registers of IDCT/IDHT stage. Not only a macroblock size buffer is saved but also the overall computation cycles are reduced.
5.3.2. Intra Predictor Generation Unit
A reconfigurable 4 pixels parallel intra predictor generation unit is proposed.
It can support nine kinds of Intra 4x4 modes, three kinds of Intra16x16 modes, and three kinds of Chroma8x8 modes. After analyzing the prediction mode, we can find that prediction samples are derived from boundary pixels using four types of arithmetic equation:
1. (A+B+1)>>1 2. (A+2B+C+2)>>2
3. Bypass (for Vertical, Horizontal mode)
4. DC (Intra4x4: average of 8 pixels, Intra16x16: average of 32 pixels) (A, B and C are reconstructed boundary pixels)
Fig. 50 shows the proposed reconfigurable architecture of intra predictor generation unit. The architecture reuse the partial sum of neighbor predictor to save the adder count.
For example:Intra4x4
Predictor1 = B+2C+D = (B+C)+(C+D) Predictor2 = A+2B+C = (A+B)+(B+C)
Thus, B+C can be reused to generate two predictor output Some examples are shown in Fig. 51 to Fig. 55.
Fig. 50. Reconfigurable data path of intra predictor generation unit
Fig. 51. Data path of diagonal down right
Fig. 52. Data path of vertical right
Fig. 53. Data path of horizontal down mode
Fig. 54. Data path of DC prediction mode
Fig. 55. Data path of horizontal mode
Fig. 56. Coding order of residual blocks
5.3.3. Transform Unit
In H.264/AVC, residual macroblock is divided in 16 4x4 luma blocks and 8 4x4 blocks as shown in Fig. 56. All the 4x4 blocks will be transformed with integer coefficient. If the intra prediction mode is Intra16x16, the DC value of 16 luma blocks will be transformed again by 4x4 discrete Hadamard transform. The 2x2 DC values of chroma blocks after DCT will also be transformed by 2x2 DHT.
Transform matrix of DCT, IDCT, and Hadamard transform is shown in Fig.
57 to Fig. 59. We can find the coefficients of the transform matrixes are even or odd symmetry at each row or column and can be implemented by add and shift.
The number of addition in each 1D transform can be reduced from 16 to 8 with butterflies. Fast algorithm and its butterfly structure are shown in Fig. 60. Because two forward transforms have the same structure and will not operate at the same time in our system architecture. We can merge them together to save area. Inverse transform of DCT and DHT are merged by the same method as the forward methods. The transform unit handles uses the similar architecture as in [18]. Fig.
61 shows the hardware architecture of transform unit.
⎥⎥
Fig. 58. Transform matrix of 4x4 IDCT transform
⎥⎥
Fig. 59. Transform matrix of Hadamard transform
Fig. 60. Fast algorithms of 4x4 transform
Fig. 61. Hardware architecture of transform unit
5.3.4. Quantization Unit
The quantization and inverse quantization unit are shown in Fig. 62. The constant value of quant_coef, dequant_coef, qp_const, qp_shift, and qp_per are implemented by look-up table depending on the QP values. The design also uses the data guarding technique to reduce power consumption when input value is zero.
Fig. 62. Hardware architecture of quantization unit
After processing whole macroblock, the mode with minimum cost will be selected as the best intra prediction mode.
Fig. 63. Hardware architecture of mode decision unit
Fig. 64. CAVLC architecture
5.3.6. CAVLC Unit
The architecture of CAVLC is shown in Fig. 64. CAVLC encoding process can be divided into two phases, scanning phase and encoding phase. Input of CAVLC is four transformed coefficients per cycle. The scanning phase will skip the zero coefficients and only scans the nonzero one in the inverse zigzag scan order to speedup the encoding phase. Then, the data are sent to the corresponding lookup tables in parallel. These codes are buffered and concatenated to form the final bitstream.
Fig. 65. Memory Organization
5.3.7. Memory Organization
In the proposed architecture, two components have memories. The organizations of memories are shown in Fig. 65. Source buffer stores the input data 4 pixels row by row. Coefficient Buffer is divided into two parts to facilitate DC value access in Intra16x16 mode. By using Ping-Pong architecture, data input phase and entropy coding phase can be pipelined to improve the encoding throughput.
1086 cycles are spent for pipelined architecture as shown in Fig. 67. The performance of proposed architecture only needs about 117.28MHz to meet HDTV 720p (1280x720@30Hz) real-time application.
Fig. 66. Timing schedule of proposed intra coder.
Fig. 67. Timing schedule of proposed architecture
5.4. Implementation Results
To evaluate the accuracy and the efficiency of the proposed architecture, the design is implemented using the UMC 0.18µm 1P6M CMOS technology and the cell-based design flow. The chip has an area of 2.4x2.4 mm2 (pad limited) as shown in Fig. 68. The design can achieve 125 MHz at the worst-case. Thus, it can easily support 29.46M pixels/s still image encoding and real-time moving picture intra coding of HDTV 720p@30fps video application when clocked at 117.28MHz. Therefore, it is suitable for digital video or camera applications.
Table 9. List of gate count
Intra Predictor 3507
Q/IQ 22082
DCT(with DC register) 9985
IDCT(with DC register) 9836
Boundary Reconstruction Unit 15697
Cost Generation and Mode Decision Unit 10315
UVLC/CAVLC 11965 Controller 2781
Boundary Predictor Buffer 6465
Total 92633
Technology: UMC 0.18 µm 1P6M CMOS
Voltage:
1.8 V (Core) 3.3 V (I/O)
Die Size: 2.4×2.4 mm2
Core size: 1.28x1.28mm
SRAM: (all single port) Coefficient buffer
Source buffer
104 x 64 bits x 2 banks 96 x 32 bits x 1 bank
Fig. 68 Chip specification
Chapter 6 Conclusion
In this thesis, our contribution is in three parts. The first contribution is the deblocking filter architecture that can accelerate the deblocking process. The proposed two architectures not only save the memory size but also have higher speed. The idea is to rearrange the data flow and achieve higher data reusability.
The second contribution is the fast intra coding algorithm can reduce the computational complexity of intra 4x4 prediction. Six modes are required instead of nine modes in the full search method. The fast intra prediction algorithm can save 33% computational complexity with only about 1% bit-rate loss. The final contribution is the intra coding architecture can speed up the computation of intra frame coding. Proposed cost function has better quality and complex plane mode is skipped to save area. The prediction process is well scheduled to achieve high utilization. We hope that our research result can promote the convenience of human life.
[2] Thomas Wiegand, Gary J. Sullivan, Gisle Bjontegaard, and Ajay Luthra,
“Overview of the H.264/AVC Video Coding Standard,” IEEE Transactions on Circuits and Systems for Video Technology, July 2003
[3] Information Technology - Generic Coding of Moving Picture and Associated Audio Information: Video, ISO/IEC 13818-2 and ITU-T Recommendation H.262, 1996
[4] Video Coding for Low Bit Rate Communication, ITU-T Recommendation H.263, Feb. 1998.
[5] Information Technology - Coding of Audio-Visual Objects - Part 2: Visual, ISO/IEC 14496-2, 1999.
[6] A. Joch, F. Kossentini, H. Schwarz, T. Wiegand, and G.J. Sullivan,
"Performance comparison of video coding standards using Lagrangian coder control," in Proceedings of IEEE International Conference on Image Processing 2002, vol. 2, pp501-504.
[7] Y.-L. Lee and H. W. Park, “Loop filtering and post-filtering for low-bitrates moving picture coding,” Signal Processing: Image Commun., vol. 16, pp.
871–890, 2001.
[8] S. D. Kim, J. Yi, H. M. Kim, and J. B. Ra, “A deblocking filter with two separate modes in block-based video coding,” IEEE Trans. Circuits Syst.
Video Technol., vol. 9, pp. 156–160, Feb. 1999.
[9] P. List, A. Joch, J. Lainema, G. Bjøntegaard, and M. Karczewicz, “Adaptive deblocking filter,” IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 7, pp. 614- 619, Jul. 2003.
[10] H.264/AVC reference software JM7.2, Jul. 2003
[11] Y.-W. Huang, T.-W. Chen, B.-Y. Hsieh, T.-C. Wang, T.-H. Chang, L.-G.
Chen, “Architecture design for deblocking filter in H.264/JVT/AVC,” Proc.
of Multimedia and Expo, vol. 1, pp. 693 –696, Jul. 2003.
[12] Draft ITU-T Recommendation and Final Draft International Standard of Joint Video Specification (ITU-T Rec. H.264/ ISO/ IEC 14496-10 AVC), Mar.
2003.
[13] H.264/AVC reference software JM8.2, Jul. 2004
[14] Meng, B.; Au, O.C, “Fast intra-prediction mode selection for 4x4 blocks in H.264”in Proc. of IEEE Int. Conf. on Acoustics, Speech, and Signal, 2003., vol. 3, 6-10 pp.III - 389-92 ,April2003
[15] Meng, B., Au, O.C., Chi-Wah Wong, Hong-Kwai Lam, “Efficient intra-prediction mode selection for 4x4 blocks in H.264” in Proc. of Int. Conf.
on Multimedia and Expo, 2003, vol. 3 , 6-9 Pages:III - 521-4, July 2003
[16] Feng PAN, Xiao LIN, Rahardja SUSANTO, Keng Pang LIM, Zheng Guo LI, Ge Nan FENG, Da Jun WU, and Si WU, "Fast Mode Decision for Intra Prediction," JVT-G013, 7th Meeting, Pattaya II, Thailand, 7-14 March, 2003.
[17] Joint Video Team (JVT) of ISO/IEC MPEG & ITU-T VCEG ”Performance comparison: H.26L intra coding vs. JPEG2000” Klagenfurt, Austria, 22-26 July, 2002, JVT-D039
[18] T.-C. Wang, Y.-W. Huang, H.-C. Fang, and L.-G. Chen, “Parallel 4_4 2D transform and inverse transform architecture for MPEG-4 AVC/H.264,” in Proc. IEEE Int. Symp. Circuits and Systems, 2003, pp. 800–803.
國立台南市第一高級中學 (民國 85 年 9 月~民國 88 年 6 月) 國立交通大學電子工程學系 學士 (民國 88 年 9 月~民國 92 年 6 月)
國立交通大學電子研究所系統組 碩士 (民國 92 年 9 月~民國 94 年 6 月)
獲獎紀錄:
z 九十三學年度 大學院校積體電路設計競賽 (IC Contest) 研究所/大學部 標準單元式設計組(Cell-based) 優等
z Asia and South Pacific Design Automation Conference (ASP-DAC) 2005 Best Award of Student Design Contest
z 九十二學年度 大學院校矽智產設計競賽(IP Contest) Star Video Motion Estimation Engine QME
Soft IP 不定題組 特優
z 九十一學年度 殷之同電子實驗計畫獎學金
專題名稱:Automatic generation of Area-Effective Bit-Serial FIR Filters z 九十一學年度上學期(大四) 電子工程系書卷獎
專題名稱:Automatic generation of Area-Effective Bit-Serial FIR Filters z 九十一學年度上學期(大四) 電子工程系書卷獎