Chapter 2 Overview of H.264/AVC Standard
2.2. Components of Intra Coding
2.2.4. Quantization and De-quantization
In the quantization stage, there are 52 values of quantization parameters (QPs) and corresponding quantization steps supplied in H.264/AVC standard. The steps are doubled for increase of every six numbers in QPs. The quantization scaling factors are to change transform of 4x4 or 8x8 block size becoming integer transform to avoid the computational complexity and precision problem.
Chapter 3
H.264/AVC High Profile HDTV Encoder
Video compression technique becomes more and more important while the development of mobile video device and HDTV is growing up. H.264, the latest video standard is well adopted in HDTV and other application since it provides high video quality and excellent coding efficiency. These coding tools provide high coding efficiency, however, also takes huge computational complexity and memory requirement, especially in intra encoding flow. Therefore, hardware accelerator for intra encoding flow of an H.264 encoder, especially which supports high profile, is necessary for real time encoding requirement.
3.1. Design Challenges of Intra Encoding in Our Encoder
Although the previous work [30] is an excellent design as the intra coding flow of baseline profile, we still have many challenges if we extend the previous design to the intra encoding of our high profile encoder. In the followings, we will show the proposed techniques to solve these problems.
z Cycle counts of every MB stage:
In previous work [30], we double the throughput of intra prediction phase. So we only need computing cycles less than 560 cycles. We extend this design to our high profile encoder that the cycle counts of all MB stages are 600 cycles. To support this design in high profile encoder we use the similar intra prediction algorithm in previous design [30] and scheduling the overall encoder function.
z Structure hazard of reconstruction phase
Reconstruction phase has three functions, reconstructing boundary pixels of 4x4
block size for intra prediction, reconstructing boundary pixels of 8x8 block size for intra prediction, and reconstructing data of the final mode as reference. This phase is independent in the third stage by three stage pipelined architecture, but intra mode decision in second stage needs reconstructed boundary pixels. To solve this problem, the process of reconstruction phase is through the second stage and the third stage. We can reuse the reconstruction phase in boundary pixel reconstruction of intra 4x4 and 8x8 prediction, and reconstruction data as inter reference by this method. The scheduling of the reconstruction phase is shown in Section 3.2.1.
z Reduce hardware cost:
In previous design [30], it uses the variable-pixel parallel architecture to reduce hardware cost and utilization of reconstruction phase. But the hardware cost for using this architecture to high profile is too high to afford. We use scheduling methods in Section 3.2.1 and in Chapter 4 to reduce such high hardware cost.
3.2. Proposed Architecture
One challenge for high profile encoder is the special coding tools in main profile and high profile such as intra 8x8 prediction, 8x8 DCT, 8x8 quantization and de-quantization. If we extend current 4x4 prediction and DCT design to 8x8, the area will become four times than previous work. Therefore, hardware and memory share for these new coding tools and existing tools are necessary.
This section introduces the complete H.264 High Profile HDTV encoder. Fig 8 shows the block diagram of our proposed architecture. The encoder contains system control, bus arbiters, and six coding tools including: integer motion estimation (IME), fractional motion estimation (FME), intra prediction, reconstruction, deblocking filter engine, and entropy coding. Besides, internal SRAMs for reference data and residue data are also included in this design. The complete frame data and reconstructed result
are stored in external memory through bus arbiter and bus interface. The bus interface width is design for 128bits.
3.2.1. Scheduling of Encoder
Fig 7. The scheduling of our encoder
Fig 7 shows the scheduling of these three stages expect entropy coding functions.
There are three features in this scheduling. First, if all calculating MB orders are shown in Fig 7, we pre-load reference data for MB 3 in IME stage because of its high complexity. Second, FME and Intra can share residual SRAM and reference SRAM and load data from external memory in time by this special scheduling. Finally, the reconstruction process is through the second stage and the third stage.
During cycle 16 and 382 the reconstruction phase reconstructs data for computing intra prediction of MB 1 and filtering data of MB 0 and MB 1. Besides, after residual re-computation of a best mode is finished, the reconstruction is beginning to quantize them and reconstruct MB immediately in the second stage. This work continues to the third stage begins to finish the reconstruction of a MB and sends into the deblocking engine. By this method, we can quantize residuals needed by entropy coding in time to avoid waiting cycles in the third stage.
3.2.2. Function Units of Intra Encoding in our HP Encoder
This design with three stage pipelined architecture is different with previous works
[34][35] which use four stage pipeline architecture. The second stage is intra prediction and FME. The intra prediction works no matter current frame is I-frame, P-frame, or B-frame. The reconstruction is through the second stage and the third stage and the deblocking filer are placed in the third stage.
The partial data of left MB required by intra prediction is saved in local register for fast data access. Moreover, in this stage, after intra and FME prediction find their best mode and corresponding SATD cost, a final mode decision is also made in this stage.
The final decision and its residue are saved to residue SRAM for further process. The reference data of final mode will be saved in the best ref SRAM for data reuse in reconstruction.
As for the reconstruction stage, its process is through the second stage and the third stage. During the second stage, it computes the reconstruction data for intra prediction of 4x4 and 8x8 block size. And after DCT transform in a best mode re-computation is finished, the reconstruction phase is beginning to quantize them and reconstruct MB immediately in the second stage. After the third stage begins, the reconstruction stage continues to finish the reconstruction of a MB and send it into the deblocking engine.
When the reconstruction is finished, the deblocking engine filters the reconstructed MB and sends the final data to REC SRAM. Because the deblocking filter needs non-deblocking upper MB information, extra external memory is required to save the data of upper MB. The data in REC SRAM will be transferred to external memory by bus arbiter.
Fig 8. The block diagram of proposed H.264 high profile encoder
3.3.Conclusion
In this paper, we propose a high performance H.264 high profile encoder that can support 1080p resolution under 145MHz with smaller area. In the proposed design, we optimize the algorithm and architecture of intra encoding flow and pipelined schedule of the whole design to achieve a high throughput and low hardware cost design. Therefore, our design is much suitable for HDTV applications.
Chapter 4
Architecture Design of Intra Encoding Flow
In Chapter 3, an H.264/AVC high profile progressive encoder is proposed for high definition and all frame size video application. This design is proposed to support both encoding process for any frame size with high quality and low bitrate at the quite low clock rate. In this chapter, we describe the intra prediction, reconstruction and daglocking filter architecture in the encoder which is described in last chapter.
Since our previous work is codec, this encoding flow is also extended to codec design. In comparison with previous design, this work not only has the better quality with high profile encoding process but also supports up to the largest HD size 1080p.
Furthermore, with the modified three-step fast prediction and enhanced SATD algorithm, this codec has the same computing cycles in every MB and increases acceptable hardware cost. Those characteristics make it more suitable for video application products especially high quality requirements such as mobile TV encoder, video conference, HDTV, and so on.
4.1. Design Techniques for Proposed Intra Prediction
Although this work is mainly based on the previous architecture [30], directly extended previous work to the high profile introduce various design problems, such as lengthy cycle counts, structure hazards, data hazards and large area cost. In the followings, we will show the proposed techniques to solve these problems.
z Independent intra 8x8 path for low cycle count:
the intra prediction phase needs 506 cycles to predict the best mode of intra luma 4x4, luma 16x16, and chroma 8x8 block sizes that the intra prediction mode of luma 8x8 block size is impossible to reuse the same hardware under our timing constraints.
The new path for predicting intra luma 8x8 modes is needed to be added in intra prediction phase with only necessary hardware.
z To modify the reconstruction phase to eight-pixel parallel architecture:
If 8x8 transform in high profile is supported when we adopt four-pixel parallel architecture, the computing cycles for this transform is too many (32 cycles per transforming) and temporary register bits are too large when we still adopt 4-pixel parallel reconstruction architecture. So we need to modify the reconstruction circuits from 4-pixel parallel architecture to 8-pixell one.
z To schedule the SRAMs behavior between Intra and FME prediction:
The structure hazard occurs between Intra and FME prediction. For example, the residual coefficients need passing through quantization during computing cycles of the second stage. And reconstruction phase also need quantization circuits to reconstruct data. But we can’t use three quantization circuits which has large area in our design. Furthermore, all SRAMs between 2nd stage and 3rd stage also have similar structure hazards if we only use single port SARMs. The scheduling illustrated in previous chapter help us to avoid those hazards. How to use only one quantization circuits in our chip will be illustrated in Section 4.5.
z To increase the utilization of our design:
Although previous design uses the variable-pixel parallel architecture, its reconstruction phase still has only 11.4% utilization. To increase hardware utilization, we reuse the same reconstruction phase in boundary pixel reconstruction of intra 4x4 and 8x8 prediction, and reconstruction data as inter reference. With this method, the reconstruction phase in our chip can achieve 18.67% utilization. And the addition path in intra prediction can also have 56% utility by re-computation for intra 8x8 boundary values scheduling though this path is only for intra 8x8 prediction.
z Total methods to reduce hardware cost in intra encoding flow:
1. Reuse the reconstruction phase :
We remove the reconstruction circuits in intra prediction circuits and reuse these functions from 3rd stage. We can save 69K hardware cost by this consideration.
2. Avoid the structure hazard of quantization :
We don’t adopt ping-pong buffer architecture as show in previous work [30].
We place the quantization between the prediction residual buffer and entropy coefficients inputs buffer. Thus, we can quantize the prediction residual as entropy coefficients and reconstruct data as reference of other frames at the same time with only one quantization.
3. Reduce the temporary registers in additional path of intra prediction:
In intra 4x4 prediction modes, we uses the best residual buffer and prediction value buffer to reduce its boundary data reconstruction cycle time and complexity. But if we extend this method to intra 8x8 prediction modes, it needs ad least 2560 bits registers. This cost is so high that we adopt re-computation for intra 8x8 boundary values scheduling to save this cost.
4. Increase the utilization of reconstruction phase:
The method to increase the utilization of reconstruction phase is already illustrated. To support this method the functions in reconstruction can support not only 4x4 block size but also 8x8 block size.
5. Save the pipelined buffers between second stage and third stage:
One purpose in our chip is to save pipelined buffer. But it makes the structure hazards in second stage. We solve those hazards by overall scheduling illustrated in previous chapter and Section 4.5. With this scheduling the single port only prediction residual and reference SRAMs can be shared by intra and FME.
6. Remove the boundary registers between different pixel phase:
In previous work we use variable-pixel parallel architecture to save hardware.
But it needs 1344 bits boundary registers at least to fit this architecture and make the computing cycles larger than 600 cycles. That’s one reason why we modify the reconstruction phase from 4-pixel parallel architecture to 8-pixel parallel one.
4.2. Description of Prior Art
4.2.1. Survey of Fast Algorithm
For baseline profile encoding, intra prediction and SATD cost function for mode decision take almost 77% of computation in all functions [13]. This result is reasonable since there are so many modes to decision one best mode for a 4x4 block.
Thus, the SATD function has to be applied to unnecessary prediction modes quite a lot. If we can reduce the computation of intra prediction and its related SATD transform, we can reduce much computation time and power. Thus, we select the modified three-step algorithm in [28] to decrease prediction modes efficiently with acceptable performance loss. As shown as Fig 9. Simulation results in [28] show that it can save about 23% of computation for the intra prediction and related transform with a bit-rate loss of 0.68%. Besides, this algorithm is suitable for the hardware implementation with little comparison circuit.
Fig 9. Decision flow of the modified three-step algorithm for intra prediction
In determining the coding performance of H.264/AVC intra prediction quality, mode decision function is the most important part. To find a best matched prediction mode is to use RDO. Though RDO can provide the best performance, its high complexity hinders its usage in the hardware design. Thus, we adopt SATD method to predict the best mode. We combine the integer transform in [13] [14] and simplified multiplication factors called enhanced SATD (ESATD). The simplified multiplication factors are derived from quantization coefficients.
32
And, we reduce prediction modes further. Macro blocks predicted in plane mode is only 4.2% in average and not larger than 5.7% except the sequence “Akiyo” which
contains much smoother texture. Thus, we remove plane mode prediction with 1%
bit-rate loss, which can be easily compensated by ESATD.
4.2.2. Architecture Design in Previous Work
Boundary
8 pixels/cycle 4 pix/c 1 coef./cycle
4 pixels/cycle
Reconsturction Phase Quantization
Phase
Fig 10. Proposed architecture in previous work
The intra frame encoder design with the modified fast algorithm uses eight-pixel parallelism in the prediction phase but four-pixel parallelism in the reconstruction circuit. The eight-pixel parallel prediction phase which calculates two rows of one 4x4 block significantly improves the throughput and reduces the computing cycles for the computationally critical intra prediction generator by half. The total architecture includes a pair of boundary buffer, two intra prediction engines, an eight-input 4x4 transform, a cost generator with feedback signals for fast mode decision, and some registers. Since only blocks with best modes are allowed to pass through the quantization phase and reconstruction phase, these two phases adopt four-pixel
parallel architecture to save area. We use the current block and best block registers in the quantization phase and the FIFO registers to buffer the data. The bit stream phase, including CAVLC encoder, is similar to [11] with at least one coefficient per cycle.
4.3. Hardware Oriented Algorithm of Intra Prediction
Because the performance of the previous algorithm in [30] is not enough to encode the large frame size like 1080p which may work a little worse than original baseline profile in [10] as shown in Fig 16, we extend all fast intra prediction algorithms in previous work from baseline profile to high profile for this chip. For example, the 8x8 DCT transform and quantization are special functions which are not included in baseline profile. These functions are used in intra luma 8x8 modes prediction in high profile. Thus, we extend the ESATD function in previous work to intra luma 8x8 modes for computing its ESATD with 8x8 DCT transform (8x8 ESATD). For detail, because the multiplier of DC value is 0.25 in 4x4 DCT transform type and 0.125 in 8x8 DCT transform type, the shift parameter in 8x8 ESATD is one more than previous ESATD. Besides, the three step algorithm in [28] is also extended to intra luma 8x8 modes prediction and the plane mode is still removal. Table 1 shows all simulation results.
The Figures from Fig 11 to Fig 17 show the RD-curve diagrams of simulation results for all intra frame prediction by these four algorithms, original baseline profile algorithms, algorithm in [30], original high profile algorithm, and proposed one. We can find the obvious gap between the original high profile and baseline profile algorithm. It can be observed that the usage of the high profile algorithm can decrease 16.36% of bit-rate and increase 0.22 dB of PSNR in average. That’s the reason why we would like to develop this encoder.
The proposed algorithm also has better performance than the original high profile
algorithm in [10]. The Figures from Fig 18 to Fig 24 show the RD-curve diagrams of their simulation results. In most case, the combined algorithm still makes the coding performance in all intra frame prediction much similar in high QP range and even better in low QP range.
Table 1 Comparison among original baseline and high profile in [10], previous algorithm, and proposed one for all intra frames with 1080p size
30
Fig 11. RD-curve of these four algorithm for sequence “blue_sky”
34
Fig 12. RD-curve of these four algorithm for sequence “pedestrian_area”
32
Fig 13. RD-curve of these four algorithm for sequence “riverbed”
36
Fig 14. RD-curve of these four algorithm for sequence “rush_hour”
32
Fig 15. RD-curve of these four algorithm for sequence “station2”
35
Fig 16. RD-curve of these four algorithm for sequence “sunflower”
32
Fig 17. RD-curve of these four algorithm for sequence “tractor”
32
Fig 18. RD-curve of proposed algorithm and previous work for sequence “blue_sky"
35
Fig 19. RD-curve of proposed algorithm and previous work for sequence
“pedestrian_area"
Fig 20. RD-curve of proposed algorithm and previous work for sequence “riverbed"
37
Fig 21. RD-curve of proposed algorithm and previous work for sequence “rush_hour"
33
Fig 22. RD-curve of proposed algorithm and previous work for sequence “station2"
35
Fig 23. RD-curve of proposed algorithm and previous work for sequence “sunflower"
33
Fig 24. RD-curve of proposed algorithm and previous work for sequence “tractor"
4.4. Architecture Design of Intra Prediction
4.4.1. Overall Intra Prediction Circuit
This eight-pixel parallel prediction phase significantly improves the throughput and reduces the cycles for the computationally critical intra prediction path by half. But its throughput is still not enough to share the same hardware when we support intra luma 8x8 modes decision which increase 37.5% computation complexity. To solve this problem, we design the additional path to predict the best mode of intra luma 8x8 modes. To save hardware cost the three scheduling technique, parallel intra 8x8/4x4 computation, interlaced scheduling, and re-computation for intra 8x8 boundary values are used in this phase. We only add a pair of intra prediction engine, two 1-D eight-point eight-by-eight DCT transform, one more ESATD circuits, and a few small buffers. The total architecture includes two pair of boundary buffer, four intra prediction engines, an eight-input 4x4 transform, an eight-input 8x8 transform, two cost generator with feedback signals for fast mode decision, and some registers.
Although the additional path also calculates 8 pixels at the same time, it’s different
Although the additional path also calculates 8 pixels at the same time, it’s different