Chapter 1 Introduction
1.2. Thesis Organization
This paper is organized with five parts. 0 gives the introduction and motivation of this work. Chapter 2 is a brief overview of H.264/AVC standard and intra frame encoding flow. Then, Chapter 3 presents challenges of designing the high profile encoder and its scheduling. In Chapter 4 a proposed intra frame encoding flow and deblocking filter architecture of this encoder chip with fast prediction technique and intra frame only codec design with this encoding flow. Finally a conclusion remark is given in Chapter 5.
Chapter 2
Overview of H.264/AVC Standard
Earlier standards like MPEG-1 and MPEG-2 have enabled many popular consumer products such as video CDs and DVDs. As their successor, H.264/AVC is created more powerful in the coding efficiency obviously but still maintains the decoded video quality. So that it is more flexible in all kinds of applications. With the highly developed signal processing and semiconductor technology, many complicated and computationally intensive coding tools can be supported efficiently in H.264/AVC standard to improve its coding performance. But its complexity is also hard to real time implement by software only. To solve this problem, the hardware design is required to speedup computing time.
2.1. Fundamental of H.264/AVC
2.1.1. Feature of Standard
Fig 1 shows the basic structure diagram of H.264/AVC encoder, and Fig 2 shows the decoder. It is the same with the previous video coding standard of hybrid coder.
Different from prior video coding standards, H.264/AVC has many features that enhance coding efficiency to predict the content of picture. We introduce them in the following.
1. Variable block-size motion estimation/compensation.
The standard has more flexibility in selection of block sizes for motion estimation and compensation than any previous standard. Seven kinds of block sizes are introduced, including 16x16, 16x8, 8x16, 8x8, 8x4, 4x8, and 4x4. This helps to
enhance the efficiency of coding irregularly shaped objects or background behind moving objects.
2. Quarter-sample-accurate motion vector
H.264/AVC enable quarter-sample motion vector accuracy, which is first found in the advanced profile of the MPEG-4 Visual (Part 2) standard. However, this standard adopts 6-tap filter to reduce the complexity of interpolation
3. Multiple reference picture motion estimation and compensation
Only one previous picture can be used to predict the values in the incoming picture in previous standards. But the H.264/AVC standard allows across multiple reference pictures for better coding efficiency. In addition, the standard also adopts the bi-direction prediction coding which uses both previous and next pictures as reference ones.
4. Spatial-based directional intra prediction coding
I-pictures are directly coded in previous standards which are before MPEG-4.
MPEG-4 Visual standard [3] adopts the Intra-AC and Intra-DC prediction for coding of I-pictures, which utilizes neighboring transformed blocks to perform the prediction and residual coding. However, these coding methods do not take advantage of the correlation among adjacent neighboring blocks. Thus, a spatial-based prediction technique for I-picture in H.264/AVC presents directional pixel mapping coding before transform, which uses the reconstructed neighboring pixels to perform the prediction with modes from different directions. With this technique, the coding efficiency for I-pictures can be improved effectively.
5. Small block size integer transform
H.264/AVC uses not only a transform block size of 8x8 of the prior video standards
but also a smaller transform size of 4x4. This allows the encoder to represent signals in a more locally-adaptive fashion and reduces the artifacts caused by the edges of different pixels.
6. In-loop deblocking filter
Block-based video coding may raise the blocking artifacts due to both prediction and residual difference coding of the decoding process. This new standard uses an adaptive deblocking filter to solve this problem. The in-loop deblocking filter can improve the resulting video quality well. Instead of building as an optical feature in H.263+, in H.264/AVC the deblocking filter is positioned in the motion compensation loop as an in-loop filter so that quality improvement in a single picture can be extended to the inter-picture prediction as well.
7. Context-adaptive entropy coding
For compression of quantized transform coefficients, an efficient variable-length coding (VLC) method is used in H.264/AVC. There are two entropy coding methods applied in H.264/AVC, termed CAVLC (context-adaptive variable length coding) and CABAC (context-adaptive variable binary arithmetic coding).
8. Arithmetic entropy coding
Another coding method known as context-adaptive binary arithmetic coding (CABAC) is also included in H.264/AVC as the advanced entropy coding. This arithmetic coding can achieve higher efficiency than VLC coding due to the effective probability model of symbol occurrence. Both entropy coding methods use context-based adaptivity to improve performance relative to prior stands. In standard, the CAVLC is main used in baseline profile and CABAC is used in high profile because of their coding efficiency and complexity.
Entropy
Fig 1. Basic structure diagram of H.264/AVC encoder
Scaling & Inv.
Fig 2. Basic structure diagram of H.264/AVC decoder
2.1.2. Profile and Level
Fig 3 shows four profiles defined in H.264/AVC, which are baseline, main, extended and high profiles. Baseline profile includes basic coding tools and features, such as I-slice without intra 8x8 prediction modes, P-slice, quarter-sample accurate motion vector, deblocking filter, and CAVLC.
Main profile is used as the mainstream consumer profile for applications of broadcast system and storage devices. It contains most of the features in baseline profile and other advanced techniques, like adaptive frame/field coding, interlaced coding, weighted prediction, B-slice, and CABAC.
The extended profile, which includes all the features in baseline profile and main profile except CABAC, is intended as the streaming video profile and has relatively high compression capability with extra tricks for robustness to data losses and server stream switching.
Finally, the high profile is the most complex profile, Intra 8x8 prediction modes and transform and quantization of 8x8 block size are supported in this profile. Besides, it still has feature of quantization scaling matrices in encoder. The high profile can achieve better performance in both bitrate saving and better video quality while needs much more computation efforts. Thus, this profile is widely used in multimedia communication especially needed high quality and low bitrate requirements.
Fig 3. Four profiles of H.264/AVC
2.2. Components of Intra Coding
2.2.1. Intra Prediction
Fig 4. Nine modes for intra 4x4 prediction
Fig 5. Nine modes for intra 8x8 prediction
Fig 6. Four modes for Intra 16x16 or 8x8 prediction
Spatial-domain prediction is the main feature of H.264/AVC intra coding. There are three kinds of intra prediction for luma components, nine 4x4 prediction modes, nine 8x8 prediction modes and four 16x16 prediction modes.
The 4x4 prediction modes use the neighboring thirteen reconstructed samples denoted from A to M in Fig 4 to predict the block pixels with eight different directions
and one average value. And all 8x8 prediction modes are very similar to the 4x4 prediction modes as shown in Fig 5. For 16x16 prediction modes, the values are predicted from the 32 adjacent boundary pixels of upper and left macroblocks. Similar procedures are also applied to the chroma components where four 8x8 prediction modes are used with 16 neighboring pixels.
2.2.2. Cost Generation and Mode Decision
The best mode decision for intra prediction in [10] can be either the time consuming rate distortion optimization (RDO) or just much simpler cost accumulation.
RDO uses the weighted sum of actual encoded bitrate and the reconstructed samples to produce distortion. Though it can achieve better performance, it is computationally intensive.
An alternative way is using cost accumulation. Two generally used mode decision methods for cost generation are available in [10], sum of absolute difference (SAD) and sum of absolute transform difference (SATD). The best mode is finally decided by comparing the summarized cost value of sixteen 4x4 blocks in the 4x4 prediction, four 8x8 blocks in the 8x8 prediction and sixteen 4x4 blocks in the 16x16 prediction.
2.2.3. Transform
The transform can be divided into two parts, 4x4 or 8x8 integer transform and its fractional scalar multiplication factors that are further merged into the quantization stage. With this method, The DCT transform can avoid precision problem happened in previous standards. For a macroblock predicted by the intra luma 16x16 or intra chmora modes, the DC value of each transformed block is further processed by 4x4 DHT or 2x2 DHT. Besides, the inverse transform units have similar behavior with transform unit.
2.2.4. Quantization and De-quantization
In the quantization stage, there are 52 values of quantization parameters (QPs) and corresponding quantization steps supplied in H.264/AVC standard. The steps are doubled for increase of every six numbers in QPs. The quantization scaling factors are to change transform of 4x4 or 8x8 block size becoming integer transform to avoid the computational complexity and precision problem.
Chapter 3
H.264/AVC High Profile HDTV Encoder
Video compression technique becomes more and more important while the development of mobile video device and HDTV is growing up. H.264, the latest video standard is well adopted in HDTV and other application since it provides high video quality and excellent coding efficiency. These coding tools provide high coding efficiency, however, also takes huge computational complexity and memory requirement, especially in intra encoding flow. Therefore, hardware accelerator for intra encoding flow of an H.264 encoder, especially which supports high profile, is necessary for real time encoding requirement.
3.1. Design Challenges of Intra Encoding in Our Encoder
Although the previous work [30] is an excellent design as the intra coding flow of baseline profile, we still have many challenges if we extend the previous design to the intra encoding of our high profile encoder. In the followings, we will show the proposed techniques to solve these problems.
z Cycle counts of every MB stage:
In previous work [30], we double the throughput of intra prediction phase. So we only need computing cycles less than 560 cycles. We extend this design to our high profile encoder that the cycle counts of all MB stages are 600 cycles. To support this design in high profile encoder we use the similar intra prediction algorithm in previous design [30] and scheduling the overall encoder function.
z Structure hazard of reconstruction phase
Reconstruction phase has three functions, reconstructing boundary pixels of 4x4
block size for intra prediction, reconstructing boundary pixels of 8x8 block size for intra prediction, and reconstructing data of the final mode as reference. This phase is independent in the third stage by three stage pipelined architecture, but intra mode decision in second stage needs reconstructed boundary pixels. To solve this problem, the process of reconstruction phase is through the second stage and the third stage. We can reuse the reconstruction phase in boundary pixel reconstruction of intra 4x4 and 8x8 prediction, and reconstruction data as inter reference by this method. The scheduling of the reconstruction phase is shown in Section 3.2.1.
z Reduce hardware cost:
In previous design [30], it uses the variable-pixel parallel architecture to reduce hardware cost and utilization of reconstruction phase. But the hardware cost for using this architecture to high profile is too high to afford. We use scheduling methods in Section 3.2.1 and in Chapter 4 to reduce such high hardware cost.
3.2. Proposed Architecture
One challenge for high profile encoder is the special coding tools in main profile and high profile such as intra 8x8 prediction, 8x8 DCT, 8x8 quantization and de-quantization. If we extend current 4x4 prediction and DCT design to 8x8, the area will become four times than previous work. Therefore, hardware and memory share for these new coding tools and existing tools are necessary.
This section introduces the complete H.264 High Profile HDTV encoder. Fig 8 shows the block diagram of our proposed architecture. The encoder contains system control, bus arbiters, and six coding tools including: integer motion estimation (IME), fractional motion estimation (FME), intra prediction, reconstruction, deblocking filter engine, and entropy coding. Besides, internal SRAMs for reference data and residue data are also included in this design. The complete frame data and reconstructed result
are stored in external memory through bus arbiter and bus interface. The bus interface width is design for 128bits.
3.2.1. Scheduling of Encoder
Fig 7. The scheduling of our encoder
Fig 7 shows the scheduling of these three stages expect entropy coding functions.
There are three features in this scheduling. First, if all calculating MB orders are shown in Fig 7, we pre-load reference data for MB 3 in IME stage because of its high complexity. Second, FME and Intra can share residual SRAM and reference SRAM and load data from external memory in time by this special scheduling. Finally, the reconstruction process is through the second stage and the third stage.
During cycle 16 and 382 the reconstruction phase reconstructs data for computing intra prediction of MB 1 and filtering data of MB 0 and MB 1. Besides, after residual re-computation of a best mode is finished, the reconstruction is beginning to quantize them and reconstruct MB immediately in the second stage. This work continues to the third stage begins to finish the reconstruction of a MB and sends into the deblocking engine. By this method, we can quantize residuals needed by entropy coding in time to avoid waiting cycles in the third stage.
3.2.2. Function Units of Intra Encoding in our HP Encoder
This design with three stage pipelined architecture is different with previous works
[34][35] which use four stage pipeline architecture. The second stage is intra prediction and FME. The intra prediction works no matter current frame is I-frame, P-frame, or B-frame. The reconstruction is through the second stage and the third stage and the deblocking filer are placed in the third stage.
The partial data of left MB required by intra prediction is saved in local register for fast data access. Moreover, in this stage, after intra and FME prediction find their best mode and corresponding SATD cost, a final mode decision is also made in this stage.
The final decision and its residue are saved to residue SRAM for further process. The reference data of final mode will be saved in the best ref SRAM for data reuse in reconstruction.
As for the reconstruction stage, its process is through the second stage and the third stage. During the second stage, it computes the reconstruction data for intra prediction of 4x4 and 8x8 block size. And after DCT transform in a best mode re-computation is finished, the reconstruction phase is beginning to quantize them and reconstruct MB immediately in the second stage. After the third stage begins, the reconstruction stage continues to finish the reconstruction of a MB and send it into the deblocking engine.
When the reconstruction is finished, the deblocking engine filters the reconstructed MB and sends the final data to REC SRAM. Because the deblocking filter needs non-deblocking upper MB information, extra external memory is required to save the data of upper MB. The data in REC SRAM will be transferred to external memory by bus arbiter.
Fig 8. The block diagram of proposed H.264 high profile encoder
3.3.Conclusion
In this paper, we propose a high performance H.264 high profile encoder that can support 1080p resolution under 145MHz with smaller area. In the proposed design, we optimize the algorithm and architecture of intra encoding flow and pipelined schedule of the whole design to achieve a high throughput and low hardware cost design. Therefore, our design is much suitable for HDTV applications.
Chapter 4
Architecture Design of Intra Encoding Flow
In Chapter 3, an H.264/AVC high profile progressive encoder is proposed for high definition and all frame size video application. This design is proposed to support both encoding process for any frame size with high quality and low bitrate at the quite low clock rate. In this chapter, we describe the intra prediction, reconstruction and daglocking filter architecture in the encoder which is described in last chapter.
Since our previous work is codec, this encoding flow is also extended to codec design. In comparison with previous design, this work not only has the better quality with high profile encoding process but also supports up to the largest HD size 1080p.
Furthermore, with the modified three-step fast prediction and enhanced SATD algorithm, this codec has the same computing cycles in every MB and increases acceptable hardware cost. Those characteristics make it more suitable for video application products especially high quality requirements such as mobile TV encoder, video conference, HDTV, and so on.
4.1. Design Techniques for Proposed Intra Prediction
Although this work is mainly based on the previous architecture [30], directly extended previous work to the high profile introduce various design problems, such as lengthy cycle counts, structure hazards, data hazards and large area cost. In the followings, we will show the proposed techniques to solve these problems.
z Independent intra 8x8 path for low cycle count:
the intra prediction phase needs 506 cycles to predict the best mode of intra luma 4x4, luma 16x16, and chroma 8x8 block sizes that the intra prediction mode of luma 8x8 block size is impossible to reuse the same hardware under our timing constraints.
The new path for predicting intra luma 8x8 modes is needed to be added in intra prediction phase with only necessary hardware.
z To modify the reconstruction phase to eight-pixel parallel architecture:
If 8x8 transform in high profile is supported when we adopt four-pixel parallel architecture, the computing cycles for this transform is too many (32 cycles per transforming) and temporary register bits are too large when we still adopt 4-pixel parallel reconstruction architecture. So we need to modify the reconstruction circuits from 4-pixel parallel architecture to 8-pixell one.
z To schedule the SRAMs behavior between Intra and FME prediction:
The structure hazard occurs between Intra and FME prediction. For example, the residual coefficients need passing through quantization during computing cycles of the second stage. And reconstruction phase also need quantization circuits to reconstruct data. But we can’t use three quantization circuits which has large area in our design. Furthermore, all SRAMs between 2nd stage and 3rd stage also have similar structure hazards if we only use single port SARMs. The scheduling illustrated in previous chapter help us to avoid those hazards. How to use only one quantization circuits in our chip will be illustrated in Section 4.5.
z To increase the utilization of our design:
Although previous design uses the variable-pixel parallel architecture, its reconstruction phase still has only 11.4% utilization. To increase hardware utilization, we reuse the same reconstruction phase in boundary pixel reconstruction of intra 4x4 and 8x8 prediction, and reconstruction data as inter reference. With this method, the reconstruction phase in our chip can achieve 18.67% utilization. And the addition path in intra prediction can also have 56% utility by re-computation for intra 8x8 boundary values scheduling though this path is only for intra 8x8 prediction.
z Total methods to reduce hardware cost in intra encoding flow:
1. Reuse the reconstruction phase :
We remove the reconstruction circuits in intra prediction circuits and reuse these functions from 3rd stage. We can save 69K hardware cost by this consideration.
2. Avoid the structure hazard of quantization :
We don’t adopt ping-pong buffer architecture as show in previous work [30].
We place the quantization between the prediction residual buffer and entropy coefficients inputs buffer. Thus, we can quantize the prediction residual as entropy coefficients and reconstruct data as reference of other frames at the same time with only one quantization.
3. Reduce the temporary registers in additional path of intra prediction:
In intra 4x4 prediction modes, we uses the best residual buffer and prediction value buffer to reduce its boundary data reconstruction cycle time and
In intra 4x4 prediction modes, we uses the best residual buffer and prediction value buffer to reduce its boundary data reconstruction cycle time and