Intra_BL Decoding Processes - Inter-Layer Intra-Prediction

CHAPTER 2 OVERVIEW THE INTRA PREDICTION IN H.264/AVC AND

2.2 SVC E XTENSION OF H.264/AVC S TANDARD O VERVIEW

2.2.4 Inter-Layer Intra-Prediction

2.2.4.1 Intra_BL Decoding Processes

To decode this new type in enhancement layer, the block diagram of decoding process is described in Figure 17. The first step is to get the corresponding blocks in the lower resolution layer. After the reference blocks in the lower resolution layer are identified, these reconstructed samples are upsampled to the higher resolution layer. In the SVC design, the upsampling process for the luma samples consists of applying a separable four-tap poly-phase interpolation filter. The interpolation coefficient values for the filter are provided in Table 3.

The chroma samples are also upsampled but with a different (simpler) interpolation filter which corresponds to bi-linear interpolation. The interpolation coefficients of this filter are also shown in Table 3.

The use of different interpolation filters for luma and chroma is motivated by complexity considerations. In prior standardized designs, the upsampling filter was only bi-linear for both luma and chroma, but resulting in significantly lower luma prediction quality. Therefore, the SVC standardizes different filters to luma and chroma samples.

Figure 17: Block diagram of Intra_BL decoding process.

Table 3: Interpolation coefficients for Luma and Chroma up-sampling.

Luma Chroma

After upsampling, the SVC decoder also adds the residual difference information to refine the upsampled prediction as the same as in H.264. Finally, a deblocking filter that is similar to that of the ordinary H.264/AVC but with altered boundary strength calculations is applied to the decoded result.

It should be noticed that the up-sampling process in Figure 17 is consisted of two parts.

One is basic interpolation and the other one is extended vertical interpolation. However, the extended vertical interpolation process is only applied to some combinations of picture type in base layer and enhancement layer, such as frame-MBAFF, MBAFF-frame, and field-frame

Chapter 3 Proposed Intra Prediction Engine with Data Reuse in H.264/AVC HP

In Chapter 1 and Chapter 2, the intra decoding process based on profile, picture type, and prediction source in H.264 and SVC is introduced. Therefore, we propose a H.264/SVC intra prediction engine which can support high profile and each inter-layer prediction type in single layer decoding and multi-layer layer decoding of H.264 and SVC, respectively.

Figure 18: Block diagram of SVC intra prediction engine.

Figure 18 shows our proposed H.264/SVC intra prediction engine. The module “Basic Intra Prediction” is used to decode the traditional single layer intra prediction in H.264 or base layer intra prediction in SVC. The other module “Intra_BL Prediction” is designed to decode the new prediction type Intra_BL in enhancement layer of SVC.

In this chapter, the design of intra prediction with data reuse in H.264 high profile will be

described. Another design for SVC inter-layer intra prediction will be further described in Chapter 4. To alleviate the starved bandwidth of intra compensation in high-definition video, we reuse the neighboring pixels and optimize the buffer size and access latency. In particular, a dedicated pixel buffer reuses neighboring pixel for realizing MB-adaptive frame-field (MBAFF) decoding in intra compensation. Moreover, a base-mode predictor is explored to optimize the area efficiency for reference sample filtering process (RSFP) in intra 8x8 modes.

The slightly increased gates and SRAMs overhead are 10% and 7.5% as compared to the intra prediction design with main profile. Simulation results show that the proposed data-reused intra prediction module requires 14K logic gates and 688 bits SRAM, and operates on 100MHz frequency for realizing 1080HD video playback at 30fps.

3.1 Overview

As we know, H.264/AVC consists of three profiles which are defined as a subset of technologies within a standard usually created for specific applications. Baseline and main profiles are intended as the applications for video conferencing/mobile and broadcast/storage, respectively. Considering the H.264-coded video in high profile, it targets the compression of high-quality and high-resolution video and becomes the mainstream of high-definition consumer products such as Blu-ray disc. However, high-profile video is more challenging in terms of implementation cost and access bandwidth since it involves extra coding engine, such as macroblock-adaptive frame field (MBAFF) coding and 8×8 intra coding, for achieving high performance compression.

In the MBAFF-coded pictures, they can be partitioned into 16x32 macroblock pairs, and both macroblocks in each macroblock-pair are coded in either frame or field mode. As compared to purely frame-coded pictures, MBAFF coding requires two times of neighboring pixels size and therefore increases implementation cost. To cope with aforementioned

problem, we propose neighboring buffer memory (include upper/left/corner) to reuse the overlapped neighboring pixels of an MB pair. Furthermore, we present memory hierarchy and pixel re-ordering process to optimize the overall memory size and external access efficiency.

On the other hand, H.264 additionally adopts intra 8×8 coding tools for improving coding efficiency. It involves a reference sample filtering process (RSFP) before decoding a Luma intra_8x8 block. These filtered pixels are used to generate predicted pixels of 8×8 blocks.

Hence, the additional processing latency and cost are required, and they may impact the overall performance for the real-time playback of high-definition video. In this chapter, we simplify the RSF process via a base-mode predictor and optimize the processing latency and buffer cost. Compared to the existing design [16] without supporting intra 8×8 coding, this design only introduces area overhead of 10% and 7.5% of gate counts and buffer SRAM.

Figure 19: Block diagram of the proposed H.264 high-profile intra predictor.

Figure 19 shows the block diagram of the proposed H.264 high-profile intra compensation architecture. A pixel rearranging process, which is located on the bottom-left of Figure 19, is proposed to reduce the complexity of neighbor fetching when MBAFF coding is

enabled. The signal, Line SRAM1/2 data_out, is directly connected to the intra prediction block for replacing the last set of upper buffer memory. As for 8×8 intra coding, a dedicated pixel buffer memory is used to store the filtered neighboring pixels and reuse the overlapped pixel data. According to the relations between Luma intra_8x8 modes and numbers of filtered pixels which are needed in each mode, we minimize the number of stored pixels to 17 (i.e.

136 bits). The output of predicted pixel is interfaced to the filtered pixel buffer memory because RSFP is embedded in the intra prediction generator.

3.2 Memory Hierarchy

An architectural choice advocated for dealing with long past history of data is a memory hierarchy [18]. However, if the instantaneously used upper neighboring pixels are not in the neighboring information memory (i.e. a miss is occurring), the decoding process and pipeline schedule will be delayed and destroyed due to external SDRAM accessing. For more details, please refer [18]. In the intra prediction, it utilizes the neighboring pixels to create a reliable predictor, leading to dependency on a long past history of data. This dependency can be solved by storing upper rows of pixels for predicting current pixels but is a challenging issue in implementation cost and access bandwidth.

Figure 20: Memory hierarchy of H.264 high-profile intra predictor.

To optimize the introduced buffer cost and access efficiency, we use two internal Line SRAM1 and Line SRAM2 to store the Luma and Chroma upper line pixels, as illustrated in Figure 20. By the ping-pong mechanism, the upper neighboring pixels of current MB (or MB pair) are stored to one of them, and the other Line SRAM is used to store next MB of upper neighboring pixels. This memory hierarchy facilitates the internal Line SRAM size and the decoding pipeline schedule.

3.3 MBAFF Decoding with Data Reuse Sets

MBAFF is proposed to improve coding efficiency for interlaced video. However, it introduces longer dependency than conventional frame-coded picture. In this section, we analyze and realize it via upper, left, and corner data reuse sets (DRS) to reuse the pixels and improve the cost and access efficiency.

3.3.1 Upper Data Reuse Sets

(a) (b)

Figure 21: The updated direction of upper/left buffer memory in (a) frame and (b) field mode MB pair.

For decoding an MBAFF-coded video, upper buffer memory is used to store the constructed upper pixels of current MB pair. These upper buffers are updated with the completion of prediction process on every 4×4 block. For each updated sub-row(s), they can be reused by the underside 4×4 blocks. According to the different prediction modes of MB pair, the upper buffer will store data from different directions. If current MB pair is frame mode, it only needs to load one row of upper buffer (16 pixels) at first, and when a 4×4 block is decoded, updating the two sub-rows in two rows (8 pixels) of upper buffer from top to down at one time, as illustrated in Figure 21(a). Finally, the new pixels in two rows will be stored to Line SRAM in chorus when MB pair is decoded.

In Figure 21(b), a field-coded MB pair needs to load two rows of upper buffer (32 pixels), two times of frame-coded MB pairs. Then, only one sub-row of upper buffer memory will be updated when a 4×4 block is decoded. Finally, the new pixels in one row of upper buffer are stored back to Line SRAM when the top MB is decoded and another row of upper buffer are stored back to Line SRAM when the bottom MB is decoded. However, considering the fifth 4×4 block, it still needs a sub-row of upper buffer to predict, as shown in Figure 22. In order to reduce the upper buffer memory size, the Line SRAM data_out is directly used. The only penalty to this scheme is that the Line SRAM data_out must hold the value until fifth 4×4 block is decoded.

Figure 22: Line SRAM data_out replaces the last sub-row of upper buffer.

3.3.2 Left Data Reuse Sets

The updated direction of the left buffer is similar to that of the upper one. The direction ranges from left to right. When the left buffer is located on the right hand side of MB pair, the next MB pair can reuse these new pixels for the following prediction procedures. However, when the modes of current and previous MB pairs are different, the left neighbors of a 4x4 block will become complicated. To reduce computational complexity of this intra predictor, pixel rearranging process is exploited. If current MB pair is frame mode, each sub-column of left buffer will be updated when each 4x4 block is decoded, as shown in Figure 22(a). On the other hand, if current MB pair is field mode, first and third buffers in each sub-column of left buffer will be updated when each 4×4 block is decoded in the top MB, as shown in Figure 22(b). Second and fourth buffers in each sub-column of left buffer will be updated when each 4×4 block is decoded in bottom MB. Hence, we only need to consider what the mode current MB pair is, instead of handling four coding modes for the combination of current and previous MB pairs, and therefore the complexity can be reduced.

3.3.3 Corner Data Reuse Sets

Using corner buffer memory can efficiently reuse the upper left neighboring pixels. We change the positions of corner buffer from left [16] to top. Therefore, the total corner buffer size can be reduced by 38% (i.e. 64bits  40bits, because the MB number in horizontal is less than that of vertical MB pair). In particular, Figure 23 shows the updating directions of corner buffer. However, because the upper neighboring pixels will be either the last row or the row prior to the last row in upper MB pair, hence the first corner of current MB pair has two processing states: reuse and reload. The first corner is reused when 1) the mode of current MB pair is identical to that of previous (left) MB pair or 2) before decoding the bottom MB of frame-coded MB pair. On the other hand, the first corner is reloaded when 1) the current MB

pair has the different modes as previous (left) MB pair or 2) before decoding the bottom MB of field-coded MB pair.

Figure 23: The updated direction of corner pixel buffers.

In summary, using neighboring buffer memory and their different directions of updates according to different modes of MB pair can reuse the neighboring pixels and improve the

access efficiency. The associated pipeline structure of MBAFF decoding is shown in Figure 24. We can see that during a MB pair decoding process, the interaction between buffers and Line SRAM can be completed easily and efficiently, and the communication between another Line SRAM and external SDRAM can be done at the same time. It should be noticed that Line SRAM1/2 data_out must be held until the fifth 4x4 block is decoded.

Figure 24: The pipeline scheme of MBAFF decoding.

3.4 Intra 8x8 Decoding with Modified Base-Mode Intra Predictor

Luma intra_8x8 is an additional intra block type supported in H.264 high profile. Before decoding an intra_8x8 block, there is an extra process that is different from intra_4x4 and intra_16x16, which called reference sample filtering process (RSFP). Original pixels will be filtered first, and then using these filtered pixels to predict subsequent 8×8 blocks.

3.4.1 Filtered Neighboring Buffer Analyze

For an intra_4x4 and intra_8x8 block, 13 neighbors and 25 filtered neighbors are needed, respectively. However, according to the Luma intra_8x8 modes, the maximum number of filtered neighbors is 17, as illustrated in Table 4. Hence, only 17 (i.e. 136 bits) filtered pixels of a 8x8 block need to be stored in our filtered pixel buffer memory instead of storing all of

the 25 filtered neighbors.

Table 4: Number of filtered pixels actually needed in intra 8x8 modes.

Prediction Modes of Intra_8x8 # of filtered neighbors

0 (Vertical) 8

1 (Horizontal) 8

2 (DC) 0,8,16

3 (Diagonal down left) 16

4 (Diagonal down right) 17

5 (Vertical right) 17

6 (Horizontal down) 17

7 (Vertical left) 16

8 (Horizontal up) 8

3.4.2 Reference Sample Filter Process (RSFP)

In the intra_4x4 process, the prediction formula of each mode except DC mode has the same form: prediction_out = (A+B+C+D+2) >> 2. For a four-parallel intra pixel generator, a suitable architecture design is proposed in [14], as shown in Figure 25.

Figure 25: Intra predictor in [14].

However, if we analyze the relationship between four output pixels in each mode, some adders can be eliminated due to the shared items, so that the hardware architecture can be reduced [19], as described in Figure 26. This shared based intra predictor can predict almost

every intra_4x4 mode in one cycle. But there still exist exceptions, such as vertical-right. For example, in the vertical-right prediction mode, there is no shared item between prediction pixels m and n, as illustrated in Figure 27. Therefore, the modified intra predictor is proposed.

Compared with the share-based [14]-[16][19] intra prediction generator, the proposed base-mode intra predictor not only reduces area cost (due to elimination of four adders) but also guarantees that four predicted pixels will be generated in one cycle of each intra_4x4 modes.

Figure 26: Intra predictor in [19].

2B+C+A

Figure 27: Modified base-mode intra predictor.

In particular, we use this base mode predictor to generate the four predicted pixels in one cycle. In the RSFP, the form of formula is identical to that in intra_4x4, and also can be rewritten to the same form filtered_out = (A+2B+C+2) >> 2. Hence, we can share the hardware resource to generate filtered pixels, as shown in Figure 28. Notice that an additional process, neighbor distribution, is needed to apply in intra_8x8 process because we only store 17 filtered pixels instead of 25.

Figure 28: Architecture of an intra 8×8 decoding module.

3.4.3 Latency Reduction

In a four-parallel intra prediction module, the latency of decoding an 8x8 block will be increased to 0~5 cycles according to the different modes of 8×8 blocks. In order to reduce the latency penalty, we reserve filtered pixels when the mode of the first/third 8x8 block in a MB is equal to 3 or 7 and second/fourth 8x8 block in the same MB is equal to 0, 2 (if upper is available), or 3~7 (i.e. if so, the value N = 6. Otherwise, N = 0). During this case, when the first/third 8x8 block in a MB is decoded, these overlapped filtered pixels are directly used to predict second/fourth 8×8 block. To clarify the extra latency, Eq. (1) lists the decoding extra latency in an intra_8x8 MB, and Table 4 summarizes the # of filtered pixels in each 8×8 intra coding mode (i.e. the value of M).

-M M N M M N

Extra latency

P P P P

       

           

8x8 block: first second third fourth (1) , where 0, 8, 16, 17

0 6 4

M or

N or

 

 

 



In particular, we list some examples to clarify the processing behavior of an intra 8×8 block in Figure 29. If the modes of first and second 8×8 blocks are 3 (diagonal bottom left) and 7 (vertical left) or 3 (diagonal bottom left) and 4 (diagonal bottom right), only 10 and 11 pixels are needed to be filtered while decoding the second 8×8 block, respectively, as described in Figure 29.

Figure 29: Behavior of shared filter.

3.5 Simulation Result and Comparison

To enhance system performance, our proposal is designed to optimize area, buffer size, and latency. We use two 0.64kb Line SRAMs which are connected to a 32-bits system bus to make decoding pipeline simple, and 0.688kb SRAM to store reused neighboring pixels.

Table 5: Average cycles needed for decoding a MB in different video sequences.

Test Video Sequence

Intra Prediction

@ BL [17]

Proposed Intra Prediction @ HP

Cycle Overhead

Foreman 342.68 355.81 3.8%

Grandma 275.63 285.28 3.5%

Suzie 294.90 307.28 4.2%

Table 5 shows the average cycles for decoding an I-MB in different video sequences of our proposed design for 30fps HD1080 video format at working frequency of 100MHz with

MBAFF and Luma intra_8x8. The overhead of latency is less than 5% compared to preliminary architecture [17]. The overall area and buffer memory size for supporting H.264 BP/MP/HP are 14063 gates in UMC 0.18um technology and 688 bits, as shown in Table 6.

The overheads for supporting Luma intra_8x8 are 10% and 7.5% compared to [16].

Table 6: Comparison results.

Chen et al. [16] Proposal Overhead

Profile MP HP

Process 0.18um 0.18um

Working Frequency 87M 100M

Gate Count 12785 14063 10%

Memory (bit) 640 688 7.5%

Chapter 4 Proposed Power Efficient Inter-Layer Intra Prediction Engine in SVC

Scalable Video Coding (SVC) extension of the H.264/AVC is the latest standard both in video coding. It has achieved significant improvements in coding efficiency with an increased degree of supported scalability relative to the scalable profiles of prior video coding standards.

In order to improve coding efficiency in comparison to simulcasting different spatial resolutions, additional so-called inter-layer prediction mechanisms are incorporated in SVC.

For the intra frame decoding, inter-layer intra prediction will be considered in SVC.

When the prediction signal of the enhancement layer macroblock is obtained by inter-layer intra-prediction, the new prediction type called Intra_BL will be used (for which the corresponding reconstructed intra-signal of the co-located 8x8 submacroblock in reference layer is upsampled). In other words, the intra prediction types which are intra_4x4, intra_8x8, intra_16x16 and intra_BL in enhancement layer are more than traditional H.264 intra prediction. However, the decoding process of Intra_BL is more complex than the other intra prediction modes. It is more like motion compensation, which needs interpolation process to generate the predicted pixels. The difference between them is that the interpolation coefficients in intra interpolation are not fixed. Therefore, complexity, processing time, and power consumption will become the major problems to decoder.

In this chapter, we propose an architecture design of high profile inter-layer intra prediction engine in SVC. It supports the new decoding type called Intra_BL and 7 possible picture type combinations between spatial layers in intra prediction. Besides, we also propose the area efficient interpolators design which can save 26% hardware area compared to direct

implement. Moreover, we further propose a power efficient design, including memory hierarchy improvement and computational complexity reduction, which can save 46.43% total power consumption compared to our preliminary design.

4.1 Overview

In Chapter 3, the proposed SVC intra prediction engine is illustrated in Figure 18, and

在文檔中應用於SVC視訊編碼標準之空間可適性內幀解碼器設計 (頁 39-0)