*BMemory Sharing

Proposed Algorithm

Transform Coef Filtered

Un-Filtered

Figure 3.2: Intra and De-blocking Line Buffers Sharing

Assume the block 0 is firstly in intra prediction stage, it loads the reconstructed pixels which are un-filtered from the *B line buffer as shown in Figure 3.3. The pixels should be stored into registers in order for the deblocking filter to use. After finishing the intra prediction process, the predicted pixels will be added with transform residuals to become the reconstructed pixels and then will be stored into the same address in the *B line buffer as shown in Figure 3.4. In

Algorithm 1:Shared Above Line Buffer Input: Intra Inf o and Deblocking Inf o Output: Deblocked P ixels

1 forall the M BN o such that j ≤ W idth/16 do

2 Intra Prediction:

3 step 1: Store bottom reconstruced pixels to 1-line buffer_intra

4 Deblocking Filter:

5 step 1: Store bottom 3 line of vertical filtered pixels to 3-line buffer_df

6 end

7 N ext Line P ixels Available

8 forall the M B_{N o} such that j ≤ W idth/16 do

9 Intra Prediction:

10 step 1: Load reconstruced pixels from 1-line buffer_intra

11 step 2: Store bottom reconstruced pixels to 1-line buffer_intra

12 Deblocking Filter:

13 step 1: Load 3 line of vertical filtered pixels from 3-line buffer_df

14 step 2: Load reconstruced pixels from 1-line buffer_intra

15 step 3: Data recovery for the reconstructed pixels

16 step 4: Store bottom 3 line of vertical filtered pixels to 3-line buffer_df

17 end

Filtered From *B

Intra

Block 0 Block 1 Un-Filtered

Intra/DF will use

Figure 3.3: Intra and De-blocking Line Buffers Sharing

deblock. The above line pixels which are being loaded from the 3-line buffer *A and registers are the input for the deblocking filter. After finishing the deblocking filter, the 12 pixels with red dotted-circle in 4x4 block would be stored back into 3-line buffer *A as shown in Figure 3.6. The advantage of the line buffer sharing is that original 5-line buffers will be decreased to the 4-line buffer about nearly 20% reduction.

Un-Filtered Filtered To *B

Intra

Transform Coef

Newly Un-Filtered Intra Predicted Block 0 Block 1

Figure 3.4: Intra and De-blocking Line Buffers Sharing

Un-Filtered Filtered From *B

Intra

DF From *A

Block 0 Block 1

Figure 3.5: Intra and De-blocking Line Buffers Sharing

Un-Filtered Filtered Intra

To *A DF

Newly Filtered Block 0 Block 1

Figure 3.6: Intra and De-blocking Line Buffers Sharing

3.2 Data Recovery

In the deblocking filter process in the HM-7.1 reference software, the filtering order is fol-lowing the vertical edge first in the frame-level, and then horizontal edge is the last as shown in Figure 3.7. However, in the hardware design, the frame-level is not possibly to be implemented for the silicon area concern. As a result, the 16x16 macroblock based filtering is the hardware trend. In the 16x16 macroblock-level, the right-most 4 4x4 blocks need to be stored into regis-ters in order for the next right 16x16 macroblock, while the bottom 4 4x4 blocks need to store into 3-line buffers for the next boundary 16x16 macrblock as shown in Figure 3.8.

As a result, in the line buffer sharing, the 12 pixels loaded from the 3-line buffer *A plus the 4 pixels loaded from the 1-line buffer *B which are not through the vertical edge filter should be filtered again to make sure the data are valid. As shown in Figure 3.9, the pixels should be in vertical edge filtering in order to guarantee the value correctness. The disadvantage of line buffer is that before doing de-blocking filter, the data recovery should be done first, this will cause some clock cycles waste. In the table,

8 Vertical Edge 8

Horizontal Edge

Figure 3.7: Deblocking filter in vertical edge and horizontal edge

Figure 3.8: Memory sharing in deblocking filter

Figure 3.9: Memory sharing with data recovery

3.3 Prediction-base with Memory Hierarchy

The in-loop filter is the most memory-dominated in the video decoder. Therefore, the effort to reduce the internal memory for in-loop filter is worth doing it. To adopt the memory hierarchy shown in Figure 3.10, the purpose is to reduce the deblocking filter 3-line buffer SRAM size for saving the power consumption. The second level memory hierarchy is to exploit the high data correlation which would be used often in the decoding process. With the support of the smaller SRAMs with size decides by the designer, the pixels could be first stored and would reduce the accessing from the external memory. In the proposed method, we exploit the memory hierarchy by utilizing the spatial locality of neighboring pixels to reduce the line buffer size and the access of the external memory. In the reference software HM-7.1, the deblocking filter adopts the frame-level coding order. Therefore, in Figure 3.11, the edge detection process which is described in the previous section is available due to the pixels below the boundary edge in the Q block are available. However, in the hardware design, the frame-level coding is not the perfect choice to follow due to the huge hardware buffer cost. As a result, if we adopt the 16x16 macroblock pipeline, the pixels below the boundary edge are un-available in the Q block. With the unknown pixels below the boundary, the filter edge detection could not accurately judge the edge whether it is over-sharp or smoothed.

在文檔中應用於超高效率H.265標準之高記憶體效率內嵌視訊解碼器 (頁 47-52)