De-blocking Filter - Gate Count - 應用於超高效率H.265標準之高記憶體效率內嵌視訊解碼器

Gate Count

4.4 De-blocking Filter

This section is talking about the architecture design of the de-blocking filter. As shown in Figure4.15, the de-blocking filter is partitioned into 4 parts, the edge detection unit, cache and left reference buffer, double transpose buffers, and filter unit. Due to the proposed algorithm,

MEMORYCONTROLLER

Strong Filter

Weak Filter

Skip DE-BLOCKING CONTROLLER

BS Rec Info Intra Info

Cache (1KB)

Figure 4.15: Architecture of the de-blocking filter

the 2nd memory hierarchy and line buffer sharing will be implemented in this section. Besides, the block filtering order is different from the previous standard H.264. The proposed order will also be described here.

4.4.1 Hybrid Filter Order

Deblocking filter should be considered in the memory access in terms of the filtering order.

The architecture [6] which follows the specification’s filtering order would produce repeated memory accesses, that the intermediate pixels have to be stored and fetched many times for the vertical and horizontal edge. Therefore, the memory accesses would be carefully considered for the operation cycle issue. Generally, the deblocking filter occupies 1/3 computational complex-ity at the H.264 video decoder [23] Differentiate with H.264 standard, the filter operates 8x8

1 2

4 5 6

8 9

12 13 14

15 16

19 17

22 23

16 Luma

Cb Cr

Figure 4.16: Hybird order of the de-blocking filter

boundaries instead of 4x4. In this sub section, we present the hybrid filter order still meeting the HEVC standard to reduce the internal SRAM access to further improve the access efficiency as shown in Figure 4.16. The vertical edges are filtered first, and then horizontal edges. The block is sized to 4x4 large. The numbers in the circle represents the processing order in the 4x4 filter unit. To solve the problem of the repeated memory accesses, the proposed hybrid scheduling could still meet the vertical edge first and then horizontal edge for following the pixel dependency.

Figure 4.17: Edge detection of the de-blocking filter

4.4.2 Edge Detection Unit

Applying the hybrid filter order, the edge detection unit is proposed to guarantee the edge whether over-sharped or not. As Figure 4.17 shows, the two parameters beta and T c, which are generated by utilizing the loopup table, quantization parameter, and also the boundary strength.

The sobel operators are usually used in the edge detection of the image processing [24]. How-ever, in the HEVC standard, the sobel operation is little different than other normal sobel oper-ators. In the sobel operator (4.1), the coefficients of the x directions are applying the bilinear filter [1,−2, 1] instead. Therefore, if the edge would be correctly checked using sobel operators, the double P and Q 4x4 block pixels should be send into the matrix multiplications. If the values are smaller than the beta, which means that the boundary is actually existed ans not need to be smoothed by appyling deblocking filter. However, if the values are bigger than the beta, the boundary is over-sharped, therefore, applying the deblocking filter to improve the video quality is important.

H =

4.4.3 Memory Hierarchy and Buffer Sharing

1. Line Buffer Sharing

As shown in Figure 3.5, the unfiltered pixels could be loaded for the deblocking filter to use. While the intra coded block also could load the unfiltered pixels for intra prediction in the pipeline stage. To be considered for the SRAM usage, since the deblocking filter and intra prediction all need to load the SRAM at the same time, the un-avoided struc-ture hazard would produce the problem as shown in Figure 4.18(a) Therefore, the data scheduling of the single port SRAM would be issued to solve the memory conflict.

Intra

(a) Structure Hazard of the Line Buffer Sharing

Intra

Figure 4.18: Structure Hazard of the Line Buffer Sharing

2. Memory Hierarchy

The proposed memory hierarchy reduces the internal SRAM for deblocking filter is very effective for achieving low power consumption. In [1], the power of the internal memory occupies about 70% of the video decoder. Due to the unnecessary storing all width of the pixels in the internal SRAMs for deblocking filter horizontal edge filtering, the proposed prediction-based cache mechanism is to eliminate the unusually-used pixels. Therefore, the prediction-based mechanism could contribute to the smaller cache size. The mem-ory hierarchy will decide the high data-relation from the intermediate filtered pixels from the deblocking filter to the smaller cache.Therefore, with trade-off the external mem-ory power consumption and the internal memmem-ory power, the analysis would be shown to present the approapriate cache size.

3. Prediction-based Mechanism

The prediction-based mechanism is proposed to utilize the edge detection unit in vertical direction and judge whether the boundary is to be filtered or not. It will decide whether the vertical filtered pixels would be saved into the cache in order to improve the access efficiency. The related pseudo code in 2 is shown that in the bottom edge of the 16x16 block, due to the unavailable pixels for the vertical direction Q block, the pre-edge de-tection could be operated to forsee the edge filter on or off. Currently, the sobel results may be half of the complete sobel results because of the spatial pixel locality. As the next line of pixels are available, the edge detection for the top edge and also can check the tag array to make sure hit or miss. In the next item, the miss rate analysis would be present to prove the proposed method would out-perform the previous design [1]. Besides, the architecture design of the prediction-based cache is shown in Figure 4.19.

4. Performance Evaluation

The proposed prediction-based mechanism with second memory hierarchy are applied to forsee whether the pixels would be stored into the internal SRAM or not. Increasing the internal memory size to store the intermediate pixels achieves lower external memory

Algorithm 2:Prediction-based Judgement Input: Sobel Results and beta

Output: T AG

1 forall the j such that j ≤ W idth/4 do

2 Check Edge Detection Unit:

3 if Sobel Results<beta/2 then

4 Store to SRAM

5 T AG[j]← 1

6 end

7 end

8 N ext Line P ixels Available

9 forall the j such that j ≤ W idth/4 do

10 Check T AG[j]:

11 if T AG[j] = 1 and Sobel Results<beta/2 then

12 HITL99

Figure 4.19: Prediction-based with memory hierarchy

bandwidth. However, the high hardware cost and full-speed SRAM power consumption will be a challenge for the video decoder. Therefore, to set the appropriate size of the SRAM, the analysis of the internal SRAM power and external memory bandwidth power consumption is important for the desicion. In [1], their work shows that in the factor of 8 is the compromisation with consideration of external SDRAM bandwidth, SDRAM power consumption and internal memory size. Due to the reduction factor of the internal SRAM, the penality of the miss rate would increase coming with external memory power consumption. As shown in Figure 4.20, the situation how the miss happens is that if the guess filter on or off array is 1, which means the pixels are stored in the SRAM if the size is still enough. However, if the the truth of the edge is on, but the top reference

Reduction ratio=4

Guess the filter on/off Horizontal Edge The truth the filter on/off

Miss happened Width=12

SRAM (3*12 Byte)

Figure 4.20: Miss rate analysis with memory reduction

35 45 55 65 75 85

r=2 r=4 r=8 r=16 r=32

[1]

Proposed

Avg Miss Rate

SRAM Size Reduction Ratio

Avg:36%

Avg:6.3%

Avg:1.6%

[17]

[proposed]

Figure 4.21: Miss rate analysis with memory reduction

sample pixels did not save in the internal SRAM, then the miss happens. Therefore, the signal will send into the external memory and load the pixels. In addition to comparing with [1], the proposed method can achieve better miss rate performance shown in Figure 4.21. With reduction factor of 2, the miss rate is reduded up to 36% average. As the factor doubles higher, the reduction is not apparently shown due to the limited memory size.

4.4.4 Strong/Weak Filter

In the last part of the deblocking filter is the filtering operation. It smoothes the over-sharp with discontinous edge. In this section, the proposed architecture of the strong and weak filter design would be present. In the deblocking filter, the filter engine would take 4 pixels on left and right side called P block and Q block. If the edge is strongly over-sharped, then the strong

Table 4.3: Deblocking filter cycle count

ICME ISCAS ISSOC

Proposed

’03 [6] ’05 [25] ’10 [26]

Cycle/MB 504 250 243 231

Gate 18.91K 19.64 N/A 54K

filter is chosen. The 3 modified output on each side would be changed during several adders interpolation, While, the weak filter is chosen, the modified 2 output on each side would be changed respectively shown in Figure 4.22. In the modified deblocking filter architecture, we adopt 4-filter parallelism respectively for strong and weak filter to enhance the coding speed. As shown in Figure 4.23, as the P and Q block all enter into the filter engine including strong and weak filter, the modified output would finally be chosen between the edge strength constraint. In the architecture of the strong filter, due the common items could be shared during the hardware resource, the area of the strong filter could be saved xx%. Also, the weak filter design composed of the delta generation depending on the 4 pixels on each side.

P3 P2 P1 P0 Q0 Q1 Q2 Q3

Weak Filter Output Strong Filter Output

Block Edge

P Block Q Block

Filter Input

Figure 4.22: Input and output of the filter engine

4.4.5 Cycle Analysis

To sum up, the operational cycles during the deblocking stage is important to the pipeline stage decision. Therefore, in the Table 4.3, the comparisons with the operation cycles are listed below.

P3 P2 P1 P0

Figure 4.23: Strong and weak filter

在文檔中應用於超高效率H.265標準之高記憶體效率內嵌視訊解碼器 (頁 66-75)