一、 Introduction
1.2 Objective
In order to solve the problem about the number of accesses of the external memory, we proposed the processing order of de-blocking Filter and an efficient architecture of the filter.
Because of ours design methods that can accelerate filtering process with pipeline technique for reducing the internal memory size and using fewer the register amounts.
Chapter 2
Background and Related Works
In this chapter, we will describe the block artifacts occur and the algorithm of de-blocking filter in H.264/AVC. Second, we will introduce some de-blocking filter processing order to sample processing level.
2.1 The blocking artifact
The majority video compression standard uses that the JPEG related compression technique to use in spatial the redundancy. In JPEG, divided into many 8×8 block for video and it uses the discrete cosine transforms (DCT) to make each block the transformation. After process transformation, the transformed coefficients are quantized then entropy coded. Then it makes the classification by transformed coefficients use to the quantization table the inside quantization step. The quantization table design reserves more low frequency coefficient and less high frequency coefficient. Under the low bit rates condition, the possibility reserve only one Direct Coefficient (DC) and some Alternate Coefficient (AC) represents a block.
Therefore we may lose the relativity of neighboring block. As a result, the reconstruction image or video quality will be influenced by obvious factitiousness. This is the blocking artifact as shown in figure 2.1.
Blocking artifact factors of H.264/AVC :
(1) The intra and inter frame prediction error coding of H.264/AVC use the integer discrete cosine transforms (DCT). The transform coefficients are too rough that can produce visually disturbing discontinuities phenomenon at the block boundaries [4].
(2) Second factor is motion compensated prediction. The motion compensated blocks are produced by copying interpolated pixel data that possible in the different locations of the different reference frames [4]. Because this reason, therefore we can not find the appropriate data that have discontinuities phenomenon at the block edge.
Figure 2.1: Illustration of blocking artifact
2.2 De-blocking Filter Algorithm
In H.264/AVC applies in-loop de-blocking filter to used eliminate blocking artifact then generates a smooth frame as shown in figure 2.2. The intra and inter frame prediction error coding are transformed then quantized. After decoding procedure, the reconstruction block has an error with the originally block. Therefore it has not the continual phenomenon then can again the block edge production. In order to eliminate discontinuity situation, the process is applied.
First the de-blocking filter divides a frame many macro-block and the de-blocking filter processing unit is a macro-block. After first a complete processing current macro-block, the next macro-block is just sent in. After first a complete processing current frame, the next
frame is just sent in. The de-blocking filter is located in decoder part. This will help us to obtain the smaller vestiges data for reconstruction frames to motion compensated prediction.
NAL Entropy
Decoding
Inverse Quantization
Inverse Transform
Motion compensation
Deblocking filter Reconstructed
frame Sub-per Interpolation input
output Reference
frame
Figure 2.2: The location of de-blocking filter in H.264/AVC decoder
2.2.1 Input of the de-blocking filter
Inputs of the de-blocking filter include boundary strength, threshold variables and pixels as shown in figure 2.3. The Boundary strength (Bs) is derived from the coding information of the macro-block. The filter depends on the boundary strength to classify. The boundary Strength (Bs) is assigned an integer value from 0 to 4. Based on the information, we may select the suitable filter to eliminate the block artifact.
Input pixels have the specific filter ordering, each pixel may be filtered multiple times.
After first the current macro-block is completed to process, the next macro-block is just sent in. By this analogy, the processing frame order also is so.
Two quantization parameters (QP) are α and β that are threshold values. Their contents of frame can turn on or turn off the filtering by itself for each individual set of sample.
Because they may distinguish, the block artifact is the true edges or the factitiousness.
Figure 2.3: Input of the de-blocking filter
2.2.2 De-blocking Filter Processing Order
As recommendation in H.264/AVC standard, the de-blocking filter uses one 4×4 pixels block as unit to process all macro-blocks. This filtering process shall be performed on a macro-block basis, with all macro-block in a frame processed in order of increasing macro-block addresses. Prior to the operation of the de-blocking filter process for each macro-block, the de-blocked samples of the macro-block or macro-block pair above (if any) and the macro-block or macro-block pair to the left (if any) of the current macro-block shall be available.
The De-blocking Filter process is invoked for the luma and chroma components separately. For each luminance macro-block, vertical edges are filtered first, from left to right, and then horizontal edges are filtered from top to bottom. The luma de-blocking filter process is performed on four 16-sample edges and the de-blocking filter process for each chroma components is performed on two 8-sample edges.
Sample values above and the left of the current macro-block that may have already been modified by the de-blocking filter process operation on previous macro-blocks shall be used as input to the de-blocking filter process on the current macro-block and they may be modified during the filtering of the current macro-block further. Sample values modified during filtering of vertical edges are used as input for the filtering of the horizontal edges for the same macro-block.
The luma de-blocking filter process is performed on four 16-sample edges. For each luminance macro-block, vertical edges are filtered first, from left to right, followed by edge 0, edge 1, edge 2, and edge 3 as shown in figure 2.4.
p3 p2 p1 p0 q0 q1 q2 q3
Edge 0 Edge 1 Edge 2 Edge 3
16 pixels
Figure 2.4: Horizontal filtering across luma vertical edges
The luma de-blocking filter process is performed on four 16-sample edges. The vertical filtering is performed after the horizontal filtering, and then horizontal edges are filtered from top to bottom, followed by edge 0, edge 1, edge 2, and edge 3 as shown in figure 2.5.
Edge 0
Edge 1
Edge 2
Edge 3
16 pixels p0
p1 p2 p3
q0 q1 q2 q3
Figure 2.5: Vertical filtering across luma horizontal edges
The de-blocking filter process for each chroma components is performed on two 8-sample edges. For each chroma block, vertical edges are filtered first, from left to right, followed by edge 0, and edge 1, and then horizontal edges are filtered from top to bottom, followed by edge 0, and edge 1 as shown in figure 2.6.
q0 q1 q2 q3
Edge 0 Edge 1
8 pixels p0
p1
Edge 0
Edge 1
p0 p1
q0 q1 q2 q3
8 pixels
Figure 2.6: Filtering process of chroma block
2.3 Boundary Strength
The filter operation is applied to each edge of a 4×4 block. The filter decision depends on the boundary strength and the gradient of image samples across the boundary. The boundary Strength (Bs) is assigned an integer value from 0 to 4. The Bs values for filtering of luminance block edges are to every edge between two 4×4 blocks. But filtering of chrominance block edges are not calculated independently. Because of the values is copied for their corresponding luminance edges. When Bs = 4 is strongest filter, it is used one or both sides of edges are intra coded and the boundary is a macro-block boundary. When Bs = 3 the one of the neighboring blocks is intra coded but the block boundary is not a macro-block boundary. Bs = 2 means two adjacent blocks are not intra coded and one of blocks contains non-zero coefficients. Otherwise Bs = 1 means blocks has different reference frames or different number of reference frames or different motion vector values. When Bs = 0 means no filtering is applied on this specific edge as shown in figure 2.7.
Block p or q Block p and q have different
reference frames or different number of reference frames ? Block boundary
Figure 2.7: Flowchart of Bs deriving process
Table 2.1: Determining of boundary strength
As shown in figure 2.8 and 2.9, the Bs values for chroma edges that the vertical edges 0 and 1 are copied from the corresponding edges of the luma macro-block vertical edges 0 and 2. The Bs values for vertical filtering across horizontal edges are the same.
Bs0
Figure 2.8: Bs value for horizontal filtering across vertical edges
Bs Block Modes and Conditions
4 One of the blocks is intra coded and the block boundary is a macro-block boundary.
3 One of the blocks is intra coded but the block boundary is not a macro-block boundary.
2 One of the blocks has coded residuals.
1 Have one of the following conditions:
Motion compensation from different reference frames.
Different number of reference frames.
Different motion vector values.
0 No filtering is applied on this specific edge.
Bs'0 Bs'1 Bs'2 Bs'3
Figure 2.9: Bs value for vertical filtering across horizontal edges
2.3.1 Gradient of image samples across the boundary
On the gradient of image samples across the boundary is a set of eight samples across a boundary between two 4×4 blocks as shown in figure 2.10. The filtering does not take place for edges with Bs equal to zero. Sets of samples across this edge are only filtered if the following conditions are all true. Bs 0
Two quantization parameters (QP) α and β are threshold values. Their contents of frame can turn on or turn off the filtering by itself for each individual set of sample. The thresholds α and β are dependant on the average quantization parameter of the two 4×4 blocks p and q.
When QP is small, the gradient across the block boundary have very small change. It is say the filter must be to turn off, because the block boundary is true edge in the frame not the blocking artifact. When QP is larger, the gradient across the block boundary have large change, the filter would be turned on. The samples p0, p1, p2, q0, q1 and q2 are filtered is determined by using Bs, α, β and content of the frame itself.
00
q
p
p1 p0 q
1 q
0
The filtering of p0 and q0 takes place if the following conditions are all true.
0
Bs (2.1) (2.2)
& (2.3) The filtering of p1 or q1 takes place if the following conditions are satisfied.
0
2 p
p or q2q0 (2.4) The filtering of p2 or q2 takes place if the following conditions are satisfied.
( p2p0 or q2q0 ) & p0 q0
2
2 (2.5)q0
q1 q2 q3
p1 p0 p3 p2
α β
β
Block edge
Block p Block q
Figure 2.10: Gradient of image samples across the boundary
0
0 q
p
0
1 p
p q1q0
2.3.2 Derivation process for the thresholds for each block edge
The qPav be a variable specifying an average quantization parameter of two adjacent 4×4 blocks, it was dominate the threshold α and β.
It is derived as follows.
1
1 qPP qPq
qPav (2.6)
Let indexA be a variable that is used to access the α table (Table 2.2) as well as the tC0 table (Table 2.3), and let indexB be a variable that is used to access the β table (Table 2.2).
The variables indexA and indexB are derived as follows.
qPav FilterOffsetA
Clip
indexA 30,51, (2.7)
qPav FilterOffsetB
Clip
indexB 30,51, (2.8) (FilterOffsetA and FilterOffsetB are used to decide of the filter is weak or strong manually)
Table 2.2: Derivation of indexA and indexB from offset dependent threshold variable α and β
index A (for α ) or index B (for β )
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
α 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 4 5 6 7 8 9 10 12 13
β 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 2 3 3 3 3 4 4 4
index A (for α ) or index B (for β )
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
α 15 17 20 22 25 28 32 36 40 45 50 56 63 71 80 90 101 113 127 144 162 182 203 226 255 255
β 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18
2.4 Filtering Operation
In H.264/AVC, the de-blocking filter that important function is filtering process. The filtering process can divided into two modes. One mode of filtering that allows for normal mode is applied when Bs parameter is 1 to 3. Another is stronger mode of filtering when Bs is equal to 4.
2.4.1 Normal mode : ( Bs=1~3 )
For luminance blocks:
The filtering unit needs to read 4 samples (p1, p0, q0, and q1) and updates 2 samples (p0 and q0).
If p2 p0
The filtering unit needs to read 4 samples (p2, p1, p0, and q1) and updates p1 sample.
If q2 q0
The filtering unit needs to read 4 samples (q2, q1, q0, and p0) and updates q1 sample.
p0 p1 p2
p3 q0 q1 q2 q3
Left 4*1 samples right 4*1 samples
edge
Update sample Read sample
Figure 2.11: Normal mode operations for luminance block
For chrominance blocks:
The filtering unit needs to read 4 samples (p1, p0, q0, and q1) and updates 2 samples (p0 and q0).
p0 p1 p2
p3 q0 q1 q2 q3
Left 4*1 samples right 4*1 samples
edge
Update sample Read sample
Figure 2.12: Normal mode operations for chrominance block
Filtering for edges with Bs less than 4 For luminance blocks:
the variables p0',p1',p2',p3',q0',q1',q2',q3' are derived by When all of the following conditions hold:
For chrominance blocks:
The variables p0',p1',p2',p3',q0',q1',q2',q3' are derived by
0 1 called clipping. Different sequences for clipping are applied for the internal and edge samples [4].
The threshold tC0 is specified in clip Table 2.3 depending on the values of indexA and Bs.
The threshold tC is determined as follows.
If the edge is luminance blocks:
2 0 ?1:0
2 0
?1:0
0
t p p q q
tC C
If the edge is chrominance blocks:
0 1
C
C t
t
Table 2.3: Value of filter clipping variable tC0 as a function of indexA and Bs
index A
index A
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
Bs=1 1 1 1 1 1 1 1 2 2 2 2 3 3 3 4 4 4 5 6 6 7 8 9 10 11 13
Bs=2 1 1 1 1 1 2 2 2 2 3 3 3 4 4 5 5 6 7 8 8 10 11 12 13 15 17
Bs=3 1 2 2 2 2 3 3 3 4 4 4 5 6 6 7 8 9 10 11 13 14 16 18 20 23 25
2.4.2 Stronger mode : ( Bs=4 )
For luminance blocks:
If p2 p0 and p0 q0
2
2
The filtering unit needs to read 6 samples (p3, p2, p1, p0, q0, and q1) and updates 3 samples (p2, p1 and p0).
If q2 q0 and p0 q0
2
2
The filtering unit needs to read 6 samples (q3, q2, q1, q0, p0, and p1) and updates 3 samples (q2, q1 and q0).
p0 p1 p2
p3 q0 q1 q2 q3
Left 4*1 samples right 4*1 samples
edge
Update sample Read sample
Figure 2.13: Stronger mode operations for luminance block
For chrominance blocks:
The filtering unit needs to read 4 samples (p1, p0, q0, and q1) and updates 2 samples (p0 and q0).
p0 p1 p2
p3 q0 q1 q2 q3
Left 4*1 samples right 4*1 samples
edge
Update sample Read sample
Figure 2.14: Stronger mode operations for chrominance block
Filtering for edges with Bs equal to 4 When all of the following conditions hold:
For luminance blocks:
The variables p0',p1',p2',p3',q0',q1',q2',q3' are derived by
For chrominance blocks:
The condition in equation does not hold
The variables p0',p1',p2',p3',q0',q1',q2',q3' are derived by
Input 8 pixels ( p3, p2, p1, p0, q0, q1, q2, q3 ), Bs, α and β
Update pixels p0 and q0 Chrominance block
Figure 2.15: Flow chart of filtering process
2.5 Related Work
In H.264/AVC standard, the de-blocking filter processing order is that, the vertical edges are filtered first, from left to right, and then horizontal edges are filtered, from top to bottom.
The filtering process is performed on the boundary between two 4×4 pixel blocks. A macro-block contains one luma block and two chroma blocks. The luma block have sixteen 4
×4 pixel blocks, the chroma block have four 4×4 pixel blocks. The filter processing requests eight the top neighbor 4×4 pixel blocks and eight the left neighbor 4×4 pixel blocks. Therefore a macro-block filter processing altogether need 40 4×4 pixel blocks.
The de-blocking filter uses one block as unit to process all macro-blocks. Therefore filter ordering according to this criterion, the 4×4 sub-block edge, left edge is filtered first, right edge is filtered second, come again the top edge is filtered third, and lower edge is the last one.
Each numeral is an edge of two adjacent 4×4 sub-blocks that equal to the filter unit processing four times.
2.5.1 Basic Processing Order
In [9], the basic processing order does not make use of data dependence between neighboring 4×4 pixel blocks as shown in figure 2.16. The example, the filtering operation is started with vertical edge 1, initially block (L1) and block (B0) are sent to the filter from internal memory using its two ports. After filtering of vertical edge 1, both the partially filtered block (L1) and block (B0) are stored into the internal memory. By this analogy, if we filter the vertical edge 5 in succession according to the basic filtering order. We have to load block (B0) and block (B1) from the internal memory, after filtering of vertical edge 5 stored the block (B0) and block (B1) back to the internal memory. The block (B0) is loaded and stored each two times. Thus it can be seen, the basic processing order does not make use of data dependence between neighboring 4×4 pixel blocks.
Supposition the memory system is 32-bit data bus, the basic processing order for a macro-block needs (4×2×2×16+(4×2×2×4)×2)=384 times of memory read and 384 times of memory write. The number of total memory access is 768 times.
luma chroma
Figure 2.16: Basic processing order
2.5.2 Advanced Processing Order
In the figure 2.17 is shown the advanced filter processing order. It makes use of one-dimensional data dependence [9]. The example, the filtering operation is started with vertical edge 1, initially block (L1) and block (B0) are sent to the filter from internal memory using its two ports. After filtering of vertical edge 1, the partially filtered block (L1) is stored into the internal memory but the block (B0) is buffered in the de-blocking filter unit for next stage filtering. By this analogy, if we filter the vertical edge 2 in succession according to the filtering order. We have to load block (B1) from the internal memory and the block (B0) is buffered in the de-blocking filter unit. In this way, all the 4×4 pixel blocks in horizontal filtering and in vertical filtering can reduced to half access times for internal memory.
Supposition the memory system is 32-bit data bus, the advanced filter processing order for a macro-block needs (384-16×4×2)=256 times of memory read and 256 times of memory write. The number of total memory access is 512 times.
luma chroma
Figure 2.17: Advanced processing order
2.5.3 2-D Processing Order
In the figure 2.18 is shown the 2-D filter processing order. The filter order conforms to the de-blocking filter processing standard. It is performed alternately to the horizontal filtering and the vertical filtering [10]. For example, the filtering operation is started with vertical edge 1, the block (L1) and block (B0) were sent to the filter from internal memory using its two ports. After filtering of vertical edge 1, the block (L1) is stored back to the internal memory, the other block (B0) is buffered in the de-blocking filter unit for next stage filtering. After the last filtering, the vertical edge 2, the block (B0) is sent to the transpose buffer wait for the horizontal edge 3 filtering, the block (B1) is buffered in the de-blocking filter unit for next stage vertical edge 4 filtering.
Supposition the memory system is 32-bit data bus, the 2-D filter processing order for a macro-block needs (4×12+4×12×2+(4×6+4×2×2)×2)=224 times of memory read and 224 times of memory write. The number of total memory access is 448 times.
luma chroma
Figure 2.18: 2-D processing order
2.5.4 2-D Simultaneous Processing Order
In the figure 2.19 is shown the 2-D simultaneous filter processing order. It is performed alternately and simultaneous processing order of the horizontal filtering of vertical edge and the vertical filtering of horizontal edge [5]. The figure shows, it was used by one the horizontal filter unit and one the vertical filter unit to simultaneous processing order. This method goal is in order to reduce when clock cycles quantity. Supposition the memory system is dual port RAMs and the data bus is 32-bit.
For example, the filtering operation is started with vertical edge 1, the block (L1) and block (B0) are sent to the filter from internal memory using its two ports. After filtering of vertical edge 1, the block (L1) is stored back to the internal memory, the other block (B0) is buffered in the de-blocking filter unit for next stage filtering. After the last filtering, the vertical edge 2, the block (B0) is sent to the transpose buffer wait for the horizontal edge 3 filtering, the block (B1) is buffered in the de-blocking filter unit for next stage vertical edge 3 filtering.
Figure 2.19: 2-D simultaneous processing order
2. 2 .5 5. .5 5 S Su um mm ma ar ry y r re el la at te ed d w wo o rk r k
Table 2.4: comparison of above proposed architecture
MeMetthhoodd Basic [9]
Advanced [9]
2-D Simultaneous
[5]
CyCycclleess//MMBB 717122 767600 464600 FiFilltteerriinngg CCyycclleess//MMBB 393922 444400 141400
EExxtteerrnnaall mmeemmoorryy
acaccceessss ccyycclleess 323200 323200 323200 W
Woorrkkiinngg ffrreeqquueennccyy 101000 MMHHzz 101000 MMHHzz 101000 MMHHzz EdEdggee FFiilltteerrss 1 1 1 1 2 2
4×4×44 aarrrraayy 2 2 2 2 3 3 4
4××4 4 FFIIFF00 0 0 0 0 9 9
TeTecchhnnoollooggyy ((μμ m)m) 00..2255 00..2255 00..1133 GGaattee ccoouunntt 1818..9911KK 1818..9911KK 3355..9999KK
MeMemmoorryy aarrcchhiitteeccttuurree OnOnee rreeaadd aanndd oonnee wrwriittee SSRRAAMM
OnOnee rreeaadd aanndd oonnee wrwriittee SSRRAAMM
TwTwoo rreeaadd aanndd ttwowo wrwriittee SSRRAAMM
SRSRAAMM rreeqquuiirreemmeennttss
SRSRAAMM rreeqquuiirreemmeennttss