Chapter 2 Scalable Video Coding
2.2 Interframe Wavelet Video Coding
2.2.3 Entropy Coder
Embedded zeroblock coding is used to code the wavelet coefficients in the interframe wavelet [6]. With the help of two powerful embedded coding techniques – set partitioning and context modeling, the embedded coding algorithm features low computational complexity and high compression efficiency. By exploiting the strong statistical dependencies among the quad-tree nodes built up from the wavelet coefficients, high compression efficiency can be achieved.
Chapter 3
Interframe Wavelet with H.264/AVC
Inter-Prediction Based Motion Compensated
Temporal Filtering
3.1 Motivation
Our motivation is to improve the motion compensated filtering process in [7].
This is because in the interframe wavelet coding structure, the errors between the original and the reconstruction frames would accumulate along the temporal hierarchy.
In addition, the motion-compensated temporal filtered frames are reference pictures in temporal scalability. They are the decoded pictures in the temporally down-sampled playback. Hence, it is critical to have good quality MCTF low-pass frames.
3.2 H.264/AVC Inter-Prediction Scheme
3.2.1 Tree Structured Motion Compensation
The basic unit in AVC motion estimation is the 16x16 macroblock structure. The luminance part of each macroblock can be divided into four types of sub-macroblocks, namely, 16x16, 16x8, 8x16 and 8x8, as illustrated in Figure 3-1(a). The 8x8
sub-macroblocks can further be partitioned into 8x8, 8x4, 4x8, and 4x4 blocks, as illustrated in Figure 3-1(b).
(a)
(b)
Figure 3-1 (a) Four types of sub-macroblocks. (b) Sub-partition for 8x8 sub-macroblocks
3.2.2 Sub-Pel Motion Vectors
After completing motion search, the border of the reference picture will be employed for padding, and the full-pel motion estimation will be employed to find the motion vector and the mode with least cost. After the full-pel search, the interpolated picture is used for 1/2 -pel and 1/4 -pel motion search.
3.2.3 Motion Vector Prediction
The motion vector of one block is highly correlated with those of its neighboring blocks. This phenomenon becomes more apparent when the block sizes get smaller.
Thus, we can make use of the left, upper-left, upper, and upper-right blocks to reduce the correlation among near-by motion vectors. Figure 3-2 demonstrates an example for mode decision after employing the AVC-based motion estimation.
Figure 3-2 An example for mode decision
3.3 Connection State Decision
After applying H.264/AVC motion estimation, we can obtain motion vectors for each frame pair. For the purpose of the temporal decomposition, we need to categorize the connected/unconnected state of each pixel in the frame pair. We call this procedure “connection state decision”. Figure 3-3 is a typical example to illustrate the decision procedure. Before the connection state decision, we can see the left figure in Figure 3-3. The pixels with single motion vector connection are denoted as the straight-line white circles. The pixels with multiple reference pixels in A frame and the related pixels in B frame are so-called multi-connected pixels, and denoted as dotted-line white circles. The others pixels in A are not referenced by any pixel in B, so we call these pixels as unconnected pixels, and they are denoted as black circles.
However, the multi-connected pixels in A are not allowed because MCTF can be accomplished along only one motion trajectory for one pixel. Therefore, we need to decide a good motion vector for each multi-connected pixel.
It is reasonable that a good prediction has small prediction cost. We use the difference between predicted and referenced pixels as the prediction cost. In the left figure of Figure 3-3, there are two multi-connected pixels in A frames. The corresponding prediction costs of related pixels in B are found and compared.
Therefore, each pixel in A has only one corresponding pixel in B to accomplish the MCTF. We can see the connection state changes after state decision in the right figure
of Figure 3-3.
Figure 3-3 Illustration of connect/unconnected state decisions.
3.4 I-Block Detection
The concepts of I-block for MCTF were described in [7]. We adopted these concepts with some modifications in our MCTF scheme.
In the temporal decomposition stage we will do motion estimation first and then do pixel-based state detection. These states could be “connected” or “unconnected”
(including the failed ones in multi-connected detection). After that, we decide the sate of each macroblock according to the steps below:
Step 1: Unconnected frame detection
If the number of connected pixels in these two frames is too small, we think these two frames are not good for MCTF. “Scene change” might happen. So we force the states of whole pixels to “unconnected”. Let Fconnect is a weighting factor between 0 and 1, so the detection criterion is as follows.
Otherwise.
Step 2: Connected/Unconnected macroblock detection
If the frame is not forced to be a unconnected frame in step 1, we will detect the state of each macroblock. Let Fconnect is a weighting factor between 0 and 1, the detection criterion is very similar in step 1:
⎩⎨
Step 3: I-block detection
However, it is possible to have connected blocks with a poor match after motion estimation. These blocks tend to produce artifacts in the temporal low-pass frame, which lead to poor visual quality for temporal scalability. As Figure 3-4 shows, we can see the obvious artifacts in the red circles.
Figure 3-4 The artifacts of low-pass frame after 4 temporal decomposition.
Therefore, to reduce these artifacts, we hope we can detect these poor match regions, and force their states to ‘unconnected”. Our I-block size is 16x16. According
to Eq. (1), let B[m,n] be the block with connected state at the location (m, n) of the B frame (predicted frame) and A[m-dm, n-dn] be the motion-compensated block with motion vector (dm,dn) in the A frame (reference frame). We compute the variance of these two blocks, and choose the minimum as Vmin. If the mean squared prediction error between these two blocks is larger the threshold F*Vmin, this block is declared as an unconnected block, where F is an adjusting parameter. Based on our experiments, F is taken around 0.7. We will show the visual quality improvement in the experimental results. Let Vmin = min{Var(B[m,n]), Var(A[m-dm, n-dn])}, and the decision is based on:
min
After the above steps, temporal filtering starts. In the prediction stage, we generate the high-pass frame according to Eq. (2). The temporal low-pass frame is generated by Eq. (3) and (4) based on the state of connection of each macroblock.
Typically, motion compensation works well on the connected pixels.
[ ]
m,n(
B[ ] [
m,n A~ m dm,n dn] )
23.5 Motion Cost Function Adjustment
The rate-distortion cost function, J= D+ λ*R, is used to decide the best motion vectors in the AVC motion estimation, in which D is the frame difference, and R is the estimated motion vector coding bits. However, as the temporal level increases in MCTF, the energy of temporal low-pass frame is also increased. Therefore, the λ value should be increased to maintain a constant rate-distortion relation at the higher temporal levels. Therefore, the λ value is increased by a factor of W for each additional temporal level. It can be generalized as Eq. (5). The theoretical weighting
factor W is 2 . Therefore, the λ value is increased by a factor of 2 for each additional temporal level.
λ(t)= W*λ(t-1) , where t is the temporal level index (5)
3.6 Bi-Directional MCTF
In Eq.(2), H frame is produced by B frame and it’s prediction from A frame. It is only forward prediction. However, the prediction block may find a better match (motion compensation) from the other direction. Since we employ the forward MCTF in the one-directional case, we need to carry out the backward motion compensation.
But that will result in an implementation problem. If the backward motion compensation is introduced into the forward MCTF scheme, the future reference frame is required, which means the future GOP data must be ready when the current GOP is encoded/decoded. Therefore, an implementation problem happens. We can simply solve this problem using backward MCTF. To accomplish the backward MCTF with bi-directional MCTF, the forward motion compensation is exploited. The forward motion compensation needs the past GOP data as references, so the mentioned problem can be solved. Figure 3-5 shows the state of connection of each pixel in the backward MCTF scheme.
Figure 3-5 State of connection of each pixel
After modifying Eq. (2)-(4), we can get Eq. (6)-(8) for backward MCTF. Thus, frame B has both forward and backward motion vectors. The use of bi-directional motion estimation reduces high-pass frame magnitude and thus increases coding efficiency.
[ ]
m,n(
A[ ] [
m,n B~ m dm,n dn] )
2H = − − − (6)
[ ]
m n H[
m d n d]
B[ ]
m nL , = ~ + m, + n + 2 , (7)
[ ]
m n B[ ]
m nL , = 2 , (8)
Since we have forward and backward motion vectors for each macroblock, mode decision is necessary. As Figure 3-6 shows, we denote that At and Bt is the tth frame pair at certain temporal level. After bi-directional motion estimation, we can get motion vectors (dx,t, dy,t) and (dx,t-1, dy,t-1) from Bt and Bt-1 respectively. We also can obtain two prediction costs from both directions. The motion vector with minimal prediction cost will be chosen to accomplish the temporal filtering.
1
Figure 3-6 Bi-directional motion estimation
3.7 Experiments and Results
In this section, we will show the experiments and results of I-block, motion cost function adjustment, and bi-directional MCTF.
The poor match blocks produce artifacts in low-pass frames. After deeper temporal decompositions, these artifacts propagate and cause the significant
subjective quality loss. It results in the inaccurate motion prediction in the following temporal decomposition. Figure 3-7 shows the subjective improvement using the I-blocks. The artifacts in the red circles at left image are reduced at right image.
Figure 3-7 The low-pass frame after 4 temporal decompositions: Bus_CIF (Left: without I-Block; right: with I-Block)
I-block detection can improve the subjective quality with very small RD performance loss. Figure 3-8 shows that PSNR is reduced within less than 0.2dB in high bitrates but improved at low bitrates. If the transmission bandwidth is very small, few bitstream remains after extraction. Therefore, the percentage of the low pass signal in the extracted bitstream is large It means the quality of low-pass frame is very important. Hence, the I-block detection can contribute slight improvement at low bitrate.
Bus_CIF (15Hz)
25 26 27 28 29 30 31 32 33 34
200 300 400 500 600 700 800 900 1000 1100
Bitrate
PSNR
noI I
Figure 3-8 The effect of I-block on RD performance of Bus_CIF sequence.
(“noI”: without I-block detection; “I”: with I-block detection)
After applying the motion cost function adjustment, we can get better RD performance in all bitrates. However, in our experiments the weighting factor in Eq.
(5) needs to be further tuned manually. As Figure 3-9 shows, if W is 2, the RD performance in low rates improves significantly with less than 0.2 dB PSNR loss in high rates, so we choose 2 as our default weighting factor. Figure 3-10 shows the RD performance improvement of Football_CIF sequence.
MCF adjustment
200 250 300 350 400 450 500 550 600
Bitrate
Figure 3-9 Comparison of RD curves with different weighting factors of motion cost function (MCF) (Bus_CIF sequence)
300 500 700 900 1100 1300 1500 1700 1900 2100
Bitrate
PSNR
MCF noMCF
Figure 3-10 The improvement by motion cost function adjustment of Football_CIF sequence.
Figure 3-11 shows the RD performance comparison between proposed scheme and MC-EZBC. We can find about 1dB PSNR loss at high rates and slight improvement at very low bitrate. However, the subjective quality is comparable.
Figure 3-12 shows that the proposed scheme may have better subjective quality even with lower PSNR.
Football_CIF
0 5 10 15 20 25 30 35 40 45
100 200 300 400 500 600 700 800
Bitrate
PSNR
Proposed MC-EZBC
Figure 3-11 RD performance comparison of proposed scheme and MC-EZBC
Figure 3-12 Subjective quality comparison of proposed scheme (left) and MC-EZBC (right). (The 39th frame of Football_CIF at 500kbps.)
Chapter 4
Interframe Wavelet with Motion Information
Scalability
4.1 Motivation
The interframe wavelet has several advantages over the existing video standards, but it may be improved in a few aspects. In our experiments, we observed that the interframe wavelet does not perform well at low bit-rates. It may be due to the following reasons:
First, motion vector information in the bit-stream is for reconstruction at full temporal frame rate and spatial resolution. In fact, the interframe wavelet video coding is designed and optimized for high bit-rates. The coding parameters and strategy may need readjustment when the low rate performance has also some priority.
Another cause of the performance loss is that the motion information takes quite a portion of the bit-stream at low bit-rates. For example, the HD-sequence “Harbour”
has about 400kbps motion information. In low bit-rate situations, the motion vector cannot even fit into the total bit budget. The high motion information bit-rate is mainly due to the fact that at higher levels of the temporal pyramid decomposition, the appropriate motion vector range is larger. In a GOP with four-level temporal decomposition, the motion information of the highest level pair would need approximately three times more bits than the lowest level pair. Motion vector prediction has than been proposed to compress the size of the motion information, but
still the total motion information is huge [14].
When the low bit-rate and spatial scalability are together required, the performance of the interframe wavelet drops significantly. This is mainly because of the fact that the motion information does not have spatial scalability built in it. For example, a bit-stream containing of a coded sequence of spatial resolution of 720x480, the spatially down-scaled resolution of 360x240 truncated bit-stream still contains all the motion information that used to construct the 720x480 size images.
In the following section, we propose a motion information partitioning scheme that can improve the performance of interframe wavelet video coding at low bit-rates
4.2 Motion Information Partitioning
In [15], Hang and Tsai proposed the motion information scalability for MC-EZBC. In this thesis, we extend this concept with some modifications to partition the motion information generated by the AVC interframe-prediction in MCTF. The wavelet coefficient information in the conventional interframe wavelet coding scheme has spatial, temporal and SNR scalability but the motion information can not be partitioned for the cases of spatial and SNR scalability. If the required bitrate is very slow, the extractor (puller) may fail to extract the bitstream because the motion information bits are larger than the specified bits. Also, at low rates, we may want to save some bits from motion information and use these bits for wavelet coefficients to achieve acceptable quality. Therefore, we partition motion information after motion estimation.
The basic idea is to partition the motion information into multiple “motion layers”. Each layer records the motion vectors with a specified accuracy. The lowest layer denotes a rough representation for the motion vectors and the higher layers are used to improve the accuracy. Particularly, different layers are coded independently so that the motion information can be truncated at the layer boundary.
In AVC interframe-prediction, the basic unit in prediction is the macroblock of
16x16. Each macroblock could be the combinations of 16x16, 16x8, 8x16, 8x8, 8x4, 4x8, and 4x4, and the corresponding motion vectors have 1/4-pixel accuracy. We denote that MVi is the motion vector in motion layer index i, and Modei is the best prediction mode. We partition the motion vectors according to the steps below.
Step 1: Do 16x16 block size motion search with integer-pixel accuracy.
In this step, the base-layer motion information is obtained. Since the first motion layer records the rough motion vectors. We only allow the macroblock size motion estimation with integer-pixel accuracy, so Mode0 can only be 16x16. Besides, to simplify the motion partitioning scheme with bi-directional motion estimation, the direction of motion vectors in the following motion layers are determined in Step 1.
Step 2: Do 16x16 and 8x8 block size motion search with 1/2- pixel accuracy.
The first enhanced motion layer is derived in Step 2 to refine the base-layer motion information. The more detailed motion vectors are produced in this step, so Mode1 can be {16x16, 16x8, 8x16, 8x8} with half pixel accuracy. Besides, we use the base-layer motion vectors to predict the current motion vectors. The difference between current motion vectors and the base-layer ones are the first enhanced motion layer. If we denote the obtained motion vector in this step as MV, the output residue vector MV1 is (MV- MV0*2).
Step 3: Do all sub-block size motion search with 1/4- pixel accuracy.
The further refined motion vector is obtained in this step. We allow the whole possible mode with 1/4 pixel accuracy to get the finest motion vector, so the Mode2
can be {16x16, 16x8, 8x16, 8x8, 8x4, 4x8, 4x4}. We also use the previous motion layers to prediction the current motion vectors. The difference between these motion vectors and the base-layer plus the first-enhancement-layer is as the second enhancement layer motion vectors. MV is denoted as the obtained motion vector in this step, so the output residue vector MV2 is MV2 – (MV1 + MV0*2)*2
Step 4: Encode the above three layers motion information using CABAC separately.
After all motion layers are ready, they are encoded with CABAC independently.
The motion information can be truncated at the layer boundary according to the spatial size or bitrate requirement.
As Figure 4-1 shows, the original motion vectors are partitioned into three layers.
Each frame of motion information in the temporal decomposition is divided into base layer and enhancement layers as described earlier.
Base layer
1st enh. layer
2nd enh. layer
Original Proposed
Original
¼ pixel accuracy Proposed
Base layer: integer pixel accuracy 1st enh. layer: ½ pixel accuracy 2nd enh. layer: ¼ pixel accuracy
Base layer
1st enh. layer
2nd enh. layer
Original Proposed
Original
¼ pixel accuracy Proposed
Base layer: integer pixel accuracy 1st enh. layer: ½ pixel accuracy 2nd enh. layer: ¼ pixel accuracy
Figure 4-1 The base and enhancement motion layers
4.3 Motion Information Scalability
Therefore, the entire motion vector information is organized into groups as shown in Figure 4-2. The base layers of all temporal levels are needed to reproduce the full-temporal resolution sequence. Hence, they (base layers) are the highest priority motion vector information data.
B
Figure 4-2 Motion information of a GOP
If the required bitrate is too small, the extractor will drop one or two enhancement layers according to the conditions. Besides, if the user wants to extract the spatially down-sampled bitstream, the extractor can also drop proper enhancement layers. When the codec scalability number is small, we can reduce the enhancement layers into one to save bits in encoding motion vectors.
1-pixel accuracy
Figure 4-3 Different resolution with different number of motion layers: (a) for two spatial resolutions S0 and S1; (b) for three spatial resolutions S0, S1, and S2.
Here we take Figure 4-3 as an example. We denote the original spatial resolution
is S0, S1= S0/4, and S2= S1/4. And we partition the motion information into 3 motion layers. In Figure 4-3 (a), if we need two spatial resolutions S0 and S1, we can extract the top two motion layers for S1 and all motion layers for S0. Besides, the motion vectors are with 1/4-pixel accuracy both in S0 and S1. The similar concept can be applied for three spatial resolution case, as Figure 4-3 (b) shows.
Figure 4-4 is the basic operation element in reconstructing pictures in the interframe temporal hierarchy. Frames A and B are reconstructed using frames L and H. Frame A is roughly the sum of the original frame L and the motion compensated version of frame H. The drift comes from the motion-compensated portion of frame H because now the motion vectors are truncated. Frame B is essentially the sum of the original frame H and the motion compensated drifted copy of frame L. Because frame L usually has a much larger power than frame H, the magnitude of drift errors in frame A is much higher than that of frame B. These drift errors propagate from the higher temporal levels to the lower levels.
B
L
A H
Motion Compensation Motion Compensation
Figure 4-4 The temporal synthesis process
4.4 Experiments and Results
In Figure 3-11, we can observe that the RD curve can not extend to low rates because the “pull” process fails. The reason of the failure is the large motion
information. Therefore, we partition the motion information into two layers and truncate one motion layer at low rates properly. As Figure 4-5 shows, we can get
information. Therefore, we partition the motion information into two layers and truncate one motion layer at low rates properly. As Figure 4-5 shows, we can get