Chapter 2 Scalable Video Coding
2.3 Interframe Wavelet Video Coding
2.4.3 Temporal Scalability
The interframe wavelet transform will create a temporal pyramid after MCTF. In order to reduce the amount of transform data, we can discard the temporal high-pass frame, as that shown in Figure 2-20(a). To achieve temporal scalability, the truncation process keeps the subset of images that are needed to generate the required level of temporal pyramid, as that shown in Figure 2-20(b).
H H H H
H2 H2
H3
H H H H1
H2 H2
H3
L H4
15Hz Video Sequence
7.5Hz Video Sequence 30Hz Video Sequence
3.25Hz Video Sequence
(a) Temporal pyramid and temporal down-scaled sequence
HH3 HHHH2 HHHHHHHH1 L H4
15 Hz Video 7.5 Hz Video
30 Hz Video GOP
Header
(b) The GOP of the temporal scaled sequence Figure 2-20 Temporal scalability.
If motion estimated motion trajectory is not perfectly matched to the original video sequence, the temporal filtering process might generate some motion artifacts [14].
Chapter 3
3D Subband Video Coding Using Barbell Lifting
In 68th MPEG meeting (March, 2004, Munich), MSRA proposed its MCTF structure and 3D ESCOT entropy coder. The 3D ESCOT entropy coder performs almost as well as the 3D EBCOT that JPEG2000 used. Figure 3-1 shows the block diagram of this coding structure [15]. This proposed video coding algorithm has two different concepts. They are Barbell lifting and 3D ESCOT entropy coding. The motion estimation scheme of this video coding algorithm is not HVSBM but a motion estimation scheme used in H.264. We will describe them in the following subsections.
Figure 3-1 The block diagram of the 3D subband video coding using Barbell lifting [15].
3.1 Barbell Lifting
MSRA proposes this Barbell lifting algorithm for doing temporal decomposition [15]. Barbell lifting uses a set of pixels instead of a pixel as the input, as that shown in Figure 3-2. The Barbell lifting can provide perfect reconstruction, sub-sample decomposition but still with critically sampled transformed coefficients.
t
Figure 3-2 The Barbell lifting [15].
Assume that S0, S1, and S2 are three consecutive frames in a video sequence.
Functions and are called as Barbell functions and they can be any linear or non-linear functions that take any pixels in the same frames as variables. The Barbell functions can also vary from pixel to pixel. Therefore the basic Barbell lifting step is formulated as:
0()
The Barbell lifting includes two stages. They are prediction stage and update stage.
The prediction stage is applied to the video sequence first. It takes the original input frames to generate the high-pass frames, as shown in Figure 3-3.
)
Figure 3-3 The prediction stage of the Barbell lifting.
Then the update stage uses the available high-pass frames and the even frames to generate the low-pass frames, as shown in Figure 3-4.
)
Figure 3-4 The update stage of the Barbell lifting.
3.1.1 The Prediction Stage
)
Figure 3-5 The Barbell functions used in the prediction stage.
Figure 3-5 shows some examples of Barbell functions used in the prediction stage and Figure 3-5(a) is the integer motion alignment case and the Barbell function of this case is:
Figure 3-5(b) is the fractional-pixel motion alignment case and the Barbell function of this case is:
Figure 3-5(c) is the multiple-to-one mapping case and the Barbell function of this case is:
Figure 3-5(d) shows a special case that the current pixel (x, y) can use its motion vector and the motion vectors of neighboring pixels to get multiple predictions from the previous frame and generate a new prediction. The Barbell function of this case is:
)
3.1.2 The Update Stage
The prediction and update stages may has mismatch when pixels in different frames are aligned with motion vectors at fractional-pixel precision or without one-to-one
mapping. Generally speaking, the update and prediction stages use the same motion vector for saving overhead bits to code motion vectors, i.e., the motion vector of the update stage is the inverse one of the prediction stage. Figure 3-6 shows the mismatch problem.
Figure 3-6 The mismatch problem of motion in the prediction and update stages.
As shown in Figure 3-6, the mismatch problem is that the prediction has the path
from to but the update has the path from
Barbell lifting can solve this mismatch problem. In the update stage, the obtained high-pass coefficients are likely distributed to those pixels that are used to calculate the high-pass coefficient in the prediction stage. Assuming that equation (9) is the Barbell function used in the prediction stage now. We can calculate the high-pass coefficients by combining equations (6) and (9). Then we calculate the high-pass coefficients by:
Then we can calculate low-pass coefficients in the same way by:
∑∑∑
+Δ +ΔIt means that the high-pass coefficient will be added back exactly to the pixels that are predicted.
For the above example, the predicted weight from to is
non-zero. So in the proposed technique, the update weight from to , which equals to the predict weight, is also not zero.
)
3.2 Spatial Decomposition
Figure 3-7 The frame after 3 level spatial decomposition.
After temporal decomposition, the spatial decomposition is applied to each created residual frame. The filter used here is the Daubechies 9/7 filter and the analysis filter coefficients are shown in Table 3-1 [35]. The coefficients of the Daubechies 9/7 synthesis filter are shown in Table 5-1 [35].
index Analysis low pass filter Analysis high pass filter
0 0.6029490182363579 1.115087052456994
0.2668641184428723 -0.5912717631142470
± 1
-0.07822326652898785 -0.05754352622849957
± 2
±3 -0.01686411844287495 0.09127176311424948
0.02674875741080976
± 4
Table 3-1 The coefficients of the Daubechies 9/7 analysis filters.
The spatial decomposition can also be done on the LH, HL, and HH subbands of the first level decomposition. Thus we can get the important information in those subbands and code them.
3.3 Multi-Layer Motion Estimation and Coding
The video coding algorithm proposed by MSRA dose not use HVSBM in motion estimation. It uses a motion estimation method adopted in H.264 but makes some changes to achieve motion information scalability.
Figure 3-8 Multi-layer motion estimation and coding.
It uses multi-layer motion estimation and coding as shown in Figure 3-8. It generates an embedded bitstream for motion, which consists of one base layer and a few enhancement layers. A coarse motion field can be reconstructed from the base layer and can be successively refined by subsequent enhancement layers. The motion vectors of the base layer are large and coarse and may be used for low bit rates. The motion vectors of enhancement layer are small with details and often used for high bit rates.
3.4 3D ESCOT
After temporal and spatial decomposition, the generated coefficients will be coded with 3D Embedded Subband Coding with Optimal Truncation (3D ESCOT) [16]. The
3D ESCOT is in principle very similar to EBCOT used in JPEG2000 [9], which deals with 2D image coding. We can call 3D ESCOT as 3D EBCOT because it is an extension of EBCOT used to do 3D dimensional signal coding. 3D ESCOT can offers high compression efficiency and other functionalities, such as error resilience and random access.
3D ESCOT takes advantages of the orientation-invariant property of wavelet subbands to reduce the number of context and codes each subband independently so each subband can be decoded independently. Because of this feature, 3D ESCOT can achieve flexible spatial/temporal scalability and R-D optimization can be done within subbands to improve compression efficiency.
Each subband is divided into 3D coding blocks and these blocks are coded independently.
For each coefficient x[i, j, k] at position [i, j, k], we assign it a binary-valued state variable σ[i, j, k], which indicates the significance of this coefficient. χ[i, j, k] is defined as the sign of the x[i, j, k]. It is 0 when the sample is positive and 1 when the sample is negative. σ[i, j, k] is initialized to 0 and toggled to 1 when the x[i, j, k]’s first non-zero bit-plane value is encoded. There are three coding operations and when they will be used depends on σ[i, j, k]. Zero coding (ZC) and sign coding (SC) will be used to code x[i, j, k] if σ[i, j, k] = 0 and magnitude refinement (MR) will be used if σ[i, j, k] = 1. We will introduce these three coding operations as follows.
3.4.1 Zero Coding
If a coefficient x[i, j, k] is not yet significant in the previous but-planes, i.e., σ[i, j, k]
= 0, ZC is used to code the new information about whether it becomes significant or not in the current bit-planes. ZC uses significant information about x[i, j, k]’s immediate neighbors as the context to code the its own significant information. There
are four types of neighbors as shown in Figure 3-9.
Current sample
Horizontal Neighbor Vertical Neighbor Temporal Neighbor
Diagonal Neighbor
Figure 3-9 Four types of coding neighbors for zero coding.
1. Immediate horizontal neighbors. The number of these neighbors is 2 and the number of significant ones is denoted by h, 0≦h≦2.
2. Immediate vertical neighbors. The number of these neighbors is 2 and the number of significant ones is denoted by v, 0≦v≦2.
3. Immediate temporal neighbors. The number of these neighbors is 2 and the number of significant ones is denoted by a, 0≦a≦2.
4. Immediate temporal neighbors. The number of these neighbors is 12 and the number of significant ones is denoted by d, 0≦d≦12.
Table 3-2 shows the context assignment map of ZC. If the conditions of two or more rows are satisfied in the same time, the low-numbered context is selected.
LLL and LLH sub-band
h 2 1 1 1 0 0 0 0 0 0 0
v x ≥1 0 0 2 1 0 0 0 0 0
a x x ≥1 0 0 0 ≥1 0 0 0 0
d x x x x x x x 3 2 1 0
context 0 0 1 2 3 4 5 6 7 8 9 LHH
sub-band
h 2 1 1 1 1 1 0 0 0 0 0
v+a x ≥3 ≥1 ≥1 0 0 ≥3 ≥1 ≥1 0 0
d x x ≥4 x ≥4 x x ≥4 x ≥4 x
context 0 0 1 2 3 4 5 6 7 8 9
HHH sub-band
d ≥6 ≥4 ≥4 ≥2 ≥2 ≥2 ≥0 ≥0 ≥0 ≥0
h+v+a x ≥3 x ≥4 ≥2 x ≥4 ≥2 1 0
context 0 1 2 3 4 5 6 7 8 9 Table 3-2 Context assignment map for ZC.
3.4.2 Sign Coding
SC is called to code χ[i, j, k], which is the sign of coefficient x[i, j, k], if x[i, j, k]
becomes significant in the current bit-plane. SC also utilizes high-order context-based arithmetic coding to compress the sign symbols. The context models of arithmetic coding are based on three quantities hs, vs and ts. They are defined as follows:
(13) hs=min{1, max{-1, σ[i-1,j,k] × (1-2χ[i-1,j,k])+ σ[i+1,j,k]×(1-2χ[i+1,j,k])}},
vs=min{1, max{-1, σ[i,j-1,k] × (1-2χ[i,j-1,k])+ σ[i,j+1,k] × (1-2χ[i,j+1,k])}}, (14)
ts=min{1, max{-1, σ[i,j,k-1] × (1-2χ[i,j,k-1])+ σ[i,j,k+1] × (1-2χ[i,j,k+1])}}. (15) Table 3-3 shows the context assignment map and sign prediction map of SC. χˆ is the sign symbol prediction under the given context and the symbol sent to the arithmetic coder is χˆ ♁χ
hs=-1 vs -1 -1 -1 0 0 0 1 1 1
ts -1 0 1 -1 0 1 -1 0 1
χˆ 0 0 0 0 0 0 0 0 0
context 0 1 2 3 4 5 6 7 8
hs =0 vs -1 -1 -1 0 0 0 1 1 1
ts -1 0 1 -1 0 1 -1 0 1
χˆ 0 0 0 0 0 1 1 1 1
context 9 10 11 12 13 12 11 10 9
hs =1 vs -1 -1 -1 0 0 0 1 1 1
ts -1 0 1 -1 0 1 -1 0 1
χˆ 1 1 1 1 1 1 1 1 1
context 8 7 6 5 4 3 2 1 0 Table 3-3 Context assignment and sign prediction map for SC.
3.4.3 Magnitude Refinement
MR is called to code new information about x[i, j, k] if σ[i, j, k] was switched to 1 in the previous bit-plane, i.e., it becomes significant. It uses three contexts for arithmetic coding.
1. The context of x[i, j, k] is 0 if MR not yet used for x[i, j, k].
2. The context of x[i, j, k] is 1 if MR has been used for x[i, j, k] and x[i, j, k] has at least one significant neighbor by now.
3. Otherwise, the context is 2.
3.4.4 Fractional Bit-Plane Coding
The practical coding gain of 3D ESCOT is higher than 3D SPIHT because SC and MR have high-order context modeling and the use of fractional bit-plane coding [16].
The fractional bit-plane coding can provides a practical means of scanning the wavelet coefficients within each bit-plane for rate-distortion (R-D) optimization at different rates. There are three different fractional bit-plane passes and the scanning order in each of them is along the i-direction firstly, then the j-direction and the k-direction lastly.
3.4.4.1 Significance Propagation Pass
If the coefficients which are not yet significant but have “preferred neighborhood”
are processed by this pass. A coefficient has a “preferred neighborhood” if and only if
the coefficient has at least one significant immediate diagonal neighbor for HHH subband or horizontal, vertical, temporal neighbor for the other types of subband. For these coefficients, we apply the ZC to code their significance information in the current bit-plane of this coefficient. If the coefficient becomes significant in the current bit-plane, then SC is used to code the sign.
3.4.4.2 Magnitude Refinement Pass
If the coefficient became significant in the previous bit-plane, it will be coded in this pass. The binary bits corresponding to these coefficients in the current bit-plane are coded by MR.
3.4.4.3 Normalization Pass
It is used to code the coefficients if it was not coded in the previous two passes.
Because these coefficients are not yet significant, they are only processed by ZC and SC.
3.5 Bitstream Truncation and Scalability
After 3D ESCOT on each subband, an embedded bitstream is generated for each subband. In order to satisfy the requested bit rate, bitstreams corresponding to different subbands will be truncated and multiplexed together to construct final bitstream then transmitted to the receiver. The rate control problem is how to truncate and multiplex bitstreams to create the final bitstream that achieves the best R-D optimization.
The basic problem of rate control is that given a target bit rate R0, how to construct a bitstream that satisfies the bit rate constraint and minimizes the overall distortion.
Shoman and Gersho proposed a Lagrange’s theorem that can solve this problem [17].
Taubman extends this algorithm to the rate control of EBCOT [9].
EBCOT partitions the subbands representing the image into a collection of relatively small code-blocks, BBi, whose embedded bitstreams may be truncated to the rate Rin. The contribution from BiB to the distortion in the reconstructed image is
denoted , for each truncation point n. Assuming that the distortion of each code-block is independent and additive. Thus the overall reconstructed image distortion D can be represented by:
n
where ni denotes the truncation point selected for code-block BBi. Din is calculated by:
∑
∈where is the 2D sequence of subband coefficients in code-block Bsi[k] B
i
i. is the quantized representation of these coefficients associated with truncation point n, and is the L2-norm of the wavelet basis functions for the subband, b
] [k sin
wb i, to which
code-block BiB belongs.
R-D optimization algorithm should select truncation points ni for each code-block BBi such that the sum of Rini or Dini meets the constraint imposed by Rmax or Dmax
Recently, several R-D optimization algorithms have been proposed to solve this
problem [18]. It is noticeable that all these algorithms are applicable to convex curves.
Convex curves are the curves that the slopes are strictly decreasing. Some R-D optimization algorithms are based on Lagrange’s theorem, such as the Lagrange multiplier used in EBCOT [9]. Lagrange’s theorem states that the sum of continuous functions with boundary condition is optimized at the points with equal slopes as shown below: is optimal in the sense that the distortion cannot be reduced without increasing the overall rate or vice-versa. If we can find a value of λ such that the truncation points minimize (D(λ)+λR(λ)) yields R(λ)=Rmax, then this set of truncation points must be an optimal solution to the R-D algorithm based on Lagrange’s theorem.
Because the number of truncation points in a code-block is finite, we can not find the value of λ such that R(λ) exactly equals to . However, since the code-block in EBCOT is very small such that the total number of truncation points is very large, we can find the smallest value of
Rmax
λ such that R(λ)≤Rmax.
In order to find the optimal truncation point sets niλ for any given λ , we need to know the rate-distortion (R-D) pair of each truncation points. λ can be viewed as the R-D slope of the optimal truncation point sets. We can find the R-D slope of each truncation point by calculating the bitstrean length and distortion at that point. Thus we can construct an operational R-D curve for each code-block.
1) Assume n is the number of the truncation points, and 0≦j≦n.
2) For j = 0, 1, 2, …, n, 0 is the beginning of the code-block, not a truncation point.
The R-D slope of each truncation point j is 1
1
accumulative bit length of truncation point j in code block i and is the accumulative distortion of truncation point j in code block i.
j
Di
Generally speaking, and
the distortion when the coefficients of the code-block are all 0. We just need to package the truncation points with the R-D slope bigger than or equal to
In 3D ESCOT, the end of each fractional bit-plane is a candidate truncation point.
The R-D slope of each truncation points can be obtained by calculating the bitstrean length and distortion [16]. Then we can construct an operational R-D curve for each subband and find its convex hull. All valid truncation points must lie on this convex hull such that the R-D optimality at each truncation point can be guaranteed. If the truncation point does not have a strictly decreasing R-D slope (i.e., it has larger distortion than the previous truncation point), it will be discarded. In order to find the best threshold value λ , we first set an arbitrary value of λ . If the R-D slope of this truncation point is bigger than or equal to λ , this truncation point will be packaged.
After we process all of the truncation points, we obtain the final bitstream. If the bit rate of this bit-stream is larger than that of requested, the value of λ will be set larger to find the final bitstream again. Otherwise, the value of λ will be set smaller.
We use this method recursively to find the final bitstream that has bit rate smaller than or equal to the requested bit rate.
Chapter 4
Human Visual System
4.1 Human Vision
Figure 4-1 Cross-section of human eye [19].
Figure 4-1 shows the cross-section of a human eye [19]. Through the optics of the eye, the visual input is projected onto the retina, the neural tissue at the back of eye composed of the photoreceptor mosaic [20]. The photoreceptors sample the image and convert the input image to the signals that can be interpreted by the visual cortex of the brain. Photoreceptors have Rhodopsin which is very sensitivity to light. When Rhodopsin receives the energy of light, it will decompose into Vitamins A, Protein, and impulse signal. The impulse signal will be processed by the Bipolar cell and Ganglion cell then passed through optical nerves into the brain as shown in Figure 4-2 [21]. The Vitamins A, Protein, and Nutrition will be combined together and converted
to Rhodopsin by the effect of Enzyme. Then the Rhodopsin can be used again.
Figure 4-2 The process of the visual input signal [21].
There two types of photoreceptors, rods and cones. Rods are relatively long and thin. They are used to view at lower several orders of magnitude of illumination, i.e., under scotopic conditions. Cones are relatively shorter and thicker and they are less sensitive than rods. They are used to view at the higher 5 to 6 orders of magnitude of illumination, i.e., under photopic conditions. The cones are concentrated in the fovea, the region of highest visual acuity, which covers approximately two degrees of visual angle on the retina. The cones are also responsible for color vision.
There three types of cones. They are L-cones, M-cones, and S-cones. L-cones are also called Red cones and they are sensitive to long wavelengths. M-cones are also called Green cones and they are sensitive to medium wavelengths. S-cones are also called Blue cones and they are sensitive to short wavelengths. Figure 4-3 shows the relative sensitivity of each photoreceptor [21].
Figure 4-3 Relative sensitivity of each photoreceptor [21].
4.2 Color Representation
Colors do not exist in natural world. To human perception, colors are related to the wavelength of light. As describes above, the retina of human eye contains 3 different color receptors: red, green, and blue. The different cones have different sensitivity curve to light of different frequency. Thus, the combination of different sensitivity curve to light can produce different color recognition. Due to this structure of human eye, any color appeared to human eye can be specified by a weighted combination of three so-called primary colors RGB. For the purpose of standardization, the CIE (Commission Internationale de L'eclairage─ International Commission on Illumination) chooses the following specific wavelength values to the three primary colors: blue (B) = 435.8nm, green (G) = 546.1nm, and red (R) = 700.0nm.
Trichromatic theory says that any color S can be represented as a combination of
these 3 primaries R, G, and B.
S = Rs·R + Gs·G + BBs·B. (21) Any 3 independent colors can be selected as primaries as long as one is not a mix of the other two. Different sets of primaries are related by linear transformations.
There several color models, such as CIE RGB, CIE XYZ, CIE YUV, and CIE L*a*b*. We introduce CIE RGB and CIE XYZ here.
1. CIE RGB:
1) R, G, B = three spectral primary source.
2) Reference white: R = G = B = 1.
3) There exist negative tristimulus values.
4) The color is fully dependent on the wavelength. The three fixed RGB components acting alone cannot generate all spectrum colors (pure colors). This is an unresolved defect for color representation.
2. CIE XYZ
1) All color matching functions are positive.
2) Y = luminance
3) Reference white: X = Y = Z = 1.
4) This model is modified from RGB model such that all spectral tristimulus values
4) This model is modified from RGB model such that all spectral tristimulus values