Bitstream Truncation and Scalability - 3D Subband Video Coding Using Barbell Lifting

Chapter 3 3D Subband Video Coding Using Barbell Lifting

3.5 Bitstream Truncation and Scalability

After 3D ESCOT on each subband, an embedded bitstream is generated for each subband. In order to satisfy the requested bit rate, bitstreams corresponding to different subbands will be truncated and multiplexed together to construct final bitstream then transmitted to the receiver. The rate control problem is how to truncate and multiplex bitstreams to create the final bitstream that achieves the best R-D optimization.

The basic problem of rate control is that given a target bit rate R0, how to construct a bitstream that satisfies the bit rate constraint and minimizes the overall distortion.

Shoman and Gersho proposed a Lagrange’s theorem that can solve this problem [17].

Taubman extends this algorithm to the rate control of EBCOT [9].

EBCOT partitions the subbands representing the image into a collection of relatively small code-blocks, B^Bi, whose embedded bitstreams may be truncated to the rate R_iⁿ. The contribution from Bi^B to the distortion in the reconstructed image is

denoted , for each truncation point n. Assuming that the distortion of each code-block is independent and additive. Thus the overall reconstructed image distortion D can be represented by:

where ni denotes the truncation point selected for code-block B^Bi. D_iⁿ is calculated by:

∑

∈

where is the 2D sequence of subband coefficients in code-block Bs_i[k] ^B

i. is the quantized representation of these coefficients associated with truncation point n, and is the L2-norm of the wavelet basis functions for the subband, b

] [k s_iⁿ

wb i, to which

code-block Bi^B belongs.

R-D optimization algorithm should select truncation points ni for each code-block B^Bi such that the sum of R_iⁿⁱ or D_iⁿⁱ meets the constraint imposed by Rmax or Dmax

Recently, several R-D optimization algorithms have been proposed to solve this

problem [18]. It is noticeable that all these algorithms are applicable to convex curves.

Convex curves are the curves that the slopes are strictly decreasing. Some R-D optimization algorithms are based on Lagrange’s theorem, such as the Lagrange multiplier used in EBCOT [9]. Lagrange’s theorem states that the sum of continuous functions with boundary condition is optimized at the points with equal slopes as shown below: is optimal in the sense that the distortion cannot be reduced without increasing the overall rate or vice-versa. If we can find a value of λ such that the truncation points minimize (D(λ)+λR(λ)) yields R(λ)=R_max, then this set of truncation points must be an optimal solution to the R-D algorithm based on Lagrange’s theorem.

Because the number of truncation points in a code-block is finite, we can not find the value of λ such that R(λ) exactly equals to . However, since the code-block in EBCOT is very small such that the total number of truncation points is very large, we can find the smallest value of

Rmax

λ such that R(λ)≤R_max.

In order to find the optimal truncation point sets n_i^λ for any given λ , we need to know the rate-distortion (R-D) pair of each truncation points. λ can be viewed as the R-D slope of the optimal truncation point sets. We can find the R-D slope of each truncation point by calculating the bitstrean length and distortion at that point. Thus we can construct an operational R-D curve for each code-block.

1) Assume n is the number of the truncation points, and 0≦j≦n.

2) For j = 0, 1, 2, …, n, 0 is the beginning of the code-block, not a truncation point.

The R-D slope of each truncation point j is ₁

accumulative bit length of truncation point j in code block i and is the accumulative distortion of truncation point j in code block i.

Generally speaking, and

the distortion when the coefficients of the code-block are all 0. We just need to package the truncation points with the R-D slope bigger than or equal to

In 3D ESCOT, the end of each fractional bit-plane is a candidate truncation point.

The R-D slope of each truncation points can be obtained by calculating the bitstrean length and distortion [16]. Then we can construct an operational R-D curve for each subband and find its convex hull. All valid truncation points must lie on this convex hull such that the R-D optimality at each truncation point can be guaranteed. If the truncation point does not have a strictly decreasing R-D slope (i.e., it has larger distortion than the previous truncation point), it will be discarded. In order to find the best threshold value λ , we first set an arbitrary value of λ . If the R-D slope of this truncation point is bigger than or equal to λ , this truncation point will be packaged.

After we process all of the truncation points, we obtain the final bitstream. If the bit rate of this bit-stream is larger than that of requested, the value of λ will be set larger to find the final bitstream again. Otherwise, the value of λ will be set smaller.

We use this method recursively to find the final bitstream that has bit rate smaller than or equal to the requested bit rate.

Chapter 4 Human Visual System

4.1 Human Vision

Figure 4-1 Cross-section of human eye [19].

Figure 4-1 shows the cross-section of a human eye [19]. Through the optics of the eye, the visual input is projected onto the retina, the neural tissue at the back of eye composed of the photoreceptor mosaic [20]. The photoreceptors sample the image and convert the input image to the signals that can be interpreted by the visual cortex of the brain. Photoreceptors have Rhodopsin which is very sensitivity to light. When Rhodopsin receives the energy of light, it will decompose into Vitamins A, Protein, and impulse signal. The impulse signal will be processed by the Bipolar cell and Ganglion cell then passed through optical nerves into the brain as shown in Figure 4-2 [21]. The Vitamins A, Protein, and Nutrition will be combined together and converted

to Rhodopsin by the effect of Enzyme. Then the Rhodopsin can be used again.

Figure 4-2 The process of the visual input signal [21].

There two types of photoreceptors, rods and cones. Rods are relatively long and thin. They are used to view at lower several orders of magnitude of illumination, i.e., under scotopic conditions. Cones are relatively shorter and thicker and they are less sensitive than rods. They are used to view at the higher 5 to 6 orders of magnitude of illumination, i.e., under photopic conditions. The cones are concentrated in the fovea, the region of highest visual acuity, which covers approximately two degrees of visual angle on the retina. The cones are also responsible for color vision.

There three types of cones. They are L-cones, M-cones, and S-cones. L-cones are also called Red cones and they are sensitive to long wavelengths. M-cones are also called Green cones and they are sensitive to medium wavelengths. S-cones are also called Blue cones and they are sensitive to short wavelengths. Figure 4-3 shows the relative sensitivity of each photoreceptor [21].

Figure 4-3 Relative sensitivity of each photoreceptor [21].

4.2 Color Representation

Colors do not exist in natural world. To human perception, colors are related to the wavelength of light. As describes above, the retina of human eye contains 3 different color receptors: red, green, and blue. The different cones have different sensitivity curve to light of different frequency. Thus, the combination of different sensitivity curve to light can produce different color recognition. Due to this structure of human eye, any color appeared to human eye can be specified by a weighted combination of three so-called primary colors RGB. For the purpose of standardization, the CIE (Commission Internationale de L'eclairage─ International Commission on Illumination) chooses the following specific wavelength values to the three primary colors: blue (B) = 435.8nm, green (G) = 546.1nm, and red (R) = 700.0nm.

Trichromatic theory says that any color S can be represented as a combination of

these 3 primaries R, G, and B.

S = Rs·R + Gs·G + B^Bs·B. (21) Any 3 independent colors can be selected as primaries as long as one is not a mix of the other two. Different sets of primaries are related by linear transformations.

There several color models, such as CIE RGB, CIE XYZ, CIE YUV, and CIE L*a*b*. We introduce CIE RGB and CIE XYZ here.

1. CIE RGB:

1) R, G, B = three spectral primary source.

2) Reference white: R = G = B = 1.

3) There exist negative tristimulus values.

4) The color is fully dependent on the wavelength. The three fixed RGB components acting alone cannot generate all spectrum colors (pure colors). This is an unresolved defect for color representation.

2. CIE XYZ

1) All color matching functions are positive.

2) Y = luminance

3) Reference white: X = Y = Z = 1.

4) This model is modified from RGB model such that all spectral tristimulus values are positive.

Generally Speaking, Each color space can transform to another space. Equation (22) is the transformation from CIE RGB to CIE XYZ and equation (23) is CIE XYZ to CIE RGB.

(23)

4.3 Contrast Sensitivity

Human perception is more sensitive to the contrast of the luminance than the absolute value of the luminance. But due to the complexity of natural image, a common definition of contrast suitable for all conditions does not exist. Generally speaking, there are three types of contrast definitions widely used.

In the case of a periodic pattern of symmetrical deviations ranging from Lmin to Lmax, Michelson contrast is generally used:

min uniform background luminance L, Weber contrast is often used:

C_W = ΔL. (25)

These two definitions of contrast are not appropriate for measuring the contrast of complex images. If there are some very bright or very dark points in the image, these points will determine the contrast of the whole image. Furthermore, human contrast perception varies with the local average luminance. Peli proposed a local band limited contrast measure to solve these problems [22]:

)

contains the energy below band i at location (x, y), i.e., the total response at this location of all the bands below the band i. Modifications of this contrast definition have been used in a number of vision models and are in good agreement with

) , ( yx

BP_i LP_i( yx, )

psychophysical experiments on Gabor patches [38].

We can describe contrast sensitivity as the function of spatial frequency. This function is called contrast sensitivity function (CSF). Contrast sensitivity is defined as the inverse of contrast threshold. The contrast threshold is the minimum contrast necessary for an observer to detect the target.

Mannos and Sakrison first applied the HVS to image coding. They model the HVS as a nonlinear point transform followed by the modulation transform function (MTF) of the form [23]:

Nill proposed a new type of MTF that can be used for DCT [24]:

) (28)

Ngan et al proposed another new MTF [25]:

(29)

Except for the dependence on spatial frequency, the contrast sensitivity also depends on temporal frequency. Thus we can describe contrast sensitivity as the function of spatial frequency and temporal frequency. Kelly proposed a contrast sensitivity function (CSF) and it is generally used [26]:

) (30)

From this CSF, we can see that human has lower sensitivity at low and high spatial (temporal) frequency but higher sensitivity at medium spatial (temporal) frequency.

4.4 Masking Effect

If a stimulus can be visible by itself but can not be detected due the presence of another stimulus, this effect is called masking effect. On the other hand, the opposite

effect, facilitation, occurs when a stimulus can not be visible itself can be detected due to the presence of another stimulus. Masking effect explains why similar coding artifacts are disturbing in certain regions of an image while they are hardly noticeable elsewhere. There two types of masking effect, spatial masking and temporal masking.

Spatial masking is due to the non-uniformity of the background luminance.

Because of this masking effect, the noise is more visible in the flat or texture-less areas and less visible in region with edges and textures. So the coding errors may be less visible around sharp edges.

Temporal masking is due to the temporal discontinuity in intensity, like scene change. The error visibility threshold is increased with the increasing interframe luminance difference. Sometimes, if moving objects are not tracked by eyes, the loss of perceived spatial resolution is substantial.

4.5 Just-Noticeable Distortion

The definition of just-noticeable distortion (JND) is the visibility threshold of distortion and the reconstruction errors below this threshold are imperceptible [27].

Sometimes we use the inverse of the sensitivity as the threshold. Human eyes are more sensitive to luminance contrast than to absolute luminance value. The detecting ability of human eyes to the difference between objects and background depends on average value of background luminance. Weber’s law said that the ration of just noticeable luminance difference to stimulus’ luminance is almost constant if the luminance of a test stimulus is just noticeable from the surrounding luminance. The noise in the dark areas is less perceptible than that in the regions of high luminance.

Because of JND, we can discard the signal below this threshold when transform the encoded bitstream. So we can decrease the amount of data. On the other hand, we can put some special signal like watermarking in the bitstream that will not be detectable.

The JND profile of a still image is a function of local signal properties, such as background luminance, activity of luminance changes and dominant spatial frequency.

JND is defined below [28]:

W where H and W denote the height and width of the still image. f1 represents the error visibility threshold due to texture masking and f2 represents the error visibility threshold due to average background luminance. mg(x, y) denotes the maximal weighted average of luminance gradients around the pixel at location (x, y) and bg(x, y) is the average background luminance around the pixel at location (x, y).

mg(x, y) of the pixel at (x, y) is determined by calculating the weighted average of luminance changes around the pixel in four directions [29], as shown as follows:

} (32)

Figure 4-4 Operations for calculating the weighted average of luminance changes in four directions.

The value of f1(mg(x, y)) is calculated as shown below:

W where the value of β is get from a subject test and the value is 2/17.

bg(x, y) of the pixel at (x, y) is calculated by a weighted low-pass operator, B(i, j), i, j = 1,2,3,4,5, as that shown in Figure 4-5 [29]. bg(x, y) is calculated by:

Figure 4-5 The operator for calculating the average background luminance.

The relationship of between visibility threshold and the average background bg(x, y) is shown in Figure 4-6 [28].

Figure 4-6 Error visibility thresholds due to background luminance in the spatial domain [28].

Sometimes we want to get the JND on the spatial-temporal domain. We can

simplify the process to get this value by multiply spatial JND and temporal JND, as that shown below [28]:

)

where ild(x, y, n) is the average interframe luminance difference between the nth and (n-1)th frame at pixel (x, y), as shown below:

The empirical results of f3 for all possible interframe luminance difference are shown in Figure 4-7 [28].

Figure 4-7 Error visibility threshold in the spatial-temporal domain, which is modeled as a scale factor or interframe luminance difference and the JND value in the spatial domain [28].

It can be seen that the error visibility threshold increases with the increasing interframe luminance difference. This coincides with the temporal masking effect that the sensitivity of human vision is decreased after scene change and large temporal luminance difference.

Chapter 5 Rate Control Algorithm Based on HVS

5.1 Transform R-D Slope Representation

The R-D slope of the truncation point j in the code block i is usually represented in

the value of ₁

truncation point j in code block i and is the accumulative distortion of truncation point j in code block i. Generally speaking, the value of

difference of this value at each truncation point is very large too.

We can transform the R-D slope of each truncation point to another representation type but keep their relative orders the same. We transform the value of _j

exponential representation and use the exponent as the new R-D slope value of each truncation point, as shown in equation (38).

⎥ (38)

The new R-D slope of each truncation point is smaller and the relative difference of them is smaller too. The most important thing is that the relative order of the new R-D slopes of truncation points is kept the same as the original R-D slopes. We use this new value as the R-D slope value for each truncation point and do rate control on this new R-D slope.

5.2 Weighting Factor

Human vision has different sensitivity on different spatial frequency, so we need to have higher fidelity on the low spatial frequency data, which has higher sensitivity and lower fidelity on the high spatial frequency data, which has lower sensitivity. For this reason, we can convert the mean-squared error (mse) distortion to the “visual distortion” in doing rate control. In other words, we can multiply the R-D slope of each truncation point by a weighting factor such that the value of weighted R-D slope is proportional to the importance to human vision. The target is that if we use the new R-D slope value to do rate control, we can probably achieve higher visual quality.

Here, we present a weighting factor only for the Y component of each frame.

Discrete wavelet transform can decompose a frame into different spatial subbands.

Every subband has its own minimum visibility threshold and thus its own relative visual importance. For this reason, the weighting factor w can be decomposed into two weighting factors and they are intra-subband weighting factor w1 and inter-subband weighting factor w2. The weighting factor w is:

1 w

w= ∗ (39)

5.2.1 Intra-Subband Weighting Factor

The intra-subband weighting factor w1 is used to decide the visibility of the truncation point in the same spatial subband. It does not consider the visibility of the truncation point in the other spatial subbands. To find the visibility of the error of a truncation point, we need to know the just-noticeable-distortion (JND) of that subband.

Watson gives the minimum threshold of luminance of each spatial subbands without masking effect [30]. This minimum threshold can be used only on the Y

component of the image. The minimum threshold y of luminance of each subbands is given by [30]:

(40)

2 0)) log(

) (log(

) log(

)

log(y = a +k⋅ f − g_θ f ,

where the value of a is 0.495, k is 0.466, and f0 is 0.401. The value of is 1.501, 1, and 0.534 for LL, LH/HL, and HH subbands. f is spatial frequency and the value is different for different viewing condition. Under the computer monitor viewing condition, the display resolution r is 16 pixels/degree.

gθ

The size of our test sequence is 288 pixels in height and 352 pixels in width. The viewing distance is about 3.5 times of the height, i.e., 1000 pixels. The visual angle in height of this condition is 2*tan^-1(288/(1000*2)) = 16.38 degree. The display resolution in height is 288/16 = 17.58 pixels/degree. The visual angle in width of this condition is 2* tan^-1(352/(1000*2)) = 19.96 degree. The display resolution in height is 352/20 = 17.6 pixels/degree. So the display resolution r is about 16 pixels/degree.

The spatial frequency of each DWT level λ is f =r∗2⁻^λ cycles/degree. Figure 5-1 shows a frame after three level of DWT and the spatial frequency of each subbands. It also shows the minimum threshold y calculated by equation (40) when the maximum spatial frequency is 16.0 cycles/degree without masking effect of each subbands.

We conclude the step of calculating the minimum threshold y as follows.

1) Find out the corresponding spatial frequency of each level λ by f =r∗2⁻^λ. 2) Find out the corresponding value of g_θ of each corresponding orientation.

3) Use equation (40) to calculate the minimum threshold y of each subband.

(3, LL)

Figure 5-1 The level, orientation, spatial frequency, and minimum threshold of each DWT subbands.

After we get the minimum threshold of each subband, we need to consider the contrast masking effect of each subband. Peli proposed a definition of contrast that can be used in complex images [22], as shown in equation (26). The problem now is the contrast sensitivity for each subband. If we assume the local luminance to be constant across the whole image and equal to the average value of the coefficients in the lowest spatial subband [31], we can calculate the contrast at each location (i, j) in

在文檔中用於畫面之間的小波轉換編碼以人類視覺系統為基礎的位元控制法 (頁 50-0)