Just-Noticeable Distortion - Human Visual System

Chapter 4 Human Visual System

4.5 Just-Noticeable Distortion

The definition of just-noticeable distortion (JND) is the visibility threshold of distortion and the reconstruction errors below this threshold are imperceptible [27].

Sometimes we use the inverse of the sensitivity as the threshold. Human eyes are more sensitive to luminance contrast than to absolute luminance value. The detecting ability of human eyes to the difference between objects and background depends on average value of background luminance. Weber’s law said that the ration of just noticeable luminance difference to stimulus’ luminance is almost constant if the luminance of a test stimulus is just noticeable from the surrounding luminance. The noise in the dark areas is less perceptible than that in the regions of high luminance.

Because of JND, we can discard the signal below this threshold when transform the encoded bitstream. So we can decrease the amount of data. On the other hand, we can put some special signal like watermarking in the bitstream that will not be detectable.

The JND profile of a still image is a function of local signal properties, such as background luminance, activity of luminance changes and dominant spatial frequency.

JND is defined below [28]:

W where H and W denote the height and width of the still image. f1 represents the error visibility threshold due to texture masking and f2 represents the error visibility threshold due to average background luminance. mg(x, y) denotes the maximal weighted average of luminance gradients around the pixel at location (x, y) and bg(x, y) is the average background luminance around the pixel at location (x, y).

mg(x, y) of the pixel at (x, y) is determined by calculating the weighted average of luminance changes around the pixel in four directions [29], as shown as follows:

} (32)

Figure 4-4 Operations for calculating the weighted average of luminance changes in four directions.

The value of f1(mg(x, y)) is calculated as shown below:

W where the value of β is get from a subject test and the value is 2/17.

bg(x, y) of the pixel at (x, y) is calculated by a weighted low-pass operator, B(i, j), i, j = 1,2,3,4,5, as that shown in Figure 4-5 [29]. bg(x, y) is calculated by:

Figure 4-5 The operator for calculating the average background luminance.

The relationship of between visibility threshold and the average background bg(x, y) is shown in Figure 4-6 [28].

Figure 4-6 Error visibility thresholds due to background luminance in the spatial domain [28].

Sometimes we want to get the JND on the spatial-temporal domain. We can

simplify the process to get this value by multiply spatial JND and temporal JND, as that shown below [28]:

)

where ild(x, y, n) is the average interframe luminance difference between the nth and (n-1)th frame at pixel (x, y), as shown below:

The empirical results of f3 for all possible interframe luminance difference are shown in Figure 4-7 [28].

Figure 4-7 Error visibility threshold in the spatial-temporal domain, which is modeled as a scale factor or interframe luminance difference and the JND value in the spatial domain [28].

It can be seen that the error visibility threshold increases with the increasing interframe luminance difference. This coincides with the temporal masking effect that the sensitivity of human vision is decreased after scene change and large temporal luminance difference.

Chapter 5 Rate Control Algorithm Based on HVS

5.1 Transform R-D Slope Representation

The R-D slope of the truncation point j in the code block i is usually represented in

the value of ₁

truncation point j in code block i and is the accumulative distortion of truncation point j in code block i. Generally speaking, the value of

difference of this value at each truncation point is very large too.

We can transform the R-D slope of each truncation point to another representation type but keep their relative orders the same. We transform the value of _j

exponential representation and use the exponent as the new R-D slope value of each truncation point, as shown in equation (38).

⎥ (38)

The new R-D slope of each truncation point is smaller and the relative difference of them is smaller too. The most important thing is that the relative order of the new R-D slopes of truncation points is kept the same as the original R-D slopes. We use this new value as the R-D slope value for each truncation point and do rate control on this new R-D slope.

5.2 Weighting Factor

Human vision has different sensitivity on different spatial frequency, so we need to have higher fidelity on the low spatial frequency data, which has higher sensitivity and lower fidelity on the high spatial frequency data, which has lower sensitivity. For this reason, we can convert the mean-squared error (mse) distortion to the “visual distortion” in doing rate control. In other words, we can multiply the R-D slope of each truncation point by a weighting factor such that the value of weighted R-D slope is proportional to the importance to human vision. The target is that if we use the new R-D slope value to do rate control, we can probably achieve higher visual quality.

Here, we present a weighting factor only for the Y component of each frame.

Discrete wavelet transform can decompose a frame into different spatial subbands.

Every subband has its own minimum visibility threshold and thus its own relative visual importance. For this reason, the weighting factor w can be decomposed into two weighting factors and they are intra-subband weighting factor w1 and inter-subband weighting factor w2. The weighting factor w is:

1 w

w= ∗ (39)

5.2.1 Intra-Subband Weighting Factor

The intra-subband weighting factor w1 is used to decide the visibility of the truncation point in the same spatial subband. It does not consider the visibility of the truncation point in the other spatial subbands. To find the visibility of the error of a truncation point, we need to know the just-noticeable-distortion (JND) of that subband.

Watson gives the minimum threshold of luminance of each spatial subbands without masking effect [30]. This minimum threshold can be used only on the Y

component of the image. The minimum threshold y of luminance of each subbands is given by [30]:

(40)

2 0)) log(

) (log(

) log(

)

log(y = a +k⋅ f − g_θ f ,

where the value of a is 0.495, k is 0.466, and f0 is 0.401. The value of is 1.501, 1, and 0.534 for LL, LH/HL, and HH subbands. f is spatial frequency and the value is different for different viewing condition. Under the computer monitor viewing condition, the display resolution r is 16 pixels/degree.

gθ

The size of our test sequence is 288 pixels in height and 352 pixels in width. The viewing distance is about 3.5 times of the height, i.e., 1000 pixels. The visual angle in height of this condition is 2*tan^-1(288/(1000*2)) = 16.38 degree. The display resolution in height is 288/16 = 17.58 pixels/degree. The visual angle in width of this condition is 2* tan^-1(352/(1000*2)) = 19.96 degree. The display resolution in height is 352/20 = 17.6 pixels/degree. So the display resolution r is about 16 pixels/degree.

The spatial frequency of each DWT level λ is f =r∗2⁻^λ cycles/degree. Figure 5-1 shows a frame after three level of DWT and the spatial frequency of each subbands. It also shows the minimum threshold y calculated by equation (40) when the maximum spatial frequency is 16.0 cycles/degree without masking effect of each subbands.

We conclude the step of calculating the minimum threshold y as follows.

1) Find out the corresponding spatial frequency of each level λ by f =r∗2⁻^λ. 2) Find out the corresponding value of g_θ of each corresponding orientation.

3) Use equation (40) to calculate the minimum threshold y of each subband.

(3, LL)

Figure 5-1 The level, orientation, spatial frequency, and minimum threshold of each DWT subbands.

After we get the minimum threshold of each subband, we need to consider the contrast masking effect of each subband. Peli proposed a definition of contrast that can be used in complex images [22], as shown in equation (26). The problem now is the contrast sensitivity for each subband. If we assume the local luminance to be constant across the whole image and equal to the average value of the coefficients in the lowest spatial subband [31], we can calculate the contrast at each location (i, j) in the frame in a simplified way by:

)

where is the average of the coefficients in the lowest spatial subband and C(i, j) is the associated wavelet coefficient at location (i, j). In the case shown in

) (C_lowest _spatial _subband

E ₋ ₋

Figure 5-1, is the average of the coefficients in the E(C_lowest₋_spatial₋_subband)

subband (3, LL). Then, c(i, j) is the contrast of the location (i, j) in the frame.

The visibility of a signal can be reduced by the presence of another signal, i.e., the contrast masking effect. The masking function is shown in Figure 5-2 and it can be the same for every subband [32].

mask contrast*csf

threshold elevation

C_M0 C_T0

mask contrast*csf

threshold elevation

C_M0 C_T0

Figure 5-2 The contrast masking function.

The contrast masking function can be formulated by:

CT(CM) = CT0, if CM < CM0, (42) and

CT(CM) = CT0(CM / CM0)^ε, (43) where CM is the masking contrast value, CT is the threshold elevation value, ε is the slope. We can see that the contrast masking function is divided into a threshold range, where the target detection threshold is independent of the masking contrast, and a masking range, where it grows with the power of the masking contrast. The slope ε is one for all subbands, which corresponds to experimentally derived slopes for phase-incoherent (noise) masking [32]. We generally assume that C = C [33] and

it is confirmed by the experiments [34]. The values of CT0 and CM0 are all 1 [32].

If we normalize both the test threshold and masking contrast axes by the test frequency’s threshold in a uniform field (i.e., 1/csf(f)), Figure 5-2 can be used to describe all frequencies, provided the test signal and masking signal are the same frequency [32]. The relationship between the threshold elevation CT (f, CM) and real threshold value T(f, C(f)) is [32]:

CT (f, CM) = T(f, C(f))‧csf(f) = T(f, C(f)) / T(f, 0), (44) where f is the spatial frequency [32]. Then, the relationship between the real masking contrast value C(f) and the masking contrast value CM is:

CM = C(f)‧csf(f), (45)

We can see that when there is no masking contrast effect, the minimum value of real threshold value T(f, 0) is the inverse value of the corresponding contrast sensitivity function. We can get real threshold value T(f, C(f)) by dividing threshold elevation value CT (f, CM) by corresponding contrast sensitivity value csf(f) and it equals to the value y we get from equation (40) when C(f) is 0, i.e., no masking effect. The real masking contrast value C(f) of location (i, j) in the frame equals the value c(i, j) we get from equation (41). We can see that the minimum real threshold values of the pixels within the same spatial subband are all the same and equal to T(f, 0). Because of the different real masking contrast value C(f) at different pixel, each pixel may have its own real threshold value T(f, C(f)).

We can use the contrast masking function to find out the corresponding threshold value of each location (i, j) in a frame. Thus, we can find out the real threshold value of every pixel within the same subband and choose the smallest real threshold value as the real threshold value of the subband. But if there is one value has smallest real threshold value, i.e., T(f, 0), then we need to choose this value as the real threshold

value of this subband and the masking contrast effect is of no use.

We have done some experiments, i.e., we use DWT to decompose the Y component of a frame and use different quantization step sizes to quantize one subband without quantizing the other subbands. Then, we use IDWT to reconstruct the frame and see which size of the quantization step size will produce difference between the original and the reconstructed frame that can be detected by eyes. We found that the step size we get is usually larger than the value calculated by the methods described in the above, especially for the lower spatial frequency subbands. The reason is that there may be some pixel in a subband has minimum real threshold value T(f, 0), but it does not dominate the entire visual effect. For this reason, we choose the middle value of the real threshold value T(Cmiddle) of pixels within the same subband as the real threshold value of this subband. T(C) is the real threshold value of the pixel with real masking contrast value C and Cmiddle is the pixel has middle real masking contrast value among the pixels within the same subband. In other words, T(Cmiddle) is also the middle real threshold value among the pixels within the same subband.

In order to apply the real threshold values to HVS, we need to convert the real threshold values from the spatial domain to the wavelet domain. We need to estimate the size of the wavelet coefficient of each subband that produces the detectable spatial (impulse) response. To do this, we have a “worst case” formula that estimates the minimum coefficients detection threshold t_JND(λ,θ,C) of the corresponding subband with level λ and orientation θ that can produce the detectable spatial response [31]:

where T(C) is the real threshold value of the corresponding subband obtained in the above and is either i_θ p_l², p_h², or p_lp_h for the LL, HH, or LH/HL subbands,

respectively. is the maximum coefficient amplitude of the low pass synthesis filter and is the maximum coefficient amplitude of the high pass synthesis filter.

The DWT filer we used is Daubechies 9/7 filter and the synthesis filter coefficients are shown in

index Synthesis low pass filter Synthesis high pass filter

0 1.115087052456994 0.6029490182363579

0.5912717631142470 -0.2668641184428723

± 1

-0.05754352622849957 -0.07822326652898785

± 2

±3 -0.09127176311424948 0.01686411844287495

0.02674875741080976

± 4

Table 5-1 The coefficients of the Daubechies 9/7 synthesis filters.

We use equation (46) to calculate t_JND(λ,θ,C) of the decomposed subbands shown in Figure 5-1 and show the result in Figure 5-3. Please note that t_JND(λ,θ,C) shown in Figure 5-3 is calculated without contrast masking effect. It means that it equals to t_JND(λ,θ,0).

tJND is also the JND threshold of the corresponding subband, i.e., the maximum error that can be tolerated in the subband without considering masking effect. For uniform quantization, if the step size of the quantizer is Q, then the maximum possible error is Q/2 [30]. Thus we can use the quantizer with step size 2*t_JND(λ,θ,0) to quantize the corresponding subband, thus the reconstructed frame will not be distinguished from the original frame by human vision.

We choose t_JND(λ,θ,C_middle) as the minimum coefficients detection threshold for the corresponding subband.

(3, LL) metric that also accounts for the spatial and spectral summation of individual quantization errors is needed. The probability summation model is adopted in the perceptual distortion metric [36] [37]. The probability summation model considers a set of independent detectors, one at subband location(λ,θ,x,y) [37]. )(λ,θ,x,y is the location (x, y) within the subband corresponding to level λ and orientation θ. The probability of detecting a distortion at location (λ,θ,x,y) is determined by the psychometric function, as shown below [37]:

) (47) parameter whose value is chosen to achieve consistency between (39) and the experimentally determined psychometric function for a given type of distortion. We

choose the value of β_b is 4 [36] [37]. )t_JND(λ,θ,x,y is the minimum threshold

The highest visual acuity is limited to the size of the foveal region and covers approximately of visual angle in HVS. Let denote the area in the spatial domain that is centered at location (n1, n2) and covers of visual angle. Then, the probability of detecting a distortion in this region is

2o F₍_n₁_,_n₂₎

The probability summation scheme is developed based on two assumptions [36] [37].

1) A distortion is detected in the foveal region if and only if at least one detector signals the presence of distortion.

2) The probability of detecting a distortion of each detector is independent.

We can substitute equation (48) into (49), thus we have [37]:

) (50)

The maximum width, maximum height, and maximum depth of the code block in 3D-ESCOT coding are 64, 64, and 4. Because we only consider one frame each time and )t_JND(λ,θ,C_middle of different frame may not be the same, the depth of the code block is 1. Although human eyes can see the scenery in the visual angle about 160^o

to , human can only pay attention to the scenery in the visual angle about because of the structure of the fovea. If we assume the foveal region is the code block, the maximum visual angle of each code block is in our condition. So we need to modify equation

180o 2 ^o

(51) to fit it to our condition.

From equation (51), we can see that the total “visual error distortion” is

∑ ∑

and the total “visual error

distortion” that can be tolerated is block_height*block_width* . We think that the ratio of these two values can determine the visual error probability. So we rewrite equation

( , , )4

The spatial subband may include more than one code block and each code block has its own height and width. If we consider just one code block a time, we can get:

We can combine equation (50) and (53) together, then we can get the intra-subband weighting factor w1 of the coding pass of the corresponding bitplane:

)) ))

We can see that intra-subband weighting factor w1 is different for every truncation

point even the truncation points are located in the same spatial subband, and w1 is frame-dependent.

5.2.2 Inter-Subband Weighting Factor

Intra-subband weighting factor w1 is close to 1 when the bitplane is close to the most significant bitplane (large distortion). In other words, if we multiply the R-D slope of each truncation point by w1, the R-D slope of bitplane near the most significant bitplane may not change and the R-D slope of bitpane near the least significant bitplane becomes smaller. Thus, the visual quality is the same as that in original rate control algorithm at low bit rate. This means that we need to find out another weighting factor to decide the relative visual importance of the same bitplane in different spatial subbands. This is inter-subband weighting factor w2.

From equations (27), (28), (29), and (30), we can see that the sensitivity at different spatial frequency is very different. Thus, the difference between their associated inter-subband weighting factors should be large too. But we use equation (38) to represent the R-D slope of the truncation point, the relative difference between their inter-subband weighting factors becomes smaller too.

We use t_JND(λ,θ,0) instead of t_JND(λ,θ,C_middle) to calculate w2. The reason is that the spatial subband with lower t_JND(λ,θ,0) usually has higher sensitivity. If we consider masking contrast effect, we can get t_JND(λ,θ,C_middle) and t_JND(λ,θ,C_middle) is bigger than or equal to t_JND(λ,θ,0). Thus, the associated intra-subband weighting factor w2 will be smaller.

The )t_JND(λ,θ,0 of the lowest spatial subband is the smallest of all the subbands but its t_JND(λ,θ,C_middle) is usually very large because of large contrast masking effect due to large wavelet coefficients in this subband. If we use t_JND(λ,θ,C_middle)

to calculated w2, we may think that the minimum spatial subband has lower w2. It is not the true based on our experiments. From our experiments, we found that the lowest spatial subband has the largest weighting. For this reason, we use t_JND(λ,θ,0) to calculate w2 for each spatial subband.

Assuming the t_JND(λ,θ,0) of the lowest spatial subband is . For the frame showing in

subband

From equation (55), we can see that the inter-subband weighting factor w2 is the same for all the truncation points within the same spatial subband and it is frame-independent.

Combing equations (39), (54), and (55) together, we can get the function of subband weighting factor w:

10 )

We can use w to transform the original distortion to “visual distortion”, i.e., the weighted truncation points are in the order of visual importance.

5.3 Rate Control

Wavelet Coefficient Code Block

Code Block Size Information Distortion Information

Weighting Factor

Transmission Channel

T(0) T(C) t_JND(C)

t_JND(0)

Figure 5-4 The flow chart of calculating the subband weighting factor w.

Figure 5-4 shows the flow chart of calculating the subband weighting factor w. We need to transform the R-D slope of each truncation point to new R-D slope by equation (38). Then, we can get subband weighting factor of each truncation point by equation (56) and multiply it to the new R-D slope got from equation (38).Thus, we can obtain the R-D slope value based on “visual distortion”.

We use the new weighted R-D slope to do rate control. If the truncation point has larger new weighted R-D slope, it has high probability to be packaged and transmitted.

We show the experimental results in the next subsection and examine the correctness of the proposed rate control algorithm.

5.4 Experimental Results

Here we show two types of the experimental results. One is the correctness of the

在文檔中用於畫面之間的小波轉換編碼以人類視覺系統為基礎的位元控制法 (頁 60-0)