Computational model of visual attention - 以注意力為基礎之高動態範圍顯像

Sustained attention will also affect the contrast sensitivity. Ling et al. [2] discovered that humans need a lower contrast threshold to reach 75% accuracy on the orientation discrimination task at the beginning of sustained attention; however, as time goes by, the contrast threshold for successful discrimination of orientation needs to be increased. Figure 3.5 shows the results of their experiment. The sustained attention has the same effect as transient attention for a short duration but the effect of adaptation takes over afterward.

Figure 3.5: This figure shows the experimental result in Ling et al. [2] They conducted a very similar experiment as Pestilli et al. They used four Gabor patches instead of two. While subjects adapt to these gratings, they were required to attend one of the four patches (sustained attention) to view all four stimuli (neutral condition). After 50 ms to 16 s, one test patch with different contrast would appear at one of the four locations and subjects were asked to report the orientation of this test patch. The result recorded contrast thresholds at 75% accuracy with different durations.

3.2 Computational model of visual attention

Many computational models of visual attention have been proposed based on the feature inte-gration theory [41]. The theory says that fairly simple visual features are computed over the

3.2 Computational model of visual attention 14

entire scene in the first step of visual processing, but only those attentive features will be further processed to form a united object representation. Itti et al. [3] modeled the salient positions where primates would pay attention in an image. They used several operations to produce a saliency map, in which they applied the winner-take-all algorithm and inhabitition of return mechanism to simulate the behavior of attention. Later, Itti [42] modified their model to add motion features in video compression and showed that the priority map produced by the saliency map is effective in video compression. Petit et al. [15] modified the saliency map in order to discover the salient locations in an HDR image.

In our algorithm, we use the HDR saliency map [15] as the model of transient attention since it has the same bottom-up process property as transient attention. The HDR saliency map is developed based on original saliency map [3]. Figure 3.6 depicts the model of the origi-nal saliency map. In this model, Itti et al. [3] first built Gaussian pyramids with nine spatial scales σ ∈ [0, 8]. The finest scale is σ = 0 and the coarsest scale is σ = 8. They decompose an image into three parts: intensity, color, and orientation. For each part, they performed the center-surround differences operation and across-scale combination to simulate the feature in-tegration theory [42]. The center-surround operator Θ mimics the function of visual receptive field, which states that visual neurons are more sensitive to the stimuli in the center of a visual space than those in the peripheral of the visual space. Θ operator upsamples a coarser-scale image to get an estimated finer-scale image and then subtract the estimated finer-scale image from the real finer-scale image to obtain a feature map.

3.2 Computational model of visual attention 15

Figure 3.6: Flowchart of the saliency map proposed by Itti et al.[3]

3.2 Computational model of visual attention 16

The intensity image I in Figure 3.6 is computed using I = r + g + b

3 , (3.1)

where r, g, b is the red, green, and blue channel of the input image, respectively. The six feature maps of intensity are computed as follows,

I(c, s) = | I(c) Θ I(s) |, (3.2)

where c ∈ {2, 3, 4} and s = c + δ with δ ∈ {3, 4}.

Itti et al. define their color maps R, G, B, and Y according to the opponent process theory [41], which states that there are three opponent channels in human visual system: red versus green, yellow versus blue, and black versus white.

R = r − g + b

As the responses excited by one color of opponent channel will inhibit those by the other color, they define 12 color feature maps as

RG(c, s) = | (R(c) − G(c)) Θ (R(s) − G(s)) |,

BY (c, s) = | (B(c) − Y (c)) Θ (B(s) − Y (s)) |, (3.4) where c ∈ {2, 3, 4} and s = c + δ with δ ∈ {3, 4}.

The orientation feature map is obtained using the Gabor pyramids O(σ, θ), where σ repre-sents the scale and θ ∈ {0^◦, 45^◦, 90^◦, 135^◦} is the orientation of the Gabor filter

O(c, s, θ) = | O(c, θ) Θ O(c, θ) | (3.5)

3.2 Computational model of visual attention 17

All feature maps of each part (intensity, colors, and orientation ) are normalized and com-bined into three conspicuity maps using the across-scale addition operator^L, which downsam-ples the feature map at each level to the scale-four level and sums these feature maps. These three conspicuity maps are then normalized and summed into the final saliency map.

I =¯ where N (.) consists of normalizing the values in the map to a fixed range [0, M ], finding the location of the global maximum M and the average ¯m of all local maxima in the map, and multiplying the map by (M − ¯m)².

Itti et al. [43] also extended the original saliency map to handle videos. They added another two features into the saliency map: flicker and motion. The feature maps of flicker are calculated from the absolute luminance difference between the current frame and the previous frame. The feature maps of motion are computed from Gabor pyramids or intensity pyramids. They shift pyramids one pixel away in four directions to obtain the shifted pyramids S_n(σ, θ). The motion pyramids is computed using Reichardt model [44],

R_n(σ, θ) = | O_n(σ, θ) ∗ S_n−1(σ, θ) − O_n−1(σ, θ) ∗ S_n(σ, θ) |, (3.7) where the subscripts n and n − 1 denote the current frame and previous frame. The symbol

∗ represents a pixel-wise product. Finally, they compute the feature maps Rn and conspicuity map ¯R_nof motion by

3.2 Computational model of visual attention 18

R_n(c, s, θ) = | R_n(c, θ)ΘR_n(s, θ) |, (3.8) and

R¯_n =^X

N (

c=2

s=c+3

N (R_n(c, s, θ))), (3.9)

where c ∈ {2, 3, 4} and s = c + δ with δ ∈ {3, 4}.

C H A P T E R 4

Approach

In this chapter, we describe our tone mapping method in details. Figure 4.1 depicts the flowchart of our algorithm, which consists of two parts: attention map computation and tone mapping ad-justment. When we acquire an HDRI, we need to calculate its attention map in order to find what regions would draw humans’ attention in this HDRI. We will describe how to capture the attention map in section 4.1.

We adjust our tone mapping function to account for transient attention and adaptation effects based on the psychophysical findings of Pestilli et al. [6]. According to section 3.1, we design an attention function that would adjust a baseline tone mapping function for neutral condition according to the computed attention map pixel by pixel. We also design an adaptation function to reduce the contrast of an HDRI. Finally, the tone mapping function locally adjusted by the attention function and adaptation function are associated by using weighted sum. We will de-scribe these functions in section 4.2.

Figure 4.1: Our algorithm

4.1 Attention map computation 21

4.1 Attention map computation

We first need to determine the attentive regions and nonattentative regions in an HDR im-age. There are several computational models of visual attentions. We use the HDR saliency map [15][16], which is developed based on the saliency map [3]. As mentioned in Petit et al.

[15][16], the prediction of original saliency map in HDRI is less accurate than that in LDRI.

Hence, they modified the saliency map in two aspects. One is intensity and the other is ori-entation. Table 4.1 shows the differences between the saliency map proposed by Itti et al. [3]

and Petit et al. [15][16]. These modifications are mainly made to handle the larger dynamic range of HDR data as the original saliency map is designed for LDRI. Petit et al. construct the conspicuity maps at scale 4, and then normalize and sum up these conspicuity maps to obtain the saliency map of an HDRI. As the sizes of the input image and its saliency map are not the same, we resize the saliency map so that the salient and unsalient regions of the input image can be determined.

Table 4.1: The differences between saliency maps of HDR and LDR images Saliency Map for LDRI Saliency Map for HDR Intensity I(c, s) = | I(c) Θ I(s) | I(c, s) = | I(c) Θ I(s) |

I(s)

Orientation O(c, s, θ) = | O(c, θ) Θ O(c, θ) | O(c, s, θ) = ^O(c,θ)_I(s)

Since the saliency map can predict the location of visual attention, we use the equation 4.1 to obtain the attention map. We treat those pixels whose saliency values are above 0 as attentive regions while those below 0 as nonattentative regions. Moreover, we rescale the saliency values of the attention map to [−1, 1]. Our tone mapping function would changes according to this attention map.

A(x, y) = 2

max − min · ( S(x, y) − max ) + 1, (4.1)

在文檔中以注意力為基礎之高動態範圍顯像 (頁 24-33)