Chapter 2. Backgrounds
2.4. Intensity-Pair Distribution
In [12], Jen et al. proposed an intensity-pair distribution technique, which was used to enhance image contrast. This distribution possesses both local information and global information of the image content. For a given image, this method tests at each pixel the intensity difference between that pixel and each of its 8-connection neighbors. Figure 2-15 shows an illustration of a pixel and its 8-connection neighbors.
Due to the commutative property of intensity pair, we only check 4 neighboring pixels, instead of 8, as we scan the image in the raster order. That is, for the pixel at E in Figure 2-15, we only check the intensity difference between that pixel and its upper-left pixel (A), upper pixel (B), upper-right pixel (C), and left pixel (D) [12].
Figure 2-15 An illustration of a pixel and its 8-connection neighbors.
After the computation of intensity differences, we may imagine that we have formed an intensity-pair distribution as shown in Figure 2-16. Figure 2-16(a) shows an example of a 2-D image. As we calculate the intensity difference between adjacent pixels, we form four different intensity pairs, {(80, 80), (175, 80), (80, 175), (175, 175)}. If we ignore the pair order and treat (175, 80) and (80, 175) as the same type of pair, these four types of pairs are further merged into three types of pairs, {(80, 80), (175, 80), (175, 175)}. As we count the total pixel number for each type of intensity pair, we may generate the intensity-pair distribution as shown in Figure 2-15(b). Here, the values at (80, 80), (175, 80), and (175, 175) are 21, 13, and 21, respectively.
Similarly, for a real image shown in Figure 2-16(c), its intensity-pair distribution can be easily calculated as shown in Figure 2-16(d). Especially, if the intensity difference of an intensity-pair is large than a pre-selected threshold, that intensity-pair is treated as an edge pair [12].
Figure 2-16 (a) A synthesized image (b) Intensity-pair distribution of (a) (c) A real image (d) Intensity-pair distribution of (c) [12]
By analyzing the content of intensity-pair distribution, we can get useful information for the detection of visual salient region. This will be discussed in later chapter.
Chapter 3.
P ROPOSED M ETHOD
The goal of saliency map is to capture the regions where a person may pay more attention to. As mentioned earlier, bottom-up methods are more flexible and are applicable to different scenarios. However, the major problems of bottom-up visual saliency models are their complicated models and the difficulties in detecting and labeling regions in complex natural scenes. In a bottom-up approach, we aim to detect those regions which are “special” or “abnormal”. In this thesis, we develop our system based on the following two intuitive assumptions:
(1) A region with a strong contrast with respect to its surrounding regions is more likely to be paid attention to.
(2) A region is less attractive to the observer if its property is common in the scene.
With these two assumptions, we develop our saliency region detector based on the infrastructure proposed by Itti [4]. The flow chart of the proposed saliency region detector is illustrated in Figure 3-1. In the following sub-sections, we will explain in detail the sub-modules of this system.
Figure 3-1 Block diagram of the proposed system
3.1. L INEAR F ILTERING OF I MAGE D ATA
Similar to Itti’s approach, we decompose an input image into a few feature vectors, including intensity, RG color, and BY color. Here, we ignore the orientation feature since the orientation feature is usually not a dominating factor in natural scenes. In our system, the intensity channel is defined as:
( )
3
r g b
I + +
= Eq. 3-1
where r, g, and b denote the red, green, and blue components of the input image.
On the other hand, we define the red, green, blue, and yellow hues of the image pixel as:
For each color hue, negative values are set to zero. Each color hue yields the maximal response for the pure, fully-saturated hue and yields zero response for gray colors. These four color hues are then merged together to form two opponent-color channels that mimic the color opponent process in human’s visual system [8].
Since the separated color feature maps have obtained, we are going to introduce why and how to combine them. It must refer to the biological functionality of human brain. In human brain, there exists a ’color opponent-component’ system. In the center of receptive fields, neurons which are excited by one color (eg. Red) while inhibited by another color (eg. Green). Red/green, green/red, blue/yellow and yellow/blue are color opponent pairs which exists in human visual cortex [13]. Thus, in our approach, we define the RG color channel to be
RG= R G− Eq. 3-6
and the BY channel to be
BY = −B Y Eq. 3-7
Currently, we have three feature maps extracted from input image, as shown below:
Figure 3-2 A sketch diagram of low-level image feature extraction
These two opponent-color channels, together with the I (Intensity) channel, are fed into the following modules to form feature-pair distributions.
3.2. F EATURE -P AIR D ISTRIBUTIONS
For each of the I, RG, and BY channels, we compute the feature-pair distribution as proposed in [12]. As mention in Section 2.4, Jen et al. proposed the concept of intensity-pair distribution for the enhancement of image contrast. Since this distribution possesses both local information and global information, it may offer useful information for us to detect visual saliency regions in the image. In a feature-pair distribution, the global information tells us what kinds of features are common in the image, while the local information tells us which portions of the image may exhibit large contrast. Hence, by properly using the pair-distributions of the I, RG, and BY channels, we can efficiently detect those image portions with unusual appearance or with stronger contrast.
To establish the feature-pair distribution for the I channel, we check at each pixel the intensity pairs between that pixel and its 8-connection neighbors. Figure 2-15 shows an illustration of a pixel and its 8-connection neighbors. If we denote the I values of these nine pixels as A to I, respectively, then the eight intensity pairs {(E, A), (E, B), (E, C), (E, D), (E, F), (E, G), (E, H), and (E, I)} are formed and accumulated in the feature-pair distribution. Clearly, we can expect that the intensity pairs over smooth regions will lie around the 45-degree line; whereas these intensity pairs across edges will lie somewhere away from the 45-degree line.
Figure 3-3 shows an example of the intensity-pair distribution. For the airplane image shown in Figure 3-3, since the sky and grass are the major backgrounds of the image, the intensity pairs over these two regions form two major clusters in the intensity-pair distribution. Here, we intentionally colorize these two clusters to indicate their correspondence. On the other hand, the aircrafts map to a smaller cluster in the lower-left corner of the distribution. Moreover, the intensity pairs over the sky-grass boundary and the aircraft-sky boundary form four clusters (represented in red color) far away from the 45-degree line. Based on this intensity-pair distribution, we can easily deduce that the boundary between the aircraft and the sky exhibits a stronger contrast than the sky-grass boundary. With the facts that (1) the aircraft is
“less common” than the sky and the grass; and (2) the aircraft has a stronger contrast with respect to its background, we may deduce that these two aircrafts may catch the attention of most observers.
Figure 3-3 A matching example of modified intensity-pair distribution
Here rises a question: how large should the input image be? In Figure 3-4, we show four intensity-pair distributions with their input image being scale 0 to scale 3.
When the scale is increased by 1, the image’s height and width are reduced by 2, respectively. The choice of scale is image dependent. However, in Scale 0 or Scale 1, the image usually contains quite a large number of scattered data and requires longer processing time. Hence, in our approach, we typically work on Scale 2 and Scale 3, as shown in Figure 3-4(c) and (d).
(a) scale 0 (b) scale 1
(c) scale 2 (d) scale 3
Figure 3-4 An example of intensity-pair distribution with different scale input
Based on the same concept, we can form the RG-pair distribution for the RG channel, and the BY-pair distribution for the BY channel. These three feature-pair
distributions may offer us plentiful clues about the global statistics and the local variations of the image contents.
(a) Input image (b) Intensity-pair distribution
(c) RG color-pair distribution (d) BY color-pair distribution Figure 3-5 An example of the feature-pair distributions
3.2.1. C LUSTERING
To identify the most common properties in the image, we need to identify the major clusters in the feature-pair distributions. From the feature-pair distributions obtained at the previous section, there are apparent clusters which we can tell easily.
The existing clustering algorithms seem to be a good tool for us to segment each cluster out. Figure 3-6 is an example of the intensity-pair distribution processed by the mean-shift clustering algorithm. The resulting clusters are reasonably good.
Unfortunately, these existing clustering algorithms are usually computationally expensive and time-consuming. These disadvantages disobey our major requirement that the system should not possess complicated computations and should be fast enough for real-time processing and analysis.
(a) (b)
Figure 3-6 An example of intensity-pair distribution after mean-shift clustering (a)intensity-pair distribution (b) mean-shift clustering algorithm passing through (a)
3.2.2. 3-D H ISTOGRAM R EPRESENTATION
To simplify the computations, we choose another approach that operates over the feature-pair distributions directly. In Figure 3-7, we show the 3-D histogram representation of the feature-pair distribution. This 3-D histogram is formed by dividing the x-y plane into a few uniform cells and count for each cell the total number of feature pairs within that cell. Clearly, we can expect that, in general, most clusters occur around the diagonal line in the 3-D histogram since most regions in a natural image are smooth. Moreover, the background elements would yield the largest cluster since the background usually occupies the largest area in the image. On the contrary, foreground objects usually correspond to smaller clusters. Besides, those clusters away from the diagonal line correspond to the boundary regions or the texture regions in the image.
Figure 3-7 The 3-D histogram representation of feature-pair distribution.
In this 3-D histogram, we denote the cell at the intersection of the ith column and jth row as C(i,j). We further define a cell to be a “diagonal” cell when |i-j| ≤ Dth, where Dth is a pre-selected threshold. On the contrary, a cell is defined as “off-diagonal” if
|i-j| > Dth.
3.3. M AP -W EIGHTING A LGORITHM
One of the reasons why we use image feature-pair distributions is that we may perform straightforward weighting strategy on them to form saliency map. Clearly, we can expect that, in general, the background elements would yield the largest clusters since backgrounds always occupy the most part of the scene in the spatial domain. On the contrary, foreground objects usually occupy a smaller space in an image. Hence, foreground objects, or salient regions, will form smaller clusters in opposition to the background. As for the clusters away from 45-degree line, it is apparently that they represent edge clusters, since they have a strong difference between the central pixel and the neighboring pixels.
In our approach, without using clustering algorithm, we build a map-weighting algorithm to directly weigh the saliency degree of the cells in the 3-D histogram. This weighting algorithm contains two main parts: the “contrast weight” to gather the information concealed in off-diagonal cells; and the “distinction weight” that determines how likely a diagonal cell may contain the intensity pairs from a visually salient region. A simple structure of the algorithm is shown below:
Figure 3-8 A simple structure of map-weighting algorithm
3.3.1. C ONTRAST W EIGHT
As mentioned above, these clusters over the off-diagonal cells correspond to the boundary portions or texture portions in the image. As an off-diagonal cell is far away from the diagonal line, it indicates that any intensity pair within this cell will exhibit stronger contrast. Intuitively, we may use this kind of information to estimate how likely a region may attract observers’ attention.
Here, we give an example to explain the calculation of this contrast weight.
Given a smooth region R0 with the feature value f0, this region would correspond to a cluster in the cell that contains the feature pair (f0, f0). If this smooth region has a surrounding region R1 with the feature value f1, we expect there is a cluster at the cell containing the pair (f0, f1) and a cluster at the cell containing the pair (f1, f0). If these two cells are far away from the diagonal line, then there should be a strong contrast between R0 and R1. Moreover, if these two off-diagonal cells contain a large number of feature pairs, it means R0 may share a long boundary with R1.
Figure 3-9 illustrate the concept of contrast weight. In Figure 3-9, lines a to d represent the four different profiles of the pair-distribution map in the left of Fig 3-9.
In each profile, there is a diagonal cluster, as represented by the yellow-green block, together with several off-diagonal clusters. As an off-diagonal cluster is far away from the diagonal line, we assign a larger weight for it, as represented by the light blue curves in Figure 3-9.
Figure 3-9 Illustration of concept of contrast weight
Figure 3-10 Viewpoint of contrast weight
Hence, given an off-diagonal cell C(i,j), we define its self-contrast-weight as ) 2
where hist(i,j) denotes the total number of intensity pairs in the cell C(i,j). Here, we take the square of |i-j| to emphasize those cells far away from the diagonal line.
With the definition of self-contrast-weight for off-diagonal cells, we further define the contrast-weight for diagonal cells, which are defined as C(i,j) with |i-j| ≤ Dth. In our algorithm, Dth is chosen to be a small-value constant. Here, for a diagonal cell at C(i,j), we define its contrast weight as
∑
∀That is, we sum up the self-contrast-weights for all the cells along the ith column or along the jth row.
From the 3-D histogram, we have the value of each square region, as shown in Figure 3-11. The numbers shown in each square represent the height, or the number of feature-pairs, in each histogram cell. Figure 3-11 illustrates an example to explain the calculation of contrast weight. In this case, we define Dth = 1. For each diagonal cell, we check its entire horizontal neighbors. For example, in Figure 3-11, the white cell in
the red rectangular area has the contrast weight 2830, which is computed as
2 2 2 2 2
5 8× +20 7× + × +15 6 30 5× + ×15 4 =2830, and the white cell in the green area has the contrast weight 15 6× +2 30 5× + ×2 15 42 =1530. Note that since Dth = 1 the cells next to the white cell are also considered as diagonal cells and are not included in the computation of contrast weight. Moreover, for these cells next to the white cells, their contrast weight are computed in the same way as that of the white cell. After every horizontal line is scanned, we get the contrast weight for every diagonal cell.
Figure 3-11 Example of 2-D view of 3-D histogram
3.3.2. D ISTINCTION W EIGHT
After the estimation of contrast distribution, we further take into account the phenomenon that humans tend to pay less attention to the common regions in the image. Hence, beside the contrast weight, we add one more weight, named distinction weight, for these diagonal cells. This distinction weight is calculated in an iterative manner.
3.3.2.1. F
IRSTI
TERATIONAt the beginning, the diagonal cell with the largest value of hist is identified.
Assume this cell is at C(i1,j1) and its hist value is denoted as hist(i1,j1). The distinction weight of this cell is defined as
Eq. 3-10
A sample 3-D histogram for the first iteration is shown in Figure 3-12, where max_hist = hist(i1,j1). This identified cell C(i1,j1) typically corresponds to the image background, the commonest portion of the image. Hence, by taking the reciprocal of hist(i1,j1), these background portions are assigned a lower value of distinction. That is, the commonest portion of the image is expected to be less visually salient to the observers.
Figure 3-12 An example of distinction weight for first iteration
Since we have kept the coordinates of all the feature pairs within each cell, the weights of the diagonal cells can be mapped back to the spatial domain easily. At the same time, after the identification of the largest peak in the 3-D histogram, the value of the maximal peak is set to zero in order to run the next iteration.
3.3.2.2. T
HES
ECONDI
TERATIONAfter the largest cluster of the 3-D histogram being identified, we search for the second largest cluster. Assume the second largest cluster is identified at the cell C(i2,j2), its distinction weight is defined as
) to the reason that the regions corresponding to C(i2,j2) will not be visually salient to the observer if their feature values are too close to the feature value of the background.
For example, in Figure 3-13, we get three candidate cells with the same number of feature-pairs. These three cells have three different distances, denoted as d1, d2, and d3, away from the largest peak in the histogram. Clearly, d3 is the farthest of the three.
Hence, the order of the corresponding self_distinction_weights of these three candidate cells would be (3) > (2) > (1). In this case, we’ll pick the farthest cell as the second largest cell. In Figure 3-14, we illustrate the calculation of the distinction weight for the second largest cluster in the 3-D histogram.
Figure 3-13 Discussion of the influence of d
Figure 3-14 An example of distinction weight for second iteration
Similarly, since we have kept the coordinates of all the feature pairs within each cell, the weights of the diagonal cells can be mapped back to the spatial domain. After the identification of the second largest peak in the 3-D histogram, the value of that peak is set to zero in order to run the next iteration.
3.3.2.3. T
HEN
EXTI
TERATIONAfter the identification of C(i2,j2), we keep searching for the next largest cluster, C(i3,jj3). For C(i3,j3), its distinction weight is defined as
The same process is repeated until hist(ik,jk) is below a pre-defined threshold Hth, which is used to ignore small regions and suppress noise interference.
3.3.2.4. S
TOPC
ONDITIONNoise is always a key problem in image processing, so is in saliency region detection. If an input image is corrupted by noise, the noise might decrease the performance of salient detection result. As for trivial objects, such as the black-circle region in Figure 3-15(a), it is obvious that the trivial region will cause a quite small cluster in the feature-pair distribution. In order to suppress noise interference or avoid the detection of such trivial objects, a threshold should be chosen appropriately to stop the iterative search of histogram peaks. As an example shown in Figure 3-16, if we set the threshold Hth to 20, the histogram values below 20 will not go through the algorithm.
(a) Trivial region (b) With noise interference
Figure 3-15 Sample highlight region of trivial region and noise interference
Figure 3-16 An example of pre-define threshold Hth set to 20
3.3.2.5. E
DGEC
ONDITIONIn the above procedure, we only search histogram peaks over diagonal cells and ignore all off-diagonal cells. Here, we define a cell C(i,j) to be a diagonal cell if |i-j| ≤ Dth. This is because an off-diagonal cluster is usually small and only corresponds to the boundary of some region in the image. Since we aim to detect visual saliency regions but not their boundaries, we only need to focus on diagonal cells but not off-diagonal cells. Here, we set a threshold Dth to determine how far away a cell may depart from the 45-degree line if that cell is to be treated as a diagonal cell. In Figure
In the above procedure, we only search histogram peaks over diagonal cells and ignore all off-diagonal cells. Here, we define a cell C(i,j) to be a diagonal cell if |i-j| ≤ Dth. This is because an off-diagonal cluster is usually small and only corresponds to the boundary of some region in the image. Since we aim to detect visual saliency regions but not their boundaries, we only need to focus on diagonal cells but not off-diagonal cells. Here, we set a threshold Dth to determine how far away a cell may depart from the 45-degree line if that cell is to be treated as a diagonal cell. In Figure