Chapter 2. Previous Work
2.3. Pattern Recognition-based Method
2.3.2. Recovering Major Occlusion Boundaries
Single-view 3D reconstruction is a popular research in computer vision. Even though they are not ready for real-time application due to high computation complexity, their qualities are good enough to use. An algorithm of Hoiem [4] et al.
describes the property of the regions and boundaries in the image, and the 3D surfaces of the scene using learned model. Their representation includes a wide variety of cues:
color, position, and alignment of region; strength and length of boundaries; 3D surface orientation estimates; and depth estimate. In a conditional random field (CRF) model, they also encode gestalt cues, such as continuity and closure, and enforce consistency between our surface and boundary labels.
To provide an initial conservative hypothesis of the occlusion boundaries, they apply the watershed segmentation algorithm to the soft boundary map provided by the pB algorithm of Martin et al. [26]. This segmentation produces thousands of regions that preserves nearly all true boundaries. In training, they assign ground truth to this initial hypothesis. Given a new image, their task is to group the small initial regions into objects, and assign figure/ground labels to the remaining boundaries.
To get a final solution, they could simply compute cues over each region and boundary, and perform a single segmentation and labeling step. However, the small regions from the initial over-segmentation do not allow the more complicated cues, such as depth, to be reliable. Furthermore, global reasoning with these initial boundaries is ineffective because most of them are spurious texture edges.
Their solution is to gradually evolve their segmentation by iteratively computing cues over the current segmentation and using them with our learned models to merge regions that are likely to be part of the same object. In each iteration, the growing
16
regions provide better spatial support for complex cues and global reasoning. And better spatial support can improve their ability to determine whether remaining boundaries are likely to be caused by occlusions. See Fig. 2.3 for an illustration. Each iteration consists of three steps based on the image and the current segmentation: 1) compute cues; 2) assign confidences to boundaries and regions; and 3) remove weak boundaries, forming larger regions for the next segmentation.
Fig 2.3. Illustration of the recovering major occlusion boundaries algorithm. [4]
In most cases, 2D images can be converted into 3D images by this method, but it is not good for real-time application. In their Matlab implementation, this algorithm takes about 4 minutes for a 600x800 image on a 64-bit 2.6GHz Athalon running Linux.
17
3. 3D Image Construction from 2D Image
3.1. Algorithm Overview
Fig. 3.1. Flow of the proposed 2D to 3D conversion system.
In this chapter, we propose a fast and effective 2D to 3D conversion algorithm with the pattern recognition-based method. Fig. 3.1 illustrates the flow of the 2D to 3D
18
conversion system, which consists of three main processes: object-based segmentation, depth assignment, and 3D image construction.
For the object-based segmentation, we first use the watershed segmentation algorithm to compute the initial segmentation. Even though the watershed segmentation can preserve object boundary well, it has problems of over segmentation and sensitivity to noise. Due to oversegmentation problem that produces from watershed segmentation, fast neighbor merge process is used to solve this. At the third step, we use the surface layout algorithm [10] to provide the geometric information for object detection. At the fourth step, inspired by the recovering occlusion boundaries method in [4], we propose the object boundary tracing method to detect object efficiently. After the object boundary tracing method, there are still some incomplete object segments. Thus, we perform the constraint segmentation, which builds some conditions to merge segments. After the constraint segmentation process, the object-based segmentation is done.
Finally, we assign the depths to the objects, and use the DIBR algorithm [28] to generate the images for left and right eyes.
3.2. Object-based Segmentation
3.2.1 Initial Segmentation
In the proposed 2D to 3D conversion system, a precise estimation of object boundary is important. Thus a proper choice of image segmentation algorithm is also important in our case. We adopt watershed image segmentation from all existing image segmentation algorithms for the two reasons: (1) it can preserve edge in the object boundary [37]; (2) it is suitable for fast application [38].
19
Fig. 3.2. Flow of the initial segmentation process.
Fig. 3.2 shows the stages in the initial segmentation. The aim of the first stage is to reduce noise in image, as well as to smooth image. At the second stage, the gradient of the smoothed image is calculated using the Gaussian filter derivatives.
Then, the gradient magnitude is calculated. At the final stage, the gradient magnitude is thresholded appropriately and watershed transform produces an initial image partition.
3.2.1.1. Noise Reduction and Gradient Computation
At the first stage of the initial segmentation, we use a Gaussian filter to smooth the image slightly before computing image gradient. In order to compensate for digitization artifacts, we always use a Gaussian with the σ of 0.8. It does not produce any visible change to the image but help remove artifacts.
At the second stage of the initial segmentation, the gradient field of the smoothed image is computed. The derivitave of Gaussian with the σ of 1.0 and the support size
20
of 9x9 is used to compute the gradient of the smoothed image L and L . Finally, the gradient magnitude image G(I) is calculated by following formula
| |
(3.1)
3.2.1.2. Watersheds Segmentation
In this stage, an initial image partitioned into primitive regions is obtained using the image gradient magnitude and watershed algorithm. Watershed segmentation is a popular and well known algorithm that extracts regions as catchment basins based on the concept of topography. The gradient image of the input image is used as the topographic surface in which the gradient value represents the altitude. The segmentation of an image finds the watershed line on the gradient image and thus separates each region. In the following, we briefly describe the parallel watershed transform proposed by Giovani et al. [29].
The algorithm is composed of the four major steps, finding the lowest neighbor of each pixel (i.e. direct path of steepest descent), finding the nearest border of internal pixels of plateaus, propagating uniformly from the borders, and minima labeling by maximal neighbor address and pixel labeling by flooding from minima. Fig. 3.3 presents a parallel watershed transform, where I is the input image, and lab is the output labeled image that is also used for storing addresses. The statement for all denotes that every iteration can be processed in parallel.
// First Step
21
Fig. 3.3. Pseudo code of the parallel watershed transform [29].
The watershed transform is applied to the thresholded gradient magnitude image
G
T, where the pixels of G having value smaller than a given threshold T are set to zero.That is
,
0, (3.2)
22
Due to thresholding, many of the regional minima of G located in homogeneous region are replaced by fewer zero-valued regional minima in GT. It could slightly limit the size of the initial image partition is to prevent over-segmentation in homogeneous region. Fig. 3.4 shows the results of the initial segmentation process.
Fig. 3.4. The results of the initial segmentation process. (a) original image. (b) gradient image. (c) initial segmentation.
3.2.2 Fast neighbor merge
In addition to the above over-segmentation reduction method, there still remain neighboring regions that be merged into a meaningful segmentation, Fast neighbor merge method is used to guarantee that segments are large enough.
Fig. 3.5 shows the stages of the Fast neighbor merge method. The aim of the first stage is the cue computation. Those cues are color and texture. At the second stage, we use those cues to decide whether the segment could be merged or not.
23
Fig. 3.5. Flow of the Fast neighbor merge method.
3.2.2.1 Cues Computation
Fig. 3.6. Illustration of the HSV color space.
In the fast neighbor merge algorithm, a precise estimation of color distance is
24
important. Thus a proper choice of color space is important in our case. In our case we consider the Hue-Saturation-Value (HSV) color space [30], because it is very similar to the human perception of colors. Fig. 3.6 is Illustration of the HSV color space.
Conceptually, the HSV color space is a cone. Viewed from the circular side of the cone, the hues are represented by the angle of each color in the cone relative to the 0°
line, which is traditionally assigned to be red. The saturation is represented as the distance from the center of the circle. Highly saturated colors are on the outer edge of the cone, whereas gray tones (which have no saturation) are at the center. The value is determined by the colors vertical position in the cone. At the pointy end of the cone, there is no brightness, so all colors are black. At the fat end of the cone are the brightest colors.
Color transformation from RGB to HSV color space is done by the following , , (3.3) space is given by the formula[31]
, 1 1 √5⁄ cos cos sin
sin
212
, (3.8)25
For every segment, we compute average RGB value, and transform average RGB value to HSV color space. Then, we compute color difference for every neighboring segment.
Another cue is texture. Similarly to color, texture provides a cue for the geometric class of a segment through its relationship to materials and objects in the world.
To represent texture, we apply a subset of the filter bank designed by Leung and Malik [32]. We generated the filters with the following parameters: 19x19pixel support, the scale of √2 for oriented and blob filters, and 6 orientations. For the filter bank, there are 6 edges, 6 bars, 1 Gaussian, and 2 Laplacian of Gaussian filters.
We compute the histogram (over pixels within a segment) of maximum responses.
Then, we compute the symmetrized Kullback-Leibler divergence , for every neighboring segment.
Finally, we compute the cost function E which is combine color and texture information for every neighbor segments by the formula,
, , , (3.9) where , are the weighting factors to control the amount of each energy.
3.2.2.2 Neighbor Merge
In this stage, we use connected components for segment merge. Connected components are the simplest method of image segmentation. During the Connected components process, if their cost E is smaller than some threshold values, two neighboring segments are merged. The key parameter in the connected components process is the threshold T. We use the following iterative method to determine the threshold T:
1. An initial threshold T is chosen.
26
2. If the cost of neighboring segment is smaller than the threshold T, we will merge neighboring segment.
3. Turn up the threshold T.
4. Go back to step 2, and replace the threshold T. Keep repeating until the number of segment is smaller than a constant NS, 1000.
Fig. 3.7 shows the results of the fast neighbor merge process.
Fig. 3.7. The results of the fast neighbor merge process. (a) Original image. (b) Initial segmentation. (c) The result of this process.
3.2.3 Surface Layout
Fig. 3.8. Surface layout [27]. On these images and elsewhere, main class labels are indicated by colors (green=support, red=vertical, blue=sky) and subclass labels are indicated by markings (left/up/right arrows for planar left/center/right, ‘O’ for porous, ‘X’ for solid).
27
Surface layout proposed in [27] can label the image into geometry classes, which coarsely describe the 3D scene orientation of each image region as shown in Fig. 3.8.
Every region in the image is categorized into one of three main classes: “support”,
“vertical”, and “sky”. Support surface are parallel to the ground and could potentially support a solid object. Vertical surfaces are solid surfaces that are too steep to support an object. The sky is the image region corresponding to the open air and clouds. Vertical class is further categorized into one of five subclasses: “left”, “center”, “right”,
“porous”, and “solid”. Planar surfaces facing to the “left”, “center” or “right” of the viewer, and non-planar surface that are either “porous” or “solid”.
We believe that surface layout representation is useful information for us to detect object in the image. Fig. 3.9 shows the stages of the surface layout. At first, image is partitioned to many superpixels, and we compute cues for each superpixels. In order to have better result, multiple segmentation is used, so same-label likelihood is computed to be cost information for merge segment. After multiple segmentation, homogeneity likelihood is computed for each segment, and it is used to determine that segment is homogeneity or not. Label likelihood is also computed for each segment and superpixel to determine that segment belongs to which category. Finally, Bayes theorem applies label likelihood and homogeneity likelihood to compute the label confidence for each superpixel. We will briefly describe the stages in following section.
28
Fig. 3.9. Flow of the surface layout.
3.2.3.1 Superpixels
The use of superpixels improves the computational efficiency of our algorithm, and allows complex statistics to be computed for enhancing our knowledge of the image structure. Different from original algorithm in [34], we adopt our initial segmentation as superpixels.
3.2.3.2 Cues computation
To determine which orientation is most likely, we need to use all of the available cues: location, color, texture, perspective. In Table 3.1, we list the set of statistics used for classification.
29
Table 3.1. Statistics computed to represent superpixels [27]
Surface Cues
Location
L1. Location: normalized x and y, mean
L2. Location: normalized x and y, 10th and 90th pctl
L3. Location: normalized y wrt estimated horizon, 10th, 90th pctl
L4. Location: whether segment is above, below, or straddles estimated horizon L5. Shape: number of superpixels in segment
L6. Shape: normalized area in image Color
C1. RGB values: mean
C2. HSV values: C1 in HSV space C3. Hue: histogram (5 bins) C4. Saturation: histogram (3 bins) Texture
T1. LM filters: mean absolute response (15 filters) T2. LM filters: histogram of maximum responses (15 bins)
Perspective
P1. Long Lines: (number of line pixels)/sqrt(area) P2. Long Lines: percent of nearly parallel pairs of lines P3. Line Intersections: histogram over 8 orientations, entropy P4. Line Intersections: percent right of image center
P5. Line Intersections: percent above image center
P6. Line Intersections: percent far from image center at 8 orientations P7. Line Intersections: percent very far from image center at 8 orientations P8. Vanishing Points: (num line pixels with vertical VP membership)/sqrt(area) P9. Vanishing Points: (num line pixels with horizontal VP membership)/sqrt(area) P10. Vanishing Points: percent of total line pixels with vertical VP membership P11. Vanishing Points: x-pos of horizontal VP - segment center (0 if none) P12. Vanishing Points: y-pos of highest/lowest vertical VP wrt segment center P13. Vanishing Points: segment bounds wrt horizontal VP
P14. Gradient: x, y center of mass of gradient magnitude wrt segment center
3.2.3.3 Same-label Likelihoods
Same-label likelihoods learned from training images. The same-label classifier
30
outputs an estimate of for the adjacent superpixels
i
andj
and image data I. Here and are the superpixel label. The same-label classifier is based on cue set L1, L6, C1-C4, and T1-T2 in Table 3.1. In Table 3.2 we list the set of statistics used for computing same-label likelihoods.Table 3.2. Statistics computed over pairs of superpixels Boundary cues
Location
the absolute differences of the pixel location values x and y Color
C1. the absolute differences of the mean RGB C2. the absolute differences of the mean HSV
C3. the symmetrized Kullback-Leibler divergence of the hue C4. the symmetrized Kullback-Leibler divergence of the saturation Texture
T1. the absolute differences of the mean LM filter response
T2. he symmetrized Kullback-Leibler divergence of texture histogram
Shape
S1. the ratio of the area
S2. the fraction of the boundary length divided by the perimeter of the smaller superpixel S3.the straightness of the boundary
3.2.3.4 Multiple Segmentations
The increased spatial support of superpixels provides much better classification performance than for pixels. Large regions are required to effectively use the more complex cues. We need to compute multiple segmentations and then use the increased spatial support provided by each segment to better evaluate its quality. This method is based on pairwise same-label likelihoods. A diverse sampling of segmentations is produced by varying the number of segments
n
sand using a random initialization.
31
3.2.3.5 Label Likelihood Computation
The label classifier is used to distinguish among the main classes and the subclasses, and it is based on all of the listed cues. The label classifier output the estimate of , for the segment .
3.2.3.6 Homogeneity Likelihood Computation
The homogeneity classifier is used to determine whether a segment has a single or is mixed, and it is based on all of the listed cues. The homogeneity classifier output the estimate of for the segment .
Fig. 3.10. The result of the confidence images for each of the surface labels.
3.2.3.7 Label Confidences Computation
In final stage, we compute label confidences for each superpixel, and use following formula:
| ∑ , (3.10) Fig. 3.10 shows the result of the confidence images for each of the surface labels.
32
3.2.4 Object Boundary Tracing Method
There are many features that could be used to detect the object boundary, and we describe below. Adjacent regions have different colors or textures, or are misaligned;
long and smooth boundaries with strong color or texture gradients; two adjacent regions have different 3D surface characteristics.
Until now, we extract many features that could be used to detect object, but how to use them efficiently? Local method is difficult to distinguish the correct boundary, while global method has high computational complexity due to much iteration.
Therefore, we propose an object boundary tracing method to solve this problem. Fig.
3.11 shows the stages of the object boundary tracing method. The aim of the first stage is the initial boundary selection, and obvious object boundaries are labeled using the rule-based method. At the second stage, the rest of object boundaries are traced from the initial boundaries. At the third stage, segments without object boundary are merged
33
Fig. 3.11. Flow of the object boundary tracing method.
3.2.4.1 Initial Boundary Selection
There are many features that we compute before and could be used to detect object boundary. As the situation is different, we should choose different features, so we categorize every object boundary in the image into one of three classes: “gnd-vrt”,
“sky-vrt”, and “vrt-vrt” as in Table 3.3. For different class, we use a specific feature to determine its initial boundaries.
Table 3.3. Features of initial boundary selection.
Class features
for all classes boundary smoothness
edge(color, texture)
34
for “gnd-vrt” class only main label likelihood
for “sky-vrt” class only main label likelihood
for “vrt-vrt” class only sub-label likelihood
if event vrt-gnd-vrt
We use a set of rule to determine the initial boundary. For example given the
“sky-vrt” class of the boundary it belongs to initial object boundary if the following condition is satisfied:
z 1 0.5
0.3 0.3
The denotes the same-label likelihood and the
denotes the sky label confidence. Similar conditions have been used in order to detect the other classes of object boundary, more detail formula that we show in appendix.
Fig 3.12 shows the result of the initial object boundary selection. The red fragments in the image are selected initial object boundaries.
Fig 3.12. The result of the initial object boundary selection.
3.2.4.2 Object Boundary Tracer
The object boundary tracer of a boundary start from an initial object boundary and selects a next object boundary. The selected object boundary should have high edge value, and high label likelihood difference, and the property of the class of object boundary, and the boundary orientation should not change rapidly. This process
35
repeats until reaching to the border of image or the object boundary that already be labeled. Fig. 3.13 shows a state of an object tracer in image domain.
x y
Current boundary position
Next boundary position
Fig 3.13. A state of an object tracer in image domain
We develop an energy function for the object boundary tracer. The energy function is modeled by three constraints. The first is the boundary tracing constraint to trace strong boundary. The second is the different label constraint to separate different object. The third is the same label constraint to penalize significant surface label
We develop an energy function for the object boundary tracer. The energy function is modeled by three constraints. The first is the boundary tracing constraint to trace strong boundary. The second is the different label constraint to separate different object. The third is the same label constraint to penalize significant surface label