Depth Estimation using Geometric Models - 基於單張影像之城市景觀三維深度估測技術研究

Chapter 2 Backgrounds

2.2 Depth Estimation using Geometric Models

In contrast to algorithms which attempt to get the absolute depth value by using depth perception models, some authors have developed methods that infer relative depth information and only build rough models of the scene geometry.

In [5], Jung et al. used object classification prior to depth value extraction. In their method, image segmentation is performed before object classification. Objects in a single-view image are classified into four types: sky, ground, cubic, and plane.

According to the inferred type, relative depth values are assigned to each type to generate a 3D model. The Ground can be regard as a horizontal plane. Its depth value increases as getting closer to the position of the vanishing point. The Ground depth acts as the base depth map from which the depth of other types can be inferred. Figure 2-5 illustrates the depth assignment for the cubic type and plane type. For plane-type objects, such as cars, they have a constant depth depending on where the bottom position of the object is. For cubic objects, such as buildings, the depth value varies with the distance from the vanishing point. One result of their algorithm is shown in Figure 2-6.

Figure 2-5 Depth assignment for type PLANE and CUBIC

(a) (b)

Figure 2-6 Experimental result of Jung et al. in [5]. (a) Input image. (b) Depth map.

In [10], Barinova et al. focus on the attachment of ground plane and vertical objects. They assume that the urban scene is composed of a flat ground plane with some vertical buildings, whose ground-vertical boundary forms a polyline. Figure 2-7 shows their 3-D model structure.

Figure 2-7 3-D model structure of Barinova et al.[10].

Furthermore, they assume a man-made environment has regular structures that provide scene geometric information, such as camera calibration, horizon detection and vanishing point estimation. After estimation of vanishing point and horizon detection, they use a Conditional Random Field (CRF) model to estimate the ground-vertical boundary parameters to infer the orientation of vertical walls for urban scenes. One example of their results is shown in Figure 2-8:

(a) (b) (c) (d)

Figure 2-8 Experimental result of Barinova et al. in [10]. (a) Input image. (b) Camera calibration, horizon detection and vanishing point estimation. (c) Positions of ground-vertical border along

the vertical axis. (d) 3-D model.

In [6, 7], Hoiem argued that people can perceive the depth of a scene if they get the whole structure of the scene. They use the phenomenon of occlusion - in an image, an object which blocks the view of another object is considered to be closer. By recovering the occlusion relationship between objects, relative depth ordering is determined. Their work can be divided into two parts. To understand the geometry of an image, they label the image into geometric classes to form the surface layout of a scene [11]. With those geometric labels, they use the classification results to learn the occlusion boundaries in an image [6, 7]. In the following, we will introduce a few algorithms proposed by Hoiem.

In [11], Hoiem proposed a method to recover the rough surface layout of an outdoor image. To get the 3-D structure of the scene, they classify the given image into geometric labels. Each pixel belongs to ground plane, vertical surface or sky. The vertical surfaces are further subdivided into subclasses, such as planar surface facing left, right, or center and non-planar surface which are solid or porous. Figure 2-9 shows a classification result. Different colors mean different main classes (ground, vertical, sky). The marks represent different subclass labels of vertical regions.

Figure 2-9 Geometric labels of Hoiem’s system[11]

In their approach, they used features such as location cues, color cues, texture cues and perspective cues (See Figure 2-10). For perspective cues, they used vanishing lines to infer the geometric structure. They found the long lines in the image, and the intersections of long lines are possible vanishing points in the image. The vanishing points are classified into vertical vanishing points and horizontal vanishing points. If more pixels in a region contribute to vertical (or horizon) vanishing points, the region is more likely to be a vertical (or support) surface.

Figure 2-10 The surface cues used in Hoiem’s surface system [11]

Like some other works, algorithms based on over-segmentation are usually used which assume that the pixel labels within a segmentation region are the same. After over-segmentation, they extract features within each superpixel, together with and some features between adjacent superpixels. When considering the relationship between nearby regions, instead of using MRF model, they merge regions which are most likely to have the same label. The advantage of merging regions iteratively is that they can use different cues in different merging steps depending on the region size.

For example, perspective information is effective only when a larger region is considered.

Since it may happen that regions of two different labels get merged together, Hoiem et al. also use multiple segmentations to avoid commitment to any particular segmentation process. After multiple segmentations, they use the pre-trained parameters to predict the label likelihood of each segment. The result is calculated by combining the superpixel label likelihood of multiple segmentations. Their surface label and likelihood estimation result are shown in Figure 2-11.

Figure 2-11 The result of Hoiem’s surface layout estimation [11]

With the help of surface layout estimation, Hoiem proposed an algorithm to recover the occlusion boundaries and depth ordering of an image. Based on occlusion boundary, figure/ground relationship between nearby objects can be determined.

Using the vertical/ground structure, Hoiem at al. estimate objects depth by detecting the attachment of ground and vertical objects. For some regions which are occluded by other regions, the occlusion relationship is used to estimate the max/min depth of those regions. One of their results is shown in Figure 2-12. In the left picture, the region to the left of an arrow is in front of the region to the right of the arrow.

(a) (b)

Figure 2-12 Result of Hoiem’s occlusion boundary algorithm [6, 7]. (a) Occlusion boundary result. (b) (Top row) Estimated max depth. (Down row) Estimated min depth.

To identify occlusion boundary, they learnt a classifier which classifies boundaries to three different types: non-occlusion, occluded and occlusion. Their algorithm starts with an over-segmentation algorithm, which assumed most boundaries are preserved in the edges between these segmented regions. Usually there are thousands of regions at the beginning and then the algorithm gradually removes these unlikely edges to get the final boundaries.

In their work, they use many cues to recognize the boundaries. The cues can be classified as boundary cues, region cues, surface layout cues, and depth-based cues. The detail cues are listed in Figure 2-13. Surface layout cues use the result of the surface layout algorithm which is very useful for detecting occlusion boundaries since most

edges between different surface labels are occlusion boundaries. Geometric labels of surface layout can also reveal figure/ground information. For example, solid regions are more likely to be in front of planar surfaces.

Figure 2-13 The occlusion cues used in Hoiem’s boundary system [6, 7].

Hoiem et al. use the cues listed above to predict the likelihood of being a boundary for each edge. Moreover, a CRF (Conditional Random Field) model is used to enforce boundary continuity and consistency in the merging process. More precisely, the boundary likelihoods of connected edges are related. Hoiem et al. consider all possible labels of the image, instead of estimating each boundary confidence alone. For example, in a junction where three edges are connected, there are 27 combinations of junctions but only 5 of them are possible. The valid types are shown in Figure 2-14.

Figure 2-14 Illustration of five valid junctions [6, 7]

Many cues need a larger spatial support. Hence, Hoiem at el. use a hierarchical segmentation process. At each time, the boundary likelihood of each edge is re-calculated and the most unlikely edges are removed until the boundary likelihood of all the remaining edges are above a given threshold. As regions grow larger and edges become larger, they refine the boundary prediction and remove unlikely boundaries again. This process works iteratively until no new region forms. The flow chart of their system is shown in Figure 2-15.

Figure 2-15 Illustration of Hoiem’s algorithm [6, 7]

在文檔中基於單張影像之城市景觀三維深度估測技術研究 (頁 14-22)