Various Depth Cues - Previous Work - 利用物件導向切割的二維至三維影像轉換

Chapter 2. Previous Work

2.1. Various Depth Cues

Humans can straightforward determine depth from single monocular image according to experiences, which contains many monocular cues, such as defocus, texture gradients, linear perspective, contextual information. For example, objects in

images nearer or farther than focus are blurred, and sky in image is infinitely far away.

In addition, motion parallax also is useful information to determine the depth of object.

For the sequences with camera translational motion, the near objects move faster than the far objects. Depth from those cues has been developed from several years. In the Section 2.1, we introduce the principle and associated algorithms of each depth cue.

2.1.1. Depth from Camera Motion

With two images of the same scene captured from slightly different view point, the depth from camera motion can be utilized to recover the depth of an object. The relative motion between the viewing camera and the observed scene also provides an binocular disparity cue for depth perception. First, a set of corresponding points in a pair of image is found. Then, we can retrieve depth information by using the triangulation method when all camera parameters are known. If only intrinsic camera parameters are known, the depth can be recovered to a scale factor. If no camera parameters are known, the resulting depth is correct up to a projective transform. In most cases, no camera parameters are known from 2D video. Thus, we must recover camera parameters by self-calibration [5].

The typical framework in [6] using the depth from camera motion is a three-stage procedure, which is composed of feature tracking [7], structure from motion [8], and dense reconstruction. This method can extract absolute depth from 2D video with camera motion. However, in order to retrieve an accurate depth map in the dense reconstruction stage, the stereo matching algorithms [9] [10] must be used but suffer from high computational complexity. Another way to solve this problem is the realistic stereo-view synthesis (RSVS) [11]. It combines both the structure from motion and the idea of image-based rendering (IBR) [12] to achieve

photo-consistency without relying on dense depth estimation.

However, for still background, a scene may contain dynamic element, i.e.

independent moving object. Such condition is difficult to recover camera parameters and extract depth information.

2.1.2. Individual Moving Objects

Individual moving object (IMOs) also is a depth cue in the 2D to 3D conversion system. In some cases, motion vector maps can be directly used as depth maps. This approximation holds when objects moving are with the same speed. Ideses et al. [13]

extract motion vector maps from compressed 2D video, and use this information to compute depth map. However, there are many cases in which the approximation does not hold. This happens when an object without motion or not with constant speed.

Moving object segmentation also is a useful method for 2D to 3D conversion system. In this approach Kunter et al. [14] extracts the foreground objects by moving object segmentation algorithm [15], and assign depth for foreground objects. However, multiple occluding objects or objects with only little motion are difficult to detect.

2.1.3. Defocus

Cameras and eyes have limited depth of focus, so images of objects nearer or farther than focus are blurred. In other words, the amount of blur in an image is directly related to image defocus caused by the optics of the eye or camera that captures it, and can be formed a depth cue.

If a scene can be described by simply estimating which objects are in front, and which are behind those objects but are not part of the background, and what is completely in the background, we can estimate a relative depth map by taking into

account image blur and its relation to the focus degree in edges that compose objects.

The typical algorithm of the depth from focus cue [16] uses spatial frequency measurement. When an object of an image is defocused, it will have a large

attenuation of its high spatial frequency, and when the object in a scene is focused, its high frequency component will not be attenuated and hence its sharp detail will be present as fast changes in the spatial frequency domain.

However, this method is just suitable for the close-up image, and it cannot perform well for another images.

2.1.4. Linear Perspective

Linear perspective refers to the fact that parallel lines, such as railroad tracks, appear to converge with distance, eventually reaching a vanishing point at horizon.

The more the lines converge, the farther away they appear to be. A representative work is the gradient plane assignment approach proposed by Battiato et al. [3]. Their method performs well for single images containing sufficient objects of a rigid and geometric appearance. In this method, first, the edge detection is employed to locate the predominant lines in the image. Then, the intersection points of these lines are determined. The intersection with the most intersection points in the neighborhood is considered to be the vanishing point. The vanishing points are marked as the major lines close to these. The major lines close to the vanishing point are assigned a larger depth value and the density of the gradient planes is also higher.

This method is suitable for the man-made scene which contains many long and parallel lines.

2.1.5. Texture

Texture also offers a good 3D impression because of the two key ingredients: the distortion of individual texels and individual texture region. The latter is also called texture gradient. For example, a tiled floor with parallel lines will appear to have tilted lines in an image. The distant patches will have larger variations in the line orientations, and nearby patches will have smaller variations in line orientations.

Similarly, a grass field when viewed at different distances will have different texture gradient distributions.

Texture cue is useful information to detect the depth of planar surface. If the surface is non-planar, shape-from-texture algorithms [19], [20] can be applied to reconstruct the 3D shape of object surface. However, the current algorithms cannot be applied to real-time application.

2.1.6. Relative Height

Relative height cue also offers the depth information of image. Generally, the closer objects in real world are projected into the lower part in a 2D image plane.

Many photographic images, especially scenery images, have the height cue. Jung et al.

[21] proposed a real-time 2D-to-3D conversion framework using the relative height cue, and many pattern recognition-based algorithms [22], [23], [27] also regard the positions of image as a cue.

2.1.7. Statistical Patterns

Statistical patterns are the elements which occur repeatedly in images. When the number or the dimension of the input data is large, the machine learning techniques

can be an effective way to solve the problems. In recent years, as a tool to estimate depth maps, the machine learning has been receiving increasing interest. Especially supervised learning applies training data with the ground truth to distinguish the geometry of scene, depth of scene, and stage of scene. As well as a set of representative and sufficient training data, good features and suitable classifiers are all essential ingredients for satisfactory results. More details of statistical patterns method is described in Section 2.3.

在文檔中利用物件導向切割的二維至三維影像轉換 (頁 20-25)