CHAPTER 2. RELATED WORK
2.1 D EPTH E STIMATION
2.1.1 Stereo matching
For the last two decades, stereo matching has been a well-known 3D depth sensing method [4]. Stereo matching is one of the most active research areas in computer vision and it serves as an important step in many applications (e.g., view synthesis, image based rendering, etc). The goal of stereo matching is to determine the disparity map between an image pair taken from the same scene. Disparity describes the difference in location of the corresponding pixels. However, occlusion area is a major challenge for the accurate computation of visual correspondence. Occluded pixels are only visible in one image, so there is no corresponding pixel in the other image.
V. Kolmogorov et al. [5] presented a method which properly addresses occlusions, while preserving the advantages of graph cut algorithm. Graph cut algorithm is used to solve the energy minimization problem in computer vision [6] [7]. J.Sun et al. [8] proposed another method which uses a symmetric stereo model to handle occlusion in dense two-frame stereo.
It embeds the visibility constraint within an energy minimization framework, resulting in a symmetric stereo model that treats left and right images equally. An iterative optimization algorithm is used to approximate the minimum of the energy using belief propagation [9].
The occlusion problem can be alleviated by using multi-view images. There are several algorithms using multi-view images as input to deal with the occlusion problem and obtain more accurate depth map. Some of these depth estimation algorithms are described in the following sub-sections.
2.1.2 Segment-based depth estimation
S. Lee et al. [10] proposed a method of depth map generation for multi-view video in 2008. This method is a segment-based approach and uses the 3D warping technique. It assumes that the pixels in one segment shall have the same depth value. It employs “Mean Shift based Image Segmentation” scheme in [11] to segment images. After image segmentation, depth estimation is conducted to each segment. To generate the depth image for the center view, both left and right views are considered simultaneously. Since the conventional matching function MSE and MAD are not robust to illumination/color change between cameras, they use self-adaptation dissimilarity measure as a matching function [12].
This function adds the absolute gradient difference term to the existing MAD term and uses a weighting factor between MAD and MGRAD (mean gradient absolute difference). This segment-based depth estimation method was added a refinement step using segment-based
belief propagation to remove erroneous depth values [13].
2.1.3 Pixel-based depth estimation
M. Tanimoto et al. [14] proposed the pixel-based depth estimation method for FTV. The disparities are estimated first and then they are transformed to depth. This is the method that we utilize to generate initial per-pixel depth map.
The depth estimation method assumes that images captured from each camera are rectified and the cameras are lined up at regular separations in horizontal direction. This method estimates disparities first then it transforms the disparities to depth with the relationship between depth and disparity. Figure 2-1 shows the relation between disparity and depth.
Figure 2-1: Relation between disparity and depth [14].
From this figure, we can easily describe the relation between depth Z, camera interval I, focal length f, and disparity d by the following equation:
d f Z I
⋅= (1)
Because the camera parameters are given, the value of I and f are already known. Once the disparity d is obtained, the depth Z can be derived from equation (1).
The disparity for each pixel is estimated by using stereo matching. It calculates the matching score for each pixel in center view and each disparity value in a predefined range at first. The matching score for a pixel in the center view at disparity d is derived by comparing the intensity value of the pixel (x, y) in the center view against the pixel (x+d, y) in the left view and the pixel (x-d, y) in the right view. Then graph cut algorithm is used to find the appropriate disparities in a view.
After disparity estimation, the depth is derived from disparity and is stored as 8-bit graylevel value with the graylevel 0 representing the furthest depth and the graylevel 255 specifying the nearest depth. The depth value Z of pixel (x, y) is transformed into the 8-bit grayvalue v using the formula described in [15].
This depth estimation algorithm has been implemented in the reference software of depth estimation that has been introduced in MPEG meeting [16].
2.1.4 Temporal consistency for depth estimation
Since the depth estimation method estimates the depth value frame by frame, the result of depth has a low temporal consistency. The depth value of non-moving background often changes frame by frame. Some algorithms are proposed to improve the temporal consistency of depth map.
S. Lee et al. [17] proposed a depth estimation method to enhance the temporal consistency by using a temporally weighted matching function to consider the previous depth value. The whole procedure is the same as their previous research [10] except for the
matching function. The matching function refers to the depth value of the previous frame
where λ represents the slope of the weighting function and Dprev(x, y) represents the depth value of the pixel (x, y) in previous frame.
H. Yuan et al. [18] also proposed a depth estimation algorithm that considers mean absolute gradient to enhance depth accuracy in depth discontinuous area. Except adding gradient term in matching function, they based on [17] to propose their depth temporal consistency preserving algorithm. They use a motion mask to decide whether the
) , , (x y d
Ctemp term should be added in the matching function or not. The motion mask can be
derived by calculating MSE of a pixel. If a pixel is not determined as a motion pixel, the )
, , (x y d
Ctemp term will be zero.
G. Bang et al. [19] proposed a depth estimation scheme that assumes the depth value of the non-moving background doesn’t change frame by frame. It extracts non-moving background by calculating frame difference value between current frame and previous frame.
When the calculated value is larger than some threshold, the pixel is considered as moving pixel. The mean value of the entire frame difference is used as a threshold. For the fast detection of non-moving background, a frame is divided into blocks. Each block evaluates the cost of the non-moving background block using the derived threshold thn. When the number of moving pixel is below 10% of the whole pixels in a block, the block is considered as a non-moving block. After dividing frame into non-moving background and moving foreground, graph cut algorithm is used on these results respectively.