Report - 應用於3D視訊多媒體之多核心微型通訊系統研究---子計畫五：高畫質多視角立體視訊核心技術研究(I)

1. Report

1-1 Introduction

Video is a widely used multimedia nowadays. After the resolution of video reach a full HD(1920*1080) level, human starts to seek a more realistic way to enjoy it.

Therefore, the research in 3D video has become more and more popular. However, the traditional contents are all 2D contents and they cannot fit a 3D display device.

Because of the reason, many automatic 2D to 3D conversion algorithms have been proposed to solve the lack of 3D content. But there is still no fast algorithm that converts monocular still images well.

1-2 Target

Our goal is to develop an algorithm that can convert a traditional 2D content into a 3D one. In order to make our algorithm have a wide application, the target speed of the algorithm is 30fps for a HD1080(1920*1080) sequence. Furthermore, it can combine with other sub-project and build a transmission system. By the system, a 3D display technology will become a much easier task and it can also improve the application in many regions.

1-3 Previous work

Recovering 3D information from a 2D video is a basic problem in computer vision. Many depth cues can be used to extract the 3D information from a 2D video, but each cue has its own advantages and disadvantages for different conditions.

Iinuma et al. [1] used the defocus cue to evaluate depth information by a single frame and the motion cue to convert the video. Cheng et al. [2] used the geometry cue and motion cue to evaluate the depth information. The simple concept and low computational complexity of those methods have enabled it to be adopted in real-time applications. However, those methods cannot perform well for a monocular still image.

Another approach is the pattern recognition-based method. In which, an image is first partitioned into many regions, and each region is categorized into several classes to be assigned depth. Based on this concept, Battiato et al. [3] classify regions into indoor, outdoor with geometric elements, and outdoor without geometric elements.

Then, it uses the information collected in the classification step to estimate the depth.

Even through this method could generate the high-quality result for the monocular still image, this method cannot perform well for many types of scenes. Hoiem et al. [4]

also classify regions into several classes first. Then, they extract the boundary information of regions to merge small regions into objects, and further assign a

specific depth to each object according to its classes. This method can generate high-quality result for many types of scene, but its boundary extraction and object detection suffer from high computational complexity.

1-4. proposed algorithm

Motivated by above issues, we propose an efficient 2D to 3D conversion algorithm for monocular still images with the steps of image segmentation, image classification, object boundary tracing, constraint segmentation .and 3D image generation. First, we apply the watershed method for image segmentation, and further merge and reduce segments by texture and color information for the efficiency of successive steps. Second, we adopt the image classification [5] to recover the geometry of scene. Third, we propose the object boundary tracing method to increase the efficiency in the boundary extraction and objection detection. Fourth, we use constraint segmentation merge incomplete object segments. Finally, we assign depth for each object and synthesize a stereoscopic image by the depth-based image rendering (DIBR) algorithm [6]. The experimental results show our proposed algorithm could deliver better depth map and stereoscopic image, and speed up to 44.4 times of the previous algorithm in [4].

Initial

segmentation Surface layout Object boundary tracing method

Constraint segmentation

Depth assignment and 3D image

construction

Right view Left view

Input image

Fig. 1. Algorithm overview.

Fig.1 illustrates the flow of our 2D to 3D conversion algorithm which consists of five stages. In our method, we first use the watershed algorithm to compute the initial segmentation. Even though the watershed segmentation can preserve object boundary well, it has problems of over segmentation. Due to the problem, neighbor region merge process is used to solve this. In the second stage, we use the surface layout algorithm [5] to provide the geometric information for object detection. In the third stage, we propose the object boundary tracing method to detect object efficiently, but there are still some incomplete object segments. Thus, in the fourth stage, we perform

the constraint segmentation to merge segments by well-defined conditions. Finally, we assign the depths to the objects, and use the DIBR algorithm [6] to generate the images for left and right eyes in the final stage.

1-4-a. Initial Segmentation

In the proposed 2D to 3D conversion algorithm, the accuracy of object boundary detection is important. Thus a proper choice of image segmentation algorithm is also important in our approach. We adopt the watershed image segmentation [7] because it can preserve edge in the object boundary, and it is suitable for fast application.

Since the number of segments is related to the computational complexity in our algorithm, we propose a neighbor merge method to reduce the segments. In this method, we refer to the color and texture information of each small segment to further merge segments into meaningful ones.

For the color information, we consider the color distance between segments by their average color. For the color space, we apply the Hue-Saturation-Value (HSV) color space, and its color difference is computed by the formula [8]. For the texture information, we apply a subset of the filter bank in [9] to compute the texture responses of each pixel. The filter bank consists of 6 edges filters, 6 bars filters, 1 Gaussian filter, and 2 Laplacian of Gaussian filters. With the texture responses, the histogram of maximum responses is computed for every segment, and then the symmetrized Kullback-Leibler divergence is computed for every neighboring segment.

Finally, we compute the edge cost to combine the above color and texture information for every neighbor segments by the formula,

(1)

where , are the weighting factors to control the amount of color difference

and divergence .

With the edge cost, two small segments could be merged if their edge cost is lower than the threshold T. The threshold is automatically and iteratively refined until the number of segment is smaller than a constant.

1-4-b Surface layout

After initial segmentation, we apply the surface layout algorithm [5] to estimate the geometry for each segment. The surface layout algorithm can label the image into geometry classes, which coarsely describe the 3D scene orientation of each image region. Every region in the image is categorized into one of three main classes:

“support”, “vertical”, and “sky”. In addition, the “vertical” class is further categorized into one of five subclasses: “left”, “center”, “right”, “porous”, and “solid”. In the

subclasses, a planar surface facing to the “left”, “center” or “right” of the viewer, while a non-planar surface that are either “porous” or “solid”. With this algorithm, we could obtain the geometry information from an image.

1-4-c Object boundary tracing method

With above two stages, much information could help us to detect object. However, a local method is difficult to distinguish the correct boundary, while a global method has high computational complexity due to much iteration. Therefore, we propose an object boundary tracing method to solve this problem. There are three stages for the object boundary tracing method.

In the first stage, we use a set of rule to determine the initial boundaries by the features of geometry, color, texture, and boundary smoothness. With the initial boundaries selection, the obvious object boundaries are labeled.

In the second stage, we propose an efficient object boundary tracer to find the object boundary from the segmentation result in Section 2.1. In which, we starts from an initial boundary between two segments, and trace its extended boundary between another two segments. The selected boundary should have higher edge cost, high label likelihood difference between the two segments. In addition, the orientation of selected boundary cannot change rapidly. This process repeats until reaching to the border of image or the object boundary that has already been labeled.

For the proposed object boundary tracer, we defined an energy function that is formulated by the following three constraints.

Constraint 1: boundary tracing constraint:

(2) Constraint 2: different label constraint:

| ( )| (3) Constraint 3: identical label constraint:

( ( )) (4)

where i and j are the adjacent superpixels, and are the superpixel label, and is the current object label. The first one is the boundary tracing constraint to trace strong boundary. The second one is the different label constraint to separate different object. The third one is the label constraint to penalize surface label in an object.

̂ { } (5)

where , , are the weighting factors to control the amount of each energy.

This cost function could be efficiently minimized by a local method.

In the third stage, we merge the segments without object boundary into one.

1-4-d Constraint segmentation

With the proposed object boundary tracing method, some segments in the image are not complete objects. They could be further merged by the event constraints as listed in Table 1. We could merge the segments if the following conditions are satisfied.

Condition 1:

Condition 2:

Condition 3:

Condition 4:

We seriatim check these conditions, and merge the segments. After the constraint segmentation process, the object-based segmentation is done.

Table 1. Events of constraint segmentation

Event 1: the color of the segment is similar to the other.

Event 2: the label confidence of segment is similar.

Event 3: the shape of the segment is similar to the other.

Event 4: the y axis position of the segment is similar Event 5: the segment is inside of the other segment.

Event 6: the segment is small enough.

1-4-e Depth assignment and 3D image construction

Finally, we assign the depth to the objects according to the object segmentation result and the geometry information in Section 2.3. Our model in the 3-dimensional space consists of a ground plane and objects are orthogonal to the ground and sky.

At first, for each region, we fit a set of line segments to the ground-vertical boundary by using the Hough transform. Those line segments are used to determine that the “vertical” segments are planar or not. If a “vertical” segment contains the line segments, it is a planar. Otherwise a “vertical” segment is a non-planar.

Then, we assign different depth for segment according to their conditions. For the

“ground” segment and the planar “vertical” segment, we assign gradient depth. Then, we assign corresponding depth according to the position of horizontal line and the behavior of ground-vertical boundary. For the “sky” segment and the non-planar

“vertical” segment, we assign constant depth according to its position in the image coordinate.

After the depth assignment, we have the disparity map and further generate an anaglyph image for left and right eyes by the depth-based image rendering (DIBR) algorithm [6].

1-6. Result

The proposed algorithm was tested on the images with the sizes from 352x288 to 1024x768, and its computation time is measured on the Intel Core i7 3.33 GHz CPU as listed in Table 2. In this table, the texture computation is bottleneck in our proposed algorithm. It is greatly increased, especially for large images. Nevertheless, the texture computation could be easily accelerated using a parallel processor.

Compared to the time distribution of Hoiem’s method [4], the proposed 2D to 3D conversion algorithm could reduce the computation of object boundary tracer and constraint segmentation. Thus, our proposed algorithm is more efficient, and only needs 2.25% of the computation time in Hoiem’s method.

Fig. 2 to Fig. 7 show the our generated disparity maps, the left-view and right-view synthesis images, and the anaglyph images. The sequences in the Fig 2 and Fig 3 are from the standard MPEG-4 video test sequences, and the other sequences are from the databases of [4]. In the depth maps and synthesized view, our proposed algorithm could deliver better results.

Table 2. Computation time on CPU in second Frame Size

CIF

Fig. 2. Flower garden sequence Fig. 3 . hall monitor sequence

Fig. 4. Scenery15 sequence Fig. 4. Scenery15 sequence

Fig. 6. Outdoor21 sequence Fig. 7 . Structure10 sequence

1-7. Conclusion

In this paper, we proposed an efficient 2D to 3D conversion algorithm which automatically converts a still 2D image into a 3D one. With the proposed object boundary tracing method, the computation time is much reduced to 2.25%. The

proposed 2D to 3D conversion algorithm could deliver better depth map and stereoscopic images, compared to the typical algorithm.

This project has already proposed as a paper in CVGIP, 2011. The paper’s name is as follow:

Yi-Chun Chen et. al. “Efficient 2D to 3D conversion with Object-Based Segmentation”, Computer Vision, Graphics, and Image Processing (CVGIP), 2011

在文檔中應用於3D視訊多媒體之多核心微型通訊系統研究---子計畫五：高畫質多視角立體視訊核心技術研究(I) (頁 5-12)