Depth Assignment - 3D Image Construction from 2D Image

Chapter 3. 3D Image Construction from 2D Image

3.3. Depth Assignment

After the object segmentation stage, we assign the depth to the objects. Our model in the 3-dimensional space consists of a ground plane and objects are orthogonal to the ground and sky. In order to construct 3D image for binocular vision, the depth assignment process output the disparity map , in the range of 0-255, disparity map is encoded the depth information. In our image coordinate system, the origin is

located at the most left-up corner, and the x-axis toward right, and the y-axis toward down.

Fig. 3.16. illustration of each condition for depth assignment

We assign different depth for segment according to their conditions. Fig. 3.16 shows those conditions that we consider. Fig 3.18 shows the stages of the depth assignment process. At first, for each region, we fit a set of line segments to the ground-vertical boundary by using the Hough transform [33]. Those line segments are used to determine that the vertical labeled vertical segments are planar or not. If vertical labeled segments contain the line segment, it is planar. Otherwise vertical labeled segment is non-planar. Then we begin to assign depth to each segment.

For the ground labeled segment, we compute disparity by the formula:

, ⁄ 255.0⁄ , (3.15)

where H is the height of the image, and hpos is the position of the horizontal line in the image that is computed by vanish point or the highest position of ground labeled pixel.

For the vertical labeled segment that is connected with ground labeled segment, if the ground-vertical boundary is a line, we use following formula:

, ⁄ 255.0⁄ (3.16)

⁄ (3.17) ,

, (3.18)

where , is the linear equation of the line segment.

For the vertical labeled segment, if the segment is planar, we also use formula (3.14) and (3.15). However the linear equation is different. The slope of the linear equation is decided by sub-class, and the line through the point that is the lowest y-axis position of the segment in the image.

If the segment is non-planar, we use following formula,

, ⁄ 255.0⁄

, (3.19)

where is the lowest y-axis position of the segment in the image.

After depth assignment process, the disparity map is computed. Fig. 3.17 shows the result of depth assignment process.

Fig. 3.17 The result of depth assignment process. (a) Original image. (b) Disparity map.

Fig 3.18. Flow of the depth assignment process.

3.4 3D Image Construction for Binocular vision

Fig. 3.19 Flow of the DIBR algorithm.

After we have the disparity map, we can generate left and right eye images by the depth-based image rendering (DIBR) algorithm [28].

Fig. 3.19 shows the stages of the DIBR algorithm. The concept of DIBR on the parallel camera configuration as shown in Fig. 3.20 . In this configuration, an object O is observed at original center view V^c, and virtual left-eye view V^l. This object is also projected to X^c, X^r, and X^l in the image planes respectively. The relationship of the projected position among views is

⁄2 ⁄ ⁄2 ⁄ , (3.20) where Z is the depth of object from the view plane f is the focal length and b is the baseline of V^r

and V

^l. Because we can’t know the camera parameter in original 2-dimensional video, we simplify the formula

⁄ 2 ⁄ , (3.21) 2

where d is the disparity that compute from Section 3.3 and s is the scale factor that could be adjusted by user.

Fig. 3.20. Parallel camera configuration for virtual images warping [28]

If disparity map is given, we can render the virtual left-eye and right-eye view images using the center view image. This rendering process is generally called 3D warping. However, the warped virtual images incur many holes, which may be seen by the right eye or left eye but occluded in the center view. To recover the holes, the hole-filling method is added after the 3D warping process as shown in Fig. 3.19 . But it suffers from serious texture distortion since the large holes cannot be recovered well.

The depth smoothing method is adopted before the 3D warping process. The aim of the depth smoothing is to reduce the size of holes. In the depth smoothing stage, directional Gaussian filter is used to reduce the geometric distortion, and apply filter only on the hole-region. Fig. 3.21 shows the result of DIBR algorithm.

Fig. 3.21. The result of DIBR algorithm. (a) Original image. (b) Disparity map. (c) Rendered left view. (d) Rendered right view.

4. Experimental Results and Analysis

4.1. Introduction

In this chapter, we show the experimental results of the proposed 2D to 3D conversion system on test images. The experimental results contain 3D result and execution time. The test images are used from the Internet. In addition to the 3D result of our proposed system, we included the 3D result of the hybrid depth cueing system [2] and the recovering major occlusion boundaries method [4] for comparison. The source codes of recovering major occlusion boundaries method for comparison is provided from [4].

4.2. 3D Results

4.2.1. Our 3D Results

The proposed method has been tested using different types of scenarios. The generated disparity maps, rendered left and right view images and anaglyph images are showed from Fig 4.1 to Fig 4.11 for evaluation. Sequences in the Fig 4.1 and Fig 4.2 are standard MPEG-4 video test sequences. Other sequences are selected from the databases of [4].

In the test image “flower garden” as shown in Fig. 4.1. It is tested for outdoor scene. There are four major parts that should be partitioned. They are sky, ground, tree, and building. The result of disparity map shows that depth of objects is correct.

In the test image “Hall monitor” as shown in Fig. 4.2. It is tested for indoor scene.

There are five major parts that should be partitioned. They are ground, ceil, left wall, right wall, and man. Even through objects in the image are not detected well, the order of depth is correct. The result also shows that out system can handle planar surface.

Fig. 4.1. Flower garden sequence.

Hall_monitor sequence

Disparity map

Anaglyph Left view

Right view

Fig. 4.2. Hall_monitor sequence.

Fig. 4.3. Building.

Fig. 4.4. Outdoor0 sequence.

Fig . 4.3 and Fig. 4.4 are tested for outdoor scene with geometry. In Fig. 4.3, the major part in the image is building, and result of depth is correct. The chair in the image is not detected well, because the geometry of result for the chair is ground label.

In Fig. 4.4, the order of depth is correct, but the woman in the image right side is

merged with building. The mistake is caused by object boundary tracer.

Fig. 4.5. Ourdoor1 sequence.

Fig. 4.6. Scenery0 sequence.

Fig. 4.5 and fig. 4.6 are tested for nature outdoor scene. The result of Fig. 4.5 is good. In the fig 4.6, many birds in the image are not detected. It is because the

geometry of result is wrong.

scenery1

Disparity

map Anaglyph

Left view

Right view

Fig. 4.7. Scenery1 sequence.

Fig. 4.8. Walking sequence.

49 structure

Disparity

map Anaglyph

Left view

Right view

Fig. 4.9. Structure sequence.

scenery1 (a)

Disparity map Left view Right view

(b)

Disparity map Left view Right view

(c)

Fig. 4.10. Urban sequence.

Fig. 4.11. Alley sequence.

In Fig 4.7, Fig 4.8, and Fig 4.10 are tested for nature outdoor scene with people.

Results show that the people in the image are detected well, and even people wear camouflage in the woods.

Fig 4.9 and Fig 4.11 are tested for man-made scene. The result of fig 4.9 is good.

Even through the order of depth in the fig 4.11 is correct, but woman in the image right side is merged with tree, ground, and statue. This makes it impossible to

distinguish the depth of these objects in the anaglyph image.

4.2.2.3D Result Comparison between Different Algorithms

In this section, we compare our method with the hybrid depth cueing system and the recovering major occlusion boundaries method.

The 3D result of the hybrid depth cueing system is showed from Fig. 4.12 to Fig.

4.13. In flower garden sequence, Fig. 4.12(c) show the DMP, DGP, fused disparity map, left view and right view, where DMP is depth from motion, DGP is depth from single image. Compare with our method in Fig. 4.12(b), our disparity map is better, because our depth of the building in the image is more accurate. If we only consider the condition that is depth from single image, our method computes the depth of objects is more accurate. Because the DGP can’t compute the depth of objects, it just can compute the depth of the background. In the hall monitor sequence, the result of the hybrid depth cueing system is better for the depth of background, but our method just use single image to compute the depth of the scene. If their result misses motion information, they could not compute the depth of man.

Fig. 4.12. 3D results of flower garden sequence with different algorithms. (a) Original image (b) Our proposed algorithm. (c) The hybrid depth cueing system.

Fig. 4.13. 3D results of hall monitor sequence with different algorithms. (a) Original image (b) Our proposed algorithm. (c) The hybrid depth cueing system.

The 3D result of the recovering major occlusion boundaries method is showed from Fig. 4.14 to Fig. 4.18. In some cases, our 3D results are comparable to the recovering major occlusion boundaries method. In the urban sequence and scenery1 sequence, our method can detect more complete objects than the recovering major occlusion boundaries method. It is because the result of our superpixels is better than original method that is proposed by Felzenszwalb et al. [34]. Fig. 4.19 shows the comparison between our method and Felzenszwalb’s method. In some case, compare with the recovering major occlusion boundaries method, even through our method cannot perform well on object boundaries, our execution time is faster. We will report our execution time in Section 4.3. Major occlusion boundaries method also report their execution time in [4], but they only implement matlab version. So we do not compare execution time of our method with them.

53 Walking

(a)

Disparity map Left view Right view

(b)

Disparity map Left view Right view

(c)

Fig. 4.14. 3D results of walking sequence with different algorithms. (a) Original image (b) Our proposed algorithm. (c) The recovering major occlusion boundaries method.

scenery1 (a)

Disparity map Left view Right view

(b)

Disparity map Left view Right view

(c)

Fig. 4.15. 3D results of scenery1 with different algorithms. (a) Original image (b) Our proposed algorithm. (c) The recovering major occlusion boundaries method.

Fig. 4.16. 3D results of alley sequence with different algorithms. (a) Original image (b) Our proposed algorithm. (c) The recovering major occlusion boundaries method.

Fig. 4.17. 3D results of outdoor0 sequence with different algorithms. (a) Original image (b) Our proposed algorithm. (c) The recovering major occlusion boundaries

method.

55 scenery1

(a)

Disparity map Left view Right view

(b)

Disparity map Left view Right view

(c)

Fig. 4.18. 3D results of urban sequence with different algorithms. (a) Original image (b) Our proposed algorithm. (c) The recovering major occlusion boundaries method.

Fig. 4.19. Superpixels computation with different algorithms. (a) Original image (b) Our proposed algorithm. (c)Felzenszwalb’s algorithm.

4.3. Execution Time

In this section we show the execution time of our proposed 2D to 3D conversion system. The algorithm was tested on several images on sizes ranging from 352x288 to 1024x768, and had its performance measured on each step. The data presented following is average of the experiments, scale to seconds (s). Because the texture computation is time-consuming, texture computation is separated from the fast neighbor merge process. Table 4.1 shows the performance for the algorithm, processed on the CPU. These results were obtained on a computer with an Intel Core

i7 980, 3.33 GHz, and a 6-GB RAM, running Window 7. And we use the Microsoft Visual C++ compiler, version 9.0. Table 4.1 shows that for the algorithm, the texture computation is bottleneck, greatly degrading the speed performance, especially on large images. But the texture computation is easy to be accelerated using parallel processor.

Table 4.1. Execution time

352x288 640x480 800x600 1024x768

Initial segmentation 0.0625 0.2236 0.3496 0.6084

Texture computation 1.3492 4.7130 7.3423 12.015

Fast neighbor merge 0.0155 0.0711 0.2356 0.5295

Surface labeling 0.1480 0.3534 0.5153 0.6073

Object boundary tracer 0.0155 0.0456 0.0646 0.0605

Constraint segmentation 0.0000 0.0021 0.0026 0.0032

Depth assignment 0.0000 0.0107 0.0156 0.0197

Total times 1.5907 5.4090 7.6600 13.824

5. Conclusion and Future Works

5.1. Conclusion

In this thesis, we proposed the 2D to 3D conversion system which automatically converts a single 2D image into the 3D effect images. This algorithm combines object-based segmentation with depth assignment, so we can see the objects more complete on the 3D display. We use watershed segmentation algorithm to generation initial segmentation. Fast neighbor merge process is proposed to solve the problem of

over-segmentation. In addition, the surface labeling algorithm is used to categorize superpixels into appropriate classes. Furthermore, we proposed an object boundary tracing method to detection objects of the image based on surface information. With the proposed object boundary tracing method, the execution time is much reduced, compared with the recovering major occlusion boundaries method.

Experimental results demonstrated that the proposed 2D to 3D conversion system could achieve better quality of 3D image than the hybrid depth cueing system, and the recovering major occlusion boundaries method.

5.2. Future work

There are two issues remained in our 2D to 3D conversion system. First, there still are many depth cues we can use. For example, considering the temporal domain information, we can combine some video segmentation method that can help the result of object segmentation more accurate. The other issue is computational speed of our algorithm which still remains slow. Therefore, we will be working on optimizing the speed of object segmentation algorithm in the future and porting the algorithm on the parallel processor.

Reference

[1] T. Iinuma, H. Murata, S. Yamashita, and K. Oyamada, “Natural Stereo Depth Creation Methodology for a Real-time 2D-to-3D Image Conversion,” SID Symposium Digest of Technical Papers, pp. 1212-1215, 2000.

[2] C. C. Cheng, T. L. Chung, Y. M. Ysai, and L. G. Chen, “Hybrid Depth Cueing for 2D-To-3D Conversion System,” in Proc. of Stereoscopic Displays and

Application XX, 2009

[3] S. Battiato, A. Capra, S. Curti, and M. L. Cascia, "3D Stereoscopic Image Pairs by Depth-Map Generation," in Proc. of International Symposium on 3D Data

Processing, Visualization and Transmission (3DPVT), pp. 124-131, 2004.

[4] D. Hoiem, A. Stein, A. A. Efros, and M. Hebert, “Recovering occlusion

boundaries from a single image,” in Proc. of IEEE International Conference on

Computer Vision (ICCV), 2007.

[5] R. I. Hartley, and A. Zisserman, “Multiple Views Geometry in Computer Vision,”

Cambridge University Press: Cambridge, UK, 2000.

[6] M. Pollefeys, L. V. Gool, and M. Vergauwen, F. Verbiest, K. Cornelis, J. Tops, R.

Koch, “Visual modeling with a hand-held camera,” International Journal of Computer Vision , vol. 59, no.3, pp. 207-232, 2004.

[7] C. Tomasi and T. Kanade, “Detection and tracking of point features,” Carnegie Mellon Univ., Pittsburgh, PA, Tech. Rep. CMU-CS-91-132, Apr. 1991.

[8] T. Jebara, A. Azarbayejani, A. Pentland, 3D structure from 2D motion, IEEE Signal Process. Mag. 16 (3) (1999) 66–84.

[9] M.Z. Brown, D. Burschka, and G. Hager, “Advances in Computational Stereo,”

IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 25, no.8,

pp. 993-1008,2003.

[10] D. Scharstein and R. Szeliski, "A Taxonomy and Evaluation of Dense Two-Frame Stereo Correspondence Algorithms," International Journal of

Computer Vision (IJCV), vol. 47, pp. 7-42, 2002.

[11] S. Knorr, T. Sikora, “An image-based rendering (IBR) approach for realistic stereo view synthesis of TV broadcast based on structure from motion,” in Proc.

of IEEE International Conference on Image Processing (ICIP), San Antonio,

USA, 2007.

[12] L.MacMillan, “An Image based approach to three-dimensional computer graphics,” Ph.D. Dissertation, 1997, University of North Carolina.

[13] I. Ideses, L. P. Yaroslavsky, and B. Fishbain, “Real-time 2D to 3D video conversion,” Journal of Real-Time Image Processing, vol.2, no. 1, pp. 3–9, 2007.

[14] M. Kunter, S. Knoor, A. Krutz, T. SiKora, “Unsupervised object segmentation for 2D to 3D conversion,” in Proc. of SPIE, vol. 7237, 2009

[15] A. Krutz, M. Kunter, M. Mandal, M. Frater, and T. Sikora, “Motion-based Object Segmentation using Sprites and Anisotropic Diffusion”, 8th International

Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS),

2007.

[16] S. A. Valencia, R. M. Rodríguez-Dagnino, “Synthesizing Stereo 3D Views from Focus Cues in Monoscopic 2D images,” in Proc. SPIE, vol. 5006, pp. 377-388, 2003.

[17] J.M. Geusebroek and A.W.M. Smeulders, “A six-stimulus theory for stochastic

texture,” International Journal of Computer Vision (IJCV), vol. 62, pp. 7–16, 2005.

[18] V. Nedovic, A. W. M. Smeulders, A. Redert, J. M. Geusebroek, “Depth

estimation via stage classification,” in Proc. of 3DTV conference, pp 77-80, 2008 [19] D. A. Forsyth, D.A. “Shape from texture and integrability,” in Proc. of

International Conference on Computer Vision (ICCV), vol. 2, pp. 447-452, 2001,

[20] A. M. Loh, R. Hartley “Shape from Non-Homogeneous, Non-Stationary, Anisotropic, Perspective texture”, in Proc. of the British Machine Vision

Conference, 2005

[21] Y. J. Jung, A. Baik, J. Kim, and D. Park, “A novel 2D-to-3D conversion

technique based on relative height depth cue,” in Proc. of SPIE, vol. 7237, 2009 [22] A. Saxena, S. H. Chung, and A. Y. Ng, “Learning depth from single monocular

images,” In NIPS, vol. 18, 2005

[23] A. Saxena, S. H. Chung, and A. Y. Ng, “3-D depth reconstruction from a single still image,” in Proc. of International Journal of Computer Vision (IJCV), vol.76 no. 1, 2007

[24] T. Okino, H. Murata, K. Taima, T. Iinuma, and K. Oketani, "New television with 2D to 3D image conversion technologies," in Proc. of SPIE , Stereoscopic

Displays and Virtual Reality Systems III, Vol. 2653, pp. 96-103, 1996.

[25] H. Murata, Y. Mori, S. Yamashita, A. Maenaka, S. Okada, K. Pyamada, and S.

Kishimoto, "A real- Time Image Conversion Technique Using Computed Image Depth," SID Symposium Digest of Technical Papers, vol. 29, pp. 919-922, 1998 [26] Martin, C. Fowlkes and J. Malik, “Learning to detect natural image boundaries

using local brightness, color and texture cues,” IEEE Transactions on Pattern

Analysis and Machine Intelligence,” vol. 26, no. 5, pp. 530–549, 2004

[27] D. Hoiem, A. Efros, and M. Hebert, “Recovering surface layout from an image,”

International Journal of Computer Vision (IJCV), vol. 75, no. 1, pp. 151–172,

2007.

[28] Y. R. Horng, Y. C. Tseng, T. S. Chang, “Stereoscopic Images Generation with DirectionalGaussian Filter,” in Proceedings of IEEE International Symposium

on Circuits and Systems, pp. 2650-2653, 2010.

[29] A. Korbes, R. Lotufo, G. B. Vitor, and J. V. Ferreira, “A proposal for a parallel watershed transform algorithm for real-time segmentation,” in proc. of Workshop

de Vis o Computacional WVC’, 2009.

[30] A. R. Smith, “Color gamut transform pairs,” Computer Graphics, Vol. 12, pp.

12-19, 1978.

[31] J. R. Smith and S. F. Chang, "VisualSEEk: A fully automated content-based

image query system", in proc. of ACM Multimedia Conference, pp. 87 - 98, 1996.

[32] T. Leung and J. Malik, “Representing and recognizing the visual appearance of materials using threedimensional textons,” International Journal of Computer

Vision (IJCV), vol. 43, no. 1, pp. 29–44, 2001.

[33] D.H. Ballard, "Generalizing the Hough Transform to Detect Arbitrary Shapes", Pattern Recognition, vol.13, no.2, pp.111-122, 1981

[34] P. Felzenszwalb and D. Huttenlocher. “Efficient graph-based image

segmentation, “International Journal of Computer Vision (IJCV), vol.59, no.2, 2004.

[35] J. L. Schneiter, N. R. Corby , US Patent No. 4,963,017 , “Variable depth range camera”, General Electric Company, schenectedy, N.Y, 1990

[36] Y. Su, M. T. Sun, and V. Hsu, “Global motion estimation from coarsely sampled motion vector field and the applications,” in Proc. International Symposium on

Circuits and System (ISCAS),, vol. 2, pp. 628–631, 2003

[37] S. Makrogiannis, G. Economou, and S. Fotopoulos, “A region dissimilarity relation that combines feature-space and spatial information for color image segmentation,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 35, no. 1, pp.

44–53, 2005.

[38] D. B. K. Trieu, and T. Maruyama, T, “Real-time image segmentation based on a parallel and pipelined watershed algorithm.,“ Journal of Real-Time Image

Processing, vol. 2, no. 4, pp. 319–329, 2007

Appendix

In this section, we briefly describe formula and parameter of object boundary tracing method and constraint segmentation. In A.1, we introduce the detail formula and parameter for object boundary tracing method. In A.2, we introduce the detail formula and parameter for constraint segmentation.

A.1

In the initial boundary selection process, we detect “sky-vrt”, “gnd-vrt”, and “vrt, vrt” class of initial object boundaries. In the following, we list the formula for those detectors.

For the “gnd-vrt” class of the boundary that belongs to initial object boundary if the following conditions are satisfied:

z 0.4

1 2

2 2.0

0.4 0.4

z 0.4 20

0.4 0.4

The denotes the same-label likelihood and the

denotes the ground label confidence. denotes the vertical label confidence. denotes the sky label confidence. denotes the length of boundary in x axis. denotes the length of boundary in y axis. denotes total pixels of boundary.

For the “sky-vrt” class of the boundary that belongs to initial object boundary if the following condition is satisfied:

z 1 0.5

0.3 0.3

For the “vrt-vrt” class of the boundary that belongs to initial object boundary if the following conditions are satisfied:

z 1 0.7

0.4 z For the condition that two fragments of junction are ground label, if other

fragments of junction are satisfied the following formula are the “vrt-vrt” class of initial object boundary.

0.8

the denotes the subclass of segment i label confidence for segment j.

A.2

In the constraint segmentation process, if following conditions are satisfied, we will merge segments.

z Condition 1: 1 2

Event 1:

v v s cos hπ s cos hπ s sin hπ s sin hπ 0.6,

where h, s, v denote value of color in the HSV color space.

Event 2:

Main class Main class

z Condition 2: 1 2 6

Event 1:

v v s cos hπ s cos hπ s sin hπ s sin hπ 1.2,

where h, s, v denote value of color in the HSV color space.

Event 2:

Max denotes the rightest position of the segment in the image. Min denotes the

leftest position of the segment in the image.

Event 4:

|Mean i Mean j | 0.1

Mean denotes the mean value of the position at the x axis in the image.

z Condition 4: 2 5

Event 2:

Main class Main class subclass subclass

Event 5:

Seg Seg Max Seg , Seg Min Seg , Seg 0.1

Seg denotes area of bounding box of segment i

Table A.1. Events of constraint segmentation Event 1: the color of the segment is similar to the other.

Event 2: the label confidence of the segment is similar to the other.

Event 3: the shape of the segment is similar to the other.

在文檔中利用物件導向切割的二維至三維影像轉換 (頁 53-0)