C ONTRIBUTION - 視點合成器分析與設計

CHAPTER 1 INTRODUCTION

1.4 C ONTRIBUTION

1. We analyzed the hardware cost of VSRS algorithm for the general setting case of cameras, in which the cameras may with rotation such that they are not in the same baseline.

2. We modified the hole-filling method in VSRS by using the simple bi-linear interpolation method and analyzed the bandwidth and memory usage in different hardware design approaches.

3. We implemented the whole VSRS algorithm as ASIC design for the real-time, high-resolution requirements.

Chapter 2

Related work

View synthesis is an image rendering method in the application of depth map. This chapter first introduces the concept of stereo vision and depth estimation method associated with view synthesis. Then the data format of video-plus-depth applied in the 3D-TV and FTV system is described. Finally, the most general view synthesis method, depth-image based rendering (DIBR) is presented.

2.1 Stereo vision and video-plus-depth concept

Human feels 3D visual perception because the scenes seen by left eye and right eye are with horizontal difference. The difference is called screen parallax values or the disparity that brain can interpret it as 3D visual perception as shown in Fig. 2-1. This disparity XR

-X

L can further be transformed to depth Z based on the inverse proportional relationship considering the baseline B and focal length f,

X X Z Bf

R− L

The depth map stores the depth value sampled in 8 bits and has variety of applications, such as 3D display, 3D interactive system, or multi-view video etc. In fact, depth map can be obtained by many methods such as using time-of-flight camera (TOF camera), structure from motion algorithm or stereo matching algorithm etc. Nowadays, in both 3D-TV system [1] and FTV system [2], the video and its corresponding depth maps are encoded and transmitted by the sender-side, and they are decoded by the receiver-side to generate novel views or stereoscopic views. The data format in these systems is called video-plus-depth format. In the video-plus-depth format, the sender-side is mainly for video capturing, rectifying and depth estimating, and the receiver-side can do view synthesis flexibly and adaptively for different display needs by the reference of the depth maps [3].

2.2 Depth-image-based rendering

2.2.1 3D warping

The depth-image-based rendering (DIBR) is an image-rendering technique using depth maps for virtual view synthesis. This rendering is by 3D warping model considering the provided known by-pixel depth. In 3D world-coordinate as shown in Fig. 2-2 (a), a point (X, Y, Z) can be mapped to a 2D point (u, v) in the image plane of a camera. In Eq. (2-1), this projection is performed by the projection matrix P, which can be decomposed to the camera intrinsic parameters K, the camera rotation matrix R, and the translation matrix T.

(2-1)

Note that the term s in Eq. (2-1) is a scalar factor, which depends on the depth Z as in Eq.

(2-2).

Next we consider the back-projection from a 2D point to 3D word-coordinate. In fact, a 2D point would be mapped to a line in the 3D world-coordinate. That means one pixel will not be back-projected to a unique point in the 3D world-coordinate. Nevertheless, if we have the depth Z, a 2D point can be back-projected to a unique point in the 3D world-coordinate through this equation,

(2-3)

Based on the projection of 3D to 2D and the back-projection of 2D to 3D, we can further extend this concept to the mapping relation between multi-views. For the convergent cameras as shown in Fig. 2-2 (b), the camera centers of two views are mapped and converged to the same point ZC in the 3D world-coordinate. And (a) is an example of multi-view projection.

Since the point (uSrc

, v

Src

) of Cam

Src and the point (uDst

, v

Dst

) of Cam

Dst are projected to the same point (X, Y, Z) in the 3D world-coordinate, these two points have the mapping relation in the following equation,

(2-4)

where the index of source view can be warped to destination view with the given Z value and the camera parameters of each camera.

(a) (b) Fig. 2-2 (a) Convergent multi-view cameras; each (u, v) in the image planes can project to the same 3D point(X,

Y, Z). (b) Convergent two cameras and the convergence point ZC. Adapted from [4]

2.2.2 Horizontal shift

(a) (b) Fig. 2-3 (a) parallel multi-view cameras; each (u, v) in the image planes can project to the same 3D point(X, Y,

Z). (b) Parallel two cameras and the convergence distance ZC. Adapted from [4]

If cameras are all set parallel in a line as shown in Fig. 2-3, the DIBR can be simplified to the mapping method, called horizontal shift [5].

For cameras are in parallel, the camera rotation matrix becomes an identity matrix I, and translation vector will only have nonzero element in X dimension as t=(TX, 0, 0)^T. If the

intrinsic matrices of source view and destination view are the same, Eq. (2-4) can be rewrite as

Z f T u T

u_Dst _Src ( ^X_,^Src − ^X_,^Dst) ^u +

= , (2-5)

where fu is the focal length in the intrinsic matrix. It shows that the mapping index of two views in each image plane only has difference in horizontal direction. In this camera configuration, the process of view synthesis can be performed through a much easier way, which can be horizontal shifting according to the depth value Z.

Chapter 3

Overview of view synthesis reference software algorithm

3.1 Overview

Fig. 3-1 Block diagram of view synthesis algorithm proposed by MPEG-FTV

According to MPEG-FTV, the view synthesis algorithm synthesizes virtual/target/synthesized view based on the two view framework.[6] The algorithm block diagram is Fig. 3-1. First the camera parameters of reference views and synthesized view are used for projection matrix to do 3D image warping, and the projection is further implemented as homography transform

will be described in detail in following section. Note that the reference views are usually one in the left and one in the right for high-quality occlusion handling. The depth map at synthesized view is pixel-by-pixel mapped using forward warping from reference view, and then post-filled using median filter. Next, the synthesized depth is reverse warped to reference view for texture mapping. The above processes are adopted for both left view and right view, finally the synthesized texture from left view and right view are blended by occlusion handling and the remaining holes are filled by inpainting method.

3.2 3D image warping

3.2.1 Projection transform and homography transform

This algorithm uses 3D image warping for view synthesis as described in Chapter 2.2 for the video-plus-depth data format. Fig. 3-2 illustrates the process of 3D image warping, where the projection represents the mapping between 3D world coordinate and image coordinate with a a 3x4 matrix. For the example of a point in the left view to its correspondence in the target view, the calculation of warping process contains two projection transform, which are the forward projection using PL and the backward projection using PV. Because of the twice projection and the pixel-by-pixel warping process, the projection transform method has high computational complexity.

Fig. 3-2 Warping process using projection or homography transform

On the other hand, the homography method simplifies the warping process, and adopts the homography matrix, which is a 3x3 matrix for the relation between two image planes. For the same example in Fig. 3-2, the calculation of a point from the left view to target view can become only one transform using the homography matrix HLV. Note that one homography matrix is corresponding to one specific depth of Z in the 3D coordinate shown in Fig. 3-3.

That is because a homography matrix is deduced from the forward projection and backward projection with fixed depth of Z. Since a depth map is usually represented using a gray-level image, there are 256 homography matrices for 256 depth levels between two views. In addition, because of the efficiency of homograph method, the released software of MPEG-FTV uses homography transform to perform the pixel-by-pixel warping process [5], [8].

Fig. 3-3 Homography of 256 depth levels between two views

For the homography relation with a specified depth Z, if the two image planes ISrc and IDst

have the homography matrix H, they should satisfy the transform equation,

x_Dst = Hx_Src (3-1)

where xSrc is a point with the vector (uSrc

, v

Src

, 1)

^T on the image plane ISrc, and xDst is a point with the vector (uDst

, v

Dstc

, 1)

^Ton the image plane IDst. Let H be a 3x3 matrix formed by (h00

, h

; h

, h

; h

, h

). The transform equation Eq. (3-1) can be expanded as Eq.

(3-2), and be further rewritten as Eq. (3-3).

(3-2)

h = (h

, h

)

^T. Eq. (3-3) can be reformulated into

(3-4)

This is a linear equation with 9 unknown values in h. We can find that a pair of corresponding points in two image planes can provide two independent equations for solving h. That is we have a linear system with 2-by-9. If there are N pairs of points, a linear system with 2N-by-9 is formed. Because the homography transform is a homogeneous vector, the last element h22

equals to 1, and it needs at least 4 pairs of points to solve h [4] . Then the linear system is formulated as the 8-by-8 square one,

(3-5)

To solve the linear system, many numerical methods could be applied. In MPEG-FTV, the VSRS estimates the homography vector h using the function “cvFindHomography” in the open computer vision library (OpenCV). This function solves the above linear system by the singular value decomposition (SVD). For the hardware implementation of SVD, we should consider precision, computational complexity, and hardware cost. The details of implementation method are discussed in Chapter 5.1.2.

3.2.2 Depth mapping using forward warping

_h₌₀

In this step, both the depth maps of reference left view and right view are warped to the virtual views. This warping process is called forward warping, which represents the 3D warping from the reference view to the virtual view as in Fig. 3-2. For each pixel in the reference, according to its depth value and the corresponding homography matrix, the warped position in the virtual view can be acquired using the homography transform in Eq. (3-2).

With the warped position, the depth value in the reference view is copied to the virtual view.

After pixel-by-pixel warping, the whole new depth map is synthesized in the virtual view.

Note that there are two new depth maps warped from left view and right view separately.

Since the original depth map in the reference view may have noise, or the warping process may induce sampling alias, the new depth map in the virtual view usually suffers from small noisy holes. To remove them, the new depth map is post-filtered by the median filter [5], [6], [7].

3.2.3 Texture mapping using reverse warping

In this step, the texture in the virtual view can be synthesized by warping the texture in reference view according to the depth map in the virtual view. This warping process is called reverse warping because the warping direction is from the virtual to reference view, instead of the reference to virtual view in the previous forward warping.

The details of the reverse warping are presented as follows. Using depth value in the virtual view, which is the result of previous depth mapping, the position in virtual view can be warped to the position in reference view. With the corresponding positions, the texture from reference view is copied pixel-by-pixel to virtual view. Because of the post-filtered depth map,

holes due to occlusion still remains in the result of this step. Note that this step should be performed for left view and right view separately. Thus, two synthesized frame for the virtual view are generated.

3.3 Occlusion handling and blending

In DIBR process, the mapped synthesized images incur large holes which may be seen by the right reference view or left reference view but occluded in the synthesized view. To recover the so called disocclusion, the study of one-view synthesis for stereoscopic video generation has adopted various hole-filling methods, such as linear interpolation [1] and horizontal extrapolation [10]. But it suffers from serious texture distortion since the large holes cannot be recovered well. Better filling methods may consider the depth information or gradient character into interpolation [11]. In order to alleviate the difficulty of hole-filling, the depth smoothing method [12]-[14] is adopted before the 3D warping process. The aim of the depth smoothing is to reduce the size of holes by means of lessening the sharp discontinuity in depth maps.

On the other hand, for the two-view synthesis algorithm we adopt, two synthesized views from left reference view and right reference view are produced separately. Hence the holes caused by occlusion can be filled from each synthesized view, which is based on that some scene may be seen by only left-eye or only right-eye. Thus the blending method proposed in [7] is formulated by the equation

(3-6)

(3-7)

In Eq. (3-6), there are four blending modes, and the hole-pixel can be detected in the previous depth mapping step. For the first mode, a pixel is not a hole if it is mapped during the warping process. When both synthesized views are not holes, the blending is done by weighted addition with the distance factor in Eq. (3-7), where t is the translation vector in extrinsic camera parameters. For the second and third modes, if only one synthesized view can be obtained, the blending will only the corresponding reference view. For the last mode, if the pixel is a hole both from reference left and right views, it will be marked as the “final-hole”, and should be filled by other image interpolation method. The other method is introduced in the next section.

However, boundary noise appears after blending due to mismatch between depth map and texture, especially at disparity discontinuity region. To eliminate these artifacts, the hole-maps are dilated one or two pixels, so that holes borders are extended and will be filled with background. Fig. 3-4 shows the synthesized image with and without dilating hole-map.

Because the hole-map after dilating has the wider borders of the holes, the two hole-maps are different at the boundary area. A new truth table of blending mode can be adopted as in Table 3-1. We define the different area of the two hole-maps as “Boundary Special”, because it is the border area of depth discontinuity, and may induce boundary noise. Note that its blend mode is different from the general case.

R L

t t t t

t t

− +

−

= −

α

(a) (b)

Fig. 3-4 Synthesized image (a) without dilating the hole-map and (b) with dilating the hole-map Table 3-1 Blend mode truth table

0 if hole, 1 if not hole Blend Mode Boundary

Special reference

to L

reference to R

After dilation reference to L

After dilation reference to R

0 0 - - Final-hole 0

0 1 0 - R only 0

1 0 - 0 L only 0

1 1 0 1 R only 1

1 1 1 0 L only 1

1 1 0 0 Weighted add 0

1 1 1 1 Weighted add 0

3.4 Hole-filling

Remaining holes flagged as “final-hole” after blending can be handled by different methods in view synthesis. FTV uses the advanced inpainting method [15] for the general warping mode, and the simple linear interpolation method for the 1D horizontal shift mode. Müller et

al. [9] extrapolates only background color on holes by examining depth value on the two sides of hole-border, because foreground has larger depth and background has smaller depth. Oh et al [16] proposed a depth-based inpainting method which also fills holes with only background color. No matter what methods are, because these remaining final-holes cannot be seen from any reference views but only can be filled reference to surrounding pixels, it is enough for holes to be filled naturally but not exactly.

However, the inpainting is a frame-based image processing and are more complex in hardware implementation. Thus, we apply a simple bi-linear interpolation, which performs a 2D low-pass filter with geometric distance weighting on the final-hole flag. In this thesis, we implemented this simple bi-linear interpolation by block-based as shown in Fig. 3-5.

Fig. 3-5 Bi-linear interpolation of hole-filling

However, the block size is related to hole-size and is a key factor in internal memory size as well as in performance. When block is too small, the larger holes in frame border may not be filled; if block is too big, the buffer size becomes large, and the interpolated texture would be noised. Table 3-2 shows the performance of some sequences under different block sizes. The sequence “Ballet” has larger holes, so that its performance is better when block size increases.

The sequences “BookArrival,” “LoveBird1” and “Kendo” have smaller holes, so that when

Fig. 3-6 Performance of bi-linear interpolation for hole-filling in different block size

Table 3-2 Performance of hole-filling by using bi-linear interpolation in different block size Y-PSNR Performance (dB)

Block size Ballet Breakdancers BookArrival Lovebird1 Newspaper Kendo 5x3 33.18638 33.06250 36.41172 31.80200 30.67576 33.00001 9x5 33.20828 33.16606 36.37078 31.80157 30.67691 32.99998 13x7 33.21609 33.17187 36.35280 31.80039 30.67858 32.99997 17x9 33.21837 33.16193 36.34878 31.79952 30.67945 32.99996 21x11 33.22026 33.14299 36.34814 31.79897 30.67974 32.99996

PSNR

BHxBW

PSNR

BHXBW

PSNR

BHXBW

PSNR

BHxBW

PSNR

BHxBW

PSNR

BHxBW

Chapter 4

Proposed architecture

Our objective is to implement a real-time view synthesis (VS) engine corresponding to the VSRS algorithm for the frame size of HD1080p (1920x1080). There are three main challenges in implementing the VSRS algorithm. First, for general application, the 3D warping requires much more hardware complexity, especially in storage cost, than the horizontal shift method. That results from the cameras with rotation, so that the disparities between each view are not only in horizontal direction as shown in Fig. 2-2. Hence data storage is increased from 1D to 2D, and its data control becomes complicated.

Second is that the VSRS algorithm uses two steps of 3D warping, one for depth mapping and the other for texture mapping. The main advantage in two steps warping is that warped depth map can be post-filled for better texture mapping. In addition, because the reverse warping processes in the index of target view, two synthesized views from different reference views can be processed at blending and hole-filling steps in parallel. However, the data storage and access are increased for the additional synthesized depth map, and therefore internal memory and bandwidth utilization become critical in architecture design.

Third challenge is in the hole-filling. As Chapter 3.4 described, we choose the simple bi-linear interpolation with the block size of 9x5 in this step. But the data storage and access is still a challenge because of the remaining holes at irregular and discontinuous positions.

Our architecture design is focus on solving the above three challenges. Finally the architecture adopts the two frame-level pipeline stages, and each with hierarchical column-level pipeline

4.1 Two frame-level pipelining stages

Because the depth mapping using forward warping and the texture mapping using reverse warping are performed at different positions, the former depth mapping should stores the warped depth of virtual view in a reorder buffer for the latter texture mapping. This size of reorder buffer will be disparity level if videos are rectified with no rotation. On the other hand, its size is up to multiple rows if videos are with rotation. For example of “Ballet” in Fig. 4-1, the region of depth map from row 0 to row 30 in reference view are forward warped to the target view with out-of-order position. The previous 20 rows in reference view are warped out to frame range, and this means that the first whole row of the virtual view is collected after the warping process of 20 rows. We need a buffer size of frame width by 20, which is 40.96KB to buffer the previous mapped depth, and is up to 108KB for HD1080P.

(a)

(b)

Fig. 4-1 Warped depth map row0 to row30 of “Ballet” (a) is the reference view and (b) is the virtual view.

To eliminate this reorder buffer, we propose the architecture of two frame-level pipeline stages, which performs the depth mapping process and the texture mapping process in different stages.. Fig. 4-2 shows the schedule of the proposed two frame-level architecture.

The warped depth is stored in the external memory at 1^st stage and read at 2^nd stage for texture mapping.

With the proposed two-level architecture, Table 4-1 shows that the total bandwidth is

在文檔中視點合成器分析與設計 (頁 13-0)