O RGANIZATION OF THIS T HESIS - 環物影片產生及應用技術之研究

CHATPER 1 INTRODUCTION

1.3. O RGANIZATION OF THIS T HESIS

This thesis investigates the techniques of producing high-quality OMs, including object movie rig calibration, OM segmentation, and stereoscopic OMs generation. Chapter 2 presents a calibration method for object movie rigs to help users to acquire high quality OMs, and to obtain camera parameters. Chapter 3 describes two segmentation methods for removing the backgrounds of OMs. The objective of the proposed segmentation method is to minimize the user intervention. The first method utilizes motion vectors to propagate the corrected information to other frames containing segmentation errors. It works well for most cases, but requires much user intervention for some cases due to error motions. Therefore, the second method is proposed to propagate the corrected information efficiently by previously learning shape priors. Chapter 4 presents a novel 3D reconstruction approach to obtain high-quality 3D models from OMs. Chapter 5 describes a novel method, called augmented stereoscopic panoramas, to augment stereo panoramas with stereo OMs. With augmented stereo panoramas, the user can enjoy more persuasive interaction with better depth perception. A conclusion is given in Chapter 7.

Chatper 2

Object Rig Calibration

In this chapter, we describe a method for assisting the user to acquire high-quality OMs, and fast obtain the camera parameters of images in OMs. The camera parameters can be used in many applications. In this work, we will use the parameters to perform background removal in Chapter 3, 3D reconstruction and novel view generation in Chapter 4.

Fig. 3 shows the processing flowchart of the proposed calibration method. To calibrate the motorized object rig, we first use the camera mounted on the AutoQTVR to capture some feature points, whose 3D positions are known beforehand. The 2D and 3D positions of the feature points are used to estimate the intrinsic and extrinsic camera parameters. In our experiments, the calibration object, called the physical control cube (PCC) [24], is shown in Fig.

4. With the estimated extrinsic camera parameters, we can reconstruct the kinematic model of the rig. Then, we apply a simple and practical model, completely and parameter continuous (CPC) model [1][69], to formulate the relation among the three axes. Finally, we provide a visual tool showing the axes for users to adjust the motorized object rig. If the intersections of the rays are not close enough, the user can adjust the motorized object rig according to the estimated result, and then the axes will be estimated again. The whole process will be repeated until the intersections of the rays are close enough. After calibration, reliable extrinsic parameters of the camera will be available with the kinematic model.

Fig. 3. Processing Flowchart of Calibration

2.1. Estimation of Camera Parameters

We adopt the method proposed by Zhang [66] to estimate the intrinsic camera parameters.

The method performs camera calibration with at least two images of a known planar pattern captured at different orientations.

On the other hand, we adopt the method presented in [9] and [24] to estimate the extrinsic camera parameters, by first using the method proposed by Kato et al. [27] to obtain a set of

Obtain a set of images of the calibration object

Estimate the intrinsic parameters

Estimate the extrinsic parameters (Initial values obtained by using ARToolkit library and refined by

ICP) +

OpenCV Library

Pan and Tilt angles of

the captured images Estimate the relation among axes

Adjust the position and orientation of axes

Take new set of photographs of the calibration object Estimation of Camera

Parameters

Do the rays of axes intersect close enough?

Calibration of motorized object rig

Extrinsic Camera Parameters

Yes

Finish output

initial extrinsic parameters, and then applying Iterative Closest Point (ICP) principle [3] to refine them.

Fig. 4. The calibration object, called physical control cube (PCC), and the extracted feature points used to obtain intrinsic camera parameters.

2.2. Completely and Parameter Continuous (CPC) Kinematic Model

A CPC model stands for the completely and parameter continuous kinematic model [69]. A complete model means the model provides enough parameters to express any variation of the actual robot structure, and parameter continuity implies no model singularity by adopting a singularity-free line representation [46].

This model was motivated by the special needs of robot calibration. It is assumed that the robot links are rigid. A CPC kinematic model for a revolution/prismatic joint can be represented as follows (we refer the reader to [69] for detail descriptions):

i i i

iT₊₁=QV (1)

where ⁱTi+1 denotes the transformation matrix between any two consecutive joint frames, i.e., the (i+1)-th reference frame to the i-th reference frame. Qi is the motion matrix defined as follows:

qi’ denotes joint value, which means the rotation angle for a revolution joint, or the amount of displacement for a prismatic joint, and Vi denotes the constant shape matrix. The shape matrix is a general transformation matrix given by

)

The rotation matrix Ri is used to describe the relative orientation of the two consecutive joint axes, details can be found in Appendix, Rotz(βi) is used to align the x- and the y-axes.

Notice that the CPC convention requires that any two consecutive joint axes have a nonnegative inner product, i.e.,b_i_,_z≥0. In general, this requirement can be achieved by changing the sign of one of the joint values of consecutive joints. This is because changing the sign of the joint value is equivalent to reversing the joint axis for both revolution and prismatic joints [53].

With the CPC kinematic model [69], the kinematic parameter identification problem can be decomposed into many kinematic parameter calibration sub-problems for ach prismatic or revolute joint. Suppose we have a robot with n joints. The transformation matrix from world reference frame, w, to end-effector reference frame, n, can be expressed as follows:

2.3. Kinematic Calibration Using the CPC Model

In this section, we will introduce how to apply the CPC model to estimate the transformation matrices among the coordinate systems defined on the motorized object rig. As shown in Fig. 5, we define three axes of three different reference frames on the rig. Let zr_c, zr_t

and zr_p detnoe the z-axes of the camera coordinate system (CCS), the tilt-axis coordinate system (TCS), and the pan-axis coordinate system (PCS), respectively.

For convenience, let the camera be the “end-effector” of the motorized object rig. Thus, we can obtain the corresponding robot pose with the method described in Section 2.1. In general, the orientations of the x- and the y-axes of the coordinate systems need not to be specified in formulating the kinematics of the motorized object rig. Therefore, the redundant parameter βi in (3) can be set to zero, and the transformation matrix from object coordinate system (OCS) to camera coordinate system (CCS) can be simplified as follows:

where ^bTa denotes the transformation matrix from coordinate system a to coordinate system b.

Since the motorized object rig is composed of two revolution joints, the motion matrix Q0

is a constant matrix which can be set to identity, whereas Q1 and Q2 are the rotation matrices about the zr_t- and the zr_p-axes, respectively. The equations of Q0, Q1 and Q2 are given by the pan axes, respectively. Substituting (8) into (7), we have

where ^cro and ^cto are the rotation matrix and the translation vector of the transformation matrix

cTo. From (9), we have In the following subsections, we will show how to solve the parameters, r0, lr₀

, r1, lr₁

, r2, lr₂

in (10) and (11).

Fig. 5. The schematic of motorized object rig.

2.3.1. Rotation Parts

In order to simplify the calibration process, we calibrate one axis at a time. Therefore, when calibrating the tilt-axis, the pan-axis is held still, i.e., φ_p can be regarded as a constant, and thus r₁×r_z(φ_p)×r₂becomes a constant term denoted by x. By substituting x into (10), we

Equation (12) can be rewritten in the following form

)

where ε denotes the error vector induced by the observation noise, and br₀

can be estimated by minimizing ε ². It is well known that br0

is the unit eigenvector of a^ta corresponding to the smallest eigenvalue λ. Note that the direction of br₀

has to be determined such that its z-component is positive. By substituting the estimated br₀

into (4), we have the orientation matrix R0.

The stability of the solution to br₀

can be realized with the following derivation. By substituting (12) to the definition of a we have rotation angles (θi-θj) is close to zero, estimating the rotation axis of ra becomes ill-posed and then the solution to br₀

may not be stable. To avoid this singular configuration, one must make (θi-θj) as large as possible. This gives a useful guidance to selecting the joint angles for kinematics calibration.

Once r0 is available, (14) can be rewritten as follows

16 The sign parameter signt can be determined by minimizing the following function

⎭⎬ calibrated, the tilt axis can be moved when calibrating r1. For convenience, let us define

( )

Equation (20) can be rewritten as follows ( ) ( )

Again, by solving an eigenvalue problem, we obtain br₁

which leads to the rotation matrix r1. The sign parameter signp for φ_p, and also be determined by minimizing an objective function similar to (18).

The final orientation parameter r2 can be computed with the following objective function derived from (10).

∑ ⁻ ^× ^× ^× ^×

This constrained optimization problem can be solved with a method similar to the one proposed in [3].

2.3.2. Translation Parts

By substituting the estimated rotation matrices into (11), we have the following linear equations for the translation parameters:

By moving the pan and the tilt joints to different positions, we have an over-determined system of the translation parameters which can be solved using the least square method.

2.3.3. Axes Adjustment

After solving the kinematic parameters of the motorized object rig, we can compute its forward kinematic model as follows:

( ) ( )

( )

and zr_p axes, i.e., the orientation and position of these three axes. First, the transformation matrix from the reference frame of the tilt axis to the CCS can be determined as ^cT_t =V₀. Thus, the unit direction vector of the tilt axis zr_t, denoted by Or_t

, can be derived as follows

[ ]

^t The position of the tilt axis, denoted by Pr_t

, is given by (28) Similarly, the orientation and position of the pan axis zrp, denoted by Orp

and Prp

, can be

found to be

[ ]

t p

t t c

p= T×T 0 0 1 0 =V₀×Rot ×V₁× 0 0 1 0

(29) and

[ ]

t p

t t c

p= T×T 0 0 0 1 =V₀×Rot ×V₁× 0 0 0 1

, (30)

respectively.

By using equations (27)-(30), the positions and orientations of the three axes of zr_c, zr_t and

r can be evaluated and then can be illustrated as shown in Fig. 6(a). The positions of these three axes can be adjusted to minimize the distance among them. According to our experiences, when the maximum distance among these three axes is smaller than a threshold value of 15 mm, the effect of the miss-alignment of these three axes is negligible.

2.4. Experimental Results of Calibration

Our method is implemented on the PC platform with CPU P4-3.0GHz and 1GB RAM and the motorized object rig is AutoQTVR. Fig. 6 shows the result before aligning the three axes of the rig where the estimates of the three axes are shown in Fig. 6(a), and the acquired OM of a toy shark is shown in Fig. 6 (b). The estimation and adjustment process is repeated five times to align the three axes of the rig and the result is shown in Fig. 7. From the frontal view of Fig. 7(d), we show that the tilt axis can be effectively adjusted to be perpendicular to the pan axis and optical axis of camera with our method. Moreover, from the top view of Fig. 7 (d), the intersections of the three axes are close enough. Some images of the OM of the toy shark are shown in Fig. 7(a). After the visual hull of the shark is constructed, shown in Fig. 7(c), the centralization process can be performed, and the resulted OM is shown in Fig. 7(b).

The process time (includes capturing time and computation time) of the calibration process relies on the amounts of the photographs are used. To reduce the process time we have to use small amounts of the photographs. Therefore, we generate some synthetic data to investigate

how many photographs we need and what the relations between the amounts of photographs and the accuracy of the estimated parameters are. We use 3D Studio Max to render the PCC object with known camera parameters. Three sets of synthetic data with different numbers of images (48, 24 and 12 images, respectively) are generated. The 48-image set is obtained with four different tilt angles (θt= 90°, 60°, 30°,and 0°) and twelve different pan angles (φ_p is from 0°

to 330° with an angle interval of 30 degree). The 24-image set is taken with three different tilt angles (90°, 60°, 30°) and eight different pan angles, and the 12-image set are taken with three different tilt angles (90°, 60°, 30°) and four different pan angles. In the experiments, our method is applied to the three data sets to estimate the camera parameters and the estimated parameters are compared with the ground truth. To quantify the error of the estimated camera parameters, some 3D points are randomly selected to calculate their 2D positions using the ground truth and the estimated camera parameters, and then the Euclidean distance between the ground-truth position and the estimated position is calculated. The results are shown in Table 1. The process time includes shooting process and camera parameter estimation. The error is mean Euclidean distance. From our experiments, we found that only 12 images are enough to obtain a set of highly accurate parameters. That is, we only need to take 12 pictures at each adjustment-calibration process, and the processing time needed, including capturing and processing, is about 7 minutes.

Table 1. Processing time and accuracy of the calibration processing The Number of Images Processing Time Euclidean Distance

48 About 25 min 1.62

24 About 8 min 1.63

12 About 7 min 1.63

(a)

(b)

Fig. 6. The OM of the toy shark before calibration. (a) shows the estimated relation among 3 axes, and (b) shows the OM of the toy shark. The cross markers indicate the center of images.

21 (a)

(b)

(c)

(d)

Fig. 7. The result of the toy shark experient. (a) shows some images of the OM of the toy shark after calibration, while (b) shows that after centralization. (c) shows the Visual Hull of the shark, and (d) shows the estimated axes after calibration.

Chatper 3

Background Removal

In order to reduce the user intervention, the basic idea to develop the OM background removal system is follows. First, an automatic segmentation will be applied to obtain initial segmentation results. If some results are not satisfied, the user can correct one of them though the provided user interface. After modification, the corrected result can be automatically propagated to the other images, and used to refine the segmentation results.

In this work, we treat the segmentation problem as a labeling problem. We assign every pixel a label for a given OM. These labels are F (Foreground), B (Background), and U (Uncertain), and the image used to record the labels is called trimap. OM notations to which we will refer are: is defined as follows. Let I_{θ ,}_φ denote the image taken at pan angle θ and tilt Fig. 8 shows a portion of the two equi-tilt sets that are contained in the OM of the pottery owl.

Based on the idea, the flowchart of our system is shown in Fig. 9. It includes three main stages: initial labeling, label updating, and alpha estimation. For initial labeling, we extract reliable foreground and background pixels based on some OM characteristics. The details are described in section 3.1. For label updating, U pixels are updated using spatial and temporal coherence based on the extracted foreground and background. After label updating,

intermediate segmentation may contain some misclassified pixels. To correctly classify these pixels, user modification can be done at this point through the provided user interface. After modification, the label updating stage is again used to obtain more accurate results. After user intervention, most pixels are classified as foreground or background except the pixels that may be composites of the foreground and background. For alpha estimation, the method proposed by Chuang et al. [13] can be applied to calculate the alpha value for each U pixel. Using the alpha value, we can product a smooth contour blending when we integrate OM into a new background.

Fig. 8. Part of the two different equi-tilt sets before applying the OM segmentation method. Except for leftmost two images in the figure, the remainder of the images in this thesis are cropped in order to show more examples.

In this thesis, two approaches are proposed to propagate the corrected information. The first method utilizes motion vectors to propagate the corrected information to other frames that some segmentation errors occur. The details are described in section 3.2. The method works well for most cases, but requires more user intervention for some cases due to error motions.

The situation could be even worse for the first method. To compute the motion field, the motion estimator usually assumes that the sampling rate of the video camera is high enough to minimize the frame-to-frame motion. However, to keep the data size and cost reasonable, the

sampling rate of the OM is generally low. A popular alternative approach is to interpolate the dense motion field from a set of image correspondences. Because the difference between the images is caused fully by the changes in the 3D viewpoints, the perspective distortion makes the correspondence problem extremely difficult. In our experience, generating enough correspondences is still a problem, even with some popular tools, e.g., such as the KLT feature tracker [52] or the SIFT features [34]. Additionally, to filter out the potentially false correspondences, the class of the transformation, e.g., translational, affine, or a more complex -transformation, need to be considered so that the images can be aligned as accurately as possible, and a robust estimation can be performed. The translational motion is often the prominent transformation in many of the video source used to demonstrate the information propagation scheme. However, the nature of the transformation existed in the OM cannot be easily modeled without 3D object information. In practice, without some user intervention or knowledge of the 3D information, a usable motion field between any possible pair of the neighboring images in the OM is quite hard to compute. Therefore, the second approach is proposed for efficiently propagate the corrected information by learning shape priors. The details are described in section 3.3.

Fig. 9. The flowchart of segmentation method.

3.1. Initial Labeling

From our observation, OM has three basic characteristics which can help the method generate the trimap:

1. When an equi-tilt set of the OM is captured, a large proportion of the background scene is static.

2. Only one interesting object is presented in every image of the OM.

3. The foreground and background color distributions are distinct in most cases.

The trimap labeling method comprises B-labeling and F-labeling. Each equi-tilt set of the OM is processed individually by the trimap labeling method. Given an equi-tilt set O_φ, the

trimap of each image in O_φ is initialized to U. During the B-labeling, pixels are examined to be labeled as B based on the color difference. During the F-labeling, all pixels that are still labeled as U are examined to be labeled F based on the background model.

1.) B-labeling: By the first characteristic, if the color of a pixel varies barely throughout the equi-tilt set O_φ, then the pixel should be the background and labeled B. Since an equi-tilt set O_φ can be treated as a short video sequence, a pixel B is labeled by examining its color difference compared with the corresponding pixels in both directions of the video sequence. Let p = [u v]^T denote a pixel of a video frame I_t, i.e., an image of the equi-tilt set O_φ. Let It(p) be the color of pixel p in the frame It. Let Nt be the set of neighboring frames of I_t. To relieve the camera noises and consider the color changes caused by the lighting, a measure based on the block color difference with respect to the mean is used

在文檔中環物影片產生及應用技術之研究 (頁 22-0)