Relative Pose Estimation - 基於深度學習方法之相機定位

Chapter 3 Method

3.4 Relative Pose Estimation

As shown in Fig. 3-1, our framework contains two cases, 2D-2D and 2D-3D case, and the difference comes from using depth information or not. In this section, we introduce these cases step by step.

3.4.1 2D-2D case

With correspondence computed from Section 3.3., we can estimate the essential matrix by five-point [41] method in RANSAC [7] loop. Essential matrix can be decomposed into four relative poses, as shown in Fig. 3-3., R, t , R, −t , R⁽, t , (R⁽, −t).

Next, we can verify the correct combination by examining the depth is positive or not based on matched feature points.

Fig. 3-3 Four cases of R/t

Since we cannot determine the correct scale of t, the absolute poses cannot directly

candidate image. Hence, we reformulate the problem to pose hypothesis filtering with RANSAC loop. This process is illustrated in Fig. 3-4.

Fig. 3-4 Pose filtering with RANSAC.

The relative pose estimated from 2D-2D matched features can be expressed as a line. R_,- and t_,- denote the rotation and translation matrix of training image. R_-. and t_-. denote the rotation and translation matrix of relative pose estimated from pipeline, and 𝜶 denote the unknown scale. We express the estimated pose of query image multiplied by relative pose as shown in Equation (3-1).

𝑅₁₂ 𝜶𝑡₁₂

0 1 𝑅₆₁ 𝑡₆₁

0 1 = 𝑅¹²𝑅₆₁ 𝑅₁₂𝑡₆₁+ 𝜶𝑡₁₂

0 1 (3-1)

The absolute camera pose 𝑐 can be described as Equation (3-2) with its camera

projection matrix 𝑅 and 𝑡 . Therefore, with the parameters in Equation (3-1), the estimation line is written as Equation (3-3).

𝑐 = −𝑅^:𝑡 (3-2)

𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑖𝑜𝑛 𝑙𝑖𝑛𝑒 = −𝑅₆₁^:𝑅₁₂^: 𝑅₁₂𝑡₆₁− 𝜶𝑅₆₁^:𝑅₁₂^: 𝑡₁₂ (3-3)

In RANSAC loop, we random select two pairs from Top-K retrieved pairs. The hypothesis pose is the closest point to the two selected estimation line. We can easily obtain the point from the midpoint of the common perpendicular line. P_D and P_E are the point in the estimation line, while v_D and v_E are the vector of the lines respectively.

𝑃_D+ 𝑡𝑣_D + 𝑢 𝑣_D×𝑣_E = 𝑃_E + 𝑠𝑣_E (3-4)

ℎ𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠 𝑝𝑜𝑠𝑒 = ^N^O^P6Q^O^PN_E ^R^PSQ^R (3-5)

The orientation of hypothesis is obtained by the spherical linear interpolation

(SLERP) of two quaternions. q_U and q_V are the quaternions of estimated pose from hypothesis pair and ⊗ denotes SLERP operation.

𝑞 = 𝑞_Y ⊗ 𝑞_V (3-6)

To evaluate a hypothesis, we find the distance between hypothesis pose and each estimation line. If the distance is under the threshold, that estimation line would be considered as one of the inlier line set. P_. and v_. is the point and vector of estimation line and P_Z is a point in 3D coordinate from hypothesis pose.

𝑒𝑟𝑟𝑜𝑟 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 = ^N^]^N_Q^{^}^×Q^]

] (3-7)

After hypothesis pose passes the inlier test, the final hypothesis should be refined by all the inlier line. The problem is transformed into finding the closest point to multiple estimation lines. We first formulate single line:

_`__a

Then we can obtain the equation of multiple lines assuming that the hypothesis pose lies on the intersection of lines perfectly.

𝐴𝑥 = 𝑏 (3-9) provide solutions once A is rank-deficient. By using SVD, we derive the general solutions of the nearest approach for all cases of multiple lines in 3D space.

𝑥 = 𝐴^P∗ 𝑏 (3-11)

In our approaches, we will examine the number of inlier of 𝑥 to ensure the quality of final hypothesis.

Fig. 3-5 Illustration of closet point to four non-intersecting lines in 3D space As for orientation part of multiple inliers, we intuitively calculate the average of all quaternion vector by SLERP.

3.4.2 2D-3D case

The 2D pixel can be inverse projected into 3D coordinate with known camera intrinsic, camera pose within specific scene, and depth information. After finding correspondence described in Section 3.3, we can solve the localization problem using PnP algorithm since the matched pixels in candidate image are interpreted as 3D points in scene as shown in Fig. 3-6. In RANSAC loop, we adopt P3P to generate the pose hypothesis and set the minimum number inlier threshold as 12. Next, we optimize the pose hypothesis by minimizing the projection error of all inliers points. In this part we simply use the opencv library solvePnPRansac and solvePNP.

Fig. 3-6 Illustration of 2D-3D case

3.5 Depth Estimation

Review our proposed framework, we would need a dense depth map of candidate RGB images when performing 2D-3D pose estimation. Even though some datasets have collected depth information already like 7-Scenes dataset, in this work we want to calculate all the information from solely RGB information.

3.5.1 Depth Estimation Framework

Inspired by self-supervised learning framework of Zhou et al. [33], we modify the origin framework to reconstruct 3D scene depth from continuous image sequence. Since we have the ground truth of each training image pose, the framework would be able to take precise 6-DoF information as one of the supervisory signal to reconstruction. In other words, our procedure doesn’t need to predict the pose of each training image and we aim on training the network predicting the depth information.

The depth estimation framework consists of a neural network, DepthNet, taking RGB frames as input and predicting depth map as output shown in Fig. 3-7. Let D_, denote predicted depth of target view, and T_,→‹ denote the relative pose calculated from ground truth pose T_‹ and T_,. With known camera intrinsic K, we can project p_,, the homogeneous coordinate of a pixel from target view, onto source view p_‹ by

𝑝_S~𝐾𝑇_6→S𝐷₆ 𝑝₆ 𝐾^`D𝑝₆ (3-12)

At this point, the 3D projection error can be computed by the difference between the source view projected from target view and the original source view. After sufficient training epochs, the tuned DepthNet is able to inference more steady and optimized depth result.

Fig. 3-7 Depth estimation training and inference procedure

3.5.2 Loss Term

In this section, we will present each component of our loss, including appearance matching loss and geometric loss. Our overall objective function can be formulated as follows:

ℒ = 𝛼ℒ_b“+ 𝛽ℒ_• (3-13)

where ℒ_–— stands for proposed appearance matching loss and ℒ_˜ stands for geometric loss.

Appearance Matching Loss

The key supervisory signal for our depth estimation is calculated from view synthesis. During the training, we project the source view image I_Uš^‹ to target view image I_Uš^, by inverse warping with estimated D_‹ and ground truth relative pose T_‹→,. The appearance matching loss is described by the appearance difference between target view image I_Uš^, and synthesized image I_›œ^,. Follow the work [42], we use the combination of two terms, L1 distance and single scale SSIM [43], describing the photometric difference as ℒ_–—^, :

ℒ_b“⁶ = ^D_• 𝛼^{D`žžŸ Ÿ}^a¡

¢,Ÿ_£¤^¢

E + 1 − 𝛼 𝐼_YV⁶ − 𝐼_¦§⁶ _D

Y,V ∈• (3-14)

Here, V denotes the points that successfully project from the source view to the target view, while successful projection means that the result pixel coordinate falls in the target image. V means the number of the successful points. We use the simplified SSIM with a 3×3 block filter and set α to 0.5 as a fixed weight.

Geometric Loss

To enforce the geometry consistency on predicted result, we aim on minimize the difference between the depth values of correspondence pixel from different view. With the relative pose T_,→‹ computed from ground truth pose, we can project the predicted

[35] and simply define the geometric loss as:

ℒ_˜ 𝑡, 𝑠 = _•^D ^¬_¬^¢→- ^{“ `¬}^-^® ^“

¢→- “ P¬_-^® “

“∈• (3-15)

where D_‹⁽ is the interpolated depth map from the estimated map D_‹. The difference obtained from each input frame is normalized by the sum of two depth map.

在文檔中基於深度學習方法之相機定位 (頁 21-29)