Chapter 3 Method
3.4 Relative Pose Estimation
As shown in Fig. 3-1, our framework contains two cases, 2D-2D and 2D-3D case, and the difference comes from using depth information or not. In this section, we introduce these cases step by step.
3.4.1 2D-2D case
With correspondence computed from Section 3.3., we can estimate the essential matrix by five-point [41] method in RANSAC [7] loop. Essential matrix can be decomposed into four relative poses, as shown in Fig. 3-3., R, t , R, −t , R(, t , (R(, −t).
Next, we can verify the correct combination by examining the depth is positive or not based on matched feature points.
Fig. 3-3 Four cases of R/t
Since we cannot determine the correct scale of t, the absolute poses cannot directly
candidate image. Hence, we reformulate the problem to pose hypothesis filtering with RANSAC loop. This process is illustrated in Fig. 3-4.
Fig. 3-4 Pose filtering with RANSAC.
The relative pose estimated from 2D-2D matched features can be expressed as a line. R,- and t,- denote the rotation and translation matrix of training image. R-. and t-. denote the rotation and translation matrix of relative pose estimated from pipeline, and 𝜶 denote the unknown scale. We express the estimated pose of query image multiplied by relative pose as shown in Equation (3-1).
𝑅12 𝜶𝑡12
0 1 𝑅61 𝑡61
0 1 = 𝑅12𝑅61 𝑅12𝑡61+ 𝜶𝑡12
0 1 (3-1)
The absolute camera pose 𝑐 can be described as Equation (3-2) with its camera
projection matrix 𝑅 and 𝑡 . Therefore, with the parameters in Equation (3-1), the estimation line is written as Equation (3-3).
𝑐 = −𝑅:𝑡 (3-2)
𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑖𝑜𝑛 𝑙𝑖𝑛𝑒 = −𝑅61:𝑅12: 𝑅12𝑡61− 𝜶𝑅61:𝑅12: 𝑡12 (3-3)
In RANSAC loop, we random select two pairs from Top-K retrieved pairs. The hypothesis pose is the closest point to the two selected estimation line. We can easily obtain the point from the midpoint of the common perpendicular line. PD and PE are the point in the estimation line, while vD and vE are the vector of the lines respectively.
𝑃D+ 𝑡𝑣D + 𝑢 𝑣D×𝑣E = 𝑃E + 𝑠𝑣E (3-4)
ℎ𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠 𝑝𝑜𝑠𝑒 = NOP6QOPNE RPSQR (3-5)
The orientation of hypothesis is obtained by the spherical linear interpolation
(SLERP) of two quaternions. qU and qV are the quaternions of estimated pose from hypothesis pair and ⊗ denotes SLERP operation.
𝑞 = 𝑞Y ⊗ 𝑞V (3-6)
To evaluate a hypothesis, we find the distance between hypothesis pose and each estimation line. If the distance is under the threshold, that estimation line would be considered as one of the inlier line set. P. and v. is the point and vector of estimation line and PZ is a point in 3D coordinate from hypothesis pose.
𝑒𝑟𝑟𝑜𝑟 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 = N]NQ^×Q]
] (3-7)
After hypothesis pose passes the inlier test, the final hypothesis should be refined by all the inlier line. The problem is transformed into finding the closest point to multiple estimation lines. We first formulate single line:
_`_a
Then we can obtain the equation of multiple lines assuming that the hypothesis pose lies on the intersection of lines perfectly.
𝐴𝑥 = 𝑏 (3-9) provide solutions once A is rank-deficient. By using SVD, we derive the general solutions of the nearest approach for all cases of multiple lines in 3D space.
𝑥 = 𝐴P∗ 𝑏 (3-11)
In our approaches, we will examine the number of inlier of 𝑥 to ensure the quality of final hypothesis.
Fig. 3-5 Illustration of closet point to four non-intersecting lines in 3D space As for orientation part of multiple inliers, we intuitively calculate the average of all quaternion vector by SLERP.
3.4.2 2D-3D case
The 2D pixel can be inverse projected into 3D coordinate with known camera intrinsic, camera pose within specific scene, and depth information. After finding correspondence described in Section 3.3, we can solve the localization problem using PnP algorithm since the matched pixels in candidate image are interpreted as 3D points in scene as shown in Fig. 3-6. In RANSAC loop, we adopt P3P to generate the pose hypothesis and set the minimum number inlier threshold as 12. Next, we optimize the pose hypothesis by minimizing the projection error of all inliers points. In this part we simply use the opencv library solvePnPRansac and solvePNP.
Fig. 3-6 Illustration of 2D-3D case
3.5 Depth Estimation
Review our proposed framework, we would need a dense depth map of candidate RGB images when performing 2D-3D pose estimation. Even though some datasets have collected depth information already like 7-Scenes dataset, in this work we want to calculate all the information from solely RGB information.
3.5.1 Depth Estimation Framework
Inspired by self-supervised learning framework of Zhou et al. [33], we modify the origin framework to reconstruct 3D scene depth from continuous image sequence. Since we have the ground truth of each training image pose, the framework would be able to take precise 6-DoF information as one of the supervisory signal to reconstruction. In other words, our procedure doesn’t need to predict the pose of each training image and we aim on training the network predicting the depth information.
The depth estimation framework consists of a neural network, DepthNet, taking RGB frames as input and predicting depth map as output shown in Fig. 3-7. Let D, denote predicted depth of target view, and T,→‹ denote the relative pose calculated from ground truth pose T‹ and T,. With known camera intrinsic K, we can project p,, the homogeneous coordinate of a pixel from target view, onto source view p‹ by
𝑝S~𝐾𝑇6→S𝐷6 𝑝6 𝐾`D𝑝6 (3-12)
At this point, the 3D projection error can be computed by the difference between the source view projected from target view and the original source view. After sufficient training epochs, the tuned DepthNet is able to inference more steady and optimized depth result.
Fig. 3-7 Depth estimation training and inference procedure
3.5.2 Loss Term
In this section, we will present each component of our loss, including appearance matching loss and geometric loss. Our overall objective function can be formulated as follows:
ℒ = 𝛼ℒb“+ 𝛽ℒ• (3-13)
where ℒ–— stands for proposed appearance matching loss and ℒ˜ stands for geometric loss.
Appearance Matching Loss
The key supervisory signal for our depth estimation is calculated from view synthesis. During the training, we project the source view image IUš‹ to target view image IUš, by inverse warping with estimated D‹ and ground truth relative pose T‹→,. The appearance matching loss is described by the appearance difference between target view image IUš, and synthesized image I›œ,. Follow the work [42], we use the combination of two terms, L1 distance and single scale SSIM [43], describing the photometric difference as ℒ–—, :
ℒb“6 = D• 𝛼D`žžŸ Ÿa¡
¢,Ÿ£¤¢
E + 1 − 𝛼 𝐼YV6 − 𝐼¦§6 D
Y,V ∈• (3-14)
Here, V denotes the points that successfully project from the source view to the target view, while successful projection means that the result pixel coordinate falls in the target image. V means the number of the successful points. We use the simplified SSIM with a 3×3 block filter and set α to 0.5 as a fixed weight.
Geometric Loss
To enforce the geometry consistency on predicted result, we aim on minimize the difference between the depth values of correspondence pixel from different view. With the relative pose T,→‹ computed from ground truth pose, we can project the predicted
[35] and simply define the geometric loss as:
ℒ˜ 𝑡, 𝑠 = •D ¬¬¢→- “ `¬-® “
¢→- “ P¬-® “
“∈• (3-15)
where D‹( is the interpolated depth map from the estimated map D‹. The difference obtained from each input frame is normalized by the sum of two depth map.