國立臺灣大學電機資訊學院電子工程學研究所 碩士論文
Graduate Institute of Electronics Engineering College of Electrical Engineering and Computer Science
National Taiwan University Master Thesis
針對單一平面目標物之三維姿態的直接分析:
演算法和系統實作
Direct 3D Pose Estimation of a Planar Target:
Algorithm and Implementation
曾泓諭 Hung-Yu Tseng
指導教授:簡韶逸 博士 Advisor: Shao-Yi Chien, Ph.D.
中華民國 105 年 6 月 June 2016
在完成這篇碩士論文的當下,心中只有滿滿的感謝。在研究的這條路上遇到 了許多貴人,給了我許多的幫助,其中我要特別感謝三個人。第一位是我的指導 老師簡韶逸教授,從大三修習老師的專題研究開始,便很敬佩老師的研究態度,
特別是對於吸收新知的熱情。老師給予我很大的研究空間,並沒有特別限制我研 究的走向。除了指導之外,老師更會盡其所能的尋找相關資源,甚至是實驗室外 的專業人士來幫助我們。對於能擁有一位為學生著想的指導老師,我滿懷感恩。
第二位是我的共同指導老師楊明玄老師,在 2015 年春天第一次與老師 Skype 通話以來,就深深的感受到老師對於台灣學生們的照顧,包括投稿 Conference 時 帶著我們日以繼夜的修改 paper、每周 meeting 時給予許多寶貴的意見、幫助我們 找實習、還有在我們低潮時鼓勵我們,特別是 how to survive in this business。回想 起來,真的很感謝老師如此的照顧與關心我們。
第三位是我的實驗室學長吳柏辰學長,學長是我研究生涯中最重要的貴人。
從專題以來,無論是基礎知識、看 paper 的能力、做投影片、上台報告、做研究甚 至是寫程式,學長都是不厭其煩的給予我指導和幫助。除此之外,學長做事情嚴 謹的態度也讓我印象深刻。當然這兩年一起待在博理 421 的日子,投稿、寫程式、
討論問題、看 LOL 比賽還有練肖話,我想我是不會忘記的。真的很謝謝學長!
最後要感謝我的家人。爸爸、媽媽、妹妹、恰恰、爺爺、奶奶、外公和外婆,
始終默默地支持並鼓勵我。我的女友婷瑋,陪伴了我走過這段時光,你們是我能 夠一直向前的最佳動力。以及我的好朋友們,人碩、明仁、昇勳、乃群、冠豪、
令儀、國婷、昱廷、則安、宗緯、廷璋、致睿等等族繁不及備載,為這段日子增 添了許多色彩。最後是實驗室學長們,偉志、嘉洋、熊、柯楊、強強林、岳穎、
明倫、培恒,以及陳凱、賓四,給予了我研究和課業上許多幫助。謝謝你們!
中文摘要
近年來,隨著擴充實境與機器人學的發展,如何即時且準確地分析已知的單 一平面目標物其三維姿態成為了一個重要的議題。即使過去十幾年中,不少有效 率的系統陸續被提出,但由於這些系統只能針對特定的平面目標物,如基準標記 或含有簡易封閉曲線的平面目標物,此問題仍舊缺少一個適用於任意平面目標物 的解決方法。目前針對此問題最好的解決方法為基於特徵點的方法,但是此方法 必須在目標物與相機圖片中的特徵點能對應的前提下才能運作。
為了解決這個問題,在本篇碩士論文中,我們提出了一個表現穩定、能針對 任意平面目標物的直接分析演算法。首先,我們採用模板匹配的概念求出一個近 似的三維姿態。接下來針對此近似的三維姿態,我們提出了一個梯度下降尋找的 演算法來求出更為精準的三維姿態。更進一步,基於所提出的演算法,我們在圖 形處理器上實作了一套分析和追蹤平面目標物之三維姿態的系統。此系統包含了 分析單元和追蹤單元兩部分。分析單元負責計算出起始的三維姿態,是基於我們 提出的演算法所設計的;追蹤單元則負責追蹤三維姿態,其所使用的方法為我們 提出的一種三階層搜尋法。在系統中,無論是分析單元還是追蹤單元都充分利用 了圖形處理器中平行運算的優點,使得我們的系統能夠非常有效率的運作。我們 透過大量的實驗,證明我們所提出的演算法和系統,其表現都比目前基於特徵點
的方法要更精準而且穩定。並於實際應用中,我們的系統達到了每秒 11幀的運算
速度。
Direct 3D Pose Estimation of a Planar Target: Algorithm and Implementation
Hung-Yu Tseng Advisor: Shao-Yi Chien Co-Advisor: Ming-Hsuan Yang
Graduate Institute of Electronics Engineering National Taiwan University
Taipei, Taiwan
June 2016
Abstract
Real-time estimating and tracking accurate 3D poses of a known planar target from a calibrated camera are essential for augmented reality and robotics. Al- though numerous efficient systems have been proposed in the past few decades, it remains a challenging task since the planar targets are limited to fiducial markers and targets with simple contours. The feature-based schemes are the state-of-the- art solutions for obtaining poses of arbitrary planar targets. However the success hinges on whether feature points can be extracted and matched correctly on targets with rich texture.
In this thesis, we propose a robust direct method for 3D pose estimation with high accuracy that performs well on both texture and textureless planar targets.
First, the pose of a planar target with respect to a calibrated camera is approxi- mated estimated by posing it as a template matching problem. Next, the object pose is further refined and disambiguated with a gradient descent search scheme.
In order to make the proposed algorithm applicable, we also develop D-PET, a direct 3D pose estimation and tracking system implemented on graphics comput- ing units (GPU) which is able to obtain poses in real-time. The system consists of a pose estimation unit and a pose tracker. The pose estimation unit is built based on the approximated pose estimation scheme in the proposed algorithm to find the initial pose. A 3-scale search scheme is proposed for the pose tracker to track the pose precisely. Both of them utilize the characteristics of GPU and accomplish the work efficiently. Extensive experiments on both synthetic and real datasets demonstrate that both the proposed algorithm and system perform favor-
i
ably against state-of-the-art feature-based approaches in terms of accuracy and robustness. The proposed system achieves a processing speed of 11 fps on an embedded GPU.
Contents
Abstract i
List of Figures v
List of Tables ix
1 Introduction 1
1.1 3D Pose Estimation . . . 1
1.2 Pose Ambiguity . . . 2
1.3 Contribution . . . 3
1.4 Thesis organization . . . 5
2 Background Knowledge and Related Works 7 2.1 Problem Definition . . . 7
2.2 State-of-the-Art Pose Estimation . . . 9
2.2.1 Marker-based Pose Estimation . . . 9
2.2.2 Feature-based Pose Estimation . . . 10
2.3 Template Matching . . . 11
2.4 Motion Estimation . . . 12
3 Direct Pose Estimation and Tracking 13 3.1 Approximated Pose Estimation . . . 13
3.1.1 ε-Covering Set Construction . . . 14
3.1.2 Coarse-to-Fine Estimation . . . 18 iii
3.2 Pose Refinement . . . 21
3.2.1 Candidate Poses Exploration . . . 22
3.2.2 Refining Pose Estimation . . . 22
4 Experiments 25 4.1 Synthetic Dataset . . . 26
4.1.1 Normal Condition . . . 27
4.1.2 Varying Conditions . . . 29
4.1.3 Refinement Analysis . . . 29
4.2 Real Dataset . . . 38
4.3 Runtime Comparison . . . 42
4.4 Textureless Images . . . 42
5 Direct Pose Estimation and Tracking System 45 5.1 Proposed System . . . 45
5.1.1 Initialization . . . 46
5.1.2 Coarse-to-Fine Estimation. . . 47
5.1.3 Pose Tracker . . . 49
5.2 Evaluation . . . 51
5.2.1 Pose Estimation Unit . . . 52
5.2.2 Pose Tracker . . . 54
5.3 Practical Tests . . . 56
6 Conclusion 59
Reference 61
List of Figures
1.1 Examples of the pose ambiguity. First row: origin images. Sec- ond row: images are rendered magenta boxes according to wrong ambiguous poses. Third row: images are rendered cyan boxes according to correct poses. . . 3 2.1 Coordinate system between the planar target and the camera image. 8 2.2 The algorithm flow of feature-based pose estimation methods. . . 11
3.1 The proposed algorithm consists of two steps. The pose is esti- mated in the first stage and refined in the second stage. . . 14 3.2 (a) Illustration of rotation matrix factorization. θx indicates the
tilt angle between the camera and the target. (b) 2D illustration of rotation around Zt-axis. The linear distance (orange solid line) between points before and after applying rotation is bounded by the arc length (brown dotted line). (c) 3D illustration of rotation around Zt-axis. The linear distance between points is a function of tilt angle θx. . . 16 3.3 An illustration of proposed coarse-to-fine estimation. . . 19 3.4 An example of the coarse-to-fine estimation. According to the
pose pb with minimum Eaestimated in each round of the coarse- to-fine estimation, the projected target image is rendered on the camera image. . . 20
v
3.5 (a) 2D illustration of coarse-to-fine gradient descent search. We carry out the GDS on the coarse level in the beginning. After reaching the local minimum, we move to the finer level and re- peat this coarse-to-fine GDS approach until we obtain minimum within the desired precision. (b) The checking pattern in 2D view, including 1 center checking point and its 4 neighnbors (2 neighn- bors per dimension). . . 23 4.1 Distributions of rotation and translation errors over experiments.
The horizontal lines correspond to the thresholds used to detect unsuccessfully estimated poses. There is a total of 15, 289 poses estimated by each pose estimation approach. . . 26 4.2 The test image was generated from a warping target image accord-
ing to the randomly generated pose on randomly chosen back- ground image. . . 27 4.3 Experimental results on synthetic data under conditions blur and
JPEG compression. D0, D1, D2 indicates the proposed direct method without refinement, refinement with IPPE and refinement with OPnP, respectively. . . 30 4.4 Experimental results on synthetic data under conditions intensity
and tilt angle. D0, D1, D2 indicates the proposed direct method without refinement, refinement with IPPE and refinement with OPnP, respectively. . . 31 4.5 Pose estimation results with and without refinement approaches.
The average value of rotation and translation error are both re- duced by the refinement approach. . . 35 4.6 The proposed method without refinement (w/o), refinement with
one candidate (w/ 1), and refinement with two candidates (w/ 2) are evaluated. (a) Distribution of errors. (b) The difference of pose errors before and after applying two kinds of refinement approaches. 35
vii 4.7 Illustration to addressing pose ambiguity problems. (b), (c) are
two cases that the refinement process with two candidate poses performs worse than with one candidate pose. The red dash line indicates the location of the ground truth pose. The purple circle indicates the first initial pose which is the input of the refinemen- t scheme, and the orange circle indicates the second initial pose.
The purple and orange triangle represent the refined first and sec- ond poses, respectively. . . 37
4.8 Estimation results by the proposed direct method on real images under different conditions. The success cases are represented with rendered cyan boxes, and the failure cases are represented with rendered magenta boxes. . . 39
4.9 Pose estimation results on textureless images by proposed direct method. First column: pose estimation target. Other columns:
pose estimation results with rendered box. . . 44
5.1 State chart of proposed D-PET system. The system consists of two main parts, which are the pose estimation unit (PEU) and the pose tracker (PT). . . 46
5.2 Initial ε-covering set construction. Two factors are determined se- quentially while the others are calculated parallelly. The six pose parameters are stored in a grouped 4 floats and 2 floats structure in global memory. . . 47
5.3 The global memory usage for storing poses and appearance dis- tances in the coarse-to-fine estimation. The description below each memory block indicates the number of poses stored in the memory. . . 49
5.4 2D view of the proposed 6D pose search pattern for the pose track- er. The search pattern has five and seven search points in each of the rotation and translation dimensions, respectively. After the point with minimum Ea, which is drawn in orange, is found, the size of the pattern is diminished to search in the finer scale. Total- ly 3 scales are used for the search in the pose tracker. . . 50 5.5 Evaluation results for PEU and APE under different conditions
with different degradation levels. . . 53 5.6 Experimental results by the proposed pose tracker. We use pose
estimation unit to find the initial pose. After that, we use the pose tracker to track the pose until it loses the track. . . 55 5.7 Results in the practical tests. Sample images are rendered cyan
boxes with poses obtained from the proposed system. . . 56 5.8 (a) The picture of the proposed system for the practical tests. (b)
The results of the practical tests for the proposed system. We use two planar targets in the tests, which are texture target Ichiro and textureless target 2Circle. The pixel values are normalized to [0, 1]
for calculating the appearance distance. . . 57
List of Tables
3.1 Step size on each dimension for constructing the ε-covering pose set. . . 18 4.1 Evaluation results under undistorted test images. D0, D1, D2 indi-
cates the proposed direct method without refinement, refinement with IPPE and refinement with OPnP, respectively. The best val- ues are highlighted in bold. . . 28 4.2 Evaluation results under varying conditions. S, A, D0, D1 and D2
indicates SIFT-based, ASIFT-based, the proposed direct method without refinement, refinement with IPPE and refinement with OPnP, respectively. . . 32 4.3 Evaluation results under varying conditions. S, A, D0, D1 and D2
indicates SIFT-based, ASIFT-based, the proposed direct method without refinement, refinement with IPPE and refinement with OPnP, respectively. . . 33 4.4 Experimental results of the visual tracking dataset [1] under dif-
ferent conditions. The SIFT-based (S), ASIFT-based (A), and the proposed direct (D) methods are evaluated under different condi- tions (uc: unconstrained; pn: panning; rt: rotation; pd: perspec- tive distortion; zm: zoom; mX: motion blur level X, X = 1...9;
ls: static lighting; ld: dynamic lighting). The best results in each condition are highlighted in bold. . . 40
ix
4.5 Experimental results of the visual tracking dataset [1] under dif- ferent conditions. The SIFT-based (S), ASIFT-based (A), and the proposed direct (D) methods are evaluated under different condi- tions (uc: unconstrained; pn: panning; rt: rotation; pd: perspec- tive distortion; zm: zoom; mX: motion blur level X, X = 1...9;
ls: static lighting; ld: dynamic lighting). The best results in each condition are highlighted in bold. . . 41 4.6 Average runtimes for three approaches on synthetic and real test
data. Although SIFT-based Approach is the fastest method among these three different schemes, its performance is quite limited. . . 43 5.1 Average runtime for APE algorithm on 3.4 GHz Core-i7 CPU and
the proposed pose estimation unit on GTX770 and TX1 board.
The first values in the third row are the runtimes for the pose es- timation unit and the second values are the average runtimes for the pose tracker. . . 52 5.2 Evaluation results for PEU and APE under undistorted condition. . 54
Chapter 1 Introduction
1.1 3D Pose Estimation
Estimating and tracking 3D poses is a classical problem that aims to find the 3D relationship between target objects and the calibrated camera, as as shown in Fig- ure 2.1. With the rapid development of augmented reality [2] and robotics, the demand for obtaining accurate and stable 3D poses becomes increasingly vital.
The problem is even more challenging for the case of arbitrary planar targets.
In the early years, several simple and efficient works have been proposed.
Systems such as [3, 4, 5] are constructed for the planar target with binary pattern, which is called fiducial marker. Fiducial marker plays an important role that en- ables these methods to extract the region of the planar target in the camera image.
To break the limit of fiducial marker, systems [6, 7] which are able to obtain pos- es of the planar target with simple contours are proposed. These methods apply training-based algorithms to recognize simple contours in the camera image and compute the pose efficiently.
Solutions for estimating and tracking the poses of arbitrary planar targets can be categorized into two categories. The first categories are based on features extracted from planar targets. The core idea behind feature-based method is to compute a set of n correspondences between 3D points and their 2D projection-
1
s from which the relative position and orientation between the camera and tar- get can be estimated. In recent years, numerous feature detection and tracking schemes [8, 9, 10, 11, 12] have been developed. In order to match features more robustly, variants of RANSAC algorithms [13, 14, 15] have been used to eliminate outliers before the object pose is estimated from a set of feature correspondences.
Typically the Perspective-n-Point (PnP) [16, 17, 18] algorithms or the other relat- ed method [19] are applied to the correspondences after RANSAC for estimating the pose. Nonetheless, since the success of this method hinges on whether point correspondences can be corretly matched or not, these approaches are less effec- tive when the target image is textureless or the camera image is blurry.
The second categories consists of direct methods that do not depend on fea- tures. Since the seminal work by Lucas and Kanade [20], several algorithms based on global, iterative and nonlinear optimization is proposed [21, 22, 23, 24]. As the 3D pose estimation problem can be reduced to 2D template matching, [25, 26]
estimate the poses through optimizing the parameters which is account for rigid transformation of observed target image. However, these methods rely on ini- tial reference and may be trapped in local minimum. To conquer the limitations, non-iterative approaches [27, 28, 29] are proposed in recent years. Nevertheless, these template matching methods have the shortcoming of misalignment between affine or homography transformation space and pose space. It causes the addition- al pose error produced by transformation matrix decomposition while estimation the 3D pose. Briefly speaking, although these methods are able to work normally in the cases which feature-based methods fail, a direct method suitable for pose estimation and tracking is still lacking in the literature.
1.2 Pose Ambiguity
Due to the lack of the information about the third dimension on the planar tar- get, the pose ambiguity problem as discussed in previous work [30, 16, 31, 32]
3
Figure 1.1: Examples of the pose ambiguity. First row: origin images. Second row: images are rendered magenta boxes according to wrong ambiguous poses.
Third row: images are rendered cyan boxes according to correct poses.
is inevitably bound to occur. Pose ambiguity is related to the situation that the according error function has multiple local minima for a given configure. Based on the observations, one of the ambiguous poses with local minimum will be the correct pose. Figure 1.1 gives several examples. The wrong ambiguous poses and correct poses are represented with magenta and cyan boxes respectively. Since the pose ambiguity is the main cause of jumping pose estimation results in an image sequence, it is an important issue in the problem of 3D pose estimation of a planar target.
1.3 Contribution
In this thesis, we propose a direct 3D pose estimation algorithm which do not de- pend on features. We also utilize the parallel characteristic of the graphics com- puting unit (GPU) and implement D-PET, an embedded real-time direct 3D pose
estimation and tracking system on NVIDIA Jetson TX1 board.
The proposed algorithm estimates the 3D poses of the planar target from a calibrated camera by measuring the similarity between the projected planar target and the 2D image based on appearance. After obtaining an initial pose using an approximated pose estimation scheme, we determine all ambiguous poses and refine them until they converge to local minima. The final pose is chosen as the one with the lowest error among these ambiguous poses. Extensive experiments are conducted to validate the proposed algorithm. We evaluate the performance of proposed algorithm on a synthetic dataset with different types of planar targets and different levels of degraded camera images. Further more, we also evaluate the proposed system on the real dataset by Gauglitz et al. [1] against the state-of- the-art feature-based algorithms.
On the other hand, the proposed system consists of a pose estimation unit and a pose tracker. We reference the approximated pose estimation scheme in the proposed algorithm and build the pose estimation unit. The pose estimation unit is in charge of finding the initial pose for tracking. The proposed pose tracker then takes the initial pose and applies a 3-scale search with a pose search pattern to track the poses. We verify that the pose estimation unit has similar performance compared to the approximated pose estimation scheme through the experiment on synthetic dataset. The performance of the pose tracker is investigated using the real dataset. Finally, we conduct several practical tests to demonstrate the performance of the proposed system in the real world. The main contribution of this thesis can be summarized as follows.
• We propose a non-feature based algorithm to estimate 3D pose of a planar target.
• We develop an efficient direct 3D pose estimation and tracking system im- plemented on NVIDIA embedded GPU.
• Extensive experiments show that both the proposed algorithm and system
5 perform favorably in terms of accuracy and robustness against the state-of- the-art feature-based methods, and the proposed system achieves the pro- cessing speed of 11 fps.
1.4 Thesis organization
In this chapter, we introduce 3D pose estimation and present the main contribution of this thesis. The remainder of the thesis is organized as follows. The background knowledge and related works are mentioned in Chapter 2. In Chapter 3, we de- scribe the proposed direct 3D pose estimation algorithm. The datasets and the evaluation results for the proposed algorithm is provided in Chapter 4. Chapter 5 introduces the proposed system and its performance under several experiments.
Finally, the conclusion is given in Chapter 6.
Chapter 2
Background Knowledge and Related Works
2.1 Problem Definition
The goal of 3D pose estimation is to find the 3D relationship between the target object and the calibrated camera. The coordinate system between the target object and the camera image is shown in Figure 2.1. Given a planar target image It, a camera image Ic, a set of 3D coordinates of points xi= [xi, yi, 0]>, i = 1. . . . , n, n ≥ 3 in target object coordinate and a set of image coordinates ui= [ui, vi]> in cam- era image coordinate, the transformation between these two sets of points can be formulated as
hui hvi h
= K · Tcm
xi yi 0 1
=
fx 0 x0 0 fy x0
0 0 1
h
R|t i
xi yi 0 1
, (2.1)
where K is the intrinsic matrix and Tcm is the extrinsic matrix. In the intrinsic matrix K, ( fx, fy) and (x0, y0) refer to focal length and principle point which are
7
𝑋
𝑌
𝑍
Normalized image plane Camera
Coordinates
𝑍 𝑋 𝑌
Target object Coordinates
𝑥𝑖, 𝑦𝑖, 0 𝑅,𝐭
Camera image Coordinates
𝑢𝑖, 𝑣𝑖
𝐼𝑡
𝐼𝑐
Figure 2.1: Coordinate system between the planar target and the camera image.
treated as known factors. In the extrinsic matrix Tcmon the other hand,
R =
R11 R12 R13 R21 R22 R23 R31 R32 R33
∈ SO(3), t =
tx ty tz
∈ R(3), (2.2)
are the rotation matrix and translation vector, respectively. The physical meaning in (2.1) can be described in two steps, which are explained in detail as follows.
First, the 3D coordinates xiin target object coordinate are transformed to 3D coordinates xci= [xci, yci, zci] in camera coordinate through the extrinsic matrix Tcm, namely
xci yci zci
= Tcm
xi yi 0
=h R|t
i
xi yi 0
. (2.3)
The rotation matrix R is determined by the three degrees of freedom parameter- ized based on the orientation, and the translation vector t is parameterized based on position of the target object with respect to the calibrated camera. The second step is to project the 3D coordinates xciin camera coordinate to 2D coordinates ui
9 in camera image coordinate through the intrinsic matrix K. From (2.1) and (2.3) we know that
ui= fxxci
zci+ x0, vi= fyyci
zci+ y0. (2.4)
The operations behind this formulation are projection and translation. After xciis projected to the normalized camera image plane, the origin point on the normal- ized camera image plane is changed to the top-left corner of the camera image, which results in uiin camera image coordinate.
Given the observed camera-image points ˆui = [ ˆui, ˆvi]>, the pose estimation algorithm needs to determine values for pose p = (R, t) that minimize an appro- priate error function. In principle, there are two possible error functions. One is the reprojection error, which is mostly used in the PnP algorithms,
Er(p) = 1 n
n i=1
∑
h
( ˆui− ui)2+ ( ˆvi− vi)2 i
. (2.5)
Another error function is based on the sum of absolute differences (also known as appearance distance) and is mostly used in direct methods and this thesis,
Ea(p) = 1 nt
nt
i=1
∑
|Ic(ui) − It(xi)|, (2.6) where nt represents the total number of pixels in It.
2.2 State-of-the-Art Pose Estimation
2.2.1 Marker-based Pose Estimation
The approaches in this category require predefined planar targets with simple bi- nary pattern, which is called fiducial markers. To recognize the planar target, these approaches threshold the camera image, extract the rectangular region of the fidu- cial marker and perform pattern checking. After the recognition, these method finds the four corners of the target on the camera image to proceed to the esti- mation stage. A lot of works based on this category are proposed in the past few decades. According to [33], these works can be classified based on the geometry
of the fiducial marker such as square [34, 35, 36, 37, 38], circular [39, 40] and dots [41, 42]. Since these methods do not need complicated computation such as feature extraction or template matching, the computation time is relatively small.
2.2.2 Feature-based Pose Estimation
Since the feature-based pose estimation relies on nature features, it does not re- quire predefined planar targets. The algorithm flow is shown in Figure 2.2. The features in the target image and the camera image are detected and matched in the first stage. There are numerous feature detection, description and matching methods proposed recently. The most classical one must be SIFT [8] which is a multi-scale detection and scale invariant description scheme. SURF [9] applies the basic idea of SIFT and uses integral image for acceleration. To make the process even faster, fast detector and binary descriptor are proposed in numerous works such as BRISK [10], ORB [11] and FREAK [12]. Since all the methods described above are not affine or perspective invariant, they are less effective in pose estimation when the tilt angle between the planar target and the camera is large. Affine-SIFT (ASIFT) [43] matches feature points well because it simulates all obtainable viewpoints for description. However, there is no existing system applying ASIFT since it is more computationally expensive than others.
The pose is estimated after the quality of feature matching is further enhanced by RANSAC algorithms. Typically there are two kinds of methods used in the estimation stage to obtain poses. The first one is the perspective-n-points (PnP) algorithms. Numerous PnP algorithms such as OI [44], RPP [16], EPnP [17], RP- nP [45] and OPnP [18] are proposed to optimize the pose using the reprojection error function which is described in (2.5). Among these PnP algorithms, OPnP shows the dominant performance in terms of efficiency and accuracy. It formu- lates the the PnP problem as an unconstrained optimization problem and applies Gr¨obner bases solver to estimate the pose.
The second kind of method is to decompose the homography transformation
11
Feature Matching
Outlier
Removal Estimation
Figure 2.2: The algorithm flow of feature-based pose estimation methods.
which is estimated from the planar target to camera image. Since the transforma- tion between planar target and camera image can be viewed as a 2D-2D homog- raphy transformation, the 3D pose can be obtained by decompose the homog- raphy. Although the homography decomposition methods proposed in the early years [46, 47, 48] are less effective than PnP algorithms, Collins and Bartoli pro- pose a new analytic solution [19] which have a comparable performance against PnP algorithms with shorter computation time.
2.3 Template Matching
The template matching problem has been widely studied in the literatures, and one important issue is how to efficiently obtain accurate results with evaluating only a subset of the possible transformationes. Since the appearance distances between a template and two sliding windows shifted by a few pixels are usually close due to the nature of image smoothness, Pele and Werman [49] exploit this fact to reduce the time complexity of pattern matching. Alex et al. [50] derive an upper bound of the Euclidean distance based on pixel values according to the spatial overlap of two windows in an image, and use it for a efficient pattern matching.
In [28], Korman et al.show that the 2D affine transformations of a template can be approximated by samples of a density function based on smoothness of a given image and propose a fast matching method, which inspires us to propose the direct
3D pose estimation algorithm.
2.4 Motion Estimation
The refinement method in proposed algorithm and the 3-scale search in D-PET system are motivated by fast motion estimation methods in video coding. Liu and Feig [51] propose the Gradient Descent Search (GDS) algorithm that evaluates the values of a given objective function from a centralized search neighborhood for motion estimation. When the minimum within a neighborhood is found, it is used to determine the position for the next search until it converges. Compared with the full search method, the GDS algorithm achieves similar performance but with much lower computational complexity. Zhu and Ma [52] develop an algorithm for block-based motion estimation based on two designed diamond-shaped search patterns, and it further reduced the required number of search points. A motion estimation method that exploits more elaborated coarse-to-fine search patterns is subsequently developed by Zhu et al. [53].
Chapter 3
Direct Pose Estimation and Tracking
The proposed algorithm consists of two steps, as shown in figure 3.1. First, the 3D pose of a planar target with respect to a calibrated camera is estimated. Second, the object pose is further refined and disambiguated. We describe these steps as follows.
3.1 Approximated Pose Estimation
Let Tpbe the transformation at pose p. Assume a reference point xiin target image transformed is transformed to two points ui1 and ui2with two different poses p1 and 2, respectively. It is shown in [28] that if the spatial distance between ui1and ui2is bounded by a positive value ε,
∀xi∈ It: d(Tp1(xi), Tp2(xi)) = O(ε), (3.1)
then the following equation holds,
|Ea(p1) − Ea(p2)| = O(ε ¯
V
), (3.2)where ¯
V
denotes the mean variation of target image It, which represents the mean value over the entire target image of the maximal difference between each pixel and any of its neighborhood. Since the mean variation ¯V
can be constrained by13
Approximated Pose Estimation
Pose Refinement Camera intrinsic
parameters
Figure 3.1: The proposed algorithm consists of two steps. The pose is estimated in the first stage and refined in the second stage.
filtering It, the difference between Ea(p1) and Ea(p2) is bounded in terms of ε. In the proposed algorithm, we only need to consider a limited number of poses by constructing a ε-covering pose set
S
based on (3.1) and (3.2).3.1.1 ε-Covering Set Construction
In this thesis, we factorize the rotation matrix as R = Rz(θzc)Rx(θx)Rz(θzt) [54]
as shown in Figure 3.2(a). The pose is parameterized as p = [θzc, θx, θzt,tx,ty,tz]>. These Euler angles θzc, θx, and θzt are in the range [−180°, 180°], [0°, 90°], and [−180°, 180°], respectively. According to the factorization and (2.1), the transfor- mation between a reference point xi= [xi, yi, 0] on Itand the corresponding image coordinate ui= [ui, vi] can be formulated as
hui hvi h
=
fx 0 x0 0 fy x0
0 0 1
h
R|t i
xi yi 0
, (3.3)
where
R =
czcczt− cxszcszt −cxcztszc− czcszt sxszc cztszc+ cxczcszt cxczcczt− szcszt −sxczc
sxszt sxczt cx
, t =
tx ty tz
. (3.4)
15 The notation s and c indicates sin and cos operation, respectively.
A pose set
S
is constructed such that any two consecutive poses, pk and pk+∆pk on each dimension, satisfy (3.1) in
S
. To construct the set favorably, the coordinates of xi∈ It are pre-normalized to the range [−1, 1]. Starting with tz, we derive the equation below by using (3.3) for each xi,d(Tptz(xi), Tptz+∆tz(xi)) = s
[(fxxi
tz ) − ( fxxi
tz+ ∆tz)]2+ [(fyyi
tz ) − ( fyyi tz+ ∆tz)]2
= O(1
tz− 1 tz+ ∆tz
).
(3.5)
To make (3.5) satisfy the constraint in (3.1), we use the step size, with tight bound in Big-Theta notation,
∆tz= Θ( εtz2
1 − εtz), (3.6)
which means that (3.5) can be bounded if we construct
S
with the step (3.6) on dimension tz.Since θxdescribes the tilt angle between camera and target image as shown in Figure 3.2(a), we obtain the following equation depending on the current tz,
d(Tpθx(xi), Tpθx+∆θx(xi)) = q
du2i+ dv2i
= O( 1
tz− sin(θx+ ∆θx)− 1 tz− sin(θx)),
(3.7)
for each xi, where
dui= ( fxxi
yisin θx+ tz) − ( fxxi
yisin(θx+ ∆θx) + tz), dvi= ( fyyicos θx
yisin θx+ tz) − ( fyyicos(θx+ ∆θx) yisin(θx+ ∆θx) + tz).
(3.8)
In addition, to make (3.7) satisfy the constraint in (3.1), we set the step size,
∆θx= Θ(sin−1(tz− 1 ε +t 1
z−sin(θx)
) − θx). (3.9)
Let θz0
t = θzt+ ∆θzt, we obtain the following equation based on the current tz
𝜃𝑧𝑡
Tile Angle 𝜃𝑧𝑐
𝜃𝑥
Δ𝜃𝑧𝑡 𝑂(Δ𝜃𝑧𝑡) Rotation around
𝑍𝑡-axis 1
1 Δ𝜃𝑧𝑡
𝑂 Δ𝜃𝑧𝑡 𝑡𝑧− sin 𝜃𝑥 𝑂 Δ𝜃𝑧𝑡
𝑡𝑧+ sin 𝜃𝑥
Δ𝜃𝑧𝑡
(a) (b) (c)
Figure 3.2: (a) Illustration of rotation matrix factorization. θx indicates the tilt angle between the camera and the target. (b) 2D illustration of rotation around Zt-axis. The linear distance (orange solid line) between points before and after applying rotation is bounded by the arc length (brown dotted line). (c) 3D illus- tration of rotation around Zt-axis. The linear distance between points is a function of tilt angle θx.
and θx,
d(Tpθ
zt(xi), Tpθ
zt +∆θzt(xi)) = q
fx2C12+ fy2C22≤ s
fx2C12+ fy2 C2 cx
2
= O
∆θzt
tz+ k sin(θx)
,
(3.10)
, where
C1= cztx− szty sx(sztx+ czty) + tz−
cz0tx− s
z0ty sx(s
z0tx+ c
z0ty) + tz, C2= cx(sztx+ czty)
sx(sztx+ czty) + tz− cx(s
z0tx+ c
zt0y) sx(sz0
tx+ cz0
ty) + tz
(3.11)
The constant k denotes any constant in the range of [−√ 2,√
2]. An illustrative ex- ample of (3.10) is shown in Figure 3.2(b)(c). To make (3.10) satisfy the constraint in (3.1), we set the step size
∆θzt = Θ(ε(tz+ k sin(θx))), (3.12) where larger k means larger bounded steps for constructing
S
. We set k to be 0 for∆θzt in the proposed method.
17 Since θzt denotes 2D image rotation of the planar target, it does not affect the bounded steps for θzc. Let θz0
c= θzc+ ∆θzc, we obtain the following equation depending on the current tz and θx,
d(Tpθzc(xi), Tpθzc +∆θzc(xi))
= s
fx2 czcx− cxszcy sxy+ tz −
cz0cx− cxs
z0cy sxy+ tz
2
+ fy2 szcx+ cxczcy sxy+ tz −
sz0cx+ cxc
z0cy sxy+ tz
2
= O
∆θzc tz+ k sin(θx)
.
(3.13) We can realize (3.13) in a similar way to (3.10). To make (3.13) satisfy the con- straint in (3.1), we set the step size,
∆θzc= Θ(ε(tz+ k sin(θx))) = Θ(ε(tz)), (3.14) which k is set to 0.
Finally, as the bounded steps for tx and ty are also affected by horizontal dis- tance tz and tilt angle θx only, the equations below can be derived,
d(Tptx(xi), Tptx+∆tx(xi))
= s
fx2
x+ tx
sxy+ tz−x+ tx+ ∆tx
sxy+ tz
2
+ fy2
y
sxy+ tz− y sxy+ tz
2
= O
∆tx tz+ k sin(θx)
,
(3.15)
d(Tpty(xi), Tpty+∆ty(xi))
= s
fx2
x
sxy+ tz− x sxy+ tz
2
+ fy2
y+ ty
sxy+ tz−y+ ty+ ∆ty
sxy+ tz
2
= O
∆ty tz+ k sin(θx)
,
(3.16)
To make (3.15) and (3.16) satisfy the constraint in (3.1), we set these step sizes,
∆tx= Θ(ε(tz+ k sin(θx))) = Θ(ε(tz−√
2 sin(θx))), (3.17)
∆ty= Θ(ε(tz+ k sin(θx))) = Θ(ε(tz−√
2 sin(θx))), (3.18)
Table 3.1: Step size on each dimension for constructing the ε-covering pose set.
Dimension Step Size
θzc Θ(εtz)
θx Θ(sin−1(tz− 1
ε+tz−sin(θx)1 ) − θx)
θzt Θ(εtz)
tx Θ(ε(tz−√
2 sin(θx))) ty Θ(ε(tz−√
2 sin(θx)))
tz Θ( εt
z2
1−εtz)
as k is set to −√
2 for pratical consideration.
Table 3.1 summarizes the bounded step size on each domain for constructing ε-covering pose set. From the table we know that the step size of θzc, θzt and tzare function of tz. The step size of θx, txand tyare function of tzand θx. Based on these observations, the construction of the ε-covering set is designed to be nested. First, we construct the set of tz using (3.6) to satisfy the constraint for approximated pose estimation. We then construct the set of θx according to (3.9) within each tz. Finally the set of θzc, θzt, txand tyare constructed sequetially by considering each value of tz and θx.
3.1.2 Coarse-to-Fine Estimation
Due to the large parameter space, the computational and memory costs are pro- hibitively high if the ε-covering pose set is used directly for pose estimation. For fast and accurate pose estimation, a coarse-to-fine approach is employed. As il- lustrated in Figure 3.3, the pose set
S
is first constructed with a coarse ε. After obtaining the best pose pband the associated error measure Ea(pb), we select the poses within a threshold,S
L= {pL | Ea(pL) < Ea(pb) + L}, (3.19)19
1D view of pose space 𝐸𝑎
𝐿 𝜺
𝑆 𝑆
𝑆′ 𝑆𝐿
Figure 3.3: An illustration of proposed coarse-to-fine estimation.
to be considered in the next round. Here the constant L is a threshold set empiri- cally. Based on Ea, we create sets with finer ε0,
S
0= {p0 | ∃pL ∈S
L: (3.1) holds for p0, pL and ε0}, (3.20)and repeat search until we obtain the desired precision parameter ε∗. The best pose in the last set is used as the approximated estimate. Figure 3.4 shows an example of the coarse-to-fine estimation. According to the pose pbwith minimum Ea estimated in each round of the coarse-to-fine estimation, the projected target image is rendered on the camera image. It is obvious that the projected target image is approximately aligned to the target in the camera image as the precision parameter ε becomes finer.
The algorithm is accelerated by random sampling a potion of pixels to approx- imated the error measure Ea0 instead of using all pixels in It to compute Eain (2.6).
According to Hoeffding’s inequality [55], Ea0 is probably close to Eawithin a pre-
Target Image Camera Image
Round 1: ε = 0.25, Ea= 0.051 Round 2: ε = 0.17, Ea= 0.033
Round 3: ε = 0.11, Ea= 0.018 Round 4: ε = 0.072, Ea= 0.015
Round 5: ε = 0.048, Ea= 0.0149 Round 6: ε = 0.032, Ea= 0.013
Figure 3.4: An example of the coarse-to-fine estimation. According to the pose pbwith minimum Eaestimated in each round of the coarse-to-fine estimation, the projected target image is rendered on the camera image.
21 cision parameter δ if the number of sampling pixels m is large enough,
P(|Ea0 − Ea| > δ) ≤ 2e−2δ2m, (3.21)
where P(·) represents the probability measure. This equation suggests that if m is properly selected, the approximation error between Ea0 and Ea can be bounded with high probability. In other words, Ea0 is a approximation of Ea within the probably approximately correct (PAC) framework [55]. With this approximation, the runtime of the proposed algorithm is dramatically reduced.
Moreover, in order to be invariant to different lighting conditions, we normal- izes the intensity term and adds chroma terms to the appearance distance measure.
In implementation, we convert the RGB values of sampling pixels to the YCbCr color space. The appearance distance is then computed as follows,
Ea= 0.5EaY+ 0.25EaCb+ 0.25EaCr. (3.22)
We determine the weight for each term by experiments and find that the proposed algorithm is more invariant to intensity change when the chroma terms are added than the normalized intensity values are only used.
3.2 Pose Refinement
We obtain (R0, t0) after the proposed approximated pose estimation scheme. How- ever, this result is bounded based on the distance in the appearance space rather than the pose space. Therefore the estimated pose and actual pose may be signif- icantly different even when the appearance distance is small, which is often the case when the tilt angle of a target image is large. In the meanwhile, the pose ambiguity problem is likely occur as illustrated in Figure 1.1. Consequently, a pose refinement scheme is proposed to further improve the accuracy and address the ambiguity problem.
3.2.1 Candidate Poses Exploration
In order to address the problem of pose ambiguity, we first transform four corner points xc1, xc2, xc3, and xc4 in the target image It to uc1, uc2, uc3, and uc4 in the camera image Ic with (R0, t0), respectively. We then compute all local minima of the error function (2.5). Finally, only the local minima with the two smallest objective values in (2.6) are plausible poses, and these two ambiguous poses are both chosen as the candidate poses. In practice, IPPE [19] and OPnP [18] are the two methods we consider to apply to get the candidate poses. Through the experiment we describe in Section 4.1, we choose to apply OPnP to accomplish the work.
3.2.2 Refining Pose Estimation
After obtaining the two candidate poses, we can further improve the accuracy us- ing a coarse-to-fine gradient descent search scheme. In contrast to the 2D motion estimation in video coding, we consider a 6D pose motion with infinity resolution in this work. A 2D view of the coarse-to-fine gradient descent search is shown in Figure 3.5(a). The largest blue circle denotes the approximate pose estimated in Section 3.1, and the smaller one (orange) represents the local minimum found by the search pattern at the starting ε-precision. As the minimum under the cur- rent precision level is found, we diminish the precision parameter ε and perform gradient descent search again on the next level. This process is repeated until we obtain the local minimum under the desired precision parameter ε∗. Finally, the pose with smaller Eais chosen from the two refined candidate poses.
The 2D view of the checking pattern in the coarse-to-fine GDS scheme [51]
are shown in Figure 3.5(b). It is formed by 13 checking points, including the center point and its 12 neighbors. These 12 neighbors are ε-away from the cen- ter separately in the 6D pose space. Let pc= [θzc, θx, θzt,tx,ty,tz]> be the center point of the checking pattern in the pose space and Pc be the 6 × 13 matrix with repeating pcin a row. Also let sε= [sθzc, sθx, sθzt, stx, sty, stz]> be the step size listed
23
Coarse Final Candidate Fine
(a) (b)
Figure 3.5: (a) 2D illustration of coarse-to-fine gradient descent search. We carry out the GDS on the coarse level in the beginning. After reaching the local min- imum, we move to the finer level and repeat this coarse-to-fine GDS approach until we obtain minimum within the desired precision. (b) The checking pattern in 2D view, including 1 center checking point and its 4 neighnbors (2 neighnbors per dimension).
in Table 3.1 with precision parameter ε and Sεbe the 6 × 13 matrix with repeating sεin a row. The mathematical description of the checking pattern M can then be written as
M = Pc+ D ◦ Sε, (3.23)
where
D =
0 1 −1 0 0 . . . 0 0 0 0 0 1 −1 . . . 0 0
0 0 0 0 0 . . . 0 0
0 0 0 0 0 . . . 0 0
0 0 0 0 0 . . . 0 0
0 0 0 0 0 . . . 1 −1
(3.24)
and ◦ represents the Hadamard product. Each column in M represents one of the checking points within the checking pattern. The main steps of the proposed pose estimation method are summarized in Algorithm 1.
Algorithm 1 Proposed Direct 3D Pose Estimation
Input: Target image It, camera image Ic, intrinsic parameters, and precision pa- rameters ε∗c, ε∗f.
Output: Estimated pose result p∗.
1: Create an ε-covering pose set
S
.2: Find pbfrom
S
with Ea0 according to (3.21).3: while ε > ε∗c do
4: Obtain the set
S
L according to (3.19);5: Diminish ε;
6: Replace
S
according to (3.20);7: Find pbfrom
S
with Ea0 according to (3.21);8: end while
9: Explore the candidate poses p1and p2with pb.
10: for i = 1 → 2 do
11: Let pc= piand εi= ε
12: while εi> ε∗f do
13: Find pbfrom (3.23) with Ea0 according to (3.21).
14: if pc6= pbthen
15: pc= pb
16: else
17: Diminish εi;
18: end if
19: end while
20: Let pi= pc
21: end for
22: Return the pose p∗with smaller Eafrom p1and p2
Chapter 4 Experiments
We experimentally evaluate the proposed algorithm for the 3D pose estimation problem using both synthetic and real datasets, and compare it with the feature- based schemes. Through some preliminary experiments, we find SIFT [8] method performs better than other alternative features in terms of repeatability and accu- racy. Similar observations can also be found in [1]. As the ASIFT [43] method is considered the state-of-the-art affine-invariant method to find correspondences under large view change, we use both the SIFT and ASIFT methods in the com- pared feature-based schemes. The RANSAC-based method [13] is then used to eliminate outliers before object pose is estimated by the PnP algorithms. It has been shown that, among the PnP algorithms [16, 17, 18, 56], the OPnP [18] al- gorithm achieves the state-of-the-art results in terms of efficiency and accuracy.
Therefore we use the OPnP algorithm as the pose estimator in the feature-based schemes.
Given the true rotation matrix ˆR and translation vector ˆt, we compute the rotation error of the estimated rotation matrix R by ER(degree) = acosd((Tr(R>· R) − 1)/2), where acosd(·) represents the arc-cosine operation in degrees. Theˆ translation error of the estimated translation vector t is measured by the relative difference between ˆt and t defined as Et(%) = kˆt − tk/kˆtk × 100. We define a pose to be successfully estimated if its both errors are under pre-defined thresholds.
25