針對單一平面目標物之三維姿態的直接分析：演算法和系統實作

(1)

國立臺灣大學電機資訊學院電子工程學研究所碩士論文

Graduate Institute of Electronics Engineering College of Electrical Engineering and Computer Science

National Taiwan University Master Thesis

針對單一平面目標物之三維姿態的直接分析：

演算法和系統實作

Direct 3D Pose Estimation of a Planar Target:

Algorithm and Implementation

曾泓諭 Hung-Yu Tseng

指導教授：簡韶逸博士 Advisor: Shao-Yi Chien, Ph.D.

中華民國 105 年 6 月 June 2016

(2)

(3)

(4)

在完成這篇碩士論文的當下，心中只有滿滿的感謝。在研究的這條路上遇到了許多貴人，給了我許多的幫助，其中我要特別感謝三個人。第一位是我的指導老師簡韶逸教授，從大三修習老師的專題研究開始，便很敬佩老師的研究態度，

特別是對於吸收新知的熱情。老師給予我很大的研究空間，並沒有特別限制我研究的走向。除了指導之外，老師更會盡其所能的尋找相關資源，甚至是實驗室外的專業人士來幫助我們。對於能擁有一位為學生著想的指導老師，我滿懷感恩。

第二位是我的共同指導老師楊明玄老師，在 2015 年春天第一次與老師 Skype 通話以來，就深深的感受到老師對於台灣學生們的照顧，包括投稿 Conference 時帶著我們日以繼夜的修改 paper、每周 meeting 時給予許多寶貴的意見、幫助我們找實習、還有在我們低潮時鼓勵我們，特別是 how to survive in this business。回想起來，真的很感謝老師如此的照顧與關心我們。

第三位是我的實驗室學長吳柏辰學長，學長是我研究生涯中最重要的貴人。

從專題以來，無論是基礎知識、看 paper 的能力、做投影片、上台報告、做研究甚至是寫程式，學長都是不厭其煩的給予我指導和幫助。除此之外，學長做事情嚴謹的態度也讓我印象深刻。當然這兩年一起待在博理 421 的日子，投稿、寫程式、

討論問題、看 LOL 比賽還有練肖話，我想我是不會忘記的。真的很謝謝學長！

最後要感謝我的家人。爸爸、媽媽、妹妹、恰恰、爺爺、奶奶、外公和外婆，

始終默默地支持並鼓勵我。我的女友婷瑋，陪伴了我走過這段時光，你們是我能夠一直向前的最佳動力。以及我的好朋友們，人碩、明仁、昇勳、乃群、冠豪、

令儀、國婷、昱廷、則安、宗緯、廷璋、致睿等等族繁不及備載，為這段日子增添了許多色彩。最後是實驗室學長們，偉志、嘉洋、熊、柯楊、強強林、岳穎、

明倫、培恒，以及陳凱、賓四，給予了我研究和課業上許多幫助。謝謝你們！

(5)

中文摘要

近年來，隨著擴充實境與機器人學的發展，如何即時且準確地分析已知的單一平面目標物其三維姿態成為了一個重要的議題。即使過去十幾年中，不少有效率的系統陸續被提出，但由於這些系統只能針對特定的平面目標物，如基準標記或含有簡易封閉曲線的平面目標物，此問題仍舊缺少一個適用於任意平面目標物的解決方法。目前針對此問題最好的解決方法為基於特徵點的方法，但是此方法必須在目標物與相機圖片中的特徵點能對應的前提下才能運作。

為了解決這個問題，在本篇碩士論文中，我們提出了一個表現穩定、能針對任意平面目標物的直接分析演算法。首先，我們採用模板匹配的概念求出一個近似的三維姿態。接下來針對此近似的三維姿態，我們提出了一個梯度下降尋找的演算法來求出更為精準的三維姿態。更進一步，基於所提出的演算法，我們在圖形處理器上實作了一套分析和追蹤平面目標物之三維姿態的系統。此系統包含了分析單元和追蹤單元兩部分。分析單元負責計算出起始的三維姿態，是基於我們提出的演算法所設計的；追蹤單元則負責追蹤三維姿態，其所使用的方法為我們提出的一種三階層搜尋法。在系統中，無論是分析單元還是追蹤單元都充分利用了圖形處理器中平行運算的優點，使得我們的系統能夠非常有效率的運作。我們透過大量的實驗，證明我們所提出的演算法和系統，其表現都比目前基於特徵點

的方法要更精準而且穩定。並於實際應用中，我們的系統達到了每秒 11幀的運算

速度。

(6)

(7)

Direct 3D Pose Estimation of a Planar Target: Algorithm and Implementation

Hung-Yu Tseng Advisor: Shao-Yi Chien Co-Advisor: Ming-Hsuan Yang

Graduate Institute of Electronics Engineering National Taiwan University

Taipei, Taiwan

June 2016

(8)

(9)

Abstract

Real-time estimating and tracking accurate 3D poses of a known planar target from a calibrated camera are essential for augmented reality and robotics. Al- though numerous efficient systems have been proposed in the past few decades, it remains a challenging task since the planar targets are limited to fiducial markers and targets with simple contours. The feature-based schemes are the state-of-the- art solutions for obtaining poses of arbitrary planar targets. However the success hinges on whether feature points can be extracted and matched correctly on targets with rich texture.

In this thesis, we propose a robust direct method for 3D pose estimation with high accuracy that performs well on both texture and textureless planar targets.

First, the pose of a planar target with respect to a calibrated camera is approximated estimated by posing it as a template matching problem. Next, the object pose is further refined and disambiguated with a gradient descent search scheme.

In order to make the proposed algorithm applicable, we also develop D-PET, a direct 3D pose estimation and tracking system implemented on graphics comput- ing units (GPU) which is able to obtain poses in real-time. The system consists of a pose estimation unit and a pose tracker. The pose estimation unit is built based on the approximated pose estimation scheme in the proposed algorithm to find the initial pose. A 3-scale search scheme is proposed for the pose tracker to track the pose precisely. Both of them utilize the characteristics of GPU and accomplish the work efficiently. Extensive experiments on both synthetic and real datasets demonstrate that both the proposed algorithm and system perform favor-

i

(10)

ably against state-of-the-art feature-based approaches in terms of accuracy and robustness. The proposed system achieves a processing speed of 11 fps on an embedded GPU.

(11)

List of Figures

1.1 Examples of the pose ambiguity. First row: origin images. Sec- ond row: images are rendered magenta boxes according to wrong ambiguous poses. Third row: images are rendered cyan boxes according to correct poses. . . 3 2.1 Coordinate system between the planar target and the camera image. 8 2.2 The algorithm flow of feature-based pose estimation methods. . . 11

3.1 The proposed algorithm consists of two steps. The pose is estimated in the first stage and refined in the second stage. . . 14 3.2 (a) Illustration of rotation matrix factorization. θ_x indicates the

tilt angle between the camera and the target. (b) 2D illustration of rotation around Z_t-axis. The linear distance (orange solid line) between points before and after applying rotation is bounded by the arc length (brown dotted line). (c) 3D illustration of rotation around Z_t-axis. The linear distance between points is a function of tilt angle θ_x. . . 16 3.3 An illustration of proposed coarse-to-fine estimation. . . 19 3.4 An example of the coarse-to-fine estimation. According to the

pose p_b with minimum Eaestimated in each round of the coarse- to-fine estimation, the projected target image is rendered on the camera image. . . 20

v

(14)

3.5 (a) 2D illustration of coarse-to-fine gradient descent search. We carry out the GDS on the coarse level in the beginning. After reaching the local minimum, we move to the finer level and repeat this coarse-to-fine GDS approach until we obtain minimum within the desired precision. (b) The checking pattern in 2D view, including 1 center checking point and its 4 neighnbors (2 neighnbors per dimension). . . 23 4.1 Distributions of rotation and translation errors over experiments.

The horizontal lines correspond to the thresholds used to detect unsuccessfully estimated poses. There is a total of 15, 289 poses estimated by each pose estimation approach. . . 26 4.2 The test image was generated from a warping target image accord-

ing to the randomly generated pose on randomly chosen background image. . . 27 4.3 Experimental results on synthetic data under conditions blur and

JPEG compression. D0, D1, D2 indicates the proposed direct method without refinement, refinement with IPPE and refinement with OPnP, respectively. . . 30 4.4 Experimental results on synthetic data under conditions intensity

and tilt angle. D0, D1, D2 indicates the proposed direct method without refinement, refinement with IPPE and refinement with OPnP, respectively. . . 31 4.5 Pose estimation results with and without refinement approaches.

The average value of rotation and translation error are both reduced by the refinement approach. . . 35 4.6 The proposed method without refinement (w/o), refinement with

one candidate (w/ 1), and refinement with two candidates (w/ 2) are evaluated. (a) Distribution of errors. (b) The difference of pose errors before and after applying two kinds of refinement approaches. 35

(15)

vii 4.7 Illustration to addressing pose ambiguity problems. (b), (c) are

two cases that the refinement process with two candidate poses performs worse than with one candidate pose. The red dash line indicates the location of the ground truth pose. The purple circle indicates the first initial pose which is the input of the refinemen- t scheme, and the orange circle indicates the second initial pose.

The purple and orange triangle represent the refined first and second poses, respectively. . . 37

4.8 Estimation results by the proposed direct method on real images under different conditions. The success cases are represented with rendered cyan boxes, and the failure cases are represented with rendered magenta boxes. . . 39

4.9 Pose estimation results on textureless images by proposed direct method. First column: pose estimation target. Other columns:

pose estimation results with rendered box. . . 44

5.1 State chart of proposed D-PET system. The system consists of two main parts, which are the pose estimation unit (PEU) and the pose tracker (PT). . . 46

5.2 Initial ε-covering set construction. Two factors are determined se- quentially while the others are calculated parallelly. The six pose parameters are stored in a grouped 4 floats and 2 floats structure in global memory. . . 47

5.3 The global memory usage for storing poses and appearance distances in the coarse-to-fine estimation. The description below each memory block indicates the number of poses stored in the memory. . . 49

(16)

5.4 2D view of the proposed 6D pose search pattern for the pose tracker. The search pattern has five and seven search points in each of the rotation and translation dimensions, respectively. After the point with minimum Ea, which is drawn in orange, is found, the size of the pattern is diminished to search in the finer scale. Total- ly 3 scales are used for the search in the pose tracker. . . 50 5.5 Evaluation results for PEU and APE under different conditions

with different degradation levels. . . 53 5.6 Experimental results by the proposed pose tracker. We use pose

estimation unit to find the initial pose. After that, we use the pose tracker to track the pose until it loses the track. . . 55 5.7 Results in the practical tests. Sample images are rendered cyan

boxes with poses obtained from the proposed system. . . 56 5.8 (a) The picture of the proposed system for the practical tests. (b)

The results of the practical tests for the proposed system. We use two planar targets in the tests, which are texture target Ichiro and textureless target 2Circle. The pixel values are normalized to [0, 1]

for calculating the appearance distance. . . 57

(17)

List of Tables

3.1 Step size on each dimension for constructing the ε-covering pose set. . . 18 4.1 Evaluation results under undistorted test images. D0, D1, D2 indi-

cates the proposed direct method without refinement, refinement with IPPE and refinement with OPnP, respectively. The best values are highlighted in bold. . . 28 4.2 Evaluation results under varying conditions. S, A, D0, D1 and D2

indicates SIFT-based, ASIFT-based, the proposed direct method without refinement, refinement with IPPE and refinement with OPnP, respectively. . . 32 4.3 Evaluation results under varying conditions. S, A, D0, D1 and D2

indicates SIFT-based, ASIFT-based, the proposed direct method without refinement, refinement with IPPE and refinement with OPnP, respectively. . . 33 4.4 Experimental results of the visual tracking dataset [1] under dif-

ferent conditions. The SIFT-based (S), ASIFT-based (A), and the proposed direct (D) methods are evaluated under different conditions (uc: unconstrained; pn: panning; rt: rotation; pd: perspective distortion; zm: zoom; mX: motion blur level X, X = 1...9;

ls: static lighting; ld: dynamic lighting). The best results in each condition are highlighted in bold. . . 40

ix

(18)

4.5 Experimental results of the visual tracking dataset [1] under different conditions. The SIFT-based (S), ASIFT-based (A), and the proposed direct (D) methods are evaluated under different conditions (uc: unconstrained; pn: panning; rt: rotation; pd: perspective distortion; zm: zoom; mX: motion blur level X, X = 1...9;

ls: static lighting; ld: dynamic lighting). The best results in each condition are highlighted in bold. . . 41 4.6 Average runtimes for three approaches on synthetic and real test

data. Although SIFT-based Approach is the fastest method among these three different schemes, its performance is quite limited. . . 43 5.1 Average runtime for APE algorithm on 3.4 GHz Core-i7 CPU and

the proposed pose estimation unit on GTX770 and TX1 board.

The first values in the third row are the runtimes for the pose estimation unit and the second values are the average runtimes for the pose tracker. . . 52 5.2 Evaluation results for PEU and APE under undistorted condition. . 54

(19)

Chapter 1 Introduction

1.1 3D Pose Estimation

Estimating and tracking 3D poses is a classical problem that aims to find the 3D relationship between target objects and the calibrated camera, as as shown in Fig- ure 2.1. With the rapid development of augmented reality [2] and robotics, the demand for obtaining accurate and stable 3D poses becomes increasingly vital.

The problem is even more challenging for the case of arbitrary planar targets.

In the early years, several simple and efficient works have been proposed.

Systems such as [3, 4, 5] are constructed for the planar target with binary pattern, which is called fiducial marker. Fiducial marker plays an important role that en- ables these methods to extract the region of the planar target in the camera image.

To break the limit of fiducial marker, systems [6, 7] which are able to obtain poses of the planar target with simple contours are proposed. These methods apply training-based algorithms to recognize simple contours in the camera image and compute the pose efficiently.

Solutions for estimating and tracking the poses of arbitrary planar targets can be categorized into two categories. The first categories are based on features extracted from planar targets. The core idea behind feature-based method is to compute a set of n correspondences between 3D points and their 2D projection-

1

(20)

s from which the relative position and orientation between the camera and target can be estimated. In recent years, numerous feature detection and tracking schemes [8, 9, 10, 11, 12] have been developed. In order to match features more robustly, variants of RANSAC algorithms [13, 14, 15] have been used to eliminate outliers before the object pose is estimated from a set of feature correspondences.

Typically the Perspective-n-Point (PnP) [16, 17, 18] algorithms or the other related method [19] are applied to the correspondences after RANSAC for estimating the pose. Nonetheless, since the success of this method hinges on whether point correspondences can be corretly matched or not, these approaches are less effective when the target image is textureless or the camera image is blurry.

The second categories consists of direct methods that do not depend on features. Since the seminal work by Lucas and Kanade [20], several algorithms based on global, iterative and nonlinear optimization is proposed [21, 22, 23, 24]. As the 3D pose estimation problem can be reduced to 2D template matching, [25, 26]

estimate the poses through optimizing the parameters which is account for rigid transformation of observed target image. However, these methods rely on initial reference and may be trapped in local minimum. To conquer the limitations, non-iterative approaches [27, 28, 29] are proposed in recent years. Nevertheless, these template matching methods have the shortcoming of misalignment between affine or homography transformation space and pose space. It causes the addition- al pose error produced by transformation matrix decomposition while estimation the 3D pose. Briefly speaking, although these methods are able to work normally in the cases which feature-based methods fail, a direct method suitable for pose estimation and tracking is still lacking in the literature.

1.2 Pose Ambiguity

Due to the lack of the information about the third dimension on the planar target, the pose ambiguity problem as discussed in previous work [30, 16, 31, 32]

(21)

3

Figure 1.1: Examples of the pose ambiguity. First row: origin images. Second row: images are rendered magenta boxes according to wrong ambiguous poses.

Third row: images are rendered cyan boxes according to correct poses.

is inevitably bound to occur. Pose ambiguity is related to the situation that the according error function has multiple local minima for a given configure. Based on the observations, one of the ambiguous poses with local minimum will be the correct pose. Figure 1.1 gives several examples. The wrong ambiguous poses and correct poses are represented with magenta and cyan boxes respectively. Since the pose ambiguity is the main cause of jumping pose estimation results in an image sequence, it is an important issue in the problem of 3D pose estimation of a planar target.

1.3 Contribution

In this thesis, we propose a direct 3D pose estimation algorithm which do not depend on features. We also utilize the parallel characteristic of the graphics com- puting unit (GPU) and implement D-PET, an embedded real-time direct 3D pose

(22)

estimation and tracking system on NVIDIA Jetson TX1 board.

The proposed algorithm estimates the 3D poses of the planar target from a calibrated camera by measuring the similarity between the projected planar target and the 2D image based on appearance. After obtaining an initial pose using an approximated pose estimation scheme, we determine all ambiguous poses and refine them until they converge to local minima. The final pose is chosen as the one with the lowest error among these ambiguous poses. Extensive experiments are conducted to validate the proposed algorithm. We evaluate the performance of proposed algorithm on a synthetic dataset with different types of planar targets and different levels of degraded camera images. Further more, we also evaluate the proposed system on the real dataset by Gauglitz et al. [1] against the state-of- the-art feature-based algorithms.

On the other hand, the proposed system consists of a pose estimation unit and a pose tracker. We reference the approximated pose estimation scheme in the proposed algorithm and build the pose estimation unit. The pose estimation unit is in charge of finding the initial pose for tracking. The proposed pose tracker then takes the initial pose and applies a 3-scale search with a pose search pattern to track the poses. We verify that the pose estimation unit has similar performance compared to the approximated pose estimation scheme through the experiment on synthetic dataset. The performance of the pose tracker is investigated using the real dataset. Finally, we conduct several practical tests to demonstrate the performance of the proposed system in the real world. The main contribution of this thesis can be summarized as follows.

• We propose a non-feature based algorithm to estimate 3D pose of a planar target.

• We develop an efficient direct 3D pose estimation and tracking system implemented on NVIDIA embedded GPU.

• Extensive experiments show that both the proposed algorithm and system

(23)

5 perform favorably in terms of accuracy and robustness against the state-of- the-art feature-based methods, and the proposed system achieves the processing speed of 11 fps.

1.4 Thesis organization

In this chapter, we introduce 3D pose estimation and present the main contribution of this thesis. The remainder of the thesis is organized as follows. The background knowledge and related works are mentioned in Chapter 2. In Chapter 3, we describe the proposed direct 3D pose estimation algorithm. The datasets and the evaluation results for the proposed algorithm is provided in Chapter 4. Chapter 5 introduces the proposed system and its performance under several experiments.

Finally, the conclusion is given in Chapter 6.

(24)

(25)

Chapter 2 Background Knowledge and Related Works

2.1 Problem Definition

The goal of 3D pose estimation is to find the 3D relationship between the target object and the calibrated camera. The coordinate system between the target object and the camera image is shown in Figure 2.1. Given a planar target image I_t, a camera image I_c, a set of 3D coordinates of points x_i= [x_i, y_i, 0]^>, i = 1. . . . , n, n ≥ 3 in target object coordinate and a set of image coordinates u_i= [u_i, v_i]^> in camera image coordinate, the transformation between these two sets of points can be formulated as





 hu_i hv_i h







= K · T_cm





 x_i y_i 0 1







=







f_x 0 x₀ 0 f_y x₀

0 0 1





 h

R|t i





 x_i y_i 0 1







, (2.1)

where K is the intrinsic matrix and T_cm is the extrinsic matrix. In the intrinsic matrix K, ( fx, f_y) and (x₀, y₀) refer to focal length and principle point which are

7

(26)

𝑋

𝑌

𝑍

Normalized image plane Camera

Coordinates

𝑍 𝑋 𝑌

Target object Coordinates

𝑥_𝑖, 𝑦_𝑖, 0 𝑅,𝐭

Camera image Coordinates

𝑢_𝑖, 𝑣_𝑖

𝐼_𝑡

𝐼_𝑐

Figure 2.1: Coordinate system between the planar target and the camera image.

treated as known factors. In the extrinsic matrix T_cmon the other hand,

R =







R₁₁ R₁₂ R₁₃ R₂₁ R₂₂ R₂₃ R₃₁ R₃₂ R₃₃







∈ SO(3), t =





 t_x t_y t_z







∈ R(3), (2.2)

are the rotation matrix and translation vector, respectively. The physical meaning in (2.1) can be described in two steps, which are explained in detail as follows.

First, the 3D coordinates x_iin target object coordinate are transformed to 3D coordinates x_ci= [x_ci, y_ci, z_ci] in camera coordinate through the extrinsic matrix T_cm, namely





 x_ci y_ci z_ci







= T_cm





 x_i y_i 0







=h R|t

i





 x_i y_i 0







. (2.3)

The rotation matrix R is determined by the three degrees of freedom parameterized based on the orientation, and the translation vector t is parameterized based on position of the target object with respect to the calibrated camera. The second step is to project the 3D coordinates xciin camera coordinate to 2D coordinates ui

(27)

9 in camera image coordinate through the intrinsic matrix K. From (2.1) and (2.3) we know that

u_i= f_xx_ci

z_ci+ x₀, v_i= f_yy_ci

z_ci+ y₀. (2.4)

The operations behind this formulation are projection and translation. After x_ciis projected to the normalized camera image plane, the origin point on the normalized camera image plane is changed to the top-left corner of the camera image, which results in u_iin camera image coordinate.

Given the observed camera-image points ˆu_i = [ ˆu_i, ˆv_i]^>, the pose estimation algorithm needs to determine values for pose p = (R, t) that minimize an appro- priate error function. In principle, there are two possible error functions. One is the reprojection error, which is mostly used in the PnP algorithms,

E_r(p) = 1 n

n i=1

∑

h

( ˆu_i− u_i)²+ ( ˆv_i− v_i)² i

. (2.5)

Another error function is based on the sum of absolute differences (also known as appearance distance) and is mostly used in direct methods and this thesis,

E_a(p) = 1 n_t

nt

i=1

∑

|I_c(u_i) − I_t(x_i)|, (2.6) where n_t represents the total number of pixels in I_t.

2.2 State-of-the-Art Pose Estimation

2.2.1 Marker-based Pose Estimation

The approaches in this category require predefined planar targets with simple binary pattern, which is called fiducial markers. To recognize the planar target, these approaches threshold the camera image, extract the rectangular region of the fiducial marker and perform pattern checking. After the recognition, these method finds the four corners of the target on the camera image to proceed to the estimation stage. A lot of works based on this category are proposed in the past few decades. According to [33], these works can be classified based on the geometry

(28)

of the fiducial marker such as square [34, 35, 36, 37, 38], circular [39, 40] and dots [41, 42]. Since these methods do not need complicated computation such as feature extraction or template matching, the computation time is relatively small.

2.2.2 Feature-based Pose Estimation

Since the feature-based pose estimation relies on nature features, it does not require predefined planar targets. The algorithm flow is shown in Figure 2.2. The features in the target image and the camera image are detected and matched in the first stage. There are numerous feature detection, description and matching methods proposed recently. The most classical one must be SIFT [8] which is a multi-scale detection and scale invariant description scheme. SURF [9] applies the basic idea of SIFT and uses integral image for acceleration. To make the process even faster, fast detector and binary descriptor are proposed in numerous works such as BRISK [10], ORB [11] and FREAK [12]. Since all the methods described above are not affine or perspective invariant, they are less effective in pose estimation when the tilt angle between the planar target and the camera is large. Affine-SIFT (ASIFT) [43] matches feature points well because it simulates all obtainable viewpoints for description. However, there is no existing system applying ASIFT since it is more computationally expensive than others.

The pose is estimated after the quality of feature matching is further enhanced by RANSAC algorithms. Typically there are two kinds of methods used in the estimation stage to obtain poses. The first one is the perspective-n-points (PnP) algorithms. Numerous PnP algorithms such as OI [44], RPP [16], EPnP [17], RP- nP [45] and OPnP [18] are proposed to optimize the pose using the reprojection error function which is described in (2.5). Among these PnP algorithms, OPnP shows the dominant performance in terms of efficiency and accuracy. It formu- lates the the PnP problem as an unconstrained optimization problem and applies Gr¨obner bases solver to estimate the pose.

The second kind of method is to decompose the homography transformation

(29)

11

Feature Matching

Outlier

Removal Estimation

Figure 2.2: The algorithm flow of feature-based pose estimation methods.

which is estimated from the planar target to camera image. Since the transformation between planar target and camera image can be viewed as a 2D-2D homography transformation, the 3D pose can be obtained by decompose the homography. Although the homography decomposition methods proposed in the early years [46, 47, 48] are less effective than PnP algorithms, Collins and Bartoli propose a new analytic solution [19] which have a comparable performance against PnP algorithms with shorter computation time.

2.3 Template Matching

The template matching problem has been widely studied in the literatures, and one important issue is how to efficiently obtain accurate results with evaluating only a subset of the possible transformationes. Since the appearance distances between a template and two sliding windows shifted by a few pixels are usually close due to the nature of image smoothness, Pele and Werman [49] exploit this fact to reduce the time complexity of pattern matching. Alex et al. [50] derive an upper bound of the Euclidean distance based on pixel values according to the spatial overlap of two windows in an image, and use it for a efficient pattern matching.

In [28], Korman et al.show that the 2D affine transformations of a template can be approximated by samples of a density function based on smoothness of a given image and propose a fast matching method, which inspires us to propose the direct

(30)

3D pose estimation algorithm.

2.4 Motion Estimation

The refinement method in proposed algorithm and the 3-scale search in D-PET system are motivated by fast motion estimation methods in video coding. Liu and Feig [51] propose the Gradient Descent Search (GDS) algorithm that evaluates the values of a given objective function from a centralized search neighborhood for motion estimation. When the minimum within a neighborhood is found, it is used to determine the position for the next search until it converges. Compared with the full search method, the GDS algorithm achieves similar performance but with much lower computational complexity. Zhu and Ma [52] develop an algorithm for block-based motion estimation based on two designed diamond-shaped search patterns, and it further reduced the required number of search points. A motion estimation method that exploits more elaborated coarse-to-fine search patterns is subsequently developed by Zhu et al. [53].

(31)

Chapter 3 Direct Pose Estimation and Tracking

The proposed algorithm consists of two steps, as shown in figure 3.1. First, the 3D pose of a planar target with respect to a calibrated camera is estimated. Second, the object pose is further refined and disambiguated. We describe these steps as follows.

3.1 Approximated Pose Estimation

Let Tpbe the transformation at pose p. Assume a reference point x_iin target image transformed is transformed to two points u_i1 and u_i2with two different poses p₁ and 2, respectively. It is shown in [28] that if the spatial distance between u_i1and u_i2is bounded by a positive value ε,

∀x_i∈ I_t: d(T_p₁(x_i), T_p₂(x_i)) = O(ε), (3.1)

then the following equation holds,

|E_a(p₁) − E_a(p₂)| = O(ε ¯

V

^), ^(3.2)

where ¯

V

denotes the mean variation of target image It, which represents the mean value over the entire target image of the maximal difference between each pixel and any of its neighborhood. Since the mean variation ¯

V

can be constrained by

13

(32)

Approximated Pose Estimation

Pose Refinement Camera intrinsic

parameters

Figure 3.1: The proposed algorithm consists of two steps. The pose is estimated in the first stage and refined in the second stage.

filtering I_t, the difference between E_a(p₁) and E_a(p₂) is bounded in terms of ε. In the proposed algorithm, we only need to consider a limited number of poses by constructing a ε-covering pose set

S

based on (3.1) and (3.2).

3.1.1 ε-Covering Set Construction

In this thesis, we factorize the rotation matrix as R = R_z(θzc)R_x(θx)R_z(θzt) [54]

as shown in Figure 3.2(a). The pose is parameterized as p = [θ_z_c, θx, θzt,t_x,t_y,t_z]^>. These Euler angles θ_z_c, θ_x, and θ_z_t are in the range [−180°, 180°], [0°, 90°], and [−180°, 180°], respectively. According to the factorization and (2.1), the transformation between a reference point x_i= [x_i, y_i, 0] on I_tand the corresponding image coordinate u_i= [u_i, v_i] can be formulated as





 hu_i hv_i h







=







f_x 0 x₀ 0 f_y x₀

0 0 1





 h

R|t i





 x_i y_i 0







, (3.3)

where

R =







c_z_cc_z_t− c_xs_z_cs_z_t −c_xc_z_ts_z_c− c_z_cs_z_t s_xs_z_c c_z_ts_z_c+ c_xc_z_cs_z_t c_xc_z_cc_z_t− s_z_cs_z_t −s_xc_z_c

s_xs_z_t s_xc_z_t c_x





 , t =





 t_x t_y t_z







. (3.4)

(33)

15 The notation s and c indicates sin and cos operation, respectively.

A pose set

S

is constructed such that any two consecutive poses, p_k and p_k+

∆p_k on each dimension, satisfy (3.1) in

S

. To construct the set favorably, the coordinates of x_i∈ I_t are pre-normalized to the range [−1, 1]. Starting with t_z, we derive the equation below by using (3.3) for each x_i,

d(T_p_tz(x_i), T_p_tz+∆tz(x_i)) = s

[(f_xx_i

t_z ) − ( f_xx_i

t_z+ ∆t_z)]²+ [(f_yy_i

t_z ) − ( f_yy_i t_z+ ∆t_z)]²

= O(1

t_z− 1 t_z+ ∆tz

).

(3.5)

To make (3.5) satisfy the constraint in (3.1), we use the step size, with tight bound in Big-Theta notation,

∆t_z= Θ( εt_z²

1 − εt_z), (3.6)

which means that (3.5) can be bounded if we construct

S

with the step (3.6) on dimension t_z.

Since θ_xdescribes the tilt angle between camera and target image as shown in Figure 3.2(a), we obtain the following equation depending on the current tz,

d(T_p_θx(x_i), T_p_θx+∆θx(x_i)) = q

d_u²_i+ d_v²_i

= O( 1

t_z− sin(θx+ ∆θx)− 1 t_z− sin(θx)),

(3.7)

for each x_i, where

d_u_i= ( f_xx_i

y_isin θ_x+ t_z) − ( f_xx_i

y_isin(θ_x+ ∆θx) + t_z), d_v_i= ( f_yy_icos θ_x

y_isin θx+ t_z) − ( f_yy_icos(θ_x+ ∆θx) y_isin(θx+ ∆θx) + t_z).

(3.8)

In addition, to make (3.7) satisfy the constraint in (3.1), we set the step size,

∆θx= Θ(sin⁻¹(t_z− 1 ε +_t ¹

z−sin(θ_x)

) − θx). (3.9)

Let θ_z⁰

t = θzt+ ∆θzt, we obtain the following equation based on the current t_z

(34)

𝜃𝑧𝑡

Tile Angle 𝜃𝑧𝑐

𝜃𝑥

Δ𝜃_𝑧_𝑡 𝑂(Δ𝜃_𝑧_𝑡) Rotation around

𝑍𝑡-axis 1

1 Δ𝜃𝑧𝑡

𝑂 Δ𝜃𝑧_𝑡 𝑡𝑧− sin 𝜃𝑥 𝑂 Δ𝜃𝑧𝑡

𝑡𝑧+ sin 𝜃𝑥

Δ𝜃𝑧_𝑡

(a) (b) (c)

Figure 3.2: (a) Illustration of rotation matrix factorization. θ_x indicates the tilt angle between the camera and the target. (b) 2D illustration of rotation around Z_t-axis. The linear distance (orange solid line) between points before and after applying rotation is bounded by the arc length (brown dotted line). (c) 3D illustration of rotation around Z_t-axis. The linear distance between points is a function of tilt angle θ_x.

and θ_x,

d(T_p_θ

zt(x_i), T_p_θ

zt +∆θzt(x_i)) = q

f_x²C₁²+ f_y²C₂²≤ s

f_x²C₁²+ f_y² C₂ c_x

2

= O

∆θz_t

t_z+ k sin(θx)

,

(3.10)

, where

C₁= c_z_tx− s_z_ty s_x(s_z_tx+ c_z_ty) + t_z−

cz⁰_tx− s

z⁰_ty s_x(s

z⁰_tx+ c

z⁰_ty) + t_z, C₂= c_x(s_z_tx+ c_z_ty)

s_x(s_z_tx+ c_z_ty) + t_z− c_x(s

z⁰_tx+ c

z_t⁰y) s_x(s_z⁰

tx+ c_z⁰

ty) + t_z

(3.11)

The constant k denotes any constant in the range of [−√ 2,√

2]. An illustrative example of (3.10) is shown in Figure 3.2(b)(c). To make (3.10) satisfy the constraint in (3.1), we set the step size

∆θ_z_t = Θ(ε(t_z+ k sin(θ_x))), (3.12) where larger k means larger bounded steps for constructing

S

. We set k to be 0 for

∆θzt in the proposed method.

(35)

17 Since θ_z_t denotes 2D image rotation of the planar target, it does not affect the bounded steps for θzc. Let θ_z⁰

c= θzc+ ∆θzc, we obtain the following equation depending on the current t_z and θ_x,

d(Tp_θ_zc(x_i), Tp_θ_{zc +∆θzc}(x_i))

= s

f_x² c_z_cx− c_xs_z_cy s_xy+ t_z −

cz⁰_cx− c_xs

z⁰_cy s_xy+ t_z

2

+ f_y² s_z_cx+ c_xc_z_cy s_xy+ t_z −

sz⁰_cx+ c_xc

z⁰_cy s_xy+ t_z

2

= O

∆θ_z_c t_z+ k sin(θx)

.

(3.13) We can realize (3.13) in a similar way to (3.10). To make (3.13) satisfy the constraint in (3.1), we set the step size,

∆θ_z_c= Θ(ε(tz+ k sin(θx))) = Θ(ε(tz)), (3.14) which k is set to 0.

Finally, as the bounded steps for t_x and t_y are also affected by horizontal distance t_z and tilt angle θ_x only, the equations below can be derived,

d(T_p_tx_(x_i₎, T_p_tx+∆tx(x_i))

= s

f_x²

x+ t_x

s_xy+ t_z−x+ t_x+ ∆tx

s_xy+ t_z

2

+ f_y²

y

s_xy+ t_z− y s_xy+ t_z

2

= O

∆t_x t_z+ k sin(θx)

,

(3.15)

d(T_p_ty_(x_i₎, T_p_ty+∆ty(x_i))

= s

f_x²

x

s_xy+ t_z− x s_xy+ t_z

2

+ f_y²

y+ t_y

s_xy+ t_z−y+ t_y+ ∆ty

s_xy+ t_z

2

= O

∆t_y t_z+ k sin(θx)

,

(3.16)

To make (3.15) and (3.16) satisfy the constraint in (3.1), we set these step sizes,

∆tx= Θ(ε(tz+ k sin(θx))) = Θ(ε(tz−√

2 sin(θ_x))), (3.17)

∆ty= Θ(ε(tz+ k sin(θx))) = Θ(ε(tz−√

2 sin(θx))), (3.18)

(36)

Table 3.1: Step size on each dimension for constructing the ε-covering pose set.

Dimension Step Size

θzc Θ(εtz)

θx Θ(sin⁻¹(t_z− ¹

ε+_{tz−sin(θx)}¹ ) − θx)

θ_z_t Θ(εt_z)

t_x Θ(ε(t_z−√

2 sin(θ_x))) t_y Θ(ε(t_z−√

2 sin(θ_x)))

t_z Θ( ^εt

z2

1−εtz)

as k is set to −√

2 for pratical consideration.

Table 3.1 summarizes the bounded step size on each domain for constructing ε-covering pose set. From the table we know that the step size of θ_z_c, θ_z_t and t_zare function of tz. The step size of θx, txand tyare function of tzand θx. Based on these observations, the construction of the ε-covering set is designed to be nested. First, we construct the set of tz using (3.6) to satisfy the constraint for approximated pose estimation. We then construct the set of θ_x according to (3.9) within each t_z. Finally the set of θzc, θzt, txand tyare constructed sequetially by considering each value of t_z and θ_x.

3.1.2 Coarse-to-Fine Estimation

Due to the large parameter space, the computational and memory costs are pro- hibitively high if the ε-covering pose set is used directly for pose estimation. For fast and accurate pose estimation, a coarse-to-fine approach is employed. As illustrated in Figure 3.3, the pose set

S

is first constructed with a coarse ε. After obtaining the best pose p_band the associated error measure E_a(p_b), we select the poses within a threshold,

S

L= {p_L | E_a(p_L) < E_a(p_b) + L}, (3.19)

(37)

19

1D view of pose space 𝐸_𝑎

𝐿 𝜺

𝑆 𝑆

𝑆^′ 𝑆_𝐿

Figure 3.3: An illustration of proposed coarse-to-fine estimation.

to be considered in the next round. Here the constant L is a threshold set empiri- cally. Based on E_a, we create sets with finer ε⁰,

S

⁰^{= {p}⁰ ^{| ∃p}L ∈

S

L: (3.1) holds for p⁰, p_L and ε⁰}, (3.20)

and repeat search until we obtain the desired precision parameter ε^∗. The best pose in the last set is used as the approximated estimate. Figure 3.4 shows an example of the coarse-to-fine estimation. According to the pose p_bwith minimum E_a estimated in each round of the coarse-to-fine estimation, the projected target image is rendered on the camera image. It is obvious that the projected target image is approximately aligned to the target in the camera image as the precision parameter ε becomes finer.

The algorithm is accelerated by random sampling a potion of pixels to approximated the error measure E_a⁰ instead of using all pixels in I_t to compute E_ain (2.6).

According to Hoeffding’s inequality [55], E_a⁰ is probably close to Eawithin a pre-

(38)

Target Image Camera Image

Round 1: ε = 0.25, E_a= 0.051 Round 2: ε = 0.17, E_a= 0.033

Round 3: ε = 0.11, E_a= 0.018 Round 4: ε = 0.072, E_a= 0.015

Round 5: ε = 0.048, Ea= 0.0149 Round 6: ε = 0.032, Ea= 0.013

Figure 3.4: An example of the coarse-to-fine estimation. According to the pose p_bwith minimum E_aestimated in each round of the coarse-to-fine estimation, the projected target image is rendered on the camera image.

(39)

21 cision parameter δ if the number of sampling pixels m is large enough,

P(|E_a⁰ − E_a| > δ) ≤ 2e^−2δ²^m, (3.21)

where P(·) represents the probability measure. This equation suggests that if m is properly selected, the approximation error between E_a⁰ and E_a can be bounded with high probability. In other words, E_a⁰ is a approximation of E_a within the probably approximately correct (PAC) framework [55]. With this approximation, the runtime of the proposed algorithm is dramatically reduced.

Moreover, in order to be invariant to different lighting conditions, we normal- izes the intensity term and adds chroma terms to the appearance distance measure.

In implementation, we convert the RGB values of sampling pixels to the YCbCr color space. The appearance distance is then computed as follows,

E_a= 0.5E_a_Y+ 0.25E_a_Cb+ 0.25E_a_Cr. (3.22)

We determine the weight for each term by experiments and find that the proposed algorithm is more invariant to intensity change when the chroma terms are added than the normalized intensity values are only used.

3.2 Pose Refinement

We obtain (R⁰, t⁰) after the proposed approximated pose estimation scheme. How- ever, this result is bounded based on the distance in the appearance space rather than the pose space. Therefore the estimated pose and actual pose may be signif- icantly different even when the appearance distance is small, which is often the case when the tilt angle of a target image is large. In the meanwhile, the pose ambiguity problem is likely occur as illustrated in Figure 1.1. Consequently, a pose refinement scheme is proposed to further improve the accuracy and address the ambiguity problem.

(40)

3.2.1 Candidate Poses Exploration

In order to address the problem of pose ambiguity, we first transform four corner points x_c1, x_c2, x_c3, and x_c4 in the target image I_t to u_c1, u_c2, u_c3, and u_c4 in the camera image I_c with (R⁰, t⁰), respectively. We then compute all local minima of the error function (2.5). Finally, only the local minima with the two smallest objective values in (2.6) are plausible poses, and these two ambiguous poses are both chosen as the candidate poses. In practice, IPPE [19] and OPnP [18] are the two methods we consider to apply to get the candidate poses. Through the experiment we describe in Section 4.1, we choose to apply OPnP to accomplish the work.

3.2.2 Refining Pose Estimation

After obtaining the two candidate poses, we can further improve the accuracy using a coarse-to-fine gradient descent search scheme. In contrast to the 2D motion estimation in video coding, we consider a 6D pose motion with infinity resolution in this work. A 2D view of the coarse-to-fine gradient descent search is shown in Figure 3.5(a). The largest blue circle denotes the approximate pose estimated in Section 3.1, and the smaller one (orange) represents the local minimum found by the search pattern at the starting ε-precision. As the minimum under the current precision level is found, we diminish the precision parameter ε and perform gradient descent search again on the next level. This process is repeated until we obtain the local minimum under the desired precision parameter ε^∗. Finally, the pose with smaller Eais chosen from the two refined candidate poses.

The 2D view of the checking pattern in the coarse-to-fine GDS scheme [51]

are shown in Figure 3.5(b). It is formed by 13 checking points, including the center point and its 12 neighbors. These 12 neighbors are ε-away from the center separately in the 6D pose space. Let pc= [θzc, θx, θzt,t_x,t_y,t_z]^> be the center point of the checking pattern in the pose space and P_c be the 6 × 13 matrix with repeating pcin a row. Also let s_ε= [s_θ_zc, s_θ_x, s_θ_zt, s_t_x, s_t_y, s_t_z]^> be the step size listed

(41)

23

Coarse Final Candidate Fine

(a) (b)

Figure 3.5: (a) 2D illustration of coarse-to-fine gradient descent search. We carry out the GDS on the coarse level in the beginning. After reaching the local minimum, we move to the finer level and repeat this coarse-to-fine GDS approach until we obtain minimum within the desired precision. (b) The checking pattern in 2D view, including 1 center checking point and its 4 neighnbors (2 neighnbors per dimension).

in Table 3.1 with precision parameter ε and S_εbe the 6 × 13 matrix with repeating s_εin a row. The mathematical description of the checking pattern M can then be written as

M = P_c+ D ◦ S_ε, (3.23)

where

D =







0 1 −1 0 0 . . . 0 0 0 0 0 1 −1 . . . 0 0

0 0 0 0 0 . . . 0 0

0 0 0 0 0 . . . 1 −1







(3.24)

and ◦ represents the Hadamard product. Each column in M represents one of the checking points within the checking pattern. The main steps of the proposed pose estimation method are summarized in Algorithm 1.

(42)

Algorithm 1 Proposed Direct 3D Pose Estimation

Input: Target image I_t, camera image I_c, intrinsic parameters, and precision parameters ε^∗_c, ε^∗_f.

Output: Estimated pose result p^∗.

1: Create an ε-covering pose set

S

^.

2: Find p_bfrom

S

^{with E}a⁰ according to (3.21).

3: while ε > ε^∗_c do

4: Obtain the set

S

L according to (3.19);

5: Diminish ε;

6: Replace

S

according to (3.20);

7: Find p_bfrom

S

^{with E}a⁰ according to (3.21);

8: end while

9: Explore the candidate poses p₁and p₂with p_b.

10: for i = 1 → 2 do

11: Let p_c= p_iand ε_i= ε

12: while ε_i> ε^∗_f do

13: Find p_bfrom (3.23) with E_a⁰ according to (3.21).

14: if p_c6= p_bthen

15: p_c= p_b

16: else

17: Diminish ε_i;

18: end if

19: end while

20: Let p_i= p_c

21: end for

22: Return the pose p^∗with smaller E_afrom p₁and p₂

(43)

Chapter 4 Experiments

We experimentally evaluate the proposed algorithm for the 3D pose estimation problem using both synthetic and real datasets, and compare it with the feature- based schemes. Through some preliminary experiments, we find SIFT [8] method performs better than other alternative features in terms of repeatability and accuracy. Similar observations can also be found in [1]. As the ASIFT [43] method is considered the state-of-the-art affine-invariant method to find correspondences under large view change, we use both the SIFT and ASIFT methods in the compared feature-based schemes. The RANSAC-based method [13] is then used to eliminate outliers before object pose is estimated by the PnP algorithms. It has been shown that, among the PnP algorithms [16, 17, 18, 56], the OPnP [18] algorithm achieves the state-of-the-art results in terms of efficiency and accuracy.

Therefore we use the OPnP algorithm as the pose estimator in the feature-based schemes.

Given the true rotation matrix ˆR and translation vector ˆt, we compute the rotation error of the estimated rotation matrix R by ER(degree) = acosd((Tr(R^>· R) − 1)/2), where acosd(·) represents the arc-cosine operation in degrees. Theˆ translation error of the estimated translation vector t is measured by the relative difference between ˆt and t defined as E_t(%) = kˆt − tk/kˆtk × 100. We define a pose to be successfully estimated if its both errors are under pre-defined thresholds.

25

針對單一平面目標物之三維姿態的直接分析：演算法和系統實作

Graduate Institute of Electronics Engineering College of Electrical Engineering and Computer Science

National Taiwan University Master Thesis

Direct 3D Pose Estimation of a Planar Target: Algorithm and Implementation

Hung-Yu Tseng Advisor: Shao-Yi Chien Co-Advisor: Ming-Hsuan Yang

Graduate Institute of Electronics Engineering National Taiwan University

Taipei, Taiwan

June 2016

Abstract

Contents

List of Figures

List of Tables

Chapter 1 Introduction

1.1 3D Pose Estimation

1.2 Pose Ambiguity

1.3 Contribution

1.4 Thesis organization

Chapter 2

Background Knowledge and Related Works

2.1 Problem Definition

∑

∑

2.2 State-of-the-Art Pose Estimation

2.2.1 Marker-based Pose Estimation

2.2.2 Feature-based Pose Estimation

Feature Matching

Outlier

Removal Estimation

2.3 Template Matching

2.4 Motion Estimation

Chapter 3

Direct Pose Estimation and Tracking

3.1 Approximated Pose Estimation

V

V

V

S

3.1.1 ε-Covering Set Construction

S

S

S

S

3.1.2 Coarse-to-Fine Estimation

S

S

S

S

3.2 Pose Refinement

3.2.1 Candidate Poses Exploration

3.2.2 Refining Pose Estimation

Coarse Final Candidate Fine

S

S

S

S

S

Chapter 4 Experiments