MOVING OBJECT TRACKING

(1)

SIMULTANEOUS LOCALIZATION, MAPPING AND

MOVING OBJECT TRACKING

Chieh-Chih Wang

CMU-RI-TR-04-23

Robotics Institute Carnegie Mellon University

Pittsburgh, PA 15213

April 2004

Submitted in partial fulfilment of the requirements for the degree of

Doctor of Philosophy

Thesis Committee:

Charles Thorpe, Chair Martial Hebert

Sebastian Thrun, Stanford University Hugh Durrant-Whyte, University of Sydney

° Cc HIEH-CHIHWANG, MMIV

(2)

(3)

ABSTRACT

L

OCALIZATION, mapping and moving object tracking serve as the basis for scene understanding, which is a key prerequisite for making a robot truly autonomous.

Simultaneous localization, mapping and moving object tracking (SLAMMOT) involves not only simultaneous localization and mapping (SLAM) in dynamic environments but also detecting and tracking these dynamic objects. It is believed by many that a solution to the SLAM problem would open up a vast range of potential applications for autonomous robots. Accordingly, a solution to the SLAMMOT problem would expand robotic applications in proximity to human beings where robots work not only for people but also with people.

This thesis establishes a new discipline at the intersection of SLAM and moving object tracking. Its contributions are two-fold: theoretical and practical.

From a theoretical perspective, we establish a mathematical framework to integrate SLAM and moving object tracking, which provides a solid basis for understanding and solving the whole problem. We describe two solutions: SLAM with generic objects (GO), and SLAM with detection and tracking of moving objects (DATMO). SLAM with GO calculates a joint posterior over all generic objects and the robot. Such an approach is similar to existing SLAM algorithms, but with additional structure to allow for motion modelling of the generic objects. Unfortunately, it is computationally demanding and infeasible. Con- sequently, we provide the second solution, SLAM with DATMO, in which the estimation problem is decomposed into two separate estimators. By maintaining separate posteriors for the stationary objects and the moving objects, the resulting estimation problems are much lower dimensional than SLAM with GO.

From a practical perspective, we develop algorithms for dealing with the implementation issues on perception modelling, motion modelling and data association. Regarding perception modelling, a hierarchical object based representation is presented to integrate existing feature-based, grid-based and direct methods. The sampling- and correlation- based range image matching algorithm is developed to tackle the problems arising from uncertain, sparse and featureless measurements. With regard to motion modelling, we describe a move-stop hypothesis tracking algorithm to tackle the difficulties of tracking ground moving objects. Kinematic information from motion modelling as well as geometric information from perception modelling is used to aid data association at different levels.

By following the theoretical guidelines and implementing the described algorithms, we are able to demonstrate the feasibility of SLAMMOT using data collected from the Navlab8 and Navlab11 vehicles at high speeds in crowded urban environments.

(4)

(5)

ACKNOWLEDGEMENTS

First and foremost, I would like to thank my advisor Chuck Thorpe for supporting me throughout the years, and for his priceless technical advice and wisdom. Martial Hebert has long been an inspiration to me. Thanks to many enjoyable discussions with Martial during my Ph.D career. My gratitude also goes to Sebastian Thrun and Hugh Durrant- Whyte, whose suggestions, insights and critique proved to be invaluable to this thesis.

I would like to thank the members of the Navlab group for their excellent work on building and maintaining the Navlab8 and Navlab11 vehicles, and for their helps on col- lecting data. I would like to specifically acknowledge: Justin Carlson, David Duggins, Arne Suppe, John Kozar, Jay Gowdy, Robert MacLachlan, Christoph Mertz and David Duke.

Thanks also go to the member of the 3D computer vision group and the MISC reading group. The weekly meetings with the MISC reading group have proved to be one of my best learning experiences at CMU. I would like to specifically acknowledge Daniel Huber, Nicolas Vandapel, and Owen Carmichael for their helps on Spin Images.

I would like to thank my many friends at CMU with whom I have the pleasure of working and living over the years. These include Carl Wellington, Vandi Verma, Wei-Tech Ang, Cristian Dima, Wen-Chieh ”Steve” Lin, Jing Xiao, Fernando Alfaro, Curt Bererton, Anthony Gallagher, Jinxiang Chai, Kiran Bhat, Aaron Courville, Siddhartha Srinivasa, Liang Zhao (now at Univ. of Maryland), Matt Deans (now at NASA Ames), and Stella Yu (now at Berkeley).

Thanks to the members of my research qualifier committee, John Bares, Simon Baker and Peng Chang (now at Sarnoff) for their feedbacks on earlier research.

I would like to thank Peter Cheeseman for hiring me as an intern at NASA Ames research center in 2002 to work on 3D SLAM, and Dirk Langer for hiring me as an intern in 2001 to work on Spin Images.

Thanks to Suzanne Lyons Muth for her greatly administrative support.

Special thanks go to my parents, my brother Chieh-Kun, and sister-in-law Tsu-Ying for their support and sacrifices, and for letting me pursue my dreams over the years.

Finally, I would like to thank my wife Jessica Hsiao-Ping for the weekly commutes between Columbus and Pittsburgh under all weathers, for providing a welcome distraction from school, for bringing me happiness, and for her support, encouragement and love.

This thesis was funded in part by the U.S. Department of Transportation; the Federal Transit Administration; by Bosch Corporation; and by SAIC Inc. Their support is gratefully acknowledged.

(6)

(7)

LIST OF FIGURES

1.1 Robotics for safe driving . . . . 2

1.2 A traffic scene on a highway . . . . 4

1.3 A traffic scene in an urban area . . . . 4

1.4 SLAM vs. DATMO . . . . 4

1.5 City-sized SLAM . . . . 6

1.6 Tracking difficulty vs. degrees of freedom . . . . 9

1.7 The Navlab8 testbed . . . 10

1.8 The Navlab11 testbed . . . 10

1.9 Raw data from the Navlab11 testbed . . . 11

1.10 Result of SLAM with DATMO . . . 11

1.11 Thesis overview . . . 12

2.1 The SLAM process, the MOT process and the SLAMMOT process . . . 16

2.2 Compounding of spatial relationships . . . 17

2.3 The inverse relationship . . . 18

2.4 The tail-to-tail relationship . . . 19

2.5 A Dynamic Bayesian Network (DBN) of the SLAM problem . . . 21

2.6 The initialization stage of SLAM . . . 22

2.7 A DBN representing the initialization stage of SLAM . . . 22

2.8 The prediction stage of SLAM . . . 23

2.9 A DBN representing the prediction stage of SLAM . . . 23

2.10 The data association stage of SLAM . . . 24

2.11 A DBN representing the data association stage of SLAM . . . 24

2.12 The update stage of SLAM . . . 25

2.13 A DBN representing the update stage of SLAM . . . 25

2.14 A DBN for multiple model based moving object tracking . . . 28

2.15 The GPB1 algorithm of one cycle for 2 switching models . . . 30

2.16 The GPB2 algorithm of one cycle for 2 switching models . . . 31

2.17 The IMM algorithm of one cycle for 2 switching models . . . 31

2.18 The initialization stage of moving object tracking . . . 34

2.19 A DBN representing the initialization stage of moving object tracking . 34 2.20 The prediction stage of moving object tracking . . . 34

(12)

2.21 A DBN representing the prediction stage of moving object tracking . . 34

2.22 The data association stage of moving object tracking . . . 35

2.23 A DBN representing the data association stage of moving object tracking 35 2.24 The update stage of moving object tracking . . . 35

2.25 A DBN representing the update stage of moving object tracking . . . . 35

2.26 Model Selection . . . 36

2.27 Move-stop-move object tracking . . . 36

2.28 A DBN for SLAM with Generic Objects . . . 37

2.29 A DBN of the SLAM with DATMO problem of duration three with one moving object and one stationary object . . . 42

2.30 The initialization stage of SLAM with DATMO . . . 42

2.31 A DBN representing the initialization stage of SLAM with DATMO . . 42

2.32 The prediction stage of SLAM with DATMO . . . 43

2.33 A DBN representing the prediction stage of SLAM with DATMO . . . . 43

2.34 The data association stage of SLAM with DATMO . . . 43

2.35 A DBN representing the data association stage of SLAM with DATMO 43 2.36 The update stage of the SLAM part of SLAM with DATMO . . . 43

2.37 A DBN representing the update stage of the SLAM part of SLAM with DATMO . . . 43

2.38 The update stage of the DATMO part of SLAM with DATMO . . . 44

2.39 A DBN representing the update stage of the DATMO part of SLAM with DATMO . . . 44

3.1 Vegetation and plant object: Bush . . . 47

3.2 Curvy object: A building . . . 47

3.3 Circle extraction . . . 48

3.4 Line extraction . . . 48

3.5 Hierarchical object based representation . . . 51

3.6 An example of scan segmentation . . . 52

3.7 Footprints of the measurement from SICK LMS 291 . . . 53

3.8 SICK LMS 211/221/291 noise model . . . 53

3.9 Scan segments of two scans . . . 56

3.10 An initial guess of the relative transformation . . . 57

3.11 Results of segment 1 registration . . . 57

3.12 Results of Figure 3.11 are shown with the whole scans . . . 58

3.13 Sampling-based uncertainty estimation . . . 58

3.14 The corresponding sample means and covariances using different numbers of samples. . . 59

3.15 The occupancy grid map of Segment 1 of A . . . 60

3.16 The occupancy grid map of Segment 1 of B . . . 60

3.17 The normalized correlations of the samples . . . 61

3.18 Mean and covariance estimation by clustering . . . 61

3.19 A wide variety of moving objects in urban areas . . . 63

(13)

LIST OF FIGURES

3.20 Different portions of a moving car . . . 64

3.21 Registration results of the example in Figure 3.20 using the SCRIM algorithm . . . 65

3.22 Aerial photo of the CMU neighborhood . . . 66

3.23 Pose estimates from the inertial measurement system . . . 66

3.24 Generated grid maps along the trajectory . . . 67

3.25 Details of the grid maps . . . 68

3.26 The data association in the large problem . . . 69

3.27 The result without loop-closing . . . 70

3.28 The result with loop-closing . . . 71

4.1 Model complexity . . . 74

4.2 A simulation of the constant acceleration maneuver . . . 78

4.3 The velocity estimates and the probabilities of the CV model and the CA model in the constant acceleration maneuver simulation . . . 78

4.4 A simulation of the constant velocity maneuver . . . 79

4.5 The velocity estimates and the probabilities of the CV model and the CA model in the constant velocity motion simulation . . . 79

4.6 A simulation of the turning maneuver . . . 80

4.7 The velocity estimates and the probabilities of the CV model and the CA model in the turning maneuver simulation . . . 80

4.8 A simulation of the move-stop-move maneuver . . . 80

4.9 The velocity estimates and the probabilities of the CV model and the CA model in the move-stop-move maneuver simulation . . . 80

4.10 The simulation in which a stationary object is tracked using a Kalman filter with the CV model . . . 81

4.11 A simulation in which the stationary object is tracked using a Kalman filter with the stop model . . . 83

4.12 A simulation in which a stationary object is tracked using the stationary process model . . . 83

4.13 A simulation in which an object moving at a constant velocity is tracked using a Kalman filter with the stop model . . . 83

4.14 A simulation in which a constant velocity moving object is tracked using the stationary process model . . . 83

4.15 The nested model set . . . 84

4.16 Occam’s Razor . . . 85

4.17 A simulation of the move-stop-move maneuver tracked by move-stop hypothesis tracking. . . 87

4.18 The enlargement of Figure 4.17 . . . 87

5.1 Gating . . . 92

5.2 Data association in the cluttered . . . 94

5.3 Clustering . . . 94

5.4 Data association in the large . . . 95

5.5 Covariance increasing . . . 96

(14)

5.6 The grid-map pair of the same region built at different times: Grid map

1 and Grid map 16 . . . 97

5.7 Recognition and localization results using different scales of grid map 1 and grid map 16 . . . 98

5.8 Two sequences. The relative starting locations of these two sequences are assumed to be unknown . . . 99

5.9 Details of grid map 1-9 . . . 99

5.10 Details of grid map 10-21 . . . 100

5.11 The bar graph of the maximum correlation values of the grid map pairs between the grid map 1-14 sequence and the grid map 15-21 sequence . 101 5.12 The slices of Figure 5.11 . . . 101

5.13 The total correlation value of the consecutive grid maps . . . 101

6.1 Performing detection, data association and motion modelling in a global frame . . . 104

6.2 Performing detection, data association and motion modelling with respect to a temporary global coordinate system . . . 104

6.3 Case 1 of detection . . . 106

6.4 Case 2 of detection . . . 106

6.5 Consistency-based detection . . . 107

6.6 Iterated SLAM with DATMO . . . 108

6.7 Multiple vehicle detection and data association . . . 111

6.8 Pedestrian detection and data association . . . 112

6.9 Pedestrian detection and data association . . . 112

6.10 Bus detection and data association . . . 113

6.11 Temporary stationary objects . . . 113

6.12 Tracking results of the example in Figure 3.20 . . . 114

6.13 Detection and data association results . . . 115

6.14 The partial image from the tri-camera system . . . 115

6.15 Raw data of 201 scans . . . 115

6.16 Results of multiple ground vehicle tracking . . . 115

6.17 Speed estimates . . . 116

6.18 An intersection. Pedestrians are pointed out by the arrow. . . 116

6.19 Visual images from the tri-camera system . . . 116

6.20 Raw data of 141 scans . . . 117

6.21 Results of multiple pedestrian tracking . . . 117

6.22 Speed estimates of object A . . . 117

6.23 Probabilities of the CV and CA models of object A . . . 117

6.24 Speed estimates of object B . . . 117

6.25 Probabilities of the CV and CA models of object B . . . 117

6.26 The scene . . . 118

6.27 201 raw scans and the robot trajectory . . . 118

6.28 The visual image from the tri-camera system . . . 118

(15)

LIST OF FIGURES

6.29 The result of move-stop object tracking using IMM with the CV and CA

models . . . 118

6.30 Speed estimates from IMM . . . 119

6.31 The result of tracking using the move-stop hypothesis tracking algorithm119 6.32 A 3-D map of several street blocks . . . 120

6.33 A 3-D model of the Carnegie Museum of Natural History . . . 120

6.34 3-D models of buildings on Filmore street . . . 121

6.35 3-D models of parked cars in front of the Carnegie Museum of Art . . . 121

6.36 3-D models of trees on S. Bellefield avenue. . . 121

6.37 Dramatic changes between consecutive scans due to a sudden start . . 122

6.38 False measurements from a uphill environment . . . 122

6.39 The failure mode of the laser scanners . . . 124

6.40 The direct sun effect on the regular camera . . . 125

6.41 An available digital map . . . 126

6.42 The reconstructed map is overlayed on an aerial photo . . . 126

6.43 The same intersections shown in Figure 6.41 are overlayed on our reconstructed map . . . 126

7.1 Between SLAM with GO and SLAM with DATMO . . . 131

7.2 4-D environments . . . 132

(16)

(17)

LIST OF TABLES

3.1 Representation comparison . . . 50

3.2 Object saliency score . . . 62

4.1 Model Complexity . . . 86

6.1 Features of SICK laser scanners . . . 110

6.2 Steepness grades of Pittsburgh hills . . . 123

(18)

(19)

CHAPTER 1 Introduction

One, a robot may not injure a human being, or through inaction, allow a human being to come to harm;

Two, a robot must obey the orders given by human beings except where such orders would conflict with the First Law;

Three, a robot must protect its own existence as long as such protection does not conflict with the First or Second Laws.

– Isaac Asimov (1920 - 1992)

”I, Robot”

S

CENE UNDERSTANDINGis a key prerequisite for making a robot truly autonomous.

The scene around the robot consists of stationary and/or moving objects. In applications such as planetary exploration and automated mining, the world around the robot consists of stationary objects, and only the robot can change its and the world’s states. In applications such as elder care, office automation, security and safe driving, the world is dynamic, consisting of both stationary and moving entities.

Establishing the spatial and temporal relationships among the robot, stationary objects and moving objects in the scene serves as the basis for scene understanding. Localization is the process of establishing the spatial relationships between the robot and stationary objects, mapping is the process of establishing the spatial relationships among stationary objects, and moving object tracking is the process of establishing the spatial and temporal relationships between moving objects and the robot or between moving objects and stationary objects.

Localization, mapping and moving object tracking are difficult because of uncertainty and unobservable states in the real world. For instance, perception sensors such as cameras, radar and laser range finders, and motion sensors such as odometry and inertial measurement units are noisy. For moving object tracking, the intentions, or control inputs, of

(20)

the moving objects are unobservable without using extra sensors mounted on the moving objects.

This dissertation is concerned with the problem of how a robot can accomplish localization, mapping and moving object tracking in the real world. We will provide a theoretical framework that integrates all these problems and explain why all these problems should be solved together. We will find algorithms for efficiently and robustly solving this whole problem of simultaneous localization, mapping and moving object tracking (SLAMMOT). We will demonstrate these algorithms with ample experimental results from a ground vehicle at high speeds in crowded urban areas.

It is believed by many that a solution to the simultaneous localization and mapping (SLAM) problem will open up a vast range of potential applications for autonomous robots (Thorpe and Durrant-Whyte, 2001; Christensen, 2002). We believe that a solution to the simultaneous localization, mapping and moving object tracking problem will expand the potential for robotic applications in proximity to human beings. Robots will be able to work not only for people but also with people. In the next section, we will illustrate the whole problem with an example application, safe driving. See Figure 1.1 for an illustration.

Figure 1.1. Robotics for safe driving. Localization, mapping, and moving object tracking are critical to driving assistance and autonomous driving.

(21)

1.1 SAFE DRIVING

1.1. Safe Driving

To improve driving safety and prevent traffic injuries caused by human factors such as speeding and distraction, techniques to understand the surroundings of the vehicle are critical. We believe that being able to detect and track every stationary object and every moving object, to reason about the dynamic traffic scene, to detect and predict every critical situation, and to warn and assist drivers in advance, is essential to prevent these kinds of accidents.

Localization

In order to detect and track moving objects by using sensors mounted on a moving ground vehicle at high speeds, a precise localization system is essential. It is known that GPS and DGPS often fail in urban areas because of urban canyon effects, and good inertial measurement systems (IMS) are very expensive.

If we can have a stationary object map in advance, the map-based localization techniques such as those proposed by (Olson, 2000), (Fox et al., 1999), and (Dellaert et al., 1999) can be used to increase the accuracy of the pose estimate. Unfortunately, it is difficult to build a usable stationary object map because of temporary stationary objects such as parked cars. Stationary object maps of the same scene built at different times could still be different, which means that we still have to do online map building to update the current stationary object map.

Simultaneous Localization and Mapping

Simultaneous localization and mapping (SLAM) allows robots to operate in an unknown environment and then incrementally build a map of this environment and concurrently use this map to localize robots themselves. Over the last decade, the SLAM problem has attracted immense attention in the mobile robotics literature (Christensen, 2002), and SLAM techniques are at the core of many successful robot systems (Thrun, 2002). How- ever, (Wang and Thorpe, 2002) have shown that SLAM can perform badly in crowded urban environments because of the static environment assumption. Moving objects have to be detected and filtered out.

Detection and Tracking of Moving Objects

The detection and tracking of moving objects (DATMO) problem has been extensively studied for several decades (Bar-Shalom and Li, 1988, 1995; Blackman and Popoli, 1999).

(22)

Even with precise localization, it is not easy to solve the DATMO problem in crowded urban environments because of a wide variety of targets (Wang et al., 2003a).

When cameras are used to detect moving objects, appearance-based approaches are widely used and moving objects can be detected no matter whether they are moving or not.

If laser scanners are used, feature-based approaches are usually the preferred solutions.

Both appearance-based and feature-based methods rely on prior knowledge of the targets.

In urban areas, there are many kinds of moving objects such as pedestrians, animals, wheelchairs, bicycles, motorcycles, cars, buses, trucks and trailers. Velocities range from under 5 mph (such as a pedestrian’s movement) to 50 mph. Figure 1.2 shows a traffic scene on a highway and Figure 1.3 shows a traffic scene in an urban area. When using laser scanners, the features of moving objects can change significantly from scan to scan. As a result, it is very difficult to define features or appearances for detecting specific objects using laser scanners.

Figure 1.2. A traffic scene on a highway. Figure 1.3. A traffic scene in an urban area.

SLAM vs. DATMO

Figure 1.4. SLAM vs. DATMO.

Both SLAM and DATMO have been studied in isolation. However, when driving in crowded urban environments composed of stationary and moving objects, neither of them

(23)

1.2 CITY-SIZED SIMULTANEOUS LOCALIZATION AND MAPPING

is sufficient. The simultaneous localization, mapping and moving object tracking problem aims to tackle the SLAM problem and the DATMO problem at once. Because SLAM provides more accurate pose estimates and a surrounding map, a wide variety of moving objects are detected using the surrounding map without using any predefined features or appearances, and tracking is performed reliably with accurate robot pose estimates. SLAM can be more accurate because moving objects are filtered out of the SLAM process thanks to the moving object location prediction from DATMO. SLAM and DATMO are mutually beneficial. Integrating SLAM with DATMO would satisfy both the safety and navigation demands of safe driving. It would provide a better estimate of the robot’s location and information of the dynamic environments, which are critical to driving assistance and autonomous driving.

Although performing SLAM and DATMO at the same time is superior to doing just one or the other, the integrated approach inherits the difficulties and issues from both the SLAM problem and the DATMO problem. Therefore, besides deriving a mathematical formulation to seamlessly integrate SLAM and DATMO, we need to answer the following questions:

• Assuming that the environment is static, can we solve the simultaneous localization and mapping problem from a ground vehicle at high speeds in very large urban environments?

• Assuming that the robot pose estimate is accurate and moving objects are correctly detected, can we solve the moving object tracking problem in crowded urban environments?

• Assuming that the SLAM problem and the DATMO problem can be solved in urban areas, is it feasible to solve the simultaneous localization, mapping and moving object tracking problem? What problems will occur when the SLAM problem and the DATMO problem are solved together?

In the following sections, we will discuss these problems from both theoretical and practical points of view.

1.2. City-Sized Simultaneous Localization and Mapping

Since Smith, Self and Cheeseman first introduced the simultaneous localization and mapping (SLAM) problem (Smith and Cheeseman, 1986; Smith et al., 1990), the SLAM problem has attracted immense attention in the mobile robotics literature. SLAM involves simultaneously estimating locations of newly perceived landmarks and the location of the robot itself while incrementally building a map. The web site of the 2002 SLAM summer

(24)

school¹provides a comprehensive coverage of the key topics and state of the art in SLAM.

In this section, we address three key issues to accomplish city-sized SLAM: computational complexity, representation, and data association in the large.

Figure 1.5. City-sized SLAM. Top shows the 3D (2.5D) map of several street blocks using the algorithm addressed in (Wang et al., 2003b). Is it possible to accomplish online SLAM in a city?

Computational Complexity

In the SLAM literature, it is known that a key bottleneck of the Kalman filter solution is its computational complexity. Because it explicitly represents correlations of all pairs among the robot and stationary objects, both the computation time and memory require- ment scale quadratically with the number of stationary objects in the map. This computational burden restricts applications to those in which the map can have no more than a few hundred stationary objects.

Recently, this problem has been subject to intense research. Approaches using approximate inference, using exact inference on tractable approximations of the true model, and using approximate inference on an approximate model have been proposed. In this

1http://www.cas.kth.se/SLAM/

(25)

1.2 CITY-SIZED SIMULTANEOUS LOCALIZATION AND MAPPING

dissertation, we will take advantage of these promising approaches and focus on the representation and data association issues. More details about the computational complexity issue will be addressed in Section 2.2.

Representation

Even with an advanced algorithm to deal with computational complexity, most SLAM applications are still limited to indoor environments (Thrun, 2002) or specific environments and conditions (Guivant et al., 2000) because of significant issues in defining environment representation and identifying an appropriate methodology for fusing data in this representation (Durrant-Whyte, 2001). For instance, feature-based approaches have an elegant solution by using a Kalman filter or an information filter, but it is difficult to extract features robustly and correctly in outdoor environments. Grid-based approaches do not need to extract features, but they do not provide any direct means to estimate and propagate uncertainty and they do not scale well in very large environments.

In Chapter 3, we will address the representation related issues in detail and describe a hierarchical object based representation for overcoming the difficulties of the city-sized SLAM problem.

Data Association in the Large

Given correct data association in the large, or loop detection, SLAM can build a globally consistent map regardless of the size of the map. In order to obtain correct data association in the large, most large scale mapping systems using moving platforms (Zhao and Shibasaki, 2001; Fr ¨uh and Zakhor, 2003) are equipped with expensive state estimation systems to assure the accuracy of the state estimation. In addition, independent position information from GPS or aerial photos is used to provide global constraints.

Without these aids, the accumulated error of the pose estimate and unmodelled uncertainty in the real world increase the difficulty of loop detection. For dealing with this issue without access to independent position information, our algorithm based on covariance increasing, information exploiting and ambiguity modelling will be presented in Chapter 5.

In this work, we will demonstrate that it is feasible to accomplish city-sized SLAM.

(26)

1.3. Moving Object Tracking in Crowded Urban Environments

In order to accomplish moving object tracking in crowded urban areas, three key issues have to be solved: detection, data association in the cluttered, and moving object motion modelling.

Detection

Recall that detection of ground moving objects using feature- or appearance-based approaches is infeasible because of the wide variety of targets in urban areas. In Chap- ter 6, the consistency-based detection and the moving object map based detection will be described for robustly detecting moving objects using laser scanners.

Cluttered Environments

Urban areas are often cluttered, as illustrated in Figure 1.3. In the tracking literature, there are a number of techniques for solving data association in the cluttered such as multiple hypothesis tracking (MHT) approaches (Reid, 1979; Cox and Hingorani, 1996) and joint probabilistic data association (JPDA) approaches (Fortmann et al., 1983; Schulz et al., 2001).

In addition to the MHT approach, we use geometric information of moving objects to aid data association in the cluttered because of the rich geometric information contained in laser scanner measurements, which will be discussed in Chapter 3 and Chapter 5.

Motion Modelling

In SLAM, we can use odometry and the identified robot motion model to predict the future location of the robot, so that the SLAM problem is an inference problem. However, in DATMO neither a priori knowledge of moving objects’ motion models nor odometry measurements about moving objects is available. In practice, motion modes of moving objects are often partially unknown and time-varying. Therefore, the motion modes of the moving object tracking have to be learned online. In other words, moving object tracking is a learning problem.

In the tracking literature, multiple model based approaches have been proposed to solve the motion modelling problem. The related approaches will be reviewed in Section 2.3.

Compared to air and marine target tracking, ground moving object tracking (Chong et al., 2000; Shea et al., 2000) is more complex because of more degrees of freedom (e.g., move-stop-move maneuvers). In Chapter 4, we will present a stationary motion model and a move-stop hypothesis tracking algorithm to tackle this issue.

(27)

1.5 EXPERIMENTAL SETUP

Figure 1.6. Tracking difficulty vs. degrees of freedom. More degrees-of-freedom of a moving object more difficult tracking.

1.4. Simultaneous Localization, Mapping and Moving Object Tracking

After establishing capabilities to solve the SLAM problem and the DATMO problem in urban areas, it is feasible to solve the simultaneous localization, mapping and moving object tracking problem. Because simultaneous localization, mapping and moving object tracking is a more general process based on the integration of SLAM and moving object tracking, it inherits the complexity, data association, representation (perception modelling) and motion modelling issues from the SLAM problem and the DATMO problem. It is clear that the simultaneous localization, mapping and moving object tracking problem is not only an inference problem but also a learning problem.

In Chapter 2, we will present two approaches and derive the corresponding Bayesian formulas for solving the simultaneous localization, mapping and moving object tracking problem: one is SLAM with Generic Objects, or SLAM with GO, and the other is SLAM with DATMO.

1.5. Experimental Setup

Range sensing is essential in robotics for scene understanding. Range information can be from active range sensors or passive range sensors. (Hebert, 2000) presented a broad review of range sensing technologies for robotic applications. In spite of the different characteristics of these range sensing technologies, the theory presented in Chapter 2 does not limit the usage of specific sensors as long as sensor characteristics are properly modelled.

When using more accurate sensors, inference and learning are more practical and tractable. In order to accomplish simultaneous localization, mapping and moving object tracking from a ground vehicle at high speeds, we mainly focus on issues of using active

(28)

ranging sensors. SICK scanners² are being used and studied in this work. Data sets collected from the Navlab8 testbed (see Figure 1.7) and the Navlab11 testbed (see Figure 1.8) are used to verify the derived formulas and the developed algorithms. Visual images from the omni-directional camera and the tri-camera system are only for visualization. Figure 1.9 shows a raw data set collected from the Navlab11 testbed. For the purpose of comparison, the result from our algorithms is shown in Figure 1.10 where measurements associated with moving objects are filtered out.

Figure 1.7. Left: the Navlab8 testbed. Right: the SICK PLS100 and the omni- directional camera.

Figure 1.8. Right: the Navlab11 testbed. Left: SICK LMS221, SICK LMS291 and the tri-camera system.

1.6. Thesis Statement

Performing localization, mapping and moving object tracking concurrently is superior to doing just one or the other. We will establish a mathematical framework that integrates all, and demonstrate that it is indeed feasible to accomplish simultaneous localization,

2http://www.sickoptic.com/

(29)

1.6 THESIS STATEMENT

Figure 1.9. Raw data from the Navlab11 testbed. This data set contains ∼36,500 scans and the travel distance is ∼5 km.

Figure 1.10. Result of SLAM with DATMO. A globally consistent map is generated and measurements associated with moving objects are filtered out.

mapping and moving object tracking from a ground vehicle at high speeds in crowded urban areas.

(30)

1.7. Document Outline

The organization of this dissertation is summarized in Figure 1.11. We will describe the foundations for solving the SLAMMOT problem in Chapter 2 and the practical issues about perception modelling, motion modelling and data association in the rest of the chapters.

Figure 1.11. Thesis overview.

We begin Chapter 2 with a review of the formulations of the SLAM problem and the moving object tracking problem. We establish a mathematical framework to integrate localization, mapping and moving object tracking, which provides a solid basis for understanding and solving the whole problem. We describe two solutions: SLAM with GO, and SLAM with DATMO. SLAM with GO calculates a joint posterior over all objects (robot pose, stationary objects and moving objects). Such an approach is similar to existing SLAM algorithms, but with additional structure to allow for motion modelling of the moving objects. Unfortunately, it is computationally demanding and infeasible. Consequently, we describe SLAM with DATMO, which is feasible given reliable moving object detection.

In Chapter 3, we address perception modelling issues. We provide a comparison of the main paradigms for perception modelling in terms of uncertainty management, sensor characteristics, environment representability, data compression and loop-closing mecha- nism. To overcome the limitations of these representation methods and accomplish both SLAM and moving object tracking, we present the hierarchical object-based approach to integrate direct methods, grid-based methods and feature-based methods. When data is uncertain and sparse, the pose estimate from the direct methods such as the iterated closed point (ICP) algorithm may not be correct and the distribution of the pose estimate may not

(31)

1.7 DOCUMENT OUTLINE

be described properly. We describe a sampling and correlation based range image matching (SCRIM) algorithm to tackle these issues.

Theoretically, motion modelling is as important as perception modelling in Bayesian approaches. Practically, the performance of tracking strongly relates to motion modelling.

In Chapter 4, we address model selection and model complexity issues in moving object motion modelling. A stationary motion model is added to the model set and the move-stop hypothesis tracking algorithm is applied to tackle the move-stop-move or very slow target tracking problem.

In Chapter 5, three data association problems are addressed: data association in the small, data association in the cluttered and data association in the large. We derive formulas to use rich geometric information from perception modelling as well as kinematics from motion modelling for solving data association. Data association in the large, or the revisiting problem, is very difficult because of accumulated pose estimate errors, unmodelled uncertainty, occlusion, and temporary stationary objects. We will demonstrate that following three principles - covariance increasing, information exploiting and ambiguity modelling - is sufficient for robustly detecting loops in very large scale environments.

In Chapter 6, we address the implementation issues for linking foundations, perception modelling, motion modelling and data association together. We provide two practical and reliable algorithms for detecting moving objects using laser scanners. For verifying the theoretical framework and the described algorithms, we show ample results carried out with Navlab8 and Navlab11 at high speeds in crowded urban and suburban areas. We also point out the limitations of our system due to the 2-D environment assumption and sensor failures.

Finally, we conclude with a summary of this work and suggest future extensions in Chapter 7.

(32)

(33)

CHAPTER 2 Foundations

The essence of the Bayesian approach is to provide a mathematical rule explain- ing how you should change your existing beliefs in the light of new evidence.

– In praise of Bayes, the Economist (9/30/00)

B

^AYESIAN^T^HOERYhas been a solid basis for formalizing and solving many statis- tics, control, machine learning and computer vision problems. The simultaneous localization, mapping and moving object tracking problem involves not only accom- plishing SLAM in dynamic environments but also detecting and tracking these dynamic objects. Bayesian theory also provides a useful guidance for understanding and solving this problem.

SLAM and moving object tracking can both be treated as processes. SLAM assumes that the surrounding environment is static, containing only stationary objects. The inputs of the SLAM process are measurements from perception sensors such as laser scanners and cameras, and measurements from motion sensors such as odometry and inertial measure- ment units. The outputs of the SLAM process are robot pose and a stationary object map (see Figure 2.1.a). Given that the sensor platform is stationary or that a precise pose estimate is available, the inputs of the moving object tracking problem are perception measurements and the outputs are locations of moving objects and their motion modes (see Figure 2.1.b). The simultaneous localization, mapping and moving object tracking problem can also be treated as a process without the static environment assumption. The inputs of this process are the same as for the SLAM process, but the outputs are not only the robot pose and the map but also the locations and motion modes of the moving objects (see Figure 2.1.c).

Without considering the perception modelling and data association issues in practice, a key issue of the SLAM problem is complexity, and a key issue of the moving object tracking problem is motion modelling. Because SLAMMOT inherits the complexity issue from the

(34)

(a) the simultaneous localization and mapping (SLAM) process

(b) the moving object tracking (MOT) process

(c) the simultaneous localization, mapping and moving object tracking (SLAMMOT) process

Figure 2.1. The SLAM process, the MOT process and the SLAMMOT process. Z denotes the perception measurements, U denotes the motion measurements, x is the true robot state, M denotes the locations of the stationary objects, O denotes the states of the moving objects and S denotes the motion modes of the moving objects.

SLAM problem and the motion modelling issue from the moving object tracking problem, the SLAMMOT problem is not only an inference problem but also a learning problem.

In this chapter, we first review uncertain spatial relationships which are essential to the SLAM problem, the MOT problem, and the SLAMMOT problem. We will briefly review the Bayesian formulas of the SLAM problem and the moving object tracking problem.

In addition, Dynamic Bayesian Networks (DBNs)¹are used to show the dependencies between the variables of these problems and explain how to compute these formulas. We will present two approaches for solving the simultaneous localization, mapping and moving object tracking problem: SLAM with GO and SLAM with DATMO. For the sake of simplicity, we assume that perception modelling and data association problems are solved and both stationary objects and moving objects can be represented by point-features. The details for dealing these issues will be addressed in the following chapters.

1For complicated probabilistic problems, computing the Bayesian formula is often computationally in- tractable. Graphical models (Jordan, 2003) provide a natural tool to visualize the dependencies between the vari- ables of the complex problems, and help simplify the Bayesian formula computations by combining simpler parts and ensuring that the system as a whole is still consistent. Dynamic Bayesian Networks (DBNs) (Murphy, 2002) are directed graphical models of stochastic processes.

(35)

2.1 UNCERTAIN SPATIAL RELATIONSHIPS

2.1. Uncertain Spatial Relationships

For solving the SLAM problem, the MOT problem or the SLAMMOT problem, manip- ulating uncertain spatial relationships is fundamental. In this section we only intuitively review the spatial relationships for the two dimensional case with three degrees-of-freedom.

See (Smith et al., 1990) for a derivation.

Compounding

In an example in which a moving object is detected by a sonar mounted on a robot, we need to compound the uncertainty from the robot pose estimate and the uncertainty from the sonar measurement in order to correctly represent the location of this moving object and the corresponding distribution with respect to the world coordinate system.

Figure 2.2. Compounding of spatial relationships.

Given two spatial relationships, xij and xjk, the formula for compounding xikfrom xijand xjkis:

xik 4

= ⊕(xij, xjk) =



 xjkcos θij− yjksin θij+ xij

xjksin θij+ yjkcos θij+ yij

θij+ θjk



 (2.1)

where ⊕ is the compounding operator, and xijand xjlare defined by:

xij =



 xij

yij

θij



 , xjk=



 xjk

yjk

θjk





Let µ be the mean and Σ be the covariance. The first-order estimate of the mean of the compounding operation is:

µx_ik ≈ ⊕(µxij, µx_jk) (2.2)

The first order estimate of the covariance is:

Σxik≈ ∇⊕

· Σxij Σxijxjk

Σxjkxij Σxjk

¸

∇^T_⊕ (2.3)

(36)

where the Jacobian of the compounding operation, ∇⊕, is defined by:

∇⊕

4=∂ ⊕ (xij, xjk)

∂(xij, xjk) =



 1 0 −(yik− yij) cos θij − sin θij 0 0 1 (xik− xij) sin θij cos θij 0

0 0 1 0 0 1



 (2.4)

In the case that the two relationships are independent, we can rewrite the first-order estimate of the covariance as:

Σxik≈ ∇1⊕Σxik∇^T_1⊕+ ∇2⊕Σxjk∇^T_2⊕ (2.5) where ∇1⊕and ∇2⊕are the left and right halves of the compounding Jacobian. The com- pounding relationship is also called the head-to-tail relationship in (Smith et al., 1990).

The Inverse Relationship

Figure 2.3. The inverse relationship.

Figure 2.3 shows the inverse relationship. For example, given the robot pose in the world coordinate frame, xij, the origin of the world frame with respect to the robot frame, xji, is:

xji 4

= ª(xij) =



 −xijcos θij− yijsin θij

xijsin θij− yijcos θij

−θij



 (2.6)

where ª is the inverse operator.

The first-order estimate of the mean of the inverse operation is:

µxji ≈ ª(µxij) and the first-order covariance estimate is:

Σxji ≈ ∇ªΣxij∇^T_ª where the Jacobian for the inverse operation, ∇ª, is:

∇ª 4

=∂xji

∂xij =



 − cos θij − sin θij yji

sin θij − cos θij −xji

0 0 −1



 (2.7)

(37)

2.2 SIMULTANEOUS LOCALIZATION AND MAPPING

Figure 2.4. The tail-to-tail relationship.

The Tail-to-Tail Relationship

For local navigation or obstacle avoidance, it is more straightforward to use the locations of moving objects in the robot frame than the locations with respect to the world coordinate system. In the example of Figure 2.4, given the locations of the robot xij and a moving object xikin the world frame, we want to know the location of this moving object, x_jk, and its distribution, Σ_x_jk, in the robot frame, which can be calculated recursively by:

xjk

4= ⊕(ª(xij), xik) = ⊕(xji, xik) (2.8) This relationship is called the tail-to-tail relationship in (Smith et al., 1990). The first- order estimate of the mean of this tail-to-tail operation is:

µxjk≈ ⊕(ª(µxij), µxik) (2.9)

and the first-order covariance estimate can be computed in a similar way:

Σxjk ≈ ∇⊕

· Σxji Σxjixjk

Σxjkxji Σxjk

¸

∇^T_⊕≈ ∇⊕

· ∇ªΣxij∇^T_ª Σxijxjk∇^T_ª

∇ªΣxjkxij Σxjk

¸

∇^T_⊕ (2.10) Note that this tail-to-tail operation is often used in data association and moving object tracking.

Unscented Transform

As addressed above, these spatial uncertain relationships are non-linear functions and are approximated by their first-order Taylor expansion for estimating the means and the covariances of their outputs. In the cases that the function is not approximately linear in the likely region of its inputs or the Jacobian of the function is unavailable, the unscented transform (Julier, 1999) can be used to improve the estimate accuracy. (Wan and van der Merwe, 2000) shows an example of using the unscented transform technique.

2.2. Simultaneous Localization and Mapping

In this section, we address the formulation, calculation procedures, computational complexity and practical issues of the SLAM problem.

(38)

Formulation of SLAM

The general formula for the SLAM problem can be formalized in the probabilistic form as:

p(xk, M | u1, u2, . . . uk, z0, z1, . . . zk) (2.11) where xkis the true pose of the robot at time k, ukis the measurement from motion sensors such as odomtrey and inertial sensors at time k, zk is the measurement from perception sensors such as laser scanner and camera at time k, and M is stochastic stationary object map which contains l landmarks, m¹, m², . . . m^l. In addition, we define the following set to refer data leading up to time k:

Zk 4

= {z0, z1, . . . , zk} (2.12)

Uk

= {u4 1, u2, . . . , uk} (2.13)

Therefore, equation (2.11) can be rewritten as:

p(xk, M | Uk, Zk) (2.14)

Using Bayes’ rule and assumptions that the vehicle motion model is Markov and the environment is static, the general recursive Bayesian formula for SLAM can be derived and expressed as: (See (Thrun, 2002; Majumder et al., 2002) for more details.)

p(xk, M | Zk, Uk)

| {z }

Posterior at k

∝ p(zk | xk, M )

| {z }

Update Z

p(xk | xk−1, uk) p(xk−1, M | Zk−1, Uk−1)

| {z }

Posterior at k − 1

dxk−1

| {z }

Prediction

(2.15)

where p(xk−1, M | Zk−1, Uk−1) is the posterior probability at time k −1, p(xk, M | Zk, Uk) is the posterior probability at time k, p(xk | xk−1, uk) is the motion model, and p(zk | xk, M ) is the update stage which can be inferred as the perception model.

Calculation Procedures

Equation 2.15 only explains the computation procedures in each time step but does not address the dependency structure of the SLAM problem. Figure 2.5 shows a Dynamic Bayesian Network of the SLAM problem of duration three, which can be used to visualize the dependencies between the robot and stationary objects in the SLAM problem. In this section, we describe the Kalman filter-based solution of Equation 2.15 with visualization aid from Dynamic Bayesian Networks (Paskin, 2003). The EKF-based framework described

(39)

2.2 SIMULTANEOUS LOCALIZATION AND MAPPING

in this section is identical to that used in (Smith and Cheeseman, 1986; Smith et al., 1990;

Leonard and Durrant-Whyte, 1991).

Figure 2.5. A Dynamic Bayesian Network (DBN) of the SLAM problem of duration three. It shows the dependencies among the motion measurements, the robot, the perception measurements and the stationary objects. In this example, there are two stationary objects, m¹and m². Clear circles denote hidden continuous nodes and shaded circles denote observed continuous nodes. The edges from stationary objects to measurements are determined by data association. We will walk through this in the next pages.

Stage 1: Initialization. Figure 2.6 shows the initialization stage, or adding new stationary objects stage. Although the distributions are shown by ellipses in these figures, the Bayesian formula does not assume that the estimations are Gaussian distributions. In this example, two new stationary objects are detected and added to the map. The state x^S_k of the whole system now is:

x^S_k =



 xk

m¹ m²



 (2.16)

Let the perception model, p(zk| xk, M ), be described as:

zk = h(xk, M ) + wk (2.17)

where h is the vector-valued perception model and wk ∼ N (0, Rk) is the perception error, an uncorrelated zero-mean Gaussian noise sequence with covariance, Rk. Because the zk

are the locations of the stationary objects M with respect to the robot coordinate system, the perception model h is simply the tail-to-tail relationship of the robot and the map. Let the perception sensor return the mean location, ˆz₀¹, and variance, R¹₀, of the stationary object m¹and ˆz₀²and R²₀of m². To add these measurements to the map, these measurements are compounded with the robot state estimate and its distribution because these measurements

(40)

Figure 2.6. The initialization stage of SLAM. Solid squares denote stationary objects and black solid circle denotes the robot. Distributions are shown by ellipses.

Figure 2.7. A DBN representing the initialization stage of SLAM. After this stage, the undirected graphical model is produced in which two stationary objects and the robot state are directly dependent.

are with respect to the robot coordinate system. Therefore, the mean and covariance of the whole system can be computed as in:

µ_x^S

0 =



 µx0

⊕(µx0, ˆz₀¹)

⊕(µx0, ˆz₀²)



 (2.18)

Σ_x^S

0 =



 Σx0x0 Σx0m¹ Σx0m²

Σ^T_x

0m¹ Σm¹m¹ Σm¹m²

Σ^T_x₀_m2 Σ^T_m1m² Σm²m²





=



 Σx0x0 Σx0x0∇^T_1⊕ Σx0x0∇^T_1⊕

∇1⊕Σx0x0 ∇1⊕Σx0x0∇^T_1⊕+ ∇2⊕R¹₀∇^T_2⊕ 0

∇1⊕Σx0x0 0 ∇1⊕Σx0x0∇^T_1⊕+ ∇2⊕R²₀∇^T_2⊕



(2.19)

This stage is shown as p(xk−1, M | Zk−1, Uk−1) in equation (2.15). Figure 2.7 shows a DBN representing the initialization stage, or the adding new stationary objects stage, in which the undirected graphical model is produced by moralizing² the directed graphical model. The observed nodes are eliminated to produce the final graphical model which shows that two stationary objects and the robot state are directly dependent.

Stage 2: Predication. In Figure 2.8, the robot moves and gets a motion measurement u1 from odometry or inertial sensors. Let the robot motion model, p(xk | xk−1, uk), be

2In the Graphical Model literature, moralizing means adding links between unmarried parents who share a common child.

MOVING OBJECT TRACKING

SIMULTANEOUS LOCALIZATION, MAPPING AND