Simultaneous Localization, Mapping and Moving Object Tracking

(1)

http://ijr.sagepub.com

Research

DOI: 10.1177/0278364907081229

2007; 26; 889

The International Journal of Robotics Research

Chieh-Chih Wang, Charles Thorpe, Sebastian Thrun, Martial Hebert and Hugh Durrant-Whyte

Simultaneous Localization, Mapping and Moving Object Tracking

http://ijr.sagepub.com/cgi/content/abstract/26/9/889

The online version of this article can be found at:

Published by:

http://www.sagepublications.com On behalf of:

Multimedia Archives

can be found at:

The International Journal of Robotics Research

Additional services and information for

http://ijr.sagepub.com/cgi/alerts Email Alerts: http://ijr.sagepub.com/subscriptions Subscriptions: http://www.sagepub.com/journalsReprints.nav Reprints: http://www.sagepub.co.uk/journalsPermissions.nav Permissions: http://ijr.sagepub.com/cgi/content/refs/26/9/889 Citations

(2)

Graduate Institute of Networking and Multimedia National Taiwan University

Taipei 106, Taiwan bobwang@ntu.edu.tw

Charles Thorpe

Qatar Campus

Carnegie Mellon University Pittsburgh, PA 15289, USA cet@qatar.cmu.edu

Sebastian Thrun

The AI group Stanford University Stanford, CA 94305, USA thrun@stanford.edu

Martial Hebert

The Robotics Institute Carnegie Mellon University Pittsburgh, PA 15213, USA hebert@ri.cmu.edu

Hugh Durrant-Whyte

The ARC Centre of Excellence for Autonomous Systems The University of Sydney

NSW 2006, Australia hugh@acfr.usyd.edu.au

Localization, Mapping

and Moving Object

Tracking

Abstract

Simultaneous localization, mapping and moving object tracking (SLAMMOT) involves both simultaneous localization and mapping (SLAM) in dynamic environments and detecting and tracking these dynamic objects. In this paper, a mathematical framework is estab-lished to integrate SLAM and moving object tracking. Two solutions are described: SLAM with generalized objects, and SLAM with detec-tion and tracking of moving objects (DATMO). SLAM with general-ized objects calculates a joint posterior over all generalgeneral-ized objects The International Journal of Robotics Research

Vol. 26, No. 09, September 2007, pp. 889–916 DOI: 10.1177/0278364907081229

c

1SAGE Publications 2007 Los Angeles, London, New Delhi and Singapore Figures 1, 2, 16, 18, 21, 23–25, 27–31, 33, 35–37 appear in color online: http://ijr.sagepub.com

and the robot. Such an approach is similar to existing SLAM algo-rithms, but with additional structure to allow for motion modeling of generalized objects. Unfortunately, it is computationally demanding and generally infeasible. SLAM with DATMO decomposes the estima-tion problem into two separate estimators. By maintaining separate posteriors for stationary objects and moving objects, the resulting es-timation problems are much lower dimensional than SLAM with gen-eralized objects. Both SLAM and moving object tracking from a mov-ing vehicle in crowded urban areas are dauntmov-ing tasks. Based on the SLAM with DATMO framework, practical algorithms are proposed which deal with issues of perception modeling, data association, and moving object detection. The implementation of SLAM with DATMO was demonstrated using data collected from the CMU Navlab11 vehi-cle at high speeds in crowded urban environments. Ample experimen-tal results shows the feasibility of the proposed theory and algorithms.

(3)

KEY WORDS—mobile robotics, localization, mapping, tracking, detection, robotic perception

1. Introduction

Establishing the spatial and temporal relationships among a robot, stationary objects and moving objects in a scene serves as a basis for scene understanding. Localization is the process of establishing the spatial relationships between the robot and stationary objects, mapping is the process of establishing the spatial relationships among stationary objects, and moving ob-ject tracking is the process of establishing the spatial and tem-poral relationships between moving objects and the robot or between moving and stationary objects. Localization, mapping and moving object tracking are difficult because of uncertainty and unobservable states in the real world. Perception sensors such as cameras, radar and laser range finders, and motion sensors such as odometry and inertial measurement units are noisy. The intentions, or control inputs, of the moving objects are unobservable without using extra sensors mounted on the moving objects.

Over the last decade, the simultaneous localization and mapping (SLAM) problem has attracted immense attention in the mobile robotics and artificial intelligence literature (Smith and Cheeseman, 19861 Thrun, 2002). SLAM involves simul-taneously estimating locations of newly perceived landmarks and the location of the robot itself while incrementally build-ing a map. The movbuild-ing object trackbuild-ing problem has also been extensively studied for several decades (Bar-Shalom and Li 19881 Blackman and Popoli 1999). Moving object tracking involves both state inference and motion model learning. In most applications, SLAM and moving object tracking are con-sidered in isolation. In the SLAM problem, information asso-ciated with stationary objects is positive1 moving objects are negative, which degrades the performance. Conversely, mea-surements belonging to moving objects are positive in the moving object tracking problem1 stationary objects are consid-ered background and filtconsid-ered out. In Wang and Thorpe (2002), we pointed out that SLAM and moving object tracking are mu-tually beneficial. Both stationary objects and moving objects are positive information to scene understanding. In Wang et al. (2003b), we established a mathematical framework to in-tegrate SLAM and moving object tracking, which provides a solid basis for understanding and solving the whole problem, simultaneous localization, mapping and moving object track-ing, or SLAMMOT.

It is believed by many that a solution to the SLAM prob-lem will open up a vast range of potential applications for autonomous robots (Thorpe and Durrant-Whyte, 20011 Chris-tensen 2002). We believe that a solution to the SLAMMOT problem will expand the potential for robotic applications still further, especially in applications which are in close proxim-ity to human beings. Robots will be able to work not only for

Fig. 1. Robotics for safe driving. Localization, mapping, and moving object tracking are critical to driving assistance and autonomous driving.

people but also with people. Figure 1 illustrates a commercial application, safe driving, which motivates the work in this pa-per.

To improve driving safety and prevent traffic injuries caused by human factors such as speeding and distraction, methods for understanding the surroundings of the vehicle are critical. We believe that being able to detect and track every stationary object and every moving object, to reason about the dynamic traffic scene, to detect and predict critical situations, and to warn and assist drivers in advance, is essential to prevent these kinds of accidents.

To detect and track moving objects using sensors mounted on a moving ground vehicle at high speeds, a precise localiza-tion system is essential. It is known that GPS and DGPS of-ten fails in urban areas because of “urban canyon” effects, and good inertial measurement units (IMU) are very expensive.

If we have a stationary object map in advance, map-based localization techniques (Olson 20001 Fox et al., 19991 Dellaert et al., 1999) can be used to increase the accuracy of the pose estimate. Unfortunately, it is difficult to build a usable station-ary object map because of temporstation-ary stationstation-ary objects such as parked cars. Stationary object maps of the same scene built at different times could still be different, which means that on-line map building is required to update the current stationary object map.

SLAM allows robots to operate in an unknown environ-ment, to incrementally build a map of this environment and to simultaneously use this map to localize the robots them-selves. However, we have observed (Wang and Thorpe 2002) that SLAM can perform badly in crowded urban environments because the static environment assumption may be violated1 moving objects have to be detected and filtered out.

(4)

Even with precise localization, it is not easy to solve the moving object tracking problem in crowded urban environ-ments because of the wide variety of targets (Wang et al., 2003a). When cameras are used to detect moving objects, appearance-based approaches are widely used and moving ob-jects should be detected whether they are moving or not. If laser scanners are used, feature-based approaches are usually the preferred solution. Both appearance-based and feature-based methods rely on prior knowledge of the targets. In urban areas, there are many kinds of moving objects such as pedes-trians, bicycles, motorcycles, cars, buses, trucks and trailers. Velocities range from under 5 mph (such as a pedestrians) to 50 mph. When using laser scanners, the features of moving ob-jects can change significantly from scan to scan. As a result, it is often difficult to define features or appearances for detecting specific objects.

Both SLAM and moving object tracking have been solved and implemented successfully in isolation. However, when driving in crowded urban environments composed of station-ary and moving objects, neither of them is sufficient in iso-lation. The SLAMMOT problem aims to tackle the SLAM problem and the moving object tracking problem concurrently. SLAM provides more accurate pose estimates together with a surrounding map. Moving objects can be detected using the surrounding map without recourse to predefined features or appearances. Tracking may then be performed reliably with accurate robot pose estimates. SLAM can be more accurate because moving objects are filtered out of the SLAM process thanks to the moving object location prediction. SLAM and moving object tracking are mutually beneficial. Integrating SLAM and moving object tracking would satisfy both the safety and navigation demands of safe driving. It would pro-vide a better estimate of the robot’s location and information of the dynamic environments, which are critical to driving as-sistance and autonomous driving.

In this paper we first establish the mathematical framework for performing SLAMMOT. We will describe two algorithms, SLAM with generalized objects and SLAM with detection and tracking of moving object (DATMO). SLAM with DATMO decomposes the estimation problem of SLAMMOT into two separate estimators. By maintaining separate posteriors for sta-tionary objects and moving objects, the resulting estimation problems are much lower dimensional than SLAM with gen-eralized objects. This makes it feasible to update both filters in real time. In this paper, SLAM with DATMO is applied and implemented.

There are significant practical issues to be considered in bridging the gap between the presented theory and its appli-cations to real problems such as driving safely at high speeds in crowded urban areas. These issues arise from a number of implicit assumptions in perception modeling and data associ-ation. When using more accurate sensors, these problem are easier to solve, and inference and learning of the SLAM with DATMO problem become more practical and tractable.

There-Fig. 2. Right: the Navlab11 testbed. Left: SICK LMS221, SICK LMS291 and the tri-camera system.

fore, we mainly focus on issues of using active ranging sen-sors. SICK laser scanners (see Figure 2) are used and stud-ied in this work. Data sets (Wang et al., 2004) collected from the Navlab11 testbed are used to verify the theory and algo-rithms. Visual images from an omni-directional camera and a tri-camera system are only used for visualization. Sensors car-rying global localization information such as GPS and DGPS are not used.

The remainder of this paper is organized as follows. In Sec-tion 2, related research is addressed. The formulaSec-tions and al-gorithms of SLAM and moving object tracking are briefly re-viewed in Section 3. In Section 4, we introduce SLAM with generalized object and SLAM with DATMO and discuss some of important issues such as motion mode learning, computa-tional complexity and interaction. The proposed algorithms, which deal with issues of perception modeling, data associa-tion and moving object detecassocia-tion, are described in Secassocia-tion 5, Section 6 and Section 7, respectively. Experimental results which demonstrated the feasibility of the proposed theory and algorithms are in Section 8. Finally, the conclusion and future work are in Section 9.

2. Related Work

The SLAMMOT problem is directly related to a rich body of the literature on SLAM and tracking.

Hähnel et al. (2002) presented an online algorithm to incor-porate people tracking into the mapping process with a mo-bile robot in populated environments. The local minima in the scans are used as the features for people detection. The sampling-based joint probabilistic data association filters are used for tracking people. A hill climbing strategy is used for scan alignment. Their system performs well in indoor envi-ronments populated with moving people. However, feature-based detection may fail in environments where a wide vari-ety of moving objects exist. In Hähnel et al. (2003), the EM algorithm is used for segmenting stationary and moving ob-ject without defining features. The technique was tested with

(5)

data collected in indoor and outdoor environments. However this is an off-line algorithm which is not suitable for real-time applications.

The approach in Biswas et al. (2002) and Anguelov et al. (2002) uses simple differencing to segment temporary-stationary objects, and then learn their shape models and iden-tify classes of these objects using a modified version of the EM algorithm. In Anguelov et al. (2004), a predefined prob-abilistic model consisting of visual features (shape and color) and behavioral features (its motion model) is learned using the EM algorithm and is used for recognizing door objects in cor-ridor environments. Although recognition could improve the performance of SLAMMOT and provide higher level scene understanding, these off-line algorithms are not feasible for real-time applications and urban areas contain richer and more complicated objects in which recognition is still a hard prob-lem both theoretically and practically.

Wolf and Sukhatme (2005) proposed to use two modified grid occupancy maps to classify static and moving objects, which is similar to our consistency-based moving object de-tection algorithm (Wang and Thorpe, 2002). The third map containing static corner features is used for localization. With-out dealing with the moving object motion modeling issues and without moving object pose prediction capability, their ap-proach would be less robust than the proposed apap-proach in this paper. Montesano et al. (2005) integrated the SLAMMOT and planning processes to improve robot navigation in dynamic en-vironments as well as extending the SLAM with DATMO al-gorithm to jointly solve the moving and stationary object clas-sification problem in an indoor environment. In this paper, the proposed theory of SLAM with generalized objects and SLAM with DATMO address more related issues such as interaction, and the proposed algorithms are demonstrated from a ground vehicle at high speeds in crowded urban environments.

The SLAMMOT problem is also closely related to the computer vision literature. Demirdjian and Horaud (2000) ad-dressed the problem of segmenting the observed scene into sta-tic and moving objects from a moving stereo rig. They apply a robust method, random sample consensus (RANSAC), to filter out the moving objects and outliers. Although the RANSAC method can tolerate up to 50% outliers, the percentage of mov-ing objects is often more than stationary objects and degenera-cies exist in our applications. With measurements from motion sensors, stationary and moving object maps, and precise local-ization, our moving object detectors perform reliably in real time.

Recently, the problem of recovery of non-rigid shape and motion of dynamic scenes from a moving camera has attracted immense attention in the computer vision literature. The ideas based on the factorization techniques and the shape basis representation are presented in Bregler et al. (2000), Brand (2001), Torresani et al. (2001) and Xiao et al. (2004), where different constraints are used for finding the solution. Theoret-ically, all of these batch approaches are inappropriate to use

Fig. 3. The SLAM process, the MOT process and the SLAM-MOT process. Z denotes the perception measurements, U de-notes the motion measurements, x is the true robot state, M denotes the locations of the stationary objects, O denotes the states of the moving objects and S denotes the motion modes of the moving objects.

in real time. Practically, these methods are computational de-manding and many difficulties such as occlusions, motion blur and lighting conditions remain to be developed.

3. Foundations

SLAM assumes that the surrounding environment is static, containing only stationary objects. The inputs of the SLAM process are measurements from perception sensors such as laser scanners and cameras, and measurements from motion sensors such as odometry and inertial measurement units. The outputs of the SLAM process are the robot pose and a sta-tionary object map (see Figure 3(a)). Given that the sensor platform is stationary or that a precise pose estimate is avail-able, the inputs of the moving object tracking problem are per-ception measurements and the outputs are locations of mov-ing objects and their motion modes (see Figure 3(b)). The SLAMMOT problem can also be treated as a process without the static environment assumption. The inputs of this process are the same as for the SLAM process, but the outputs are both robot pose and map, together with the locations and motion modes of the moving objects (see Figure 3(c)).

Leaving aside perception and data association, the key is-sue in SLAM is the computational complexity of updating and maintaining the map, and the key issue of the moving object tracking problem is the computational complexity of motion modelling. As SLAMMOT inherits the complexity issue from SLAM and the motion modelling issue from moving object tracking, the SLAMMOT problem is both an inference prob-lem and a learning probprob-lem.

(6)

In this section we briefly introduce SLAM and moving ob-ject tracking.

3.1. Notation

Let k denote the discrete time index, ukthe vector describing a motion measurement from time k2 1 to time k, zka measure-ment from perception sensors such as laser scanners at time k, xkthe state vector describing the true pose of the robot at time k, and Mkthe map contain l landmarks, m11 m21 2 2 2 1 ml, at time k. In addition, we define the following sets:

Xk 4 5x3 01 x11 2 2 2 1 xk6 (1)

Zk 4 5z3 01 z11 2 2 2 1 zk6 (2)

Uk 4 5u3 11 u21 2 2 2 1 uk62 (3) 3.2. Simultaneous Localization and Mapping

The SLAM problem is to determine the robot poses xkand the stationary object map Mkgiven perception measurements Zk and motion measurement Uk.

The formula for sequential SLAM can be expressed as p3xk1 Mk7 Uk1 Zk42 (4) Using Bayes’ rule and assumptions that the vehicle motion model is Markov and the environment is static, the general re-cursive Bayesian formula for SLAM can be derived and ex-pressed as (see Thrun, 2002 and Durrant-Whyte et al., 2003 for more details.)

p3xk1 Mk7 Zk1 Uk4 1 23 4 Posterior at k 8 p3zk7 xk1 Mk4 1 23 4 Perception model (5) 9 5 p3xk 7 xk211 uk4p3x₁ k211 Mk21₂₃7 Zk211 Uk21₄4 Posterior at k21 d xk21 1 23 4 Prediction

where p3xk211 Mk21 7 Zk211 Uk214 is the posterior

probabil-ity at time k2 1, p3xk1 Mk 7 Zk1 Uk4 is the posterior prob-ability at time k, p3xk 7 xk211 uk4 is the motion model, and

p3zk 7 xk1 M4 is the stage describing the perception model. The motion model is calculated according to robot kinemat-ics/dynamics. The perception model can be represented in dif-ferent ways, such as features/landmarks and occupancy-grids. Equation 5 explains the computation procedures in each time step. Figure 4 shows a Dynamic Bayesian Network (DBN) for SLAM over three time-steps, which can be used to visualize the dependencies between the robot and stationary objects (Paskin, 2003).

Fig. 4. A Dynamic Bayesian Network (DBN) of the SLAM problem of duration three. It shows the dependencies among the motion measurements, the robot, the perception measure-ments and the stationary objects. In this example, there are two stationary objects, m1_{and m}2_{. Clear circles denote hidden}

con-tinuous nodes and shaded circles denote observed concon-tinuous nodes. The edges from stationary objects to measurements are determined by data association.

The extended Kalman filter (EKF)-based solution (Smith and Cheeseman, 19861 Smith et al., 19901 Leonard and Durrant-Whyte, 1991) to the SLAM problem is elegant, but is computationally complex. Approaches using approximate in-ference, using exact inference on tractable approximations of the true model, and using approximate inference on an approx-imate model have been proposed (Paskin 20031 Thrun et al., 20021 Bosse et al., 20031 Guivant and Nebot, 20011 Leonard and Feder 19991 Montemerlo, 2003). Paskin (2003) included an excellent comparison of these techniques.

3.3. Moving Object Tracking

Just as with SLAM, moving object tracking can be formulated with Bayesian approaches such as Kalman filtering. Moving object tracking is generally easier than SLAM since only the moving object pose is maintained and updated. However, as motion models of moving objects are often time-varying and not known with accuracy, moving object tracking is more difficult than SLAM in terms of online motion model learn-ing.

The general recursive probabilistic formula for moving ob-ject tracking can be expressed as

p3ok1 sk 7 Zk4 (6) where ok is the true state of a moving object at time k, and

skis the true motion mode of the moving object at time k, and

Zkis the perception measurement set leading up to time k. The robot (sensor platform) is assumed to be stationary for the sake of simplicity.

(7)

Fig. 5. A DBN for multiple model based moving object tracking. Clear circles denote hidden continuous nodes, clear squares denotes hidden discrete nodes and shaded circles de-notes continuous nodes.

Using Bayes’ rule, Equation (6) can be rewritten as p3ok1 sk7 Zk4 4 p3ok7 sk1 Zk4 1 23 4 State inference 9 p3sk7 Zk4 1 23 4 Mode learning (7)

which indicates that the moving object tracking problem can be solved in two stages: the first stage is the mode learning stage p3sk 7 Zk4, and the second stage is the state inference stage p3ok7 sk1 Zk4.

Without a priori information, online mode learning of time-series data is a daunting task. In practice, the motion mode of moving objects can be approximately composed of several mo-tion models such as the constant velocity model, the constant acceleration model and the turning model. Therefore the mode learning problem can be simplified to a model selection prob-lem. Figure 5 shows a DBN for multiple model-based moving object tracking.

Multiple model-based moving object tracking is still difficult because the motion mode of moving objects can be time-varying. The only way to avoid the exponentially increas-ing number of histories is to use approximate and suboptimal approaches, which merge or reduce the number of the mode history hypotheses in order to make computation tractable.

Generalized pseudo-Bayesian (GPB) approaches (Tugnait, 1982) apply a simple suboptimal technique which keeps the histories of the target mode with the largest probabilities, dis-cards the rest, and renormalizes the probabilities. In the GPB approaches of first order (GPB1), the state estimate at time k is computed under each possible current model. At the end of each cycle, the r motion mode hypotheses are merged into a single hypothesis. The GPB1 approach uses r filters to pro-duce one state estimate. In the GPB approaches of second or-der (GPB2), the state estimate is computed unor-der each possible model at current time k and previous time k2 1. There are r estimates and covariances at time k2 1. Each is predicted to time k and updated at time k under r hypotheses. After the up-date stage, the r2 hypotheses are merged into r at the end of each estimation cycle. The GPB2 approach uses r2 _{filters to}

produce r state estimates.

In the interacting multiple model (IMM) approach (Blom and Bar-Shalom, 1988), the state estimate at time k is com-puted under each possible current model using r filters and each filter uses a suitable mixing of the previous model-conditioned estimate as the initial condition. It has been shown that the IMM approach performs significantly better than the GPB1 algorithm and almost as well as the GPB2 algorithm in practice. Instead of using r2_{filters to produce r state estimates}

in GPB2, the IMM uses only r filters to produce r state esti-mates.

In both GPB and IMM approaches, it is assumed that a model set is given or selected in advance, and tracking is formed based on model averaging of this model set. The per-formance of moving object tracking strongly depends on the selected motion models. Given the same data set, the tracking results differ according to the selected motion models.

4. SLAMMOT

In the previous section, we have briefly described the SLAM and moving object tracking problems. In this section, we ad-dress the approaches to concurrently solve the SLAM and moving object tracking problems, SLAM with generalized ob-jects and SLAM with DATMO.

4.1. SLAM with Generalized Objects

Without making any hard decisions about whether an object is stationary or moving, the SLAMMOT problem can be handled by calculating a joint posterior over all objects (robot pose, stationary objects, moving objects). Such an approach would be similar to existing SLAM algorithms, but with additional structure to allow for motion mode learning of the generalized objects.

4.1.1. Bayesian Formulation

The formalization of SLAM with generalized objects is straightforward. First we define that the generalized object is a hybrid state consisting of the state and the motion mode.

yi_k4 5y3 i_k1 s_ki6 and Yk4 5y3 1k1 y2k1 2 2 2 1 ylk6 (8) where yk is the true state of the generalized object, sk is the true motion mode of the generalized object and l is the number of generalized objects. Note that generalized objects can be moving, stationary, or move-stop-move entities. We then use this hybrid variable Y to replace the variable M in Equation (5) and the general recursive probabilistic formula of SLAM with generalized objects can be expressed as:

(8)

Fig. 6. A DBN for SLAM with Generalized Objects. It is an integration of the DBN of the SLAM problem (Figure 4) and the DBN of the moving object tracking problem (Figure 5).

Using Bayes’ rules and assumptions that the motion mod-els of the robot and generalized objects are Markov and there is no interaction among the robot and generalized objects, the general recursive Bayesian formula for SLAM with general-ized objects can be derived and expresses as: (See Appendix A for derivation.) p3xk1 Yk7 Uk1 Zk4 1 23 4 Posterior at k 8 p3z₁ k7 x₂₃k1 Yk₄4 Update 5 5 p3xk7 xk211 uk4 1 23 4 Robot predict p3Yk7 Yk214 1 23 4 generalized objs 9 ₁p3xk211 Yk21₂₃7 Zk211 Uk21₄4 Posterior at k21 d xk21dYk21 (10)

where p3xk211 Yk21 7 Zk211 Uk214 is the posterior probability

at time k2 1, p3xk1 Yk7 Uk1 Zk4 is the posterior probability at time k. In the prediction stage p3xk 7 xk211 uk4p3Yk 7 Yk214,

the states of the robot state and generalized objects are pre-dicted independently with the no interaction assumption. In the update stage p3zk 7 xk1 Yk4, the states of the robot and generalized objects as well as their motion models are updated concurrently. In the cases that interactions among the robot and generalized objects exist, the formula for SLAM with general-ized objects is also shown in Appendix A.

Figure 6 shows a DBN representing the SLAM with gen-eralized objects of duration three with two gengen-eralized ob-jects, which integrates the DBNs of the SLAM problem and the moving object tracking problem.

4.1.2. Motion Modeling/Motion Mode Learning

Motion modeling of generalized objects is critical for SLAM with generalized objects. A general mechanism for solving motion modeling of stationary, moving objects and objects be-tween stationary and moving has to be developed.

The IMM algorithm and its variants (Mazor et al., 1998) have been successfully implemented in many tracking appli-cations for dealing with the moving object motion modeling problem because of their low computational cost and satisfac-tory performance.

Adding a stationary motion model to the motion model set with the same IMM algorithm for dealing with move-stop-move maneuvers was suggested in Kirubarajan and Bar-Shalom (2000), Coraluppi et al. (2000) and Coraluppi and Carthel (2001). However, as observed in Shea et al. (2000) and Coraluppi and Carthel (2001), all of the estimates tend to de-grade when the stop (stationary motion) model is added to the model set and mixed with other moving motion models. This topic is beyond the scope of this paper. We provide a theoret-ical explanation of this phenomenon and a practtheoret-ical solution, move-stop hypothesis tracking, in Chapter 4 of Wang (2004).

4.1.3. Highly Maneuverable Objects

The framework of SLAM with generalized objects indicates that measurements belonging to moving objects contribute to localization and mapping as well as stationary objects. Never-theless, highly maneuverable objects are difficult to track and often unpredictable in practice. Including them in localization and mapping would have a minimal effect on localization ac-curacy.

4.1.4. Computational Complexity

In the SLAM literature, it is known that a key bottleneck of the Kalman filter solution is its computational complexity. Be-cause it explicitly represents correlations of all pairs among the robot and stationary objects, both the computation time and memory requirement scale quadratically with the number of stationary objects in the map. This computational burden re-stricts applications to those in which the map can have no more than a few hundred stationary objects. Recently, this problem has been subject to intense research.

In the framework of SLAM with generalized objects, the robot, stationary and moving objects are generally correlated through the convolution process in the prediction and update stages. Although the formulation of SLAM with generalized objects is elegant, it is clear that SLAM with generalized ob-jects is much more computationally demanding than SLAM due to the required motion modeling of all generalized ob-jects at all time steps. Given that real-time motion model-ing of generalized objects and interaction among movmodel-ing and

(9)

stationary objects are still open questions, the computational complexity of SLAM with generalized objects is not further analyzed.

4.2. SLAM with DATMO

SLAM with generalized objects is similar to existing SLAM algorithms, but with additional structure to allow for motion modeling of generalized objects. Unfortunately, it is compu-tationally demanding and generally infeasible. Consequently, in this section we provide the second approach, SLAM with DATMO, in which the estimation problem is decomposed into two separate estimators. By maintaining separate posteriors for stationary objects and moving objects, the resulting estima-tion problem is of lower dimension than SLAM with gener-alized objects, making it feasible to update both filters in real time.

4.2.1. Bayesian Formulation

Let ok denote the true hybrid state of the moving object at time k.

oi_k4 5o3 i_k1 s_ki6 (11)

where okis the true state of the moving object, sk is the true motion mode of the moving object.

In SLAM with DATMO, three assumption are made to sim-ply the computation of SLAM with generalized objects. One of the key assumptions is that measurements can be decomposed into measurement of stationary and moving objects. This im-plies that objects can be classified as stationary or moving in which the general SLAM with DATMO problem can be posed as computing the posterior

p3xk1 Ok1 Mk 7 Zk1 Uk4 (12) where the variable Ok 4 5o1k1 o2k1 2 2 2 1 onk6 denotes the true hy-brid states of the moving objects, of which there are n in the world at time k, and the variable Mk 4 5m1k1 m2k1 2 2 2 1 m

q k6 de-notes the true locations of the stationary objects, of which there are q in the world at time k. The second assumption is that when estimating the posterior over the stationary object map and the robot pose, the measurements of moving objects carry no information about stationary landmarks and the robot pose, neither do their hybrid states Ok. The third assumption is that there is no interaction among the robot and the moving ob-jects. The robot and moving objects move independently of each other.

Using Bayes’ rules and the assumptions addressed in Ap-pendix B, the general recursive Bayesian formula for SLAM with DATMO can be derived and expresses as: (see Appen-dix B for derivation.)

Fig. 7. A dynamic Bayesian network of the SLAM with DATMO problem of duration three with one moving object and one stationary object.

p3xk1 Ok1 Mk 7 Zk1 Uk4 (13) 8 p3zo k7 xk1 Ok4 p3Ok 7 Zko211 Uk4 1 23 4 DATMO 9 p3zm_k 7 xk1 Mk4 p3xk1 Mk7 Zkm211 Uk4 1 23 4 SLAM 4 p3zo k7 Ok1 xk4 1 23 4 DATMO Update 9 5 p3Ok 7 Ok214 p3Ok217 Zko211 Uk214 dOk21 1 23 4 DATMO Prediction 9 p3zm_k 7 Mk1 xk4 1 23 4 SLAM Update 9 5 p3xk7 uk1 xk214 p3xk211 Mk217 Zmk211 Uk214 dxk21 1 23 4 SLAM Prediction

where zm_k and z_ko denote measurements of stationary and moving objects, respectively. Equation (13) shows how the SLAMMOT problem is decomposed into separate posteri-ors for moving and stationary objects. It also indicates that DATMO should take account of the uncertainty in the pose estimate of the robot because perception measurements are di-rectly from the robot.

Figure 7 shows a DBN representing three time steps of an example SLAM with DATMO problem with one moving ob-ject and one stationary obob-ject.

(10)

4.2.2. Detection/Classification

Correctly detecting or classifying moving and stationary ob-jects is essential for successfully implementing SLAM with DATMO. In the tracking literature, a number of approaches have been proposed for detecting moving objects, which can be classified into two categories: with and without the use of thresholding. Gish and Mucci (1987) proposed an approach that detection and tracking occur simultaneously without us-ing a threshold. This approach is called track before detect (TBD), although detection and tracking are performed simul-taneously. However, the high computational requirements of this approach make the implementation infeasible. Arnold et al. (1993) showed that integrating TBD with a dynamic pro-gramming algorithm provides an efficient solution for detec-tion without thresholding, which could be a soludetec-tion for imple-menting SLAM with generalized objects practically. In Sec-tion 7, we present two reliable approaches for detecting or classifying moving and stationary objects from laser scanners.

4.3. Interaction

Thus far, we have described the SLAMMOT problem which involves both SLAM in dynamic environments and detec-tion and tracking of these dynamic objects. We presented two solutions, SLAM with generalized objects and SLAM with DATMO. In this section, we discuss one possible extension, taking interaction into account, to improve the algorithms.

The multiple moving object tracking problem can be decou-pled and treated as the single moving object tracking problem if the objects are moving independently. However, in many tracking applications, objects move dependently such as sea vessels or air fighters moving in formation. In urban and subur-ban areas, cars or pedestrians often move in formation as well because of specific traffic conditions. Although the locations of these objects are different, velocity and acceleration may be nearly the same in which these moving objects tend to have highly correlated motions. Similar to the SLAM problem, the states of these moving objects can be augmented to a system state and then be tracked simultaneously. Rogers (1988) pro-posed an augmented state vector approach which is identical to the SLAM problem in the way of dealing with the correlation problem from sensor measurement errors.

In Appendix A, we provided the formula for SLAM with generalized objects in the cases that interactions among the robot and generalized objects exist. Integrating behavior and interaction learning and inference would improve the perfor-mance of SLAM with generalized objects and lead to a higher level scene understanding.

Following the proposed framework of SLAM with general-ized objects, Wang et al. (2007) introduced a scene interaction model and a neighboring object interaction model to take long-term and short-long-term interactions between the tracked objects

and its surroundings into account, respectively. With the use of the interaction models, they demonstrated that anomalous activity recognition is accomplished easily in crowded urban areas. Interacting pedestrians, bicycles, motorcycles, cars and trucks are successfully tracked in difficult situations with oc-clusion.

In the rest of the paper, we will demonstrate the feasibil-ity of SLAMMOT from a ground vehicle at high speeds in crowded urban areas. We will describe practical SLAM with DATMO algorithms which deal with issues of perception mod-eling, data association and classifying moving and stationary objects. Ample experimental results will be shown verifying the proposed theory and algorithms.

5. Perception Modeling

Perception modeling, or representation, provides a bridge be-tween perception measurements and theory1 different repre-sentation methods lead to different means to calculate the theoretical formulas. Representation should allow information from different sensors, from different locations and from dif-ferent time frames to be fused.

In the tracking literature, targets are usually represented by point-features (Blackman and Popoli, 1999). In most air and sea vehicle tracking applications, the geometrical information of the targets is not included because of the limited resolu-tion of percepresolu-tion sensors such as radar and sonar. However, the signal-related data such as the amplitude of the radar sig-nal can be included to aid data association and classification. On the other hand, research on mobile robot navigation has produced four major paradigms for environment representa-tion: feature-based approaches (Leonard and Durrant-Whyte, 1991), grid-based approaches (Elfes, 19881 Thrun et al., 1998), direct approaches (Lu and Milios, 1994, 1997), and topologi-cal approaches (Choset and Nagatani, 2001).

Since feature-based approaches are used in both MOT and SLAM, it should be straightforward to use feature-based ap-proaches to accomplish SLAMMOT. Unfortunately, it is ex-tremely difficult to define and extract features reliably and robustly in outdoor environments according to our experi-ments. In this section, we present a hierarchical free-form ob-ject representation to integrate direct methods, grid-based ap-proaches and feature-based apap-proaches for overcoming these difficulties.

5.1. Hierarchical Free-form Object Based Representation In outdoor or urban environments, features are extremely difficult to define and extract as both stationary and moving objects do not have specific sizes and shapes. Therefore, in-stead of using ad hoc approaches to define features in specific

(11)

Fig. 8. An example of scan segmentation. The black solid box denotes the robot (2 m by 5 m). Each object has its own grid-map and coordinate system.

environments or for specific objects, free-form objects are used.

At the preprocessing stage, scan points are grouped or segmented into segments. Hoover et al. (1996) proposed a methodology for evaluating range image segmentation algo-rithms, which are mainly for segmenting a range image into planar or quadric patches. Unfortunately, these methods are infeasible for our applications. Here we use a simple distance criterion, namely the distance between points in two segments must be longer than 1 meter. Although this simple criterion can not produce perfect segmentation, more precise segmenta-tion will be accomplished by localizasegmenta-tion, mapping and track-ing ustrack-ing spatial and temporal information over several time frames. An example of scan segmentation is shown in Figure 8. In this framework, the scan segments over different time frames are integrated into free-form objects after localization, mapping and tracking processes. This approach is hierarchical since these three main representation paradigms are used on different levels. Local localization is accomplished using di-rect methods, local mapping is accomplished using grid-based approaches and global SLAM is accomplished using feature-based approaches. This representation is also suitable for mov-ing object trackmov-ing. Feature-based approaches such as Kalman filtering can be used for manage tracking uncertainty and the shape information of moving object is also maintained using grid-maps. Note that an free-form object can be as small as a pedestrian or as big as several street blocks.

5.2. Local Localization

Registration or localization of scan segments over different time frames can be done using the direct methods, namely

the iterative closest point (ICP) algorithm (Rusinkiewicz and Levoy, 2001). As range images are sparser and more uncer-tain in outdoor applications than indoor applications, the pose estimation and the corresponding distribution from the ICP al-gorithm may not be reliable. Sparse data causes problems of correspondence finding, which directly affect the accuracy of direct methods. If a point–point metric is used in the ICP algo-rithm, one-to-one correspondence will not be guaranteed with sparse data, which will result in decreasing the accuracy of transformation estimation and slower convergence. Research on the ICP algorithms suggests that minimizing distances tween points and tangent planes can converge faster. But be-cause of sparse data and irregular surfaces in outdoor environ-ments, the secondary information derived from raw data such as surface normal can be unreliable and too sensitive. The other issue is featureless data, which causes correspondence ambi-guity as well.

In Wang and Thorpe (2004), we presented a sampling-and correlation-based range image matching (SCRIM) algo-rithm for taking correspondence errors and measurement noise into account. To deal with the sparse data issues, a sampling-based approach is used to estimate the uncertainty from cor-respondence errors. Instead of using only one initial relative transformation guess, the registration process is run 100 times with randomly generated initial relative transformations. For dealing with the uncertain data issues, a correlation-based ap-proach is used with the grid-based method for estimating the uncertainty from measurement noise along with the sampling-based approach. Measurement points and their correspond-ing distributions are transformed into occupancy grids us-ing our proposed SICK laser scanner noise model. After the grid maps are built, the correlation of the grid maps is used to evaluate how strong the grid-maps are related. Now the samples are weighted with their normalized correlation re-sponses. We have shown that the covariance estimates from the SCRIM algorithm describe the estimate distribution correctly. See Wang and Thorpe (2004) for more detailed information of the SCRIM algorithm.

5.3. Local Mapping

The results of local localization or registration are integrated into grid-maps via grid-based approaches. Measurements be-longing to stationary objects are integrated/updated into the lo-cal grid map. Each moving object has its own grid-map, which contains the shape (geometrical information) of this moving object and has its own coordinate systems. See Figure 9 for an illustration.

After locally localizing the robot using the SCRIM algo-rithm, the new measurement is integrated into the local grid map. The Bayesian recursive formula for updating the local grid map is computed by using (see Elfes, 1988, 1990 for a derivation):

(12)

Fig. 9. Hierarchical free-from object based representation. l_kx y 4 log p3g x y_{7 Z}m k211 zk4 12 p3gx y 7 Zm k211 zmk4 4 log p3gx y7 zmk4 12 p3gx y 7 zm k4 lx y k21 l x y 0 (14)

where g is the grid map, gx y_{be the occupancy value of a grid} cell atx1 y, l is the log-odds ratio, and

l₀x y4 log p3g

x y₄

12 p3gx y42 (15) Theoretically, there are two important requirements to se-lect the size and resolution of the local grid maps for accom-plishing hierarchical free-form object based SLAM: one is that the local grid maps should not contain loops, and the other is that the quality of the grid map should be maintained at a rea-sonable level. To satisfy these requirements, the width, length and resolution of the local grid maps can be adjusted on-line in practice.

For the experiments addressed in this paper, the width and length of the grid maps are set as 160 m and 200 m respec-tively, and the resolution of the grid map is set at 0.2 m. When the robot arrives at the 40 m boundary of the grid map, a new grid map is initialized. The global pose of a local map and its corresponding distribution is computed according to the ro-bot’s global pose and the distribution. Figure 10 shows the local grid maps generated along the trajectory using the de-scribed parameters. Figure 11 shows the details of the grid maps, which contain information from both stationary objects and moving objects.

Fig. 10. Generated grid maps along the trajectory. The boxes indicate the boundaries of the grid maps.

5.4. Global SLAM

As grid-based approaches need extra computation for loop-closing and all raw scans have to be used to generate a new global consistent map, the grid-map is only built locally. To ac-complish global SLAM in very large environments, each local grid map is treated as a three-degree-freedom feature as shown in Figure 9 and loop closing is done with the mechanism of the feature-based approaches.

Figure 12 shows the result without loop-closing and Fig-ure 13 shows the result using the featFig-ure based EKF algorithm for loop-closing with correct loop detection. Extension 2 pro-vides a full reply of this loop closing processing. Information from moving objects is filtered out in both figures. The co-variance matrix for closing this loop contains only 14 three degree-of-freedom features.

Since we set the whole local grid maps as features in the feature-based approaches for loop-closing, the uncertainty in-side the local grid maps is not updated with the constraints from loop detection. Although Figure 13 shows a satisfactory result, the coherence of the overlay between grid maps is not guaranteed. Practically, the inconsistency between the grid-maps will not effect the robot’s ability to perform tasks. Local navigation can be done with the current built grid map which contains the most recent information about the surrounding en-vironment. Global path planning can be done with the global consistent map from feature-based approaches in a topolog-ical sense. In addition, the quality of the global map can be improved by adjusting sizes and resolutions of the local grid maps to smooth out the inconsistency between grid maps. At the same time, the grid-maps should be big enough to have high object saliency scores in order to reliably solve the revis-iting problem.

(13)

Fig. 11. Details of the grid maps. Gray denotes areas which are not occupied by both moving objects and stationary objects, whiter than gray denotes the areas which are likely to be occupied by moving objects, and darker than gray denotes the areas which are likely to be occupied by stationary objects.

Fig. 12. The result without loop-closing. Information from moving object is filtered out.

5.5. Local Moving Object Grid Map

There is a wide variety of moving objects in urban and sub-urban environments such as pedestrians, animals, bicycles, motorcycles, cars, trucks, buses and trailers. The critical re-quirement for safe driving is that all such moving objects be detected and tracked correctly. Figure 14 shows an example of different kinds of moving objects in an urban area where the hierarchical free-form object representation is suitable and

Fig. 13. The result with loop-closing. Information from mov-ing object is filtered out.

applicable because free-form objects are used without pre-defining features or appearances.

As the number of measurement points belonging to small moving objects such as pedestrians is often less than four, the centroid of the measurement points is used as the state vector of the moving object. The state vector, or object-feature of a small moving object contains only location without orientation because the geometrical information is insufficient to correctly determine orientation.

However, when tracking large moving objects, using the centroid of the measurements is imprecise. Different portions

(14)

Fig. 14. A wide variety of moving objects in urban areas. A: a dump truck, B: a car, C: a truck, D: two pedestrians, E: a truck.

of moving objects are observed over different time frames be-cause of motion and occlusion. This means that the centroids of the measurements over different time frames do not rep-resent the same physical point. Figure 15 shows the different portions of a moving car observed over different time frames.

Therefore, the SCRIM algorithm is used to estimate the rel-ative transformation between the new measurement and the object-grids and its corresponding distribution. As the online learned motion models of moving objects may not be reliable at the early stage of tracking, the predicted location of the mov-ing object may not be good enough to avoid the local minima problem of the ICP algorithm. Applying the SCRIM algorithm to correctly describe the uncertainty of the pose estimate is es-pecially important.

Since “large object” orientation can be determined reliably, the state vector, or object-feature, can consist of both location and orientation. In addition, the geometrical information is ac-cumulated and integrated into the object-grids. As a result, not only are motions of moving objects learned and tracked, but their contours are also built. Figure 16 shows the registration results using the SCRIM algorithm.

The moving objects’ own grid maps only maintain their shape information but not their trajectories. Although the tra-jectories of moving objects can be stored and maintained with lists, it is difficult to retrieve information from the lists of mul-tiple moving objects. Therefore, local moving object grid maps are created to store trajectory information from moving objects

Fig. 15. Different portions of a moving car.

Fig. 16. Registration results of the example in Figure 15 using the SCRIM algorithm. The states are indicated by Box (1), and the final scan points are indicated by .

(15)

using the same mechanism of maintaining local stationary ob-ject grid maps. Figure 11 shows examples of the local station-ary and moving object grid maps.

By integrating trajectory information from moving cars and pedestrians, lanes and sidewalks can be recognized. This kind of information is extremely important to robots operating in environments occupied by human beings. In the applications of exploration, robots can go wherever there is no obstacle. However, for tasks in environments shared with human beings, robots at least have to follow the same rules that people obey. For example, a robot car should be kept in the lane and should not go onto the unoccupied sidewalks. Both the stationary ob-ject map and the moving obob-ject map provide essential and crit-ical information to accomplish these tasks.

6. Data Association

In this section, we present simple yet effective solutions for solving data association issues in SLAMMOT.

6.1. Revisiting in SLAM

One of the most important steps needed to solve the global SLAM problem is to robustly detect loops or recognize the pre-visited areas. It is called the revisiting problem (Stewart et al., 20031 Thrun and Liu, 20031 Hähnel et al., 2003). Figure 17 shows that the robot entered the explored area and the current grid map is not consistent with the pre-built map. The revisit-ing problem is difficult because of accumulated pose estimate errors, unmodelled uncertainty sources, temporary stationary objects and occlusion. Here we describe one approach, infor-mation exploiting, for dealing with these issues.

For loop closing, not only recognizing but also localizing the current measurement within the global map has to be ac-complished. Unfortunately, because of temporary stationary objects, occlusion, and featureless areas (Wang, 2004), recog-nizing and localizing places are difficult even with the proper information about which portions of the built map are more likely. For instance, Figure 18 shows that the currently built stationary object maps may be very different from the global stationary object map because of temporary stationary objects such as ground vehicles stopped by traffic lights and parked cars. Since the environments are dynamic, stationary objects may be occluded when the robot is surrounded by big moving objects such as buses and trucks.

In order to deal with the addressed situations, big regions are used for loop-detection instead of using raw scans. In large scale regions, large and stable objects such as buildings and street blocks are the dominating factors in the recognition and localization processes, and the effects of temporary stationary objects such as parked cars is minimized. It is also more likely to have more salient areas when the size of the regions is larger.

Fig. 17. The revisiting problem. Because of the accumulated pose error, the current grid map is not consistent with the pre-built map.

Fig. 18. A temporary stationary bus. Rectangles denote the de-tected and tracked moving objects. The segment numbers of these moving objects are shown.

In other words, the ambiguity of recognition and localization can be removed more easily and robustly. As the measure-ments at different locations over different times are accumu-lated and integrated into the local grid maps, the occlusion of stationary objects is reduced as well. Figure 19 shows a grid-map pair of the same regions built at different times. Although the details of local grid-maps are not the same in the same

(16)

re-Fig. 19. The grid-map pair of the same region built at different times: Grid-map 1 and Grid map 16. Different moving object activities at different times, occlusion and temporary stationary objects are shown.

Fig. 20. Recognition and localization results using different scales of grid map 1 and grid map 16. From left to right: 158 scale, 154 scale and 152 scale. Two grid maps are shown with respect to the same coordinate system.

gion because of the described reasons, full grid maps contain enough information for place recognition and localization.

As local grid maps are used, visual image registration al-gorithms from the computer vision literature can be used for recognition and localization. Following the SCRIM algorithm, we use the correlation between local grid maps to verify the recognition (searching) results, and we perform recognition (searching) between two grid maps according to the covariance matrix from the feature-based SLAM process instead of sam-pling. The search stage is speeded up using multi-scale pyra-mids. Figure 20 shows the recognition and localization results of the examples in Figure 19 using different scales.

6.2. Data Association in MOT

Once a new moving object is detected, our algorithm initial-izes a new track for this object, such as assigning an initial state and motion models to this new moving object. By using laser scanners, we can only get the position but not the veloc-ity and orientation, therefore our algorithm uses the data from different times and then accomplishes data association in order to initialize a new track.

Data association and tracking problems have been exten-sively studied and a number of statistical data association tech-niques have been developed, such as the Joint Probabilistic Data Association Filter (JPDAF) (Fortmann et al., 1983) and Multiple Hypothesis Tracking (MHT) (Reid, 19791 Cox and Hingorani, 1996). Our system applies the MHT method, which maintains a hypothesis tree and can revise its decisions while getting new information. This delayed decision approach is more robust than other approaches. The main disadvantage of the MHT method is its exponential complexity. If the hypoth-esis tree is too big, it will not be feasible to search all the hy-potheses to get the most likely matching set. Fortunately, the number of moving objects in our application is usually less than twenty and most of the moving objects only appear for a short period of time. Also, useful information about moving objects from laser scanners, such as location, size, shape, and velocity, is used for updating the confidence for pruning and merging hypotheses. In practice, the hypothesis tree is always managed in a reasonable size.

7. Moving Object Detection

Recall that SLAM with DATMO makes the assumption that the measurements can be decomposed into measurements of stationary and moving objects. This means that correctly de-tecting moving object is essential for successfully implement-ing SLAM with DATMO.

In this section, we describe two approaches for detecting moving objects: a consistency based approach and a motion object map based approach. Although these two approaches work with the use of thresholding, the experimental results us-ing laser scanners are satisfactory.

7.1. Consistency-based Detection

The consistency-based moving object detection algorithm con-sists of two parts: the first is the detection of moving points1 the second is the combination of the results from segmenta-tion and moving point detecsegmenta-tion for deciding which segments are potential moving objects. The details are as follows: given a new scan, the local surrounding map, and the relative pose estimate from direct methods, we first transform the local sur-rounding map to the coordinate frame of the current laser scan-ner, and then convert the map from a rectangular coordinate system to a polar coordinate system. Now it is easy to detect moving points by comparing values along the range axis of the polar coordinate system.

A segment is identified as a potential moving object if the ratio of the number of moving points to the number of to-tal points is greater than 0.5. Note that the consistency-based detector is a motion-based detector in which temporary sta-tionary objects cannot be detected. If the time period between

(17)

Fig. 21. Multiple car detection and data association. Top: the solid box denotes the Navlab11 testbed and rectangles denote the detected moving objects. Bottom: the partial image from the tri-camera system. Four lines indicate the detected ground vehicles.

consecutive measurements is very short, the motions of mov-ing objects will be too small to detect. Therefore, in practice an adequate time period should be chosen for maximizing the correctness of the consistency-based detection approach.

Figure 21 shows a result of the detection and data asso-ciation algorithms, and the partial image from the tri-camera system.

7.2. Moving Object Map Based Detection

Detection of pedestrians at very low speeds is difficult but pos-sible by including information from the moving object map. From our experimental data, we found that the data associated with a pedestrian is very small, generally 1–4 points. Also, the motion of a pedestrian can be too slow to be detected by the consistency-based detector. As the moving object map con-tains information from previous moving objects, we can say that if a blob is in an area that was previously occupied by moving objects, this object can be recognized as a potential moving object.

8. Experimental Results

So far we have shown the procedures in detail for accomplish-ing ladar-based SLAM with DATMO from a ground vehicle at high speeds in crowded urban areas. To summarize, Figure 22 shows the flow diagram of SLAM with DATMO, and the steps are briefly described below:

8.0.1. Data collecting and preprocessing

Measurements from motion sensors such as odometry and measurements from perception sensors such as laser scanners are collected. In our applications, laser scans are segmented. The robot pose is predicted using the motion measurement and the robot motion model.

8.0.2. Tracked moving object association

The scan segments are associated with the tracked moving ob-jects with the MHT algorithm. In this stage, the predicted robot pose estimate is used.

8.0.3. Moving object detection and robot pose estimation Only scan segments not associated with the tracked moving objects are used in this stage. Two algorithms, the consistency-based and the moving object map-consistency-based detectors, are used to detect moving objects. At the same time, the robot pose esti-mate is improved with the use of the SCRIM algorithm.

8.0.4. Update of stationary and moving objects

With the updated robot pose estimate, the tracked moving ob-jects are updated and the new moving obob-jects are initialized via the IMM algorithm. If the robot arrives the boundary of the local grid map, a new stationary object grid map and a new moving object grid map are initialized and the global SLAM using feature-based approaches stages is activated. Otherwise, the stationary object grid map and the moving object grid map are updated with the new stationary scan segments and the new moving scan segments, respectively. Now we complete one cy-cle of local localization, mapping and moving object tracking.

8.0.5. Global SLAM

The revisiting problem is solved. The local grid maps are treated as three degree-of-freedom features and the global SLAM problem is solved via extended Kalman filtering.

(18)

Fig. 22. Flowchart of the SLAM with DATMO algorithm. Dark circles are data, rectangles are processes and gray ovals are inferred or learned variables.

Extension 1 provides a full replay of the SLAM with DATMO processing. As the experimental results of a city sized SLAM and single car tracking have been shown in the previ-ous sections, multiple pedestrian and ground vehicle tracking will be shown to demonstrate the feasibility of the proposed SLAMMOT theory and SLAM with DATMO algorithms. In addition, the effects of 2-D environment assumption in 3-D environments, the issues about sensor selection and limitation and ground truth for evaluating SLAMMOT will be addressed in this section.

8.1. Multiple Target Tracking

Figures 23 and 24 illustrate an example of multiple car track-ing for about 6 seconds. Figure 23 shows the raw data of the 201 scans in which object B was occluded during the tracking process. Figure 24 shows the tracking results. The occlusion did not affect tracking because the learned motion models pro-vide reliable predictions of the object states. The association was established correctly when object B reappeared in this ex-ample. Extension 3 provides a full replay of the multiple car tracking processing.

(19)

Fig. 23. Raw data of 201 scans. Measurements associated with stationary objects are filtered out. Measurements are denoted by every 20 scans. Note that object B was occluded during the tracking process.

Fig. 24. Results of multiple ground vehicle tracking. The trajectories of the robot and the tracked moving objects are denoted with the text labels. denotes that the state estimates are not from the update stage but from the prediction stage because of occlusion.

Figures 25–28 illustrate an example of pedestrian tracking. Figure 25 shows the scene in which there are three pedestri-ans. Figure 26 shows the visual images from the tri-camera system and Figure 27 shows the 141 raw scans. Because of the selected distance criterion in segmentation, object B con-sists of two pedestrians. Figure 28 shows the tracking result, which demonstrates the ability to deal with occlusion. Exten-sion 4 provides a full replay of this multiple pedestrian track-ing processtrack-ing.

8.2. 2-D Environment Assumption in 3-D Environments We have demonstrated that it is feasible to accomplish city-sized SLAM, and Figure 13 shows a convincing 2-D map of

a very large urban area. In order to build 3-D (21₂-D) maps, we mounted another scanner on the top of the Navlab11 vehi-cle to perform vertical profiling. Accordingly, high quality 3D models can be produced. Figure 29, Figure 30 and Figure 31 show the 3-D models of different objects, which can be very useful for applications in civil engineering, architecture, land-scape architecture, city planning, etc. Extension 5 provides a video that shows the 3D city modeling results.

Although the formulations derived in this paper are not re-stricted to two-dimensional applications, it is more practical and easier to solve the problem in real time by assuming that the ground is flat. For most indoor applications, this assump-tion is fair. But for applicaassump-tions in urban, suburban or highway environments, this assumption is not always valid. False

(20)

mea-Fig. 25. An intersection. Pedestrians are pointed out by the arrow.

Fig. 26. Visual images from the tri-camera system. Block boxes indicate the detected and tracked pedestrians. Note that the images are only for visualization.

surements due to this assumption are often observed in our experiments. One is from roll and pitch motions of the robot, which are unavoidable due to turns at high speeds or sudden stops or starts (see Figure 32). These motions may cause false measurements such as wrong scan data from the ground in-stead of other objects. Additionally, since the vehicle moves in 3-D environments, uphill environments may cause the laser beam to hit the ground as well (see Figure 33).

In order to accomplish 2-D SLAM with DATMO in 3-D environments, it is critical to detect and filter out these false measurements. Our algorithms can detect these false ments implicitly without using other pitch and roll

measure-Fig. 27. Raw data of 201 scans. Measurements are denoted by every 20 scans.

Fig. 28. Results of multiple pedestrian tracking. The final scan points are denoted by magenta and the estimates are denoted by blue.

ments. First, the false measurements are detected and initial-ized as new moving objects by our moving object detector. Af-ter data associating and tracking are applied to these measure-ments, the shape and motion inconsistency will tell us quickly that these are false measurements. Also these false measure-ments will disappear immediately once the motion of the ve-hicle is back to normal. The results using data from Navlab11 show that our 2-D algorithms can survive in urban and subur-ban environments. However, these big and fast moving false alarms may confuse the warning system and cause a sudden overwhelming fear before these false alarm are filtered out by

(21)

Fig. 29. 3-D models of buildings on Filmore street.

Fig. 30. 3-D models of parked cars in front of the Carnegie Museum of Art.

Fig. 31. 3-D models of trees on S. Bellefield avenue.

the SLAM with DATMO processes. Using 3-D motion and/or 3-D perception sensors to compensate these effects should be necessary.

8.3. Sensor Selection and Limitation

The derived Bayesian formulations for solving the SLAM-MOT problem are not restricted to any specific sensors. In this section, we discuss the issues on selection and limitations of perception and motion sensors.

8.3.1. Perception Sensors

In the tracking literature, there are a number of studies on is-sues of using different perception sensors (Bar-Shalom and

Fig. 32. Dramatic changes between consecutive scans due to a sudden start.

Fig. 33. False measurements from an uphill environment.

Li, 19951 Blackman and Popoli, 1999). In the SLAM litera-ture, use of different sensors has been proposed as well. For instance, bearing-only sensors such as cameras (Deans 2005), and range-only sensors such as transceiver–transponders (Kan-tor and Singh, 20021 Newman and Leonard, 2003) have been used for SLAM. The fundamentals for using heterogeneous sensors for SLAM, MOT, and SLAM with DATMO are the same. The difference is sensor modelling according to sensor characteristics.

Although laser scanners are relatively accurate, some fail-ure modes or limitations exist. Laser scanners can not detect some materials such as glass because the laser beam can go through these materials. Laser scanners may not detect black

(22)

Fig. 34. Failure mode of the laser scanners. Car A and Car B are not shown completely in the laser scanner measurement.

objects because laser light is absorbed. If the surface of objects is not diffusing enough, the laser beam can be reflected out and not returned to the devices. In our experiments these fail-ure modes are rarely observed but do happen. In Figfail-ure 34, the measurement from the laser scanner missed two black and/or clean cars, which are shown clearly in the visual image form the tri-camera system. Heterogenous sensor fusion would be necessary to overcome these limitations.

8.3.2. Motion Sensors

In this paper, we have demonstrated that it is indeed feasible to accomplish SLAMMOT using odometry and laser scanners. However, we do not suggest the total abandonment of inexpen-sive sensors such as compasses and GPS if they are available. With extra information from these inaccurate but inexpensive sensors, inference and learning can be easier and faster. For in-stance, for the revisiting problem, the computational time for searching can be reduced dramatically in the orientation di-mension with a rough global orientation estimate from a com-pass, and in the translation dimensions with a rough global lo-cation estimate from GPS. The saved computational power can be used for other functionalities such as warning and planning. 8.4. Ground Truth

It would of course be nice to have ground truth, to measure the quantitative improvement of localization, mapping and mov-ing object trackmov-ing with the methods introduced in this paper.

Fig. 35. An available digital map. The locations of intersec-tions are denoted by circles.

Unfortunately, getting accurate ground truth is difficult, and is beyond the scope of the work in this paper. Several factors make ground truth difficult:

Localization: collecting GPS data in city environments is problematic, due to reflections from tall buildings and other corrupting effects.

Mapping: the accuracy and resolution of the mapping results are better than available digital maps.

Moving object tracking: any system that works in the presence of uninstrumented moving objects will have a difficult time assessing the accuracy of tracking data. Some of these difficulties are illustrated by Figures 35, 36, and 37. Figure 35 shows the locations of intersections on an available digital map. In Figure 37, those same intersections are overlayed on our reconstructed map. In Figure 36, the re-constructed map is overlayed on an aerial photo. Qualitatively, the maps line up, and the scale of the maps is consistent to within the resolution of the digital maps. Quantitative compar-isons are much more difficult.

9. Conclusion and Future Work

In this paper, we have developed a theory for performing SLAMMOT. We first presented SLAM with generalized ob-jects, which computes the joint posterior over all generalized objects and the robot. Such an approach is similar to exist-ing SLAM algorithms. Unfortunately, it is computationally