CHAPTER 1. Introduction
1.3. Contribution and Statement
can be attached to the IMU to observe the environment or moving objects for orientation (Bloesch et al., 2015), or the IMU trajectory can be accumulated in the very beginning for matching with the RGB-D sensor observations (Croitoru et al., 2005). To resolve issues about relative orientation, Tian et al. (Tian et al., 2015) setup their system to parallel the Earth frame, in which is the frame for IMUs’ global orientation. With this parallel setting, the initial relative orientation between RGB-D and IMUs is known. However, this setting limits system feasibility regarding orientation. To relax this limitation, we use virtual rela-tive orientation technique by exploiting the initial global orientation of the IMU in the Earth frame. Thus, the proposed system do not have to be parallel to Earth frame. The second issue about relative orientation is resolved by using virtual relative orientation technique with global orientation measurements. As for the first issue, we assume that the initial relative orientations between RGB-D and IMUs are given to work around since providing robust estimates during occlusion is what we focus in this dissertation.
In order to track human joints under occlusions, not only are multiple heterogenous sensors used, but an arm model is built to represent the structure of human upper extrem-ities. A representation called, Denavit-Hartenberg parameteric representation (Denavit &
Hartenberg, 1955), is used to describe the relationship of each joint from body to hand. The parameters, such as lengths of different arm parts, joint rotations, and rotations around arm parts, are estimated following inverse kinematics techniques when observations from the RGB-D sensor are available. Once the structure is learned by the parameters, the first type of virtual measurement, known as arm model estimates, are generated with forward kinematics to provide virtual locations for each joint from shoulder to hand. Since the up-per extremity structure is estimated along the time, the arm model in the proposed system is naturally suitable for different patients in different rehabilitation stages along time. With these generated virtual measurements, we can track joint locations under occlusion.
1.3. Contribution and Statement
In this dissertation, we propose methods and algorithms to reduce the effects from occlusion. First, the limitation and the effects of a single sensor are discussed. In this ap-proach, we propose a framework to perform multiple interacting object tracking in urban areas using a stationary laser scanner based on the given segmentation results without any classification. Under the finding that moving objects prefer to maintain their behavior,
this approach estimated moving objects with occlusion. The framework is composed of a virtual measurement model for tracking in occlusion and interacting object models that describe the interactions between the nearby moving objects and the environment.
In our initial study (Wang et al., 2007), (Wan et al., 2008), (Lin et al., 2011), the eval-uation was conducted using a manually-labeled benchmark; we propose a benchmark-generating system that uses 3D LIDAR to provide the occluded information in 2D. To quantify the occlusion between 2D and 3D sensors and to build the benchmark system, an occluded area detection module is also proposed to extract occluded grids.
The interacting object models are composed of the scene interaction model and the neighboring object interaction model for long-term and short-term interactions, respec-tively. The scene interaction model is represented using a stationary occupancy map and moving object maps for the monitored scene. The occupancy-motion grids are utilized to store the collected moving object information such as speed and moving direction, and the k-means clustering algorithm (MacQueen, 1967) is applied to cluster the samples in order to provide predictions with means and covariances. The neighboring object interaction model is extended from (Wang et al., 2007) from a simple following interaction to three kinds of interactions: following, attracting, and repelling. These interacting object models not only solve the challenging modeling problem but also yield higher-level scene under-standing. These interacting object models are consistent in occlusion and the weights of the motion models are utilized as the representation of a moving object’s motion feature. We also find that the motion features of moving objects tend to vary little, with small changes in the weights of motion models. Thus, these interacting object models are applied to compute the virtual measurement model with the stored motion features for tracking in occluded space.
In the second approach, we show a system framework that fuses different sensor mea-surements to simultaneously track human joints and model adjustable upper extremities to provide estimates during occlusion. Based on the proposed heterogeneous sensor simulta-neous localization, tracking, and modeling (HS-SLTAM) algorithm, sensors are located in a global frame within which human joints are tracked directly. Occlusions are first reduced by the use of heterogeneous sensors. In this part, two types of virtual measurements are used. The first is arm model estimates, which are generated based on the upper extremity
1.3 CONTRIBUTION AND STATEMENT
model to provide further estimates during occlusion. The second, virtual relative orienta-tion, is a technique based on sensor data to relax limitations of system settings. In order to provide robust estimates during occlusion, the upper extremity model is built in to the HS-SLTAM algorithm. The first virtual measurements are directly generated according to estimates from the adjustable upper extremity model. Furthermore, using the adjustable upper extremity model, the proposed framework is suitable for different patients and at different progress of the rehabilitation process.
The proposed system is composed of a stationary RGB-D camera and wearable de-vices on a person to track human joints. Measurements from these sensors are fused to assist stroke patients or other patients in the postoperative rehabilitation stage. The pro-posed algorithm is developed from multi-robot simultaneous localization and tracking (MR-SLAT) (Chang et al., 2016). First, the state vector is extended to 3D space and the orientation representation is replaced with quaternions to reduce the linearization error.
Second, different measurement models are used in HS-SLTAM to take into account dif-ferent observations from multiple heterogeneous devices; these including location-based, acceleration-based, and orientation-based models. Third, a kinematics model of the up-per extremities is part of the proposed system. Based on Denavit-Hartenberg parameter representation (Denavit & Hartenberg, 1955), the kinematics model can be customized ac-cording to different subjects in different stages of the rehabilitation process. With these three extensions to the initial study (Chang et al., 2016), we have HS-SLTAM, which fuses measurements in 3D space and describes the upper extremities of different subjects with the proposed kinematics model for occlusion.
In this part, we not only provide a heterogeneous sensor fusion framework to track joints during occlusion but also provide an adjustable arm model for the rehabilitation pro-cess. With the proposed HS-SLTAM, we reduce occlusion by integrating sensor measure-ments and provide robust joint tracking under occlusion by utilizing virtual measuremeasure-ments.
We develop two types of virtual measurements: arm model estimates for use under occlu-sion to track joints based on the proposed arm model, and virtual relative orientation as a technique for relaxing the system limitation regarding orientation respectively. Since each patient is unique, the rehabilitation process has many stages, and patients improve dur-ing the course of rehabilitation; as such the rehabilitation system should be customizable along each stage of the process. The proposed approach involves a two-phase estimation
framework for online training and testing. In the first phase we track joint locations, and in the second phase we construct the arm model. These two phases cooperate together to provide model training and testing as well as to provide the rotation rate and variance of joints so that medical staff can judge the rehabilitation performance. The proposed system demonstrates a way to produce online estimates for each patient and for each rehabilitation exercise.
Even thought it is difficult to resolve the occlusion, we contribute approaches to tackle issues and effects caused by occlusion in different environments including urban traffic and indoor rehabilitation activity; and in different sensor settings from a single sensor to het-erogeneous sensors. The boundary of the existed sensors and sensor fusion frameworks are explored. These approaches not only extend the estimating capabilities under occlu-sion, but also expand the working spaces from a surface in the traffic scene to a 3D space for postoperative rehabilitation. Especially in the second part, different models of hetero-geneous sensors are introduced accompanying with an augmented state space to represent the sensor fusion results for the approach. Moreover, the online adjustable relationship model, upper extremity model, is demonstrated to provide extra measurements for reha-bilitation and to evaluate the performance during the rehareha-bilitation.
The following chapters of this dissertation are organized as such. Chapter 2 describes the related work about the multiple target tracking, human motion tracking approaches with different sensors, how to tackle the occlusion in urban, multiple sensor fusion ap-proaches, and a discussion regarding the comparison between the existing approaches.
Chapter 3 introduces the single sensor approach to tackle the occlusion issue. The nec-essary modules and mathematical foundations of interacting object tracking with virtual measurement model are described in details. As for the heterogeneous sensor fusion ap-proach, the framework based on heterogeneous sensor simultaneous localization, tracking, and modeling, the necessary heterogeneous sensor measurement models, and the upper extremity model for occlusion are represented in Chapter 4. The evaluation results from the single sensor approach and the heterogeneous sensor fusion approach were revealed in Chapter 5. In the end, Chapter 6 concludes the findings and the achievements of the proposed frameworks on tackling the occlusion issue.
CHAPTER 2
Related Work
I
Norder to assist machines and robots understand the world better, it is necessary to explore the limitation of the current sensor ability. Some approaches which ex-ploited different sensors were developed in specific areas for multiple target track-ing. Here, several aspects regarding the multiple target tracking, applying with dif-ferent types of sensors, and reducing or dealing with occlusion for either single or multiple sensor fusion were described and discussed in this chapter. The baseline and the funda-mental knowledge were established with the existed works in these aspects and provided a glance to extend machine perception for multiple target tracking.2.1. Multiple Target Tracking Approaches
The moving object tracking problem can be formulated with probabilistic robotics rep-resentation as
P(xt, st|Zt) (2.1)
where xt is the state estimates of the moving target, st is the motion mode of the target, and Zt is the all measurements until the time stamp t. This area had been developed for decades with numerous approaches in many applications. The formulated moving object problem had been extended for multiple target tracking at least from (Reid, 1979). Reid has summarized the state of the art at that time and proposed a framework for multiple target tracking, also tackling many important modules and issues for the later approaches such as initiating tracks, accounting for false or missing reports, and processing sets of dependent reports. This work also indicates the challenging data association issue in multiple target tracking.
Data association and motion modeling are two of the most challenging problems in multi-target tracking. The classical approaches such as the multiple hypothesis track-ing (MHT) algorithm (Cox & Htrack-ingorani, 1996) and the joint probabilistic data association (JPDA) approach (Fortmann et al., 1983) have been extensively applied to solve data as-sociation in many applications. Different motion patterns from a wide variety of moving objects make motion modeling difficult. Thus, (Magill, 1965) provides a straight forward multiple model approach and with different types of the model interaction, (Blom & Bar-Shalom, 1988) proposed the interacting multiple model approach to estimate various mo-tion models for moving object tracking.
Due to complicated maneuvers from the multiple moving targets, several multiple motion model approaches has been demonstrated. There are three generation of the multi-ple model approaches for moving object tracking. The first generation is called autonomous multiple model (AMM). As an example of AMM, Magill (Magill, 1965) estimates the mov-ing object with different motion model filters and generates the result with weightmov-ing sum of these multiple models. The second generation lets the motion models cooperate each other and that is why also called cooperating multiple model (CMM). And many differ-ent cooperation strategies have also proposed such as generalized pseudo-Bayesian algo-rithms of order n (GPB1, GPB2, GPBn) (Jaffer & Gupta, 1971), (Bar-Shalom & Li, 1993), (Sugimoto & Ishizuka, 1983), and interacting multiple model (IMM) (Blom & Bar-Shalom, 1988). These approaches use a fixed and same memory depth to reduce the hypothesis of modeling switching tree. With the mixing step, IMM provides a more cost-effective reini-tialization method than those of GPBn.
After the proposed of the first and second generation of the multiple model approaches, (Li & Jilkov, 2005) has summarized these approaches and described the third generation of the multiple model approach, the variable structure multiple model method. The third generation removes the fixed structure assumption and organizes the motion models to different motion model sets dependently along the time. It is more computation effective than the multiple model methods in the previous two generations for wide variety motion patterns.
2.2 MOTION MODELING REGARDING INTERACTIONS WITH SCENE AND MOVING TARGETS
2.2. Motion Modeling regarding Interactions with Scene and Moving Targets Only a few works address the observation and motion modeling issues of interactions among the tracked objects and the scene implicitly. Khan et al. (Khan et al., 2005) pro-pose a Markov chain Monte Carlo (MCMC)-based particle filter to track interacting ants, in which interactions are modeled through a Markov random field motion prior. Their interaction potential is based only on static poses which yield no higher-level scene under-standing. Smith et al. (Smith et al., 2005) use a simple interaction model to penalize object overlapping. Sullivan and Carlsson (Sullivan & Carlsson, 2006) propose constructing an interaction graph and then apply a two-stage clustering scheme to label the identity of the target. Instead of modeling interactions explicitly, these studies use the term “interaction”
to describe situations in which the target and adjacent objects share the common measure-ments and cannot be correctly labeled. In these existing approaches, interactions represent opposite information to avoid wrong estimates.
Besides the interactions among moving objects, there are approaches addressing the scene understanding and semantic mapping issues. In (Kostavelis & Gasteratos, 2015), Kostavelis and Gasteratos classify the existed works in semantic mapping field accord-ing to four major aspects, scalability, topological map, temporal coherence, and inference model. For scalability, they categorize semantic mapping-related works as either indoor or outdoor and single-scene or large-scale. For example, Trevor et al. (Trevor et al., 2013) introduce an efficient connected component solution in RGB-D data for single scene point cloud. They interpret the single scene point cloud to different objects ready to provide semantic meanings.
Liu et al. (Liu & von Wichert, 2014) extract the semantic of domestic environment based on the occupancy grid map. Utilizing the Markov chain Monte Carlo(MCMC) based sampling and the maximum posterior solution, (Liu & von Wichert, 2014) extract occu-pancy grids to semantic rooms for their later human robot interaction application. Sen-gupta et al. (SenSen-gupta et al., 2012) exploit two conditional random fields to provide a street-level semantic map from visual imagery. The sematic map classifies the route to 13 different classes as, road, building, tree, ...etc in 14.8 km route. This work, however, focuses on static scenes or objects. Because it does not provide information on moving objects, ex-isting work is unable to model moving objects in such environments. In general, most of
the semantic mapping approaches focus on the static scenes or objects, while the input data is from visual images, 2D occupancy map, or even 3D point cloud.
Wolf et al. (Wolf & Sukhatme, 2008) not only focus on the semantic terrain mapping problem, but also introduce a solution to the semantic activity mapping problem using su-pervised learning techniques, namely hidden Markov models (HMMs) and support vector machines (SVMs). In their activity-based semantic mapping, HMMs and SVMs are ap-proaches for determining the spatial usage of dynamic entities in urban environments.
Later on, the space has been classify to two categories, street and sidewalk. Nevertheless, the final semantic map from (Wolf & Sukhatme, 2008) is not for directly tracking moving objects: it only summarizes how the moving objects utilize space.
2.3. Tackle the Occlusion in Urban
The occlusion occurred due to the limitation of the sensor perception capability, sur-rounding objects, and the environments. The sursur-rounding objects lead to occlusion when the density of the objects is high. And the different environments make different percep-tion results contain with different occlusion areas from crowded urban to free space. There are lots of approaches handling the occlusion problem from partial occlusion to fully oc-clusion in the Computer Vision literature. In (Yilmaz et al., 2004), the ococ-clusion is detected and the shape of the tracked object is recovered the contour according to the shape model for tracking. Ross et al.(Ross et al., 2008) introduce a forgetting factor with incrementally updating eigenbasis and mean of the target appearance for tracking. With the ability with track variations, it can track under partial occlusion. And Tang et al.(Tang et al., 2014) pro-posed a double-person detector model for human detector in order to handle the partial occlusion in crowded urban environment.
Not only in the computer vision literature, the occlusion problem is also raised with other sensors. Nashashibi and Bargeton (Nashashibi & Bargeton, 2008) compensate for occlusion issue by determining about the occluded sides and use object classification to identify the occluded part, and then the confidence level estimation is used in Kalman filter-based tracking. On the other hand, Almeida (Almeida, 2010) assumes that points do not change while be occluded and creates a new data with occluded points to compensate the sensor data for tracking. Wyffels et al. (Wyffels & Campbell, 2015) utilize occlusion as negative information in which they assume the tracked targets will exist in the next step
2.4 SHORT DISCUSSION FOR URBAN CASES
Table 2.1. Various approaches and their models for different tracking levels
Galceran et al.
(Galceran et al., 2015)
Interacting object tracking (Wang et al., 2007)(Wan et al., 2008) (Lin et al., 2011)
Virtual measurement model (Chen et al., 2018)
Tracking scheme hGMM VSMM VSMM
L1 tracking Driving behavior
model CV/CA model CV/CA model
L2 tracking Road network model Scene interaction model Scene interaction model L3 tracking N/A Neighboring interaction model Neighboring interaction model
In occlusion
Hypotheses with driv-ing behavior model and road network model
VSMM prediction Virtual measurement model
and treat missing targets as negative information to estimate the occlusion or to remove nonexistent tracks. Most such works attempt to balance the effects of occlusion and utilize information to remove fake tracks.
In (Yu & LaValle, 2012), Yu et al. proposed to estimate objects in occluded areas. By splitting and merging shadow regions through time, they create a graph of the shadow information spaces through time to find occluded targets. Then, given the probability dis-turbance of the observation, the max flow or probability mass propagation algorithm is applied to find the number of hidden agents. (Yu & LaValle, 2012) topologically estimates occluded objects, while (Galceran et al., 2015) attempts to precisely track occluded objects inside shadow regions. Galceran et al. (Galceran et al., 2015) introduce a road network model and a driving behavior model to perform estimation under occlusion. The road net-work model is composed of a discrete set of policies with a prior map of the environment, while the driving behavior model is composed of prescribe commands of
In (Yu & LaValle, 2012), Yu et al. proposed to estimate objects in occluded areas. By splitting and merging shadow regions through time, they create a graph of the shadow information spaces through time to find occluded targets. Then, given the probability dis-turbance of the observation, the max flow or probability mass propagation algorithm is applied to find the number of hidden agents. (Yu & LaValle, 2012) topologically estimates occluded objects, while (Galceran et al., 2015) attempts to precisely track occluded objects inside shadow regions. Galceran et al. (Galceran et al., 2015) introduce a road network model and a driving behavior model to perform estimation under occlusion. The road net-work model is composed of a discrete set of policies with a prior map of the environment, while the driving behavior model is composed of prescribe commands of