Object detection tracking have been researched for a long time and have been developed by different sensors. According to the properties of different sensors, the object detection task can be categorized into two types, beam-type sensor based and vision-based. For the first category, beam-type sensor, such as laser range finder or ultrasound, provides spatial information by returning an environment point positions. In [34: Wolf et al. 2004], the authors proposed the moving object detection method by
constructing a static grid map, and comparing each scan data to this static gird to filter out the dynamic points. However, tracking laser points are a challenge problem, since no other information to determine how to link an object point to the object in next scan correctly. It is well known as data association problem [62: Thrun 2005]. Although many hypothesis approaches have been developed to overcome the problem, considering only spatial information to solve data association problem is still hard and makes ambiguous result.
On the other hand, for vision-base category, it can be divided into mono camera and RGB-D type sensor. The main different between these two subcategories is if there has the corresponding range image to the image. Object detection based on mono camera has been researched for a long time since camera provides abundant visual information
to obtain the object appearance. In [22: Saravanakumar et al. 2010], the authors proposed a background subtraction method to retrieve dynamic object, which based on the background modeling performance. To model the background, [23: Lee et al. 2003]
proposed using Gaussian Mixture Model (GMM) to model the environment background by several frame images. [24: Barnich et al. 2011] proposed the visual background extractor (ViBe) to achieve better performance than GMM. Both these methods need several images to construct the background, and thus the sensor cannot move too fast.
[25: Enzweiler et al. 2009] mentioned that moving object can be extracted by estimating
the optical flow of the image to extract moving pixels. The similar concept is tracking features on the object to detect moving object in the image plane [26: Tang et al. 2008].
On the other hand, training-based algorithms are also popular to achieve the goal of detecting specific object. For example, [21: Dalal et al. 2005] proposed using the histograms of oriented gradients (HOG) to detect human based on the edge orientation of the human. [32: Viola et al. 2003] proposes the pedestrian detection method by training the preset pedestrian patterns using Harr wavelet. However, training-based should train a sequence of object patches, and only the specific object can be detected, such as human or vehicle, with different training data.
Stereo camera provides color image with corresponding depth, which has abundant
and tracking can be constructed more easily to combine two different spaces information. To detect object, v-disparity approach is first proposed in [27: Labayrade et al. 2002] and becomes more and more popular. The disparity map is projected to
V-disparity space by accumulating the disparity along the v-axis. [7: Hu et al. 2012] and [38: Krotosky et al. 2007] extended the work of Labayrade, the u, v-disparity approach
is developed and using Hough transform to extract object bounding box. These methods have a drawback that in some complicate scenario, the line of object bounding box becomes discontinuous in Hough transform line extraction stage. Therefore, some object may not enclose completely by the bounding box. Other approaches based on grid mapping are developed. [31: Oniga et al. 2010] construct a digital elevation map (DEM) to check the height of each grid cell, and construct a density map to check the measurement density of the grid cell. Both of DEM and density map are constructed in Cartesian space. By using these two grid map, the obstacle grid cell can be extracted and find the corresponding object image position by perspective mapping. Although the authors considered the fact that a grid cell at the far distance has less measurement points due to perspective projection by constructing the density grid map, extracting obstacle grid cells by checking the density map is not a complete consideration due to the density of a grid cell may be affected by partially occlusion or missing data. In [29:
Perrollaz et al. 2012], Perrollaz et al. proposed the visibility-based occupancy grid map
calculation method for an efficient and formal consideration on u-disparity occupancy grid construction. Instead of using density to describe the occupancy of a grid, the visibility-grid map considers the ratio between the valid number of disparity pixels and the number of disparity pixels that exactly hit (measure) the obstacle to the grid cell and formally uses a probability formula to describe the occupancy of a grid. Based on occupancy grid mapping, tracking an object can be done by Kalman filter [36: Barth et al. 2009] or particle filter [35: Danescu et al. 2012] based on Bayesian framework.
However, system encounters data association problem like the situation of beam-type sensor when it tracks multiple object. For example, although the particle tracking method proposed in [35: Danescu et al. 2012] can track multiple objects in most of cases, the tracking result fails when two objects move across each other. In [36: Barth et al. 2009] the authors proposed track-before-detect scheme to solve the data association
problem by tracking the image features and then group features by the 3D motion of each feature. In [37: Nedevschi et al. 2007], data association is solved by tracking the features in the object bounding box. These methods can solve data association problem quite well when the object is in the camera field of view. However, these methods may fail when object is viewed from different directions during the object return to the camera FOV. This is because that the feature points are too sparse and too distinctive to
most cases, the hue and saturation distributions of an object in HSV color space do not change dramatically. Therefore, in this thesis, the color distributions of the object are used to be the feature vectors to describe the object without using the feature points.
In this thesis, object detection is solved by slightly modifying the visibility-based occupancy grid construction method proposed in [29: Perrollaz et al. 2012], and data association is solved by using the distribution of the hue and saturation of the object as feature vector. The tracking strategy is proposed to update the state of an object in different situations.
Figure 2.2: The object detection and tracking categories.
Vision-based
Local Dynamic Map [33: Wolf]