CHAPTER 3. MONOCULAR VISUAL NAVIGATION DESIGN
3.3 G ROUND P LANE AND O BSTACLE D ETECTION
In order to obtain robust matching results of corresponding points, the feature extractor should have properties such as invariance to rotation, scale, illumination and image noise. In previous related works, Harris and FAST [39] corner detectors have been widely adopted [28-31]. The robustness of these descriptors is relatively limited in practical applications. In last decade, there have been many advances in scale/rotation-invariant feature extractors, such as Scale-Invariant Feature Transform (SIFT) [40]. A An array of image gradient histograms has been used as a descriptor instead of a raw image patches. However, the heavy computation
Fig. 3-3: An example of ray casting in an image after IPT. The white area indicates the ground region. The dark area indicates obstacles. The distances between the camera center
and obstacles can be estimated by measuring the length of the rays.
Camera center Ray
load of SIFT algorithm makes it unsuitable for real-time navigation applications. In this work, a combination of robustness and execution speed has been investigated. In the current design, the FAST corner detector is first applied to find interesting feature points. These points will be further described with Speeded Up Robust Features (SURF) [41] for its superior robustness and execution speed. FAST provides a high speed corner detection performance, which can reach full-frame feature detection up to 400Hz. While the robustness can be preserved, SURF is about three times faster than SIFT in feature extraction step. SURF also provides an indexing value for fast matching purpose. Nevertheless, these methods may still fail under excessive motion blur or abrupt changes in view angle. The speed of the robot and therefore the camera motion is limited accordingly in this study.
Once features are found and matched, the next step is to determine whether these feature points are on the ground. Consider that a ground plane is projected into two views taken at different positions as shown in Fig. 3-4. With the pinhole camera model, two views of the same plane are related by a unique homography 20. That is, for a plane = [nT 1]T, a ray corresponding to a point p in image I and its corresponding point p’ in image I’ meet at point
Fig. 3-4: Homography of coplanar points observed from two images.
p p’
C’ C
I’ I
H
P
P in plane . Therefore, if a set of coplanar points pi with homogeneous image coordinates (xi, yi, 1)T are found, and their correspondences {pi p'i} in two images are also found, there exists a 3 by 3 homography matrix H such that
pipi Hpi
where K is the intrinsic camera parameter matrix, R is the rotation matrix, t is the translation vector, is the scale factor, and d is the distance between the plane and the camera. To determine H, four non-degenerated corresponding points are required since each point correspondence provides two independent constraints. In the proposed system, the features on the ground are initially determined, i.e., a subset of pi is known as on the ground plane. The homography of the ground plane can thus be determined initially. To further reduce possible matching errors, we apply RANdom SAmple Consensus (RANSAC) to eliminate outliers and robustly determine H [42-43]. Note that the H matrix here is only relevant to find a robust plane relationship between two frames. It does not affect the homography matrix applied in IPT computation. Once the homography matrix is determined, the rest of the corresponding points can also be determined if it is on the ground by using the back projection technique such that: classification should be. Since the mobile robot can run over certain small obstacles, there is a small range of tolerance for the feature classification. This facilitates the mobile robot to navigate on somewhat uneven surfaces. A fixed value is determined beforehand in the current
implementation. Fig. 3-5 illustrates an example of ground feature classification from two images. Note that homography estimation has its limitation. For instance, in case of near-zero camera translation, i.e., t 0 in (3), no information on coplanarity can be inferred since the plane normal n can be arbitrary [30]. This condition must be checked since in this case all the points will be determined as ground. A simple test is to determine whether all observed points conform to the same homography. This test will fail only when there is just a single scene plane visible, or when camera intrinsic parameters are changed. Both of these cases can be
(a) (b)
(c)
Fig. 3-5: An example of ground feature detection. (a) and (b) are images taken in different views. (c) is the classification result of the matched features observed from
(a) and (b). The features with blue label indicate features on the ground, while the features with orange label are features off the ground.
avoided in the current design.
3.3.2 Ground Region Segmentation
In the Section 3.3.1 feature points which belong to the ground plane are determined.
Pixels that are not selected as feature points, however, are still needed to be classified if they are on the ground. In this work, this problem is resolved by segmenting the image into regions beforehand and classifying these regions on the ground afterwards.
The quality of segmentation depends not only on the algorithm used, but also the targeted application. In this design, the segmentation method should be able to distinguish objects from image frames in a fast and robust way under various environmental conditions. In the current design, an image frame is segmented by adopting multi-scale Mean-shift algorithm in HSI (Hue-Saturation-Intensity) color space. Mean-shift algorithm [44] is a nonparametric, iterative clustering technique which does not require prior knowledge of the number of clusters and constrain the shape of the clusters. It is therefore suitable for unsupervised color segmentation.
The proposed multi-scale Mean-shift algorithm is summarized as follows:
Step 1. Choose a search window with proper bandwidth.
Step 2. Choose the initial location of the search window.
Step 3. Compute the mean value (centroid) of the data in the search window.
Step 4. Move the center of the search window to the mean location computed in Step 3.
Step 5. Repeat Step 3 and 4 until convergence.
Note that in step 1, the bandwidth of the kernel needs to be determined. In particular, the image is under-segmented when the bandwidth is too large, and over-segmented when the bandwidth is too small. In this work, proper bandwidth is determined dynamically, according to the frequency analysis results of the image [45]. For instance, a clustered image such as
crowds often indicates larger energy in high frequency, and requires a smaller bandwidth value.
On the contrary, a simple image such as a white wall, will give a larger bandwidth value.
Additionally, since the Hue values of pixels are unstable under low intensity and saturation, these colorless pixels should be segmented separately. In the current implementation, the hue value of a pixel is set to 2, if this pixel has a saturation value<0.1 or an intensity value <0.1.
The proposed segmentation algorithm takes 0.1 to 0.5 seconds to process an image of the size of 640 480 pixels. In order to further boost the speed for real-time applications, images are scaled down to one-tenth beforehand. Small objects may be neglected in this case.
Therefore, the proposed algorithm estimates the purity of each segment with original resolution, and segments again if needed. The modified method is on average 10 times faster than that using the original scale.
3.3.3 Obstacle Detection
After ground/off-ground features are determined, and the image is segmented into regions, it is now possible to classify the ground region in a pixel-wise manner. As mentioned in Section 1, many previous works warp the image by using the homography matrix and calculating the Sum of Absolute Differences (SAD) between the warped or rectified image and current image. However, it is difficult to determine a proper value of threshold since the SAD value is correlated to the environment. Furthermore, homogenous obstacles may be neglected.
The proposed system determines if the region is on the ground according to the displacement and the distortion of each segment separately. Similar techniques have been used in stereo vision when estimating disparity maps [46]. To do so, corresponding segments need to be found. While segments contained already matched feature points can be found easily, homogenous segments are matched using both feature points and its color distribution. The
proposed segment matching method considers finding the maximum of the overall likelihood function of multiple cues, that is,
L a b( | )Lcolor( | )a b Lmotion( | )a b
The color likelihood model is defined in a way to estimate the similarity between the color histogram of two segments. Bhattacharyya coefficient has been widely used to determine the similarity of two discrete probability distributions [47]. Within the interval [0, 1], the larger the coefficient is, the more similar the two histograms are. Assuming two objects a and b with color histogram ha ={ha,1…ha,N}and hb ={hb,1…hb,N}, the color cue likelihood function between
Those segments which fail to be matched will be labeled as undetermined. Finally, each matched segment is then assigned an initial probability value 0.5. The probability value of each segment is updated on the basis of SAD. These segments are then classified as ground or off-ground. Both static and moving obstacles can be observed. Fig. 3-6 shows both the result of segmentation and ground plane labeling based on the images in Fig. 3-5. Note that in the image a piece of paper on the ground is classified as ground as expected.
In summary, The proposed ground plane detection algorithm is able to segment ground regions from two images. The feature classification step determines the features on the ground planes, while also finding the homography matrix H between the ground planes in two images.
The second image can thus be rectified with H. These two images are also segmented using color homogeny. These segments are matched with multiple cues, and finally determined as ground or off-ground. In practice, the key image for matching is not merely obtained from a single frame, but also a collection of tracked features and segments among N frames that can be processed in previous one. Therefore, this method can be applied even when the robot is rotating, or the camera is temporally occluded.