Chapter 4 Navigation in Outdoor Environments
4.2 Proposed Navigation Process
4.2.2 Idea of Vehicle Localization by Learned Sequential Landmarks . 36
Although the odometer readings provide the vehicle’s position and direction for vehicle navigation in the navigation phase, they are usually imprecise to guide the vehicle to the next position correctly. Therefore, using the learned landmarks, which include light poles, hydrants, sidewalk curb lines, and tree trunks in this study, to localize the vehicle’ position and orientation becomes the main task, as shown in Fig.
4.3. The sequential landmarks and the characteristic of the curb along a path on a sidewalk can be used to obtain the vehicle position on the path, and the learned odometer readings can assist judging whether the vehicle has arrived at a correct position or not to detect the landmark. The process of vehicle localization is illustrated in Fig. 4.4. Two different positions of the vehicle at a node in the navigation path and the relation between the vehicle, the curb, and the light pole are illustrated in Fig. 4.5.
The proposed vehicle localization technique consists of two major steps. Firstly, an object detection process detects the existence of the next-to-visit landmark continuously after the start of the navigation process. When the detection process detects a landmark at a correct node, we can acquire the landmark’s depth data by the KINECT device with respect to the vehicle. Second, from the learned environment parameters, we can obtain the recorded depth data, and then we match, using the ICP algorithm, the two different sets of depth data to estimate the correct position and orientation of the vehicle according to the MSE criterion as illustrated in Fig. 4.6,
37
resulting in a set of 3D space parameters, including a pair of translation parameters (Xmse, Zmse) and a rotation angle mse in the CCS. The adopted technique to adjust the vehicle to a correct position is described in the following algorithm.
Initialize The path
Figure 4.2Flowchart of navigation process.
Algorithm 4.1 Vehicle localization and position adjustment by learned landmarks.
Input: a color/depth image and a recoded landmark depth data D.
Output: None.
Step.
Step 1. Use the SURF extraction algorithm (described in Chapter 5) to recognize a learned landmark from the input color/depth image; and if an object of the learned landmark recognized successfully by system, take out the
38
corresponding node data from the learned path information; else, go to Step 1.
Step 2. Obtain at the vehicle’s current position new depth data D' of the landmark by one of the three KINECT devices as specified in the learned path coordinates and the recorded one in the path data in the CCS using the ICP technique (the detailed method will be described in Chapter 5).
Step 5. Convert the coordinates (X, Y, Z) in the CCS into the coordinates (VX, VY) in the VCS by the following way:
VX X ; VY Z . (4.1)
At first, we define a region as the detection window and a threshold value thr for detecting landmark objects in the depth image, which are selected in the learning stage. Then, after the navigation process is started, the detection process will detect a region of detection window in the acquired depth image and decide whether there exists any object of concern. The criterion for this decision is to check if the distance between the detected object with respect to the vehicle is smaller than a pre-selected threshold thr. If this condition is satisfied, the SURF extraction algorithm is then
39
applied to extract the object’s feature points in the color image, which then are matched with the learned feature set to recognize the learned landmark. The detection process is illustrated in Fig. 4.7.
(a) (b) (c)
Figure 4.3Three types of landmarks selected for vehicle localization in this study. (a) Light pole. (b)
Hydrant. (c) Curb line.
Figure 4.4 The vehicle localization process.
40
Curb line
Learned Position of Vehicle Current Position of
Vehicle
GCS
VCS
Light Pole
Figure 4.5 Illustration of learned position of the vehicle and current position of the vehicle in the GCS.
VCS
L(lX, lY, l)
GCS
VCS
Current feature position Recorded feature position
L¢(l¢X, l¢Y, l¢)
(a) (b)
Figure 4.6 The depth data of light pole recorded at position L are matched with newly-acquired depth data in navigation process at position L¢(a) A recorded feature position with respect to the vehicle. (b) A current feature position with respect to the vehicle.
4.3 Algorithm of Navigation in Outdoor Environments
In this section, we describe the detailed process for vehicle navigation in the
41
outdoor environment. With the learned information, the vehicle navigates along the learned path by the concept of sequential-node visiting to visit each recorded node consecutively and conducts specified works at the learned positions until reaching the end point of the learned path. The entire navigation process is described in the following algorithm, and a flowchart of the complete navigation process is shown in Fig. 4.8.
Figure 4.7 Flowchart of proposed detection process.
Algorithm 4.2 Navigation Process.
Input: a learned navigation path Npath with relevant guidance parameters, and learned data of environment parameters.
Output: none.
Step.
Step 1. Choose a start node Nstart and an end node Nend from the learned navigation
42
path Npath, and initialize vehicle navigation from Nstart.
Step 2. Read from Npath a navigation node Nnext and related guidance parameters.
Step 3. Move the vehicle forward to node Nnext and detect the learned landmark.
Step 4. If a sidewalk following mode is adopted, detect the curb line by the curb line detection process (the detailed method will be described in Chapter 6).
If successful, modify the vehicle direction accordingly; otherwise, conduct the vehicle in the blind navigation mode.
Step 5. If the detection process detects an object of concern in the detection window and its distance with respect to the vehicle is smaller than a threshold thr, then stop the vehicle and go to the next step; else, go to Step 7.
Step 6. If there exist a light pole or hydrant landmark in the current node Nnext, capture a color/depth image by KINECT device, use the color/depth image and the learned landmark depth data D as inputs, perform the algorithm 4.1 to do vehicle localization, and then go to Step 2.
Step 7. If there exists a ramp landmark in the current node Nnext, adopt the blind navigation mode, adjust the vehicle direction, and then go to Step 2.
Step 8. If there exists a tree trunk landmark in the current node Nnext, compute the position of the landmark center in the image in terms of 3D space coordinates, use the coordinates to localize the vehicle, and then go to Step 2.
Step 9. If there exists a landmark which is pre-selected as the terminal node Nend
(recognized to be so by its landmark type and its number), stop the vehicle, and finish the navigation.
Step 10. Repeat Steps 3 through 9.
43
Figure 4.8Flowchart of detailed proposed navigation process.
44
Chapter 5
Landmark Detection and
Localization Using Depth and Color Images
5.1 Introduction
Vehicle localization is an important task for building the autonomous vehicle navigation system in this study. It can guide the vehicle move to a pre-selected destination successfully. For this purpose, we use pre-selected landmarks to provide the current vehicle position in the learned map when navigating. However, to decide which landmarks should be used is also an issue. Utilizing the characteristics of the KINECT device which can provide 3D space by depth images, we select objects which have prominent 3D shape information as landmarks for localization, as illustrated in Fig. 5.1. We choose rectangular-like objects as landmarks as Fig. 5.1(a), because it can provide translation and rotation information simultaneously. A method for feature extraction and matching for recognizing the landmark is described in Section 5.2. Unfortunately, not all of the landmarks on the learned path can provide 3D information. Therefore, some other techniques of object detection and localization will be introduced in Chapter 6. In addition, how we convert depth images of landmarks into 3D space coordinates to localize the vehicle will be described in Section 5.3. And a series of algorithms for landmark detection and localization will be described in details in Section 5.4.
45
Figure 5.1 The top views of three difference types of objects in the depth image which is captured from the front of the KINECT device. (a) A rectangle. (b) A plane. (c) A cylinder.
5.2 Review of Method of Matching by Speeded Up Robust Features
(SURFs)
The SURF extraction method proposed by Herbert Bay et al. [20] in 2006 includes four major stages of computation to generate a set of features, and part of the idea is based on the similar concept of the SIFT [21]. In this section, we will give a brief review of the SURF, which is divided into two parts as follows: detection of feature points of interest and description and matching of such points.
5.2.1 Detection of Feature Points of Interest
A main difference between the SURF and the SIFT is that the SURF is based on the use of Hessian matrix approximation and integral images, which reduce the computation time drastically because the integral image allows fast computation of box-type convolution filters and the Hessian matrix has a good performance in accuracy.
In more detail, the theory of the SURF one pixel in an image Ix, y) can be
(a) (b) (c)
46
represented by a Hessian matrix as follows:
2 2
and using the convolutions of the Gaussian second-order derivatives, the Hessian matrix H(x, ) at x at scale is defined as follows: with of the image I at point x, and Lxy(x, ) and Lyy(x, ) are interpreted similarly.
In the choice of the filter, the author thinks that the filters are non-ideal in any case, so he chose to approximate the Hessian matrix with box filters. The 9×9 box filters in Fig. 5.2 for computing the blob response maps are denoted by Dxx, Dyy, and Dxy. Therefore, the determinant of the Hessian matrix can be written as:
( approx) xx yy (0.9 xy)2
Det H D D D . (5.3)
Figure 5.2 Left to right: The SURF used the approximation of the second-order Gaussian partial derivative in the y-direction (Dyy) and the xy-direction (Dxy). The grey regions are equal to zero.
The scale spaces are usually implemented as an image pyramid. The images are repeatedly smoothed with a Gaussian filter, and then sub-sampling in order to achieve
47
a high level of the pyramid. In the SIFT, the author subtracts these pyramid layers in order to get the DoG (Difference of Gaussians) images. In the SURF, the scale space s is analyzed by up-scaling the filter size rather than iteratively reducing the image size as shown in Fig. 5.3. Therefore, the SURF can reduce the sampling time to speed up the overall computation time.
(a) (b)
Figure 5.3 Illustration of SIFT and SURF. (a) Iteratively reducing the image size. (b) According to the scale space s to up-scaling the filter size.
A similar technique of the SIFT to localize the points of interest in the image is also used in the SURF extraction algorithm. Each point in the images is compared with its 8 neighbors in the same scale image, and the 9 corresponding neighbors in neighboring scale images, as shown in Fig. 5.4. If the point is a local maximum, it is selected as a candidate feature point. And the found candidate points of the determinant of the Hessian matrix are computed by 3D linear interpolation in the scale and image space.
Figure 5.4 Maxima values are detected by comparing a pixel, as marked with X, with its 26 neighbors, as marked with the green circles, in 3×3 regions of the current and adjacent scales.
48
5.2.2 Description and Matching of Feature Points of Interest
In this section, we will introduce how the SURF feature extractor generates the feature descriptor and matches these features. The descriptor describes the distribution of the intensity content within the neighborhood of the point of interest, and is similar to the SIFT. The author builds the distribution of the first-order Haar wavelet responses in the x and y directions, rather than the gradient; exploits the integral image for speeding up; and uses only 64D. This reduces the time for feature computation and matching, and has proven to simultaneously increase the robustness.
Furthermore, the author presents a new indexing step based on the sign of the Laplacian.
The first step consists of fixing a reproducible orientation based on information from a circular region around the feature point of interest. Then, a square region aligned to the selected orientation is constructed and the SURF descriptor is extracted from it. Finally, features are matched between two images. And the three steps are described in more detail in the following.
1. Orientation assignment
In order to be invariant to image rotation, the Gaussian weighted Haar wavelet responses in the x and y directions within a circular neighborhood of radius 6s around the interest point, with s the scale at which the interest point was detected. The responses are represented as points in a space with the horizontal response strength along the abscissa and the vertical response strength along the ordinate, and the dominant orientation is estimated by calculating the sum of the horizontal and vertical responses within a sliding orientation window of size /3 as shown in Fig. 5.5. The
49
two summed responses then yield a local orientation vector.
Figure 5.5 The dominant orientation of the Gaussian weighted Haar wavelet responses detected by the sliding orientation window.
2. Descriptor based on the sum of Haar wavelet responses
For the extraction of the descriptor, the first step consists of constructing a square-region centered around the points of interest and oriented along the orientation selected in the above-mentioned scheme. The region is split up regularly into smaller 4×4 square sub-regions. For each sub-region, we compute Haar wavelet responses at 5×5 spaced sample points. And the derivatives dx and dy are defined as Haar wavelet responses in the horizontal and vertical directions, respectively. Then, the wavelet responses dx and dy are summed up over each sub-region to form a first set of entries in the feature vector. In order to express the polarity of the intensity changes, it also extracts the sum of the absolute values of the responses |dx|and |dy|. Hence, each sub-region has a 4D descriptor vector v(
dx, dy, dx , dy)as illustrated in Fig. 5.6. Concatenating all of the 4×4 square sub-regions results in a descriptor vector of length 64.50
Figure 5.6 To build the descriptor, an oriented quadratic grid with 4×4 square sub-regions is laid over the point of interest. For each square, the wavelet responses are computed from 5×5 samples. For each field, the sums dx, |dx|, dy and |dy|, computed relatively to the orientation of the grid, are collected.
3. Fast indexing for matching
The matching technique in the SURF is only to compare features to see if they have the same type of contrast. Because the sign of the Laplacian distinguishes bright blobs on dark backgrounds from the reverse situation, as illustrated in Fig. 5.7, this feature is available at no extra computational cost as it was already computed during the detection phase.
Figure 5.7 If the two types of contrasts between the two points of interest are different, it means that the candidate points do not match each other.
5.3 Vehicle Localization Using an Iterative Method
5.3.1 Conversion of Depth Information into 3D Space
Coordinates
51
The depth data dv provided by the infrared ray sensor equipped on the KINECT device are usually displayed as a depth image D on the monitor. The original depth data obtained by using the Kinect-for-Windows SDK are distance values and the range of them is from 800 to 4000 in the unit of millimeter. In order to display these values on a monitor, it is usually converted into a gray-level image using the following equation:
( , ) ( , )
16 d x yv
D x y , (5.4)
where dv(x, y) is the distance value of a pixel at coordinates (x, y) and D(x, y) is the computed result for dv.
By Equation (5.4), the distance range will become 50 through 250. This shows that the depth values in the depth image have been compressed. Although the influence of this resolution reduction on the calculation result is not so much, yet we use the original depth data in this study and save them into a 2D array. A depth image obtained from converting the depth data is shown in Fig. 5.8
Figure 5.8 A landmark of tree appearing in a depth image.
In order to adjust the vehicle to the its correct position by the estimated rotation
52
and translation parameters in the GCS, we have to convert the depth data dv into 3D coordinates (X, Y, Z) in the CCS, and then convert the coordinates (X, Y, Z) in the CCS into the VCS. For this, based on Equation (2.3), we convert all the depth data dv by the following equations:
where the values width and height specify the 2D array size; and the units of u and v are both pixel and those of X(u, v), Y(u, v), and Z(u, v) are all millimeter. By Equations (5.5) above, we can obtain the 3D space coordinates in the CCS with respect to the center of the 2D array.
5.3.2 Localization by an Iterative Algorithm Using 3D Space Coordinates
According to the above discussions, we can obtain the 3D space coordinates of the landmark in the CCS. In order to localize the vehicle’s position on the path, at the vehicle’s current position, we capture new depth data dv' of the landmark and match them against the learned data dv to compute the MSE estimation of the vehicle location, including the rotation angle mse and the translation parameters (Xmse, Zmse), as described in Section 4.2.2 using the concept of ICP [23] technique.
Accordingly, we want to rotate and translate the depth data dv' of the landmark continuously by the concept of the ICP technique to find the correct vehicle location.
53
For this, we derive some formulas for use in the process. At first, assume that we have calibrated the tilt angle of the used KINECT device by the Kinect-for-Windows SDK before the navigation starts. Then, we want to rotate the depth data dv' of the landmark through the pan angle by a rotation matrix described as follows:
cos 0 sin 0
also, we want to translate the depth data dv' of the landmark through a vector [vx, vy, vz]T by a translation matrix as follows after representing the 3D space coordinates in the CCS by homogeneous coordinates (with an extra item of 1 in the fourth
specifically, for a pixel p with 3D coordinates (px, py, pz), the above translation results in: by p. Moreover, if we concatenate the rotation and the translation together, then if P is any point in the 3D space (say, on a landmark) with learned coordinates (X, Y, Z) and its current coordinates obtained from the KINECT images are (X', Y', Z'), then the
54
latter may be transformed into the former (i.e., from the current version to the learned one) by the following equation:
( , , , 1 ) ( , , , 1) y( ) v
P X Y Z P X¢ Y¢ Z¢ R T p, (5.9)
or equivalently, by the following equation:
cos 0 sin technique to obtain an angle mse and a translation vector (Xmse, Zmse) which minimize the mean square error (MSE), where the MSE value is computed by the following Euclidean distance. Finally, a detailed description of the minimum MSE estimation using the ICP technique is as follows.
55
Algorithm 5.1: Minimum MSE estimation of the parameters for vehicle localization using the ICP technique.
Input: the depth data dv' of a landmark and the corresponding learned depth data dv; a range for rotations between thl to thr, and a 2D range for translations between (Xthl, Zthl) to (Xthr, Zthr), all retrieved from the database.
Output: minimum MSE estimations of the rotation angle mse and the translation vector (Xmse, Zmse) of the vehicle in the CCS with respect to the landmark.
Steps:
Step 1. Convert the depth data dv and dv' respectively into the learned position coordinates (X, Y, Z) and the current position coordinates (X', Y', Z') in the CCS by Equation (5.5).
Step 2. For each point Pi with coordinate (X, Y, Z) in a 3D region A with from thl
to thr, X from Xthl to Xthr and Z from Zthl to Zthr, do the following steps.
Step 2.1 Compute the Euclidean distances from Pi of dv' to every point Pj of dv, and put them into a set M
Step 2.2 Find out a point Pji with the minimum Euclidean distance in M, take it as the one corresponding to Pi, and record the current data (ji, Xji, Zji).
Step 2.3 Repeat Steps 2.1 and 2.2 until all points Pi in A are processed.
Step 3. Substitute all points Pi in A and their corresponding points Pji together with the recorded data (ji, Xji, Zji) into Equation (5.11) to compute the MSE values and record them into a set K.
Step 4. Repeat Steps 2 and 3 until the end of the region A.
Step 5. Find the minimum of the MSE values, ki K, and take out the corresponding values of X, and Z in region A as the desired rotation angle
Step 5. Find the minimum of the MSE values, ki K, and take out the corresponding values of X, and Z in region A as the desired rotation angle