3.3 Image Processing and Description
3.3.3 Connected-Component Labeling
To analyze each morphology region of binary image, it is necessary to distinguish each region at the beginning. Connected-component has the property that each pixel is the neighbor of the other pixels in the region in 4- or 8-connectivity.
Connected-component labeling is an algorithm that used to detect connected regions in binary image in computer vision [58: Gonzalez & Woods 2008]. Once the image region is labeled by using connected-component labeling algorithm, many region properties such as area (pixel number), smallest bounding box vertexes and component pixels list can be extracted.
Many ways to achieve connected-component labeling task have been developed.
Here a simple algorithm in recursive version is described in Algorithm 3.2 and illustrated in Figure 3.7. In this thesis, the connected-component labeling and region
properties can be found in MATLAB using the instruction regionprops.
Algorithm 3.2: Simple Connected-Component Labeling with 4-connectivity Input: Binary ImageImage
Output: Connected-Component Labeling Array ConnectedImage 1: [ImageRow,ImageCol] size(= Image);
2: ConnectedImage=zeros(ImageRow,ImageCol);
3: NumberLabel =0 4: fori=1:ImageRow 5: forj=1:ImageCol
6: if Image i j( , )==1and ConnectedImage i j( , )== 0 7: NumberLabel=NumberLabel+1;
8: ConnectedImage i j( , ) 1;=
9: ConnectedImage=CheckNeighbor i( −1, ,j Image ConnectedImage, );
10: ConnectedImage=CheckNeighbor i( +1, ,j Image ConnectedImage, );
11: ConnectedImage=CheckNeighbor i j( , −1,Image ConnectedImage, );
12: ConnectedImage=CheckNeighbor i j( , +1,Image ConnectedImage, );
13: end if 14: end for 15: end for
16: function CheckNeighbor(iIdx, jIdx,ConnectedImage)
17: if Image iIdx( −1, jIdx)==1and ConnectedImage iIdx( −1,jIdx)== 0 18: ConnectedImage=CheckNeighbor i( −1, ,j Image ConnectedImage, );
19: end if
20: if Image iIdx( +1, jIdx)==1and ConnectedImage iIdx( +1,jIdx)== 0 21: ConnectedImage=CheckNeighbor i( +1, ,j Image ConnectedImage, );
22: end if
23: if Image iIdx jIdx( , − ==1) 1and ConnectedImage iIdx jIdx( , − == 1) 0 24: ConnectedImage=CheckNeighbor i j( , −1,Image ConnectedImage, );
25: end if
26: if Image iIdx jIdx( , + ==1) 1and ConnectedImage iIdx jIdx( , + == 1) 0 27: ConnectedImage=CheckNeighbor i j( , +1,Image ConnectedImage, );
28: end if
29: returnConnectedImage 30: end function
3.4
( x xi )
ϕ − is the so called radial basis function, which has many types, for example:
1. Gaussain:
( )2
( )r e εr
ϕ = − (3.15)
2. Multiquadric:
( )r 1 ( r)2
ϕ = + ε (3.16)
3. Inverse multiquadric:
2
( ) 1
1 ( ) r
ϕ r
= ε
+ (3.17)
where r= −x xi is the distance between x and x . i
To obtain the goal of RBF interpolation, it is necessary to determine a proper radial basis function ϕ( )r . After determining the radial basis function, the next step is to train the weights of RBF ωi. Since each known data point has corresponding known output, the training process takes all the known data point x to associate each data j x to i calculate the corresponding weight ωi, that is,
( )i i, 1... .
y x =b i = N (3.18)
1
( ) ( ) , 1...
N
j i j i j
i
y x ω ϕ x x b j N
=
=
⋅ − = = (3.19)where x is the known node and i b is the corresponding known output. Extending the i Equation (3.19), it becomes:
1 1 1 1 2 2 1 1
By rewriting the Equation (3.21) in matrix form, Equation (3.21) becomes:
1,1 1,2 1, 1 1
For simplicity, Equation (3.22) can be expressed as Equation (3.23)
Aω =b (3.23)
If the matrix A is nonsingular, the coefficient ω can be obtained by Equation (3.24).
A b1
ω= − (3.24)
Therefore, due to ωi is known and ( )ϕ r is defined previously, the approximating function can be obtained.
Chapter 4
3D Environment Reconstruction
In this chapter, the proposed 3D environment reconstruction method using stereo camera is presented. The proposed system can be divided into two parts, which are localization and stereo refinement. The overall system architecture can be illustrated by Figure 4.1. For camera at k-th time step, two independent processes perform in parallel. To estimate camera relative movement between current and previous step, feature points captured in these two steps need to be the inputs of localization part to achieve the goal of feature matching process. Some wrong matching pairs cause inaccurate localization result are eliminated by the outlier rejection method based on Random Sample Consensus (RANSAC). For stereo data refinement, wrong measurement pixels in forbidden area are removed by statistic method. The missing data areas which are often called a hole in the remaining disparity map are detected by using connected-component labeling technique. Finally, the missing data regions are filled by dual orthogonal linear interpolation method.
Figure 4.1: The proposed system architecture.
4.1 Stereo Camera Localization and Mapping
One of the advantages of stereo camera is the data structure that combines spatial and color information to a pixel in the image coordinate, that is, Pi =
(
x y z r g bi, , , , ,i i i i i)
. This data structure provides essential information to 3-D reconstruction. Assuming a camera captures a sequence of the local data in the environment with known camera poses, the global environment 3D model can be reconstructed by placing these data in correct positions to world coordinate. Figure 4.2 shows the concept: two landmarks in the environment are captured by two consecutive camera frames with relative cameraStereo Refinement with RANSAC Outlier
Removal
Forbidden Area Segmentation
Hole Detection
Hole Filling Stereo Data Input
Step k
Stereo Data Input Step k-1
Proposed Algorithm Proposed
Algorithm
Stereo Data Input Step k+1 Proposed Algorithm Stereo
Data k-2
Data transformed by the Transformation Matrix
Global Mapping
pose T as in Figure 4.2(a). The landmarks are seen by the camera in k-th and (k−1)-th steps as in Figure 4.2(b) and Figure 4.2(c) respectively. If the transformation
T is known, two measurements in k-th and (k−1)-th steps can be aligned together, and therefore the environment can be reconstructed, as illustrated in Figure 4.2(d).
However, in practice the measurement data is acquired from the camera coordinate without knowing the camera pose in world coordinate. That is, the measurement data points cannot be placed to the correct positions in world coordinate. Fortunately, two consecutive images capture the same landmarks in the environment, as illustrated in Figure 4.2(e) and (f) for example. Assuming the environment is a rigid body, a certain transformation which includes rotation and translation can align these landmarks in current camera coordinate to the landmarks in previous camera coordinate correctly. In other words, the transformation can be found out by using the relationship between the corresponding landmarks in current and previous camera coordinates as illustrated in Figure 4.2(g) and (f). By doing so, the 3-D points in current step can be mapped to the previous coordinate. This transformation is the relative camera pose that is necessary to be estimated, and the transformation estimation process is familiar with the term
“localization”. In ideal, the camera relative motion can be estimated by simply applying the above concept. However, some error matching pairs affect the motion estimation as shown in Figure 4.2(j) and (k). If the motion is estimated inaccurately, the environment
cannot be reconstructed precisely as illustrated in Figure 4.2(l). Therefore, motion estimation with outlier rejection by random sample consensus (RANSAC) is used in this thesis.
The relation between time flow and the each processes of the localization algorithm is shown in Figure 4.3. For k-th step, stereo camera provides two images from right and left CCDs and then calculates the depth map with respect to the target image coordinate at first. The second step is to detect image feature from the target image. In the third step, image features in k-th step are matched to the corresponding features in (k−1)-th step.
(a) (b) (c) (d)
(e) (f) (g) (h)
(i) (j) (k) (l)
Figure 4.2: Illustration of the importance of localization for mapping task (a) The environment is captured by camera in consecutive step.
(b)-(c) The camera measurement in (k−1)-th and k-thsteps in camera coordinates.
(d) If the transformation is known, measurements in k-th step can be transformed to the (k−1)-th step camera coordinate to obtain the environment reconstruction process.
(e)-(f) Feature points extracted in (k−1)-th and k-th step in ideal.
(g) Overlap these measurements in the same camera coordinates. Features in (e) and (f) are matched by estimating the similarity of the local appearance of each feature.
(h) Since the same landmarks captured by the camera in consecutive time steps are matched well, the transformation relation can be estimated by using these landmarks relative positions.
(i) Feature points extracted in (k−1)-th step with wrong feature point extraction for example.
(j)-(k) In practice, there might have some wrong matching pairs that will cause inaccurate or incorrect motion estimation.
(l) If the camera relative motion is estimated inaccurately, the environment cannot reconstruct precisely.
1 T
k− k
Capture Scenario
1 k− For k 1 step−
k For k step
k 1
k−
T
1
k− k TIdeal
1
k− TInaccurate
,
4.1.1 Feature Point Extraction
The purpose of the feature point extraction step is to detect some parts of the scene observed by stereo camera in the environment for feature tracking. Currently, many types of features that can be used to describe the environment have been developed. In order to identify a point located at different image plane position frame by frame, a feature should be robust or so called “salient” in the field of computer vision to distinguish each data point. Point feature can be categorized into two main types, that is, color and spatial spaces.
Stereo camera provides two images from different image planes and then the disparity map is estimated by using these images, which is secondhand information and using spatial type feature such as Normal Aligned Radial Feature (NARF) [42: Steder et al. 2011] might be unreliable for stereo vision. Figure 4.4 shows the NARF detection
results from the depth maps at the same position but in different time steps for example [55: NARF feature from PCL 2013]. Although the frame data are acquired by stereo
camera in the same poses, the NARF features are not robust due to stereo uncertainty and missing data problem, as shown in Figure 4.4(i)-(l). Therefore, the Scalar Invariant Feature Transform (SIFT) which belongs to color type feature point is used in this thesis.
(a) (b) (c) (d)
(e) (f) (g) (h)
(i) (j) (k) (l) Figure 4.4: An example of spatial feature: Normal Aligned Radial Feature (NARF).
(a)-(d) Color image captured at the same position in different time steps.
(e)-(h) Corresponding depth maps in different time steps.
(i)-(l) NARF features extracted from range images in different time steps. Red circles indicate the locations of NARF features.
In feature extraction step, a set of salient points F is detected at k k-th frame in the target imageITar k, . Each feature point has two important data, which are the position in image coordinate PositionImage(Fk i,)=(vk i,,uk i,) and the 1 128× descriptor
, , , ,1 , ,2 , ,128
( k i) k i [ k i , k i , , k i ]
Descriptor F =D = d d d extracted from its local patch. The descriptor is used to determine the correspondence of certain feature in two different images.
Moreover, with original SIFT feature information and the spatial information provided
imag
(a) (b)
(c) (d)
(e) (f) Figure 4.6: Feature extraction by SIFT detector
(a)-(b) Input image and Corresponding depth map
(c)-(d) SIFT features with input image and corresponding depth map.
(e)-(f) Features with non-depth information are marked as red cross sign according to the depth map.
4.1.2 Feature Point Matching in Two Consecutive Frames
Several salient feature points are extracted from each target image frame in the stage of feature detection presented in Section 4.1.1. In order to track 3D positions of the features captured in consecutive images, this section aims at recognizing each feature located at different images and formed as feature point pairs. This process is often called feature matching, and the data flow is illustrated in Figure 4.7. For the consecutive images at k-th and (k−1)-th step, feature sets Fk and Fk−1 are extracted as the inputs of feature matching stage, and the output is the feature matching pairs formed as 2×N matrix FPk k,−1= FPk k, −1,1,FPk k, −1,2,,FPk k,−1,j, ,FPk k,−1,N, where
, 1, k k j
FP − is the -thj matching pair with 2 1× dimension that stored the indexes of two features Fk m, and Fk−1,n.
A traditional way to achieve the goal of feature matching for certain feature in current step is to compare the similarity of feature points from previous feature set.
Since the descriptor represents the local appearance of a feature, the similarity can be evaluated by comparing the point descriptor in current frame to the descriptors in previous frame. Many ways are used to calculate the similarity of two data, such as Euclidean distance used in this thesis. The Euclidean distance measures the square root of sum of square difference of each element between two n-dimension vectors, which is defined as follows.
( )
2j feature in Fk−1,j. Therefore, the feature with the smallest Euclidean distance in descriptors is considered to be the successful matching feature pair of Fk i, and Fk−1,j in
After feature matching step, features in current step will be linked to previous features point-to-point. For a certain feature, its 3D positions ( , , )x y z in stereo camera coordinate are also known from stereo measurements. Therefore, each of the features positions in consecutive time steps are known and the camera related pose can be estimated by using these spatial relations.
Feature Set k-1 Feature Set k Feature Set
k+1
Feature Matching
Pair k-1 Feature Matching Pair k-1
(a) (b)
(c)
Figure 4.8: The result of feature matching by estimate the similarity between two feature descriptor.
(a) Previous image with the detected features. The same as in Figure 4.6(e).
(b) Current image with the detected features. The same as in Figure 4.6(f).
(c) By comparing the similarity, each previous feature is linked to the current feature as the same landmark.
4.1.3 Estimate the relative transformation matrix of rigid body by Least-Squares method using SVD.
After finishing the feature matching stage, points in each matching pair is assumed to be the same landmark in the environment with different positions at current and previous frames in camera coordinate. The goal is to find a proper rotation and translation to fit the current feature set in matching pair to the previous feature set. This task is well known as “point clouds registration.” To find the optimal 3 3× rotation matrix R and the 3 1× translation vector t , the least-squares fitting method using singular value decomposition (SVD) is first proposed in [15: Arun et al. 1987]. To simplify the problem, two point sets are assumed to be a rigid body. Each point in rigid body has the same rotation and translation in an arbitrary motion with respect to its centroid, which is formulated as follows.
1
k k
PFP− = ×R PFP +t (4.4)
Therefore the rotation and translation can be departed in two independent materials.
According to the work proposed by [15: Arun et al. 1987], the overall least-square fitting algorithm is listed in Algorithm 4.1, and the concept is illustrated in Figure 4.9.
Figure 4.9(a) shows that a rigid body is captured by camera in (k−1)-th and k-th steps, where rigid body is located at different positions with respect to different camera
two measurements are drawn in the same coordinate. The dot lines in Figure 4.9(d) stand for the measurement in (k−1)-th step, while the solid lines stand for the measurements in k-th step. To align these measurement points, the rotation part R needs to be estimated before the translation since the translation part depends on
PFPk× . Because each point of the rigid body rotates around its centroidR [60: Spong 2005], the rotation can be estimated by ignoring the translation part by setting two
centroids to the original, as shown in Figure 4.9(f). Therefore the centroids of two point sets are calculated at first as in Algorithm 4.1 line 2-3 and each point is subtracted to its centroid in the product terms of 3 3× covariance matrix in Equation (4.5). The optimal rotation can be calculated using the following covariance matrix:
, 1 1 ,
1
( )( )
Pairs
N
T
k k
i k i k
i
H PFP − X − PFP X
=
=
− − (4.5)Note that the order of the product in Equation (4.5) cannot be changed or the transformation will be the inverse motion. According to the property of SVD, the covariance H can be decomposed into three parts, as shown in Equation (4.6), where
H is the product of these parts and is written as follows.
[ , , ]U S V =SVD H( ) (4.6)
H =USVT (4.7)
Therefore, the optimal least-squares rotation matrix can then be calculated as follows:
R=VUT (4.8)
After the optimal rotation part is calculated, the translation term can be extracted. As the properties of rigid body mention above, each point has the same translation as its centroid. Therefore, the corresponding optimal translation vector can be calculated as follows:
1
k k
t= X − −R X (4.9)
The transformation matrix can be written as a 4 4× matrix combining with rotation matrix and translation vector together into homogeneous transformation form, which is
written as follows [60: Spong 2005]:
11 12 13
Algorithm 4.1: Estimate relative transformation matrix using SVD Input: Feature matching pairsFPk k, −1.
3: Compute the centroid of point set in step k-1,
1 1 1 1 , 1
Figur
Equation (4.10) is the relative camera motion from (k−1)-th to k-th step. In order to place stereo measurements to global coordinate, camera pose needs to be specified. The relation between camera relative motion and the camera pose in each step is illustrated in Figure 4.10. The camera pose in k-thstep can be defined as C and k Following this rule, if the initial pose C is known or predefined by an identity matrix 0 as global coordinate, the camera pose at k-th step can be written as follows:
Each k-th step point can be transformed to initial step camera coordinate as the defined global coordinate, and this relation can be written as follows:
11 12 13
Figure 4.10: Illustration of the relation between camera pose and relative camera motion.
4.1.4 Camera Pose Estimation with RANSAC Outlier Rejection
In previous section, the distinctive feature points are detected in each step, and the corresponding feature points will be matched in two consecutive frame data. The camera relative pose can then be estimated by using the spatial relation between these matching pairs using SVD method. However, using these matching pairs without any selection will cause inaccurate or incorrect localization result. Although the SIFT feature is quite robust comparing to most of the recent feature techniques, there might have some wrong matching cases such as repeating features or similar object in the world. In addition, the uncertainty of each stereo camera measurement point may contribute some drift to the final relative pose. To overcome the above problem, Random Same Consensus (RANSAC) outlier rejection framework is applied to find a best transformation matrix. The modified RANSAC algorithm to this case is listed in
0 4
C =I
Time
x
y
1 1 0
C =M C C2 =M C2 1=M M C2 1 0 Ck =MkCk−1=MkMk−1...C0 M1
Mk
M2
…
x
y
x
y
x
y
M1 M1 M2
0
t= t=1 t=2
t=k
Algorithm 4.2. It is assuming that the best transformation matrix is the model with largest number of inliers. In each iteration step, several matching pairs are selected randomly as sample data to estimate a transformation matrix Mcurrent by SVD (as in
Algorithm 4.1). In order to determine which matching pairs are inliers, the feature points in current frame are transformed by Mcurrent to previous camera coordinate, then each spatial error of the matching pair can be calculated by using Euclidean distance, which can be written as:
( ) (
2) (
2)
2* * * *
, , 1 , , 1 , , 1 , , 1
( i k, i k ) i k i k i k i k i k i k
d PFP PFP x x y y z z
ε = − = − − + − − + − − (4.14)
* , ,
1 1
i k i k
current
PFP PFP
=M
(4.15)
where PFPi k*, =(x*i k, ,y*i k, ,z*i k, )T is the coordinate of PFP which is transformed by i k,
current
M as shown in Equation (4.15). If the Euclidean distance between PFPi k*, and
, 1
PFPi k− is less than a predefine threshold, it is considered to be an inlier. By doing so, the transformation matrix and inlier set are calculated in certain step. Then, after
Iteration
N times iteration, there will be NIteration number of transformation matrix M i and the corresponding inlier setsInliersi, where i=0...NIteration. The best transformation
Mbest is determined by choosing the transformation matrix M with the largest number i of inliers Inliersi, which can be done iteratively without storing all the trying models
M with its inliers i Inliersi as in the line 11-21 in Algorithm 4.2.
Algorithm 4.2: Feature-based localization with RANSAC outlier rejection algorithm Input:
Feature Matching Pair FPk k, −1
Set the number of iterationsNIteration Number of sample pairsNsample
Output:
Transformation MatrixMt t, 1−
Inlier listInliers
1: Initialize best transformation matrixMbest ← φ 2: Initialize best inliers setInliersbest ← φ
3: Initialize best inliers numberNInliers best_ =0
4: Calculate the number of matching pairs NMatch=size FP( k k, −1)
5: for iteration = 1:NIteration
6: SampleSet ←Randomly select Nsample matching pairs inFPk k,−1
7: Compute Current Transform Matrix Mcurrent from SampleSet using SVD (Algorithm 4.1)
8: InliersCurrent ← φ
9: for all FPk k, −1,i in FPk k, −1
10: Compute the spatial error between Mcurrent PFPi k, and PFPi k, −1 by using Euclidean distance, that is, ε=Euclidean M( currentPFPi k, ,PFPi k, −1)
11: if ε<threshold
12: InliersCurrent←InliersCurrent+FPk k, −1,i
13: end if 14: end for
15: Count the number of InliersCurrent, NInliers current_ =size Inliers( Current)
16: Recomputing the transformation matrix Mcurrent by InliersCurrent using SVD (Algorithm 4.1)
17: if NInliers current_ >NInliers best_
18: Mbest ←Mcurrent 19: Inliersbest←InliersCurrent 20: NInliers best_ =NInliers current_
21: end if 22: end for 23: Mk k, −1←Mbest
For better understanding, 2-th and 3-th frame data are taken as (k−1)-th and
-th
k steps for example. The given relative motion is a pure translation along x-axis with positive 0.1m without any rotation, and therefore the transformation matrix is as
follows:
Figure 4.11 (a)-(b) are two target images captured from right CCD of the stereo camera with corresponding feature points in (k−1)-th and k-th steps, respectively. The green circles indicate the features in the (k−1)-th step, while the red dots represent the features in the k-th step. Figure 4.11(c) and (d) show the projecting result of k-th step features from 3D coordinate to image plane by pin-hole model with different transformation matrixes estimated in two iterations. For better estimation iteration case,
features in k-th step are transformed by the following matrix:
1.0000 0.0019 0.0033
Most of the red dots align to the green circles as shown in Figure 4.11(c) and (e). The aligning pairs are equivalence to the 3D spatial inliers since the projection by camera
Most of the red dots align to the green circles as shown in Figure 4.11(c) and (e). The aligning pairs are equivalence to the 3D spatial inliers since the projection by camera