Connected-Component Labeling - Image Processing and Description

3.3 Image Processing and Description

3.3.3 Connected-Component Labeling

To analyze each morphology region of binary image, it is necessary to distinguish each region at the beginning. Connected-component has the property that each pixel is the neighbor of the other pixels in the region in 4- or 8-connectivity.

Connected-component labeling is an algorithm that used to detect connected regions in binary image in computer vision [58: Gonzalez & Woods 2008]. Once the image region is labeled by using connected-component labeling algorithm, many region properties such as area (pixel number), smallest bounding box vertexes and component pixels list can be extracted.

Many ways to achieve connected-component labeling task have been developed.

Here a simple algorithm in recursive version is described in Algorithm 3.2 and illustrated in Figure 3.7. In this thesis, the connected-component labeling and region

properties can be found in MATLAB using the instruction regionprops.

Algorithm 3.2: Simple Connected-Component Labeling with 4-connectivity Input: Binary ImageImage

Output: Connected-Component Labeling Array ConnectedImage 1: [ImageRow,ImageCol] size(= Image);

2: ConnectedImage=zeros(ImageRow,ImageCol);

3: NumberLabel =0 4: fori=1:ImageRow 5: forj=1:ImageCol

6: if Image i j( , )==1^and ConnectedImage i j( , )== 0 7: NumberLabel=NumberLabel+1;

8: ConnectedImage i j( , ) 1;=

9: ConnectedImage=CheckNeighbor i( −1, ,j Image ConnectedImage, );

10: ConnectedImage=CheckNeighbor i( +1, ,j Image ConnectedImage, );

11: ConnectedImage=CheckNeighbor i j( , −1,Image ConnectedImage, );

12: ConnectedImage=CheckNeighbor i j( , +1,Image ConnectedImage, );

13: end if 14: end for 15: end for

16: function CheckNeighbor(iIdx, jIdx,ConnectedImage)

17: if Image iIdx( −1, jIdx)==1^and ConnectedImage iIdx( −1,jIdx)== 0 18: ConnectedImage=CheckNeighbor i( −1, ,j Image ConnectedImage, );

19: end if

20: if Image iIdx( +1, jIdx)==1^and ConnectedImage iIdx( +1,jIdx)== 0 21: ConnectedImage=CheckNeighbor i( +1, ,j Image ConnectedImage, );

22: end if

23: if Image iIdx jIdx( , − ==1) 1^and ConnectedImage iIdx jIdx( , − == 1) 0 24: ConnectedImage=CheckNeighbor i j( , −1,Image ConnectedImage, );

25: end if

26: if Image iIdx jIdx( , + ==1) 1^and ConnectedImage iIdx jIdx( , + == 1) 0 27: ConnectedImage=CheckNeighbor i j( , +1,Image ConnectedImage, );

28: end if

29: returnConnectedImage 30: end function

3.4

( x x_i )

ϕ − is the so called radial basis function, which has many types, for example:

1. Gaussain:

( )2

( )r e ^ε^r

ϕ = ⁻ (3.15)

2. Multiquadric:

( )r 1 ( r)2

ϕ = + ε (3.16)

3. Inverse multiquadric:

( ) 1

1 ( ) r

ϕ r

= ε

+ (3.17)

where r= −x x_i is the distance between x and x . _i

To obtain the goal of RBF interpolation, it is necessary to determine a proper radial basis function ϕ( )r . After determining the radial basis function, the next step is to train the weights of RBF ω_i. Since each known data point has corresponding known output, the training process takes all the known data point x to associate each data _j x to _i calculate the corresponding weight ω_i, that is,

( )_i _i, 1... .

y x =b i = N (3.18)

( ) ( ) , 1...

j i j i j

y x ω ϕ x x b j N



⋅ − = = ^(3.19)

where x is the known node and _i b is the corresponding known output. Extending the _i Equation (3.19), it becomes:

1 1 1 1 2 2 1 1

By rewriting the Equation (3.21) in matrix form, Equation (3.21) becomes:

1,1 1,2 1, 1 1

For simplicity, Equation (3.22) can be expressed as Equation (3.23)

Aω =b (3.23)

If the matrix A is nonsingular, the coefficient ω can be obtained by Equation (3.24).

A b1

ω= ⁻ (3.24)

Therefore, due to ω_i is known and ( )ϕ r is defined previously, the approximating function can be obtained.

Chapter 4 3D Environment Reconstruction

In this chapter, the proposed 3D environment reconstruction method using stereo camera is presented. The proposed system can be divided into two parts, which are localization and stereo refinement. The overall system architecture can be illustrated by Figure 4.1. For camera at k-th time step, two independent processes perform in parallel. To estimate camera relative movement between current and previous step, feature points captured in these two steps need to be the inputs of localization part to achieve the goal of feature matching process. Some wrong matching pairs cause inaccurate localization result are eliminated by the outlier rejection method based on Random Sample Consensus (RANSAC). For stereo data refinement, wrong measurement pixels in forbidden area are removed by statistic method. The missing data areas which are often called a hole in the remaining disparity map are detected by using connected-component labeling technique. Finally, the missing data regions are filled by dual orthogonal linear interpolation method.

Figure 4.1: The proposed system architecture.

4.1 Stereo Camera Localization and Mapping

One of the advantages of stereo camera is the data structure that combines spatial and color information to a pixel in the image coordinate, that is, ^Pⁱ =

(

x y z r g bⁱ^{, , , , ,}ⁱ ⁱ ⁱ ⁱ ⁱ

)

. This data structure provides essential information to 3-D reconstruction. Assuming a camera captures a sequence of the local data in the environment with known camera poses, the global environment 3D model can be reconstructed by placing these data in correct positions to world coordinate. Figure 4.2 shows the concept: two landmarks in the environment are captured by two consecutive camera frames with relative camera

Stereo Refinement with RANSAC Outlier

Removal

Forbidden Area Segmentation

Hole Detection

Hole Filling Stereo Data Input

Step k

Stereo Data Input Step k-1

Proposed Algorithm Proposed

Algorithm

Stereo Data Input Step k+1 Proposed Algorithm Stereo

Data k-2

Data transformed by the Transformation Matrix

Global Mapping

pose T as in Figure 4.2(a). The landmarks are seen by the camera in k-th and (k−1)-th steps as in Figure 4.2(b) and Figure 4.2(c) respectively. If the transformation

T is known, two measurements in k-th and (k−1)-th steps can be aligned together, and therefore the environment can be reconstructed, as illustrated in Figure 4.2(d).

However, in practice the measurement data is acquired from the camera coordinate without knowing the camera pose in world coordinate. That is, the measurement data points cannot be placed to the correct positions in world coordinate. Fortunately, two consecutive images capture the same landmarks in the environment, as illustrated in Figure 4.2(e) and (f) for example. Assuming the environment is a rigid body, a certain transformation which includes rotation and translation can align these landmarks in current camera coordinate to the landmarks in previous camera coordinate correctly. In other words, the transformation can be found out by using the relationship between the corresponding landmarks in current and previous camera coordinates as illustrated in Figure 4.2(g) and (f). By doing so, the 3-D points in current step can be mapped to the previous coordinate. This transformation is the relative camera pose that is necessary to be estimated, and the transformation estimation process is familiar with the term

“localization”. In ideal, the camera relative motion can be estimated by simply applying the above concept. However, some error matching pairs affect the motion estimation as shown in Figure 4.2(j) and (k). If the motion is estimated inaccurately, the environment

cannot be reconstructed precisely as illustrated in Figure 4.2(l). Therefore, motion estimation with outlier rejection by random sample consensus (RANSAC) is used in this thesis.

The relation between time flow and the each processes of the localization algorithm is shown in Figure 4.3. For k-th step, stereo camera provides two images from right and left CCDs and then calculates the depth map with respect to the target image coordinate at first. The second step is to detect image feature from the target image. In the third step, image features in k-th step are matched to the corresponding features in (k−1)-th step.

(a) (b) (c) (d)

(e) (f) (g) (h)

(i) (j) (k) (l)

Figure 4.2: Illustration of the importance of localization for mapping task (a) The environment is captured by camera in consecutive step.

(b)-(c) The camera measurement in (k−1)-th and k-thsteps in camera coordinates.

(d) If the transformation is known, measurements in k-th step can be transformed to the (k−1)-th step camera coordinate to obtain the environment reconstruction process.

(e)-(f) Feature points extracted in (k−1)-th and k-th step in ideal.

(g) Overlap these measurements in the same camera coordinates. Features in (e) and (f) are matched by estimating the similarity of the local appearance of each feature.

(h) Since the same landmarks captured by the camera in consecutive time steps are matched well, the transformation relation can be estimated by using these landmarks relative positions.

(i) Feature points extracted in (k−1)-th step with wrong feature point extraction for example.

(j)-(k) In practice, there might have some wrong matching pairs that will cause inaccurate or incorrect motion estimation.

(l) If the camera relative motion is estimated inaccurately, the environment cannot reconstruct precisely.

1 T

k− ^k

Capture Scenario

1 k− For k 1 step−

k For k step

k 1

k−

k− k T^Ideal

k− T_Inaccurate

4.1.1 Feature Point Extraction

The purpose of the feature point extraction step is to detect some parts of the scene observed by stereo camera in the environment for feature tracking. Currently, many types of features that can be used to describe the environment have been developed. In order to identify a point located at different image plane position frame by frame, a feature should be robust or so called “salient” in the field of computer vision to distinguish each data point. Point feature can be categorized into two main types, that is, color and spatial spaces.

Stereo camera provides two images from different image planes and then the disparity map is estimated by using these images, which is secondhand information and using spatial type feature such as Normal Aligned Radial Feature (NARF) [42: Steder et al. 2011] might be unreliable for stereo vision. Figure 4.4 shows the NARF detection

results from the depth maps at the same position but in different time steps for example [55: NARF feature from PCL 2013]. Although the frame data are acquired by stereo

camera in the same poses, the NARF features are not robust due to stereo uncertainty and missing data problem, as shown in Figure 4.4(i)-(l). Therefore, the Scalar Invariant Feature Transform (SIFT) which belongs to color type feature point is used in this thesis.

(a) (b) (c) (d)

(e) (f) (g) (h)

(i) (j) (k) (l) Figure 4.4: An example of spatial feature: Normal Aligned Radial Feature (NARF).

(a)-(d) Color image captured at the same position in different time steps.

(e)-(h) Corresponding depth maps in different time steps.

(i)-(l) NARF features extracted from range images in different time steps. Red circles indicate the locations of NARF features.

In feature extraction step, a set of salient points F is detected at _k k-th frame in the target imageI_{Tar k}_, . Each feature point has two important data, which are the position in image coordinate Position_Image(F_{k i}_,)=(v_{k i}_,,u_{k i}_,) and the 1 128× descriptor

, , , ,1 , ,2 , ,128

( _{k i}) _{k i} [ _{k i} , _{k i} , , _{k i} ]

Descriptor F =D = d d d extracted from its local patch. The descriptor is used to determine the correspondence of certain feature in two different images.

Moreover, with original SIFT feature information and the spatial information provided

imag

(a) (b)

(e) (f) Figure 4.6: Feature extraction by SIFT detector

(a)-(b) Input image and Corresponding depth map

(c)-(d) SIFT features with input image and corresponding depth map.

(e)-(f) Features with non-depth information are marked as red cross sign according to the depth map.

4.1.2 Feature Point Matching in Two Consecutive Frames

Several salient feature points are extracted from each target image frame in the stage of feature detection presented in Section 4.1.1. In order to track 3D positions of the features captured in consecutive images, this section aims at recognizing each feature located at different images and formed as feature point pairs. This process is often called feature matching, and the data flow is illustrated in Figure 4.7. For the consecutive images at k-th and (k−1)-th step, feature sets F_k and F_k₋₁ are extracted as the inputs of feature matching stage, and the output is the feature matching pairs formed as 2×N matrix FP_{k k}_,₋₁= FP_{k k}_, ₋_1,1,FP_{k k}_, ₋_1,2,,FP_{k k}_,₋_1,_j,  ,FP_{k k}_,₋_1,_N, where

, 1, k k j

FP ₋ is the -thj matching pair with 2 1× dimension that stored the indexes of two features F_{k m}_, and F_k₋_1,_n.

A traditional way to achieve the goal of feature matching for certain feature in current step is to compare the similarity of feature points from previous feature set.

Since the descriptor represents the local appearance of a feature, the similarity can be evaluated by comparing the point descriptor in current frame to the descriptors in previous frame. Many ways are used to calculate the similarity of two data, such as Euclidean distance used in this thesis. The Euclidean distance measures the square root of sum of square difference of each element between two n-dimension vectors, which is defined as follows.

( )

j feature in Fk₋1,j. Therefore, the feature with the smallest Euclidean distance in descriptors is considered to be the successful matching feature pair of F_{k i}_, and F_k₋_1,_j in

After feature matching step, features in current step will be linked to previous features point-to-point. For a certain feature, its 3D positions ( , , )x y z in stereo camera coordinate are also known from stereo measurements. Therefore, each of the features positions in consecutive time steps are known and the camera related pose can be estimated by using these spatial relations.

Feature Set k-1 Feature Set k Feature Set



k+1



Feature Matching

Pair k-1 Feature Matching Pair k-1

(a) (b)

(c)

Figure 4.8: The result of feature matching by estimate the similarity between two feature descriptor.

(a) Previous image with the detected features. The same as in Figure 4.6(e).

(b) Current image with the detected features. The same as in Figure 4.6(f).

4.1.3 Estimate the relative transformation matrix of rigid body by Least-Squares method using SVD.

After finishing the feature matching stage, points in each matching pair is assumed to be the same landmark in the environment with different positions at current and previous frames in camera coordinate. The goal is to find a proper rotation and translation to fit the current feature set in matching pair to the previous feature set. This task is well known as “point clouds registration.” To find the optimal 3 3× rotation matrix R and the 3 1× translation vector t , the least-squares fitting method using singular value decomposition (SVD) is first proposed in [15: Arun et al. 1987]. To simplify the problem, two point sets are assumed to be a rigid body. Each point in rigid body has the same rotation and translation in an arbitrary motion with respect to its centroid, which is formulated as follows.

k k

PFP₋ = ×R PFP +t (4.4)

Therefore the rotation and translation can be departed in two independent materials.

According to the work proposed by [15: Arun et al. 1987], the overall least-square fitting algorithm is listed in Algorithm 4.1, and the concept is illustrated in Figure 4.9.

Figure 4.9(a) shows that a rigid body is captured by camera in (k−1)-th and k-th steps, where rigid body is located at different positions with respect to different camera

two measurements are drawn in the same coordinate. The dot lines in Figure 4.9(d) stand for the measurement in (k−1)-th step, while the solid lines stand for the measurements in k-th step. To align these measurement points, the rotation part R needs to be estimated before the translation since the translation part depends on

PFPk× . Because each point of the rigid body rotates around its centroidR [60: Spong 2005], the rotation can be estimated by ignoring the translation part by setting two

centroids to the original, as shown in Figure 4.9(f). Therefore the centroids of two point sets are calculated at first as in Algorithm 4.1 line 2-3 and each point is subtracted to its centroid in the product terms of 3 3× covariance matrix in Equation (4.5). The optimal rotation can be calculated using the following covariance matrix:

, 1 1 ,

( )( )

Pairs

k k

i k i k

H PFP − X − PFP X



− − ^(4.5)

Note that the order of the product in Equation (4.5) cannot be changed or the transformation will be the inverse motion. According to the property of SVD, the covariance H can be decomposed into three parts, as shown in Equation (4.6), where

H is the product of these parts and is written as follows.

[ , , ]U S V =SVD H( ) (4.6)

H =USVT (4.7)

Therefore, the optimal least-squares rotation matrix can then be calculated as follows:

R=VUT (4.8)

After the optimal rotation part is calculated, the translation term can be extracted. As the properties of rigid body mention above, each point has the same translation as its centroid. Therefore, the corresponding optimal translation vector can be calculated as follows:

k k

t= X − −R X (4.9)

The transformation matrix can be written as a 4 4× matrix combining with rotation matrix and translation vector together into homogeneous transformation form, which is

written as follows [60: Spong 2005]:

11 12 13

Algorithm 4.1: Estimate relative transformation matrix using SVD Input: Feature matching pairsFPk k, ₋1.

3: Compute the centroid of point set in step k-1,

1 1 1 1 , 1

Figur

Equation (4.10) is the relative camera motion from (k−1)-th to k-th step. In order to place stereo measurements to global coordinate, camera pose needs to be specified. The relation between camera relative motion and the camera pose in each step is illustrated in Figure 4.10. The camera pose in k-thstep can be defined as C and _k Following this rule, if the initial pose C is known or predefined by an identity matrix ₀ as global coordinate, the camera pose at k-th step can be written as follows:

Each k-th step point can be transformed to initial step camera coordinate as the defined global coordinate, and this relation can be written as follows:

11 12 13

Figure 4.10: Illustration of the relation between camera pose and relative camera motion.

4.1.4 Camera Pose Estimation with RANSAC Outlier Rejection

In previous section, the distinctive feature points are detected in each step, and the corresponding feature points will be matched in two consecutive frame data. The camera relative pose can then be estimated by using the spatial relation between these matching pairs using SVD method. However, using these matching pairs without any selection will cause inaccurate or incorrect localization result. Although the SIFT feature is quite robust comparing to most of the recent feature techniques, there might have some wrong matching cases such as repeating features or similar object in the world. In addition, the uncertainty of each stereo camera measurement point may contribute some drift to the final relative pose. To overcome the above problem, Random Same Consensus (RANSAC) outlier rejection framework is applied to find a best transformation matrix. The modified RANSAC algorithm to this case is listed in

0 4

C =I

Time

x

y

1 1 0

C =M C C₂ =M C_{2 1}=M M C₂ _{1 0} ^C_k =^M_k^C_k₋₁=^M_k^M_k₋₁^...^C₀ M1

…

x

y

x

y

x

y

M1 M₁ ^M²

t= t=1 t=2



t=k



Algorithm 4.2. It is assuming that the best transformation matrix is the model with largest number of inliers. In each iteration step, several matching pairs are selected randomly as sample data to estimate a transformation matrix M_current by SVD (as in

Algorithm 4.1). In order to determine which matching pairs are inliers, the feature points in current frame are transformed by M_current to previous camera coordinate, then each spatial error of the matching pair can be calculated by using Euclidean distance, which can be written as:

( ) (

) (

)

* * * *

, , 1 , , 1 , , 1 , , 1

( _{i k}, _{i k} ) _{i k} _{i k} _{i k} _{i k} _{i k} _{i k}

d PFP PFP x x y y z z

ε = ₋ = − ₋ + − ₋ + − ₋ (4.14)

* , ,

1 1

i k i k

current

PFP PFP

 =M  

   

  (4.15)

where PFP_{i k}^*_, =(x^*_{i k}_, ,y^*_{i k}_, ,z^*_{i k}_, )^T is the coordinate of PFP which is transformed by _{i k}_,

current

M as shown in Equation (4.15). If the Euclidean distance between PFP_{i k}^*_, and

, 1

PFPi k₋ is less than a predefine threshold, it is considered to be an inlier. By doing so, the transformation matrix and inlier set are calculated in certain step. Then, after

Iteration

N times iteration, there will be N_Iteration number of transformation matrix M _i and the corresponding inlier setsInliers_i, where i=0...N_Iteration. The best transformation

Mbest is determined by choosing the transformation matrix M with the largest number _i of inliers Inliers_i, which can be done iteratively without storing all the trying models

M with its inliers i Inliers_i as in the line 11-21 in Algorithm 4.2.

Algorithm 4.2: Feature-based localization with RANSAC outlier rejection algorithm Input:

Feature Matching Pair FP_{k k}_, ₋₁

Set the number of iterationsN_Iteration Number of sample pairsNsample

Output:

Transformation MatrixM_{t t}_{, 1}₋

Inlier listInliers

1: Initialize best transformation matrixM_best ← φ 2: Initialize best inliers setInliers_best ← φ

3: Initialize best inliers numberNInliers best_ =0

4: Calculate the number of matching pairs N_Match=size FP( _{k k}, ₋1)

5: for iteration = 1:N_Iteration

6: SampleSet ←Randomly select N_sample matching pairs inFP_{k k}_,₋₁

7: Compute Current Transform Matrix M_current from SampleSet using SVD (Algorithm 4.1)

8: Inliers_Current ← φ

9: for all FP_{k k}_, ₋_1,_i in FP_{k k}_, ₋₁

10: Compute the spatial error between M_current PFP_{i k}_, and PFP_{i k}_, ₋₁ by using Euclidean distance, that is, ε=Euclidean M( _currentPFP_{i k}, ,PFP_{i k}, ₋1)

11: if ε<threshold

12: InliersCurrent←InliersCurrent+FPk k, ₋1,i

13: end if 14: end for

15: Count the number of Inliers_Current, NInliers current_{_} =size Inliers( Current)

16: Recomputing the transformation matrix M_current by Inliers_Current using SVD (Algorithm 4.1)

17: if NInliers current_{_} >NInliers best_{_}

18: M_best ←M_current 19: Inliers_best←Inliers_Current 20: NInliers best_ =NInliers current_

21: end if 22: end for 23: Mk k, ₋1←Mbest

For better understanding, 2-th and 3-th frame data are taken as (k−1)-th and

-th

k steps for example. The given relative motion is a pure translation along x-axis with positive 0.1m without any rotation, and therefore the transformation matrix is as

follows:

Figure 4.11 (a)-(b) are two target images captured from right CCD of the stereo camera with corresponding feature points in (k−1)-th and k-th steps, respectively. The green circles indicate the features in the (k−1)-th step, while the red dots represent the features in the k-th step. Figure 4.11(c) and (d) show the projecting result of k-th step features from 3D coordinate to image plane by pin-hole model with different transformation matrixes estimated in two iterations. For better estimation iteration case,

features in k-th step are transformed by the following matrix:

1.0000 0.0019 0.0033

Most of the red dots align to the green circles as shown in Figure 4.11(c) and (e). The aligning pairs are equivalence to the 3D spatial inliers since the projection by camera

在文檔中利用立體攝影機進行色彩與深度感測以達成三維環境重建及物體追蹤 (頁 46-0)