Chapter 5 The Avoidance of Degeneracy in Geometry Transfer
5.3 Robust Trifocal Tensor for Structure from Motion Estimation
5.3.2 The combination of the SfM and trifocal Tensor
國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
107
EXIF tags to be the initial focal length and estimate the focal length. The corresponding points are got by Lowe’s [60] Scale Invariant Feature Transform (SIFT). They use RANSAC to estimate image pairs’ fundamental matrix to filter probable outliers. We think that it is not enough to remove the unreasonable corresponding points with only two view geometry. We also find that the reprojection errors are still large by this system in the experiments.
Therefore, we try to uses trifocal tensor, the three view geometry, to improve the parameter estimation and get more reliable camera parameters which can be used in other researches.
5.3.2 The combination of the SfM and trifocal Tensor
If the images are not calibrated, we can get the position and direction of cameras by the SfM technology. The SfM technology uses the detected corresponding points to estimate the camera parameters. Snavely et al. [85][86][87] showed a method which improved the traditional SfM and this system is called as the Bundler. They provide open source code and it is very popular. This system was used in the Photo Tourism project [87]. Furukawa [35] also recommend this system for users. However, we test this system by some ground truth data sets [90][103] for estimating its errors. After calculating the 3D point coordinates by the ground truth corresponding points and projecting them into images, we can get the reprojection errors and the errors are larger than dozen of pixels.
‧
Figure 5-4 The outlier and the transfer by trifocal tensor.
Thus, we think that it is still not stable because the Bundler uses SIFT to detect corresponding points in the images with some outliers and it only uses the RANSAC to estimate the fundamental matrix for filtering the outliers. This method does not consider the degeneracy as the epipolar transfer. Therefore, we re-estimate trifocal tensor and remove the corresponding points which do not conform to the geometry relationship to increase the accuracy.
As shown in Figure 5-4, it shows how to using the trifocal tensor to filter the outliers which the epipolar transfer cannot process. Assume that we use five views and the corresponding point (x, x', x'', x''', x'''') to process the problem. If the corresponding point x' in I2 is an outlier, we will use the following process to filter it. We estimate the trifocal tensors in
‧ 國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
109
every three views. For example, after estimating the trifocal tensor from I1, I2, and I3, the corresponding points (x, x') in I1 and I2 can be transferred to x'' in I3. We can compare the transferred point with the corresponding point in I3. Because x' is an outlier, it will cause large error. So, we mark the three images once. For another example, when estimating I1, I3, and I4, the trifocal tensor of them is also estimated. Because the tree corresponding points are all not outliers, the transferred point from (x, x'') will be close to the corresponding point x''' in I4. Therefore, the error is small and we do not mark it. The rest may be deduced by analogy.
When any three views combine with I2, it will have large error and be marked. The accumulative mark in I2 is largest. Therefore, we can assume that x' in I2 may be an outlier, and filter it. According to this process, we may remove the errors which are caused by the degeneracy from the epipolar transfer.
The procedure of the application to strengthen the geometry relations of the SfM with trifocal tensor is now described as follows and Figure 5-5:
Step 1: Corresponding points. Automatically detect the corresponding points in the images and initialize the projection matrices by the Bundler with SIFT [60].
Step 2: Parameter estimation. Use the proposed parameter estimation described in section 3 and the corresponding points got from the Bundler to estimate the trifocal tensors in every three images. We can think that if the corresponding points are also be probable outliers in many other three images, the probability of these corresponding points which are outliers is very high. So, we will remove them and retain the more reliable points.
Step 3: SfM estimation. Use the Bundler to re-estimate the projection matrices by the corresponding points processed in the step 2.
‧
Step 4: Stop condition. Use the corresponding points and the projection matrices got from the step 3 to estimate the reprojection errors. It stops when the error is smaller than a threshold trp, otherwise repeat the steps 2 and 3 until this constraint is satisfied or select the best iteration whose reprojection error is smallest when the threshold is always larger than trp.
Image triples
reprojection errors Errors < trp
Cameras
Figure 5-5 Left: Flowchart of using ROPSO to estimate trifocal tensor. Right: Flowchart of the proposed method.
5.4 Experiments
According to the description in section 5.3, taking Herz-Jesu-P8 data set (Figure 3-5 (d)) as an example, we use the Bundler to get the camera parameters of the seven images and use the ground truth data to estimate their reprojection errors. The average error is about 21.5 pixels, a quite large error. Therefore, we try our proposed method which uses trifocal tensor to process the degeneracy problem and get more accurate camera parameters.
‧
Each iteration reprojection errors of 1st view
0
Each iteration reprojection errors of 2nd view
0
‧
Each iteration reprojection errors of 3rd view
0
Each iteration reprojection errors of 4th view
0
‧
Each iteration reprojection errors of 5th view
0
Each iteration reprojection errors of 6th view
0
‧
Each iteration reprojection errors of 7th view
0
Figure 5-6 Each iteration reprojection errors for Herz-Jesu-P8 data set.
The proposed method described in section 5.3.2 is implemented. In order to evaluate the errors, we use the corresponding points provided by the Bundler to estimate the residuals and compare with the reprojection errors calculated by the ground truth data. We find that the residuals and reprojection errors are almost in direct proportion. Therefore, it is possible to use the residuals to judge the iterative stop condition. In the experiment, as show in Figure 5-6, we find that after re-estimating the trifocal tensor and filtering the outliers, it is effective to get more accurate camera parameters. For example, in the eleventh iteration, the reprojection errors can be reduced to less than 1 pixel. The Figure 5-7 shows the original reprojection points in different views by the bundler for Herz-Jesu-P8 data set. The reprojection points mean the projection points from the 3D points which estimated by the corresponding points and the projection matrices from the bundler. The results show several error points. The Figure 5-8 shows the refined results and it gets more accurate reprojection points.
‧ 國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
115
Figure 5-7 The original reprojection points in different views for Herz-Jesu-P8 data set.
‧ 國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
116
Figure 5-8 The refined reprojection points in different views for Herz-Jesu-P8 data set.
Taking the Fountain-P11 data set (Figure 3-5 (c)) as an example, the results are similar as shown in Figure 5-9. This shows our idea is feasible. The experiment also shows that the re-projection errors are not reduced according to the iteration frequency. In each iteration, the error may increase or decrease irregularly. It is because that after filtering the outliers in an iteration, it may also remove some inliers. If the outliers also have large influence on the
‧ 國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
117
result, and less inliers cannot oppose to the outliers, the errors will increase. However this method is obvious to filter the outliers. After several iterations, the errors will reduce substantially. It also proves that the trifocal tenser can process the degeneracy problem with epipolar transfer indeed. Of course it may not get acceptable results after several iterations. It relates to the corresponding points got from the Bundler. Despite this situation, we also can select the best result among all the iterations.
Each iteration reprojection errors of 1st view
0 50 100 150 200 250
0 1 2 3 4 5 6 7 8 9
Iterations
Reprojection error (pixels)
Residual Ground truth
‧
Each iteration reprojection errors of 2nd view
0
Each iteration reprojection errors of 3rd view
0
‧ 國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
119
Each iteration reprojection errors of 4th view
0 50 100 150 200 250
0 1 2 3 4 5 6 7 8 9
Iterations
Reprojection error (pixels)
Residual Ground truth
Each iteration reprojection errors of 5th view
0 50 100 150 200 250
0 1 2 3 4 5 6 7 8 9
Iterations
Reprojection error (pixels)
Residual Ground truth
‧ 國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
120
Each iteration reprojection errors of 6th view
0 20 40 60 80 100 120 140 160 180 200
0 1 2 3 4 5 6 7 8 9
Iterations
Reprojection error (pixels)
Residual Ground truth
Figure 5-9 Each iteration reprojection errors for Fountain-P11 data set.
The Figure 5-10 shows the original reprojection points in different views by the bundler for Fountain-P11 data set. The results show several error points. The Figure 5-11 shows the refined results and it gets more accurate reprojection points.
‧ 國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
121
Figure 5-10 The original reprojection points in different views for Fountain-P11 data set.
‧ 國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
122
Figure 5-11 The refined reprojection points in different views for Fountain-P11 data set.
We also show another result by the dinoRing data set from [82]. The Figure 5-12 shows the original reprojection points in different views by the bundler. The results also show several error points the same as above data sets. The Figure 5-13 shows the refined results and it also gets more accurate reprojection points.
‧ 國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
123
Figure 5-12 The original reprojection points in different views for dinoRing data set.
‧ 國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
124
Figure 5-13 The refined reprojection points in different views for dinoRing data set.
‧ 國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
125
5.5 Summary
In this chapter, we discuss about the influence of the degeneracy in epipolar transfer, and present a mechanism to estimate accurate trifocal tensor and then use it to improve the SfM for getting more reliable camera parameters for possible future use. We strongly recommend that do not use the epipolar transfer for multi-view researches. The experimental results show that after using the robust trifocal tensor, we can improve successfully the accuracy of the SfM estimation. The average reprojection errors are reduced from dozen of pixels to less than 1 pixel.
‧ 國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
126
‧ 國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
127
Chapter 6
The Avoidance of Degeneracy in Texture Matching
6.1 Introduction
In traditionally, there are many regular block matching algorithms for corresponding point matching. They may be used for motion estimation in video compression [4]. For example, exhaustive search (ES), which is also known as Full Search, is the most computationally expensive algorithm of all. For fast block matching algorithms, three step search (TSS) is one of the earliest methods. New three step search (NTSS) [57] improves the TSS method by constructing a center biased searching method and using a halfway stop technique to reduce computational cost. Simple and efficient search (SES) [61], based on the unimodal error surface assumption, is another fast block matching algorithm for improving the TSS method. Four step search (4SS) [70] is similar to the NTSS which constructs a center biased searching method and using a halfway stop technique. Diamond search (DS) [116] is similar to 4SS, but it uses a diamond to replace the regular block and does not restrict the number of steps. Adaptive rood pattern search (ARPS) [66] uses the adjustable rood-shaped search pattern and the predicted motion vector for pattern matching.
However the regular block matching algorithms use the regular block which does not
‧ 國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
128
consider the texture in different views. The pattern may be distorted in different views.
Therefore, there are also more and more researches focus on the multiple view stereo matching. Such as the popular patch-based reconstruction methods proposed by Furukawa and Ponce [32][33]. They construct the patch in 3D space and project the patch into images to calculate the similarity of the correspondences. This solves the problem of the inconsistent matching blocks which are in the different views. It considers not only the traditional photometric consistency, but also the visibility consistency. This stresses the multiple view geometry to improve the accuracy of the corresponding points. Many researches follow the idea such as Bradley et al. [8], Frahm et al. [30], Goesele et al. [37], and Hiep et al. [43].
Many of the foregoing popular research can be found in the popular benchmarks. Steven et al. [81][82] published a benchmark algorithm, to compare many multiple view reconstruction methods. The authors use Stanford Spherical Gantry [55] to construct the environment. It calculates their completeness and accuracy by the ground truth data. They provide two data sets for researchers to download and experiment. The researchers can submit their results to get the comparison with other methods.
Strecha et al. [90][91] published another popular benchmark and use cumulative error distribution and completeness to estimate 3D model reconstruction methods. They also provide several data sets including multiple view images, camera parameters and ground truth data got from laser scanner. The researchers can download them and submit their results for comparison.
However, if the camera parameters are not accurate enough, some correspondences may not be detected with projection errors. Therefore, Furukawa and Ponce [31] used bundle adjustment to reduce the reprojection errors and refined the camera parameters. The
‧ 國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
129
optimization parameters will improve the accuracy of the 3D model.
Patch based similarity matching uses the reprojection patches which are not regular blocks to estimate the similarity. It has fewer errors in multiple views with different rotation angles. But, the similarity will be influenced when the distance from views to the object is quite far. We call this patch-based matching scaling problem.
When the difference between two captured distances is large, the reprojection patches will have two situations, patch projection gathered and distributed as shown in Figure 6-1.
This takes a three by three block as an example. When the patch projection gathered, the projection pixels may decrease and repeat. If the patch projection distributed, the extraction of the projection color value will influence on the measurement. Therefore, we must have a robust algorithm to solve this problem.
For solving the patch-based matching scaling problem described in the section 2.4.3.1, this chapter, we put emphasis on two parts. First, we use mutually supported transform and dynamic Gaussian filtering to improve the path-based matching. Second, the integration of the similarity function which considers the photometric consistency and the multiple view geometry is used to increase the reliability of similarity measurement. This avoids matching the correspondences only according to two images and raises the precision. The details are described as follows.
6.2 Mutually Supported Patch Matching
We know that the patch-based matching scaling problem will make patch projection gathered and distributed. This will cause the similarity measurement to be inaccurate. The
‧ 國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
130
patch projection gathered has more serious influence. As shown in Figure 6-1, because many points in a patch may project in the same projection pixels, the projection patch has less information and the similarity measurement will be unreasonable. Even if the corresponding point is correct, the similarity value may be small. This will cause the correspondences selected to become errors.
Mutually Supported
Figure 6-1 Patch projection gathered (left) and distributed (right), and mutually supported transform.
Because it is not easy to process less projection patch information, we propose the mutually supported transformation to interchange the projection image with the reference image. This transform all patch projection gathered to patch projection distributed. As shown in Figure 6-1, when I is the reference image and 1 I is the projection image, the projected 2 patch is gathered. After mutually supported transforming, I becomes the reference image 2 and the patch projection gathered will became patch projection distributed.
‧ 國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
131
6.2.1 Dynamic Gaussian filtering
Because the mutually supported transformation will change patch projection gathered to patch projection distributed, it is another problem we must to process. We use dynamic Gaussian filtering to decide the patch pixel value.
Figure 6-2 Dynamic Gaussian filtering.
As shown in Figure 6-2, take a three by three patch as an example. When the projection patch in multiple views is distributed, we can get the image pixel values from its neighboring information. The pixel values are calculated by equation (6-2). G is the Gaussian function in equation (6-1). The projection pixel values can be calculated by the neighboring pixels’
weighted averages with Gaussian function. IN is a set of the pixels in the grid with projection pixel centered. gray pixel pixel is the pixel value in the image coordinate
x, y
pixel pixel . The pixel value is from 0 to 255. x, y
proj and xij projyij are the ith row and jth column projection pixel’s x-coordinate and y-coordinate.‧
gray pixel pixel G pixel proj pixel proj
B G pixel proj pixel proj (6-2)
6.2.2 Integrated similarity function
The researches on the image correspondence matching traditionally have focused on two view images. In this research, we use integrated similarity function which is according to the photometric consistency and multiple views geometry consistency to measure the similarity for getting more accurate correspondences.
Equation (6-3) is the integrated similarity function. PSCORE is the score of the 3D point Pt calculated by this integrated similarity function. The score is from -1 to 1. r is the reference image and M is the images set in multiple views except the reference image. m is one element of set M. S r m
b, b
is the photometric consistency function. In this research, we use zero mean normalized cross-correlation to be the photometric consistency function. This value is from -1 to 1. The larger the value is, the higher the similarity is. r and b m are the b matching block in the reference image and the multiple view images. sin
Pt
r mC, C
means the robust geometry factor and is from 0 to 1 (the angle must be less than 180 degrees, otherwise any two views can not capture the same point). The larger the factor is, the more robust the geometry is. Pt
r mC, C
is the angle of the r PC tm . C r is the center of the C reference image, m are the center of the multiple view images, and P is the 3D point. We C think that when a patch on the surface is visible in all different and large angle view, the patch‧
is more reliable. The robust geometry factor shows the target.
6.3 Image Selection and Filtering
In the section 2.4.3.3, we discuss that the similarity will be influenced when the object surface is a curved surface, not a plane. According to the Figure 2-24, when the matching point is on the curved object, the larger the angle between two rays from two camera centers to the surface is, the larger the influence is. Therefore, when the angle is large and the similarity of the measurement is small, and the angle is small and the similarity one is large, it can be considered as a curved surface around the matching point. At this moment, there are two decisions for this problem. First, select the images with small angles. Second, ignore this matching point.
6.4 Experiments
In the section, we build our experimental environment and get the camera parameters.
The mutually supported patch method, dynamic Gaussian filtering, and integrated similarity function will be used to solve the patch-based matching scaling problem. The detail is as follows.
6.4.1 Experiment environment
First, we build an experimental environment for the proof of our theory by the 3ds max
‧ 國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
134
software. As shown in Figure 6-3, it is a lateral view of the experimental environment in 3ds Max. We install 109 cameras around a cube in the different views and capture the corresponding images. They are distributed as a hemisphere which forms by six concentric circles.
Figure 6-3 3ds Max experiment environment.
In the experimental environment, the distance between the 109 camera centers and the origin of the world coordinate is 1.5 meters. We use the simulated environment to capture a sequence images in multiple views.
We have two methods to get the camera parameters. First, we can get cameras’ focal
We have two methods to get the camera parameters. First, we can get cameras’ focal