Histograms of Orientated Gradients (HOG) - Dense Matching by Conditional Random Field

Chapter 3 Proposed System

3.2 Dense Matching by Conditional Random Field

3.2.2 Histograms of Orientated Gradients (HOG)

(a) (b) Figure 3-12 CRF model

3.2.2 Histograms of Orientated Gradients (HOG)

The Histograms of Orientated Gradients (HOG) descriptor is based on evaluating the normalized local histograms of image gradient orientations in the grids. In Figure

3-13(a), an input patch is divided into several small grids, with each grid containing 8 orientated gradient magnitudes. In Figure 3-14, it shows an example of HOG for human detection [19, 20]. Here, the HOG descriptor is used to describe a human pattern. Because the CRF model has contained spatial information, we can merge all histogram into a grid. This causes the reduction of dimension in the proposed

modified HOG feature.

(a) HOG cells (b) Eight orientations

Figure 3-13 HOG descriptor

Figure 3-14 Human’s HOG descriptor [20]

3.2.3 Modified HOG for distortion and occlusion handling

Occlusion is one of major challenges in correspondence problem, particularly in the wide-baseline case. For wide-baseline cases, the total occlusion areas become larger and more distorted than that in short-baseline cases, Here, we design the modified HOG descriptor to detect occlusion regions and ignore those occluded regions. In other word, we only extract un-occluded regions.

We modify the HOG descriptor by separating the original descriptor into two parts: left part and right part. Figure 3-15 (b) shows the modified HOG descriptor.

When the modified descriptor is placed on an occlusion boundary, the differences between the left part and the right part will be large. Here, we can define an O function, as expressed in Equation 3.7 to describe this property.

Figure 3-15 Modified HOG descriptor

Now we apply the MHOG descriptor to the first term of the CRF model, MHOG is designed to handle occlusion effect. Here, we can write the data cost function as

𝜑_𝑝 𝑑^𝑝 [𝐻𝑂𝐺 𝐻𝑂𝐺 + 𝑑^𝑝 ]*O (p) (3.6)

O p {1 , 𝑓 𝑒𝑓𝑒 𝑎𝑟𝑡 𝑑 𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒 𝑟 ℎ𝑡 𝑎𝑟𝑡 𝑑 𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒 < 𝛾

𝛤 , 𝑓 𝑒𝑓𝑒 𝑎𝑟𝑡 𝑑 𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒 𝑟 ℎ𝑡 𝑎𝑟𝑡 𝑑 𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒 ≥ 𝛾 (3.7)

In Equation 3.6, O(p) is an additional penalty function. Here, we check the feature vector at pixel p in the left image with the feature vector at the corresponding pixel p + 𝑑^𝑝 in the right image. If the left-part feature distance of MHOG and the right-part feature distance of MHOG are inconsistent (e.g. one distance is small, another one is big), O(p) will multiply the data cost function in Equation 3.6) by 𝛤. In this case, the pixel p is more likely to be labeled as occluded. Figure 3-16 shows an example.

(a) Input images

(b) 𝛤 = 2 (c) 𝛤 3 Figure 3-16 Result comparison for different 𝛤 values

Figure 3-17 SIFT flow [21]

In comparison, as shown in Figure3-17, SIFT flow cannot detect occlusion regions. Hence, occluded regions are forced to match the most similar region.

3.2.4 Matching points by MHOG with spatially constraint

Now we will apply the MHOG descriptor to the CRF model. The adopted MHOG descriptor has a 24 x 24 window around pixels with two grids and eight orientations. Hence, the first term 𝜑_𝑝 (data cost) of the CRF formula is a distance of measure of the MHOG features between Pixel p on one image and Pixel p + 𝑑^𝑝 on the other image. Since the MHOG provide the statistical information about gradient information, it can’t provide us enough spatial information. Hence, we use the second term to compensate for the lack of spatial information. In our experiments, we found that if a larger window is used, the MHOG feature component along the vertical orientation would be similar in different views similar. This is because we have placed cameras at the similar height without rolling. This character is used when we match feature pairs across images.

In summary, our CRF model has two terms, with the first term 𝜑_𝑝 𝑑^𝑝 being discussed above. Our system will choose the disparity values that minimize the modified HOG distance over the whole image. The second term of CRF model is a

regularization term, which regulates the first term’s choice. This second term constrains adjacent pixels with similar intensity values to choose similar disparity values. This is because if a pixel and its neighbors have similar colors, then they probably come from the same surface of an object.

The regularization term of the CRF model is designed to be 𝛼 ∑ 𝐺 𝐼_𝑝𝑞 ^𝑝 𝐼^𝑞 ∗ 𝑑^𝑝, 𝑑^𝑞 , where p and q are neighbors and 𝐼^𝑝 is the intensity value of Pixel p. In this formula, if 𝐼^𝑝 𝐼^𝑞| < C (In our experimentation, C is chosen to be 30), 𝐺 𝐼^𝑝 𝐼^𝑞 = 1. Moreover, 𝑑^𝑝, 𝑑^𝑞 is the L2 norm of the disparity difference between 𝑑^𝑝 and 𝑑^𝑞 . If two neighboring pixels have similar intensity values, they should choose similar disparity values.

Figure 3-18 Matching result based on the CRF model

3.2.5 Model formulation

In Table 1 we summarize all related formulas of our CRF model.

Table 1 Model formulation

After we have matched images and gotten pixel correspondence, we can use the correspondence to estimate the relative positions among the cameras. In other words, we can estimate each camera’s extrinsic parameter matrices 𝑅_𝑛 and 𝑇_𝑛, where 𝑅_𝑛 is a rotation matrix, 𝑇_𝑛 is a transformation matrix, and 𝑥^′ = Rx + T. Figure 3-14 shows an example of the camera geometry.

After the estimation of the transformation matrices R and T, we can calculate the

3D to 2D projection matrix P, P = K*(R|T). After that, we can use the projection matrix P to find a 3D point cloud. Here, we apply the bundle adjustment algorithm used in [5] to build the 3D model.

Figure 3-19 Camera geometry

3.3.2 Bundle adjustment

The bundle adjustment in [22] can simultaneously refine the 3D coordinates to describe the scene geometry with the relative motion parameters. The bundle adjustment is based on the mathematical expression in Equation 3.8.

, , (3.8) where 𝑥_𝑘 is the point correspondence between each image pair (in our system m = 3), 𝑃_𝑘 𝑋 is a 3D point 𝑋 projected to Image k via the projection matrix 𝑃_𝑘, and D(x, y) is the L2 norm distance between x and y. After minimizing the sum of projection error of all points, we can estimate the 3D point set that coarsely describes the view

geometry in front of cameras. The detail will be explained in the next section. After the estimation of the 3D point set, we use the spatial matting algorithm in [23, 24] to re-define the 3D point set. In Figure 3-20, we illustrate the finding of a 3D point cloud that minimizes the projection errors between the projected points on the 2D image and the original image points with inliers. To suppress outlier points, we use the RANdom SAmple Consensus method in [25] (RANSAC) to identify the inlier points.

Figure 3-20 Estimation of 3D points by using bundle adjustment

3.3.3 Random sample consensus (RANSAC)

Random sample consensus is an iterative method to estimate the parameters of a mathematical model from a set of observed data which may contain outliers. We use RANSAC to estimate a camera model that fits the largest amount of inlier matches across images. Here, we use RANSAC to calculate the projection matrix for the first iteration of bundle adjustment.

Figure 3-21 (a) Data set with many outliers (b) Fitted line with RANSAC

3.3.4 3D point set refinement by matting refinement

In Section 3.3.2, we build a 3D point cloud by the bundle adjustment process.

However, the outcomes are still not good enough. There are some false matches and several unmatched regions at occlusion pixels. We assume that our matches around keypoints are accurate and we build a confidence map CM(x, y). In Equation 3.9, if pixel (x, y) is near a keypoint, has an inlier matches (picked by RANSAC), and has no occlusion, then the confidence value at that pixel is equal to one; otherwise, the

confidence value is zero.

𝐶𝑀 𝑥, 𝑦 { 1 , if x, y ∈ Ne r keyp in s ∩ In iers ∩ N cc uded

0 , herwise (3.9)

Figure 3-22 (a) Locations of keypoints (b) Confidence map

After we estimate the depth information at these locations whose confidence value is one, by using the spectral matting method in [23, 26], we propagate these estimated depth information to these unknown regions by minimize the cost function in Equation 3.10, where L is a laplacian matrix, is a prior map (estimated depth map at confidence-one pixels). In Equation 3.12, is a 3×3 covariance matrix, is a 3×1 mean vector of the colors in a window _𝑘, and I₃ is the 3×3 identity matrix.

The matting affinity in Equation 3.11 is defined by pixels’ color and its spatial relations (In Equation 3.12).

𝛼 𝛼+ (3.10)

L = D – (3.11)

, 𝑗 ∑ _𝑊

𝑘 1 + 𝐼 𝜇_𝑘 ( _𝑘+_𝑊^𝜖

𝑘 𝐼₃)⁻

𝑘 , ∈𝑊_𝑘 I_j 𝜇_𝑘 . (3.12)

In Equation 3.12, D is a diagonal matrix, whose elements are defined as D

= ∑𝑊 , 𝑗 . W is a sum of matrix . is a prior map, at which we have

estimated the depth information at pixels with CM(x, y) =1. After we solve the

optimization problem in Equation 3.10, we can get all pixels’ depth values. A result of the aforementioned process is shown in Figure 3-23.

Figure 3-23 Refined depth map (remove sky and ground)

Figure 3-24 Overview of spectral matting

3.4 Summary

In our system, we combine local and global approaches to find the

correspondence of image pairs. First, we use randomized forest to obtain some rough correspondence of image keypoints. With the initial correspondence, we can

propagate these keypoints’ correspondence information to the entire image by solving a global optimization problem. Moreover, the CRF model can correct some errors by using spatial constraints. After we have gotten the disparity values of all pixels, we use the RANSAC method to find inliers whose distribution fits the camera geometry the most. After that, we use these inlier disparities to build a 3D point cloud and refine the 3D point cloud by spectral matting. Finally, we convert the point cloud to a mesh

of triangles and build the 3D model of the captured scene. The overall system flow is shown below.

Figure 3-25 System flow

Chapter 4 Experimental Results

In this chapter, we show some experimental results. In Section 4.1, we will demonstrate some matched correspondences, outcomes of random forest and the CRF model processing results, respectively. We can find that random forest only matches keypoints coarsely. In the next stage, the proposed CRF model will correct random forest’s result and also deal with those pixels lacking texture information.

4.1 Matched results

4.1.1 Matched results by using random forest

In Figure 4-1, an outdoor case, we observe that random forest can match some keypoints correctly. Figure 4-2 shows the correctly matched points. The match rate is about 30/100.

Figure 4-1Matched feature pairs by random forest

Figure 4-2 Some correct matches

In Figure 4-3 and Figure 4-4, we show the matching of some high-texture keypoints.

(Red circles mean the incorrect correspondence.)

Figure 4-3 Matched result by random forest (part1)

Figure 4-4 Matched result by random forest (part2)

In Figure 4-5, there are many repetitive patterns and regions with texture. As expected, the performance of this case is not good. However, after the CRF processing, we can still obtain many correctly matched pairs.

Figure 4-5 Matching result of random forest

4.1.2 Matched results after CRF correction

Figure 4-6 Result of CRF

Figure 4-7 Some matches of Figure 4-6

Figure 4-8 Some matches of Figure 4-6

In Figure 4-6, two images with difference exposure levels are matched by our system (random forest + CRF). We can find that CRF can correct some erroneous correspondence (some erroneous correspondences by random forest are shown in Figure 4-1) and can propagate a keypoint’s correspondence information to its

neighbors. (Here, we only demonstrate some keypoint correspondences.)

Next, we compare the SIFT matching result with our system, Figure 4-10 shows some correspondence results of our system. Note that the cameras are separated very widely and there are many low-texture areas and repetitive patterns in two images.

The matching in this case is very difficult for the SIFT approach.

Figure 4-9 SIFT result (many mismatched points over the region with repetitive patterns)

(a) Our system (RF + CRF) can match repetitive patterns better than SIFT

(b) Our system (RF + CRF) can match repetitive patterns better than SIFT Figure 4-10 Some matched results of our system

In our system, we use a CRF model to match points with spatial constraints.

Hence, we can identify similar patterns at difference locations. In brief, we can match keypoints better than the SIFT method.

Figure 4-11 Some matched results of our system

4.2 3D Reconstruction

This section demonstrates some results of 3D reconstruction. Figure 4-12 shows our reconstruction result. It fails in these white lower texture regions, since these areas contain too little information for accurate matching.

(a) (b) (c)

(d) (e) (f)

Figure 4-12 (a), (b), (c) Input images (d), (e), (f) Corresponding results of 3D reconstruction

Figure 4-13 Multi-view database “fountain” [7]

In Figure 4-13, we show the multi-view database “fountain” [7]. In [27], Hiep, V.H. and Keriven, R. reconstructed 3D models by using 11 stereo images. Their results are shown in Figure 4-16. Here, we only use 3 of the 11 images (the 3 images with red frame in Figure 4-13) to build the 3D model. As illustrated in Figure 4-14, we can see the object shape clearly in our reconstructed model.

(a)

(b)

Figure 4-14 (a) Input images (b) Reconstructed 3D model

Figure 4-15 First row: short-baseline input images. Second row: reconstructed 3D models

In Figure 4-15, we use short-baseline image pairs to build the 3D model. Here, we choose the images with green frames in Figure4-13. We can find that the details can be built more cleanly.

Figure 4-16 Hiep, V.H. and Keriven, R.’s results

Chapter 5 Conclusion

In this thesis, we proposed a wide-baseline stereo matching approach for 3D reconstruction. Our system can match images in difference illuminations with the change of viewpoint orientation ranging from about 40^°to 40^°. Based on random forest and conditional random field, the system can deal with large perspective distortions and occlusions. Besides, the proposed system can also deal with images with repetitive patterns. Matching similar patterns by using only gradient features usually cannot achieve robust and accurate matching. In our approach, we add the spatial information around each pixel to our matching strategy. With this arrangement, we can match similar patterns and distorted patterns well. In the last stage of our system, we use RANdom SAmple Consensus (RANSAC) and Bundle Adjustment (BA) to reconstruct 3D point cloud. Finally, we refine the 3D point cloud by using the spectral matting method and convert the point cloud to a mesh of triangles that

represent the 3D model of the captured scene.

[11]

Chapter 6 Reference

[1] N. Snavely, S. M. Seitz, and R. Szeliski, "Photo tourism: exploring photo collections in 3D." ACM Transactions on Graphics (TOG), pp. 835-846.

[2] V. Lepetit, and P. Fua, “Keypoint recognition using randomized trees,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 9, pp.

1465-1479, 2006.

[3] V. Lepetit, P. Lagger, and P. Fua, "Randomized trees for real-time keypoint recognition." IEEE Conference on Computer Vision and Pattern Recognition, pp. 775-781 vol. 2., 2005.

[4] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,”

International Journal of Computer Vision, vol. 60, no. 2, pp. 91-110, 2004.

[5] H. Bay, T. Tuytelaars, and L. Van Gool, “Surf: Speeded up robust features,”

Computer Vision–European Conference on Computer Vision, pp. 404-417, 2006.

[6] K. Mikolajczyk, and C. Schmid, “A performance evaluation of local

descriptors,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 10, pp. 1615-1630, 2005.

[7] C. Strecha. Multi-view evaluation-http://cvlab.epfl.ch/data, 2008.

[8] Y. Boykov, and V. Kolmogorov, “Computing geodesics and minimal surfaces via graph cuts,” IEEE International Conference on Computer Vision, pp. 26-33, 2003.

[9] C. Strecha, R. Fransens, and L. Van Gool, "Combined depth and outlier estimation in multi-view stereo." IEEE International Conference on Computer Vision, pp. 2394-2401, 2003

[10] D. Pritchard, and W. Heidrich, "Cloth motion capture." Computer Graphics Forum, pp. 263-271.

[11] Y. Amit, and D. Geman, “Shape Quantization and Recognition with

Randomized Trees,” Neural computation, vol. 9, no. 7, pp. 1545-1588, 1997.

[12] L. Breiman, “Random Forests,” Machine learning, vol. 45, no. 1, pp. 5-32, 2001.

[13] H. Tin Kam, “Random decision forests,” in Proceedings of the Third

International Conference on Document Analysis and Recognition, 1995, pp.

278-282 vol.1.

[14] P. Geurts, D. Ernst, and L. Wehenkel, “Extremely randomized trees,” Machine learning, vol. 63, no. 1, pp. 3-42, 2006.

[15] J. Sivic, and A. Zisserman, "Video Google: a text retrieval approach to object matching in videos." IEEE International Conference on Computer Vision,pp.

1470-1477 vol.2, 2003.

[16] M. Calonder et al., “Brief: Binary robust independent elementary features,”

Computer Vision–ECCV 2010, pp. 778-792, 2010.

[17] T. Kanade, and M. Okutomi, “A stereo matching algorithm with an adaptive window: Theory and experiment,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 16, no. 9, pp. 920-932, 1994.

[18] N. Salman, and M. Yvinec, "High resolution surface reconstruction from overlapping multiple-views." Proceedings of the 25th annual symposium on Computational geometry, pp. 104-105.

[19] N. Dalal, B. Triggs, and C. Schmid, “Human detection using oriented

histograms of flow and appearance,” Computer Vision–European Conference on Computer Vision, pp. 428-441, 2006.

[20] N. Dalal, and B. Triggs, "Histograms of oriented gradients for human

detection." IEEE Conference on Computer Vision and Pattern Recognition, pp.

886-893 vol. 1., 2005.

[21] C. Liu et al., “Sift flow: Dense correspondence across different scenes,”

Computer Vision–European Conference on Computer Vision, pp. 28-42, 2008.

[22] B. Triggs et al., “Bundle adjustment—a modern synthesis,” Vision algorithms:

theory and practice, pp. 153-177, 2000.

[23] A. Levin, A. Rav Acha, and D. Lischinski, “Spectral matting,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30, no. 10, pp. 1699-1712, 2008.

[24] A. Levin, A. Rav-Acha, and D. Lischinski, "Spectral matting." IEEE Conference on Computer Vision and Pattern Recognition, pp.1-8, 2007.

[25] M. A. Fischler, and R. C. Bolles, “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,”

Communications of the ACM, vol. 24, no. 6, pp. 381-395, 1981.

[26] A. Levin, D. Lischinski, and Y. Weiss, “A closed-form solution to natural image matting,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.

30, no. 2, pp. 228-242, 2008.

[27] V. H. Hiep et al., "Towards high-resolution large-scale multi-view stereo." IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30, no. 10, pp.

1430-1437.

在文檔中適用於三維場景重建之寬基線立體影像匹配 (頁 29-0)