Even though we have put some effort to improve the performance of feature matching through deep learning-based method, the result is still unsatisfactory. We believe that it is possible to eliminate the overhead of RANSAC and enhance the robustness with more careful designed data-driven algorithms. In addition, the correspondence between 2D pixel and 3D depth still comes from 2D-2D feature matching.
From our perspective, the feature extracted with 3D space information may has the potential to enhance the image-based localization.
REFERENCES
[1] D. G. Lowe, "Distinctive image features from scale-invariant keypoints,"
International journal of computer vision, vol. 60, no. 2, pp. 91-110, 2004.
[2] A. Kendall, M. Grimes, and R. Cipolla, "Posenet: A convolutional network for real-time 6-dof camera relocalization," in Proceedings of the IEEE international conference on computer vision, 2015, pp. 2938-2946.
[3] Z. Laskar, I. Melekhov, S. Kalia, and J. Kannala, "Camera relocalization by computing pairwise relative poses using convolutional neural network," in Proceedings of the IEEE International Conference on Computer Vision Workshops, 2017, pp. 929-938.
[4] P.-E. Sarlin, C. Cadena, R. Siegwart, and M. Dymczyk, "From coarse to fine:
Robust hierarchical localization at large scale," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 12716-12725.
[5] T. Sattler, B. Leibe, and L. Kobbelt, "Efficient & effective prioritized matching for large-scale image-based localization," IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 9, pp. 1744-1756, 2016.
[6] T. Sattler et al., "Are large-scale 3D models really necessary for accurate visual localization?," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1637-1646.
[7] M. A. Fischler and R. C. Bolles, "Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,"
Communications of the ACM, vol. 24, no. 6, pp. 381-395, 1981.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6684-6692.
[9] E. Brachmann and C. Rother, "Learning less is more-6d camera localization via 3d surface regression," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4654-4662.
[10] H. Taira et al., "InLoc: Indoor visual localization with dense matching and view synthesis," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7199-7209.
[11] L. Liu, H. Li, and Y. Dai, "Efficient global 2d-3d matching for camera localization in a large-scale 3d map," in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2372-2381.
[12] J. Shotton, B. Glocker, C. Zach, S. Izadi, A. Criminisi, and A. Fitzgibbon, "Scene coordinate regression forests for camera relocalization in RGB-D images," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 2930-2937.
[13] D. Massiceti, A. Krull, E. Brachmann, C. Rother, and P. H. Torr, "Random forests versus Neural Networks—What's best for camera localization?," in 2017 IEEE International Conference on Robotics and Automation (ICRA), 2017: IEEE, pp.
5118-5125.
[14] H. Jégou, M. Douze, C. Schmid, and P. Pérez, "Aggregating local descriptors into a compact image representation," in 2010 IEEE computer society conference on computer vision and pattern recognition, 2010: IEEE, pp. 3304-3311.
[15] T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A.
Y. Wu, "An efficient k-means clustering algorithm: Analysis and implementation," IEEE transactions on pattern analysis and machine intelligence,
vol. 24, no. 7, pp. 881-892, 2002.
[16] R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, "NetVLAD: CNN architecture for weakly supervised place recognition," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 5297-5307.
[17] V. Balntas, S. Li, and V. Prisacariu, "Relocnet: Continuous metric learning relocalisation using neural nets," in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 751-767.
[18] T. Sattler, Q. Zhou, M. Pollefeys, and L. Leal-Taixe, "Understanding the limitations of cnn-based absolute camera pose regression," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 3302-3312.
[19] F. Walch, C. Hazirbas, L. Leal-Taixe, T. Sattler, S. Hilsenbeck, and D. Cremers,
"Image-based localization using lstms for structured feature correlation," in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp.
627-637.
[20] H. Sak, A. W. Senior, and F. Beaufays, "Long short-term memory recurrent neural network architectures for large scale acoustic modeling," 2014.
[21] J. F. Henriques and A. Vedaldi, "Mapnet: An allocentric spatial memory for mapping environments," in proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8476-8484.
[22] A. Valada, N. Radwan, and W. Burgard, "Deep auxiliary learning for visual localization and odometry," in 2018 IEEE international conference on robotics and automation (ICRA), 2018: IEEE, pp. 6939-6946.
[23] N. Radwan, A. Valada, and W. Burgard, "Vlocnet++: Deep multitask learning for
Letters, vol. 3, no. 4, pp. 4407-4414, 2018.
[24] E. Rosten and T. Drummond, "Machine learning for high-speed corner detection," in European conference on computer vision, 2006: Springer, pp. 430-443.
[25] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, "ORB: An efficient alternative to SIFT or SURF," in 2011 International conference on computer vision, 2011: Ieee, pp. 2564-2571.
[26] M. Calonder, V. Lepetit, C. Strecha, and P. Fua, "Brief: Binary robust independent elementary features," in European conference on computer vision, 2010: Springer, pp. 778-792.
[27] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos, "ORB-SLAM: a versatile and accurate monocular SLAM system," IEEE transactions on robotics, vol. 31, no.
5, pp. 1147-1163, 2015.
[28] I. Rocco, M. Cimpoi, R. Arandjelović, A. Torii, T. Pajdla, and J. Sivic,
"Neighbourhood consensus networks," in Advances in Neural Information Processing Systems, 2018, pp. 1651-1662.
[29] K. M. Yi, E. Trulls, V. Lepetit, and P. Fua, "Lift: Learned invariant feature transform," in European Conference on Computer Vision, 2016: Springer, pp. 467-483.
[30] D. DeTone, T. Malisiewicz, and A. Rabinovich, "Superpoint: Self-supervised interest point detection and description," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2018, pp. 224-236.
[31] D. Eigen, C. Puhrsch, and R. Fergus, "Depth map prediction from a single image using a multi-scale deep network," in Advances in neural information processing systems, 2014, pp. 2366-2374.
[32] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab, "Deeper depth prediction with fully convolutional residual networks," in 2016 Fourth international conference on 3D vision (3DV), 2016: IEEE, pp. 239-248.
[33] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe, "Unsupervised learning of depth and ego-motion from video," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1851-1858.
[34] Y. Zou, Z. Luo, and J.-B. Huang, "Df-net: Unsupervised joint learning of depth and flow using cross-task consistency," in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 36-53.
[35] J. Bian et al., "Unsupervised scale-consistent depth and ego-motion learning from monocular video," in Advances in neural information processing systems, 2019, pp. 35-45.
[36] A. Torii, R. Arandjelovic, J. Sivic, M. Okutomi, and T. Pajdla, "24/7 place recognition by view synthesis," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1808-1817.
[37] T. Sattler et al., "Benchmarking 6dof outdoor visual localization in changing conditions," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8601-8610.
[38] F. Radenović, G. Tolias, and O. Chum, "CNN image retrieval learns from BoW:
Unsupervised fine-tuning with hard examples," in European conference on computer vision, 2016: Springer, pp. 3-20.
[39] F. Radenović, G. Tolias, and O. Chum, "Fine-tuning CNN image retrieval with no human annotation," IEEE transactions on pattern analysis and machine intelligence, vol. 41, no. 7, pp. 1655-1668, 2018.
Visual Localization from Essential Matrices," arXiv preprint arXiv:1908.01293, 2019.
[41] D. Nistér, "An efficient solution to the five-point relative pose problem," IEEE transactions on pattern analysis and machine intelligence, vol. 26, no. 6, pp. 756-770, 2004.
[42] C. Godard, O. Mac Aodha, and G. J. Brostow, "Unsupervised monocular depth estimation with left-right consistency," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 270-279.
[43] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, "Image quality assessment: from error visibility to structural similarity," IEEE transactions on image processing, vol. 13, no. 4, pp. 600-612, 2004.
[44] P.-E. Sarlin, F. Debraine, M. Dymczyk, R. Siegwart, and C. Cadena, "Leveraging deep visual descriptors for hierarchical efficient localization," arXiv preprint arXiv:1809.01019, 2018.
[45] R. A. Newcombe et al., "KinectFusion: Real-time dense surface mapping and tracking," in 2011 10th IEEE International Symposium on Mixed and Augmented Reality, 2011: IEEE, pp. 127-136.
[46] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner,
"Scannet: Richly-annotated 3d reconstructions of indoor scenes," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp.
5828-5839.
[47] H. Noh, A. Araujo, J. Sim, T. Weyand, and B. Han, "Large-scale image retrieval with attentive deep local features," in Proceedings of the IEEE international conference on computer vision, 2017, pp. 3456-3465.