Citation: | Zhenyu Li, Zehui Chen, Xianming Liu, Junjun Jiang. DepthFormer: Exploiting Long-range Correlation and Local Information for Accurate Monocular Depth Estimation. Machine Intelligence Research, vol. 20, no. 6, pp.837-854, 2023. https://doi.org/10.1007/s11633-023-1458-0 |
[1] |
K. M. He, X. Y. Zhang, S. Q. Ren, J. Sun. Deep residual learning for image recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Las Vegas, USA, pp. 770–778, 2016. DOI: 10.1109/CVPR.2016.90.
|
[2] |
H. Fu, M. M. Gong, C. H. Wang, K. Batmanghelich, D. C. Tao. Deep ordinal regression network for monocular depth estimation. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 2002–2011, 2018. DOI: 10.1109/CVPR.2018.00214.
|
[3] |
J. H. Lee, M. K. Han, D. W. Ko, I. H. Suh. From big to small: Multi-scale local planar guidance for monocular depth estimation, [Online], Available: https://arxiv.org.abs/1907.10326, 2019.
|
[4] |
S. F. Bhat, I. Alhashim, P. Wonka. AdaBins: Depth estimation using adaptive bins. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Nashville, USA, pp. 4008–4017, 2021. DOI: 10.1109/CVPR46437.2021.00400.
|
[5] |
R. Ranftl, A. Bochkovskiy, V. Koltun. Vision transformers for dense prediction. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Montreal, Canada, pp. 12159–12168, 2021. DOI: 10.1109/ICCV48922.2021.01196.
|
[6] |
A. Saxena, S. H. Chung, A. Y. Ng. Learning depth from single monocular images. In Proceedings of the 18th International Conference on Neural Information Processing Systems, Vancouver, Canada, pp. 1161–1168, 2005.
|
[7] |
O. Ronneberger, P. Fischer, T. Brox. U-Net: Convolutional networks for biomedical image segmentation. In Proceedings of the 18th International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, Munich, Germany, pp. 234–241, 2015. DOI: 10.1007/978-3-319-24574-4_28.
|
[8] |
L. C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, A. L. Yuille. DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 4, pp. 834–848, 2018. DOI: 10.1109/TPAMI.2017.2699184.
|
[9] |
H. S. Zhao, J. P. Shi, X. J. Qi, X. G. Wang, J. Y. Jia. Pyramid scene parsing network. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Honolulu, USA, pp. 6230–6239, 2017. DOI: 10.1109/CVPR.2017.660.
|
[10] |
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, USA, pp. 6000–6010, 2017.
|
[11] |
L. Huynh, P. Nguyen-Ha, J. Matas, E. Rahtu, J. Heikkilä. Guiding monocular depth estimation using depth-attention volume. In Proceedings of the 16th European Conference on Computer Vision, Springer, Glasgow, UK, pp. 581–597, 2020. DOI: 10.1007/978-3-030-58574-7_35.
|
[12] |
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. H. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the 9th International Conference on Learning Representations, 2021.
|
[13] |
G. Yang, H. Tang, M. Ding, N. Sebe, E. Ricci. Transformers solve the limited receptive field for monocular depth prediction. In Proceedings of International Confonference on Computer Vision, 2021.
|
[14] |
K. Yuan, S. P.Guo, Z. W.Liu, A. J. Zhou, F. W. Yu, W. Wu. Incorporating convolution designs into visual transformers. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Montreal, Canada, pp. 559–568, 2021. DOI: 10.1109/ICCV48922.2021.00062.
|
[15] |
Z. H. Dai, H. X. Liu, Q. V. Le, M. X. Tan. Coatnet: Marrying convolution and attention for all data sizes. In Proceedings of the 35th International Conference on Neural Information Processing Systems, pp. 3965–3977, 2021.
|
[16] |
T. Xiao, M. Singh, E. Mintun, T. Darrell, P. Dollár, R. B. Girshick. Early convolutions help transformers see better. In Proceedings of the 35th International Conference on Neural Information Processing Systems, pp. 30392–30400, 2021.
|
[17] |
J. F. Dai, H. Z. Qi, Y. W. Xiong, Y. Li, G. D. Zhang, H. Hu, Y. C. Wei. Deformable convolutional networks. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Venice, Italy, pp. 764–773, 2017. DOI: 10.1109/ICCV.2017.89.
|
[18] |
X. Z. Zhu, W. J. Su, L. W. Lu, B. Li, X. G. Wang, J. F. Dai. Deformable detr: Deformable transformers for end-to-end object detection. In Proceedings of 9th International Conference on Learning Representations, 2021.
|
[19] |
A. Geiger, P. Lenz, C. Stiller, R. Urtasun. Vision meets robotics: The KITTI dataset. The International Journal of Robotics Research, vol. 32, no. 11, pp. 1231–1237, 2013. DOI: 10.1177/0278364913491297.
|
[20] |
N. Silberman, D. Hoiem, P. Kohli, R. Fergus. Indoor segmentation and support inference from RGBD images. In Proceedings of the 12th European Conference on Computer Vision, Springer, Florence, Italy, pp. 746–760, 2012. DOI: 10.1007/978-3-642-33715-4_54.
|
[21] |
S. R. Song, S. P. Lichtenberg, J. X. Xiao. SUN RGB-D: A RGB-D scene understanding benchmark suite. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Boston, USA, pp. 567–576, 2015. DOI: 10.1109/CVPR.2015.7298655.
|
[22] |
T. W. Hui, C. C. Loy, X. O. Tang. Depth map super-resolution by deep multi-scale guidance. In Proceedings of the 14th European Conference on Computer Vision, Springer, Amsterdam, The Netherlands, pp. 353–369, 2016. DOI: 10.1007/978-3-319-46487-9_22.
|
[23] |
J. Lee, Y. Kim, S. Lee, B. Kim, J. Noh. High-quality depth estimation using an exemplar 3D model for stereo conversion. IEEE Transactions on Visualization and Computer Graphics, vol. 21, no. 7, pp. 835–847, 2015. DOI: 10.1109/TVCG.2015.2398440.
|
[24] |
J. X. Dong, J. S. Pan, J. S. Ren, L. Lin, J. H. Tang, M. H. Yang. Learning spatially variant linear representation models for joint filtering. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 11, pp. 8355–8370, 2022. DOI: 10.1109/TPAMI.2021.3102575.
|
[25] |
Z. Q. Zhang, X. G. Zhu, Y. W. Li, X. Q. Chen, Y. Guo. Adversarial attacks on monocular depth estimation, [Online], Available: https://arxiv.org/abs/2003.10315, 2020.
|
[26] |
D. Eigen, C. Puhrsch, R. Fergus. Depth map prediction from a single image using a multi-scale deep network. In Proceedings of the 27th International Conference on Neural Information Processing Systems, Montreal, Canada, pp. 2366–2374, 2014.
|
[27] |
J. J. Hu, M. Ozay, Y. Zhang, T. Okatani. Revisiting single image depth estimation: Toward higher resolution maps with accurate object boundaries. In Proceedings of IEEE Winter Conference on Applications of Computer Vision, IEEE, Waikoloa, USA, pp. 1043–1051, 2019. DOI: 10.1109/WACV.2019.00116.
|
[28] |
X. B. Yang, L. Y. Zhou, H. Q. Jiang, Z. L. Tang, Y. B. Wang, H. J. Bao, G. F. Zhang. Mobile3DRecon: Real-time monocular 3D reconstruction on a mobile phone. IEEE Transactions on Visualization and Computer Graphics, vol. 26, no. 12, pp. 3446–3456, 2020. DOI: 10.1109/TVCG.2020.3023634.
|
[29] |
M. X. Tan, Q. V. Le. EfficientNet: Rethinking model scaling for convolutional neural networks. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, USA, pp. 6105–6114, 2019.
|
[30] |
G. Huang, Z. Liu, L. Van Der Maaten, K. Q. Weinberger. Densely connected convolutional networks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Honolulu, USA, pp. 2261–2269, 2017. DOI: 10.1109/CVPR.2017.243.
|
[31] |
I. Alhashim, P. Wonka. High quality monocular depth estimation via transfer learning, [Online], Available: https://arxiv.org/abs/1812.11941, 2018.
|
[32] |
Z. Liu, Y. T. Lin, Y. Cao, H. Hu, Y. X. Wei, Z. Zhang, S. Lin, B. N. Guo. Swin Transformer: Hierarchical vision transformer using shifted windows. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Montreal, Canada, pp. 9992–10002, 2021. DOI: 10.1109/ICCV48922.2021.00986.
|
[33] |
N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, S. Zagoruyko. End-to-end object detection with transformers. In Proceedings of the 16th European Conference on Computer Vision, Springer, Glasgow, UK, pp. 213–229, 2020. DOI: 10.1007/978-3-030-58452-8_13.
|
[34] |
S. X. Zheng, J. C. Lu, H. S. Zhao, X. T. Zhu, Z. K. Luo, Y. B. Wang, Y. W. Fu, J. F. Feng, T. Xiang, P. H. S. Torr, L. Zhang. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Nashville, USA, pp. 6877–6886, 2021. DOI: 10.1109/CVPR46437.2021.00681.
|
[35] |
J. B. Jiao, Y. Cao, Y. B. Song, R. Lau. Look deeper into depth: Monocular depth estimation with semantic booster and attention-driven loss. In Proceedings of the 15th European Conference on Computer Vision, Springer, Munich, Germany, pp. 55–71, 2018. DOI: 10.1007/978-3-030-01267-0_4.
|
[36] |
W. H.Wang, E. Z. Xie, X. Li, D. P. Fan, K. T. Song, D. Liang, T. Lu, P. Luo, L. Shao. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Montreal, Canada, pp. 548–558, 2021. DOI: 10.1109/ICCV48922.2021.00061.
|
[37] |
Z. Y. Li, Z. H. Chen, A. Li, L. J. Fang, Q. H. Jiang, X. M. Liu, J. J. Jiang, B. L. Zhou, H. Zhao. SimIPU: Simple 2D image and 3D point cloud unsupervised pre-training for spatial-aware visual representations. In Proceedings of the 36th AAAI Conference on Artificial Intelligence, Vancouver, Canada, pp. 1500–1508, 2022. DOI: 10.1609/aaai.v36i2.20040.
|
[38] |
R. Garg, V. K. B.G., G. Carneiro, I. Reid. Unsupervised CNN for single view depth estimation: Geometry to the rescue. In Proceedings of the 14th European Conference on Computer Vision, Springer, Amsterdam, The Netherlands, pp. 740–756, 2016. DOI: 10.1007/978-3-319-46484-8_45.
|
[39] |
J. Uhrig, N. Schneider, L. Schneider, U. Franke, T. Brox, A. Geiger. Sparsity invariant CNNs. In Proceedings of International Conference on 3D Vision, IEEE, Qingdao, China, pp. 11–20, 2017. DOI: 10.1109/3DV.2017.00012.
|
[40] |
J. X. Xiao, A. Owens, A. Torralba. SUN3D: A database of big spaces reconstructed using SfM and object labels. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Sydney, Australia, pp. 1625–1632, 2013. DOI: 10.1109/ICCV.2013.458.
|
[41] |
A. Janoch, S. Karayev, Y. Q. Jia, J. T. Barron, M. Fritz, K. Saenko, T. Darrell. A category-level 3D object dataset: Putting the kinect to work. Consumer Depth Cameras for Computer Vision, A. Fossati, J. Gall, H. Grabner, X. F. Ren, K. Konolige, Eds., London, UK: Springer, pp. 141–165, 2013. DOI: 10.1007/978-1-4471-4640-7_8.
|
[42] |
M. Contributors. MMsegmentation: Openmmlab semantic segmentation toolbox and benchmark, [Online], Available: https://gitee.com/deadkany/mmsegmentation, 2020.
|
[43] |
A. Krizhevsky, I. Sutskever, G. E. Hinton. ImageNet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems, Lake Tahoe, USA, pp. 1097–1105, 2012.
|
[44] |
B. Y. Li, Y. Huang, Z. Y. Liu, D. P. Zou, W. X. Yu. StructDepth: Leveraging the structural regularities for self-supervised indoor depth estimation. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Montreal, Canada, pp. 12643–12653, 2021. DOI: 10.1109/ICCV48922.2021.01243.
|
[45] |
P. Ji, R. Z. Li, B. Bhanu, Y. Xu. Monoindoor: Towards good practice of self-supervised monocular depth estimation for indoor environments. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Montreal, Canada, pp. 12767–12776, 2021. DOI: 10.1109/ICCV48922.2021.01255.
|
[46] |
I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, N. Navab. Deeper depth prediction with fully convolutional residual networks. In Proceedings of the 4th International Conference on 3D Vision, IEEE, Stanford, USA, pp. 239–248, 2016. DOI: 10.1109/3DV.2016.32.
|
[47] |
W. H. Yuan, X. D. Gu, Z. Z. Dai, S. Y. Zhu, P. Tan. Neural window fully-connected CRFs for monocular depth estimation. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, New Orleans, USA, pp. 3906–3915, 2022. DOI: 10.1109/CVPR52688.2022.00389.
|
[48] |
C. Godard, O. M. Aodha, G. J. Brostow. Unsupervised monocular depth estimation with left-right consistency. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Honolulu, USA, pp. 6602–6611, 2017. DOI: 10.1109/CVPR.2017.699.
|
[49] |
A. Johnston, G. Carneiro. Self-supervised monocular trained depth estimation using self-attention and discrete disparity volume. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Seattle, USA, pp. 4755–4764, 2020. DOI: 10.1109/CVPR42600.2020.00481.
|
[50] |
Y. K. Gan, X. Y. Xu, W. X. Sun, L. Lin. Monocular depth estimation with affinity, vertical pooling, and label enhancement. In Proceedings of the 15th European Conference on Computer Vision, Springer, Munich, Germany, pp. 232–247, 2018. DOI: 10.1007/978-3-030-01219-9_14.
|
[51] |
W. Yin, Y. F. Liu, C. H. Shen, Y. L. Yan. Enforcing geometric constraints of virtual normal for depth prediction. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Seoul, Republic of Korea, pp. 5683–5692, 2019. DOI: 10.1109/ICCV.2019.00578.
|
[52] |
D. Xu, X. Alameda-Pineda, W. L. Ouyang, E. Ricci, X. G. Wang, N. Sebe. Probabilistic graph attention network with conditional kernels for pixel-wise prediction. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 5, pp. 2673–2688, 2022. DOI: 10.1109/TPAMI.2020.3043781.
|
[53] |
S. Aich, J. M. U. Vianney, M. A. Islam, M. K. B. Liu. Bidirectional attention network for monocular depth estimation. In Proceedings of IEEE International Conference on Robotics and Automation, IEEE, Xi′an, China, pp. 11746–11752, 2020. DOI: 10.1109/ICRA48506.2021.9560885.
|
[54] |
S. Lee, J. Lee, B. Kim, E. Yi, J. Kim. Patch-wise attention network for monocular depth estimation. In Proceedings of the 35th AAAI Conference on Artificial Intelligence, pp. 1873–1881, 2021. DOI: 10.1609/aaai.v35i3.16282.
|
[55] |
S. Y. Qiao, Y. K. Zhu, H. Adam, A. Yuille, L. C. Chen. ViP-DeepLab: Learning visual perception with depth-aware video panoptic segmentation. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Nashville, USA, pp. 3996–4007, 2021. DOI: 10.1109/CVPR46437.2021.00399.
|
[56] |
X. T. Chen, X. J. Chen, Z. J. Zha. Structure-aware residual pyramid network for monocular depth estimation. In Proceedings of the 28th International Joint conference on Artificial Intelligence, Macao, China, pp. 694–700, 2019.
|
[57] |
A. Kolesnikov, L. Beyer, X. H. Zhai, J. Puigcerver, J. Yung, S. Gelly, N. Houlsby. Big transfer (BiT): General visual representation learning. In Proceedings of the 16th European Conference on Computer Vision, Springer, Glasgow, UK, pp. 491–507, 2020. DOI: 10.1007/978-3-030-58558-7_29.
|
[58] |
A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, L. Fei-Fei. Large-scale video classification with convolutional neural networks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Columbus, USA, pp. 1725–1732. DOI: 10.1109/CVPR.2014.223.
|
[59] |
J. Hu, L. Shen, G. Sun. Squeeze-and-excitation networks. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 7132–7141, 2018. DOI: 10.1109/CVPR.2018.00745.
|
[60] |
S. Woo, J. Park, J. Y. Lee, I. S. Kweon. CBAM: Convolutional block attention module. In Proceedings of the 15th European Conference on Computer Vision, Springer, Munich, Germany, pp. 3–19, 2018. DOI: 10.1007/978-3-030-01234-2_1.
|
[61] |
T. Zhou, H. Z. Fu, G. Chen, Y. Zhou, D. P. Fan, L. Shao. Specificity-preserving RGB-D saliency detection. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Montreal, Canada, pp. 4661–4671, 2021. DOI: 10.1109/ICCV48922.2021.00464.
|
[62] |
W. B. Zhang, G. P. Ji, Z. Wang, K. R. Fu, Q. J. Zhao. Depth quality-inspired feature manipulation for efficient RGB-D salient object detection. In Proceedings of the 29th ACM International Conference on Multimedia, ACM, pp. 731–740, 2021. DOI: 10.1145/3474085.3475240.
|