Citation: | Haoyu Lu, Yuqi Huo, Mingyu Ding, Nanyi Fei, Zhiwu Lu. Cross-modal Contrastive Learning for Generalizable and Efficient Image-text Retrieval. Machine Intelligence Research, vol. 20, no. 4, pp.569-582, 2023. https://doi.org/10.1007/s11633-022-1386-4 |
[1] |
H. Chen, G. G. Ding, X. D. Liu, Z. J. Lin, J. Liu, J. G. Han. IMRAM: Iterative matching with recurrent attention memory for cross-modal image-text retrieval. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Seattle, USA, pp. 12652−12660, 2020. DOI: 10.1109/CVPR42600.2020.01267.
|
[2] |
K. H. Lee, X. Chen, G. Hua, H. D. Hu, X. D. He. Stacked cross attention for image-text matching. In Proceedings of the 15th European Conference on Computer Vision, Springer, Munich, Germany, pp. 212−228, 2018. DOI: 10.1007/978-3-030-01225-0_13.
|
[3] |
H. Y. Lu, M. Y. Ding, N. Y. Fei, Y. Q. Huo, Z. W. Lu. LGDN: Language-guided denoising network for video-language modeling. In Proceedings of Advances in Neural Information Processing Systems, 2022.
|
[4] |
O. Vinyals, A. Toshev, S. Bengio, D. Erhan. Show and tell: A neural image caption generator. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Boston, USA, pp. 3156−3164, 2015. DOI: 10.1109/CVPR.2015.7298935.
|
[5] |
X. Jia, E. Gavves, B. Fernando, T. Tuytelaars. Guiding the long-short term memory model for image caption generation. In Proceedings of IEEE International Conference on Computer Vision, Santiago, Chile, pp. 2407−2415, 2015. DOI: 10.1109/ICCV.2015.277.
|
[6] |
J. Johnson, A. Gupta, L. Fei-Fei. Image generation from scene graphs. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 1219−1228, 2018. DOI: 10.1109/CVPR.2018.00133.
|
[7] |
T. T. Qiao, J. Zhang, D. Q. Xu, D. C. Tao. MirrorGAN: Learning text-to-image generation by redescription. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, pp. 1505−1514, 2019. DOI: 10.1109/CVPR.2019.00160.
|
[8] |
A. Karpathy, F. F. Li. Deep visual-semantic alignments for generating image descriptions. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Boston, USA, pp. 3128−3137, 2015. DOI: 10.1109/CVPR.2015.7298932.
|
[9] |
Y. C. Chen, L. J. Li, L. C. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y. Cheng, J. J. Liu. UNITER: UNiversal image-TExt representation learning. In Proceedings of the 16th European Conference on Computer Vision, Springer, Glasgow, UK, pp. 104−120, 2020. DOI: 10.1007/978-3-030-58577-8_7.
|
[10] |
R. Kiros, R. Salakhutdinov, R. S. Zemel. Unifying visual-semantic embeddings with multimodal neural language models. [Online], https://arxiv.org/abs/1411.2539, 2014.
|
[11] |
L. W. Wang, Y. Li, S. Lazebnik. Learning deep structure-preserving image-text embeddings. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, pp. 5005−5013, 2016. DOI: 10.1109/CVPR.2016.541.
|
[12] |
Y. Q. Huo, M. L. Zhang, G. Z. Liu, H. Y. Lu, Y. Z. Gao, G. X. Yang, J. Y. Wen, H. Zhang, B. G. Xu, W. H. Zheng, Z. Z. Xi, Y. Q. Yang, A. W. Hu, J. M. Zhao, R. C. Li, Y. D. Zhao, L. Zhang, Y. Q. Song, X. Hong, W. Q. Cui, D. Y. Hou, Y. Y. Li, J. Y. Li, P. Y. Liu, Z. Gong, C. H. Jin, Y. C. Sun, S. Z. Chen, Z. W. Lu, Z. C. Dou, Q. Jin, Y. Y. Lan, W. X. Zhao, R. H. Song, J. R. Wen. WenLan: Bridging vision and language by large-scale multi-modal pre-training. [Online], https://arxiv.org/abs/2103.06561, 2021.
|
[13] |
N. Y. Fei, Z. W. Lu, Y. Z. Gao, G. X. Yang, Y. Q. Huo, J. Y. Wen, H. Y. Lu, R. H. Song, X. Gao, T. Xiang, H. Sun, J. R. Wen. Towards artificial general intelligence via a multimodal foundation model. Nature Communications, vol. 13, no. 1, Article number 3094, 2022. DOI: 10.1038/s41467-022-30761-2.
|
[14] |
H. Y. Lu, N. Y. Fei, Y. Q. Huo, Y. Z. Gao, Z. W. Lu, J. R. Wen. COTS: Collaborative two-stream vision-language pre-training model for cross-modal retrieval. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, New Orleans, USA, pp. 15671–15680, 2022. DOI: 10.1109/CVPR52688.2022.01524.
|
[15] |
Y. L. Wu, S. H. Wang, G. L. Song, Q. M. Huang. Learning fragment self-attention embeddings for image-text matching. In Proceedings of the 27th ACM International Conference on Multimedia, ACM, Nice, France, pp. 2088−2096, 2019. DOI: 10.1145/3343031.3350940.
|
[16] |
H. W. Diao, Y. Zhang, L. Ma, H. C. Lu. Similarity reasoning and filtration for image-text matching. Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 2, pp. 1218–1226, 2021. DOI: 10.1609/aaai.v35i2.16209.
|
[17] |
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. H. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby. An image is worth 16×16 words: Transformers for image recognition at scale. In Proceedings of the 9th International Conference on Learning Representations, 2021.
|
[18] |
Z. R. Wu, Y. J. Xiong, S. X. Yu, D. H. Lin. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 3733−3742, 2018. DOI: 10.1109/CVPR.2018.00393.
|
[19] |
A. van den Oord, Y. Z. Li, O. Vinyals. Representation learning with contrastive predictive coding. [Online], https://arxiv.org/abs/1807.03748, 2018.
|
[20] |
R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, P. Bachman, A. Trischler, Y. Bengio. Learning deep representations by mutual information estimation and maximization. In Proceedings of the 7th International Conference on Learning Representations, New Orleans, USA, 2019.
|
[21] |
C. X. Zhuang, A. Zhai, D. Yamins. Local aggregation for unsupervised learning of visual embeddings. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Seoul, Republic of Korea, pp. 6001−6011, 2019. DOI: 10.1109/ICCV.2019.00610.
|
[22] |
P. Bachman, R. D. Hjelm, W. Buchwalter. Learning representations by maximizing mutual information across views. In Proceedings of the 33rd Conference on Neural Information Processing Systems, ACM, Vancouver, Canada, pp. 15509−15519, 2019.
|
[23] |
T. Chen, S. Kornblith, M. Norouzi, G. Hinton. A simple framework for contrastive learning of visual representations. In Proceedings of the 37th International Conference on Machine Learning, pp. 1597−1607, 2020.
|
[24] |
J. B. Grill, F. Strub, F. Altché, C. Tallec, P. Richemond, E. Buchatskaya, C. Doersch, B. Á. Pires, Z. H. Guo, M. G. Azar, B. Piot, K. Kavukcuoglu, R. Munos, M. Valko. Bootstrap your own latent-a new approach to self-supervised learning. In Proceedings of the 34th Conference on Neural Information Processing Systems, ACM, Vancouver, Canada, pp. 21271−21284, 2020.
|
[25] |
X. L. Chen, K. M. He. Exploring simple Siamese representation learning. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Nashville, USA, pp. 15750−15758, 2021. DOI: 10.1109/CVPR46437.2021.01549.
|
[26] |
D. Y. She, K. Xu. Contrastive self-supervised representation learning using synthetic data. International Journal of Automation and Computing, vol. 18, no. 4, pp. 556–567, 2021. DOI: 10.1007/s11633-021-1297-9.
|
[27] |
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin. Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems, ACM, Long Beach, USA, pp. 5998−6008, 2017.
|
[28] |
T. Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dolláar, C. L. Zitnick. Microsoft COCO: Common objects in context. In Proceedings of the 13th European Conference on Computer Vision, Springer, Zurich, Switzerland, pp. 740−755, 2014. DOI: 10.1007/978-3-319-10602-1_48.
|
[29] |
P. Young, A. Lai, M. Hodosh, J. Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, vol. 2, no. 1, pp. 67–78, 2014. DOI: 10.1162/tacl_a_00166.
|
[30] |
S. Q. Ren, K. M. He, R. Girshick, J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems, ACM, Montreal, Canada, pp. 91-99, 2015.
|
[31] |
K. M. He, X. Y. Zhang, S. Q. Ren, J. Sun. Deep residual learning for image recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, pp. 770−778, 2016. DOI: 10.1109/CVPR.2016.90.
|
[32] |
R. Girshick, J. Donahue, T. Darrell, J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Columbus, USA, pp. 580−587, 2014. DOI: 10.1109/CVPR.2014.81.
|
[33] |
X. Wei, T. Z. Zhang, Y. Li, Y. D. Zhang, F. Wu. Multi-modality cross attention network for image and sentence matching. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Seattle, USA, pp. 10938−10947, 2020. DOI: 10.1109/CVPR42600.2020.01095.
|
[34] |
P. Anderson, X. D. He, C. Buehler, D. Teney, M. Johnson, S. Gould, L. Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 6077−6086, 2018. DOI: 10.1109/CVPR.2018.00636.
|
[35] |
R. Krishna, Y. K. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L. J. Li, D. A. Shamma, M. S. Bernstein, L. Fei-Fei. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, vol. 123, no. 1, pp. 32–73, 2017. DOI: 10.1007/s11263-016-0981-7.
|
[36] |
Z. H. Wang, X. H. Liu, H. S. Li, L. Sheng, J. J. Yan, X. G. Wang, J. Shao. CAMP: Cross-modal adaptive message passing for text-image retrieval. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Seoul, Republic of Korea, pp. 5763−5772, 2019. DOI: 10.1109/ICCV.2019.00586.
|
[37] |
Y. Zhang, H. C. Lu. Deep cross-modal projection learning for image-text matching. In Proceedings of the 15th European Conference on Computer Vision, Springer, Munich, Germany, pp. 707−723, 2018. DOI: 10.1007/978-3-030-01246-5_42.
|
[38] |
J. Devlin, M. W. Chang, K. Lee, K. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, USA, pp. 4171−4186, 2019. DOI: 10.18653/v1/N19-1423.
|
[39] |
K. M. He, H. Q. Fan, Y. X. Wu, S. N. Xie, R. Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Seattle, USA, pp. 9726−9735, 2020. DOI: 10.1109/CVPR42600.2020.00975.
|
[40] |
M. U. Gutmann, A. Hyvärinen. Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. Journal of Machine Learning Research, vol. 13, pp. 307–361, 2012.
|
[41] |
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, I. Sutskever. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, pp. 8748−8763, 2021.
|
[42] |
Y. H. Liu, M. Ott, N. Goyal, J. F. Du, M. Joshi, D. Q. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov. RoBERTa: A robustly optimized BERT pretraining approach. [Online], https://arxiv.org/abs/1907.11692, 2019.
|
[43] |
V. Nair, G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning, Haifa, Israel, pp. 807−814, 2010.
|
[44] |
A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, T. Mikolov. DeViSE: A deep visual-semantic embedding model. In Proceedings of the 26th International Conference on Neural Information Processing Systems, ACM, Lake Tahoe, USA, pp. 2121−2129, 2013.
|
[45] |
M. X. Tan, Q. Le. EfficientNet: Rethinking model scaling for convolutional neural networks. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, USA, pp. 6105−6114, 2019.
|
[46] |
Q. Zhang, Z. Lei, Z. X. Zhang, S. Z. Li. Context-aware attention network for image-text retrieval. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Seattle, USA, pp. 3533−3542, 2020. DOI: 10.1109/CVPR42600.2020.00359.
|
[47] |
J. C. Chen, H. X. Hu, H. Wu, Y. N. Jiang, C. H. Wang. Learning the best pooling strategy for visual semantic embedding. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Nashville, USA, pp. 15789−15798, 2021. DOI: 10.1109/CVPR46437.2021.01553.
|
[48] |
W. Kim, B. Son, I. Kim. ViLT: Vision-and-language transformer without convolution or region supervision. In Proceedings of the 38th International Conference on Machine Learning, pp. 5583−5594, 2021.
|
[49] |
Z. Y. Dou, Y. C. Xu, Z. Gan, J. F. Wang, S. H. Wang, L. J. Wang, C. G. Zhu, P. C. Zhang, L. Yuan, N. Y. Peng, Z. C. Liu, M. Zeng. An empirical study of training end-to-end vision-and-language transformers. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, New Orleans, USA, pp. 18145−18155, 2022. DOI: 10.1109/CVPR52688.2022.01763.
|
[50] |
X. J. Li, X. Yin, C. Y. Li, P. C. Zhang, X. W. Hu, L. Zhang, L. J. Wang, H. D. Hu, L. Dong, F. R. Wei, Y. J. Choi, J. F. Gao. OSCAR: Object-semantics aligned pre-training for vision-language tasks. In Proceedings of the 16th European Conference on Computer Vision, Springer, Glasgow, UK, pp. 121−137, 2020. DOI: 10.1007/978-3-030-58577-8_8.
|
[51] |
Z. Ji, H. R. Wang, J. G. Han, Y. W. Pang. Saliency-guided attention network for image-sentence matching. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Seoul, Republic of Korea, pp. 5753−5762, 2019. DOI: 10.1109/ICCV.2019.00585.
|
[52] |
W. Li, C. Gao, G. C. Niu, X. Y. Xiao, H. Liu, J. C. Liu, H. Wu, H. F. Wang. UNIMO: Towards unified-modal understanding and generation via cross-modal contrastive learning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pp. 2592–2607, 2021. DOI: 10.18653/v1/2021.acl-long.202.
|
[53] |
Y. X. Wang, H. Yang, X. M. Qian, L. Ma, J. Lu, B. Li, X. Fan. Position focused attention network for image-text matching. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, Macao, China, pp. 3792−3798, 2019. DOI: 10.24963/ijcai.2019/526.
|
[54] |
F. Yan, K. Mikolajczyk. Deep correlation for matching images and text. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Boston, USA, pp. 3441−3450, 2015. DOI: 10.1109/CVPR.2015.7298966.
|
[55] |
Y. L. Song, M. Soleymani. Polysemous visual-semantic embedding for cross-modal retrieval. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Long Beach, USA, pp. 1979−1988, 2019. DOI: 10.1109/CVPR.2019.00208.
|