Xiao Wang, Guangyao Chen, Guangwu Qian, Pengcheng Gao, Xiao-Yong Wei, Yaowei Wang, Yonghong Tian, Wen Gao. Large-scale Multi-modal Pre-trained Models: A Comprehensive Survey. Machine Intelligence Research, vol. 20, no. 4, pp.447-482, 2023. https://doi.org/10.1007/s11633-022-1410-8
Citation: Xiao Wang, Guangyao Chen, Guangwu Qian, Pengcheng Gao, Xiao-Yong Wei, Yaowei Wang, Yonghong Tian, Wen Gao. Large-scale Multi-modal Pre-trained Models: A Comprehensive Survey. Machine Intelligence Research, vol. 20, no. 4, pp.447-482, 2023. https://doi.org/10.1007/s11633-022-1410-8

Large-scale Multi-modal Pre-trained Models: A Comprehensive Survey

doi: 10.1007/s11633-022-1410-8
More Information
  • Author Bio:

    Xiao Wang received the B. Sc. degree in computer science and technology from West Anhui University, China in 2013, the Ph. D. degree in computer science from Anhui University, China in 2019. From 2015 and 2016, he was a visiting student with School of Data and Computer Science, Sun Yat-sen University, China. He also has a visiting at UBTECH Sydney Artificial Intelligence Centre, the Faculty of Engineering, University of Sydney, Australia in 2019. He finished the postdoc research in Peng Cheng Laboratory, China from April, 2020 to April, 2022. He is now an associate professor at School of Computer Science and Technology, Anhui University, China. He serves as a reviewer for a number of journals and conferences such as IEEE TCSVT, TIP, IJCV, CVIU, PR, CVPR, ICCV, AAAI, ECCV, ACCV, ACM-MM, WACV, ICLR, etc. He is a member of IEEE, ACM, CCF and CSIG. His research interests include computer vision, event-based vision, machine learning and pattern recognition. E-mail: xiaowang@ahu.edu.cn ORCID iD: 0000-0001-6117-6745

    Guangyao Chen received the B. Sc. degree in computer science and technology from Wuhan University, China in 2018. He is currently a Ph. D. degree candidate in computer application technology at School of Computer Science, Peking University, China. His research interests include open-world discovery, out-of-distribution and model compression.E-mail: gy.chen@pku.edu.cnORCID iD: 0000-0002-7255-2109

    Guangwu Qian received the Ph. D. degree in computer science from College of Computer Science, Sichuan University, China in 2017. Afterwards, he worked as a researcher and team leader of algorithm in AI Research Laboratory, Imsight Technology Co., Ltd., Shenzhen, China. He is currently a postdoctoral fellow at Peng Cheng Laboratory, China. His research interests include conceptors, medical imaging and deep learning.E-mail: qiangw@pcl.ac.cnORCID iD: 0000-0001-9241-1699

    Pengcheng Gao received the Ph. D. degree in computer applied technology from University of Chinese Academy of Sciences, China in 2020. He is now involved in post-doctoral research in the Peng Cheng Laboratory, China. His research interests include deep learning, computer vision, facial landmark detection and facial expression analysis.E-mail: gaopch@pcl.ac.cnORCID iD: 0000-0002-6692-341X

    Xiao-Yong Wei received the Ph. D. degree in computer science from City University of Hong Kong, China in 2009, and has worked as a postdoctoral fellow in the University of California, Berkeley, USA from December, 2013 to December, 2015. He has been a professor and the head of Department of Computer Science, Sichuan University, China since 2010. He is an adjunct professor of Peng Cheng Laboratory, China, and a visiting professor of Department of Computing, Hong Kong Polytechnic University. He is a senior member of IEEE, and has served as an associate editor of Interdisciplinary Sciences: Computational Life Sciences since 2020, the program Chair of ICMR 2019, ICIMCS 2012, and the technical committee member of over 20 conferences such as ICCV, CVPR, SIGKDD, ACM MM, ICME, and ICIP. His research interests include multimedia computing, health computing, machine learning and large-scale data mining. E-mail: cswei@scu.edu.cn ORCID iD: 0000-0002-5706-5177

    Yaowei Wang received the Ph. D. degree in computer science from the Graduate University of Chinese Academy of Sciences, China in 2005. He is currently an associate professor with Peng Cheng Laboratory, China. He was a professor at National Engineering Laboratory for Video Technology Shenzhen (NELVT), Peking University Shenzhen Graduate School, China in 2019. From 2014 to 2015, he worked as an academic visitor at the vision laboratory of Queen Mary University of London, UK. He worked at Department of Electronics Engineering, Beijing Institute of Technology, China from 2005 to 2019. He is the author or coauthor of over 70 refereed journals and conference papers. He was the recipient of the second prize of the National Technology Invention in 2017 and the first prize of the CIE Technology Invention in 2015. His team was ranked as one of the best performers in the TRECVID CCD/SED tasks from 2009 to 2012 and in PETS 2012. He is a member of IEEE, CIE, CCF and CSIG. His research interests include machine learning, multimedia content analysis and understanding. E-mail: wangyw@pcl.ac.cn (Corresponding author) ORCID iD: 0000-0003-2197-9038

    Yonghong Tian received the Ph. D. degree in computer applied technology from Institute of Computing Technology, Chinese Academy of Sciences, China in 2005. He is currently a Boya distinguished professor with Department of Computer Science and Technology, Peking University, China, and is also the deputy director of Artificial Intelligence Research Center, Peng Cheng Laboratory, China. Prof. Tian is the author or coauthor of over 200 technical articles in refereed journals such as IEEE TPAMI/TNNLS/TIP/TMM/TCSVT/TKDE/TPDS, ACM CSUR/TOIS/TOMM and conferences such as NeurIPS/CVPR/ICCV/AAAI/ACMMM/WWW. He was/is an Associate Editor of IEEE TCSVT (2018.1–), IEEE TMM (2014.8–2018.8), IEEE Multimedia Mag. (2018.1–), and IEEE Access (2017.1–). He co-initiated IEEE International Conference on Multimedia Big Data (BigMM) and served as the TPC Co-Chair of BigMM 2015, and aslo served as the Technical Program Co-Chair of IEEE ICME 2015, IEEE ISM 2015 and IEEE MIPR 2018/2019, and General Co-Chair of IEEE MIPR 2020 and ICME2021. He is the steering member of IEEE ICME (2018–) and IEEE BigMM (2015–), and is a TPC member of more than ten conferences such as CVPR, ICCV, ACM KDD, AAAI, ACM MM and ECCV. He was the recipient of the Chinese National Science Foundation for Distinguished Young Scholars in 2018, two National Science and Technology Awards and three ministerial-level awards in China, and obtained the 2015 EURASIP Best Paper Award for Journal on Image and Video Processing, and the best paper award of IEEE BigMM 2018. He is a senior member of IEEE, CIE and CCF, a member of ACM. His research interests include neuromorphic vision, brain-inspired computation and multimedia big data. E-mail: tianyh@pcl.ac.cn (Corresponding author) ORCID iD: 0000-0002-2978-5935

    Wen Gao received the Ph. D. degree in electronics engineering from The University of Tokyo, Japan in 1991. He is currently a Boya Chair professor in computer science at Peking University, China. He is the director of Peng Cheng Laboratory, China. Before joining Peking University, he was a Professor with Harbin Institute of Technology, China from 1991 to 1995. From 1996 to 2006, he was a professor at Institute of Computing Technology, Chinese Academy of Sciences, China. He has authored or coauthored five books and over 1000 technical articles in refereed journals and conference proceedings in the areas of image processing, video coding and communication, computer vision, multimedia retrieval, multimodal interface, and bioinformatics. He served on the editorial boards for several journals, such as ACM CSUR, IEEE Transactions on Image Processing (TIP), IEEE Transactions on Circuits and Systems for video Technology (TCSVT), and IEEE Transactions on Multimedia (TMM). He served on the advisory and technical committees for professional organizations. He was the vice president of the National Natural Science Foundation (NSFC) of China from 2013 to 2018 and the president of China Computer Federation (CCF) from 2016 to 2020. He is the deputy director of China National Standardization Technical Committees. He is an Academician of the Chinese Academy of Engineering and a fellow of ACM. He chaired a number of international conferences, such as IEEE ICME 2007, ACM Multimedia 2009, and IEEE ISCAS 2013. His research interests include machine learning, multimediacontent analysis and understanding. E-mail: wgao@pku.edu.cnORCID iD: 0000-0002-8070-802X

  • Received Date: 2022-07-08
  • Accepted Date: 2022-12-13
  • Publish Online: 2023-06-06
  • Publish Date: 2023-08-01
  • With the urgent demand for generalized deep models, many pre-trained big models are proposed, such as bidirectional encoder representations (BERT), vision transformer (ViT), generative pre-trained transformers (GPT), etc. Inspired by the success of these models in single domains (like computer vision and natural language processing), the multi-modal pre-trained big models have also drawn more and more attention in recent years. In this work, we give a comprehensive survey of these models and hope this paper could provide new insights and helps fresh researchers to track the most cutting-edge works. Specifically, we firstly introduce the background of multi-modal pre-training by reviewing the conventional deep learning, pre-training works in natural language process, computer vision, and speech. Then, we introduce the task definition, key challenges, and advantages of multi-modal pre-training models (MM-PTMs), and discuss the MM-PTMs with a focus on data, objectives, network architectures, and knowledge enhanced pre-training. After that, we introduce the downstream tasks used for the validation of large-scale MM-PTMs, including generative, classification, and regression tasks. We also give visualization and analysis of the model parameters and results on representative downstream tasks. Finally, we point out possible research directions for this topic that may benefit future works. In addition, we maintain a continuously updated paper list for large-scale pre-trained multi-modal big models: https://github.com/wangxiao5791509/MultiModal_BigModels_Survey.

     

  • This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
    The images or other third party material in this article are included in the article′s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article′s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
    To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
    11 https://www.flickr.com/2 https://www.mturk.com/
    22 https://www.mturk.com/
    33 https://www.wikipedia.org/
    44 https://www.instagram.com/
    55 Note that only half a year's results (the year 2022, from January to June) have been counted.
  • loading
  • [1]
    A. Krizhevsky, I. Sutskever, G. E. Hinton. ImageNet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems, Lake Tahoe, USA, pp. 1097–1105, 2012.
    [2]
    J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, L. Fei-Fei. ImageNet: A large-scale hierarchical image database. In Proceedings of IEEE Conference on Computer Vision and Pattern recognition, Miami, USA, pp. 248–255, 2009. DOI: 10.1109/CVPR.2009.5206848.
    [3]
    K. Simonyan, A. Zisserman. Very deep convolutional networks for large-scale image recognition. In Proceedings of the 3rd International Conference on Learning Representations, San Diego, USA, 2015. DOI: 10.48550/arXiv.1409.1556.
    [4]
    K. M. He, X. Y Zhang, S. Q. Ren, J. Sun. Deep residual learning for image recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, pp. 770–778, 2016. DOI: 10.1109/CVPR.2016.90.
    [5]
    C. Szegedy, S. Ioffe, V. Vanhoucke, A. Alemi. Inception-v4, inception-ResNet and the impact of residual connections on learning. In Proceedings of the 31st AAAI Conference on Artificial Intelligence, San Francisco, USA, pp. 4278–4284, 2017. DOI: 10.1609/aaai.v31i1.11231.
    [6]
    S. Hochreiter, J. Schmidhuber. Long short-term memory. Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997. DOI: 10.1162/neco.1997.9.8.1735.
    [7]
    J. Pennington, R. Socher, C. Manning. GloVe: Global vectors for word representation. In Proceedings of Conference on Empirical Methods in Natural Language Processing, ACL, Doha, Qatar, pp. 1532–1543, 2014. DOI: 10.3115/v1/D14-1162.
    [8]
    R. Kiros, Y. K. Zhu, R. Salakhutdinov, R. S. Zemel, A. Torralba, R. Urtasun, S. Fidler. Skip-thought vectors. In Proceedings of the 28th International Conference on Neural Information Processing systems, Montreal, Canada, pp. 3294–3302, 2015.
    [9]
    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing systems, Long Beach, USA, pp. 6000–6010, 2017.
    [10]
    J. Devlin, M. W. Chang, K. Lee, K. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, ACL, Minneapolis, USA, pp. 4171–4186, 2019. DOI: 10.18653/v1/N19-1423.
    [11]
    Q. L. Xia, H. Y. Huang, N. Duan, D. D. Zhang, L. Ji, Z. F. Sui, E. Cui, T. Bharti, M. Zhou. XGPT: Cross-modal generative pre-training for image captioning. In Proceedings of the 10th CCF International Conference on Natural Language Processing and Chinese Computing, Springer, Qingdao, China, pp. 786–797, 2021. DOI: 10.1007/978-3-030-88480-2_63.
    [12]
    T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever. Language models are few-shot learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver, Canada, pp. 1877–1901, 2020.
    [13]
    C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Q. Zhou, W. Li, P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, vol. 21, no. 1, Article number 140, 2020.
    [14]
    Z. L. Yang, Z. H. Dai, Y. M. Yang, J. Carbonell, R. Salakhutdinov, Q. V. Le. XLNet: Generalized autoregressive pretraining for language understanding. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, Canada, pp. 5754–5764, 2019.
    [15]
    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. H. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the 9th International Conference on Learning Representations, 2021.
    [16]
    Z. Liu, Y. T. Lin, Y. Cao, H. Hu, Y. X. Wei, Z. Zhang, S. Lin, B. N. Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Montreal, Canada, pp. 9992–10012, 2021. DOI: 10.1109/ICCV48922.2021.00986.
    [17]
    X. J. Li, X. Yin, C. Y. Li, P. C. Zhang, X. W. Hu, L. Zhang, L. J. Wang, H. D. Hu, L. Dong, F. R. Wei, Y. J. Choi, J. F. Gao. OSCAR: Object-semantics aligned pre-training for vision-language tasks. In Proceedings of the 16th European Conference on Computer Vision, Springer, Glasgow, UK, pp. 121–137, 2020. DOI: 10.1007/978-3-030-58577-8_8.
    [18]
    Y. C. Chen, L. J. Li, L. C. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y. Cheng, J. J. Liu. UNITER: UNiversal image-text representation learning. In Proceedings of the 16th European Conference on Computer Vision, Springer, Glasgow, UK, pp. 104–120, 2020. DOI: 10.1007/978-3-030-58577-8_7.
    [19]
    Y. G. Li, F. Liang, L. C. Zhao, Y. F. Cui, W. L. Ouyang, J. Shao, F. W. Yu, J. J. Yan. Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. In Proceedings of the 10th International Conference on Learning Representations, 2022.
    [20]
    Z. C. Huang, Z. Y. Zeng, B. Liu, D. M. Fu, J. L. Fu. Pixel-BERT: Aligning image pixels with text by deep multi-modal transformers. [Online], Available: https://arxiv.org/abs/2004.00849, 2020.
    [21]
    C. Jia, Y. F. Yang, Y. Xia, Y. T. Chen, Z. Parekh, H. Pham, Q. Le, Y. H. Sung, Z. Li, T. Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of the 38th International Conference on Machine Learning, pp. 4904–4916, 2021.
    [22]
    J. Liu, X. X. Zhu, F. Liu, L. T. Guo, Z. J. Zhao, M. Z. Sun, W. N. Wang, H. Q. Lu, S. Y. Zhou, J. J. Zhang, J. Q. Wang. OPT: Omni-perception pre-trainer for cross-modal understanding and generation. [Online], Available: https://arxiv.org/abs/2107.00249, 2021.
    [23]
    D. Cheng, J. Y Zhou, N. N. Wang, X. B. Gao. Hybrid dynamic contrast and probability distillation for unsupervised person RE-ID. IEEE Transactions on Image Processing, vol. 31, pp. 3334–3346, 2022. DOI: 10.1109/TIP.2022.3169693.
    [24]
    F. L. Chen, D. Z. Zhang, M. L. Han, X. Y. Chen, J. Shi, S. Xu, B. Xu. VLP: A survey on vision-language pre-training. Machine Intelligence Research, vol. 30, pp. 38–56, 2023. DOI: 10.1007/s11633-022-1369-5.
    [25]
    Y. F. Du, Z. K. Liu, J. Y. Li, W. X. Zhao. A survey of vision-language pre-trained models. In Proceedings of the 31st International Joint Conference on Artificial Intelligence, Vienna, Austria, pp. 5436–5443, 2022. DOI: 10.24963/ijcai.2022/762.
    [26]
    M. Zaib, Q. Z. Sheng, W. E. Zhang. A short survey of pre-trained language models for conversational AI–A new age in NLP. In Proceedings of Australasian Computer Science Week Multiconference, Melbourne, Australia, Article number 11, 2020. DOI: 10.1145/3373017.3373028.
    [27]
    H. Q. Zhang, H. L. Song, S. Y. Li, M. Zhou, D. W. Song. A survey of controllable text generation using transformer-based pre-trained language models. [Online], Available: https://arxiv.org/abs/2201.05337, 2022.
    [28]
    J. Yang, G. Xiao, Y. L. Shen, W. Jiang, X. Y. Hu, Y. Zhang, J. H. Peng. A survey of knowledge enhanced pre-trained models. [Online], Available: https://arxiv.org/abs/2110.00269, 2021.
    [29]
    D. Yin, L. Dong, H. Cheng, X. D. Liu, K. W. Chang, F. R. Wei, J. F. Gao. A survey of knowledge-intensive NLP with pre-trained language models. [Online], Available: https://arxiv.org/abs/2202.08772, 2022.
    [30]
    P. Bhargava, V. Ng. Commonsense knowledge reasoning and generation with pre-trained language models: A survey. In Proceedings of the 36th AAAI, Conference on Artificial Intelligence, pp. 12317–12325, 2022. DOI: 10.1609/aaai.v36i11.21496.
    [31]
    Q. Liu, M. J. Kusner, P. Blunsom. A survey on contextual embeddings. [Online], Available: https://arxiv.org/abs/2003.07278, 2020.
    [32]
    P. F. Liu, W. Z. Yuan, J. L. Fu, Z. B. Jiang, H. Hayashi, G. Neubig. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. [Online], Available: https://arxiv.org/abs/2107.13586, 2021.
    [33]
    B. Y. Wang, Q. Q Xie, J. H. Pei, Z. H. Chen, P. Tiwari, Z. Li, J. Fu. Pre-trained language models in biomedical domain: A systematic survey. [Online], Available: https://arxiv.org/abs/2110.05006, 2021.
    [34]
    X. P. Qiu, T. X. Sun, Y. G. Xu, Y. F. Shao, N. Dai, X. J. Huang. Pre-trained models for natural language processing: A survey. Science China Technological Sciences, vol. 63, no. 10, pp. 1872–1897, 2020. DOI: 10.1007/s11431-020-1647-3.
    [35]
    X. Han, Z. Y. Zhang, N. Ding, Y. X. Gu, X. Liu, Y. Q. Huo, J. Z. Qiu, Y. Yao, A. Zhang, L. Zhang, W. T. Han, M. L. Huang, Q. Jin, Y. Y. Lan, Y. Liu, Z. Y. Liu, Z. W. Lu, X. P. Qiu, R. H. Song, J. Tang, J. R. Wen, J. H. Yuan, W. X. Zhao, J. Zhu. Pre-trained models: Past, present and future. AI Open, vol. 2, pp. 225–250, 2021. DOI: 10.1016/j.aiopen.2021.08.002.
    [36]
    L. D. Ruan, Q. Jin. Survey: Transformer based video-language pre-training. AI Open, vol. 3, pp. 1–13, 2022. DOI: 10.1016/j.aiopen.2022.01.001.
    [37]
    F. Li, H. Zhang, Y. F. Zhang, S. L. Liu, J. Guo, L. M. Ni, P. C. Zhang, L. Zhang. Vision-language intelligence: Tasks, representation learning, and large models. [Online], Available: https://arxiv.org/abs/2203.01922, 2022.
    [38]
    K. Han, Y. H. Wang, H. T. Chen, X. H. Chen, J. Y. Guo, Z. H. Liu, Y. H. Tang, A. Xiao, C. J. Xu, Y. X. Xu, Z. H. Yang, Y. M. Zhang, D. C. Tao. A survey on vision transformer. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 1, pp. 87–110, 2023. DOI: 10.1109/TPAMI.2022.3152247.
    [39]
    S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan, M. Shah. Transformers in vision: A survey. ACM Computing Surveys, vol. 54, no. 10, Article number 200, 2022. DOI: 10.1145/3505244.
    [40]
    Y. Liu, Y. Zhang, Y. X. Wang, F. Hou, J. Yuan, J. Tian, Y. Zhang, Z. C. Shi, J. P. Fan, Z. Q. He. A survey of visual transformers. [Online], Available: https://arxiv.org/abs/2111.06091, 2021.
    [41]
    J. Selva, A. S. Johansen, S. Escalera, K. Nasrollahi, T. B. Moeslund, A. Clapés. Video transformers: A survey. [Online], Available: https://arxiv.org/abs/2201.05991, 2022.
    [42]
    S. W. Guo, C. L. Xie, J. W. Li, L. J. Lyu, T. W. Zhang. Threats to pre-trained language models: Survey and taxonomy. [Online], Available: https://arxiv.org/abs/2202.06862, 2022.
    [43]
    I. Garrido-Muñoz, A. Montejo-Ráez, F. Martínez-Santiago, L. A. Ureña-López. A survey on bias in deep NLP. Applied Sciences, vol. 11, no. 7, Article number 3184, 2021. DOI: 10.3390/app11073184.
    [44]
    N. Meade, E. Poole-Dayan, S. Reddy. An empirical survey of the effectiveness of debiasing techniques for pre-trained language models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Dublin, Ireland, pp. 1878–1898, 2022. DOI: 10.18653/v1/2022.acl-long.132.
    [45]
    R. K. Kaliyar. A multi-layer bidirectional transformer encoder for pre-trained word embedding: A survey of BERT. In Proceedings of the 10th International Conference on Cloud Computing, Data Science & Engineering, IEEE Harbin, pp. 336–340, 2020. DOI: 10.1109/Confluence47617.2020.9058044.
    [46]
    J. J. Peng, K. X. Han. Survey of pre-trained models for natural language processing. In Proceedings of International Conference on Electronic Communications, Internet of Things and Big Data, IEEE Harbin, China, pp. 277–280, 2021. DOI: 10.1109/ICEIB53692.2021.9686420.
    [47]
    S. Yuan, H. Y. Zhao, S. Zhao, J. H. Leng, Y. X. Liang, X. Z. Wang, J. F. Yu, X. Lv, Z. Shao, J. A. He, Y. K. Lin, X. Han, Z. H. Liu, N. Ding, Y. M. Rao, Y. Z. Gao, L. Zhang, M. Ding, C. Fang, Y. S. Wang, M. S. Long, J. Zhang, Y. P. Dong, T. Y. Pang, P. Cui, L. X. Huang, Z. Liang, H. W. Shen, H. Zhang, Q. S. Zhang, Q. X. Dong, Z. X. Tan, M. X. Wang, S. Wang, L. Zhou, H. R. Li, J. W. Bao, Y. W. Pan, W. N. Zhang, Z. Yu, R. Yan, C. C. Shi, M. H. Xu, Z. B. Zhang, G. Q. Wang, X. Pan, M. J. Li, X. Y. Chu, Z. J. Yao, F. W. Zhu, S. L. Cao, W. C. Xue, Z. X. Ma, Z. Y. Zhang, S. D. Hu, Y. J. Qin, C. J. Xiao, Z. N. Zeng, G. Q. Cui, W. Z. Chen, W. L. Zhao, Y. Yao, P. Li, W. Z. Zheng, W. L. Zhao, Z. Y. Wang, B. R. Zhang, N. Y. Fei, A. W. Hu, Z. N. Ling, H. Y. Li, B. X. Cao, X. P. Han, W. D. Zhan, B. B. Chang, H. Sun, J. W. Deng, C. J. Zheng, J. Z. Li, L. Hou, X. G. Cao, J. D. Zhai, Z. Y. Liu, M. S. Sun, J. W. Lu, Z. W. Lu, Q. Jin, R. H. Song, J. R. Wen, Z. C. Lin, L. W. Wang, H. Su, J. Zhu, Z. F. Sui, J. J. Zhang, Y. Liu, X. D. He, M. L. Huang, J. Tang, J. Tang. A roadmap for big model. [Online], Available: https://arxiv.org/abs/2203.14101, 2022.
    [48]
    S. Q. Long, F. Q. Cao, S. C. Han, H. Q. Yang. Vision-and-language pretrained models: A survey. In Proceedings of the 31st International Joint Conference on Artificial Intelligence, Vienna, Austria, pp. 5530–5537, 2022. DOI: 10.24963/ijcai.2022/773.
    [49]
    P. Xu, X. T. Zhu, D. A. Clifton. Multimodal learning with transformers: A survey. [Online], Available: https://arxiv.org/abs/2206.06488, 2022.
    [50]
    Y. Lecun, L. Bottou, Y. Bengio, P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998. DOI: 10.1109/5.726791.
    [51]
    G. Huang, Z. Liu, L. van der Maaten, K. Q. Weinberger. Densely connected convolutional networks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, pp. 2261–2269, 2017. DOI: 10.1109/CVPR.2017.243.
    [52]
    B. Min, H. Ross, E. Sulem, A. P. B. Veyseh, T. H. Nguyen, O. Sainz, E. Agirre, I. Heinz, D. Roth. Recent advances in natural language processing via large pre-trained language models: A survey. [Online], Available: https://arxiv.org/abs/2111.01243, 2021.
    [53]
    A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, S. R. Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 7th International Conference on Learning Representations, New Orleans, USA, 2019.
    [54]
    A. Radford, K. Narasimhan, T. Salimans, I. Sutskever. Improving language understanding by generative pre-training, [Online], Available: https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf, 2018.
    [55]
    A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, vol. 1, no. 8, Article number 9, 2019.
    [56]
    C. Rosset. Turing-NLG: A 17-billion-parameter language model by Microsoft, [Online], Available: https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/, 2020.
    [57]
    W. Zeng, X. Z. Ren, T. Su, H. Wang, Y. Liao, Z. W. Wang, X. Jiang, Z. Z. Yang, K. S. Wang, X. D. Zhang, C. Li, Z. Y. Gong, Y. F. Yao, X. J. Huang, J. Wang, J. F. Yu, Q. Guo, Y. Yu, Y. Zhang, J. Wang, H. T. Tao, D. S. Yan, Z. X. Yi, F. Peng, F. Q. Jiang, H. Zhang, L. F. Deng, Y. H. Zhang, Z. Lin, C. Zhang, S. J. Zhang, M. Y. Guo, S. Z. Gu, G. J. Fan, Y. W. Wang, X. F. Jin, Q. Liu, Y. H. Tian. Pangu-$\alpha $: Large-scale autoregressive pretrained Chinese language models with auto-parallel computation. [Online], Available: https://arxiv.org/abs/2104.12369, 2021.
    [58]
    J. Q. Wei, X. Z. Ren, X. G. Li, W. Y. Huang, Y. Liao, Y. S. Wang, J. S. Lin, X. Jiang, X. Chen, Q. Liu. NEZHA: Neural contextualized representation for Chinese language understanding. [Online], Available: https://arxiv.org/abs/1909.00204, 2019.
    [59]
    M. Chen, A. Radford, R. Child, J. Wu, H. Jun, D. Luan, I. Sutskever. Generative pretraining from pixels. In Proceedings of the 37th International Conference on Machine Learning, pp. 1691–1703, 2020.
    [60]
    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. H. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby. An image is worth 16×16 words: Transformers for image recognition at scale. In Proceedings of the 9th International Conference on Learning Representations, 2021.
    [61]
    N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, S. Zagoruyko. End-to-end object detection with transformers. In Proceedings of the 16th European Conference on Computer Vision, Springer, Glasgow, UK, pp. 213–229, 2020. DOI: 10.1007/978-3-030-58452-8_13.
    [62]
    S. X. Zheng, J. C. Lu, H. S. Zhao, X. T. Zhu, Z. K. Luo, Y. B. Wang, Y. W. Fu, J. F. Feng, T. Xiang, P. H. S. Torr, L. Zhang. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Nashville, USA, pp. 6877–6886, 2021. DOI: 10.1109/CVPR46437.2021.00681.
    [63]
    H. T. Chen, Y. H. Wang, T. Y. Guo, C. Xu, Y. P. Deng, Z. H. Liu, S. W. Ma, C. J. Xu, C. Xu, W. Gao. Pre-trained image processing transformer. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Nashville, USA, pp. 12294–12305, 2021. DOI: 10.1109/CVPR46437.2021.01212.
    [64]
    K. M. He, X. L. Chen, S. N. Xie, Y. H. Li, P. Dollár, R. Girshick. Masked autoencoders are scalable vision learners. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, New Orleans, USA, pp. 15979–15988, 2022. DOI: 10.1109/CVPR52688.2022.01553.
    [65]
    H. B. Bao, L. Dong, S. H. Piao, F. R. Wei. BEiT: BERT pre-training of image transformers. In Proceedings of the 10th International Conference on Learning Representations, 2022.
    [66]
    X. Y. Dong, J. M. Bao, T. Zhang, D. D. Chen, W. M. Zhang, L. Yuan, D. Chen, F. Wen, N. H. Yu, B. N. Guo. PeCo: Perceptual codebook for BERT pre-training of vision transformers. [Online], Available: https://arxiv.org/abs/2111.12710, 2021.
    [67]
    S. Schneider, A. Baevski, R. Collobert, M. Auli. Wav2vec: Unsupervised pre-training for speech recognition. In Proceedings of the 20th Annual Conference of the International Speech Communication Association, Graz, Austria, pp. 3465–3469, 2019. DOI: 10.21437/Interspeech.2019-1873.
    [68]
    A. Baevski, M. Auli, A. Mohamed. Effectiveness of self-supervised pre-training for speech recognition. [Online], Available: https://arxiv.org/abs/1911.03912, 2019.
    [69]
    W. N. Hsu, B. Bolte, Y. H. H. Tsai, K. Lakhotia, R. Salakhutdinov, A. Mohamed. HuBERT: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio,Speech,Language Processing, vol. 29, pp. 3451–3460, 2021. DOI: 10.1109/TASLP.2021.3122291.
    [70]
    A. Baevski, Y. H. Zhou, A. Mohamed, M. Auli. Wav2vec 2.0: A framework for self-supervised learning of speech representations. In Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver, Canada, Article number 1044, 2020.
    [71]
    Y. A. Chung, Y. Zhang, W. Han, C. C. Chiu, J. Qin, R. M. Pang, Y. H. Wu. W2v-BERT: Combining contrastive learning and masked language modeling for self-supervised speech pre-training. In Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop, Cartagena, Colombia, pp. 244–250, 2021. DOI: 10.1109/ASRU51503.2021.9688253.
    [72]
    P. P. Zhu, X. Wang, L. Zhu, Z. L. Sun, W. S. Zheng, Y. W. Wang, C. W. Chen. Prompt-based learning for unpaired image captioning. [Online], Available: https://arxiv.org/abs/2205.13125, 2022.
    [73]
    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, I. Sutskever. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, pp. 8748–8763, 2021.
    [74]
    Y. H. Xing, Q. R. Wu, D. Cheng, S. Z. Zhang, G. Q. Liang, Y. N. Zhang. Class-aware visual prompt tuning for vision-language pre-trained model. [Online], Available: https://arxiv.org/abs/2208.08340, 2022.
    [75]
    V. Ordonez, G. Kulkarni, T. Berg. Im2Text: Describing images using 1 million captioned photographs. In Proceedings of the 24th International Conference on Neural Information Processing Systems, Granada, Spain, pp. 1143–1151, 2011.
    [76]
    P. Young, A. Lai, M. Hodosh, J. Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. In Proceedings of Transactions of the Association for Computational Linguistics, Cambridge, USA, pp. 67–78, 2014. DOI: 10.1162/tacl_a_00166.
    [77]
    M. Hodosh, P. Young, J. Hockenmaier. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research, vol. 47, pp. 853–899, 2013. DOI: 10.1613/jair.3994.
    [78]
    X. L. Chen, H. Fang, T. Y. Lin, R. Vedantam, S. Gupta, P. Dollár, C. L. Zitnick. Microsoft COCO captions: Data collection and evaluation server. [Online], Available: https://arxiv.org/abs/1504.00325, 2015.
    [79]
    T. Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C. L. Zitnick. Microsoft COCO: Common objects in context. In Proceedings of the 13th European Conference on Computer Vision, Springer, Zurich, Switzerland, pp. 740–755, 2014. DOI: 10.1007/978-3-319-10602-1_48.
    [80]
    R. Krishna, Y. K. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L. J. Li, D. A. Shamma, M. S. Bernstein, L. Fei-Fei. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, vol. 123, no. 1, pp. 32–73, 2017. DOI: 10.1007/s11263-016-0981-7.
    [81]
    Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, D. Parikh. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, pp. 6325–6334, 2017. DOI: 10.1109/CVPR.2017.670.
    [82]
    N. Rostamzadeh, S. Hosseini, T. Boquet, W. Stokowiec, Y. Zhang, C. Jauvin, C. Pal. Fashion-gen: The generative fashion dataset and challenge. [Online], Available: https://arxiv.org/abs/1806.08317, 2018.
    [83]
    P. Sharma, N. Ding, S. Goodman, R. Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, pp. 2556–2565, 2018. DOI: 10.18653/v1/P18-1238.
    [84]
    D. A. Hudson, C. D. Manning. GQA: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Long Beach, USA, pp. 6693–6702, 2019. DOI: 10.1109/CVPR.2019.00686.
    [85]
    D. Qi, L. Su, J. Song, E. Cui, T. Bharti, A. Sacheti. ImageBERT: Cross-modal pre-training with large-scale weak-supervised image-text data. [Online], Available: https://arxiv.org/abs/2001.07966, 2020.
    [86]
    S. Changpinyo, P. Sharma, N. Ding, R. Soricut. Conceptual 12M: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Nashville, USA, pp. 3557–3567, 2021. DOI: 10.1109/CVPR46437.2021.00356.
    [87]
    J. Lei, L. C. Yu, M. Bansal, T. Berg. TVQA: Localized, compositional video question answering. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, ACL, Brussels, Belgium, pp. 1369–1379, 2018. DOI: 10.18653/v1/D18-1167.
    [88]
    A. Miech, D. Zhukov, J. B. Alayrac, M. Tapaswi, I. Laptev, J. Sivic. HowTo100M: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Seoul, Republic of Korea, pp. 2630–2640, 2019. DOI: 10.1109/ICCV.2019.00272.
    [89]
    M. Bain, A. Nagrani, G. Varol, A. Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Montreal, Canada, pp. 1708–1718, 2021. DOI: 10.1109/ICCV48922.2021.00175.
    [90]
    B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, L. J. Li. YFCC100M: The new data in multimedia research. Communications of the ACM, vol. 59, no. 2, pp. 64–73, 2016. DOI: 10.1145/2812802.
    [91]
    C. Schuhmann, R. Vencu, R. Beaumont, R. Kaczmarczyk, C. Mullis, A. Katta, T. Coombes, J. Jitsev, A. Komatsuzaki. LAION-400M: Open dataset of CLIP-filtered 400 million image-text pairs. [Online], Available: https://arxiv.org/abs/2111.02114, 2021.
    [92]
    K. Desai, G. Kaul, Z. Aysola, J. Johnson. RedCaps: Web-curated image-text data created by the people, for the people. In Proceedings of the 1st Neural Information Processing Systems Track on Datasets and Benchmarks, 2021.
    [93]
    J. X. Gu, X. J. Meng, G. S. Lu, L. Hou, M. Z. Niu, H. Xu, X. D. Liang, W. Zhang, X. Jiang, C. J. Xu. Wukong: 100 million large-scale Chinese cross-modal pre-training dataset and a foundation framework. [Online], Available: https://arxiv.org/abs/2202.06767, 2022.
    [94]
    Z. Parekh, J. Baldridge, D. Cer, A. Waters, Y. F. Yang. Crisscrossed captions: Extended intramodal and intermodal semantic similarity judgments for MS-COCO. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, ACL, pp. 2855–2870, 2021. DOI: 10.18653/v1/2021.eacl-main.249.
    [95]
    X. L. Zhan, Y. X. Wu, X. Dong, Y. C. Wei, M. L. Lu, Y. C. Zhang, H. Xu, X. D. Liang. Product1M: Towards weakly supervised instance-level product retrieval via cross-modal pretraining. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Montreal, Canada, pp. 11762–11771, 2021. DOI: 10.1109/ICCV48922.2021.01157.
    [96]
    K. Srinivasan, K. Raman, J. C. Chen, M. Bendersky, M. Najork. WIT: Wikipedia-based image text dataset for multimodal multilingual machine learning. In Proceedings of the 44th ACM, International SIGIR Conference on Research and Development in Information Retrieval, pp. 2443–2449, 2021. DOI: 10.1145/3404835.3463257.
    [97]
    C. Sun, A. Shrivastava, S. Singh, A. Gupta. Revisiting unreasonable effectiveness of data in deep learning era. In Proceedings of IEEE International Conference on Computer Vision, Venice, Italy, pp. 843–852, 2017. DOI: 10.1109/ICCV.2017.97.
    [98]
    J. W. Yang, C. Y. Li, P. C. Zhang, X. Y. Dai, B. Xiao, L. Yuan, J. F. Gao. Focal self-attention for local-global interactions in vision transformers. [Online], Available: https://arxiv.org/abs/2107.00641, 2021.
    [99]
    D. Mahajan, R. Girshick, V. Ramanathan, K. M. He, M. Paluri, Y. X. Li, A. Bharambe, L. Van Der Maaten. Exploring the limits of weakly supervised pretraining. In Proceedings of the 15th European Conference on Computer Vision, Springer, Munich, Germany, pp. 185–201, 2018. DOI: 10.1007/978-3-030-01216-8_12.
    [100]
    J. Y. Lin, R. Men, A. Yang, C. Zhou, M. Ding, Y. C. Zhang, P. Wang, A. Wang, L. Jiang, X. Y. Jia, J. Zhang, J. W. Zhang, X. Zou, Z. K. Li, X. D. Deng, J. Liu, J. B. Xue, H. L. Zhou, J. X. Ma, J. Yu, Y. Li, W. Lin, J. R. Zhou, J. Tang, H. X. Yang. M6: A Chinese multimodal pretrainer. [Online], Available: https://arxiv.org/abs/2103.00823, 2021.
    [101]
    X. Dong, X. L. Zhan, Y. X. Wu, Y. C. Wei, X. Y. Wei, M. L. Lu, X. D. Liang. M5Product: A multi-modal pretraining benchmark for e-commercial product downstream tasks. [Online], Available: https://arxiv.org/abs/2109.04275, 2021.
    [102]
    J. Pont-Tuset, J. Uijlings, S. Changpinyo, R. Soricut, V. Ferrari. Connecting vision and language with localized narratives. In Proceedings of the 16th European Conference on Computer Vision, Springer, Glasgow, UK, pp. 647–664, 2020. DOI: 10.1007/978-3-030-58558-7_38.
    [103]
    Y. Q. Huo, M. L. Zhang, G. Z. Liu, H. Y. Lu, Y. Z. Gao, G. X. Yang, J. Y. Wen, H. Zhang, B. G Xu, W. H. Zheng, Z. Z. Xi, Y. Q. Yang, A. W. Hu, J. M. Zhao, R. C. Li, Y. D. Zhao, L. Zhang, Y. Q. Song, X. Hong, W. Q. Cui, D. Y. Hou, Y. Y. Li, J. Y. Li, P. Y. Liu, Z. Gong, C. H. Jin, Y. C. Sun, S. Z. Chen, Z. W. Lu, Z. C. Dou, Q. Jin, Y. Y. Lan, W. X. Zhao, R. H. Song, J. R. Wen. WenLan: Bridging vision and language by large-scale multi-modal pre-training. [Online], Available: https://arxiv.org/abs/2103.06561, 2021.
    [104]
    Y. Sha, S. Zhao, J. H. Leng, Z. Xue, H. Y. Zhao, J. Tang. WuDaoMM: A large-scale multi-modal dataset for pre-training models. [Online], Available: https://arxiv.org/abs/2203.11480, 2022.
    [105]
    D. L. Chen, F. Liu, X. Y. Du, R. Z. Gao, F. Xu. MEP-3M: A large-scale multi-modal E-commerce products dataset. In Proceedings of IJCAI Workshop on Long-Tailed Distribution Learning, 2021.
    [106]
    N. Y. Fei, Z. W. Lu, Y. Z. Gao, G. X. Yang, Y. Q. Huo, J. Y. Wen, H. Y. Lu, R. H. Song, X. Gao, T. Xiang, H. Sun, J. R. Wen. WenLan 2.0: Make ai imagine via a multimodal foundation model. [Online], Available: https://arxiv.org/abs/2110.14378, 2021.
    [107]
    B. L. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, A. Torralba. Scene parsing through ADE20K dataset. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, pp. 5122–5130, 2017. DOI: 10.1109/CVPR.2017.544.
    [108]
    P. C. Zhang, X. J. Li, X. W. Hu, J. W. Yang, L. Zhang, L. J. Wang, Y. Choi, J. F. Gao. VinVL: Revisiting visual representations in vision-language models. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Nashville, USA, pp. 5575–5584, 2021. DOI: 10.1109/CVPR46437.2021.00553.
    [109]
    G. Li, N. Duan, Y. J. Fang, M. Gong, D. X. Jiang. Unicoder-Vl: A universal encoder for vision and language by cross-modal pre-training. In Proceedings of the 34th AAAI Conference on Artificial Intelligence, New York, USA, pp. 11336–11344, 2020. DOI: 10.1609/aaai.v34i07.6795.
    [110]
    J. Y. Lin, A. Yang, Y. C. Zhang, J. Liu, J. R. Zhou, H. X. Yang. InterBERT: Vision-and-language interaction for multi-modal pretraining. [Online], Available: https://arxiv.org/abs/2003.13198, 2020.
    [111]
    Z. R. Wang, J. H. Yu, A. W. Yu, Z. H. Dai, Y. Tsvetkov, Y. Cao. SimVLM: Simple visual language model pretraining with weak supervision. In Proceedings of the 10th International Conference on Learning Representations, 2022.
    [112]
    H. Tan, M Bansal. LXMERT: Learning cross-modality encoder representations from transformers. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, ACL, Hong Kong, China, pp. 5100–5111, 2019. DOI: 10.18653/v1/D19-1514.
    [113]
    M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, L. Zettlemoyer. Deep contextualized word representations. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, USA, pp. 2227–2237, 2018. DOI: 10.18653/v1/N18-1202.
    [114]
    L. Dong, N. Yang, W. H. Wang, F. R. Wei, X. D. Liu, Y. Wang, J. F. Gao, M. Zhou, H. W. Hon. Unified language model pre-training for natural language understanding and generation. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, Canada, pp. 13042–13054, 2019.
    [115]
    G. Peyré, M. Cuturi. Computational optimal transport: With applications to data science. Foundations and Trends? in Machine Learning, vol. 11, no. 5–6, pp. 355–607, 2019. DOI: 10.1561/2200000073.
    [116]
    Y. J. Xie, X. F. Wang, R. J. Wang, H. Y. Zha. A fast proximal point method for computing exact wasserstein distance. In Proceedings of the 35th Uncertainty in Artificial Intelligence, Tel Aviv, Israel, pp. 433–453, 2020.
    [117]
    W. T. Hao, C. Y. Li, X. J. Li, L. Carin, J. F. Gao. Towards learning a generic agent for vision-and-language navigation via pre-training. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Seattle, USA, pp. 13134–13143, 2020. DOI: 10.1109/CVPR42600.2020.01315.
    [118]
    F. Yu, J. J. Tang, W. C. Yin, Y. Sun, H. Tian, H. Wu, H. F. Wang. ERNIE-ViL: Knowledge enhanced vision-language representations through scene graphs. In Proceedings of the 35th AAAI Conference on Artificial Intelligence, pp. 3208–3216, 2021. DOI: 10.1609/aaai.v35i4.16431.
    [119]
    M. C. Zhuge, D. H. Gao, D. P. Fan, L. B. Jin, B. Chen, H. M. Zhou, M. H. Qiu, L. Shao. Kaleido-BERT: Vision-language pre-training on fashion domain. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Nashville, USA, pp. 12642–12652, 2021. DOI: 10.1109/CVPR46437.2021.01246.
    [120]
    H. Y. Xu, M. Yan, C. L. Li, B. Bi, S. F. Huang, W. M. Xiao, F. Huang. E2E-VLP: End-to-end vision-language pre-training enhanced by visual learning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL, pp. 503–513, 2021. DOI: 10.18653/v1/2021.acl-long.42.
    [121]
    L. J. Li, Y. C. Chen, Y. Cheng, Z. Gan, L. C. Yu, J. J. Liu. HERO: Hierarchical encoder for video+language omni-representation pre-training. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, ACL, pp. 2046–2065, 2020. DOI: 10.18653/v1/2020.emnlp-main.161.
    [122]
    Y. Ling, J. F. Yu, R. Xia. Vision-language pre-training for multimodal aspect-based sentiment analysis. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Dublin, Ireland, pp. 2149–2159, 2022. DOI: 10.18653/v1/2022.acl-long.152.
    [123]
    Y. H. Cui, Z. Yu, C. Q. Wang, Z. Z. Zhao, J. Zhang, M. Wang, J. Yu. ROSITA: Enhancing vision-and-language semantic alignments via cross- and intra-modal knowledge integration. In Proceedings of the 29th ACM International Conference on Multimedia, pp. 797–806, 2021. DOI: 10.1145/3474085.3475251.
    [124]
    M. H. Guo, T. X. Xu, J. J. Liu, Z. N. Liu, P. T. Jiang, T. J. Mu, S. H. Zhang, R. R. Martin, M. M. Cheng, S. M. Hu. Attention mechanisms in computer vision: A survey. Computational Visual Media, vol. 8, no. 3, pp. 331–368, 2022. DOI: 10.1007/s41095-022-0271-y.
    [125]
    J. N. Li, R. Selvaraju, A. Gotmare, S. Joty, C. M. Xiong, S. C. H. Hoi. Align before fuse: Vision and language representation learning with momentum distillation. In Proceedings of the 34th International Conference on Neural Information Processing Systems, pp. 9694–9705, 2021.
    [126]
    W. Suo, M. Y. Sun, P. Wang, Q. Wu. Proposal-free one-stage referring expression via grid-word cross-attention. In Proceedings of the 30th International Joint Conference on Artificial Intelligence, Montreal, Canada, pp. 1032–1038, 2021. DOI: 10.24963/ijcai.2021/143.
    [127]
    Z. Y. Yang, Y. W. Fang, C. G. Zhu, R. Pryzant, D. D. Chen, Y. Shi, Y. C. Xu, Y. Qian, M. Gao, Y. L. Chen, L. Y. Lu, Y. J. Xie, R. Gmyr, N. Codella, N. Kanda, B. Xiao, L. Yuan, T. Yoshioka, M. Zeng, X. D. Huang. I-code: An integrative and composable multimodal learning framework. [Online], Available: https://arxiv.org/abs/2205.01818, 2022.
    [128]
    L. C. Zhu, Y. Yang. ActBERT: Learning global-local video-text representations. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Seattle, USA, pp. 8743–8752, 2020. DOI: 10.1109/CVPR42600.2020.00877.
    [129]
    M. M. Wang, J. Z. Xing, Y. Liu. ActionCLIP: A new paradigm for video action recognition. [Online], Available: https://arxiv.org/abs/2109.08472, 2021.
    [130]
    M. L. Li, R. C. Xu, S. H. Wang, L. W. Zhou, X. D. Lin, C. G. Zhu, M. Zeng, H. Ji, S. F. Chang. CLIP-event: Connecting text and images with event structures. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, New Orleans, USA, pp. 16399–16408, 2022. DOI: 10.1109/CVPR52688.2022.01593.
    [131]
    Y. F. Cui, L. C. Zhao, F. Liang, Y. G. Li, J. Shao. Democratizing contrastive language-image pre-training: A CLIP benchmark of data, model, and supervision. [Online], Available: https://arxiv.org/abs/2203.05796, 2022.
    [132]
    S. Shen, L. H. Li, H. Tan, M. Bansal, A. Rohrbach, K. W. Chang, Z. W. Yao, K. Keutzer. How much can CLIP benefit vision-and-language tasks? In Proceedings of the 10th International Conference on Learning Representations, 2022.
    [133]
    D. L. Chen, Z. Wu, F. Liu, Z. Q. Yang, Y. X. Huang, Y. P. Bao, E. J. Zhou. Prototypical contrastive language image pretraining. [Online], Available: https://arxiv.org/abs/2206.10996, 2022.
    [134]
    L. H. Li, M. Yatskar, D. Yin, C. J. Hsieh, K. W. Chang. VisualBERT: A simple and performant baseline for vision and language. [Online], Available: https://arxiv.org/abs/1908.03557, 2019.
    [135]
    J. S. Lu, D. Batra, D. Parikh, S. Lee. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Proceedings of 32th International Conference on Neural Information Processing Systems, Vancouver, Canada, pp. 13–23, 2019.
    [136]
    C. Alberti, J. Ling, M. Collins, D. Reitter. Fusion of detected objects in text for visual question answering. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, ACL, Hong Kong, China, pp. 2131–2140, 2019. DOI: 10.18653/v1/D19-1219.
    [137]
    W. J. Su, X. Z. Zhu, Y. Cao, B. Li, L. W. Lu, F. R. Wei, J. F. Dai. VL-BERT: Pre-training of generic visual-linguistic representations. In Proceedings of the 8th International Conference on Learning Representations, Addis Ababa, Ethiopia, 2020.
    [138]
    L. W. Zhou, H. Palangi, L. Zhang, H. D. Hu, J. Corso, J. F. Gao. Unified vision-language pre-training for image captioning and VQA. In Proceedings of the 34th AAAI Conference on Artificial Intelligence, New York, USA, pp. 13041–13049, 2020. DOI: 10.1609/aaai.v34i07.7005.
    [139]
    J. S. Lu, V. Goswami, M. Rohrbach, D. Parikh, S. Lee. 12-in-1: Multi-task vision and language representation learning. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Seattle, USA, pp. 10434–10443, 2020. DOI: 10.1109/CVPR42600.2020.01045.
    [140]
    V. Murahari, D. Batra, D. Parikh, A. Das. Large-scale pretraining for visual dialog: A simple state-of-the-art baseline. In Proceedings of the 16th European Conference on Computer Vision, Springer, Glasgow, UK, pp. 336–352, 2020. DOI: 10.1007/978-3-030-58523-5_20.
    [141]
    Y. T. Gao, J. F. Liu, Z. H. Xu, J. Zhang, K. Li, C. H. Shen. PyramidCLIP: Hierarchical feature alignment for vision-language model pretraining. [Online], Available: https://arxiv.org/abs/2204.14095, 2022.
    [142]
    D. H. Gao, L. B. Jin, B. Chen, M. H. Qiu, P. Li, Y. Wei, Y. Hu, H. Wang. FashionBERT: Text and image matching with adaptive loss for cross-modal retrieval. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2251–2260, 2020. DOI: 10.1145/3397271.3401430.
    [143]
    Z. Gan, Y. C. Chen, L. J. Li, C. Zhu, Y. Cheng, J. J. Liu. Large-scale adversarial training for vision-and-language representation learning. In Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver, Canada, Article number 555, 2020.
    [144]
    D. D. Song, S. Y. Ma, Z. C. Sun, S. C. Yang, L. J. Liao. KVL-BERT: Knowledge enhanced visual-and-linguistic BERT for visual commonsense reasoning. Knowledge-Based Systems, vol. 230, Article number 107408, 2021. DOI: 10.1016/j.knosys.2021.107408.
    [145]
    J. Cho, J. Lei, H. Tan, M. Bansal. Unifying vision-and-language tasks via text generation. In Proceedings of the 38th International Conference on Machine Learning, pp. 1931–1942, 2021.
    [146]
    W. Kim, B. Son, I. Kim. ViLT: Vision-and-language transformer without convolution or region supervision. In Proceedings of the 38th International Conference on Machine Learning, pp. 5583–5594, 2021.
    [147]
    A. Kamath, M. Singh, Y. LeCun, G. Synnaeve, I. Misra, N. Carion. MDETR-modulated detection for end-to-end multi-modal understanding. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Montreal, Canada, pp. 1760–1770, 2021. DOI: 10.1109/ICCV48922.2021.00180.
    [148]
    Z. C. Huang, Z. Y. Zeng, Y. P. Huang, B. Liu, D. M. Fu, J. L. Fu. Seeing out of the bOx: End-to-end pre-training for vision-language representation learning. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Nashville, USA, pp. 12971–12980, 2021. DOI: 10.1109/CVPR46437.2021.01278.
    [149]
    H. W. Xue, Y. P. Huang, B. Liu, H. W. Peng, J. L. Fu, H. Q. Li, J. B. Luo. Probing inter-modality: Visual parsing with self-attention for vision-and-language pre-training. In Proceedings of the 34th Annual Conference on Neural Information Processing Systems, pp. 4514–4528, 2021.
    [150]
    A. Jain, M. Guo, K. Srinivasan, T. Chen, S. Kudugunta, C. Jia, Y. F. Yang, J. Baldridge. MURAL: Multimodal, multitask retrieval across languages. [Online], Available: https://arxiv.org/abs/2109.05125, 2021.
    [151]
    W. H. Wang, H. B. Bao, L. Dong, F. R. Wei. VLMo: Unified vision-language pre-training with mixture-of-modality-experts. [Online], Available: https://arxiv.org/abs/2111.02358, 2021.
    [152]
    Z. Y. Dou, Y. C. Xu, Z. Gan, J. F. Wang, S. H. Wang, L. J. Wang, C. G. Zhu, P. C. Zhang, L. Yuan, N. Y. Peng, Z. C. Liu, M. Zeng. An empirical study of training end-to-end vision-and-language transformers. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, New Orleans, USA, pp. 18145–18155, 2022. DOI: 10.1109/CVPR52688.2022.01763.
    [153]
    C. Sun, A. Myers, C. Vondrick, K. Murphy, C. Schmid. VideoBERT: A joint model for video and language representation learning. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Seoul, Republic of Korea, pp. 7463–7472, 2019. DOI: 10.1109/ICCV.2019.00756.
    [154]
    C. Sun, F. Baradel, K. Murphy, C. Schmid. Learning video representations using contrastive bidirectional transformer. [Online], Available: https://arxiv.org/abs/1906.05743, 2019.
    [155]
    H. H. Luo, L. Ji, B. T. Shi, H. Y. Huang, N. Duan, T. R. Li, J. Li, T. Bharti, M. Zhou. UniVL: A unified video and language pre-training model for multimodal understanding and generation. [Online], Available: https://arxiv.org/abs/2002.06353, 2020.
    [156]
    A. Urooj, A. Mazaheri, N. Da Vitoria Lobo, M. Shah. MMFT-BERT: Multimodal fusion transformer with BERT encodings for visual question answering. In Proceedings of Findings of the Association for Computational Linguistics, pp. 4648–4660, 2020. DOI: 10.18653/v1/2020.findings-emnlp.417.
    [157]
    R. Yan, M. Z. Shou, Y. X. Ge, A. J. Wang, X. D. Lin, G. Y. Cai, J. H. Tang. Video-text pre-training with learned regions. [Online], Available: https://arxiv.org/abs/2112.01194, 2021.
    [158]
    W. Li, C. Gao, G. C. Niu, X. Y. Xiao, H. Liu, J. C. Liu, H. Wu, H. F. Wang. UNIMO: Towards unified-modal understanding and generation via cross-modal contrastive learning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL, pp. 2592–2607, 2021. DOI: 10.18653/v1/2021.acl-long.202.
    [159]
    A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, I. Sutskever. Zero-shot text-to-image generation. In Proceedings of the 38th International Conference on Machine Learning, pp. 8821–8831, 2021.
    [160]
    L. K. Gui, Q. Y. Huang, S. Som, A. Hauptmann, Y. Bisk, J. F. Gao. Training vision-language transformers from captions alone. [Online], Available: https://arxiv.org/abs/2205.09256, 2022.
    [161]
    M. Ding, Z. Y. Yang, W. Y. Hong, W. D. Zheng, C. Zhou, D. Yin, J. Y. Lin, X. Zou, Z. Shao, H. X. Yang, J. Tang. CogView: Mastering text-to-image generation via transformers. In Proceedings of the 34th Annual Conference on Neural Information Processing Systems, pp. 19822–19835, 2021.
    [162]
    H. Akbari, L. Z. Yuan, R. Qian, W. H. Chuang, S. F. Chang, Y. Cui, B. Q. Gong. VATT: Transformers for multimodal self-supervised learning from raw video, audio and text. In Proceedings of the 34th Annual Conference on Neural Information Processing Systems, pp. 24206–24221, 2021.
    [163]
    L. Yuan, D. D. Chen, Y. L. Chen, N. Codella, X. Y. Dai, J. F. Gao, H. D. Hu, X. D. Huang, B. X. Li, C. Y. Li, C. Liu, M. C. Liu, Z. C. Liu, Y. M. Lu, Y. Shi, L. J, Wang, J. F. Wang, B. Xiao, Z. Xiao, J. W. Yang, M. Zeng, L. W. Zhou, P. C. Zhang. Florence: A new foundation model for computer vision. [Online], Available: https://arxiv.org/abs/2111.11432, 2021.
    [164]
    S. Bakkali, Z. H. Ming M. Coustaty, M. Rusiñol, O. R. Terrades. VLCDoC: Vision-language contrastive pre-training model for cross-modal document classification. [Online], Available: https://arxiv.org/abs/2205.12029, 2022.
    [165]
    L. H. Wei, L. X. Xie, W. G. Zhou, H. Q. Li, Q. Tian. MVP: Multimodality-guided visual pre-training. In Proceedings of the 17th European Conference on Computer Vision, Springer, Tel Aviv, Israel, pp. 337–353, 2022. DOI: 10.1007/978-3-031-20056-4_20.
    [166]
    W. X. Hong, K. X. Ji, J. J. Liu, J. Wang, J. D. Chen, W. Chu. GilBERT: Generative vision-language pre-training for image-text retrieval. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1379–1388, 2021. DOI: 10.1145/3404835.3462838.
    [167]
    H. Y. Lu, N. Y. Fei, Y. Q. Huo, Y. Z. Gao, Z. W. Lu, J. R. Wen. COTS: Collaborative two-stream vision-language pre-training model for cross-modal retrieval. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, New Orleans, USA, pp. 5671–15680, 2022. DOI: 10.1109/CVPR52688.2022.01524.
    [168]
    L. H. Li, H. X. You, Z. C. Wang, A. Zareian, S. F. Chang, K. W. Chang. Unsupervised vision-and-language pre-training without parallel images and captions. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 5339–5350, 2021.
    [169]
    J. B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, R. Ring, E. Rutherford, S. Cabi, T. D. Han, Z. T. Gong, S. Samangooei, M. Monteiro, J. Menick, S. Borgeaud, A. Brock, A. Nematzadeh, S. Sharifzadeh, M. Binkowski, R. Barreira, O. Vinyals, A. Zisserman, K. Simonyan. Flamingo: A visual language model for few-shot learning. [Online], Available: https://arxiv.org/abs/2204.14198, 2022.
    [170]
    M. H. Ni, H. Y. Huang, L. Su, E. Cui, T. Bharti, L. J. Wang, D. D. Zhang, N. Duan. M.3PM.3P: Learning universal representations via multitask multilingual multimodal pre-training. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Nashville, USA, pp. 3976–3985, 2021. DOI: 10.1109/CVPR46437.2021.00397.
    [171]
    J. N. Li, D. X. Li, C. M. Xiong, S. Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the 39th International Conference on Machine Learning, Baltimore, USA, pp. 12888–12900, 2022.
    [172]
    C. F. Wu, J. Liang, L. Ji, F. Yang, Y. J. Fang, D. X. Jiang, N. Duan. NÜWA: Visual synthesis pre-training for neural visual world creation. In Proceedings of the 17th European Conference on Computer Vision, Springer, Tel Aviv, Israel, pp. 720–736, 2022. DOI: 10.1007/978-3-031-19787-1_41.
    [173]
    J. Y. Yang, J. L. Duan, S. Tran, Y. Xu, S. Chanda, L. Q. Chen, B. Zeng, T. Chilimbi, J. Z. Huang. Vision-language pre-training with triple contrastive learning. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, New Orleans, USA, pp. 15650–15659, 2022. DOI: 10.1109/CVPR52688.2022.01522.
    [174]
    X. Dong, X. L. Zhan, Y. X. Wu, Y. C. Wei, M. C. Kampffmeyer, X. Y. Wei, M. L. Lu, Y. W. Wang, X. D. Liang, X. D. Liang. M5product: Self-harmonized contrastive learning for E-commercial multi-modal pretraining. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, New Orleans, USA, pp. 21220–21230, 2022. DOI: 10.1109/CVPR52688.2022.02057.
    [175]
    B. Yan, M. T. Pei. Clinical-BERT: Vision-language pre-training for radiograph diagnosis and reports generation. In Proceedings of the 36th AAAI, Conference on Artificial Intelligence, pp. 2982–2990, 2022. DOI: 10.1609/aaai.v36i3.20204.
    [176]
    Y. W. Zhong, J. W. Yang, P. C. Zhang, C. Y. Li, N. Codella, L. H. Li, L. W. Zhou, X. Y. Dai, L. Yuan, Y. Li, J. F. Gao. RegionCLIP: Region-based language-image pretraining. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, New Orleans, USA, pp. 16772–16782, 2021. DOI: 10.1109/CVPR52688.2022.01629.
    [177]
    X. W. Liang, F. D. Zhu, L. L. Li, H. Xu, X. D. Liang. Visual-language navigation pretraining via prompt-based environmental self-exploration. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Dublin, Ireland, pp. 4837–4851, 2022. DOI: 10.18653/v1/2022.acl-long.332.
    [178]
    L. H. Li, P. C. Zhang, H. T. Zhang, J. W. Yang, C. Y. Li, Y. W. Zhong, L. J. Wang, L. Yuan, L. Zhang, J. N. Hwang, K. W. Chang, J. F. Gao. Grounded language-image pre-training. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, New Orleans, USA, pp. 10955–10965, 2022. DOI: 10.1109/CVPR52688.2022.01069.
    [179]
    C. Y. Xie, H. Cai, J. F. Song, J. H. Li, F. J. Kong, X. Y. Wu, H. Morimitsu, L. Yao, D. X. Wang, D. W. Leng, X. Y. Ji, Y. F. Deng. Zero and R2D2: A large-scale Chinese cross-modal benchmark and A vision-language framework. [Online], Available: https://arxiv.org/abs/2205.03860, 2022.
    [180]
    N. Mu, A. Kirillov, D. Wagner, S. N. Xie. SLIP: Self-supervision meets language-image pre-training. In Proceedings of the 17th European Conference on Computer Vision, Springer, Tel Aviv, Israel, pp. 529–544, 2021. DOI: 10.1007/978-3-031-19809-0_30.
    [181]
    L. W. Yao, R. H. Huang, L. Hou, G. S. Lu, M. Z. Niu, H. Xu, X. D. Liang, Z. G. Li, X. Jiang, C. J. Xu. FILIP: Fine-grained interactive language-image pre-training. In Proceedings of the 10th International Conference on Learning Representations, 2022.
    [182]
    C. L. Li, M. Yan, H. Y. Xu, F. L. Luo, W. Wang, B. Bi, S. F. Huang. SemVLP: Vision-language pre-training by aligning semantics at multiple levels. [Online], Available: https://arxiv.org/abs/2103.07829, 2021.
    [183]
    J. H. Yu, Z. R. Wang, V. Vasudevan, L. Yeung, M. Seyedhosseini, Y. H. Wu. CoCa: Contrastive captioners are image-text foundation models. [Online], Available: https://arxiv.org/abs/2205.01917, 2022.
    [184]
    F. L. Chen, X. Y. Chen, J. X. Shi, D. Z. Zhang, J. L. Chang, Q. Tian. HiVLP: Hierarchical vision-language pre-training for fast image-text retrieval. [Online], Available: https://arxiv.org/abs/ 2205.12105, 2022.
    [185]
    A. Guzhov, F. Raue, J. Hees, A. Dengel. Audioclip: Extending clip to image, text and audio. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Singapore, pp. 976–980, 2022. DOI: 10.1109/ICASSP43922.2022.9747631.
    [186]
    H. B. Bao, W. H. Wang, L. Dong, F. R. Wei. VL-BEiT: Generative vision-language pretraining. [Online], Available: https://arxiv.org/abs/2206.01127, 2022.
    [187]
    P. H. Seo, A. Nagrani, A. Arnab, C. Schmid. End-to-end generative pretraining for multimodal video captioning. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, New Orleans, USA, pp. 17938–17947, 2022. DOI: 10.1109/CVPR52688.2022.01743.
    [188]
    Z. H. Fan, Z. Y. Wei, J. J. Chen, S. Y. Wang, Z. J. Li, J. R. Xu, X. J. Huang. A unified continuous learning framework for multi-modal knowledge discovery and pre-training. [Online], Available: https://arxiv.org/abs/2206.05555, 2022.
    [189]
    H. T. Zhang, P. C. Zhang, X. W. Hu, Y. C. Chen, L. H. Li, X. Y. Dai, L. J. Wang, L. Yuan, J. N. Hwang, J. F. Gao. GLIPv2: Unifying localization and vision-language understanding. [Online], Available: https://arxiv.org/abs/2206.05836, 2022.
    [190]
    B. Mustafa, C. Riquelme, J. Puigcerver, R. Jenatton, N. Houlsby. Multimodal contrastive learning with LIMoE: The language-image mixture of experts. [Online], Available: https://arxiv.org/abs/2206.02770, 2022.
    [191]
    T. Wang, W. H. Jiang, Z. C. Lu, F. Zheng, R. Cheng, C. G. Yin, L. Ping. VLMixer: Unpaired vision-language pre-training via cross-modal cutmix. In Proceedings of the 39th International Conference on Machine Learning, Baltimore, USA, pp. 22680–22690, 2022.
    [192]
    A. Bordes, N. Usunier, A. Garcia-Durán, J. Weston, O. Yakhnenko. Translating embeddings for modeling multi-relational data. In Proceedings of the 26th International Conference on Neural Information Processing Systems, Lake Tahoe, USA, pp. 2787–2795, 2013.
    [193]
    Z. Wang, J. W. Zhang, J. L. Feng, Z. Chen. Knowledge graph embedding by translating on hyperplanes. In Proceedings of the 28th AAAI Conference on Artificial Intelligence, Québec City, Canada, pp. 1112–1119, 2014. DOI: 10.1609/aaai.v28i1.8870.
    [194]
    G. L. Ji, S. Z. He, L. H. Xu, K. Liu, J. Zhao. Knowledge graph embedding via dynamic mapping matrix. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Beijing, China, pp. 687–696, 2015. DOI: 10.3115/v1/P15-1067.
    [195]
    Y. K. Lin, Z. Y. Liu, M. S. Sun, Y. Liu, X. Zhu. Learning entity and relation embeddings for knowledge graph completion. In Proceedings of the 29th AAAI Conference on Artificial Intelligence, Austin, USA, pp. 2181–2187, 2015. DOI: 10.1609/aaai.v29i1.9491.
    [196]
    G. L. Ji, K. Liu, S. Z. He, J. Zhao. Knowledge graph completion with adaptive sparse transfer matrix. In Proceedings of the 30th AAAI Conference on Artificial Intelligence, Phoenix, USA, pp. 985–991, 2016.
    [197]
    M. Nickel, V. Tresp, H. P. Kriegel. A three-way model for collective learning on multi-relational data. In Proceedings of the 28th International Conference on International Conference on Machine Learning, Bellevue, USA, pp. 809–816, 2011.
    [198]
    R. Socher, D. Q. Chen, C. D. Manning, A. Y. Ng. Reasoning with neural tensor networks for knowledge base completion. In Proceedings of the 26th International Conference on Neural Information Processing Systems, Lake Tahoe, USA, pp. 926–934, 2013.
    [199]
    B. S. Yang, W. T. Yih, X. D. He, J. F. Gao, L. Deng. Embedding entities and relations for learning and inference in knowledge bases. In Proceedings of the 3rd International Conference on Learning Representations, San Diego, USA, 2015. DOI: 10.48550/arXiv.1412.6575.
    [200]
    A. Bordes, X. Glorot, J. Weston, Y. Bengio. A semantic matching energy function for learning with multi-relational data. Machine Learning, vol. 94, no. 2, pp. 233–259, 2014. DOI: 10.1007/s10994-013-5363-6.
    [201]
    M. Nickel, L. Rosasco, T. Poggio. Holographic embeddings of knowledge graphs. In Proceedings of the 30th AAAI Conference on Artificial Intelligence, Phoenix, USA, pp. 1955–1961, 2016.
    [202]
    J. Bruna, W. Zaremba, A. Szlam, Y. LeCun. Spectral networks and locally connected networks on graphs. In Proceedings of the 2nd International Conference on Learning Representations, Banff, Canada, 2014. DOI: 10.48550/arXiv.1312.6203.
    [203]
    T. N. Kipf, M. Welling. Semi-supervised classification with graph convolutional networks. In Proceedings of the 5th International Conference on Learning Representations, Toulon, France, 2017.
    [204]
    T. N. Kipf, M. Welling. Variational graph auto-encoders. [Online], Available: https://arxiv.org/abs/1611.07308, 2016.
    [205]
    W. L. Hamilton, R. Ying, J. Leskovec. Inductive representation learning on large graphs. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, USA, pp. 1025–1035, 2017.
    [206]
    P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, Y. Bengio. Graph attention networks. In Proceedings of the 6th International Conference on Learning Representations, Vancouver, Canada, 2018.
    [207]
    M. Schlichtkrull, T. N. Kipf, P. Bloem, R. van den Berg, I. Titov, M. Welling. Modeling relational data with graph convolutional networks. In Proceedings of the 15th International Conference on the Semantic Web, Springer, Heraklion, Greece, pp. 593–607, 2018. DOI: 10.1007/978-3-319-93417-4_38.
    [208]
    C. Shang, Y. Tang, J. Huang, J. B. Bi, X. D. He, B. W. Zhou. End-to-end structure-aware convolutional networks for knowledge base completion. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence, Honolulu, USA, pp. 3060–3067, 2019. DOI: 10.1609/aaai.v33i01.33013060.
    [209]
    T. Dettmers, P. Minervini, P. Stenetorp, S. Riedel. Convolutional 2D knowledge graph embeddings. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, New Orleans, USA, pp. 1811–1818, 2018. DOI: 10.1609/aaai.v32i1.11573.
    [210]
    D. Nathani, J. Chauhan, C. Sharma, M. Kaul. Learning attention-based embeddings for relation prediction in knowledge graphs. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 4710–4723, 2019. DOI: 10.18653/v1/P19-1466.
    [211]
    S. Vashishth, S. Sanyal, V. Nitin, P. Talukdar. Composition-based multi-relational graph convolutional networks. In Proceedings of the 8th International Conference on Learning Representations, Addis Ababa, Ethiopia, 2020.
    [212]
    Y. Z. Li, B. W. Yu, X. Mengge, T. W. Liu. Enhancing pre-trained Chinese character representation with word-aligned attention. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 3442–3448, 2020. DOI: 10.18653/v1/2020.acl-main.315.
    [213]
    P. Ke, H. Z. Ji, S. Y. Liu, X. Y. Zhu, M. L. Huang. SentiLARE: Sentiment-aware language representation learning with linguistic knowledge. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, ACL, pp. 6975–6988, 2020. DOI: 10.18653/v1/2020.emnlp-main.567.
    [214]
    A. Roberts, C. Raffel, N. Shazeer. How much knowledge can you pack into the parameters of a language model? In Proceedings of the Conference on Empirical Methods in Natural Language Processing, ACL, pp. 5418-5426, 2020. DOI: 10.18653/v1/2020.emnlp-main.437.
    [215]
    D. Sachan, Y. H. Zhang, P. Qi, W. L. Hamilton. Do syntax trees help pre-trained transformers extract information? In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 2647–2661, 2021. DOI: 10.18653/v1/2021.eacl-main.228.
    [216]
    J. R. Zhou, Z. S. Zhang, H. Zhao, S. L. Zhang. LIMIT-BERT: Linguistics informed multi-task BERT. In Proceedings of Findings of the Association for Computational Linguistics, pp. 4450–4461, 2020. DOI: 10.18653/v1/2020.findings-emnlp.399.
    [217]
    Z. Y. Zhang, X. Han, Z. Y. Liu, X. Jiang, M. S. Sun, Q. Liu. ERNIE: Enhanced language representation with informative entities. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 1441–1451, 2019. DOI: 10.18653/v1/P19-1139.
    [218]
    M. E. Peters, M. Neumann, R. Logan, R. Schwartz, V. Joshi, S. Singh, N. A. Smith. Knowledge enhanced contextual word representations. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, ACL, Hong Kong, China, pp. 43–54, 2019. DOI: 10.18653/v1/D19-1005.
    [219]
    P. Wang, Q. Wu, C. H. Shen, A. Dick, A. van den Hengel. Explicit knowledge-based reasoning for visual question answering. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, Melbourne, Australia, pp. 1290–1296, 2017. DOI: 10.24963/ijcai.2017/179.
    [220]
    P. Wang, Q. Wu, C. H. Shen, A. Dick, A. van den Hengel. FVQA: Fact-based visual question answering. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 10, pp. 2413–2427, 2018. DOI: 10.1109/TPAMI.2017.2754246.
    [221]
    J. Deng, N. Ding, Y. Q. Jia, A. Frome, K. Murphy, S. Bengio, Y. Li, H. Neven, H. Adam. Large-scale object classification using label relation graphs. In Proceedings of the 13th European Conference on Computer Vision, Springer, Zurich, Switzerland, pp. 48–64, 2014. DOI: 10.1007/978-3-319-10590-1_4.
    [222]
    T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, C. Alberti, D. Epstein, I. Polosukhin, J. Devlin, K. Lee, K. Toutanova, L. Jones, M. Kelcey, M. W. Chang, A. M. Dai, J. Uszkoreit, Q. Le, S. Petrov. Natural questions: A benchmark for question answering research. In Proceedings of Transactions of the Association for Computational Linguistics, ACL, Cambridge, USA, pp. 452–466, 2019. DOI: 10.1162/tacl_a_00276.
    [223]
    Z. L. Yang, P. Qi, S. Z. Zhang, Y. Bengio, W. Cohen, R. Salakhutdinov, C. D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, ACL, Brussels, Belgium, pp. 2369–2380, 2018. DOI: 10.18653/v1/D18-1259.
    [224]
    C. Clark, K. Lee, M. W. Chang, T. Kwiatkowski, M. Collins, K. Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, ACL, Minneapolis, USA, pp. 2924–2936, 2019. DOI: 10.18653/v1/N19-1300.
    [225]
    J. Thorne, A. Vlachos, C. Christodoulopoulos, A. Mittal. FEVER: A large-scale dataset for fact extraction and VERification. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, USA, pp. 809–819, 2018. DOI: 10.18653/v1/N18-1074.
    [226]
    Z. C. Guo, D. Barbosa. Robust entity linking via random walks. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, Shanghai, China, page 499–508, 2014. DOI: 10.1145/2661829.2661887.
    [227]
    A. Talmor, J. Herzig, N. Lourie, J. Berant. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, USA, pp. 4149–4158, 2019. DOI: 10.18653/v1/N19-1421.
    [228]
    C. Bhagavatula, R. Le Bras, C. Malaviya, K. Sakaguchi, A. Holtzman, H. Rashkin, D. Downey, W. T. Yih, Y. Choi. Abductive commonsense reasoning. In Proceedings of the 8th International Conference on Learning Representations, Addis Ababa, Ethiopia, 2020.
    [229]
    B. Y. Lin, W. C. S. Zhou, M. Shen, P. Zhou, C. Bhagavatula, Y. Choi, X. Ren. CommonGen: A constrained text generation challenge for generative commonsense reasoning. In Proceedings of Findings of the Association for Computational Linguistics, pp. 1823–1840, 2020. DOI: 10.18653/v1/2020.findings-emnlp.165.
    [230]
    M. Sap, H. Rashkin, D. Chen, R. Le Bras, Y. Choi. Social IQa: Commonsense reasoning about social interactions. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, ACL, Hong Kong, China, pp. 4463–4473, 2019. DOI: 10.18653/v1/D19-1454.
    [231]
    Y. Bisk, R. Zellers, R. Le bras, J. F. Gao, Y. Choi. PIQA: Reasoning about physical commonsense in natural language. In Proceedings of the 34th AAAI Conference on Artificial Intelligence, New York, USA, pp. 7432–7439, 2020. DOI: 10.1609/aaai.v34i05.6239.
    [232]
    B. Zhou, D. Khashabi, Q. Ning, D. Roth. “Going on a vacation” takes longer than “going for a walk”: A study of temporal commonsense understanding. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, ACL, Hong Kong, China, pp. 3363–3369, 2019. DOI: 10.18653/v1/D19-1332.
    [233]
    B. Zhou, K. Richardson, Q. Ning, T. Khot, A. Sabharwal, D. Roth. Temporal reasoning on implicit events from distant supervision. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1361–1371, 2021. DOI: 10.18653/v1/2021.naacl-main.107.
    [234]
    H. Agrawal, K. Desai, Y. F. Wang, X. L. Chen, R. Jain, M. Johnson, D. Batra, D. Parikh, S. Lee, P. Anderson. Nocaps: Novel object captioning at scale. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Seoul, Repblic of Korea, pp. 8947–8956, 2019. DOI: 10.1109/ICCV.2019.00904.
    [235]
    A. Das, S. Kottur, K. Gupta, A. Singh, D. Yadav, J. M. F. Moura, D. Parikh, D. Batra. Visual dialog. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, pp. 1080–1089, 2017. DOI: 10.1109/CVPR.2017.121.
    [236]
    P. C. Yang, B. X. Chen, P. Zhang, X. Sun. Visual agreement regularized training for multi-modal machine translation. In Proceedings of the 34th AAAI Conference on Artificial Intelligence, New York, USA, pp. 9418–9425, 2020. DOI: 10.1609/aaai.v34i05.6484.
    [237]
    S. Antol, A. Agrawal, J. S. Lu, M. Mitchell, D. Batra, C. L. Zitnick, D. Parikh. VQA: Visual question answering. In Proceedings of IEEE International Conference on Computer Vision, Santiago, Chile, pp. 2425–2433, 2015. DOI: 10.1109/ICCV.2015.279.
    [238]
    J. Z. Liu, W. H. Chen, Y. Cheng, Z. Gan, L. C. Yu, Y. M. Yang, J. J. Liu. Violin: A large-scale dataset for video-and-language inference. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Seattle, USA, pp. 10897–10907, 2020. DOI: 10.1109/CVPR42600.2020.01091.
    [239]
    A. Suhr, M. Lewis, J. Yeh, Y. Artzi. A corpus of natural language for visual reasoning. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, Canada, pp. 217–223, 2017. DOI: 10.18653/v1/P17-2034.
    [240]
    N. Xie, F. Lai, D. Doran, A. Kadav. Visual entailment: A novel task for fine-grained image understanding. [Online], Available: https://arxiv.org/abs/1901.06706, 2019.
    [241]
    I. Dagan, O. Glickman, B. Magnini. The PASCAL recognising textual entailment challenge. In Proceedings of the 1st Pascal Machine Learning Challenges Workshop on Machine Learning Challenges, Springer, Southampton, UK, pp. 177–190, 2005. DOI: 10.1007/11736790_9.
    [242]
    R. Zellers, Y. Bisk, A. Farhadi, Y. Choi. From recognition to cognition: Visual commonsense reasoning. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Long Beach, USA, pp. 6713–6724, 2019. DOI: 10.1109/CVPR.2019.00688.
    [243]
    X. Wang, S. F. Zheng, R. Yang, A. H. Zheng, Z. Chen, J. Tang, B. Luo. Pedestrian attribute recognition: A survey. Pattern Recognition, vol. 121, Article number 108220, 2022. DOI: 10.1016/j.patcog.2021.108220.
    [244]
    D. Ghosal, S. Akhtar, D. Chauhan, S. Poria, A. Ekbal, P. Bhattacharyya. Contextual inter-modal attention for multi-modal sentiment analysis. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, ACL, Brussels, Belgium, pp. 3454–3466, 2018. DOI: 10.18653/v1/D18-1382.
    [245]
    S. Li, T. Xiao, H. S. Li, B. L. Zhou, D. Y. Yue, X. G. Wang. Person search with natural language description. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, pp. 5187–5196, 2017. DOI: 10.1109/CVPR.2017.551.
    [246]
    W. Chen, Y. Liu, W. P. Wang, E. Bakker, T. Georgiou, P. Fieguth, L. Liu, M. S. Lew. Deep image retrieval: A survey. [Online], Available: https://arxiv.org/abs/2101.11282, 2021.
    [247]
    J. Gu, E. Stefani, Q. Wu, J. Thomason, X. Wang. Vision-and-language navigation: A survey of tasks, methods, and future directions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Dublin, Ireland, pp. 7606–7623, 2022. DOI: 10.18653/v1/2022.acl-long.524.
    [248]
    S. M. Park, Y. G. Kim. Visual language navigation: A survey and open challenges. Artificial Intelligence Review, vol. 56, no. 1, pp. 365–427, 2023. DOI: 10.1007/s10462-022-10174-9.
    [249]
    H. W. Zhang, Y. L. Niu, S. F. Chang. Grounding referring expressions in images by variational context. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 4158–4166, 2018. DOI: 10.1109/CVPR.2018.00437.
    [250]
    S. B. Yang, G. B. Li, Y. Z. Yu. Cross-modal relationship inference for grounding referring expressions. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Long Beach, USA, pp. 4140–4149, 2019. DOI: 10.1109/CVPR.2019.00427.
    [251]
    X. P. Ding, N. N. Wang, S. W. Zhang, Z. Y. Huang, X. M. Li, M. Q. Tang, T. L. Liu, X. B. Gao. Exploring language hierarchy for video grounding. IEEE Transactions on Image Processing, vol. 31, pp. 4693–4706, 2022. DOI: 10.1109/TIP.2022.3187288.
    [252]
    Z. H. Tang, Y. Liao, S. Liu, G. B. Li, X. J. Jin, H. X. Jiang, Q. Yu, D. Xu. Human-centric spatio-temporal video grounding with visual transformers. IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 12, pp. 8238–8249, 2022. DOI: 10.1109/TCSVT.2021.3085907.
    [253]
    X. Wang, X. J. Shu, Z. P. Zhang, B. Jiang, Y. W. Wang, Y. H. Tian, F. Wu. Towards more flexible and accurate object tracking with natural language: Algorithms and benchmark. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Nashville, USA, pp. 13758–13768, 2021. DOI: 10.1109/CVPR46437.2021.01355.
    [254]
    X. Wang, C. L. Li, R. Yang, T. Z. Zhang, J. Tang, B. Luo. Describe and attend to track: Learning natural language guided structural representation and visual attention for object tracking. [Online], Available: https://arxiv.org/abs/1811.10014, 2018.
    [255]
    Q. Feng, V. Ablavsky, Q. X. Bai, S. Sclaroff. Siamese natural language tracker: Tracking by natural language descriptions with siamese trackers. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Nashville, USA, pp. 5847–5856, 2021. DOI: 10.1109/CVPR46437.2021.00579.
    [256]
    Y. Yao, A. Zhang, Z. Y. Zhang, Z. Y. Liu, T. S. Chua, M. S. Sun. CPT: Colorful prompt tuning for pre-trained vision-language models. [Online], Available: https://arxiv.org/abs/2109.11797, 2021.
    [257]
    X. H. He, D. J. Yang, W. X. Feng, T. Fu, A. Akula, V. Jampani, P. Narayana, S. Basu, W. Y. Wang, X. Wang. CPL: Counterfactual prompt learning for vision and language models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, ACL, Abu Dhabi, UAE, pp. 3407–3418, 2022.
    [258]
    M. L. Jia, L. M. Tang, B. C. Chen, C. Cardie, S. Belongie, B. Hariharan, S. N. Lim. Visual prompt tuning. In Proceedings of the 17th European Conference on Computer Vision, Springer, Tel Aviv, Israel, pp. 709–727, 2022. DOI: 10.1007/978-3-031-19827-4_41.
    [259]
    K. Y. Zhou, J. K. Yang, C. C. Loy, Z. W. Liu. Learning to prompt for vision-language models. International Journal of Computer Vision, vol. 130, no. 9, pp. 2337–2348, 2022. DOI: 10.1007/s11263-022-01653-1.
    [260]
    K. Y. Zhou, J. K. Yang, C. C. Loy, Z. W. Liu. Conditional prompt learning for vision-language models. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, New Orleans, USA, pp. 16795–16804, 2022. DOI: 10.1109/CVPR52688.2022.01631.
    [261]
    Q. Z. Wang, S. Li, H. Qin, A. M. Hao. Robust multi-modal medical image fusion via anisotropic heat diffusion guided low-rank structural analysis. Information Fusion, vol. 26, pp. 103–121, 2015. DOI: 10.1016/j.inffus.2015.01.001.
    [262]
    X. Wang, X. J. Shu, S. Zhang, B. Jiang, Y. W. Wang, Y. H. Tian, F. Wu. MFGNet: Dynamic modality-aware filter generation for RGB-T tracking. IEEE Transactions on Multimedia, 2022. DOI: 10.1109/TMM.2022.3174341.
    [263]
    K. H. Lee, X. Chen, G. Hua, H. D. Hu, X. D. He. Stacked cross attention for image-text matching. In Proceedings of the 15th European Conference on Computer Vision, Springer, Munich, Germany, pp. 212–228, 2018. DOI: 10.1007/978-3-030-01225-0_13.
  • 加载中

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(13)  / Tables(3)

    用微信扫码二维码

    分享至好友和朋友圈

    Article Metrics

    Article views (1659) PDF downloads(330) Cited by()
    Proportional views
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return