Citation: | Xiao Wang, Guangyao Chen, Guangwu Qian, Pengcheng Gao, Xiao-Yong Wei, Yaowei Wang, Yonghong Tian, Wen Gao. Large-scale Multi-modal Pre-trained Models: A Comprehensive Survey. Machine Intelligence Research, vol. 20, no. 4, pp.447-482, 2023. https://doi.org/10.1007/s11633-022-1410-8 |
[1] |
A. Krizhevsky, I. Sutskever, G. E. Hinton. ImageNet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems, Lake Tahoe, USA, pp. 1097–1105, 2012.
|
[2] |
J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, L. Fei-Fei. ImageNet: A large-scale hierarchical image database. In Proceedings of IEEE Conference on Computer Vision and Pattern recognition, Miami, USA, pp. 248–255, 2009. DOI: 10.1109/CVPR.2009.5206848.
|
[3] |
K. Simonyan, A. Zisserman. Very deep convolutional networks for large-scale image recognition. In Proceedings of the 3rd International Conference on Learning Representations, San Diego, USA, 2015. DOI: 10.48550/arXiv.1409.1556.
|
[4] |
K. M. He, X. Y Zhang, S. Q. Ren, J. Sun. Deep residual learning for image recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, pp. 770–778, 2016. DOI: 10.1109/CVPR.2016.90.
|
[5] |
C. Szegedy, S. Ioffe, V. Vanhoucke, A. Alemi. Inception-v4, inception-ResNet and the impact of residual connections on learning. In Proceedings of the 31st AAAI Conference on Artificial Intelligence, San Francisco, USA, pp. 4278–4284, 2017. DOI: 10.1609/aaai.v31i1.11231.
|
[6] |
S. Hochreiter, J. Schmidhuber. Long short-term memory. Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997. DOI: 10.1162/neco.1997.9.8.1735.
|
[7] |
J. Pennington, R. Socher, C. Manning. GloVe: Global vectors for word representation. In Proceedings of Conference on Empirical Methods in Natural Language Processing, ACL, Doha, Qatar, pp. 1532–1543, 2014. DOI: 10.3115/v1/D14-1162.
|
[8] |
R. Kiros, Y. K. Zhu, R. Salakhutdinov, R. S. Zemel, A. Torralba, R. Urtasun, S. Fidler. Skip-thought vectors. In Proceedings of the 28th International Conference on Neural Information Processing systems, Montreal, Canada, pp. 3294–3302, 2015.
|
[9] |
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing systems, Long Beach, USA, pp. 6000–6010, 2017.
|
[10] |
J. Devlin, M. W. Chang, K. Lee, K. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, ACL, Minneapolis, USA, pp. 4171–4186, 2019. DOI: 10.18653/v1/N19-1423.
|
[11] |
Q. L. Xia, H. Y. Huang, N. Duan, D. D. Zhang, L. Ji, Z. F. Sui, E. Cui, T. Bharti, M. Zhou. XGPT: Cross-modal generative pre-training for image captioning. In Proceedings of the 10th CCF International Conference on Natural Language Processing and Chinese Computing, Springer, Qingdao, China, pp. 786–797, 2021. DOI: 10.1007/978-3-030-88480-2_63.
|
[12] |
T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever. Language models are few-shot learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver, Canada, pp. 1877–1901, 2020.
|
[13] |
C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Q. Zhou, W. Li, P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, vol. 21, no. 1, Article number 140, 2020.
|
[14] |
Z. L. Yang, Z. H. Dai, Y. M. Yang, J. Carbonell, R. Salakhutdinov, Q. V. Le. XLNet: Generalized autoregressive pretraining for language understanding. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, Canada, pp. 5754–5764, 2019.
|
[15] |
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. H. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the 9th International Conference on Learning Representations, 2021.
|
[16] |
Z. Liu, Y. T. Lin, Y. Cao, H. Hu, Y. X. Wei, Z. Zhang, S. Lin, B. N. Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Montreal, Canada, pp. 9992–10012, 2021. DOI: 10.1109/ICCV48922.2021.00986.
|
[17] |
X. J. Li, X. Yin, C. Y. Li, P. C. Zhang, X. W. Hu, L. Zhang, L. J. Wang, H. D. Hu, L. Dong, F. R. Wei, Y. J. Choi, J. F. Gao. OSCAR: Object-semantics aligned pre-training for vision-language tasks. In Proceedings of the 16th European Conference on Computer Vision, Springer, Glasgow, UK, pp. 121–137, 2020. DOI: 10.1007/978-3-030-58577-8_8.
|
[18] |
Y. C. Chen, L. J. Li, L. C. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y. Cheng, J. J. Liu. UNITER: UNiversal image-text representation learning. In Proceedings of the 16th European Conference on Computer Vision, Springer, Glasgow, UK, pp. 104–120, 2020. DOI: 10.1007/978-3-030-58577-8_7.
|
[19] |
Y. G. Li, F. Liang, L. C. Zhao, Y. F. Cui, W. L. Ouyang, J. Shao, F. W. Yu, J. J. Yan. Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. In Proceedings of the 10th International Conference on Learning Representations, 2022.
|
[20] |
Z. C. Huang, Z. Y. Zeng, B. Liu, D. M. Fu, J. L. Fu. Pixel-BERT: Aligning image pixels with text by deep multi-modal transformers. [Online], Available: https://arxiv.org/abs/2004.00849, 2020.
|
[21] |
C. Jia, Y. F. Yang, Y. Xia, Y. T. Chen, Z. Parekh, H. Pham, Q. Le, Y. H. Sung, Z. Li, T. Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of the 38th International Conference on Machine Learning, pp. 4904–4916, 2021.
|
[22] |
J. Liu, X. X. Zhu, F. Liu, L. T. Guo, Z. J. Zhao, M. Z. Sun, W. N. Wang, H. Q. Lu, S. Y. Zhou, J. J. Zhang, J. Q. Wang. OPT: Omni-perception pre-trainer for cross-modal understanding and generation. [Online], Available: https://arxiv.org/abs/2107.00249, 2021.
|
[23] |
D. Cheng, J. Y Zhou, N. N. Wang, X. B. Gao. Hybrid dynamic contrast and probability distillation for unsupervised person RE-ID. IEEE Transactions on Image Processing, vol. 31, pp. 3334–3346, 2022. DOI: 10.1109/TIP.2022.3169693.
|
[24] |
F. L. Chen, D. Z. Zhang, M. L. Han, X. Y. Chen, J. Shi, S. Xu, B. Xu. VLP: A survey on vision-language pre-training. Machine Intelligence Research, vol. 30, pp. 38–56, 2023. DOI: 10.1007/s11633-022-1369-5.
|
[25] |
Y. F. Du, Z. K. Liu, J. Y. Li, W. X. Zhao. A survey of vision-language pre-trained models. In Proceedings of the 31st International Joint Conference on Artificial Intelligence, Vienna, Austria, pp. 5436–5443, 2022. DOI: 10.24963/ijcai.2022/762.
|
[26] |
M. Zaib, Q. Z. Sheng, W. E. Zhang. A short survey of pre-trained language models for conversational AI–A new age in NLP. In Proceedings of Australasian Computer Science Week Multiconference, Melbourne, Australia, Article number 11, 2020. DOI: 10.1145/3373017.3373028.
|
[27] |
H. Q. Zhang, H. L. Song, S. Y. Li, M. Zhou, D. W. Song. A survey of controllable text generation using transformer-based pre-trained language models. [Online], Available: https://arxiv.org/abs/2201.05337, 2022.
|
[28] |
J. Yang, G. Xiao, Y. L. Shen, W. Jiang, X. Y. Hu, Y. Zhang, J. H. Peng. A survey of knowledge enhanced pre-trained models. [Online], Available: https://arxiv.org/abs/2110.00269, 2021.
|
[29] |
D. Yin, L. Dong, H. Cheng, X. D. Liu, K. W. Chang, F. R. Wei, J. F. Gao. A survey of knowledge-intensive NLP with pre-trained language models. [Online], Available: https://arxiv.org/abs/2202.08772, 2022.
|
[30] |
P. Bhargava, V. Ng. Commonsense knowledge reasoning and generation with pre-trained language models: A survey. In Proceedings of the 36th AAAI, Conference on Artificial Intelligence, pp. 12317–12325, 2022. DOI: 10.1609/aaai.v36i11.21496.
|
[31] |
Q. Liu, M. J. Kusner, P. Blunsom. A survey on contextual embeddings. [Online], Available: https://arxiv.org/abs/2003.07278, 2020.
|
[32] |
P. F. Liu, W. Z. Yuan, J. L. Fu, Z. B. Jiang, H. Hayashi, G. Neubig. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. [Online], Available: https://arxiv.org/abs/2107.13586, 2021.
|
[33] |
B. Y. Wang, Q. Q Xie, J. H. Pei, Z. H. Chen, P. Tiwari, Z. Li, J. Fu. Pre-trained language models in biomedical domain: A systematic survey. [Online], Available: https://arxiv.org/abs/2110.05006, 2021.
|
[34] |
X. P. Qiu, T. X. Sun, Y. G. Xu, Y. F. Shao, N. Dai, X. J. Huang. Pre-trained models for natural language processing: A survey. Science China Technological Sciences, vol. 63, no. 10, pp. 1872–1897, 2020. DOI: 10.1007/s11431-020-1647-3.
|
[35] |
X. Han, Z. Y. Zhang, N. Ding, Y. X. Gu, X. Liu, Y. Q. Huo, J. Z. Qiu, Y. Yao, A. Zhang, L. Zhang, W. T. Han, M. L. Huang, Q. Jin, Y. Y. Lan, Y. Liu, Z. Y. Liu, Z. W. Lu, X. P. Qiu, R. H. Song, J. Tang, J. R. Wen, J. H. Yuan, W. X. Zhao, J. Zhu. Pre-trained models: Past, present and future. AI Open, vol. 2, pp. 225–250, 2021. DOI: 10.1016/j.aiopen.2021.08.002.
|
[36] |
L. D. Ruan, Q. Jin. Survey: Transformer based video-language pre-training. AI Open, vol. 3, pp. 1–13, 2022. DOI: 10.1016/j.aiopen.2022.01.001.
|
[37] |
F. Li, H. Zhang, Y. F. Zhang, S. L. Liu, J. Guo, L. M. Ni, P. C. Zhang, L. Zhang. Vision-language intelligence: Tasks, representation learning, and large models. [Online], Available: https://arxiv.org/abs/2203.01922, 2022.
|
[38] |
K. Han, Y. H. Wang, H. T. Chen, X. H. Chen, J. Y. Guo, Z. H. Liu, Y. H. Tang, A. Xiao, C. J. Xu, Y. X. Xu, Z. H. Yang, Y. M. Zhang, D. C. Tao. A survey on vision transformer. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 1, pp. 87–110, 2023. DOI: 10.1109/TPAMI.2022.3152247.
|
[39] |
S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan, M. Shah. Transformers in vision: A survey. ACM Computing Surveys, vol. 54, no. 10, Article number 200, 2022. DOI: 10.1145/3505244.
|
[40] |
Y. Liu, Y. Zhang, Y. X. Wang, F. Hou, J. Yuan, J. Tian, Y. Zhang, Z. C. Shi, J. P. Fan, Z. Q. He. A survey of visual transformers. [Online], Available: https://arxiv.org/abs/2111.06091, 2021.
|
[41] |
J. Selva, A. S. Johansen, S. Escalera, K. Nasrollahi, T. B. Moeslund, A. Clapés. Video transformers: A survey. [Online], Available: https://arxiv.org/abs/2201.05991, 2022.
|
[42] |
S. W. Guo, C. L. Xie, J. W. Li, L. J. Lyu, T. W. Zhang. Threats to pre-trained language models: Survey and taxonomy. [Online], Available: https://arxiv.org/abs/2202.06862, 2022.
|
[43] |
I. Garrido-Muñoz, A. Montejo-Ráez, F. Martínez-Santiago, L. A. Ureña-López. A survey on bias in deep NLP. Applied Sciences, vol. 11, no. 7, Article number 3184, 2021. DOI: 10.3390/app11073184.
|
[44] |
N. Meade, E. Poole-Dayan, S. Reddy. An empirical survey of the effectiveness of debiasing techniques for pre-trained language models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Dublin, Ireland, pp. 1878–1898, 2022. DOI: 10.18653/v1/2022.acl-long.132.
|
[45] |
R. K. Kaliyar. A multi-layer bidirectional transformer encoder for pre-trained word embedding: A survey of BERT. In Proceedings of the 10th International Conference on Cloud Computing, Data Science & Engineering, IEEE Harbin, pp. 336–340, 2020. DOI: 10.1109/Confluence47617.2020.9058044.
|
[46] |
J. J. Peng, K. X. Han. Survey of pre-trained models for natural language processing. In Proceedings of International Conference on Electronic Communications, Internet of Things and Big Data, IEEE Harbin, China, pp. 277–280, 2021. DOI: 10.1109/ICEIB53692.2021.9686420.
|
[47] |
S. Yuan, H. Y. Zhao, S. Zhao, J. H. Leng, Y. X. Liang, X. Z. Wang, J. F. Yu, X. Lv, Z. Shao, J. A. He, Y. K. Lin, X. Han, Z. H. Liu, N. Ding, Y. M. Rao, Y. Z. Gao, L. Zhang, M. Ding, C. Fang, Y. S. Wang, M. S. Long, J. Zhang, Y. P. Dong, T. Y. Pang, P. Cui, L. X. Huang, Z. Liang, H. W. Shen, H. Zhang, Q. S. Zhang, Q. X. Dong, Z. X. Tan, M. X. Wang, S. Wang, L. Zhou, H. R. Li, J. W. Bao, Y. W. Pan, W. N. Zhang, Z. Yu, R. Yan, C. C. Shi, M. H. Xu, Z. B. Zhang, G. Q. Wang, X. Pan, M. J. Li, X. Y. Chu, Z. J. Yao, F. W. Zhu, S. L. Cao, W. C. Xue, Z. X. Ma, Z. Y. Zhang, S. D. Hu, Y. J. Qin, C. J. Xiao, Z. N. Zeng, G. Q. Cui, W. Z. Chen, W. L. Zhao, Y. Yao, P. Li, W. Z. Zheng, W. L. Zhao, Z. Y. Wang, B. R. Zhang, N. Y. Fei, A. W. Hu, Z. N. Ling, H. Y. Li, B. X. Cao, X. P. Han, W. D. Zhan, B. B. Chang, H. Sun, J. W. Deng, C. J. Zheng, J. Z. Li, L. Hou, X. G. Cao, J. D. Zhai, Z. Y. Liu, M. S. Sun, J. W. Lu, Z. W. Lu, Q. Jin, R. H. Song, J. R. Wen, Z. C. Lin, L. W. Wang, H. Su, J. Zhu, Z. F. Sui, J. J. Zhang, Y. Liu, X. D. He, M. L. Huang, J. Tang, J. Tang. A roadmap for big model. [Online], Available: https://arxiv.org/abs/2203.14101, 2022.
|
[48] |
S. Q. Long, F. Q. Cao, S. C. Han, H. Q. Yang. Vision-and-language pretrained models: A survey. In Proceedings of the 31st International Joint Conference on Artificial Intelligence, Vienna, Austria, pp. 5530–5537, 2022. DOI: 10.24963/ijcai.2022/773.
|
[49] |
P. Xu, X. T. Zhu, D. A. Clifton. Multimodal learning with transformers: A survey. [Online], Available: https://arxiv.org/abs/2206.06488, 2022.
|
[50] |
Y. Lecun, L. Bottou, Y. Bengio, P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998. DOI: 10.1109/5.726791.
|
[51] |
G. Huang, Z. Liu, L. van der Maaten, K. Q. Weinberger. Densely connected convolutional networks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, pp. 2261–2269, 2017. DOI: 10.1109/CVPR.2017.243.
|
[52] |
B. Min, H. Ross, E. Sulem, A. P. B. Veyseh, T. H. Nguyen, O. Sainz, E. Agirre, I. Heinz, D. Roth. Recent advances in natural language processing via large pre-trained language models: A survey. [Online], Available: https://arxiv.org/abs/2111.01243, 2021.
|
[53] |
A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, S. R. Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 7th International Conference on Learning Representations, New Orleans, USA, 2019.
|
[54] |
A. Radford, K. Narasimhan, T. Salimans, I. Sutskever. Improving language understanding by generative pre-training, [Online], Available: https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf, 2018.
|
[55] |
A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, vol. 1, no. 8, Article number 9, 2019.
|
[56] |
C. Rosset. Turing-NLG: A 17-billion-parameter language model by Microsoft, [Online], Available: https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/, 2020.
|
[57] |
W. Zeng, X. Z. Ren, T. Su, H. Wang, Y. Liao, Z. W. Wang, X. Jiang, Z. Z. Yang, K. S. Wang, X. D. Zhang, C. Li, Z. Y. Gong, Y. F. Yao, X. J. Huang, J. Wang, J. F. Yu, Q. Guo, Y. Yu, Y. Zhang, J. Wang, H. T. Tao, D. S. Yan, Z. X. Yi, F. Peng, F. Q. Jiang, H. Zhang, L. F. Deng, Y. H. Zhang, Z. Lin, C. Zhang, S. J. Zhang, M. Y. Guo, S. Z. Gu, G. J. Fan, Y. W. Wang, X. F. Jin, Q. Liu, Y. H. Tian. Pangu-$\alpha $: Large-scale autoregressive pretrained Chinese language models with auto-parallel computation. [Online], Available: https://arxiv.org/abs/2104.12369, 2021.
|
[58] |
J. Q. Wei, X. Z. Ren, X. G. Li, W. Y. Huang, Y. Liao, Y. S. Wang, J. S. Lin, X. Jiang, X. Chen, Q. Liu. NEZHA: Neural contextualized representation for Chinese language understanding. [Online], Available: https://arxiv.org/abs/1909.00204, 2019.
|
[59] |
M. Chen, A. Radford, R. Child, J. Wu, H. Jun, D. Luan, I. Sutskever. Generative pretraining from pixels. In Proceedings of the 37th International Conference on Machine Learning, pp. 1691–1703, 2020.
|
[60] |
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. H. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby. An image is worth 16×16 words: Transformers for image recognition at scale. In Proceedings of the 9th International Conference on Learning Representations, 2021.
|
[61] |
N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, S. Zagoruyko. End-to-end object detection with transformers. In Proceedings of the 16th European Conference on Computer Vision, Springer, Glasgow, UK, pp. 213–229, 2020. DOI: 10.1007/978-3-030-58452-8_13.
|
[62] |
S. X. Zheng, J. C. Lu, H. S. Zhao, X. T. Zhu, Z. K. Luo, Y. B. Wang, Y. W. Fu, J. F. Feng, T. Xiang, P. H. S. Torr, L. Zhang. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Nashville, USA, pp. 6877–6886, 2021. DOI: 10.1109/CVPR46437.2021.00681.
|
[63] |
H. T. Chen, Y. H. Wang, T. Y. Guo, C. Xu, Y. P. Deng, Z. H. Liu, S. W. Ma, C. J. Xu, C. Xu, W. Gao. Pre-trained image processing transformer. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Nashville, USA, pp. 12294–12305, 2021. DOI: 10.1109/CVPR46437.2021.01212.
|
[64] |
K. M. He, X. L. Chen, S. N. Xie, Y. H. Li, P. Dollár, R. Girshick. Masked autoencoders are scalable vision learners. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, New Orleans, USA, pp. 15979–15988, 2022. DOI: 10.1109/CVPR52688.2022.01553.
|
[65] |
H. B. Bao, L. Dong, S. H. Piao, F. R. Wei. BEiT: BERT pre-training of image transformers. In Proceedings of the 10th International Conference on Learning Representations, 2022.
|
[66] |
X. Y. Dong, J. M. Bao, T. Zhang, D. D. Chen, W. M. Zhang, L. Yuan, D. Chen, F. Wen, N. H. Yu, B. N. Guo. PeCo: Perceptual codebook for BERT pre-training of vision transformers. [Online], Available: https://arxiv.org/abs/2111.12710, 2021.
|
[67] |
S. Schneider, A. Baevski, R. Collobert, M. Auli. Wav2vec: Unsupervised pre-training for speech recognition. In Proceedings of the 20th Annual Conference of the International Speech Communication Association, Graz, Austria, pp. 3465–3469, 2019. DOI: 10.21437/Interspeech.2019-1873.
|
[68] |
A. Baevski, M. Auli, A. Mohamed. Effectiveness of self-supervised pre-training for speech recognition. [Online], Available: https://arxiv.org/abs/1911.03912, 2019.
|
[69] |
W. N. Hsu, B. Bolte, Y. H. H. Tsai, K. Lakhotia, R. Salakhutdinov, A. Mohamed. HuBERT: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio,Speech,Language Processing, vol. 29, pp. 3451–3460, 2021. DOI: 10.1109/TASLP.2021.3122291.
|
[70] |
A. Baevski, Y. H. Zhou, A. Mohamed, M. Auli. Wav2vec 2.0: A framework for self-supervised learning of speech representations. In Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver, Canada, Article number 1044, 2020.
|
[71] |
Y. A. Chung, Y. Zhang, W. Han, C. C. Chiu, J. Qin, R. M. Pang, Y. H. Wu. W2v-BERT: Combining contrastive learning and masked language modeling for self-supervised speech pre-training. In Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop, Cartagena, Colombia, pp. 244–250, 2021. DOI: 10.1109/ASRU51503.2021.9688253.
|
[72] |
P. P. Zhu, X. Wang, L. Zhu, Z. L. Sun, W. S. Zheng, Y. W. Wang, C. W. Chen. Prompt-based learning for unpaired image captioning. [Online], Available: https://arxiv.org/abs/2205.13125, 2022.
|
[73] |
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, I. Sutskever. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, pp. 8748–8763, 2021.
|
[74] |
Y. H. Xing, Q. R. Wu, D. Cheng, S. Z. Zhang, G. Q. Liang, Y. N. Zhang. Class-aware visual prompt tuning for vision-language pre-trained model. [Online], Available: https://arxiv.org/abs/2208.08340, 2022.
|
[75] |
V. Ordonez, G. Kulkarni, T. Berg. Im2Text: Describing images using 1 million captioned photographs. In Proceedings of the 24th International Conference on Neural Information Processing Systems, Granada, Spain, pp. 1143–1151, 2011.
|
[76] |
P. Young, A. Lai, M. Hodosh, J. Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. In Proceedings of Transactions of the Association for Computational Linguistics, Cambridge, USA, pp. 67–78, 2014. DOI: 10.1162/tacl_a_00166.
|
[77] |
M. Hodosh, P. Young, J. Hockenmaier. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research, vol. 47, pp. 853–899, 2013. DOI: 10.1613/jair.3994.
|
[78] |
X. L. Chen, H. Fang, T. Y. Lin, R. Vedantam, S. Gupta, P. Dollár, C. L. Zitnick. Microsoft COCO captions: Data collection and evaluation server. [Online], Available: https://arxiv.org/abs/1504.00325, 2015.
|
[79] |
T. Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C. L. Zitnick. Microsoft COCO: Common objects in context. In Proceedings of the 13th European Conference on Computer Vision, Springer, Zurich, Switzerland, pp. 740–755, 2014. DOI: 10.1007/978-3-319-10602-1_48.
|
[80] |
R. Krishna, Y. K. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L. J. Li, D. A. Shamma, M. S. Bernstein, L. Fei-Fei. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, vol. 123, no. 1, pp. 32–73, 2017. DOI: 10.1007/s11263-016-0981-7.
|
[81] |
Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, D. Parikh. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, pp. 6325–6334, 2017. DOI: 10.1109/CVPR.2017.670.
|
[82] |
N. Rostamzadeh, S. Hosseini, T. Boquet, W. Stokowiec, Y. Zhang, C. Jauvin, C. Pal. Fashion-gen: The generative fashion dataset and challenge. [Online], Available: https://arxiv.org/abs/1806.08317, 2018.
|
[83] |
P. Sharma, N. Ding, S. Goodman, R. Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, pp. 2556–2565, 2018. DOI: 10.18653/v1/P18-1238.
|
[84] |
D. A. Hudson, C. D. Manning. GQA: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Long Beach, USA, pp. 6693–6702, 2019. DOI: 10.1109/CVPR.2019.00686.
|
[85] |
D. Qi, L. Su, J. Song, E. Cui, T. Bharti, A. Sacheti. ImageBERT: Cross-modal pre-training with large-scale weak-supervised image-text data. [Online], Available: https://arxiv.org/abs/2001.07966, 2020.
|
[86] |
S. Changpinyo, P. Sharma, N. Ding, R. Soricut. Conceptual 12M: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Nashville, USA, pp. 3557–3567, 2021. DOI: 10.1109/CVPR46437.2021.00356.
|
[87] |
J. Lei, L. C. Yu, M. Bansal, T. Berg. TVQA: Localized, compositional video question answering. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, ACL, Brussels, Belgium, pp. 1369–1379, 2018. DOI: 10.18653/v1/D18-1167.
|
[88] |
A. Miech, D. Zhukov, J. B. Alayrac, M. Tapaswi, I. Laptev, J. Sivic. HowTo100M: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Seoul, Republic of Korea, pp. 2630–2640, 2019. DOI: 10.1109/ICCV.2019.00272.
|
[89] |
M. Bain, A. Nagrani, G. Varol, A. Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Montreal, Canada, pp. 1708–1718, 2021. DOI: 10.1109/ICCV48922.2021.00175.
|
[90] |
B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, L. J. Li. YFCC100M: The new data in multimedia research. Communications of the ACM, vol. 59, no. 2, pp. 64–73, 2016. DOI: 10.1145/2812802.
|
[91] |
C. Schuhmann, R. Vencu, R. Beaumont, R. Kaczmarczyk, C. Mullis, A. Katta, T. Coombes, J. Jitsev, A. Komatsuzaki. LAION-400M: Open dataset of CLIP-filtered 400 million image-text pairs. [Online], Available: https://arxiv.org/abs/2111.02114, 2021.
|
[92] |
K. Desai, G. Kaul, Z. Aysola, J. Johnson. RedCaps: Web-curated image-text data created by the people, for the people. In Proceedings of the 1st Neural Information Processing Systems Track on Datasets and Benchmarks, 2021.
|
[93] |
J. X. Gu, X. J. Meng, G. S. Lu, L. Hou, M. Z. Niu, H. Xu, X. D. Liang, W. Zhang, X. Jiang, C. J. Xu. Wukong: 100 million large-scale Chinese cross-modal pre-training dataset and a foundation framework. [Online], Available: https://arxiv.org/abs/2202.06767, 2022.
|
[94] |
Z. Parekh, J. Baldridge, D. Cer, A. Waters, Y. F. Yang. Crisscrossed captions: Extended intramodal and intermodal semantic similarity judgments for MS-COCO. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, ACL, pp. 2855–2870, 2021. DOI: 10.18653/v1/2021.eacl-main.249.
|
[95] |
X. L. Zhan, Y. X. Wu, X. Dong, Y. C. Wei, M. L. Lu, Y. C. Zhang, H. Xu, X. D. Liang. Product1M: Towards weakly supervised instance-level product retrieval via cross-modal pretraining. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Montreal, Canada, pp. 11762–11771, 2021. DOI: 10.1109/ICCV48922.2021.01157.
|
[96] |
K. Srinivasan, K. Raman, J. C. Chen, M. Bendersky, M. Najork. WIT: Wikipedia-based image text dataset for multimodal multilingual machine learning. In Proceedings of the 44th ACM, International SIGIR Conference on Research and Development in Information Retrieval, pp. 2443–2449, 2021. DOI: 10.1145/3404835.3463257.
|
[97] |
C. Sun, A. Shrivastava, S. Singh, A. Gupta. Revisiting unreasonable effectiveness of data in deep learning era. In Proceedings of IEEE International Conference on Computer Vision, Venice, Italy, pp. 843–852, 2017. DOI: 10.1109/ICCV.2017.97.
|
[98] |
J. W. Yang, C. Y. Li, P. C. Zhang, X. Y. Dai, B. Xiao, L. Yuan, J. F. Gao. Focal self-attention for local-global interactions in vision transformers. [Online], Available: https://arxiv.org/abs/2107.00641, 2021.
|
[99] |
D. Mahajan, R. Girshick, V. Ramanathan, K. M. He, M. Paluri, Y. X. Li, A. Bharambe, L. Van Der Maaten. Exploring the limits of weakly supervised pretraining. In Proceedings of the 15th European Conference on Computer Vision, Springer, Munich, Germany, pp. 185–201, 2018. DOI: 10.1007/978-3-030-01216-8_12.
|
[100] |
J. Y. Lin, R. Men, A. Yang, C. Zhou, M. Ding, Y. C. Zhang, P. Wang, A. Wang, L. Jiang, X. Y. Jia, J. Zhang, J. W. Zhang, X. Zou, Z. K. Li, X. D. Deng, J. Liu, J. B. Xue, H. L. Zhou, J. X. Ma, J. Yu, Y. Li, W. Lin, J. R. Zhou, J. Tang, H. X. Yang. M6: A Chinese multimodal pretrainer. [Online], Available: https://arxiv.org/abs/2103.00823, 2021.
|
[101] |
X. Dong, X. L. Zhan, Y. X. Wu, Y. C. Wei, X. Y. Wei, M. L. Lu, X. D. Liang. M5Product: A multi-modal pretraining benchmark for e-commercial product downstream tasks. [Online], Available: https://arxiv.org/abs/2109.04275, 2021.
|
[102] |
J. Pont-Tuset, J. Uijlings, S. Changpinyo, R. Soricut, V. Ferrari. Connecting vision and language with localized narratives. In Proceedings of the 16th European Conference on Computer Vision, Springer, Glasgow, UK, pp. 647–664, 2020. DOI: 10.1007/978-3-030-58558-7_38.
|
[103] |
Y. Q. Huo, M. L. Zhang, G. Z. Liu, H. Y. Lu, Y. Z. Gao, G. X. Yang, J. Y. Wen, H. Zhang, B. G Xu, W. H. Zheng, Z. Z. Xi, Y. Q. Yang, A. W. Hu, J. M. Zhao, R. C. Li, Y. D. Zhao, L. Zhang, Y. Q. Song, X. Hong, W. Q. Cui, D. Y. Hou, Y. Y. Li, J. Y. Li, P. Y. Liu, Z. Gong, C. H. Jin, Y. C. Sun, S. Z. Chen, Z. W. Lu, Z. C. Dou, Q. Jin, Y. Y. Lan, W. X. Zhao, R. H. Song, J. R. Wen. WenLan: Bridging vision and language by large-scale multi-modal pre-training. [Online], Available: https://arxiv.org/abs/2103.06561, 2021.
|
[104] |
Y. Sha, S. Zhao, J. H. Leng, Z. Xue, H. Y. Zhao, J. Tang. WuDaoMM: A large-scale multi-modal dataset for pre-training models. [Online], Available: https://arxiv.org/abs/2203.11480, 2022.
|
[105] |
D. L. Chen, F. Liu, X. Y. Du, R. Z. Gao, F. Xu. MEP-3M: A large-scale multi-modal E-commerce products dataset. In Proceedings of IJCAI Workshop on Long-Tailed Distribution Learning, 2021.
|
[106] |
N. Y. Fei, Z. W. Lu, Y. Z. Gao, G. X. Yang, Y. Q. Huo, J. Y. Wen, H. Y. Lu, R. H. Song, X. Gao, T. Xiang, H. Sun, J. R. Wen. WenLan 2.0: Make ai imagine via a multimodal foundation model. [Online], Available: https://arxiv.org/abs/2110.14378, 2021.
|
[107] |
B. L. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, A. Torralba. Scene parsing through ADE20K dataset. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, pp. 5122–5130, 2017. DOI: 10.1109/CVPR.2017.544.
|
[108] |
P. C. Zhang, X. J. Li, X. W. Hu, J. W. Yang, L. Zhang, L. J. Wang, Y. Choi, J. F. Gao. VinVL: Revisiting visual representations in vision-language models. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Nashville, USA, pp. 5575–5584, 2021. DOI: 10.1109/CVPR46437.2021.00553.
|
[109] |
G. Li, N. Duan, Y. J. Fang, M. Gong, D. X. Jiang. Unicoder-Vl: A universal encoder for vision and language by cross-modal pre-training. In Proceedings of the 34th AAAI Conference on Artificial Intelligence, New York, USA, pp. 11336–11344, 2020. DOI: 10.1609/aaai.v34i07.6795.
|
[110] |
J. Y. Lin, A. Yang, Y. C. Zhang, J. Liu, J. R. Zhou, H. X. Yang. InterBERT: Vision-and-language interaction for multi-modal pretraining. [Online], Available: https://arxiv.org/abs/2003.13198, 2020.
|
[111] |
Z. R. Wang, J. H. Yu, A. W. Yu, Z. H. Dai, Y. Tsvetkov, Y. Cao. SimVLM: Simple visual language model pretraining with weak supervision. In Proceedings of the 10th International Conference on Learning Representations, 2022.
|
[112] |
H. Tan, M Bansal. LXMERT: Learning cross-modality encoder representations from transformers. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, ACL, Hong Kong, China, pp. 5100–5111, 2019. DOI: 10.18653/v1/D19-1514.
|
[113] |
M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, L. Zettlemoyer. Deep contextualized word representations. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, USA, pp. 2227–2237, 2018. DOI: 10.18653/v1/N18-1202.
|
[114] |
L. Dong, N. Yang, W. H. Wang, F. R. Wei, X. D. Liu, Y. Wang, J. F. Gao, M. Zhou, H. W. Hon. Unified language model pre-training for natural language understanding and generation. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, Canada, pp. 13042–13054, 2019.
|
[115] |
G. Peyré, M. Cuturi. Computational optimal transport: With applications to data science. Foundations and Trends? in Machine Learning, vol. 11, no. 5–6, pp. 355–607, 2019. DOI: 10.1561/2200000073.
|
[116] |
Y. J. Xie, X. F. Wang, R. J. Wang, H. Y. Zha. A fast proximal point method for computing exact wasserstein distance. In Proceedings of the 35th Uncertainty in Artificial Intelligence, Tel Aviv, Israel, pp. 433–453, 2020.
|
[117] |
W. T. Hao, C. Y. Li, X. J. Li, L. Carin, J. F. Gao. Towards learning a generic agent for vision-and-language navigation via pre-training. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Seattle, USA, pp. 13134–13143, 2020. DOI: 10.1109/CVPR42600.2020.01315.
|
[118] |
F. Yu, J. J. Tang, W. C. Yin, Y. Sun, H. Tian, H. Wu, H. F. Wang. ERNIE-ViL: Knowledge enhanced vision-language representations through scene graphs. In Proceedings of the 35th AAAI Conference on Artificial Intelligence, pp. 3208–3216, 2021. DOI: 10.1609/aaai.v35i4.16431.
|
[119] |
M. C. Zhuge, D. H. Gao, D. P. Fan, L. B. Jin, B. Chen, H. M. Zhou, M. H. Qiu, L. Shao. Kaleido-BERT: Vision-language pre-training on fashion domain. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Nashville, USA, pp. 12642–12652, 2021. DOI: 10.1109/CVPR46437.2021.01246.
|
[120] |
H. Y. Xu, M. Yan, C. L. Li, B. Bi, S. F. Huang, W. M. Xiao, F. Huang. E2E-VLP: End-to-end vision-language pre-training enhanced by visual learning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL, pp. 503–513, 2021. DOI: 10.18653/v1/2021.acl-long.42.
|
[121] |
L. J. Li, Y. C. Chen, Y. Cheng, Z. Gan, L. C. Yu, J. J. Liu. HERO: Hierarchical encoder for video+language omni-representation pre-training. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, ACL, pp. 2046–2065, 2020. DOI: 10.18653/v1/2020.emnlp-main.161.
|
[122] |
Y. Ling, J. F. Yu, R. Xia. Vision-language pre-training for multimodal aspect-based sentiment analysis. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Dublin, Ireland, pp. 2149–2159, 2022. DOI: 10.18653/v1/2022.acl-long.152.
|
[123] |
Y. H. Cui, Z. Yu, C. Q. Wang, Z. Z. Zhao, J. Zhang, M. Wang, J. Yu. ROSITA: Enhancing vision-and-language semantic alignments via cross- and intra-modal knowledge integration. In Proceedings of the 29th ACM International Conference on Multimedia, pp. 797–806, 2021. DOI: 10.1145/3474085.3475251.
|
[124] |
M. H. Guo, T. X. Xu, J. J. Liu, Z. N. Liu, P. T. Jiang, T. J. Mu, S. H. Zhang, R. R. Martin, M. M. Cheng, S. M. Hu. Attention mechanisms in computer vision: A survey. Computational Visual Media, vol. 8, no. 3, pp. 331–368, 2022. DOI: 10.1007/s41095-022-0271-y.
|
[125] |
J. N. Li, R. Selvaraju, A. Gotmare, S. Joty, C. M. Xiong, S. C. H. Hoi. Align before fuse: Vision and language representation learning with momentum distillation. In Proceedings of the 34th International Conference on Neural Information Processing Systems, pp. 9694–9705, 2021.
|
[126] |
W. Suo, M. Y. Sun, P. Wang, Q. Wu. Proposal-free one-stage referring expression via grid-word cross-attention. In Proceedings of the 30th International Joint Conference on Artificial Intelligence, Montreal, Canada, pp. 1032–1038, 2021. DOI: 10.24963/ijcai.2021/143.
|
[127] |
Z. Y. Yang, Y. W. Fang, C. G. Zhu, R. Pryzant, D. D. Chen, Y. Shi, Y. C. Xu, Y. Qian, M. Gao, Y. L. Chen, L. Y. Lu, Y. J. Xie, R. Gmyr, N. Codella, N. Kanda, B. Xiao, L. Yuan, T. Yoshioka, M. Zeng, X. D. Huang. I-code: An integrative and composable multimodal learning framework. [Online], Available: https://arxiv.org/abs/2205.01818, 2022.
|
[128] |
L. C. Zhu, Y. Yang. ActBERT: Learning global-local video-text representations. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Seattle, USA, pp. 8743–8752, 2020. DOI: 10.1109/CVPR42600.2020.00877.
|
[129] |
M. M. Wang, J. Z. Xing, Y. Liu. ActionCLIP: A new paradigm for video action recognition. [Online], Available: https://arxiv.org/abs/2109.08472, 2021.
|
[130] |
M. L. Li, R. C. Xu, S. H. Wang, L. W. Zhou, X. D. Lin, C. G. Zhu, M. Zeng, H. Ji, S. F. Chang. CLIP-event: Connecting text and images with event structures. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, New Orleans, USA, pp. 16399–16408, 2022. DOI: 10.1109/CVPR52688.2022.01593.
|
[131] |
Y. F. Cui, L. C. Zhao, F. Liang, Y. G. Li, J. Shao. Democratizing contrastive language-image pre-training: A CLIP benchmark of data, model, and supervision. [Online], Available: https://arxiv.org/abs/2203.05796, 2022.
|
[132] |
S. Shen, L. H. Li, H. Tan, M. Bansal, A. Rohrbach, K. W. Chang, Z. W. Yao, K. Keutzer. How much can CLIP benefit vision-and-language tasks? In Proceedings of the 10th International Conference on Learning Representations, 2022.
|
[133] |
D. L. Chen, Z. Wu, F. Liu, Z. Q. Yang, Y. X. Huang, Y. P. Bao, E. J. Zhou. Prototypical contrastive language image pretraining. [Online], Available: https://arxiv.org/abs/2206.10996, 2022.
|
[134] |
L. H. Li, M. Yatskar, D. Yin, C. J. Hsieh, K. W. Chang. VisualBERT: A simple and performant baseline for vision and language. [Online], Available: https://arxiv.org/abs/1908.03557, 2019.
|
[135] |
J. S. Lu, D. Batra, D. Parikh, S. Lee. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Proceedings of 32th International Conference on Neural Information Processing Systems, Vancouver, Canada, pp. 13–23, 2019.
|
[136] |
C. Alberti, J. Ling, M. Collins, D. Reitter. Fusion of detected objects in text for visual question answering. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, ACL, Hong Kong, China, pp. 2131–2140, 2019. DOI: 10.18653/v1/D19-1219.
|
[137] |
W. J. Su, X. Z. Zhu, Y. Cao, B. Li, L. W. Lu, F. R. Wei, J. F. Dai. VL-BERT: Pre-training of generic visual-linguistic representations. In Proceedings of the 8th International Conference on Learning Representations, Addis Ababa, Ethiopia, 2020.
|
[138] |
L. W. Zhou, H. Palangi, L. Zhang, H. D. Hu, J. Corso, J. F. Gao. Unified vision-language pre-training for image captioning and VQA. In Proceedings of the 34th AAAI Conference on Artificial Intelligence, New York, USA, pp. 13041–13049, 2020. DOI: 10.1609/aaai.v34i07.7005.
|
[139] |
J. S. Lu, V. Goswami, M. Rohrbach, D. Parikh, S. Lee. 12-in-1: Multi-task vision and language representation learning. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Seattle, USA, pp. 10434–10443, 2020. DOI: 10.1109/CVPR42600.2020.01045.
|
[140] |
V. Murahari, D. Batra, D. Parikh, A. Das. Large-scale pretraining for visual dialog: A simple state-of-the-art baseline. In Proceedings of the 16th European Conference on Computer Vision, Springer, Glasgow, UK, pp. 336–352, 2020. DOI: 10.1007/978-3-030-58523-5_20.
|
[141] |
Y. T. Gao, J. F. Liu, Z. H. Xu, J. Zhang, K. Li, C. H. Shen. PyramidCLIP: Hierarchical feature alignment for vision-language model pretraining. [Online], Available: https://arxiv.org/abs/2204.14095, 2022.
|
[142] |
D. H. Gao, L. B. Jin, B. Chen, M. H. Qiu, P. Li, Y. Wei, Y. Hu, H. Wang. FashionBERT: Text and image matching with adaptive loss for cross-modal retrieval. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2251–2260, 2020. DOI: 10.1145/3397271.3401430.
|
[143] |
Z. Gan, Y. C. Chen, L. J. Li, C. Zhu, Y. Cheng, J. J. Liu. Large-scale adversarial training for vision-and-language representation learning. In Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver, Canada, Article number 555, 2020.
|
[144] |
D. D. Song, S. Y. Ma, Z. C. Sun, S. C. Yang, L. J. Liao. KVL-BERT: Knowledge enhanced visual-and-linguistic BERT for visual commonsense reasoning. Knowledge-Based Systems, vol. 230, Article number 107408, 2021. DOI: 10.1016/j.knosys.2021.107408.
|
[145] |
J. Cho, J. Lei, H. Tan, M. Bansal. Unifying vision-and-language tasks via text generation. In Proceedings of the 38th International Conference on Machine Learning, pp. 1931–1942, 2021.
|
[146] |
W. Kim, B. Son, I. Kim. ViLT: Vision-and-language transformer without convolution or region supervision. In Proceedings of the 38th International Conference on Machine Learning, pp. 5583–5594, 2021.
|
[147] |
A. Kamath, M. Singh, Y. LeCun, G. Synnaeve, I. Misra, N. Carion. MDETR-modulated detection for end-to-end multi-modal understanding. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Montreal, Canada, pp. 1760–1770, 2021. DOI: 10.1109/ICCV48922.2021.00180.
|
[148] |
Z. C. Huang, Z. Y. Zeng, Y. P. Huang, B. Liu, D. M. Fu, J. L. Fu. Seeing out of the bOx: End-to-end pre-training for vision-language representation learning. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Nashville, USA, pp. 12971–12980, 2021. DOI: 10.1109/CVPR46437.2021.01278.
|
[149] |
H. W. Xue, Y. P. Huang, B. Liu, H. W. Peng, J. L. Fu, H. Q. Li, J. B. Luo. Probing inter-modality: Visual parsing with self-attention for vision-and-language pre-training. In Proceedings of the 34th Annual Conference on Neural Information Processing Systems, pp. 4514–4528, 2021.
|
[150] |
A. Jain, M. Guo, K. Srinivasan, T. Chen, S. Kudugunta, C. Jia, Y. F. Yang, J. Baldridge. MURAL: Multimodal, multitask retrieval across languages. [Online], Available: https://arxiv.org/abs/2109.05125, 2021.
|
[151] |
W. H. Wang, H. B. Bao, L. Dong, F. R. Wei. VLMo: Unified vision-language pre-training with mixture-of-modality-experts. [Online], Available: https://arxiv.org/abs/2111.02358, 2021.
|
[152] |
Z. Y. Dou, Y. C. Xu, Z. Gan, J. F. Wang, S. H. Wang, L. J. Wang, C. G. Zhu, P. C. Zhang, L. Yuan, N. Y. Peng, Z. C. Liu, M. Zeng. An empirical study of training end-to-end vision-and-language transformers. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, New Orleans, USA, pp. 18145–18155, 2022. DOI: 10.1109/CVPR52688.2022.01763.
|
[153] |
C. Sun, A. Myers, C. Vondrick, K. Murphy, C. Schmid. VideoBERT: A joint model for video and language representation learning. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Seoul, Republic of Korea, pp. 7463–7472, 2019. DOI: 10.1109/ICCV.2019.00756.
|
[154] |
C. Sun, F. Baradel, K. Murphy, C. Schmid. Learning video representations using contrastive bidirectional transformer. [Online], Available: https://arxiv.org/abs/1906.05743, 2019.
|
[155] |
H. H. Luo, L. Ji, B. T. Shi, H. Y. Huang, N. Duan, T. R. Li, J. Li, T. Bharti, M. Zhou. UniVL: A unified video and language pre-training model for multimodal understanding and generation. [Online], Available: https://arxiv.org/abs/2002.06353, 2020.
|
[156] |
A. Urooj, A. Mazaheri, N. Da Vitoria Lobo, M. Shah. MMFT-BERT: Multimodal fusion transformer with BERT encodings for visual question answering. In Proceedings of Findings of the Association for Computational Linguistics, pp. 4648–4660, 2020. DOI: 10.18653/v1/2020.findings-emnlp.417.
|
[157] |
R. Yan, M. Z. Shou, Y. X. Ge, A. J. Wang, X. D. Lin, G. Y. Cai, J. H. Tang. Video-text pre-training with learned regions. [Online], Available: https://arxiv.org/abs/2112.01194, 2021.
|
[158] |
W. Li, C. Gao, G. C. Niu, X. Y. Xiao, H. Liu, J. C. Liu, H. Wu, H. F. Wang. UNIMO: Towards unified-modal understanding and generation via cross-modal contrastive learning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL, pp. 2592–2607, 2021. DOI: 10.18653/v1/2021.acl-long.202.
|
[159] |
A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, I. Sutskever. Zero-shot text-to-image generation. In Proceedings of the 38th International Conference on Machine Learning, pp. 8821–8831, 2021.
|
[160] |
L. K. Gui, Q. Y. Huang, S. Som, A. Hauptmann, Y. Bisk, J. F. Gao. Training vision-language transformers from captions alone. [Online], Available: https://arxiv.org/abs/2205.09256, 2022.
|
[161] |
M. Ding, Z. Y. Yang, W. Y. Hong, W. D. Zheng, C. Zhou, D. Yin, J. Y. Lin, X. Zou, Z. Shao, H. X. Yang, J. Tang. CogView: Mastering text-to-image generation via transformers. In Proceedings of the 34th Annual Conference on Neural Information Processing Systems, pp. 19822–19835, 2021.
|
[162] |
H. Akbari, L. Z. Yuan, R. Qian, W. H. Chuang, S. F. Chang, Y. Cui, B. Q. Gong. VATT: Transformers for multimodal self-supervised learning from raw video, audio and text. In Proceedings of the 34th Annual Conference on Neural Information Processing Systems, pp. 24206–24221, 2021.
|
[163] |
L. Yuan, D. D. Chen, Y. L. Chen, N. Codella, X. Y. Dai, J. F. Gao, H. D. Hu, X. D. Huang, B. X. Li, C. Y. Li, C. Liu, M. C. Liu, Z. C. Liu, Y. M. Lu, Y. Shi, L. J, Wang, J. F. Wang, B. Xiao, Z. Xiao, J. W. Yang, M. Zeng, L. W. Zhou, P. C. Zhang. Florence: A new foundation model for computer vision. [Online], Available: https://arxiv.org/abs/2111.11432, 2021.
|
[164] |
S. Bakkali, Z. H. Ming M. Coustaty, M. Rusiñol, O. R. Terrades. VLCDoC: Vision-language contrastive pre-training model for cross-modal document classification. [Online], Available: https://arxiv.org/abs/2205.12029, 2022.
|
[165] |
L. H. Wei, L. X. Xie, W. G. Zhou, H. Q. Li, Q. Tian. MVP: Multimodality-guided visual pre-training. In Proceedings of the 17th European Conference on Computer Vision, Springer, Tel Aviv, Israel, pp. 337–353, 2022. DOI: 10.1007/978-3-031-20056-4_20.
|
[166] |
W. X. Hong, K. X. Ji, J. J. Liu, J. Wang, J. D. Chen, W. Chu. GilBERT: Generative vision-language pre-training for image-text retrieval. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1379–1388, 2021. DOI: 10.1145/3404835.3462838.
|
[167] |
H. Y. Lu, N. Y. Fei, Y. Q. Huo, Y. Z. Gao, Z. W. Lu, J. R. Wen. COTS: Collaborative two-stream vision-language pre-training model for cross-modal retrieval. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, New Orleans, USA, pp. 5671–15680, 2022. DOI: 10.1109/CVPR52688.2022.01524.
|
[168] |
L. H. Li, H. X. You, Z. C. Wang, A. Zareian, S. F. Chang, K. W. Chang. Unsupervised vision-and-language pre-training without parallel images and captions. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 5339–5350, 2021.
|
[169] |
J. B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, R. Ring, E. Rutherford, S. Cabi, T. D. Han, Z. T. Gong, S. Samangooei, M. Monteiro, J. Menick, S. Borgeaud, A. Brock, A. Nematzadeh, S. Sharifzadeh, M. Binkowski, R. Barreira, O. Vinyals, A. Zisserman, K. Simonyan. Flamingo: A visual language model for few-shot learning. [Online], Available: https://arxiv.org/abs/2204.14198, 2022.
|
[170] |
M. H. Ni, H. Y. Huang, L. Su, E. Cui, T. Bharti, L. J. Wang, D. D. Zhang, N. Duan. M.3PM.3P: Learning universal representations via multitask multilingual multimodal pre-training. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Nashville, USA, pp. 3976–3985, 2021. DOI: 10.1109/CVPR46437.2021.00397.
|
[171] |
J. N. Li, D. X. Li, C. M. Xiong, S. Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the 39th International Conference on Machine Learning, Baltimore, USA, pp. 12888–12900, 2022.
|
[172] |
C. F. Wu, J. Liang, L. Ji, F. Yang, Y. J. Fang, D. X. Jiang, N. Duan. NÜWA: Visual synthesis pre-training for neural visual world creation. In Proceedings of the 17th European Conference on Computer Vision, Springer, Tel Aviv, Israel, pp. 720–736, 2022. DOI: 10.1007/978-3-031-19787-1_41.
|
[173] |
J. Y. Yang, J. L. Duan, S. Tran, Y. Xu, S. Chanda, L. Q. Chen, B. Zeng, T. Chilimbi, J. Z. Huang. Vision-language pre-training with triple contrastive learning. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, New Orleans, USA, pp. 15650–15659, 2022. DOI: 10.1109/CVPR52688.2022.01522.
|
[174] |
X. Dong, X. L. Zhan, Y. X. Wu, Y. C. Wei, M. C. Kampffmeyer, X. Y. Wei, M. L. Lu, Y. W. Wang, X. D. Liang, X. D. Liang. M5product: Self-harmonized contrastive learning for E-commercial multi-modal pretraining. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, New Orleans, USA, pp. 21220–21230, 2022. DOI: 10.1109/CVPR52688.2022.02057.
|
[175] |
B. Yan, M. T. Pei. Clinical-BERT: Vision-language pre-training for radiograph diagnosis and reports generation. In Proceedings of the 36th AAAI, Conference on Artificial Intelligence, pp. 2982–2990, 2022. DOI: 10.1609/aaai.v36i3.20204.
|
[176] |
Y. W. Zhong, J. W. Yang, P. C. Zhang, C. Y. Li, N. Codella, L. H. Li, L. W. Zhou, X. Y. Dai, L. Yuan, Y. Li, J. F. Gao. RegionCLIP: Region-based language-image pretraining. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, New Orleans, USA, pp. 16772–16782, 2021. DOI: 10.1109/CVPR52688.2022.01629.
|
[177] |
X. W. Liang, F. D. Zhu, L. L. Li, H. Xu, X. D. Liang. Visual-language navigation pretraining via prompt-based environmental self-exploration. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Dublin, Ireland, pp. 4837–4851, 2022. DOI: 10.18653/v1/2022.acl-long.332.
|
[178] |
L. H. Li, P. C. Zhang, H. T. Zhang, J. W. Yang, C. Y. Li, Y. W. Zhong, L. J. Wang, L. Yuan, L. Zhang, J. N. Hwang, K. W. Chang, J. F. Gao. Grounded language-image pre-training. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, New Orleans, USA, pp. 10955–10965, 2022. DOI: 10.1109/CVPR52688.2022.01069.
|
[179] |
C. Y. Xie, H. Cai, J. F. Song, J. H. Li, F. J. Kong, X. Y. Wu, H. Morimitsu, L. Yao, D. X. Wang, D. W. Leng, X. Y. Ji, Y. F. Deng. Zero and R2D2: A large-scale Chinese cross-modal benchmark and A vision-language framework. [Online], Available: https://arxiv.org/abs/2205.03860, 2022.
|
[180] |
N. Mu, A. Kirillov, D. Wagner, S. N. Xie. SLIP: Self-supervision meets language-image pre-training. In Proceedings of the 17th European Conference on Computer Vision, Springer, Tel Aviv, Israel, pp. 529–544, 2021. DOI: 10.1007/978-3-031-19809-0_30.
|
[181] |
L. W. Yao, R. H. Huang, L. Hou, G. S. Lu, M. Z. Niu, H. Xu, X. D. Liang, Z. G. Li, X. Jiang, C. J. Xu. FILIP: Fine-grained interactive language-image pre-training. In Proceedings of the 10th International Conference on Learning Representations, 2022.
|
[182] |
C. L. Li, M. Yan, H. Y. Xu, F. L. Luo, W. Wang, B. Bi, S. F. Huang. SemVLP: Vision-language pre-training by aligning semantics at multiple levels. [Online], Available: https://arxiv.org/abs/2103.07829, 2021.
|
[183] |
J. H. Yu, Z. R. Wang, V. Vasudevan, L. Yeung, M. Seyedhosseini, Y. H. Wu. CoCa: Contrastive captioners are image-text foundation models. [Online], Available: https://arxiv.org/abs/2205.01917, 2022.
|
[184] |
F. L. Chen, X. Y. Chen, J. X. Shi, D. Z. Zhang, J. L. Chang, Q. Tian. HiVLP: Hierarchical vision-language pre-training for fast image-text retrieval. [Online], Available: https://arxiv.org/abs/ 2205.12105, 2022.
|
[185] |
A. Guzhov, F. Raue, J. Hees, A. Dengel. Audioclip: Extending clip to image, text and audio. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Singapore, pp. 976–980, 2022. DOI: 10.1109/ICASSP43922.2022.9747631.
|
[186] |
H. B. Bao, W. H. Wang, L. Dong, F. R. Wei. VL-BEiT: Generative vision-language pretraining. [Online], Available: https://arxiv.org/abs/2206.01127, 2022.
|
[187] |
P. H. Seo, A. Nagrani, A. Arnab, C. Schmid. End-to-end generative pretraining for multimodal video captioning. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, New Orleans, USA, pp. 17938–17947, 2022. DOI: 10.1109/CVPR52688.2022.01743.
|
[188] |
Z. H. Fan, Z. Y. Wei, J. J. Chen, S. Y. Wang, Z. J. Li, J. R. Xu, X. J. Huang. A unified continuous learning framework for multi-modal knowledge discovery and pre-training. [Online], Available: https://arxiv.org/abs/2206.05555, 2022.
|
[189] |
H. T. Zhang, P. C. Zhang, X. W. Hu, Y. C. Chen, L. H. Li, X. Y. Dai, L. J. Wang, L. Yuan, J. N. Hwang, J. F. Gao. GLIPv2: Unifying localization and vision-language understanding. [Online], Available: https://arxiv.org/abs/2206.05836, 2022.
|
[190] |
B. Mustafa, C. Riquelme, J. Puigcerver, R. Jenatton, N. Houlsby. Multimodal contrastive learning with LIMoE: The language-image mixture of experts. [Online], Available: https://arxiv.org/abs/2206.02770, 2022.
|
[191] |
T. Wang, W. H. Jiang, Z. C. Lu, F. Zheng, R. Cheng, C. G. Yin, L. Ping. VLMixer: Unpaired vision-language pre-training via cross-modal cutmix. In Proceedings of the 39th International Conference on Machine Learning, Baltimore, USA, pp. 22680–22690, 2022.
|
[192] |
A. Bordes, N. Usunier, A. Garcia-Durán, J. Weston, O. Yakhnenko. Translating embeddings for modeling multi-relational data. In Proceedings of the 26th International Conference on Neural Information Processing Systems, Lake Tahoe, USA, pp. 2787–2795, 2013.
|
[193] |
Z. Wang, J. W. Zhang, J. L. Feng, Z. Chen. Knowledge graph embedding by translating on hyperplanes. In Proceedings of the 28th AAAI Conference on Artificial Intelligence, Québec City, Canada, pp. 1112–1119, 2014. DOI: 10.1609/aaai.v28i1.8870.
|
[194] |
G. L. Ji, S. Z. He, L. H. Xu, K. Liu, J. Zhao. Knowledge graph embedding via dynamic mapping matrix. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Beijing, China, pp. 687–696, 2015. DOI: 10.3115/v1/P15-1067.
|
[195] |
Y. K. Lin, Z. Y. Liu, M. S. Sun, Y. Liu, X. Zhu. Learning entity and relation embeddings for knowledge graph completion. In Proceedings of the 29th AAAI Conference on Artificial Intelligence, Austin, USA, pp. 2181–2187, 2015. DOI: 10.1609/aaai.v29i1.9491.
|
[196] |
G. L. Ji, K. Liu, S. Z. He, J. Zhao. Knowledge graph completion with adaptive sparse transfer matrix. In Proceedings of the 30th AAAI Conference on Artificial Intelligence, Phoenix, USA, pp. 985–991, 2016.
|
[197] |
M. Nickel, V. Tresp, H. P. Kriegel. A three-way model for collective learning on multi-relational data. In Proceedings of the 28th International Conference on International Conference on Machine Learning, Bellevue, USA, pp. 809–816, 2011.
|
[198] |
R. Socher, D. Q. Chen, C. D. Manning, A. Y. Ng. Reasoning with neural tensor networks for knowledge base completion. In Proceedings of the 26th International Conference on Neural Information Processing Systems, Lake Tahoe, USA, pp. 926–934, 2013.
|
[199] |
B. S. Yang, W. T. Yih, X. D. He, J. F. Gao, L. Deng. Embedding entities and relations for learning and inference in knowledge bases. In Proceedings of the 3rd International Conference on Learning Representations, San Diego, USA, 2015. DOI: 10.48550/arXiv.1412.6575.
|
[200] |
A. Bordes, X. Glorot, J. Weston, Y. Bengio. A semantic matching energy function for learning with multi-relational data. Machine Learning, vol. 94, no. 2, pp. 233–259, 2014. DOI: 10.1007/s10994-013-5363-6.
|
[201] |
M. Nickel, L. Rosasco, T. Poggio. Holographic embeddings of knowledge graphs. In Proceedings of the 30th AAAI Conference on Artificial Intelligence, Phoenix, USA, pp. 1955–1961, 2016.
|
[202] |
J. Bruna, W. Zaremba, A. Szlam, Y. LeCun. Spectral networks and locally connected networks on graphs. In Proceedings of the 2nd International Conference on Learning Representations, Banff, Canada, 2014. DOI: 10.48550/arXiv.1312.6203.
|
[203] |
T. N. Kipf, M. Welling. Semi-supervised classification with graph convolutional networks. In Proceedings of the 5th International Conference on Learning Representations, Toulon, France, 2017.
|
[204] |
T. N. Kipf, M. Welling. Variational graph auto-encoders. [Online], Available: https://arxiv.org/abs/1611.07308, 2016.
|
[205] |
W. L. Hamilton, R. Ying, J. Leskovec. Inductive representation learning on large graphs. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, USA, pp. 1025–1035, 2017.
|
[206] |
P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, Y. Bengio. Graph attention networks. In Proceedings of the 6th International Conference on Learning Representations, Vancouver, Canada, 2018.
|
[207] |
M. Schlichtkrull, T. N. Kipf, P. Bloem, R. van den Berg, I. Titov, M. Welling. Modeling relational data with graph convolutional networks. In Proceedings of the 15th International Conference on the Semantic Web, Springer, Heraklion, Greece, pp. 593–607, 2018. DOI: 10.1007/978-3-319-93417-4_38.
|
[208] |
C. Shang, Y. Tang, J. Huang, J. B. Bi, X. D. He, B. W. Zhou. End-to-end structure-aware convolutional networks for knowledge base completion. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence, Honolulu, USA, pp. 3060–3067, 2019. DOI: 10.1609/aaai.v33i01.33013060.
|
[209] |
T. Dettmers, P. Minervini, P. Stenetorp, S. Riedel. Convolutional 2D knowledge graph embeddings. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, New Orleans, USA, pp. 1811–1818, 2018. DOI: 10.1609/aaai.v32i1.11573.
|
[210] |
D. Nathani, J. Chauhan, C. Sharma, M. Kaul. Learning attention-based embeddings for relation prediction in knowledge graphs. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 4710–4723, 2019. DOI: 10.18653/v1/P19-1466.
|
[211] |
S. Vashishth, S. Sanyal, V. Nitin, P. Talukdar. Composition-based multi-relational graph convolutional networks. In Proceedings of the 8th International Conference on Learning Representations, Addis Ababa, Ethiopia, 2020.
|
[212] |
Y. Z. Li, B. W. Yu, X. Mengge, T. W. Liu. Enhancing pre-trained Chinese character representation with word-aligned attention. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 3442–3448, 2020. DOI: 10.18653/v1/2020.acl-main.315.
|
[213] |
P. Ke, H. Z. Ji, S. Y. Liu, X. Y. Zhu, M. L. Huang. SentiLARE: Sentiment-aware language representation learning with linguistic knowledge. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, ACL, pp. 6975–6988, 2020. DOI: 10.18653/v1/2020.emnlp-main.567.
|
[214] |
A. Roberts, C. Raffel, N. Shazeer. How much knowledge can you pack into the parameters of a language model? In Proceedings of the Conference on Empirical Methods in Natural Language Processing, ACL, pp. 5418-5426, 2020. DOI: 10.18653/v1/2020.emnlp-main.437.
|
[215] |
D. Sachan, Y. H. Zhang, P. Qi, W. L. Hamilton. Do syntax trees help pre-trained transformers extract information? In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 2647–2661, 2021. DOI: 10.18653/v1/2021.eacl-main.228.
|
[216] |
J. R. Zhou, Z. S. Zhang, H. Zhao, S. L. Zhang. LIMIT-BERT: Linguistics informed multi-task BERT. In Proceedings of Findings of the Association for Computational Linguistics, pp. 4450–4461, 2020. DOI: 10.18653/v1/2020.findings-emnlp.399.
|
[217] |
Z. Y. Zhang, X. Han, Z. Y. Liu, X. Jiang, M. S. Sun, Q. Liu. ERNIE: Enhanced language representation with informative entities. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 1441–1451, 2019. DOI: 10.18653/v1/P19-1139.
|
[218] |
M. E. Peters, M. Neumann, R. Logan, R. Schwartz, V. Joshi, S. Singh, N. A. Smith. Knowledge enhanced contextual word representations. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, ACL, Hong Kong, China, pp. 43–54, 2019. DOI: 10.18653/v1/D19-1005.
|
[219] |
P. Wang, Q. Wu, C. H. Shen, A. Dick, A. van den Hengel. Explicit knowledge-based reasoning for visual question answering. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, Melbourne, Australia, pp. 1290–1296, 2017. DOI: 10.24963/ijcai.2017/179.
|
[220] |
P. Wang, Q. Wu, C. H. Shen, A. Dick, A. van den Hengel. FVQA: Fact-based visual question answering. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 10, pp. 2413–2427, 2018. DOI: 10.1109/TPAMI.2017.2754246.
|
[221] |
J. Deng, N. Ding, Y. Q. Jia, A. Frome, K. Murphy, S. Bengio, Y. Li, H. Neven, H. Adam. Large-scale object classification using label relation graphs. In Proceedings of the 13th European Conference on Computer Vision, Springer, Zurich, Switzerland, pp. 48–64, 2014. DOI: 10.1007/978-3-319-10590-1_4.
|
[222] |
T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, C. Alberti, D. Epstein, I. Polosukhin, J. Devlin, K. Lee, K. Toutanova, L. Jones, M. Kelcey, M. W. Chang, A. M. Dai, J. Uszkoreit, Q. Le, S. Petrov. Natural questions: A benchmark for question answering research. In Proceedings of Transactions of the Association for Computational Linguistics, ACL, Cambridge, USA, pp. 452–466, 2019. DOI: 10.1162/tacl_a_00276.
|
[223] |
Z. L. Yang, P. Qi, S. Z. Zhang, Y. Bengio, W. Cohen, R. Salakhutdinov, C. D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, ACL, Brussels, Belgium, pp. 2369–2380, 2018. DOI: 10.18653/v1/D18-1259.
|
[224] |
C. Clark, K. Lee, M. W. Chang, T. Kwiatkowski, M. Collins, K. Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, ACL, Minneapolis, USA, pp. 2924–2936, 2019. DOI: 10.18653/v1/N19-1300.
|
[225] |
J. Thorne, A. Vlachos, C. Christodoulopoulos, A. Mittal. FEVER: A large-scale dataset for fact extraction and VERification. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, USA, pp. 809–819, 2018. DOI: 10.18653/v1/N18-1074.
|
[226] |
Z. C. Guo, D. Barbosa. Robust entity linking via random walks. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, Shanghai, China, page 499–508, 2014. DOI: 10.1145/2661829.2661887.
|
[227] |
A. Talmor, J. Herzig, N. Lourie, J. Berant. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, USA, pp. 4149–4158, 2019. DOI: 10.18653/v1/N19-1421.
|
[228] |
C. Bhagavatula, R. Le Bras, C. Malaviya, K. Sakaguchi, A. Holtzman, H. Rashkin, D. Downey, W. T. Yih, Y. Choi. Abductive commonsense reasoning. In Proceedings of the 8th International Conference on Learning Representations, Addis Ababa, Ethiopia, 2020.
|
[229] |
B. Y. Lin, W. C. S. Zhou, M. Shen, P. Zhou, C. Bhagavatula, Y. Choi, X. Ren. CommonGen: A constrained text generation challenge for generative commonsense reasoning. In Proceedings of Findings of the Association for Computational Linguistics, pp. 1823–1840, 2020. DOI: 10.18653/v1/2020.findings-emnlp.165.
|
[230] |
M. Sap, H. Rashkin, D. Chen, R. Le Bras, Y. Choi. Social IQa: Commonsense reasoning about social interactions. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, ACL, Hong Kong, China, pp. 4463–4473, 2019. DOI: 10.18653/v1/D19-1454.
|
[231] |
Y. Bisk, R. Zellers, R. Le bras, J. F. Gao, Y. Choi. PIQA: Reasoning about physical commonsense in natural language. In Proceedings of the 34th AAAI Conference on Artificial Intelligence, New York, USA, pp. 7432–7439, 2020. DOI: 10.1609/aaai.v34i05.6239.
|
[232] |
B. Zhou, D. Khashabi, Q. Ning, D. Roth. “Going on a vacation” takes longer than “going for a walk”: A study of temporal commonsense understanding. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, ACL, Hong Kong, China, pp. 3363–3369, 2019. DOI: 10.18653/v1/D19-1332.
|
[233] |
B. Zhou, K. Richardson, Q. Ning, T. Khot, A. Sabharwal, D. Roth. Temporal reasoning on implicit events from distant supervision. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1361–1371, 2021. DOI: 10.18653/v1/2021.naacl-main.107.
|
[234] |
H. Agrawal, K. Desai, Y. F. Wang, X. L. Chen, R. Jain, M. Johnson, D. Batra, D. Parikh, S. Lee, P. Anderson. Nocaps: Novel object captioning at scale. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Seoul, Repblic of Korea, pp. 8947–8956, 2019. DOI: 10.1109/ICCV.2019.00904.
|
[235] |
A. Das, S. Kottur, K. Gupta, A. Singh, D. Yadav, J. M. F. Moura, D. Parikh, D. Batra. Visual dialog. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, pp. 1080–1089, 2017. DOI: 10.1109/CVPR.2017.121.
|
[236] |
P. C. Yang, B. X. Chen, P. Zhang, X. Sun. Visual agreement regularized training for multi-modal machine translation. In Proceedings of the 34th AAAI Conference on Artificial Intelligence, New York, USA, pp. 9418–9425, 2020. DOI: 10.1609/aaai.v34i05.6484.
|
[237] |
S. Antol, A. Agrawal, J. S. Lu, M. Mitchell, D. Batra, C. L. Zitnick, D. Parikh. VQA: Visual question answering. In Proceedings of IEEE International Conference on Computer Vision, Santiago, Chile, pp. 2425–2433, 2015. DOI: 10.1109/ICCV.2015.279.
|
[238] |
J. Z. Liu, W. H. Chen, Y. Cheng, Z. Gan, L. C. Yu, Y. M. Yang, J. J. Liu. Violin: A large-scale dataset for video-and-language inference. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Seattle, USA, pp. 10897–10907, 2020. DOI: 10.1109/CVPR42600.2020.01091.
|
[239] |
A. Suhr, M. Lewis, J. Yeh, Y. Artzi. A corpus of natural language for visual reasoning. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, Canada, pp. 217–223, 2017. DOI: 10.18653/v1/P17-2034.
|
[240] |
N. Xie, F. Lai, D. Doran, A. Kadav. Visual entailment: A novel task for fine-grained image understanding. [Online], Available: https://arxiv.org/abs/1901.06706, 2019.
|
[241] |
I. Dagan, O. Glickman, B. Magnini. The PASCAL recognising textual entailment challenge. In Proceedings of the 1st Pascal Machine Learning Challenges Workshop on Machine Learning Challenges, Springer, Southampton, UK, pp. 177–190, 2005. DOI: 10.1007/11736790_9.
|
[242] |
R. Zellers, Y. Bisk, A. Farhadi, Y. Choi. From recognition to cognition: Visual commonsense reasoning. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Long Beach, USA, pp. 6713–6724, 2019. DOI: 10.1109/CVPR.2019.00688.
|
[243] |
X. Wang, S. F. Zheng, R. Yang, A. H. Zheng, Z. Chen, J. Tang, B. Luo. Pedestrian attribute recognition: A survey. Pattern Recognition, vol. 121, Article number 108220, 2022. DOI: 10.1016/j.patcog.2021.108220.
|
[244] |
D. Ghosal, S. Akhtar, D. Chauhan, S. Poria, A. Ekbal, P. Bhattacharyya. Contextual inter-modal attention for multi-modal sentiment analysis. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, ACL, Brussels, Belgium, pp. 3454–3466, 2018. DOI: 10.18653/v1/D18-1382.
|
[245] |
S. Li, T. Xiao, H. S. Li, B. L. Zhou, D. Y. Yue, X. G. Wang. Person search with natural language description. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, pp. 5187–5196, 2017. DOI: 10.1109/CVPR.2017.551.
|
[246] |
W. Chen, Y. Liu, W. P. Wang, E. Bakker, T. Georgiou, P. Fieguth, L. Liu, M. S. Lew. Deep image retrieval: A survey. [Online], Available: https://arxiv.org/abs/2101.11282, 2021.
|
[247] |
J. Gu, E. Stefani, Q. Wu, J. Thomason, X. Wang. Vision-and-language navigation: A survey of tasks, methods, and future directions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Dublin, Ireland, pp. 7606–7623, 2022. DOI: 10.18653/v1/2022.acl-long.524.
|
[248] |
S. M. Park, Y. G. Kim. Visual language navigation: A survey and open challenges. Artificial Intelligence Review, vol. 56, no. 1, pp. 365–427, 2023. DOI: 10.1007/s10462-022-10174-9.
|
[249] |
H. W. Zhang, Y. L. Niu, S. F. Chang. Grounding referring expressions in images by variational context. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 4158–4166, 2018. DOI: 10.1109/CVPR.2018.00437.
|
[250] |
S. B. Yang, G. B. Li, Y. Z. Yu. Cross-modal relationship inference for grounding referring expressions. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Long Beach, USA, pp. 4140–4149, 2019. DOI: 10.1109/CVPR.2019.00427.
|
[251] |
X. P. Ding, N. N. Wang, S. W. Zhang, Z. Y. Huang, X. M. Li, M. Q. Tang, T. L. Liu, X. B. Gao. Exploring language hierarchy for video grounding. IEEE Transactions on Image Processing, vol. 31, pp. 4693–4706, 2022. DOI: 10.1109/TIP.2022.3187288.
|
[252] |
Z. H. Tang, Y. Liao, S. Liu, G. B. Li, X. J. Jin, H. X. Jiang, Q. Yu, D. Xu. Human-centric spatio-temporal video grounding with visual transformers. IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 12, pp. 8238–8249, 2022. DOI: 10.1109/TCSVT.2021.3085907.
|
[253] |
X. Wang, X. J. Shu, Z. P. Zhang, B. Jiang, Y. W. Wang, Y. H. Tian, F. Wu. Towards more flexible and accurate object tracking with natural language: Algorithms and benchmark. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Nashville, USA, pp. 13758–13768, 2021. DOI: 10.1109/CVPR46437.2021.01355.
|
[254] |
X. Wang, C. L. Li, R. Yang, T. Z. Zhang, J. Tang, B. Luo. Describe and attend to track: Learning natural language guided structural representation and visual attention for object tracking. [Online], Available: https://arxiv.org/abs/1811.10014, 2018.
|
[255] |
Q. Feng, V. Ablavsky, Q. X. Bai, S. Sclaroff. Siamese natural language tracker: Tracking by natural language descriptions with siamese trackers. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Nashville, USA, pp. 5847–5856, 2021. DOI: 10.1109/CVPR46437.2021.00579.
|
[256] |
Y. Yao, A. Zhang, Z. Y. Zhang, Z. Y. Liu, T. S. Chua, M. S. Sun. CPT: Colorful prompt tuning for pre-trained vision-language models. [Online], Available: https://arxiv.org/abs/2109.11797, 2021.
|
[257] |
X. H. He, D. J. Yang, W. X. Feng, T. Fu, A. Akula, V. Jampani, P. Narayana, S. Basu, W. Y. Wang, X. Wang. CPL: Counterfactual prompt learning for vision and language models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, ACL, Abu Dhabi, UAE, pp. 3407–3418, 2022.
|
[258] |
M. L. Jia, L. M. Tang, B. C. Chen, C. Cardie, S. Belongie, B. Hariharan, S. N. Lim. Visual prompt tuning. In Proceedings of the 17th European Conference on Computer Vision, Springer, Tel Aviv, Israel, pp. 709–727, 2022. DOI: 10.1007/978-3-031-19827-4_41.
|
[259] |
K. Y. Zhou, J. K. Yang, C. C. Loy, Z. W. Liu. Learning to prompt for vision-language models. International Journal of Computer Vision, vol. 130, no. 9, pp. 2337–2348, 2022. DOI: 10.1007/s11263-022-01653-1.
|
[260] |
K. Y. Zhou, J. K. Yang, C. C. Loy, Z. W. Liu. Conditional prompt learning for vision-language models. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, New Orleans, USA, pp. 16795–16804, 2022. DOI: 10.1109/CVPR52688.2022.01631.
|
[261] |
Q. Z. Wang, S. Li, H. Qin, A. M. Hao. Robust multi-modal medical image fusion via anisotropic heat diffusion guided low-rank structural analysis. Information Fusion, vol. 26, pp. 103–121, 2015. DOI: 10.1016/j.inffus.2015.01.001.
|
[262] |
X. Wang, X. J. Shu, S. Zhang, B. Jiang, Y. W. Wang, Y. H. Tian, F. Wu. MFGNet: Dynamic modality-aware filter generation for RGB-T tracking. IEEE Transactions on Multimedia, 2022. DOI: 10.1109/TMM.2022.3174341.
|
[263] |
K. H. Lee, X. Chen, G. Hua, H. D. Hu, X. D. He. Stacked cross attention for image-text matching. In Proceedings of the 15th European Conference on Computer Vision, Springer, Munich, Germany, pp. 212–228, 2018. DOI: 10.1007/978-3-030-01225-0_13.
|