Cross-modal Contrastive Learning for Generalizable and Efficient Image-text Retrieval

Haoyu Lu; Yuqi Huo; Mingyu Ding; Nanyi Fei; Zhiwu Lu

doi:10.1007/s11633-022-1386-4

Haoyu Lu, Yuqi Huo, Mingyu Ding, Nanyi Fei, Zhiwu Lu. Cross-modal Contrastive Learning for Generalizable and Efficient Image-text Retrieval[J]. Machine Intelligence Research, 2023, 20(4): 569-582. DOI: 10.1007/s11633-022-1386-4

Citation:

Cross-modal Contrastive Learning for Generalizable and Efficient Image-text Retrieval

Graphical Abstract

Graphical Abstract

Abstract

Abstract

Cross-modal image-text retrieval is a fundamental task in bridging vision and language. It faces two main challenges that are typically not well addressed in previous works. 1) Generalizability: Existing methods often assume a strong semantic correlation between each text-image pair, which are thus difficult to generalize to real-world scenarios where the weak correlation dominates. 2) Efficiency: Many latest works adopt the single-tower architecture with heavy detectors, which are inefficient during the inference stage because the costly computation needs to be repeated for each text-image pair. In this work, to overcome these two challenges, we propose a two-tower cross-modal contrastive learning (CMCL) framework. Specifically, we first devise a two-tower architecture, which enables a unified feature space for the text and image modalities to be directly compared with each other, alleviating the heavy computation during inference. We further introduce a simple yet effective module named multi-grid split (MGS) to learn fine-grained image features without using detectors. Last but not the least, we deploy a cross-modal contrastive loss on the global image/text features to learn their weak correlation and thus achieve high generalizability. To validate that our CMCL can be readily generalized to real-world scenarios, we construct a large multi-source image-text dataset called weak semantic correlation dataset (WSCD). Extensive experiments show that our CMCL outperforms the state-of-the-arts while being much more efficient.

FullText(HTML)

References (55)

Supplements (0)

Cited By

Cross-modal Contrastive Learning for Generalizable and Efficient Image-text Retrieval

Graphical Abstract

Abstract

Catalog

Export File

Citation

Format

Content