Yu-Jia Zhou, Jing Yao, Zhi-Cheng Dou, Ledell Wu, Ji-Rong Wen. DynamicRetriever: A Pre-trained Model-based IR System Without an Explicit Index. Machine Intelligence Research. https://doi.org/10.1007/s11633-022-1373-9
Citation: Yu-Jia Zhou, Jing Yao, Zhi-Cheng Dou, Ledell Wu, Ji-Rong Wen. DynamicRetriever: A Pre-trained Model-based IR System Without an Explicit Index. Machine Intelligence Research. https://doi.org/10.1007/s11633-022-1373-9

DynamicRetriever: A Pre-trained Model-based IR System Without an Explicit Index

doi: 10.1007/s11633-022-1373-9
More Information
  • Author Bio:

    Yu-Jia Zhou received the B. Eng. degree in computer science and technology from School of Information, Renmin University of China, China in 2019. He is currently a Ph. D. degree candidate in computer science at School of Information, Renmin University of China. He won the best student paper award in CCIR 2018. He has been invited as a reviewer of international conferences SIGIR, KDD, WSDM.His research interests include information retrieval, personalized search, deep learning, and data mining. E-mail: zhouyujia@ruc.edu.cn ORCID iD: 0000-0002-3530-3787

    Jing Yao received the B. Eng. degree in computer science and technology from School of Information, Renmin University of China, China in 2019, and the M. Sc. degree in computer application technology from School of Information, Renmin University of China, Chian in 2022. She has been invited as a reviewer of international conferences SIGIR, WSDM. She is working at Microsoft Research Asia as a researcher now.Her research interests include information retrieval, personalized search, explainable search/recommendation.E-mail: jing_yao@ruc.edu.cn

    Zhi-Cheng Dou received the B. Sc. and Ph. D. degrees in computer science and technology from Nankai University, China in 2003 and 2008, respectively. He is an associate professor in School of Information, Renmin University of China. He worked at Microsoft Research as a researcher from July 2008 to September 2014. He is a member of the IEEE.His research interests include information retrieval, data mining, and big data analytics.E-mail: dou@ruc.edu.cn (Corresponding author)ORCID iD: 0000-0002-9781-948X

    Ledell Wu received the B. Sc. degree in mathematics from Peking University, China in 2009, received the the M. Sc. degree in computer science from and University of Toronto, Canada in 2011. She is currently a research scientist manager at Beijing Academy of Artificial Intelligence (BAAI), China. She worked as a research engineer at Facebook AI Research from 2013–2021. She worked on a couple of research projects that also have boarder impact at Facebook, including general purpose embedding system, large-scale graph embedding system, mono/multilingual entity linking system and dense passage retrieval system. She also studies fairness and biases in machine learning and NLP models.Her research interests include approximation algorithms, the hardness of approximation, privacy, and machine learning.E-mail: wuyu@baai.ac.cn

    Ji-Rong Wen received the B. Sc. and M. Sc. degrees in computer science from Renmin University of China, China, in 1994 and 1996, and the Ph. D. degree in computer science from Chinese Academy of Sciences, China in 1999. He is a professor at Renmin University of China. He was a senior researcher and research manager with Microsoft Research from 2000 to 2014. He is a senior member of the IEEE. His research interests include web data management, information retrieval (especially web IR), and data mining. E-mail: jirong.wen@gmail.com

  • Received Date: 2022-06-30
  • Accepted Date: 2022-08-31
  • Publish Online: 2023-01-11
  • Web search provides a promising way for people to obtain information and has been extensively studied. With the surge of deep learning and large-scale pre-training techniques, various neural information retrieval models are proposed, and they have demonstrated the power for improving search (especially, the ranking) quality. All these existing search methods follow a common paradigm, i.e., index-retrieve-rerank, where they first build an index of all documents based on document terms (i.e., sparse inverted index) or representation vectors (i.e., dense vector index), then retrieve and rerank retrieved documents based on the similarity between the query and documents via ranking models. In this paper, we explore a new paradigm of information retrieval without an explicit index but only with a pre-trained model. Instead, all of the knowledge of the documents is encoded into model parameters, which can be regarded as a differentiable indexer and optimized in an end-to-end manner. Specifically, we propose a pre-trained model-based information retrieval (IR) system called DynamicRetriever, which directly returns document identifiers for a given query. Under such a framework, we implement two variants to explore how to train the model from scratch and how to combine the advantages of dense retrieval models. Compared with existing search methods, the model-based IR system parameterizes the traditional static index with a pre-training model, which converts the document semantic mapping into a dynamic and updatable process. Extensive experiments conducted on the public search benchmark Microsoft machine reading comprehension (MS MARCO) verify the effectiveness and potential of our proposed new paradigm for information retrieval.

     

  • 1 https://huggingface.co/bert-base-uncased/tree/main
    *These authors contribute equally to this work
  • loading
  • [1]
    S. Robertson, H. Zaragoza. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval, vol. 3, no. 4, pp. 333–389, 2009. DOI: 10.1561/1500000019.
    [2]
    T. Mikolov, K. Chen, G. Corrado, J. Dean. Efficient estimation of word representations in vector space. [Online], Available: https://arxiv.org/abs/1301.3781, 2013.
    [3]
    C. Y. Xiong, Z. Y. Dai, J. Callan, Z. Y. Liu, R. Power. End-to-end neural ad-hoc ranking with kernel pooling. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, Tokyo, Japan, pp. 55–64, 2017. DOI: 10.1145/3077136.3080809.
    [4]
    Z. Y. Dai, C. Y. Xiong, J. Callan, Z. Y. Liu. Convolutional neural networks for soft-matching N-grams in Ad-Hoc search. In Proceedings of the 11th ACM International Conference on Web Search and Data Mining, Marina Del Rey, USA, pp. 126–134, 2018. DOI: 10.1145/3159652.3159659.
    [5]
    J. T. Zhan, J. X. Mao, Y. Q. Liu, J. F. Guo, M. Zhang, S. P. Ma. Optimizing dense retrieval model training with hard negatives. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1503–1512, 2021. DOI: 10.1145/3404835.3462880.
    [6]
    L. Xiong, C. Y. Xiong, Y. Li, K. F. Tang, J. L. Liu, P. N. Bennett, J. Ahmed, A. Overwijk. Approximate nearest neighbor negative contrastive learning for dense text retrieval. In Proceedings of the 9th International Conference on Learning Representations, 2021.
    [7]
    L. Y. Gao, Z. Y. Dai, T. F. Chen, Z. Fan, B. Van Durme, J. Callan. Complement lexical retrieval model with semantic residual embeddings. In Proceedings of the 43rd European Conference on Information Retrieval, Springer, pp. 146–160, 2021. DOI: 10.1007/978-3-030-72113-8_10.
    [8]
    K. Guu, K. Lee, Z. Tung, P. Pasupat, M. W. Chang. REALM: Retrieval-augmented language model pre-training. [Online], Available: https://arxiv.org/abs/2002.08909, 2020.
    [9]
    J. Devlin, M. W. Chang, K. Lee, K. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, USA, pp. 4171–4186, 2019. DOI: 10.18653/v1/N19-1423.
    [10]
    A. Radford, K. Narasimhan, T. Salimans, I. Sutskever. Improving language understanding by generative pre-training. [Online], Available: https://www.cs.ubc.ca/~amuham01/LING530/papers/radford2018improving.pdf, 2018.
    [11]
    K. Clark, M. T. Luong, Q. V. Le, C. D. Manning. ELECTRA: Pre-training text encoders as discriminators rather than generators. In Proceedings of the 8th International Conference on Learning Representations, Addis Ababa, Ethiopia, 2020.
    [12]
    R. Nogueira, W. Yang, K. Cho, J. Lin. Multi-stage document ranking with BERT. [Online], Available: https://arxiv.org/abs/1910.14424, 2019.
    [13]
    W. C. Chang, F. X. Yu, Y. W. Chang, Y. M. Yang, S. Kumar. Pre-training tasks for embedding-based large-scale retrieval. In Proceedings of the 8th International Conference on Learning Representations, Addis Ababa, Ethiopia, 2020.
    [14]
    X. Y. Ma, J. F. Guo, R. Q. Zhang, Y. X. Fan, X. Ji, X. Q. Cheng. PROP: Pre-training with representative words prediction for Ad-Hoc retrieval. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining, pp. 283–291, 2021. DOI: 10.1145/3437963.3441777.
    [15]
    W. Yang, H. T. Zhang, J. Lin. Simple applications of BERT for ad hoc document retrieval. [Online], Available: https://arxiv.org/abs/1903.10972, 2019.
    [16]
    R. Nogueira, K. Cho. Passage re-ranking with BERT. [Online], Available: https://arxiv.org/abs/1901.04085, 2019
    [17]
    L. Y. Gao, Z. Y. Dai, J. Callan. Rethink training of BERT rerankers in multi-stage retrieval pipeline. In Proceedings of the 43rd European Conference on Information Retrieval, Springer, pp. 280–286, 2021. DOI: 10.1007/978-3-030-72240-1_26.
    [18]
    J. T. Zhan, J. X. Mao, Y. Q. Liu, M. Zhang, S. P. Ma. RepBERT: Contextualized text embeddings for first-stage retrieval. [Online], Available: https://arxiv.org/abs/2006.15498, 2020.
    [19]
    B. Miutra, N. Craswell. An introduction to neural information retrieval. Foundations and Trends in Information Retrieval, vol. 13, no. 1, pp. 1–126, 2018. DOI: 10.1561/1500000061.
    [20]
    D. Metzler, Y. Tay, D. Bahri, M. Najork. Rethinking search: Making domain experts out of dilettantes. ACM SIGIR Forum, vol. 55, no. 1, Article number 13, 2021. DOI: 10.1145/3476415.3476428.
    [21]
    Y. Tay, V. Q. Tran, M. Dehghani, J. M. Ni, D. Bahri, H. Mehta, Z. Qin, K. Hui, Z. Zhao, J. Gupta, T. Schuster, W. W. Cohen, D. Metzler. Transformer memory as a differentiable search index. [Online], Available: https://arxiv.org/abs/2202.06991, 2022.
    [22]
    J. Pennington, R. Socher, C. Manning. GloVe: Global vectors for word representation. In Proceedings of Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, pp. 1532–1543, 2014. DOI: 10.3115/v1/D14-1162.
    [23]
    G. Q. Zheng, J. Callan. Learning to reweight terms with distributed representations. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, Santiago, Chile, pp. 575–584, 2015. DOI: 10.1145/2766462.2767700.
    [24]
    J. F. Guo, Y. X. Fan, Q. Y. Ai, W. B. Croft. A deep relevance matching model for Ad-Hoc retrieval. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, Indianapolis, USA, pp. 55–64, 2016. DOI: 10.1145/2983323.2983769.
    [25]
    M. Dehghani, H. Zamani, A. Severyn, J. Kamps, W. B. Croft. Neural ranking models with weak supervision. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, Shinjuku, Japan, pp. 65–74, 2017. DOI: 10.1145/3077136.3080832.
    [26]
    R. Nogueira, J. Lin. From doc2query to docTTTTTquery. [Online], Available: https://cs.uwaterloo.ca/~jimmylin/publications/Nogueira_Lin_2019_docTTTTTquery-v2.pdf 2019.
    [27]
    R. Nogueira, W. Yang, J. Lin, K. Cho. Document expansion by query prediction. [Online], Available: https://arxiv.org/abs/1904.08375, 2019.
    [28]
    C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Q. Zhou, W. Li, P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, vol. 21, no. 1, Article number 140, 2020.
    [29]
    Z. Y. Dai, J. Callan. Context-aware sentence/passage term importance estimation for first stage retrieval. [Online], Available: https://arxiv.org/abs/1910.10687, 2019.
    [30]
    Z. Y. Dai, J. Callan. Context-aware document term weighting for ad-hoc search. In Proceedings of the Web Conference, Taiwan, China, pp. 1897–1907, 2020. DOI: 10.1145/3366423.3380258.
    [31]
    J. Johnson, M. Douze, H. Jégou. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, vol. 7, no. 3, pp. 535–547, 2021. DOI: 10.1109/TBDATA.2019.2921572.
    [32]
    H. Jégou, M. Douze, C. Schmid. Product quantization for nearest neighbor search. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 1, pp. 117–128, 2011. DOI: 10.1109/TPAMI.2010.57.
    [33]
    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, USA, pp. 6000–6010, 2017.
    [34]
    J. F. Guo, Y. X. Fan, L. Pang, L. Yang, Q. Y. Ai, H. Zamani, C. Wu, W. B. Croft, X. Q. Cheng. A deep look into neural ranking models for information retrieval. Information Processing &Management, vol. 57, no. 6, Article number 102067, 2020. DOI: 10.1016/j.ipm.2019.102067.
    [35]
    K. Lee, M. W. Chang, K. Toutanova. Latent retrieval for weakly supervised open domain question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 6086–6096, 2019. DOI: 10.18653/v1/P19-1612.
    [36]
    J. M. Ni, G. H. Ábrego, N. Constant, J. Ma, K. B. Hall, D. Cer, Y. F. Yang. Sentence-T5: Scalable sentence encoders from pre-trained text-to-text models. In Proceedings of the Findings of the Association for Computational Linguistics, Dublin, Ireland, pp. 1864–1874, 2022. DOI: 10.18653/v1/2022.findings-acl.146.
    [37]
    V. Karpukhin, B. Oǧuz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Q. Chen, W. T. Yih. Dense passage retrieval for open-domain question answering. In Proceedings of Conference on Empirical Methods in Natural Language Processing, pp. 6769–6781, 2020. DOI: 10.18653/v1/2020.emnlp-main.550.
    [38]
    N. De Cao, G. Izacard, S. Riedel, F. Petroni. Autoregressive entity retrieval. In Proceedings of the 9th International Conference on Learning Representations, 2021.
    [39]
    J. G. Chen, R. Q. Zhang, J. F. Guo, Y. X. Fan, X. Q. Cheng. GERE: Generative evidence retrieval for fact verification. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, pp. 2184–2189, 2022. DOI: 10.1145/3477495.3531827.
    [40]
    J. P. Callan. Passage-level evidence in document retrieval. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Springer-Verlag, Dublin, Ireland, pp. 302–310, 1994.
    [41]
    T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, C. Alberti, D. Epstein, I. Polosukhin, J. Devlin, K. Lee, K. Toutanova, L. Jones, M. Kelcey, M. W. Chang, A. M. Dai, J. Uszkoreit, Q. Le, S. Petrov. Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, vol. 7, pp. 453–466, 2019. DOI: 10.1162/tacl_a_00276.
  • 加载中

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(6)  / Tables(6)

    用微信扫码二维码

    分享至好友和朋友圈

    Article Metrics

    Article views (25) PDF downloads(3) Cited by()
    Proportional views
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return