Enhancing Biomedical Information Retrieval with Semantic Search: A Comparative Analysis Using PubMed Data

Adela Ljajić1, Lorenzo Cassano2, Miloš Košprdić1*, Bojana Bašaragin1, Darija Medvecki1 and Nikola Milošević2

1 Institute for Artificial Intelligence Research and Development of Serbia, Novi Sad, Serbia

2 Bayer A.G., Research and Development,
Berlin, Germany

milos.kosprdic [at] ivi.ac.rs

Abstract

PubMed excels in retrieving scientific articles through keyword matching in biomedical literature. However, its efficacy in comprehending and addressing natural language queries is limited due to its emphasis on basic text matching and absence of contextual understanding. This limitation becomes challenging when users pose inquiries in natural language that do not align with the structured vocabulary of the database. To address this, we are presenting an Information Retrieval System utilizing indexed data sourced from PubMed articles (title+abstract), which employs a combination of lexical and semantic search to retrieve the most accurate responses to user inquiries.

For the vector representation of concatenated titles and abstracts, we employed a sentence transformer model optimized for asymmetric semantic search, given our focus on shorter queries searching through longer texts. Lexical indexing utilized the OpenSearch database, while semantic indexing was facilitated by Qdrant. Tuning the hybrid search results achieved an optimal balance between lexical and semantic search parameters. Evaluation was conducted using the BioASQ dataset comprising 5049 questions, each paired with PubMed articles and annotated by domain experts. We also used this dataset to assess the performance of the PubMed Search Engine in biomedical question answering, enabling a comparative analysis.

Utilizing the lexical index for document retrieval yielded MAP@10 of 0.411. Through experimentation, we determined that the optimal hybrid query combination entails weights of 0.7 and 0.3 for lexical and semantic components, respectively. Integrating the best lexical results with the semantic index led to an enhanced MAP@10 of 0.425. Assessment of the PubMed search engine on the same BioASQ dataset unveiled MAP@10 of 0.153 when MeSH terms were omitted and MAP@10 of 0.191 when they were included in the search of the PubMed database. Our system notably advances biomedical information retrieval by leveraging a fusion of lexical and semantic search, resulting in heightened precision when responding to natural language queries, surpassing PubMed’s keyword-based approach.

Keywords: PubMed, information retrieval, vector database, LLM’s, hybrid search

Acknowledgement: The project Verif.ai is a collaborative effort of Bayer A.G. and the Institute for Artificial Intelligence Research and Development of Serbia, funded within the framework of the NGI Search project under Horizon Europe grant agreement No 101069364.