Miloš Košprdić1*, Nikola Prodanović1, Adela Ljajić1, Bojana Bašaragin1, and Nikola Milošević1,2
1Institute for Artificial Intelligence Research and Development of Serbia, Fruškogorska 1, Novi Sad, Serbia
2Bayer A.G., Reaserch and Development, Mullerstrasse 173, Berlin, Germany
milos.kosprdic [at] ivi.ac.rs
Abstract
Named entity recognition (NER) is an NLP that involves identifying and classifying named entities in text. Token classification is a crucial subtask of NER that assumes assigning labels to individual tokens within a text, indicating the named entity category to which they belong. Fine-tuning large language models (LLMs) on labeled domain datasets has emerged as a powerful technique for improving NER performance. By training a pre-trained LLM such as BERT on domain-specific labeled data, the model learns to recognize named entities specific to that domain with high accuracy. This approach has been applied to a wide range of domains including biomedical and has demonstrated significant improvements in NER accuracy.
Still, data for fine-tuning pre-trained LLMs is large and labeling is a time-consuming and expensive process that requires expert domain knowledge. Also, domains with an open set of classes yield difficulties in traditional machine learning approaches since the number of classes to predict needs to be pre-defined.
Our solution to the two mentioned problems is based on data transformation for factorizing the initial multiple classification problem into a binary one and applying cross-encoder-based BERT architecture for zero- and few-shot learning.
To create our dataset, we transformed six widely used biomedical datasets that contain various biomedical entities such as genes, drugs, diseases, adverse events, chemicals, etc., into a uniform format. This transformation process enabled us to merge the datasets into a single cohesive dataset of 26 different named entity classes.
We then fine-tuned two pre-trained language models: BioBERT and PubMedBERT for the NER task in zero- and few-shot settings. The results of the experiment for 9 classes in zero-shot mode are promising for semantically similar classes and improve significantly after providing only a few supporting examples for almost all classes. The best results were obtained using a fine-tuned PubMedBERT model, with average F1 scores of 35.44%, 50.10%, 69.94%, and 79.51% for zero-shot, one-shot, 10-shot, and 100-shot NER respectively.
Keywords: zero-shot learning, machine learning, deep learning, natural language processing, biomedical named entity recognition