Semantic unification and search of bioinformatics databases

A. Veljković1*, and N. Mitić1

1Faculty of Mathematics, University of Belgrade, Studentski trg 16, 11000 Belgrade, Serbia

aleksandar.veljkovic [at] matf.bg.ac.rs

Abstract

Analyzing biological data from various sources offers a comprehensive perspective of a domain, facilitating the identification of patterns that would otherwise be challenging or impossible to observe when focusing solely on individual biological entities. The process of linking data from different databases can present challenges due to inconsistencies in properties and identifiers assigned to the same entity across databases. Although certain databases include a range of identifiers from multiple sources, the search capabilities are restricted to exact property matching, preventing the execution of complex queries involving multiple metadata attributes.

We designed a novel data framework that aims to address these challenges by facilitating the linkage and retrieval of information from diverse interconnected biological data sources. To evaluate the effectiveness of the model, we conducted tests and created a knowledge graph using metadata extracted from five separate public datasets: DisProt, HGNC, Tantigen 2.0, IEDB, and DisGeNET. The resulting graph establishes connections between more than 17 million nodes, comprising 2.5 million distinct biological entity objects, and encompasses over 4 million relationships.

Additionally, we designed and implemented a general-purpose procedure for extracting new relationships based on semantic similarity from data transformed into the BioGraph data model.

Keywords: Bioinformatics database, semantic search, unification, BioGraph

Comments are closed.