Intrinsic disorder of proteins associated with diseases

Lazar Vasović* and Jovana Kovačević

Faculty of Mathematics, University of Belgrade, Belgrade, Serbia

pd212006 [at] alas.matf.bg.ac.rs

Abstract

Numerous publicly accessible databases include variously formatted information regarding the relationship between genes and diseases. This work expedites their use by integrating them into one standardised database – Integrated Gene Disease Database. IGDD currently has more than 400,000 rows incorporating gene-disease associations from the following sources: DisGeNet, COSMIC, HumsaVar, Orphanet, ClinVar, HPO, DISEASES. Its features include: gene symbol and IDs, UniProt ID, disease name, Disease Ontology ID. Disease Ontology was chosen since it offers a wide range of possibilities in terms of disease exploration.

IGDD was further enriched with information on the disorder of the proteins encoded by the genes associated with diseases since many lack a fixed and well-defined three-dimensional structure. That fact may be linked with the disease-causing mechanisms, so it is an important feature of a protein. Several disorder measures were used, based both on the sequence profiling and the advanced statistical methods: amino acid profiles, charge-hydropathy (CH) prediction, PONDR family (VL-XT, VSL2), IUPred family (long, short, ANCHOR), FuzDrop.

This work focuses on the following question: is there any relationship between certain diseases or their groups and the level of disorder of proteins related to them? With that in mind, no correlation was found between any considered disorder measure and the number of diseases that proteins are related to. There was neither a correlation between the depth of diseases in the ontology and the disorder of the related proteins. Additionally, no obvious regularity was noticed when it comes to the disorder of proteins grouped by diseases they are related to. Both ordered and disordered proteins were equally found in all parts of the ontology.

Regardless of the results of this research, IGDD can nevertheless be considered a valuable resource for future data analysis and further investigation of gene-disease associations. Its detailed features and a large number of relations open the path for many types of studies.

Keywords: gene-disease associations, protein disorder, correlation analysis

Acknowledgement: This research was financially supported by the Ministry of Science, Technological Development and Innovation of the Republic of Serbia through the scholarship project for young and unemployed doctoral students (Lazar Vasović, contract number 451-03-1271/2022-14/2990) and through Project No. 174021. It is based upon work from COST Action CA21160, named Non-globular proteins in the era of Machine Learning (ML4NGP) and supported by COST (European Cooperation in Science and Technology).