Estimating the dimensionality of omics network embedding space

Milena Stojić1, Noël Malod-Dognin1 and Nataša Pržulj1,2,3,*

1 Barcelona Supercomputing Center (BSC), Barcelona, Spain

2 Department of Computer Science, University College London, London, UK

3 ICREA, Pg. Lluís Companys 23, Barcelona, Spain

natasha [at] bsc.es

Abstract

Thanks to the advances in capturing technology, huge amounts of large-scale biological, omics data have been accumulated. These data are naturally modeled as networks in which nodes represent entities (e.g., patients, genes, metabolites) and edges represent interactions between them. Because of the computational complexity of directly mining networks, current approaches first embed these networks in low-dimensional vector space, and then mine the resulting node embedding vectors for new biomedical knowledge. However, despite successful applications of network-embedding methodologies for mining biological data, there is still no gold-standard approach for determining its key parameter; the number of dimensions of the embedding space. Thus, to set this parameter, most studies rely on computationally inefficient grid-searches. Recently, Two Nearest Neighbors (2NN), a methodology that estimates the intrinsic dimensionality of data-points in high dimensional space, has been successfully applied to estimate the number of dimensions needed to embed synthetic and toy example networks.

In this work, we investigate the applicability of 2NN for determining the dimensionality of biological, omics network embedding spaces. On the protein-protein interaction networks and the gene co-expression networks of budding yeast and of homo sapiens, we relate the obtained dimensionality estimations with various network topological properties and with biomedical downstream analysis tasks.

Keywords: bioinformatics, network data mining, network embedding, network biology, AI

Acknowledgement: This project has received funding from the European Union’s EU Framework Programme for Research and Innovation Horizon 2020, Grant Agreement No 860895, the European Research Council (ERC) Consolidator Grant 770827, the Spanish State Research Agency and the Ministry of Science and Innovation MCIN grants PID2022-141920NB-I00/AEI/10.13039/501100011033/FEDER, UE, and PID2022-141920NB-I00/AEI/10.13039/501100011033/ERDF, UE Project: PN046500 and the Department of Research and Universities of the Generalitat de Catalunya code 2021 SGR 01536.