Alexandros Xenos1, Noel-Malod Dognin1 and Nataša Pržulj1,*
1 Barcelona Supercomputing Center, Barcelona, Spain
natasha [at] bsc.es
Abstract
Low-dimensional embeddings are a cornerstone in the modelling and analysis of complex biological networks. Embedding biological networks is challenging, as it involves capturing both structural (topological) and semantic information of a graph (i.e., node labels). Typically, nodes with the same label are in the same dense subgraph (neighborhood-based similarity), but it has been shown that similarly annotated nodes can be in different network neighborhoods while having similar wiring patterns (topological similarity). However, current network embedding algorithms do not preserve both types of similarity, which limits the information preserved in the embedding space. Moreover, most existing approaches for mining network embedding spaces rely on computationally intensive machine learning systems to facilitate downstream analysis tasks. On the other hand, word embedding spaces capture semantic relationships linearly, allowing for information retrieval using simple linear operations on word embedding vectors.
In our work, following the NLP paradigm, we introduce novel random-walk-based embeddings that allow mining biological knowledge directly from the embedding space. Namely, we introduce embeddings that locate close in the space genes that have similar biological functions (either topological or neighborhood-based similar nodes). We exploit this property to predict genes participating in protein complexes and to identify cancer-related genes based on the cosine similarities between the vector representations of the genes. We also go beyond embeddings that preserve one type of similarity by introducing novel graphlet-based representations of the networks that simultaneously capture topological and neighborhood membership information. We use all the different network representations to assess whether it is an intrinsic property in the structure of the data (input matrix representation) that yield embedding spaces that enable downstream analysis tasks via simple linear operations. Using nine multi-label biological networks and seven single label networks that are commonly used in machine learning studies, we demonstrate that the more homophilic the network matrix representation, the more linearly organized the corresponding network embedding space, and thus, the better the downstream analysis results. Our results suggest that our new graphlet-based methodologies embed networks into linear spaces, allowing for better mining of the networks and alleviating the need for computational-intensive ML models.
Keywords: bioinformatics, network biology, network embeddings, machine learning
Acknowledgement: This project has received funding from the European Union’s
EU Framework Programme for Research and Innovation Horizon 2020, Grant Agreement No 860895, the European Research Council (ERC) Consolidator Grant 770827, the Spanish State Research Agency and the Ministry of Science and Innovation MCIN grant PID2022-141920NB-I00 / AEI /10.13039/501100011033/ FEDER, UE, and the Department of Research and Universities of the Generalitat de Catalunya code 2021 SGR 01536.