Exploiting the linear organisation of omics network embedding spaces

Noël Malod-Dognin1*, Alexandros Xenos1, Sergio Doria Belenguer1, and Nataša Pržulj1,2,3

1Barcelona Supercomputing Center (BSC), Barcelona 08034, Spain

2Department of Computer Science, University College London, London WC1E 6BT, UK

3ICREA, Pg. Lluís Companys 23, 08010 Barcelona, Spain

noel.malod [at] bsc.es

Abstract

We are increasingly accumulating large-scale biological omics data that describe different aspects of cellular functioning. These datasets are typically modelled and analyzed as networks. To ease the downstream analyses, recent approaches embed the nodes of a network into a low-dimensional space by using a skip-gram neural network (e.g. DeepWalk, LINE and node2vec). These methods are implicitly factorizing a positive pointwise mutual information (PPMI) matrix, which could be explicitly factorized with Non-negative Matrix Tri-Factorization (NMTF). Importantly, in Natural Language Processing (NLP), word embeddings obtained by using similar approaches showed linear algebraic structures, which allows for answering analogy questions by using simple linear vector operations. Thus, we investigate if we can obtain and exploit similar linear embedding spaces for the biological omics networks.

We initiate the use of the PPMI matrices to capture the neighborhood relationship or the structural (topological) similarities of nodes in the network. By embedding the human Protein-Protein Interaction (PPI) network by factorizing its PPMI matrix representations with NMTF, we demonstrate that the embedding vectors of genes having different Gene Ontology (GO) annotations are linearly separated in the PPI embedding space.

Then, in analogy to the embedding vector of a sentence being obtained as the sum (average) of the embedding vectors of its constituent words in NLP, we show that the embedding vectors of biological functions and of protein complexes can be obtained by averaging he embedding vectors of the genes that participate in then, and that these embeddings can be used to predict protein complex memberships and cancer genes.

Finally, we investigate the embeddings of cancer and control tissue specific PPI networks and show that simple subtractions allow for identifying cancer altered biological functions and cancer genes.

Keywords: bioinformatics, molecular omics networks, network data mining, network embedding

Acknowledgement: This project has received funding from the European Research Council (ERC) Consolidator Grant 770827 and the Spanish State Research Agency AEI 10.13039/501100011033 grant number PID2019-105500GB-I00.

Comments are closed.