Biljana T. Stojanović1, Saša N. Malkov2, Miloš V. Beljanski3, Gordana M. Pavlović Lažetić2, Mirjana M. Maljković Ružičić2, Ivan Lj. Čukić2 , Nenad S. Mitić2
1 Mathematical Institute SASA, Belgrade, Serbia,
2 University of Belgrade, Faculty of Mathematics, Belgrade, Serbia
3 Institute for General and Physical Chemistry, Belgrade, Serbia
nenad.mitic [at] matf.bg.ac.rs
Abstract
This paper presents an approach for clustering of particular SARS-CoV-2 protein types based on Codon Usage (CU) bias measures. Our previous research has shown that clustering based on CU bias measures is very close to the natural clustering by protein type, regardless of virus affiliation. Relative Synonymous Codon Usage, RSCU, Effective Number of Codons, ENC along with Effective Number of Codons for individual AAc, ENCAA and Relative Codon Bias Strength, RCBS were calculated to measure the CU bias in different proteins coding sequences.
The dataset contains 928.850 SARS-CoV-2 complete virus isolates with non-ambiguous nucleotide sequences. It contains 1.145.168 unique (out of a total of 15.564.504) protein nucleotide sequences and the corresponding AAc sequences. Protein coding sequences are associated with metadata, including the collection date and the WHO virus strain annotation.
Protein coding sequences within the same type (for each of the 12 most abundant types) were clustered. Different clustering algorithms (BIRCH, Kohonen Neural Network, fuzzy and probabilistic clustering) were performed for clustering proteins based on RSCU, ENC and RCBS with a variable number of clusters. WHO group annotations were used for additional cluster description. Most clusters in all results are homogeneous (with a maximum size of about 19-35% of the input material) and are almost pure related to specific WHO group. Each result contains one or two small cardinality heterogeneous clusters with mixed WHO groups. These heterogeneous clusters likely denotes proteins (isolates) that were present at the transition between the two WHO groups. Combining results from different clustering algorithms the membership to WHO groups of SARS-CoV-2 proteins can be described with very high accuracy using protein clustering based on the results of CU bias measures.
Keywords: SARS-CoV-2 WHO groups, codon usage, clustering, data mining