S. Malkov1*, M. Beljanski1, G. Pavlović Lažetić1, B. Stojanović2, M. Maljković1, A. Veljković1, S. Kapunac1, and N. Mitić1
1Faculty of Mathematics, University of Belgrade, Studentski trg 16, 11000 Belgrade, Serbia
2Mathematical Institute SASA, Knez Mihaila 36, 11000 Belgrade, Serbia
sasa.malkov [at] matf.bg.ac.rs
Abstract
The existence of a large number of sequenced SARS-COV-2 isolates provides an opportunity to observe genomic variability in a massive sample. The goal of our research was to use data mining techniques to study possible correlation between codon usage and classification by WHO-labels in a certain period of time.
The material includes 745,533 isolates with 12,236,672 coding sequences (proteins) from NCBI (10.08.2022.). RSCU was used as a measure of codon usage. Samples are associated with WHO-labels (based on Pango_Id) and time intervals. Inconsistency of WHO-labels with periods in which the respective strains were actually present was observed. The isolates with the observed discrepancy were excluded from the sample. Isolates without assigned WHO-labels were also excluded. In addition, individual coding sequences containing ambiguous nucleotide codes were eliminated.
Clustering was performed for each of the 12 common types of coding sequences (proteins), with multiple methods and a different number of clusters. Neural clustering gave the best results. For different protein types, different degrees of RSCU variability are observed. In the case of proteins with a small variation in nucleotide contents, over 95% of the material belongs to a single cluster, while the other clusters are of negligible size. In the case of proteins with more variations, a higher number of pure clusters (by WHO-labels) is obtained, with a small number of heterogeneous clusters (about 10% of the material). In those heterogeneous clusters, there are isolates with different WHO-labels that were present in parallel at some point, as a kind of transitional forms between two strains.
Different classification models were created on the same sample. Models based on protein types with higher diversity between coding sequences are highly accurate (96-100%). Using the classification models, the corresponding WHO-labels were associated with isolates without previously assigned WHO-labels.
Keywords: SARS-COV-2, RSCU, clustering, classification