Marko Tumbas1* and Marko Đorđević1
1Quantitative Biology Group, Faculty of Biology, University of Belgrade, Studentski trg 16, 11000 Belgrade, Serbia
marko.tumbas [at] bio.bg.ac.rs
Abstract
CRISPR-cas systems are incredibly diverse and currently are classified in six major types and over 30 subtypes. Apart from their role in adaptive immunity it has been shown that some of the CRISPR-cas subtypes are also involved in host gene regulation and even in collateral damage leading to bacteriostatic or lethal outcomes for the host. CRISPR array spacers direct and influence canonical and non-canonical functions of the CRISPR-cas system together with subtype Cas proteins. Better understanding of spacer adaptation mechanisms is crucial for uncovering intricacies of evolutionary arms race between prokaryotes and phages.
Here we present large-scale analysis of CRISPR array spacers originating from 31845 complete bacterial genomes. All bacterial and 16388 viral genomes were retrieved using NCBI datasets API. CRISPRidentify and CRISPRcasIdentifier tools were used for CRISPR array, Cas genes detection and subtyping. Viral genomes were mapped to their hosts using the latest version of the Virus-Host DB. Mapping was performed on the genus level of the hosts phylogenetic tree. Gumbel extreme value distribution was used to determine statistical significance of each spacer Smith-Waterman alignment score.
Differences in melting energy and GC content between identified spacers, origin bacterial genomes and infecting bacteriophages were explored for different CRISPR-cas subtypes and for different bacterial genera. Spacers from the extremes of the GC content distribution were aligned to the origin bacterial and infecting phage genomes in order to determine their origin.
GC content of the spacers was lesser than the GC content of the source bacterial genome but greater than infecting viral genome. This observation aligns with the hypothesis that the majority of CRISPR spacers were adapted from the bacteriophage genomes and serve canonical function. Alignments of the spacers from GC rich distribution tail have shown their preferential targeting of host genomes which further supports the hypothesis that GC rich spacers originated from the bacterial genome and have non-canonical function.
Keywords: CRISPR-cas, melting energy, extreme value distribution