Bogdan Kirillov*1,2, Aleksandra Vasileva3,4, Oleg Fedorov5, Maxim Panov6, and Konstantin Severinov7
1Skolkovo Institute of Science and Technology, Bolshoy Boulevard 30, bld 1, 121205 Moscow, Russia
2Center for Precision Genome Editing and Genetic Technologies for Biomedicine, Institute of Gene Biology, Russian Academy of Sciences, 34/5 Vavilova Street, 119334 Moscow, Russia
3Peter the Great St. Petersburg Polytechnic Universitsy, Politekhnicheskaya St 29, 195251 St. Petersburg, Russia
4Institute of Molecular Genetics, Russian Academy of Sciences, Kurchatov square 2, 123182 Moscow, Russia
5Research Institute for Systems Biology and Medicine, Department of Mathematical Biology and Bioinformatics, Nauchny Proezd 18, 117246 Moscow, Russia
6Artificial Intelligence Cross Center Unit, Technology Innovation Institute, PO Box: 9639, Masdar City, Abu Dhabi, United Arab Emirates
7Waksman Institute of Microbiology,, Rutgers, State University of New Jersey, Piscataway, NJ 08854, USA
bogdan.kirillov [at] skoltech.ru
Abstract
The recognition of target DNA sequences during the interference phase of prokaryotic CRISPR-Cas immunity relies on Protospacer-Adjacent Motif (PAM) sequences, specific for each Cas effector. PAM identification is a laborious and time consuming process that requires multiple stages including in vitro and in vivo cleavage assays followed by Next Generation Sequencing of targets that withstood cleavage. Determining PAM is an essential step of characterisation of any novel Cas9 ortholog and determines the likelihood of its potential use. This study investigates the potential of machine learning to predict PAM sequences for a given Cas9 ortholog based on the results of cleavage experiments and employing an Active Learning process akin to Reinforcement Learning with Human Feedback. Machine learning-facilitated PAM identification would streamline and accelerate existing pipelines for describing novel Cas proteins. We demonstrate that simple models with a small amount of data are sufficient for confident PAM predictions when training is effectively orchestrated.
Keywords: bioinformatics, CRISPR, machine learning