The use of Active Machine Learning for Protospacer-Adjacent Motif recovery in Class 2 CRISPR-Cas systems

Bogdan Kirillov*1,2, Aleksandra Vasileva3,4, Oleg Fedorov5, Maxim Panov6, and Konstantin Severinov7

1Skolkovo Institute of Science and Technology, Bolshoy Boulevard 30, bld 1, 121205 Moscow, Russia

2Center for Precision Genome Editing and Genetic Technologies for Biomedicine, Institute of Gene Biology, Russian Academy of Sciences, 34/5 Vavilova Street, 119334 Moscow, Russia

3Peter the Great St. Petersburg Polytechnic Universitsy, Politekhnicheskaya St 29, 195251 St. Petersburg, Russia

4Institute of Molecular Genetics, Russian Academy of Sciences, Kurchatov square 2, 123182 Moscow, Russia

5Research Institute for Systems Biology and Medicine, Department of Mathematical Biology and Bioinformatics, Nauchny Proezd 18, 117246 Moscow, Russia

6Artificial Intelligence Cross Center Unit, Technology Innovation Institute, PO Box: 9639, Masdar City, Abu Dhabi, United Arab Emirates

7Waksman Institute of Microbiology,, Rutgers, State University of New Jersey, Piscataway, NJ 08854, USA

bogdan.kirillov [at] skoltech.ru

Abstract

The recognition of target DNA sequences during the interference phase of prokaryotic CRISPR-Cas immunity relies on Protospacer-Adjacent Motif (PAM) sequences, specific for each Cas effector. PAM identification is a laborious and time consuming process that requires multiple stages including in vitro and in vivo cleavage assays followed by Next Generation Sequencing of targets that withstood cleavage. Determining PAM is an essential step of characterisation of any novel Cas9 ortholog and determines the likelihood of its potential use. This study investigates the potential of machine learning to predict PAM sequences for a given Cas9 ortholog based on the results of cleavage experiments and employing an Active Learning process akin to Reinforcement Learning with Human Feedback. Machine learning-facilitated PAM identification would streamline and accelerate existing pipelines for describing novel Cas proteins. We demonstrate that simple models with a small amount of data are sufficient for confident PAM predictions when training is effectively orchestrated.

Keywords: bioinformatics, CRISPR, machine learning

Comments are closed.