Pangenomic Alignment: Strings plus Graphs

Travis Gagie

Faculty of Computer Science, Dalhousie University, 6050 University Avenue, Halifax, Nova Scotia, B3H 1W5, Canada

travis.gagie [at] dal.ca

Abstract

The use of only one or a few reference genomes for DNA alignment is known to bias research results and medical diagnoses, but aligning against many reference genomes has been problematic. If we represent such a pangenomic reference as a set of strings, then each seed we find in a DNA read may occur in many of the genomes, so even reporting all those occurrences can be slow, and extending and chaining seeds can be infeasible. On the other hand, if we represent them as a graph then — even apart from the significant technical challenges of indexing graphs — we may find many chimeric matches. The more of humanity’s genetic diversity we try to represent in the graph, the fuzzier it becomes, and the greater the probability of spurious results.

Most research on pangenomic alignment uses either a string representation or a graph representation, but not both. In this talk we first describe how a tool called MONI indexes a pangenomic reference as a set of strings in small space such that later, for each maximal exact match in a given read, we can quickly find that match’s length, the position of one of its occurrences in the set of strings, and the lexicographic rank of the suffix starting with that occurrence. We then describe how a tool called MARIA will, when fully implemented, store a pangenomic reference as a graph in small space such that, given MONI’s output about a maximal exact match, we can quickly report all the non-chimeric occurrences of that match in the graph.

Combining MONI and MARIA will give us the advantages of working with both strings and graphs: we index the set of reference genomes, the whole set of reference genomes, and nothing but the set of reference genomes, but for each maximal exact match we output relatively few occurrences in the graph, which are easy to use later in a pipeline.

Keywords: pangenomic alignment, reference genomes, data structures, indexing

Acknowledgement: This talk covers results obtained in collaboration with many other researchers, in particular Christina Boucher and Marco Oliva at the University of Florida, Ben Langmead at Johns Hopkins University and Massimiliano Rossi at Illumina, for MONI; and Andrej Baláž, Adrián Goga and Alessia Petescia at Comenius University, Simon Heumos at the University of Tübingen and Jouni at the UCSC Genomics Institute, for MARIA. The author was funded by NSERC grant RGPIN-07185-2020, NSF/BIO grant DBI-2029552 to Christina Boucher, and NIH/NHGRI grant R01HG011392 to Ben Langmead.

Comments are closed.