An integrated platform for genome assembly, comparative genomics and management of genomic variation databases

Jorge Duitama1

1Systems and Computing Engineering Department. Universidad de los Andes, Cra 1 Este 19 A 40, Bogotá, Colombia

ja.duitama [at] uniandes.edu.co

Abstract

The use of long read DNA sequencing technologies is producing an explosion of high-quality de-novo genome assemblies. The availability of these genomes represents a major step forward for evolution, population genomics, epidemiology, among other applications. A major bottleneck for many research groups continues to be the availability of tools to build and analyze the large datasets of genomes that can be produced using these technologies. In this talk, I summarize the functionalities developed by my research group in the version four of the Next Generation Sequencing Experience Platform (NGSEP) to perform a comprehensive analysis of long and short DNA sequencing reads. First, we designed new algorithms for assembly of haploid and diploid samples from long DNA sequencing reads. A minimizers table is constructed from the reads , using K-mer hash codes calculated from rankings relative to the mode of the k-mer counts distribution. Statistics collected during this process are used as features to build layout paths. For diploid samples, we integrated a reimplementation of the ReFHap algorithm to perform molecular phasing. Benchmark experiments using PacBio HiFi and Nanopore sequencing data for different species show that our solution has competitive contiguity and efficiency, as well as superior accuracy in some cases, compared to other currently used software. We also developed a functionality to perform ortholog identification and gene-based alignment of assembled genomes. Proteomes for each genome are extracted and homology relationships are efficiently predicted building indexes of aminoacid sequences by k-mer ocurrance. Then, genes are clustered in orthogroups based on the topology of the graph induced by the predicted relationships. Gene presence/absence matrices are derived from these orthogroups. If genome assemblies are provided as input, synteny relationships are identified for each pair of genomes. We also implemented algorithms to perform alignment of short and long reads to a reference genome. Based on aligned long reads, we improved the classical variants detector to detect long structural variants. Adding up these developments, NGSEP is a comprehensive tool to perform de-novo and reference-based analysis of DNA sequencing reads in a wide variety of experimental settings to solve different research goals.

Keywords: bioinformatics, algorithms, DNA sequencing, software, genome assembly

Acknowledgement: This work was supported by the Colombian Ministry of Sciences research fund “Patrimonio Autónomo Fondo Nacional de Financiamiento Para la Ciencia, la Tecnología Y la Innovación Francisco José de Caldas” through the grant with contract number 80740-441-2020, awarded to J Duitama. We also wish to acknowledge the support of the IT Services Department and ExaCore—IT Core-facility of the Vice Presidency for Research & Creation at the Universidad de Los Andes that allow us to perform the computational analysis.

Comments are closed.