The past, the present, and the future of RNA secondary structure prediction

Lazar Vasović1

1Faculty of Mathematics, University of Belgrade, Studentski trg 16, 11000 Belgrade, Serbia

pd212006 [at] alas.matf.bg.ac.rs

Abstract

RNA is a biopolymer whose primary structure is a sequence of nucleobases. While messenger RNA is probably the most known, an increasing number of non-coding RNAs is being discovered. In order to become biologically active, ncRNA folds intramolecularly, thus forming segments of paired bases. This secondary structure largely determines the function of an ncRNA, so its prediction is important for newly discovered sequences. Owing to the strong link between the two structural levels, most predictors are data-driven and sequence-based.

The oldest and simplest algorithm was base pair maximization (BPM), which did not presume important structural features. Another approach exploited the fact that biophysics dictates RNA folding, so it searched for the thermodynamically optimal structure. Statistical learning was the base of the third group, with probabilistic context-free grammars (PCFGs) being the most influential. These were the state-of-the-art methods at the beginning of the century.

However, much has changed in the last years, since technological advancement allowed the widespread use of machine learning. Its use in the RNA structure prediction ranges from being the supplementary method (e.g., for estimating thermodynamical and statistical parameters of traditional methods) to encapsulating the whole prediction process. The highest success has been reported with transformers, recurrent, and convolutional neural networks (CNN).

This paper was designed as a review and aimed to compare several methods theoretically and assess them practically. As expected, model complexity was highly correlated with accuracy. On the subset of simply structured transfer RNA, for example, BPM predicted ~22% of pairings correctly, PCFG ~86%, and CNN ~99%. Other subsets, such as 16S ribosomal RNA, were more challenging, but deep learning always performed best. With the continued growth of computational power and the amount of annotated data, prediction accuracy is expected to get even closer to the experimental determination, while still maintaining a much lower cost.

Keywords: RNA structure prediction, review, machine learning

Acknowledgement: This research was supported by the Ministry of Science, Technological Development and Innovation of the Republic of Serbia through the scholarship project for young and unemployed doctoral students, contract number 451-03-1271/2022-14/2990.

Comments are closed.