Alexandre G. de Brevern1* and Nenad Mitić2
1 DSIMB Bioinformatics team, INSERM UMR_S 1134, BIGR, Université Paris Cité, Université de la Réunion, Necker Hospital, Paris, France
2 Department of Computer Science, Faculty of Mathematics, University of Belgrade, Belgrade, Serbia
alexandre.debrevern [at] univ-paris-diderot.fr
Abstract
Three-dimensional (3D) protein structures underpin the biological functions that are essential to life. Access to this 3D information is of great interest for both basic and applied research. Traditionally, structures are analyzed by assigning secondary structures (helices, sheets and loops). However, this description does not allow loops to be properly described and does not provide accurate details of the fine structure of repetitive structures.
As a result, more systematic approaches to describing 3D structures have been developed, known as Structural Alphabets (SA). Within this framework, Protein Blocks (PBs) is the SA that has had the most success and application. The 16 PBs, named from PB a to PB p, are pentapeptides that can finely approximate the entire 3D structure. They have a strong sequence-structure relationship and form a grammar in which certain PBs preferentially follows a PB. There are so highly preferential transitions. Some PBs are strongly directed to two or three PBs. However, there are also rare but present transitions, i.e. present with a frequency less than 1%.
The work carried out here involved analyzing data from the Protein Data Bank to see which transitions are very common and which are rare. Secondly, the amino acid frequencies of the PBs involved in these rare transitions were compared with the frequencies classically expected to answer this simple question: Are these rare, and therefore unexpected, transitions linked to different amino acid compositions to those observed in PBs in general. Finally, a similar analysis was carried out using AlphaFold2 models of the human proteome. This work highlights the specific behavior of a number of PBs and amino acids.
Keywords: bioinformatics, data mining, computer science, protein structures, sequence – structure relationship.