Supported by: #
#

#

1 PBBD – Protein Building Blocks Database

Motifs of super-secondary structure that are consistent with closed loop properties were collected manually using the following principles: 1) amino acid sequence length between 10 and 40 mers; 2) Cα-Cα distance between sequence extremities < 10 Å; 3) at least three apolar and/or hydrophobic amino acids contiguous with extremities; 4) structural compatibility with one of the more common folding motifs that are recurrent in proteins, such as type I’..IV’ αβ-barrel, β-hairpin, β-corner, Greek key motif, Rossmann fold, βαβ-motif, zinc finger, leucine zipper, helix turn helix, α-α corner, and α-hairpin. Moreover, elementary building blocks of different lengths, with internal conformations kept invariant, which possess hydrophobic folding units (HFUs) on the N-terminal and C-terminal ends, were included in the database as well. Furthermore, we collected a broad set of intrinsic disordered regions that lack a defined three-dimensional structure and can not be classified by any of the previous reported categories. Many members of the PBBD had been chosen so as to retain a discrete number of attributes, which can be summarized as follows: a) putative capability to fold in isolation, which is based on distribution of inter-residue contacts present in the native structure; b) ownership to a typical supersecondary structure tipology; c) conserved biophysical properties, and d) functional and evolutionary sequence patterns. In more detail, collected motifs contain specific subsequences, namely closed loops, which were conceived as remnants of ancestral prototypes that led to the birth of modern proteins, and which could provide insights for understanding the early stages of protein evolution. In addition, many fragments contain, at their N and C terminal ends, from 3 to 5 hydrophobic amino acids, namely van der Waals locks, often corresponding to HFUs: such patterns could play a crucial role in driving the folding of polypeptide chains during the stages that lead to the achievement of the native conformation. Peptides flanked by van der Waals locks are putatively equipped by autonomous folding units (AFUs), and that, together with the extant of an appropriate distribution of inter-residue contacts, could lead to rational design of proteins via recombination patterns between structural motifs as well as to reveal evolutionary pathways. Therefore, each record of the PBBD had been enriched by both structural and gene ontology annotations, in the attempt to determine a non-hierarchical semantic similarity between members of the dataset. Annotations are divided into two main groups: the former includes general information that are part of both the parent protein and structural domains, and which had been retrieved from Protein Data Bank (URL: http: //www.rcsb .org /), SCOP (URL: http://scop.berkeley.edu/) and CATH (URL: http://www.cathdb.info/); the latter includes patterns, which are specific for the fragments stored in the PBBD, retrieved from PROSITE (URL: http://prosite.expasy.org/), Molecular Modelling Database, Conserved Domain Database (CDD)(https://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi) and Short Linear Motifs (SLiMs) Databases (http://elm.eu.org/). Furthermore, conserved positions of primary van der Waals locks (hydrophobic sequences located between the ends of the closed loops) and van der Waals secondary locks (hydrophobic sequences, which establish contacts between different closed loops and/or spatially neighbouring regions) were collected from Domain Hierarchy and closed Loops (DHcL) server (http://cropnet.pl/dhcl/).

1.2 Attributes of PBBD sequences

Multiple alignments of motifs stored in the PBBD show a satisfactory parallel between closed loops structure and sequence: use of the Gonnet matrix leads to split αβ-barrel sequences into three distinct strings, each corresponding quite well to a secondary structure typology. Surprisingly, alignment of sequences corresponding to βαβ-motifs reveals a sufficiently clear separation between the respective secondary structures. The substitution matrix here used evidently discriminates between strings characterised prevalently by hydrophobic and/or non polar residues, corresponding preferentially to β-strands, and strings characterised by polar, acid and/or basic residues, corresponding prevalently to α-helixes. In addition, strings in which residues such as Gly, Pro and Asp prevail more likely corresponded to hairpins of αβ-barrels as well as to both helix-turn-helix hairpins and flexible coils connecting secondary structures of βαβ motifs. Statistical analysis based on frequencies of residues placed within β-strands show a predilection for amino acids such as Val, Ile and Leu. Dipeptides Val-Ile, Val-Leu and Leu-Ile are among those occurring with a higher frequency and are therefore a good, though non univocal, fingerprint for identifying β-strands placed within closed loops. Among tripeptides observed with high frequency in β-strands, such as Ile-Val-Val, Leu-Val-Ile, and Ala-Leu-Val, some belong to “prototypes” of prokaryotic closed loops already characterised in previous studies, while frequencies of binary patterns occurring within these motifs reveal a clear prevalence of non polar residues. Analysis of residues occurring in α-helices show a clear predominance of dipeptides such as Val-Ala, Ala-Leu, Leu-Lys, Arg-Glu and Ala-Ala, and tripeptides such as X-Ala-Ala (X = Arg, Leu, Lys, Glu), Glu-Glu-Ala, Ala-Leu-Ala and Leu-Lys-Ala. On the other hand, binary patterns identified in α-helices often exhibit regular alternation of polar and non polar residues, which are typical of amphipathic helices, whereas in other cases short patterns of polar amino acids followed by a contiguous series of hydrophobic and/or non polar amino acids can be observed. Finally, hairpin folds typically embedded within αβ-barrels as well as coils embedded within both βαβ motifs and α-hairpins reveal, as expected, a clear predominance of residues interrupting secondary structures (Pro, Gly, Asp) as well as peptides Gly-Arg, Gly-Ala-X (X = Ser, Ala, Asp, Arg, Gly, Glu, Lys) and Ala-Pro-X (X = His, Ser, Gly, Glu). Finally, analysis of hydrophobic clusters reveal many patterns that could assist to discriminate between different secondary structure typologies.