1 PBBD – Protein Building Blocks Database
Motifs of super-secondary structure that are consistent with closed loop properties were collected
manually using the following principles: 1) amino acid sequence length between 10 and 40 mers;
2) Cα-Cα distance between sequence extremities < 10 Å; 3) at least three apolar and/or hydrophobic
amino acids contiguous with extremities; 4) structural compatibility with one of the more common
folding motifs that are recurrent in proteins, such as type I’..IV’ αβ-barrel, β-hairpin, β-corner,
Greek key motif, Rossmann fold, βαβ-motif, zinc finger, leucine zipper, helix turn helix,
α-α corner, and α-hairpin. Moreover, elementary building blocks of different lengths, with internal
conformations kept invariant, which possess hydrophobic folding units (HFUs) on the N-terminal and C-terminal ends, were
included in the database as well. Furthermore, we collected a broad set of intrinsic disordered regions that
lack a defined three-dimensional structure and can not be classified by any of the previous reported
categories. Many members of the PBBD had been chosen so as to retain a discrete number of attributes, which can be summarized as follows:
a) putative capability to fold in isolation, which is based on distribution of inter-residue contacts present in the native structure;
b) ownership to a typical supersecondary structure tipology; c) conserved biophysical properties, and d) functional and evolutionary sequence patterns.
In more detail, collected motifs contain specific subsequences, namely closed loops, which were conceived as remnants of ancestral prototypes that led
to the birth of modern proteins, and which could provide insights for understanding the early stages of protein evolution. In addition, many fragments
contain, at their N and C terminal ends, from 3 to 5 hydrophobic amino acids, namely van der Waals locks,
often corresponding to HFUs: such patterns could play a crucial role in driving the folding of
polypeptide chains during the stages that lead to the achievement of the native conformation. Peptides flanked by van der Waals locks are putatively equipped by
autonomous folding units (AFUs), and that, together with the extant of an appropriate distribution of inter-residue contacts, could lead to rational
design of proteins via recombination patterns between structural motifs as well as to reveal evolutionary pathways. Therefore, each record of the PBBD had been enriched by both structural and gene
ontology annotations, in the attempt to determine a non-hierarchical semantic similarity between members of the dataset.
Annotations are divided into two main groups: the former includes general information that are part of both the parent protein and structural
domains, and which had been retrieved from Protein Data Bank (URL: http: //www.rcsb .org /), SCOP (URL: http://scop.berkeley.edu/) and
CATH (URL: http://www.cathdb.info/); the latter includes patterns, which are specific for the fragments stored in the PBBD, retrieved from
PROSITE (URL: http://prosite.expasy.org/), Molecular Modelling Database, Conserved Domain Database (CDD)(https://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi)
and Short Linear Motifs (SLiMs) Databases (http://elm.eu.org/). Furthermore, conserved positions of primary van der Waals locks
(hydrophobic sequences located between the ends of the closed loops) and van der Waals secondary locks
(hydrophobic sequences, which establish contacts between different closed loops and/or spatially neighbouring regions) were collected
from Domain Hierarchy and closed Loops (DHcL) server (http://cropnet.pl/dhcl/).
1.2 Attributes of PBBD sequences
Multiple alignments of motifs stored in the PBBD show a satisfactory parallel between closed loops
structure and sequence: use of the Gonnet matrix leads to split αβ-barrel sequences into
three distinct strings, each corresponding quite well to a secondary structure typology.
Surprisingly, alignment of sequences corresponding to βαβ-motifs reveals
a sufficiently clear separation between the respective secondary structures. The substitution matrix
here used evidently discriminates between strings characterised prevalently by hydrophobic and/or non
polar residues, corresponding preferentially to β-strands, and strings characterised by polar, acid
and/or basic residues, corresponding prevalently to α-helixes. In addition, strings in
which residues such as Gly, Pro and Asp prevail more likely corresponded to hairpins of αβ-barrels as well as
to both helix-turn-helix hairpins and flexible coils connecting secondary structures of βαβ motifs.
Statistical analysis based on frequencies of residues placed within β-strands show a predilection for
amino acids such as Val, Ile and Leu. Dipeptides Val-Ile, Val-Leu and Leu-Ile are among those
occurring with a higher frequency and are therefore a good, though non univocal, fingerprint for
identifying β-strands placed within closed loops. Among tripeptides observed with high frequency in β-strands,
such as Ile-Val-Val, Leu-Val-Ile, and Ala-Leu-Val, some belong to “prototypes”
of prokaryotic closed loops already characterised in previous studies, while frequencies
of binary patterns occurring within these motifs reveal a clear prevalence of non polar residues. Analysis of residues
occurring in α-helices show a clear predominance of dipeptides such as Val-Ala, Ala-Leu, Leu-Lys,
Arg-Glu and Ala-Ala, and tripeptides such as X-Ala-Ala (X = Arg, Leu, Lys, Glu), Glu-Glu-Ala, Ala-Leu-Ala
and Leu-Lys-Ala. On the other hand, binary patterns identified in α-helices often exhibit regular
alternation of polar and non polar residues, which are typical of amphipathic helices, whereas in other cases
short patterns of polar amino acids followed by a contiguous series of hydrophobic
and/or non polar amino acids can be observed. Finally, hairpin folds typically embedded within αβ-barrels as well as coils embedded within both βαβ motifs and
α-hairpins reveal, as expected, a clear predominance of residues interrupting secondary
structures (Pro, Gly, Asp) as well as peptides Gly-Arg, Gly-Ala-X (X = Ser, Ala, Asp, Arg, Gly, Glu,
Lys) and Ala-Pro-X (X = His, Ser, Gly, Glu). Finally, analysis of hydrophobic clusters reveal many patterns
that could assist to discriminate between different secondary structure typologies.