|
With the definition of bioinformatics that it is the development of algorithms and databases for understanding of biological systems, then the structural bioinformatics is the representative of subset that deals, directly or indirectly, with the structure of macromolecules. Structural bioinformatics involves study of the structures of DNA, RNA and proteins. The structural information about molecules – the 3- dimensional atomic coordinates of structures – is the core from which all the other details are derived. It is a primary resource of structural data and is central to everything else. In most of the cases the files containing atomic coordinates are not informative to the majority of the structural biologists; thus, there are algorithms (tools) that transform, classify, analyze and then model this primary data. The results of the data analysis are stored in other databases and these databases are termed as secondary resources, as they contain value added information. The overall schema starts with a primary resource to which various algorithms are applied in order to generate multiple secondary resources. For example, protein data bank, PDB is an example of a primary resource; combinatorial extension, structural comparison, an example of an algorithm applied to the primary data, whose results i.e. the structural alignments of proteins are captured in the secondary resources. Algorithmic tools and the secondary resources can be divided into several broad categories like visualization, structural classification, structural alignment/structure modeling, structure prediction and protein-protein/protein-ligand interactions.
Primary Resource
Protein data bank is the first biological database and was established in 1971 to store 3D biological macromolecular structures. It was originally housed at Brookhaven National Laboratories, USA and it is now managed and maintained by the RCSB, i.e. Research Collaboratory for Structural Bioinformatics. PDB contains publicly available 3D structures of proteins, nucleic acids, and a variety of other complex biomolecules determined by X-ray crystallography, NMR spectroscopy and cryoelectron microscopy. The format used by PDB is PDB. It consists of fixed format records that describe the atomic coordinates, chemical and biochemical features, experimental details of the structure determination, and some structural features such as hydrogen bonds and secondary structure assignments. An important aspect of PDB is efficient data processing that consists of three steps: data deposition, validation and annotation.
Secondary Resource
Secondary resources are value-added structural databases; they are frequently the result of data reduction by using algorithms and/or human expertise. Secondary algorithms are grouped based on type, namely, structural classification, structure prediction, functional assignments, protein-ligand interactions, and protein-protein interactions. This grouping is done with associated algorithms/methods of secondary resources.
Structural classification is a process of grouping proteins together by their level of 3D and sequence similarity. Clustering proteins by structural similarity is fundamental for the conceptual organization of the protein spaces, as well as for understanding evolutionary relationships among proteins. Structural classification is based on the striking observation made in early days of structural biology that structure is far more highly conserved than sequence. Structural classification is initially built by cross-comparison of known protein structures either manually by experts of using structural alignments algorithms. Hierarchy within classification is achieved by repeating structural cross-comparison based on different alignment criteria. The majority of structure alignments are pairwise comparisons. The comparison process can be divided into three steps: 1) representation of two structures in coordinate-independent space, 2) comparison and optimization, and 3) measuring the statistical significance of alignment against a random set of structures. Although gene prediction is common in bioinformatics and computational biology, only the progress of structure prediction is measured in a quantitative way, by the Critical Assessment of Structure Prediction (CASP) and the Critical Assessment of Fully Automated Structure Prediction (CAFASP) experiments. The assumption underlying functional assignments is that proteins with similar sequences have similar functions. Thus, we can transfer what we know about the function of one protein to the other, as long as they share a reasonable level of sequence similarity. With the growth of structural genomics, we rapidly gain knowledge of new protein structures. At the same time, the number of available ligands in both real and virtual libraries and the number of libraries are rapidly increasing. It is necessary to efficiently manage these structures in the ligand-design context for instance, by searching a particular ligand and its potential targets and visualizing the protein-ligand interactions.
|