INTRODUCTION TO BIOLOGICAL DATABASES


As biology has increasingly turned into a data-rich science, the need for storing and communicating large datasets has grown rapidly.  The  examples are the nucleotide sequences, the protein sequences, and the 3D structural data produced by X-ray crystallography and macromolecular NMR
Bioinformatics is the application of Information technology to store, organize and analyze the vast amount of biological data which is available in the form of sequences and structures of proteins (the building blocks of organisms) and nucleic acids (the information carrier). The biological information of nucleic acids is available  as sequences while  the data of proteins is available as sequences and structures. Sequences are represented in single dimension where as the structure contains the three dimensional data of sequences.
There are two main functions of biological databases:


  • Make biological data available to scientists.              As much as possible of a particular type of information should be available in one single place (book, site, and database). Published data may be difficult to find or access and collecting it from the literature is very time- consuming. And not all data is actually published explicitly in an article (genome sequences!).
  • To make biological data available in computer-readable form.



Since analysis of biological data almost always involves computers, having the data in computer-readable form (rather than printed on paper) is a necessary first step.
Types of data generated by  research:
  • Nucleotide sequences (DNA and mRNA)
  • Protein sequences
  • 3-D protein structures
  • Complete genomes and maps


Based on this information and further research also we have 
  • Gene Expression data
  • Polymorphism 


Historical Aspects behind the Databases 

The first idea about creating a database was came in existence when Sanger first discovered the method to sequence proteins.
The first database was created within a short period after the Insulin protein sequence was made available in 1956. Incidentally, Insulin is the first protein to be sequenced. The sequence of Insulin consisted of just 51 residues (analogous to alphabets in a sentence) which characterize the sequence. Around mid nineteen sixties, the first nucleic acid sequence of Yeast tRNA with 77 bases (individual units of nucleic acids) was found out. During this period, 3D structures of proteins were studied and the well known Protein Data Bank was developed as the first protein structure database with only 10 entries in 1972.  This has now grown in to a large database with over 10,000 entries. While the initial databases of protein sequences were maintained at the individual laboratories, the development of a consolidated formal database known as SWISS-PROT protein sequence database was initiated in 1986 which now has about 70,000 protein sequences from more than 5000 model organisms, a small fraction of all known organisms. These huge varieties of divergent data resources are now available for study and research by both academic institutions and industries. These are made available as public domain information in the larger interest of research community through Internet  and CDROMs. These databases are constantly updated with additional entries.

Biological Databases 

Biological databases can be broadly classified into sequence and structure databases. Sequence databases are applicable to both nucleic acid sequences and protein sequences, whereas structure database is applicable to only Proteins. 
Databases in general can be classified in to
  1. Primary Database 
  2. Secondary Database 
  3. Composite Databases 
  4. Structure Databases


A primary database contains information of the sequence or structure alone. Examples of these include Swiss-Prot and PIR for protein sequences, GenBank and  DDBJ for Genome sequences and the Protein Databank for protein structures. In this database the data is originally obtained and maintained for further studies.
Primary Nucleotide Sequence Repository – GenBank, EMBL, DDBJ
Primary Protein Sequence Repositories--PIR-PSD or protein information resource – protein sequence database, at the NBRF (National Biomedical Research Foundation, USA), and SWISS-PROT at the SBI (Swiss Biotechnology Institute, Switzerland 
A secondary database contains derived information from the primary database. A secondary sequence database contains information like the conserved sequence, signature sequence and active site residues of the protein families arrived by multiple sequence alignment of a set of related proteins. A secondary structure database contains entries of  the PDB in an organized way. These contain entries that are classified according to their structure like all alpha proteins, all beta proteins, etc. These also contain information on conserved secondary structure motifs of a particular protein. Some of the secondary database created and hosted by various researchers at their individual laboratories includes SCOP, developed at Cambridge University; CATH developed at University College of London, PROSITE of Swiss Institute of Bioinformatics, eMOTIF at Stanford.

Composite database contains a variety of different primary database sources, which obviates the need to search multiple resources. Different composite database use different primary database and different criteria in their search algorithm. Various options for search have also been incorporated in the composite database. The National Center for Biotechnology Information (NCBI) which hosts these nucleotide and protein databases in their large high available redundant array of computer servers, provides free access to the various persons involved in research. This also has link to OMIM (Online Mendelian Inheritance in Man) which contains information about the proteins involved in genetic diseases.

Structure Databases  like sequence databases comes in two varieties, primary and secondary. Strictly speaking there is only one database that stores primary structural data  of biological molecules, namely the PDB. In the context of this database, term macromolecule stretches to cover three orders of magnitude of molecular weight from 1000 Daltons to 1000 kilo Daltons Small biological and organic molecules have their structures stored in another primary structure database the CSD, which is also widely used in biological studies. This contains the three dimensional structure of drugs, inhibitors and fragments or monomers of the macromolecule.


Application of the Biological Databases 

Sequence Analysis 

Every new searched sequence is first matched with the sequences present in the databases to identify its functionality and uniqueness.  If the sequence is already present in the databases further studies becomes easier if not then every minute information is collected about it and stored for future reference in databases.

Prediction of Protein Structure 

It is easy to determine the primary structure of proteins in the form of amino acids which are present on the DNA molecule but it is difficult to determine the secondary, tertiary or quaternary structures of proteins. For this purpose either the method of crystallography is used or tools of bioinformatics can also be used to determine the complex protein structures. By comparing the new data with existing data the bioinformatic tools can predict function and structures. 

Genome Annotation 

In genome annotation, genomes are marked to know the regulatory sequences and protein coding. It is a very important part of the human genome project as it determines the regulatory sequences.

Comparative Genomics


Comparative genomics is the branch of bioinformatics which determines the genomic structure and function relation between different biological species. For this purpose, intergenomic maps are constructed which enable the scientists to trace the processes of evolution that occur in genomes of different species. These maps contain the information about the point mutations as well as the information about the duplication of large chromosomal segments, which are extracted from the databases.

Health and Drug discovery

The tools of bioinformatics are also helpful in drug discovery, diagnosis and disease management. Complete sequencing of human genes has enabled the scientists to make medicines and drugs which can target more than 500 genes. Different computational tools and drug targets has made the drug delivery easy and specific because now only those cells can be targeted which are diseased or mutated. It is also easy to know the molecular basis of a disease, stored in the databases.

Conclusion 

The present test is to deal with a huge volume of information, for example, the ones created by the human genome venture, to enhance database configuration, create programming for database access and control, and gadget information passage strategies to make up for the fluctuated PC techniques and frameworks utilized in various research facilities. There is most likely that Bioinformatics apparatuses for proficient research will have huge effect in organic sciences and advancement of human lives.

1 Comments

Submitted comments will only appear after manual approval, which can take up to 24 hours.
Comments posted as "Unknown" go straight to junk. You may have to click on the orange-white blogger icon next to your name to change to a different account.

  1. what is the difference between nucleotide and nucleoside

    ReplyDelete
Previous Post Next Post