Wednesday, June 13, 2007

Information Retrieval

  1. INFORMATION RETRIEVAL FROM BIOLOGICAL DATABASES

    • To review, GenBank is an annotated collection of all publicly available DNA and protein sequence and is maintained by the National Center for Biotechnology Information (NCBI). As of this writing, GenBank contains 7 million sequence records covering almost 9 billion nucleotide bases. Sequences find their way into GenBank in several ways, most often by direct submission by individual investigators through tools such as Sequin or through “direct deposit” by large genome sequencing centers.

    • GenBank, or any other biological database for that matter, serves little purpose unless the database can be easily searched and entries retrieved in a usable, meaningful format. Otherwise, sequencing efforts have no use of the information hidden within these millions of bases and amino acids. Much effort has gone into making such data accessible to the average user, and the programs and interfaces resulting from these efforts are focus of this chapter. The discussion centers on querying the NCBI databases because these more “general” repositories are far and away the ones most often accessed by biologists, but attention is also given to a number of smaller, specialized databases that provide information not necessarily found in GenBank.

    Search Engines

    1. Entrez (NCBI)

    Entrez is a retrieval system for searching several linked databases.

    It provides access to:

    PubMed : The biomedical literature.

    Genbank : Nucleotide sequence database Protein sequence database

    Structure : three-dimensional macromolecular structures

    Genome : complete genome assemblies

    OMIM : Online Mendelian Inheritance in Man

    Taxonomy : Organisms in GenBank
    2. SRS (EBI and DDBJ)

    SRS is a data retrieval system that integrates heterogeneous databanks in molecular biology and genome analysis. There are currently several dozen servers worldwide that provide access to over 300 different databanks via the World Wide Web. Additional technology to integrate externally developed applications into the package gives novel and powerful capabilities for biological data analysis.

    Sequence Retrieval tools: ENTREZ from NCBI and SRS (Sequence Retrieval System) from EBI

    Sequence Submission tools: Sequin and BankIt from NCBI and WebIn from EBI

    1. Getting Started with NCBI and Entrez

    • Entrez is the common front-end to all the databases maintained by the NCBI and is an extremely easy system to use. TheEntrez main page, as with all NCBI pages, is undemanding in its browser requirements and downloads quickly. Part of the front page ins illustrated in fig. The databases available for searching can be accessed by hyperlinks at the top of the page, or by using the drop-down menu as shown. Once a database has been selected, a search term is then entered in the space provided. The search term any is a single word or a Boolean phrase. Clicking on ‘GO’ initiates the search. Hits in the selected database are displayed (these are known as neighbors) and matching records in other Entrez databases are also shown (these are known as links). Hits are ordered by similarity based on precomputed analysis of sequences / structures or theliterature.

    Table – 1. The databases covered by Entrez, listed by categor

    Category Database


  2. Nucleic acid sequence Entrez nucleotides; sequence obtained from GenBank,
  3. Protein sequences Entrez protein; sequences obtained from SWISS-PORT

PIR, PRF, PDB and translations from annotated coding regions in GenBank and RefSeq

  • 3D structures Entrez Molecular Modelling Database (MMDB)
  • PopSet From GenBank set of DNA sequences that have been

Collected to analyse the evolutionary relatedness of a Population.

  • OMIM Online Mendelian inheritance in Man
  • Taxonomy NCBI Taxonomy Database
  • Books Bookshelf
  • Probeset Gene expression Omnibus (GEO)
  • 3D domains Domains from the Entrez Molecular modeling database(MMDB)
  • Literature PubMED

2. DBGET/LinkDB

  • DBGET is an integrated data retrieval system developed and jointly maintained by the institute for Chemical Research (Kyoto University) and the Human Genome Center (University of Tokyo). It is integrated with more than 20 data bases (Table 2), which can be searched one at a time or in combination using the commands bfind (for text searches) or bget (for searchers based on accession number). Hits are presented as a list of results together with any available and cited information. LinkDB is an associated database of links (binary relation ships) between entries in the different databases available to DBGET and further organism-specific databases, such as the C. elegans databases (AceDB), Flybase and the Saccharomyces Genome Database (SGD) DBGET is closely associated with KEGG, the Kyoto Encyclopedia of Genes and Genomes, which is maintained by the same group.

No comments: