bioinformatics

 
Home

syllabus

general information

homework

lectures

websites

Will Terzaghi's Homepage



Membership

Login

 
 

Week 1 lecture

"week1.ppt"

Bioinformatics

Bioinformatics text (Baxevanis & Ouellette): the intersection of molecular and computational biology
  • My favorite: the science of making sense of biological information.
  • Bioinformatics combines the tools and techniques of mathematics, computer science and biology in order to understand the biological significance of a variety of data.

    The successful use of bioinformatics requires a thorough understanding of modern biochemistry and molecular biology as well as ability to use techniques of computer science, information technology and mathematics.Ref: Brutlag : Genomics, Bioinformatics and medicine Http://cmgm.stanford.edu/biochem118/ brutlag 2:

    Applications for bioinformatics

    Turning raw sequence data from genome projects into useful information about gene function, protein structure, molecular evolution, drug targets and disease mechanisms.

    Linking sequence data with new types of experimental information obtained from micro-array and proteomics.

    Identifying/constructing new biochemical pathways.

    Who knows? The only limit is our imagination!

    time:

    Course Objectives

    l. Identifying the sorts of activities that are classified as bioinformatics.

    2. Learning about the sorts of databases available for bioinformatics

    3. Pairwise sequence alignment.

    4. Aligning multiple sequences.

    5. Constructing molecular phylogenies

    6. Predicting structures from a sequence.

    7. Visualizing molecules

    8. Analyzing DNA sequences and designing primers.

    9. Finding genes in raw DNA sequences.

    10. Whole genome analysis

    11. DNA fingerprinting.

    12. Metabolic simulation

    13. Artificial intelligence / life

    DATABASES

    Bioinformatics requires rapid access to a large volume of DNA and protein sequence data

    Data must be organized in a common format that can be shared by multiple users (humans and computers)

    Data must be stored in a database = collection of information stored in an organized form on a computer

    Database management system (DBMS) = system that can manipulate data in a large collection of files, cross-referencing between files as needed.

    databaseorg:

    National Center for Biotechnology Information (NCBI) is a DBMS:

    • GENBANK is the database
    • ENTREZ, etc are the managers

    Components:

    DBMS:

    1) Hardware: server that can handle requests from multiple clients and that can store very large amounts of data

    2) DATA

    • sequence
    • annotation
      • must be in a format that can be searched in many ways = by many fields

    entry:

    for example, every entry in GENBANK is a "record" containing many fields.

    entry2:

    3) Software: allows you to search the database: e.g Genbank can be searched in multiple ways using the application ENTREZ to search by text, a different application, BLAST, to find and align related sequences and a third application, VAST to find and align related structures.

    4) Users: the consumers- both humans and the computers they are operating!

    • a problem, because formats that make sense to humans aren't so good for computers, and vice-versa
    • Originally, all the human consumers were very knowledgable in both computers and molecular biology.
    • Now the general public can access raw data over the internet; making user-friendliness a major problem!

    General principles

    • Data must be organized in a common format that can be shared by multiple users (humans and computers)
      • Formats that make sense to humans often are not best for computers!
      • Genbank flatfile is good for humans,OK for computers. Alternatives such as ASN.1 records are therefore gaining favor.
    • Must eliminate redundancy
    • Must avoid inconsistency
    • Must enforce integrity -> proofreading!
    • Must maintain security
    • General rule: search the smallest database likely to contain your target!
      • less likely to get false positives

    Types of Databases

    • Primary databases = depositories of raw data

      e.g. genbank is a depository of raw DNA sequence data http://www.ncbi.nlm.nih.gov

    • Secondary databases = depositories of annotated and derived data

      e.g. most protein sequences deposited at databanks such as SWISSPROT ( http://www.expasy.ch/prosite/ ) are now derived from nucleic acid sequences!

      Irony: protein sequence databases came first!

    NCBI (http://www.ncbi.nlm.nih.gov) has many databases

    The "search" window on the NCBI homepage lets you select which one you want

    Genbank/EMBL/DDBJ are the primary databases for nucleotide sequences.

    A collaboration between NCBI at NIH http://www.ncbi.nlm.nih.gov

    DDBJ (DNA Database of Japan) http://www.ddbj.nig.ac.jp

    and EBI (European Bioinformatics Institute) http://www.ebi.ac.uk

    Can submit to any one: All three exchange information daily

    Store info in (slightly) different format, with different information systems and different search tools

    • I search all three depending on what I'm looking for and how busy each one is

    subdivided into many divisions

    • nr: non-redundant: most sequences, but does not include EST
    • EST: expressed sequence tags: brute force sequencing of cDNAs
      • full of errors, especially at 5' and 3' ends
    • other divisions include dbSTS, RefSeq, dbSNP.

    most sequences are annotated: combined primary and secondary data

    ENTREZ is the NCBI data retrieval system (DDBJ and EBI use different data retrieval systems)

    To learn more about Genbank go to http://www.ncbi.nlm.nih.gov/About/index.html

    Genbank Flatfiles

    The form in which DNA sequence data is stored: i.e. GENBANK's reason for existence

    each one is a record ( eg. X94702) with 3 parts

    1) Header: Information applying to entire record

    • Locus = relic from time when computers were slower, memory was more expensive and many fewer sequences were deposited
      • "Characterizes" sequences in <10 characters
      • retained because it is still used by many bioinformatics applications
    • INV: genbank Division
    • Date: when record was last made public
    • Definition: "summarizes" the biology
    • Accession: Reference for the record
      • what gets cited
      • Does not change if sequence is updated!
    • NID/Version: Each update has a Different # and a different gi: geninfo identifier
    • Keywords: Another relic that isn't widely used because they were never used consistently
    • Source: Organism sequence comes from
    • Reference: Lists authors, their addresses, and whether the data is a direct submission or has been published in one or more articles
    • Some entries also have comments

    2) Features: Annotations on the sequence

    • Source: all the info on source of the DNA clone from which the sequence was obtained, eg name of clone (note that this source is different from the source in the header)
    • Gene: where each gene starts and stops (useful in genomic sequences that contain multiple genes
    • CDS: coding sequence
      • Where each gene's instructions for making proteins start & stop
      • protein id: geninfo identifier of the encoded protein. This is a separate genbank entry which you can open by clicking on the hypertext
    • Translation of the protein
      • The computer-generated translation of the coding sequence (rarely experimentally-verified)

    3) Sequence itself

    entry2:

    Other genomic databases

    Many companies have their own databases

    e.g. TIGR http://www.tigr.org/tdb/ , Incyte genomics http://www.incyte.com/index.shtml

    some allow free access to academic users

    others charge for access

    There are also many databases devoted to specific organisms e.g.

    Other biological databases

    There are many databases devoted to more specific types of biological information

    Some examples of databases we will be searching later in the course:

    Databases devoted to human genes

    Databases devoted to genes in other organisms

    Databases devoted to protein structure

    Databases devoted to enzymes and metabolic pathways

    Databases devoted to results from microarray experiments: analysis of all the genes expressed in a particular tissue

    Databases devoted to results from proteomics experiments: analysis of all the proteins present in a particular tissue

    Finding databases

    • Search engine queries
    • Surfing links
    • Searching likely institutions/organizations
    • Asking experts

    Using databases

    Specific queries

    Browsing

     

     

     




  • Last update: Monday, January 13, 2003 at 11:50:27 AM.