Bioinformatics combines the tools and
techniques of mathematics, computer science and biology in order
to understand the biological significance of a variety of
data.
The successful use of bioinformatics
requires a thorough understanding of modern biochemistry and
molecular biology as well as ability to use techniques of computer
science, information technology and mathematics.Ref: Brutlag : Genomics, Bioinformatics and medicine
Http://cmgm.stanford.edu/biochem118/

Applications for
bioinformatics
Turning raw sequence data from genome projects into useful
information about gene function, protein structure, molecular
evolution, drug targets and disease mechanisms.
Linking sequence data with new types of
experimental information obtained from micro-array and
proteomics.
Identifying/constructing new biochemical
pathways.
Who knows? The only limit is our
imagination!

Course Objectives
l. Identifying the sorts of activities
that are classified as bioinformatics.
2. Learning about the sorts of databases
available for bioinformatics
3. Pairwise sequence
alignment.
4. Aligning multiple
sequences.
5. Constructing molecular
phylogenies
6. Predicting structures from a
sequence.
7. Visualizing molecules
8. Analyzing DNA sequences and designing
primers.
9. Finding genes in raw DNA
sequences.
10. Whole genome analysis
11. DNA fingerprinting.
12. Metabolic simulation
13. Artificial intelligence /
life
DATABASES
Bioinformatics requires rapid access to a
large volume of DNA and protein sequence data
Data must be organized in a common format
that can be shared by multiple users (humans and
computers)
Data must be stored in a database =
collection of information stored in an organized form on a
computer
Database management system (DBMS) =
system that can manipulate data in a large collection of files,
cross-referencing between files as needed.

National Center for Biotechnology
Information (NCBI) is a DBMS:
- GENBANK is the database
- ENTREZ, etc are the
managers
Components:

1) Hardware: server that can handle
requests from multiple clients and that can store very large amounts
of data
2) DATA
- sequence
- annotation
- must be in a format that can be
searched in many ways = by many fields

for example, every entry in GENBANK is a
"record" containing many fields.

3) Software: allows you to search the
database: e.g Genbank can be searched in multiple ways using the
application ENTREZ to search by text, a different application, BLAST,
to find and align related sequences and a third application, VAST to
find and align related structures.
4) Users: the consumers- both humans and
the computers they are operating!
- a problem, because formats that make
sense to humans aren't so good for computers, and
vice-versa
- Originally, all the human consumers
were very knowledgable in both computers and molecular biology.
- Now the general public can access raw
data over the internet; making user-friendliness a major
problem!
General principles
- Data must be organized in a common
format that can be shared by multiple users (humans and
computers)
- Formats that make sense to humans
often are not best for computers!
- Genbank flatfile is good for
humans,OK for computers. Alternatives such as ASN.1 records
are therefore gaining favor.
- Must eliminate redundancy
- Must avoid inconsistency
- Must enforce integrity ->
proofreading!
- Must maintain security
- General rule: search the smallest
database likely to contain your target!
- less likely to get false
positives
Types of Databases
- Primary databases = depositories of
raw data
e.g. genbank is a depository of raw
DNA sequence data http://www.ncbi.nlm.nih.gov
- Secondary databases = depositories of
annotated and derived data
e.g. most protein sequences deposited
at databanks such as SWISSPROT ( http://www.expasy.ch/prosite/ )
are now derived from nucleic acid sequences!
Irony: protein sequence databases came
first!
NCBI (http://www.ncbi.nlm.nih.gov) has many databases
The "search" window on the NCBI homepage
lets you select which one you want
Genbank/EMBL/DDBJ are the primary
databases for nucleotide sequences.
A collaboration between NCBI
at NIH http://www.ncbi.nlm.nih.gov
DDBJ (DNA Database of Japan)
http://www.ddbj.nig.ac.jp
and EBI (European Bioinformatics
Institute) http://www.ebi.ac.uk
Can submit to any one: All three
exchange information daily
Store info in (slightly) different
format, with different information systems and different search
tools
- I search all three depending on what
I'm looking for and how busy each one is
subdivided into many divisions
- nr: non-redundant: most sequences,
but does not include EST
- EST: expressed sequence tags: brute
force sequencing of cDNAs
- full of errors, especially at 5'
and 3' ends
- other divisions include dbSTS,
RefSeq, dbSNP.
most sequences are annotated: combined
primary and secondary data
ENTREZ is the NCBI data retrieval system
(DDBJ and EBI use different data retrieval systems)
To learn more about Genbank go to
http://www.ncbi.nlm.nih.gov/About/index.html
Genbank Flatfiles
The form in which DNA
sequence data is stored: i.e. GENBANK's reason for
existence
each one is a record ( eg. X94702) with 3
parts
1) Header: Information applying to entire
record
- Locus = relic from time when
computers were slower, memory was more expensive and many fewer
sequences were deposited
- "Characterizes" sequences in
<10 characters
- retained because it is still used
by many bioinformatics applications
- INV: genbank Division
- Date: when record was last made
public
- Definition: "summarizes" the
biology
- Accession: Reference for the record
- what gets cited
- Does not change if sequence is
updated!
- NID/Version: Each update has a
Different # and a different gi: geninfo identifier
- Keywords: Another relic that isn't
widely used because they were never used consistently
- Source: Organism sequence comes
from
- Reference: Lists authors, their addresses, and whether the data is a direct submission or has been published in one or more articles
- Some entries also have
comments
2) Features: Annotations on the
sequence
- Source: all the info on source of the DNA clone from which the sequence was obtained, eg name of clone (note that this source is
different from the source in the header)
- Gene: where each gene starts and
stops (useful in genomic sequences that contain multiple
genes
- CDS: coding sequence
- Where each gene's instructions for
making proteins start & stop
- protein id: geninfo identifier of
the encoded protein. This is a separate genbank entry which
you can open by clicking on the hypertext
- Translation of the protein
- The computer-generated translation
of the coding sequence (rarely
experimentally-verified)
3) Sequence itself

Other genomic databases
Many companies have their own
databases
e.g. TIGR
http://www.tigr.org/tdb/ , Incyte genomics
http://www.incyte.com/index.shtml
some allow free access to academic
users
others charge for access
There are also many databases devoted to
specific organisms e.g.
Other biological databases
There are many databases
devoted to more specific types of biological information
Some examples of databases we will be
searching later in the course:
Databases devoted to human
genes
Databases devoted to genes in other
organisms
Databases devoted to protein
structure
Databases devoted to enzymes and
metabolic pathways
Databases devoted to results from
microarray experiments: analysis of all the genes expressed in a
particular tissue
Databases devoted to results from
proteomics experiments: analysis of all the proteins present in a
particular tissue
Databases devoted to results from
metabolomics experiments: analysis of all the metabolites present
in a particular tissue
Finding databases
- Search engine queries
- Surfing links
- Searching likely
institutions/organizations
- Asking experts
Using databases
Specific queries
Browsing
