week 2 lecture
Using databases
Specific queries
- by name (or text)
- by sequence alignment
- by structure alignment
Browsing
- by organism
- by key word
- by chromosome
Data mining
- The exploration & analysis of
large quantities of data in order to discover meaningful (and
previously unknown) patterns
- A major division of computer
science!

Data mining
An introduction to the field
is provided at http://www.the-data-mine.com/
A journal devoted to the field can be
read at
http://www.digimine.com/usama/datamine/
Lots of information and links to many
aspects of data mining are posted at
http://www.kdnuggets.com/software/index.html
The "tools for data mining" at NCBI
describes the services available at NCBI
http://www.ncbi.nlm.nih.gov/Tools/index.html
http://industry.ebi.ac.uk/~brazma/dm.html
provides information about biological datamining
Another good source of information is: http://bioinformatics.weizmann.ac.il/cards/knowledge.html
Many companies make their living mining
biological databases
http://www.gene.com/gene/research/biotechnology/genomics.jsp
Other examples are Millenium Pharmaceuticals http://www.millennium.com/ and Human Genome Sciences http://www.hgsi.com/
Anthony Petrolonis is a recent Wilkes Biology graduate who works for Millenium Pharmaceuticals http://www.millennium.com/ and Michael Donahue is a recent Wilkes Biology graduate who works for Human Genome Sciences http://www.hgsi.com/
One major purpose of datamining is
knowledge discovery
Two types of knowledge
discovery
- Directed (similar to hypothesis
testing)
- Purpose: explain a target field in
terms of the others
- pick a target field based on a
hypothesis and ask the algorithm to classify or predict
it
- example: look for all genes that
contain a homeobox
- example: align twenty bacterial
genomes and look for the trp operon and nearest
neighbors
- Undirected
- purpose: finding interesting
patterns
- example: align twenty bacterial
genomes and look for sequences that tend to occur in the same
place
- example: find all genes that are
turned on in particular cancers, or that respond in the same
way to various treatments.
Undirected KD helps identify
relationships and directed KD helps explain them
Common tasks for data mining
- description
- classification: placing in
categories based on characteristics
- e.g. homeobox proteins, or
G-protein-linked receptors
- Affinity grouping : grouping by
relations (not characteristics)
- e.g. genes activated by heat
shock
- Clustering: segmenting a diverse
population into similar groups according to some similarity
measure
- no pre-defined classes and no
examples!
- e.g. genes that occur in the same
location in bacterial genomes
- Estimation: rules that explain how
to estimate a value given characteristics
- Prediction: rules that explain how to
predict a future value given characteristics
Applications for data mining in
bioinformatics
- Identifying genes from different
organisms that are related to each other
- Identifying genes within the same
organism that are related to each other
- helps to infer the function of
unknown genes identified by sequencing genomes or cDNAs if find
portions that are related to genes of known function
- Identifying genes that are expressed
together in microarray experiments
- Identifying genes that are physically
located next to each other in bacteria
- Hypothesis: they are part of same
biochemical pathway
- Identifying other structures that are
functionally related
Many companies make their living mining
biological databases: the challenge is writing software that can
identify meaningful patterns
- We don't fully understand molecular
biology
- Molecular biology has many quirks
that make programming difficult
- Redundancy
- DNA is
double-stranded
- Many organisms are diploid (or
more)
- Many genes are present in
multiple copies
- The genetic code is
degenerate
- RNA processing, e.g. Splicing and
editing
- Some mutations have much more
effect than others
- Some amino acid substitutions have
much more effect than others on protein structure
- Genes are 1-D but function depends
on the 3-D structure of their products
- many proteins are modified after
synthesis in ways that are presently difficult to
predict
- Many genes have multiple
functions
- many genes (or proteins) interact
in ways that are presently difficult to predict
A common task for data mining in
bioinformatics is searching data bases for specific types of
sequences
- related genes
- sequences which occur in specific
locations
- interesting patterns
- assembling sequences of entire
genomes
General approach is to compare a query
sequence with all the sequences in the database: somehow the
sequences must be aligned.
References for sequence
alignment
http://www.psc.edu/biomed/training/tutorials/sequence/db/index.html
http://www.umanitoba.ca/faculties/afs/plant_science/courses/39_769/lec04/lec04.1.html
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/similarity.html
http://www.ncbi.nlm.nih.gov:80/BLAST/tutorial/Altschul-1.html
http://www.inf.ethz.ch/personal/cannaroz/courses/compbio/week2/week2/week2.html
Underlying assumptions:
- all life evolved from a common
ancestor: therefore all genes and proteins have diverged from a
common origin.
- For example, many English and
French words have evolved from a common ancestor, and we can
recognize that they are related:
- The same is true for many English
and German words
- We therefore assume that we can
calculate the probability 2 sequences had a common ancestor and
estimate the time since they had a common ancestor
- We also assume that evolution
occurred stepwise via a series of mutations
- We assume that evolution obeys the
rules of DNA replication and repair
- bases change by transitions ( pyr
-> pyr or pur -> pur) , transversions (pyr ->pur or
pur ->pyr) or by insertions and deletions
- we can estimate the probability of
each kind of mutation"




- selection acts on phenotype,
therefore, some mutations will have no effect, others will be
lethal!
These assumptions have been used to
construct mathematical models of evolution based on probability.
These models are then used to estimate
- the most likely way these sequences
could have evolved from a common ancestor
- the probability this
happened
- evolutionary distance is estimated
from number of steps and probability of each one
occurring
Many models are called "hidden Markov
models" ref: http://www.msci.memphis.edu/~giri/compbio/papers/hmmshamir.pdf
They are based on Markov
chains in which the probability of moving from one state to the next
at each step depends only on state it is in
- For example, the probability of
mutating from an A to a G depends only on what the base is now,
not on whether 10 generations ago it was a T.
The probability of moving from one state
to the next at each step is called a transition
probability
If know (or estimate) probability of each
possible event can arrange in a transition matrix based on
probability a particular mutation can occur and will be
allowed
can estimate probability of reaching a
particular state in n number of steps
compare observed events with probability
matrix
Definitions
- Similar - The degree to which
two species or populations share identities.
- Homologous - characters that are
similar due to common ancestry
- Analogous - characters that are
similar due to convergent evolution
- Orthologous - characters that are
homologous with conserved function
- Paralogous - characters that are
homologous with divergent function.
similarity is a
measurement!
Homology is a judgement!
Computers measure
similarity!
We attempt to infer
homology!
Sequence Alignment
Computer scientists call this
a string alignment problem, similar to routing phone calls
Think of this as a path: task is finding
best one
A common approach for aligning 2
sequences is to start with a dot matrix in which you place a dot at
each location that have an alignment


find best path that links the
segments

Searching Genbank is aligning a very long
sequence!
Done using dynamic programming: approach
developed by Needleman &Wunsch
Needleman-Wunsch assumes that can find
optimal path by incrementing optimal subpaths
- need a scoring mechanism!
- Basic problem: correlating scoring
with biology
- General approach, deduct points for
mismatches
- Deduct more for gaps, since
deletions are usually more serious = gap penalties
- have large penalty for the first
missing base, add smaller penalty for each additional
base
- How to deduct for
mismatches?
- For nucleic acids deduct more
for transversions than transitions

But: protein alignments are far more
sensitive!
- 20 possible amino acids cf 4
possible nucleotides
- Changes in nucleotides often do
not affect protein
- Proteins have 3x fewer
residues
- Proteins are what selection acts
on
- Therefore, proteins are preferred
for searches
Attempt to score amino acid
mismatches based on consequence of substitution
- Weighting in favor of
"biologically acceptable"paths
- Margaret Dayhoff waas the first to
try this: she prepared a model of evolution in which some
mutations are more acceptable than others
- These were identified by
comparing sequences of proteins from closely related
organisms
- She developed a unit called the
Point Accepted Mutation (PAM) One PAM = 1% of amino acids have
changed over evolution
- The 1 PAM mutation matrix mutates
amino acids with probability of 1%.
- The 100 PAM mutation matrix
mutates amino acids with probability of 100 mutations/100 amino
acids
- not every aa mutates and some
mutate back to starting point!
- To calculate score, need to
account for differences in normal abundance of various amino
acids
- Dayhoff matrices are 10 x log
(probability (a.a are paired due to common ancestry)/
(probability (pairing would occur by chance)) = log-odds
table

Using PAM
- compare mismatches with PAM matrix to
decide how important they are
- For any possible amino acid
substitution, a value from the scoring matrix is added to the
similarity score
- Therefore: is important to pick
correct scoring matrix!
- for closely related organisms
rules of mutation are most important
- for distantly-related organisms
rules of amino acid substitution are most important
- Therefore must pick matrix
appropriate for expected sequence similarity
- Use small PAM numbers (e.g PAM100)
to compare closely related spp, large numbers (e.g. PAM250) for
more distant
- One criticism of the PAM matrices
is that the PAM units were calibrated using pairs of
closely-related proteins
- Many other matrices have been
developed based on other criteria
- Blosum (BLOcks SUbstitutionMatrix)
matrices were calculated using data from the BLOCKS database
[http://www.blocks.fhcrc.org], which contains alignments of
more distantly-related proteins
- should be more realistic for
comparing distantly-related proteins since are derived from
distantly-related proteins. Derived empirically cf PAM
extrapolated based on theory
- BLOSUM 62 = minimum 62%
identity
- BLOSUM 30 = minimum 30 %
identity-> suitable for comparing distant-related
species
- BLOSUM matrices are generally
superior to PAM matrices for detecting biological
relationships.
- MDM10 and MDM20 matrices are
better suited for closely related species
- You will see all these different
substitution matrices as options at various web sites that allow
you to perform homology searches online
N.B.: basic approach is similar,
argument is how to weight gaps and mismatches!
Algorithms compute score for "paths" by
summing the values assigned by scoring matrix & gap
penalties
NWi,j = max{ NWi-1,j-1 +
s(ai,bj); NWi-k,j + gj; NWi,j-k + gi }
- Match = 1
- Mismatch = -1
- Gap = -2
Path with highest score wins
They then compare score with value
predicted by pure luck= E value
- expected number of hits scoring this
high by dumb luck after screening this many candidates
- Theoretical reason for using smallest
possible database: Each hit is more statistically
significant!
Usually compare with sequences generated
randomly using a DNA or protein sequence model
Alternative = comparing with
composition of the entire database
Needleman-Wunsch calculates global
alignments: align entire sequences
This often doesn't work well for aligning
DNA or protein sequences, because of the following
problems:
- Biological: proteins are often composed of motifs
& domains that are shuffled between different
proteins
- global alignments would miss
these, instead scores similarity of entire protein
- Statistical: theory for computing
significance of global alignments is weak
Local
alignments are therefore weapon of choice
- recognise conserved elements such as
motifs and domains in proteins that are otherwise very
different
- e.g. homeodomains in homeobox
proteins
- strong statistical theory
- 2 types of algorithm: gapped and
ungapped
- Ungapped has stronger theory,
simpler algorithms
- Gapped is more biologically
relevant
- identifies related sequences
despite insertion/deletion
Strongest but slowest local alignment
algorithm is Smith-Waterman
- Computes local alignments using
Needleman-Wunsch algorithm
- Key difference: stops alignment when
score goes to 0
- SWi,j = max{ SWi-1,j-1 + s(ai,bj);
SWi-k,j + gj; SWi,j-k + gi; 0 }
- N-M continues alignment adding
negative cells
- Match = 5, Mismatch = -4; Open
Gap = 0, Extend Gap = -7
- Guaranteed to find best alignment!
- Unfortunately, it is very
slow
- Therefore several faster heuristic
(rule of thumb) methods have been developed to speed up initial
search
FASTA
heuristic approximation of
Smith-Waterman developed by William R. Pearson at the University of Virginia
1) Breaks query and test sequences into
overlapping words (2 a.a. or 6 nt) and looks for matches
- rationale: homologous sequences
should have runs of common nucleotides or amino acids
2) next it scores the hits using
substitution matrix
3) it then tries to join high-scoring
segments using gaps < a certain size (window size)
4) Next it uses Smith-Waterman to find
best alignment within the vicinity of high-scoring
segments
5) Finally, it computes the score for
optimal alignment
- uses linear regression against ln
search set sequence length to calculate z-score
- uses
distribution of the z-score to estimate number of sequences
expected to produce higher z-scores by chance
Several different types of FASTA programs
are available
Fasta3 protein vs protein database or
DNA vs DNA database
Fastx3./fasty3 DNA translated in 3
frames vs protein database. Fastx3 allows frameshifts between codons,
fasty also allows frameshifts within codons (slower)
Tfastx3/tfasty3 protein vs DNA database
allowing frameshifts

Many servers allow you to search genomic or proteiomic databases online. A few will also let you use Smith-Waterman (since it is so much more computationally intensive)
http://fasta.bioch.virginia.edu/
http://www.ebi.ac.uk/fasta33/
http://www.ddbj.nig.ac.jp
http://workbench.sdsc.edu/
http://www.arabidopsis.org/cgi-bin/fasta/nph-TAIRfasta.pl
http://searchlauncher.bcm.tmc.edu/seq-search/protein-search.html
http://decypher.stanford.edu/index_by_algo.htm

|