| Home
syllabus
general information
homework
lectures
websites
Will Terzaghi's Homepage
Membership
Login |
|
|
|
week 3 lecture
Sequence Alignment
Basic goal: identify similarities
- acquire data to support inference
of homology
Needleman-Wunsch calculates global alignments:
align entire sequences
This often doesn't work well for aligning
DNA or protein sequences, because of the following
problems:
- Biological: proteins are often
composed of motifs & domains that are shuffled between
different proteins
- global alignments would miss these because
they score similarity of entire protein &
penalizes for the portions that are different
- Statistical: theory for computing
significance of global alignments is weak
Local alignments are therefore weapon of
choice
- recognise conserved elements such as
motifs and domains in proteins that are otherwise very
different
- e.g. homeodomains in homeobox
proteins
- strong statistical theory
- 2 types of algorithm: gapped and
ungapped
- Ungapped has stronger theory,
simpler algorithms
- Gapped is more biologically
relevant
- identifies related sequences
despite insertion/deletion
Strongest but slowest local alignment algorithm is
Smith-Waterman
- Guaranteed to find best alignment!
- Unfortunately, it is very slow
- Therefore several faster heuristic (rule of
thumb) methods have been developed to speed up initial search
FASTA: a heuristic approximation of Smith-Waterman
- Breaks query and test sequences into overlapping
words (2 a.a. or 6 nt) and looks for matches
- rationale: homologous sequences should have runs
of common nucleotides or amino acids
- next it scores the hits using substitution matrix
- it then tries to join high-scoring segments using
gaps < a certain size (window size)
- Next it uses Smith-Waterman to find best alignment
within the vicinity of high-scoring segments: speed advantage arises because
have restricted the search space.
- Finally, it computes the score for optimal alignment
- uses linear regression against ln search set
sequence length to calculate z-score (since
the probability of finding an accidental match increases as the size of
the data base increases)
- uses distribution of the z-scores to estimate
number of sequences expected to produce higher z-scores by chance
FASTA
SETTINGS
- K-tuple value: the word size. Defaults at most sites are 2 amino
acids for protein searches and 6 nucleotides for nucleic acid searches.
- You can increase sensitivity by choosing
smaller ktup values, but this will greatly lengthen the search time.
- It will also increase the number of false
positives (sensitivity and selectivity are inversely related).
- Gap Penalties
- Opening: how much to penalize for the introduction
of a gap (i.e. for the first nucleotide or amino acid in a gap)
- extension: how much to penalize for each
additional missing residue.
- Opening is usually much higher than extension,
because insertions and deletions are rare, but they frequently involve several
adjacent residues
- Default values are usually -12 and -2, respectively
for proteins, and -16 and -4 for nucleic acids, but there isn't a strong
theoretical basis for this.
- Optimization Threshold Score
- Defines the width of the band used in
the final score optimization step (i.e. how far you will search to either
side of a high-scoring segment).
- Lower numbers (i.e. 1) provide greater
search sensitivity, but slow the search significantly.
- Higher numbers (i.e. 10) reduce both sensitivity
and search times (remember the sensitivity/selectivity
tradeoff)
- The FASTA program calculates this value automatically, unless
you override it
- E-value Threshold =
Expectation value
limit for displaying scores and alignments.
- Highest expect: only scores with opt scores
lower than this limit will be printed.
- Lowest expect: only scores with opt scores
greater than this limit will be printed. Allows
you to eliminate close relatives if you are looking for more distantly-related
sequences
- number of similarity scores reported
- Maximum number of sequences to report
below E-value threshold. Keeps output under control if you find lots of
sequences related to your query.
- number of alignments reported
- Maximum number of alignments reported.
Keeps output under control if you find lots of sequences related to your
query.
- Omit Histogram: turns off printing of Extreme Value Distribution
Histogram
- You usually don't want to do this unless
you have confidence in your settings, because this provides a measure
of the significance of your results
- the histogram compares the distribution
of your z-scores with those predicted by the Extreme Value Distribution,
and the Kolmogorov-Smirnov statistic describes the maximum deviation of
the observed distribution from the extreme value distribution. If there
is significant deviation you need to try again with different settings.
Interpreting FASTA
results
- Check the histogram and Kolmogorov-Smirnov
statistic to be sure that your results are reliable.
- Use the e-values to see which alignments are
significant.
- As a general rule, alignments with e <
0.05 are probably meaningful, but look at the actual alignments to make
sure that they aren't trivial (for example, if comparing an mRNA sequence
against an EST databank - a databank of mRNA sequences - you want to ensure
that you aren't simply matching poly-A tails)
- Conversely, alignments with e > 0.05
may be interesting, and worth checking using other settings or other algorithms
(this is why most sites allow you to list hits with e-values up to 10)
- remember that your scores depend on
the scoring matrix used, so if you use a matrix such as Blosum 100
or PAM 10 designed for closely-related sequences, a distant relative
will get a lower score (= higher e-value) than it should.
- also, remember that if gene A is
related to gene B, and gene B is related to gene C, gene A must also
be related to gene C even though genes A and C may now have evolved
in quite different ways.
- One of the major tasks of bioinformatics
is separating meaningful from spurious alignments!
BLAST (Basic Local Alignment Search Tool)
- BLAST first breaks query into words (usually 4 aa or 11 nt) and
finds words with high scores (i.e. relatively rare, therefore unlikely to
occur in unrelated sequences)
- BLAST next scans for words that score above
critical value when aligned with query sequence = hits
- Next extend in each direction until run out
of similarity (-> hit a run of negative scores)
- Compile High Scoring Pairs (HSPs) and compute
score
- Compare with expected random occurrence of such score
BLAST
SETTINGS
- BLAST programs ref; http://www.ncbi.nlm.nih.gov/BLAST/producttable.html
- blastn:compares a nucleotide query sequence
against a nucleotide database
- megablast is designed to find nearly
identical matches: uses larger words, has more stringent cutoffs.
- "Search for short and near exact
matches" under Nucleotide BLAST is useful for primer or short nucleotide
motif searches. These will not be found using the default blastn settings
because of the cutoff values.
- blastp compares an amino acid query sequence against a protein
database
- blastx compares a nucleotide query
sequence translated in all reading frames against a protein sequence database
- tblastn compares a protein query sequence
against a nucleotide sequence database dynamically translated in all reading
frames.
- tblastx compares the six-frame translations
of a nucleotide query sequence against the six-frame translations of a nucleotide
sequence database (very time-consuming!)
- PSI blast (position specific iterative) is designed to identify
proteins which contain certain motifs or domains. It is described in more
detail below.
- Pattern-Hit Initiated (PHI) BLAST is designed
to search for proteins that contain a pattern specified by the user.
- Many other settings are very similar to FASTA
- database chooses which database to use
: NR (non-redundant) is the default
- CDD (conserved domain database) allows
you to search for portions of your gene/protein that have been identified
in other genes/proteins
- filter allows you to mask regions that
are commonly encountered such as repetitive sequences or proline-rich
regions in proteins, reducing the number of false positives
- expect: sets the upper limit of the e-value
of matches to report
- matrix: allows you to choose scoring matrix
- Gap Cost and Lambda Ratio: allow you to
choose the cost for opening and extending gaps (within the allowed values)
- Note: you must select "Advanced BLAST"
to control most of these parameters!
- BLAST also allows you more control over the
output format, e.g. how to view the alignments
- Note that NCBI will assign you a hypertext
number for each query
- you will need to click on this number
to view your results.
Interpreting BLAST results
- General idea is same as interpreting FASTA:
did you pick the correct settings, and which results are meaningful?
- Evaluate settings by distribution of scores
- Evaluate matches by e-values, and by the actual
matches detected (i.e. common sense)
BLAST vs FASTA
- Different algorithm
- Different way of computing E:
- BLAST compares with simulated database
- FASTA compares with entire database
- BLASTX, etc compare each mRNA reading frame one at a time
- FASTA does all 3 (or 6 if are looking at both
strands) frames at once
- BLAST is faster and picks up multiple alignments
- FASTA is slower, only picks best matches,
but finds things BLAST misses
- Which is best? Depends on your goals!(remember
from homework2, try several different algorithms)
Recommendations
- Use BLAST for protein, FASTA for DNA
- Refine search using Smith-Waterman
and/or PSI BLAST
- Use protein (or translated) sequences
if possible
- Break up protein sequences >
~250
- Break up nucleic acid sequences > ~1000
- Try several different scoring
matrices
- E< 0.05 is probably valid
- Higher E values may be worth
refining
Recommended matrices
by query length
| Length |
Matrix |
Gap penalties |
| <35 |
PAM-30 |
( 9,1) |
| 35-50 |
PAM-70 |
(10,1) |
| 50-80 |
BLOSUM-80 |
(10,1) |
| >85 |
BLOSUM-62 |
(11,1) |
Finding distant relatives
- since proteins act as tools, selection acts on shapes.
- Therefore, proteins with similar
shapes probably perform similar functions.
- Regions of proteins with similar
shapes probably perform similar functions.
- Probably evolved from common
ancestor.
- Rationale for identifying motifs & domains
- many proteins contain modules that turn
up in other proteins:
- Provide clues about functions of unknown
proteins (Also provide clues about strucures of portions of unknown proteins)
- Database searches using motifs or profiles
(position-specific scoring matrices: PSSM) are often much better at detecting
weak relationships.
How to determine significance of similarities
detected by these techniques?
- Ultimate test: do they adopt similar structures?
- remember, proteins that are similar shapes
perform similar jobs
- portions of proteins that perform similar
shapes perform similar jobs
- Therefore, many programs for aligning 3-D
structures have been developed.
(we will deal with predicting 3-D structures later)
VAST & many other programs attempt aligning
3-D structures
- The VAST algorithm(Vector Alignment Search Tool) looks for statistically
significant structural similarity (http://www.ncbi.nlm.nih.gov/Structure/VAST/vast.shtml)
- Compares likelihood of a particular structure
against probability of it occurring at random
- Focuses on most “surprising”
( = unlikely due to random chance) alignments
- Uses vector alignment techniques to superimpose
two structures by best fit based on secondary structure
- Does not use sequence data-> picks up similarities missed by
BLAST!
Combinatorial extension (http://cl.sdsc.edu/ce.html)
uses a different algorithm
- First identifies aligned fragment pairs which confer structure similarity
- Next attempts extending combinations of aligned
fragment pairs to find best alignment
- Eventually returns optimal alignment, similar
results to VAST
DALI (http://www.ebi.ac.uk/dali/)
- The three-dimensional coordinates of each protein are used to calculate
residue-residue (Calpha-Calpha) distance matrices.
- The distance matrices are first decomposed
into elementary contact patterns, e.g., hexapeptide-hexapeptide submatrices.
- Next, similar contact patterns in the two
matrices are paired and combined into larger consistent sets of pairs.
- A Monte Carlo procedure is used to optimise
a similarity score defined in terms of equivalent intramolecular distances.
- Several alignments are optimised in parallel, leading to simultaneous
detection of the best, second-best and so on solutions.
Documenting Searches (e.g. for publication)
- Algorithm
- Substitution matrix
- Gap penalty
- Database(s)
- Name
- Version
- Computer used

|