bioinformatics

 
Home

syllabus

general information

homework

lectures

websites

Will Terzaghi's Homepage



Membership

Login

 
 

week 3 lecture

Sequence Alignment

Basic goal: identify similarities

  • acquire data to support inference of homology

Needleman-Wunsch calculates global alignments: align entire sequences

This often doesn't work well for aligning DNA or protein sequences, because of the following problems:

  • Biological: proteins are often composed of motifs & domains that are shuffled between different proteins
    • global alignments would miss these because they score similarity of entire protein & penalizes for the portions that are different
  • Statistical: theory for computing significance of global alignments is weak

Local alignments are therefore weapon of choice

  • recognise conserved elements such as motifs and domains in proteins that are otherwise very different
    • e.g. homeodomains in homeobox proteins
  • strong statistical theory
  • 2 types of algorithm: gapped and ungapped
    • Ungapped has stronger theory, simpler algorithms
    • Gapped is more biologically relevant
      • identifies related sequences despite insertion/deletion

Strongest but slowest local alignment algorithm is Smith-Waterman

  • Guaranteed to find best alignment!
  • Unfortunately, it is very slow
    • Therefore several faster heuristic (rule of thumb) methods have been developed to speed up initial search

FASTA: a heuristic approximation of Smith-Waterman

  1. Breaks query and test sequences into overlapping words (2 a.a. or 6 nt) and looks for matches
    • rationale: homologous sequences should have runs of common nucleotides or amino acids
  2. next it scores the hits using substitution matrix
  3. it then tries to join high-scoring segments using gaps < a certain size (window size)
  4. Next it uses Smith-Waterman to find best alignment within the vicinity of high-scoring segments: speed advantage arises because have restricted the search space.
  5. Finally, it computes the score for optimal alignment
    • uses linear regression against ln search set sequence length to calculate z-score (since the probability of finding an accidental match increases as the size of the data base increases)
    • uses distribution of the z-scores to estimate number of sequences expected to produce higher z-scores by chance

FASTA SETTINGS

  1. K-tuple value: the word size. Defaults at most sites are 2 amino acids for protein searches and 6 nucleotides for nucleic acid searches.
    • You can increase sensitivity by choosing smaller ktup values, but this will greatly lengthen the search time.
    • It will also increase the number of false positives (sensitivity and selectivity are inversely related).
  2. Gap Penalties
    • Opening: how much to penalize for the introduction of a gap (i.e. for the first nucleotide or amino acid in a gap)
    • extension: how much to penalize for each additional missing residue.
    • Opening is usually much higher than extension, because insertions and deletions are rare, but they frequently involve several adjacent residues
    • Default values are usually -12 and -2, respectively for proteins, and -16 and -4 for nucleic acids, but there isn't a strong theoretical basis for this.
  3. Optimization Threshold Score
    • Defines the width of the band used in the final score optimization step (i.e. how far you will search to either side of a high-scoring segment).
    • Lower numbers (i.e. 1) provide greater search sensitivity, but slow the search significantly.
    • Higher numbers (i.e. 10) reduce both sensitivity and search times (remember the sensitivity/selectivity tradeoff)
    • The FASTA program calculates this value automatically, unless you override it
  4. E-value Threshold = Expectation value limit for displaying scores and alignments.
    • Highest expect: only scores with opt scores lower than this limit will be printed.
    • Lowest expect: only scores with opt scores greater than this limit will be printed. Allows you to eliminate close relatives if you are looking for more distantly-related sequences
  5. number of similarity scores reported
    • Maximum number of sequences to report below E-value threshold. Keeps output under control if you find lots of sequences related to your query.
  6. number of alignments reported
    • Maximum number of alignments reported. Keeps output under control if you find lots of sequences related to your query.
  7. Omit Histogram: turns off printing of Extreme Value Distribution Histogram
    • You usually don't want to do this unless you have confidence in your settings, because this provides a measure of the significance of your results
    • the histogram compares the distribution of your z-scores with those predicted by the Extreme Value Distribution, and the Kolmogorov-Smirnov statistic describes the maximum deviation of the observed distribution from the extreme value distribution. If there is significant deviation you need to try again with different settings.

Interpreting FASTA results

  1. Check the histogram and Kolmogorov-Smirnov statistic to be sure that your results are reliable.
  2. Use the e-values to see which alignments are significant.
    • As a general rule, alignments with e < 0.05 are probably meaningful, but look at the actual alignments to make sure that they aren't trivial (for example, if comparing an mRNA sequence against an EST databank - a databank of mRNA sequences - you want to ensure that you aren't simply matching poly-A tails)
    • Conversely, alignments with e > 0.05 may be interesting, and worth checking using other settings or other algorithms (this is why most sites allow you to list hits with e-values up to 10)
      • remember that your scores depend on the scoring matrix used, so if you use a matrix such as Blosum 100 or PAM 10 designed for closely-related sequences, a distant relative will get a lower score (= higher e-value) than it should.
      • also, remember that if gene A is related to gene B, and gene B is related to gene C, gene A must also be related to gene C even though genes A and C may now have evolved in quite different ways.
      • One of the major tasks of bioinformatics is separating meaningful from spurious alignments!

BLAST (Basic Local Alignment Search Tool)

  1. BLAST first breaks query into words (usually 4 aa or 11 nt) and finds words with high scores (i.e. relatively rare, therefore unlikely to occur in unrelated sequences)
  2. BLAST next scans for words that score above critical value when aligned with query sequence = hits
  3. Next extend in each direction until run out of similarity (-> hit a run of negative scores)
  4. Compile High Scoring Pairs (HSPs) and compute score
  5. Compare with expected random occurrence of such score
BLAST1: BLAST2: BLAST3: BLAST4:

BLAST SETTINGS

  1. BLAST programs ref; http://www.ncbi.nlm.nih.gov/BLAST/producttable.html
    • blastn:compares a nucleotide query sequence against a nucleotide database
      • megablast is designed to find nearly identical matches: uses larger words, has more stringent cutoffs.
      • "Search for short and near exact matches" under Nucleotide BLAST is useful for primer or short nucleotide motif searches. These will not be found using the default blastn settings because of the cutoff values.
    • blastp compares an amino acid query sequence against a protein database
    • blastx compares a nucleotide query sequence translated in all reading frames against a protein sequence database
    • tblastn compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames.
    • tblastx compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database (very time-consuming!)
    • PSI blast (position specific iterative) is designed to identify proteins which contain certain motifs or domains. It is described in more detail below.
    • Pattern-Hit Initiated (PHI) BLAST is designed to search for proteins that contain a pattern specified by the user.
  2. Many other settings are very similar to FASTA
    • database chooses which database to use : NR (non-redundant) is the default
      • CDD (conserved domain database) allows you to search for portions of your gene/protein that have been identified in other genes/proteins
    • filter allows you to mask regions that are commonly encountered such as repetitive sequences or proline-rich regions in proteins, reducing the number of false positives
    • expect: sets the upper limit of the e-value of matches to report
    • matrix: allows you to choose scoring matrix
    • Gap Cost and Lambda Ratio: allow you to choose the cost for opening and extending gaps (within the allowed values)
  3. Note: you must select "Advanced BLAST" to control most of these parameters!
  4. BLAST also allows you more control over the output format, e.g. how to view the alignments
  5. Note that NCBI will assign you a hypertext number for each query
    • you will need to click on this number to view your results.

Interpreting BLAST results

  1. General idea is same as interpreting FASTA: did you pick the correct settings, and which results are meaningful?
  2. Evaluate settings by distribution of scores
  3. Evaluate matches by e-values, and by the actual matches detected (i.e. common sense)

BLAST vs FASTA

  1. Different algorithm
  2. Different way of computing E:
    • BLAST compares with simulated database
    • FASTA compares with entire database
  3. BLASTX, etc compare each mRNA reading frame one at a time
    • miss frameshifts
  4. FASTA does all 3 (or 6 if are looking at both strands) frames at once
    • recognizes frameshifts
  5. BLAST is faster and picks up multiple alignments
  6. FASTA is slower, only picks best matches, but finds things BLAST misses
  7. Which is best? Depends on your goals!(remember from homework2, try several different algorithms)

Recommendations

  1. Use BLAST for protein, FASTA for DNA
  2. Refine search using Smith-Waterman and/or PSI BLAST
  3. Use protein (or translated) sequences if possible
  4. Break up protein sequences > ~250
  5. Break up nucleic acid sequences > ~1000
  6. Try several different scoring matrices
  7. E< 0.05 is probably valid
    • Higher E values may be worth refining

Recommended matrices by query length

Length Matrix Gap penalties
<35 PAM-30 ( 9,1)
35-50 PAM-70 (10,1)
50-80 BLOSUM-80 (10,1)
>85 BLOSUM-62 (11,1)

Finding distant relatives

  • since proteins act as tools, selection acts on shapes.
  • Therefore, proteins with similar shapes probably perform similar functions.
  • Regions of proteins with similar shapes probably perform similar functions.
  • Probably evolved from common ancestor.
  • Rationale for identifying motifs & domains
    • many proteins contain modules that turn up in other proteins:
    • Provide clues about functions of unknown proteins (Also provide clues about strucures of portions of unknown proteins)
  • Database searches using motifs or profiles (position-specific scoring matrices: PSSM) are often much better at detecting weak relationships.

How to determine significance of similarities detected by these techniques?

  • Ultimate test: do they adopt similar structures?
    • remember, proteins that are similar shapes perform similar jobs
    • portions of proteins that perform similar shapes perform similar jobs
  • Therefore, many programs for aligning 3-D structures have been developed.
    (we will deal with predicting 3-D structures later)

VAST & many other programs attempt aligning 3-D structures

  • The VAST algorithm(Vector Alignment Search Tool) looks for statistically significant structural similarity (http://www.ncbi.nlm.nih.gov/Structure/VAST/vast.shtml)

  • Compares likelihood of a particular structure against probability of it occurring at random

  • Focuses on most “surprising” ( = unlikely due to random chance) alignments

  • Uses vector alignment techniques to superimpose two structures by best fit based on secondary structure

  • Does not use sequence data-> picks up similarities missed by BLAST!

Combinatorial extension (http://cl.sdsc.edu/ce.html) uses a different algorithm

  • First identifies aligned fragment pairs which confer structure similarity
    • Local geometry!
  • Next attempts extending combinations of aligned fragment pairs to find best alignment
  • Eventually returns optimal alignment, similar results to VAST

DALI (http://www.ebi.ac.uk/dali/)

  • The three-dimensional coordinates of each protein are used to calculate residue-residue (Calpha-Calpha) distance matrices.
  • The distance matrices are first decomposed into elementary contact patterns, e.g., hexapeptide-hexapeptide submatrices.
  • Next, similar contact patterns in the two matrices are paired and combined into larger consistent sets of pairs.
  • A Monte Carlo procedure is used to optimise a similarity score defined in terms of equivalent intramolecular distances.
  • Several alignments are optimised in parallel, leading to simultaneous detection of the best, second-best and so on solutions.

Documenting Searches (e.g. for publication)

  • Algorithm
  • Substitution matrix
  • Gap penalty
  • Database(s)
  • Name
  • Version
  • Computer used




Last update: Tuesday, January 28, 2003 at 6:15:52 PM.