bioinformatics

 
Home

syllabus

general information

homework

lectures

websites

Will Terzaghi's Homepage



Membership

Login

 
 

Week 4 lecture

Multiple sequence alignment

Motifs were identified by multiple sequence alignment

BLAST-PSI performs multiple sequence alignment to create the PSSM for each iteration

General idea: align multiple relatives to determine a consensus

score candidates based on their deviation from consensus at each position

Why bother?

  • identify motifs
  • compare evolution of related sequences (directed knowledge discovery)
    • identify conserved regions : functionally significant
    • identify regions that diverge: tolerate variation
    • measure substitution frequencies, e.g., to develop substitution matrices.
  • see how proteins may evolve from common ancestor to perform different functions: e.g. globins
  • predict structure of portions of a new unknown protein
    • if its amino acids align well, it will probably adopt a similar shape
  • find important structures of unknown function (undirected knowledge discovery)
    • if the same sequence turns up in many genes/proteins it must serve some function!
  • first step in computing molecular phylogenies

BLAST and FASTA identify similar proteins

MSA studies the similarity of groups of proteins identified by BLAST and FASTA

Much more difficult computing problem: all are approximations to best solution.

Reason: adding each additional sequence to the search is like searching through an additional dimension.e.g. constructing the alignment for 3 sequences is like finding the best path through 3 dimensions.

"multalign 2"

To illustrate the problem:

  • If FASTA alignment of two 200 aa sequences takes 1 sec
    • aligning three 200 aa sequences takes 200 sec
    • aligning ten 200 aa sequences takes 200 to the eighth power sec

Basic premises:

  1. common ancestry
  2. evolution tolerates little variation in important structural features such as active sites: therefore, these will not change much as the rest of the protein evolves
  3. conversely, "conserved regions" (regions which have similar amino acid sequences) found upon aligning groups of proteins probably perform the same function

In theory, can do multiple alignments without any assumptions of phylogeny

In practice, “all models” assume evolution from a common ancestor, then use some kind of evolutionary tree to perform the alignment: circular reasoning!

most use progressive alignment algorithms

  • Use pairwise alignments to construct a phylogeny.
  • Use phylogeny to align everyone else adding in decreasing order of relatedness.
  • Develop a consensus sequence

General considerations

  • Start with protein alignments
  • Try several different algorithms & check output
  • If fairly closely related, also try DNA
    • Since alignment is based on phylogeny, DNA may reveal silent mutations
  • High quality data is crucial!
  • Beware of duplicates when screening large data sets

CLUSTAL W: most widely used program

Available within MacVector or at

http://workbench.sdsc.edu/
http://www2.ebi.ac.uk/clustalw/
http://www.ddbj.nig.ac.jp/E-mail/clustalw-e.html
http://decypher.stanford.edu/index_by_algo.htm
http://searchlauncher.bcm.tmc.edu/multi-align/multi-align.html

1) aligns each sequence with every other, one pair at a time, and scores the similarity

2) constructs a “distance matrix” using these scores that measure the relatedness of each pair of sequences

    "distance matrix"

3) uses the distance matrix to calculate a “phylogenetic guide tree” that organizes the genes/proteins according to the "distance" between them using the neighbor-joining method (i.e., their relatedness)

tree:

4) uses this tree to construct alignment

  • Start with most closely related sequence pair
  • Add sequences one at a time in order of relatedness
align:

5) calculates the distance between each member using substitution matrices

  • Unlike pairwise alignments, gaps are less costly than many substitutions!

Notes:

1) CLUSTAL W weights sequences according to their divergence from the most closely-related pair

  • uses this weight to choose the scoring matrix and gap opening and extension penalties to use.

2) Constructs a "consensus sequence" which lists the most common base or amino acid found at each position

  • none of the queries may match this consensus!

3) Frequently biologists may have additional information to help tweak the alignment

  • secondary structures adopted by the protein
  • active sites
  • key motifs (e.g. leucine zippers or Zn fingers)

4) Therefore, to make sense of the alignment, try visualizing it in various ways, such as color-coding according:

  • identities
  • differences
  • amino acid polarity
  • secondary structure
  • other features such as motifs, trans-membrane domains, etc (MacVector allows you to color-code according to all of these features!)
vis1: vis2:

PIMA (available at http://searchlauncher.bcm.tmc.edu/cgi-bin/multi-align/multi-align.pl)

Key difference: PIMA computes similarities whereas CLUSTAL W computes distances!

1) performs all pairwise alignments using a substitution matrix based on amino acid properties

PIMA sub:

2) creates a scoring matrix

consider the following sequences

S1 TCYGIFVL
S2 TCGIFVL
S3 SCYGIFVLSGS
S4 TCFGIFVL
S5 ACGIFVLSG

S1
S2
S3
S4
S5
S1
26
40
38
26
S2
26
26
26
32
S3
40
26
36
36
S4
38
26
36
26
S5
26
32
36
26

3) Aligns best scores first = S1:S3

  • Then adds S4
  • S2 next
  • S5 last

    • S1 TCYGIFVL--
      S3 SCYGIFVLSG
      S4 TCFGIFVL--
      S2 TC-GIFVL
      S5 AC-GIFVLSG

    Uses STAR algorithm

    • sequence with best alignment is at center
    • rays are distances to remaining sequences

MAP (available at http://searchlauncher.bcm.tmc.edu/cgi-bin/multi-align/multi-align.pl)

Another global program (aligns entire protein)

Good for doing alignments when there are long gaps in some sequences

Very slow!

MSA (available at http://workbench.sdsc.edu/

http://searchlauncher.bcm.tmc.edu/cgi-bin/multi-align/multi-align.pl)

Another global program (aligns entire protein)
Slower algorithm, but nearly optimal
Uses sum-of-pairs criterion

Many other programs for computing global alignments are available!

Remember that each uses different heuristics to align entire protein!


Many programs look for conserved motifs = local alignment

Just like gapped BLAST, rationale is that many proteins are assembled using different combinations of conserved elements

These form a “concentrated” resource that can be used to reduce background & increase sensitivity of searches

BLOCKS (http://blocks.fhcrc.org/blocks/)

BLOCKS = multiply-aligned ungapped segments corresponding to the most highly-conserved regions

BLOCKS:

BLOCKMAKER identifies blocks

  • Assumes proteins are homologous
  • does local alignment using 2 different algorithms
  • Identifies conserved regions within the proteins
"BLOCKS2"

BLOCK Searcher compares query with a database of conserved motifs identified by running BLOCKMAKER against the InterPro database

  • all possible positions against all blocks in this database
  • constructs position-specific scoring matrix for each block
  • Computes the probability of each amino acid turning up at that position within the block
"blocksearch"

BLIMPS is an improved version of BLOCK searcher available at http://workbench.sdsc.edu/

COBBLER (COnsensus Biasing By Locally Embedding Residues) available at http://blocks.fhcrc.org/blocks/

Trick to improve searches.

Use a single sequence, but bias it by substituting the consensus sequence identified by BLOCKMAKER in the conserved regions

Sample output

COBBLER sequence from MOTIF

>unknown gi|1084385| from 1 to 320 with embedded consensus blocks

matataagLVTGGSRGIGLAIAQWLGQEGspvlagfgshaaksfpilstrsiatsgiraqvataekvsagagqsvespvvivtgasrgi kaialslgkagckvlvnyarsskeaeevskeieafggqaltfgGRIINISSVSGAMGNAGQSNYAAAKAGVVGFTKSLAHE mrmkksqwqevidlnltgvflctqaaakimmkkkkIPLGRFGQAEEVAGAVAFLASDaakagvigftktvareyasrninvnavapgfissdmtsklgddinkkiletiplgrygqpe evaglveflainpassyvtgqvftidggmtm

LAMA (Local Alignment of Multiple Alignments) is a program for comparing protein multiple sequence alignments with each other. http://blocks.fhcrc.org/blocks/help/LAMA_help.html#LAMA

compares BLOCKS with BLOCKS

Other Motif databases

Many websites contain sets of conserved motifs

Many have their own program for searching them (according to their proprietary algorithm)

MEME allows you to discover your own motifs, or to search their database with MAST
http://meme.sdsc.edu/meme/website/

Match-box uses a different algorithm for multiple protein alignments

http://searchlauncher.bcm.tmc.edu/multi-align/multi-align.html

Receive very detailed results by email

3motif is a Stanford website for very detailed analysis of short sequences
http://motif.Stanford.EDU/3motif/

PRINTS http://www.bioinf.man.ac.uk/dbbrowser/PRINTS/

Another collection of motifs

Can search protein sequences for these motifs with FingerPRINTScan

also available at http://workbench.sdsc.edu/

PROSITE = database of protein ”fingerprints” (motifs and other sequence patterns) in Switzerland

http://www.expasy.ch/prosite/

PFSCAN ( at http://workbench.sdsc.edu/ ) searches profiles stored in PROSITE and PFAM databases

InterProScan ( http://www.ebi.ac.uk/interpro/scan.html ) searches PROSITE, Pfam, PRINTS and other family and domain databases

PFAM = database of conserved protein domains that were identified by hand
http://www.sanger.ac.uk/Pfam/

ProDom: French website of protein domains
http://prodes.toulouse.inra.fr/prodom/2002.1/html/home.php

Many others can be accessed from http://us.expasy.org/tools/#pattern

Each contains an overlapping but distinct set of entries and a different way of accessing them

Each may give you different results!

Some are based solely on protein data, while others contain primarily proteins predicted from mRNA or genomic DNA sequences

General conclusion: try them all!




Last update: Friday, February 7, 2003 at 4:34:42 PM.