Week 4 lecture
Multiple sequence alignment
Motifs were identified by multiple sequence alignment
BLAST-PSI performs multiple sequence alignment
to create the PSSM for each iteration
General idea: align multiple relatives to determine a consensus
score candidates based on their deviation
from consensus at each position
Why bother?
- identify motifs
- compare evolution of related sequences
(directed knowledge discovery)
- identify conserved regions : functionally
significant
- identify regions that diverge: tolerate
variation
- measure substitution frequencies, e.g.,
to develop substitution matrices.
- see how proteins may evolve from common ancestor
to perform different functions: e.g. globins
- predict structure of portions of a new unknown
protein
- if its amino acids align well, it will
probably adopt a similar shape
- find important structures of unknown function
(undirected knowledge discovery)
- if the same sequence turns up in many
genes/proteins it must serve some function!
- first step in computing molecular phylogenies
BLAST and FASTA identify similar proteins
MSA studies the similarity of groups of proteins
identified by BLAST and FASTA
Much more difficult computing problem: all are approximations to
best solution.
Reason: adding each additional sequence to the search is like searching through
an additional dimension.e.g. constructing the alignment for 3 sequences is like finding the
best path through 3 dimensions.
"multalign 2"
To illustrate the problem:
- If FASTA alignment of two 200 aa sequences
takes 1 sec
- aligning three 200 aa sequences takes
200 sec
- aligning ten 200 aa sequences takes
200 to the eighth power sec
Basic premises:
- common ancestry
- evolution tolerates little variation in
important structural features such as active sites: therefore, these will
not change much as the rest of the protein evolves
- conversely, "conserved regions"
(regions which have similar amino acid sequences) found upon aligning groups
of proteins probably perform the same function
In theory, can do multiple alignments without any assumptions of phylogeny
In practice, “all models” assume
evolution from a common ancestor, then use some kind of evolutionary tree to
perform the alignment: circular reasoning!
most use progressive alignment algorithms
- Use pairwise alignments to construct a phylogeny.
- Use phylogeny to align everyone else adding
in decreasing order of relatedness.
- Develop a consensus sequence
General considerations
- Start with protein alignments
- Try several different algorithms & check
output
- If fairly closely related, also try DNA
- Since alignment is based on phylogeny, DNA may reveal silent
mutations
- High quality data is crucial!
- Beware of duplicates when screening large
data sets
CLUSTAL W: most widely used program
Available within MacVector or at
http://workbench.sdsc.edu/
http://www2.ebi.ac.uk/clustalw/
http://www.ddbj.nig.ac.jp/E-mail/clustalw-e.html
http://decypher.stanford.edu/index_by_algo.htm
http://searchlauncher.bcm.tmc.edu/multi-align/multi-align.html
1) aligns each sequence with every other, one pair at a time, and
scores the similarity
2) constructs a “distance matrix”
using these scores that measure the relatedness of each pair of sequences
"distance matrix"
3) uses the distance matrix to calculate a
“phylogenetic guide tree” that organizes the genes/proteins according
to the "distance" between them using the neighbor-joining method
(i.e., their relatedness)
4) uses this tree to construct alignment
- Start with most closely related sequence
pair
- Add sequences one at a time in order of
relatedness
5) calculates the distance between each member
using substitution matrices
- Unlike pairwise alignments, gaps are less
costly than many substitutions!
Notes:
1) CLUSTAL W weights sequences according to
their divergence from the most closely-related pair
- uses this weight to choose the scoring matrix
and gap opening and extension penalties to use.
2) Constructs a "consensus sequence"
which lists the most common base or amino acid found at each position
- none of the queries may match this consensus!
3) Frequently biologists may have additional
information to help tweak the alignment
- secondary structures adopted by the protein
- active sites
- key motifs (e.g. leucine zippers or Zn fingers)
4) Therefore, to make sense of the alignment, try visualizing it
in various ways, such as color-coding according:
- identities
- differences
- amino acid polarity
- secondary structure
- other features such as motifs, trans-membrane
domains, etc (MacVector allows you to color-code
according to all of these features!)
PIMA (available at http://searchlauncher.bcm.tmc.edu/cgi-bin/multi-align/multi-align.pl)
Key difference: PIMA
computes similarities whereas CLUSTAL W computes distances!
1) performs all pairwise alignments using a substitution matrix based
on amino acid properties
2) creates a scoring matrix
consider the following sequences
S1 TCYGIFVL
S2 TCGIFVL
S3 SCYGIFVLSGS
S4 TCFGIFVL
S5 ACGIFVLSG
|
S1 |
S2 |
S3 |
S4 |
S5 |
S1 |
|
26 |
40 |
38 |
26 |
S2 |
26 |
|
26 |
26 |
32 |
S3 |
40 |
26 |
|
36 |
36 |
S4 |
38 |
26 |
36 |
|
26 |
S5 |
26 |
32 |
36 |
26 |
|
3) Aligns best scores first = S1:S3
- Then adds S4
- S2 next
- S5 last
Uses STAR algorithm
- sequence with best alignment is at center
- rays are distances to remaining sequences
MAP (available at http://searchlauncher.bcm.tmc.edu/cgi-bin/multi-align/multi-align.pl)
Another global program (aligns entire protein)
Good for doing alignments when there are long
gaps in some sequences
Very slow!
MSA (available at http://workbench.sdsc.edu/
http://searchlauncher.bcm.tmc.edu/cgi-bin/multi-align/multi-align.pl )
Another global program (aligns entire protein)
Slower algorithm, but nearly optimal
Uses sum-of-pairs criterion
Many other programs for computing global alignments
are available!
Remember that each uses different heuristics to align entire protein!
Many programs look for conserved motifs = local
alignment
Just like gapped BLAST, rationale is that many proteins are assembled
using different combinations of conserved elements
These form a “concentrated” resource
that can be used to reduce background & increase sensitivity of searches
BLOCKS (http://blocks.fhcrc.org/blocks/)
BLOCKS = multiply-aligned ungapped segments corresponding to the
most highly-conserved regions
BLOCKMAKER identifies blocks
- Assumes proteins are homologous
- does local alignment using 2 different algorithms
- Identifies conserved regions within the proteins
"BLOCKS2"
BLOCK Searcher compares query with a database
of conserved motifs identified by running BLOCKMAKER against the InterPro
database
- all possible positions against all blocks
in this database
- constructs position-specific scoring matrix
for each block
- Computes the probability of each amino acid turning up at that
position within the block
"blocksearch"
BLIMPS is an improved version of BLOCK searcher
available at http://workbench.sdsc.edu/
COBBLER (COnsensus Biasing By Locally Embedding Residues)
available at http://blocks.fhcrc.org/blocks/
Trick to improve searches.
Use a single sequence, but bias it by substituting
the consensus sequence identified by BLOCKMAKER in the conserved regions
Sample output
COBBLER sequence from MOTIF
>unknown gi|1084385| from 1 to 320 with embedded consensus blocks
matataagLVTGGSRGIGLAIAQWLGQEGspvlagfgshaaksfpilstrsiatsgiraqvataekvsagagqsvespvvivtgasrgi
kaialslgkagckvlvnyarsskeaeevskeieafggqaltfgGRIINISSVSGAMGNAGQSNYAAAKAGVVGFTKSLAHE
mrmkksqwqevidlnltgvflctqaaakimmkkkkIPLGRFGQAEEVAGAVAFLASDaakagvigftktvareyasrninvnavapgfissdmtsklgddinkkiletiplgrygqpe
evaglveflainpassyvtgqvftidggmtm
LAMA (Local Alignment of Multiple Alignments)
is a program for comparing protein multiple sequence alignments with each other.
http://blocks.fhcrc.org/blocks/help/LAMA_help.html#LAMA
compares BLOCKS with BLOCKS
Other Motif databases
Many websites contain sets of conserved motifs
Many have their own program for searching them
(according to their proprietary algorithm)
MEME allows you to discover your own motifs,
or to search their database with MAST
http://meme.sdsc.edu/meme/website/
Match-box uses a different algorithm for multiple
protein alignments
http://searchlauncher.bcm.tmc.edu/multi-align/multi-align.html
Receive very detailed results by email
3motif is a Stanford website for very detailed
analysis of short sequences
http://motif.Stanford.EDU/3motif/
PRINTS http://www.bioinf.man.ac.uk/dbbrowser/PRINTS/
Another collection of motifs
Can search protein sequences for these motifs with FingerPRINTScan
also available at http://workbench.sdsc.edu/
PROSITE = database of protein ”fingerprints” (motifs
and other sequence patterns) in Switzerland
http://www.expasy.ch/prosite/
PFSCAN (
at http://workbench.sdsc.edu/ ) searches profiles stored in
PROSITE and PFAM databases
InterProScan ( http://www.ebi.ac.uk/interpro/scan.html
) searches PROSITE, Pfam, PRINTS and
other family and domain databases
PFAM = database of conserved protein domains that were identified
by hand
http://www.sanger.ac.uk/Pfam/
ProDom: French website of protein domains
http://prodes.toulouse.inra.fr/prodom/2002.1/html/home.php
Many others can be accessed from
http://us.expasy.org/tools/#pattern
Each contains an overlapping but distinct set of entries and a different
way of accessing them
Each may give you different results!
Some are based solely on protein data, while
others contain primarily proteins predicted from mRNA or genomic DNA sequences
General conclusion: try them all!

|