week 5 lecture
Computational phylogenetics
Attempts to infer evolutionary relationships from sequence similarity. Compute similarity, then attempt to infer how recently organisms had a common ancestor Basic assumption: evolution has proceeded as a series of bifurcations: two populations became reproductively isolated than evolved independently. We can therefore represent evolution as a "phylogenetic tree" (a.k.a. dendrogram) with a series of branches.
Goal is to place the nodes and estimate the length of each branch Problems:
1) genes don't always evolve in the same way as species!
- therefore, we need to distinguish between gene trees and species trees
- we discuss below the construction of gene trees; to construct species trees we need data from multiple genes
2) Genes don't always evolve in such a simple manner
- new genes may enter a population by gene flow rather than mutation. Many species may exchange genes after they have started to evolve separately- bacteria in particular, may exchange genes between quite different taxa.
- how do we place the branch for such an event?
- gene conversion may replace one allele with another
- how do we recognize this has happened?
- how do we place the branch for such an event?
- gene duplication may result in two (or more) copies of the same gene which then evolve separately
- which one do we use for aligning with other species?
- deletions may remove a gene from one species.
- Homoplasies (mutation back to the original sequence) may lead us to underestimate the distance between two species
- how do we recognize this has happened?
An enormous number of programs have been written for constructing phylogenies based on molecular data!
Links to many are posted at http://evolution.genetics.washington.edu/phylip/software.html
Steps in constructing a molecular phylogeny
- alignment
- determining the substitution model
- tree building
- tree evaluation
Alignment
Perform multiple alignment, except need more care in choice of data
- Be sure that are aligning homologous sequences
- Before we were trying to identify homologous sequences
- now we‚re using them to try to infer evolutionary relationships
- Be sure that you are aligning orthologous and not paralogous sequences
(unless you‚re studying the evolution of a gene family)
- Now often discard gapped regions and focus on aligned regions
- we‚re trying to compare evolution of particular positions
- Use DNA to compare closely related organisms
- Pro: provides better record of evolution
- con: has greater risk of homoplasies
- Use protein for more distantly-related organisms
- Pro: more reliable alignment and less risk of homoplasies
- con: less complete record of evolution. Due to redundancy of genetic code, two organisms with e.g. serine or arginine at a given position may have mutated multiple times.
- Use rRNA for very distantly related organisms
- We can use the programs used last week to align multiple sequences, but you must bear in mind that we are using them for a different purpose.
- Last week we were trying to identify conserved regions or structural similarities
- Now we are trying to use sequence similarities to identify evolutionary relationships. Therefore programs such as CLUSTAL W which approximates a phylogeny to construct its alignments are uncool for this purpose.
- Instead, programs such as PIMA or MSA are preferable. In particular, MALIGN and TreeAlign determine alignment and phylogeny at the same time by finding the tree that minimizes the total alignment score.
Determining the substitution model
Tree-building
General problem: as you add sequences, the number of potential trees grows at an enormous rate!
|
# sequences |
# trees |
|
3 |
1 |
|
4 |
3 |
|
5 |
15 |
|
6 |
105 |
|
7 |
945 |
|
8 |
10,395 |
|
9 |
135,135 |
|
10 |
1,027,025 |
Therefore, must find way to reduce number of trees to consider
Three general approaches
- Distance matrix methods: count # differences
- Maximum Parsimony: finds path that requires fewest changes
- Maximum Likelihood: list all possible models then for each calculate the probability of generating the observed data
Distance matrix methods
- Calculate all pairwise distances in a data set
- Construct a tree to minimize distance when all branches are added together
- Recalculates the distance matrix, treating the newly-joined sequences as a new phylogenetic unit
- Continues (recalculating the distance matrix after each sequence is added) until all sequences have been added
- Allows you to measure length of the branches
- Only considers end product
- lose information because don‚t consider how they could have evolved
1) UPGMA (Unweighted-pair-group method with arithmetic mean)
- chooses the two most closely-related sequences
- Adds each sequence one at a time in order of relatedness as a new branch
- Recalculates the distance matrix after each addition assuming a constant rate of evolution (often wrong!)
- Generates a rooted tree, indicating the polarity of evolution (i.e., what was the starting point, where did each sequence branch off and how long ago did it brnch off, assuming a constant rate of evolution)
2) Neighbor-Joining method
- chooses the two most closely-related sequences
- Adds each sequence one at a time in order of relatedness as a new branch
- Recalculates the distance matrix after each addition either making no assumption about mutation rates or using empirical scoring methods
- Use similarity to calculate the branch lengths
- Generates an unrooted tree (i.e.output is a star with branches radiating out from the center rather than a tree)
- Advantages
- fastest tree building method
- can use empirical substitution scoring methods
- Disadvantages
- tests only a single tree
- does not consider intermediate ancestors
- no requirement for an internally-consistent evolutionary model
- misses homoplasies
3) Fitch/Margoliash
- tries different tree topologies and recalculates the distances.
- Uses least-squares approach to find tree with lowest score
- Advantages
- tests > one tree
- still pretty fast
- can use empirical substitution scoring methods
- global optimization of tree by statistical criteria
- Disadvantages
- slower than Neighbor Joining
- does not consider intermediate ancestors : no internally-consistent evolutionary model
- misses homoplasies
Maximum Parsimony: second approach to tree-building
- finds the path of fewest changes
- attempts to reconstruct evolution!
- Tests various models of how the observed data could have evolved from a common ancestor
- picks the one with the fewest steps
- Model of evolution is critical!
- Use orthologues from study group and from a different group (an outgroup) to identify fewest # mutations needed
- Methods (especially outgroups) are explained at http://www.gwu.edu/~clade/faculty/lipscomb/web.pdf
- Advantages
- reconstructs ancestral nodes: uses all the evolutionary data
- performs better than distance methods using simulated data and real data from pedigrees
- provides numerous "most parsimonious trees‰
- Disadvantages
- provides numerous "most parsimonious trees‰
- Can only determine position, not length of branches
- slower than matrix methods
- sensitive to order in which sequences are added to tree!
Maximum Likelihood
- list all possible models then for each calculate the probability of generating the observed data
- DNAML works by successive addition of DNA sequences to a tree, optimizing the tree by maximum likelihood at each step.
- calculates probability of producing that pattern according to each substitution model
- PROTML performs a similar maximum likelihood estimate using protein sequence data (a much more computer-intensive process, since there are 20 rather than four possible states at each position)
- Advantages
- reconstructs ancestral nodes: uses all the evolutionary data
- estimates branch lengths
- Estimates significance of each branch
- Outperforms distance methods using simulated and real data
- Disadvantage
- VERY slow! Time required increases with the fourth power of the number of sequences.
Rooting trees: all these techniques (except UPGMA) provide unrooted trees no evolutionary polarity
Most hypotheses involve direction: organisms evolved from a common ancestor
Rooting: estimating the source from which they diverged (i.e, the lasst common ancestor)
- Distance matrix assumes is halfway between the most distantly-related organisms
- A more difficult problem for maximum parsimony techniques
- Why outgroups are important: they provide a reference for where the starting point may lie
- somewhere along the branch connecting the outgroup with its closest relative within the group
- for example: when rooting the tree of the great apes we can use baboons as an outgroup
tree evaluation: does the tree you come up with make sense?
1. Jumbling sequence addition order
- Most methods are sensitive to the order of addition
- Can test trees by scrambling sequences & trying again
- Many programs allow you to jumble the order of sequence addition, using a random number as „seed‰
2. Jackknife resampling
- tests subsets of the original dataset (i.e, new dataset is smaller than the original)
- Idea: if model is correct, subset of data should give same answer as complete dataset
- Therefore, perform multiple iterations using different subsets each time and see if they give the same answer
3. Bootstrap resampling is sampling with replacement
- Dataset stays the same size, but some sites are changed
- Sample one column at a time, but some columns may be missed while others may be replicated 2 or more times
- Generate many new datasets of same size, but each a bit different:
- some positions have been duplicated, while others have been lost
- Redo the phylogeny with each new dataset
- See how reproducible the original tree is
- are using same data, but are sampling various subsets
- „Bootstrap value‰ is number of time a branch appears in these new phylogenies
- If significant, it should turn up in most, and can use the frequency of its appearance to estimate the statististical significance of the branch
4. Evaluating the data
- Random data should give symmetrical trees, whereas true phylogenies should be skewed
- Many procedures test how skewed the data is
- Most important consideration is data quality!
- Also remember to shuffle order of addition!
Phylogenetics programs
A few sites allow you to run PHYLIP online:
Biologists workbench allows you to perform distance matrix and parsimony analysis under the "alignments tools" menu
You can also perform several PHYLIP programs including fastDNAML online at
http://bioweb.pasteur.fr/seqanal/phylogeny/phylip-uk.html
Links to other sites are posted at http://evolution.genetics.washington.edu/phylip/phylipweb.html
Other phylogenetics programs
MacVector allows you to construct phylogenetic trees using UPGMA or neighbor-joining techniques
PAUP = commercial offering
Hennig 86 = maximum parsimony program available from http://www.cladistics.org/education/hennig86.html
Many other programs are available: Joseph Felsenstein has provided links to 194 programs at http://evolution.genetics.washington.edu/phylip/software.html
Most are command line and require you to run output through a tree-drawing program to visualize results
Joseph Felsenstein is both a leader in the field of molecular evolution and quite a character: Check out his rendition of the "Amphioxus song" at
http://newfish.mbl.edu/Course/Resources/amphioxus.html
|