| Home
syllabus
general information
homework
lectures
websites
Will Terzaghi's Homepage
Membership
Login |
|
|
|
week 6 lecture
Sequence Analysis
Frequently you wish to search raw sequence data for various
features. In some cases (the easiest) the search is for a specific sequence.
A related but more difficult problem is identifying patterns within sequences.
One common search is for the sites
cut by various restriction enzymes for mapping, or subcloning portions of the
sequence. For example, we frequently wish to subclone the coding sequence into
a plasmid in order to express the encoded protein in bacteria.
- MacVector
- allows you to find which sites are present
(and filter acording to criteria such as number of cuts, size of recognition
sequence, etc)
- also predicts size of the resulting fragments
when you cut with various enzymes
- TACG in the "Nucleic Tools"
at Biology workbench
- also allows you to find which sites are
present (and filter acording to criteria such as number of cuts, size
of recognition sequence, etc)
- also predicts size of the resulting fragments
when you cut with various enzymes
- http://www.firstmarket.com/cutter/cut2.html
- analyses sequences you cut
and paste in
- also allows you to import sequences directly
from GenBank
- not as many features or as user-friendly
as MacVector or TACG
- http://www.arabidopsis.org/cgi-bin/patmatch/RestrictionMapper.pl
- analyses sequences you cut
and paste in
- also allows you to analyze Arabidopsis
sequences by simply typing in the locus or gi number
- not as many features or as user-friendly
as MacVector or TACG
- http://biochem.roche.com/fst/products.htm?/benchmate/
- tells whether a sequence you enter contains
a restriction site
Another common search is for the
sites cut by various proteolytic enzymes (e.g. trypsin or papain) or various
chemicals in order to do peptide mapping
- MacVector provides
the same suite of features for proteolytic enzymes as for restriction enzymes
- allows you to find which sites are present
(and filter acording to criteria such as number of cuts, size of recognition
sequence, etc)
- also predicts size of the resulting fragments
- You can perform a similar
analysis online at http://us.expasy.org/tools/peptidecutter/
Another common problem is searching for the DNA sequences that encode
proteins
Many programs allow you to detect Open Reading
Frames (ORFs): sequences that run from a potential start codon to a stop codon.
You specify the minimum length, and the genetic code to use.
Determining the actual coding sequence is trickier,
and often can only be done experimentally.
in mRNA the start codon often isn't the first
AUG
- in bacteria is the first AUG 3' to the
Shine-Dalgarno sequence(and sometimes isn't even AUG!)
- in eukaryotes is often the first AUG 3' to a Kozak sequence
- in eukaryotes the coding sequence in the
genomic DNA is often interrupted by introns
MacVector allows you to search for coding
regions using Pickett's algorithm which looks for biases in base composition
and base position (these are known to differ
between coding and non-coding regions)
Netstart http://www.cbs.dtu.dk/services/NetStart/
is a program which uses a form of artificial intelligence
called neural networks to predict start sites in mRNA sequences (I'll explain
how these work later in the course; basically, you train the program to
recognize certain types of sequences using training sets of known start
sites, then turn it loose on unknown sequences)
- it can also be used to find start sites
in genomic DNA sequences, although it may be fooled by genes that have
introns near the translation start (and many do)
Many programs will translate RNA (or DNA) sequence
into the protein it encodes. Since you often don't know which strand to read
or which frame it is in they will translate all six frames (3 on the top strand
and 3 on the bottom).
Another common problem is generating the reverse
and complement of a sequence
Many programs allow you to design primers: short
pieces of single-stranded DNA that should anneal to specific target sequences.
These are some of the most widely used programs
in molecular biology, because primers are used for many different purposes.
- PCR is one of the most widely used procedures
in all of molecular biology, and successful PCR requires good primers! Amplifying
specific pieces of DNA by PCR requires one primer that will anneal to the
target site on the Watson strand and another that will anneal to the
target site on the Crick strand
-
Some uses for PCR
- Testing whether a particular sequence
is present
- In DNA: e.g. looking for homologous
sequences in different species
- Measuring the abundance of particular
mRNAs for gene expression studies using reverse-transcriptase PCR
- Testing which version of a particular
sequence is present (e.g. testing for cystic fibrosis)
- DNA fingerprinting
- Cloning genes (or portions of genes)
- Making recombinant DNA: constructing new
genes by putting pieces of DNA together in novel combinations
- can perform site-directed mutagenesis
in the process
- Other uses for primers
- DNA sequencing by the Sanger (dideoxy)
method requires annealing a known primer to the target DNA, then extending
it
- Spotting on microarrays
- Probes for Southern or Northern blots
General considerations for primer design
- 18- 30 bases long
- Matched melting temperatures
- GC content between 45-55%
- No internal base-pairing
( hairpins)
- No base-pairing with the other primer (primer
dimers)
MacVector has many powerful features for designing
primers for PCR or sequencing
Many sites allow you to design primers online
Designing primers
- simplest case is when you are simply testing for the presence of
a sequence
- then all you need to do is find good binding
sites within a reasonable distance 5' and 3' to the target.
- Designing primers to clone specific fragments
or for site -directed mutagenesis is more difficult because they must start
or finish at specific locations
- In this case it is important to test the
primers to make sure they have no hairpins or primer dimers
- Often must modify settings dramatically to get results
How to amplify unknown sequence?
- Use Multiple Sequence Alignment to identify
two conserved regions (BLOCKS or motifs) in homologous
sequences (usually proteins)
- Rationale: if they are conserved in all
known sequences expect to be conserved in unknown as well
- design primers that will anneal to these regions
- reverse translation: because of degeneracy
of genetic code must pick all possible ways to encode an amino acid sequence
- MacVector will perform reverse translation
- Can be done online at
- options once you have a reverse translation
- design primer to anneal to least degenerate
portion (i.e, a region with amino acids like methionine and tryptophan
that are only encoded by one codon, therefore there is only one option)
- at degenerate positions you must order
4 different primers, one with each possible base since we don't know which
is correct
- in the example below (which is a degenerate
primer which we successfully used in my lab) we needed to order primers
in which half had an A at position 3 and the other half had G, half
had an A at position 6 and the other half had G, 1/4 had an A at position
9, 1/4 had C, 1/4 had G, and 1/4 had T, etc. (this is easy to do when
ordering)
- overall, we had 256 different DNA
sequences, therefore we say that this is 256 fold degenerate.
- degeneracy is a nuisance; only 1 primer
in the mix is a perfect match to your target sequence, but many are
only off by one base, so designing the protocols (and getting them
to work) is tricky
- CODEHOP is designed to reduce the problems
of degeneracy: http://blocks.fhcrc.org/blocks/codehop.html
- have short degenerate portion at 3’ end (10-12 bases),
then at 5’ end have a "clamp" that picks the most
likely codon for each amino acid
- 5' end is portion that will tolerate
a mismatch
- Use Blockmaker to make Blocks
- Select “Codehop” in output
window
- Adjust settings until you get results
- Start with degeneracy & temperature
Cloning in silico
Before constructing a piece of recombinant DNA, it is useful to construct
it in silico
- ensuring that your strategy will work (e.g.,
the primers you design will work if you're doing it by PCR, and the restriction
enzymes you choose only cut where they're supposed to and create the correct
sorts of overhangs)
- the recombinant DNA encodes the desired protein
with no mutations at the junctions
- designing strategies to identify clones expressing
the correct recombinant molecules
STEPS
1) Identifying enzymes which do not cut your
sequence, but do cut your vector (plasmids and viruses commonly used as cloning
vectors typically have a "Multiple Cloning Site" which contains the
sequences recognized by a number of different restriction enzymes lined up nose
to tail)
2) Designing primers to amplify your sequence
RULES:
- need 3 C:G pairs at 5’ end to control "breathing"
DNA constantly snaps open and shut at end of linear molecules
- introduce restriction site for cloning 5’
to “real” sequences
- add start/stop or other mutations near 5’
end
- anchor primer with G:C pairs
- try to balance Tm for 5’ and 3’ primers, and aim for
60 ˚ C
3) Testing your primers
MacVector: Analyze | Primers | test PCR primer
pair
- paste your primers into windows
- File containing DNA these primers anneal
to must be open!
Can also test online:
4) “Cloning” your amplified sequence
- find the chosen sites in vector and sequence
- copy the amplified sequence starting at 5’
restriction site and ending at 3’ restriction site
- replace the vector Multiple Cloning Site
from 5’ to 3’ site with chosen sequence
5) Testing your recombinant DNA
- print the map of your new molecule and check
that it is the correct size and that restriction sites are where they belong

|