bioinformatics

 
Home

syllabus

general information

homework

lectures

websites

Will Terzaghi's Homepage



Membership

Login

 
 

week 6 lecture

Sequence Analysis

Frequently you wish to search raw sequence data for various features. In some cases (the easiest) the search is for a specific sequence. A related but more difficult problem is identifying patterns within sequences.

One common search is for the sites cut by various restriction enzymes for mapping, or subcloning portions of the sequence. For example, we frequently wish to subclone the coding sequence into a plasmid in order to express the encoded protein in bacteria.

  • MacVector
    • allows you to find which sites are present (and filter acording to criteria such as number of cuts, size of recognition sequence, etc)
    • also predicts size of the resulting fragments when you cut with various enzymes
  • TACG in the "Nucleic Tools" at Biology workbench
    • also allows you to find which sites are present (and filter acording to criteria such as number of cuts, size of recognition sequence, etc)
    • also predicts size of the resulting fragments when you cut with various enzymes
  • http://www.firstmarket.com/cutter/cut2.html
    • analyses sequences you cut and paste in
    • also allows you to import sequences directly from GenBank
    • not as many features or as user-friendly as MacVector or TACG
  • http://www.arabidopsis.org/cgi-bin/patmatch/RestrictionMapper.pl
    • analyses sequences you cut and paste in
    • also allows you to analyze Arabidopsis sequences by simply typing in the locus or gi number
    • not as many features or as user-friendly as MacVector or TACG
  • http://biochem.roche.com/fst/products.htm?/benchmate/
    • tells whether a sequence you enter contains a restriction site

Another common search is for the sites cut by various proteolytic enzymes (e.g. trypsin or papain) or various chemicals in order to do peptide mapping

  • MacVector provides the same suite of features for proteolytic enzymes as for restriction enzymes
    • allows you to find which sites are present (and filter acording to criteria such as number of cuts, size of recognition sequence, etc)
    • also predicts size of the resulting fragments
  • You can perform a similar analysis online at http://us.expasy.org/tools/peptidecutter/

Another common problem is searching for the DNA sequences that encode proteins

Many programs allow you to detect Open Reading Frames (ORFs): sequences that run from a potential start codon to a stop codon. You specify the minimum length, and the genetic code to use.

Determining the actual coding sequence is trickier, and often can only be done experimentally.

in mRNA the start codon often isn't the first AUG

  • in bacteria is the first AUG 3' to the Shine-Dalgarno sequence(and sometimes isn't even AUG!)
  • in eukaryotes is often the first AUG 3' to a Kozak sequence
  • in eukaryotes the coding sequence in the genomic DNA is often interrupted by introns

MacVector allows you to search for coding regions using Pickett's algorithm which looks for biases in base composition and base position (these are known to differ between coding and non-coding regions)

Netstart http://www.cbs.dtu.dk/services/NetStart/ is a program which uses a form of artificial intelligence called neural networks to predict start sites in mRNA sequences (I'll explain how these work later in the course; basically, you train the program to recognize certain types of sequences using training sets of known start sites, then turn it loose on unknown sequences)

  • it can also be used to find start sites in genomic DNA sequences, although it may be fooled by genes that have introns near the translation start (and many do)

Many programs will translate RNA (or DNA) sequence into the protein it encodes. Since you often don't know which strand to read or which frame it is in they will translate all six frames (3 on the top strand and 3 on the bottom).

Another common problem is generating the reverse and complement of a sequence

Many programs allow you to design primers: short pieces of single-stranded DNA that should anneal to specific target sequences.

These are some of the most widely used programs in molecular biology, because primers are used for many different purposes.

  • PCR is one of the most widely used procedures in all of molecular biology, and successful PCR requires good primers! Amplifying specific pieces of DNA by PCR requires one primer that will anneal to the target site on the Watson strand and another that will anneal to the target site on the Crick strand
PCR:
  • Some uses for PCR

    • Testing whether a particular sequence is present
      • In DNA: e.g. looking for homologous sequences in different species
      • Measuring the abundance of particular mRNAs for gene expression studies using reverse-transcriptase PCR
    • Testing which version of a particular sequence is present (e.g. testing for cystic fibrosis)
    • DNA fingerprinting
    • STR:
    • Cloning genes (or portions of genes)
    • Making recombinant DNA: constructing new genes by putting pieces of DNA together in novel combinations
      • can perform site-directed mutagenesis in the process
  • Other uses for primers
    • DNA sequencing by the Sanger (dideoxy) method requires annealing a known primer to the target DNA, then extending it
sequencing:
    • Spotting on microarrays
    • Probes for Southern or Northern blots
Southern:

General considerations for primer design

  1. 18- 30 bases long
  2. Matched melting temperatures
  3. GC content between 45-55%
  4. No internal base-pairing ( hairpins)
  5. No base-pairing with the other primer (primer dimers)

MacVector has many powerful features for designing primers for PCR or sequencing

Many sites allow you to design primers online

Designing primers

  • simplest case is when you are simply testing for the presence of a sequence
    • then all you need to do is find good binding sites within a reasonable distance 5' and 3' to the target.
  • Designing primers to clone specific fragments or for site -directed mutagenesis is more difficult because they must start or finish at specific locations
    • In this case it is important to test the primers to make sure they have no hairpins or primer dimers
    • Often must modify settings dramatically to get results

How to amplify unknown sequence?

  • Use Multiple Sequence Alignment to identify two conserved regions (BLOCKS or motifs) in homologous sequences (usually proteins)
    • Rationale: if they are conserved in all known sequences expect to be conserved in unknown as well
  • design primers that will anneal to these regions
  • options once you have a reverse translation
    • design primer to anneal to least degenerate portion (i.e, a region with amino acids like methionine and tryptophan that are only encoded by one codon, therefore there is only one option)
    • at degenerate positions you must order 4 different primers, one with each possible base since we don't know which is correct
      • in the example below (which is a degenerate primer which we successfully used in my lab) we needed to order primers in which half had an A at position 3 and the other half had G, half had an A at position 6 and the other half had G, 1/4 had an A at position 9, 1/4 had C, 1/4 had G, and 1/4 had T, etc. (this is easy to do when ordering)
      • overall, we had 256 different DNA sequences, therefore we say that this is 256 fold degenerate.
      • degeneracy is a nuisance; only 1 primer in the mix is a perfect match to your target sequence, but many are only off by one base, so designing the protocols (and getting them to work) is tricky
    • degenerate:
    • CODEHOP is designed to reduce the problems of degeneracy: http://blocks.fhcrc.org/blocks/codehop.html
      • have short degenerate portion at 3’ end (10-12 bases), then at 5’ end have a "clamp" that picks the most likely codon for each amino acid
        • 5' end is portion that will tolerate a mismatch
      • Use Blockmaker to make Blocks
      • Select “Codehop” in output window
      • Adjust settings until you get results
        • Start with degeneracy & temperature

Cloning in silico

Before constructing a piece of recombinant DNA, it is useful to construct it in silico

  • ensuring that your strategy will work (e.g., the primers you design will work if you're doing it by PCR, and the restriction enzymes you choose only cut where they're supposed to and create the correct sorts of overhangs)
  • the recombinant DNA encodes the desired protein with no mutations at the junctions
  • designing strategies to identify clones expressing the correct recombinant molecules

STEPS

1) Identifying enzymes which do not cut your sequence, but do cut your vector (plasmids and viruses commonly used as cloning vectors typically have a "Multiple Cloning Site" which contains the sequences recognized by a number of different restriction enzymes lined up nose to tail)

pBS:

2) Designing primers to amplify your sequence

RULES:

  • need 3 C:G pairs at 5’ end to control "breathing" DNA constantly snaps open and shut at end of linear molecules
  • introduce restriction site for cloning 5’ to “real” sequences
  • add start/stop or other mutations near 5’ end
  • anchor primer with G:C pairs
  • try to balance Tm for 5’ and 3’ primers, and aim for 60 ˚ C

3) Testing your primers

MacVector: Analyze | Primers | test PCR primer pair

  • paste your primers into windows
  • File containing DNA these primers anneal to must be open!

Can also test online:

4) “Cloning” your amplified sequence

  • find the chosen sites in vector and sequence
  • copy the amplified sequence starting at 5’ restriction site and ending at 3’ restriction site
  • replace the vector Multiple Cloning Site from 5’ to 3’ site with chosen sequence

5) Testing your recombinant DNA

  • print the map of your new molecule and check that it is the correct size and that restriction sites are where they belong




Last update: Saturday, February 22, 2003 at 1:59:09 PM.