bioinformatics

 
Home

syllabus

general information

homework

lectures

websites

Will Terzaghi's Homepage



Membership

Login

 
 

week 7 lecture

Finding genes

Finding genes in prokaryotic DNA is relatively easy

  • 85-88% of the nucleotides are associated with coding sequence in the bacterial genomes that have been completely sequenced.
    • in Escherichia coli there are 4288 genes that have an average of 950 bp of coding sequence and are separated by an average of just 118 bp.
  • Prokaryotes have short, simple promoters that are easy to recognize
  • Transcriptional terminators often consist of short inverted repeats followed by a run of Ts. Therefore, programs that find prokaryotic genes search for:
    1. ORFs 60 or more codons long
    2. promoters at the 5' end
    3. Terminators at the 3' end
    4. Homology to known genes from other prokaryotes
    5. Shine-Dalgarno sequences

    GLIMMER (available at http://www.tigr.org/software/glimmer/ ) uses interpolated Markov models to identify the coding regions of prokaryotic genomes and distinguish them from noncoding DNA.

    GENEMARK (available at http://opal.biology.gatech.edu/GeneMark/ ) assesses the protein-coding potential of a prokaryotic DNA sequence using Markov models of coding and non-coding regions.

    EASYGENE (available at http://www.cbs.dtu.dk/services/EasyGene/ ) uses an artificial intelligence approach to identify genes in prokaryotic genomes. It first generates a training set of genes by finding a set of known genes in the organism, then uses this training set to make a hidden Markov model of this organism's genome. Putative genes are then scored with the HMM.

    FGENESB ( available at http://www.softberry.com/berry.phtml?topic=fgenesb ) uses pattern recognition of different types of signals and Markov chain models of coding regions to find genes in prokaryotic DNA.

    BPROM ( available at http://www.softberry.com/berry.phtml?topic=bprom ) identifies bacterial promoters and transcription start sites.

NNPP / Prokaryotic (available at http://www.fruitfly.org/seq_tools/promoter.html ) uses neural networks to find prokaryotic (and eukaryotic) promoters.

TRANSTERM (available at http://www.tigr.org/software/transterm.html ) finds rho-independent transcription terminators in bacterial genomes.

RBSFINDER (available at ftp://ftp.tigr.org/pub/software/RBSfinder/ ) finds Shine-Dalgarno ribosome binding sites in bacterial genomes.

By contrast, finding genes in eukaryotic DNA is like looking for a piece of hay in a haystack, since they are interspersed with non-coding sequence and are frequently interrupted with introns. Therefore, the general approach is to look for specific patterns that are associated with genes, such as promoters, splice sites, altered base composition or organization. If enough of these features are found at suitable distances from each other, then the conclusion is that there must be a gene present.

A very good overview of various programs used for finding genes is provided at

http://www.cs.jhu.edu/~salzberg/appendixa.html

A comprehensive list of links to various gene-finding progrrams is posted at

http://linkage.rockefeller.edu/wli/gene/programs.html

George Sen has provided a nice overview of gene-finding methodologies. The pdf can be downloaded at

http://cmgm.stanford.edu/biochem218/Projects%202002/Sen.pdf

Finding Promoters

Many programs search for promoters, since these are the sequences which control how a gene is regulated. These consist of the binding sites for a number of transcription factors, and the types, numbers and arrangements of these binding sites determine when and where and how actively a gene will be transcribed. Therefore, all genes have a promoter 5' to the coding sequence and they tend to have certain features in common.

  • Nearly all eukaryotic promoters have a TATAA-like sequence 30 bp 5' to the transcription start
  • many have a CCAAT-like sequence about 75 bp 5' to the transcription start
  • promoters are often assembled from combinations of a limited number of enhancer elements (= binding sites for transcription factors), just as proteins are assembled from a limited number of amino acids
    • differences between promoters are due to the enhancers present and the order in which they are asembled
  • Therefore many programs look for specific sequences associated with promoters
    • clusters of transcription factor binding sites
  • Other programs use artificial intelligence approaches to recognize sequence patterns that are associated with promoters.
    • An important clue used by many programs is that the dinucleotide 5'-CG-3' is greatly underrepresented in the human genome; it only occurs at 20% of the frequency predicted by chance. However, they are 5x more abundant (i.e, at the level predicted by chance) in the vicinity of the 5' end of known genes; from ~1500 bp 5' to the transcription start to about 500 bp 3' to the transcription start. These "CpG" islands are therefore useful for identifying promoters.
  • http://www.fruitfly.org/seq_tools/promoter.html uses a neural network approach to find promoters
  • http://www.cbs.dtu.dk/services/promoter/ uses a combination of neural networks and genetic algorithms to find promoters
  • http://www.cbil.upenn.edu/tess/ identifies binding sites for transcription factors
  • http://bimas.dcrt.nih.gov/molbio/signal/ finds homologies of published signal sequences in your sequence, most of these transcriptional elements
  • http://argon.cshl.org/genefinder/CPROMOTER/index.htm predicts the transcription start site
  • http://cgsigma.cshl.org/CpG_promoter/ finds promoters by looking for CpG islands
  • http://rulai.cshl.org/tools/FirstEF/ looks for the promoter and first exon using an artificial intelligence approach
  • several different programs for predicting promoters can be run from http://searchlauncher.bcm.tmc.edu/seq-search/gene-search.html under "Promoter and Transcription Factor Binding Site Prediction"

Finding other sequences associated with eukaryotic genes

Last week we described programs for identifying coding sequences.

  • The simplest look for a start codon followed by at least 60 codons in the same frame before encountering a stop codon.
  • More sophisticated programs also look for nucleotide biases associated with coding regions
    • differences in nucleotide abundance
    • differences in nucleotide position: e.g, coding regions tend to repeat every third base.

Most higher eukaryotic genes are spliced, and many programs have been written that identify splice sites; i.e. intron/exon junctions

Recognizing poly-Adenylation signals allows you to identify the 3' end of a gene. This can be done online at these sites.

Finding other types of sequences

Matrix attachment sites are characteristic sequences where the chromosome is attached to the nuclear matrix.

Repeated sequences are a problem for many types of sequence analysis

Finding Genes

Many different programs have been written that can be accessed on line that will search for genes in eukaryotic genomic sequences.

The first step is to look for sequences similar to genes identified in other organisms. For example, OTTO goes through a genome and identifies genes that are high matches to known human genes (based on BLAST alignments). It then compares the genome with databases of EST, proteins and gene sequences from other organisms.

Next, most look for combinations of several different features associated with genes, such as promoters, start sites, ORFs, etc, in reasonable proximity.

  • If a promoter is found with a translation start, then a splice site, etc reasonably nearby then it is called a gene.

Each assigns different weights to the various factors. For example, last year we searched the same 50000 bp of raw human DNA with 7 different programs

  • each one found a different number of genes!
    • all agreed on two (which were the only two actually identified by GenBank as residing on this piece of DNA)
    • all found variable numbers of extra genes

Therefore, you should always have several different programs analyze the same piece of DNA!

  • genbank typically uses three
  • TIGR provides an application COMBINER (available from http://www.tigr.org/software/ ) that uses a voting scheme to combine the predictions of 3 or more gene finders and produce a single best prediction. It is compatible with GlimmerM, Genscan, FGenes, GRAIL, and GeneMark.HMM.

User’s instructions from http://cmgm.stanford.edu/classes/genefind/

  1. Remove repetitive elements (ALUs, etc.)
  2. Database Search on Translated DNA (BlastX or TFasta)
  3. ORF Gene Finding Search (Grail, GenScan, etc)
  4. Translate putative ORFs and do Functional Analysis (Blocks, Motifs, etc)
  5. Always have more than one program analyze your data.

Online Gene-finding websites (Note that these come from all over the world!)

http://www.tigr.org/software/

  • GlimmerM
  • EXONomy
  • Unveil

http://searchlauncher.bcm.tmc.edu/seq-search/gene-search.html

http://www.softberry.com/berry.phtml?topic=gfind

http://genes.mit.edu/GENSCAN.html

http://compbio.ornl.gov/grailexp/

http://www.cbs.dtu.dk/services/

http://www.fruitfly.org/seq_tools/genie.html

http://opal.biology.gatech.edu/GeneMark/eukhmm.cgi

http://arete.ibb.waw.pl/PL/html/gene_lang.html

http://www1.imim.es/geneid.html

http://www.itba.mi.cnr.it/webgene/

 




Last update: Friday, February 28, 2003 at 9:10:16 AM.