week 7 lecture
Finding genes
Finding genes in prokaryotic DNA
is relatively easy
- 85-88% of the nucleotides are
associated with coding sequence in the bacterial genomes that have been completely
sequenced.
- in Escherichia coli
there are 4288 genes that have an average of 950 bp of coding sequence
and are separated by an average of just 118 bp.
- Prokaryotes have short, simple
promoters that are easy to recognize
- Transcriptional terminators often
consist of short inverted repeats followed by a run of Ts. Therefore,
programs that find prokaryotic genes search for:
- ORFs 60 or more codons long
- promoters at the 5' end
- Terminators at the 3' end
- Homology to known genes from other prokaryotes
- Shine-Dalgarno sequences
GLIMMER (available at
http://www.tigr.org/software/glimmer/
) uses interpolated Markov models to identify
the coding regions of prokaryotic genomes and distinguish them from noncoding
DNA.
GENEMARK (available at
http://opal.biology.gatech.edu/GeneMark/ )
assesses the protein-coding potential of a prokaryotic DNA sequence using
Markov models of coding and non-coding regions.
EASYGENE (available at http://www.cbs.dtu.dk/services/EasyGene/
) uses an artificial intelligence approach to identify genes in prokaryotic
genomes. It first generates a training set of genes by finding a set of
known genes in the organism, then uses this training set to make a hidden
Markov model of this organism's genome. Putative genes are then scored with
the HMM.
FGENESB ( available at
http://www.softberry.com/berry.phtml?topic=fgenesb
) uses pattern recognition of different types of signals and Markov chain
models of coding regions to find genes in prokaryotic DNA.
BPROM ( available at
http://www.softberry.com/berry.phtml?topic=bprom
) identifies bacterial promoters and transcription start sites.
NNPP / Prokaryotic (available at
http://www.fruitfly.org/seq_tools/promoter.html
) uses neural networks to find prokaryotic (and
eukaryotic) promoters.
TRANSTERM (available at http://www.tigr.org/software/transterm.html
) finds rho-independent transcription terminators in bacterial genomes.
RBSFINDER (available at ftp://ftp.tigr.org/pub/software/RBSfinder/
) finds Shine-Dalgarno ribosome binding sites
in bacterial genomes.
By contrast, finding genes in eukaryotic
DNA is like looking for a piece of hay in a haystack, since they are interspersed
with non-coding sequence and are frequently interrupted with introns. Therefore,
the general approach is to look for specific patterns that are associated with
genes, such as promoters, splice sites, altered base composition or organization.
If enough of these features are found at suitable distances from each other,
then the conclusion is that there must be a gene present.
A very good overview of various
programs used for finding genes is provided at
http://www.cs.jhu.edu/~salzberg/appendixa.html
A comprehensive list of links to
various gene-finding progrrams is posted at
http://linkage.rockefeller.edu/wli/gene/programs.html
George Sen has provided a nice overview
of gene-finding methodologies. The pdf can be downloaded at
http://cmgm.stanford.edu/biochem218/Projects%202002/Sen.pdf
Finding Promoters
Many programs search for promoters, since these
are the sequences which control how a gene is regulated. These consist of the
binding sites for a number of transcription factors, and the types, numbers
and arrangements of these binding sites determine when and where and how actively
a gene will be transcribed. Therefore, all genes have a promoter 5' to the coding
sequence and they tend to have certain features in common.
- Nearly
all eukaryotic promoters have a TATAA-like sequence 30 bp 5' to the transcription
start
- many have a CCAAT-like sequence
about 75 bp 5' to the transcription start
- promoters are often assembled from combinations
of a limited number of enhancer elements (= binding sites for transcription
factors), just as proteins are assembled from a limited number of amino acids
- differences between promoters are due to the enhancers present
and the order in which they are asembled
- Therefore many programs look for specific
sequences associated with promoters
- clusters of transcription factor binding
sites
- Other programs use artificial intelligence
approaches to recognize sequence patterns that are associated with promoters.
- An important clue used by many programs
is that the dinucleotide 5'-CG-3' is greatly underrepresented in the human
genome; it only occurs at 20% of the frequency predicted by chance. However,
they are 5x more abundant (i.e, at the level predicted by chance) in the
vicinity of the 5' end of known genes; from ~1500 bp 5' to the transcription
start to about 500 bp 3' to the transcription start. These "CpG"
islands are therefore useful for identifying promoters.
- http://www.fruitfly.org/seq_tools/promoter.html
uses a neural network approach to find promoters
- http://www.cbs.dtu.dk/services/promoter/ uses
a combination of neural networks and genetic algorithms to find promoters
- http://www.cbil.upenn.edu/tess/
identifies binding sites for transcription factors
- http://bimas.dcrt.nih.gov/molbio/signal/ finds
homologies of published signal sequences in your sequence, most of these transcriptional
elements
- http://argon.cshl.org/genefinder/CPROMOTER/index.htm
predicts the transcription start site
- http://cgsigma.cshl.org/CpG_promoter/ finds
promoters by looking for CpG islands
- http://rulai.cshl.org/tools/FirstEF/ looks
for the promoter and first exon using an artificial intelligence approach
- several different programs for predicting
promoters can be run from http://searchlauncher.bcm.tmc.edu/seq-search/gene-search.html
under "Promoter and Transcription Factor Binding
Site Prediction"
Finding other sequences associated
with eukaryotic genes
Last week we described programs for identifying coding sequences.
- The simplest look for a start codon followed
by at least 60 codons in the same frame before encountering a stop codon.
- More sophisticated programs also look for
nucleotide biases associated with coding regions
- differences in nucleotide abundance
- differences in nucleotide position: e.g,
coding regions tend to repeat every third base.
Most higher eukaryotic genes are spliced, and
many programs have been written that identify splice sites; i.e. intron/exon
junctions
Recognizing poly-Adenylation signals allows you
to identify the 3' end of a gene. This can be done online at these sites.
Finding other types of sequences
Matrix attachment sites are characteristic sequences where
the chromosome is attached to the nuclear matrix.
Repeated sequences are a problem for many types
of sequence analysis
Finding Genes
Many different programs have been
written that can be accessed on line that will search for genes in eukaryotic
genomic sequences.
The first step is to look for sequences
similar to genes identified in other organisms. For example, OTTO goes through
a genome and identifies genes that are high matches to known human genes (based
on BLAST alignments). It then compares the genome with databases of EST, proteins
and gene sequences from other organisms.
Next, most look for combinations
of several different features associated with genes, such as promoters, start
sites, ORFs, etc, in reasonable proximity.
- If a promoter is found with a
translation start, then a splice site, etc reasonably nearby then it is called
a gene.
Each assigns different weights to the various
factors. For example, last year we searched the same
50000 bp of raw human DNA with 7 different programs
- each one found a different number of genes!
- all agreed on two (which were the only
two actually identified by GenBank as residing on this piece of DNA)
- all found variable numbers of extra genes
Therefore, you should always have several different
programs analyze the same piece of DNA!
- genbank typically uses three
- TIGR provides an application COMBINER (available
from http://www.tigr.org/software/
) that uses a voting scheme to combine the predictions of 3 or more gene finders
and produce a single best prediction. It is compatible with GlimmerM, Genscan,
FGenes, GRAIL, and GeneMark.HMM.
User’s instructions from http://cmgm.stanford.edu/classes/genefind/
- Remove repetitive elements (ALUs, etc.)
- Database Search on Translated DNA (BlastX
or TFasta)
- ORF Gene Finding Search (Grail, GenScan, etc)
- Translate putative ORFs and do Functional Analysis (Blocks, Motifs,
etc)
- Always have more than one program analyze
your data.
Online Gene-finding websites (Note
that these come from all over the world!)
http://www.tigr.org/software/
http://searchlauncher.bcm.tmc.edu/seq-search/gene-search.html
http://www.softberry.com/berry.phtml?topic=gfind
http://genes.mit.edu/GENSCAN.html
http://compbio.ornl.gov/grailexp/
http://www.cbs.dtu.dk/services/
http://www.fruitfly.org/seq_tools/genie.html
http://opal.biology.gatech.edu/GeneMark/eukhmm.cgi
http://arete.ibb.waw.pl/PL/html/gene_lang.html
http://www1.imim.es/geneid.html
http://www.itba.mi.cnr.it/webgene/

|